Publications

End-to-End Deep Learning for Child Speech Recognition

Published in Stanford CS224U: Natural Language Understanding, 2020

Abstract: Child Speech Recognition (CSR) is a less explored and more challenging task than typical Automatic Speech Recognition (ASR). This task has significant applications in the classroom and is especially important in a remote learning environment. We present findings from training deep-learning based speech recognition models on the MyST corpus, the largest publicly-available English language child speech corpus. We obtained 27.26% word error rate (WER) on the MyST test set with a DeepSpeech2 baseline. Our best model, a Conformer model pre-trained on LibriSpeech and fine-tuned using the MyST corpus, achieved a test WER of 23.45%. Our results show that pre-training on adult speech is essential for model performance. We also provide additional error analysis on our best model and discussion of the results.

Download here

Robust 3D Object Tracking in Autonomous Vehicles

Published in Stanford CS238: Decision Making under Uncertainty, 2019

Abstract: We present a stereo-camera-based 3D vehicle-tracking system that utilizes Kalman filtering to improve robustness. The objective of our system is to accurately predict locations and orientations of vehicles from stereo camera data. It consists of three modules: a 2D object detection network, 3D position extraction, and 3D object correlation/smoothing. The system approaches the 3D localization performance of LIDAR and significantly outperforms the state-of-the-art monocular vehicle tracking systems. The addition of Kalman filtering increases our system’s robustness to missed detections, and improves the recall of our detector. Kalman filtering improves the MAP score of 3D localization for moderately difficult vehicles by 7.7%, compared to our unfiltered baseline. Our system predicts the correct orientation of vehicles with 78% accuracy.

Download here