End-to-End Deep Learning for Child Speech Recognition
Published in Stanford CS224U: Natural Language Understanding, 2020
Abstract: Child Speech Recognition (CSR) is a less explored and more challenging task than typical Automatic Speech Recognition (ASR). This task has significant applications in the classroom and is especially important in a remote learning environment. We present findings from training deep-learning based speech recognition models on the MyST corpus, the largest publicly-available English language child speech corpus. We obtained 27.26% word error rate (WER) on the MyST test set with a DeepSpeech2 baseline. Our best model, a Conformer model pre-trained on LibriSpeech and fine-tuned using the MyST corpus, achieved a test WER of 23.45%. Our results show that pre-training on adult speech is essential for model performance. We also provide additional error analysis on our best model and discussion of the results.
Download here