Flatstart-CTC: a new acoustic model training procedure for speech recognition


We present a new procedure to train acoustic models from scratch for large vocabulary speech recognition requiring no previous model for alignments or boot-strapping. We augment the Connectionist Temporal Classification (CTC) objective function to allow training of acoustic models directly from a parallel corpus of audio data and transcribed data. With this augmented CTC function we train a phoneme recognition acoustic model directly from the written-domain transcript. Further, we outline a mechanism to generate a context-dependent phonemes from a CTC model trained to predict phonemes and ultimately train a second CTC model to predict these context-dependent phonemes. Since this approach does not require training of any previous non-CTC model it drastically reduces the overall data-to-model training time from 30 days to 10 days. Additionally, models obtain from this flatstart-CTC procedure outperform the state-of-the-art by XX-XX\%.