End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow


This article discusses strategies for end-to-end training of state- of-the-art acoustic models for Large Vocabulary Continuous Speech Recognition (LVCSR), with the goal of leveraging Ten- sorFlow components so as to make efficient use of large-scale training sets, large model sizes, and high-speed computation units such as Graphical Processing Units (GPUs). Benchmarks are presented that evaluate the efficiency of different approaches to batching of training data, unrolling of recurrent acoustic models, and device placement of TensorFlow variables and op- erations. An overall training architecture developed in light of those findings is then described. The approach makes it possi- ble to take advantage of both data parallelism and high speed computation on GPU for state-of-the-art sequence training of acoustic models. The effectiveness of the design is evaluated for different training schemes and model sizes, on a 20, 000 hour Voice Search task.