High quality agreement-based semi-supervised training data for acoustic modeling


This paper describes a new technique to automatically obtain large high-quality training speech corpora for acoustic modeling. Traditional approaches select utterances based on confidence thresholds and other heuristics. We propose instead to use an ensemble approach: we transcribe each utterance using several recognizers, and only keep those on which they agree. The recognizers we use are trained on data from different dialects of the same language, and this diversity leads them to make different mistakes in transcribing speech utterances. In this work we show, however, that when they agree, this is an extremely strong signal that the transcript is correct. This allows us to produce automatically transcribed speech corpora that are superior in transcript correctness even to those manually transcribed by humans. Further more, we show that using the produced semi-supervised data sets, we can train new acoustic models which outperform those trained solely on previously available data sets.