Jump to Content

Speech recognition for medical conversations

Chung-Cheng Chiu
Kat Chou
Chris Co
Navdeep Jaitly
Diana Jaunzeikare
Patrick Nguyen
Ananth Sankar
Justin Jesada Tansuwan
Nathan Wan
Frank Zhang
Interspeech 2018 (2018)

Abstract

In this paper we document our experiences with developing speech recognition for Medical Transcription -- a system that automatically transcribes notes from doctor-patient conversations. Towards this goal, we built a system along two different methodological lines -- a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) model. To train these models we used a corpus of anonymized conversations representing approximately 14,000 hours of speech . Because of noisy transcripts and alignments in the corpus, a significant amount of effort was invested in data cleaning issues. We describe a two-stage strategy we followed for segmenting the data. The data cleanup and development of a matched language model was essential to the success of the CTC based models. The LAS based models, however were found to be resilient to alignment and transcript noise and did not require the use of language models. CTC models were able to achieve a word error rate of 20.1%, and the LAS models were able to achieve 18.5%.