An Online Sequence-to-Sequence Model Using Partial Conditioning


Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a new model that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequence-to-sequence models, our method computes the next-step distribution conditioned on the partial input sequence observed and the partial sequence generated. It accomplishes this goal using an encoder recurrent neural network (RNN) that computes features at the same frame rate as the input, and a transducer RNN that operates over blocks of input steps. The transducer RNN extends the sequence produced so far using a local sequence-to-sequence model. During training, our method uses alignment information to generate supervised targets for each block. Approximate alignment is easily available for tasks such as speech recognition, action recognition in videos, etc. During inference (decoding), beam search is used to find the most likely output sequence for an input sequence. This decoding is performed online - at the end of each block, the best candidates from the previous block are extended through the local sequence-to-sequence model. On TIMIT, our online method achieves 19.8% phone error rate (PER). For comparison with published sequence-to-sequence methods, we used a bidirectional encoder and achieved 18.7% PER compared to 17.6% from the best reported sequence-to-sequence model. Importantly, unlike sequence-to-sequence our model is minimally impacted by the length of the input. On artificially created longer utterances, it achieves 20.9% with a unidirectional model, compared to 20% from the best bidirectional sequence-to-sequence models.