Jump to Content
Olivier Siohan

Olivier Siohan

Speech Processing

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Large Scale Self-Supervised Pretraining for Active Speaker Detection
    Alice Chuang
    Keith Johnson
    Tony (Tuấn) Nguyễn
    Wei Xia
    Yunfan Ye
    ICASSP 2024 (2024) (to appear)
    Preview abstract In this work we investigate the impact of a large-scale self-supervised pretraining strategy for active speaker detection (ASD) on an unlabeled dataset consisting of over 125k hours of YouTube videos. When compared to a baseline trained from scratch on much smaller in-domain labeled datasets we show that with pretraining we not only have a more stable supervised training due to better audio-visual features used for initialization, but also improve the ASD mean average precision by 23\% on a challenging dataset collected with Google Nest Hub Max devices capturing real user interactions. View details
    Preview abstract In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation. We also contribute an open-source implementation of CTC and RNN-T in the semiring framework that includes numerically stable and highly parallel variants of the entropy semiring. Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario. View details
    Preview abstract It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model. While there have been many attempts at building robust models, there is little consensus on how robustness should be evaluated. To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. Finally, we show that an architecture-agnostic solution based on cascades can consistently achieve robustness to missing video, even in settings where existing techniques for robustness like dropout fall short. View details
    Preview abstract Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap between speech recognition and active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25%, while simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR. View details
    Preview abstract Audio-visual automatic speech recognition (AV-ASR) extends the speech recognition by introducing the video modality. In particular, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks~\cite{Dosovitskiy2020-nh} demonstrated the ability to extract rich visual features for the image classification task. In this work, we propose to replace the 3D convolution with a video transformer video feature extractor. We train our baselines and the proposed model on a large scale corpus of the YouTube videos. Then we evaluate the performance on a labeled subset of YouTube as well as on the public corpus LRS3-TED. Our best model video-only model achieves the performance of 34.9\% WER on YTDEV18 and 19.3\% on LRS3-TED which is a 10\% and 9\% relative improvements over the convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6\% WER). View details
    Preview abstract Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, in particular often relying on information conveyed by the motion of the speaker's mouth. The use of the visual signal requires extracting visual features, which are then combined with the acoustic features to build an AV-ASR system~\cite{Makino2019-zd}. This is traditionally done with some form of 3D convolution network (e.g. VGG) as widely used in the computer vision community. Recently, video transformers~\cite{Dosovitskiy2020-nh} have been introduced to extract visual features useful for image classification tasks. In this work, we propose to replace the 3D convolution visual frontend typically used for AV-ASR and lip-reading tasks by a video transformer frontend. We train our systems on a large-scale dataset composed of YouTube videos and evaluate performance on the publicly available LRS3-TED set, as well as on a large set of YouTube videos. On a lip-reading task, the transformer-based frontend shows superior performance compared to a strong convolutional baseline. On an AV-ASR task, the transformer frontend performs as well as a VGG frontend for clean audio, but outperforms the VGG frontend when the audio is corrupted by noise. View details
    Preview abstract This paper investigates an end-to-end modeling approach for ASR that explicitly deals with scenarios where there are overlapping speech utterances from multiple talkers. The approach assumes the availability of both audio signals and video signals in the form of continuous mouth-tracks aligned with speech for overlapping speakers. This work extends previous work on audio-only multi-talker ASR applied to two party conversations in a call center application. It also extends work on end-to-end audio-visual (A/V) ASR applied to A/V YouTube (YT) Confidence Island utterances. It is shown that incorporating attention weighted combination of visual features in A/V multi-talker RNNT models significantly improves speaker disambiguation in ASR on overlapping speech. A 17% reduction in WER was observed for A/V multi-talker models relative to audio-only multi-talker models on a simulated A/V overlapped speech corpus. View details
    Preview abstract Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their small size and minimal latency make them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context. Nevertheless, non-streaming models can be used as teacher models to improve streaming ASR systems. An arbitrarily large set of unsupervised utterances is distilled from such teacher models so that streaming models can be trained using these generated labels. However, the performance gap between teacher and student world error rates (WER) remains high. In this paper, we propose to reduce this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). Fusing RNN-T and CTC models makes stronger teachers as they improve the performance of streaming student models. In this paper, we outperform a baseline streaming RNN-T trained from non-streaming RNN-T teachers by 27\% to 42\% depending on the language. View details
    Preview abstract Audio-visual automatic speech recognition is a promising ap-proach to robust ASR under noisy conditions. However, up untilrecently it had been traditionally studied in isolation assuming thevideo of a single speaking face matches the audio, and selecting theactive speaker at inference time when multiple people are on screenwas put aside as a separate problem. As an alternative, recent workhas proposed to address the two problems simultaneously with anattention mechanism, baking the speaker selection problem directlyinto a fully differentiable model. One interesting finding was thatthe attention indirectly learns the association between the audio andthe speaking face even though this correspondence is never explicitlyprovided at training time. On the present work we further investigatethis connection and examine the interplay between the two problems.With experiments carried over 50 thousand hours of public YouTubevideos as training data, we first evaluate the accuracy of the attentionlayer on an active speaker selection task. Secondly, we show undercloser scrutiny that the end-to-end model performs at least as wellas a considerably larger two-step system connected with a hard deci-sion boundary under various noise conditions and number of parallel face tracks. View details
    Preview abstract Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and considers the scenario where multiple people are simultaneously on screen (multi-person A/V ASR). We propose a fully differentiable A/V ASR model that is able to handle multiple face tracks in a video. Instead of relying on two separate models for speaker face selection and audio-visual ASR on a single face track, we introduce an attention layer to the ASR encoder that is able to soft-select the appropriate face video track. Experiments carried out on an A/V system trained on over 30k hours of YouTube videos illustrate that the proposed approach can automatically select the proper face tracks with minor WER degradation compared to an oracle selection of the speaking face while still showing benefits of employing the visual signal instead of the audio alone. View details
    RECURRENT NEURAL NETWORK TRANSDUCER FOR AUDIO-VISUAL SPEECH RECOGNITION
    Basi Garcia
    Brendan Shillingford
    Yannis Assael
    Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (2019)
    Preview abstract This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (AV) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: an internal set of YouTube utterances (YouTube-AV-Dev-18) and the publicly available TED-LRS3 set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YouTube-AV-Dev-18 set artificially corrupted with additive background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the TED-LRS3 set. View details
    Preview abstract This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and grid lstms to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of over 18\% relative compared to the current production system. View details
    Automatic Optimization of Data Perturbation Distributions for Multi-Style Training in Speech Recognition
    Mortaza Doulaty
    Proceedings of the IEEE 2016 Workshop on Spoken Language Technology (SLT2016)
    Preview abstract Speech recognition performance using deep neural network based acoustic models is known to degrade when the acoustic environment and the speaker population in the target utterances are significantly different from the conditions represented in the training data. To address these mismatched scenarios, multi-style training (MTR) has been used to perturb utterances in an existing uncorrupted and potentially mismatched training speech corpus to better match target domain utterances. This paper addresses the problem of determining the distribution of perturbation levels for a given set of perturbation types that best matches the target speech utterances. An approach is presented that, given a small set of utterances from a target domain, automatically identifies an empirical distribution of perturbation levels that can be applied to utterances in an existing training set. Distributions are estimated for perturbation types that include acoustic background environments, reverberant room configurations, and speaker related variation like frequency and temporal warping. The end goal is for the resulting perturbed training set to characterize the variability in the target domain and thereby optimize ASR performance. An experimental study is performed to evaluate the impact of this approach on ASR performance when the target utterances are taken from a simulated far-field acoustic environment. View details
    Preview abstract While research has often shown that building dialect-specific Automatic Speech Recognizers is the optimal approach to dealing with dialectal variations of the same language, we have observed that dialect-specific recognizers do not always output the best recognitions. Often enough, another dialectal recognizer outputs a better recognition than the dialect-specific one. In this paper, we present two methods to select and combine the best decoded hypothesis from a pool of dialectal recognizers. We follow a Machine Learning approach and extract features from the Speech Recognition output along with Word Embeddings and use Shallow Neural Networks for classification. Our experiments using Dictation and Voice Search data from the main four Arabic dialects show good WER improvements for the hypothesis selection scheme, reducing the WER by 2.1 to 12.1% depending on the test set, and promising results for the hypotheses combination scheme. View details
    Preview abstract Recently, Google launched YouTube Kids, a mobile application for children, that uses a speech recognizer built specifically for recognizing children’s speech. In this paper we present techniques we explored to build such a system. We describe the use of a neural network classifier to identify matched acoustic training data, filtering data for language modeling to reduce the chance of producing offensive results. We also compare long short-term memory (LSTM) recurrent networks to convolutional, LSTM, deep neural networks (CLDNN). We found that a CLDNN acoustic model outperforms an LSTM across a variety of different conditions, but does not specifically model child speech relatively better than adult. Overall, these findings allow us to build a successful, state-of-the-art large vocabulary speech recognizer for both children and adults. View details
    Preview abstract In this paper we construct a data set for semi-supervised acoustic model training by selecting spoken utterances from a massive collection of anonymized Google Voice Search utterances. Semi-supervised training usually retains high-confidence utterances which are presumed to have an accurate hypothesized transcript, a necessary condition for successful training. Selecting high confidence utterances can however restrict the diversity of the resulting data set. We propose to introduce a constraint enforcing that the distribution of the context-dependent state symbols obtained by running forced alignment of the hypothesized transcript matches a reference distribution estimated from a curated development set. The quality of the obtained training set is illustrated on large scale Voice Search recognition experiments and outperforms random selection of high-confidence utterances. View details
    A big data approach to acoustic model training corpus selection
    John Alex
    Conference of the International Speech Communication Association (Interspeech) (2014)
    Preview abstract Deep neural networks (DNNs) have recently become the state of the art technology in speech recognition systems. In this paper we propose a new approach to constructing large high quality unsupervised sets to train DNN models for large vocabulary speech recognition. The core of our technique consists of two steps. We first redecode speech logged by our production recognizer with a very accurate (and hence too slow for real-time usage) set of speech models to improve the quality of ground truth transcripts used for training alignments. Using confidence scores, transcript length and transcript flattening heuristics designed to cull salient utterances from three decades of speech per language, we then carefully select training data sets consisting of up to 15K hours of speech to be used to train acoustic models without any reliance on manual transcription. We show that this approach yields models with approximately 18K context dependent states that achieve 10% relative improvement in large vocabulary dictation and voice-search systems for Brazilian Portuguese, French, Italian and Russian languages. View details
    Preview abstract In large vocabulary continuous speech recognition, decision trees are widely used to cluster triphone states. In addition to commonly used phonetically based questions, others have proposed additional questions such as phone position within word or syllable. This paper examines using the word or syllable context itself as a feature in the decision tree, providing an elegant way of introducing word- or syllable-specific models into the system. Positive results are reported on two state-of-the-art systems: voicemail transcription and a search by voice tasks across a variety of acoustic model and training set sizes. View details
    An Audio Indexing System for Election Video Material
    Christopher Alberti
    Ari Bezman
    Anastassia Drofa
    Ted Power
    Arnaud Sahuguet
    Maria Shugrina
    Proceedings of ICASSP (2009), pp. 4873-4876
    Preview abstract In the 2008 presidential election race in the United States, the prospective candidates made extensive use of YouTube to post video material. We developed a scalable system that transcribes this material and makes the content searchable (by indexing the meta-data and transcripts of the videos) and allows the user to navigate through the video material based on content. The system is available as an iGoogle gadget as well as a Labs product. Given the large exposure, special emphasis was put on the scalability and reliability of the system. This paper describes the design and implementation of this system. View details
    The IBM 2007 speech transcription system for European parliamentary speeches
    Bhuvana Ramabhadran
    Abhinav Sethy
    ASRU (2007), pp. 472-477
    Vocabulary independent spoken term detection
    Jonathan Mamou
    Bhuvana Ramabhadran
    SIGIR (2007), pp. 615-622
    Comments on Vocal Tract Length Normalization Equals Linear Transformation in Cepstral Space
    Mohamed Afify
    IEEE Transactions on Audio, Speech & Language Processing, vol. 15 (2007), pp. 1731-1732
    The IBM 2006 speech transcription system for european parliamentary speeches
    Bhuvana Ramabhadran
    Lidia Mangu
    Geoffrey Zweig
    Martin Westphal
    Henrik Schulz
    Alvaro Soneiro
    INTERSPEECH (2006)
    The IBM Rich Transcription Spring 2006 Speech-to-Text System for Lecture Meetings
    Jing Huang
    Martin Westphal
    Stanley F. Chen
    Daniel Povey
    Vit Libal
    Alvaro Soneiro
    Henrik Schulz
    Thomas Ross
    Gerasimos Potamianos
    MLMI (2006), pp. 432-443
    Automated Quality Monitoring for Call Centers using Speech and NLP Technologies
    Geoffrey Zweig
    George Saon
    Bhuvana Ramabhadran
    Daniel Povey
    Lidia Mangu
    Brian Kingsbury
    HLT-NAACL (2006)
    Fast vocabulary-independent audio search using path-based graph indexing
    INTERSPEECH (2005), pp. 53-56
    A new verification-based fast-match for large vocabulary continuous speech recognition
    Mohamed Afify
    Feng Liu
    Hui Jiang 0001
    IEEE Transactions on Speech and Audio Processing, vol. 13 (2005), pp. 546-553
    Use of metadata to improve recognition of spontaneous speech and named entities
    Bhuvana Ramabhadran
    Geoffrey Zweig
    INTERSPEECH (2004)
    Sequential estimation with optimal forgetting for robust speech recognition
    Mohamed Afify
    IEEE Transactions on Speech and Audio Processing, vol. 12 (2004), pp. 19-26
    Speech recognition error analysis on the English MALACH corpus
    Bhuvana Ramabhadran
    Geoffrey Zweig
    INTERSPEECH (2004)
    Hierarchical class n-gram language models: towards better estimation of unseen events in speech recognition
    Imed Zitouni
    Chin-Hui Lee
    INTERSPEECH (2003)
    Advances in natural language call routing
    Hong-Kwang Jeff Kuo
    Joseph P. Olive
    Bell Labs Technical Journal, vol. 7 (2003), pp. 155-170
    Backoff hierarchical class n-gram language modelling for automatic speech recognition systems
    Imed Zitouni
    Hong-Kwang Jeff Kuo
    Chin-Hui Lee
    INTERSPEECH (2002)
    A discriminative training criterion and an associated EM learning algorithm
    Mohamed Afify
    ICASSP (2002), pp. 1065-1068
    Bell labs approach to Aurora evaluation on connected digit recognition
    Jingdong Chen
    Dimitris Dimitriadis
    Hui Jiang 0001
    Qi Li
    Tor André Myrvoll
    Frank K. Soong
    INTERSPEECH (2002)
    Structural maximum a posteriori linear regression for fast HMM adaptation
    Tor André Myrvoll
    Chin-Hui Lee
    Computer Speech & Language, vol. 16 (2002), pp. 5-24
    A dynamic in-search discriminative training approach for large vocabulary speech recognition
    Hui Jiang 0001
    Frank K. Soong
    Chin-Hui Lee
    ICASSP (2002), pp. 113-116
    Towards knowledge-based features for HMM based large vocabulary automatic speech recognition
    Benoit Launay
    Arun C. Surendran
    Chin-Hui Lee
    ICASSP (2002), pp. 817-820
    Upper and lower bounds on the mean of noisy speech: application to minimax classification
    Mohamed Afify
    Chin-Hui Lee
    IEEE Transactions on Speech and Audio Processing, vol. 10 (2002), pp. 79-88
    Minimax classification with parametric neighborhoods for noisy speech recognition
    Mohamed Afify
    Chin-Hui Lee
    INTERSPEECH (2001), pp. 2355-2358
    Joint maximum a posteriori adaptation of transformation and HMM parameters
    Cristina Chesta
    Chin-Hui Lee
    IEEE Transactions on Speech and Audio Processing, vol. 9 (2001), pp. 417-428
    A new verification-based fast match approach to large vocabulary speech recognition
    Feng Liu
    Mohamed Afify
    Hui Jiang 0001
    INTERSPEECH (2001), pp. 851-854
    An auditory system-based feature for robust speech recognition
    Qi Li
    Frank K. Soong
    INTERSPEECH (2001), pp. 619-622
    A real-time Japanese broadcast news closed-captioning system
    Akio Ando
    Mohamed Afify
    Hui Jiang 0001
    Chin-Hui Lee
    Qi Li
    Feng Liu
    Kazuo Onoe
    Frank K. Soong
    Qiru Zhou
    INTERSPEECH (2001), pp. 495-498
    Evaluating the Aurora connected digit recognition task - a bell labs approach
    Mohamed Afify
    Hui Jiang 0001
    Filipp Korkmazskiy
    Chin-Hui Lee
    Qi Li
    Frank K. Soong
    Arun C. Surendran
    INTERSPEECH (2001), pp. 633-636
    Small group speaker identification with common password phrases
    Aaron E. Rosenberg
    S. Parthasarathy
    Speech Communication, vol. 31 (2000), pp. 131-140
    Constrained maximum likelihood linear regression for speaker adaptation
    Mohamed Afify
    INTERSPEECH (2000), pp. 861-864
    Extended maximum a posterior linear regression (EMAPLR) model adaptation for speech recognition
    Wu Chou
    Tor André Myrvoll
    Chin-Hui Lee
    INTERSPEECH (2000), pp. 616-619
    Structural maximum a-posteriori linear regression for unsupervised speaker adaptation
    Tor André Myrvoll
    Chin-Hui Lee
    Wu Chou
    INTERSPEECH (2000), pp. 540-543
    A high-performance auditory feature for robust speech recognition
    Qi Li
    Frank K. Soong
    INTERSPEECH (2000), pp. 51-54
    Maximum a posteriori linear regression for hidden Markov model adaptation
    Cristina Chesta
    Chin-Hui Lee
    EUROSPEECH (1999)
    Comparative experiments of several adaptation approaches to noisy speech recognition using stochastic trajectory models
    Yifan Gong
    Jean Paul Haton
    Speech Communication, vol. 18 (1996), pp. 335-352
    Noise adaptation using linear regression for continuous noisy speech recognition
    Yifan Gong
    Jean Paul Haton
    EUROSPEECH (1995)
    A comparison of three noisy speech recognition approaches
    Yifan Gong
    Jean Paul Haton
    ICSLP (1994)
    A Bayesian approach to phone duration adaptation for lombard speech recognition
    Yifan Gong
    Jean Paul Haton
    EUROSPEECH (1993)
    Minimization of speech alignment error by iterative transformation for speaker adaptation
    Yifan Gong
    Jean Paul Haton
    ICSLP (1992)