Jump to Content
Parisa Haghani

Parisa Haghani

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Second-pass rescoring is a well known technique to improve the performance of Automatic Speech Recognition (ASR) systems. Neural oracle search (NOS), which selects the most likely hypothesis from N-best hypothesis list by integrating in-formation from multiple sources, such as the input acoustic representations, N-best hypotheses, additional first-pass statistics,and unpaired textual information through an external language model, has shown success in re-scoring for RNN-T first-pass models. Multilingual first-pass speech recognition models of-ten outperform their monolingual counterparts when trained on related or low-resource languages. In this paper, we investigate making the second-pass model multilingual and apply rescoring on a multilingual first-pass. We conduct experiments on Nordic languages including Danish, Dutch, Finnish, Norwegian and Swedish. View details
    Preview abstract Multilingual speech recognition models are capable of recognizing speech in multiple different languages. When trained on related or low-resource languages, these models often outperform their monolingual counterparts. Similar to other forms of multi-task models, when the group of languages are unrelated, or when large amounts of training data is available, multilingual models can suffer from performance loss. We investigate the use of a mixture-of-expert approach to assign per-language parameters in the model to increase network capacity in a structured fashion. We introduce a novel variant of this approach, 'informed experts', which attempts to tackle inter-task conflicts by eliminating gradients from other tasks in the these task-specific parameters. We conduct experiments on a real-world task on English, French and four dialects of Arabic to show the effectiveness of our approach. View details
    Preview abstract With a large population of the world speaking more than one language, multilingual automatic speech recognition (ASR) has gained popularity in the recent years. While lower resource languages can benefit from quality improvements in a multilingual ASR system, including unrelated or higher resource languages in the mix often results in performance degradation. In this paper, we propose distilling from multiple teachers, with each language using its best teacher during training, to tackle this problem. We introduce self-adaptive distillation, a novel technique for automatic weighting of the distillation loss that uses the student/teachers confidences. We analyze the effectiveness of the proposed techniques on two real world use-cases and show that the performance of the multilingual ASR models can be improved by up to 11.5% without any increase in model capacity. Furthermore, we show that when our methods are combined with increase in model capacity, we can achieve quality gains of up to 20.7%. View details
    Multilingual Speech Recognition with Self-Attention Structured Parameterization
    Yun Zhu
    Brian Farris
    Hainan Xu
    Han Lu
    Qian Zhang
    Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, ISCA
    Preview abstract Multilingual automatic speech recognition systems can transcribe utterances from different languages. These systems are attractive from different perspectives: they can provide quality improvements, specially for lower resource languages, and simplify the training and deployment procedure. End-to-end speech recognition has further simplified multilingual modeling as one model, instead of several components of a classical system, have to be unified. In this paper, we investigate a streamable end-to-end multilingual system based on the Transformer Transducer. We propose several techniques for adapting the self-attention architecture based on the language id. We analyze the trade-offs of each method with regards to quality gains and number of additional parameters introduced. We conduct experiments in a real-world task consisting of five languages. Our experimental results demonstrate $\sim$10\% and $\sim$15\% relative gain over the baseline multilingual model. View details
    Preview abstract Multilingual speech recognition models are capable of recognizing speech in multiple different languages. Depending on the amount of training data, and the relatedness of languages, these models can outperform their monolingual counterparts. However, the performance of these models heavily relies on an externally provided language-id which is used to augment the input features or modulate the neural network's per-layer outputs using a language-gate. In this paper, we introduce a novel technique for inferring the language-id in a streaming fashion using the RNN-T loss that eliminates reliance on knowing the utterance's language. We conduct experiments on two sets of languages, arabic and nordic, and show the effectiveness of our approach. View details
    Preview abstract Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to text, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of intents and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem. We propose and compare various encoder-decoder based approaches that optimizes both modules jointly, in an end-to-end manner. We evaluate these methods on a real-world task. Our results show that having an intermediate text representation while jointly optimizing the full system improves accuracy of prediction. View details
    Preview abstract Current state-of-the-art automatic speech recognition systems are trained to work in specific ‘domains’, defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. In this paper, we explore the idea of building a single domain-invariant model that works well for varied use-cases. We do this by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be trained to be robust to multiple application domains, and other variations like codecs and noise. Such models also generalize better to unseen conditions and allow for rapid adaptation to new domains – we show that by using as little as 10 hours of data for adapting a domain-invariant model to a new domain, we can match performance of a domain-specific model trained from scratch using roughly 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work. View details
    Preview abstract Domain robustness is a challenging problem for automatic speech recognition (ASR). In this paper, we consider speech data collected for different applications as separate domains and investigate the robustness of acoustic models trained on multi-domain data on unseen domains. Specifically, we use Factorized Hidden Layer (FHL) as a compact low-rank representation to adapt a multi-domain ASR system to unseen domains. Experimental results on two unseen domains show that FHL is a more effective adaptation method compared to selectively fine-tuning part of the network, without dramatically increasing the model parameters. Furthermore, we found that using singular value decomposition to initialize the low-rank bases of an FHL model leads to a faster convergence and improved performance. View details
    Preview abstract We explore the feasibility of training long short-term memory (LSTM) recurrent neural networks (RNNs) with syllables, rather than phonemes, as outputs. Syllables are a natural choice of linguistic unit for modeling the acoustics of languages such as Mandarin Chinese, due to the inherent nature of the syllable as an elemental pronunciation construct and the limited size of the syllable set for such languages (around 1400 syllables for Mandarin). Our models are trained with Connectionist Temporal Classification (CTC) and sMBR loss using asynchronous stochastic gradient descent (ASGD) utilizing a parallel computation infrastructure for large-scale training. With feature frames computed every 30ms, our acoustic models are well suited to syllable-level modeling as compared to phonemes which can have a shorter duration. Additionally, when compared to word-level modeling, syllables have the advantage of avoiding out-of-vocabulary (OOV) model outputs. Our experiments on a Mandarin voice search task show that syllable-output models can perform as well as context-independent (CI) phone-output models, and, under certain circumstances can beat the performance of our state-of-the-art context-dependent (CD) models. Additionally, decoding with syllable-output models is substantially faster than that with CI models, and vastly faster than with CD models. We demonstrate that these improvements are maintained when the model is trained to recognize both Mandarin syllables and English phonemes. View details
    No Results Found