Michael Riley
Michael Riley has a B.S., M.S., and Ph.D. from MIT, all in computer science. He began his career at Bell Labs and AT&T Labs where he, together with Mehryar Mohri and Fernando Pereira, introduced and developed the theory and use of weighted finite-state transducers (WFSTs) in speech and language. He is currently distinguished research scientist at Google, Inc. His interests include speech and natural language processing, machine learning, and information retrieval. He is a principal author of the OpenFst library He manages a group with expertise that includes speech recognition and synthesis, NLP, information retrieval, image processing, algorithms, machine learning and privacy. He is an IEEE and ISCA Fellow.
Authored Publications
Google Publications
Other Publications
Sort By
On Weight Interpolation of the Hybrid Autoregressive Transducer Model
Interspeech 2022, Interspeech 2022 (2022) (to appear)
Preview abstract
This paper explores ways to improve a two-pass speech recognition system when the first-pass
is hybrid autoregressive transducer model and the second-pass is a neural language model.
The main focus is on the scores provided by each of these models, their quantitative analysis,
how to improve them and the best way to integrate them with the objective of better recognition
accuracy. Several analysis are presented to show the importance of the choice of the
integration weights for combining the first-pass and the second-pass scores. A sequence level weight
estimation model along with four training criteria are proposed which allow adaptive integration
of the scores per acoustic sequence.
The effectiveness of this algorithm is demonstrated by constructing and analyzing
models on the Librispeech data set.
View details
Preview abstract
We introduce a framework for adapting a virtual keyboard to individual user behavior by modifying a Gaussian spatial model to use personalized key center offset means and, optionally, learned covariances. Through numerous real-world studies, we determine the importance of training data quantity and weights, as well as the number of clusters into which to group keys to avoid overfitting. While past research has shown potential of this technique using artificially-simple virtual keyboards and games or fixed typing prompts, we demonstrate effectiveness using the highly-tuned Gboard app with a representative set of users and their real typing behaviors. Across a variety of top languages,we achieve small-but-significant improvements in both typing speed and decoder accuracy.
View details
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
Rami Botros
Ruoming Pang
James Qin
Quoc-Nam Le-The
Anmol Gulati
Chung-Cheng Chiu
Emmanuel Guzman
Jiahui Yu
Qiao Liang
Wei Li
Yu Zhang
Interspeech (2021) (to appear)
Preview abstract
On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller.
View details
Approximating probabilistic models as weighted finite automata
Vlad Schogol
Computational Linguistics, vol. 47 (2021), pp. 221-254
Preview abstract
Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-
gram language models, since they are efficient for recognition tasks in time and space. The
probabilistic source to be represented as a WFA, however, may come in many forms. Given
a generic probabilistic model over sequences, we propose an algorithm to approximate it as a
weighted finite automaton such that the Kullback-Leiber divergence between the source model
and the WFA target model is minimized. The proposed algorithm involves a counting step and a
difference of convex optimization step, both of which can be performed efficiently. We demonstrate
the usefulness of our approach on various tasks, including distilling n-gram models from neural
models, building compact language models, and building open-vocabulary character models. The
algorithms used for these experiments are available in an open-source software library.
View details
Hybrid Autoregressive Transducer (HAT)
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6139-6143
Preview abstract
This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoder-decoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. We evaluate our proposed model on a large-scale voice search task. Our experiments show significant improvements in WER compared to the state-of-the-art approaches.
View details
Distilling weighted finite automata from arbitrary probabilistic models
Vlad Schogol
Proceedings of FSMNLP (2019), pp. 87-97
Preview abstract
Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a weighted finite automaton such that the Kullback-Leibler divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization, both of which can be performed efficiently. We demonstrate the usefulness of our approach on some tasks including distilling n-gram models from neural models.
View details
Federated Learning of N-gram Language Models
Adeline Wong
The SIGNLL Conference on Computational Natural Language Learning (2019)
Preview abstract
We propose algorithms to train production-quality n-gram language models using federated learning. Federated learning is a machine learning technique to train global models to be used on portable devices such as smart phones, without the users' data ever leaving their devices. This is especially relevant for applications handling privacy-sensitive data, such as virtual keyboards. While the principles of federated learning are fairly generic, its methodology assumes that the underlying models are neural networks. However, virtual keyboards are typically powered by n-gram language models, mostly for latency reasons.
We propose to train a recurrent neural network language model using the decentralized "FederatedAveraging" algorithm directly on training and to approximating this federated model server-side with an n-gram model that can be deployed to devices for fast inference.
Our technical contributions include novel ways of handling large vocabularies, algorithms to correct capitalization errors in user data, and efficient finite state transducer algorithms to convert word language models to word-piece language models and vice versa.
The n-gram language models trained with federated learning are compared to n-grams trained with traditional server-based algorithms using A/B tests on tens of millions of users of a virtual keyboard.
Results are presented for two languages, American English and Brazilian Portuguese. This work demonstrates that high-quality n-gram language models can be trained directly on client mobile devices without sensitive training data ever leaving the device.
View details
Latin script keyboards for South Asian languages with finite-state normalization
Lawrence Wolf-Sonkin
Vlad Schogol
Proceedings of FSMNLP (2019), pp. 108-117
Preview abstract
The use of the Latin script for text entry of South Asian languages is common, even though there is no standard orthography for these languages in the script. We explore several compact finite-state architectures that permit variable spellings of words during mobile text entry. We find that approaches making use of transliteration transducers provide large accuracy improvements over baselines, but that simpler approaches involving a compact representation of many attested alternatives yields much of the accuracy gain. This is particularly important when operating under constraints on model size (e.g., on inexpensive mobile devices with limited storage and memory for keyboard models), and on speed of inference, since people typing on mobile keyboards expect no perceptual delay in keyboard responsiveness.
View details
Algorithms for Weighted Finite Automata with Failure Transitions
International Conference of Implementation and Applications of Automata (CIAA) (2018), pp. 46-58
Preview abstract
In this paper we extend some key weighted finite automata (WFA) algorithms to automata with failure transitions (phi-WFAs). Failure transitions, which are taken only when no immediate\ match is possible at a given state, are used to compactly epresent automata and have many applications. An efficient intersection algorithm and a shortest distance algorithm (over R+) are presented as well as a related algorithm to remove failure transitions from a phi-WFA.
View details
Semantic Lattice Processing in Contextual Automatic Speech Recognition for Google Assistant
Ian Williams
Justin Scheiner
Interspeech 2018, ISCA (2018), pp. 2222-2226
Preview abstract
Recent interest in intelligent assistants has increased demand for Automatic Speech Recognition (ASR) systems that can utilize contextual information to adapt to the user’s preferences or the current device state. For example, a user might be more likely to refer to their favorite songs when giving a “music playing” command or request to watch a movie starring a particular favorite actor when giving a “movie playing” command. Similarly, when a device is in a “music playing” state, a user is more likely to give volume control commands.
In this paper, we explore using semantic information inside the ASR word lattice by employing Named Entity Recognition (NER) to identify and boost contextually relevant paths in order to improve speech recognition accuracy. We use broad semantic classes comprising millions of entities, such as songs and musical artists, to tag relevant semantic entities in the lattice. We show that our method reduces Word Error Rate (WER) by 12.0% relative on a Google Assistant “media playing” commands test set, while not affecting WER on a test set containing commands unrelated to media.
View details
On Lattice Generation for Large Vocabulary Speech Recognition
Johan Schalkwyk
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan (2017)
Preview abstract
Lattice generation is an essential feature of the decoder for many
speech recognition applications. In this paper, we first review
lattice generation methods for WFST-based decoding and describe in a
uniform formalism two established approaches for state-of-the-art
speech recognition systems: the phone pair and the N-best histories
approaches. We then present a novel optimization method,
pruned determinization followed by minimization, that produces a
deterministic minimal lattice that retains all paths within specified
weight and lattice size thresholds. Experimentally, we show that
before optimization, the phone-pair and the N-best histories
approaches each have conditions where they perform better when
evaluated on video transcription and mixed voice search and dictation
tasks. However, once this lattice optimization procedure is applied,
the phone pair approach has the lowest oracle WER for a given lattice
density by a significant margin. We further show that the pruned
determinization presented here is efficient to use during decoding
unlike classical weighted determinization from which it is derived.
Finally, we consider on-the-fly lattice rescoring in which the
lattice generation and combination with the secondary LM are done
in one step. We compare the phone pair and N-best histories
approaches for this scenario and find the former superior in our
experiments.
View details
Transliterated mobile keyboard input via weighted finite-state transducers
Lars Hellsten
Prasoon Goyal
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP) (2017)
Preview abstract
We present an extension to a mobile keyboard input decoder based on finite-state transducers that provides general transliteration support, and demonstrate its use for input of South Asian languages using a QWERTY keyboard. On-device keyboard decoders must operate under strict latency and memory constraints, and we present several transducer optimizations that allow for high accuracy decoding under such constraints. Our methods yield substantial accuracy improvements and latency reductions over an existing baseline transliteration keyboard approach. The resulting system was launched for 22 languages in Google Gboard in the first half of 2017.
View details
Contextual prediction models for speech recognition
Yoni Halpern
Keith Hall
Vlad Schogol
Martin Baeuml
Proceedings of Interspeech 2016
Preview abstract
We introduce an approach to biasing language models towards
known contexts without requiring separate language models or
explicit contextually-dependent conditioning contexts. We do
so by presenting an alternative ASR objective, where we predict
the acoustics and words given the contextual cue, such as
the geographic location of the speaker. A simple factoring of the
model results in an additional biasing term, which effectively
indicates how correlated a hypothesis is with the contextual cue
(e.g., given the hypothesized transcript, how likely is the user’s
known location). We demonstrate that this factorization allows
us to train relatively small contextual models which are effective
in speech recognition. An experimental analysis shows both a
perplexity reduction and a significant word error rate reductions
on a voice search task when using the user’s location as a contextual
cue.
View details
Preview abstract
We present a new algorithm for efficiently training n-gram language models on uncertain data, and illustrate its use for semi-supervised language model adaptation. We compute the probability that an n-gram occurs k times in the sample of uncertain data, and use the resulting histograms to derive a generalized Katz backoff model. We compare semi-supervised adaptation of language models for YouTube video speech recognition in two conditions: when using full lattices with our new algorithm versus just the 1-best output from the baseline speech recognizer. Unlike 1-best methods, the new algorithm provides models that yield solid improvements over the baseline on the full test set, and, further, achieves these gains without hurting performance on any of the set of channels. We show that channels with the most data yielded the largest gains. The algorithm was implemented via a new semiring in the OpenFst library and will be released as part of the OpenGrm ngram library.
View details
Distributed representation and estimation of WFST-based n-gram models
Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata (StatFSM) (2016), pp. 32-41
Preview abstract
We present methods for partitioning a weighted finite-state transducer (WFST) representation of an n-gram language model into multiple shards, each of which is a stand-alone WFST n-gram model in its own right, allowing processing with existing algorithms. After independent estimation, including normalization, smoothing and pruning on each shard, the shards can be merged into a single WFST that is identical to the model that would have resulted from estimation without sharding. We then present an approach that uses data partitions in conjunction with WFST sharding to estimate models on orders-of-magnitude more data than would have otherwise been feasible with a single process. We present some numbers on shard characteristics when large models are trained from a very large data set. Functionality to support distributed n-gram modeling has been added to the OpenGrm library.
View details
Composition-based on-the-fly rescoring for salient n-gram biasing
Preview
Keith Hall
Eunjoon Cho
Noah Coccaro
Kaisuke Nakajima
Linda Zhang
Interspeech 2015, International Speech Communications Association
Preview abstract
This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decision tree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable.
View details
Pushdown automata in statistical machine translation
Preview
Bill Byrne
Adrià de Gispert
Gonzalo Iglesias
Computational Linguistics, vol. 40 (2014), pp. 687-723
Smoothed marginal distribution constraints for language modeling
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL) (2013), pp. 43-52
Preview abstract
We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney smoothing. Unlike Kneser-Ney, our approach is designed to be applied to any given smoothed backoff model, including models that have already been heavily pruned. As a result, the algorithm avoids issues observed when pruning Kneser-Ney models (Siivola et al., 2007; Chelba et al., 2010), while retaining the benefits of such marginal distribution constraints. We present experimental results for heavily pruned backoff n-gram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods. An open-source version of the algorithm has been released as part of the OpenGrm ngram library.
View details
The OpenGrm Open-Source Finite-State Grammar Software Libraries
Preview
Terry Tai
ACL (System Demonstrations) (2012), pp. 61-66
Language Modeling for Automatic Speech Recognition Meets the Web: Google Search by Voice
Johan Schalkwyk
Boulos Harb
Peng Xu
Preethi Jyothi
Thorsten Brants
Vida Ha
Will Neveitt
University of Toronto (2012)
Preview abstract
A critical component of a speech recognition system targeting web search is the language model. The talk presents an empirical exploration of the google.com query stream with the end goal of high quality statistical language modeling for mobile voice search. Our experiments show that after text normalization the query
stream is not as ``wild'' as it seems at first sight. One can achieve out-of-vocabulary rates below 1% using a one million word vocabulary, and excellent n-gram hit ratios of 77/88% even at high orders such as n=5/4, respectively.
Using large scale, distributed language models can improve performance significantly---up to 10\% relative reductions in word-error-rate over conventional models used in speech recognition. We also find that the query stream is non-stationary, which means that adding more past training data beyond a certain point provides diminishing returns, and may even degrade performance slightly. Perhaps less surprisingly, we have shown that locale matters significantly for English query data across USA, Great Britain and Australia. In an attempt to leverage the speech data in voice search logs, we successfully build large-scale discriminative N-gram language models and derive small but significant gains in recognition performance.
View details
Mobile Music Modeling, Analysis and Recognition
Pavel Golik
Boulos Harb
Alex Rudnick
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2012)
Preview abstract
We present an analysis of music modeling and recognition techniques in the context of mobile music matching, substantially improving on the techniques presented in [Mohri et al., 2010]. We accomplish this by adapting the features specifically to this task, and by introducing new modeling techniques that enable using a corpus of noisy and channel-distorted data to improve mobile music recognition quality. We report the results of an extensive empirical investigation of the system's robustness under realistic channel effects and distortions. We show an improvement of recognition accuracy by explicit duration modeling of music phonemes and by integrating the expected noise environment into the training process. Finally, we propose the use of frame-to-phoneme alignment for high-level structure analysis of polyphonic music.
View details
Preview abstract
This paper explores various static interpolation methods for approximating a single dynamically-interpolated language model
used for a variety of recognition tasks on the Google Android
platform. The goal is to find the statically-interpolated firstpass LM that best reduces search errors in a two-pass system
or that even allows eliminating the more complex dynamic second pass entirely. Static interpolation weights that are uniform,
prior-weighted, and the maximum likelihood, maximum a posteriori, and Bayesian solutions are considered. Analysis argues
and recognition experiments on Android test data show that a
Bayesian interpolation approach performs best.
View details
Hierarchical Phrase-Based Translation Representations
Preview
Gonzalo Iglesias
William Byrne
Adrià de Gispert
Proceedings of EMNLP 2011
A Filter-based Algorithm for Efficient Composition of Finite-State Transducers
Preview
Johan Schalkwyk
International Journal of Foundations of Computer Science, vol. 22 (2011), pp. 1781-1795
Language Modeling for Automatic Speech Recognition Meets the Web: Google Search by Voice
Johan Schalkwyk
Boulos Harb
Peng Xu
Thorsten Brants
Vida Ha
Will Neveitt
OGI/OHSU Seminar Series, Portland, Oregon, USA (2011)
Preview abstract
The talk presents key aspects faced when building language models (LM) for the google.com query stream, and their use for automatic speech recognition (ASR). Distributed LM tools enable us to handle a huge amount of data, and experiment with LMs that are two orders of magnitude larger than usual. An empirical exploration of the problem led us to re-discovering a less known interaction between Kneser-Ney smoothing and entropy pruning, possible non-stationarity of the query stream, as well as strong dependence on various English locales---USA, Britain and Australia. LM compression techniques allowed us to use one billion n-gram LMs in the first pass of an ASR system built on FST technology, and evaluate empirically whether a two-pass system architecture has any losses over one pass.
View details
Preview abstract
This paper describes a new method for building compact
con-text-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decision-tree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The
objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable.
View details
Preview abstract
This paper describes a weighted finite-state transducer composition algorithm that generalizes the notion of the composition filter and present filters that remove useless epsilon paths and push forward labels and weights along epsilon paths. This filtering allows us to compose together large speech recognition context-dependent lexicons and language models much more efficiently in time and space than previously possible. We present experiments on Broadcast News and Google Search by Voice that demonstrate a 5% to 10% overhead for dynamic, runtime composition compared to a static, offline composition of the recognition transducer. To our knowledge, this is the first such system with such small overhead.
View details
Web-derived Pronunciations
Arnab Ghoshal
Martin Jansche
Sanjeev Khudanpur
Morgan Ulinski
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009), pp. 4289-4292
Preview abstract
Pronunciation information is available in large quantities on the Web, in the form of IPA and ad-hoc transcriptions. We describe techniques for extracting candidate pronunciations from Web pages and associating them with orthographic words, filtering out poorly extracted pronunciations, normalizing IPA pronunciations to better conform to a common transcription standard, and generating phonemic from ad-hoc transcriptions. We show improvements on a letter-to-phoneme task when using web-derived vs. Pronlex pronunciations.
View details
Web Derived Pronunciations for Spoken Term Detection
Doğan Can
Erica Cooper
Arnab Ghoshal
Martin Jansche
Sanjeev Khudanpur
Bhuvana Ramabhadran
Murat Saraçlar
Abhinav Sethy
Morgan Ulinski
Christopher White
32nd Annual International ACM SIGIR Conference (2009), pp. 83-90
Preview abstract
Indexing and retrieval of speech content in various forms such as broadcast news, customer care data and on-line media has gained a lot of interest for a wide range of applications, from customer analytics to on-line media search. For most retrieval applications, the speech content is typically first converted to a lexical or phonetic representation using automatic speech recognition (ASR). The first step in searching through indexes built on these representations is the generation of pronunciations for named entities and foreign language query terms. This paper summarizes the results of the work conducted during the 2008 JHU Summer Workshop by the Multilingual Spoken Term Detection team, on mining the web for pronunciations and analyzing their impact on spoken term detection. We will first present methods to use the vast amount of pronunciation information available on the Web, in the form of IPA and ad-hoc transcriptions. We describe techniques for extracting candidate pronunciations from Web pages and associating them with orthographic words, filtering out poorly extracted pronunciations, normalizing IPA pronunciations to better conform to a common transcription standard, and generating phonemic representations from ad-hoc transcriptions. We then present an analysis of the effectiveness of using these pronunciations to represent Out-Of-Vocabulary (OOV) query terms on the performance of a spoken term detection (STD) system. We will provide comparisons of Web pronunciations against automated techniques for pronunciation generation as well as pronunciations generated by human experts. Our results cover a range of speech indexes based on lattices, confusion networks and one-best transcriptions at both word and word fragments levels.
View details
OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language
Martin Jansche
Proceedings of the North American Chapter of the Association for Computational Linguistics -- Human Language Technologies (NAACL HLT) 2009 conference, Tutorials
Preview abstract
Finite-state methods are well established in language and speech processing. OpenFst (available from www.openfst.org) is a free and open-source software library for building and using finite automata, in particular, weighted finite-state transducers (FSTs). This tutorial is an introduction to weighted finitestate transducers and their uses in speech and language processing. While there are other weighted finite-state transducer libraries, OpenFst (a) offers, we believe, the most comprehensive, general and efficient set of operations; (b) makes available full source code; (c) exposes high- and low-level C++ APIs that make it easy to embed and extend; and (d) is a platform for active research and use among many colleagues.
View details
Sample Selection Bias Correction Theory
Preview
Proceedings of The 19th International Conference on Algorithmic Learning Theory (ALT 2008), Springer, Heidelberg, Germany, Budapest, Hungary
Speech Recognition with Weighted Finite-State Transducers
Preview
Handbook on Speech Processing and Speech Communication, Part E: Speech recognition, Springer-Verlag, Heidelberg, Germany (2008)
On the Computation of the Relative Entropy of Probabilistic Automata
Preview
Ashish Rastogi
International Journal of Foundations of Computer Science, vol. 19 (2008), pp. 219-242
Speech Recognition with Weighted Finite-State Transducers
Preview
Handbook on Speech Processing and Speech Communication, Part E: Speech recognition, Springer-Verlag, Heidelberg, Germany (2007)
OpenFst: a General and Efficient Weighted Finite-State Transducer Library
Preview
Johan Schalkwyk
Wojciech Skut
Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA 2007), Springer-Verlag, Heidelberg, Germany, Prague, Czech Republic
On the Computation of the Relative Entropy of Probabilistic Automata
Preview
Ashish Rastogi
International Journal of Foundations of Computer Science, vol. to appear (2007)
Efficient Computation of the Relative Entropy of Probabilistic Automata
Preview
Ashish Rastogi
Proceedings of the 7th Latin American Symposium (LATIN 2006), Springer-Verlag, Heidelberg, Germany, Valdivia, Chile
Automata and Graph Compression
Automata and graph compression
MAP adaptation of stochastic grammars
Efficient Computation of the Relative Entropy of Probabilistic Automata
Weighted Automata in Text and Speech Processing
Statistical Modeling for Unit Selection in Speech Synthesis
42nd Meeting of the Association for Computational Linguistics (ACL 2004), Proceedings of the Conference, Barcelona, Spain
A Generalized Construction of Integrated Speech Recognition Transducers
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), Montreal, Canada
Statistical Modeling for Unit Selection in Speech Synthesis
$42$nd Meeting of the Association for Computational Linguistics (ACL 2004), Proceedings of the Conference, Barcelona, Spain
Voice Signatures
Proceedings of The 8th IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2003), St. Thomas, U.S. Virgin Islands
A comparison of two LVR search optimization techniques
A Comparison of Two LVR Search Optimization Techniques
Stephan Kanthak
Hermann Ney
Proceedings of the International Conference on Spoken Language Processing 2002 (ICSLP '02), Denver, Colorado
Weighted Finite-State Transducers in Speech Recognition (Tutorial)
Proceedings of the International Conference on Spoken Language Processing 2002 (ICSLP '02), Denver, Colorado
An Efficient Algorithm for the N-Best-Strings Problem
Proceedings of the International Conference on Spoken Language Processing 2002 (ICSLP '02), Denver, Colorado
An Efficient Algorithm for the $N$-Best-Strings Problem
Proceedings of the International Conference on Spoken Language Processing 2002 (ICSLP '02), Denver, Colorado
Weighted Finite-State Transducers in Speech Recognition
Computer Speech and Language, vol. 16 (2002), pp. 69-88
Weighted finite-state transducers in speech recognition
A Weight Pushing Algorithm for Large Vocabulary Speech Recognition
Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech '01), Aalborg, Denmark (2001)
A weight pushing algorithm for large vocabulary speech recognition
Weighted Finite-State Transducers in Speech Recognition
Proceedings of the ISCA Tutorial and Research Workshop, Automatic Speech Recognition: Challenges for the new Millenium (ASR2000), Paris, France
The Design Principles of a Weighted Finite-State Transducer Library
Theoretical Computer Science, vol. 231 (2000), pp. 17-32
The Design Principles of a Weighted Finite-State Transducer Library
Integrated Context-Dependent Networks in Very Large Vocabulary Speech Recognition
Proceedings of the 6th European Conference on Speech Communication and Technology (Eurospeech '99), Budapest, Hungary (1999)
Network Optimizations for Large Vocabulary Speech Recognition
Rapid unit selection from a large speech corpus for concatenative speech synthesis
Efficient General Lattice Generation and Rescoring
Integrated context-dependent networks in very large vocabulary speech recognition
Network optimizations for large-vocabulary speech recognition
Rapid Unit Selection from a Large Speech Corpus for Concatenative Speech Synthesis
Mark Beutnagel
Proceedings of the 6th European Conference on Speech Communication and Technology (Eurospeech '99), Budapest, Hungary (1999)
A Rational Design for a Weighted Finite-State Transducer Library
Proceedings of the Second International Workshop on Implementing Automata (WIA '97), Springer-Verlag, Berlin-NY (1998), pp. 144-158
Full Expansion of Context-Dependent Networks in Large Vocabulary Speech Recognition
Don Hindle
Andrej Ljolje
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '98), Seattle, Washington (1998)
Full expansion of context-dependent networks in large vocabulary speech recognition
Donald Hindle
Andrej Ljolje
Fernando C. N. Pereira
ICASSP (1998), pp. 665-668
A Rational Design for a Weighted Finite-State Transducer Library
Proceedings of the Workshop on Implementing Automata (WIA '97), London, Ontario, Canada, University of Western Ontario, London, Ontario, Canada (1997)
Weighted determinization and minimization for large vocabulary speech recognition
Transducer Composition for Context-Dependent Network Expansion
EuroSpeech'97, European Speech Communication Association, Genova, Italy (1997), pp. 1427-1430
Transducer Composition for Context-Dependent Network Expansion
Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech '97), Rhodes, Greece (1997)
Speech Recognition by Composition of Weighted Finite Automata
Finite-State Language Processing, MIT Press, Cambridge, Massachusetts (1997), pp. 431-453
A Rational Design for a Weighted Finite-State Transducer Library
WIA'97: Proceedings of the Workshop on Implementing Automata, Springer-Verlag (1997)
Weighted Determinization and Minimization for Large Vocabulary Speech Recognition
Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech '97), Rhodes, Greece (1997)
Compilation of Weighted Finite-State Transducers from Decision Trees
Speech Recognition by Composition of Weighted Finite Automata
Algorithms for Speech Recognition and Language Processing
Rational Power Series in Text and Speech Processing
Graduate course, University of Pennsylvania, Department of Computer Science, Philadelphia, PA (1996)
Finite-State Transducers in Language and Speech Processing
Tutorial at the 16th International Conference on Computational Linguistics (COLING-96), COLING, Copenhagen, Denmark (1996)
Compilation of Weighted Finite-State Transducers from Decision Trees
Weighted Automata in Text and Speech Processing
Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended finite state models of language, John Wiley and Sons, Chichester, Budapest, Hungary (1996)
The AT&T 60,000 Word Speech-to-Text System
Andrej Ljolje
Don Hindle
Eurospeech'95: ESCA 4th European Conference on Speech Communication and Technology, Madrid, Spain (1995), pp. 207-210
Weighted Rational Transductions and their Application to Human Language Processing
Human Language Technology Workshop, Morgan Kaufmann, San Francisco, California (1994), pp. 262-267
A spoken language translator for restricted-domain context-free languages
David B. Roe
Alejandro Macarrón
Speech Communication, vol. 11 (1992), pp. 311-319
Efficient Grammar Processing for a Spoken Language Translation System
David B. Roe
Alejandro Macarrón
Proceedings of ICASSP, IEEE, San Francisco, California (1992), pp. 213-216
Toward a Spoken Language Translator for Restricted-Domain Context-Free Languages
David B. Roe
Alejandro Macarrón
EUROSPEECH 91 -- 2nd European Conference on Speech Communication and Technology, Genova, Italy (1991), pp. 1063-1066