Georg Heigold
Georg Heigold received the Diplom degree in
physics from ETH Zurich, Switzerland, in 2000.
He was a Software Engineer at De La Rue, Berne,
Switzerland, from 2000 to 2003. From 2004 to 2010,
he was with the Computer Science Department,
RWTH Aachen University, Aachen, University.
Since 2010, he has been a Research Scientist at
Google, Mountain View, CA. His research interests
include automatic speech recognition, discriminative
training, and log-linear modeling.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Conditional Object-Centric Learning from Video
Thomas Kipf
Austin Stone
Rico Jonschkowski
Alexey Dosovitskiy
Klaus Greff
ICLR, ICLR (2022)
Preview abstract
Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for any supervision. However, such fully-unsupervised methods still fail to scale to diverse realistic data, despite the use of increasingly complex inductive biases such as priors for the size of objects or the 3D geometry of the scene. In this paper, we instead take a weakly-supervised approach and focus on how 1) using the temporal dynamics of video data in the form of optical flow and 2) conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data. We introduce a sequential extension to Slot Attention which we train to predict optical flow for realistic looking synthetic scenes and show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation. These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences. We also find that such initial-state-conditioning can be used during inference as a flexible interface to query the model for specific objects or parts of objects, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models.
View details
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Dirk Weissenborn
Jakob Uszkoreit
Sylvain Gelly
ICLR (2021)
Preview abstract
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision tasks, attention is usually either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks, while keeping their overall structure in place. We show that this reliance on ConvNets is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-10, etc), these transformers attain excellent accuracy, matching or outperforming the best convolutional networks while requiring substantially less computational resources to train.
View details
Object-Centric Learning with Slot Attention
Francesco Locatello
Dirk Weissenborn
Jakob Uszkoreit
Alexey Dosovitskiy
Thomas Kipf
NeurIPS 2020
Preview abstract
Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks.
View details
End-to-End Text-Dependent Speaker Verification
Samy Bengio
Noam M. Shazeer
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Preview abstract
In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system’s components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient systems, requiring little domain-specific knowledge and making few model assumptions. We implement the idea by formulating the problem as a single neural network architecture, including the estimation of a speaker model on only a few utterances, and evaluate it on our internal ”Ok Google” benchmark for text-dependent speaker verification. The proposed approach appears to be very effective for big data applications like ours that require highly accurate, easy-to-maintain systems with a small footprint.
View details
Preview abstract
This article proposes and evaluates a Gaussian Mixture Model
(GMM) represented as the last layer of a Deep Neural Network
(DNN) architecture and jointly optimized with all previous layers
using Asynchronous Stochastic Gradient Descent (ASGD). The resulting “Deep GMM” architecture was investigated with special attention
to the following issues: (1) The extent to which joint optimization
improves over separate optimization of the DNN-based
feature extraction layers and the GMM layer; (2) The extent to which
depth (measured in number of layers, for a matched total number
of parameters) helps a deep generative model based on the GMM
layer, compared to a vanilla DNN model; (3) Head-to-head performance
of Deep GMM architectures vs. equivalent DNN architectures
of comparable depth, using the same optimization criterion
(frame-level Cross Entropy (CE)) and optimization method (ASGD);
(4) Expanded possibilities for modeling offered by the Deep GMM
generative model. The proposed Deep GMMs were found to yield
Word Error Rates (WERs) competitive with state-of-the-art DNN
systems, at the cost of pre-training using standard DNNs to initialize
the Deep GMM feature extraction layers. An extension to Deep
Subspace GMMs is described, resulting in additional gains.
View details
Asynchronous, Online, GMM-free Training of a Context Dependent Acoustic Model for Speech Recognition
Preview
Proceedings of the European Conference on Speech Communication and Technology (2014) (to appear)
Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks
Erik McDermott
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Firenze, Italy (2014)
Preview abstract
This paper explores asynchronous stochastic optimization for sequence training of deep neural networks. Sequence training requires more computation than frame-level training using pre-computed frame data. This leads to several complications for stochastic optimization, arising from significant asynchrony in model updates under massive parallelization, and limited data shuffling due to utterance-chunked processing. We analyze the impact of these two issues on the efficiency and performance of sequence training. In particular, we suggest a framework to formalize the reasoning about the asynchrony and present experimental results on both small and large scale Voice Search tasks to validate the effectiveness and efficiency of asynchronous stochastic optimization.
View details
Word Embeddings for Speech Recognition
Samy Bengio
Proceedings of the 15th Conference of the International Speech Communication Association, Interspeech (2014)
Preview abstract
Speech recognition systems have used the concept of states as a way to decompose words into sub-word units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.
View details
GMM-Free DNN Training
Preview
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2014)
Sequence Discriminative Distributed Training of Long Short-Term Memory Recurrent Neural Networks
Andrew Senior
Erik McDermott
Rajat Monga
Mark Mao
Interspeech (2014)
Preview abstract
We recently showed that Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform state-of-the-art deep neural networks (DNNs) for large scale acoustic modeling where the models were trained with the cross-entropy (CE) criterion. It has also been shown that sequence discriminative training of DNNs initially trained with the CE criterion gives significant improvements.
In this paper, we investigate sequence discriminative training of LSTM RNNs in a large scale acoustic modeling task. We train the models in a distributed manner using asynchronous stochastic gradient descent optimization technique. We compare two sequence discriminative criteria -- maximum mutual information and state-level minimum Bayes risk, and we investigate a number of variations of the basic training strategy to better understand issues raised by both the sequential model, and the objective function. We obtain significant gains over the CE trained LSTM RNN model using
sequence discriminative training techniques.
View details
Multilingual acoustic models using distributed deep neural networks
Patrick Nguyen
Marc'aurelio Ranzato
Matthieu Devin
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
Preview abstract
Today’s speech recognition technology is mature enough to be useful
for many practical applications. In this context, it is of paramount
importance to train accurate acoustic models for many languages
within given resource constraints such as data, processing power, and
time. Multilingual training has the potential to solve the data issue
and close the performance gap between resource-rich and resourcescarce
languages. Neural networks lend themselves naturally to parameter
sharing across languages, and distributed implementations
have made it feasible to train large networks. In this paper, we
present experimental results for cross- and multi-lingual network
training of eleven Romance languages on 10k hours of data in total.
The average relative gains over the monolingual baselines are
4%/2% (data-scarce/data-rich languages) for cross- and 7%/2% for
multi-lingual training. However, the additional gain from jointly
training the languages on all data comes at an increased training time
of roughly four weeks, compared to two weeks (monolingual) and
one week (crosslingual).
View details
An Empirical study of learning rates in deep neural networks for speech recognition
Marc'aurelio Ranzato
Ke Yang
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013) (to appear)
Preview abstract
Recent deep neural network systems for large vocabulary speech
recognition are trained with minibatch stochastic gradient descent
but use a variety of learning rate scheduling schemes. We investigate
several of these schemes, particularly AdaGrad. Based on our analysis
of its limitations, we propose a new variant ‘AdaDec’ that decouples
long-term learning-rate scheduling from per-parameter learning
rate variation. AdaDec was found to result in higher frame accuracies
than other methods. Overall, careful choice of learning rate
schemes leads to faster convergence and lower word error rates
View details
Multiframe Deep Neural Networks for Acoustic Modeling
Matthieu Devin
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
Preview abstract
Deep neural networks have been shown to perform very well
as acoustic models for automatic speech recognition. Compared
to Gaussian mixtures however, they tend to be very
expensive computationally, making them challenging to use
in real-time applications. One key advantage of such neural
networks is their ability to learn from very long observation
windows going up to 400 ms. Given this very long temporal
context, it is tempting to wonder whether one can run neural
networks at a lower frame rate than the typical 10 ms, and
whether there might be computational benefits to doing so.
This paper describes a method of tying the neural network parameters
over time which achieves comparable performance
to the typical frame-synchronous model, while achieving up
to a 4X reduction in the computational cost of the neural network
activations.
View details
Deep Neural Networks with Auxiliary Gaussian Mixture Models for Real-Time Speech Recognition
Preview
Xin Lei
Hui Lin
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Vancouver, CA (2013)
Investigations on Exemplar-Based Features for Speech Recognition Towards Thousands of Hours of Unsupervised, Noisy Data
Patrick Nguyen
Mitchel Weintraub
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Kyoto, Japan (2012), pp. 4437-4440
Preview abstract
The acoustic models in state-of-the-art speech recognition
systems are based on phones in context that are represented
by hidden Markov models. This modeling approach may be
limited in that it is hard to incorporate long-span acoustic
context. Exemplar-based approaches are an attractive alternative, in particular if massive data and computational power are available. Yet, most of the data at Google are unsupervised and noisy. This paper investigates an exemplar-based approach under this yet not well understood data regime. A log-linear rescoring framework is used to combine the exemplar-based features on the word level with the first-pass model. This approach guarantees at least baseline performance and focuses on the refined modeling of words with sufficient data. Experimental results for the Voice Search and the YouTube tasks are presented.
View details
RWTH OCR: A Large Vocabulary Optical Character Recognition System for Arabic Scripts
Philippe Dreuw
Hermann Ney
Guide to OCR for Arabic Scripts, Springer (2012), pp. 215-254
WFST Enabled Solutions to ASR Problems: Beyond HMM Decoding
Björn Hoffmeister
Ralf Schlüter
Hermann Ney
IEEE Transactions on Audio, Speech, and Language Processing, vol. 20 (2012), pp. 551-564
Latent Log-Linear Models for Handwritten Digit Classification
Thomas Deselaers
Tobias Gass
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34/6 (2012), pp. 1105-1117
The RWTH Aachen University Open Source Speech Recognition System
Christian Gollan
Björn Hoffmeister
Jonas Lööf
Ralf Schlüter
Hermann Ney
Interspeech (2009), pp. 2111-2114
Investigations on Convex Optimization Using Log-Linear HMMs for Digit String Recognition
Speech Recognition with State-based Nearest Neighbour Classifiers.
Minimum Exact Word Error Training
Ralf Schlueter
Hermann Ney
Automatic Speech Recognition and Understanding (2005), pp. 186-190