Hugo Larochelle
I am a Principal Scientist in the Google DeepMind team in Montreal. My main area of expertise is deep learning. My previous work includes unsupervised pretraining with autoencoders, denoising autoencoders, visual attention-based classification, neural autoregressive distribution models and zero-shot learning. More broadly, I’m interested in applications of deep learning to natural language processing, code, computer vision and environmental sustainability problems.
Previously, I was Associate Professor at the Université de Sherbrooke (UdeS). I also co-founded Whetlab, which was acquired in 2015 by Twitter, where I then worked as a Research Scientist in the Twitter Cortex group. From 2009 to 2011, I was also a member of the machine learning group at the University of Toronto, as a postdoctoral fellow under the supervision of Geoffrey Hinton. I obtained my Ph.D. at the Université de Montréal, under the supervision of Yoshua Bengio.
My academic involvement includes being a member of the boards for the International Conference on Machine Learning (ICML) and for the Neural Information Processing Systems (NeurIPS) conference. I also co-founded the journal Transactions on Machine Learning Research.
Finally, I have a popular online course on deep learning and neural networks, freely accessible on YouTube.
Previously, I was Associate Professor at the Université de Sherbrooke (UdeS). I also co-founded Whetlab, which was acquired in 2015 by Twitter, where I then worked as a Research Scientist in the Twitter Cortex group. From 2009 to 2011, I was also a member of the machine learning group at the University of Toronto, as a postdoctoral fellow under the supervision of Geoffrey Hinton. I obtained my Ph.D. at the Université de Montréal, under the supervision of Yoshua Bengio.
My academic involvement includes being a member of the boards for the International Conference on Machine Learning (ICML) and for the Neural Information Processing Systems (NeurIPS) conference. I also co-founded the journal Transactions on Machine Learning Research.
Finally, I have a popular online course on deep learning and neural networks, freely accessible on YouTube.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions
Rishab Goel
International Conference on Learning Representations (ICLR) (2023)
Preview abstract
The execution behavior of a program often depends on external resources, such as program inputs or file contents, and so cannot be run in isolation. Nevertheless, software developers benefit from fast iteration loops where automated tools identify errors as early as possible, even before programs can be compiled and run. This presents an interesting machine learning challenge: can we predict runtime errors in a ``static'' setting, where program execution is not possible? Here, we introduce a real-world dataset and task for predicting runtime errors, which we show is difficult for generic models like Transformers. As an alternative, we develop an interpreter-inspired architecture with an inductive bias towards mimicking program executions, which models exception handling and ``learns to execute'' descriptions of the contents of external resources. Surprisingly, we show that the model can also predict the location of the error, despite being trained only on labels indicating the presence/absence and kind of error. In total, we present a practical and difficult-yet-approachable challenge problem related to learning program execution and we demonstrate promising new capabilities of interpreter-inspired machine learning models for code.
View details
Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
Mike Mozer
Proceedings of the 39th International Conference on Machine Learning, PMLR (2022)
Preview abstract
Transfer-learning methods aim to improve performance in a data-scarce target domain using a model pretrained on a source domain. A cost-efficient strategy, , involves freezing the source model and training a new classification head for the target domain. This strategy is outperformed by a more costly but state-of-the-art method--- all parameters of the source model to the target domain---possibly because fine tuning allows the model to leverage useful information from intermediate layers which is otherwise discarded. We explore the hypothesis that these intermediate layers might be directly exploited by linear probing. We propose a method, , that selects features from all layers of the source model to train a target-domain classification head. In evaluations on the Visual Task Adaptation Benchmark, Head2Toe matches performance obtained with fine tuning on average, but critically, for out-of-distribution transfer, Head2Toe outperforms fine tuning.
View details
Impact of Aliasing on Generalization in Deep Convolutional Networks
Nicolas Le Roux
Rob Romijnders
International Conference on Computer Vision ICCV 2021, IEEE/CVF (2021)
Preview abstract
Traditionally image pre-processing in the frequency domain has played a vital role in computer vision and was even part of the standard pipeline in the early days of Deep Learning. However, with the advent of large datasets many practitioners concluded that this was unnecessary due to the belief that these priors can be learned from the data itself \emph{if they aid in achieving stronger performance}. Frequency aliasing is a phenomena that may occur when down-sampling (sub-sampling) any signal, such as an image or feature map. We demonstrate that substantial improvements on OOD generalization can be obtained by mitigating the effects of aliasing by placing non-trainable blur filters and using smooth activation functions at key locations in the ResNet family of architectures -- helping to achieve new state-of-the-art results on two benchmarks without any hyper-parameter sweeps.
View details
Preview abstract
The goal of program synthesis from examples is to find a computer program that is consistent with a given set of input-output examples. Most learning-based approaches try to find a program that satisfies all examples at once. Our work, by contrast, considers an approach that breaks the problem into two stages: (a) find programs that satisfy only one example, and (b) leverage these per-example solutions to yield a program that satisfies all examples. We introduce the Cross Aggregator neural network module based on multi-head attention mechanism that learns to combine the cues present in these per-example solutions to synthesize a global solution. Evaluation across programs of different lengths and under two different experimental settings reveal that when given the same budget, our technique significantly improves the success rate over PCCoder [Zohar et. al 2018] and other ablation baselines.
View details
A Unified Few-Shot Classification Benchmark to Compare Transfer and Meta Learning Approaches
Sylvain Gelly
NeurIPS Datasets and Benchmarks Track (2021)
Preview abstract
Meta and transfer learning are two successful families of approaches to few-shot learning. Despite highly related goals, state-of-the-art advances in each family are measured largely in isolation of each other. As a result of diverging evaluation norms, a direct or thorough comparison of different approaches is challenging. To bridge this gap, we introduce a few-shot classification evaluation protocol named VTAB+MD with the explicit goal of facilitating sharing of insights from each community. We demonstrate its accessibility in practice by performing a cross-family study of the best transfer and meta learners which report on both a large-scale meta-learning benchmark (Meta-Dataset, MD), and a transfer learning benchmark (Visual Task Adaptation Benchmark, VTAB). We find that, on average, large-scale transfer methods (Big Transfer, BiT) outperform competing approaches on MD, even when trained only on ImageNet. In contrast, meta-learning approaches struggle to compete on VTAB when trained and validated on MD. However, BiT is not without limitations, and pushing for scale does not improve performance on highly out-of-distribution MD tasks. We hope that this work contributes to accelerating progress on few-shot learning research.
View details
Preview abstract
Few-shot dataset generalization is a challenging variant of the well-studied few-shot classification problem where a diverse training set of several datasets is given, for the purpose of training an adaptable model that can then learn classes from new datasets using only a few examples. To this end, we propose to utilize the diverse training set to construct a universal template: a structure that can define a wide array of dataset-specialized models, by plugging in appropriate parameter-light components. For each new few-shot classification problem, our approach therefore only requires inferring a small number of task-specific parameters to insert into the universal template. We design a separate network that produces a carefully-crafted initialization of those parameters for each given task, and we then fine-tune its proposed initialization via a few steps of gradient descent. Our approach is more parameter-efficient, scalable and adaptable compared to previous methods, and achieves state-of-the-art on the challenging Meta-Dataset benchmark.
View details
Preview abstract
Few-shot classification aims to recognize unseen classes given only few samples.
We consider the problem of multi-domain few-shot image classification, where unseen classes and examples come from diverse data sources. This problem has seen growing interest and has inspired the development of benchmarks such as Meta-Dataset. A key challenge in this multi-domain setting is effectively integrating the feature representations from the diverse set of training domains.
Here, we propose a Universal Representation Transformer (URT) layer, that meta-learns to leverage universal features for few-shot classification by dynamically re-weighting and composing the most appropriate domain-specific representations.
In experiments, we show that URT sets a new state-of-the-art result on Meta-Dataset.
Specifically, it outperforms the best previous model on 3 data sources and otherwise matches it on the others.
We analyze variants of URT and present a visualization of the attention score heatmaps that sheds light on how the model performs cross-domain generalization.
View details
Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples
Eleni Triantafillou
Kelvin Xu
Carles Gelada
International Conference on Learning Representations (submission) (2020)
Preview abstract
Few-shot classification refers to learning a classifier for new classes given only a few examples. While a plethora of models have emerged to tackle this recently, we find the current procedure and datasets that are used to systematically assess progress in this setting lacking. To address this, we propose META-DATASET: a new benchmark for training and evaluating few-shot classifiers that is large-scale, consists of multiple datasets, and presents more natural and realistic tasks. The aim is to measure the ability of state-of the-art models to leverage diverse sources of data to achieve higher generalization, and to evaluate that generalization ability in a more challenging setting. We additionally measure robustness of current methods to variations in the number of available examples and the number of classes. Finally our extensive empirical evaluation leads us to identify weaknesses in Prototypical Networks and MAML, two popular few-shot classification methods, and to propose a new method, ProtoMAML, which achieves improved performance on our benchmark.
View details
Learning Graph Structure With A Finite-State Automaton Layer
Thirty-fourth Conference on Neural Information Processing Systems (2020)
Preview abstract
Graph-based neural network models are producing strong results in a number of domains, in part because graphs provide flexibility to encode domain knowledge in the form of relational structure (edges) between nodes in the graph. In practice, edges are used both to represent intrinsic structure (e.g., bonds in chemical molecules or abstract syntax trees of programs) and more abstract relations that aid reasoning for a downstream task (e.g., results of relevant program analyses). In this work, we study the problem of learning to derive abstract relations from the intrinsic graph structure. Motivated by their power in program analyses, we consider relations defined by paths on the base graph accepted by a finite-state automaton. We show how to learn these relations end-to-end by relaxing the problem into learning finite-state automata policies on a graph-based POMDP and then training these policies using implicit differentiation. The result is a differentiable Graph Finite-State Automaton (GFSA) layer that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. We demonstrate that this layer can find shortcuts in grid-world graphs and reproduce simple static analyses on Python programs. Additionally, we combine the GFSA layer with a larger graph-based model trained end-to-end on the variable misuse program understanding task, and find that this model outperforms baseline methods even without providing the hand-engineered semantic edges that those baselines use.
View details
The Hanabi Challenge: A New Frontier for AI Research
Nolan Bard
Jakob N. Foerster
Sarath Chandar
Neil Burch
Marc Lanctot
H. Francis Song
Emilio Parisotto
Subhodeep Moitra
Edward Hughes
Iain Dunning
Shibl Mourad
Marc G. Bellemare
Michael Bowling
Artificial Intelligence, vol. 280 (2020)
Preview abstract
From the early days of computing, games have been important testbeds for studying how well machines can do sophisticated decision making. In recent years, machine learning has made dramatic advances with artificial agents reaching superhuman performance in challenge domains like Go, Atari, and some variants of poker. As with their predecessors of chess, checkers, and backgammon, these game domains have driven research by providing sophisticated yet well-defined challenges for artificial intelligence practitioners. We continue this tradition by proposing the game of Hanabi as a new challenge domain with novel problems that arise from its combination of purely cooperative gameplay with two to five players and imperfect information. In particular, we argue that Hanabi elevates reasoning about the beliefs and intentions of other agents to the foreground. We believe developing novel techniques for such theory of mind reasoning will not only be crucial for success in Hanabi, but also in broader collaborative efforts, especially those with human partners. To facilitate future research, we introduce the open-source Hanabi Learning Environment, propose an experimental framework for the research community to evaluate algorithmic advances, and assess the performance of current state-of-the-art techniques.
View details
Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks
Thirty-fourth Conference on Neural Information Processing Systems (2020)
Preview abstract
Graph neural networks (GNNs) have emerged as a powerful tool for learning software engineering tasks including code completion, bug finding, and program repair. They benefit from leveraging program structure like control flow graphs, but they are not well-suited to tasks like program execution that require far more sequential reasoning steps than number of GNN propagation steps. Recurrent neural networks (RNNs), on the other hand, are well-suited to long sequential chains of reasoning, but they do not naturally incorporate program structure and generally perform worse on the above tasks. Our aim is to achieve the best of both worlds, and we do so by introducing a novel GNN architecture, the Instruction Pointer Attention Graph Neural Network (IPA-GNN), which achieves systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by developing a spectrum of models between RNNs operating on program traces with branch decisions as latent variables and GNNs. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution. To test the models, we propose evaluating systematic generalization on learning to execute using control flow graphs, which tests sequential reasoning and use of program structure. More practically, we evaluate these models on the task of learning to execute partial programs, as might arise if using the model as a value function in program synthesis. Results show that the IPA-GNN outperforms a variety of RNN and GNN baselines on both tasks.
View details
Revisiting Fundamentals of Experience Replay
Liam B. Fedus
Mark Rowland
Prajit Ramachandran
Will Dabney
Yoshua Bengio
International Conference on Machine Learning (2020)
Preview abstract
Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and ablative studies upend conventional wisdom around experience replay -- greater capacity is found to substantially increase the performance of certain algorithms, while leaving others unaffected. Counterintuitively we show that theoretically ungrounded, uncorrected n-step returns are uniquely beneficial while other techniques confer limited benefit for sifting through larger memory. Separately, by directly controlling the replay ratio we contextualize previous observations in the literature and empirically measure its importance across a variety of deep RL algorithms. Finally, we conclude by testing a set of hypotheses on the nature of these performance benefits.
View details
Recall Traces: Backtracking Models for Efficient Reinforcement Learning
Anirudh Goyal
Philemon Brakel
Liam Fedus
Soumye Singhal
Timothy Lillicrap
Sergey Levine
Yoshua Bengio
ICLR (2019)
Preview abstract
In many environments only a tiny subset of all states yield high reward. In these
cases, few of the interactions with the environment provide a relevant learning
signal. Hence, we may want to preferentially train on those high-reward states
and the probable trajectories leading to them. To this end, we advocate for the use
of a backtracking model that predicts the preceding states that terminate at a given
high-reward state. We can train a model which, starting from a high value state
(or one that is estimated to have high value), predicts and samples which (state,
action)-tuples may have led to that high value state. These traces of (state, action)
pairs, which we refer to as Recall Traces, sampled from this backtracking model
starting from a high value state, are informative as they terminate in good states,
and hence we can use these traces to improve a policy. We provide a variational
interpretation for this idea and a practical algorithm in which the backtracking
model samples from an approximate posterior distribution over trajectories which
lead to large rewards. Our method improves the sample efficiency of both on- and
off-policy RL algorithms across several environments and tasks.
View details
InfoBot: Structured Exploration in ReinforcementLearning Using Information Bottleneck
Anirudh Goyal
Riashat Islam
Daniel Strouse
Matthew Botvinick
Yoshua Bengio
Sergey Levine
ICLR (2019)
Preview abstract
A central challenge in reinforcement learning is discovering effective policies for
tasks where rewards are sparsely distributed. We postulate that in the absence of
useful reward signals, an effective exploration strategy should seek out decision
states. These states lie at critical junctions in the state space from where the agent
can transition to new, potentially unexplored regions. We propose to learn about
decision states from prior experience. By training a goal-conditioned policy with
an information bottleneck, we can identify decision states by examining where
the model actually leverages the goal state. We find that this simple mechanism
effectively identifies decision states, even in partially observed settings. In effect,
the model learns the sensory cues that correlate with potential subgoals. In new
environments, this model can then identify novel subgoals for further exploration,
guiding the agent through a sequence of potential decision states and through new
regions of the state space.
View details
Meta-Learning for Semi-Supervised Few-Shot Classification
Eleni Triantafillou
Jake Snell
Josh Tenenbaum
Mengye Ren
Richard Zemel
Sachin Ravi
ICLR (2018)
Preview abstract
In few-shot classification, we are interested in learning algorithms that train a classifier from only a handful of labeled examples. Recent progress made in few-shot classification has featured meta-learning, in which a parameterized model for a learning algorithm is defined and trained on episodes representing different classification problems, each with a small labeled training set and its corresponding test set. In this work, we advance this few-shot classification paradigm towards a scenario where unlabeled examples are also available within each episode. We consider two situations: one where all unlabeled examples are assumed to belong to the same set of classes as the labeled examples of the episode, as well as the more realistic situation where examples from other {\it distractor} classes are also provided. To address this paradigm, we propose novel extensions of prototypical networks (Snell et al. 2017) that are augmented with the ability to use unlabeled examples when producing prototypes. These models are trained in an end-to-end way on episodes, to learn to leverage the unlabeled examples successfully.
We evaluate these methods on versions of the Omniglot and mini-ImageNet benchmarks, adapted to this new framework augmented with unlabeled examples. We also propose a new split of ImageNet. Our experiments confirm that our prototypical networks can learn to improve their predictions due to unlabeled examples, much like a semi-supervised algorithm would.
View details
Modulating early visual processing by language
Harm de Vries
Florian Strub
Jérémie Mary
Olivier Pietquin
Aaron Courville
NIPS (2017)
Preview abstract
It is commonly assumed that language refers to high-level visual concepts while
leaving low-level visual processing unaffected. This view dominates the current
literature in computational models for language-vision tasks, where visual and
linguistic input are mostly processed independently before being fused into a single
representation. In this paper, we deviate from this classic pipeline and propose to
modulate the entire visual processing by linguistic input. Specifically, we condition
the batch normalization parameters of a pretrained residual network (ResNet) on a
language embedding. This approach, which we call MOdulated RESnet (MORES),
significantly improves strong baselines on two visual question answering tasks. Our
ablation study shows that modulating from the early stages of the visual processing
is beneficial. We finally show that ResNet image features are effectively grounded.
View details
Modulating early visual processing by language
Harm de Vries
Florian Strub
Jérémie Mary
Olivier Pietquin
Aaron Courville
Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 6594-6604
Preview abstract
It is commonly assumed that language refers to high-level visual concepts while leaving low-level visual processing unaffected. This view dominates the current literature in computational models for language-vision tasks, where visual and linguistic input are mostly processed independently before being fused into a single representation. In this paper, we deviate from this classic pipeline and propose to modulate the entire visual processing by linguistic input. Specifically, we condition the batch normalization parameters of a pretrained residual network (ResNet) on a language embedding. This approach, which we call MOdulated RESnet (MORES), significantly improves strong baselines on two visual question answering tasks. Our ablation study shows that modulating from the early stages of the visual processing is beneficial. We finally show that ResNet image features are effectively grounded.
View details
Preview abstract
Matrix factorization (MF) is one of the most popular techniques for product recommendation, but is known to suffer from serious cold-start problems. Item cold-start problems are particularly acute in settings such as Tweet recommendation where new items arrive continuously. In this paper, we present a {\it meta-learning} strategy to address item cold-start when new items arrive continuously. We propose two deep neural network architectures that implement our meta-learning strategy. The first architecture learns a linear classifier whose weights are determined by the item history while the second architecture learns a neural network whose biases are instead adapted based on item history. We evaluate our techniques on the real-world problem of Tweet recommendation. On production data at Twitter, we demonstrate that our proposed techniques significantly beat the MF baseline with lookup table based user embeddings and also outperform the state-of-the-art production model for Tweet recommendation.
View details
MADE: Masked Autoencoder for Distribution Estimation
Preview
Mathieu Germain
Karol Gregor
Iain Murray
Proceedings of the 32nd International Conference on Machine Learning (2015)
Guest editors' introduction: Special section on learning deep architectures
Preview
Samy Bengio
Li Deng
Honglak Lee
Ruslan Salakhutdinov
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 35 (2013), pp. 1795-1797
Domain-Adversarial Training of Neural Networks
Yaroslav Ganin
Evgeniya Ustinova
Hana Ajakan
Pascal Germain
François Laviolette
Mario Marchand
Victor Lempitsky
Journal of Machine Learning Research, vol. 17 (2016)
An autoencoder approach to learning bilingual word representations
Sarath Chandar A P
Stanislas Lauly
Mitesh Khapra
Balaraman Ravindran
Vikas C Raykar
Amrita Saha
Advances in Neural Information Processing Systems 27 (2014)
Practical bayesian optimization of machine learning algorithms
Jasper Snoek
Ryan P. Adams
Advances in Neural Information Processing Systems 25 (2012)
The Neural Autoregressive Distribution Estimator
Iain Murray
Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (2011)
Conditional Restricted Boltzmann Machines for Structured Output Prediction
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion
Pascal Vincent
Isabelle Lajoie
Yoshua Bengio
Pierre-Antoine Manzagol
Journal of Machine Learning Research, vol. 11 (2010)
Learning to combine foveal glimpses with a third-order Boltzmann machine
Exploring strategies for training deep neural networks
Yoshua Bengio
Jérôme Louradour
Journal of Machine Learning Research, vol. 1 (2009)
Extracting and composing robust features with denoising autoencoders
Pascal Vincent
Yoshua Bengio
Pierre-Antoine Manzagol
Proceedings of the 25th International Conference on Machine Learning (2008)
Classification using discriminative restricted boltzmann machines
Yoshua Bengio
Proceedings of the 25th International Conference on Machine Learning (2008)
Zero-data learning of new tasks
Yoshua Bengio
Proceedings of the 23rd AAAI Conference on Artificial Intelligence (2008)
An empirical evaluation of deep architectures on problems with many factors of variation
Aaron Courville
James Bergstra
Yoshua Bengio
Proceedings of the 24th International Conference on Machine Learning (2007)
Greedy layer-wise training of deep networks
Yoshua Bengio
Dan Popovici
Advances in neural information processing systems 19 (2007)