Mostafa Dehghani
I'm a Research Scientist at Google Brain, where I work on machine learning, in particular, deep learning. My areas of interest include self-supervised learning, generative models, training giant models, and sequence modeling.
Before Google, I was doing a PhD at the University of Amsterdam. My PhD research was focused on improving the process of learning with imperfect supervision. I explored ideas around using injecting inductive biases into algorithms, incorporating prior knowledge, and meta-learning the properties of the data using the data itself, in order to help learning algorithms to better learn from noisy or/and limited data.
You can know more about me here: mostafadehghani.com.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Carlos Riquelme
Yi Tay
Siamak Shakeri
Daniel Salz
Michael Tschannen
Mandar Joshi
Filip Pavetić
Anurag Arnab
Yuanzhong Xu
Keran Rong
Computer Vision and Pattern Recognition Conference (CVPR) (2024)
Preview abstract
We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
View details
Scaling Vision Transformers to 22 Billion Parameters
Josip Djolonga
Basil Mustafa
Piotr Padlewski
Justin Gilmer
Mathilde Caron
Rodolphe Jenatton
Michael Tschannen
Anurag Arnab
Carlos Riquelme
Fisher Yu
Avital Oliver
Fantine Huot
Mark Collier
Yi Tay
Filip Pavetić
Thomas Kipf
Arxiv (2023)
Preview abstract
The scaling of Transformers has driven breakthrough capabilities for language models.
At present, the largest large language models (LLMs) contain upwards of 100B parameters.
Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there.
View details
DSI++: Updating Transformer Memory with New Documents
Yi Tay
Jinfeng Rao
Emma Strubell
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Preview abstract
Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents.
Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents (+12%). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by +21.1% over competitive baselines for NQ and requires 6 times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.
View details
Dual PatchNorm
Transactions on Machine Learning Research (2023) (to appear)
Preview abstract
We discover that just placing two LayerNorms: before and after the patch embedding layer leads to improvements over well-tuned ViT models. In particular, this outperforms exhaustive search for alternative LayerNorm placement strategies in the transformer block itself.
View details
UL2: Unifying Language Learning Paradigms
Yi Tay
Xavier Garcia
Jason Wei
Hyung Won Chung
Steven Zheng
ICLR (2023)
Preview abstract
Existing pre-trained models are generally geared towards a particular class of
problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for
pre-training models that are universally effective across datasets and setups. We
begin by disentangling architectural archetypes with pre-training objectives – two
concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training
objectives can be cast as one another and how interpolating between different
objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pretraining objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning
is associated with specific pre-training schemes. We conduct extensive ablative
experiments to compare multiple pre-training objectives and find that our method
pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across
multiple diverse setups. Finally, by scaling our model up to 20B parameters, we
achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language
understanding, text classification, question answering, commonsense reasoning,
long text reasoning, structured knowledge grounding and information retrieval.
Our model also achieve strong results at in-context learning, outperforming 175B
GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on oneshot summarization. Finally, we show that UL2 20B works well with chain-ofthought prompting and reasoning tasks, making it an appealing choice for research
into reasoning at a small to medium scale of 20B parameters. We publicly release
Flax-based T5X model checkpoints for the 20B model.
View details
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
Ashish Teku Vaswani
Dani Yogatama
Hyung Won Chung
Jinfeng Rao
Liam B. Fedus
Samira Abnar
Sharan Narang
Yi Tay
ICLR (2022)
Preview abstract
Kaplan et al. argues that the performance of a Transformer model strongly depends on the model size, but only weakly on the model shape. Our work empirically confirms their results for upstream training, but then reveals a striking discrepancy when fine-tuning: downstream task performance is strongly influenced by model shape (e.g. depth and width). We find that widely adopted models including T5-base, T5-large and T5-XL/XXL (Raffel et al. 2019) are inefficient on a compute-performance Pareto curve. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster.
We conclude by demonstrating that our improved scaling protocol also holds in other domains.
View details
Simple Open-Vocabulary Object Detection with Vision Transformers
Austin Stone
Maxim Neumann
Dirk Weissenborn
Alexey Dosovitskiy
Anurag Arnab
Zhuoran Shen
Thomas Kipf
ECCV (Poster) (2022)
Preview abstract
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub (https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit).
View details
Confident Adaptive Language Modeling
Adam Fisch
Yi Tay
NeurIPS 2022
Preview abstract
Recent advances in Transformer-based large language models (LLMs) achieved significant performance improvements across many tasks.
These gains come with a drastic increase in the models' size, leading to slow and costly use at inference time.
In practice, however, the series of generations made by LLMs is composed of varying levels of difficulty. While certain predictions truly benefit from the models' full capacity, other continuations are more trivial and can be solved with reduced compute.
In this work, we introduce Confident Adaptive Language Modeling (CALM), a method for dynamically allocating different amounts of compute per example and per generation timestep.
Early exit decoding involves several challenges that we address here, such as: (1) what confidence measure to use; (2) connecting sequence-level constraints to local per-token exit decisions; and (3) attending back to missing hidden representations due to early exits in previous tokens.
Through theoretical analysis and empirical experiments on three diverse generation tasks, we demonstrate the efficacy of our method in reliably reducing compute while maintaining high performance.
View details
Preview abstract
Recent developments in large-scale machine learning have created a tempting picture suggesting that by scaling up data, model size and training time properly, one can obtain a model that can be used successfully in few-shot settings in all downstream tasks. In this work, we investigate this premise empirically and provide a strong case against it. In particular, we consider image recognition task with large scale models (Vision Transformers) trained on the largest scale of available data (JFT). We show that as we improve the performance of upstream task either by scaling up or hyper-parameter and architectural choices, the performance of many downstream tasks eventually plateau. We showcase an even more extreme scenario where performance on upstream and downstream contradict each other, i.e., in order to have a better downstream performance, we need to hurt upstream accuracy. We delve deeper into understanding the reasons that give rise to these phenomena by designing interventions and investigating different components of the models which gives us crude yet useful insights into the mechanisms behind these observations.
View details
Transformer Memory as a Differentiable Search Index
Yi Tay
Jianmo Ni
NeurIPS 2022
Preview abstract
In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup.
View details
Retrieval Enhanced Machine Learning
Hamed Zamani
SIGIR 2022: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Perspectives Track)
Preview abstract
Information access systems have supported people during tasks across a variety of domains. In this perspective paper, we advocate for broadening the scope of information access research to include machines. We believe that machine learning can be substantially advanced by developing a research program around retrieval as a core algorithmic method. This paper describes how core principles of indexing, representation, retrieval, and relevance can extend supervised learning algorithms. It proposes a generic retrieval-enhanced machine learning (REML) framework and describes challenges in and opportunities introduced by implementing REML. We also discuss different optimization approaches for training REML models and review a number of case studies that are simplified and special implementations of the proposed framework. The research agenda introduced in this paper will smooth the path towards developing machine learning models with better scalability, sustainability, effectiveness, and interpretability.
View details
Preview abstract
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.
View details
Preview abstract
In the era of pretrained language models, transformers are the defacto choice of model architectures. While recent works have shown promise in entirely convolutional based architectures, these CNN-based models have not been widely adopted or evaluated under the pretrain-finetune paradigm.
In the context of language models, are convolutional models competitive when pretrained?
This paper investigates this research question and presents several interesting findings. Across a set of extensive experiments, our findings show that CNN-based pretrained models are highly competitive and outperform Transformer-based pretrained models in certain scenarios, albeit with caveats. Overall, the findings of this paper should implore the broader academic community to perhaps not conflate pretraining advances with architectural advances and both set of techniques could be studied in isolation.
View details
TokenLearner: Adaptive Space-Time Tokenization for Videos
Michael Ryoo
Anurag Arnab
Conference on Neural Information Processing Systems (NeurIPS) (2021)
Preview abstract
In this paper, we present an approach for representation learning from videos. Instead of relying on hand-designed splitting strategies to obtain space-time tokens from videos, our approach learns to mine important tokens in video frames. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise interactions between such tokens over a longer temporal horizon. We introduce a vector transformer to capture such pairwise space-time relations, and a technique to fuse the transformed tokens while learning their spatio-temporal patterns. The proposed approach is designed with the intention to allow the tokenizer to adaptively react to input video frames containing diverse visual content, and then to have the vector transformer and subsequent modules learn the underlying spatio-temporal interactions and long-range dependencies in video inputs. We show the effectiveness of the proposed approach over challenging video classification datasets, outperforming the state-of-the-art, despite using much less compute. We further conduct extensive ablation experiments to study the method.
View details
Long Range Arena : A Benchmark for Efficient Transformers
Yi Tay
Samira Abnar
Yikang Shen
Jinfeng Rao
Sebastian Ruder
ICLR 2021 (to appear)
Preview abstract
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable performance to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative performance amongst many models. This paper proposes a systematic and unified benchmark, LRA a benchmark specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data types and modalities such as text, natural and synthetic images, and mathematical expressions requiring similarity, structural and visual-spatial reasoning. We systematically evaluate ten well established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle.
View details
IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression
Rianne van den Berg
Casper Kaae Sønderby
ICLR 2021, ICLR 2021 (to appear)
Preview abstract
In this paper we analyse and improve integer discrete flows for lossless compression. Integer discrete flows are a recently proposed class of models that learn invertible transformations for integer-valued random variables. Their discrete nature makes them particularly suitable for lossless compression with entropy coding schemes. We start by investigating a recent theoretical claim that states that invertible flows for discrete random variables are less flexible than their continuous counterparts. We demonstrate with a proof that this claim does not hold for integer discrete flows due to the embedding of data with finite support into the countably infinite integer lattice. Furthermore, we zoom in on the effect of gradient bias due to the straight-through estimator in integer discrete flows, and demonstrate that its influence is highly dependent on architecture choices and less prominent than previously thought. Finally, we show how different modifications to the architecture improve the performance of this model class for lossless compression.
View details
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Dirk Weissenborn
Jakob Uszkoreit
Sylvain Gelly
ICLR (2021)
Preview abstract
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision tasks, attention is usually either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks, while keeping their overall structure in place. We show that this reliance on ConvNets is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-10, etc), these transformers attain excellent accuracy, matching or outperforming the best convolutional networks while requiring substantially less computational resources to train.
View details
OmniNet: Omnidirectional Representations from Transformers
Yi Tay
Vamsi Aribandi
ICML 2021
Preview abstract
This paper proposes Omnidirectional Representations from Transformers (\textsc{OmniNet}). In OmniNet, instead of maintaing a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirection attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based \cite{choromanski2020rethinking}, low-rank attention \cite{wang2020linformer} and/or Big Bird \cite{zaheer2020big} as the meta-learner. We conduct extensive experiments on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA) and Image Recognition, showing that OmniNet not only achieves considerable improvements when equipped with both sequence-based (1D) Transformers but also on image recognition (finetuning and few shot learning) tasks. OmniNet also achieves state-of-the-art performance on LM1B, WMT'14 En-De/En-Fr and Long Range Arena.
View details
Preview abstract
Having the right inductive biases can be crucial in many tasks or scenarios where data or computing resources are a limiting factor, or where training data is not perfectly representative of the conditions at test time. However, defining, designing and efficiently adapting inductive biases is not necessarily straightforward. In this paper, we explore the power of knowledge distillation for transferring the effect of inductive biases from one model to another. We consider families of models with different inductive biases, LSTMs vs. Transformers and CNNs vs. MLPs, in the context of tasks and scenarios where having the right inductive biases is critical. We study how the effect of inductive biases is transferred through knowledge distillation, in terms of not only performance but also different aspects of converged solutions.
View details
MetNet: A Neural Weather Model for Precipitation Forecasting
Casper Kaae Sønderby
Avital Oliver
Jason Hickey
Submission to journal (2020)
Preview abstract
Weather forecasting is a long standing scientific challenge with direct social and economic impact. The task is suitable for deep neural networks due to vast amounts of continuously collected data and a rich spatial and temporal structure that presents long range dependencies. We introduce MetNet, a neural network that forecasts precipitation up to 8 hours into the future at the high spatial resolution of 1 km and at the temporal resolution of 2 minutes with a latency in the order of seconds. MetNet takes as input radar and satellite data and forecast lead time and produces a probabilistic precipitation map. The architecture uses axial self-attention to aggregate the global context from a large input patch corresponding to a million square kilometers. We evaluate the performance of MetNet at various precipitation thresholds and find that MetNet outperforms Numerical Weather Prediction at forecasts of up to 7 to 8 hours on the scale of the continental United States.
View details
Preview abstract
Recurrent neural networks (RNNs) sequentially process data by updating their
state with each new data point, and have long been the de facto choice for sequence
modeling tasks. However, their inherently sequential computation makes them
slow to train. Feed-forward and convolutional architectures have recently been
shown to achieve superior results on some sequence modeling tasks such as machine
translation, with the added advantage that they concurrently process all inputs in
the sequence, leading to easy parallelization and faster training times. Despite these
successes, however, popular feed-forward sequence models like the Transformer
fail to generalize in many simple tasks that recurrent models handle with ease, e.g.
copying strings or even simple logical inference when the string or formula lengths
exceed those observed at training time. We propose the Universal Transformer
(UT), a parallel-in-time self-attentive recurrent sequence model which can be
cast as a generalization of the Transformer model and which addresses these
issues. UTs combine the parallelizability and global receptive field of feed-forward
sequence models like the Transformer with the recurrent inductive bias of RNNs.
We also add a dynamic per-position halting mechanism and find that it improves
accuracy on several tasks. In contrast to the standard Transformer, under certain
assumptions UTs can be shown to be Turing-complete. Our experiments show that
UTs outperform standard Transformers on a wide range of algorithmic and language
understanding tasks, including the challenging LAMBADA language modeling
task where UTs achieve a new state of the art, and machine translation where UTs
achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-De dataset.
View details
Preview abstract
Learning meaningful and useful task-dependent data representations requires many training instances -- but training labels are expensive to obtain, and may be of varying quality. This creates a fundamental quality-versus-quantity trade-off in the learning process. Do we learn from the small amount of high-quality data or the potentially large amount of weakly-labeled data (obtained from heuristics or crowd-sourcing, etc.)? We argue that if we could somehow know and take the label-quality into account when learning the data representation, we could get the best of both worlds.
To this end, we propose ``fidelity-weighted learning'' (\fwl), a semi-supervised student-teacher approach for training deep neural networks using weakly-labeled data. \fwl modulates the parameter updates to a \emph{student} network (trained on the task we care about) on a per-sample basis according to the posterior confidence of the label-quality estimated by a \emph{teacher}. Both student and teacher are learned from the data. We evaluate \fwl on two real-world tasks in information retrieval and natural language processing where we outperform state-of-the-art alternative semi-supervised methods, indicating that our approach makes better use of the label information and results in better task-dependent data representations.
View details
Preview abstract
Users try to articulate their complex information needs during search sessions by reformulating their queries. In order to make this process more effective, search engines provide related queries to help users to specify the information need in their search process.
In this paper, we propose a customized sequence-to-sequence model for session-based query suggestion.In our model, we employ a query-aware attention mechanism to capture the structure of the session context. This enables us to control the scope of the session from which we infer the suggested next query, which helps not only handle the noisy data but also automatically detect session boundaries. Furthermore, we observe that based on user query reformulation behavior, a large portion of terms of a query in a session is retained from the previously submitted queries in the same session and consists of mostly infrequent or unseen terms that are usually not included in the vocabulary. We therefore empower the decoder of our model to access the source words from the session context during decoding by incorporating a copy mechanism. Moreover, we propose evaluation metrics to assess the quality of the generative models for query suggestion. We conduct an extensive set of experiments and analysis. The results suggest that our model outperforms the baselines both in terms of the generating queries and scoring candidate queries for the task of query suggestion.
View details
Neural Ranking Models with Weak Supervision
Hamed Zamani
Jaap Kamps
W. Bruce Croft
Proceedings of The 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2017)
Preview abstract
Despite the impressive improvements achieved by unsupervised
deep neural networks in computer vision, natural language processing,
and speech recognition tasks, such improvements have not
generally been observed in ranking for information retrieval. The
reason might be related to the complexity of the ranking problem,
in the sense that it is not obvious how to learn from queries and
documents when no supervised signal is available. Hence, in this
paper, we propose to train a neural ranking model from a weak
supervision signal, which is a training signal that can be obtained
automatically without human labeling or any external resources
(e.g., click data). To this aim, we use the output of a known unsupervised
ranking model, such as BM25, as a weak supervision
signal. We further train a set of simple yet effective ranking models
based on feed-forward neural networks. We study their effectiveness
under various learning scenarios (point-wise and pair-wise
models) and using different input representations (i.e., from encoding
query-document pairs into dense/sparse vectors to using word
embedding representation). We train our network on 5 million
unique queries obtained from the publicly available AOL query
logs and two standard collections: a homogeneous news collection
(Robust) and a heterogeneous large-scale web collection (ClueWeb).
Our experiments indicate that feeding raw data to the networks
and letting them learn representations for the input data leads to
an impressive performance, with over 13% and 35% MAP improvements
compared to the BM25 model on the Robust and the ClueWeb
collections, respectively. Our findings suggest that neural ranking
models can greatly benefit from large amounts of weakly labeled
data that can be easily obtained from unsupervised IR models.
View details
Learning to Learn from Weak Supervision by Full Supervision
Jaap Kamps
NIPS workshop on Meta-Learning (MetaLearn 2017)
Preview abstract
In this paper, we propose a method for training neural networks when we have a large set of data with weak labels and a small amount of data with true labels. In our proposed model, we train two neural networks: a target network, the learner and a confidence network, the meta-learner. The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to control the magnitude of the gradient updates to the target network using the scores provided by the second confidence network, which is trained on a small amount of supervised data. Thus we avoid that the weight updates computed from noisy labels harm the quality of the target network model.
View details
Preview abstract
Making use of weak or noisy signals, like the output of heuristic
methods or user click through data for training deep neural networks
is increasing, in particular for the tasks where an adequate
amount of data with true labels is not available. In a semi-supervised
setting, we can use a large set of data with weak labels to pretrain a
neural network and fine tune the parameters with a small amount
of data with true labels. However, these two independent stages do
not leverage the full capacity of clean information from true labels
during pretraining.
In this paper, we propose a semi-supervised learning method
where we train two neural networks in a multi-task fashion: a target
network and a confidence network. The target network is optimized
to perform a given task and is trained using a large set of unlabeled
data that are weakly annotated. We propose to weight the gradient
updates to the target network using the scores provided by the
second confidence network, which is trained on a small amount of
supervised data. Thus we avoid that the weight updates computed
from noisy labels harm the quality of the target network model. We
evaluate our learning strategy on two different tasks: document
ranking and sentiment classification. The results demonstrate that
our approach not only enhances the performance compared to the
baselines but also speeds up the learning process from weak labels.
View details
No Results Found