Ankur Bapna
I am a Staff Software Engineer on the Brain team. My current research interests include multimodal representation learning for speech and text, massively multilingual modeling and applications of these approaches to translation, ASR, TTS and tasks involving end-to-end speech understanding and generation.
Authored Publications
Google Publications
Other Publications
Sort By
Multimodal Language Identification
Shikhar Bharadwaj
Sriram (Sri) Ganapathy
Sid Dalmia
Wei Han
Yu Zhang
Proceedings of 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024) (2024)
Preview abstract
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition.
View details
Label Aware Speech Representation Learning For Language Identification
Shikhar Bharadwaj
Sriram Ganapathy
Wei Han
Proceedings of Interspeech 2023, pp. 5351-5355
Preview abstract
The speech representation learning approaches, for nonsemantic tasks like language recognition, have either explored supervised embedding extraction methods using a classifier model or the self-supervised representation learning approach using raw data. In this paper, we propose a novel framework of combining the self-supervised representation learning with the language label information for the pre-training task. This framework, termed as label aware speech representation learning (LASR), uses a triplet based objective function to incorporate the language labels along with the self-supervised loss function. The speech representations are further fine-tuned for the identification task. The language recognition experiments are performed on two public datasets - FLEURS and Dhwani. In these experiments, we illustrate that the proposed LASR framework improves over the state-of-art systems in terms of recognition performance. We also report an analysis of the robustness of the LASR approach to noisy/missing labels as well as the application of the LASR model for downstream multi-lingual speech recognition tasks.
View details
LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus
Yifan Ding
Kohei Yatabe
Nobuyuki Morioka
Yu Zhang
Wei Han
Interspeech 2023 (2023)
Preview abstract
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples. The corpus is freely available for download from [URL-HERE]
View details
Preview abstract
We present Mu2SLAM, a multilingual sequence-to-sequence model pre-trained jointly on un-labeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition(ASR), Automatic Speech Translation (AST)and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu2SLAM trains ona sequence-to-sequence masked denoising objective similar to T5 on both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoSTAST, Mu2SLAM establishes a new state-of-the-art for models trained on public datasets, improv-ing on xx-en translation over the previous best by 1.9 Bleu points and on en-xx translation by 0.9 Bleu points. On Voxpopuli ASR, our model matches the performance of a mSLAM model finetuned with a RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.
View details
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech
Takaaki Saeki
Zhehuai Chen
Nobuyuki Morioka
Yu Zhang
ICASSP (2023)
Preview abstract
This paper proposes Virtuoso, a massive multilingual speech–text joint learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech–text paired data in low-resource languages. This study extends Maestro, which is a speech–text semi-supervised joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline TTS models in seen languages, and 2) these models can synthesize reasonably good speech for unseen languages where no paired TTS data is available.
View details
Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech Representation and Linguistic Features
Yifan Ding
Kohei Yatabe
Nobuyuki Morioka
Yu Zhang
Wei Han
WASPAA 2023 (2023) (to appear)
Preview abstract
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) linguistic features extracted from transcripts and PnG-BERT for conditioning features. Experiments show that the proposed model (i) is robust against various audio degradation, (ii) can restore samples in the LJspeech dataset and improves the quality of text-to-speech (TTS) outputs without changing the model and hyper-parameters, and (iii) enable us to train a high-quality TTS model from restored speech samples collected from the web.
View details
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer
Lisa Wang
Ahsan Wahab
Nasanbayar Ulzii-Orshikh
Allahsera Auguste Tapo
Nishant Subramani
Artem Sokolov
Claytone Sikasote
Monang Setyawan
Supheakmungkol Sarin
Sokhar Samb
Benoît Sagot
Clara E. Rivera
Annette Rios
Isabel Papadimitriou
Salomey Osei
Pedro Javier Ortiz Suárez
Iroro Fred Ọ̀nọ̀mẹ̀ Orife
Kelechi Ogueji
Rubungo Andre Niyongabo
Toan Nguyen
Mathias Müller
André Müller
Shamsuddeen Hassan Muhammad
Nanda Muhammad
Ayanda Mnyakeni
Jamshidbek Mirzakhalov
Tapiwanashe Matangira
Colin Leong
Nze Lawson
Yacine Jernite
Mathias Jenny
Bonaventure F. P. Dossou
Sakhile Dlamini
Nisansa de Silva
Sakine Çabuk Ballı
Stella Biderman
Alessia Battisti
Ahmed Baruwa
Pallavi Baljekar
Israel Abebe Azime
Ayodele Awokoya
Duygu Ataman
Orevaoghene Ahia
Oghenefego Ahia
Sweta Agrawal
Mofetoluwa Adeyemi
TACL (2022)
Preview abstract
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.
View details
XTREME-S: Evaluating Cross-lingual Speech Representations
Clara E. Rivera
Sebastian Ruder
Simran Khanuja
Ye Jia
Yu Zhang
Proc. Interspeech 2022
Preview abstract
We introduce \xtremes, a new benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, retrieval and speech-to-text translation. Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as well as catalyze research in ``universal'' speech representation learning. This paper describes the new benchmark and establishes the first speech-only and speech-text baselines using XLS-R and mSLAM on all downstream tasks. We motivate the design choices and detail how to use the benchmark. The code and pre-processing scripts will be made publicly available.\footnote{\small\url{https://huggingface.co/datasets/google/xtreme_s}}
View details
FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
Alexis Conneau
Simran Khanuja
Yu Zhang
Siddharth Dalmia
Clara Rivera
IEEE Spoken Language Technology Workshop (SLT) (2022)
Preview abstract
We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding.
View details
MAESTRO: Matched Speech Text Representations through Modality Matching
Pedro Jose Moreno Mengibar
Yu Zhang
Zhehuai Chen
interspeech 2022 (2022) (to appear)
Preview abstract
We present Maestro, a self-supervised training method to
unify representations learnt from speech and text modalities.
Self-supervised learning from speech signals aims to learn the
latent structure inherent in the signal, while self-supervised
learning from text attempts to capture lexical information.
Learning aligned representations from unpaired speech and
text sequences is a challenging task. Previous work either
implicitly enforced the representations learnt from these two
modalities to be aligned in the latent space through multi-
tasking and parameter sharing or explicitly through conversion
of modalities via speech synthesis. While the former suffers
from interference between the two modalities, the latter
introduces additional complexity. In this paper, we propose
Maestro, a novel algorithm to learn unified representations from
both these modalities simultaneously that can transfer to diverse
downstream tasks such as Automated Speech Recognition
(ASR) and Speech Translation (ST). Maestro learns unified
representations through sequence alignment, duration predic-
tion and matching embeddings in the learned space through
an aligned masked-language model loss. We establish a new
state-of-the-art (SOTA) on VoxPopuli multilingual ASR with
a 8% relative reduction in Word Error Rate (WER), multi-
domain SpeechStew ASR (3.7% relative) and 21 languages to
English multilingual ST on CoVoST 2 with an improvement of
2.8 BLEU averaged over 21 languages.
View details
Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR
Zhehuai Chen
Yu Zhang
Pedro Moreno Mengibar
IEEE SLT (2022)
Preview abstract
Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model introduced in~\cite{zhehuai2021} can be leveraged to train a massively multilingual ASR model without any transcribed speech. In most zero resource conditions, lack of transcribed speech also implies lack of lexicons. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero transcribed speech, real-world setting to expand the set of languages covered by ASR models with only unlabeled speech and text in the target languages. We define the task to cover $102$ languages, where transcribed speech is available in $52$ of these languages and can be used to improve end-to-end ASR quality on the remaining $50$. First, we show that by combining speech representations with byte-level text representations coupled with the effective use of language embeddings, we can dramatically reduce the resource requirements for deploying an ASR model to a new language. On the FLEURS dataset, this approach is able to reduce the CER on languages with no transcribed speech from 64.1\% to 29.6\%, a relative reduction of 54\%. Second, using a subset of Indic languages we show that the proposed method can learn effectively from languages with transcribed speech even when there is limited to no graphemeic overlap with the target languages, reducing the average CER of the target languages from 56.3 to 17.2. We believe this is the first demonstration that competitive ASR performance can be achieved for an unseen language using no language resources other than text and untranscribed speech.
View details
Preview abstract
Multilingual neural machine translation (NMT) typically learns to maximize the likelihood of training examples from a combination set of multiple language pairs. However, this mechanical combination only relies on the basic sharing to learn the inductive bias, which undermines the generalization and transferability of multilingual NMT models. In this paper, we introduce a multilingual crossover encoder-decoder (mXEnDec) to fuse language pairs at instance level to exploit cross-lingual signals. For better fusions on multilingual data, we propose several techniques to deal with the language interpolation, dissimilar language fusion and heavy data imbalance. Experimental results on a large-scale WMT multilingual data set show that our approach significantly improves model performance on general multilingual test sets and the model transferability on zero-shot test sets (up to $+5.53$ BLEU).
Results on noisy inputs demonstrates the capability of our approach to improve model robustness against the code-switching noise. We also conduct qualitative and quantitative representation comparisons to analyze the advantages of our approach at the representation level.
View details
Building Machine Translation Systems for the Next Thousand Languages
Julia Kreutzer
Mengmeng Niu
Pallavi Nikhil Baljekar
Xavier Garcia
Maxim Krikun
Pidong Wang
Apu Shah
Macduff Richard Hughes
Google Research (2022)
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
Dmitry (Dima) Lepikhin
Maxim Krikun
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference (2021)
Preview abstract
Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x.
View details
Preview abstract
Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.
View details
Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation
Henry Tsai
Naveen Ari
AAAI 2020 (2020)
Preview abstract
Recently proposed Massively Multilingual Neural Machine Translation system has been shown to be capable of translating 102 languages to and from English within a single model. In this paper, we evaluate the cross-lingual effectiveness of representations from the encoder of such a model on 5 downstream classification and sequence tagging tasks spanning more than 50 languages. We compare our results to a strong multilingual baseline, BERT and show modest gains on zero-shot cross-lingual transfer in 4 out of these 5 tasks. Our results provide strong insight into how applicable the representations learned from multilingual machine translation are, across languages and tasks.
View details
Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation
Naveen Ari
ACL 2020 (2020)
Preview abstract
Over the last few years two promising research directions in low-resource neural machine translation (NMT) have emerged. The first focuses on utilizing high-resource languages to improve the quality of low-resource languages via multilingual NMT. The second direction employs monolingual data with self-supervision to pre-train translation models, followed by fine-tuning on small amounts of supervised data. In this work, we join these two lines of research and demonstrate the efficacy of monolingual data with self-supervision in multilingual NMT. We offer three major results: (i) Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. (ii) Self-supervision improves zero-shot translation quality in multilingual models. (iii) Leveraging monolingual data with self-supervision provides a viable path towards adding new languages to multilingual models, getting up to 28 BLEU on ro-en translation without any parallel data or back-translation.
View details
Preview abstract
Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence S=s1,...,sS, we propose truncating the target-side context used for incremental predictions by making a Markov (N-gram) assumption. Experiments on WMT EnDe and EnFr data sets show that the N-gram masked self-attention model loses very little in BLEU score for N values in the range 4,...,8, depending on the task.
View details
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
Dmitry (Dima) Lepikhin
George Foster
Maxim Krikun
Naveen Ari
(2019)
Preview abstract
We introduce our efforts towards building a universal neural machine translation (NMT) system capable of translating between any language pair. We set a milestone towards this goal by building a single massively multilingual NMT model handling 103 languages trained over 25 billion examples. Our system demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines. We provide in-depth analysis of various aspects of model building that are crucial to the quality and practicality towards universal NMT. While we prototype a high-quality universal translation system, our extensive empirical analysis exposes issues that need to be further addressed, and we suggest directions for future research.
View details
Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model
Interspeech 2019 (2019) (to appear)
Preview abstract
Multilingual end-to-end (E2E) models have shown great
promise as a means to expand coverage of the world’s lan-
guages by automatic speech recognition systems. They im-
prove over monolingual E2E systems, especially on low re-
source languages, and simplify training and serving by elimi-
nating language-specific acoustic, pronunciation, and language
models. This work aims to develop an E2E multilingual system
which is equipped to operate in low-latency interactive applica-
tions as well as handle the challenges of real world imbalanced
data. First, we present a streaming E2E multilingual model.
Second, we compare techniques to deal with imbalance across
languages. We find that a combination of conditioning on a
language vector and training language-specific adapter layers
produces the best model. The resulting E2E multilingual model
system achieves lower word error rate (WER) than state-of-the-
art conventional monolingual models by at least 10% relative
on every language.
View details
Preview abstract
Neural Networks trained with gradient descent are known to be susceptible to catastrophic forgetting caused by parameter shift during the training process. In the context of Neural Machine Translation (NMT) this results in poor performance on heterogeneous datasets, and on sub-tasks like rare phrase translation. On the other hand, non-parametric approaches are immune to forgetting, perfectly complementing the generalization ability of NMT. However, attempts to combine non-parametric or retrieval based approaches with NMT have only been successful on narrow domains, possibly due to over-reliance on sentence level retrieval. We propose a novel n-gram level retrieval approach that relies on local phrase level similarities, allowing us to retrieve neighbors that are useful for translation even when overall sentence similarity is low. We complement this with an expressive neural model, allowing our model to extract information from the noisy retrieved context. We evaluate our semi-parametric NMT approach on a heterogeneous dataset composed of WMT, IWSLT, JRC-Acquis and OpenSubtitles, and demonstrate gains on all 4 evaluation sets. The semi-parametric nature of our approach also opens the door for non-parametric domain adaptation, demonstrating strong inference-time adaptation performance on new domains without the need for any parameter updates.
View details
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Youlong Cheng
Dehao Chen
HyoukJoong Lee
Jiquan Ngiam
NeurIPS (2019)
Preview abstract
Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure. These solutions are often architecture-specific and do not transfer to other tasks. To address the need for efficient and task-independent model parallelism, we introduce GPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers. By pipelining different sub-sequences of layers on separate accelerators, GPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. Moreover, GPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when a model is partitioned across multiple accelerators. We demonstrate the advantages of GPipe by training large-scale neural networks on two different tasks with distinct network architectures: (i) Image Classification: We train a 557-million-parameter AmoebaNet model and attain a top-1 accuracy of 84.4% on ImageNet-2012, (ii) Multilingual Neural Machine Translation: We train a single 6-billion-parameter, 128-layer Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models.
View details
Preview abstract
Multilingual Neural Machine Translation (NMT) models have yielded large empirical success in transfer learning settings. However, these black-box representations are poorly understood, and their mode of transfer remains elusive. In this work, we attempt to understand massively multilingual NMT representations (with over 100 languages) using Singular Value Canonical Correlation Analysis (SVCCA), a representation similarity framework that allows us to compare representations across different languages, layers and models. Our analysis validates several empirical results and long-standing intuitions, and unveils new observations regarding how representations evolve in a multilingual translation model. We draw two major results from our analysis: (i) Representations of the same sentences across different languages cluster based on linguistic similarity and (ii) Source sentence representations learned by the encoder are dependent on the target language. We further confirm our observations with carefully designed experiments and connect our findings with existing results in multilingual NMT and cross-lingual transfer learning.
View details
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation
George Foster
Llion Jones
Macduff Hughes
Mike Schuster
Niki J. Parmar
ACL'18 (2018) (to appear)
Preview abstract
The past year has witnessed rapid advances in sequence-to-sequence (seq2seq)
modeling for Machine Translation (MT). The classic RNN-based approaches to MT
were first out-performed by the convolutional seq2seq model, which was then
out-performed by the more recent Transformer model. Each of these new
approaches consists of a fundamental architecture accompanied by a set of
modeling and training techniques that are in principle applicable to other
seq2seq architectures. In this paper, we tease apart the new architectures and
their accompanying techniques in two ways. First, we identify several key
modeling and training techniques, and apply them to the RNN architecture,
yielding a new RNMT+ model that outperforms all of the three fundamental architectures
on the benchmark WMT'14 English to French and
English to German tasks. Second, we analyze the properties of each
fundamental seq2seq architecture and devise new hybrid architectures intended
to combine their strengths. Our hybrid models obtain further improvements,
outperforming the RNMT+ model on both benchmark datasets.
View details
Revisiting Character-Based Neural Machine Translation with Capacity and Compression
George Foster
Empirical Methods in Natural Language Processing (2018)
Preview abstract
Translating characters instead of words or word-fragments has the potential to simplify the processing pipeline for neural machine translation (NMT), and improve results by eliminating hyper-parameters and manual feature engineering. However, it results in longer sequences in which each symbol contains less information, creating both modeling and computational challenges. In this paper, we show that the modeling problem can be solved by standard sequence-to-sequence architectures of sufficient depth, and that deep models operating at the character level outperform identical models operating over word fragments. This result implies that alternative architectures for handling character input are better viewed as methods for reducing computation time than as improved ways of modeling longer sequences. From this perspective, we evaluate several techniques for character-level NMT, verify that they do not match the performance of our deep character baseline model, and evaluate the performance versus computation time tradeoffs they offer. Within this framework, we also perform the first evaluation for NMT of conditional computation over time, in which the model learns which timesteps can be skipped, rather than having them be dictated by a fixed schedule specified before training begins.
View details
Preview abstract
While current state-of-the-art NMT models, both LSTM based and Transformers, are much deeper compared to their early counterparts, they are still shallow in comparison to convolutional models used for both text and vision applications. In this work we attempt to train significantly (2-3x) deeper transformer and BiLSTM encoders for machine translation. We propose a simple modification to the attention mechanism that eases the optimization of deeper models, and results in significant improvements on the benchmark WMT'14 English-German and WMT'15 Czech-English tasks for both architectures.
View details
Building a Conversational Agent Overnight with Dialogue Self-Play
Pararth Shah
Dilek Hakkani-Tur
Gokhan Tur
Neha Nayak
Larry Heck
arxiv.org (2018)
Preview abstract
We propose Machines Talking To Machines (M2M), a framework combining automation and crowdsourcing to rapidly bootstrap end-to-end dialogue agents for goal-oriented dialogues in arbitrary domains. M2M scales to new tasks with just a task schema and an API client from the dialogue system developer, but it is also customizable to cater to task-specific interactions. Compared to the Wizard-of-Oz approach for data collection, M2M achieves greater diversity and coverage of salient dialogue flows while maintaining the naturalness of individual utterances. In the first phase, a simulated user bot and a domain-agnostic system bot converse to exhaustively generate dialogue "outlines", i.e. sequences of template utterances and their semantic parses. In the second phase, crowd workers provide contextual rewrites of the dialogues to make the utterances more natural while preserving their meaning. The entire process can finish within a few hours. We propose a new corpus of 3,000 dialogues spanning 2 domains collected with M2M, and present comparisons with popular dialogue datasets on the quality and diversity of the surface forms and dialogue flows.
View details
Preview abstract
State-of-the-art slot filling models for goal-oriented human/machine conversational language understanding systems rely on deep learning methods. Multi-task training of such models alleviate the need for in-domain annotated datasets, as they benefit from shared wording, meanings and schema elements across different tasks and domains. However, bootstrapping a semantic parsing model for a new domain using only the semantic frame, such as the back-end API or knowledge graph schema, is still one of the holy grail tasks of language understanding. This paper proposes a deep learning based approach that can utilize only the slot label descriptions in context without the need of any labeled or unlabeled in-domain examples, to quickly bootstrap a new domain. The main idea is using the encoding of the slot names and descriptions within a multi-task deep learning slot filling model, resulting in soft alignments across domains by leveraging implicit transfer learning. Such an approach is promising for solving the domain scaling problem of language understanding models and eliminates dependency on large amounts manually annotated training data sets. Furthermore, our controlled experiments using a multitude of domains show that this approach results in significantly better semantic parsing performance when compared to using only in-domain data.
View details
Preview abstract
Spoken Language Understanding (SLU) is a key component of goal oriented dialogue systems that would parse user utterances into semantic frame representations. Traditionally SLU does not utilize the dialogue history beyond the previous system turn and contextual ambiguities are resolved by the downstream components. In this paper, we explore novel approaches for modeling dialogue context in a recurrent neural network (RNN) based language understanding system. We propose the Sequential Dialogue Encoder Network, that allows encoding context from the dialogue history in chronological order. We compare the performance of our proposed architecture with two context models, one that uses just the previous turn context and another that encodes dialogue context in a memory network, but loses the order of utterances in the dialogue history. Experiments with a multi-domain dialogue dataset demonstrate that the proposed architecture results in reduced semantic frame error rates.
View details
No Results Found