Jump to Content
Yun-hsuan Sung

Yun-hsuan Sung

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Despite recent progress, it has been difficult to prevent semantic hallucinations in generative Large Language Models. One common solution to this is augmenting LLMs with a retrieval system and making sure that the generated output is attributable to the retrieved information. Given this new added constraint, it is plausible to expect that the overall quality of the output will be affected, for example, in terms of fluency. Can scaling language models help? Here we examine the relationship between fluency and attribution in LLMs prompted with retrieved evidence in knowledge-heavy dialog settings. Our experiments were implemented with a set of auto-metrics that are aligned with human preferences. They were used to evaluate a large set of generations, produced under varying parameters of LLMs and supplied context. We show that larger models tend to do much better in both fluency and attribution, and that (naively) using top-k retrieval versus top-1 retrieval improves attribution but hurts fluency. We next propose a recipe that could allow smaller models to both close the gap with larger models and preserve the benefits of top-k retrieval while avoiding its drawbacks. View details
    LongT5: Efficient Text-To-Text Transformer for Long Sequences
    Joshua Ainslie
    David Uthus
    Jianmo Ni
    Yinfei Yang
    Findings of the Association for Computational Linguistics: NAACL 2022, Association for Computational Linguistics
    Preview abstract Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call Transient Global (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks. View details
    Preview abstract Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations also set new state-of-the-art results on Flickr30K and MSCOCO benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries. View details
    Self-supervised Learning for Pairwise Data Refinement
    Bowen Liang
    Wei Wang
    Zarana Parekh
    Yinfei Yang
    Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Suzhou, China (2020), pp. 435-446 (to appear)
    Preview abstract We present a self-supervised method to refine pairwise data using the contents of the data itself.Our method is based on the cross-lingual similarity scores calculated with a dual-encoder model and using them to select data to train new dual-encoder models in an iterative way. To illustrate the functionality of our method, we apply it to the task of denoising parallel texts mined from the internet on two language pairs: en-fr and en-de. We train dual-encoder models on the refined data and test them on the BUCC bitext mining tasks. The dual-encoder models show steady performance improvement with every iteration. We also use the refined data to train machine translation models that we integrate in our method for further improvement of the dual-encoder models. The machine translation models that we evaluate are competitive against similar models trained with data filtered with a supervised approach. Our method has the advantage that, given that it is entirely self-supervised, it is well-suited to handle text data for which there is no prior knowledge about the language or where labeled clean data is not available. View details
    Machine Translation Aided Bilingual Data-to-Text Generation and Semantic Parsing
    Heming Ge
    Oshin Agarwal
    Siamak Shakeri
    3rd Workshop on Natural Language Generation from the Semantic Web (2020)
    Preview abstract We present a system for bilingual Data-To-Text Generation and Semantic Parsing. We use a text-to-text generator to learn a single model that works for both languages on each of the tasks. The model is aided by machine translation during both pre-training and fine-tuning. We evaluate the system on WebNLG 2020 data, which consists of RDF triples in English and natural language sentences in English and Russian for both the tasks. We achieve considerable gains over monolingual models, especially on unseen relations and Russian. View details
    Universal Sentence Encoder
    Yinfei Yang
    Sheng-yi Kong
    Nan Hua
    Nicole Lyn Untalan Limtiaco
    Rhomni St. John
    Steve Yuan
    Chris Tar
    Brian Strope
    Ray Kurzweil
    In submission to: EMNLP demonstration, Association for Computational Linguistics, Brussels, Belgium (2018)
    Preview abstract We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate performance on diverse transfer tasks. Two variants of the encoding models allow for trade-offs between accuracy and compute resources. For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance. Comparisons are made with baselines that use word level transfer learning via pretrained word embeddings as well as baselines do not use any transfer learning. We find that transfer learning using sentence embeddings tends to outperform word level transfer. With transfer learning via sentence embeddings, we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task. We obtain encouraging results on Word Embedding Association Tests (WEAT) targeted at detecting model bias. Our pre-trained sentence encoding models are made freely available for download and on TF Hub. View details
    Preview abstract This paper presents a computationally efficient machine-learned method for natural language response suggestion. Feed-forward neural networks using n-gram embedding features encode messages into vectors which are optimized to give message-response pairs a high dot-product value. An optimized search finds response suggestions. The method is evaluated in a large-scale commercial e-mail application, Inbox by Gmail. Compared to a sequence-to-sequence approach, the new system achieves the same quality at a small fraction of the computational requirements and latency. View details
    Preview abstract We investigate the task of modeling open-domain, multi-turn, unstructured, multi- participant, conversational dialogue. We specifically study the effect of incorporating different elements of the conversation. Unlike previous efforts, which focused on modeling messages and responses, we extend the modeling to long context and participant’s history. Our system does not rely on handwritten rules or engineered features; instead, we train deep neural networks on a large conversational dataset. In particular, we exploit the structure of Reddit comments and posts to extract 2.1 billion messages and 133 million conversations. We evaluate our models on the task of predicting the next response in a conversation, and we find that modeling both context and participants improves prediction accuracy. View details
    Preview abstract We evaluate different architectures to recognize multilingual speech for real-time mobile applications. In particular, we show that combining the results of several recognizers greatly outperforms other solutions such as training a single large multilingual system or using an explicit language identification system to select the appropriate recognizer. Experiments are conducted on a trilingual English-French-Mandarin mobile speech task. The data set includes Google searches, Maps queries, as well as more general inputs such as email and short message dictation. Without pre-specifying the input language, the combined system achieves comparable accu- racy to that of the monolingual systems when the input language is known. The combined system is also roughly 5% absolute better than an explicit language identification approach, and 10% better than a single large multilingual system. View details
    Deploying Google Search by Voice in Cantonese
    Martin Jansche
    12th Annual Conference of the International Speech Communication Association (Interspeech 2011), pp. 2865-2868
    Preview abstract We describe our efforts in deploying Google search by voice for Cantonese, a southern Chinese dialect widely spoken in and around Hong Kong and Guangzhou. We collected audio data from local Cantonese speakers in Hong Kong and Guangzhou by using our DataHound smartphone application. This data was used to create appropriate acoustic models. Language models were trained on anonymized query logs from Google Web Search for Hong Kong. Because users in Hong Kong frequently mix English and Cantonese in their queries, we designed our system from the ground up to handle both languages. We report on experiments with different techniques for mapping the phoneme inventories for both languages into a common space. Based on extensive experiments we report word error rates and web scores for both Hong Kong and Guangzhou data. Cantonese Google search by voice was launched in December 2010. View details
    Preview abstract Letter units, or graphemes, have been reported in the literature as a surprisingly effective substitute to the more traditional phoneme units, at least in languages that enjoy a strong correspondence between pronunciation and orthography. For English however, where letter symbols have less acoustic consistency, previously reported results fell short of systems using highly-tuned pronunciation lexicons. Grapheme units simplify system design, but since graphemes map to a wider set of acoustic realizations than phonemes, we should expect grapheme-based acoustic models to require more training data to capture these variations. In this paper, we compare the rate of improvement of grapheme and phoneme systems trained with datasets ranging from 450 to 1200 hours of speech. We consider various grapheme unit configurations, including using letter-specific, onset, and coda units. We show that the grapheme systems improve faster and, depending on the lexicon, reach or surpass the phoneme baselines with the largest training set. Index Terms— Acoustic modeling, graphemes, directory assistance, speech recognition. View details
    Hidden Conditional Random Fields for Speech Recognition
    Ph.D. Thesis, Stanford University (2010)
    Hidden Conditional Random Fields for Phone Recognition
    Dan Jurafsky
    IEEE workshop on Automatic Speech Recognition and Understanding (2009)
    Maximum Conditional Likelihood Linear Regression and Maximum a Posteriori for Hidden Conditional Random Fields Speaker Adaptation
    Constantinos Boulis
    Dan Jurafsky
    ICASSP (2008)
    Regularization, Adaptation, and Non-Independent Features Improve Hidden Conditional Random Fields for Phone Classification
    Constantinos Boulis
    Christopher Manning
    Dan Jurafsky
    IEEE workshop on Automatic Speech Recognition and Understanding (2007)
    Detection of Word Fragments in Mandarin Telephone Conversation
    Cheng-Tao Chu
    Yuan Zhao
    Dan Jurafsky
    Interspeech (2006)