Jump to Content
Sneha Reddy Kudugunta

Sneha Reddy Kudugunta

I joined the Google AI Residency right after receiving my bachelor’s degree in Computer Science and Engineering from Indian Institute of Technology, Hyderabad. While there, I worked on viewing neural networks in terms of quasi-convex optimization problems. I’ve also spent time at the Institute for Pure and Applied Mathematics, UCLA modeling cryptographic side-channel attacks, and at the Information Sciences Institute, USC using machine learning to detect social bots. These experiences have gotten me interested in both understanding language - especially in less structured contexts - and in using mathematics to understand machine learning. Given my diverse interests, the opportunity to explore different research interests before grad school at a place that has teams working on a broad variety of areas is what drew me to this program. I’m currently working on improving transfer learning in the context of natural language understanding - the difficulties associated with natural language understanding make transfer learning in this context an especially exciting challenge. My time here so far has been amazing - over the next year, I’m excited to learn and make use of all the opportunities available at Google. In my free time, I enjoy reading, exposing my friends to cringe-pop, doodling (sketchpad optional!) and being outdoors.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics such as gradient clipping and learning rate warmup. Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoid---or navigate out of---regions of high curvature and into flatter regions that tolerate a higher learning rate. Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor conditioning. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization. View details
    Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
    Julia Kreutzer
    Lisa Wang
    Ahsan Wahab
    Nasanbayar Ulzii-Orshikh
    Allahsera Auguste Tapo
    Nishant Subramani
    Artem Sokolov
    Claytone Sikasote
    Monang Setyawan
    Supheakmungkol Sarin
    Sokhar Samb
    Benoît Sagot
    Clara E. Rivera
    Annette Rios
    Isabel Papadimitriou
    Salomey Osei
    Pedro Javier Ortiz Suárez
    Iroro Fred Ọ̀nọ̀mẹ̀ Orife
    Kelechi Ogueji
    Rubungo Andre Niyongabo
    Toan Nguyen
    Mathias Müller
    André Müller
    Shamsuddeen Hassan Muhammad
    Nanda Muhammad
    Ayanda Mnyakeni
    Jamshidbek Mirzakhalov
    Tapiwanashe Matangira
    Colin Leong
    Nze Lawson
    Yacine Jernite
    Mathias Jenny
    Bonaventure F. P. Dossou
    Sakhile Dlamini
    Nisansa de Silva
    Sakine Çabuk Ballı
    Stella Biderman
    Alessia Battisti
    Ahmed Baruwa
    Pallavi Baljekar
    Israel Abebe Azime
    Ayodele Awokoya
    Duygu Ataman
    Orevaoghene Ahia
    Oghenefego Ahia
    Sweta Agrawal
    Mofetoluwa Adeyemi
    TACL (2022)
    Preview abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases. View details
    Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
    Dmitry (Dima) Lepikhin
    Maxim Krikun
    Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference (2021)
    Preview abstract Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x. View details
    Preview abstract Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN \cite{jia2021scaling}--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets; more importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL improves zero-shot mean recall by 14.4\% on average for eight under-resourced languages and by 6.6\% on average when fine-tuning. Interestingly, we also find that text representations learned from MURAL cluster based on areal linguistics as well, like the Balkan sprachbund, and not just language genealogy. View details
    Preview abstract Over the last few years two promising research directions in low-resource neural machine translation (NMT) have emerged. The first focuses on utilizing high-resource languages to improve the quality of low-resource languages via multilingual NMT. The second direction employs monolingual data with self-supervision to pre-train translation models, followed by fine-tuning on small amounts of supervised data. In this work, we join these two lines of research and demonstrate the efficacy of monolingual data with self-supervision in multilingual NMT. We offer three major results: (i) Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. (ii) Self-supervision improves zero-shot translation quality in multilingual models. (iii) Leveraging monolingual data with self-supervision provides a viable path towards adding new languages to multilingual models, getting up to 28 BLEU on ro-en translation without any parallel data or back-translation. View details
    Preview abstract Multilingual Neural Machine Translation (NMT) models have yielded large empirical success in transfer learning settings. However, these black-box representations are poorly understood, and their mode of transfer remains elusive. In this work, we attempt to understand massively multilingual NMT representations (with over 100 languages) using Singular Value Canonical Correlation Analysis (SVCCA), a representation similarity framework that allows us to compare representations across different languages, layers and models. Our analysis validates several empirical results and long-standing intuitions, and unveils new observations regarding how representations evolve in a multilingual translation model. We draw two major results from our analysis: (i) Representations of the same sentences across different languages cluster based on linguistic similarity and (ii) Source sentence representations learned by the encoder are dependent on the target language. We further confirm our observations with carefully designed experiments and connect our findings with existing results in multilingual NMT and cross-lingual transfer learning. View details
    No Results Found