Jump to Content
Ananda Theertha Suresh

Ananda Theertha Suresh

Ananda Theertha Suresh is a research scientist at Google. He obtained PhD from University of California, San Diego where he was advised by Prof. Alon Orlitsky. His research interests lie in the intersection of machine learning, information theory, and statistics. More details can be found at theertha.info
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Efficient Language Model Architectures for Differentially Private Federated Learning
    Yanxiang Zhang
    Privacy Regulation and Protection in Machine Learning Workshop at ICLR 2024 (2024) (to appear)
    Preview abstract Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data ever leaving the devices. SGD is the standard client optimizer for on device training in cross-device FL, favored for its memory and computational efficiency. However, in centralized training of neural language models, adaptive optimizers are preferred as they offer improved stability and performance. In light of this, we ask if language models can be modified such that they can be efficiently trained with SGD client optimizers and answer this affirmatively. We propose a scale-invariant \emph{Coupled Input Forget Gate} (SI CIFG) recurrent network by modifying the sigmoid and tanh activations in the recurrent cell and show that this new model converges faster and achieves better utility than the standard CIFG recurrent model in cross-device FL in large scale experiments. We further show that the proposed scale invariant modification also helps in federated learning of larger transformer models. Finally, we demonstrate the scale invariant modification is also compatible with other non-adaptive algorithms. Particularly, our results suggest an improved privacy utility trade-off in federated learning with differential privacy. View details
    Preview abstract Most studies in cross-device federated learning focus on small models, due to the server-client communication and on-device computation bottlenecks. In this work, we leverage various techniques for mitigating these bottlenecks to train larger language models in cross-device federated learning. With systematic applications of partial model training, quantization, efficient transfer learning, and communication-efficient optimizers, we are able to train a 21M parameter Transformer that achieves the same perplexity as that of a similarly sized LSTM with ~10x smaller client-to-server communication cost and 11% lower perplexity than smaller LSTMs commonly studied in literature. View details
    Preview abstract We propose a practical maximum-likelihood-estimation framework for regression as an alternative to the typical approach of Empirical Risk Minimization (ERM) over a specific loss metric. Our approach is better suited to capture inductive biases in datasets, and can output post-hoc estimators at inference time that can optimize different types of loss metrics. We present theoretical evidence (in the fixed design setting) to demonstrate that our approach is always competitive with using ERM over the loss metric, and in many practical scenarios can be much superior to ERM. For time series forecasting, we propose an end-to-end MLE based training and inference approach that can flexibly capture various inductive biases, and optimize prediction accuracy for a variety of typical loss metrics, without having to choose a specific loss metric at training time. We demonstrate empirically that our method instantiated with a well-designed general purpose likelihood can obtain superior performance over ERM for a variety of time-series forecasting and regression datasets with different inductive biases and data distributions. View details
    FedJAX: Federated learning simulation with JAX
    Ke Wu
    1st NeurIPS Workshop on New Frontiers in Federated Learning (NFFL 2021) (2021)
    Preview abstract Federated learning is a machine learning technique that enables training across decentralized data. Recently, federated learning has become an active area of research due to an increased focus on privacy and security. In light of this, a variety of open source federated learning libraries have been developed and released. We introduce FedJAX, a JAX-based open source library for federated learning simulations that emphasizes ease-of-use in research. With its simple primitives for implementing federated learning algorithms, prepackaged datasets, models and algorithms, and fast simulation speed, FedJAX aims to make developing and evaluating federated algorithms faster and easier for researchers. Our benchmark results show that FedJAX can be used to train models with federated averaging on the EMNIST dataset in a few minutes and the Stack Overflow dataset in roughly an hour with standard hyperparameters using TPUs. View details
    A Field Guide to Federated Optimization
    Jianyu Wang
    Gauri Joshi
    Maruan Al-Shedivat
    Galen Andrew
    A. Salman Avestimehr
    Katharine Daly
    Deepesh Data
    Suhas Diggavi
    Hubert Eichner
    Advait Gadhikar
    Antonious M. Girgis
    Filip Hanzely
    Chaoyang He
    Samuel Horvath
    Martin Jaggi
    Tara Javidi
    Sai Praneeth Karimireddy
    Jakub Konečný
    Sanmi Koyejo
    Tian Li
    Peter Richtarik
    Virginia Smith
    Mahdi Soltanolkotabi
    Weikang Song
    Sebastian Stich
    Ameet Talwalkar
    Hongyi Wang
    Blake Woodworth
    Honglin Yuan
    Mi Zhang
    Tong Zhang
    Chunxiang (Jake) Zheng
    Chen Zhu
    arxiv (2021)
    Preview abstract Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications. View details
    Preview abstract In distributed learning settings such as federated learning, the training algorithm can be potentially biased towards different clients. Mohri et al. (2019) proposed a domain-agnostic learning algorithm, where the model is optimized for any target distribution formed by a mixture of the client distributions in order to overcome this bias. They further proposed an algorithm for the cross-silo federated learning setting, where the number of clients is small. We consider this problem in the cross-device setting, where the number of clients is much larger. We propose a communication-efficient distributed algorithm called Agnostic Federated Averaging (or AgnosticFedAvg) to minimize the domain-agnostic objective proposed in (Mohri et al., 2019), which is amenable to other private mechanisms such as secure aggregation. We highlight two types of naturally occurring domains in federated learning and argue that AgnosticFedAvg performs well on both. To demonstrate the practical effectiveness of AgnosticFedAvg, we report positive results for large-scale language modeling tasks in both simulation and live experiments, where the latter involves training language models for Spanish virtual keyboard for millions of user devices. View details
    Preview abstract We present a theoretical and algorithmic study of the multiple-source domain adaptation problem in the common scenario where the learner has access only to a limited amount of labeled target data, but where he has at his disposal a large amount of labeled data from multiple source domains. We show that a new family algorithms based on model selection ideas benefit from very favorable guarantees in this scenario and discuss some theoretical obstacles affecting some alternative techniques. We also report the results of several experiments with our algorithms that demonstrate their practical effectiveness in several tasks View details
    Preview abstract Weighted finite automata (WFA) are often used to represent probabilistic models, such as n- gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a weighted finite automaton such that the Kullback-Leiber divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization step, both of which can be performed efficiently. We demonstrate the usefulness of our approach on various tasks, including distilling n-gram models from neural models, building compact language models, and building open-vocabulary character models. The algorithms used for these experiments are available in an open-source software library. View details
    Preview abstract We study multiple-source domain adaptation, when the learner has access to abundant labeled data from multiple source domains and limited labeled data from the target domain. We analyze existing algorithms and propose an instance-optimal approach based on model selection. We provide efficient algorithms and empirically demonstrate the benefits of our approach. View details
    Preview abstract Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon. In this work, we propose a general algorithmic framework, \mime, which i) mitigates client drift and ii) adapts arbitrary centralized optimization algorithms such as SGD and Adam to the federated learning setting. Mime uses a combination of control-variates and server-level statistics (e.g. momentum) at every client-update step to ensure that each local update mimics that of the centralized method run on iid data. We prove a reduction result showing that \mime can translate the convergence of a generic algorithm in the centralized setting into convergence in the federated setting. Further, we show for the first time that multiple local steps can lead to faster convergence in the cross-device FL setting. Our thorough theoretical and empirical analyses establish Mime's superiority over other other baselines. View details
    Preview abstract Communication cost is often a bottleneck in federated learning and other client-based distributed learning scenarios. To overcome this, several gradient compression and model compression algorithms have been proposed. In this work, we propose an alternative approach whereby an ensemble of pre-trained base predictors is trained via federated learning. This method allows for training a model which may otherwise surpass the communication bandwidth and storage capacity of the clients to be learned with on-device data through federated learning. Motivated by language modeling, we prove the optimality of ensemble methods for density estimation for standard empirical risk minimization and agnostic risk minimization. We provide communication-efficient ensemble algorithms for federated learning, where per-round communication cost is independent of the size of the ensemble. Furthermore, unlike works on gradient compression, our proposed approach reduces the communication cost of both server-to-client and client-to-server communication. View details
    Sampled softmax with random fourier features
    Jiecao (Jack) Chen
    Advances in Neural Information Processing Systems (NeurIPS) (2019)
    Preview abstract The computational cost of training with softmax cross entropy loss grows linearly with the number of classes. For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the gradient based on these classes, known as the \emph{sampled softmax} method. However, the sampled softmax provides a biased estimate of the gradient unless the samples are drawn from the exact softmax distribution, which is again expensive to compute. Therefore, a widely employed practical approach (without theoretical justification) involves sampling from a simpler distribution in the hope of approximating the exact softmax distribution. In this paper, we develop the first theoretical understanding of the role that different sampling distributions play in determining the quality of sampled softmax. Motivated by our analysis and the work on kernel-based sampling, we propose the {\em Random Fourier Softmax} (RF-softmax) method that utilizes the powerful Random Fourier features to enable more efficient and accurate sampling from the (approximate) softmax distribution. We show that RF-softmax leads to low biased estimation in terms of both the full softmax distribution and the full softmax gradient. Furthermore, the cost of RF-softmax scales only logarithmically with the number of classes. View details
    Preview abstract We propose algorithms to train production-quality n-gram language models using federated learning. Federated learning is a machine learning technique to train global models to be used on portable devices such as smart phones, without the users' data ever leaving their devices. This is especially relevant for applications handling privacy-sensitive data, such as virtual keyboards. While the principles of federated learning are fairly generic, its methodology assumes that the underlying models are neural networks. However, virtual keyboards are typically powered by n-gram language models, mostly for latency reasons. We propose to train a recurrent neural network language model using the decentralized "FederatedAveraging" algorithm directly on training and to approximating this federated model server-side with an n-gram model that can be deployed to devices for fast inference. Our technical contributions include novel ways of handling large vocabularies, algorithms to correct capitalization errors in user data, and efficient finite state transducer algorithms to convert word language models to word-piece language models and vice versa. The n-gram language models trained with federated learning are compared to n-grams trained with traditional server-based algorithms using A/B tests on tens of millions of users of a virtual keyboard. Results are presented for two languages, American English and Brazilian Portuguese. This work demonstrates that high-quality n-gram language models can be trained directly on client mobile devices without sensitive training data ever leaving the device. View details
    Preview abstract Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a weighted finite automaton such that the Kullback-Leibler divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization, both of which can be performed efficiently. We demonstrate the usefulness of our approach on some tasks including distilling n-gram models from neural models. View details
    Preview abstract Modeling tasks that use a large vocabulary require two words-to-vector maps, one for the embedding layer and one for the softmax layer. A majority of model parameters for such modeling tasks are in the embedding and the softmax layers, while only a small fraction of the parameters are used to the core of the model e.g., recurrent structures such as LSTM. When training models on small to medium corpus size, these models are subject to over-tting as well as large storage and memory footprint requirements. We propose to compress the embedding and softmax matrices by imposing structure into the parameter space. The embedding and softmax matrices are factored as the product of a sparse matrix and a structured dense matrix. Without compromizing performance, we achieve a significant compression rate for the embedding layer and a moderate compression rate for the softmax layer. The factoring of the embedding and softmax matrix before training allows us to jointly train these matrix values to optimize the training objective. Being able to compress the embedding and softmax layers allows us to uses this saved memory for increased recurrent unit size, which results in improved performance at an uncompressed memory footprint. We report results of this compression technique on standard datasets and a state of the art on-device automatic speech recognition system. View details
    Preview abstract The decentralized nature of federated learning makes detecting and defending against adversarial attacks a challenging task. This paper focuses on backdoor attacks in the federated learning setting, where the goal of the adversary is to reduce the performance of the model on targeted tasks while maintaining a good performance on the main task. Unlike existing works, we allow non-malicious clients to have correctly labeled samples from the targeted tasks. We conduct a comprehensive study of backdoor attacks and defenses for the EMNIST dataset, a real-life, user-partitioned, and non-iid dataset. We observe that in the absence of defenses, the performance of the attack largely depends on the fraction of adversaries present and the “complexity” of the targeted task. Moreover, we show that norm clipping and “weak” differential privacy mitigate the attacks without hurting the overall performance. We have implemented the attacks and defenses in TensorFlow Federated (TFF), a TensorFlow framework for federated learning. In open sourcing our code, our goal is to encourage researchers to contribute new attacks and defenses and evaluate them on standard federated datasets. View details
    Advances and Open Problems in Federated Learning
    Brendan Avent
    Aurélien Bellet
    Mehdi Bennis
    Arjun Nitin Bhagoji
    Graham Cormode
    Rachel Cummings
    Rafael G.L. D'Oliveira
    Salim El Rouayheb
    David Evans
    Josh Gardner
    Adrià Gascón
    Phillip B. Gibbons
    Marco Gruteser
    Zaid Harchaoui
    Chaoyang He
    Lie He
    Zhouyuan Huo
    Justin Hsu
    Martin Jaggi
    Tara Javidi
    Gauri Joshi
    Mikhail Khodak
    Jakub Konečný
    Aleksandra Korolova
    Farinaz Koushanfar
    Sanmi Koyejo
    Tancrède Lepoint
    Yang Liu
    Prateek Mittal
    Richard Nock
    Ayfer Özgür
    Rasmus Pagh
    Ramesh Raskar
    Dawn Song
    Weikang Song
    Sebastian U. Stich
    Ziteng Sun
    Florian Tramèr
    Praneeth Vepakomma
    Jianyu Wang
    Li Xiong
    Qiang Yang
    Felix X. Yu
    Han Yu
    Arxiv (2019)
    Preview abstract Federated learning (FL) is a machine learning setting where many clients (e.g., mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g., service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and mitigates many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents a comprehensive list of open problems and challenges. View details
    Preview abstract Distributed stochastic gradient descent is an important subroutine in distributed learning. A setting of particular interest is when the clients are mobile devices, where two important concerns are communication efficiency and the privacy of the clients. Several recent works have focused on reducing the communication cost or introducing privacy guarantees, but none of the proposed communication efficient methods are known to be privacy preserving and none of the known privacy mechanisms are known to be communication efficient. To this end, we study algorithms that achieve both communication efficiency and differential privacy. For d variables and n \approx d clients, the proposed method uses \cO(\log \log(nd)) bits of communication per client per coordinate and ensures constant privacy. We also improve previous analysis of the \emph{Binomial mechanism} showing that it achieves nearly the same utility as the Gaussian mechanism, while requiring fewer representation bits, which can be of independent interest. View details
    Preview abstract Motivated by the need for distributed optimization algorithms with low communication cost, we study communication efficient algorithms to perform distributed mean estimation. We study the scenarios in which each client sends one bit per dimension. We first show that for d dimensional data with n clients, a naive stochastic rounding approach yields a mean squared error Theta(d/n). We then show by applying a structured random rotation of the data (an O(dlogd) algorithm), the error can be reduced to O(logd/n). The methods we show in this paper do not depend on the distribution of the data. View details
    Preview abstract Recurrent neural network language models (RNNLM) and Long Short Term Memory (LSTM) LMs, a variant of RNN LMs, have been shown to outperform traditional N-gram LMs on speech recognition tasks. However, these models are computationally more expensive than N-gram LMs for decoding, and thus, challenging to integrate into speech recognizers. Recent research has proposed the use of lattice-rescoring algorithms using RNNLMs and LSTMLMs as an efficient strategy to integrate these models into a speech recognition system. In this paper, we evaluate existing lattice rescoring algorithms along with a few of our own novel variants on a Youtube speech recognition task. Lattice rescoring using LSTMLMs reduces the word error rate (WER) for this task by about 6\% relative to the WER obtained using an N-gram LM. View details
    Preview abstract We propose a multiscale quantization approach for fast similarity search on large, high-dimensional datasets. The key insight of the approach is that quantization methods, in particular product quantization, perform poorly when there is large variance in the norms of the data points. This is a common scenario for real-world datasets, especially when doing product quantization of residuals obtained from coarse vector quantization. To address this issue, we propose a multiscale formulation where we learn a separate scalar quantizer of the residual norm scales. All parameters are learned jointly in a stochastic gradient descent framework to minimize the overall quantization error. We provide theoretical motivation for the proposed technique and conduct comprehensive experiments on two large-scale public datasets, demonstrating substantial improvements in recall over existing state-of-the-art methods. View details
    Preview abstract Federated Learning is a machine learning setting where the goal is to train a high-quality centralized model with training data distributed over a large number of clients each with unreliable and relatively slow network connections. We consider learning algorithms for this setting where on each round, each client independently computes an update to the current model based on its local data, and communicates this update to a central server, where the client-side updates are aggregated to compute a new global model. The typical clients in this setting are mobile phones, and communication efficiency is of utmost importance. In this paper, we propose two ways to reduce the uplink communication costs. The proposed methods are evaluated on the application of training a deep neural network to perform image classification. Our best approach reduces the upload communication required to train a reasonable model by two orders of magnitude. View details
    Preview abstract We present an intriguing discovery related to Random Fourier Features: in Gaussian kernel approximation, replacing the random Gaussian matrix by a properly scaled random orthogonal matrix significantly decreases kernel approximation error. We call this technique Orthogonal Random Features (ORF), and provide theoretical and empirical justification for this behavior. Motivated by this discovery, we further propose Structured Orthogonal Random Features (SORF), which uses a class of structured discrete orthogonal matrices to speed up the computation. The method reduces the time cost from O(d^2) to O(dlogd), where d is the data dimensionality, with almost no compromise in kernel approximation quality compared to ORF. Experiments on several datasets verify the effectiveness of ORF and SORF over the existing methods. We also provide discussions on using the same type of discrete orthogonal structure for a broader range of applications. View details
    No Results Found