Jump to Content
Satyen Kale

Satyen Kale

My current research is the design of efficient and practical algorithms for fundamental problems in Machine Learning and Optimization. More specifically, I am interested in decision making under uncertainty, statistical learning theory, combinatorial optimization, and convex optimization techniques such as linear and semidefinite programming. More information, including a complete list of publications, can be found at satyenkale.com.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Large deep learning models have achieved state-of-the-art performance across various natural language processing (NLP) tasks and demonstrated remarkable few-shot learning performance. However, training them is often challenging and resource-intensive. In this paper, we study an efficient approach to train language models using few-shot learners. We show that, by leveraging the fast learning nature of few-shot learners, one can train language models efficiently in a stagewise manner. Our main insight is that stacking a good few-shot learner on a good small language model provides a good initializer for a larger language model. Using this insight and building upon progressive stacking approaches, we develop novel approaches for training such networks in a stagewise manner. Furthermore, we also provide a theoretical framework and accompanying empirical studies to support our insights, thereby creating a theoretical foundation for progressive stacking. Finally, we provide empirical results to demonstrate the effectiveness of our approach in reducing the training time of few-shot learners. View details
    On the Convergence of Federated Averaging with Cyclic Client Participation
    Yae Jee Cho
    Pranay Sharma
    Gauri Joshi
    Tong Zhang
    International Conference on Machine Learning (ICML) (2023) (to appear)
    Preview abstract Federated Averaging (FedAvg) and its variants are the most popular optimization algorithms in federated learning (FL). Previous convergence analyses of FedAvg either assume full client participation or partial client participation where the clients can be uniformly sampled. However, in practical cross-device FL systems, only a subset of clients that satisfy local criteria such as battery status, network connectivity, and maximum participation frequency requirements (to ensure privacy) are available for training at a given time. As a result, client availability follows a natural cyclic pattern. We provide (to our knowledge) the first theoretical framework to analyze the convergence of FedAvg with cyclic client participation with several different client optimizers such as GD, SGD, and shuffled SGD. Our analysis discovers that cyclic client participation can achieve a faster asymptotic convergence rate than vanilla FedAvg with uniform client participation under suitable conditions, providing valuable insights into the design of client sampling protocols. View details
    Preview abstract Federated learning (FL) enables learning from decentralized privacy-sensitive data, with computations on raw data confined to take place at edge clients. This paper introduces mixed FL, which incorporates an additional loss term calculated at the coordinating server (while maintaining FL's private data restrictions). There are numerous benefits. For example, additional datacenter data can be leveraged to jointly learn from centralized (datacenter) and decentralized (federated) training data and better match an expected inference data distribution. Mixed FL also enables offloading some intensive computations (e.g., embedding regularization) to the server, greatly reducing communication and client computation load. For these and other mixed FL use cases, we present three algorithms: PARALLEL TRAINING, 1-WAY GRADIENT TRANSFER, and 2-WAY GRADIENT TRANSFER. We state convergence bounds for each, and give intuition on which are suited to particular mixed FL problems. Finally we perform extensive experiments on three tasks, demonstrating that mixed FL can blend training data to achieve an oracle's accuracy on an inference distribution, and can reduce communication and computation overhead by over 90%. Our experiments confirm theoretical predictions of how algorithms perform under different mixed FL problem settings. View details
    A Field Guide to Federated Optimization
    Jianyu Wang
    Gauri Joshi
    Maruan Al-Shedivat
    Galen Andrew
    A. Salman Avestimehr
    Katharine Daly
    Deepesh Data
    Suhas Diggavi
    Hubert Eichner
    Advait Gadhikar
    Antonious M. Girgis
    Filip Hanzely
    Chaoyang He
    Samuel Horvath
    Martin Jaggi
    Tara Javidi
    Sai Praneeth Karimireddy
    Jakub Konečný
    Sanmi Koyejo
    Tian Li
    Peter Richtarik
    Virginia Smith
    Mahdi Soltanolkotabi
    Weikang Song
    Sebastian Stich
    Ameet Talwalkar
    Hongyi Wang
    Blake Woodworth
    Honglin Yuan
    Mi Zhang
    Tong Zhang
    Chunxiang (Jake) Zheng
    Chen Zhu
    arxiv (2021)
    Preview abstract Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications. View details
    Preview abstract Multiclass logistic regression is a fundamental task in machine learning with applications in classification and boosting. Previous work (Foster et a. 2018) has highlighted the importance of improper predictors for achieving ``fast rates'' in the online multiclass logistic regression problem without suffering exponentially from secondary problem parameters, such as the norm of the predictors in the comparison class. While Foster et al. introduced a statistically optimal algorithm, it is in practice computationally intractable due to its run-time complexity being a large polynomial in the time horizon and dimension of input feature vectors. It has remained an open problem if algorithms exist that are both statistically and computationally efficient. Jezequel et al. answered this question for binary logistic regression by introducing AIOLI, an algorithm based on online newton step. However, their technique fails to generalize to more than two classes and it remains open whether their algorithm can be applied to practical applications of logistic regression, such as bandit multiclass learning or online boosting. In this paper, we develop a new algorithm, FOLKLORE, for the problem which runs significantly faster than the algorithm of Foster et al. -- the running time per iteration scales quadratically in the dimension -- at the cost of a linear dependence on the norm of the predictors in the regret bound. This yields the first practical algorithm for online multiclass logistic regression, resolving an open problem of Jezequel et al. We resolve the open questions by deriving an extension of AIOLI to the multi-class setting. Furthermore, we show that our algorithm can be applied to online bandit multiclass prediction and online multiclass boosting, yielding more practical algorithms for both problems compared to the ones in Foster et al. with similar performance guarantees. Finally, we also provide an online-to-batch conversion result for our algorithm. View details
    Preview abstract Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients which gives rise to the client drift phenomenon. In this work, we propose a general algorithmic framework, \mime, which i) mitigates client drift and ii) adapts arbitrary centralized optimization algorithms such as SGD and Adam to the federated learning setting. Mime uses a combination of control-variates and server-level statistics (e.g. momentum) at every client-update step to ensure that each local update mimics that of the centralized method run on iid data. We prove a reduction result showing that \mime can translate the convergence of a generic algorithm in the centralized setting into convergence in the federated setting. Further, we show for the first time that multiple local steps can lead to faster convergence in the cross-device FL setting. Our thorough theoretical and empirical analyses establish Mime's superiority over other other baselines. View details
    Preview abstract We present a series of new PAC-Bayes learning guarantees for randomized algorithms with sample-dependent priors. Our most general bounds make no assumption on the priors and are given in terms of certain covering numbers under the infinite-Renyi divergence and the L1 distance. We show how to use these general bounds to derive leaning bounds in the setting where the sample-dependent priors obey an infinite-Renyi divergence or L1-distance sensitivity condition. We also provide a flexible framework for computing PAC-Bayes bounds, under certain stability assumptions on the sample-dependent priors, and show how to use this framework to give more refined bounds when the priors satisfy an infinite-Renyi divergence sensitivity condition. View details
    Preview abstract In extreme classification settings, embedding-based neural network models are currently not competitive with sparse linear and tree-based methods in terms of accuracy. Most prior works attribute this poor performance to the low-dimensional bottleneck in embedding-based methods. In this paper, we demonstrate that theoretically there is no limitation to using low-dimensional embedding-based methods, and provide experimental evidence that overfitting is the root cause of the poor performance of embedding-based methods. These findings motivate us to investigate novel data augmentation and regularization techniques to mitigate overfitting. To this end, we propose GLaS, a new regularizer for embedding-based neural network approaches. It is a natural generalization from the graph Laplacian and spread-out regularizers, and empirically it addresses the drawback of each regularizer alone when applied to the extreme classification setup. With the proposed techniques, we attain or improve upon the state-of-the-art on most widely tested public extreme classification datasets with hundreds of thousands of labels. View details
    Preview abstract We consider the problem of retrieving the most relevant labels for a given input when the size of the output space is very large. Retrieval methods are modeled as set-valued classifiers which output a small set of classes for each input, and a mistake is made if the label is not in the output set. Despite its practical importance, a statistically principled, yet practical solution to this problem is largely missing. To this end, we first define a family of surrogate losses and show that they are calibrated and convex under certain conditions on the loss parameters and data distribution, thereby establishing a statistical and analytical basis for using these losses. Furthermore, we identify a particularly intuitive class of loss functions in the aforementioned family and show that they are amenable to practical implementation in the large output space setting (i.e. computation is possible without evaluating scores of all labels) by developing a technique called Stochastic Negative Mining. We also provide generalization error bounds for the losses in the family. Finally, we conduct experiments which demonstrate that Stochastic Negative Mining yields benefits over commonly used negative sampling approaches. View details
    Preview abstract Adaptive gradient methods that rely on scaling gradients down by the square root of exponential moving averages of past squared gradients, such RMSPROP, ADAM, ADADELTA have found wide application in optimizing the non-convex problems that arise in deep learning. However, it has been recently demonstrated that such methods can fail to converge even in simple convex optimization settings. In this work, we provide a new analysis of such methods applied to nonconvex stochastic optimization problems, characterizing the effect of increasing minibatch size. Our analysis shows that under this scenario such methods do converge to stationarity up to the statistical limit of variance in the stochastic gradients (scaled by a constant factor). In particular, our result implies that increasing minibatch sizes enables convergence, thus providing a way to circumvent the non-convergence issues. Furthermore, we provide a new adaptive optimization algorithm, YOGI, which controls the increase in effective learning rate, leading to even better performance with similar theoretical guarantees on convergence. Extensive experiments show that YOGI with very little hyperparameter tuning outperforms methods such as ADAM in several challenging machine learning tasks View details
    Online Learning of Quantum States
    Scott Aaronson
    Xinyi Chen
    Ashwin Nayak
    NeurIPS (2018)
    Preview abstract Suppose we have many copies of an unknown n-qubit state ρ. We measure some copies of ρ using a known two-outcome measurement E1, then other copies using a measurement E2, and so on. At each stage t, we generate a current hypothesis σt about the state ρ, using the outcomes of the previous measurements. We show that it is possible to do this in a way that guarantees that |Tr(Eiσt)−Tr(Eiρ)|, the error in our prediction for the next measurement, is at least ε at most O(n/ε2) times. Even in the "non-realizable" setting---where there could be arbitrary noise in the measurement outcomes---we show how to output hypothesis states that do significantly worse than the best possible states at most O(Tn‾‾‾√) times on the first T measurements. These results generalize a 2007 theorem by Aaronson on the PAC-learnability of quantum states, to the online and regret-minimization settings. We give three different ways to prove our results---using convex optimization, quantum postselection, and sequential fat-shattering dimension---which have different advantages in terms of parameters and portability. View details
    Logistic Regression: The Importance of Being Improper
    Dylan Foster
    Haipeng Luo
    Karthik Sridharan
    Conference on Learning Theory (2018)
    Preview abstract Learning linear predictors with the logistic loss—both in stochastic and online settings—is a fundamental task in machine learning and statistics, with direct connections to classification and boosting. Existing “fast rates” for this setting exhibit exponential dependence on the predictor norm, and Hazan et al. (2014) showed that this is unfortunately unimprovable. Starting with the simple observation that the logistic loss is 1-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm. This provides a positive resolution to a variant of the COLT 2012 open problem of McMahan and Streeter (2012) when improper learning is allowed. This improvement is obtained both in the online setting and, with some extra work, in the batch statistical setting with high probability. We also show that the improved dependence on predictor norm is near-optimal. Leveraging this improved dependency on the predictor norm yields the following applications: (a) we give algorithms for online bandit multiclass learning with the logistic loss with an O*(√n) relative mistake bound across essentially all parameter ranges, thus providing a solution to the COLT 2009 open problem of Abernethy and Rakhlin (2009), and (b) we give an adaptive algorithm for online multiclass boosting with optimal sample complexity, thus partially resolving an open problem of Beygelzimer et al. (2015) and Jung et al. (2017). Finally, we give information-theoretic bounds on the optimal rates for improper logistic regression with general function classes, thereby characterizing the extent to which our improvement for linear classes extends to other parameteric and even nonparametric settings. View details
    On the convergence of Adam and Beyond
    International Conference on Learning Representations (2018)
    Preview abstract Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with “long-term memory” of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance. View details
    Dual Decomposition for Fast Learning in Large Output Spaces
    Ian Yen
    Dan Holtmann-Rice
    Pradeep Ravikumar
    International Conference on Machine Learning (2018)
    Preview abstract For problems with large output spaces, evaluation of the loss function and its gradient are expensive, typically taking linear time in the size of the output space. Recently, methods have been developed to speed up learning via efficient data structures for Nearest-Neighbor Search (NNS) or Maximum Inner-Product Search (MIPS). However, the performance of such data structures typically degrades in high dimensions. In this work, we propose a novel technique to reduce the intractable high dimensional search problem to several much more tractable lower dimensional ones via dual decomposition of the loss function. At the same time, we demonstrate guaranteed convergence to the original loss via a greedy message passing procedure. In our experiments on multiclass and multilabel classification with hundreds of thousands of classes, as well as training skip-gram word embeddings with a vocabulary size of half a million, our technique consistently improves the accuracy of search-based gradient approximation methods and outperforms sampling-based gradient approximation methods by a large margin. View details
    Near-Optimal Algorithms for Online Matrix Prediction
    Elad Hazan
    Shai Shalev-Shwartz
    SIAM Journal of Computing (2017)
    Preview abstract In several online prediction problems of recent interest the comparison class is composed of matrices. For example, in the online max-cut problem, the comparison class is matrices which represent cuts of a given graph and in online gambling the comparison class is matrices which represent permutations over $n$ teams. Another important example is online collaborative filtering in which a widely used comparison class is the set of matrices with a small trace norm. In this paper we isolate a property of matrices, which we call $(\beta,\tau)$-decomposability, and derive an efficient online learning algorithm, that enjoys a regret bound of $\tilde{O}(\sqrt{\beta\,\tau\,T})$ for all problems in which the comparison class is composed of $(\beta,\tau)$-decomposable matrices. By analyzing the decomposability of cut matrices, low trace-norm matrices, and triangular matrices, we derive near optimal regret bounds for online max-cut, online collaborative filtering, and online gambling. In particular, this resolves (in the affirmative) an open problem posed by Abernethy (2010) and Kleinberg et al (2010). Finally, we derive lower bounds for the three problems and show that our upper bounds are optimal up to logarithmic factors. In particular, our lower bound for the online collaborative filtering problem resolves another open problem posed by Shamir and Srebro (2010). View details
    Adaptive Feature Selection: Computationally Efficient Online Sparse Linear Regression under RIP
    Zohar Karnin
    Tengyuan Liang
    David Pal
    International Conference on Machine Learning (2017)
    Preview abstract Online sparse linear regression is an online problem where an algorithm repeatedly chooses a subset of coordinates to observe in an adversarially chosen feature vector, makes a real-valued prediction, receives the true label, and incurs the squared loss. The goal is to design an online learning algorithm with sublinear regret to the best sparse linear predictor in hindsight. Without any assumptions, this problem is known to be computationally intractable. In this paper, we make the assumption that data matrix satisfies restricted isometry property, and show that this assumption leads to computationally efficient algorithms with sublinear regret for two variants of the problem. In the first variant, the true label is generated according to a sparse linear model with additive Gaussian noise. In the second, the true label is chosen adversarially. View details
    Parameter-Free Online Learning via Model Selection
    Dylan Foster
    Karthik Sridharan
    Neural Information Processing Systems (2017)
    Preview abstract We introduce an efficient algorithmic framework for model selection in online learning, also known as parameter-free online learning. Departing from previous work, which has focused on highly structured function classes such as nested balls in Hilbert space, we propose a generic meta-algorithm framework that achieves online model selection oracle inequalities under minimal structural assumptions. We give the first computationally efficient parameter-free algorithms that work in arbitrary Banach spaces under mild smoothness assumptions; previous results applied only to Hilbert spaces. We further derive new oracle inequalities for matrix classes, non-nested convex sets, and R^d with generic regularizers. Finally, we generalize these results by providing oracle inequalities for arbitrary non-linear classes in the online supervised learning model. These results are all derived through a unified meta-algorithm scheme using a novel “multi-scale” algorithm for prediction with expert advice based on random playout, which may be of independent interest. View details
    Hardness of Online Sleeping Combinatorial Optimization Problems
    Chansoo Lee
    David Pal
    Neural Information Processing Systems (2016)
    Preview abstract We show that several online combinatorial optimization problems that admit efficient no-regret algorithms become computationally hard in the sleeping setting where a subset of actions becomes unavailable in each round. Specifically, we show that the sleeping versions of these problems are at least as hard as PAC learning DNF expressions, a long standing open problem. We show hardness for the sleeping versions of Online Shortest Paths, Online Minimum Spanning Tree, Online k-Subsets, Online k-Truncated Permutations, Online Minimum Cut, and Online Bipartite Matching. The hardness result for the sleeping version of the Online Shortest Paths problem resolves an open problem presented at COLT 2015 [Koolen et al., 2015]. View details
    No Results Found