H. Brendan McMahan
Authored Publications
Google Publications
Other Publications
Sort By
Learning to Generate Image Embeddings with User-level Differential Privacy
Maxwell D. Collins
Yuxiao Wang
Sewoong Oh
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023) (to appear)
Preview abstract
We consider training feature extractors with user-level differential privacy to map images to embeddings from large-scale supervised data. To achieve user-level differential privacy, federated learning algorithms are extended and applied to aggregate user partitioned data, together with sensitivity control and noise addition. We demonstrate a variant of federated learning algorithm with partial aggregation and private reconstruction can achieve strong privacy utility trade-offs. When a large scale dataset is provided, it is possible to train feature extractors with both strong utility and privacy guarantees by combining techniques such as public pretraining, virtual clients, and partial aggregation.
View details
Federated Learning of Gboard Language Models with Differential Privacy
Yanxiang Zhang
Galen Andrew
Jesse Rosenstock
Yuanbo Zhang
ACL industry track (2023) (to appear)
Preview abstract
We train language models (LMs) with federated learning (FL) and differential privacy (DP) in the Google Keyboard (Gboard). We apply the DP-Follow-the-Regularized-Leader (DP-FTRL)~\citep{kairouz21b} algorithm to achieve meaningfully formal DP guarantees without requiring uniform sampling of client devices.
To provide favorable privacy-utility trade-offs, we introduce a new client participation criterion and discuss the implication of its configuration in large scale systems. We show how quantile-based clip estimation~\citep{andrew2019differentially} can be combined with DP-FTRL to adaptively choose the clip norm during training or reduce the hyperparameter tuning in preparation for training.
With the help of pretraining on public data, we train and deploy more than twenty Gboard LMs that achieve high utility and $\rho-$zCDP privacy guarantees with $\rho \in (0.2, 2)$, with two models additionally trained with secure aggregation~\citep{bonawitz2017practical}.
We are happy to announce that all the next word prediction neural network LMs in Gboard now have DP guarantees, and all future launches of Gboard neural network LMs will require DP guarantees.
We summarize our experience and provide concrete suggestions on DP training for practitioners.
View details
Adaptive Federated Optimization
Jakub Konečný
(2021)
Preview abstract
Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Due to the heterogeneity of the client datasets, standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Yogi and Adam, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can improve the performance of federated learning.
View details
Practical and Private (Deep) Learning without Sampling or Shuffling
Preview
Abhradeep Thakurta
38th International Conference on Machine Learning (ICML 2021) (2021) (to appear)
Preview abstract
Building privacy-preserving systems for machine learning and data science on decentralized data
View details
A Field Guide to Federated Optimization
Jianyu Wang
Gauri Joshi
Maruan Al-Shedivat
Galen Andrew
A. Salman Avestimehr
Katharine Daly
Deepesh Data
Suhas Diggavi
Hubert Eichner
Advait Gadhikar
Antonious M. Girgis
Filip Hanzely
Chaoyang He
Samuel Horvath
Martin Jaggi
Tara Javidi
Sai Praneeth Karimireddy
Jakub Konečný
Sanmi Koyejo
Tian Li
Peter Richtarik
Karan Singhal
Virginia Smith
Mahdi Soltanolkotabi
Weikang Song
Sebastian Stich
Ameet Talwalkar
Hongyi Wang
Blake Woodworth
Honglin Yuan
Mi Zhang
Tong Zhang
Chunxiang (Jake) Zheng
Chen Zhu
arxiv (2021)
Preview abstract
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications.
View details
Generative Models for Effective ML on Private, Decentralized Datasets
8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net
Preview abstract
To improve real-world applications of machine learning, experienced modelers develop intuition about their datasets, their models, and how the two interact. Manual inspection of raw data—of representative samples, of outliers, of misclassifications—is an essential tool in a) identifying and fixing problems in the data, b) generating new modeling hypotheses, and c) assigning or refining human-provided labels. However, manual data inspection is risky for privacy-sensitive datasets, such as those representing the behavior of real-world individuals. Furthermore, manual data inspection is impossible in the increasingly important setting of federated learning, where raw examples are stored at the edge and the modeler may only access aggregated outputs such as metrics or model parameters. This paper demonstrates that generative models—trained using federated methods and with formal differential privacy guarantees—can be used effectively to debug data issues even when the data cannot be directly inspected. We explore these methods in applications to text with differentially private federated RNNs and to images using a novel algorithm for differentially private federated GANs.
View details
Federated Heavy Hitters with Differential Privacy
Haicheng Sun
Vivian (Wei) Li
International Conference on Artificial Intelligence and Statistics (AISTATS) 2020
Preview abstract
The discovery of heavy hitters (most frequent items) in user-generated data streams drives improvements in the app and web ecosystems, but can incur substantial privacy risks if not done with care. To address these risks, we propose a distributed and privacy-preserving algorithm for discovering the heavy hitters in a population of user-generated data streams. We leverage the sampling property of our distributed algorithm to prove that it is inherently differentially private, without requiring additional noise. We also examine the trade-off between privacy and utility, and show that our algorithm provides excellent utility while also achieving strong privacy guarantees. A significant advantage of this approach is that it eliminates the need to centralize raw data while also avoiding the significant loss in utility incurred by local differential privacy. We validate our findings both theoretically, using worst-case analyses, and practically, using a Twitter dataset with 1.6M tweets and over 650k users. Finally, we carefully compare our approach to Apple's local differential privacy method for discovering heavy hitters.
View details
Privacy Amplification via Random Check-Ins
Borja Balle
Abhradeep Thakurta
Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020
Preview abstract
Differentially Private Stochastic Gradient Descent (DP-SGD) forms a fundamental building block in many applications for learning over sensitive data. Two standard approaches, privacy amplification by subsampling, and privacy amplification by shuffling, permit adding lower noise in DP-SGD than via na\"{\i}ve schemes. A key assumption in both these approaches is that the elements in the data set can be uniformly sampled, or be uniformly permuted --- constraints that may become prohibitive when the data is processed in a decentralized or distributed fashion. In this paper, we focus on conducting iterative methods like DP-SGD in the setting of federated learning (FL) wherein the data is distributed among many devices (clients). Our main contribution is the random check-in distributed protocol, which crucially relies only on randomized participation decisions made locally and independently by each client. It has privacy/accuracy trade-offs similar to privacy amplification by subsampling/shuffling. However, our method does not require server-initiated communication, or even knowledge of the population size. To our knowledge, this is the first privacy amplification tailored for a distributed learning framework, and it may have broader applicability beyond FL. Along the way, we extend privacy amplification by shuffling to incorporate $(\epsilon,\delta)$-DP local randomizers, and exponentially improve its guarantees. In practical regimes, this improvement allows for similar privacy and utility using data from an order of magnitude fewer users.
View details
Training Production Language Models without Memorizing User Data
Galen Andrew
(2020)
Preview abstract
This paper presents the first consumer-scale next-word prediction (NWP) model trained with Federated Learning (FL) while leveraging the Differentially Private Federated Averaging (DP-FedAvg) technique. There has been prior work on building practical FL infrastructure, including work demonstrating the feasibility of training language models on mobile devices using such infrastructure. It has also been shown (in simulations on a public corpus) that it is possible to train NWP models with user-level differential privacy using the DP-FedAvg algorithm. Nevertheless, training production-quality NWP models with DP-FedAvg in a real-world production environment on a heterogeneous fleet of mobile phones requires addressing numerous challenges. For instance, the coordinating central server has to keep track of the devices available at the start of each round and sample devices uniformly at random from them, while ensuring \emph{secrecy of the sample}, etc. Unlike all prior privacy-focused FL work of which we are aware, for the first time we demonstrate the deployment of a differentially private mechanism for the training of a production neural network in FL, as well as the instrumentation of the production training infrastructure to perform an end-to-end empirical measurement of unintended memorization.
View details
Towards Federated Learning at Scale: System Design
Hubert Eichner
Wolfgang Grieskamp
Dzmitry Huba
Vladimir Ivanov
Chloé M Kiddon
Jakub Konečný
Stefano Mazzocchi
Timon Van Overveldt
David Petrou
Jason Roselander
SysML 2019
Preview abstract
Federated Learning is a distributed machine learning approach which enables training on a large corpus of data which never needs to leave user devices. We have spent some effort over the last two years building a scalable production system for FL. In this paper, we report about the resulting high-level design, sketching the challenges and the solutions, as well as touching the open problems and future directions.
View details
Preview abstract
We consider convex SGD updates with a block-cyclic structure, i.e. where each cycle consists of a small number of blocks, each with many samples from a possibly different, block-specific, distribution. This situation arises, e.g., in Federated Learning where the mobile devices available for updates at different times during the day have different characteristics. We show that such block-cyclic structure can significantly deteriorate the performance of SGD, but propose a simple approach that allows prediction with the same performance guarantees as for i.i.d., non-cyclic, sampling.
View details
Federated Learning with Autotuned Communication-Efficient Secure Aggregation
Fariborz Salehi
Jakub Konečný
Marco Gruteser
Asilomar Conference on Signals, Systems, and Computers (2019)
Preview abstract
Federated Learning enables mobile devices to collaboratively learn a shared inference model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. Existing work on federated learning with limited communication demonstrates how random rotation can enable users' model updates to be quantized much more efficiently, reducing the communication cost between users and the server. Meanwhile, secure aggregation enable the server to learn an aggregate of at least a threshold number of device's model contributions without observing any individual device's contribution in unaggregated form. In this paper, we highlight some of the challenges of setting the parameters for secure aggregation to achieve communication efficiency, especially in the the context of the aggressively quantized inputs enabled by random rotation. We then develop a recipe for auto-tuning communication-efficient secure aggregation, based on specific properties of random rotation and secure aggregation -- namely, the predictable distribution of vector entries post-rotation and the modular wrapping inherent in secure aggregation. We present both theoretical results and initial experiments.
View details
Advances and Open Problems in Federated Learning
Brendan Avent
Aurélien Bellet
Mehdi Bennis
Arjun Nitin Bhagoji
Graham Cormode
Rachel Cummings
Rafael G.L. D'Oliveira
Salim El Rouayheb
David Evans
Josh Gardner
Adrià Gascón
Phillip B. Gibbons
Marco Gruteser
Zaid Harchaoui
Chaoyang He
Lie He
Zhouyuan Huo
Justin Hsu
Martin Jaggi
Tara Javidi
Gauri Joshi
Mikhail Khodak
Jakub Konečný
Aleksandra Korolova
Farinaz Koushanfar
Sanmi Koyejo
Tancrède Lepoint
Yang Liu
Prateek Mittal
Richard Nock
Ayfer Özgür
Rasmus Pagh
Ramesh Raskar
Dawn Song
Weikang Song
Sebastian U. Stich
Ziteng Sun
Florian Tramèr
Praneeth Vepakomma
Jianyu Wang
Li Xiong
Qiang Yang
Felix X. Yu
Han Yu
Arxiv (2019)
Preview abstract
Federated learning (FL) is a machine learning setting where many clients (e.g., mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g., service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and mitigates many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents a comprehensive list of open problems and challenges.
View details
Can You Really Backdoor Federated Learning?
Ziteng Sun
Federated learning workshop at NeurRIPS (2019)
Preview abstract
The decentralized nature of federated learning makes detecting and defending against adversarial attacks a challenging task. This paper focuses on backdoor attacks in the federated learning setting, where the goal of the adversary is to reduce the performance of the model on targeted tasks while maintaining a good performance on the main task. Unlike existing works, we allow non-malicious clients to have correctly labeled samples from the targeted tasks. We conduct a comprehensive study of backdoor attacks and defenses for the EMNIST dataset, a real-life, user-partitioned, and non-iid dataset. We observe that in the absence of defenses, the performance of the attack largely depends on the fraction of adversaries present and the “complexity” of the targeted task. Moreover, we show
that norm clipping and “weak” differential privacy mitigate the attacks without hurting the overall performance. We have implemented the attacks and defenses in TensorFlow Federated (TFF), a TensorFlow framework for federated learning. In open sourcing our code, our goal is to encourage researchers to contribute new attacks and defenses and evaluate them on standard federated datasets.
View details
Learning Differentially Private Recurrent Language Models
Kunal Talwar
Li Zhang
International Conference on Learning Representations (ICLR) (2018)
Preview abstract
We demonstrate that it is possible to train large recurrent language models with user-level differential privacy guarantees with only a negligible cost in predictive accuracy. Our work builds on recent advances in the training of deep networks on user-partitioned data and privacy accounting for stochastic gradient descent. In particular, we add user-level privacy protection to the federated averaging algorithm, which makes "large step" updates from user-level data. Our work demonstrates that given a dataset with a sufficiently large number of users (a requirement easily met by even small internet-scale datasets), achieving differential privacy comes at the cost of increased computation, rather than in decreased utility as in most prior work. We find that our private LSTM language models are quantitatively and qualitatively similar to un-noised models when trained on a large dataset.
View details
A General Approach to Adding Differential Privacy to Iterative Training Procedures
Galen Andrew
Ilya Mironov
Steve Chien
Úlfar Erlingsson
NIPS (2018)
Preview abstract
In this work we address the practical challenges of training machine learning models on privacy-sensitive datasets by introducing a modular approach that minimizes changes to training algorithms, provides a variety of configuration strategies for the privacy mechanism, and then isolates and simplifies the critical logic that computes the final privacy guarantees. A key challenge is that training algorithms often require estimating many different quantities (vectors) from the same set of examples --- for example, gradients of different layers in a deep learning architecture, as well as metrics and batch normalization parameters. Each of these may have different properties like dimensionality, magnitude, and tolerance to noise. By extending previous work on the Moments Accountant for the subsampled Gaussian mechanism, we can provide privacy for such heterogeneous sets of vectors, while also structuring the approach to minimize software engineering challenges.
View details
cpSGD: Communication-efficient and differentially-private distributed SGD
Naman Agarwal
Neural Information Processing Systems (2018)
Preview abstract
Distributed stochastic gradient descent is an important subroutine in distributed learning. A setting of particular interest is when the clients are mobile devices, where two important concerns are communication efficiency and the privacy of the clients. Several recent works have focused on reducing the communication cost or introducing privacy guarantees, but none of the proposed communication efficient methods are known to be privacy preserving and none of the known privacy mechanisms are known to be communication efficient. To this end, we study algorithms that achieve both communication efficiency and differential privacy. For d variables and n \approx d clients, the proposed method uses \cO(\log \log(nd)) bits of communication per client per coordinate and ensures constant privacy. We also improve previous analysis of the \emph{Binomial mechanism} showing that it achieves nearly the same utility as the Gaussian mechanism, while requiring fewer representation bits, which can be of independent interest.
View details
Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization
Blake Woodworth
Jialei Wang
Nathan Srebro
Advances in Neural Information Processing Systems (NIPS) (2018)
Preview abstract
We suggest a general oracle-based framework that captures different parallel stochastic optimization settings described by a dependency graph, and derive generic lower bounds in terms of this graph. We then use the framework and derive lower bounds for several specific parallel optimization settings, including delayed updates and parallel processing with intermittent communication. We highlight gaps between lower and upper bounds on the oracle complexity, and cases where the “natural” algorithms are not known to be optimal.
View details
Preview abstract
Communication on heterogeneous edge networks is a fundamental bottleneck in Federated Learning (FL), restricting both model capacity and user participation. To address this issue, we introduce two novel strategies to reduce communication costs: (1) the use of lossy compression on the global model sent server-to-client; and (2) Federated Dropout, which allows users to efficiently train locally on smaller subsets of the global model and also provides a reduction in both client-to-server communication and local computation. We empirically show that these strategies, combined with existing compression approaches for client-to-server communication, collectively provide up to a 14× reduction in server-to-client communication, a 1.7× reduction in local computation, and a 28× reduction in upload communication, all without degrading the quality of the final model. We thus comprehensively reduce FL's impact on client device resources, allowing higher capacity models to be trained, and a more diverse set of users to be reached.
View details
A Survey of Algorithms and Analysis for Adaptive Online Learning
Journal of Machine Learning Research, vol. 18 (2017)
Preview abstract
We present tools for the analysis of Follow-The-Regularized-Leader (FTRL), Dual Averaging, and Mirror Descent algorithms when the regularizer (equivalently, prox-function or learning rate schedule) is chosen adaptively based on the data. Adaptivity can be used to prove regret bounds that hold on every round, and also allows for data-dependent regret bounds as in AdaGrad-style algorithms (e.g., Online Gradient Descent with adaptive per-coordinate learning rates). We present results from a large number of prior works in a unified manner, using a modular and tight analysis that isolates the key arguments in easily re-usable lemmas. This approach strengthens previously known FTRL analysis techniques to produce bounds as tight as those achieved by potential functions or primal-dual analysis. Further, we prove a general and exact equivalence between an arbitrary adaptive Mirror Descent algorithm and a correspond- ing FTRL update, which allows us to analyze any Mirror Descent algorithm in the same framework. The key to bridging the gap between Dual Averaging and Mirror Descent algorithms lies in an analysis of the FTRL-Proximal algorithm family. Our regret bounds are proved in the most general form, holding for arbitrary norms and non-smooth regularizers with time-varying weight.
View details
Communication-Efficient Learning of Deep Networks from Decentralized Data
Eider Moore
Seth Hampson
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) (2017)
Preview abstract
Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning.
We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent.
View details
Distributed Mean Estimation with Limited Communication
International Conference on Machine Learning (2017)
Preview abstract
Motivated by the need for distributed optimization algorithms with low communication cost, we study communication efficient algorithms to perform distributed mean estimation. We study the scenarios in which each client sends one bit per dimension. We first show that for d dimensional data with n clients, a naive stochastic rounding approach yields a mean squared error Theta(d/n). We then show by applying a structured random rotation of the data (an O(dlogd) algorithm), the error can be reduced to O(logd/n). The methods we show in this paper do not depend on the distribution of the data.
View details
On the Protection of Private Information in Machine Learning Systems: Two Recent Approaches
Úlfar Erlingsson
Ian Goodfellow
Ilya Mironov
Kunal Talwar
Li Zhang
Proceedings of 30th IEEE Computer Security Foundations Symposium (CSF) (2017), pp. 1-6
Preview abstract
The recent, remarkable growth of machine learning has led to intense interest in the privacy of the data on which machine learning relies, and to new techniques for preserving privacy. However, older ideas about privacy may well remain valid and useful. This note reviews two recent works on privacy in the light of the wisdom of some of the early literature, in particular the principles distilled by Saltzer and Schroeder in the 1970s.
View details
Practical Secure Aggregation for Privacy-Preserving Machine Learning
Antonio Marcedone
Benjamin Kreuter
Sarvar Patel
Vladimir Ivanov
CCS (2017)
Preview abstract
We design a novel, communication-efficient, failure-robust protocol for secure aggregation of high-dimensional data. Our protocol allows a server to collect an aggregate of user-held data from mobile devices in a privacy-preserving manner, and can be used, for example, in a federated learning setting, to aggregate user-provided model updates for a deep neural network. We prove the security of our protocol in the honest-but-curious and malicious server settings, and show that privacy is preserved even if an arbitrarily chosen subset of users drop out at any time. We evaluate the efficiency of our protocol and show, by complexity analysis and a concrete implementation, that its runtime and communication overhead remain low even on large data sets and client pools. For 16-bit input values, our protocol offers 1.73× communication expansion for 2^10 users and 2^20-dimensional vectors, and 1.98× expansion for 2^14 users and 2^24-dimensional vectors.
View details
Federated Learning: Strategies for Improving Communication Efficiency
Jakub Konečný
Peter Richtarik
NIPS Workshop on Private Multi-Party Machine Learning (2016)
Preview abstract
Federated Learning is a machine learning setting where the goal is to train a high-quality centralized model with training data distributed over a large number of clients each with unreliable and relatively slow network connections. We consider learning algorithms for this setting where on each round, each client independently computes an update to the current model based on its local data, and communicates this update to a central server, where the client-side updates are aggregated to compute a new global model. The typical clients in this setting are mobile phones, and communication efficiency is of utmost importance. In this paper, we propose two ways to reduce the uplink communication costs. The proposed methods are evaluated on the application of training a deep neural network to perform image classification. Our best approach reduces the upload communication required to train a reasonable model by two orders of magnitude.
View details
Deep Learning with Differential Privacy
Andy Chu
Ian Goodfellow
Ilya Mironov
Kunal Talwar
Li Zhang
23rd ACM Conference on Computer and Communications Security (ACM CCS) (2016), pp. 308-318
Preview abstract
Machine learning techniques based on neural networks are achieving remarkable results in a wide variety of domains. Often, the training of models requires large, representative datasets, which may be crowdsourced and contain sensitive information. The models should not expose private information in these datasets. Addressing this goal, we develop new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy. Our implementation and experiments demonstrate that we can train deep neural networks with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.
View details
Preview abstract
We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number of nodes. The goal is to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of the utmost importance and minimizing the number of rounds of communication is the principal goal.
A motivating example arises when we keep the training data locally on users' mobile devices instead of logging it to a data center for training. In federated optimization, the devices are used as compute nodes performing computation on their local data in order to update a global model. We suppose that we have extremely large number of devices in the network --- as many as the number of users of a given service, each of which has only a tiny fraction of the total data available. In particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, it is reasonable to assume that no device has a representative sample of the overall distribution.
We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results for sparse convex problems. This work also sets a path for future research needed in the context of federated optimization.
View details
Practical Secure Aggregation for Federated Learning on User-Held Data
Vladimir Ivanov
Ben Kreuter
Antonio Marcedone
Sarvar Patel
NIPS Workshop on Private Multi-Party Machine Learning (2016)
Preview abstract
Secure Aggregation is a class of Secure Multi-Party Computation algorithms wherein a group of
mutually distrustful parties u ∈ U each hold a private value x_u and collaborate to compute an
aggregate value, such as the sum_{u∈U} x_u, without revealing to one another any information about
their private value except what is learnable from the aggregate value itself. In this work, we consider
training a deep neural network in the Federated Learning model, using distributed gradient descent
across user-held training data on mobile devices, wherein Secure Aggregation protects the privacy of
each user’s model gradient. We identify a combination of efficiency and robustness requirements
which, to the best of our knowledge, are unmet by existing algorithms in the literature. We proceed to
design a novel, communication-efficient Secure Aggregation protocol for high-dimensional data that
tolerates up to 1/3 users failing to complete the protocol. For 16-bit input values, our protocol offers
1.73x communication expansion for 2^10 users and 2^20-dimensional vectors, and 1.98x expansion
for 2^14 users and 2^24 dimensional vectors.
View details
Federated Optimization: Distributed Optimization Beyond the Datacenter
Jakub Konečný
NIPS Optimization for Machine Learning Workshop (2015), pp. 5
Preview abstract
We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are distributed (unevenly) over an extremely large number of nodes, but the goal remains to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of utmost importance.
A motivating example for federated optimization arises when we keep the training data locally on users' mobile devices rather than logging it to a data center for training. Instead, the mobile devices are used as nodes performing computation on their local data in order to update a global model. We suppose that we have an extremely large number of devices in our network, each of which has only a tiny fraction of data available totally; in particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, we assume that no device has a representative sample of the overall distribution.
We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results. This work also sets a path for future research needed in the context of federated optimization.
View details
Preview abstract
We analyze new online gradient descent algorithms for distributed systems with large delays between gradient computations and the corresponding updates. Using insights from adaptive gradient methods, we develop algorithms that adapt not only to the sequence of gradients, but also to the precise update delays that occur. We first give an impractical algorithm that achieves a regret bound that precisely quantifies the impact of the delays. We then analyze AdaptiveRevision, an algorithm that is efficiently implementable and achieves comparable guarantees. The key algorithmic technique is appropriately and efficiently revising the learning rate used for previous gradient steps. Experimental results show when the delays grow large (1000 updates or more), our new algorithms perform significantly better than standard adaptive gradient methods.
View details
Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations
Francesco Orabona
Proceedings of the 27th Annual Conference on Learning Theory (COLT) (2014)
Preview abstract
We study algorithms for online linear optimization in Hilbert spaces, focusing on the case where the player is unconstrained. We develop a novel characterization of a large class of minimax algorithms, recovering, and even improving, several previous results as immediate corollaries. Moreover, using our tools, we develop an algorithm that provides a regret bound of $O(U \sqrt{T \log( U \sqrt{T} \log^2 T +1)})$, where $U$ is the $L_2$ norm of an arbitrary comparator and both $T$ and $U$ are unknown to the player. This bound is optimal up to $\sqrt{\log \log T}$ terms. When $T$ is known, we derive an algorithm with an optimal regret bound (up to constant factors). For both the known and unknown $T$ case, a Normal approximation to the conditional value of the game proves to be the key analysis tool.
View details
Preview abstract
We design and analyze minimax-optimal algorithms for online linear optimization games where the player's choice is unconstrained. The player strives to minimize regret, the difference between his loss and the loss of a post-hoc benchmark strategy. While the standard benchmark is the loss of the best strategy chosen from a bounded comparator set, we consider a very broad range of benchmark functions. The problem is cast as a sequential multi-stage zero-sum game, and we give a thorough analysis of the minimax behavior of the game, providing characterizations for the value of the game, as well as both the player's and the adversary's optimal strategy. We show how these objects can be computed efficiently under certain circumstances, and by selecting an appropriate benchmark, we construct a novel hedging strategy for an unconstrained betting game.
View details
Estimation, Optimization, and Parallelism when Data is Sparse
John C. Duchi
Michael I. Jordan
Advances in Neural Information Processing Systems (NIPS) (2013)
Preview abstract
We study stochastic optimization problems when the \emph{data} is sparse, which is in a sense dual to current perspectives on high-dimensional statistical learning and optimization. We highlight both the difficulties---in terms of increased sample complexity that sparse data necessitates---and the potential benefits, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on several medium to large-scale learning tasks.
View details
Ad Click Prediction: a View from the Trenches
Michael Young
Dietmar Ebner
Julian Grady
Lan Nie
Eugene Davydov
Sharat Chikkerur
Dan Liu
Arnar Mar Hrafnkelsson
Tom Boulos
Jeremy Kubica
Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2013)
Preview abstract
Predicting ad click--through rates (CTR) is a massive-scale learning
problem that is central to the multi-billion dollar online
advertising industry. We present a selection of case studies and
topics drawn from recent experiments in the setting of a deployed
CTR prediction system. These include improvements in the context of
traditional supervised learning based on an FTRL-Proximal online
learning algorithm (which has excellent sparsity and convergence
properties) and the use of per-coordinate learning rates.
We also explore some of the challenges that arise in a real-world
system that may appear at first to be outside the domain of
traditional machine learning research. These include useful tricks
for memory savings, methods for assessing and visualizing
performance, practical methods for providing confidence estimates
for predicted probabilities, calibration methods, and methods for
automated management of features. Finally, we also detail several
directions that did not turn out to be beneficial for us, despite
promising results elsewhere in the literature. The goal of this
paper is to highlight the close relationship between theoretical
advances and practical engineering in this industrial setting, and
to show the depth of challenges that appear when applying
traditional machine learning methods in a complex dynamic system.
View details
Large-Scale Learning with Less RAM via Randomization
Michael Young
Proceedings of the 30 International Conference on Machine Learning (ICML) (2013), pp. 10
Preview abstract
We reduce the memory footprint of popular large-scale online learning methods by projecting our weight vector onto a coarse discrete set using randomized rounding. Compared to standard 32-bit float encodings, this
reduces RAM usage by more than 50% during training and by up to 95% when making predictions from a fixed model, with almost no loss in accuracy. We also show that randomized counting can be used to implement per-coordinate learning rates, improving model quality with little additional RAM. We prove these memory-saving methods achieve regret guarantees similar to their exact variants. Empirical evaluation confirms excellent performance, dominating standard approaches across memory versus accuracy tradeoffs.
View details
Preview abstract
Some of the most compelling applications of online convex
optimization, including online prediction and classification, are
unconstrained: the natural feasible set is R^n. Existing
algorithms fail to achieve sub-linear regret in this setting unless
constraints on the comparator point x* are known in advance. We
present algorithms that, without such prior knowledge, offer
near-optimal regret bounds with respect to any choice of
x*. In particular, regret with respect to x* = 0 is
constant. We then prove lower bounds showing that our
guarantees are near-optimal in this setting.
View details
Open Problem: Better Bounds for Online Logistic Regression
COLT/ICML Joint Open Problem Session, JMLR: Workshop and Conference Proceedings (2012)
Preview abstract
Known algorithms applied to online logistic regression on a feasible set of L2 diameter D achieve regret bounds like O(eD log T) in one dimension, but we show a bound of O(sqrt(D) + log T) is possible in a binary 1-dimensional problem. Thus, we pose the following question: Is it possible to achieve a regret bound for online logistic regression that is O(poly(D)log(T))? Even if this is not possible in general, it would be interesting to have a bound that reduces to our bound in the one-dimensional case.
View details
Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization
Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) (2011)
Preview abstract
We prove that many mirror descent algorithms for online convex optimization (such as online gradient descent) have an equivalent interpretation as follow-the-regularized-leader (FTRL) algorithms. This observation makes the relationships between many commonly used algorithms explicit, and provides theoretical insight on previous experimental observations. In particular, even though the FOBOS composite mirror descent algorithm handles L1 regularization explicitly, it has been observed that the FTRL-style Regularized Dual Averaging (RDA) algorithm is even more effective at producing sparsity. Our results demonstrate that the key difference between these algorithms is how they handle the cumulative L1 penalty. While FOBOS handles the $L_1$ term exactly on any given update, we show that it is effectively using subgradient approximations to the L1 penalty from previous rounds, leading to less sparsity than RDA, which handles the cumulative penalty in closed form. The FTRL-Proximal algorithm, which we introduce, can be seen as a hybrid of these two algorithms, and significantly outperforms both on a large, real-world dataset.
View details
Adaptive Bound Optimization for Online Convex Optimization
Proceedings of the 23rd Annual Conference on Learning Theory (COLT) (2010)
Preview abstract
We introduce a new online convex optimization algorithm that
adaptively chooses its regularization function based on the loss
functions observed so far. This is in contrast to previous algorithms
that use a fixed regularization function such as L2-squared, and
modify it only via a single time-dependent parameter. Our algorithm's
regret bounds are worst-case optimal, and for certain realistic
classes of loss functions they are much better than existing bounds.
These bounds are problem-dependent, which means they can exploit the
structure of the actual problem instance. Critically, however, our
algorithm does not need to know this structure in advance. Rather, we
prove competitive guarantees that show the algorithm provides a bound
within a constant factor of the best possible bound (of a certain
functional form) in hindsight.
View details
Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards
Preview
Varun Kanade
Proceedings of the 12th International Conference on Artificial Intelligence and Statistic (AISTATS) (2009)
Tighter Bounds for Multi-Armed Bandits with Expert Advice
Preview
Proceedings of the 22nd Annual Conference on Learning Theory (COLT) (2009)
Robust Submodular Observation Selection
Preview
Andreas Krause
Carlos Guestrin
Anupam Gupta
Journal of Machine Learning Research (JMLR), vol. 9 (2008), pp. 2761-2801
Efficiently Computing Minimax Expected-Size Confidence Regions
Preview
Chad M. Schafer
Jeff Schneider
Proc. 24th ICML, ACM, Corvalis (2007), pp. 97-104
Selecting Observations Against Adversarial Objectives
Preview
Andreas Krause
Carlos Guestrin
Anupam Gupta
Advances in Neural Information Processing Systems (NIPS 2007)
A Fast Bundle-based Anytime Algorithm for Poker and other Convex Games
Geoffrey J. Gordon
Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS) (2007)
A Unification of Extensive-Form Games and Markov Decision Processes
Robust Planning in Domains with Stochastic Outcomes, Adversaries, and Partial Observability
CMU (2006)
Generalizing Dijkstra's algorithm and Gaussian Elimination for solving MDPs
Fast Exact Planning in Markov Decision Processes
Geoffrey J. Gordon
International Conference on Automated Planning and Scheduling (ICAPS) (2005)
Online convex optimization in the bandit setting: gradient descent without a gradient
Abraham Flaxman
Adam Tauman Kalai
Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2005)
Bounded Real-Time Dynamic Programming: RTDP with monotone upper bounds and performance guarantees
Maxim Likhachev
Geoffrey J. Gordon
Proceedings of the 22nd International Conference on Machine Learning (ICML) (2005)
Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary
Avrim Blum
Proceedings of the Seventeenth Annual Conference on Learning Theory (COLT) (2004), pp. 109-123
Multi-source spanning trees: algorithms for minimizing source eccentricities
Planning in the Presence of Cost Functions Controlled By An Adversary
Geoffrey J. Gordon
Avrim Blum
In Proceedings of the 20th International Conference on Machine Learning (ICML) (2003)