Jump to Content
Afshin Rostamizadeh

Afshin Rostamizadeh

Afshin is a research scientist at Google Research NY, where he specializes in designing and applying machine learning algorithms. He received his BS in Electrical Engineering and Computer Science from UC Berkeley, his PhD in Computer Science from the Courant Institute at NYU with advisor Mehryar Mohri and was a post-doc at UC Berkeley in Peter Bartlett's group.

He has worked on problems such as learning from non-iid samples, learning from biased samples, learning from data with missing features and automatic kernel selection for kernelized algorithms such as SVM.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Search engines including Google are beginning to support local-dining queries such as ``At which nearby restaurants can I order the Indonesian salad \textit{gado-gado}?''. Given the low coverage of online menus worldwide, and only 30\% even having a website, this remains a challenge. Here we leverage the power of the crowd: online users who are willing to answer questions about dish availability at restaurants visited. While motivated users are happy to contribute knowledge for free, they are much less likely to respond to ``silly'' or embarrassing questions (e.g., ``Does \textit{Pizza Hut} serve pizza?'' or ``Does \textit{Mike's Vegan Restaurant} serve hamburgers?'') In this paper, we study the problem of \textit{Vexation-Aware Active Learning}, where judiciously selected questions are targeted towards improving restaurant-dish model prediction, subject to a limit on the percentage of ``unsure'' answers or ``dismissals'' (e.g., swiping the app closed) used to measure vexation. We formalize the problem as an integer linear program and solve it efficiently using a distributed solution that scales linearly with the number of candidate questions. Since our algorithm relies on precise estimation of the unsure-dismiss rate (UDR), we give a regression model that provides accurate results compared to baselines including collaborative filtering. Finally, we demonstrate in a live system that our proposed vexation-aware strategy performs competitively against classical (margin-based) active learning approaches while not exceeding UDR bounds. View details
    Preview abstract In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i.e. predictive churn). If model retraining results in vastly different behavior, then it could cause negative effects in downstream systems, especially if this churn can be avoided with limited impact on model accuracy. In this paper, we show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines on a wide range of datasets and model architectures, including fully-connected networks, convolutional networks, and transformers. View details
    Preview abstract Consider a setting where we wish to automate an expensive task with a machine learning algorithm using a limited labeling resource. In such settings, examples routed for labeling are often out of scope for the machine learning algorithm. For example, in a spam detection setting, human reviewers not only provide labeled data but are such high-quality detectors of spam that examples routed to them no longer require machine evaluation. A consequence is that distribution of examples routed to the machine is intimately tied to the process generating labels. We introduce a formalization of this setting, and give an algorithm that simultaneously learns a model and decides when to request a label by leveraging ideas from both the abstention and active learning literatures. We prove an upper bound on the algorithm’s label complexity and a matching lower bound for any algorithm in this setting. We conduct a thorough set of experiments including an ablation study to test different components of our algorithm. We demonstrate the effectiveness of an efficient version of our algorithm over margin sampling on a variety of datasets. View details
    Preview abstract Federated learning is typically approached as a distributed optimization problem, where the goal is to minimize a global loss function by distributing computation across many client devices that possess local data and specify different parts of the global objective. We present an alternative perspective and formulate federated learning as inference of the global posterior distribution over model parameters. While exact inference is often intractable, this perspective provides a consistent way to search for global optima in federated settings. Further, starting with the analysis of federated quadratic objectives, we develop a computation- and communication-efficient approximate posterior inference algorithm---\emph{federated posterior averaging} (\FedPA). Our algorithm uses MCMC for approximate inference of local posteriors on the clients and efficiently communicates their statistics to the server, where the latter uses them to iteratively refine the global estimate of the posterior mode. Finally, we show that \FedPA generalizes federated averaging (\FedAvg), can similarly benefit from adaptive optimizers, and yields state-of-the-art results on four realistic and challenging benchmarks, converging faster, to better optima. View details
    Preview abstract The ability to train complex and highly effective models often requires an abundance of training data, which can easily become a bottleneck in cost, time, and computational resources. Batch active learning, which adaptively issues batched queries to a labeling oracle, is a common approach for addressing this problem. The practical benefits of batch sampling come with the downside of less adaptivity and the risk of sampling redundant examples within a batch -- a risk that grows with the batch size. In this work, we analyze an efficient active learning algorithm, which focuses on the large batch setting. In particular, we show that our sampling method, which combines notions of uncertainty and diversity, easily scales to batch sizes (100K-1M) several orders of magnitude larger than used in previous studies and provides significant improvements in model training efficiency compared to recent baselines. View details
    Understanding the Effects of Batching in Online Active Learning
    Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (2020)
    Preview abstract Online active learning (AL) algorithms often assume immediate access to a label once a query has been made. However, due to practical constraints, the labels of these queried examples are generally only available in ``batches''. In this work, we present a novel analysis for a generic class of batch online AL algorithms and reveal that the effects of batching are in fact mild and only result in an additional term in the label complexity that is linear in the batch size. To our knowledge, this provides the first theoretical justification for such algorithms and we show how they can be applied to batch variants of three canonical online AL algorithms: IWAL, ORIWAL, and DHM. We also conduct an empirical study that corroborates the novel theoretical insights. View details
    An Analysis of SVD for Deep Rotation Estimation
    Jake Levinson
    Arthur Chen
    Angjoo Kanazawa
    Advances in Neural Information Processing Systems (NeurIPS) 2020
    Preview abstract Symmetric orthogonalization via SVD, and closely related procedures, are well-known techniques for projecting matrices onto O(n) or SO(n). These tools have long been used for applications in computer vision, for example optimal 3D alignment problems solved by orthogonal Procrustes, rotation averaging, or Essential matrix decomposition. Despite its utility in different settings, SVD orthogonalization as a procedure for producing rotation matrices is typically overlooked in deep learning models, where the preferences tend toward classic representations like unit quaternions, Euler angles, and axis-angle, or more recently-introduced methods. Despite the importance of 3D rotations in computer vision and robotics, a single universally effective representation is still missing. Here, we explore the viability of SVD orthogonalization for 3D rotations in neural networks. We present a theoretical analysis of SVD as used for projection onto the rotation group. Our extensive quantitative analysis shows simply replacing existing representations with the SVD orthogonalization procedure obtains state of the art performance in many deep learning applications covering both supervised and unsupervised training. View details
    A System for Massively Parallel Hyperparameter Tuning
    Liam Li
    Kevin Jamieson
    Ekaterina Gonina
    Jonathan Ben-tzur
    Moritz Hardt
    Benjamin Recht
    Ameet Talwalkar
    Third Conference on Systems and Machine Learning (2020) (to appear)
    Preview abstract Modern learning models are characterized by large hyperparameter spaces and long training times; this coupled with the rise of parallel computing and productionization of machine learning motivate developing production- quality hyperparameter optimization functionality for a distributed computing setting. We address this challenge with a simple and robust hyperparameter optimization algorithm ASHA, which exploits parallelism and aggressive early-stopping to tackle large-scale hyperparameter optimization problems. Our extensive empirical results show that ASHA outperforms state-of-the-art hyperparameter optimization methods; scales linearly with the number of workers in distributed settings; and is suitable for massive parallelism, converging to a high quality configuration in half the time taken by Vizier (Google’s internal hyperparameter optimization service) in an experiment with 500 workers. We end with a discussion of the systems considerations we encountered and our associated solutions when implementing ASHA in SystemX, a production-quality service for hyperparameter tuning. View details
    Preview abstract In modern machine learning tasks, the presence of categorical features with extremely large vocabularies is a reality. This becomes a bottleneck when using an ML model, which generally grows at least linearly with the vocabulary size, affecting the memory, training and inference costs, as well as overfitting risk. In this work, we seek to compress the vocabulary by maximizing the mutual information between the compressed categorical feature and the target binary labels. We note the relationship of this problem to that of quantization in a discrete memoryless channel, where there exists a quadratic-time algorithm to solve the problem. Unfortunately, such an algorithm does not scale to data sets with massive vocabularies and, in this paper, we develop a distributed quasi-linear O(n log n) algorithm with provable approximation guarantees. We first observe that although entropy is a submodular function, this is not the case for mutual information between a categorical feature and label. To tackle this problem, we define a set function over a different space, which still contains the optimal solution, and prove this function is submodular. We also provide a query oracle to the submodular function that runs in amortized logarithmic time, and is easy to compute in a distributed fashion. Combining these results with a greedy algorithm allows us to achieve a (1-1/e)-approximation in quasi-linear time. Finally, we compare our proposed algorithm to several existing approaches using the large-scale Criteo learning task and demonstrate better performance in retaining mutual information and also verify the learning performance of the compressed vocabulary. View details
    Preview abstract Linear encoding of sparse vectors is widely popular, but is commonly data-independent -- missing any possible extra (but a-priori unknown) structure beyond sparsity. In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used ℓ1 decoder. The convex ℓ1 decoder prevents gradient propagation as needed in standard gradient-based training. Our method is based on the insight that unrolling the convex decoder into T projected subgradient steps can address this issue. Our method can be seen as a data-driven way to learn a compressed sensing measurement matrix. We compare the empirical performance of 10 algorithms over 6 sparse datasets (3 synthetic and 3 real). Our experiments show that there is indeed additional structure beyond sparsity in the real datasets. Our method is able to discover it and exploit it to create excellent reconstructions with fewer measurements (by a factor of 1.1-3x) compared to the previous state-of-the-art methods. We illustrate an application of our method in learning label embeddings for extreme multi-label classification. Our experiments show that our method is able to match or outperform the precision scores of SLEEC, which is one of the state-of-the-art embedding-based approaches for extreme multi-label learning. View details
    Preview abstract We tested, in a production setting, the use of active learning for selecting text documents for human annotations used to train a Thai segmentation machine learning model. In our study, two concurrent annotated samples were constructed, one through random sampling of documents from a text corpus, and the other through model-based scoring and ranking of documents from the same corpus. We observed that several of the assumptions forming the basis of offline (simulated) evaluation largely failed in the live setting. We present these challenges and propose guidelines addressing each of them which can be used for the design of live experimentation of active learning, and more generally for the application of active learning in live settings. View details
    Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization
    Liam Li
    Kevin Jamieson
    Ameet Talwalkar
    Journal of Machine Learning Research, vol. 18-185 (2018), pp. 1-52
    Preview abstract Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, øuralg , for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare øuralg with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that øuralg can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems. View details
    Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits
    Ameet Talwalkar
    Giulia DeSalvo
    Kevin Jamieson
    Lisha Li
    5th International Conference on Learning Representations (2017)
    Preview abstract Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian Optimiza- tion to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation. We present H YPERBAND , a novel algorithm for hyperparameter optimization that is simple, flexible, and theoretically sound. H YPERBAND is a principled early-stoppping method that adaptively allocates a pre- defined resource, e.g., iterations, data samples or number of features, to randomly sampled configurations. We compare H YPERBAND with state-of-the-art Bayesian Optimization methods on several hyperparameter optimization problems. We ob- serve that H YPERBAND can provide over an order of magnitude speedups over competitors on a variety of neural network and kernel-based learning problems. View details
    Where to sell: Simulating auctions from learning algorithms
    Hamid Nazerzadeh
    Proceedings of the Seventeenth ACM Conference on Economics and Computation (EC2016)
    Preview abstract Ad Exchange platforms connect online publishers and advertisers and facilitate selling billions of impressions every day. We study these environments from the perspective of a publisher who wants to find the profit maximizing exchange to sell his inventory. Ideally, the publisher would run an auction among exchanges. However, this is not possible due to technological and other practical considerations. The publisher needs to send each impression to one of the exchanges with an asking price. We model the problem as a variation of multi-armed bandits where exchanges (arms) can behave strategically in order to maximizes their own profit. We propose mechanisms that find the best exchange with sub-linear regret and have desirable incentive properties. View details
    Preview abstract The problem of matrix column subset selection has recently attracted a large body of research, with feature selection serving as one obvious and important application. Among the techniques that have been applied to solve this problem, the greedy algorithm has been shown to be quite effective in practice. However, our theoretical guarantees on its performance have not been ex- plored thoroughly, especially in a distributed set- ting. In this paper, we study the greedy algorithm for the column subset selection problem from a theoretical and empirical perspective and show its effectiveness in a distributed setting. In par- ticular, we provide an improved approximation guarantee for the greedy algorithm, and present the first distributed implementation of this algo- rithm with provable approximation factors. We use the idea of randomized composable core- sets, developed recently in the context of sub- modular maximization. Finally, we validate the effectiveness of this distributed algorithm via an empirical study. View details
    Scalable and interpretable data representation for high-dimensional complex data
    Been Kim
    Kayur Patel
    Julie Shah
    AAAI Conference on Artificial Intelligence (2015)
    Preview
    Preview abstract We give the first O(1/sqrt{T})-error online algorithm for reconstructing noisy statistical databases, where T is the number of (online) sample queries received. The algorithm is optimal up to the poly(log(T)) factor in terms of the error and requires only O(log T) memory. It aims to learn a hidden database-vector w* in R^d in order to accurately answer a stream of queries regarding the hidden database, which arrive in an online fashion from some unknown distribution D. We assume the distribution D is defined on the neighborhood of a low-dimensional manifold. The presented algorithm runs in O(dD)-time per query, where d is the dimensionality of the query-space. Contrary to the classical setting, there is no separate training set that is used by the algorithm to learn the database —- the stream on which the algorithm will be evaluated must also be used to learn the database-vector. The algorithm only has access to a binary oracle O that answers whether a particular linear function of the database-vector plus random noise is larger than a threshold, which is specified by the algorithm. We note that we allow for a significant O(D) amount of noise to be added while other works focused on the low noise o(sqrt{D}) setting. For a stream of T queries our algorithm achieves an average error O(1/sqrt{T}) by filtering out random noise, adapting threshold values given to the oracle based on its previous answers and, as a consequence, recovering with high precision a projection of a database-vector w* onto the manifold defining the query-space. Our algorithm may be also applied in the adversarial machine learning context to compromise machine learning engines by heavily exploiting the vulnerabilities of the systems that output only binary signal and in the presence of significant noise. View details
    Applying WebTables in Practice
    Sreeram Balakrishnan
    Alon Halevy
    Boulos Harb
    Warren Shen
    Kenneth Wilder
    Fei Wu
    Cong Yu
    Conference on Innovative Data Systems Research (2015)
    Preview
    Preview abstract Google Research recently tested a massive online class model for an internal engineering education program, with machine learning as the topic, that blended theoretical concepts and Google-specific software tool tutorials. The goal of this training was to foster engineering capacity to leverage machine learning tools in future products. The course was delivered both synchronously and asynchronously, and students had the choice between studying independently or participating with a group. Since all students are company employees, unlike most publicly offered MOOCs we can continue to measure the students’ behavioral change long after the course is complete. This paper describes the course, outlines the available data set and presents directions for analysis. View details
    Theoretical Foundations for Learning Kernels in Supervised Kernel PCA
    Modern Nonparametrics 3: Automating the Learning Pipeline, Neural Information Processing Systems, Workshop (2014)
    Preview abstract This paper presents a novel learning scenario which combines dimensionality reduction, supervised learning as well as kernel selection. We carefully define the hypothesis class that addresses this setting and provide an analysis of its Rademacher complexity and thereby provide generalization guarantees. The proposed algorithm uses KPCA to reduce the dimensionality of the feature space, i.e. by projecting data onto top eigenvectors of covariance operator in a kernel reproducing space. Moreover, it simultaneously learns a linear combination of base kernel functions, which defines a reproducing space, as well as the parameters of a supervised learning algorithm in order to minimize a regularized empirical loss. The bound on Rademacher complexity of our hypothesis is shown to be logarithmic in the number of base kernels, which encourages practitioners to combine as many base kernels as possible. View details
    Preview abstract Motivated by real-time advertising exchanges, we analyze the problem of pricing inventory in a repeated posted-price auction. We consider both the cases of a truthful and surplus-maximizing buyer, where the former makes decisions myopically on every round, and the latter may strategically react to our algorithm, forgoing short-term surplus in order to trick the algorithm into setting better prices in the future. We further assume a buyer’s valuation of a good is a function of a context vector that describes the good being sold. We give the first algorithm attaining sublinear (O(T^{ 2/3})) regret in the contextual setting against a surplus-maximizing buyer. We also extend this result to repeated second-price auctions with multiple buyers. View details
    Preview abstract Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer’s value for a good when the buyer is repeatedly interacting with the seller through a posted-price mechanism. We model the buyer as a strategic agent, interested in maximizing her long-term surplus, and are interested in optimizing seller revenue. We show conditions under which the seller cannot hope to gain an advantage by learning the buyer’s value – i.e. the buyer can always manipulate the exchange to hide her value. This result is accompanied by a seller algorithm that is able to achieve no-regret when the buyer is unable to incur the short-term costs of such manipulation. View details
    Ensembles of Kernel Predictors
    Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011)
    Preview
    Generalization Bounds for Learning Kernels
    Proceedings of the 27th Annual International Conference on Machine Learning (ICML 2010)
    Preview abstract This paper presents several novel generalization bounds for the problem of learning kernels based on a combinatorial analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for learning kernels with a convex combination of p base kernels using L1 regularization admits only a √log p dependency on the number of kernels, which is tight and considerably more favorable than the previous best bound given for the same problem. We also give a novel bound for learning with a non-negative combination of p base kernels with an L2 regularization whose dependency on p is also tight and only in p^(1/4). We present similar results for Lq regularization with other values of q, and outline the relevance of our proof techniques to the analysis of the complexity of the class of linear functions. Experiments with a large number of kernels further validate the behavior of the generalization error as a function of p predicted by our bounds. View details
    Two-Stage Learning Kernel Algorithms
    Proceedings of the 27th Annual International Conference on Machine Learning (ICML 2010)
    Preview abstract This paper examines two-stage techniques for learning kernels based on a notion of alignment. It presents a number of novel theoretical, algorithmic, and empirical results for alignmentbased techniques. Our results build on previous work by Cristianini et al. (2001), but we adopt a different definition of kernel alignment and significantly extend that work in several directions: we give a novel and simple concentration bound for alignment between kernel matrices; show the existence of good predictors for kernels with high alignment, both for classification and for regression; give algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP; and report the results of extensive experimentswith this alignment-based method in classification and regression tasks, which show an improvement both over the uniformcombination of kernels and over other state-of-the-art learning kernel methods. View details
    Preview abstract Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. View details
    Domain Adaptation: Learning Bounds and Algorithms
    Yishay Mansour
    Proceedings of The 22nd Annual Conference on Learning Theory (COLT 2009), Omnipress, Montr\'eal, Canada
    Preview
    Rademacher Complexity Bounds for Non-I.I.D. Processes
    Advances in Neural Information Processing Systems (NIPS 2008), MIT Press, Vancouver, Canada (2009)
    Preview
    Multiple Source Adaptation and the Renyi Divergence
    Yishay Mansour
    Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), Montr\'eal, Canada
    Preview
    Domain Adaptation with Multiple Sources
    Yishay Mansour
    Advances in Neural Information Processing Systems (NIPS 2008), MIT Press, Vancouver, Canada (2009)
    Preview
    L2 Regularization for Learning Kernels
    Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), Montr\'eal, Canada
    Preview
    Stability Bounds for Non-i.i.d. Processes
    Advances in Neural Information Processing Systems (NIPS 2007), MIT Press, Vancouver, Canada (2008)
    Preview
    Sample Selection Bias Correction Theory
    Proceedings of The 19th International Conference on Algorithmic Learning Theory (ALT 2008), Springer, Heidelberg, Germany, Budapest, Hungary
    Preview
    Learning sequence kernels
    Proceedings of IEEE International Workshop on Machine Learning for Signal Processing (2008)
    Preview
    An $\tildeO(\frac1\sqrtT)$-error online algorithm for retrieving heavily perturbated statistical databases in the low-dimensional querying mode
    Foundations of Coupled Nonlinear Dimensionality Reduction
    Perceptron Mistake Bounds
    CoRR, vol. abs/1305.0208 (2013)
    Foundations of Machine Learning
    Ameet Talwalkar
    MIT Press (2012), I-XII, 1-412
    Learning with Missing Features
    Alekh Agarwal
    Peter Bartett
    Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011)
    Stability Bounds for Stationary phi-mixing and beta-mixing Processes
    Journal of Machine Learning Research, vol. 11 (2010), pp. 789-814
    Multiple Source Adaptation and the Rényi Divergence
    Yishay Mansour
    UAI (2009), pp. 367-374
    New Generalization Bounds for Learning Kernels
    Stability Bound for Stationary Phi-mixing and Beta-mixing Processes
    CoRR, vol. abs/0811.1629 (2008)
    Rademacher Complexity Bounds for Non-I.I.D. Processes
    NIPS (2008), pp. 1097-1104