Jump to Content
Walid Krichene

Walid Krichene

Walid Krichene received his Ph.D. in Electrical Engineering and Computer Sciences from the University of California at Berkeley. His research focuses on optimization theory and differential privacy, and their applications to recommendation systems. He received the M.S. in Applied Mathematics from the Ecole des Mines Paristech and the M.A. in Mathematics from the University of California, Berkeley. He previously worked at Criteo labs, the CDC department at Caltech, and the CAOR group at the Ecole des Mines. Walid received the Best Paper Award at KDD 2020, the Leon Chua Award for outstanding achievement in nonlinear science, the Valedictorian Prize for the Tunisian Baccalaureate, and a Bronze Medal at the Pan African Math Olympiads.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Revisiting the Performance of iALS on Item Recommendation Benchmarks
    Steffen Rendle
    Li Zhang
    Yehuda Koren
    Proceedings of the 16th ACM Conference on Recommender Systems, Association for Computing Machinery (2022), pp. 427-435
    Preview abstract Matrix factorization learned by implicit alternating least squares (iALS) is a popular baseline in recommender system research publications. iALS is known to be one of the most computationally efficient and scalable collaborative filtering methods. However, recent studies suggest that its prediction quality is not competitive with the current state of the art, in particular autoencoders and other item-based collaborative filtering methods. In this work, we revisit four well-studied benchmarks where iALS was reported to perform poorly and show that with proper tuning, iALS is highly competitive and outperforms any method on at least half of the comparisons. We hope that these high quality results together with iALS's known scalability spark new interest in applying and further improving this decade old technique. View details
    Preview abstract We study the problem of differentially private (DP) matrix completion under user-level privacy. We design an $(\epsilon,\delta)$-joint differentially private variant of the popular Alternating-Least-Squares (ALS) method that achieves: i) (nearly) optimal sample complexity for matrix completion (in terms of number of items, users), and ii) best known privacy/utility trade-off both theoretically, as well as on benchmark data sets. In particular, despite non-convexity of low-rank matrix completion and ALS, we provide the first global convergence analysis of ALS with {\em noise} introduced to ensure DP. For $n$ being the number of users and $m$ being the number of items in the rating matrix, our analysis requires only about $\log (n+m)$ samples per user (ignoring rank, condition number factors) and obtains a sample complexity of $n=\tilde\Omega(m/(\sqrt{\zeta}\cdot \epsilon))$ to ensure relative Frobenius norm error of $\zeta$. This improves significantly on the previous state of the result of $n=\tilde\Omega\left(m^{5/4}/(\zeta^{5}\epsilon)\right)$ for the private-FW method by ~\citet{jain2018differentially}. Furthermore, we extensively validate our method on synthetic and benchmark data sets (MovieLens 10mi, MovieLens 20mi), and observe that private ALS only suffers a 6 percentage drop in accuracy when compared to the non-private baseline for $\epsilon\leq 10$. Furthermore, compared to prior work of~\cite{jain2018differentially}, it is at least better by 10 percentage for all choice of the privacy parameters. View details
    Preview abstract The task of item recommendation requires ranking a large catalogue of items given a context. Item recommendation algorithms are evaluated using ranking metrics that depend on the positions of relevant items. To speed up the computation of metrics, recent work often uses sampled metrics where only a smaller set of random items and the relevant items are ranked. This paper investigates sampled metrics in more detail and shows that they are inconsistent with their exact version, in the sense that they do not persist relative statements, e.g., recommender A is better than B, not even in expectation. Moreover, the smaller the sampling size, the less difference there is between metrics, and for very small sampling size, all metrics collapse to the AUC metric. We show that it is possible to improve the quality of the sampled metrics by applying a correction, obtained by minimizing different criteria such as bias or mean squared error. We conclude with an empirical evaluation of the naive sampled metrics and their corrected variants. To summarize, our work suggests that sampling should be avoided for metric calculation, however if an experimental study needs to sample, the proposed corrections can improve the quality of the estimate. View details
    Rankmax: An Adaptive Projection Alternative to the Softmax Function
    Weiwei Kong
    Nicolas Mayoraz
    Steffen Rendle
    Li Zhang
    (2020)
    Preview abstract Several machine learning models involve mapping a score vector to a probability vector. Usually, this is done by projecting the score vector onto a probability simplex, and such projections are often characterized as Lipschitz continuous approximations of the argmax function, whose Lipschitz constant is controlled by a parameter that is similar to a softmax temperature. The aforementioned parameter has been observed to affect the quality of these models and is typically either treated as a constant or decayed over time. In this work, we propose a method that adapts this parameter to individual training examples. The resulting method exhibits desirable properties, such as sparsity of its support and numerically efficient implementation, and we find that it significantly outperforms competing non-adaptive projection methods. In our analysis, we also derive the general solution of (Bregman) projections onto the (n, k)-simplex, a result which may be of independent interest. View details
    Superbloom: Bloom filter meets Transformer
    John Roberts Anderson
    Qingqing Huang
    Steffen Rendle
    Li Zhang
    Google (2020)
    Preview abstract We extend the idea of word pieces in natural language models to machine learning tasks on opaque ids. This is achieved by applying hash functions to map each id to multiple hash tokens in a much smaller space, similarly to a Bloom filter. We show that by applying a multi-layer Transformer to these Bloom filter digests, we are able to obtain models with high accuracy. They outperform models of a similar size without hashing and, to a large degree, models of a much larger size trained using sampled softmax with the same computational budget. Our key observation is that it is important to use a multi-layer Transformer for Bloom filter digests to remove ambiguity in the hashed input. We believe this provides an alternative method to solving problems with large vocabulary size. View details
    Neural Collaborative Filtering vs. Matrix Factorization Revisited
    Steffen Rendle
    Li Zhang
    John Roberts Anderson
    Recsys 2020
    Preview abstract Embedding based models have been the state of the art in collaborative filtering for over a decade. Traditionally, the dot product or higher order equivalents have been used to combine two or more embeddings, e.g., most notably in matrix factorization. In recent years, it was suggested to replace the dot product with a learned similarity e.g. using a multilayer perceptron (MLP). This approach is often referred to as neural collaborative filtering (NCF). In this work, we revisit the experiments of the NCF paper that popularized learned similarities using MLPs. First, we show that with a proper hyperparameter selection, a simple dot product substantially outperforms the proposed learned similarities. Second, while a MLP can in theory approximate any function, we show that it is non-trivial to learn a dot product with an MLP. Finally, we discuss practical issues that arise when applying MLP based similarities and show that MLPs are too costly to use for item recommendation in production environments while dot products allow to apply very efficient retrieval algorithms. We conclude that MLPs should be used with care as embedding combiner and that dot products might be a better default choice. View details
    Preview abstract Recent results have shown that for two-layer fully connected neural networks, gradient flow converges to a global optimum in the infinite width limit, by making a connection between the mean field dynamics and the Wasserstein gradient flow. These results were derived for first-order gradient flow, and a natural question is whether second-order dynamics, i.e., dynamics with momentum, exhibit a similar guarantee. We show that the answer is positive for the heavy ball method. In this case, the resulting integro-PDE is a nonlinear kinetic Fokker Planck equation, and unlike the first-order case, it has no apparent connection with the Wasserstein gradient flow. Instead, we study the variations of a Lyapunov functional along the solution trajectories to characterize the stationary points and to prove convergence. While our results are asymptotic in the mean field limit, numerical simulations indicate that global convergence may already occur for reasonably small networks. View details
    Efficient Training on Very Large Corpora via Gramian Estimation
    Nicolas Mayoraz
    Steffen Rendle
    Li Zhang
    Lichan Hong
    John Anderson
    ICLR 2019 (to appear)
    Preview abstract We study the problem of learning similarity functions over very large corpora using neural network embedding models. These models are typically trained using SGD with random sampling of unobserved pairs, with a sample size that grows quadratically with the corpus size, making it expensive to scale. We propose new efficient methods to train these models without having to sample unobserved pairs. Inspired by matrix factorization, our approach relies on adding a global quadratic penalty and expressing this term as the inner-product of two generalized Gramians. We show that the gradient of this term can be efficiently computed by maintaining estimates of the Gramians, and develop variance reduction schemes to improve the quality of the estimates. We conduct large-scale experiments that show a significant improvement both in training time and generalization performance compared to sampling methods. View details
    Scaling Up Collaborative Filtering Data Sets through Randomized Fractal Expansions
    Francois Belletti
    Karthik Singaram Lakshmanan
    Nicolas Mayoraz
    Yi-fan Chen
    John Anderson
    Taylor Robie
    Tayo Oguntebi
    Amit Bleiwess
    Dan Shirron
    arXiv (2019)
    Preview abstract Recommender System research suffers from a disconnect between the size of academic data sets and the scale of industrial production systems. In order to bridge that gap, we propose to generate large-scale User/Item interaction data sets by expanding pre-existing public data sets. Our key contribution is a technique that expands User/Item incidence matrices matrices to large numbers of rows (users), columns (items), and non-zero values (interactions). The proposed method adapts Kronecker Graph Theory to preserve key higher order statistical properties such as the fat-tailed distribution of user engagements, item popularity, and singular value spectra of user/item interaction matrices. Preserving such properties is key to building large realistic synthetic data sets which in turn can be employed reliably to benchmark Recommender Systems and the systems employed to train them. We further apply our stochastic expansion algorithm to the binarized MovieLens 20M data set, which comprises 20M interactions between 27K movies and 138K users. The resulting expanded data set has 1.2B ratings, 2.2M users, and 855K items, which can be scaled up or down. View details
    Randomized Fractal Expansions for Production-Scale Public Collaborative-Filtering Data Sets
    Francois Belletti
    John Anderson
    Karthik Singaram Lakshmanan
    Nicolas Mayoraz
    Pankaj Kanwar
    Taylor Robie
    Tayo Oguntebi
    Yi-fan Chen
    Arxiv (2019)
    Preview abstract Recommender system research suffers from a disconnect between the size of academic data sets and the scale of industrial production systems. In order to bridge that gap, we propose to generate large-scale user/item interaction data sets by expanding pre-existing public data sets. Our key contribution is a technique that expands user/item incidence matrices to large numbers of rows (users), columns (items), and non-zero values (interactions). The proposed method adapts Kronecker Graph Theory to preserve key higher order statistical properties such as the fat-tailed distribution of user engagements, item popularity, and singular value spectra of user/item interaction matrices. Preserving such properties is key to building large realistic synthetic data sets which can be employed reliably to benchmark recommender systems and the systems employed to train them. We further apply our stochastic expansion algorithm to the binarized MovieLens 20M data set, which comprises 20M interactions between 27K movies and 138K users. The resulting expanded data set has 1.2B ratings, 2.2M users, and 855K items, which can be scaled up or down. Furthermore, we present collaborative filtering experiments demonstrating that the generated synthetic data entails valuable insights for machine learning at scale in recommender systems. We provide code pointers to reproduce our data and our experiments. View details
    Acceleration and Averaging in Stochastic Descent Dynamics
    Peter Bartlett
    30th Conference on Neural Information Processing Systems (NIPS) (2017)
    Preview abstract We formulate and study a general family of (continuous-time) stochastic dynamics for accelerated first-order minimization of smooth convex functions. Building on an averaging formulation of accelerated mirror descent, we propose a stochastic variant in which the gradient is contaminated by noise, and study the resulting stochastic differential equation. We prove a bound on the rate of change of an energy function associated with the problem, then use it to derive estimates of convergence rates of the function values (almost surely and in expectation), both for persistent and asymptotically vanishing noise. We discuss the interaction between the parameters of the dynamics (learning rate and averaging rates) and the covariation of the noise process. In particular, we show how the asymptotic rate of covariation affects the choice of parameters and, ultimately, the convergence rate. View details
    Adaptive Averaging in Accelerated Descent Dynamics
    Alexandre Bayen
    Peter Bartlett
    29th Conference on Neural Information Processing Systems (NIPS) (2016)
    Preview abstract We study accelerated descent dynamics for constrained convex optimization. This dynamics can be described naturally as a coupling of a dual variable accumulating gradients at a given rate η(t), and a primal variable obtained as the weighted average of the mirrored dual trajectory, with weights w(t). Using a Lyapunov argument, we give sufficient conditions on η and w to achieve a desired convergence rate. As an example, we show that the replicator dynamics (an example of mirror descent on the simplex) can be accelerated using a simple averaging scheme. We then propose an adaptive averaging heuristic which adaptively computes the weights to speed up the decrease of the Lyapunov function. We provide guarantees on adaptive averaging, and give numerical experiments to compare it with existing heuristics, such as adaptive restarting. The experiments indicate that adaptive averaging performs at least as well as adaptive restarting, with significant improvements in some cases. View details
    On Learning How Players Learn: Estimation of Learning Dynamics in the Routing Game
    Kiet Lam
    Alexandre Bayen
    ACM/IEEE 7th International Conference on Cyber-Physical Systems (ICCPS) (2016)
    Minimizing Regret on Reflexive Banach Spaces and Nash Equilibria in Continuous Zero-Sum Games
    Maximilian Balandat
    Claire Tomlin
    Alexandre Bayen
    Advances in Neural Information Processing Systems 29 (NIPS) (2016)
    On Social Optimal Routing Under Selfish Learning
    Milena Suarez Castillo
    Alexandre Bayen
    IEEE Transactions on Control of Network Systems (2016)
    Continuous and Discrete Dynamics for Online Learning and Convex Optimization
    Ph.D. Thesis, University of California, Berkeley (2016)
    Online Learning of Nash Equilibria in Congestion Games
    Benjamin Drighès
    Alexandre Bayen
    SIAM Journal on Control and Optimization (2015)
    The Hedge Algorithm on a Continuum
    Maximilian Balandat
    Claire Tomlin
    Alexandre Bayen
    32nd International Conference on Machine Learning (ICML) (2015)
    Efficient Bregman Projections onto the Simplex
    Syrine Krichene
    Alexandre Bayen
    IEEE 54th Annual Conference on Decision and Control (CDC) (2015)
    Accelerated Mirror Descent in Continuous and Discrete Time
    Alexandre Bayen
    Peter Bartlett
    Advances in neural information processing systems 28 (NIPS) (2015), pp. 2845-2853
    Differential Privacy of Populations in Routing Games
    Roy Dong
    Alexandre Bayen
    S. Shankar Sastry
    IEEE 54th Annual Conference on Decision and Control (CDC) (2015), pp. 2798-2803
    A PDE-ODE Model for a Junction with Ramp Buffer
    Maria-Laura Delle Monache
    Jack Reilly
    Samitha Samaranayake
    Paula Goatin
    Alexandre Bayen
    SIAM Journal on Applied Mathematics, vol. 74 (2014), pp. 22-39
    On the Convergence of No-Regret Learning in Selfish Routing
    Benjamin Drighès
    Alexandre Bayen
    31st International Conference on Machine Learning (ICML) (2014), pp. 163-171
    Stability of Nash Equilibria in the Congestion Game under Replicator Dynamics
    Benjamin Drighes
    Alexandre Bayen
    IEEE 53rd Annual Conference on Decision and Control (CDC) (2014)