Sergei Vassilvitskii
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible.
Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental $\ell_p$ $(p\in [0,+\infty))$ frequency moment estimation problem under this setting, and give an $\varepsilon$-DP algorithm that achieves $(1+\eta)$-relative approximation $(\forall \eta\in(0,1))$ with $\mathrm{poly}\log(Tn)$ additive error and uses $\mathrm{poly}\log(Tn)\cdot \max(1, n^{1-2/p})$ space, where $T$ is the length of the stream and $n$ is the size of the universe of elements.
Our space is near optimal up to poly-logarithmic factors even in the non-private setting.
To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting.
Based on these primitives, we develop a differentially private continual release level set estimation approach to address the $\ell_p$ frequency moment estimation problem.
We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past $W$ data items.
View details
Preview abstract
Transformer-based language models are being pre-trained on ever growing datasets (hundreds of gigabytes) using ever growing amounts of parameters (millions to billions).
These large amounts of training data are typically scraped from the public web, and may contain (public) personally identifiable information such as names and phone numbers.
Moreover, recent findings also show that the capacity of these models allow them to memorize parts of the training data. One defense against such memorization that has not yet been fully explored in this context is differential privacy (DP).
We focus on T5, a popular encoder-decoder, and show that by using recent advances in JAX and XLA we can train models with DP that do not suffer a big drop in utility, nor in training speed, and can still be fine-tuned to high accuracies on downstream tasks such as GLUE. Moreover, we show that T5's span corruption pre-training task, unlike next token prediction, is a good defense against data memorization.
View details
Scalable Differentially Private Clustering via Hierarchically Separated Trees
Chris Schwiegelshohn
David Saulpic
2022 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2022) (to appear)
Preview abstract
We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space.
By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods.
We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime.
Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
View details
Secretaries with Advice
Paul Duetting
Proceedings of the 22nd ACM Conference on Economics and Computation (EC'21) (2021), pp. 409-429
Preview abstract
The secretary problem is probably the purest model of decision making under uncertainty. In this
paper we ask which advice can we give the algorithm to improve its success probability?
We propose a general model that unifies a broad range of problems: from the classic secretary problem
with no advice, to the variant where the quality of a secretary is drawn from a known distribution and
the algorithm learns each candidate’s quality quantile on arrival, to more modern ML-based versions of
advice where a binary classifier gives us noisy advice about whether or not the current secretary is the
best on the market.
Our main technique is a factor revealing LP that captures all of the problems above. We use this LP
formulation to gain structural insight into the optimal policy and present two case studies: a re-derivation
of the classic known distributions result with tools from linear programming and a tight analysis of the
noisy binary advice model.
View details
Sliding Window Algorithms for k-Clustering Problems
Michele Borassi
Neurips 2020 (to appear)
Preview abstract
The sliding window model of computation captures scenarios in which data is arriving continuously, but only the latest $w$ elements should be used for analysis. The goal is to design algorithms that update the solution efficiently with each arrival rather than recomputing it from scratch. In this work, we focus on $k$-clustering problems such as $k$-means and $k$-median. In this setting, we give simple and practical algorithms that come with stronger performance guarantees than previously known results. Empirically, we show that our methods store only a small fraction of the data, are orders of magnitude faster, and find solutions with cost only slightly worse than those returned by algorithms that have access to the full dataset.
View details
Preview abstract
The use of machine learning and predictive models have produced a revolution in science and engineering. Online optimization problems are a natural source of uncertainty that predictions can be used to manage and improve performance. This paper studies how predictive techniques can be used to break through worst case barriers in online scheduling.
The make-span minimization problem on unrelated machines and its special case, restricted assignment, are classic problems in online scheduling theory. Worst case analysis of these problems yields Ω(log m) lower bounds on the competitive ratio in the online setting. In this paper we construct non-trivial predictions for these problems and design algorithms that utilize these predictions to compute solutions online. Our predictions are compact in size, having dimension linear in the number of machines. The performance guarantees of our algorithms depend on the accuracy of the predictions, and moderately accurate predictions allow our techniques to beat the worst case lower bounds. More precisely, the predictions can be used to construct O(log η)-competitive fractional assignments, where η is the error of the predictions. We then give an online algorithm that is O(poly(log log(m)))-competitive to round these fractional assignments
View details
Fair Hierarchical Clustering
Benjamin Moseley
Marina Knittel
Yuyan Wang
Neurips 2020
Preview abstract
As machine learning has become more and more integrated into our businesses and lifestyles, researchers have begun to recognize the necessity of ensuring machine learning systems are fair. Recently, there has been an interest in defining a notion of fairness that mitigates over-representation in traditional clustering.
In this paper we extend this notion to hierarchical clustering, where the goal is to recursively partition the data to optimize a certain objective~\cite{dasgupta}. For various natural objectives, we obtain simple, efficient algorithms to find a provably good fair hierarchical clustering. Empirically, we show that our algorithms can find a fair hierarchical clustering, surprisingly, with only a negligible loss in the objective.
View details
Preview abstract
Modern online learning algorithms achieve low (sublinear) regret in a variety of diverse settings. These algorithms, however, update their solution at every time step. While, these updates are computationally efficient, the very requirement of frequent updates makes the algorithms untenable in some practical applications. In this work we look for online learning algorithms that update a sublinear number of times. We give a meta algorithm based on non-homogeneous Poisson Processes that gives a smooth trade- off between final regret and frequency of updates. Empirically, we show that in many cases we can significantly reduce the number at a minimal increase in regret.
View details
Preview abstract
Differentially private learning algorithms protect individual participants in the training dataset by guaranteeing that their presence does not significantly change the resulting model. In order to make this promise, such algorithms need to know the maximum contribution that can be made by a single user: the more data an individual can contribute, the more noise will need to be added to protect them. While most existing analyses assume that the maximum contribution is known and fixed in advance—indeed, it is often assumed that each user contributes only a single example— we argue that in practice there is a meaningful choice to be made. On the one hand, if we allow users to contribute large amounts of data, we may end up adding excessive noise to protect a few outliers, even when the majority contribute only modestly. On the other hand, limiting users to small contributions keeps noise levels low at the cost
of potentially discarding significant amounts of excess data, thus introducing bias. Here, we characterize this trade-off for an empirical risk minimization setting, showing that in general there is a “sweet spot” that depends on measurable properties of the dataset, but that there is also a concrete cost to privacy that cannot be avoided simply by collecting more data.
View details
Preview abstract
We propose online algorithms for Column Subset Selection (CSS) and Principal Component
Analysis (PCA), two methods that are widely employed for data analysis, summarization, and
visualization. Given a data matrix A that is revealed one column at a time, the online CSS
problems asks to keep a small set of columns, S, that best approximates the space spanned by
the columns of A. As each column arrives, the algorithm must irrevocably decide whether to
add it to S, or to ignore it. In the online PCA problem, the goal is to output a projection of
each column to a low dimensional subspace. In other words, the algorithm must provide an
embedding for each column as it arrives, which cannot be changed as new columns arrive.
While both of these problems have been studied in the online setting, only additive approximations were known prior to our work. The core of our approach is an adaptive sampling technique that gives a practical and efficient algorithm for both of these problems. We prove that by sampling columns using their “residual norm” (i.e. their norm orthogonal to directions sampled so far), we end up with a significantly better dependence between the number of columns sampled, and the desired error in the approximation.
We further show how to combine our algorithm “in series” with prior algorithms. In particular, using the results of Boutsidis et al. [4] and Frieze et al. [14] that have additive guarantees, we show how to improve the bounds on the error of our algorithm.
View details
Better Sliding Window Algorithms to Maximize Subadditive and Diversity Objectives
Michele Borassi
Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2019)
Preview abstract
The streaming computation model is a standard model for large-scale data analysis: the input arrives one element at a time, and the goal is to maintain an approximately optimal solution using only a constant, or, at worst, poly-logarithmic space.
In practice, however, recency plays a large role, and one often wishes to consider only the last w elements that have arrived, the so-called sliding window problem. A trivial approach is to simply store the last w elements in a buffer; our goal is to develop algorithms with space and update time sublinear in w. In this regime, there are two frameworks: exponential histograms and smooth histograms, which can be used to obtain sliding window algorithms for families of functions satisfying certain properties.
Unfortunately, these frameworks have limitations and cannot always be applied directly. A prominent example is the case of a submodular function with cardinality constraints. Some of these difficulties can be rectified, but often only on a case-by-case basis. Here, we describe an alternative approach to designing efficient sliding window algorithms for maximization problems. Then we instantiate this approach on a wide range of problems, yielding better algorithms for submodular function optimization, diversity optimization and general subadditive optimization. In doing so, we improve state-of-the art results obtained using problem-specific algorithms.
View details
Preview abstract
The desire to use machine learning to assist in human decision making has spawned a large area of research in understanding the impact of such systems not only on the society as a whole, but also the specific impact on different subpopulations. Recent work has shown that while there are several natural ways to quantify the fairness of a particular system, none of them are universal, and except for trivial cases, satisfying one means violating another~\citet{Kleinberg, Goel, Kleinberg2}.
In parallel to understanding the interplay between different definitions of fairness, researchers have looked to develop algorithms for finding fair solutions to specific classes of problems. For example, there is recent work on fair regression, fair clustering, fair bandit algorithms and many others.
In this work we tackle another large class of problems that form a useful primitive in machine learning and optimization, that of optimizing a set function subject to a set of matroid constraints. A matroid is a combinatorial object that generalizes linear independence between vectors and is general enough to encode cardinality constraints (e.g. selecting at most $k$ elements), connectivity constraints (e.g. selecting a spanning tree of a graph), or matching constraints (ensuring a subgraph has a perfect matching).
View details
Testing Incentive Compatibility in Display Ad Auctions
Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018
Preview abstract
Consider a buyer participating in a repeated auction, such as those
prevalent in display advertising. How would she test whether the
auction is incentive compatible? To bid effectively, she is
interested in whether the auction is single-shot incentive
compatible---a pure second-price auction, with fixed reserve
price---and also dynamically incentive compatible---her bids
are not used to set future reserve prices. In this work we develop
tests based on simple bid perturbations that a buyer can use to
answer these questions, with a focus on dynamic incentive
compatibility.
There are many potential A/B testing setups that one could use, but
we find that many natural experimental designs are, in fact, flawed.
For instance, we show that additive perturbations can lead to paradoxical results,
where higher bids lead to lower optimal reserve prices. We precisely
characterize this phenomenon and show that reserve prices are only
guaranteed to be monotone for distributions satisfying the Monotone
Hazard Rate (MHR) property. The experimenter must also decide how to split traffic to apply
systematic perturbations. It is tempting to have this split be
randomized, but we demonstrate empirically that unless the perturbations are
aligned with the partitions used by the seller to compute
reserve prices, the results are guaranteed to be inconclusive.
We validate our results with experiments on real display auction
data and show that a buyer can quantify both single-shot and dynamic
incentive compatibility even under realistic conditions where only
the cost of the impression is observed (as opposed to the exact
reserve price). We analyze the cost of running such experiments,
exposing trade-offs between test accuracy, cost, and underlying
market dynamics.
View details
Preview abstract
We consider the reachability indexing problem for private-public directed graphs. In these graphs nodes come in three flavors: public—nodes visible to all users, private—nodes visible to a specific set of users, and protected—nodes visible to any user who can see at least one of the node’s parents. We are interested in computing the set of nodes visible to a specific user online. There are two obvious algorithms: precompute the result for every user, or run a reachability algorithm at query time. This paper explores the trade-off between these two strategies.
Our approach is to identify a set of additional visible seed nodes for each user. The online reachability algorithm explores the graph starting at these nodes. We first formulate the problem as asymmetric k-center with outliers, and then give an efficient and practical algorithm. We prove new theoretical guarantees for this problem and show empirically that it performs very well in practice.
View details
Preview abstract
We introduce the {\em consistent $k$-clustering} problem, which asks to minimize the number of re-clusterings necessary to maintain a constant approximate solution as points arrive in an online manner. We prove lower bounds of $\Omega(k \log n)$ on the number of clustering changes necessary, and give an algorithm that needs only $O(k^3 \log^5 n)$ changes to maintain a constant competitive solution. This is an exponential improvement on the naive solution of reclustering at every time step. Finally, we show experimentally that our approach performs much better than the theoretical bound, with the number of changes growing approximately as $O(\log n)$
View details
Preview abstract
We study the question of fair clustering under the disparate impact doctrine, where each protected
class must have approximately equal representation in every cluster. We formulate the fair clustering
problem under both the k-center and the k-median objectives, and show that even with two protected
classes the problem is challenging, as the optimum solution can violate common conventions—for instance a point may no longer be assigned to its nearest cluster center!
En route we introduce the concept of fairlets, which are minimal sets that satisfy fair representation
while approximately preserving the clustering objective. We show that any fair clustering problem can be
decomposed into first finding good fairlets, and then using existing machinery for traditional clustering
algorithms. While finding good fairlets can be NP-hard, we proceed to obtain efficient approximation
algorithms based on minimum cost flow.
We empirically quantify the value of fair clustering on real-world datasets with sensitive attributes.
View details
Submodular Optimization Over Sliding Windows
Proceedings of the 26th International World Wide Web Conference, WWW (2017)
Preview abstract
Maximizing submodular functions under cardinality constraints lies at
the core of numerous data mining and machine learning applications,
including data diversification, data summarization, and coverage problems.
In this work, we study this question in the context of data
streams, where elements arrive one at a time, and we want to
design low-memory and fast update-time algorithms that maintain a good solution.
Specifically, we focus on the sliding window
model, where we are asked to maintain a solution that considers only
the last $W$ items.
In this context, we provide the first non-trivial algorithm that
maintains a provable approximation of the optimum using space sublinear in the size of the window.
In particular we give a $\nicefrac{1}{3} - \epsilon$ approximation algorithm that uses space polylogarithmic in
the spread of the values of the elements,
$\Spread$, and linear in the solution size $k$ for any constant $\epsilon > 0$. At the same
time, processing each element only requires a
polylogarithmic number of evaluations of the function itself. When a
better approximation is desired, we show a different algorithm that, at the cost of using more memory, provides a $\nicefrac{1}{2} - \epsilon$
approximation, and allows a tunable trade-off between average update time and space. This algorithm matches the best known approximation guarantees for submodular optimization in insertion-only streams, a less general formulation of the problem.
We demonstrate the efficacy of the algorithms on a number of real
world datasets, showing that their practical performance far exceeds
the theoretical bounds. The algorithms preserve high quality solutions in streams with millions of items, while storing a
negligible fraction of them.
View details
Preview abstract
We study the question of setting and testing reserve prices in single item auctions when the bidders are not identical. At a high level, there are two generalizations of the standard second price auction: in the lazy version we first determine the winner, and then apply reserve prices; in the eager version we first discard the bidders not meeting their reserves, and then determine the winner among the rest. We show that the two versions have dramatically different properties: lazy reserves are easy to optimize, and A/B test in production, whereas eager reserves always lead to higher welfare, but their optimization is NP-complete, and naive A/B testing will lead to incorrect conclusions. Despite their different characteristics, we show that the overall revenue for the two scenarios is always within a factor of 2 of each other, even in the presence of correlated bids. Moreover, we prove that the eager auction dominates the lazy auction on revenue whenever the bidders are independent or symmetric. We complement our theoretical results with simulations on real world data that show that even suboptimally set eager reserve prices are preferred from a revenue standpoint.
View details
Pricing a low-regret seller
Hoda Heidari
Sadra Yazdanbod
Proceedings of the Thirty-Third International Conference on Machine Learning (ICML 2016)
Preview abstract
As the number of ad exchanges has grown, publishers have turned to low regret learning algorithms to decide which exchange offers the best price for their inventory. This in turn opens the following question for the exchange: how to set prices to attract as many sellers as possible and maximize revenue. In this work we formulate this precisely as a learning problem, and present algorithms showing that by simply knowing that the counterparty is using a low regret algorithm is enough for the exchange to have its own low regret learning algorithm to find the optimal price.
View details
Learning mobile phone battery consumptions
Ashish Sharma
Paul Eastham
Workshop on On Device Intelligence (2016)
Preview abstract
We introduce a novel, data-driven way for predicting battery consumption of apps. The state-of-the-art models used to blame battery consumption on apps are based on micro-benchmark experiments. These experiments are carried out on controlled setups where one can measure how much battery is consumed by each internal resource (CPU, bluetooth, WiFi...). The battery blame allocated to an app is simply the sum of the blames of the resources consumed by the app. We argue that this type of models do not capture the way phones work "in the wild" and propose instead to train a regression model using data collected from logs. We show that this type of learning is correct in the sense that under some assumptions, we can recover the true battery discharge rate of each component. We present experimental results where we consistently do better predictions than a model trained on micro-benchmarks.
View details
Preview abstract
A significant part of optimizing revenue for ad auctions is setting a
good reserve (or minimum) price. Set it too low, and the impression
may yield little revenue, set it too high and there may not anyone
willing to buy the item. Previous work has looked at predicting this
value directly, however, the strongly non-convex objective function
makes this a challenging proposition. In contrast, motivated by the
fact that computing an optimal reserve price for a set of bids is
easy, we propose a clustering approach, first finding a good partition
of the data, and then finding an optimal reserve price for each
partition. In this work, we take a major step in this direction: we
derive the specific objective function that corresponds to revenue
optimization in auctions, and give algorithms that optimize it.
View details
Incentivizing Advertiser Networks to Submit Multiple Bids
Patrick Hummel
R. Preston McAfee
International Journal of Game Theory, vol. 45 (2016), pp. 1031-1052
Preview abstract
This paper illustrates a method for making side payments to advertiser networks that creates an incentive for the advertiser networks to submit the second-highest bids they received to an ad exchange and simultaneously ensures that the publishers will make more money on average in the short run as a result of adopting this scheme. We also illustrate how this payment scheme affects publisher payoffs in the long run after advertisers have a chance to modify their strategies in response to the changed incentives of the mechanism.
View details
Value of Targeting
Patrick Hummel
Proceedings of the 7th International Symposium on Algorithmic Game Theory (SAGT) (2014), pp. 194-205
Preview abstract
We undertake a formal study of the value of targeting data to an advertiser. As expected, this value is increasing in the utility difference between realizations of the targeting data and the accuracy of the data, and depends on the distribution of competing bids. However, this value may vary non-monotonically with an advertiser’s budget. Similarly, modeling the values as either private or correlated, or allowing other advertisers to also make use of the data, leads to unpredictable changes in the value of data. We address questions related to multiple data sources, show that utility of additional data may be non-monotonic, and provide tradeoffs between the quality and the price of data sources. In a game-theoretic setting, we show that advertisers may be worse off than if the data had not been available at all. We also ask whether a publisher can infer the value an advertiser would place on targeting data from the advertiser’s bidding behavior and illustrate that this is impossible.
View details
Preview abstract
Computing connected components of a graph lies at the core of many data mining algorithms, and is a fundamental subroutine in graph clustering. This problem is well studied, yet many of the algorithms with good theoretical guarantees perform poorly in practice, especially when faced with graphs with hundreds of billions of edges. In this paper, we design improved algorithms based on traditional MapReduce architecture for large scale data analysis. We also explore the effect of augmenting MapReduce with a distributed hash table (DHT) service. We show that these algorithms have provable theoretical guarantees, and easily outperform previously studied algorithms, sometimes by more than an order of magnitude. In particular, our iterative MapReduce algorithms run 3 to 15 times faster than the best previously studied algorithms, and the MapReduce implementation using a DHT is 10 to 30 times faster than the best previously studied algorithms. These are the fastest algorithms that easily scale to graphs with hundreds of billions of edges.
View details
Scalable K-Means by ranked retrieval
Lluis Garcia-Pueyo
Vanja Josifovski
Srihari Venkatesan
Proceedings of the 7th ACM international conference on Web search and data mining, ACM, New York, NY, USA (2014), pp. 233-242
Preview abstract
The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality.
The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.
View details
Ad auctions with data
Preview
Hu Fu
Patrick R. Jordan
Inbal Talgam-Cohen
INFOCOM Workshops (2012), pp. 184-189
Ad serving using a compact allocation plan
Preview
Peiji Chen
Wenjing Ma
Srinath Mandalapu
Chandrashekhar Nagarjan
Jayavel Shanmugasundaram
Erik Vee
Manfai Yu
Jason Zien
Proceedings of the 13th ACM Conference on Electronic Commerce, ACM, New York, NY, USA (2012), pp. 319-336
Preview abstract
We consider the problem of designing an online
exchange. We identify the goals of exchange design,
and present key techniques for accomplishing these
goals along with the tradeoffs inherent in the choices.
View details
To match or not to match: economics of cookie matching in online advertising
Preview
Arpita Ghosh
Preston McAfee
Proceedings of the 13th ACM Conference on Electronic Commerce, ACM, New York, NY, USA (2012), pp. 741-753
Efficiently Encoding Term Co-occurrences in Inverted Indexes
Preview
Marcus Fontoura
Maxim Gurevich
Vanja Josifovski
20th ACM Conference on Information and Knowledge Management (CIKM 2011) (to appear)
Filtering: a method for solving graph problems in MapReduce.
Benjamin Moseley
Siddharth Suri
SPAA 2011: Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 85-94
Preview abstract
The MapReduce framework is currently the de facto standard used throughout both industry and academia for petabyte scale data analysis. As the input to a typical MapReduce computation is large, one of the key requirements of the framework is that the input cannot be stored on a single machine and must be processed in parallel. In this paper we describe a general algorithmic design technique in the MapReduce framework called filtering. The main idea behind filtering is to reduce the size of the input in a distributed fashion so that the resulting, much smaller, problem instance can be solved on a single machine. Using this approach we give new algorithms in the MapReduce framework for a variety of fundamental graph problems. Specifically, we present algorithms for minimum spanning trees, maximal matchings, approximate weighted matchings, approximate vertex and edge covers and minimum cuts. In all of these cases, we will parameterize our algorithms by the amount of memory available on the machines allowing us to show tradeoffs between the memory available and the number of MapReduce rounds. For each setting we will show that even if the machines are only given substantially sublinear memory, our algorithms run in a constant number of MapReduce rounds. To demonstrate the practical viability of our algorithms we implement the maximal matching algorithm that lies at the core of our analysis and show that it achieves a significant speedup over the sequential version.
View details
Hiring a secretary from a poset.
Ravi Kumar
Andrea Vattani
Proceedings 12th ACM Conference on Electronic Commerce (EC-2011), pp. 39-48
Preview abstract
The secretary problem lies at the core of mechanism design for online auctions. In this work we study the generalization of the classical secretary problem in a setting where there is only a partial order be- tween the elements and the goal of the algorithm is to return one of the maximal elements of the poset. This is equivalent to the setting where the seller has a multidimensional objective function with only a partial order among the outcomes. We obtain an algorithm that succeeds with probability at least?1 + l
k^{−k/(k−1)} ((1+log^{-1/(k-1)} k)^k -1) where k is the number of maximal elements in the poset and is the only information about the poset that is known to the algorithm. On the other hand, we prove an almost matching upper bound of k^{−1/(k−1)} on the success probability of any algorithm for this problem; this upper bound holds even if the algorithm knows the complete structure of the poset.
View details
Factorization-based Lossless Compression of Inverted Indices
Preview
George Beskales
Marcus Fontoura
Maxim Gurevich
Vanja Josifovski
20th ACM Conference on Information and Knowledge Management (CIKM 2011) (to appear)
Efficiently Evaluating Graph Constraints in Content-Based Publish/Subscribe
Preview
Shirshanka Das
Marcus Fontoura
Bhaskar Ghosh
Vanja Josifovski
Jayavel Shanmugasundaram
The 20th International World Wide Web Confererence (WWW 2011)
Maximally representative allocations for guaranteed delivery advertising campaigns
R. Preston McAfee
Review of Economic Design, vol. 17 (2013), pp. 83-94
Densest Subgraph in Streaming and MapReduce
SHALE: an efficient algorithm for allocation of guaranteed display advertising
Vijay Bharadwaj
Peiji Chen
Wenjing Ma
Chandrashekhar Nagarajan
John Tomlin
Erik Vee
Jian Yang
KDD (2012), pp. 1195-1203
Scalable K-Means++
Bahman Bahmani
Benjamin Moseley
Andrea Vattani
Ravi Kumar
PVLDB, vol. 5 (2012), pp. 622-633
Ad Serving Using a Compact Allocation Plan
Peiji Chen
Wenjing Ma
Srinath Mandalapu
Chandrashekhar Nagarajan
Jayavel Shanmugasundaram
Erik Vee
Manfai Yu
Jason Y. Zien
CoRR, vol. abs/1203.3593 (2012)
Handling forecast errors while bidding for display advertising
Ad serving using a compact allocation plan
Peiji Chen
Wenjing Ma
Srinath Mandalapu
Chandrashekhar Nagarajan
Jayavel Shanmugasundaram
Erik Vee
Manfai Yu
Jason Y. Zien
ACM Conference on Electronic Commerce (2012), pp. 319-336
Inventory Allocation for Online Graphical Display Advertising using Multi-objective Optimization
Jian Yang
Erik Vee
John Tomlin
Jayavel Shanmugasundaram
Tasos Anastasakos
Oliver Kennedy
ICORES (2012), pp. 293-304
SHALE: An Efficient Algorithm for Allocation of Guaranteed Display Advertising
Vijay Bharadwaj
Peiji Chen
Wenjing Ma
Chandrashekhar Nagarajan
John Tomlin
Erik Vee
Jian Yang
CoRR, vol. abs/1203.3619 (2012)
Counting triangles and the curse of the last reducer
Filtering: a method for solving graph problems in MapReduce
The Multiple Attribution Problem in Pay-Per-Conversion Advertising
Optimal Envy-Free Pricing with Metric Substitutability
Factorization-based lossless compression of inverted indices
George Beskales
Marcus Fontoura
Maxim Gurevich
Vanja Josifovski
CIKM (2011), pp. 327-332
Factorization-based Lossless Compression of Inverted Indices
George Beskales
Marcus Fontoura
Maxim Gurevich
Vanja Josifovski
CoRR, vol. abs/1108.1956 (2011)
Cross-Validation and Mean-Square Stability
Hiring a secretary from a poset
Ravi Kumar
Andrea Vattani
ACM Conference on Electronic Commerce (2011), pp. 39-48
Efficiently encoding term co-occurrences in inverted indexes
Efficiently evaluating graph constraints in content-based publish/subscribe
Shirshanka Das
Marcus Fontoura
Bhaskar Ghosh
Vanja Josifovski
Jayavel Shanmugasundaram
WWW (2011), pp. 497-506
Efficiently evaluating complex boolean expressions
Marcus Fontoura
Suhas Sadanandan
Jayavel Shanmugasundaram
Erik Vee
Srihari Venkatesan
Jason Y. Zien
SIGMOD Conference (2010), pp. 3-14
Finding the Jaccard Median
A Model of Computation for MapReduce
Optimal online assignment with forecasts
Erik Vee
Jayavel Shanmugasundaram
ACM Conference on Electronic Commerce (2010), pp. 109-118
Generalized distances between rankings
Inventory Allocation for Online Graphical Display Advertising
Jian Yang
Erik Vee
John Tomlin
Jayavel Shanmugasundaram
Tasos Anastasakos
Oliver Kennedy
CoRR, vol. abs/1008.3551 (2010)
Contract Auctions for Sponsored Search
Social Networks and Stable Matchings in the Job Market
Getting Recommender Systems to Think outside the Box
Zeinab Abbassi
Sihem Amer-Yahia
Laks V. S. Lakshmanan
Cong Yu
RecSys (2009), pp. 285-288
Indexing Boolean Expressions
Steven Whang
Chad Brower
Jayavel Shanmugasundaram
Erik Vee
Ramana Yerneni
Hector Garcia-Molina
PVLDB, vol. 2 (2009), pp. 37-48
The Hiring Problem and Lake Wobegon Strategies
Adam Kirsch
Ravi Kumar
Michael Mitzenmacher
Eli Upfal
SIAM J. Comput., vol. 39 (2009), pp. 1233-1255
Adaptive bidding for display advertising
Arpita Ghosh
Benjamin I. P. Rubinstein
Martin Zinkevich
WWW (2009), pp. 251-260
Nearest-neighbor caching for content-match applications
Sandeep Pandey
Flavio Chierichetti
Vanja Josifovski
Ravi Kumar
WWW (2009), pp. 441-450
Bidding for Representative Allocations for Display Advertising
Top-k aggregation using intersections of ranked inputs
Battling Predictability and Overconcentration in Recommender Systems
Sihem Amer-Yahia
Laks V. S. Lakshmanan
Cong Yu
IEEE Data Eng. Bull., vol. 32 (2009), pp. 33-40
Indexing Boolean Expressions
Steven Euijong Whang
Chad Brower
Jayavel Shanmugasundaram
Erik Vee
Ramana Yerneni
Hector Garcia-Molina
Proc. 35th Int'l Conf. on Very Large Data Bases (PVLDB) (2009), pp. 37-48
Similarity caching
Bidding for Representative Allocations for Display Advertising
Arpita Ghosh
Randolph Preston McAfee
CoRR, vol. abs/0910.0880 (2009)
Impression-plus-click auctions
Worst-Case and Smoothed Analysis of the ICP Algorithm, with an Application to the k-Means Method
Social Networks and Stable Matchings in the Job Market
Optimal envy-free pricing with metric substitutability
The hiring problem and Lake Wobegon strategies
Adam Kirsch
Ravi Kumar
Michael Mitzenmacher
Eli Upfal
SODA (2008), pp. 1184-1193
Relaxation in text search using taxonomies
Marcus Fontoura
Vanja Josifovski
Ravi Kumar
Christopher Olston
PVLDB, vol. 1 (2008), pp. 672-683
Sponsored Search Auctions with Reserve Prices: Going Beyond Separability
Tracing the Path: New Model and Algorithms for Collaborative Filtering
On threshold behavior in query incentive networks
Esteban Arcaute
Adam Kirsch
Ravi Kumar
David Liben-Nowell
ACM Conference on Electronic Commerce (2007), pp. 66-74
k-means++: the advantages of careful seeding
Worst-case and Smoothed Analysis of the ICP Algorithm, with an Application to the k-means Method
How slow is the k-means method?
Using web-graph distance for relevance feedback in web search
Efficiently computing succinct trade-off curves
Efficiently Computing Succinct Trade-Off Curves
On the General Reconfiguration Problem for Expanding Cube Style Modular Robots
On the General Reconfiguration Problem for Expanding Cube Style Modular Robots
Jeremy Kubica
Eleanor G. Rieffel
John W. Suh
Mark Yim
Proceedings of the 2002 IEEE Int. Conference on Robotics and Automation, pp. 801-808