Jump to Content
Alessandro Epasto

Alessandro Epasto

I am a staff research scientist at Google, New York working in the Algorithms and Optimization team lead by Vahab Mirrokni. I received a Ph.D in computer science from Sapienza University of Rome, where I was advised by Professor Alessandro Panconesi and supported by the Google Europe Ph.D. Fellowship in Algorithms, 2011. I was also a post-doc at the department of computer science of Brown University in Providence (RI), USA where I was advised by Professor Eli Upfal. My research interests include algorithmic problems in machine learning and data mining, in particular in the areas of clustering, privacy and large scale graphs analysis.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible. Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental $\ell_p$ $(p\in [0,+\infty))$ frequency moment estimation problem under this setting, and give an $\varepsilon$-DP algorithm that achieves $(1+\eta)$-relative approximation $(\forall \eta\in(0,1))$ with $\mathrm{poly}\log(Tn)$ additive error and uses $\mathrm{poly}\log(Tn)\cdot \max(1, n^{1-2/p})$ space, where $T$ is the length of the stream and $n$ is the size of the universe of elements. Our space is near optimal up to poly-logarithmic factors even in the non-private setting. To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting. Based on these primitives, we develop a differentially private continual release level set estimation approach to address the $\ell_p$ frequency moment estimation problem. We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past $W$ data items. View details
    Preview abstract Compact user representations (such as embeddings) form the backbone of personalization services. In this work, we present a new theoretical framework to measure re-identification risk in such user representations. Our framework, based on hypothesis testing, formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation. As an application, we show how our framework is general enough to model important real-world applications such as the Chrome's Topics API for interest-based advertising. We complement our theoretical bounds by showing provably good attack algorithms for re-identification that we use to estimate the re-identification risk in the Topics API. We believe this work provides a rigorous and interpretable notion of re-identification risk and a framework to measure it that can be used to inform real-world applications. View details
    Private estimation algorithms for stochastic block models and mixture models
    Hongjie Chen
    Tommaso D'Orsi
    Jacob Imola
    David Steurer
    Stefan Tiegel
    54rd Annual ACM Symposium on Theory of Computing (STOC'23) (2023)
    Preview abstract We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient (ϵ,δ)-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an (ϵ,δ)-differentially private algorithm that recovers the centers of the k-mixture when the minimum separation is at least O(k^{1/t}/√t). For all choices of t, this algorithm requires sample complexity n≥k^O(1)d^O(t) and time complexity (nd)^O(t). Prior work required minimum separation at least O(√k) as well as an explicit upper bound on the Euclidean norm of the centers. View details
    Preview abstract We study the differentially private (DP) $k$-means and $k$-median clustering problems of $n$ points in $d$-dimensional Euclidean space in the massively parallel computation (MPC) model. We provide two near-optimal algorithms where the near-optimality is in three aspects: they both achieve (1). $O(1)$ parallel computation rounds, (2). near-linear in $n$ and polynomial in $k$ total computational work (i.e., near-linear running time in the sequential setting), (3). $O(1)$ relative approximation and $\text{poly}(k, d)$ additive error, where $\Omega(1)$ relative approximation is provably necessary even for any polynomial-time non-private algorithm, and $\Omega(k)$ additive error is a provable lower bound for any polynomial-time DP $k$-means/median algorithm. Our two algorithms provide a tradeoff between the relative approximation and the additive error: the first has $O(1)$ relative approximation and $\sim (k^{2.5} + k^{1.01} \sqrt{d})$ additive error, and the second one achieves $(1+\gamma)$ relative approximation to the optimal non-private algorithm for an arbitrary small constant $\gamma>0$ and with $\text{poly}(k, d)$ additive error for a larger polynomial dependence on $k$ and $d$. To achieve our result, we develop a general framework which partitions the data and reduces the DP clustering problem for the entire dataset to the DP clustering problem for each part. To control the blow-up of the additive error introduced by each part, we develop a novel charging argument which might be of independent interest. View details
    Preview abstract We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines. View details
    Preview abstract Personalized PageRank (PPR) is a fundamental tool in unsupervised learning of graph representations such as node ranking, labeling, and graph embedding. However, while data privacy is one of the most important recent concerns, existing PPR algorithms are not designed to protect user privacy. PPR is highly sensitive to the input graph edges: the difference of only one edge may cause a big change in the PPR vector, potentially leaking private user data. In this work, we propose an algorithm which outputs an approximate PPR and has provably bounded sensitivity to input edges. In addition, we prove that our algorithm achieves similar accuracy to non-private algorithms when the input graph has large degrees. Our sensitivity-bounded PPR directly implies private algorithms for several tools of graph learning, such as, differentially private (DP) PPR ranking, DP node classification, and DP node embedding. To complement our theoretical analysis, we also empirically verify the practical performances of our algorithms. View details
    Preview abstract As machine learning has become more and more integrated into our businesses and lifestyles, researchers have begun to recognize the necessity of ensuring machine learning systems are fair. Recently, there has been an interest in defining a notion of fairness that mitigates over-representation in traditional clustering. In this paper we extend this notion to hierarchical clustering, where the goal is to recursively partition the data to optimize a certain objective~\cite{dasgupta}. For various natural objectives, we obtain simple, efficient algorithms to find a provably good fair hierarchical clustering. Empirically, we show that our algorithms can find a fair hierarchical clustering, surprisingly, with only a negligible loss in the objective. View details
    Preview abstract The sliding window model of computation captures scenarios in which data is arriving continuously, but only the latest $w$ elements should be used for analysis. The goal is to design algorithms that update the solution efficiently with each arrival rather than recomputing it from scratch. In this work, we focus on $k$-clustering problems such as $k$-means and $k$-median. In this setting, we give simple and practical algorithms that come with stronger performance guarantees than previously known results. Empirically, we show that our methods store only a small fraction of the data, are orders of magnitude faster, and find solutions with cost only slightly worse than those returned by algorithms that have access to the full dataset. View details
    Fair Correlation clustering
    23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020) (2020) (to appear)
    Preview abstract In this paper, we study correlation clustering under fairness constraints. Fair variants of k-median and k-center clustering have been studied recently, and approximation algorithms using a notion called fairlet decomposition have been proposed. We obtain approximation algorithms for fair correlation clustering under several important types of fairness constraints. Our results hinge on obtaining a fairlet decomposition for correlation clustering by introducing a novel combinatorial optimization problem. We define a fairlet decomposition with cost similar to the k-median cost and this allows us to obtain approximation algorithms for a wide range of fairness constraints. We complement our theoretical results with an in-depth analysis of our algorithms on real graphs where we show that fair solutions to correlation clustering can be obtained with limited increase in cost compared to the state-of-the-art (unfair) algorithms. View details
    Preview abstract Hierarchical Clustering is an unsupervised data analysis method which has been widely used for decades. Despite its popularity, it had an underdeveloped analytical foundation and to address this, Dasgupta recently introduced an optimization view-point of hierarchical clustering with pair- wise similarity information that spurred a line of work shedding light on old algorithms (e.g., Average-Linkage), but also designing new algorithms. Here, for the maximization dual of Dasgupta’s objec- tive (introduced by Moseley-Wang), we present polynomial-time 42.46% approximation algorithms that use Max-Uncut Bisection as a subroutine. The previous best worst-case approximation factor in polynomial time was 33.6%, improving only slightly over Average-Linkage which achieves 33.3%. Finally, we complement our positive results by providing APX-hardness (even for 0-1 similarities), under the Small Set Expansion hypothesis. View details
    Preview abstract Motivated by the increasing need to preserve privacy in digital devices, we introduce the on-device public-private model of computation. Our motivation comes from social-network based recommender systems in which the users want to receive recommendations based on the information available on their devices, as well as the suggestions of their social contacts, without sharing such information or contacts with the central recommendation system. Our model allows us to solve many algorithmic problems while providing absolute (deterministic) guarantees of the privacy of on-device data and the user's contacts. In fact, we ensure that the private data and private contacts are never revealed to the central system. Our restrictive model of computation presents several interesting algorithmic challenges because any computation based on private information and contacts must be performed on local devices of limited capabilities. Despite these challenges, under realistic assumptions of inter-device communication, we show several efficient algorithms for fundamental data mining and machine learning problems, ranging from k-means clustering to heavy hitters. We complement this analysis with strong impossibility results for efficient private algorithms without allowing inter-device communication. In our experimental evaluation, we show that our private algorithms provide results almost as accurate as those of the non-private ones while speeding up the on-device computations by orders of magnitude. View details
    Preview abstract Clustering is a fundamental problem in unsupervised machine learning. In many applications, clustering needs to be performed in presence of additional constraints, such as fairness or diversity constraints. In this paper, we formulate the problem of k-center clustering without over-representation, and propose approximation algorithms to solve the problem, as well as hardness results. We empirically evaluate our clusterings on real-world dataset and show that fairness can be obtained with limited effect on clustering quality. View details
    Better Sliding Window Algorithms to Maximize Subadditive and Diversity Objectives
    Michele Borassi
    Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2019)
    Preview abstract The streaming computation model is a standard model for large-scale data analysis: the input arrives one element at a time, and the goal is to maintain an approximately optimal solution using only a constant, or, at worst, poly-logarithmic space. In practice, however, recency plays a large role, and one often wishes to consider only the last w elements that have arrived, the so-called sliding window problem. A trivial approach is to simply store the last w elements in a buffer; our goal is to develop algorithms with space and update time sublinear in w. In this regime, there are two frameworks: exponential histograms and smooth histograms, which can be used to obtain sliding window algorithms for families of functions satisfying certain properties. Unfortunately, these frameworks have limitations and cannot always be applied directly. A prominent example is the case of a submodular function with cardinality constraints. Some of these difficulties can be rectified, but often only on a case-by-case basis. Here, we describe an alternative approach to designing efficient sliding window algorithms for maximization problems. Then we instantiate this approach on a wide range of problems, yielding better algorithms for submodular function optimization, diversity optimization and general subadditive optimization. In doing so, we improve state-of-the art results obtained using problem-specific algorithms. View details
    Preview abstract Recent interest in graph embedding methods has focused on learning a single representation for each node in the graph. But can nodes really be best described by a single vector representation? In this work, we propose a method for learning multiple representations of the nodes in a graph (e.g., the users of a social network). Based on a principled decomposition of the ego-network, each representation encodes the role of the node in a different local community in which the nodes participate. These representations allow for improved reconstruction of the nuanced relationships that occur in the graph -- a phenomenon that we illustrate through state-of-the-art results on link prediction tasks on a variety of graphs, reducing the error by up to $90\%$. In addition, we show that these embeddings allow for effective visual analysis of the learned community structure. View details
    Preview abstract Maximizing diversity is a central problem in information re- trieval and data mining systems with prominent applications in web search, recommender systems, news aggregators and product search. In this paper, we study a diversity maxi- mization problem (a.k.a. maximum dispersion problem) in which given a set of n objects in a metric space, one wants to find a subset of k objects with the maximum sum of pairwise distances. To solve this problem in a scalable distributed manner, we apply a novel distributed framework for tackling large-scale problems known as randomized composable core-sets: divide the big data set into smaller parts, solve the problem for each part, combine the solutions from each part, and solve the problem on the union of these solutions. Our algorithms improve significantly over the approximation guarantees of state-of-the-art core-set-based algorithms while using min- imum possible intermediate output size. In particular, we present a simple distributed algorithm that achieves an al- most optimal communication complexity, and moreover, it asymptotically achieves approximation factor of 1/2 which is the best possible approximation factor for the global opti- mization problem under certain complexity theory assump- tions. Our algorithms are scalable and practical as shown by our extensive empirical evaluation with large datasets and they can be easily adapted to the major distributed comput- ing systems like MapReduce. Furthermore, we show empir- ically that, in real-life instances, our algorithms reach close- to-optimal solutions with approximation factor of > 90%. This approximation factor is far exceeding the approxima- tion barrier for the problem and provide useful output. View details
    Preview abstract Covering the edges of a bipartite graph by a minimum set of bipartite complete graphs (bicliques) is a basic graph theoretic problem, with numerous applications. In particular, it is used to characterize parsimonious models of a set of observations (each biclique corresponds to a {\it factor} or {\it feature} that relates the observations in the two sets of nodes connected by the biclique). \revision{The decision version of the minimum biclique cover problem is NP-Complete}, and unless $P=NP$, the cover size cannot be approximated in general within less than a sub-linear factor of the number of nodes (or edges) in the graph. In this work, we consider two natural restrictions to the problem, motivated by practical applications. In the first case, we restrict the number of bicliques a node can belong to. We show that when this number is at least $5$, the problem is still NP-hard. In contrast, we show that when nodes belong to no more than 2 bicliques, the problem has efficient approximations. The second model we consider corresponds to observing a set of independent samples from an unknown model, governed by a possibly large number of factors. The model is defined by a bipartite graph $G=(L,R,E)$, where each node in $L$ is assigned to an arbitrary subset of up to a constant $f$ factors, while the nodes in $R$ (the independent observations) are assigned to random subsets of the set of $k$ factors where $k$ can grow with size of the graph. We show that this practical version of the biclique cover problem is amenable to efficient approximations. View details
    Incentive-Aware Learning for Large Markets
    Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, pp. 1369-1378
    Preview abstract In a typical learning problem, one key step is to use training data to pick one model from a collection of models that optimizes an objective function. In many multi-agent settings, the training data is generated through the actions of the agents, and the model is used to make a decision (e.g., how to sell an item) that affects the agents. An illustrative example of this is the problem of learning the reserve price in an auction. In such cases, the agents have an incentive to influence the training data (e.g., by manipulating their bids in the case of an auction) to game the system and achieve a more favorable outcome. In this paper, we study such incentive-aware learning problem in a general setting and show that it is possible to approximately optimize the objective function under two assumptions: (i) each individual agent is a “small” (part of the market); and (ii) there is a cost associated with manipulation. For our illustrative application, this nicely translates to a mechanism for setting approximately optimal reserve prices in auctions where no individual agent has significant market share. For this application, we also show that the second assumption (that manipulations are costly) is not necessary since we can “perturb” any auction to make it costly for the agents to manipulate. View details
    Preview abstract Maximizing submodular functions under cardinality constraints lies at the core of numerous data mining and machine learning applications, including data diversification, data summarization, and coverage problems. In this work, we study this question in the context of data streams, where elements arrive one at a time, and we want to design low-memory and fast update-time algorithms that maintain a good solution. Specifically, we focus on the sliding window model, where we are asked to maintain a solution that considers only the last $W$ items. In this context, we provide the first non-trivial algorithm that maintains a provable approximation of the optimum using space sublinear in the size of the window. In particular we give a $\nicefrac{1}{3} - \epsilon$ approximation algorithm that uses space polylogarithmic in the spread of the values of the elements, $\Spread$, and linear in the solution size $k$ for any constant $\epsilon > 0$. At the same time, processing each element only requires a polylogarithmic number of evaluations of the function itself. When a better approximation is desired, we show a different algorithm that, at the cost of using more memory, provides a $\nicefrac{1}{2} - \epsilon$ approximation, and allows a tunable trade-off between average update time and space. This algorithm matches the best known approximation guarantees for submodular optimization in insertion-only streams, a less general formulation of the problem. We demonstrate the efficacy of the algorithms on a number of real world datasets, showing that their practical performance far exceeds the theoretical bounds. The algorithms preserve high quality solutions in streams with millions of items, while storing a negligible fraction of them. View details
    Bicriteria Distributed Submodular Maximization in a Few Rounds
    Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures (2017)
    Preview abstract We study the problem of efficiently optimizing submodular functions under cardinality constraints in distributed setting. Recently, several distributed algorithms for this problem have been introduced which either achieve a sub-optimal solution or they run in super-constant number of rounds of computation. Unlike previous work, we aim to design distributed algorithms in multiple rounds with almost optimal approximation guarantees at the cost of outputting a larger number of elements. Toward this goal, we present a distributed algorithm that, for any ε > 0 and any constant r, outputs a set S of O(rk/ε^(1/r)) items in r rounds, and achieves a (1-ε)-approximation of the value of the optimum set with k items. This is the first distributed algorithm that achieves an approximation factor of (1-ε) running in less than log 1/ε number of rounds. We also prove a hardness result showing that the output of any 1-ε approximation distributed algorithm limited to one distributed round should have at least Ω(k/ε) items. In light of this hardness result, our distributed algorithm in one round, r = 1, is asymptotically tight in terms of the output size. We support the theoretical guarantees with an extensive empirical study of our algorithm showing that achieving almost optimum solutions is indeed possible in a few rounds for large-scale real datasets. View details
    The Spread of Physical Activity Through Social Networks
    Proceedings of the 26th International World Wide Web Conference 2017, WWW
    Preview abstract We study the evolution of daily physical activity of 44.5K users on the Fitbit social network over a period of eight months. A time-aggregated analysis shows that average alter activity, gender, and body mass index (BMI) are significantly predictive of ego activity when controlling for ego BMI, gender, and number of friends. The direction and effect size of the associations surfaced vary when considering chronic conditions self-reported by a large portion of the users including diabetes, dyslipidemia, hypertension and depression. When considering the co-evolution of activity and friendship on a month by month basis in a within-subject analysis, we show via fixed effects modeling that the fluctuations in average alter activity significantly predict fluctuations in ego activity. Finally, we investigate the causal factors that may drive change of physical activity over time. We leverage a class of novel non-parametric statistical tests to rule out homophily as the sole source of dependence in activity, even in the presence of unobserved individual traits. View details
    Preview abstract We study the problem of efficiently optimizing submodular functions under cardinality constraints in distributed setting. Recently, several distributed algorithms for this problem have been introduced which either achieve a sub-optimal solution or they run in super-constant number of rounds of computation. Unlike previous work, we aim to design distributed algorithms in multiple rounds with almost optimal approximation guarantees at the cost of outputting a larger number of elements. Toward this goal, we present a distributed algorithm that, for any \epsilon > 0 and any constant r, outputs a set S of O(rk/\epsilon^{1\over r}) items in r rounds, and achieves a (1-\epsilon)-approximation of the value of the optimum set with k items. This is the first distributed algorithm that achieves an approximation factor of (1-\epsilon) running in less than $\log {1\over \epsilon}$ number of rounds. We also prove a hardness result showing that the output of any $1-\epsilon$ approximation distributed algorithm limited to one distributed round should have at least $\Omega(k/\epsilon)$ items. In light of this hardness result, our distributed algorithm in one round, $r=1$, is asymptotically tight in terms of the output size. We support the theoretical guarantees with an extensive empirical study of our algorithm showing that achieving almost optimum solutions is indeed possible in a few rounds for large-scale real datasets. View details
    Preview abstract We propose a new framework called Ego-splitting for detecting clusters in complex networks which leverage the local structures known as ego-nets (i.e. the subgraph induced by the neighborhood of each node) to de-couple overlapping clusters. Ego-splitting is highly scalable and flexible framework, with provable theoretical guarantees, that reduce the complex overlapping clustering problem to a simpler and more amenable non-overlapping (partitioning) problem. We can solve community detection in graphs with tens of billions of edges and outperform previous solutions based on ego-nets analysis. More precisely, our framework works in two steps: a local ego- net analysis, and a global graph partitioning. In the local step, we first partition the nodes’ ego-nets using non-overlapping clustering. We then use these clusters to split each node of the graph into its persona nodes that represents the instantiation of the node in its communities. Then, in the global step, we partition these new persona nodes to obtain an overlapping clustering of the original graph. View details
    Preview abstract We present TRIÈST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions. Our algorithms use reservoir sampling and its variants to exploit the user-specified memory space at all times. This is in contrast with previous approaches which use hard-to-choose parameters (e.g., a fixed sampling probability) and offer no guarantees on the amount of memory they will use. We show a full analysis of the variance of the estimations and novel concentration bounds for these quantities. Our experimental results on very large graphs show that TRIÈST outperforms state-of-the-art approaches in accuracy and exhibits a small update time View details
    Preview abstract In this paper, we present a study of the community structure of ego-networks—the graphs representing the connections among the neighbors of a node—for several online social networks. Toward this goal, we design a new technique to efficiently build and cluster all the ego-nets of a graph in parallel (note that even just building the ego-nets efficiently is challenging on large networks). Our experimental findings are quite compelling: at a microscopic level it is easy to detect high quality communities. Leveraging on this fact we, then, develop new features for friend suggestion based on co-occurrences of two nodes in different ego-nets’ communities. Our new features can be computed efficiently on very large scale graphs by just analyzing the neighborhood of each node. Furthermore, we prove formally on a stylized model, and by experimental analysis that this new similarity measure outperforms the classic local features employed for friend suggestions View details
    Preview abstract We introduce the public-private model of graphs. In this model, we have a public graph and each node in the public graph has an associated private graph. The motivation for studying this model stems from social networks, where the nodes are the users, the public graph is visible to everyone, and the private graph at each node is visible only to the user at the node. From each node's viewpoint, the graph is just a union of its private graph and the public graph. We consider the problem of efficiently computing various properties of the graphs from each node's point of view, with minimal amount of recomputation on the public graph. To illustrate the richness of our model, we explore two powerful computational paradigms for studying large graphs, namely, sketching and sampling, and focus on some key problems in social networks and show efficient algorithms in the public-private graph model. In the sketching model, we show how to efficiently approximate the neighborhood function, which in turn can be used to approximate various notions of centrality. In the sampling model, we focus on all-pair shortest path distances, node similarities, and correlation clustering. View details
    Preview abstract Densest subgraph computation has emerged as an important primitive in a wide range of data analysis tasks such as community and event detection. Social media such as Facebook and Twitter are highly dynamic with new friendship links and tweets being generated incessantly, calling for efficient algorithms that can handle very large and highly dynamic input data. While either scalable or dynamic algorithms for finding densest subgraphs have been proposed, a viable and satisfactory solution for addressing both the dynamic aspect of the input data and its large size is still missing. We study the densest subgraph problem in the the dynamic graph model, for which we present the first scalable algorithm with provable guarantees. In our model, edges are added adversarially while they are removed uniformly at random from the current graph. We show that at any point in time we are able to maintain a 2(1+ε)-approximation of a current densest subgraph, while requiring O(polylog(n+r)) amortized cost per update (with high probability), where r is the total number of update operations executed and n is the maximum number of nodes in the graph. In contrast, a naive algorithm that recomputes a dense subgraph every time the graph changes requires Omega(m) work per update, where m is the number of edges in the current graph. Our theoretical analysis is complemented with an extensive experimental evaluation on large real-world graphs showing that (approximate) densest subgraphs can be maintained efficiently within hundred of microseconds per update. View details
    Communities, Random Walks, and Social Sybil Defense.
    Lorenzo Alvisi
    Allen Clement
    Alessandro Panconesi
    Internet Mathematics (2014)
    Preview
    Preview abstract We study the problem of computing similarity rankings in large-scale multi-categorical bipartite graphs, where the two sides of the graph represent actors and items, and the items are partitioned into an arbitrary set of categories. The problem has several real-world applications, including identifying competing advertisers and suggesting related queries in an online advertising system or finding users with similar interests and suggesting content to them. In these settings, we are interested in computing on-the-fly rankings of similar actors, given an actor and an arbitrary subset of categories of interest. Two main challenges arise: First, the bipartite graphs are huge and often lopsided (e.g. the system might receive billions of queries while presenting only millions of advertisers). Second, the sheer number of possible combinations of categories prevents the pre-computation of the results for all of them. We present a novel algorithmic framework that addresses both issues for the computation of several graph-theoretical similarity measures, including # common neighbors, and Personalized PageRank. We show how to tackle the imbalance in the graphs to speed up the computation and provide efficient real-time algorithms for computing rankings for an arbitrary subset of categories. Finally, we show experimentally the accuracy of our approach with real-world data, using both public graphs and a very large dataset from Google AdWords. View details
    Sok: The Evolution of Sybil Defense via Social Networks
    Lorenzo Alvisi
    Allen Clement
    Alessandro Panconesi
    2013 IEEE Symposium on Security and Privacy, SP 2013
    Preview abstract Sybil attacks in which an adversary forges a potentially unbounded number of identities are a danger to distributed systems and online social networks. The goal of sybil defense is to accurately identify sybil identities. This paper surveys the evolution of sybil defense protocols that leverage the structural properties of the social graph underlying a distributed system to identify sybil identities. We make two main contributions. First, we clarify the deep connection between sybil defense and the theory of random walks. This leads us to identify a community detection algorithm that, for the first time, offers provable guarantees in the context of sybil defense. Second, we advocate a new goal for sybil defense that addresses the more limited, but practically useful, goal of securely white-listing a local region of the graph. View details
    Reconstructing Hidden Permutations Using the Average-Precision (AP) Correlation Statistic
    Lorenzo De Stefani
    Eli Upfal
    Fabio Vandin
    In Proceedings of the 30th AAAI Conference on Artificial Intelligence 2016
    Spreading Rumours without the Network
    Pawel Brach
    Alessandro Panconesi
    Piotr Sankovski
    In Proceedings of the 2nd ACM Conference on Social Networks, COSN, 2014