## Abstract

For over a decade now we have been witnessing the success of massive parallel computation (MPC) frameworks, such as MapReduce, Hadoop, Dryad, or Spark. One of the reasons for their success is the fact that these frameworks are able to accurately capture the nature of large-scale computation. In particular, compared to the classic distributed algorithms or PRAM models, these frameworks allow for much more local computation. The fundamental question that arises in this context is though: can we leverage this additional power to obtain even faster parallel algorithms?

A prominent example here is the maximum matching problem---one of the most classic graph problems. It is well known that in the PRAM model one can compute a 2-approximate maximum matching in O(log n) rounds. However, the exact complexity of this problem in the MPC framework is still far from understood. Lattanzi et al. (SPAA 2011) showed that if each machine has n^(1+Omega(1)) memory, this problem can also be solved 2-approximately in a constant number of rounds. These techniques, as well as the approaches developed in the follow up work, seem though to get stuck in a fundamental way at roughly O(log n) rounds once we enter the (at most) near-linear memory regime. It is thus entirely possible that in this regime, which captures in particular the case of sparse graph computations, the best MPC round complexity matches what one can already get in the PRAM model, without the need to take advantage of the extra local computation power.

In this paper, we finally refute that perplexing possibility. That is, we break the above O(log n) round complexity bound even in the case of slightly sublinear memory per machine. In fact, our improvement here is {\em almost exponential}: we are able to deliver a (2+epsilon)-approximation to maximum matching, for any fixed constant epsilon>0, in O((log log n)^2) rounds.

To establish our result we need to deviate from the previous work in two important ways that are crucial for exploiting the power of the MPC model, as compared to the PRAM model. Firstly, we use vertex--based graph partitioning, instead of the edge--based approaches that were utilized so far. Secondly, we develop a technique of round compression. This technique enables one to take a (distributed) algorithm that computes an O(1)-approximation of maximum matching in O(log n) independent PRAM phases and implement a super-constant number of these phases in only a constant number of MPC rounds.