An effective technique for solving optimization problems over massive data sets is to partition the data into smaller pieces, solve the problem on each piece and compute a representative solution from it, and finally obtain a solution inside the union of the representative solutions for all pieces. This technique can be captured via the concept of composable core-sets, and has been recently applied to solve diversity maximization problems as well as several clustering problems [7,15,8]. However, for coverage and submodular maximization problems, impossibility bounds are known for this technique . In this paper, we focus on efficient construction of a randomized variant of composable core-sets where the above idea is applied on a random clustering of the data. We employ this technique for the coverage, monotone and non-monotone submodular maximization problems. Our results significantly improve upon the hardness results for non-randomized core-sets, and imply improved results for submodular maximization in a distributed and streaming settings. The effectiveness of this technique has been confirmed empirically for several machine learning applications , and our proof provides a theoretical foundation to this idea.
In summary, we show that a simple greedy algorithm results in a 1/3-approximate randomized composable core-set for submodular maximization under a cardinality constraint. Our result also extends to non-monotone submodular functions, and leads to the first 2-round MapReduce-based constant-factor approximation algorithm with O(n) total communication complexity for either monotone or non-monotone functions. Finally, using an improved analysis technique and a new algorithm PseudoGreedy, we present an improved 0.545-approximation algorithm for monotone submodular maximization, which is in turn the first MapReduce-based algorithm beating factor 1/2 in a constant number of rounds.