Robust and Probabilistic Failure-Aware Placements


Motivated by the growing complexity of heterogeneity and hierarchical organization in data centers, in this paper we address the problem of placing copies of tasks on machines with the goal of increasing availability in the presence of failures. We consider two models of failures: adversarial and probabilistic. In the adversarial model, each node has a weight (higher weight implying higher reliability) and the adversary can remove any subset of nodes of total weight at most a given bound W and our goal is to find a placement that incurs the least disruption against such an adversary. In the probabilistic model, each node has a probability of failure and we want to find a placement that maximizes the probability that at least K out of N tasks are available at any time.

The problems in the first category covering adversarial failures lie in $\Sigma_2$, the second level of the polynomial hierarchy. We show that a basic variant, which we call RobustFAP, is co-NP-hard, and an all-or-nothing version of RobustFAP is $\Sigma_2$-complete. Our first algorithmic result here is is a PTAS for RobustFAP, a key ingredient of which is a solution to a fractional variant of RobustFAP. Our second algorithmic result is for fractional HierRobustFAP over hierarchies: a new Generalized Spreading algorithm, which is simultaneously optimal for all W. This algorithm generalizes the classical notion of {\em max-min fairness} to work with nodes of differing capacities, differing reliability weights and hierarchical structures. We then apply randomized rounding to obtain an algorithm for integral RobustFAP over hierarchies.

For the probabilistic version ProbFAP, we first focus on the single level problem and give an algorithm that achieves an additive \eps approximation in the failure probability while giving up a (1 + \eps) multiplicative factor in the number of failures. We then extend the result to the hierarchical version achieving a \eps additive approximation in failure probability while giving up a (L + \eps) multiplicative factor in the number of failures, where L is the number of levels in the hierarchy.