Multi-Objective Weighted Sampling


Key value data sets of the form ${(x,w_x)}$ where $w_x >0$ are prevalent. Common queries over such data are {\em segment $f$-statistics} $Q(f,H) = \sum_{x\in H}f(w_x)$, specified for a segment $H$ of the keys and a function $f$. Different choices of $f$ correspond to count, sum, moments, cap, and threshold statistics.

When the data set is large, we can compute a smaller sample from which we can quickly estimate statistics. A weighted sample of keys taken with respect to $f(w_x)$ provides estimates with statistically guaranteed quality for $f$-statistics. Such a sample $S^{(f)}$ can be used to estimate $g$-statistics for $g\not=f$, but quality degrades with the disparity between $g$ and $f$. In this paper we address applications that require quality estimates for a set $F$ of different functions. A naive solution is to compute and work with a different sample $S^{(f)}$ for each $f\in F$. Instead, this can be achieved more effectively and seamlessly using a single {\em multi-objective} sample $S^{(F)}$ of a much smaller size.

We review multi-objective sampling schemes and place them in our context of estimating $f$-statistics. We show that a multi-objective sample for $F$ provides quality estimates for any $f$ that is a positive linear combination of functions from $F$. We then establish a surprising and powerful result when the target set $M$ is {\em all} monotone non-decreasing functions, noting that $M$ includes most natural statistics. We provide efficient multi-objective sampling algorithms for $M$ and show that a sample size of $k \ln n$ (where $n$ is the number of active keys) provides the same estimation quality, for any $f\in M$, as a dedicated weighted sample of size $k$ for $f$.