In-memory column-stores are becoming more and more widespread, making interactive data analysis feasible for ever growing amounts of data. However, data volume in many big data scenarios still exceeds the available memory and thus requires smart algorithms if interactive response times are desired. PowerDrill  is such a system which is used Google-internally for data discovery in log data. In this paper we present results from the “fast approximations” initiative, where we explore possibilities for optimizing performance at the expense of accuracy.
In one of the optimizations we use some of the theory behind Data-Sketches to considerably improve memory efficiency of the data encoding scheme. For particularly expensive fields in our largest datasets this saves a factor of 4.5 when compared to a standard production grade compression algorithm.
We additionally evaluate the effects of using sampling on accuracy and performance. We propose a simple heuristic for estimating the accuracy of individual result-values. We show that annotating result-values as accurate (or not) is essential for the usefulness of intermediate results over sub-samples. Only with these annotations our users can start interpreting intermediate results before final results are available. For a large set of queries this effectively brings down the 95th latency percentile from 30 to 4 seconds.