Estimating Uncertainty for Massive Data Streams


We address the problem of estimating the variability of an estimator computed from a massive data stream. While nearly-linear statistics can be computed exactly or approximately from “Google- scale” data, second-order analysis is a challenge. Unfortunately, massive sample sizes do not obviate the need for uncertainty calculations: modern data often have heavy tails, large coefficients of variation, tiny effect sizes, and generally exhibit bad behaviour. We describe in detail this New Frontier in statistics, outline the computing infrastructure required, and motivate the need for modification of existing methods. We introduce two procedures for basic uncertainty estimation, one derived from the bootstrap and the other from a form of subsampling. Their costs and theoretical properties are briefly discussed, and their use is demonstrated using Google data.