Online search systems that display ads continually offer new features that advertisers can use to fine-tune and enhance their ad campaigns. An important question is whether a new feature actually helps advertisers. In an ideal world for statisticians, we would answer this question by running a statistically designed experiment. But that would require randomly assigning a set of advertisers to the treatment group and forcing them to use the feature, which is not realistic. Accordingly, in the real world, new features for advertisers are seldom evaluated with a traditional experimental protocol. Instead, customer service representatives (CSRs) select advertisers who are invited to be among the first to test a new feature (i.e., white-listed), and then each white-listed advertiser chooses whether or not to use the new feature. Neither the CSR nor the advertiser chooses at random.
This paper addresses the problem of drawing valid inferences from white-list trials about the effects of new features on advertiser happiness. We are guided by three principles. First, statistical procedures for white-list trials are likely to be applied in an automated way, so they should be robust to violations of modeling assumptions. Second, standard analysis tools should be preferred over custom-built ones, both for clarity and for robustness. Standard tools have withstood the test of time and have been thoroughly debugged. Finally, it should be easy to compute reliable confidence intervals for the estimator. We review an estimator that has all these attributes, allowing us to make valid inferences about the effects of a new feature on advertiser happiness. In the example in this paper, the new feature was introduced during the holiday shopping season, thereby further complicating the analysis.