## Abstract

The study of skylines and their variants has received considerable attention in recent years. Skylines are essentially sets of most interesting (undominated) tuples in a database. However, since the skyline is often very large, much research effort has been devoted to identifying a smaller subset of (say k) “representative skyline” points. Several different definitions of representative skylines have been considered. Most of these formulations are intuitive in that they try to achieve some kind of clustering “spread” over the entire skyline, with k points. In this work, we take a more principled approach in defining the representative skyline objective. One of our main contributions is to formulate the problem of displaying k representative skyline points such that the probability that a random user would click on one of them is maximized.

Two major research questions arise naturally from this formulation. First, how does one mathematically model the likelihood with which a user is interested in and will “click” on a certain tuple? Second, how does one negotiate the absence of the knowledge of an explicit set of target users; in particular what do we mean by “a random user”? To answer the first question, we model users based on a novel formulation of threshold preferences which we will motivate further in the paper. To answer the second question, we assume a probability distribution of users instead of a fixed set of users. While this makes the problem harder, it lends more mathematical structures that can be exploited as well, as one can now work with probabilities of thresholds and handle cumulative density functions.

On the theoretical front, our objective is NP-hard. For the case of a finite set of users with known thresholds, we present a simple greedy algorithm that attains an approximation ratio of (1 − 1/e) of the optimal. For the case of user distributions, we show that a careful yet similar greedy algorithm achieves the same approximation ratio. Unfortunately, it turns out that this algorithm is rather involved and computationally expensive. So we present a threshold sampling based algorithm that is more computationally affordable and, for any fixed epsilon > 0, has an approximation ratio of (1 − 1/e − epsilon). We perform experiments on both real and synthetic data to show that our algorithm significantly outperforms previously proposed approaches.