Large Scale Page-Based Book Similarity Clustering


The Google Books corpus now counts over 15M books spanning 7 centuries and countless languages. Traditional cataloguing at that scale is imprecise, and often fails to identify more complex book-to-book relationships, such as ‘same text, different pagination’ or ‘partial overlap’. Our contribution is a two-step technique for clustering books based on content similarity (at both book and page level) and classifying their relationships. We run this on our corpora consisting of more than 15M books (5B pages). We first detect similar books and similar pages within matching books, using hashing techniques and judicious thresholds. We then combine those features to identify the exact relationship between matching books. In this paper, we describe the basic approach to making the problem tractable, as well as the features and classifiers that we used. We enumerate a small number of relationships to qualify the link between scanned real-world books. Finally, we provide precision and recall measurements of the classifier.