AI

A Practical Algorithm for Solving the Incoherence Problem of Topic Models In Industrial Applications

Abstract

Topic models are often applied in industrial settings to discover user profiles from activity logs where documents correspond to users and words to complex objects such as web sites and installed apps. Standard topic models ignore the content-based similarity structure between these objects largely because of the inability of the Dirichlet prior to capture such side information of word-word correlation. Several approaches were proposed to replace the Dirichlet prior with more expressive alternatives. However, this added expressivity comes with a heavy premium: inference becomes intractable and sparsity is lost which renders these alternatives not suitable for industrial scale applications. In this paper we take a radically different approach to incorporating word-word correlation in topic models by applying this side information at the posterior level rather than at the prior level. We show that this choice preserves sparsity and results in a graph-based sampler for LDA whose computational complexity is asymptotically on bar with the state of the art Alias base sampler for LDA. We illustrate the efficacy of our approach over real industrial datasets that span up to billion of users, tens of millions of words and thousands of topics. To the best of our knowledge, our approach provides the first practical and scalable solution to this important problem