AI

Language Modeling in the Era of Abundant Data

Abstract

The talk presents an overview of statistical language modeling as applied to real-word problems: speech recognition, machine translation, spelling correction, soft keyboards to name a few prominent ones. We summarize the most successful estimation techniques, and examine how they fare for applications with abundant data, e.g. voice search. We conclude by highlighting a few open problems: getting an accurate estimate for the entropy of text produced by a very specific source, e.g. query stream); optimally leveraging data that is of different degrees of relevance to a given "domain"; does a bound on the size of a "good" model for a given source exist?