Large-Scale Audio Event Discovery in One Million YouTube Videos


Internet videos provide a virtually boundless source of audio with a conspicuous lack of localized annotations, presenting an ideal setting for unsupervised methods. With this motivation, we perform an unprecedented exploration into the large-scale discovery of recurring audio events in a diverse corpus of one million YouTube videos (45K hours of audio). Our approach is to apply a streaming, nonparametric clustering algorithm to both spectral features and out-of-domain neural audio embeddings, and use a small portion of manually annotated audio events to quantitatively estimate the intrinsic clustering performance. In addition to providing a useful mechanism for unsupervised active learning, we demonstrate the effectiveness of the discovered audio event clusters in two downstream applications. The first is weakly-supervised learning, where we exploit the association of video-level metadata and cluster occurrences to temporally localize audio events. The second is informative activity detection, an unsupervised method for semantic saliency based on the corpus statistics of the discovered event clusters.