DaMN – Discriminative and Mutually Nearest: Exploiting Pairwise Category Proximity for Video Action Recognition


We propose a method for learning discriminative category-level features and demonstrate state-of-the-art results on large-scale action recognition in video. The key observation is that one-vs-rest classifiers, which are ubiquitously employed for this task, face challenges in separating very similar categories (such as running vs. jogging). Our proposed method automatically identifies such pairs of categories using a criterion of mutual pairwise proximity in the (kernelized) feature space, using a category-level similarity matrix where each entry corresponds to the one-vs-one SVM margin for pairs of categories. We then exploit the observation that while splitting such "Siamese Twin" categories may be difficult, separating them from the remaining categories in a two-vs-rest framework is not. This enables us to augment one-vs-rest classifiers with a judicious selection of "two-vs-rest" classifier outputs, formed from such discriminative and mutually nearest (DaMN) pairs. By combining one-vs-rest and two-vs-rest features in a principled probabilistic manner, we achieve state-of-the-art results on the UCF101 and HMDB51 datasets. More importantly, the same DaMN features, when treated as a mid-level representation also outperform existing methods in knowledge transfer experiments, both cross-dataset from UCF101 to HMDB51 and to new categories with limited training data (one-shot and few-shot learning). Finally, we study the generality of the proposed approach by applying DaMN to other classification tasks; our experiments show that DaMN outperforms related approaches in direct comparisons, not only on video action recognition but also on their original image dataset tasks.