Large scale distributed neural network training through online distillation


While techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model they are seldom used as the multi-stage training setups they require are cumbersome and the extra hyperparameters introduced make the process of tuning even more expensive. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup. We also show that distillation can be used as a meaningful distributed learning algorithm: instead of independent workers exchanging gradients, which requires worrying about delays and synchronization, independent workers can exchange full model checkpoints. This can be done far less frequently than exchanging gradients, breaking one of the scalability barriers of stochastic gradient descent. We have experiments on Criteo clickthrough rate, and the largest to-date dataset used for neural language modeling, based on Common Crawl and containing $6\times 10^{11}$ tokens. In these experiments we show we can scale at least $2\times$ as well as the maximum limit of distributed stochastic gradient descent. Finally, we also show that online distillation can dramatically reduce the churn in the predictions between different versions of a model.