A Bayesian Perspective on Generalization and Stochastic Gradient Descent


Zhang et al. (2016) argued that understanding deep learning requires rethinking generalization. To justify this claim, they showed that deep networks can easily memorize randomly labeled training data, despite generalizing well when shown real labels of the same inputs. We show here that the same phenomenon occurs in small linear models with fewer than a thousand parameters; however there is no need to rethink anything, since our observations are explained by evaluating the Bayesian evidence in favor of each model. This Bayesian evidence penalizes sharp minima. We also explore the “generalization gap” observed between small and large batch training, identifying an optimum batch size which scales linearly with both the learning rate and the size of the training set. Surprisingly, in our experiments the generalization gap was closed by regularizing the model.