Train faster, generalize better: Stability of stochastic gradient descent


We show that any model trained by a stochastic gradient method with few iterations has vanishing generalization error. Our results apply to both convex and non-convex optimization under standard Lipschitz and smoothness assumptions. Our bounds hold in cases where existing uniform convergence bounds do not apply, for instance, if there is no explicit form of regularization and the model capacity far exceeds the sample size. Conceptually, our findings help explain the widely observed empirical success of training large models with gradient descent methods. They further underscore the importance of reducing training time beyond the obvious benefit of saving time.