We study randomly initialized residual networks using mean field theory and the theory of difference equations. Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward.
The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by converting to residual connections, with most activations such as tanh or a power of the ReLU unit, the network will adopt subexponential forward and backward dynamics, and in many cases in fact polynomial. The exponents of these polynomials are obtained through analytic methods and proved and verified empirically to be correct. In terms of the
edge of chaos'' hypothesis, these subexponential and polynomial laws allow residual networks tohover over the boundary between stability and chaos,'' thus preserving the geometry of the input space and the gradient information flow. We also train a grid of tanh residual networks on MNIST, and observe that, as predicted by the theory developed in this paper, the peak performances of these models are determined by the product between the standard deviation of weights and the square root of the depth. Thus in addition to improving our understanding of residual networks, our theoretical tools can guide the research toward better initialization schemes. Finally, we have made mathematical contributions by deriving several new identities for the kernels of powers of ReLU functions by relating them to the zeroth Bessel function of the second kind.