Roman Novak
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Fast Neural Kernel Embeddings for General Activations
Insu Han
Amir Zandieh
Amin Karbasi
NeurIPS 2022 (2022) (to appear)
Preview abstract
Infinite width limit has shed light on generalization and optimization aspects of deep learning by establishing connections between neural networks and kernel methods. Despite their importance, the utility of these kernel methods was limited in large-scale learning settings due to their (super-)quadratic runtime and memory complexities. Moreover, most prior works on neural kernels have focused on the ReLU activation, mainly due to its popularity but also due to the difficulty of computing such kernels for general activations. In this work, we overcome such difficulties by providing methods to work with general activations. First, we compile and expand the list of activation functions admitting exact dual activation expressions to compute neural kernels. When the exact computation is unknown, we present methods to effectively approximate them. We propose a fast sketching method that approximates any multi-layered Neural Network Gaussian Process (NNGP) kernel and Neural Tangent Kernel (NTK) matrices for a wide range of activation functions, going beyond the commonly analyzed ReLU activation. This is done by showing how to approximate the neural kernels using the truncated Hermite expansion of any desired activation functions. While most prior works require data points on the unit sphere, our methods do not suffer from such limitations and are applicable to any dataset of points in ℝ^d. Furthermore, we provide a subspace embedding for NNGP and NTK matrices with near input-sparsity runtime and near-optimal target dimension which applies to any homogeneous dual activation functions with rapidly convergent Taylor expansion. Empirically, with respect to exact convolutional NTK (CNTK) computation, our method achieves 106× speedup for approximate CNTK of a 5-layer Myrtle network on CIFAR-10 dataset.
View details
Preview abstract
The Neural Tangent Kernel (NTK), defined as the outer product of the neural network (NN) Jacobians, has emerged as a central object of study in deep learning. In the infinite width limit, the NTK can sometimes be computed analytically and is useful for understanding training and generalization of NN architectures. At finite widths, the NTK is also used to better initialize NNs, compare the conditioning across models, perform architecture search, and do meta-learning. Unfortunately, the finite width NTK is notoriously expensive to compute, which severely limits its practical utility.
We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks. Leveraging the structure of neural networks, we further propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK, dramatically improving efficiency.
Our algorithms can be applied in a black box fashion to any differentiable function, including those implementing neural networks.
We open-source our implementations within the Neural Tangents package at https://github.com/google/neural-tangents.
View details
Preview abstract
We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow. The repriorisation map acts directly on parameters, and its analytic simplicity complements the known neural network Gaussian process (NNGP) behaviour of wide BNNs in function space. Exploiting the repriorisation, we develop a Markov chain Monte Carlo (MCMC) posterior sampling algorithm which mixes faster the wider the BNN. This contrasts with the typically poor performance of MCMC in high dimensions. We observe up to 50x higher effective sample size relative to no reparametrisation for both fully-connected and residual networks. Improvements are achieved at all widths, with the margin between reparametrised and standard BNNs growing with layer width.
View details
Preview abstract
The effectiveness of machine learning algorithms arises from being able to extract useful features from large amounts of data. As model and dataset sizes increase, dataset distillation methods that compress large datasets into significantly smaller yet highly performant ones will become valuable in terms of training efficiency and useful feature extraction. To that end, we apply a novel distributed kernel based meta-learning framework to achieve state-of-the-art results for dataset distillation using infinitely wide convolutional neural networks. For instance, using only 10 datapoints (0.02% of original dataset), we obtain over 64% test accuracy on CIFAR-10 image classfication task, a dramatic improvement over the previous best test accuracy of 40%. Our state-of-the-art results extend across many other settings for MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and SVHN. Furthermore, we perform some preliminary analyses of our distilled datasets to shed light on how they differ from naturally occurring data.
View details
Neural Tangents: Fast and Easy Infinite Neural Networks in Python
Jiri Hron
Jascha Sohl-dickstein
Sam Schoenholz
ICLR (2020)
Preview abstract
Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. These networks can then be trained and evaluated either at finite-width as usual or in their infinite-width limit. Infinite-width networks can be trained analytically using exact Bayesian inference or using gradient descent via the Neural Tangent Kernel. Additionally, Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks in either function space or weight space.
The entire library runs out-of-the-box on CPU, GPU, or TPU. All computations can be automatically distributed over multiple accelerators with near-linear scaling in the number of devices.
Neural Tangents is available at
https://github.com/google/neural-tangents
We also provide an accompanying interactive Colab notebook at
https://colab.sandbox.google.com/github/neural-tangents/neural-tangents/blob/master/notebooks/neural_tangents_cookbook.ipynb
View details
Finite versus Infinite Neural Networks:an Empirical Study
Sam S. Schoenholz
Jeffrey Pennington
Jascha Sohl-dickstein
NeurIPS 2020
Preview abstract
We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully connected finite width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; ensembles of finite networks have reduced posterior variance and behave similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence of finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction. Using these best practices we achieve state-of-the-art results for non-trainable kernels on CIFAR-10 classification tasks.
View details
Infinite attention: NNGP and NTK for deep attention networks
Jiri Hron
Jascha Sohl-dickstein
International Conference on Machine Learning 2020 (2020) (to appear)
Preview abstract
There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly initialised wide NNs optimised by gradient descent without ever running an optimiser. We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. We further discuss the effects of positional encodings and layer normalisation, and propose modifications of the attention mechanism which lead to improved results for both finite and infinitely wide NNs. We evaluate attention kernels empirically, leading to a moderate improvement upon the previous state-of-the-art on CIFAR-10 for GPs without trainable kernels and advanced data preprocessing. Finally, we introduce new features to the Neural Tangents library (Novak et al., 2020) allowing applications of NNGP/NTK models, with and without attention, to variable-length sequences, with an example on the IMDb reviews dataset.
View details
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
Sam Schoenholz
Jascha Sohl-dickstein
Jeffrey Pennington
NeurIPS (2019)
Preview abstract
A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.
View details
Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes
Greg Yang
Jiri Hron
Dan Abolafia
Jeffrey Pennington
Jascha Sohl-dickstein
ICLR (2019)
Preview abstract
There is a previously identified equivalence between wide fully connected neural
networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for
instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but
by instead evaluating the corresponding GP. In this work, we derive an analogous
equivalence for multi-layer convolutional neural networks (CNNs) both with and
without pooling layers, and achieve state of the art results on CIFAR10 for GPs
without trainable kernels. We also introduce a Monte Carlo method to estimate
the GP corresponding to a given neural network architecture, even in cases where
the analytic form has too many terms to be computationally feasible.
Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs
with and without weight sharing are identical. As a consequence, translation
equivariance, beneficial in finite channel CNNs trained with stochastic gradient
descent (SGD), is guaranteed to play no role in the Bayesian treatment of the infinite channel limit – a qualitative difference between the two regimes that is not
present in the FCN case. We confirm experimentally, that while in some scenarios
the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs
can significantly outperform their corresponding GPs, suggesting advantages from
SGD training compared to fully Bayesian parameter estimation.
View details
Preview abstract
In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets.
We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. We further establish that factors associated with poor generalization − such as full-batch training or using random labels − correspond to lower robustness, while factors associated with good generalization − such as data augmentation and ReLU non-linearities − give rise to more robust functions. Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.
View details
Deep Neural Networks as Gaussian Processes
Sam Schoenholz
Jeffrey Pennington
Jascha Sohl-dickstein
ICLR (2018)
Preview abstract
It has long been known that a single-layer fully-connected neural network with an
i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit
of infinite network width. This correspondence enables exact Bayesian inference
for infinite width neural networks on regression tasks by means of evaluating the
corresponding GP. Recently, kernel functions which mimic multi-layer random
neural networks have been developed, but only outside of a Bayesian framework.
As such, previous work has not identified that these kernels can be used as covariance
functions for GPs and allow fully Bayesian prediction with a deep neural
network.
In this work, we derive the exact equivalence between infinitely wide deep networks
and GPs. We further develop a computationally efficient pipeline to compute
the covariance function for these GPs. We then use the resulting GPs to perform
Bayesian inference for wide deep neural networks on MNIST and CIFAR10.
We observe that trained neural network accuracy approaches that of the corresponding
GP with increasing layer width, and that the GP uncertainty is strongly
correlated with trained network prediction error. We further find that test performance
increases as finite-width trained networks are made wider and more similar
to a GP, and thus that GP predictions typically outperform those of finite-width
networks. Finally we connect the performance of these GPs to the recent theory
of signal propagation in random neural networks.
View details
Preview abstract
Existing machine translation decoding algorithms
generate translations in a strictly
monotonic fashion and never revisit previous
decisions. As a result, earlier mistakes
cannot be corrected at a later stage. In
this paper, we present a translation scheme
that starts from an initial guess and then
makes iterative improvements that may revisit
previous decisions. We parameterize
our model as a convolutional neural network
that predicts discrete substitutions to
an existing translation based on an attention
mechanism over both the source sentence
as well as the current translation output.
By making less than one modification
per sentence, we improve the output
of a phrase-based translation system by up
to 0.4 BLEU on WMT15 German-English
translation.
View details
No Results Found