Recent work demonstrated significant progress towards modeling the distribution of natural images with tractable likelihood using deep neural networks. This was achieved by modeling the joint distribution of pixels in the image as the product of conditional distributions, thereby turning it into a sequence modeling problem, and applying recurrent or convolutional neural networks to it.

In this work we instead build on the Transformer, a recently proposed network architecture based on self-attention, to model the conditional distributions in similar factorizations. We present two extensions of the network architecture, allowing it to scale to images and to take advantage of their two-dimensional structure.

While conceptually simple, our generative models trained on two image data sets are competitive with or outperform the current state of the art on two different data sets, CIFAR-10 and ImageNet, as measured by log-likelihood.

We also present results on image super-resolution with large magnification ratio with an encoder-decoder configuration of our architecture. In a human evaluation study, we show that our super-resolution models improve over previously published autoregressive super-resolution models in how often they fool a naive human observer by a factor of three.

Lastly, we provide examples of images generated or completed by our various models which, following previous work, we also believe to look pretty cool.