Jump to Content
Matthew D. Hoffman

Matthew D. Hoffman

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Sequential Monte Carlo Learning for Time Series Structure Discovery
    Feras Saad
    Vikash Mansinghka
    Proceedings of the 40th International Conference on Machine Learning (2023), pp. 29473-29489
    Preview abstract This paper presents a new approach to automatically discovering accurate models of complex time series data. Working within a Bayesian nonparametric prior over a symbolic space of Gaussian process time series models, we present a novel structure learning algorithm that integrates sequential Monte Carlo (SMC) and involutive MCMC for highly effective posterior inference. Our method can be used both in "online'' settings, where new data is incorporated sequentially in time, and in "offline'' settings, by using nested subsets of historical data to anneal the posterior. Empirical measurements on a variety of real-world time series show that our method can deliver 10x--100x runtime speedups over previous MCMC and greedy-search structure learning algorithms for the same model family. We use our method to perform the first large-scale evaluation of Gaussian process time series structure learning on a widely used benchmark of 1,428 monthly econometric datasets, showing that our method discovers sensible models that deliver more accurate point forecasts and interval forecasts over multiple horizons as compared to prominent statistical and neural baselines that struggle on this challenging data. View details
    Preview abstract ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. View details
    Preview abstract In order to prepare for and control the continued spread of the COVID-19 pandemic while minimizing its economic impact, the world needs to be able to estimate and predict COVID-19’s spread. Unfortunately, we cannot directly observe the prevalence or growth rate of COVID-19; these must be inferred using some kind of model. We propose a hierarchical Bayesian extension to the classic susceptible-exposed-infected-removed (SEIR) compartmental model that adds compartments to account for isolation and death and allows the infection rate to vary as a function of both mobility data collected from mobile phones and a latent time-varying factor that accounts for changes in behavior not captured by mobility data. Since confirmed-case data is unreliable, we infer the model’s parameters conditioned on deaths data. We replace the exponential-waiting-time assumption of classic compartmental models with Erlang distributions, which allows for a more realistic model of the long lag between exposure and death. The mobility data gives us a leading indicator that can quickly detect changes in the pandemic’s local growth rate and forecast changes in death rates weeks ahead of time. This is an analysis of observational data, so any causal interpretations of the model's inferences should be treated as suggestive at best; nonetheless, the model’s inferred relationship between different kinds of trips and the infection rate do suggest some possible hypotheses about what kinds of activities might contribute most to COVID-19’s spread. View details
    Automatically batching control-intensive programs for modern accelerators
    Alexey Radul
    Dougal Maclaurin
    Third Conference on Systems and Machine Learning, Austin, TX (2020)
    Preview abstract We present a general approach to batching arbitrary computations for GPU and TPU accelerators. We demonstrate the effectiveness of our method with orders-of-magnitude speedups on the No U-Turn Sampler (NUTS), a workhorse algorithm in Bayesian statistics. The central challenge of batching NUTS and other Markov chain Monte Carlo algorithms is data-dependent control flow and recursion. We overcome this by mechanically transforming a single-example implementation into a form that explicitly tracks the current program point for each batch member, and only steps forward those in the same place. We present two different batching algorithms: a simpler, previously published one that inherits recursion from the host Python, and a more complex, novel one that implmenents recursion directly and can batch across it. We implement these batching methods as a general program transformation on Python source. Both the batching system and the NUTS implementation presented here are available as part of the popular TensorFlow Probability software package. View details
    Preview abstract Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter. View details
    Preview abstract Hamiltonian Monte Carlo is a powerful algorithm for sampling from difficult-to-normalize posterior distributions. However, when the geometry of the posterior is unfavorable, it may take many expensive evaluations of the target distribution and its gradient to converge and mix. We propose neural transport (NeuTra) HMC, a technique for learning to correct this sort of unfavorable geometry using inverse autoregressive flows (IAF), a powerful neural variational inference technique. The IAF is trained to minimize the KL divergence from an isotropic Gaussian to the warped posterior, and then HMC sampling is performed in the warped space. We evaluate NeuTra HMC on a variety of synthetic and real problems, and find that it significantly outperforms vanilla HMC both in time to reach the stationary distribution and asymptotic effective-sample-size rates. View details
    Preview abstract Texture synthesis techniques based on matching the Gram matrix of feature activations in neural networks have achieved spectacular success in the image domain. In this paper we extend these techniques to the audio domain. We demonstrate that synthesizing diverse audio textures is challenging, and argue that this is because audio data is relatively low-dimensional. We therefore introduce two new terms to the original Grammian loss: an autocorrelation term that preserves rhythm, and a diversity term that encourages the optimization procedure to synthesize unique textures. We quantitatively study the impact of our design choices on the quality of the synthesized audio by introducing an audio analogue to the Inception loss which we term the VGGish loss. We show that there is a trade-off between the diversity and quality of the synthesized audio using this technique. Finally we perform a number of experiments to qualitatively study how these design choices impact the quality of the synthesized audio. View details
    Failure Modes of Variational Inference for Decision Making
    Carlos Riquelme
    Matthew Johnson
    ICML Workshop (2018)
    Preview abstract In this paper we highlight the risks of relying on mean-field variational inference to learn models that are used as simulators for decision making. We study the role of accurate inference for latent variable models in terms of cumulative reward performance. We show how naive mean-field variational inference at test time can lead to poor decisions in basic but fundamental quadratic control problems with continuous actions, as relevant correlations in the latent space are ignored. We then extend these examples to a more complex non-linear scenario with asymmetric costs, where regret is even more significant. View details
    Preview abstract Deriving conditional and marginal distributions using conjugacy relationships can be time consuming and error prone. In this paper, we propose a strategy for automating such derivations. Unlike previous systems which focus on relationships between pairs of random variables, our system (which we call Autoconj) operates directly on Python functions that compute log-joint distribution functions. Autoconj provides support for conjugacy-exploiting algorithms in any Python-embedded PPL. This paves the way for accelerating development of novel inference algorithms and structure-exploiting modeling strategies. The package can be downloaded at https://github.com/google-research/autoconj. View details
    Preview abstract Deep generative neural networks have proven effective at both conditional and unconditional modeling of complex data distributions. Conditional generation enables interactive control, but creating new controls often requires expensive retraining. In this paper, we develop a method to condition generation without retraining the model. By post-hoc learning latent constraints, value functions that identify regions in latent space that generate outputs with desired attributes, we can conditionally sample from these regions with gradient-based optimization or amortized actor functions. Combining attribute constraints with a universal “realism” constraint, which enforces similarity to the data distribution, we generate realistic conditional images from an unconditional variational autoencoder. Further, using gradient-based optimization, we demonstrate identity-preserving transformations that make the minimal adjustment in latent space to modify the attributes of an image. Finally, with discrete sequences of musical notes, we demonstrate zero-shot conditional generation, learning latent constraints in the absence of labeled data or a differentiable reward function. View details
    Preview abstract We present a general-purpose method to train Markov Chain Monte Carlo kernels (parameterized by deep neural networks) that converge and mix quickly to their target distribution. Our method generalizes Hamiltonian Monte Carlo and is trained to maximize expected squared jump distance, a proxy for mixing speed. We demonstrate significant empirical gains (up to $124\times$ greater effective sample size) on a collection of simple but challenging distributions. Finally, we show quantitative and qualitative gains on a real-world task: latent-variable generative modeling. Python source code is included as supplemental material, and will be open-sourced with the camera-ready paper. View details
    Preview abstract Deriving conditional and marginal distributions using conjugacy relationships can be time consuming and error prone. In this paper, we propose a strategy for automating such derivations. Unlike previous systems which focus on relationships between pairs of random variables, our system (which we call Autoconj) operates directly on Python functions that compute log-joint distribution functions. Autoconj provides support for conjugacy-exploiting algorithms in any Python-embedded PPL. This paves the way for accelerating development of novel inference algorithms and structure-exploiting modeling strategies. The package can be downloaded at https://github.com/google-research/autoconj. View details
    Preview abstract We describe Edward2, a low-level probabilistic programming language. Edward2 distills the core of probabilistic programming down to a single abstraction—the random variable. By blurring the line between model and computation, Edward2 enables numerous applications not shown before: a model-parallel variational auto-encoder (VAE) with tensor processing units (TPUs); a data-parallel autoregressive model (Image Transformer) with TPUs; and multi-GPU No-U-Turn Sampler (NUTS). Edward2 achieves an optimal linear speedup from 4 to 256 TPUs. With VAEs, Edward2 sees up to a 20x speedup on TPUs over Pyro and Edward on GPUs; with Bayesian neural networks, Edward2 sees up to a 51x speedup. With NUTS, Edward2 sees a 20x speedup on GPUs over Stan and 7x over PyMC3. View details
    The Beta VAE's Implicit Prior
    Carlos Riquelme
    Matthew Johnson
    NIPS Workshop (2017)
    Preview abstract Variational autoencoders are a popular and powerful class of deep generative models. They resemble a classical autoencoder, except that the latent code $z=f(x)$ is replaced with a \emph{distribution} $q(z\mid x)$ over latent codes, and this distribution is regularized to have small KL divergence to a (usually pre-specified) marginal distribution $p(z)$. If the reconstruction log-likelihood $\Eq[\log p(x\mid z)]$ has the same weight as the KL-divergence penalty $\Eq[\log \frac{p(z)}{q(z\mid x)}]$, then the training procedure can be interpreted as maximizing a bound on the marginal likelihood $p(x)$ (sometimes called the evidence lower bound or ELBO). However, recent work has explored applying different weights to the KL-divergence term, either to alleviate optimization issues during training or to exert greater control over the sorts of latent spaces that get learned. Following Higgins et al. (2017), we will call models fit with this approach "beta-VAEs". Below, we will analyze beta-VAEs where the KL-divergence weight beta<1. We will argue that optimizing this partially regularized ELBO is equivalent to doing approximate variational EM with an implicit prior r(z) that depends on the marginal posterior q(z)\triangleq\frac{1}{N}\sum_n q(z\mid x_n), with one main difference; it ignores the normalizing constant of this implicit distribution. We show how to estimate this missing normalizing constant, and demonstrate that beta-VAEs with beta<1 can actually achieve higher held-out likelihoods than standard VAEs. View details
    Preview abstract Deep latent Gaussian models are powerful and popular probabilistic models of high-dimensional data. These models are almost always fit using variational expectation-maximization, an approximation to true maximum-marginal-likelihood estimation. In this paper, we propose a different approach: rather than use a variational approximation (which produces biased gradient signals), we use Markov chain Monte Carlo (MCMC, which allows us to trade bias for computation). We find that our MCMC-based approach has several advantages: it yields higher held-out likelihoods, produces sharper images, and does not suffer from the variational overpruning effect. MCMC’s additional computational overhead proves to be significant, but not prohibitive. View details
    TensorFlow Distributions
    Josh Dillon
    Dustin Tran
    Eugene Brevdo
    Dave Moore
    Workshop on Probabilistic Programming Languages, Semantics, and Systems (PPS 2018) (2017)
    Preview abstract The TensorFlow Distributions library implements a vision of probability theory adapted to the modern deep-learning paradigm of end-to-end differentiable computation. Building on two basic abstractions, it offers flexible building blocks for probabilistic computation. Distributions provide fast, numerically stable methods for generating samples and computing statistics, e.g., log density. Bijectors provide composable volume-tracking transformations with automatic caching. Together these enable modular construction of high dimensional distributions and transformations not possible with previous libraries (e.g., pixelCNNs, autoregressive flows, and reversible residual networks). They are the workhorse behind deep probabilistic programming systems like Edward and empower fast black-box inference in probabilistic models built on deep-network components. TensorFlow Distributions has proven an important part of the TensorFlow toolkit within Google and in the broader deep learning community. View details
    No Results Found