Jump to Content
Kevin P. Murphy

Kevin P. Murphy

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%. View details
    COVID-19 Open-Data: a global-scale spatially granular meta-dataset for coronavirus disease
    Oscar Wahltinez
    Aurora Cheung
    Ruth Alcantara
    Donny Cheung
    Mayank Daswani
    Anthony Erlinger
    Matt Lee
    Pranali Yawalkar
    Paula Lê
    Ofir Picazo Navarro
    Scientific Data (2022)
    Preview abstract This paper introduces the COVID-19 Open Dataset (COD), available at goo.gle/covid-19-open-data. A static copy is of the dataset is also available at https://doi.org/10.6084/m9.figshare.c.5399355. This is a very large “meta-dataset” of COVID-related data, containing epidemiological information, from 22,579 unique locations within 232 different countries and independent territories. For 62 of these countries we have state-level data, and for 23 of these countries we have county-level data. For 15 countries, COD includes cases and deaths stratified by age or sex. COD also contains information on hospitalizations, vaccinations, and other relevant factors such as mobility, non-pharmaceutical interventions and static demographic attributes. Each location is tagged with a unique identifier so that these different types of information can be easily combined. The data is automatically extracted from 121 different authoritative sources, using scalable open source software. This paper describes the format and construction of the dataset, and includes a preliminary statistical analysis of its content, revealing some interesting patterns. View details
    Machine Learning on Graphs: A Model and Comprehensive Taxonomy
    Ines Chami
    Sami Abu-El-Haija
    Chris Ré
    Journal of Machine Learning Research, vol. 23 (2022), pp. 1-64
    Preview abstract There has been a surge of recent interest in graph representation learning (GRL). GRL methods have generally fallen into three main categories, based on the availability of labeled data. The first, network embedding, focuses on learning unsupervised representations of relational structure. The second, graph regularized neural networks, leverages graphs to augment neural network losses with a regularization objective for semi-supervised learning. The third, graph neural networks, aims to learn differentiable functions over discrete topologies with arbitrary structure. However, despite the popularity of these areas there has been surprisingly little work on unifying the three paradigms. Here, we aim to bridge the gap between network embedding, graph regularization and graph neural networks. We propose a comprehensive taxonomy of GRL methods, aiming to unify several disparate bodies of work. Specifically, we propose the GraphEDM framework, which generalizes popular algorithms for semi-supervised learning (e.g. GraphSage, GCN, GAT), and unsupervised learning (e.g. DeepWalk, node2vec) of graph representations into a single consistent approach. To illustrate the generality of GraphEDM, we fit over thirty existing methods into this framework. We believe that this unifying view both provides a solid foundation for understanding the intuition behind these methods, and enables future research in the area. View details
    Plex: Towards Reliability using Pretrained Large Model Extensions
    Du Phan
    Mark Patrick Collier
    Zi Wang
    Zelda Mariet
    Clara Huiyi Hu
    Neil Band
    Tim G. J. Rudner
    Joost van Amersfoort
    Andreas Christian Kirsch
    Rodolphe Jenatton
    Honglin Yuan
    Kelly Buchanan
    Yarin Gal
    ICML 2022 Pre-training Workshop (2022)
    Preview abstract A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the \emph{reliability} of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 11 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, \emph{p}retrained \emph{l}arge-model \emph{ex}tensions (henceforth abbreviated as \emph{plex}) for vision and language modalities. Plex greatly improves the state-of-the-art across tasks, and as a pretrained model Plex unifies the traditional protocol of designing and tuning one model for each reliability task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on new tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding. View details
    Preview abstract Digital contact tracing apps for COVID, such as the one developed by Google and Apple, need to estimate the risk that a user was infected during a particular exposure, in order to decide whether to notify the user to take precautions, such as entering into quarantine, or requesting a test. Such risk score models contain numerous parameters that must be set by the public health authority. In this paper, we show how to automatically learn these parameters from data. Our method needs access to exposure and outcome data. Although this data is already being collected (in an aggregated, privacy-preserving way) by several health authorities, in this paper we limit ourselves to simulated data, so that we can systematically study the different factors that affect the feasibility of the approach. In particular, we show that the parameters become harder to estimate when there is more missing data (e.g., due to infections which were not recorded by the app), and when there is model misspecification. Nevertheless, the learning approach outperforms a strong manually designed baseline. Furthermore, the learning approach can adapt even when the risk factors of the disease change, e.g., due to the evolution of new variants, or the adoption of vaccines. View details
    Preview abstract This paper studies the problem of predicting the distribution over multiple possible future paths of people as they move through various visual scenes. We make two main contributions. The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real world trajectory data, and then extrapolated by human annotators to achieve different latent goals. This provides the first benchmark for quantitative evaluation of the models to predict multi-future trajectories. The second contribution is a new model to generate multiple plausible future trajectories, which contains novel designs of using multi-scale location encodings and convolutional RNNs over graphs. We refer to our model as Multiverse. We show that our model achieves the best results on our dataset, as well as on the real-world VIRAT/ActEV dataset (which just contains one possible future). View details
    Preview abstract Being able to design biological sequences like DNA or proteins to have desired properties would have considerable impact in medical and industrial applications. However, doing so presents a challenging black-box optimization problem that requires multiple rounds of expensive, time-consuming experiments. In response, we propose using reinforcement learning (RL) for biological sequence design. RL is a flexible framework that allows us to optimize generative sequence policies to achieve a variety of criteria, including diversity among high-quality sequences discovered. We use model-based RL to improve sample efficiency, where at each round the policy is trained offline using a simulator fit on functional measurements from prior rounds. To accommodate the growing number of observations across rounds, the simulator model is automatically selected at each round from a pool of diverse models of varying capacity. On the tasks of designing DNA transcription factor binding sites, designing antimicrobial proteins, and optimizing the energy of Ising models based on protein structures, we find that model-based RL is an attractive alternative to existing methods. View details
    Preview abstract The use of black-box optimization for the design of new biological sequences is an emerging research area with potentially revolutionary impact. The cost and latency of wet-lab experiments requires methods that find good sequences in few experimental rounds of large batches of sequences --- a setting that off-the-shelf black-box optimization methods are ill-equipped to handle. We find that the performance of existing methods varies drastically across optimization tasks, posing a significant obstacle to real-world applications. To improve robustness, we propose population-based optimization (PBO), which generates batches of sequences by sampling from an ensemble of methods. The number of sequences sampled from any method is proportional to the quality of sequences it previously proposed, allowing PBO to combine the strengths of individual methods while hedging against their innate brittleness. Adapting the population of methods online using evolutionary optimization further improves performance. Through extensive experiments on in-silico optimization tasks, we show that PBO outperforms any single method in its population, proposing both higher quality single sequences as well as more diverse batches. By its robustness and ability to design diverse, high-quality sequences, PBO is shown to be a new state-of-the art approach to the batched black-box optimization of biological sequences. View details
    Towards Differentiable Resampling
    Michael Zhu
    Rico Jonschkowski
    arXiv:2004.11938 (2020)
    Preview abstract Resampling is a key component of sample-based recursive state estimation in particle filters. Recent work explores differentiable particle filters for end-to-end learning. However, resampling remains a challenge in these works, as it is inherently non-differentiable. We address this challenge by replacing traditional resampling with a learned neural network resampler. We present a novel network architecture, the particle transformer, and train it for particle resampling using a likelihood-based loss function over sets of particles. Incorporated into a differentiable particle filter, our model can be end-to-end optimized jointly with the other particle filter components via gradient descent. Our results show that our learned resampler outperforms traditional resampling techniques on synthetic data and in a simulated robot localization task. View details
    Biological Sequences Design using Batched Bayesian Optimization
    Zelda Mariet
    Ramya Deshpande
    David Dohan
    Olivier Chapelle
    NeurIPS workshop on Bayesian Deep Learning (2019)
    Preview abstract Being able to effectively design biological sequences like DNA and proteins would have transformative impact on medicine. Currently, the most popular method in the life sciences for performing design is directed evolution,which explores sequence space by making small mutations to existing sequences.Alternatively, Bayesian optimization (BO) provides an attractive framework for model-based black-box optimization, and has achieved many recent successes in life sciences applications. However, within the ML community, most large-scale BO efforts have focused on hyper-parameter tuning. These methods often do not translate to biological sequence design, where the search space is over a discrete alphabet, wet-lab experiments are run with considerable parallelism (1K-100K sequences per batch), and experiments are sufficiently slow and expensive that only few rounds of experiments are feasible. This paper discusses the particularities of batched BO on a large discrete space, and investigates the design choices that must be made in order to obtain robust, scalable, and experimentally successful models within this unique context. View details
    Preview abstract In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image. For example, we may present an image of the Eiffel tower, and ask the system to find images which are visually similar but are modified in small ways, such as being taken at nighttime instead of during the day. To tackle this task, we learn a similarity metric between a target image and a source image plus source text, an embedding and composing function such that target image feature is close to the source image plus text composition feature. We propose a new way to combine image and text using such function that is designed for the retrieval task. We show this outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset we create based on CLEVR. We also show that our approach can be used to classify input queries, in addition to image retrieval. View details
    Preview abstract Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. Whereas most existing approaches learn low-level representations, we propose a joint visual-linguistic model to learn high-level features without any explicit supervision. In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. We use VideoBERT in numerous tasks, including action classification and video captioning. We show that it can be applied directly to open-vocabulary classification, and confirm that large amounts of training data and cross-modal information are critical to performance. Furthermore, we outperform the state-of-the-art on video captioning, and quantitative results verify that the model learns high-level semantic features. View details
    Preview abstract Instance embeddings are an efficient and versatile image representation that facilitates applications like recognition, verification, retrieval, and clustering. Many metric learning methods represent the input as a single point in the embedding space. Often the distance between points is used as a proxy for match confidence. However, this can fail to represent uncertainty which can arise when the input is ambiguous, e.g., due to occlusion or blurriness. This work addresses this issue and explicitly models the uncertainty by “hedging” the location of each input in the embedding space. We introduce the hedged instance embedding (HIB) in which embeddings are modeled as random variables and the model is trained under the variational information bottleneck principle (Alemi et al., 2016; Achille & Soatto, 2018). Empirical results on our new N-digit MNIST dataset show that our method leads to the desired behavior of “hedging its bets” across the embedding space upon encountering ambiguous inputs. This results in improved performance for image matching and classification tasks, more structure in the learned embedding space, and an ability to compute a per-exemplar uncertainty measure which is correlated with downstream performance. View details
    Preview abstract We propose 4 insights that help to significantly improve the performance of deep learning models that predict surface normals and semantic labels from a single RGB image. These insights are: (1) denoise the ”ground truth” surface normals in the training set to ensure consistency with the semantic labels; (2) concurrently train on a mix of real and synthetic data, instead of pretraining on synthetic and finetuning on real; (3) jointly predict normals and semantics using a shared model, but only backpropagate errors on pixels that have valid training labels; (4) slim down the model and use grayscale instead of color inputs. Despite the simplicity of these steps, we demonstrate consistently improved state of the art results on several datasets, using a model that runs at 12 fps on a standard mobile phone. View details
    Preview abstract This paper focuses on multi-person action forecasting in videos. More precisely, given a history of H previous frames, the goal is to detect actors and to predict their future actions for the next T frames. Our approach jointly models temporal and spatial interactions among different actors by constructing a recurrent graph, using actor proposals obtained with Faster R-CNN as nodes. Our method learns to select a subset of discriminative relations without requiring explicit supervision, thus enabling us to tackle challenging visual data. We refer to our model as Discriminative Relational Recurrent Network (DRRN). Evaluation of action prediction on AVA demonstrates the effectiveness of our proposed method compared to simpler baselines. Furthermore, we significantly improve performance on the task of early action classification on J-HMDB, from the previous SOTA of 48% to 60%. View details
    Preview abstract We present a method that learns to integrate temporal information, from a learned dynamics model, with ambiguous visual information, from a learned vision model, in the context of interacting agents. Our method is based on a graph-structured variational recurrent neural network (Graph-VRNN), which is trained end-to-end to infer the current state of the (partially observed) world, as well as to forecast future states. We show that our method outperforms various baselines on two sports datasets, one based on real basketball trajectories, and one generated by a soccer game engine. View details
    Preview abstract Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve stable learning and avoid compounding of errors in pixel space. Our method improves upon unstructured representations both for pixel-level video prediction and for downstream tasks requiring object-level understanding of motion dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset, the Human3.6M dataset, and datasets based on continuous control tasks from the DeepMind Control Suite. The spatially structured representation outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction. View details
    Preview abstract Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future. In this paper, we propose a novel formulation that simultaneously learns a hierarchical, disentangled object representation and a dynamics model for object parts from unlabeled videos. Our Parts, Structure, and Dynamics (PSD) model learns to, first, recognize the object parts via a layered image representation; second, predict hierarchy via a structural descriptor that composes low-level concepts into a hierarchical structure; and third, model the system dynamics by predicting the future. Experiments on multiple real and synthetic datasets demonstrate that our PSD model works well on all three tasks: segmenting object parts, building their hierarchical structure, and capturing their motion distributions. View details
    Progressive Neural Architecture Search
    Chenxi Liu
    Barret Zoph
    Maxim Neumann
    Jonathan Shlens
    Wei Hua
    Jia Li
    Fei-Fei Li
    Alan Yuille
    ECCV (2018)
    Preview abstract We propose a new method for learning the structure of convolutional neural networks (CNNs) that is more efficient than recent state-of-the-art methods based on reinforcement learning and evolutionary algorithms. Our approach uses a sequential model-based optimization (SMBO) strategy, in which we search for structures in order of increasing complexity, while simultaneously learning a surrogate model to guide the search through structure space. Direct comparison under the same search space shows that our method is up to 5 times more efficient than the RL method of Zoph et al. (2018) in terms of number of models evaluated, and 8 times faster in terms of total compute. The structures we discover in this way achieve state of the art classification accuracies on CIFAR-10 and ImageNet. View details
    Fixing a Broken ELBO
    Ben Poole
    Josh Dillon
    Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholmsmässan, Stockholm Sweden (2018), pp. 159-168
    Preview abstract Recent work in unsupervised representation learning has focused on learning deep directed latent variable models. Fitting these models by maximizing the marginal likelihood or evidence is typically intractable, thus a common approximation is to maximize the evidence lower bound (ELBO) instead. However, maximum likelihood training (whether exact or approximate) does not necessarily result in a good latent representation, as we demonstrate both theoretically and empirically. In particular, we derive variational lower and upper bounds on the mutual information between the input and the latent variable, and use these bounds to derive a rate-distortion curve that characterizes the tradeoff between compression and reconstruction accuracy. Using this framework, we demonstrate that there is a family of models with identical ELBO, but different quantitative and qualitative characteristics. Our framework also suggests a simple new method to ensure that latent variable models with powerful stochastic decoders do not ignore their latent code. View details
    Preview abstract Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level and model temporal context with 3D ConvNets. Here, we go one step further and model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions. Our approach is weakly supervised and mines the relevant elements automatically with an actor-centric relational network (ACRN). ACRN computes and accumulates pair-wise relation information from actor and global scene features, and generates relation features for action classification. It is implemented as neural networks and can be trained jointly with an existing action detection system. We show that ACRN outperforms alternative approaches which capture relation information, and that the proposed framework improves upon the state-of-the-art performance on JHMDB and AVA. A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action. View details
    Preview abstract Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24). View details
    Preview abstract We present a box-free bottom-up approach for the tasks of pose estimation and instance segmentation of people in multi-person images using an efficient single-shot model. The proposed PersonLab model tackles both semantic-level reasoning and object-part associations using part-based modeling. Our model employs a convolutional network which learns to detect individual keypoints and predict their relative displacements, allowing us to group keypoints into person pose instances. Further, we propose a part-induced geometric embedding descriptor which allows us to associate semantic person pixels with their corresponding person instance, delivering instance-level person segmentations. Our system is based on a fully-convolutional architecture and allows for efficient inference, with runtime essentially independent of the number of people present in the scene. Trained on COCO data alone, our system achieves COCO test-dev keypoint average precision of 0.665 using single-scale inference and 0.687 using multi-scale inference, significantly outperforming all previous bottom-up pose estimation systems. We are also the first bottom-up method to report competitive results for the person class in the COCO instance segmentation task, achieving a person category average precision of 0.417. View details
    Preview abstract We use large amounts of unlabeled video to learn models for visual tracking without manual human supervision. We leverage the natural temporal coherency of color to create a model that learns to colorize gray-scale videos by copying colors from a reference frame. Quantitative and qualitative experiments suggest that this task causes the model to automatically learn to track visual regions. Although the model is trained without any ground-truth labels, our method learns to track well enough to outperform optical flow based methods. Finally, our results suggest that failures to track are correlated with failures to colorize, indicating that advancing video colorization may further improve self-supervised visual tracking. View details
    Context-aware Captions from Context-agnostic Supervision
    Shanmukha Ramakrishna Vedantam
    Samy Bengio
    Devi Parikh
    Gal Chechik
    CVPR (2017)
    Preview abstract We describe a model to induce discriminative image captions based only on generative ground-truth training data. For example, given images and descriptions of “zebras” and “horses”, our system can generate discriminative language that describes the zebra images while capturing the differences with the “horse” images . Producing discriminative language is a foundational problem in the study of pragmatic behavior: Humans can effortlessly repurpose language for being persuasive and effective in communication. We first propose a novel inference procedure based on a reflex speaker and an introspector to induce discrimination between concepts. Intuitively, the reflex speaker models a good utterance for some concept (“zebra”), while the introspector models how discriminative the sentence is between the concepts (“zebra” and “horse”). Unlike previous approaches, the form of our listener has the attractive property of being amenable to joint approximate inference to select utterances that satisfy both the speaker and the introspector, yielding an introspective speaker. We apply our introspective speaker to the CUB-Text dataset to describe why an image contains a particular bird category as opposed to some other closely related bird category and to the MS COCO dataset to generate language that points to one out two semantically similar images. Evaluations with discriminative ground truth collected on CUB and with humans on MSCOCO reveal that our approach outperforms baseline approaches for discrimination. We then draw qualitative insights from our model outputs which suggest that in some cases one may interpret the introspective speaker outputs to be lies in service of the higher goal of discrimination. View details
    Preview abstract We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method "Deep Variational Information Bottleneck", or Deep VIB. We show that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack. View details
    Preview abstract Style transfer usually refers to the task of applying color and texture information from a specific style image to a given content image while preserving the structure of the latter. Here we tackle the more generic problem of semantic style transfer: given two unpaired collections of images, we aim to learn a mapping between the corpus-level style of each collection, while preserving semantic content shared across the two domains. We introduce XGAN ("Cross-GAN"), a dual adversarial autoencoder, which captures a shared representation of the common domain semantic content in an unsupervised way, while jointly learning the domain-to-domain image translations in both directions. We exploit ideas from the domain adaptation literature and define a semantic consistency loss which encourages the model to preserve semantics in the learned embedding space. We report promising qualitative results for the task of face-to-cartoon translation. The cartoon dataset we collected for this purpose is in the process of being released as a new benchmark for semantic style transfer. View details
    Preview abstract We propose a method for multi-person detection and 2-D keypoint localization (human pose estimation) that achieves state-of-the-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages. In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector with an Inception-ResNet architecture. In the second stage, we estimate the keypoints of the person potentially contained in each proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based confidence score estimation, instead of box-level scoring. Our final system achieves average precision of 0.636 on the COCO test-dev set and the 0.628 test-standard sets, outperforming the CMU-Pose winner of the 2016 COCO keypoints challenge. Further, by using additional labeled data we obtain an even higher average precision of 0.668 on the test-dev set and 0.658 on the test-standard set, thus achieving a roughly 10% improvement over the previous best performing method on the same challenge. View details
    Preview abstract The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN~\cite{ren2015faster}, R-FCN~\cite{dai2016r} and SSD~\cite{liu2015ssd} systems, which we view as ``meta-architectures'' and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that runs at over 50 frames per second and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task. View details
    Deep Metric Learning via Facility Location
    Hyun Oh Song
    Stefanie Jegelka
    Vivek Rathod
    IEEE CVPR (2017)
    Preview abstract Learning the representation and the similarity metric in an end-to-end fashion with deep networks have demonstrated outstanding results for clustering and retrieval. However, these recent approaches still suffer from the performance degradation stemming from the local metric training procedure which is unaware of the global structure of the embedding space. We propose a global metric learning scheme for optimizing the deep metric embedding with the learnable clustering function and the clustering metric (NMI) in a novel structured prediction framework. Our experiments on CUB200-2011, Cars196, and Stanford online products datasets show state of the art performance both on the clustering and retrieval tasks measured in the NMI and Recall@K evaluation metrics. View details
    Attention-based Extraction of Structured Information from Street View Imagery
    Zbigniew Wojna
    Alex Gorban
    Dar-Shyang Lee
    Qian Yu
    Julian Ibarz
    ICDAR (2017), pp. 8
    Preview abstract We present a neural network model, based on CNNs, RNNs and attention mechanisms, which achieves 84.04% accuracy on the challenging French Street Name Signs (FSNS) dataset, significantly outperforming the previous state of the art (Smith’16), which achieved 72.46%. Furthermore, our new method is much simpler and more general than the previous approach. To demonstrate the generality of our model, we also apply it to two datasets, derived from Google Street View, in which the goal is to extract business names from store fronts, and extract structured date/time information from parking signs. Finally, we study the speed/accuracy tradeoff that results from cutting pretrained inception CNNs at different depths and using them as feature extractors for the attention mechanism. The resulting model is not only accurate but efficient, allowing it to be used at scale on a variety of challenging real-world text extraction problems. View details
    Preview abstract Current image captioning methods are usually trained via (penalized) maximum likelihood estimation. However, the log-likelihood score of a caption does not correlate well with human assessments of quality. Standard syntactic evaluation metrics, such BLEU, METEOR and ROUGE, are also not well correlated. The SPICE and CIDEr metrics are better correlated, but have traditionally been hard to optimize for. In this paper, we show how to use a policy gradient (PG) algorithm to directly optimize a combination of SPICE and CIDEr (a combination we call SPIDEr): the SPICE score ensures our captions are semantically faithful to the image, and the CIDEr score ensures our captions are syntactically fluent. The PG algorithm we propose improves on the prior MIXER approach, by using Monte Carlo rollouts instead of mixing ML training with PG. We show empirically that our algorithm leads to improved results compared to MIXER. Finally, we shoow that using our PG algorithm to optimize the novel SPIDEr metric results in image captions that are strongly preferred by human raters compared to captions generated by the same model but trained using different objective functions. View details
    PixColor: Pixel Recursive Colorization
    Ryan Dahl
    Mohammad Norouzi
    Jonathon Shlens
    Proceedings of the 28th British Machine Vision Conference (BMVC) (2017)
    Preview abstract We propose a novel approach to automatically produce multiple colorized versions of a grayscale image. Our method results from the observation that the task of automated colorization is relatively easy given a low-resolution version of the color image. We first train a conditional PixelCNN to generate a low resolution color for a given grayscale image. Then, given the generated low-resolution color image and the original grayscale image as inputs, we train a second CNN to generate a high-resolution colorization of an image. We demonstrate that our approach produces more diverse and plausible colorizations than existing methods, as judged by human raters in a "Visual Turing Test". View details
    Generation and Comprehension of Unambiguous Object Descriptions
    Junhua Mao
    Alexander Toshev
    Oana Camburu
    Computer Vision and Pattern Recognition (2016)
    Preview abstract We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene. Our model is inspired by recent successes of deep learning methods for image captioning, but while image captioning is difficult to evaluate, our task allows for easy objective evaluation. We also present a new large-scale dataset for referring expressions, based on MSCOCO. We have released the dataset and a toolbox for visualization and evaluation, see https://github.com/ mjhucla/Google_Refexp_toolbox. View details
    G-RMI Object Detection
    Anoop Korattikara
    Menglong Zhu
    Vivek Rathod
    Zbigniew Wojna
    2nd ImageNet and COCO Visual Recognition Challenges Joint Workshop, Amsterdam (2016)
    Preview abstract We present our submission to the COCO 2016 Object Detection challenge. View details
    Detecting Events and Key Actors in Multi-Person Videos
    Vignesh Ramanathan
    Alexander Gorban
    Li Fei-Fei
    Computer Vision and Pattern Recognition (CVPR) (2016)
    Preview abstract Multi-person event recognition is a challenging task, often with many people active in the scene but only a small subset contributing to an actual event. In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event. Our model does not use explicit annotations regarding who or where those people are during training and testing. In particular, we track people in videos and use a recurrent neural network (RNN) to represent the track features. We learn time-varying attention weights to combine these features at each time-instant. The attended features are then processed using another RNN for event detection/classification. Since most video datasets with multiple people are restricted to a small number of videos, we also collected a new basketball dataset comprising 257 basketball games with 14K event annotations corresponding to 11 event classes. Our model outperforms state-of-the-art methods for both event classification and detection on this new dataset. Additionally, we show that the attention mechanism is able to consistently localize the relevant players. View details
    Bayesian Dark Knowledge
    Anoop Korattikara
    Vivek Rathod
    Max Welling
    Advances in Neural Information Processing Systems (2015)
    Preview abstract We consider the problem of Bayesian parameter estimation for deep neural networks, which is important in problem settings where we may have little data, and/ or where we need accurate posterior predictive densities, e.g., for applications involving bandits or active learning. One simple approach to this is to use online Monte Carlo methods, such as SGLD (stochastic gradient Langevin dynamics). Unfortunately, such a method needs to store many copies of the parameters (which wastes memory), and needs to make predictions using many versions of the model (which wastes time). We describe a method for "distilling" a Monte Carlo approximation to the posterior predictive density into a more compact form, namely a single deep neural network. We compare to two very recent approaches to Bayesian neural networks, namely an approach based on expectation propagation [Hernandez-Lobato and Adams, 2015] and an approach based on variational Bayes [Blundell et al., 2015]. Our method performs better than both of these, is much simpler to implement, and uses less computation at test time. View details
    Im2Calories: towards an automated mobile vision food diary
    Austin Myers
    Vivek Rathod
    Anoop Korattikara
    Alex Gorban
    Nathan Silberman
    George Papandreou
    ICCV (2015)
    Preview abstract We present a system which can recognize the contents of your meal from a single image, and then predict its nutritional contents, such as calories. The simplest version assumes that the user is eating at a restaurant for which we know the menu. In this case, we can collect images offline to train a multi-label classifier. At run time, we apply the classifier (running on your phone) to predict which foods are present in your meal, and we lookup the corresponding nutritional facts. We apply this method to a new dataset of images from 23 different restaurants, using a CNN-based classifier, significantly outperforming previous work. The more challenging setting works outside of restaurants. In this case, we need to estimate the size of the foods, as well as their labels. This requires solving segmentation and depth / volume estimation from a single image. We present CNN-based approaches to these problems, with promising preliminary results. View details
    What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision
    Jonathan Malmaud
    Vivek Rathod
    Andrew Rabinovich
    North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015) (to appear)
    Preview abstract We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task. In particular, we focus on the cooking domain, where the instructions correspond to the recipe. Our technique relies on an HMM to align the recipe steps to the (automatically generated) speech transcript. We then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network. We show that our technique outperforms simpler techniques based on keyword spotting. It also enables interesting applications, such as automatically illustrating recipes with keyframes, and searching within a video for events of interest. View details
    Large-Scale Object Classification Using Label Relation Graphs
    Jia Deng
    Yangqing Jia
    Andrea Frome
    Samy Bengio
    Yuan Li
    European Conference on Computer Vision (2014)
    Preview abstract . In this paper we study how to perform object classification in a principled way that exploits the rich structure of real world labels. We develop a new model that allows encoding of flexible relations between labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism that captures semantic relations between any two labels applied to the same object: mutual exclusion, overlap and subsumption. We then provide rigorous theoretical analysis that illustrates properties of HEX graphs such as consistency, equivalence, and computational implications of the graph structure. Next, we propose a probabilistic classification model based on HEX graphs and show that it enjoys a number of desirable properties. Finally, we evaluate our method using a large-scale benchmark. Empirical results demonstrate that our model can signifi- cantly improve object classification by exploiting the label relations. View details
    Knowledge Base Completion via Search-Based Question Answering
    Robert West
    Evgeniy Gabrilovich
    Shaohua Sun
    Dekang Lin
    WWW (2014)
    Preview abstract Over the past few years, massive amounts of world knowledge have been accumulated in publicly available knowledge bases, such as Freebase, NELL, and YAGO. Yet despite their seemingly huge size, these knowledge bases are greatly incomplete. For example, over 70% of people included in Freebase have no known place of birth, and 99% have no known ethnicity. In this paper, we propose a way to leverage existing Web-search--based question-answering technology to fill in the gaps in knowledge bases in a targeted way. In particular, for each entity attribute, we learn the best set of queries to ask, such that the answer snippets returned by the search engine are most likely to contain the correct value for that attribute. For example, if we want to find Frank Zappa's mother, we could ask the query "who is the mother of Frank Zappa". However, this is likely to return "The Mothers of Invention", which was the name of his band. Our system learns that it should (in this case) add disambiguating terms, such as Zappa's place of birth, in order to make it more likely that the search results contain snippets mentioning his mother. Our system also learns how many different queries to ask for each attribute, since in some cases, asking too many can hurt accuracy (by introducing false positives). We discuss how to aggregate candidate answers across multiple queries, ultimately returning probabilistic predictions for possible values for each attribute. Finally, we evaluate our system and show that it is able to extract a large number of facts with high confidence. View details
    Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion
    Xin Luna Dong
    Evgeniy Gabrilovich
    Geremy Heitz
    Wilko Horn
    Ni Lao
    Thomas Strohmann
    Shaohua Sun
    Wei Zhang
    The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, New York, NY, USA - August 24 - 27, 2014, pp. 601-610
    Preview abstract Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s Satori, and Google’s Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous approaches have primarily focused on text-based extraction, which can be very noisy. Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories. We employ supervised machine learning methods for fusing these distinct information sources. The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilistic inference system that computes calibrated probabilities of fact correctness. We report the results of multiple studies that explore the relative utility of the different information sources and extraction methods. View details
    Preview abstract Today’s Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, using a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students. View details
    Group Sparse Priors for Covariance Estimation
    Benjamin M. Marlin
    Mark W. Schmidt
    CoRR, vol. abs/1205.2626 (2012)
    A Stick-Breaking Likelihood for Categorical Data Analysis with Latent Gaussian Models
    Mohammad Emtiyaz Khan
    Shakir Mohamed
    Benjamin M. Marlin
    Journal of Machine Learning Research - Proceedings Track, vol. 22 (2012), pp. 610-618
    Bayesian structure learning using dynamic programming and MCMC
    Daniel Eaton
    CoRR, vol. abs/1206.5247 (2012)
    Piecewise Bounds for Estimating Bernoulli-Logistic Latent Gaussian Models
    Benjamin M. Marlin
    Mohammad Emtiyaz Khan
    ICML (2011), pp. 633-640
    Multiscale Conditional Random Fields for Semi-supervised Labeling and Classification
    David K. Duvenaud
    Benjamin M. Marlin
    CRV (2011), pp. 371-378
    Identifying players in broadcast sports videos using conditional random fields
    Wei-Lwun Lu
    Jo-Anne Ting
    James J. Little
    CVPR (2011), pp. 3249-3256
    Convex Structure Learning in Log-Linear Models: Beyond Pairwise Potentials
    Mark W. Schmidt
    Journal of Machine Learning Research - Proceedings Track, vol. 9 (2010), pp. 709-716
    Using the forest to see the trees: exploiting context for visual object detection and localization
    Antonio Torralba
    William T. Freeman
    Commun. ACM, vol. 53 (2010), pp. 107-114
    Time-Bounded Sequential Parameter Optimization
    Frank Hutter
    Holger H. Hoos
    Kevin Leyton-Brown
    LION (2010), pp. 281-298
    Causal learning without DAGs
    David K. Duvenaud
    Daniel Eaton
    Mark W. Schmidt
    Journal of Machine Learning Research - Proceedings Track, vol. 6 (2010), pp. 177-190
    Review of "Probabilistic graphical models" by Koller and Friedman
    Artif. Intell., vol. 174 (2010), pp. 145-146
    SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors
    Rodrigo Goya
    Mark G. F. Sun
    Ryan D. Morin
    Gillian Leung
    Gavin Ha
    Kimberley C. Wiegand
    Janine Senz
    Anamaria Crisan
    Marco A. Marra
    Martin Hirst
    David G. Huntsman
    Sam Aparicio
    Sohrab P. Shah
    Bioinformatics, vol. 26 (2010), pp. 730-736
    Variational bounds for mixed-data factor analysis
    Mohammad Emtiyaz Khan
    Benjamin M. Marlin
    Guillaume Bouchard
    NIPS (2010), pp. 1108-1116
    Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
    Baback Moghaddam
    Benjamin M. Marlin
    Mohammad Emtiyaz Khan
    NIPS (2009), pp. 1285-1293
    A Hybrid Conditional Random Field for Estimating the Underlying Ground Surface From Airborne LiDAR Data
    Wei-Lwun Lu
    James J. Little
    Alla Sheffer
    Hongbo Fu
    IEEE T. Geoscience and Remote Sensing, vol. 47 (2009), pp. 2913-2922
    Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm
    Mark W. Schmidt
    Ewout van den Berg
    Michael P. Friedlander
    Journal of Machine Learning Research - Proceedings Track, vol. 5 (2009), pp. 456-463
    Modeling Discrete Interventional Data using Directed Cyclic Graphical Models
    Mark W. Schmidt
    UAI (2009), pp. 487-495
    An experimental investigation of model-based parameter optimisation: SPO and beyond
    Frank Hutter
    Holger H. Hoos
    Kevin Leyton-Brown
    GECCO (2009), pp. 271-278
    Sparse Gaussian graphical models with unknown block structure
    Benjamin M. Marlin
    ICML (2009), pp. 89
    Model-based clustering of array CGH data
    Sohrab P. Shah
    K-John Cheung Jr.
    Nathalie A. Johnson
    Guillaume Alain
    Randy D. Gascoyne
    Douglas E. Horsman
    Raymond T. Ng
    Bioinformatics, vol. 25 (2009)
    Group Sparse Priors for Covariance Estimation
    Benjamin M. Marlin
    Mark W. Schmidt
    UAI (2009), pp. 383-392
    Structure learning in random fields for heart motion abnormality detection
    Mark W. Schmidt
    Glenn Fung
    Rómer Rosales
    CVPR (2008)
    LabelMe: A Database and Web-Based Tool for Image Annotation
    Bryan C. Russell
    Antonio Torralba
    William T. Freeman
    International Journal of Computer Vision, vol. 77 (2008), pp. 157-173
    Modeling changing dependency structure in multivariate time series
    Xiang Xuan
    ICML (2007), pp. 1055-1062
    Learning Graphical Model Structure Using L1-Regularization Paths
    Mark W. Schmidt
    Alexandru Niculescu-Mizil
    AAAI (2007), pp. 1278-1283
    Figure-ground segmentation using a hierarchical conditional random field
    Jordan Reynolds
    CRV (2007), pp. 175-182
    Sharing Visual Features for Multiclass and Multiview Object Detection
    Antonio Torralba
    William T. Freeman
    IEEE Trans. Pattern Anal. Mach. Intell., vol. 29 (2007), pp. 854-869
    Modeling recurrent DNA copy number alterations in array CGH data
    Sohrab P. Shah
    Wan L. Lam
    Raymond T. Ng
    ISMB/ECCB (Supplement of Bioinformatics) (2007), pp. 450-458
    Bayesian structure learning using dynamic programming and MCMC
    Daniel Eaton
    UAI (2007), pp. 101-108
    Exact Bayesian structure learning from uncertain interventions
    Daniel Eaton
    Journal of Machine Learning Research - Proceedings Track, vol. 2 (2007), pp. 107-114
    Efficient parameter estimation for RNA secondary structure prediction
    Mirela Andronescu
    Anne Condon
    Holger H. Hoos
    David H. Mathews
    ISMB/ECCB (Supplement of Bioinformatics) (2007), pp. 19-28
    A non-myopic approach to visual search
    Julia Vogel
    CRV (2007), pp. 227-234
    Accelerated training of conditional random fields with stochastic gradient methods
    S. V. N. Vishwanathan
    Nicol N. Schraudolph
    Mark W. Schmidt
    ICML (2006), pp. 969-976
    Object Detection and Localization Using Local and Global Features
    Antonio Torralba
    Daniel Eaton
    William T. Freeman
    Toward Category-Level Object Recognition (2006), pp. 382-400
    Integrating copy number polymorphisms into array CGH analysis using a robust HMM
    Sohrab P. Shah
    Xiang Xuan
    Ronald J. deLeeuw
    Mehrnoush Khojasteh
    Wan L. Lam
    Raymond T. Ng
    ISMB (Supplement of Bioinformatics) (2006), pp. 431-439
    Shared Features for Multiclass Object Detection
    Antonio Torralba
    William T. Freeman
    Toward Category-Level Object Recognition (2006), pp. 345-361
    Representing Hierarchical POMDPs as DBNs for Multi-scale Robot Localization
    Georgios Theocharous
    Leslie Pack Kaelbling
    ICRA (2004), pp. 1045-1051
    Contextual Models for Object Detection Using Boosted Random Fields
    Antonio Torralba
    William T. Freeman
    NIPS (2004)
    Sharing Features: Efficient Boosting Procedures for Multiclass Object Detection
    Antonio Torralba
    William T. Freeman
    CVPR (2) (2004), pp. 762-769
    Graphical Model For Recognizing Scenes and Objects
    Antonio Torralba
    William T. Freeman
    NIPS (2003)
    Context-based vision system for place and object recognition
    Antonio Torralba
    William T. Freeman
    Mark A. Rubin
    ICCV (2003), pp. 273-280
    A coupled HMM for audio-visual speech recognition
    Ara V. Nefian
    Luhong Liang
    Xiaobo Pi
    Xiaoxiang Liu
    Crusoe Mao
    ICASSP (2002), pp. 2013-2016
    Dynamic Bayesian Networks for Audio-Visual Speech Recognition
    Ara V. Nefian
    Luhong Liang
    Xiaobo Pi
    Xiaoxing Liu
    EURASIP J. Adv. Sig. Proc., vol. 2002 (2002), pp. 1274-1288
    Linear-time inference in Hierarchical HMMs
    Mark A. Paskin
    NIPS (2001), pp. 833-840
    The Factored Frontier Algorithm for Approximate Inference in DBNs
    Yair Weiss
    UAI (2001), pp. 378-385
    Rao-Blackwellised Particle Filtering for Dynamic Bayesian Networks
    Arnaud Doucet
    Nando de Freitas
    Stuart J. Russell
    UAI (2000), pp. 176-183
    Loopy Belief Propagation for Approximate Inference: An Empirical Study
    Yair Weiss
    Michael I. Jordan
    UAI (1999), pp. 467-475
    A Dynamic Bayesian Network Approach to Figure Tracking using Learned Dynamic Models
    Vladimir Pavlovic
    James M. Rehg
    Tat-Jen Cham
    ICCV (1999), pp. 94-101
    Vision-Based Speaker Detection Using Bayesian Networks
    James M. Rehg
    Paul W. Fieguth
    CVPR (1999), pp. 2110-2116
    Bayesian Map Learning in Dynamic Environments
    NIPS (1999), pp. 1015-1021
    A Variational Approximation for Bayesian Networks with Discrete and Continuous Latent Variables
    UAI (1999), pp. 457-466
    Learning the Structure of Dynamic Probabilistic Networks
    Nir Friedman
    Stuart J. Russell
    UAI (1998), pp. 139-147
    Space-Efficient Inference in Dynamic Probabilistic Networks
    John Binder
    Stuart J. Russell
    IJCAI (1997), pp. 1292-1296
    Automata-Theoretic Models of Mutation and Alignment
    David B. Searls
    ISMB (1995), pp. 341-349