Jump to Content
Anelia Angelova

Anelia Angelova

Anelia Angelova is a research scientist in the area of computer vision. She leads the Robot Vision research team in Brain Robotics at Google Brain. Her most recent research focuses on deep learning for robotics perception, including semantic and 3D scene understanding and real-time algorithms for pedestrian detection and robot grasp localization. She has integrated her work in production systems, including the first deep neural network models running onboard Google's self-driving car, now Waymo. Anelia received her MS and PhD degrees in Computer Science from California Institute of Technology.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix. View details
    Preview abstract Effective scaling and a flexible task interface enable large-capacity language models to excel at many tasks. PaLI (Pathways Language and Image model) extends these ideas to the joint modeling of language and vision. PaLI is a model that generates text based on visual and textual inputs. Using this API, PaLI is able to perform many vision, language, and multimodal tasks, across many languages. We train PaLI with two main principles: reuse of pretrained unimodal components, and joint scaling of modalities. Using large-capacity pretrained language models and vision models allows us to capitalize on their existing capabilities, while leveraging the substantial cost of training them. We scale PaLI models across three axes:the language component, the vision component, and the training data that fuses them. For the vision component, we train the largest and best-performing VisionTransformer (ViT) to date. For the data, we build an image-text training set over10B images and covering over 100 languages. PaLI inherits and enhances language-understanding capabilities, and achieves state-of-the-art in multiple vision and language tasks (image classification, image captioning, visual question-answering, scene-text understanding, etc.), based on a simple, modular, and reuse-friendly platform for modeling and scaling. View details
    Preview abstract We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at https://sites.google.com/corp/view/f-vlm/home. View details
    Mechanical Search on Shelves with Efficient Stacking and Destacking of Objects
    Huang Huang
    Letian Fu
    Michael Danielczuk
    Chung Min Kim
    Zachary Tam
    Jeff Ichnowski
    Brian Ichter
    Ken Goldberg
    The International Symposium of Robotics Research (ISRR) (2023)
    Preview abstract Stacking increases storage efficiency in shelves, but the lack of visibility and accessibility makes the mechanical search problem of revealing and extracting target objects difficult for robots. In this paper, we extend the lateral-access mechanical search problem to shelves with stacked items and introduce two novel policies -- Distribution Area Reduction for Stacked Scenes (DARSS) and Monte Carlo Tree Search for Stacked Scenes (MCTSSS) -- that use destacking and restacking actions. MCTSSS improves on prior lookahead policies by considering future states after each potential action. Experiments in 1200 simulated and 18 physical trials with a Fetch robot equipped with a blade and suction cup suggest that destacking and restacking actions can reveal the target object with 82--100% success in simulation and 66--100% in physical experiments, and are critical for searching densely packed shelves. In the simulation experiments, both policies outperform a baseline and achieve similar success rates but take more steps compared with an oracle policy that has full state information. In simulation and physical experiments, DARSS outperforms MCTSSS in median number of steps to reveal the target, but MCTSSS has a higher success rate in physical experiments, suggesting robustness to perception noise. View details
    Joint Adaptive Representations for Image-Language Learning
    Transformers for Vision (T4V) Workshop at the Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
    Preview abstract Image-language transformer models have achieved tremendous success, but they come at high computational costs. We here propose a joint adaptive image-language representation learning, which adaptively and iteratively fuses the multi-modal features. This consistently reduces the model cost and size, allows the model to scale without a large increase in FLOPs or memory, and outperforms bigger and much more expensive models. With only 40M training examples and with 39 GFLOPs our model outperforms many times larger models, some reaching 800 GFLOPs. View details
    Preview abstract We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results. View details
    Dynamic Pre-training of Vision-Language Models
    Wei Li
    ICLR 2023 Workshop on Multimodal Representation Learning (2023)
    Preview abstract Vision-Language pretraining aims to learn universal cross-modal representations and to create models with broad capabilities. In this paper, we propose a novel dynamic pretraining resampling for a variety of pretraining tasks. Unlike recent large-scale vision-language approaches, we show that a set of diverse self- and weakly-supervised pretraining tasks dynamically sampled according to task difficulty provides strong performance. Further, the approach is sample-efficient, using much less data and compute to address a range of downstream tasks. We show that a single 330M pretrained model using only smaller and publicly accessible datasets, achieves competitive or SOTA performance on three diverse groups of tasks: visual question answering, text-based image localization by referring expressions, and video question answering. View details
    Preview abstract We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) – a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 APr on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models. View details
    Diversifying Joint Vision-Language Tokenization Learning
    Vardaan Pahuja
    Transformers for Vision (T4V) Workshop at the Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
    Preview abstract Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering. In this work, we find that the representations must not only jointly capture features from both modalities but should also be diverse for better generalization performance. To this end, we propose joint vision-language representation learning by diversifying the tokenization learning process, enabling tokens which are sufficiently disentangled from each other to be learned from both modalities. We observe that our approach outperforms the baseline models in a majority of settings and is competitive with state-of-the-art methods. View details
    Preview abstract The development of language models have moved from encoder-decoder to decoder-only designs. In addition, the common knowledge has it that the two most popular multimodal tasks, the generative and contrastive tasks, tend to conflict with one another, are hard to accommodate in one architecture, and further need complex adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach. View details
    Learning Open-World Object Proposals without Learning to Classify
    Tsung-Yi Lin
    In So Kweon
    Robotics and Automation Letters (RA-L) Journal and International Conference on Robotics and Automation (ICRA) (2022)
    Preview abstract Object proposals have become an integral preprocessing step of many vision pipelines including objec detection, weakly supervised detection, object discovery, tracking, etc. Compared to the learning-free methods, learning-based proposals have become popular recently due to the growing interest in object detection. The common paradigm is to learn object proposals from data labeled with a set of object regions and their corresponding categories. However, this approach often struggles with novel objects in the open world that are absent in the training set. In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories. Therefore, we propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlap with any groundtruth object (e.g., centerness and IoU). This strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization on COCO. We further explore more challenging cross-dataset generalization onto RoboNet and EpicKitchens dataset and demonstrate clear improvement over the state-of-the-art object detectors and object proposers. The code is publicly available. View details
    Preview abstract We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection. Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements across the tasks. In addition, we discover that a standard object detector is surprisingly effective in unifying these tasks without a need for task-specific design, losses, or pre computed detections. Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries for zero, one, or multiple objects. Jointly trained on these tasks, FindIt outperforms the state of the art on both referring expression and text-based localization, and shows competitive performance on object detection. Finally, FindIt generalizes better to out-of-distribution data and novel categories compared to strong singletask baselines. All of these are accomplished by a single, unified and efficient model View details
    Preview abstract We present Answer-Me, a task-aware multi-task framework which unifies multiple question answering tasks, such as, visual question answering, visual entailment, visual reasoning. In contrast to previous works using contrastive or generative captioning training, we propose a novel and simple recipe to pretrain a vision-language joint model, which is multi-task as well, and uses the entire architecture end-to-end. Our results, which are in the challenging open-vocabulary generative setting, show state-of-the-art performance, zero-shot generalization, robustness to forgetting. View details
    Preview abstract We present a novel efficient image-language learning model for multi-task visual question answering tasks which works at a fraction of the computational cost. New compact features are learned adaptively to jointly represent the image and language modalities according to the data. Our method outperforms the state-of-the-art multi-task approaches on SNLI-VE and GQA, and works competitively on VQA2.0. The model is highly efficient using 7-10 fewer GFLOPs and scales well to more than twice the input image size. View details
    Mechanical Search on Shelves using a Novel “Bluction” Tool
    Huang Huang
    Michael Danielczuk
    Chung Min Kim
    Letian Fu
    Zachary Tam
    Jeff Ichnowski
    Brian Andrew Ichter
    Ken Goldberg
    International Conference on Robotics and Automation (ICRA) (2022) (to appear)
    Preview abstract Shelves are common in homes, warehouses, and commercial settings due to their storage efficiency. However, this efficiency comes at the cost of reduced visibility and accessibility. When looking from a side (lateral) view of a shelf, most objects will be fully occluded, resulting in a constrained lateral-access mechanical search problem. To address this problem, we introduce: (1) a novel bluction tool, which combines a thin pushing blade and suction cup gripper, (2) an improved LAX-RAY simulation pipeline and perception model that combines ray-casting with 2D Minkowski sums to efficiently generate target occupancy distributions, and (3) a novel SLAX-RAY search policy, which optimally reduces target object distribution support area using the bluction tool. Experimental data from 2000 simulated shelf trials and 18 trials with a physical Fetch robot equipped with the bluction tool suggest that using suction grasping actions improves the success rate over the highest performing push-only policy by 26% in simulation and 67% in physical environments. View details
    Preview abstract Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model requires only 67 GFLOPs, producing a highly efficient video question answering model. View details
    Preview abstract We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of text-generative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods. View details
    Preview abstract 3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments. To achieve a mapping between image views of objects and 3D shapes, we leverage CAD model priors from existing large-scale databases, and propose a novel approach towards constructing a joint embedding space between 2D images and 3D CAD models in a patch-wise fashion – establishing correspondences between patches of an image view of an object and patches of CAD geometry. This enables part similarity reasoning for retrieving similar CADs to a new image view without exact matches in the database. Our patch embedding provides more robust CAD retrieval for shape estimation in our end-to-end estimation of CAD model shape and pose for detected objects in a single input image. Experiments on in-the-wild, complex imagery from ScanNet show that our approach is more robust than state of the art in real-world scenarios without any exact CAD matches. View details
    4D-Net for Learned Multi-Modal Alignment
    Michael Ryoo
    International Conference on Computer Vision (ICCV) (2021)
    Preview abstract We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time. We are able to incorporate the 4D information by performing a novel dynamic connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints. Our approach outperforms the state-of-the-art and strong baselines on the Waymo Open Dataset. 4D-Net is better able to use motion cues and dense image information to detect distant objects more successfully. We will open source the code. View details
    Mechanical Search on Shelves using LAX-RAY: Lateral Access X-RAY
    Huang Huang
    Marcus Dominguez-Kuhne
    Vishal Satish
    Michael Danielczuk
    Kate Sanders
    Jeff Ichnowski
    Andrew Lee
    Ken Goldberg
    IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2021)
    Preview abstract Finding an occluded object in a lateral access environment such as a shelf or cabinet is a problem that arises in many contexts such as warehouses, retail, healthcare, shipping, and homes. While this problem, known as mechanical search, is well-studied in overhead access environments, lateral access environments introduce constraints on the poses of objects and on available grasp actions, and pushing actions are preferred to preserve the environment structure. We propose LAXRAY (Lateral Access maXimal Reduction in support Area of occupancY distribution): a system that combines target object occupancy distribution prediction with a mechanical search policy that sequentially pushes occluding objects to reveal a given target object. For scenarios with extruded polygonal objects, we introduce two lateral-access search policies that encode a history of predicted target distributions and can plan up to three actions into the future. We introduce a First-Order Shelf Simulator (FOSS) and use it to evaluate these policies in 800 simulated random shelf environments per policy. We also evaluate in 5 physical shelf environments using a Fetch robot with an embedded PrimeSense RGBD Camera and an attached pushing blade. The policies outperform baselines by up to 25 % in simulation and up to 60% in physical experiments. Additionally, the two-step prediction policy is the highest performing in simulation for 8 objects with a 69 % success rate, suggesting a tradeoff between future information and prediction errors. Code, videos, and supplementary material can be found at https://sites.google.com/berkeley.edu/lax-ray. View details
    Tiny Video Networks
    Michael Ryoo
    Applied AI Letters Journal (2021)
    Preview abstract Automatic video understanding is becoming more important for applications where real-time performance is crucial and compute is limited. Yet, accurate solutions so far have been computationally intensive. We propose efficient models for videos - Tiny Video Networks - which are video architectures, automatically designed to comply with fast runtimes and, at the same time are effective at video recognition tasks. The Tiny Video Networks run at faster-than-real-time speeds and demonstrate strong performance across several video benchmarks. These models not only provide new tools for real-time video applications, but also enable fast research and development in video understanding. Code and models are available. View details
    SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping
    Austin Stone
    Daniel Maurer
    Alper Ayvaci
    Rico Jonschkowski
    Computer Vision and Pattern Recognition (CVPR) (2021)
    Preview abstract We present SMURF, a method for unsupervised learning of optical flow that improves state of the art on all benchmarks by 36% to 40% (over the prior best method UFlow) and even outperforms several supervised approaches such as PWC-Net and FlowNet2. Our method integrates architecture improvements from supervised optical flow, i.e. the RAFT model, with new ideas for unsupervised learning that include a sequence-aware self-supervision loss, a technique for handling out-of-frame motion, and an approach for learning effectively from multi-frame video data while still only requiring two frames for inference. View details
    Preview abstract In this paper we address the problem of automatically discovering atomic actions from instructional videos. Instructional videos contain complex activities and are a rich source of information for intelligent agents, such as, autonomous robots or virtual assistants, which can, for example, automatically ‘read’ the steps from an instructional video and execute them. However, videos are rarely annotated with atomic activities, their boundaries or duration. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos. We propose a sequential stochastic autoregressive model for temporal segmentation of videos, which learns to represent and discover the sequential relationship between different atomic actions of the task, and provides automatic and unsupervised self-labeling. View details
    TokenLearner: Adaptive Space-Time Tokenization for Videos
    Michael Ryoo
    Anurag Arnab
    Conference on Neural Information Processing Systems (NeurIPS) (2021)
    Preview abstract In this paper, we present an approach for representation learning from videos. Instead of relying on hand-designed splitting strategies to obtain space-time tokens from videos, our approach learns to mine important tokens in video frames. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise interactions between such tokens over a longer temporal horizon. We introduce a vector transformer to capture such pairwise space-time relations, and a technique to fuse the transformed tokens while learning their spatio-temporal patterns. The proposed approach is designed with the intention to allow the tokenizer to adaptively react to input video frames containing diverse visual content, and then to have the vector transformer and subsequent modules learn the underlying spatio-temporal interactions and long-range dependencies in video inputs. We show the effectiveness of the proposed approach over challenging video classification datasets, outperforming the state-of-the-art, despite using much less compute. We further conduct extensive ablation experiments to study the method. View details
    Adaptive Intermediate Representations for Video Understanding
    Juhana Kangaspunta
    Rico Jonschkowski
    Michael Ryoo
    MUltimodal Learning and Applications (MULA) Workshop, CVPR (2021)
    Preview abstract A common strategy to video understanding is to incorporate spatial and motion information by fusing features derived from RGB frames and optical flow. In this work, we introduce a new way to leverage semantic segmentation as an intermediate representation for video understanding and use it in a way that requires no additional labeling. Second, we propose a general framework which learns the intermediate representations (optical flow and semantic segmentation) jointly with the final video understanding task and allows the adaptation of the representations to the end goal. Despite the use of intermediate representations within the network, during inference, no additional data beyond RGB sequences is needed. Finally, we present a way to find the optimal learning configuration by searching the best loss weighting via evolution. We obtain more powerful visual representations for videos which lead to performance gains over the state-of-the-art. View details
    Taskology: Utilizing Task Relations at Scale
    Yao Lu
    Sören Pirk
    Jan Dlabal
    Anthony Brohan
    Ankita Pasad
    Zhao Chen
    Ariel Gordon
    Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
    Preview abstract Many computer vision tasks address the problem of scene understanding and are naturally interrelated e.g. object classification, detection, scene segmentation, depth estimation, etc. We show that we can leverage the inherent relationships among collections of tasks, as they are trained jointly, supervising each other through their known relationships via consistency losses. Furthermore, explicitly utilizing the relationships between tasks allows improving their performance while dramatically reducing the need for labeled data, and allows training with additional unsupervised or simulated data. We demonstrate a distributed joint training algorithm with task-level parallelism, which affords a high degree of asynchronicity and robustness. This allows learning across multiple tasks, or with large amounts of input data, at scale. We demonstrate our framework on subsets of the following collection of tasks: depth and normal prediction, semantic segmentation, 3D motion and ego-motion estimation, and object tracking and 3D detection in point clouds. We observe improved performance across these tasks, especially in the low-label regime. View details
    Visionary: Vision Architecture Discovery for Robot Learning
    Iretiayo Akinola
    Yao Lu
    Yevgen Chebotar
    Dmitry Kalashnikov
    Jake Varley
    Julian Ibarz
    Michael Ryoo
    International Conference on Robotics and Automation (ICRA) (2021)
    Preview abstract We propose a vision-based architecture search algorithm for learning of robot manipulation tasks, which discovers interactions between low dimension action inputs and high dimensional visual inputs. The architectures are automatically designed while training for the task itself and are capable of discovering novel ways of combining action and image feature inputs as well as features from previous stages of learning. The obtained new architectures demonstrated better task success rates, in some cases with large margin, compared to a recent high performing baseline. Our real robot experiments also uncovered architectures which improve grasping performance by 6%. This is the first approach to demonstrate a tailored architecture can be simultaneously modified and trained for a real-robot task. View details
    Preview abstract In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos based on a sequential stochastic autoregressive model for temporal segmentation of videos. This learns to represent and discover the sequential relationship between different atomic actions of the task, and which provides automatic and unsupervised self-labeling. View details
    Unsupervised Monocular Depth Learning in Dynamic Scenes
    Hanhan Li
    Ariel Gordon
    Hang Zhao
    Conference on Robot Learning (CoRL) (2020)
    Preview abstract We present a method for jointly training the estimation of depth, ego-motion, and a dense 3D translation field of objects relative to the scene, with monocular photometric consistency being the sole source of supervision. We show that this apparently heavily-underdetermined problem can be regularized by imposing the following prior knowledge about 3D translation fields: they are sparse, since most of the scene is static, and they tend to be constant for rigid moving objects. We show that this regularization alone is sufficient to train monocular depth prediction models that exceed the accuracy achieved in prior work for dynamic scenes, including semantically-aware methods. The code is available at https://github.com/google-research/google-research/tree/master/depth_and_motion_learning. View details
    Preview abstract We present a new method to learn video representations from large-scale unlabeled video data. Ideally, this representation will be generic and transferable, directly usable for new tasks such as action recognition and zero or few-shot learning. We formulate unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are shared across different modalities via distillation. Further, we introduce the concept of loss function evolution by using an evolutionary search algorithm to automatically find optimal combination of loss functions capturing many (self-supervised) tasks and modalities. Thirdly, we propose an unsupervised representation evaluation metric using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law. This unsupervised constraint, which is not guided by any labeling, produces similar results to weakly-supervised, task-specific ones. The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods. Notably, it is also more effective than several label-based methods (e.g., ImageNet), with the exception of large, fully labeled video datasets View details
    Semantically-Agnostic Unsupervised Monocular Depth Learning in Dynamic Scenes
    Hanhan Li
    Ariel Gordon
    Hang Zhao
    Workshop on Perception for Autonomous Driving, ECCV 2020 (2020)
    Preview abstract We present a method for jointly training the estimation of depth, egomotion, and a dense 3D translation field of objects, suitable for dynamic scenes containing multiple moving objects. Monocular photometric consistency is the sole source of supervision. We show that this apparently heavily-underdetermined problem can be regularized by imposing the following prior knowledge about 3D translation fields: They are sparse, since most of the scene is static, and they tend to be constant through rigid moving objects. We show that this regularization alone is sufficient to train monocular depth prediction models that exceed the accuracy achieved in prior work, including methods that require semantic input. View details
    Tiny Video Networks: Architecture Search for Efficient Video Models
    Michael Ryoo
    ICML Workshop on Automated Machine Learning (AutoML) (2020)
    Preview abstract Video understanding is a challenging problem with great impact on real-world applications. Yet, solutions so far have been computationally intensive, with the fastest algorithms running at few hundred milliseconds per video snippet on powerful GPUs. We use architecture search to build highly efficient models for videos - Tiny Video Networks - which run at unprecedented speeds and, at the same time, are effective at video recognition tasks. The Tiny Video Networks run faster than real-time e.g., at less than 20 milliseconds per video on a GPU and are much faster than contemporary video models. These models not only provide new tools for real-time applications such as in mobile vision and robotics, but also enable fast research and development for video understanding. The project site is available at https://sites.google.com/view/tinyvideonetworks. View details
    AssembleNet++: Assembling Modality Representations via Attention Connectivity
    Michael Ryoo
    Juhana Kangaspunta
    European Conference on Computer Vision (ECCV) (2020)
    Preview abstract We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network. A new network component named peer-attention is introduced, which dynamically learns the attention weights using another block or modality. Even without any pre-training, our models outperform the previous work on standard public activity recognition datasets with continuous videos, establishing new state-of-the-art. We also confirm that our findings of having neural connectivity from the object modality and the use of peer-attention is generally applicable for different existing architectures, improving their performances. View details
    Preview abstract Object recognition has seen significant progress in the image domain, with focus primarily on 2D perception. We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image by constructing a CAD-based representation of the objects and their poses. We present Mask2CAD, which jointly detects objects in real-world images and for each detected object, optimizes for the most similar CAD model and its pose. We construct a joint embedding space between the detected regions of an image corresponding to an object and 3D CAD models, enabling retrieval of CAD models for an input RGB image. This produces a clean, lightweight representation of the objects in an image; this CAD-based representation ensures a valid, efficient shape representation for applications such as content creation or interactive scenarios, and makes a step towards understanding the transformation of real-world imagery to a synthetic domain. Experiments on real-world images from Pix3D demonstrate the advantage of our approach in comparison to state of the art. To facilitate future research, we additionally propose a new image-to-3D baseline on ScanNet which features larger shape diversity, real-world occlusions, and challenging image views. View details
    Differentiable Mapping Networks: Learning Structured Map Representations for Sparse Visual Localization
    Peter Karkus
    Rico Jonschkowski
    International Conference on Robotics and Automation (ICRA) (2020)
    Preview abstract Mapping and localization, preferably from a small number of observations, are fundamental tasks in robotics. We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN). The DMN constructs a spatially structured view-embedding map and uses it for subsequent visual localization with a particle filter. Since the DMN architecture is end-to-end differentiable, we can jointly learn the map representation and localization using gradient descent. We apply the DMN to sparse visual localization, where a robot needs to localize in a new environment with respect to a small number of images from known viewpoints. We evaluate the DMN using simulated environments and a challenging real-world Street View dataset. We find that the DMN learns effective map representations for visual localization. The benefit of spatial structure increases with larger environments, more viewpoints for mapping, and when training data is scarce. Project website: https://sites.google.com/view/differentiable-mapping. View details
    Preview abstract Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to include the time dimension, using modules such as 3D convolutions, or by using two-stream design to capture both appearance and motion in videos. We interpret a video CNN as a collection of multi-stream convolutional blocks connected to each other, and propose the approach of automatically finding neural architectures with better connectivity and spatio-temporal interactions for video understanding. This is done by evolving a population of overly-connected architectures guided by connection weight learning. Architectures combining representations that abstract different input types (i.e., RGB and optical flow) at multiple temporal resolutions are searched for, allowing different types or sources of information to interact with each other. Our method, referred to as AssembleNet, outperforms prior approaches on public video datasets, in some cases by a great margin. We obtain 58.6% mAP on Charades and 34.27% accuracy on Moments-in-Time. View details
    Preview abstract This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos. Learning latent terminals, non-terminals, and production rules directly from continuous data allows the construction of a generative model capturing sequential structures with multiple possibilities. Our model is fully differentiable, and provides easily interpretable results which are important in order to understand the learned structures. It outperforms the state-of-the-art on several challenging datasets and is more accurate for forecasting future activities in videos. We plan to open-source the code at https://sites.google.com/corp/view/differentiable-grammars. View details
    KeyPose: Multi-View 3D Labeling and Keypoint Estimationfor Transparent Objects
    Xingyu Liu
    Rico Jonschkowski
    Kurt Konolige
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
    Preview abstract Estimating the 3D pose of desktop objects is crucial for applications such as robotic manipulation. Many existing approaches to this problem require a depth map of the object for both training and prediction, which restricts them to opaque, lambertian objects that produce good returns in an RGBD sensor. In this paper we forgo using a depth sensor in favor of raw stereo input. We address two problems: first, we establish an easy method for capturing and labeling 3D keypoints on desktop objects with an RGB camera; and second, we develop a deep neural network, called KeyPose, that learns to accurately predict object poses using 3D keypoints, from stereo input, and works even for transparent objects. To evaluate the performance of our method, we create a dataset of 15 clear objects in five classes, with 48K 3D-keypoint labeled images. We train both instance and category models, and show generalization to new textures, poses, and objects. KeyPose surpasses state-of-the-art performance in 3D pose estimation on this dataset by factors of 1.5 to 3.5, even in cases where the competing method is provided with ground-truth depth. Stereo input is essential for this performance as it improves results compared to using monocular input by a factor of 2. We will release a public version of the data capture and labeling pipeline, the transparent object database, and the KeyPose models and evaluation code. Project website: https://sites.google.com/corp/view/keypose. View details
    X-Ray: Mechanical Search for an Occluded Object by Minimizing Support of Learned Occupancy Distributions
    Michael Danielczuk
    Ken Goldberg
    International Conference on Intelligent Robots and Systems (IROS) (2020)
    Preview abstract For applications in e-commerce, warehouses, healthcare, and home service, robots are often required to search through heaps of objects to grasp a specific target object. For mechanical search, we introduce X-Ray, an algorithm based on learned occupancy distributions. We train a neural network using a synthetic dataset of RGBD heap images labeled for a set of standard bounding box targets with varying aspect ratios. X-Ray minimizes support of the learned distribution as part of a mechanical search policy in both simulated and real environments. We benchmark these policies against two baseline policies on 1,000 heaps of 15 objects in simulation where the target object is partially or fully occluded. Results suggest that X-Ray is significantly more efficient, as it succeeds in extracting the target object 82% of the time, 15% more often than the best-performing baseline. Experiments on an ABB YuMi robot with 20 heaps of 25 household objects suggest that the learned policy transfers easily to a physical system, where it outperforms baseline policies by 15% in success rate with 17% fewer actions. Datasets, videos, and experiments are available at https://sites.google.com/corp/berkeley.edu/x-ray. View details
    Probabilistic Object Detection: Definition and Evaluation
    David Hall
    Feras Dayoub
    John Skinner
    Haoyang Zhang
    Dimity Miller
    Peter Corke
    Gustavo Carneiro
    Niko Suenderhauf
    WACV (2020)
    Preview abstract We introduce Probabilistic Object Detection, the task of detecting objects in images and accurately quantifying the spatial and semantic uncertainties of the detections. Given the lack of methods capable of assessing such probabilistic object detections, we present the new Probability-based Detection Quality measure (PDQ). Unlike AP-based measures, PDQ has no arbitrary thresholds and rewards spatial and label quality, and foreground/background separation quality while explicitly penalising false positive and false negative detections. We contrast PDQ with existing mAP and moLRP measures by evaluating state-of-the-art detectors and a Bayesian object detector based on Monte Carlo Dropout. Our experiments indicate that conventional object detectors tend to be spatially overconfident and thus perform poorly on the task of probabilistic object detection. Our paper aims to encourage the development of new object detection approaches that provide detections with accurately estimated spatial and label uncertainties and are of critical importance for deployment on robots and embodied AI systems in the real world. View details
    Improving Semantic Segmentation through Spatio-Temporal Consistency Learned from Videos
    Ankita Pasad
    Ariel Gordon
    Tsung-Yi Lin
    CVPR 2020 Workshop on Learning from Unlabeled Videos (2020) (to appear)
    Preview abstract We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve the performance of single-image semantic segmentation, by enforcing 3D-geometric and temporal consistency of segmentation masks across video frames. The predicted depth, egomotion, and camera intrinsics are used to provide an additional supervision signal to the segmentation model, significantly enhancing its quality, or, alternatively, reducing the number of labels the segmentation model needs. Our experiments were performed on the ScanNet dataset. View details
    AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification
    Xiaofang Wang
    Xuehan Xiong
    Maxim Neumann
    Michael Ryoo
    Kris Kitani
    Wei Hua
    European Conference on Computer Vision (ECCV) (2020) (to appear)
    Preview abstract Convolutional operations have two limitations: (1) do not explicitly model where to focus as the same filter is applied to all the positions, and (2) are unsuitable for modeling long-range dependencies as they only operate on a small neighborhood. While both limitations can be alleviated by attention operations, many design choices remain to be determined to use attention, especially when applying attention to videos. Towards a principled way of applying attention to videos, we address the task of spatiotemporal attention cell search. We propose a novel search space for spatiotemporal attention cells, which allows the search algorithm to flexibly explore various design choices in the cell. The discovered attention cells can be seamlessly inserted into existing backbone networks, e.g., I3D or S3D, and improve video classification accuracy by more than 2\% on both Kinetics-600 and MiT datasets. The discovered attention cells outperform non-local blocks on both datasets, and demonstrate strong generalization across different modalities, backbones, and datasets. Inserting our attention cells into I3D-R50 yields state-of-the-art performance on both datasets. View details
    What Matters in Unsupervised Optical Flow
    Rico Jonschkowski
    Austin Stone
    Ariel Gordon
    Kurt Konolige
    ECCV (2020)
    Preview abstract We systematically compare and analyze a set of key components in unsupervised optical flow to identify which photometric loss, occlusion handling, and smoothness regularization is most effective. Alongside this investigation we construct a number of novel improvements to unsupervised flow models, such as cost volume normalization, stopping the gradient at the occlusion mask, encouraging smoothness before upsampling the flow field, and continual self-supervision with image resizing. By combining the results of our investigation with our improved model components, we are able to present a new unsupervised flow technique that significantly outperforms the previous unsupervised state-of-the-art and performs on par with supervised FlowNet2 on the KITTI 2015 dataset, while also being significantly simpler than related approaches. View details
    Adversarial Generative Grammars for Human Activity Prediction
    Alexander Toshev
    Michael Ryoo
    European Conference on Computer Vision (ECCV) (2020)
    Preview abstract In this paper we propose an adversarial generative grammar model for future prediction. The objective is to learn a model that explicitly captures temporal dependencies, providing a capability to forecast multiple, distinct future activities. Our adversarial grammar is designed so that it can learn stochastic production rules from the data distribution, jointly with its latent non-terminal representations. Being able to select multiple production rules during inference leads to different predicted outcomes, thus efficiently modeling many plausible futures. The adversarial generative grammar is evaluated on the Charades, MultiTHUMOS, Human3.6M, and 50 Salads datasets and on two activity prediction tasks: future 3D human pose prediction and future activity prediction. The proposed adversarial grammar outperforms the state-of-the-art approaches, being able to predict much more accurately and further in the future, than prior work. Code will be open sourced. View details
    ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors
    Jitendra Malik
    Tsung-Yi Lin
    International Conference on Computer Vision (ICCV) (2019)
    Preview abstract Instance segmentation aims to detect and segment individual objects in a scene. Most existing methods rely on precise mask annotations of every category. However, it is difficult and costly to segment objects in novel categories because a large number of mask annotations is required. We introduce ShapeMask, which learns the intermediate concept of object shape to address the problem of generalization in instance segmentation to novel categories. ShapeMask starts with a bounding box detection and gradually refines it by first estimating the shape of the detected object through a collection of shape priors. Next, ShapeMask refines the coarse shape into an instance level mask by learning instance embeddings. The shape priors provide a strong cue for object-like prediction, and the instance embeddings model the instance specific appearance information. ShapeMask significantly outperforms the state-ofthe-art by 6.4 and 3.8 AP when learning across categories, and obtains competitive performance in the fully supervised setting. It is also robust to inaccurate detections, decreased model capacity, and small training data. Moreover, it runs efficiently with 150ms inference time on a GPU and trains within 11 hours on TPUs. With a larger backbone model, ShapeMask increases the gap with state-of-the-art to 9.4 and 6.2 AP across categories. Code will be publicly available at: https://sites.google.com/view/shapemask/home. View details
    Unsupervised monocular depth and ego-motion learning with structure and semantics
    Soeren Pirk
    CVPR Workshop on Visual Odometry & Computer Vision Applications Based on Location Clues (2019)
    Preview abstract We present an approach which takes advantage of both structure and semantics for unsupervised monocular learning of depth and ego-motion. More specifically we model the motions of individual objects and learn their 3D motion vector jointly with depth and egomotion. We obtain more accurate results, especially for challenging dynamic scenes not addressed by previous approaches. This is an extended version of Casser et al. Code and models have been open sourced at: https://sites.google.com/corp/view/struct2depth. View details
    Preview abstract Learning to predict scene depth from RGB inputs is a challenging task both for indoor and outdoor robot navigation. In this work we address unsupervised learning of scene depth and robot ego-motion where supervision is provided by monocular videos, as cameras are the cheapest, least restrictive and most ubiquitous sensor for robotics. Previous work in unsupervised image-to-depth learning has established strong baselines in the domain. We propose a novel approach which produces higher quality results, is able to model moving objects and is shown to transfer across data domains, e.g. from outdoors to indoor scenes. The main idea is to introduce geometric structure in the learning process, by modeling the scene and the individual objects; camera ego-motion and object motions are learned from monocular videos as input. Furthermore an online refinement method is introduced to adapt learning on the fly to unknown domains. The proposed approach outperforms all state-of-the-art approaches, including those that handle motion e.g. through learned flow. Our results are comparable in quality to the ones which used stereo as supervision and significantly improve depth prediction on scenes and datasets which contain a lot of object motion. The approach is of practical relevance, as it allows transfer across environments, by transferring models trained on data collected for robot navigation in urban scenes to indoor navigation settings. The code associated with this paper can be found at https://sites.google.com/ view/struct2depth. View details
    OnboardDepth: Depth Prediction for Onboard Systems
    Devesh Yamparala
    Justin Vincent
    Chris Leger
    European Conference on Mobile Robots (ECMR) (2019)
    Preview abstract Depth sensing is important for robotics systems for both navigation and manipulation tasks. We here present a learning-based system which predicts accurate scene depth and can take advantage of many types of sensor supervision. We develop an algorithm which combines both supervised and unsupervised constraints to produce high quality depth and which is robust to the presence of noise, sparse sensing, and missing information. Our system is running onboard in real-time, is easy to deploy, and is applicable to a variety of robot platforms. View details
    Evolving Space-Time Neural Architectures for Videos
    Alexander Toshev
    Michael Ryoo
    International Conference on Computer Vision (ICCV) (2019)
    Preview abstract We present a new method for finding video CNN architectures that capture rich spatio-temporal information in videos. Previous work, taking advantage of 3D convolutions, obtained promising results by manually designing video CNN architectures. We here develop a novel evolutionary search algorithm that automatically explores models with different types and combinations of layers to jointly learn interactions between spatial and temporal aspects of video representations. We demonstrate the generality of this algorithm by applying it to two meta-architectures, obtaining new architectures superior to manually designed architectures: EvaNet. Further, we propose a new component, the iTGM layer, which more efficiently utilizes its parameters to allow learning of space-time interactions over longer time horizons. The iTGM layer is often preferred by the evolutionary algorithm and allows building cost-efficient networks. The proposed approach discovers new and diverse video architectures that were previously unknown. More importantly they are both more accurate and faster than prior models, and outperform the state-of-the-art results on multiple datasets we test, including HMDB, Kinetics, and Moments in Time. We will open source the code and models, to encourage future model development at https://sites.google.com/corp/view/evanet-video. . View details
    Differentiable Mapping Networks: Learning Task-Oriented Latent Maps with Spatial Structure
    Peter Karkus
    Rico Jonschkowski
    Perception as Generative Reasoning Workshop, NeurIPS 2019
    Preview abstract To efficiently operate in previously unseen environments, robots must be able to build a map – an internal representation of the environment – even from a small number of observations. But how should that map be represented and which information should be stored in it, to enable downstream tasks, for example localization? Classic approaches use a fixed map representation with strong spatial structure, such as voxels or point clouds, which makes them applicable to wide range of robotic tasks. Data-driven approaches, on the other hand, are able to learn rich and robust representations by optimizing them directly for a downstream task. Eslami et al., for example, learn to construct representations of simulated environments from a few images that allow them to generate images from novel viewpoints. The challenge for learning in complex environments is choosing suitable priors that enable generalization while having only limited amount of data for training. A desirable approach would combine the best of both worlds: retain the spatial structure of the classic approaches, but also leverage the power of deep neural networks to learn a flexible and effective map representation for the downstream task. In this paper we explore how structure and learning can be combined in the context of a sparse visual localization task. View details
    Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras
    Ariel Gordon
    Hanhan Li
    Rico Jonschkowski
    The IEEE International Conference on Computer Vision (ICCV) (2019)
    Preview abstract We present a novel method for simultaneously learning depth, egomotion, object motion, and camera intrinsics from monocular videos, using only consistency across neighboring video frames as a supervision signal. Similarly to prior work, our method learns by applying differentiable warping to frames and comparing the result to adjacent ones, but it provides several improvements: We address occlusions geometrically and differentiably, directly using the depth maps as predicted during training. We introduce randomized layer normalization, a novel regularizer, and we account for object motion relative to the scene. To the best of our knowledge, our work is the first to learn the camera intrinsic parameters, including lens distortion, from video in an unsupervised manner, thereby allowing us to extract accurate depth and motion from arbitrary videos of unknown origin at scale. We evaluate our results on the Cityscapes, KITTI, and EuRoC MAV datasets, establishing new state of the art on depth prediction and odometry, and demonstrate qualitatively that depth prediction can be learned from a collection of YouTube videos. The code is publicly available at github.com/google-research/google-research/tree/master/depth_from_video_in_the_wild. View details
    Evolving Losses for Unlabeled Video Representation Learning
    Michael Ryoo
    CVPR 2019 Workshop on Learning from Unlabeled Videos (2019)
    Preview abstract We present a new method to learn video representations from large-scale unlabeled video data. We formulate our unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are also shared across different modalities via distillation. Our formulation allows for the distillation of audio, optical flow and temporal information into a single, RGB-based convolutional neural network. We also compare the effects of using additional unlabeled video data and evaluate our representation learning on standard public video datasets. We newly introduce the concept of using an evolutionary algorithm to obtain a better multi-modal, multi-task loss function to train the network. AutoML has successfully been applied to architecture search and data augmentation. Here we extend the concept of AutoML to unsupervised representation learning by automatically finding the optimal weighting of tasks for representation learning. View details
    Learning Differentiable Grammars for Videos
    Michael Ryoo
    Bay Area Machine Learning Symposium (BayLearn) (2019)
    Preview abstract This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos. Learning latent terminals, nonterminals, and production rules directly from continuous data allows the construction of a generative model capturing sequential structures with multiple possibilities. Our model is fully differentiable, and provides easily interpretable results which are important in order to understand the learned structures. It outperforms the state-of-the-art on several challenging datasets and is more accurate for forecasting future activities in videos. View details
    Evolving Losses for Video Representation Learning
    Michael Ryoo
    Bay Area Machine Learning Symposium (BayLearn) (2019)
    Preview abstract We present a new method to learn video representations from unlabeled data. We formulate our unsupervised representation learning as a multi-modal, multi-task learning problem. We also introduce the concept of finding a better loss function to train such multi-task multi-modal representation space using an evolutionary algorithm; our method automatically searches over different combinations of loss functions capturing multiple (self-supervised) tasks and modalities View details
    Unsupervised Monocular Depth and Ego-motion Learning with Structure and Semantics
    Soeren Pirk
    CVPR Workshop on Visual Odometry & Computer Vision Applications Based on Location Clues (VOCVALC) (2019)
    Preview abstract We present an approach which takes advantage of both structure and semantics for unsupervised monocular learning of depth and ego-motion. More specifically, we model the motion of individual objects and learn their 3D motion vector jointly with depth and ego-motion. We obtain more accurate results, especially for challenging dynamic scenes not addressed by previous approaches. This is an extended version of Casser et al. [AAAI'19]. Code and models have been open sourced at: https://sites.google.com/view/struct2depth. View details
    EvaNet: A Family of Diverse, Fast and Accurate Video Architectures
    Alexander Toshev
    Michael Ryoo
    Bay Area Machine Learning Symposium (BayLearn) (2019)
    Preview abstract We present a novel evolutionary algorithm that automatically constructs architectures of layers exploring space-time interactions for videos. The discovered architectures are accurate, diverse and efficient. Ensembling such models leads to further accuracy gains and yields faster and more accurate solutions than previous state-of-the-art models. Evolved models can be used across datasets and to build more powerful models for video understanding. View details
    Preview abstract In this paper, we present a new method for evolving video CNN models to find architectures that more optimally captures rich spatio-temporal information in videos. Previous work, taking advantage of 3D convolutional layers, obtained promising results by manually designing CNN architectures for videos. We here develop an evolutionary algorithm that automatically explores models with different types and combinations of space-time convolutional layers to jointly capture various spatial and temporal aspects of video representations. We further propose a new key component in video model evolution, the iTGM layer, which more efficiently utilizes its parameters to allow learning of space-time interactions over longer time horizons. The experiments confirm the advantages of our video CNN architecture evolution, with results outperforming previous stateof-the-art models. Our algorithm discovers new and interesting video architecture structures. View details
    Preview abstract We present a novel approach for unsupervised learning of depth and ego-motion from monocular video. Unsupervised learning removes the need for separate supervisory signals (depth or ego-motion ground truth, or multi-view video). Prior work in unsupervised depth learning uses pixel-wise or gradient-based losses, which only consider pixels in small local neighborhoods. Our main contribution is to explicitly consider the inferred 3D geometry of the scene, enforcing consistency of the estimated 3D point clouds and ego-motion across consecutive frames. This is a challenging task and is solved by a novel (approximate) backpropagation algorithm for aligning 3D structures. We combine this novel 3D-based loss with 2D losses based on photometric quality of frame reconstructions using estimated depth and ego-motion from adjacent frames. We also incorporate validity masks to avoid penalizing areas in which no useful information exists. We test our algorithm on the KITTI dataset and on a video dataset captured on an uncalibrated mobile phone camera. Our proposed approach consistently improves depth estimates on both datasets, and outperforms the state-of-the-art for both depth and ego-motion. Because we only require a simple video, learning depth and ego-motion on large and varied datasets becomes possible. We demonstrate this by training on the low quality uncalibrated video dataset and evaluating on KITTI, ranking among top performing prior methods which are trained on KITTI itself. View details
    Preview abstract Predicting the future to anticipate the outcome of events and actions is a critical attribute of autonomous agents; particularly for agents which must rely heavily on real time visual data for decision making. Working towards this capability, we address the task of predicting future frame segmentation from a stream of monocular video by leveraging the 3D structure of the scene. Our framework is based on learnable sub-modules capable of predicting pixel-wise scene semantic labels, depth, and camera ego-motion of adjacent frames. We further propose a recurrent neural network based model capable of predicting future ego-motion trajectory as a function of a series of past ego-motion steps. Ultimately, we observe that leveraging 3D structure in the model facilitates successful prediction, achieving state of the art accuracy in future semantic segmentation. View details
    Preview abstract Predicting the future to anticipate the outcome of events and actions is a critical attribute of autonomous agents. In this work, we address the task of predicting future frame segmentation from a stream of monocular video by leveraging the 3D structure of the scene. Our framework is based on learnable sub-modules capable of predicting pixelwise scene semantic labels, depth, and camera ego-motion of adjacent frames. Ultimately, we observe that leveraging 3D structure in the model facilitates successful positioning of objects in the 3D scene, achieving state of the art accuracy in future semantic segmentation. View details
    Preview abstract We consider the problem of retrieving objects from image data and learning to classify them into meaningful semantic categories with minimal supervision. To that end, we propose a fully differentiable unsupervised deep clustering approach to learn semantic classes in an end-to-end fashion without individual class labeling using only unlabeled object proposals. The key contributions of our work are 1) a kmeans clustering objective where the clusters are learned as parameters of the network and are represented as memory units, and 2) simultaneously building a feature representation, or embedding, while learning to cluster it. This approach shows promising results on two popular computer vision datasets: on CIFAR10 for clustering objects, and on the more complex and challenging Cityscapes dataset for semantically discovering classes which visually correspond to cars, people, and bicycles. Currently, the only supervision provided is segmentation objectness masks, but this method can be extended to use an unsupervised objectness-based object generation mechanism which will make the approach completely unsupervised. View details
    Preview abstract We consider the problem of next frame prediction from video input. A recurrent convolutional neural network is trained to predict depth from monocular video input, which, along with the current video image and the camera trajectory, can then be used to compute the next frame. Unlike prior next- frame prediction approaches, we take advantage of the scene geometry and use the predicted depth for generating the next frame prediction. Our approach can produce rich next frame predictions which include depth information attached to each pixel. Another novel aspect of our approach is that it predicts depth from a sequence of images (e.g. in a video), rather than from a single still image. We evaluate the proposed approach on the KITTI dataset, a standard dataset for benchmarking tasks relevant to au- tonomous driving. The proposed method produces results which are visually and numerically superior to existing methods that directly predict the next frame. We show that the accuracy of depth prediction improves as more prior frames are considered. View details
    Preview abstract We approach structured output prediction by learning a deep value network (DVN) that evaluates different output structures for a given input. For example, when applied to image segmentation, the value network takes an image and a segmentation mask as inputs and predicts a scalar score evaluating the mask quality and its correspondence with the image. Once the value network is optimized, at inference, it finds output structures that maximize the score of the value net via gradient descent on continuous relaxations of structured outputs. Thus DVN takes advantage of the joint modeling of the inputs and outputs. Our framework applies to a wide range of structured output prediction problems. We conduct experiments on multi-label classification based on text data and on image segmentation problems. DVN outperforms several strong baselines and the state-of-the-art results on these benchmarks. In addition, on image segmentation, the proposed deep value network learns complex shape priors and effectively combines image information with the prior to obtain competitive segmentation results. View details
    Preview abstract Learning a set of diverse and representative features from a large set of unlabeled data has long been an area of active research. We present a method that separates proposals of potential objects into semantic classes in an unsupervised manner. Our preliminary results show that different object categories emerge and can later be retrieved from test images. We propose a differentiable clustering approach which can be integrated with Deep Neural Networks to learn semantic classes in end-to-fashion without manual class labeling. View details
    Learning with Proxy Supervision for End-To-End Visual Learning
    Jiří Čermák
    Deep Learning for Vehicle Perception Workshop, Intelligent Vehicles Symposium (2017)
    Preview abstract Learning with deep neural networks forms the state-of-the-art in many tasks such as image classification, image detection, speech recognition, text analysis. We here set out to gain understanding in learning in an ‘end-to-end’ manner for an autonomous vehicle, which refers to directly learning the decision which will result from the perception of the scene. For example, we consider learning a binary ‘stop’/‘go‘ decision, with respect to pedestrians, given the input image. In this work we propose to use additional information, referred to as ‘proxy supervision’, for improved learning and study its effects on the overall performance. We show that the proxy labels significantly improve the robustness of learning, while achieving as good, or better, accuracy than in the original task of binary classification. View details
    Improved generator objectives for GANs
    Jascha Sohl-dickstein
    NIPS Workshop on Adversarial Learning (2016)
    Preview abstract We present a new framework to understand GAN training as alternating density ratio estimation with divergence minimization. This provides a new interpretation for the GAN generator objective used in practice and explains the problem of poor sample diversity. Furthermore, we derive a family of objectives that target arbitrary f-divergences without minimizing a lower bound, and use them to train generative image models that target either improved sample quality or sample diversity. View details
    Real-Time Pedestrian Detection With Deep Network Cascades
    Alex Krizhevsky
    Abhijit Ogale
    Dave Ferguson
    Proceedings of BMVC 2015
    Preview abstract We present a new real-time approach to object detection that exploits the efficiency of cascade classifiers with the accuracy of deep neural networks. Deep networks have been shown to excel at classification tasks, and their ability to operate on raw pixel input without the need to design special features is very appealing. However, deep nets are notoriously slow at inference time. In this paper, we propose an approach that cascades deep nets and fast features, that is both extremely fast and extremely accurate. We apply it to the challenging task of pedestrian detection. Our algorithm runs in real-time at 15 frames per second. The resulting approach achieves a 26.2% average miss rate on the Caltech Pedestrian detection benchmark, which is competitive with the very best reported results. It is the first work we are aware of that achieves extremely high accuracy while running in real-time. View details
    Real-Time Grasp Detection Using Convolutional Neural Networks
    Joseph Redmon
    International Conference on Robotics and Automation (ICRA), IEEE (2015)
    Preview abstract We present an accurate, real-time approach to robotic grasp detection based on convolutional neural networks. Our network performs single-stage regression to graspable bounding boxes without using standard sliding window or region proposal techniques. The model outperforms state-of- the-art approaches by 14 percentage points and runs at 13 frames per second on a GPU. Our network can simultaneously perform classification so that in a single step it recognizes the object and finds a good grasp rectangle. A modification to this model predicts multiple grasps per object by using a locally constrained prediction mechanism. The locally constrained model performs significantly better, especially on objects that can be grasped in a variety of ways. View details
    Preview abstract Pedestrian detection is of crucial importance to autonomous driving applications. Methods based on deep learning have shown significant improvements in accuracy, which makes them particularly suitable for applications, such as pedestrian detection, where reducing miss rate is very important. Although they are accurate, their runtime has been at best in seconds per image, which makes them not practical for onboard applications. We present here a Large-Field-Of-View (LFOV) deep network for pedestrian detection, that can achieve high accuracy and is designed to make deep networks work faster for detection problems. The idea of the proposed Large-Field-of-View deep network is to learn to make classification decisions simultaneously and accurately at multiple locations. The LFOV network processes larger image areas at much faster speeds than typical deep networks have been able to do, and can intrinsically reuse computations. Our pedestrian detection solution, which is a combination of a LFOV network and a standard deep network, works at 280 ms per image on GPU and achieves 35.85 average miss rate on the Caltech Pedestrian Detection Benchmark. View details
    Object Recognition from Short Videos for Robotic Perception
    Ivan Bogun
    Navdeep Jaitly
    CoRR, vol. abs/1509.01602 (2015)
    Preview abstract Deep neural networks have become the primary learning technique for object recognition. Videos, unlike still images, are temporally coherent which makes the application of deep networks non-trivial. Here, we investigate how motion can aid object recognition in short videos. Our approach is based on Long Short-Term Memory (LSTM) deep networks. Unlike previous applications of LSTMs, we implement each gate as a convolution. We show that convolutional-based LSTM models are capable of learning motion dependencies and are able to improve the recognition accuracy when more frames in a sequence are available. We evaluate our approach on the Washington RGBD Object dataset and on the Washington RGBD Scenes dataset. Our approach outperforms deep nets applied to still images and sets a new state-of-the-art in this domain. View details
    Feature combination with Multi-Kernel Learning for Fine-Grained Visual Classification
    Alexandru Niculescu-Mizil
    IEEE Winter Conference on Applications of Computer Vision (WACV) (2014)
    Benchmarking Large-Scale Fine-Grained Categorization
    Phil Long
    IEEE Winter Conference on Applications of Computer Vision (WACV) (2014)
    Efficient object detection and segmentation for fine-grained recognition
    Shenghuo Zhu
    Computer Vision and Pattern Recognition (CVPR), IEEE (2013)
    Development and Deployment of a Large-Scale Flower Recognition Mobile App
    Shenghuo Zhu
    Yuanqing Lin
    Josephine Wong
    Chelsea Specht
    NEC Laboratories America (2012)
    Terrain Adaptive Navigation for Planetary Rovers
    Daniel Helmick
    Larry Matthies
    Journal of Field Robotics (JFR) (2009)
    Characterization of traverse slippage experienced by Spirit rover on Husband Hill at Gusev crater
    Rongxing Li
    Bo Wu
    Kaichang Di
    Raymond E. Arvidson
    I-Chieh Lee
    Mark Maimone
    Larry H. Matthies
    Lutz Richer
    Robert Sullivan
    Michael H. Sims
    Rebecca Greenberger
    Steven W. Squyres
    Journal of Geophysical Research - Planets (2008)
    Visual Prediction of Rover Slip: Learning Algorithms and Field Experiments
    Ph.D. Thesis, California Institute of Technology (2008)
    Experimental results from a terrain adaptive navigation system for planetary rovers
    Daniel Helmick
    Larry Matthies
    Chris Brooks
    Ibrahim Halatci
    Steve Dubowsky
    Karl Iagnemma
    International Symposium on Artificial Intelligence, Robotics and Automation in Space (2008)
    Terrain Adaptive Navigation for a Mars Rover
    Daniel Helmick
    Matthew Livianu
    Larry Matthies
    IEEE Aerospace Conference (2007)
    Learning And Prediction of Slip Using Visual Information
    Larry Matthies
    Daniel Helmick
    Pietro Perona
    Journal of Field Robotics (JFR) (2007)
    Dimensionality Reduction Using Automatic Supervision for Vision-Based Terrain Learning
    Larry Matthies
    Daniel Helmick
    Pietro Perona
    Robotics: Science and Systems (RSS) (2007)
    Learning Slip Behavior Using Automatic Mechanical Supervision
    Larry Matthies
    Daniel Helmick
    Pietro Perona
    IEEE International Conference on Robotics and Automation (ICRA) (2007)
    Fast Terrain Classification Using Variable-Length Representation for Autonomous Navigation
    Larry Matthies,
    Daniel Helmick
    Pietro Perona
    Computer Vision and Pattern Recognition (CVPR), IEEE (2007)
    Computer Vision on Mars
    Larry Matthies
    Mark Maimone
    Andrew Johnson
    Yang Cheng
    Reg Willson
    Carlos Villalpando
    Steve Goldberg
    Andres Huertas
    Andrew Stein
    International Journal of Computer Vision (IJCV) (2007)
    Learning to Predict Slip for Ground Robots
    Larry Matthies
    Daniel Helmick
    Gabe Sibley
    Pietro Perona
    IEEE International Conference on Robotics and Automation (ICRA) (2006)
    Towards Learned Traversability for Robot Navigation: From Underfoot to the Far Field
    Andrew Howard
    Michael Turmon
    Larry Matthies
    Benyang Tang
    Eric Mjolsness
    Journal of Field Robotics (JFR) (2006)
    Slip Prediction Using Visual Information
    Larry Matthies
    Daniel Helmick
    Pietro Perona
    Robotics: Science and Systems (RSS) (2006)
    Learning for Autonomous Navigation
    Larry Matthies
    Michael Turmon
    Andrew Howard
    Benyang Tang
    Eric Mjolsness
    NIPS, Workshop on Machine Learning Based Robotics in Unstructured Environments (2005)
    Pruning Training Sets for Learning of Object Categories
    Yaser Abu-Mostafa
    Pietro Perona
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2005)
    Data Pruning