Jump to Content
Susanna Ricco

Susanna Ricco

My research sits at the intersection of computer vision and ML fairness. I lead a team developing techniques to bring more inclusive machine learning systems to Google products and the broader community. I have a Ph.D. in computer vision from Duke University, where my research focused on long-term dense motion estimation in video.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract The Open Images Dataset contains approximately 9 million images and is a widely accepted dataset for computer vision research. As is common practice for large datasets, the annotations are not exhaustive, with bounding boxes and attribute labels for only a subset of the classes in each image. In this paper, we present a new set of annotations on a subset of the Open Images dataset called the ``MIAP (More Inclusive Annotations for People)'' subset, containing bounding boxes and attributes for all of the people visible in those images. The attributes and labeling methodology for the ``MIAP'' subset were designed to enable research into model fairness. In addition, we analyze the original annotation methodology for the person class and its subclasses, discussing the resulting patterns in order to inform future annotation efforts. By considering both the original and exhaustive annotation sets, researchers can also now study how systematic patterns in training annotations affect modeling. View details
    Preview abstract This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding. View details
    Preview abstract We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-toframe pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), differentiably warps frames in time to match pixels and backpropagates. The model can be trained with various degrees of supervision: 1) completely unsupervised, 2) supervised by ego-motion (camera motion), 3) supervised by depth (e.g., as provided by RGBD sensors), 4) supervised by ground-truth optical flow. We show that SfM-Net successfully estimates segmentation of the objects in the scene, even though such supervision is never provided. It extracts meaningful depth estimates or infills depth of RGBD sensors and successfully estimates frame-to-frame camera displacements. SfM-Net achieves state-of-the-art optical flow performance. Our work is inspired by the long history of research in geometry-aware motion estimation, Simultaneous Localization and Mapping (SLAM) and Structure from Motion (SfM). SfM-Net is an important first step towards providing a learning-based approach for such tasks. A major benefit over the existing optimization approaches is that our proposed method can improve itself by processing more videos, and by learning to explicitly model moving objects in dynamic scenes. View details
    Preview abstract We propose a method to discover the physical parts of an articulated object class (e.g. tiger, horse) from multiple videos. Since the individual parts of an object can move independently of one another, we discover them as object regions that consistently move relatively with respect to the rest of the object across videos. We then learn a location model of the parts and segment them accurately in the individual videos using an energy function that also enforces temporal and spatial consistency in the motion of the parts. Traditional methods for motion segmentation or non-rigid structure from motion cannot discover parts unless they display independent motion, since they operate on one video at a time. Our method overcomes this problem by discovering the parts across videos, which allows to discover them in videos where they move to videos where they do not. We evaluate our method on a new dataset of 32 videos of tigers and horses, where we significantly outperform state-of-the art motion segmentation on the task of part discovery (roughly twice the accuracy). View details
    Video Motion for Every Visible Point
    Carlo Tomasi
    International Conference on Computer Vision (ICCV) (2013)
    Preview
    No Results Found