Jump to Content
Sean Fanello

Sean Fanello

I am a Research Scientist and Manager at Google, where I am leading efforts to solve real-world human perception tasks, often relying on performance capture and neural rendering to train deep learning models that generalize to in-the-wild applications. My research interests include: 3D performance capture, photo realistic rendering, neural rendering, relighting, viewpoint synthesis. Previously, I was a Senior Scientist and a Founding Team Member at perceptiveIO, Inc., where I developed computer vision and machine learning algorithms for 3D sensing, visual recognition and human-computer interaction. Prior to that, I was a Post-Doc Researcher in the Interactive 3D Technologies (I3D) group at Microsoft Research Redmond where I substantially contributed to the Hololens 3D sensing capabilities. I was also one of the main contributors for the Holoportation project. I obtained my PhD in Robotics, Cognition and Interaction Technologies at the Italian Institute of Technology in collaboration with the University of Genoa in 2013. During my PhD I developed computer vision and machine learning techniques for the iCub humanoid robot. In 2010 I completed my Master’s Degree in Computer Engineering at Sapienza University of Rome, with a specialization in Artificial Intelligence and Pattern Recognition. Personal website: http://seanfanello.it Google Scholar
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers
    Phil A. Chou
    Berivan Isik
    Hugues Hoppe
    Danhang Tang
    Jonathan Taylor
    Philip Davidson
    arXiv:2402.05887 (2024)
    Preview abstract We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec’s performance on its intended content, it can effectively adapt the codec to other types of image/video content and to other distortion measures. Essentially, the sandwich learns to transmit “neural code images” that optimize overall rate-distortion performance even when the overall problem is well outside the scope of the codec’s design. Through a variety of examples, we apply the sandwich architecture to sources with different numbers of channels, higher resolution, higher dynamic range, and perceptual distortion measures. The results demonstrate substantial improvements (up to 9 dB gains or up to 3 adaptations. We derive VQ equivalents for the sandwich, establish optimality properties, and design differentiable codec proxies approximating current standard codecs. We further analyze model complexity, visual quality under perceptual metrics, as well as sandwich configurations that offer interesting potentials in image/video compression and streaming. View details
    Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos
    Ziqian Bai
    Danhang "Danny" Tang
    Di Qiu
    Abhimitra Meka
    Mingsong Dou
    Ping Tan
    Thabo Beeler
    2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE
    Preview abstract We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduce over-smoothing and improve out-of-model expressions synthesis, we propose to predict local features anchored on the 3DMM geometry. These learnt features are driven by 3DMM deformation and interpolated in 3D space to yield the volumetric radiance at a designated query point. We further show that using a Convolutional Neural Network in the UV space is critical in incorporating spatial context and producing representative local features. Extensive experiments show that we are able to reconstruct high-quality avatars, with more accurate expression-dependent details, good generalization to out-of-training expressions, and quantitatively superior renderings compared to other state-of-the-art approaches. View details
    Multi-Camera Lighting Estimation for Photorealistic Front-Facing Mobile AR
    Yiqin Zhao
    Tian Guo
    Association for Computing Machinery, New York, NY, USA (2023), 68–73
    Preview abstract Lighting estimation plays an important role in virtual object composition, including mobile augmented reality (AR) applications. Prior work often targets recovering lighting from the physical environment to support photorealistic AR rendering. Because the common workflow is to use a backward-facing camera to capture the overlay of the physical world and virtual objects, we refer to this usage pattern as backward-facing AR. However, existing methods often fall short of supporting emerging front-facing virtual try-on applications where a mobile user leverages a front-facing camera to explore the effect of various products, e.g., glasses or hats, of different styles. This lack of support can be attributed to the unique challenges of obtaining 360◦ HDR environment maps, an ideal format of lighting representation, from the front-facing camera. In this paper, we propose to leverage a dual-camera streaming setup (front and backward-facing), to perform multi-view lighting estimation. Our approach results in improved rendering quality and visually coherent AR try-on experiences. Our contributions include energy conserving data capturing, high-quality environment map generation, and parametric directional light estimation. View details
    Sandwiched Image Compression: Increasing the resolution and dynamic range of standard codecs
    Phil Chou
    Hugues Hoppe
    Danhang "Danny" Tang
    Philip Davidson
    2022 Picture Coding Symposium (PCS), IEEE (to appear)
    Preview abstract Given a standard image codec, we compress images that may have higher resolution and/or higher bit depth than allowed in the codec's specifications, by sandwiching the standard codec between a neural pre-processor (before the standard encoder) and a neural post-processor (after the standard decoder). Using a differentiable proxy for the the standard codec, we design the neural pre- and post-processors to transport the high resolution (super-resolution, SR) or high bit depth (high dynamic range, HDR) images as lower resolution and lower bit depth images. The neural processors accomplish this with spatially coded modulation, which acts as watermarks to preserve the important image detail during compression. Experiments show that compared to conventional methods of transmitting high resolution or high bit depth through lower resolution or lower bit depth codecs, our sandwich architecture gains ~9 dB for SR images and ~3 dB for HDR images at the same rate over large test sets. We also observe significant gains in visual quality. View details
    Neural Light Transport for Relighting and View Synthesis
    Xiuming Zhang
    Yun-Ta Tsai
    Tiancheng Sun
    Tianfan Xue
    Philip Davidson
    Christoph Rhemann
    Paul Debevec
    Ravi Ramamoorthi
    ACM Transactions on Graphics, vol. 40 (2021)
    Preview abstract The light transport (LT) of a scene describes how it appears under different lighting and viewing directions, and complete knowledge of a scene's LT enables the synthesis of novel views under arbitrary lighting. In this paper, we focus on image-based LT acquisition, primarily for human bodies within a light stage setup. We propose a semi-parametric approach to learn a neural representation of LT that is embedded in the space of a texture atlas of known geometric properties, and model all non-diffuse and global LT as residuals added to a physically-accurate diffuse base rendering. In particular, we show how to fuse previously seen observations of illuminants and views to synthesize a new image of the same scene under a desired lighting condition from a chosen viewpoint. This strategy allows the network to learn complex material effects (such as subsurface scattering) and global illumination, while guaranteeing the physical correctness of the diffuse LT (such as hard shadows). With this learned LT, one can relight the scene photorealistically with a directional light or an HDRI map, synthesize novel views with view-dependent effects, or do both simultaneously, all in a unified framework using a set of sparse, previously seen observations. Qualitative and quantitative experiments demonstrate that our neural LT (NLT) outperforms state-of-the-art solutions for relighting and view synthesis, without separate treatment for both problems that prior work requires. View details
    HumanGPS: Geodesic PreServing Feature for Dense Human Correspondence
    Danhang "Danny" Tang
    Mingsong Dou
    Kaiwen Guo
    Cem Keskin
    Sofien Bouaziz
    Ping Tan
    Computer Vision and Pattern Recognition 2021 (2021), pp. 8
    Preview abstract In this paper, we address the problem of building dense correspondences between human images under arbitrary camera viewpoints and body poses. Prior art either assumes small motion between frames or relies on local descriptors, which cannot handle large motion or visually ambiguous body parts, e.g. left v.s. right hand. In contrast, we propose a deep learning framework that maps each pixel to a feature space, where the feature distances reflect the geodesic distances among pixels as if they were projected onto the surface of a 3D human scan. To this end, we introduce novel loss functions to push features apart according to their geodesic distances on the surface. Without any semantic annotation, the proposed embeddings automatically learn to differentiate visually similar parts and align different subjects into an unified feature space. Extensive experiments show that the learned embeddings can produce accurate correspondences between images with remarkable generalization capabilities on both intra and inter subjects. View details
    Sandwiched Image Compression: Wrapping Neural Networks Around a Standard Codec
    Phil Chou
    Hugues Hoppe
    Danhang "Danny" Tang
    Philip Davidson
    2021 IEEE International Conference on Image Processing (ICIP), IEEE, Anchorage, Alaska, pp. 3757-3761
    Preview abstract We sandwich a standard image codec between two neural networks: a preprocessor that outputs neural codes, and a postprocessor that reconstructs the image. The neural codes are compressed as ordinary images by the standard codec. Using differentiable proxies for both rate and distortion, we develop a rate-distortion optimization framework that trains the networks to generate neural codes that are efficiently compressible as images. This architecture not only improves rate-distortion performance for ordinary RGB images, but also enables efficient compression of alternative image types (such as normal maps of computer graphics) using standard image codecs. Results demonstrate the effectiveness and flexibility of neural processing in mapping a variety of input data modalities to the rigid structure of standard codecs. A surprising result is that the rate-distortion-optimized neural processing seamlessly learns to transport color images using a single-channel (grayscale) codec. View details
    Preview abstract We propose a novel system for portrait relighting and background replacement, which maintains high-frequency boundary details and accurately synthesizes the subject’s appearance as lit by novel illumination, thereby producing realistic composite images for any desired scene. Our technique includes foreground estimation via alpha matting, relighting, and compositing. We demonstrate that each of these stages can be tackled in a sequential pipeline without the use of priors (e.g. known background or known illumination) and with no specialized acquisition techniques, using only a single RGB portrait image and a novel, target HDR lighting environment as inputs. We train our model using relit portraits of subjects captured in a light stage computational illumination system, which records multiple lighting conditions, high quality geometry, and accurate alpha mattes. To perform realistic relighting for compositing, we introduce a novel per-pixel lighting representation in a deep learning framework, which explicitly models the diffuse and the specular components of appearance, producing relit portraits with convincingly rendered non-Lambertian effects like specular highlights. Multiple experiments and comparisons show the effectiveness of the proposed approach when applied to in-the-wild images. View details
    Multiresolution Deep Implicit Functions for 3D Shape Representation
    Zhang Chen
    Kyle Genova
    Sofien Bouaziz
    Christian Haene
    Cem Keskin
    Danhang "Danny" Tang
    ICCV (2021)
    Preview abstract We introduce Multiresolution Deep Implicit Functions (MDIF), a hierarchical representation that can recover fine details, while being able to perform more global operations such as shape completion. Our model represents a complex 3D shape with a hierarchy of latent grids, which can be decoded into different resolutions. Training is performed in an encoder-decoder manner, while the decoder-only optimization is supported during inference, hence can better generalize to novel objects, especially when performing shape completion. To the best of our knowledge, MDIF is the first model that can at the same time (1) reconstruct local detail; (2) perform decoder-only inference; (3) fulfill shape reconstruction and completion. We demonstrate superior performance against prior arts in our experiments. View details
    State of the Art on Neural Rendering
    Ayush Tewari
    Christian Theobalt
    Eli Shechtman
    Gordon Wetzstein
    Jason Saragih
    Jun-Yan Zhu
    Justus Thies
    Kalyan Sunkavalli
    Maneesh Agrawala
    Matthias Niessner
    Michael Zollhöfer
    Ohad Fried
    Ricardo Martin Brualla
    Stephen Lombardi
    Tomas Simon
    Vincent Sitzmann
    Computer Graphics Forum (2020)
    Preview abstract The efficient rendering of photo-realistic virtual worlds is a long standing effort of computer graphics. Over the last few years, rapid orthogonal progress in deep generative models has been made by the computer vision and machine learning communities leading to powerful algorithms to synthesize and edit images. Neural rendering approaches are a hybrid of both of these efforts that combine physical knowledge, such as a differentiable renderer, with learned components for controllable image synthesis. Nowadays, neural rendering is employed for solving a steadily growing number of computer graphics and vision problems. This state-of-the-art report summarizes the recent trends and applications of neural rendering. We focus on approaches that combine classic computer graphics techniques with deep generative models to obtain controllable and photo-realistic outputs. Starting with an overview of the underlying computer graphics and machine learning concepts, we discuss critical aspects of neural rendering approaches. Specifically, we are dealing with the type of control, i.e., how the control is provided, which parts of the pipeline are learned, explicit vs. implicit control, generalization, and stochastic vs. deterministic synthesis. The second half of this state-of-the-art report is focused on the many important use cases for the described algorithms such as novel view synthesis, semantic photo manipulation, facial and body reenactment, re-lighting, free-viewpoint video, and the creation of photo-realistic avatars for virtual and augmented reality telepresence. Finally, we conclude with a discussion of the social implications of such technology and investigate open research problems. View details
    Preview abstract Computational stereo has reached a high level of accuracy, but degrades in the presence of occlusions, repeated textures, and correspondence errors along edges. We present a novel approach based on neural networks for depth estimation that combines stereo from dual cameras with stereo from a dual-pixel sensor, which is increasingly common on consumer cameras. Our network uses a novel architecture to fuse these two sources of information and can overcome the above-mentioned limitations of pure binocular stereo matching. Our method provides a dense depth map with sharp edges, which is crucial for computational photography applications like synthetic shallow-depth-of-field or 3D Photos. Additionally, we avoid the inherent ambiguity due to the aperture problem in stereo cameras by designing the stereo baseline to be orthogonal to the dual-pixel baseline. We present experiments and comparisons with state-of-the-art approaches to show that our method offers a substantial improvement over previous works. View details
    Learning Illumination from Diverse Portraits
    Wan-Chun Alex Ma
    Christoph Rhemann
    Jason Dourgarian
    Paul Debevec
    SIGGRAPH Asia 2020 Technical Communications (2020)
    Preview abstract We present a learning-based technique for estimating high dynamic range (HDR), omnidirectional illumination from a single low dynamic range (LDR) portrait image captured under arbitrary indoor or outdoor lighting conditions. We train our model using portrait photos paired with their ground truth illumination. We generate a rich set of such photos by using a light stage to record the reflectance field and alpha matte of 70 diverse subjects in various expressions. We then relight the subjects using image-based relighting with a database of one million HDR lighting environments, compositing them onto paired high-resolution background imagery recorded during the lighting acquisition. We train the lighting estimation model using rendering-based loss functions and add a multi-scale adversarial loss to estimate plausible high frequency lighting detail. We show that our technique outperforms the state-of-the-art technique for portrait-based lighting estimation, and we also show that our method reliably handles the inherent ambiguity between overall lighting strength and surface albedo, recovering a similar scale of illumination for subjects with diverse skin tones. Our method allows virtual objects and digital characters to be added to a portrait photograph with consistent illumination. As our inference runs in real-time on a smartphone, we enable realistic rendering and compositing of virtual objects into live video for augmented reality. View details
    Deep Relightable Textures: Volumetric Performance Capture with Neural Rendering
    Abhi Meka
    Christian Haene
    Peter Barnum
    Philip Davidson
    Daniel Erickson
    Jonathan Taylor
    Sofien Bouaziz
    Wan-Chun Alex Ma
    Ryan Overbeck
    Thabo Beeler
    Paul Debevec
    Shahram Izadi
    Christian Theobalt
    Christoph Rhemann
    SIGGRAPH Asia and TOG (2020)
    Preview abstract The increasing demand for 3D content in augmented and virtual reality has motivated the development of volumetric performance capture systems such as the Light Stage. Recent advances are pushing free viewpoint relightable videos of dynamic human performances closer to photorealistic quality. However, despite significant efforts, these sophisticated systems are limited by reconstruction and rendering algorithms which do not fully model complex 3D structures and higher order light transport effects such as global illumination and sub-surface scattering. In this paper, we propose a system that combines traditional geometric pipelines with a neural rendering scheme to generate photorealistic renderings of dynamic performances under desired viewpoint and lighting. Our system leverages deep neural networks that model the classical rendering process to learn implicit features that represent the view-dependent appearance of the subject independent of the geometry layout, allowing for generalization to unseen subject poses and even novel subject identity. Detailed experiments and comparisons demonstrate the efficacy and versatility of our method to generate high-quality results, significantly outperforming the existing state-of-the-art solutions. View details
    Light Stage Super-Resolution: Continuous High-Frequency Relighting
    Tiancheng Sun
    Zexiang Xu
    Xiuming Zhang
    Christoph Rhemann
    Paul Debevec
    Yun-Ta Tsai
    Ravi Ramamoorthi
    SIGGRAPH Asia and TOG (2020)
    Preview abstract The light stage has been widely used in computer graphics for the past two decades, primarily to enable the relighting of human faces. By capturing the appearance of the human subject under different light sources, one obtains the light transport matrix of that subject, which enables image-based relighting in novel environments. However, due to the finite number of lights in the stage, the light transport matrix only represents a sparse sampling on the entire sphere. As a consequence, relighting the subject with a point light or a directional source that does not coincide exactly with one of the lights in the stage requires interpolation and resampling the images corresponding to nearby lights, and this leads to ghosting shadows, aliased specularities, and other artifacts. To ameliorate these artifacts and produce better results under arbitrary high-frequency lighting, this paper proposes a learning-based solution for the "super-resolution" of scans of human faces taken from a light stage. Given an arbitrary "query" light direction, our method aggregates the captured images corresponding to neighboring lights in the stage, and uses a neural network to synthesize a rendering of the face that appears to be illuminated by a "virtual" light source at the query location. This neural network must circumvent the inherent aliasing and regularity of the light stage data that was used for training, which we accomplish through the use of regularized traditional interpolation methods within our network. Our learned model is able to produce renderings for arbitrary light directions that exhibit realistic shadows and specular highlights, and is able to generalize across a wide variety of subjects. Our super-resolution approach enables more accurate renderings of human subjects under detailed environment maps, or the construction of simpler light stages that contain fewer light sources while still yielding comparable quality renderings as light stages with more densely sampled lights. View details
    Deep Implicit Volume Compression
    Danhang "Danny" Tang
    Phil Chou
    Christian Haene
    Mingsong Dou
    Jonathan Taylor
    Shahram Izadi
    Sofien Bouaziz
    Cem Keskin
    CVPR (2020)
    Preview abstract We describe a novel approach for compressing truncated signed distance fields (TSDF) stored in voxel grids and their corresponding textures. To compress the TSDF our method relies on a block-based neural architecture trained end-to-end achieving state-of-the-art compression rates. To prevent topological errors we losslessly compress the signs of the TSDF which also as a side effect bounds the maximum reconstruction error by the voxel size. To compress the affiliated texture we designed a fast block-base charting and Morton packing technique generating a coherent image that can be efficiently compressed using existing image-based compression algorithms. We demonstrate the performance of our algorithms on a large set of 4D performance sequences captured using multi-camera RGBD setups. View details
    Deep Reflectance Fields - High-Quality Facial Reflectance Field Inference from Color Gradient Illumination
    Abhi Meka
    Christian Haene
    Michael Zollhöfer
    Graham Fyffe
    Xueming Yu
    Jason Dourgarian
    Peter Denny
    Sofien Bouaziz
    Peter Lincoln
    Matt Whalen
    Geoff Harvey
    Jonathan Taylor
    Shahram Izadi
    Paul Debevec
    Christian Theobalt
    Julien Valentin
    Christoph Rhemann
    SIGGRAPH (2019)
    Preview abstract Photo-realistic relighting of human faces is a highly sought after feature with many applications ranging from visual effects to truly immersive virtual experiences. Despite tremendous technological advances in the field, humans are often capable of distinguishing real faces from synthetic renders. Photo-realistically relighting any human face is indeed a challenge with many difficulties going from modelling sub-surface scattering and blood flow to estimating the interaction between light and individual strands of hair. We introduce the first system that combines the ability to deal with dynamic performances to the realism of 4D reflectance fields, enabling photo-realistic relighting of non-static faces. The core of our method consists of a Deep Neural network that is able to predict full 4D reflectance fields from two images captured under spherical gradient illumination. Extensive experiments not only show that two images under spherical gradient illumination can be easily captured in real time, but also that these particular images contain all the information needed to estimate the full reflectance field, including specularities and high frequency details. Finally, side by side comparisons demonstrate that the proposed system outperforms the current state-of-the-art in terms of realism and speed. View details
    Volumetric Capture of Humans with a Single RGBD Camera via Semi-Parametric Learning
    Anastasia Tkach
    Shuoran Yang
    Pavel Pidlypenskyi
    Jonathan Taylor
    Ricardo Martin Brualla
    George Papandreou
    Philip Davidson
    Cem Keskin
    Shahram Izadi
    CVPR (2019)
    Preview abstract Volumetric (4D) performance capture is fundamental for AR/VR content generation. Whereas previous work in 4D performance capture has shown impressive results in studio settings, the technology is still far from being accessible to a typical consumer who, at best, might own a single RGBD sensor. Thus, in this work, we propose a method to synthesize free viewpoint renderings using a single RGBD camera. The key insight is to leverage previously seen "calibration" images of a given user to extrapolate what should be rendered in a novel viewpoint from the data available in the sensor. Given these past observations from multiple viewpoints, and the current RGBD image from a fixed view, we propose an end-to-end framework that fuses both these data sources to generate novel renderings of the performer. We demonstrate that the method can produce high fidelity images, and handle extreme changes in subject pose and camera viewpoints. We also show that the system generalizes to performers not seen in the training data. We run exhaustive experiments demonstrating the effectiveness of the proposed semi-parametric model (i.e. calibration images available to the neural network) compared to other state of the art machine learned solutions. Further, we compare the method with more traditional pipelines that employ multi-view capture. We show that our framework is able to achieve compelling results, with substantially less infrastructure than previously required. View details
    The Relightables: Volumetric Performance Capture of Humans with Realistic Relighting
    Kaiwen Guo
    Peter Lincoln
    Philip Davidson
    Xueming Yu
    Matt Whalen
    Geoff Harvey
    Jason Dourgarian
    Danhang Tang
    Anastasia Tkach
    Emily Cooper
    Mingsong Dou
    Graham Fyffe
    Christoph Rhemann
    Jonathan Taylor
    Paul Debevec
    Shahram Izadi
    SIGGRAPH Asia (2019) (to appear)
    Preview abstract We present ''The Relightables'', a volumetric capture system for photorealistic and high quality relightable full-body performance capture. While significant progress has been made on volumetric capture systems, focusing on 3D geometric reconstruction with high resolution textures, much less work has been done to recover photometric properties needed for relighting. Results from such systems lack high-frequency details and the subject's shading is prebaked into the texture. In contrast, a large body of work has addressed relightable acquisition for image-based approaches, which photograph the subject under a set of basis lighting conditions and recombine the images to show the subject as they would appear in a target lighting environment. However, to date, these approaches have not been adapted for use in the context of a high-resolution volumetric capture system. Our method combines this ability to realistically relight humans for arbitrary environments, with the benefits of free-viewpoint volumetric capture and new levels of geometric accuracy for dynamic performances. Our subjects are recorded inside a custom geodesic sphere outfitted with 331 custom color LED lights, an array of high-resolution cameras, and a set of custom high-resolution depth sensors. Our system innovates in multiple areas: First, we designed a novel active depth sensor to capture 12.4MP depth maps, which we describe in detail. Second, we show how to design a hybrid geometric and machine learning reconstruction pipeline to process the high resolution input and output a volumetric video. Third, we generate temporally consistent reflectance maps for dynamic performers by leveraging the information contained in two alternating color gradient illumination images acquired at 60Hz. Multiple experiments, comparisons, and applications show that The Relightables significantly improves upon the level of realism in placing volumetrically captured human performances into arbitrary CG scenes. View details
    The Need 4 Speed in Real-Time Dense Visual Tracking
    Christoph Rhemann
    Jonathan Taylor
    Philip Davidson
    Mingsong Dou
    Kaiwen Guo
    Cem Keskin
    Sameh Khamis
    Danhang Tang
    Vladimir Tankovich
    Julien Valentin
    Shahram Izadi
    SIGGRAPH Asia (2018)
    Preview abstract The advent of consumer depth cameras has incited the development of a new cohort of algorithms tackling challenging computer vision problems. The primary reason is that depth provides direct geometric information that is largely invariant to texture and illumination. As such, substantial progress has been made in human and object pose estimation, 3D reconstruction and simultaneous localization and mapping. Most of these algorithms naturally benefit from the ability to accurately track the pose of an object or scene of interest from one frame to the next. However, commercially available depth sensors (typically running at 30fps) can allow for large inter-frame motions to occur that make such tracking problematic. A high frame rate depth camera would thus greatly ameliorate these issues, and further increase the tractability of these computer vision problems. Nonetheless, the depth accuracy of recent systems for high-speed depth estimation [Fanello et al. 2017b] can degrade at high frame rates. This is because the active illumination employed produces a low SNR and thus a high exposure time is required to obtain a dense accurate depth image. Furthermore in the presence of rapid motion, longer exposure times produce artifacts due to motion blur, and necessitates a lower frame rate that introduces large inter-frame motion that often yield tracking failures. In contrast, this paper proposes a novel combination of hardware and software components that avoids the need to compromise between a dense accurate depth map and a high frame rate. We document the creation of a full 3D capture system for high speed and quality depth estimation, and demonstrate its advantages in a variety of tracking and reconstruction tasks. We extend the state of the art active stereo algorithm presented in Fanello et al. [2017b] by adding a space-time feature in the matching phase. We also propose a machine learning based depth refinement step that is an order of magnitude faster than traditional postprocessing methods. We quantitatively and qualitatively demonstrate the benefits of the proposed algorithms in the acquisition of geometry in motion. Our pipeline executes in 1.1ms leveraging modern GPUs and off-the-shelf cameras and illumination components. We show how the sensor can be employed in many different applications, from [non-]rigid reconstructions to hand/face tracking. Further, we show many advantages over existing state of the art depth camera technologies beyond framerate, including latency, motion artifacts, multi-path errors, and multi-sensor interference. View details
    StereoNet: Guided Hierarchical Refinement for Edge-Aware Depth Prediction
    Sameh Khamis
    Christoph Rhemann
    Julien Valentin
    Shahram Izadi
    European Conference on Computer Vision (2018)
    Preview abstract This paper presents StereoNet, the first end-to-end deep architecture for real-time stereo matching that runs at 60fps on an NVidia Titan X, producing high-quality, edge-preserved, quantization-free depth maps. A key insight of this paper is that the network achieves a sub-pixel matching precision than is a magnitude higher than those of traditional stereo matching approaches. This allows us to achieve real-time performance by using a very low resolution cost volume that encodes all the information needed to achieve high depth precision. Spatial precision is achieved by employing a learned edge-aware upsampling function. Our model uses a Siamese network to extract features from the left and right image. A first estimate of the disparity is computed in a very low resolution cost volume, then hierarchically the model re-introduces high-frequency details through a learned upsampling function that uses compact pixel-to-pixel refinement networks. Leveraging color input as a guide, this function is capable of producing high-quality edge-aware output. We achieve compelling results on multiple benchmarks, showing how the proposed method offers extreme flexibility at an acceptable computational budget. View details
    UltraFast 3D Sensing, Reconstruction and Understanding of People, Objects, and Environments
    Anastasia Tkach
    Christine Kaeser-Chen
    Christoph Rhemann
    Jonathan Taylor
    Julien Valentin
    Kaiwen Guo
    Mingsong Dou
    Sameh Khamis
    Shahram Izadi
    Sofien Bouaziz
    Thomas Funkhouser
    Yinda Zhang
    Preview abstract This is a set of slide decks presenting a full tutorial on 3D capture and reconstruction, with high-level applications on VR and AR. This request is to upload the slides on the tutorial website: https://augmentedperception.github.io/cvpr18/ View details
    LookinGood: Enhancing Performance Capture with Real-Time Neural Re-Rendering
    Ricardo Martin Brualla
    Shuoran Yang
    Pavel Pidlypenskyi
    Jonathan Taylor
    Julien Valentin
    Sameh Khamis
    Philip Davidson
    Anastasia Tkach
    Peter Lincoln
    Christoph Rhemann
    Cem Keskin
    Steve Seitz
    Shahram Izadi
    SIGGRAPH Asia (2018)
    Preview abstract Motivated by augmented and virtual reality applications such as telepresence, there has been a recent focus in real-time performance capture of humans under motion. However, given the real-time constraint, these systems often suffer from artifacts in geometry and texture such as holes and noise in the final rendering, poor lighting, and low-resolution textures. We take the novel approach to augment such real-time performance capture systems with a deep architecture that takes a rendering from an arbitrary viewpoint, and jointly performs completion, super resolution, and denoising of the imagery in real-time. We call this approach neural (re-)rendering, and our live system "LookinGood". Our deep architecture is trained to produce high resolution and high quality images from a coarse rendering in real-time. First, we propose a self-supervised training method that does not require manual ground-truth annotation. We contribute a specialized reconstruction error that uses semantic information to focus on relevant parts of the subject, e.g. the face. We also introduce a salient reweighing scheme of the loss function that is able to discard outliers. We specifically design the system for virtual and augmented reality headsets where the consistency between the left and right eye plays a crucial role in the final user experience. Finally, we generate temporally stable results by explicitly minimizing the difference between two consecutive frames. We tested the proposed system in two different scenarios: one involving a single RGB-D sensor, and upper body reconstruction of an actor, the second consisting of full body 360 degree capture. Through extensive experimentation, we demonstrate how our system generalizes across unseen sequences and subjects. View details
    ActiveStereoNet: Unsupervised End-to-End Learning for Active Stereo Systems
    Yinda Zhang
    Sameh Khamis
    Christoph Rhemann
    Julien Valentin
    Vladimir Tankovich
    Michael Schoenberg
    Shahram Izadi
    European Conference on Computer Vision (2018)
    Preview abstract In this paper we present ActiveStereoNet, the first deep learning solution for active stereo systems. Due to the lack of ground truth, our method is fully self-supervised, yet it produces precise depth with a subpixel precision of 1/30th of a pixel; it does not suffer from the common over-smoothing issues; it preserves the edges; and it explicitly handles occlusions. We introduce a novel reconstruction loss that is more robust to noise and texture-less patches, and is invariant to illumination changes. The proposed loss is optimized using a window-based cost aggregation with an adaptive support weight scheme. This cost aggregation is edge-preserving and smooths the loss function, which is key to allow the network to reach compelling results. Finally we show how the task of predicting invalid regions, such as occlusions, can be trained end-to-end without ground-truth. This component is crucial to reduce blur and particularly improves predictions along depth discontinuities. Extensive quantitatively and qualitatively evaluations on real and synthetic data demonstrate state of the art results in many challenging scenes. View details
    Preview abstract Real time non-rigid reconstruction pipelines are extremely computationally expensive and easily saturate the highest end GPUs currently available. This requires careful strategic choices to be made about a set of highly interconnected parameters that divide up the limited compute. Offline systems, however, prove the value of increasing voxel resolution, more iterations, and higher frame rates. To this end, we demonstrate a set of remarkably simple, but effective modifications to these algorithms that significantly reduce the average per-frame computation cost allowing these parameters to be increased. Specifically, we divide the depth stream into sub-frames and fusion-frames, disabling both model accumulation (fusion) and non-rigid alignment (model tracking) on the former. Instead, we efficiently track point correspondences across neighboring sub-frames. We then leverage these correspondences to initialize the standard non-rigid alignment to a fusion-frame where data can then be accumulated into the model. As a result, compute resources in the modified non-rigid reconstruction pipeline can be immediately re-purposed. Finally, we leverage recent high framerate depth algorithms to build a novel “twin” sensor consisting of a low-res/high-fps sub-frame camera and a second low-fps/high-res fusion camera. View details
    SOS: Stereo Matching in O(1) with Slanted Support Windows
    Vladimir Tankovich
    Michael John Schoenberg
    Christoph Rhemann
    Mirko Schmidt
    Maksym Dzitsiuk
    Julien Valentin
    Shahram Izadi
    IROS (2018)
    Preview abstract Depth cameras have accelerated research in many areas of computer vision. Most triangulation-based depth cameras, whether structured light systems like the Kinect or active (assisted) stereo systems, are based on the principle of stereo matching. Depth from stereo is an active research topic dating back 30 years. Despite recent advances, algorithms usually trade-off accuracy for speed. In particular, efficient methods rely on fronto-parallel assumptions to reduce the search space and keep computation low. We present SOS (Slanted O (1) Stereo), the first algorithm capable of leveraging slanted support windows without sacrificing speed or accuracy. We use an active stereo configuration, where an illuminator textures the scene. Under this setting, local methods-such as PatchMatch Stereo-obtain state of the art results by jointly estimating disparities and slant, but at a large computational cost. We observe that these methods typically exploit local smoothness to simplify their initialization strategies. Our key insight is that local smoothness can in fact be used to amortize the computation not only within initialization, but across the entire stereo pipeline. Building on these insights, we propose a novel hierarchical initialization that is able to efficiently perform search over disparity and slants. We then show how this structure can be leveraged to provide high quality depth maps. Extensive quantitative evaluations demonstrate that the proposed technique yields significantly more precise results than current state of the art, but at a fraction of the computational cost. Our prototype implementation runs at 4000 fps on modern GPU architectures. View details
    Real-time Compression and Streaming of 4D Performances
    Danhang Tang
    Mingsong Dou
    Peter Lincoln
    Philip Davidson
    Kaiwen Guo
    Jonathan Taylor
    Cem Keskin
    Sofien Bouaziz
    Shahram Izadi
    ACM Transaction of Graphics (2018)
    Preview abstract We introduce a realtime compression architecture for 4D performance capture that is two orders of magnitude faster than current state-of-the-art techniques, yet achieves comparable visual quality and bitrate. We note how much of the algorithmic complexity in traditional 4D compression arises from the necessity to encode geometry in a explicit model (i.e. a triangle mesh). In contrast, we propose an encoder that leverages implicit model to represent the observed geometry and its changes through time View details
    Depth from motion for smartphone AR
    Julien Valentin
    Neal Wadhwa
    Max Dzitsiuk
    Michael John Schoenberg
    Vivek Verma
    Ambrus Csaszar
    Ivan Dryanovski
    Joao Afonso
    Jose Pascoal
    Konstantine Nicholas John Tsotsos
    Mira Angela Leung
    Mirko Schmidt
    Sameh Khamis
    Vladimir Tankovich
    Shahram Izadi
    Christoph Rhemann
    ACM Transactions on Graphics (2018)
    Preview abstract Augmented reality (AR) for smartphones has matured from a technology for earlier adopters, available only on select high-end phones, to one that is truly available to the general public. One of the key breakthroughs has been in low-compute methods for six degree of freedom (6DoF) tracking on phones using only the existing hardware (camera and inertial sensors). 6DoF tracking is the cornerstone of smartphone AR allowing virtual content to be precisely locked on top of the real world. However, to really give users the impression of believable AR, one requires mobile depth. Without depth, even simple effects such as a virtual object being correctly occluded by the real-world is impossible. However, requiring a mobile depth sensor would severely restrict the access to such features. In this article, we provide a novel pipeline for mobile depth that supports a wide array of mobile phones, and uses only the existing monocular color sensor. Through several technical contributions, we provide the ability to compute low latency dense depth maps using only a single CPU core of a wide range of (medium-high) mobile phones. We demonstrate the capabilities of our approach on high-level AR applications including real-time navigation and shopping. View details
    No Results Found