Jump to Content
Rohit Kumar Pandey

Rohit Kumar Pandey

Rohit is a machine learning researcher and engineer in the augmented perception team at Google. His recent efforts are focused on applying deep learning to style transfer, novel view synthesis and relighting for humans. He has also worked on designing and implementing efficient deep learning solutions that can be deployed on mobile devices. Prior to Google, he graduated from the University at Buffalo, SUNY with a PhD in Computer Science, where his research focused on privacy preserving deep learning and its applications to biometric authentication.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos
    Ziqian Bai
    Danhang "Danny" Tang
    Di Qiu
    Abhimitra Meka
    Mingsong Dou
    Ping Tan
    Thabo Beeler
    2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE
    Preview abstract We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduce over-smoothing and improve out-of-model expressions synthesis, we propose to predict local features anchored on the 3DMM geometry. These learnt features are driven by 3DMM deformation and interpolated in 3D space to yield the volumetric radiance at a designated query point. We further show that using a Convolutional Neural Network in the UV space is critical in incorporating spatial context and producing representative local features. Extensive experiments show that we are able to reconstruct high-quality avatars, with more accurate expression-dependent details, good generalization to out-of-training expressions, and quantitatively superior renderings compared to other state-of-the-art approaches. View details
    Preview abstract We propose a novel system for portrait relighting and background replacement, which maintains high-frequency boundary details and accurately synthesizes the subject’s appearance as lit by novel illumination, thereby producing realistic composite images for any desired scene. Our technique includes foreground estimation via alpha matting, relighting, and compositing. We demonstrate that each of these stages can be tackled in a sequential pipeline without the use of priors (e.g. known background or known illumination) and with no specialized acquisition techniques, using only a single RGB portrait image and a novel, target HDR lighting environment as inputs. We train our model using relit portraits of subjects captured in a light stage computational illumination system, which records multiple lighting conditions, high quality geometry, and accurate alpha mattes. To perform realistic relighting for compositing, we introduce a novel per-pixel lighting representation in a deep learning framework, which explicitly models the diffuse and the specular components of appearance, producing relit portraits with convincingly rendered non-Lambertian effects like specular highlights. Multiple experiments and comparisons show the effectiveness of the proposed approach when applied to in-the-wild images. View details
    HumanGPS: Geodesic PreServing Feature for Dense Human Correspondence
    Danhang "Danny" Tang
    Mingsong Dou
    Kaiwen Guo
    Cem Keskin
    Sofien Bouaziz
    Ping Tan
    Computer Vision and Pattern Recognition 2021 (2021), pp. 8
    Preview abstract In this paper, we address the problem of building dense correspondences between human images under arbitrary camera viewpoints and body poses. Prior art either assumes small motion between frames or relies on local descriptors, which cannot handle large motion or visually ambiguous body parts, e.g. left v.s. right hand. In contrast, we propose a deep learning framework that maps each pixel to a feature space, where the feature distances reflect the geodesic distances among pixels as if they were projected onto the surface of a 3D human scan. To this end, we introduce novel loss functions to push features apart according to their geodesic distances on the surface. Without any semantic annotation, the proposed embeddings automatically learn to differentiate visually similar parts and align different subjects into an unified feature space. Extensive experiments show that the learned embeddings can produce accurate correspondences between images with remarkable generalization capabilities on both intra and inter subjects. View details
    Neural Light Transport for Relighting and View Synthesis
    Xiuming Zhang
    Yun-Ta Tsai
    Tiancheng Sun
    Tianfan Xue
    Philip Davidson
    Christoph Rhemann
    Paul Debevec
    Ravi Ramamoorthi
    ACM Transactions on Graphics, vol. 40 (2021)
    Preview abstract The light transport (LT) of a scene describes how it appears under different lighting and viewing directions, and complete knowledge of a scene's LT enables the synthesis of novel views under arbitrary lighting. In this paper, we focus on image-based LT acquisition, primarily for human bodies within a light stage setup. We propose a semi-parametric approach to learn a neural representation of LT that is embedded in the space of a texture atlas of known geometric properties, and model all non-diffuse and global LT as residuals added to a physically-accurate diffuse base rendering. In particular, we show how to fuse previously seen observations of illuminants and views to synthesize a new image of the same scene under a desired lighting condition from a chosen viewpoint. This strategy allows the network to learn complex material effects (such as subsurface scattering) and global illumination, while guaranteeing the physical correctness of the diffuse LT (such as hard shadows). With this learned LT, one can relight the scene photorealistically with a directional light or an HDRI map, synthesize novel views with view-dependent effects, or do both simultaneously, all in a unified framework using a set of sparse, previously seen observations. Qualitative and quantitative experiments demonstrate that our neural LT (NLT) outperforms state-of-the-art solutions for relighting and view synthesis, without separate treatment for both problems that prior work requires. View details
    Learning Illumination from Diverse Portraits
    Wan-Chun Alex Ma
    Christoph Rhemann
    Jason Dourgarian
    Paul Debevec
    SIGGRAPH Asia 2020 Technical Communications (2020)
    Preview abstract We present a learning-based technique for estimating high dynamic range (HDR), omnidirectional illumination from a single low dynamic range (LDR) portrait image captured under arbitrary indoor or outdoor lighting conditions. We train our model using portrait photos paired with their ground truth illumination. We generate a rich set of such photos by using a light stage to record the reflectance field and alpha matte of 70 diverse subjects in various expressions. We then relight the subjects using image-based relighting with a database of one million HDR lighting environments, compositing them onto paired high-resolution background imagery recorded during the lighting acquisition. We train the lighting estimation model using rendering-based loss functions and add a multi-scale adversarial loss to estimate plausible high frequency lighting detail. We show that our technique outperforms the state-of-the-art technique for portrait-based lighting estimation, and we also show that our method reliably handles the inherent ambiguity between overall lighting strength and surface albedo, recovering a similar scale of illumination for subjects with diverse skin tones. Our method allows virtual objects and digital characters to be added to a portrait photograph with consistent illumination. As our inference runs in real-time on a smartphone, we enable realistic rendering and compositing of virtual objects into live video for augmented reality. View details
    State of the Art on Neural Rendering
    Ayush Tewari
    Christian Theobalt
    Eli Shechtman
    Gordon Wetzstein
    Jason Saragih
    Jun-Yan Zhu
    Justus Thies
    Kalyan Sunkavalli
    Maneesh Agrawala
    Matthias Niessner
    Michael Zollhöfer
    Ohad Fried
    Ricardo Martin Brualla
    Stephen Lombardi
    Tomas Simon
    Vincent Sitzmann
    Computer Graphics Forum (2020)
    Preview abstract The efficient rendering of photo-realistic virtual worlds is a long standing effort of computer graphics. Over the last few years, rapid orthogonal progress in deep generative models has been made by the computer vision and machine learning communities leading to powerful algorithms to synthesize and edit images. Neural rendering approaches are a hybrid of both of these efforts that combine physical knowledge, such as a differentiable renderer, with learned components for controllable image synthesis. Nowadays, neural rendering is employed for solving a steadily growing number of computer graphics and vision problems. This state-of-the-art report summarizes the recent trends and applications of neural rendering. We focus on approaches that combine classic computer graphics techniques with deep generative models to obtain controllable and photo-realistic outputs. Starting with an overview of the underlying computer graphics and machine learning concepts, we discuss critical aspects of neural rendering approaches. Specifically, we are dealing with the type of control, i.e., how the control is provided, which parts of the pipeline are learned, explicit vs. implicit control, generalization, and stochastic vs. deterministic synthesis. The second half of this state-of-the-art report is focused on the many important use cases for the described algorithms such as novel view synthesis, semantic photo manipulation, facial and body reenactment, re-lighting, free-viewpoint video, and the creation of photo-realistic avatars for virtual and augmented reality telepresence. Finally, we conclude with a discussion of the social implications of such technology and investigate open research problems. View details
    Deep Relightable Textures: Volumetric Performance Capture with Neural Rendering
    Abhi Meka
    Christian Haene
    Peter Barnum
    Philip Davidson
    Daniel Erickson
    Jonathan Taylor
    Sofien Bouaziz
    Wan-Chun Alex Ma
    Ryan Overbeck
    Thabo Beeler
    Paul Debevec
    Shahram Izadi
    Christian Theobalt
    Christoph Rhemann
    SIGGRAPH Asia and TOG (2020)
    Preview abstract The increasing demand for 3D content in augmented and virtual reality has motivated the development of volumetric performance capture systems such as the Light Stage. Recent advances are pushing free viewpoint relightable videos of dynamic human performances closer to photorealistic quality. However, despite significant efforts, these sophisticated systems are limited by reconstruction and rendering algorithms which do not fully model complex 3D structures and higher order light transport effects such as global illumination and sub-surface scattering. In this paper, we propose a system that combines traditional geometric pipelines with a neural rendering scheme to generate photorealistic renderings of dynamic performances under desired viewpoint and lighting. Our system leverages deep neural networks that model the classical rendering process to learn implicit features that represent the view-dependent appearance of the subject independent of the geometry layout, allowing for generalization to unseen subject poses and even novel subject identity. Detailed experiments and comparisons demonstrate the efficacy and versatility of our method to generate high-quality results, significantly outperforming the existing state-of-the-art solutions. View details
    GeLaTO: Generative Latent Textured Objects
    Ricardo Martin Brualla
    Sofien Bouaziz
    European Conference on Computer Vision (2020)
    Preview abstract Accurate modeling of 3D objects exhibiting transparency, reflections and thin structures is an extremely challenging problem. Inspired by billboards and geometric proxies used in computer graphics, this paper proposes Generative Latent Textured Objects (GeLaTO), a compact representation that combines a set of coarse shape proxies defining low frequency geometry with learned neural textures, to encode both medium and fine scale geometry as well as view-dependent appearance. To generate the proxies' textures, we learn a joint latent space allowing category-level appearance and geometry interpolation. The proxies are independently rasterized with their corresponding neural texture and composited using a U-Net, which generates an output photorealistic image including an alpha map. We demonstrate the effectiveness of our approach by reconstructing complex objects from a sparse set of views. We show results on a dataset of real images of eyeglasses frames, which are particularly challenging to reconstruct with classical methods. We also demonstrate that these coarse proxies can be handcrafted when the underlying object geometry is easy to model, like eyeglasses, or generated using a neural network for more complex categories, such as cars. View details
    Deep Reflectance Fields - High-Quality Facial Reflectance Field Inference from Color Gradient Illumination
    Abhi Meka
    Christian Haene
    Michael Zollhöfer
    Graham Fyffe
    Xueming Yu
    Jason Dourgarian
    Peter Denny
    Sofien Bouaziz
    Peter Lincoln
    Matt Whalen
    Geoff Harvey
    Jonathan Taylor
    Shahram Izadi
    Paul Debevec
    Christian Theobalt
    Julien Valentin
    Christoph Rhemann
    SIGGRAPH (2019)
    Preview abstract Photo-realistic relighting of human faces is a highly sought after feature with many applications ranging from visual effects to truly immersive virtual experiences. Despite tremendous technological advances in the field, humans are often capable of distinguishing real faces from synthetic renders. Photo-realistically relighting any human face is indeed a challenge with many difficulties going from modelling sub-surface scattering and blood flow to estimating the interaction between light and individual strands of hair. We introduce the first system that combines the ability to deal with dynamic performances to the realism of 4D reflectance fields, enabling photo-realistic relighting of non-static faces. The core of our method consists of a Deep Neural network that is able to predict full 4D reflectance fields from two images captured under spherical gradient illumination. Extensive experiments not only show that two images under spherical gradient illumination can be easily captured in real time, but also that these particular images contain all the information needed to estimate the full reflectance field, including specularities and high frequency details. Finally, side by side comparisons demonstrate that the proposed system outperforms the current state-of-the-art in terms of realism and speed. View details
    The Relightables: Volumetric Performance Capture of Humans with Realistic Relighting
    Kaiwen Guo
    Peter Lincoln
    Philip Davidson
    Xueming Yu
    Matt Whalen
    Geoff Harvey
    Jason Dourgarian
    Danhang Tang
    Anastasia Tkach
    Emily Cooper
    Mingsong Dou
    Graham Fyffe
    Christoph Rhemann
    Jonathan Taylor
    Paul Debevec
    Shahram Izadi
    SIGGRAPH Asia (2019) (to appear)
    Preview abstract We present ''The Relightables'', a volumetric capture system for photorealistic and high quality relightable full-body performance capture. While significant progress has been made on volumetric capture systems, focusing on 3D geometric reconstruction with high resolution textures, much less work has been done to recover photometric properties needed for relighting. Results from such systems lack high-frequency details and the subject's shading is prebaked into the texture. In contrast, a large body of work has addressed relightable acquisition for image-based approaches, which photograph the subject under a set of basis lighting conditions and recombine the images to show the subject as they would appear in a target lighting environment. However, to date, these approaches have not been adapted for use in the context of a high-resolution volumetric capture system. Our method combines this ability to realistically relight humans for arbitrary environments, with the benefits of free-viewpoint volumetric capture and new levels of geometric accuracy for dynamic performances. Our subjects are recorded inside a custom geodesic sphere outfitted with 331 custom color LED lights, an array of high-resolution cameras, and a set of custom high-resolution depth sensors. Our system innovates in multiple areas: First, we designed a novel active depth sensor to capture 12.4MP depth maps, which we describe in detail. Second, we show how to design a hybrid geometric and machine learning reconstruction pipeline to process the high resolution input and output a volumetric video. Third, we generate temporally consistent reflectance maps for dynamic performers by leveraging the information contained in two alternating color gradient illumination images acquired at 60Hz. Multiple experiments, comparisons, and applications show that The Relightables significantly improves upon the level of realism in placing volumetrically captured human performances into arbitrary CG scenes. View details
    Neural Rerendering in the Wild
    Moustafa Mahmoud Meshry
    Sameh Khamis
    Hugues Hoppe
    Ricardo Martin Brualla
    Computer Vision and Pattern Recognition (CVPR) (2019)
    Preview abstract We explore total scene capture — recording, modeling, and rerendering a scene under varying appearance such as season and time of day. Starting from internet photos of a tourist landmark, we apply traditional 3D reconstruction to register the photos and approximate the scene as a point cloud. For each photo, we render the scene points into a deep framebuffer, and train a neural network to learn the mapping of these initial renderings to the actual photos. This rerendering network also takes as input a latent appearance vector and a semantic mask indicating the location of transient objects like pedestrians. The model is evaluated on several datasets of publicly available images spanning a broad range of illumination conditions. We create short videos demonstrating realistic manipulation of the image viewpoint, appearance, and semantic labeling. We also compare results with prior work on scene reconstruction from internet photos. Code and additional information is available on the project webpage. View details
    Volumetric Capture of Humans with a Single RGBD Camera via Semi-Parametric Learning
    Anastasia Tkach
    Shuoran Yang
    Pavel Pidlypenskyi
    Jonathan Taylor
    Ricardo Martin Brualla
    George Papandreou
    Philip Davidson
    Cem Keskin
    Shahram Izadi
    CVPR (2019)
    Preview abstract Volumetric (4D) performance capture is fundamental for AR/VR content generation. Whereas previous work in 4D performance capture has shown impressive results in studio settings, the technology is still far from being accessible to a typical consumer who, at best, might own a single RGBD sensor. Thus, in this work, we propose a method to synthesize free viewpoint renderings using a single RGBD camera. The key insight is to leverage previously seen "calibration" images of a given user to extrapolate what should be rendered in a novel viewpoint from the data available in the sensor. Given these past observations from multiple viewpoints, and the current RGBD image from a fixed view, we propose an end-to-end framework that fuses both these data sources to generate novel renderings of the performer. We demonstrate that the method can produce high fidelity images, and handle extreme changes in subject pose and camera viewpoints. We also show that the system generalizes to performers not seen in the training data. We run exhaustive experiments demonstrating the effectiveness of the proposed semi-parametric model (i.e. calibration images available to the neural network) compared to other state of the art machine learned solutions. Further, we compare the method with more traditional pipelines that employ multi-view capture. We show that our framework is able to achieve compelling results, with substantially less infrastructure than previously required. View details
    UltraFast 3D Sensing, Reconstruction and Understanding of People, Objects, and Environments
    Anastasia Tkach
    Christine Kaeser-Chen
    Christoph Rhemann
    Jonathan Taylor
    Julien Valentin
    Kaiwen Guo
    Mingsong Dou
    Sameh Khamis
    Shahram Izadi
    Sofien Bouaziz
    Thomas Funkhouser
    Yinda Zhang
    Preview abstract This is a set of slide decks presenting a full tutorial on 3D capture and reconstruction, with high-level applications on VR and AR. This request is to upload the slides on the tutorial website: https://augmentedperception.github.io/cvpr18/ View details
    LookinGood: Enhancing Performance Capture with Real-Time Neural Re-Rendering
    Ricardo Martin Brualla
    Shuoran Yang
    Pavel Pidlypenskyi
    Jonathan Taylor
    Julien Valentin
    Sameh Khamis
    Philip Davidson
    Anastasia Tkach
    Peter Lincoln
    Christoph Rhemann
    Cem Keskin
    Steve Seitz
    Shahram Izadi
    SIGGRAPH Asia (2018)
    Preview abstract Motivated by augmented and virtual reality applications such as telepresence, there has been a recent focus in real-time performance capture of humans under motion. However, given the real-time constraint, these systems often suffer from artifacts in geometry and texture such as holes and noise in the final rendering, poor lighting, and low-resolution textures. We take the novel approach to augment such real-time performance capture systems with a deep architecture that takes a rendering from an arbitrary viewpoint, and jointly performs completion, super resolution, and denoising of the imagery in real-time. We call this approach neural (re-)rendering, and our live system "LookinGood". Our deep architecture is trained to produce high resolution and high quality images from a coarse rendering in real-time. First, we propose a self-supervised training method that does not require manual ground-truth annotation. We contribute a specialized reconstruction error that uses semantic information to focus on relevant parts of the subject, e.g. the face. We also introduce a salient reweighing scheme of the loss function that is able to discard outliers. We specifically design the system for virtual and augmented reality headsets where the consistency between the left and right eye plays a crucial role in the final user experience. Finally, we generate temporally stable results by explicitly minimizing the difference between two consecutive frames. We tested the proposed system in two different scenarios: one involving a single RGB-D sensor, and upper body reconstruction of an actor, the second consisting of full body 360 degree capture. Through extensive experimentation, we demonstrate how our system generalizes across unseen sequences and subjects. View details
    No Results Found