Jump to Content
Dan B Goldman

Dan B Goldman

I'm a researcher working at the intersection of computer graphics, computer vision, and human-computer interaction. At Google, I currently lead an R&D team working on real-time 3D human capture and rendering. A full bio and publication list can be found on my personal website.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Project Starline: A high-fidelity telepresence system
    Supreeth Achar
    Gregory Major Blascovich
    Joseph G. Desloge
    Tommy Fortes
    Eric M. Gomez
    Sascha Häberling
    Hugues Hoppe
    Andy Huibers
    Claude Knaus
    Brian Kuschak
    Ricardo Martin-Brualla
    Harris Nover
    Andrew Ian Russell
    Steven M. Seitz
    Kevin Tong
    ACM Transactions on Graphics (Proc. SIGGRAPH Asia), vol. 40(6) (2021)
    Preview abstract We present a real-time bidirectional communication system that lets two people, separated by distance, experience a face-to-face conversation as if they were copresent. It is the first telepresence system that is demonstrably better than 2D videoconferencing, as measured using participant ratings (e.g., presence, attentiveness, reaction-gauging, engagement), meeting recall, and observed nonverbal behaviors (e.g., head nods, eyebrow movements). This milestone is reached by maximizing audiovisual fidelity and the sense of copresence in all design elements, including physical layout, lighting, face tracking, multi-view capture, microphone array, multi-stream compression, loudspeaker output, and lenticular display. Our system achieves key 3D audiovisual cues (stereopsis, motion parallax, and spatialized audio) and enables the full range of communication cues (eye contact, hand gestures, and body language), yet does not require special glasses or body-worn microphones/headphones. The system consists of a head-tracked autostereoscopic display, high-resolution 3D capture and rendering subsystems, and network transmission using compressed color and depth video streams. Other contributions include a novel image-based geometry fusion algorithm, free-space dereverberation, and talker localization. (presentation video) View details
    GeLaTO: Generative Latent Textured Objects
    Ricardo Martin Brualla
    Sofien Bouaziz
    European Conference on Computer Vision (2020)
    Preview abstract Accurate modeling of 3D objects exhibiting transparency, reflections and thin structures is an extremely challenging problem. Inspired by billboards and geometric proxies used in computer graphics, this paper proposes Generative Latent Textured Objects (GeLaTO), a compact representation that combines a set of coarse shape proxies defining low frequency geometry with learned neural textures, to encode both medium and fine scale geometry as well as view-dependent appearance. To generate the proxies' textures, we learn a joint latent space allowing category-level appearance and geometry interpolation. The proxies are independently rasterized with their corresponding neural texture and composited using a U-Net, which generates an output photorealistic image including an alpha map. We demonstrate the effectiveness of our approach by reconstructing complex objects from a sparse set of views. We show results on a dataset of real images of eyeglasses frames, which are particularly challenging to reconstruct with classical methods. We also demonstrate that these coarse proxies can be handcrafted when the underlying object geometry is easy to model, like eyeglasses, or generated using a neural network for more complex categories, such as cars. View details
    State of the Art on Neural Rendering
    Ayush Tewari
    Christian Theobalt
    Eli Shechtman
    Gordon Wetzstein
    Jason Saragih
    Jun-Yan Zhu
    Justus Thies
    Kalyan Sunkavalli
    Maneesh Agrawala
    Matthias Niessner
    Michael Zollhöfer
    Ohad Fried
    Ricardo Martin Brualla
    Stephen Lombardi
    Tomas Simon
    Vincent Sitzmann
    Computer Graphics Forum (2020)
    Preview abstract The efficient rendering of photo-realistic virtual worlds is a long standing effort of computer graphics. Over the last few years, rapid orthogonal progress in deep generative models has been made by the computer vision and machine learning communities leading to powerful algorithms to synthesize and edit images. Neural rendering approaches are a hybrid of both of these efforts that combine physical knowledge, such as a differentiable renderer, with learned components for controllable image synthesis. Nowadays, neural rendering is employed for solving a steadily growing number of computer graphics and vision problems. This state-of-the-art report summarizes the recent trends and applications of neural rendering. We focus on approaches that combine classic computer graphics techniques with deep generative models to obtain controllable and photo-realistic outputs. Starting with an overview of the underlying computer graphics and machine learning concepts, we discuss critical aspects of neural rendering approaches. Specifically, we are dealing with the type of control, i.e., how the control is provided, which parts of the pipeline are learned, explicit vs. implicit control, generalization, and stochastic vs. deterministic synthesis. The second half of this state-of-the-art report is focused on the many important use cases for the described algorithms such as novel view synthesis, semantic photo manipulation, facial and body reenactment, re-lighting, free-viewpoint video, and the creation of photo-realistic avatars for virtual and augmented reality telepresence. Finally, we conclude with a discussion of the social implications of such technology and investigate open research problems. View details
    Neural Rerendering in the Wild
    Moustafa Mahmoud Meshry
    Sameh Khamis
    Hugues Hoppe
    Ricardo Martin Brualla
    Computer Vision and Pattern Recognition (CVPR) (2019)
    Preview abstract We explore total scene capture — recording, modeling, and rerendering a scene under varying appearance such as season and time of day. Starting from internet photos of a tourist landmark, we apply traditional 3D reconstruction to register the photos and approximate the scene as a point cloud. For each photo, we render the scene points into a deep framebuffer, and train a neural network to learn the mapping of these initial renderings to the actual photos. This rerendering network also takes as input a latent appearance vector and a semantic mask indicating the location of transient objects like pedestrians. The model is evaluated on several datasets of publicly available images spanning a broad range of illumination conditions. We create short videos demonstrating realistic manipulation of the image viewpoint, appearance, and semantic labeling. We also compare results with prior work on scene reconstruction from internet photos. Code and additional information is available on the project webpage. View details
    Deformed Reality
    Antoine Petit
    Nazim Haouchine
    Frederick Roy
    Stephane Cotin
    Computer Graphics and Visual Computing (CGVC), The Eurographics Association (2019)
    Preview abstract We present Deformed Reality, a new way of interacting with an augmented reality environment by manipulating 3D objects in an intuitive and physically-consistent manner. Using the core principle of augmented reality to estimate rigid pose over time, our method makes it possible for the user to deform the targeted object while it is being rendered with its natural texture, giving the sense of a interactive scene editing. Our framework follows a computationally efficient pipeline that uses a proxy CAD model for both pose computation, physically-based manipulations and scene appearance estimation. The final composition is built upon a continuous image completion and re-texturing process to preserve visual consistency. The presented results show that our method can open new ways of using augmented reality by not only augmenting the environment but also interacting with objects intuitively. View details
    Approximate svBRDF Estimation From Mobile Phone Video
    Rachel Albert
    Dorian Yao Chang
    James O'Brien
    In Proceedings of EGSR 2018, vol. 37 (2018), pp. 12
    Preview abstract We demonstrate a new technique for obtaining a spatially varying BRDF (svBRDF) of a flat object using printed fiducial markers and a cell phone capable of continuous flash video. Our homography-based video frame alignment method does not require the fiducial markers to be visible in every frame, thereby enabling us to capture larger areas at a closer distance and higher resolution than in previous work. Pixels in the resulting panorama are fit with a BRDF based on a recursive subdivision algorithm, utilizing all the light and view positions obtained from the video. We show the versatility of our method by capturing a variety of materials with both one and two camera input streams and rendering our results on 3D objects under complex illumination. View details
    ESPReSSo: Efficient Slanted PatchMatch for Real-time Spacetime Stereo
    Harris Nover
    Supreeth Achar
    In Proceedings of Sixth International Conference on 3D Vision (3DV) (2018)
    Preview abstract We present ESPReSSo, the first real-time implementation of spacetime stereo, offering improved quality vs. existing real-time systems. ESPReSSo uses a local stereo reconstruction algorithm that precomputes subpixel-shifted binary descriptors, then iteratively samples those descriptors along slanted disparity plane hypotheses, applying an edge-aware filter for spatial cost aggregation. Plane hypotheses are shared across rectangular tiles, but every pixel gets a different winner, much as in PatchMatch Filter. This architecture performs very few descriptor computations but many cost aggregations, and we tune our choice of descriptor and filter accordingly: We propose a new 32-bit binary spacetime descriptor breve that combines the benefits of small spatial extent with robustness to scene motion, and the system aggregates costs using the permeability filter, a very efficient edge-aware filter. Our prototype system outputs 60 depth frames per second on a desktop GPU, using less than 11ms total computation per frame. View details
    LookinGood: Enhancing Performance Capture with Real-Time Neural Re-Rendering
    Ricardo Martin Brualla
    Shuoran Yang
    Pavel Pidlypenskyi
    Jonathan Taylor
    Julien Valentin
    Sameh Khamis
    Philip Davidson
    Anastasia Tkach
    Peter Lincoln
    Christoph Rhemann
    Cem Keskin
    Steve Seitz
    Shahram Izadi
    SIGGRAPH Asia (2018)
    Preview abstract Motivated by augmented and virtual reality applications such as telepresence, there has been a recent focus in real-time performance capture of humans under motion. However, given the real-time constraint, these systems often suffer from artifacts in geometry and texture such as holes and noise in the final rendering, poor lighting, and low-resolution textures. We take the novel approach to augment such real-time performance capture systems with a deep architecture that takes a rendering from an arbitrary viewpoint, and jointly performs completion, super resolution, and denoising of the imagery in real-time. We call this approach neural (re-)rendering, and our live system "LookinGood". Our deep architecture is trained to produce high resolution and high quality images from a coarse rendering in real-time. First, we propose a self-supervised training method that does not require manual ground-truth annotation. We contribute a specialized reconstruction error that uses semantic information to focus on relevant parts of the subject, e.g. the face. We also introduce a salient reweighing scheme of the loss function that is able to discard outliers. We specifically design the system for virtual and augmented reality headsets where the consistency between the left and right eye plays a crucial role in the final user experience. Finally, we generate temporally stable results by explicitly minimizing the difference between two consecutive frames. We tested the proposed system in two different scenarios: one involving a single RGB-D sensor, and upper body reconstruction of an actor, the second consisting of full body 360 degree capture. Through extensive experimentation, we demonstrate how our system generalizes across unseen sequences and subjects. View details
    Perspective-aware manipulation of portrait photos
    Ohad Fried
    Eli Shechtman
    Adam Finkelstein
    ACM Transactions on Graphics (Proc. SIGGRAPH), vol. 35(4) (2016)
    Preview abstract This paper introduces a method to modify the apparent relative pose and distance between camera and subject given a single portrait photo. Our approach fits a full perspective camera and a parametric 3D head model to the portrait, and then builds a 2D warp in the image plane to approximate the effect of a desired change in 3D. We show that this model is capable of correcting objectionable artifacts such as the large noses sometimes seen in “selfies,” or to deliberately bring a distant camera closer to the subject. This framework can also be used to re-pose the subject, as well as to create stereo pairs from an input portrait. We show convincing results on both an existing dataset as well as a new dataset we captured to validate our method. View details
    VidCrit: Video-based asynchronous video review
    Amy Pavel
    Björn Hartmann
    Maneesh Agrawala
    Proceedings of UIST 2016
    Preview abstract Video production is a collaborative process in which stakeholders regularly review drafts of the edited video to indicate problems and offer suggestions for improvement. Although practitioners prefer in-person feedback, most reviews are conducted asynchronously via email due to scheduling and location constraints. The use of this impoverished medium is challenging for both providers and consumers of feedback. We introduce VidCrit, a system for providing asynchronous feedback on drafts of edited video that incorporates favorable qualities of an in-person review. This system consists of two separate interfaces: (1) A feedback recording interface captures reviewers’ spoken comments, mouse interactions, hand gestures and other physical reactions. (2) A feedback viewing interface transcribes and segments the recorded review into topical comments so that the video author can browse the review by either text or timelines. Our system features novel methods to automatically segment a long review session into topical text comments, and to label such comments with additional contextual information. We interviewed practitioners to inform a set of design guidelines for giving and receiving feedback, and based our system’s design on these guidelines. Video reviewers using our system preferred our feedback recording interface over email for providing feedback due to the reduction in time and effort. In a fixed amount of time, reviewers provided 10.9 (sigma = 5.09) more local comments than when using text. All video authors rated our feedback viewing interface preferable to receiving feedback via e-mail. View details
    No Results Found