Jump to Content
Andrea Tagliasacchi

Andrea Tagliasacchi

Please refer to https://taiya.github.io
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error. We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery. View details
    Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation
    Kyle Genova
    Xiaoqi Yin
    Leonidas Guibas
    Frank Dellaert
    Conference on Computer Vision and Pattern Recognition (2022)
    Preview abstract We present Panoptic Neural Fields (PNF), an object-aware neural scene representation that decomposes a scene into a set of objects (things) and background (stuff). Each object is represented by an oriented 3D bounding box and a multi-layer perceptron (MLP) that takes position, direction, and time and outputs density and radiance. The background stuff is represented by a similar MLP that additionally outputs semantic labels. Each object MLPs are instance-specific and thus can be smaller and faster than previous object-aware approaches, while still leveraging category-specific priors incorporated via meta-learned initialization. Our model builds a panoptic radiance field representation of any scene from just color images. We use off-the-shelf algorithms to predict camera poses, object tracks, and 2D image semantic segmentations. Then we jointly optimize the MLP weights and bounding box parameters using analysis-by-synthesis with self-supervision from color images and pseudo-supervision from predicted semantic segmentations. During experiments with real-world dynamic scenes, we find that our model can be used effectively for several tasks like novel view synthesis, 2D panoptic segmentation, 3D scene editing, and multiview depth prediction. View details
    LOLNeRF: Learn from One Look
    Daniel Rebain
    Kwang Yi
    Dmitry Lagun
    Computer Vision Pattern Recognition (CVPR) (2022)
    Preview abstract We present a method for learning a generative 3D model based on neural radiance fields, trained solely from single-views of objects. While generating realistic images is no longer a difficult task, producing the corresponding 3D structure such that they can be rendered from different views is non-trivial. Here, we show that, unlike existing methods, one does not need any multi-view data to achieve this goal. Specifically, we show that by learning to reconstruct many images aligned to an approximate canonical pose, with a single network conditioned on a shared latent space, you can learn a space of radiance fields that models the shape and appearance of a class of objects. We demonstrate this by training models to reconstruct a number of object categories including humans, cats, and cars, all using datasets that contain only single views of each subject and no depth or geometry information. Our experiments show that this method achieves state-of-the-art results in novel view synthesis and monocular depth prediction. View details
    Preview abstract In the era of deep learning, human pose estimation from multiple cameras with unknown calibration has received little attention to date. We show how to train a neural model to perform this task with high precision and minimal latency overhead. The proposed model takes into account joint location uncertainty due to occlusion from multiple views, and requires only 2D keypoint data for training. Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines on the well-established Human3.6M dataset, as well as the more challenging in-the-wild Ski-Pose PTZ dataset. View details
    Preview abstract We present NeSF, a method for producing 3D semantic fields from pre-trained density fields and sparse 2D semantic supervision. Our method side-steps traditional scene representations by leveraging neural representations where 3D information is stored within neural fields. In spite of being supervised by 2D signals alone, our method is able to generate 3D-consistent semantic maps from novel camera poses and can be queried at arbitrary 3D points. Notably, NeSF is compatible with any method producing a density field, and its accuracy improves as the quality of the pre-trained density fields improve. Our empirical analysis demonstrates comparable quality to competitive 2D and 3D semantic segmentation baselines on convincing synthetic scenes while also offering features unavailable to existing methods. View details
    Kubric: A scalable dataset generator
    Anissa Yuenming Mak
    Austin Stone
    Carl Doersch
    Cengiz Oztireli
    Charles Herrmann
    Daniel Rebain
    Derek Nowrouzezahrai
    Dmitry Lagun
    Fangcheng Zhong
    Florian Golemo
    Francois Belletti
    Henning Meyer
    Hsueh-Ti (Derek) Liu
    Issam Laradji
    Klaus Greff
    Kwang Moo Yi
    Matan Sela
    Noha Radwan
    Thomas Kipf
    Tianhao Wu
    Vincent Sitzmann
    Yilun Du
    Yishu Miao
    (2022)
    Preview abstract Data is the driving force of machine learning. The amount and quality of training data is often more important for the performance of a system than the details of its architecture. Data is also an important tool for testing specific hypothesis, and for empirically evaluating the behaviour of complex systems. Synthetic data generation represents a powerful tool that can address all these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent privacy and legal concerns. Unfortunately the toolchain for generating data is less well developed than that for building models. We aim to improve this situation by introducing Kubric: a scalable open-source pipeline for generating realistic image and video data with rich ground truth annotations. We also publish a collection of generated datasets and baseline results on several vision tasks. View details
    Vector Neurons: A General Framework for SO(3)-Equivariant Networks
    Congyue Deng
    Or Litany
    Yueqi Duan
    Adrien Poulenard
    Leonidas J. Guibas
    ICCV (2021)
    Preview abstract Invariance and equivariance to the rotation group have been widely discussed in the 3D deep learning community for pointclouds. Yet most proposed methods either use complex mathematical tools that may limit their accessibility, or are tied to specific input data types and network architectures. In this paper, we introduce a general framework built on top of what we call Vector Neuron representations for creating SO(3)-equivariant neural networks for pointcloud processing. Extending neurons from 1D scalars to 3D vectors, our vector neurons enable a simple mapping of SO(3) actions to latent spaces thereby providing a framework for building equivariance in common neural operations -- including linear layers, non-linearities, pooling, and normalizations. Due to their simplicity, vector neurons are versatile and, as we demonstrate, can be incorporated into diverse network architecture backbones, allowing them to process geometry inputs in arbitrary poses. Despite its simplicity, our method performs comparably well in accuracy and generalization with other more complex and specialized state-of-the-art methods on classification and segmentation tasks. We also show for the first time a rotation equivariant reconstruction network. View details
    Canonical Capsules: Unsupervised Capsules in Canonical Pose
    Weiwei Sun
    Boyang Deng
    Soroosh Yazdani
    Geoffrey Everest Hinton
    Kwang Moo Yi
    Neural Information Processing Systems (NeurIPS) (2021)
    Preview abstract We propose an unsupervised capsule architecture for 3D point clouds. We compute capsule decompositions of objects through permutation-equivariant attention, and self-supervise the process by training with pairs of randomly rotated objects. Our key idea is to aggregate the attention masks into semantic keypoints, and use these to supervise a decomposition that satisfies the capsule invariance/equivariance properties. This not only enables the training of a semantically consistent decomposition, but also allows us to learn a canonicalization operation that enables object-centric reasoning. In doing so, we require neither classification labels nor manually-aligned training datasets to train. Yet, by learning an object-centric representation in an unsupervised manner, our method outperforms the state-of-the-art on 3D point cloud reconstruction, registration, and unsupervised classification. We will release the code and dataset to reproduce our results as soon as the paper is published. View details
    COTR: Correspondence Transformer for Matching Across Images
    Wei Jiang
    Jan Hosang
    Kwang Moo Yi
    International Conference in Computer Vision (2021)
    Preview abstract We propose a novel framework for finding correspondences in images based on a deep neural network that, given two images and a query point in one of them, finds its correspondence in the other. By doing so, one has the option to query only the points of interest and retrieve sparse correspondences, or to query all points in an image and obtain dense mappings. Importantly, in order to capture both local and global priors, and to let our model relate between image regions using the most relevant among said priors, we realize our network using a transformer. At inference time, we apply our correspondence network by recursively zooming in around the estimates, yielding a multiscale pipeline able to provide highly-accurate correspondences. Our method significantly outperforms the state of the art on both sparse and dense correspondence problems on multiple datasets and tasks, ranging from wide-baseline stereo to optical flow, without any retraining for a specific dataset. We commit to releasing data, code, and all the tools necessary to train from scratch and ensure reproducibility. View details
    PIE-NET: Parametric Inference of Point Cloud Edges
    Xiaogang Wang
    Yuelang Xu
    Kevin Kai Xu
    Bin Zhou
    Ali Mahdavi-Amiri
    Hao Zhang
    Proceedings of Neural Information Processing Systems (2020)
    Preview abstract We introduce an end-to-end learnable technique to robustly identify feature edges in 3D point cloud data. We represent these edges as a collection of parametric curves (i.e.,lines, circles, and B-splines). Accordingly, our deep neural network, coined PIE-NET, is trained for parametric inference of edges. The network relies on a "region proposal" architecture, where a first module proposes an over-complete collection of edge and corner points, and a second module ranks each proposal to decide whether it should be considered. We train and evaluate our method on the ABC dataset, a large dataset of CAD models, and compare our results to those produced by traditional (non-learning) processing pipelines, as well as a recent deep learning based edge detector (EC-NET). Our results significantly improve over the state-of-the-art from both a quantitative and qualitative standpoint. View details
    CoSE: Compositional Stroke Embeddings
    Emre Aksan
    Thomas Deselaers
    Otmar Hilliges
    Proceedings of Neural Information Processing Systems (2020)
    Preview abstract We present a generative model for stroke-based drawing tasks which is able to model complex free-form structures. While previous approaches rely on sequence-based models for drawings of basic objects or handwritten text, we propose a model that treats drawings as a collection of strokes that can be composed into complex structures such as diagrams (e.g., flow-charts). At the core of the approach lies a novel auto-encoder that projects variable-length strokes into a latent space of fixed dimension. This representation space allows a relational model, operating in latent space, to better capture the relationship between strokes and to predict subsequent strokes. We demonstrate qualitatively and quantitatively that our proposed approach is able to model the appearance of individual strokes, as well as the compositional structure of larger diagram drawings. Our approach is suitable for interactive use cases such as auto-completing diagrams. View details
    Preview abstract Polygonal meshes are ubiquitous in the digital 3D domain, yet they have only played a minor role in the deep learning revolution. Leading methods for learning generative models of shapes rely on implicit functions, and generate meshes only after expensive iso-surfacing routines. To overcome these challenges, we are inspired by a classical spatial data structure from computer graphics, Binary Space Partitioning (BSP), to facilitate 3D learning. The core ingredient of BSP is an operation for recursive subdivision of space to obtain convex sets. By exploiting this property, we devise BSP-Net, a network that learns to represent a 3D shape via convex decomposition. Importantly, BSP-Net is unsupervised since no convex shape decompositions are needed for training. The network is trained to reconstruct a shape using a set of convexes obtained from a BSP-tree built on a set of planes. The convexes inferred by BSP-Net can be easily extracted to form a polygon mesh, without any need for iso-surfacing. The generated meshes are compact (i.e., low-poly) and well suited to represent sharp geometry; they are guaranteed to be watertight and can be easily parameterized. We also show that the reconstruction quality by BSP-Net is competitive with state-of-the-art methods while using much fewer primitives. View details
    ACNe: Attentive Context Normalization for Robust Permutation-Equivariant Learning
    Weiwei Sun
    Wei Jiang
    Kwang Moo Yi
    Computer Vision Pattern Recognition (CVPR) (2020)
    Preview abstract Many problems in computer vision require dealing with sparse, unordered data in the form of point clouds. Permutation-equivariant networks have become a popular solution-they operate on individual data points with simple perceptrons and extract contextual information with global pooling. This can be achieved with a simple normalization of the feature maps, a global operation that is unaffected by the order. In this paper, we propose Attentive Context Normalization (ACN), a simple yet effective technique to build permutation-equivariant networks robust to outliers. Specifically, we show how to normalize the feature maps with weights that are estimated within the network, excluding outliers from this normalization. We use this mechanism to leverage two types of attention: local and global-by combining them, our method is able to find the essential data points in high-dimensional space to solve a given task. We demonstrate through extensive experiments that our approach, which we call Attentive Context Networks (ACNe), provides a significant leap in performance compared to the state-of-the-art on camera pose estimation, robust fitting, and point cloud classification under noise and outliers. View details
    CvxNet: Learnable Convex Decomposition
    Boyang Deng
    Kyle Genova
    Soroosh Yazdani
    Sofien Bouaziz
    Geoffrey Hinton
    Computer Vision Pattern Recognition (CVPR) (2020)
    Preview abstract Any solid object can be decomposed into a collection of convex polytopes (in short, convexes). When a small number of convexes are used, such a decomposition can be thought of as a piece-wise approximation of the geometry. This decomposition is fundamental in computer graphics, where it provides one of the most common ways to approximate geometry, for example, in real-time physics simulation. A convex object also has the property of being simultaneously an explicit and implicit representation: one can interpret it explicitly as a mesh derived by computing the vertices of a convex hull, or implicitly as the collection of half-space constraints or support functions. Their implicit representation makes them particularly well suited for neural network training, as they abstract away from the topology of the geometry they need to represent. However, at testing time, convexes can also generate explicit representations, polygonal meshes, which can then be used in any downstream application. We introduce a network architecture to represent a low dimensional family of convexes. This family is automatically derived via an auto-encoding process. We investigate the applications of this architecture including automatic convex decomposition, image to 3D reconstruction, and part-based shape retrieval. View details
    ShapeFlow: Learnable Deformations Among 3D Shapes
    Max Jiang
    Jingwei Huang
    Leonidas Guibas
    Proceedings of Neural Information Processing Systems 2020
    Preview abstract We present ShapeFlow, a flow-based model for learning a deformation space for entire classes of 3D shapes with large intra-class variations. ShapeFlow allows learning a multi-template deformation space that is agnostic to shape topology, yet preserves fine geometric details. Different from a generative space where a latent vector is directly decoded into a shape, a deformation space decodes a vector into a continuous flow that can advect a source shape towards a target. Such a space naturally allows the disentanglement of geometric style (coming from the source) and structural pose (conforming to the target). We parametrize the deformation between geometries as a learned continuous flow field via a neural network and show that such deformations can be guaranteed to have desirable properties, such as be bijectivity, freedom from self-intersections, or volume preservation. We illustrate the effectiveness of this learned deformation space for various downstream applications, including shape generation via deformation, geometric style transfer, unsupervised learning of a consistent parameterization for entire classes of shapes, and shape interpolation. View details
    NASA: Neural Articulated Shape Approximation
    Boyang Deng
    JP Lewis
    Timothy Jeruzalski
    Gerard Pons-Moll
    Geoffrey Hinton
    Mohammad Norouzi
    European Conference on Computer Vision (ECCV) (2020)
    Preview abstract Efficient representation of articulated objects such as human bodies is an important problem in computer vision and graphics. To efficiently simulate deformation, existing approaches represent 3D objects using polygonal meshes and deform them using skinning techniques. This paper introduces neural articulated shape approximation (NASA), an alternative framework that enables efficient representation of articulated deformable objects using neural indicator functions that are conditioned on pose. Occupancy testing using NASA is straightforward, circumventing the complexity of meshes and the issue of water-tightness. We demonstrate the effectiveness of NASA for 3D tracking applications, and discuss other potential extensions. View details
    Deep Implicit Volume Compression
    Danhang "Danny" Tang
    Phil Chou
    Christian Haene
    Mingsong Dou
    Jonathan Taylor
    Shahram Izadi
    Sofien Bouaziz
    Cem Keskin
    CVPR (2020)
    Preview abstract We describe a novel approach for compressing truncated signed distance fields (TSDF) stored in voxel grids and their corresponding textures. To compress the TSDF our method relies on a block-based neural architecture trained end-to-end achieving state-of-the-art compression rates. To prevent topological errors we losslessly compress the signs of the TSDF which also as a side effect bounds the maximum reconstruction error by the voxel size. To compress the affiliated texture we designed a fast block-base charting and Morton packing technique generating a coherent image that can be efficiently compressed using existing image-based compression algorithms. We demonstrate the performance of our algorithms on a large set of 4D performance sequences captured using multi-camera RGBD setups. View details
    Preview abstract We propose a novel image sampling method for differentiable image transformation in deep neural networks. The sampling schemes currently used in deep learning, such as Spatial Transformer Networks, rely on bilinear interpolation, which performs poorly under severe scale changes, and more importantly, results in poor gradient propagation. This is due to their strict reliance on direct neighbors. Instead, we propose to generate random auxiliary samples in the vicinity of each pixel in the sampled image, and create a linear approximation with their intensity values. We then use this approximation as a differentiable formula for the transformed image. We demonstrate that our approach produces more representative gradients with a wider basin of convergence for image alignment, which leads to considerable performance improvements when training networks for classification tasks. This is not only true under large downsampling, but also when there are no scale changes. We compare our approach with multi-scale sampling and show that we outperform it. We then demonstrate that our improvements to the sampler are compatible with other tangential improvements to Spatial Transformer Networks and that it further improves their performance. View details
    Volumetric Capture of Humans with a Single RGBD Camera via Semi-Parametric Learning
    Anastasia Tkach
    Shuoran Yang
    Pavel Pidlypenskyi
    Jonathan Taylor
    Ricardo Martin Brualla
    George Papandreou
    Philip Davidson
    Cem Keskin
    Shahram Izadi
    CVPR (2019)
    Preview abstract Volumetric (4D) performance capture is fundamental for AR/VR content generation. Whereas previous work in 4D performance capture has shown impressive results in studio settings, the technology is still far from being accessible to a typical consumer who, at best, might own a single RGBD sensor. Thus, in this work, we propose a method to synthesize free viewpoint renderings using a single RGBD camera. The key insight is to leverage previously seen "calibration" images of a given user to extrapolate what should be rendered in a novel viewpoint from the data available in the sensor. Given these past observations from multiple viewpoints, and the current RGBD image from a fixed view, we propose an end-to-end framework that fuses both these data sources to generate novel renderings of the performer. We demonstrate that the method can produce high fidelity images, and handle extreme changes in subject pose and camera viewpoints. We also show that the system generalizes to performers not seen in the training data. We run exhaustive experiments demonstrating the effectiveness of the proposed semi-parametric model (i.e. calibration images available to the neural network) compared to other state of the art machine learned solutions. Further, we compare the method with more traditional pipelines that employ multi-view capture. We show that our framework is able to achieve compelling results, with substantially less infrastructure than previously required. View details
    Deep Reflectance Fields - High-Quality Facial Reflectance Field Inference from Color Gradient Illumination
    Abhi Meka
    Christian Haene
    Michael Zollhöfer
    Graham Fyffe
    Xueming Yu
    Jason Dourgarian
    Peter Denny
    Sofien Bouaziz
    Peter Lincoln
    Matt Whalen
    Geoff Harvey
    Jonathan Taylor
    Shahram Izadi
    Paul Debevec
    Christian Theobalt
    Julien Valentin
    Christoph Rhemann
    SIGGRAPH (2019)
    Preview abstract Photo-realistic relighting of human faces is a highly sought after feature with many applications ranging from visual effects to truly immersive virtual experiences. Despite tremendous technological advances in the field, humans are often capable of distinguishing real faces from synthetic renders. Photo-realistically relighting any human face is indeed a challenge with many difficulties going from modelling sub-surface scattering and blood flow to estimating the interaction between light and individual strands of hair. We introduce the first system that combines the ability to deal with dynamic performances to the realism of 4D reflectance fields, enabling photo-realistic relighting of non-static faces. The core of our method consists of a Deep Neural network that is able to predict full 4D reflectance fields from two images captured under spherical gradient illumination. Extensive experiments not only show that two images under spherical gradient illumination can be easily captured in real time, but also that these particular images contain all the information needed to estimate the full reflectance field, including specularities and high frequency details. Finally, side by side comparisons demonstrate that the proposed system outperforms the current state-of-the-art in terms of realism and speed. View details
    Real-time Compression and Streaming of 4D Performances
    Danhang Tang
    Mingsong Dou
    Peter Lincoln
    Philip Davidson
    Kaiwen Guo
    Jonathan Taylor
    Cem Keskin
    Sofien Bouaziz
    Shahram Izadi
    ACM Transaction of Graphics (2018)
    Preview abstract We introduce a realtime compression architecture for 4D performance capture that is two orders of magnitude faster than current state-of-the-art techniques, yet achieves comparable visual quality and bitrate. We note how much of the algorithmic complexity in traditional 4D compression arises from the necessity to encode geometry in a explicit model (i.e. a triangle mesh). In contrast, we propose an encoder that leverages implicit model to represent the observed geometry and its changes through time View details
    Preview abstract Real time non-rigid reconstruction pipelines are extremely computationally expensive and easily saturate the highest end GPUs currently available. This requires careful strategic choices to be made about a set of highly interconnected parameters that divide up the limited compute. Offline systems, however, prove the value of increasing voxel resolution, more iterations, and higher frame rates. To this end, we demonstrate a set of remarkably simple, but effective modifications to these algorithms that significantly reduce the average per-frame computation cost allowing these parameters to be increased. Specifically, we divide the depth stream into sub-frames and fusion-frames, disabling both model accumulation (fusion) and non-rigid alignment (model tracking) on the former. Instead, we efficiently track point correspondences across neighboring sub-frames. We then leverage these correspondences to initialize the standard non-rigid alignment to a fusion-frame where data can then be accumulated into the model. As a result, compute resources in the modified non-rigid reconstruction pipeline can be immediately re-purposed. Finally, we leverage recent high framerate depth algorithms to build a novel “twin” sensor consisting of a low-res/high-fps sub-frame camera and a second low-fps/high-res fusion camera. View details
    The Need 4 Speed in Real-Time Dense Visual Tracking
    Christoph Rhemann
    Jonathan Taylor
    Philip Davidson
    Mingsong Dou
    Kaiwen Guo
    Cem Keskin
    Sameh Khamis
    Danhang Tang
    Vladimir Tankovich
    Julien Valentin
    Shahram Izadi
    SIGGRAPH Asia (2018)
    Preview abstract The advent of consumer depth cameras has incited the development of a new cohort of algorithms tackling challenging computer vision problems. The primary reason is that depth provides direct geometric information that is largely invariant to texture and illumination. As such, substantial progress has been made in human and object pose estimation, 3D reconstruction and simultaneous localization and mapping. Most of these algorithms naturally benefit from the ability to accurately track the pose of an object or scene of interest from one frame to the next. However, commercially available depth sensors (typically running at 30fps) can allow for large inter-frame motions to occur that make such tracking problematic. A high frame rate depth camera would thus greatly ameliorate these issues, and further increase the tractability of these computer vision problems. Nonetheless, the depth accuracy of recent systems for high-speed depth estimation [Fanello et al. 2017b] can degrade at high frame rates. This is because the active illumination employed produces a low SNR and thus a high exposure time is required to obtain a dense accurate depth image. Furthermore in the presence of rapid motion, longer exposure times produce artifacts due to motion blur, and necessitates a lower frame rate that introduces large inter-frame motion that often yield tracking failures. In contrast, this paper proposes a novel combination of hardware and software components that avoids the need to compromise between a dense accurate depth map and a high frame rate. We document the creation of a full 3D capture system for high speed and quality depth estimation, and demonstrate its advantages in a variety of tracking and reconstruction tasks. We extend the state of the art active stereo algorithm presented in Fanello et al. [2017b] by adding a space-time feature in the matching phase. We also propose a machine learning based depth refinement step that is an order of magnitude faster than traditional postprocessing methods. We quantitatively and qualitatively demonstrate the benefits of the proposed algorithms in the acquisition of geometry in motion. Our pipeline executes in 1.1ms leveraging modern GPUs and off-the-shelf cameras and illumination components. We show how the sensor can be employed in many different applications, from [non-]rigid reconstructions to hand/face tracking. Further, we show many advantages over existing state of the art depth camera technologies beyond framerate, including latency, motion artifacts, multi-path errors, and multi-sensor interference. View details
    No Results Found