Jump to Content
Ruofei Du

Ruofei Du

Ruofei Du serves as Interactive Perception & Graphics Lead / Manager at Google and devotes to creating novel interactive technologies for XR. As a Research Scientist, Ruofei's research covers a wide range of topics in technical HCI, Graphics, and Perception, including XR interactions, visual programming, augmented communication, XR social platforms, digital human, foveated rendering, accessibility, and deep learning in graphics. Du serves as an Associate Chair in program committee of CHI and UIST; an Associate Editor of IEEE TCSVT. Ruofei holds 3 US patents and has published over 30 peer-reviewed publications in top venues of HCI, Computer Graphics, and Computer Vision, including CHI, UIST, SIGGRAPH Asia, TVCG, CVPR, ICCV, ECCV, ISMAR, VR, and I3D. In their own words: I am passionate about inventing interactive technologies with graphics, perception, and HCI. See my research, artsy, projects, youtube, talks, github, and shadertoy demos for fun!


Personal Website
Google Scholar
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers
    Phil A. Chou
    Berivan Isik
    Hugues Hoppe
    Danhang Tang
    Jonathan Taylor
    Philip Davidson
    arXiv:2402.05887 (2024)
    Preview abstract We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec’s performance on its intended content, it can effectively adapt the codec to other types of image/video content and to other distortion measures. Essentially, the sandwich learns to transmit “neural code images” that optimize overall rate-distortion performance even when the overall problem is well outside the scope of the codec’s design. Through a variety of examples, we apply the sandwich architecture to sources with different numbers of channels, higher resolution, higher dynamic range, and perceptual distortion measures. The results demonstrate substantial improvements (up to 9 dB gains or up to 3 adaptations. We derive VQ equivalents for the sandwich, establish optimality properties, and design differentiable codec proxies approximating current standard codecs. We further analyze model complexity, visual quality under perceptual metrics, as well as sandwich configurations that offer interesting potentials in image/video compression and streaming. View details
    Experiencing InstructPipe: Building Multi-modal AI Pipelines via Prompting LLMs and Visual Programming
    Zhongyi Zhou
    Jing Jin
    Xiuxiu Yuan
    Jun Jiang
    Jingtao Zhou
    Yiyi Huang
    Kristen Wright
    Jason Mayes
    Mark Sherwood
    Ram Iyengar
    Na Li
    Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 5 (to appear)
    Preview abstract Foundational multi-modal models have democratized AI access, yet the construction of complex, customizable machine learning pipelines by novice users remains a grand challenge. This paper demonstrates a visual programming system that allows novices to rapidly prototype multimodal AI pipelines. We first conducted a formative study with 58 contributors and collected 236 proposals of multimodal AI pipelines that served various practical needs. We then distilled our findings into a design matrix of primitive nodes for prototyping multimodal AI visual programming pipelines, and implemented a system with 65 nodes. To support users' rapid prototyping experience, we built InstructPipe, an AI assistant based on large language models (LLMs) that allows users to generate a pipeline by writing text-based instructions. We believe InstructPipe enhances novice users onboarding experience of visual programming and the controllability of LLMs by offering non-experts a platform to easily update the generation. View details
    ChatDirector: Enhancing Video Conferencing with Space-Aware Scene Rendering and Speech-Driven Layout Transition
    Brian Moreno Collins
    Karthik Ramani
    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 16 (to appear)
    Preview abstract Remote video conferencing systems (RVCS) are widely adopted in personal and professional communication. However, they often lack the co-presence experience of in-person meetings. This is largely due to the absence of intuitive visual cues and clear spatial relationships among remote participants, which can lead to speech interruptions and loss of attention. This paper presents ChatDirector, a novel RVCS that overcomes these limitations by incorporating space-aware visual presence and speech-aware attention transition assistance. ChatDirector employs a real-time pipeline that converts participants' RGB video streams into 3D portrait avatars and renders them in a virtual 3D scene. We also contribute a decision tree algorithm that directs the avatar layouts and behaviors based on participants' speech states. We report on results from a user study (N=16) where we evaluated ChatDirector. The satisfactory algorithm performance and complimentary subject user feedback imply that ChatDirector significantly enhances communication efficacy and user engagement. View details
    Human I/O: Towards Comprehensive Detection of Situational Impairments in Everyday Activities
    Xingyu Bruce Liu
    Jiahao Nick Li
    Xiang 'Anthony' Chen
    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 18 (to appear)
    Preview abstract Situationally Induced Impairments and Disabilities (SIIDs) can significantly hinder user experience in everyday activities. Despite their prevalence, existing adaptive systems predominantly cater to specific tasks or environments and fail to accommodate the diverse and dynamic nature of SIIDs. We introduce Human I/O, a real-time system that detects SIIDs by gauging the availability of human input/output channels. Leveraging egocentric vision, multimodal sensing and reasoning with large language models, Human I/O achieves good performance in availability prediction across 60 in-the-wild egocentric videos in 32 different scenarios. Further, while the core focus of our work is on the detection of SIIDs rather than the creation of adaptive user interfaces, we showcase the utility of our prototype via a user study with 10 participants. Findings suggest that Human I/O significantly reduces effort and improves user experience in the presence of SIIDs, paving the way for more adaptive and accessible interactive systems in the future. View details
    UI Mobility Control in XR: Switching UI Positionings between Static, Dynamic, and Self Entities
    Siyou Pei
    Yang Zhang
    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 12 (to appear)
    Preview abstract Extended reality (XR) has the potential for seamless user interface (UI) transitions across people, objects, and environments. However, the design space, applications, and common practices of 3D UI transitions remain underexplored. To address this gap, we conducted a need-finding study with 11 participants, identifying and distilling a taxonomy based on three types of UI placements --- affixed to static, dynamic, or self entities. We further surveyed 113 commercial applications to understand the common practices of 3D UI mobility control, where only 6.2% of these applications allowed users to transition UI between entities. In response, we built interaction prototypes to facilitate UI transitions between entities. We report on results from a qualitative user study (N=14) on 3D UI mobility control using our FingerSwitches technique, which suggests that perceived usefulness is affected by types of entities and environments. We aspire to tackle a vital need in UI mobility within XR. View details
    Experiencing Visual Blocks for ML: Visual Prototyping of AI Pipelines
    Na Li
    Jing Jin
    Michelle Carney
    Jun Jiang
    Xiuxiu Yuan
    Kristen Wright
    Mark Sherwood
    Jason Mayes
    Lin Chen
    Jingtao Zhou
    Zhongyi Zhou
    Ping Yu
    Ram Iyengar
    ACM (2023) (to appear)
    Preview abstract We demonstrate Visual Blocks for ML, a visual programming platform that facilitates rapid prototyping of ML-based multimedia applications. As the public version of Rapsai , we further integrated large language models and custom APIs into the platform. In this demonstration, we will showcase how to build interactive AI pipelines in a few drag-and-drops, how to perform interactive data augmentation, and how to integrate pipelines into Colabs. In addition, we demonstrate a wide range of community-contributed pipelines in Visual Blocks for ML, covering various aspects including interactive graphics, chains of large language models, computer vision, and multi-modal applications. Finally, we encourage students, designers, and ML practitioners to contribute ML pipelines through https://github.com/google/visualblocks/tree/main/pipelines to inspire creative use cases. Visual Blocks for ML is available at http://visualblocks.withgoogle.com. View details
    InstructPipe: Building Visual Programming Pipelines with Human Instructions
    Zhongyi Zhou
    Jing Jin
    Xiuxiu Yuan
    Jun Jiang
    Jingtao Zhou
    Yiyi Huang
    Kristen Wright
    Jason Mayes
    Mark Sherwood
    Ram Iyengar
    Na Li
    arXiv, vol. 2312.09672 (2023)
    Preview abstract Visual programming provides beginner-level programmers with a coding-free experience to build their customized pipelines. Existing systems require users to build a pipeline entirely from scratch, implying that novice users need to set up and link appropriate nodes all by themselves, starting from a blank workspace. We present InstructPipe, an AI assistant that enables users to start prototyping machine learning (ML) pipelines with text instructions. We designed two LLM modules and a code interpreter to execute our solution. LLM modules generate pseudocode of a target pipeline, and the interpreter renders a pipeline in the node-graph editor for further human-AI collaboration. Technical evaluations reveal that InstructPipe reduces user interactions by 81.1% compared to traditional methods. Our user study (N=16) showed that InstructPipe empowers novice users to streamline their workflow in creating desired ML pipelines, reduce their learning curve, and spark innovative ideas with open-ended commands. View details
    ThingShare: Ad-Hoc Digital Copies of Physical Objects for Sharing Things in Video Meetings
    Erzhen Hu
    Jens Emil Sloth Grønbæk
    Wen Ying
    Seongkook Heo
    Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI), ACM
    Preview abstract During a video meeting, people may share various physical objects with remote users, such as physical documents, design prototypes, and personal items. However, sharing physical objects in video meetings has several challenges, e.g., the difficulty of referencing a remote object, the limited size and clarity of the object, and the inefficiency of coordinating the position and distances of an object to the camera. We introduce ThingShare, a video-conferencing system designed to support sharing of physical objects during remote meetings. ThingShare allows users to easily create digital copies of physical objects in the video feeds, which can then be magnified on a separate panel for focused sharing, overlaid on the user’s video feed for sharing in context, and stored in the object drawer for reviews. Our user study showed that ThingShare made initiating object-centric discussions more efficient and provided a more stable and detailed viewing of the objects. View details
    Rapsai: Accelerating Machine Learning Prototyping of Multimedia Applications through Visual Programming
    Na Li
    Jing Jin
    Michelle Carney
    Scott Joseph Miles
    Maria Kleiner
    Xiuxiu Yuan
    Anuva Kulkarni
    Xingyu “Bruce” Liu
    Ahmed K Sabie
    Ping Yu
    Ram Iyengar
    Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI), ACM
    Preview abstract In recent years, there has been a proliferation of multimedia applications that leverage machine learning (ML) for interactive experiences. Prototyping ML-based applications is, however, still challenging, given complex workflows that are not ideal for design and experimentation. To better understand these challenges, we conducted a formative study with seven ML practitioners to gather insights about common ML evaluation workflows. This study helped us derive six design goals, which informed Rapsai, a visual programming platform for rapid and iterative development of end-to-end ML-based multimedia applications. Rapsai is based on a node-graph editor to facilitate interactive characterization and visualization of ML model performance. Rapsai streamlines end-to-end prototyping with interactive data augmentation and model comparison capabilities in its no-coding environment. Our evaluation of Rapsai in four real-world case studies (N=15) suggests that practitioners can accelerate their workflow, make more informed decisions, analyze strengths and weaknesses, and holistically evaluate model behavior with real-world input. View details
    Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals
    Xingyu Bruce Liu
    Vladimir Kirilyuk
    Xiuxiu Yuan
    Xiang ‘Anthony’ Chen
    Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI), ACM, pp. 1-20
    Preview abstract Computer-mediated platforms are increasingly facilitating verbal communication, and capabilities such as live captioning and noise cancellation enable people to understand each other better. We envision that visual augmentations that leverage semantics in the spoken language could also be helpful to illustrate complex or unfamiliar concepts. To advance our understanding of the interest in such capabilities, we conducted formative research through remote interviews (N=10) and crowdsourced a dataset of 1500 sentence-visual pairs across a wide range of contexts. These insights informed Visual Captions, a real-time system that we integrated into a videoconferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest relevant visuals in open-vocabulary conversations. We report on our findings from a lab study (N=26) and a two-week deployment study (N=10), which demonstrate how Visual Captions has the potential to help people improve their communication through visual augmentation in various scenarios. View details
    Modeling and Improving Text Stability in Live Captions
    Xingyu "Bruce" Liu
    Jun Zhang
    Leonardo Ferrer
    Susan Xu
    Vikas Bahirwani
    Extended Abstract of the 2023 CHI Conference on Human Factors in Computing Systems (CHI), ACM, 208:1-9
    Preview abstract In recent years, live captions have gained significant popularity through its availability in remote video conferences, mobile applications, and the web. Unlike preprocessed subtitles, live captions require real-time responsiveness by showing interim speech-to-text results. As the prediction confidence changes, the captions may update, leading to visual instability that interferes with the user’s viewing experience. In this work, we characterize the stability of live captions by proposing a vision-based flickering metric using luminance contrast and Discrete Fourier Transform. Additionally, we assess the effect of unstable captions on the viewer through task load index surveys. Our analysis reveals significant correlations between the viewer's experience and our proposed quantitative metric. To enhance the stability of live captions without compromising responsiveness, we propose the use of tokenized alignment, word updates with semantic similarity, and smooth animation. Results from a crowdsourced study (N=123), comparing four strategies, indicate that our stabilization algorithms lead to a significant reduction in viewer distraction and fatigue, while increasing viewers' reading comfort. View details
    Experiencing Augmented Communication with Real-time Visuals using Large Language Models in Visual Captions
    Xingyu 'Bruce' Liu
    Vladimir Kirilyuk
    Xiuxiu Yuan
    Xiang ‘Anthony’ Chen
    Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), ACM (2023) (to appear)
    Preview abstract We demonstrate Visual Captions, a real-time system that integrates with a video conferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest visuals that are relevant to the context of the ongoing conversation. We implemented Visual Captions as a user-customizable Chrome plugin with three levels of AI proactivity: Auto-display (AI autonomously adds visuals), Auto-suggest (AI proactively recommends visuals), and On-demand-suggest (AI suggests visuals when prompted). We showcase the usage of Visual Captions in open-vocabulary settings, and how the addition of visuals based on the context of conversations could improve comprehension of complex or unfamiliar concepts. In addition, we demonstrate three approaches people can interact with the system with different levels of AI proactivity. Visual Captions is open-sourced at https://github.com/google/archat. View details
    Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos
    Ziqian Bai
    Danhang "Danny" Tang
    Di Qiu
    Abhimitra Meka
    Mingsong Dou
    Ping Tan
    Thabo Beeler
    2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE
    Preview abstract We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduce over-smoothing and improve out-of-model expressions synthesis, we propose to predict local features anchored on the 3DMM geometry. These learnt features are driven by 3DMM deformation and interpolated in 3D space to yield the volumetric radiance at a designated query point. We further show that using a Convolutional Neural Network in the UV space is critical in incorporating spatial context and producing representative local features. Extensive experiments show that we are able to reconstruct high-quality avatars, with more accurate expression-dependent details, good generalization to out-of-training expressions, and quantitatively superior renderings compared to other state-of-the-art approaches. View details
    Experiencing Rapid Prototyping of Machine Learning Based Multimedia Applications in Rapsai
    Na Li
    Jing Jin
    Michelle Carney
    Xiuxiu Yuan
    Ping Yu
    Ram Iyengar
    CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, ACM, 448:1-4
    Preview abstract We demonstrate Rapsai, a visual programming platform that aims to streamline the rapid and iterative development of end-to-end machine learning (ML)-based multimedia applications. Rapsai features a node-graph editor that enables interactive characterization and visualization of ML model performance, which facilitates the understanding of how the model behaves in different scenarios. Moreover, the platform streamlines end-to-end prototyping by providing interactive data augmentation and model comparison capabilities within a no-coding environment. Our demonstration showcases the versatility of Rapsai through several use cases, including virtual background, visual effects with depth estimation, and audio denoising. The implementation of Rapsai is intended to support ML practitioners in streamlining their workflow, making data-driven decisions, and comprehensively evaluating model behavior with real-world input. View details
    Sandwiched Image Compression: Increasing the resolution and dynamic range of standard codecs
    Phil Chou
    Hugues Hoppe
    Danhang "Danny" Tang
    Philip Davidson
    2022 Picture Coding Symposium (PCS), IEEE (to appear)
    Preview abstract Given a standard image codec, we compress images that may have higher resolution and/or higher bit depth than allowed in the codec's specifications, by sandwiching the standard codec between a neural pre-processor (before the standard encoder) and a neural post-processor (after the standard decoder). Using a differentiable proxy for the the standard codec, we design the neural pre- and post-processors to transport the high resolution (super-resolution, SR) or high bit depth (high dynamic range, HDR) images as lower resolution and lower bit depth images. The neural processors accomplish this with spatially coded modulation, which acts as watermarks to preserve the important image detail during compression. Experiments show that compared to conventional methods of transmitting high resolution or high bit depth through lower resolution or lower bit depth codecs, our sandwich architecture gains ~9 dB for SR images and ~3 dB for HDR images at the same rate over large test sets. We also observe significant gains in visual quality. View details
    Preview abstract Hand-based gestural interaction in augmented reality (AR) is an increasingly popular mechanism for spatial interactions. However, it presents many challenges. For example, most hand gesture interactions work well for interactions with virtual content and interfaces, but seldom work with physical devices and users’ environment. To explore this, and rather than inventing new paradigms for AR interactions, this paper revisits Zigelbaum, Kumpf, Vazquez, and Ishii's 2008 project ‘Slurp’ - a physical eyedropper to interact with digital content from IoT devices. We revive this historical work in a new modality of AR through a five step process: re-presecencing, design experimentation, scenario making, expansion through generative engagements with designers, and reflection. For the designers we engaged, looking back and designing with a restored prototype helped increased understanding of interactive strategies, intentions and rationales of original work. By revisiting Slurp, we also found many new potentials of its metaphorical interactions that could be applied in the context of emerging spatial computing platforms (e.g., smart home devices). In doing so, we discuss the value of mining past works in new domains and demonstrate a new way of thinking about designing interactions in emerging platforms. View details
    OmniSyn: Synthesizing 360 Videos with Wide-baseline Panoramas
    David Li
    Christian Haene
    Danhang "Danny" Tang
    Amitabh Varshney
    2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), IEEE
    Preview abstract Immersive maps such as Google Street View and Bing Streetside provide true-to-life views with a massive collection of panoramas. However, these panoramas are only available at sparse intervals along the path they are taken, resulting in visual discontinuities during navigation. Prior art in view synthesis is usually built upon a set of perspective images, a pair of stereoscopic images, or a monocular image, but barely examines wide-baseline panoramas, which are widely adopted in commercial platforms to optimize bandwidth and storage usage. In this paper, we leverage the unique characteristics of wide-baseline panoramas and present OmniSyn, a novel pipeline for 360° view synthesis between wide-baseline panoramas. OmniSyn predicts omnidirectional depth maps using a spherical cost volume and a monocular skip connection, renders meshes in 360° images, and synthesizes intermediate views with a fusion network. We demonstrate the effectiveness of OmniSyn via comprehensive experimental results including comparison with the state-of-the-art methods on CARLA and Matterport datasets, ablation studies, and generalization studies on street views. We envision our work may inspire future research for this unheeded real-world task and eventually produce a smoother experience for navigating immersive maps. View details
    Opportunistic Interfaces for Augmented Reality: Transforming Everyday Objects into Tangible 6DoF Interfaces Using Ad hoc UI
    Mathieu Le Goc
    Shengzhi Wu
    Danhang "Danny" Tang
    Jun Zhang
    David Joseph New Tan
    Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, ACM
    Preview abstract Real-time environmental tracking has become a fundamental capability in modern mobile phones and AR/VR devices. However, it only allows user interfaces to be anchored at a static location. Although fiducial and natural-feature tracking overlays interfaces with specific visual features, they typically require developers to define the pattern before deployment. In this paper, we introduce opportunistic interfaces to grant users complete freedom to summon virtual interfaces on everyday objects via voice commands or tapping gestures. We present the workflow and technical details of Ad hoc UI (AhUI), a prototyping toolkit to empower users to turn everyday objects into opportunistic interfaces on the fly. We showcase a set of demos with real-time tracking, voice activation, 6DoF interactions, and mid-air gestures and prospect the future of opportunistic interfaces. View details
    RetroSphere: Self-Contained Passive 3D Controller Tracking for Augmented Reality
    Ananta Narayanan Balaji
    Clayton Merrill Kimber
    David Li
    Shengzhi Wu
    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 6(4) (2022), 157:1-157:36
    Preview abstract Advanced AR/VR headsets often have a dedicated depth sensor or multiple cameras, high processing power, and a highcapacity battery to track hands or controllers. However, these approaches are not compatible with the small form factor and limited thermal capacity of lightweight AR devices. In this paper, we present RetroSphere, a self-contained 6 degree of freedom (6DoF) controller tracker that can be integrated with almost any device. RetroSphere tracks a passive controller with just 3 retroreflective spheres using a stereo pair of mass-produced infrared blob trackers, each with its own infrared LED emitters. As the sphere is completely passive, no electronics or recharging is required. Each object tracking camera provides a tiny Arduino-compatible ESP32 microcontroller with the 2D position of the spheres. A lightweight stereo depth estimation algorithm that runs on the ESP32 performs 6DoF tracking of the passive controller. Also, RetroSphere provides an auto-calibration procedure to calibrate the stereo IR tracker setup. Our work builds upon Johnny Lee’s Wii remote hacks and aims to enable a community of researchers, designers, and makers to use 3D input in their projects with affordable off-the-shelf components. RetroSphere achieves a tracking accuracy of about 96.5% with errors as low as ∼3.5 cm over a 100 cm tracking range, validated with ground truth 3D data obtained using a LIDAR camera while consuming around 400 mW. We provide implementation details, evaluate the accuracy of our system, and demonstrate example applications, such as mobile AR drawing, 3D measurement, etc. with our Retrosphere-enabled AR glass prototype. View details
    PRIF: Primary Ray-based Implicit Function
    Brandon Yushan Feng
    Danhang "Danny" Tang
    Amitabh Varshney
    European Conference on Computer Vision (ECCV) (2022)
    Preview abstract We introduce a new implicit shape representation called Primary Ray-based Implicit Function (PRIF). In contrast to most existing approaches based on the signed distance function (SDF) which handles spatial locations, our representation operates on oriented rays. Specifically, PRIF is formulated to directly produce the surface hit point of a given input ray, without the expensive sphere-tracing operations, hence enabling efficient shape extraction and differentiable rendering. We demonstrate that neural networks trained to encode PRIF achieve successes in various tasks including single shape representation, category-wise shape generation, shape completion from sparse or noisy observations, inverse rendering for camera pose estimation, and neural rendering with color. View details
    ProtoSound: A Personalized and Scalable Sound Recognition System for Deaf and Hard-of-Hearing Users
    DJ Jain
    Khoa Huynh Anh Nguyen
    Steven Goodman
    Rachel Grossman-Kahn
    Hung Ngo
    Leah Findlater
    Jon E. Froehlich
    Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI), ACM, pp. 24
    Preview abstract Recent advances have enabled automatic sound recognition systems for deaf and hard of hearing (DHH) users on mobile devices. However, these tools use pre-trained, generic sound recognition models, which do not meet the diverse needs of DHH users. We introduce ProtoSound, an interactive system for customizing sound recognition models by recording a few examples, thereby enabling personalized and fine-grained categories. ProtoSound is motivated by prior work examining sound awareness needs of DHH people and by a survey we conducted with 472 DHH participants. To evaluate ProtoSound, we characterized performance on two real-world sound datasets, showing significant improvement over state-of-the-art (e.g., +9.7% accuracy on the first dataset). We then deployed ProtoSound's end-user training and real-time recognition through a mobile application and recruited 19 hearing participants who listened to the real-world sounds and rated the accuracy across 56 locations (e.g., homes, restaurants, parks). Results show that ProtoSound personalized the model on-device in real-time and accurately learned sounds across diverse acoustic contexts. We close by discussing open challenges in personalizable sound recognition, including the need for better recording interfaces and algorithmic improvements. View details
    Sandwiched Image Compression: Wrapping Neural Networks Around a Standard Codec
    Phil Chou
    Hugues Hoppe
    Danhang "Danny" Tang
    Philip Davidson
    2021 IEEE International Conference on Image Processing (ICIP), IEEE, Anchorage, Alaska, pp. 3757-3761
    Preview abstract We sandwich a standard image codec between two neural networks: a preprocessor that outputs neural codes, and a postprocessor that reconstructs the image. The neural codes are compressed as ordinary images by the standard codec. Using differentiable proxies for both rate and distortion, we develop a rate-distortion optimization framework that trains the networks to generate neural codes that are efficiently compressible as images. This architecture not only improves rate-distortion performance for ordinary RGB images, but also enables efficient compression of alternative image types (such as normal maps of computer graphics) using standard image codecs. Results demonstrate the effectiveness and flexibility of neural processing in mapping a variety of input data modalities to the rigid structure of standard codecs. A surprising result is that the rate-distortion-optimized neural processing seamlessly learns to transport color images using a single-channel (grayscale) codec. View details
    HumanGPS: Geodesic PreServing Feature for Dense Human Correspondence
    Danhang "Danny" Tang
    Mingsong Dou
    Kaiwen Guo
    Cem Keskin
    Sofien Bouaziz
    Ping Tan
    Computer Vision and Pattern Recognition 2021 (2021), pp. 8
    Preview abstract In this paper, we address the problem of building dense correspondences between human images under arbitrary camera viewpoints and body poses. Prior art either assumes small motion between frames or relies on local descriptors, which cannot handle large motion or visually ambiguous body parts, e.g. left v.s. right hand. In contrast, we propose a deep learning framework that maps each pixel to a feature space, where the feature distances reflect the geodesic distances among pixels as if they were projected onto the surface of a 3D human scan. To this end, we introduce novel loss functions to push features apart according to their geodesic distances on the surface. Without any semantic annotation, the proposed embeddings automatically learn to differentiate visually similar parts and align different subjects into an unified feature space. Extensive experiments show that the learned embeddings can produce accurate correspondences between images with remarkable generalization capabilities on both intra and inter subjects. View details
    Saliency Computation for Virtual Cinematography in 360° Videos
    Amitabh Varshney
    Computer Graphics and Applications, vol. 41(4) (2021), pp. 99-106
    Preview abstract Recent advances in virtual reality cameras have contributed to a phenomenal growth of 360° videos. Estimating regions likely to attract user attention is critical for efficiently streaming and rendering 360° videos. In this article, we present a simple, novel, GPU-driven pipeline for saliency computation and virtual cinematography in 360° videos using spherical harmonics (SH). We efficiently compute the 360° video saliency through the spectral residual of the SH coefficients between multiple bands at over 60FPS for 4K resolution videos. Further, our interactive computation of spherical saliency can be used for saliency-guided virtual cinematography in 360° videos. View details
    GazeChat: Enhancing Virtual Conferences with Gaze-aware 3D Photos
    Zhenyi He
    Keru Wang
    Brandon Yushan Feng
    Ken Perlin
    The 34th Annual ACM Symposium on User Interface Software and Technology (UIST), ACM (2021), pp. 769-782
    Preview abstract Communication software such as Clubhouse and Zoom has evolved to be an integral part of many people's daily lives. However, due to network bandwidth constraints and concerns about privacy, cameras in video conferencing are often turned off by participants. This leads to a situation in which people can only see each others' profile images, which is essentially an audio-only experience. Even when switched on, video feeds do not provide accurate cues as to who is talking to whom. This paper introduces GazeChat, a remote communication system that visually represents users as gaze-aware 3D profile photos. This satisfies users' privacy needs while keeping online conversations engaging and efficient. GazeChat uses a single webcam to track whom any participant is looking at, then uses neural rendering to animate all participants' profile images so that participants appear to be looking at each other. We have conducted a remote user study (N=16) to evaluate GazeChat in three conditions: audio conferencing with profile photos, GazeChat, and video conferencing. Based on the results of our user study, we conclude that GazeChat maintains the feeling of presence while preserving more privacy and requiring lower bandwidth than video conferencing, provides a greater level of engagement than to audio conferencing, and helps people to better understand the structure of their conversation. View details
    A Log-Rectilinear Transformation for Foveated 360-degree Video Streaming
    David Li
    Adharsh Babu
    Camelia D. Brumar
    Amitabh Varshney
    IEEE Transactions on Visualization and Computer Graphics, vol. 27, pp. 2638-2647
    Preview abstract With the rapidly increasing resolutions of 360° cameras, head-mounted displays, and live-streaming services, streaming high-resolution panoramic videos over limited-bandwidth networks is becoming a critical challenge. Foveated video streaming can address this rising challenge in the context of eye-tracking-equipped virtual reality head-mounted displays. However, conventional log-polar foveated rendering suffers from a number of visual artifacts such as aliasing and flickering. In this paper, we introduce a new log-rectilinear transformation that incorporates summed-area table filtering and off-the-shelf video codecs to enable foveated streaming of 360° videos suitable for VR headsets with built-in eye-tracking. To validate our approach, we build a client-server system prototype for streaming 360° videos which leverages parallel algorithms over real-time video transcoding. We conduct quantitative experiments on an existing 360° video dataset and observe that the log-rectilinear transformation paired with summed-area table filtering heavily reduces flickering compared to log-polar subsampling while also yielding an additional 11% reduction in bandwidth usage. View details
    Multiresolution Deep Implicit Functions for 3D Shape Representation
    Zhang Chen
    Kyle Genova
    Sofien Bouaziz
    Christian Haene
    Cem Keskin
    Danhang "Danny" Tang
    ICCV (2021)
    Preview abstract We introduce Multiresolution Deep Implicit Functions (MDIF), a hierarchical representation that can recover fine details, while being able to perform more global operations such as shape completion. Our model represents a complex 3D shape with a hierarchy of latent grids, which can be decoded into different resolutions. Training is performed in an encoder-decoder manner, while the decoder-only optimization is supported during inference, hence can better generalize to novel objects, especially when performing shape completion. To the best of our knowledge, MDIF is the first model that can at the same time (1) reconstruct local detail; (2) perform decoder-only inference; (3) fulfill shape reconstruction and completion. We demonstrate superior performance against prior arts in our experiments. View details
    CollaboVR: A Reconfigurable Framework for Creative Collaboration in Virtual Reality
    Zhenyi He
    Ken Perlin
    2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), IEEE, pp. 11
    Preview abstract Writing or sketching on whiteboards is an essential part of collaborative discussions in business meetings, reading groups, design sessions, and interviews. However, prior arts in collaborative virtual reality (VR) systems have rarely explored the design space of multi-user layouts and interaction modes with virtual whiteboards. In this paper, we present CollaboVR, a reconfigurable framework for distributed and co-located multi-user communication in VR. Our system unleashes the creativity of VR users by sharing freehand drawings, converting 2D sketches into 3D models, and generating procedural animations in real time. To save the computational budgets for VR clients, we leverage a cloud architecture where the computational expensive applications (Chalktalk) are hosted directly on the servers and results are streamed to every client. We further devise two user layouts, side-by-side and face-to-face layouts to reduce visual clutter and compare interaction engagement. Users may also enable the projection mode in CollaboVR to mirror their private sketches at hands to the shared virtual whiteboards. We conduct two rounds of user study with 16 participants to evaluate CollaboVR. Our findings reveal that users appreciate the custom configurations and real-time animations in CollaboVR. Generally, face-to-face layout is preferred while the projection mode may yield less eye contacts. We will opensource CollaboVR to facilitate future research and development of natural user interface and real-time networking systems for VR collaborations. View details
    3D-Kernel Foveated Rendering for Light Fields
    Xiaoxu Meng
    Joseph F. JaJa
    Amitabh Varshney
    IEEE Transactions on Visualization and Computer Graphics (2020)
    Preview abstract Light fields capture both the spatial and angular rays, thus enabling free-viewpoint rendering and custom selection of the focal plane. Scientists can interactively explore pre-recorded microscopic light fields of organs, microbes, and neurons using virtual reality headsets. However, rendering high-resolution light fields at interactive frame rates requires a very high rate of texture sampling, which is challenging as the resolutions of light fields and displays continue to increase. In this paper, we present an efficient algorithm to visualize 4D light fields with 3D-kernel foveated rendering (3D-KFR). The 3D-KFR scheme coupled with eye-tracking has the potential to accelerate the rendering of 4D depth-cued light fields dramatically. We have developed a perceptual model for foveated light fields by extending the KFR for the rendering of 3D meshes. On datasets of high-resolution microscopic light fields, we observe 3.47x-7.28x speedup in light field rendering with minimal perceptual loss of detail. We envision that 3D-KFR will reconcile the mutually conflicting goals of visual fidelity and rendering speed for interactive visualization of light fields. View details
    Eye-dominance-guided Foveated Rendering
    Xiaoxu Meng
    Amitabh Varshney
    IEEE Transaction on Visualization and Computer Graphics, vol. 26 (2020), pp. 1972-1980
    Preview abstract Optimizing rendering performance is critical for a wide variety of virtual reality (VR) applications. Foveated rendering is emerging as an indispensable technique for reconciling interactive frame rates with ever-higher head-mounted display resolutions. Here, we present a simple yet effective technique for further reducing the cost of foveated rendering by leveraging ocular dominance -- the tendency of the human visual system to prefer scene perception from one eye over the other. Our new approach, eye-dominance-guided foveated rendering (EFR), renders the scene at a lower foveation level (higher detail) for the dominant eye compared to the non-dominant eye. Compared with traditional foveated rendering, EFR can be expected to provide superior rendering performance while preserving the same level of perceived visual quality. View details
    Experiencing Real-time 3D Interaction with Depth Maps for Mobile Augmented Reality in DepthLab
    Maksym Dzitsiuk
    Luca Prasso
    Ivo Duarte
    Jason Dourgarian
    Joao Afonso
    Jose Pascoal
    Josh Gladstone
    Nuno Moura e Silva Cruces
    Shahram Izadi
    Konstantine Nicholas John Tsotsos
    Adjunct Publication of the 33rd Annual ACM Symposium on User Interface Software and Technology, ACM (2020), pp. 108-110
    Preview abstract We demonstrate DepthLab, a wide range of experiences using the ARCore Depth API that allows users to detect the shape and depth in the physical environment with a mobile phone. DepthLab encapsulates a variety of depth-based UI/UX paradigms, including geometry-aware rendering (occlusion, shadows, texture decals), surface interaction behaviors (physics, collision detection, avatar path planning), and visual effects (relighting, 3D-anchored focus and aperture effects, 3D photos). We have open-sourced our software at https://github.com/googlesamples/arcore-depth-lab to facilitate future research and development in depth-aware mobile AR experiences. With DepthLab, we aim to help mobile developers to effortlessly integrate depth into their AR experiences and amplify the expression of their creative vision. View details
    MeteoVis: Visualizing Meteorological Events in Virtual Reality
    David Li
    Eric Lee
    Elijah Schwelling
    Mason G. Quick
    Patrick Meyers
    Amitabh Varshney
    Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, ACM, Honolulu, Hawaii, pp. 9
    Preview abstract Modern meteorologists in the National Oceanic and Atmospheric Administration (NOAA) use the Advanced Weather Interactive Processing System (AWIPS) to visualize weather data. However, AWIPS presents critical challenges when comparing data from multiple satellites for weather analysis. To address its limitations, we iteratively design with Earth Science experts and developed MeteoVis, an interactive system to visualize spatio-temporal atmospheric weather data from multiple sources simultaneously in an immersive 3D environment. In a preliminary case study, MeteoVis enables forecasters to easily identify the Atmospheric River event that caused intense flooding and snow storms along the western coast of North America during February 2019. We envision that MeteoVis will inspire future development of atmospheric visualization and analysis of the causative factors behind atmospheric processes improving weather forecast accuracy. A demo video of MeteoVis is available at https://youtu.be/GG96uO3WIy4. View details
    DepthLab: Real-time 3D Interaction with Depth Maps for Mobile Augmented Reality
    Maksym Dzitsiuk
    Luca Prasso
    Ivo Duarte
    Jason Dourgarian
    Joao Afonso
    Jose Pascoal
    Josh Gladstone
    Nuno Moura e Silva Cruces
    Shahram Izadi
    Konstantine Nicholas John Tsotsos
    Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, ACM (2020), pp. 829-843
    Preview abstract Mobile devices with passive depth sensing capabilities are ubiquitous, and recently active depth sensors have become available on some tablets and VR/AR devices. Although real-time depth data is accessible, its rich value to mainstream AR applications has been sorely under-explored. Adoption of depth-based UX has been impeded by the complexity of performing even simple operations with raw depth data, such as detecting intersections or constructing meshes. In this paper, we introduce DepthLab, a software library that encapsulates a variety of depth-based UI/UX paradigms, including geometry-aware rendering (occlusion, shadows), surface interaction behaviors (physics-based collisions, avatar path planning), and visual effects (relighting, depth-of-field effects). We break down depth usage into localized depth, surface depth, and dense depth, and describe our real-time algorithms for interaction and rendering tasks. We present the design process, system, and components of DepthLab to streamline and centralize the development of interactive depth features. We have open-sourced our software to external developers, conducted performance evaluation, and discussed how DepthLab can accelerate the workflow of mobile AR designers and developers. We envision that DepthLab may help mobile AR developers amplify their prototyping efforts, empowering them to unleash their creativity and effortlessly integrate depth into mobile AR experiences. View details
    Language-based Colorization of Scene Sketches
    Changqing Zou
    Haoran Mo
    Chengying Gao
    Hongbo Fu
    ACM Transactions on Graphics, vol. 38, pp. 16
    Preview abstract Being natural, touchless, and fun-embracing, language-based inputs have demonstrated effective for various tasks from image generation to literacy education for children. This paper for the first time presents a language-based system for interactive colorization of scene sketches, based on their semantic comprehension. The proposed system is built upon deep neural networks trained on a large-scale repository of scene sketches and cartoon-style color images with text descriptions. Given a scene sketch, our system allows users, via language-based instructions, to interactively localize and colorize specific foreground object instances to meet various colorization requirements in a progressive way. We demonstrate the effectiveness of our approach via comprehensive experimental results including alternative studies, comparison with the state of the art, and generalization user studies. Given the unique characteristics of language-based inputs, we envision a combination of our interface with a traditional scribble-based interface for a practical, multi-modal colorization system, benefiting various applications. View details
    ORC Layout: Adaptive GUI Layout with OR-Constraints
    Yue Jiang
    Christof Lutteroth
    Wolfgang Stuerzlinger
    Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI 2019), ACM, pp. 1-12
    Preview abstract We propose a novel approach for constraint-based graphical user interface (GUI) layout based on OR-constraints (ORC) in standard soft/hard linear constraint systems. ORC layout unifies grid layout and flow layout, supporting both their features as well as cases where grid and flow layouts individually fail. We describe ORC design patterns that enable designers to safely create flexible layouts that work across different screen sizes and orientations. We also present the ORC Editor, a GUI editor that enables designers to apply ORC in a safe and effective manner, mixing grid, flow and new ORC layout features as appropriate. We demonstrate that our prototype can adapt layouts to screens with different aspect ratios with only a single layout specification, easing the burden of GUI maintenance. Finally, we show that ORC specifications can be modified interactively and solved efficiently at runtime. View details
    No Results Found