Pixels & Predictions

Building Rome in A Day

With the advent of digital photography and the popularity of cloud-based digital image sharing websites, there has been a huge proliferation in the number of publicly accessible photographs of popular cities (and landmarks thereof) across the world. As a result, the ability to leverage these photos in a meaningful manner is of massive interest in computer vision communities. One key research area that could immensely benefit from it is city-scale 3D reconstruction. Traditionally, existing systems for this task have relied on images and data acquired in a structured manner, making computation simple. On the contrary, images uploaded on the internet have no such constraints, necessitating the development of algorithms which can work on “extremely diverse, large, and unconstrained image collections”. Building upon previous research and incorporating elements from other disciplines of computer science, this paper proposes a system to construct large-scale 3D geometry from large and unorganized image collections publicly available on the internet, with the ability to process more than a hundred thousand images in a day. ...

KinectFusion: Real-Time Dense Surface Mapping and Tracking

The surge of interest in augmented and mixed reality applications can at least in part be attributed to research in the “real-time infrastructure-free” tracking of a camera with the simultaneous generation of detailed maps of physical scenes. While computer vision research has enabled this (especially accurate camera tracking and dense scene surface reconstructions) using structure from motion and multiview stereo algorithms, they are not quite well suited for either real-time applications or detailed surface reconstructions. There has also been a contemporaneous improvement of camera technologies, especially depth-sensing cameras based on time-of-flight or structured light sensing, such as Microsoft Kinect, a consumer-grade offering. Microsoft Kinect features a structured light-based depth sensor (sensor hereafter) and generates a 11-bit $640 \times 480$ depth map at 30Hz using an on-board ASIC. However, these depth images are usually noisy with ‘holes’ indicating regions where depth reading was not possible. This paper proposes a system to process these noisy depth maps and perform real-time (9 million new point measurements per second) dense simultaneous localization and mapping (SLAM), thereby generating an incremental and consistent 3D scene model while also tracking the sensor’s motion (all 6 degrees-of-freedom) through each frame. While the paper presents quite an involved description of the method, the key components have been briefly summarized here. ...

A Practical Model for Subsurface Light Transport

In computer graphics, a bidirectional reflectance distribution function (BRDF) is used to model light reflectance properties at a surface, and is defined as the ratio of the radiance (incident light) to the irradiance (reflected light) per unit surface area. All BRDF models operate on surface scattering, i.e., the assumption that “light scatters at one surface point” and that light enters and exits a material at the same position, and do not model subsurface transport of incident light. Although this assumption stands valid for metals, translucent surfaces modeled using BRDF exhibit a distinct hard and computer-generated appearance and poor blending of local color and geometry features. While there have been works to model subsurface transport of light, the existing methods are either slow or inefficient for anisotropic or highly scattering translucent media (such as skin and milk). This papers attempts to address this shortcoming by proposing a model for subsurface light transport in translucent materials using bidirectional surface scattering reflectance distribution function (BSSRDF). BSSRDFs are a generalization of BRDFs and, unlike the latter, can model light transport between any two rays that hit a surface. Since the exact BSSRDF derivation is quite involved, we only present a brief summary here, followed by its extension to a model for rendering computer graphics. ...

Interactive Reconstruction of Monte Carlo Image Sequences using a Recurrent Denoising Autoencoder

Owing to the immense popularity of ray-tracing and path tracing rendering algorithms for visual effects, there has been a surge of interest in developing filtering and reconstruction methods to deal with the noise present in these Monte Carlo renderings. Despite the focus on a large sampling rate (upto thousands of samples per pixel before filtering), even the fastest ray tracers are limited to a few rays per pixel, and a low sampling budget would be realistic for the foreseeable future. This paper proposes a learning-based approach for reconstruction of global illumination with very low sampling budgets (as low as 1 spp) at interactive rates. At 1 sample per pixel (spp), the Monte Carlo integration of indirect illumination results in very noisy images, and the problem can therefore be framed as reconstruction instead of denoising. Previous works on offline and interactive denoising for Monte Carlo rendering suffer from a trade-off between speed and performance, require user-defined parameters, and scale poorly to large scenes. Inspired by the progress in single image restoration (denoising) using deep learning, the authors propose a deep learning based approach which leverages an encoder-decoder architecture and recurrent connections for improved temporal consistency. The proposed model requires no user guidance, is end-to-end trainable and is able to exploit auxiliary pixel features for improved performance. ...

Mesh R-CNN

Although deep learning has enabled massive strides in visual recognition tasks including object detection, most of these advances have been made in 2D object recognition. However, these improvements are built upon a critical omission: objects in the real world exist beyond the $XY$ image plane and in a 3D space. While there has also been significant progress in 3D shape understanding tasks, the authors call to attention for methods that amalgamate these two tasks: i.e., approaches which (a) can work in the real world where there are far fewer constraints (as compared to carefully curated datasets) such as constraints on object count, occlusion, illumination, etc., and (b) can do so without ignoring the rich 3D information present therein. They build upon the immensely popular Mask R-CNN multi-task framework and extend it by adding a mesh prediction branch that learns to generate “high-resolution triangle mesh” of the detected objects simultaneously. Whereas previous works on single-view shape prediction rely on post-processing or are limited in the topologies that they can represent as meshes, Mesh R-CNN uses multiple 3D shape representations: 3D voxels and 3D meshes, where the latter is obtained by refining the former. ...