Pixels & Predictions

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Several computer vision tasks require perceiving or interacting with 3D environments and objects therein, making a strong case in favor of 3D deep learning. However, unlike images which are most popularly structured as arrays of pixels, there are multiple 3D representations, e.g., meshes, point clouds, volumetric and boundary representations, RGB-D representation, etc. Of these, point clouds are arguably the closest representation of raw sensor data, and their simplicity of representation makes them a canonical 3D representation, meaning it is easy to convert them to and from other representation forms. The majority of previous work on using deep learning for 3D data has been using CNNs (by projecting the point clouds into 2D images), volumetric CNNs (by applying 3D CNNs on voxelized shapes), spectral CNNs (on meshes), and fully connected networks (by extracting feature vectors from 3D data). These approaches suffer from several shortcomings, such as challenges with data sparsity, high computational cost, inability to extend to tasks beyond shape classification and non-isometric shapes, and limited expressiveness of extracted features. To address these concerns, the authors propose PointNet, a deep neural network architecture which is able to process point clouds directly for various 3D tasks such as shape classification, part segmentation, and scene understanding, and is also robust to input points’ corruption and perturbation. ...

Mask R-CNN

The instance segmentation task in computer vision involves labeling each pixel in an image with a class and an instance label. It can be thought of as a generalization of the semantic segmentation task, since it requires segmenting all the objects in the image while also segmenting each instance. As such, it is a dense prediction task which combines elements from two popular computer vision tasks: semantic segmentation (pixelwise labeling without differentiating between instances) and object detection (detection using bounding boxes). This makes the instance segmentation task vulnerable to challenges from both the parent tasks, such as difficulty segmenting small objects and overlapping instances. Recent advances in instance segmentation, driven primarily by the success of R-CNN, have relied upon sequential (cascaded) prediction of segmentation and classification labels. This paper, on the other hand, proposes Mask R-CNN, a multi-task prediction architecture for simultaneously detecting objects, classifying them, and delineating their fine boundaries within the detected bounding boxes. Mask R-CNN builds upon the massively popular Faster R-CNN model, which was not designed for “pixel-to-pixel alignment between network inputs and outputs”, by adding a mask prediction branch for simultaneous segmentation predictions. ...

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

The optical flow estimation task in computer vision is that given two images $\mathcal{I}_1$ and $\mathcal{I}_2$, we want to estimate for each pixel in $\mathcal{I}_1$, where it goes to in $\mathcal{I}_2$. This dense pixel correspondence task is a long-standing problem that has remained largely unsolved because of difficulties including but not limited to shadows, reflections, occlusions, fast moving objects, surfaces with low textures, etc. Traditional approaches for estimating optical flow, which frame it as a hand-crafted optimization problem over the “space of dense displacement fields” between an image pair and with the optimization performed during inference, are limited because of the challenges in hand-crafting the optimization problem. Motivated by these traditional optimization-based approaches, this paper proposes an end-to-end differentiable deep learning (DL)-based architecture called RAFT (Recurrent All-Pairs Field Transforms) for estimating the optical flow. The RAFT architecture comprises of 3 main components: (a) a convolutional feature encoder to extract feature vectors from a pair of images, (b) a correlation layer to construct a 4D correlation volume followed by pooling to produce volumes at multiple lower resolutions, and (c) a gated activation unit based on GRUs to iteratively update a single flow field using values from the correlation volumes. ...

ImageNet: A Large-Scale Hierarchical Image Database

The availability of large volumes of data is a key requirement in the development of efficient, robust, and advanced machine learning based prediction models. This paper introduces the ImageNet database, “a large scale ontology of images” built upon the hierarchical structure of WordNet, an online lexical database of meaningful concepts. These concepts, described by words or word phrases, are known as synonym sets or synsets. The ImageNet dataset contains 3.2 million labeled images, organized in 12 subtrees and 5247 synsets in total, with an average of 600 images per synset, making it one of the largest publicly available image datasets in terms of the number and the diversity of images, the accuracy of the image labels, and the hierarchical structure of the dataset. ...

Multi-Scale Context Aggregation By Dilated Convolutions

The semantic segmentation task in computer vision involves partitioning an image into a set of multiple non-overlapping and semantically interpretable regions. This entails assigning pixel-wise class labels to the entire image, making it a dense prediction task. Owing to the massive improvements in image classification performance achieved by CNNs over the recent years, there have been several works which successfully repurpose these popular image classification CNN architectures for dense prediction tasks. This paper questions this approach, and instead investigates if modules specifically designed for a dense prediction task would improve the segmentation performance even further. Unlike image classification networks which aggregate multi-scale contextual information through successive downsampling operations to obtain a global prediction, a dense prediction task like semantic segmentation requires “multi-scale contextual reasoning in combination with full-resolution output”. % However, increasing the receptive field of the convolution operator comes at the cost of more parameters. The authors therefore propose using the dilated convolution operator to address this. To this end, this paper makes threefold contributions: (a) a generalized form of the convolution operator to account for dilation, (b) a multi-scale context aggregation module that relies on dilated convolution, and (c) a simiplified front-end module which gets rid of “vestigial components” carried over from image classification networks. ...