Object Detection

Although deep learning has enabled massive strides in visual recognition tasks including object detection, most of these advances have been made in 2D object recognition. However, these improvements are built upon a critical omission: objects in the real world exist beyond the $XY$ image plane and in a 3D space. While there has also been significant progress in 3D shape understanding tasks, the authors call to attention for methods that amalgamate these two tasks: i.e., approaches which (a) can work in the real world where there are far fewer constraints (as compared to carefully curated datasets) such as constraints on object count, occlusion, illumination, etc., and (b) can do so without ignoring the rich 3D information present therein. They build upon the immensely popular Mask R-CNN multi-task framework and extend it by adding a mesh prediction branch that learns to generate “high-resolution triangle mesh” of the detected objects simultaneously. Whereas previous works on single-view shape prediction rely on post-processing or are limited in the topologies that they can represent as meshes, Mesh R-CNN uses multiple 3D shape representations: 3D voxels and 3D meshes, where the latter is obtained by refining the former. ...

The instance segmentation task in computer vision involves labeling each pixel in an image with a class and an instance label. It can be thought of as a generalization of the semantic segmentation task, since it requires segmenting all the objects in the image while also segmenting each instance. As such, it is a dense prediction task which combines elements from two popular computer vision tasks: semantic segmentation (pixelwise labeling without differentiating between instances) and object detection (detection using bounding boxes). This makes the instance segmentation task vulnerable to challenges from both the parent tasks, such as difficulty segmenting small objects and overlapping instances. Recent advances in instance segmentation, driven primarily by the success of R-CNN, have relied upon sequential (cascaded) prediction of segmentation and classification labels. This paper, on the other hand, proposes Mask R-CNN, a multi-task prediction architecture for simultaneously detecting objects, classifying them, and delineating their fine boundaries within the detected bounding boxes. Mask R-CNN builds upon the massively popular Faster R-CNN model, which was not designed for “pixel-to-pixel alignment between network inputs and outputs”, by adding a mask prediction branch for simultaneous segmentation predictions. ...