Mask R-CNN

The instance segmentation task in computer vision involves labeling each pixel in an image with a class and an instance label. It can be thought of as a generalization of the semantic segmentation task, since it requires segmenting all the objects in the image while also segmenting each instance. As such, it is a dense prediction task which combines elements from two popular computer vision tasks: semantic segmentation (pixelwise labeling without differentiating between instances) and object detection (detection using bounding boxes). This makes the instance segmentation task vulnerable to challenges from both the parent tasks, such as difficulty segmenting small objects and overlapping instances. Recent advances in instance segmentation, driven primarily by the success of R-CNN, have relied upon sequential (cascaded) prediction of segmentation and classification labels. This paper, on the other hand, proposes Mask R-CNN, a multi-task prediction architecture for simultaneously detecting objects, classifying them, and delineating their fine boundaries within the detected bounding boxes. Mask R-CNN builds upon the massively popular Faster R-CNN model, which was not designed for “pixel-to-pixel alignment between network inputs and outputs”, by adding a mask prediction branch for simultaneous segmentation predictions. ...

September 28, 2020 · 4 min · Kumar Abhishek