Visual Grounding

The ability to provide human language instructions to robots for carrying out navigational tasks has been a longstanding goal of robotics and artificial intelligence. This task involves achieving visual perception and natural language understanding objectives in tandem, and while advancements in visual question answering and visual dialog have enabled models to combine visual and linguistic reasoning, they do not “allow an agent to move or control the camera”. Natural language-only commands abstract away the visual perception component, and are not very linguistically rich. While hand-crafted rendering models and environments and simulators built thereupon try to address these problems, they possess a limited set of 3D assets and textures, converting the robot’s challenging open-set problem in the real world to a fairly simpler closer set problem, which in turn deteriorates the performance on previously unseen environments. Finally, although reinforcement learning has been used to train navigational agents, they either do not leverage language instructions or rely on very simple linguistic settings. This paper proposes MatterPort3D Simulator, “a large-scale reinforcement learning environment based on real imagery” and an associated Room-to-Room (R2R) dataset with the hope that these will help push forward advancements in vision-and-language navigation (VLN) tasks and improve generalizability in previously unseen environments. ...