Pixels & Predictions

PolyGen: An Autoregressive Generative Model of 3D Meshes

Polygonal meshes are widely used in computer graphics, robotics, and game development to represent virtual objects and scenes. Exisitng learning-based methods for 3D object generation have relied on template models and parametric shape families. Progress with deep learning based approaches has also been limited because meshes are challenging to work with for deep networks, and therefore recent works have instead used alternative representations of object shape, such as voxels, point clouds, occupancy functions, and surfaces. These works, however, leave mesh reconstruction as a post-processing step, leading to inconsistent mesh quality. Drawing inspiration from the success of previous neural autoregressive models applied to sequential raw data (e.g., images, text, and raw audio waveforms) and building upon previously proposed components (e.g., Transformers and pointer networks), this paper presents PolyGen, a neural autoregressive generative model for generating 3D meshes. ...

CARLA: An Open Urban Driving Simulator

The development and subsequent deployment of autonomous ground vehicles is a popular instantiation of achieving perfect sensorimotor control in 3D environments. It is a perception-driven control task, with one of the most difficult scenarios being navigating in densely populated urban environments, primarily because of but not limited to ...

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

The ability to provide human language instructions to robots for carrying out navigational tasks has been a longstanding goal of robotics and artificial intelligence. This task involves achieving visual perception and natural language understanding objectives in tandem, and while advancements in visual question answering and visual dialog have enabled models to combine visual and linguistic reasoning, they do not “allow an agent to move or control the camera”. Natural language-only commands abstract away the visual perception component, and are not very linguistically rich. While hand-crafted rendering models and environments and simulators built thereupon try to address these problems, they possess a limited set of 3D assets and textures, converting the robot’s challenging open-set problem in the real world to a fairly simpler closer set problem, which in turn deteriorates the performance on previously unseen environments. Finally, although reinforcement learning has been used to train navigational agents, they either do not leverage language instructions or rely on very simple linguistic settings. This paper proposes MatterPort3D Simulator, “a large-scale reinforcement learning environment based on real imagery” and an associated Room-to-Room (R2R) dataset with the hope that these will help push forward advancements in vision-and-language navigation (VLN) tasks and improve generalizability in previously unseen environments. ...

As-Rigid-As-Possible Shape Manipulation

The problem of shape manipulation is of interest to many domains, including but not limited to: image editing platforms, usage in real-time live performances, and enhancing graphical user interfaces. The goal of shape manipulation is to impart the ability to move and deform shapes in a manner that is akin to interacting with an object in the real world. Previous approaches to shape manipulation can be broadly categorized into (a) space-warp and (b) physics based techniques. ...

DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills

The adoption of physically simulated character animation in the industry remains a challenging problem, primarily because of the lack of the directability and generalizability of existing methods. With the goal being to amalgamate data-driven behavior specification with the ability to reproduce such behavior in a physical simulation, there have been several categories of approaches which try to achieve it. Kinematic models rely on large amounts of data, and their ability to generalize to unseen situations can be limiting. Physics-based models incorporate prior knowledge based on the physics of motion, but they do not perform well for “dynamic motions” involving long-term planning. Motion imitation approaches can achieve highly dynamic motions, but are limited by the complexity of the system and lack of adaptability to task objectives. Techniques based on reinforcement learning (RL), although comparatively successful in achieving the defined objectives, often produce unrealistic motion artifacts. This paper addresses these problems by proposing a deep RL-based “framework for physics-based character animation” called DeepMimic by combining a motion-imitation objective with a task objective, allowing it to demonstrate a wide range of motions skills and to adapt to a variety of characters, skills, and tasks by leveraging rich information from the high-dimensional state and environment descriptions, is conceptually simpler than motion imitation based approaches, and can work with data provided in the form of either motion capture clips or keyframed animation. While the paper presents intricate details about the DeepMimic framework, the high level details and novel contribution claims are summarized here, skipping the common details about deep RL problem formulations. ...