DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills

The adoption of physically simulated character animation in the industry remains a challenging problem, primarily because of the lack of the directability and generalizability of existing methods. With the goal being to amalgamate data-driven behavior specification with the ability to reproduce such behavior in a physical simulation, there have been several categories of approaches which try to achieve it. Kinematic models rely on large amounts of data, and their ability to generalize to unseen situations can be limiting. Physics-based models incorporate prior knowledge based on the physics of motion, but they do not perform well for “dynamic motions” involving long-term planning. Motion imitation approaches can achieve highly dynamic motions, but are limited by the complexity of the system and lack of adaptability to task objectives. Techniques based on reinforcement learning (RL), although comparatively successful in achieving the defined objectives, often produce unrealistic motion artifacts. This paper addresses these problems by proposing a deep RL-based “framework for physics-based character animation” called DeepMimic by combining a motion-imitation objective with a task objective, allowing it to demonstrate a wide range of motions skills and to adapt to a variety of characters, skills, and tasks by leveraging rich information from the high-dimensional state and environment descriptions, is conceptually simpler than motion imitation based approaches, and can work with data provided in the form of either motion capture clips or keyframed animation. While the paper presents intricate details about the DeepMimic framework, the high level details and novel contribution claims are summarized here, skipping the common details about deep RL problem formulations.

The input to the system is a character model, a set of kinematic reference motions (either motion capture clips or keyframed animations), and a task along with an associated reward function. The state features consist of the character’s body configuration in the form of relative positions of all links w.r.t. a local coordinate frame with the pelvis as origin, rotations (as quaternions), linear and angular velocities, and a phase function denoting stages of a motion cycle. The policy is queried at 30 Hz, and each action from the policy provides target orientations for the proportional-derivative (PD) controllers at the joints. The action distribution is modeled as a Gaussian centered around a network specified state-dependent mean and a fixed diagonal covariance matrix. Fully connected layers and ReLU activations are used, and for vision-based tasks, a heightmap of the terrain surrounding the character sampled on a uniform grid is used to augment the input. The reward at each step consists of a weighted sum of rewards associated with task fulfilment and imitation of the reference motion, where the latter is comprised of rewards for matching pose and velocity of the joints and positions of hand, feet, and center-of-mass of the character. The policies are trained with proximal policy optimization (PPO) episodically with each episode simulated to either a “fixed time horizon” or until a termination condition is met, with two networks being maintained for the policy and the value function respectively. Instead of a fixed initial state which limits performance on complex dynamic motion sequences and limits exploration, this paper proposes using a reference stage initialization (RSI). With RSI, before each training episode, a state is sampled from the reference motion and is used to initialize the agent’s state, allowing the agent to encounter promising and high reward states early in the training. They also introduce early termination (ET), where an episode terminates if the torso or the head touches the ground, indicating that the character has fallen. This discourages undesirable behavior and avoids bias in the data distribution towards failure cases. Finally, they incorporate multi-skill integration with (a) the ability to “utilize multiple reference clips”, (b) the ability to execute an arbitrary sequence of skills on demand, and (c) a composite policy constructed using a Boltzmann function allowing the network to jointly learn multiple related skills. The experiments are presented on 4 characters with a vast range of properties (link count, mass, height, degrees of freedom, state features, and action parameters): 3D humanoid, Atlas robot model, T-Rex and dragon, and on 4 tasks: target heading (traveling in a specified direction with a minimum velocity), strike (striking a “randomly placed” spherical object with specific links), throw (hitting a target by throwing a ball at it), and terrain traversal (traversing 4 obstacle-filled environments in a fixed direction). Ablation studies indicate that (a) the best result is obtained when using both task-specific and motion-specific reward functions, and that dropping either of them leads to a drop in the performance or to unexpected outcomes, and (b) RSI and ET are both crucial for model performance, with the lack of RSI leading to failure of producing desired behaviors and the lack of ET leading to the model getting stuck on local optima. Character and environment retargeting experiments show the efficacy of the system to generalize well to new characters and to different terrains. Finally, despite the model not being trained with external perturbation, the policies are robust to perturbation forces in the simulation, and this can largely be attributed to the stochastic policies’ exploration noise during training.

This paper introduces DeepMimic, a deep RL based system for learning character animations from data which can generalize well to new characters, enviroments, motion skills, and tasks. The paper presents an excellent description of the method, sound justifications for design choices, and an extensive set of experiments and ablation studies to analyze and evaluate DeepMimic’s performance. However, the authors do admit that DeepMimic has room for improvement in certain areas: (a) the phase function in the state features requires it to be synchronized with the reference motion, limiting the ability to adjust motion timing., (b) the multi-clip integration approach performs well for only a small number of motion clips, (c) manual imitation reward assignment for all the motions, and (d) long training time (upto several days per skill).

This summary was written in Fall 2020 as a part of the CMPT 757 Frontiers of Visual Computing course.