UBC Faculty Research and Publications

Guided learning of control graphs for physics-based characters Liu, Libin; van de Panne, Michiel; Yin, KangKang Nov 30, 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-ControlGraphs.pdf [ 8.95MB ]
JSON: 52383-1.0220787.json
JSON-LD: 52383-1.0220787-ld.json
RDF/XML (Pretty): 52383-1.0220787-rdf.xml
RDF/JSON: 52383-1.0220787-rdf.json
Turtle: 52383-1.0220787-turtle.txt
N-Triples: 52383-1.0220787-rdf-ntriples.txt
Original Record: 52383-1.0220787-source.json
Full Text

Full Text

Guided Learning of Control Graphsfor Physics-based CharactersLIBIN LIU and MICHIEL VAN DE PANNEThe University of British ColumbiaandKANGKANG YINNational University of SingaporeThe difficulty of developing control strategies has been a primary bot-tleneck in the adoption of physics-based simulations of human motion. Wepresent a method for learning robust feedback strategies around given mo-tion capture clips as well as the transition paths between clips. The outputis a control graph that supports real-time physics-based simulation of mul-tiple characters, each capable of a diverse range of robust movement skills,such as walking, running, sharp turns, cartwheels, spin-kicks, and flips. Thecontrol fragments which comprise the control graph are developed usingguided learning. This leverages the results of open-loop sampling-based re-construction in order to produce state-action pairs which are then trans-formed into a linear feedback policy for each control fragment using linearregression. Our synthesis framework allows for the development of robustcontrollers with a minimal amount of prior knowledge.1. INTRODUCTIONDesigning controllers to realize complex human movements re-mains a challenge for physics-based character animation. Diffi-culties arise from non-linear dynamics, an under-actuated system,and the obscure nature of human control strategies. There is alsoa need to design and control effective transitions between motionsin addition to the individual motions. Since the early work on thisproblem over two decades ago, controllers had been developed formany simulated skills, including walking, running, swimming, nu-merous aerial maneuvres, and bicycle riding. However, controllerdesign often relies on specific insights into the particular motionbeing controlled, and the methods often do not generalize for widerclasses of motions. It also remains difficult to integrate motion con-trollers together in order to produce a multi-skilled simulated char-acter.In this paper, we develop controllers for a wide variety of re-alistic, dynamic motions, including walking, running, aggressiveturns, dancing, flips, cartwheels, and getting up after falls, as wellas transitions between many of these motions. Multiple simulatedcharacters can physically interact in real-time, opening the door tothe possible use of physics in a variety of sports scenarios.Our method is designed around the use of motion capture clips asreference motions for the control, which allows for existing motioncapture data to be readily repurposed to our dynamic setting. It alsohelps achieve a high degree of realism for the final motion withoutneeding to experiment with objective functions and solution shap-ing, as is often required by optimization approaches. The controlitself is broken into a sequence of control fragments, each typically0.1s in length, and a separate linear feedback control strategy is{libinliu, van}@cs.ubc.ca, kkyin@comp.nus.edu.sglearned for each such fragment. An iterative guided learning pro-cess is used for learning: a sampling-based control method servesas a control oracle that provides high-quality solutions in the formof state-acion pairs; linear regression on these pairs then providesan estimated linear control policy for any given control fragment.Importantly, successive iterations of the learning are coupled to-gether by using the current estimated linear control policy to in-form the construction of the solution provided by the control or-acle; it provides the oracle with an estimated solution, which canthen be refined as needed. This coupling encourages the oracle andthe learned control policy to produce mutually compatable solu-tions. The final control policies are compact in nature and have lowcomputational requirements.Our work makes two principal contributions: (1) A guided-learning algorithm that combines the use of a sampling-based con-trol oracle and the full-rank linear regression for iteratively learn-ing time-varying linear feedback policies that robustly track inputmotion capture clips. The pipeline further supports motion retarget-ing. (2) An overall clips-to-controllers framework that learns robustcontrollers for a wide range of cyclic and non-cyclic human mo-tions, including many highly dynamic motions, as well as learningtransitions between the controllers in order to produce flexible con-trol graphs. Results demonstrate the integrated motion capabilitieson real-time simulations of multiple characters that are capable ofphysics-based interactions with each other. Four different stand-upstrategies, each based on motion capture data, allow characters torecover from falls voluntarily.2. SYSTEM OVERVIEWFigure 1 provides an overview of the system components and howthey interact. As input, the system takes individual motion clips andthe desired connectivity of these clips, as represented by a motiongraph. The output is then a control graph that is constructed froma large set of control fragments, typically of short duration, e.g.,0.1s, which afford the connectivity described by the desired motiongraph. Each control fragment is defined by a short target trackingtrajectory, mˆ, its duration, δt, and a related linear feedback policy,pi, as developed during an offline guided learning process.The learning process begins with the application of a SAMpling-based CONtrol strategy (SAMCON) that produces an open-looptrajectory for the controls, and therefore each control fragment,that does well at reproducing a given input motion clip or motion-clip transition. This serves two purposes. First, it replaces the inputmotion, which is often not physically feasible due to modeling er-rors and possible retargeting, with a physically-realizable motion.Second, it provides a nominal open-loop reference motion and theassociated control values, around which we will then learn linearfeedback control strategies to provide robust control. The reference2 • L. Liu et al.motion and the control values are stored in the control fragmentsthat underly any given motion clip.Next, iterative guided learning is used to learn linear feedbackpolicies for each control fragment. This involves the repeated useof SAMCON in order to produce multiple new solutions (motionsand the control values underlying them) that each do well at re-producing the reference motion. These then serve to provide state-and-corresponding-action data for learning a local linear feedbackmodel for each control fragment, using linear regression. However,the initial linear feedback policies learned in this fashion do notwork well in practice; when applied in simulation, the resulting mo-tion quickly visits regions of the state-space that are far-removedfrom the regions for which the original state-and-action data wasobtained. To remedy this, guided SAMCON uses the current lin-ear control policy as an initial guess for computing its solutions,thereby implicitly looking for control solutions that exhibit a degreeof compatibility with the current linear-feedback control policies.Over multiple iterations of the guided learning loop, the processconverges towards robust linear feedback control policies for thesequence of control fragments. In order for the described methodto also be able to robustly handle transitions between motion clips,as modeled by the desired connectivity in the motion graph, themotions for which we use SAMCON to collect the desired state-and-action data will come from long random walks on the desiredmotion graph. In this way, a control fragment that immediatelyfollows incoming transitions from multiple branches of a motiongraph will see state-and-action data from all of these different ar-rival paths and will therefore be encouraged to produce a controlpolicy that is compatible with this possibly-diverse set of startingstates.During online simulation, a motion planner or user input speci-fies a desired path through the control graph, and thus provides adesired sequence of control fragments. Linear feedback control isapplied once at the beginning of each control fragment based onthe current state at that time. The computed control action specifiesa constant offset from the reference controls that is then applied forthe duration of the control fragment. The linear feedback controldecisions are thus made at the time scale of the control fragments.Finally, proportional-derivative (PD) controllers are used to con-vert the control actions, which are target joint angles in our case,into joint torques. The use of this final low-level control constructat a fine time scale (milliseconds) allows for the rapid, adaptivegeneration of torques upon ground contact or other collisions, andhelps enable the key control decisions to be made at the coarsertime scale of the control fragments.3. RELATED WORKNumerous approaches have been proposed for controlling the mo-tion of physics-based characters, many of which are described in arecent survey article [Geijtenbeek and Pronost 2012]. In what fol-lows below, we review various approaches and categorize them ac-cording to the features of relevance to our proposed method. Suit-ably crafted phase structures, abstractions of state-and-action fea-tures, and the use of optimization are important common featuresacross a majority of approaches.Implicit dynamics methods:Many control strategies have been developed that do not requirethe controller to have full knowledge of the equations of motion. In-stead, these are taken into account implicitly during multiple simu-lations that are used to evaluate the impact of parameter adaptationsthat are made to the components of the control system. These meth-ods are commonly characterized by proportional-derivative jointOpen-loop Control Trajectories �𝒎𝒎SAMCONControl Graph 𝒢𝒢𝒞𝒞𝟎𝟎𝒞𝒞𝟏𝟏Control Fragment 𝒞𝒞𝒌𝒌: { �𝑚𝑚𝑘𝑘, 𝛿𝛿𝑡𝑡𝑘𝑘,𝜋𝜋𝑘𝑘}…Online SimulationSimulation State SimulatorPD-servosTarget PoseJoint TorqueControl Fragment 𝓒𝓒𝑘𝑘PlannerFeedback Policy 𝜋𝜋𝑘𝑘Simulation Tuples 𝜏𝜏𝑘𝑘𝑖𝑖Linear RegressionGuided SAMCONIterativeLearningProcessTarget Trajectory Motion ClipsTransitions�𝑚𝑚𝑘𝑘𝛿𝛿𝑡𝑡𝑘𝑘 ≈ 0.1𝑠𝑠𝛿𝛿𝑡𝑡sim = 0.005𝑠𝑠Fig. 1: System Overviewcontrol, force generation using the Jacobian tranpose, finite statemachines that model motion phases, and phase-specific feedbacklaws that govern tasks such as hand and foot placement. Controllersfor many types of motion skills have been synthesized, includingwalking, running, bicycling, and other agile motions, e.g., [Hod-gins et al. 1995; Yin et al. 2007; Wang et al. 2009; Kwon and Hod-gins 2010; Lee et al. 2010; Ha et al. 2012; Liu et al. 2012; Al Bornoet al. 2013; Tan et al. 2014]. Knowledge and insights about the de-sired motions can be incorporated into the design of the controlsystem, an objective function to be used for optimization, or, mostcommonly, using both of these. Realistic muscle models can alsobe integrated into such approaches, i.e., [Wang et al. 2012; Geijten-beek et al. 2013].Optimized inverse dynamics methods: Another popular cate-gory of approach combines knowledge of the equations of motionwith optimization in order to solve directly for the control actions,typically joint torques. This can be done at various time scales.Short time-horizon methods optimize for the controls, accelera-tions, and ground contact forces for the current time step and arecommonly solved using quadratic programming, which allows forground contact constraints to be conveniently imposed. Knowledgeabout the phase-based nature of the motion can be encoded intophase-specific objective functions, and anticipatory knowledge canbe incorporated into simplified models that then participate in theobjective function. The approach has been successfully applied to awide range of motions, e.g., [Da Silva et al. 2008; Macchietto et al.2009; de Lasa et al. 2010; Ye and Liu 2010; Zordan et al. 2014;Al Borno et al. 2014]. Long-horizon methods optimize for the mo-tion and the underlying controls for a finite-duration horizon intothe future, possibly encompassing the entire motion, e.g., [Popovic´and Witkin 1999; Sulejmanpasˇic´ and Popovic´ 2005; Wampler andPopovic´ 2009]. For interactive applications, model-predictive con-trol is used, whereby only the immediate control actions are em-ployed and the remainder of the time horizon is treated as a mo-tion plan that is then extended and reoptimized at the next controltime step, e.g., [Tassa et al. 2012]. Recent work has further shownthat the phase structure can also be learned for a variety of mo-tions [Mordatch et al. 2012].UBC Computer Science, Technical Report • 3Motion tracking: Motion capture data can be used as partof controller design as a means of producing high-quality mo-tions without needing to first fully decipher the many factors thatmay influence how humans move. Motion capture clips had beenused as reference trajectories for passive simulation [Zordan et al.2005] and spacetime optimization [Popovic´ and Witkin 1999; Sule-jmanpasˇic´ and Popovic´ 2005]. With the help of robust abstract feed-back policies, it can be used to guide the creation of closed-loopcontrollers for realistic walking [Sok et al. 2007; Yin et al. 2007;Lee et al. 2010] and running [Kwon and Hodgins 2010] motions.Model-based optimal control provides a general method for devel-oping robust control about given reference trajectories [Muico et al.2009; Muico et al. 2011]. In general, however, it remains unclearhow to adapt tracking-based control methods for complex contactconditions and for a wide range of motions. The sampling-basedcontrol strategy proposed in [Liu et al. 2010] has demonstrated theability to robustly track a wide variety of motions, including thoseinvolving complex changing contacts. However, the solutions areopen-loop and require offline computation.Compact linear feedback: Low-dimensional linear feedbackpolicies can performs surprisingly well in many circumstances,suggesting that compact and simple solutions do often exist for pro-ducing robust control for locomotion [Raibert and Hodgins 1991;Yin et al. 2007]. Robust reduced-order linear feedback policiescan also be learned for a variety of motions using optimization inthe space of reduced-order linear policies [Ding et al. 2015]. Thismethod has further been demonstrated in the synthesis of controlfor several parkour-like skills [Liu et al. 2012] and skeleton-drivensoft body characters [Liu et al. 2013]. However, using the samereduced-order linear policy across all phases of a motion is insuffi-cient for complex motions, and thus the work of Liu et al. [2012] re-quires manually segmentation into motion phases, followed by theoptimization of separate feedback policies for each motion phase.In this paper, we avoid the need for this manual segmentation byallowing each short-duration control fragment to have its own lin-ear feedback model for its fine-scale (approximately 0.1s) motionphase. Our regression-based learning can efficiently learn the largenumber of linear feedback parameters that result from the resultingparameter-rich model.Multiple controller integration: Kinematic approaches offereasy-to-use graph structures for organizing and composing motionclips. However, research on sequencing and interpolation of con-trollers remains sparse. Given a set of existing controllers, oraclescan be learned to predict the basins of attraction for controllers, andtherefore to predict when transitions can safely be made betweencontrollers [Faloutsos et al. 2001]. Tracking multiple trajectoriessimultaneously has been used to enhance the robustness of loco-motion control [Muico et al. 2011]. Transitions between runningand obstacle clearing maneuvers are realized in [Liu et al. 2012]using careful design of the structure and objectives of the transitionbehaviors. In this paper, we systematically realize robust transitionsbetween many different skills.Reinforcement learning: Reinforcement learning (RL) pro-vides a convenient and well-studied framework for control andplanning. It seeks an optimal policy that maximizes the expected re-turns given rewards that characterize a desired task. Value-iterationRL methods have been used on kinematic motion models, e.g., forboxing[Lee and Lee 2006] and flexible navigation [Lee and Lee2006; Treuille et al. 2007; Lee et al. 2010], and for physics-basedmodels, e.g., terrain traversal with constraints [Coros et al. 2009]and with highly dynamic gaits [Peng et al. 2015]. Policy searchmethods are often applied to problems having continuous actionspaces, often searching the parameter space using stochastic op-Symbol Descriptionp posepˆ target pose for PD-servos∆pˆ offset on target posesm motion clip, i.e. a sequence of poses in timem˜ reference motion capture clipmˆ control clip / tracking target trajectoryG˜ reference motion graphG control graphC control fragmentδt duration of a control fragmentpi feedback policy of a control fragmentM , aˆ gain matrix and affine term of a feedback policyΣ variance of policy explorationss state vectora action vectorτk simulation tuple corresponds to Ck . τk = (sk−1,ak,sk)W random walk on the control graphW = {Ck}τ execution episode of the random walk, τ = {τk}i sample index for policy searchj sample index for guided SAMCONk index for control fragmentsTable I. : Symbolstimization algorithms such as policy gradient [Peters and Schaal2008], related EM-based approaches [Peters and Schaal 2007], andapproaches with compact-but-adaptive policy representations [Tanet al. 2014]. Despite such progress, policy search often suffers fromcommon issues related to optimization in high-dimensional spaces,such as being sensitive to the policy representation, requiring largenumber of samples, and convergence to local optima. Several re-cent works make progress on this problem using forms of guidedpolicy search, an iterative process where new samples from a con-trol oracle inform the construction of an improved policy, whichthen informs the collection of new samples, and so forth, e.g., [Rosset al. 2011; Levine and Koltun 2013; 2014; Mordatch and Todorov2014].Our learning pipeline has a similar guided-learning structure butis unique in its use of: (1) the use of an implicit-dynamics, sample-based motion reconstruction method as the control oracle; (2) theuse of simple time-indexed linear feedback policies and linear re-gression to learn these policies; (3) a focus on difficult, dynamic,and realistic 3D full-body human motion skills; and (4) the abilityto learn transitions between skills to yield integrated multiskilledcharacters.4. STRUCTURE OF CONTROLLERSWe model a virtual character as an under-actuated articulated rigidbody system, whose pose p = (x0,q0,qj), j = 1, . . . , n is fullydetermined by the position (x0) and orientation (q0) of the root andthe rotations of all n joints. We drive each DoF (degree of freedom)with PD-servos:τ = kp(qˆ − q)− kdq˙ (1)where q and q˙ represent the joint rotation and rotational speed re-spectively, and the tracking target qˆ is given by a target pose pˆ. Thesystem is simulated with the open-source Open Dynamic Engine(ODE). For better stability, we follow the idea of Stable-PD con-trol [Tan et al. 2011] and replace the second term of Equation 1with implicit damping in the same way as described in [Liu et al.2013]. This allows us to use a large simulation time step (5ms),4 • L. Liu et al.𝑡𝑡 0 𝛿𝛿𝑡𝑡 𝑚𝑚�  Δ?̂?𝑝 𝜋𝜋 𝑠𝑠0 𝒞𝒞 ∶  𝑚𝑚� ,𝛿𝛿𝑡𝑡,𝜋𝜋𝑠𝑠𝑒𝑒  𝑚𝑚�′ 𝑠𝑠0′𝒞𝒞 ∶  𝑚𝑚� , 𝛿𝛿𝑡𝑡,𝜋𝜋, 𝑠𝑠0, 𝑠𝑠𝑒𝑒Fig. 2: A control fragment: when the simulate state s′0 drifts away fromthe reference start state s0, the feedback policy pi is involved to computea compensation ∆p that offsets the open-loop control clip mˆ to mˆ′. Bytracking mˆ′ with PD-servos, the simulation can end near the reference endstate se in δt seconds.𝒞𝒞1 𝒞𝒞2⋯𝒞𝒞𝐾𝐾⟺Fig. 3: A chain of control fragmentswhich significantly speeds up the learning process and improvesthe online performance.Control fragments, represented by the symbol C, are the basicunits of the controllers in our framework. A control fragment is atuple {δt, mˆ,pi} as indicated in Figure 2, where mˆ = pˆ(t) rep-resents an open-loop control clip consists of a sequence of targetposes in time, which can be tracked by PD-servos to simulate acharacter from a start state s0 to the end state se in δt seconds. s0and se are derived from the reference mˆ. In practice, the simula-tion state in effect when a control fragment begins, s′0, will not beexactly at the expected starting state, s0, due to perturbations. Thefeedback policy,pi, is therefore used to compute a corrective action,a, which consists of an offset, ∆pˆ, that is added to the mˆ in orderto eliminate the deviation. As illustrated in Figure 2, this offset re-mains fixed during the whole control fragment, yielding a resultingcontrol clip mˆ′ = ∆pˆ ⊕ mˆ that is then tracked instead of mˆ inorder to have the state end nearby the desired end state, se. Herethe operator⊕ represents a collection of quaternion multiplicationsbetween corresponding joint rotations.Our framework employs a linear feedback policy for every con-trol fragment:a = pi(s;M , aˆ)= Ms+ aˆ (2)where M represents a feedback gain matrix, aˆ is an affine term,and s and a are vectors representing the simulation state andfeedback action, respectively. We use a selected subset of stateand action features in order to facilitate a compact control pol-icy. For all the skills developed in this paper, we use s =(q∗0, h0, c, c˙,dl,dr,L), consisting of the root orientation q∗0, theroot height h0, the centroid position c and velocity c˙, vectors point-ing from the center of mass to the centers of both feet dl,dr , andthe angular momentum L. All these quantities are measured in acoordinate frame that has one axis vertically aligned and anotheraligned with the character’s facing direction. As the vertical com-ponent of the root orientation q0 is always zero in this referenceframe, q∗0 contains only the two planar components of the corre-sponding exponential map of q0. s thus represents 18 degrees offreedom (DoF). Similarly, we use an 11-DoF action vector a thatconsists of the offset rotations of the waist, hips, and knees, repre-sented in terms of exponential map. Knee joints have one DoF inour model. The final compensation offset, ∆pˆ, is then computedfrom a, where we set the offset rotations of all remaining joints tozero.A controller can be defined as a cascade of control fragments, asdepicted in Figure 3, which can be executed to reproduce a givenmotion clip. We also wish to be able to organize the control frag-ments into a graph as shown in Figure 4(b), whereby multiple pos-sible outgoing or incoming transitions are allowed at the boundariesof the control fragments at transition states, such as s1, s2, and s3.We further define the chains of the control fragments between tran-sition states as controllers and each controller is uniquely coloredin Figure 4(b). In practice, controllers need to produce particularskills, e.g., running, and to perform dedicated transitions betweenskills, e.g. speeding up to a run. In Figure 4(c) we then illustrate thecorresponding connectivity between controllers. Here, an arrow in-dicates that the controller associated with the tail ends in a statethat is near to the expected starting state of the controller associ-ated with the head. Based on this graph structure, the sequencingof skills is simply achieved by walking on this graph while execut-ing the encountered control fragments.In our framework, the structure of a control graph is predefinedand fixed during the learning process. Given example motion clipsof desired skills, this is done by first building a reference motiongraph, and then converting it into a control graph. Figure 4(a) showsa simple motion graph consisting of three motion clips and tran-sitions between sufficiently similar frames, e.g. s1, s2, s3, whichdefine the transition states. Any portion of a motion clip that is be-tween two transition frames is then converted to a chain of controlfragments, or equivalently, a controller, between the correspondingtransition states. In this conversion, the motion clip is segmentedintoK identical-duration pieces, withK chosen to yield time inter-vals δt ≈ 0.1s. We construct high-quality open-loop control trajec-tories from the input motion clips using the improved SAMCON al-gorithm and noise reduction and time scaling techniques [Liu et al.2015; Liu et al. 2013], and initialize the control fragments with theresulting open-loop controls. The feedback policies pi are initial-ized to zero, i.e.M = 0, aˆ = 0.The initial configuration of control fragements as described thusfar cannot produce robust execution of skills because of the lackof feedback. In the next section, we introduce the details of ourlearning pipeline that augment the control graph with a feedbackpolicy for each control fragment.5. GUIDED LEARNING OF FEEDBACK POLICIESWe desire a control graph that supports random walks on thegraph for physics-based characters, analogous to the use of a mo-tion graph for kinematic motion synthesis. For these physics-basedskills to be robust, feedback policies need to be developed for thecontrol fragments. Formally, given a control graph that consists ofK control fragments, {Ck}, we need to learn feedback policiesfor these control fragments that ensure successful random walksW = {Ck1 , Ck2 , . . . }, ki ∈ {1, . . . K} on the graph. To this end,we first generate a long sequenceW via a random walk on the con-trol graph, in which each control fragment Ck appears at least 200times. We then formulate the learning process as a policy searchproblem to be evaluated on W , and use an iteratively process todevelop suitable feedback policies.Figure 5 provides a toy illustration of the guided learning pro-cess. Given a random graph-walk,W , consisting of 9 control frag-ments, a successful execution episode of W is generated usingUBC Computer Science, Technical Report • 5𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 (a) a motion graph𝑠𝑠1 𝑠𝑠2𝑠𝑠3(b) a control graph𝑠𝑠1 𝑠𝑠2𝑠𝑠3(c) a compact representation of (b)Fig. 4: Control graph: a control graph is created by (a) building a reference motion graph from example motion clips, then (b) convertingeach clip of the motion graph to a chain of control fragments. (c) shows a compact representation of the control graph (b), where each noderepresent a chain of control fragments, or rather, a controller.Random Walk (𝒲𝒲): 𝓒𝓒𝟏𝟏,𝓒𝓒𝟐𝟐,𝓒𝓒𝟐𝟐,𝓒𝓒𝟑𝟑,𝓒𝓒𝟏𝟏,𝓒𝓒𝟏𝟏,𝓒𝓒𝟐𝟐,𝓒𝓒𝟑𝟑,𝓒𝓒𝟏𝟏Guided SAMCON: Linear Regression: 𝑠𝑠0 𝑠𝑠2𝑗𝑗 𝑠𝑠3𝑗𝑗 𝑠𝑠4𝑗𝑗 𝑠𝑠5𝑗𝑗 𝑠𝑠6𝑗𝑗 𝑠𝑠7𝑗𝑗 𝑠𝑠8𝑗𝑗 𝑠𝑠9𝑗𝑗𝑠𝑠1𝑗𝑗𝑎𝑎1𝑗𝑗∼ 𝜋𝜋1 𝑠𝑠0;𝜃𝜃1𝓒𝓒𝟏𝟏 𝓒𝓒𝟐𝟐 𝓒𝓒𝟑𝟑 { { { } } } 𝝉𝝉: 𝝅𝝅𝟏𝟏 𝝅𝝅𝟐𝟐 𝝅𝝅𝟐𝟐 𝝅𝝅𝟑𝟑 𝝅𝝅𝟏𝟏 𝝅𝝅𝟏𝟏 𝝅𝝅𝟐𝟐 𝝅𝝅𝟑𝟑 𝝅𝝅𝟏𝟏 Fig. 5: A sketch of the guided learning process for a toy control graph.Guided SAMCON, as will be discussed in further detail in §5.3.This provides a sequence, τ , of states and corresponding controlactions that does well at reproducing the desired reference motionscorresponding to W . In this toy example, we simply use four dis-crete states as an abstract representation of a larger continuous statespace, and the actions are simply represented as the arrows thattransition to the state at the start of the next control fragment. Be-cause each control fragment occurs multiple times inW , multiplestate-action pairs, (s,a), are collected for each control fragment,i.e., four for C1, three for C2, and so forth. These are then used todevelop a linear (affine in practice) regression model for each con-trol fragment that predicts a as a linear function of s. This resultingpredictive model then becomes the control policy,pi, for the controlfragment. This control policy is then used to help inform the nextround of motion reconstruction using Guided SAMCON.In the following section, we describe how the iterative use of thelinear regression model can be understood as being an EM-based(expectation maximization) policy search algorithm. Alternatively,readers can choose to jump directly to the specific details of thelinear regression for our problem, as described in § Guided Learning as EM-based Policy SearchStarting from a state sk−1, each execution of a control fragment Ckresults in a simulation tuple τ = (sk−1,ak, sk). Given a rewardfunction R(τ) that measures the goodness of this execution, policysearch seeks for the optimal policy that maximizes the expectedreturnJ(θ) =∫τP (τ ;θ)R(τ) (3)with respect to the feedback parameters θ. The probability densityof a simulation tuple is determined by:P (τ ;θ) = P (sk|sk−1,ak)pik(ak|sk−1;θ)P (sk−1) (4)where P (sk|sk−1,ak) is the transition probability density andpik(ak|sk−1;θ) represents the probability density of the feedbackaction given the start state and the feedback parameters. We modelpik(ak|sk−1;θ) as Gaussian explorations superimposed onto thedeterministic feedback policy of Equation 2, i.e.:pik(ak|sk−1;θ) := pik(ak|sk−1;Mk, aˆk,Σk)∼ N (Mksk−1 + aˆk,Σk) (5)The feedback parameters are then defined as θ = {Mk, aˆk,Σk}.We use a diagonal covariance matrix Σk with the assumption thateach dimension of the action space is independent.An EM-style algorithm offers a simple way to find the optimalpolicy by iteratively improving the estimated lower-bound of thepolicy’s expected return. [Peters and Schaal 2007] applies an EMalgorithm to episodic policy search for a linear policy and showsthat the iterative update procedure is just a weighted linear regres-sion over the execution episodes of the current policy. Specifically,let θ0 be the current estimation of the policy parameters, EM-basedpolicy search computes a new estimation θ that maximizes:logJ(θ)J(θ0)= log∫τP (τ ;θ)R(τ)/J(θ0) (6)≥ 1J(θ0)∫τP (τ ;θ0)R(τ) logP (τ ;θ)P (τ ;θ0)(7)∝∫τP (τ ;θ0)R(τ) logpi(ak|sk−1;θ)pi(ak|sk−1;θ0) (8)= L(θ;θ0) + C (9)where Equation 7 applies Jensen’s inequality to the concave loga-rithm function, C is a constant independent of θ andL(θ;θ0) :=∫τP (τ ;θ0)R(τ) logpi(ak|sk−1;θ) (10)Note that the optimal θmust satisfies J(θ) ≥ J(θ0) because Equa-tion 7 is always zero when θ = θ0. L(θ;θ0) can be further esti-mated from a number of simulation tuples {τ ik} sampled accordingto the current policy pik(ak|sk−1,θ0) as:L(θ;θ0) ≈ 1NkNk∑i=1R(τ i) logpik(aik|sik−1;θ) (11)6 • L. Liu et al.whereNk is the number of such tuples. By letting ∂L(θ;θ0)/∂θ =0 we can find the locally optimal estimation of θ by solving0 =∂∂θL(θ;θ0)∝Nk∑i=1R(τ i)∂∂θlogpik(aik|sik−1;θ) (12)With this maximization step in place (the M step), we then up-date θ0 with this optimal θ and then recompute a new set of sam-ples (the E step) and repeat the EM iteration until obtaining optimalpolicies.As we are learning the feedback policies against the random walkW , the sample tuples {τ ik} for all the control fragments {Ck} canbe collected simultaneously by generating a long successful execu-tion ofW , represented by τ = {τk1 , τk2 , . . . }, and then extractingsimulation tuples for each individual control fragment from it. Fig-ure 5 provides a simple sketch of this procedure. Furthermore, weassign a constant reward to all such tuples, which implies a specialreward function in the form ofR(τ) = 1 tuple τ is good enough in the long run sothat the random walkW can succeed.0 otherwise. (13)Solving Equation 12 against this reward function and the Gaus-sian Explorations of Equation 5 leads to the linear regression thatwe describe next.5.2 Estimation of Linear Feedback PolicyThe linear regression problem solved for control fragment k yieldsa model to predict a as an affine function of s, as per equation 2,whereMk =[(STk Sk)−1(STk Ak)]T(14)aˆk = a¯k −Mks¯k−1 (15)diag(Σk) =1Nτdiag[(Ak − SkMTk )T (Ak − SkMTk )](16)where a¯k and s¯k−1 are the averages of aik and sik−1 respectively,the Nk-row matrices Sk and Ak represent the centered collectionsof all the tuples, i.e.Sk =[s1k−1 − s¯k−1, . . . , sNkk−1 − s¯k−1]T(17)Ak =[a1k − a¯k, . . . ,aNkk − a¯k]T(18)To prevent the regression from being underdetermined, the ran-dom walk is generated to be long enough so that Nk ≥ 200 for allcontrol fragments. We further regularize the Frobenius norm of thefeedback gain matrixMk, so thatMk =[(STk Sk + λI)−1(STk Ak)]T(19)is used instead of Equation 14. The regularization coefficientλ = 10−6 in all our experiments.We further use the observed prediction variances, as captured byΣk, to provide a stochastic version of the control policy (cf. §5.1)that will be used to guide the sampling used by the search algorithm(SAMCON) that serves as our control oracle:pik(ak|sk−1;θ) := N (Mksk−1 + aˆk,Σk)Algorithm 1 Guided SAMCONInput:1: a random walk on the control graphW = {Ck}, k = 1, . . . ,N2: the start state s0Output: a successful execution of the sequence τ1: {sj0} ← initialize the starting set withNs replicas of s02: for k← 1 toN do3: for each sample j do4: generate action ajk ∼ pi(ak|sjk−1) ∼ N (Mksjk−1 + aˆk,Σk)5: sjk ← execute control fragment Ck against ajk6: record a simulation tuple τjk = (sjk−1,ajk,sjk)7: Ejk ← evaluate end state sjk8: end for9: {τj∗k } ← select ns elite samples according to {Ejk}10: {sjk} ← resample {sj∗k } to get a new starting set of sizeNs11: end for12: τ = {τk} ← select the best path from all saved {τj∗k }5.3 Guided SAMCONA key to the success of the guided learning is that it learnsfrom useful examples, i.e., only those that result from successfulcompletions of the random walk in every iteration. Levine andKoltun [2013; 2014] suggest that trajectory optimization can beused to collect such useful examples for guided policy search. Wedraw inspiration from this type of guided learning, to develop aform of guided learning that relies on sampling-based motion con-trol (SAMCON) methods, as proposed by [Liu et al. 2010; Liu et al.2015] to provide successful executions of the target random walk.SAMCON allows us to work with long sequences of complex mo-tions, and has proved to be capable of generating the controls fora wide variety of motion skills. We call this new method GuidedSAMCON.We first begin by reviewing SAMCON. For this review, the exactindex of the control fragment is unimportant and thus we representthe random walk according to their sequence index in the randomwalk, i.e., W = {C1, C2, . . . , CN}, where N =∑Kk=1Nk is thelength of the random walk, and Nk is the number of times a givencontrol fragment k appears in the random walk. Beginning fromthe initial state of the first control fragment, C1, we utilize SAM-CON to develop a sequence of actions {ak} that results in controlfragment end states, {sk}, that are nearby those of the desired ref-erence motions. The set of simulation tuples {(sk−1,ak, sk)} thendescribes the samples from the same control fragment k that arecollected together for regression. Note that this represents a slightabuse of notation in that we use sk−1 to refer to the previous statein the random walk sequence, rather than a true control fragmentindex, whose identification numbering can be in an arbitrary order.However, this significantly simplifies the notation for the remainderof the paper.Algorithm 1 outlines the main steps of SAMCON, and Figure 5provides a simple example. SAMCON can be viewed as a type ofSequential Monte Carlo algorithm [Doucet and Johansen 2011].Specifically, for the first control fragment, SAMCON initializes aninitial set of states {sj0} with j ∈ 1...Ns replicas of the start states0, and samples an action aj1 for each sj0 according to a sample dis-tribution pi(a1|s0). It then advances the simulation from sj0 whileexecuting the control fragment with the corresponding compensa-tion offset ∆pˆj1 computed from aj1 as described in Section 4. Thesimulation results in a tuple τ j1 = (sj0,aj1, sj1) whose end state sj1is evaluated according to its similarity to the reference end state thatcorresponds to the control fragment.UBC Computer Science, Technical Report • 7We measure this similarity by a cumulative cost function:E = wpEp + wrEr + weEe + wbEb+ wcEc + wvEv + wLEL + waEa (20)where the terms for pose control Ep, root control Er , end-effectorcontrol Ee, and balanced control Eb are identical to the origi-nal work [Liu et al. 2010]. We additionally regularize the differ-ences between the simulation and the reference in terms of cen-troid positionEc, centroid velocityEv , and the angular momentumEL. The last term, Ea, simply serves to regularize the Euclideannorm of the actions. We use (wp, wr, we, wb, wc, wv, wL, wa) =(4.0, 4.0, 10.0, 1.0, 3.0, 0.1, 0.03, 0.5) for all our experiments. Ourresults are not sensitive to the exact values of these weights so othervalues within the same order of magnitude may be used as well.After executing all sample actions and obtaining Ns simulationtuples {τ j1}, guided SAMCON selects and saves the ns best tuples{τ j∗1 }, as measured by the lowest cumulative costs, and then sys-tematically resamples the corresponding end states {sj∗1 } accordingto their costs to obtain a new starting set {sj1} of size Ns for thesuccessive control fragment. This is akin to the resampling proce-dure used in particle filtering, i.e., better samples produce more suc-cessors. This sampling procedure is repeated for each stage of themotion, i.e., once per control fragment, until the end of the randomwalk is reached. Finally, the resultant execution episode τ = {τk}is chosen to be the best path of all saved tuples {τ j∗k }.Guided SAMCON uses the current policy of every control frag-ment Ck as the distribution to sample from, i.e., pi(ak|sk−1) =pik(ak|sk−1;Mk, aˆk,Σk) ∼ N (Mksk−1 + aˆk,Σk). This canbe viewed as an enhancement of the original SAMCON algo-rithm [Liu et al. 2010] that employed a fixed sampling distribution,pi(ak|sk−1) ∼ N (0,Σ0), and also of the improved SAMCON al-gorithm [Liu et al. 2015] that evolves the mean and covariance ofthe sample distributions iteratively in a state-independent fashion,i.e., pi(ak|sk−1) ∼ N (aˆk,Σk). The guided sample selection andresampling implicitly focuses the exploration on regions of the statespace that are both relevant to the current policy as well as regionsof the action space that are known to yield desired motions.Voluntarily including noise in optimization has been shown tobe useful to prevent over-fitting and allows the learned policy todeal with larger uncertainty [Wang et al. 2010; Liu et al. 2012].We build on this idea by further adding a Gaussian noise vectorεk ∼ N (0, σ2εI) to the action samples. We thus compute the com-pensation offset ∆pˆjk from ajk + εk. The noise vector is assumedto be unknown to the feedback policies, and is not recorded or in-cluded in regression. We find that a uniform setting σε = 3◦ isenough to allow all of our motions to be robustly executed.5.4 Learning Control GraphsAlgorithm 2 summarizes the whole guided learning framework ofcontrol graphs. Given several example motion clips of the targetskills as input, the pipeline builds a control graph that synthe-sizes robust dynamic motions from arbitrary random walks overthe graph. This allows for motion planners, which are beyond thescope of this paper, to work with the graph as a simple high-levelabstraction of the motion capabilities. The whole pipeline consistsof the following sub-procedures:Building control graphs: A reference motion graph is firstlybuilt (line 1 of Algorithm 2), and then converted to a control graph(line 2) as described in Section 4. Building high-quality motiongraphs can be a non-trivial task, even with the help of automatedtechniques such as the one proposed by [Kovar et al. 2002]. ManualAlgorithm 2 Guided Learning PipelineInput: example motion clips of skillsOutput: a control graph G1: build a reference motion graph G˜ from input motion clips2: initialize a control graph G = {Ck} according to G˜3: generate a random walkW = {Ck1 , . . . , CkN }4: refine the open-loop control clip mˆk for every Ck5: initializeMk = 0, ak = 0, Σk = σ20I for every Ck6: for every EM iteration do . policy search7: generate a successful execution τ ofW with Guided SAMCON8: for each control fragment Ck do9: {τ ik} ← extract sample simulation tuples of Ck from τ10: updateMk, aˆk,Σk by linear regression on {τ ik}11: end for12: end fortuning is often necessary to achieve natural-looking transitions andto remove artifacts such as foot-skating. Fortunately, the usage ofsimulation naturally offers the ability to produce physically plausi-ble motions for the control graph. Therefore, the reference motiongraph does not necessarily need to be carefully tuned. In this pa-per, we simply specify the connectivity of the motion graphs forour control graphs manually. We kinematically blend a few framesof the relevant motion clips near the transition points. Our learningprocedure is robust to kinematic flaws due to blending or noise, andis able to generate high-quality simulated motions.Refining open-loop control clips: The initial control clips ofevery control fragment are directly computed from the individualmotion capture example clips, which are not necessary physicallyplausible for the transitions in the control graph. To facilitate thegraph learning process, we further refine these open-loop controlsas indicated on line 4. Specifically, this is done by performing theoriginal SAMCON on the motion sequence corresponding to therandom walk W , and then replacing the initial open-loop controlclip mˆk and the reference end states with the average of all sim-ulation instances of the control fragment Ck in W . The averagingnot only reduces the noise due to random sampling as suggestedby [Liu et al. 2015], but also maximizes the possibility of finding arobust feedback policy that can deal with all possible transitions.Learning Feedback Policies: In line 5, the feedback policies areinitialized, as well as the default exploration covariances. We findthat σ0 = 5◦ works for all the skills that we have tested. The EM-based policy search is performed in line 6–12, where the guidedSAMCON trials and the linear regressions are alternated to improvethe feedback policies iteratively. In all our experiments, this policysearch process can converge to robust feedback policies in at most20 iterations. Guided SAMCON can occasionally fail when gener-ating a long sequence of random walk, especially in the first iter-ation where the initial policy is applied. To mitigate this problem,we generate more samples (Ns = 1000) per stage during the firstiteration than for the successive iterations (Ns = 200). If the algo-rithm fails to complete the designated graph walk, we roll back theexecution of the latest three controllers (25-50 control fragments)and then restart guided SAMCON from that point.6. RESULTSWe have implemented our framework in C++. We augment theOpen Dynamic Engine (ODE) v0.12 with an implicit dampingscheme [Liu et al. 2013] to simulate the character faster and morestably. On a desktop with an Intel Core i5 @ 2.67GHz CPU, oursingle threaded implementation runs at 10× real-time using a simu-lation time step of 5ms. Except for the retargeting experiments, allour experiments are performed with a human model that is 1.7m8 • L. Liu et al.0501001502002503003504001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Average Resisting Time (s)# Learning Iterations StridingRunKickDanceSpinCatwalkBackflipWaltzFig. 6: The relationship between the number of learning iterations and thecontroller robustness as indicated by the resisting duration before failureunder random pushes of growing magnitude.SkillsPD-Gains Tcycle tlearning tfSet (s) (min) (s)Catwalk (a) 0.7 40.3 107StridingRun (a) 0.45 21.1 370Waltz (a) 5.0 314 67.2Kick (b) 1.6 93.8 280DanceSpin (b) 1.6 102 139Backflip (b) 2.5 153 73.7Table II. : Performance statistics for cyclic motions. Tcycle represents thelength of a reference cycle. tlearning is the learning time for each skill on a20-core computer. tf represents the average resisting duration before failureunder random pushes of growing magnitude.tall and weights 62kg. It has 45 DoFs in total, including 6 DoFsfor the position and orientation of the root. Two sets of PD-gainsare used in our experiments: (a) for basic locomotion, we simplyset kp = 500, kd = 50 for all joints; (b) for highly dynamicstunts, a stronger waist (kp = 2000, kd = 100) and leg joints(kp = 1000, kd = 50) are necessary.6.1 Cyclic SkillsThe simplest non-trivial control graphs are those built from individ-ual cyclic skills. A variety of cyclic skills have been tested to fullyevaluate the capability of the proposed learning framework, in-cluding basic locomotion gaits, dancing elements, flips, and kicks.The example motion clips for these skills are from various sourcesand were captured from different subjects. We simply apply themonto our human model, and kinematically blend the beginning andend of the clips to obtain cyclic reference motions. Errors due tomodel mismatches and blending are automatically handled by ourphysics-based framework. The animation sequences shown in Fig-ure 7 demonstrate the executions of several learned skills. We en-courage readers to watch the supplemental video to better evaluatethe motion quality and robustness of the controllers.The offline learning is performed on compute clusters with tensof cores. The performance of the learning pipeline is determined bythe number of necessary runs of guided SAMCON, whose compu-tational cost scales linearly with respect to the length of the clip,the number of samples Ns, and inversely with the number of coresavailable. Table II lists the learning time for several cyclic skills,measured on a small cluster of 20 cores. Note that here we runthe learning pipeline with the same configuration for all motionsfor ease of comparison, i.e., 20 iterations of guided learning, withNs = 1000 in the first iteration and Ns = 200 for the remainingiterations. In practice, the required SAMCON samples can be muchlower, e.g. toNs = 50 ∼ 100, after the first few learning iterations,as the feedback policies usually converge quickly under the guidedlearning.The learned skills are robust enough to enable repeated execu-tions even under external perturbations. In the supplemental videowe show that 400N×0.2s impulses can be applied to the charac-ter’s trunk during the flight phase of kicking without causing themotion to fail. To more systematically test the robustness of thelearned controllers, we apply a sequence of horizontal pushes inrandom directions to the character’s trunk, and measure the timethat the motion skill can last before the character falls. The mag-nitude of the perturbation force is generated from a normal distri-bution with increasing variance. This experiment is performed 100times, and the average performance is computed as an indication ofrobustness as shown in the last column of Table II. Figure 6 furtherillustrates how the robustness improves as a function of the num-ber of guided policy iterations. Generally speaking, faster motionssuch as running and kicking take less learning iterations to achievestable cyclic skills that can execute indefinitely, as well as toleratelarger perturbations. In contrast, slow motions such as the catwalkand the waltz are more sensitive to perturbations.All the test skills can be learned with the standard settings asdescribed in the previous sections, while special treatment is ap-plied for walking and running in order to achieve symmetric gaits.Specifically, we pick one stride (half step) from the example clipand concatenate its mirror stride to generate a symmetric referencemotion. In addition, we only learn the feedback policies for thefirst stride, and mirror the states and actions for the second strideso that the feedback policies are symmetric too. We further em-ploy contact-aligned phase-resetting for walking and running con-trollers, which improves the robustness to large perturbations. In-terestingly, we found the contact-aligned phase-resets not helpfulfor learning controllers for complex skills such as kicks and back-flips, which may indicate that the contact events are not informativephase indicators for such motions. Another interesting observationof the learned walking and running controllers is that the characterturns when gentle sideways pushes are applied. This offers a sim-ple way to parameterize these locomotion skills, as we can recordthe corresponding actions under external pushes and add them vol-untarily to the action vectors in order to make the character turnin moderate speed. We use this simple parameterization method toachieve basic steering behaviors in our demos. For rapid turns westill need to use controllers learned from relevant motion captureexamples.The robustness of the learning framework enables additional in-teresting applications. For example, in the supplemental video weshow that two significantly different backflips can be learned from asingle motion capture example, where one is learned from a shorterreference cycle than the other. The guided learning process auto-matically finds a physically feasible movement that fills in the miss-ing segment of the shorter reference trajectory. Our framework alsosupports retargeting controllers onto characters with significantlydifferent morphology from the motion captured subjects. We sim-ply re-run the pipeline on the new character, with the open-loopclip refinement step warm-started from the results built for our de-fault character. Figure 8 shows several examples where we retargetthe cyclic kick and the dance spin to characters with modified bodyUBC Computer Science, Technical Report • 9Fig. 7: Simulations of learned skills under external pushes. Top: kick. Middle: backflip. Bottom: waltz.Fig. 8: Retargeting kick (top) and dance-spin (bottom) to characters with modified body ratio.Fig. 9: Applications of the control graph. Top: random walk with external perturbations. Bottom: steering.10 • L. Liu et al.Cartwheel & BackflipSlow RunStriding Run180-Turn 1180-Turn 2StandSpeed UpSlow Down & TurnKip UpGet Up1 2 3 4 5Learning Order: 6 7**StandAction1Action2Action3Action4Slow CartwheelSupineGet UpProneGet Up1 2 3 4 5Learning Order: 6**Fig. 10: Two porototype control graphs progressively learned in the order marked in different colors. Only major controllers are shown inthe graph for clarity. The rising skills pointed by dashed arrows will be triggered automatically once the character falls. Left: Locomotionand gymnastic graph. Right: Bollywood dancing graph. Action1–arm hip shake; Action2–chest pump+swag throw; Action3–pick and throw;Action4–hand circular pump.segment lengths. The retargeted controllers are robust to externalperturbations as before.6.2 Control GraphsFigure 10 shows two prototype control graphs, one consisting ofruns, turns, gymnastic movements, and balancing, and the otherconsisting of Bollywood dancing elements, get-up motions, andbalancing. Only major controllers are shown in the graphs for clar-ity. The two control graphs can be further composed into a largerone through the standing behavior. Learning the controllers for theentire control graphs all at once can be inefficient because differ-ent controllers converge in different speed, i.e., some controllersquickly become robust, while others may cause SAMCON to failand restart constantly. This disparity results in excessive samplesbeing used for the easy controllers and excessive restarts for the dif-ficult ones, if the entire control graph were to be learned all at once.To mitigate this problem, we learn the control graph progressively.Figure 10 illustrates the example learning orders we use. Specifi-cally, we start from learning controllers for a few cyclic skills, us-ing the process described in the last subsection. Non-cyclic skillsare then gradually incorporated into the latest subgraph by rerun-ning the whole learning pipeline. This progressive process skipsthe learned skills from guided SAMCON by merely executing thelearned policies instead of generating additional exploratory sam-ples for their learning. Another scheme we employ to improvelearning efficiency is to generate random walks that visit each skillwith approximately equal probability. Some connections betweenthe learned skills are temporarily neglected to achieve this condi-tion. Our experiments show that both control graphs can be learnedwithin one night in the fashion described above.We further include a few rising skills in the control graphs thatwill be automatically executed when the character falls. These ris-ing skills have only one-way connections with the graph and welearn them in a separate procedure. We create learning cycles bypushing the character on the trunk in suitable directions and theninvoke a ragdoll controller that tracks the starting pose of the targetrising skill. When the difference between the simulated pose andthis starting pose is small enough, the rising skill is activated andthe character gets up and once again transitions to the beginning ofthe learning cycle. We currently use a simple fall detection algo-rithm that monitors the magnitude of the action vector as computedby the feedback policies. Once this exceeds a fixed threshold, weactivate the ragdoll control followed by an appropriate rising skill.We show several applications of the prototype control graphs inthe supplemental video. The learned skills in the graph are quiterobust: a random walk on the graph can always succeed when noperturbations are applied. With the help of a simple greedy planner,we can easily achieve interactive navigation in the scene. The char-acters can also robustly perform the desired motions in the presenceof moderate external perturbations such as pushes on the trunk andball-impacts as shown in Figure 9. The character will fall if it isdisturbed too much, which automatically activates the rising con-trollers that return the character to performing desired motions des-ignated by the high-level planner.Figure 11 demonstrates two simulated characters, steered by ahigh-level planner, to always try to run into each other. They repeatthe overall behaviors of colliding, falling, and getting up. The com-plex contacts and interactions between the characters would be toodifficult to synthesize via kinematic approaches, while our frame-work can easily generate these motions in real-time thanks to thephysics-based nature of the simulations and the robustness of thecontrol graphs. In the video, we further show another example in-volving four characters, all simultaneously simulated in real-time,that perform the same overall interaction behaviors.7. DISCUSSIONWe have introduced a general framework that learns and organizesphysics-based motion skills from example motion capture clips.Key to achieving the results is the use of control graphs composedof control fragments, the use of random walks on the control graphsUBC Computer Science, Technical Report • 11Fig. 11: Two simulated characters try to run into each other. Both of them are controlled by the same control graph.for learning, and the use of guided policy search for developing lin-ear feedback policies for the control fragments. To the best of ourknowledge, this is the first time that a very diverse range of mo-tions, including locomotion, highly dynamic kicks and gymnastics,standing, and rising motions, can be synthesized in real time for 3Dphysics-based characters and controlled and integrated together in auniform framework. We believe the proposed method can be easilyapplied to a variety of other skilled motions. This offers a potentialsolution to the difficulties in developing general control methodsthat still prevent the more widespread adoption of physics-basedmethods for character animation.A primary finding of our work is that sequences of linear feed-back policies based on a fixed set of state features and action fea-tures, as implemented by the control fragments, do well at con-trolling many skills, including not only the basic locomotion, butalso rising skills and complex highly-agile movements. It is furtherinteresting to note that these linear policies can be learned froma suitable data stream using standard linear regression methods.Two components of the success are: (a) the ability to generate high-quality open-loop motion reconstructions using SAMCON; and (b)the use of guided learning which effectively selects samples in thevicinity of the states and actions produced by the policy and there-fore encourages convergence between the developed offline solu-tions and the learned linear policies.In practice, every run of the learning pipeline on a given skillusually results in different policies, which suggests that the feed-back policies may have a low-rank structure that admits multi-ple solutions. This has been proven to be true for basic locomo-tion [Ding et al. 2015; Liu et al. 2012]. Learning phase-specificreduced-order policies for complex skills is an interesting topic forfuture work.Our current state features and action features were selected withskills in mind such as locomotion, kicks, and dancing, these all be-ing skills where the character’s legs are extensively used for bal-ance. However, these features proved to be suitable for a widerrange of skills, including those where the arms play an importantrole, e.g., cartwheels and rising up motions. For motions that aredominated by the control applied to the arms, such as a hand-standor a hand-stand walk, we expect that some new state features andaction features may need to be introduced.A sequence of control fragments, or a controller, implicitly de-fines an time-indexed piece-wise linear feedback policy, with an ap-proximate 0.1s time interval, as inherited from the original SAM-CON algorithm [Liu et al. 2010]. The feedback is therefore de-pendent on both the current simulation state and the time indexin the reference motions. This scheme mitigates many difficultiesin learning the feedback policies, but makes the learned policiesless flexible. As future work, we wish to develop state-based feed-back policies from executions of the time-indexed policies by lever-aging more complex policy representations, such as neural net-works [Levine and Koltun 2013; Mordatch and Todorov 2014; Tanet al. 2014].We wish to develop and integrate parameterized versions of themotions and their feedback controllers. Parameterization is also an-other form of generalization, and an appropriate learning processcan likely be bootstrapped from the initial unparameterized mo-tions. Currently, our ability to steer the character is developed inan ad hoc fashion. Parameterization with continuous optimization[Yin et al. 2008; Liu et al. 2012] and interpolations between con-trollers [da Silva et al. 2009; Muico et al. 2011] may help enrichthe variance of learned skills. It may also be possible to integratethe use of abstract models, such as the inverted pendulum model[Coros et al. 2010] or feature-based control [Mordatch et al. 2010]in support of generalization with respect to larger perturbations ormotion parameterization.The efficiency of our current learning pipeline could potentiallybe improved in several respects as many sample simulations are dis-carded without being fully exploited. For example, guided SAM-CON discards all the simulation tuples except those belong to thebest path, and even the saved simulation tuples are discarded af-ter their use in the linear regression for the current iteration ofthe guided learning. These samples could likely be further utilizedto reduce the necessary duration of random walk and to enhancethe robustness of the learned policies as they offer extra informa-tion about the policy space. Reusing these samples with importanceweights [Hachiya et al. 2009] offers one possible path forward todeveloping a more efficient learning process.After the initialization procedures, the current framework islargely automated, with uniform parameter settings being used todevelop most of the motions. However, manually designing thereference motion graph is still necessary at the beginning of thepipeline. Developing good open-loop control clips for difficultskills or from poor-quality reference motions remains the part ofthe learning pipeline that still requires some manual intervention.For future work, we would like to create a fully automated pipeline.ACKNOWLEDGMENTSThis project is partially supported by NSERC Discovery GrantsProgram RGPIN-2015-04843 and Singapore Ministry of EducationAcademic Research Fund, Tier 2 (MOE2011-T2-2-152).REFERENCESAL BORNO, M., DE LASA, M., AND HERTZMANN, A. 2013. Tra-jectory optimization for full-body movements with complex contacts.TVCG 19, 8, 1405–1414.AL BORNO, M., FIUME, E., HERTZMANN, A., AND DE LASA, M. 2014.Feedback control for rotational movements in feature space. ComputerGraphics Forum 33, 2.COROS, S., BEAUDOIN, P., AND VAN DE PANNE, M. 2009. Robusttask-based control policies for physics-based characters. ACM Trans.Graph. 28, 5 (Dec.), 170:1–170:9.COROS, S., BEAUDOIN, P., AND VAN DE PANNE, M. 2010. Generalizedbiped walking control. ACM Trans. Graph. 29, 4 (July), 130:1–130:9.12 • L. Liu et al.DA SILVA, M., ABE, Y., AND POPOVIC´, J. 2008. Simulation of humanmotion data using short-horizon model-predictive control. In ComputerGraphics Forum. Vol. 27. Wiley Online Library, 371–380.DA SILVA, M., DURAND, F., AND POPOVIC´, J. 2009. Linear bellmancombination for control of character animation. ACM Trans. Graph. 28, 3(July), 82:1–82:10.DE LASA, M., MORDATCH, I., AND HERTZMANN, A. 2010. Feature-based locomotion controllers. ACM Trans. Graph. 29, 4 (July), 131:1–131:10.DING, K., LIU, L., VAN DE PANNE, M., AND YIN, K. 2015. Learningreduced-order feedback policies for motion skills. In Proceedings of the14th ACM SIGGRAPH / Eurographics Symposium on Computer Anima-tion. SCA ’15. ACM, New York, NY, USA, 83–92.DOUCET, A. AND JOHANSEN, A. M. 2011. A tutorial on particle filteringand smoothing: Fifteen years later. In Handbook of Nonlinear Filtering.Oxford, UK: Oxford University Press.FALOUTSOS, P., VAN DE PANNE, M., AND TERZOPOULOS, D. 2001.Composable controllers for physics-based character animation. In Pro-ceedings of SIGGRAPH 2001. 251–260.GEIJTENBEEK, T. AND PRONOST, N. 2012. Interactive character anima-tion using simulated physics: A state-of-the-art review. In ComputerGraphics Forum. Vol. 31. Wiley Online Library, 2492–2515.GEIJTENBEEK, T., VAN DE PANNE, M., AND VAN DER STAPPEN, A. F.2013. Flexible muscle-based locomotion for bipedal creatures. ACMTransactions on Graphics (TOG) 32, 6, 206.HA, S., YE, Y., AND LIU, C. K. 2012. Falling and landing motion controlfor character animation. ACM Trans. Graph. 31, 6 (Nov.), 155:1–155:9.HACHIYA, H., PETERS, J., AND SUGIYAMA, M. 2009. Efficient samplereuse in EM-based policy search. In Machine Learning and KnowledgeDiscovery in Databases. Lecture Notes in Computer Science, vol. 5781.Springer Berlin Heidelberg, 469–484.HODGINS, J. K., WOOTEN, W. L., BROGAN, D. C., AND O’BRIEN, J. F.1995. Animating human athletics. In Proceedings of SIGGRAPH. ACM,New York, NY, USA, 71–78.KOVAR, L., GLEICHER, M., AND PIGHIN, F. 2002. Motion graphs. InSIGGRAPH ’02: Proceedings of the 29th annual conference on Computergraphics and interactive techniques. ACM, New York, NY, USA, 473–482.KWON, T. AND HODGINS, J. 2010. Control systems for human runningusing an inverted pendulum model and a reference motion capture se-quence. In SCA. Eurographics Association, Aire-la-Ville, Switzerland,Switzerland, 129–138.LEE, J. AND LEE, K. H. 2006. Precomputing avatar behavior from humanmotion data. Graphical Models 68, 2, 158–174.LEE, Y., KIM, S., AND LEE, J. 2010. Data-driven biped control. ACMTrans. Graph. 29, 4 (July), 129:1–129:8.LEE, Y., WAMPLER, K., BERNSTEIN, G., POPOVIC´, J., AND POPOVIC´,Z. 2010. Motion fields for interactive character locomotion. ACM Trans.Graph. 29, 6 (Dec.), 138:1–138:8.LEVINE, S. AND KOLTUN, V. 2013. Guided policy search. In ICML ’13:Proceedings of the 30th International Conference on Machine Learning.LEVINE, S. AND KOLTUN, V. 2014. Learning complex neural networkpolicies with trajectory optimization. In ICML ’14: Proceedings of the31st International Conference on Machine Learning.LIU, L., YIN, K., AND GUO, B. 2015. Improving Sampling-based MotionControl. Computer Graphics Forum 34, 2.LIU, L., YIN, K., VAN DE PANNE, M., AND GUO, B. 2012. Terrain runner:control, parameterization, composition, and planning for highly dynamicmotions. ACM Trans. Graph. 31, 6, Article 154.LIU, L., YIN, K., VAN DE PANNE, M., SHAO, T., AND XU, W. 2010.Sampling-based contact-rich motion control. ACM Trans. Graph. 29, 4,Article 128.LIU, L., YIN, K., WANG, B., AND GUO, B. 2013. Simulation and controlof skeleton-driven soft body characters. ACM Trans. Graph. 32, 6, Article215.MACCHIETTO, A., ZORDAN, V., AND SHELTON, C. R. 2009. Momentumcontrol for balance. ACM Trans. Graph. 28, 3.MORDATCH, I., DE LASA, M., AND HERTZMANN, A. 2010. Robustphysics-based locomotion using low-dimensional planning. ACM Trans.Graph. 29, 4 (July), 71:1–71:8.MORDATCH, I. AND TODOROV, E. 2014. Combining the benefits offunction approximation and trajectory optimization. In Proceedings ofRobotics: Science and Systems. Berkeley, USA.MORDATCH, I., TODOROV, E., AND POPOVIC´, Z. 2012. Discovery ofcomplex behaviors through contact-invariant optimization. ACM Trans.Graph. 31, 4 (July), 43:1–43:8.MUICO, U., LEE, Y., POPOVIC´, J., AND POPOVIC´, Z. 2009. Contact-aware nonlinear control of dynamic characters. ACM Trans. Graph. 28, 3.MUICO, U., POPOVIC´, J., AND POPOVIC´, Z. 2011. Composite control ofphysically simulated characters. ACM Trans. Graph. 30, 3 (May), 16:1–16:11.PENG, X. B., BERSETH, G., AND VAN DE PANNE, M. 2015. Dynamicterrain traversal skills using reinforcement learning. ACM Transactionson Graphics (to appear).PETERS, J. AND SCHAAL, S. 2007. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the24th International Conference on Machine Learning. ICML ’07. ACM,New York, NY, USA, 745–750.PETERS, J. AND SCHAAL, S. 2008. Reinforcement learning of motor skillswith policy gradients. NEURAL NETWORKS 21, 4 (MAY), 682–697.POPOVIC´, Z. AND WITKIN, A. 1999. Physically based motion transforma-tion. In Proceedings of the 26th annual conference on Computer graphicsand interactive techniques. ACM Press/Addison-Wesley Publishing Co.,11–20.RAIBERT, M. H. AND HODGINS, J. K. 1991. Animation of dynamiclegged locomotion. In ACM SIGGRAPH Computer Graphics. Vol. 25.ACM, 349–358.ROSS, S., GORDON, G., AND BAGNELL, J. A. D. 2011. A reduction ofimitation learning and structured prediction to no-regret online learning.In Proceedings of the 14th International Conference on Artifical Intelli-gence and Statistics (AISTATS).SOK, K. W., KIM, M., AND LEE, J. 2007. Simulating biped behaviorsfrom human motion data. ACM Trans. Graph. 26, 3, Article 107.SULEJMANPASˇIC´, A. AND POPOVIC´, J. 2005. Adaptation of performedballistic motion. ACM Transactions on Graphics (TOG) 24, 1, 165–179.TAN, J., GU, Y., LIU, C. K., AND TURK, G. 2014. Learning bicycle stunts.ACM Trans. Graph. 33, 4 (July), 50:1–50:12.TAN, J., LIU, C. K., AND TURK, G. 2011. Stable proportional-derivativecontrollers. IEEE Comput. Graph. Appl. 31, 4, 34–44.TASSA, Y., EREZ, T., AND TODOROV, E. 2012. Synthesis and stabilizationof complex behaviors through online trajectory optimization. In Intelli-gent Robots and Systems (IROS), 2012 IEEE/RSJ International Confer-ence on. IEEE, 4906–4913.TREUILLE, A., LEE, Y., AND POPOVIC´, Z. 2007. Near-optimal characteranimation with continuous control. ACM Trans. Graph. 26, 3 (July).WAMPLER, K. AND POPOVIC´, Z. 2009. Optimal gait and form for animallocomotion. ACM Trans. Graph. 28, 3, Article 60.WANG, J. M., FLEET, D. J., AND HERTZMANN, A. 2009. Optimizingwalking controllers. ACM Trans. Graph. 28, 5, Article 168.UBC Computer Science, Technical Report • 13WANG, J. M., FLEET, D. J., AND HERTZMANN, A. 2010. Optimizingwalking controllers for uncertain inputs and environments. ACM Trans.Graph. 29, 4 (July), 73:1–73:8.WANG, J. M., HAMNER, S. R., DELP, S. L., AND KOLTUN, V. 2012. Op-timizing locomotion controllers using biologically-based actuators andobjectives. ACM Trans. Graph. 31, 4, 25.YE, Y. AND LIU, C. K. 2010. Optimal feedback control for characteranimation using an abstract model. ACM Trans. Graph. 29, 4 (July),74:1–74:9.YIN, K., COROS, S., BEAUDOIN, P., AND VAN DE PANNE, M. 2008.Continuation methods for adapting simulated skills. ACM Trans.Graph. 27, 3, Article 81.YIN, K., LOKEN, K., AND VAN DE PANNE, M. 2007. SIMBICON: Simplebiped locomotion control. ACM Trans. Graph. 26, 3, Article 105.ZORDAN, V., BROWN, D., MACCHIETTO, A., AND YIN, K. 2014. Controlof rotational dynamics for ground and aerial behavior. Visualization andComputer Graphics, IEEE Transactions on 20, 10 (Oct), 1356–1366.ZORDAN, V. B., MAJKOWSKA, A., CHIU, B., AND FAST, M. 2005. Dy-namic response for motion capture animation. ACM Trans. Graph., 697–701.


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items