UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Developing locomotion skills with deep reinforcement learning Peng, Xue Bin 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2017_september_peng_xue_bin.pdf [ 16.32MB ]
Metadata
JSON: 24-1.0345638.json
JSON-LD: 24-1.0345638-ld.json
RDF/XML (Pretty): 24-1.0345638-rdf.xml
RDF/JSON: 24-1.0345638-rdf.json
Turtle: 24-1.0345638-turtle.txt
N-Triples: 24-1.0345638-rdf-ntriples.txt
Original Record: 24-1.0345638-source.json
Full Text
24-1.0345638-fulltext.txt
Citation
24-1.0345638.ris

Full Text

Developing Locomotion Skills with Deep Reinforcement LearningbyXue Bin PengB.Sc., The University of British Columbia, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Computer Science)The University of British Columbia(Vancouver)April 2017c© Xue Bin Peng, 2017AbstractWhile physics-based models for passive phenomena such as cloth and fluids have been widely adoptedin computer animation, physics-based character simulation remains a challenging problem. One of themajor hurdles for character simulation is that of control, the modeling of a character’s behaviour in re-sponse to its goals and environment. This challenge is further compounded by the high-dimensional andcomplex dynamics that often arise from these systems. A popular approach to mitigating these chal-lenges is to build reduced models that capture important properties for a particular task. These modelsoften leverage significant human insight, and may nonetheless overlook important information. In thisthesis, we explore the application of deep reinforcement learning (DeepRL) to develop control poli-cies that operate directly using high-dimensional low-level representations, thereby reducing the needfor manual feature engineering and enabling characters to perform more challenging tasks in complexenvironments.We start by presenting a DeepRL framework for developing policies that allow character to agilelytraverse across irregular terrain. The policies are represented using a mixture of experts model, whichselects from a small collection of parameterized controllers. Our method is demonstrated on planarcharacters of varying morphologies and different classes of terrain. Through the learning process, thenetworks develop the appropriate strategies for traveling across various irregular environments withoutrequiring extensive feature engineering. Next, we explore the effects of different action parameteriza-tions on the performance of RL policies. We compare policies trained using low-level actions, such astorques, target velocities, target angles, and muscle activations. Performance is evaluated using a motionimitation benchmark. For our particular task, the choice of higher-level actions that incorporate localfeedback, such as target angles, leads to significant improvements in performance and learning speed.Finally, we describe a hierarchical reinforcement learning framework for controlling the motion of asimulated 3D biped. By training each level of the hierarchy to operate at different spatial and temporalscales, the character is able to perform a variety of locomotion tasks that require a balance betweenshort-term and long-term planning. Some of the tasks include soccer dribbling, path following, andnavigation across dynamic obstacles.iiLay AbstractHumans and other animals are able to agilely move through and interact with their environment us-ing a rich repertoire of motor skills. Modeling these skills has been a long standing challenge withfar-reaching implications for fields ranging from computer animation, robotics, and many more. Rein-forcement learning has emerged as a promising paradigm for developing these skills, where an agentlearns through trial-and-error in order to discover the appropriate behaviours for accomplishing a task.The goal of this work is to leverage deep reinforcement learning techniques to develop locomotion skillsthat enable simulated agents to move agilely through their surroundings in a task-driven manner.iiiPrefaceChapter 4 was published as Xue Bin Peng, Glen Berseth, and Michiel van de Panne. Terrain-adaptive lo-comotion skills using deep reinforcement learning. ACM Transactions on Graphics (Proc. SIGGRAPH2016), 35(4), 2016. Chapter 5 is currently under review, and Chapter 6 will be published as Xue BinPeng, Glen Berseth, KangKang Yin, and Michiel van de Panne. DeepLoco: Dynamic Locomotion SkillsUsing Hierarchical Deep Reinforcement Learning. ACM Transactions on Graphics (Proc. SIGGRAPH2017), 36(4), 2017. My supervisor, Michiel van de Panne was instrumental in the paper-writing, andin providing the vision and direction for the projects. My co-authors Glen and KangKang also madesignificant contributions to the paper-writing. This work would not have been possible if not for theirefforts and our countless hours of discussions. I was responsible for designing and implementing thelearning frameworks, tuning the algorithms, performing experiments, and writing the papers. Glen alsocontributed to the coding and collection of results.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Terrain-Adaptive Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Action Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Hierarchical Locomotion Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Physics-based Character Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Reinforcement Learning for Motion Control . . . . . . . . . . . . . . . . . . . . . . . 72.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.1 Direct Policy Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 Deep Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.3 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10v3.2 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 CACLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Terrain-Adaptive Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4 Characters and Terrains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4.1 Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4.2 Terrain classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5 Policy Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5.1 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5.2 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5.3 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5.4 Policy Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.6 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Action Parameterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3 Task Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3.1 Reference Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3.2 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3.3 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3.4 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3.5 Initial State Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.4 Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 Hierarchical Locomotion Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Policy Representation and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5 Low-Level Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.5.1 Reference Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63vi6.5.2 LLC Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.5.3 Bilinear Phase Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.5.4 LLC Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.5.5 LLC Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.5.6 Style Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.6 High-level Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.6.1 HLC Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.6.2 HLC Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.6.3 HLC Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.7.1 LLC Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.7.2 HLC Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.7.3 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85viiList of TablesTable 4.1 Performance of the final policies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Table 4.2 Performance of applying policies to unfamiliar terrains. . . . . . . . . . . . . . . . 33Table 5.1 Actuation models and their respective actuator parameters. . . . . . . . . . . . . . 42Table 5.2 Training hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Table 5.3 The number of state, action, and actuation parameters. . . . . . . . . . . . . . . . . 47Table 5.4 Performance for different characters and actuation models. . . . . . . . . . . . . . 50Table 6.1 LLC robustness tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Table 6.2 HLC performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Table 6.3 Performance of different combinations of LLC’s and HLC’s. . . . . . . . . . . . . 79viiiList of FiguresFigure 4.1 Terrain traversal using a learned actor-critic ensemble. The color-coding of thecenter-of-mass trajectory indicates the choice of actor used for each leap. . . . . . 17Figure 4.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Figure 4.3 Character models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Figure 4.4 Dog controller motion phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 4.5 State features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Figure 4.6 Experience tuples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 4.7 Policy network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Figure 4.8 Comparisons of learning performance. . . . . . . . . . . . . . . . . . . . . . . . 31Figure 4.9 Action space evolution for using MACE(3) with initial actor bias. . . . . . . . . . 32Figure 4.10 Action space evolution for using MACE(3) without initial actor bias. . . . . . . . 32Figure 4.11 Policy generalization to easier and more-difficult terrains. . . . . . . . . . . . . . 33Figure 4.12 Raptor and goat control policies. . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 4.13 Dog control policies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 5.1 Neural network control policies trained for various simulated planar characters. . . 38Figure 5.2 Initial state distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Figure 5.3 Character models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Figure 5.4 Policy network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Figure 5.5 Learning curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 5.6 Simulated motions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Figure 5.7 Performance during actuator optimization. . . . . . . . . . . . . . . . . . . . . . . 49Figure 5.8 Optimized MTU performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Figure 5.9 Action and torque profiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Figure 5.10 Performance from different network initializations. . . . . . . . . . . . . . . . . . 52Figure 5.11 Performance using different amounts of exploration noise. . . . . . . . . . . . . . 52Figure 5.12 Performance from different network architectures. . . . . . . . . . . . . . . . . . 53Figure 5.13 Robustness tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Figure 5.14 Robustness to terrain variation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Figure 5.15 Performance using different query rates. . . . . . . . . . . . . . . . . . . . . . . . 54ixFigure 6.1 Locomotion skills learned using hierarchical reinforcement learning. (a) Followinga varying-width winding path. (b) Dribbling a soccer ball. (c) Navigating throughobstacles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Figure 6.2 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Figure 6.3 State features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62Figure 6.4 Footstep plan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62Figure 6.5 LLC network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Figure 6.6 Height maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Figure 6.7 HLC network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Figure 6.8 Learning curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Figure 6.9 Learning curves for each stylized LLC. . . . . . . . . . . . . . . . . . . . . . . . 73Figure 6.10 Action feedback curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Figure 6.11 HLC learning curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Figure 6.12 Learning curves with and without control hierarchy. . . . . . . . . . . . . . . . . . 76Figure 6.13 HLC tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Figure 6.14 Transfer performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Figure 6.15 Transfer learning curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Figure 6.16 LLC walk cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80xGlossaryS state spaces states′ next stateA action spacea actionr rewardγ discount factorpi policyV state-value functionQ action-value functionA advantage functionxiAcknowledgmentsI would like to thank my supervisor, Dr. Michiel van de Panne, for his mentoring and support throughoutmy time at UBC. It has been a pleasure working with you, and I cannot thank you enough for all thatyou have taught me over these past years. I will always recall fondly of our many conversations and theatmosphere that you have fostered in the group. Your vision and poise have been, and will continue tobe, a great source of inspiration for my aspirations both in work and life.I would like to extend my thanks to Dr. Dinesh Pai for being part of my examining committee andfor his feedback on this thesis. Also, a great debt of gratitude to my co-authors Glen Berseth and Dr.KangKang Yin. All of this work would not be what it is had it not been for your efforts. Furthermore, Iwant to thank my friends and colleagues, it has been pleasure to meet and learn alongside all of you.Finally, and most importantly, I want to thank my family for their support throughout my life, andfor the opportunities that they have provided me, at times at the cost of their own. Though it can oftenbe taken for granted and left overlooked, your perseverance and tireless efforts have been an enduringsource of inspiration. At the risk of being banal, none of this would have been possible without you.xiiChapter 1IntroductionBe it the leisurely stride of a walk or the acrobatic stunts of parkour, humans and other animals areable to agilely move through and interact with their environments using a rich repertoire of motor skills.Modeling these skills have been a long standing challenge with far-reaching applications for fields rang-ing from biomechanics, robotics, computer animation, and many more. While significant efforts havebeen devoted to building models of motions, such as walking and running, the generalization of suchmodels to a more diverse array of skills can be limited. The manual engineering of models is fur-ther complicated as the domain shifts towards more dynamic motions where human insight becomesincreasingly scarce.In this work, we adopt a learning-based approach that aims to develop skillful agents while reducingthe dependency on human domain knowledge. At the core of our work is the reinforcement learning(RL) paradigm, whereby an agent develops skillful behaviours through trial-and-error. While our workis presented in the context of physics-based character simulation for computer animation, we believe thatthe underlying insights gained may be of interest to other domains where motion control is of prominentinterest.1.1 MotivationIn computer animation, developing physics-based models of character motion offers a wealth of po-tential applications. The most immediate may be in alleviating the tedious manual efforts currentlyrequired by artists to author motions for characters. By simulating the physics and low-level details ofa motion, artistic effort can be better directed towards the high-level and narrative-relevant content ofa production. In the space of interactive media, such as games, character simulation offers the possi-bility of behaviour synthesis, in which appropriate reactions are generated for characters in response todifferent scenarios, where exhaustive manual scripting becomes infeasible. With the advent of virtualreality, the interactions a user can have with virtual agents are richer than would be possible through1more tradition interfaces, such as keyboards and game controllers. As the possible forms of interactionsgrow, the capacity of developers to generate or collect natural responsive behaviours for the variousforms of interaction becomes increasingly strained. Therefore the role of simulation will likely becomemore prominent as new technologies emerge that enable ever more immersive virtual experiences. Re-inforcement learning may also offer the potential for adaptive agents that can develop cooperative oradversarial strategies in accordance to a user’s behaviour.For robotics, learning models for legged-locomotion skills will prove valuable as machines becomemore prevalent in urban environments, commonly designed for human traversal. As agent are presentedwith unstructured environments that can be difficult to parameterize a priori, building systems that canprocess rich high-dimensional sensory information, while also adapting its behaviours to unfamiliar sit-uations, will be vital. Better models of human and animal motions are also of interest for biomechanicsand physiotherapy, with applications such as injury prevention and rehabilitation. Such model can alsoassist in the design and personalization of prosthetics that can help patients better recover their naturalrange of motion. Models of human motion will also benefit the development of exoskeletons, wheresystems need to anticipate and assist users in performing their intended tasks.1.2 Thesis OverviewThe work presented in this thesis explores the use of reinforcement learning (RL) to develop controlpolicies for motion control tasks, with an emphasis on locomotion. We begin with an overview ofrelated work in character animation and reinforcement learning (Chapter 2). Discussions of relevantwork are also available within each chapter. Chapter 3 provides a review of fundamental conceptsand algorithms in reinforcement learning, which form the foundations for the methods explored in thefollowing chapters. The framework presented for Terrain-Adaptive Locomotion (Chapter 4) developscontrol policies that enable simulated 2D character to traverse across irregular obstacle-filled terrains.The use of a deep convolutional neural network to represent the policy provides a means of processinghigh-dimensional low-level descriptions of the environment, giving rise to policies that can adapt to avariety of obstacles without requiring additional manual feature engineering. Next, the work on ActionParameterizations (Chapter 5) explores the effects of different choices of action parameterizations onthe performance of RL policies. We show that simple low-level action abstractions can offer significantimprovements to performance and learning speed. Finally, in Hierarchical Locomotion Skills (Chapter6), we present a hierarchical RL framework that enables a 3D simulated biped to perform a varietyof locomotion tasks, such as soccer dribbling, path following, and obstacle avoidance. The hierarchicalcontrol policies eliminate much of the hand-engineered control structures (e.g. finite-state machines andfeedback strategies) often leveraged by previous systems, and replaces them with learned neural networkcontrollers that operate at different spatial and temporal abstractions. This hierarchical decompositionsallows the policies to fulfill high-level task objectives while also satisfying low-level goals. We concludewith a discussion on the limitations of our methods and suggest possible directions for future work.21.3 Terrain-Adaptive LocomotionIn this work, we consider the task of traversing across irregular terrain with randomly generated se-quences of obstacles (Chapter 4). The goal of a policy is to select the appropriate actions for a charactersuch that it is able to continue moving through the environment while maintaining a desired forwardvelocity. The environment consists of randomly varying irregular slopes interrupted by step, gap, andwall obstacles. One of the challenges of this task lies in representing the shape of the terrain. Due tothe irregularity of the environment, it can be challenging to manually design a compact set of featuresthat can sufficiently parameterize the terrain variations encountered by the characters. With deep re-inforcement learning, we are able to directly provide a neural network policy with high-dimensionalheight-field representations of the terrain, and the network in turn learns to extract relevant features fortraversing across the obstacles.Our method is characterized by high-dimensional low-level state representations, parameterizedfinite-state machines (FSM) that provide the policies with high-level action abstractions, and a mixtureof actor-critic experts (MACE) policy model. The MACE model is composed of a collection of subpoli-cies, referred to as actors, and their corresponding set of value functions, referred to as critics. At thestart of a cycle of the FSM, each actor proposes a different action in response to the state of the characterand the up-comping terrain. The critics then predict the expected performance of each actor, and theaction from the actor with the highest predicted performance is selected as the action for the cycle. Incomparison to the more common choice of a unimodal Gaussian distribution, the mixture model allowsfor richer multi-modal action distributions.Our framework is applied to train policies for planar dog, raptor, and goat characters. The policiesenable the characters to agilely traverse across different classes of terrain without requiring terrain-specific or character-specific feature engineering. We evaluate the impact of different design decisionsand show that the mixture model leads to significant improvements in performance compared to moreconventional RL models.1.4 Action ParameterizationWhile the previous work leveraged hand-crafted FSMs to provide high-level action abstractions, in thiswork we look to further reduce the domain knowledge involved in crafting such action parameterizations(Chapter 5). Instead of high-level action abstractions, we explore a number of low-level action represen-tations, such as torques and target angles, and evaluate the impact of different choices of action spaceson policy performance. In alignment with recent trends in machine learning of moving towards high-dimensional low-level feature representations, DeepRL for motion control have often adopted torquesas the action of choice. These policies are trained to directly map state observations to torques for eachof the character’s joints. However, the motion quality from these systems often fall short of what hasbeen previously achieved with hand-crafted action abstractions, such as FSMs. Directly using torques3as the action parameterization also overlooks the synergy between control and biomechanics exhibitedin nature. Passive-dynamics, from muscles and tendons, play an important role in providing low-levelfeedback that shapes the motions produced by humans and other animals.In this work, we compare the use of torques, target velocities, target angles, and muscle activationsas action parameterizations for a policy. Our benchmark consists of a motion-imitation task for varioussimulated 2D characters. Policies are trained to control each character to imitate a given reference mo-tion, specified by a sequence of kinematic keyframes. Policies using different action parameterizationsare then evaluated according to a number of metrics, such as final performance and learning speed.Our results suggest that the choice of action parameterization can have significant impact on perfor-mance. For our task, high-level action abstractions that incorporate low-level feedback, such as targetangles, can lead to faster learning and better overall performance. The differences between the vari-ous actions grow as the morphologies of the character becomes more complex. With the performanceof low-level actions, such as torques, deteriorating noticeably as character complexity increases. Ad-ditionally, we propose an actuator optimization method that interleaves policy learning and actuatoroptimization. This method enables neural networks to directly control complex muscle models throughmuscle activations, without requiring additional control abstractions such as Jacobian transpose controlor other inverse-dynamics methods.1.5 Hierarchical Locomotion SkillsFinally, we present a use of hierarchical deep reinforcement learning to develop locomotion skills fora 3D biped (Chapter 6). The method developed for Terrain-Adaptive Locomotion (Chapter 4) trainspolicies that leverage hand-crafted FSMs to provide high-level action abstractions, enabling the policiesto operate at the timescale of running steps (e.g. 2 Hz). The work in Action Parameterization (Chapter5) developed control policies that utilize low-level action parameterizations, such as torques and targetangles specified for every joint, resulting in policies that operate over finer timescales (e.g. 60 Hz).In this chapter, we look to develop hierarchical policies, where each level of the hierarchy operates atdifferent spatial and temporal scales, thereby allowing the policies to simultaneously address high-leveltask objectives while also satisfying low-level goals.Our control hierarchy consists of a high-level controller and a low-level controller, operating at 2 Hzand 30 Hz respectively. At the start of each walking step, the high-level controller specifies footstepgoals for the low-level controller for the upcoming step. The low-level controller then specifies targetangles for PD-controllers positioned at each of the character’s joints in order to realize the footstepgoals. We demonstrate the capabilities of this hierarchical control structure on different locomotiontasks such as soccer dribbling, path following, and navigation through an environment consisting ofstatic or dynamic obstacles.We show that the hierarchical decomposition allows the character to learn tasks that would otherwise4be challenging for a single-timescale policy. The resulting policies are also robust to significant exter-nal perturbations, rivaling controllers build using hand-crafted feedback strategies. Furthermore, wedemonstrate transfer learning between different combinations of high-level and low-level controllers forthe various tasks, yielding significant reductions in training time compared to retraining from randominitialization.5Chapter 2Related WorkModeling movement skills, locomotion in particular, has a long history in computer animation, robotics,and biomechanics. It has also recently seen significant interest from the machine learning community asan interesting and challenging domain for reinforcement learning. In this chapter we focus on the mostclosely related work in computer animation and reinforcement learning.2.1 Physics-based Character AnimationSignificant progress has been made in recent years in developing methods to create motion from firstprinciples, i.e., control and physics, as a means of character animation. A recent survey on physics-basedcharacter animation and control techniques [Geijtenbeek and Pronost, 2012] provides a comprehensiveoverview of work in this area. These methods can be coarsely categorized as model-free and model-based. Model-free methods assume no access to the equations of motion and rely on domain knowledgeto develop simplified models that can be used in the design of controllers. An early and enduringapproach to controller design has been to structure control policies around finite state machines (FSMs)and feedback rules that use a simplified abstract model or feedback law. These general ideas havebeen applied to human athletics, running, and a rich variety of walking styles [Hodgins et al., 1995,Laszlo et al., 1996, Yin et al., 2007, Sok et al., 2007, Coros et al., 2010, Lee et al., 2010a]. Model-based methods that assumes access to the equations of motion also provide highly effective solutions. Adynamics model is often utilized by inverse-dynamics based methods, with the short and long-term goalsbeing encoded into the objectives of a quadratic program that is solved at every time-step, e.g., [da Silvaet al., 2008, Muico et al., 2009, Mordatch et al., 2010, Ye and Liu, 2010]. Many model-free and model-based methods use some form of model-free policy search, wherein a controller is first designed andthen has a number of its free parameters optimized using episodic evaluations. Policy search methods,e.g., stochastic local search or CMA [Hansen, 2006], can be used to optimize the parameters of thegiven control structures to achieve a richer variety of motions e.g., [Yin et al., 2008, Wang et al., 2009,Coros et al., 2011b, Liu et al., 2012, Tan et al., 2014b], and efficient muscle-driven locomotion [Wang6et al., 2009, Lee et al., 2014]. Policy search has also been successfully applied directly to time-indexedsplines and neural networks in order to learn a variety of bicycle stunts [Tan et al., 2014b].An alternative class of approach is given by trajectory optimization methods, which can computesolutions offline, e.g., [Al Borno et al., 2013], and then later adapted for online model-predictive control[Tassa et al., 2012, Ha¨ma¨la¨inen et al., 2015]. Alternative, optimized actions can be computed for thecurrent time-step using quadratic programming, e.g., [Macchietto et al., 2009, de Lasa et al., 2010]. Tofurther improve motion quality and enrich the motion repertoire, data-driven models incorporate mo-tion capture examples in constructing controllers, most often using a learned or model-based trajectorytracking method [Sok et al., 2007, da Silva et al., 2008, Muico et al., 2009, Liu et al., 2012, 2016b].2.2 Reinforcement Learning for Motion ControlReinforcement learning as guided by state-value or action-value functions have been used for synthesisof kinematic motions. In particular, this type of RL has been highly successful for making decisionsabout which motion clip to play next in order to achieve a given objective [Lee and Lee, 2006, Treuilleet al., 2007, Lee et al., 2009, 2010b, Levine et al., 2012]. Work that applies RL with nonparametricfunction approximators to the difficult problem of controlling the movement of physics-based charac-ters has been more limited [Coros et al., 2009, Peng et al., 2015]. Due to the curse of dimensionality,nonparametric methods have often relied on the manual selection of a compact set of important featuresthat succintly describes the state of the character and its environment. The use of parametric functionapproximators, most notably neural networks, have demonstrated impressive capabilities for processinghigh-dimensional low-level input features. This advantage has spurred significant efforts towards tack-ling reinforcement learning problems for physics-based characters with high-dimensional continuousstate and action spaces. In the following section, we review some of the approaches used to developdeep neural network control policies.2.3 Deep Reinforcement LearningThe recent successes of deep learning have also seen a resurgence of its use for learning control policies.The use of deep neural networks as function approximators for RL is generally referred to as deepreinforcement learning (DeepRL). These methods can be broadly characterized into three categories,direct policy approximation, deep Q-learning, and policy gradient methods, which we now elaborateon.2.3.1 Direct Policy ApproximationDeep neural networks (DNNs) can be used to directly approximate a control policy from example datapoints generated by an oracle, represented by some other control process. For example, trajectory op-7timization can be used to compute families of optimal control solutions from various initial states, andeach trajectory then yields a large number of data points suitable for supervised learning. A naive ap-plication of these ideas will often fail because when the approximated policy is deployed, the systemstate will nevertheless easily drift into regions of state space for which no data has been collected. Thisproblem is rooted in the fact that a control policy and the state-distribution that it encounters are tightlycoupled. Recent methods have made progress on this issue by proposing iterative techniques. Optimaltrajectories are generated in close proximity to the trajectories resulting from the current policy; thecurrent policy is then updated with the new data, and the iteration repeats. These guided policy searchmethods have been applied to produce motions for robust bipedal walking and swimming gaits for pla-nar models [Levine and Koltun, 2014, Levine and Abbeel, 2014] and a growing number of challengingrobotics applications. Relatedly, methods that leverage contact-invariant trajectory optimization havealso demonstrated many capabilities, including planar swimmers, planar walkers, and 3D arm reach-ing [Mordatch and Todorov, 2014], and, more recently, simulated 3D swimming and flying, as well as3D bipeds and quadrupeds capable of skilled interactive stepping behaviors [Mordatch et al., 2015b].In this most recent work, the simulation is carried out as a state optimization, which allows for largetimesteps, albeit with the minor caveat that the final motion comes from an optimization step rather thandirectly from a forward dynamics simulation2.3.2 Deep Q-LearningFor decision problems with discrete action spaces, a DNN can be used to approximate a set of valuefunctions, one for each action, thereby learning a complete state-action value (Q) function. One ofthe distinguishing characteristics of Q-learnings is that a policy can be implicitly represented by theQ-function. When a policy is queried for a particular state, an action can be chosen by first enumeratingover all possible actions using the Q-function, and then selecting the action with the maximum pre-dicted value. A notable achievement of this approach has been the ability to learn to play a large suiteof Atari games at a human-level of skill, using only raw screen images and the score as input [Mnihet al., 2015]. Many additional improvements have since been proposed, including prioritized experiencereplay [Schaul et al., 2015], double Q-learning [Van Hasselt et al., 2015], better exploration strate-gies [Stadie et al., 2015], and accelerated learning using distributed computation [Nair et al., 2015].However, it is not obvious how to directly extend these methods to control problems with continuousaction spaces. Since action selection involves optimizing the Q-function over all possible actions, thisprocess quickly becomes intractable for continuous action spaces. Methods have been proposed thatimpose particular structures onto the Q-function, which allows the optimization over continuous actionspaces to be solved efficiently [Gu et al., 2016b].82.3.3 Policy Gradient MethodsIn the case of continuous actions learned in the absence of an oracle, DNNs can be used to explicitlymodel a policy pi(s). Policy gradient methods is a popular class of techniques for training a parameter-ized policy [Sutton et al., 2001]. These methods perform gradient ascent on the overall task objectiveusing Monte-Carlo estimates of the objective gradient with respect to the policy parameters. The RE-INFORCE algorithm is a classic example of a policy gradient method for stochastic policies [Williams,1992]. More recently, deterministic policy gradient methods have been proposed where a Q-function,Q(s,a), is learned alongside the policy, pi(s). This leads to a single composite network, Q(s,pi(s)),that allows for the back-propagation of value-gradients back through to the control policy, and there-fore provides a mechanism for policy improvement. Since the original method for using deterministicpolicy gradients [Silver et al., 2014b], several variations have been proposed with a growing portfolioof demonstrated capabilities. This includes a method for stochastic policies that can span a range ofmodel-free and model-based methods [Heess et al., 2015], as demonstrated on examples that includea monoped, a planar biped walker, and an 8-link planar cheetah model. Recently, further improve-ments have been proposed to allow end-to-end learning of image-to-control-torque policies for a wideselection of physical systems [Lillicrap et al., 2015b]. Another recent work proposes the use of policy-gradient DNN-RL with parameterized controllers for simulated robot soccer [Hausknecht and Stone,2015a]. Recent work using generalized advantage estimation together with TRPO [Schulman et al.,2016] demonstrates robust 3D locomotion for a biped with ball feet, although with (subjectively speak-ing) awkward movements. While the aforementioned methods are promising, the resulting capabilitiesand motion quality as applied to locomoting articulated figures still fall well short of what is needed foranimation applications.9Chapter 3BackgroundOur tasks will be structured as standard reinforcement problems where an agent interacts with its envi-ronment according to a policy in order to maximize a reward signal. Let pi(s) : S→ A represent a deter-ministic policy, which maps a state s∈ S to an action a∈ A, while a stochastic policy pi(s,a) : S×A→Rrepresents the conditional probability distribution of a given s, pi(s,a) = p(a|s). At each control step t,the agent observes a state st and samples an action at from pi . The environment in turn responds witha scalar reward rt , and a new state s′t = st+1 sampled from its dynamics p(s′|s,a). The reward functionrt = r(st ,at) provides the agent with a feedback signal on the desirability of performing an action at agiven state. In reinforcement learning, the objective is often to learn an optimal policy pi∗, which max-imizes the expected long term cumulative reward J(pi), expressed as the discounted sum of immediaterewards rt ,J(pi) = Er0,r1,...rT[r0+ γr1+ ...+ γT rT∣∣pi]= E{rt}[T∑t=0γ trt∣∣∣∣∣pi]with γ ∈ [0,1] as the discount factor, and T as the horizon, which may be infinite. The discount factorensures that the cumulative reward is finite, and captures the intuition that events occurring in the distantfuture are likely to be of less consequence than those occurring in the more immediate future. Theoptimal policy is then defined aspi∗ = arg maxpiJ(pi)3.1 Value FunctionsValue functions provide estimates of a policy’s expected performance as measured by the cumulativereward. This information is leveraged by value function based methods to improve a policy’s perfor-mance [Sutton and Barto, 1998]. Given a policy, two common classes of value functions are state-value10functions V (s) and action-value functions Q(s,a). The state-value function V (s) can be interpreted asthe desirability of the agent being in a given state. This is formulate as the expected cumulative rewardof following a policy starting at a given state,V (s) = E{rt}[T∑t=0γ trt∣∣∣∣∣s0 = s,pi]Similarly, the action-value function Q(s,a) can be interpreted as the desirability of performing a partic-ular action at a given state. This is represented as the expected cumulative reward of performing actiona at state s and following the policy starting at the next state s′,Q(s,a) = E{rt}[T∑t=0γ trt∣∣∣∣∣s0 = s,a0 = a,pi]Whenever convenient and without ambiguity, the dependency on pi will be implicitly assumed anddropped from the notation, V (s) will be referred simply as the value function, and Q(s,a) as the Q-function. The value functions can be defined recursively according toV (s) = Er,s′[r+ γV (s′)∣∣s0 = s]Q(s,a) = Er,s′[r+ γ∫a′pi(s′,a′)Q(s′,a′)da′∣∣∣∣s0 = s,a0 = a]The relationship between the two value functions can be seen as followsV (s) = Ea [Q(s,a)]Q(s,a) = Er,s′[r+ γV (s′)∣∣s0 = s,a0 = a]These recursive definitions give rise to the value iteration algorithm for learning the value functionthrough repeated rollouts of a policy. Starting from an initial guess of the value function V 0(s), for eachstate s encountered by the agent, the policy pi is queried to sample an action a, which then results in anew state s′ and reward r. Each state transition can be summarized by an experience tuple τ = (s,a,r,s′),and used to update the value function,V k+1(s)←V k(s)+α[r+ γV k(s′)−V k(s)]Where α < 1 is a stepsize, and V k(s) is the value function at the kth iteration. The update stepr+ γV (s′)−V (s) is referred to as the temporal difference (TD), and can be interpreted as the differ-ence between the predicted value at a given state V (s) and an updated approximation of the actual value11observed by the agent r+ γV (s′). Algorithm 1 illustrates a method for learning a value function.Algorithm 1 Policy Evaluation Using Value Iteration1: V 0 initialize value function2: while not done do3: s← start state4: a∼ pi(s,a)5: Apply a for one step6: s′← end state7: r← reward8: V k+1(s)←V k(s)+α [r+ γV k(s′)−V k(s)]9: end whileThe Q-function can be learned in a similar manner, but as it is a function of both state and action,in addition to recording the next state s′, the next action a′ selected by the policy at s′ is also required.Each experience tuple therefore consists of τ = (s,a,r,s′,a′), and the update proceeds according toQk+1(s)← Qk(s,a)+α[r+ γQk(s′,a′)−Qk(s,a)]This update provides the foundation for the SARSA algorithm [Sutton and Barto, 1998].3.2 Q-LearningAn advantage of learning a Q-function over a state-value function is that the Q-function provides animplicit representation of a policy. Suppose that pi∗ is the optimal policy and Q∗ the optimal Q-function.An optimal deterministic policy can be defined as a policy that selects the action that maximizes theQ-value at a given statepi∗(s) = arg maxaQ∗(s,a)where the optimal Q-function satisfies the propertyQ∗(s,a) = Er,s′[r+ γ maxa′Q∗(s′,a′)](3.1)= Er,s′[r+ γ Q∗(s′,pi∗(s′))]More generally, given a Q-function that may or may not be optimal, a policy can be constructed byselecting the action that maximizes the Q-function at a given state. Of course, the resulting policy mayno longer be optimal.Q-learning takes advantage of the recursive definition of the optimal Q-function to learn an ap-proximation of Q∗ starting from an initial guess Q0. The agent proceeds by collecting experiences12τ = (s,a,r,s′) using the current Q-function to define a policy pi(s) = arg maxa Q(s,a). The experiencesare then used to improve the Q-function by interpreting Equation 3.1 as an update step.Qk+1(s)← Qk(s,a)+α[r+ γ maxa′Qk(s′,a′)−Qk(s,a)]However, the method described so far is insufficient for convergence to the optimal Q-function. Inparticular, the definition of a determinisitc policy as always selecting the action which maximizes theQ-function does not allow the agent to explore new action that may yield higher rewards. Since theinitial Q-function is likely a poor approximation of Q∗, the recommended action may not be optimal.This leads to the exploration-exploitation tradeoff, where an agent must balance between exploiting thecurrent policy (i.e. selecting the actions proposed by a policy), and exploring new actions that maylead to improved performance. A simple heuristic to control this tradeoff is ε-greed exploration, usingan exploration rate ε ∈ [0,1]. With ε-greed, the agent selects an action according to a policy withprobability (1− ε) and selects a random action with probaiblity ε . ε can therefore be interpreted as anagent’s skepticism of its current policy, where ε = 0 results in a deterministic policy that always selectsthe proposed actions, and ε = 1 results in a stochastic policy that ignores the proposed actions. As such,a common heuristic for selecting ε is to start with a high value (e.g. ε ≈ 1) and slowly decrease it astraining progresses, and the policy improves (e.g. ε ≈ 0.2). Algorithm 2 summarizes the overall learningprocess. With sufficient exploration and under suitable assumptions Qk can be shown to converge to Q∗,yielding the optimal policy pi∗ [Sutton and Barto, 1998].Algorithm 2 Q-Learning With ε-Greedy Exploration1: Q0 initialize value function2: while not done do3: s← start state4: with probability ε do5: a← random action6: else7: a← arg maxa Qk(s,a)8: end with9: Apply a for one step10: s′← end state11: r← reward12: Qk+1(s,a)← Qk(s,a)+α [r+ γ maxa′ Qk(s′,a′)−Qk(s,a)]13: end while3.3 Policy Gradient MethodsQ-learning has led to groundbreaking advances for a rich repertoire of control problems, with a notablerecent example being Deep Q-Networks [Mnih et al., 2015], which combines deep neural networks with13Q-learning to achieve human-level performance on a suite of Atari games. However, many of the effec-tive application of Q-learning has been for tasks with discrete action spaces, where the set of possibleactions are reasonably small. This limitation partly stems from the optimization over the action spaceneeded during policy evaluation and Q-function updates. This optimization quickly becomes intractablefor large sets of possible actions, or if the action space becomes continuous. For motion control prob-lems, the actions are often naturally parameterized by continuous action spaces, with parameters such asforces or target positions. This limitation of Q-learning motivates the use of a different class of methodsthat can better cope with continuous action spaces. Nonetheless, value functions continue to play animportant role in the methods explored in the following sections.One of the challenges of applying Q-learning to continuous action spaces is that the policy is im-plicitly represented by the Q-function. As such, during policy evaluation, when the agent queries thepolicy for an action, it requires optimizing the Q-function over the action space in order to find theaction with the maximum predicted value. This can lead to intractable runtime costs when deploying apolicy. An alternative approach is to learn an explicit representation of the policy that can directly mapa query state to an action. A popular class of techniques for learning an explicit policy representation ispolicy gradient methods [Sutton et al., 2001]. Consider a policy modeled with as a parametric functionpi(s,a|θ) with parameters θ , then the expected cumulative reward can be re-expressed as J(θ), and thegoal of learning pi∗ can be formulated as finding the optimal set of parameters θ ∗θ ∗ = arg maxθJ(θ)Policy gradient methods learn a policy by performing gradient ascent on the objective using empiricalestimates of the policy gradient OθJ(θ), i.e. the gradient of J(θ) with respect to the policy parametersθ . The policy can be determined according to the policy gradient theorem [Sutton et al., 2001], whichprovides a direction of improvement to adjust the policy parameters θ .OθJ(θ) =∫Sd(s|θ)∫AOθ log(pi(s,a|θ))A(s,a)da dswhere d(s|θ) = ∫S∑Tt=0 γ t p0(s0)p(s0→ s|t,θ)ds0 is the unnormalized discounted state distribution, withp0(s) representing the initial state distribution, and p(s0→ s|t,θ) modeling the likelihood of reachingstate s by starting at s0 and following the policy pi(s,a|θ) for T steps [Silver et al., 2014a]. A(s,a)represents a generalized advantage function. The choice of advantage function gives rise to a familyof policy gradient algorithms, but in this work, we will focus on the one-step temporal difference (TD)advantage function [Schulman et al., 2015]A(s,a) = r+ γV (s′)−V (s)where V (s) is the state-value function of the policy pi(s,a|θ). An action with a positive advantageimplies better than average performance for a given state, and a negative advantage implies worse than14average performance. The policy gradient can therefore be interpreted as increasing the likelihood ofactions that result in higher than average performance, while decreasing the likelihood of actions thatresult in lower than average performance. A parameterized policy pi(s,a|θpi) and parameterized valuefunction V (s|θV ), with parameters θpi and θV , can be learned in tandem using an actor-critic framework[Konda and Tsitsiklis, 2000]. Algorithm 3 provides an example of an Actor-Critic algorithm usingpolicy gradients. In an actor-critic algorithm, the policy is commonly referred as the actor, and the valuefunction is referred to as the critic. αpi and αV denote the actor and critic stepsizes.Algorithm 3 Actor-Critic Algorithm Using Policy Gradients1: θpi initialize actor parameters2: θV initialize critic parameters3: while not done do4: s← start state5: a∼ pi(s,a|θpi)6: Apply a for one step7: s′← end state8: r← reward9: y← r+ γV (s′|θV )10: θV ← θV +αVOθV V (s|θV )(y−V (s|θV ))11: θpi ← θpi +αpiOθpi log(pi(s,a|θpi))A(s,a)12: end whileNext, we consider some design decisions for modeling the distribution from stochastic policies.Given a state s, a stochastic policy pi(s,a|θ) = p(a|s,θ) models a distribution over the action space.With discrete action spaces, a policy can often be represented as a discrete probability distribution overthe possible actions, but predicting a score for each action. However, with continuous action spaces,this is no longer feasible and additional modeling decisions are often needed to represent the actiondistribution. A common choice is to model the action distribution as a unimodal Gaussian distributionwith a parameterized mean µ(s|θ) and fixed covariance matrix Σ.pi(s,a|θ) = 1(2pi)n/2|Σ|1/2 exp(−12(a−µ(s|θ))TΣ−1(a−µ(s|θ)))where n = |A| is the dimension of the action space. Actions can be sampled from this distribution byapplying Gaussian noise to the mean actiona = µ(s|θ)+N(0,Σ)The corresponding policy gradient for a Gaussian policy assumes the formOθJ(θ) =∫Sd(s|θ)∫AOθµ(s|θ)Σ−1 (a−µ(s|θ))A(s,a)da dswhich can be interpreted as shifting the mean of the action distribution towards actions that lead to higher15than expected rewards, while moving away from actions that lead to lower than expected rewards.3.4 CACLAThe Continuous Actor-Critic Learning Automaton (CACLA) is a variant of policy gradient methods thatattempts to address some of the challenges stemming from the choice of Gaussian policies [Van Hasselt,2012]. To motivate CACLA, consider the use of the TD-advantage function with a Gaussian policy. Apositive-TD update has the effect of shifting the mean of the action distribution towards an action thatwas observed to perform better than average, while a negative-TD update shifts the mean away from anaction observed to perform worse than average. In the case of negative-TD updates, shifting the meanaway from an observed action is equivalent to shifting it towards an unknown action, which may performworse than the current mean action [Van Hasselt, 2012]. Though in expectation, the empirical estimatesof the policy gradient converge to the true policy gradient, in practice, the gradients are estimated usingminibatches of experience tuples, which only provide noisy estimates of the true gradient. Therefore,negative-TD updates can result in the policy adopting less desirable actions, which in turn influence thebehaviour of the agent as it collects subsequent experiences. In practice, this can result in instabilitiesduring training, where policy performance fluctuates drastically between iterations.To mitigate some of these challenges, CACLA proposes the use of an alternative advantage functionbased on positive temporal differencesA(s,a) = I [δ > 0] =1, δ > 00, otherwiseδ = r+ γV (s′)−V (s)δ represents the temporal difference. When using a Gaussian policy, the CACLA update shifts themean towards an action, only if it was observed to perform better than average, otherwise the policyremains unchanged. However, the resulting policy gradient is no longer a valid gradient of the originalobjective J(θ). Instead, CACLA can be interpreted as learning a policy that maximizes the likelihoodof performing better than average. The choice of a step function has the additional advantage of beinginvariant to the scale of the reward function. As with most gradient descent algorithms, selecting theappropriate stepsize can significantly impact performance. With the standard TD-advantage function,scaling the reward function will often entail retuning the stepsize in order to compensate for the scaling.However with CACLA, the algorithm is invariant to uniform scaling of the reward function, which inpractice reduces the tuning required for the stepsize. CACLA’s positive temporal difference update is atthe heart of the learning frameworks detailed in the following chapters.16Chapter 4Terrain-Adaptive LocomotionFigure 4.1: Terrain traversal using a learned actor-critic ensemble. The color-coding of the center-of-mass trajectory indicates the choice of actor used for each leap.Reinforcement learning offers a promising methodology for developing skills for simulated char-acters, but typically requires working with sparse hand-crafted features. Building on recent progressin deep reinforcement learning, we introduce a mixture of actor-critic experts (MACE) approach thatlearns terrain-adaptive dynamic locomotion skills using high-dimensional state and terrain descriptionsas input, and parameterized leaps or steps as output actions. MACE learns more quickly than a singleactor-critic approach and results in actor-critic experts that exhibit specialization. Additional elementsof our solution that contribute towards efficient learning include Boltzmann exploration and the intro-duction of initial actor biases to encourage specialization. Results are demonstrated for multiple planarcharacters and terrain classes.4.1 IntroductionIn practice, a number of challenges need to be overcome when applying the RL framework to problemswith continuous and high-dimensional states and actions, as required by movement skills. A controlpolicy needs to select the best actions for the distribution of states that will be encountered, but thisdistribution is often not known in advance. Similarly, the distribution of actions that will prove to beuseful for these states is also seldom known in advance. Furthermore, the state-and-action distributionsare not static in nature; as changes are made to the control policy, new states may be visited, and,conversely, the best possible policy may change as new actions are introduced. It is furthermore notobvious how to best represent the state of a character and its environment. Using large descriptorsallows for very general and complete descriptions, but such high-dimensional descriptors define large17state spaces that pose a challenge for many RL methods. Using sparse descriptors makes the learningmore manageable, but requires domain knowledge to design an informative-and-compact feature set thatmay nevertheless be missing important information.In this chapter we use deep neural networks in combination with reinforcement learning to ad-dress the above challenges. This allows for the design of control policies that operate directly on high-dimensional character state descriptions (83D) and an environment state that consists of a height-fieldimage of the upcoming terrain (200D). We provide a parameterized action space (29D) that allows thecontrol policy to operate at the level of bounds, leaps, and steps. We introduce a novel mixture of actor-critic experts (MACE) architecture to enable accelerated learning. MACE develops n individual controlpolicies and their associated value functions, which each then specialize in particular regimes of theoverall motion. During final policy execution, the policy associated with the highest value function isexecuted, in a fashion analogous to Q-learning with discrete actions. We show the benefits of Boltzmannexploration and various algorithmic features for our problem domain. We demonstrate improvements inmotion quality and terrain abilities over previous work.4.2 Related WorkWhile signnificant progress has been made in developing physics-based controllers for a rich repertoireof locomotion skills, many methods focus mainly on controlling locomotion over flat terrain. A com-bination of manually-crafted and learned feedback strategies have been incoporated into controllers toimprove robustness to unexpected perturbations and irregularities in the environment [Yin et al., 2007,Coros et al., 2010, 2011a, Geijtenbeek et al., 2013, Ding et al., 2015]. Nonetheless, these controllersare still purely reactive and cannot actively anticipate changes in the environment. Planning algorithmscan be incorporated to guide characters through irregular terrains. But due to the complex dynamicsand high-dimensional state space of articulated systems, these methods often perform planning usingreduced state representations [Coros et al., 2008, Mordatch et al., 2010]. Reinforcement learning usingnonparametric function approximators have been previously explored for terrain traversal [Peng et al.,2015]. But again relied on hand-crafted reduced state representations to describe the character and en-vironment. As a result, the policies are limited to classes of terrain where compact descriptions areavailable.In this work, we present a mixture model policy representation where subpolicies are trained tospecialize in handling different situations. The idea of modular selection and identification for controlhas been proposed in many variations. Well-known work in sensorimotor control proposes the use of aresponsibility predictor that divides experiences among several contexts [Haruno et al., 2001]. Similarconcepts can be found in the use of skill libraries indexed based on sensory information [Pastor et al.,2012], Gaussian mixture models for multi-optima policy search for episodic tasks [Calinon et al., 2013],and the use of random forests for model-based control [Hester and Stone, 2013]. Ensemble methodsfor RL problems with discrete actions have been investigated in some detail [Wiering and Van Hasselt,182008]. Adaptive mixtures of local experts [Jacobs et al., 1991] allow for specialization by allocatinglearning examples to a particular expert among an available set of experts according to a local gatingfunction, which is also learned so as to maximize performance. This has also been shown to work well inthe context of reinforcement learning [Doya et al., 2002, Uchibe and Doya, 2004]. More recently, therehas been strong interest in developing deep RL architectures for multi-task learning, where the tasks areknown in advance [Parisotto et al., 2015, Rusu et al., 2015], with a goal of achieving policy compression.In the context of physics-based character animation, a number of papers propose to use selections orcombinations of controllers as the basis for developing more complex locomotion skills [Faloutsos et al.,2001, Coros et al., 2008, da Silva et al., 2009, Muico et al., 2011].Our work: We propose a deep reinforcement learning method based on learning Q-functions anda policy, pi(s), for continuous action spaces as modeled on the CACLA RL algorithm [Van Hasseltand Wiering, 2007, Van Hasselt, 2012]. In particular, we show the effectiveness of using a mixtureof actor-critic experts (MACE), as constructed from multiple actor-critic pairs that each specialize inparticular aspects of the motion. Unlike prior work on dynamic terrain traversal using reinforcementlearning [Peng et al., 2015], our method can work directly with high-dimensional character and terrainstate descriptions without requiring the feature engineering often needed by nonparametric methods.Our results also improve on the motion quality and expand upon the types of terrains that can be navi-gated with agility.4.3 OverviewAn overview of the system is shown in Figure 4.2, which illustrates three nested loops that each corre-spond to a different timescale. In the following description, we review its operation for our dog controlpolicies, with other control policies being similar.The inner-most loop models the low-level control and physics-based simulation process. At eachtime-step4t, individual joint torques are computed by low-level control structures, such as PD-controllersand Jacobian transpose forces (see §4.4). These low-level control structures are organized into a smallnumber of motion phases using a finite state machine. The motion of the character during the time stepis then simulated by a physics engine.The middle loop operates at the time scale of locomotion cycles, i.e., leaps for the dog. Touch-downof the hind-leg marks the beginning of a motion cycle, and at this moment the control policy, a = pi(s),chooses the action that will define the subsequent cycle. The state, s, is defined by C, a set of 83 numbersdescribing the character state, and T , a set of 200 numbers that provides a one-dimensional heightfield“image” of the upcoming terrain. The output action, a, assigns specific values to a set of 29 parametersof the FSM controller, which then governs the evolution of the motion during the next leap.The control policy is defined by a small set of actor-critic pairs, whose outputs taken together repre-sent the outputs of the learned deep network (see Figure 4.7). Each actor represents an individual control19Figure 4.2: System Overviewpolicy; they each model their own actions, Aµ(s), as a function of the current state, s. The critics, Qµ(s),each estimate the quality of the action of their corresponding actor in the given situation, as given by theQ-value that they produce. This is a scalar that defines the objective function, i.e., the expected value ofthe cumulative sum of (discounted) future rewards. The functions Aµ(s) and Qµ(s) are modeled usinga single deep neural network that has multiple corresponding outputs, with most network layers beingshared. At run-time, the critics are queried at the start of each locomotion cycle in order to select theactor that is best suited for the current state, according to the highest estimated Q-value. The outputaction of the corresponding actor is then used to drive the current locomotion cycle.Learning requires exploration of “off-policy” behaviors. This is implemented in two parts. First, anactor can be selected probabilistically, instead of deterministically choosing the max-Q actor. This isdone using a softmax-based selection, which probabilistically selects an actor, with higher probabilitiesbeing assigned to actor-critic pairs with larger Q-values. Second, Gaussian noise can be added to theoutput of an actor with a probability εt , as enabled by an exploration choice of λ = 1.For learning purposes, each locomotion cycle is summarized in terms of an experience tuple τ =(s,a,r,s′,µ,λ ), where the parameters specify the starting state, action, reward, next state, index of theactive actor, and a flag indicating the application of exploration noise. The tuples are captured in areplay memory that stores the most recent 50k tuples and is divided into a critic buffer and an actorbuffer. Experiences are collected in batches of 32 tuples, with the motion being restarted as needed,i.e., if the character falls. Tuples that result from added exploration noise are stored in the actor buffer,while the remaining tuples are stored in the critic buffer, which are later used to update the actors andcritics respectively. Our use of actor buffers and critic buffers in a MACE-style architecture is new, tothe best of our knowledge, although the actor buffer is inspired by recent work on prioritized experiencereplay [Schaul et al., 2015].20Figure 4.3: Left: 21-link planar dog. Right: 19-link raptor.The outer loop defines the learning process. After the collection of a new minibatch of tuples, alearning iteration is invoked. This involves sampling minibatches of tuples from the replay memory,which are used to improve the actor-critic experts. The actors are updated according to a positive-temporal difference strategy, modeled after CACLA [Van Hasselt, 2012], while the critics are updatedusing the standard temporal difference updates, regardless of the sign of their temporal differences. Formore complex terrains, learning requires on the order of 300k iterations.4.4 Characters and TerrainsOur planar dog model is a reconstruction of that used in previous work [Peng et al., 2015], although itis smaller, standing approximately 0.5 m tall at the shoulders, as opposed to 0.75 m. It is composed of21 links and has a mass of 33.7 kg. The pelvis is designated as the root link and each link is connectedto its parent link with a revolute joint, yielding a total of 20 internal degrees of freedom and a further3 degrees of freedom defined by the position and orientation of the root in the world. The raptor iscomposed of 19 links, with a total mass of 33 kg, and a head-to-tail body length of 1.5 m. The motionof the characters are driven by the application of internal joint torques and is simulated using the Bulletphysics engine [Bullet, 2015] at 600 Hz, with friction set to 0.81.4.4.1 ControllersSimilar to much prior work in physics-based character animation, the motion is driven using jointtorques and is guided by a finite state machine. Figure 4.4 shows the four phase-structure of the con-troller. In each motion phase, the applied torques can be decomposed into three components,τ = τspd+ τg+ τvfwhere τspd are torques computed from joint-specific stable (semi-implicit) proportional-derivative (SPD)controllers [Tan et al., 2011], τg provides gravity compensation for all links, as referred back to the rootlink, and τvf implements virtual leg forces for the front and hind legs when they are in contact withthe ground, as described in detail in [Peng et al., 2015]. The stable PD controllers are integrated into21Figure 4.4: Dog controller motion phases.Bullet with the help of a Featherstone dynamics formulation [Featherstone, 2014]. For our system,conventional PD-controllers with explicit Euler integration require a time-step4t = 0.0005s to remainstable, while SPD remains stable for a time-step of4t = 0.0017s, yielding a two-fold speedup once thesetup computations are taken into account.The character’s propulsion is principally achieved by exerting forces on the ground with its endeffectors, represented by the front and back feet. Virtual force controllers are used to compute the jointtorques τe needed to exert a desired force fe on a particular end effector e.τe = δeJTe fewhere δe is the contact indicator variable for the end effector, and Je is the end effector Jacobian. Thefinal control forces for the virtual force controllers are the sum of the control forces for the front andback feet.τv f = τ f + τb4.4.2 Terrain classesWe evaluate the learning method on multiple classes of terrain obstacles that include gaps, steps, walls,and slopes. It is possible to make the obstacles arbitrarily difficult and thus we use environments thatare challenging while remaining viable. All of the terrains are represented by 1D height-fields, andgenerated randomly by drawing uniformly from predefined ranges of values for the parameters that22characterize each type of obstacle. In the flat terrains, gaps are spaced between 4 to 7m apart, withwidths ranging from 0.5 to 2m, and a fixed gap depth of 2m. Steps are spaced 5 to 7m apart, withchanges in height ranging from 0.1 to 0.4m. Walls are spaced 6 to 8m apart, with heights rangingbetween 0.25 to 0.5m, and a fixed width of 0.2m. Slopes are generated by varying the change in slope ofthe terrain at each vertex following a momentum model. The height yi of vertex i is computed accordingtoyi = yi−1+4xsisi = si−1+4si4si = sign(U(−1,1)− si−1smax )×U(0,4smax)where smax = 0.5 and 4smax = 0.05, 4x = 0.1m, and vertices are ordered such that xi−1 < xi. Whenslopes are combined with the various obstacles, the obstacles are adjusted to be smaller than those inthe flat terrains.4.5 Policy RepresentationA policy is a mapping between a state space S and an action space A, i.e., pi(s) : S 7→ A. For ourframework, S is a continuous space that describes the state of the character as well as the configurationof the upcoming terrain. The action space A is represented by a 29D continuous space where each actionspecifies a set of parameters to the FSM. The following sections provide further details about the policyrepresentation.4.5.1 StateA state s consists of features describing the configuration of the character and the upcoming terrain. Thestate of the character is represented by its pose q and velocity q˙, where q records the positions of thecenter of mass of each link with respect to the root and q˙ records the center of mass velocity of eachlink. The terrain features, T , consist of a 1D array of samples from the terrain height-field, beginningat the position of the root and spanning 10 m ahead. All heights are expressed relative to the height ofthe terrain immediately below the root of the character. The samples are spaced 5 cm apart, for a totalof 200 height samples. Combined, the final state representation is 283-dimensional. Figure ?? illustratethe character and terrain features.4.5.2 ActionsA total of 29 controller parameters serve to define the available policy actions. These include specifi-cations of the target spine curvature as well as the target joint angles for the shoulder, elbow, hip, knee,hock, and hind-foot, for each of the four motion phases defined by the controller FSM. Additionally,23Figure 4.5: Left: The character features consist of the displacements of the centers of mass ofall links relative to the root (red) and their linear velocities (green). Right: Terrain featuresconsist of height samples of the terrain in front of the character, evenly spaced 5cm apart. Allheights are expressed relative to the height of the ground immediately under the root of thecharacterthe x and y components of the hind-leg and front-leg virtual forces, as applied in phases 1 and 3 of thecontroller, are also part of the parameter set. Phases 2 and 4 apply the same forces as phases 1 and 3,respectively, if the relevant leg is still in contact with the ground. Lastly, the velocity feedback gain forthe swing hip (and shoulder) provides one last action parameter.Prior to learning the policy, a small set of initial actions are created which are used to seed thelearning process. The set of actions consists of 4 runs and 4 leaps. All actions are synthesized usinga derivative-free optimization process, CMA [Hansen, 2006]. Two runs are produced that travel atapproximately 4 m/s and 2 m/s, respectively. These two runs are then interpolated to produce 4 runs ofvarying speeds. Given a sequence of successive fast-run cycles, a single cycle of that fast-run is thenoptimized for distance traveled, yielding a 2.5 m leap that can then be executed from the fast run. Theleap action is then interpolated with the fast run to generate 4 parametrized leaps that travel differentdistances.4.5.3 RewardIn reinforcement learning the reward function, r(s,a), is used as a training signal to encourage or dis-courage behaviors in the context of a desired task. The reward provides a scalar value reflecting thedesirability of a particular state transition that is observed by performing action a starting in the initialstate s and resulting in a successor state s′. Figure 4.6 is an example of a sequence of state transitionsfor terrain traversal. For the terrain traversal task, the reward is provided byr(s,a) =0, character falls during the cyclee−ω(v∗−v)2 , otherwisewhere a fall is defined as any link of the character’s trunk making contact with the ground for an ex-tended period of time, v is the average horizontal velocity of the center of mass during a cycle, v∗= 4m/sis the desired velocity, and ω = 0.5 is the weight for the velocity error. This simple reward is thereforedesigned to encourage the character to travel forward at a consistent speed without falling. If the char-24Figure 4.6: Each state transition can be recorded as a tuple τt = (st ,at ,rt ,s′t). st is the initial state,at is the action taken, s′t is the resulting state, and rt is the reward received during cycle t.acter falls during a cycle, it is reset to a default state and the terrain is regenerated randomly. The goalof learning is to find a control policy that maximizes the expected value of the long term cumulativereward.4.5.4 Policy RepresentationTo represent the policy, we use a convolutional neural network, with weights θ , following the structureillustrated in Figure 4.7. The network is queried once at the start of each locomotion cycle. The overallstructure of the convolutional network is inspired by the recent work of Minh et al. [Mnih et al., 2015].For a query state s = (q, q˙,T ), the network first processes the terrain features T by passing it through 168×1 convolution filters. The resulting feature maps are then convolved with 32 4×1 filters, followedby another layer of 32 4×1 filters. A stride of 1 is used for all convolutional layers. The output of thefinal convolutional layer is processed by 64 fully-connected units, and the resulting features are thenconcatenated with the character features q and q˙, as inspired by [Levine et al., 2015]. The combinedfeatures are processed by a fully-connected layer composed of 256 units. The network then branchesinto critic and actor subnetworks. The critic sub-network predicts the Q-values for each actor, whileeach actor subnetwork proposes an action for the given state. All subnetworks follow a similar structurewith a fully connected layer of 128 units followed by a linear output layer. The size of the output layersvary depending on the subnetwork, ranging from 3 output units for the critics to 29 units for each actor.The combined network has approximately 570k parameters. Rectified linear units are used for all layers,except for the output layers.During final runtime use, i.e., when learning has completed, the actor associated with the highestpredicted Q-value is selected and its proposed action is applied for the given state.µ∗ = arg maxµQµ(s)pi(s) = Aµ∗(s)The inputs are standardized before being used by the network. The mean and standard deviation for thisstandardization are determined using data collected from an initial random-action policy. The outputs25Figure 4.7: Schematic illustration of the MACE convolutional neural network. T and C are theinput terrain and character features. Each Aµ represents the proposed action of actor µ , andQµ is the critic’s predicted reward when activating the corresponding actor.are also followed by the inverse of a similar transformation in order to allow the network to learnstandardized outputs. We apply the following transformation:Aµ(s) = ΣA¯µ(s)+βµwhere A¯µ is the output of each actor subnetwork, Σ= diag(σ0,σ1, ...), {σi} are pre-specified scales foreach action parameter, and βµ is an actor bias. {σi} are selected such that the range of values for eachoutput of A¯µ remains approximately within [-1, 1]. This transformation helps to prevent excessivelysmall or large gradients during back-propagation, improving stability during learning. The choice ofdifferent biases for each actor helps to encourage specialization of the actors for different situations, apoint which we revisit later. For the dog policy, we select a fast run, slow run, and large jump as biasesfor the three actors.4.6 LearningAlgorithm 4 illustrates the overall learning process. θ represents the weights of the composite actor-critic network, and Qµ(s|θ) is the Q-value predicted by the critic for the result of activating actor Aµin state s, where Aµ(s|θ) is the action proposed by the actor for s. Since an action is decided by firstselecting an actor followed by querying the chosen actor for its proposed action, exploration in MACEcan be decomposed into critic exploration and actor exploration. Critic exploration allows for theselection of an actor other than the “best” actor as predicted by the Q-values. For this we use Boltzmannexploration, which assigns a selection probability pµ to each actor based on its predicted Q-value:pµ(s) =eQµ (s|θ)/Tt∑ j eQµ j (s|θ)/Tt,26where Tt is a temperature parameter. Actors with higher predicted values are more likely to be selected,and the bias in favor of actors with higher Q-values can be adjusted via Tt . Actor exploration results inchanges to the output of the selected actor, in the form of Gaussian noise that is added to the proposedaction. This generates a new action from the continuous action space according to:a = Aµ(s)+N(0,Σ),where Σ are pre-specified scales for each action parameter. Actor exploration is enabled via a Bernoulliselector variable, λ ∼ Ber(εt): Ber(εt) = 1 with probability εt , and Ber(εt) = 0 otherwise.Once recorded, experience tuples are stored into separate replay buffers, for use during updates.Tuples collected during actor exploration are stored in an actor buffer Da, while all other tuples arestored in a critic buffer Dc. This separation allows the actors to be updated using only the off-policytuples in Da, and the critics to be updated using only tuples without exploration noise in Dc During acritic update, a minibatch of n = 32 tuples {τi} are sampled from Dc and used to perform a Bellmanbackup,yi = ri+ γ maxµ Qµ(s′i|θ)θ ← θ +α(1n∑iOθQµi(si|θ)(yi−Qµi(si|θ)))During an actor update, a minibatch of tuples {τ j} are sampled from Da and a CACLA-style positive-temporal difference update is applied to each tuple’s respective actor,δ j = y j−maxµ Qµ(s j|θ)if δ j > 0 : θ ← θ +α(1nOθAµ j(s j|θ)(a j−Aµ j(s j|θ)))Target network: Similarly to [Mnih et al., 2015], we used a separate target network when comput-ing the target values yi during updates. The target network is fixed for 500 iterations, after which it isupdated by copying the most up-to-date weights θ , and then held fixed again for another 500 iterations.Hyperparameter settings: m = 32 steps are simulated before each update. Updates are performedusing stochastic gradient descent with momentum, with a learning rate, α = 0.001, a weight decay of0.0005 for regularization, and momentum set to 0.9. εt is initialized to 0.9 and linearly annealed to 0.2after 50k iterations. Similarly, the temperature Tt used in Boltzmann exploration is initialized to 20 andlinearly annealed to 0.025 over 50k iterations.Initialization: The actor and critic buffers are initialized with 50k tuples from a random policy thatselects an action uniformly from the initial action set for each cycle. Each of the initial actions aremanually associated with a subset of the available actors using the actor bias. When recording an initialexperience tuple, µ will be randomly assigned to be one of the actors in its respective set.27Algorithm 4 MACE1: θ ← random weights2: Initialize Dc and Da with tuples from a random policy3: while not done do4: for step = 1, ...,m do5: s← character and terrain initial state6: µ ← select each actor with probability pµi = exp(Qµi (s|θ)/Tt)∑ j(exp(Qµ j (s|θ)/Tt))7: λ ← Ber(εt)8: a← Aµ(s|θ)+λNt9: Apply a and simulate forward 1 cycle10: s′← character and terrain terminal state11: r← reward12: τ ← (s,a,r,s′,µ,λ )13: if λ = 1 then14: Store τ in Da15: else16: Store τ in Dc17: end if18: end for19: Update critic:20: Sample minibatch of n tuples {τi = (si,ai,ri,s′i,µi,λi)} from Dc21: yi← ri+ γ maxµ Qµ(s′i|θ) for each τi22: θ ← θ +α (1n ∑iOθQµi(si|θ)(yi−Qµi(si|θ)))23: Update actors:24: Sample minibatch of n tuples {τ j = (s j,a j,r j,s′j,µ j,λ j)} from Da25: for each τ j do26: y j←maxµ Qµ(s j|θ)27: y′j← r j + γ maxµ Qµ(s′j|θ)28: if y′j > y j then29: θ ← θ +α (1nOθAµ j(s j|θ)(a j−Aµ j(s j|θ)))30: end if31: end for32: end while4.7 ResultsThe motions resulting from the learned policies are best seen in the supplemental video. The majority ofthe results we present are on policies for the dog, as learned for the 7 different classes of terrain shown inFigure 4.13. By default, each policy uses three actor-critic pairs. The final policies are the result of 300kiterations of training, collecting about 10 million tuples, and requiring approximately 20h of computetime on a 16-core cluster, using a multithreaded C++ implementation. All networks are built and trained28using Caffe [Jia et al., 2014b]. Source code is available at https://github.com/xbpeng/DeepTerrainRL.The learning time remains dominated by the cost of simulating the motion of the character rather thanthe neural network updates. Because of this, we did not pursue the use of GPU-accelerated training.Once the control policies have been learned, all results run faster than real time. Exploration is turnedoff during the evaluation of the control policies. Separate policies are learned for each class of terrain.The development of a single policy that would be capable of all these terrain classes is left as futurework.In order to evaluate attributes of the learning algorithm, we use the mean distance before a fall as ourperformance metric, as measured across 250 evaluations. We do not use the Q-function estimates com-puted by the policies for evaluation as these can over time provide inflated estimates of the cumulativerewards that constitute the true objective.We compare the final performance of using a mixture of three actor-critic experts, i.e., MACE(3), totwo alternatives. First, we compare to Double Q-learning using a set of DNN approximators, Qi(s), foreach of the 8 initial actions [Van Hasselt et al., 2015]. This is motivated by the impressive success ofDNNs in learning control policies for discrete action spaces [Mnih et al., 2015]. We use a similar DNNarchitecture to the network shown in Figure 4.7, except for the exclusion of the actor subnetworks. Sec-ond, we compare against the CACLA algorithm [Van Hasselt, 2012]. In our case, CACLA is identicalto MACE(1), except without the use of a critic buffer, which means that CACLA uses all experiencetuples to update the critic, while MACE(1) only uses the tuples generated without applied explorationnoise. In practice, we often find the performance of CACLA and MACE(1) to be similar.Table 5.4 gives the final performance numbers for all the control policies, including the MACE/Q/-CACLA comparisons. The final mean performance may not always tell the full story, and thus we alsocompare the learning curves, as we shall discuss shortly. MACE(3) significantly outperforms Q-learningand CACLA for all three terrain classes used for comparison. In two out of three terrain classes, CA-CLA outperforms Q-learning with discrete actions. This may reflect the importance of being able tolearn new actions. As measured by their final performance, all three approaches rank the terrains in thesame order in terms of difficulty: narrow-gaps (hardest), slopes-mixed, mixed (easiest). Thetight-gaps terrain proved to be the most difficult for MACE(3); it consists of the same types of gapsas narrow-gaps, but with half the recovery distance between sequences of gaps.In addition to the dog, we apply MACE(3) to learn control policies for other characters and terraintypes, as shown in Figure 4.12. The raptor model uses a 4-states-per-step finite state machine controller,with fixed 0.0825 s state transitions and a final state transition occurring when the swing foot strikes theground. The control policy is invoked at the start of each step to make a decision with regard to the FSMparameters to apply in the next step. It uses a set of 28 control parameters, similar to those of the dog.We also explored goat locomotion by training the dog to climb never-ending sequences of steep stepsthat have variable widths and heights. A set of 5 initial actions are provided, corresponding to jumps ofheights varying from 0.2 m to 0.75 m, and 3 of which are used as the initial actor biases. Though the29Scenario Performance (m)dog + mixed: MACE(3) 2094dog + mixed: Q 194dog + mixed: CACLA 1095dog + slopes-mixed: MACE(3) 1364dog + slopes-mixed: Q 110dog + slopes-mixed: CACLA 739dog + narrow-gaps: MACE(3) 176dog + narrow-gaps: Q 74dog + narrow-gaps: CACLA 38dog + tight-gaps: MACE(3) 44dog + slopes-gaps: MACE(3) 1916dog + slopes-steps: MACE(3) 3782dog + slopes-walls: MACE(3) 4312goat + variable-steps: MACE(3) 1004raptor + mixed: MACE(3) 1111raptor + slopes-mixed: MACE(3) 562raptor + narrow-gaps: MACE(3) 145Table 4.1: Performance of the final policies.character uses the same underlying model as the dog, it is rendered as a goat in the figures and video ashomage to the inspiration for this scenario.In Figure 4.8, we provide learning curve comparisons for different architectures and algorithm fea-tures. As before, we evaluate performance by measuring the mean distance before a fall across 100randomly generated terrains. Figure 4.8(a) compares MACE(3) with discrete-action-set Q-learning andCACLA, as measured for the dog on mixed terrain. The Q-learning plateaus after 50k iterations, whileCACLA continues to improve with further learning iterations, but at a slower rate than MACE(3).Figure 4.8(b) shows the effects of disabling features of MACE that are enabled by default. Boltz-mann exploration has a significant impact on learning. This is likely because it helps to encouragespecialization by selecting actors with high predicted Q-values more frequently, while also enablingexploration of multiple actors in cases where they share similar predicted values, thus helping to betterdisambiguate between the utilities of the different actors. The actor buffer has a large impact, as it allowslearning to focus on exploratory actions that yielded an improvement. The initial use of actor bias alsoproves to be significant, which we attribute to the breaking of initial symmetry between the actor-criticpairs. Lastly, the critic buffer, which enables the critics to learn exclusively from the actions of the deter-ministic actor policies without exploration noise, does not show a significant benefit for this particularexample. However, we found the critic buffer to yield improved learning in earlier experiments, and wefurther keep this feature because of the more principled learning that it enables.Figure 4.8(c) shows the impact of the number of actor-critic pairs, comparing MACE(1), MACE(2),MACE(3), MACE(6) for the dog on mixed terrain. MACE(2) and MACE(3) yield the best learning30performance, while MACE(6) results in a drop in learning performance, and MACE(1) is the worst.A larger number of actor-critic pairs allows for increased specialization, but results in fewer learningtuples for each actor-critic pair, given that our current MACE learning method only allows for a singleactor-critic pair to learn from a particular tuple.(a) MACE(3) vs CACLA and Q-learning on mixed terrain (b) MACE(3) with features disabled.(c) MACE(n) for various number of actors.Figure 4.8: Comparisons of learning performance.Learning good control policies requires learning new actions that yield improved performance. TheMACE architecture supports actor-critic specialization by having multiple experts that can specializein different motions, as well as taking advantage of unique biases for each actor to encourage diversityin their specializations. Figure 4.9 illustrates the space of policy actions early and late in the learn-ing process. This visualization is created using t-SNE (t-distributed Stochastic Neighbor Embedding)[van der Maaten and Hinton, 2008], where a single embedding is constructed using all the action sam-ples collected at various iterations in the training process. Samples from each iteration are then renderedseparately. The numbers embedded in the plots correspond to the set of 8 initial actions. The actionsbegin nearby the initial actions, then evolve over time as demanded by the task, while remaining spe-cialized. The evolution of the actions during learning is best seen in the supplementary videos.31(a) 10k iterations (b) 300k iterationsFigure 4.9: Action space evolution for using MACE(3) with initial actor bias.(a) 10k iterations (b) 300k iterationsFigure 4.10: Action space evolution for using MACE(3) without initial actor bias.To encourage actor specialization, the actor-specific initialization bias helps to break symmetry earlyon in the learning process. Without this initial bias, the benefit of multiple actor-critic experts is dimin-ished. Figure 4.10 illustrates that actor specialization is much less evident when all actors receive thesame initial bias, i.e., the fast run for the dog.We test the generalization capability of the MACE(3) policy for the dog in the slopes-mixed ter-rain, which can be parameterized according to a scale parameter, ψ , that acts a multiplier for the size ofall the gaps, steps, and walls. Here, ψ > 1 implies more difficult terrains, while ψ < 1 implies easier ter-rains. Figure 4.11 shows that the performance degrades gracefully as the terrain difficulty is increased.To test the performance of the policies when applied to unfamiliar terrain, i.e., terrains not encoun-tered during training, we apply the policy trained in mixed to slopes-mixed, slopes-gaps toslopes-mixed, and slopes-mixed to slopes-gaps. Table 5.4 summarizes the results of eachscenario. The policies perform poorly when encountering unfamiliar obstacles, such as slopes for themixed policy and walls for the slopes-gaps policy. The slopes-mixed policy performs well32Scenario Perf. (m)mixed policy in slopes-mixed 80slopes-gaps policy in slopes-mixed 35slopes-mixed policy in slopes-gaps 1545Table 4.2: Performance of applying policies to unfamiliar terrains.Figure 4.11: Policy generalization to easier and more-difficult terrains.in slopes-gaps, since it has been previously exposed to gaps in slopes-mixed, but nonethelessdoes not reach a similar level of performance as the policy trained specifically for slopes-gaps.4.8 DiscussionThe use of a predefined action parameterization helps to provide a significant degree of action ab-straction as compared to other recent deep-RL methods that attempt to learn control policies that aredirectly based on joint torques, e.g., [Levine and Abbeel, 2014, Levine and Koltun, 2014, Mordatch andTodorov, 2014, Mordatch et al., 2015b, Heess et al., 2015, Lillicrap et al., 2015b]. Instead of focusingon the considerable complexities and challenges of tabula rasa learning we show that deep-RL enablesnew capabilities for physics-based character animation, as demonstrated by agile dynamic locomotionacross a multitude of terrain types. In future work, it will be interesting to explore the best ways of learn-ing action abstractions and of determining the benefits, if any, of working with actuation that allows forcontrol of stiffness, as allowed by PD-controllers or muscles.Our overall learning approach is quite different from methods that interleave trajectory optimization(to generate reference data) and neural network regression for supervised learning of the control pol-icy [Levine and Abbeel, 2014, Levine and Koltun, 2014, Mordatch and Todorov, 2014, Mordatch et al.,2015b]. The key role of trajectory optimization makes these methods strongly model-based, with thecaveat that the models themselves can possibly be learned. In contrast, our approach does not need anoracle that can provide reference solutions. Another important difference is that much of our network isdevoted to processing the high-dimensional terrain description that comprises a majority of our high-D33state description.Recent methods have also demonstrated the use of more direct policy-gradient methods with deepneural network architectures. This involves chaining a state-action value function approximator in se-quence with a control policy approximator, which then allows for backpropagation of the value functiongradients back to the control policy parameters, e.g., [Silver et al., 2014b, Lillicrap et al., 2015b]. Recentwork applies this approach with predefined parameterized action abstractions [Hausknecht and Stone,2015a]. We leave comparisons to these methods as important future work. To our knowledge, thesemethods have not yet been shown to be capable of highly dynamic terrain-adaptive motions. Our workshares many of the same challenges of these methods, such as making stationary-distribution assump-tions which are then violated in practice. We do not yet demonstrated the ability to work directly withinput images to represent character state features, as shown by others [Lillicrap et al., 2015b].We have encountered difficult terrains for which learning does not succeed, such as those that havesmall scale roughness, i.e., bumps 20-40 cm in width that are added to the other terrain features. Withextensive training of 500k iterations, the dog is capable of performing robust navigation across thetight-gaps terrain, thereby in some sense “aiming for” the best intermediate landing location.However, we have not yet seen that a generalized version of this capability can be achieved, i.e., onethat can find a suitable sequence of foot-holds if one exists. Challenging terrains may need to learn newactions and therefore demand a carefully staged learning process. A common failure mode is that ofintroducing overly-challenging terrains which therefore always cause early failures and a commensuratelack of learning progress.There remain many architecture hyperparameters that we set based on limited experimentation,including the number of layers, the number of units per layer, regularization parameters, and learningbatch size. The space of possible network architectures is large and we have explored this in only aminimal way. The magnitude of exploration for each action parameter is currently manually specified;we wish to develop automated solutions for this. Also of note is that all the actors and critics for anygiven controller currently share most of their network structure, branching only near the last layers of thenetwork. We would like to explore the benefit, if any, of allowing individual critic and actor networks,thereby allowing additional freedom to utilize more representational power for the relevant aspects oftheir specialization. We also note that a given actor in the mixture may become irrelevant if its expectedperformance, as modeled by its critic, is never better than that of its peers for all encountered states.This occurs for MACE(3) when applied to the goat on variable-steps, where the distribution ofactor usages is (57, 0, 43), with all figures expressed as percentages. Other example usage patterns are(43,48,9) for the dog on mixed, (23,66,11) for dog on slopes-mixed, and (17,73,10) for dog onnarrow-gaps. One possible remedy to explore in future work would be to reinitialize an obsoleteactor-critic pair with a copy of one of its more successful peers.We wish to develop principled methods for integrating multiple controllers that are each trained fora specific class of terrain. This would allow for successful divide-and-conquer development of control34strategies. Recent work has tackled this problem in domains involving discrete actions [Parisotto et al.,2015]. One obvious approach is to use another mixture model, wherein each policy is queried for itsexpected Q-value, which is in turn modeled as the best Q-value of its individual actors. In effect, eachpolicy would perform its own analysis of the terrain for the suitability of its actions. However, thisignores the fact that the Q-values also depend on the expected distributions of the upcoming characterstates and terrain, which remain specific to the given class of terrain. Another problem is the lack of amodel for the uncertainty of Q-value estimates. Nevertheless, this approach may yield reasonable resultsin many situations that do not demand extensive terrain-specific anticipation. The addition of modelsto predict the state would allow for more explicit prediction and planning based on the available setof controllers. We believe that the tradeoff between implicit planning, as embodied by the actor-criticnetwork, and explicit planning, as becomes possible with learned forward models, will be a fruitful areafor further reseach.We have not yet demonstrated 3D terrain adaptive locomotion. We expect that the major challengefor 3D will arise in the case of terrain structure that requires identifying specific feasible foothold se-quences. The capacity limits of a MACE(n) policy remain unknown.Control policies for difficult terrains may need to be learned in a progressive fashion via some formof curriculum learning, especially for scenarios where the initial random policy performs so poorlythat no meaningful directions for improvement can be found. Self-paced learning is also a promisingdirection, where the terrain difficulty is increased once a desired level of competence is achieved withthe current terrain difficulty. It may be possible to design the terrain generator to work in concertwith the learning process by synthesizing terrains with a bias towards situations that are known to beproblematic. This would allow for “purposeful practice” and for learning responses to rare events. Otherpaths towards more data-efficient learning include the ability to transfer aspects of learned solutionsbetween classes of terrain, developing an explicit reduced-dimensionality action space, and learningmodels of the dynamics, e.g., [Assael et al., 2015]. It would also be interesting to explore the coevolutionof the character and its control policies.It is not yet clear how to best enable control of the motion style. The parameterization of the actionspace, the initial bootstrap actions, and the reward function all provide some influence over the finalmotion styles of the control policy. Available reference motions could be used to help develop the initialactions or used to help design style rewards.35(a) Raptor mixed terrain(b) Raptor slopes-mixed terrain(c) Raptor narrow-gaps terrain(d) Goat on variable-steps terrainFigure 4.12: Raptor and goat control policies.36(a) mixed terrain(b) slopes-mixed terrain(c) narrow-gaps terrain(d) tight-gaps terrain(e) slopes-gaps terrain(f) slopes-steps terrain(g) slopes-walls terrainFigure 4.13: Dog control policies.37Chapter 5Action ParameterizationsFigure 5.1: Neural network control policies trained for various simulated planar characters.The use of deep reinforcement learning allows for high-dimensional state descriptors, but little isknown about how the choice of action representation impacts the learning difficulty and the resultingperformance. The work presented in Chapter 4 leveraged a hand-crafted FMS to provide the policy witha high-level action parameterization. In this chapter, we reduce the manual effort needed to craft high-level action representations by training policies that directly use low-level action parameterizations. Wecompare the impact of four different low-level action parameterizations (torques, muscle-activations,target joint angles, and target joint-angle velocities) in terms of learning time, policy robustness, motionquality, and policy query rates. Our results are evaluated on a gait-cycle imitation task for multiple planararticulated figures and multiple gaits. We demonstrate that the local feedback provided by higher-levelaction parameterizations can significantly impact the learning, robustness, and motion quality of theresulting policies.5.1 IntroductionThe introduction of deep learning models to reinforcement learning (RL) has enabled policies to operatedirectly on high-dimensional, low-level state features. As a result, deep reinforcement learning hasdemonstrated impressive capabilities, such as developing control policies that can map from input imagepixels to output joint torques [Lillicrap et al., 2015a]. However, the motion quality and robustness oftenfalls short of what has been achieved with hand-crafted action abstractions, e.g., Coros et al. [2011a],38Geijtenbeek et al. [2013]. While much is known about the learning of state representations, the choiceof action parameterization is a design decision whose impact is not yet well understood.Joint torques can be thought of as the most basic and generic representation for driving the move-ment of articulated figures, given that muscles and other actuation models eventually result in jointtorques. However this ignores the intrinsic embodied nature of biological systems, particularly the syn-ergy between control and biomechanics. Passive-dynamics, such as elasticity and damping from mus-cles and tendons, play an integral role in shaping motions: they provide mechanisms for energy storage,and mechanical impedance which generates instantaneous feedback without requiring any explicit com-putation. Loeb coins the term preflexes [Loeb, 1995] to describe these effects, and their impact onmotion control has been described as providing intelligence by mechanics [Blickhan et al., 2007]. Thiscan also be thought of as a kind of partitioning of the computations between the control and physicalsystem.In this paper we explore the impact of four different actuation models on learning to control dy-namic articulated figure locomotion: (1) torques (Tor); (2) activations for musculotendon units (MTU);(3) target joint angles for proportional-derivative controllers (PD); and (4) target joint velocities (Vel).Because DeepRL methods are capable of learning control policies for all these models, it now becomespossible to directly assess how the choice of actuation model affects the learning difficulty. We alsoassess the learned policies with respect to robustness, motion quality, and policy query rates. We showthat action spaces which incorporate local feedback can significantly improve learning speed and per-formance, while still preserving the generality afforded by torque-level control. Such parameterizationsalso allow for more complex body structures and subjective improvements in motion quality.Our specific contributions are: (1) We introduce a DeepRL framework for motion imitation tasks;(2) We evaluate the impact of four different actuation models on the learned control policies accordingto four criteria; and (3) We propose an optimization approach that combines policy learning and actuatoroptimization, allowing neural networks to effective control complex muscle models.5.2 Related WorkDeepRL has driven impressive recent advances in learning motion control, i.e., solving for continuous-action control problems using reinforcement learning. All four of the actions types that we explore haveseen previous use in the machine learning literature. Wawrzyn´Ski and Tanwani [2013] use an actor-critic approach with experience replay to learn skills for an octopus arm (actuated by a simple musclemodel) and a planar half cheetah (actuated by joint-based PD-controllers). Recent work on deterministicpolicy gradients [Lillicrap et al., 2015a] and on RL benchmarks, e.g., OpenAI Gym, generally use jointtorques as the action space, as do the test suites in recent work [Schulman et al., 2015] on usinggeneralized advantage estimation. Other recent work uses: the PR2 effort control interface as a proxyfor torque control [Levine et al., 2015]; joint velocities [Gu et al., 2016a]; velocities under an implicit39control policy [Mordatch et al., 2015a]; or provide abstract actions [Hausknecht and Stone, 2015b].Our learning procedures are based on prior work using actor-critic approaches with positive temporaldifference updates [Van Hasselt, 2012].Work in biomechanics has long recognized the embodied nature of the control problem and theview that musculotendon systems provide “preflexes” [Loeb, 1995] that effectively provide a form in-telligence by mechanics [Blickhan et al., 2007], as well as allowing for energy storage. The controlstrategies for physics-based character simulations in computer animation also use all the forms of ac-tuation that we evaluate in this paper. Representative examples include quadratic programs that solvefor joint torques [de Lasa et al., 2010], joint velocities for skilled bicycle stunts [Tan et al., 2014a],muscle models for locomotion [Wang et al., 2012, Geijtenbeek et al., 2013], mixed use of feed-forwardtorques and joint target angles [Coros et al., 2011a], and joint target angles computed by learned linear(time-indexed) feedback strategies [Liu et al., 2016a]. Lastly, control methods in robotics use a mix ofactuation types, including direct-drive torques (or their virtualized equivalents), series elastic actuators,PD control, and velocity control. These methods often rely heavily on model-based solutions and thuswe do not describe these in further detail here.5.3 Task Representation5.3.1 Reference MotionIn our task, the goal of a policy is to imitate a given reference motion {q∗t } which consists of a se-quence of kinematic poses q∗t in reduced coordinates. The reference velocity q˙∗t at a given time t isapproximated by finite-difference q˙∗t ≈ q∗t+4t−q∗t4t . Reference motions are generated via either using arecorded simulation result from a preexisting controller (“Sim”), or via manually-authored keyframes.Since hand-crafted reference motions may not be physically realizable, the goal is to closely reproducea motion while satisfying physical constraints.5.3.2 StatesTo define the state of the agent, a feature transformation Φ(q, q˙) is used to extract a set of features fromthe reduced-coordinate pose q and velocity q˙. The features consist of the height of the root (pelvis)from the ground, the position of each link with respect to the root, and the center of mass velocity ofeach link. When training a policy to imitate a cyclic reference motion {q∗t }, knowledge of the motionphase can help simplify learning. Therefore, we augment the state features with a set of target featuresΦ(q∗t , q˙t∗), resulting in a combined state represented by st = (Φ(qt , q˙t),Φ(q∗t , q˙t∗)).405.3.3 ActionsWe train separate policies for each of the four actuation models, as described below. Each actuationmodel also has related actuation parameters, such as feedback gains for PD-controllers and musculo-tendon properties for MTUs. These parameters can be manually specified, as we do for the PD and Velmodels, or they can be optimized for the task at hand, as for the MTU models. Table 5.1 provides a listof actuator parameters for each actuation model.Target Joint Angles (PD): Each action represents a set of target angles qˆ, where qˆi specifies thetarget angles for joint i. qˆ is applied to PD-controllers which compute torques according toτ i = kip(qˆi−qi)+ kid( ˆ˙qi− q˙i)where ˆ˙qi = 0, and kip and kid are manually-specified gains.Target Joint Velocities (Vel): Each action specifies a set of target velocities ˆ˙q which are used tocompute torques according toτ i = kid( ˆ˙qi− q˙i)where the gains kid are specified to be the same as those used by the PD-controllers for target angles.Torques (Tor): Each action directly specifies torques for every joint, and constant torques are ap-plied for the duration of a control step. Due to torque limits, actions are bounded by manually specifiedlimits for each joint. Unlike the other actuation models, the torque model does not require additional ac-tuator parameters, and can thus be regarded as requiring the least amount of domain knowledge. Torquelimits are excluded from the actuator parameter set as they are common for all parameterizations.Muscle Activations (MTU): Each action specifies activations for a set of musculotendon units(MTU). Detailed modeling and implementation information are available in Wang et al. [2012]. EachMTU is modeled as a contractile element (CE) attached to a serial elastic element (SE) and parallelelastic element (PE). The force exerted by the MTU can be calculated according toFMTU = FSE = FCE +FPEBoth FSE and FPE are modeled as passive springs, while FCE is actively controlled according toFCE = aMTU F0 fl(lCE) fv(vCE)with aMTU being the muscle activation, F0 the maximum isometric force, lCE and vCE being the lengthand velocity of the contractile element. The functions fl(lCE) and fv(vCE) represent the force-lengthand force-velocity relationships, modeling the variations in the maximum force that can be exerted by amuscle as a function of its length and contraction velocity. Analytic forms are available in Geyer et al.[2003]. Activations are bounded between [0, 1]. The length of each contractile element lCE are included41as state features. To simplify control and reduce the number of internal state parameters per MTU, thepolicies directly control muscle activations instead of indirectly through excitations [Wang et al., 2012].Actuation Model Actuator ParametersTarget Joint Angles (PD) proportional gains kp, derivative gains kdTarget Joint Velocities (Vel) derivative gains kdTorques (Tor) noneMuscle Activations (MTU) optimal contractile element length, serial elastic element rest length,maximum isometric force, pennation, moment arm,maximum moment arm joint orientation, rest joint orientation.Table 5.1: Actuation models and their respective actuator parameters.5.3.4 RewardThe reward function consists of a weighted sum of terms that encourage the policy to track a referencemotion.r = wposerpose+wvelrvel +wendrend +wrootrroot +wcomrcomrpose = exp(−||q∗−q||2W) , rvel = exp(−||q˙∗− q˙||2W)rend = exp(−40∑e||x∗e− xe||2)rroot = exp(−10(h∗root −hroot)2) , rcom = exp(−10||x˙∗com− x˙com||2)wpose = 0.5, wvel = 0.05, wend = 0.15, wroot = 0.1, wcom = 0.2rpose penalizes deviation of the character pose from the reference pose, and rvel penalizes deviationof the joint velocities. rend and rroot accounts for the position error of the end-effectors and root. rcompenalizes deviations in the center of mass velocity from that of the reference motion. q and q∗ denotes thecharacter pose and reference pose represented in reduced-coordinates, while q˙ and q˙∗ are the respectivejoints velocities. W is a manually-specified per joint diagonal weighting matrix. hroot is the height ofthe root from the ground, and x˙com is the center of mass velocity.5.3.5 Initial State DistributionWe design the initial state distribution, p0(s), to sample states uniformly along the reference trajectory(Figure 5.2). At the start of each episode, q∗ and q˙∗ are sampled from the reference trajectory, and usedto initialize the pose and velocity of the agent. This helps guide the agent to explore states near the targettrajectory. Figure 5.2 illustrates a comparison between fixed and sampled initial state distributions.42Figure 5.2: Left: fixed initial state biases agent to regions of the state space near the initial state,particular during early iterations of training. Right: initial states sampled from referencetrajectory allows agent to explore the state space more uniformly around the reference trajec-tory.5.4 Learning FrameworkFor our learning algorithm, we adapt a positive temporal difference (PTD) update as proposed byVan Hasselt [2012]. Stochastic policies are used during training for exploration, while deterministicpolicies are used for evaluation at runtime. The choice between a stochastic and deterministic policycan be specified by the addition of a binary indicator variable λ ∈ [0,1]at = µ(st |θpi)+λN(0,Σ)where λ = 1 corresponds to a stochastic policy with exploration noise, and λ = 0 corresponds to adeterministic policy that always selects the mean of the distribution. Noise from a stochastic policy willresult in a state distribution that differs from that of the deterministic policy at runtime. To mitigate thisdiscrepancy, we incorporate ε-greedy exploration to the original Gaussian exploration strategy. Duringtraining, λ is determined by a Bernoulli random variable λ ∼ Ber(ε), where λ = 1 with probabilityε ∈ [0,1]. The exploration rate ε is annealed linearly from 1 to 0.2 over 500k iterations, which slowlyadjusts the state distribution encountered during training to better resemble the distribution at runtime.Since the policy gradient is defined for stochastic policies, only tuples recorded with exploration noise(i.e. λ = 1) can be used to update the actor, while the critic can be updated using all tuples.Training proceeds episodically, where the initial state of each episode is sampled from p0(s), andthe episode duration is drawn from an exponential distribution with a mean of 2s. To discourage falling,an episode will also terminate if any part of the character’s trunk makes contact with the ground foran extended period of time, leaving the agent with zero reward for all subsequent steps. Algorithm 7summarizes the complete learning process. A summary of the hyperparameter settings is available inTable 5.2.43Parameter Value Descriptionγ 0.9 cumulative reward discount factorαpi 0.001 actor learning rateαV 0.01 critic learning ratemomentum 0.9 stochastic gradient descent momentumφ weight decay 0 L2 regularizer for critic parametersθpi weight decay 0.0005 L2 regularizer for actor parametersminibatch size 32 tuples per stochastic gradient descent stepreplay memory size 500000 number of the most recent tuples stored for future updatesTable 5.2: Training hyperparameters.Bounded Action Space: Properties such as torque and neural activation limits result in boundson the range of values that can be assumed by actions for a particular parameterization. Improperenforcement of these bounds can lead to unstable learning as the gradient information outside the boundsmay not be reliable [Hausknecht and Stone, 2015b]. To ensure that all actions respect their bounds, weadopt a method similar to the inverting gradients approach proposed by Hausknecht and Stone [2015b].Let Oa = (a− µ(s))A(s,a) be the empirical action gradient from the policy gradient estimate of aGaussian policy. Given the lower and upper bounds [li,ui] of the ith action parameter, the boundedgradient of the ith action parameter Oa˜i is determined according toOa˜i =li−µ i(s), µ i(s)< li and Oai < 0ui−µ i(s), µ i(s)> ui and Oai > 0Oai, otherwiseUnlike the inverting gradients approach, which scales all gradients depending on proximity to thebounds, this method preserves the empirical gradients when bounds are respected, and alters the gradi-ents only when bounds are violated.MTU Actuator Optimization: Actuation models such as MTUs are defined by further parameterswhose values impact performance [Geijtenbeek et al., 2013]. Geyer et al. [2003] uses existing anatom-ical estimates for humans to determine MTU parameters, but such data is not be available for morearbitrary creatures. Alternatively, Geijtenbeek et al. [2013] uses covariance matrix adaptation (CMA),a derivative-free evolutionary search strategy, to simultaneously optimize MTU and policy parameters.This approach is limited to policies with reasonably low dimensional parameter spaces, and is thus ill-suited for neural network models with hundreds of thousands of parameters. To avoid manual-tuning ofactuator parameters, we propose a heuristic approach that alternates between policy learning and actua-tor optimization. The actuator parameters ψ can be interpreted as a parameterization of the dynamics ofthe system p(s′|s,a,ψ). The expected cumulative reward J(θpi) can then be re-parameterized accordingto44J(θpi ,ψ) =∫Sd(s|θpi ,ψ)∫Apiθ (s,a|θpi)A(s,a)da dswhere d(s|θpi ,ψ) =∫S∑Tt=0 γ t p0(s0)p(s0→ s|t,piθ ,ψ)ds0 is the discounted state distribution [Silveret al., 2014a]. θpi and ψ are then learned in an alternating fashion as per Algorithm 5. This alternatingmethod optimizes both the control and dynamics in order to maximize the expected value of the agent,as analogous to the role of evolution in biomechanics. During each pass, the policy parameters θpi aretrained to improve the agent’s expected value for a fixed set of actuator parameters ψ . Next, ψ is opti-mized using CMA to improve performance while keeping θpi fixed. The expected value of each CMAsample of ψ is estimated using the average (undiscounted) cumulative reward over multiple rollouts.Figure 5.3: Simulated articulated figures and their state representation. Revolute joints connect alllinks. left-to-right: 7-link biped; 19-link raptor; 21-link dog; State features: root height,relative position (red) of each link with respect to the root and their respective linear velocity(green).Figure 5.4: Neural Network Architecture. Each policy is represented by a three layered network,with 512 and 256 fully-connected hidden units, followed by a linear output layer.Algorithm 5 Alternating Actuator Optimization1: θpi ← θ 0pi2: ψ ← ψ03: while not done do4: θpi ← argmaxθ ′pi J(θ ′pi ,ψ) with Algorithm 75: ψ ← argmaxψ ′ J(θpi ,ψ ′) with CMA6: end while45Algorithm 6 Actor-critic Learning Using Positive Temporal Differences1: θpi ← θ 0pi2: θV ← θ 0V3: while not done do4: for step = 1, ...,m do5: s← start state6: λ ← Ber(εt)7: a← µ(s|θpi)+λN(0,Σ)8: Apply a and simulate forward 1 step9: s′← end state10: r← reward11: τ ← (s,a,r,s′,λ )12: store τ in replay memory13: if episode terminated then14: Sample s0 from p0(s)15: Reinitialize state s to s016: end if17: end for18: Update critic:19: Sample minibatch of n tuples {τi = (si,ai,ri,λi,s′i)} from replay memory20: for each τi do21: δi← ri+ γV (s′i|θV )−V (si|θV )22: θV ← θV +αV 1nOθV V (si|θV )δi23: end for24: Update actor:25: Sample minibatch of n tuples {τ j = (s j,a j,r j,λ j,s′j)} from replay memory where λ j = 126: for each τ j do27: δ j← r j + γV (s′j|θV )−V (s j|θV )28: if δ j > 0 then29: Oa j← a j−µ(s j|θpi)30: Oa˜ j← BoundActionGradient(Oa j,µ(s j|θpi))31: θpi ← θpi +αpi 1nOθpiµ(s j|θpi)Σ−1Oa˜ j32: end if33: end for34: end while5.5 ResultsThe motions are best seen in the supplemental video https://youtu.be/L3vDo3nLI98. We evaluate theaction parameterizations by training policies for a simulated 2D biped, dog, and raptor as shown inFigure 5.3. Depending on the agent and the actuation model, our systems have 58–214 state dimensions,6–44 action dimensions, and 0–282 actuator parameters, as summarized in Table 5.3. The MTU models46Figure 5.5: Learning curves for each policy during 1 million iterations.have at least double the number of action parameters because they come in antagonistic pairs. As well,additional MTUs are used for the legs to more accurately reflect bipedal biomechanics. This includesMTUs that span multiple joints.Character + Actuation Model State Parameters Action Parameters Actuator ParametersBiped + Tor 58 6 0Biped + Vel 58 6 6Biped + PD 58 6 12Biped + MTU 74 16 114Raptor + Tor 154 18 0Raptor + Vel 154 18 18Raptor + PD 154 18 36Raptor + MTU 194 40 258Dog + Tor 170 20 0Dog + Vel 170 20 20Dog + PD 170 20 40Dog + MTU 214 44 282Table 5.3: The number of state, action, and actuation model parameters for different charactersand actuation models.Each policy is represented by a three layer neural network, as illustrated in Figure 5.4 with 512and 256 fully-connected units, followed by a linear output layer where the number of output units varyaccording to the number of action parameters for each character and actuation model. ReLU activationfunctions are used for both hidden layers. Each network has approximately 200k parameters. The valuefunction is represented by a similar network, except having a single linear output unit. The policies arequeried at 60Hz for a control step of about 0.0167s. Each network is randomly initialized and trainedfor about 1 million iterations, requiring 32 million tuples, the equivalent of approximately 6 days ofsimulated time. Each policy requires about 10 hours for the biped, and 20 hours for the raptor and dog47Figure 5.6: Simulated motions. The biped uses an MTU action space while the dog and raptor aredriven by a PD action space.on an 8-core Intel Xeon E5-2687W.Only the actuator parameters for MTUs are optimized with Algorithm 5, since the parameters forthe other actuation models are few and reasonably intuitive to determine. The initial actuator parametersψ0 are manually specified, while the initial policy parameters θ 0pi are randomly initialized. Each passoptimizes ψ using CMA for 250 generations with 16 samples per generation, and θpi is trained for250k iterations. Parameters are initialized with values from the previous pass. The expected valueof each CMA sample of ψ is estimated using the average cumulative reward over 16 rollouts with aduration of 10s each. Separate MTU parameters are optimized for each character and motion. Eachset of parameters is optimized for 6 passes following Algorithm 5, requiring approximately 50 hours.Figure 5.7 illustrates the performance improvement per pass. Figure 5.8 compares the performance ofMTUs before and after optimization. For most examples, the optimized actuator parameters significantlyimprove learning speed and final performance. For the sake of comparison, after a set of actuatorparameters has been optimized, a new policy is retrained with the new actuator parameters and its48performance compared to the other actuation models.Policy Performance and Learning Speed: Figure 5.5 shows learning curves for the policies andthe performance of the final policies are summarized in Table 5.4. Performance is evaluated using thenormalized cumulative reward (NCR), calculated from the average cumulative reward over 32 episodeswith lengths of 10s, and normalized by the maximum and minimum cumulative reward possible foreach episode. No discounting is applied when calculating the NCR. The initial state of each episodeis sampled from the reference motion according to p(s0). To compare learning speeds, we use thenormalized area under each learning curve (AUC) as a proxy for the learning speed of a particularactuation model, where 0 represents the worst possible performance and no progress during training,and 1 represents the best possible performance without requiring training. Figure 5.7 illustrates theimprovement in performance during the optimization process, as applied to motions for three differentagents. Figure 5.8 compares the learning curves for the initial and final MTU parameters, for the samethree motions.Figure 5.7: Performance of intermediate MTU policies and actuator parameters per pass of actua-tor optimization following Algorithm 5.Figure 5.8: Learning curves comparing initial and optimized MTU parameters.PD performs well across all examples, achieving comparable-to-the-best performance for all mo-tions. PD also learns faster than the other parameterizations for 5 of the 7 motions. The final per-formance of Tor is among the poorest for all the motions. Differences in performance appear morepronounced as characters become more complex. For the simple 7-link biped, most parameterizationsachieve similar performance. However, for the more complex dog and raptor, the performance of Torpolicies deteriorate with respect to other policies such as PD and Vel. MTU policies often exhibited49Character + Actuation Motion Performance (NCR) Learning Speed (AUC)Biped + Tor Walk 0.7662 ± 0.3117 0.4788Biped + Vel Walk 0.9520 ± 0.0034 0.6308Biped + PD Walk 0.9524 ± 0.0034 0.6997Biped + MTU Walk 0.9584 ± 0.0065 0.7165Biped + Tor March 0.9353 ± 0.0072 0.7478Biped + Vel March 0.9784 ± 0.0018 0.9035Biped + PD March 0.9767 ± 0.0068 0.9136Biped + MTU March 0.9484 ± 0.0021 0.5587Biped + Tor Run 0.9032 ± 0.0102 0.6938Biped + Vel Run 0.9070 ± 0.0106 0.7301Biped + PD Run 0.9057 ± 0.0056 0.7880Biped + MTU Run 0.8988 ± 0.0094 0.5360Raptor + Tor Run (Sim) 0.7265 ± 0.0037 0.5061Raptor + Vel Run (Sim) 0.9612 ± 0.0055 0.8118Raptor + PD Run (Sim) 0.9863 ± 0.0017 0.9282Raptor + MTU Run (Sim) 0.9708 ± 0.0023 0.6330Raptor + Tor Run 0.6141 ± 0.0091 0.3814Raptor + Vel Run 0.8732 ± 0.0037 0.7008Raptor + PD Run 0.9548 ± 0.0010 0.8372Raptor + MTU Run 0.9533 ± 0.0015 0.7258Dog + Tor Bound (Sim) 0.7888 ± 0.0046 0.4895Dog + Vel Bound (Sim) 0.9788 ± 0.0044 0.7862Dog + PD Bound (Sim) 0.9797 ± 0.0012 0.9280Dog + MTU Bound (Sim) 0.9033 ± 0.0029 0.6825Dog + Tor Rear-Up 0.8151 ± 0.0113 0.5550Dog + Vel Rear-Up 0.7364 ± 0.2707 0.7454Dog + PD Rear-Up 0.9565 ± 0.0058 0.8701Dog + MTU Rear-Up 0.8744 ± 0.2566 0.7932Table 5.4: Performance of policies trained for the various characters and actuation models. Per-formance is measured using the normalized cumulative reward (NCR) and learning speed isrepresented by the normalized area under each learning curve (AUC). The best performingparameterizations for each character and motion are in bold.the slowest learning speed, which may be a consequence of the higher dimensional action spaces, i.e.,requiring antagonistic muscle pairs, and complex muscle dynamics. Nonetheless, once optimized, theMTU policies produce more natural motions and responsive behaviors as compared to other parameter-izations. We note that the naturalness of motions is not well captured by the reward, since it primarilygauges similarity to the reference motion, which may not be representative of natural responses whenperturbed from the nominal trajectory. Action and torque trajectories over one cycle of a desired motionare available in Figure 5.9.Sensitivity Analysis: Due to the non-convex nature of the optimization problem, policies trainedfrom different random initializations may converge to different results. To analyze the sensitivity ofthe results, we compare policies trained with different initializations and design decisions. Figure 5.1050Figure 5.9: Policy actions over time and the resulting torques for the four action types. Data isfrom one biped walk cycle (1s). Left: Actions (60 Hz), for the right hip for PD, Vel, and Tor,and the right gluteal muscle for MTU. Right: Torques applied to the right hip joint, sampledat 600 Hz.compares the learning curves from multiple policies trained using different random initializations of thenetworks. Four policies are trained for each actuation model. The results for a particular actuation modelare similar across different runs, and the trends between the various actuation models also appear to beconsistent. To evaluate the sensitivity to the amount of exploration noise applied during training, wetrained policies where the standard deviation of the action distribution is twice and half of the defaultvalues. Figure 5.11 illustrates the learning curves for each policy. Overall, the performance of thepolicies do not appear to change significantly for the particular range of values. Finally, Figure 5.1251compares the results using different network architectures. The network variations include doubling thenumber of units in both hidden layers, halving the number of hidden units, and inserting an additionallayer with 512 units between the two existing hidden layers. The choice of network structure does notappear to have a noticeable impact on the results, and the differences between the actuation modelsappear to be consistent across the different networks.Figure 5.10: Learning curves from different random network initializations. Four policies aretrained for each actuation model.Figure 5.11: Learning curves comparing the effects of scaling the standard deviation of the actiondistribution by 1x, 2x, and 1/2x.Policy Robustness: To evaluate robustness, we recorded the NCR achieved by each policy whensubjected to external perturbations. The perturbations assume the form of random forces applied tothe trunk of the characters. Figure 5.13 illustrates the performance of the policies when subjected toperturbations of different magnitudes. The magnitude of the forces are constant, but direction variesrandomly. Each force is applied for 0.1 to 0.4s, with 1 to 4s between each perturbation. Performance52Figure 5.12: Learning curves for different network architectures. The network structures include,doubling the number of units in each hidden layer, halving the number of units, and insertingan additional hidden layer with 512 units between the two existing hidden layers.is estimated using the average over 128 episodes of length 20s each. For the biped walk, the Tor policyis significantly less robust than those for the other types of actions, while the MTU policy is the leastrobust for the raptor run. Overall, the PD policies are among the most robust for all the motions. Inaddition to external forces, we also evaluate the robustness when locomoting over randomly generatedterrain consisting of bumps with varying heights and slopes with varying steepness (Figure 5.14). Thereare a few consistent patterns for this test. The Vel and MTU policies are significantly worse than theTor and PD policies for the dog bound on the bumpy terrain. The unnatural jittery behavior of the dogTor policy proves to be surprisingly robust for this scenario. We suspect that the behavior prevents thetrunk from contacting the ground for extended periods for time, and thereby escaping our system’s falldetection.Figure 5.13: Performance when subjected to random perturbation forces of different magnitudes.Query Rate: Figure 5.15 compares the performance of different parameterizations for different pol-icy query rates. Separate policies are trained with queries of 15Hz, 30Hz, 60Hz, and 120Hz. Actuationmodels that incorporate low-level feedback such as PD and Vel, appear to cope more effectively to lowerquery rates, while the Tor degrades more rapidly at lower query rates. It is not yet obvious to us whyMTU policies appear to perform better at lower query rates and worse at higher rates. Lastly, Figure 5.953Figure 5.14: Performance of different action parameterizations when traveling across randomlygenerated irregular terrain. Left: Dog running across bumpy terrain, where the height ofeach bump varies uniformly between 0 and a specified maximum height. Middle: andRight: biped and dog traveling across randomly generated slopes with bounded maximumsteepness.shows the policy outputs as a function of time for the four actuation models, for a particular joint, aswell as showing the resulting joint torque. Interestingly, the MTU action is visibly smoother than theother actions and results in joint torques profiles that are smoother than those seen for PD and Vel.Figure 5.15: Left: Performance of policies with different query rates for the biped. Right: Per-formance for the dog. Separate policies are trained for each query rate.5.6 DiscussionOur experiments suggest that action parameterizations that include basic local feedback, such as PDtarget angles, MTU activations, or target velocities, can improve policy performance and learning speedacross different motions and character morphologies. Such models more accurately reflect the embodiednature of control in biomechanical systems, and the role of mechanical components in shaping theoverall dynamics of motions and their control. The difference between low-level and high-level actionparameterizations grow with the complexity of the characters, with high-level parameterizations scalingmore gracefully to complex characters. As a caveat, there may well be tasks, such as impedance control,where lower-level action parameterizations such as Tor may prove advantageous. We believe that no54single action parameterization will be the best for all problems. However, since objectives for motioncontrol problems are often naturally expressed in terms of kinematic properties, higher-level actionssuch as target joint angles and velocities may be effective for a wide variety of motion control problems.We hope that our work will help open discussions around the choice of action parameterizations.Our results have only been demonstrated on planar articulated figure simulations; the extension to3D currently remains as future work. Furthermore, our current torque limits are still large as compared towhat might be physically realizable. Tuning actuator parameters for complex actuation models such asMTUs remains challenging. Though our actuator optimization technique is able to improve performanceas compared to manual tuning, the resulting parameters may still not be optimal for the desired task.Therefore, our comparisons of MTUs to other action parameterizations may not be reflective of thefull potential of MTUs with more optimal actuator parameters. Furthermore, our actuator optimizationcurrently tunes parameters for a specific motion, rather than a larger suite of motions, as might beexpected in nature.Since the reward terms are mainly expressed in terms of positions and velocities, it may seem thatit is inherently biased in favour of PD and Vel. However, the real challenges for the control policieslie elsewhere, such as learning to compensate for gravity and ground-reaction forces, and learning foot-placement strategies that are needed to maintain balance for the locomotion gaits. The reference poseterms provide little information on how to achieve these hidden aspects of motion control that willultimately determine the success of the locomotion policy. While we have yet to provide a concreteanswer for the generalization of our results to different reward functions, we believe that the choice ofaction parameterization is a design decision that deserves greater attention regardless of the choice ofreward function.Finally, it is reasonable to expect that evolutionary processes would result in the effective co-designof actuation mechanics and control capabilities. Developing optimization and learning algorithms toallow for this kind of co-design is a fascinating possibility for future work.55Chapter 6Hierarchical Locomotion SkillsFigure 6.1: Locomotion skills learned using hierarchical reinforcement learning. (a) Following avarying-width winding path. (b) Dribbling a soccer ball. (c) Navigating through obstacles.In this chapter we aim to learn a variety of environment-aware locomotion skills with a limitedamount of prior knowledge. We adopt a two-level hierarchical control framework. First, low-level con-trollers are learned that operate at a fine timescale and which achieve robust walking gaits that satisfystepping-target and style objectives. Second, high-level controllers are then learned which plan at thetimescale of steps by invoking desired step targets for the low-level controller. The high-level controllermakes decisions directly based on high-dimensional inputs, including terrain maps or other suitable rep-resentations of the surroundings. Both levels of the control policy are trained using deep reinforcementlearning. Results are demonstrated on a simulated 3D biped. Low-level controllers are learned for avariety of motion styles and demonstrate robustness with respect to force-based disturbances, terrainvariations, and style interpolation. High-level controllers are demonstrated that are capable of followingtrails through terrains, dribbling a soccer ball towards a target location, and navigating through static ordynamic obstacles.6.1 IntroductionPhysics-based simulations of human skills and human movement have long been a promising avenuefor character animation, but it has been difficult to develop the needed control strategies. While thelearning of robust balanced locomotion is a challenge by itself, further complexities are added when the56locomotion needs to be used in support of tasks such as dribbling a soccer ball or navigating amongmoving obstacles. Hierarchical control is a natural approach towards solving such problems. A low-level controller (LLC) is desired at a fine timescale, where the goal is predominately about balanceand limb control. At a larger timescale, a high-level controller (HLC) is more suitable for guiding themovement to achieve longer-term goals, such as anticipating the best path through obstacles. In thispaper, we leverage the capabilities of deep reinforcement learning (RL) to learn control policies at bothtimescales. The use of deep RL allows skills to be defined via objective functions, while enabling forcontrol policies based on high-dimensional inputs, such as local terrain maps or other abundant sensoryinformation. The use of a hierarchy enables a given low-level controller to be reused in support ofmultiple high-level tasks. It also enables high-level controllers to be reused with different low-levelcontrollers.Our principal contribution is to demonstrate that environment-aware 3D bipedal locomotion skillscan be learned with a limited amount of prior structure being imposed on the control policy. In supportof this, we introduce the use of a two-level hierarchy for deep reinforcement learning of locomotionskills, with both levels of the hierarchy using an identical style of actor-critic algorithm. To the bestof our knowledge, we demonstrate some of the most capable dynamic 3D walking skills for model-free learning-based methods, i.e., methods that have no direct knowledge of the equations of motion,character kinematics, or even basic abstract features such as the center of mass, and no a priori control-specific feedback structure. Our method comes with its own limitations, which we also discuss.6.2 Related WorkLearning of high-level controllers for physics-based characters has been successfully demonstrated forseveral locomotion and obstacle-avoidance tasks [Coros et al., 2009, Peng et al., 2015, 2016]. Alter-natively, planning using a learned high-level dynamics model has also been proposed for locomotiontasks [Coros et al., 2008]. However, the low-level controllers for these learned policies are still designedwith significant human insight, and the work presented in Chapter 4 are demonstrated only for planarmotions.Motion planning is a well-studied problem, which typically investigates how characters or robotsshould move in constrained environments. For wheeled robots, such problem can usually be reducedto finding a path for a point robot [Kavraki et al., 1996]. Motion planning for legged robots is signifi-cantly more challenging due to the increased degrees of freedom and tight coupling with the underlyinglocomotion dynamics. When quadrupeds are equipped with robust mobility control, a classic A∗ pathplanner can be used to compute steering and forward speed commands to the locomotion controller tonavigate in real-world environment with high success [Wooden et al., 2010]. However, skilled balancedmotions are more difficult to achieve for bipeds and thus they are harder to plan and control [Kuffneret al., 2005]. Much of the work in robotics emphasizes footstep planning, e.g., [Chestnutt et al., 2005],with some work on full-body motion generation, e.g., [Grey et al., 2016]. Possibility graphs are pro-57Figure 6.2: System overviewposed [Grey et al., 2016] to use high-level approximations of constraint manifolds to rapidly explore thepossibility of actions, thereby allowing lower-level motion planners to be utilized more efficiently. Ourhierarchical planning framework and the step targets produced by the HLC are partly inspired by thisprevious work from humanoid robotics.Motion planning in support of character animation has been studied for manipulation tasks [Yamaneet al., 2004, Bai et al., 2012] as well as full-body behaviours. The full-body behaviour planners oftenwork with kinematic motion examples [Pettr et al., 2003, Lee and Lee, 2004, Lau and Kuffner, 2005].Planning for physics-based characters is often achieved with the help of abstract dynamic models in low-dimensional spaces [Mordatch et al., 2010, Ye and Liu, 2010]. A hybrid approach is adopted in [Liuet al., 2012] where a high-level kinematic planner directs the low-level dynamic control of specificmotion skills.6.3 OverviewAn overview of the DeepLoco system is shown in Figure 6.2. The system is partitioned into two compo-nents that operate at different timescales. The high-level controller (HLC) operates at a coarse timescaleof 2 Hz, the timescale of walking steps, while the low-level controller (LLC) operates at 30 Hz, thetimescale of low-level control actions such as PD target angles. Finally, the physics simulation is per-formed at 3 kHz. Together, the HLC and LLC form a two-level control hierarchy where the HLCprocesses the high-level task goals gH and provides the LLC with low-level intermediate goals gL thatdirect the character towards fulfilling the overall task objectives. When provided with an intermediategoal from the HLC, the LLC coordinates the motion of the character’s various joints in order to fulfill theintermediate goals. This hierarchical partitioning of control allows the controllers to explore behavioursspanning different spatial and temporal abstractions, thereby enabling more efficient exploration of task-relevant strategies.The inputs to the HLC consist of the state, sH , and the high-level goal, gH , as specified by thetask. It outputs an action, aH , which then serves as the current goal gL for the LLC. sH provides both58proprioceptive information of the character’s configuration as well as exteroceptive information aboutits environment. In our framework, the high level action, aH , consists of a footstep plan for the LLC.The LLC receives the state, sL, and an intermediate goal, gL, as specified by the HLC, and outputsan action aL. Unlike the high-level state sH , sL consists mainly of proprioceptive information describingthe state of the character. The low-level action aL specifies target angles for PD controllers positionedat each joint, which in turn compute torques that drive the motion of the character.The actions from the LLC are applied to the simulation, which in turn produces updated states sHand sL by extracting the relevant features for the HLC and LLC respectively. The environment thenalso provides separate reward signals rH and rL to the HLC and LLC, reflecting progress towards theirrespective goals gH and gL. Both controllers are trained with a common actor-critic learning algorithm.The policy (actor) is trained using a positive-temporal difference update scheme modeled after CACLA[Van Hasselt, 2012], and the value function (critic) is trained using Bellman backups.6.4 Policy Representation and LearningLet pi(s,g) : S×G→ A represent a deterministic policy, which maps a state s ∈ S and goal g ∈ G to anaction a ∈ A, while a stochastic policy pi(s,g,a) : S×G×A→ R represents the conditional probabilitydistribution of a given s and g, pi(s,g,a) = p(a|s,g). For a particular s and g, the action distribution ismodeled by a Gaussian pi(s,g,a) = G(µ(s,g),Σ), with a parameterized mean µ(s,g) and fixed covari-ance matrix Σ. Each policy query in turn samples an action from the distribution according toa = µ(s,g)+N, N ∼ G(0,Σ) (6.1)generated by applying Gaussian noise to the mean action µ(s,g). While the covariance Σ= diag({σi})is represented by manually-specified values {σi} for each action parameter, the mean is represented bya neural network µ(s,g|θ) with parameters θ .During training, a stochastic policy enables the character to explore new actions that may provepromising, but the addition of exploration noise can impact performance at runtime. Therefore, atruntime, a deterministic policy, which always selects the mean action pi(s,g) = µ(s,g), is used instead.The choice between a stochastic and deterministic policy can be denoted by the addition of a binaryindicator variable λ ∈ {0,1}a = µ(s,g)+λN (6.2)where 1 indicates a stochastic policy with added exploration noise, and 0 a deterministic policy thatalways selects the mean action. During training, ε-greedy exploration can be incorporated by randomlyenabling and disabling exploration noise according to a Bernoulli distribution λ ∼ Ber(ε), where εrepresents the probability of action exploration by applying noise to the mean action.The reward function rt = r(st ,gt ,at) provides the agent with feedback regarding the desirability of59performing action at at state st given goal gt . The reward function is therefore an interface throughwhich users can shape the behaviour of the agent by assigning higher rewards to desirable behaviours,and lower rewards to less desirable ones. The policies are trained using an actor-critic framework tomaximize the expected cumulative reward. Algorithm 7 illustrates the common learning algorithm forboth the LLC and HLC. For the purpose of learning, the character’s experiences are summarized bytuples τi = (si,gi,ai,ri,s′i,λi), recording the start state, goal, action, reward, next state, and applicationof exploration noise for each action performed by the character. The tuples are stored in an experiencereplay memory D and used to update the policy. Each policy is trained using an actor-critic framework,where a policy pi(s,g,a|θpi) and value function V (s,g|θV ) are learned in tandem. The value functionis trained to predict the expected cumulative reward of following the policy starting at a given state sand goal g. To update the value function, a minibatch of n tuples {τi} are sampled from D and used toperform a Bellman backupyi← ri+ γV (s′i,gi|θV ) (6.3)θV ← θV +αv(1n∑iOθV V (si,gi|θV )(yi−V (si,gi|θV )))(6.4)The learned value function is then used to update the policy. Policy improvement is performed usinga CACLA-style positive temporal difference update [Van Hasselt, 2012]. Since the policy gradient asdefined above is for stochastic policies, policy updates are performed using only tuples with addedexploration noise (i.e. λi = 1).δi← ri+ γV (s′i,gi|θV )−V (si,gi|θV ) (6.5)if δi > 0 :θpi ← θpi +αµ(1nOθpiµ(si,gi|θpi)Σ−1(ai−µ(si,gi|θpi))) (6.6)Equation 6.6 can be interpreted as a stochastic gradient ascent step along an estimate of the policygradient for a Gaussian policy.6.5 Low-Level ControllerThe low-level controller LLC is responsible for coordinating joint torques to mimic the overall style of areference motion while satisfying footstep goals and maintaining balance. The reference motion is rep-resented by keyframes that specify target poses at each timestep t. The LLC is queried at 30 Hz , whereeach query provides as input the state sL, representing the character state, and goal gL, representing afootstep plan. The LLC then produces an action aL specifying PD target angles for every joint, relativeto their parent link.60Algorithm 7 Actor-Critic Algorithm Using Positive Temporal Difference Updates1: θpi ← random weights2: θV ← random weights3: while not done do4: for step = 1, ...,m do5: s← start state6: g← goal7: λ ← Ber(εt)8: a← µ(s,g|θpi)+λN, N ∼ G(0,Σ)9: Apply a and simulate forward one step10: s′← end state11: r← reward12: τ ← (s,g,a,r,s′,λ )13: store τ in D14: end for15: Update value function:16: Sample minibatch of n tuples {τi = (si,gi,ai,ri,s′i,λi)} from D17: for each τi do18: yi← ri+ γV (s′i,gi|θV )−V (si,gi|θV )19: end for20: θV ← θV +αv(1n ∑iOθV V (si,gi|θV )(yi−V (si,gi|θV )))21: Update policy:22: Sample minibatch of n tuples {τ j = (s j,g j,a j,r j,s′j,λ j)} from D where λ j = 123: for each τ j do24: δ j← r j + γV (s′j,g j|θV )−V (s j,g j|θV )25: if δ j > 0 then26: 4a j← a j−µ(s j,g j|θpi)27: θpi ← θpi +αµ(1nOθpiµ(s j,g j|θpi)Σ−14a j)28: end if29: end for30: end whileLLC State: The LLC input state sL, shown in Figure 6.3 (left), consists mainly of features describingthe character’s configuration. These features include the center of mass positions of each link relative tothe character’s root, designated as the pelvis, their relative rotations with respect to the root expressedas quaternions, and their linear and angular velocities. Two binary indicator features are included,corresponding to the character’s feet. The features are assigned 1 when their respective foot is in contactwith the ground and 0 otherwise. A phase variable φ ∈ [0,1] is also included as an input, which indicatesthe phase along a walk cycle. Each walk cycle has a fixed period of 1 s, corresponding to 0.5 s per step.The phase variable advances at a fixed rate and helps keep the LLC in sync with the reference motion.Combined, these features create a 110D state space.61Figure 6.3: left: The character state features consist of the positions of each link relative to theroot (red arrows), their rotations, linear velocities (green arrows), and angular velocities.right: The terrain features consist of a 2D heightmap of the terrain sampled on a regular grid.All heights are expressed relative to height of the ground immediately under the root of thecharacter. The heightmap has a resolution of 32x32 and occupies an area of approximately11x11m.Figure 6.4: The goal gL for the LLC is represented as a footstep plan, specifying the target posi-tions pˆ0 and pˆ1 for the next two steps, and the target heading for the root θˆroot .LLC Goal: Each footstep plan gL = (pˆ0, pˆ1, θˆroot), as shown in Figure 6.4, specifies the 2D targetposition pˆ0 relative to the character on the horizontal plane for the swing foot at the end of the nextstep, as well as the target location for the following step pˆ1. This is motivated by work showing that”two steps is enough” [Zaytsev et al., 2015]. In addition to target step positions, the footstep plan alsoprovides a desired heading θˆroot for the root of the character for the immediate next step.LLC Action: The action aL produced by the LLC specifies target positions for PD controllers posi-tioned at each joint. The target joint positions are represented in 4 dimensional axis-angle form, withaxis normalization occurring when applying the actions, i.e., the output action from the network neednot be normalized. This yields a 22D action space.626.5.1 Reference MotionA reference motion (or set of motions) serves to help specify the desired walking style, while alsohelping to guide the learning. The reference motion can be a single manually keyframed motion cycle,or one or more motion capture clips. The goal for the LLC is to mimic the overall style of the referencemotion rather than precisely tracking it. The reference motion will generally not satisfy the desiredfootstep goals, and is often not physically realizable in any case because of the approximate natureof a hand-animated walk cycle, or model mismatches in the case of a motion capture clip. At eachtimestep t a reference motion provides a reference pose qˆ(t) and reference velocity ˆ˙q(t), computed viafinite-differences ˆ˙q(t)≈ qˆ(t+4t)−qˆ(t)4t . The use of multiple reference motion clips can help produce betterturning behaviors, as best seen in the supplemental video. To make use of multiple reference motions,we construct a kinematic controller qˆ∗(·)← K(s,gL), when given the simulated character state s and afootstep plan gL, selects the appropriate motion from a small set of motion clips that best realizes thefootstep plan gL. To construct the set of reference motions for the kinematic controller, we segmented7 s of motion capture data of walking and turning motions into individual clips qˆ j(·), each correspondingto a single step. A step begins on the stance foot heel-strike and ends on the swing foot heel-strike. Eachclip is preprocessed to be in right stance with a step duration of 0.5 s. During training, the referencemotions are mirrored as necessary to be consistent with the simulated character’s stance leg. A vectorof features Λ(qˆ j(·)) = (pstance, pswing,θroot) are then extracted for each clip and later used to select theappropriate clip for a given query. The features include the stance foot position pstance at the start of aclip, the swing foot position pswing at the end of the clip, and the root orientation θroot on the horizontalplane at the end of the clip.During training, K(s,gL) is queried at the beginning of each step to select the reference clip for theupcoming step. To select among the motion clips, a similar set of features Λ(s,gL) are extracted from sand gL, where pstance is specified by the stance foot position from the simulated character state s, pswingand θroot are specified by the target footstep position pˆ0 and root orientation θˆroot from gL. The mostsuitable clip is then selected according to:K(s,gL) = arg minqˆ j(·)||Λ(s,gL)−Λ(qˆ j(·))|| (6.7)The selected clip then acts as the reference motion to shape the reward function for the LLC over thecourse of the upcoming step.6.5.2 LLC RewardGiven the reference pose qˆ(t) and velocity ˆ˙q(t) the LLC reward rL is defined as a weighted sum ofobjectives that encourage the character to imitate the style of the reference motion while following the63footstep plan,rL = wposerpose+wvelrvel +wrootrroot +wcomrcom+wendrend +wheadingrheading (6.8)using (wpose,wvel,wroot ,wcom,wend ,wheading) = (0.5,0.05,0.1,0.1,0.2,0.1). rpose, rvel , rroot , and rcomencourages the policy to reproduce the given reference motion, while rend and rheading encourages it tofollow the footstep plan.rpose = exp(−∑iwid(qˆi(t),qi)2)rvel = exp(−∑iwi|| ˆ˙qi(t)− q˙i||2)rroot = exp(−10(hˆroot −hroot)2)rcom = exp(−||vˆcom− vcom||2)rend = exp(−||pˆswing− pswing||2−||pˆstance− pstance||2)rheading = 0.5cos(θˆroot −θroot)+0.5where qi represents the quaternion rotation of joint i and d(·, ·) computes the distance between twoquaternions. wi are manually specified weights for each joint. hroot represents the height of the rootfrom the ground, vcom is the center of mass velocity, pswing and pstance are the positions of the swing andstance foot. The target position for the swing foot pˆswing = pˆ0 is provided by the footstep plan, whilethe target position for the stance foot pˆstance is provided by the reference motion. θroot represents theheading of the root on the horizontal plane, and θˆroot is the desired heading provided by the footstepplan. We choose to keep rewards constrained to r ∈ [0,1].6.5.3 Bilinear Phase TransformWhile the phase variable φ helps to keep the LLC in sync with the reference motion, in our experimentsthis did not appear to be sufficient for the network to clearly distinguish the different phases of a walk,often resulting in foot-dragging artifacts. To help the network better distinguish between different phases64of a motion, we take inspiration from bilinear pooling models for vision tasks [Fukui et al., 2016]. Fromthe scalar phase variable φ we construct a tile-coding Φ = (Φ0,Φ1,Φ2,Φ3)T , where Φi ∈ {0,1} is 1 ifφ lies within its phase interval and 0 otherwise. For example, Φ0 = 1 iff 0 ≤ φ < 0.25, and Φ1 = 1 iff0.25≤ φ < 0.5, etc. Given the original input vector (sL,gL), the bilinear phase transform computes theouter product (sLgL)ΦT =[Φ0(sLgL),Φ1(sLgL),Φ2(sLgL),Φ2(sLgL)](6.9)which is then processed by successive layers of the network. This representation results in a feature setwhere only a sparse subset of the features, corresponding to the current phase interval, are active at agiven time. This effectively encodes a prior into the network that different behaviours are expected atdifferent phases of the motion. Note that the scalar phase variable φ is still included in sL to allow theLLC to track its progress within each phase interval.Figure 6.5: Schematic illustration of the LLC network. The input consists of the state sL andgoal gL. The first layer applies the bilinear phase transform and the resulting features areprocessed by a series of fully-connected layers. The output layer produces the action aL,which specifies PD targets for each joint.6.5.4 LLC NetworkA schematic diagram of the LLC network is shown in Figure 6.5. The LLC is represented by a 4-layeredneural network that receives as input sL and gL, and outputs the mean µ(sL,gL) of the action distribution.The first layer applies the bilinear phase transform to the inputs, and the resulting bilinear features arethen processed by two fully-connected layers with 512 and 256 units each. ReLU activation functionsare applied to both hidden layers [Nair and Hinton, 2010]. Finally a linear output layer computes themean action. The LLC value function VL(sL,gL) is modeled by a similar network, but with a singlelinear unit in the output layer. Each LLC network has approximately 500k parameters.656.5.5 LLC TrainingLLC training proceeds episodically where the character is initialized to a default pose at the beginningof each episode. An episode is simulated for a maximum of 200 s but is terminated early if the characterfalls, leaving the character with 0 reward for the remainder of the episode. A fall is detected when thetorso of the character makes contact with the ground. At the beginning of each walking step, a newfootstep plan gkL = (pˆk0, pˆk1, θˆkroot) is generated by randomly adjusting the previous plan gk−1L accordingtopˆk0 = pˆk−11θˆ kroot = θˆk−1root +N, N ∼ G(0,0.252)pˆk1 = pˆk0+4p(θˆ kroot)(6.10)where 4p(θˆ kroot) advances the step position along the heading direction θˆ kroot by a fixed step length of0.4 m to obtain a new target step position.After a footstep plan has been determined for the new step, the kinematic controller K(sL,gL) isqueried for a new reference motion. The reference motion qˆ(·) is then used by the reward function forthe duration of the step, which guides the LLC towards a stepping motion that approximately achievesthe desired footstep goal gL.6.5.6 Style ModificationIn addition to imitating a reference motion, the LLC can also be stylized by simple modifications to thereward function. In the following examples, we consider the addition of a style term cstyle to the posereward rpose.rpose = exp(−∑iwid(qˆi(t),qi)2−wstylecstyle)(6.11)cstyle provides an interface through which the user can shape the motion of the LLC. wstyle is a user-specified weight that trades off between conforming to the reference motion and satisfying the desiredstyle.Forward/Sideways Lean: By using cstyle to specify a desired waist orientation, the LLC can be steeredtowards learning a robust walk while leaning forward or sideways.cstyle = d(qˆ(t)waist ,qwaist)2 (6.12)where qˆ(·)waist is a quaternion specifying the desired waist orientation.66Straight Leg(s): Similarly, cstyle can be used to penalize bending of the knees, resulting in a locked-knee walk.cstyle = d(qI,qknee)2 (6.13)with qI being the identity quaternion. Using this style term, we trained two LLC’s, one with the rightleg encouraged to be straight, and one with both legs straight.High-Knees: A high-knees walk can be created by using cstyle to encourage the character to lift itsknees higher during each step,cstyle = (hˆknee−hknee)2 (6.14)where hˆknee = 0.8m is the target height for the swing knee with respect to the ground.In-place Walk: By replacing the reference motion qˆ(·) with a single hand-authored clip of an in-placewalk, the LLC can be trained to step in-place.Separate networks are trained for each stylized LLC by bootstrapping from the nominal walk LLC.The weights of each network are initialized from those of the nominal walk, then fine-tuned using thestylized reward functions. Furthermore, we show that it is possible to interpolate different stylizedLLC’s while also remaining robust. Let piaL(sL,gL) and pibL(sL,gL) represent LLC’s trained for style aand b. A new LLC picL(s,g) can be defined by linearly interpolating the outputs of the two LLC’spicL(sL,gL) = (1−u)piaL(sL,gL)+upibL(sL,gL) (6.15)u ∈ [0,1], allowing the character to seamlessly transition between the different styles. As shown in theresults, we can also allow for moderate extrapolation.6.6 High-level ControllerWhile the LLC is primarily responsible for low-level coordination of the character’s limbs for loco-motion, the HLC is responsible for high-level task-specific objectives such as navigation. The HLC isqueried at 2 Hz, corresponding to the beginning of each step. Every query provides as input a state sHand a task-specific goal gH . The HLC output action aH specifies a footstep plan gL for the LLC. Therole of the HLC is therefore to provide intermediate goals for the LLC in order to achieve the overalltask objectives.HLC State: Unlike sL, which provides mainly proprioceptive information describing the configurationof the character, sH includes both proprioceptive and exteroceptive information describing the characterand its environment. Each state sH = (C,T ), consists of a set of character features C and terrain features67T , shown in Figure 6.3 (right). C shares many of the same features as the LLC state sL, but excludes thephase and contact features. T is represented by a 32×32 heightmap of the terrain around the character.The heightmap is sampled on a regular grid with an area of approximately 11× 11 m. The samplesextend 10 m in front of the character and 1 m behind. Example terrain maps are shown in Figure 6.6.The combined features result in a 1129D state space.6.6.1 HLC TrainingAs with the LLC training, the character is initialized to a default pose at the start of each episode. Eachepisode terminates after 200 s or when the character falls. At the start of each step, the HLC is queried tosample an action aH from the policy, which is then applied to the LLC as a footstep goal gL. The LLC isexecuted for 0.5 s, the duration of one step, and an experience tuple τ is recorded for the step. Note thatthe weights for the LLC are frozen and only the HLC is being trained. Therefore, once trained, the sameLLC can be applied to a variety of tasks by training task-specific HLC’s that specify the appropriateintermediate footstep goals.6.6.2 HLC NetworkA schematic diagram of the HLC network is available in Figure 6.7. The HLC is modeled by a deepconvolutional neural network that receives as input the state sH = (C,T ) and task-specific goal gH , andthe output action aH = gL specifies a footstep plan for the LLC for a single step. The terrain map Tis first processed by a series of three convolutional layers, with 16 5×5 filters, 32 4×4 filters, and 323× 3 filters respectively. The features maps from the final convolutional layer are processed by 128fully-connected units. The resulting feature vector is concatenated with C and gH , and processed bytwo additional fully-connected layers with 512 and 256 units. ReLUs are used for all hidden layers.The linear output layer produces the final action. Each HLC network has approximately 2.5 millionparameters.Figure 6.6: 32× 32 height maps are included as input features to the HLC. Each map covers a11×11 m area. Values in the images are normalized by the minimum and maximum heightwithin each map. left: path; middle: pillar obstacles; right: block obstacles.68Figure 6.7: Schematic illustration of the HLC network. The input consists of a terrain map T ,character features C, and goal gH . The output action aH specifies a footstep plan gL for theLLC.6.6.3 HLC TasksPath Following: In this task an HLC is trained to navigate narrow trails carved into rocky terrain. Atarget location is placed randomly along the trail, and the target advances along the trail as the charactermoves sufficiently close to the target. The HLC goal gH = (θtar,dtar) is represented by the directionto the target θtar relative to the character’s facing direction, and the distance dtar to the target on thehorizontal plane. The path and terrain are randomly generated, with the path width varying between0.5 m and 2 m. Since the policy is not provided with an explicit parameterization of the path as input, itmust learn to recognize the path from the terrain map T and plan its footsteps accordingly.The reward for this task is designed to encourage the character to move towards the target at adesired speed.rH = exp(−(min(0,uTtarvcom− vˆcom))2) (6.16)where vcom is the agent’s centre of mass velocity on the horizontal plane, and utar is a unit vector onthe horizontal plane pointing towards the target. vˆcom = 1 m/s specifies the desired speed at which thecharacter should move towards the target.Soccer Dribbling: Dribbling is a challenging task requiring both high-level and low-level planning.The objective is to move a ball to a target location, where the initial ball and target locations are randomlyset at the beginning of each episode. The ball has a radius of 0.2 m and a mass of 0.1 kg. Having to learna proper sequence of sub-tasks in the correct order makes this task particularly challenging. The agentmust first move to the ball, and once it has possession of the ball, dribble the ball towards the target.When the ball has arrived at the target, the agent must then learn to stop moving the ball to avoid kickingthe ball past the target. Since the policy does not have direct control over the ball, it must rely on complexcontact dynamics in order to manipulate the ball. Furthermore, considering the LLC was not trained with69motion data comparable to dribbling, the HLC has to learn to provide the appropriate footstep plansin order to elicit the necessary LLC behaviour. The goal gH = (θtar,dtar,θball,dball,hball,vball,ωball)consists of the target direction relative to the ball θtar, distance between the target and ball dtar, balldirection relative to the agent’s root θball , distance between the ball and the agent dball , height of theball’s center of mass from the ground hball , the ball’s linear velocity vball , and angular velocity ωball .Because the dribbling task occurs on a flat plane, the terrain map T is excluded from the HLC inputs,and the convolutional layers are removed from the network.The reward for the soccer task consists of a weighted sum of terms which encourages the agent tomove towards the ball rcv, stay close to the ball rcp, move the ball towards the target rbv, and keep theball close to the target rbp.rH = wcvrcv+wcprcp+wbvrbv+wbprbp (6.17)rcv = exp(−(min(0,uTballvcom− vˆcom))2)rcp = exp(−d2ball)rbv = exp(−(min(0,uTtarvball− vˆball))2)rbp = exp(−d2tar)with weights (wcv,wcp,wbv,wbp) = (0.17,0.17,0.33,0.33). uball is a unit vector pointing in the directionfrom the character to the ball, vcom the character’s center of mass velocity, and vˆcom = 1m/s the desiredspeed with which the character should move towards the ball. Similarly, utar represents the unit vectorpointing from the ball to the target position, vball the velocity of the ball, and vˆball = 1m/s the desiredspeed for the ball with which to move towards the target. Once the ball is within 0.5 m of the target andthe character is within 2 m of the ball, then the goal is considered fulfilled and the character receives aconstant reward of 1 from all terms, corresponding to the maximum possible reward.Pillar Obstacles: A more common task is to traverse a reasonably dense area of static obstacles.Similar to the path following task, the objective is to reach a randomly placed target location. However,unlike the path following task, there exists many possible paths to reach the target. The HLC is thereforeresponsible for planning and steering the agent along a particular path. When the agent reaches thetarget, the target location is randomly changed. The base of each obstacle measures 0.75× 0.75 m,with height varying between 2 m and 8 m. Each environment instance is generated by randomly placing70obstacles at the start of each episode. The goal gH and reward function are the same as those used forthe path following task.Block Obstacles: This environment is a variant of the pillar obstacles environment, where the obstaclesconsist of large blocks with side lengths varying between 0.5 m and 7 m. The policy therefore must learnto navigate around large obstacles to find paths leading to the target location.Dynamic Obstacles: In this task, the objective is to navigate across a dynamically changing environ-ment in order to reach a target location. The environment is populated with obstacles moving at fixedvelocities back and forth along randomly oriented linear paths. The velocities vary from 0.2 m/s to1.3 m/s, with the agent’s maximum velocity being approximately 1 m/s. Given the presence of dy-namic obstacles, rather than using a heightfield as input, the policy is provided with a velocity-map.The environment is sampled for moving obstacles where each sample records the 2D velocity along thehorizontal plane if a sample overlaps with an obstacle. If a sample point does not contain an obstacle,then the velocity is recorded as 0. The goal features and reward function are identical to those used inthe path following task. This example should not be confused with a multiagent simulation because themoving obstacles themselves are not reactive.6.7 ResultsThe motions from the policies are best seen in the supplemental videos. We learn locomotion skills fora 3D biped, as modeled by eight links: three links for each leg and two links for the torso. The bipedis 1.6 m tall and has a mass of 42 kg. The knee joints have one degree of freedom (DOF) and all otherjoints are spherical, i.e., three DOFs. We use a ground friction of µ = 0.9. The character’s motionis driven by internal joint torques from stable PD controllers [Tan et al., 2011] and simulated usingthe Bullet physics engine [Bullet, 2015] at 3000 Hz. The kp gains for the PD controllers are (1000,300, 300, 100) Nm/rad for the (waist, hip, knee, ankle), respectively. Derivative gains are specified askd = 0.1kp. Torque limits are (200, 200, 150, 90) Nm, respectively. Joint limits are also in effect for alljoints. All neural networks are built and trained with Caffe [Jia et al., 2014a]. The values of the inputstates and output actions of the networks are normalized to range approximately between [-1, 1] usingmanually-specified offsets and scales. The output of the value network is normalized to be between[0, 1] by multiplying the cumulative reward by (1− γ). This normalization helps to ensure reasonablegradient magnitudes during backpropagation. Once trained, all results run faster than real-time.LLC reference motions: We train controllers using a single planar keyframed motion cycle as a motionstyle to imitate, as well as a set of ten motion capture steps that correspond to approximately 7 s of datafrom a single human subject. The clips consist of walking motions with different turning rates. The71character was designed to have similar measurements to those of the human subject. By default, we usethe results based on the motion capture styles, as they allow for sharper turns and produce a moderateimprovement in motion quality. Please see the supplementary video for a direct comparison.Hyperparameter settings: Both LLC and HLC training share similar hyperparameter settings. Batchesof m = 32 are collected before every update. The experience replay memory D records the 50k mostrecent tuples. Updates are performed by sampling minibatches of n = 32 tuples from D and applyingstochastic gradient descent with momentum, with value function stepsize αV = 0.01, policy stepsizeαµ = 0.001, and momentum 0.9. L2 weight decay of 0.0005 is applied to the policy, but none is appliedto the value function. Both the LLC and HLC use a discount factor of γ = 0.95. For the LLC, the ε-greedy exploration rate εt is initialized to 1 and linearly annealed to 0.2 over 1 million iterations. For theHLC, εt is initialized to 1 and annealed to 0.5 over 200k iterations. The LLC is trained for approximately6 million iterations, requiring about 2 days of compute time on a 16-core cluster using a multithreadedC++ implementation. Each HLC is trained for approximately 1 million iterations, requiring about 7days. All computations are performed on the CPU and no GPU-acceleration was leveraged.6.7.1 LLC PerformanceFigure 6.8 illustrates an LLC learning curve. Performance of intermediate policies are evaluated every40k iterations by applying the policies for 32 episodes with a length of 20 s each. Performance ismeasured using the normalized cumulative reward (NCR), which is calculated as the sum of immediaterewards over an episode normalized by the minimum and maximum possible cumulative reward. Nodiscounting is applied when calculating the NCR. A comparison between an LLC trained using 10mocap clips and another trained using a single hand-authored forward walking motion is also availablein Figure 6.8. Since the two LLC’s use different reference motions, the NCR is measured using onlythe footstep terms r = wendrend +wheadingrheading. A richer repertoire of reference motions leads tonoticeable improvements in learning speed and final performance. Learning curves for the stylizedLLC’s are available in Figure 6.9. Each network is initialized using the LLC trained for the nominalwalk. Performance is measured using only the style term cstyle, measuring the LLC’s conformity to thestyle objectives.LLC Robustness: LLC’s trained for different styles are evaluated for robustness by measuring themaximum perturbation force that each LLC can tolerate before falling. The character is first directed towalk forward, then a push is applied to the torso mid-point. Forward and sideways pushes were testedseparately where each perturbation is applied for 0.25 s. The magnitude of the forces are increased inincrements of 10 N until the character falls. We also evaluated the LLC’s robustness to terrain varia-tion by measuring the steepest incline and decline that each LLC can successfully travel across withoutfalling for 20 s. The maximum force that each LLC is able to recover from, and the steepest incline72Figure 6.8: left: Learning curve for the LLC. The network is randomly initialized and trained tomimic a nominal walk while following randomly generated footstep plans. right: learningcurves for policies trained with 10 mocap clips and 1 hand-authored clip.Figure 6.9: Learning curves for each stylized LLC.and decline are summarized in Table 6.1. The nominal walk proves fairly robust to the different pertur-bations, while the straight leg walks are generally less robust than the other styles. Though the LLC’swere trained exclusive on flat terrain, the nominal LLC is able to walk up 16% inclines without falling.After normalizing for character weight and size differences, the robustness of the nominal walk LLCis comparable to figures reported for SIMBICON, which leverages manually-crafted balance strategies[Yin et al., 2007]. The LLC’s robustness likely stems from the application of exploration noise duringtraining. The added noise perturbs the character away from its nominal trajectory, requiring it to learnrecovery strategies for unexpected perturbations. We believe that robustness could be further improvedby presenting the character with examples of different pushes and terrain variations during training, andby letting it anticipate pushes and upcoming terrain. We also test for robustness with respect to changesin the gait period, i.e., forcing the controller to walk with shorter or longer duration steps. The gaits aregenerally robust to changes in gait period of ±20%.To better understand the feedback strategies developed by the networks we analyze the action out-puts from the nominal walk LLC for different character state configurations. Figure 6.10 illustrates theswing and stance hip target angles as a function of character’s state. The state variations we considerinclude the waist leaning forward and backward at different angles, and pushing the root at different73LLC Forward Side Incline DeclineNominal Walk 200N 210N 16%(9.1◦) 11%(6.3◦)High-Knees 140N 190N 9%(5.1◦) 5%(2.9◦)Straight Leg 150N 180N 12%(6.8◦) 6%(3.4◦)Straight Legs 90N 130N 9%(5.1◦) 5%(2.9◦)Forward Lean 180N 290N 10%(5.7◦) 16%(9.1◦)Sideways Lean 160N 220N 7%(4.0◦) 16%(9.1◦)Table 6.1: Maximum forwards and sideways push, and steepest incline and decline each LLC cantolerate before falling. Each push is applied for 0.25 s.velocities. The LLC exhibits intuitive feedback strategies reminiscent of SIMBICON [Yin et al., 2007].When the character is leaning too far forward or its forward velocity is too high, then the swing hipis raised higher to help position the swing foot further in front to regain balance in the following step,and vice-versa. but unlike SIMBICON, whose linear balance strategies are manually-crafted, the LLCdevelops nonlinear strategies without explicit user intervention.Figure 6.10: PD target angles for the swing and stance hip as a function of character state. top:character’s waist is leaning forward at various angles, with positive theta indicates a back-ward lean. bottom: the root is given a push at different velocities.74Figure 6.11: HLC learning curves.Task Reward (NCR)Path Following 0.55Soccer Dribbling 0.77Pillar Obstacles 0.56Block Obstacles 0.70Dynamic Obstacles 0.18Table 6.2: Performance summary of HLC’s trained for each task. The NCR is calculated using theaverage of 256 episodes.6.7.2 HLC PerformanceLearning curves for HLC’s trained for different tasks are available in Figure 6.11. Intermediate policyperformance is evaluated every 5k iterations using 32 episodes with a length of 200 s each. Note that themaximum normalized cumulative reward, NCR = 1, may not always be attainable. For soccer dribbling,the maximum NCR would require instantly moving the ball to the target location. For the navigationtasks, the maximum NCR would require a straight and unobstructed path between the character andtarget location.For soccer dribbling, the HLC learns to correctly sequence the required sub-tasks. The HLC firstdirects the character towards the ball. It then dribbles the ball towards the target. Once the ball is suffi-ciently close to the target, the HLC developed a strategy of circling around the ball, while maintainingsome distance, to avoid perturbing the ball away from the target or tripping over the ball. Alternatively,the ball can be replaced with a box, and the HLC is able to generalize to the different dynamics with-out additional training. The HLC’s for the path following, pillar obstacles, and block obstacles tasks75Figure 6.12: Learning curves with and without control hierarchy.(a) soccer dribbling (b) path following(c) pillar obstacles (d) block obstacles(e) dynamic obstaclesFigure 6.13: Snapshots of HLC tasks. The red marker represents the target location and the blueline traces the trajectory of the character’s center of mass.all learned to identify and avoid obstacles using heightmaps and navigate across different environmentsseeking randomly placed targets. The more difficult dynamic obstacles environment, proved challengingfor the HLC, reaching a competent level of performance, but still prone to occasional missteps, particu-76larly when navigating around faster moving obstacles. We note that the default LLC training consists ofconstant speed forward walks and turns but no stopping, which limits the options available to the HLCwhen avoiding obstacles.Figure 6.12 compares the learning curves with and without the control hierarchy for soccer dribblingand path following. To train the policies without the control hierarchy, the LLC’s inputs are augmentedwith gH and for the path following task, the terrain map T is also included as part of the input. Convo-lutional layers are added to the path following LLC. The augmented LLC’s are then trained to imitatethe reference motions and perform the high-level tasks. Without the hierarchical decomposition, bothLLC’s failed to perform their respective tasks.Figure 6.14: Performance using different LLC’s for various tasks with and without HLC fine-tuning, and retraining. HLC are originally trained for the nominal LLC.6.7.3 Transfer LearningAnother advantage of a hierarchical structure is that it enables a degree of interchangeability betweenthe different components. While a common LLC can be used by the various task-specific HLC’s, a77Figure 6.15: Learning curves for fine-tuning HLC’s trained for the nominal LLC to differentLLC’s, and retraining HLC’s from scratch.common HLC can also be applied to multiple LLC’s without additional training. This form of zero-shot transfer allows the character to swap between different LLC’s while retaining a reasonable levelof aptitude for a task. Furthermore, the HLC can then be fine-tuned to improve performance with anew LLC, greatly decreasing the training time required when compared to retraining from scratch. InFigure 6.14 the performance when using different LLC’s is shown for soccer dribbling before and afterHLC fine-tuning, and retraining from scratch. Fine-tuning is applied for 200k iterations using the HLCtrained for the nominal LLC for initialization. Retraining is performed for 1 million iterations fromrandom initialization. For soccer dribbling, the ability to substitute different LLC’s is style dependent,with the forward lean exhibiting the least degradation and high-knees exhibiting the most. Table 6.3summarizes the results of transfer learning between different HLC and LLC combinations.78Task + LLC No Fine-Tuning With Fine-Tuning RetrainingSoccer + High-Knees 0.19 0.56 0.51Soccer + Straight Leg 0.51 0.64 0.45Soccer + Straight Legs 0.16 0.63 -Soccer + Forward Lean 0.64 0.69 -Soccer + Sideways Lean 0.41 0.72 0.74Path + High-Knees 0.10 0.39 -Path + Forward Lean 0.43 0.44 -Path + Straight Leg 0.08 0.19 -Pillars + High-Knees 0.06 0.43 -Pillars + Forward Lean 0.43 0.45 -Pillars + Straight Leg 0.15 0.35 -Table 6.3: Performance (NCR) of different combinations of LLC’s and HLC’s. No Fine-Tuning:directly using the HLC’s trained for the nominal LLC. With Fine-Tuning: HLC’s fine-tunedusing the nominal HLC’s as initialization. Retraining: HLC’s are retrained from randominitialization for each task and LLC.6.8 DiscussionThe method described in this paper allows for skills to be designed while making few assumptions aboutthe controller structure or explicit knowledge of the underlying dynamics. Skill development is guidedby the use of objective functions for low-level and high-level policies. Taken together, the hierarchicalcontroller allows for combined planning and physics-based movement based on high-dimensional in-puts. Overall, the method further opens the door to learning-based approaches that allow for rapid andflexible development of movement skills, at multiple levels of abstraction. The same deep RL method isused at both timescales, albeit with different states, actions, and rewards. Taken as a whole, the methodallows for learning skills that directly exploit a variety of information, such as the terrain maps fornavigation-based tasks, as well as skills that require finer-scale local interaction with the environment,such as soccer dribbling.Imitation objective: The LLC learns in part on a motion imitation objective, utilizing a referencemotion that provides a sketch of the expected motion. This can be as simple as a single keyframed planarwalk cycle that helps guide the control policy towards a reasonable movement pattern, as opposed tolearning it completely from scratch. Importantly, it further provides a means of directing the desiredmotion style. Once a basic control policy is in place, the policy can be further adapted using new goalsor objective functions, as demonstrated in our multiple LLC style variations.Phase information: Our LLC’s currently still use phase information as part of the character state,which can be seen as a basic memory element, i.e., “where am I in the gait cycle.” We still do notfully understand why a bilinear phase representation works better for LLC learning, in terms of achiev-ing a given motion quality, than the alternative of using continuously-valued phase representation, i.e.,cos(φ),sin(φ). In future work, we expect that the phase could be stored and updated using an internal79(a) nominal walk (b) in-place walk(c) high-knees (d) straight leg(e) straight legs (f) forward lean(g) sideways leanFigure 6.16: LLC walk cycles.memory element in a recurrent network or LSTM. This would also allow for phase adaptations, such asstopping or reversing the phase advancement when receiving a strong backwards push.HLC-LLC interface and integration: Currently, the representation used as the interface betweenthe HLC and LLC, aH ≡ gL, is manually specified, in our case corresponding to the next two footsteplocations and the body orientation at the first footstep. The HLC treats these as abstract handles forguiding the LLC, and thus may exploit regions of this domain for which the LLC has never been trained.This is evident in the HLC behaviour visualizations, which show that unattainable footsteps are regularlydemanded of the LLC by the HLC. This is not problematic in practice because the HLC will learn toavoid regions of the action space that lead to problematic behaviors from the LLC. Learning the bestrepresentation for the interface between the HLC and LLC, as demonstrated in part by [Heess et al.,2016], is an exciting avenue for future work. It may be possible to find representations which thenallow for LLC substitutions with less performance degradation. An advantage of the current explicitly-defined LLC goals, gL, is that it can serve to define the reward to be used for LLC training. However,80it does result in the LLC’s and HLC’s being trained using different reward functions, whereas a moreconceptually pure approach might simply use a single objective function.Motion planning: Some tasks, such as path navigation, could also be accomplished using existingmotion planning techniques based on geometric constraints and geometric objectives. However, devel-oping efficient planning algorithms for tasks involving dynamic worlds, such as the dynamic obstaclestask or the soccer dribbling task, is much less obvious. In the future, we also wish to develop skills thatare capable of automatically selecting efficient and feasible locomotion paths through challenging 3Dterrains.Transfer and parameterization: Locomotion can be seen as encompassing a parameterized familyof related movements and skills. Knowing one style of low-level motion should help in learning anotherstyle, and, similarly, knowing the high-level control for one task, e.g., avoiding static obstacles, shouldhelp in learning another related task, e.g., avoiding dynamic obstacles. This paper has demonstratedseveral aspects of transfer and parameterization. The ability to interpolate (and to do moderate extrap-olation) between different LLC motion styles provides a rich and conveniently parameterized space ofmotions. The LLC motions are robust to moderate terrain variations, external forces, and changes in gaitperiod, by virtue of the exploration noise they experience during the learning process. As demonstrated,the HLC-LLC hierarchy also allows for substitution of HLC’s and LLC’s. However, for HLC/ LLCpairs that have never been trained together, the performance will be degraded for tasks that are sensitiveto the dynamics, such as soccer dribbling. However, the HLC’s can be efficiently readapted to improveperformance with additional fine-tuning.Learning efficiency: The sample efficiency of the training process can likely be greatly improved.Interleaving improvements to a learned dynamics model with policy improvements is one possible ap-proach. While we currently use a positive-temporal difference advantage function in our actor-criticframework, we intend to more fully investigate other alternatives in future work.81Chapter 7Conclusion7.1 DiscussionIn this thesis we have presented a number of frameworks for developing locomotion skills for simulatedcharacters using deep reinforcement learning. Though the systems share similar underlying learningalgorithms, the choice of timescales and action abstractions plays a critical role in determining the skillsthat can be achieved by the different policies. In Chapter 4, the use of parameterized FSMs enabledpolicies to operate and explore actions at the timescale of running and jumping steps. The MACEmodel can be viewed as a control hierarchy where the critics make high-level decisions in selecting theclass of actions to perform, while the individual actors make low-level decisions regarding the executionof their respective class of actions. An important distinction to the hierarchy introduced in Chapter 6 isthat both levels of the hierarchy in MACE operate at the same timescale but at different levels of actionabstraction. Chapter 5 then explores the development of low-level policies that operate at fine timescales(e.g. 60Hz). Low-level policies often rely on less domain knowledge in crafting task-specific actionabstractions, and instead work directly on simple representations such as torques and target angles. Thisflexibility partly stems from the policies’ fine timescale, which provides a tight feedback loop that canquickly respond and adjust the actions in accordance to changes in the state. However, since rewarddiscounting is performed in a per-step fashion, small timesteps limit a policy’s ability to plan overlonger time horizons, thereby producing more myopic behaviours. In contrast, policies that operateover coarse timescales can be more effective at planning over longer time horizons, enabling themto perform more complex tasks that require longer term planning. But larger timesteps often requirelow-level feedback structures, such as the SIMBICON balance strategy used by the FSMs, to handleunexpected perturbations over the duration of a control step.The hierarchical policy outlined in Chapter 6 attempts to combine the advantages of different timescalesby training each level of the hierarchy to operate with a different timestep. The LLC replaces the hand-crafted FSM utilized in Chapter 4 with learned low-level control strategies that remain robust in spite82of significant perturbations. The LLC then provides the HLC with a higher-level action abstraction, inthe form of footstep goals, that allows the HLC to explore actions over larger timescales. By actingover larger timesteps, the HLC is able to better accommodate the long term planning required for morecomplex tasks such as navigation and soccer dribbling.7.2 ConclusionOne of the defining characteristics of physics-based character animation is the high-dimensional contin-uous state and action spaces that often arise from motion control problems. While handcrafted reducedmodels have been effective in reproducing a rich repertoire of motions, individual models are often lim-ited in their capacity to generalize to new skills, leading to a catalog of task-specific controllers. Deepreinforcement learning offers the potential for a more ubiquitous framework for motion modeling. Byleveraging expressive but general models, control policies can be developed for diverse motions andcharacter morphologies while reducing the need for task-specific control structures. Neural networks’capacity to process high-dimensional inputs also opens the opportunity to develop sensorimotor policiesthat utilize rich low-level descriptions of the environment.While DeepRL has enabled the use of high-dimensional low-level state descriptions, our experiencesuggest that the choice of action abstraction remains a crucial design decision. The choice of actionrepresentation influences the timescale at which a policy can operate and shapes the exploration be-haviour of the agent during training. Hierarchical policies provide a means of leveraging the advantagesof different action abstractions, and can be beneficial for complex tasks that require a balance betweenlong and short term objectives. In the future, we hope to explore techniques that can more effectivelylearn useful task-specific action abstractions directly from low-level action representations.7.3 Future WorkDeep reinforcement learning opens many exciting avenues for computer animation and motion control.The methods we have presented in this thesis have been predominantly model-free, as the policies donot leverage any models of the dynamics of their environments. Model-based RL, which utilizes learneddynamics models, may allow the agents to more efficiently explore potentially beneficial behaviours ascompared to the current use of random exploration noise. A learned dynamics model can also provideagents with valuable mechanisms for planning, which may improve performance for more challengingtasks that require strategic long-term behaviours.While most of our work has been focused on single agents interacting with a predominantly passiveenvironment, multi-agent systems is a promising future application, particularly for films and games,where agents are seldom situated in isolation. Training policies for both cooperative and adversarialtasks opens a wealth of potential applications. Cooperative policies that can observe and anticipate the83behaviours of other agents will be a vital component in integrating robots into assistive roles alongsidetheir human counterparts.With deep reinforcement learning, our policies are able to utilize rich sensory information fromhigh-dimensional low-level state representations, such as heightmaps of upcoming terrain. However, ourapplications have mainly utilized visual and proprioceptive information. Exploring the use of additionalsensory data, such as tactile information, may help in developing more dexterous skills for manipulationtasks. Furthermore, incorporating memory units, such as long-short term memory (LSTM), into thenetwork can be beneficial for tasks that place greater emphasis on exploration and memorization, suchas maze navigation.By providing the policies with reference motions during training, we are able to significantly im-prove the motion quality as compared to previous model-free RL methods [Lillicrap et al., 2015a, Schul-man et al., 2016]. But the quality of our resulting motions still falls short of what has been previouslyachieved with carefully engineered models [Coros et al., 2011b, Geijtenbeek et al., 2013]. Due to theprevalent use of simplified character models, designing appropriate motion priors is crucial to reproducebiologically plausible motions. Motion manifolds such as those proposed by Holden et al. [2016] maybe a promising tool for providing a policy with feedback on the naturalness of its motions. Alternatively,motion priors can be incorporated into the actuation model by leveraging more biologically-plausibleactuators [Geijtenbeek et al., 2013]. More realistic muscle models may lead to the emergence of naturalbehaviours without the need for extensive reward shaping, while also offering more predictive power forapplications such as injury prevention and rehabilitation. We believe deep learning will be a valuabletool in developing policies for controlling these complex muscle models.84BibliographyM. Al Borno, M. de Lasa, and A. Hertzmann. Trajectory optimization for full-body movements withcomplex contacts. TVCG, 19(8):1405–1414, 2013. → pages 7J.-A. M. Assael, N. Wahlstro¨m, T. B. Scho¨n, and M. P. Deisenroth. Data-efficient learning of feedbackpolicies from image pixels using deep dynamical models. arXiv preprint arXiv:1510.02173, 2015.→ pages 35Y. Bai, K. Siu, and C. K. Liu. Synthesis of concurrent object manipulation tasks. ACM Transactions onGraphics (TOG), 31(6):156, 2012. → pages 58R. Blickhan, A. Seyfarth, H. Geyer, S. Grimmer, H. Wagner, and M. Gu¨nther. Intelligence bymechanics. Philosophical Transactions of the Royal Society of London A: Mathematical, Physicaland Engineering Sciences, 365(1850):199–220, 2007. → pages 39, 40Bullet. Bullet physics library, Dec. 2015. http://bulletphysics.org. → pages 21, 71S. Calinon, P. Kormushev, and D. G. Caldwell. Compliant skills acquisition and multi-optima policysearch with em-based reinforcement learning. Robotics and Autonomous Systems, 61(4):369–379,2013. → pages 18J. Chestnutt, M. Lau, G. Cheung, J. Kuffner, J. Hodgins, and T. Kanade. Footstep planning for thehonda ASIMO humanoid. In ICRA05, pages 629–634, 2005. → pages 57S. Coros, P. Beaudoin, K. K. Yin, and M. van de Pann. Synthesis of constrained walking skills. ACMTrans. Graph., 27(5):Article 113, 2008. → pages 18, 19, 57S. Coros, P. Beaudoin, and M. van de Panne. Robust task-based control policies for physics-basedcharacters. ACM Transctions on Graphics, 28(5):Article 170, 2009. → pages 7, 57S. Coros, P. Beaudoin, and M. van de Panne. Generalized biped walking control. ACM Transctions onGraphics, 29(4):Article 130, 2010. → pages 6, 18S. Coros, A. Karpathy, B. Jones, L. Reveret, and M. van de Panne. Locomotion skills for simulatedquadrupeds. ACM Transactions on Graphics, 30(4):Article TBD, 2011a. → pages 18, 38, 40S. Coros, A. Karpathy, B. Jones, L. Reveret, and M. van de Panne. Locomotion skills for simulatedquadrupeds. ACM Transactions on Graphics, 30(4):Article 59, 2011b. → pages 6, 84M. da Silva, Y. Abe, and J. Popovic´. Interactive simulation of stylized human locomotion. ACM Trans.Graph., 27(3):Article 82, 2008. → pages 6, 785M. da Silva, F. Durand, and J. Popovic´. Linear bellman combination for control of character animation.ACM Trans. Graph., 28(3):Article 82, 2009. → pages 19M. de Lasa, I. Mordatch, and A. Hertzmann. Feature-based locomotion controllers. In ACMTransactions on Graphics (TOG), volume 29, page 131. ACM, 2010. → pages 7, 40K. Ding, L. Liu, M. van de Panne, and K. Yin. Learning reduced-order feedback policies for motionskills. In Proc. ACM SIGGRAPH / Eurographics Symposium on Computer Animation, 2015. →pages 18K. Doya, K. Samejima, K.-i. Katagiri, and M. Kawato. Multiple model-based reinforcement learning.Neural computation, 14(6):1347–1369, 2002. → pages 19P. Faloutsos, M. van de Panne, and D. Terzopoulos. Composable controllers for physics-basedcharacter animation. In Proceedings of SIGGRAPH 2001, pages 251–260, 2001. → pages 19R. Featherstone. Rigid body dynamics algorithms. Springer, 2014. → pages 22A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compactbilinear pooling for visual question answering and visual grounding. CoRR, abs/1606.01847, 2016.URL http://arxiv.org/abs/1606.01847. → pages 65T. Geijtenbeek and N. Pronost. Interactive character animation using simulated physics: Astate-of-the-art review. In Computer Graphics Forum, volume 31, pages 2492–2515. Wiley OnlineLibrary, 2012. → pages 6T. Geijtenbeek, M. van de Panne, and A. F. van der Stappen. Flexible muscle-based locomotion forbipedal creatures. ACM Transactions on Graphics, 32(6), 2013. → pages 18, 39, 40, 44, 84H. Geyer, A. Seyfarth, and R. Blickhan. Positive force feedback in bouncing gaits? Proc. RoyalSociety of London B: Biological Sciences, 270(1529):2173–2183, 2003. → pages 41, 44M. X. Grey, A. D. Ames, and C. K. Liu. Footstep and motion planning in semi-unstructuredenvironments using possibility graphs. CoRR, abs/1610.00700, 2016. URLhttp://arxiv.org/abs/1610.00700. → pages 57, 58S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation.arXiv preprint arXiv:1610.00633, 2016a. → pages 39S. Gu, T. P. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-basedacceleration. CoRR, abs/1603.00748, 2016b. URL http://arxiv.org/abs/1603.00748. → pages 8P. Ha¨ma¨la¨inen, J. Rajama¨ki, and C. K. Liu. Online control of simulated humanoids using particlebelief propagation. ACM Transactions on Graphics (TOG), 34(4):81, 2015. → pages 7N. Hansen. The cma evolution strategy: A comparing review. In Towards a New EvolutionaryComputation, pages 75–102, 2006. → pages 6, 24M. Haruno, D. H. Wolpert, and M. Kawato. Mosaic model for sensorimotor learning and control.Neural computation, 13(10):2201–2220, 2001. → pages 18M. Hausknecht and P. Stone. Deep reinforcement learning in parameterized action space. arXivpreprint arXiv:1511.04143, 2015a. → pages 9, 3486M. J. Hausknecht and P. Stone. Deep reinforcement learning in parameterized action space. CoRR,abs/1511.04143, 2015b. → pages 40, 44N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policiesby stochastic value gradients. In Advances in Neural Information Processing Systems, pages2926–2934, 2015. → pages 9, 33N. Heess, G. Wayne, Y. Tassa, T. P. Lillicrap, M. A. Riedmiller, and D. Silver. Learning and transfer ofmodulated locomotor controllers. CoRR, abs/1610.05182, 2016. URLhttp://arxiv.org/abs/1610.05182. → pages 80T. Hester and P. Stone. Texplore: real-time sample-efficient reinforcement learning for robots.Machine Learning, 90(3):385–429, 2013. → pages 18J. K. Hodgins, W. L. Wooten, D. C. Brogan, and J. F. O’Brien. Animating human athletics. InProceedings of SIGGRAPH 1995, pages 71–78, 1995. → pages 6D. Holden, J. Saito, and T. Komura. A deep learning framework for character motion synthesis andediting. ACM Trans. Graph., 35(4):Article 138, 2016. → pages 84R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neuralcomputation, 3(1):79–87, 1991. → pages 19Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell.Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACMInternational Conference on Multimedia, MM ’14, pages 675–678. ACM, 2014a. ISBN978-1-4503-3063-3. doi:10.1145/2647868.2654889. URLhttp://doi.acm.org/10.1145/2647868.2654889. → pages 71Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell.Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACMInternational Conference on Multimedia, MM ’14, pages 675–678, New York, NY, USA, 2014b.ACM. ISBN 978-1-4503-3063-3. doi:10.1145/2647868.2654889. URLhttp://doi.acm.org/10.1145/2647868.2654889. → pages 29L. E. Kavraki, P. Svestka, J.-C. Latombe, and M. H. Overmars. Probabilistic roadmaps for pathplanning in high-dimensional configuration spaces. IEEE Transactions on Robotics & Automation,12(4):566–580, 1996. → pages 57V. Konda and J. Tsitsiklis. Actor-critic algorithms. In SIAM Journal on Control and Optimization,pages 1008–1014. MIT Press, 2000. → pages 15J. Kuffner, K. Nishiwaki, S. Kagami, M. Inaba, and H. Inoue. Motion Planning for Humanoid Robots,pages 365–374. Springer Berlin Heidelberg, 2005. → pages 57J. Laszlo, M. van de Panne, and E. Fiume. Limit cycle control and its application to the animation ofbalancing and walking. In Proceedings of the 23rd annual conference on Computer graphics andinteractive techniques, pages 155–162. ACM, 1996. → pages 6M. Lau and J. Kuffner. Behavior planning for character animation. In SCA ’05: Proceedings of the2005 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 271–280, 2005. →pages 5887J. Lee and K. H. Lee. Precomputing avatar behavior from human motion data. In Proceedings of the2004 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’04, pages 79–87,2004. → pages 58J. Lee and K. H. Lee. Precomputing avatar behavior from human motion data. Graphical Models, 68(2):158–174, 2006. → pages 7Y. Lee, S. J. Lee, and Z. Popovic´. Compact character controllers. ACM Transctions on Graphics, 28(5):Article 169, 2009. → pages 7Y. Lee, S. Kim, and J. Lee. Data-driven biped control. ACM Transctions on Graphics, 29(4):Article129, 2010a. → pages 6Y. Lee, K. Wampler, G. Bernstein, J. Popovic´, and Z. Popovic´. Motion fields for interactive characterlocomotion. ACM Transctions on Graphics, 29(6):Article 138, 2010b. → pages 7Y. Lee, M. S. Park, T. Kwon, and J. Lee. Locomotion control for many-muscle humanoids. ACMTrans. Graph., 33(6):218:1–218:11, Nov. 2014. ISSN 0730-0301. doi:10.1145/2661229.2661233.URL http://doi.acm.org/10.1145/2661229.2661233. → pages 7S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknowndynamics. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors,Advances in Neural Information Processing Systems 27, pages 1071–1079. Curran Associates, Inc.,2014. → pages 8, 33S. Levine and V. Koltun. Learning complex neural network policies with trajectory optimization. InProceedings of the 31st International Conference on Machine Learning (ICML-14), pages 829–837,2014. → pages 8, 33S. Levine, J. M. Wang, A. Haraux, Z. Popovic´, and V. Koltun. Continuous character control withlow-dimensional embeddings. ACM Transactions on Graphics (TOG), 31(4):28, 2012. → pages 7S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. CoRR,abs/1504.00702, 2015. → pages 25, 39T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuouscontrol with deep reinforcement learning. CoRR, abs/1509.02971, 2015a. → pages 38, 39, 84T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuouscontrol with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015b. → pages 9, 33,34L. Liu, K. Yin, M. van de Panne, and B. Guo. Terrain runner: control, parameterization, composition,and planning for highly dynamic motions. ACM Trans. Graph., 31(6):154, 2012. → pages 6, 7, 58L. Liu, M. van de Panne, and K. Yin. Guided learning of control graphs for physics-based characters.ACM Transactions on Graphics, 35(3), 2016a. → pages 40L. Liu, M. van de Panne, and K. Yin. Guided learning of control graphs for physics-based characters.ACM Trans. Graph., 35(3):Article 29, 2016b. doi:10.1145/2893476. → pages 7G. Loeb. Control implications of musculoskeletal mechanics. In Engineering in Medicine and Biology88Society, 1995., IEEE 17th Annual Conference, volume 2, pages 1393–1394. IEEE, 1995. → pages39, 40A. Macchietto, V. Zordan, and C. R. Shelton. Momentum control for balance. In ACM SIGGRAPH2009 Papers, SIGGRAPH ’09, pages 80:1–80:8, New York, NY, USA, 2009. ACM. ISBN978-1-60558-726-4. doi:10.1145/1576246.1531386. URLhttp://doi.acm.org/10.1145/1576246.1531386. → pages 7V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcementlearning. Nature, 518(7540):529–533, 2015. → pages 8, 13, 25, 27, 29I. Mordatch and E. Todorov. Combining the benefits of function approximation and trajectoryoptimization. In Robotics: Science and Systems (RSS), 2014. → pages 8, 33I. Mordatch, M. de Lasa, and A. Hertzmann. Robust physics-based locomotion using low-dimensionalplanning. ACM Trans. Graph., 29(4):Article 71, 2010. → pages 6, 18, 58I. Mordatch, K. Lowrey, G. Andrew, Z. Popovic, and E. Todorov. Interactive control of diversecomplex characters with neural networks. In Advances in Neural Information Processing Systems28, pages 3132–3140, 2015a. → pages 40I. Mordatch, K. Lowrey, G. Andrew, Z. Popovic, and E. V. Todorov. Interactive control of diversecomplex characters with neural networks. In Advances in Neural Information Processing Systems,pages 3114–3122, 2015b. → pages 8, 33U. Muico, Y. Lee, J. Popovic´, and Z. Popovic´. Contact-aware nonlinear control of dynamic characters.ACM Trans. Graph., 28(3):Article 81, 2009. → pages 6, 7U. Muico, J. Popovic´, and Z. Popovic´. Composite control of physically simulated characters. ACMTrans. Graph., 30(3):Article 16, 2011. → pages 19A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam,M. Suleyman, C. Beattie, S. Petersen, et al. Massively parallel methods for deep reinforcementlearning. arXiv preprint arXiv:1507.04296, 2015. → pages 8V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In J. Frnkranzand T. Joachims, editors, Proceedings of the 27th International Conference on Machine Learning(ICML-10), pages 807–814. Omnipress, 2010. URL http://www.icml2010.org/papers/432.pdf. →pages 65E. Parisotto, J. L. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcementlearning. arXiv preprint arXiv:1511.06342, 2015. → pages 19, 35P. Pastor, M. Kalakrishnan, L. Righetti, and S. Schaal. Towards associative skill memories. InHumanoid Robots (Humanoids), 2012 12th IEEE-RAS International Conference on, pages 309–315.IEEE, 2012. → pages 18X. B. Peng, G. Berseth, and M. van de Panne. Dynamic terrain traversal skills using reinforcementlearning. ACM Transactions on Graphics, 34(4), 2015. → pages 7, 18, 19, 21, 57X. B. Peng, G. Berseth, and M. van de Panne. Terrain-adaptive locomotion skills using deep89reinforcement learning. ACM Transactions on Graphics (Proc. SIGGRAPH 2016), 35(5), 2016. →pages 57J. Pettr, J.-P. Laumond, and T. Simon. 2-stages locomotion planner for digital actors. In SCA ’03:Proceedings of the 2010 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages258–264, 2003. → pages 58A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih,K. Kavukcuoglu, and R. Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015. →pages 19T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprintarXiv:1511.05952, 2015. → pages 8, 20J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel. High-dimensional continuous controlusing generalized advantage estimation. CoRR, abs/1506.02438, 2015. → pages 14, 39J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous controlusing generalized advantage estimation. In International Conference on Learning Representations(ICLR 2016), 2016. → pages 9, 84D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradientalgorithms. In Proc. International Conference on Machine Learning, pages 387–395, 2014a. →pages 14, 45D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradientalgorithms. In ICML, 2014b. → pages 9, 34K. W. Sok, M. Kim, and J. Lee. Simulating biped behaviors from human motion data. ACM Trans.Graph., 26(3):Article 107, 2007. → pages 6, 7B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deeppredictive models. arXiv preprint arXiv:1507.00814, 2015. → pages 8R. Sutton, D. Mcallester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcementlearning with function approximation, 2001. → pages 9, 14R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA,USA, 1st edition, 1998. ISBN 0262193981. → pages 10, 12, 13J. Tan, K. Liu, and G. Turk. Stable proportional-derivative controllers. Computer Graphics andApplications, IEEE, 31(4):34–44, 2011. → pages 21, 71J. Tan, Y. Gu, C. K. Liu, and G. Turk. Learning bicycle stunts. ACM Trans. Graph., 33(4):50:1–50:12,2014a. ISSN 0730-0301. → pages 40J. Tan, Y. Gu, C. K. Liu, and G. Turk. Learning bicycle stunts. ACM Transactions on Graphics (TOG),33(4):50, 2014b. → pages 6, 7Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through onlinetrajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots andSystems, pages 4906–4913. IEEE, 2012. → pages 790A. Treuille, Y. Lee, and Z. Popovic´. Near-optimal character animation with continuous control. ACMTransactions on Graphics (TOG), 26(3):Article 7, 2007. → pages 7E. Uchibe and K. Doya. Competitive-cooperative-concurrent reinforcement learning with importancesampling. In Proc. of International Conference on Simulation of Adaptive Behavior: From Animalsand Animats, pages 287–296, 2004. → pages 19L. van der Maaten and G. E. Hinton. Visualizing high-dimensional data using t-sne. Journal ofMachine Learning Research, 9:2579–2605, 2008. → pages 31H. Van Hasselt. Reinforcement learning in continuous state and action spaces. In ReinforcementLearning, pages 207–251. Springer, 2012. → pages 16, 19, 21, 29, 40, 43, 59, 60H. Van Hasselt and M. A. Wiering. Reinforcement learning in continuous action spaces. InApproximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEEInternational Symposium on, pages 272–279. IEEE, 2007. → pages 19H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. arXivpreprint arXiv:1509.06461, 2015. → pages 8, 29J. M. Wang, D. J. Fleet, and A. Hertzmann. Optimizing walking controllers. ACM Transctions onGraphics, 28(5):Article 168, 2009. → pages 6J. M. Wang, S. R. Hamner, S. L. Delp, V. Koltun, and M. Specifically. Optimizing locomotioncontrollers using biologically-based actuators and objectives. ACM Trans. Graph, 2012. → pages40, 41, 42P. Wawrzyn´Ski and A. K. Tanwani. Autonomous reinforcement learning with experience replay.Neural Networks, 41:156–167, 2013. → pages 39M. Wiering and H. Van Hasselt. Ensemble algorithms in reinforcement learning. Systems, Man, andCybernetics, Part B: Cybernetics, IEEE Transactions on, 38(4):930–936, 2008. → pages 18R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning. Mach. Learn., 8(3-4):229–256, May 1992. ISSN 0885-6125. doi:10.1007/BF00992696.URL https://doi.org/10.1007/BF00992696. → pages 9D. Wooden, M. Malchano, K. Blankespoor, A. Howardy, A. A. Rizzi, and M. Raibert. Autonomousnavigation for bigdog. In ICRA10, pages 4736–4741, 2010. → pages 57K. Yamane, J. J. Kuffner, and J. K. Hodgins. Synthesizing animations of human manipulation tasks.ACM Trans. Graph., 23(3):532–539, 2004. → pages 58Y. Ye and C. K. Liu. Optimal feedback control for character animation using an abstract model. ACMTrans. Graph., 29(4):Article 74, 2010. → pages 6, 58K. Yin, K. Loken, and M. van de Panne. Simbicon: Simple biped locomotion control. ACMTransctions on Graphics, 26(3):Article 105, 2007. → pages 6, 18, 73, 74K. Yin, S. Coros, P. Beaudoin, and M. van de Panne. Continuation methods for adapting simulatedskills. ACM Transctions on Graphics, 27(3):Article 81, 2008. → pages 6P. Zaytsev, S. J. Hasaneini, and A. Ruina. Two steps is enough: no need to plan far ahead for walking91balance. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages6295–6300. IEEE, 2015. → pages 6292

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0345638/manifest

Comment

Related Items