Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Learning locomotion : symmetry and torque limit considerations Abdolhosseini, Farzad 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2019_november_abdolhosseini_farzad.pdf [ 3.06MB ]
Metadata
JSON: 24-1.0383251.json
JSON-LD: 24-1.0383251-ld.json
RDF/XML (Pretty): 24-1.0383251-rdf.xml
RDF/JSON: 24-1.0383251-rdf.json
Turtle: 24-1.0383251-turtle.txt
N-Triples: 24-1.0383251-rdf-ntriples.txt
Original Record: 24-1.0383251-source.json
Full Text
24-1.0383251-fulltext.txt
Citation
24-1.0383251.ris

Full Text

Learning Locomotion: Symmetry and Torque LimitConsiderationsbyFarzad AbdolhosseiniB.Sc., Sharif University of Technology, 2017a thesis submitted in partial fulfillmentof the requirements for the degree ofMaster of Scienceinthe faculty of graduate and postdoctoral studies(Computer Science)The University of British Columbia(Vancouver)September 2019ยฉ Farzad Abdolhosseini, 2019The following individuals certify that they have read, and recommend to the Faculty of Graduateand Postdoctoral Studies for acceptance, the thesis entitled:Learning Locomotion: Symmetry and Torque Limit Considerationssubmitted by Farzad Abdolhosseini in partial fulfillment of the requirements for the degreeof Master of Science in Computer Science.Examining Committee:Michiel van de Panne, Computer ScienceSupervisorLeonid Sigal, Computer ScienceAdditional ExamineriiAbstractDeep reinforcement learning offers a flexible approach to learning physics-based locomotion.However, these methods are sample-inefficient and the result usually has poor motion qualitywhen learned without the help of motion capture data. This work investigates two approachesthat can make motions more realistic while having equal or higher learning efficiency.First, we propose a way of enforcing torque limits on the simulated character withoutdegrading the performance. Torque limits indicate how strong a character is and thereforehas implications on how realistic the resulting motion looks. We show that using realistic limitsfrom the beginning can hinder training performance. Our method uses a curriculum learningapproach in which the agent is gradually faced with more difficult tasks. This way the resultingmotion becomes more realistic without sacrificing performance.Second, we explore methods that can incorporate left-right symmetry into the learning pro-cess which highly increases the motion quality. Gait symmetry is an indicator of health andasymmetric motion is easily noticeable by human observers. We compare two novel approachesas well as two existing methods of incorporating symmetry in the reinforcement learning frame-work. We also introduce a new metric for evaluating gait symmetry and confirm that theresulting motion has higher motion quality.iiiLay SummaryReinforcement learning offers a flexible approach to learning locomotion skills in simulation.This work investigates two approaches that can make the learned motions more realistic. Weincorporate gait symmetry into the learning process. We also propose to begin the learningprocess with exceptionally strong characters, which enables them to rapidly discover goodsolution modes, and then progressively revert to a weaker character in order to obtain a morerealistic motion.ivPrefaceChapter 4 is unpublished work done by myself with additional inputs from my supervisor,Michiel van de Panne.Chapter 5 has been accepted as a long paper at Motion, Interaction and Games (MIG) 2019as Farzad Abdolhosseini, Hung Yu Ling, Zhaoming Xie, Xue Bin Peng, Michiel van de Panne.On Learning Symmetric Locomotion. The DUP, PHASE, and NET methods were respectivelyinvented by Hung Yu (Ben) Ling, Xue Bin (Jason) Peng, and myself. The majority of coding,writing the article, and conducting the experiments were done by me and Hung Yu (Ben) Ling.Zhaoming Xie and Xue Bin (Jason) Peng also contributed by conducting the experiments onCassie and DeepMimic, respectively. Michiel van de Panne helped in writing the paper.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Motion Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Kinematic Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Physics-Based Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.1 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Proximal Policy Optimization (PPO) . . . . . . . . . . . . . . . . . . . . . . . . . 94 Torque Limit Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.4.1 Torque Limit Baseline Experiments . . . . . . . . . . . . . . . . . . . . . . 13vi4.4.2 Torque Limit Curriculum . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.4.3 Ablation and More Environments . . . . . . . . . . . . . . . . . . . . . . . 154.4.4 Curriculum Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Symmetry Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.1 Symmetry Enforcement Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1.1 Duplicate Tuples (DUP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1.2 Auxiliary Loss (LOSS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1.3 Phase-Based Mirroring (PHASE) . . . . . . . . . . . . . . . . . . . . . . . 235.1.4 Symmetric Network Architecture (NET) . . . . . . . . . . . . . . . . . . . 245.1.5 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Gait Symmetry Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.4.2 Effect on Learning Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.4.3 Symmetry Enforcement Effectiveness . . . . . . . . . . . . . . . . . . . . . 315.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Appendix Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 40A.1 Chapter 4 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40A.2 Mirroring Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40A.3 Alternate Symmetric Network Architecture . . . . . . . . . . . . . . . . . . . . . 41A.4 Symmetry in DeepMimic Environment . . . . . . . . . . . . . . . . . . . . . . . . 42viiList of TablesTable 5.1 Actuation SI. Lower numbers are better. . . . . . . . . . . . . . . . . . . . . . 32Table 5.2 Phase-portrait index. Lower numbers are better. . . . . . . . . . . . . . . . . . 32Table 1 Hyper-parameters used in Chapter 4. . . . . . . . . . . . . . . . . . . . . . . . 40viiiList of FiguresFigure 4.1 Final performance of the agent with different torque limit multiplier (TLM)values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 4.2 The effect of curriculum learning for Walker2D. . . . . . . . . . . . . . . . . . 16Figure 4.3 Torque limit curriculum results for Half Cheetah, Ant, and Hopper. . . . . . 17Figure 4.4 Curriculum sensitivity to initial TLM value. . . . . . . . . . . . . . . . . . . . 18Figure 4.5 Curriculum sensitivity to the number of steps. . . . . . . . . . . . . . . . . . 19Figure 5.1 A universal method for converting any neural network into a symmetric net-work policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Figure 5.2 Environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 5.3 Learning curves for different symmetry methods in each of the four locomo-tion environments (Section 5.3). . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 5.4 Phase-portrait for Walker2D and Walker3D. The green curve is for the lefthip flexion and red for the right side. The more symmetric the motion, themore aligned are the curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Figure 1 Learning curves for the original DeepMimic environment. BASE and Phasecorresponds to the symmetry enforcement methods in Figure 5.3 . . . . . . . 43ixGlossaryDRL deep reinforcement learningFSM finite-state machineMDP Markov decision processPPO proximal policy optimizationTL torque limitTLM torque limit multiplierTRPO trust-region policy optimizationxAcknowledgmentsI would like to express my gratitude for those that have helped me throughout this journey.First, my supervisor, Michiel van de Panne, who has guided me at every step of the way. Yourcheerful attitude is always motivating and the way you earnestly care for everyone around youis inspiring. I admire you, not just for your knowledge, but as a great human being.Second, I would like to thank my lovely partner, Ainaz Hajimoradlu, for her unwaveringlove and support. Next in line are my colleagues who have helped me improve day by day: GlenBerseth, Hung Yu (Ben) Ling, Zhaoming Xie, Xue Bin (Jason) Peng, and Prashant Sachdeva.Even more important, I would like to thank my family who have supported and encouragedme all my life. Particularly my mom and my aunt, Eshrat, without whose sacrifices I wouldnot be here right now. Your love knows no bounds and I am forever grateful for all that youhave given me.xiChapter 1IntroductionHumans can gracefully move around without consciously thinking about how to coordinate ourmuscles. Our level of mastery over this basic skill makes it seem so mundane that most ofus never think much of it. Yet, we are still unable to create controllers that replicate what anewborn horse can do after two hours. This is not, however, due to a lack of effort, as theproblem of locomotion has been studied in many fields.Computer graphics is interested in locomotion in order to bring virtual characters to life. Wecan divide motion generation techniques into two categories: kinematic and dynamic (physics-based) methods. The kinematic approach produces the desired motion by directly manipulatingthe object positions and joint angles as well as their respective velocities. However, modifyingthese values directly can lead to unrealistic motions. One major reason for this is that theresulting motions can break the laws of physics. As humans, we are adept at picking up suchinconsistencies. An intuitive approach to fixing this problem is to use the laws of physics asconstraints. This method is known as physics-based animation.Physics-based animation first requires us to create accurate models of objects and articulatedbodies by taking into account their parameters such as masses, dimensions, joint types, torquelimits, and so on. We can then apply well studied Newtonian laws to simulate the interactionsbetween these objects. Finally, the question then becomes that of, how we can provide thecontrol to generate movements that we desire.This question has led many researchers to tackle this challenging problem. Since 1985when physics was first used for animation [1], many different approaches for motion generationhave been proposed. These range from huge optimization problems constrained by the laws ofphysics that generated motions for a Luxo character [42], to the highly structured FSM basedapproaches [44] and the beloved CMA algorithm [12]. Recently, reinforcement learning (RL) hasemerged as a promising approach to learning locomotion skills and it holds significant potentialbecause of its flexibility. However, there remains a compelling need to improve learning efficiencyand motion quality for RL to become a widely-adopted animation tool. One way of going aboutthis problem is by incorporating expert knowledge, such as left-right symmetry, into the model.RL for locomotion is generally formulated as a control task where the agent can manipulate1joint motors individually at each time step to achieve certain goals that are modelled by areward signal. The generality of the RL formulation makes it applicable to a wide variety ofsettings. Recent success of RL in animation and machine learning shows the ability to producerobust learned locomotion for simulated humans, animals, and imaginary legged creatures.Classical control theory studies almost the same problem as RL but from a different perspec-tive. Control theory usually assumes complete knowledge about the system that is under theinvestigation, including the environment. Disturbances and uncertainties are modelled explic-itly and extra assumptions about their nature are made in this framework. A control algorithm,such as LQR, can solve the task at hand very efficiently. However, this approach can fail todeliver any solution for non-linear systems.RL has been moving in the opposite direction with model-free methods that attempt to makelittle to no assumption about the task at hand. This is at odds with the classical control theoryperspective and it reflects why we still talk about RL and control theory as different disciplines.This separation from concrete models allows RL to solve problems at a higher abstraction levelthan control theory does. This can lead to a universal algorithm that can solve many sorts ofproblems and is highly flexible. However, this generality comes at a price. These algorithmsare generally known to be inefficient as they require countless interactions with the system thatis under investigation. To make this more concrete, it is common for a task to be consideredsolved after ten million time-steps which translates to more than one hundred days of non-stopinteraction with the character or the robot.Furthermore, encoding the properties of a desirable motion through the reward signal canprove highly challenging and can fail in unintuitive ways. Common reward functions for walkingand running are primarily based on forward progress or a fixed root velocity. Surprisingly, theagent can produce a walk like behaviour, given little to no extra knowledge. However, themotions are usually far from appealing. Because progress is the primary reward signal, theagent tends to learn peculiar motions with the hands flailing around in the air and the headfixed in unnatural positions. Even characters without an upper body commonly find irregulargaits, such as a fencing gait that keeps one foot in front of the other.Further engineering of the reward can alleviate some of the problems, but it can also produceother issues that are difficult to debug. Specifically, a common argument is that even thoughdifferent styles of walking exist, humans and other animals tend to choose the most energy-efficient one. Therefore a common remedy to the problem raised above is to use an energyexpenditure cost in the reward function. However, it has been argued that RL practitionerstend to set the weight of this term tends to be low enough that it can be ignored since thisterm can make learning difficult [45].This motivates us to look at other ways to augment the traditional RL formulation withprior knowledge from the experts in order to gain more natural-looking motions. In this work,we explore two methods that can help achieve higher quality motion as well as faster learning.In Chapter 4, we will look at the effects of modifying the torque limits on the learning process.2This can be seen as an alternative to the problem caused by the energy expenditure cost asdiscussed above. We will show that using realistic torque limits from the start can hinder thetraining process to the point that the agent never learns to move in the allotted time. We canovercome this by using a simple curriculum schedule.In Chapter 5, we will look at a big contributor to the poor quality in the motions thatare generated via the RL paradigm, namely, asymmetric walking patterns. The left and theright sides of the human body are approximately symmetric. Consequently, walking patterns ofhealthy humans are generally quite symmetric as well. Symmetric motions are also perceivedto be more attractive, e.g., for dance [5], and gait symmetry is seen as a desirable outcome forphysiological manipulation [30]. However, RL agents commonly find asymmetric gait patternssuch as fencing, which leads with one primary foot and tends not to switch the leading foot.We explore four methods for incorporating symmetry into the RL paradigm and discuss theiradvantages and drawbacks. We then compare their motion quality as well as the degree towhich they achieve gait symmetry in practice.3Chapter 2Related WorkLocomotion in humans and other animals is a long-standing problem. Different aspects ofthis problem have been the subject of study in numerous fields for decades, such as computergraphics, robotics, biomechanics, control, and more recently machine learning. In this work,however, we will be focusing on the results from computer graphics and to a more limitedextent, the machine learning community.2.1 Motion SymmetryMotion symmetry has been a topic of interest for many years in the study of human motionand movement biomechanics. Symmetric motions are perceived to be more attractive, e.g., fordance [5], and gait symmetry is seen as a desirable outcome for physiological manipulation [30].While symmetry is a common assumption in the study of gait and posture, individual gaitsoften do exhibit asymmetries due to various possible functional causes [34]. We refer the readerto a past review article [31] for insights into the degree of symmetry of lower limbs movementduring able-bodied gait and the potential influence of limb dominance on the motion symmetryof the lower extremities and human gaits [29]. It is also not obvious how to best quantify theasymmetry of human gaits, and thus specific symmetry metrics have been proposed [18, 39].2.2 Kinematic LocomotionKey-framing is a common approach to making characters move. In this technique the animatorhas to decide the character position and joint orientations for each frame, often using computer-assisted software. However, this approach is time-consuming and requires a highly skilled user.The use of motion capture systems is an alternative approach, but in the past, this has requiredexpensive equipment and is constrained by the confines of the studio. Much work has been putinto reusing captured motions using techniques such as Motion Graphs [21]. Recent approaches,such as Phase-Functioned Neural Networks [17] and Mode-Adaptive Neural Networks [46], useneural networks to learn kinematic models. They can act based on a directional command fromthe user and can respond to terrain height variations.4Key-framing relies heavily on the mental biases and knowledge of the animator. It is there-fore perhaps the simplest way of incorporating all forms of expert knowledge into the resultingmotion, including energy efficiency and symmetry. It is also common in kinematic-based ap-proaches to mirror all the available motion data in order to double the effective size of thedataset and to reflect the often-symmetric nature of human locomotion, e.g., [6, 16].2.3 Physics-Based LocomotionThe robust control of physics-based character locomotion is a long-standing challenge for char-acter animation. We refer the reader to a survey paper for a detailed history [11]. An earlyand enduring approach to controller design has been to structure control policies around finitestate machines (FSMs) and feedback rules that use a simplified abstract model or feedback law.These general ideas have been applied to human athletics, running [15], and a rich variety ofwalking styles [7, 22, 44]. Many controllers developed for physics-based animation further useoptimization methods to improve controllers developed around an FSM-structure, or use anFSM to define phase-dependent objectives for an inverse dynamics optimization to be solvedat each time step. Policy search methods, e.g., stochastic local search or CMA [12], can beused to optimize the parameters of the given control structures to achieve a richer variety ofmotions, e.g., [8, 44], and efficient muscle-driven locomotion [40]. Many of the FSM controllersuse hard-coded symmetries, which assign the roles of stance-leg or swing-leg to the left andright legs, as a function of the FSM state. The trajectory optimization-based methods alsocommonly assume motion symmetry when convenient, e.g., [25].More recently, locomotion synthesis has attracted significant attention from the reinforce-ment learning (RL) community, where the OpenAI Gym tasks have become a popular RLbenchmark [4]. In this context, symmetry constraints are commonly not imposed and thetorque limits are unrealistic. As a consequence, the resulting motions are often idiosyncraticand have noticeable asymmetries. Further work extends these efforts in a variety of ways, in-cluding traversing challenging terrains [13]. More realistic and dynamic motions can be achievedwith the help of motion-capture clips [27, 28] and these use what in Chapter 5 is referred to asthe PHASE symmetry method, with the goal of more efficient learning. [24] uses a variationof PHASE in which individual strides (half steps) are mirrored and concatenated to generatesymmetric reference motions. However, there exist no robust documented experiments to verifyefficiency gains. The efficient learning of controllers that are capable of producing high-qualitymotion for realistic-strength characters remains a challenging problem in the absence of motioncapture data. Recent work makes progress on this problem using RL with a combination ofenergy optimization, learning curriculum, and an auxiliary motion symmetry loss [45], whichwe shall refer to as the LOSS method.5Chapter 3Reinforcement LearningIn this chapter, we provide a brief review of reinforcement learning (RL). RL emerges from theidea that humans and other animals tend to learn about the world through interaction. Wecan observe the world around us and act in certain ways to achieve our goals, and at the sametime learn more about the world that we live in. RL is a computational approach to learningfrom interactions with the final goal of maximizing a numerical reward signal [36].RL is applicable to a wide variety of applications, but this generality can also be problematic.The learner is not told which actions are better or worse and it needs to figure everything outitself using a possibly weak reward signal. To make matters worse, the consequences of anaction may not be immediately visible to the learner as they might be revealed only after a longduration of time. This article is focused on introducing some structure into this formulationwhile keeping its flexibility.3.1 Markov Decision ProcessThe problem formulation in RL is based on the concept of a Markov decision process (MDP).The MDP is defined by a tuple {๐’ฎ,๐’œ,๐‘ƒ ,๐‘Ÿ,๐›พ}, where ๐‘† โˆˆ โ„๐‘› and ๐ด โˆˆ โ„๐‘š are the state spaceand action space of the problem, respectively. The transition function ๐‘ƒ โˆถ ๐‘† ร—๐ดร—๐‘† โ†’ [0,โˆž)is a probability density function, with ๐‘ƒ(๐‘ ๐‘ก+1 โˆฃ ๐‘ ๐‘ก,๐‘Ž๐‘ก) being the probability density of visiting๐‘ ๐‘ก+1 given that at state ๐‘ ๐‘ก, the system takes action ๐‘Ž๐‘ก. The reward function ๐‘Ÿ โˆถ ๐‘†ร—๐ดโ†’โ„ givesa scalar reward for each transition of the system. ๐›พ โˆˆ (0,1] is the discount factor. A determin-istic MDP is a special case where the transition function and the initial state distribution aredeterministic.Tasks can be categorized into two categories: episodic and continuing [36]. Episodic taskscan naturally be divided into subsequences known as episodes, such as a single match in sportsor one play of a game. Each episode ends after a certain period of time, known as the timehorizon, has passed or a pre-specified terminal state has been reached. In continuing tasks,the task goes on without limit and there is no natural notion of an episode present. Here, wewill consider the episodic case with a fixed time horizon ๐‘‡ . For a more in depth discussion6please refer to [36]. The goal of reinforcement learning is to find a parameterized policy ๐œ‹๐œƒ,where ๐œ‹๐œƒ โˆถ ๐‘† ร—๐ด โ†’ [0,โˆž) is the probability density of ๐‘Ž๐‘ก given ๐‘ ๐‘ก, that solves the followingoptimization problem:max๐œƒ๐ฝ(๐œƒ) =๐”ผ๐œโˆผ๐‘๐œƒ [๐‘‡โˆ’1โˆ‘๐‘ก=0๐›พ๐‘ก๐‘Ÿ(๐‘ ๐‘ก,๐‘Ž๐‘ก)]. (3.1)Here, ๐œ = (๐‘ 0,๐‘Ž0, ๐‘Ÿ0,โ€ฆ,๐‘ ๐‘‡ ) is known as a trajectory and the probability of encountering atrajectory is computed according to ๐‘๐œƒ(๐œ) = ๐‘ƒ(๐‘ 0)ฮ ๐‘‡โˆ’1๐‘ก=0 ๐œ‹๐œƒ(๐‘Ž๐‘ก|๐‘ ๐‘ก)๐‘ƒ (๐‘ ๐‘ก+1|๐‘ ๐‘ก,๐‘Ž๐‘ก). The discountfactor (๐›พ) controls how much we care about future rewards rather than immediate rewards.The (possibly discounted) sum of the future rewards is also known as the return of a trajectory:๐บ(๐œ) โ‰๐‘‡โˆ’1โˆ‘๐‘ก=0๐›พ๐‘ก๐‘Ÿ(๐‘ ๐‘ก,๐‘Ž๐‘ก).Another important categorization of tasks is based on whether state and action spaces arediscrete or continuous. Discrete spaces are generally known to have more well-defined solutions,namely tabular algorithms, than the continuous spaces. In this thesis, we focus on tasks whereboth state and action spaces are continuous.3.2 Policy Gradient MethodsMany algorithms have been proposed for solving RL tasks and each is useful in certain scenarios.This section will explain the proximal policy optimization (PPO) algorithm which is used inthe following chapters. For a discussion of other existing methods please refer to [2].PPO belongs to the policy gradient methods class which have been shown to work well oncontinuous tasks. The idea behind the Policy Gradients (PG) algorithm is straight-forward,namely, to optimize the average return by computing an approximate gradient with respect tothe underlying policy parameters, and then taking a gradient ascent step to increase it.To optimize the objective, PG directly optimizes the policy ๐œ‹. One of the underlyingassumptions of PG is that the policy should be stochastic rather than deterministic for thisalgorithm to work, although this assumption can be relaxed [23]. Furthermore, we assume thatthis stochastic policy is parametrized by parameters ๐œƒ and therefore the algorithmโ€™s job is tofind the optimal parameters ๐œƒโˆ— that maximize Equation (3.1). To maximize the objective, weneed to know the gradient โˆ‡๐œƒ๐ฝ(๐œƒ). Since computing this gradient requires knowledge aboutthe dynamics of the MDP, PG approximates it by using the REINFORCE trick [41]:7โˆ‡๐œƒ๐ฝ(๐œƒ) = โˆ‡๐œƒโˆซ๐บ(๐œ)๐‘๐œƒ(๐œ) (3.2)=โˆซ๐บ(๐œ)โˆ‡๐œƒ๐‘๐œƒ(๐œ) (3.3)=โˆซ๐บ(๐œ)๐‘๐œƒ(๐œ)โˆ‡๐œƒ log๐‘๐œƒ(๐œ) (3.4)= ๐”ผ๐œโˆผ๐‘๐œƒ [๐บ(๐œ)โˆ‡๐œƒ log๐‘๐œƒ(๐œ)] (3.5)We can switch the integral and the gradient operator if the policy is differentiable every-where. The equality is based on the following identity:๐‘๐œƒ(๐œ)โˆ‡๐œƒ log๐‘๐œƒ(๐œ) = ๐‘๐œƒ(๐œ)โˆ‡๐œƒ๐‘๐œƒ(๐œ)๐‘๐œƒ(๐œ)= โˆ‡๐œƒ๐‘๐œƒ(๐œ)Next, we can expand โˆ‡๐œƒ log๐‘๐œƒ:โˆ‡๐œƒ log๐‘๐œƒ =โˆ‡๐œƒ[log๐‘(๐‘ 1)+๐‘‡โˆ‘๐‘ก=1log๐œ‹๐œƒ(๐‘Ž๐‘ก|๐‘ ๐‘ก)+ log๐‘(๐‘ ๐‘ก+1|๐‘ ๐‘ก,๐‘Ž๐‘ก)] (3.6)=โˆ‡๐œƒ[๐‘‡โˆ‘๐‘ก=1log๐œ‹๐œƒ(๐‘Ž๐‘ก|๐‘ ๐‘ก)] (3.7)=๐‘‡โˆ‘๐‘ก=1โˆ‡๐œƒ log๐œ‹๐œƒ(๐‘Ž๐‘ก|๐‘ ๐‘ก) (3.8)The extra terms are independent of ๐œƒ and therefore they do not contribute to the gradient.Substituting this back into Equation (3.5) we arrive at the following expression. For simplicitythe discount factor, ๐›พ, has been set to one:โˆ‡๐œƒ๐ฝ(๐œƒ) = ๐”ผ๐œโˆผ๐‘๐œƒ [๐บ(๐œ)๐‘‡โˆ‘๐‘ก=1โˆ‡๐œƒ log๐œ‹๐œƒ(๐‘Ž๐‘ก|๐‘ ๐‘ก)] (3.9)= ๐”ผ๐œโˆผ๐‘๐œƒ [(๐‘‡โˆ‘๐‘ก=1๐‘Ÿ๐‘ก)(๐‘‡โˆ‘๐‘ก=1โˆ‡๐œƒ log๐œ‹๐œƒ(๐‘Ž๐‘ก|๐‘ ๐‘ก))] (3.10)Using this formulation we can use Monte Carlo sampling [35] to approximately compute thegradient in order to iteratively improve the policy:โˆ‡๐œƒ๐ฝ(๐œƒ) โ‰ˆ1๐‘๐‘โˆ‘๐‘–=1[(๐‘‡โˆ‘๐‘ก=1๐‘Ÿ๐‘–,๐‘ก)(๐‘‡โˆ‘๐‘ก=1โˆ‡๐œƒ log๐œ‹๐œƒ(๐‘Ž๐‘–,๐‘ก|๐‘ ๐‘–,๐‘ก))] (3.11)8Where, ๐œ๐‘– = (๐‘ ๐‘–,1,๐‘Ž๐‘–,1, ๐‘Ÿ๐‘–,1,โ€ฆ,๐‘ ๐‘–,๐‘‡ ) are sample trajectories from ๐‘๐œƒ. With this we arriveat the REINFORCE algorithm [41].Unfortunately, this is not a good estimator in practice as its variance can be quite high.Multiple tricks exists that try to alleviate this problem. The simplest is to increase the numberof sampled trajectories ๐‘ , however, this also makes the algorithm less efficient. One observationis that the reward time step ๐‘ก only causally depends on actions that were made until time ๐‘ก andare independent of decisions that are made afterwards. In other words, action ๐‘Ž๐‘ก can only beresponsible for the cost to go from time ๐‘ก forward. With some abuse of notation we can write:โˆ‡๐œƒ๐ฝ(๐œƒ) โ‰ˆ ๐”ผ๐œโˆผ๐‘๐œƒ [๐‘‡โˆ‘๐‘ก=1(โˆ‡๐œƒ log๐œ‹๐œƒ(๐‘Ž๐‘ก|๐‘ ๐‘ก)๐‘‡โˆ‘๐‘กโ€ฒ=๐‘ก๐‘Ÿ๐‘ก)] (3.12)= ๐”ผ๐‘ ๐‘กโˆผ๐‘๐œƒ [๐‘‡โˆ‡๐œƒ log๐œ‹๐œƒ(๐‘Ž๐‘ก|๐‘ ๐‘ก)๐‘‡โˆ‘๐‘กโ€ฒ=๐‘ก๐‘Ÿ๐‘ก], (3.13)where ๐‘ ๐‘ก is any state randomly sampled from a randomly sampled trajectory.Another trick is to use a learned value function ฬ‚๐‘‰ (๐‘ ) also known as a critic. It can be shownthat subtracting out a fixed value from the cumulative return in the above formula does notchange the value of the expectation. Therefore, if ฬ‚๐‘‰ (๐‘ ๐‘ก) is a good estimate of โˆ‘๐‘‡๐‘กโ€ฒ=๐‘ก ๐‘Ÿ๐‘ก thenthe following approximator would have lower variance:โˆ‡๐œƒ๐ฝ(๐œƒ) โ‰ˆ ๐”ผ๐‘ ๐‘กโˆผ๐‘๐œƒ [๐‘‡โˆ‡๐œƒ log๐œ‹๐œƒ(๐‘Ž๐‘ก|๐‘ ๐‘ก)(๐‘‡โˆ‘๐‘กโ€ฒ=๐‘ก๐‘Ÿ๐‘กโˆ’ ฬ‚๐‘‰ (๐‘ ๐‘ก))]. (3.14)3.3 Proximal Policy Optimization (PPO)PG is difficult to get working in practice. This problem is partly attributed to destructiveupdates, where a bad update during training can make the performance drop rapidly. Thetrust-region policy optimization (TRPO) algorithm [32] was introduced to solve this problem.The authors explain that destructive updates happen because the algorithm makes large updatesbased on an optimistic guess. Therefore TRPO only allows the policy to be updated within atrusted region. In other words, it is a step-size control mechanism.Later on, PPO was introduced as a faster alternative to TRPO. Instead of explicitly en-forcing a trust-region, PPO slightly changes the PG update rule as follows:๐ฟ๐ถ๐ฟ๐ผ๐‘ƒ = ๐”ผ๐‘ ๐‘กโˆผ๐‘๐œƒ [min(๐‘Ÿ๐‘ก(๐œƒ) ฬ‚๐ด๐‘ก,clip(๐‘Ÿ๐‘ก(๐œƒ),1โˆ’๐œ–,1+๐œ–) ฬ‚๐ด๐‘ก)], (3.15)where the probability ratio is defined as ๐‘Ÿ๐‘ก(๐œƒ)= ๐œ‹๐œƒ(๐‘Ž๐‘ก|๐‘ ๐‘ก)๐œ‹๐œƒ๐‘œ๐‘™๐‘‘ (๐‘Ž๐‘ก|๐‘ ๐‘ก) and the advantage is just a shorthandfor ฬ‚๐ด๐‘ก =โˆ‘๐‘‡๐‘กโ€ฒ=๐‘ก ๐‘Ÿ๐‘กโˆ’ ฬ‚๐‘‰ (๐‘ ๐‘ก). The hyper-parameter ๐œ– controls the step-size and it commonly takeson a value in the range [0.1,0.2]. For more information please refer to [33].Intuitively, by clipping the objective, the flow of gradient is blocked if it tries to push the9current policy outside the interval [1โˆ’๐œ–,1+๐œ–] around the old policy. Therefore, PPO has beenadvertised as enforcing a trust-region on the updates. Although the validity of this claim hasrecently been challenged [19], the fact remains that with the right hyper-parameter setting PPOis one of the best model-free algorithms in practice today.10Chapter 4Torque Limit ConsiderationsMuch effort in reinforcement learning has been spent on finding better or faster algorithmsthat can solve any RL problem in a model-free way, i.e. only through interactions with theenvironment. However, most success stories, such as TD-Gammon [37] and AlphaGo [10], stillrequired careful engineering. Therefore, we believe that the design of the character and the taskscan be in some cases far more important to achieving superior results than the algorithm itself.In other words, making the problem itself simpler and more compatible with the algorithm usedcan be more productive than finding new algorithms that claim to solve more difficult problems.In this chapter, we look at the effects of different torque limit (TL) settings and we show thata simple torque limit curriculum can help achieve higher rewards and more reliable results.4.1 IntroductionTo measure the strength of a person we can measure the strength of his/her muscles. Similarly,we can measure the strength of a robot by measuring the strength of its motors. To do this, weneed to consider each joint separately. A natural measure of strength is the amount of torquethat the joint can produce. Furthermore, as ideal robots are consistent and never get tiredlike humans do, we only need to measure the maximum amount of torque output. Hence, astandard way of representing robot strength is through a TL vector. Naturally, a simulatedcharacter modelled after a robot needs to have a pre-specified TL. The correctness of TL is ofhigh importance as it determines the capabilities of the robot. Having a higher or lower TLsetting, therefore, has serious implications on what the agent can or cannot do and how quicklyor more reliably the task can be solved.Experimenting in a simulator, however, allows us to use torque limits that are much higheror lower than in real life and to investigate their effects. In most applications, lower limitstend to be more desirable. In robotics, using excessive forces can cause damage to the robotas well as its surroundings including any human or animals that may be in its vicinity. Usinga low torque limit can be seen as a simple but effective safety mechanism that can alleviatemany problems. The consequences of using unrealistic limits in computer graphics may not be11as dire, but humans are good at perceiving inconsistent movements performed by human-likecharacters. An impossibly high jump, pushing a heavy object with one hand, and recoveringfrom a fall by using excessive strength are all immediately recognizable by a human observer.Low limits, on the other hand, are usually not noticeable unless the character or robot fails toaccomplish the task at hand, as people rarely ever use maximum strength for day to day tasks.Consequently, researchers in robotics and computer graphics tend to care about using real-istic limits on the characters. On the other hand, the machine learning community is usuallyless concerned with such details and tends to use characters with excessive TL. This has beena source of (informal) complaints from the aforementioned communities.One common way of compelling the agent to use less force is via an energy consumptioncost. This approach is widely used in practice. However, it has two disadvantages. First, theagent can still use excessive force if it so desires. More importantly, the energy consumptioncan be harmful for training. Therefore, the weight of the energy consumption cost is set lowenough that it may be ignored by the agent [45].We aim to solve both of these problems by using TL as hard constraints. As a result, thischapter aims to answer the following questions: how much does the torque limit affect learningand can we make use of this knowledge to find better solutions?Our results indicate that TL settings strongly affect the final solution. Below a certainthreshold, the learning algorithm generally fails to find any solution, in a reasonable amount oftime, that can solve the task. We demonstrate that this phenomenon is not a consequence of theproblem setup but rather a limitation of the optimization procedure by showing the existenceof higher-performing solutions with the same setting. Finally, we offer a simple solution to fixthis problem by using a curriculum during learning. This helps the agent to learn in a simplerenvironment and then transfer the learned solution to the more difficult setting.4.2 EnvironmentsWe experiment with a set of existing locomotion environments from Roboschool [20]. Allcharacters are simulated using PyBullet [9] which is a Python interface for the Bullet3 physicsengine. In all of the environments, the task is for the character to walk as far as possible in theforward direction in the allotted time. The reward function also includes terms to encouragethe agent to use less energy and stay โ€œaliveโ€ longer, i.e. to not fall. The observation space in allof these environments consists of root information (root z-coordinate, x and y heading vector,root velocity, roll, and pitch), joint angles, joint angular velocities, and binary foot contactinformation. The torques are all normalized between โˆ’1 and 1.Walker2D is a simplified bipedal character whose movements are constrained to a 2D plane.An action is a 6D vector corresponding to torques at the hip, knee, and ankle on both left andright legs. The observation space is 22D and it weighs about 24kg.12Hopper is a one-legged character constrained to a 2D plane. It is similar to the Walker2Dwith one leg missing. The action space is 3D where each dimension controls the torque at thehip, knee, and ankle. The observation space is 15D and it weights about 16jg.HalfCheetah is a 2D model that closely resembles a quadruped with only a fore and a hindleft (no sides). An action is a 6D vector, similar to Walker2D, corresponding to the normalizedtorques at the thigh, shin, and the foot for each of the fore and hind legs. The observation spaceis 26D and consists of the same information as Walker2D with the addition of more fine-grainedcontact information. The character weighs about 38kg.Ant is a 3D character that resembles an insect with four legs. It consists of a torso as well asfour legs that are each divided into two segments. The action space is 8D and the observationspace is 28D containing the same information as Walker2D. Despite being three dimensional,this character is highly stable due to having four legs. The character weighs about 182kg.4.3 MethodsTo experiment with different torque limits we need a way to dynamically modify the limits ineach environment. Therefore we slightly modified the aforementioned environments to includea parameter denoted by the torque limit multiplier (TLM). TLM is a single scalar variable andthe torque limits are altogether multiplied by this shared scalar which lets us shrink or expandthe limits all at once. It is possible to have a specific multiplier for each joint, however, a singlevariable is sufficient for the purposes of our work.We use our implementation of PPO [33] for all the experiments1. The hyper-parametersused can be found in Section A.1. Each experiment collected 6 million environment time stepsin total and all the experiments were replicated five times each.4.4 ResultsIn this section, we begin by demonstrating the effect of torque limit on training and thenprovide a more appropriate way of enforcing this limit. All plots in this section report theaverage performance as well as the minimum and maximum across five runs.4.4.1 Torque Limit Baseline ExperimentsTo determine how the torque limit can affect the final solution, we run a set of experiments onthe Walker2D environment with different values of ๐‘‡๐ฟ๐‘€ specified. To rule out the effect ofrandom noise which is an important contributing factor in deep reinforcement learning (DRL)experiments [14] we run each experiment five times with different random seeds and the meanof the data is reported as well as the best and the worst results.1The source code is available at https://github.com/farzadab/walking-benchmark130.6 0.8 1.0 1.2 1.4 1.6Torque Limit Multiplier050010001500200025003000Mean Cumulative Reward(a) Cumulative rewards0.6 0.8 1.0 1.2 1.4 1.6Torque Limit Multiplier050010001500200025003000Mean Progress(b) Amount of progress in each iterationFigure 4.1: Final performance of the agent with different TLM values.The error bars reflect the best and the worst performance achieved by the same algorithmover five runs with different random seeds.The final cumulative return, as well as the amount of progress made at test time, are shownin Figure 4.1. Not surprisingly the agents that had a higher torque limit constraint achievedhigher returns. Interestingly, below a certain point (all runs with ๐‘‡๐ฟ๐‘€ โ‰ค 0.8 as well as someruns with ๐‘‡๐ฟ๐‘€ = 1) the agent failed to make any forward progress. Note that the cumulativereturn in these cases is still as high as one thousand. This is the result of the agent learning tostand still and avoiding early termination instead of learning to walk.The results for the default torque limit (๐‘‡๐ฟ๐‘€ = 1 in the same figure) show us somethinginteresting. First, the variance in the results is high. This is a known problem of reinforcementlearning algorithms [14]. More importantly, this hints at the existence of a local optimumwhere the agent does not learn to move and just avoids early termination by standing still.Even though the results of training with higher torque limits show variations as well, none ofthem seem to be stuck in this local optima. Lastly, the results seem to indicate that it is not14possible to walk with the lower torque limits of 0.6 and 0.8.4.4.2 Torque Limit CurriculumIf our hypothesis is correct that the agents are getting stuck in a local optimum with lowertorque limits, it may be possible to get a better controller simply by using a better initialization.Agents trained with higher torque limits can intuitively provide a good starting point. Thisleads us to curriculum learning [3].Curriculum learning is motivated by how humans and animals learn and is based on the ideaof learning gradually from simple concepts to more difficult ones. In this approach, instead oftraining on the most difficult version of the problem, the training is divided into multiple stageswhere the first stage is a simplified version of the problem and task becomes more difficultat each stage until the last stage in which the agent is faced with the original version of theproblem.According to Figure 4.1, tasks with higher torque limits seem to be easier to solve. There-fore, we can define a curriculum where the agent first sees high torque limit environments butgradually the limit is lowered linearly until it matches our final target. To make the trainingmore stationary the training is divided into several levels during which TLM stays fixed. Thenumber of these levels is a hyper-parameter denoted by ๐‘๐ฟ๐‘’๐‘ฃ๐‘’๐‘™๐‘ . For most experiments weused ๐‘๐ฟ๐‘’๐‘ฃ๐‘’๐‘™๐‘  = 10.The results of applying the torque limit curriculum can be seen in Figure 4.2. This methodnot only achieves higher performance, it also manages to sidestep the local optima as evident inFigure 4.2b. As a bonus, this approach seems to be more reliable in most cases as its varianceseems to be lower than the baseline approach, except at ๐‘‡๐ฟ๐‘€ = 0.6. In the latter case, thebaseline always converges to the local optima which has relatively low variance, however, thisis not the desired behaviour.4.4.3 Ablation and More EnvironmentsTo further validate our approach, we test it on the Half Cheetah, Ant, and Hopper environmentsas well. As a baseline, we keep TLM fixed at the target value. Also, we compare our resultswith another curriculum approach that is very similar to our own, namely, an exploration ratecurriculum, where the TLM is kept fixed the same as in our baseline. In this approach, theamount of exploration noise is varied at training time by starting at a high value to explorethe solution space well and annealing it to a lower value in order to increase the final motionquality.The results are shown in Figure 4.3. All experiments used the same hyper-parameters,namely with ๐‘๐ฟ๐‘’๐‘ฃ๐‘’๐‘™๐‘  = 10 and the TLM going from 1.2 to 0.6. The exploration curriculumseems to be helpful to some extent but the TLM curriculum works best for all environments,specifically for the Ant.150.6 0.8 1.0 1.2 1.4Torque Limit Multiplier (Final)050010001500200025003000Mean Cumulative RewardBaselineCurriculum(a) Cumulative rewards0.6 0.8 1.0 1.2 1.4Torque Limit Multiplier (Final)05001000150020002500Mean ProgressBaselineCurriculum(b) Amount of progress in each iterationFigure 4.2: The effect of curriculum learning for Walker2D.Blue plots show the final performance of the agents which started out with ๐‘‡๐ฟ๐‘€ = 1.6 andthe TLM was decreased to the target value in ten steps. The red bars are the same as inFigure 4.1.4.4.4 Curriculum SensitivityThis curriculum learning technique seems to be useful in the different environments that wetested on, even though the gains vary across domains. However, as with many approaches, thismethod also includes hyper-parameters that need to be chosen by the user. We can assumethat the target torque limits are a given, but the starting limits are not fixed. Furthermore,the optimal value for the NLevels hyper-parameter is also unknown. An important questionto ask is: how sensitive is this method to the hyper-parameters. Therefore, we designed twoexperiments to answer this question.First, we look at the starting TLM. We assume that the final TLM of 0.6 is fixed and we canvary the starting TLM value. The results of the experiment on the Walker2D environment canbe seen in Figure 4.4. Surprisingly, all initial values seem to work relatively well. Specifically,160.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0Half Cheetah0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0Ant0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0Hopper0 1 2 3 4 5 6Time Steps (Training) 1e60500100015002000Mean Cumulative RewardTL CurriculumBaselineExploration Curriculum (ablation)0 1 2 3 4 5 6Time Steps (Training) 1e60500100015002000Mean Progress0 1 2 3 4 5 6Time Steps (Training) 1e60250500750100012501500Mean Cumulative Reward0 1 2 3 4 5 6Time Steps (Training) 1e602004006008001000Mean Progress0 1 2 3 4 5 6Time Steps (Training) 1e60500100015002000Mean Cumulative Reward0 1 2 3 4 5 6Time Steps (Training) 1e602505007501000125015001750Mean ProgressFigure 4.3: Torque limit curriculum results for Half Cheetah, Ant, and Hopper.The dips in performance reveal the points in which the value of TLM was reduced. The finalTLM is constant across all methods. The shaded area corresponds to the minimum and themaximum values across five runs.170.8 1.0 1.2 1.4 1.6 1.8 2.0Initial TLM0100200300400500600Mean ProgressFigure 4.4: Curriculum sensitivity to initial TLM value.All experiments with ๐‘‡๐ฟ๐‘€ > 1.2 have an acceptable performance both in terms of theaverage as well as the worse performance.by comparing the results for ๐‘‡๐ฟ๐‘€ =0.8 with Figure 4.1b, we see that the final performance hasincreased instead of decreasing even though the final ๐‘‡๐ฟ๐‘€ has decreased. This might simplybe due to randomness, but it is also possible that the sudden changes in curriculum traininglet the agent escape local optima regions more easily. More importantly, the results seem toindicate that the method is not sensitive to this hyper-parameter as long as the initial TLMvalue is high enough to stumble upon a good solution.Next, we look at the number of curriculum steps required, while keeping the total numberof simulation steps fixed. A low number means an abrupt change but a high number wouldmean changing slowly but frequently. Both approaches have their merits. Changing slowlymeans the previous solution will still work under the new conditions, but it also means thatthere might not be enough time to get adjusted to the new situation before the next change.Abrupt changes can also be useful for getting out of a local solution. Again, we run a similarexperiment as before on the Walker2D environment with the TLM going from 1.6 to 0.6 withdifferent number of steps. The results are provided in Figure 4.5.The results in both cases show some variability, which is to be expected, but there is no clearwinner or a general trend to be pointed out. This is reassuring, as it shows that the method isnot sensitive to small hyper-parameter changes. This makes the method more easily applicableto different settings.4.5 ConclusionsIn this chapter, we showed that the construction and the details of the locomotion environmentare important, and the best design for learning need not be the most realistic one. Here,we looked at the effects of torque limits on learning. Torque limits describe how strong thecharacter is and indirectly decides which movements are possible and which ones are not. The180 10 20 30 40 50Num Levels0200400600800100012001400Mean ProgressFigure 4.5: Curriculum sensitivity to the number of steps.robotics and computer graphics communities tend to specify torque limits based on real-liferobots and animals, but the machine learning community is less concerned with such details.The torque limit setting is indeed important, as evident by the decrease in the final per-formance achieved in different settings. Furthermore, we show that this setting is specificallyimportant in the context of reinforcement learning since with more restrictive configurationsthe learner tends to get stuck in local optima regions increasingly often. Therefore doing theinitial training with higher torque limits is useful for sidestepping local optima and the resultingpolicy is perhaps more robust as a result of experiencing slightly different versions of the sameenvironment.19Chapter 5Symmetry ConsiderationsOne obvious path towards faster-and-better learning relies on exploiting the motion symmetrythat is a common attribute of human and animal locomotion; gait symmetry is an indicatorof healthy outcomes in physiotherapy [29, 30]. Relatedly, while asymmetric gaits are oftenassociated with physical injuries and neural impairments such as stroke. A symmetry constraintor symmetry-favouring bias thus offers a readily available and convenient path towards fasterlearning and more realistic outcomes. It is also largely orthogonal to most other efficiencyimprovements.Naively, exploiting symmetry might be expected to yield a 2ร— learning speedup, and mayhelp to avoid some of the undesired asymmetric local minima that DRL is prone to exploit.On the other hand, it could also be the case that asymmetric policies and motions serve auseful role as an intermediate path towards finding eventual optimal symmetric motions, andtherefore hard symmetry constraints may be problematic. Another important subtlety is thatwhile a symmetric policy helps achieve symmetric motions, it does not guarantee a symmetricoutcome. For example, a quadruped gallop and a biped lope are asymmetric gait cycles, aseach gait cycle begins with a leading left or right foot, while the underlying policy can still befully symmetric.What is the best way to integrate a symmetry bias or other forms of symmetry enforcementinto the learning process? How much benefit does it offer in terms of learning speed and learningoutcomes? What are other considerations for symmetry-informed methods? The principalcontribution of our work is an in-depth analysis of four different methods of incorporatingsymmetry into the learning process:DUP Duplicating tuples with their symmetric counterparts.LOSS Adding a symmetry auxiliary loss.PHASE Motion phase mirroring.NET Enforcing symmetry in the network itself.20Two of these methods are new (DUP, NET) and two are already present in existing literature(LOSS, PHASE), albeit without a systematic evaluation of all the issues around symmetryenforcement. The methods incorporate knowledge of symmetry into the policy structure (NET),the learning data (DUP, PHASE), or via the learning loss (LOSS). We also believe that theresults are of more general interest because they illustrate (and experimentally validate) variousways that inductive biases can be incorporated into DRL methods.5.1 Symmetry Enforcement MethodsWe now describe four methods for enforcing symmetry, using duplicate tuples, auxiliary losses,a time-indexed motion phase, and architecture-based methods. We begin by formally definingsymmetric trajectories and symmetric policies. Two trajectories are symmetric if for eachstate-action tuple, (๐‘ ,๐‘Ž), from one trajectory, the corresponding state-action tuple is given by(โ„ณ๐‘ (๐‘ ),โ„ณ๐‘Ž(๐‘Ž)) for the other trajectory, where โ„ณ๐‘  and โ„ณ๐‘Ž are defined as follows,๐‘€๐‘  โˆถ ๐’ฎ โ†’ ๐’ฎ ๐‘€๐‘Ž โˆถ ๐’œโ†’๐’œ๐‘€๐‘ (๐‘ ) = the mirror of state ๐‘  ๐‘€๐‘Ž(๐‘Ž) = the mirror of action ๐‘ŽNote that the mirroring functions are attributes of the environment and not attributes of theenforcement method or learning pipeline. Here we use environment to refer to the combinationof the character, its simulated world, and the task, as is common in RL settings. All thesymmetry enforcement methods we shall describe require both of these functions as a minimumrequirement. Similarly, we can define a symmetric policy to be one where the following holdsfor all states ๐‘  โˆˆ ๐’ฎ:๐œ‹๐œƒ(๐‘€๐‘ (๐‘ )) =๐‘€๐‘Ž(๐œ‹๐œƒ(๐‘ )). (5.1)A symmetric policy thus produces the mirrored action when given the mirrored state asinput. RL methods such as PPO also use state-value functions during the learning process.The output of these value functions should remain unchanged for any state and its mirroredcounterpart. The construction of the mirror functions for our environments (Section 5.3) isfurther elaborated in Section A.2.The methods discussed in this section attempt to achieve symmetric gaits by encouragingor constraining the learned policies to be symmetric. However, even if successful, this maybe insufficient to guarantee a symmetric gait. In particular, a symmetric policy may learn tofavour motions with staggered poses, where the dominant foot is always in front. This mayconfer advantages concerning balance and agility. Once such a policy is initialized to an initial21asymmetric staggered pose, it can continue with an asymmetric motion. With regard to thepolicy, it is not always possible to achieve exact symmetry in a parameterized model suchas a neural network. For example, regions of the state space may remain unexplored duringthe learning process, and thus symmetry cannot be enforced for such regions. Therefore, theequality in Equation (5.1) is not always assumed to be strict.It is possible to directly optimize for gait symmetry with reinforcement learning by includingquantitative symmetry measures in the reward function, such as the Symmetry Index [30] orother measures [39]. However, we share the sentiment of previous work [45] that directlyoptimizing such measures may be ineffective, as they introduce delayed or sparse rewards thatmay make the learning problem more difficult. Consequently, our work focuses on methodsthat can be used for obtaining approximately symmetric policies, which are described in theremainder of this section.5.1.1 Duplicate Tuples (DUP)This method may be the most intuitive way of achieving symmetry and is a form of dataaugmentation. In this approach each trajectory tuple is duplicated, mirrored, then added asa valid experience tuple along with the original. More formally, let ๐œ = (๐‘ 1,๐‘Ž1, ๐‘Ÿ1,โ€ฆ,๐‘ ๐‘‡ ) be atrajectory sampled from the environment. A post-processing step will compute the mirroredtrajectory of ๐œ , i.e. ๐œโ€ฒ = (๐‘€๐‘ (๐‘ 1),๐‘€๐‘Ž(๐‘Ž1),๐‘Ÿ1,โ€ฆ,๐‘€๐‘ (๐‘ ๐‘‡ )), and both ๐œ and ๐œโ€ฒ will be added tothe roll-out memory buffer for learning. Notice that the rewards, ๐‘Ÿ1,โ€ฆ,๐‘Ÿ๐‘‡โˆ’1 are the same inboth ๐œ and ๐œโ€ฒ. This is because the reward function ๐‘Ÿ(๐‘ ,๐‘Ž) is automorphic under the symmetrytransformation, namely ๐‘Ÿ(๐‘ ,๐‘Ž) = ๐‘Ÿ(โ„ณ๐‘ (๐‘ ),โ„ณ๐‘Ž(๐‘Ž)).One drawback of using this approach is that the mirrored tuples are not strictly on-policy,as assumed by policy-gradient RL methods. Thus it could be problematic when used withmethods such as PPO [33] and TRPO [32]. The off-policy issue arises because at training timethe policy ๐œ‹๐œƒ is not guaranteed to be symmetric, and therefore the probability of samplingaction ๐‘€๐‘Ž(๐‘Ž๐‘ก) from ๐œ‹๐œƒ(๐‘€๐‘ (๐‘ ๐‘ก)) could be low, effectively corresponding to an off-policy action.However, our results show that this is not necessarily a critical issue in practice.5.1.2 Auxiliary Loss (LOSS)In this method proposed by Yu et al.[45], the authors create a symmetry loss defined as follows:๐ฟ๐‘ ๐‘ฆ๐‘š(๐œƒ) =๐‘‡โˆ‘๐‘ก=1โ€–๐œ‹๐œƒ(๐‘ ๐‘ก)โˆ’๐‘€๐‘Ž(๐œ‹๐œƒ(๐‘€๐‘ (๐‘ ๐‘ก)))โ€–2 (5.2)and optimize this as an auxiliary loss in addition to the default PPO loss:๐œ‹๐œƒ = argmin๐œƒ๐ฟ๐‘ƒ๐‘ƒ๐‘‚(๐œƒ)+๐‘ค๐ฟ๐‘ ๐‘ฆ๐‘š(๐œƒ), (5.3)22where ๐‘ค is a scalar hyper-parameter used to balance the gait symmetry loss with the standardpolicy optimization loss which aims to maximize the original objective. The authors use ๐‘ค= 4for their results. An alternative approach would be to simply include the symmetry loss asan extra reward term. However, the auxiliary loss is generally preferable; the loss term isdifferentiable and therefore provides a clear signal to optimize rather than being included viathe PPO-approximated gradient. Changing the reward function may also induce unexpectedbehaviours.Yu et al. [45] showed improvements in the sample efficiency for their four tasks with afactor of approximately two (see Figure 8 in [45]). However, the symmetric loss is shown to bebeneficial only in the context of a given curriculum learning algorithm; in its absence, there wasno significant improvement over a vanilla-PPO baseline, and in one case (the humanoid) usingthe symmetric loss proved to be detrimental (please refer to the same plot). The addition of anextra hyper-parameter may generally be seen as undesirable. However, in practice, we find inour experiments that the method is not very sensitive to the choice of ๐‘ค and we end up usingthe default value in all settings.5.1.3 Phase-Based Mirroring (PHASE)To study locomotion, the gait can usually be divided into repeated gait cycles, which can thenfurther be parameterized using a phase variable ๐œ™ โˆˆ [0,1), which then wraps back to ๐œ™= 0 uponreaching ๐œ™ = 1. A common assumption is to advance the phase linearly with time. Anotherstrategy that can help provide additional robustness is to perform a phase-reset at each bipedalfoot strike, e.g., set ๐œ™ = 0 upon left-foot strike and ๐œ™ = 0.5 upon right foot strike. To enforcesymmetry, a policy is only learned for the first half cycle, and is replaced by the policy withmirrored states-and-actions during the second half cycle:๐‘Ž๐‘ก =โŽง{โŽจ{โŽฉ๐œ‹๐œƒ(๐‘ ๐‘ก) 0 โ‰ค ๐œ™(๐‘ ๐‘ก) < 0.5๐‘€๐‘Ž(๐œ‹๐œƒ(๐‘€๐‘ (๐‘ ๐‘ก))) 0.5 โ‰ค ๐œ™(๐‘ ๐‘ก) < 1(5.4)In our experiments, we strictly advance the motion phase as a function of time and we donot implement phase-resets. For forward-progress tasks, this then corresponds to providing amandated duration for each half-cycle of the motion. The phase-based method is particularlyuseful for imitation-guided learning scenarios such as those presented in [27], [28], and [43]. Thegoal in these cases is to imitate a reference motion capture clip with the help of a phase-indexedreward that measures the distance from the reference motion. The use of the PHASE symmetryin that context is motivated by the potential for faster learning.The PHASE approach is simple to implement and does not require modifying training inany way since it can be implemented directly within the environment. However, the potentialfor abrupt changes exists at ๐œ™ = 0.5 when the phase is strictly computed as a function of time.23Figure 5.1: A universal method for converting any neural network into a symmetric net-work policy.โ„ณ block is an environment-dependent state mirroring function. The two policy blocks are thesame neural network module, with the output terminals re-order for illustration clarity. The s,c, o terminals corresponds to side, common, and opposite joints as described in Section A.2.5.1.4 Symmetric Network Architecture (NET)Another approach towards enforcing symmetry is to impose symmetry at the network archi-tecture level. The goal here is to choose a network architecture such that Equation (5.1) holdsfor all states ๐‘  and all network parameters ๐œƒ. There are multiple ways to go about designingsuch an architecture. However, they may require some knowledge about how the actions and/orstates in which case having access to the mirroring functions ๐‘€๐‘  and ๐‘€๐‘Ž is strictly-speakingnot enough.A general case description of this method would be lengthy, and thus we focus only on thekey aspects here. The simplest case occurs when we can assume that the action vector is simplydivided into two, one corresponding to each side of the body, and that the actions of one sidecan readily be applied to the other side through a simple swapping operation. This ignores thecommon parts such as the torso and the head for the time being. More concretely, consider:๐‘Ž = [๐‘Ž๐‘™๐‘Ž๐‘Ÿ]๐‘€๐‘Ž(๐‘Ž) = [๐‘Ž๐‘Ÿ๐‘Ž๐‘™]where ๐‘Ž๐‘™ and ๐‘Ž๐‘Ÿ are vectors of equal size. In this case, we can define a symmetric policycomposed of an inner network ๐‘“ as follows:๐œ‹๐‘ ๐‘–๐‘‘๐‘’(๐‘ ) = [๐‘“(๐‘ ,๐‘€๐‘ (๐‘ ))๐‘“(๐‘€๐‘ (๐‘ ),๐‘ )]24It is easy to show in this case that Equation (5.1) holds:๐œ‹๐‘ ๐‘–๐‘‘๐‘’(๐‘€๐‘ (๐‘ )) = [๐‘“(๐‘€๐‘ (๐‘ ),๐‘€๐‘ (๐‘€๐‘ (๐‘ )))๐‘“(๐‘€๐‘ (๐‘€๐‘ (๐‘ )),๐‘€๐‘ (๐‘ ))]= [๐‘“(๐‘€๐‘ (๐‘ ),๐‘ )๐‘“(๐‘ ,๐‘€๐‘ (๐‘ ))]=๐‘€๐‘Ž([๐‘“(๐‘ ,๐‘€๐‘ (๐‘ ))๐‘“(๐‘€๐‘ (๐‘ ),๐‘ )])=๐‘€๐‘Ž (๐œ‹๐‘ ๐‘–๐‘‘๐‘’(๐‘ ))When the action space also includes actions for common parts, i.e., those such as the torsoand head that have no symmetric counterparts, it is easy to define ๐œ‹๐‘๐‘œ๐‘š(๐‘ ) = โ„Ž(๐‘ )+โ„Ž(๐‘€๐‘ (๐‘ ))which is then invariant to left/right mirroring. Finally, the policy is then a combination of thecommon actions and side actions:๐œ‹๐œƒ(๐‘ ) = [๐œ‹๐‘๐‘œ๐‘š(๐‘ )๐œ‹๐‘ ๐‘–๐‘‘๐‘’(๐‘ )]Please refer to Figure 5.1 for an illustration of the NET method.A drawback of this method is that it requires knowledge about the state and action symmetrystructures to redefine the network. Also, this method is highly sensitive to state and actionnormalization. The problem is that an ordinary normalization based on past experiences maybreak the symmetry. Though the other methods introduced here can also suffer from the sameproblem, this method is much more sensitive to the issue.5.1.5 Practical ConsiderationsThere are some practical considerations to take into account when working with each of themethods introduced in the previous section. In terms of implementation, the DUP and PHASEmethods are the easiest to implement as they required little to no change to the learningpipeline. Architecture-based mirroring (NET) requires the most modification to both the learn-ing pipeline and the environments. The LOSS method is the only approach here that allowsus to balance the desire for symmetry with the original learning objective, albeit at the costof an extra hyper-parameter. The NET method produces a truly symmetric policy, which isnot possible with the other methods. The PHASE method is the approach best suited forcoping with neutral states, which represent symmetric states where it may become problematicto break symmetry. We revisit this point later. PHASE is also restrictive in that it enforces apredefined walk cycle timing.One more consideration relates to the application of normalization to network inputs, whichis commonly done by using statistics gathered from the data itself. However, this can breaksome of the mirroring assumptions. The problem is most severe when using a symmetric25network architecture, although other methods are also impacted. Fortunately, developing anormalization scheme that works correctly is relatively straightforward. A simple approach isto duplicate the states (or actions) as in Section 5.1.1 and to compute the statistics based onthe aggregated set of states (or actions) and their mirrored states (or actions).5.2 Gait Symmetry MetricsAll of the methods discussed only provide indirect paths, via the learned policies, for achievingsymmetry for the actual motions. Therefore it is important to evaluate how well these methodsdo at achieving their final goal. Yu et al.[45] uses an established metric in the biomechanicsliterature known as the Robinson Symmetry Index (SI):๐‘†๐ผ = 2|๐‘‹๐‘…โˆ’๐‘‹๐ฟ|๐‘‹๐ฟ+๐‘‹๐‘…โ‹… 100, (5.5)where ๐‘‹๐‘… is a scalar features of interest, such as the duration of the stance phase for the rightleg, and ๐‘‹๐ฟ is its counterpart for the left leg. Previous work using the LOSS method [45]chooses to use the average actuation magnitude as the parameter of interest which leads to๐‘‹๐‘… =โˆ‘๐‘‡๐‘ก=1 โ€–๐œ๐‘ก,๐‘…โ€–2 where ๐œ๐‘ก,๐‘… is the vector of applied torques at time ๐‘ก for the right leg. Wewill refer to this as the actuation symmetry index (ASI). In practice, we found that the ASIcan be misleading in some circumstances, e.g., a high torque applied to the right hip can beconflated with a high torque applied to the left knee, which is not desirable. ASI also losesinformation about signs of the applied torques.The phase-portrait is another tool that can be used to qualitatively investigate the symmetryor asymmetry of a gait, as seen in [18]. The phase-portrait is a scatter plot drawn over a period oftime, usually over a single gait cycle. The ๐‘ฅ and ๐‘ฆ-axes of the 2D plot correspond to the positionand velocity, respectively, of a joint of interest, such as the hip flexion, For an asymmetric gait,the phase portraits of the two sides will not fully overlap. To numerically quantify the similaritybetween two phase-portraits, we propose to use a phase-portrait index (PPI). One problem toaddress is that the left and right limbs usually have a phase offset even for a symmetric motion.This is not a problem when inspecting the phase-portraits visually, but the problem needs tobe addressed to compute a meaningful metric. We solve this by finding the best phase offsetbetween the left and the right side through an exhaustive search. We also normalize each axisso that ๐‘ฅ,๐‘ฆ โˆˆ [โˆ’1,1] to address the potential discrepancy between magnitudes of different gaits.The final PPI is defined according to:๐‘ƒ๐‘ƒ๐ผ = 1๐ถ min๐‘ ๐ถโˆ’1โˆ‘๐‘ก=0โ€–๐‘ž๐‘…๐‘ก โˆ’๐‘ž๐ฟ๐‘ก+๐‘ โ€–1+โ€– ฬ‡๐‘ž๐‘…๐‘ก โˆ’ ฬ‡๐‘ž๐ฟ๐‘ก+๐‘ โ€–1, (5.6)where ๐ถ is the length of a gait cycle, ๐‘ž๐‘…๐‘ก and ฬ‡๐‘ž๐‘…๐‘ก are the normalized right joint position andvelocity at time ๐‘ก. Similarly, ๐‘ž๐ฟ๐‘ก+๐‘  is the normalized left joint position at time ๐‘ก+๐‘  modulo ๐ถ,as the elements that are shifted beyond the last position are reintroduced at the beginning.26Figure 5.2: Environments.Top-left: Walker2D. Top-right: Walker3D. Bottom-left: Stepper. Bottom-right: Cassie.5.3 EnvironmentsWe evaluate the effectiveness of the enforcement methods described in Section 5.1 on fourdifferent locomotion tasks, i.e., RL โ€œenvironmentsโ€. The environments were chosen to repre-sent a fairly diverse range of locomotion tasks. They are described in detail below. For eachenvironment, we run each method 5 times and plot the mean results.Walker2D The implementation of Walker2D environment is taken directly from PyBullet [9]without further modification. This character is almost identical to the one used in Section 4.1.The purpose of this environment is to evaluate each symmetry method on a well-establishedexisting reinforcement learning environment. The task is for the character to walk as far aspossible in the forward direction in the allotted time. An action is a 6D vector correspondingto a normalized torque at each of the hip, knee, and ankle on both left and right legs. Theobservation space is 22D and consists of root information (root z-coordinate, x and y headingvector, root velocity, roll, and pitch), joint angles, joint angular velocities, and binary footcontact information.27Walker3D This represents a 3D character simulated in PyBullet, with targets randomlyplaced, at a distance, in the half-plane in front of the character. The task requires characterto navigate towards the target and then stop at the target. A new target will be chosen, inthe forward half-plane of the current character orientation, once the target is reached and onesecond has passed. The 3D character has 21-DoF corresponding to abdomen (x3), hip (x3),knee, ankle, shoulder (x3), and elbow. The observation space is 52D, and is analogous to thatprovided for Walker2D, with an additional 2D vector representing the target location in thecharacter root frame.Stepper Stepper uses the same model asWalker3D, and requires it to navigate terrain consist-ing of a sequence of stepping blocks. The blocks are randomly generated by sampling from thefollowing distributions: spacing ๐‘‘ โˆผ ๐’ฐ(0.65,0.85) meters and height variation of the next stepโ„Žโˆผ๐’ฐ(-25,25)โˆ˜. The character receives information for two upcoming blocks as an (๐‘ฅ,๐‘ฆ,๐‘ง) offsetin character root space. The stepping block information advances when either foot contacts theimmediate next block, which effectively forces the character to step precisely on each step. Theprecise foot placement requirement, as well as variable terrain height, makes this environmentmore challenging than Walker3D.Cassie The task requires a bipedal robot Cassie to walk forward at a desired speed whilemimicking a reference motion. Since the reference motion is time-indexed, the character receivesa phase variable as input. The phase variable varies according to ๐œ™ โˆˆ [0,1) in the gait cycle.In addition to phase, the character receives other inputs including the height, the orientationexpressed as a unit quaternion, pelvis velocities, angular velocities, and acceleration, jointangles and angular velocities. In total, the Cassie robot has a 10D action space and 47Dobservation space. Another important distinction between Cassie and the other tasks is that itis implemented in MuJoCo [38], while other environments use the Pybullet [9] physics engine.This simulated model has also been validated to be close to the physical Cassie robot [43].5.4 ResultsWe compare the four methods, together with an asymmetric baseline, across four differentlocomotion tasks of varying difficulties1.5.4.1 SummaryWe begin with a high-level summary of our findings. All symmetry enforcement methodsimprove motion quality over the baseline, but they cannot be reliably ranked across differentenvironments. In general, DUP is the least effective in enforcing symmetry, while LOSS is themost consistent. For imitation-guided tasks, where the reward is related to imitating a time-1The source code is available at https://github.com/UBCMOCCA/SymmetricRL.28indexed reference motion, such as for Cassie and DeepMimic, the PHASE method appears tobe superior.Regarding learning speed, the symmetry enforcement methods have no consistent and pre-dictable impact, positive or negative. While this contradicts our initial expectation, it does notprovide the full picture. In particular, even though BASE achieves relatively high rewards inStepper, it was unable to make forward progress in any of the five runs. In summary, we suggestsymmetry methods be used for producing higher quality symmetric motions, i.e., closer to whatwe might expect from human and animal movement, but not necessarily for faster learning. Wefurther expand the comparison of the different methods in two sections below.5.4.2 Effect on Learning SpeedOne of our initial hypotheses was that the learning speed can be improved by enforcing symme-try. Symmetry can be considered as domain knowledge that may otherwise be difficult to learn,especially considering its abstract nature. However, our experiments indicated that enforcingsymmetry, in general, has no consistent impact on the learning speed. As shown in Figure 5.3,BASE performs well inWalker2D andWalker3D. In particular, although BASE was not initiallythe fastest in Walker3D, it ultimately achieves a higher return than all mirroring methods. Onthe other hand, BASE fails to learn the Stepper task in all five runs; it often pauses near thebeginning without taking a single step. This is consistent with findings by Yu et al., who alsofind that symmetry enforcement can be crucial when learning more difficult tasks.For the Cassie environment, the benefit of enforcing symmetry is evident because the rewardexplicitly encourages the character to imitate a symmetric reference motion. We hypothesizethat for such a case, symmetry suitably constrains the search space for the symmetric task.However, if symmetry is not rewarded, explicitly or implicitly, then its effects may not bereflected in the learning curve. Finally, between the symmetry methods, there is no clearwinner in terms of learning speed.PHASE and Imitation-Guided Learning In phase-based symmetry experiments, we de-fine a phase variable in correspondence to the gait cycle. For the Cassie environment, we use aperiod of 0.8 s, which is determined based on the reference motion. For all other environments,we assign a period based on a working solution.We find phase-based symmetry enforcement to be effective for imitation-guided learning, asit outperforms other methods by a significant margin for Cassie. When comparing Cassie withDeepMimic [28], which also uses an imitation objective, we find the results to be consistent.The learning curves for our DeepMimic symmetry experiment are presented in Section A.4. Wehypothesize that phase-based symmetry is effective for imitation-guided tasks when the motionclips used for training containing suitably-periodic and symmetric motions. On the other hand,PHASE constrains the period of the gait cycle, which can be harmful for non-imitation tasks.PHASE performs poorly in terms of learning speed when used without a reference motion, i.e.,290.2 0.4 0.6 0.8 1.01e705001000150020002500Average ReturnWalker2DBASELOSSNETNET-ALTPHASEDUPNET-POL0.2 0.4 0.6 0.8 1.01e7050100150200Cassie0 1 2 3 4 5 6Time Step 1e70500100015002000Average ReturnWalker3D0 1 2 3 4 5 6Time Step 1e70500100015002000StepperFigure 5.3: Learning curves for different symmetry methods in each of the four locomo-tion environments (Section 5.3).The Walker2D plot contains two additional experiments aside from the baseline and foursymmetry methods. NET-ALT uses an alternate formulation of symmetric networkarchitecture described in Section A.3. NET-POL is an ablation study on NET with symmetryenforcement only on the policy network and not on the value network.for Walker2D, Walker3D, and Stepper, although it can do well in terms of quality, e.g., forWalker3D.Alternate Symmetric Network The NET method presented in Figure 5.1 is an intuitiveway of converting any neural network into a symmetric policy. However, it is perhaps notthe immediate solution that one would come up with when tasked to design a symmetricneural network. We include one of our earlier constructions of symmetric policy in Section A.3,which we refer to as NET-ALT. A major difference between NET and NET-ALT is that thelatter uses shared weights at the layer level to explicitly enforce the symmetry constraint inEquation (5.1). Despite this, the two architecture-based mirroring methods should, in theory,30BASE DUP LOSS PHASE NETWalker2DWalker3DFigure 5.4: Phase-portrait for Walker2D and Walker3D. The green curve is for the lefthip flexion and red for the right side. The more symmetric the motion, the morealigned are the curves.have similar performance. As can be seen in Figure 5.3, NET-ALT significantly outperformsNET in the Walker2D environment, along with the baseline and all other mirroring methods.We believe that the structure of the symmetric layer matrix in Section A.3 may be the key toresolve this gap, which remains to be verified.Policy Network Ablation Study As an ablation study, we removed the symmetry constraintfor the value network in the NET method. Since our goal is to produce a symmetric policy,and the value network is discarded after training, we want to see how enforcing symmetry inthe value network during training affects the learning speed. In Figure 5.3, the two curves ofinterest are NET and NET-POL, where the latter has the symmetry constraint removed for thevalue network. Our experiment shows that it is beneficial to enforce the symmetry constraintfor the value network during training since the difference between the curves is not insignificant.5.4.3 Symmetry Enforcement EffectivenessAlthough learning speed is a major point of interest from the ML perspective, our work isnevertheless motivated by the aesthetics of symmetric gaits that are needed for applications inanimation. We measure the effectiveness of each symmetry enforcement methods on the metricswe defined in Section 5.2. In most cases, we find that symmetric gaits are better achieved whenany of the enforcement methods are applied, as compared to the baseline. The motions producedby the symmetry methods are also more natural-looking, subjectively speaking, than withoutmirroring.Figure 5.4 shows the phase-portraits forWalker2D andWalker3D. The symmetry metrics forall environments are summarized in Table 5.1 and Table 5.2. To perform consistent measurementfor the metrics, we omit the first two strides to limit the influence of the transition periodfrom standing to locomotion. The reported metrics are calculated from the median of the tensubsequent strides after the initial two. For the Stepper tasks, we use the median from five31strides to accommodate for the increased difficulty. Also, note the Stepper results are missingfor BASE because it was unable to produce consistent gait cycles that can be measured. Inmost cases, the policy either learns to pause at the starting location or falls after taking one ortwo steps.As in learning speed, there is not a single best mirroring method across all environments.However, from the overall picture, we found that LOSS and PHASE to be the most consistentamong all methods. In general ASI and PPI do not agree on a single best method except forthe Cassie task where PHASE is the best.Walker2D Walker3D Stepper CassieBASE 3.97 6.36 7 9.27DUP 3.77 7.57 7.54 6.58LOSS 2.56 4.48 6.36 15.72PHASE 3.77 2.55 3.99 4.49NET 2.00 10.64 28.97 5.15NET-ALT 1.04 โ€“ โ€“ โ€“NET-POL 1.71 โ€“ โ€“ โ€“Table 5.1: Actuation SI. Lower numbers are better.Walker2D Walker3D Stepper CassieBASE 1.06 2.16 7 0.49DUP 0.39 1.61 0.57 0.41LOSS 0.33 0.19 0.46 0.31PHASE 0.57 0.30 0.49 0.17NET 0.16 0.58 0.65 0.23NET-ALT 0.16 โ€“ โ€“ โ€“NET-POL 0.28 โ€“ โ€“ โ€“Table 5.2: Phase-portrait index. Lower numbers are better.5.5 DiscussionSymmetry can sometimes be harmful, especially when the character begins from or otherwisearrives at a neutral pose, i.e., a symmetric pose where ๐‘  = ๐‘€๐‘ (๐‘ ). The problem is that asymmetric policy is incapable of escaping from a neutral pose since the action that it takeswould also be symmetric. When a symmetric action is applied in a symmetric state, the nextstate is necessarily also symmetric. For instance, a character that starts from the T-posewill likely perform some kind of hopping gait, since the feasible locomotion possibilities whichperpetuate symmetric states and actions are limited. To make matters worse, states near theneutral states can also become problematic.The breaking symmetry problem is most severe when enforcing symmetry through networkarchitecture, as this method is guaranteed to produce true symmetric policies. While DUP and32LOSS methods can suffer from the same issue, they can implement workarounds at an additionalcost. This issue, however, does not affect PHASE. A simple workaround to this problem is toalways start the character from a non-neutral position. This can be easily achieved by addingsome random noise to each joint of the initial pose at the start of the task. In practice, we didnotice that on occasion the character would converge on a hopping gait. However, the simpleworkaround works well for the majority of cases in our experiments.Our work is motivated by the premise that healthy human gaits are usually symmetric.However, this remains a controversial issue in the biomedical literature [29, 31]. The strongestargument for asymmetry in human motor control is the general belief that humans have adominant side that is often the preferred choice for manipulating objects. This is also tiedwith the need for a leading foot to start a walk or run cycle in the neutral state problem. Oneshould, therefore, be aware of the implications when enforcing perfect symmetry. Quadrupedallocomotion, which has six commonly observed gaits as opposed to the three gaits of bipeds[26], is also interesting to examine. Of these six, half are fairly symmetric including walk, trot,and rack. However, the remaining three, also known as the in-phase gaits which are used athigh speeds, are often asymmetric. Since the symmetry of gait and policy are not the same, itwould be interesting to see whether it is possible to nevertheless achieve these non-symmetricquadrupedal gaits with a symmetric policy.5.6 ConclusionsIn this chapter, we explore the use of symmetry constraints for DRL-based learning of locomo-tion skills. We compared four different enforcement methods, in addition to a symmetry-freebaseline, across four different locomotion tasks of varying difficulty. We find that enforcingsymmetry constraints can sometimes be harmful to learning efficiency, but that in general, itproduces higher quality motions. When comparing the symmetry methods, we find that theresults, both in terms of learning speed and motion symmetry, to be environment-dependent.A notable exception is that the phase-based mirroring method generally performs better thanthe baseline for imitation-guided reward settings such as for Cassie and DeepMimic.The difference between the enforcement methods is more pronounced from the implementa-tion standpoint. LOSS and PHASE methods have the burden of an additional hyperparameterto tune. However, the additional parameter can also be viewed as an advantage in terms offlexibility. In LOSS, the hyperparameter can be used to adjust the strength of symmetry con-straint. For PHASE, the phase variable allows us to define a desired locomotion period. Giventhe similarities across all methods, it is perhaps justifiable to choose one based on the imple-mentation overhead. DUP is the easiest to implement and evaluate since it requires minimalchange to the existing RL pipeline and has no hyperparameter to tune. Finally, if the applica-tion requires absolute symmetry, then the NET method is guaranteed to produce a symmetricpolicy.The application of symmetric policies is not limited to locomotion. Many classical control33tasks may benefit significantly from leveraging symmetry, including acrobat, cart-pole, andpendulum [4]. Furthermore, the notion of symmetry extends beyond left-right symmetry andeven character motion. The Sudoku game is an example task that exhibits multiple types ofsymmetry properties. Whether a learning method can take full advantage of all the symmetriesremains an open question. However, this paper lays a foundation for enabling future studies oninductive biases based on symmetry.34Chapter 6ConclusionsReinforcement learning provides a promising new path for motion generation, one that canbe generalized to new terrains and character morphologies. However the current methods arecomputationally inefficient and unless motion captured data is used, the motion quality istypically unsatisfactory for applications in computer graphics.Our work tackles these two problems by investigating two issues: excessively large torquelimits and gait asymmetry. We show that more realistic torque limits, though resulting inmore natural motions, can hinder the training in the beginning. We propose to use a simplecurriculum learning technique that starts with higher torque limits to speed up the trainingbut gradually decreases the limits to arrive at more natural final motions. This way we get thebest of both worlds.Next, we looked at ways of incorporating gait symmetry into the training process. Symmet-ric motions are generally perceived to be more attractive in humans and asymmetric patternsare commonly associated with disability or injury. We have compared four methods of enforc-ing symmetry in various environments as well as discussing their advantages and drawbacks indifferent scenarios.As with any other work, there remain some questions to be answered. I would be interestedin using the torque limit curriculum approach to figure out what is the lowest torque limit thatstill permits locomotion in a robot, such as Cassie. This can help with designing cheaper robots.Another promising direction is to look at the time-step used for locomotion. In all of ourwork, we used a prespecified control frequency. Humans, however, tend to plan at different timehorizons depending on the task and their skill level. We have promising initial results showingthat using curriculum learning to transfer skills learned with a specific control frequency toanother can improve performance. Perhaps the reason for this is that the agent can betterfocus on the long-term goals while still having the opportunity to refine its movements. Thiscan help bridge the gap between low-level skill learning and long-term planning. This idea isalso closely related to hierarchical reinforcement learning.35Bibliography[1] W. W. Armstrong and M. W. Green. The dynamics of articulated rigid bodies forpurposes of animation. The Visual Computer, 1(4):231โ€“240, Dec 1985. ISSN 1432-2315.doi:10.1007/BF02021812. URL https://doi.org/10.1007/BF02021812. โ†’ page 1[2] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. A brief survey ofdeep reinforcement learning. CoRR, abs/1708.05866, 2017. URLhttp://arxiv.org/abs/1708.05866. โ†’ page 7[3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, ICMLโ€™09, pages 41โ€“48, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1.doi:10.1145/1553374.1553380. URL http://doi.acm.org/10.1145/1553374.1553380. โ†’ page15[4] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, andW. Zaremba. Openai gym, 2016. โ†’ pages 5, 34[5] W. M. Brown, L. Cronk, K. Grochow, A. Jacobson, C. K. Liu, Z. Popoviฤ‡, andR. Trivers. Dance reveals symmetry especially in young men. Nature, 438(7071):1148,2005. โ†’ pages 3, 4[6] A. Bruderlin and T. W. Calvert. Goal-directed, dynamic animation of human walking. InACM SIGGRAPH Computer Graphics, volume 23, pages 233โ€“242. ACM, 1989. โ†’ page 5[7] S. Coros, P. Beaudoin, and M. van de Panne. Generalized biped walking control. ACMTransctions on Graphics, 29(4):Article 130, 2010. โ†’ page 5[8] S. Coros, A. Karpathy, B. Jones, L. Reveret, and M. van de Panne. Locomotion skills forsimulated quadrupeds. ACM Transactions on Graphics, 30(4):Article 59, 2011. โ†’ page 5[9] E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games,robotics and machine learning. http://pybullet.org, 2016โ€“2019. โ†’ pages 12, 27, 28[10] K. S. I. A. A. H. A. G. T. H. L. B. M. L. A. B. Y. C. T. L. F. H. L. S. G. v. d. D. T. G.D. H. David Silver, Julian Schrittwieser. Mastering the game of go without humanknowledge. MNature, 529, 2016. URL https://doi.org/10.1038/nature16961. โ†’ page 11[11] T. Geijtenbeek and N. Pronost. Interactive character animation using simulated physics:A state-of-the-art review. In Computer Graphics Forum, volume 31, pages 2492โ€“2515.Wiley Online Library, 2012. โ†’ page 5[12] N. Hansen. The cma evolution strategy: A comparing review. In Towards a NewEvolutionary Computation, pages 75โ€“102, 2006. โ†’ pages 1, 536[13] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez,Z. Wang, S. M. A. Eslami, M. A. Riedmiller, and D. Silver. Emergence of locomotionbehaviours in rich environments. CoRR, abs/1707.02286, 2017. URLhttp://arxiv.org/abs/1707.02286. โ†’ page 5[14] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deepreinforcement learning that matters. CoRR, abs/1709.06560, 2017. URLhttp://arxiv.org/abs/1709.06560. โ†’ pages 13, 14[15] J. K. Hodgins, W. L. Wooten, D. C. Brogan, and J. F. Oโ€™Brien. Animating humanathletics. In Proceedings of SIGGRAPH 1995, pages 71โ€“78, 1995. โ†’ page 5[16] D. Holden, T. Komura, and J. Saito. Phase-functioned neural networks for charactercontrol. ACM Trans. Graph., 36(4):42:1โ€“42:13, July 2017. ISSN 0730-0301.doi:10.1145/3072959.3073663. URL http://doi.acm.org/10.1145/3072959.3073663. โ†’ page5[17] D. Holden, T. Komura, and J. Saito. Phase-functioned neural networks for charactercontrol. ACM Trans. Graph., 36(4):42:1โ€“42:13, July 2017. ISSN 0730-0301.doi:10.1145/3072959.3073663. URL http://doi.acm.org/10.1145/3072959.3073663. โ†’ page4[18] E. T. Hsiao-Wecksler, J. D. Polk, K. S. Rosengren, J. J. Sosnoff, and S. Hong. A reviewof new analytic techniques for quantifying symmetry in locomotion. Symmetry, 2:1135โ€“1155, 2010. โ†’ pages 4, 26[19] A. Ilyas, L. Engstrom, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, and A. Madry.Are deep policy gradient algorithms truly policy gradient algorithms? CoRR,abs/1811.02553, 2018. URL http://arxiv.org/abs/1811.02553. โ†’ page 10[20] O. Klimov and J. Schulman. Roboschool, open-source software for robot simulation.https://github.com/openai/roboschool, 2017. โ†’ page 12[21] L. Kovar, M. Gleicher, and F. Pighin. Motion graphs. In Proceedings of the 29th AnnualConference on Computer Graphics and Interactive Techniques, SIGGRAPH โ€™02, pages473โ€“482, New York, NY, USA, 2002. ACM. ISBN 1-58113-521-1.doi:10.1145/566570.566605. URL http://doi.acm.org/10.1145/566570.566605. โ†’ page 4[22] Y. Lee, S. Kim, and J. Lee. Data-driven biped control. ACM Transctions on Graphics, 29(4):Article 129, 2010. โ†’ page 5[23] T. P. Lillicrap, J. J. Hunt, A. e. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, andD. Wierstra. Continuous control with deep reinforcement learning. arXiv e-prints, art.arXiv:1509.02971, Sep 2015. โ†’ page 7[24] L. Liu, M. van de Panne, and K. Yin. Guided learning of control graphs for physics-basedcharacters. ACM Transactions on Graphics, 35(3), 2016. โ†’ page 5[25] A. Majkowska and P. Faloutsos. Flipping with physics: motion editing for acrobatics. InProceedings of the 2007 ACM SIGGRAPH/Eurographics symposium on Computeranimation, pages 35โ€“44. Eurographics Association, 2007. โ†’ page 537[26] T. A. McMahon. Muscles, Reflexes, and Locomotion. Princeton University Press,Princeton, New Jersey, 1 edition, 1984. ISBN 069102376X. โ†’ page 33[27] X. B. Peng, G. Berseth, K. Yin, and M. van de Panne. Deeploco: Dynamic locomotionskills using hierarchical deep reinforcement learning. ACM Transactions on Graphics(Proc. SIGGRAPH 2017), 36(4), 2017. โ†’ pages 5, 23[28] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic: Example-guideddeep reinforcement learning of physics-based character skills. ACM Transactions onGraphics (Proc. SIGGRAPH 2018), 37(4), 2018. โ†’ pages 5, 23, 29, 42[29] J. L. Riskowski, T. J. Hagedorn, A. B. Dufour, V. A. Casey, and M. T. Hannan.Evaluating gait symmetry and leg dominance during walking in healthy older adults.Institute of Aging Research, Hebrew SeniorLife, Boston, Usa, Harvad Medical School,USA, School of Public Health, Boston University, Boston, 2011. โ†’ pages 4, 20, 33[30] R. Robinson, W. Herzog, and B. Nigg. Use of force platform variables to quantify theeffects of chiropractic manipulation on gait symmetry. Journal of manipulative andphysiological therapeutics, 10(4):172โ€“176, 1987. โ†’ pages 3, 4, 20, 22[31] H. Sadeghi, P. Allard, F. Prince, and H. Labelle. Symmetry and limb dominance inable-bodied gait: a review. Gait & Posture, 12(1):34โ€“45, 2000. ISSN 0966-6362.doi:https://doi.org/10.1016/S0966-6362(00)00070-9. URLhttp://www.sciencedirect.com/science/article/pii/S0966636200000709. โ†’ pages 4, 33[32] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region policyoptimization. CoRR, abs/1502.05477, 2015. URL http://arxiv.org/abs/1502.05477. โ†’pages 9, 22[33] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policyoptimization algorithms. CoRR, abs/1707.06347, 2017. URLhttp://arxiv.org/abs/1707.06347. โ†’ pages 9, 13, 22[34] M. K. Seeley, B. R. Umberger, and R. Shapiro. A test of the functional asymmetryhypothesis in walking. Gait & posture, 28(1):24โ€“28, 2008. โ†’ page 4[35] A. Shapiro. Monte carlo sampling methods. In Stochastic Programming, volume 10 ofHandbooks in Operations Research and Management Science, pages 353 โ€“ 425. Elsevier,2003. doi:https://doi.org/10.1016/S0927-0507(03)10006-0. URLhttp://www.sciencedirect.com/science/article/pii/S0927050703100060. โ†’ page 8[36] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press,Cambridge, MA, 2 edition, 2018. โ†’ pages 6, 7[37] G. Tesauro. Temporal difference learning and td-gammon. Commun. ACM, 38(3):58โ€“68,Mar. 1995. ISSN 0001-0782. doi:10.1145/203330.203343. URLhttp://doi.acm.org/10.1145/203330.203343. โ†’ page 11[38] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages5026โ€“5033, Oct 2012. doi:10.1109/IROS.2012.6386109. โ†’ page 2838[39] S. Viteckova, P. Kutilek, Z. Svoboda, R. Krupicka, J. Kauler, and Z. Szabo. Gaitsymmetry measures: A review of current and prospective methods. Biomedical SignalProcessing and Control, 42:89โ€“100, 2018. ISSN 1746-8094.doi:https://doi.org/10.1016/j.bspc.2018.01.013. URLhttp://www.sciencedirect.com/science/article/pii/S1746809418300193. โ†’ pages 4, 22[40] J. M. Wang, D. J. Fleet, and A. Hertzmann. Optimizing walking controllers. ACMTransctions on Graphics, 28(5):Article 168, 2009. โ†’ page 5[41] R. J. Williams. Simple statistical gradient-following algorithms for connectionistreinforcement learning. Machine learning, 8(3-4):229โ€“256, 1992. โ†’ pages 7, 9[42] A. Witkin and M. Kass. Spacetime constraints. In Proceedings of the 15th AnnualConference on Computer Graphics and Interactive Techniques, SIGGRAPH โ€™88, pages159โ€“168, New York, NY, USA, 1988. ACM. ISBN 0-89791-275-6.doi:10.1145/54852.378507. URL http://doi.acm.org/10.1145/54852.378507. โ†’ page 1[43] Z. Xie, P. Clary, J. Dao, P. Morais, J. W. Hurst, and M. van de Panne. Iterativereinforcement learning based design of dynamic locomotion skills for cassie. CoRR,abs/1903.09537, 2019. URL http://arxiv.org/abs/1903.09537. โ†’ pages 23, 28[44] K. Yin, K. Loken, and M. van de Panne. Simbicon: Simple biped locomotion control.ACM Trans. Graph., 26(3):Article 105, 2007. โ†’ pages 1, 5[45] W. Yu, G. Turk, and C. K. Liu. Learning symmetric and low-energy locomotion. ACMTransactions on Graphics (Proc. SIGGRAPH 2018 - to appear), 37(4), 2018. โ†’ pages2, 5, 12, 22, 23, 26, 29[46] H. Zhang, S. Starke, T. Komura, and J. Saito. Mode-adaptive neural networks forquadruped motion control. ACM Trans. Graph., 37(4):145:1โ€“145:11, July 2018. ISSN0730-0301. doi:10.1145/3197517.3201366. URLhttp://doi.acm.org/10.1145/3197517.3201366. โ†’ page 4[titletoc]39AppendixSupporting MaterialsA.1 Chapter 4 Hyper-parametersThe following contains the hyper-parameters used for all of the experiments in Chapter 4. Thecode is also available under https://github.com/farzadab/walking-benchmark/.Name DescriptionPolicy LR 3๐‘’โˆ’3Value Network LR 5๐‘’โˆ’3Total Environment Steps 6 MillionClipping Parameter (๐œ–) 0.2Decay (๐›พ) 0.99GAE ๐œ† 0.95Policy type Fixed Diagonal GaussianPolicy stdev exp(โˆ’1)Network Size 3 Layers of 256 nodesNetwork Activations ReLUMax Grad Norm 1Optimization Batch Size 128PPO Optimization Steps 10Table 1: Hyper-parameters used in Chapter 4.A.2 Mirroring FunctionsThe mirroring functions, โ„ณ๐‘  and โ„ณ๐‘Ž as described in Section 5.1 are properties of the envi-ronment. Consequently, the environment is responsible for providing the necessary informationfor policies to perform the mirroring operation on state and action. Although the mirroringfunctions can be arbitrarily complex, we found that all the environments in Section 5.3 share asimilar construction. Using Walker3D as an example, the method for deriving mirror functionsare described in detail below.40TheWalker3D character has a total of 21-DoF and each DoF is modelled as a one-dimensionalhinge joint. Furthermore, let the x-axis be the forward direction and the z-axis pointing up inthe local coordinate frame of the character. For mirroring purposes, the joints can be dividedinto three categories, common, opposite, and side. The common categories contain joints thatare unchanged by the mirroring function, such as ๐‘Ž๐‘๐‘‘๐‘œ๐‘š๐‘’๐‘›๐‘ฆ. In general, joints that rotateabout the y-axis should remain unchanged after mirroring. The opposite categories containjoints that are mainly on the torso of the character and they need to be negated for mirroring.In the case of Walker3D, only ๐‘Ž๐‘๐‘‘๐‘œ๐‘š๐‘’๐‘›๐‘ฅ and ๐‘Ž๐‘๐‘‘๐‘œ๐‘š๐‘’๐‘›๐‘ง would fall under this category. Theside categories contain joints that are on the limbs. Importantly, for each joint on one side,there must be a corresponding joint on the other side; for instance, the right knee correspondsto the left knee. With the one-to-one mapping, โ„ณ๐‘Ž can simply interchange the applied torquesfor the respective joints on either side. We found that it is more straightforward if the jointrotation axes are flipped, except for axes aligned on the y-axis, for the left and right limbs.Otherwise, additional negation operations need to be applied after interchanging left and rightactions.โ„ณ๐‘  follows a similar pattern as described above. For state information that is derived fromthe character, such as joint angles and angular velocities, the mirrored counterpart would havenegated and interchanged values. Besides, the environment may provide additional informa-tion, such as character orientation, velocity, and target location in character root space, asin Walker3D. For vector-valued information, such as velocity and target location, the valuesalong the ๐‘ฆ-axis should be negated; for orientations, values representing roll and yaw should benegated.A.3 Alternate Symmetric Network ArchitectureIn Figure 5.1, we presented a universal method for embedding any neural network into a sym-metric policy. The NET method effectively uses the same policy module twice with flippedinputs for ๐‘  and ๐‘€(๐‘ ). While this construction is relatively simple to implement, alternativesymmetric policy constructions do exist. In this section, we describe the construction used forNET-ALT in Figure 5.3.Recall that a symmetric policy is one that satisfies Equation (5.1), along with the fact thatour mirror functions (Section A.2) essentially perform negation and swapping operation on thestate and action vectors. Let us then consider the individual layers of a neural network asmatrix operations, in particular, before the application of non-linear activation functions. Thefull matrix form of the first layer for ๐‘  and โ„ณ๐‘ (๐‘ ) can be written as,โŽกโŽขโŽขโŽขโŽฃ๐‘Š๐‘‹๐‘Œ๐‘โŽคโŽฅโŽฅโŽฅโŽฆ= (๐‘Ž๐‘–๐‘—)4ร—4โŽกโŽขโŽขโŽขโŽฃ๐ถ๐‘‚๐‘…๐ฟโŽคโŽฅโŽฅโŽฅโŽฆand,โŽกโŽขโŽขโŽขโŽฃ๐‘Šโˆ’๐‘‹๐‘๐‘ŒโŽคโŽฅโŽฅโŽฅโŽฆ= (๐‘Ž๐‘–๐‘—)4ร—4โŽกโŽขโŽขโŽขโŽฃ๐ถโˆ’๐‘‚๐ฟ๐‘…โŽคโŽฅโŽฅโŽฅโŽฆ.41๐ถ, ๐‘‚, ๐‘…, ๐ฟ represent the portions of the state vector corresponding to common, opposite,right, and left respectively. The uppercase letters for ๐ถ, ๐‘‚, ๐‘…, ๐ฟ, ๐‘Š , ๐‘‹, ๐‘Œ , and ๐‘ indicate thatthese are not necessarily scalars. For instance, for Walker3D, ๐‘‚ contains both ๐‘Ž๐‘๐‘‘๐‘œ๐‘š๐‘’๐‘›๐‘ฅ and๐‘Ž๐‘๐‘‘๐‘œ๐‘š๐‘’๐‘›๐‘ง. Similarly, the matrix, (๐‘Ž๐‘–๐‘—)4ร—4, is dimensionally consistent with the correspondingelements in the state vector. For example, ๐‘Ž2๐‘— is a two-column wide block that matches withthe two elements in ๐‘‚ for Walker3D. In addition, notice the negated ๐‘‚ and ๐‘‹, as well as theinterchanged ๐‘… and ๐ฟ are the effect of the mirroring functions. Overall, there are a total of16 unknowns and 8 equations. A symmetric layer can be obtained by solving this system ofequations. In particular, NetAlt contains symmetric layers of the following form,(๐‘Ž๐‘–๐‘—)4ร—4 =โŽกโŽขโŽขโŽขโŽฃ๐›ผ 0 ๐›ฝ ๐›ฝ0 ๐›พ ๐›ฝ โˆ’๐›ฝ๐›ฟ ๐œ– ๐œ ๐œ‚๐›ฟ โˆ’๐œ– ๐œ‚ ๐œโŽคโŽฅโŽฅโŽฅโŽฆ.To maintain the symmetric policy constraint, the activation function applied to the negationportion, ๐‘‚, must be an odd function such as tanh or softsign. A similar procedure can be followedfor intermediate and output layers, as long as the sizes for each of the portions are correctlymaintained. Finally, a symmetric policy network can be constructed by stacking symmetriclayers.A.4 Symmetry in DeepMimic EnvironmentTo evaluate the effectiveness of phase-based mirroring, we ran an experiment for the originalDeepMimic environment [28] in additional to the Cassie environment. In both cases, our datashows that phase-based mirroring does indeed make the learning faster. However, in the caseof DeepMimic, the difference in final return is small between BASE and PHASE and only aminor difference can be seen from the video.420 1 2 3 4 5 6 7 8Time Step 1e70100200300400500Average ReturnHumanoid RunPhaseNoMirror0 1 2 3 4 5 6 7Time Step 1e70100200300400500600Average ReturnHumanoid WalkFigure 1: Learning curves for the original DeepMimic environment. BASE and Phasecorresponds to the symmetry enforcement methods in Figure 5.343

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0383251/manifest

Comment

Related Items