Learning Reduced Order Linear Feedback Policies for Motion Skills by Kai Ding B.Eng., Zhejiang University, 2009 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) October 2011 c© Kai Ding, 2011 Abstract Skilled character motions need to adapt to their circumstances and this is typically accomplished with the use of feedback. However, good feedback strategies are difficult to author and this has been a major stumbling block in the development of physics-based animated characters. In this thesis we present a framework for the automated design of compact linear feedback strategies. We show that this can be an effective substitute for manually-designed abstract models such as the use of inverted pendulums for the control of simulated walking. Results are demon- strated for a variety of motion skills, including balancing, hopping, ball kicking, single-ball juggling, ball volleying, and bipedal walking. The framework uses pol- icy search in the space of reduced-order linear feedback matrices as a means of developing an optimized linear feedback strategy. The generality of the method al- lows for the automated development of highly-effective unconventional feedback loops, such as the use of foot pressure feedback to achieve robust physics-based bipedal walking. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Reduced Order Models for Physics-based Simulation . . . . . . . 4 2.2 Output Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Model-based Biped Control . . . . . . . . . . . . . . . . . . . . . 6 2.4 Policy Search Methods . . . . . . . . . . . . . . . . . . . . . . . 8 3 Feedback Control Framework . . . . . . . . . . . . . . . . . . . . . 9 3.1 Control Framework . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Linear Output Feedback Structure . . . . . . . . . . . . . . . . . 11 3.2.1 Full Matrix Form . . . . . . . . . . . . . . . . . . . . . . 11 3.2.2 Reduced Order Form . . . . . . . . . . . . . . . . . . . . 11 3.2.3 Affine Form . . . . . . . . . . . . . . . . . . . . . . . . . 12 iii 4 Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1 Cost Functions for Optimization . . . . . . . . . . . . . . . . . . 13 4.2 Optimization Method . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . 15 5 Control Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1 Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.2 Constrained Hopping . . . . . . . . . . . . . . . . . . . . . . . . 18 5.3 Controllable Ball Kicking . . . . . . . . . . . . . . . . . . . . . . 21 5.4 Single-ball Juggling . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.5 Ball Volleying . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.6 Bipedal Walking . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.1 Performances of Learned Feedback Policies . . . . . . . . . . . . 31 6.2 Choice of Sensory Information . . . . . . . . . . . . . . . . . . . 37 6.3 Model Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.5 Incremental Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.6 Affine Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.7 Performance of Optimization . . . . . . . . . . . . . . . . . . . . 43 6.8 Workflow of Controller Design for Motion Skills . . . . . . . . . 45 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.1 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . 48 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 iv List of Tables Table 6.1 Optimized Function Scores for Balancing . . . . . . . . . . . . 32 Table 6.2 Optimized Function Scores for Constrained Hopping . . . . . . 33 Table 6.3 Performance of Constrained Hopping with Feedback . . . . . . 33 Table 6.4 Optimized Function Scores for Ball Kicking . . . . . . . . . . 34 Table 6.5 Optimized Function Scores for Single-ball Juggling . . . . . . 35 Table 6.6 Optimized Function Scores for Ball Volleying . . . . . . . . . 36 Table 6.7 Maximal Recoverable Forces and Cost Function Scores of Bipedal Walking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Table 6.8 Number of Evaluations and Total Time for Optimization . . . . 45 v List of Figures Figure 2.1 SIMBICON Feedback Policy [37] (Used with Permission) . . 7 Figure 3.1 Classical Control Framework . . . . . . . . . . . . . . . . . . 10 Figure 3.2 Linear Output Feedback Framework . . . . . . . . . . . . . . 10 Figure 3.3 Feature Selection on Sensory Data and Control Inputs . . . . 12 Figure 4.1 Directional Optimization of Covariance Matrix Adaption [33] (Used with Permission) . . . . . . . . . . . . . . . . . . . . . 14 Figure 5.1 In-place Balancing . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 5.2 Character Configuration for Balancing Example . . . . . . . . 17 Figure 5.3 Motion of the Tilting Platform in Optimization Scenario . . . 18 Figure 5.4 Luxo Constrained Hopping . . . . . . . . . . . . . . . . . . . 19 Figure 5.5 Luxo Configuration . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 5.6 Constrained Hopping Optimization Scenario . . . . . . . . . 20 Figure 5.7 Controllable Ball Kicking . . . . . . . . . . . . . . . . . . . 22 Figure 5.8 Human Leg Configuration . . . . . . . . . . . . . . . . . . . 22 Figure 5.9 Single-ball Juggling . . . . . . . . . . . . . . . . . . . . . . 23 Figure 5.10 Human Arm Configuration . . . . . . . . . . . . . . . . . . . 24 Figure 5.11 Ball Volleying . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 5.12 Biped Walking . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 5.13 Biped Configuration . . . . . . . . . . . . . . . . . . . . . . 27 Figure 5.14 Modeling Contact Forces for Bipedal Walking . . . . . . . . . 28 Figure 6.1 Performance of Balancing in New Scenarios . . . . . . . . . . 32 Figure 6.2 Ball Trajectory for 5000 Simulation Steps in Ball Juggling . . 34 vi Figure 6.3 Ball Trajectory for 20 Seconds in Ball Volleying . . . . . . . 35 Figure 6.4 Impact of Sensory Information Choice on Performance in Con- strained Hopping Example . . . . . . . . . . . . . . . . . . . 38 Figure 6.5 Average Squared Distance Error with Transformed Sensory Data in Ball Kicking Example . . . . . . . . . . . . . . . . . 39 Figure 6.6 Feature Selection for Ball Juggling Example . . . . . . . . . 41 Figure 6.7 Feature Selection for Bipedal Walking Example . . . . . . . . 42 Figure 6.8 Effect of Incremental Optimization in MRF for Biped Walking 43 Figure 6.9 Comparison between Affine Feedback and Non-affine Feed- back in Constrained Hopping Example . . . . . . . . . . . . 44 Figure 6.10 Comparison between Optimization Methods . . . . . . . . . . 46 vii Acknowledgments This thesis would not have been possible without the support of many people. My sincerest gratitude goes first to my supervisor, Professor Michiel van de Panne, for his guidance, inspiration and patience. Michiel always kindly gives me the freedom and encouragement to explore different ideas, which form the basis of this thesis. I am also grateful to my second reader, Professor Dinesh Pai, for his valuable feedback on the writing of this thesis. I would like to thank my grad buddy, Stelian Coros, for the help and discussion during the early days of my master study, which guide me into the exciting field of character animation. Many thanks to members of Imager Lab for creating such a fun place to work and study. I wish to thank my good friends: Tao Su, Suwen Wang, Xi Chen, Li Li, Caoyu Wang, Jun Chen and many others for their support that helps me ride out the ups and downs of graduate school and makes my student life in Vancouver enjoyable memories. Finally, and most importantly, I thank my parents Zongliang Ding and Xiuying Sun for their perpetual love and support. viii Chapter 1 Introduction 1.1 Motivation Dynamically-simulated characters provide many desired properties. By taking physics into account, they can faithfully interact with the surrounding environment and they can respond realistically to unexpected perturbations such as collisions or pushes. However, their success depends on first designing an appropriate con- troller. For most control tasks, the control needs to regulate or adapt a motion with respect to a known reference action. For example, the controller for a walking biped requires adapting the default walking step to specific circumstances such as a loss of balance. In general, carefully-tuned feedback laws are required in order to adapt the control to unexpected circumstances or towards specific desired goals. Linear feedback provides a simple and powerful model for controlling dynam- ical systems. However, the classical approaches for designing linear feedback sys- tems demand extensive and explicit modeling choices in order to develop a suitable state-feedback matrix. In this thesis, we investigate a direct-design alternative: the control policy is assumed to take the form of a linear feedback matrix and elements of the matrix are then treated as the free parameters of the policy. The method al- lows for the use of arbitrary sensory variables to provide linear output feedback to an arbitrary set of control inputs. This allows for unconventional feedback strate- gies to be explored, such as the use of ground reaction forces to influence the target trajectories of a walking motion in order to successfully maintain balance. 1 Directly searching the space of all output-feedback matrices may not scale well. Given m sensors and n control inputs, the control policy defined by the full output feedback has m× n elements and is very high dimensional. We therefore investigate the use of reduced-order output feedback matrices, which effectively identify a low-dimensional subspace that is well suited for the linear feedback con- trol of a given task. We show that automatically-synthesized reduced-order feed- back models can be an effective substitute for manually-designed abstract models, such as the use of center-of-mass abstractions for the control of simulated walking. To achieve further compactness in the feedback model, we develop a method that allows the control policy to identify the most useful sensory variables and control inputs for the task at hand. This performs a function analogous to feature selection. In this thesis we show that reduced-order feedback solutions can be automati- cally synthesized for a significant variety of simulated skills, including balancing, hopping, ball kicking, planar paddle-based juggling and volleying of a ball, and bipedal walking. We demonstrate the effective use of many different types of sen- sory information for the control of bipedal walking. 1.2 Contributions The primary contributions of this thesis can be summarized in two parts. First, we show that a reduced order linear feedback structure can be automatically synthe- sized for a variety of motion skills with our simple framework. Second, a sparsity- enforcing method is integrated into the framework and able to evaluate the impor- tance of components in the feedback and eliminate useless features. We implement six examples of control tasks including in-place balancing, con- strained hopping, ball kicking, single-ball juggling, ball volleying and bipedal walking. The proposed feedback strategy has been successfully applied in these examples, which demonstrates the feasibility of stabilizing a large variety of dy- namical systems. The results of feature selection in our framework also show that the learned feedback structure converges to the manually designed feedback law which is based upon prior knowledge. 2 1.3 Organization The remainder of this thesis is structured as follows. We discuss the related work in Chapter 2. The linear output feedback structure is described in Chapter 3. In Chapter 4, we describe the optimization technique to learn the feedback policy with feature selection. Six control tasks are demonstrated in Chapter 5. Results are presented in Chapter 6 accordingly. We summarize our work in Chapter 7 and discuss directions for future work. 3 Chapter 2 Related Work Aspects of the ideas we present in this thesis have been explored previously. In this chapter, we briefly review the research literature related to reduced order mod- els for physics-based simulation, output feedback, model-based biped control and policy search methods in character animation and robotics. 2.1 Reduced Order Models for Physics-based Simulation In this thesis we present a reduced order method for physics-based character con- trol. This is motivated by a number of reduced order models for physics-based sim- ulation, including sound synthesis [19], fluid simulation [27, 32] and deformable object simulation [9, 10]. Picard et al. [19] develop a multi-scale method to render sounds of virtual objects using modal synthesis. Their method is able to perform the voxelization using a sparse regular grid embedding of a given object and therefore permits the construction of plausible lower resolution approximations of the modal model. For fluid simulation, Treuille et al. [27] present a framework for the reduced- dimensional simulation of incompressible fluids. They first learn a low-dimensional basis from a set of high-resolution fluid simulations. The fluid simulation is then performed through the subspace of the small basis to achieve real-time perfor- mance. Wicke et al. [32] further extend the idea of model reduction by construct- ing a set of reduced models which capture spatially localized fluid behavior and 4 rearranging these models at runtime. Simulating deformable objects are computationally expensive. Kim et al. [9] develop an online model reduction method to accelerate deformable simulation. Their method learns fast reduced order models on-the-fly and detects when the full-model computation can be skipped and replaced with the subspace models. Alternatively, with a domain-decomposition method [10], real-time simulation of articulated deformable characters can also be achieved in a subspace framework. 2.2 Output Feedback Our work is an example of output feedback in classical control theory, which is covered in many optimal control books [1, 14]. The output feedback is used to modify the dynamics of the system where only the output is being measured. An important open question in control engineering is the static output feedback problem [25]: find a static output feedback for a given linear time-invariant system to achieve desirable characteristics, or determine that such a feedback does not exist. Extensive literature [21, 28, 30] discusses the necessary and sufficient condi- tions for static output feedback stabilizability and propose various approaches for feedback controller design. Output feedback is usually designed using model-based approaches where sys- tems are linearized. In this case, the output feedback can be represented in compact form and be designed using convex optimization. Levine et al. [13] discuss the op- timal control of linear multi-variable systems with output feedback. The constant feedback gains are optimized with respect to a quadratic performance criterion. Scherer et al. [23] synthesize linear output-feedback controllers by solving a sys- tem of linear matrix inequalities where the design objectives are formulated in a Lyapunov function and are a mix of system performance and constraints. Khalil et al. [8] linearize a single-input-single-output system by an input-output model and design an adaptive output feedback controller to track given reference signals. Output feedback also involves the design of reduced-order controllers. David et al. [6] develop a potential reduction method to solve minimum rank problems involving linear matrix inequalities for the design of reduced-order feedback con- troller. Harn et al. [35] introduce a LQG-like parametrization scheme for low-order 5 controller design where the controller parameters consist of the reduced-order reg- ulator and estimator gain matrices. These parameters are optimized via a descent- based method with respect to an H2 cost. Burns et al. [3] use a reduced basis approach to develop low-order nonlinear feedback controllers where low-order fi- nite dimensional compensators are computed to approximate the optimal infinite dimensional feedback control laws. In spite of offering solid foundation in theory, we are not aware of demon- strations of this previous work being applied to the control of articulated figures performing motion skills. 2.3 Model-based Biped Control The high dimensional nature of human motion makes it difficult to design con- trollers for the seemingly trivial tasks such as walking, running and balancing. Research in animation and robotics often adopts simplified abstract models to ap- proximate character dynamics and then design feedback strategies accordingly to produce robust motion. In locomotion tasks, the character can often be approximated by an inverted pendulum. Coros et al. [4] utilize an inverted pendulum model to compute proper foot placement in the design of a generalized biped walking controller so that the desired velocity can be achieved. The inverted pendulum model assumes constant leg length and is able to work across a wide range of body types. Tsai et al. [29] develop a walking controller based on trajectory tracking and use an inverted pen- dulum model to modulate desired motion trajectory so that the resulting motion is physically plausible and adapted to dynamic environments. Kown et al. [11] build control systems for human running. The systems make use of an abstract model consisting of an inverted pendulum on a cart, which resembles the balancing ac- tions of human. A linear quadratic regulator is further adopted as a controller for this simplified model. Structures of fewer degrees of freedom are often used to approximate humanoid characters. da Silva et al. [5] construct a three link model based on the geometry and inertial properties of the full character model. The dynamics of this model is approximated by a discrete-time linear time-varying system and an optimal balance 6 feedback policy is further optimized based on this approximation. The center of mass (COM) state, linear momentum, L, and angular momen- tum, H, are high level features that abstract full states of the character. Many biped control strategies involve directly controlling these high level features. Macchietto et al. [15] develop a in-place balancing controller by guiding the linear and an- gular momentum to govern the positions of the center of pressure (COP) and the projected COM. A similar approach is applied in bipedal walking where the COM and the swing foot are guided by a momentum-based supervisor to improve the character’s robustness to disturbances [34]. Ye et al. [36] design an abstract model based on COM, L and H to simplify the dynamics. Optimal control is applied in this abstract model to control character motion under physical perturbations and changes in the environment. Figure 2.1: SIMBICON Feedback Policy [37] (Used with Permission) Given simplified models, the feedback strategies can also be manually de- signed. SIMBICON [37] is an example, where the inputs and outputs for the feedback are chosen manually. As shown in Figure 2.1, SIMBICON employs a 7 hand-tuned balance feedback law on the swing hip joint and stance ankle joint, depending on the horizontal distance from the stance ankle to character’s center of mass (COM) and the velocity of COM: θd = θd0 + cdd + cvv. Once the feed- back form is determined, the gains can be further refined by optimization to im- prove the robustness [31]. The manual design of feedback control usually requires knowledge and experience about the typical tasks and characters. Based on biome- chanical principles and captured data about human recovery response, Shiratori et al. [24] construct a physics-based controller with finite state machines to simulate balance recovery to trips. 2.4 Policy Search Methods In many situations where the application of optimal control formalisms is intractable or impractical, it is known that policy search methods applied to simple con- trol structures can provide good sub-optimal control [2, 17]. Frequently, many of the key parameters in the policy parameterization are in fact feedback gains [2, 16, 17, 20]. Policy Search methods applied to articulated figures have often used target angle trajectories as a control variable for achieving motor tasks such as throwing [12] or weightlifting [20]. The efficient computation of good policy gradients is important in policy search methods and has been the subject of much analysis, e.g., [16, 18] and many others. The learning of control policies for some of the types of tasks that we demonstrate in our work have also been demonstrated by others, such as juggling [22], throwing [12], and walking [26]. One of the goals in our work is to demonstrate the applicability of a single technique to a wide range of control tasks. 8 Chapter 3 Feedback Control Framework In this chapter, we provide necessary background information about the widely used linear control framework that is adopted in this thesis. We also describe the linear output feedback structure and how it can be easily integrated into the existing control framework. 3.1 Control Framework A common control framework is shown in Figure 3.1. We approximate the char- acter to be controlled as a number of rigid bodies that are connected by joints. The desired motion is generated by applying proper torques to the joints of the charac- ter. The control framework is a closed loop. At each simulation step, an estimate of the system state, x, is produced based upon the available observations, s. The feed- back control takes the classical state as input and produce corrections, δa, on the reference actions, a0. The mapping between the classical states and actions is usu- ally manually designed and is based on typical knowledge about the control task and the morphology of character. The low level joint PD controllers then compute torques, τ , for each joint according to the desired actions and apply these torques in the simulation. In the classical control framework, the state estimation mechanism and the state-action mapping require extensive and explicit modeling of many aspects of the system. In our linear output feedback framework, we aim to use as little prior 9 Figure 3.1: Classical Control Framework knowledge as possible. Instead of estimating the classical state description, we would like to use the raw sensory data directly to produce the action correction. We treat the relationship between sensory states and actions as a black box, which is encoded by a linear output feedback structure. The design of feedback control can be automated through the learning process. The control framework with output feedback structure is shown in Figure 3.2. Figure 3.2: Linear Output Feedback Framework 10 3.2 Linear Output Feedback Structure 3.2.1 Full Matrix Form The linear output feedback is simple in form, as shown in Equation 3.1. It linearly maps changes in the sensory state, δ s = s− s0, to changes in the control inputs, δa = a−a0. The changes are defined with respect to the reference measurements for the sensory data, s0, and the reference control actions, a0. The control policy is defined by the elements of the feedback matrix, MF . The feedback policy computes an action correction, δa, as the product of MF and δ s and then applies the modified action according to a = a0 + δa. The feedback matrix MF is of size m× n for m control parameters and n sensory inputs. δa = MF ·δ s (3.1) 3.2.2 Reduced Order Form In the full matrix form of feedback structure, the size of the feedback matrix MF depends on the dimensions of control parameters and sensory inputs. If either δa or δ s stays in a high dimensional space, MF will be large, increasing the difficulty for the learning process. Moreover, the full matrix form provides little intuition about the importance of individual components either in the sensory data or in the control parameters. In order to produce significantly more compact linear feedback policies, we can factor MF into two components, as described by Equation 3.2: (i) a sensory projection matrix, Msp, that projects high-dimensional sensory data to a reduced- order state space; and (ii) an action projection matrix, Map, that maps the reduced- order state back to the full action space to produce the feedback compensation. The dimensions of Msp and Map are then given by the reduced dimension, nr. We define SF as the parameterization of the control policy and in this case SF = {Msp,Map}. δa = Map ·Msp ·δ s (3.2) By controlling the dimension of the reduced-order state space, the number of 11 policy parameters can be dramatically reduced. As shown in Figure 3.3, column vectors in Msp correspond to components in the sensory data and row vectors in Map correspond to components in the control inputs. By additionally rewarding L1 sparsity in the learning process, as described in Chapter 4, we can use the L1 norms of these vectors to evaluate the importance of particular piece of sensory data and control parameters in producing a robust motion. Figure 3.3: Feature Selection on Sensory Data and Control Inputs 3.2.3 Affine Form The resulting feedback policy depends on the measurements used for the sensory data, s0, and the reference motion, a0. With a better choice of s0 and a0, it is easier to find a better feedback solution in the learning process. In the affine form of the output feedback structure, we consider s0, a0 as well as the feedback matrices and learn these components all together in the off-line optimization. The parameteriza- tion of the control policy can then be represented in Equation 3.3 for the full matrix form and be represented in Equation 3.4 for the reduced-order form. SF = {s0,a0,MF} (3.3) SF = {s0,a0,Msp,Map} (3.4) 12 Chapter 4 Policy Search This chapter describes our approach for learning the linear output feedback struc- tures. We use stochastic optimization using repeated roll-outs in order to learn the policy parameters, i.e., the feedback structure, SF , which can be either in the full matrix form or in the reduced-order form. In the remainder of this chapter, we discuss the cost functions for the off-line optimization, the typical optimization method as well as the incremental learning scheme. 4.1 Cost Functions for Optimization For a given task, a cost function is defined to optimize parameters in the feedback structure so that the specific goal can be achieved. These share a common structure: cost(SF) = w · [ S(SF) E(SF) U(SF) R(SF) ]T (4.1) The function score is a weighted sum of four terms: S(SF) rewards structures that make the resulting motion as robust as possible; E(SF) measures how well the resulting motion meets the environment constraints; U(SF) measures how well the motion satisfies user specifications; and the regularization term R(SF) is used to enforce the sparsity of SF and therefore also implicitly performs feature selection on the sensing and control variables. In practice, we use L1 regularization terms for L1 norms of column vectors in the sensory projection matrix, Msp, as well as L1 norms of row vectors in the action 13 projection matrix, Map, as shown in Equation 4.2. R(SF) = w0∑ i ∑ j ∥∥Mspi j∥∥1+w1∑ i ∑ j ∥∥Mapi j∥∥1 (4.2) With L1 regularization, the resulting matrices are sparse and therefore we can use L1 norms of vectors to evaluate the importance of corresponding components in the sensory data and control parameters. 4.2 Optimization Method We use a stochastic global optimization technique, Covariance Matrix Adaption (CMA) [7], to optimize the feedback structure. We apply the linear output feed- back framework in various control tasks where the policy search involves high dimensional nonlinear optimization. It has been shown that CMA is effective in these problems where local gradient-based optimization algorithms can be trapped in the local optimum. Figure 4.1: Directional Optimization of Covariance Matrix Adaption [33] (Used with Permission) CMA is an evolutionary strategy. As shown in Figure 4.1, CMA starts with a uniform Gaussian distribution and gradually directs the search to approach the global optimum. In each iteration, a number of candidates are sampled from a previously fitted Gaussian distribution over the space of parameter values. We evaluate these samples and pick a number of best candidates to fit a new Gaussian distribution. This process is performed iteratively until the optimization converges to the expected cost or the maximal number of evaluations is reached. The optimization procedure is generalized across various control tasks in that 14 the optimization for the feedback matrix always begins from an initial guess con- sisting of zero entries, which makes it easier to automate the design of feedback policy. 4.3 Incremental Learning For many difficult control tasks, the optimization is challenging due to the com- plexity of dynamical system. The solution for the task is usually highly constrained and far away from the initial guess. For some tasks, we therefore break the opti- mization into multiple stages, each with increasing difficulty, and each using the solution of the previous stage as a starting point. 15 Chapter 5 Control Tasks We apply the feedback policy with linear structure in various control problems. In this chapter, we detail the sensory variables, control inputs, and cost functions for each task. The 3D bipedal walking simulation uses Open Dynamics Engine (ODE) as the underlying physics simulator, while all the other examples use the Box2D physics engine. 5.1 Balancing Given a free-standing character on the flat terrain, the goal is to provide the char- acter with the ability to maintain balance when the ground starts tilting, as shown in Figure 5.1. Figure 5.2 shows the character that is modeled by four links connected by three revolute joints. All the links are 4.0m in length and 32.0kg in weight. The initial control inputs, θ0, consist of three fixed joint angles for this charac- ter. The feedback policy is applied here to change the constant target joint angles overtime so that the character is adapted to the changing environment. We take changes in the slope of the terrain, ∆α , as well as changes in the ground reaction forces, ∆Fc, as sensory inputs. α0 and Fc0 are the default sensory information when the character stands on the flat ground. The feedback policy is therefore formulated in Equation 5.1. 16 Figure 5.1: In-place Balancing L4 Figure 5.2: Character Configuration for Balancing Example 17 [δθ ]3×1 = SF · [ ∆α1×1 ∆Fc4×1 ] 5×1 (5.1) In the optimization, we only consider the robustness term, S(SF). We reward structures that help keep balance for a longer period, tbalance. Additionally, we con- sider that the character is within its most stable states when its base is well planted on the terrain. This is measured by tstable, the overall duration of continuous, stable ground contact. Equation 5.2 shows the cost function. The optimization scenario is simulated for 30 seconds where the motion of the tilting platform, α(t), is inter- actively specified by the user, as shown by Figure 5.3. cost(SF) =− log(tbalance+0.1 · tstable) (5.2) Figure 5.3: Motion of the Tilting Platform in Optimization Scenario 5.2 Constrained Hopping Given a hopping controller for a four-link articulated figure (Luxo), the goal for this task is to adapt the hopping to the environment constraint, which is a sequence of desired foot-step locations, as shown by black dots in Figure 5.4. Luxo consists of four links, as shown in Figure 5.5. The head and the upper body are connected by a fixed joint. The joints that connect the base, the lower body and the upper body are controllable. The base and the lower body are 0.8m in length and 0.32kg in weight. The upper body is 1.0m in length and 0.4kg in weight. The head weighs 0.0225kg. The original hopping controller is a finite state machine (FSM) consisting of three states. The transition between states depends on the defined state duration, 18 Figure 5.4: Luxo Constrained Hopping Base Head Upper Body Lower Body Figure 5.5: Luxo Configuration 19 Figure 5.6: Constrained Hopping Optimization Scenario ts0 , and the FSM restarts when Luxo hits the ground. In each state, we specify the constant target angles, θ0, and the PD gains, [kp0 ,kd0 ], to track these poses. We provide feedback compensation on these control inputs. We consider the difference, ∆d, between the regular step length, d0, and the required step length, d, as one component in the sensory inputs. We also include difference in the re- duced state of the character, when Luxo hits the ground, into the sensory data. The reduced state, sr, is obtained by PCA which is learned from the motion generated by the original controller where sr0 is the mean value calculated from the original motion. In the next chapter, we will discuss that this PCA sensory information is in fact optional. Equation 5.3 shows the resulting feedback policy. δθ δ ts δkp δkd  12×1 = SF · [ ∆sr2×1 ∆d1×1 ] 3×1 (5.3) In the optimization, in the S(SF) term, we expect Luxo to hop without falling. In the E(SF) term, we would like to meet the environment constraint by minimizing differences between the actual foot positions, xi, and the desired foot positions, xdi . In addition, we want Luxo to hop as fast as possible. Equation 5.4 is the cost function for the off-line optimization. 20 cost(SF) = S(SF)+E(SF)+U(SF) S(SF) = 1000 ·0.98t E(SF) = 10 ·∑ i (xi− xdi)2 U(SF) = t (5.4) The optimization scenario is a sequence of 16 target landing locations. Given the default hop length of 1.93m in the original controller, the desired stepping lengths in the optimization scenario range from 1.248m to 2.358m, as shown in Figure 5.6 5.3 Controllable Ball Kicking Given a controller that enables the character to kick a ball, which falls from height, hd0 , towards the position, pd0 , on the wall, the goal is to apply the feedback policy to adapt the controller so that it can work for different pairs of [hd , pd ]. This task is illustrated in Figure 5.7, where the blue cross marks the initial ball position and the red cross marks the target location. We use four links and three controllable joints to model the character, as shown in Figure 5.8. The torso is static in the scene. It is 1.0m in length and 0.5m in width. The upper leg weighs 12kg, whose length is 1m and width is 0.3m. The lower leg weighs 15.6kg, whose length is 1.3m and width is 0.3m. The foot is 0.6m in length, 0.2m in width and weighs 4.8kg. The radius of the ball is 0.5m and it weighs 0.78kg. The control actions for this motion consist of the timing position, ut , and cor- responding values, uθ , of control points that are used to model the trajectories of desired motion of the character. The given controller consists of a set of default values [ut0 ,uθ0 ]. For the sensory inputs, we consider the difference, ∆hd , between the desired starting height of the ball, hd , and the default one, hd0 . We also con- sider the difference, ∆pd , between the desired target position, pd , and the default position, pd0 . Equation 5.5 is the applied feedback policy.[ δut δuθ ] 26×1 = SF · [ ∆hd1×1 ∆pd1×1 ] 2×1 (5.5) 21 Figure 5.7: Controllable Ball Kicking Torso Thigh Shin Foot Figure 5.8: Human Leg Configuration 22 In the optimization, we expect the character to kick the ball to the position, p, close to the target one. This constraint is specified in the U(SF) term. We apply a large penalty in the S(SF) term when the character misses the ball. We evaluate the feedback structure under 112 example scenarios, each one of which consists of a ball height, hdi , and a target location, pdi , where hdi ∈ {3.5,4.5, ...,9.5} and pdi ∈ {9,8, ...,−6}. Equation 5.6 shows the cost function that we use in the optimization. cost(SF) =∑ i (Si(SF)+Ui(SF)) Si(SF) = { 0, the character kicks the ball; 100, the character fails to catch the ball. Ui(SF) = ‖pi− pdi‖1 (5.6) 5.4 Single-ball Juggling Figure 5.9: Single-ball Juggling Given a controller that enables the character to bounce the ball for a very short time, we apply the feedback policy here to enable the system to remain stable forever. This task is shown in Figure 5.9. As shown in Figure 5.10, the character is composed of four parts. The shoulder is fixed in the scene, whose radius is 0.1m. The upper arm is 0.2m in length and 0.05m in width. It weighs 0.8kg. The lower arm is 0.25m in length, 0.05m in width and 1.0kg in weight. The length of the racket is 0.35m, the width is 0.01m and the weight is 0.28kg. These components are connected by three controllable joints. The ball’s weight is 0.15kg. We set the ball’s radius to be 0.1m and the restitution to be 0.8. The control inputs are the same as the previous example. We apply the feed- 23 Shoulder Figure 5.10: Human Arm Configuration back policy every time the velocity of the ball comes to zero. For the sensory inputs, we consider not only changes in states of ball including the position, ∆p, and the horizontal velocity, ∆vx, but also changes in states of the character, ∆s. The state of the character is an 18 dimensional vector, which includes the positions, the linear velocities, the orientations and the angular velocities of the rigid bodies in the character frame: s = { x, ẋ,θ , θ̇ } . Equation 5.7 shows the applied feedback policy. [ ut uθ ] 24×1 = SF ·  ∆p2×1∆vx1×1 ∆s18×1  21×1 (5.7) Every time the ball begins falling, we would like to constrain its states [pi,vxi ] to be similar to the initial state [p0,vx0 ]. We would like to reward feedback structures that enable the system to work for a longer period, t, before the character fails to catch the ball. Equation 5.8 shows the cost function for the optimization. 24 cost(SF) = S(SF)+U(SF) S(SF) = 100 · t−0.95 U(SF) = 20 · ∑ N i=1 ‖pi− p0‖22 N +10 · ∑ N i=1 ‖vxi− vx0‖22 N (5.8) We apply an incremental scheme here. We first run the optimization where the scenario has been simulated for 5 seconds. We then start from the optimized result and further refine the feedback structure in a new optimization where the scenario has been simulated for 15 seconds. 5.5 Ball Volleying In Figure 5.11, two characters have a task that consists of volleying the ball back and forth to each other. The initial controller only enables this system to work for a very short period. Our feedback policy is applied here to stabilize the whole dynamical system. Figure 5.11: Ball Volleying The character configuration and the control inputs are the same as the previous example. The feedback law is invoked every time the ball crosses the midline between two characters. We take changes in states of the ball, including the vertical position, ∆py, and the velocity, ∆v, as well as changes in states of the character, ∆s, 25 as the sensory data. Equation 5.9 shows the applied feedback policy. [ ut uθ ] 24×1 = SF ·  ∆py1×1∆v2×1 ∆s18×1  21×1 (5.9) In the optimization, every time the vertical velocity of the ball approaches zero, we want to constrain the ball’s position, p, and its horizontal velocity, vx, to be close to the initial states: p0 = [2.5,0.0] and vx0 = 2.5. The optimization also rewards structures that enable the system to work for a longer time duration, t. Equation 5.10 shows the cost function. cost(SF) = S(SF)+U(SF) S(SF) = T − t U(SF) = ∑Ni=1 ‖pi− p0‖22 N + ∑Ni=1 ‖vxi− vx0‖22 N (5.10) We adopt the same two-phase incremental scheme as the previous example while the simulation time, T , for the first phase is set to be 5 seconds and T for the second phase is set to be 30 seconds. 5.6 Bipedal Walking Figure 5.12: Biped Walking For the walking control task, as shown in Figure 5.12, our interest is in learn- ing feedback control strategies that provide robust balance. Without feedback, the physics-based locomotion controller is apt to fail even under subtle disturbances. 26 We build on an implementation of the well-studied SIMBICON balance strategy [37]. The configuration of the biped model is shown in Figure 5.13. It has a total mass of 70.4kg and has 34 degrees of freedom. We use simple geometries such as capsules, spheres and boxes for collision detection and use fine meshes for visual- ization. 2 DOF 3 DOF 1 DOF 3 DOF 1 DOF Figure 5.13: Biped Configuration The goal here is to replace the SIMBICON feedback law with an automatically synthesized feedback strategy. To simplify the problem, we constrain the pelvis of the 3D character to remain vertical in the sagittal plane. What we control here is the target pose, θ , for all joints of the character at current simulation step. We have experimented with three different sets of sensory inputs. In the first set, we use the full state of the character as the sensory input. The full state, s f ull , includes the position, xr, linear velocity, ẋr, orientation, θr, and angular velocity, θ̇r, of the root in the world frame as well as the relative orientation, q, and angular velocity, q̇, of each joint with respect to its parent: 27 s f ull = { x, ẋ,θr, θ̇r,q, q̇ } . This set only contains the raw sensory data without ex- plicitly modeling the dynamics of the character with COM. Here we use the null vector as the initial measurement of the full state, s f ull0 . Equation 5.11 shows the resulting feedback policy. [δθ ]19×1 = SF · [ ∆s f ull ] 108×1 (5.11) Figure 5.14: Modeling Contact Forces for Bipedal Walking In the second set, we consider the ground reaction forces on both feet of the character as the sensory inputs. In practice, there may be many contact points on both feet along with forces, Fi, from the ground during runtime, as shown by the grey arrows in Figure 5.14. For each foot, we take the sum of contact forces as the overall ground reaction force, i.e. FC = ∑i Fi. The ground reaction force is shown by the red arrows in Figure 5.14. Equation 5.12 shows the resulting feedback policy. [δθ ]19×1 = SF · [ ∆FCswing3×1 ∆FCstance3×1 ] 6×1 (5.12) In the third set, the center of pressure (COP) information on both feet is con- sidered as the sensory data. All the forces acting between foot and ground can be 28 summed to a single ground reaction force on the COP about which there is no mo- ment. Therefore, given vertical components of contact forces, |Fi|y, over different contact points, pi, on the foot, the COP location, pcop, can be calculated as follows: pcop = ∑i pi · |Fi|y ∑i |Fi|y |Fi|y = Fi · (0,1,0) (5.13) We use the COM position of the foot as the initial measurement. At every control time step, the feedback policy takes the distance between COP and COM on both feet as inputs and produces the compensation. When the foot is not in contact with the ground, we set the sensory inputs to be null. Equation 5.14 shows the applied feedback policy. [δθ ]19×1 = SF · [ ∆Dswing3×1 ∆Dstance3×1 ] 6×1 ∆D = { pcop− pcom, foot in contact with the ground; 0, otherwise. (5.14) In the optimization, S(SF) rewards structures that enable the character to walk without falling for a long period, t. U(SF) constrains the resulting motion to be similar as the one generated by the original SIMBICON in terms of the average step length, l0, the average velocity per step, v0, and the average states, s0, when the swing foot hits the ground. Additionally, we require the character to walk with minimal energy consumption, which is approximated by the average applied torque, τ̄ . R(SF) is used to perform feature selection when using the reduced-order form of feedback structure. Equation 5.15 shows the cost function and the weights, w, are set to be [2000,1000,1000,0.5,0.0001,200]. 29 cost(SF) = S(SF)+U(SF)+R(SF) S(SF) = w1 · (T − t) U(SF) = ∑Ni=1 ( w2(li− l0)2+w3(vi− v0)2+w4(si− s0)2 ) N +w5 · τ̄2 R(SF) = w6 · (∑ i ∑ j ∥∥Mspi j∥∥1+∑ i ∑ j ∥∥Mapi j∥∥1) (5.15) We apply a three-phase optimization for this task. In the first phase, we opti- mize the feedback structure in the scenario without pushes for 5 seconds. In the second phase, the scenario is the same but the simulation time, T , increases to 50 seconds. In the last phase, we apply gradually increasing forces in the scenario so that the refined feedback structure can further enhance the robustness of walking motion under pushes. 30 Chapter 6 Results and Discussion The motor tasks in Chapter 5 are performed with reasonable skill using learned feedback policies. In this chapter, we begin by presenting the performances of learned feedback policies for control tasks. In order to obtain further insight into the method, we discuss the impact of various choices on the resulting perfor- mances, including the choice of sensory information, the L1 feature selection, the incremental scheme, the affine feedback form and the optimization method. Lastly, we discuss a typical workflow of designing controllers for motion skills using our feedback framework. 6.1 Performances of Learned Feedback Policies Balancing: We use feedback structures of full matrix form (FM) and first-order re- duced form (R1). Performances of them are very similar in terms of the optimized cost function scores. The character with learned feedbacks is able to maintain balance for the whole 30 seconds in the optimization scenario and the resulting systems can work across various new scenarios of irregular oscillating motions of the platform, as shown in Figure 6.1. Constrained Hopping: We optimize the full matrix feedback, the first-order feedback and the second-order feedback (R2). The optimized feedback policies enable the character to adjust the hopping length to satisfy the stepping constraints, as shown in Figure 5.4. Table 6.2 shows the optimized function scores. 31 Feedback Form Number of Parameters Score R1 8 -3.444 FM 15 -3.452 Table 6.1: Optimized Function Scores for Balancing Figure 6.1: Performance of Balancing in New Scenarios We further test the performance of the adapted hopping controller under new scenarios. New sequences of desired stepping locations are sampled from normal distribution whose mean is the regular step length of the character (1.93m). The variance of the distribution is set to be 0.2. We generate 100 samples of sequences and each sequence consists of 16 target stepping locations. For a given sequence, 32 Feedback Form Number of Parameters Score R1 15 21.897 R2 30 21.846 FM 36 23.030 Table 6.2: Optimized Function Scores for Constrained Hopping we record the average accumulated distance error, ε , between the actual stepping positions and the desired ones: ε = 1100 ∑ 100 j=1∑ 16 i=1 ‖xi− xdi‖. It is possible for the character to fail for a new sequence where there are extremely small or large steps. Therefore we record the success rate as well. Table 6.3 shows performances under new synthetic scenarios in terms of ε and the success rate. The full matrix feed- back outperforms reduced-order feedbacks in this case, indicating that the learned reduced-order feedbacks may overfit to the optimization scenario. Feedback Form ε success rate R1 5.44 32% R2 3.30 53% FM 1.65 69% Table 6.3: Performance of Constrained Hopping with Feedback Controllable Ball Kicking: With optimized feedback policies, the character can catch the ball all the time and kick towards target locations. We apply full matrix feedback and first-order feedback. A reduction to a 1D feedback state re- sults in a functional system with a slight degradation in performance. For the full matrix form, 77.68% of the kicks in the optimization scenario produce a distance error for the ball of less than 0.5, which drops to 42.86% for the first-order system. We further test the resulting systems under different scenarios, [hdi , pdi ], where hdi ∈ {3,4, ...,9} and pdi ∈ {8.5,7.5, ...,−6.5}. The average distance error is 0.33 for the full matrix feedback and is 1.32 for the first-order feedback. Single-ball Juggling: We experiment on the feedback policies with full matrix form and reduced-order forms (R1, R2, R3). All the optimized models result in stable working systems, as shown in Figure 5.9. Figure 6.2 shows the trajectory 33 Feedback Form Number of Parameters Score R1 28 110.66 FM 52 32.70 Table 6.4: Optimized Function Scores for Ball Kicking 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 position x po sit io n y initial position of the ball Figure 6.2: Ball Trajectory for 5000 Simulation Steps in Ball Juggling of the ball position for 5000 simulation steps. This demonstrates that the resulting systems are robust to disturbances in the ball states. As shown in Table 6.6, the optimized cost function values generally decrease with respect to the dimension of the reduced space. However, the score for the full matrix form is relatively larger than those using the second-order form and the third-order form. We further test the robustness of the resulting systems. We measure the range of perturbations that we can apply to the starting state of the ball so that the system can still work in a stable way. The ball always starts from zero velocity in the vertical direction. The perturbation for the initial ball position varies from −0.065m to 0.019m in horizontal direction and from −0.072m to 0.058m in vertical direction. The initial ball velocity can be perturbed from−0.02m/s to 0.027m/s in horizontal 34 Feedback Form Number of Parameters Score R1 45 0.0196 R2 90 0.0022 R3 135 0.0021 FM 504 0.0173 Table 6.5: Optimized Function Scores for Single-ball Juggling direction. −1.5 −1 −0.5 0 0.5 1 1.5 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 initial position of the ball Figure 6.3: Ball Trajectory for 20 Seconds in Ball Volleying Ball Volleying: The optimization fails to find working solutions for the first- order feedback policy and the third-order feedback policy while the second-order form and full matrix form can be optimized to stabilize the system. Some frames from the resulting animation are shown in Figure 5.9. A trajectory of the ball posi- tion for 20 seconds is shown in Figure 6.3. The full matrix feedback outperforms the second-order feedback in terms of the optimized cost function score. The fail- ure of the first-order feedback shows that one dimension is insufficient to model the state of the dynamical system. The result of the third-order feedback is surpris- 35 ing in that the higher order feedback should in principle do at least as well as the lower order factorization because it is capable of implementing the same feedback matrix. In this case, it is evident that CMA optimization algorithm has not found the optimal solution in higher dimensional reduced order state space. Feedback Form Number of Parameters Score R1 45 failed R2 90 0.125 R3 135 failed FM 504 0.089 Table 6.6: Optimized Function Scores for Ball Volleying The ball starts from the midline between two characters. We measure how large we can perturb the initial velocity and the vertical position of the ball so that the system can continue to work. The perturbation for the initial velocity varies from −0.071m/s to 0.2m/s in horizontal direction and from −0.02m/s to 0.027m/s in vertical direction. The perturbation for the vertical position varies from −0.023m to 0.045m. Bipedal Walking: We note that learning a full matrix feedback for the biped character for the full-character-state sensory input is unfeasible, given the 108× 18 parameters that define the matrix. Therefore, we document the reduced order results (R1, R2, R3) for the case of full-character-state sensory input, FS, as well as the full matrix results (FM) and reduced order results for the cases of reduced sensory input (the ground reaction forces, CF, and the center of pressure location, CP). All the models result in stable walking motion and are robust to pushes. Frames of the resulting motion are shown in Figure 5.12. We optimize the SIMBICON feedback gains according to the same cost function as defined in 5.15 and use its performance as the benchmark for comparison. There is little difference in perfor- mance among the feedback models applied to the original cost objective without the regularization term, indicating that a one dimensional state summary is largely sufficient for balance in the sagittal plane. We also measure the performance of the learned feedback policy by the maximal recoverable force (MRF) that can be 36 applied at the beginning of the walking motion. MRF is applied for 0.4 seconds and the system is able to return to a stable walking cycle within 5.0 seconds. Table 6.7 shows the MRF and the optimized function scores without the regularization term. Sensor Type Order # of Parameters MRF Score without R(SF) FS R1 127 82 N 706.76 R2 254 53 N 717.80 R3 371 52 N 708.82 CF R1 25 73 N 700.05 R2 50 145 N 714.22 R3 75 133 N 711.39 FM 114 154 N 688.36 CP R1 25 226 N 703.43 R2 50 189 N 704.16 R3 75 189 N 705.80 FM 114 219 N 691.62 SIMBICON N/A 6 159 N 682.46 Table 6.7: Maximal Recoverable Forces and Cost Function Scores of Bipedal Walking 6.2 Choice of Sensory Information The linear output feedback framework couples off-line policy search and feature selection. The off-line policy search automatically figures out the relationship be- tween changes in the sensory data and changes in the action space. The integra- tion of feature selection eliminates useless components in the feedback inputs and outputs. Both components result in a wide range of choices for the sensory infor- mation. In this section, we discuss the effect of sensory information choice on the performance for the learned feedback policy. We begin by looking at how the specific choice of sensory information made available for the task can impact task performance. We first consider the con- strained hopping task. The sensory information made available to the baseline case, A0, consists of the change in desired hop length, ∆d, and a reduced model 37 of the deviation, ∆sr, of the initial state with respect to a reference starting state. A first alternative we test, A1, consists of removing ∆sr from the available sensory information, which removes knowledge regarding any initial state deviations. In a second alternative, A2, we add information regarding the position and velocity of the center of mass. As expected, the additional information for A2 improves the performance to 22.13, in terms of the average optimized cost function score, as compared to the baseline case (23.14). Surprisingly, A0 outperforms A1 with an average optimized score (22.45). These performance indexes are computed across 5 independent runs of the optimization for each case, as shown in Figure 6.4 where the black bar shows the range of independent scores and the red bar shows the median value. 21.4 21.6 21.8 22 22.2 22.4 22.6 22.8 23 23.2 23.4 A0 A1 A2 Figure 6.4: Impact of Sensory Information Choice on Performance in Con- strained Hopping Example We consider the effect of the type of sensory information on walking biped example, where three different alternatives are considered: the full character state 38 (FS, 108 dimensional), the 3D components of ground contact forces (CF, 6 dimen- sional), and the 3D components of the center of pressure location relative to the center of mass (CP, 6 dimensional). All these forms of sensory input perform sim- ilarly with respect to the final cost function score without the regularization term, and very nearly on par with an optimized SIMBICON controller, as shown in Table 6.7. This may model the fact that for the given cost function, all three types of sen- sory input are equally informative with respect to feedback for the given task. As a second test, we consider the maximal recoverable perturbation force during a walk cycle. As shown in Table 6.7, for this new measurement, the relative performances do begin to differ significantly, with the relative ranking being: CP > SIMBCION > CF > FS. The optimized use of center of pressure information yields better per- formance by a large margin (30%) over an optimized SIMBICON feedback model, and the ground contact force model does nearly as well as SIMBICON. Figure 6.5: Average Squared Distance Error with Transformed Sensory Data in Ball Kicking Example We next consider the effect of applying an orthogonal transformation to the sensory information. The resulting performance should in principle be invariant to such transformations of the input data, given that the same information remains embedded. This is tested with the controllable ball kicking task. We compare per- formance for the original sensory information, S1, which consists of the height at the start of the drop and the target height, a second scenario, S2, where the original sensory information is rotated by 45 degrees, and a third scenario, S3, where the 39 original sensory inputs [∆hd ,∆pd ]T are transformed into [∆hd +∆pd ,∆hd−∆pd ]T . The resulting performance in Figure 6.5 shows a only marginal reduction in perfor- mance for cases S2 and S3, and that therefore the method does exhibit invariance to these transformations of the sensory data. 6.3 Model Reduction The ability of the method to develop reduced order feedback models is tested using several of the tasks. In principle, the performance of optimized feedback in full matrix form should be the upper bound for the performance of policies in reduced- order forms. The flexibility of the model increases with respect to the dimension of the reduced-order structure. However, in practice, the off-line optimization affects the resulting performance. Higher dimensions in the feedback structure increase the number of parameters to be optimized. The optimization for structures with higher order are easier to be trapped in local minima. In this sense, it is possible for lower-rank feedback policies to have similar or better performance over higher- rank feedback policies. 6.4 Feature Selection The effect of feature selection is two-fold: First, useless sensory inputs or even noise terms can be eliminated by feature selection; second, feature selection helps discover the important components in the sensory inputs and control parameters. We apply feature selection in the example of ball juggling with feedback struc- ture of first-order reduced form. The weight for the regularization term is set to be 3.0. Based upon the original feedback form, we add a useless term, η , into the control parameters and add a noise term, ε , into the sensory inputs. Every time the feedback policy is invoked, ε is sampled from the Gaussian distribution,N (0,1). The feedback policy can then be formulated in Equation 6.1.  δutδuθ η  25×1 = Map25×1 ·Msp1×22 ·  ∆p ∆vx ∆s ε  22×1 (6.1) 40 0 5 10 15 20 25 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 useless control parameter η (a) L1 norms of Map row vectors 0 5 10 15 20 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 noise term ε (b) L1 norms of Msp column vectors Figure 6.6: Feature Selection for Ball Juggling Example By plotting L1 norms of column vectors in Msp and L1 norms of row vectors in Map, as shown in Figure 6.6, we observe that η and ε are eliminated by feature selection. We also look at the ability of selecting the most relevant sensory parameters and control inputs for a reduced first-order model in bipedal walking by using L1 regularization during optimization. For the sensory inputs, we use the ones in the SIMBICON model: the distance from the stance ankle to COM, d, and the velocity of COM, v, both in the character frame. We use all the 19 dimensional full control parameters. We add a useless term, η , into the control inputs and a noise term, ε , into the sensory data. The feedback policy is formulated in Equation 6.2. Figure 6.7 shows that the most relevant control inputs are readily identified. The swing hip angle dominates because of its important role in foot placement. The horizontal velocity of the root link is identified as the key sensory parameter. Interestingly, the vertical height of the root link is also identified as a key parameter. Because it is one of the most constant parameters, we suspect that it is being utilized because of the benefits of an affine transformation over a linear transformation. [ δθ η ] 20×1 = Map20×1 ·Msp7×1 ·  dv ε  7×1 (6.2) 41 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 swing hip angle in saggital plane swing ankle angle in saggital plane torso orientation in saggital plane relative angle in saggital plane between torso and pelvis useless control parameter η (a) L1 norms of Map row vectors 1 2 3 4 5 6 7 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 horizontal v in saggital plane noise term ε vertical d in saggital plane (b) L1 norms of Msp column vectors Figure 6.7: Feature Selection for Bipedal Walking Example 6.5 Incremental Scheme By exposing the optimization to similar scenarios with increasing difficulty, the ro- bustness of the learned feedback policy can be significantly improved. We consider the bipedal walking example. In Figure 6.8, we compare the maximal recoverable forces (MRF) that can be applied to the walking system with and without running the third phase of optimization where we introduce gradually increasing pushes in the simulated scenario. The maximal recoverable forces for all the models are increased after running the third phase of optimization and we can achieve at most 190X improvement in terms of MRF, as shown by the case of CF R3. 6.6 Affine Feedback We investigate the effect of optimizing affine feedback instead of linear feedback by adding s0 and a0 to the policy parameters. We experiment on the example of constrained hopping. For each case, we run the optimization for five times. Figure 6.9 shows the box plot of the optimized function scores where the red central marks are the medians. The affine solution allows for solutions that consistently outperform the linear solutions. This indicates that the linear solution is using values for s0 and a0 that are significantly sub-optimal. While the affine form will in principle always yield performance that is at least as good as the linear form, a caveat is that the additional free parameters increase the risk of finding a local 42 Figure 6.8: Effect of Incremental Optimization in MRF for Biped Walking minimum instead of a more global one. 6.7 Performance of Optimization We evaluate the performance of CMA optimization procedure. With CMA, the policy search is able to find working solutions for all tasks. We record the number of evaluations and the total time for the optimization of each working solution, as shown in Table 6.8. All the optimizations are conducted on a 2.66 GHz Intel i5 desktop with 6GB of memory. For complicated tasks, the optimization takes hours. However, with learned feedback structures, all scenarios run in real-time. Lastly, we test the impact of using a local optimization method instead of the CMA stochastic optimization. We implement a greedy stochastic local search algo- rithm. The algorithm starts from the initial guess, which is the feedback structure of zero entries. At each iteration, it perturbs the current best solution, within a fixed window, in a uniform random fashion. The search advances if the perturbed sam- ple has better performance. The whole algorithm stops until the maximal number of evaluations is reached. 43 22.4 22.5 22.6 22.7 22.8 22.9 23 23.1 23.2 23.3 23.4 Non−affine Form Affine Form Figure 6.9: Comparison between Affine Feedback and Non-affine Feedback in Constrained Hopping Example We compare the performance of both optimization methods on the balancing task and the ball volleying task, all with full matrix form of feedback policy. Fig- ure 6.10 shows how the optimized score changes with respect to the number of evaluations. For the former, two methods converge to a solution with very similar performance while the convergence of the local stochastic method is faster. For the latter, the local optimization method fails to find a solution of comparable quantity even in the first phase. In contrast, CMA can converge very quickly to a working solution. We think that, in systems with complicated dynamics, the space for the cost function is noisy. Therefore, local randomized search cannot work well in these cases. 44 Task Order Number of Evaluations Total Time (hrs) Balancing FM 11400 0.23 R1 11300 0.14 Constrained Hopping FM 38000 0.67 R1 17000 0.35 R2 3000 0.06 Ball Kicking FM 20000 69.84 R1 26000 164.24 Single-ball Juggling FM 23000 3.42 R1 66000 17.99 R2 55000 14.94 R3 85000 25.30 Ball Volleying FM 13000 9.14 R2 43000 47.73 Bipedal Walking (FS) R1 88600 248.15 R2 83800 162.06 R3 86400 165.47 Bipedal Walking (CF) FM 59400 179.79 R1 89600 223.94 R2 86700 275.58 R3 88900 276.21 Bipedal Walking (CP) FM 56000 280.00 R1 68100 253.17 R2 75100 226.47 R3 83200 265.77 Table 6.8: Number of Evaluations and Total Time for Optimization 6.8 Workflow of Controller Design for Motion Skills With our feedback framework, we can design robust controllers for motion skills. The first step is to model the reference motion. Although a good reference motion may yield better performances, little knowledge about the given task is expected because the learned feedback policy is able to correct the reference motion in order to achieve the task goal and the reference motion can be optimized as well in the affine feedback. Choosing sensory inputs and control parameters for the feedback policy is the next step. As tested with several examples, our framework allows for 45 (a) Balancing (b) Ball Volleying Figure 6.10: Comparison between Optimization Methods the arbitrary use of sensory information and works for various control parameters. Therefore, this step is straightforward and demands little tweaking. The last pro- cedure is to define a cost function and optimize the feedback structure in the policy search, which is the most time-consuming part in the design workflow. In prac- tice, we find that more complex applications require more manual user interven- tion to accomplish the desired goals. This includes adding constraints terms, tuning weights in the cost function and applying staged optimization scheme, Therefore, expertise on the given task can facilitate this process. 46 Chapter 7 Conclusions The design of good feedback strategies for motor control tasks can be a complex endeavor. All of our control tasks involve some form of regulation around a default known reference action. For example, our walking biped example already knows how to take a reasonable walking step, but without further feedback it will rapidly lose balance and fall. As perturbations occur, the goal of the feedback is to bring the system back towards the reference pose or reference motion. Common approaches require tackling issues of system identification (model- ing), state estimation from a variety of sensors, and feedback loop design. In this thesis, we explore the possibilities of applying policy search to linear output feed- back control policies. This family of control policies can exploit rather arbitrary combinations of sensory variables and available control inputs. In doing so, the method eschews the use of classical state descriptions. An interesting feature of the method is reduced order models can be learned with the framework. This is achieved by imposing the use of a factored feedback matrix structure that generates an intermediate summarized state representation of a given desired dimensionality. Further sparsity in the use of sensory information and control inputs can be achieved with the help of L1 regularization during the policy search. We show that these ideas can be applied to a variety of dynamical systems and different types of sensory inputs. Non-standard sensory inputs such as ground reaction forces are shown to be a rich source of sensory information for feedback tasks such as balance. 47 7.1 Limitations and Future Work Many challenges and issues remain. To achieve our policy search, we use CMA, a global search method, rather than more local policy gradient methods. It is not clear what types of problems mandate the use of global search methods rather than local ones. The choice of control representation remains important for our problems. Linear feedback is limited in many ways; more complex problems may require learning multiple linear models for different regimes of operation, anal- ogous to gain scheduling models. The stochastic policy search method requires some knowledge of the range of reasonable gains that should be explored; it is not obvious how this prior knowledge can be arrived at in a more general setting. Lastly, the current method still requires large numbers of policy rollouts, an issue which needs to be addressed in order for the method to be practical in real world applications. 48 Bibliography [1] B. Anderson and J. Moore. Optimal control: linear quadratic methods. Dover Books on Engineering Series. Dover Publications, 2007. ISBN 9780486457666. → pages 5 [2] J. A. D. Bagnell and J. Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In Proceedings of the International Conference on Robotics and Automation 2001. IEEE, May 2001. → pages 8 [3] J. A. Burns and B. B. King. A reduced basis approach to the design of low-order feedback controllers for nonlinear continuous systems. Journal of Vibration and Control, 4(3):297–323, 1998. → pages 6 [4] S. Coros, P. Beaudoin, and M. van de Panne. Generalized biped walking control. ACM Transctions on Graphics, 29(4):Article 130, 2010. → pages 6 [5] M. da Silva, Y. Abe, and J. Popović. Interactive simulation of stylized human locomotion. ACM Trans. Graph., 27:82:1–82:10, August 2008. ISSN 0730-0301. → pages 6 [6] J. David and B. De Moor. Designing reduced order output feedback controllers using a potential reduction method. American Control Conference, 1994, 1:845–849, 1994. → pages 5 [7] N. Hansen. The CMA evolution strategy: a comparing review. In J. Lozano, P. Larranaga, I. Inza, and E. Bengoetxea, editors, Towards a new evolutionary computation. Advances on estimation of distribution algorithms, pages 75–102. Springer, 2006. → pages 14 [8] H. Khalil. Adaptive output feedback control of nonlinear systems represented by input-output models. Automatic Control, IEEE Transactions on, 41(2):177–188, 1996. → pages 5 49 [9] T. Kim and D. L. James. Skipping steps in deformable simulation with online model reduction. ACM Trans. Graph., 28:123:1–123:9, December 2009. ISSN 0730-0301. → pages 4, 5 [10] T. Kim and D. L. James. Physics-based character skinning using multi-domain subspace deformations. In Proceedings of the 2011 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’11, pages 63–72, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0923-3. → pages 4, 5 [11] T. Kwon and J. K. Hodgins. Control systems for human running using an inverted pendulum model and a reference motion capture sequence. The ACM SIGGRAPH / Eurographics Symposium on Computer Animation (SCA 2010), 2010. → pages 6 [12] G. Lawrence. Efficient gradient estimation for motor control learning. In In Uncertainty in Artificial Intelligence. Morgan Kaufmann, 2003. → pages 8 [13] W. Levine and M. Athans. On the determination of the optimal constant output feedback gains for linear multivariable systems. Automatic Control, IEEE Transactions on, 15(1):44–48, 1970. → pages 5 [14] F. Lewis and V. Syrmos. Optimal control. A Wiley-Interscience publication. J. Wiley, 1995. ISBN 9780471033783. → pages 5 [15] A. Macchietto, V. Zordan, and C. R. Shelton. Momentum control for balance. ACM Trans. Graph., 28:80:1–80:8, July 2009. ISSN 0730-0301. → pages 7 [16] A. Ng and M. Jordan. Pegasus: A policy search method for large mdps and pomdps. In In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pages 406–415, 2000. → pages 8 [17] A. Y. Ng, H. J. Kim, M. I. Jordan, and S. Sastry. Inverted autonomous helicopter flight via reinforcement learning. In In International Symposium on Experimental Robotics. MIT Press, 2004. → pages 8 [18] J. Peters and S. Schaal. Policy gradient methods for robotics. In In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS, 2006. → pages 8 [19] C. Picard, F. Faure, G. Drettakis, and P. G. Kry. A robust and multi-scale modal analysis for sound synthesis. In Proc. of the 12th Int. Conference on Digital Audio Effects, 2009. → pages 4 50 [20] M. T. Rosenstein and A. G. Barto. Robot weightlifting by direct policy search. In Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2, pages 839–844, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-812-5, 978-1-558-60812-2. → pages 8 [21] D. Rosinov, V. Vesel, and V. Kucera. A necessary and sufficient condition for static output feedback stabilizability of linear discrete-time systems. Kybernetika, 39(4):447–459, 2003. → pages 5 [22] S. Schaal and C. G. Atkeson. Control systems magazine, 14, 1, pp.57-71. robot juggling: An implementation of memory-based learning, 1994. → pages 8 [23] C. Scherer, P. Gahinet, and M. Chilali. Multiobjective output-feedback control via lmi optimization. Automatic Control, IEEE Transactions on, 42 (7):896–911, 1997. → pages 5 [24] T. Shiratori, B. Coley, R. Cham, and J. K. Hodgins. Simulating balance recovery responses to trips based on biomechanical principles. In Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’09, pages 37–46, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-610-6. → pages 8 [25] V. Syrmos, C. Abdallah, P. Dorato, and K. Grigoriadis. Static output feedback: A survey, 1997. → pages 5 [26] R. Tedrake. Stochastic policy gradient reinforcement learning on a simple 3d biped. In Proc. of the 10th Int. Conf. on Intelligent Robots and Systems, pages 2849–2854, 2004. → pages 8 [27] A. Treuille, A. Lewis, and Z. Popović. Model reduction for real-time fluids. ACM Trans. Graph., 25:826–834, July 2006. ISSN 0730-0301. → pages 4 [28] A. Trofino-Neto and V. Kucera. Stabilization via static output feedback. Automatic Control, IEEE Transactions on, 38(5):764–765, 1993. → pages 5 [29] Y.-Y. Tsai, W.-C. Lin, K. B. Cheng, J. Lee, and T.-Y. Lee. Real-time physics-based 3d biped character animation using an inverted pendulum model. IEEE Transactions on Visualization and Computer Graphics, 16: 325–337, March 2010. ISSN 1077-2626. → pages 6 [30] V. Vesel. Static output feedback controller design. Kybernetika, 37(2): 205–221, 2001. → pages 5 51 [31] J. M. Wang, D. J. Fleet, and A. Hertzmann. Optimizing walking controllers for uncertain inputs and environments. ACM Trans. Graph., 29:73:1–73:8, July 2010. ISSN 0730-0301. → pages 8 [32] M. Wicke, M. Stanton, and A. Treuille. Modular bases for fluid dynamics. ACM Trans. Graph., 28:39:1–39:8, July 2009. ISSN 0730-0301. → pages 4 [33] Wikipedia. Cma-es, 2011. URL http://en.wikipedia.org/wiki/CMA-ES. → pages vi, 14 [34] C.-C. Wu and V. Zordan. Goal-directed stepping with momentum control. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’10, pages 113–118, Aire-la-Ville, Switzerland, Switzerland, 2010. Eurographics Association. → pages 7 [35] H. Y.-P. and R. Kosut. Optimal low-order controller design via lqg-like parametrization. Decision and Control, 1992., Proceedings of the 31st IEEE Conference on, 1:1099–1104, 1992. → pages 5 [36] Y. Ye and C. K. Liu. Optimal feedback control for character animation using an abstract model. ACM Trans. Graph., 29:74:1–74:9, July 2010. ISSN 0730-0301. → pages 7 [37] K. Yin, K. Loken, and M. van de Panne. Simbicon: Simple biped locomotion control. ACM Trans. Graph., 26(3):Article 105, 2007. → pages vi, 7, 27 52