SHAPING AND POLICY SEARCH FOR NEAREST-NEIGHBOUR CONTROL POLICIES WITH APPLICATIONS TO VEHICLE STEERING by K E N A L T O N B.Sc , The University of Calgary, 2000 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF M A S T E R OF SCIENCE in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) T H E U N I V E R S I T Y OF BRITISH C O L U M B I A December, 2004 © K E N A L T O N , 2004 Abstract ii Abstract The graceful animal motion we see in nature has proven extremely difficult to reproduce algorith-mically. There is a need for further research into motion control techniques to address this problem adequately for computer animation and robotics applications. In this thesis, we describe a novel method for the synthesis of compact control policies for a variety of motion control problems. Direct policy search is applied to a nearest-neighbour control policy, which uses a Voronoi cell discretization of the observable state space, as induced by a set of control nodes located in this space. Such a semi-parametric representation allows for policy refinement through the adaptive addition of nodes. Thus, a coarse-to-fine policy search can be performed in such a way that the problem is shaped for easy learning. We apply our method to developing policies for steering various vehicles around a winding track. In particular, a policy is learned to steer a double-trailer truck backwards, a problem of considerable difficulty given the instability of the trailers. We also generate policies for the problems of balancing an inverted pendulum on a moving cart and driving the state of a point mass around a user-specified limit cycle. We examine these motion control results critically and discuss the limitations of the nearest-neighbour policy representation and our policy search method. Table of Contents . iii Table of Contents Abstract 1 1 Table of Contents i i i List of Tables v i List of Figures . , vii Acknowledgements . . . i x 1 Introduction 1 1.1 Motivation 1 1.1.1 Understanding Motion Control 1 1.1.2 Applications of Motion Control 2 1.2 Goals and Scope 3 1.3 Assumptions 4 1.3.1 Reactive Control 4 1.3.2 Control in Simulation 4 1.4 Reinforcement Learning Framework 5 1.4.1 R L and Other Frameworks 6 1.4.2 Challenges in R L 7 1.5 Policy Search 9 1.6 Contributions 1 0 1.7 Thesis Outline 10 2 P O M D P and Policy Search Framework 12 2.1 P O M D P s and Reinforcement Learning 12 2.1.1 Policies and State Trajectories • • 14 2.1.2 Reward and R L Problems 15 2.2 Value-based Methods and Limitations 15 2.2.1 Dynamic Programming 15 2.2.2 Standard Value-based Methods 16 2.2.3 State and Action Discretization 17 2.2.4 Value Function Approximation 17 2.2.5 Partial Observability 19 2.3 Policy Search 19 2.3.1 Previous Work 20 2.3.2 Policy Representation 22 2.3.3 Evaluation 23 2.3.4 Optimization 24 2.3.5 Shaping 26 2.4 Summary 30 Table of Contents iv 2.4.1 Relevance of Framework 30 2.4.2 Related Work .31 Nearest-Neighbour Control Policies 33 3.1 Policy Definition 33 3.2 Related Work 34 3.3 A 2D Example 35 3.3.1 Problem Formulation 36 3.3.2 Policy Evaluation 37 3.3.3 Policy Search 39 3.3.4 Results 40 3.4 Summary 40 Cart and Pole 41 4.1 Problem Overview 41 4.2 Related Work and Problem Variations 42 4.3 Problem Formulation 43 4.4 Policy Evaluation 44 4.5 Policy Search 46 4.6 Results 48 4.7 Discussion and Future Work 51 4.7.1 A 2-Node Policy 52 4.8 Interactive Extensions 54 Vehicle Steering 55 5.1 Related Work 57 5.2 Problem Formulation 59 5.2.1 Vehicle Tracks 61 5.2.2 Policy Representation 63 5.3 Policy Evaluation 65 5.4 Policy Search 67 5.5 Results 70 5.5.1 Backwards Driving 71 5.5.2 Forwards Driving 79 5.6 Summary and Future Work 80 Discussion 83 6.1 Nearest-Neighbour Policy Representation 83 6.2 Policy Search Method 90 Conclusion 92 A 2D Example: Time-Based Policy 94 A . l Constructing a Time-Based Policy 94 A. 2 Bootstrapping a Nearest-Neighbour Policy 95 Cart and Pole: Additional Information 97 B. l Equations of Motion 97 B.2 Perturbation Algorithm • • • 97 Table of Contents v C Vehicle Steering: Additional Information 99 C . l Left-Right Symmetry Proof for Steering Policy 99 C.2 Equations of Motion 100 C.3 Perturbation Algorithm 101 Bibliography 104 List of Tables vi List of Tables 4.1 Paramaters and results for cart and pole problem 49 5.1 Our vehicle steering problem versus car control for RARS 58 5.2 Paramaters and results for vehicle steering problems 82 B . l Constants used for the cart and pole perturbation algorithm. 98 List of Figures vii List of Figures 1.1 Agent observing and acting in an environment 6 2.1 Evolution of a state trajectory 14 2.2 Comparison of a poorly-shaped and well-shaped optimization 27 2.3 Progressive refinement using a series of optimizations 29 3.1 A n abstract diagram of a nearest-neighbour policy. 34 3.2 Nearest-neighbour policy for 2D trajectory tracking problem 36 3.3 Dynamics for control of a point mass in one dimension 37 3.4 A P O M D P reward function for the trajectory tracking problem. 38 3.5 Policy evaluation for the 2D trajectory tracking problem 39 4.1 Setup of the cart and pole problem 43 4.2 Dynamics for the cart and pole system 44 4.3 Policy performance during the cart-pole policy search 48 4.4 Example state evolution for the cart and pole system 50 4.5 Controllable states for the cart and pole system 51 4.6 Trajectory following the Voronoi cell boundary to the goal region 53 5.1 Steering one and two trailer trucks backwards 55 5.2 Jack-knifing of backwards-driving trucks 56 5.3 Truck sensory information 59 5.4 Steering Angle versus Simulation Time Step for vehicle steering 60 5.5 Dynamics for the vehicle simulation 61 5.6 A n example of a complete track for backwards truck-steering 62 5.7 Tracks from 3 different track classes 63 5.8 Symmetric observations of the truck policy. 64 5.9 Distance measured along the center of the track. 65 5.10 Initial states for truck backing-up .' 67 5.11 Options for perturbing policy parameters 70 5.12 Truck dimensions for the single-trailer truck backing problem 71 5.13 Steering a one-trailer truck backwards 72 5.14 Policy performance versus function evaluations for truck 73 5.15 Driving a truck backwards in unexpected situations 74 5.16 Driving a two-trailer truck backwards 75 5.17 Difficult initial states for two-trailer truck 76 5.18 Car driving backwards on an unfamiliar track 77 5.19 Car driving backwards on two variable tracks 78 5.20 Car driving backwards in unexpected situations 78 5.21 Car driving forwards on a narrow track 79 5.22 Car driving forwards on a wide track 79 List of Figures viii 5.23 Truck driving forwards on an unfamiliar track 80 6.1 Poor nearest-neighbour approximation of a linear function 84 6.2 Poor nearest-neighbour approximation of a discrete function 85 6.3 Chattering between the opposing actions of Voronoi cells 86 6.4 Stretched Voronoi cells with unnormalized sensory variables 88 6.5 Nearest-neighbour node positions for a walking trajectory 89 A . l A time-based policy approximating a desired limit cycle 95 C . l Motion of the truck during a simulation time step 101 Acknowledgements ix Acknowledgements I would like to thank Michiel van de Panne for ideas that inspired this work, valuable criticism, and enthusiasm. Michiel's thoughts and encouragement were instrumental from the beginning of this thesis through to its completion. Ian Mitchell has provided much insightful feedback about the work, for which I am grateful. His comments have been helpful in pointing out some of the weaknesses of our method and possible alternatives. This knowledge will prove valuable in extending and improving on the ideas presented in this thesis. Also, I would like to acknowledge Dana Sharon for interesting discussions about our closely related work. She has shown that a similar policy search method can be used to learn policies for planar biped walking, which I find very exciting. Finally, I am thankful for the encouragement and support of my friends and family, who all helped to restore balance to my life during the most stressful and discouraging times of this research. Particularly, I want to thank Zorana for helping me keep in mind what is important in life and my parents for their loving long-distance support. Much of this work has been supported by a NSERC PGS-A scholarship. Chapter 1. Introduction 1 C h a p t e r 1 Introduction How are humans and animals able to move through complex environments with seeming effortlessness? This is a problem that has received much attention in the fields of biomechanics, computer animation, and robotics, yet it remains very much unsolved. This thesis looks at learning automatic controllers for several specific motion control problems. The intention is that this research will shed light on issues in motion control and take us a step further towards algorithmically reproducing the graceful motion that is evident in humans and animals. In this introduction, we discuss compelling reasons to study motion control and present the major goals of our work. We also describe the assumptions that are made regarding the type of motion control considered in this thesis and justify these assumptions. We briefly introduce the reinforcement learning framework and discuss some of its major chal-lenges. Next, policy search is considered as a practical approach for solving motion control problems. This is the approach that is taken in our work. Finally, we summarize the contributions of our work and present an outline of the rest of the thesis. 1.1 Motivation 1.1.1 U n d e r s t a n d i n g M o t i o n C o n t r o l The movement of animals and people can seem so natural and effortless that this type of motion control is usually taken for granted. Everyday we see many such examples: a man walking his dog, a crow flying overhead and settling down on the edge of a rooftop, or a squirrel scurrying up the bark of a tree. When we see that squirrel perform some fearless acrobatic routine across the branches of the tree, we may pause and wonder at its incredible agility. However, few casual observers actually attempt to understand how such graceful animal motion is produced. Researchers in the field of biomechanics are an exception. They study, among other things, the neuro-muscular activity that produces animal motion. Also, researchers in computer animation and robotics may investigate the biomechanics of animal motion for the purpose of reproducing it. Alternatively, one can attempt to synthesize animal-like motion without necessarily understanding all the details of the neuro-muscular systems that produce this motion in nature. Computer animators Chapter 1. Introduction 2 and robotics engineers are mostly concerned with results. If satisfactory motion results can be obtained with a simplified model then why complicate it with physiological details? For example, in this thesis we synthesize steering controllers for various vehicles in simulation. We would like the motion of the vehicles to appear as if they are driven by a person, even though that person is not visible in the results. However, we do not attempt to model the biomechanics of the driver in order to produce realistic steering motions. Of course, in this case, the subtleties of the human motion are hidden. If the goal is to synthesize a specific animal motion, like canine walking, it becomes more compelling to model the biomechanics in order to capture all the nuances. We shall use the term motion control to mean the control of a physical system, simulated or real, to produce a desired type of motion. Since we are more concerned with understanding principles of general motion control rather than the details of a particular biological implementation, we synthesize motion using an abstract control representation. In this way, we hope to gain insight into general motion control, which could be applied to reproducing animal motions for animation or robotics or to synthesizing entirely new types of motion. 1.1.2 A p p l i c a t i o n s o f M o t i o n C o n t r o l Motion control is currently applied extensively in robotics. For industrial processes, robot arms and manipulators are common, and their movement must be controlled in a safe and purposeful manner. Also, a great deal of research has gone into the engineering of mobile robots. Many of the most difficult control problems in robotics involve imitating human and animal motion. State of the art humanoid robots are capable of walking, but in a manner that is, at present, clumsy in comparison to humans. A better understanding of motion control is imperative in creating robots that move in a more graceful and robust fashion. This is true whether we are trying to reproduce human or animal motion in a robot or designing entirely new ways for robots to move and manipulate their environments. In a similar vein, human prosthetics are also a sensible application for motion control. The human body is an incredibly tuned motion control system. Artificial replacement parts should be carefully designed using both biomechanics and principles of general motion control. Prosthetics should integrate as seamlessly as possible with the body in both physical appearance and motion properties. In computer animation, motion control has much potential. Initially, the motion of characters was designed entirely by human animators. Recently, motion capture techniques have allowed the motion of real human actors to be recorded and transferred to animated characters. Once the data acquired by motion capture is available, it can be used to quickly produce realistic animations. However, Chapter 1. Introduction 3 producing the variety of motion seen in the real world requires an enormous amount of data, the acquisition of which is expensive and time-consuming. Also, motion captured from a particular actor cannot easily be applied to a character of different proportions or morphology. Most importantly, current motion capture methods cannot be used to animate the complex and often unpredictable interaction of a character with elements of its environment. A promising alternative is to use motion control of a virtual character with a physical simulation of its body and environment. The physical simulation of fluids and rigid bodies for animation has been very successful. However, moving a character with the agility of a human or with the grace and style of a hand-animated character requires great improvement in the motion control techniques available. 1.2 Goals and Scope The long-term goal of our research is to form a better understanding of general motion control through the development of a principled methodology for synthesizing motion controllers. Such a methodology could provide great benefit in application areas such as animation and robotics. Toward this end, this work focuses on automatically learning reactive controllers for a limited class of problems in simulation, including the classical cart and pole problem and the steering of cars and trucks. Of course, motion control is simply a type of control, which includes other types, such as the control of chemical processes. Our work focuses on motion control, although the benefits of the knowledge obtained may naturally carry over into other areas. The representation of a motion controller is important and we want to gain insight into desirable properties and key trade-offs in the choice of representation. In this work, we test and evaluate a particular representation, the nearest-neighbour control policy, to help accomplish this. Also, guiding or shaping of the learning process has always been an important aspect in the training of a human or animal to perform a task. For example, when an athlete learns to perform a difficult physical manoeuvre, he typically learns in stages of refinement. First, the crude motion is learned; then, it is carefully tailored, often with the guidance of an coach. A goal of ours is to investigate shaping the problem of learning a motion controller, which we believe is necessary to make high-dimensional control problems tractable. For this reason, this thesis looks at the progressive refinement of a controller and other shaping techniques for motion control. Chapter 1. Introduction 4 1.3 Assumptions 1.3.1 R e a c t i v e C o n t r o l In this thesis, we study reactive motion control. In reactive control, actions are determined solely by the current observation of the system state. In other words, a reactive controller has no memory of past observations or temporary plans that affect its actions. Of course, the current observation may be ambiguous, in which case consideration of the context of the observation may be needed to take optimal actions. However, the complexity of control tasks that can be handled by a purely reactive controller is surprising. For instance, we demonstrate the reactive control of a simulated double-trailer truck driving backwards along a winding track with only local sensory information. Also, reactive control is particularly suited to unpredictable environments, such as flying a simulated spaceship through an asteroid field. At some point it becomes inefficient to make long-term plans when they are constantly being rendered useless by unpredictable events. In arriving at an understanding of general motion control, it is valuable to study reactive control, given its conceptually simple nature. 1.3.2 C o n t r o l i n S i m u l a t i o n We restrict our investigation to motion control in simulated systems. It is easier to justify studying motion control in the real world because of the obvious tangible benefits, such as industrial robots. However, there are a number of reasons why motion control in simulation is an important topic of inquiry, some of which are listed here: 1. It is far less costly and less risky to experiment with motion control in simulation than to experiment with actual robots. 2. Simulation affords some luxuries that are not possible in the real world. For example, it is often possible to invert the dynamics of a simulation and move backwards in time, allowing for a bidirectional search to connect an initial state with a goal state. 3. Once they are adequately developed, techniques for designing controllers in simulation can often be used to build robots in the real world. 4. Motion controllers can be ported to the real world if the dynamic simulation is sufficiently realistic. 5. For a robot to predict and plan in a world that contains dynamic people, animals, and other robots, it would benefit from a simulated model of how these other entities move. Chapter 1. Introduction 5 6. The motion control of virtual animals and humans in simulated environments can be used for animation. 1.4 Reinforcement Learning Framework We look at motion control problems in a reinforcement learning (RL) framework. A particular motion control problem is specified using the formalism of a partially observable Markov decision process (POMDP) . In this section, we briefly describe the R L framework and explain our reasons for choosing it among other possible frameworks. The R L framework and the P O M D P are further formalized in Chapter 2. Motion control can be viewed as a sequence of action choices with potentially delayed consequences in an uncertain dynamic environment. The P O M D P , which originated in operations research and was introduced to the A l community by [CKL94], is a way of formalizing these types of sequential decision-making problems. A P O M D P describes the relationship between an agent and its environment, like that illustrated in Figure 1.1. It specifies what aspects of the system state can be observed by the agent, what sort of actions the agent is permitted to take, and how those actions affect the evolution of the state. It also specifies the amount of reward that is received by the agent when it takes an action to transition between states. In this way, the goal of the agent is defined, since its objective is to obtain as much reward as possible. Dynamic systems with discrete or continuous state can be described using the P O M D P formalism. The state can be partially or completely observable to the agent. Note that partial observability means there is a limited and perhaps noisy view of the state of the system, analogous to viewing the world through a dirty windshield. This assumption makes these problems considerably more difficult and also much more realistic. In the rest of the introductory discussion, we ignore the complication of partial observability. A control policy, or simply policy1, is a set of behavioural rules for the agent. It tells the agent which action it should take in any particular state. The goal in reinforcement learning is to learn the optimal policy, which is the policy that chooses the best action in any given state. The best action is the one that allows the agent to receive the most reward over time. We define a reinforcement learning problem (RL problem)2 to be a problem that can be framed in terms of finding the optimal policy for a specified P O M D P 3 . 1In the RL framework, policy is an appropriate synonym for controller and both terms are used interchangeably in this thesis. 2 In the RL framework, RL problem is an appropriate synonym for control problem and both terms are used inter-changeably in this thesis. 3Note that in [SB98] the reinforcement learning problem is a general framing of "the problem of learning from Chapter 1. Introduction 6 action Figure 1.1: Agent observing and acting in an environment. Standard methods to solve R L problems include value iteration and policy iteration, which are based on the principals of dynamic programming and require complete knowledge of the environment dynamics. However, other methods such as temporal difference learning and Q-learning do not require an explicit model of the environment. A l l of these methods try to find or approximate the optimal policy indirectly by computing an optimal state-value function (or, alternatively, action-value function), which assigns a value to each state (or each state-action pair). Any method that is used to solve R L problems by computing a state-value or action-value function we call a value-based method. In general, we use the term RL method to refer to any method that is used to solve R L problems, not restricted to value-based methods. 1.4.1 R L a n d O t h e r F r a m e w o r k s It is important to note that there are other standard frameworks, such as control theory [Oga70] and kinodynamic planning [HKLR02], for looking at motion control problems. Control theory techniques are often used when a target state trajectory is given and the problem is to track that trajectory. This is accomplished by minimizing the error between the current state and the desired state on the target trajectory. A serious limitation is that many of the existing mathematical techniques in this framework assume that the system dynamics are linear and time-invariant. Motion control problems often have complex nonlinear and discontinuous dynamics. In particular, this is the case when motion may be jarred by contact with other objects, like the contact of a character's foot interaction to achieve a goal" using the Markov decision process, a fully-observable version of the POMDP. Chapter 1. Introduction 7 on the hard ground. There have been many practical successes based on control theory, especially in controlling chemical and mechanical industrial processes. However, there is a large gap between the state of the art in control theory and that which is necessary to create graceful and physically-realistic character animation, for example. Furthermore, with respect to trajectory tracking, this framework often does not consider how the trajectory is obtained and how geometric constraints in the environment influence it. We might also look at motion control from a kinodynamic planning (KP) perspective [HKLR02]. In this framework, an initial state and a goal state are given and the problem is to plan a trajectory that connects the two states, subject to the kinematic and dynamic constraints of the system. K P can be used to complement the control theory framework. A desired trajectory can be obtained using K P and then a control theory technique can be used to track the trajectory. However, K P is limited in that it assumes the problem can be specified using an initial and goal state, or perhaps a sequence of states. Frameworks like control theory and K P have produced useful tools in robotics engineering. The motion system for a robot is typically designed using several subsystems, each solving a particular problem. For example, the system may be composed of a high-level task planner, a kinodynamic planner, a trajectory tracker, and perhaps a sensory subsystem. Each of these subsystems can be designed within a separate framework. Reinforcement Learning is an extremely general framework. As is pointed out in [SB98], it can be used to specify a large variety of problems, including those of different robotic subsystems. Path planning, trajectory tracking, and high-level task planning can all be cast as problems of learning a policy that obtains the maximum reward over time. In this thesis, under the framework of R L , we solve problems as diverse as following a specified limit-cycle trajectory for a point mass in one dimension and steering a truck backwards on an unseen track. We frame motion control problems using R L because of its generality and conceptual simplicity. 1.4.2 Cha l l enges i n R L The R L framework is general and intuitive but it still faces some major challenges. For one, the optimal policy may be extremely difficult or even impossible to learn in large or infinite state spaces. This is a serious issue, since most interesting problems, such as humanoid walking, have many con-tinuous state variables. In these cases it is necessary to learn an approximation of the optimal policy. This can be accomplished by using a priori knowledge to limit the region of the state space that is explored and otherwise limit the set of policies considered. To learn an approximate optimal policy we might evaluate policy actions from a finite sample Chapter 1. Introduction 8 of states in a potentially infinite state space. If the state sampling is representative of all states of concern and the policy generalizes to these states, then hopefully a good policy for the sample is good in general. It would be advantageous to sample states for evaluation by their importance for determining the optimal policy. For example, many states in the state space can never be visited or would only be visited by very poor policies so there is no need to waste computation to determine what action should be taken in these states. This can be accomplished in part by simulating state trajectories using the current approximation of the optimal policy during an iterative scheme, as is suggested in [SB98] and [BT96]. It is also useful to apply a priori knowledge in choosing the initial states for simulation. Furthermore, in some states, the choice of action may be irrelevant to the outcome of the system. For example, once a walking robot has fallen over, any possible action may prove futile. It is inefficient to explore all possible actions in states where those actions are equally ineffective. Expert knowledge can often be used to limit the part of the state space that is explored to that which is relevant and to give preference when sampling evaluation states to those where control has a large impact on the outcome. In infinite state and action spaces it is also necessary to limit the set of policies considered by choosing a certain representation or parameterization of this set. In effect, this choice imposes a structure on the considered policies, so that solutions must conform to a certain policy type. There are two conflicting goals in this choice of parameterization. We want the parameterized set to include some policies that are close enough to optimal so that we will be satisfied with their performance. However, we also want the structure of this set of policies to be small and simple enough that it is easy to find one of these satisfactory policies within it. This choice typically requires some specific knowledge about the structure of the R L problem and a good idea about what type of policy would be useful in solving it. Of course, if an expert is capable of narrowing the set of potential policies down to a single candidate, then no search of this set is required and we are left with a hand-engineered solution. Another challenge is to specify rewards in order to elicit a desired agent behaviour. In the R L framework, much of the difficulty in solving a specific problem, is delegated to the task of choosing a reward scheme. The P O M D P , and particularly the reward scheme, must be chosen carefully, so that the resulting optimal policy corresponds to agent behaviour that is desired. Furthermore, the choice of rewards can be used to ease the process of learning the optimal policy. For example, [Ng03] describes shaping rewards, which are modifications to the reward scheme that can accelerate learning, yet leave the optimal policy unchanged. In this thesis, we do not specifically examine the choice of rewards. However, we do consider the closely related topic of shaping the policy search through the Chapter 1. Introduction 9 choice of policy evaluation function and initial policy parameters and by staging the learning process. Many successful solutions to practical R L problems require expert knowledge to narrow down the search of possible policies and to shape the learning process. One way of doing this is to conduct a policy search using shaping techniques, as is discussed next. 1.5 Policy Search In this thesis, we use policy search to compute the policies of several motion control problems. In policy search, a limited set of policies is defined and that set is searched directly for an optimal policy 4. Any method that uses policy search to solve an R L problem we call a policy search method. These are contrasted with value-based methods [Lar68, BT96, SB98], which use value functions to determine the optimal policy. Policy search can be much more efficient than value-based methods if the set of policies can be narrowed considerably using a priori knowledge of the problem. A successful practical example of policy search is in [Ng03]5, where the authors devise a rough control program to fly a helicopter with some parameters left undetermined. The various forms of this control program that arise from different parameter settings comprise the set of considered policies. Policy search is then used to find the optimal policy within the set, or in other words to learn the optimal parameters for the control program. This example is indicative of how policy search is a compromise between hand-engineering and learning a policy 6 . Hand-engineering is used to create a general policy structure and the policy search fills in the details by learning the optimal parameters for the policy structure. To effectively solve a motion control problem using policy search it is important to choose: 1. A policy representation that is adequate for the problem and yet as simple as possible. 2. An observable state representation that highlights features of the state that are pertinent to the motion control task. 3. An evaluation method for policies such that the policy score is indicative of the quality of the solution to the problem. 4. Initial policy parameters that provide a good starting point for the policy search. A l l of these choices guide policy learning through the use of a priori knowledge to impose a structure on the policy search. Any technique that is used to shape the search in this way, we refer to as a 4 The optimal policy within the set that may not correspond to the overall optimal policy. 5 This example is discussed, further in Section 2.3.1. 6Because learning optimal policy parameters and conducting a search for an optimal policy usually mean the same thing, we use the phrases searching for a policy and learning a policy interchangeably. Chapter 1. Introduction 10 shaping technique. The choice of policy representation, observable state representation, evaluation function, and shaping techniques in general are further explained in Chapter 2. 1.6 Contributions We demonstrate that policy search applied to a nearest-neighbour policy representation (NNPR) can be used to solve a variety of motion control problems. This representation's semi-parametric character lends itself to adaptive policy refinement, and we show how this can be used as an effective technique for shaping the policy search. We illustrate the use of other important shaping techniques, including the choice of evaluation function and observable state representation. The flexibility of the nearest-neighbour representation is conveyed through the range of motion control problems solved in this thesis. These include the classical cart and pole problem and the steering of cars and trucks down a winding track using local environment sensing. In one example, we demonstrate the steering of a truck with two trailers backwards along the track, which is of particular interest because of the considerable instability of the trailers. In [NW89], the authors solve a related problem using self-learning of a neural network policy to steer a single-trailer truck backwards towards a fixed goal state. Also, T D learning has been used to learn policies that drive a car forwards down a specific winding track in the context of R A R S , the Robot Auto Racing Simulator [Cou02, PH98]. The steering problem we solve is considerably different than these R A R S examples in that we use a fixed vehicle velocity and only local track observations, we handle the steering of unstable systems, and perhaps most significantly, our learned policies are shown to generalize to different tracks of the same style as the training track. We also look at the problem of driving the state of a one-dimensional point mass around a user-specified limit cycle in a robust fashion. Almost all types of locomotion are cyclic in nature, so it is important that a controller be able to follow a desired limit cycle robustly in the presence of perturbations, such as wind gusts or terrain variations. For example, [Sha04] describes the use of a nearest-neighbour policy to control planar biped walking on variable terrain with the variations in ground level being unanticipated. 1.7 Thesis Outline In Chapter 2, we formalize the P O M D P and policy search framework in which we will be viewing motion control problems. In Chapter 3, we define a nearest-neighbour policy, which we shall use as our policy representation and we provide the example of using such a policy to drive the state of a point-mass around a limit cycle. Moreover, we discuss specific shaping techniques that are used Chapter 1. Introduction 11 in our work, including adaptive refinement of the policy. The application of the nearest-neighbour representation to the control of a cart and pole system is described in Chapter 4. We apply our method to the steering of cars and trucks in Chapter 5. Chapter 6 takes a critical look at the benefits and drawbacks of the nearest-neighbour representation and our policy search method. The work is concluded and future directions laid out in Chapter 7. Related work is distributed throughout the thesis in order to locate it directly with the appropriate related discussions. Theoretical works on POMDPs , reinforcement learning, and policy search are referenced in Chapter 2. References specific to the problem of cart and pole are included in Chapter 4 and those for vehicle steering problems are in Chapter 5. Chapter 2. POMDP and Policy Search Framework 12 C h a p t e r 2 POMDP and Policy Search Framework It is useful to have a consistent language in which to talk about motion control problems. We use the P O M D P as a way of formalizing such problems. Also, we take a particular approach to solving R L problems, namely policy search. This chapter describes this perspective of viewing and attacking problems in motion control. In Section 2.1, we formalize the P O M D P and explain R L problems in further detail. Value-based R L methods and their limitations are examined in Section 2.2. R L problems are framed from a policy search perspective in Section 2.3. We go over previous work in solving R L problems, and particularly motion control problems, using policy search. We look at the important choices of policy representation, evaluation metric, and optimization technique. Also, shaping and how it can be used to aid policy learning is discussed. Finally, in Section 2.4, we summarize the discussion of the P O M D P and policy search framework and outline which aspects and previous work are most relevant to our work in motion control. 2.1 POMDPs and Reinforcement Learning The partially observable Markov decision process (POMDP) is a useful way of formalizing sequential decision making with delayed consequences in'uncertain dynamic environments. We use the P O M D P as a way of framing various motion control problems in a consistent manner, such that the com-monalities and distinctions between the problems are highlighted. A P O M D P describes the states which an agent can take on within its environment, the aspects of its state which it can observe, and actions it can take to affect its state. It also describes how the agent gets rewarded for changes in its state and how important future rewards are to current action decisions. A P O M D P is defined as {S, O, A, v, w, r, 7 ) , where: • S is the set of states, • O is the set of observations, • A is the set of actions, Chapter 2. POMDP and Policy Search Framework 13 • v . S x A —* 5 is the state transition function, • uj : S —• O is the observation function, • r : 5 x 4 x 5 - > R i s the reward function, • 7 € [0,1] is the discount factor. This definition is generally consistent with a standard P O M D P definition [MPKK99]. However, we consider only deterministic P O M D P s , while in [MPKK99] and other definitions [Ng03, CKL94], v and u> are probabilistic 1. For all of our control problems, S is a continuous state space, i.e., S = Mds, where ds is the dimension of the state space. Let s G 5 be a state, represented by a d s-vector of real numbers. Similarly, O is a continuous observation space, i.e., O = R d ° , where d0 is the dimension of the observation space. Let o € O be an observation represented by a dG-vector of real numbers. Also, A C Rda, with da being the dimension of the action space, as the control problems described in this work have one or more control variables with a range of real values. Let a £ A be an action represented by a dQ-vector of real numbers. We use the notations st, ot, and at to refer to the state, observable state, and action respectively at a discrete time step t 2 . The state transition function, v, describes how the state of the system will evolve, given the input of certain actions, by mapping a state, st, and action, at, to a successor state, St+i- As a window onto the state of the system, the observation function, u> : S —* O, describes what can be observed about the state. If the state is completely observable, O = S and OJ = I is the identity function. The reward function, r, maps a state, st, and the action taken at that state, at, to a real-valued reward. We take rt = r(st, at) to be the reward received at time step t 3 . The reward function rewards good choices of action and punishes bad ones indirectly (and not always immediately), through the transitions in system state that are achieved by a sequence of actions. In a P O M D P , as in real life, gratification is not always instant and actions at one time can affect the state of the system such that the repercussions are felt only considerably later. Finally, 7 determines how important rewards obtained in the future are to choosing actions now, in a way that will be formalized shortly. 1 Probabilistic state transition and observation functions are normally used to model environments with noisy ob-servation and action, which are typical conditions for real-world robots. 2 We make the assumption that time is discretized, which is reasonable considering the problems we consider involve simulation on a digital computer. 3Note that, in some P O M D P definitions, rt is based only on st (e.g. [Ng03]). Chapter 2. POMDP and Policy Search Framework 14 2.1.1 Policies and State Trajectories A P O M D P describes the environment in which an agent acts. This agent learns some rules of behaviour, called a policy, which tell it exactly how to act in each possible situation. With POMDPs , an observation, ot, may not give the policy enough information to discern exactly what is the current state of the system, st. Of course, st is often what is needed by a policy to determine the optimal action required to drive the state of the system towards that which is desired. For this reason, a policy might put its current observation, ot, into context by considering the history of all observations oi,... ,ot in determining the current action, at- To accomplish this, a policy would require a memory in which to store and manipulate (e.g., consolidate) past observations. A policy with a memory is capable of having temporary goals and is sometimes called a purposeful policy. On the contrary, a policy without a memory must simply decide on the action, at, it will take based only on the current observation, ot- Such a policy could be considered to have a purpose or goals but they are permanently inscribed in its rules of behaviour. Policies of this type are called reactive or memoryless policies. time step (-/ i t Figure 2.1: Evolution of a state trajectory. In this thesis, we deal with reactive policies. The symbol IT refers to both a policy and its characteristic mapping of observation to action. For example, 7r : O —• A, describes how the policy 7r responds to each possible observable state, o. Let II be the set of all such policies. When a policy acts within a P O M D P the system evolves over sequence of states, {s\,si), called a state trajectory, or simply trajectory. This trajectory is determined by the following recursion: _ f Sinit if t = 1 S t - \ v(st-u«(u>{st-i))) if * > 1, where Smit is the initial state. Figure 2.1 illustrates how a state trajectory evolves. It is useful to notice that a trajectory is entirely determined given Sinu, v, u>, and n. Chapter 2. POMDP and Policy Search Framework 15 2.1.2 R e w a r d and R L P r o b l e m s The reward function, r : S x A —> K, allows us to evaluate policies, and therefore to determine which policies are optimal. It specifies the reward that is received by a policy when an action is taken in a particular state. Typically, the P O M D P (and, in particular, the reward function, r) is framed so as to elicit an optimal policy that results in a desired behaviour of the entire system. A policy is optimal if, from any initial state, it receives as much discounted reward as possible, summed over the resulting trajectory. This cumulative discounted reward is determined by the policy 7r and the initial state s and is used to define a state-value function for the policy, oo V*(a) = Y,1t-lru t=i where 7 6 [0,1] and rt is the reward received at time step t when policy TT is executed from the initial state s. Note that the smaller 7 is, the more discounted future rewards are, and the more short-sighted optimal policies will be in determining actions. Choosing the reward function, r, and the discount factor, 7 , that correspond to a desired system behaviour is certainly not trivial in many cases and this continues to be an open area of R L research. In reinforcement learning, we take a hedonistic viewpoint by trying to maximize discounted reward received over time. In other words, we try to find a policy TT*, such that V*'(s) = maxV*(s), (2.1) 71-en for all states, s. The objective of a reinforcement learning problem is to find an optimal policy, ir*, or some near-optimal approximation of ir*. Note that it is usually impractical or impossible to search through all policies to find one that is optimal, since | I I | = |j4|I°L 2.2 Value-based Methods and Limitations Most value-based methods were developed for MDPs, which are the subset of P O M D P s with complete observation, i.e, with u> being the identity function. For the discussion of value-based methods, we will initially assume that we are dealing only with MDPs. A comprehensive description of value-based methods for general (including nondeterministic and deterministic) MDPs can be found in [SB98, BT96, KLM96]. 2.2.1 D y n a m i c P r o g r a m m i n g Much of reinforcement learning theory is based on the principles of dynamic programming [Bel57], which exploit the structure of an M D P to find IT* more efficiently than through an exhaustive search Chapter 2. POMDP and Policy Search Framework 16 of n . For MDPs with finite state and action spaces, an optimal policy, it*, can be found indirectly by solving the Bellman optimality equation for V*, of which the following is one of a number of equivalent forms: V*{s) = msxMs,a)+V*(s')), (2.2) where s' is the state reached by taking action a from state s. The unique solution gives the optimal state-value function, V*, which is equivalent to the solution of Equation 2.1, V" . There may be more than one policy, 7r*, that is optimal, but they share the same optimal state-value function. Once V* is available, an optimal policy can be denned by choosing actions greedily to achieve as much cumulative discounted reward as possible: 7T*(s) = argmaxa€A(r(s,a) + V*(s')) 2.2.2 S t a n d a r d Va lue -based M e t h o d s One standard method for solving Equation 2.2 to obtain V* is the value iteration algorithm, which makes successive improved approximations of V* and has proven guarantees of convergence. Another method is policy iteration, which manipulates policies directly by evaluating the current policy ir to obtain Vv and determining the next improved policy by choosing actions greedily based on V. Policy iteration terminates after at most \A\\S\ iterations. Value iteration and policy iteration assume that a model of the M D P is available, i.e, that the transition function, v, and the reward function, r, are known. However, in many R L problems, a policy must be learned without direct access to an M D P model, but only through the experience of choosing sequences of actions and taking note of the resulting rewards. The TD(X) algorithm can be used for policy evaluation (i.e., to find Vw) without a model by updating the values of states in experienced trajectories. The function V* can be estimated by always using the policy that is greedy with respect to the current value function estimate to compute the trajectories. Similarly, Q-learning can be used to learn the optimal action-value function, Q*(s,a), through updates to action-values during actual experience. The function Q*(s,a) assigns values to state-action pairs by specifying the cumulative discounted reward for taking action a from state s and taking optimal actions afterwards. Q-learning and other temporal difference (TD) methods can be contrasted with value and policy iteration, which require a model and update the value function with full sweeps of the state space. Al l of these methods can be considered generalized policy iteration, which includes any method that alternates between improving an estimate of the value function for the current policy and improving the policy to take greedy actions with respect to the current value function estimate. Chapter 2. POMDP and Policy Search Framework 17 The value-based methods described above are sufficient for MDPs with small, finite state and action spaces. However, most interesting R L problems have large and/or continuous state spaces and some have large and/or continuous action spaces4. For the vast majority of problems with continuous state spaces, it is impossible to learn a value function5 exactly by one of these value-based methods. In these cases, it is necessary to solve the R L problem approximately by discretizing the problem or approximating the value function in some other way. 2.2.3 Sta te and A c t i o n D i s c r e t i z a t i o n Continuous state and action spaces can be discretized to solve an R L problem approximately [Lar68]. However, the curse of dimensionality dictates that the number of discretized states varies exponen-tially with the dimension of the state space, ds, which makes most R L problems, even of moderate (5-10) dimension, intractable. Since some regions of the optimal value function may have more variation than others, adaptive resolution schemes can be used to focus computation and storage resources on interesting areas of the state space. For instance, [MA95] presents the PartiGame algorithm for finding feasible trajectories to goal regions by adaptively partitioning the state space. This algorithm was tested successfully on problems with continuous state spaces ranging from 2 to 9 dimensions. PartiGame is limited to solving problems with a fixed goal region, as opposed to having arbitrary reward functions. Unlike R L problems, as we have defined them, it does not approximate the value function and does not necessarily find optimal or near-optimal trajectories to the goal. In [MM02], a variable discretization of the value function is used. The state space is partitioned into cells using a tree representation. The decision of whether to split a cell is based not only on local variation in the value function or optimal policy but also on the non-local impact on other cells. This value-based method is used to find near-optimal solutions to deterministic control problems of as many as 5 continuous state dimensions. 2.2.4 V a l u e F u n c t i o n A p p r o x i m a t i o n R L problems can also be solved by using a function approximator with generalization capabilities, such as a neural network, to approximate value functions. Generalization can be useful in large state spaces, whether discrete or continuous. [Tes95] used TD-learning with a backpropogation neural network to learn a policy that could play backgammon at the human master level. The algorithm 4 For example, a 9 degree-of-freedom biped used to learn walking control in [Sha04] moves in an 18-dimensional continuous state space. If an action consists of setting a target position for each joint or applying a force to each joint the action space would be continuous and 6-dimensional. 5 Here we use value function to refer to a state-value or action-value function, although in most other cases value function will refer only to the state-value function. Chapter 2. POMDP and Policy Search Framework 18 learned through several months of self-play and the most sophisticated version used hand-crafted useful features in the state vector. Although backgammon has discrete state, the number of possible states is estimated to be more than 10 2 0 , which makes approximation of the value function necessary. In another example, a neural network and TD-learning was used to approximate the value function in [Cou02]. A policy was found that allowed a snake-like creature to swim in an 12-dimensional continuous state space with 4-dimensional continuous actions. Although the dimensionality of this problem was large, it is questionable whether it is more complex than problems with highly nonlinear control and 4-dimensional state spaces, like the cart-pole swing-up and acrobat problem, which were also addressed in the work. There are likely a large number of policies that would allow the snake to swim in one direction, which would make it relatively easy to find one that is locally optimal. However, this example demonstrates that neural networks can effectively generalize in high-dimensional spaces. [Cou02] also used TD-learning to learn a policy for car racing, which we compare to our vehicle steering work in Chapter 5. The main disadvantage of using a neural network (or a similar generalizing function approximator) is its poor locality. Incremental updates to the weights of the neural network may have side-effects in regions of the value function domain which are unrelated to the training data currently being learned. For this reason, there are few convergence guarantees in this setting, thus nullifying one of strongest features of value-based methods. Some limited positive convergence results are in [BT96], including a proof convergence of TD(X) algorithm for approximate policy evaluation6 using linear function approximators. A n example of divergence for the case of nonlinear approximation is also shown. A framework for handling continuous-time control problems, without explicitly discretizing the state space, the action space, or time, has been presented in [DoyOO]. The main advantages of this framework are that the action can vary smoothly yielding better policy performance, the gradient of the value function can be used to determine near-optimal control, and there is no need to determine how to partition state, action, and time. The task of finding the right granularities is delegated to the function approximation and numerical integration algorithms. A cart-pole swing-up problem with highly nonlinear control in 4 continuous state dimensions was solved using this framework and a normalized Gaussian network is used to approximate the value function and policy. This approximation scheme has the disadvantage that the basis functions are positioned in a regular grid and therefore number exponentially with the number of state space dimensions, although it is possible to use other approximation schemes within the framework. 6Policy evaluation computes the value function, V", of a fixed policy 7r. Chapter 2. POMDP and Policy Search Framework 19 2.2.5 P a r t i a l Observabil i ty-A l l value-based methods described above are designed to work for MDPs, where the state is completely observable. In the setting of POMDPs , with partially-observable state, the problem of finding an optimal policy is considerably more difficult. This is because the observable state will typically not be Markovian 7 in which case the choice of optimal actions may depend on previous observations, not simply the current observation. It is possible to use a value-based method for MDPs on a P O M D P using the observable states and ignoring the fact that they may not be not Markovian. If the observable state is sufficiently close to Markovian then this might perform well. However, performance may be arbitrarily poor in the general case. A more robust method is to use a model of the P O M D P (including state transition function, v, and observation function, w) to compute a belief state, which represents the probability of being in any state given past and current observations. The belief state is Markovian and so it can theoretically be used to compute the optimal policy using an value-based method for MDPs. However, the space of belief states will be much larger than the original state space. For example, given a discrete state space, S, the corresponding belief state space will be continuous with |5 | — 1 dimensions. Finding the optimal policy for a P O M D P using belief states and dynamic programming can often be very computationally intensive or practically impossible. 2.3 Policy Search Although some dynamic programming algorithms offer guarantees of convergence to an optimal value function, it is yet to be shown that such algorithms can be applied to learning complex control in continuous state spaces of moderate-to-high dimension and continuous action spaces. Many motion control problems, such as the control of an articulated figure for animation or robotic simulation, fall into this category. A practical alternative to value-based methods is policy search, where a restricted set of policies is searched directly for an optimal policy, without computing a value function. In this setting, some prior knowledge of useful policy structure is used to define a fixed class of policies, n C II, to be considered. A vector of policy parameters, 0 , defines a specific policy within II. We then search LT for an optimal policy, 7r*, which is equivalent to learning the optimal policy parameters, O*. By defining a promising policy structure with a simple parameterization, it can be significantly more efficient to search n for an optimal controller 7r* than it is to find an adequate approximation to V* 7 If a state is Markovian, the successor state is determined independent of any previous states. Chapter 2. POMDP and Policy Search Framework 20 using dynamic programming. Of course, ft must be chosen carefully so that it includes satisfactory solutions. Furthermore, the parameterization and search algorithm must be chosen intelligently so that one of these solutions is found quickly and reliably. Usually it is difficult to guarantee that a policy search algorithm will find n* that is close to n* by some objective measure. For this reason, policy search methods are typically evaluated by the quality of the solutions computed in practise, while unfortunately remaining weak in theoretical guarantees. In Section 2.3.1, we consider examples of policy search being used to solve motion control problems. This is followed by a discussion of important aspects of the policy search method, including the policy representation, the optimization method, and other ways of shaping the search. 2.3.1 P r e v i o u s W o r k Complex motion control problems have been solved using policy search. For example, a policy was designed for flying an autonomous helicopter[Ng03]. The state space of the helicopter was 12-dimensional and the action space 4-dimensional. In order to learn a stable hovering behaviour, ft was defined to be a set of neural networks of a specific topology that map a specific choice of state variables and position errors to actions. The connection weights of the network were left for the policy search to determine. Prior knowledge was incorporated into the network structure by choosing useful observable state variables as input, appropriate computation (sum or tanh) at each network node, and only connecting state input to an action output if it was considered relevant. The learned policy was able to hover at a fixed target position. With a slightly modified network and by slowly varying the target position for the helicopter to create a target trajectory, another policy was learned that could perform several flying manoeuvres. In another example, a controller for robot weightlifting was learned using policy search and shaping techniques modelled after human motor learning [RB01]. The task was to lift a three jointed robot arm with a payload from straight-down equilibrium position to a straight-up unstable position by applying a torque to each joint. This is similar in concept to the classical pendulum swing-up problem. The task is made difficult when the joint torques are limited and payload mass sufficient such that the payload cannot be lifted by trivial motion of the arm. A priori structure was introduced into the policy by defining a via point, which is a useful part-way arm position for completing the task. The policy was then composed of three timed proportional-derivative (PD) controllers, one to drive the arm to the via point, one for the goal configuration, and one to hold the arm at the goal configuration. A matrix of weights was used to couple the torques applied to the joints. The timing parameters, P D parameters, and matrix weights were learned with a simple stochastic search algorithm. The payload mass was increased as policy learning progressed as a form of shaping. Chapter 2. POMDP and Policy Search Framework 21 In several instances, policy search has been used to learn policies for the motion of virtual char-acters with potential application to computer animation: • [vdPF93] used a randomized generate-and-test algorithm followed by fine-tuning using stochas-tic search to generate locomotion policies for a number of articulated virtual creatures. In this case the policies were structured as sensor-actuator networks (SANs), which are similar to ar-tificial neural networks except that they have node delays. These delays give a S A N dynamic properties of its own, making it a purposeful policy with internal state as opposed to a reactive policy. • Policies based on stimulus-response mappings were optimized using genetic algorithms in [NM93]. The evaluation function rewarded a policy for making a character of a particular shape move as far as possible in one direction during a given evaluation period. Several different characters were taught various locomotion strategies using this technique. Our nearest-neighbour policy representation (NNPR) is similar to the stimulus-response policy representation used in [NM93]. A more detailed comparison is included in Section 3.1. • [Sim94] used genetic algorithms to evolve both the policies and morphologies of virtual creatures for various behaviours, such as swimming, walking, jumping. This work looked at a broader problem than an R L problem since not only the policy, or mapping from observable state to action, was learned. Sensors were evolved, optimizing the form of the observable state, and the creatures' bodies and potential actions were evolved, optimizing aspects of the system dynamics. Since parts of the entire system were being optimized, this problem was particularly suited to a standard optimization using genetic algorithms and it is difficult to see how this problem could be made to fit the R L framework. The policies evolved were purposeful because timed oscillators were an important component. Creatures were evaluated specifically for each action, e.g., using the distance travelled for walking and swimming behaviours. • A genetic algorithm was also used in [Gri97] to find policies to control the motion of an animated desk lamp character. The policy was intended to move the lamp forward a variable distance to a goal position. In this way, the motion of the learned policy was parameterized, i.e., an animator could specify the distance the lamp should move. The actions of the learned policy could be dependent on time, making it a purposeful policy. The evaluation function took into account the error from the goal position as well as some aspects of motion style. Shaping of the policy learning was achieved by slowly introducing style considerations into the evaluation function over successive generations of the algorithm. Chapter 2. POMDP and Policy Search Framework 22 2.3.2 P o l i c y Re p r e se n t a t i on There are a variety of ways of incorporating knowledge of useful policy structure into the represen-tation. Out of necessity, R L methods have imposed structure on solutions by using various function approximations for policies or value functions, such as grid discretizations [Lar68], neural networks [Tes95, Cou02], and normalized Gaussian networks [DoyOO]. Knowledge that optimal policies and value functions often have inconsistent variation through the observable state space has encour-aged the use of variable-resolution methods [MA95, MM02], Genotype representations of policies [Sim94, Gri97] can help to reuse useful structures within a policy, while learning these structures only once. A l l these representations simplify policy learning partly through reducing the set of considered policies by making a coarse approximation of all possible policies. The main simplifying assumption that is typically made is that the policy to be approximated is mostly continuous over the state space. However, most of the above approximation methods still attempt to model general functions, without incorporating specific task knowledge into the policy. On the other hand, some policy search methods, such as [Ng03, RB01] have used prior knowledge about the control task to make the policy representation highly-structured and problem specific, thus significantly limiting the number of policy parameters to be learned. Certain properties of the policy representation may be useful to an R L method: • Compactness: Generally speaking, the fewer the policy parameters the easier the search for an effective policy. This also effects the efficiency of policy storage and algorithms that work directly with policy parameters. • Smooth Parameterization: If changes in parameters smoothly affect the policy, the policy search is generally made easier. • Nonparametric or semi-parametric: Free parameters in the policy representation can be added or removed to suit the complexity of the task being learned. This is in contrast to a parametric representation which has a fixed number of free parameters to be learned. A nonparametric policy representation typically stores experience data directly, whereas with a semi-parametric representation experience data is consolidated. • Adaptive refinement: The policy granularity can be made to vary across the observable state space, rather than being uniform. In other words, the policy can be refined further in some regions of the domain than others, as required. • Locality: Local regions of the policy can be updated without unwanted side-effects to other regions. Chapter 2. POMDP and Policy Search Framework 23 • Generalization: Aspects of the policy parameters should affect various regions of the policy when relevant. Particularly in high-dimensional state spaces, symmetries in the policy should be exploited as much as possible. • Conceptual simplicity: Whenever possible, a policy representation that can be readily under-stood by people is preferable. Of course, it can be very difficult to incorporate all of these properties into a single representation, and trade-offs must be made. Some properties, such as locality and generalization, often directly conflict with one another. We use the nearest-neighbour policy representation, defined and explained in Chapter 3. A discussion of properties of the N N P R and how they may have affected the policy search for our various motion control tasks is in Section 6.1. 2.3.3 Evaluation In order to use policy search to find an optimal policy we must define a real-valued evaluation function, E(Q), that models how well the policy with parameter vector 0 solves a particular control problem 8. The object of the policy search is to find the optimal parameters9: Typically, E(Q) evaluates a policy by executing the policy from one or more initial states and considering some success measure of the resulting trajectories1 0. Since we use this type of evaluation for all control problems considered in this thesis it is useful to establish a consistent notation. A policy is evaluated from p initial states, s\nit, 1 < i < p for Tevai simulation time steps each. The set of all p initial states is Smit- This results in p state trajectories, <7*(0) = (s^O) = s\nit, ^ ( O ) , . . . , « T e „ a , where 1 < i < p. Each of these trajectories is evaluated by a function e(<rI(0)), which measures how well the policy does from initial state s\nit. We sum the performances of the policy from each initial position to obtain an overall evaluation metric 8 The evaluation function, E(@), is distinct from the reward function, r, of a POMDP, which measures the desirability of taking an action in a particular state. In conducting a policy search, we desire that an optimal policy as denned by E(@), is close to an optimal policy of the underlying P O M D P as defined in Section 2.1.2. However, it is difficult to obtain proven guarantees in this regard. 9Although we usually attempt to maximize the evaluation function, it is very common to minimize a cost function. These are equivalent and an evaluation function can be obtained by negating a cost function and vice-versa. With regard to optimization, we may use the term ascent for increasing the value of the evaluation function towards a maximum and descent for decreasing the value of the cost function towards a minimum. 1 0 For example, the evaluation could start a virtual creature from a single position and evaluate how far it can walk in a particular direction. 0* = argmaxQE(@). p (2-3) Chapter 2. POMDP and Policy Search Framework 24 In some cases, we consider e(cr'(0)) to represent an error in the trajectory and the sum is negated in Equation 2.3. Although we use an L\ norm of the individual trajectory evaluations to calculate the overall evaluation, there are other possibilities that are reasonable, such as an L2 norm. It is unclear how the choice of norm would affect policy learning and the effects would likely be problem dependent. This is an interesting issue that is not addressed in this thesis. The evaluation of a trajectory, e(cr l(©)), can be viewed as an estimation of the value (or cumulative discounted reward) of the corresponding initial state, V*(s\nit). In this way, if provided with an M D P (S, O, A, v, u>, r, 7 ) , a reasonable trajectory evaluation function would be 6 ( ^ ( 9 ) ) = £ 7 ^ ( 4 ( 0 ) , ^ ) j=i For such an evaluation metric, if Sinit contained all states and Tevai was sufficiently large, 0* would specify the optimal policy, 7r*. However, in policy search we typically avoid evaluating all states for large values of Tevai because this is impractical. The choice of initial states, Sinit, is critical in policy search. Often the evaluation of a policy is computationally expensive because it involves simulation of many trajectories. Each simulation of potentially thousands of timesteps can be expensive in itself. It is useful to reduce p * Tevai as much as possible without degrading the evaluation quality. The set Sinu must be chosen so that the distribution of initial states and resulting trajectories is representative of state regions that are important for control. This means that the distribution should be more dense in states that are visited frequently or for which policy changes result in large differences in cumulative discounted reward, Vn(s). Often it is difficult to choose the set Smit deterministically, and it is simpler to define a distribution D of potential initial states and randomly sample the states that form Sinit from D. This method of choosing initial states is also described in [Ng03]. The choice of D can be critical for shaping the optimization, which is further discussed in Section 2.3.5. 2.3.4 O p t i m i z a t i o n Any number of search techniques can be used to find locally or globally optimal policies. For example, we extensively use variations of stochastic greedy ascent (SGA), a simple optimization technique. This method, outlined in Algorithm 1, perturbs parameters 0 by a random vector SQ and if an improvement is made keeps the change. Otherwise, the unperturbed parameters are kept and another random perturbation is tried. These perturbations and comparisons are applied iteratively until some convergence criterion is satisfied. Despite the simplicity of this technique, good solutions can be found if E(Q) is adequately shaped, as will be described further in Section 2.3.5. In this thesis, we focus Chapter 2. POMDP and Policy Search Framework 25 on the shaping of optimization problems so that a simple optimization technique, such as SGA, is adequate. Leveraging existing, more sophisticated optimization algorithms that are particularly suited to our problem would possibly improve the search performance, but as yet, we have not experimented in this direction. Algorithm 1 Stochastic Greedy Ascent (SGA) input 0 {policy parameters} input p {step size} input A {decay factor} Ebest *— E(Q) repeat for i < — 1... de do SQ[i] <— Rand(-l, 1) {Assumes dimensions of the policy parameter space are normalized} end for &new <- 0 + 5 0 Enein * E(Qnew) if Enevj > Ebest then 0 * 0netw Ebest * E n e w end if P «- V until convergence Optimization techniques such as genetic algorithms and simulated annealing can be used to find globally optimal solutions. These may prove considerably more effective than simple SGA if the optimization landscape has many local minima that are not satisfactory solutions. Gradient-based optimization techniques are another alternative. If it is possible to compute partial derivatives of E(Q) with respect to 0 then local maxima could be efficiently computed by steepest ascent or other numerical optimization techniques that utilize gradient information. However, for most policy parameterizations, E(&) does not vary smoothly with changes to 0 . For example, if E(Q) evaluates one or more trajectories, a small change to the policy parameters could result in a large jump in the policy evaluation. This may occur if there are discontinuities in the state transition function, v, the observation function, ui, or the reward function, r, which is common for R L problems. In [BM99], a method that can be used to conduct a policy search 1 1 using the gradient of an evaluation function is presented. The evaluation is based on the expected value of a trajectory error with respect to trajectory probabilities. A stochastic policy is used so that the distribution of policy actions and also the distribution of trajectories varies smoothly with changes to the policy parameters. n T h e algorithm is sufficiently general that it can be instantiated as either a value-based or policy search algorithm, or some combination of the two. Chapter 2. POMDP and Policy Search Framework 26 However, only finite state and action spaces are considered and it is not clear how this method would generalize to continuous state and action spaces, where the number of possible trajectories, even with a small time horizon, is infinite. It is also possible to estimate the gradient of E(Q), even if it is non-differentiable, and use the estimate to move in the direction of steepest descent (or ascent). For instance, approximate partial derivatives of E(Q) can be computed using finite differences (i,e., by evaluating one modified policy for each policy parameter, each with just a small change in that parameter). There are also methods for computing approximate gradients using stochastic perturbations to policy parameters. In [RB01], a stochastic search algorithm that blends a random walk with movement along a rough estimate of the gradient is used to find a policy for robot weightlifting. 2.3.5 S h a p i n g We take shaping to mean guiding policy learning through introducing a priori structure into the policy search. In psychology, the term shaping has been used to refer to training a human or animal to perform a complex task by leading up to it with a sequence of progressively more difficult tasks. We include this kind of shaping in our definition of shaping for policy search. In [Ng03], reward shaping referred to choosing the rewards for a P O M D P to aid policy learning. Our definition of shaping includes the choice of evaluation function, which is related to the reward function, to guide the optimization of policy parameters. Finding good solutions to complex R L problems in high-dimensional or continuous state and action spaces can be computationally intractable if assumptions are not made with regard to policy learning. Making various assumptions about the types of policies to consider and how to learn a policy can drastically reduce computational effort in policy search. General Techniques One way shaping can be accomplished is by choosing a compact, smooth policy representation. Since the domain of the evaluation function, E(Q), is the set of possible policy parameters, 0 , the choice of parameterization is important for facilitating the search for an optimal policy. The policy search can also be shaped by choosing useful inputs and outputs for the policy. The representation of observable state and actions available to the policy can greatly affect policy learning. Observable state should be directly relevant to the choice of action necessary to receive reward. The action representation should be as simple as possible while allowing the capability to succeed at the R L problem. Careful representational choices can simplify the optimal policy, making it easier to approximate with a given policy representation. Changes to the observation function and actions can Chapter 2. POMDP and Policy Search Framework 27 either be viewed as modifications to the P O M D P or to the policy structure. If the R L problem is fixed, these changes must be limited to a mapping from observable states, O, to modified observable states, O', and a mapping from modified actions, A', to the original actions, A. With modified observations and actions, the goal is to learn an optimal policy, TT* : O' —> A'. Another critical choice is how to evaluate policies. If the evaluation time, Tevai for each trajectory is too small, short-sighted policies may be favoured. If, however, initial states, Sinit, are distributed densely enough that trajectories overlap then short-sightedness is less of an issue. The evaluation will focus on control in certain regions of the state space if the distribution of initial states is biased. Having many initial states in regions of critical control can help shape the policy search. This is because the search can first find a policy which solves these critical regions and then fine-tune the policy. It is also useful to have initial states with a range of control difficulty, so that, early on, the search can optimize the policy for the easy trajectories and progressively optimize for more difficult trajectories. Finally, the evaluation of individual trajectories depends on the choice of reward function. Rewards can be given for accomplishing known intermediate goals to guide policy learning. Since evaluations are of limited time, Tevai, it is critical that rewards are chosen such that the quality of each trajectory can be judged. If reward is only received at a goal state, and the trajectories cannot reach that goal state, they will be useless for the evaluation. 0 i n i l ©* Policy Parameters 0 Figure 2.2: A poorly shaped optimization problem (top) and a well-shaped optimization problem (bottom) given initial parameters, 0 t m t . Choosing a good policy representation, observation and action representations, and evaluation function serves to shape the evaluation landscape and can significantly aid policy optimization. The evaluation landscape can be considered very well-shaped if a gradient ascent optimization can find a satisfactory solution (see Figure 2.2 (bottom)). This is generally true if the initial policy parameters, Chapter 2. POMDP and Policy Search Framework 28 © m i s , lie within the basin of a satisfactory optimum. A n optimization may be poorly-shaped if the evaluation landscape is plagued by poor local minima and Qinu does not lie within the basin of a satisfactory local optimum (see Figure 2.2 (top)). Some optimization techniques allow the current solution to escape local minimum but this makes the optimization problem a more difficult one. When the optimization is poorly-shaped, it is necessary to use a global optimization technique, such as simulated annealing or genetic algorithms to find satisfactory optima, which can be a computationally intensive process. Of course, to make an optimization well-shaped it is possible to initialize Qinu to lie close to a good solution. We call this technique bootstrapping12. In this way, another process could be used to find a very good estimate of policy parameters and then policy search could be used to refine them to a locally optimal solution. A n example of this is described in Section 3.3.3, where the optimization of a time-based policy is used to bootstrap the initial policy parameters for a policy search. Series Shaping We are particularly interested in using bootstrapping in a series of optimization episodes that pro-gressively refines the policy, much in the same way that shaping (in the psychological sense) can be used to train animals. We call this type of shaping series shaping. A n optimization series, cj>i,..., <j>m, is conducted, where the policy parameters resulting from 4>i are used to initialize the parameters for the next episode of optimization, 4>i+\, as shown in Figure 2.3. This type of policy refinement is analogous to the way a beginner might learn to play a video game, by starting off with very basic missions and using the skills they have developed to attempt increasingly more difficult missions. Series shaping bears some resemblance to simulated annealing, where solutions to problems with local minima are also gradually refined. Simulated annealing performs a search to minimize a fixed objective function and gradually decreases the probability that changes which increase the objective function are accepted. On the other hand, series shaping gradually changes the objective (evaluation) function, increasing the difficulty of the optimization problem. Series shaping usually provides a way to introduce a priori knowledge into the optimization about what makes the problem difficult and how it could be solved in stages. Simulated annealing is suited for problems in which a global optimum is required but little is known about the optimization landscape, but it can be very computationally intensive. We use series shaping to learn policies for steering vehicles by increasing the free parameters available in the policy representation with each optimization episode. First, a coarse policy for 1 2Note that in [SB98], the term bootstrapping is used in the context of value function estimation to refer to "updating] estimates of the values of states based on estimates of the values of successor states", as opposed to estimating the value of a state independent of other state value estimates. Chapter 2. POMDP and Policy Search Framework 29 ginil 4j Policy Parameters 9 Figure 2.3: Progressive refinement of parameters, 0 , using a series of progressively more difficult optimizations. Each optimization episode is bootstrapped using the result of the previous one. steering is learned, and then, the addition of policy nodes refines the policy in certain regions of the observation space, O. The difficulty of the optimization problem increases with the additional degrees of freedom. However, many of the parameters should be bootstrapped to reasonable values by the previous optimization (for more details, see Section 5.4). Series shaping can also be done by gradually changing the evaluation function and thus increasing the difficulty of the optimization problem. In other words, the criteria of what is considered a good solution is tightened as the series of optimizations progresses. For example, the evaluation time, Tevai, can be gradually increased, so that successively longer trajectories are optimized. Also, the set of initial positions, 5 ,„jt, can be progressively augmented to provide a denser sampling in interesting areas of the state space. A n early optimization episode would find a coarse policy that can handle some basic situations well. Later episodes would refine the policy to handle more situations, and Chapter 2. POMDP and Policy Search Framework 30 perhaps, more difficult situations than before. Moreover it is often useful to optimize a basic motion behaviour first and then gradually add more restrictive criteria. For instance, one might optimize a vehicle driving policy to first avoid crashing and later consider fuel consumption. In [Gri97], a policy is optimized first to accomplish a basic jumping motion and then style considerations are phased in. Finally, series shaping can be conducted by modifying the difficulty of the R L problem itself. This was done in [RB01], where the weight that the robot was required to lift was increased periodically during the learning process. Another example would be to train a vehicle steering policy on pro-gressively more difficult tracks (e.g., by narrowing the track and adding more corners or making the corners sharper). 2.4 Summary In this chapter, we formalized the concept of the P O M D P , which we use to frame motion control problems. We also discussed value-based methods for solving R L problems, as well as their limitations. We suggested policy search as a promising alternative and covered some of the significant related work in this field. Lastly, we outlined some of the issues in successfully applying policy search to R L problems. Below, we specify how the P O M D P and policy search framework, as well as previous work in these areas, is relevant to our motion control work. 2.4.1 Relevance of Framework Although we use the P O M D P formalism to frame pur motion control problems, we do not attempt to find optimal or near-optimal solutions to the R L problems as is typically the case in R L research. Instead we are interested in finding satisfactory solutions. These are solutions that accomplish the general task that is specified and do so in a way that graceful, robust, and reasonably efficient13. Policy search is often used to find satisfactory solutions to practical problems with guarantees of near-optimality being difficult to obtain. Most aspects of the P O M D P formalism, including S, O, A, u>, 7r, and trajectories, will be di-rectly relevant to describing our motion control problems. Some aspects, such as the reward function, r, and the discount factor, 7 , are useful conceptually for constructing an evaluation function, E(&). Trajectory evaluation can be viewed as summing reward along the trajectory and thus approximating the value of the corresponding initial state. The values of other states along the trajectory are also 13Gracefulness can usually only be evaluated qualitatively, although it is very important for computer animation results. Robustness can be measured by evaluating the policy in a range of situations, particularly those that are not expected to occur during normal policy operation. Efficiency can be measured in terms of the energy used during policy actions. In computer animation, sometimes energy use is minimized with the intention of making motions more graceful. This is because of the reasoning that animals and humans tend to conserve energy in motion if possible. Chapter 2. POMDP and Policy Search Framework 31 being approximated to a lesser degree. We do not use a discount factor for evaluation, but the eval-uation time, Tevai, performs a similar function to 7 in that they both specify the short-sightedness of the policy evaluation. One reason that we do not use 7 in evaluating trajectories is that we want the evaluation of decisions at states toward the end of a trajectory to be given significant weight in the evaluation. In this way, a finite number of trajectories, can better evaluate the infinite number of states in the continuous state space. Most aspects of policy search discussed above are directly relevant to our work. We explain our choice of policy representation, the nearest-neighbour policy, in Chapter 3. Specification of the eval-uation metric, particularly the trajectory evaluation function, initial states, and evaluation time, is critical to the success of our solution policies. We discuss the evaluation metric in more detail in Sections 3.3.2, 4.4, and 5.3 for the limit cycle tracking problem, the cart and pole problem, and the ve-hicle steering problem, respectively. The optimization technique used can largely affect the efficiency of convergence to a satisfactory optimum. However, we do not focus on developing or leveraging a sophisticated optimization technique. Instead, we use minor variations on a simple stochastic greedy ascent (SGA) optimization (1). We do, however, customize the algorithm to make local perturbations in the policy for the vehicle steering problem, which is described further in Section C.3. Shaping techniques are crucial for using prior knowledge to solve complex control problems ef-ficiently and we demonstrate the use of a few of these. The choice of policy parameterization and evaluation metric serve to shape the optimization landscape. More interestingly, we adaptively refine policies for vehicle steering by adding free parameters to the policy representation in a series of opti-mization episodes. This technique, explained further in Section 5.4, achieves shaping by conducting a coarse-to-fine search for a policy. We also choose a vector of observable variables for the vehicle steering that is small and easy to compute and yet provides adequate information for the task. 2.4.2 R e l a t e d W o r k We will restrict our discussion of related work to that which is most pertinent or directly influential to our work. The P O M D P and policy search framework that we use is very similar to that used in [Ng03]. In particular, our policy evaluation metric (Equation 2.3) is the same as the approximate utility for policies in [Ng03], except that we do not include the discount factor. However, in our work, we emphasize the choice of initial state distribution, D, as important for shaping the search in complex control. Series shaping has been used to learn motion control in [Gri97] and [RB01]. Shaping is accom-plished by phasing style considerations into the evaluation of a hopping motion in the former. In the latter, the physical model is gradually modified by increasing the mass held by a robot arm in order Chapter 2. POMDP and Policy Search Framework 32 to shape the learning of a robot weightlifting policy. In our work, we use adaptive refinement of the policy through progressively adding free parameters into the representation to shape policy learning, as explained in Section 2.3.5. This topic will be further addressed in Section 5.4. Note that an adap-tive resolution scheme has been used to partition the state space to find feasible trajectories to goal regions in [MA95]. Also, [MM02] adaptively refines a value function using a tree representation to solve continuous state R L problems. Furthermore, in [AM03], value functions and policies are stored using trajectories and the representation can be refined by adding trajectories to new parts of the state space. The policy search method used to learn walking controllers for a bipedal articulated figure in [Sha04] is similar to the work in this thesis in a few ways. For one, the N N P R is also used to represent policies for the walking problem. A similar evaluation metric that evaluates several trajectories from relevant initial states is used. The optimization technique is also a simple greedy algorithm, although it makes systematic deterministic perturbations, in contrast to the stochastic perturbations of our SGA. Shaping is accomplished by progressively increasing the evaluation time, Tevai, as opposed to progressively adding policy parameters, as we do for the vehicle steering problem. However, a walking policy for bumpy terrain is achieved by duplicating policy parameters in a single episode at the same time that the terrain bumps are introduced, in a similar form of policy refinement. Our N N P R is similar to the stimulus-response representation used in [NM93] and we compare the two representations in Section 3.1. We discuss work directly related to the cart and pole and vehicle steering problems in Chapters 4 and 5, respectively. Chapter 3. Nearest-Neighbour Control Policies 33 C h a p t e r 3 Nearest-Neighbour Control Policies We represent policies using a semi-parametric nearest-neighbour function approximator (NNFA). This representation is local and conceptually simple. Also, we demonstrate the compactness of NNFA in Chapters 4 and 5, as it allows policies for complex control tasks to be modelled using few parameters. Moreover, NNFAs have an advantage over regular grids in that the resolution of the approximation can be varied over the function domain (i.e., they can be adaptively refined). In other words, representational complexity can be concentrated in more interesting or critical regions of the observable state space. Our work applies policy search within the set of policies, ft, that can be represented using an NNFA. The idea is to create a variable discretization of the policy domain, O, such that the policy can be adaptively refined in regions of particular importance1. In this chapter, we formally define the nearest-neighbour policy representation and go over some related work. As a concrete example, we present a policy that was synthesized to drive a point mass along a target trajectory and describe the method used to create it. 3.1 Policy Definition We define a policy 7r using a set of n nodes, with each node i having a location Q in the observable state space, O, and an action \n € A. The node locations are used to induce a Voronoi partitioning on O and each Voronoi cell is mapped to the action of its corresponding node. A policy 7r is thus defined by 7T(o) = tlc(0) c{o) = argmini<=i<=n(\\o - 0\\), (3.1) where c(o) is the active node, given an observation, o. Figure 3.1 shows an abstract diagram of a nearest-neighbour policy. A policy can be specified using the parameter vector, 0 = [Ci. • • •, Cn, Mi> • • • > Mn], consisting of all node locations and corresponding actions. Notice that the number of nodes and location of the 1In Section 5.4 of the chapter on vehicle steering, we explain how adaptive refinement through the addition of policy nodes can be used to shape the policy search. Chapter 3. Nearest-Neighbour Control Policies 34 observation space O action space A H i Figure 3.1: A n abstract diagram of a nearest-neighbour policy of 3 nodes with a 2-dimensional input state space and 1-dimensional action space. nodes can be chosen to adequately represent the complexity of the required control. For example, a policy for the cart and pole problem was synthesized using just 2 nodes (Section 4.6), while the more difficult problem of steering a single-trailer truck backwards around a track was solved using a policy with 14 nodes (5.5). 3.2 Related Work As in [NM93], a stimulus-response metaphor can also be used to describe the N N P R . The current stimulus, Ot, is found to lie within a Voronoi cell, activating the corresponding node, c(ot), and triggering the associated response, / i c ( 0 [ ) . With the nearest-neighbour representation, the action Hi is taken if the corresponding node location, has the minimum distance of any node location to the current state, s t. On the other hand, the stimulus-response (SR) representation used in [NM93] correlates each action with a stimulus function, which maps observable states to a stimulus value. The current action becomes that of the highest-valued stimulus function, only if that value is positive. If no stimulus function is positive the current action remains unchanged. The stimulus functions are positive only within a hyper-rectangle in state space and the maximum value is at the center of the hyper-rectangle. If a hyper-rectangle does not intersect with others any state within it will always map to the corresponding action. However, if two or more hyper-rectangles intersect, the action chosen in the intersecting region will correspond to the stimulus-function with highest value. The boundary between two neighbouring nodes in the N N P R will always be formed of a single hyperplane, Chapter 3. Nearest-Neighbour Control Policies 35 while the boundaries in overlapping hyper-rectangles in the SR representation are more complex. In the SR representation, it is also possible to have states which do not have a corresponding action because they are not within a stimulus hyper-rectangle. If there is no stimulus activated, the action corresponding to the last stimulus will be kept, meaning that the resulting policy is not entirely reactive. A n NNFA can be considered a simple form of locally weighted regression (LWR). A n L W R is a function approximation that fits a model to a local set of weighted data points, rather than attempting to fit a model to all data points. LWRs have been used to represent action-value functions in instance-based learning [SK01, FA02], In these cases, data points are stored for experiences that the R L agent encounters. A regression of neighbouring data points is conducted for each query of the action-value function. An NNFA is an L W R where only the data point closest (using Euclidean distance) to the query point has a non-zero weight and the value of the data point is used as the constant function approximation in that neighbourhood. For our nearest-neighbour policy representation, we do not use agent experience for data points, but instead we try to optimize the position and value of these data points (which we call nodes) to find an optimal policy given the restrictions of the representation. We are not aware of previous work that uses an NNFA to represent policies. However, the concurrent work on walking policies in [Sha04] also uses the NNFA. 3.3 A 2D Example This example problem is useful for visualizing how a nearest-neighbour policy operates, because the state space, S, is two-dimensional. In this section, we present an example of a resulting policy and also describe the method used to synthesize the policy. Figure 3.2 shows a policy that was synthesized to control the motion of a point mass in one dimen-sion. The goal of the policy is to drive the state of the mass along a cyclical position/velocity profile, shown in the figure as the target state trajectory. Examples of such cyclical one-dimensional motion are the height of a drumstick during a repetitive drumming motion or the height of a water surface at one point when there is a consistently repeating wave pattern. Such a limit cycle is also observed in the various degrees of freedom of a walking person. For example, the angle of the knee follows a consistent cyclical pattern during a walk. Using policy search, we have synthesized nearest-neighbour policies for many such user-specified limit cycles, one of which is shown in Figure 3.2. The algorithm takes a target trajectory as input and outputs a parameter vector, 0 , to represent the generated policy. The number of policy nodes, n, is problem dependent, and matches the representational com-Chapter 3. Nearest-Neighbour Control Policies 36 state trajectory current state (x,x) target state trajectory state flow field node position Q Voronoi cell boundary Figure 3.2: Nearest-neighbour policy for a point mass in one dimension with a 2D position-velocity state space. This policy drives the point mass towards a limit cycle which is meant to approximate the target state trajectory. plexity required to follow the specific target state trajectory2. In the following, we discuss the system dynamics, policy evaluation metric, and policy search method used to create this example policy. 3.3.1 Problem Formulation The dynamics for this problem are illustrated in Figure 3.3. The state of the point-mass is given by its position, x, and velocity, x, and this state is directly observable by the policy ir. Thus, o = s = (x, x) £ <5. The control action fit £ A associated with each Voronoi cell i are the parameters, (x, kp, kd), for a proportional-derivative controller, used to drive the point mass towards the target position, x. The force on the point mass is computed as F = kp(x - x) - kdx, (3.2) where kp is the spring constant and kd the damping constant 3 . The state transition function v, is thus given by Equation 3.2 in combination with the physics for a force acting on a point mass in one dimension. In this way, the P D control is considered part of the P O M D P dynamics, and the P D 2 Appendix A explains how the number of nodes and initial node positions is chosen by approximating the target trajectory with a time-based policy. 3 We did not experiment with using other types of controllers (e.g, PID) to actuate the system, as we obtained successful results using this simple PD controller. In our experiments, the motion that resulted from a nearest-neighbour policy composed of discrete PD control actions always converged towards either a single state or a limit cycle, a property useful for solving this particular problem. Interesting future work might involve proving why this is the case. Chapter 3. Nearest-Neighbour Control Policies 37 state transition function V PD control function —7K i time step t-i 1 t i i Figure 3.3: Dynamics for control of a point mass in one dimension. controller parameters, [x, kp, kj), the policy output. Alternatively, one could view the force, F, as the output of 7r and the P D control as part of the policy structure. 3.3.2 Policy Evaluation Given these system dynamics, we wish to find a nearest-neighbour control policy, n e ft, that moves the point mass in a path that is as close to the target trajectory as possible. We would like the generated policy to be both accurate and robust. When the state is on the target trajectory, it should continue to move along that trajectory. When the state is off the target trajectory, the actual trajectory should converge towards the target trajectory as quickly as possible. We do not impose any form of control energy cost. In the P O M D P framework, a reward function, r(st,at) = f(st,^{st,ai)) = f(st,st+i) 4 might reward the policy for how close st+\ is to the desired state, st+i, given an approximate previous state, st, on the target trajectory. This state, st, could be the closest point (e.g., by a Euclidean measure) on the target trajectory to st. Figure 3.4 shows graphically how such a reward metric would be calculated. This reward function would reward the policy for removing error between the current state and the corresponding desired state on the target trajectory. Our policy evaluation function, E(Q), assigns a high utility to policies that drive states on the 4Note that with a deterministic transition function, v, the state st+i is determined by st and at, so we may use a reward function, f(st, « t+ i ) , that depends on the state, st, and its successor, st+i. Chapter 3. Nearest-Neighbour Control Policies 38 actual state trajectory target state trajectory Figure 3.4: A reward function for a P O M D P framing of the 2 D trajectory tracking problem. The approximate previous state st is determined by st = o/rgmins&straj\\s — st\\, where Straj is the set of states on the target trajectory. In the figure, d = \\st — st\\ is the distance between st and the closest state on the target trajectory, st. The desired state, st+i, is the state reached by following the target trajectory one time step from st. The reward r is given by r(st,st+i) = - | | s t + i - s t + 1 | | 2 . target trajectory toward their desired successor states. We evaluate the policy by computing trajec-tories from p initial states, s\nit,l < i < p, for Tevai simulation time steps. The initial states are positioned on the target trajectory, distributed at even time intervals, Tintervai, around the trajectory, as illustrated in Figure 3.5. The error of the ith trajectory, e(<7*(0)), is computed as: e (<7 I (0)) = | | 4 _ , ( e ) - 4 _ 1 ( 0 ) | | 2 , (3.3) where sip ( 0 ) is the desired final state, or the state that would be reached by following the target trajectory from initial state s\nit for Tevai simulation time steps5. The policy evaluation function is defined as the negated sum of trajectory errors: E(0) = -J2e(*im- (3-4) i=l In this way, a policy is evaluated by how well it tracks the cyclical target trajectory from several initial states distributed on the trajectory. Performance from other regions could also be taken into account by evaluating from initial states off of the target trajectory. Although this was not tried, it would likely result in a policy that is more robust. Policy shaping could be achieved by first evaluating from only on the target trajectory and then phasing in initial states further off the trajectory to gradually incorporate robustness into the policy. However, the policies we generate are surprisingly robust to state perturbations, even though evaluation is only conducted from initial states on the target trajectory. 5 Note that this distance metric is dependent upon the units used to measure distance and velocity. Ideally, the metric should use some kind of normalized coordinates for the state space. Chapter 3. Nearest-Neighbour Control Policies 39 target state trajectory (a) (b) Figure 3.5: (a) Initial states and evaluation trajectories. The initial states are Tintervai time steps apart along the target trajectory. Each trajectory is evaluated for Tevai time steps, (b) Evaluation of a single trajectory. The evaluation error for the trajectory, e(al(Q)), is the square of the distance between the actual final state, szT^al (0), and the desired final state, slT^al{Q), on the target trajectory. The specific parameter settings for our results are Tevai = 2000 and T i n t e r v a i = 500 simulation time steps. This setting of Tintervai results in about 20 initial states for a typical target trajectory. In other words, it takes about 10000 time steps to cycle once around a typical target trajectory, so Tevai is about g of the total cycle time. This seems to be enough initial states to evaluate the policy in sufficient detail on the entire trajectory. The size of Tevai appears to be a good compromise between consideration of short and long term consequences of policy actions. The result is that small details in the target trajectory may be ignored in favour of following the trajectory better in the long term. However, if the error was integrated over an evaluation trajectory the small details in the relevant area of the target trajectory would be considered. A metric which integrates the error would have an advantage over the metric that we employed in that it would not be as heavily dependent on the choice of Tintervai- As of yet, we have not experimented with this alternative for evaluation. 3 . 3 . 3 P o l i c y Search Choosing good initial parameters, Qinit, is important for accelerating the search for a satisfactory policy, especially when the parameter space is large. For this example, the initial parameters are bootstrapped using a time-based policy generated to follow the target trajectory. A time-based policy is a policy for which actions are a function of time, rather than a function of the current Chapter 3. Nearest-Neighbour Control Policies 40 observable state. To avoid diversion from the main subject of the thesis, we defer a more detailed discussion of time-based policies to Appendix A, where we explain how such a policy is generated for the target trajectory, and how it is used to initialize the search. Careful bootstrapping using a time-based policy should give us initial parameters that are fairly close to an optimal n-node nearest-neighbour policy. This good initial starting point for the opti-mization is critical since the number of nodes, and therefore the number of policy parameters, can be quite large making a naive search of the parameter space nearly impossible6. Furthermore, it is difficult to estimate the number of nodes, n, needed to represent a policy that can successfully follow the target trajectory without the first step of finding a time-based policy. 3.3.4 R e s u l t s We conduct a n S G A optimization to maximize the evaluation criteria, given by Equation 3.4. In this manner, our algorithm generates policies for moving the state of a point mass along a cyclical target trajectory, such as the policy in Figure 3.2. In the figure, note the successful tracking behaviour achieved with what is a sparse control policy representation. The example trajectory begins well off the target trajectory but is rapidly drawn to it. Also, the state flow field indicates the robustness of this policy to unexpected state perturbations, as the flow tends towards the target state trajectory. 3.4 Summary A l l of the motion control problems in this thesis, as well as the walking problem in [Sha04] use the nearest-neighbour policy representation. We have formally described this representation and presented a concrete example of the N N P R in use. In the next two chapters, we look at some positive results of the N N P R and policy search being applied to motion control. Chapter 4 addresses the cart and pole problem and Chapter 5 presents results for vehicle steering. 6 A typical policy for this problem has around 10 nodes. Each node has 2 parameters to locate it within the state space and a 3 parameter action, so the total number of policy parameters is typically around 10 X (2 + 3) = 50. Chapter 4. Cart and Pole 41 C h a p t e r 4 Cart and Pole The cart and pole problem is well-known in the control and reinforcement learning disciplines because of its simple structure and inherent instability. This problem has 4 state dimensions, all of which are observable to the policy. We consider the problem as one that is more complex, in terms of observable state dimensions, than the 2D trajectory tracking problem of Chapter 3, but less complex than the vehicle steering problems described in Chapter 5, which have 5 to 7 dimensions. This chapter describes the cart and pole problem in detail and examines the results of using policy search with a nearest-neighbour policy representation to find a solution. 4.1 Problem Overview For this problem, an inverted pendulum, which is called the pole, is hinged onto a cart that can slide along a track in 1-dimension, as shown in Figure 4.1. The objective is to move the cart to a goal position and keep it there, while maintaining the balance of the pole. This must be accomplished by the careful application of a continuous-valued force, F, to the cart in either direction. Of course, the pole is unstable because gravity will act to increase any deviation from the upright position. If the pole is falling to the left, for instance, a force can be applied to push the cart to the left, which in turn applies a force to the bottom of the pole, acting to keep it upright. If the pole falls to the horizontal position, then it contacts the cart, and the balance of the pole cannot be recovered. This problem is similar to balancing a broomstick on a finger tip by attempting to keep the bottom of the broomstick underneath its center of gravity. However, in this broomstick task, there is no clear goal position and the fingertip can be moved in three spatial dimensions. Section 4.3 will define the problem more formally. As an extension to the cart and pole problem, we allow the cart goal position to be interactively changed by a user. In this way, the cart can be made to follow a trajectory of goal positions. The user can also apply force perturbations to the cart, in order to test the robustness of the learned policy. These interactive extensions are explained further in Section 4.8. Chapter 4. Cart and Pole 42 4.2 Related Work and Problem Variations A version of the cart and pole problem was first studied in the reinforcement learning framework by [BSA83]. In this version, called the pole-balancing problem, the objective was simply to keep the pole from falling below a certain angle to the vertical, while also keeping the cart within set track boundaries. This problem is simpler than ours in that it does not require the cart to be brought to a goal position and kept there. However, our problem does not have the track boundaries, which can serve to make control more difficult. Another difference is that the pole-balancing problem typically allows only a discrete left or right force to be applied to the cart (sometimes referred to as bang-bang control), while we allow a continuous range of forces that can be applied using a P D controller with some parameter restrictions (see Section 4.3). [And87] also looked at the pole-balancing problem, and extended the learning system in [BSA83] by using a feed-forward neural network to estimate value functions. Another version of the cart and pole problem was examined in [DoyOO]. The cart-pole swing-up task, as it is called, involves swinging the pole up to the vertical position from any arbitrary angle by applying forces to the cart. In this version, the pole is permitted to be in a hanging position, below the horizontal. This task is difficult because of its highly nonlinear nature. If the force that can be applied to the cart is limited, the pole must be swung back and forth to bring it to the upright vertical position. Doya used a continuous TD(X) formulation with a normalized Gaussian network to approximate the value function and policy for this problem. Coulom also solved the cart-pole swing-up task using the continuous TD(X) algorithm, but with a neural network representation of the value function[Cou02]. In the control literature, the cart and pole problem, which is often called the inverted pendulum problem, is usually viewed as the problem of removing error in the pole angle by pushing the cart with a real-valued force, similar to the pole-balancing problem described above except without track boundaries. It is considered a particularly interesting problem because of its instability, nonlinearity, and similarity to the problem of balancing a missile atop a booster rocket [DW91]. If the angle and angular velocity of the pole are assumed to be small then the problem can be approximated as a linear system and analyzed using linear control theory. For example, in [Oga70], conditions for constants of a P D controller that would insure stability of the pole for a small range of operating conditions were derived. In [DW91], the constants for a P D controller that was able to recover from a large range (up to 80 degrees from the vertical) of pole angles were determined experimentally. However, a pole-balancing problem does not consider simultaneously removing error in the cart position and pole angle, as does our version of the problem. [SSR99] solved this more difficult problem Chapter 4. Cart and Pole 43 using a nonlinear optimization technique on a fuzzy logic controller. The controller is bootstrapped using a optimal control table estimated by using dynamic programming on a cell discretization of the state space. The controllable region of the state space for their optimized controller was found to be significantly larger than that of a linear quadratic regulator (LQR)'that was analytically derived using a linearization of the dynamics. We have not undertaken a detailed comparison between the controllable regions of a L Q R or other control technique and our learned policy. Particularly with pole swing-up problems, nonlinear control must be employed to bring the state of the system close to the goal state, where a linear controller is capable of stabilizing the state of the system. For example, in both [Sti99] and [RK03], a pole is swung up from a hanging position by a nonlinear control technique and once the pole is near the standing position a L Q R is used to stabilize it. The cart and pole problem can be made even more difficult by attaching a second pole on top of the first. This inverted double pendulum system is very unstable in a similar way to backing up a double trailer truck, which is the problem we look at in Section 5.5 of the next chapter. In [RN0098], fuzzy control rules for controlling such a system are acquired automatically using genetic algorithms. 4.3 Problem Formulation The state, s, of the cart and pole is represented by the vector (x, x,a,a), where x is the position of the cart relative to a fixed reference position 1, x is the velocity of the cart, a is the angle of the pole relative to the upright vertical position, and a is the angular velocity. This setup is illustrated in Figure 4.1. This state is fully observable to the policy, i.e., o = s. reference position x ' cart Figure 4.1: Setup of the cart and pole problem. In our problem formulation, a policy controls the cart indirectly through a P D controller. Thus, the action Hi € A of a policy node is the parameters, (x, kp, kd), of the P D controller. The force, F, 1 A n extension where the reference point can be interactively moved is discussed in Section 4.8 Chapter 4. Cart and Pole 44 on the cart is then calculated as F = kp(x — x) — kdi. The force, F, acts to accelerate the cart. It also indirectly causes a horizontal force to be applied to the bottom end of the pole, which interacts with the force of gravity to determine the angular acceleration of the pole. The equations of motion of the cart and pole are given in Appendix B . Figure 4.2 illustrates how the state of the cart-pole system will evolve over time. (x,X, OC, Ci)f.j > observation function (0 = 1 state transition function V cart-pole equations of motion ^XyXy OC} C ^ ^ ^ ^ ~ (x^ Xf OCj OC) t-j policy 7t (x,kp,kd)t-j time step t-l j Figure 4.2: Dynamics for the cart and pole system. The policy's objective is drive the state of the system towards the goal state, s* = (0,0,0,0), in minimal time. The goal state is characterized by having the cart stationary at the reference position with the pole upright and not falling in either direction. For the cart-pole system to be controllable with reasonable limits for F the pole should always be close to the upright position and without large d, especially in the direction the pole is leaning. If the pole angle reaches 90° from the upright position it hits the cart and cannot swing through. The simulation is ended as it is impossible for the pole to be pushed up from a horizontal position. 4.4 Policy Evaluation In practice, the goal state, s*, is unstable and the trajectory of a good policy will oscillate around it. In the P O M D P setting, it would be sensible to have the reward function, r, give positive rewards to i a region of states that is close to s* and zero reward to any state outside of this region. The reward in the goal region would be larger the closer the state is to s*. The policy could be encouraged to Chapter 4. * Cart and Pole 45 drive the state to the goal region quickly by making the discount factor, 7 , less than 1. Additionally, it would be useful to penalize the policy if the pole falls over2. This would shape the problem by encouraging the learning of pole-balancing explicitly. In unstable systems, such as the cart and pole, backwards truck-driving, or biped walking, balance is critical to success and should be learned by the policy early. Our policy evaluation metric, E(Q), is based on similar ideas as the above reward function3. A policy that drives the state close to s* within a fixed evaluation period, Tevai, is favoured. A n additional penalty is applied if the pole falls during this period. To encourage balancing the pole for as long as possible, the penalty is proportional to the time remaining in the evaluation period when the pole falls. The policy is evaluated from p initial states, each resulting in a trajectory. The weighted error of the ith trajectory is given by e(cr l(9)) = wfau(Tevai - tfaii) + wxx*T^ai + Wii?^ + waalT^ai + WadzTeva[, where Wfaii, wx, w±, wa, and Wa are weights for the error terms and feai« XT^i > Q T e „ a i , 4 e „ , ) = 4 . „ a l (©) is the final state of the ith evaluation trajectory4. The variable tfau measures the time when a reached 90°, if the pole fell during the evaluation period, and tjau = Tevai, otherwise. We define the evaluation function to return the negated sum of all the trajectory errors just as in Equation 3.4: E{Q) = -JTe\Q). The p initial states are at fixed intervals across a range of cart positions, [—xumit,xiimit], with x, a, and a always initialized to zero. The distance between two neighbouring initial cart positions is xSpace. This simple distribution of initial states allows evaluation from a fairly broad range of states that can almost certainly be controlled. A typical issue for evaluating policies is the range and density of initial states from which to evaluate. Including states that are difficult or impossible to control not only increases the computation required for each evaluation but can also complicate the search process. We shape the policy search by phasing in more difficult initial states through a gradual increase to xumit (see Section 4.5). Our evaluation does not consider the potentially more difficult initial states, where a or a are nonzero. Considering these states, at least later on in the 2In the original RL problem of pole-balancing [BSA83], the only reinforcement was a penalty for pole falling over 3Note that the specific policy evaluation parameters used to obtain our results are included in Table 4.1 4 Our policy search finds fairly robust policies (see Section 4.6) using a light-weight evaluation that considers only the final state of each trajectory. However, integrating the error over the entire trajectory might better ensure that the resulting policy brings that state of the system, s, to the goal state, s', as quickly as possible and stabilizes it there. We have not experimented with this alternative. Chapter 4. Cart and Pole 46 search process, would most likely result in a policy that is more robust (i.e., with a larger region of controllable states). However, currently our evaluation metric only considers the simple distribution of initial states described above. 4.5 Policy Search The policy search algorithm we employ is straightforward. The specific policy search parameters used to obtain our results are included in Table 4.1. The policy parameters, ®init, are given sensible but highly untuned initial guesses. We employ series shaping by increasing the range of initial states for the policy evaluation at each optimization episode. Within each episode, we conduct an optimization on the policy parameters using an algorithm very similar to the SGA, as presented in Algorithm 1 in Chapter 2. Throughout the entire search, the step size, p, of the stochastic perturbations in policy parameters is decayed. Pseudocode for the policy search is given in Algorithm 2. A l g o r i t h m 2 Cart-Pole Policy Search i n p u t 7r {policy} i n p u t xumit {the limit on the initial position of the cart for evaluation} i n p u t Xinc {the amount that xumit is incremented for each episode} i n p u t p {step size} i n p u t A {decay factor} i n p u t btotai {number of iterations per optimization episode} i n p u t cstep {number of iterations without improvement before p decay} i n p u t dtotai {total number of optimization episodes} d e f i n e c <— 0 {number of iterations without improvement} f o r d <— 1 to dtotai d o Ebest <— E(n.Q,Xumit) f o r b * - 1 to btotai d o ^new <- perturbationl(n, p) Enew < E(^7Tnew.0, Xiimit) i f E n e w > Ebest t h e n 7"* * l^new Ebest •* E n e w c<-0 e l s e c <- c+ 1 e n d i f i f c 7^ 0 and c mod cstep = 0 t h e n p = Xp e n d i f e n d f o r ^limit * %limit "f" Xinc e n d f o r O u t p u t 7T Chapter 4. Cart and Pole 47 We initially allocate 2 nodes to the policy and make reasonable guesses for the parameters, Qmtt 5 . In this way, we use some prior knowledge of solution structure to bootstrap the search. The values chosen for the parameters, Qmit, are very approximate, since we rely on the policy search for fine-tuning. Initial parameters for a successful search are shown in Table 4.1. With only 2 nodes it is reasonable that one generally push the cart to the left and the other to the right 6. Thus the P D target, x, of node 1 is initialized to be negative and that of node 2 positive. The P D constants, kp and kd, for both nodes are set to the same intermediate values. The only other knowledge built into the initial parameters is that extreme situations should be countered with a force in the reasonable direction. For this reason, we initialize the location, £i, of node 1, such that node 1 will be activated to push the cart to the left if the cart is positioned to the right and moving further right or the pole is leaning to the left and rotating further to the left. To do this, we set the node location, Ci, to have a positive cart position, 'x, and velocity, x, and a negative pole angle, a, and angular velocity, d. The initial location for node 2, is simply that of node 1 but mirrored about the state space origin. These initial node locations and actions, although very coarse, provide an estimate of a decent 2-node policy that can aid the beginning of the search. To shape policy learning, we conduct the search in a series of dtotai optimization episodes. With each episode, the range of initial cart positions, [—xiimit,xumit], considered in the evaluation is increased by incrementing xumit by x , n c . In this manner, the initial episode focusses on states that are close to the goal state, s*, and relatively easy to control. Each subsequent episode adds states further away from the goal, thus increasing the difficulty of the control task. However, the policy parameters from the previous episode are carried over, providing a good initial starting point for the next episode. Another way to perform series shaping would be to refine the granularity of the policy representation by adding nodes at each optimization episode, as is done for the vehicle steering policy search and described in Section 5.4. For this problem, we do not experiment with adaptive refinement, so the number of nodes, n, remains 2 throughout the policy search. Each episode is conducted as a slight variation of SGA, which optimizes policy parameters 6 = [CiiC2,P-i, 1^2] with regard to the evaluation function, E(Q). A n episode corresponds to a single iteration of the outer loop in Algorithm 2. For this problem, the algorithm does not use a convergence check as in Algorithm 1, but instead runs for a fixed number of iterations, btotai- Also, the perturbation step size, p, is not decayed every iteration, but only after cstep iterations have elapsed since the last policy improvement or step size decay. Finally, the perturbation algorithm is slightly modified because 5 The initialization of the policy parameters is not included in the Algorithm 2, since this could easily be replaced with another suitable initialization. 6Here, we use the term left to mean "in the negative x direction" and right to mean "in the positive x direction". Chapter 4. Cart and Pole 48 the policy parameters are not normalized and the details are included in Appendix B.2. 4.6 Results Our policy search algorithm can find a 2-node policy that works well for a limited range of initial states. We used the evaluation and search parameters shown in Table 4.1. The search required 16000 total policy evaluations. Figure 4.3 shows the best overall evaluation, E(Q), as the policy search progresses. For later episodes, E(Q), represents a more difficult evaluation that takes into account a broader range of initial states, which is why the best value of the policy may drop at the beginning of an episode. -o.ol--0.1 © W > o OH -10 -100 Episode 3 (xUmit = 0.3) Episode 2 (xlimit = 0.2) Episode 1 (xlimit = 0.l) Episode 4 (xUmi, = 0.4) 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 1 2 0 0 0 Policy Search Function Evaluations 1 4 0 0 0 1600C Figure 4.3: Policy performance versus evaluations during the policy search for the cart and pole system. Note that for each subsequent optimization episode the range of initial cart positions, [—xiimit,xumit} is expanded, thereby making the control task more difficult. The parameters of a learned policy are also listed in Table 4.1. A state trajectory of this policy converging toward the goal state is shown in Figure 4.4. Figure 4.5 evaluates the robustness of a generated policy, by showing the range of controllable states, where the initial cart velocity, Xmit, and the initial angular velocity, are zero. A state, Sinu, is considered to be controllable if after 10 s of simulation time the policy has brought the system state from Sinu to a goal region specified Chapter 4. Cart and Pole 49 Vdues Problem Parameters Cart Mass (kg) 1 Pole Mass (kg) 0.1 Pole Length (m) 0.5 Allowed Range of Spring Constants (1/s ) 0.1-2.0 Allowed Range of Damping Constants (1/s) 0.1-2.0 Simulation Time Step (s) 0.001 Evaluation Parameters Equation Time Tmai (s) 15 Space between Initial Cart Positions xspace (m) 0.01 Error Weight for Pole Falling wfaU 1 Error Weight for Cart Position wx 1 Error Weight for Cart V docity w-K 0 Error Weight for Pole Angle wa 1 Error Weight for Pole Angular Velocity 0 Search Parameters Initial Action for Node 1 p., = (x,kp,kd) (-0.4, 0.5, 0.5) Initial Action for Node 2 ^ = (x,kp,kd) (0.4, 0.5, 0.5) Initial Location for Node 1 = (x,x,a,a) (0.1,0.05, -0.1,-0.05) Initial Location for Node 2 £ 2 = (x,x,a,a) (-0.1, -0.05, 0.1,0.05) Step Size p 1 Decay Factor X 0.91 Total Iterations per Episode batai 4000 Iterations without Improvement before Decay cstep 200 Total Episodes dmal 4 Initial Cart Position Range xiMt (m) 0.1 Cart Position Range Increment xinc (m) 0.1 Results Optimized Action for Node 1 \\,{ =(x,kp,kd) (-0.4037, 0.5844, 0.6894) Optimized Action for Node 2 U2 = (x,kp,kd) (0.6832, 0.4297, 0.5086) Optimized Location for Node 1 r { = (x,x,a,a) (-0.0112, -01928, -01358, -0.1320) Optimized Location for Node 2 £ 2 = (x'x ,a,a) (0.0173, -0.1412, 0.2203, 0.0664) Total Policy Evaluations 16000 Table 4.1: Simulation parameters, evaluation parameters, policy search parameters, and numerical results for the cart and pole problem. by |a;| < 0.02, \x\ < 0.003, \a\ < 0.005, and |d | < 0.005. While the learned policy is quite robust in controlling a wide range of initial cart positions, it is less robust in controlling error in pole angle. By comparison, the optimized fuzzy logic controller in [SSR99] was capable of controlling a much larger Chapter 4. Cart and Pole 50 angular error. However, their controller was allowed to apply a force of up to 50 Newtons (N) in either direction, while our policy was limited by the strengths of the component P D controllers, i.e., by the constraints on the constants, kp and kd- The operation of our learned policy typically results in a maximum magnitude of about 0.6 N being exerted by the P D controllers. •8 '3 0.1 Simulation Time t o Figure 4.4: Example state evolution for a learned policy controlling the cart and pole system from and initial state with xinit = 0.230496, xinit = 0.001754, a , n i t = 0.015482, and Q ; n i t = 0.000626). This shows the cart converging towards the goal state, s*. x is measured in meters (m), x in meters/second(m/s), a in radians (rad), a in radians/second (rad/s), and t in seconds (s). Chapter 4. Cart and Pole 51 t 1 1 1 1 1 1 1 1 r Uncontrollable Region Controllable Region Uncontrollable Region _ 1 I I I I I I I I 1 1 3 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Initial Cart Position xinit Figure 4.5: Region of initial cart positions and pole angles for which the system is controllable using the learned policy. For all initial states tested, Xinit = 0 and c\init = 0. The cart position, x, is measured in meters (m) and the pole angle, a, in radians (rad). 4.7 Discussion and Future Work We looked at solving the cart and pole problem as a proof of concept of our nearest-neighbour policy representation and policy search algorithm. We were most interested in applying the N N P R to steering vehicles, as will be presented in Chapter 5. We did not perform a thorough analysis of using nearest-neighbour policies for the cart and pole problem and we did not undertake a careful comparison with other control methods that solve a similar unstable, nonlinear control problem. It would be interesting to compare the performance of our policy representation with that of a simple linear controller with similar constraints on the force, F, that can be applied. A linear controller would probably be more efficient than our learned 2 node policy which relies on a rapid alternation between opposing forces, a behaviour which is described in Section 4.7.1 below. Our N N P R can be viewed as a mixture of limited linear controllers,7 each of which operates in a different Voronoi region of the state space and has a distinct target cart position. Since our policy represenation 7 The PD controller components are linear controllers with a force output that is only dependent on the cart position, x, and the cart velocity, i , rather than the full system state. 0.1 0.08 0.06 0.04 ° 0.02 u ^ ° o Ph -0.02 -0.04 -0.06 -0.08 -0.1 Chapter 4. Cart and Pole 52 is richer than a single linear controller (with similar constraints on the observable variables and constants), we speculate it would be capable of performing better on the cart and pole task. However, this needs to be validated experimentally, especially since it is not clear that our search algorithm finds a globally optimal nearest-neighbour policy. Because we obtained a reasonably successful policy using only 2 policy nodes, we did not attempt to refine the policy using additional nodes. In future work, we might look at how policy refinement affects the range of controllable states (see Figure 4.5), as well as the efficiency of the policy operation. The following section discusses the learning and behaviour of the 2-node policy. 4.7.1 A 2 - N o d e P o l i c y It is interesting that a policy with just 2 nodes is able to solve a seemingly nontrivial problem like the cart and pole problem. The following is a qualitative description of its operation in driving the system state towards the goal, s*, from an initial state with x ^ 0, x = 0, a = 0, and d = 0. The policy begins by moving the cart directly away from the target position, so that the pole begins to fall towards the target position. During this period, typically only a single node is active. For the rest of the state trajectory, the policy rapidly alternates (chatters) between the action of the two nodes. The policy begins pushing the cart back towards the target position with the pole now leaning in the direction of motion. As the cart is approaching the target position, the policy will have accelerated enough so that the pole is now leaning away from the target position and against the direction of cart motion. This will allow the policy to decellerate the cart without the pole falling forward. The policy tends to overshoot the target position, and the cart begins to oscillate back and forth around the target position, with the pole counter-oscillating to lean towards the target position. The motion of the cart is damped until the oscillation around the target position is very small, i.e., within the numerical precision of the simulation. This description approximately matches the state trajectory shown in Figure 4.4 except that, in the figure, the pole begins leaning away from the target cart position, meaning that the policy must initially push even further away from the target position than if the pole began upright. Most of the time while the trajectory is converging on the goal state, s*, as illustrated in Figure 4.6, the policy is chattering between the actions of the two nodes. While this chattering occurs, the state trajectory is following the boundary between the two corresponding Voronoi cells 8. This boundary is a 3-dimensional hyperplane in the 4-dimensional state space. So, altogether, a trajectory from an initial state to s* consists of the first section, where one node is activated to drive the state towards 8Note that a more energy efficient solution would better dampen the oscillation along the boundary and follow it more directly. For more discussion of chattering see Section 6.1. Chapter 4. Cart and Pole 53 Figure 4.6: Trajectory following the Voronoi cell boundary to the goal region, (a) The state is first driven from the initial state towards the Voronoi cell boundary by the action of one Voronoi node, (b) Once at the boundary the state is pushed along the Voronoi boundary by actions that alternate between the two nodes. The state converges on the goal state via the Voronoi boundary. the cell boundary, and the rest, which follows along the boundary towards s*. It is suprising that it is possible to easily get from a range of controllable initial states to this hyperplanar boundary and then follow it all the way to the goal region9. This means that this boundary must intersect the goal region, lie within reach of all controllable initial states, and the trajectories taken through the boundary must all be in the region of controllable states. To test the conjecture that the s* lies close to the Voronoi boundary, we calculated the squared distance from the node locations to s* and found them to be near equal: | | d - s*||2 = 0.073163 ~ |C 2 - s*\\2 = 0.73178. The policy search arrives at polices with the above-mentioned strategy through perturbations of the policy parameters. Perturbing node positions, Q, usually moves the Voronoi cell boundary. The boundary must intersect the goal region, so s* can be reached by following the boundary. Perturbing action parameters, mainly affects how trajectories interact with the boundary. Actions are found so that the states close to the boundary are driven towards it. Of course, once the state is near the boundary chattering results in a trajectory along the boundary. The actions must be chosen so that this trajectory converges on the goal region. It would be interesting to investigate how this convergence from controllable states within the boundary to the goal region occurs. 9 B y goal region we mean a small neighbourhood of states near s*. Chapter 4. Cart and Pole 54 4.8 Interactive Extensions We have developed a cart and pole simulation in a interactive GUI framework that allows the user to better explore the capabilities of the solution policy. For example, the reference position, as shown in Figure 4.1, can be changed interactively by the user and the policy will react by driving the system state towards the latest reference position. In this way, the cart can be made to track a trajectory of reference positions, all the time with the pole remaining upright 1 0. This sort of interactive control would be very useful for a more complex domain, such as the animation of biped walking. The underlying policy would be responsible for controlling the balanced and physically realistic motion of the biped towards a reference position. The animator would then be able to interactively move the reference position so that the overall result would be a physically realistic walk that approximately follows the motion trajectory laid out by the animator. We note that the idea of tracking a trajectory of goal states is not new, particularly for robot motion execution. Another feature of our framework allows the user to apply random force impulses to the cart to interactively test robustness of the controller in this regard. The user is also able to change elements of the system state, such as the cart velocity or pole angle, simply by clicking and dragging on graphical indicators of these state variables. This allows the user to test policy robustness to various kinds of system state perturbations. A 1 0 T h e pole will stay balanced as long as the user does not drag the reference position far enough away from the cart position that the system is no longer in a controllable state. Chapter 5. Vehicle Steering 55 C h a p t e r 5 Vehicle Steering In this chapter, we describe the use of our policy search method to learn nearest-neighbour policies for steering cars and trucks both backwards and forwards down a winding track. To illustrate the type of steering problems we consider, Figure 5.1 shows the steering behaviour for policies trained to drive trucks backwards on a class of tracks. Such vehicle steering problems are part of a class of problems known as non-holonomic control problems because the instantaneous direction of the vehicle motion is constrained1. The steering problems have from 5 to 7 observable state dimensions, so in this regard, they are the most complex control problems examined in this thesis. (a) (b) Figure 5.1: Steering backwards-driving trucks along unfamiliar tracks: (a) One-trailer truck (b) Two-trailer truck. lA car or truck is capable of only moving forwards or backwards, perhaps along a circular arc if it is turning, but not sideways. For a more detailed discussion of nonholonomic constraints, see [Lat91], p. 27 and 406. Chapter 5. Vehicle Steering 56 The steering policies are provided with local observations of the track (see Figure 5.3) and must produce steering commands that will navigate the vehicle down the track without it crashing into track barriers. The fact that a policy is given only a local state representation is important for several reasons: 1. A reactive policy that uses only local state information is limited in anticipating future situations to decide on a current action. For example, a policy does not know what is coming up around a corner in deciding how wide to take the turn. 2. The limitation of the policy to fairly realistic sensory information2 makes it easier to apply to . real world automatic steering problems. Also, if global state is not required then localization of the robot relative to the global coordinate frame is unnecessary. 3. A policy that relies only on local state is better able to generalize to similar situations on the same track or another track entirely. Furthermore, such a policy is better at reacting to unexpected changes in the environment because it does not base its control decisions on global environment information that may have become outdated. (b) Figure 5.2: (a) Jack-knifing of a truck driving backwards when the steering is kept fixed, (b) Jack-knifing of a two-trailer truck driving backwards. Of the steering problems considered, the steering of a backwards-moving truck-with-trailer is particularly difficult because of the inherent instability in the motion of the trailer(s). Figure 5 shows 2Although our vehicle observations are realistic in that include only range data and internal angle measurements, we have not considered sensory noise, which is an important issue in robotics. Chapter 5. Vehicle Steering 57 how backwards-moving trucks can jacknife without corrective control. The trailer is unstable in much the same way that the pole is unstable in the cart and pole problem. However, in the case of a truck, this instability must be controlled through setting the angle of the steered wheels on the cab of the truck. The requirement that the trailer instability must be handled at the same time as navigating the truck around sharp corners makes this problem a considerable challenge. Adding a second trailer to the truck increases the instability further, making the problem even more difficult. 5.1 Related Work Nonholonomic systems, such as the truck-and-trailer configurations with which we experiment, have been the focus of much research over the past 15 years. One common approach to controlling these systems is to use a trajectory-planning stage followed by a trajectory-tracking scheme. [Lat91] de-scribes motion planning methods, which plan a trajectory that brings one or more objects from one configuration to another while avoiding obstacles and possibly subject to holonomic or nonholonomic constraints. Suchmotion planning techniques can be used to plan a trajectory for a nonholonomic vehicle that results in a desired state. [KN03] examines the problem of determining a control input in real time to bring a model of a nonholonomic vehicle from an initial state to a goal state. Errors in execution of the trajectory are not taken into account, but trajectories can be replanned in real time to consider new information in a changing or unknown envirionment. In [DW97], the problem of parallel parking or docking (see the "truck backer-upper" below) nonholomic vehicles with zero, one, or two trailers is solved by planning a trajectory offline and using a linear quadratic regulator to track the trajectory. The trajectory-tracking system is tested online using a real tractor-trailer robot. [ASW01] develops a controller to stabilize the motion of a real truck-and-trailer robot driv-ing backwards along a path formed of lines and arcs of circles. A l l these methods are dependent on trajectory-planning, which requires a world model. With our reactive, sensor-based approach to steering we wish to avoid construction of a world model or localization relative to an a priori world model. Policy search is another approach to solving nonholonomic control problems. For example, policy search has been used to solve variations of the "truck backer-upper" problem presented in [NW89, Koz92, HFGS96, HGS97]. In general, the goal is to backup a one or two-trailer truck such that the end of the last trailer approaches a goal position, which represents a loading dock. Most versions of this problem assume knowledge of the global state of the truck[NW89, Koz92]. [NW89] used self-learning of a neural network policy and [Koz92] used genetic programming to solve this problem. [HFGS96, HGS97] does not assume global state but does allow the self-organizing neural network Chapter 5. Vehicle Steering 58 controller access to the angle of the last trailer with respect to the goal state. [HFGS96] describes learning a controller on a real tractor-trailer robot as well as in simulation, while [HGS97] solves the more difficult problem of backing-up a two-trailer truck, but only in simulation. We also use policy search to find controllers for backing up tractor-trailer vehicles but, in contrast to these examples, our steering problem is defined in terms of steering performance around a sharply-curved track rather than that of reaching a single well-defined goal state. Furthermore, we require our controller to work with only local sensory information instead of global state information. T D learning, a valued-based R L method, has been used to learn policies that manoever cars in the Robot Automobile Racing Simulation (RARS) setting [Cou02, PH98]. Table 5.1 compares the problem of car control in R A R S to our vehicle steering problem. The physical model of car motion and control in R A R S is similar to a point-mass space-ship model. The magnitude and direction of the car acceleration is controlled directly within constraints imposed by engine power and tire slippage3. In our problem formulation, we assume a fixed vehicle speed and no tire slippage. However, we assume strict nonholonomic constraints, which are neglected in the RARS model. The tracks in RARS have rounded corners of various radii while our tracks have sharp corners, which are generally right-angled. Our Vehicle Steering Car Control in R A R S Policy Representation nearest-neighbour neural network approximation of value or Q function Policy Learning direct policy search TD(X) value-based R L method Vehicle Types car, single-trailer and double trailer trucks car Vehicle Motion forwards or backwards forwards Physical Model constant vehicle speed, no tire slippage point-mass space-ship model, frictional model of tires Vehicle Control , steering angle car acceleration Observation Model distance sensors truck hitch angles car position and velocity relative to local track features Track Style sharp corners round corners of various radii Track Generalization generalizes well no mention of evaluation Table 5.1: Our vehicle steering problem versus car control for R A R S [Cou02, PH98]. In the R A R S problem, actions are learned based upon information about the car's position and velocity relative to local track features. In [Cou02], the car's curvilinear distance from the start line of the' track is also available to the policy. Both of these methods assume knowledge of the local track curvature and learned policies are only tested on the same tracks that were used for training. We develop solutions based on simpler sensory information provided by a set of distance 3 A simplified model of the RARS dynamics is described in [Cou02], p. 110, while a more complete version of the RARS dynamics can be found at http://rars.sourceforge.net/. Chapter 5. Vehicle Steering 59 sensors, and measurements of the current steering angle and internal hitch angles for the trucks. Also, we demonstrate that a learned steering policy is capable of generalizing to new tracks that are similar variations of the training track and coping with unexpected situations for which it was not trained. Moreover, we apply our steering method to significantly more challenging models, such as backwards-moving tractor-trailers. 5.2 Problem Formulation Figure 5.3: Truck sensory information. We work with several steering problems, including steering cars and trucks both forwards and backwards. For clarity, we will initially emphasize the specific problem of steering a truck-with-trailer backwards around a winding track. Appendix C.2 contains a detailed description of the vehicle dynamics. Physical constants and other important simulation and optimization parameters used for the various vehicle steering problems are included in Table 5.2. The truck observes a local model of the track in the direction of travel using an array of distance sensors, d\,.., ,d^. Each of these sensors has a sweep of 22.5° and returns the distance of the closest object (i.e., part of the track barrier) within this range. A diagram of the truck with sensors is shown in Figure 5.3. A truck observation, o, consists of the vector (a, (3, di, d%, d$, di), where a gives angle Chapter 5. Vehicle Steering 60 0.8 0 20 40 60 80 100 120 140 160 180 200 Simulation Time Step Figure 5.4: Steering Angle versus Simulation Time Step for a forward-driving car steering policy. of the steered wheels (front wheels of the cab) relative to the long axis of the cab, and (3 gives the angle of the trailer relative to the cab. A truck action, a, consists of setting a desired steering angle, a for the steered wheels. The angle of the steered wheels, a, is not set directly, because this would cause an unrealistic jump in steering direction. Instead it is driven towards the desired angle a by the following dynamical system, which is the equivalent of a P D controller: a = kp(a — a) — k^a, (5.1) where kp and kj. are treated as constants for a particular steering problem. Figure 5.4 shows the relationship between a and a during the simulation of the steering of a forward-driving car by a learned policy. A truck-steering policy, ir : O —> A, maps an observation, (a, P,d\,d2,d^,d^), to a desired steering angle, a. Given the steering action of the policy, the simulation can determine the next system state by using Equation 5.1 to calculating the angular acceleration of the steered wheels, a, and integrating the vehicle dynamics, as illustrated in Figure 5.5. The truck dynamics assume that the front axis of the truck cab moves at a fixed speed in the Chapter 5. Vehicle Steering 61 state transition function V PD control function i time step t-l 1 t i i Figure 5.5: Dynamics for the vehicle simulation. backwards direction of the steered wheels. The tires of the truck cannot slip, so the speed of the truck is of no consequence with regard to traction during cornering. In fact, the speed is only important because of the discrete time approximation of the dynamics and also because of the delayed effect of steering actions. If the fixed speed parameter is increased the steering response will further lag behind the relevant track situation, increasing the difficulty of the problem. However, the steering response could be scaled to match the vehicle speed by adjusting the constants kp and kd in Equation 5.1. The truck continues at a fixed speed until it crashes into a track barrier or jack-knifes, at which time the truck's motion ceases. Both of these events are referred to as a collision. A truck is considered to have jack-knifed if a trailer angle exceeds 90°. 5.2.1 V e h i c l e Tracks The truck steering problem we address involves driving around a sharp-cornered track without a collision. The truck is always located between the two side barriers of a track. The track is a circuit, consisting of straight segments connected by right-angled corners, as shown in Figure 5.64. Our goal is that learned policies will be able to generalize their steering control to different tracks with similar properties to the track on which the policy was trained. In order to generate tracks of various classes, we use a parameterized track generator. The parameters that define a track class include the track width, Wtrack, the minimum track segment length, lmin, and maximum track 4 A l l tracks that we have used for training have right-angled corners, although we have tested some of the learned policies on tracks with corners of varying angle, as is shown in Section 5.5 Chapter 5. Vehicle Steering 62 0 Figure 5.6: A n example of a complete track for backwards truck-steering. segment length, / m o x . For example, Figure 5.7 shows instances of 3 different track classes. We have also tested steering policies on tracks with varying width and varying turn angle. These track classes have parameters for minimum and maximum track width, as well as minimum and maximum corner angle. Although a detailed description of the track generation algorithm is beyond the scope of this thesis, the following is high-level overview of how a track with fixed width, wtrack, and right-angle corners would be generated. The algorithm begins by sampling the length of the first segment from a uniform distribution over the range [lmin> lmax\- The first segment is placed and a left or right corner is chosen with equal probability. Then the length of the next segment is sampled and it is attached to the first such that it points in the correct direction. The track is grown by recursively attaching track segments until the end of last segment is close to the first segment and perpendicular to it. The first and last segments are then stitched together to form a complete circuit. If a newly generated track segment intersects another during the track generation, that segment is removed and another segment is generated. If at any stage of the recursion it is found that the algorithm is failing to connect up the track after a fixed number of attempts, back-tracking is performed so that a different path of track segments may be tried. The entire algorithm may return failure after a number of attempts at generating a complete circuit has failed. Chapter 5. Vehicle Steering 63 _ l n (a) (b) (c) B 13 Figure 5.7: Tracks from 3 different track classes, all with right-angled corners and fixed width: (a) wtrack — 1-2, lmin = 0.1, and lmax = 2.0 (b) w t r a c k = 1.2, lmin = 1.0, and lmax = 4.0 (c) IMtrack — 2.4, Imv 5.2.2 Policy Representation We will be using the nearest-neighbour policy representation to define steering policies for this prob-lem. Each node i will have a location Ci in the observation space, O. As an example, for the truck, C,i = (a,P,di,d,2, d3,d,4)i. Each node i will also have a corresponding action, fn G A, where m = dj sets the target angle for the steered wheels. This means that the observation space will be carved up into Voronoi cells, each cell i corresponding to the node i. When the current observation ot lies within a particular cell i, the policy will act by setting the target steering angle to d;. We enforce solutions to the vehicle steering problem to be left-right symmetric. On any particular Chapter 5. Vehicle Steering 64 Figure 5.8: Symmetric observations of the truck policy. cyclic track it may be advantageous to steer asymmetrically, for example, if the track has a counter-clockwise flow so that most corners are left-handed. However, we want steering policies that generalize to any track of the same class, including one that is an exact mirror image (i.e., all left-hand corners are replaced with right-hand corners, and vice versa) of the track on which the policy was trained. Since the dynamics of the vehicle are symmetric (see Appendix C.2), and the class of possible testing tracks is symmetric (it includes mirror images of all tracks), it makes sense to enforce left-right symmetry in the policy. For example, if the policy makes a particular left turn by executing a series of actions, it should make a symmetric right turn by executing the exact opposite actions, i.e. setting the target steering angle, d , to the negated values. More formally, a policy for truck-steering is left-right symmetric, if and only if, for all observations o = (a, f3, d±, cfo, 0I3, Ai), if 7r(o) = d then 7r(o). = — d , where 0 = (—a, —(3, d i , cfa, cfa, d\) is the mirrored observation of o. Figure 5.8 shows an example of a mirrored observation and gives some intuition as to why it takes this form and why a left-right symmetric policy is a useful constraint for steering policies that generalize to various tracks. In only considering left-right symmetric policies, we simplify the problem, since the search does not need to discover the benefit of this symmetry. Left-right symmetry is enforced through the pairing of policy nodes. Given n node pairs, every node j, 1 < j < n, has a partner node n + j, for a total of Chapter 5. Vehicle Steering 65 2n nodes. If a node j has location Q = (a, f3, d\,d2,d3, d 4) and action p,j = a, then its partner node n+j will have location Cn+j = {-a, —(5, dit d3, d3, di) and action /j,n+j = - d . Appendix C . l contains a proof that this representation ensures that the nearest-neighbour policy is left-right symmetric. 5.3 Policy Evaluation In general terms, we wish to find a policy that can steer the truck backwards as far as possible around the track in a fixed period of time. It must do this without jack-knifing or crashing it into the track barriers. To measure progress around the track we define a track center-line and compute the point, P, on the center line that is closest to the reference point, P, on the vehicle. We use P to measure the distance travelled by the truck along the track center-line, as illustrated in Figure 5.9. t d(s,s') = dl+d2 + d3 Figure 5.9: Distance measured along the center of the track. The evaluation function, E(Q), should assign a high value to policies which steer the truck quickly around the track from a reasonable initial starting state. An evaluation function that measures the policy performance from a single initial state is inadequate in that the solutions may become overly tuned to the situations encountered from that initial state. To overcome this, we evaluate the policy from p initial states, s\nit,\ < i < p, sampled from a initial state distribution, D. The initial states s\nit are sampled once and then held constant throughout the optimization process in order to fix the evaluation function for convergence. The policy is executed from each of initial state, Chapter 5. Vehicle Steering 66 s\nit, for Tevai simulation time steps, resulting in the corresponding trajectory, o-%(@) = (s\(Q) = s\nit, sl2(&), • • •, slTcval (©))• If a collision occurs at time tcoiude, the simulation is stopped and slt(Q) = s\cMide{&) for tcoiude < t <= TeVat- The policy is evaluated for each trajectory, o-1(Q), by using the distance covered over the trajectory, d(s\nit, slT^ai (0)). The complete evaluation function is given by: E(e, Tevai) = f > ( 4 » t > 4 . . . , (©)) - PerashV(sir.m, (6)) • (5.2) t=l where Pcrash is a constant collision penalty and n(s) is a crash indicator function that returns 1 if s is a collision state and returns 0 otherwise. Although a crash is implicitly punished because the vehicle stops and no longer can cover track distance, this punishment is often not significant when the crash occurs towards the end of the evaluation time period. The crash penalty is added to the E(Q, Tevai) to more effectively deter crashes no matter when they occur. Tevai defines the time horizon of the evaluation, which determines its short-sightedness. We choose Tevai to be of sufficient duration for the vehicle to navigate around at least two corners. This discourages the policy from making bad short-sighted choices. Initial State Distribution The specific choice of the initial state distribution, D, can strongly influence the learned policy. It can be used to manipulate the trade-off between providing control over a large domain versus providing higher-performance control over a smaller domain. In the former case, a learned policy may be able to guide not-often-seen but difficult initial states towards successful track navigation. In the latter case, a higher-performance solution could be achieved, one that perhaps does faster, more refined turns around corners at the expense of having to start the vehicle in a familiar situation, i.e., located near to the track center. The distribution D is defined by a initial state generation algorithm, which is parameterized for the specific steering problem. The following describes how initial states are generated for the backwards truck steering. For concreteness, the parameters for this problem are included. The algorithm parameters for the all problems are listed in Table 5.2. The angle, a, and angular velocity, d of steered wheels is always initialized to zero. The position of the center of the cab is uniformly distributed around the length of the track and uniformly distributed across 80% of the track width. The initial value for (3, the cab/trailer angle, is sampled uniformly in the range of -0.4 to 0.4 radians. However, any initial positions, which intersect with the track barrier or nearly do so are rejected. A near-collision is determined by computing a collision with an enlarged version of the truck, for which the back half of the trailer is 2.5 times as long as normal and the width of the truck is 1.4 times the Chapter 5. Vehicle Steering 67 Figure 5.10: Several initial states sampled from the distribution D for the single-trailer truck-backing problem. The evaluation for this problem uses 500 such initial states. normal width. Positions that are near-collisions typically result in crashes which the policy can do little to avoid or prolong. These near-collisions can distract the policy evaluation from its job, which is to reward policies that move the vehicle quickly around the track under reasonable conditions. Several legal initial states are displayed in Figure 5.10. 5.4 Policy Search In Chapter 4, we conducted a policy search using a nearest-neighbour policy with a fixed number of policy nodes in order to control a cart and pole system. The problem with fixing the number of policy nodes is that it is difficult to know a priori how much representational complexity is required for a satisfactory solution to the problem. Our policy search algorithm found a reasonable 2-node policy for the cart and pole problem. However, perhaps a much better solution could have been found among 4-node or 8-node policies. An adaptive policy search allows nodes to be added to the policy if this results in a better solution, so that the number of nodes required to adequately solve the problem does not need to be guessed before the search begins. Another advantage of an adaptive policy search is that the easier task of finding a coarse solution can be conducted initially, and then that solution can be gradually refined. For the vehicle steering problem, initial experiments that used Chapter 5. Vehicle Steering 68 policy search with a fixed number of nodes, n > 3, discovered that the correspondingly large number of policy parameters made a simple optimization prohibitively difficult. With fixed n > 3 the policy search was unable to find a reasonable solution and with fixed n < 3 the solutions generated were rather coarse. Adaptive refinement of a coarse solution seemed like an obvious way of improving such a steering policy. Adaptive refinement, as a way of adding representational complexity only where it is useful, has been used for grids [MM02, MA95] and growing algorithms for neural networks [Bis95]. We exploit the semi-parametric character of the nearest-neighbour policy representation, by con-ducting the policy search as a series of optimization episodes and adding a new node with each episode(as mentioned in Section 2.3.5). The pseudo-code for the optimization series is outlined in Algorithm 3. Algorithm 3 Optimization Series input 7r {policy} global Tevai {evaluation time} Ebest <— E(lT.Q, Tevai) repeat Knew add-node(n) Knew < — optimize(irnew) E-new * E(lTnew . 0 , Tevai) if E n e w > Ebest then Ebest * Enew end if until convergence output policy 7T Policy nodes should be added in a location in O that will allow improvement to the evaluation, E(Q), of the policy. The location, Cnew, of a newly added node is drawn uniformly from the distribu-tion of observations encountered during the evaluation of the previous control policy. The location of the first-added node is sampled from the observations of a policy that always steers straight. This simple heuristic biases the refinement to regions in O that are encountered frequently. It discourages the addition of new nodes to regions of the observation space that are rarely or never encountered. We note that this heuristic could be improved as it does not consider locations which are less frequently seen but for which the policy could improve its choice of action considerably. However, this is only a starting point for the node location, which will be further optimized by the policy search. The action, /J.new, is initialized to be the same as that of the closest node, or in other words, the action that would be taken by the previous policy at the state Cnew This bootstraps the initial action of the new node to be a reasonable value and the optimization is used to further refine that value. The action of the first node is sampled from a uniform distribution on the range [-1.0,1.0] Chapter 5. Vehicle Steering 69 radians. The addition of a new node and bootstrapping of its parameters is done in the addjnode routine. After a node is added, an optimization on the augmented policy parameters is performed in the optimize routine. Notice that the new policy resulting from addjnode and optimize is only kept if its evaluation is improved over its predecessor. Otherwise, we revert to the predecessor and try another node addition and optimization. The optimize routine is a variation on the SGA algorithm, as defined in Section 2.3.4, and is outlined in Algorithm 4. We limit the total number of perturbations and the number of perturbations that occurred since the last improvement to the evaluation of the policy. This limits the computation used to optimize a policy after the addition of a new node. Sometimes the added node only interferes with the policy in the sense that any optimum that can be reached by our local optimization is worse than the policy value before the node had been added. In this case the policy node will not be kept and we would not want to unnecessarily waste computational resources on the optimization. A l g o r i t h m 4 optimize^) input 7T {policy} global Tevai {evaluation time} global p {step size} global A {decay factor} global b m a x {maximum number of iterations allowed} global cmax {maximum number of iterations without improvement} global cstep {number of iterations without improvement before p decay} define b < — 0 {number of iterations} define c <— 0 {number of iterations without improvement} Ebest < — E(TT.@, TeVal) while b < b m a x and c < cmax do Knew *— perturbation2{tr, p) Enew 4 E(^7Tnew . 0 , Tevai) i f Enew > Ebest then 7T < T^new Ebest * E n e w else c <- c+ 1 end i f i f c / 0 and c mod csteP = 0 then p = Xp end i f b <- b+ 1 end while Output 7T The perturbation routine, which is defined in Appendix C.3, stochastically perturbs policy pa-rameters to create a new candidate policy. Several options for which policy nodes will be perturbed Chapter 5. Vehicle Steering 70 during the optimization are outlined in Figure 5.11. Though it is possible to perturb all policy pa-rameters, in the search for an locally optimal policy, this may be impractical if there are many nodes so the dimensionality of the parameter space is large. Alternatively, we could restrict perturbations to just the parameters of the new node, given that with the addition of a new node, we are trying to refine the policy in a local region of the observation space. However, certain steering behaviours may require the coordination of actions in contiguous Voronoi cells. Therefore, a reasonable com-promise might involve perturbing the parameters of the newly added node plus those of some of its neighbouring nodes, where the neighbourhood is denned according to the trajectories entering and exiting the cell. We take this final approach, since preliminary experiments indicated that the first two approaches were unsuccessful at refining the optimized one-node policy. (a) (b) (c) Figure 5.11: Options for perturbing policy parameters (a) Perturb all policy parameters (b) Perturb just the parameters of the newly added node (c) Perturb parameters of nodes in the newly added node's neighbourhood. 5.5 Results Using the policy search method described in the previous section, policies for various vehicle steering problems have been developed. For example, policies have been created for driving a car forwards and backwards; driving a truck with one trailer forwards and backwards; and driving a truck with two trailers backwards. A l l of these policies have been tested on various tracks of the same style, yet different from the tracks on which they were trained. Additionally, some of the policies have been tested on tracks which are quite different from the training track or which have obstacles that were not used in training. The results for a variety of steering problems follow. A comparison of simulation and optimization parameters, as well as some numerical results, of each steering problems is included in Table 5.2. Chapter 5. Vehicle Steering 71 5.5.1 B a c k w a r d s D r i v i n g Single-Trailer Truck We have been able to generate a number of steering controllers for backward driving trucks which have similar strategies. As is true of most optimization algorithms, some parameters must be hand-tuned, and our policy search algorithm is no exception. k M X Figure 5.12: Truck dimensions for the single-trailer truck backing problem. To give an idea of the difficulty of the problem, Figure 5.12 shows the dimensions of the truck and the track it must drive on. Tracks for both training and testing are generated using the algorithm described in Section 5.2.1. For this problem, the track is 1.8 units wide, which is 5.14 times the width of the truck. The straight segments between turns have a length uniformly distributed between 2 units and 4 units. Some physical parameters for the simulation must be chosen, as well. We take a simulation time step to be 0.01 s and the fixed speed of the truck to be 1 unit/s. The constants kp and kd for the Chapter 5. Vehicle Steering 72 dynamical system denned in Equation 5.1, which is used to drive the steered wheels, are 10 ^ and 3.5 i , respectively. We evaluate a policy from p = 500 initial positions for Tevai = 800 simulation time steps, which corresponds to about 8 s of simulation time and is typically enough time for the truck to round two corners. Initial states are generated by the algorithm outlined in Section 5.3. : r n : •'•fcC3-:-:.CX3--: (a) (b) "•'ct3;_\CE3? Figure 5.13: Steering a one-trailer backwards-driving truck along unfamiliar tracks of: (a) The same class as the training track (b) Wider and with shorter segments than the training track. We take the initial optimization step size to be p = 2.5. This defines the maximum magnitude of perturbations in the policy parameter space in a way that is elaborated in Appendix C.3. The decay factor is A = 0.9. The maximum number of iterations during an optimization episode is set to bmax = 150 and the maximum number of iterations allowed without improvement is cmax = 50. The number of iterations without improvement before the step size p is decayed is cstep = 10. With these parameter settings for the simulation and policy search, a policy with 14 nodes (or 28 including symmetric nodes) has been developed to steer the truck backwards. Each policy node has 7 parameters (6 for the node location and 1 for the node action) so the total number of policy parameters is 98. Figures 5.1(a) and 5.13(a) shows this policy driving the truck around a track that has the same properties, yet is distinct, from the track on which the policy was. trained. Figure 5.13 Chapter 5. Vehicle Steering 73 (b) shows the policy steering on a track that is significantly wider and with shorter segments. The policy appears to have difficulty navigating the u-turns for which it was not trained, although it does manage to stabilize its motion. Policy evaluations take the bulk of the computational resources during the policy search. There is one policy evaluation conducted for each of a number of iterations during an optimization episode. This policy search took 106 optimization episodes and 13476 policy evaluations (i.e., computations of E(Q)). The entire process took a total of 134877 s = 37.5 hr on a modern PC, without necessarily having sole access to the processor. Table 5.2 shows a summary of the parameters and some policy search statistics for this steering problem and others. -1500 -5000 *e - *© _1 1_ _J 1_ 1000 2000 3000 4000 5000 6000 Policy Search Function Evaluations 7000 8000 9000 10000 Best Value for All Optimizations # New Node Added x Reverted to Previous Controller O Accepted New Controller Best Value for Current Optimization Figure 5.14: Policy performance versus function evaluations during the policy search for a single-trailer truck driving backwards on a winding track. Figure 5.14 shows how the value of the best policy is improved over the course of a policy search. Each solid line in the figure correspond to a optimization episode. With each optimization episode a Chapter 5. Vehicle Steering 74 node is added and the policy is optimized. If the resulting policy is better than the previous policy it is kept. Otherwise, we revert to the previous policy. Notice that most improvement in the policy evaluation, -E(O), takes place with the acceptance of the first 3 nodes. After that, additional nodes tend to add only slight improvements to the policy. This may correspond to the fact that during normal policy operation only 3 nodes plus their symmetric nodes are mostly used. The other nodes are activated only in extreme circumstances. It can be speculated that the very slight improvements are made by the addition of nodes that are activated in rare situations, perhaps to postpone a imminent crash slightly and thereby improve the evaluation. (a) (b) Figure 5.15: Driving a one-trailer truck backwards in unexpected situations: (a) Track with varying width and turn angles (b) Track with obstacles. The learned truck-steering policy demonstrated reasonable robustness in driving situations quite different from the class of tracks for which it was trained. For example, in Figure 5.15 (a) the policy is controlling the motion of the truck down a track with varied steering angles and track widths, although the policy was trained on a track with 90° corners and a fixed width of 1.8 units. Also, the policy was able to cope fairly well with obstacles on the track as shown in Figure 5.15 (b), even though obstacles were not used in training. The frequency of crashes does increase with the unfamiliarity of the situation but the policy generalizes suprisingly well. In order to test robustness with regard to errors in the truck model, we applied the computed policy to one-trailer trucks with the position of the rear trailer wheels varied. The truck policy still steers well when this wheel position is changed by as much as 10% as measured from the middle of the trailer. We ran the policy search algorithm 4 times with identical parameters but different random number Chapter 5. Vehicle Steering 75 initializations to check its robustness in this regard. The track instance used for training, the initial evaluation positions, and the policy perturbations are all dependent on the random number seed. The resulting policies all had different node layouts but were similar in their ability to navigate tracks of the specified class. Double-Trailer Truck Steering a two-trailer truck that is moving backwards is a particular challenge because of the consid-erably instability in the trailer motion. Either trailer can jack-knife if the swing of the trailer is not carefully controlled as shown in Figure 5 (b). The instability of this steering problem is similar to that of attempting to balance a double-inverted pendulum. Figure 5.16: Steering of a backwards-driving two-trailer truck around a previously unseen track. The two-trailer truck is just like the one-trailer truck but with an additional identical trailer. It is equipped with an additional sensor to measure the angle between the first and second trailers, (32-The observation space, O, of the policy is thus 7-dimensional, with o = (a, Pi,P2, di, di, d$, cLi) being an observation. A n action is the same as in the single-trailer problem and consists of setting the desired steering angle, d. The track for this problem is 2.7 units wide and the straight segments have a length uniformly distributed between 3 units and 4 units. The starting position of the cab center is in the center 40% of the track width. The initial trailer angles, P and P2, are uniformly sampled in the range of -0.2 and 0.2 radians. As with the single-trailer truck, initial states are pruned to remove Chapter 5. Vehicle Steering 76 collision states or near-collisions. The physical constants for the simulation, including kd and kp for the steering, are identical. The evaluation and optimization parameters are also the same as for the one-trailer case. (a) (b) (c) (d) Figure 5.17: Initial states resulting in a collision for two-trailer truck: (a)-(c) Jack-knifing of front trailer (d) Crash into barrier. Our algorithm was successful in learning a policy that can steer the truck backwards around a class of winding tracks. Figures 5.1(b) and 5.16 demonstrate the learned steering policy operating on such a track. The policy uses a total of 30 nodes (or 60 including symmetric nodes). Each policy node has 8 parameters (7 for the node location and 1 for the node action) so the total number of policy parameters is 240. As with the one-trailer truck policy, only about 3 pairs of nodes are used for the vast majority of steering manoevers. The rest of the nodes are almost always activated in unusual circumstances that often result in a crash. The policy search was completed in 129 optimization episodes with a total of 15482 policy evaluations. The process took 220922 s = 61 hr using comparable computing resources to that used for the one-trailer problem. See Table 5.2 for a summary of the problem parameters and some statistics for the policy search. The learned policy works well from some initial states, but others give the policy difficulty as it is not able to guide the state trajectory into a stable steering pattern. Examples of some difficult initial states and the collision trajectories that result are shown in Figure 5.17. Note for a trajectory such Chapter 5. Vehicle Steering 77 as (c) in the figure, the policy would likely have been able to steer the truck around the corner if not for exceeding the constraint of < 90° resulting in a jack-knifing of the front trailer. These trailer-angle constraints only compound the difficulty of the problem. As the complexity of the steering problem increases it becomes more difficult to determine whether a stable motion without a collision can result from a particular initial state. It may be that any given policy is doomed to crash or jack-knife from this state making it uncontrollable. In many cases, the only feasable way to determine if a state is controllable is to generate by policy search or some other method a policy capable of controlling the state. Determining that a state is uncontrollable is considerably harder but this topic will not be discussed further in this thesis. The policy has not been tested as extensively as the policy for the single-trailer truck for robustness in situations for which it was not trained. However, it can be stated qualitatively that the double-trailer truck policy is not nearly as robust. Since the car does not have any hitch angle sensors, the observation space, O, of the policy is 5-dimensional, with o = (a, d\,d,2,d3, d$) being an observation. The policy search for the backwards-driving car required 43 optimization episodes with a total of 3665 policy evaluations. The process took 84325 s = 23.4 hr using similar computing resources to the other steering problems. The resulting policy contains 22 nodes. Each policy node has 6 parameters (5 for the node location and 1 for the node action) so the total number of policy parameters is 132. The steering behaviour of the backwards-driving car policy on an unseen track of the same class as the training track is shown in Figure 5.18. This policy is particularly robust to changes in the track width and turn angles, as shown in Figure 5.19. Also, Figure 5.20 (a) demonstrates that the policy is reasonably capable at avoiding obstacles on the track, despite the fact that obstacles were never encounterd in the training. However, certain configurations of obstacles that are not shown Car Figure 5.18: Car driving backwards on an unfamiliar track. Chapter 5. Vehicle Steering 78 (a) (b) Figure 5.19: Car driving backwards on two tracks with varying width and'corner angle. are obviously passable and yet cause the policy to crash, as might be expected since it has not been trained for these situations. This car policy can also successfully navigate a " Y " or "T" branch in the track, as illustrated in Figure 5.20 (b). This backwards car-steering policy was the best of all policies, including the forwards car-steering policy for generalizing to situations it had not seen in training. Figure 5.20: Car driving backwards in unexpected situations: (a) Track with obstacles, (b) Track with " Y " and "T" branches Chapter 5. Vehicle Steering 79 5.5.2 Forwards D r i v i n g The task of forwards-driving around the track is simpler than the equivalent backwards-driving task because of the stability provided by having the steered wheels lead the motion. Because the goal is to travel around the track in minimal time, we would expect to see behaviors such as steering close to inside corners and steering directly (diagonally) between corners. A critical issue to be addressed is that of avoiding collisions with corners, which is especially interesting in the case of the trucks-with-trailers given that the trailers do not directly follow the path of the cab. For this reason, the cab must swing wide in a turn in order to provide sufficient corner clearance for the trailer. Steering close enough to the corner so that near-maximum progress around the track is made, while not approaching so close that a collision is risked is a compromise that is defined by the evaluation function and learned during the policy search process. Figure 5.21: Car policy trained to drive forwards on a narrow track. Figure 5.22: Car policy trained to drive forwards on a wide track. In Figures 5.21 and 5.22, we see the resulting steering behaviour of policies developed for forward Chapter 5. Vehicle Steering 80 car-driving on wide and narrow tracks, respectively. For the wide track, the path of the car cuts to the inside on corners to progress more quickly around the track. This policy uses 15 nodes and was computed with 27 optimization episodes and 1342 function evaluations in 66192 s = 18.4 hr. For the narrow track, the turns must be executed with precision in order to avoid collisions with the sides of the track. This policy uses 21 nodes and was computed with 47 optimization episodes and 3158 function evaluations in 70479 s = 19.6 hr. Figure 5.23: Truck driving forwards on an unfamiliar track. Finally, Figure 5.23 shows a forwards-driving single-trailer truck. Note the necessity of the truck cab to swing wide on turns in order to avoid the trailer clipping the corner. In contrast to the backwards-driving truck policies, this policy does not observe the hitch angle, as it is less necessary for control due to trailer stability. The policy has 20 nodes and required 37 optimization episodes, 2801 function evaluations, and 62234 s = 17.3 hr of computation. A l l the results shown are for tracks that are novel, but that belong to the same family of tracks used during training. To test for robustness with respect to modeling errors, we made alterations to the wheel-base of the car of up to 20% and found that the policy still steered well. 5.6 Summary and Future Work We have demonstrated the use of policy search using adaptive refinement to generate policies for a variety of steering tasks, including the difficult problem of steering a truck with one or more trailers Chapter 5. Vehicle Steering 81 down a sharply-cornered winding track with only local sensory information. However, during this process, a number of issues have come up which would benefit from further investigation. Nearly all of the steering behaviours for the backwards-moving vehicles arise from motion of the observation state along the boundary between symmetric nodes. The observable state, o, tends to bounce back and forth across the boundary between the two symmetric Voronoi cells, much the same as described with the cart and pole problem in Section 4.7.1. This chattering behaviour occurs not only when the vehicle is driving straight, but also as the vehicle is rounding a corner. It is evident that this type of bang-bang control is found by the policy search to be particularly useful at controlling the unstable backwards motion of the vehicles. Chattering is less prevalent in the forwards-driving vehicle policies which have inherently stable motion. Furthermore, the backwards-driving policies tend to be much more robust to changes in track width and corner angle, and are more capable at avoiding obstacles. It can be speculated that this robustness is somehow related to the chattering behviour, but further analysis would be required to understand why this might be the case. General policy chattering is discussed further in Section 6.1. No previous work focusses on this particular formulation of the vehicle steering problem so it is difficult to compare with other methods for finding steering controllers. However, considerable work has been done on other variations of the nonholonomic vehicle control problem, as noted in Section 5.1. Future work should involve more rigorous comparisons of the performance of our adaptive policy search method versus other control techniques. A strong point of our method is that it is capable of solving a wide range of steering problems automatically using only minor parameter modifications. However, a disadvantage is that our policy search typically takes on the order of a day to learn a good steering solution. It is likely that the method could benefit from a more principled optimiza-tion approach that utilizes gradient information or a richer policy representation, for example, with interpolation between policy node actions. Much further research could be conducted into the use of different optimization techniques or policy representations. Our vehicle steering problems were physically idealistic in a number of ways, including having a fixed vehicle speed, no tire friction model, and noiseless sensors and actuators. For the policies generated by our method to be applicable to real world control the physical model of the vehicle motion would need to be much more realistic. Chapter 5. Vehicle Steering 82 (a) 1 (b) (c)' (f) Problem Parameters V ehicle Width (units) 0.35 0.35 0.35 0.35 0.35 0.35 Track Width (units) 1.8 2.7 1.04 0.85 1.8 1.04 Minimum Track Segment Length (units) 2 3 0.5 0.1 0.1 0.5 Maximum Track Segment Length (units) 4 4 2.5 2.5 2.5 3.5 Vehicle Speed (units/s) 1 1 1 1 1 1 kp(l/s2) 10 10 7.5 10 6 10 kd(l/s) 3.5 3.5 3 3.5 2.5 3.5 Observable State Variables 6 7 5 5 5 5 Blffiiati oir^ prameters Evaluation Time Tmai (s) 8 8 7 7 5 10 Number of Initial Positions 500 500 500 500 300 500 Maximumlnitial Vehicle Angle (rad) 0.4 0.2 0.4 0.4 0.4 0.4 Portion of Track Width for Initial State (%) 80 40 80 80 80 80 Width Ratio for Near-Collision Test 1.4 1.4 1.4 1.4 1.4 1.4 Length Ratio for Near-Collision Test 2.5 2.5 2.5 2.5 2.5 2.5 Search Parameters Step Size p 2.5 2.5 2 2 2 2 Decay Factor X 0.9 0.9 0.8 0.8 0.8 0.8 b max 150 150 100 80 80 100 C max 50 50 30 25 25 30 C step 10 10 8 7 7 8 Results Nodes n 14 30 22 21 15 20 Number of Policy Parameters 98 240 132 126 90 120 Optimization Episodes 106 129 43 47 27 37 Policy Evaluations 13476 15482 3665 3158 1342 2801 Time to Conduct Search (h) 37.5 61 23.4 19.6 18.4 17.3 Table 5.2: Simulation parameters, policy search parameters, and numerical results for various vehicle steering problems: (a) One-trailer truck backwards (b) Two-trailer truck backwards (c) Car backwards (d) Car forwards (e) Car forwards on wide track (f) One-trailer truck forwards. Chapter 6. Discussion 83 C h a p t e r 6 Discussion Evaluating a control problem solution is itself a difficult problem. This is because there are several common notions of what constitutes an optimal solution, including robustness, energy efficiency, and time efficiency. It is usually best to blend several of these notions of optimality into a single evaluation criteria. However, the choice of what weight to give each factor is difficult and highly problem specific. If a policy is capable of controlling a large range of intial states, we consider it to be robust. However, the resulting trajectories from those initial states may be highly nonoptimal in terms of the time or energy used to make progress towards a goal region or along a path. The evaluation metrics used to solve the control problems in this thesis considered a compromise of robustness and progress made during a fixed evaluation time. We did not consider energy used by the controller in evaluating policies. Other notions of optimality such as the smoothness of trajectories were not explicitly considered in the evaluation metric but may have been implicitly rewarded because trajectories that make maximum progress tend to be smooth. This chapter discusses some advantages and disadvantages of our choices of policy representation and policy search method and makes brief comparisons whith some alternatives. 6.1 Nearest-Neighbour Policy Representation When choosing a policy representation various factors can affect the shape of the optimization land-scape. Ideally the policy parameterization should be compact and smooth, yet contain good solutions. In this section, we evaluate various properties of the nearest-neighbour representation. Arbitrary Function Approximation Our policy search method operates within the class of policies, II, which can be represented using a nearest-neighbour (NN) function approximator (FA). Ideally, we want n to contain the optimal policy 7T* or a good approximation of it. In other words, it would be good if the N N function approximator was capable of approximating any possible function. Indeed, it can be seen that as the number of nodes is increased towards infinity, and they are spread to cover the entire policy domain, they can make an arbitrarily fine discretization of the domain. This means that the N N F A can be used to Chapter 6. Discussion 84 approximate any function, and most importantly, the optimal policy. optimal pol icy n* nearest-neighbour policy 7t Voronoi cel l boundary node location observation space O Figure 6.1: Poor approximation of a linear function using a nearest-neighbour function approximator. Efficient Parameterization of Function Approximator It is not enough to know that the optimal policy can be approximated; it is equally important that it be approximated efficiently For example, if the N N policy representation takes an inordinate number of nodes to approximate a simple optimal policy then it is not a good choice. The dimensionality of the space of policies is linear in the number of nodes, so this may make a search for a good approximation to the optimal policy prohibitively expensive to compute. It is instructive to discuss some simple policies for which the N N function approximation is poor so that we know where to avoid it. For instance, any function that is linear but not constant is poorly approximated by an FA with discrete values, which includes the NNFA and a discrete grid FA. A discrete FA must have a large number of discrete steps to approximate a function that can be exactly represented by a simple hyperplane. A n illustration of an inefficient N N function approximation of a linear optimal policy is shown in Figure 6.1. Interpolation between policy actions may help to better approximate some control policy functions. It is important to note, however, that linear controllers (or even other more complex control models) can be used as components that each operate within a Voronoi cell of the nearest-neighbour Chapter 6. Discussion 85 policy. In fact, the N N policies used to solve various control problems in this thesis use P D control actions, which correspond to linear combinations of 2 state variables. Even an optimal policy with discrete control may be poorly approximated by an NNFA. Such a-policy is one with curved boundaries between regions of discrete action. Since a discrete region of an NNFA is a Voronoi cell, whose boundaries are hyperplanes, it makes the approximation of discrete policies with curved boundaries difficult. Figure 6.2 shows an example of this. observation space O Figure 6.2: Poor approximation of a discrete function with curved boundaries using a nearest-neighbour function approximator. Chattering and Energy Waste In our results for both the cart and pole (Chapter 4) and vehicle steering (Chapter 5), we have encountered a policy behaviour that involves chattering between neighbouring Voronoi cells using opposing actions. This behaviour is illustrated in Figure 6.3. The structure itself is not a cause for concern since often reasonable policies use alternating actions to maintain stability. For example, when balancing a pole restricted to a planar domain it is useful to alternate between pushing the pole one direction and the other, to counteract instability in the balance of the pole. However, because of the discrete approximation of the NNFA the opposition of the alternating actions can be quite strong, Chapter 6. Discussion 86 resulting in harsh chattering between the discrete action of two Voronoi regions, as with bang-bang control. Obviously, this chattering results in a unnecessary waste of energy. Since we are running in simulation and we do not consider energy use in evaluating policies this is not a concern, but it would be for engineers designing policies for real world control. Also policies that use less energy tend to appear more natural, which is a useful property for animation, particularly of animals and people. In future work, it would be valuable to examine the power/robustness tradeoff of the policies generated. Using a policy representation that allows for smoother variation in the policy ir : O —> A, would be useful for decreasing power use. This might be done by using smoother interpolation kernels for the policy representation, such as could be achieved by the k-nearest neighbour or radial basis network approaches. Smoothness of Policy Parameter Space Choosing a good policy representation can be critical for ensuring that the policy search is well-shaped. Typically, to make the search simpler, the policy function, ir : O —» A should vary smoothly as we move about in the policy parameter space. In other words, if a small change in policy parameters 0 results in a large change in TT : O —» A, it is also likely that the state trajectories for the evaluation will change drastically, and therefore there will be a large jump in E(Q). This means that discontinuities in the parameterization of the policy can cause discontinuities in the policy evaluation function, E(@), which increases the difficulty of the policy search. With the N N policy representation, moving within the policy parameter space mostly results in smooth change in 7r. However, an exception occurs when the locations of two nodes, Q and Q , cross directly over one another. In this case, the Voronoi cells \ observation space O trajectory Figure 6.3: Chattering between the opposing actions of Voronoi cells. Chapter 6. Discussion 87 of nodes i and j will be swapped, meaning that the actions, (ii and in those Voronoi cell regions will be swapped. This may result in a large jump in the policy function IT, which would presumably make for a discontinuity in E(@), and a more difficult policy search. Fortunately, in a random walk of the N N policy parameter space, the occurance of two nodes crossing directly over one another is quite rare, as long as the nodes are sparsely positioned. Effect of Redundant and Unnormalized Policy Domain Nearest-neighbour function approximators (NNFAs) use a distance metric which weighs all dimensions of the the policy domain (or observation space, O) equally. Lets assume that the optimal policy IT* : O —• A is fairly smooth, meaning that there is a rough corellation between the distance between two observations, |oi — 02I and the distance between the corresponding optimal actions, |7r*(oi) — ir*{o2)\-If there is redundancy in (or a correlation,between) observation variables or these variables have significantly different ranges, the N N function approximation of IT will become less effective. This is because the redundant variables or variables with greater ranges will be weighted more heavily in the Z/2 distance metric used to determine the Voronoi cells. This tends to align Voronoi boundaries and stretch Voronoi cells in a certain direction, making them less efficient at representing a smooth policy. This problem may occur when observation dimensions are measured in different units that have different ranges, e.g., when one sensor measures the speed of a car in m/s while another sensor measures the steering angle in radians. Figure 6.4 illustrates the stretched shape of Voronoi cells when the range of two observation variables differs by an order of magnitude. In such problems it would be useful to normalize the sensory variables so that their ranges are ap-proximately the same. Normalization of sensory variables is useful for other function approximators such as radial basis functions. A neural network function approximator normalizes input variables automatically with input weights. The nearest-neighbour policies we used did not require normaliza-tion since the ranges of input variables were comparable. As for redundant input variables, it would likely be beneficial to use a dimensionality reduction technique such as principal component analysis to create compressed observations for input to the nearest-neighbour policy. N N F A in High-Dimensional Spaces Although we have used a N N policy represenation to solve some difficult problems in policy domains with as many as 7 dimensions (eg. for the two-trailer truck), the N N F A will likely not work well in observation spaces of high dimension, unless it is known that a policy need only operate in a lower-dimensional manifold. Randomly-sampled points from a uniform distribution in a high-dimensional space tend to be spaced such that the distance between any two points is equal. So if nodes are Chapter 6. Discussion 88 i i • • •••••••• m i " " " " " " " " " • • m i l , , , , • * 2 — •• • " 1 1 " » m i n i m i ••••••• i i m i i m m 1 1 1 " 1 1 1 1 2 3 4 5 6 observation variable 1 Figure 6.4: Stretched Voronoi cells with unnormalized sensory variables. placed randomly in a high dimensional space they will likely have this property. Also, a randomly sampled state will tend to have the same distance to all such nodes. In other words, there would be a large region in the observation space that has nearly the same distance to all nodes. The choice of action in this region would be rather ambiguous. We might consider this the curse of dimensionality for NNFAs. If we know there is a low-dimensional manifold of normal operation in the observation space, NNFAs can still work fairly well in moderately high dimensional spaces. Consider, as an example the control of human walking [Sha04]. Although the state space of a human figure has many dimensions (i.e., all joint angles and angular velocities), we know that, as long as the walk is proceeding normally, the state will remain close to a 1-dimensional trajectory. For a N N walking policy that does not need to recover from abnormal states it is possible to place nodes along the normal walking trajectory, thus carving up the region of interest into discrete actions, as shown in Figure 6.5. A walking policy was generated in [Sha04], using such an initialization of node positions. Advantages of N N F A Despite the shortcomings of the N N policy representation, it has still proven to be a very useful tool for solving a variety of difficult control problems. For example, using such a representation we have 50-40-3 > a o 30-20-O 10-Chapter 6. Discussion 89 * £ * observation space 0 • • * Voronoi cell boundary node location \\valk trajectory • Figure 6.5: Dividing up a 1-dimensional walking trajectory through a potentially high-dimensional observation space into the cells of a nearest-neighbour policy. The distance metric and Voronoi cells are still computed in the high dimensional space, because the actual state may not lie exactly on the manifold. developed policies for difficult vehicle steering tasks. There are a few reasons why an NNFA is effective as a policy representation for certain control problems. It is a semi-parametric representation, which allows the complexity of the representation to be augmented to fit the problem. In other words, the policy can be adaptively refined in regions of the observation space as needed. A n N N policy is an irregular represenation, unlike uniform grids, since nodes can be concentrated in specific regions. This allows the distribution of representational complexity through the policy domain to match that of the needed control complexity. Secondly, an NNFA efficiently represents a policy that requires an alternation between two opposing actions. This behaviour can be achieved by having the observation state bounce back and force between contiguous Voronoi cells with opposing actions. This is similar to bang-bang control and is useful for controlling certain unstable processes such as balancing a pole or steering a tractor-trailer backwards. Finally, an action is represented locally in a particular Voronoi cell, meaning that the policy can be adapted locally without any far-reaching side-effects. In general, N N policies can be a very compact representation. Interesting control problems can be solved using only a few nodes, which each have only a few parameters equal to do + dA, where do and dA are the dimensionality of the observation space and action space. Furthermore, it is impressive that a policy representation that is so conceptually simple, as is evident by Equation 3.1, is capable of handling rather complex control. It is also computationally simple, as a naive algorithm need only Chapter 6. Discussion 90 calculate and compare the distances between the current state and each node, which is linear in ndo, where n is the number of nodes. 6.2 Policy Search Method In Section 6.1, we mainly discussed features of the N N policy representation in terms of how well it could approximate IT*. However, even if ir* can be approximated, that does not mean a policy search algorithm will find a good approximation quickly. This discussion looks critically at our policy search algorithm and suggests some potential improvements. Adaptive Addition of Nodes As discussed in Sections 2.3.5 and 5.4, adaptive addition of nodes can be useful for shaping the search for a policy through progressive refinement. However, this method will only be effective if the number of nodes used for the initial optimization episode is adequate to roughly solve the control problem. If this number is not enough then often a poor initial solution to the problem will be found, one that achieves some short-term reward, with little consideration of the complexity of the problem as a whole. Of course, if the rough solution of the first optimization episode is poor, it is unlikely that refinement of that solution will result in a good approximation of the t t * . The reason we use a sequence of progressively more difficult optimizations is so that early ones can bias the later ones with a good starting points. If early optimization episodes have an inadequate number of nodes and thereby result in poor solutions, such biasing can only hamper later optimizations. Stochastic Greedy Ascent Optimization SGA optimization is but one of many possible optimization methods. Successful application of SGA to difficult control problems has demonstrated the effectiveness of our policy search shaping techniques. It would be beneficial to try other optimization techniques. Obvious examples include genetic algorithms and simulated annealing. A variation of the downhill simplex method proposed in [PTVF97], p. 451 solves some noted inefficiencies of typical simulated annealing techniques and may be a promising approach for policy search, particularly with control problems characterized by local minima. [RB01] uses a stochastic search algorithm biased by a rough estimate of the gradient. This is a alternative to our SGA with a simple model of the local landscape. If gradient information for £ ( 6 ) is available it can be used with a number of nonlinear program-ming techniques, such as line-search and trust-region techniques, to find local minima very efficiently. Our policy search could benefit significantly by using such techniques, as long as the evaluation function is adequately shaped. Chapter 6. Discussion 91 Value Function Approximation There are certain benefits to using value-based R L methods to compute control policies. The main advantage is that knowledge of the structure of a control problem is used to compute a value or Q function, which guides the search for an optimal policy. When our SGA optimization blindly perturbs parameters and tests if they result in a better policy, it is ignoring this inherent structure in the control problem. If we had access to a value function that assigned values to states, this could be used to alter a local region of the policy so that state trajectories passing through that region can be driven uphill on the value function landscape. The NNFA may provide a useful approximation of the value function. We might consider assigning values to policy nodes, as well as actions. The resulting N N approximation of the value function could be used to guide changes to the policy, during the policy search. In using policy search, we are losing the theoretical guarantee of convergence to an optimal policy that value-based R L methods can provide, although in practice such convergence guarantees are less meaningful, given that they typically have an embedded assumption that requires visiting every state infinitely often. Instead, we must resort to giving a good argument of why our policy representation and search methods are effective, and demonstrate some solutions to difficult control problems to back up our claims. Indeed, we have solved difficult problems using this policy search method, such as the assortment of steering results shown in Section 5.5. Chapter 7. Conclusion 92 C h a p t e r 7 Conclusion We have demonstrated the effective use of a nearest-neighbour (NN) policy representation combined with direct policy search to solve control problems of varying difficult. Importantly, we have also presented a method for shaping the policy search by adaptively adding nodes to the policy in a series of optimization episodes (Section 5.4). The N N representation allows a sparse and non-uniform discretization of the policy function. The N N representation has been shown to model policies for nontrivial control problems in a compact manner (Sections 5.5 and 4.6). N N policies are conceptually simple and the control computation is linear in the number of nodes and the dimension of the observation space. This representation also ensures locality within Voronoi cells, permitting local modification of the control policy without side-effects. The variety of control problems that we have solved is a testament to the flexibility of the N N policy representation and our direct policy search method. We have developed policies that follow a user-specified limit cycle in a 2D state space and ones that can move a cart to a desired location, while simultaneously balancing a pole. N N policies have been found to steer a number of different vehicles around winding tracks. These steering policies are not specific to a track, but can generalize to a variety of tracks with similar properties. Future Work N N policies do have some drawbacks (Section 6.1) and it would be valuable to look at alternative representations to deal with these. Certain classes of optimal policies are not represented efficiently. Basis functions that provide a smoother interpolation between actions may provide a better repre-sentation for these cases. N N policies may be hampered by redundant or unnormalized observations. Perhaps the observation state could be preprocessed, so that the policy sees only state information that is compact and relevent to control. The N N representation does not provide the capability to generalize by reusing certain useful structures in different areas of the policy domain, as is possi-ble with hierarchical representations like those used in genetic programming. It may be valuable to introduce hierarchical structure into an N N policy so that certain useful arrangements of policy Chapter 7. Conclusion 93 nodes could be reused and parameterized. Locality and generalization are conflicting goals in devel-oping a policy representation and more investigation should go into effectively balancing these useful properties. Although many of our results have provided encouraging evidence of our method, it would be useful to derive some theoretical statements outlining under what conditions our method can find a reasonable solution and how close a solution must be to the optimal policy. However, as the complexity of the systems considered increases towards that of the real world, theoretical guarantees become exceedingly difficult to find. At some point it becomes prudent to evaluate a method as successful if it reliably generates useful results and it is simple enough to be understood and readily extended. How to validate and fairly evaluate developed control systems is an ongoing issue. Control researchers could benefit greatly from a standardized system for documenting and comparing the capabilities of controllers they design. Our policy search method could be applied to other interesting control problems. For example, we would like to develop N N control policies to control vehicle steering in fields of obstacles and vehicle parking. Some initial experiments with these problems have encountered some success. Bicycle steering and the control of a car on a slippery surface would also be interesting problems that could be applied to computer animation. We must admit that in comparison to real world control problems, like making a canine robot roll over, for example, the ones discussed in this thesis are quite simple. We also have the benefit of working in simulation, which allows our policies to observe whatever state we decide is useful. Furthermore, observations and actions are noiseless, which is typically not the case for robots in the real world. Reactive policies, such as those used in this thesis, are usually not adequate for an environment where observations are ambiguous. Investigation into the use of memory to model and contextualize observations would help to bridge the gap between our work and real robotics. Appendix A. A 2D Example: Time-Based Policy 94 A p p e n d i x A A 2D Example: Time-Based Policy A . l Constructing a Time-Based Policy The initial parameters for the policy search are bootstrapped using a time-based policy generated to follow the limit cycle. We use the term time-based policy1 to refer to a policy that chooses actions based on simulation time elapsed, t, as opposed to choosing actions based on the observable state, ot, as is the case with ordinary policies. Of course, if t was added to the observable state then a time-based policy would be ordinary. Time-based policies are very effective at following a specific trajectory from a single starting state if the dynamics are deterministic. However, if a time-based policy starts from any other state or is bumped off its familiar trajectory it will not be able to react appropriately. A time-based policy is the extreme in purposeful policies; it operates completely open loop. In other words, the observable state of the system has no effect on the action choices of the policy. The time-based policy is constructed in a piece-wise manner. We begin at a state on the desired limit-cycle and try to find the P D controller parameters that track the limit cycle for as long as possible, while the accumulated error of the actual trajectory remains under a specified threshold. Once the P D parameters and resulting trajectory for the first piece have been determined, we optimize the parameters of a second P D controller to take over and track the desired limit-cycle from the end state of the trajectory for the first P D controller. In this way, several P D controllers can be used to approximately track the desired limit cycle, each activated in sequence to control the trajectory of the point-mass for a specific amount of simulation time. More formally, let fii = (x, kp, kd)i be the parameters of the ith P D controller activated and let a ' = (s{, s\,..., s\.) be the trajectory produced by that controller, where s\ = s\nit is the starting point of the trajectory and U is the trajectory length. A diagram of a constructed time-based policy with corresponding labels is shown in Figure A . l . We begin by setting s\nit to some state on the desired limit cycle and optimize m using stochastic greedy descent (SGA) such that the length, Zi, of trajectory, a1, is as large as possible while the error remains under a given threshold, ip. The trajectory error, e(a1), is the sum-squares error between the trajectory and the desired limit cycle, 1Time-based policy is a synonym for open-loop policy, which is used to refer to a policy without state feedback. Appendix A. A 2D Example: Time-Based Policy 95 velocity x Figure A . l : A time-based policy approximating a desired limit cycle. analogous to that given in Equation 3.3. If two sets of parameters, fii and / i ^ , result in trajectories of equal length, then the tie is broken by rewarding the least error. Once the optimization for the first P D controller has converged, we optimize a second P D controller, with parameters ^2- The initial state, smit, for trajectory o1 is set to be the last state, of trajectory a1. The second optimization similarily tries to maximize I2, while ensuring e{a2) < ip- We continue in this way, until a chain of trajectories makes its way around the limit cycle, ensuring that the ends of the trajectory chain are sufficiently close (i.e., \\s^ — s\\\ < e). Thus, piece-by-piece, we construct a time-based policy that approximates the full limit cycle with a series of trajectories, a1, a2,... ,an. The policy delegates control to the P D controllers given by parameters fi2, • • •, in sequence for li, I2, •. •, ln simulation time steps, repectively. A.2 Bootstrapping a Nearest-Neighbour Policy Now that we have a time-based policy that approximates the desired limit cycle, from it we need to derive an initial nearest-neighbour policy 7 r t n , t with parameters Q m t = [d , . . . , £ N , MI, . . . , pn]mxt that follows roughly the same trajectory. The policy irmlt will have n nodes, where n is the number of P D controllers used in the time-based policy. Of course, the actions for the n nodes will simply be the P D controller parameters /Uj optimized for the time-based policy. Now it remains to determine the locations of the the nodes, HI, . . . , pn. Intuitively, it would be ideal if the nodes were positioned such that each Voronoi cell i covered exactly the corresponding trajectory, a1 of the time-based policy. If this was true, then executing the nearest-neighbour policy from any position on the time-based trajectory should continue to follow that trajectory, thus mimicking the behaviour of the time based Appendix A. A 2D Example: Time-Based Policy 96 policy. We first initialize the node location using, & <- | (si+sj.) to be half way between the beginning and the end of the corresponding time-based trajectory. Then we perform a SGA optimization on the node locations to minimize an error function which penalizes a trajectory transition that is not exactly half way between its neigbouring nodes or that has any node closer than its neighbouring nodes. This optimization attempts to find the ideal node positions described above, thus completing the parameters Qmlt of the initial nearest-neighbour policy. Appendix B. Cart and Pole: Additional Information 97 A p p e n d i x B Cart and Pole: Additional Information B. l Equations of Motion The motion of the cart and pole is modelled using the following differential equations [Pas98]. mg sin a — cos a(F + mpla2 sin a) / ( | m — mpcos2a) F + mpl(a2 sin a — a cos a) x = — -, m where g = 9.81 m s - 2 is gravitational accleration, mc = 1.0 kg is the cart mass, mp = 0.1 kg is the pole mass, I = 0.5 m is half the pole length, and m = mc + mp. The dynamics of the cart and pole system is computed using Euler integration with a time step of 0.001 s. B.2 Perturbation Algorithm Algorithm 5 perturbation^, p) input 7T {policy} input p {step size} for i <— 1... de do 89[i] <- -K.w\i\Rand(-\,l) end for 7T.0 <— 7T.0 + (50 clip 7T.0 to [Tr.min,7t.max] O u t p u t 7T Algorithm 5 outlines the perturbation algorithm for the cart and pole policy search. In calculating the direction of the perturbation in the policy parameter space, the size of the perturbation in each parameter i is weighted by 7r.w[z]. The parameter weights, TT.W, are chosen by hand based on the type of parameter (e.g., kp of a node action or a of a node location) to be approximately proportional to a reasonable range of values for that parameter. After the perturbation is applied to 7T.O, each parameter 7r.0[i] is clipped to be within the range [7r.min[i], 7r.maa;[i]]. The values in ir.min and Appendix B. Cart and Pole: Additional Information 98 K.max are also chosen manually and most parameters are not clipped, i.e., [i.mm[i], 7r.maa;[i]] = [—00,00]. Table B . l lists the values that are used in K.W, K.min, and -K.max for each type of policy parameter. TT.W •K.min •K.max action /J. a 0.5 —00 00 hp 0.5 0.1 2.0 kd 0.5 0.1 2.0 location £ X 1.0 —00 00 X 1.0 —00 00 a 0.1 —00 00 a 0.1 —00 00 Table B . l : Constants used for the cart and pole perturbation algorithm. Appendix C. Vehicle Steering: Additional Information 99 A p p e n d i x C Vehicle Steering: Additional Information C . l Left-Right Symmetry Proof for Steering Policy Assume that a truck steering policy 7r contains n pairs of nodes or 2n nodes in total. For 1 < j < n, let node j and node n + j form a pair of nodes. For each node i, 1 < i < 2n, we define i to be the index of the partner node of node i. Let = (a;, Pi, d\^, d2ii, d3<i, diti) be the location of node i and Hi = &i be its action. Furthermore, let Q = ( Q J , / ? ; , dl-i, d2-i, dz j , c?4j) = (—a*, d 4 i j , a^,, d2ti, di^) and /xj = -Hi1. We wish to show that for any observation, o = (a, P,di,d2,d3,d4), if 7r(o) = a then TT(O) = —a, where 5 = (—a, —d 4 ,d 3 , cfo, di). Alternatively, we can show that if o lies in the Voronoi cell of node k then 5 lies in the Voronoi cell of node k, the partner of node k. If an observation, o, lies in Voronoi cell k this implies that for all i, such that 1 <= i <= 2n and i y£ k, \o — Ck\ < \o — Ci\- This means that: Vi, ( Q - ak)2 + (p- pk)2 + (d x - dltk)2 + {d2 - d2,kf + (d 3 - d3,k)2 + (d4 - d4,k)2 < (a - ai)2 + (P- Pi)2 + (di - du)2 + (d2 - d2,i)2 + (d3 - d3A)2 + (d4 - d 4 ] i ) 2 . By multiplying the first two terms on either side of the inequality by 1 = (—l) 2 and reversing the order of the last four terms on each side we get: Vi, (-a - (-ak))2 + (-p - (-pk))2 + (d4 - d^)2 + (d 3 - d3,k)2 + (d2 - d2,k)2 + (d1 - dlik)2 < (-a - (-at))2 + (-p - (-P,))2 + (d4 - d4ii)2 + (d3 - d3il)2 + (da - d2A)2 + (d, - dhi)2. But this simply states that for all i, |6 — C/il < \o — Ci l i which is the same as saying that 0 lies in the Voronoi cell of node k. This is exactly what we set out to show, so the symmetric node policy representation that we are using, indeed ensures a left-right symmetric policy. It is easy to show that similar symmetric node representations for the observation spaces of the car and two-trailer truck also ensure left-right symmetric policies. 1Note that this definition is not contradictory, since the location and action of a node are the same as the location and action of a node's partner's partner. Appendix C. Vehicle Steering: Additional Information 100 C.2 Equations of Motion This section describes how we obtain the truck state at simulation time step, st.= (xt, yt, Ot, o-t, d t , fit) from the state at the previous time step, st~i = (xt-i,yt-i, #t-i, c*t-i, &t-i, Pt-i), where (xt, yt) = Pt is the position of the truck cab center, Ot is angle that the long axis of the truck cab makes with the positive x-axis, at is the angle of the steered wheels, at is the angular velocity of the steered wheels, and pt, is the hitch angle, all at discrete time step t. Let Tstep = 0.01 s be the duration of a time step and v = 1.0 units/s be the speed of the vehicle. The following equations of motion are for the forward-driving truck. They can easily be modified to remove the trailer or add more trailers. If the vehicle is to drive backwards, the speed, v, can be changed to a negative value, but otherwise the same equations of motion can be used. If the steered wheels are pointed straight ahead, i.e, at-\ = 0, the new truck cab position, (xt,yt) is calculated simply, as follows: xt <— xt + vTsteP cos(# t-i) Vt *-Vt + vTstep sin(0 t_i). Otherwise, if at-i i= 0, the new position of the point between the front wheels, Pf:t = {xf,t,yf,t) and the new position of the point between the back wheels, Pb,t = (xb,t,yb,t) is computed, as illustrated in Figure C . l . In the figure, the trailer of the truck is not shown for clarity. It is assumed that the point between the front wheels will move in the direction that the steered wheels are pointing: xf,t <— xf,t + vTstep cos(0 t_i + at-i) Vf,t *~'Vf,t + vTstepSm(0t-i + at-i). It is also assumed that the point between the back wheels will move in the direction that truck cab is pointing. The point, (xt,^,yb,t), is thus found by computing a point that lies on the long axis of the car a distance of df + db away from {xftt,y/tt), where df and db are the distances from the center of the car to the front and rear axles, respectively. A quadratic equation that encodes these constraints can be solved to get {xb,t,yb,t)- The new cab position, (xt,yt) is then determined by finding the appropriate point on the line segment between Pftt and Pf, i t: D i \ dbPj,t + dfPb,t / r i 1 x Pt = (xt,yt)* J]Tdb~^- ( C 1 ) The angle Bt can be computed as the angle that vector, P;t — Pbtt, makes with the a;-axis. The motion of the trailer is calculated using a method similar to that which is used to compute the motion of the cab. The new position of the hitch is given by Pb,t = (xb,t,Vb,t)- The point between Appendix C. Vehicle Steering: Additional Information 101 y /K Figure C . l : Determining the motion of the truck during a simulation time step. the wheels of the trailer is constrained to move along the direction of the trailer, and the new position of this point can be found by solving a quadratic equation. Given this point, it is simple to determine Pt, the new angle of the trailer relative to the cab. Given the angular acceleration of the steered wheels, d, Euler integration is used to calculate the new angular velocity, dt, and angle, at, of the steered wheels. C.3 Perturbation Algorithm Algorithm 6 outlines the perturbation routine for the vehicle steering policy search. This is distinct from Algorithm 5 in that the parameters for each policy node are perturbed separately. If a node has a large energy, as stored in the energynoa-e vector, it is more likely to be perturbed and have its parameters modified to a larger degree. The action, fi = d, of a node is perturbed by the dstep and clipped to within the range [—1,1] radians. The location, £, of a node is perturbed by moving it ^ ' " ^ closer to a sample observation encountered during the previous policy evaluation and within the node's Voronoi cell. The call to observationsample(i) provides such a sample from Voronoi cell i. Perturbing node locations in this manner has the effect of keeping them within regions of the observable state space, O, which are visited regularly while still allowing the locations and actions Appendix C. Vehicle Steering: Additional Information 102 Algorithm 6 perturbation2(-K, p) input 7r {policy} input p {current optimization step size and total perturbation energy} global n {number of nodes and index of the newly added node} global pinit {initial optimization step size before decay} define energynode {vector - perturbation energy of each node} define dnode {step for current node} define (sample {sample observable state within Voronoi cell of node - vector of size de> observable state mapped onto location parameters} energynode <— 0 energynode <— propagate_perturbation(n, p, energy„ode) for i <— 1... n do if Rand(0,1) < ^ " ' ^ " ^ ' ' 1 then dstep = energy^de [i] Rand(-1,1) 7T .0 < — 7T .0 + dstev in dimension of p.% (sample *~ observation_sample(i) for j <— each location parameter index for node i do n.Q[j] - TT.QlJ} + ^A((sample[j] - n.e[j}) end for end if end for clip 77.0 to [n.min, n.max] O U t p u t 7T to be optimized. The node location parameters are not clipped to a particular range, but they still remain close to the typical region of operation of the policy in O. The energynode value for each node is determined by the propagate-perturbation recursive routine (see Algorithm 7), which propagates the perturbation energy specified by the energy parameter from node inode to neighbouring nodes and accumulates the results in the energynode vector. Of the energy that comes into node inode, portions may be propagated to each neighbouring node, while the rest is kept to that node. The portions that are kept or propagated depend on the transitions between Voronoi cells during the previous evaluation of the policy, ir. These transitions are recorded in the transitions matrix, where for each time step t during the evaluation, transitions[c(ot)][c(ot+\)] is incremented2. The constants used to obtain our results are fractionmin = 0.1, fractionmax = 1.0, x = 40, and energythresh = 0.05. 2Recall from Equation 3.1 that c(o) is the index of the active node given observation, o. Appendix C. Vehicle Steering: Additional Information 103 Algorithm 7 propagatejperturbation(inode, energy, energynode) input i {node index} input energy {energy entering node} input energynode *- P {total perturbation energy} global transitions {matrix of transitions between Voronoi cells} global fractionmin global fractionmax global x global energythresh define fractionkept {fraction of energy kept by node inode} define connection {vector - the transitions between inode and each node} for i <— 1... n do connection\i] <— transitions[inode\[i] + transitions[i][inode\ connectiontotai <~ connectiontotai + connection[i] end for fractionkept <— fractionmin + (fractionmax - / r a e t i o n m i n ) ( c ° ™ ^ ' , ° ^ ^ )x energykept <-fractionkeptenergy energynode[inode] <— energynode[inode\ + energykept energy <— energy - energykept connectiontotai <— connectiontotai — connection\inode] for i «— all node indices except inode do connectionli] energy portion <- energy c o n n e c t i o n if energyp0rtion > energythresh then energynode < — propagate.perturbation(i, energyportion, energynode) else energy node\inode] *- energy'node[inode} + energy portion end if end for output energynode Bibliography 104 Bibliography [AM03] Christopher G. Atkeson and Jun Morimoto. Nonparametric representation of policies and value functions: A trajectory-based approach. In S. Thrun S. Becker and K . Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 1611-1618. 2003. [And87] Charles W. Anderson. Strategy learning with multilayer connectionist representations. In Proceedings of the Fourth International Workshop on Machine Learning, pages 103-114. Morgan Kaufmann, 1987. [ASW01] C. Altafini, A . Speranzon, and B. Wahlberg. A feedback control scheme for reversing a truck and trailer vehicle. IEEE Transactions on Robotics and Automation, 17(6):915-922, December 2001. [Bel57] Richard Bellman. Dynamic Programming. Princeton University Press, 1957. [Bis95] Christopher M . Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [BM99] L . Baird and A. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11, pages 968-974, 1999. [BSA83] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptiveelements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13,5, pages 834-846, 1983. [BT96] Dimitri P. Bersekas and John N . Tsitsiklis. Neuro-Dynamic Programming. Athena Sci-entific, 1996. [CKL94] Anthony Cassandra, Leslie Pack Kaelbling, and Michael Littman. Acting optimally in partially observable stochastic domains. In Twelfth National Conference on Artificial Intelligence, (AAAI) Seattle, WA, 1994. [Cou02] Remi Coulom. Reinforcement Learning Using Neural Networks, with Applications to Motor Control. PhD thesis, Institut National Polytechnique de Grenoble, 2002. Bibliography 105 [DoyOO] Kenji Doya. Reinforcement learning in continuous time and space. Neural Computation, 12:243-269, 2000. [DW91] Thomas L. Dean and Michael P. Wellman. Planning and Control. Morgan Kaufmann Publishers, San Mateo, California, 1991. [DW97] A. Divelbiss and J . T. Wen. Trajectory tracking control of a car trailer system. IEEE Transactions on Control Systems Technology, 5(3):269-278, May 1997. [FA02] J. Forbes and D. Andre. Representations for learning control policies. In Proceedings of the ICML-2002 Workshop on Development of Representations, pages 7-14, 2002. [Gri97] Larry Gritz. Genetic programming evolution of controllers for 3-d character animation. Genetic Programming 1997: Proceedings of the 2nd Annual Conference, pages 139-146, 1997. [HFGS96] D. F. Hougen, J . Fischer, M . Gini, and J . Slagle. Fast connectionist learning for trailer backing using a real robot. In IEEE International Conference on Robotics and Automa-tion, pages 1917-1922, 1996. [HGS97] D. F. Hougen, M . Gini, and J . Slagle. Rapid, unsupervised connectionist learning for backing a robot with two trailers. In IEEE International Conference on Robotics and Automation, 1997. [HKLR02] David Hsu, Robert Kindel, Jean-Claude Latombe, and Stephen Rock. Randomized kino-dynamic motion planning with moving obstacles. Int. J. of Robotics Research, 21(3):233-255, March 2002. [KLM96] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learn-ing: A survey. Journal of Artificial Intelligence Research, ^-.237-285, 1996. [KN03] A . Kelly and B. Nagy. Reactive nonholonomic trajectory generation via parametric opti-mal control. The International Journal of Robotics Research, 22(7):583-601, July 2003. [Koz92] J . R. Koza. A genetic approach to finding a controller to back up a tractor-trailer truck. In Proceedings of the American Control Conference, 1992. [Lar68] R. E . Larson. State Increment Dynamic Programming. Elsevier, New York, 1968. [Lat91] Jean-Claude Latombe. Robot Motion Planning. Kluwer Academic Publishers, 1991. Bibliography 106 [MA95] Andrew Moore and Chris Atkeson. The parti-game algorithm for variable resolution re-inforcement learning in multidimensional state-spaces. Machine Learning, 21:1-36, 1995. [MM02] R. Munos and A. Moore. Variable resolution discretization in optimal control. Machine Learning, 49, Numbers 2/3:291-323, November/December 2002. [MPKK99] Nicolas Meuleau, Leonid Peshkin, Kee-Eung K i m , and Leslie Pack Kaelbling. Learning finite-state controllers for partially observable environments. In UAI1999, pages 427-436, 1999. [Ng03] Andrew Y . Ng. Shaping and policy search in Reinforcement learning, PhD thesis, Uni-versity of California, Berkeley, 2003. [NM93] J . T. Ngo and J . Marks. Spacetime constraints revisited. Proceedings of ACM SIGGRAPH '93, pages 343-350, 1993. [NW89] D. H. Nguyen and B. Widrow. The truck backer-upper: A n example of self-learning in neural networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN89), pages 357-363, 1989. [Oga70] Katsuhiko Ogata. Modern Control Engineering. Prentice Hall, Inc., Englewood Cliffs, N.J . , 1970. [Pas98] F. Pasemann. Evolving neurocontrollers for balancing an inverted pendulum. Network: Computational Neural Systems. 9, pages 495-511, 1998. [PH98] L . D. Pyeatt and A. E . Howe. Learning to race: Experiments with a simulated race car. In 11th International Floria Artificial Intelligence Research Society Conference, 1998. [PTVF97] William H. Press, Saul A . Teukolsky, Willaim T. Vetterling, and Brian P. Flannery. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, 1997. [RB01] M.T . Rosenstein and A . G . Barto. Robot weightlifting by direct policy search. In Pro-ceedings ofUCAI 2001, pages 839-844, 2001. [RK03] S. Ramamoorthy and B. Kuipers. Qualitative heterogeneous control of higher order systems. Hybrid Systems: Computation and Control, Lecture Notes in Computer Science, 2003. Bibliography 107 [RN0098] Muchammad Romzi, Junji Nishino, Tomohiro Odaka, and Hisakazu Ogura. Refining fuzzy control rules for the inverted double pendulum system based on a transformation-type genetic algorithm. Systems and Computers in Japan 29(14), pages 50-55, 1998. [SB98] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. [Sha04] Dana Sharon. Synthesis of stylized walking controllers for planar bipeds. Master's thesis, University of British Columbia, 2004. [Sim94] Kar l Sims. Evolving virtual creatures. Proceedings of ACM SIGGRAPH '94, pages 15-22, 1994. [SK01] William D. Smart and Leslie Pack Kaelbling. Reinforcement learning for robot control. Mobile Robots XVI, 2001. [SSR99] Feijun Song, S.M. Smith, and C.G. Rizk. A fuzzy logic controller design methodology for 4d systems with optimal global performance using enhanced cell state space based best estimate directed search method. IEEE International Conference on Systems, Man and Cybernetics, pages 138-143, 1999. [Sti99] Andrew K . Stimac. Standup and stabilization of the inverted pendulum. Master's thesis, Massachusetts Institute of Technology, 1999. [Tes95] G. Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58-67, 1995. [vdPF93] Michiel van de Panne and Eugene Fiume. Sensor-actuator networks. Proceedings of ACM SIGGRAPH '93, pages 335-342, 1993.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Shaping and policy search for nearest-neighbour control...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Shaping and policy search for nearest-neighbour control policies with applications to vehicle steering Alton, Ken 2004
pdf
Page Metadata
Item Metadata
Title | Shaping and policy search for nearest-neighbour control policies with applications to vehicle steering |
Creator |
Alton, Ken |
Date Issued | 2004 |
Description | The graceful animal motion we see in nature has proven extremely difficult to reproduce algorithmically. There is a need for further research into motion control techniques to address this problem adequately for computer animation and robotics applications. In this thesis, we describe a novel method for the synthesis of compact control policies for a variety of motion control problems. Direct policy search is applied to a nearest-neighbour control policy, which uses a Voronoi cell discretization of the observable state space, as induced by a set of control nodes located in this space. Such a semi-parametric representation allows for policy refinement through the adaptive addition of nodes. Thus, a coarse-to-fine policy search can be performed in such a way that the problem is shaped for easy learning. We apply our method to developing policies for steering various vehicles around a winding track. In particular, a policy is learned to steer a double-trailer truck backwards, a problem of considerable difficulty given the instability of the trailers. We also generate policies for the problems of balancing an inverted pendulum on a moving cart and driving the state of a point mass around a user-specified limit cycle. We examine these motion control results critically and discuss the limitations of the nearest-neighbour policy representation and our policy search method. |
Extent | 5695597 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-12-02 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051749 |
URI | http://hdl.handle.net/2429/16091 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2004-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2005-0001.pdf [ 5.43MB ]
- Metadata
- JSON: 831-1.0051749.json
- JSON-LD: 831-1.0051749-ld.json
- RDF/XML (Pretty): 831-1.0051749-rdf.xml
- RDF/JSON: 831-1.0051749-rdf.json
- Turtle: 831-1.0051749-turtle.txt
- N-Triples: 831-1.0051749-rdf-ntriples.txt
- Original Record: 831-1.0051749-source.json
- Full Text
- 831-1.0051749-fulltext.txt
- Citation
- 831-1.0051749.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051749/manifest