UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Reinforcement learning using sensorimotor traces Li, Jingxian 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_spring_li_jingxian.pdf [ 2.29MB ]
JSON: 24-1.0103383.json
JSON-LD: 24-1.0103383-ld.json
RDF/XML (Pretty): 24-1.0103383-rdf.xml
RDF/JSON: 24-1.0103383-rdf.json
Turtle: 24-1.0103383-turtle.txt
N-Triples: 24-1.0103383-rdf-ntriples.txt
Original Record: 24-1.0103383-source.json
Full Text

Full Text

Reinforcement Learning Using Sensorimotor TracesbyJingxian LiB.Sc., Zhejiang University, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)December 2013c? Jingxian Li, 2013AbstractThe skilled motions of humans and animals are the result of learning good solu-tions to difficult sensorimotor control problems. This thesis explores new modelsfor using reinforcement learning to acquire motion skills, with potential applica-tions to computer animation and robotics. Reinforcement learning offers a princi-pled methodology for tackling control problems. However, it is difficult to applyin high-dimensional settings, such as the ones that we wish to explore, where thebody can have many degrees of freedom, the environment can have significantcomplexity, and there can be further redundancies that exist in the sensory repre-sentations that are available to perceive the state of the body and the environment.In this context, challenges to overcome include: a state space that cannot be fullyexplored; the need to model how the state of the body and the perceived state ofthe environment evolve together over time; and solutions that can work with onlya small number of sensorimotor experiences.Our contribution is a reinforcement learning method that implicitly representsthe current state of the body and the environment using sensorimotor traces. Adistance metric is defined between the ongoing sensorimotor trace and previouslyexperienced sensorimotor traces and this is used to model the current state as aweighted mixture of past experiences. Sensorimotor traces play multiple roles inour method: they provide an embodied representation of the state (and thereforealso the value function and the optimal actions), and they provide an embodiedmodel of the system dynamics.In our implementation, we focus specifically on learning steering behaviors fora vehicle driving along straight roads, winding roads, and through intersections.The vehicle is equipped with a set of distance sensors. We apply value-iterationiiusing off-policy experiences in order to produce control policies capable of steeringthe vehicle in a wide range of circumstances. An experimental analysis is providedof the effect of various design choices.In the future we expect that similar ideas can be applied to other high-dimensionalsystems, such as bipedal systems that are capable of walking over variable terrain,also driven by control policies based on sensorimotor traces.iiiPrefaceThe whole thesis is the achievement contributed by me and Michiel Van De Panne,my supervisor. I was responsible for program implementation, discussion aboutideas and project, and thesis writing. Michiel takes responsibility to promote thediscussion of ideas and give feedback on thesis writing.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 42 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Embodied Learning . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Sensorimotor System . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Window Memory - Sensory Memory Trace . . . . . . . . . . . . 72.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 72.5 Data Driven Dynamic System Model . . . . . . . . . . . . . . . . 82.6 Policy Search Mehods . . . . . . . . . . . . . . . . . . . . . . . 9v3 Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Data, State, and Distance Metrics . . . . . . . . . . . . . . . . . . 113.3 Reinforcement Learning Framework . . . . . . . . . . . . . . . . 133.3.1 Background Introduction . . . . . . . . . . . . . . . . . . 133.3.2 The Agent-Environment Interface . . . . . . . . . . . . . 143.3.3 Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.4 Value Functions . . . . . . . . . . . . . . . . . . . . . . . 163.3.5 Dynamic Programming . . . . . . . . . . . . . . . . . . . 163.3.6 Our RL Implementation . . . . . . . . . . . . . . . . . . 173.4 Storage of Sensorimotor States . . . . . . . . . . . . . . . . . . . 203.5 Fast Nearest Neighbor Queries . . . . . . . . . . . . . . . . . . . 203.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Car Steering Using Sensorimotor Traces . . . . . . . . . . . . . . . 244.1 Car Model and State Representation . . . . . . . . . . . . . . . . 254.1.1 Dubin?s Simple Car Model . . . . . . . . . . . . . . . . . 254.1.2 Example Environments and Tasks . . . . . . . . . . . . . 264.1.3 Sensors Configuration . . . . . . . . . . . . . . . . . . . 274.1.4 Linear Speed and Uniformly Sampling . . . . . . . . . . 284.1.5 Condition Definition . . . . . . . . . . . . . . . . . . . . 284.1.6 Car State Representation . . . . . . . . . . . . . . . . . . 294.2 Application of RL . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.1 Different Configurations between CSRL and STRL . . . . 304.2.2 Classical State RL (CSRL) . . . . . . . . . . . . . . . . . 324.2.3 Sensorimotor Traces RL (STRL) with Local Sensors . . . 344.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.1 Comparison Using CSRL . . . . . . . . . . . . . . . . . . . . . . 365.1.1 Policy Evaluation for Straight Road Driving Problem . . . 395.1.2 Driving on Curved Tracks . . . . . . . . . . . . . . . . . 405.2 Tracks with Sharp Turns . . . . . . . . . . . . . . . . . . . . . . 41vi5.3 Results with Complex Tracks . . . . . . . . . . . . . . . . . . . . 435.4 Comparison Between with and without Trace . . . . . . . . . . . 445.5 Tracks with Multiple Choices . . . . . . . . . . . . . . . . . . . . 465.6 Data Usage Effectiveness . . . . . . . . . . . . . . . . . . . . . . 476 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 516.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55viiList of TablesTable 4.1 Summary of Algorithm Design Choices for Different Scenarios. . . 32Table 5.1 Configuration Comparison between CSRL and STRL. . . . . . . 38viiiList of FiguresFigure 3.1 Diagram of State Information . . . . . . . . . . . . . . . . . 12Figure 3.2 Agent-Environment Interaction . . . . . . . . . . . . . . . . . 14Figure 3.3 NN Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 3.4 Continuity Guess . . . . . . . . . . . . . . . . . . . . . . . . 21Figure 4.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . 24Figure 4.2 Dubin?s Car Model . . . . . . . . . . . . . . . . . . . . . . . 25Figure 4.3 Example of Task . . . . . . . . . . . . . . . . . . . . . . . . 26Figure 4.4 Car Sensors Configuration . . . . . . . . . . . . . . . . . . . 27Figure 4.5 Norm Sensors Clear the Ambiguity . . . . . . . . . . . . . . 28Figure 4.6 Configuration Comparison . . . . . . . . . . . . . . . . . . . 30Figure 4.7 y?? Scenario Diagram . . . . . . . . . . . . . . . . . . . . 32Figure 4.8 2D Grid Sampling . . . . . . . . . . . . . . . . . . . . . . . 33Figure 4.9 Results of Greedy and RL Policy . . . . . . . . . . . . . . . . 34Figure 4.10 A Scenario Illustrates the Drawback of Greedy Policy . . . . 34Figure 5.1 Performance Comparison between STRL and CSRL . . . . . 37Figure 5.2 Performance Comparison as Data Size Increases . . . . . . . 39Figure 5.3 Results on Curved Tracks . . . . . . . . . . . . . . . . . . . 40Figure 5.4 STRL Applied to 90 Degree Turning . . . . . . . . . . . . . . 41Figure 5.5 STRL Applied to 180 Degree Turning . . . . . . . . . . . . . 42Figure 5.6 Winding Track Results . . . . . . . . . . . . . . . . . . . . . 43Figure 5.7 Track with Obstacles . . . . . . . . . . . . . . . . . . . . . . 44Figure 5.8 Compare the Result with or without Trace . . . . . . . . . . . 44ixFigure 5.9 Steering Comparison with and without Trace . . . . . . . . . 45Figure 5.10 All Possible Ways to Pass Branch Points . . . . . . . . . . . . 46Figure 5.11 An Illustration of Multi-choice Training . . . . . . . . . . . . 48Figure 5.12 MLSR Mode Steering . . . . . . . . . . . . . . . . . . . . . . 49Figure 5.13 MSLR Mode Steering . . . . . . . . . . . . . . . . . . . . . . 49Figure 5.14 Data Usage Frequency . . . . . . . . . . . . . . . . . . . . . 50xAcknowledgmentsThis thesis would not have been possible without the support of many people. Mysincerest gratitude goes first to my supervisor, Professor Michiel van de Panne,for his excellent guidance, caring, patience, and providing me with an excellentatmosphere for doing research. Michiel always kindly gives me the freedom andencouragement to explore different ideas, which form the basis of this thesis.I am also grateful to my second reader, Professor Dinesh Pai, for his valuablefeedback on the writing of this thesis. Many thanks to members of Imager Labfor creating such a fun place to work and study. I wish to thank my good friends:Junhao Shi, Baipeng Han, Chuan Zhu, Shuo Shen, Xinxin Zhang and many othersfor their support that helps me and makes my student life in Vancouver enjoyablememories.Finally, I would also like to thank my parents, Xuyang Li and Yuanfang Li, andmy girl friend, Xin Zhou. They were always supporting me and encouraging mewith their best wishes.xiChapter 1IntroductionMany artificial intelligence problems are focused on how computers and robotscan learn from and understand the world. For physics-based animation we mayalso wish to have characters develop skills based on trial-and-error learning asobserved through embodied sensors. In this thesis we explore new models forusing reinforcement learning to acquire motion skills, with potential applicationsto computer animation and robotics.We begin by discussing the motivation behind using sensorimotor traces inreinforcement learning. We then detail some of the challenges of reinforcementlearning and its application to motion control. We outline the goals we hope toachieve. We conclude this chapter with a summary of the thesis structure.1.1 MotivationThe acquisition of skills via reinforcement learning is a problem of interest to manyfields of research, including machine learning, computer animation and robotics.Machine Learning for Sequential Decision ProblemsReinforcement learning is a well-studied learning method for sequential decisionproblems. Due to a computational complexity that is exponential in the dimensionof the problem, research focuses on mapping the high-dimensional problems intothe low-dimensional forms. The application of reinforcement learning to high-1dimensional problems, such as humanoid motions with many degrees of freedom,is an open topic.RoboticsAnthropomorphic robots, or ?humanoid robots?, have long been a fascination ofmankind. Roboticists also develop other types of robots, such as wheeled robots,for specific tasks. A significant long-term challenge of building more intelligentrobots, lies with creating the ability to learn motion strategies, instead of havingthem be preprogrammed.Computer AnimationBuilding intelligent, autonomous characters for use in computer graphics facessome of the same challenges as building capable robots. It is a simpler problem inthat it lets us easily experiment with the control problems without having to dealwith the complexities of hardware. Intelligent and reactive motions are also a keypart of what makes a character seem ?alive?. Characters will be perceived as in-telligent if they can perform movement in autonomous and flexible ways, such asadaptation to new environments. For example, a biped could learn how to stand upfrom a variety of initial postures using an appropriate generalizable control strat-egy.Embodied LearningEmbodied learning describes how an agent can respond to its environment in sen-sible ways without needing a full world model. When an agent explores the world,the sensory data can often provide actionable for a given situation with or withoutneeding to build a world model. People want the agent to learn and act appropri-ate motion when it feels corresponding scenario. Achieving intelligent emergentbehaves in this way is behind many embodied learning efforts.1.2 ChallengesReinforcement learning faces several important challenges when applied to prob-lems such as controlling human movement.2High-dimensional ComplexityMany learning frameworks suffer from the need to explicitly model high-dimensionalconfigurations. The human body has many degrees of freedom, and the environ-ment can have significant complexity. There can also be redundancy in the sensoryrepresentations ? some sensory data does not provide useful information to guidelearning and moving.State Space ExplorationBased on limited sensorimotor experiences with specific skills, it is challengingto distinguish among different scenarios and guide the character or agent to moveappropriately according to a learned policy. Much of the state space will neverhave been visited before.Sensorimotor Traces for State RepresentationAs sensory data provides a limited view of the real environmental conditions, it isa challenge to estimate the true state of the agent and its environment. Instead weexplore directly using time-windows of sensorimotor traces as our state represen-tation.Learning EfficiencyFor data driven reinforcement learning, the main topic in this thesis, the data effi-ciency of the learning is a core challenge to solve. How much data should be used?More data for achieving better results, but demands more memory and computa-tion. What is the right trade-off?1.3 Research GoalsThe general problems of data driven reinforcement learning for skilled motion con-trol remain unsolved in many ways. The major goals for our work are as follows:? To develop representations and policy learning techniques that are effectivewhen learning from only a small number of collected sensorimotor experi-ences.3? The learned policies should generalize well across similar-but-different en-vironments.? The method should support the progressive addition of new skills or adapta-tion to new environments.? The method should generalize well to many kinds of dynamical systems. Aswe shall see later we leave this goal as future work.1.4 Thesis OrganizationThe remainder of this thesis is structured as follows. In Chapter 2, we present aselection of previous work relating to reinforcement learning. We also describerelated work on embodied learning, sensory-motor traces and data driven dynamicsystem modeling. Chapter 3 describes our reinforcement learning framework andgives the detailed specification of all the steps and algorithms. Chapter 4 describesour implementation for car steering problems. We present details regarding systemconfiguration and parameter settings. The result of experiments are presented inChapter 5. We demonstrate the performance of our method by comparison with tra-ditional state-space representation on an easily modeled problem and also presentseveral adaptations towards harder problems. We conclude in Chapter 6 with asummary of contributions, discussion about problems and limitations, and direc-tions for future work.4Chapter 2Related WorkMany aspects of the ideas we present in this thesis have been previously explored.In this chapter, we briefly review the research areas that are related to embodiedlearning, reinforcement learning, state representation, and modeling high dimen-sional continuous dynamic systems.2.1 Embodied LearningEmbodied Cognition is a cognitive science in which the mind, body and the envi-ronment interact, enabling learners to acquire or construct new knowledge [PB06].The goal is to learn in goal-directed, real time environments that engages thesenses, perceptions, and prior experiences [Ker02]. Emobodied cognition is a re-action against Cartesian Dualism posited by Ren Decartes and which informedinstructional design for many centuries. Cartesian Dualism articulates a distinctseparation between the mind and the body. According to this theory, learning takesplace solely through the mind.With these ideas in mind, learning in robotics now barely comes to such asummary: with a specific goal and a world-sensorial system, the agent learns theskill by practicing in order to achieve the goal. Robots learning is like babiesexploring the new things with few experiences. The skilled motion control cannot be learned if agent has no opportunities to practice and get feedback from theworld. With this approach, we can help objects (programmable robots, computers,5etc) to learn by giving them an understandable goal and letting them to develop andrefine their motion control skills through practice.2.2 Sensorimotor SystemCentral to the idea of embodied learning is that the mind is grounded in the sensa-tions coming from the body and that the mind controls the body to further explorethe environment. The sensorimotor system is thus an online stream of all that ishappening. Prior to a movement, an animal?s current sensory state is used to gen-erate a motor command. To generate a motor command, first, the current sensorystate is compared to the desired or target state. Then, the nervous system trans-forms the sensory coordinates into the motor system?s coordinates, and the motorsystem generates the necessary commands to move the muscles so that the targetstate is reached[Fla11].The sensorimotor system is widely researched. Some research focuses on howan agent can learn an internal model of the world[TN99, WGJ95]. Research alsouses the sensory-motor system to help understand the concept of evolving abstrac-tions in the brain[VG05]. It is proposed that the sensory-motor system has the rightkind of structure to characterize both sensory-motor and more abstract concepts.Models of the sensory motor system have been applied to dynamic categoriza-tion [MI04]. Such as the ability of an autonomous agent to distinguish 2D objectlike rectangle and triangular objects. Rolf Pfeifer [PS97] has pointed out that theproblem of categorization in the real world is significantly simplified if it is viewedas one of sensory-motor coordination, rather than one of information processinghappening on the input side.The sensory motor system is widely modeled in many learning tasks, by com-bining sensor memory and motor actions together. The integration of the sensorystream and motor actions together allows the agent to gain more competence to-wards complex problems. This is inspired by the fact that motions will lead tochanges of what is sensed, while what is sensed provides the information to decidewhich actions to take online. Because of the continuity of sensor stream, we can,to some extent, make predictions based on observed sensory pattern. We use theseideas in this thesis to construct a structured model for the dynamics.62.3 Window Memory - Sensory Memory TraceSensory memory allows individuals to retain impressions of sensory informationafter the original stimulus has ceased. According to this notion, it is reasonableto think of a sensation as being a trace of sensory memory rather than just thecurrent sensory values. During every moment of an organisms lifetime, sensoryinformation can be taken in, depending who the person is, by sensory receptors andprocessed by the nervous system. The information received can also be transferredto short-term memory.A sensory memory trace(SMT) can help an agent better understand the worldit is experiencing. SMTs could help to stabilize the understanding of the world andeliminate ambiguity [PB08] when searching for ?familiar scenarios? through mem-ory storage. It has been noted that ambiguity normally cause visual perception towaver unpredictably when such ambiguities are presented briefly with interveningblank periods. Sensory memory traces can provide more helpful information forstate representation and environment recognition[PB08].Even though, we cannot directly know the best duration to use for the short-term memory, it is not hard to treat this as a free parameter that can be solved forusing optimization. In this thesis, we apply a framework with sensory memorytraces, which are defined as the periods of time of sensory memory that help theagent to make action decisions.2.4 Reinforcement LearningFor general introductions to reinforcement learning from varying perspectives, werefer the reader to the books by Bertsekas and Tsitsiklis [BT95] and Sutton andBarto [SB98] and the more recent books by Bertsekas [Ber07], Powell [Pow07],Szepesvari [Sze10].For model-based reinforcement learning problems, one can use dynamic pro-gramming (Bellman [Bel57]; Howard [How60]; Puterman [Put94]; Sutton andBarto [SB98]; Bertsekas [Ber07, Ber05]), or one can sample from the model anduse one of the reinforcement-learning algorithms discussed in [vH07]. Model-freereinforcement learning systems, such as Q-learning (Watkins [H.89]), or advantageupdating (Baird [C.93]), require that a function f (x,u) be learned, where x repre-7sents the state, u represents the control actions, and f is the function that maps xand u to system dynamics. Schwartz [Sch93] examined the problem of adaptingQ-learning to an average-reward framework. Although his R-learning algorithmexhibits convergence problems for some MDPs[How60, vH07, Ber87], several re-searchers have found the average-reward criterion to more closely represent thetrue problem they wish to solve as compared to a discounted criterion and there-fore prefer R-learning to Q-learning.There are many traditional benchmark problems for reinforcement learning.Mountain car [SS] is a standard RL benchmark simulation where the agent startsin a valley between two hills and takes actions to reach the top of the right hill. Thecart-pole system [Spo98] is composed of a cart that moves along a single axis witha pendulum (the pole) attached by a pin joint.However, when the agent is not localized within the environment it is exploring,an embodied reinforcement learning can be highly beneficial. In this thesis, weapply model-free reinforcement learning to a simplified dynamical system, in orderto teach the agent to adapt to its environment. The car driving problem in thisthesis is in part based on previous work on a policy-search method[AvdP05], ona simple sensory-motor space to achieve intelligent steering behaviors on windingtrack. The dynamic system for steering problems is a useful platform to performtest ideas.Our thesis tackles a high-dimensional, continuous sensory [vH07, C.93, IJR12,Kim07] space learning problems using reinforcement learning based on sensorimo-tor traces.2.5 Data Driven Dynamic System ModelMotion fields [Lee10] provide inspiration with regard to representations and datadriven control for dynamical systems. A carefully crafted distance metric is definedfor generating an interpolated motion field through the state space. The methoduses a data-driven reconstruction of the available actions and the resulting dynam-ics. It also describes several ideas for dealing with data scarcity when simulatingstate encounters unknown or sparsely-populated space.82.6 Policy Search MehodsAlthough many dynamic programming algorithms can guarantee the convergenceto an approximation of the optimal value function, it remains to be shown that thesealgorithms could be used to learn complex control in continuous state spaces ofmoderate-to-high dimension and continuous action spaces. Many motion controlproblems fall into this category.A popular method for policy search is the Policy Gradient Method (PGM)[Sut00, PV06, Lau12]. PGMs are formulated in Markov Decision Process (MDP).The idea of PGM is to update a policy in a direction that can minimize a prede-fined cost via gradient descent. Readers who are familiar with optimization willrecognize the gradient descent step, ? := ? ?? ?J?? , where ? is the parameter set,? is the update step, J is the cost to be minimized. So now, we suppose a pol-icy is parameterized by a set of parameters ? , and we have a cost function to beminimized. To compute the gradient, and using a gradient local model for the costfunctions, taking J = (s? s?)T (s? s?), we have ?J?? = (s? s?)T ? ? s?? . s representsthe current state, s? describes the target state. The ? s?? term requires knowledge ofthe dynamics model, i.e. the change of state with respect to the change of the con-trol parameters. Two common approaches are commonly proposed to work aroundthis requirement, namely finite differencing method [SB98] and the likelihood ratiomethod [BB01].Complex motion control problems have been solved using policy search inmany examples. For example, a policy was designed for flying an autonomous he-licopter [Ng03]. In order to learn a stable hovering behavior, the policy was definedas a set of neural networks of a specific topology that map a specific choice of statevariables and position errors to actions. The connection weights of the networkwere the parameters left for the policy search to determine. Another example in-volves a controller for robot weightlifting, which was learned using policy searchand shaping techniques modeled after human motor learning [RB01]. The taskis to lift a three jointed robot arm with a payload from straight-down equilibriumposition to a straight-up unstable position by applying a torque to each joint. Thetiming parameters, proportional-derivative (PD) controllers parameters, and ma-trix weights that couple the torques applied to the joints, are learned with a simple9stochastic search algorithm. In order to reduce the likelihood of the search stuckin local maximum, this method needs to increase the number of seed points forthe optimization. The algorithm of policy search is broadly applicable, particularlywhen little is known about the model and relationship between policy parameters.10Chapter 3Learning Framework3.1 IntroductionIn this chapter, we present a framework for reinforcement learning based on sen-sorimotor trace data and applied to a dynamic system. A Sensory Memory Trace(SMT) is used as the state representation, and distance metrics are used for statecomparisons. The use of distance metrics helps to cope with the challenging highdimensional nature of problems that are commonly encountered in RL problems.This property of our learning architecture provides us with one way of solvingmany kinds of high-dimensional reinforcement learning problems. Several meth-ods for effective data utilization are also described in this chapter. These methodshelp to save memory and speed up the online action inquiry.3.2 Data, State, and Distance MetricsWe assume in this thesis that a recorded dataset is constructed from multiple trajec-tories and their sensory history. Sensory data are recorded by sampling at discretetime intervals. We define the configuration, Cn as the sensory vector at time n. Wedenote the actions taken at time n as An.To represent the system state at time n, a sensory trace, Sn consists of Sn ={Cn?w+1,Cn?w+2, . . . ,Cn} is taken into the consideration, where w is the durationof the trace, as measured in sample points. The time window helps disambiguate11Figure 3.1: Diagram of state information. Red rectangles cover the sensorytime-window of data used as state representation at time t1 and t2. Thesensory trace duration is specified as w. Each horizontal line representsone sensory channel.the inherent ambiguity of sensory information at any given instant in time. Asshown in Figure 3.1, along the time axis, red rectangles show the window of datathat contributes to the state representation at a given time.To define a good distance metric we want to identify contributing sensors andalso determine appropriate weightings for their contributions. More sensors canprovide more informative sensory data, while excessively many sensors will intro-duce redundancy. In this thesis, we determine the distance metrics empirically. Forexample, we assume that older data within sensory-motor trace should be givenless weight as compared to more recent data. Further discussion on distance met-rics can be found in Section 6.2.We define a distance metric on the sensory vector, dC (Ct1 ,Ct2), which will beused in support of our non-parametric representations. We use a weighted Eu-clidean distance metric defined by:dC (Ct1 ,Ct2) =??i?i(Cit1?Cit2)2(3.1)The superscript i in Cit indicates the ith element of Ct . ?i denotes the weightsdistributed to each element of Ct for distance metric calculation. We now define Stias the state representation of the system that is associated with a window length of12w:dS (St1 ,St2) =w?1?i=0? idC (Ct1?i,Ct2?i) (3.2)where ? , in the range of [0,1], is the discount parameter to weaken the influenceof C that are further in the past. When ? = 0, the state, S, is only related to thecurrently observed samples of sensory values, Ct . When ? = 1, all samples in thetime window will be weighted equally. Usually, as illustrated by the color ramp inFigure 3.1, we attach more weight to more recent time samples.3.3 Reinforcement Learning FrameworkIn this section, we review fundamental concepts about Reinforcement Learning (RL).We also introduce the specific RL technique that we employ for our system. A goodsummary of RL methods can be found in [SB98].3.3.1 Background IntroductionReinforcement learning is an area of machine learning, concerned with how agentsshould take actions in an environment so as to maximize cumulative reward. A keybenefit of reinforcement learning is that it is a goal-oriented learning method thatuses rewards to develop a competency without needing to specify how to achieve agiven task. In other words, the agent is not told which actions to take, but it rathercan discover which actions yield the best cumulative rewards.Another key feature of reinforcement learning is that it explicitly considers theproblem of a goal-directed agent interacting with an uncertain environment. Thisis in contrast to many approaches that consider sub-problems without addressinghow they might fit into a larger picture. For instance, much of machine learningresearch is concerned with supervised learning without explicitly specifying wherethe supervisory reference solutions might come from.Beyond the agent and the environment, one can identify four main sub-elementsof the reinforcement learning system: a policy, pi; a reward function, r; a valuefunction, V ; and, optionally, a model of the environment.A policy defines the learning agent?s choice of action at a given time instant.13Usually, a policy is a mapping from perceived states (or an estimated state) of theenvironment to the actions to be taken when in those states.Whereas the reward function, r(s,a), indicates what is desirable in an imme-diate sense, a value function, V (s), specifies what is good in the long run. Thevalue function gives the cumulative discounted rewards that an agent can expect toaccumulate in the future, starting from that state.3.3.2 The Agent-Environment InterfaceThe reinforcement learning problem is a particular way of framing the problemof learning from interaction to achieve the specified goal. We refer to the learnerand decision-maker as the agent or controller. All that it interacts with, includingeverything outside the controller, is called the environment. The controller candecide which action to take at every time step, and that action will change thestate with respect to the environment. Meanwhile, the environment also providesa current reward, special numerical values that the controller tries to optimize overtime.Figure 3.2: The agent-environment interaction in reinforcement learning.More formally, the controller and the environment interact at each of the dis-crete time steps, t = 0,1,2,3 . . .. At each time step t, the controller receives somestate information about the environment, st ? S, in which S is the space of all pos-sible states. Based on st , we can choose an action, at ? A(st), where A(st) is the14set of all available actions in state st . One time step later, as a consequence of theaction just have taken, the controller will receives a reward, rt+1 ? R, and findsitself in a new state, st+1. We define the state transition function asst+1 = f (st ,at) (3.3)Figure 3.2 illustrates the standard reinforcement learning controller-environmentinteraction model.At each time step, the controller uses a mapping from states to select fromeach of the possible actions for a particular current state. We call this mapping thepolicy of the controller, denoted as pi (st ,at). Usually, pi (st ,at) is defined as theprobability that at = a if st = s. In this thesis, the agent deterministically choosesthe action with highest expected return and thus we do not consider stochasticpolicies. Reinforcement learning methods specify how the controller changes itspolicy as a result of its experience.3.3.3 ReturnsAs stated earlies, the controller?s goal is to maximize the total amount of rewardit receives over the long run. The expected return, Rt , is defined as function ofthe sequence of reward numbers. For a finite time horizon of T steps, this can bedefined as:RT =T?i=1rt+i (3.4)In this thesis we use an infinite horizon with discounted rewards, as is commonin RL problems. According to this approach, the agent selects actions to maximizethe sum of the discounted rewards it expects to receive in the future. Specifically,Rt =??k=1?krt+k (3.5)Where ? is the discount rate, with range between 0? ? ? 1.The discount rate determines the present value of future rewards: a rewardreceived k time steps away in the future is worth only ?k times what it would beworth if it were received immediately. For small values of ? , the controller stronglyfavors rewards in the near future. As ? approaches to 1, the objective gives more15weight to future rewards and the controller becomes more farsighted.3.3.4 Value FunctionsAlmost all reinforcement learning algorithms are based on estimating value func-tions - functions of states (or of state-action pairs) that estimate how desirable itis for the controller to be in a given state. This notion is defined in terms of fu-ture rewards that can be expected, or, to be precise, in terms of expected return. Ofcourse the rewards the controller can expect to receive in the future depend on whatactions it will take. Thus, value functions are associated with particular policies.The value of a state s under a policy pi , denoted V pi (s), is the expected returnvalue when starting in s and following pi thereafter. V pi (s) gives the expectedcumulative reward for the states. In reinforcement learning, the goal is to find apolicy pi?, such thatV pi?(s) = maxpi??{Vpi (s)} (3.6)3.3.5 Dynamic ProgrammingMuch of reinforcement learning theory is based on the principles of dynamic pro-gramming, which exploit the structure of a Markov Decision Process(MDP) to findpi? more efficiently than the alternative of an exhaustive search of the space of allpossible policies, ?. For MDPs with finite state and action spaces, an optimal pol-icy, pi?, can be found indirectly by solving the Bellman optimality equation for V ?,of which the following is one of a number of equivalent forms:V ? (s) = maxa?A(s){r (s,a)+ ?V ?(s?)}(3.7)where s? is the state reached by taking action a from state s. The unique solutiongives the optimal state-value function, V ?, which is equivalent to the solution ofEquation (3.6), V pi?. There may be more than one policy, pi?, that is optimal, butthey share the same optimal state-value function. Once V ? is available, an optimalpolicy can be defined by choosing actions greedily to achieve as much cumulativediscounted reward as possible:16pi? (s) = argmaxa?A(s){r (s,a)+ ?V ?(s?)}(3.8)To compute the state-value function V pi for an arbitrary policy pi , we recallfrom Equation (3.5) and (3.7) we have:Vk+1 (st) = maxa?A(s) {rt+1 (s,a)+ ?Vk (st+1)} (3.9)To develop an algorithm to implement iterative policy evaluation, as given byEquation (3.9), we choose to use in-place updates.Algorithm 1 Reinforcement Learning Iteration algorithm1: Initialize V (s) = 0, for all s ? S2: repeat3: ? ? 04: for all s ? S do5: v?V (s)6: V (s)? maxa?A(s) {r (s,a)+ ?V (s?)} (Bellman backup)7: ? ? max(? , |v?V (s)|)8: end for9: until ? < ? (a small positive number)10: Output V ?V piV (s) converges to the optimal policy.3.3.6 Our RL ImplementationIn previous section, many procedures of common reinforcement learning problemshave been elaborated. In this section we will describe our reinforcement learningframework, particularly applied in context of our sensorimotor trace model.Structure of the Reward Functions for Our WorkTo solve the value function iteration of RL, as presented in Equation (3.9), theinstantaneous reward rt+1 and the estimated value V (st+1) are two terms that needto be calculated. We first introduce more details about the reward rt+1.The reward function r(s,a) is calculated according to the instantaneous valueof both state and action information. In our thesis, the reward function consists of17two parts, a positively-valued term, and a negatively-valued cost term. The detailsof these terms will be described in Section 4.2.Nearest Neighbor Estimation of the State DynamicsIn order to obtain the reward (rt+1) and estimated value of Vk(st+1) for next state,we first need to develop an estimation approach to evaluate the next state dynam-ics. We use nearest-neighbor(NN) estimation in this thesis as simple but effectiveestimator.The estimated state transition function can be written as:s?t+1 = f? (st ,at). (3.10)Figure 3.3: Illustration of NN estimation. Colors of dot A, F , C, G distinguishthe actions that was taken in these states. Circle shows the range to de-fine neighborhoods. Solid arrows represent the real transition recordedin memory, while the dash arrow shows the estimated transition at stateA by applying action a3 in state C, the nearest neighbor applied a3. Par-ticularly,?AE=?CD.18NN estimation evaluates ?s, the change of state s in state transition, for a givens using regression over a dataset that contains {(si,a,?si)} pairs, which are pro-vided by every two continuous data points recorded in memory. Formally, weestimate ?st+1 = f (st ,at) using a nearest neighbor estimator, from which we thencompute st+1 = st +?st+1.This is illustrated in Figure 3.3, here, state A is the query state, while blackarrows represent various ?s. Colors of dot A, F , C, G distinguish the actions thatwas taken in these states. C and G applied the same action, while A and F used dif-ferent motions. The grey circle centered with point A represents the neighbor rangeof A, thus we regard state F , C and G as A?s neighbors that they are close enoughto contribute similar state transition if the same action is applied. According tothis assumption, we estimate the expected state transition f? (st ,at) with particularaction at by referring to the nearest neighbor that takes action at . The change instate is assumed to be the same as that of the nearest neighbor that also takes thesame action. As shown by the blue dash arrow in Figure 3.3, the change in state?AE is estimated from?CD, but not?GH, since C is the nearest neighbor that appliedaction a3. Formally, ??s = ?sNN (s,a). In other words, among all the neighbors ofstate A, there can be many different actions to try out, and among all neighbors thattake the same action, we estimate the next state using the nearest neighbor.With next state st+1 estimated using our approach, we can calculate the rewardrt+1 according to the information of st+1.kNN Estimation of Value FunctionWe use kNN estimation to obtain the estimated value of Vk(sk+1).For a query state S, we estimate its value by referring to its nearest k neighbors.We denote the neighborhood of state S as Nk (S), in which N notes for the set ofnearest neighborhoods with size of k.V? = ?i wiV (si), where wi =1? ?1d(S,Si)2 . Here, Si is the ith neighbor of S and? = ?i 1d(S,Si)2 is a normalization factor to ensure the weights sum to 1, and d(S,Si)is a distance metric that we define in Section 3.2.19Summary of Estimated RLWith these two estimation technique, we can solve the value function iteration inEquation (3.9). The next state can be evaluated by applying NN Estimation. Oncethe estimated next state s?k+1 is obtained, we can calculate the estimated rewardr?k+1 according to the state information. In addition, the inferred value functionVk(sk+1) can be evaluated by applying kNN Estimation.3.4 Storage of Sensorimotor StatesThe quality of the learned policy is dependent on the quantity of example data thatis used to model the dynamics and the rewards.We store the set of experienced sensorimotor states, {Si}, via time indexing intothe original recorded sensorimotor trace data. This allows for significant efficiency,given the temporal overlap of the time windows, as compared to the alternative ofstoring the data for each si independently.3.5 Fast Nearest Neighbor QueriesThis section describes practical methods, with which we can efficiently access therequired data.The time required for searching through the memory depends on the size ofthe data for search. There are several ways to speed up the memory searching. Wemay apply different methods to narrow down the search range considering differentscenarios. If the agent is simulated in environments that are similar to the trainingscenarios, we can apply the Continuity Guess method. For general problems, theSuccessor List method is practical.Continuity GuessAfter learned some experiences under particular scenarios, the agent can explorethe world and expect those learned familiar scenarios to show up again. For ex-ample, in memory, the agent remembered that after seeing state s1, states s2 ands3 then following next. Intuitively, when the character explores new scenarios andsees s1, it can reasonably predict that s2 and s3 may follow.20Figure 3.4: Demonstration of reappearance of continuous scenarios.s1,s2,s3,s4 that experienced in learning progress can be seen insequence when the agent explores new environments.With the idea of state continuity of the trajectory, we promote a method calledContinuity Guess that assuming the next states of the current k nearest neighborscan provide a good prediction of the likely nearest neighbors for next time step.However, the prediction could be wrong. In this case, we have to justify therecognition of the real world, by searching through all the data points and find newnearest neighbors. The frequency of this adjustment depends on both the similar-ity between new and learned environments and how to define a good prediction(usually with a threshold).This particular method is not useful for exploring novel environments, sinceit?s hard to see familiar scenarios.Successor ListIf the dynamic system is in a new environment, in which the continuity guessmethod will not work well, we widen the range of possible next states from justone next data point to a union of possible next data points. Based on this concept,21we bind each data point in memory with a list (union) of possible next states, whichwe call the Successor List.Each data point p has one successor list, which contains several indices of datapoints that either once was the next state of p or in the next lists of p?s neighbors.At each time step of simulation, we call all the neighbors of p as the set of activateddata points, Pac. The successor lists need to be updated in every time phrase:L?pi = pi+? (k?1?i=0Lpi), pi ? Pac (3.11)Here, Lpi represents the next list of point pi. L?pi is the new updated version ofthe next list of pi, and we need the old version of Lpi to update it. pi+ stands for thenext state point of pi along the trajectory. pi is one of the activated data points fromPac. In every time step, we update the successor lists of all the activated data points.In addition, each Lp will be initialized as an empty set. As the online simulationgoes on, the next lists of many data points start to increase. We define a maximumsize of next lists, nL, which prevent the size of the list from increasing withoutbound.Thus, every time when we search for nearest neighbors, we search a list ofdata points having a maximum size of k ?nL, where k is size of nearest neighbors.Usually, many successor lists are limited in number. As we can keep k?nL << ndat ,this method can save significant time when searching, ndat is the whole size of thedataset. It also represents a type of anticipation for the motion.Elimination of Unused Sensorimotor StatesWe can further narrow down the search range by eliminating the seldom used sen-sory memory trace data. The implementation of this idea uses an array of countersindicating how frequently each memory data point was referred.After multiple simulations, we find that only portions of the sensorimotor tracedata are used for online simulation. If the rarely encountered points are eliminatedfrom all query searches, this can provide further time savings.22Elimination of Redundant Sensorimotor StatesWhile sampling the data, it is possible that near-duplicate trajectories can be recorded.In our work, we do not add new sensorimotor state to the dataset if 60% or moreof its data are within a distance, ? , of the data points that are currently stored.Distance metrics are used to determine how close two data points are.Other MethodsWe note that there also exist many other methods for fast approximate nearestneighbor lookup.3.6 SummaryIn this chapter, the core ideas of our reinforcement learning method were described.It is an architecture for data driven reinforcement learning, based on sensorimotortrace data as a model for the system state. Regression estimation methods are usedto estimate state dynamics and value functions. We also described some techniquesfor effectively data utilization.In next chapter, we apply our learning framework to a car steering problem.23Chapter 4Car Steering Using SensorimotorTracesIn Chapter 3, the framework of our learning method was described in a general way.The learning framework is theoretically applicable to many types of dynamicalsystem. In this chapter, we describe the details of applying the framework to carsteering problems.Figure 4.1: Overview of RL system.Figure 4.1 shows the overview of our RL framework. Our system is aimingat acquiring motion skills by applying reinforcement learning towards data drivendynamics systems using sensorimotor traces. During the data gethering phase, the24dynamic system with sensors is simulated to gather raw sensorimotor data. Themotion used for data gathering can be elementary, and uninformed (off policy),such as involving. In the learning phase, with the learning goal defined, reinforce-ment learning is then applied to the data for each state to select an optimal action,which helps to define a new mapping from state to action, i.e., a learned policy.At run time, the dynamics system uses the learned policy to guide movement.At each time step t, the agent queries for the right action at to take based on thecurrent sensory memory trace values.4.1 Car Model and State Representation4.1.1 Dubin?s Simple Car ModelFigure 4.2: Illustration of Dubin?s simple car model. Given a constant for-ward speed, (x,y,? ) represents the state of the car in the world. Thesteering angle, a, is the action taken by the car.In this thesis, we use Dubin?s Simple Car to model our car dynamical system.Figure 4.2 illustrates the car dynamics, with x and y representing the car coordinates25in the world, and ? defines the car orientation. The angle a describes the carsteering angle. The car has constant linear speed ? . The car has only one degreeof freedom for action, i.e., the steering..x= cos(?)? (4.1).y= sin(?)? (4.2).?=?? =?Ltana (4.3)Within a short time interval, Equation (4.1),(4.2),(4.3) describes the car dy-namics, which are used to simulate the next condition of the car at run time. Givena short time interval, we can assume the car is driving along the orientation givenby ? , using Equation (4.1) and (4.2). At the end of simulate at each time step,we update the car orientation using Equation (4.3). As the car is steering, we canimagine that it is rotating around a circle center with radius of ? . Using the angularvelocity, we can integrate the orientation, ? .4.1.2 Example Environments and TasksIn this chapter, we solve the car steering problems to achieve the task of driving thecar down the middle of the winding tracks, as shown in Figure 4.3. Environmentsfor car steering problems consist of various types of tracks. Chapter 5 includesresults for different environments.Figure 4.3: One example of car steering tasks and environments.26By changing the reward function, other motion skills can be learned. For ex-ample, we can guide the car to learn driving along the right edge of the track byaltering the reward function so that it gives a high reward for that behavior.We can also develop an interface for a user to define a specific behavior toperform upon encountering particular situations. Results related to this type ofuser-guided route behaviors are also presented in next chapter.4.1.3 Sensors ConfigurationThe car model is equipped with 6 distance sensors, as shown in Figure 4.4. Adistance sensor measures the distance between car?s center and the track edges. Wedenote the distance measurements as d1,d2, . . . ,d6. Assuming the forward drivingdirection as a reference, d1 and d2 are offset 20 degrees. There is a 50 degreesoffset for d3, d4, and 90 degrees for d5,d6. We choose these sensors in order tohave sensing towards the left, right and the front.Normal sensors are used to add more information of track circumstance. Par-ticularly, each ni represents the angle of the normal of the edge of the road withrespect to the incident direction of the ray that defines the distance measurement.Figure 4.4: The car sensor configuration. This consists of 6 distance sensorsand 6 normal sensors. In this specific case, n5 and n6 are both 0, n1 =?60? and n4 = 30?.The normal sensors are taken into consideration to reduce ambiguity. Considerthe two situations presented in Figure 4.5. All distance sensors have the same27length. However, the normal sensors in the front provide sufficient information todistinguish the scenarios of straight tracks and turning corners.Figure 4.5: Two scenarios with similar distance sensor data but different nor-mal.4.1.4 Linear Speed and Uniformly SamplingIn real life the speed of a car varies, because of the traffic and adaptation to chang-ing circumstance. However in our thesis, we assign the car a constant speed. Thefixed speed and constant sampling time intervals make it possible to use sensori-motor traces more easily.4.1.5 Condition DefinitionGiven a time t and the sensory data, we can define the sensed state Ct for the car,asCt = {d1,d2,d3,d4,d5,d6,?1,?2,?3,?4,?5,?6} (4.4)In this thesis we also define a transformed set of sensory measurements, C?t .This represents the car sensory state in a way that as a degree of scale invariance.Specifically, we desire that driving down the center of a large road yields the samevalues for C?t as driving down the center of a narrower road. This is done by encod-ing the ratios of the left and right senor readings. Specifically,C?t ={pair(d?1,d?2), pair(d?3,d?4), pair(d?5,d?6),?1,?2,?3,?4,?5,?6}(4.5)28wherepair (a,b) =a?bmax(a,b),a > 0,b > 0 (4.6)In Equation (4.5)d?i =1di, i = 1,2, . . . ,6 (4.7)The absolute value of pair (a,b) gives the difference between a and b. Thesymbol of pair (a,b) helps to identify either left or right is greater. Moreover, as itis divided by the maximum of a and b, the results are restricted to the range (-1,1).This encoding assists the car to recognize similar situations despite sensor valuesbeing different.In Equation (4.5), we use d?i , as described in Equation (4.7), in order to makethe sensors more sensitive towards small distances.Informally, we want to allocate additional precision to the distance metric whenthe car is close to the side of the road.For convenience, we use the notation nC?t to refer to the nth element of C?t .For any two time steps t1 and t2, we could evaluate how the environments aresimilar to each other by looking at C?t1 and C?t2 and calculating the distance for them.We define the distance according to:d2 (t1, t2) =9?i=1?i(iC?t1?i C?t2)2(4.8)Where ?i gives weights to each term difference. We further discuss the selectionof ?i in Section Car State RepresentationWe apply the sensory state representation described in Section 3.2 to a variety ofcar steering problems. The sensor space is defined according to C?t and the carsensorimotor trace, St , is used as the state representation at time t, i.e.,St ={C?t?w+1, . . . ,C?t?2,C?t?1,C?t}(4.9)294.2 Application of RLWhen implementing RL, several parts needs to be configured in detail: data gath-ering, dynamics approximation, value function estimation and action space spec-ification. Each aspect of the these configurations are listed in Table 4.1 for twodifferent settings, ?simple track? and ?complex track?. Simple settings are relatedto and used for ideal modeled RL problems, and serve as a benchmark while thecomplex settings are specific to our RL framework.In our thesis, we use simplified scenario and state representation for initial de-velopment, debugging, and it is used as a baseline case for comparisons. We namethis baseline case as classical state reinforcement learning (CSRL), and, in our the-sis, CSRL refers in particular to the problem of dricing on a straight horizontal roadthat we will introduce in next section.4.2.1 Different Configurations between CSRL and STRLComplex settings are used for our RL framework, Sensorimotor Traces RL(STRL),in order to solve the problems where a Classical State representation RL(CSRL) isnot easily specified. When equipped with local sensors, we need to change the wayof data gathering and the estimation approaches are used. However, STRL can beused to solve more complex problems.Figure 4.6: Comparison between grid sampling and episode sampling ap-proaches. In (b), P1 - P6 construct episode 1, and episode 2 consistsof the rest points in (b).30In our thesis, datasets used for STRL are sampled from car driving simulationwith fixed time steps. Data is gathered during episodes of ?off policy? behavior,for example, it can come from episodes of random steering at the stage of datagathering, as shown in Figure 4.6(b). Figure 4.6(a) presents the alternative of gridsampling over sensor space for CSRL. At each grid sampling data point, the agentcan try out all possible actions, using simulation technique to obtain the accuratenext state. Episodic sampling, as compared to grid sampling, can not promise tocover the complete space of possible states and actions. To explore and evaluatenew actions for recorded data points, estimation approaches described in 3.3.6 areused. However, grid sampling is used for the simple track (CSRL) as in orderto generate neat data baseline. Both of the sampling approaches shown in Figure4.6 provide a discrete data set containing many data points for observed states andactions.Every data point in memory record and update its V ? when applying iteration ofreinforcement learning. For discrete dataset, we can hardly find exact value of V ?for most cases. Basically, reinforcement learning is generating a policy that teacheseach data point in memory to iteratively select an optimal action, as described inLine 6 (?Bellman backup? operation) in Algorithm 1. Trying actions other than thesampled one will transfer the state to unknown space. To estimate value function,interpolation methods are applicable when state space grid sampled (for CSRL),while episode sampling requires kNN Estimation (for STRL).To achieve better reinforcement learning, literally we should try over all avail-able actions in ?Bellman backup? operation. For example, assuming the car cansteer between [-5,5] integer degrees, we should test through -5,. . .,4,5, as illustratedin Figure 4.6(a), and pick the action that brings in maximum value. We can easilyinspect all possible actions if we simulate every state transition, however, STRLhas limited action choices because of the episode sampling. We can firstly look fornearest k neighbors and make sure their distance are within a threshold ? . Afterthat, we go over all of these neighbors, try their actions and see whether each ofthose actions makes ?Bellman backup? operation to achieve greater value.31Simple trackCSRLComplex trackSTRL(A)Data SamplingGrid SamplingSampling fromcar driving(B)DynamicsSimulated Estimated(C)V ? estimationLinearinterpolationk-nearest-neighbor(D)Action spaceEnumerate allpossible actionsEnumerated fromk nearest neighbors.(E)State representationClassical statey,?Sensory trace usingsensors on car.Table 4.1: Summary of algorithm design choices for different scenarios.Figure 4.7: Illustration for y?? scenario and sensor configuration.4.2.2 Classical State RL (CSRL)Figure 4.7 shows the scenario of the car steering under sensor space of y?? . Thecar is driving from right to left in a track with infinite length and constant width.The goal of the driving task is to steer the car in the middle of the track, startingfrom various initial states. Given an arbitrary initial state, (y,?), we want to learna policy that helps to drive along the track center, i.e., y = R/2. Here R is the thewidth of the track.Regarding Table 4.1, we use the simple configuration of A,B,C and D for thisy?? scenario. This is the well-modeled RL problem. ? is more positive if the carrotates counterclockwise, as shown in Figure 4.7.We use ny and n? samples in the y and ? dimensions, respectively, with ny = 2032Figure 4.8: 2D sampling for y? ? in 30*30 grid. The horizontal axis rep-resents the track position, y, while vertical axis represents ? . Red seg-ments mean that the transition was simulated by a left (positive degree)turn, while blue ones were turning right (negative degree).and n? = 20 as typical values. As we can see in Figure 4.8, it is one exampleof grid sampling, y ranges between [-100,100] and ? expands between [-30, 30]with sampling size of 30*30. Figure 4.8 can be regarded as a panel showing statetransition. Each segment drawn in the panel starts with a dot representing the startstate and ends with the transferred-to state.The reward for this scenario is defined as:r (s) =(0.5R?|y|0.5R)2(4.10)which gives a large reward for driving in the middle of the road and a low rewardfor driving close to the sides of the road.After one iteration of reinforcement learning, the reward function in (4.10)generates a greedy policy result. As shown in Figure 4.9(a), all the data points witha positive y value decide to turn left and those with negative y value decide to turnright. This is a straightforward policy for encouraging the car to drive in the middleof the road, but it does not exhibit anticipation. For example, for the situation inFigure 4.10, the policy in Figure 4.9(a) will take the action of turning left, sincethe car is on the right hand side of the road. However, this policy orients the car33Figure 4.9: Comparison between greedy and RL policy. Red segments repre-sents the state transition with left (positive degree) turning, while blueones transited by turning right (negative degree).Figure 4.10: A particular example that illustrates the goodness of long-termpolicy.sharply downwards, and therefore it will overshoot the center of the road.Figure 4.9(b) shows the final converged policy, pi?. The trajectory traces in bothFigures 4.9(a) and 4.9(b) show how the car at initial state P reach the goal statewhen applying greedy policy (RL with one iteration) and converged reinforcementlearning policy. The trajectory with the greedy policy overshoots the center of theroad several times.4.2.3 Sensorimotor Traces RL (STRL) with Local SensorsWe use the complex track settings in Table 4.1 to configure our RL frameworksystem. We collect random driving episodes for a straight road, as shown in Figure4.6(b) and 4.7, for initial data gathering. This data gathering approach allows us34to apply off policy learning based on random steering rather than human guidedexamples. In order to avoid the situation that the car is driving straight because ofsuccessive, short random actions, we don?t change the steering angle every timestep but hold it constant for several time steps q.The gathered data are for episodes, which convey sequential information andfacilitate sensorimotor traces representation. The initial states for each episode arestochastically generated using uniform random distributions.For the next state and value function evaluation, NN Estimation and kNN Esti-mation, as described in Section 3.3.6, are used.We change the reward function definition as defined in Equation (4.11), whichgives high scores for the situations with similar left and right sensory values. Thisallows the agent to reward itself without truly knowing where the center of the roadis. This is useful for novel training scenarios, e.g., on curved roads.r (s) = 1?(?1 pair(d?1,d?2)+?2 pair(d?3,d?4)+?3 pair(d?5,d?6))/3 (4.11)The best reward, r(s) = 1, will be achieved if all three pairs return a value of0. We regularize pair into the range of [0,1], so that the learned control skills aremore flexible in different tracks with various widths.4.3 SummaryWe have introduced two scenarios of car steering problem to demonstrate the usageof the reinforcement learning framework for a sensor based dynamical system. Theresults will be detailed in Chapter 5.By changing the reward function, many different learning tasks can be achieved.With a well-defined interface, the users are able to assign the goal by defining re-ward function in interactive way, as will be illustrated in Chapter 5 using behaviorsthat steer through road intersections in desired ways.35Chapter 5ResultsIn this chapter, we present the results of applying Sensorimotor Traces RL (STRL)in various scenarios. Classical State RL (CSRL) is used as baseline for perfor-mance comparison. We also demonstrate the progressive acquisition of skills.Skills learned from simple scenarios are used to guide curved, complex track driv-ing. Using target trajectory RL, the car can drive on tracks with intersections thatoffer various route choices.5.1 Comparison Using CSRLWe begin with evaluations using the simple track test scenario, introduced ear-lier in Section 4.2. In this section, we provide the result comparison between ourmethod(STRL), and CSRL, the baseline. To better understand the differences be-tween STRL and CSRL, Table 4.1 provides a detailed configuration comparisonand Table 5.1 provides a further high level summary.Both STRL and CSRL are model-based. CSRL is explicitly aware of the sys-tem dynamics. STRL implicitly models the environment, by using approximationsof the dynamics and rewards to support the learning iteration.From Figure 5.1, we can see the fact that as data set volume increases, thepolicy learned by our method quickly approaches to that of the benchmark. Figure5.1 illustrates the training episodes and the performance using the resulting learnedcontrol policy for varying amounts of training data. In each condition, the image on36Figure 5.1: STRL learning for varying amounts of training data. In all cases,the car moves from right to left. The images on the left visualize thetraining episodes, a?U [?5,5]. The images on the right illustrate theSTRL performance (red) compared to that of CSRL (black).the left visualizes the training episodes. These are generated using stochastically-chosen steering actions with steering angles randomly sampled from a ? [?5,5]and a ? Z.The images on the right side of Figure 5.1 show three trajectories resultingfrom the learned control policy as applied to three different initial states. Threetrajectories are performed by CSRL in black and STRL in red. The car is drivingfrom right to left. The black lines can be regarded as a good model of the optimalpolicy actions. We can observe that the red lines shows the policy learned by STRL37STRL CSRLData gathered Multiple long episodesExplicit sampling ofall actions of a regulargrid of sampled statesNext statelook upRegression-based estimationfrom past experiencesReal simulation, which tellsexactly where the state goes.Reward functionHigh reward for stateswith similar left andright sensor distancesHigh reward for statesnear the middle of the roadEnvironmentunderstandingEnvironment is implicitlymodeled via sensorsEnvironment is explicitly modeledas part of the state descriptionTable 5.1: This is the table shows the configuration comparison betweenSTRL and CSRL on straight track steering.progressively approach these optimal actions as the volume of sensory trace dataincreases.Policy EvaluationIn order to compare policy performance numerically we define a real-valued eval-uation function, E(pi), that models how well the policy solves a particular controlproblem. Typically, E(pi) evaluates a policy by executing the policy from one ormore initial states and then records the cumulative sum of rewards for these cases.A policy is evaluated from m initial states, siinit ,1? i?m for a duration of Teval timesteps for each initial state. The evaluation for state trajectory i with initial state ofsiinit has a cumulative score ei (pi) = ?Tevalj=1 r(sij). We sum the performances of thepolicy from each initial state to obtain an overall evaluation metricE(T ) =?mi=1ei (pi) (5.1)As in many cases, we might need different size of m and Teval in various controlproblems, here we name E? (pi) as the average state evaluation that is normalized inrange of [0,1].E? (pi) = E (pi)mTeval(5.2)385.1.1 Policy Evaluation for Straight Road Driving ProblemFigure 5.2: Performance comparison between two methods.We use this policy evaluation method to test STRL against CSRL with increas-ingly larger datasets of sensory traces. In this evaluation, m is set as 10, meaningthere are 10 initial states for the policy to run for evaluation. Every trajectory ofeach initial state has fixed Teval of 60 frames. Figure 5.2 presents the performance(E? (pi)) of CSRL and STRL. Recall that the CSRL is generated based on a 30*30state sampling while 600-700 number of data points could help STRL generatehighly similar performance against CSRL. CSRL can not reach E?(pi) = 1, becausemany initial states used for policy evaluation are distant to ideal state. CSRL?s datasize fixed as 900 in Figure 5.2, and STRL approaches to higher performance byincreasing its data samples. In this case, dataset is one of the key elements forgenerating good performance when using STRL, but the data volume needed toachieve good performance is relatively small and quite effective.In addition, STRL?s local sensory based learning provides high flexibility forlearning in wider environments that is unspecified or unable to model. We will seemore regarding this in later sections.395.1.2 Driving on Curved TracksFigure 5.3: 4 STRL-based driving results on curved tracks with various roadwidths and track conditions.For CSRL, it is non-trivial to develop policies that can steer on a curved track,as shown in Figure 5.3. However, the policy learned by STRL can be easily appliedto the curved track, even with varying track widths.In these curved track results, the car retains a maximum steering angle of 5degrees. In Figure 3(a), track width is five times the car width. In Figure 3(b),track width is only 1.5 times the car width. In Figure 3(c) and 3(d), there are 8segments on the circle that each vary in width.These results illustrate the adaptive ability of the STRL method.405.2 Tracks with Sharp TurnsFigure 5.4: STRL applied to 90 degree turning.In order to develop STRL control policies for more challenging tasks, we applya progressive learning strategy. In particular, control policies for sharp corners aredeveloped by proceeding from straight road steering, to 90 degree turns, and thenfinally to 180 degree turns.Exploration is accomplished during episode generation by adding bounded uni-form random noise to the current policy action. Practicing, particularly in this sec-tion, refers to the learning and solidifying the skills based on newly explored expe-riences. Thus, starting from the straight road driving skill, as shown in Figure 5.4,we try to learn steering through 90-degree left turn, which is impossible at first as5 degree turning is not big enough. We define the policy developed for the straightroad with a ? [?5,5] as pi1. To gradually explore new experience, we generate ?offpolicy? steering variations according to a(s) = pi (s)+aN , where aN ?U [?5,5].These newly generated episodes are regarded as a new set of data points forour reinforcement learning system. As we can imagine, the newly learned policy,pi2, leads the car steer between (-10,10), and turning left (positive) could gain morereward. New routes and experiences are explored and practiced to learn pi3, pi4,41Figure 5.5: STRL applied to 180 degree turning.etc., until 90 degree left turning can be achieved, as shown in Figure 5.4. For 90degree turning, we limit the sharpest steering angle to 25 degrees.Applying a similar technique, a U-turn driving through policy is acquired byexploring and practicing based on the fact that it knows how to perform 90 degreeleft turn, which could be regarded as the base knowledge before acquiring newskills, as illustrated in Figure 5.5(a)?(d). The 90 degree turn and 180 degree turnscenarios required 5 and 9 policy iterations, respectively.This learning procedure demonstrates embodied learning. Given some nearlysuccessful experiences, we practice and then obtain feedback through reward, wecan finally tune our actions to achieve the ideal movement. In this example, a ?[?5,5] straight road driving is the basic acquired skill that helps to gradually learnmore. Once control policies are capable of 90 degree turns or U-turn have beenachieved, the intermediate sensorimotor traces can be discarded, if desired. Afterthis series of learning procedures, we could store the resulting three scenarios ofmemories and learned policies for the car. An interesting consequence of using42sensorimotor traces is that it becomes easy to recognize the scenario by their sen-sorimotor trace signatures. Experimentally, we find that STRL control policies aresufficient to deal with many intermediate scenarios like 45 degree turn. In a latersection, we show results for complex tracks with these three well-learned policies.5.3 Results with Complex TracksFigure 5.6: The steering result on a complex track.In this section, we present a number of results for steering in more complexenvironments. In Figure 5.6, different kinds of turns are tested. Turns includeeither short turns or smooth turns, of various width and degrees. Symmetry isexploited for all our datasets, so that the policy learned for turning left could beused for turning right. The width of track in Figure 5.6 is roughly 2.5 times the carwidth.The resulting car trajectories are smooth and exhibit anticipation for many ofthe turns. On straight roads the car steers towards the middle of the track. On wind-43ing tracks, tracks with ragged edges, the car steers smoothly and is not adverselyaffected by the ragged edges of the track.In Figure 5.7, we add obstacles on the road, which then provides great noiseand randomness in the sensory observations. We can see that the car steers wellthrough the many obstacles.Figure 5.7: STRL control policy on a track with obstacles.5.4 Comparison Between with and without TraceFigure 5.8: Comparison of the STRL control policy for a car that has beentrained to drive straight on a track. (a) using only the current sensorreadings, and (b) using the full sensorimotor trace time window.The sensory-motor trace is used in our learning system to capture a specific44state of dynamic system. It consists of a time series of sensory data in order to cap-ture information about the state and the environment in the recent past. Accordingto sensory-motor trace data, the ambiguity at a given instant of time can be clarifiedby the sensory trace, which provides further contextual information.For example, see Figure 5.8(a), the car is driving, without using sensorimotortraces. It only concerns current sensory data as car state and drives through a fork inthe road, which is a scenario that it has never encountered before. The downwardbranch influenced the car to turn left and finally led to a crash. Figure 5.8(b) istrained using the same data and procedures except it uses sensorimotor traces forRL.Figure 5.9: Steering comparison with and without trace.The comparison shown in Figure 5.9 illustrates the difference more strikingly.The control policy based on no trace will perform dramatic left or right turns uponseeing a jump in any sensor value. Once a dramatic turn is made, the system cantypically not recover. In comparison, the full STRL policy performs well and drivesmore smoothly through these novel environments.In summary, the sensory trace reinforcement learning help agents to understandthe environment by better capturing the task at hand. The policy learned via STRLgeneralizes well and as a result presents greater flexibility.45Figure 5.10: All possible ways to pass branch points.5.5 Tracks with Multiple ChoicesThe results we have presented up to now cannot make steering decisions at in-tersections where there are multiple possible roads to follow. The presence ofintersections within a track provides several additional challenges.There are many possible types of turns through intersections. We only discussthose shown in Figure 5.10. The ?one choise? scenarios were discussed in previoussections and we still include them in this part. We use the notation L, S and R torepresent the scenarios of left turn, keep straight and right turn, respectively. Twochoice scenarios provide two possible choices whenever the car encounters thebranch point. We use SL as the notation for the scenario where the car can go eitherstraight or turn left, and the preference in this scenario is 90? straight. Similarly,SR,LR,RL,RS and LS are scenarios that have two choices, which is implied bythe two letters, and the letter ordering gives their preference. In the same manner,LXX ,SXX and RXX represent the scenarios of three ways and prefer to turn left,go straight and turn right respectively. We have 12 scenarios in total, and they willbe further explained in the remainder of this section.There are 6 modes for car steering to choose considering a world containing46two way and three way intersections: MLSR,MLRS,MSLR,MSRL, MRSL,MRLS. Thesenotations define the turning preferences. For instance, MLSR means the car willprefer to turn left whenever it can turn left. If no left turn could be made, goingstraight is the second choice. To give a full list of scenarios in order to performmode MLSR, we need the data for the scenarios L,S,R,LS,LR,SR,LXX . Whensymmetry is applied, the number of required scenarios can be reduced.We can therefore simplify 12 scenarios into data for 7 scenarios. We also de-velop an interface for easily changing among 6 driving modes.In the case of multiple choice scenarios, we need to define the reward functionin a new way. This is because when facing multiple choices, the reward describedin Chapter 4 will not behave well. Here we use another way of reward definition:distance to an example target trajectory.Figure 5.11 gives an example of learning using target trajectories. There arefour scenarios presented in the picture, S,L,LS and LXX . Each example trajectorygenerated using random steering. The green lines represent target route i.e., howwe want the car to act in that scenario.When gathering the sensorimotor trace data, we stochastically guide the offpolicy exploration in order to more quickly converge to the desired policy. Forexample, for LXX scenario, we could bias the car to turn left.Figure 5.12 shows the driving result for MLSR and Figure 5.13 shows that ofMSLR.5.6 Data Usage EffectivenessSensorimotor traces have a known frequency of use while the car is driving. How-ever, there are many sensorimotor traces will not be accessed or will be rarelyencountered. We collect data from 40 episodes, starting from point A in Figure5.14, and record the frequency of the usage among all memory data. Blue meansno use and red means it was referenced as one of the k nearest neighbors at least 20times, when each time the car search for k = 20 neighborhoods. Other scenariosare given interpolated colors.As can be seen in Figure 5.14, only a small fraction of the sensorimotor tracesare frequently encountered in practice. This is reasonable, because many blue47Figure 5.11: An illustration of multi-choice training. In this picture, 4 sce-narios are contained: S,L,LS and LXX . Green line shows the targetroute.segments correspond to experiences with bad outcomes that should never be ex-perienced again. We can utilize this property to optimize the online search byeffectively discarding these portions of the sensorimotor traces.48Figure 5.12: Driving result in multiple choices track under MLSR mode.Figure 5.13: Driving result in multiple choices track under MSLR mode.49Figure 5.14: A visualization of the frequency of sensorimotor trace being en-countered during policy execution. 40 simulation episodes are per-formed with starting point around A. Red ones show the data that werefrequently used, while blue means seldom used.50Chapter 6Discussion and ConclusionAs shown in Chapter 5, our learning framework is capable of generating controlpolicies based on sensorimotor traces as an underlying representation. We furtherdemonstrate the progressive acquisition of more complex steering skills. In thischapter we summarize our contributions, and then we discuss open problems andlimitations. Lastly, we elaborate on several future research directions.6.1 ContributionsImplicit Representation of StateIn Chapter 3 we described the representation of state using sensory traces. Policylearning using ?situated? sensory data provides for good generalization, as com-pared to a classical state representation. The policy is largely robust to changesin orientation, translation and scale. The agent can learn to perform well in anenvironment that is complex, and containing many obstacles.This representation can implicitly match the current state to scenarios that havebeen observed previously by searching through sensorimotor traces. In Chapter5, we demonstrated the importance of using sensory trace, e.g., a time window ofsensory information, instead of just using sensory state. The results showed that thesensory trace helps to disambiguate scenarios in a useful way. After the scenarioshave been recognized, the car can steer according specific motion preferences when51facing multiple choices.An important property of our learning framework using distance metrics is thatit does not suffer from the curse of dimensionality as other RL methods do. All thecomparisons between states are completed through distance metrics, and there isno explicit model of the full state space.Effectiveness in Time and MemoryThe size of the vector used in calculating distance metrics will increase if moresensory data is added or a larger time window is used. The calculation has O(mn)time complexity, where n is the vector size, and m is the number of data points.As data efficiency can be optimized (Section 3.4), we can drive fast onlinesimulation based on large memory. Our learning framework can be adapted viathe size of the episodic dataset, and the sensory trace length required for differentkinds of dynamic systems.Our learning methods is an effective alternative to policy search methods, suchas [AvdP05] which solves a similar problem. Policy search optimization oftenrequires significant computational expenses. Compared to policy search methods,our method needs less calculation since it learns the policy based on a small set ofcaptured episodes. The kNN policy look up helps to cover all the situations usingestimation.Progressively Acquiring New SkillsIn our thesis, we illustrate that our learning method can help the agent adapt to newenvironments by progressively exploring and practicing (Section 5.2).Generalization to Various Kinds of EnvironmentsWe have experimentally demonstrated that STRL methods have the potential togeneralize well when the resulting policies are used in new situations or environ-ments.526.2 LimitationsHand Tuned Distance MetricsThe learning framework described in our thesis makes extensive use of distancemetrics. A good distance metrics should be helpful in distinguishing between sce-narios in meaningful ways.In Chapter 4, the distance metrics are empirically defined. These hand tuneddistance metrics are the product of subjective experience. This implies that we needextra time to determine the definition of distance metrics when working with a newdynamic system. Although we generate the distance metrics in an empirical way,there exist several directions to help defining distance metrics in a more automatedfashion.One method is to optimize the weight parameters, ?i, for each of the elementsof the sensorimotor trace, S. For a given set of weight parameters, the system canlearn a control policy. The resulting policy can then be evaluated for its perfor-mance. As a result, we can consider searching through the ? space, in order tofind a value of ? which generates the best policy. However, this method has asignificant computational complexity when the parameter space becomes large.Improved Regression ProceduresThere remains many ways to improve upon the regression procedures used to es-timate the dynamics and the value function. Applying NN and kNN estimationcan introduce limitations that prevent the agent from achieving better policy. Forexample, the action space of STRL is limited by the neighbors and usually cannotcover all possible actions for each data point.Lack of Motor TraceIn our explanations we refer to a sensory-motor trace. However, only a sensorytrace is being used in practice. It is also likely useful to include motor signals (ortheir efferent copier) as part of the sensorimotor traces.536.3 Future WorkState PredictionWe demonstrated how an agent can recognize a particular situation scenario bysearching memory for the most similar sensorimotor trace. Currently we do this ina brute-force manner. However, it should be possible to exploit the known temporalstructure in order to predict future sensory information. Predicting these providesa simple means to verify whether the agent is experiencing anything unexpected.A ?surprise? means seeing some conditions that do not agree with the prediction ornever been experienced before. We believe that we can use prediction and surpriseto acquire better learning experiences, which is left for future work.Motions with More Degrees of FreedomIn this thesis, we assume that the relation between motion and sensory data issimple. Our dynamical system in Chapter 4 has only one degree of freedom, thesteering, in action space. Clearly, agents with more degrees of freedom will com-plicate the motion control problem. The motion trajectory will become a combinedresult of a much higher dimensional action space.Several interesting topics related to robotics and physics-based character ani-mation can be explored. For example biped walking on uneven surface, with un-known friction.54Bibliography[AvdP05] Ken Alton and Michiel van de Panne. Learning to steer on windingtracks using semi-parametric control policies. Control Problems inRobotics and Automation, pages 4588?4593, April 2005. ? pages 8,52[BB01] Jonathan Baxter and Peter L. Bartlett. Infinite-horizon policy-gradientestimation. Journal of Artificial Intelligence Research, 15:319?350,2001. ? pages 9[Bel57] Richard Bellman. Dynamic programming. Princeton University Press,1957. ? pages 7[Ber87] Dimitri P. Bertsekas. Dynamic Programming: Deterministic andStochastic Models. Prentice-Hall, 1987. ? pages 8[Ber05] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control.Vol. I. Athena Scientific, 2005. ? pages 7[Ber07] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control.Vol. II. Athena Scientific, 2007. ? pages 7[BT95] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamicprogramming: an overview. Proceedings of the 34th Conference onDecision and Control, 1:560?564, December 1995. ? pages 7[C.93] Baird III Leemon C. Advantage updating. Technical ReportWL-TR-93-1146, 1993. ? pages 7, 8[Fla11] Martha Flanders. What is the biological basis of sensorimotorintegration? Biological Cybernetics, 104:1?8, 2011. ? pages 6[H.89] Watkins C. J. C. H. Learning from delayed rewards. Doctoral Thesis,Cambridge University, 1989. ? pages 755[How60] Ronald A. Howard. Dynamic programming and Markov processes.MIT Press, 1960. ? pages 7, 8[IJR12] Morteza Ibrahimi, Adel Javanmard, and Benjamin Van Roy. Efcientreinforcement learning for high dimensional linear quadratic systems.Advances in Neural Information Processing Systems, pages2645?2653, 2012. ? pages 8[Ker02] Sandra Kerka. Somatic/embodied learning and adult education.TRENDS AND ISSUES ALERT, 2002. ? pages 5[Kim07] Hajume Kimura. Reinforcement learning in multi-dimensionalstate-action space using random rectangular coarse coding and gibbssampling. Intelligent Robots and Systems, pages 88?95, 2007. ?pages 8[Lau12] Tak Kit Lau. Stunt driving via policy search. Robotics andAutomation(ICRA), pages 4699?4704, May 2012. ? pages 9[Lee10] Yongjoon Lee. Motion fields for interactive character locomotion.Proceedings of ACM SIGGRAPH Asia, 29, December 2010. ? pages 8[MI04] G Morimoto and T Ikegami. Evolution of plastic sensory-motorcoupling and dynamic categorization. Proceedings of Artificial Life IX,pages 188?191, 2004. ? pages 6[Ng03] Andrew Y. Ng. Shaping and policy search in reinforcement learning.Ph.D Thesis of University of California, Berkeley, 2003. ? pages 9[PB06] Rolf Pfeifer and Josh Bongard. How the Body Shapes the Way WeThink. MIT Press, 2006. ? pages 5[PB08] Joel Pearson and Jan Brascamp. Sensory memory for ambiguousvision. Trends in Cognitive Scences, 12:334?341, September 2008. ?pages 7[Pow07] Warren B. Powell. Approximate Dynamic Programming: Solving theCurses of Dimensionality. Wiley-Blackwell, 2007. ? pages 7[PS97] Rolf Pfeifer and Christian Scheier. Sensory-motor coordination: themetaphor and beyond. Robotics and Autonomous Systems,20:157?178, June 1997. ? pages 656[Put94] Martin L. Puterman. Markov Decision Processes: Discrete StochasticDynamic Programming. John Wiley and Sons, Inc., Publication, 1994.? pages 7[PV06] Jan Peters and Sethu Vijayakumar. Policy gradient methods for robotcontrol. Intelligent Robots and Systems, pages 2219?2225, October2006. ? pages 9[RB01] Michael T. Rosenstein and Andrew G. Barto. Robot weightlifting bydirect policy search. In Proceedings of International Joint Conferenceon Artificial Intelligence, 2:839?844, 2001. ? pages 9[SB98] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: AnIntroduction. Cambridge University Press, 1998. ? pages 7, 9, 13[Sch93] Anton Schwartz. A reinforcement learning method for maximizingundiscounted rewards. Proceedings of the Tenth InternationalConference on Machine Learning, pages 298?305, 1993. ? pages 8[Spo98] Mark W. Spong. Underactuated mechanical systems. Control Problemsin Robotics and Automation, pages 135?150, 1998. ? pages 8[SS] Satinder P. Singh and Richard S. Sutton. Reinforcement learning withreplacing eligibility traces. Machine Learning, 22, January. ? pages 8[Sut00] Richard S. Sutton. Policy gradient methods for reinforcement learningwith function approximation. Proceeding of Advances in NeuralInformation Processing Systems, 12, January 2000. ? pages 9[Sze10] Csaba Szepesvri. Algorithms for reinforcement learning. SynthesisLectures on Artificial Intelligence and Machine Learning, 4:1?103,2010. ? pages 7[TN99] Jun Tani and Stefano Nolfi. Learning to perceive the world asarticulated: An approach for hierarchical learning in sensory-motorsystems. Neural Networks, 1999. ? pages 6[VG05] Gallese V and Lakoff G. The brain?s concepts: the role of thesensory-motor system in conceptual knowledge. Cogn Neuropsychol,22:455?534, May 2005. ? pages 6[vH07] Hado van Hasselt. Reinforcement learning in continuous action spaces.Approximate Dynamic Programming and Reinforcement Learning,pages 272?279, April 2007. ? pages 7, 857[WGJ95] Daniel M. Wolpert, Zoubin Ghahramani, and Michael I. Jordan. Aninternal model for sensorimotor integration. Science, 269:1880?1882,September 1995. ? pages 658


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items