REINFORCEMENT LEARNING USING THE GAME OF SOCCERByRoger David FordB. Sc. (Computer Science) University of Lethbridge, 1992A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE STUDIES(DEPARTMENT OF COMPUTER SCIENCE)We accept this thesis as conformingto 7ie required s andardTHE UNIVERSITY OF BRITISH COLUMBIAOctober, 1994© Roger David Ford, 1994Signature(s) removed to protect privacyIn presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives, It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)Department of Y1[)LJ#v Sc4-lcedThe University of British ColumbiaVancouver, CanadaDate 0C4. JY, (‘??‘YDE-6 (2/88)Signature(s) removed to protect privacyAbstractTrial and error learning methods are often ineffective when applied to robots. This isdue to certain characteristics found in robotic domains such as large continuous statespaces, noisy sensors and faulty actuators. Learning algorithms work best with smalldiscrete state spaces, discrete deterministic actions, and accurate identification of state.Since trial and error learning requires that an agent learn by trying actions under allpossible situations, the large continuous state space is the most problematic of the abovecharacteristics, causing the learning algorithm to become inefficient. There is rarelyenough time to explicitly visit every state or enough memory to store the best action forevery state.This thesis explores methods for achieving reinforcement learning on large continuousstate spaces, where actions are not discrete. This is done by creating abstract states,allowing one state to represent numerous similar states. This saves time since not everystate in the abstract state needs to be visited and saves space since only one state needsto be stored.The algorithm tested in this thesis learns which volumes of the state space are similarby recursively subdividing each volume with a KD-tree. Identifying if an abstract stateshould be split, which dimension should be split, and where that dimension should besplit is done by collecting statistics on the previous effects of actions. Continuous actionsare dealt with by giving actions inertia, so they can persist past state boundaries if itnecessary.11AbstractTable of Contents11List of TablesList of FiguresAcknowledgement1 Introduction1.1 Motivation1.2 Goals and Objectives1.3 Soccer1.4 Outlineviviiviii2 Background2.1 Markov Decision Processes2.2 Dynamic Programming2.2.1 Value iteration2.2.2 Policy Iteration2.3 Reinforcement learning2.3.1 Temporal Credit Assignment.2.3.2 Structural Credit Assignment77810111214193 The Dynamite Testbed11125114554 Experiments: Learning to play Soccer4.1 Different Algorithms4.1.1 The Hand Coded controller .4.1.2 Regular Grid4.1.3 Random Agent4.1.4 Adaptive KD-tree4.2 Evaluation4.2.1 Measures of Performance4.2.2 Simulation results4.2.3 Real Robot results5 Conclusion5.1 Summary5.2 Future Work5.2.1 Adding existing knowledgeBibliography 52Appendices 55A Main Reinforcement algorithm 553.1 System Overview3.2 The Reactive Deliberation Controller . .3.2.1 Executor3.2.2 Deliberator3.2.3 Adapting Reactive Deliberation to do learning25262828293031343435363939404748484950ivB Update Q-values algorithm 58C KD-tree algorithms 59D A KD-tree 66VList of Tables2.1 The Value Iteration Algorithm 112.2 The Policy Iteration Algorithm 122.3 The Q-learning Algorithm 164.4 KD-Tree Algorithm 364.5 Adaptive KD-tree algorithm 37viList of Figures2.1 A MDP that cause problems for a 1-shot update . . . 152.2 Different ways to divide up a 2D space 212.3 A G tree that has split the space into 3 clusters 233.4 The dynamite soccer playing testbed 263.5 The controller 274.6 Convergence on the simulator 414.7 Adaptive KD-tree with different t thresholds 424.8 Adaptive KD tree with different update schedules . 444.9 Convergence using a fixed tree 454.10 Performance using pure exploitation (simulator) . . .. 46viiAcknowledgementI would like to address my sincere appreciation to my supervisor Craig Boutilier, forputting up with me and showing me how to do research. Even if I didn’t always followhis teaching.I would also like to thank Michael Sahota for creating the reactive-deliberation controller on which much of the results of the thesis depend, and for aiding me in findingnumerous problems in my code and how it interfaced with his controller.Finally, thanks to the various people, and agencies that funded my research. Themajority of the money provided came from a NSERC graduate scholarship.viiiChapter 1Introduction1.1 MotivationOne of the goals of the Artificial Intelligence field is the creation of intelligent agents.An agent is an autonomous system that is able to sense its environment, act on thatinformation and influence the environment. These agents may be physical like robotsand factory controllers, or like schedulers they may be entirely software. An intelligentagent must do more than simply interact with its environment. It must act to cause theenvironment to reach desirable configurations or goals. The agent often has several goalsthat may be conflicting. Intelligent behavior often involves achieving these goals wherethe order and manner in which they are achieved is important.Simple agents like a chess player can be programmed directly because the environmentis well understood, unchanging with deterministic actions. The agent may not be ableto predict exactly how the opponent will behave, but it expects actions to be completelyobservable and deterministic. If the agent moves its knight to square C3 then the knightwill be on square C3 not some other square nearby. It is possible to give a complete andcorrect description of the environment and the task, from which the agent can plan itsbest actions. One reason why there are not more successful intelligent agents, like thechess playing ones, is that as the task and environment become more complex specifyingcomplete and correct models becomes increasingly difficult. There are three situationswhere learning can be used to attack this complexity.1Chapter 1. Introduction 2Most agents rigidly follow a set of preprogrammed responses. The programmer mustprogram a response for all possible situations that the agent is likely to encounter. This isnot possible for most environments. Usually the environment is so complex that it is hardto program for all possible situations, especially in nondeterministic environments wherethe number of possible situations could be very large. So the environment is constrained,as in the case of assembly line robots, to allow only a few possible situations. With thereduced complexity the programmer is able to explicitly tell the robot what to do. Tellinga housekeeping robot how to wash dishes would be an nearly impossible task, if we hadto specify torques and joint angles, and model every possible dish it might encounter. Itwould be nice if the robot could learn without having to be explicitly told what to do.Even with the reduced environment the human programmer may not understand theenvironment well enough, or the information may be unknown. A programmer typicallyruns through a test-calibrate cycle several times before the agent has a reasonable performance. Here the programmer has learned about the interaction between the robot andthe environment, and adapted its behaviour. Having a human do the learning works wellbecause humans are very good at learning. However, its not always practical or possibleto have a human tune the robot to its environment. Having to send out a specialist withevery housekeeping robot, just to tune it to every different house, would not be profitable.The third situation is when there is a Dynamic environment. Even if the agent had agood model of the environment initially it would soon become outdated. An agent in oneof these environments would have to constantly update its model of the environment, forexample, I might have to reprogram my robot to vacuum the rug, just because I movedthe furniture. This could be better solved by having the robot “learn” the location ofthe furniture.Robotic agents make good cases for studying learning algorithms because they oftenfall prey to all three of the above difficulties. In addition, in order for any learning systemChapter 1. Introduction 3work effectively in a robotic domain it must have several properties [19]:• Fast convergence: Many learning systems require extended training data. Performing millions of training examples on a real robot is impractical, and it could beexpensive, especially if the robot is continually breaking from driving into wallsor off the edge of a cliff, from following a plan made before it has learned theappropriate behaviour• Noise immunity: Robotic sensors are typically very noisy. They often give onlyapproximations to the state information. Any learning system that requires correcttraining data would not be very effective.• Incremental: It may be the case that the agent needs to actually perform its actionswhile learning. As in the case of a dynamic environment, it may not be possible todo a separate off-line learning phase.• Tractable: Robots must keep pace with the environment, there is little time tosit and think before choosing the right action. If the agent sits idle for too longthe environment will have changed and the plan is worthless. Any learning systemmust be able to perform the required computations in a reasonable amount of time.• Grounded: The system should only depend upon information available from thesensors on the robot. While tabula rasa learning is impossible, the robot might notbe able to depend upon a human to give it the necessary information. The humanmight not known the information, or the information may be incorrect.Robotic systems make excellent test cases for evaluating learning systems. One criticism of learning systems is that they are only tested on toy examples. Learning on a robotforces the system to deal with very large sized domains, that are often not well-behavedChapter 1. Introduction 4or well-understood. In short if it performs well on a robotic system then it should beable learn in a lot of other environments that are better behaved and better understood.Whether the robot benefits from learning depends upon if the cost of doing the learningout-weighs the cost of programming the robot directly.1.2 Goals and ObjectivesThe main goal of this thesis is the development of a reinforcement learning algorithm[29, 32, 17] that is capable of learning on very large continuous state-spaces, with thecharacteristics needed by a robot, that is, fast convergence, noise immunity, incrementality, and real-time operation. Current reinforcement algorithms only work on discretestates with discrete actions. They also suffer the “curse of dimensionality”: while learningcan be polynomial in the number of states, the number of states tends to grow exponentially [31]. Overcoming the problem of too many inputs is necessary before learning canbe applied to more than just simple problems.The solution to this problem outlined in this thesis is to automatically divide up thestate space into abstract states. By testing the ability of different learning paradigmsto learn to play soccer, an objective measure of the effectiveness of each paradigm in acomplex domain can be achieved. This thesis develops experiments for testing differentlearning strategies. In comparing the performance of each learning strategy againstan independently developed, non-adaptive agent an objective empirical measure of thestrength and weakness of different algorithms can be determined. Creating a winningsoccer player is not an essential goal.Chapter 1. Introduction 51.3 SoccerThe game of soccer was chosen as an environment for testing learning agents because,as an abstraction of the world, it simplifies the problem but keeps the essential essenceof the problem. In particular it preserves the aspects of the world corresponding tothe three situations listed above. Soccer contains knowledge that is hard to program,knowledge that is incomplete, a dynamic changing environment. In addition it containshostile agents and provides an objective measure of performance [261.The world is not completely predictable, and much of the information needed todo classical planning is not known. This state of affairs should be preserved in anyabstraction used to test AT theories. Soccer preserves unpredictability and unknowability.It is not possible in practice to predict where the ball will go, where the opponent is going,or where team mates will be. The information needed to create the optimal soccer playingagent is not known. True, experts exist, but they are unable, or maybe just unwilling,to divulge this information. It is also not clear that their experience is transferable to arobot, whose sensors, actuators, and dynamics are different.Even though an agent cannot be compared to an optimal agent, the standard wayof judging correctness, they can be judged as being relatively more correct than anotheragent. An agent that is able to score goals, prevent goals from being scored, and havean overall higher score, can be said to be more correct than another agent without suchabilities. Such an objective measure is essential for testing learning agents.1.4 OutlineChapter 2 describes the reinforcement learning problem, and some of the other work donein an attempt to solve the input generalization problem.Chapter 3 describes the Dynamite testbed developed at the University of BritishChapter 1. Introduction 6Columbia. An outline of the Reactive Deliberation controller, developed by MichaelSahota[25, 26], is given, as well as a description of how it is adapted to perform learning.Chapter 4 details the different learning algorithms tested, the results and their significance. The basic learning algorithms tested are all based on Q-learning [32]. The onlydifferences between them is how they divide up the state-space.Chapter 2Background2.1 Markov Decision ProcessesClassical planning [2] works in a well structured deterministic domain. When actionsbegin to have non-deterministic effects, sensors only give a probability of being in a state,or when the environment behaves non-deterministically, classical planners run into majorproblems. The planner must plan for all the effects of the actions, not just the expectedor desired ones. This causes the search tree to expand rapidly and a planner quickly runsinto computational difficulty. Usually the plan is generated assuming a deterministicworld, and if an action “fails” a new plan is generated. The non-deterministic world canbe more adequately modeled by a Markov Decision Process [5, 13]. A Markov DecisionProcess (MDP) consists of four parts:• A set of states S. The state space of the world. Usually this is thought of as afinite discrete set, but the model allows continuous state spaces as well.• A set of actions A(s). This is a function that gives all the possible actions for eachstate Si.• A transition function P(s, a, si). Actions in a MDP are not assumed to be deterministic. The function P(s, a, s) gives the probability that performing action awhile in state s will change the world into state s. As expected, the sum of allthe transitions from a single state should be equal to one. That is for any state7Chapter 2. Background 8s e S,and each action a E A(s),P(s, a, s) = 1Si S• A reward function R(s) This describes the immediate value of the agent beingin that state. The simplest function is to set to R = 1 at the goal states and 0everywhere else.The one essential property of MDPs is that deciding the best action to performdepends only on the current state. The current state of the process contains all theinformation necessary to make a prediction about the usefulness of executing a givenaction. Knowing the past history, states and actions, does not give any more usefulinformation. This means that regardless of how the state was reached the actions at thecurrent state still have the same transition probabilities. That is the state space capturesall the essential properties for predicting rewards and transitions. This is known as theMarkov property. Any system can be modeled as a Markov process if enough details areprovided.2.2 Dynamic ProgrammingDeciding on the best action to do next with a MDP is just a matter of maximizing rewards.The agent must not only pick the action that maximizes the expected immediate reward,but also maximize the potential for rewards in the future. Picking an action that givesa high immediate reward may not be the best thing to do, if in the next state the agentcan only pick actions that give negative rewards. A policy r is a function, or universalplan, Tr(s), that specifies the action to perform for each state s.Chapter 2. Background 9The expected value of the state s, given a fixed policy r, is denoted V (s), and dependsnot only on the reward at that state but the expected rewards received by following thepolicy from that state. Future rewards are discounted by a factor, -y, a constant betweeno and 1. This forces the agent to pick the actions that reach the goals as quickly asis possible, since, rewards received further in the future are worth less to the decisionmaker at the immediate state. The value function then is written as the immediatereward added to the discounted expected value of following the policy. Equation 2.1gives a system of SI linear equations that can be solved to give the value of any policyat any state.V,(s) = R(s) +-y P(s, ir(s), s)V,,(s) (2.1)sESThe decision of what to do at each state is equivalent to finding the Optimal policy,and following it. An optimal policy is a policy that if followed will produce a higherexpected value than any other policy for any state. Let s0 be the initial state of theagent. Then the value of the optimal policy V,,. (SO) must be greater then or equal tothe expected value of all possible policies, starting from s0. This will hold regardless ofwhich s0 e S is chosen as the initial state. As long as the agents start in the same state,the one following policy 7r* is expected to do as least as well as the agent following anyother policy. There is always an optimal policy for a MDP.Given the transition function and the reward function it is possible to find the optimalpolicy for the system. Finding the optimal policy by trying all policies is very inefficient.Finding the decisions that could improve the policy is not simple. For example, in chessthe actions that give a high immediate reward are not necessarily the best. Similarly,actions that may get low immediate rewards can lead to good positions. It is possible alsoto get into a doomed state, where even the best decision leads to failure. Blaming theChapter 2. Background 10actions just before failure would be wrong when the decision that needs to be changedis somewhere further back in the chain. The process of finding which actions to improvewhen the actions interact is called credit assignment. Dynamic Programming finds thedecisions that need to be changed by an incremental procedure, building successivelycloser approximations to the optimal policy. These methods work well but they needto be given a model, the reward function and the transition function. These methodsbecome too time consuming for large state spaces. Recent work has addressed the ideaof how to shrinking the state space until the computation becomes reasonable [12, 7].2.2.1 Value iterationThe value iteration algorithm proposed by Bellman[5] (see table 2.1), does not actuallysolve the equations for the value of a policy, thus avoids the time consuming step. Theidea is intuitively simple, making successive approximations to the optimal policy, andthe value function, until the difference between one approximation and the next is closeto zero. Initially, the value of each state, call it V° is set to 0. In practice the initial valuesare usually set to the immediate reward. This is the value of the state when following theoptimal policy for 0 steps. Then a closer approximation of the Value function is createdfrom the last approximation. For each state the action that maximizes the expectedimmediate reward plus the old estimate of the expected value of the state reached isfound. Since, V° was set to the immediate reward, V’ becomes the value of performingonly one action in that state, and then stopping. This is done repeatedly,(step 3(b)),the V value of the state being calculated from the V”’. In general V” is the valueof the state for following the optimal policy if the agent was only going to perform nactions. So as ii —b ——+ 0. Value iteration suffers from the problem, of notknowing when it has reached the optimal policy. Clearly as n —f cc the policy approachesoptimal. The algorithm itself, however does not continue on indefinitely, but terminatesChapter 2. Background 111. Initialize V° to the immediate reward values.2. i:=O3. repeat(a) i:=i+1(b) Vs e S doV(s)a){R(8 a) + P(s, a,S(c) 7r(s):= the action a from previous step.until — ui_i) =Table 2.1: The Value Iteration Algorithmwhen the improvement between successive iterations is small. It is not known in advancehow many iterations this will take; for this reason policy iteration, which is known toconverge in a fixed number of steps, is usually the preferred approach.2.2.2 Policy IterationThe policy Iteration Algorithm [13], does solve the system of S equations generatingthe value function for the sub-optimal policy. Unlike value iteration, which successivelyestimates the value of the optimal policy, policy iteration works on improving a suboptimal policy. This is done by identifying states that can be improved, and quits whenthe policy can no longer be improved. A search is done through the states, finding statess where there is an action a which if taken for just this one state, and then following 7tresults in a higher expected return than just following K(s). After finding such a stateand action the policy is changed so that it(s) now selects the action a. This results in anew policy that must have a higher value than the old policy. When no more such statesChapter 2. Background 121. Let it’ be any policy on S2. while it it’ do(a) it := it’(b) Calculate V,,.(s) Vs e S(c) Vs E S doif a e A s.t.(R(s) + P(s, a, s’)V(s’)) > V(s)S’ ESthen it’(s) := a else ir’(s) := it(s)3. return itTable 2.2: The Policy Iteration Algorithmand actions can be found, then the optimal policy has been found. This algorithm, unlikevalue iteration, is guaranteed to converge in a finite number of iterations, in practice itis often polynomial in SI.2.3 Reinforcement learningBoth value iteration and policy iteration require that the transition function, P(si, a, s2),and the reward function R(s), be known. For many real problems these functions maybe unknown. The only alternative then is to try actions and observe the results. Onecan then either try to learn the transition function and the reward function and solvefor the policy using one of the above methods, or one can attempt to learn the policydirectly from the observations. Learning the transition probabilities would require atable of about size IS 2 A, while learning the policy directly would only require orderSIIAI space. Learning the transitions also has the problem that the agent is not able toincrementally use the information it has learned. The dynamic programming methodsChapter 2. Background 13are too computationally intensive to perform after each observation. This means thatthe agent is required to follow the old policy until such time as the policy can be updatedto incorporate the new observations, usually after a goal state is reached.Reinforcement learning is learning from rewards and penalties. This is learning fromtrial and error, as opposed to supervised learning. In supervised learning the agent isgiven pairs of items, an input vector and an output vector. After training the agent issupposed to produce the correct output vector, when given an input vector. The obviousproblem with this learning by example is that the agent inherits the bias of the teacher,and this may be bad if the teacher’s bias is incorrect. The main advantage of learning byexample is its rapid convergence, they give good approximations with a lot less trainingthan self-guided learners. Since, reinforcement learning is self-guiding and able to learnalmost tabula rasa it can be considered a more general form of learning. It still requiresa teacher, or at least a means of identifying which situations are “good” and whichsituations are “bad”. This is a much weaker requirement on the teacher than the onerequired by example learning systems. Its conceptually easier to tell when something isright or wrong then it is to show how to do the task.Learning methods where nothing is predefined are called strong learning methods.These methods may be either taught by a teacher or self-guided. Like most methodsthat are capable of learning everything, strong learning methods are very slow to learnanything. Weak learning methods require some prior knowledge and are able to learnfaster because of this background information. Weak learning methods suffer from thefact that they must have background information, and they must accept the backgroundinformation as absolutely correct. The majority of weak learning methods are unableto determine if the domain knowledge they started with is reliable or not. Clearly whatis needed is a general strong learning algorithm that is able to accept existing domainknowledge, and override such knowledge if it sees fit.Chapter 2. Background 142.3.1 Temporal Credit AssignmentThe field of reinforcement learning is mostly concerned with learning controllers forMarkov decision processes when a model is not available. This involves two separatecredit assignment problems. The first is the temporal credit assignment, and the secondis the structural credit assignment. Temporal Credit assignment is the process of assigning responsibility for a good or a bad outcome to the sequence of decisions observed,that is, identifying which decisions, in the sequence that lead to a reward, were primarilyresponsible for that reward [29, 32, 19]. Structural credit assignment assigns values todecisions at a state, a state that may not have been observed, such that similar stateshave would have similar values for a given decision. Early work done in this area includesSamuel’s checker player [27] and the BOXES algorithm [211. The dynamic programmingmethods above solve both credit assignment problems, since DP has a model, it canadjust the values of every state and decision with out having to observe it. Although DPsuffers from the curse of dimensionality, the problem is not as bad as that experiencedby learning systems, since they have to actually visit each state.One of the earliest learning systems to solve the temporal credit assignment problemwas Samuel’s checker player. There are two basic methods for solving the temporal creditassignment problem. The first, and most obvious, is to save every decision and state untila reward is received. Then the values of all the decisions are adjusted according to somevariant of Equation 2.1. The second, and the one used by Samuel, is the method ofTemporal Differences (TD) [29]. Samuel’s checker player was to learn an evaluationfunction for board positions in a checker game. Although a straightforward simple modelfor checkers exists, the size of the state space makes solving for the optimal policy difficult.The evaluation function gives a heuristic that tells how good the position is and can beused to constrain a search. Samuel’s learning method was to apply the current evaluationChapter 2. Background 15Figure 2.1: A MDP that cause problems for a 1-shot updatefunction to the current state, perform a move, apply the evaluation function to the newstate, then use the difference between the two values to adjust the evaluation of the firststate. When a sequence of moves lead to a reward, the state just before will have itsvalue increased. Each successive time through the sequence the increase will back up tothe earlier states (see figure 2.1).The method of Temporal differences, although used frequently earlier, was formalizedby Sutton in 1987 [29]. Most importantly he developed a convergence theorem showingthat TD learning methods asymptotically converge to the correct values. Also it wasshown that TD methods are able to converge to optimal values with finite training. Hereoptimal, refers to the best values that can be determined from the training, not optimalin the sense of a dynamic programming solution. This idea of optimal is the maximumlikelihood estimate. Take, for example, tossing a coin. If the coin was tossed ten times andheads turned up seven, then the prediction of getting a head on the next toss based on amodel of a fair coin would be 0.5, but asked to estimate based solely on the observationsit would be . From the observations the maximum-likelihood estimate is in a sensethe optimal prediction. Sutton proved that the TD methods, with judiciously chosenparameters, will converge to the same predictions as a method that remembers all stateChapter 2. Background 161. Initialize Q-values.2. While true do(a) Find current state of world, s(b) find a = maxaQ(s, a), the current “policy” action(c) Decide whether to follow policy or to explore. Perform either a or a randomaction. Call the action actually performed a’.(d) Get the new state, s1, and the reward R(si)(e) Update Q-values with the formulaQ(s, a’) = Q(s, a’) + (R(si) + 7maxaQ(si, a) — Q(s, a’))Table 2.3: The Q-learning Algorithmtransitions that occurred and their outcomes which means that TD learning convergesto the maximum-likehood estimate.Watkins [32] applies Suttons TD learning methods to perform incremental dynamicprogramming. Watkins’s algorithm, called Q-learning because of the notation used in histhesis, learns the optimal policy and is incrementally updated, generating the improvedpolicy after every step. The temporal difference methods, as used by Samuel and Suttonabove, learned an evaluation function. It was still necessary to store the actual policy asan additional data structure which needs to be recomputed whenever the value functionchanges. Another method, the one used by Samuel’s checker player, uses the valuefunction to control a search. The agent searches through all possible actions at a stateand picks the action that leads to the highest valued state. Where the value of the statesis determined using the value function that was learned. Q-learning learns the policy andthe evaluation function at the same time using a single data structure. A look-up tableis created with one entry in the table for each state-action pair. The table entry Q(s, a)is the current estimate of the value of performing action a while in state s. The value ofChapter 2. Background 17the state is the value of the best action in that state. So the table of Q-values is able tostore the policy, the value functions, and the value of doing non-policy actions. Decidingwhat is the best action to perform is done just by finding the action with the maximumvalue for that state.Q-learning (table 2.3) is a form of value-iteration describe in section 2.2.1. Theimportant fact to notice is that value-iteration does not require that each V be updatedin a systematic way, i.e.. V’, The method still converges, although perhaps notas quickly, if the value of every state is updated in a arbitrary order, so long as eachstate is updated often [32]. The value of the entry Q(s, a) is defined as the value of theimmediate reward for performing action a plus the (discounted) expected value of thestate reached. where the expected value of any state is just the value of that states bestaction. Letting s1 be the state reached by performing action a in state s, the expectedvalue of performing action a then is:Q(s,a) = r(s1)+7maxQ(s,a)a EAThis of course requires that the values of s and a be remembered. So that after performingthe action, the new state s1 can be observed and the table values updated. Following themethod of temporal differences, the values for the Q-table are updated by the rule:Q(s,a) ÷— Q(s,a) + /3((r(si)+7maxQ(s1, j)— Q(s,a))aThe new Q-value is the error between the earlier estimate and the observation, multiplied by a learning rate parameter /3, used to control the convergence rate. 3 is set inthe range (0, 1]; by setting /3 to 1, the slope given by the error term is descended to theminimum, and the new Q-value is updated to treat the new observation as correct. Thisone-shot updating of the Q-values causes problems when performing an action sometimesresults in a positive reward and sometimes results in a negative reward. For example,Chapter 2. Background 18take the situation shown in figure 2.1. State B allows only one action, this action however, leads to either a negative reward or a positive reward with equal probability. If thefirst time the action is performed a positive reward is received, the value of that actionis given a value of about (1), making the expected value of B appear to any state thatcan reach it, like A, appear to be higher then is should. If the next time the action istried the negative reward is received, then the value of the action is moved all the wayto around (—1). With j3 set as one, the value of such a non-deterministic action, willnever have the more appropriate value of , that using transition probabilities wouldhave given. Setting j3 to a value less then one causes the Q-values change slower, andpreserves more of the past observations.Q-learning has several beneficial properties. It does not require an initial model or apriori domain knowledge, which as explained in chapter 1, is not always available. All Qlearning requires is a reinforcement function. Q-learning, also, does not require that entiresequence of states be saved; like all TD methods only one observation at a time needsto be stored. Finally, Q-learning is incremental, there is no need for a separate learningphase. This requires some method of determining if the current, possibly sub-optimal,policy should be followed, or some other action that might improve the policy shouldbe performed (Exploitation or Exploration). This is usually solved by some stochasticmethod.The disadvantage of Q-learning is that it is slow to converge. Q-learning solves thestructural credit assignment the same as dynamic programming does, by actually visitingeach state. Since the model of state transitions is not known, it is not possible to knowwhich actions will lead to unexplored states. Q-learning relies on the fact, that byoccasionally choosing random actions, eventually the entire state space will be explored.This is even more of a problem with robotic systems, since, they tend to have extremelylarge, usually continuous, state spaces. Also, it is not possible to do the thousands ofChapter 2. Background 19trials needed for Q-learning to converge on most robotic systems.2.3.2 Structural Credit AssignmentThe structural credit assignment problem is the problem of assigning a value to states andactions that haven’t been visited, based upon some measure of similarity between states.This is also called the input generalization problem. Dynamic programming methods, aswell as standard Q-learning, don’t solve the structural credit assignment problem: theymust either solve for, or visit, every state. The problem is less acute for dynamic programming since solving a system of equations takes less time, then random exploration.However, in robotic domains the state spaces are extremely large, often more than 2100states [10]. With state spaces this large even a polynomial time algorithm would take toolong. The major difficulty in assigning structural credit is knowing which unvisited statesare similar to states that have been experienced and for which an informed decision canbe made.The BOXES algorithm[21] used the most obvious method of assigning structuralcredit. Boxes’s purpose was to learn the classic problem of balancing a pole. The problemis to balance a pole on top of a car. A negative reward is given if the pole falls over.In the simplest case the car has two possible actions: move forward or move back. Thestate space is defined by four continuous variables: the position of the car x, the angleO of the pole from the vertical, and the rate of change of these two variables over time,dx, dO. The boxes algorithm divides up each of the dimensions into fixed sized intervals,arbitrarily chosen by the programmer. With four dimensions this discretises the statespace into boxes (thus the name). Every value that falls within the boundaries of thebox is treated as a member of one state. The obvious problem is knowing beforehandhow many boxes are needed, and where the boundaries should be.Dividing up the space uniformly along each dimension results in the hypercubes,Chapter 2. Background 20or the boxes used above (see figure 2.2). When the actual Q-values are not uniformlydistributed through the space, then many of the boxes will be nearly empty; not only isthis a waste of space but the number of states can grow exponentially. Certain areas ofstate space need to be divided up finely to correctly partition the spaces where Q-valuesare tightly grouped, but areas where Q-values are sparse should not be divided up asfine. The adaptive grid (figure 2.2) is more flexible then the regular grid; rather thanuniformly partitioning each dimension, the location of the boundaries of boxes can varyalong each dimension. Adaptive grids still suffer from the “curse of dimensionality”. Eachpartition is global, in effect partitioning every state along its path. Recursive partitioningschemes, (Figure 2.2 recursive grids, KD-trees, Arbitrary Hyperplanes), don’t have theglobal partitioning problem, further partitions only effect the subspace in which they areapplied.The other common way to assign values to unexplored states is to use Neural networks.The backpropagation algorithm[24] is an extension of the percepetron[22], that allowsmultilevel networks with hidden nodes to be trained. Backpropagation uses gradientdescent to adjust the weights of connections in the network. A supervised learningmethod, the network is presented with a training (input,output) pair. The current net isapplied to the input vector, and the result vector is compared to the correct output. Theerror between the output the net generated and the actual output is used to adjust theweights. The adjustments are propagated backwards through the net from output nodesto input nodes. Backpropagation is primarily orientated to solving the structural creditproblem. Each node can be considered as performing a test if the weighted sum of itsinputs is greater than a threshold. Essentially this is just a partition of the input space bya hyperplane, with multilevel networks allowing recursive partitioning. Backpropagationsuffers from the lack of a good theory explaining when it should work, how many nodesare needed, and how they should be arranged. Lin[17] and others have had successChapter 2. Background 21j__Recursi efrkLtarHvnem anesKD TrJRegu1Grid Adapti eJridFigure 2.2: Different ways to divide up a 2D spaceChapter 2. Background 22coupling NN and Q-learning, still others like Chapman and Kaelbling [10] have had poorresults.Chapman and Kaelbling [10] developed another type of generalization algorithm,called the G algorithm. The G algorithm grows a tree, the internal nodes which testselected input bits and follow either the left or right branches of the tree according tothe results. The leaves of the tree each contain a single Q-value. This Q-value representsthe Q-values for all the states that have the same value for the bits tested in the internalnodes. By testing only the most important bits the leaves of the tree can represent alarge number of states, states that have similar Q-values, with a single Q-value. Thissaves both space and time (see figure 2.3).Initially the algorithm considers all the bits to be irrelevant, clustering the entire statespace into one state. Statistical evidence is collected for the relevance of bits. When abit is discovered, by the t-test, to be significant the node is split into two leaves, one forwhen the bit is off and one for when the bit is on. Then more statistics are gatheredwithin the leaves, which themselves can be split if necessary. The G algorithm has onemajor drawback, that is the testing is done at the bit level. If the sensor information isnot presented such that it makes sense to test individual bits, then the method will notbe efficient. Consider again the pole balancing problem, here the sensors give four realvalues. Testing real numbers a single bit at a time would give a rather large tree, largerthen might really be necessary. Also it needs the bits to be individually relevant.The other alternative, to clustering based upon dividing up the state space into subspaces, is to cluster based upon similarity in the states response vectors. This enablesstates, that might not be nearby in the state space, to be clustered together simplybecause they produce “similar” responses. One such method used by Mahadevan andConnell [19] clusters states together solely on the basis of a similarity metric. Each observation is compared to all the clusters and if it matches a cluster it is added to it. IfChapter 2. Background 23//\Q(B3=1)B7/Q(B6=O A B3=O) Q(B6=1 AB3=O)Figure 2.3: A G tree that has split the space into 3 clusters.Chapter 2. Background 24it does not match any clusters a new cluster is formed with only one member. A clusterdescription is a vector of probabilities, < Pi, P2, ...p > where each p. is the probabilitythat the ih bit is on given that a state matches the cluster. From this, the probabilitythat a state matches a cluster can be found, if this probability is greater then a threshold,then the state is said to match the cluster. Clusters can be merged together if the “distance” between clusters is less then the threshold p and the Q-values of the cluster agreein sign and the difference in magnitude is less then the threshold ö. The best action isselected by maximizing the combination of the probability of matching the cluster, andthat clusters Q-value.All methods that solve structural credit assignment work by defining a similaritymetric across the state space defined by the sensors. This similarity metric is used tointerpolate between known states and states that have not been visited yet. The hope isthat given a good metric the entire space will not have to be explored, and the learningmethod will converge to an adequate policy in a reasonable amount of time. Methodsare usually based on some variation on dividing up the state space or maximizing somemeasure of similarity of points in a cluster.Chapter 3The Dynamite TestbedThe Dynamite testbed is a platform for testing theories about controlling mobile robotsin dynamic domains, in particular mobile robots in the soccer domain. This provides acommon task for comparison of different control strategies. It consists of several radiocontrolled cars that share a common vision system, and play a version of soccer on adining table sized field.3.1 System OverviewThe Dynamite testbed, see figure 3.4, consists of a small playing field, about 224 cm longand 122 cm wide.Unlike a real soccer field it has boards around the outside to prevent theball from going out of bounds. The field provides a playing surface for multiple soccerplaying agents. The agents are made from off the shelf remote controlled cars retrofittedto work with the vision system. Each car is fitted with two circular colour markers thatallows the vision system to determine position and orientation of the car. Each car alsohas small bumpers enabling them to push the ball, since cars cannot kick.The vision system consists of a single overhead camera that transmits full colour videoto special purpose datafiow computer, named the Data Cube. The Datacube is able toperform rapid filtering of colours, classifying pixels into colour groups, or blobs. Thecolour information is then given to some transputer nodes for additional processing. Thetransputer nodes are able to output the world coordinates for the center of each colourblob. With the two circles on each car, the vision system is able to produce real-valued25Chapter 3. The Dynamite Testbed 26RGB Camera(Single CCD)User Nodes: Reasoning & ControlRadio• Transputer Network______________________TransmitterFigure 3.4: The dynamite soccer playing testbedcoordinates and orientation for each car, and the ball, sixty times a second. The accuracyof the values are within millimeters of the real position. The agents operate on the remotebrain concept: all sensing, reasoning and control is done on a remote computer. The carscan receive commands from a transputer controlled radio transmitter, fifty times a second.The entire team, for both sides, can be controlled on the same set of transputers, sinceeach of the n transputers nodes in figure 3.4 can control an agent independent of therest. Complete details of the system are given in [25, 26, 3, 4].3.2 The Reactive Deliberation ControllerThe Reactive Deliberation Controller is a robot control architecture proposed by MichaelSahota [25]. It consists of two basic parts, the Executor and the Deliberator (see figure3.5). The main idea is that the two halves run asynchronously— in this case they each getone transputer node. This allows the executor to keep up with the environment, alwaysSoccer FieldChapter 3. The Dynamite Testbed 27SensorUboratorAction Selection :rid F’ rmters—: Sensor andStatt..is data——xecutor‘V1\ctLIatorDoritrolsFigure 3.5: The controllerSensorIData—Chapter 3. The Dynamite Testbed 28monitoring the environment and sending control signals. The deliberator however, is ableto become more involved in deciding what to do.3.2.1 ExecutorThe lowest level of the controller’s hierarchy is the Executor. It consist of behaviorsthat can be executed rapidly, with the sensor to actuator ioop being as short as possible.Lengthy computations cannot be performed at this level. These behaviors, which Sahotacalls action schemas, perform the low-level control of the robot, receiving raw sensor data,and action selection with parameters then sending out low-level controls to the actuators.It also sends filtered sensor and status data to the next higher level. For example, one ofthe behaviours from the soccer agent is “follow path.” Follow path is activated and givena path to follow by the deliberator. The behaviour then compares the world coordinatesfrom the sensors to the path data and sends directional controls to the robot. The higherlevels of the controller can actually have crashed and the robot won’t notice until thebehaviour is finished.3.2.2 DeliberatorThe Deliberator is similar in structure to the executor. The main difference is thatit is allowed to spend more time between actions, thus the name, and it cannot senseor control the environment directly. The Deliberator is forced to receive all its sensorinformation from the executor. This could be simply the raw sensor data passed through,or some more complex filtering could be done. For the soccer playing robot, the sensordata is filtered to provide accelerations and status of the executor. Each behaviour of thedeliberator uses the sensor information given by the executor to compute which of theexecutor actions should be selected. It then computes the run time parameters for thataction, and produces a bid based upon the results of the planning and how appropriateChapter 3. The Dynamite Testbed 29that action is thought to be. The behaviour with the highest bid then gets to select theaction that gains control of the executor. Unlike the executor, each behaviour is runningin parallel with the others. However, only the behaviour that produces the highest bidget control of the executor. If one of the other behaviours, one not currently in control,produces a bid greater then the current bid, then the active behaviour is interrupted andthe new behaviour is put in control.3.2.3 Adapting Reactive Deliberation to do learningAdaption of the Reactive Deliberation Controller such that it is able to learn requiresthat the voting system of the Deliberator be removed and arbitration be performed bya higher level. Originally the arbitration is very simple, simply a comparison ioop tofind the highest bid then select that behaviour. This simple selection will be changed.No longer will bids be sent to the next level, rather the full sensor data will be sent.This sensor data defines the state space, the arbitrator then performs Q-learning. Theset of actions for the Q-learning algorithms is the set of behaviours of the deliberator.Q-learning requires that the states and actions be discrete. Both of these conditions areviolated with this architecture, for the state space is continuous and the actions persistthrough state boundaries. The continuous state space can be divided into discrete spaces,but that cannot be done with the actions. Actions are given an inertia, so that whenan action crosses a state boundary, the executor is allowed the chance to continue theaction it had started earlier.In the spirit of behavioral control only one behavior per level is allowed to sendactuator control commands at one time. Unlike subsumption[9], the decision of whichbehaviour is in control is not fixed. Instead, the decision is made at the next higher levelof the controller. This scheme allows greater flexibility, since unlike subsumption, thepriority between behaviours is not hardwired at compile time.Chapter 4Experiments: Learning to play SoccerThe game of soccer that the agent is required to learn is a simplified version of the realgame. The major simplification is in the number of players: in this version there are onlytwo players. The state of each object in the game is given by the vector < x, y, 8, v >.This vector is made up out of the position of the object on two dimensional field x, y,the orientation of that object 0, and the object’s velocity v. The ball’s orientation isfound by tracking it over time, unlike the cars the ball’s orientation corresponds to thedirection its moving, not which way it’s pointing. In the one-on-one version of soccerplayed here, there are only three objects on the field at one time, this gives a total oftwelve dimensions for the agent to partition and explore. It would be easy to createmore input dimensions (like acceleration), but the twelve given here would seem to bethe reasonable dimensions for the task.Each agent was trained on a simulator, where a large number of games can be playedeasier then they can on the real robots. The agent receives a +1 reward for scoring a goal,and a -1 reward when the opponent scores a goal. After training the agents were testedin pure exploitation mode with no learning. In other words they are tested following thepolicy that was learned, both in simulation and on the real robots.The visual system is easily able to resolve the position of each object to the nearesthalf-centimeter and smaller [25]. As a result the discrete state space, defined by dividingup each input dimension at the resolution of the sensors, is extremely large. The playingsurface being 244cm long and 122cm wide can be resolved into nearly 120,000 states,30Chapter 4. Experiments: Learning to play Soccer 31velocity and angular orientation contribute about 100 states each. In total each objectin the game can be accurately positioned into about 1.2 billion different states. Sincethe state of any object in the game is potentially important, following all three objectsin the game would require a state space of about 1.7 * 1027 or 291 states. It would beimpractical to create a look-up table of this size, let alone explore the states adequatelyand find a good policy. Furthermore, if the game is scaled up, the state space is goingto be multiplied about 1.2 billion times for every new player added. Even a linear timealgorithm would have problems exploring that many states in a reasonable amount oftime. Reinforcement algorithms are at best polynomial time algorithms [31, 16] andwould be completely unreasonable. The first job of any learning agent in this domainis to define a similarity metric over the states, grouping similar states together until thenumber of states is small enough to explore in a reasonable amount of time.4.1 Different AlgorithmsEach of the learning agents have a similar structure. The only major difference is thestrategy they use to divided up the state space. They all perform Q-learning with exactlythe same parameters. Any differences between the different learning algorithms shouldbe entirely due to the way the state space was subdivided.The first step for every learning agent tested is to determine which state it is currentlyoccupying. This is done by taking the input vector <TO, o, o, V0,. . . , x3 y, O, v3 > anddetermining into which bucket this places the agent. When one of the 12 input dimensionshas changed enough so that a bucket boundary is crossed, or when 10 milliseconds haspassed, the agent “changes state” and a new action will be selected.At each state an agent is to select one of the following nine actions for the deliberatorto perform:Chapter 4. Experiments: Learning to play Soccer 321. Wait: This is the simplest behaviour. In short it waits for the situation to improve.2. Shoot: This is the behaviour responsible for planning and following a path that willcause the ball to go into the opponent’s goal.3. Clear: This behaviour finds and follows a path that will cause the ball away fromthe home goal. This is a weaker requirement then shoot which requires a path thatcauses the ball to reach the opponent’s goal.4. Go Home: A simple behaviour that follows a path back to the agents own net.5. Defend Home: Take up goalie position and keep the car situated between the balland the net.6. Go Red: Drives the car to the centre of the playing surface.7. Defend Red: Similar to Defend Home except it occurs at the centre line.8. Servo: Equivalent to “kicking the ball”. This drives the car at the ball.9. Unwedge: Tries to drive the car out of a stuck position.After selecting an action anew the agent must also check which action is currently active atthe deliberator level a0ld. If anew is different from a0zd then the deliberator is interruptedand started on the new task. Since actions are not discrete in this model, when anew isthe same as a0zd the agent must decide if this means that the deliberator is to continueperforming its current action, or if the deliberator is to replan and start the same actionagain. The method used here is to give the actions “inertia”. The idea of inertia is thatan action automatically is assumed to continue into the next state (unless the actionhappened to have already stopped). Another way of solving this would be to add onebit per action to the input vector so that the agent learns at each state if is shouldChapter 4. Experiments: Learning to play Soccer 33continue or re-select the action. This would probably be more robust but would addextra dimensions to an already large state space.Q-learning has three tunable parameters (/3 = 0.9,-y = 0.5, PBEST = 0.9) whichcontrol the learning rate, the discount for future rewards and the exploration strategyrespectively. Since all the agents have exactly the same values for these parameters anydetrimental effects of using suboptimal values should be similar for each agent.The choice for -y is some what arbitrary. Setting 70.9 would be more realistic. However, because rewards are only a 1 or a 0 and goal states are self-absorbing, this meansthat if one Q-value is higher then means that the action with the higher Q-value leadsto a goal state more quickly. The choice of ‘y only effects the magnitude of the Q-values,not their relative sizeSemi-uniform exploration was chosen for the exploration strategy. This is the simpleststrategy for deciding when to follow the current policy and when to try new actions togain better information. The method is very simple, at point 2(c) of table 2.3 thereis a fixed probability PBEST of choosing the current policy action and 1 — PBESTof choosing a random action. A more intuitive approach is for PBEST to start at asmall value to encourage more exploration and slowly increase. Other strategies involveadjusting this probability according to how often the state has been visited, or accordingto the magnitude of change between Q-value estimates [31]. Adjusting PBEST onlyeffects the convergence rate.Each different agent was tested first on the simulator and then the best of those weretested on the real robots. On the simulator the robot play until a goal is scored, than therobots and the ball are automatically positioned at the centre for a new trial. A gamefor a fixed interval of time, about 5 minutes. The only change required for games run onthe physical robots is that the robots and the ball must be repositioned by hand afterevery goal.Chapter 4. Experiments: Learning to play Soccer 344.1.1 The Hand Coded controllerThe hand-crafted, non-adaptive controller created by Sahota [25] provides the standardagent that all the other agents use as the standard opponent for training and evaluation.Since, the optimal policy is not known, this hand coded controller plays the part of theoptimal policy for evaluation purposes, therefore to converge to the “optimal” performance is to perform as well as the hand-coded controller (i.e. to tie or perhaps beat thiscontroller).The Hand-coded agent uses more information than the learning agents. The learningagents are limited to the four variables <x, y, 8, v > for each object. In the Hand-codedagent, however, the controller at the top, does not need to know positional information.Each of the lower behaviours decides for itself how important it is. They do this usinga series of heuristic rules of the form “If I am already selected then add 2 to my value”and “If I can’t find a path then make my value 0”. The arbitrator receives a value forhow important each behaviour thinks it is, and selects the behaviour with the highestvalue. Since, each behaviour has access to its internal state, it uses more information ingenerating its bid than just the four the learning agents use.4.1.2 Regular GridThis is the first of the learning agents. This agent uses the BOXES [21] approach tolearning. First state space is divided up a priori into a fixed grid and the Q-learningperformed on this discrete state space. The purpose of including this controller is tocompare learning with a fixed similarity metric versus a adaptive similarity metric.The inherent problem with dividing up each dimension globally versus recursively isthat when fine resolution is needed the entire state space must be divided up finely. Thisgives a huge number of states, with twelve dimensions dividing up each dimension intoChapter 4. Experiments: Learning to play Soccer 35four boxes would give 412, about 17 million, total states. Each state requires one realnumber for each action. Assuming 32bit floating point numbers, 224 states require about226 bytes of memory, an excessive amount for a mobile robot. Four boxes per dimensionis probably not fine enough resolution, BOXES used about 256, but it only had twodimensions and two actions.The boxes for this soccer player were created by dividing up the four dimensions ofeach object in the same way. The x coordinate value was divided into four equally sizedboxes, same with the y coordinate. The orientation of each object was given two possibleboxes, either towards the goal or away from it. The velocity of the object was given onlyone possible value. This gives a total of 215 states occupying about l28Kbytes. It wouldbe better if velocity and orientation also were divided into four boxes, but the size of theQ-table would be enormous.Deciding how many buckets each dimension should be divided into and where theboundaries of the buckets should be are problems that are difficult to solve withoutextensive experience with the problem. The usual approach is to divide up the statespace test to see how well it performs. Then the state space is divided up again usingthe information learned from the first test. This generate, test and generate again cycleis repeated until either satisfactory performance is reached or the programmer gives upin despair!4.1.3 Random AgentThis agent is included as a means of comparing the how the agents improve over time.It performs no learning merely chooses a random action at each state. Using the samestate space partition as the regular grid agent. The performance of the random agent isdue entirely to the inherent information of the deliberator’s behaviours.Chapter 4. Experiments: Learning to play Soccer 364.1.4 Adaptive KD-treeThis algorithm adaptively divides up the state space, rather then learning on a fixedpartition. It is based upon Chapman and IKaelbling’s G-algorithm. However, rather thentesting individual bits of the perceptual inputs and growing a binary tree that tests therelevant bits, a KD-tree is grown that tests multidimensional real valued inputs.A KD-tree [28, 6] is a binary tree that branches on all dimensions rather then thebranching on just one dimension like a normal binary tree. A normal binary tree organizesrecords based upon the “index” dimension, as a result queries based on other dimensionsare very inefficient. A KD-tree allows multivariate queries, like those needed in nearestneighbour searches, to be performed in logarithmic time [6]. Bentley’s algorithm forcreating a KD-tree (table 4.4), is to first identify the best dimension to split. Next apartition plane is chosen, perpendicular to the dimension, such that the records are splitinto roughly equal subsets, each record being either above or below the partition plane.This procedure is continued recursively until the leaves of the tree contain only onerecord. This process has been extended to arbitrary planes that are not perpendicularto a dimension [28].Repeat until all each leaf contains one record:For each leaf do:1. Compute the variance of the records for each dimension. The partition dimensionis the one with the greatest variance.2. Erect a partition plane perpendicular to the partition dimension, that passesthrough the point dm, the median value of the records.3. Partition the records into the high child if they are above the plane, or into the lowchild if they are below the plane.Table 4.4: KD-Tree AlgorithmThe KD-tree algorithm needs to be changed slightly to solve the structural creditChapter 4. Experiments: Learning to play Soccer 371. Initialize Tree2. While true do(a) Use sensor input s to find current leaf of the tree L(s)(b) Find the best action: a = maxaiQ(L(s) , a1) Decide either to follow the currentpolicy or to try a random action. Let a be the action performed.(c) Find the new state reached s1 and immediate reward R(si)(d) calculate the New Q-value for the leaf,L(s) using:Q(L(s), a) = Q(L(s), a) + (R(si) + y max Q(L(s1),a) — Q(L(s), a))(e) Push the pair < s, Q(L(s)) > on the history stack for L(s)(f) Decide if the tree should updated. For each leaf with a history do:i. Compute the variance of all the records in the history stack. Let D bethe input dimension with the greatest variance.ii. Sort the records into two subsets based on which side of the plane, perpendicular to D passing through Med(D), they are on.iii. Perform a t-test to decide if it is a good split.Table 4.5: Adaptive KD-tree algorithmassignment problem. First, Bentley’s algorithm assumes that all the records exist, sothe KD-tree is just a method of organizing them. For learning applications this is nottrue, the records (i.e. the Q-values) do not exist. The reason for clustering the statesis that there is not enough time to collect all the q-values. The algorithm needs to beadapted to work without having all the records available for classification. Second, thetermination condition needs to be changed. Rather then stopping when each record isuniquely classified, the algorithm should terminate when all the Q-values in a leaf aresufficiently similar, or like the 1D3 algorithm, when information gained by splitting issmall [23].The algorithm tested here (table 4.5) solves the first problem, the problem of notChapter 4. Experiments: Learning to play Soccer 38having the Q-values to build the tree with, by keeping a history queue. Whenever theQ-value of a node is updated, a tuple is created consisting of the Q-value and the inputvector that caused the update, and is pushed on the queue. When the queue fills up theoldest observation is discarded and replaced with the new one. This allows statistics onhow the Q-values within the cluster vary. From these statistics the best dimension andlocation for a partition plane can be calculated.Initially the entire state space is considered one cluster. Q-values are collected andstored in the history queue. The variance of each dimension is calculated in two steps:1. Calculation of the mean value for a dimension k:Qt * Smk:where (si, Qt) is the jth pair in a history queue of length n. Sk is the value of thekt dimension of observation s.2. Calculation of variance for a dimension k:d2 — Q(sk — mk)2Sk—The dimension with the greatest variance is identified and a partition plane is createdand the q-values in the queue are sorted onto one side of the plane or the other. At-test is performed to see if continued splitting is required. The sample means x1,x2, thesample variances S?, S, and the sample sizes n1, n are calculated for each of the twosubsets. A t-test is computed using the equation:If the t-value is above a threshold the original leaf is replaced with the two new leaves,otherwise the original leaf is kept. This threshold corresponds to a fixed probability thatChapter 4. Experiments: Learning to play Soccer 39the two subsets are actually distinct and not just two groups of samples on the same set.The exact value of this threshold depends upon the data set and the number of clustersdesired, and should be tuned for maximum performance.An example of a KD-tree generated using a fixed depth update strategy, updatingevery 7.5 seconds to a fixed depth of 5 nodes, is shown in appendix D. It shows thedimensions that are split, where they are split, and the best action for each cluster.4.2 Evaluation4.2.1 Measures of PerformanceThe performance of each agent is judged on the final score of each game played. Thescore is transformed into a reward measure by taking the goals the agent scores andsubtracting the number of goals the opponent scored. This gives a good estimate of howwell the agent does in each game with a single integer. A tie game would be zero, a winwould be a positive integer and a loss a negative integer. The performance of the agentover time is calculated by taking the moving average over the last 100 games played.The criteria for comparing agents is done on three measures [15]:Correctness is conceptually the simplest measure of performance. A agent is correctif for a given input it selects the same outputs as the agent following the optimal policywould select. The problem with this measure is it can be hard to define, as it is in thiscase, since the optimal policy may not be known. For this domain the hand coded agentplays the part of the optimal agent, therefore the the agent can be said to be correct ifit receives continued positive rewards, i.e. it always wins. To measure the correctnessof the agent all learning parameters are turned off, and the agent follows the policy itlearned in pure exploitation mode.Convergence is another important measure of performance. The fact that a learningChapter 4. Experiments: Learning to play Soccer 40system converges to the optimal policy is not always sufficient, especially if the agent haspoor performance over most of its lifetime, only improving after hundreds of thousandsof trials. An agent that improves rapidly is usually better then one that does it slowly,although the slower converging agent may eventually be more correct. This distinction isvery important for agents that operate in the real world, where finding an approximatesolution quickly can have better results (win more games) over time.Complexity is not directly related to the reward the agent receives. The time taken forcomputations and the space required should be bounded. Agents that take unboundedtime to compute, might not keep up changes in the environment, and as a result willtend to achieve fewer rewards. Similarly, if the memory requirements can increase withoutbound then it might outgrow the memory available. The faster an agent can computeand determine the best action the better. Again, there might be a tradeoff between aslower more complex agent that is more correct and a simpler faster algorithm that isless correct.4.2.2 Simulation resultsQ-learning has several parameters that can be tuned to maximize performance. However,for these tests the only parameters of interest are not those of the Q-learning algorithmbut the two parameters of the KD-tree algorithm, namely the threshold for the t-test,and the frequency with which tree is updated.Figure 4.6 shows the best results for convergence for all the algorithms tested, afterall the parameters were tuned. The sine like shape of the fixed grid agent is probably dueto the fact that it does not have enough information. The limitations of memory onlyallowed a very coarse grid to be created. It is also important to note the performance ofthe random agent. This agent only loses by an average of 3 to 4 goals in 100 games. Thisquality of performance is due entirely to the “domain information” implicitly containedChapter 4. Experiments: Learning to play Soccer210—1—2—3ci)—4—5—6—741—80 100 200 300 400 500 600 700Number of games playedFigure 4.6: Convergence on the simulator800 900 1000in the behaviours.Figure 4.7 shows the effect of adjusting the threshold for the t-test. A low thresholdcauses the algorithm to more susceptible to noise and it divides up the state space poorly.A high threshold causes the state space not to be divided up at all. Choosing the thresholdat about 0.7 or 0.75 seems to give the best performance.After fixing the threshold for splitting states at 0.7, the update schedule is adjusted.This the parameter that selects when a trial split on the tree should be performed. Sinceit is a rather lengthy calculation it cannot be performed at every update. Three methods“4..,lyr4f.4 Ii/44,1I44’1.1‘IJ.(—— Fixed Grid- — ‘- KD—treeRandomChapter 4. Experiments: Learning to play Soccer 42Convergence with Different t—threasholds10.5 t=0.750 t=0.05.41!LI—3.54I I I I0 100 200 300 400 500 600Number of games playedFigure 4.7: Adaptive KD-tree with different t thresholdsChapter 4. Experiments: Learning to play Soccer 43were tested, each with the same t-threshold (figure 4.8). A fixed update interval, wherethe the tree was tested at every n second interval, was tested for 5, 7.5 and 10 secondintervals. This method worked great initially, the best results were achieved with ainterval of 7.5 seconds. The 5 second interval was too short, the agent quickly startedspending more time growing the tree then selecting actions. The 10 second interval wastoo long allowing the relevant history to be pushed out of the buffer. As the tree growslarger, however, the amount of computation that must be performed at each intervalgrows. The agent begins to spends more and more time trying to split the tree andignoring what is happening on the soccer field and performance declines.The second method tried was to increase the interval as the tree increased in size.Initially the tree was checked every 7.5 seconds, this interval was increased by 7.5 secondseverytime a new state was added. This method had the best performance but the statespace was not divided up as well as it could have been. This is because as the delaybetween testing the tree increases the history buffer tends to contain more similar valuesand the t-values decrease.The third method is to update on a fixed interval but only over a selected area ofthe tree. The agent tested, checks the tree every 7.5 seconds, but only to a fixed depthof 5 levels. This corresponds to checking 32 leaf nodes to see if they can be split whichtakes about the 10 milliseconds before a new action is selected. In this way the Q-valuesof the history buffer are not lost, and the agent does not spend an increasing amount oftime growing the tree. However it tends to build smaller but more balanced trees. Thiswould explain why it does not perform as well.Updating the tree means the agent stops paying attention to the task at hand. Thisis why the convergence curves of figure 4.8 have a sinusoidal shape, particularly theincreasing interval method. Here the tree may not be updated for scores of games andthen it may have to update the whole tree at once. This allows the opponent to win50 100 150 200 250 300Number of games playedFigure 4.8: Adaptive ND tree with different update schedulesChapter 4. Experiments: Learning to play Soccer 44- —— Fixed Depth— —- Increasing IntervalFixed Interval(Uci)L..0—0.5—1—1.5—2—2.5—3—3.5JI(14!II iIlvi II /I III.’VPII c“IIvS.0 350 400Chapter 4. Experiments: Learning to play Soccer 45ci)400Figure 4.9: Convergence using a fixed treeseveral games and brings down the average score.To test the quality of the trees created by the different parameters. The KD-treealgorithm was turned off, and Q-learning was performed on the fixed tree that had beengenerated. The results are shown in figure 4.9. This only makes it more obvious that theincreasing interval update method has built a better tree.Testing the correctness of the agents required that all learning parameters be turnedoff and the agents act in pure exploitation mode. The results are shown in Figure 4.10.Both KD-tree algorithms perform about the same, averaging about —0.5 reward overall.—4.50 50 100 150 200 250 300 350Number of games playedChapter 4. Experiments: Learning to play Soccer 460—2—41086I’//Fixed GridFixed DepthIncreasing Interval2—% __-% —-—0 50 100 150 200 250 300 350Figure 4.10: Performance using pure exploitation (simulator)Chapter 4. Experiments: Learning to play Soccer 47The regular grid learner performs about the same level as the random agent around anaverage reward of —3. Notice that the KD-tree created using the fixed depth updateinitially does extremely well, a reversal from its performance in the training phase.4.2.3 Real Robot resultsDoing all the training and testing on a simulator has some serious drawbacks. Thebiggest being that the simulator is never the same as the physical system. The agentsshould always be tested on the physical robots. Ideally they should be trained on thephysical system as well but because of logistics this would have been impractical. Insteadonly the pure exploitation performance was tested on the robots and only for the twoKD-tree agents. The results for the increasing interval KD-tree algorithm, were similarto those achieved in simulation. After playing for 5 minutes the score was 6 to 5 for thehand-coded agent. The fixed depth KD-tree agent performed very poorly on the realrobots, loosing spectacularly 5 to nothing. This only shows that the results achieved insimulation are not directly transposable to real robots.Chapter 5Conclusion5.1 SummaryIn many systems where it would be beneficial to have a learning agent, the agent isineffective because of the curse of too many inputs. In order for an agent to learn in areasonable amount of time some method of identifying the important characteristic of theinputs must be used. In this thesis a methods of automatically identifying the importantareas of the inputs was explored. Previous methods surveyed were able to cluster largestate spaces but depended upon having continuous state spaces discretized, and havingdiscrete actions.The Adaptive KD-tree Algorithm was designed to work with continuous real valuedinputs of high dimension and persistent actions. The algorithm does this by:• Collecting statistics on the previous effects of actions. Then using those statisticsto identify if and where a cluster should be further divided.• Giving an “inertia” to actions, so that they can exist through different states; Ifan action is selected in two states in sequence. Then the action knows that thissecond selection is a continuation of the action started in the first state (unless theaction happened to already have finished its task).In general the Adaptive KD-tree algorithm works better then learning with a naivehand divided state space, although not as well as the non-adaptive agent that has beentuned by hand. To be fair the hand-coded controller does use more information than any48Chapter 5. Conclusion 49of the learning agents. It also remains to be seen how well other clustering algorithms,like neural nets, work on this domain.The one drawback of the adaptive KD-tree algorithm is its complexity. Having ahistory buffer is very memory intensive, and imposes a limits on the type of trees thatcan be built. This can be partly rectified by choosing a good strategy for deciding whento test clusters. Testing at a fixed interval just does not work because the agent startsto neglect selecting actions and spends all of its time testing the tree. Increasing theinterval as the tree grows works better, but it means the relevant history may be missed.A better approach might be to keep a measure of the relative error at each cluster andupdate at fixed intervals only the nodes with large errors.Further testing is also needed to determine the effect of the various Q-learning parameters and to see how varying them effects convergence rates and optimality. The twomain tests that need to be done in this area are to vary the learning rate /3 and theexploration parameter PBEST. Using the Boltzmann model [31] for adjusting PBESTso that more exploration occurs earlier in trials but once information is learned the bestaction is more often chosen. Changing these parameters shouldn’t effect the relativeperformance between the various algorithms only cause them to all improve.5.2 Future WorkOne extension to the adaptive RD-tree algorithm is to replace the history buffer with aspatial buffer. Keeping a buffer that only stores the last n decisions at the cluster makesthe assumption that those n decisions will be randomly distributed through the entirevolume of the cluster. This assumption is often incorrect. A buffer that stores decisionsbased upon where in the cluster they occurred would probably work better. This requiresthat some method of ensuring that all the samples in the buffer are of a similar age ifChapter 5. Conclusion 50one area is much older then another, it may appear as if the state should be split, whenin fact it would not be split if the samples were all the same age.The next step would be to get rid of the buffer altogether. The buffer is uses a lot ofspace and imposes limits on the size of the tree that can be built. To remove the bufferwould require that the best dimension to split and the location of the split be foundwithout using statistics.5.2.1 Adding existing knowledgeOne observation about learning agents is that the more knowledge they have built in thefaster they are to learn a task. The disadvantage is that they usually have to assumethat the knowledge is correct. The multi-level learning system used in this thesis allowsknowledge to be included into the system in such a way that the agent is able to overridethe knowledge if it seems irrelevant.Knowledge can be included in the system in the form of behaviours. Each behaviourrepresents a skill or concept the programmer has explicitly written into the system. If theinformation is incorrect or irrelevant the agent will determine that through the course ofnormal reinforcement learning.A similar method of adding information is through the creation of sensor behaviours.The standard reactive-deliberation controller as it is tested in this thesis uses only controlbehaviours. A controller behaviour is primarily concerned with the question: “This isthe current inputs; what should the output at the lower level be?” A sensor behaviouranswers the question: “This the raw sensor information (from the lower level), what doesthis mean?” For example, in soccer a typical sensor behaviour that might be useful isthe Closer behaviour. This behaviour simply looks at the x, y coordinates of each objectand sends a 0 or a 1 to the higher level, a 1 if the agent is closer to the ball, or a 0 if theopponent is closer. This simple function provides a new measure on the state space, aChapter 5. Conclusion 51measure that would have been hard for the agent to develop with only a lookup table.Adding in sensor behavours is more costly then control behaviour. Each new controlbehaviour linearly increase the size of the lookup table. A sensor behaviour adds an newdimension to the state space, potentially an exponential growth in the size of the table.Of course if the agent uses a good clustering algorithm this is not a major problem.However, learning could be done by converting all raw sensor data into sensor behavioursand learning only over the space of sensor behaviours, discarding raw data in the learningprocess. Through judicious choice of sensor behaviours, a programmer can drasticallyreduce the size of the learning space.Another way of adding information to the learner is through more detailed rewardfunctions. In the experiments performed here, the reward function is very sparse, widelyspread in time. This make learning difficult, a real soccer player don’t learn this way;rather they receive intermediate rewards by coaches. When an soccer player performsa good action, like clearing the ball out of his own end, they are given an immediateevaluation of their action by the coach. This leads to (subjectively) good behaviourwhich can speedup the learning process by leading to “ultimate rewards” or goals morequickly.Bibliography[1] E. W. ABOAF, Task-level robot learning, Tech. Report AI-TR 1079, MIT ArtificialIntelligence Laboratory, 1988.[2] J. ALLEN, J. HENDLER, AND A. TATE, eds., Readings In Planning, Morgan Kaufmann Publishers Inc., 1990.[3] R. BARMAN, S. KINGDON, A. MACKWORTH, D. PAT, M. SAHOTA, H. WILKINSON, AND Y. ZHANG, Dynamo: real-time experiments with multiple mobile robots,in Proceedings of Intelligent Vehicles Symposium, 1993.[4] , Dynamite: A testbed for multiple mobile robots., in Proceedings IJCAI Workshop on Dynamically Interacting Robots, 1993.[5] R. BELLMAN, Dynamic Programming, Princeton University Press, 1957.[6] J. L. BENTLEY, Multidimensional binary search trees used for associative searching.,Communications of ACM, (1975).[7] C. B0uTILIER AND R. DEARDEN, Using abstraction for decision-theoretic planningwith time constraints, in Proceedings of the 12th National Conference on ArtificialIntellegence, AAAI press/The MIT press., 1994.[8] R. A. BROOKS, A robust layered control system for a mobile robot, IEEE Journalon Robotics and Automation, (1986).[9] , A robot that walks: Emergent behavior from a carefully evolved network, NeuralComputation, (1989).[10] D. CHAPMAN AND L. P. KAELBLING, Learning from delayed reinforcement, Tech.Report TR-90-11, MIT, 1990.[11] P. CHEESEMAN, J. KELLY, M. SELF, J. STUTZ, W. TAYLOR, AND D. FREEMAN,Autoclass: A bayesian classification system, in Readings in Machine Learning, J. W.Shavlik and T. G. Dietterich, eds., Morgan Kaufmann Publishers, INC., 1990.[12] T. DEAN, L. P. KAELBLING, J. KIRMAN, AND A. NICHOLSON, Planning withdeadlines in stochastic domains, in Proceedings of th 11th National Conference onAT., AAAI press/The MIT press, 1993, pp. 574—579.52Bibliography 53[13] R. A. HOWARD, Dynamic Programming and Markov Processes, MIT press, Cambridge, Massachusetts, 1960.[14] R. A. JACOBS AND M. I. JORDON, Learning piecewise control stategies in a modularneural network architecture, IEEE transactions on Systems, Man and Cybernetics,(1993).[15] L. P. KAELBLING, Learning in embedded systems, The MIT Press, Cambridge,Massachusetts, London, UK, 1993. Book Version of Phd. Thesis.[16] S. KOENING AND R. G. SIMMONS, Complexity analysis of real-time reinforcementlearning applied to finding shortest paths in deterministic domains, Tech. ReportCMU-CS-93- 106, Caregie Mellon University, 1992.[17] L.-J. LIN, Self-improving reactive agents: Case studies of reinforcement learningframeworks, Tech. Report CMU-CS-90-109, Carnegie Mellon University, 1990.[18] P. MAES AND R. A. BROOKS, Learning to coordinate behaviors, AAAI, (1993).[19] 5. MAHADEVAN AND J. CONNELL, Automatic programming of behavior-based robotsusing reinforcement learning, tech. report, IBM T.J. Watson Research Center, 1990.[20] M. J. MATARIC, A distributed model for mobile robot environment-learning andnavigation, Tech. Report AI-TR-1223, MIT Artificial Intelligence Laboratory, 1990.[21] D. MIcHIE AND R. CHAMBERS, Boxes: An experiment in adaptive control., inMachine Intelligence 2, E. Dale and D. Michie, eds., Oliiver and Boyd, 1968.[22] M. L. MINSKY AND S. A. PAPERT, Learning, in Readings in Machine Learning,J. W. Shavlik and T. G. Dietterich, eds., Morgan Kaufmann Publishers, INC., 1990.Originally published in Perceptrons: An Introduction to Computational Geometry,1969. MIT Press Cambridge, MA.[23] J. R. QUINLAN, Induction of decision trees, in Readings in Machine Learning, J. W.Shavlik and T. G. Dietterich, eds., Morgan Kaufmann Publishers, INC., 1990.[24] D. RUMELHART, G. E. HINTON, AND R. J. WILLIAMS, Learning internal representations by error propagation, in Readings in Machine Learning, J. W. Shavlikand T. G. Dietterich, eds., Morgan Kaufmann Publishers, INC., 1990.[25] M. K. SAHOTA, Real-time intelligent behaviour in dynamic environments: Soccerplaying robots, master’s thesis, The University of British Columbia, 1993.[26] M. K. SAHOTA AND A. K. MACWORTH, Can situated robots play soccer?, CanadianAI-91, 1994. Submitted.Bibliography 54[27] A. SAMUEL, Some studies in machine learning using the game of checkers., IBMJournal on Research and Development, (1959), pp. 210—229.[28] R. F. SPROULL, Refinements to nearest-neighbor searching in k-dimensional trees,Algorithmica, 6 (1991), pp. 579—589.[29] R. S. SUTTON, Learning to predict by the methods of temporal differences, Tech.Report TR87-509.1, GTE Laboratories Incorportated, 1987.[30] , Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, AAAI, (1990).[31] S. B. THRUN, Efficient exploration in reinforcement learning, tech. report, CarnegieMellon University, 1992.[32] C. WATKINS, Learning from delayed Rewards, PhD thesis, King’s College, 1989.[33] J. Y. YANGSHENG AND X. C. CHEN, Hidden markov model approach to skill learning and its application to telerobotics, Tech. Report CMU-RI-TR-93-01, CarnegieMellon University The Robotics Institute, 1993.Appendix AMain Reinforcement algorithm**************************************DECIDEC)This is the arbritrator between the behaviorsvoid decide(){double pri;static mt old_behaviourO;static mt old_time;static mt R_time;static Object old_obj [MAX_OBJECTS];mt i;Ktree *state,*old_state;new_exec = FALSE;if (program == PROGRAM_NORMAL_PLAY) {command_f lag=TRUE;evaluate_f lag=TRUE;stateget_state(obj);old_stateget_state(old_obj);i=reward(obj);if((state!old_state) I I(current_time*TYME_TO_MS>(old_time+iO)))/* Oont select a new action *//* until state changes or times out*/if(state—>high!=NULL) printf(”Error \n”);behaviourselect_action(state);old_time=current_time*TYME_TO_MS;update_Qvalues(old_state,state,i,old_behaviour,old_obj);}if((current_time*TYME_TO_MS)>(R_tmme+7500)) /*milliseconds*/55Appendix A. Main Reinforcement algorithm 56{/* decides how often to rebuild the tree ms*//* 5.0,7.5,10.0 seconds*/K=rebuild_tree(Jfl;R_time=current_time*TYNE_T0_MS;}if((behaviour==old_behaviour)fl(car.exec = EXEC_IDLE))/* INERTIA for actions: only replan if new action or executor is idle*/return_of _control=TRUE;elsereturn_of_control=FALSE;prifn[behaviour]O; /* Do the behaviour*/for(i0; i<NAX_OBJECTS ; i++){old_obj [ii .xobj[il .x;old_obj [ii .y=obj Ii] .y;old_obj Ii) .headobj Fi] .head;old_obj [ii .speedobj Li] .speed;}old_behaviour=behaviour;}/*****************************************************************************/Ktree *get_state(Object *o)return(search_tree (K, o));I**************************************************************4c**************Imt select_action(Ktree *s){/*select the maximum behaviour*/mt i,j,k;float PBESThO.9; /* the prob. that the best action is chosen */mt actiono; /4’ the best action */bit best[MAX_BER]; /* All the best actions */imt nbest=0;Appendix A. Main Reinforcement algorithm 57/* find the Best Action(s) */if((((float) (randO7.lOO))/iOO.O)< PBEST){/* randomly select one of the best */actionresponse Cs);}else{1* select any randomly */action=(randO7.MAX_BEH);J.return(action);}Appendix BUpdate Q-values algorithm/*****************************************************************************/mt update_Qvalues(Ktree *s,Ktree *sl,int r,int b,Object *o){/* Qvalues q, state s, reward r, action b */mt i,j;float E,Ei,error,resp[MAX_BEHJ ,BETAO.9,GAMMAO.5;El0;/* find the expected utility E *1if(sl!NULL){iresponse(sl); /9’ i is the best action */E10;for(j=O;j<(sl—>d);j++)E1E1+(sl—>bucket [j] .response Ii]);if((si—>d) =0)E1E1/(float)(sl—>d);1* avg reward for best action action */elseEl0;}E0;for(i0; i<HAX_BEH;i++) resp[il=0;if Cs! =NULL)/* the top of the stack is the current Qvalue for this state*/for(i=0; i<MAX_BEH; i++)resp[ilCs—>bucket[Cs—>d)—ll .response[i]);E=resp[bJ;error(r+GAMMA*Ei-E);resp [hi E+BETA*error;update(s,o,resp);}58Appendix CKD-tree algorithmstinclude “ktree .h”Ktree *search_tree(Ktree *kt, Object *o){float m[DIMEN1;if(oNULL) return(NULL);else{1*12 dimensions *//* BALL*/m[O]o[0J .x;m[1]o[0J .y;m [21=0 [0) . head2;m[31 0 [01 . speed=3;/* CAR 1 */m[4]o[1) .x;m[5)o[1].y;m[6]o[1] .head;m[7]o[1] .speed;/* CAR 2 */n[8)=o[2) .x;m[9]o[2] .y;m[10)o[2) .head;m[111o[21 .speed;}if(kt==NULL) return(NULL);if (kt->bucketNULL)/*check uhere to split. kt—>d is the dimension and kt—>k is the value*/if(m[(kt—>d)]>(kt—>k))if(kt—>high=NULL) printf(’Error in tree (search)\n’);59Appendix C. KD-tree algorithms 60else return(search_tree(kt—>high, o));elseif(kt—>low=NULL) printf(”Error in tree(search)\n”);else return(search_tree(kt—>low, o));else/* return the correct model */if((kt->bucket!NULL)&&((kt->high!=NULL)I I (kt->low!=NULL)))printf (“Error in get state\m”);return(kt);}/*****************************************************************************procedure: splitdescription: finds the best dimension and location to split. Does a trialsplit and performs a t—test.Ktree *split(Ktree *kt){Ktree *nl,*n2,*n3;float t,k,m[DIMENJ;float N1,N2,ml,m2,df,s,si,s2;mt d,i;if((n1(Ktree *) malloc(sizeof(Ktree)))==NULL){kt—>c0; printf(”Err: Malloc\n”); return(kt) ;}if((n2(Ktree *) malloc(sizeof(Ktree)))==NULL){kt—>c=D; return(kt);}if((n3(Ktree *) malloc(sizeof(Ktree)))==NULL){kt->c=O; return(kt) ;}nl->high=n2;ni—>lown3;nl->bucketNULL;n2—>d0;n2—>cl;n2->highNULL;n2->1owNULL;if((n2—>bucket(sample *)malloc(SAMPLES*(sizeof(sample))))==NULL)kt—>c0;return(kt);Appendix C. KD-tree algorithms 61}m3->d0;n3—>c1;n3->high=NULL;n3->lowNULL;if((n3—>bucket(sample *)malloc(SAMPLES*(sizeof(sample))))==NULL)kt—>c0;return(kt);}/* find the dimension with the*//* maximum variance */var(kt->d,kt->bucket ,m);k0;d0;for(i0 ; i<DINEN ; i++)if(m[iJ>k){km[i] ;di;}nl->dd;/* Find the location for the partion plane */k=median(d,kt—>d,kt—>bucket);ni->kk;/* split the samples high or low */for(i0;i<(kt—>d) ;i++)if((kt—>bucket[i] .inputs[dl)>k){n2—>bucket [n2—>d] kt—>bucket [ii;(n2—>d) ++;}else{n3—>bucket [n3—>d] kt—>bucket [ii;(m3—>d)++;}siO;s20;t0;if(((n2—>d)>5)&&((n3—>d)>S)) /* need at least S samples each */{/* Do a t—test to see if the split is valid *//* Null hypothesis means are equal */mean(n2—>d,n2->bucket ,m);Appendix C. KD-tree algorithms 62mlm[d];mesn(n3—>d,n3—>bucket ,nO;m2m[d];var(n2—>d,n2—>bucket ,m);slm[d];var(n3->d,n3->bucket ,m);s2m[d];N10;for(i0;i<(n2—>d) ;j++)N1N1+max(n2—>bucket [i] .response);N20;for(i0;i<(n3—>d) ;i++)N2N2+max(n3—>bucket [ii .response);t((sl*si)/Ni)+((s2*s2)/N2);if (t<O) t—1*t;tsqrt(t);t(mi—m2)/t;}if((t<—O.7O)I I(t>O.70)) /* threshold???*/{/* reject Null hypothesis *//* Split the node */free(kt);nl—>c0;return(ni);}else{/* accept Null hypothesis */free(nl);free(n2);free(n3);kt—>c0;if(kt—>high!=O) printf(tError in split\n”);return(kt);}}Appendix C. KD-tree algorithms 63/**********.******************************************************************mt response(Ktree *kt)‘Cmt N,i,j;sample *B;float r[MAX_BETI];if(kt==NIJLL){ prmntf (“Error in tree(response)\n”);}else{N=kt->d;Bkt->bucket;if((N>SAMPLES) II (B==NULL)) printf(”Error in Bucket\n”);for(j0;j<MAX_BEH;j++) r[j]=O;use the top element */i0;for(j0 j<MAX_BEH;j++)‘Cr[j]B[N—1] .response[j]if(rCj]>rti])i=j;}return(r [i]);}}/*****************************************************************************void update(Ktree *kt,Object *o,float *r)mt i,j;kt—>cl;if((kt—>bucketfrNuLL)&&((kt—>high!NULL)I I(kt—>low!NULL)))printf (“Error in Update\n”);if ( (kt->d) <SAMPLES){for(i0; i<MAX_BEH; i++)Appendix C. KD-tree algorithms 64kt—>bucket [kt—>d] response [i] =r [i]kt—>bucket[kt—>dl inputs[01o[0] .x;kt—>bucket[kt—>dl .inputs[1)o[0] .y;kt—>bucket [kt—>dl inputs 12) =0 [0] .head;kt—>bucket [kt—>dl inputs [3] =0 [0] speed;kt—>bucket[kt—>d) .inputs[4]o[1] .x;kt—>bucket [kt—>d] inputs [5] =0 [1] y;kt—>bucket[kt—>d] .inputs[6]o[11 .head;kt—>bucket [kt—>d] inputs [7] o [11 speed;kt—>bucket [kt—>d] inputs [81=0 [21 x;kt—>bucket [kt—>dI inputs [91=012)kt—>bucket [kt—>dl inputs [101=012] .head;kt—>bucket [kt—>dl inputs [111=012] speed;kt->d++;}else{/*push down like a queue*/for(i0;i<(SAMPLES—2) ;i++){for(j=0;jGIAX_BEU;j++)kt—>bucket[i] .response[j]kt—>bucket[i+1] .response[j];for(j=0;j<DIMEN;j++)kt—>bucket[i] .inputs[jlkt—>bucket[i+1] .inputs[j]}j=(SAHPLES—1);for(i0; i<11AX_BEH; i++)kt—>bucket[j] .response[i]r[ilkt—>bucket[j] .inputs[0]=o[0] .x;kt—>bucket[j] .inputs[l]o[0] .y;kt—>bucket [j] inputs [2] =0 [0] .head;kt—>bucket [j] inputs [31=0 [0] speed;kt—>bucket[jl .inputs[41=o[11 .x;kt—>bucket[j] .inputs[51=o[1] .y;kt—>bucket [j] inputs [61=0111 .head;kt—>bucket [j] inputs [71=0111 speed;kt—>bucket[jl .inputs[81o[21 .x;kt—>bucket[jl .inputs[9]o[21 .y;kt—>bucket [ii inputs [101=0 [2] .head;Appendix C. KD-tree algorithms 65kt—>bucket[j] inputs [1i]o [21 speed;}if((kt->bucket!=NULL)tc&((kt—>high!=NULL)I I(kt—>low!NULL)))printf (“Error in Update\n”);}/***********************************************************‘n******’ic*********Ktree *rebuild_tree(Ktree *kt){mt MAXJ)EPTH6;static mt depthO;/* oct2. modified to only split to a fixed depth.*/if (depth<MAX_DEPTH)if ( (kt—>bucket) NULL)printf(”depthY.d\n” ,depth);if((kt—>c)1) ktsplit(kt);}else{depthdepth+l;if((kt—>high) !=NULL) (kt->high)rebuild_tree(kt—>high);if ( (kt—>low) NULL) (kt—>low) rebuild_tree (kt—>low);depthdepth-l;}if((kt—>bucket !NULL)&&(kt—>low!NULL)) prmntf(”Error in Rebuild tree\n”);return(kt);}Appendix DA KD-tree10,0.88,1327,59 7,39A KD-tree generated after 360 games. The RD-tree is grown using a fixed depth updateschedule, testing all the nodes less then a depth of 5 every 7.5 seconds.Each internal node of the tree contains the number of the dimension that is split andthe location of the partition plane. For example the root node contains the pair (0, 121)which means that the balls x-dimension was split at the value 121. If the balls x-valueis greater then 121 the right path is followed, otherwise the left branch is followed. Thenumbers of each dimension are:66Appendix D. A KD-tree 670 The balls x-dimension.1 The balls y-dimension.2 The balls orientation 0.3 The balls speed.4 The opponents x-dimension.5 The opponents y-dimension.6 The opponents orientation.7 The opponents speed.8 The agents x-dimension.9 The agents y-dimension.10 The agents orientation.11 The agents speed.The leaf nodes contain the number of the action that was learned for that cluster.The list of possible actions and their corresponding numbers are:0 wait1 Shoot2 Clear3 Go Home4 Defend HomeAppendix D. A KD-tree 685 Go Red6 Defend Red7 Servo8 Unwedge
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Reinforcement learning using the game of soccer
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Reinforcement learning using the game of soccer Ford, Roger David 1994
pdf
Page Metadata
Item Metadata
Title | Reinforcement learning using the game of soccer |
Creator |
Ford, Roger David |
Date Issued | 1994 |
Description | Trial and error learning methods are often ineffective when applied to robots. This is due to certain characteristics found in robotic domains such as large continuous state spaces, noisy sensors and faulty actuators. Learning algorithms work best with small discrete state spaces, discrete deterministic actions, and accurate identification of state. Since trial and error learning requires that an agent learn by trying actions under all possible situations, the large continuous state space is the most problematic of the above characteristics, causing the learning algorithm to become inefficient. There is rarely enough time to explicitly visit every state or enough memory to store the best action for every state. This thesis explores methods for achieving reinforcement learning on large continuous state spaces, where actions are not discrete. This is done by creating abstract states, allowing one state to represent numerous similar states. This saves time since not every state in the abstract state needs to be visited and saves space since only one state needs to be stored. The algorithm tested in this thesis learns which volumes of the state space are similar by recursively subdividing each volume with a KD-tree. Identifying if an abstract state should be split, which dimension should be split, and where that dimension should be split is done by collecting statistics on the previous effects of actions. Continuous actions are dealt with by giving actions inertia, so they can persist past state boundaries if it necessary. |
Extent | 1119444 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-03-04 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051238 |
URI | http://hdl.handle.net/2429/5465 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1994-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1994-0520.pdf [ 1.07MB ]
- Metadata
- JSON: 831-1.0051238.json
- JSON-LD: 831-1.0051238-ld.json
- RDF/XML (Pretty): 831-1.0051238-rdf.xml
- RDF/JSON: 831-1.0051238-rdf.json
- Turtle: 831-1.0051238-turtle.txt
- N-Triples: 831-1.0051238-rdf-ntriples.txt
- Original Record: 831-1.0051238-source.json
- Full Text
- 831-1.0051238-fulltext.txt
- Citation
- 831-1.0051238.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051238/manifest