OPTIMAL CONTROL OF DYNAMIC SYSTEMS THROUGH THEREINFORCEMENT LEARNING OF TRANSITION POINTSByKenneth M. BucklandM.A.Sc. (Electrical Engineering) University of British ColnmbiaB.Sc. (Electrical Engineering) University of CalgaryA THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF ELECTRICAL ENGINEERINGWe accept this thesis as conformingto the required standard...THE UNIVERSITY OF BRITISH COLUMBIAApril 1994© Kenneth M. Bnckland, 1994In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)_________________________________Department of___________________The University of British ColumbiaVancouver, CanadaDateDE-6 (2/88)AbstractThis work describes the theoretical development and practical application of transitionpoint dynamic programming (TPDP). TPDP is a memory-based, reinforcement learning,direct dynamic programming approach to adaptive optimal control that can reduce thelearning time and memory usage required for the control of continuous stochastic dynamicsystems. TPDP does so by determining an ideal set of transition points (TPs) whichspecify, at various system states, only the control action changes necessary for optimalcontrol. TPDP converges to an ideal TP set by using a variation of Q-learning to assessthe merits of adding, swapping and removing TPs from states throughout the state space.This work first presents how optimal control is achieved using dynamic programming,in particular Q-learning. It then presents the basic TPDP concept and proof that TPDPconverges to an ideal set of TPs. After the formal presentation of TPDP, a PracticalTPDP Algorithm will be described which facilitates the application of TPDP to practical problems. The compromises made to achieve good performance with the PracticalTPDP Algorithm invalidate the TPDP convergence proofs, but near optimal control policies were nevertheless learned in the practical problems considered. These policies werelearned very quickly compared to conventional Q-learning, and less memory was requiredduring the learning process.A neural network implementation of TPDP is also described, and the possibility ofthis neural network being a plausible model of biological movement control is speculatedupon. Finally, the incorporation of TPDP into a complete hierarchical controller isdiscussed, and potential enhancements of TPDP are presented.11Table of ContentsAbstract iiList of Figures xAcknowledgements xiiDedication1 Introduction 11.1 Dynamic Programming Control 11.2 The Transition Point Dynamic Programming Approach 21.3 The Format of this Work 31.4 Brief Summary of Contributions 42 Q-learning 52.1 Dynamic Programming Control 52.1.1 Control as a Markov Decision Process 52.1.2 Markov Decision Processes and Optimal Control 72.1.3 The Optimality Equation 82.1.4 Solving the Optimality Equation 92.2 Direct DP Control 102.2.1 Direct and Indirect DP 102.2.2 Direct DP in the form of Q-learning 112.2.3 Convergence of Q-learning 121113 Transition Point Dynamic Programming (TPDP)3.1 General Description of TPDP3.1.1 Inspiration3.1.2 The Shape of the Uniform Region3.1.3 The Benefits of TPDP3.1.4 TPDP and Inertia3.2 The Goal of TPDP3.2.1 Transition Points (TPs)1414181820212122232324252728282.2.4 Exploration in Q-learning 132.2.5 One-step Q-learning 142.3 The Characteristics of Q-learning2.3.1 Q-learning is Direct DP Control2.3.2 Q-learning Can Be Implemented as Memory-Based Control .2.3.3 Q-learning is Reinforcement Learning .2.3.4 Q-learning Addresses the Credit Assignment Problem2.3.5 Q-learning is Adaptive Optimal Control2.3.6 Q-learning and Temporal Differences . .2.3.7 Q-learning and Noise2.4 Practical DP Control2.4.1 The Curse of Dimensionality2.4.2 Associative Content Addressable Memories (ACAMs)2.4.3 Amalgamation of States2.4.4 Approximation of Evaluation Functions2.4.5 Prioritized Exploration2.4.6 Replaying ExperiencesBoundaries2929293132333434iv4 Practical TPDP 693.2.2 Environments 343.2.3 Closed State Spaces 353.2.4 Optimal Closed State Spaces 373.2.5 Minimal TP Optimal Control 373.2.6 Summary of the TPDP State Sets 383.3 The Specific Operation of TPDP 393.3.1 Pursuing Minimal TP Optimal Control 393.3.2 TPDP is a Form of Q-learning 403.3.3 Determining the Q-values of TPs 413.3.4 Determining the Evaluation Function Values of TP States . . 433.3.5 Swapping TPs 453.3.6 The Result of Continued TP Swapping 493.3.7 The Limitations of TP Swapping 523.3.8 Adding TPs 533.3.9 Removing TPs 573.3.10 Preparing External States 593.3.11 Minimal TP Optimal Control is Guaranteed 613.3.12 Summary of TPDP Operation 633.4 The Characteristics of TPDP 643.4.1 The Characteristics Shared with Q-learning 643.4.2 A Minimal Form of Direct DP Control 643.4.3 TPDP is Action Centered 653.4.4 TPDP and Temporal Differences 663.4.5 Continuous State Spaces and TPDP 66V4.1 The Practical TPDP Approach 694.1.1 The Problem With the Theoretical Form of TPDP 694.1.2 Concurrent Assessment of TP Modifications 704.1.3 Conventional Q-learning and Practical TPDP 704.1.4 Minimal TP Optimal Control and Practical TPDP 714.2 The Specific Operation of Practical TPDP 724.2.1 Using Weights to Concurrently Assess TPs 724.2.2 Policy TP Determination 734.2.3 Exploration in Practical TPDP 744.2.4 General Operation of the Practical TPDP Algorithm 764.2.5 Delayed Updating 794.2.6 The Practical TPDP Exploration Parameters 804.2.7 The Other Practical TPDP Parameters 845 Application of Practical TPDP 865.1 The Race Track Problem 865.1.1 Description of the Problem 865.1.2 A Discrete-time Stochastic Dynamic System Formulation 885.1.3 A Continuous Version of the Problem 885.1.4 Problem Specifics Used During Practical TPDP Application . 895.1.5 The Practical TPDP Algorithm Parameters Used 915.1.6 Evaluation of Performance on the Problem 915.2 Performance on the Race Track Problem 925.2.1 Comparing Practical TPDP and Conventional Q-learning 925.2.2 Near Optimal Performance of Practical TPDP 955.3 Practical TPDP and Generalization 99vi5.3.1 Generalization by Copying5.3.2 Generalization by Copying in the Race Track Problem5.3.3 Practical TPDP Glitches on the Race Track Problem5.3.4 A Performance Metric5.3.5 Comparing Generalization Levels With the Performance Metric5.4 Practical TPDP and TP Allocation5.4.1 TP Allocation in the Race Track Problem5.4.2 Superfluous TPs5.4.3 Stopping Learning5.4.4 Arbitrarily Limiting the Number of TPs5.4.5 Placing a Price on TPs5.4.6 Eliminating Suboptimal TPs5.4.7 TP Allocation in a One-Dimensional Race Track Problem . .5.5 Varying the Practical TPDP Algorithm Parameters6 Neural TPDP6.1 A Neural Network Design for Direct DP6.1.1 Neural Networks and Evaluation Function Approximation6.1.2 A Neural Network Design for State Space Control6.1.3 ACAM Operation of the Neural Network Design6.1.4 Implementing Mutual Inhibition6.1.5 Synapses as TPs6.1.6 The Full Implementation of Neural TPDP6.1.7 Allocating Identification Neurons6.1.8 Parallel Processing and Neural TPDP6.2 Analysis of Neural TPDP123123• . . . 123• . • • 124127127129• . . . 130132133• . . . 13499100101102103104104105107108109111113116vii6.2.1 The Localized Operation of Neural TPDP 1346.2.2 Generalization by Copying in Neural TPDP 1356.2.3 Elemental and Composite Actions 1366.3 Biological Plausibility 1396.3.1 A Possible Model of Biological Movement Control 1396.3.2 Increasing the Localized Operation of Neural TPDP 143‘7’ Practical TPDP as Part of a Complete Hierarchical Controller 1497.1 Practical TPDP Facilitates Lower Movement Control 1497.1.1 Context Lines 1497.1.2 A Sketch of the Complete Hierarchical Controller 1517.2 Using Higher Level Knowledge 1517.2.1 Guided Learning 1517.2.2 Implementing Guided Learning in Practical TPDP 1527.2.3 Gross Inverse Models 1537.2.4 Optimality Criteria and Higher Knowledge 1567.2.5 Dynamic Optimality Criteria 1598 Conclusion 1618.1 The Main Benefit of TPDP 1618.2 The Main Disappointment of TPDP 1628.3 Direct DP Optimal Control with TPDP is Practical 1638.4 Contributions of this Work 1638.5 Future Work 164Glossary 166List of Variables 175vu’Bibliography 182A Proof of the Convergence of the TPDP Version of Q-Learning 190B Full Description of the Practical TPDP Algorithm 195B.1 General Operation of the Practical TPDP Algorithm 195B.2 The Practical TPDP Algorithm 195B.3 The Stack Updating Procedure 200ixList of Figures3.1 Stylized Application of TPDP to a Phase Plane Control Task 313.2 Sample Immediate Cost Error Factors 684.3 The Practical TPDP Algorithm 774.4 The Stack Update Procedure 785.5 A Race Track 875.6 Performance of Practical TPDP and Conventional Q-learning 945.7 Performance of Practical TPDP and Delayed Updating Q-learning . . 955.8 Performance on a Discrete Version of the Problem 965.9 Performance of Practical TPDP 965.10 Five Typical Race Track Routes After 300 Epochs 975.11 Five Typical Race Track Routes After 800 Epochs 975.12 Five Typical Race Track Routes After 1300 Epochs 975.13 Five Typical Race Track Routes After 1800 Epochs 985.14 Performance of Practical TPDP With and Without Generalization. . 1015.15 The Effect of Changing Pr(ogeneraijze> 0) 1035.16 TP Allocation as Practical TPDP Learning Progressed 1055.17 The Effect of Limiting TP Allocation 1095.18 The Effect of Incorporating a TP Allocation Cost 1115.19 The Effect of Eliminating Suboptimal TPs 1125.20 The One-Dimensional Race Track and Its Phase Plane 1135.21 Limiting TP Allocation on the One-Dimensional Race Track 115x5.22 Performance of Practical TPDP on the One-Dimensional Race Track 115Pr(uswaTp > 0)Pr(JaddTp > 0)Pr(exteii> 0)Pr(crchuLge> 0)a1171171191201201211211226.316.326.336.346.357.367.377.387.39B.40B.41A Generic Neuron ModelA Neural Network Design for State SpaceThe Localized Operation Practical TPDP1471541541581582052065.23 TP Positioning in the One-Dimensional Race Track5.24 TP Positioning with a 25% TP Allocation Limit5.25 The effect of Changing ãdelay5.265.275.285.295.30TheTheTheTheTheEffect ofEffect ofEffect ofEffect ofEffect ofChangingChangingChangingChangingChangingControlAlgorithm124125145146The Localized Operation Stack Update ProcedurePerformance of Practical TPDP with Increased Localized Operation .The Biased Action Specifications Used for Guided LearningPerformance of Practical TPDP During Guided LearningA Roughly Optimal Path on the Race TrackPerformance of Practical TPDP with Increased Optimality InformationThe Practical TPDP AlgorithmThe Stack Update ProcedurexiAcknowledgementsI would like to thank the following people:John Ip, my very good friend and kindred spirit, for his extensive help withthe proofs in this thesis — and for our discussions in general.My supervisor, Peter Lawrence, for providing many insights into reality.The members of my thesis committee, David Lowe, Mabo Ito and Chris Ma,for their suggestions, and for taking the time to be involved in this work.The taxpayers of Canada for their financial support.All of those whose ideas and work have been of fundamental importance tomy own and to this thesis, especially Marvin Minsky, Andrew Barto, JamesAlbus and, always, Ayn Rand.xiiDedicationThis thesis, and all the best of my effort behind it,is dedicated to my wonderful wife, Heather.All in one, the best of women.Full instinct of woman, self refined.Lovely, healthy, firm, with thick hair and broad smile.Unique air and look, deeply warming and exciting.That quirky something.A grand passion for adventure, worldwide and domestic.New possessions she adores, choosing clothes and things for home.The best of mothers, caring, careful, informed, and always a full heart.Scientific in mind, practical, but a wide streak of creativity.Art comes easily in boldness, flow and color.And inventions that amuse, yet always have good purpose.Trusting, reliable, loyal, relentlessly supporting.A strong, natural sense of the decent and the right.I love her fully.xliiChapter 1Introduction1.1 Dynamic Programming ControlDynamic programming (DP) methods can be used to optimally and adaptively controllinear and non-linear systems (Barto et al., 1991). In such applications, DP methods areused to determine the expected cost for each action that may be taken in each state.The action that has the lowest expected cost is then selected as the optimal action.Specifically, the determination of the optimal action for each state is formed as a Markovdecision process to which DP methods can be applied (Watkins, 1989).DP methods are effective in determining optimal actions in this manner. The problemwith DP control is that, in a state space of any size, the learning time required todetermine the optimal actions can be extensive (Barto et al., 1991). Another problemis that the amount of memory required to store the DP computational elements can beenormous (Barto et al., 1991).Many different approaches have been taken to solve these two problems. A commonapproach is to use functional approximation to facilitate generalization of the expectedcosts being learned (Barto et al., 1989, 1990c, Anderson, 1989a, 1989b, 1993, Werbos,1990, Chinchuan et al., 1990, Yee, 1992, White et al., 1992, Thrun et al., 1993). Suchgeneralization reduces the learning time required to determine the optimal actions. Approaches have also been developed that partition the state space in ways which reducethe amount of memory required for representation (Moore, 1991, Chapman et al., 1991).1Chapter 1. Introduction 2Generally such partitioning approaches also reduce the learning time required becausethey group states together, facilitating the simultaneous (as opposed to individual) learning of optimal policies for all of the grouped states. Other approaches to reducing thelearning time include ones where learning is focused on the regions of the state spacewhere it will produce the best results (Moore et al., 1993, Yee, 1992), and approacheswhere previous experiences are replayed (Lin, 1991a). Sections 2.4.3 through 2.4.6 willdescribe these various approaches in more detail.1.2 The Transition Point Dynamic Programming ApproachThis work describes another approach to reducing the learning time and memory usagerequired for DP control. Transition point dynamic programming (TPDP) operates byplacing DP computational elements only at locations in the state space where changesto the action being specified are required for optimal control. The resulting reductionin memory usage decreases learning time because learning time is normally related tothe number of DP computational elements (see Section 3.1.3). TPDP is thus a memory-based control approach (see Section 2.3.2) that can control a system using a minimalamount of memory’.TPDP employs reinforcement learning (see Section 2.3.3) to determine the properplacement of the DP computational elements that specify the control action changes. Asa reinforcement learning approach TPDP can learn optimal control policies when thefinal outcome of a control sequence can be judged, but the desired intermediate actionsare unknown. This capability typically requires longer learning times relative to otherapproaches (see Section 2.3.3). As a result, TPDP may not be the best control approachto take if the desired intermediate actions are known.1Assuming a fixed, finely resolved state space (see Section 3.1.3).Chapter 1. Introduction 3TPDP operates as a direct DP controller that does not employ an explicit systemmodel (see Section 2.2.1). TPDP is thus a form of Q-learning (Watkins, 1989).TPDP is best suited to learning control policies for continuous stochastic dynamicsystems— systems that have inertia. Some examples of systems to which TPDP controlcould be successfully applied are robot manipulators, chemical process control, hydraulicsystem control and flight control. The conditions under which TPDP operates best aredescribed in Section 3.1.4.1.3 The Format of this WorkTPDP will be described and demonstrated in this work in the following manner:Chapter 2 DP will be described in detail, particularly direct DP in the form of“Q-Learning”. The problems of DP will be discussed, as well as solutions to them that have been investigated by others.Chapter 3 TPDP will be described in detail. The inspiration behind TPDP willbe explained, and proof that TPDP converges on optimal control policies will be presented.Chapter 4 A practical form of TPDP will be described, and an algorithmic implementation of that practical form presented.Chapter 5 The effectiveness of TPDP will be demonstrated on two problems. Inthe process, many characteristics of TPDP will be illustrated.Chapter 6 TPDP was developed in the context of neural network control. Therelationship of TPDP to neural networks will be described, as well asa neural network implementation of TPDP.Chapter 1. Introduction 4Chapter 7 The incorporation of TPDP into a complete hierarchical controller willbe described, and some attributes and capabilities of such a controllerwill be discussed.Chapter 8 In conclusion, the overall attributes of TPDP will be described.1.4 Brief Summary of ContributionsThe development of TPDP is the main contribution of this work. The originality ofTPDP is that it specifies only the control action changes necessary for the optimal controlof continuous stochastic dynamic systems. As a result, when learning optimal controlpolicies for such systems, TPDP requires less time and less memory than conventionalQ-learning.In the course of developing TPDP a number of additional contributions have beenmade:1. The development of “generalization by copying” to facilitate generalization duringTPDP learning, and to thereby increase the rate at which optimal actions arelearned.2. The development of a neural network implementation strategy for direct DP controllers — including TPDP controllers.3. The development of approaches for the incorporation of TPDP into a completehierarchical controller, including an investigation of the ways in which high levelcontrol knowledge can aid TPDP in learning optimal policies for the lowest level ofcontrol.Chapter 2Q-learningThis Chapter will provide the background information about dynamic programming (DP)and Q-learning required for the explanation of transition point dynamic programming(TPDP). It will also describe some of the solutions to the problems of DP, and thecharacteristics of Q-learning that are shared with TPDP.2.1 Dynamic Programming Control2.1.1 Control as a Markov Decision ProcessThe purpose of this Chapter is to present a control paradigm in which optimal controlcan be achieved by treating the control action determination in each state as a Markovdecision process (Ross, 1983). When control action decisions are framed as Markov decision processes, optimal control policies can be determined using dynamic programming(DP) methods. Q-learning (Watkins, 1989, Watkins et al., 1992), which is the standarddirect form of DP (see Section 2.2.1), and which is the basis for transition point dynamicprogramming, will be described fully.Consider a discrete-time stochastic dynamic system with a finite state set S = 1, ..., ii(Barto et al., 1991). A controller sends action’ specifications to this system, which actas inputs to the system and affect its state in some way. At time step t this system isin state i S, where the controller selects an action u from the possible actions U(i)‘Control actions will be referred to as “actions” throughout this work.5Chapter 2. Q-learning 6in that state; u U(i). As a result of the application of action u, the system makesa transition to some state j E S in the next time step (t + 1). The probability of atransition occurring from state i to each state j as a result of u is the state transitionprobabilityp3(i,u); 0 p3(i,u) 1. Ifp3(i,u) isO, a state transition is not possible fromstate ito state j as a result of action u. If pj(i, u) is 1, the state transition resulting fromaction u is a deterministic one that always moves the system from i to j. For all actionsu taken in state i it is true that:= 1 (2.1)jESAs a result of the application of action u in state i, an immediate cost c(i, u) isincurred. The state transition probabilities pj(i, u) and the immediate costs c(i, u) areboth functions of only the system state i and the action u taken in that state (Ross,1983, Barto et al., 1991).The goal of a controller applied to such a system is to determine, for each state i,a policy action (i)€U(i) that should always be taken to fulfill some control task (ortask) required of the system. The control task could be to move the system along sometrajectory, or to direct the system to some target state. The set of policy actions overthe entire state space S constitutes the policy2 1tt; t = [t(1), ..., p(n)].Given a policy ,i, and some initial state i, the sequence of states that the system canmove through after leaving i forms a “Markov chain” (Narendra et al., 1989) based onthe state transition probabilities. Because this Markov chain is the result of the policyaction selection, the control process is referred to as a “Markov decision process” (Ross,1983, Barto et al., 1991).2Control policies will be referred to as “policies” throughout this work.Chapter 2. Q-learning 72.1.2 Markov Decision Processes and Optimal ControlGiven a discrete-time stochastic dynamic system as described, an optimality criteria isrequired that can be used as a basis for an optimal policy determination. The optimalitycriteria used throughout this research was the “expected total infinite-horizon discountedcost”. Other criteria, like total cost or average cost, can be used in Markov decisionprocesses (Ross, 1983, Watkins, 1989), but DP controllers typically use the expectedtotal infinite-horizon discounted cost. The expected total infinite-horizon discountedcost resulting if policy t is applied from state i onward is the evaluation function value:V(i) = E {7tc(St,1t(St))Iso = i] (2.2)Where ‘y, 0 y 1, is the discount factor of future immediate costs c(i, u), and St is thestate at time step t. The notation “E,” indicates that V(i) is based on the expectedcosts that will be incurred, given that policy is applied from state i onward.The set of evaluation function values V(i) for each state i E S over the entire statespace S is known as the evaluation function and is denoted ; V, = [4(1), ..., V(n)].The evaluation function indicates, for each state, the expected total infinite-horizon discounted cost that will result from the existing policy 1u.To facilitate optimal control, a policy ,u must be determined that minimizes theevaluation function value V(i) for each and every state i. The policy that achieves thisis the optimal policy, denoted 1z*; z9 = [,a*(1),...,a*(n)j. Such a policy depends on 7,and it may not be unique — more than one action in each state may result in the sameminimal expected total infinite-horizon discounted cost (Ross, 1983).The optimal evaluation function values for each state i that result from the optimalpolicy t’’ are denoted (i). The set of these values over the entire state space Sconstitutes the optimal evaluation function V*; VrL* = {.(1), ..., V*(n)].If an optimal policy is determined that minimizes V(i) for each state i, and 4(i) isChapter 2. Q-learning 8composed of future immediate costs as indicated in Equation 2.2, then the immediatecosts (in conjunction with the discount rate j) are the basis upon which the optimalpolicy is determined. As a result, the control task (see Section 2.1.1) that the policyleads the system to perform is inherently defined by the setting of the immediate costs.Generally, lower immediate costs are associated with target states, or with states alongdesired trajectories, resulting in policies that direct the system towards those states.2.1.3 The Optimality EquationA method of determining V,i is required to determine an optimal policy p, and tothereby facilitate optimal control. As will be explained, once V is known, the optimalpolicy it” can readily be determined.Q-values (Watkins’ notation, 1989) are defined as3:Q(i,u) c(i,u) +7p(i,u)V(j) (2.3)jEsThe Q-value Q(i, u) is the expected total infinite-horizon discounted cost if action u istaken in state i and policy p is followed in all states thereafter (Barto et al., 1989).To facilitate optimal control, the policy action p(i) for each state i should be theaction that has the lowest Q-value for that state. The policy action p(i) should thereforebe an action which satisfies:Q(i,p(i)) = mm Q(i,u) (2.4)uEU()If the policy actions determined in this manner are always taken, the evaluation functionvalue V(i) for each state i will be the Q-value of the policy action p(i):3A more precise notation for Q-values would be “Qv,. (i, u)”, but for the purpose of clarity “Q(i, u)”will be used throughout this work.Chapter 2. Q-learning 9V(i) = Q(i,j(i)) (2.5)= mm Q(i,u) (2.6)uEU(i)= u) (2.7)jESThe evaluation function T/ resulting from the determination of policy actions in thismanner will actually be an optimal evaluation function (Ross, 1983). This is true because,if Equation 2.7 is recursively expanded out along the Markov chain extending from statei, V(i) will consist entirely of sums of minimum costs. As a result of this, the optimalevaluation function is defined by:V*(i) = mm Q*(j) (2.8)uEU(:)= u)[C(iU) +7EPi(iu)V*(i)] (2.9)JESEquation 2.9 is one form of the Bellman Optimality Equation” (Bellman, 1957). Thegoal of DP, in all its variations, is to solve Equation 2.9 for each state i to determine .(Barto et al., 1989). Once V has been determined, the optimal Q-values4Q*(j, u) can bedetermined using Equation 2.3, and the optimal action *() E U(i) can be determinedfor each state i by finding those actions which satisfy Equation 2.4. The optimal policyis thereby determined.2.1.4 Solving the Optimality EquationThe goal of DP is to solve Equation 2.9, the Bellman Optimality Equation, for eachstate i to determine the optimal evaluation function V1. and thereby determine themore precise notation for optimal Q-values would be “Qv. (i, u)”, but for the purpose of clarityQ*(j u)” will be used throughout this work.Chapter 2. Q-learning 10optimal policy . The standard approach is that of value iteration (Barto et al., 1991).Value iteration is a successive approximation procedure where an evaluation functionapproximation Vk is updated so that it converges onFollowing Barto et al. (1991), Vk is defined as the estimate of V,* at stage k (k =0, 1,...) of the approximation computation. Where Vk(i) is the approximate optimalevaluation function value of state i at stage k, the successive approximation procedure isdefined as follows:Vk+l(i) = mm Qvk(i,u) (2.10)uEU(z)= mm c(i,u)+7p(i,u)Vk(j) (2.11)UEU(:) jESUpdating the evaluation function in this manner is referred to as backing-up the costsbecause, for each state i, Vk(i) is updated with the evaluation function value approximations Vk(j) of the states j that the system may make a transition to after taking actionu U(i) in state i (Barto et al., 1991).2.2 Direct DP Control2.2.1 Direct and Indirect DPThe description of DP so far has focused on systems where an explicit model of thesystem response is known or has been learned using system identification (Ogata, 1970).This model is defined by the state transition probability values pj(i, u), which indicatethe probability of a transition occurring from state i to state j if action u is employedin state i. Controllers which use such explicit models are indirect controllers (Sutton,1991, Sutton et al., 1992, Barto et al., 1991, Narendra, 1992). Explicit models can beconstructed using methods more compact than simple probability tables, but for the sakeof brevity it will be assumed throughout this work that they are not.Chapter 2. Q-learning 11Direct controllers do not use explicit models of the system response. Instead theydirectly determine how to control the system by making observations of how the systemresponds to the various actions in the different system states (Sutton et al., 1992, Bartoet al., 1991). Direct DP controllers function by repeatedly observing the actual coststhat are incurred when an action u U(i) is employed in state i (Barto et al., 1989).2.2.2 Direct DP in the form of Q-learningA standard way to implement direct DP controllers is using Q-learning (Watkins, 1989,Watkins et al., 1992). Q-learning operates as follows. In Equation 2.8, the optimalevaluation function value . (i) is determined in each state i by equating it to the lowestof the optimal Q-values Q*(i, u) in that state. Direct DP control is thus possible if theoptimal Q-values can be determined without the use of an explicit system model. Thisis done by making successive approximations of the optimal Q-values. Where Qj(i, u)is the approximate Q-value of taking action u in state i at time step t, the successiveapproximation update is defined as follows5:1[1 — u)] Q(i, u) + u) [c(i, u) +7Vt(st+i)1 if i = t, U = UtQt+i(i,u) =I.. Qt(i, u) otherwise(2.12)Where at(i,u), 0 < ct(Z,U) < 1, is the update rate, and where V(s+i) is the approximation of the evaluation function value of the state.sti that the system (actually) makesa transition to after action U 15 applied in the current state s. Using successive approximations of the Q-values, successive approximations of the optimal evaluation functionvalues VL (i) for each state i are determined with:V+1(i) mm Qt+i(i,U) (2.13)uEU(z)5Because direct observation of the system response is being made, each time step t corresponds to anapproximation computation stage k.Chapter 2. Q-learning 12Because the system is stochastic, and Q-learning does not employ an explicit systemmodel, St+1 cannot be known or predicted until the state transition actually occurs. Qvalue updating in Q-learning thus depends on observation of the actual response of thesystem in terms of transitions made from one state to the next, as well as the costsincurred during those transitions.By using direct experience of the response of a system to update the Q-values of adirect DP controller (Equation 2.12), and then by using these Q-values to update theapproximation of the optimal evaluation function values 14(i) for each state i (Equation2.13), Q-learning can converge to the optimal evaluation function V. and thereby determine an optimal policy (Barto et aL, 1991). This is done without the formation of anyexplicit models of the system response. Information about the stochastic behavior of thesystem is contained implicitly in the Q-values as part of the costs that are experiencedas a result of the stochastic state transitions.2.2.3 Convergence of Q-IearningWatkins (1989, et al., 1992) has proven that Q-learning will converge to the optimalQ-values necessary for optimal control under the following conditions:1. Every state i and action u€U(i) combination must be attempted an infinitenumber of times as learning progresses.2. The immediate costs c(i, u) must be bounded for each action u E U(i) in state i.3. It must be true that 0 <7 < 1.4. It must be true, for each action u E U(i) in each state i at time step t, that0< ct(i,u) <1, and that crt(i,u) —*0 as t —* 00.Chapter 2. Q-learning 135. Where k is the time step in which a Q-value Qnk(i,’u) is updated (i St u = Ut)for the kth time, it must be true that, for each action u E U(i) in state i:= 00, [nk(i,U)12<00 (2.14)2.2.4 Exploration in Q-learningBecause Q-learning relies on actual experience of state transitions and immediate costs todetermine optimal Q-values, the possible policies must be thoroughly explored to ensurethat optimal Q-values are converged to. To that end, it is required that Q-learningcontrollers have a non-zero probability of visiting every state i E S, and employing everypossible action a U(i) in that state for all time (Watkins et al., 1992). This is the firstQ-learning convergence condition given in Section 2.2.3.The exploration that is performed in Q-learning must result in a thorough search ofthe possible policies. While this exploration can be extensive, it is not strictly exhaustivebecause every possible sequence of action specifications does not have to be attempted(Barto et al., 1989, 1993). The backing-up of the evaluation function values (see Section2.1.4) facilitates thorough exploration using only transitions from one state to the next.In Q-learning, when the evaluation function approximation 4 is close to being optimal, it is not normally worthwhile to continue exploring state and action combinationswhose Q-values are large relative to other Q-values for the same state. Such state andaction combinations are unlikely to be part of the optimal policy and the probability ofexploring them should be reduced as learning progresses— although not to 0. Doing sois particularly desirable if reasonable, near optimal, control of the system is valued during the continuing learning process. Determining how frequently such state and actioncombinations should be explored is part of a fundamental problem in optimal adaptiveChapter 2. Q-learning 14control (Thrun, 1992, Barto et aL, 1991). This problem concerns balancing the exploration needed to ensure that optimal control is achieved with the desire to providereasonable control (Michie et al., 1968). In the context of Q-learning, many differentexploration strategies have been investigated (Watkins, 1989, Sutton, 1990, Barto et al.,1990a, 1990c, 1991, Kaelbling, 1990).2.2.5 One-step Q-learningThe form of Q-learning presented thus far is “one-step Q-learning” (Watkins, 1989). Itis the simplest form of Q-learning, where Q-values are updated immediately after eachaction is applied using the evaluation function value V(i) of the next state encountered.More complex forms of Q-learning are described by Watkins (1989) and mentioned brieflyin Section 2.3.6.The simple one-step form of Q-learning, as well as discretized memory-based representation of the state space (see Section 2.3.2), will be the approach to Q-learning assumedthroughout this work. The reason for this is that the transition point dynamic programming (TPDP) controller that will be described makes use of Q-learning in a complicatedway. To keep the analysis of TPDP as clear as possible, it is therefore best to use thesimplest form of Q-learning.2.3 The Characteristics of Q-learning2.3.1 Q-learning is Direct DP ControlAs explained in Section 2.2.2, Q-learning is direct DP control. Q-learning operates without the use of explicit system models. This leads to Q-learning having a number ofadvantages and disadvantages when compared to indirect DP controllers. These will bedescribed in this Section.Chapter 2. Q-learning 15Q-learning Requires Less MemoryOne significant advantage of Q-learning over indirect DP control is that, because the system model is implicitly contained in the Q-values, an explicit and separate representationof the model is not necessary (Watkins, 1989). This can greatly reduce the amount ofmemory required for a DP controller (Barto et al., 1990b)6 In stochastic control applications, for each possible action a E U(i) in each state i, indirect DP controllers must storea state transition probability pj(i, a) for each state j that may be reached as a result ofaction a. If any action made in any state has some probability of causing a transition toevery other state, the number of state transition probability values that must be storedis (Watkins, 1989):worst-case number of state transition probability values =iES uEU(i)Where jS is the number of states in the state space S. In addition to these statetransition probabilities, indirect DP controllers must store an evaluation function valuefor each state i.Q-learning must store a Q-value for each possible action a€U(i) in each state i:number of Q-values 1iES uEU(i)Q-learning must also store an evaluation function value for each state i. The worst-case difference then between the memory requirements of Q-learning and of indirect DPcontrollers consists of the difference between the worst-case number of state transitionprobabilities that must be stored for indirect DP control and the number of Q-valuesthat must be stored for Q-learning. The former is SI times larger than the latter.In practice it is unlikely that the worst-case number of indirect DP state transitionprobabilities need to be stored. Some state transitions may be impossible, or of such6Under the assumption stated in Section 2.2.1 that explicit models are always constructed usingprobability tables.Chapter 2. Q-learning 16low likelihood that they can be ignored. Even so, the best case memory requirement forindirect DP is when only one state transition is possible for each action u U(i) in eachstate i. In this deterministic case the memory requirement for indirect DP control isonly as low as that of Q-learning. Furthermore, considerable computational effort mightbe required to determine just which state transition probabilities are important for anindirect DP controller to store (Watkins, 1989).Q-learning is Computationally SimplerThe implicit models in Q-learning controllers are updated continuously as state transitions, and their associated costs, are observed. Indirect DP controllers, with their explicitmodels, do not have this feature. As a result, computational steps must be taken in suchcontrollers to ensure that the model is kept up to date (Barto et al., 1990b, 1991). “Thiscomputation is inherently complex, making adaptive methods in which the optimal controls are estimated directly7more attractive” (Sutton et al., 1992).The Task-Specificity of Implicit System ModelsWhile Q-learning controllers require less memory than indirect DP controllers becausetheir system model is contained implicitly in their Q-values, there is a disadvantage inthis approach to system modeling. The disadvantage is that, because an explicit modelis not available, that model cannot be used in all the different control tasks that a systemmay be required to perform.Most DP controllers, both indirect and direct (Q-learning), can oniy perform onecontrol task at a time. If more than one control task is required of a system, controlmust be shifted from one DP controller to another. If indirect DP control is being appliedto a system, the same system model can be used by all of the DP controllers fulfilling70f which Q-learning is one.Chapter 2. Q-learning 17the various control tasks. In contrast, the system model in Q-learning cannot be usedin different control tasks because that model is implicitly stored in the Q-values and theQ-values are specific to each control task.The value of being able to develop a general purpose system model can be questionedhowever if the extensive memory requirements of explicit models are considered. Section7.2.3 will present more analysis of the modeling issue.Off-Line Updating and Q-learningAnother disadvantage of Q-learning is that the evaluation functions of controllers employing Q-learning cannot be updated off-line (Barto et al., 1991). That is, they cannotbe updated when the controller is not controlling the system. As described by Bartoet al. (1991), off-line updating is the backing-up of costs that can be done when indirect DP control is being employed. Indirect DP controllers have explicit system modelscontaining state transition probabilities. If these transition probabilities are accurate,they can be used at any time to facilitate the backing-up of costs (see Section 2.1.4).It is not necessary to make actual observations of system state transitions to performback-ups as state transitions can effectively be simulated. By reducing the amount ofactual experience required to learn optimal policies, off-line back-ups can greatly reducethe time required to learn those policies.Because the system model in Q-learning controllers is contained implicitly in the Qvalues, distinct state transition probabilities that could be used for off-line back-ups arenot available. As a result, off-line updating cannot be performed. Updating in Q-learningcan only occur when the system is being controlled and actual state transitions are beingobserved — unless of course an explicit system model is learned in addition to the implicitmodel of Q-learning (Sutton, 1991).Chapter 2. Q-learning 182.3.2 Q-learning Can Be Implemented as Memory-Based ControlMulti-dimensional tables that associate an action u E U(i) with each state i can beused to control systems regardless of any non-linearities in those systems (Atkeson, 1989,1991). Such controllers, whose table dimensions correspond to the state dimensions ofthe system being controlled, are called memory-based controllers. They represent statesin a completely localized way (Atkeson, 1989).Q-learning controllers can be implemented as memory-based controllers that operateby associating a table entry with each state i e S. In each entry they store Q-valuesQ(i,u) that directly reflect the costs that will be incurred when each action u E U(i) isapplied in state i. Q-learning is well suited to such implementation.Like other memory-based DP controllers (Moore, 1991), memory-based Q-learningcontrollers typically handle continuous state spaces by discretizing them (Baker et al.,1992). This discretization facilitates the representation of the states in a tabular format(Barto et al., 1991).The table entries associated with each state in a DP controller will be referred to asDP elements.2.3.3 Q-learning is Reinforcement LearningQ-learning is inherently reinforcement learning (Sutton et al., 1992). Q-learning operatesby determining Q-values, and from them evaluation function values (see Section 2.2.2).Referring to Equation 2.2, the costs in this Equation that are discounted and summedare the immediate costs c(i, u). These immediate costs constitute a scalar reinforcementsignal that the controller receives as a result of the actions that it specifies the systemshould take. Aside from observable changes in the state of the system, this is all theinformation which Q-learning controllers receive.Chapter 2. Q-learning 19In contrast to reinforcement learning approaches like Q-learning, supervised learningapproaches (error propagation methods for example, Rumeihart et al., 1986) receivecomplete information about what the desired output (in the case of controllers this isthe optimal action) is at any point in time (Williams, 1986, 1987a). When such fullinformation is available it can result in faster learning than the scalar reinforcementsignal of reinforcement learning (Williams, 1987b). If desired output information is notavailable however, supervised learning approaches cannot be utilized (Anderson, 1989a).As a result, “reinforcement learning is clearly more generally applicable than supervisedlearning” (Williams, 1987b).An example of a learning application where reinforcement learning is necessary isthat of a novice snowboarder. A person in such a situation may not know exactly whatmovements he should make to stay erect, but he certainly knows when he has fallen. Inthis case the discomfort of the fall constitutes a negative reinforcement signal.The reinforcement learning capability of Q-learning facilitates the learning of optimalpolicies without any of the optimal actions being known in advance. Such learning is notpossible with supervised learning approaches.Another potential advantage of reinforcement learning approaches over supervisedlearning approaches occurs when many outputs are equally desirable in response to agiven input (in the case of controllers this is when more than one action is optimal ina given state). In such a situation reinforcement learning approaches can learn any ofthe appropriate outputs, while supervised learning approaches must learn one arbitrarilychosen desired output (Williams, 1987a, 1987b). The increased flexibility of reinforcementlearning approaches in such situations has the potential to result in faster learning.Finally, an important difference between reinforcement learning and supervised learning approaches is that supervised approaches typically follow learning gradients to determine their input to output mappings (Rumeihart et al., 1986). Reinforcement learningChapter 2. Q-learning 20approaches do not, so they must employ mechanisms which ensure full exploration ofthe mapping possibilities (Williams, 1987a)8. An advantage of not following learninggradients is that (most) reinforcement learning approaches will not prematurely settle onlocal minima (or maxima)— a problem which confronts the gradient-following supervisedlearning approaches (Rumelhart et aL, 1986).As described by Barto (1992), Q-learning is an Adaptive Critic Method. AdaptiveCritic Methods are reinforcement learning approaches that include a mechanism for anticipating future reinforcements. Some common citations made to sketch out the courseof development of Adaptive Critic Methods include: Samuel (1959), Michie et al. (1968),Widrow et al. (1973), Barto et al. (1983), Sutton (1988) and Watkins (1989).2.3.4 Q-learning Addresses the Credit Assignment ProblemThe basic idea of reinforcement learning (see Section 2.3.3) is to increase or decreasethe likelihood of a controller specifying a given action based on whether positive ornegative reinforcement normally results from that action (Mendel, 1973). The difficultylies in determining how much each action, of all those taken in the time proceeding thereinforcement, should be credited (positively or negatively) with the outcome. This isthe credit assignment problem (Minsky, 1985), and it has a structural and a temporalcomponent (Williams, 1987b).Q-learning addresses both components of the credit assignment problem by backingup evaluation function values V(i). This backing-up distributes, in reverse, credit forincurred costs, and the distribution is simultaneously structural and temporal. The statetransition probabilities pj(i, u) and the discount factor determine the amount of creditassigned to each state for each incurred cost (see Sections 2.1.1 to 2.1.4).8Exploration issues are discussed in Section 2.2.4.Chapter 2. Q-learning 212.3.5 Q-learning is Adaptive Optimal ControlQ-learning is an adaptive optimal control approach (Sutton et at., 1992, Barto et aL,1991). That Q-learning provides optimal control has been discussed throughout thisChapter (also see White et at., 1992). Q-learning is also adaptive in providing optimalcontrol because the model that Q-learning makes of a system is implicitly contained inthe Q-values (Barto et at., 1991). As the system changes, those changes will be reflectedin the updating of the Q-values. At any instant in time, a controller using Q-learningwill be converging on the optimal policy for the system that exists at that time.Indirect DP controllers do not provide adaptive optimal control as readily as Qlearning controllers. To adaptively follow system changes such controllers must continually modify their explicit system model. Depending on the exact approach taken, thismay require complete recalculation of both the model and the optimal policy based onthat model (Barto et al., 1991). This can be extremely computationally expensive if thesystem changes with any regularity.2.3.6 Q-Iearning and Temporal DifferencesQ-learning is a form of temporal difference learning (Sutton, 1988, Watkins, 1989, Dayan,1991). Temporal difference learning is notated as TD() (Sutton, 1988). To learn expected total infinite-horizon discounted costs, TD() updating of Q-values is performedas follows (derived from Watkins, 1989):Qt+i(st, ut) = [1—cvt(st, at)] Qt(st, Ut) + (2.15)at(st,Ut) (c(St+k,ut+k) + (1 —Where \, 0 A < 1, is the weighting of future experiences.In the Q-learning methods described in this work, A will be set to 0, resulting inTD(0) (Sutton, 1988). Setting A to 0 in Equation 2.15 converts it into the familiar formChapter 2. Q-learning 22of Equation 2.12. TD(O) updates the values being learned using oniy the informationobtained in the next time step, t + 1. In one application Sutton (1988) found that TD)worked best at X values around .3, but as explained by Watkins (1989) using anythingbut TD(O) to update Q-values requires considerably more computational effort.Watkins (1989) also pointed out that setting ,\ to 1 in Equation 2.15 results in Qvalue updating based simply on the sum of the actual infinite-horizon discounted costsexperienced (see Equation 2.2). TD(1) learning then does not use any evaluation function values V(i) in the Q-value updating. As the evaluation function values are onlyapproximations of the optimal evaluation function until convergence on that function iscomplete (see Section 2.1.4), TD(1) avoids the use of approximate evaluation functionvalues.2.3.T Q-learning and NoiseIn a system being controlled with a Q-learning controller, where actions u E U(i) arespecified by the controller for each state i that the system enters, there are four ways inwhich noise can affect control of the system:1. Noise can perturb the controller specification of the action u E U(i) applied to thesystem.2. Noise can perturb the response of the system to the action u e U(i).3. Noise can perturb the immediate costs c(i, u) returned to the controller as a resultof the action u that it took in state i.4. Noise can perturb controller observation of the system state so that the controllerinaccurately assesses which state the system is in.Chapter 2. Q-learning 23The first three potential effects of noise are intrinsic to the control of discrete-timestochastic dynamic systems. These noise components are what make the state transitionsstochastic (Barto et al., 1989). As such, these effects of noise are inherently handled byQ-learning. As Q-learning proceeds to learn optimal Q-values, these effects of noiseare experienced and amalgamated into the expected costs that the Q-values reflect. Aslong as a system controlled by Q-learning is controllable to the extent desired (wherecontrollability is in part affected by the first two types of noise), Q-learning will convergeon a policy that provides optimal control in those noise conditions.The fourth effect of noise is not addressed by Q-learning. Such noise affects theobservability of the system, not its stochastic response, and it cannot be incorporatedinto the implicit model of the system (Kaelbling, 1990). All controllers are vulnerableto this type of noise however, and it is fairly safe to say that, as an optimal controlapproach, Q-learning can handle it as well as any.More generally, “dynamic programming is the only exact and efficient method available to solve the problem of [cost minimization] over time in the general case where noiseand nonlinearity may be present” (Werbos, 1990).2.4 Practical DP Control2.4.1 The Curse of DimensionalityIn order to facilitate DP control, memory must be allocated to store the DP elementsnecessary for every state i E S. These memory entries constitute the table describedin Section 2.3.2. Normally the table structure is defined by the state space dimensionsand the resolution of each of those dimensions. As the number of dimensions increases,the total memory requirement increases exponentially (Barto et al., 1991, Moore, 1991).Chapter 2. Q-learning 24Bellman (1957) called this the “curse of dimensionality”. This curse haunts DP controllers when they are applied to systems that have anything but the most restricted ofstate spaces.As the number of states and their associated DP elements grow, the curse of dimensionality also increases the learning time required for DP controllers to converge tooptimal policies (Moore, 1991, Yee, 1992). This is because DP controllers must performa certain amount of computation for each state.As described, the curse of dimensionality results in extensive memory usage and protracted learning times. To make DP control practical these problems must be addressed.Sections 2.4.2 to 2.4.6 describe work done by others to that end.2.4.2 Associative Content Addressable Memories (ACAMs)Associative content addressable memories (ACAMs) can be used to significantly reducethe memory requirements of memory-based controllers (Atkeson, 1989), including DPcontrollers. ACAMs are not addressed in the manner of standard memories, but are“content-addressable in that part of the contents of the memory are used to select therest of the memory” (Atkeson et al., 1988). In the case of DP controllers, an ACAMwould use the state information to select the memory entry containing the DP elementinformation associated with that state (see Section 2.3.2).ACAMs are able to reduce the memory requirement of memory-based controllers,including DP controllers, because systems fulfilling specific control tasks may never entersignificant portions of the fully expanded state space defined by their state space dimensions (see Section 2.3.2). It may even be impossible for the system to reach some of thedefined states. As a result, if ACAMs are employed to contain the DP elements, memoryentries need only be allocated for those states that actually are encountered as the control task is fulfilled (Atkeson, 1989). Such allocation may result in far less memory beingChapter 2. Q-learning 25used than would be required if a full tabular memory is employed (see Section 2.3.2).ACAMs can be implemented using specialized hardware; although such hardware israrely used. ACAMs can also be simulated using massively parallel computers like theConnection Machine (Atkeson, 1989, Hillis, 1985). Hashing techniques can be employedto simulate ACAMs (Standish, 1980). Korf (1990) applied hashing in a heuristic searchalgorithm that operated basically like an indirect DP controller. Neural networks canalso be designed to operate as ACAMs, as will be described in Sections 6.1.2 and 6.1.3.2.4.3 Amalgamation of StatesWithin a state space S there may be groups of neighboring states that are all relativelyunimportant in terms of the fulfillment of the required control task (Moore, 1991). Thatis, these states may be ones that the system is not likely to enter during the normalcourse of task fulfillment and, as a result, the evaluation function values V(i) associatedwith them need not be exact. If inaccuracies in the evaluation function are tolerable, it ispossible to amalgamate groups of such states together together and treat each group likea single state for the purposes of DP control. Such amalgamation reduces the memoryrequired for DP control (Chapman et al., 1991, Moore, 1991) and increases the rate oflearning (Singh, 1992a). The rate of learning is increased because it can progress in allof the amalgamated states simultaneously instead of just one state at a time (Chapmanet at., 1991).There are two difficulties to such an approach (Buckland et at., 1993). One is determining which states are of limited enough importance to be included in an amalgamationwithout optimal control being seriously affected. The other difficulty is ensuring that theamalgamated states each have similar enough evaluation function values T/(i) that thesharing of the same value across all of them will not distort the overall evaluation function V. (Anderson, 1989a) in a way that significantly perturbs the learning of the optimalChapter 2. Q-learning 26policy throughout the entire state space (Buckland et at., 1993, Moore, 1991). Determining which states can prudently be amalgamated, and then determining appropriateamalgamation groupings requires computational effort beyond that necessary to learnoptimal policies. The extent of that extra effort must be weighed against the benefits ofamalgamation.Moore (1991) used “trie” data structures (Omohundro, 1987) to facilitate variableresolution in the evaluation function used by an indirect DP controller. The resolutionof the evaluation function was made finer in states that were projected as being closerto the optimal trajectory of the system concerned. Such variable resolution effectivelyamalgamates the states that are coarsely resolved.Chapman et at. (1991) recursively split the state space of a system that was represented by many binary dimensions. This splitting resulted in a binary tree of Q-values.Each node in this tree which was not split down to an irreducible size was effectivelyan amalgamation of the unsplit states it contained. The splitting decisions were basedon statistics gathered regarding differences in the reinforcements received within unsplitstates.Mahadevan et al. (1992) investigated two techniques that involved state amalgamation. In one, all of the Q-values within a fixed Hamming distance of the current statewere updated, resulting in an amalgamation effect. In the other, statistical clustering wasused to flexibly group states in close proximity to each other that had similar Q-values.Buckland et at. (1993) investigated a number of Q-learning approaches that usedstatistics gathered regarding differences in reinforcements received. These statistics facilitated both the splitting and combining of state amalgamations. These investigationswere of limited success because the computational effort required to make the amalgamations was often extensive and it disturbed the computations being made to learn theoptimal actions (Buckland et at., 1993).Chapter 2. Q-learning 272.4.4 Approximation of Evaluation FunctionsUp to this point the evaluation function V, of DP controllers has been described asconsisting of separate evaluation function values V,(i) for each state i. The curse ofdimensionality makes such discrete representation of the evaluation function impracticalfor state spaces of any size however (see Section 2.4.1), so the evaluation function isnormally parametrically approximated in some way (Barto et al., 1989, 1990c, Anderson,1989a, 1989b, 1993, Werbos, 1990, Chinchuan et at., 1990, Yee, 1992, White et at., 1992,Thrun et at., 1993). In addition to reducing memory usage, such parametric approximation facilitates generalization between the states (Barto et al., 1989), thereby increasingthe rate of learning.Many different approaches to evaluation function parametric approximation have beentaken. Barto et at. (1989) have formulated how to generally apply TD methods (Sutton,1988) to evaluation function parametric approximation. This formulation includes approximation with supervised learning neural networks (Barto et at., 1989). Werbos (1990)has done work utilizing such supervised learning in DP controllers. Watkins (1989) andTham et al. (1993) used the CMAC neural model (Albus, 1975a, 1975b) for evaluationfunction and Q-value parametric approximation.The evaluation function can also be approximated in other ways. Yee (1992) has usedhierarchical binary trees and hierarchical neural networks to that end.The main drawback of evaluation function approximation approaches is that the approximating mechanisms themselves typically require extensive computational effort todevelop a reasonably accurate approximation. Further, evaluation function approximation often invalidates any proofs that have been made of the convergence of a DP controller to an optimal policy (Barto et at., 1991).Chapter 2. Q-learning 28Evaluation function approximation is made even more difficult in Q-learning controllers because of the fact that Q-values are a function of action as well as state, andan additional action dimension must be included in any approximation that representsQ-values2.4.5 Prioritized ExplorationAnother way to increase the rate of learning in DP controllers is to explore the statespace in ways that concentrate the learning effort on those states where learning willproduce the best results. As exploration is essential to Q-learning, many researchershave developed highly effective exploration strategies (see Section 2.2.4). In addition tothis work however, much work has been done that focuses on exploration in DP controllersin general.Sutton (1990), Kaelbling (1990) and Yee (1992) have done work on associating accuracy information with the costs being backed-up to indicate which back-ups are themost likely to enhance learning. Along similar lines, Peng et aL (1992) and Moore et al.(1993) have used the size of the change that occurs with each cost update to prioritizethe future exploration of the state that update was associated with. The largest changesin cost were given the highest priority.2.4.6 Replaying ExperiencesIf real-time computation constraints make it possible, the rate of learning in DP controllers can be increased by replaying past experiences and making new updates basedon those experiences (Lin, 1991a, 1991b). This approach requires that the details of pastexperiences be stored somehow.Chapter 3Transition Point Dynamic Programming (TPDP)This Chapter will present a full description of transition point dynamic programming(TPDP), including proof of its convergence to optimal control policies. The descriptionwill rely on the explanation of DP and Q-learning presented in Chapter 2, but thisChapter will not be a direct extension of Chapter 2. Instead the background informationof Chapter 2 will be utilized as required to explain TPDP.3.1 General Description of TPDP3.1.1 InspirationTransition point dynamic programming (TPDP) was developed as a solution to the extensive memory use and protracted learning time that results from the “curse of dimensionality” in DP controllers (see Section 2.4.1). The TPDP concept arose out of anapproach (Buckland et al., 1993) to direct DP where artificial neurons were connectedin a network, with each neuron operating as a separate DP element (see Section 2.3.2)—in fact, each neuron operated as an ACAM (see Section 2.4.2). Various techniques wereemployed that modified the neural connections so that groups of neighboring states inthe state space that had similar control requirements could be amalgamated (see Section2.4.3). Such groups had to have evaluation function values V(i) for each state i thatwere very similar, as well as the same optimal action (Buckland et at., 1993).These attempts at neural network amalgamation of neighboring states where largely29Chapter 3. Transition Point Dynamic Programming (TPDP) 30unsuccessful because the computational effort required to make the amalgamations wasoften extensive and it disturbed the computations being made to learn the optimal actions(Buckland et aL, 1993). This research lead to the realization however that individualstates on the boundaries of a state space region could be used to specify the actionemployed throughout that region. That is, if there existed uniform regions of states thatall had the same optimal action (or the same set of optimal actions1), that action couldbe specified at the states on the boundary of a uniform region and then be maintainedas long as the system remained within that uniform region. So at every boundary statewhere a uniform region might stochastically be entered from states outside that region,a new action is specified. The action thus specified is then maintained until the uniformregion is left and another uniform region is entered (where another set of boundary statesspecifies the next action). As the system moves about each uniform region, it may passthrough dormant states that are not on the boundary. These dormant states make nochange in the action specification, they simply leave it the same. They have no DPelements associated with them at all (see Section 2.3.2). This is the fundamental ideabehind TPDP.States that are neither boundary states nor dormant states, are external states. As willbe explained in Section 3.2.4, external states are not entered during system movement.Figure 3.1 illustrates the TPDP concept when movement control of a “car” on a onedimensional track is desired. The car, with some initial positive velocity to the right,must pass Position A and return to the left. The “transition points” (see Section 3.2.1)in Figure 3.1 (represented by boxes) are located at all boundary states. The shadedregions indicate all of the states that the system can possibly move through given theactions specified at the boundary states and the stochastic response of the car. Shaded‘The simplifying assumption that there is only one optimal action in each uniform region will generallybe made throughout this work. The existence of more than one optimal action does not alter theoperation of TPDP in any way.Chapter 3. Transition Point Dynamic Programming (TPDP)states without transition points are therefore dormant states. Uniform regions consist ofadjacent boundary states where the same action is specified, as well as the shaded regionthrough which that action is maintained before another boundary is encountered (seeFigure 3.1). Boundary states that do not seem to be on the main state transition routes(the one identified in Figure 3.1 for example) ensure that any stochastic deviations fromthose routes are realigned. Unshaded states are external states.3.1.2 The Shape of the Uniform Region BoundariesThe boundaries that are constructed in TPDP are multi-dimensional surfaces that mayrange from smooth to highly irregular. The boundaries surfaces need not be closed, asboundaries only have to exist at states where a uniform region can possibly be enteredfrom states outside that region.+31DormantStateUniformRegionBoundaryStateAPositionThe control task shown on this phase plane is to startwith a positive velocity, pass position A, and return.Each II is a transition point (TP).Figure 3.1: Stylized Application of TPDP to a Phase Plane Control TaskChapter 3. Transition Point Dynamic Programming (TPDP) 323.1.3 The Benefits of TPDPThe main benefit of the TPDP approach is that, where uniform regions exist, they canbe represented by a relatively small number of DP elements (depending on the shapeof the boundaries and the size of the uniform regions they encompass). This reductionin memory usage results in an accompanying reduction in the learning time requiredto learn optimal policies (see Section 2.4.3 and Chapman et al., 1991). Further, thisreduction in memory usage is accomplished without having to expend any computationaleffort determining how states should be amalgamated (Buckland et al., 1993). As willbe explained in Sections 3.3.2 through 3.3.11, the determination of whether a state isa boundary state or a dormant state can be made as part of the computations thatdetermine the optimal actions for each state.Another benefit of TPDP is that the boundary states learn optimal actions independently of each other, and these determinations are made with the same fine resolutionthat they would be if the state space was fully represented with a DP element associatedwith every single state. In contrast, approaches that amalgamate states to achieve memory and learning time reductions result in coarse resolutions in some regions of the statespace. These coarsely resolved regions can perturb the learning of the optimal policylocally and throughout the entire state space (see Section 2.4.3).In general, TPDP learns optimal action change specifications where those specifications can be accurately placed in highly resolved state spaces. By learning only actionchanges, TPDP is able to “compress” the necessary action specifications and minimizeits use of memory. Further, the time constant of those action change specifications (thetemporal separation between them as the system moves through the state space) canvary in an unlimited way throughout the state space (as long as it is greater than thestate space resolution).Chapter 3. Transition Point Dynamic Programming (TPDP) 333.1.4 TPDP and InertiaTPDP is best suited to control applications where reasonably sized uniform regions exist.Such regions are most likely to be found in continuous control applications where thesystem exhibits inertia. In such applications, the inertia of the system limits the effectsof any actions instantaneously specified by individual states if those action specificationsare different from the specifications of neighboring states. In other words, no single statecan make much of a difference in terms of controlling (optimally or otherwise) the system.As a result, optimal actions are initiated and maintained over relatively long periods oftime as the system moves through many states.Continuous dynamic systems with inertia are thus ideally suited to TPDP as it wasdescribed in Section 3.1.3. Applied to such systems, TPDP can determine the finelyresolved uniform region boundaries necessary for optimal control, and it can learn theoptimal actions for the boundary states. Some examples of systems to which TPDPcontrol could be successfully applied are robot manipulators, chemical process control,hydraulic system control and flight control.TPDP is not well suited to control applications that exhibit little or no system inertia,and that do not have large uniform regions. Much DP work has been concerned withoptimal decision-making (backgammon playing is one example, Tesauro, 1991). Theactions taken in decision-making tasks can drastically alter the state of the system, andit is not commonplace in such tasks for the optimal action to be the same for more thanone time step. As a result, uniform regions in such control applications are ill-defined.In general terms, TPDP performs best on continuous state space stochastic controlapplications (that have been discretized as described in Section 2.3.2) and performs poorlyon inherently discrete control applications — applications that are better described asdecision tasks. Kaelbling (1990) makes a distinction similar to this between “statisticalChapter 3. Transition Point Dynamic Programming (TPDP) 34learning” (the former), and “symbolic learning” (the latter).3.2 The Goal of TPDP3.2.1 Transition Points (TPs)A transition point (TP) is simply the association of an action (a TP action, 1LTP E U(i))with a state. A state i with such an association is called a TP state2. More than oneTP may be associated with each state, but one TP at each state must be chosen as thepolicy TP (z(i) = uTp(i), the action of the policy TP). Dynamic systems can be entirelycontrolled using TPs.Not all of the states in a system being controlled with TPs need have TPs associatedwith them. Those which do not are called non-TP states2. Controllers that employ TPsmust maintain the last action specified by a TP when the system moves through non-TPstates.Both Barto et al. (1991) and Watkins (1989) have suggested something similar toTPs, but neither fully investigated the concept.3.2.2 EnvironmentsA system being controlled by TPs is defined as having an environment in which it operates. Environments include:1. The system under consideration.2. A set of starting states Ss C S that the control task may start from, as well as theprobability of it being started from each of these states and the specification of theactions to be taken in those states. The starting states Ss must include at leastone state (Ss 0), and may include the entire state space S (S = 5).2These state definitions will be made somewhat more exclusive in Section 3.2.3.Chapter 3. Transition Point Dynamic Programming (TPDP) 353. A set of absorbing states SA C S at which the control task terminates. There neednot be any absorbing states— SA can be an empty set (SA = 0). When an absorbingstate is encountered the task is restarted at one of the starting states Ss accordingto the starting probabilities.4. A mechanism that ensures that each of the TP actions UTP U(i) at each TPstate i has some finite non-zero probability of being employed each time state i isentered by the system.5. A mechanism that ensures that each of the starting states Ss is revisited with somefinite non-zero probability as the control task continues to run. Such a mechanismis necessary if states exist from which the system cannot reach all of the startingstates Ss through some sequence of TP actions3,and from which the system cannotreach any of the absorbing states 5A through some sequence of TP actions. Suchstates represent regions in the state space that can be entered but not left. Themechanism required when such states exist is one that, after some large number ofstate transitions, terminates the control task and restarts from one of the startingstates S.Barto et. al. (1991) describe something like an environment, although not to thislevel of specificity.3.2.3 Closed State SpacesGiven a valid environment and a set of TPs, if all of the starting states Ss are continuallyrevisited with some finite non-zero probability, and if state transitions are made from thestarting states Ss using the TP actions (with each TP action TP E U(i) being employedin each state i with some finite non-zero probability), then eventually every state that3The sequence of actions to each starting state may include passage through other starting states.Chapter 3. Transition Point Dynamic Programming (TPDP) 36can be reached from the starting states Ss using the TP actions will be reached. This setof states is called the closed state space Sc C S. The closed state space Sc includes allthe starting states Ss (Ss C Sc C 5) and any states that the system can reach through asequence of transitions from the starting states. In Figure 3.1 the shaded states representa closed state space.In a valid environment, the closed state space Sc is defined entirely by the startingstates Ss, the existing set of TPs, and the system state transition probabilities (seeSection 2.1.1). The closed state space Sc can include all of the states S. States not inthe closed state space Sc are the external states SE C S (SE fl Sc = 0). External statesare never visited.The TP states SB are properly defined as being states within the closed state spaceSc that have TPs (SB C Sc C S)4. Similarly, the non-TP states SD are properly definedas being states within the closed state space Sc that do not have TPs (SD C Sc C S)4.TPs associated with external states are called ineffectual TPs because they cannotaffect the movement of the system — being outside the closed state space Sc the systemnever reaches them.The relationships between these various state sets are as follows:S—SCUSE, ScflSE—O (3.16)SC=SBUSD, SBflSD—O (3.17)That is, none of the state sets SB, SD, SE overlap, and together they constitute the entirestate space S.4The containment of these states within Sc was not part of the Section 3.2.1 state definitions.Chapter 3. Transition Point Dynamic Programming (TPDP) 373.2.4 Optimal Closed State SpacesIf an optimal control policy has been determined that specifies an optimal actionit*(i) for each state i, those optimal actions can be used to completely define a set ofTPs. That is, one action (the optimal action) can be made the TP action in each statei (up(i) = 1*()). If the TPs so defined are employed in a valid environment that hasstarting states Ss and absorbing states SA, a closed state space Sc. C S will result thatis the optimal closed state space. This optimal closed state space includes all of the statesthat can be visited if the optimal actions (the TP actions) are employed after the systemmoves from one of the starting states S5.Any states outside of the optimal closed state space S. are optimal external statesSE* C S (Sc. fl SE. 0) which are never visited if the optimal control policy p isfollowed. As a result, using memory to represent these optimal external states in anyway is unnecessary for the purposes of optimal control.3.2.5 Minimal TP Optimal ControlConsider a state i in the optimal closed state space Sj. defined by an optimal set of TPs.The entry actions Ue() of that state consist of all the actions that have some non-zeroprobability of leading to a transition from some other state in Sc to state i5. If everyentry action u E Ue() is one of the possible actions in state i (Ue() C U(i))6 and is alsoan optimal action if applied in state i, then no TP is necessary at state i. That is, if noaction is specified at state i, actions specified previously and maintained through thatstate (see Section 3.2.1) will be optimal.This is again the fundamental idea behind TPDP: that optimal control can be facilitated even in states that specify no actions as long as all the actions that cause a5This definition of entry actions U€(i) also holds in closed state spaces that are not optimal.6This assumption shall be used throughout this work.Chapter 3. Transition Point Dynamic Programming (TPDP) 38transition to such states, if maintained, are optimal actions for those states.In an optimal closed state space S0. defined by an optimal set of TPs, the optimalTP states are SB. C Scm. If every unnecessary TP is removed from Sc., a minimal set ofTPs results, and minimal TP optimal control has been achieved. The optimal TP statesin this case are the boundary states SB C Sc. (see Section 3.1.1). The boundary statesSBO are thus a special case of the optimal TP states SB when optimal control has beenachieved with a minimal set of TPs.Correspondingly, in an optimal closed state space S0. defined by an optimal set ofTPs, the optimal non.-TP states are SD* C Scm. When minimal TP optimal control hasbeen achieved the non-TP states are the dormant states SDO C Sc (see Section 3.1.1), aspecial case of SD.. As explained in Section 3.1.1, the TPs at the boundary states ensurethat optimal actions are employed throughout the uniform regions that these dormantstates reside in.3.2.6 Summary of the TPDP State SetsThis Section will summarize the various state sets involved in TPDP and describe howthey change when minimal TP optimal control is achieved. Given an arbitrary set ofTPs, the various state sets are:Sc Closed state space that results from the environment and the existing TPs.SB TP states in the closed state space Sc.SD Non-TP states in the closed state space ScSE External states outside S0 (which may or may not have TPs).Chapter 3. Transition Point Dynamic Programming (TPDP) 39As described in Section 3.2.5, minimal TP optimal control involves these state sets:Sc* Optimal closed state space that results from the environment and the TPs.SBO Boundary states in the optimal closed state space Sc that have the minimalTPs required for optimal control (special case of the optimal TP states SB*).SDO Dormant states in the optimal closed state space Sc that do not requireTPs (special case of the optimal non-TP states SD*).SE* External states outside Sc (which may or may not have TPs).In terms of the various state sets, the following is achieved when the set of TPs istransformed into a minimal TP optimal control set:1. The closed state space Sc defined by the environment and the TPs becomes anoptimal closed state space Scm.2. The set of TP states SB C Sc becomes the set of boundary states SB C Sc.3. The set of non-TP states SD C Sc becomes the set of dormant states SD C Sc,leaving no unnecessary TPs within Sc*.4. The set of external states SE becomes the set of optimal external states SE*.3.3 The Specific Operation of TPDP3.3.1 Pursuing Minimal TP Optimal ControlThe goal of TPDP is to achieve minimal TP optimal control for any system operatingwithin a valid environment. TPDP pursues this goal by performing two main tasks:1. Locating the set of boundary states SBO that require TPs for optimal control.2. Determining an optimal TP action for each boundary state i E SB•Chapter 3. Transition Point Dynamic Programming (TPDP) 40In other words, the right TPs must be found for the right states. Given an arbitraryset of initial TPs, TPDP modifies that set so that it is transformed into a minimal TPoptimal control set. Modifications can include the addition and removal of TPs, and theswapping of one TP for another (each specifying different actions) at the same state i.These modifications are performed one at a time in arbitrary order, and can continueindefinitely.During the modification process at most one TP is associated with each state at anygiven time. Although Section 3.2.1 stated that more than one TP could be associatedwith each state, the single TP restriction is part of the formal definition of TPDP— itwill be relaxed in Chapter 4. This restriction implies that the one TP at each state willalways be the policy TP ((i) = wrp(i) for each TP state i).Clearly a sequence of TP modifications can be specified that will transform any initialset of TPs into a minimal TP optimal control set. For example, if the minimal TP optimalcontrol set were known, TPs could be added to any states that were known to requirethem, and then all of the TPs could be swapped for ones specifying optimal actions.The difficulty is that the minimal TP optimal control set is not likely to be known. Thepurpose of TPDP then is to discover it. Sections 3.3.2 through 3.3.11 will explain howTPDP does so, and provide proof that TPDP will always achieve minimal TP optimalcontrol.3.3.2 TPDP is a Form of Q-learningQ-learning (see Chapter 2) cannot be directly applied to learn optimal policies in controllers that use TPs. The irregularities caused by the addition, swapping, and removalof TPs would perturb any convergence of Q-learning to an optimal policy. Q-learningis used as a component process of TPDP however. Specifically, Q-learning can be employed to learn the Q-values (see Section 2.1.3) of TPs (see Section 3.3.3). Q-values canChapter 3. Transition Point Dynamic Programming (TPDP) 41be associated with TPs to indicate the costs that result when the actions specified bythose TPs are employed. In TPDP, Q-values are utilized in this manner to determinethe relative merits of different TPs and to thereby indicate prudent modifications to theexisting set of TPs. Sections 3.3.3 through 3.3.11 will describe fully how this is done.Because TPDP employs Q-values, and because it updates these values using a variation of the Q-value updating Equation 2.12 (see Section 3.3.3), it is a form of Q-learning(see Chapter 2). TPDP is not strictly Q-learning however, because it does not associatea Q-value Q(i, u) with each possible action u E U(i) in each state i. Instead it associatesQ-values only with the state and action combinations defined by the existing set of TPs.3.3.3 Determining the Q-values of TPsGiven an arbitrary set of TPs, and the closed state space Sc that they define in avalid environment, Q-learning can be employed to determine the Q-values of each TPin Sc. That is, Q-learning can be used to determine the exact expected infinite-horizondiscounted cost (see Section 2.1.2) that results from the action that each TP specifies.This is done by performing Q-learning while using only the actions specified by theexisting set of TPs to control the system. Unlike conventional applications of Q-learning,the different actions possible in each state are not randomly attempted. Only the TPactions are used.As a result, the set of possible actions in each state i is effectively reduced to U(i) ={uTp(i)}. In this context only one action is possible in each state, so it must be the“optimal” action (u*(i) = uTp(i) for each state i). The motivation behind using Qlearning in TPDP is therefore to have it converge to the exact Q-values for each actionspecified by the TPs in the course of trivially deciding that these actions are the “optimal”ones (see Chapter 2).Used in this manner, Q-learning cannot learn overall optimal policies. As explainedChapter 3. Transition Point Dynamic Programming (TPDP) 42in Section 3.3.2 however, it is not intended that Q—learning do so in TPDP. Instead it isused as a component process of TPDP.Determining TP Q-values in this manner has two significant ramifications regardingineffectual TPs and non-TP states respectively. Ineffectual TPs, like all of the externalstates SE, will not be included in the Q-learning process. As defined by the existing set ofTPs and the environment, the reduced Markov decision problem addressed by Q-learningin the manner described simply does not reach these external states (whether they haveineffectual TPs or not). As a result, these states do not in any way affect the Q-valuesof the TPs at TP states in ScThe second ramification is that because non-TP states SD do not have any TPs or TPactions, there are no Q-values that can be determined for these states. Unlike the externalstates however, the non-TP states do affect the Q-values being determined for TPs atthe TP states. This is because actions specified in the TP states and then maintainedthrough non-TP states result in overall state transition probabilities to other TP statesthat are a result of the state transition probabilities of the intermediate non-TP states.But regardless of the various routes that the system may stochastically take from one TPstate, through non-TP states, to another TP state, the overall probability of each suchtransition is fixed. If Q-learning is employed, all of the fixed overall TP state transitionprobabilities will be included in the various TP Q-values as direct observation of the statetransitions and associated costs are made (see Section 2.2.2). As a result, the Q-valueupdating Equation 2.12 need only be modified to bypass the non-TP states:d—1[1— at(i,u)]Qt(i,u) + at(i,u) {(7mc(s÷u)) +Vt(st+d)]Qt+d(i,U)= if i = St,IL UtQt(i,U) otherwise(3.18)Chapter 3. Transition Point Dynamic Programming (TPDP) 43Where d is the number of time steps after a TP state is left before another TP state isencountered. During that time the system moves through non-TP states. If Equation3.18 is used for Q-learning, the non-TP states will be treated as inherent parts of thedynamic response of the system. The difference between Equation 2.12 and Equation3.18 is the replacement of:d—1c(i, u) + 7Vt(st+i) with ( 7’c(st, u)) +7dVt(st+d)If d is set to 1, reflecting the case where a transition is made directly from one TP state toanother with no intervening non-TP states, these Equations are the same. The Q-valueupdating Equation 2.12 is thus a special case of the TP updating Equation 3.18.Proof of the convergence of Q-learning using the TP updating Equation 3.18 is presented in Appendix A. This proof is based on the work of Jaakkola et al. (1993).3.3.4 Determining the Evaluation Function Values of TP StatesAs explained in Section 3.3.1, at most one TP is associated with each state during theoperation of TPDP. As a result, the evaluation function value (i) of each TP statei E SB C Sc is (based on Equation 2.6):V(i) = Q(i, (i)) (3.19)Where j(i) is the one TP action uTp(i) at each state i.Further, the following relationship can be established between the evaluation functionvalues of two TP states:V(i) = Q(i,1L(i)) = C(i, {j}) + i(i,j)V(j) (3.20)Where the excluded cost C,(i, {j}) is the expected cost of all state space transitions fromstate i that can occur with policy (see Equation 2.2)— excluding those after state j hasChapter 3. Transition Point Dynamic Programming (TPDP) 44been encountered, and (i,j) is the participation factor of V(j) in 1’(i):II—i(i,j) = II 7pxk+l(xk,(i)) (3.21)rEX(i,j) k=OWhere X(i,j) is the set of all possible state transition routes £ from state i to state j( E X(i,j)), = [so, x1, ..., x,] is one possible state transition route from state ito statej of variable length n (S = i,x = j), and is the number of states along each suchroute.The participation factor Equation 3.21 represents the summed overall probability ofeach state transition route that can be taken from state i to state j, attenuated by thediscount factor -y at each transition step. Participation factors will always fall between0 and 1 (0 i,j) < 1). The maximum value of participation factors is 1 becauseEquation 3.21 has its maximum value when y is 1. When y is 1, Equation 3.21 becomesthe overall probability of eventually making a transition from state i to state j. Thisprobability is between 0 and 1. The minimum value of participation factors is 0 becauseall the elements of Equation 3.21 are positive.Returning to Equation 3.20, the term (i,j)V(j) then represents the portion of theexpected total infinite-horizon discounted cost (see Section 2.1.2) at state i that occursafter the system has made a transition to state j. The evaluation function value V(j) ofstate j reflects those costs, and the participation factor ensures that they are discountedappropriately for inclusion in the determination of V(i).Equation 3.20 can be readily extended to include an arbitrary number of states:V(i) = C(i, J) + i,j)V(j) (3.22)jEJWhere J C S is the arbitrarily chosen set of states for which explicit representation isdesired in the evaluation function value computation of state i.Chapter 3. Transition Point Dynamic Programming (TPDP) 453.3.5 Swapping TPsBy making use of TP Q-values (determined as described in Section 3.3.3), the swappingof TPs can be facilitated. TP swapping occurs when a TP associated with a state isexchanged for another TP, specifying a different action, at that same state.Swapping Rule: TP swapping is performed by using Q-learning to determine the Q-value of an existing TP. Then that TP is replaced with a trial TPand Q-learning is used to determine the Q-value of the trial TP. If the Q-valueof the trial TP is less than that of the old TP, a swap is made. Otherwise theswap attempt is aborted.Theorem 3.1: When a TP swap is made according to the Swapping Rulein a valid environment, the evaluation function values (i) of all of the TPstates i E SB C Sc will be monotonically reduced7.Proof: When a TP swap is made, the evaluation function value V,(j) of theswapped TP at state j is reduced as a direct result of the Swapping Rule andof the fact that T4(j)=Q(j,p(j)) where (j) is the one TP action uTp(j) atstate j (Equation 3.19).The evaluation function values (i) of all the other states i can be expressed in terms of V(j) (Equation 3.20):V(i) = C(i, {j}) + (i,j)V(j)Where neither C,(i, {j}) nor i,j) will be altered as a result of the TPswap at state j. This is because both of these values are the result of costsand state transitions probabilities encountered before a transition is made tostate j (see Section 3.3.4).7”Monotonically reduced” will mean “reduced or kept the same” throughout this work.Chapter 3. Transition Point Dynamic Programming (TPDP) 46As a result, when V(j) is reduced, the evaluation function values V,(i) ofall the other states i E SB C Sc will also be monotonically reduced. DIt is possible to attempt TP swaps at more than one state concurrently. Such concurrent swapping may be necessary to break up deadlocks which can occur if the costsexperienced by more than one TP are dependent on each other, and none of the TPs canmake a beneficial swap until the others have done so. To make concurrent swap attemptsthe Swapping Rule is applied simultaneously at each of the states concerned.Theorem 3.2: Any and all TP swaps that are made according to a concurrent, but individual application of the Swapping Rule in a valid environmentwill result in the evaluation function values V,(i) of all of the TP statesi e SB C Sc being monotonically reduced.Proof: Equation 3.22 can be readily expanded to:= C(i, (SG U SG’)) + + r(i,j)V(j) (3.23)jESG JESGIWhere SG (the aborted swap states) is the set of TP states for which a TP swapwas attempted but aborted (SG C SB), and SG’ (the accepted swap states)is the set of TP states for which a TP swap was attempted and accepted(SG’ C SB; SG fl SG’ = 0).Equation 3.23 indicates the evaluation function value V,(i) of each TPstate i before any TP swaps were made. When the complete set of TP swapswas attempted, the evaluation function value of each TP state i was thefollowing:i(i,j)Vi(j)+ > q(i,j)V(j) (3.24)jESa jESG,Chapter 3. Transition Point Dynamic Programming (TPDP) 47Where ‘ indicates that the attempted TP swaps altered existing policy ,u.Now to prove that the evaluation function values of all the TP states iare monotonically reduced if a subset of the attempted TP swaps is accepted(based on individual application of the Swapping Rule to each swap attempt),an iterative approach will be employed determine what the resultant evaluation function values are — based on the evaluation function values before theswap attempt and on those determined during the swap attempt8. The iterative calculation will concern only the evaluation function values V(i) of thestates i for which a TP swap was attempted (i E (SG U SG’)). The iterativecalculations are defined as follows:C(i, (SG U SG’)) + (i,j)Vk(j) + i,j)V(j) V i ST/ — jES0 jESGIVk+11i) —> i(i,j)Vk(j)+ r’(i,j)V(j)jESG jES0,(3.25)The combined usage of C(i, (SG U SG’)), C,i(i, (SG U SG’)), p(,j) andi,j) values in Equation 3.25 is valid because of the nature of the composition of the evaluation function values. The state space transitions occurringfrom aborted swap states i E SG will be the same up to the point where theyencounter other states j (SG U SG’) where swaps were attempted. Thus, forthose states (i e SG), C,(i, (SG U SG’)) and (i,j) will remain the same.Similarly, after a subset of the swaps are accepted, the state space transitions occurring from accepted swap states i E SGI will be the same as thoseexperienced during the concurrent swap attempt up to the point where theyencounter other states j E (SG U 50’) where swaps were attempted. Thus, forthose states (i E SG’), C,i(i, (S0 U SG’)) and iW(i,j) will remain the same.8This iterative approach does not represent a form of Q-learning, as it has nothing directly to dowith the learning of an optimal policy. It is employed simply in the context of this proof.Chapter 3. Transition Point Dynamic Programming (TPDP) 48Equation 3.25 is therefore valid for use in iterative calculations. The initialvalues for these calculations are:I1’(i) ViESGVo(i) = (3.26)j V(i) ViESG’Now since 1’(i) for states i S is based on evaluation function valuesV(j) for states j SG’ (see Equation 3.23) that existed before TP swapswere made at those states (j SG’), and since the initial iterative value Vo(j)is lower than V(j) for each state j E SG’, Vi(i) for all states i E SG is sureto be lower than Vo(i) (Vo(i) = V(i)). Similar analysis can be made to showthat Vi(i) for all states i SG’ is sure to be lower than Vo(i) (Vo(i) = V(i)).In turn, since V1(i) is lower than Vo(i) for all states i E (SG U SG’), V2(i)will be lower than V1 (i) for all states i. Each iterative cycle k will likewiseresult in a monotonic reduction of Vk(i) for each state i E (SGUSGI). None ofthe iterative evaluation function values Vk(i) will ever go below 0 because allof the elements of Equation 3.25 are positive. Therefore, Vk(i) for each statei E (SG U SG’) will converge on some positive value below Vo(i). The exactvalue each converges on is not important — what is important is that it is lowerthan (i) for all states i E SG, and lower than V’(i) (with V(i) < V(i))for all states i E SG’.Because the evaluation function values for all states i E (SG U SG’) will bemonotonically reduced when a subset of the swaps are accepted, and becauseevaluation function values at all TP states where a swap was not attemptedwill also be monotonically reduced (based on a simple application of Equation3.22 to those states), the evaluation function values for all TP states will bemonotonically reduced. 0Chapter 3. Transition Point Dynamic Programming (TPDP) 49Swapping can change the set of states in the closed state space Sc. That is, TPswaps can change the shape of Sc. The actions specified by swapped TPs may directthe system into states that had previously been external to Sc, and it may also directthe system entirely away from states that had previously been part of Sc• Such changesare desired if an optimal policy is sought. They result when new state transition routesoutside Sc have been found that result in lower costs than those available inside Sc. Theswapping of TPs makes these lower cost routes permanently available.Changes in the closed state space Sc do not affect the validity of Theorems 3.1 and3.2 because ineffectual TPs associated with states outside Sc (see Section 3.2.3) do nothave Q-values. These states are never reached with the existing set of TPs, and it is thusmeaningless to associate any sort of costs with them. As a result, if TPs are included in(or excluded from) Sc as the result of TP swapping, nothing can really be said aboutthe changes in their Q-values.3.3.6 The Result of Continued TP SwappingUsing the Swapping Rule, swaps can be concurrently attempted at one or more TPstates in Sc (see Section 3.3.5). Each time swaps are made the evaluation function valueV(i) (Equation 3.19: V(i) = Q(i, (i))) of each TP state i E Sc will be monotonicallyreduced. Consecutive TPs swaps therefore result in a continuous monotonic reduction ofthe evaluation function values of the TP states within Sc•Theorem 3.3: Given any initial set of states with TPs, including TP statesSB C Sc and external states SE with ineffectual TPs, successive randomlyattempted concurrent TP swaps at those states, each made according to theSwapping Rule in a valid environment, will result in a TP action being associated with every TP state in the closed state space Sc that has the lowest™Chapter 3. Transition Point Dynamic Programming (TPDP) 50expected total infinite-horizon discounted cost possible at that state (for thatset of states with TPs).Intuitively Theorem 3.3 seems valid because if monotonic reduction of the evaluationfunction values continues for all of the TP states in Sc (according to Theorem 3.2),eventually the lowest possible value will be reached for each TP state. The issue is madecomplicated however by the fact that TP swaps can change the states included in Sc(see Section 3.3.5). Thus it has to be proven that, regardless of such changes to Sc, thelowest cost action will be determined for every TP state in some sort of lowest cost Sc.Basically, Sc has to be “anchored” in some way.Proof: There are three cases to be considered for each of the starting statesi E Ss from which state transitions begin:1. The starting state i E Ss is a TP state.2. The starting state i Ss is not a TP state, but the actions specified inthis state (defined as part of the environment) result in stochastic statetransitions that lead to one or more TP states.3. The starting state i e Ss is not a TP state, and the actions specified inthis state (defined as part of the environment) do not result in stochasticstate transitions that lead to any TP states.In the third case actions are never specified by TPs as the system movesthrough the state space, so there is no way to swap TPs. The claims madein Theorem 3.3 are therefore moot.In the first and second case the system will either start at a TP state,or a TP state will be reached after a number of state transitions from thestarting state i E Ss. In the second case the costs experienced before a TPChapter 3. Transition Point Dynamic Programming (TPDP) 51state is encountered cannot be altered by any TP swaps. In both cases theenvironmental definition and the state transition probabilities ensure that afixed set of initial anchored states (TP states) will always be reached. As aresult, these anchored states will remain part of Sc — regardless of any TPswaps made.Considering the anchored states, Theorem 3.2 guarantees that the evaluation function values V(i) of these states will be monotonically reduced by anycombination of TP swaps made in Sc (whichever other states may includedin Sc at any time). Successive randomly attempted TP swaps will eventuallyreduce the evaluation function values of the anchored states to their lowestpossible levels (given the set of states with TPs). This is true because themonotonic reduction process will never stop before that point. Any interdependent relationships that may develop between the TPs, preventing furtherevaluation function value reductions, can and will always be broken by somerandomly attempted simultaneous swapping of all of the interdependent TPs.In the worst case, given enough random attempts, a simultaneous TP swapattempt will inevitably be made where the TP at every TP state is swappedfor a TP specifying (coincidentally) the lowest cost action for that state.Once the evaluation function values of the anchored states have beenmonotonically reduced to their lowest possible levels (given the set of stateswith TPs), it will also be true that the evaluation function values of all ofthe other TP states in Sc will be at their lowest possible levels (regardless ofthe shape of Sc at that time). This is true because, according to Equation3.20, the evaluation function value of every anchored state can be defined using the evaluation function value of every other TP state. Therefore, for theChapter 3. Transition Point Dynamic Programming (TPDP) 52evaluation function value of every anchored state to be minimized, the evaluation function value of all other TP states in Sc must be minimized (notingthat every TP state in Sc is reached through transitions from at least oneanchored state — otherwise it would not be part of Sc).Finally, since the evaluation function value of every TP state in the closedstate space Sc will inevitably be reduced to its lowest possible level (giventhe set of states with TPs), it must be true that an action having the lowestexpected total infinite-horizon discounted cost is associated with every TPstate in Sc (see Equation 2.6). 03.3.7 The Limitations of TP SwappingSection 3.3.6 presented proof (Theorem 3.3) that, in a valid environment with a giveninitial set of TP states, continual randomly attempted TP swapping eventually leads tothe lowest cost action being associated with all TP states i E Sc There is no guaranteehowever that the resultant set of TPs are a minimal TP optimal control set (see Section3.2.5), or that the closed state space Sc they define is the optimal closed state space Sc*.There are two reasons for this:1. It may be necessary to associate new TPs with some of the non-TP states SD(converting them to TP states) to make SB match SB* (see Section 3.2.6).2. There may be state transition routes through the external states SE that result inlower costs than those available through the closed state space Sc, and it may notbe possible to discover those routes without first associating some ineffectual TPs(see Section 3.2.3) with external states. These ineffectual TPs may be necessaryso that the system can be directed through the available low cost routes in theexternal states.Chapter 3. Transition Point Dynamic Programming (TPDP) 53The first reason why minimal TP optimal control might be prevented, which necessitatesthe addition of TPs, will be addressed in Section 3.3.8. The second reason, which requiressome sort of structuring of the ineffectual TP configurations in the external states SE,will be addressed in Section 3.3.10.3.3.8 Adding TPsTPs are added in a manner very similar to how they are swapped. The Q-value Q(i, uTp)that results when a trial TP is associated with a non-TP state i E SD is determined andcompared to an assessment of the costs that result when no TP is associated with statei. This comparison procedure is made difficult however by the fact that, since a non-TPstate has no TP, it has no Q-value associated with it that can be used to determinethe costs incurred when there is no TP. As a result an B-value is instead determined.R-values are determined the same way that TP Q-values are, using the Q-value updatingEquation 3.18 with only notation change:/d—1[1— crt(i)1 R(i) + at(i) u)) +7dv(sd) if =Rt+d(z) = n=OR(i) otherwise(3.27)Where d is the number of time steps after the non-TP state i is left before a TP state isencountered, and ‘u is an action specified at some time step before t.Evaluation function values are not generated from R-values. It is not necessary todo so as neither R-value nor Q-value updates make use of evaluation function values atnon-TP states— the states R-values are associated with.To illustrate that the R-value updating Equation 3.27 will converge on exact R-valuesthat indicate the costs incurred if no action is specified in a given non-TP state, entryaction probabilities first have to be explained. Entry action probabilities p(i, n) indicate,Chapter 3. Transition Point Dynamic Programming (TPDP) 54for a given state i, the probability of the system arriving at i as the result of an actionU E Ue() specified in some previous TP state, relative to the probability that the systemwill arrive at i at all. This means that:> p(i,u) = 1 (3.28)ttEUe (i)Entry action probabilities are entirely a function of the existing set of TPs and theenvironment, as these determine the probability of the system making state transitionsfrom the starting states Ss to TP states where action u E Ue() can result in a transitionto i. The entry action probabilities p(i, u) for any state i can thus be altered by anymodification of the existing set of TPs.The entry action probabilities p(i, u) for a system operating with a given set of TPs ina valid environment will be fixed. As a result, a non-TP state i can be considered to bea TP state that specifies a non-action action . The state transition probabilities pj(i, )that result from this non-action action will be a product of the entry action probabilitiesand the system state transition probabilities, and they too will be fixed:= p(i,’u)pj(i,U) (3.29)uEU6(i)If a non-TP state i can be considered to be a TP state that specifies a non-action actionü which results in fixed state transition probabilities p3(i, ), then Q-learning can beapplied (as described in Section 3.3.3) to learn the Q-value Q(i, i1) that results from thataction. A Q-value determined in this manner is in fact an R-value. So Equation 3.27,which is essentially the TP updating Equation 3.18, can be employed to learn the R-valueof a non-TP state.With exact R-values available, the addition of TPs can be made according to thefollowing Addition Rule.Chapter 3. Transition Point Dynamic Programming (TPDP) 55Addition Rule: TP addition is performed by using Q-learning to determinethe R-value of a non-TP state. Then a trial TP is placed at that state andQ-learning is used to determine the Q-value of that TP. If the Q-value of thetrial TP is less than or equal to the R-value, the TP is added. Otherwise theaddition is aborted.Even though the evaluation function value V,j(i) (Equation 3.19: V(i) Q(i,i(i)))of a state i with a newly added TP will be less than or equal to the R-value of that statebefore the TP was added (as a result of the Addition Rule), the addition of a TP to astate does not necessarily result in monotonic reduction of the evaluation function valuesof the other TP states. Consider that the R-value R(i) for a non-TP state i is a productof the Q-values of the entry actions to that state:R(i) = p(i,’u)Q(i,u) (3.30)UEUe (1)Now if there existed two TP states a and b that always made a transition to non-TP statei with the same action, a E Ue() for one and ub € U€(i) for the other, with Q(i, Ua)being lower than Q(i, ub), then it may be possible to add a TP to state i that has anintermediate Q-value Q(i, uTp) between Q(i, ua) and Q(i, ub). In this case, depending onthe entry action probabilities (see Equation 3.30), the R-value R(i) of state i might behigher than Q(i, uTp). The TP would then be added (as a result of the Addition Rule),increasing the costs experienced by TP state a as it reduced the costs experienced by TPstate b (see Equation 2.2). In this example the low costs of entry action Ua are balancedby the high costs of entry action ub so that the overall R-value (see Equation 3.30) of thenon-TP state is higher than the Q-value of the TP added to it. Such a scenario may beunlikely, but this example illustrates that the addition of a TP to a state, done accordingto the Addition Rule, may result in an increase in the costs experienced by other TPChapter 3. Transition Point Dynamic Programming (TPDP) 56states. These increased costs will be reflected in the evaluation function values of thosestates.Theorem 3.4: When a TP addition is made to state i e SD C Sc accordingto the Addition Rule in a valid environment, either the Q-value Q(i, uTp) ofthe added TP will be less than the Q-value Q(i, u) of at least one entry actionU E Ue(), or Q(i,uTp) = Q(i,u) Vu E Ue().Proof: By contradiction. For Theorem 3.4 not to be true Q(i, uTp) must begreater than Q(i, u) for at least one entry action u E Ue(), and equal to allof the rest. If such were the case then:p(i,u)Q(i,’uTp) > p(i,u)Q(i,’u) (3.31)uEUe(i) EUe(i)Which, because of Equation 3.28, means that:Q(i,uTp) > p(i,u)Q(i,u) (3.32)UEUe (1)Considering Equation 3.30, this means that:Q(i,uTp) > R(i) (3.33)Which is a violation of the Theorem 3.4 assumption that the Addition Ruleis adhered to. DTheorem 3.4 illustrates why TPs must be added. If an addition attempt is successfulthen the costs incurred when the system encounters state i will either remain the samefor all entry actions u Ue(), or will be reduced for at least one entry action. In theformer case the evaluation function values of all the TP states in Sc will remain the sameand the addition of the TP will do no harm. In the latter case some of the evaluationfunction values of the TP states may be increased as a result of the TP addition, but theChapter 3. Transition Point Dynamic Programming (TPDP) 57addition is necessary to facilitate eventual optimal control of the system. This is because,without a TP at state i, there would be no way to reduce the costs incurred when thesystem encounters that state as a result of any entry actions u E Ue() whose Q-valuesare higher than Q(i, UT?). If some of the entry action Q-values are lower than Q(i, UT?),a swap will eventually occur at state i for a TP with a Q-value the same or lower thanthat of the lowest entry action.Theorem 3.5: A TP addition can always be made to a state i€ SD C S’according to the Addition Rule in a valid environment.Proof: A TP addition can always be attempted where the action UT? associated with that TP has a Q-value Q(i, UT?) that is as low as the lowest Q-valueQ(i,u) of all the entry actions u€ Ue() in state i. The Q-value Q(i,uTp) ofsuch a TP is certain to be less than or equal to the R-value R(i) of non-TPstate i because the R-value, according to Equation 3.30, is composed of theentry action Q-values— all of which are the same or higher than the Q-valueof the action the TP specifies. As a result, the Addition Rule will ensure thatthe TP is added. This will occur even if all of the entry action Q-values arethe same as Q(i, UT?). In that case the R-value R(i) of non-TP state i willbe equal to Q(i, wpp), which is sufficient for the Addition Rule. C3.3.9 Removing TPsIn order to achieve minimal TP optimal control it may be necessary to remove TPs. SomeTPs may be unnecessarily associated with dormant states SD. TP removal is performedin a manner reverse that of TP addition.Removal Rule: TP removal is performed by using Q-learning to determinethe Q-value of a TP. Then removal of that TP is attempted and Q-learning isChapter 3. Transition Point Dynamic Programming (TPDP) 58used to determine the R-value of the resultant non-TP state. If the R-valueof the trial non-TP state is less than or equal to the Q-value of the TP, theTP is removed. Otherwise the removal is aborted.Even though the R-value of a state i with a recently removed TP will be less than orequal to the evaluation function value V,j(i) of that state when it had a TP, the removal ofTPs does not necessarily result in monotonic reduction of the evaluation function valuesof the other TP states. This is true for the same reasons that it is true of TP addition(see Section 3.3.8). Briefly, TP removal can reduce the costs incurred when the systemencounters a state with some entry actions while increasing those incurred with others.Theorem 3.6: A TP can only be removed from the state i€ SB C Sc it isassociated with, according to the Removal Rule in a valid environment, if atleast one entry action u€ Ue() has a Q-value Q(i, u) less than the Q-valueQ(i, uTp) of that TP, or if Q(i, wrp) = Q(i, u) V E Ue().Proof: By contradiction. For Theorem 3.6 not to be true the Q-valueQ (i, UTP) of a removed TP must be less than Q(i, n) for at least one entryaction u E Ue(), and equal to all of the rest. If such were the case then:p(i,u)Q(i,uTp) < p(i,u)Q(i,u) (3.34)UEUe(i) uEU€(i)Which, because of Equation 3.28, means that:Q(i,uTp) < p(i,u)Q(i,u) (3.35)uEU (1)Considering Equation 3.30, this means that:Q(i,nTp) <R(i) (3.36)Which is a violation of the Theorem 3.6 assumption that the Removal Ruleis adhered to.Chapter 3. Transition Point Dynamic Programming (TPDP) 59The overall effect of the Addition and Removal Rules is that TPs may be addedand removed from states continuously depending on how their Q-values compare withthe R-values of the states concerned at any given time. Once the lowest Q-value TPis associated with a state however, it cannot be removed unless all of the entry actionsU E U(i) have the same low Q-value. If the state is not a dormant state, this will neveroccur.3.3.10 Preparing External StatesSections 3.3.8 and 3.3.9 described how TPs are added to non-TP states SD within Sc,and how they will be adopted permanently if those states are not dormant states. Suchadditions ensure that, after extended TP swapping, the lowest cost actions are associatedwith every TP state in Sc (Theorem 3.3). There is no guarantee however that theresulting set of TPs defines an optimal closed state space SC.. To guarantee this, all ofthe state transition routes external to Sc. (through SE) must be tried to ensure that noneexist that have lower costs than those available within Sc. If lower cost routes are found,TP modifications must be made to change the shape of Sc so that it will include them.In a given environment, access to external states SE is facilitated by making TPmodifications in S that direct the system into the external states. External states thatcannot be accessed by TP modifications are not of concern as they simply cannot bereached in the existing environment.Theorem 3.7: If some external states are entered in a valid environment asthe result of a TP addition, swap or removal attempt at state i E SC C 5,and the Q-value or R-value of the attempted TP modification is less thanthe Q-value or R-value normally experienced at state i, then the TP will bemodified according to the Addition, Swapping or Removal Rule (whicheverChapter 3. Transition Point Dynamic Programming (TPDP) 60is appropriate) to include the external states in Sc.Proof: The immediate costs experienced in the external states SE are nodifferent from those experienced in Sc (see Equation 2.2). As a result, theQ-values and R-values determined during TP addition, swap or removal attempts are the same regardless of whether external states are entered or not.The Addition, Swapping and Removal Rules will therefore operate in exactlythe same manner, making the TP modification if it has been found to resultin a lower Q-value or R-value than that normally experienced. DTheorem 3.7 is fairly self-evident. It was presented mainly to clarify the situationwith external states — specifically that they are no different from internal states in termsof experienced costs. This assumption was implicitly made in Theorems 3.1 through 3.6because the TP modifications that each of these Theorems described could have includedentry into external states.So external states can readily be included in Sc if state transition routes throughthem result in low enough costs. TP additions, swaps and removals in the existing closedstate space Sc will facilitate the inclusion of any low cost external state routes. Thereis a difficulty that arises from assessing external state routes in this manner however.This is that lower cost transition routes through the external states may not be readilyavailable. It may not be a simple matter of just entering the external states to discoverthe lower cost routes. Instead it may be necessary to have some ineffectual TPs alreadyin place that can guide the system through convoluted low cost routes. Only then mightsome low cost routes be discovered.TP modifications in the external states do not affect the costs experienced in the closedstate space Sc (see Section 3.3.3), so ineffectual TP configurations can be modified freelyin the external states to prepare low cost routes.Chapter 3. Transition Point Dynamic Programming (TPDP) 61If random TP configurations are successively adopted in the external states and if aTP configuration is possible that results in a lower cost state transition route throughthese states, eventually this configuration will be discovered by a TP modification in Scand the lower cost route will be adopted. It may be desirable to prepare external state TPconfigurations in a more structured manner however — a manner that facilitates quickerdiscovery of the lower cost routes. This can be done by somehow modifying externalstate TP configurations that have already been found to be of relatively low cost, or itcan be done by employing some knowledge of the system response.3.3.11 Minimal TP Optimal Control is GuaranteedCombining the results of Sections 3.3.3 through 3.3.10 a guarantee of TPDP convergingto optimal control and achieving minimal TP optimal control (see Section 3.2.5) can bemade9:Theorem 3.8: Given any initial set of TPs in a valid environment, by sequentially making random TP addition, swap and removal attempts at statesin the closed state space Sc according to the Addition, Swapping and Removal Rules, and by simultaneously randomly configuring ineffectual TPs inthe external states SE, optimal control of the system will be converged to andminimal TP optimal control will be achieved.Proof: Theorem 3.8 is true because a worst-case achievement of minimal TPoptimal control is always possible given a long enough random sequence ofTP modifications. It can occur as follows:1. TPs are randomly and sequentially added to every boundary state SBin Sc. Such additions can always be made (Theorem 3.5).9Assuming that accurate Q-values and R-values can be determined during the TPDP process (seeSections 3.3.3 and 2.2.3, and Appendix A).Chapter 3. Transition Point Dynamic Programming (TPDP) 622. Ineffectual TPs are randomly configured to be at every boundary stateSB in SE. Such TPs can be added at any time (see Section 3.3.10).3. Repeated, randomly attempted TP swapping monotonically reduces theevaluation function values of all TP states in Sc to their lowest possiblelevels, and thereby results in the lowest cost action being associated withevery TP state (Theorem 3.3). During this process all low cost routesthrough the external states will be incorporated into Sc (Theorem 3.7).4. As TPs are associated with all of the boundary states SBG, when lowestcost actions are associated with all of the TP states, optimal control isachieved (see Section 3.2.4). This optimal control will be permanentlymaintained because TP additions, swaps and removals cannot be madethat will increase costs above the optimal costs experienced with optimalcontrol (Theorems 3.4, 3.2 and 3.6 respectively).5. Randomly attempted TP removals eliminate all of the unnecessary TPsat the dormant states SDO in Sc, reducing the TP states to the boundarystates SB. At this point minimal TP optimal control is achieved. Suchunnecessary TPs can always be removed (Theorem 3.6).After minimal TP optimal control is achieved, random TP additions mayrepeatedly add unnecessary TPs to the dormant states SDO. Such is possibleaccording to Theorem 3.4, but it will not affect the optimal control of thesystem and eventually these TPs will be randomly removed again.While TP additions and removals may continue after minimal TP optimalcontrol is achieved, no TP swaps will occur. This is because the SwappingRule prevents TP swaps after the lowest cost action is associated with anyTP state concerned. DChapter 3. Transition Point Dynamic Programming (TPDP) 63The proof of Theorem 3.8 is based on an unlikely sequence of random TP modificationsbeing made. This sequence will inevitably occur however, given enough random TPmodifications. Unfortunately the complexities of the TPDP process require that such abrutish proof be made.In practice however, it is highly probable that TPDP can achieve minimal TP optimalcontrol in a much more straight-forward way. This is because, if Theorems 3.1 through3.7 are considered, the Addition, Swapping and Removal Rules all tend to either mono-tonically reduce costs or to ensure the continuance of important low cost TPs. Effectivelythen, TPDP “locks-in” any cost reductions discovered. As a result, TPDP steadily determines a set of TPs that will provide minimal TP optimal control. Further, in contrastto the sequence of events described in the proof of Theorem 3.8 that span the entire statespace, in most applications (ones without any complex cost interdependencies betweenlarge numbers of states) TPDP can operate in a piecemeal manner, reducing the costsexperienced in small regions of the state space independently of other small regions. Every such set of small reductions contributes to the overall advancement towards minimalTP optimal control.3.3.12 Summary of TPDP OperationSection 3.3.8 described how TPs can be added to non-TP states in the closed state spaceSc so that actions can be specified at every state where such specification is requiredfor optimal control. Section 3.3.10 detailed how ineffectual TPs can be configured inthe external states SE to prepare state transition routes through SE which may resultin lower costs than those available in Sc. Section 3.3.5 described how TP swaps aremade, and Section 3.3.6 explained how repeated TP swaps monotonically reduce theevaluation function values of the TP states within Sc. Section 3.3.9 explained how TPscould be removed from states if they were not required for optimal control, facilitatingChapter 3. Transition Point Dynamic Programming (TPDP) 64the minimization of the set of TPs.Together these TPDP operations result in: the “filling out” of Sc so that it acquiresall of the TPs necessary for optimal control; the configuration of TPs in SE to facilitatethe inclusion of lower cost external state transition routes in Sc; the swapping of TPsso that optimal control can be achieved; and the removal of unnecessary TPs. Whencombined, these operations lead to minimal TP optimal control.3.4 The Characteristics of TPDP3.4.1 The Characteristics Shared with Q-learningBecause TPDP is a modified form of Q-learning, it has many of the characteristics thatQ-learning has. As described in Chapter 2, these include the fact that TPDP is:• direct dynamic programming control• memory-based control• reinforcement learning control• a solution to the credit assignment problem• adaptive control3.4.2 A Minimal Form of Direct DP ControlLike Q-learning, TPDP employs no explicit system models (see Section 2.2.1), so it usesless memory than indirect dynamic programming controllers (see Section 2.3.1). Beyondthis however, TPDP is likely to use substantially less memory than Q-learning in thesame continuous control applications. This is true for two reasons. The first is thatTPDP stores only individual TPs; that is, the single Q-value, R-value and action that aChapter 3. Transition Point Dynamic Programming (TPDP) 65TP associates with a given state. Q-learning on the other hand stores a complete set ofQ-values for each state— although it does not have to store R-values.The second reason that TPDP can use less memory than Q-learning is that, if anACAM is employed to store the TPs (see Section 2.4.2), TPDP can take advantage ofuniform regions in which dormant states exist. Such dormant states require no TPs, andTPDP can thus save memory for each dormant state. Even though Q-learning can takeadvantage of the ACAM storage of Q-values (see Section 2.4.2), it cannot take advantageof this further saving. It must store the Q-values of every state encountered duringcontrol of the system.The nature of the system being controlled determines how much of a memory savingis possible if TPDP is employed instead of Q-learning. The advantages of TPDP arereduced if there are few uniform regions in which dormant states exist (see Section 3.1.4),or if there are few possible actions U(i) in each state i.3.4.3 TPDP is Action CenteredUnlike most DP approaches to control, TPDP is highly focused on the consequences ofactions. Determining optimal actions is the final goal of any DP controller, but most areprimary concerned with learning an evaluation function that leads indirectly to optimalpolicy determination. TPDP is directly concerned with evaluating specific TPs; that is,evaluating specific actions taken in specific states. This is true of Q-learning controllers ingeneral, but it is especially so with TPDP because of the emphasis on adding, swappingand removing TPs that are associated with specific actions.Because TPDP focuses on the effects of actions instead of on the determination ofan evaluation function, the evaluation function approximation techniques described inSection 2.4.4 are not really appropriate for TPDP. In general, TPDP is a very concrete,memory-based approach to control.Chapter 3. Transition Point Dynamic Programming (TPDP) 663.4.4 TPDP and Temporal DifferencesSection 2.3.6 described Q-learning in the context of temporal difference learning. It wasexplained in that Section how conventional one-step Q-learning (see Section 2.2.5) isa form of TD(O) learning. The relationship of TPDP to temporal difference learningis less clear. This is because the updating that occurs in TPDP involves evaluationfunction values that can be further than a single time step away from the state beingupdated (see Equation 3.18). TPDP is thus not TD(O) learning. TPDP is also notTD(1) learning because it does not update using only the actual costs experienced afteran action has been specified in a given state (see Section 2.3.6). TPDP updates witha series of immediate costs followed by a single evaluation function value, so it is somehybrid of TD(O) and TD(1) learning — maybe “stretched TD(O) learning”.Even though TPDP lies in the middle ground between TD(O) and TD(1), it wouldbe inaccurate to describe it as TD() learning (Sutton, 1988). This is because it doesnot make updates using combinations of immediate costs and evaluation function values(see Equation 2.15).3.4.5 Continuous State Spaces and TPDPAs explained in Section 3.1.4, TPDP performs best on continuous state space stochasticcontrol applications. To facilitate control on these applications however, the continuousstate space must be discretized, with each state dimension being divided into a numberof intervals (see Section 2.3.2). Furthermore, TPDP operates by sampling the systemstate in discrete time intervals. So while TPDP operates on continuous systems, it doesso by sampling a discretized state space in discrete time intervals. As long as the levelof discretization is fine enough in both space and time, this approach can be successfullyemployed to control continuous systems.Chapter 3. Transition Point Dynamic Programming (TPDP) 67Up to this point the description of TPDP has involved only the time step intervalof 1 that is conventionally used with discrete state spaces. In continuous state spacescontrol is resolved with a general time step interval T. The time step interval can bemade as short as is necessary (given the practical limitations of the controller hardware)to effectively follow the continuous response of the system. As TPDP performs best incontinuous state spaces, the general time step interval T will be used henceforth in thedescription of TPDP.An incidental result of reducing the time step interval T is that the Q-values determined by any DP controller employing Q-learning will be changed. Even though thesame immediate costs c(i, u) will be experienced as the system moves about the statespace, the immediate costs experienced over shorter intervals will be added to the totalinfinite-horizon discounted cost and discounted more frequently. This can greatly alterthe Q-values Q(i, u) that are learned for each action u taken in each state i if updatingEquations 2.12 or 3.18 is employed. To roughly compensate for this effect the followingequations can be used to calculate new c’(i, u) and‘values from the old ones when a timestep interval change has been made and it is desired that the same total infinite-horizondiscounted costs be experienced:c’(i,) = c(i,u)- (3.37)7’ = (3.38)Equation 3.38 will result in accurate discounting of the evaluation function valueVt(st+d) that is included in updating Equations 2.12 and 3.18’°. The immediate costcomponents in these equations will not be discounted entirely accurately however. Assuming that a constant immediate cost per unit time is experienced over all time stepintervals, and that is an integer, the following factor of error will be incurred in the10When the update occurs after exactly r and d time steps respectively.Chapter 3. Transition Point Dynamic Programming (TPDP) 687=0.8 7=0.9=io =1oo =io =iood = 1 -9.4% -10.3% -4.6% -5.0%d = 10 -41.8% -45.8% -29.9% -32.8%Figure 3.2: Sample Immediate Cost Error Factorsimmediate cost portion of updating Equations 2.12” and 3.18:immediate cost error factor = .7d 1 — 7d 1 (339)T 71 7—1Figure 3.2 presents some immediate cost error factors predicted by Equation 3.39 forsome and values.‘1For Equation 2.12 d is always 1.Chapter 4Practical TPDPThis Chapter will present a description of a practical form of TPDP. The theoretical formof TPDP presented in Chapter 3 has a number of limitations that make it impracticalfor use in real control applications. These limitations are addressed in the practical formof TPDP.4.1 The Practical TPDP Approach4.1.1 The Problem With the Theoretical Form of TPDPTPDP was theoretically described in Chapter 3 as a method of learning a minimal set ofTPs that could provide optimal control of a system in a valid environment (see Section3.2.2). As such it was shown that TPDP is certain to arrive at this minimal set ifrandom TP additions, swaps and removals are continually made over an indefinite butfinite period of time (see Section 3.3.11). The problem with this approach is that eachTP addition, swap1 and removal requires the determination of exact Q-values and Rvalues. For such values to be determined, Q-learning must be allowed to progress withthe existing set of TPs until Q-values and R-values have been converged upon which areexact or negligibly close to being exact. This can take a considerable period of time, andthis process must be repeated for each TP addition, swap and removal. As a result, thetheoretical form of TPDP described in Chapter 3 is not very practical in terms of being‘Of which more than one can be made concurrently (see Section 3.3.5).69Chapter 4. Practical TPDP 70a viable control approach.4.1.2 Concurrent Assessment of TP ModificationsTo solve the problem of the protracted learning time required by the theoretical formof TPDP, many TP modifications (additions, swaps and removals) can be assessed concurrently. That is, Q-learning can be employed not just to determine the Q-values andR-values for a single TP modification, but instead to learn these values for a number ofconcurrent modification attempts. Further, modification attempts, and the learning ofthe values required for them, need not be initiated simultaneously. The determination ofeach value can be made part of the Q-learning process whenever new TP modificationsare randomly attempted. This approach is called Practical TPDP.Practical TPDP basically consists of a continually running Q-learning process, whereQ-learning is used to learn a constantly changing set of TP Q-values and R-values2.Thesevalues are used in the assessment of randomly attempted TP modifications. The TPsbeing assessed at any one time (with their associated Q-values and R-values) constitutethe assessment group.Practical TPDP can result in more than one TP being simultaneously associated witha state. As a result a policy TP (see Section 3.2.1) must be determined that defines thepolicy action p(i) (see Section 2.1.1) for each state i.4.1.3 Conventional Q-learning and Practical TPDPTo the extent that the Q-learning process is continuous in Practical TPDP (instead ofconsisting of successive rounds of Q-learning in the manner of the theoretical form ofTPDP), and to the extent that a policy action 1t(i) may have to be chosen from actions2A separate R-value must be associated with each TP at the same state when concurrent evaluationsare made (see Appendix Section B.3, Lines 20 to 33). As a result, the R-value notation must be modifiedaccordingly: R(i) becomes R(i, u).Chapter 4. Practical TPDP 71specified by a number of TPs associated with the same state i, Practical TPDP is likeconventional Q-learning. Practical TPDP is unlike conventional Q-learning however inthat the addition, swapping and removal of TPs completely disrupts the convergence tooptimal values that occurs in conventional Q-learning (see Section 2.2.3).4.1.4 Minimal TP Optimal Control and Practical TPDPAs explained in Section 4.1.3, Practical TPDP disrupts the convergence to optimal valuesthat occurs in conventional Q-learning. As a result, exact Q-values and R-values are notsure to be determined in Practical TPDP (see Sections 3.3.3 and 2.2.3, and Appendix A),making the proof that minimal TP optimal control will be achieved with the theoreticalform of TPDP (Theorem 3.8) inapplicable to Practical TPDP. No extended proof hasbeen developed that can surmount the complex Q-value and R-value interdependenciesthat exist in Practical TPDP. So no proof exists that Practical TPDP is certain to achieveminimal TP optimal control.Nonetheless, it is reasonable to assume that Practical TPDP will achieve minimal TPoptimal control. The reasons why this is true are the same reasons why the theoreticalform of TPDP is likely to achieve minimal TP optimal control more readily than throughthe unlikely sequence of events described in Theorem 3.8 (see Section 3.3.11). Basically,the addition, swapping and removal of TPs tends to steadily reduce the costs experiencedduring system control and to “lock-in” these cost reductions3. This will occur whetherthese TP modifications are made based on the exactly determined Q-values and R-valuesof the theoretical form of TPDP, or on values that are determined in the concurrent manner of Practical TPDP. Chapter 5 will present some demonstrations of the effectivenessof Practical TPDP.31n fact, the proofs presented in Chapter 3 were developed mainly to indicate what could be expected(although not guaranteed) with Practical TPDP.Chapter 4. Practical TPDP 724.2 The Specific Operation of Practical TPDP4.2.1 Using Weights to Concurrently Assess TPsThe main difficulty that arises when TPs are concurrently assessed is that of determiningwhen an assessment is complete. That is, when the Q-values and R-values associatedwith each TP have been learned well enough for a TP modification to be made based onthem. The technique employed to address this problem is to associate a weight w(i, u)with each TP that indicates the general merit of that TP. The basic idea of weights is tofacilitate the random addition of trial TPs to a TP assessment group with a low initialweight Wjjj. The Q-values and R-values of the TPs in the assessment group are thenlearned in an ongoing Q-learning process, and the TP weights are adjusted heuristicallyusing those values:1. New TPs are given an initial weight of wjp.jtj1; 0 < wjpjj < Wmax.2. Each time the Q-value of a TP is updated (whenever the system enters the stateassociated with that TP and takes the action it specifies), the weight w(i, u) of thatTP is incremented if Q(i, u) <R(i, u) and decremented otherwise.3. Each TP weight w(i, u) is limited to a maximum value of Wmax.4. If a TP weight w(i, u) is decremented to 0, the TP is removed.The TP weights are limited to a maximum value of Wmax to prevent them from becomingso large that they cannot readily be reduced again if the system or the learning situationchanges.A simpler approach to TP modification assessment might have been to associatesome sort of counter with the Q-values and R-values of each TP that indicated howmany times they had been updated. These counters could then be used in assessmentChapter 4. Practical TPDP 73decisions. Similar techniques were discussed in Section 2.4.5. The approach using weightswas chosen because it facilitated the prompt removal of TPs that were clearly not prudentfor a state to employ, and because it allowed for an ongoing assessment of each TP asconditions changed.4.2.2 Policy TP DeterminationAfter each change in TP weights or Q-values, the policy TP for the state associated withthat TP has to be redetermined. This is done by considering all of the TPs with weightsw(i, u) greater than or equal to Wt1J (Wipjtjaj < W < Wmax), and finding the one with thelowest Q-value Q(i, u). The threshold value Wthr prevents new and untested TPs frombeing made policy TPs (see Section 4.1.2). This decision process is formalized as findinga TP whose associated action UTP fulfills:mm Q(i,u) Vu with w(i,u) Wthr (4.40)uEU(:)If no TPs exist whose weights fulfill this criteria (w(i, u) Wthi.), the policy TP isdetermined by finding the TP whose weight is closest to Wt. This decision process isformalized as finding the TP whose associated action fulfills:max w(i,u) Vu with w(i,u) >0 (4.41)uEU(:)If a state has only one TP associated with it, that TP will always be the policy TP.Regarding the formalities of the theoretical form of TPDP presented in Chapter 3, thiscombined approach to TP assessment (see Section 4.2.1) and policy TP determinationresults in TP additions when a TP is associated with a non-TP state. TP swaps resultwhen, according to Equations 4.40 and 4.41, the policy TP at a state is changed. AndTP removals result when the single TP associated with a state (which must be the policyTP) is removed.Chapter 4. Practical TPDP 744.2.3 Exploration in Practical TPDPThere are two basic types of exploration which must be performed in Practical TPDP.The first, as described in general terms in Section 2.2.4, is to continually try the actionsthat can be performed in each state i. In the case of Practical TPDP, this means that theactions specified by all of the TPs associated with each state must be continually tried.Thorough exploration of this type is readily ensured by randomly selecting between theTP actions associated with each state encountered. Practical TPDP makes such randomselection, so this type of exploration will be discussed no further.The second type of exploration involves the identification of new TPs that should beassessed by Q-learning. Such TPs can lead to policy improvements, and they can beassociated both with TP states and non-TP states. There are two modes of this type ofexploration, internal exploration and external exploration.The purpose of internal exploration is to identify TPs that may reduce the costsexperienced within the closed state space Sc defined by the existing set of policy TPs(see Section 3.2.1). Internal exploration is performed by randomly trying sequences ofactions in Sc that are not specified by the existing set of TPs. If such experimentalsequences of actions result in lower costs than would have been experienced with theexisting set of TPs, new TPs that specify the experimental actions are associated withthe states where those actions were applied. Those TPs, and the actions they specify,are then fully assessed using Q-learning — they are made part of the assessment group.This process is described in detail in Appendix Section B.3.The purpose of external exploration is to associate TPs with external states SE (seeSection 3.2.1) to prepare state transition routes that can be followed by the systemthrough those states. Such preparation may be necessary for the discovery of lower costroutes through the external states (see Section 3.3.10). External exploration is performedChapter 4. Practical TPDP 75by having the system make random action specifications in SE until a TP is encountered.When a TP is encountered, a new TP is associated with the state where the last randomaction was specified. This TP then indicates a route that the system can follow to returnto the closed state space Sc when it is moving through the external states SE. Over timesuch TP allocations will build upon each other to indicate complete routes through theexternal states SE TP allocations are made in the external states in this conservativemanner to ensure that the number of TPs allocated does not become excessive. Excessiveallocation would occur, for example, if TPs were allocated every time a new action wasspecified during external exploration.Generally, internal exploration can be viewed as a way of ensuring that uniform regionboundaries (see Section 3.1.1) are located correctly. That is, ensuring that the TPs thatdefine these boundaries are associated with exactly the right states and that they shouldnot be shifted slightly in the state space. Internal exploration also ensures that absolutelyoptimal actions are specified at the boundaries. External exploration can be viewed asa way of ensuring that no state space transition routes are available in SE that result inlower costs than those already being followed in Sc.As described, internal and external exploration result respectively in the allocationof new, lower cost TPs in the closed state space Sc, and in the arbitrary allocation ofTPs in the external states SE. In reality the correspondence between the two modes ofexploration and the portion of the state space S in which the TPs are allocated is ratherrough. This is because there is no practical way to fully discriminate between Sc andSE during the operation of Practical TPDP. This is not a problem however, because theclear demarcation of these two regions was not found to be required for the successfuloperation of Practical TPDP (see Chapter 5).Chapter 4. Practical TPDP 764.2.4 General Operation of the Practical TPDP AlgorithmFigure 4.3 presents a complete algorithm for Practical TPDP that makes use of the TPassessment method described in Section 4.2.1, the policy TP determination method described in Section 4.2.2, and the internal and external exploration described in Section4.2.3. This Practical TPDP Algorithm is employed once for every learning trial (sequenceof state transitions from the starting states Ss to the absorbing states SA4), and it repeatedly calls the Stack Updating Procedure presented in Figure 4.4. The “stack” (Standish,1980) contains time step, state and action information about all TPs that are encountered in the period between every set of Q-value, R-value and weight updates requestedby the Practical TPDP Algorithm (see Appendix Section B.2). When the PracticalTPDP Algorithm requests an updating, the Stack Updating Procedure performs it usingthe contents of the stack. Appendix Section B.2 fully describes the Practical TPDPAlgorithm, and Appendix Section B.3 fully describes the Stack Updating Procedure itcalls.The general operation of the Practical TPDP Algorithm is as follows. The PracticalTPDP controller can be in one of three exploration modes. The mode in effect at anytime is identified by the variable explore as ‘none’, ‘internal’ or ‘external’. When noexploration is occurring actions are randomly chosen from those specified by the TPs atthe states encountered. The immediate costs incurred when those actions are taken areobserved, and the Q-values, R-values and weights of the TPs specifying those actions areupdated accordingly. Internal and external exploration (see Section 4.2.3) are randomlyinitiated in the midst of this process and are allowed to continue until a TP state isencountered. Internal and external exploration facilitate the allocation of new TPs thatcan be further assessed.4A trial can also be terminated by a mechanism that ensures that all of the states are continuallyvisited, as explained in Section 3.2.2, but such terminations will not be dealt with in this Chapter.Chapter 4. Practical TPDP 771. randomly choose starting state E Ss2. choose a starting action u0 E U(s0)3. if (state has a TP): ‘none’ explore4. otherwise: ‘external’ explore5. 0 t, 0 tupdate, 0 tItTp10. while St 15 not an absorbing state SA:11. if (state St St—T) or (state St = 5t—T for time Ttk):20. if (state t has a TP):21. if (UswapTp > 0):22. update-stack(explore, t, Q(s, f’(st)), Vexpected)23. ‘internal’ explore24. randomly choose action Ut e U(s)25. push-on-stack(t, 5, Ut, 0)26. otherwise if (explore ‘none’) or (t > tupdate)27. update-stack(explore, t, Q(st, It(St)), Vexpecte)28. ‘none’ = explore29. t + Ode1ay tupdate30. randomly choose action Ut from TP actions in U(s)31. push-on-stack(t, 5, Ut, ‘true’)32. otherwise: push-on-stack(t, 5t, Ut, ‘false’)33. Q(st,p(st)) Vexpecteci, Q(st,iz(st)) =‘ ‘IS-TP34. t = tfrsf40. if (state.s has no TP) and (explore ‘none’) and (athange> 0):41. if (explore = ‘external’): flush-stack42. randomly choose action Ut E U(s)43. push-on-stack(t, 5, Ut, 0)50. if (state s has no TP) and (explore = ‘none’) and (addTp > 0)51. update-stack(’none’, t1astTP, V1astTP, 0)52. if (Oexternai> 0): ‘external’ = explore53. otherwise: ‘internal’ = explore54. randomly choose action Ut E U(s)55. push-on-stack(t, st, Ut, 0)60. t+T=t61. (Vexpected— C(St, Ut)) Vexpecteci62. observe system for new state s63. update-stack(explore, t, S, Vexpected)Figure 4.3: The Practical TPDP AlgorithmChapter 4. Practical TPDP 78[Parameters passed: explore, t, Vupdate, Vexpected]1. while (time at top of stack t, t): pop-off-stack(t3is, u3, specified3)2. Vupdate Gtotal10. while (there are entries in stack):11. pop-off-stack(t5,i, u.,, specified5)12. while (t> t3):13. 7Cotaj+c(is,us) ‘ 0totai14. t—Tt20. if (explore = ‘none’):21. for (each TP action u E U(i5) in state i3):22. if(u=u3):23. (1— cr)Q(i3,u5) + aCtotaj Q(i, u5)24. if (Q(i5,u)< R(i3,u8)):25. w(i5,u5) + 1 = w(i5,u5)26. if (w(i3,u) > wmax): Wmax w(i5,u)27. otherwise:28. w(i5,u3)—1 = w(i5,u)29. if (w(is, u5) = 0): remove the TP30. if [(state i5 has only one TP) and (specified5= ‘false’)] or[(u u3) and (another TP at state i. specifies u8)]:31. (1— c)R(i5,u) + aCtotai = R(i5,u)32. for (each TP action u E U(i5) in state i5):33. if [(w(i3,1u(i3)) < wi) and (w(i5,u) > w(i5,z(i5)))] or[(w(i5,u()) w) and (w(i5,u) w) and(Q(i5,u) < Q(i3,p(i5)))]: u =‘ ,u(i5)40. if (state i has TPs) and (memory can be allocated) and(explore ‘internal’) and (C031 < Q(i8,(i5))):41. allocate a new TP at state i with action u,42. C0 = Q(i, u5), C01 R(i5,s)43. Wjnjtjal = w(i5,u5)44. if (w(i5,u8) > w(i5,1z(i5))) or w(i3,u5) wt): u5 = [L(i5)50. if (state i5 has no TPs) and (memory can be allocated) and[((explore = ‘internal’) and Vupdate < Vexpected)) or (explore = ‘external’)]:51. allocate a new TP at state i5 with action u552. C0j= Q(i5,u),C05.1 = R(i5,u)53. = w(i3,u)54. = (i3)Figure 4.4: The Stack Update ProcedureChapter 4. Practical TPDP 794.2.5 Delayed UpdatingIt was found, during experimental work with Practical TPDP, that updating the stackas soon as a new TP was encountered (see Equations 3.18 and 3.27) resulted in poorperformance. This was because the TP states were frequently close together, and onewas often encountered soon after another had been left. This resulted in the specifiedaction being changed after only brief intervals, which prevented observation of the costswhich each action specification would incur if they had time to truly affect the system.As explained in Section 3.1.4, actions specified for short durations cannot really affectsystems that have inertia. In response to this, delayed updating was incorporated intoPractical TPDP. Delayed updating prevents updating for a reasonable period of time,facilitating Q-value and R-value updating with longer term cost consequences.Delayed updating is facilitated with the use of an update time tupdate (see Figure 4.3).Instead of simply applying a TP action at a given TP state and updating the Q-values, Rvalues and weights of that TP as soon as another TP state is encountered, tupdate is usedto maintain the action specified by the TP and delay updating. The delay may continueas state transitions are made through many TP states. This is in fact the primary reasonwhy a stack is employed in Practical TPDP5.The stack records the information relatedto all of the TPs that are encountered in the period before the update time tupdate hasbeen reached.The update time tupdate is set (see Appendix Section B.2, Lines 20 to 34) by adding arandom positive delay parameter crdelay to the current time step t. Section 4.2.6 describeshow the random distribution of de1ay is determined. Once tupdate is set, stack updatingand changes in the action specification are prevented until t exceeds tupdae (unless internalor external exploration is initiated).5The stack is also used to record all of the action specifications made during internal exploration.Chapter 4. Practical TPDP 80Update times are not used during internal and external exploration because all exploration is terminated as soon as a TP state is encountered. This is done becauseinternal and external exploration are not performed to assess existing TPs. Explorationis performed to determine if new TPs should be allocated for future assessment, and thisdetermination can only be made based on the evaluation function values of the TP statesat the start and end of the exploratory transition sequences through the state space (seeAppendix Section B.3). In other words, all exploration takes place during transitionsequences between TP states.If a given TP is the only TP associated with a state, and updating is not occasionallydelayed when the system enters that state, then the action associated with that TPwill always be chosen for specification when that state is entered. This will preclude allupdating of the R-value of that TP (see Appendix Section B.3). This is another reasonwhy delayed updating is necessary.4.2.6 The Practical TPDP Exploration ParametersOf the various parameters in the Practical TPDP Algorithm that must be adjusted toensure effective algorithm performance, the exploration parameters were found duringexperimental work to be the most important (see Chapter 5). This Section will describethese parameters, and present guidelines for their determination.The Delay ParameterAs explained in Section 4.2.5, ‘7delay is a random positive delay parameter added to i todetermine the next update time tupdate. Basically, tupdate determines how often updatesshould occur. The distribution of de1ay can be uniform or Gaussian6,with the meande1ay entirely depending on the system being considered. The value of Udelay must be set6With negative values being made equal to zero.Chapter 4. Practical TPDP 81high enough to ensure that the long term consequences of the action specified by eachTP can be observed in the system under consideration (see Section 4.2.5). It must notbe set so high however that the action specified by each TP is frequently maintained wellbeyond the boundaries of the uniform region (see Section 3.1.1) that it is best suited for.This would result in the costs experienced after actions are specified being unnecessarilyhigh.Generally, &delay is related to the natural modes of the system being controlled. If theaction specification must change after an average interval ofT11If0for a given system tobe optimally controlled (that is, it takes an average time ofT1ç0to cross each uniformregion), then Practical TPDP is most effective when:de1ay = Tuniform (4.42)A rough value for Tjf0can be determined by considering the general response characteristics of the system concerned. A rough value is sufficient, as Practical TPDP is notoverly sensitive to the setting of this parameter (see Section 5.5).The Swap ParameterThe random swap parameter swap-TP (see Appendix Section B.2), which has some fixedprobability Pr(oswapTp > 0) of being greater than zero each time it is considered, determines whether or not internal exploration is initiated at TP states. This probability mustbe set low enough to ensure frequent updates of the TPs already associated with eachstate, and high enough to facilitate thorough exploration of alternative TP possibilitiesat each state. Experimental work with Practical TPDP (see Chapter 5) indicated that a0.2 probability ofc7swapTP being greater than zero generally produced good results — thatis, internal exploration should be initiated once every five times that a TP is encountered.Chapter 4. Practical TPDP 82The Initiate ParameterThe random initiate parameter add-TP (see Appendix Section B.2), which has somefixed probability Pr(Jaad..Tp > 0) of being greater than zero each time it is considered,determines whether or not exploration (internal or external) is initiated at non-TP states.This probability must be set low enough to ensure that the updating of existing TPs isnot constantly interrupted, and high enough to facilitate thorough exploration of new TPaddition possibilities. As °add-TP is considered every time step during which the system isnot in a TP state, the cumulative probability of exploration (internal or external) beinginitiated between TP states must be adjusted to be appropriate for the system underconsideration.Experimental work with Practical TPDP (see Chapter 5) indicated that the bestresults were obtained when the cumulative probability of exploration was 0.5 or higherwhen the system passed through the first non-TP state encountered after leaving a TPstate. This facilitated the addition of new TPs that could subtly refine the boundarystate locations (see Section 3.1.1). By employing dqujck, the number of time steps requiredto pass through a single state at the highest system velocity, a formula for Pr(JadaTp > 0)can thus be determined:0.5 = 1 — [1— Pr(uadaTp > J)]dquiCkPr(uaddTp > 0) = 1 — 51/dquick (4.43)Using Equation 4.43 to determine a value for Pr(Jadd’rp > 0) can result in the updating of existing TPs constantly being interrupted early in the learning process when theTPs are well spaced out. At this time refinement of the boundary locations is unnecessary and undesirable. A method of varying Pr(oaad.Tp > 0) as learning progresses wouldtherefore be valuable. Such a method was not developed during this work however.Chapter 4. Practical TPDP 83The External ParameterOnce internal or external exploration is initiated at non-TP states (based on the randomparameter OaddTp), the probability Pr(uexteni > 0) of the random external parameterexterxia1 being greater than zero determines whether or not external exploration is initiatedinstead of internal exploration (see Appendix Section B.2). Experimental work withPractical TPDP (see Chapter 5) indicated that a 0.5 probability of extema1 being greaterthan zero generally produced good results — that is, when exploration is initiated atnon-TP states, internal and external exploration should be chosen with equal likelihood.The Route Change ParameterDuring internal and external exploration, the probability Pr(uchange> 0) of the randomroute change parameter o’cllange being greater than zero determines whether or not a newexperimental action is specified during each time step (see Appendix Section B.2). Asc1-tange is considered every time step during exploration, the cumulative probability of anaction change must be adjusted so that action changes are made at intervals that reflectthe ability of the system concerned to respond.Like the delay parameter ãdelay, Pr(uchange > 0) is related to the natural modes ofthe system being controlled. Experimental work with Practical TPDP (see Chapter 5)indicated that the best results were obtained when there was a 0.5 cumulative probabilityof an action change occurring after the same action had been specified for one fifth ofTuniform, the average time required to cross each uniform region:0.5 = 1 — [1— Pr(ochge> o)]TUIUf0ITfl/(5T)Pr(ochge> 0) = 1 — (444)Chapter 4. Practical TPDP 844.2.7 The Other Practical TPDP ParametersThe Learning RateThe learning rate o should be set high enough to facilitate some learning, but not so highthat the results from each learning experience can drastically alter what has already beenlearned. Learning rate values are commonly set around 0.3, and this value was foundto produce good results during experimental work with Practical TPDP (see Chapter5). There is no reason why this value should not produce equally good results in anyapplication of Practical TPDP.The Weight ParametersThe weight parameters must be set so that when a new TP is added to the assessmentgroup with an initial weight of that TP will be quickly removed if its R-value isthereafter less than its Q-value (see Section 4.2.1). The threshold value w above whicha TP can be considered as a potential policy TP for a given state (see Section 4.2.2) mustbe set high enough so that a newly added TP must be updated a number of times beforeit can reach that threshold. The maximum weight value Wmax must be set so that theweight of a TP can be quickly reduced below w if a system change reduces its merit(see Section 4.2.1). Experimental work with Practical TPDP (see Chapter 5) indicatedthat the following weight parameter values produced good results:= 2WtFj. = 10Wmax = 20There is no reason why these values should not produce equally good results in anyapplication of Practical TPDP.Chapter 4. Practical TPDP 85The Same State Limit ParameterThe parameter Ttk is used to initiate exploration when the system has been “stuck” inthe same state for too long (see Appendix Section B.2). It should be set so that Ttk isa time period longer than that for which the system under consideration would normallybe expected to stop.Chapter 5Application of Practical TPDPThis Chapter will describe the application of Practical TPDP to two different controltasks. These applications allow the characteristics of Practical TPDP to be analyzed anddescribed.5.1 The Race Track Problem5.1.1 Description of the ProblemGardner (1973) described a problem called Race Track which Barto et al. (1991) latermodified for the application of DP controllers. As described by Barto et al. (1991), RaceTrack consists of a “track” like the one shown in Figure 5.5. A single “car” starts at theleft of the track shown, accelerates down the length of it, and finishes after making a leftturn. Each square in this track represents a position that the car can move to. The fourpositions where the car can start are the starting states Ss (with all velocities zero). Theabsorbing states SA include all four finishing positions, with different velocities possible ineach position. When one of the finishing positions is reached the problem is terminated.Collisions with the walls are handled by leaving the car at the point of collision andsetting its velocity to zero1. As explained by Barto et al. (1991), this implies that thecar cannot leave the track except by reaching the finishing positions.As the car moves down the track its horizontal and vertical accelerations can be1Litigation is not a concern in this problem.86Chapter 5. Application of Practical TPDP 87FinishingPositionsStartingPositionscontrolled, with the velocity of the car in both directions being limited. This limitationensures that the state space S of the problem is finite (Barto et aL, 1991), although wallcollisions will inherently limit the velocities possible on the track shown in Figure 5.5.Any acceleration specification has a probability Pineffective of having no effect on the carduring the time step in which it is applied. This makes the problem stochastic.The control task is to control the horizontal and vertical accelerations of the car sothat it runs the length of the track in the minimum possible time. This optimality criteriais established by imposing the same immediate cost c(i, u) = 1 for all actions u E U(i)taken in each state i, with the exception of actions taken in the absorbing states SA.All actions taken after reaching the absorbing states SA, for all future time, result inimmediate costs of c(i, u) = 0. The discount rate y is set to 1. As a result, the lowestpossible costs are incurred when the car runs the length of the track in the minimumpossible time.Figure 5.5: A Race TrackChapter 5. Application of Practical TPDP 885.1.2 A Discrete-time Stochastic Dynamic System FormulationFollowing Barto et al. (1991), the Race Track problem can be described as having a statespace S consisting of states St = [Xt, Yt, t, ‘t1, where xt and Yt are the horizontal andvertical position of the car respectively at time step t. The possible actions U(i) arethe same for each state i and consist of u — [zi, up], where both ‘u and u, are in theset {—a, 0, a}, and where a is a constant. With velocity limitations of thL and thefollowing state transition will occur at time step t with probability 1— pineffective:Xt+T = max(—L, min(thL, + uT)))Yt+T max(—L, min(L, + uT)))Xt+T = Xt + Xt+TTYt+T Yt + Yt+TT (5.45)If the acceleration specifications have been ineffective (probability pineffective), the followingstate transition will occur at time step t:Xt+T =Yt+T = YtXt+T = Xt + Xt+TTYt+T = lit + yt+TT (5.46)Given the state transitions described by Equations 5.45 and 5.46, if a, XL, YL, T andthe initial state values [xc,, yo, o, ‘o] are all integers, the state values will always remainintegers.5.1.3 A Continuous Version of the ProblemAs explained in Section 3.1.4, TPDP performs best in continuous state spaces. Therefore,to fully demonstrate Practical TPDP, it was applied to a continuous version of the RaceChapter 5. Application of Practical TPDP 89Track problem. To make the problem continuous, the time step interval T in Equations5.45 and 5.46 is made infinitesimally small.When this is done there is no guarantee that the state values will remain illtegers.As a result, in order to control the car with a finite number of DP elements (see Section2.3.2), the state values must be quantized into discrete intervals. For this application ofPractical TPDP the state values were simply rounded-off to the nearest integer.Making the state space continuous changes the total infinite-horizon discounted costsexperienced as described in Section 3.4.5. Equations 3.37 and 3.38 were used to compensate for these effects during Practical TPDP solution of the Race Track problem.By making the time step interval T small, and thereby making the Race Track problemcontinuous, the system state is prevented from drastically changing between time steps.In the discrete-time (T = 1) version of the problem considered by Barto et al. (1991) thecar velocity could instantaneously change, and the car could move from one position toanother one many squares away without passing through the positions in between. If thetime step interval T is made small, state space transitions are restricted to occur onlybetween neighboring states, and both of these behaviors are eliminated. As explainedin Section 3.1.4, this is necessary for the effective operation of TPDP. If drastic statechanges are possible, uniform region boundaries cannot be established.5.1.4 Problem Specifics Used During Practical TPDP ApplicationThe track used for Practical TPDP application was the one used by Barto et at. (1991).It is the track shown in Figure 5.5. Continuous movement was approximated by settingT = 0.05. This setting was small enough to ensure that state transitions could only bemade between neighboring states, but large enough to ensure that the problem simulationtime did not become exorbitant.With T set to 0.05 on this track the system can enter roughly 15440 discretized states.Chapter 5. Application of Practical TPDP 90The car was started in one of the four starting positions with zero velocity, each of whichwas chosen with equal probability.When Practical TPDP was applied to the Race Track problem, the various problemparameters were generally set to the values used by Barto et al. (1991):= 1 (this value is unaffected by the y compensation Equation 3.38)0 for all actions taken in the absorbing states SAc(i, u) 3 for any actions that lead to a wall collision1- = 0.05 for all other actions (based on Equation 3.37)Pineffective = 0.1a=1XL = 6= 6As indicated, the immediate cost c(i, u) incurred when an action lead to a wall collisionwas set to 3. In the discrete-time (T = 1) version of this problem considered by Barto etal. (1991), no extra cost was incurred other than that which resulted from the additionaltime required to accelerate the car again. When the problem was made continuous, thecollision cost of 3 was used because the ability of the car to accelerate was changed. Thischange lead to the uninteresting result that the total cost incurred was roughly the samewhether the car swept smoothly around the left turn or whether it accelerated fully intothe right wall, collided with it, and then accelerated again to reach the finishing positions.The collision cost of 3 made the latter strategy much less attractive.Chapter 5. Application of Practical TPDP 915.1.5 The Practical TPDP Algorithm Parameters UsedDuring solution of the Race Track problem, the Practical TPDP Algorithm parameters(see Sections 4.2.1, 4.2.5, 4.2.6, 4.2.7, and Figures 4.3 and 4.4) were set to:de1ay = 2.5Pr(JswaTp > 0) = 0.2Pr(Jadd..Tp > 0) = 0.25Pr(externai> 0) = 0.5Pr(Jchange> 0) = 0.1= 0.3 V t,i,u= 2Wthr 10Wmax = 20Ttck = 1The random parameter de1ay had a uniform distribution from 0 to 5 (see Section 4.2.6).The values for Odelay and Pr(uthge> 0) were based on a T1f0value of 2.5 being usedin Equations 4.42 and 4.44 respectively. The value for Pr(craddTp > 0) was based ona dquick value of 3 being used in Equation 4.43. The rest of the parameter values werechosen based on preliminary experimentation.5.1.6 Evaluation of Performance on the ProblemIt was not expected that Practical TPDP would learn absolutely optimal policies for theRace Track problem — this would have taken far too long. Instead near optimal policieswere sought. To evaluate the performance of Practical TPDP as it learned such nearChapter 5. Application of Practical TPDP 92optimal policies, successive trials were simulated that included one run of the car fromthe starting positions to the finishing positions. After every set of 20 such learning trials,500 test trials were simulated where Practical TPDP learning was suspended and thepolicy learned up to that point was employed to control the car. The average track timeof the car down the track during the 500 test trials was used as an indication of how wellthe policy had been learned. The combination of 20 learning trials followed by 500 testtrials was called an epoch, and it was also used by Barto et al. (1991).All test trials were terminated in which a total cost of more than 500 was experiencedbefore the car arrived at the finishing positions. Further, during testing, only actionsspecified by policy TPs (see Section 4.1.2) with weights w(i, u) greater than wereapplied to the system.5.2 Performance on the Race Track Problem5.2.1 Comparing Practical TPDP and Conventional Q-learningFigure 5.6 shows the average track time results (see Section 5.1.6) when both PracticalTPDP and conventional Q-learning were applied to a continuous version (see Section5.1.3) of the Race Track problem. As explained in Section 5.1.4, the track shown inFigure 5.5 was used, as well as the problem parameters presented in that Section. ThePractical TPDP Algorithm parameters used were those presented in Section 5.1.5. Theparameters and exploration method used for conventional Q-learning were those employedby Barto et al. (1991) on this same track during their discrete-time (T = 1) experiments.Figure 5.6 shows that Practical TPDP learned a near optimal policy much faster thanconventional Q-learning. This comparison is somewhat inequitable however, because theconventional Q-learning approach used (Barto et al., 1991) is best suited for discrete-time(T = 1) applications. When the time step interval T is made small enough to make theChapter 5. Application of Practical TPDP 93problem approximately continuous (see Section 3.4.5), state transitions are made onlybetween neighboring states. As a result, Q-value updating is performed by conventionalQ-learning using only the immediate costs experienced after short periods of time. Thismakes learning a near optimal policy very difficult (see Section 4.2.5). This problem wassolved in Practical TPDP by making use of delayed updating and its randomly chosenupdate time tupdate (see Section 4.2.5). In order to make the comparison between PracticalTPDP and conventional Q-learning more equitable, this delayed updating approach wasadded to conventional Q-learning.The average track time results after delayed updating was incorporated into conventional Q-learning are presented in Figure 5.7. Figure 5.7 shows that the performanceof conventional Q-learning on the continuous Race Track problem was indeed improvedby adding delayed updating to it, but that Practical TPDP still learned a near optimalpolicy considerably faster. The reason for this, as explained in Section 3.1.3, is thatconventional Q-learning associates a DP element with every possible state and actioncombination. Practical TPDP only associates DP elements (in the form of TPs) with afraction of this total, and thus enjoys the benefit of not having to make DP computationsfor all of them.Considering total computational effort, to complete 1500 epochs (with no test trials)on the Race Track problem Q-learning without delayed updating required 983 CPUseconds on a Sun SPARCstation 10, Q-learning with delayed updating required 229CPU seconds, and Practical TPDP required 528 CPU seconds. As explained, Q-learningwithout delayed updating is not well suited to application in such continuous state spaces,and the protracted learning times which result necessitate more computational effort.More computational effort was required for Practical TPDP than for Q-learning withdelayed updating because its computations are more complex (see Chapter 4). In realtime control applications however, the learning time, not the total computational effort,Chapter 5. Application of Practical TPDP 94100EFC-)Ct5i 50ci)Cuci)>02500Epoch NumberFigure 5.6: Performance of Practical TPDP and Conventional Q-learningis likely to be the larger concern.Finally, as a matter of interest, Figure 5.8 shows the average track time results whenboth Practical TPDP and conventional Q-learning were applied to the discrete-timeversion of the Race Track problem considered by Barto et al. (1991)2. The averagetrack time performance of conventional Q-learning in this case was, not surprisingly,very similar to that obtained by Barto et al. (1991). As shown in Figure 5.8, PracticalTPDP again learned a near optimal policy much faster than conventional Q-learning.One fact is not clear in Figure 5.8 however. This is that, even though Practical TPDPsettled on a policy much sooner than conventional Q-learning, the average track time thatresulted from that policy remained roughly constant, and it was slightly higher than theaverage track time that conventional Q-learning was able to achieve after 2000 epochs.2The immediate cost schedule used by Barto et al. (1991) on this problem was used as well.0Chapter 5. Application of Practical TPDP 95100ci)EF-‘C.)50ci)c3)ci)>00 2500Epoch NumberFigure 5.7: Performance of Practical TPDP and Delayed Updating Q-learningUndoubtedly this reflects an inability of Practical TPDP to learn an absolutely optimalpolicy in this discrete-time application. This is not a concern however, as Practical TPDPwas not developed to operate in such discrete-time applications (see Section 3.1.4).5.2.2 Near Optimal Performance of Practical TPDPFigure 5.9 again shows the average track time results when Practical TPDP was appliedto the Race Track problem. This Figure indicates that after Practical TPDP settled onan initial average track time of around 17, and maintained that track time for roughly600 epochs, a learning transition occurred that reduced the average track time to around14. To illustrate why this distinct learning transition occurred, Figures 5.10, 5.11, 5.12and 5.13 show five typical car paths down the length of the track after 300, 800, 1300and 1800 epochs respectively of Practical TPDP learning.Chapter 5. Application of Practical TPDP 96ci)2H0H0)0)>a)EH0j2O0)Cuci)>1005000 2500Epoch NumberFigure 5.8: Performance on a Discrete Version of the Problem30100 300 800 1300 1800 2500Epoch NumberFigure 5.9: Performance of Practical TPDPChapter 5. Application of Practical TPDP 97StartingPositionsStartingPositionsStartingPositionsFinishingPositionsFinishingPositionsFinishingPositionsFigure 5.10: Five Typical Race Track Routes After 300 EpochsFigure 5.11: Five Typical Race Track Routes After 800 EpochsFigure 5.12: Five Typical Race Track Routes After 1300 EpochsChapter 5. Application of Practical TPDP 98FinishingPositionsStartingPositionsAfter 300 epochs (Figure 5.10) Practical TPDP had learned the policy of fully accelerating the car down the track, colliding with the right wall, and then accelerating thecar towards the finishing positions. After 800 epochs (Figure 5.11) Practical TPDP wasstill employing this same basic policy, but it had started learning to curve the car pathupward. This reduced the time required to reach the finishing positions once the rightwall had been struck. Figure 5.11 also shows the path taken by a car when it recoveredfrom a collision with the top wall after attempting to curve upward too much.After 800 epochs the learning transition began. The learning transition consisted ofPractical TPDP learning that if the car was not accelerated horizontally to full speed(full speed is 6, see Section 5.1.4) while moving towards the right wall, it could beaccelerated upward beginning at roughly the track midpoint to produce a smooth leftturn into the finishing positions. By making such a turn the car could avoid collisionwith any walls, and reach the finishing positions in the shortest possible time. Figure5.12 shows five typical car paths taken after 1300 epochs of learning, in the midst ofthe learning transition. Some paths included smooth turns; others were like the collisionpaths taken after 800 epochs. After 1800 epochs the learning transition was complete.Figure 5.13 shows that smooth turns were consistently made at this stage of PracticalTPDP learning.Figure 5.13: Five Typical Race Track Routes After 1800 EpochsChapter 5. Application of Practical TPDP 99The learning transition described illustrates fully the exploration issues discussed inSection 4.2.3. Basically what occurred during the Practical TPDP learning process wasthat an initial low cost policy was learned quickly (that of fully accelerating, collidingwith the right wall, and moving towards the finishing positions— at 300 epochs, see Figure5.10). The TPs specifying this low cost policy defined a closed state space Sc. Oncethis policy had been learned, internal exploration within Sc lead to reductions in thecosts experienced when it was followed (800 epochs, see Figure 5.11). As this occurred,external exploration resulted in the placement of ineffectual TPs that could guide thecar along an even lower cost path (smoothly turning— 1800 epochs, see Figure 5.13). Ataround 1300 epochs enough such TPs were in place for a learning transition to be made.Figure 5.9 shows that the near optimal policy was not consistently followed afterabout 2000 epochs. The average track time increased sporadically after 2000 epochs.This was due to the continued allocation of superfluous TPs, a problem which will bediscussed in Section 5.4.2.5.3 Practical TPDP and Generalization5.3.1 Generalization by CopyingSection 2.4.4 described how the ability to generalize is often incorporated into DP controllers by parametrically approximating the evaluation function. It was further explainedthat such parametric approximation typically requires extensive computational effort forthe approximation to be learned, and that parametric approximation is particularly difficult to incorporate into Q-learning controllers. This is because Q-values are a functionof action as well as state, and an action dimension must also be included in any approximation that represents Q-values. As a result of all this, parametric approximation wasnever investigated in the development of Practical TPDP.Chapter 5. Application of Practical TPDP 100Another type of generalization is possible with Practical TPDP however. This is tobias the selection of the random actions attempted during exploration (see Section 4.2.4and Appendix Section B.2) towards actions already specified by TPs in the state spacevicinity around the state where the random action is being attempted. This type ofgeneralization is called generalization by copying. Generalization by copying is highlylocalized in terms of the state space regions over which generalizations are made, but itcan be extremely effective when used in Practical TPDP. This is because, when PracticalTPDP is applied to the continuous state space systems to which it is best suited (seeSection 3.1.4), it is likely that a TP that has been found to be worthwhile at one statewill also be worthwhile if duplicated in neighboring states.5.3.2 Generalization by Copying in the Race Track ProblemGeneralization by copying was attempted during Practical TPDP solution of the RaceTrack problem. Whenever an action had to be randomly chosen during internal orexternal exploration, the states adjacent to the one where the random action was beingattempted were inspected. If any of them had TPs, and the random generalizationparameter genera1ize was greater than zero, generalization by copying was performed.Specifically, an action was chosen, with equal probability for each choice, from all theactions specified by TPs in the adjacent states. Generalization by copying was onlyperformed on random occasions (according to the random value of cTgeneraljze) to preventthe internal and external exploration processes from losing too much vigor. That is, toensure that they were not biased too frequently.Figure 5.14 illustrates the results of attempting generalization by copying duringPractical TPDP solution of the Race Track problem, with Pr(ugeneraijze > 0) = 0.2.Figure 5.14 shows that generalization by copying can increase the rate at which PracticalTPDP learns a near optimal policy.Chapter 5. Application of Practical TPDP 10130EF-C)j2OCu>100 2500Epoch NumberFigure 5.14: Performance of Practical TPDP With and Without Generalization5.3.3 Practical TPDP Glitches on the Race Track ProblemInspection of Figure 5.14 reveals that there are short duration glitches in the average tracktime at epochs 709, 2164 and 2328. While these glitches were the result of attemptinggeneralization by copying during Practical TPDP, they appeared during many otherexperiments in which Practical TPDP was applied to the Race Track problem. Theywere caused by the removal of TPs associated with zero velocity wall states (states thathad zero velocity and a position immediately next to a wall). Such TPs were crucialbecause they specified actions that started the car moving again once it had collidedwith a wall. When they were removed during Practical TPDP learning, and the carentered the states they were associated with during test trials, the car would remain inthose states until the test trial was aborted. This lack of movement resulted in largecosts being incurred, which produced the glitches seen in Figure 5.14.Chapter 5. Application of Practical TPDP 102Glitches did not result when TPs were removed from other states because the carwould continue to move through those states whether TPs specified new actions in themor not. Thus, it was the definition of the Race Track problem that lead to the occurrenceof the glitches. If the cars had bounced off the walls for example, instead of stoppingcompletely, the glitches would not have occurred. As they were a result of the Race Trackproblem definition, glitches also occurred when conventional Q-learning was applied tothe Race Track problem. In the case of conventional Q-learning, they resulted when thepolicy action at the zero velocity wall states did not accelerate the car away from thewall. Figures 5.6 and 5.8 show many glitches in the average track time resulting fromconventional Q-learning. In fact, it was the rapid convergence of Practical TPDP to anear optimal policy that makes the Practical TPDP glitches appear so onerous. Theglitches rise up in stark contrast to the generally settled average track time curve.Any number of techniques could be employed to prevent glitches in the PracticalTPDP solution of the Race Track problem. TP removal could be prevented in zerovelocity states for example. No such techniques were investigated however, as the problemwas not considered to be particularly serious. It was mainly a result of the peculiarities ofthe Race Track problem definition, not a result of some fundamental failing of PracticalTPDP. Further, the rapid recovery of Practical TPDP when glitches occurred was seenas further evidence of the viability of the TPDP approach.5.3.4 A Performance MetricTo determine the level of generalization by copying (Pr(crgeneraijze > 0)) that producedthe best results during Practical TPDP solution of the Race Track problem, a performance metric was developed that facilitated quantitative comparison between the resultsobtained when different versions of Practical TPDP were employed. This performancemetric was designed to indicate when the learning transition had been made (see SectionChapter 5. Application of Practical TPDP 1032500 I I I2 20000.8 1.0Probability of Generalization by Copying During ExplorationFigure 5.15: The Effect of Changing Pr(ugenerajjze> 0)5.2.2). That is, when Practical TPDP had learned to smoothly turn the car, as opposedto colliding with the right wall. The performance metric was based on the first 100 testtrials in which the average track time was less than 14. The average number of epochsrequired for the first 100 such track times was used as the performance metric.5.3.5 Comparing Generalization Levels With the Performance MetricUsing the performance metric described in Section 5.3.4, Figure 5.15 was generated toillustrate the performance results from different levels of generalization by copying. Thatis, Figure 5.15 illustrates the results of varying the probability that the random parametergenera1ize was greater than zero. The portions of the curve which are missing from Figure5.15 indicate Practical TPDP solutions of the Race Track problem where the averagetrack time was not found to be less than 14 at least 100 times within 2500 epochs.Chapter 5. Application of Practical TPDP 104It was found that generalization by copying had the general effect of increasing therate at which an existing policy was refined. That is, generalization by copying aidedthe internal exploration process in determining the lowest cost TPs for the closed statespace Sc (see Section 4.2.3). It did so by focusing the internal exploration on actionspecifications that were already known to have merit. However, by restricting explorationto the actions already being specified in the various regions of Sc, generalization bycopying inhibited external exploration and the discovery of alternative low cost routesthrough the state space (see Section 4.2.3). External exploration inherently requiresmore vigorous experimentation. As a consequence of this, Figure 5.15 indicates that ageneralization by copying level of 0.2 produced the best results during Practical TPDPsolution of the Race Track problem.The 0.2 level of generalization by copying did not result in the lowest overall performance metric value however. The 0.6 level did. But the low performance metric valueresulting from the 0.6 level was the spurious result of a near optimal policy being discovered exceptionally and fortuitously early in the Practical TPDP learning process. AsFigure 5.15 indicates, at a generalization by copying level of 0.7 a near optimal policy wasnot found at all (no performance metric value was determined within 2500 epochs). Thelow performance metric value resulting from the 0.6 level is therefore not an indicationthat such a high level of generalization by copying is likely to produce superior results.5,4 Practical TPDP and TP Allocation5.4.1 TP Allocation in the Race Track ProblemFigure 5.16 shows the percentage of TPs allocated as Practical TPDP learned a nearoptimal policy for the Race Track problem. This percentage was based on the fact thatroughly 15440 states could be entered during movement of the car around the trackChapter 5. Application of Practical TPDP 10530 302—1HnnC.)aC)I10 100 00 2500Epoch NumberFigure 5.16: TP Allocation as Practical TPDP Learning Progressedconsidered, and that 9 actions were possible in each state. As a result, it was possiblefor 138960 TPs to be allocated.5.4.2 Superfluous TPsFigure 5.16 shows that the percentage of TPs allocated during Practical TPDP solutionof the Race Track problem continued to increase after epoch 1500, when the learningtransition had certainly occurred (see Section 5.2.2) and a near optimal policy had basically been learned. In any application of Practical TPDP, TP allocation will continue(likely with a decreasing rate of allocation, see Figure 5.16) until every state in the systemconcerned has at least one TP associated with it. There are two reasons for this.The first reason for continued TP allocation is that the R-value of any TP specifying an optimal action will always be the same or higher than the Q-value of that TPChapter 5. Application of Practical TPDP 106(theoretically, see Theorem 3.6). If it is higher, then that TP will rightfully never beremoved (see Figure 4.4). If it is the same, then that TP is unnecessary (see Sections3.2.5 and 3.3.9). But any small system disturbance (noise, or a new suboptimal TP forexample) will make it higher. As a result, TPs specifying optimal actions at dormantstates (see Section 3.1.1) are rarely removed, and as new TPs specifying optimal actionsare continually added to the closed state space Sc (as a result of internal exploration,see Section 4.2.3), the number of TPs in Sc will continue to grow.The second reason for the continued allocation of TPs is that one TP is allocatedevery time external exploration occurs (see Section 4.2.3). As Practical TPDP learningprogresses, this will result in TPs being allocated at states more and more remote fromthe closed state space Sc.Continued TP allocation is undesirable because it results in Practical TPDP usingmore memory than is necessary, and that additional memory usage does not typicallyfacilitate any real improvement in system performance. In fact, continued TP allocationcan negatively affect any near optimal policies that have been learned. This can be seenin Figure 5.16, where the average track time sporadically increased after a near optimalpolicy had been learned (around 1800 epochs). Such distortion of the policy resultswhen new TPs, specifying suboptimal actions, are allocated at states between thosewhere optimal actions are already being specified by TPs. Eventually these new TPs willbe removed, or replaced by TPs specifying optimal actions, but until that happens theyperturb the policy that has been learned, increasing the costs experienced by the system.Continued TP allocation can also prevent the removal of TPs at dormant states(see Sections 3.2.5 and 3.3.9) specifying optimal actions. As described previously inthis Section, the removal of such TPs is prevented when system disturbances increasetheir R-values. The allocation of new TPs specifying suboptimal actions creates suchdisturbances, so the continued allocation of new TPs can deter the removal of older,Chapter 5. Application of Practical TPDP 107unnecessary TPs.In general, once a near optimal policy has basically been learned, there is little point inPractical TPDP allocating any more TPs. Continued allocation will have the undesirableeffects described, and while there is some possibility that each new TP can increase theoptimality of the existing policy, this is not a significant concern. This is because theaction specified by each new TP, even if it is optimal (and many will not be), can atbest slightly improve the overall policy. If the system being controlled has any inertia,and TPs have already been allocated that result in a near optimal policy, then new TPsplaced between the existing TPs can only specify short duration control adjustments thatwill have very limited effect (see Section 3.1.4). As the addition of such TPs is not reallyworthwhile, continued TP allocation should be prevented in some way.Sections 5.4.3 through 5.4.5 present a number of ways in which continued TP allocation can be prevented in Practical TPDP.5.4.3 Stopping LearningThe simplest way to prevent continued TP allocation in Practical TPDP is to arbitrarilystop the Practical TPDP learning process at some point and retain the policy whichhas been learned up to that point. Such an approach requires the ability to judge theoptimality of the existing policy so that the stopping decision can be made. Such anapproach also rules out adaptive control as it cannot be applied to a system which ischanging. Both of these requirements can be met in many potential applications ofPractical TPDP however.A variation to stopping learning is to steadily decrease the level of internal and external exploration by varying the random exploration parameters appropriately (see Section4.2.6). This approach requires some knowledge of an appropriate rate of exploration reduction.Chapter 5. Application of Practical TPDP 1085.4.4 Arbitrarily Limiting the Number of TPsAnother approach to preventing continued TP allocation is to arbitrarily limit the numberof TPs which can be allocated. Considering Figure 5.16, it would seem reasonable tolimit the percentage of TPs allocated during solution of the Race Track problem to thepercentage at which the learning transition to a near optimal policy occurred (see Section5.2.2). The TP allocation at that point was around 24%. When arbitrary TP limitationwas attempted during Practical TPDP solution of the Race Track problem, allocationlimits between 20% and 30% produced good results. At an allocation limit of 20% thelearning transition occurred sooner, and the sporadic increases in the average track timeafter the learning transition occurred were reduced. The average track time curve atthe 20% allocation limit was generally lower and flatter. This was due to the preventionof superfluous TPs, as described in Section 5.4.2. As the allocation limit was increasedfrom 20% to 30%, where the allocation of new TPs was not limited much at all (see thepercentage of allocation curve in Figure 5.16), these beneficial effects subsided.Reducing the allocation limit below 20% resulted in increased glitching (see Section5.3.3) and, eventually, severe disruptions in the learning of a near optimal policy. Withan allocation limit of 10% the learning transition (see Section 5.2.2) did not occur within2500 epochs. Figure 5.17 shows the results of a 15% TP allocation limit. This 15%allocation limit resulted in the learning transition being made at roughly the same pointat which it was made when no allocation limit was applied (a point later than that whichresulted from a 20% allocation limit). The average track time curve was lower and flatterthan that which resulted when there was no allocation limit however.Chapter 5. Application of Practical TPDP 10930 30E -F-20 20-C) =0C)H //3) /10 /ci) /> I0 00 2500Epoch NumberFigure 5.17: The Effect of Limiting TP Allocation5.4.5 Placing a Price on TPsThe rate of TP allocation can be reduced by placing a cost on the allocation and preservation of each TP. This is done by adding an allocation cost Callocation to the followingvalues upon which decisions are based in the Stack Updating Procedure (see AppendixSection B.3, Figure 4.4):1. Line 24: (Q(i3,u3) < R(i3,ne)) becomes (Q(i3,u8) + CJjocatjon < R(i3,t8))2. Line 40: (Ctotai < Q(i3,1(i3))) becomes (Ctotai + Callocatjon < Q(is, i(i3)))3. Line 50: (Vupaate < Vexpected) becomesVupdate + Callocation < Vexpected)Change 1 results in TPs being removed if their Q-value is not at least Cocatjofl lower thantheir R-value. Change 1 thereby increases the likelihood that TPs specifying optimalChapter 5. Application of Practical TPDP 110actions at dormant states will be removed (see Section 5.4.2). Changes 2 and 3 preventthe allocation of new TPs as a result of internal exploration unless the costs experiencedduring the exploratory state space transitions were COcatjOfl lower than the costs whichwould normally have been experienced.Figure 5.18 shows the result of employing an allocation cost Cajiocation of 0.1 duringPractical TPDP solution of the Race Track problem. Doing so had the effect of makingthe learning transition (see Section 5.2.2) occur sooner than when there was no allocationcost, as well as reducing the sporadic increases in average track time after the learningtransition had occurred. The average track time curve when Cocatjofl was set to 0.1was generally lower and flatter, although it did include many more glitches (see Section5.3.3). The increased number of glitches was due to the allocation cost increasing thelikelihood that TPs at zero velocity wall states would be removed (see Sectioll 5.3.3). Allof these results were due to the prevention of superfluous TPs, as described in Section5.4.2. Figure 5.18 indicates that the expected reduction in TP allocation did in factresult from the use of the allocation cost. Overall, employing an allocation cost of 0.1during Practical TPDP solution of the Race Track problem proved to be very beneficial.Allocation costs above 0.1 resulted in the learning transition (see Section 5.2.2) notbeing made before 2500 epochs had passed. This occurred because the allocation costdeterred TP allocations that were necessary for the learnillg of a near optimal policy.Extensive glitching also occurred as increasing allocation costs increased the likelihoodthat TPs at zero velocity wall states would be removed (see Section 5.3.3).The allocation cost approach to reducing TP allocation results in suboptimal policesbeing learned because it effectively makes reduction of the number of TPs part of theoptimality criteria. This prevents Practical TPDP from learning optimal policies thatare full in accordance with the optimality criteria. This effect, making TP allocationpart of the optimality criteria, may be exactly what is desired in some applications ofChapter 5. Application of Practical TPDP 11130 30a)E —IHnn nnC) =00gE10 10w-...t100 00 2500Epoch NumberFigure 5.18: The Effect of Incorporating a TP Allocation CostPractical TPDP however.5.4.6 Eliminating Suboptimal TPsAs explained in Section 4.1.2, Practical TPDP can result in more than one TP beingsimultaneously associated with a state. As learning progresses in Practical TPDP, anda near optimal policy is learned, a policy TP will be selected for each state that specifiesan optimal action (see Section 4.2.2). At this point the actions specified by other TPsassociated with the same states no longer need to be considered. As a result, the TPsspecifying suboptimal actions can be eliminated. Eliminating such suboptimal TPs willreduce the amount of memory used by Practical TPDP, and have few other effects.There are innumerable ways in which suboptimal TPs can be identified and eliminated. One approach that was attempted during Practical TPDP solution of the Race02500Chapter 5. Application of Practical TPDP 11230 30. -20 20>10 1000Epoch NumberFigure 5.19: The Effect of Eliminating Suboptimal TPsTrack problem was to randomly eliminate TPs whose Q-value was greater than the Qvalue of the policy TP. Such elimination was done randomly in order to ensure thatPractical TPDP was provided with some time (stochastically determined) in which tofully learn the Q-values of each TP. Figure 5.19 shows the results of eliminating suboptimal TPs in this manner, with each suboptimal TP having a 0.25 probability of beingeliminated each time its Q-value was updated. Figure 5.19 indicates that eliminating thesuboptimal TPs, aside from reducing the sporadic increases in the average track timeafter the learning transition had occurred (see Section 5.2.2), had almost no effect on thelearning of a near optimal policy — as compared to when elimination was not performed.The allocation of TPs was significantly reduced however.Chapter 5. Application of Practical TPDP 1136000ci)>-6PositionI Ii I — i 1111 III IStarting FinishingPosition Position(must have zero velocity)Figure 5.20: The One-Dimensional Race Track and Its Phase Plane5.4.7 TP Allocation in a One-Dimensional Race Track ProblemIn order to observe the actual allocation of TPs in Practical TPDP, it was applied to aone-dimensional Race Track problem. This track is shown at the bottom of Figure 5.20.For this problem the car was started at the starting position with a randomly choseninteger velocity from 0 to 6 (motion towards the right had a positive velocity). It thenhad to move horizontally until it arrived at the finishing position with zero velocity. Asidefrom the different track, all of the Race Track problem and Practical TPDP parameterswere kept the same as those used on the more complex track described in Sections 5.1.1to 5.1.6. As a result, the car could enter 420 states, and with 3 actions possible in eachstate, it was possible for 1260 TPs to be allocated.Above the track shown in Figure 5.20 is a phase plane that indicates the horizontalposition and velocity of the car as it moves about the track. The state of the car onthe one-dimensional track is fully described with these two values, thus facilitating theChapter 5. Application of Practical TPDP 114two-dimensional representation of the system state space in the form of the phase plane.Figure 5.20 also shows the phase plane traces resulting from typical runs of the cardown the track after 200 epochs of Practical TPDP learning of the optimal policy. Thetrace produced when the initial velocity was set to 6 (the highest initial velocity possible,and the one with the top trace in Figure 5.20) indicates that the car moved past thefinishing position and then returned to it. This was due to the fact that, with thatinitial velocity, the car simply could not decelerate in time to avoid passing the finishingposition. The optimal policy in that case was to return to the finishing position as quicklyas possible.Figure 5.21 shows the phase plane traces resulting from typical runs of the car afterPractical TPDP with a 25% TP allocation limit (see Section 5.4.4) had been applied to theone-dimensional Race Track problem for 200 epochs of learning. Comparing this figureto Figure 5.20, it is clear that the 25% TP allocation limit inhibited the learning of a nearoptimal policy. Figure 5.22 shows the average track time and TP allocation results whenPractical TPDP was applied to the problem both with and without a 25% TP allocationlimit. This figure indicates that the average track time sporadically increased once a nearoptimal policy had been learned with the 25% TP allocation limit. TP allocation limitshigher than 25% resulted in performance comparable to that which resulted when therewas no allocation limit.Both the TP allocation percentage curves (Figures 5.16 and 5.22) and the lowestfeasible TP allocation limit were higher when Practical TPDP was applied to the onedimensional track than they were when it was applied to the more complex track (seeSections 5.4.1 and 5.4.4). This was mainly due to the fact that on the one-dimensionaltrack, with only three possible actions in each state, the 25% TP allocation limit resultedin an average of 0.75 TPs being allocated per state. In contrast, the more complex track,with nine possible actions in each state, had an average of 2.25 TPs per state (at theChapter 5. Application of Practical TPDP6C-)oOa)>Position115IllI I I I Ill I I II I I I 11114 hIll IStartingPositionFinishingPosition(must have zero velocity)Figure 5.21: Limiting TP Allocation on the One-Dimensional Race Track30ci)2H0HEl)0)E210El)>00Epoch Number75-1-D50>0C)51)0D020020: ; :1:-650 100 150Figure 5.22: Performance of Practical TPDP on the One-Dimensional Race TrackChapter 5. Application of Practical TPDP 11625% TP allocation limit). The difference in the feasible TP limit was also due to thefact that there were fewer external states that did not require TPs (see Section 3.2.4),relative to the total number of states, in the one-dimensional Race Track problem.Finally, Figures 5.23 and 5.24 show the location of the policy TPs in the phase planeafter 200 epochs of Practical TPDP learning both with and without a 25% TP allocationlimit. The tails extending from each TP in these figures indicate the phase plane tracethat resulted when the system was started at the center of each TP state, the actionspecified by the policy TP was applied to it, and state transitions were allowed to continue(without the stochastic interference of noise) until another TP state was encountered.Both figures show that the policy TPs generally specified actions that directed the car tothe finishing position. Both figures also show that many policy TPs in the upper-rightand lower-left corners of the phase planes did not specify such actions. In some casesthe policy TPs in these regions actually specified actions that accelerated the car awayfrom the finishing position. This was because, when the car was in those states, it couldnot decelerate in time to prevent collision with the ends of the track. As a result, theoptimal policy was to accelerate, causing the collision to occur sooner, so that movementtowards the finishing position could begin as soon as possible.Figure 5.23 indicates that a TP was allocated for almost every state when there was noTP allocation limit. Figure 5.24 shows the more interesting TP positioning that resultedwhen TP allocation was limited.5.5 Varying the Practical TPDP Algorithm ParametersFigures 5.25 to 5.30 indicate the results, in terms of performance metric values (seeSection 5.3.4), of varying the Practical TPDP Algorithm parameters during differentsolutions of the complex Race Track problem (see Sections 5.1.1 to 5.1.6). While eachChapter 5. Application of Practical TPDPC..)0ci)>Positiont.-T.-.-i.-T .-i‘.1.- .-I. . . . . . . .1.1vJrJ.1i-I..i•‘S .11 •i..j.1w’ea..?- SW L [L f1 •1 f t £ t‘ ‘‘ Ii’ ‘E! I • ‘• ‘• t-• t-• r ‘• • ‘ ..• • • • • r rr rri ‘ • • • N t •i—t• • • i . . .. • .1-.!•. • . i • . • . .1 .1-il-i . . . .—— —+‘— —4+——+—+• •t1 • irr ‘• •• ‘r •rt’rrr ••f • ‘• • •• •j}zfff. ifT it T II I[[1 II [I I I I IFinishingPosition(must have zero velocity)117C)0ci)>60-6I I I II II — II III IIStarting FinishingPosition Position(must have zero velocity)Figure 5.23: TP Positioning in the One-Dimensional Race TrackPosition60-6I •I.•I•.•:4 .P • -- ._1.4. • • - q • • • • • • • • •.• /I[Wj I I •.LL LI J•.. . ‘.r. . ‘.1’. .‘. . ‘. ‘.‘. I / . . .— —— —I— — — .4•• •r• •.1r. .....j. A-i • -. • •i- • •IE. •E1 • • • Lt • • •j . • - • - . •• - • I [ • • • • •I. I .1—.... ..----- -i — ; — ; • — a . — S • — — — S S S S • • SStartingPositionFigure 5.24: TP Positioning with a 25% TP Allocation LimitChapter 5. Application of Practical TPDP 118parameter was varied the others were maintained at the settings specified in Section5.1.5. The parameters wjjjj, Wt and Wmax were not analyzed in this way because theseparameters, as long as they were set to reasonable values (see Sections 4.2.1, 4.2.2 and4.2.7), had little effect on the learning performance of Practical TPDP.Missing sections of the curves shown in Figures 5.25 to 5.30 indicate parameter settings for which a performance metric value (see Section 5.3.4) was not determined within2500 epochs. This means that the average track time was not determined to be less than14 more than 100 times within 2500 epochs, and that the learning transition (see Section5.2.2) had thus not occurred.As explained in Section 4.2.6, for each application of Practical TPDP add-TP must beadjusted to match the time required to pass through the high velocity states in the systemunder consideration, and chge must be adjusted so that the action changes specifiedduring internal and external exploration are made at intervals that reflect the ability ofthe system concerned to respond. As a result, both of these parameters must be tunedto suit the problems to which Practical TPDP is applied. As can be seen in Figures 5.27and 5.29 there was a fairly wide range of settings over which both parameters producedgood performance on the complex Race Track problem.The setting of the parameter 0de1ay also depends on the dynamic characteristics ofthe system to which Practical TPDP is being applied (see Section 4.2.6). Figure 5.25indicates however that the performance of Practical TPDP was fairly independent of thesetting of this parameter during solution of the complex Race Track problem.Of the various parameters, Practical TPDP was found to be most sensitive to thesetting of swap-TP (see Figure 5.26). Normally, the main concern with such sensitivitywould be that a suitable parameter setting will be difficult to determine when PracticalTPDP is applied to different problems. This is not a concern with swap-TP however,because Oswap-TP does not have to be adjusted to match the system to which PracticalChapter 5. Application of Practical TPDP 119250020001500100000- 50000 1 2 3 4Average Update Time DelayFigure 5.25: The Effect of Changing de1ayTPDP is being applied. It simply sets the balance between the evaluation of TPs alreadyassociated with each state, and the internal exploration of other TP possibilities for thosestates (see Section 4.2.6 and Appendix Section B.2). Setting Pr(JswaTp > 0) to 0.2produced good results during Practical TPDP solution of both Race Track problems,and the 0.2 setting should result in reasonable performance in most applications.Figure 5.28 indicates, uninterestingly, that the setting of addTP does not have to beprecise. Figure 5.30 indicates, unsurprisingly, that the learning rate o should be set highenough to facilitate some learning, but not so high that the results from each learningexperience can drastically alter what has already been learned.5Chapter 5. Application of Practical TPDP 1202500 I I I I.s 20001500100000- 5000I I0.0 0.2 0.4 0.6 0.8 1.0Probability of Internal Exploration Being Initiated at TPFigure 5.26: The Effect of Changing Pr(crswaTp > 0)2500 I I IS 20001500100000.. 5000I I0.0 0.2 0.4 0.6 0.8 1.0Probability of Exploration Being Initiated at Non-TP StateFigure 5.27: The Effect of Changing Pr(oaaaTp > 0)Chapter 5. Application of Practical TPDP 1212 I I I I2000150010000U)0- 5000 I0.0 0.2 0.4 0.6 0.8 1.0Probability of Exploration Mode Being ExternalFigure 5.28: The Effect of Changing Pr(oextern> 0)2500 I I I I2000150010000ci0- 5000I I0.0 0.2 0.4 0.6 0.8 1.0Probability of New Action Being Specified During ExplorationFigure 5.29: The Effect of Changing Pr(Jchange> 0)Chapter 5. Application of Practical TPDP 12225002000150010000ci)0 5000—0.0 0.2 0.4 0.6 0.8 1.0Learning RateFigure 5.30: The Effect of Changing aI IChapter 6Neural TPDPTPDP was developed in the context of neural network control. This Chapter will describehow Practical TPDP can be implemented with a neural network. Some characteristicsof this neural implementation, as well as its biological plausibility, will be discussed.6.1 A Neural Network Design for Direct DP6.1.1 Neural Networks and Evaluation Function ApproximationAs explained in Section 2.4.4, neural networks have been used extensively to parametrically approximate the evaluation function of DP controllers. Such approximation facilitates generalization of the evaluation function and combats the “curse of dimensionality”(see Section 2.4.1) by eliminating the need for the evaluation function to be representedby an individual value V(i) at each state i (see Section 2.4.4).As further explained in Section 2.4.4, the main drawback to evaluation function approximation is that the approximating mechanisms themselves typically require extensive computational effort to develop a reasonably accurate approximation. If parametricapproximation is being done with neural networks, the computational effort of approximation is that required for network training (for example Rumeihart et al., 1986).Evaluation function approximation is more difficult in Q-learning controllers, including TPDP controllers, because Q-values are a function of action as well as state, andan additional action dimension must be included in any approximation that represents123Chapter 6. Neural TPDP 124INPUTS OUTPUTQ-values (see Section 2.4.4). In the case of TPDP controllers, evaluation function approximation is inherently more duffcult because TPDP focuses on the effects of actionsinstead of on the determination of an evaluation function (see Section 3.4.3).As a result of all this, evaluation function approximation was not attempted duringthis research, neither with neural networks or with any other approach. A neural networkdesign that could directly facilitate Practical TPDP control was developed however, andit will be discussed in Sections 6.1.2 to 6.1.8. In fact, as mentioned in Section 3.1.1,this neural network design actually led to the development of TPDP. TPDP evolvedout of attempts to modify neural connections so that groups of neighboring states withsimilar control requirements could be amalgamated and represented with one neuron(Buckland et al., 1993). The general neural network design used for the attempts at suchamalgamation is what will be presented.6.1.2 A Neural Network Design for State Space ControlA generic neuron model is presented in Figure 6.31. In general terms, a neuron is adevice which accepts a number of inputs and passes these inputs through some functionto produce an output. A generic neuron model of this type can be used as the constituentelement of a controller as shown in Figure 6.32.The state lines shown in Figure 6.32 carry binary signals, and each represents aFigure 6.31: A Generic Neuron ModelChapter 6. Neural TPDPSTATE LINESCl)-Jz00Iz0C)Figure 6.32: A Neural Network Design for State Space Control125quantized interval of one dimension of the state space. As a result, each state spacedimension is represented by a group of such state lines, one for each interval of thedimension. One state line in such a group will always be high (active) while all of theothers are low (inactive), indicating which interval of the dimension the system is in. Asingle high state line in every group of state lines fully defines the state of the system.As indicated in Figure 6.32, identification neurons inspect the state lines. Each ofthem is responsible for identifying a particular state i. They output a high signal whenthe system enters the state they are responsible for. The function of the identificationneurons is to act as and-gates. They and together a single state line from each group (eachstate space dimension) to identify when the system is in the state i uniquely identifiedby that set of state lines. Only one identification neuron outputs a high signal at anyIDENTIFICATIONNEURONSgiven time because each one identifies a separate state, and the system can only be inChapter 6. Neural TPDP 126one state at a time.The driving neurons specify the actions that control the system. When appropriate,each one signals with a high binary signal that a specific action should be performed.The action specified by each driving neuron u is always the same, and it is different fromthose specified by the other neurons. To prevent more than one driving neuron fromspecifying an action at the same time, they are connected to mutually inhibit each other(see Figure 6.32, Feldman et al., 1982). The inhibiting signals are randomly perturbedslightly to resolve deadlocks between driving neurons that are equally active’.The outputs of the identification neurons are connected as inputs to the drivingneurons, and the point of connection is called a synapse2.Whenever one of the inputs toa driving neuron is high, that neuron is active and is capable of outputting a high signalto specify an action. If it is the victor of a mutual inhibition contest with any otherdriving neurons that may be active, it is able to do so. The driving neurons thus actas (mutually inhibiting) or-gates, allowing a number of different identification neurons(only one of which outputs a high signal at any given time) to specify the same action.The connections between the identification neurons and the driving neurons— thesynapses — define how the system is controlled. When a system state is identified, ahigh signal is passed from the corresponding identification neuron, through one or moresynapses, to a number of driving neurons. The driving neurons that are activated asa result engage in a mutual inhibition contest to determine which action will actuallybe specified. If each identification neuron is connected to only one driving neuron, theresulting network operates in exactly the manner of the control paradigm described inSection 2.1.1.1A11 active driving neurons are equally active in the neural network design described up to this point.This will not remain true as the design is more fully developed however.2The input connection points on the identification neurons are not, by this definition, synapses. Thereason for this is presented in Section 6.1.5.Chapter 6. Neural TPDP 127Overall the neural network performs an and-or function. Similar network designshave been investigated by others— Minsky (1985) and Ayestaran et at. (1993) are twoexamples. Further, the mutual inhibition between the driving neurons results in a formof competitive learning (Rumeihart et at., 1985).6.1.3 ACAM Operation of the Neural Network DesignA controller implemented with a neural network in the manner being described is inherently a memory-based controller (see Section 2.3.2). Specifically, it operates as anACAM (see Section 2.4.2). The identification neurons act as self-activating table entries,specifying through the driving neurons3 the action that is appropriate for the state towhich they attend. Because they are self-activating, the number of identification neuronsallocated need only be the number required to specify an action in each state that thesystem actually enters. As a result, the total amount of memory required for control canbe reduced below what would be necessary if a full table were employed. This is one ofthe main rationales for implementing a controller with this type of neural network.6.1.4 Implementing Mutual InhibitionThere are a number of ways in which signal level determination can be made in a mutuallyinhibiting neural network controller. This Section will describe a simple and effectiveheuristic approach (developed during this research, although similar to work done byothers — for example Feldman et at., 1982) that facilitates parallel computation, in eachof the driving neurons, of the mutual inhibition contest victor. It is based on the followingupdate equation for the driving neuron output mt(u) of each driving neuron u at time3The number of driving neurons will always be fixed — there must be one for each possible action.Chapter 6. Neural TPDP 128step t:mt+i(u) = min(1,max(kat(u),m(u) + + mt(u) — mt(v)j)) (6.47)vEUWhere at(u) is the activity level of driving neuron u at time step t, ic is the low-gradeoutput factor, b is the updating rate, and U is the set of actions specified by all of thedriving neurons. Normally i should be set to around 0.1 to keep the low-grade neuronoutput low and unintrusive.The activity level at(u) is the result of or-ing together the inputs to each drivingneuron u (see Section 6.1.2). It is the signal level that each driving neuron would outputif it was not being mutually inhibited by the other driving neurons. The activity levelat(u) is multiplied by the low-grade output factor i to determine a weak low-grade outputlevel (see Equation 6.47) that each driving neuron should output to indicate that it isactive (see Section 6.1.2).When all of the driving neuron outputs mt(u) are at a low level, updating Equation6.47 will increase all of them. The output level of each will continue increasing until,for each neuron, + rn(u) — > m(v) <0. When each driving neuron reaches thisvEUbalancing point its output Tflj(U) will begin decreasing. The output level of each drivingneuron at the beginning of the mutual inhibition contest determines how soon each neuronreaches its balancing point. The lower the initial output level of each neuron, the soonerit will reach its balancing point.As the mutual inhibition contest continues, all but one of the driving neurons willreach a balancing point. This neuron will be the mutual inhibition contest victor, andits output level will continue increasing until it reaches 1. Equation 6.47 ensures thatthis high level in the victorious driving neuron results in all of the other driving neuronoutputs being reduced to their low-grade output level Kat(u).For Equation 6.47 to produce these results, it must be ensured that the combinedChapter 6. Neural TPDP 129low-grade output levels of all the inhibited driving neurons are not enough to affect theneuron output mt(uvjctor) of the mutual inhibition contest victor. This is ensured when(from Equation 6.47):+— [ mt(v)Iworstcase] 0 (6.48)vEU+1-[(IUI-1)+1j 0 (6.49)2(IUI— 1) (6.50)Where UI is the number of driving neurons.The driving neuron which emerges victorious from the mutual inhibition contest willalways be the one that had the highest output level mt(u) at the start of the contest. Asthe starting level equals the low-grade output level lat(u), and t is a constant, the victorof the mutual inhibition contest will always be the driving neuron u with the highestactivity level at(u). Since all active driving neurons will have the same nominal activitylevel of 1, at(u) is actually determined by multiplying the output from the driving neuronor-ing function by a uniformly distributed random value ranging from 0 to 1. Randomlyperturbing the activity levels in this manner ensures that the mutual inhibition contestvictor will be chosen randomly from amongst the active driving neurons.6.1.5 Synapses as TPsAn optimal policy (see Section 2.1.2) is manifested ill the neural network controller beingdescribed (see Figure 6.32) by making the right connections between the identificationneurons and the driving neurons. That is, the right set of synapses must be determinedfor the network. Practical TPDP can be used to determine what this set should be.To facilitate Practical TPDP learning of optimal policies in the controller being described, each synapse is treated like a TP. Q-values Q(i, u), R-values R(i, ‘a) and weightsChapter 6. Neural TPDP 130w(i, u) are associated with each synapse and are used together to determine the merit ofthe single state/action combination that the synapse defines. For a given synapse/TP,the state i associated with that TP is the state attended to by the identification neuronoutputting to the synapse, and the action u that the TP specifies is the action specifiedby the driving neuron to which that synapse is attached. Because synapses are essentiallyTPs in this context, the term “synapse” will henceforth refer to TPs4.The Q-value and R-value of each synapse reflect the costs experienced each time ahigh signal passes through that synapse (Section 6.1.6 will describe this more fully). Theweight value w(i, u) of each synapse is modified based on the Q-value and R-value ofthat synapse (see Section 4.2.1), and indicates the integrity of the synapse. A low weightindicates a new, untested synapse, or a synapse that has been found to lack merit. Ahigh weight indicates a synapse that has been found to be worthwhile. When the weightof a synapse is decremented to zero that synapse is removed. The synapse weights arenot used to modulate the signals going through the synapses in any way.6.1.6 The Full Implementation of Neural TPDPSection 6.1.5 presented the idea that Practical TPDP can be used to learn optimal policiesin the neural network controller being described if synapses are treated like TPs. Thisneural implementation of Practical TPDP is called Neural TPDP. Neural networks canbe used to implement Practical TPDP without the basic functionality of the PracticalTPDP Algorithm being affected in any way (see Sections 4.2.4, Appendix Sections B.2and B.3, and Figures 4.3 and 4.4). That is, a Neural TPDP controller will functionexactly like a Practical TPDP controller, with the same Q-value, R-value and weightupdates being made in what is effectively just a different processing environment. This4This is why the term “synapse” is not applied to identification neuron input connections — “synapse”has a special meaning related to TPs that does not apply to such connections.Chapter 6. Neural TPDP 131Section will explain in detail how such is done.Basically, all states identified by identification neurons are TP states (see Section3.2.1). The identification neurons that identify such TP states always output to at leastone synapse. To ensure that action specifications are maintained as the system movesthrough non-TP states between TP states, which is a fundamental requirement of TPDP(see Section 3.1.1), the driving neurons must have some positive feedback. Such positivefeedback ensures that, once a driving neuron has won a mutual inhibition contest (seeSection 6.1.4), it will continue to specify the same action until another mutual inhibitioncontest occurs. If Equation 6.47 is used to determine the mutual inhibition contestwinner, the required positive feedback occurs.Movement of the system through the state space occurs as a result of successivemutual inhibition contests between the driving neurons (see Section 6.1.2). Each suchcontest is initiated when the current mutual inhibition contest victor flash reduces itsoutput from 1 to its low-grade output level Iat(u) (see Equation 6.47). Mutual inhibitioncontests are initiated when the system is in a TP state, but not at every TP state. Asexplained in Section 4.2.5, the random parameter de1ay is used to maintain the sameaction specification while the system passes through a number of TP states. This isdone so that the costs resulting from each action specification can be observed over aninterval of reasonable length. Delayed updating is facilitated in Neural TPDP by flashreducing the output of the mutual inhibition contest victor (see Section 6.1.6) only whenthe system is in a TP state and t > tupdate (see Section 4.2.5). The former condition isindicated to the mutual inhibition contest victor by the presense of at least one non-zerolow-grade signal on its mutual inhibition inputs (see Section 6.1.4).The Q-value of each synapse is updated (using Equation 3.18) when that synapsebecomes active and the driving neuron that it is connected to is the mutual inhibitioncontest victor. When the Q-value of a synapse is updated, the weight of that synapse isChapter 6. Neural TPDP 132updated as well (according to the rules of Section 4.2.1). The R-value of each synapse isupdated (using Equation 3.27) when that synapse becomes active and the driving neuronthat it is connected to is a mutual inhibition contest loser. In this way the evaluationof each synapse is performed according to the Practical TPDP Algorithm (see AppendixSection B.2).Internal and external exploration (see Section 4.2.3) are facilitated by using the random exploration parameters0swap-TP and add-TP to determine when exploration shouldoccur (Section 4.2.6). To initiate exploration the activity level at(u) of each driving neuron u is set to a uniformly distributed random value between 0 and 1. This ensures thateach action specification will be chosen with equal probability during exploration. Whenthis is done the output of the mutual inhibition contest winner is also flash reduced.Action specification changes during exploration are facilitated by using the random exploration parameter0change to determine when new mutual inhibition contests should beinitiated in the same manner.The random exploration parameter externa1 is used to select between internal andexternal exploration when exploration is initiated in non-TP states (see Appendix SectionB.2, Lines 50 to 55, and Section 4.2.6).If the results of internal or external exploration indicate that new TPs should beallocated (see Section 4.2.3), such is performed by adding synapses that connect theidentification neurons with the driving neurons.6.1i Allocating Identification NeuronsAs well as being able to add synapses that correspond to the allocation of new TPs,Neural TPDP must be able to allocate new identification neurons. This is necessarywhen the first TP associated with a state is allocated (see Figure 6.32). Similarly, NeuralTPDP must be able to deallocate identification neurons whell the last TP associatedChapter 6. Neural TPDP 133with a state is removed. Driving neurons need not be allocated or deallocated becauseeach of them permanently specifies a different control action.An interesting alternative Neural TPDP design that does not require identificationneuron allocation involves making the driving neurons more complex so that they canperform the state identification function. The driving neurons can be modified to operateas and-or gates, first and-ing together state lines to identify individual states, and thenor-ing together the results to combine the identification signals in the usual manner(see Section 6.1.2). Similar neuron designs have been investigated by others for entirelydifferent applications (for example Feldman et al., 1982). If driving neurons of this typeare employed, identification neurons are not required at all. The price paid for thissimplification is that the mechanisms required to identify each state must be replicatedat every synapse related to that state.6.1.8 Parallel Processing and Neural TPDPAside from the fact that Neural TPDP operates like an ACAM (see Section 6.1.3), animportant rationale for using Neural TPDP is that such an approach facilitates theparallel processing possible with neural networks. To do so a universal “stack” is notused, as described in Section 4.2.4, to store the information necessary for Q-value and Rvalue updating. Instead, the relevant information is stored locally by each neuron so thatupdating can occur in a parallel fashion. Some information still has to be distributed toall of the neurons however. This information includes the immediate costs experienced,the mode of exploration, the fact that an update should be made, and the evaluationfunction value V(i) of the state i that the system is in.Chapter 6. Neural TPDP 1346.2 Analysis of Neural TPDP6.2.1 The Localized Operation of Neural TPDPAs described in Section 6.1.2, Neural TPDP fulfills its function as a controller throughthe connections between the identification neurons and the driving neurons (see Figure6.32). It is thus the connectivity of the Neural TPDP network that matters the most.This flexible connectivity is altered based on the synapse weights w(i, n), which indicatethe integrity of each connection. This type of operation is very different from conventional neural networks (the popular error propagation method for example, Rumeihartet al., 1986, Fahiman et al., 1987), which are normally fully interconnected with fixedconnections, and which use their weights to modulate the signals passing through eachconnection. The neurons in such networks work together to process the signals input tothe network. In contrast, only one identification neuron and one driving neuron in NeuralTPDP output a high signal at any one time. Neural TPDP thus functions using localizedoperation (Baker et al., 1992, Schmidhuber, 1990b). Other neural network designs thathave localized operation include CMAC (Albus, 1975a, 1975b, Kraft et al., 1992) andradial basis function networks (Poggio et al., 1990, Girosi et al., 1989).Related to the localized operation of Neural TPDP is the fact that identificationneurons and driving neurons output binary signal levels. This is different from thecontinuous signal levels output by most neural networks.The main disadvantage of localized neural network designs is that such networkscannot readily make generalizations (Jacobs et al., 1991). Conventional neural networksin contrast excel at making generalizations (Baker et al., 1992). The localized operation ofNeural TPDP precludes the formation of generalizations between states and the actionsspecified in those states5. Section 6.2.2 will present a way in which generalization by5This is different from the evaluation function approximation that Practical TPDP also precludes (orChapter 6. Neural TPDP 135copying (see Section 5.3.1) can be facilitated with Neural TPDP however.The main advantages of localized neural network designs are that repeated trainingsessions are not required to facilitate generalization (Baker et al., 1992), and that learningin one region of the input to output mapping does not affect other parts of the mapping(Baker et al., 1992, Ayestaran et al., 1993). In conventional neural network designs, generalization results from the network learning, through repeated presentation of trainingvectors, how it should adjust its weights to produce the desired input to output mappingfor all of the training vectors.6.2.2 Generalization by Copying in Neural TPDPSection 5.3.1 described how generalization by copying can be facilitated in PracticalTPDP by biasing the selection of the random actions attempted during exploration (seeSection 4.2.4 and Appendix Section B.2, Lines 20 to 34, 40 to 43, and 50 to 55) towardsactions already specified by TPs in the state space vicinity around the current state.This type of generalization is facilitated in Neural TPDP by modifying the identificationneurons so that they operate as enhanced and-gates. The goal of this modification is tohave each identification neuron output a low-grade signal when the system is in a stateclose to that which it identifies. Such low-grade signals are passed through synapsesto the driving neurons connected to each partially activated identification neuron, andindicate to those driving neurons that they should modify their activity levels at(u) tobias the random action selection (see Section 6.1.6).There are many ways in which low-grade signals can be produced in the identificationneurons. One way is to define the identification neuron function as follows:Ot()= [J [(1 — 6)z(l) + 6] (6.51)1EL(i)at least makes very difficult — see Sections 2.4.4 and 6.1.1).Chapter 6. Neural TPDP 136Where Ot() is an identification neuron output at time step t, L(i) is the set of state linesconnected as inputs to identification neuron i, Zt(l) is the signal level of each state line1 E L(i) at time step t, and 6 is the desired low-grade signal level.Equation 6.51 results in the output of the identification neurons being 69ow, whereni0 is the number of input state lines 1 L(i) which have low signal levels. If 6 is setto 0.1, and the driving neurons bias their activity levels at(u) only when they receivelow-grade input signals of at least 0.1, generalization by copying will result between allimmediately adjacent states.Once it has been determined that a driving neuron should bias its activity levelduring random action selection (see Section 6.1.6), a biasing technique must be employed.Normally when mutual inhibition contests are used to randomly select actions duringexploration the activity levels at(u) of each driving neuron are set to a random valuebetween 0 and 1 (see Section 6.1.6). To bias the random action selection, bjas randomvalues between 0 and 1 are generated for each driving neuron that has a sufficient low-grade signal, and the highest one is used as the activity level at(u) for that neuron. Thisincreases nbI-fold the likelihood of each driving neuron biased in this manner being thevictor of the mutual inhibition contest.6.2.3 Elemental and Composite ActionsThe neural network implementation of Practical TPDP brings to light one importantconsideration in TPDP control, and in control in general. In the Race Track problempresented in Sections 5.1.1 through 5.1.3 the possible actions U(i) in each state i wereu = un], where both u and u, were in the set {—a, 0, a}, and where a was a constant.Each action was thereby a composite action constructed from simpler elemental actions.That is, each action u consisted of an elemental action ur in the x direction, and anelemental action u, in the y direction. In real controllers there is often a limit to theChapter 6. Neural TPDP 137number of composite actions that can be constructed out of elemental actions. Whenthat limit is reached the action specification in each state must be made using a numberof elemental actions or less complex composite actions.In conventional Q-learning controllers (see Chapter 2) the amount of memory availablelimits the complexity of the composite actions. The amount of memory required forcontrol is proportional to the number of composite actions that can be constructed fromthe elemental actions, and this number grows exponentially with the number of elementalaction dimensions. Basically, a Q-value must be stored for each composite action ineach state. As a result, the amount of memory available determines how complex thecomposite actions can be.Other types of memory-based controllers (see Sections 2.3.2 and 2.4.2), ones which donot allocate a memory entry for each state/composite action combination, do not sufferfrom exponentially increasing memory requirements as the number of elemental actiondimensions grows. An example of such memory-based controllers would be a PracticalTPDP controller operating as an ACAM (see Section 2.4.2). Such a controller wouldonly allocate memory entries to associate a small number of composite actions with eachstate (often just one), and it may not associate any actions with some states. In the caseof Neural TPDP, each memory entry would correspond to a synapse.As an ACAM control approach (see Section 6.1.3), Neural TPDP does not experienceexponentially increasing memory requirements as the number of elemental action dimensions grows. But the number of composite actions possible with Neural TPDP is limitedfor another reason. This reason is that Neural TPDP uses one driving neuron for eachcomposite action, and having too many such driving neurons would make implementation of the neural network impractical. The same problem does not occur when PracticalTPDP is implemented using tables (see Section 2.3.2), because each table entry can consist of an arbitrary list of elemental action specifications defining a composite action. AsChapter 6. Neural TPDP 138a result, the number of possible composite actions is limited only to the number of TPsin Practical TPDP.Aside from memory usage concerns, complex composite actions can make the learningof an optimal policy more difficult in any controller. If, for example, a TPDP controlleris presented with the task of learning an optimal snowboarding policy, the compositeactions available could include hj, a hip elemental action combined with a left thumbelemental action, and Uhf, the same hip elemental action combined with a right thumbelemental action. The TPDP controller would then have to determine the Q-values, Rvalues and weights for composite actions uhi and Uhr. Clearly it would be more effectiveto learn these values for only the hip elemental action uh.Conversely, separating complex composite actions into elemental actions (or intosmaller composites of elemental actions) results in credit assignment (see Section 2.3.4)being made more difficult. If an undesirable snowboarding outcome occurs, it must bedetermined whether the hip elemental action uh or the torso elemental action u wasresponsible for this result. This determination is difficult, and it is normally made usingextensive observation of the outcomes when various combinations of elemental actionsare attempted. It may be that TPDP, with it Q-values and R-values, is very well suitedto credit assignment determination. This is because these values indicate not only theresults incurred when an elemental action is taken, but the results incurred when it isnot taken. This extra information, which few learning controllers make use of, may beof great use in assigning credit properly. This issue requires further research.Chapter 6. Neural TPDP 1396.3 Biological Plausibility6.3.1 A Possible Model of Biological Movement ControlNeural TPDP is, arguably and speculatively, a plausible model of biological movementcontrol and refinement. Specifically, Neural TPDP can be viewed as a model of howmuscle control signals are learned and generated at the lowest levels of control in thebrain. Neural TPDP functions by making synaptic connections between identificationneurons that identify states and driving neurons that output action signals. Each suchsynapse is either maintained or discarded based on evaluations of whether or not thestate/action association it makes improves the control of the system. These evaluationsare made in a way that solves the credit assignment problem (see Section 2.3.4) — anecessity in biological control. Further, the action specified by each synapse is maintaineduntil another state is encountered in which an action is specified. As a result, a limitednumber of synapses can define entirely how the system is controlled, and these synapsescan be modified based on observations of the system response. This is a very simple buteffective approach to movement control and refinement, and as such it may be a plausiblemodel of biological control.Beyond the intuitive appeal of Neural TPDP as a model of biological movementcontrol, Neural TPDP has the following specific attributes which add to its plausibilityas a biological model. Many of these attributes were discussed in relation to Q-learningand the theoretical form of TPDP in Sections 2.3.1, 2.3.2, 2.3.3, 2.3.5, 3.4.1 and 3.4.2.Minimal Use of MemoryAs a direct DP control approach, Neural TPDP avoids the extensive use of memory thatis required for indirect DP control (see Sections 2.2.2 and 2.3.1). The memory savingsinherent in Neural TPDP go much further however. Because Neural TPDP operates as aChapter 6. Neural TPDP 140ACAM (see Sections 2.4.2 and 6.1.3), it only allocates memory at those states which areactually entered during control of the system. It does not require a full, multi-dimensionaltable to produce the non-linear control advantages that result from memory-based control(see Section 2.3.2). Further, because of the way in which actions are specified and thenmaintained in TPDP (see Section 3.1.1), Neural TPDP can achieve optimal control byallocating memory at only a limited number of uniform region boundary states (seeSection 3.4.2).Reduced memory usage in Neural TPDP means that the number of synapses requiredfor control are reduced. By making a limited number of connections between identification neurons and driving neurons, Neural TPDP can achieve optimal control. Further,because reduced memory usage increases the rate of DP learning (see Section 2.4.1),reduced numbers of synapses increase the rate of learning in Neural TPDP.Rapid Learning of Preliminary PoliciesSection 5.2.2 illustrated that one of the main attributes of Practical TPDP is that itlearns effective preliminary policies very rapidly. Neural TPDP, as a way of implementingPractical TPDP, shares this attribute, and it is an important one in terms of makingNeural TPDP a plausible model of biological movement control. This is because theearly learning of reasonably effective policies has obvious survival value for biologicalsystems.In many control applications, biological and otherwise, the ability to rapidly learneffective preliminary policies may be the most important attribute of a controller. Oncesuch a preliminary policy is learned, it can be refined and optimized as more learningeffort is exerted. As an example, consider a casual and a professional snowboarder.The former wishes to rapidly acquire a level of competence in the sport by learning aneffective but suboptimal policy. The latter wishes to highly refine his skills in the sport,Chapter 6. Neural TPDP 141and is willing to go through considerable training effort to do so. In the context of NeuralTPDP, the professional performs much more extensive exploration, and allocates a largernumber of synapses as he seeks to learn an absolutely optimal policy.Localized Operation and Reinforcement LearningAs explained in Section 6.2.1, the neurons in Neural TPDP operate in a localized way.Further, Neural TPDP, as a form of Q-learning, is a reinforcement learning approach(see Section 2.3.3). These two characteristics of Neural TPDP mean that it can learnoptimal policies with each neuron operating in parallel, and with only limited amounts ofinformation being passed between the neurons (other than the direct identification neuronto driving neuron synaptic signaling). Such localized operation and limited informationpassing are necessary for any plausible model of biological movement control (Alkon,1989, Alkon et al., 1990, Schmidhuber, 1990a). Neural network designs which do notoperate in a localized way, like the ubiquitous error propagation method (Rumelhart etal., 1986), require learning supervisors and complex information passing routes that areunlikely to exist in biological movement controllers.The only information that the neurons in Neural TPDP require from external sourcesis the state information and a universally distributed reinforcement signal. The reinforcement signal indicates the immediate costs c(i, u) incurred as the system moves about thestate space.The internal information that Neural TPDP passes includes the mode of exploration,the evaluation function value V(i) of the state i that the system is in, and the factthat an update of Q-values, R-values and weights should be made. The last item ofinformation can be inferred indirectly by each driving neuron if they observe the actionspecification changes indicated by their mutual inhibition inputs (see Figure 6.32), andcombine that information with their knowledge of the exploration mode (see AppendixChapter 6. Neural TPDP 142Section B.2 and Figure 4.3).In summary, aside from the explicit network connections shown in Figure 6.32, NeuralTPDP operates with only the following information being passed between the neurons:the immediate costs c(i, u), the mode of exploration, and the evaluation function valueV(i) of the state i that the system is in. All three of these items of information are universally distributed, and can thus be equated to hormone levels in a biological movementcontroller. As a result, the neurons in Neural TPDP operate in a highly independentmanner. In contrast, most other neural network designs involve complex supervisorysystems and information flows that do not seem biologically plausible6.Continuous State Space ControlAs explained in Section 3.1.4, TPDP operates best in continuous state space controlapplications. As biological movement controllers operate entirely in continuous statespaces, this attribute of Neural TPDP is an important part of its plausibility as a modelof biological control.Adaptive ControlAs a form of Q-learning, Neural TPDP is an adaptive control approach (see Section2.3.5). This attribute is an essential one in any model of biological movement controlbecause biological controllers are clearly adaptive.Composite Action LearningAny plausible model of biological movement control must have the ability to learn optimal policies using elemental actions, or partial composites of elemental actions (see6Fidelity to biological systems is admittedly noi the goal of most artificial neural models however.Practical application on non-biological problems is instead the concern.Chapter 6. Neural TPDP 143Section 6.2.3). This is because biological systems, with their enormous number of elemental actions, could not possibly operate using composite actions constructed fromevery combination of elemental actions. Unfortunately, although Neural TPDP seemswell suited to learning with elemental actions, this capability has not yet been thoroughlyinvestigated.6.3.2 Increasing the Localized Operation of Neural TPDPOne aspect of Neural TPDP which has not been discussed up to this point is how thepolicy synapse (the policy TP) (i) is determined for each state i. There are a number ofways in which policy synapses can be determined, but all are rather involved. One methodof determination will be presented in this Section, and then a simplifying approach toNeural TPDP which does not require policy synapse determination will be described.This simplification will be combined with an additional simplification to increase the independent operation of the Neural TPDP neurons. Such increased independent operationmakes Neural TPDP even more biologically plausible.One way to determine the policy synapse is to have each identification neuron ioccasionally output some sort of special signal (a bursty one perhaps) when the system isin the state i identified by that neuron. This special signal indicates to all of the drivingneurons activated by that identification neuron that they should modulate their activitylevels at(u) not with a random value (see Section 6.1.4), but with 1 — ‘ . ThisQmax-possiblemodulation will result in the mutual inhibition contest victor being the driving neuron uwhose synapse has the lowest Q-value Q(i, u). This result, which indicates which synapseis the policy synapse z(i) in state i, can then be noted at all network locations wherethis information is of consequence.To avoid the complexity of policy synapse determination, Neural TPDP can be modified to operate without policy synapses (i), or the evaluation function values V(i)Chapter 6. Neural TPDP 144determined from them; V(i) = Q(i,1t(i)). This is done, at each state i where an updateis performed, by updating not with V,.(i) but with the Q-value Q(i, u) of the action uthat is actually specified at that state (see Appendix Sections B.2 and B.3).A Neural TPDP implementation difficulty similar to that of policy synapse determination is the problem of informing each synapse as to whether or not other synapses areassociated with the same state that it is. This synapse allocation knowledge can be acquired by each synapse through inspection of the mutual inhibition inputs to the drivingneuron to which they are attached. But this determination is again rather involved. Toavoid this complexity, the use of synapse allocation knowledge can also be eliminatedfrom Neural TPDP.Eliminating both the use of policy synapses and the use of synapse allocation knowledge requires changes to the Practical TPDP Algorithm, as well as to the Stack UpdatingProcedure it calls (see Section 4.2.4, and Appendix B). The necessary changes are presented in Figures 6.33 and 6.34.Regarding the changes, when internal exploration is initiated at a TP state, a policyaction is first selected, and then a random action is reselected (Lines 22 and 25 of themodified Practical TPDP Algorithm, see Figure 4.3). This is done so that stack updatingcan be performed using a Q-value of an existing TP (Line 23). If this were not done,the randomly chosen exploration action might not have a TP associated with it, and thestack could not be updated. The other algorithm modifications will not be explained indetail as they are fairly straight-forward.Figure 6.35 presents the results of using the modified Practical TPDP Algorithm tosolve the Race Track problem described in Sections 5.1.1 to 5.1.3 (using the parametersdefined in Sections 5.1.4 and 5.1.5). One further modification had to be made to thePractical TPDP Algorithm to obtain these results. The random selection of TP actionshad to be biased towards TPs with high weight values w(i, u) when no exploration wasChapter 6. Neural TPDP 1451. randomly choose starting state so E Ss2. choose a starting action u0 E U(so)3. if (state has a TP): ‘none’ = explore4. otherwise: ‘external’ = explore5. 0 t, 0 tupdae10. while s is not an absorbing state SA:11. if (state St .St—T) or (state St = 5t—T for time Tt&):20. if (state t has a TP):21. if (t7swap-TP > 0):22. randomly choose action Ut from TP actions in U(s)23. update-stack( explore, t, Q(st, Ut), Vexpected)24. Q(st,ut) Vexpecte25. randomly re-choose action Ut E U(st)26. ‘internal’ ‘ explore27. push-on-stack(t, 5, Ut, 0)28. otherwise if (explore # ‘none’) or (t > tupdate):29. randomly choose action ‘Ut from TP actions in U(s)30. update-stack( explore, t, Q(st, Ut), Vexpectea)31. Q(st,u) Vexpecteci32. ‘none’ =- explore33. t + °ieiay =‘ tupdate34. push-on-stack(t, 5, Ut, ‘true’)35. otherwise: push-on-stack(t, 5, Ut, ‘false’)40. if (state s, has no TP) and (explore ‘none’) and (Uchange> 0):41. if (explore = ‘external’): flush-stack42. randomly choose action Ut E U(s)43. push-on-stack(t, 5, Ut, 0)50. if (state st has no TP) and (explore = ‘none’) and (UacldTp > 0):51. flush-stack52. if (cTexternai> 0): ‘external’ = explore53. otherwise: ‘internal’ explore54. randomly choose action Ut E U(s)55. push-on-stack(t,.St, Ut, 0)60.61. (Vexpected — C(St, Ut)) Vexpected62. observe system for new state s63. update-stack(explore, t, St, Vexpected)Figure 6.33: The Localized Operation Practical TPDP AlgorithmChapter 6. Neural TPDP 146[Parameters passed: explore, t, Vupdate, Vexpecteci]1. while (time at top of stack t, t): pop-off-stack(t,, is, u,, specified,)2. Vupdate :=. Ctotai10. while (there are entries in stack):11. pop-ofF-stack(t,, i,, u,, specified,)12. while (t > t,):13. 7Ctota.i + c(i,, u,) Ctotai14.20. if (explore = ‘none’):21. for (each TP action u€U(i,) in state i,):22. if (u =23. (1 — cr)Q(i,, u,) + cCtotaj Q(i,, u,)24. if(Q(i,,u,) > R(i,,u,)):25. w(i,, u,) + 1 = w(i,, u,)26. if (w(i,, u,) > wmax): Wmax ‘ w(i,, u,)27. otherwise:28. w(i,, u,) — 1 = w(i,, u,)29. if (w(i,, u,) 0): remove the TP30. if (specified, = ‘false’) or (u u,):31. (1 — c)R(i,, u) + cvCt0t,,ci R(i,, u)40. if (memory can be allocated) and[((explore = ‘internal’) and Vupciate < Vexpected)) or (explore = ‘external’)]:41. allocate a new TP at state i, with action U,42. Ctotai =‘. Q(i,,u,) = R(i,,u,)43. = w(i,, u,)Figure 6.34: The Localized Operation Stack Update ProcedureChapter 6. Neural TPDP 14730EF0Fa)G)>100 2500Epoch NumberFigure 6.35: Performance of Practical TPDP with Increased Localized Operationoccurring. This was necessary to ensure that the Q-values used for stack updating werepredominantly ones whose high weight values indicated that they were associated withnear optimal actions. In effect, this biasing resulted in evaluation function values beingused for the updating.To bias the random action selection, w(i, u)2 random values from 0 to 1 were generatedfor each driving neuron u, where w(i, u) was the weight of the active synapse attached toeach driving neuron. The highest of these values was then used to modulate the activitylevel at(u) of that driving neuron. This resulted in the probability of each action beingselected being proportional to the square of w(i, u) (see Section 6.1.6).Figure 6.35 indicates that the modified Practical TPDP Algorithm performed veryChapter 6. Neural TPDP 148well. The learning transition (see Section 5.2.2) occurred later than it did with conventional Practical TPDP, but the average track time curve was generally flatter. Considering how little information was passed between neurons in the modified algorithm, it issurprising that it performed as well as it did.In Section 6.1.7 it was suggested that Neural TPDP could be made simpler if identification neurons were not used at all, and the driving neurons were instead designed tooperate as and-or gates. The driving neurons could then perform the state identification.As described in Section 6.1.7, this design change would eliminate the need to determinewhen identification neurons should be allocated and deallocated. This change would alsoincrease the independence of the neurons in Neural TPDP because any identificationneuron allocation mechanism would require synapse allocation knowledge. Using a network design that does not employ identification neurons would eliminate the need forthis knowledge, as well as the information routes that pass it.Chapter 7Practical TPDP as Part of a Complete Hierarchical ControllerThis Chapter will describe how Practical TPDP can be incorporated into a completehierarchical controller. The attributes and capabilities of such a controller will also bepresented.7.1 Practical TPDP Facilitates Lower Movement ControlIt was speculated in Section 6.3.1 that Neural TPDP may be a plausible model of lowermovement control in biological systems. If such is the case, the question that arises is howNeural TPDP would function as part of a complete hierarchical controller. This Chapterwill discuss this issue, and while it will do so frequently in the context of biologicalcontrollers, the arguments presented could be equally well applied to the incorporationof Practical TPDP in any hierarchical controller. The terms “Practical TPDP” and“Neural TPDP” will therefore be used interchangeably in this Chapter, with the latterindicating a biological controller context.7.1.1 Context LinesInherently, to function as a low level movement controller within a hierarchical controller(Albus, 1983, 1988, 1991), Neural TPDP must be able to perform more than one controltask on the same system. If it cannot, the hierarchical control layers above it, whichdecide which specific low level tasks should be performed, have no purpose.An effective way to facilitate the performance of more than one task with the same149Chapter 7. Practical TPDP as Part of a Complete Hierarchical Controller 150Neural TPDP controller is to have context lines (similar to input from the “plan units”in Massone et al., 1989) descending from higher levels of control. Each context lineis associated with a single task, and when higher levels of control have decided thata particular task should be performed, the appropriate context line is made high. Asonly one task can be performed at a time, only one context line can be high at a time.Context lines are used with state lines (Singh made a similar combination, 1991b) asinput to the identification neurons of a Neural TPDP controller (see Section 6.1.2 andFigure 6.32). A single context line is used as an input to each identification neuron, andbecause of the and-ing function of these neurons, it will act as an identification “enabler”.By connecting the same context line to all of the identification neurons involved in eachcontrol task, the context lines can be made to act as task enablers, preparing completesets of identification neurons to make the state/action associations (through their drivingneuron synaptic connections — see Section 6.1.2) appropriate for each task.As Neural TPDP learning progresses, the set of identification neurons enabled by eachcontext line learns the optimal policy for the task associated with that context line. Thislearning is based on the immediate costs incurred whenever that context line is high. Itis therefore assumed that there is some consistent relationship between when the higherlevels of control activate each context line, and the situation that the system is in (asreflected in the immediate costs incurred). For example, it is assumed that the higherlevels of control would not indicate a snowboarding task when the system is relaxing onthe beach and requires a beer.If Practical TPDP is implemented as a memory-based controller (see Section 2.3.2),the context line function can be included in the controller by making use of additionaltable dimensions.Chapter 7. Practical TPDP as Part of a Complete Hierarchical Controller 1517.1.2 A Sketch of the Complete Hierarchical ControllerThe complete model of a controller incorporating Neural TPDP is that of a hierarchicalcontroller where the abstract control decisions are made, based on sensor information,at higher levels of control. As a result of these abstract control decisions, a context lineis activated that enables a set of identification neurons. These identification neurons areindividually activated based on state line signals, and specify control actions through thedriving neurons (see Section 6.1.2 and Figure 6.32). The set of identification neuronsthus enabled is the set which makes the appropriate state/action associations for thecontrol task indicated by the context line. Neural TPDP learns these associations byobserving the costs incurred whenever each context line is high, and by modifying itssynapses accordingly (see Sections 6.1.2 and 6.1.5).7.2 Using Higher Level KnowledgeIn a hierarchical controller incorporating Neural TPDP, the relationship between thelevels of control extends beyond the high levels deciding which context lines should beactivated (see Section 7.1.1). The high levels can also aid the lowest level (Neural TPDP)in learning optimal policies (the use of high level knowledge has been investigated by,among others, Lin, 1991a, 1991b, Utgoff et al., 1991). Sections 7.2.1 through 7.2.5 willdescribe some ways in which this can occur.7.2.1 Guided LearningOne way in which higher levels of control can aid the lowest level in learning optimalpolicies is by guiding the exploration that occurs in the lowest level. If the higher levelsChapter 7. Practical TPDP as Part of a Complete Hierarchical Controller 152possess some knowledge’ about what an optimal policy is, in general terms, this knowledge can be passed to the lowest of control to direct exploration. This is called guidedlearning.As a general example of guided learning, consider a novice snowboarder. The taskconfronting that snowboarder is to learn an optimal snowboarding policy. If the snow-boarder has abstract knowledge of generally how to position his body, that knowledgecan be passed to his lower movement controller through conscious positioning and conscious exertions of force. Such conscious intervention biases learning exploration, andcan greatly increase the rate of learning. If such were not the case, the jobs of many asnowboarding instructor (whose verbal messages guide conscious positioning) would bein jeopardy.7.2.2 Implementing Guided Learning in Practical TPDPIn the case of Neural TPDP, guided learning is facilitated by having the higher levels ofcontrol transmit signals down binary biasing lines connected, one each, to the drivingneurons. These lines bias the mutual inhibition contests that are an essential part ofexploration (see Section 6.1.6) by increasing the activity level at(u) of the driving neuronu to which they are connected (see Section 6.1.4). If higher levels of control possessknowledge’ that a certain action will result in good performance if applied in a certainstate, and the system enters that state, a high signal is transmitted down the biasing lineconnected to the driving neuron which specifies the appropriate action. This increasesthe activity level at(u) of the driving neuron, which in turn increases the likelihood thatthe desired action will be specified as a result of a mutual inhibition contest at that state(see Section 6.1.2).1llow the higher levels of control obtain and store this knowledge is immaterial — this descriptionis concerned only with how such knowledge can be passed down to the lowest level of control (NeuralTPDP) and used effectively.Chapter 7. Practical TPDP as Part of a Complete Hierarchical Controller 153To control movement, Neural TPDP learns state/action associations which direct thecontrolled system through its state space. Initially, before optimal state/action associations have been learned, higher levels of control may have knowledge of the general trajectory that the system should follow. That knowledge can be utilized, through guidedlearning, to direct learning exploration along the general trajectory. As learning progresses along the general trajectory, Neural TPDP “fleshes out” the trajectory by addingand modifying synapses that define specific optimal state/action associations (see Sections 6.1.2 and 6.1.5) along it. Optimal control can be achieved by this means — withoutexhaustive exploration being necessary.Figure 7.37 presents the results of applying Practical TPDP, with guided learning,to the Race Track problem described in Sections 5.1.1 to 5.1.3 (using the parametersdefined in Sections 5.1.4 and 5.1.5). Guided learning was facilitated by biasing therandom selection of actions during exploration so that the actions shown in Figure 7.36were selected 50% of the time. The actions shown in Figure 7.36 were not optimal, theywere simply intelligent guesses (made by this researcher) at the optimal actions. Figure7.37 indicates that guided learning was indeed very effective in increasing the rate oflearning.‘7.2.3 Gross Inverse ModelsAs a direct DP control approach, Practical TPDP does not develop an explicit systemmodel to facilitate control (see Sections 2.2.2 and 2.3.1). In Section 2.3.1 it was statedthat an explicit model of a system is very useful if that system has to perform many tasks.This is because that model can be utilized for every task, and the modeling knowledgeacquired learning the optimal policy for each task can be used by the others.The problem with such a general purpose model is that, if it is the only model usedin all tasks, it must be highly resolved. That is, it must have fine enough resolutionChapter 7. Practical TPDP as Part of a Complete Hierarchical ControllerFigure 7.36: The Biased Action Specifications Used for Guided Learning30G)EI-C),20asci)>100Epoch Number2500154yStartingPositionsxFinishingPositionsHHliii —-IHH EHHEEEHEEEEEEEEEEEEHEEEC—/A: bias [1, 0] B: ‘11bias 1 [—1, 1] if th > 2.5B: Ubi = [1, —1] 1 0 otherwiseC: Ubi = [1,0] [—1, 11 if < —2.5F: Ubias= {D: lLb = [1, 1] 0 otherwiseFigure 7.37: Performance of Practical TPDP During Guided LearningChapter 7. Practical TPDP as Part of a Complete Hierarchical Controller 155to be of use in each and every task. This could result in a very prohibitive memoryrequirement. To reduce this memory requirement, a variable resolution model could beemployed (Moore, 1991). Such a model would have increased resolution in the statesassociated with those tasks that the system must perform. But if it did, there is somequestion as to whether or not it is really a general purpose model. That is, few knowledgesharing benefits may result from combining the variable resolution models determinedfor each task.Alternatively, a controller can be constructed that has two levels of modeling. In thecontext of Practical TPDP, such a controller would operate as follows. Higher levels ofcontrol would employ Practical TPDP, at the lowest level of control, to develop a grossinverse model (for further description of inverse models see Aboaf et al., 1988, Moore,1992). This model would be developed by having Practical TPDP learn policies that movethe system to a general set of states or gross positions (by doing so gross inverse modelsare similar to the variable temporal resolution models investigated by Singh, 1992a). Themovement to each such gross position would constitute a single control task, and wouldhave a context line (see Section 7.1.1) associated with it — a gross position context line.The controller would learn a complete gross inverse model by learning the policy requiredto reach every interesting region of the state space. The resulting gross inverse modelwould provide the controller with general knowledge of how to control the system.Once a gross inverse model has been developed, that model can be used by high levelsof control to generally direct the system when policies for specific tasks must be learned(this is similar to the approaches taken by Singh, 1991a, 1991b, 1992b, Schmidhuber,1990b). In other words, gross inverse models facilitate guided learning (see Sections 7.2.1and 7.2.2).In Section 7.2.2 the use of bias lines that activated driving neurons was describedas a way to facilitate guided learning in Neural TPDP. An alternative approach is toChapter 7. Practical TPDP as Part of a Complete Hierarchical Controller 156modify the neural design so that multiple context lines (see Section 7.1.1) can be activated simultaneously. Specifically, modify the design so that one primary context linecan be activated fully to indicate what specific task is being performed, and secondarycontext lines can be partially activated to bias the random action selections during exploration. If a gross inverse model exists, higher levels of control can partially activate,at the appropriate time, gross position context lines associated with the intermediategross positions that the system should reach during completion of a specific task. Thiswill partially activate one or more identification neurons, which will in turn partiallyactivate a number of driving neurons. The end result will be the biasing of the mutualinhibition contest between the driving neurons (see Section 6.1.2), biasing the randomaction selection to actions that take the system to the intermediate gross positions.7.2.4 Optimality Criteria and Higher KnowledgeSection 7.2.1 described how higher levels of control can aid the lowest level in learningoptimal policies by using knowledge of optimal actions to bias the random action selectionduring exploration. This biasing approach was called guided learning. There is anotherway in which high level, abstract knowledge can aid low level learning. If the optimalactions are not known, but extensive general knowledge of the desired system behaviorexists in higher levels of control1, this knowledge can be used to increase the rate atwhich the optimal policy is learned (Barto, 1992).Consider the Race Track problem described in Sections 5.1.1 to 5.1.3. The car inthat problem experiences the same immediate cost every time step until it reaches thefinishing positions. When it has reached those positions it experiences no further costs.As a result, the optimal behavior of the car is to reach the finishing positions as soon aspossible. Practical TPDP and conventional Q-learning learn this policy by backing-upthe costs (see Section 2.1.4) that will be experienced before the car reaches the finishingChapter 7. Practical TPDP as Part of a Complete Hierarchical Controller 157positions. Accurate estimates of these costs, in the form of evaluation function values (seeSection 2.1.2), are initially learned in the states that the car enters immediately beforeencountering the finishing positions. These accurate evaluation function values are thenused in the learning of other accurate evaluation function values in states progressivelyfurther away from the finishing positions. As accurate evaluation function values arelearned for each state, optimal actions can be determined. The learning of the optimalpolicy thus backs-up from the finishing positions.Because the learning of the optimal policy backs-up from the finishing positions,protracted periods of back-up time can pass before optimal actions can be determinedfor states distant from those positions. If higher levels of control have extensive generalknowledge of the desired system behavior however, this back-up time can be reduced. Forexample, if it is known that the car in the Race Track problem should make a left turn(even though the actions facilitating that turn are not known), that knowledge can beused to modify the optimality criteria that the controller is presented with. Figure 7.38indicates a roughly optimal path that the car should follow for the Race Track problemdescribed in Sections 5.1.1 to 5.1.3 (using the parameters defined in Sections 5.1.4 and5.1.5). Figure 7.39 shows the average track time results (see Section 5.1.6) when PracticalTPDP was applied to this Race Track problem, and the immediate costs at car positionsalong the path shown in Figure 7.38 were reduced to one quarter their normal level (seeSection 5.1.4).Figure 7.39 indicates that the learning transition (see Section 5.2.2) occurred muchsooner when lower immediate costs were incurred at car positions along the path shownin Figure 7.38. This is because the path provided additional optimality information, andit acted as a source from which accurate evaluation function values could be backed-up.As a source of accurate evaluation function values, it was much closer to a large number ofstates than the finishing positions were, so the backing-up time was reduced accordingly.Chapter 7. Practical TPDP as Part of a Complete Hierarchical ControllerII1111111111111111111Figure 7.38: A Roughly Optimal Path on the Race Track0Epoch Number250030a)EF-C)F-ci)ci>10158FinishingPositionsSti11ir’ig I I I I I I I I I I I I I I I I I I I I I IIPositions_______________ ____________-1-IIIIIIIIIIIIIIIIIIIIIIIi{ii I IIII I I I I I I I I [111111 I liii I 11111111111 I_____Figure 7.39: Performance of Practical TPDP with Increased Optimality InformationChapter 7. Practical TPDP as Part of a Complete Hierarchical Controller 159The use of additional optimality information in the Race Track problem illustrateshow optimality information from higher levels of control can aid in the learning of optimalpolicies. If prudent (or optimal) state space trajectories which lead to the arrival at goalstates are known, but the actions that will result in those trajectories are not known, thosetrajectories can be incorporated into the optimality criteria. This is done by reducing theimmediate costs experienced by the lowest level of control when the system is followingthose trajectories.7.2.5 Dynamic Optimality CriteriaAs learning progresses in a controller which incorporates Practical TPDP, the controllercan continually make abstract, high level observations (or predictions) about the optimalbehavior of the system. These observations may include the determination or expansionof prudent state space trajectories (see Section 7.2.4), where expansion may involve identifying trajectories that lead to other trajectories. All of this knowledge can be utilizedin the learning process by dynamically modifying the optimality criteria presented to thelowest level of control (Practical TPDP).In fact, optimal control can be viewed as a race between higher levels of control andthe lowest level of control. As higher levels continually develop more comprehensive andinformative optimality criteria, and the lowest level strives to learn the optimal policyfor the optimality criteria that the higher levels have developed at any one time.In a complete hierarchical controller, the emphasis should be placed on the higherlevels of control rapidly developing the optimality criteria. This is because even smallenhancements in the optimality knowledge can drastically reduce the learning time required to learn optimal policies. The lowest level of control operates in highly resolved,Chapter 7. Practical TPDP as Part of a Complete Hierarchical Controller 160continuous state spaces, and learning at this level is inherently laborious, requiring exhaustive experimentation and exploration. At higher levels however, where the optimality criteria is developed, the state space is abstracted into large, discrete, grossly definedblocks. High level control frequently consists of polar decisions. One such decision, “If Ilift my snowboard nose higher, I will turn quicker”, can be made without exhaustive experimentation, and when incorporated into the optimality criteria it can have enormousimpact on how rapidly the optimal policy is learned at the lowest level.It would seem that human intelligence is a fine example of the importance of emphasizing the development of optimality criteria (at the highest levels of control) overoptimal policy determination (at the lowest level). Humans develop incredibly complexoptimality criteria which facilitate the learning of policies for amazingly involved tasks.The lower movement controllers used to perform these tasks are little different from thosefound in many other mammals (especially other primates), but the tasks humans perform are far more involved. This difference is entirely a result of humans developing morecomplex optimality criteria.Chapter 8Conclusion8.1 The Main Benefit of TPDPThe main benefit of TPDP is that it learns near optimal policies very rapidly. This wasdemonstrated throughout Chapter 5 in applications of Practical TPDP. As described inSection 5.2.1, Practical TPDP learned near optimal policies for the Race Track problemmuch more quickly than conventional Q-learning.The main reason that Practical TPDP rapidly learns near optimal policies is that afew initial TP allocations (made during early learning exploration) can direct the systemalong state space transition routes which are generally good (see Section 5.2.2). Aftersuch routes have been discovered, further TP allocations can be made along them which“flesh out” the policy, making it optimal at all points. This sequence of events resultsin Practical TPDP very quickly directing its learning effort to regions of the state spacewhere that effort will produce the best results.The hazard of learning optimal policies in this manner is that the learning effortcan sometimes be concentrated on suboptimal state space transition routes which werediscovered early on, while undiscovered, truly optimal routes are ignored. The externalexploration of Practical TPDP is a way of preventing this (see Section 4.2.3).Aside from the way in which it rapid focuses learning effort, Practical TPDP alsolearns near optimal policies quickly because it makes evaluation function value back-ups(see Section 2.1.4) between states which are widely separated (see Chapter 3). This161Chapter 8. Conclusion 162results in faster learning.It was speculatively suggested in Section 6.3.1 that the rapid learning of good preliminary policies makes TPDP a plausible model of biological movement control andrefinement.8.2 The Main Disappointment of TPDPThe main disappointment of TPDP was that it did not, in practise, result in the significant memory usage reductions that were hoped for (see Chapter 5). The reason for thiswas that superfluous TPs were continually added to the state space after a near optimalpolicy had basically been learned— a result of the way in which Practical TPDP operates(see Section 5.4.2). In Sections 5.4.3 to 5.4.6 a number of improvements were suggestedand demonstrated which combat this problem.The absence of a significant reduction in memory usage in TPDP is exacerbated bythe fact that, for each state/action association made by a TP, more memory is requiredfor TPDP than for Q-learning. That is, for every TP, two floating point values (the Qvalue and R-value) and one integer (the weight) must be stored. In contrast, Q-learningrequires only one floating point value (the Q-value) for each state/action association.Further, to take advantage of the sparsity of TPs in the state space, Practical TPDPis best implemented with an ACAM (see Sections 2.4.2 and 3.4.2). This requires theadditional storage of two integers (the discretized state and action) as ACAM addressfields. Q-learning can also be implemented with an ACAM if at least one integer (thediscretized state) is stored as an address field. Assuming that floating point values requiretwice the storage space of integers, and ignoring the many different ways in which bothPractical TPDP and Q-learning can be implemented with ACAMs, Practical TPDPmust allocate no more than 40% of the maximum number of TPs (one for each possibleChapter 8. Conclusion 163state/action combination) to require less memory in a given application than Q-learning.This was achieved in both of the applications of Practical TPDP described in Chapter 5.8.3 Direct DP Optimal Control with TPDP is PracticalAdaptive optimal control is very desirable, but DP approaches to such control, whiletheoretically possible, are often impractical. The main reasons are that they require toomuch memory, too much learning time, or both. Direct DP approaches like Q-learningrequire less memory than indirect approaches (see Sections 2.2.1 and 2.2.2), but oftenrequire more learning time to compensate for the lack of an explicit system model. Asdescribed in this work however, TPDP is a direct DP approach that increases the rate oflearning by focusing the learning effort and increasing the distance over which back-upsare made (see Section 8.1). This increase in the rate of learning could make TPDP apractical alternative in many control applications.If Practical TPDP is incorporated into a complete hierarchical controller (see Chapter7), the rate of learning can be increased still further by employing guided learning (seeSection 7.2.1), and by passing high level optimality knowledge to the lowest level ofcontrol— the level at which Practical TPDP operates (see Section 7.2.4).8.4 Contributions of this WorkThe main contribution of this work is the development of TPDP. This developmentincludes proof of the fact that TPDP is sure to achieve minimal TP optimal control (seeChapter 3), as well as formulation of the Practical TPDP Algorithm (see Chapter 4). Noproof exists that Practical TPDP is sure to achieve minimal TP optimal control, but theTheorems of Chapter 3 were developed in such a way that they make the achievement ofminimal TP optimal control seem highly likely when Practical TPDP is employed.Chapter 8. Conclusion 164In addition to the development of TPDP and the Practical TPDP Algorithm, thefollowing contributions were also made:1. “Generalization by copying” was developed and demonstrated as a way of facilitating generalization in Practical TPDP (see Sections 5.3.1 and 5.3.2).2. Neural TPDP was developed as a neural implementation of Practical TPDP (seeChapter 6). This development included the design of a neural model which couldbe used for control (see Section 6.1.2), a neural mechanism that facilitated mutualinhibition contests (see Section 6.1.4), and a neural mechanism that facilitatedgeneralization by copying (see Section 6.2.2).3. Incorporation of Practical TPDP into a complete hierarchical controller was investigated (see Chapter 7), and the ways in which high level control knowledgecould be made use of in such a controller were developed and demonstrated. Theseincluded “guided learning” (see Section 7.2.1), “gross inverse models” (see Section7.2.3) and the use of high level optimality knowledge (see Section 7.2.4).4. A number of approaches were developed and demonstrated that combat the allocation of superfluous TPs in Practical TPDP (see Sections 5.4.2 to 5.4.6).8.5 Future WorkThere are a number of ways in which this work could be continued:1. Complete proof that Practical TPDP achieves minimal TP optimal control couldbe developed — a daunting task.2. Techniques other than generalization by copying (see Section 5.3.1) could be developed which further extend the ability of TPDP to make generalizations.Chapter 8. Conclusion 1653. The problem of the allocation of superfluous TPs in Practical TPDP (see Section5.4.2) could be more fully investigated, possibly leading to the development of acomplete strategy to combat this effect.4. The incorporation of Practical TPDP into a complete hierarchical controller couldbe more fully investigated (see Chapter 7), possibly leading to the use of TPDPideas in higher levels of control. Specifically, TPs could be used to specify actionsof varying abstractness at all levels of control. This is a particularly promisingdirection.In general, the idea behind TPs is expressed as: “Try this for a while and see whathappens”. TPDP provides a framework in which controllers that operate using thisprinciple can be developed.GlossaryAddition Rule: a rule guiding the addition of new TPs (see Section 3.3.8).DP: dynamic programming.DP element: a memory entry associated with a single state for the purposes of dynamicprogramming control.Neural TPDP: a neural network implementation of Practical TPDP (see Chapter 6).Practical TPDP: a specific practical approach to TPDP (see Chapter 4).Practical TPDP Algorithm: an algorithm that facilitates Practical TPDP control.Q-learning: a direct DP method (see Chapter 2).Q-value: Q(i, u); the expected total infinite-horizon discounted cost if action u is takenin state i and the current policy is followed in all states thereafter.R-value: R(i); the expected total infinite-horizon discounted cost if no new action isspecified in state i and the current policy is followed in all states thereafter.Removal Rule: a rule guiding the removal of TPs (see Section 3.3.9).Stack Updating Procedure: a procedure called by the Practical TPDP Algorithm.Swapping Rule: a rule guiding the swapping of TPs (see Section 3.3.5).TP: transition point— the association of an action UTP with a state i.TP action: ‘uTp; the action associated with a TP.166Glossary 167TP states: SB; states associated with TPs.TPDP: transition point dynamic programming.aborted swap states: SG; the set of TP states for which a TP swap was attemptedbut aborted.absorbing states: SA; states where system transitions are terminated.accepted swap states: SG’; the set of TP states for which a TP swap was attemptedand accepted.action: (control action) the action a controller specifies that a system should perform.activity level: at(u); the signal level that each driving neuron would output if it wasnot being mutually inhibited by the other driving neurons.addition: adding a TP to a closed state space Sc.allocation cost: cocatjon; a cost placed on the allocation and preservation of each TP.anchored states: TP states which are always reached from the starting states Ss (considered in the proof of Theorem 3.3).assessment group: the set of TPs being evaluated at any one time as Practical TPDPlearning progresses.back-up time: the time required for accurate cost estimates to be backed-up.backing-up: the backwards movement of cost estimates that results from updating thecost estimates at each state using evaluation function values of later states.balancing point: the point at which the output of a driving neuron stops increasingand starts decreasing during a mutual inhibition contest.Glossary 168biasing lines: lines carrying binary signals, each one of which can increase the activitylevel of a single driving neuron in order to bias mutual inhibition contest outcomes.boundary: a set of boundary states associated with the same uniform region.boundary states: SB; states where TPs specify an action that will be used throughouta uniform state space region.closed state space: Sc; the states which can be reached in a valid environment with agiven set of TPs.composite action: an action consisting of a number of elemental action specifications.context lines: lines carrying binary signals, each one of which indicates if the systemis to perform a specific task.control task: a task that the system is to perform.controller: a mechanism controlling a system or plant.delay parameter: aaelay; a random positive parameter added to time step t to determinethe update time tupdate.delayed updating: the delaying of Q-value, R-value and weight updating until longerterm costs can be observed.direct controllers: controllers which use implicit system models.discount factor: y; the attenuation factor used for future immediate costs.dormant states: SDO; states within a uniform region that are not boundary states — noactions need be specified at them.Glossary 169driving neuron output: rnt(u); the output level of driving neuron u at time step t.driving neurons: neurons that each specify a separate action— they or together outputsfrom identification neurons attending to states for which their action is appropriate.elemental action: actions representing the finest control action resolution possible.entry action probability: p(i, u); the probability that action u led to the arrival ofthe system at state i (in a valid environment, with the existing set of TPs).entry actions: Ue(i); all the actions that can result in a transition to state i from someother state (in a valid environment, with the existing set of TPs).environment: a set of conditions under which TPDP operates (see Section 3.2.2).epoch: a set of 20 learning and 500 testing trials.evaluation function: ; the set of evaluation function values V(i) over the entirestate space S.evaluation function value: V(i); the expected total infinite-horizon discounted costif the existing policy is applied from state i onward.excluded cost: C,(i, J); the expected cost of all state space transitions from state ithat can occur with policy z — excluding those after all states j in the arbitrarilychosen set of states J have been encountered.expected evaluation function value: Vexpected; an estimate of what the evaluationfunction value will be at the next TP state.experienced cost: C0t; the total infinite-horizon discounted cost experienced afterthe system leaves each of the states recorded in the stack.Glossary 170external exploration: learning exploration intended to discover low cost state transition routes through the external states SE.external parameter: uextena]; a random parameter used to determine whether or notexternal exploration is initiated instead of internal exploration.external states: SE; states outside the closed state space Sc.flash reducing: instantaneous reduction of the output of a driving neuron from 1 to itslow-grade output level — ends inhibition of other driving neurons by that neuron.generalization by copying: generalization by attempting actions, during learning exploration, which have been found effective in nearby states.generalization parameter: generaJjze; a random parameter used to determine whetheror not generalization by copying should be performed.gross inverse model: a general system model consisting of knowledge of the actionsnecessary to move the system to a number of gross positions.gross position: a loosely defined set of neighboring states.gross position context lines: context lines that specify that the control task is toreach a gross position.guided learning: prudent exploration choices made during learning based on higherlevel control knowledge.high: an active binary signal.identification neuron output: oj(i); the output level of each identification neuron iat time step t.Glossary 171identification neurons: neurons that inspect binary state lines, each identifying a separate state.immediate cost: c(i, u); the cost incurred when action u is applied in state i.indirect controllers: controllers which use explicit system models.ineffectual TPs: TPs associated with external states SE.initiate parameter: uaddTp; a random parameter used to determine whether or notexploration is initiated at non-TP states.internal exploration: learning exploration intended to discover cost-reducing TPs thatshould be added to the closed state space Sc.learning transition: a distinct change in the learning of the optimal policy which occurred during application of Practical TPDP (see Section 5.2.2).localized operation: the independent operation of neurons in certain types of neuralnetwork designs.low: an inactive binary signal.low-grade output factor: i; the parameter used to adjust the low-grade output level.low-grade output level: the output signal of driving neurons which are active butinhibited by another driving neuron.memory-based controllers: controllers which use look-up tables that associate actionswith states.minimal TP optimal control: optimal control with TPs when all unnecessary TPshave been removed.Glossary 172non-TP states: SD; states not associated with TPs.off-line: calculations which a controller makes when it is not busy controlling the system.optimal Q-value: Q*(j, u); the Q-value when the policy is optimal.optimal TP states: SB*; states associated with TPs when the policy is optimal.optimal action: 1L*(i); the policy action at state i when the policy is optimal.optimal closed state space: Sc*; closed state space when the policy is optimal.optimal control: control of a system when that control is optimal with regard to someoptimality criteria.optimal evaluation function: 14,j*; the evaluation function when the policy is optimal.optimal evaluation function value: V*(i); the evaluation function value of state iwhen the policy is optimal.optimal external states: S; the external states when the policy is optimal.optimal non-TP states: SD*; the non-TP states when the policy is optimal.optimal policy: *; a policy that provides optimal control.optimality criteria: cost criteria which defines when optimal control has been achieved.participation factor: i(i,j); a factor which indicates how much of the evaluationfunction value V(i) of state i depends on the evaluation function value V(j) ofanother state j with policy u.performance metric: used during the performance evaluation of Practical TPDP (seeSection 5.3.4).Glossary 173phase plane: a two-dimensional diagram showing the position and velocity of a system.policy: p; the set of policy actions p(i), over the entire state space, that the controllerspecifies in each state i.policy TP: the one TP at each TP state which specifies the policy action.policy action: it(i); the action that the controller specifies in state i (when learningexploration is not occurring).policy synapse: a policy TP in Neural TPDP.possible actions: U(i); the set of actions that can be performed in each state i.removal: removing a TP from a closed state space Sc.route change parameter: change a random parameter used to determine whether ornot a new experimental action is specified during exploration.starting states: Ss; states in which a system is started.state lines: lines carrying binary signals, each one of which indicates if the system is ina specific quantized interval of one dimension of the state space.state space: 5; the set of all states that the system can enter.state transition probability: pj(i, u); the probability that a transition will occur tostate j if action u is applied in state i.swap parameter: swap-TP a random parameter used to determine whether or not internal exploration is initiated at TP states.swapping: exchanging one TP for another in a closed state space Sc.Glossary 174synapse: a connection between the output of an identification neuron and the input ofa driving neuron— acts as a TP, associating an action with a state.synapse allocation knowledge: the knowledge each synapse requires as to whether ornot other synapses are associated with the same state.task: (control task) a task that the system is to perform.time step interval: T; the time between two control time steps.transition point: (TP) the association of an action UTP with a state i.trial: a complete set of system state transitions from starting states to absorbing states.uniform region: a region of neighboring states where the same action is optimal.update rate: u); the rate at which a Q-value or R-.value is updated in Q-learning.update time: tupdate; the time step when the next Q-value, R-value and weight updateshould occur.value iteration: a successive approximation approach for determining the evaluationfunction (see Section 2.1.4).weight: w(i, u); the merit of the TP associating action u with state i.List of Variables€Vt(, u) update rate: the rate at which a Q-value or R-value is updated in Q-learning.7 discount factor: the attenuation factor used for future immediate costs.6 the low-grade signal level which is sought from an identification neuron.i,j) participation factor: a factor which indicates how much of the evaluationfunction value V(i) of state i depends on the evaluation function value V,1(j)of another state j with policy p.low-grade output factor: the parameter used to adjust the low-grade output level.p policy: the set of policy actions p(i), over the entire state space, that thecontroller specifies in each state i.optimal policy: a policy that provides optimal control.policy action: the action that the controller specifies in state i (when learning exploration is not occurring).optimal action: the policy action at state i when the policy is optimal.p(i, u) entry action probability: the probability that action u led to the arrival ofthe system at state i (in a valid environment, with the existing set of TPs).175List of Variables 1760add-TP initiate parameter: a random parameter used to determine whether or notexploration is initiated at non-TP states.change route change parameter: a random parameter used to determine whetheror not a new experimental action is specified during exploration.0delay delay parameter: a random positive parameter added to time step t to determine the update timetupdate.externa1 external parameter: a random parameter used to determine whether or notexternal exploration is initiated instead of internal exploration.genera1ize generalization parameter: a random parameter used to determine whetheror not generalization by copying should be performed.swap-TP swap parameter: a random parameter used to determine whether or notinternal exploration is initiated at TP states.C1(i, J) excluded cost: the expected cost of all state space transitions from state ithat can occur with policy t— excluding those after all states j in the arbitrarily chosen set of states J have been encountered.C0 experienced cost: the total infinite-horizon discounted cost experienced after the system leaves each of the states recorded in the stack.L(i) the set of state lines connected as inputs to identification neuron i.Q(i, u) Q-value: the expected total infinite-horizon discounted cost if action u istaken in state i and the current policy is followed in all states thereafter.Q*(j u) optimal Q-value: the Q-value when the policy is optimal.List of Variables 177R(i) R-value: the expected total infinite-horizon discounted cost if no new actionis specified in state i and the current policy is followed in all states thereafter.S state space: the set of all states that the system can enter.SA absorbing states: states where system transitions are terminated.SB TP states: states associated with TPs.SB* optimal TP states: states associated with TPs when the policy is optimal.SB boundary states: states where TPs specify an action that will be usedthroughout a uniform state space region.Sc closed state space: the states which can be reached in a valid environmentwith a given set of TPs.Sc* optimal closed state space: closed state space when the policy is optimal.SD non-TP states: states not associated with TPs.SD* optimal non-TP states: the non-TP states when the policy is optimal.SDO dormant states: states within a uniform region that are not boundary states— no actions need be specified at them.SE external states: states outside the closed state space ScSE* optimal external states: the external states when the policy is optimal.SG aborted swap states: the set of TP states for which a TP swap was attempted but aborted.List of Variables 178SG’ accepted swap states: the set of TP states for which a TP swap was attempted and accepted.Ss starting states: states in which a system is started.T time step interval: the time between two control time steps.Tstuck the time period for which no system movement will be permitted before exploration is initiated.Tuniform the average time required to cross each uniform region.U(i) possible actions: the set of actions that can be performed in each state i.Ue() entry actions: all the actions that can result in a transition to state i fromsome other state (in a valid environment, with the existing set of TPs).evaluation function: the set of evaluation function values V(i) over theentire state space S.optimal evaluation function: the evaluation function when the policy isoptimal.V,(i) evaluation function value: the expected total infinite-horizon discountedcost if the existing policy is applied from state i onward.(i) optimal evaluation function value: the evaluation function value of statei when the policy is optimal.Vexpected expected evaluation function value: an estimate of what the evaluationfunction value will be at the next TP state.List of Variables 179X(i,j) the set of all possible state transition routes from state ito state j.at(u) activity level: the signal level that each driving neuron would output if itwas not being mutually inhibited by the other driving neurons.c(i, u) immediate cost: the cost incurred when action u is applied in state i.Cocatjon allocation cost: a cost placed on the allocation and preservation of each TP.d the number of time steps between TP states.dquick the number of time steps required to pass through a single discretized state atthe highest system velocity.i a particular system state.j a particular system state.k iteration number.1 a state line.mt(u) driving neuron output: the output level of driving neuron u at time step t.Ot(Z) identification neuron output: the output level of each identification neuron i at time step t.p3(i, u) state transition probability: the probability that a transition will occur tostate j if action u is applied in state i.Pineffective the probability that a specified car acceleration instruction had no effect (seeSection 5.1.2).List of Variables 180the state at time t.t time step: the current time step.tlastTP the last time step in which a TP state was encountered.tupdate update time: the time step when the next Q-value, R-value and weight update should occur.u an action taken in a given state.Tp TP action: the action associated with a TP.a hypothetical non-action action.w(i, u) weight: the merit of the TP associating action u with state i.wijj the initial setting of a TP weight (see Section 4.2.1).Wmax the maximum value of a TP weight (see Section 4.2.1).WtI a threshold TP weight value (see Section 4.2.2).the horizontal position of the car at time step t (see Section 5.1.2).the horizontal velocity of the car at time step t (see Section 5.1.2).XL the horizontal velocity limit of the car (see Section 5.1.2).a possible state transition route between two states.Yt the vertical position of the car at time step t (see Section 5.1.2).the vertical velocity of the car at time step t (see Section 5.1.2).List of Variables 181IlL the vertical velocity limit of the car (see Section 5.1.2).Zt(l) the signal level of state line 1 at time step t.BibliographyAboaf, E. W., C. 0. Atkeson and D. J. Reinkensmeyer (1988), “Task-level robot learning”, Proceedings of the 1988 International Conference on Robotics and Automation, vol.2, 1988, PP. 1309-1310.Albus, J. S. (1975a), “A new approach to manipulator control: the cerebellar model articulation controller (CMAC)”, Transactions of the ASME, vol. 97, Sept. 1975, pp. 220-227.Albus, J. S. (1975b), “Data storage in the cerebellar model articulation controller(CMAC)”, Transactions of the ASME, vol. 97, Sept. 1975, pp. 228-233.Albus, J. S. (1983), “A structure for generation and control of intelligent behavior”,Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers, Port Chester, New York, 1983, pp. 25-28.Albus, J. 5. (1988), “The central nervous system as a low and high level control system”,NATO ASI Series: Sensors and Sensory Systems for Advanced Robots, vol. F43, ed.: P.Dario, Berlin: Springler-Verlag, 1988, pp. 3-20.Albus, J. S. (1991), “Outline for a theory of intelligence”, IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-21, no. 3, May/June 1991, pp. 473-509.Alkon, D. L. (1989), “Memory storage and neural systems”, Scientific America, July,1989, PP. 42-50.Alkon, D. L., K. T. Blackwell, 0. S. Barbour, A. K. Rigler and T. P. Vogl (1990),“Pattern-recognition by an artificial network derived from biologic neuronal systems”,Biological Cybernetics, vol. 62, 1990, pp. 363-376.Anderson, C. W. (1989a), “Learning to control an inverted pendulum using neural networks”, IEEE Control Systems Magazine, vol. 9, no. 3, Apr. 1989, pp. 31-37.Anderson, C. W. (1989b), “Towers of Hanoi with connectionist networks: learning newfeatures”, Machine Learning: Proceedings of the 6th International Conference, San Mateo, California: Morgan Kaufmann Publishers, 1989, pp. 345-350.182Bibliography 183Anderson, C. W. (1993), “Q-learning with hidden unit restarting”, Advances in NeuralInformation Processing Systems 5, San Mateo, California: Morgan Kaufmann Publishers, 1993, pp. 81-88.Atkeson, C. G. and D. J. Reinkensmeyer (1988), “Using associative content-addressablememories to control robots”, Proceedings of the 27th Conference on Decision and Control, Austin, Texas, Dec. 1988, pp. 792-797.Atkeson, C. G. (1989), “Learning arm kinematics and dynamics”, Annual Review of Neuroscience, vol. 12, 1989, pp. 157-183.Atkeson, C. G. (1991), “Using locally weighted regression for robot learning”, Proceedingsof the 1991 IEEE International Conference on Robotics and Automation, Sacramento,California, Apr. 1991, pp. 958-963.Ayestaran, H. E. and R. W. Prager (1993), “The logical gates growing network”, Report CUED/F-INFENG/TR 137, Cambridge University Engineering Department, Cambridge, July 1993.Baker, W. L. and J. A. Farrell (1992), “An introduction to connectionist learning controlsystems”, Handbook of Intelligent Control, eds.: D. A. White and D. A. Sofge, New York:Van Nostrand Reinhold, 1992, pp. 35-63.Barto, A. G., R. S. Sutton and C. W. Anderson (1983), “Neuronlike elements that cansolve difficult learning control problems”, IEEE Transactions on Systems, Man, and Cybernetics, vol. 13, no. 5, 1983, pp. 835-846.Barto, A. G., R. S. Sutton and C. J. C. H. Watkins (1989), “Learning and sequential decision making”, COINS Technical Report 89-95, University of Massachusetts, Sept. 1989.Barto, A. G. and S. P. Singh (1990a), “Reinforcement learning and dynamic programming”, Proceedings of the 6th Yale Workshop on Adaptive and Learning Systems, Yale,1990, pp. 83-88.Barto, A. C. and S. P. Singh (1990b), “On the computational economics of reinforcementlearning”, Connectionist Models: Proceedings of the 1990 Summer School, San Mateo,California: Morgan Kaufmann Publishers, 1990, pp. 35-44.Barto, A. G., R. S. Sutton and C. J. C. H. Watkins (1990c), “Sequential decision problems and neural networks”, Advances in Neural Information Processing Systems 2, SanMateo, California: Morgan Kaufmann Publishers, 1990, pp. 686-693.Bibliography 184Barto, A. C., S. J. Bradtke and S. P. Singh (1991), “Real-time learning and control using asynchronous dynamic programming”, COINS Technical Report 91-57, University ofMassachusetts, Aug. 1991.Barto, A. 0. (1992), “Reinforcement learning and adaptive critic methods”, Handbookof Intelligent Control, eds.: D. A. White and D. A. Sofge, New York: Van NostrandReinhold, 1992, pp. 469-491.Barto, A. 0., S. J. Bradtke and S. P. Singh (1993), “Learning to act using real-timedynamic programming”, Department of Computer Science, University of Massachusetts,Jan. 1993.Bellman (1957), Dynamic Programming, Princeton: Princeton University Press, 1957.Buckland, K. M. and P. D. Lawrence (1993), “A connectionist approach to direct dynamicprogramming control”, Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, Victoria, Canada, 1993, vol. 1, pp. 284-287.Chapman, D. and L. P. Kaelbling (1991), “Input generalization in delayed reinforcement-learning: an algorithm and performance comparisons”, Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, Aug. 1991, pp.726-731.Chinchuan, C., C. Y. Maa and M. A. Shanblatt (1990), “An artificial neural networkalgorithm for dynamic programming”, International Journal of Neural Systems, vol. 1,no. 3, 1990, pp. 211-220.Dayan, p. (1991), “Navigating through temporal difference”, Advances in Neural Information Processing Systems 3, San Mateo, California: Morgan Kaufmann Publishers,1991, pp. 464-470.Fahlman, S. E. and 0. E. Hinton (1987), “Connectionist architectures for intelligence”,Computer, Jan. 1987, pp. 100-109.Feldman, J. A. and D. H. Ballard (1982), “Connectionist models and properties”, Cognitive Science, vol. 6, 1982, pp. 205-254.Gardner, M. (1973), “Mathematical games”, Scientific American, 228:108, Jan. 1973.Bibliography 185Girosi, F. and T. Poggio (1989), “Networks and the best approximation property”, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, A.I. Memo No.1164, C.B.I.P. Paper No. 45, Oct. 1989.Hillis, W. D. (1985), The Connection Machine, Cambridge, Massachusetts: MIT Press,1985.Jaakkola, T., M. I. Jordan and S. P. Singh (1993), “On the convergence of stochasticiterative dynamic programming algorithms”, Advances in Neural Information ProcessingSystems 5, San Mateo, California: Morgan Kaufmann Publishers, 1993.Jacobs, R. A. and M. I. Jordan (1991), “A competitive modular connectionist architecture”, Advances in Neural Information Processing Systems 3, San Mateo, California:Morgan Kaufmann Publishers, 1991, pp. 767-773.Kaelbling, L. P. (1990), Learning in Embedded Systems, Ph.D. Thesis, Stanford University, Department of Computer Science, Stanford, California, Tech. Rep. TR-90-04, 1990.Korf, R. E. (1990), “Real-time heuristic search”, Artificial Intelligence, vol. 42, 1990, pp.189-211.Kraft, L. 0. W. T. Miller and D. Dietz (1992), “Development and application of CMACneural network-based control”, Handbook of Intelligent Control, eds.: D. A. White andD. A. Sofge, New York: Van Nostrand Reinhold, 1992, pp. 215-232.Lin, L. J. (1991a), “Programming robots using reinforcement learning and teaching”,Proceedings of the 9th International Conference on Artificial Intelligence, Cambridge,Massachusetts: MIT Press, 1991, pp. 781-786.Lin, L. J. (1991b), “Self-improvement based on reinforcement learning, planning andteaching”, Machine Learning: Proceedings of the 8th International Workshop, eds.: L.Birnbaum and G. Collins, San Mateo, California: Morgan Kaufmann Publishers, 1991,pp. 323-327.Mahadevan, S. and J. Connell (1992), “Automatic programming of behavior-based robotsusing reinforcement learning”, Artificial Intelligence, vol. 55, 1992, pp. 311-365.Massone, L. and E. Bizzi (1989), “A neural network model for limb trajectory formation”, Biological Cybernetics, vol. 61, 1989, pp. 417-425.Bibliography 186Mendel, J. M. (1973), “Reinforcement learning models and their applications to controlproblems”, Learning Systems: A Symposium of the AACC Theory Committee, 1973, pp.3-18.Michie, D. and R. A. Chambers (1968), “BOXES: an experiment in adaptive control”,Machine Intelligence 2, eds.: E. Dale and D. Michie, Oliver and Boyd, 1968, pp. 137-152.Minsky, M. L. (1985), The Society of Mind, New York: Simon and Schuster, 1985.Moore, A. W. (1991), “Variable resolution dynamic programming: efficiently learningaction maps in multivariate real-valued state-spaces”, Machine Learning: Proceedings ofthe 8th International Workshop, eds.: L. Birnbaum and 0. Collins, San Mateo, California: Morgan Kaufmann Publishers, 1991.Moore, A. W. (1992), “Fast, robust adaptive control by learning only forward models”,Advances in Neural Information Processing Systems 4, San Mateo, California: MorganKaufmann Publishers, 1992, pp. 571-578.Moore, A. W. and C. 0. Atkeson (1993), “Prioritized sweeping: reinforcement learningwith less data and less real time”, Machine Learning, Oct. 1993.Narendra, K. S. and M. A. L. Thathachar (1989), Learning Automata: An Introduction,New Jersey: Prentice-Hall, 1989.Narendra, K. S. (1992), “Adaptive control of dynamical systems using neural networks”,Handbook of Intelligent Control, eds.: D. A. White and D. A. Sofge, New York: VanNostrand Reinhold, 1992, pp. 141-183.Ogata, K. (1970), Modern Control Engineering, Englewood Cliffs, New Jersey: PrenticeHall, 1970.Omohundro, S. M. (1987), “Efficient algorithms with neural network behavior”, ComplexSystems, vol. 1, 1987, pp. 273-347.Peng, J. and R. J. Williams (1992), “Efficient search control in Dyna”, College of Computer Science, Northeastern University, Mar. 1992.Poggio, T. and F. Girosi (1990), “Networks for approximation and learning”, Proceedingsof the IEEE, vol. 78, no. 9, Sept. 1990, pp. 1481-1497.Bibliography 187Ross, 5. (1983), Introduction to Stochastic Dynamic Programming, New York: AcademicPress, 1983.Rumeihart, D. E. and D. Zipser (1985), “Feature discovery by competitive learning”,Cognitive Science, vol. 9, 1985, pp. 75-112.Rumeihart, D. E., G. E. Hinton and R. J. Williams (1986), “Learning internal representations by error propagation”, Parallel Distributed Processing, eds.: D. E. Rumeihartand J. L. McClelland, Boston: MIT Press, 1986, vol. 1, pp. 318-362.Samuel, A. L. (1959), “Some studies in machine learning using the game of checkers”,IBM Journal on Research and Development, 1959, pp. 210-229.Singh, S. P. (199hz), “Transfer of learning across compositions of sequential tasks”, Machine Learning: Proceedings of the 8th International Workshop, eds.: L. Birnbaum andG. Collins, San Mateo, California: Morgan Kaufmann Publishers, 1991, pp. 348-352.Singh, S. P. (1991b), “Transfer of learning by composing solutions of elemental sequentialtasks”, Department of Computer Science, University of Massachusetts, Dec. 1991.Singh, S. P. (1992a), “Scaling reinforcement learning algorithms by learning variabletemporal resolution models”, Proceedings of the 9th Machine Learning Conference, eds.:D. Sleeman and P. Edwards, July 1992.Singh, S. P. (1992b), “The efficient learning of multiple task sequences”, Advances inNeural Information Processing Systems 4, San Mateo, California: Morgan KaufmannPublishers, 1992, pp. 251-258.Schmidhuber, J. (1990a), “A local learning algorithm for dynamic feedforward and recurrent networks”, Report FKI-124-90, Institut für Informatik, Technische UniversitätMünchen, Munich, Germany, 1990.Schmidhuber, J. (1990b), “Learning algorithms for networks with internal and externalfeedback”, Connectionist Models: Proceedings of the 1990 Summer School, eds.: D. S.Touretzky et. al., San Mateo, California: Morgan Kaufmann Publishers, 1990, pp. 52-61.Standish, T. A. (1980), Data Structure Techniques, Reading, Massachusetts: AddisonWesley Publishing Company, 1980.Sutton, R. S. (1988), “Learning to predict by the methods of temporal differences”, Machine Learning, vol. 3, 1988, pp. 9-43.Bibliography 188Sutton, R. S. (1990), “Integrated architectures for learning, planning, and reacting basedon approximating dynamic programming”, Machine Learning: Proceedings of the 7thInternational Conference, eds.: L. Birnbaum and G. Collins, San Mateo, California:Morgan Kaufmann Publishers, 1990, pp. 216-224.Sutton, R. S. (1991), “Integrated modeling and control based on reinforcement learningand dynamic programming”, Advances in Neural Information Processing Systems 3, SanMateo, California: Morgan Kaufmann Publishers, 1991, pp. 471-478.Sutton, R. S., A. 0. Barto and R. J. Williams (1992), “Reinforcement learning is directadaptive optimal control”, IEEE Control Systems Magazine, vol. 12, no. 2, Apr. 1992,pp. 19-22.Tesauro, G. J. (1991), “Practical issues in temporal difference learning”, Research ReportRC 17223 (#76307), IBM Research Division, T. J. Watson Research Center, YorktownHeights, New York, 1991.Tham, C. K. and R. W. Prager (1993), “Reinforcement learning methods for multi-linkedmanipulator obstacle avoidance and control”, Engineering Department, Cambridge University, Cambridge, Mar. 1993.Thrun, S. B. (1992), “The role of exploration in learning control”, Handbook of IntelligentControl, eds.: D. A. White and D. A. Sofge, New York: Van Nostrand Reinhold, 1992,pp. 527-559.Thrun, S. B. and A. Schwartz (1993), “Issues in using function approximation for reinforcement learning”, Proceedings of the th Connectionist Models Summer School, Hills-dale, New Jersey: Lawrence Erlbaum Publisher, Dec. 1993.Utgoff, P. E. and J. A. Clouse (1991), “Two kinds of training information for evaluationfunction learning”, Proceedings of the 9th Annual Conference on Artificial Intelligence,San Mateo, California: Morgan Kaufmann Publishers, 1991, pp. 596-600.Watkins, C. J. C. H. (1989), Learning from Delayed Rewards, Ph.D. Thesis, CambridgeUniversity, Cambridge, England, 1989.Watkins, C. J. C. H. and P. Dayan (1992), “Q-learning”, Machine Learning, vol. 8, 1992,pp. 279-292.Bibliography 189Werbos, P. J. (1990), “Consistency of HDP applied to a simple reinforcement learningproblem”, Neural Networks, vol. 3, no. 3, 1990, pp. 179-189.White, D. A. and M. I. Jordan (1992), “Optimal control: a foundation for intelligentcontrol”, Handbook of Intelligent Control, eds.: D. A. White and D. A. Sofge, New York:Van Nostrand Reinhold, 1992, pp. 185-214.Widrow, B., N. K. Gupta and S. Maitra (1973), “Punish/reward: learning with a criticin adaptive threshold systems”, IEEE Transactions on Systems, Man, and Cybernetics,vol. 3, no. 5, 1973, pp. 455-465.Williams, R. J. (1986), “Reinforcement learning in connectionist networks: a mathematical analysis”, ICS Report 8605, Institute for Cognitive Science, University of California,San Diego, June 1986.Williams, R. J. (1987a), “A class of gradient-estimating algorithms for reinforcementlearning in neural networks”, Proceedings of the IEEE First International Conference onNeural Networks, San Diego, California, 1987, vol. 2, pp. 601-608.Williams, R. J. (1987b), “Reinforcement-learning connectionist systems”, Technical Report NU-CCS-8’7-3, College of Computer Science, Northeastern University, Boston, Feb.1987.Yee, R. C. (1992), “Abstraction in control learning”, COINS Technical Report 92-16,University of Massachusetts, Mar. 1992.Appendix AProof of the Convergence of the TPDP Version of Q-LearningProof of the convergence of Q-learning performed using TP updating Equation 3.18 isbased on the dynamic programming convergence proofs developed by Jaakkola et. al.(1993). Theorem 1 of Jaakkola et. al.’s work (1993) is:Theorem 1: A random iterative process = (1 — c(x))z(x) +3(x)F(x) converges to zero w.p.1 under the following assumptions:1. The state space is finite.2. cr(x) = co, Zc(x) < oo, = cc, Z5(x) < co, andE{/3(x)IP} E{a(x)P} uniformly w.p.1.3. WE{F(x)IP}Uw <YWLnWW, where -y e (0,1).4. Var{F(x)P} K(1 + IIWw)2,where K is some constant.Here P={z, ..., F,_1 ..., c_.i, ••8fl—i, ...} stands for the past at stepn. F(x), o(x) and 3(x) are allowed to depend 011 the past insofar as theabove conditions remain valid. The notation II w refers to some weightedmaximum norm.The proof of Jaakkola et. al.’s Theorem 1 will not be presented (see Jaakkola et.al., 1993), but the convergence of Q-learning performed using TP updating Equation3.18 will be proven by relating the updating process to the converging stochastic processdefined by Theorem 1.190Appendix A. Proof of the Convergence of the TPDP Version of Q-Learning 191Theorem A:’ Updating Q-values with the following form of the TP updatingEquation 3.182:d— 1Qt+d(st, Ut) [1—t(St, Ut)] Qt(st, Ut) + t(St, Ut) [(7c(st+n Ut)) +7dv(s)](A.52)results in convergence to optimal Q-values Q*(s, U) if:1. The state and action spaces are finite.2. at(s,U) — cc and c(s,U) <cc uniformly w.p.1.3. Var{c(s, u)} is bounded, and c(s, U) > 0 V s, u.4. If -y = 1 all policies lead to a cost free absorbing state w.p.1.Proof: By subtracting Q*(st, Ut) from both sides of updating Equation A.52and by defining:L(s, U) = Qt(s, U) — Q*(s U)Ut)=(‘7nc(st+ Ut)) + -yd V(s d) — Q*( Ut)updating Equation A.52 can be seen to have the form of the process inJaakkola et. al.’s Theorem 1 with /3t(S, U) = c(s, U).Conditions 1 and 2 of Jaakkola et. al.’s Theorem 1 are met by conditions 1and 2 of Theorem A, so it remains to be shown that F(s, U) has the propertiesrequired by Theorem 1 — those defined by its conditions 3 and 4.1Theorem A closely follows the conventional Q-learning theorem presented by Jaakkola eL at. (1993).2This form is Equation 3.18 with a few minor notation changes.Appendix A. Proof of the Convergence of the TPDP Version of Q-Learning 192To show that condition 3 of Jaakkola et. al.’s Theorem 1 is met, we have:E{Ft(s,u)} = Ct(s,SB) + i7(s,j)4(j)— Ct(s,SB)—i7(s,j)V(j)jES2 jESB= ‘t(s,j)(Vt(j)— (A.53)jES2Where C(s, SB) is the excluded cost of state transitions from state s to the TPstates SB (with all costs incurred after the TP states SB are encountered beingexcluded from the total — see Section 3.3.4), and 71t(s,j) is the participationfactor of TP state j in the evaluation function value of state s (see Section3.3.4).Manipulating E{F(s, u)} produces:E{Ft(s,u)}I = r(s,j)(V(j) —jES2>jES2it(s,j)IminQt(j,v)_rnjnQ*(j,v)IjES2?7t(s,j)maxQt(j,v) — Q*(j,v)I (A.54)The maximum value the right hand side of Equation A.54 can have is whena one-step transition is made, w.p.1, to a TP state j that has the largest Qvalue error (max IQt(j, v) — Q*(j, v)I) of all the TP states SB. Using weightedmaximum norm weights of 1 the following can thus be stated:IIE{Ft(s,u)}Uw ymaxmaxlQt(j,v) — Q*(jv)I (A.55)HE{Ft(s, u)}IIw ‘(j, v)IIw (A.56)Since the notation of v)Iw can be arbitrarily changed to It(S, u)IIw,condition 3 of Jaakkola et. al.’s Theorem 1 is met.Appendix A. Proof of the Convergence of the TPDP Version of Q-Learning 193To show that condition 4 of Jaakkola et. aL’s Theorem 1 is met, we have:d—1Var{Ft(s,u)} = Var{ (E7nc(sflu))+7dVt(sd) — Q*(s,u)so =< 3Var{7nc(snu)so } +3Var{7dVi(sd)so =K1 + 3Var{7dvt (sd)J s0 = (A.57)Where K1 is a constant reflecting the fact that the variance of the immediatecosts incurred is bounded as a result of conditions 3 and 4 of Theorem A.Continuing on (and assuming throughout that s0 =Var{Ft(s, u)} K +3 ç II Pxk+l (xk, u))(7Iv(i) — E{7dv(})2iESB EX(s,j)(A.58)Where X(s,j) is the set of all possible state transition routes from state sto TP state j ( X(s,j)), = [x0,1...,x,] is one possible state transitionroute from state s to TP state j of variable length n (xo = s, x,-,= j), and jis the number of states along each such route.The maximum value the right hand side of Equation A.58 can have iswhen a one-step transition is made, w.pJ, to a TP state j that has thelargest evaluation function value V(j) of all the TP states SB, and when theexpected value of .ydT/(s) is 0. Using weighted maximum norm weights of 1the following can thus be stated3:/Var{Ft(s,u)} Ki+3(ymaxVt(j)—0\‘ IVar{Ft(s,u)} K1 +32(mxV(j)) (A.59)3The outcome is the same if (j) is minimized and the expected value of 7dVL(sd) is maximized.Appendix A. Proof of the Convergence of the TPDP Version of Q-Learning 194By again using weighted maximum norm weights of 1, the bound of condition 4 of Jaakkola et. al.’s Theorem 1 can be expressed as follows:K(1 + IItIIw)2 = i(1 +maxmaxlQt(s,u) —K(i +maxlminQt(s,u) _minQ*(s,u)l)> K(1+maxIV(s)-V*(s)) (A.60)Since the notation of Equation A.60 can be arbitrarily changed to:2K(1 + IItUw)2 K(i +mxIVt(j) - (A.61)Equations A.59 and A.61 can be employed to verify that condition 4 ofJaakkola et. al.’s Theorem 1 is met:2 2K1+372(mxVt(j)) K(1 +mxVt(j)-V.(j)I) (A.62)Equation A.62 is true because conditions 3 and 4 of Theorem A ensure thatthe maximum value of V(j) for any state j is bounded — the left hand sideof Equation A.62 is therefore also bounded. The right hand side of EquationA.62 always has a value of at least K, and K can be arbitrarily increased toexceed the left hand side bound. As a result:Ki+372(maxVt(j)) K(1+mxIVt(i)_*(i)I)Var{Ft(s,u)} K(1 + IItIw)2 (A.63)condition 4 of Jaakkola et. al.’s Theorem 1 is therefore met. UAppendix BFull Description of the Practical TPDP AlgorithmAppendix Section B.2 fully describes the Practical TPDP Algorithm, and AppendixSection B.3 fully describes the Stack Updating Procedure it calls.B.1 General Operation of the Practical TPDP AlgorithmThis Appendix Section repeats part of Section 4.2.4.The general operation of the Practical TPDP Algorithm is as follows. The PracticalTPDP controller can be in one of three exploration modes. The mode in effect at anytime is identified by the variable explore as ‘none’, ‘internal’ or ‘external’. When noexploration is occurring actions are randomly chosen from those specified by the TPs atthe states encountered. The immediate costs incurred when those actions are taken areobserved, and the Q-values, R-values and weights of the TPs specifying those actions areupdated accordingly. Internal and external exploration (see Section 4.2.3) are randomlyinitiated in the midst of this process and are allowed to continue until a TP state isencountered. Internal and external exploration facilitate the allocation of new TPs thatcan be further assessed.B.2 The Practical TPDP AlgorithmConsidering the Practical TPDP Algorithm presented in Figure B.40 (a convenientlylocated copy of Figure 4.3):195Appendix B. Full Description of the Practical TPDP Algorithm 196Lines 1 to 5These lines initialize the algorithm by chosing a starting state s0 and a starting actionuo E U(so) for the trial. They also set a number of algorithm variables to startingvalues. These include explore; t, the time step; tupdate, an indicator of when the nextstack updating should occur; and tlastTP, the last time step in which a TP state wasencountered.The variable explore is initialized to ‘external’ if the starting state s0 has no TPbecause an initial TP action must be specified for the exploration mode to be made‘none’. The exploration mode ‘external’ results in random movement of the systemthrough the state space until a TP state is encountered. This is a reasonable behaviorif there is no TP associated with the starting state which specifies some sort of initialaction.Lines 10 to 11, and 60 to 63These lines determine when the calculations are made for Practical TPDP. Calculationsare made whenever the system enters a new state, or after a period Ttth has passed inthe same state (Line 11). The latter condition is used to prevent Practical TPDP frompermanently remaining in the same state, continually specifying the same action. Line10 ensures that Practical TPDP calculations continue until the system has reached oneof the absorbing states SA.Line 61 is used to update the expected evaluation function value Vexpected, a runningestimate of what the evaluation function value (i) will be at the next TP state iencountered. This estimate is based on the evaluation function value of the last TP stateencountered (Line 33) and the immediate costs experienced since then. The update ofVexpected is done recursively and is based on Equation 2.2.Appendix B. Full Description of the Practical TPDP Algorithm 197Line 63 calls the Stack Updating Procedure (Figure B.41.) to perform a final updatingwhen an absorbing state has been reached.Lines 20 to 34These lines define the three algorithm operations that can be performed when the systemencounters a TP state.The first operation (Lines 22 to 25) is to initiate internal exploration at the currentstate. This operation is performed when the value of the random parameter SWaTP 5greater than zero (Line 21). The value of swapTP has some fixed probability of beinggreater than zero each time it is considered’.When internal exploration is initiated the stack is updated using the evaluation function value V(s) ((St) = Q(s, it(St))) for the current state St (Line 22, see AppendixSection B.3). Then the explore variable is set to indicate the mode of exploration occurring (Line 23), and an experimental action Ut 1S randomly chosen from the set U(s)(Line 24). Finally, the current time step, state and action information is stored on thestack for future updating (Line 25).The second operation (Lines 27 to 31) is to randomly choose an action from thosespecified by the TPs at the current state st (Line 30). This facilitates evaluation of theTP specifying that action. To prepare for such evaluation the stack is updated (Line27), and the explore variable is set to indicate that no internal or external exploration isoccurring (Line 28). The current time step, state and action information is also storedon the stack for future updating (Line 31).The second operation is always performed if internal or external exploration has beenoccurring (excluding internal exploration which has just been initiated — in Lines 21to 25), and it terminates all such exploration (Line 26). The second operation is also‘Section 4.2.6 describes how the random distribution of °swap-Ti’ is determined.Appendix B. Full Description of the Practical TPDP Algorithm 198performed if no exploration has been occurring, but the update time tupdate has beenreached (see Section 4.2.5).The third operation (Line 32) is performed if neither of the other operations is. Thatis, the third operation is performed only if no exploration is occurring and the updatetime tupdate has not yet been reached. This operation involves simply storing the currenttime step, state and action information in the stack for future updating. When thestack update is performed, the Q-values, R-values and weights will be updated for eachstate identified in this manner. As will be explained in Appendix Section B.3, this isdone because even though no actions are specified in the states concerned, it will still bepossible to update some of the Q-values, R-values and weights associated with them.The specified field stored in the stack entry during the third operation is set to ‘false’to indicate that no action was specified at the state concerned (Line 32). This field is setto ‘true’ when the second operation, that of choosing a TP action, is performed (Line31). This field is set to a null value (0) when the first operation, the initiation of internalexploration, is performed (Line 25). This is because the value of this variable has nomeaning when exploration is occurring (see Appendix Section B.3).Finally, Lines 33 and 34 update variables Vexpected, V1astTP and tlast-TP — all of whichare based on the last TP state encountered. The expected evaluation function valueVexpecteci is a running estimate of what the evaluation function value V(i) will be at thenext TP state i encountered. As explained earlier in this Appendix Section (Lines 10 to11, and 60 to 63), this estimate is set to the evaluation function value of the last TP stateencountered (V(s) = Q(st, IZ(St))), and then recursively updated as immediate costs areexperienced (Line 61).The variable t1Tp indicates the time step in which the last TP state was encountered.This variable is used if internal or external exploration is initiated between TP states.It facilitates a stack update using the evaluation function value of the last TP stateAppendix B. Full Description of the Practical TPDP Algorithm 199encountered (see Appendix Section B.3).Lines 40 to 43These lines are used to make random changes in the route followed through the statespace during internal and external exploration. Route changes are made when exploration is already occurring, and the system is making state transitions through non-TPstates. The determination of when to make route changes is made based on the randomparameter chge, which has some fixed probability of being greater than zero each timeit is considered (Line 40)2.When a route change is made a new experimental action is randomly selected (Line42), and the current time step, state and action information are stored on the stack torecord the change (Line 43). If external exploration is occurring, the stack is first flushedso that only the last action specified will be stored on it when a TP state is encounteredand a stack update is performed (Line 41). As explained in Section 4.2.3, this ensuresthat only the last action specified before a TP state is encountered is allocated a TP.Lines 50 to 55These lines are used to initiate internal and external exploration when the system ismaking state transitions through non-TP states. The determination of when to initiatethis exploration is made based on the random parameter add-TP, which has some fixedprobability of being greater than zero each time it is considered (Line 5O).When exploration is initiated the stack is updated using the time step t1Tp inwhich the last TP state was encountered (Line 51). Then the determination of whetherthe exploration should be internal or external is made based on the random parameter2Section 4.2.6 describes how the random distribution of °change is determined.3Section 4.2.6 describes how the random distribution of °dd-Tp is determined.Appendix B. Full Description of the Practical TPDP Algorithm 200externai, which has some fixed probability of being greater than zero each time it isconsidered (Line 52). Finally, a new action is randomly selected (Line 54) and thecurrent time step, state and action information are stored on the stack to record thechange (Line 55).B.3 The Stack Updating ProcedureThe Stack Updating Procedure (see Figure B.41— a conveniently located copy of Figure4.4) operates by retrieving state and action combinations from the stack in reverse ofthe order in which they were experienced and recorded. As a result, updating begins atthe last TP state encountered before the stack update was initiated, and the experiencedcost 0totaj (the total infinite-horizon discounted cost experienced after the system leaveseach of the states recorded in the stack) can be recursively calculated starting with theevaluation function value Vupdate of that last TP state (Lines 2 and 13). The recursivelycalculated is used during updating in various ways, depending on the mode ofexploration in effect.Parameters passedThe Stack Updating Procedure is passed four parameters by the Practical TPDP Algorithm. These are explore, the mode of exploration occurring when the stack updatewas requested; the updating time step t, which is initialized to the time step when thelast TP state was encountered; Vupate, the evaluation function value of the last TP stateencountered; and Vexpected, the running estimate of what the evaluation function valueV(i) will be at the next TP state i encountered (see Section B.2).The mode of exploration passed to the Stack Updating Procedure as explore is always4Section 4.2.6 describes how the random distribution of0extemal is determined.Appendix B. Full Description of the Practical TPDP Algorithm 201the same as the mode of exploration that was occurring when each of the stack entrieswas stored. This is because the stack is updated every time the mode of exploration ischanged (see Section B.2 and Figure B.40).Lines 1 to 2Line 1 removes all entries from the stack whose time step t equals or exceeds the initialvalue of t — the time step when the last TP state was encountered. This operation willonly have an effect if the current state is a non-TP state and stack entries have beenrecorded since the last TP state was encountered. It removes all such entries because,unless the system is at a TP state, no evaluation function value will be available withwhich these non-TP states can be updated.Line 2 initializes the value of C0j.Lines 10 to 14These lines determine which state should next be updated as entries representing thestates are retrieved from the stack. For each recorded state the values t.9, i3, ‘u.s andspecified3 are retrieved from the stack (Line 11). The experienced cost is also recursivelycalculated by discounting C03.1 and adding the immediate cost c(i5,u3) to it for eachtime step (Lines 12 to 14).Lines 20 to 33These lines perform the updating of states when no internal or external exploration hasbeen occurring. If, at the state i5 being updated, a TP exists that specifies an actioncorresponding to the action u taken in that state, the Q-value of that TP is updated(Line 23). This is done whether the TP itself specified the action, or whether it had beenAppendix B. Full Description of the Practical TPDP Algorithm 202specified previously at another TP state. In either case the Q-value can legitimatelybe updated based on the consequences of that action being applied to the system. TheQ-value updating is done according to the TP updating Equation 3.18, and the updatemakes use of 0total• Incorporated in is both the evaluation function value Vupdateof the last TP state encountered by the system, and the immediate costs experienced asthe system moved from i3 to that state. Once the Q-value updating has been performed,the weight of the TP concerned is updated according to the rules presented in Section4.2.1 (Lines 24 to 29).Lines 30 to 31 are used to update R-values according to the R-value updating Equation3.27. If there is only one TP associated with state i3, and it did not specify an action whenstate i5 was encountered (as indicated by specified5;see Appendix Section B.2, Lines 20to 34), the R-value of that TP is updated. If there is more than one TP associated withstate i and one of them specifies the action w that was taken in state i3, the Ft-valuesof all of the other TPs are updated.These two cases facilitate updating that is very effective in producing R-values thatcan be used to assess the merits of each TP. The former case results in the R-value for asingle TP at state i reflecting the costs incurred if that TP did not exist. This R—valuecan then be directly compared to the Q-value of the single TP to assess the value of thatTP. The latter case results in the R-value for each TP associated with state i5 reflectingthe costs incurred when other TPs at state i specify an action. The R-value calculatedfor a TP in this manner can be compared with the Q-value of that TP to assess thevalue of that TP relative to the other .TPs. If this comparison reveals that the Q-valueof a TP is higher than its Ft-value, then it is known that other TPs at state i result inlower costs. The TP is removed as a result. The elimination of high Q-value TPs in thismanner results in the R-values of the remaining TPs being lowered, which leads to moreTP removals. Eventually a single TP will remain, and it will then be assessed solely onAppendix B. Full Description of the Practical TPDP Algorithm 203the basis of the costs that will be incurred if it is removed (as described previously). Thiswhole process is basically one of competitive learning (Rumelhart et. al., 1985).Finally, the policy TP for each state i8 identified in the stack must be determined (seeSection 4.2.2). Lines 32 to 33 make this determination using Equations 4.40 and 4.41.Lines 40 to 44These lines perform the updating of TP states when internal exploration has been occurring. At most one such state, the oldest state in the stack, will be identified duringeach stack update. This is because all exploration is terminated when a TP state isencountered, and the only TP state stored on the stack during internal exploration willbe the one in which the exploration was initiated (see Section B.2 and Figure B.40).If internal exploration is initiated in a TP state i8, the merit of the route takenthrough the state space during that exploration can be determined simply by comparingthe experienced cost Ctoai with the evaluation function value V(i3)(T(i8)= Q(i8,t(i3)))of that state. If Ct0tai is less than Q(i3,t(i3)) (Line 40), then the action u3 specified atthe initiation of the internal exploration is worthy of further consideration and a TP isallocated to specify that action at the TP state concerned (Line 41).The Q-value and R-value of any TPs allocated are initially set to Ctota.i (Line 42).This is the best initial estimate of the Q-value. The R-value is also set to this valueso that it will diverge away from the initial Q-value to become higher or lower. If thetrue R-value of a TP is higher than its true Q-value, and it is initialized to some valuelower than the initial Q-value, it will have to be updated until it exceeds the Q-value.During the period that it remains lower than the Q-value, the weight of that TP will befallaciously decreased (the TP may even be removed as a result). In a similar manner,the weight of a TP may be fallaciously increased if its R-value is initialized to some valuehigher than its initial Q-value. Initializing the Q-values and R-values of each TP withAppendix B. Full Description of the Practical TPDP Algorithm 204the same value is therefore the best strategy.When a new TP is allocated its weight is initialized to wjj (Line 43). This weightis also compared to the weight of the policy TP to see if the new TP should be madethe policy TP (Line 44). This comparison is a reduced version of the comparison madein Line 33, with the reductions resulting from conditions that are sure to be met when anew TP is being allocated.Lines 50 to 54These lines perform the updating of non-TP states when internal or external explorationhas been occurring.If external exploration has been occurring, the stack will only have one entry (seeAppendix Section B.2, Lines 40 to 43). For the reasons presented in Section 4.2.3, a newTP is unconditionally allocated at the appropriate state to specify the action indicatedby that stack entry.If internal exploration has been occurring, the merit of the route taken through thestate space during that exploration can be determined by making use of the expectedevaluation function value Vexpected (see Figure B.40 and Appendix Section B.2, Lines 10to 11, and 60 to 63). If the actual evaluation function value of the TP state encountered,Vupdate, is lower than Vexpected (Line 50), then the route taken through the state spaceduring the internal exploration was a low cost one and the actions specified along thatroute are worthy of further consideration. In this case new TPs are allocated to specifyeach such action at the states where they were specified during exploration.When new TPs are allocated at non-TP states, the allocation operations are the sameas those performed when allocating new TPs at states that already have TPs (Lines 41to 44). The oniy difference is that, being the only TPs at the states concerned, the newTPs are automatically made the policy TPs (Line 54).Appendix B. Full Description of the Practical TPDP Algorithm 2051. randomly choose starting state ü E Ss2. choose a starting action u0 E U(so)3. if (state has a TP): ‘none’ = explore4. otherwise: ‘external’ explore5. 0 t, 0 tupdate, 0 tlastTP10. while s is not an absorbing state SA:11. if (state s.9t—T) or (state St = 8t—T for time Ttk):20. if (state s has a TP):21. if (UswaTp > 0):22. update-stack( explore, t, Q(s, It(st)), Vexpectea)23. ‘internal’ explore24. randomly choose action Ut E U(s)25. push-on-stack(t, 5, t, 0)26. otherwise if (explore ‘none’) or (t > tupaate):27. update-stack( explore, t, Q(s, I1(st)), Vexpected)28. ‘none’ =‘ explore29. t + Udelay tupdate30. randomly choose action Ut from TP actions in U(s)31. push-on-stack(t, 5, Ut, ‘true’)32. otherwise: push-on-stack(t, s, Ut, ‘false’)33. Q(st,LL(st)) Vexpecteci, Q(st,p(st)) = V5ti’p34. t40. if (state s, has no TP) and (explore ‘none’) and (Uchge> 0):41. if (explore = ‘external’): flush-stack42. randomly choose action Ut E U(s)43. push-on-stack(t, 5, Ut, 0)50. if (state s has no TP) and (explore = ‘none’) and (cTaddTp > 0):51. update-stack(’none’,tItTP, V1atTp, 0)52. if (7extema1> 0): ‘external’ = explore53. otherwise: ‘internal’ z explore54. randomly choose action Ut E U(s)55. push-on-stack(t, 5t, Ut, 0)60. t+T=t61. (Vexpected — C(St, Ut)) Vexpected62. observe system for new state st63. update-stack( explore, t,.St, VexpecLed)Figure B.40: The Practical TPDP AlgorithmAppendix B. Full Description of the Practical TPDP Algorithm 206[Parameters passed: explore, t, Vupdate, Vexpected]while (time at top of stack t3 t): pop-off-stack(t5i, u5, specified3)Vupdaewhile (there are entries in stack):pop-off-stack(t5,js, u, specified)while (t > t5):7Gtotaj + c(i3,u5) =t—Ttif (explore = ‘none’):for (each TP action u é U(i3) in state i5):if (u = u3):(1 — )Q(i5,u3) + crCtotai = Q(i5,u5)if (Q(i5,u) < R(i3,u5)):w(i3,u.9) + 1 =‘ w(i3,u5)if (w(i,u5)> Wmax) Wmax = w(i5,u)otherwise:w(i3,u5)—1 = w(i3,u)if (w(i3,u3) = 0): remove the TPif [(state i has oniy one TP) and (specified3=[(u u3) and (another TP at state i5 specifies(1 — c)R(i5,u) + aCtotaj = R(i5,u)for (each TP action u E U(i5) in state i5):if [(w(i5,t(i5)) < Wt1,) and (w(i5,u) > w(i5,t(i3)))] or[(w(i1u(i)) ‘wh1.) and (w(i5,u) > W.) and(Q(i5,u) < Q(i5,.t(i5)))]: u = i(i5)40. if (state i5 has TPs) and (memory can be allocated) and(explore = ‘internal’) and (Ctotai < Q(i5,jz(8))):41. allocate a new TP at state i3 with action u42. 0totai Q(i5,u), = R(i3,u5)43. Winitial w(i5,u5)44. if (w(i3,u3) > w(i5,p(i3))) or w(i3,u9) wt): z& = IL(i3)50. if (state i3 has no TPs) and (memory can be allocated) and[((explore = ‘internal’) and Vupaate < Vexpected)) or (explore = ‘external’)]:allocate a new TP at state i3 with action uC01 = Q(i3,u3), Ctotai =‘ R(i5,u3)wjj = w(i5,u.3)1.2.10.11.12.13.14.20.21.22.23.24.25.26.27.28.29.30.31.32.33.‘false’)] or51.52.53.54.Figure B.41: The Stack Update Procedure
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Optimal control of dynamic systems through the reinforcement...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Optimal control of dynamic systems through the reinforcement learning of transition points Buckland, Kenneth M. 1994
pdf
Page Metadata
Item Metadata
Title | Optimal control of dynamic systems through the reinforcement learning of transition points |
Creator |
Buckland, Kenneth M. |
Date Issued | 1994 |
Description | This work describes the theoretical development and practical application of transition point dynamic programming (TPDP). TPDP is a memory-based, reinforcement learning, direct dynamic programming approach to adaptive optimal control that can reduce the learning time and memory usage required for the control of continuous stochastic dynamic systems. TPDP does so by determining an ideal set of transition points (TPs) which specify, at various system states, only the control action changes necessary for optimal control. TPDP converges to an ideal TP set by using a variation of Q-learning to assess the merits of adding, swapping and removing TPs from states throughout the state space. This work first presents how optimal control is achieved using dynamic programming, in particular Q-learning. It then presents the basic TPDP concept and proof that TPDP converges to an ideal set of TPs. After the formal presentation of TPDP, a Practical TPDP Algorithm will be described which facilitates the application of TPDP to practical problems. The compromises made to achieve good performance with the Practical TPDP Algorithm invalidate the TPDP convergence proofs, but near optimal control policies were nevertheless learned in the practical problems considered. These policies were learned very quickly compared to conventional Q-learning, and less memory was required during the learning process. A neural network implementation of TPDP is also described, and the possibility of this neural network being a plausible model of biological movement control is speculated upon. Finally, the incorporation of TPDP into a complete hierarchical controller is discussed, and potential enhancements of TPDP are presented. |
Extent | 4144678 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-04-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0065157 |
URI | http://hdl.handle.net/2429/6896 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1994-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1994-893346.pdf [ 3.95MB ]
- Metadata
- JSON: 831-1.0065157.json
- JSON-LD: 831-1.0065157-ld.json
- RDF/XML (Pretty): 831-1.0065157-rdf.xml
- RDF/JSON: 831-1.0065157-rdf.json
- Turtle: 831-1.0065157-turtle.txt
- N-Triples: 831-1.0065157-rdf-ntriples.txt
- Original Record: 831-1.0065157-source.json
- Full Text
- 831-1.0065157-fulltext.txt
- Citation
- 831-1.0065157.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065157/manifest