UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Learning and planning in structured worlds Dearden, Richard W. 2000

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2000-565319.pdf [ 10.72MB ]
Metadata
JSON: 831-1.0051620.json
JSON-LD: 831-1.0051620-ld.json
RDF/XML (Pretty): 831-1.0051620-rdf.xml
RDF/JSON: 831-1.0051620-rdf.json
Turtle: 831-1.0051620-turtle.txt
N-Triples: 831-1.0051620-rdf-ntriples.txt
Original Record: 831-1.0051620-source.json
Full Text
831-1.0051620-fulltext.txt
Citation
831-1.0051620.ris

Full Text

Learning and Planning in Structured Worlds by Richard W . Dearden M . S c , University of British Columbia, 1994  A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F THE REQUIREMENTS FOR T H E DEGREE OF  D o c t o r of P h i l o s o p h y in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard  The /TJnlv^rsfCy^of British Columbia August 2000 © Richard W . Dearden, 2000  In presenting this thesis/essay in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying for this thesis for scholarly purposes may be granted by the Head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission.  Date  Department of Computer Science The University of British Columbia 2366 Main mall Vancouver, BC Canada V6T1Z4  Abstract This thesis is concerned with the problem of how to make decisions in an uncertain world. We use a model of uncertainty based on Markov decision problems, and develop a number of algorithms for decision-making both for the planning problem, in which the model is known in advance, and for the reinforcement learning problem in which the decision-making agent does not know the model and must learn to make good decisions by trial and error. The basis for much of this work is the use of structural representations of problems. If a problem is represented in a structured way we can compute or learn plans that take advantage of this structure for computational gains. This is because the structure allows us to perform abstraction. Rather than reasoning about each situation in which a decision must be made individually, abstraction allows us to group situations together and reason about a whole set of them in a single step. Our approach to abstraction has the additional advantage that we can dynamically change the level of abstraction, splitting a group of situations in two if they need to be reasoned about separately to find an acceptable plan, or merging two groups together if they no longer need to be distinguished. We present two planning algorithms and one learning algorithm that use this approach. A second idea we present in this thesis is a novel approach to the exploration problem in reinforcement learning. The problem is to select actions to perform given that we would like good performance now and in the future. We can select the current best action to perform, but this may prevent us from discovering that another action is better, or we can take an exploratory action, but we risk performing poorly now as a result. Our Bayesian approach makes this tradeoff explicit by representing our uncertainty about the values of states and using this measure of uncertainty to estimate the value of the information we could gain by performing each action. We present both model-free and model-based reinforcement learning algorithms that make use of this exploration technique. Finally, we show how these ideas fit together to produce a reinforcement learning algorithm that uses structure to represent both the problem being solved  and the plan it learns, and that selects actions to perform in order to learn using our Bayesian approach to exploration.  iii  Contents Abstract  ii  Contents  iv  List of Tables  viii  List of Figures  x  Acknowledgements 1  2  xvi  Introduction  1  1.1  Decision Theory and Planning  7  1.2  Reinforcement Learning .  10  1.3  Putting it A l l Together  13  1.4  Organisation of this Thesis  13  Markov Decision Processes and Structured Representations  15  2.1  Markov Decision Processes  2.2  Policies and Optimality  20  2.2.1  The value of a policy  21  2.2.2  Optimal policies  22  2.3  .17  Algorithms  23  2.3.1  23  Policy iteration  iv  24  2.3.3  Modified policy iteration  26  Bayesian Networks  28  2.5  Structured Representations of Actions  32  2.5.1  33  Two-stage Temporal Bayesian Networks  Previous Work in Decision-Theoretic Planning  36  2.6.1  "Envelope" Algorithms  37  2.6.2  Real-Time Dynamic Programming  38  2.6.3  Abstraction by Ignoring Features  39  2.6.4  Other Approaches  42  Exploiting Structure for Planning 3.1  3.2  3.3  4  Value iteration  2.4  2.6  3  2.3.2  46  Structured Policy Iteration  47  3.1.1  Decision-Theoretic Regression  54  3.1.2  Structured Successive Approximation  64  3.1.3  Structured Policy Improvement  67  3.1.4  Computing Norms of Structured Value Functions  69  3.1.5  Structured Value Iteration  70  3.1.6  Results and Analysis  71  Approximate Structured Value Iteration  79  3.2.1  Ranged Value Trees and Pruning  81  3.2.2  Value Iteration Using Approximate Value Functions  84  3.2.3  Termination for A S V I  87  3.2.4  Results and Analysis  90  Nummary  95  Reinforcement Learning: Background and Previous Work 4.1  Reinforcement Learning: The Basics  4.2  Model-free learning  97 99 103  v  4.3  Model Learning  105  4.4  Exploration  110  4.4.1  112  4.5  4.6 5  Generalisation and Structured Learning  116  4.5.1  G-Learning  117  4.5.2  Function Approximators  118  4.5.3  Variable Resolution Dynamic Programming  119  4.5.4  Learning with Structure  120  Structure and Exploration  122  Bayesian Approaches to Exploration in Reinforcement Learning 5.1  5.2  5.3  6  Directed Approaches  Q-value Distributions and the Value of Information  128  5.1.1  Distributions over Q-Values  128  5.1.2  The Myopic Value of Perfect Information  129  The Model-free Algorithm  133  5.2.1  The Normal-gamma Distribution  136  5.2.2  Updating the Q-distribution  138  5.2.3  Convergence  143  5.2.4  Results  147  Model-based Bayesian Exploration  156  5.3.1  Representing Uncertainty in the Model  158  5.3.2  The Dirichlet Distribution  160  5.3.3  The Sparse-multinomial Distribution  162  5.3.4  Translating from model uncertainty to Q-value uncertainty  5.3.5  Representing Uncertainty in the Q-values  171  5.3.6  Results  175  Structure-based Reinforcement Learning 6.1  124  Structured Prioritized Sweeping  . 165  181 185  vi  6.2  6.3 7  6.1.1  Local Decision-Theoretic Regression  186  6.1.2  The Structured Model  191  6.1.3  The Structured Prioritized Sweeping Algorithm  191  6.1.4  Results  198  Bayesian Reinforcement Learning Where the Structure is Known . . 202 6.2.1  Model Uncertainty . .  203  6.2.2  The Beta Distribution  206  6.2.3  Q-Value Uncertainty  207  6.2.4  Structured Value of Information Calculation  208  Unknown Structure  210  Conclusions 7.1  Future Work  213 . .  217  7.1.1  Future Work in Planning  218  7.1.2  Future Work in Learning  219  Bibliography  221  Appendix A Example M D P s  235  A.l  The Coffee Robot M D P  236  A.2 The Worst-Case and Best-Case Examples  239  A.3 Exogenous Events  245  A.4 The Process Planning Problem for Structured Prioritized Sweeping . 260  vii  List of Tables 1.1  One possible structured representation of the  MOVE  action from Fig-  ure 1.1  8  2.1  The transition matrix for action a in Figure 2.1  2.2  Two policies for the example M D P of Figure 2.1. The second is an optimal policy for this M D P  2.3  20  22  Successive approximations to the optimal value function produced by value iteration  26  3.1  Results for SPI and M P I on process-planning problems  75  3.2  Results for A S V I on the robot-coffee problem with exogenous events.  91  3.3  Results for A S V I on the same robot-coffee problem as in Table 3.2, but with the exogenous events removed  viii  92  5.1  Average and standard deviation of accumulated rewards over 10 runs of the chain domain. Each phase consists of 1,000 steps, so the first phase is the average performance over the first 1,000 steps of the algorithm.  In each of the Bayesian runs, the priors were set such  that the expected value of the mean was zero, and the variance of the mean was 400 for both V P I algorithms, 100 for QS+Mom, and 800 for QS+Mix. The mean of the variance was 1 in all cases, and the variance of the variance was 0.005 for all cases except QS+Mix, where it was 0.05. In most cases, the most important parameter was the variance of the mean. The algorithms are relatively insensitive to the variance parameters 5.2  150  Average and standard deviation of accumulated rewards over 10 runs on the loop domain. A phase consists of 1,000 steps. The priors were set with the E[(i] = 0, Var[/f] = 50 except for QS+Mix where Var[/i] = 200  5.3  151  Average and standard deviation of accumulated rewards over 10 runs of the maze domain. A phase consists of 20,000 steps. The priors were fixed with E[fi\ = 0 and Var[/i] = 200 for all the Bayesian algorithms. 154  ix  List of Figures 1.1  The coffee-delivery robot problem. The robot prefers the world to be such that the person doesn't want coffee (State 5). The rectangles correspond to the states of the system, and the arrows to actions (in fact, all actions are possible in any state, only the "best" action in each state is shown). When the person wants coffee, the robot must move to the shop, get coffee, move to the office, and deliver the coffee.  1.2  2  A path-planning task. The agent has already discovered the path shown with a solid line, but needs to explore to find the shorter path shown with a dashed line  11  2.1  A n example Markov decision process with six states and two actions.  19  2.2  The policy iteration algorithm  23  2.3  The value iteration algorithm  25  2.4  The modified policy iteration algorithm  27  2.5  A n example of a Bayesian belief network  29  2.6  A n example of an influence diagram  31  2.7  Action Network with Tree-structured C P T s  33  2.8  Reward tree for structured M D P  36  x  3.1  The example coffee domain showing all three actions, the reward tree, and an initial policy. The dashed arcs represent persistence, where the corresponding tree has the variable at the root, 1.0 as the value of the "true" subtree, and 0.0 as the value of the "false" subtree.  . .  3.2  Examples of (a) a policy tree, and (b) a value tree  3.3  The modified policy iteration algorithm (slightly changed from Figure  50  2.4) 3.4  51  A tree simplified by removal of redundant nodes (triangles denote subtrees)  3.5  53  Appending tree T-i to the leaf labelled "2" in tree T\. The leaf labels are combined using the max function  3.6  54  A n example of a value tree partitioning the state space. States x and y can be treated together, while z must be distinguished  3.7  48  56  A simple action network and the reward and value tree. In this and subsequent figures, variables at time t + 1 (after the action) will be indicated by primed variable names (Z'), unprimed variables are at time t  3.8  57  Decision-theoretic regression of Tree(V) ure 3.7 to produce Tree(Q ): v a  FVTree  3.9  through action a in Fig-  (a) Tree(V); (b) PTree(V,a);  (c)  (V, a); (d) Tree (Q%)  58  Decision-theoretic regression of Tree(V) ure 3.7 to produce Tree(Q^): PTree(V,a);  through action a in Fig-  (a) Tree(V);  (b) Partially completed  (c) Unsimplified version of PTree(V,  (e) FVTree(V,a);  a); (d)  PTree{V,a);  (f) Tree(Q )  59  v a  3.10 The decision-theoretic regression algorithm 3.11 PRegress(Tree(V),a).  The algorithm for producing PTree(V,  xi  60  a).  . .  61  3.12 Applying the successive approximation algorithm to the robot-coffee example with the policy and value function shown in Figure 3.2: (a) The future value trees for each action; (b) the unsimplified merged future value tree; (c) the final future value tree  65  3.13 A series of approximations to the value of the action in Figure 3.7. .  66  3.14 Three action trees used for structured policy improvement, the new value tree formed by merging them, and the corresponding policy.  .  68  3.15 The continuation of the SPI algorithm from Figure 3.14. The value function that results from successive approximation, the Q-trees, and the resulting value function and (optimal) policy are shown 3.16 The structured value iteration algorithm  68 70  3.17 Optimal value trees for the (a) worst-case and (b) best-case problems with three variables  72  3.18 (a) Time and (b) space performance for SPI and M P I on the worstcase series of examples. .  72  3.19 (a) Time and (b) space performance for SPI and M P I on the best-case series of examples  74  3.20 The approximate structured value iteration algorithm. The changes from the SVI algorithm of Figure 3.16 are in boldface  80  3.21 Pruning applied to the value function computed in Figure 3.14. The trees are pruned to 20% and 50% tolerance respectively 3.22 Algorithm for optimal sequence of pruned ranged value trees [11] . .  81 83  3.23 The final pruned value tree, the corresponding greedy policy, and the value of that policy for 20% pruning applied to the robot-coffee problem  86  3.24 The final pruned value tree, the corresponding greedy policy, and the value of that policy for 50% pruning applied to the robot-coffee problem  86  xii  3.25 The optimal value function for the robot-coffee problem  86  3.26 The effect of pruning on running time, average, and maximum error for the 55 thousand state process-planning problem 4.1  A reinforcement learning agent  4.2  Comparison of a learning agent's performance with optimal.  93 100 The  shaded area is the regret of the system  101  4.3  The Q-learning algorithm  103  4.4  The Dyna algorithm  107  4.5  The prioritised sweeping algorithm  109  4.6  The interval estimation algorithm  113  4.7  Meuleau and Bourgine's I E Q L + algorithm  115  4.8  The G-learning algorithm  117  5.1  A simple maze with prior information about the values of states. In (a) the priors are point values, while in (b) they are probability distributions. After a small amount of learning, the priors in (a) have almost disappeared, while those in (b) are still available to guide the agent  125  5.2  The behaviour of a Q-distribution over time  129  5.3  Q-distributions for the three possible actions in some state  130  5.4  The Bayesian Q-learning algorithm. Changes from standard Q-learning are in boldface  5.5  134  Examples of Q-value distributions of two actions for which Q-value sampling has the same exploration policy even though the payoff of exploring action 2 in (b) is higher than in (a)  5.6  The  chain  domain[74].  146  Each arc is labeled with an action and the  corresponding reward. With probability 0.2, the other action is performed (i.e. if a was selected, the arc corresponding to b is followed).  xiii  148  5.7  Actual discounted reward as a function of number of steps. Results for the c h a i n domain  5.8  149  The loop domain[113]. Many algorithms will converge before the lefthand loop is explored  5.9  150  Actual discounted reward as a function of number of steps. Results for the loop domain  152  5.10 Task 3. A navigation problem. S is the start state. The agent receives a reward upon reaching G based on the number of flags F collected.  152  5.11 Actual discounted reward as a function of number of steps. Results for the m a z e domain  154  5.12 The effects of priors on the learning rate. Results are for the loop domain  155  5.13 Outline of the model-based Bayesian reinforcement learning algorithm. 157 5.14 The  global sampling  with repair  algorithm for translating from model  uncertainty to Q-value uncertainty  169  5.15 Mean and variance of the Q-value distribution for a state, plotted as a function of time. Note that the means of each method converge to the true value of the state at the same time that the variances approach zero  170  5.16 Samples, Gaussian approximation, and Kernel estimates of a Q-value distribution after 100, 300, and 700 steps of Naive global sampling on the same run as Figure 5.15  174  5.17 The (a.) "trap" and (b.) larger maze domains  176  5.18 Discounted future reward received for the "trap" domain  176  5.19 Comparison of Q-value estimation techniques on the larger maze domain.  In all cases, kernel estimation was used to smooth the Q-  distributions  177  xiv  5.20 The effects of smoothing techniques on performance in the large maze domain. Naive global sampling was used to produce the samples for all the algorithms  178  6.1  The prioritized sweeping algorithm  185  6.2  Local-Regress(Tree(V),a,(p,Tree(Q ),Tree(R)),  the local decision-theoretic  a  regression algorithm 6.3  Local-PRegress(Tree(V),a,ip,  187 Tree(Q ))• a  The algorithm for producing  PTree{V,a,<p) 6.4  188  Local decision-theoretic regression of a value tree through the action in Figure 6.5 to produce the Q-tree for the action given the value function  6.5  189  A simple action represented using a D B N , and the reward tree for the M D P  190  6.6  Learning the simple action of Figure 6.5  192  6.7  The structured prioritized sweeping algorithm  193  6.8  Structured prioritized sweeping compared with generalized prioritized sweeping on the 256 state linear (best-case) domain  6.9  199  Structured prioritized sweeping compared with generalized prioritized sweeping on the process-planning domain  200  6.10 The structured Bayesian exploration algorithm for reinforcement learning when the structure is known in advance  202  6.11 A structured representation of an action. The boxes represent parameters that must be learned  205  xv  Acknowledgements Firstly, I'd like to thank my supervisor, Craig Boutilier. Discussions with Craig have shaped almost all of the work in this thesis. I didn't fully appreciate just how valuable a resource he was until he left of the University of Toronto. Secondly, I owe a great debt of gratitude to my co-authors on some of the papers that have been incorporated into this thesis. The structured planning work in Sections 3.1 was a collaboration between Craig Boutilier, Moises Goldszmidt, and myself. In addition, the work on Bayesian approaches to reinforcement learning in Chapter 5 was a collaboration between Nir Friedman and myself, with additional contributions by Stuart Russell in Section 5.1.1 and David Andre in Section 5.3. In particular, Nir's influence on this thesis has been considerable—if nothing else, he taught me that I have nothing to fear from statistics! Thanks are also due to the members of my committee, David Poole, David Lowe, Nick Pippenger and Marty Puterman. Having to read and provide comments on a thesis as long as this one qualifies as cruel and unusual punishment. Numerous others have contributed to this thesis in less tangible ways. M y parents and the rest of my family were always encouraging and supportive, no matter how long graduate school dragged on. The computer science department has been a very good environment for research. Particular thanks to Valerie McRea for making things work when I needed them to. Finally, thanks to all my friends both in Vancouver, and elsewhere, for keeping me happy and filling my life with amusing diversions. Some of the people who have made the past six years particularly good ones are Eric, Aleina, Mike, Meighen, Lisa, Vanessa, and Emma.  RICHARD W . DEARDEN  The University  of British  Columbia  October 2000  xvi  Chapter 1  Introduction This thesis is primarily concerned with the problem of how to make decisions in an uncertain world. In the most general terms, this means that there is a choice of possible courses of action, and some information about the effects of each course of action and preferences about how the world should be; and the problem is to decide what to do in order to change the world to reflect these preferences.  In artificial  intelligence (Al), this decision-making problem is frequently referred to as planning. In discussing problems of this type, the idea of an agent is frequently helpful. The agent is the entity that does the decision making. It is situated in some environment or world, and has a variety of actions that it can carry out. Some of these actions change the state of the world, for example, by moving the agent or picking up an object, while others may be purely informational in that they don't change the world, but give the agent more information about it. A n agent also has preferences about the way the world should be. These preferences allow the agent to choose between different actions it could carry out. Actions that change the world in ways that result in greater agreement with the agent's preferences are "better" than actions which lead to non-preferred states of the world. Where these preferences come from will not concern us here; we can assume that the agent is acting for someone, perhaps bringing them coffee, or running a factory for them,  1  Figure 1.1: The coffee-delivery robot problem. The robot prefers the world to be such that the person doesn't want coffee (State 5). The rectangles correspond to the states of the system, and the arrows to actions (in fact, all actions are possible in any state, only the "best" action in each state is shown). When the person wants coffee, the robot must move to the shop, get coffee, move to the office, and deliver the coffee. and the agent's preferences come from this person's preferences. As an example of what we mean by decision-making in an uncertain world, consider a robotic agent that delivers coffee to a person as shown in Figure 1.1. The world it inhabits consists of a number of states (five are shown in the example) which can be described by a set of features or variables such as whether or not the robot has coffee. The agent also has a number of actions it can perform, for example MOVE  and  DELIVBR-COFFEE.  These actions may be stochastic—they have several  possible outcomes and each outcome has a probability of occurring. In our example, the  DELIVER-COFFEE  action performed in state 4 can leave the system in state 5  (the person has coffee), or in state 1 (neither the person or the robot has coffee because it has been dropped). Finally, the agent has some set of preferences about  2  how the world should be, for example our coffee robot would prefer it to be the case that whenever the person wants coffee, they have coffee (state 5 in the figure is preferred). In order to satisfy these preference, the agent may have to perform a sequence of actions, perhaps anticipating when coffee will be wanted, moving to the coffee shop, buying a cup of coffee, moving to the office, and delivering the coffee. Given a  model  of the world the agent is in, a description of the effects of the agent's  actions, and a description of the agent's preferences, the planning problem consists of finding actions to perform in some subset of the states so that by performing those actions, the agent will satisfy some or all of its preferences. We call these actions a plan or policy. One such policy is shown by the sequence of actions in Figure 1.1. Although the other actions are not shown, they are available to the robot in every state (for example, a MOVE action can be performed at any time). The set of actions shown comprises one possible policy for satisfying the agent's preferences. In describing our coffee robot problem, we have made the assumption that the agent has an accurate description of the effects of its actions and of its preferences about the world. Sometimes however, this is not the case, and in order to make a good plan the agent must first  learn  this information. Learning is the second area  discussed in this thesis, but rather than looking at the entire field of learning in A l , which encompasses the general task of building a model of a collection of data, we will restrict ourselves to the problem of learning how to act, or learning for the purposes of planning. The problem for a learning agent is to perform actions, and by observing the effects of the actions, eventually to be able to act in a way that satisfies its preferences. We will return to this learning problem in Section 1.2. We said in the opening paragraph that we were interested in decision making in uncertain worlds. We have explained what we mean by decision making, but we have only touched on the subject of uncertainty. From an agent's point of view, there are a number of sources of uncertainty in the world. These include: • The agent may not have complete information about the world.  3  In other  words, while the agent may be able to observe some aspects of the state of the world, other parts of the state may be unobserved or unobservable. We call this partial observability, and the alternative, where the agent always knows the current state in its entirety, we refer to as complete observability. • The agent's actions may be uncertain in the sense that the agent does not know ahead of time exactly what the effects of its actions will be. The agent may know what outcomes are possible when it performs an action, and may also know what the chances are of each possible outcome actually occurring when it performs the action, but it may not know which outcome actually does occur until after it has performed the action. We call these actions stochastic since each outcome has a fixed probability of occurring if the action is performed in a particular state, and contrast them with deterministic actions in which a single outcome occurs whenever the action is performed in a particular state. • The agent may be uncertain about the possible outcomes of its actions and about its preferences. This is the type of uncertainty we referred to in our discussion of learning. In this case the uncertainty can eventually disappear as the agent builds a better model of the world and how its actions change the world. Obviously this problem is of great interest. Making decisions is one of the basic problems that any intelligent agent must tackle, and it would be a rare agent indeed that didn't have to deal with some form of uncertainty. In its most general form, it is also a very difficult problem. To make it more tractable, we will be making some assumptions about the types of problems we will examine in this thesis. In particular, we are interested in problems that display structure. B y this we mean that rather than looking at arbitrary decision-making problems, we will restrict our attention to problems that can be represented and reasoned about in a compact manner. In our example we might reason about the problem using the  4  representation in terms of the variables (the robot's location, whether the person has coffee, whether the person wants coffee, and whether the robot has coffee), rather than directly using the states. This allows us to represent actions in terms of the variables they affect—for example  MOVE  affects the location variable, but has no  effect on the others. Is this restriction to structured worlds a reasonable one? Does this restricted class of structured worlds include interesting problems? We believe that the answer to both these questions is "yes." In fact we would go so far as to say that most interesting problems display structure of this kind. When people describe problems or worlds, they do not describe them with statements such as "when you pull this lever in state number 102, the system makes a transition to state 76." Instead, they say things like "pulling this lever turns on the car's wind-screen wipers." Implied in this description is the fact that the lever has this effect regardless of what the driving conditions are like, or what gear you are in, so the description of the action is generalised over a wide variety of different states. This is exactly the sort of structure that we hope to take advantage of—descriptions of actions in general terms that hold over classes of similar states. We take advantage of the structure by reasoning about these classes of states together. The idea of describing the world in terms of features such as the state of the wind-screen wipers and whether it is raining or not is of course a standard one in A l planning systems. The difference here is that we are applying it to a much more general class of problems than classical A l planning, namely ones that include the kinds of uncertainty we have discussed above. One way to think about this work— especially the planning parts of it—is as taking knowledge representation techniques from classical A l and applying them to problems from operations research, where planning under uncertainty has been of interest for much longer. There is also plenty of practical evidence for importance of structured problems. We have already mentioned classical planning, which has considerable suc-  5  cess in tackling real-world problems such as spacecraft mission planning [61, 39] and scientific experiment design [99]. Similarly, structured representations based on Bayesian networks have been used extensively in medical diagnosis applications [52, 2]. A n example that is closer to the stochastic systems we describe in this work is Livingston [115], a model-based reactive reasoning system for diagnosing faults in spacecraft and other complex systems. Livingston is a hybrid between a probabilistic system that handles uncertainty and stochastic actions, and a higher level logical reasoning system. There are other types of structure apart from features that can also be exploited in planning and learning. For example, a robot moving around a building can reason in terms of rooms and hallways, rather than its absolute position. The algorithms we describe may not be well suited for exploiting these other forms of structure, but they are often complementary to other algorithms that are, so that a combination of two different structured approaches can lead to better performance than either approach alone. For a discussion of other algorithms that exploit problem structure of various kinds, see Section 2.6. Using structure to simplify problems can be thought of as exploiting a type of prior knowledge—namely  prior knowledge about the regularity of the world. We are  also interested in exploiting other types of prior knowledge, particularly in learning. For example, imagine an agent that does not know the effects of its actions and is attempting to learn them in order to decide how to act.  In many cases it is  reasonable to assume that some information is available about the agent's actions, but that the information is uncertain. For example, when we build a robot agent, we can be fairly certain that operating the motor that powers the wheels will cause the robot to move, but we may not know with what probability the wheels will slip on different surfaces, or what chance there is that the motor will fail. We may be able to provide estimates of these probabilities to the agent, but they will be uncertain. This is another kind of prior information that we would like to exploit. By specifying prior  6  information to the agent, but including a measure of how confident we are about that information, the agent can decide when to trust the information and when it may be unreliable. When the agent is relatively confident about the accuracy of its prior information, it does not need to spend time learning about those values, and can instead concentrate its learning time on states about which it is less confident. Obviously this can have a great impact on the effectiveness of a learning agent. In summary, this thesis is concerned with different forms of prior information and how they can be used to gain leverage that will allow us to solve some very difficult problems in planning and learning. We will now describe in a little more detail how we can take advantage of prior information.  1.1  Decision Theory and Planning  Classical A l planning generally involves a static world in which nothing changes except as a result of the agent's actions, these actions are deterministic, and there is a single fixed goal. The agent's task is to find a sequence of actions that will achieve the goal from the current state. More recently, the field of decision-theoretic planning has investigated problems in which a number of these assumptions are relaxed. Decision-theoretic planning typically looks at problems where the agent's actions may have probabilistic effects (each outcome of an action in Figure 1.1 has an associated probability so DELIVER-COFFEE in state 4 results in state 5 with probability 0.9 and state 1 with probability 0.1), there may be external actions or events occurring in the world (for example, the person may decide they don't want coffee after all), and preferences about the state of the world may be described much more generally, for example, using a function that gives the agent a numeric reward or reinforcement that depends on the state of the world. A numeric reward allows us to represent more complex kinds of preferences such as conflicting goals, as well as giving the agent the power to trade off different goals and determine which preferences are more important and so should be concentrated on. In these  7  Table 1.1: One possible structured representation of the 1.1. Precondition  action from Figure  Action  Effects  MOVE  location = office (p = 0.2)  location = office location  MOVE  location = shop (p = 0.8)  = shop  location = office (p = 0.8) location = shop (p = 0.2)  problems, rather than to accumulate an "acceptable" amount of reward, the agent's task often becomes to maximise the total reward it receives over time; in other words, to find an  optimal  plan. Decision-theoretic planning has provided the tools  to tackle a great variety of interesting new problems that have not been studied in classical A l , and will be the focus of our research in planning. The greater generality and flexibility of decision theory comes at a cost, however. In practice, the computational demands of decision-theoretic approaches tend to be greater than for classical A l techniques in areas such as planning. Even problems with relatively small numbers of states and actions can be quite computationally demanding when represented in a decision-theoretic way, while similar sized problems are trivial for classical planners.  Part of the reason for this  is one of the great strengths of classical planning; its structured representation of problems in terms of features. This representation allows irrelevant features of the world to be ignored, and actions to be described both compactly in terms of their preconditions—the values that certain features must have for the action to have a particular result—and their effects—the features that are changed by performing the action, features that the action does not affect are left out of the action description. For example, the  MOVE  action in Figure 1.1 can be represented as shown in  Table 1.1, in which it is assumed that the values of all variables that do not appear explicitly are left unchanged by the action. This thesis is largely motivated by the desire to use some of these strengths of  8  classical planning in a decision-theoretic framework. We will not examine another type of structure commonly used in classical planning, namely the use of schemas  action  where a single description, for example "goto(X)", can describe a whole  class of real actions such as "go to the office" and "go to the coffee shop." While this is another form of structure that can be exploited, it is outside the scope of this work. One traditional approach to problem solving in A l (and in fact in many other fields) is  abstraction.  Abstraction can be thought of as making a difficult problem  easier by ignoring some of the details. We will use abstraction extensively in this thesis, primarily in the form of state  space abstraction,  where the details we ignore  are distinctions between states (the action schemas we mention above are another example of abstraction, in this case action abstraction). There are many ways to perform abstraction. For our purposes, we will be abstracting by treating sets of states of the world as if they were a single state. We would like such an abstraction to be non-uniform,  by which we mean that the level of abstraction varies in different  parts of the problem—in places where we need a lot of detail to decide what action to perform, we will do very little abstraction, while in other places we will be able to group many more states together. We would also like our abstraction scheme to be  adaptive  rather rather than fixed—if we choose a particular level of abstraction  in a certain part of the state space, and in the course of planning discover that we need more or less detail, we can change the abstraction we use to suit our changing needs. The  structured  policy  iteration  (SPI) algorithm we present in Chapter 3 is  an attempt to transfer two of the strengths of classical planning that we discussed above to decision-theoretic planning. These are the representation of the world using features, and the use of abstraction to ignore irrelevant features and group sets of world states together and treat them as a single state if they differ only in these irrelevant details. In our example, the agent might determine that all the states (not  9  shown in the figure) in which the person doesn't want coffee can be grouped together because the robot acts the same (in fact, does nothing) in all of them. This use of abstraction allows SPI to solve much larger problems than other decision-theoretic planning algorithms, provided that the problems are fairly structured. SPI produces as output a structured representation of an optimal policy or plan. As we said above, decision-theoretic planners frequently produce optimal plans, while classical A l planning is generally more concerned with finding any reasonable plan for achieving the goals. W i t h this in mind we also present an approximate version of SPI that sacrifices optimality for even more computational gain. We anticipate that this algorithm will actually be far more useful in the long term than the SPI algorithm. The key contribution of our work in decision-theoretic planning is the structured representation of problems and their solutions, and the powerful abstractions this allows us to create, and in particular, the  decision-theoretic  regression  opera-  tor (described in Section 3.1.1) which is the basis for these algorithms. Decisiontheoretic regression allows us to maintain the structured representation as we incrementally construct a plan.  1.2  Reinforcement Learning  Learning is also possible in a decision-theoretic framework.  Reinforcement  Learning  (RL), in which the agent receives a reward or reinforcement based on the current state and the action it chooses to perform at each time step, is a common approach to learning in stochastic environments. R L agents typically operate by trying out various actions in different states of the world, observing the reinforcements that they accumulate over time, and using this information to determine which action is best to perform in each state. Going back to the problem in Figure 1.1, the task for a reinforcement learning agent is the same as for our planning agent, to find a plan that gives the person coffee whenever they want coffee, but this time, the agent 10  SHOP  Figure 1.2: A path-planning task. The agent has already discovered the path shown with a solid line, but needs to explore to find the shorter path shown with a dashed line. doesn't have a complete model of the problem (the effects of the actions, and its preferences) available in advance. It must learn the information it needs to find a good plan. One important issue for a learning agent is how to  explore  effectively. When  we evaluate the performance of a R L agent, we typically do not allow it a "free" learning phase where it can try out actions without incurring any cost. Instead, we expect a R L agent to start performing well as soon as possible. This produces a dilemma for the agent—should it perform the action that seems to be the best, or instead choose an action that doesn't look as good at present, but will give the agent more information about the world? In general terms, the answer to this question is to explore a lot at the beginning of learning, but as time goes on, the amount of exploration is steadily reduced so that, more and more frequently, the estimated best action is chosen. Consider a lower-level version of the coffee robot problem in which the robot must plan a route from the office to the shop and back as shown in Figure 1.2, in which the agent already has a plan available to it, namely the solid line in the figure. Exploration is needed if the agent is to discover other potentially superior  11  plans such as the route shown with a dashed line, or the short-cut via the meeting room. Another important issue for R L agents is whether to learn a model of the world or not. Both  model-free  learning algorithms, where the agent attempts to learn  how to act without building a model of its environment, and  model-based  learning  algorithms, in which the agent builds a model and uses that to determine how to act, are possible, and there has been much debate over the advantages and disadvantages of each approach. Model-based systems tend to learn faster because they can do additional reasoning but model-free approaches are more generally applicable, and can more effectively make use of strategies like function approximation to reduce the complexity of a learning problem. As part of this thesis, we will present three reinforcement learning algorithms. The first two,  Bayesian  Q-learning  and  model-based Bayesian  exploration,  are re-  spectively, model-free and model-based approaches to the exploration problem. The common motivation for these approaches is that an agent should decide when to explore and when to how  confident  exploit  the knowledge it already has by taking into account  it is about that knowledge. When the agent is relatively confident  about the values of different actions in a particular state, it probably will learn little through exploration, so it can go ahead and perform the best action. On the other hand, when the agent is more uncertain about the values of the actions, it is worth more to explore since there is greater likelihood that an action that it currently thinks is sub-optimal will turn out to be the best after all. This confidence-based approach allows prior information to be easily incorporated into a learning problem since we can specify both the information itself, and also how much trust the agent should have in the accuracy of the information. The key contribution we introduce in our work on exploration is the idea of selecting an action to perform based on the value of the information we might potentially get by performing it. Exploration can be thought of as a tradeoff be-  12  tween performance now and performance in the future, and the information  myopic  value of  measure we introduce attempts to quantify the future performance so  that the tradeoff can be made in an informed manner.  1.3  Putting it All Together  Given the close relationship between decision-theoretic planning and reinforcement learning, we would like to apply our ideas about structure in learning problems as well as in planning. Our aim is to produce a model-based reinforcement learning algorithm that learns a structured model, and uses the learned structure in the same way that SPI does to produce abstract policies that ignore irrelevant details. The third R L algorithm we will present,  structured prioritized  sweeping,  does  just this, learning more quickly by exploiting problem structure to do more with the data it observes. It can be combined with model-based Bayesian exploration to produce an algorithm that takes advantage of both abstraction and good exploration strategies to substantially increase the effectiveness of learning.  1.4  Organisation of this Thesis  In the next chapter, we will introduce the basics of Markov Decision Processes (MDPs), a model used extensively both in decision-theoretic planning and in reinforcement learning. We will introduce the mathematical background of M D P s , describe the idea of optimality in this context, and present three commonly used algorithms for finding optimal policies.  We will also provide an introduction to  Bayesian Networks and influence diagrams. These will be our means of representing structure in problems, both for planning and learning. Also in Chapter 2, we will examine some of the previous work using M D P s that has been done in the A l community. Chapter 3 contains descriptions of two planning algorithms that we have de-  13  veloped to take advantage of structure in problems, structured policy iteration, and approximate  structured  value iteration  (ASVI). The SPI algorithm takes advantage  of structure in problems to find optimal policies more efficiently, while A S V I uses the structure to greatly reduce the complexity of the problem being solved, but sacrifices optimality in the process. In Chapter 4 we turn our attention to learning, and in particular to reinforcement learning. This chapter is a summary of the standard techniques in reinforcement learning, as well as work that is related to the approaches we will be describing in Chapters 5 and 6. We particularly concentrate on the issue of exploration in reinforcement learning. Chapter 5 presents the two learning algorithms we have developed that mostly tackle the issue of how to explore effectively, Bayesian Q-learning, which is a model-free learning algorithm that allows us to more effectively use prior information that may be available to the learning agent, and model-based Bayesian exploration, the model-based analogue of Bayesian Q-learning, which also uses V P I to select actions to perform, but bases its decision on the model of the system that it creates. We compare both these algorithms with many of the standard algorithms we present in Chapter 4. In Chapter 6, we describe preliminary research designed to integrate the learning work from Chapter 5 with the planning algorithms in Chapter 3. We describe the structured prioritised sweeping algorithm, a model-based reinforcement learning algorithm that uses ideas from SPI to learn more efficiently than standard techniques.  We also describe how this algorithm fits together with model-based  Bayesian exploration to produce an algorithm that explores very effectively to gain useful data for learning and then uses this data very efficiently to speed learning. Finally, in Chapter 7, we summarise the main points of this thesis, and discuss future directions that could be taken with this work.  14  Chapter 2  M a r k o v Decision Processes and S t r u c t u r e d Representations In this chapter we introduce  Markov  decision problems  (MDPs), the mathematical  foundation that underlies our work in both planning and learning. As we said in Chapter 1, we are interested in exploiting structure in problems for computational gain. Section 2.5 will describe a way of representing M D P s using Bayesian networks that explicitly models the kinds of structure that we want to exploit. In Chapters 3 and 6 we describe algorithms that use this representation to increase efficiency in planning and learning respectively. The types of problems we are interested in for both planning and learning have the following characteristics: • There are finitely many states and actions. Although domains with infinitely many states are also interesting, and many of our techniques can potentially be applied in these domains, we will restrict ourselves here to finite state and action spaces for computational and expositional reasons. • Time passes in discrete steps. The agent makes one decision (chooses one action to perform) at every time step. Again, more complex models of time exist, but we will not consider them here.  15  • Actions have stochastic effects. When an action is performed in a particular state, there are a number of possible states that could result, and each of these outcomes has a probability associated with it. In other words, actions can be viewed as mappings from states to probability distributions over states. The set of deterministic actions is a subset of the set of stochastic ones, so the ideas we discuss are also applicable in the deterministic setting, but we will not make such a restrictive assumption about the world here. • The agent receives a numerical reward or reinforcement at each time step which depends on the state the system is in, and may also depend on the action selected by the agent. These rewards can be thought of as a way of representing immediate preferences over states. States that are preferred are given large rewards, while states that should be avoided are made lees rewarding. Rewards can also be thought of as representing the cost of performing a particular action in a state. • The system runs indefinitely, or for a sufficiently long time that it can be approximated by a model that runs indefinitely. Many of our algorithms can actually be applied to problems which halt after a known finite number of steps with only small modifications, but for ease of presentation, we will concentrate on the infinite-horizon problem here. • The world is completely observable by the agent. A t any time step, the agent knows with certainty the true state of the world, and this knowledge is available at no cost to the agent. Again we can imagine relaxing this assumption, and many of the ideas in this thesis can potentially be applied if we do, but there are a number of additional sources of complexity in these models, so we will restrict ourselves to the completely observable case. One common approach to modeling domains with these characteristics is to use a Markov decision process [7, 57, 58]. Markov decision processes are a very gen-  16  eral formulation of decision making problems involving various types of uncertainty and have all the desirable properties we have mentioned above. They have been extensively studied in the field of Operations Research and there is an extensive literature, both practical and theoretical, upon which to draw. Markov decision processes have been applied to problems in both decision-theoretic planning and reinforcement learning, and so provide a simple framework in which to combine the two. It therefore seems natural to use Markov decision processes as the underlying model for all the problems that will be described in this thesis.  2.1  Markov Decision Processes  Following Puterman [84], we describe a Markov decision process model in terms of five elements: decision epochs, actions, states, transition probabilities, and rewards. Partially-observable Markov decision processes typically also deal with observations, but since we are restricting ourselves to completely observable problems, we will not discuss observations here. In a Markov decision process, an agent makes decisions at points in time which we refer to as decision epochs. Although the set T of decision epochs can be either finite or infinite, and either discrete or continuous, we will restrict ourselves to considering discrete infinite sets. In practice this means that the agent will make decisions at a series of fixed times (rather than being able to make decisions at any time), and will continue to make decisions indefinitely. While the agent may not, in fact, continue acting forever, in situations where the agent makes a very large number of decisions, or does not know in advance when it will stop acting, this is a reasonable approximation. We refer to this as an infinite-horizon  model since the  agent must consider the effects of its actions over infinitely many future decision epochs. A t each decision epoch, the system is in some state s £ S, where S is the set of all possible system states. A t this point, the agent chooses some action a to  17  perform, where a is a member of the set of allowable actions in state s, A . Let s  A — Uses  ^e  t  n  e  s e t  °^ ^ actions. We will assume that all actions are possible a  in every state, in which case A = A for all s. s  The transition function, Pr(s,a,s')  represents the probability that when ac-  tion a is performed with the system in state's, the system moves to state s'. For all states s and actions a we have that X^'gs •  ^M > > s') = 1. s  a  As a result of performing action a in state s, the agent receives some finite,  real-valued reward R(s, a), and the system moves to a new state s' according to the transition function Pr(s, a, s'). The reward function R is a mapping from states and actions to real numbers. In the learning community, rewards are frequently referred to as reinforcements  and we will use the terms interchangeably here.  As well as depending on the state and action, rewards can also depend on the state that results from performing the action, in which case we write the reward function as R(s, a, s'). If we let R(s, a) — Y^s'eS -^( > » ')P ( , s  value of R(s,a,s'),  a  s  r  s  a  i ') s  D e  the expected  there is no difference under most notions of optimality between  planning using R(s,a,s')  and R(s,a).  This does not hold, however, for learning.  Most reinforcement learning algorithms assume that the reinforcement received is only a function of s and a, and will not necessarily learn good policies if this is not the case. We will assume that R is a function only of s and a. We call the tuple <T,S,A,Pr,R>  a Markov  decision process.  It is Markovian because the transition function and  reward depend only on the current state and action, not on any previous states the system may have inhabited. Although studies have been made of history-dependent decision processes, they are beyond the scope of this research, and can often be translated into Markovian problems by including the necessary history explicitly in the state and action models [71]. We note that for large problems, both the reward and transition functions will  18  Figure 2.1: A n example Markov decision process with six states and two actions. require substantial amounts of memory to store in explicit functional form. For an n-state, m-action Markov decision process, the reward function may be represented as an m X n matrix, while the transition function is stored as a set of m n X n matrices, one for each action. In this representation, element  of the matrix for action a  contains Pr(s,-, a, Sj). In practice we can expect that most of the matrix entries will be zero since there is zero probability of a transition between the two states. For this reason, sparse matrices are generally used to represent the transition function, in which case each matrix is roughly of size nb where b is the average branching factor—the number of possible outcomes of the action. Figure 2.1 shows an example Markov decision process. The state space S consists of six states {si, S2, S3, S4, S5, SQ}, A contains two actions a, and b, and the transition function Pr is shown by the probabilities associated with the arcs. For example, Pr(si,b,  S5) = 0.7, while Pr(si,a,s ) 5  = 0.0 since there is no arc between  those states labelled with a, which we interpret as meaning that the probability is zero. The full transition matrix for action a is shown in Table 2.1. The reward  19  Table 2.1: The transition matrix for action a in Figure 2.1 To From Sl S2 S3  s  4  $6  0.4 0 0.8 0 0 0  S2  S3  s  ss  s&  0.6 1.0 0 0.3 0.5 0  0 0 0 0 0.5 0  0 0 0.2 0 0 0  0 0 0 0 0 0  0 0 0 0.7 0 1.0  4  function for the Markov decision process is not given in the figure. One possible reward function would be to give a reward of —1 for any action in any state other than se, and a reward of 0 for any action in state SQ. This reward function suggests that we would like the world to be in state se, and would like to keep it there as much as possible, but that we don't care which actions are used to achieve this.  2.2  Policies and Optimality  For our purposes, a policy can be considered as a mapping IT : S —l A from states to actions which tells an agent which action to perform in every state. Because of the Markov property, policies need not depend on the history of states visited and actions performed (see [84] for a proof of this). We can think of a policy as a universal plan [92]. Since we are restricting our interest primarily to infinite-horizon Markovian processes, we will restrict our attention to policies are stationary—they do not depend on the decision epoch. Furthermore, we examine only policies that are deterministic—for each state the policy consists of a single action rather than a probability distribution over actions. For completely-observable, infinite-horizon Markov Decision Processes (we define what we mean by this below), there is always an optimal policy that is stationary and deterministic (again see [84] for a proof), so we will only be interested in policies with these properties when planning. However,  20  in Chapters 4 and 5, we will discuss stochastic policies in the context of learning.  2.2.1  The value of a policy  Given a particular starting state, execution of a policy TT will cause the system to pass through an infinite succession of states, and collect a series of rewards. We use this series of rewards to compute the value of TT for every state. There are a number of possible criteria for assigning a value for a policy. These include the expected total reward, and the expected average reward per decision epoch. The criterion we  will concentrate on here is the expected total discounted  reward, in which rewards  received t steps in the future are discounted by 7 * , where 0 < 7 < 1 is the factor,  discount  usually close to one. We use the discount factor to ensure that rewards  received in the future are worth less than rewards received now. While this may not always seem like an appropriate measure of the value of a policy, it has the computational advantage that the value of all policies is finite. Expected discounted reward over an infinite horizon can also be justified as modeling a world in which the agent has a probability of 1 — 7 of "dying" at any decision epoch, or where the agent knows that it will halt at some decision epoch, but does not know in advance which one. Alternatively, it can be.justified as an economic model in which a dollar earned today is worth more than a dollar earned tomorrow. Let  be the value of policy TT with discount factor 7 . If X is a random t  variable denoting the state of the system at time t, then the value of following policy TT with the world initially in state s can be written as: N  v;(s)=\\m  R(X ,TT(X ))} T  t  t=l  where the expectation is over states visited while following policy TT starting at state s.  Howard [57] has derived a more useful expression, at least from the point of view of computation, for the value of policy TT. Howard's formula, which is the basis  21  Table 2.2: Two policies for the example M D P of Figure 2.1. The second is an optimal policy for this M D P .  State Sl ,  2  S  S3 S  4  S5 S6  Optimal Policy Optimal Action Value 6 -2.47 6 -3.28 a -3.12 a -1.89 b -1 a 0  Policy 1 Action Value a -9.86 b -9.83 a -9.40 b -7.23 b -6.80 b -6.44  for the policy iteration algorithm described in Section 2.3.1, is: ;(s)  V  = R(s,  J2 P<s,  TT(S)) + 7  We call v*(s) for all s 6 S the value function  *(s),t) ;(t) V  (2.1)  for policy TT with discount factor 7 .  One possible policy for the M D P in Figure 2.1 would be to do a if the current state is s\ or s , and 6 otherwise. With a discount factor of 0.9, the value of this 3  policy is given in the third column of Table 2.2.  2.2.2  O p t i m a l policies  Now that we have defined the value of a policy, the natural next step is to define an optimal policy, which we do in the obvious way. We say that n* is an optimal policy under expected total discounted reward, if for all s £ S and all IT: < (  S  We call vHf* the optimal value function  ) > < ( 5 )  for this M D P . There may be many optimal  policies for a given M D P , but there is exactly one optimal value function [58, 84]. We refer to a Markov decision process together with an optimality criterion as a Markov decision problem ( M D P ) . The optimal policy and its value for the example M D P of Figure 2.1 is shown in Table 2.2. 22  Input: A Markov decision process < T, S, A, Pr, R >, a discount factor 7 . Output: A n optimal policy IT*. 1. Let n = 0, and TT be an arbitrary initial policy. 0  2. Compute v , the value of 7 r „ , by solving the set of \S\ equations in \S\ unknowns given by Equation 2.1. n  3. For each s 6 5, let 7 r i ( s ) £ A where A is the set of actions a that maximise R{s,a) + 7 ^ P r ( s , a , £ ) u ( £ ) +  +  n +  r a  tes If 7r„(s) £ A , then let 7 r i ( s ) = 7r„(s). +  n +  4. If 7 r i ( s ) = 7r (s) for all s G 5, then stop, and set 7r* = n ; otherwise increment n and go to step 2. n +  n  n  Figure 2.2: The policy iteration algorithm.  2.3  Algorithms  We now present three algorithms for finding optimal policies for M D P s . The first, policy iteration [57] is based on the optimality equation given earlier, and is conceptually the easiest algorithm to describe, but will not be used further in this work. The second algorithm, value iteration [7] is the basis for the structured approximate value iteration algorithm described in Section 3.2, and is closely related to the way reinforcement learning operates; thus it plays an important role in Chapters 4 to 6. The final algorithm, modified policy iteration [85] is a generalisation of both policy and value iteration that usually performs better than either of them [84]. It is the basis for the structured policy iteration algorithm described in Section 3.1.  2.3.1  Policy iteration  The policy iteration algorithm [57] iteratively explores the space of possible policies. The idea is to begin with some policy, find a better one, and continue until no better policy can be found. When this occurs, the final policy is guaranteed to be optimal (see [84] for proofs). The algorithm is given in Figure 2.2.  23  Step 2 of the policy iteration algorithm is generally called policy evaluation. It is usually accomplished using Gaussian elimination, and since the matrices are usually sparse, can be solved in 0 ( | 5 | ) time [91] ( 0 ( | 5 | ) if the matrix is dense). 2  3  Step 3, policy improvement, is 0 ( | S | | A | ) (or 0 ( | S | | A | ) for dense matrices). 2  The  number of iterations of policy iteration is polynomial in \S\ (see [27] for details), and in practice the number of iterations is usually small. On the example M D P of Figure 2.1, policy iteration operates by beginning with some arbitrary policy, for example "do action a in every state". Computing the value of this policy we get that v(si) = v(s2) = —10, v(s3) = —8.87,u(s4) = — 3.7,u(s5) = —9.49,u(s6) = 0. We then search for an improving policy by checking every action in each state to find an action better than the one in the policy, and get n(si) =  TT(S2)  = 7r(s ) = 6,7r(s3) = 7r(s ) = 5  4  7V(SQ)  = a. The value of this policy is  v(si) = -2.47, v(s ) = -3.28, u(s ) = -3.12, v(s ) = -1.89, v{s ) = - l , t ; ( s ) = 0. 2  3  4  5  6  At this point we can find no improving action for any state, so we have an optimal policy.  2.3.2  Value iteration  The value iteration algorithm [7] finds an optimal policy for an M D P by first computing the optimal value function, and then finding a policy that has that value. Value iteration is an approximation algorithm. As the iteration i increases,  v (s) l  approaches the optimal value function for state s. When one of the termination conditions given below holds, the policy found is (.-optimal, where 7r* is e-optimal if, for all s: v?{s)  >  v;(s)-e  The value iteration algorithm is given in Figure 2.3. The algorithm works by making a series of better and better approximations to the true value of the optimal  24  Input: A Markov decision process < T, S, A , Pr, R >, a discount factor 7 , an optimality constant e. Output: A n epsilon-optimal policy IT*. 1. Let v° be an arbitrary value function, let e > 0, set n = 0. 2. For each s G S, com pute v {s)  using Equation 2.2.  n+1  3. If the termination condition (Equation 2.3 or 2.4) holds then go to step 4; otherwise increment n and go to step 2. 4. For each s G 5', choose 7T*(.s) G arg max{i?(s, a) + 7  °  ^ Pr(s, a,  t)v (t)} n+1  fts  aeA  Figure 2.3: The value iteration algorithm, policy by repeated application of the following equation: v {s)  = m&x{R(s,a)  n+1  + 7V  Pr(s, a, £)«"(*)}  (2.2)  tes We will often refer to the use of this equation as performing a Bellman There are a number of possible termination conditions [84]. The  backup. supremum  norm criterion: max{| "+ ( ) -  <  1  U  S  (2.3)  is commonly used when a close-to-optimal value function is needed. It produces a value function that is within | of optimal (and an e-optimal policy). In situations where finding a good policy is more important than obtaining a good estimate of v*, faster convergence can be achieved using the span-seminorm maxil^s) s£S  - mm{\v (s) n+l  s£S  - v (s)\} n  termination criterion: <  ^  (2.4)  7  When we apply the value iteration algorithm to the example M D P of Figure 2.1, we start with the value v° equal to the reward function R, then repeatedly apply Equation 2.2 for each state to get a sequence of new value functions as shown in Table 2.3.  e-optimality with e = 0.1 is reached after five iterations. The corresponding  policy is given in the final column of the table. 25  Table 2.3: Successive approximations to the optimal value function produced by value iteration. State  -1 -1 -1 -1 -1 0  Sl S2 S3  s  4  S6  2.3.3  v°  v  v  -1.9 -1.9 -1.9 -1.27 -1 0  -2.14 -2.71 -2.6 -1.51 -1 0  1  v  v  2  V  5  4  3  -2.33 -2.97 -2.82 -1.73 -1 0  -2.44 -3.21 -3.05 -1.85 -1 0  -2.39 -3.14 -2.99 -1.80 -1 0  7T  b b a a b a  M o d i f i e d p o l i c y iteration  Modified policy iteration (MPI) [85] can be thought of as a hybrid of the policy iteration and value iteration algorithms, or as a generalisation of both. The key idea is that Step 2 of the policy iteration algorithm, solving the system of linear equations, is very expensive and an exact solution is generally unnecessary for finding an improving policy. In M P I , rather than solve the equations exactly, we approximate the solution by performing a few iterations of successive  <  +1  (s) = R(s, K(S)) + T E  Pr  (> s  approximation:  '  (-) 2  5  tes where TT is the policy found in the previous iteration, and u£(s) is the previous estimate of the value of state s under TT. A step of successive approximation is equivalent to performing a Bellman backup (Equation 2.2) where the policy is fixed. Figure 2.4 shows the M P I algorithm in detail. A t each iteration we find an improving policy and compute its value by performing successive approximation for m steps.  If the termination criterion (typically either supremum norm  or span-seminorm) is met, we halt. M P I generally converges faster than either of the preceding two algorithms [84]. As with value iteration, it is an approximation algorithm, and produces an e-optimal policy.  26  Input: A n M D P < T, 5, A , Pr, R >, a discount factor 7 , an optimality constant e, and an integer m that is the number of steps of policy evaluation to perform per iteration. Output: A n epsilon-optimal policy TT*. 1. Let n = 0, select an arbitrary initial value function v ° . 2. (Policy improvement) Let 7 T „  +1  be any policy for which  R{s,TT {s))+ ^Pr( n+1  S,TT (s),t)v (t)>V ( ) n  7  n+l  n  S  for all s £ S. 3. (Policy evaluation) Let k — 0, define uo(s) by  4. If the termination condition (Equations 2.3 or 2.4) applied to UQ and v holds then go to step 8. 5. If k = m then go to step 7. Otherwise compute tik+x by V-k+l{s) =  R{s,TT i n+  (S)) +  7  X > K S,  TT (s),t)u (t) n+1  k  6. Increment k and go to step 4. 7. Let v  n+1  = u  m  8. Let TT* = 7 r  n +  and go to step 2.  i and stop.  Figure 2.4: The modified policy iteration algorithm.  27  2.4  Bayesian Networks  A Bayesian  belief network [82] (usually shortened to Bayesian  network) is a graphical  representation of a probability distribution. Given a domain described in terms of a set of variables, a Bayesian network represents a, joint probability distribution  over the  set. This allows us to explicitly depict which variables are dependent on which other variables, and which pairs of variables are (conditionally) independent. Our interest in Bayesian networks is simply as a representational tool. As we shall see in Section 2.5.1, we can represent an action in an M D P using a Bayesian network, capturing 1  structure that would be obscured by standard matrix-based representations. For this reason we do not present any algorithms for computing probabilities using Bayesian networks, we simply describe the representation itself. A Bayesian network is a directed acyclic graph (DAG) where nodes are labelled with the variables they represent, and edges represent direct dependence between variables. Each variable has an associated conditional probability table (CPT) which shows the probability of it having each of its possible values, given the values of its parents. We write a Bayesian network as a tuple (A , E, P) where: 7  • N is a set of nodes such that each node n £ N is labelled with a random variable and has associated with it a set of possible values Q . n  • E is a set of edges (x,y) such that x,y £ N, and the'graph < N, E > is directed and acyclic. • P is a set of conditional probabilities for each n £ C given its parents (those nodes x such that (x,n) £ E).  If n has no parents, then P  n  is the prior  probability of n. The independence assumption embedded in a network states that if  V\,V  n  are the parents of a variable V , then all other variables that are not descendents In fact, we can represent a complete MDP using influence diagrams (see below).  1  28  Rain Umbrella P(Wet) T T 0.1 T F 0.95 F T 0.0 F F 0.0  CCatch Cold  Figure 2.5: A n example of a Bayesian belief network. of V are independent of V given { V i , ...,V }. n  A consequence of this independence  assumption is the d-separation criterion which describes independencies that are implied by the structure of the graph. Two sets of nodes in the graph are independent given a third if they are d-separated according to the definition below: Definition [82] If X, Y, and Z are three disjoint subsets of the nodes in a Bayesian network, then Z d-separates X from Y if there is no path between a node in X and a node in Y such that: 1. Every node with converging arrows in the path is in Z or has a descendent in Z. 2. Every other node in the path is not in Z. Figure 2.5 shows a typical Bayesian network with five boolean variables. The network represents the fact that if it is raining and I do not have an umbrella, then I will probably get wet; if I am wet, then I may catch a cold; and if I have my umbrella then I will get tired. Since there is no arrow from Rain to Tired, there is no direct dependence between the two. Furthermore, if I know the value of Wet, whether or not I catch a cold is independent of whether it is raining. Although every variable 29  in the network has its own C P T , only the one for Wet is given in the figure. The C P T for Wet shows that if it is raining and I don't have an umbrella, I will get wet with probability 0.95, but if I do have an umbrella, then Wet will be true only with probability 0.1; and regardless of whether I have an umbrella or not, Wet will be false with probability 1 if Rain is false. Bayesian networks are a powerful tool for representing uncertain knowledge, but for the purposes of planning and learning to act, we also need a way to represent decisions and their utilities. For this we use influence  diagrams [56, 81], which are an  extension of Bayesian networks that includes decision variables and a value or utility node. Figure 2.6 shows a typical influence diagram, rain and forecast are called chance nodes. These are random variables over which the agent has no control. The utility node, V, represents the agent's utilities (or rewards) for various assignments of values to the variables in its parent nodes. The take umbrella node is a  decision  node, a random variable over which the agent has complete control which represents an action that the agent can take, in this case whether or not to take an umbrella. The agent only has the value of forecast available to it when it chooses its action since that is the only parent of the take umbrella node. As before, all the chance nodes have conditional probability tables to determine their probabilities given the values of their parents. The utility node has a table which gives the value to the agent of each possible combination of values of its parents. A possible table for V is given in the figure. Formally, an influence diagram is a tuple < N,E,P,F  > where:  • N is a set of nodes (or random variables), partitioned into C, the set of chance nodes, D, the set of decision nodes, and V, the value node. A node n in CUD 2  is labelled with a random variable and has associated with it a set of possible values Q . n  We assume throughout that there is a single value node per influence diagram, although this need not be the case. 2  30  Figure 2.6: A n example of an influence diagram. • E is a set of edges (x,y) as before. If there are multiple decision nodes, we assume "no forgetting;" all the variables that are observable when making some decision are also observable when making any subsequent decisions, as are the previous decisions themselves. That is, if there is an edge (x,d) 6 E where d € D, then there is also an edge (a;, d') € E and an edge (d, d') € E for any subsequent decision node d! £ D. • P is a set of conditional probabilities for each n £ C given its parents. • F is the utility function for the value node V , defined in terms of its parents. A number of algorithms for evaluating an influence diagram—determining the agent's maximum utility, and the corresponding decisions—have been proposed. We will not present any here, but direct the interested reader to [59, 94] for examples of such algorithms.  31  2.5  Structured Representations of Actions  In Sections 2.1 and 2.2 we have presented a very general model of M D P s . Unfortunately, this generality has the disadvantage that it is can be quite an inefficient way of representing a problem. This is because typically we don't represent the world explicitly using states, but instead we describe it in terms of a set of features or variables as in the Bayesian networks of the previous section. The number of states is exponential in the number of variables we use to describe the world (n binary variables results in 2" states). This exponential "blow-up" in the size of the problem is known as the  curse of dimensionality  [7].  Since we can expect the world to be represented in terms of a set of variables, it is unreasonable to expect users to supply the huge transition matrices and reward vectors required as input to the algorithms in that form. The description of influence diagrams in Section 2.4 suggests a more compact representation. If we represent the world directly in terms of the set of variables that make it up then we can exploit independence in the same way that Bayesian Networks do in order to greatly reduce the size of the  description  of an M D P . We will describe the details of this in Section  2.5.1. There has been some previous work in using structured world representations with M D P s for planning. As we shall see in Section 2.6, both Nicholson and Kaelbling [80] and Dearden and Boutilier [31, 33] suggest using Bayesian Networks or similar representations for M D P s . Both these approaches improve performance by producing approximately optimal policies through the use of abstract M D P s . In Section 3.1, we will describe a method that uses a structured representation to produce an optimal policy by maintaining the structure throughout the computation, thus allowing considerable savings in time and especially space when compared with the algorithms described in Section 2.3.  32  FETCH-COFFEE  • R W  U  W  ' T  T  1.0  F  T F  T T T  0.1 0.0  T F T F T F  F  T  T T F F  F F F F  1.0  w 1.0  u  0.0  1.0 1.0 1.0 0.0  0.1  1.0  HC H C T F  1.0 0.9  1.0  0.9  tru^-^\^alse  KEY Matrix  Tree Representation  Figure 2.7: Action Network with Tree-structured C P T s .  2.5.1  Two-stage T e m p o r a l Bayesian N e t w o r k s  We represent M D P s compactly by describing actions using two-stage temporal Bayesian Networks (2TBNs) [29, 25]. We assume that the world can be described using a set of variables P which induces a state space of size  assuming that all  the variables are binary). For each action, we have a Bayesian network with one set of nodes representing the system state at some decision epoch t prior to the action (one node for each variable), another set representing the world at decision epoch i + 1 after the action has been performed, and directed arcs representing causal influences between the pre-action and post-action nodes, and potentially between pairs of post-action nodes. Each post-action node has an associated conditional probability table ( C P T ) which quantifies the effect of the action on the corresponding variable, given the values of the variables that influence it (parents in the network). Thus we have a Bayesian network for each action that describes the action's transition function in terms of its effect on each of the variables individually.  33  Figure 2.7 illustrates this representation for a single action. The figure is taken from an example domain in which a robot is supposed to get coffee from a coffee shop across the street, can get wet if it is raining unless it has an umbrella, and is rewarded if it brings coffee when the user requests it, and penalised (to a lesser extent) if it gets wet [16]. The network in the figure describes the action of fetching coffee. In the conditional probability tables in Figure 2.7, we use primed variables (such as W) to represent the value of the variable at decision epoch t + l , and unprimed variables to represent the value at time t. As is the case with Bayesian networks, the lack of an arc from a pre-action variable X to a post-action variable Y in the network for action a reflects the independence of a's effect on Y from the prior value of X. Thus, the network in Figure 2.7 should be interpreted as follows: Whether it is raining R at time t + l depends only on whether it is raining at time t (persistence) and similarly for whether the robot has the umbrella U, and whether the user wants coffee WC. Whether the robot is wet W at time t + l depends on whether it is raining, whether the robot has the umbrella, and whether the robot was already wet at time t. The likelihood that the robot is carrying coffee HC depends only on whether it was carrying coffee at time t. Although none appear in Figure 2.7, there could also be arcs between postaction variables, indicating that the variables are correlated. We call these synchronic arcs because they link nodes in a single time-step.  In Figure 2.7, if the  action could fail without the robot ever going outside, the value of W  t+1  decision epoch t + l ) would depend on whether or not HC  t+1  would be a synchronic arc from HC  t+1  to  (W at  was true, so there  W . t+1  Unlike a regular Bayesian network, there are no C P T s for the pre-action variables in the 2TBNs. This is because the values of these variables are obtained from the state that the M D P is in before the action is performed.  Because we  are using the 2TBNs only to represent the effects of actions, we are not interested  34  in the prior probabilities of the pre-action variables, which correspond to a prior distribution over states in the  MDP.  We capture additional independence by assuming structured C P T s . In particular, we use a decision tree to represent the function that maps combinations of parent variable values to (conditional) probabilities. The decision trees allow us to represent independence given specific variable assignments  by representing facts  such as "A" is independent of Y whenever Z has value z". We call this  independence  context-specific  [18]. The decision trees also provide a more compact representation  than the usual C P T s . For example, in the C P T for W we see that if W is true before the action is performed, then it is true with probability 1 afterwards. Thus if we know that W is true, the value of W l  t+1  is independent of the values of R} and £/*,  and four rows in the C P T can be summarised with a single value. In this representation, each leaf in the decision tree for a variable V contains the probabilities that  V has each of its possible values at time t + 1 in any world where all the variables in the path from the root to that leaf have the corresponding values at time t. For instance, for the trees in Figure 2.7, where we assume that a left-pointing arrow represents the value "true", and a right-pointing arrow represents "false", W will be true with probability 0.1 at time t + 1 in any world where W is false, R is true, and U is true at time t. We write Tree (a, X) for the conditional probability tree for action a and variable X. Formally, a decision tree is a compact representation of a set of non-empty sets of states, which we call partitions,  such that the union of all the partitions is  the entire set of states, and no state is in more than one partition. Each of the leaves of a decision tree for variable V corresponds with one of these partitions, and is labelled with a probability for each value of V. Let / be some leaf in a decision tree. Every node in the path from the root of the tree to / is labelled with some variable, and the edge below that node is labelled with some value of that variable. Let  Assigns(l)  be the set of such variables and their values. We assume that no  35  Figure 2.8: Reward tree for structured M D P . variable can appear more than once in this set, and that each node in the tree has exactly one subtree for each of the corresponding variable's values. Thus we see that all the partitions are disjoint, since they must all differ on the value of at least one variable (the variable where the corresponding paths in the tree diverge), and that the union of the partitions is the set of possible values for the parents of V. The structure in the tree representation can be exploited computationally when solving an M D P as we describe in Chapter 3. A similar representation can be used to represent the reward function R, as shown in Figure 2.8. We call this the (immediate) reward tree, Tree(R).  2.6  Previous Work in Decision-Theoretic Planning  In Sections 2.1 to 2.3 we described M D P s and the standard algorithms for solving them that have come from the operations research community. In this section we will examine some of the approaches developed in the artificial intelligence community. Most of these approaches do not attempt to solve the M D P s exactly, but rather they take advantage of extra information that may be available to find approximate  36  solutions, or solutions to certain classes of problem more efficiently. The additional information they use might be in the form of simplified reward functions, knowledge of the start state of the M D P , or structured representations of the problem such as the one we describe above. We will describe many of these algorithms in planning terms rather than in terms of solving M D P s . In most cases this is because planning is the task that motivated development of the algorithms and they are therefore more easily expressed in terms of planning. In fact, most of these algorithms still use one of the standard algorithms we describe in Section 2.3 to actually create a plan. The contribution of most of these algorithms is not in the solution method used, but rather in the way a simpler M D P (or set of simpler M D P s ) is constructed that can be solved to find a good policy for the original problem.  2.6.1  "Envelope" Algorithms  The first approach we describe is the Plexus algorithm of Dean et. al. [28]. They assume that the reward function is goal-based, meaning that the state space is divided into goal states which have a reward of 0, and all other states which have reward —1. They also assume that the start state of the system is known, or that planning is interleaved with execution so that knowledge of the current state can be used in constructing a plan. Rather than generating a policy for an entire M D P , the Plexus algorithm operates by building a simpler M D P , and then solving that using policy iteration. The state space of the simpler M D P , called the  envelope,  3  is a subset of the states  in the full M D P , augmented with a state OUT that represents all the states outside the envelope. The initial envelope is constructed by forward search from the current state until a goal state is found. This small envelope is then extended by adding states outside the envelope that might be reached with high probability. The idea 3  The Plexus algorithm is also often called the envelope method.  37  is to include in the envelope all the states that are likely to be reached on the way to the goal. Once the envelope has been constructed, a policy is found for it using policy iteration, and the agent executes the policy. If at any point the agent leaves the envelope, it must re-plan by extending the envelope (the envelope can also be pruned by discarding states that the agent has already passed) and computing a new policy. The main weakness of this algorithm is its dependence on specific types of reward function. Although the authors point out that more general rewards can be used by changing the method of envelope building, much of the algorithm's usefulness derives from having easily determined goals which allow it to restrict the envelope size. Tash and Russell [108] propose a similar algorithm to Plexus. They also assume rewards of 0 for goal states, and —1 otherwise, and interleave planning and execution. Their algorithm keeps an envelope of states near the current one, but they also use a heuristic function to estimate the values of other states. Rather than keeping a single OUT state, they keep a fringe of states on the edge of the envelope. When they compute a policy for the envelope, they make the fringe states absorbing and assign them their heuristic value as a reward. Thus the value they produce by solving the envelope M D P is the true cost of reaching a fringe state plus the estimated (by the heuristic) cost of getting from the fringe state to a goal state. Once the envelope M D P has been solved, the heuristic value of the states in the envelope can be replaced with their value according to the envelope M D P , so that over time (assuming an initial heuristic that underestimates values) the heuristic value of each visited state converges to the optimal value of that state.  2.6.2  Real-Time Dynamic Programming  Like the Plexus algorithm, Real-time dynamic programming (RTDP) [5] is best suited to situations where planning and execution of the plan can be interleaved.  38  RTDP is an alternative to value iteration in which rather than performing Bellman backups for every state at each iteration, only some subset of the states, including the current state, is backed up. The set of states could be constructed by forward search from the current state, or by any other means. RTDP concentrates on un-discounted models with specified start and goal states. As with Plexus, goal states have reward 0 and all other states have reward —1. Goal states are also assumed to be  absorbing,  meaning that all actions performed  in a goal state leave the system in the same state with probability one.  RTDP  also assumes that a series of trials are performed to learn the optimal policy, where each trial begins in a randomly selected start state and continues for some finite number of actions or until a goal state is reached. Under these assumptions, RTDP will converge on an optimal policy for all  relevant  states (states that it visits in its  trials), but may perform many fewer backups that value iteration. Typically RTDP only does Bellman backups for states that it visits, and hence it will not find the optimal value for states that cannot be reached from the start state while following the optimal policy. 2.6.3  A b s t r a c t i o n by Ignoring Features  The techniques we have described above simplify the MDP to be solved by ignoring certain states. In the case of RTDP, these are states that are never visited when following the optimal policy from the start state, while in the Plexus algorithm the states are unlikely to be visited from the current state and so are abstracted into a single OUT state. We will now examine a different approach to simplifying the MDP. Rather than using a state-based representation of a problem, the algorithms we describe below use feature-based representations like the one we describe in Section 2.5. To perform abstraction they ignore certain features of the state space so that states which agree on the values of all the other features are treated as if they were a  39  single state. As we said in the introduction, we can classify abstraction mechanisms by whether they are uniform, where the level of abstraction is the same everywhere, and adaptive, where the level of abstraction may change over the course of the algorithm.  These distinctions are a useful way to compare the flexibility of an  abstraction mechanism. Ideally we would like an abstraction that is non-uniform and adaptive. This allows the algorithm to concentrate its computation in parts of the state space that need it the most, and allows it to discover those parts as it runs. Nicholson and Kaelbling Nicholson and Kaelbling [80] represent M D P s using the 2 T B N representation we described in Section 2.5. They leave out features of this M D P to produce a sequence of abstract M D P s that range from quite similar to the original M D P to being a very coarse-grained approximation of it. The idea is that a rough policy for the original M D P can be produced from the optimal policy in the most abstract M D P , and then if time is available this policy can be improved by finding the optimal policy for the next M D P in the sequence, and so on. The algorithm is therefore adaptive, although it is uniform in that the same features are ignored throughout the state space. To select features to remove when producing the abstract M D P s , Nicholson and Kaelbling use sensitivity analysis to determine the degree to which the conditional probability distribution for a particular variable is influenced by another variable. If a variable has only a small effect on the other variables, it can safely be abstracted away. The algorithm is intended to work hand-in-hand with the Plexus algorithm described above, building an envelope to reduce the size of the state space, and then swiftly finding a policy for the envelope by abstracting away some of the detail.  40  Dearden and Boutilier Dearden and Boutilier [33, 31] take a somewhat different approach to finding good abstractions, and again produce an approximately optimal policy. Although they represent the M D P using probabilistic  STRIPS  rules [42, 66] rather than 2TBNs,  the representational power is similar, and we will explain their approach here using 2TBNs. The idea behind this algorithm is to simplify the reward function by removing features that have only a small effect on rewards. A n abstract M D P can then be constructed that uses this simpler reward function and includes only those features that directly or indirectly have an impact on the reward received (according to this simple reward function). This is achieved by searching backwards through the 2 T B N for each action, beginning with the features in the new reward function, and adding any feature to the abstract M D P that has a edge leading from it to a feature that is already in the abstract M D P . For example, the reward function in Figure 2.8 could be simplified by removing the distinctions due to feature W. Now only WC and HC directly affect the reward, and looking at the action in Figure 2.7 we see that the only features that affect their value are again WC and HC. In practice of course, this procedure would be performed on all the actions, rather than just the single one shown in the figure, and any new features added would themselves have to be checked against all the actions until no new features were added. This algorithm suffers from the major weakness that it may not be applicable at all in some M D P s because no simpler reward function can be found that allows features to be removed from the problem. It is also rather inflexible as its abstractions are uniform and fixed. However, it requires much less computation to build the abstract M D P than Nicholson and Kaelbling's approach, and allows a bound to be computed a priori on the loss of optimality due to abstraction. Dearden and Boutilier [32] also describe a forward search method that can be used during execution of the policy discovered using the abstract M D P to try to  41  locally improve performance. Abstraction Using Bounded-Parameter M D P s Another approach to solving large M D P s by ignoring features is the use of bounded parameter MDPs  [46] for abstraction. Bounded parameter M D P s are M D P s in which  the transition probabilities and reward function are specified using closed intervals rather than point values.  They can be solved using the interval value iteration  algorithm [46] to produce bounds on the optimal value function given the intervals used in the transition and reward functions. In [26] bounded parameter M D P s are used for abstraction.  They take a  top-down approach to producing an abstract M D P , beginning with a very coarsegrained abstraction that splits the state space into a small number of partitions, and refining it by finding and splitting partitions if the abstract transition functions— by which we mean the total probability of ending up in a particular partition by executing some action—of individual states in the partition differ by more than some parameter e. The algorithm produces abstractions that are non-uniform, but the abstraction is fixed before the interval value iteration algorithm is run. It is in some ways similar to the approximation algorithm we describe in Section 3.2. The main difference being that our algorithm partitions states together when their values according to the current policy are similar, rather than when their transition probabilities are similar.  2.6.4  O t h e r Approaches  Dietterich and Flann [37, 38] describe a different form of abstraction that is inspired by work in explanation-based learning. Although they describe their algorithm in terms of reinforcement learning (see Chapter 4) rather than planning, the basic idea applies in both situations. Rather than performing Bellman backups on individual states, they find regions of the state space that can be backed up collectively. These  42  regions are discovered by an analogue of goal regression [83] in which a set of states with identical values are regressed backwards through the actions to discover new sets of states such that if the action were performed in them it would leave the agent in the original set, and which therefore have identical value for that action. This idea is closely related to the decision-theoretic regression algorithm we describe in Chapter 3. In [37] Dietterich and Flann applied their algorithm only in deterministic domains, which makes the regression operation much easier. They extend the approach to work in stochastic domains in [38], although they still use a goal-based model rather.than an arbitrarily complex reward function. As they point out in the paper, the algorithm works much more effectively when there are funnel actions that when performed, move the system from a large number of states to a single state. These types of action allow large regions of the state space to be discovered which all have the same value for executing the funnel action. One example domain where this approach works particularly well is in navigation domains. Actions such as "move north until blocked by a wall" act as funnel actions, making Dietterich and Flann's algorithm particularly effective. In contrast, the SPI algorithm we describe in Chapter 3 performs poorly in this kind of domain because it is unable to exploit structure in this form. A n approach that is much closer to classical planning is found in the  DRIPS  system [49, 50]. Here the abstraction is in terms of actions rather than states. A set { a i , • • •, a } of actions that have comparable effects are combined to produce an n  abstract action a which summarises the effects of all the original actions. Planning with these abstract actions can be more efficient because the abstract actions ignore certain features of the state space, so reducing the complexity of the planning task. The abstract plans found can then be translated back into concrete plans that use the original actions. This is done by replacing a single plan that uses some abstract action a (an abstraction of concrete actions {a\, • • •, a }) with a set of plans in which n  a is replaced by each of a\, • • •, a . The abstract plan can then be used to efficiently n  43  evaluate the quality of each of these new plans and any sub-optimal ones can be discarded. The  DRIPS  approach is of interest to us because abstracting actions is in  many ways orthogonal to the state-abstraction approach we describe in Chapter 3. One of the weaknesses of our algorithm (and many of the others described above) is that they perform very badly when there are a large number of actions to consider. Combining them with an approach such as  DRIPS  may alleviate this problem.  Another approach based on classical planning is the [66].  BURIDAN  BURIDAN  uses a classical partial-order planning system such as  algorithm of SNLP  [72] for  plan creation, and a probabilistic reasoning system to evaluate partial plans. In addition to an action model, the algorithm requires a distribution over start states, a set of distinguished goal states, and a probability threshold. The algorithm produces a partially-ordered plan which leaves the system in a goal state with probability greater than the threshold. Like most classical planning approaches,  BURIDAN  builds a complete plan  without executing any actions. One characteristic of the plans it finds is that actions are frequently repeated a number of times to increase their probability of success. The  C-BURIDAN  algorithm [40] avoids this behaviour by interleaving planning and  execution so that observations can be made of the success or failure of actions, cBURIDAN  builds plans that may include loops and branch points that depend on  observations of the current state. This has not been an exhaustive survey of work in decision-theoretic planning, we have merely tried to present outlines of some of the algorithms that are more closely related to those we will present in the next chapter. For a more wide-ranging survey of the decision-theoretic planning literature see [15]. The work of Dietterich and Flann we described above is also not the only related work from the field of reinforcement learning. We will survey work in this field separately in Chapter 4. Work on value function approximation is also of interest, especially as it is closely  44  related to the work on compact representations of value functions we present in the next chapter. As much of this work has been done in the reinforcement learning community, we will discuss it in Section 4.5.  45  Chapter 3  Exploiting Structure for Planning In Chapter 2 we presented the definition of a Markov decision process—the mathematical theory underlying this work—and a compact representation for M D P s that uses Bayesian networks to make certain types of structure in the problem explicit. Here we provide a framework for using this structured representation and present two algorithms that solve M D P s more efficiently than the standard techniques we presented in Chapter 2. The intuition behind these algorithms is that if a problem has a compact representation then it will often also have a compactly representable solution, and we should be able to find this solution more efficiently by taking this into account. The structured policy iteration (SPI) algorithm that we will present in this section takes as input a structured representation of an M D P , and produces a structured optimal policy and corresponding value function. It is based on the following observations: • If we have a structured policy TT, and a structured estimate V of the value of 1  7r, an improved estimate V  I + 1  can often preserve much of this structure; and  • If we have a structured value estimate V of the current policy, we can construct v  an improving policy TT' (if such a policy exists) in such a way that the structured 46  representation is maintained. These two facts suggest an algorithm based on modified policy iteration (see Section 2.3.3). The first implies that we can perform a structured version of successive approximation, while the second means that we can do policy improvement while maintaining structure. Any initial policy can be used, but we favour strongly structured policies (that is policies that can be represented very compactly) in the hope that much of this structure will be maintained throughout the computation. Often in artificial intelligence the problem is to find good solutions, or even— and this is typically the case in classical A l planning—any solution that achieves our goals. Approximation algorithms that trade computation time for solution quality are therefore of great interest to us. The framework for reasoning about structure that we present here is very well-suited to use in approximation algorithms, and in Section 3.2 we present a variant of the SPI algorithm that produces approximately optimal solutions.  3.1  Structured Policy Iteration  If a problem has a compact representation, this must be because the representation exploits regularities and structure present in the problem domain. The work in this chapter is predicated on the hypothesis that, given this structure, optimal policies for the problem will also have certain structure, as will value functions. To prove this hypothesis we would need to examine a number of real-world problems and see if their optimal policies and value functions can be represented in a structured way. While we haven't done this, we feel that the artificial problems we use in Section 3.1.6 have many of the characteristics of real-world problems, and the structure we find in their solutions gives us confidence that the hypothesis is a reasonable one. Structured solutions to real-world scheduling problems, and problems in other areas of A l also support our hypothesis.  47  GET-UMBRELLA  DO-NOTHING  R  U  U  W  W  HC  HC  WC  WC  REWARD  R  R  U  U  W  W  HC  HC  WC  WC  Figure 3.1:  R  WC  1  1.0  0.9  POLICY  DO-NOTHING  2 - 3 - 2  T h e example coffee domain showing all three actions, the reward tree,  and an initial policy. T h e dashed arcs represent persistence, where the corresponding tree has the variable at the root, 1.0  as the value of the "true" subtree, and 0.0  the value of the "false" subtree.  48  as  In this section we describe a method for optimal policy construction that eliminates the need to construct the explicit transition matrices, reward and value vectors, and policy vectors used in the standard M D P algorithms of Section 2.3. Our method is based on modified policy iteration (see Section 2.3.3), but exploits the fact that at any stage in the computation, both the current policy TT and the current estimate of the value function V may be representable in a compact fashion. We v  represent this structure using the same type of decision tree we used in Section 2.5 to represent C P T s for the Bayesian network representation of an action. We write Tree (a, Xi) for the conditional probability tree that describes the effect of action a on variable X{, and similarly, we write Tree(R) for the tree-structured representation of the reward function. We also need a tree-structured representation of a policy, which we write Tree(7r), and value function Tree(V).  We label the leaf nodes of  policy trees with the action to be performed, and the leaf nodes of value trees with our current estimate of the value of the corresponding states. Throughout this chapter, we will use as our example the M D P shown in Figure 3.1, which we call the robot-coffee example. The problem is very simple, containing only 32 states and three actions, but is sufficient to illustrate most of the complexities of our algorithm. Examples of a tree-structured policy and value function for this domain are shown in Figure 3.2. The robot-coffee domain contains neither multi-valued (i.e., non-binary) variables nor synchronic arcs (correlated action effects). For ease of exposition, we will present our algorithm in this section under the assumption that neither of these are present. Multi-valued variables add very little in'the way of complexity to the algorithm. They make the representation of policies and value functions as trees, and the algorithm for simplifying trees slightly more complex. The effect of correlations is more problematic, but the algorithm can be amended in a relatively straightforward manner to handle this case. See [12] or [17] for details. How can we expect this structured representation of a problem to provide  49  WC  Figure 3.2: Examples of (a) a policy tree, and (b) a value tree. computational savings? The idea is conceptually simple and closely related to that of Dietterich and Flann [37] as described in Section 2.6.4: We will perform modified policy iteration, but whenever some set of states behaves identically under the current policy we will treat them as if they were a single state—we will perform a single Bellman backup to update the value of all the states in the set. When we say that a set of states behaves identically under a policy we mean that the policy calls for the same action to be performed in all of them, the action has the same effects on relevant features of the problem in all of them (we explain what we mean by this below), and their value is the same. The decision trees we use to represent the current policy, and value function tell us when we can treat a set of states as a single state. A run of the structured policy iteration (SPI) algorithm will produce a sequence of policy trees and associated value trees that exactly match the sequence of policies and value functions produced by modified policy iteration.  The final  output will be an e-optimal policy and value function in the form of decision trees. The sequence of value functions and policies will be the same as that produced by M P I (modulo actions with identical value), but will simply be represented (in many cases) more compactly. Figure 3.3 is a slightly changed description of the modified policy algorithm.  50  Input: A n MDP < T, 5 , A, Pr, R >, a discount factor  7 , an optimality constant e, and an integer m that is the number of steps of policy evaluation to perform per iteration. Output: A n epsilon-optimal policy 7 r * .  1. Let n = 0, select an arbitrary initial value function v°. 2. (Policy improvement) For each action o, compute for all s € S:  Q(s, a) = R{s, a) + 7 ]T Pr(s, a, t)v (t) tes n  Let 7r +i be any policy for which for all s 6 S n  Q(s,  = maxQ(s, a)  7r i(s)) n+  aeA  and let  u {s) = 0  Q(s,w (s)) n+1  3. If the termination condition (Equations 2.3 or 2.4) holds for UQ and v then go to step 7.  n  4: (Successive approximation) If k = m then go to step 6. Otherwise compute Uk+i  by  u +i(s) = R(s, n (s)) + 7 k  n+l  ^ Pr(s,  n (s),t)u (t) n+1  k  5. Increment k and go to step 4. 6. Let v  n+1  = u  7. Let 7r* = 7 r  m  n +  and go to step 2.  i and stop.  Figure 3.3: The modified policy iteration algorithm (slightly changed from Figu  51  To implement a structured version of this algorithm, we will need to make changes to Step 2 and Step 4. In particular, given a tree-structured representation of a value function V, we need algorithms for producing tree-structured representations of the following: • The Q-function for action a with respect to V computed by: Q (s,a) v  = R(s,a)  + ^  P  r  ^  ^ )  a  V  (  t  ) ^  (3.1)  tes (see Section 3.1.1). • The greedy policy TT and an estimate V based on Q  V  V  of its its value function  such that: TT(S) = argmax<3(s, a),  (3.2)  V (S)=Q(S,TT(S)),  (3.3)  v  (see Section 3.1.3). • Given an estimate V of the value of TT, an improved estimate w  using one  iteration of successive approximation: V'(s) = R(s,  TT(S))  +  P r ( , TT(S), t)V (t). 5  n  (3.4)  tes (see Section .3.1.2). ' For the termination conditions, we will also need to be able to compare and compute the norm of two tree-structured value functions V and V (see Section 3.1.4). A l l these algorithms will operate by first building the structure of the value function or policy, and then applying the appropriate decision-theoretic calculation once per leaf of the tree to compute the new value function or policy. If the size of the tree is significantly smaller than the size of the state space, the computational savings can be substantial. To produce these tree-structured representations, we will require the following basic tree operations: 52  Figure 3.4: A tree simplified by removal of redundant nodes (triangles denote subtrees). Tree Simplification. This is the removal of redundant interior nodes from a tree. For example, in Figure 3.4, the tree contains multiple interior nodes labelled X along a single branch. The tree can be simplified as shown to produce a new tree in which X appears only once. As in the figure, only the topmost occurrence of the duplicated variable X is retained in the tree, and subsequent occurrences of X in the subtree in which X =  replaced with their own  subtree labelled X{. Their other subtrees can never be relevant since we already know that X = X{. More complex tree simplifications, such as removing interior nodes for which all subtrees are identical, or even reducing the size of trees by reordering them are also possible, but are not essential for the algorithm. Appending Trees. When we append a tree T 2 to some leaf I of tree T\, we attach the structure of T2 in place of /. The new leaves added below / are labelled with some function of the label of the leaf in T 2 and the label of / (the function will always be clear from the description of the algorithm). We write Append(T\, I, T2) to denote the resulting tree. For example, Figure 3.5 shows tree T2 being appended to the leaf labelled 2 of T\ and the max operator being applied. As in the figure, we generally assume that the resulting tree is 53  z  X  Y  3  \  7  Y  W  A A 1 4  2  X  0  5  4  W  Append here  T  2  5  Figure 3.5: Appending tree T 2 to the leaf labelled "2" in tree T\. The leaf labels are combined using the max function. simplified. Merging trees. A set of trees T\, • • •, T can be merged to form a single tree that n  makes all the distinctions made in any of the trees.  As with the append  operation, the leaves of this new tree are labelled with some function of the labels of the corresponding leaves in the original trees. Merging of trees can be implemented as repeated appends, where each tree in turn is appended to all the leaves of the previous tree. Again we generally assume that the resulting tree is simplified. If only two trees are to be merged, we will often write Append(Ti,^)  3.1.1  to indicate that T 2 should be appended to every leaf of  D e c i s i o n - T h e o r e t i c Regression  The key to the SPI algorithm is a process we call decision-theoretic regression. In classical planning the regression of a set C of conditions through an action a is the weakest set of preconditions such that performing a will make C true. Decisiontheoretic regression is a stochastic analogue to this process. Since we are concerned with values rather than goals, rather than regressing a single condition, we regress a set of conditions, each with an associated value, through an action. And because our  54  actions have stochastic effects, there is no absolute set of preconditions for making any condition C in this set true. Instead, we find sets of preconditions under which action a will make each of the regressed conditions true with identical probability. Since each of the conditions has a value associated with it, decision-theoretic regression produces sets of preconditions each of which has the same expected value under o. Decision-theoretic regression is used to perform the structured versions of Equations 3.1 and 3.4. These equations differ only in that the first does a Bellman backup for a single action in all states, while the second does a Bellman backup for a fixed policy (i.e. the action may be different for different states). We shall describe the single action case (Equation 3.1) here.  Given an algorithm for the  decision-theoretic regression of value function through a single action, we can easily perform it for a policy. We do this by computing the single action tree for each action that appears in the policy, which results in a tree-structured representation of the Q-function for each of those actions. The tree for action a is then appended to every leaf node in the policy tree that is labelled with a, and the resulting tree is simplified. To compute the Q-function Tree(Q^)  with respect to Tree(V)  (Equation  3.1), we perform decision-theoretic regression through the single action a. To do this, we use the structure in the reward function, the current value function and the representation of a to determine when states have the same Q-values and can therefore be grouped together. It is clear from the equation that two states s; and Sj have the same Q-value if they have the same reward and the same expected future value. Their rewards can simply be read off Tree(R),  so we will focus on their future  value. The value tree Tree (V) partitions the set of all states into a set of partitions P i , • • •, P , one for each leaf in V , where the states in each P have the same value k  n  according to V . To determine whether states s; and Sj can be treated together  55  Figure 3.6: A n example of a value tree partitioning the state space. States x and y can be treated together, while z must be distinguished. (that is, whether they have identical future expected value), we must examine what the effects of performing action a in each of them would be.  For each state s,  performing a induces a stochastic transition to some set of states. Let p k be the Sj  total probability that performing a in s results in a state in partition Pk- If for every partition Pk, p k = Sit  P ,k, Sj  then s; and  Sj  can be treated together when we compute  their future value. This is illustrated in Figure 3.6 in which states x and y can be clustered together since they have identical probabilities of reaching each partition of the state space, while state z must be treated separately. Thus the Q-value tree for V and a, Tree(Q^) ditions under which a makes some branch of Tree(V)  should only distinguish con(that is, some partition Pk)  true with differing probability. These conditions can be determined by examining the effects of a on each variable X i in Tree(V),  and these effects are defined in the  action network for a in Tree (a, X i ) . We illustrate how the decision-theoretic regression algorithm works using the simple example shown in Figure 3.7. This shows an action a in a domain with three variables, X, Y, and Z.  Action a has no effect on X, but makes Y true with  56  Figure 3.7: A simple action network and the reward and value tree. In this and subsequent figures, variables at time t + 1 (after the action) will be indicated by primed variable names (Z'), unprimed variables are at time t. probability 0.9 if X is true, and makes Z true with probability 0.9 if Y is true (both Y and Z are unaffected if they are true before a is performed). The figure also shows the reward tree, which also provides the initial value function. Figure 3.8 illustrates the steps in computing Tree(Q^), the Q-tree for this action and value function. First we determine the conditions under which a will have different expected future value with respect to V. Since V depends only on the truth of Z, the expected future value depends only on the conditions that affect the probability of Z being true or false after a is performed. These conditions can be found in the conditional probability tree for Z in the network for o, Tree (a, Z). This tree tells us that the post-action probability of Z being true depends only on the pre-action probabilities of Z and Y. Each branch of this tree thus describes exactly the conditions we want— the conditions under which a will lead with fixed probability to each partition of the state space induced by V. This probability tree is shown in Figure 3.8(b) as PTree(V,a).  Since each leaf in PTree (V, a) is labelled with a distribution over the value of Z, and only Z influences future value, it is now straightforward to compute the expected future value of performing a as shown in Figure 3.8(c). For example, when  57  10  Z ' : 1.0  9 (a) Initial value tree  (b) Probability tree  19  Y 0  Y 8.1  (c) Future value tree  0.0  (d) Final Q-tree  Figure 3.8: Decision-theoretic regression of Tree(V) through action a in Figure 3.7 to produce Tree(Q ): v  a  (a) Tree(V);  (b) PTree {V, a); (c) FVTree(V,  a); (d)  Tree(Q ). v  a  Z is false and Y is true before the action, Z becomes true with probability 0.9 after the action, and this has value 10, and Z remains false with probability 0.1, which has value 0. Hence the expected future value of all states where these conditions hold is 9. We denote by FVTree(V,  a) this "future value tree" obtained by converting the  distributions at each leaf of PTree (V, a) to expected values with respect to V. Finally, in Figure 3.8(d), we produce the Q-tree Tree(Q^)applying the discount factor at every leaf of FVTree(V,  We do this by  a) and then merging this  tree with the reward tree, summing the values at the leaves. In this case, since the reward tree only contains Z, the simplified tree is unchanged in structure from FVTree{V,a).  The decision-theoretic regression operation becomes more complex when the value tree being regressed through contains more structure. In Figure 3.9 we assume that the value function Tree(V)  is Tree(Q^)  from Figure 3.8, and compute the next  successive approximation to the value of the policy "do a everywhere." As before, we begin by computing PTree(V,  a). This time we need to know the condition that  make Z true with fixed probability, and when Z could be false, the conditions that make Y true with fixed probability. To find these conditions we regress each of the variables individually through a and piece together the final tree from the trees that result. Beginning at the root of the tree, we first regress variable Z through a to produce the tree in Figure 3.9(b). Next we regress Y through a and append the  58  Z':  1.0 Y  Z':  1.0 Z':0.9  (a) Initial value  Z':  Z':0.0  (b) Partial probability tree  Z':0.9 Y': 1.0  Z ' : 0.0 Y': 1.0,  X  X  Z ' : 0.0 Z ' : 0.0 Z':0.9 Z':0.9 Y': 0.9 Y': 0.0 Y': 0.9 Y': 1.0 (c) Unsimplified probability tree  1.0  27.1  Z':0.0 Y':0.9 (d) Simplified probability tree  6.46  Z':0.0 Y':0.0 (e) Future value  0  (f) Q-tree  Figure 3.9: Decision-theoretic regression of Tree(V) through action a in Figure 3.' to produce Tree(Q^): (a) Tree(V); (b) Partially completed PTree(V,a); (c) Un simplified version of PTree(V, a); (d) PTree(V,a); (e) FVTree{V, a); (f) Tree(Q ) v  a  59  Input: Tree(V), an action a, Tree(R).  Output:  Tree(Q^).  1. Let PTree(V, a) be the tree returned by PRegress (Tree (V) ,a).  2. Build FVTree(V,a)  by: For each branch b of PTree(V,  a) with leaf l : 0  (a) Let P r be the joint distribution formed from the product of the individual variable distributions at (b) Compute: 6  "6=  Pr (b')V{b')  E  h  b'eTree(V)  where b' are the branches of Tree(V), Pr (b') is the probability according to the distribution at l of the conditions labelling branch and V(b') is the value labelling the leaf at the end of branch b'. (c) Re-label leaf If, with v . h  0  0  3. Discount FVTree(V, at each leaf by 7.  a) with discount faction 7 by multiplying the value  4. Merge FVTree(V, a) with Tree(R) using addition to combine the values at the leaves. Simplify the resulting tree, which is Tree(Q^). Figure 3.10: The decision-theoretic regression algorithm. resulting tree to each branch of the tree for Z where Z has a non-zero probability of being false. A t leaves where Z is true with probability 1 the value of Y is irrelevant because Y doesn't appear in the subtree of Tree(V)  where Z is true. The leaves  of the appended tree are labelled with the union of the original labels. This gives us the tree shown in Figure 3.9(c) in which each leaf is labelled with a distribution over both Z and Y. This tree can then be simplified to get PTree(V,  a) as shown in  Figure 3.9(d). Since there are no synchronic arcs in this problem, Z and Y are independent given the prior state, and the product of the two probabilities labelling each leaf of PTree (V, a) is sufficient to determine future values. If Z and Y were not independent (i.e. there was a synchronic arc between them in the action network) we would need a complete joint distribution over the two variables at each leaf of the tree. We can now use this tree to compute expected future values, resulting in FVTree(V,  60  a)  Input: Tree(V), 1. If Tree(V)  an action a.  Output: PTree (V, a).  contains a single leaf node, return an empty tree.  2. Let X be the variable at the root of Tree(V). Let Tx — Tree (a, X) be the conditional probability tree for X in action a. 3. For each a;,- G val(X) that occurs with non-zero probability in the distribution at some leaf of Tx, let: be the subtree of Tree(V) attached to the root by the arc labelled  (a)  (b) T%. be the tree produced by calling P Regress (T%,a). 4. For each leaf / G Tx label-led with probability distribution Pr : 1  (a) Let vali(X)  = a:,- G val(X)  : Pr (x ) l  i  > 0.  (b) Let Ti = Merge (T^ : X{ G vali(X)) using union to combine the labels (probability distributions) at the leaves. (c) Revise Tx by appending T\ to leaf /, again using union to combine the leaf labels. 5. Return PTree(V,a)  = Tx-  Figure 3.11: PRegress(Tree(V),a).  The algorithm for producing PTree(V, a).  shown in Figure 3.9(e). Finally, we discount the expected future value tree and merge it with the reward tree to obtain Tree(Q^)  in 3.9(f).  These examples illustrate the main intuitions of the regression algorithm used to compute Equation 3.1.  The main steps of the algorithm are shown in  Figure 3.10, while the details of the creation of PTree(V, a) are shown in Figure 3.11. The soundness of this algorithm is ensured by the following result:  Theorem 3.1 Let PTree(Q^) gorithm.  be the tree produced by the PRegress(Tree(V),a)  al-  For any branch b of PTree (Q^), let B denote the event determined  by its  edge labels, and assume the leaf of b is labelled by distributions Let Pv  h  denote the joint product distribution  {Pi,---,P }. n  (a)  Any x  Pi(Xi)  for 1 < i < n.  over X = {X\, • • •, X } n  induced by  Then:  G val(X)  such that P r ( x ) 6  > 0 corresponds  61  to a unique  branch of  Tree(V). ficient (b)  That is, any assignment  to "traverse"  Let x G val(X)  Tree(V)  to X that has positive probability  to a unique leaf node.  and Si be any state satisfying  Si,  In other words, P r ( 5  < + 1  \= x  is suf-  S  t  Then b  a) : SJ  \=B,A  l  B.  \=  x) = Pr(x)  = a) = P r (x). 6  P r o o f We prove part (a) inductively on the depth of tree Tree(V). for a tree of depth 0—i.e., when Tree(V)  The base case  consists of a single leaf labelled  with a value—is immediate, since PRegress returns an empty tree, which is sufficient to traverse Tree(V)  to a leaf. Now assume that the result holds for  all value trees with depth less than d. Let Tree(V)  have depth d with root  labelled by variable X and subtrees Tree(x{) for each X{ G val(X).  Since all  subtrees Tree(xi) have depth less than d, by the inductive hypothesis PRegress will return a probability tree PTree(xf)  capturing a joint distribution over the  variables in Tree (x;) such that any assignment to those variables given nonzero probability allows subtree Tree(xi)  to be traversed to a leaf. Now  PTree(Q^)  is constructed by appending to each leaf / of the conditional probability tree Tree (a, X) those trees PTree(xi)  for which the distribution over X labelling  / assigns Pr(a; ) > 0, and maintaining the subsequent product distribution 2  at each resulting leaf. If this resulting product distribution at any leaf of PRegress(Tree(V),  a) assigns Pr(a;,) > 0, then we may traverse the X{ arc from  the root of Tree(V). The fact that this distribution must include information from PTree(xi) subtree Tree{xf)  means that any nonzero event permits the navigation of the of Tree(V).  If the resulting product distribution assigns  Pr(xj) = 0, then we will never traverse the X{ arc, and the fact that the distributions from PTree(xi)  are not included in the product distribution is  irrelevant.  62  To prove part (b), let leaf lb of PTree (Q^) be labelled by a distribution over the set of variables X = {X-y, • • •, X } and the corresponding branch b labelled n  with conditions B. By construction, the conditions labelling b entail the conditions labelling exactly one branch b\ of the conditional probability tree for {  Tree (a, X ) , for each A \ . Denote these conditions Bx - The semantics of the {  t  Bayesian network, given the absence of synchronic arcs, ensures that Pr(X*  = a) = P r ( X *  | B ,A  l  + 1  Xt  + 1  | B , A* = a , C  t + 1  X{  ,C")  where C* is any event over the variables Xj consistent with Bx and C  t+1  {  labelling If, is exactly Pr(AT*  +1  is  ^ i. Since, for each Xi the distribution  any event over the variables X* ,j +l  = a), the corresponding product  | B .,AL X  distribution is \B ,A  Pr(Xl  +1  t  Since the Xf  +1  t  = a) P r ( A ^  | B*, A = a) • • - P r ( X ^  + 1  4  +1  follows.  | B\ A = a) 1  are independent given B , this is exactly: x  PT(X ,X ,---X + Since the X*  + 1  t +1  t +1  1  2  t  1  \B ,A t  t  = a)  are independent of any event C* consistent with B , the result f  •  It follows almost immediately that the decision-theoretic regression algorithm is sound. This is because when it is applied to a value tree Tree(V)  and an action  a it uses PTree (Q^) to the determine the distribution over values in Tree(V),  and  the conditions under which to use that distribution to compute the expected future value of performing a. Adding the immediate reward to the discounted future value is straightforward.  Corollary 3.2  Let Tree(Q^)  regression algorithm  to Tree{V)  be the tree produced by applying the  decision-theoretic  and a. For any branch b of Tree(Q^),  let B denote  the event determined by its edge labels, and assume the leaf of b is labelled by the  63  value vi,. For any state s,- |= B, we have Q^{si) — v . 0  accurately represents  In other words,  Tree(Q^)  Q\.  Note that the algorithms in Figures 3.10 and 3.11 could be implemented considerably more efficiently than they have been presented. Examples of optimisations include performing steps 3 and 4 of the regression' algorithm while the tree is being constructed rather than separately at the end, and building shared substructures (such as the subtree rooted at W that appears multiply in the future value tree for the  FETCH-COFFEE  action in Figure 3.12(a)) only once in the PRegress algorithm.  The algorithms are very sensitive to tree ordering. A different ordering of variables can greatly change the size of tree required to represent the same value function. Although we have not discussed it here, it may be desirable at any point to reorder a policy or value tree to improve efficiency. Choosing an optimal ordering of variables to minimise the size of a tree is an NP-hard problem [89], but a greedy search can be used to select an ordering which is empirically close to optimal [11]. Because the ordering of variables in the value trees is closely based on the ordering in the action trees, natural representations of the actions are often invaluable in minimising the size of policy and value trees.  3.1.2  S t r u c t u r e d Successive A p p r o x i m a t i o n  The decision-theoretic regression algorithm described above is also used to perform Step 4 of the modified policy iteration algorithm, the successive approximation step given by Equation 3.4. However, there is the added complication that the action being performed varies over the state space since it depends on the current policy. The simplest solution to this problem is to apply the regression algorithm on every action that appears in the policy tree. The Q-tree for action a is then appended to the policy tree at every leaf in which a is the action being performed, with the leaves of the new tree being labelled with their label from the Q-tree for the action. This process is shown for the robot-coffee domain in Figure 3.12. Here the policy and  64  X  FETCH-COFFEE WC  GET-UMBRELLA  DO-NOTHING  X  0  W -3.8  W  1.9 3.8  0  W 1.9  -3.8  -1.37 U  1.9 3.8 3.71  (a)  3.8 2.9  U -1.37 -0.64 -0.56  DO-NOTH  WC  wc 'J.9  3.8 -y^T ,'19  \ R  -137  U ^3/71  "' ' 3 8/  R  U 2.9  ^ \  -0.64  °  W  "' 3  8  °  -1.37 -0.56-  -1.3_7_  -0.64  -0.56  FETCH-COFFEE  ------  (c)  Figure 3.12: Applying the successive approximation algorithm to the robot-coffee example with the policy and value function shown in Figure 3.2: (a) The future value trees for each action; (b) the unsimplified merged future value tree; (c) the final future value tree. initial value function are those given in Figure 3.2, and the future value trees for each action are shown (Figure 3.12(a)), along with the unsimplified and simplified merged trees (Figure 3.12(b) and (c) respectively). A t this point the tree in Figure 3.12(c) would be discounted and merged with the reward tree as usual. Since the value in the merged tree Tree(Q^) the corresponding Q-tree Tree(Q^),  at state s is just the value from  it is quite obvious that the algorithm produces  a sound representation of Q^Proposition 3.3 Let Tree(Q^) sion algorithm  be the tree produced by the decision-theoretic  applied to Tree(V)  let B denote the event determined  and Tree(Tr)).  For any branch b of  regresTree{Q^),  by its edge labels, and assume the leaf of b is  65  19  16.12  X  0.0  O O O  64.70  13.64 /10  V-  Figure 3.13: A series of approximations to the value of the action in Figure 3.7. labelled by the value v . For any state s |= B, we have Q^(s,) = vi,. In other words, 0  Tree(Q^)  accurately  t  represents  .  The value of the current policy is computed by performing this successive approximation step—computing a more accurate approximation to the value of the current policy—some number of times. This process is illustrated in Figure 3.13, which shows part of the sequence of value trees produced if the action in Figure 3.7 is used as the policy everywhere. As Figure 3.13 shows, the structure of the tree tends to converge long before the actual values at the leaves converge. This suggests another important optimisation for the algorithm; namely that once the tree structure has converged, we need only keep the probability tree along with the current value tree, and compute the next successive approximation directly from them without needing to apply the PRegress algorithm. This optimisation is possible because once the tree structure has remained the same from one iteration to the next, it cannot change at subsequent iterations.  Theorem 3.4 iterations identical  Let Tree(V£)  and Tree(V£ )  be two trees produced by successive  +1  of structured successive approximation. structure  leaves), then  (i.e.,  2ree(V^ ^)  Proof Suppose that  +  are identical  If Tree (V^) and Tree (Vjf ) +1  except possibly for the value labels at their  will have the same structure  Tree(Vjf)  have  and Tree(Vjf ) +1  66  for any j  > 0.  have the same structure.  The  decision-theoretic regression algorithm applied to Tree(Vjf) duces the structure of Tree(V£ ) +1  and Tree (IT) pro-  based on the structure of Tree(Vjf  regard to the values at the leaves. Since Tree(Vjf ) +1  without  has identical structure  (it differs only in its leaf values), the algorithm will produce Tree{V^ ) +2  identical structure again. A simple inductive argument proves the result.  3.1.3  with •  S t r u c t u r e d P o l i c y Improvement  To perform Step 2 of the M P I algorithm (Equations 3.2 and 3.3) we have to compute an improving policy tree from the Q-trees for all the actions.  We showed how  to compute these Q-trees using decision-theoretic regression in Section 3.1.1, so it remains to show how the improving policy and an estimate of its value function can be computed. This is done by merging the Q-trees for each action since these Q-trees contain a Bellman backup of the previous value function. The leaves of the new tree are labelled with the maximum of the labels of the Q-trees to produce the new value function, and with an action for which the corresponding Q-tree had the maximal value at this leaf to produce the new policy tree. Figure 3.14 shows this procedure applied to the robot-coffee example. The Q-trees for each of the three actions are shown (these are computed using decisiontheoretic regression on an initial value function that is the same as the reward function from Figure 3.1), along with the policy tree and value tree that result when the Q-trees are merged. Note that the policy tree is somewhat simpler than the value tree. This is almost always the case since many branches with differently valued leaves may nevertheless have the same best action (for example, if it isn't raining, the value function will be different depending on whether the robot is already wet or not, but the policy will be the same). Although the policy tree can be simplified in this way, it is often not worth doing this in practice because the next iteration of successive approximation will add the additional complexity back into the tree structure.  67  DO-NOTHING  FETCH-COFFEE WC  GET-UMBRELLA  WC  WC  V  W  A  A  1.9  3.8  W  A  -5.7  A A  1.9  -3.1  3.8  -5.7  A  / \ 0 -1-9  1.9  A  A  HC  A ^  8  / \-0.9  -0.09  WC  DO-NOTHING  -246J^^ 65 0  -0.65  TT (  A -0.65  2.9 -0.65 -1.46 NEW POLICY  3.71  w  1.9 3.8  R  -2.46  X' A  -3.8  MERGED VALUE TREE WC  w  A  R  DO-NOTHING  FETCH-COFFEE  -1.46  Figure 3.14: Three action trees used for structured policy improvement, the new value tree formed by merging them, and the corresponding policy. V A L U E TREE  GET-UMBRELLA  DO-NOTHING W  w  . . . . 9.58 19.15  w  A  -  9.62 19  5.18 R  9.62 19.24  A/ "1.66 R 1.66 X v  9 62  I vT  "14^6 13782 6.T8 FETCH-COFFEE  10.44  HC  A  0  A  9.62 19.24  1.66  10.44 TI  1L28 11^8  3*56M E R G E D T R E E NEW  R  ^" X 9.75 11-28  OPTIMAL POLICY WC  i£  W  A  X )^62^R^^  5 J 2 J J ^ ^  19.24 19.24 U JJ  UJ 18.38  10.62  13.90  -  E  9te  J^T  A 0 "  9  1  A 9  2  4  5  2  2  14.84  13790  9.75  0  DO-NOTHING  W X  FETCH-COFFEE^R^^  ^  14*84 .0.^8*6 162 6.22  A  9.62  DO-NOTHING  ^JJ^FETCH-COFFEE FETCH-COFFEE  GET-UMBRELLA  Figure 3.15: The continuation of the SPI algorithm from Figure 3.14. The value function that results from successive approximation, the Q-trees, and the resulting value function and (optimal) policy are shown.  68  Figure 3.15 shows the continuation of this process in computing the optimal policy for this problem. Structured successive approximation is performed on the value tree from Figure 3.14 to produce the value tree in Figure 3.15, then policy improvement is again performed, producing a Q-tree for each action, and these are merged to produce the final value function and policy. Another round of successive approximation (not shown in the figure) is then run until convergence is reached.  3.1.4  C o m p u t i n g N o r m s of S t r u c t u r e d V a l u e F u n c t i o n s  The final piece required for implementing the MPf algorithm is the ability to compare two value functions to evaluate the stopping criterion. We have described two possible norms for halting the algorithm, which are given in Equations 2.3 and 2.4. To compute each of these, we need to find the maximum (and in the case of the span-seminorm measure, the minimum) of the difference between two value functions. The difference tree can be computed by merging the two value function trees and labelling the leaves with the difference between the values of the leaves in the value trees. The maximum and minimum values in the difference tree can then be found by a simple tree traversal. For efficiency the two procedures can be combined, with the maximum and minimum values being calculated as the value trees are being merged (and in fact, the merged tree need not be produced as long as the values at its leaves are computed). The operations described above are all that is required to implement the SPI algorithm. The structure of the algorithm is identical to that shown in Figure 3.3. The only difference is that each step of the algorithm is performed in a structured way. The soundness of the components of the algorithm ensures the soundness of the algorithm as a whole.  69  Input: A Markov decision process < T, S, A, Pr, R > in structured form, a discount factor 7 , an optimality constant e. Output: A n epsilon-optimal policy tree Tree(ir*). 1. Let Tree(V°)  be Tree(R),  set n = 0.  2. For each action a, compute: = Regress {Tree {V ), a, Tree{R))  Tree(Ql ) n  n  3. Merge the Q-trees Tree(Q^ ) labels to obtain Tree(V ). n  for each a using maximisation on the leaf  n+1  4. If the termination condition (Equation 2.3 or 2.4) holds then go to step 5, otherwise increment n and go to step 2. 5. Return Tree{V*) =  Tree(V ). n+1  Figure 3.16: The structured value iteration algorithm.  3.1.5  S t r u c t u r e d V a l u e Iteration  Algorithms other than M P I can also be implemented in a structured way. Policy iteration can easily be implemented just by changing the termination conditions of M P I , although the structured version of policy iteration produces an e-optimal policy rather than an optimal policy because it computes the value of each policy using successive approximation while policy iteration solves the system of equations to compute policy value exactly. We can also implement a structured version of value iteration (SVI). The algorithm is shown in Figure 3.16. Although all the results in Section 3.1.6 are computed using SPI, SVI plays an important role in the approximation algorithms we discuss in Section 3.2. SVI works be repeatedly constructing Q-trees for each action based on the current estimate of the value function, and then merging them to produce a new estimate. Note that an |-optimal value function is returned rather than a policy. A n e-optimal policy can easily be constructed from this by performing structured policy improvement on the returned value function.  70  3.1.6  R e s u l t s and A n a l y s i s  In this section we describe a number of experiments that compare the performance of structured policy iteration with modified policy iteration, the best of the standard algorithms. We examine performance on a number of different problems and attempt to characterise the types of domains on which SPI works well. For fairness, we have tried to compare SPI with as efficient an implementation of M P I as possible. Our implementation of M P I stores the transition matrices for each action sparsely to save on space and lookup times, however, we include the cost of translating from the structured action representation into this sparse matrix representation in the computation times for M P I . We do this because we believe that the structured representation is the more natural way to represent problems, and real problems are more likely to be described in this form (or some variation of it). These results were produced using implementations written in C + + running on a 400MHz Pentium II with 640MB of memory.  Best and Worst Case Performance To illustrate the best-case and worst-case performance of SPI, we have constructed two synthetic domains. The worst-case domain is a problem with a compact representation but in which every state has a distinct value, so the optimal value function tree contains a leaf for every possible state of the system. We use an M D P consisting of n boolean variables X\, • • •, X  n  and n deterministic actions ai, • • •, a . Each n  action a,- makes the corresponding variable Xi true if it is false, but only if all the preceding variables X\, • • •,  are true. In the process, it also makes X\, • • •,  false. Thus to make Xi true, we must first have all of X\, • • - ,Xi-i  true, and then  perform action a;. This makes Xi true, but makes all of X\, • • •, A ^ _ i false. Therefore, to make Xj+i true, we must now make all of X\, • • •,A.,_i  true again. A n  example of this type of M D P is given in Appendix A . The reward functions for these M D P s have a large value if all the variables are true, and zero otherwise. Be-  71  X Y z  z  z  z  A A A A 10 9.0 8.1  7.3  7.3 6.6 5.9 5.3  10  4.i  9.0  (b) Best-case value tree  (a) Worst-case value tree  Figure 3.17: Optimal value trees for the (a) worst-case and (b) best-case problems with three variables.  o E 2  9 10 Number of Variables  Number of Variables  (a)  (b)  Figure 3.18: (a) Time and (b) space performance for SPI and M P I on the worst-case series of examples. cause of this, it takes a large number of iterations for the reward to propagate back to all the states. A n example of the optimal value function for the worst-case domain is given in Figure 3.17(a) for a problem with only three variables, a reward of one for making all of them true, and a discount factor of 0.9. We call this a worst-case for SPI since it must enumerate the entire state space in the same way that M P I does, but must also pay the additional overhead of constructing and traversing the trees. Figure 3.18 compares the performance of SPI with M P I on these worst-case  72  problems with from six to twelve variables (64 to 4096 states). Figure 3.18(a) shows that SPI runs roughly 100 times slower than M P I on these problems. This roughly constant overhead is expected given that the complete tree needed to represent the value function is roughly twice as large as the equivalent table-based value function. As Figure 3.18(b) shows, the space required by SPI increases at a significantly greater rate than with M P I . Again this is due to the fact that the tree representation must store the interior nodes in the value tree. While the overhead in the worst-case example is large, we designed the problem specifically to illustrate how badly SPI could potentially perform. As long as we can expect benefits in the best or average case, the risk of this behaviour may well be acceptable. To examine the best-case performance, we have constructed another synthetic domain. In the best-case domain, although every variable is relevant to the optimal value function—and hence all the variables appear in the tree—each variable only appears once in the optimal value tree. This is best-case performance in that there is no smaller value tree in which every variable plays a role. For the best-case examples, we again have n boolean variables Xi, • • - ,X  n  and associated actions a\, • • - ,a , and as before, we have a large reward if all the n  variables are true, and a reward of zero otherwise. However, in this case, each action a\ makes variable X\ true if all preceding variables Xi, • • •, effect otherwise. But it also makes all following variables  are true, and has no , • • •, X  n  false. Hence  the optimal policy is to perform the action corresponding to the first variable that is false. This then makes all the following variables false, and then each of those must be made true in turn. Hence the optimal value tree consists of a single branch with n internal nodes and n + 1 leaves, each with a distinct value. A n example for n = 3 is shown in Figure 3.17(b), and given in detail in Appendix A . Figure 3.19 shows a comparison of SPI and M P I on a series of these problems with from six to 20 variables (64 states to 1.05 million states). As we would expect, the space and time requirements of M P I grow exponentially in the number of  73  o E  S  10  12 U 16 Number of Variables  10  (a)  12 14 16 Number of Variables  (b)  Figure 3.19: (a) Time and (b) space performance for SPI and M P I on the best-case series of examples. variables, with insufficient space available to solve the 20 variable problem. In comparison, SPI has almost constant memory requirements since the value tree grows by a single internal node and a single leaf node for each additional variable added to the problem. SPI also outperforms M P I considerably in terms of computation time, solving the 18-variable problem in 1.4 seconds, compared with 2923 seconds for M P I . Process-Planning Problems The results for the best and worst-case domains illustrate the extremes of how we can expect SPI to perform. To investigate where more typical problems fall in this continuum, we ran SPI on a set of problems from a synthetic manufacturing domain. These problems are loosely based on a domain used in [98] and [31] and describe a manufacturing problem in which a product is built by creating two component parts and then joining them together. The parts must be milled, polished, painted, and attached together by either bolting or drilling. There are two types of finished product, high and low quality, and the policies for producing them are quite different. For example, high quality parts should be hand-painted, drilled, and bolted together, which requires skilled labour, a drill press, and a supply of bolts. This process is  74  Table 3.1: Results for SPI and M P I on process-planning problems. Problem Manufacturingl Manufacturing2 Manufacturings  Alg. MPI SPI MPI SPI MPI SPI  States 55,296  Acts. 14  SPI Leaves 5786  221,184  14  1,769,472  14  14,117 40,278  Time (s) 349 650 1775 1969 > 9000 5155  Space (M) 78 22 160 54 129  too expensive for producing low quality parts, so they are spray-painted and glued, which in turn requires a spray gun and glue. Table 3.1 compares the performance of SPI and M P I on three different versions of this problem with sizes ranging from 55 thousand to 1.8 million states. M P I is unable to run to completion on the largest of these due to memory requirements, but SPI solves the problem in approximately one third of the time we expect that M P I would require. For the smallest of the problems, there is insufficient structure in the problem to produce time savings for SPI. The structure in the problem has reduced it from the equivalent of 15.75 binary variables (that is, 2 (2  1 2 5  1 5 , 7 5  states), to 12.5 binary variables  leaves), which is equivalent to removing a little more than three variables  from the problem. For the second problem, the computation times for SPI and M P I are roughly equivalent, and the reduction in problem size is from the equivalent of 17.75 binary variables to the equivalent of 13.75. This suggests that in order to gain significant savings in computation time from using SPI, we need to remove at least the equivalent of four binary variables from the problem. Notice that adding a variable that has no effect on the optimal actions in any state effectively doubles the state space size for the problem and so doubles the time required for M P I . In contrast, SPI can discover the irrelevance of these variables, so that they have no effect on the running time for SPI (we have verified this empirically).  75  SPI performs very well compared with M P I in terms of space requirements for these problems. Because it discovers that there are only a relatively small number of distinct values for states in the optimal value function, SPI stores that function far more efficiently than M P I can, and so the space requirements are considerably reduced, even on the smallest of these problems. SPI performs so well on these problems because in certain parts of the state space, certain variables are completely irrelevant to the optimal value function. For example, if high quality products are required, they should be bolted together for maximum utility, so the availability of glue is irrelevant to the optimal value in this case. SPI discovers this kind of conditional irrelevance which allows it to perform abstraction by ignoring variables when certain other variables have particular values. Conditional irrelevance can be expected to appear in many real-world problems, particularly where there are a number of different ways to achieve an objective, but only one is optimal under a particular set of circumstances.  Random Exogenous Events The robot-coffee domain we have used as an example throughout this chapter is very simplistic, but more complex versions are easy to construct. In particular, we have constructed a somewhat more complex version of the problem in which the domain behaves as a process—events continuously occur and the robot must react to them. We call these random occurrences exogenous events. In this robot-coffee problem the robot must deliver coffee when it is requested, must deliver mail whenever there is mail waiting, and must keep a lab tidy. Each of these tasks has an associated exogenous event. One produces a request for coffee, one causes new undelivered mail to arrive, and one makes the lab less tidy. Each of these can occur at any time with some low probability. They are represented in the M D P by rolling them into the effects of the action, which has the unfortunate consequence that any action can affect the values of each of these variables. In this situation, the conditional  76  irrelevance property we discussed above no longer holds because all the actions affect the value of these variables, and hence the decision-theoretic regression algorithm builds more complex trees that take these effects into account. We compare the performance of SPI on two versions of the problem which are identical except for the presence or absence of these exogenous events (see Appendix A for details). Each problem has six variables, four with two values, and the other two with five values, for a total of 400 states. There are eight actions in each problem. Without exogenous events, SPI runs in 11.9 seconds and 1.85M of memory. It produces an optimal value tree with 291 leaves, and a policy tree with 196 leaves. Even without exogenous events, most variables are relevant in this problem, so the value tree is not significantly more compact than a table-based representation. For comparison, M P I requires 0.31 seconds and 1.5M of memory to find the optimal policy. If exogenous events are present in the problem, SPI produces a final value tree with 300 leaves and policy tree with 219 leaves, and requires 2.0M of memory. This is slightly larger than the version without exogenous events because the variables associated with tasks that do not currently require action are now relevant to the value function because those tasks might need to be performed in the near future. More worrisome is the fact that SPI required 100.6 seconds to compute the optimal policy with exogenous events. This increase is partly a result of the extra iterations required for convergence. A more significant factor is that even though the final value and policy trees are only slightly larger than they would be without exogenous events, the trees tend to get large much more quickly when exogenous events are present—the intermediate steps in the algorithm are slower because the intermediate trees are larger.  77  Discussion As we said in our discussion of process-planning problems, we expect the overhead of tree-building to pay off for the SPI algorithm if the equivalent of four to five variables can be removed from the problem—that is if there are between 1/15 and 1/30 as many leaves in the trees as there are states in the problem. We expect that this level of reduction will be quite achievable in large problems. Of course, in terms of space we can expect dramatic savings compared with M P I even when time savings are minimal or non-existent. Thus we can hope to solve M D P s that are much too large to even be represented with state-based transition matrices. The problems posed by exogenous events cannot be addressed directly by SPI. Fortunately, the representation lends itself naturally to approximation schemes such as the one we describe below in Section 3.2. SPI also combines very naturally with other abstraction schemes that have been proposed for M D P s such as the envelope approach of [28] (see Section 2.6). Because the abstraction method used in SPI is orthogonal to those proposed in many of these approaches, we can hope to apply both algorithms together and combine the computational savings. This may be especially valuable in situations with exogenous events such as the example described above when the reward function can be decomposed into separate rewards for each task (in this case fetching coffee, delivering mail, and tidying the lab). A number of algorithms which operate on decomposable rewards of this kind and then build a policy by combining the optimal value functions for each sub-task have been proposed. See for example [13], [75], and [95]. Similar approaches have also been examined in the reinforcement learning community, for example by [97], [62] and [36]. Recent work by Hoey et. al. [53] on a more optimised version of SPI that uses algebraic decision diagrams [4] rather than trees shows great promise for making SPI competitive with M P I on smaller problems. Their S P U D D algorithm finds an optimal policy for the medium sized process-planning problem (220 thousand states)  78  in 111 seconds, more than 15 times faster than either SPI or M P I . This dramatic improvement is due both to optimisation of the tree structure made possible by the use of algebraic decision diagrams, and to the use of highly optimised routines for traversing and constructing the data structures.  We are optimistic that the  S P U D D approach may allow algorithms based on SPI to be competitive with M P I in problems that exhibit much less structure than the ones we have examined here.  3.2  Approximate Structured Value Iteration  One advantage of using a tree-based representation for value functions is that the trees lend themselves naturally to use in approximation algorithms simply through pruning of branches. We are interested in approximation algorithms for two reasons. The first is that, in certain situations, there may not be enough structure in a problem for SPI to produce computational or memory savings. This is the case, for example, in the worst-case problems we described in Section 3.1.6. Here the optimal value function takes on a large number of different values, precluding a compact representation. In these types of problems, approximation by pruning can dramatically decrease the size of the trees, while still producing a value function that is close to optimal. The second reason for our interest in approximation algorithms is that often we are more interested in achieving "good" performance—or perhaps in performing better than some threshold—than in producing an optimal policy. In classical planning, the task is frequently not to find the best plan, but to find any plan that achieves the goals. Here again, we can use pruning to trade off policy quality with computation time. We produce policies which are not optimal, but which can be shown to be within some threshold of optimal, and we do this in much less time than finding an optimal policy would require. Our algorithm, approximate  structured  value iteration  (ASVI), is based on  the structured version of value iteration described in Section 3.1.5 and Figure 3.16. 79  Input: A Markov decision process < T, 5, A , Pr, R > in structured form, a discount factor 7 , an optimality constant e, an acceptable approximation value S or maximum tree size a. Output: A value tree Tree{V$) that is within 5 of optimal, or Tree{V*) that is the best possible value tree with at most a leaves. 1. Let Tree(V°)  be Tree{R),  2. Prune Tree(V°) duce V°.  set n = 0.  under some pruning criterion of S or a, to pro-  3. For each action a, compute: Tree{Ql")  = Regress {Tree {V ), a, n  Tree(R))  4. Merge the Q-trees Tree{Q^") for each a using maximisation on the leaf labels to obtain Tree{V ). n+l  5. Prune Tree{V ) produce V .  under some pruning criterion of <5 or CT, to  n+1  n+1  6. If the termination condition (See Section 3.2.3) holds then go to step 7, otherwise increment n and go to step 3. 7. Return Tree(V ) n+1  as the final value function Tree(V£)  or  Tree{V*).  Figure 3.20: The approximate structured value iteration algorithm. The changes from the SVI algorithm of Figure 3.16 are in boldface. The main difference is that after each step of value iteration, the value trees produced are pruned. Subtrees which make only small distinctions in value are replaced with a single leaf. The tree that results no longer reflects regions that have identical value, but regions of similar  value. The A S V I algorithm is shown in Figure 3.20.  In broad outline, we construct a sequence of approximate ranged value trees by: a) pruning a value tree so that it makes fewer distinctions and approximates its true value; and b) generating a new value tree by decision-theoretic regression of the approximate value tree. Another way to look at this is that we perform regionbased dynamic programming, as we did in SPI, but coalesce regions that make distinctions of marginal utility.  80  V A L U E TREE  -0.65  20% P R U N I N G  50% P R U N I N G  -1.46  Figure 3.21: Pruning applied to the value function computed in Figure 3.14. The trees are pruned to 20% and 50% tolerance respectively.  3.2.1  R a n g e d V a l u e Trees and P r u n i n g  Suppose we are given a value tree such as that in Figure 3.21, but are unhappy with its size. A simple way to reduce the size of the tree is to replace a (nontrivial) subtree with a single leaf as shown in the figure. Since we no longer distinguishing, for example, states where W and R are both true from states where W is true and R is false, the resulting tree can only approximately represent the true value function. One obvious choice of value assignment for the larger region (new leaf) is the midpoint of the values being replaced—this minimises the maximum error in the approximate value tree. We could instead label the new region with a range that encompasses all the replaced values, as shown in the figure. Ranges play a valuable role in A S V I , so we assume that all approximate value functions are represented by ranged value trees: each leaf is labelled with a range [/, u] representing the minimum (lower) and maximum (upper) values associated with states in the corresponding region. Point-valued regions (hence, exact value functions) are represented by setting u = I.  For any ranged value function V, we take the upper value function be the value function induced by considering the upper entries of V. value function  and the midpoint function  V  T  to  The lower  V * are defined in the obvious way. In 4  choosing a particular value for a region given V (e.g., in action selection), one can  81  easily recover the midpoint from the range and use the function V** as needed. For any state s and ranged function V , we define span(s) to be u — I, where u and / are the upper and lower values for the region containing s. The span of V is the maximum of all such spans. The maximum error in the induced value function V**, assuming that the ranges in V contain the true values of all states, is  span(V)/2.  When pruning a ranged tree, we may either want the most accurate tree of a fixed (maximum) size a, or the smallest tree of a fixed (minimum) accuracy <5. This problem, of course, is strongly related to work on pruning decision trees in classification. Given a fixed decision tree (assuming training has been completed), Bohanec and Bratko [11] present an algorithm for producing the sequence of pruned trees of decreasing size such that each tree in the sequence is the most accurate among all trees of that size. We can apply similar ideas in our setting, where the aim is to produce ranged value functions with the smallest span. The algorithm is shown in Figure 3.2.1. Several points are worth noting. We assume that the initial tree is ranged, and produce the new ranged trees with (possibly) larger ranges by collapsing subtrees. The sequence of trees is implicit in the variable S E Q - L A B E L : subtrees are replaced in the order described by S E Q - L A B E L . If variables are boolean, the sequence of produced trees is dense (there is a tree for each size less than the initial size of the tree). Finally, in practice, the algorithm is not run to completion. Instead, we terminate when either: a) the ranged tree at some point has a range larger than some maximum specified range 5, in which case the previous  pruned tree is the  desired tree; or b) the ranged tree has been reduced to some maximum allowable size. Which of these choices is used will be application-dependent. The amount of pruning that one can do by removing subtrees within acceptable tolerances—indeed the size of the tree before pruning—may be strongly influenced by the node ordering used in the value tree. Again, this issue arises in research on classification [86, 112]. Finding the smallest decision tree representing a given function is NP-hard [60], but there are feasible heuristics one can use in our  82  Input: ranged value tree T Output: Labels S E Q - L A B E L indicating order in which to replace subtrees rooted at labelled node  1. Let  SEQ = 1  2. Let F be the set of penultimate nodes in T (non-leaf nodes all of whose children are leaves) 3. For each n £ F, set R-label (n) = [u ,l ] where u — max{-u : [u, I] labels a child of n}, [u, I] labels a child of n}, and n  n  n  and  /„  =  min{/  :  4. While F ± 0 (a) Let n = argmin{w„ — l : n £ F} n  (b) Set SEQ - LABEL(n)  = SEQ; SEQ = SEQ + 1  (c) Set F = F - {n} (d) If m = Parent(n) exists and Children(m) fl F — 0 then add m to F and set R-label(m) = [u ,l ] where u = max{tf : [u , l ] is R-label of a child c} l = min{/ : [u , l ] is R-label of a child c} m  m  m  c  c  c  c  m  c  c  Figure 3.22: Algorithm for optimal sequence of pruned ranged value trees [11] setting to reorder the tree to make it smaller and/or more amenable to pruning. Among these, one appears rather promising and is strongly related to the information gain heuristic [86]. The idea is that we take the existing tree and categorise each variable according to the size of the ranges induced when it is either true or false—this can be done in one sweep through the tree. The variable with the smallest ranges is installed at the root of the tree (for example see [112]). This is repeated with the variables remaining in the new subtrees.  1  In the results we present below, we will concentrate on pruning to find the smallest tree of a fixed minimum accuracy. We either prune so that all subtrees whose values are within S are pruned, or more typically, we let 8 depend on the range of possible values of the M D P . In this case, if M and m are the maximum and 1  Thanks to Will Evans for his help with these ideas.  83  minimum values in the current value function, 20% pruning corresponds to pruning all the subtrees whose values differ by no more than 20% of M — m. Figures 3.21, 3.23 and 3.24 are all pruned in this way. Pruning using this sliding 5 value avoids the situation where early in the value iteration algorithm, the values of all states are relatively similar so the tree is pruned to a single leaf (or very small number of leaves), which can lead to overly early convergence (see Section 3.2.3). Sliding 6 pruning ensures that as the value function increases in range, the amount of pruning also increases so early trees are pruned less vigorously than later ones.  3.2.2  V a l u e I t e r a t i o n U s i n g A p p r o x i m a t e V a l u e Functions  Armed with a method for pruning a ranged value tree, we now examine how this can be applied to a policy construction technique like SVI. Our basic strategy (see Figure 3.20) can be described asfollows. Given any ranged function V , we create a pruned tree Tree(V ) l  by pruning Treeiy ) 1  to some specified tolerance (or size)  to get a more compact, but approximate, representation of V . The pruned tree Tree(V )  is then used as the basis for Bellman backups to produce a new ranged  l  tree Tree(V ).  This new ranged tree is constructed in a manner very similar to  n+1  that used in ordinary SVI, the key difference lying in the use of the ranges in V  1  instead of point values. We note that T r e e ( V true i + 1-step value function V  , + 1  7t+1  ) is itself an approximation of the  , since it was produced using an approximation of  V . However, V + will be further approximated by pruning Tree(V 1  T8+1  ) to produce  Tree{V ). i+1  The steps involved in producing the value function T r e e ( V  !+1  ) (i.e., Steps 3  and 4) are reasonably straightforward, but deserve some elaboration. The production of  Tree(Qa' ) +1  proceeds exactly as it does in SVI with one minor exception.  Since the value tree Tree(V ) l  is labelled with ranges rather than single values, we  want to produce ranges for the leaves of Tree(Q^" ). +1  To do this, we calculate the  expected future value using the upper values at each leaf in exactly the same way  84  as SPI (see Section 3.1.1) and do the same using the lower values to get the new ranges. Conceptually, we treat the new Q-tree as having ranges produced using the trees Tree(V" ) and Tree{V' ). jt  1  Slightly more subtle is the merging of Q-trees in Step 4. Merging requires that for each state we determine which action choice maximises future expected value. In SVI this is reasonably straightforward: we find a partition (tree) that subsumes each Q-tree and label the leaves of this larger tree with the maximum value from the corresponding partitions in the set of Q-trees. In A S V I , these partitions are labelled with ranges that can't necessarily be compared with a max operator. Instead we label the leaves of Tree(V ) l+1  with the maximum of all upper labels of  the corresponding partitions in the Q-trees, and the maximum of all lower labels of the corresponding partitions. Clearly, choosing the maximum of the upper labels is correct and bounds the true value of a state s. In the case of the lower labels, there exists an action that guarantees state s has the maximum of the expected values among the lower labels, namely the action used to derive the maximising Q-tree. This is therefore a tight lower bound on the true value of state s. This argument relies crucially on the fact that we need not pick an action at this point.  There  will generally be no single action that one can assign to each state in the region to ensure this maximum lower bound is achieved for all states, but this is irrelevant to the construction of the value function. Figures 3.23 and 3.24 show the continuation of the A S V I algorithm with 20% and 50% pruning respectively, applied to the trees shown in Figure 3.21. Note that in Figure 3.23 the greedy policy is unable to distinguish between the values of the DO-NOTHING and G E T - U M B R E L L A actions at the circled leaf of the tree. The true value of the policy is greatly different depending on the action chosen, but the value according to ASVI's value function is indistinguishable, so the greedy policy arbitrarily chooses an action. In Figure 3.24, the pruned value function is equivalent to ignoring the effects of being wet on the value function. As a result, the  85  FINAL PRUNED VALUE TREE  GREEDY POLICY WC D  IC  W  W  W  , ^ s 1.82 -  '  W  N 0 T H I N G  DO-NOTHING ^0  w R  FETCH-COFFEE  '  5.52 [10.04  °-  V A L U E OF POLICY WC  U  15.43]  10  W 20  -  -  14.63'  FETCH-COFFEE \ DO-NOTHING I • GET-UMBRELLA;  0  5^f^R  ' '  FETCH-COFFEE  10  >•  15.60 -20  ! 10.82  Figure 3.23: The final pruned value tree, the corresponding greedy policy, and the value of that policy for 20% pruning applied to the robot-coffee problem.  FINAL PRUNED V A L U E TREE  GREEDY POLICY  IC  IC [2.71 5.42] [-1.68 1.03]  V A L U E OF POLICY  VyIC „  C  R.71 0] DO-NOTHING  >f  DO-NOTHING FETCH-COFFEE  }Y  10  W  -10  0  20 s l f ^ R 15.60 14.63  6.6  Figure 3.24: The final pruned value tree, the corresponding greedy policy, and the value of that policy for 50% pruning applied to the robot-coffee problem.  OPTIMAL VALUE FUNCTION iC dj 10  -10 20  5  .  6  0  R 60  14.63  10.82  Figure 3.25: The optimal value function for the robot-coffee problem.  86  greedy policy is optimal with the exception of ignoring the possibility of getting the umbrella. For comparison purposes, the optimal value function is shown in Figure 3.25. A S V I with either 20% or 50% pruning only differ from an optimal policy in states where WC and R are true, and HC, W, and U are false.  3.2.3  T e r m i n a t i o n for A S V I  The termination of A S V I raises some interesting issues. Exact value iteration is guaranteed to converge because the transformation operation (the Bellman backup) on value functions is a contraction operator with respect to the supremum norm (we do not consider span-seminorm termination here). The same does not apply when the intermediate value functions are approximated. Indeed, without a wellthought out stopping criterion, we can construct quite straightforward and natural examples in which the pruning of value trees causes A S V I to cycle through a sequence of identical value functions without termination.  2  To deal with this situation, we  adopt a fairly conservative approach: we stop whenever the ranges of two consecutive value functions indicate that the stopping criterion (Equation 2.3) might be satisfied. Specifically, the use of encompassing ranges allows us to test this condition in a way that is impossible with simple point valued approximation. For any two ranged value functions V and W, we define (V - W){s)  = mm{\r - r'\ : V {s) l  <r<  V (s), f  W {s) < r' < W^{s)} l  We terminate A S V I when the following condition holds:  \\V  i+1  - V \\ < e {  (3.5)  In other words, when the ranges for every state in successive value approximations either overlap or lie within e of one another, we terminate. We note that testing this condition with two value trees is no more difficult than in the case of SPI. 2  For further discussion of convergence problems that arise due to approximation, see [20].  87  Regardless of the pruning criterion, as long as it produces sound ranged value trees, we can show the following results.  Theorem 3.5  Then V  <  P r o o f We prove this by induction. For the base case we have that Tree(V°)  is  yn  <  Let Tree(V )  be the nth value tree produced by ASVI.  n  ni  y n t _  formed by pruning Tree(V°),  so this is true by the properties of our pruning  techniques. For the recursive case, we have that V ^ < V n  for n + 1.  < V™f and must show this  n  B y the soundness of the decision-theoretic regression algorithm  (Theorem 3.1) we have that for all a:  QT <QT <QVWe must show that max Qjf" < max C}^" a  T  a  and max„  < rnax Q^ n  a  The first of these is trivially true since for all a, Q " < Q j f . nt  v  a  We prove that max„ Q^  ni  < max Q a  by contradiction. For this to be false,  v n a  there must be a state s in which a* = arg max Q " (s) and v  a  a  < Q^" for 1  some other action b. But for this to be true, a* can't be the maximum action as we assumed, because:  <Qn*) -  Q :(s)< r v  a  Q  so no such state can exist. Now we have that: Tree{V ) n+u  and all we do to produce V ^ n+1  tains the property.  < V  <  n+1  and V ^ n+1  Tree(V ) n+n  is to prune this tree, which main-  •  Thus the n-step ranged value functions "contain" the optimal rc-step value functions for the M D P . Theorem 3.5 also guarantees termination due to the following result:  88  T h e o r e m 3.6 If the stopping  guaranteed  to terminate.  criterion  specified by Equation  Its rate of convergence  3.5 is used, ASVI  is at least that of value  is  iteration.  P r o o f Assume that for a particular M D P , value iteration terminates after i + 1 iterations, that is when  Treeiy )  — V | | < e. Since Theorem 3.5 holds for both l  and Tree{V + 1), Equation 3.5 will also hold at iteration i+ 1, and  %  l  therefore A S V I will terminate no later than iteration i + 1.  •  Finally, it is easy to verify the following error bounds. T h e o r e m 3.7 Let ASVI  function  terminate  (according  V = V" ; and let 8 = span(V). 1  to Equation  (3.5)) with ranged  value  Then  1  - 7  The induced policy TT is such that  ll^*-K|| <  27(2<5 + e) 1-7  P r o o f Theorem 6.3.1 of [84], states that if | | V  n + 1  - V \\ < /3(l-j)/2j n  then ||V* -  T/n+i|| < pj2 and the value of the corresponding policy is /3-optimal. That is that  — V^-ll < (3. Now if A S V I terminates at the ith iteration and  span(V)  ~ 5, then the most that a state could have changed in value from  iteration i — 1 to iteration i is 26 + e. This occurs when it is at one extreme of the range at its corresponding leaf in one iteration, and the other extreme in the next.  = 25 + e = /?(1 -  Hence we have that \\V* -  •y)/2j.  Rearranging, we get that (26 + e)27/(l — 7) = /?, and since the value function is /3/2-optimal, and the policy /?-optimal,  — V | | < (28  \\V* - V \\ < 27(25 + e ) / ( l - 7 ) as required.  •  n  + 6)7/(1  —  7)  and  We note that the policy improvement step of the SPI algorithm can be applied to produce a tree-structured policy using the midpoint value function Tree(V ^). n  [96] for discussion of policy error given an approximate value function. 89  See  It is also worth noting that the argument for convergence of A S V I cannot be applied to approximate structured modified policy iteration (ASPI) since it requires that intermediate policies be produced (and partially evaluated). The problem is that because ranges are used one cannot generally guarantee that the sequence of policies is improving for every state in a range. One action may be better in one part of a region but worse in another. While A S P I works well on many examples, it can rather easily fall into cyclic behaviour. Thus, value iteration seems the ideal candidate for approximation using ranges.  3.2.4  Results and Analysis  To demonstrate the effectiveness of the pruning algorithm, we have applied it to three of the problems described in Section 3.1.6. These are the robot-coffee problem with and without random exogenous events, the worst-case problem, and the process-planning problem. For each of these, we have performed sliding S pruning for a variety of different pruning factors ranging from 10 to 90 percent, and compared them with both unpruned structured value iteration, and SPI. One important point to notice is that unpruned SVI is much less efficient on these problems that SPI. In fact, on the robot-coffee problem with exogenous events, SVI requires almost five times the computation time used by SPI. This means that the performance improvements due to pruning must often be considerable, simply to overcome the performance advantages of using SPI instead of SVI. For each of the problems we examine we compute the average and maximum error over the whole state space. The averaging is done on a per-state basis rather than per leaf, and all error calculations are based on the actual value of the greedy policy formed using Tree(V **). n  function Tree(V *), n<  That is, we first compute the midpoint value  then we use that to find a greedy policy, compute the true  value of that policy, and compare that value function with the optimal value function state by state to compute the maximum and average errors.  90  Table 3.2: Results for A S V I on the robot-coffee problem with exogenous events. Pruning 0 10% 20% 30% 40% 50% 60% 70% 80%  Time (s) 471 91 45 26 7 7 <1 <1 <1  Mem. (M) 2.34 2.11 2.07 2.04 2.03 1.97 1.60 1.60 1.60  Leaves 400 275 260 229 204 153 26 24 17  Avg. Error 0 0.82 0.93 3.08 3.34 4.43 5.54 8.34 7.50  Max. Error 0 1.27 13.33 19.87 19.22 18.88 17.88 18.19 18.14  The Robot-Coffee Problem with Exogenous Events As we saw in Section 3.1.6, the problem with exogenous events is that even with small probabilities associated with them, they considerably increase the complexity of value trees because they make many more variables relevant in each action. This problem seems like an ideal one for A S V I since we should be able to ignore the small differences in value caused by the low-probability exogenous events. Table 3.2 shows the effect of pruning on this problem when exogenous events are present. For comparison, SPI ran in 100.6 seconds and 2.0 megabytes of memory producing a value tree with 300 leaves. The range of values in the optimal value is from -69.72 to -31.46. If we perform 20% pruning in this domain, we effectively remove the exogenous events from consideration, halving the computation time compared with SPI and producing a policy in which states have an average error of only 0.93, which means that on average, the value of each state was only 2.4 percent below the optimal value. The maximum error of 13.33 indicates however that the error was mostly confined to a relatively small number of states. This is just as we would expect if the pruning had allowed us to ignore the effects of the exogenous events. To test this hypothesis, we applied the A S V I algorithm on the problem without the exogenous events, with the results shown in Table 3.3.  91  Table 3.3: Results for A S V I on the same robot-coffee problem as in Table 3.2, but with the exogenous events removed. Pruning 0 10% 20% 30% 40%  Time (s) 20 8 4 2 1  Mem. (M) 2.18 1.85 1.77 1.73 1.69  Leaves 400 211 146 124 72  Avg. Error 0 3.07 5.51 10.48 8.63  Max. Error 0 37.82 37.60 36.82 33.96  We can see from Table 3.3 that pruning has a much larger effect on the average and maximum error when the exogenous events are not present, again leading us to suspect that up to 20% pruning is simply removing the effects of the exogenous events. For comparison in this case, SPI runs in 11.9 seconds and 1.85 megabytes, producing a value tree with 291 leaves with values ranging from -51.51 to zero. In this case we see that 10% pruning gives us a speed improvement over SPI, and the resulting policy is on average only six percent worse than the optimal policy, although again the error is concentrated in a relatively small number of particularly bad states.  The Worst-Case Problem In the worst-case problem, a polynomial-sized problem description becomes an exponential-sized value tree for the optimal policy where the value of every state is different, but states that differ only in the value of the first few variables have very similar values (see Figure 3.17 for an example of one of these value trees). Because of this similarity of "nearby states", we might hope to see similar improvements in performance on the worst-case problem to those we see in the robot-coffee problem with exogenous events present. Unfortunately, the results are not promising for the worst-case problem. On the ten variable version (1024 states), pruning to 10% tolerance results in a tree with  92  Pruning percentage  Figure 3.26: The effect of pruning on running time, average, and maximum error for the 55 thousand state process-planning problem. only 195 leaves which is produced in 18 seconds, using 1.9 megabytes of memory. While these results compare very well with SPI (807 seconds and 3.5 megabytes), the errors in the value function of the policy produced with even this little pruning are extremely large. The problem is that each of those "nearby" states, although they have similar values, have different actions associated with them, and the greedy policy can't distinguish between them to find the optimal action for each.  The Process Planning Problem Figure 3.26 shows the performance of the A S V I algorithm on the small processplanning problem with 55 thousand states. We can see that at 30% pruning, the computational time and space requirements are less that SPI (524 seconds and 4.3 megabytes as compared with 650 seconds and 22 megabytes), and the value of the  93  policy that results is on average only 9.3% worse than the optimal policy. As with the other problems, larger pruning factors produce even faster performance, but with a corresponding loss of policy quality. The key however is that while computation time tends to decrease exponentially as the pruning factor is increased, the average error per state increases at a close to linear rate, meaning that even at very high pruning levels, the policy found is on average only 20 — 25% worse than the optimal policy, and can be computed in vastly less time.  Discussion The results we have presented do not make use of any tree reordering algorithm. We would expect that using this technique will produce even better performance on most problems, although we have found in practice that the way people describe problems (for example, the way they order the variables in the reward tree and the conditional probability trees) tends to produce variable orderings that naturally lead to compact value trees. Given that tree reordering techniques have not been used, the results for the A S V I algorithm seem particularly promising. On all the domains we tested, we have seen quite a reduction in computation time without having a huge impact on policy quality, even at quite high levels of pruning. In [33], we discussed the idea of building a simplified abstract M D P from some original M D P , computing an optimal policy for the abstract M D P , and then using that value function or policy as a seed to compute an optimal policy for the original M D P . Similar approaches may be possible here if the policy produced by A S V I is not acceptable. For example, we might use the median value tree Tree(V *) n<r  seed for SPI by using it as the initial value function. Since Tree(V **) n  as a  will be fairly  close to the optimal value function in many cases, it should considerably reduce the number of iterations SPI requires for convergence.  We have performed one  experiment with this approach in the 55 thousand state process-planning problem. A S V I was run with a pruning factor of 70%, taking 19 seconds to run to convergence,  94  and the resulting value function (the corresponding greedy policy has an average error of 19.2%) was given as the initial value function to SPI. SPI ran to convergence in 561 seconds, for a total saving in computation time of 10.7% when compared with SPI using the reward function as the initial value function. While this is only a small speed improvement, it suggests that the technique has promise and should be investigated further. A number of other approaches for incrementally improving the value function found by A S V I are also possible. For example, one could perform decision-theoretic regression locally (See Chapter 6) on small subtrees where the value function is believed to be poor, or even run a small number of iterations of unpruned structured value iteration on the final tree. We hope that techniques such as these, which are designed to add some of the value distinctions back into the tree that have been removed by pruning, may considerably improve the quality of the greedy policy without impacting too heavily on the performance of the algorithm.  3.3  Summary  The SPI and A S V I algorithms that we have presented here demonstrate that we can exploit structured representations for computational gains in the M D P framework. The decision-theoretic regression algorithm, which generalises classical goal regression to stochastic domains, groups together states that have identical value at various points in the dynamic programming procedure for solving an M D P . The SPI algorithm uses decision-theoretic regression to exploit uniformity in the value function, specifically the fact that under certain conditions, some variables have no impact on value. This is very similar to the classical planning approach in which goal regression allows a planner to ignore variables that have no impact on the achievement of the goal. The A S V I algorithm illustrates the strengths of this approach in approximation as well, and we anticipate that in larger problems where finding optimal 95  policies is impractical, A S V I may prove to be a very useful approach, possibly combined with an algorithm that can locally improve a structured value function of the kind produced by A S V I . In [33, 31], we presented a comparable approach to approximation in structured M D P s .  The difference between this approach and A S V I is that we select  the level of abstraction in advance by deciding which variables can be left out of the problem without compromising policy quality too much. Thus the abstraction scheme is non-adaptive and fixed, whereas the approach in A S V I adapts the level of abstraction from one iteration to the next, and also uses different levels of abstraction in different parts of the state-space. Indeed, there are some problems, including the robot-coffee problem with exogenous events, in which our previous algorithm was unable to find any acceptable abstraction at all. In comparison, as we have seen, A S V I works very effectively on this problem. In Chapter 6, we will describe a reinforcement learning algorithm that also builds structured representations of the value function. In fact, the ideas behind that algorithm can be applied directly in the planning domain as well. The algorithm is based on a local version of decision theoretic regression where rather than regressing an action through a complete value function, only a small part of the value function—a subtree of the value tree—is regressed through the action. As we will see, this can be used to produce an asynchronous version of value iteration similar to R T D P (see Section 2.6.2).  96  Chapter 4  Reinforcement Learning: Background and Previous Work As we have shown in Chapter 3, the use of two time-step Bayesian Networks for representing actions allows us, in many cases, to find good or optimal policies for M D P s more efficiently than standard approaches. A natural question about such a representation is "can we expect real-world problems to come described in this way?" There are two issues to be addressed in answering this question. The first is "Can we expect a human expert, when describing a problem, to describe it in a structured way?" The answer to this question is an unequivocal yes. It has been shown in a number of studies [48, 43] that human experts not only can describe problems in structured ways, but in fact prefer to do so. The second part of the question is "can we expect to learn such a world model directly from data?" If we can learn our structured representation, then we can build an agents that, when placed in an unknown environment, can learn the effects of their actions in that environment, learn to distinguish good and bad states of the world, and efficiently plan and execute good or optimal strategies, all without intervention from a user. This is the initial motivation for our interest in learning. We have a representation for actions that allows us to take advantage of prior knowledge about structure for  97  solving planning problems. What we would like to do is to take advantage of this prior knowledge of structure, as well as other forms of prior knowledge, in learning as well. Consider an agent in a poorly-explored or partially-described world, such as an agent controlling a factory.  It is unrealistic to expect that the agent has  complete knowledge of the factory's operations, but instead it may have a default policy and value function based on the way the factory has been run previously, or the operation of other similar factories. It may also start with information about things it shouldn't do, such as exceeding the maximum operating parameters for machines. The agent's task is to use the available prior information about the effects of its actions and the reward function, along with information it acquires by acting, to learn how to act optimally—in this case to control the factory as effectively as possible. There are a number of important issues that arise in attempting to build a learning and planning agent of this kind; for example, how to incorporate prior knowledge into the system, how to explore the world efficiently, how to generalise over states, and how to update a model when new observations suggest that it is in error. In this chapter we will begin by ignoring considerations of structure and concentrating on the mechanics of learning to act. We will review previous work in this area, beginning in Section 4.1 with the basics of reinforcement learning, and in Sections 4.2 and 4.3 with model-free and model-based algorithms respectively. In Section 4.4 we will examine the issue of exploration, and look at some previouslypublished exploration techniques. Finally in Sections 4.5 and 4.6 we will discuss how learning structure fits into this framework, and the interactions between problem structure and exploration algorithms.  98  4.1  Reinforcement Learning: The Basics  Reinforcement learning (RL) is a technique used to learn how to behave well by a process of trial and error. A reinforcement learning agent repeatedly performs actions in a dynamic environment, and by collecting statistics about those actions, learns how to act well or optimally according to some function of rewards or reinforcements  it receives for its actions. The agent learns to perform optimally either  by building a model of its environment and planning using the model, or simply by learning the value of each action in every state and performing the action with the highest value. Reinforcement learning problems can be divided into two classes; immediate  reward models in which the agent receives immediate feedback on the  value of an action, and delayed reward models where the value of an action may not be realized until a number of other actions have been performed. An example of an immediate reward model is a multi-armed bandit problem, where the agent must determine which arm has the best expected payoff, but where the system has no state, so the payoff at each decision epoch is independent of the system's history. MDPs fall into the second category, delayed reinforcement models, because they have multiple states, so the value of an action can depend on states and actions that occur some number of decision epochs later than the action itself. Since we are using MDPs as the foundation for this thesis, we will concern ourselves only with the delayed reward case here. Excellent surveys of the field of reinforcement learning, including immediate reward models, can be found in [64] and [105]. For a more detailed examination of the immediate reward case, see [8] or [116]. As Figure 4.1 shows, there are two main parts to a reinforcement learning agent. The first, the learner, receives a sequence of inputs of the form  (s,a,r,t)  where s is the state the world was in, a is the action that the agent performed in that state, r is the reinforcement that the agent received as a result of performing a in s, and t is the resulting new state. The task for this part of the agent is to summarise this data in a form that can be used to select actions (see below).  99  R L Agent Prior Knowledge  Updated  *• Learner  Reward, N e w State  Model  Explorer  Environment -*  Action  Figure 4.1: A reinforcement learning agent. Typically, this is done by learning the "long-term" value of every action in every state (or at least for every action and state that is reachable while following the optimal policy). The second part of a reinforcement learning agent is the exploration strategy, which selects the action to be performed in the current state. Action selection can be based on any of the agent's prior knowledge or experiences, but it is important that the exploration strategy doesn't always select the action that the agent currently believes is the best. A n agent that always selects what it currently believes to be the best action will be unlikely to learn that other actions are in fact better, since it will never perform them. A t the same time, since performance while learning matters—we would like the agent to act intelligently even when it knows relatively little about its environment—the exploration strategy should tend to prefer better actions over poorer ones. The trade-off between these two conflicting objectives is the major challenge in designing an exploration strategy. An important question in all learning research is how to compare the quality of different learning algorithms. For R L algorithms, a number of possible metrics that measure the quality of learning are available, including: Eventual convergence to optimal. A n algorithm that eventually converges to  100  Optimal Performance  Time Figure 4.2: Comparison of a learning agent's performance with optimal. The shaded area is the regret of the system. optimal behaviour is obviously very useful, and most of the learning algorithms we examine will have this property. However, without a measure of how quickly -convergence occurs, this property is mostly of theoretical interest. Convergence rate. Convergence to optimality is frequently asymptotic.  Two  other related metrics are convergence to near-optimality and performance after a given amount of time or computation, both of which are rather vague notions. Algorithms that only aim to maximise convergence rates also fail to take into account initial performance. They may learn quickly at the cost of performing very poorly early in the learning process. Regret. Another possible measure is the amount of reward lost because the agent spends time learning rather than performing optimally from the beginning. Figure 4.2 is a graph of the performance of a learning agent over time. The shaded area is the regret of the agent, the difference between optimal performance and the actual performance. Frequently, a graph of the value of the current greedy policy (i.e., the policy  101  the agent would follow if it always picked actions it currently thinks are best) is used to compare the regret of different algorithms. One problem with this is that because of the exploration strategy, R L algorithms never actually follow this strategy.  Instead, we advocate graphing the actual discounted future  reward received by the agent. On each run we compute at every stage the discounted value of the future rewards actually received on that run. Averaged over a number of runs, this measure penalises algorithms that discover good policies but over-explore and hence never follow them, and prefers algorithms that try never to perform too badly while exploring. Another important issue in examining R L algorithms is whether to learn a model of the world or not. Somewhat surprisingly, it is entirely possible to learn to act optimally without having any explicit understanding of the system dynamics. A system that only learns a policy or value function is called a model-free algorithm. In contrast, systems which learn a model of their environment—that is, the effects of their actions and the reward function—and then use the model to determine what action to perform are called model-based learning algorithms. In general, model-based approaches tend to learn from fewer observations of the environment. However, there are situations where learning a model provides little or no computational advantage because of the difficulty of learning the model, the difficulty of using the learned model for reasoning, or in very large domains where the space requirements of storing the model are prohibitive. In these cases model-free algorithms are generally preferred, although some of the approaches described in this thesis may provide leverage to apply model-based learning more successfully under these conditions.  102  1. Let the current state be s. 2. Select an action a to perform. 3. Let the reward received for performing a be r system transitions to be t.  and the state that the  4. Update Q(s, a) to reflect the observation (s, a, r, t) as follows: Q(s, a) = (1 - a)Q(s, a) + a(r + 7 max Q(t, a')) a'eAt  where a is the current learning rate. 5. Go to step 1. Figure 4.3: The Q-learning algorithm.  4.2  Model-free learning  A number of model-free learning algorithms have been proposed. We will mention three here, the Adaptive  Q-learning  Heuristic  Critic  [100, 54], TD(\) [101] (see below), and  [113, 114]. Of these, the simplest and most widely used is Q-learning  (see Figure 4.3 for the algorithm). Q-learning works by learning, for each state s and action a, the value of performing a in s and then acting optimally. Given a value function V , the Q-value for a state s and action a with respect to V is given by: Q (s, v  a) = R{s, a)  +7E  P r  ( ' ' s  a  tes If V is the optimal value function, we write Q*(s,a).  So Q*(s,a)  is the value of  performing a in state s and then acting optimally, and the value of state s under the optimal policy, V*(s) is the maximum over all actions of Q*(s, a): Q*(s, a) = R(s, a) + 7 E  Pr{s, a,  t)V*(t)  tes V*(s) = maxQ*(s, a) aeA  In Q-learning we store for each state and action the current estimate Q(s,a)  of  Q*(s, a). When we receive an experience (s, a, r, t), we update the Q-value for s and 103  a using the Q-learning rule: Q(s,a) = ( l - c v ) Q ( s , a) + a(r + 7 max Q(t, a'))  (4.1)  a'eAt  where a is a learning rate parameter between 0 and 1, and 7 is the discount factor as before. The learning rate is typically decayed over time to prevent recent observations from overwhelming prior experiences. We write a(n) for the learning rate at time n. The Q-learning rule adjusts the current estimate of Q(s,a) to be closer to the newly observed value r + 7 max < ^ Q(t, a'). a  g  t  If each action is performed in each state infinitely often on an infinite run, if 2^2^=0 { ) a  n  =  0  0  a  n  d ^ X^^Lo ( ) a  n  2  <  0  0  (that is, if the learning rate a is decreased  appropriately slowly), then the Q values will converge to Q* [114, 70]. In many applications of Q-learning, a is set to be - where n is the number of times this action has been performed in this state. A variant of Q-learning called R-learning [93] that operates with the average reward optimality criterion (see [84] for details) has also been developed. The adaptive heuristic critic works in a similar way to Q-learning, but splits the learning task into two components, a reinforcement learner and a critic. The job of the learner is to learn a policy that maximises the values it gets from the critic, while the critic learns the expected discounted future reward for the policy that the learner is following. The two components run simultaneously, so the policy changes as the value function does. Given an observation (s, a, r, t), the critic learns the value of the policy using the TD(0) procedure [101]:  V(s) = V(s) + a(r + V(t) - V{s)) y  where V is the critic's current estimate of the value of the policy, and a is a learning rate parameter as before. TD(0) is an instance of a more general class of algorithms known as TD(A). The reinforcement learner typically chooses actions in a greedy fashion to maximise the value it gets from the critic. 104  In TD(A), the algorithm keeps an eligibility  trace e(s) for each state that  records how recently the state was visited, and therefore how much its value is likely to change due to the change in the current state's value function. Given an observation (s,a,r,t),  the eligibility trace can be updated by:  (  X"fe(u) + 1  A7e(u)  if u = s  otherwise.  and the learning rule becomes: V(u) = V{u) + a{r + jV(t)  -  V(s))e(u)  When A = 0, this is the TD(0) rule given above in which only the most recent state is updated. When A = 1 it is equivalent to updating states proportionally to the number of times they were visited in this run (that is, a pure Monte Carlo approach [105]). The learner used in adaptive heuristic critic algorithms typically implements a simple stochastic policy. When the T D error r + yV(t) — V(s) that results from performing action a is positive, the probability of selecting a in the future should be increased, and the probability should be decreased when the T D error is negative. One commonly used approach is to use the Gibbs softmax method [105], in which the probability of selecting action a in state s is: e  P(s.a)  £ ^  P r ( a | s ) =  where p(s, a) is the current value of the modifiable policy parameters of the actor. Increasing p(s, a) makes action a more likely to be selected in future.  4.3  Model Learning  The algorithms described above that learn policies without learning models make very inefficient use of the data they collect. This is because they only record information about the values of the states they visit. The effects of an action in a 105  state are summarized by a single number, the Q-value, and other potentially useful information such as the reward received, and counts of transitions are ignored. Algorithms that learn a model and use it to decide how to act are much more data efficient, and offer many advantages in domains where real-world actions are expensive, but computation time is relatively cheap. Since our interest is primarily in learning structured models, we will examine a number of model-learning algorithms here.  Dyna The first technique we examine is Sutton's Dyna architecture [103, 102]. Dyna uses its observations to learn the value of actions in the same way that Q-learning does, but at the same time it learns a model of the world (typically in the form of transition and reward functions) from the observations. Dyna uses this model to increase the rate of learning by performing a number of hypothetical actions for every real-world action that the agent performs. The hypothetical actions are simulated using the current estimated model to predict the outcome, and the Q-values are updated based on both the real-world observations, and on the results of the hypothetical actions. Figure 4.4 is a high-level description of the Dyna algorithm. The algorithm requires the following three additional components: • A model of the world and a learning algorithm that updates it when a new observation is made. • A model-free reinforcement learning algorithm (for example, Q-learning). • A procedure for selecting hypothetical states and actions to perform. The original description of Dyna [103] makes little mention of the world model to be used. For the experiments in [102], all actions are deterministic, so a very simple model that maps state-action pairs to their resulting states is used. In more recent work [64] these ideas have been expanded to learn the effects of probabilistic 106  Perform the following two procedures in parallel: • Repeat 1. Observe the state s of the world and select an action a to perform. 2. Observe the resulting reward r and the new state t. 3. Apply reinforcement learning to the tuple  (s,a,r,t).  4. Update the world model based on the tuple  (s,a,r,t).  • Repeat (typically some number of times for every iteration of the previous procedure). 1. Select a state s' and action a' according to some strategy. 2. Use the world model to predict the reward r' and new state t' that would result if a' were to be performed in state s'. 3. Apply reinforcement learning to the tuple (s', a', r', t'). Figure 4.4: The Dyna algorithm. actions. In this case, the world model is typically stored in two tables,  R(s,a),  which stores the current estimate of the reward received by performing action a in state s, and Pr(s,a,t),  which stores the current estimate of the probability of  state t resulting when a is performed in s. The data in each of these tables is simple observed frequencies—the probabilities and rewards are found by computing expected values based on all the cases seen—and is updated as follows: •^r, Pr(s,a,t)=  n(s,a,t) y '/  ,  4.2  n(s,a) R{s, a)  =—  V  n{s,a)  r  {  ^  where n(s, a, t) is the number of times state t has resulted when action a was performed in state  s,  and n(s,a)  =  Ylt ( > n  s  i ^ s  n e  number of times action a has  been performed in state s. For the reinforcement learning steps of Dyna, any algorithm is acceptable. The Dyna-Q algorithm, which uses Q-learning, is a common choice, in which case the reinforcement learning step for a tuple (s, a, r, t) updates the Q-value for s and a 107  using Equation 4.1 as before. In the case of the hypothetical actions, we use R(s, a) and Pr(s, a, t) rather than the value according to a particular outcome, so Equation 4.1 becomes: Q(s, a)  = (1 -  Pr(t | s, o)  a)Q(s, a) + a(R(s, a) + 7  max{C}(£, a')}a)  tes  Several methods for choosing hypothetical states and actions have been suggested. Kaelbling [64] suggests selecting states and actions at random, which at least guarantees that all states and actions will be investigated . Other approaches 1  such as those mentioned in Section 4.4 below could also be used, but given that the actions are hypothetical and will not affect the actual reward received by the agent, it seems sensible to encourage evaluation of less promising actions that may lead to undesirable performance if executed in the real world. Prioritised Sweeping Prioritised sweeping [79] is a variant of Dyna that tries to concentrate its hypothetical actions in "interesting" areas of the state space. The algorithm is shown in Figure 4.5. It maintains a priority queue of states whose values need updating, where the priorities are estimates of how much that state's value will change as a result of other changes in the value function or model. After each real-world action and corresponding update of the model using Equation 4.2, a number of value-propagation steps are performed. There may be a constant number of these steps per real-world action, or'if real-world actions must be performed at fixed time intervals, as many value-propagation steps may be done as are possible in the available time. Each value-propagation step consists of popping the highest priority state off the queue and updating its Q-value by performing a Bellman backup (Equation 2.2) in that state. When a state t is updated, the change in its value, A is recorded, t  Since we cannot perform hypothetical actions without some real-world experience, most selection schemes will only select state-action pairs that have already been performed in the real world. 1  108  1. Let the current state be s. 2. Select an action a to perform in the real world, and observe its outcome r and t. 3. Update the model to reflect the new observation. 4. Promote state 5 to the top of the priority queue. 5. While there is computation time remaining do (a) Pop the top state s' from the priority queue. (b) For each action, perform a Bellman backup to recompute its Q-value, set V(s') to be the maximum Q-value, and set A < to be the magnitude of the change in V(s'). s  (c) For each predecessors s" of s', push s" onto the priority queue with its priority set to the maximum of its old priority (if it was on the queue already) and m a x Pr(s", a, a  s')A i. s  Figure 4.5: The prioritised sweeping algorithm. and for every state s such that for some action a, Pr(t | s, a) > 0, s is added to the priority queue with a priority that is the maximum of its previous priority and Pr(t | s,a)A . t  This means that when something "surprising" happens (the value  of t changes by a relatively large amount), as well as the value of t being updated, states that lead to t (and therefore may also change in value) are given high priority and will eventually have their values updated. When the estimate of the value of t is close to the value that results, little effort is made to update the value estimates for that region of the state space. Prioritised sweeping has also been generalised to work with structured representations of problems [3]. We describe this GenPS algorithm in our discussion of structured learning algorithms in Section 4.5.4. Adaptive Real-Time Dynamic Programming Adaptive real-time dynamic programming [5] is a learning version of the real-time dynamic programming algorithm (see Section 2.6.2). As with R T D P , it works with undiscounted models with fixed start and goal states. Under these conditions it  109  converges to an accurate world model and policy for relevant states, while avoiding the computational effort needed to learn a model of actions for some or all irrelevant states. When it receives an input tuple (s, a, r,t), adaptive R T D P updates the model using Equation 4.2 and then uses the updated model to perform the R T D P algorithm (see Section 2.6.2). Finally, it selects an action to perform in the new state t. Adaptive R T D P converges much faster on domains that meet its (rather strict) requirements than Q-learning does. However, the need for goal states is rather restrictive. Although adaptive R T D P does still converge if discounted models with more general reward models are used, the computational gains from concentrating on relevant states seem to be greatly reduced if goal states are not absorbing or if there is not a relatively small set of start states.  4.4  Exploration  Deciding what action to perform is a major issue for R L systems. Since we rarely allow an R L algorithm a "free" learning phase in which its actions cost nothing, we would like it to perform well at the same time that it learns.  To do this a  learner must balance exploitation of the knowledge it already has with exploration to discover new knowledge that might lead to better performance in the future. Clearly the decision of when to explore and when to exploit should be based on how confident we are in our estimates of the values of the actions we are choosing between—when we know little about the value of the actions we should be more exploratory, but once we have tried the actions numerous times we should simply exploit the information we have since our value estimates should now be good. Many ways of dealing with this explore-exploit tradeoff have been proposed [110, 41]. The simplest are ad hoc, taking into account only the differences in value between actions (preferring actions with high value). The simplest of these, known as uniform random exploration involves selecting the action with the highest Q-value with some probability 1 — p, and with probability p selecting an action at random. 110  Typically p is decayed over time so that exploration is initially more strongly encouraged, but the probability of performing an exploratory action decreases over time. Uniform random exploration performs poorly when there are two actions with similar values, one of which is the best action. In this case one of the actions has a high probability of being performed while the other is relatively unlikely to be performed even though its value is similar. Boltzmann  exploration  is a commonly-  used technique that avoids this problem. In Boltzmann exploration, the probability of performing action a in state s is given by the Boltzmann distribution [113]: Q{',o)IT  e  P  r  [  a  )  =  Za'tA.  t  '  Q{S  a)/T  where T is a temperature parameter which governs the degree of randomness or exploration, and which should be slowly decreased over time. The Boltzmann distribution is designed to allow any action to be selected, but to choose actions with higher estimated values more frequently. In practice it tends to work better than uniform random exploration [113]. It's only weakness is that in many cases it tends to prefer actions that have already been explored over actions that haven't been tried. This is due to the fact that Q-values are often initially set low, so that when positive rewards are observed Q-values increase making explored actions look preferable to those that haven't yet been tried. This weakness is characteristic of many of the value-based exploration algorithms that have been proposed. The convergence proofs for algorithms such as Q-learning rely on the fact that on an infinite run of the algorithm, every action is performed infinitely often (see Section 4.2). Both uniform random and Boltzmann exploration meet this criterion (assuming that the learning rate is decayed appropriately), so algorithms which use these strategies are guaranteed to converge (eventually) to the optimal policy  2  [70]. Many of the other exploration methods we will discuss below do not have this Although due to this requirement these algorithms never actually follow the optimal policy they discover! 2  Ill  property and hence convergence cannot be guaranteed.  4.4.1  Directed Approaches  The ad hoc techniques described above are often referred to as being undirected in the sense that they make action selection decisions without basing them on any exploration-specific information. In directed approaches, action selection decisions are based on both the values of the possible actions and some exploration-specific information. A wide variety of exploration measures have been proposed, we describe them below along with the algorithms that use them. The simplest directed exploration techniques are counter- [6] or recencybased [102]. In counter-based techniques, exploration is performed by counting the number of times each action has been performed. If there are actions whose count is less than some threshold, one of them is performed, otherwise the best action is performed. Even without the advantages of reasoning about the learned model (i.e., when counter-based exploration is applied in a model-free learning algorithm), counter-based exploration generally outperforms any of the model-free exploration techniques we have described [6, 110], although convergence is no longer guaranteed. One common counter-based approach, first introduced as part of the Prioritised Sweeping algorithm [79], is optimism in the face of uncertainty. The idea here is that, in the absence of other evidence, any action in any state is assumed to lead us directly to a fictional absorbing state with large reward  r°P*. The amount  of evidence to the contrary required to overcome this optimism is set by a parameter TjjQj.gj. If the number of observations of the action in the state is less than •^bored'  w  e  3 5 8 1 1 1 1 1 6  the transition leads to the fictional state with long-term reward  r ° P V ( l - 7 ) . If we have observed the transition more than T j j  0 r e (  j times, then we  use the true future value of the state. This approach is global in that the value of the fictional state is propagated backwards through the Bellman equation, but in many instances may be too optimistic—the value of T k 112  o r e (  j required to sufficiently  For each action and state, keep a sliding window of past observations of the Q-value of the action. Given an observation (s,a,r,t): 1. Make a new observation r + yV(t) of Q(s, a), drop the least recent observation from the sliding window and replace it with this one. 2. Select an action to perform in state t by for each action a: (a) Based on the sliding window for Q(t, a), build a probability distribution over the value of Q(t,a). This might be a normal distribution, or other non-parametric distribution such as kernel estimation (see Section 5.3.5). (b) Compute ubg(Q(t, a)), the (1 — 6) confidence interval for the probability distribution for Q(t,a). 3. perform action a* = argmax nbs(Q(t, a)). a  Figure 4.6: The interval estimation algorithm. explore the state space to find the optimal policy will also result in a great deal of exploration of irrelevant states. Recency-based exploration [102] is similar to counter-based, but is designed more for use when the environment is slowly changing over time. It is designed to encourage performance of actions that have rarely been attempted or that have not been attempted recently. The number of time steps n  S;a  since action a was actually  performed in state s is stored for each state-action pair. A n exploration t^/n-^a, where e is a small constant, is then added to Q(s,a)  bonus of  when deciding which  action to perform. The exploration methods we have described so far can be used with modelfree algorithms such as Q-learning. When a model is being explicitly learned, such as in Dyna or prioritised sweeping, exploration is often based on the model rather than on Q-values.  113  Interval Estimation and I E Q L + Kaelbling's interval estimation  algorithm [63] (see Figure 4.6) was the first attempt  to select actions based on an explicit measure of confidence in the accuracy of Devalues. Interval estimation chooses an action based on the estimated value of each action, but also includes an exploration bonus to encourage the use of poorly-known actions. Each time an action is selected its value is updated using the Bellman equation, just as in standard Q-learning; but in interval estimation the last n computed values are stored. To select an action, we compute for each action the upper bound of the 1—6 confidence interval (typically the 95% or 99% confidence interval). The action with the highest upper bound is then performed. In this algorithm, the difference between the upper bound and the mean of the confidence interval is the exploration bonus. Interval estimation is quite effective in many domains, learning good policies considerably faster than the ad hoc approaches. Where it fails to work it is typically because it doesn't manage to explore the entire state space before the Q-values for the states it visits frequently stabilise and the system stops exploring. This is because interval estimation is only doing local exploration—the exploration bonus is based only on the variability of the observed Q-values in the current state. In contrast with local exploration methods like interval estimation, Meuleau and Bourgine [74] have proposed a series of algorithms that back up exploration bonuses at the same time as they do Bellman backups of the Q-values. The simplest of their algorithms, I E Q L + , is shown in Figure 4.7. In domains where the optimal policy is particularly difficult to find, these algorithms often considerably outperform interval estimation. By propagating the exploration bonus back to previous states these algorithms can choose actions that will lead to poorly explored regions of the state space several steps in the future, rather than just a single step.  114  For state s and action a, we keep n , the number of times a has been performed in s, n , the number of times performing a in s has resulted in state t, and iV", the value estimate for s and a. We also keep a confidence coefficient 9, and c r , the maximum possible range of values that the problem could have. Finally we have a learning rate a(n) that depends on the number of observations of the particular parameter being learned. A t each decision epoch, the following procedure is performed: a  s  a  st  max  1. Let the current state be s. Choose an action a by: If there exists an action a' that has never been tried, let a = a'; otherwise let a = argmax < A " . r  a  2. execute a and observe the reward r and new state t. 3. Increment n  a  and n . a  st  s  4. Calculate the exploration bonus:  where zg/ is the value of the 1 — 9 confidence interval for a standard normal distribution. 2  5. Update  A7  a s  by:  NZ = NZ + a(n ) ( a  s  5 ( l - ) ) + max(/V a  r +  £  s  7  7  a t  - Nf)  a'€.A  6. Go to 1. Figure 4.7: Meuleau and Bourgine's I E Q L + algorithm.  115  Explicit Exploit or Explore In a recent work, Kearns and Singh [65] have produced a polynomial (in the size of the state space and the horizon time) reinforcement learning algorithm. Their explicit exploit or explore (E ) algorithm works by dividing the state space into a 3  set of known and a set of unknown states. The method used to select an action depends on which set the current state is in. If the agent is in an unknown state, it performs balanced wandering, choosing the action it has tried the fewest times in this state. If the agent is in a known state (one that has been visited sufficiently often through balanced wandering), it evaluates the current exploiting policy and if it is better than some threshold—which estimates the value of an optimal policy from that state—it performs the best action, otherwise it follows a policy designed to move it to an unknown state as quickly as possible. Kearns and Singh show that this behaviour produces a near-optimal policy in time polynomial in the size of the state space and the horizon time 1/(1 — 7). However, their algorithm is quite cautious, in that it takes a long time to decide that a state is well enough known to do more than explore randomly. Many of the other exploration strategies we have discussed here may be more efficient than E in practice despite the fact that 3  they do not have its guarantee of polynomial performance.  Hence E is more of 3  theoretical interest than of use as a practical exploration algorithm.  4.5  Generalisation and Structured Learning  A l l the algorithms we have presented so far depend critically on the ability to explicitly enumerate the entire state space. There are a number of reasons why this might not be desirable. The most obvious is that the algorithms cannot be applied (in their usual form) to problems with continuous state spaces—even very large discrete state spaces present problems due to the size of the data structures involved. More interestingly from our point of view is that for large state spaces their rate  116  1. Initially, let the tree G be a single cluster containing all states. 2. For every input tuple (s,a,r,t)  received:  (a) Update Q(s,a) for the cluster in G that contains s. (b) Update the statistics for that cluster to include the new data point. (c) For each statistic collected, and for each feature of the state space that has not already been split, check to see if the feature is significant enough to split on. If so, split the tree, adding a leaf for each value of the feature, and discard all statistics and Q-values for this cluster. Figure 4.8: The G-learning algorithm. of learning is very slow simply because of the time required to observe every action being performed in every state. To overcome this difficulty we need an algorithm that can generalise—learning policies and values for a large number of states based on observations from only some of these states. We again distinguish between model-free and model-learning algorithms. A l gorithms that approximate Q-values or learn a compact representation of Q-values we will refer to as generalisation  techniques.  Algorithms that learn structured repre-  sentations for actions and reward functions—that is, structured models—and then use the learned model to decide how to act, we call structured  4.5.1  learning  techniques.  G-Learning  One of the earliest attempts to perform generalisation in reinforcement learning problems was G-learning [22]. The algorithm was developed for a problem with 100 bits of input, making it far too large for conventional Q-learning. The G-learning approach is to construct a decision-tree in which the interior nodes are bits from the input, and the leaf nodes are learned estimates of the value of all states that agree on the values of the bits above them in the tree. These trees are very similar to the structures we presented in Chapter 3. The main difference is that rather than storing a different tree for every action, G-learning uses a single tree, and stores Q-values for every action at the leaf.  117  The G-learning algorithm is presented in Figure 4.8. The algorithm begins with a single cluster that contains all the states. It collects Q-values and statistics on this cluster, and tests the statistics to see if the cluster should be split. When the test indicates that a feature is significant, the cluster is split into a set of clusters, one for each value of the feature.  A t the time of the split, both the Q-values  and statistics must be discarded, and new values learned. This loss of information could be avoided if the agent records all the experiences it has seen so far, but this considerably increases the space requirements of the algorithm. To decide when to split a cluster, the G-learning algorithm uses Student's t test, which determines whether two sets of data could have arisen from a single distribution. For each cluster, the algorithm keeps statistics on the reward for that action in that state, as well as the Q-value of that action. These statistics are kept for every feature that has not already been used to distinguish that cluster. The t test requires that the data is normally distributed, which is frequently not the case, but Chapman and Kaelbling suggest preventing splitting until sufficient data has been collected for it to approximate a normal distribution. G-learning will converge far more slowly than Q-learning on problems with a sufficiently small state space for both to be applied. This is because of the time that G-learning takes to learn to split the state space (and the statistics that it discards in the process). G-learning's main advantage is that it can be used on much larger problems than Q-learning can handle. It performs best under many of the same conditions as the SPI algorithm (see Section 3.1), in particular when Q-values are independent or conditionally independent of many variables.  4.5.2  Function Approximators  There has been a lot of interest in methods for using function approximators to perform generalisation [20, 68, 9]. Two different approaches are typically taken. In the first, the value function is represented using, for example, a backpropagation  118  network. The neural network is then updated using Q-learning or value iteration. This approach has been used very successfully for learning to play Backgammon [109] among other applications. Other approximations, such as the use of linear function approximators [111] have also been proposed. More recently, approaches that approximate the value function based on states of known (or estimated) value close to the current state, such as locally weighted regression [69, 20], local linear regression [107], or ^-nearest neighbour algorithms [76] have also been used. While function approximation has been used in practice on certain types of problem, theoretical results about its performance are less promising. Boyan and Moore [20] have shown that function approximators frequently do not converge to good policies even on relatively simple problems due to interactions between the approximator and the learning rule. Some more recent work by Sutton [104] and Gordon [47] has demonstrated that convergence, although not necessarily convergence to optimal values, can be guaranteed by choosing an appropriate approximator, and in fact these approaches are in widespread use for dealing with continuous state spaces. Problems with very large but discrete state spaces are also sometimes converted to continuous problems to allow the use of function approximators for generalisation. However, current function approximation techniques do not lead to representations of value functions that can easily be reasoned about, and for this reason do not seem suited to the sorts of problems we are interested in.  4.5.3  Variable Resolution D y n a m i c Programming  Moore [77, 78] describes two approaches to generalisation that are designed to allow traditional dynamic programming to be used in domains with real-valued features where discretisation would result in extremely large state spaces. The V R D P and PartiGame algorithms rely on kd-trees to partition the state space into progressively finer regions, depending on the importance of the region in question. The PartiGame algorithm [78] in particular learns very quickly. It includes a local controller that  119  allows it to move quickly from one region to another, causing it to examine quite different trajectories through the state space before examining paths with only small local changes. PartiGameis unfortunately only applicable in deterministic problems.  4.5.4  Learning with Structure  The generalisation techniques we have described operate directly on the value function. For model-learning algorithms, we not only want a generalising representation of Q-values or value functions, but also a generalising model of the reward function and the transition probabilities from states. A n example of such a model is the 2 T B N representation introduced in Section 2.5.1. With this kind of model, we can take a set of observations made in a number of similar states and generalise transition probabilities and a reward function for all of the states from them. This allows considerable improvements in learning speed as small numbers of observations can be leveraged for reasoning.  The H-learning Algorithm Tadepalli and Ok have proposed the H-learning algorithm [107] which does this type of generalisation for average-reward reinforcement learning. Their approach combines local linear regression to approximate the value function with the use of 2TBNs.  In H-learning the network structure of the 2TBNs is is assumed to be  given, so the only requirement is to learn the probabilities that make up the entries in the conditional probability tables. These are computed via maximum likelihood estimation. The learned transition probabilities and reward function are then used to update the value function using an average-reward version of the R T D P algorithm (see Section 2.6.2). Tadepalli and Ok demonstrate impressive empirical results for the H-learning algorithm, as well as a version of A R T D P that uses 2TBNs and local linear regression in the infinite-horizon discounted case. However, there are a number of advantages of  120  using structured representations that they do not exploit. In particular, they do not discuss exploration strategies at all in their work, and they use A R T D P to update the value function when the model changes, effectively doing a small amount of local search from the current state to decide which states' value functions to update. Generalised Prioritised Sweeping Generalised prioritised sweeping [3] tackles this second problem, namely which devalues to update when there is a change in the model. Recall that prioritised sweeping [79] (see Section 4.3) keeps a priority queue of states to update based on an estimate of how much their Q-values might change and updates the highest priority states between performing real actions. Generalised prioritised sweeping formalises the priority calculations, but more importantly, extends the algorithm to operate with structured representations of actions. A s before, it keeps a priority queue of states to update although now the sum of the priorities is kept when a state on the queue is again selected for update, rather than the maximum of the new and previous priorities. The priorities are changed in two cases; when the model changes at some state, and when the value of a state is updated. In each case, the idea is to update (by setting their priorities high) the states whose value function will change the most. This can be approximated by max \VQ(s, a) • Ag\, the maximum over a  all actions of the change in the Bellman error at state s (the difference between the current value of s and the one-step lookahead value) as a function of the change in Q(s,a).  For state-based representations, the priority update rules are very similar to those of prioritised sweeping. If the model changes as a result of an observation (s, a, r, t), then: W)f,„, A - A , VQ{s, a) • Ag = A / \ + 7 R  S  _V(t)-Zt>Pr(s,a,t')V(t>) =—  2_/i' A where A ^ R  s > a  ,i'  is the change in the value of the reward at s, and N > SAtt  is the number  of times the transition from s to £ with action a has been observed (plus any prior 121  information about this transition), and N a ' Si  and Pr(s,a,t')  tt  are calculated using  the updated model. In this case s is the only state whose priority changes. If the value of some state t changes as a result of a value update, then: V Q ( s , a ) • Ag =  ~fPr(s,a,t)A  v{t)  For a structured representation using 2TBNs, the computation of the priorities after updating the value of a state is the same since Q-values are still stored individually for each state. However, when a real-world action is observed and the model changes, a much larger number of states may need, to have their values updated.  In the 2 T B N case, each state is an assignment of values to some set of  variables Y i , Y „ . When we observe a transition from s to t due to action a, the probabilities in the C P T for the value of each variable in s will change, so any other state which uses the same entry in the C P T will need its value updated. Let Pa; be the parents in the Bayesian network of variable Y,-. A state s' that assigns the same values to Pa,- as s does under action a will, need its priority changed. Unfortunately, the magnitude of the change cannot be computed efficiently (i.e., it is as complex as solving the Bellman equation directly), so a heuristic is used that estimates the importance of each variable to the value function. State s' is added to the priority queue with the sum over all the variables for which s' agrees with s of the heuristic as its priority.  4.6  Structure and Exploration  The H-learning algorithm and generalised prioritised sweeping show the effectiveness of using structured representations to speed reinforcement learning. However, there are a number of areas they have not examined in which structure has an important role to play. The first of these is the problem of learning the structure. Both H-learning and generalised prioritised sweeping assume that the structure is given in advance, 122  and only the probabilities must be learned. While this is a reasonable assumption in many situations—expert knowledge of the structure of a problem is often available even when the actual probabilities are unknown, or can only be guessed at—one cannot assume that this will always be the case. There are many interesting problems in applying techniques from the field of Bayesian network learning to learning the action effects directly from the observed effects of actions. A potentially more interesting area which has yet to be examined is the effects of structured representations on the method used to select actions. We have said above that we would like any exploration algorithm to be directed and global, but this is even more important in the case of structure. If we are choosing between an action that will give us information about just the current state and an action that gives us (partial) information about a large number of states, we should certainly include this information in our decision. In the following chapters, we make some progress in both of these areas. Chapter 5 examines exploration in explicit state-based problems in much greater detail, describing Bayesian approaches to exploration in both model-free and modelbased reinforcement learning algorithms that are both directed and global, and that attempt to use the available observations much more efficiently than previous approaches. The problem of how to exploit structured representations in exploration is examined in Chapter 6, along with learning the Bayesian network structure of the actions during learning.  123  Chapter 5  Bayesian Approaches to Exploration in Reinforcement Learning In this chapter we describe a new Bayesian approach to the exploration-exploitation tradeoff. We propose a new measure of the value of exploration which we call myopic value of perfect information (VPI) that can be used in both model-free and model-based reinforcement learning algorithms. The motivation for this approach comes from two sources. The first is to do as much as possible with the available observational data. Undirected exploration strategies such as Boltzmann exploration base their action selections on a single Q-value for each action (perhaps the mean of the observed values). Directed approaches such as interval estimation and I E Q L + extend this by basing exploration on the variance, in the form of a confidence interval, of the set of observed values for an action. In the Bayesian approach we explicitly represent our uncertainty about Q-values as a probability distribution. This representation of uncertainty about the values of actions allows us to estimate the value (in terms of its effects on future performance) of doing an exploratory action rather than the current best action.  124  5  iBi Start  11 -5 Trap  1  0.5  -  -. -0.5  Priors MB/SI  "washed out"  Goal  0.5  Start  Prior values  Pi  -1 Trap  Goal  Posterior values (a)  9H Slait  2^ S S P  A Trap  iflR  Priors retained  filfllflp  1118  18111 11111111111  Goal  Start  Prior distributions  mm  Trap  i p  Goal  •  :  Posterior distributions (b)  Figure 5.1: A simple maze with prior information about the values of states. In (a) the priors are point values, while in (b) they are probability distributions. After a small amount of learning, the priors in (a) have almost disappeared, while those in (b) are still available to guide the agent.  125  The second motivation for this work is to incorporate prior knowledge into an R L algorithm in a principled way. One of the problems with R L algorithms, particularly model-free ones, is that it is difficult to include prior knowledge in a way that can help them make better decisions. For example, consider a maze domain such as that shown in Figure 5.1(a) in which the agent has been given as prior information estimates of the Q-value of certain states in order to direct it to the goal. The problem is that the period of semi-random exploration that occurs early in the learning process before the goal has been discovered backs up the values of nearby states to recompute the values of the states with the priors, and as the figure shows, this process eventually overwhelms the prior values. We would like these priors to be more robust—they should retain their values despite the large number of observations that their values are incorrect. When we explicitly represent uncertainty about each Q-value using a probability distribution we can choose initial distributions with low variance for the prior information we provide to the agent as shown in Figure 5.1(b). Under these conditions, observations made when the value function is relatively inaccurate do have an effect on the distribution (so that for example, incorrect priors can eventually be corrected if enough observations can be made) but rather than completely obliterating the prior information, they spread out the probability weight, making the variance of the distribution greater. The information in the priors remains intact to guide the agent's exploration. We can think of the exploration problem as a partially observable M D P . We have an agent exploring some M D P , which we'll call the underlying MDP, and call the agent's current knowledge about the underlying M D P , for instance its estimates of Q-values and partially learned model, the agent's belief state. The idea is that the belief state of the agent is the state, and the actions performed provide new data that moves the agent from one belief state to another. More precisely, a state in this exploration POMDP consists of the agent's state in the underlying M D P , together with the agent's current belief state. The reward function and actions in  126  the exploration P O M D P are the same as in the underlying M D P , but the transition function includes both the change in the underlying state as a result of the action, and also the change in the model due to the observation of another underlying transition. The form of the agent's belief state depends on the type of learning agent we are talking about.  For a model-based learning agent, the belief state  consists of the agent's current model of the underlying M D P . For a model-free agent the belief state is just its current set of Q-values for the states and actions. If we could solve the exploration P O M D P , the resulting optimal policy would be an optimal exploration strategy for the underlying M D P . Unfortunately, we cannot expect to solve the P O M D P exactly. The state space is infinite in many dimensions since it consists of the space of all possible sets of observations, and the transition function is also extremely complex. In fact, we wouldn't want to compute a general policy at all, since such a policy would be optimal for any underlying M D P with the same number of states and actions, which is far to general-purpose for our requirements.  For a model-based agent, because we never forget any information,  the exploration P O M D P has the property that we never return to a previously visited state, so computing a policy for a state containing information inconsistent with what we already know is obviously of no help to us in deciding how to act. As a result of this, we would like to locally solve the P O M D P , giving us an exploration policy for the set of likely underlying M D P s . The ideas presented in this chapter are an attempt to approximate a local solution to this exploration P O M D P . We model the set of likely underlying M D P s using probability distributions over Q-values or M D P s , and the value of information measure we introduce below is an approximation to the local exploration policy. In the case of bandit problems, the exploration P O M D P is much simpler since the problem consists of a single state. For this simpler problem, the optimal exploration strategy can be computed exactly under certain assumptions [8]. In the next section we discuss the use of probability distributions to represent  127  uncertainty about Q-values, and describe how to select actions to perform using VP1. Section 5.2 introduces the Bayesian  Q-learning  algorithm, which is a model-free  learning algorithm based on these ideas, and Section 5.3 introduces the model-based version of the algorithm.  5.1 5.1.1  Q-value Distributions and the Value of Information Distributions over Q-Values  A reinforcement learning agent needs to perform exploration because it is uncertain about the true Q-values of actions—indeed, this is the reason for doing learning in the first place. A n agent that knows and is confident about the Q-values of all the actions has no reason to explore since it doesn't expect there to be a better policy to discover. It seems reasonable to propose that decisions about when to explore and when to exploit should be made based on how uncertain the agent is about the Q-values of the actions in question, and therefore that uncertainty should be represented explicitly in a learning system. We represent the uncertainty we have about the Q-value of each action using a Q-distribution,  a probability distribution over the possible values that the Q-  value might take. We begin by giving the learning agent prior distributions over Q-values for each state and action—either default priors representing an absence of knowledge, or priors based on our previous knowledge of the problem—and then update these priors based on the agent's experiences. Formally, let R  S)(l  be a random  variable that denotes the total discounted reward received when action a is executed in state s and an optimal policy is followed thereafter. In particular, we note that Q*(s,a)  = E[H a]. St  The agent is initially uncertain about how R  s a  is distributed,  and its objective is to learn i?[R ]. s>a  Figure 5.2 shows the expected behaviour of a distribution over some i?[R ] S|0  as learning occurs.  Initially, the distribution is quite flat, indicating that there  128  Figure 5.2: The behaviour of a Q-distribution over time. is little or no information about i?[R ] available. S)0  As time goes on, the agent  observes a sequence of values from E'pR.^o], each of which changes its beliefs. As more observations are made, the distribution should get progressively more peaked as the agent becomes less uncertain about the. actual value of i?[R ]. S!a  5.1.2  The Myopic Value of Perfect Information  When the agent selects an action to perform, we would like it to strike a balance between present performance (the action that appears to be best now) and future performance (an action that might turn out to be better in the future). This is the essence of the exploration-exploitation tradeoff. In the Bayesian approach proposed here, the agent keeps a Q-distribution for each state-action pair s, a which represents its uncertainty about the Q-value Q(s,a). This information is used to measure the value of exploration and hence to balance present and future performance. For example, Figure 5.3 shows Q-distributions for three actions available in  129  Figure 5.3: Q-distributions for the three possible actions in some state some state. Which action should be executed? Obviously action three is a poor choice. Its expected value is a lot less than action one, so the cost of performing it rather than the better action one is quite high. Also, the variance of action three is relatively small which indicates that it is relatively well explored. This means that the likelihood of it actually being the best action is very small, so a future gain in performance due to information learned from doing action three is very unlikely. On the other hand, action two is certainly a candidate for being executed. Its expected value is only slightly less than that of action one, so the expected cost right now of doing two instead of one is small. Action two's Q-distribution also has relatively high variance, which indicates that its Q-value is relatively uncertain, and there is a significant probability that it could be better than action one, in which case the expected future performance gains because of the information gained by doing action two may outweigh the cost of doing what is now believed to be a suboptimal action. Our new exploration measure, myopic value of perfect information (VPI) [55, 90] measures the value of an action based on these kinds of considerations. 130  To select an action to perform, the V P I is computed for every action possible in the current state. However, we must also take into account the cost now of performing action a rather than the action ci\ that has the highest expected value. Thus the action a for which the V P I minus the difference between the expected value of a and the expected value of a-y is greatest is performed.  By doing this,  the gain in the expected value of the action due to possible future observations is balanced against the cost of performing the action now rather than performing the action that currently appears to be best (that is, the action with the highest expected value). Computing the V P I for an action can be thought of as asking the question "what would I be willing to pay in order to learn the true value v* of this action." If I am willing to pay a lot for the information then the information may be worth discovering through exploration. If the information is only worth a little, then I would probably rather not explore and I should perform a different action. The V P I of an action is computed as follows: Let us imagine that we have an oracle available that will tell us the true expected value q* of an action a. We a  wish to compute the worth of this information. We first note that knowing this value is worth nothing if it does not change the current policy. Let a\ be the action we currently believe to be best (i.e., the action with the highest expected Q-value given the present state of knowledge). Let 0,2 be the action we currently believe is the second best. If we learn that the true expected value of ai is higher than the expected value of 0 2 , then all we've done is confirmed that the current policy (do action ay) is correct. Since the policy is unchanged, the expected future discounted reward is unchanged so the information has no value. Note that the expectation of the future discounted reward may have changed as a result of learning a different value for a\, but the actual expected discounted future reward is unchanged because the policy is unchanged. Similarly if we learn the true value of some other action a, but its value is still smaller than the expected value of 01, then as before, the policy will remain the same and the information has no value.  131  We are left with two cases in which learning the true value of an action has some worth. These are when the oracle tells us that an action currently believed to be sub-optimal has a higher Q-value than the expected value of a\, and when the oracle tells us that the true value of a\ is less than the expected value of ai- In both cases, the worth of the information gain is the difference between the value of the current policy in which a\ is performed in this state and the policy in which a\ is replaced with the action with the new highest Q-value. If the oracle tells us that some previously sub-optimal action is in fact optimal, then this worth is just the difference between the expected value of a\ and the new value obtained from the oracle. In the case where the previous best action is now sub-optimal, the worth is the difference between the expected value of ai and the oracle's value for a\. To formalize this discussion, let the current state be s, the action whose V P I is being computed be a, and let q  StU  be a possible value of Q*(s,a)  = E\R ]. s>a  Let  the current best action be a\ and the second best action be 0,2- Then the myopic value of perfect information for learning that a has expected value q* is: a  [ls,a ] ~ q* , if a = a i and q* <  E  2  s  a  V P I , a « ) = { 1*s,a ~ E[q,, ] s  0  ai  a  if a ^ a and q* > x  a  E[q , ] s a2  E[q ] 8iai  otherwise  0  Of course we don't actually have an oracle we can consult to determine q* . a  Instead, what we have is a probability distribution (the Q-distribution for a and s) over the value of q* . So to compute the worth of learning the true value of a, we need to take into account not only the gain from learning that a has a particular value q* , but also the probability that it has this value given our current Q-distribution a  over Q*(s,a).  B y taking the integral of the product of these two quantities we can  compute the expected value of perfect information  (EVPI) about q  sa  as follows:  CO  (5.1) —00  132  The expected value of perfect information gives an upper bound on the myopic value of information for exploring action a. It is an upper bound because in practice we don't gain the information that E[q ]  = q*  Sia  a  by performing a  single action—it takes many repetitions of the action before the actual Q-value can be learned.  The expected cost incurred for this exploration is given by the  difference between the value of a and the value of the current best action, i.e., max / a  E[q i] s<a  — E[q ]. SA  This suggests we choose the action a that maximizes EVPI(s,a)  - (max£[<7 , ,] a' s  a  E[q , ]). a a  Clearly, this strategy is equivalent to choosing the action that maximizes: E[q , ] s a  +  EVPI(s,a).  We see that the expected value of perfect information estimate is used as a way of boosting the desirability of different actions. When the agent is confident of the estimated Q-values, the E V P I of each action is close to 0, and the agent will always choose the action with the highest expected value. It is clear that the value of perfect information is an optimistic assessment of the value of performing a; by performing a once, we do not get perfect information about it, but only one more training instance. Thus, we might consider weighting the E V P I estimate by some constant. However, since we are only using V P I for the purpose of comparison, and since we would prefer to minimize the number of parameters our algorithms have we will not consider this here.  5.2  The Model-free Algorithm  We now describe in more detail the representations and techniques used to apply the Bayesian approach we described above in the Bayesian  Q-learning  algorithm [30].  For the case of model-free reinforcement learning we will directly update the Q-distributions based on observations of the future values of states. The underlying 133  1. Let the current state be s. 2. Select an action a to perform using the myopic value of information measure. 3. Let the reward received for performing a be r, and the state that the system transitions to be t. 4. Update the Q-distribution R value for action a in state s:  s a  to reflect the newly observed  r + 7 max  K, ' t a  a'eAt  5. Go to step 1. Figure 5.4: The Bayesian Q-learning algorithm. Changes from standard Q-learning are in boldface. R L algorithm we have used is Q-learning (see Section 4.2 and Figure 4.3); our variant of the algorithm is shown in Figure 5.4. There are two changes from the standard Q-learning algorithm. In Step 2, we use the myopic V P I approach to exploration we have described above to select actions, rather than using the ad hoc exploration schemes typically found in Q-learning. In Step 4, rather than using a learning rate parameter to update the Q-value for state s and action a, we will be updating the Q-distribution  for s and a. We will use the new value we have observed—the  discounted probability distribution over the value of state t plus the instantaneous reward ?—to update the current Q-distribution. As we said in Section 5.1.1, we want to learn for each state s and action a the expected value of an unknown random variable H , which denotes the expected s  a  total discounted reward received when a is executed in s and an optimal policy is followed thereafter. To make learning R Assumption 1: R  s a  s a  practical, we first make the following assumption:  has a normal distribution.  The main reason we make this assumption is pragmatic. to model our uncertainty about the distribution of R 134  s a  It implies that  we need only model a  distribution over the mean ^  s > a  and precision r . SA  is the reciprocal of its variance (r =  The precision of a normal variable  and along with the mean is all that is  1/a ), 2  required to describe a normal distribution.  If a random variable has a normal  distribution with mean fi and precision r , its probability density function is: /(x|M,r) =  y^ -H^)  2  e  We use the precision here because it is considerably simpler to represent uncertainty about the precision than about the variance [34]. We again note that the mean ^  S i a  corresponds to Q(s, a). A second argument in favour of this assumption is that since the accumulated reward is the discounted sum of immediate rewards, each of which is a random event, by the central limit theorem, if the discount factor 7 is close to 1 and the underlying M D P is ergodic when the optimal policy is applied, then the distribution over R  S]a  is approximately normal. In practice, because of the max terms in the Bellman equation, the distribution tends to be heavy-tailed. However, empirical evidence from looking at actual distributions (and see Section 5.3.5) also suggests that the assumption is reasonable. Our next assumption is that our prior beliefs about R  s a  are independent of  those about R ' ' s  ia  Assumption 2: The prior distribution over / i distribution over / v , a ' and  s > a  and r  S;a  is independent of the prior  for s / s' or a' / a.  T iy s  This assumption is fairly innocuous, in that it restricts only the form of prior knowledge about the system. Note that this assumption does not imply that the posterior distribution satisfy such independencies. (We return to this issue below.) Next we assume that the prior distributions over the parameters of the distribution over each R  5 a  are from a particular family:  Assumption 3: The prior p(/i , , r ) = R s  a  s<a  below). 135  S)a  , is a normal-gamma distribution (see  5.2.1  The Normal-gamma Distribution  Assumption 1 allows us to represent our uncertainty about R tributions over  fi a St  and r  S)£l  5 > a  as probability dis-  . The normal-gamma distribution is conjugate to the  normal distribution—it is a single, joint distribution over both the mean and precision of an unknown, normally distributed random variable and is therefore an excellent candidate for modeling this kind of uncertainty. See [34] for more details of this family of distributions. A normal-gamma distribution over the mean \i and the precision r of an unknown normally distributed variable R is determined by a tuple of  hyper-parameters  p = (yu , A, a, /?). We say that a distribution p(p,, r) ~ NG(po, A, a, (3) if: 0  p{n,T)  OC  | -j ri™)V-V A  r  T  e  The normal-gamma distribution has the property that the conditional distribution of fi given r is a normal distribution with mean po and precision A r , and the marginal distribution of r is a gamma distribution with parameters a and (3. Standard results show how to update such a prior distribution when we receive independent samples of values of R : Theorem 1 [34] Let p(p,, r) ~ NG(po, A, a, j3) be a prior distribution known parameters independent  for a normally  distributed  samples of R with M\ — ^ ^ p{p,,T  | rj,..., r) n  variable  R, and let r\,...,  n and M = ^ 52; i • Then: r  2  ~NG(p ,\',a',P') 0  where A^o + nM\  X+n A  over the un-  A + n,  )+ 136  nX(M  1  -  HQ)  2  r  n  be n  That is, given a single normal-gamma prior, the posterior after any sequence of independent observations is also a normal-gamma distribution. The normal-gamma distribution has a number of other useful properties. We will use the following standard results throughout this chapter. Theorem 2 gives equations for the first and second moments of a normal-gamma distribution, while Theorem 3 provides equations for the marginal probability density of the mean of a normal-gamma distribution, and the marginal cumulative probability of the mean. Theorem 2 [34] Let R be a normally and unknown precision  distributed  variable with unknown  mean fi  T, and let p(/i, r) ~ NG(jio, A, a, j3). Then E[R] = fio, and  Theorem 3 [34] If p(fi, r) ~ NG(fi , 0  A, a, f3), then  ^)=(^) ^ r#(/5+iA( l  £  M  -  and  M  )  o  2  )  (  0  +  2  \  (5.2)  i  Pn> < x) = T((x - fio) where T(x : d) is the cumulative  2  :  2a)  t-distribution with d degrees of freedom:  — CO  and F(x) =  f °° t ~ e~ dt x  0  1  t  is the gamma function.  Moreover,  E[/J] = JJLQ, and  Assumption 3 implies that to represent the agent's prior over the distribution of R  S)a  , we only need to maintain a tuple p  = (/IQ", A , a ' , 0 ' ) S,A  s<a  s a  s a  of hyper-  parameters. Given Assumptions 2 and 3, we can represent our prior by a collection of hyper-parameters for each state s and action a. Theorem 1 implies that if we 137  have independent samples from the distribution over each R , , the same compact s  a  representation can be used for the joint posterior. We now assume that the posterior has this form. Assumption 4: A t any stage, the agent's posterior over p  and r  SA  of the posterior over /tt /,a'  a  n  s  s a  is independent  d s',a' f ° s ^ s' or a' ^ a. r  r  In an M D P setting, this assumption is likely to be violated; the agent's observations about the reward-to-go at different states and actions can be strongly correlated—in fact, they are related by the Bellman equations.  Despite this, in  order to make the problem tractable, we shall assume that we can represent the posterior as though the observations were independent, i.e., we use a collection of hyper-parameters p ^ for the normal-gamma posterior for the mean and precision s  a  parameters of the distribution over each R , . s  a  Given these assumptions we can use the properties of the normal-gamma (see [34]) to reduce the computation of EVPI(s,a)  in Equation 5.1 to a closed  form equation involving the cumulative distribution of p  s>a  (which can be computed  efficiently), along with the corresponding distributions of the best action / z , s  second-best action  ^ ,o s  and  : 2  c + (E[fi ] - E[fj, ])Pr(fi S!ai  SA2  EVPI{s,a)  ai  S!ai  c + {E[fj, \ - E[p ])Pr(n SAl  S!a  < E[p , \) s a2  > E[n ])  SA  Siai  if a - «i if a ^ a i  where a , T(a s  a  +  S t a  |)  (^  (a,, - i ) r ( a , , ) r ( | ) a , / 2 A ^ V 0  5.2.2  a  s  aV  E2r.. [fj, ]i \ - o » , o + i 2  Sta  2a , s  a  Updating the Q-distribution  We now turn to the question of how to update the estimate of the distribution over devalues after executing a transition. The analysis of the updating step is complicated by the fact that a distribution over Q-values is a distribution over expected total  138  rewards, whereas the available observations are instances of actual local rewards. Thus, we cannot use the Bayesian updating results in Theorem 1 directly. Suppose that the agent is in state s, executes action a, receives reward r, and moves to state t. We would like to know the complete sequence of rewards received from t onwards, but this is not available. Let R  be a random variable denoting  t  the discounted sum of rewards from t. If we assume that the agent will follow the apparently optimal policy, then R  t  is distributed as Rt,a , where a is the action t  t  with the highest expected value at t. We might hope to use this distribution to substitute in some way for the unknown future experiences. We now discuss two ways of going about this. Moment updating The idea of moment updating is, notionally, to randomly sample values R\,.. .,1Z? from the distribution R , and then update R (  s  with the sample r+yR\,...,  a  r+7-R™,  where we take each sample to have weight ^ . Theorem 1 implies that we only need the first two moments of this sample to update our distribution. Assuming that n tends to infinity, these two moments are: * Mi  =  E[r + R J = r +  M  =  E[(r + TR<) ] = E[r  =  r + 2 rE[R ]  2  7  2  2  1  jE[R ] t  + 2jrH  2  +  t  + 7 R?] 2  t  j E[R ] 2  2  Since the estimate of the distribution of R is a normal-gamma distribution over the t  mean and variance of H , we can use Theorem 2 to compute the first two moments t  of R . (  Now we can update the hyper-parameters p  s<a  as though we had seen a  collection of examples with total weight 1, mean M i , and second moment M2 as follows: , M o  _  A/ip + M i  ~  A+ l  ' 139  A'  =  A+ l,  a  =  a +  B'  =  ,5+-(M -M  1 - ,  2  2 1  ) +  A(M  - MO)  X  2  2(A+1)  This approach results in a simple closed-form equation for updating the hyper-parameters for R , . Unfortunately, it quickly becomes too confident of the s  value of the mean  a  To see this, note that we can roughly interpret the param-  fi . a<a  eter A as the confidence in our estimate of the unknown mean. The method we have just described updates fiQ and A with the mean of the unknown future value, which is just r + jE[R ], t  as if we were assured of this being a true sample. Our  uncertainty about the value of R j is represented by the second moment M , which 2  is only used to update the 3 parameter of R , and hence mainly affects the estis  a  mate of the variance. Thus, our uncertainty about R< is not directly translated to uncertainty about the mean of R , - Instead, it leads to a higher estimate of the s  a  variance of R , . The result of all of this is that the precision of the mean increases s  a  too fast, leading to low exploration values and hence to premature convergence on sub-optimal strategies. The problem here is that the normal-gamma distribution is designed to be used when we have observations of an unknown distribution with fixed mean and variance. In this case, what we have instead are samples from the future value of R  t  which is a distribution over the mean as well as the precision of the value of state t, so there are two sources of variability in the sampled values, uncertainty about the mean, and the variance itself. Consider two extreme cases: In the first, the mean of R i is known with certainty and the variance of R  4  is high. In the second, we  are quite uncertain about the mean, but the variance is close to zero. The moment updating approach cannot distinguish between these two cases because M i and M  2  are not sufficient to distinguish these two sources of uncertainty.  140  M i x t u r e updating The problem with moment updating described in the preceding section can be avoided by using the distribution over R in a slightly different way. Let p(p , SA  (  be the posterior distribution over  R)  s<a  |  after observing discounted accumu-  p ,a-,T s  r  SA  lated future reward R. If we observe the value R — x, then the updated distribution t  over H ,a is p{fi a,Ts,a s  \ r + jx). We can then use the same trick we used in the  S:  computation of E V P I described above to capture our uncertainty about the value x by weighting these distribution by the probability that R< = x. This results in the following mixture  posterior: CO  p™ {p a, tx  t  St  T ) Sta  J  =  p(p , , s a  r , | r + yx)p(R s  a  t  = x)dx  —co  Unfortunately, the posterior p™t (p ^ , X  s  does not have a simple representation—  T )  a  s<a  it is no longer a member of the normal-gamma family of distributions. Using a more complex representation would lead to more problems when another observation means that the distribution must be updated again since this would result in an even more complex posterior. We can avoid this complexity by approximating p™t (lis,a,  Ts,a)  X  with a normal-gamma distribution after each update.  We compute the best normal-gamma approximation to the posterior by minimizing the Kullback-Leibler divergence KL(q,p)  [24] between the true mixture dis-  tribution q and its normal-gamma approximation p. T h e o r e m 4 Let q(p, r ) be some density measure over p and r and let e > 0. If we constrain minimizes  a to be greater than 1 + e, the distribution the divergence KL(q,p)  p(p, r ) ~ NG(po, A, a, (3) that :  is defined by the following  equations:  E [pr] q  fJ-o  =  E [r} q  A  =  a  =  , -1 {E [p\)-E [T]piy q  q  max(l + 6 , / ( l o g £ , [ r ] - £ , [ l o g r ] ) ) 141  &  E [r] q  where f(x) is the inverse of g(y) = logy — ip(y), and ib(x) =  is the digamma  function. P r o o f To minimize the Kullback-Leibler divergence, we choose the parameters of the normal-gamma distribution p so KL(q,p) From [24] we have that KL(q,p)  is minimal, or dKL(q,p)/dp  = E [\ogq] - E [\ogp]. q  = 0.  The first term of  q  this doesn't depend on p, so we can ignore it, and simply differentiate the second, taking advantage of the fact that ^E [\ogp]  — E [-^\ogp].  q  q  From  the definition of the normal-gamma distribution, along with the definitions of the gamma and normal distributions, we have that:  and:  Taking the partial derivatives of the logarithms of p, we get:  d^g P(r)  _  da (9 log p(r) d(3 d\ogp{p | r) dX d\ogp{n | r)  l o g r  +  " =  &  a T f3 1 1 2 A - 2 XT(U  dk>gr(a)  l o g / ?  '  &  "  3a  2 T  (  /  i  ^  0  )  - no)  Oflo  Setting each of these to zero, we get the following set of simultaneous equations: £[logr]  +  log/3  =  0  =  0  =  0  =  0  %-E[T]  l.-lEirfji-no) ] 2  XE[r{p-p )] 0  and the theorem follows from the solution of these. 142  •  The requirement that a > 1 + e is to ensure that a > 1 so that the normal-gamma distribution is well defined.  Although this theorem does not give a closed-form  solution for a, we can find a numerical solution easily since g(y) is a monotonically decreasing function [1]. Another complication with this approach is that it requires us to compute E[r ], SA  E[r , Sta  fj. ], 3ta  E[T , Sja  p ] 2  sa  and E[logT a]  with respect to p™ (p , tx  t  St  r , ).  S:a  s  0  These expectations do not have closed-form solutions, but can be approximated as follows. We start by noting, that for normal-gamma q, these terms have closed form equation. In fact, this is an easy corollary of Theorem 4. C o r o l l a r y 5 Let q(p,r) E \M T] 2  q  ~ NG{p ,  E [pr]  = /i f,  To compute the expectation of these terms with respect to p ,  we note  0  = i + A*gf, E [\ogr]  A, a, 3).  Then E [r] = |, q  q  0  = tp(a) - log/3.  q  mix  that /(/x, r ) has the property that: E mixU{l 'i )] J  T  p  =  mix[Ep(v,T:r+ x)[f{n,i~)]}.  E p  1  Thus, we can evaluate these expectations by integrating over R j , and using the updated values of ^o,  & and Q for (s, t) to express the inner expectation.  To summarize, in this section we discussed two possible ways of updating the estimate of the values. The first, moment update leads to an easy closed form update, but tends to become overly confident. The second, mixture update, is more cautious, but requires numerical integration, making it considerably more expensive in terms of computation time.  5.2.3  Convergence  We are interested in knowing whether our algorithms converge to optimal policies in the limit. It suffices to show that the means p  SA  converge to the true Q-values,  and that the variance of the means converges to 0. If this is the case, then we will eventually execute an optimal policy. 143  As we said in Section 4.2, the standard convergence proof [114] for Q-learning requires that each action is tried infinitely often in each state in an infinite run, and that Yl^=Q { ) a  —  n  a ^ S^Lo° ( ) <  0 0  11  ;  n  2  where a is the learning rate. If these  0 0  conditions are met, then the theorem shows that Q-value estimates converge to the real Q-values. Using this theorem, we can show that when we use moment updating, our algorithm converges to the correct mean. T h e o r e m 6 If each action a is tried infinitely rithm uses moment updating,  often in every state, and the algo-  then the mean u  converges to the true Q-value for  SA  every state s and action a.  P r o o f We need to show that the updating rule for u  Sfa  satisfies the equations given  above. The updating rules are: A/zp + M i  A+l A'  A + 1,  Since M\ is the newly observed value of the state, this is equivalent to Qlearning with a learning rate of l / ( n + no) where no is the prior value of A. Y^Lo V I  7 2  + n ) = 00 and J2^=o 0  + o) n  2  < 00 as required.  •  Unfortunately, the myopic V P I algorithm we describe above does not meet the criteria that every action will be tried infinitely often on an infinite run. Let a* be the current best action, then some other action a will never again be executed if 00  Generally, the asymptotic nature of standard convergence results renders them relatively unimportant in practice. When a guarantee of convergence is important for a particular problem, we propose two alternative action selection approaches. The simplest is to define a "noisy" version of the myopic V P I action 144  selection strategy that occasionally tries actions other than the one with the highest V P I . For example, we might use a Boltzmann distribution (see Section 4.4) where the values for each action are the V P I for that action rather than its Q-value. Since all actions will now be selected with some (arbitrary small) probability, this scheme guarantees convergence. A more principled action selection scheme, Q-value  sampling,  was first de-  scribed by Wyatt [116] for exploration in multi-armed bandit problems. The idea is to select actions stochastically based on our current subjective belief that they are optimal. That is, action a is performed with probability given by Pr(a  = argmax/z /) SiQ  =  Pr{\la'^a,  p,  s a  > p , >) s a  n  f  CO  — co  The last step in this derivation is justified by Assumption 4 that states that our posterior distribution over the values of separate actions is independent. To evaluate this expression, we use Theorem 3 (Equations 5.2 and 5.3), which gives the marginal density of p given a normal-gamma distribution. In practice, we can avoid the computation of (5.4). Instead, we sample a value from each  p(p ), s<a  and execute the action with the highest sampled value. It is straightforward to show that this procedure selects a with probability given by (5.4). Of course, sampling from a distribution of the form of (5.2) is nontrivial and requires evaluation of the cumulative distribution P(p < x). Fortunately, T(x : d) can be evaluated efficiently using standard statistical packages. In our experiments, we used the library routines of Brown et al. [21]. Q-value sampling resembles, to some extent, Boltzmann exploration. It is a stochastic exploration policy, where the probability of performing an action is related to the distribution of the associated Q-values.  One drawback of Q-value  sampling is that it only considers the probability that a is best action, and does not consider the amount by which choosing a might improve over the current policy. 145  action 1 action 2  -6  -4  -2  0  2  4  6  8  -6  -4  -2  0  (a)  2  4  6  8  (b)  Figure 5.5: Examples of Q-value distributions of two actions for which Q-value sampling has the same exploration policy even though the payoff of exploring action 2 in (b) is higher than in (a). Figure 5.5 show examples of two cases where Q-value sampling would generate the same exploration policy. In both cases, Pr(/j,  a2  > fi ) = 0.4. However, in case ai  (b) exploration of action a seems more useful than in case (a), since the potential 2  for larger rewards is higher for the second action in this case. Because E V P I is computed using the magnitude of the change in the value function, it would select ai if Figure 5.5(a) since although a is better than a\ with probability 0.4, the value 2  function will probably change very little. On the other hand, in Figure 5.5(b), E V P I will select action a because the potential change in the value of the state is much 2  greater. As we shall see in Section 5.2.4, Q-value sampling generally performs worse than V P I in practice. However, because of its stochasticity, it does maintain the property that every action has a non-zero probability of being executed, and therefore it can be guaranteed to converge. We do not have a counterpart to Theorem 6 for mixture updating. conjecture is that the estimated mean does converge to the true mean.  146  Our  5.2.4  Results  We have examined the performance of the model-free Bayesian approach on several different domains and compared it with a number of different exploration techniques. The parameters of each algorithm were tuned as well as possible for each domain. The algorithms we have used are as follows: Semi-Uniform Q-learning with semi-uniform random exploration (see Section 4.4). Boltzmann Q-learning with Boltzmann exploration (Section 4.4). Interval Kaelbling's interval-estimation algorithm [63] (Section 4.4.1). IEQL+ Meuleau and Bourgine's I E Q L + algorithm [74] (Section 4.4.1). Bayes Bayesian Q-learning as presented above, using either Q-value sampling or myopic-VPI to select actions, and either Moment updating or Mixture updating for value updates. These variants are denoted QS, VPI, Mom, Mix, respectively. Thus, there are four possible variants of the Bayesian Q-Learning algorithm, denoted, for example, as VPI+Mix. As we said in Section 4.1, there are several ways of measuring the performance of learning algorithms. We measure the performance of the learning algorithms by the actual discounted total reward-to-go at each point in the run. More precisely, suppose the agent receives rewards r i , 7 * 2 , . . . , rjv in a- run of length reward-to-go at time  t  to be  N.  We define the  Yt'>t t'l '~ - Of course, this estimate is reliable only for r  t  t  points that are far enough from the end of the run. We prefer this measure because it shows the actual performance of the algorithm in practice. This is important in evaluating exploration strategies because other measures such as the quality of the greedy policy at each stage are overly optimistic since exploration strategies typically do not follow the greedy policy. In the graphs below we plot the average discounted future reward received as a function, of time by averaging these values over 10 runs with different random seeds. The tables show the total reward received in the first and second 1,000 147  Figure 5.6: The chain domain[74]. Each arc is labeled with an action and the corresponding reward. With probability 0.2, the other action is performed (i.e. if a was selected, the arc corresponding to b is followed). steps of learning (for the maze domain, the first and second 20,000 steps). We performed parameter adjustment to find the best-performing parameters for each method. Thus the results reported for each algorithm are probably somewhat optimistic. The graphs are also smoothed using Bezier smoothing, as are all the graphs in this and the following chapter. We do this to make the graphs more readable as the discounted total reward-to-go measure we use tends to produce quite large changes in value from one time-step to the next as individual rewards are received. We tested these learning algorithms on three domains. The first two are designed so that there are suboptimal strategies that can be exploited. Thus, if a learning algorithm converges too fast, then it will not discover the higher-scoring alternatives. The third domain is larger and less "tricky" although it also admits inferior policies. We use it to evaluate how the various exploration strategies scale up. In each case a discounting factor of 0.99 was used.  The Chain Domain This domain, taken from [74], is shown in Figure 5.6. It consists of six states and two actions a and b. When an action is performed, with probability 0.2, the agent "slips"and the effects are exactly those of the other action. The optimal policy for this domain (assuming a discount factor of 0.99) is to do action a everywhere. However, the domain is designed in such a way that learning algorithms that do insufficient exploration can get trapped at the initial state, preferring to follow the 148  Figure 5.7: Actual discounted reward as a function of number of steps. Results for the chain domain. 6-loop to obtain a series of smaller rewards. Figure 5.7 shows the performance of the algorithms on this problem. I E Q L + tends to dominate the other algorithms in this relatively simple domain. A more detailed analysis is given in Table 5.1 which shows the total reward received (undiscounted) over each of the first and second 1,000 steps.  Although the Bayesian  algorithms (particularly using mixture updating with V P I for action selection) approach the performance of I E Q L + during the second phase, the standard deviation of the performance of the Bayesian algorithms is much higher than for IEQL-f-, which indicates that they are still exploring excessively, even after 2,000 steps of learning.  149  Method Uniform Boltzmann Interval IEQL+ Bayes QS+Mom Bayes QS+Mix Bayes V P I + M o m Bayes V P I + M i x  Mean 1519.0 1605.8 1522.8 2343.6 1480.8 1210.0 1875.4 1697.4  1st Phase Std. Deviation 37.2 78.1 180.2 234.4 206.3 86.1 478.7 336.2  2nd Phase Mean Std. Deviation 1611.4 34.7 1623.4 67.1 1542.6 197.5 2557.4 271.3 1894.2 364.7 1306.6 102.0 2234.0 443.9 2417.2 650.1  Table 5.1: Average and standard deviation of accumulated rewards over 10 runs of the chain domain. Each phase consists of 1,000 steps, so the first phase is the average performance over the first 1,000 steps of the algorithm. In each of the Bayesian runs, the priors were set such that the expected value of the mean was zero, and the variance of the mean was 400 for both V P I algorithms, 100 for QS+Mom, and 800 for QS+Mix. The mean of the variance was 1 in all cases, and the variance of the variance was 0.005 for all cases except QS+Mix, where it was 0.05. In most cases, the most important parameter was the variance of the mean. The algorithms are relatively insensitive to the variance parameters.  Figure 5.8: The loop domain[113]. Many algorithms will converge before the lefthand loop is explored.  150  Method Uniform Boltzmann Interval IEQL+ Bayes QS+Mom Bayes QS+Mix Bayes V P I + M o m Bayes V P I + M i x  1st Phase Avg. Dev. 185.6 3.7 186.0 2.8 198.1 1.4 264.3 1.6 190.0 19.6 203.9 72.2 316.8 74.2 326.4 85.2  2nd Phase A v g . Dev. 198.3 1.4 200.0 0.0 200.0 0.0 292.8 1.3 262.9 51.4 236.5 84.1 340.0 91.7 340.0 91.7  Table 5.2: Average and standard deviation of accumulated rewards over 10 runs on the loop domain. A phase consists of 1,000 steps. The priors were set with the E[fi] = 0, Var[/i] = 50 except for QS+Mix where Var|>] = 200. The Loop Domain This domain, from [113], consists of two loops as shown in Figure 5.8.  Actions  are deterministic. The problem here is that a learning algorithm may have already converged on action a for state 0 before the larger reward available in state 8 has been backed up. Here the optimal policy is to do action b everywhere. Figure 5.9 shows the actual discounted future reward for this problem. Here selecting actions using V P I works extremely effectively in comparison with the other algorithms. However, as Table 5.2 shows, their high standard deviation indicates that these algorithms are still performing exploration. In comparison, all the runs of interval estimation converged on a single sub-optimal policy (perform action a everywhere), and no more exploration is being performed. I E Q L + is still exploring, and its performance is still improving, but it is much slower than the Bayesian approaches using V P I .  The Maze Domain This is a maze domain where the agent attempts to "collect" flags and get them to the goal. In the experiments we used the maze shown in Figure 5.10. In this figure,  151  2000 Figure 5.9: Actual discounted reward as a function of number of steps. Results for the loop domain.  Figure 5.10: Task 3. A navigation problem. S is the start state. The agent receives a reward upon reaching G based on the number of flags F collected.  152  S marks the start state, G marks the goal state, and F marks locations of flags that can be collected. The reward received on reaching G is based on the number of flags collected. Once the agent reaches the goal, the problem is reset. There are a total of 264 states in this M D P . The agent has four actions—up, down, left, and right. There is a small probability, 0.1, that the agent will slip and actually perform an action that goes in a perpendicular direction. If the agent attempts to move into a wall, its position does not change. The challenge is to do sufficient exploration to collect all three flags before reaching the goal. As Figure 5.11 shows, the Bayesian algorithms using V P I are far superior to the rest on this problem, and mixture updating shows considerably better performance than moment updating. Q-sampling proved particularly poor on this domain, in fact worse than any of the other algorithms. This seems to be because the values of all the states are very similar for a long time, and so the action selection is relatively uninformed. Table 5.3 shows the actual reward received over the first and second 20,000 steps, and once again the algorithms using V P I have much greater variance in performance than the others. This is one of the disadvantages of the Bayesian approach, that different runs tend to (at least temporarily) converge on quite different policies. Their average performance tends to be better than other algorithms, particularly for large problems, but they have a great deal more variability. Our results show that in all but the smallest of domains our methods are competitive with or superior to state of the art exploration techniques such as I E Q L + . Our analysis suggests that this is due to our methods' more effective use of small numbers of data points. Results from the maze domain in particular show that our VPI-based methods begin directing the search towards promising states after making significantly fewer observations than I E Q L + and interval estimation. Overall, we have found that using mixture updating combined with V P I for action selection gives the best performance. In terms of their computational requirements our algorithms are much more  153  i  0  1  1  1  1  r  15000  20000  25000  30000  35000  1  Boltzmann Interval IEQL+  5000  10000  -  40000  Figure 5.11: Actual discounted reward as a function of number of steps. Results for the maze domain.  Method Uniform Boltzmann Interval IEQL+ Bayes QS+Mom Bayes QS+Mix Bayes V P I + M o m Bayes V P I + M i x  1st Phase Avg. Dev. 105.3 10.3 195.2 61.4 246.0 122.5 269.4 3.0 10.7 132.9 128.1 11.0 403.2 248.9 817.6 101.8  2nd Phase Avg. Dev. 161.2 8.6 87.9 1024.3 506.1 315.1 7.3 253.1 12.2 176.1 9.9 121.9 660.0 487.5 1099.5 134.9  Table 5.3: Average and standard deviation of accumulated rewards over 10 runs of the maze domain. A phase consists of 20,000 steps. The priors were fixed with E[fi] — 0 and Var[/z) = 200 for all the Bayesian algorithms.  154  T  1  100  1  r  Priors from 200 stepsof no-priors run  |  0  1  200  300  400  500  600  Learning steps  Figure 5.12: The effects of priors on the learning rate. Results are for the loop domain. expensive than any of the algorithms we used for comparison. The actual complexity is no greater, but the integral solving we use in the computation of V P I and in mixture updating are considerable computational tasks. In practice, our prototype implementations of these algorithms run between ten and 100 times slower than interval estimation, depending on which updating and action selection techniques are used. This could be reduced considerably by using more efficient numerical integration algorithms, or by performing the integrations to a lower degree of accuracy. Figure 5.12 shows the effect of different priors on the learning rate for the loop domain with a discount factor of 0.95. We use Bayesian Q-learning using V P I to select actions and mixture updating. The graph shows the effects of "helpful" and "misleading" priors in state 0, helpful priors in states 0 and 5  8, and priors formed  by running 200 steps of the algorithm starting with uniform priors. For the helpful priors, we made action b more attractive by increasing the \i and A parameters of 155  the distribution for action^, resulting in a distribution where the mean and precision of the mean are greater. For the misleading priors, we applied the same changes to action a. As the figure shows, "helpful" priors, and particularly the partially learned priors considerably increased the speed of learning, while "misleading" priors slowed learning considerably. Even with misleading priors, the optimal policy was eventually discovered (in two of the ten runs, it was discovered in the first 200 steps, the other runs took around 2000 steps). One weakness of our algorithms is that they have significantly more parameters than I E Q L + or interval estimation. The main parameters that seem to effect the performance of our method is the variance of the initial prior, that is, the ratio j^zrpj-  Priors with larger variances usually lead to better performance. In the  following section, we examine the model-based Bayesian approach. Part of the motivation for using models is to considerably reduce the number of parameters that can be tuned.  5.3  Model-based Bayesian Exploration  The results in the previous section demonstrate that a Bayesian approach to exploration can have considerable advantages in terms of performance over ad-hoc exploration methods, and some performance gains over more sophisticated algorithms such as interval estimation and I E Q L + . Unfortunately, the Bayesian Q-learning approach has a number of weaknesses. It relies on the assumptions listed on page 134, some of which are reasonable, but others are rarely true in practice. It also has considerable computational requirements and a large number of parameters to be tuned. Also, when they can be applied, model-based algorithms (even the relatively unsophisticated ones in Section 4.3) tend to learn good policies far more quickly than any model-free exploration strategy, and model-based algorithms fit much better with our interest in exploiting problem structure for computational gains. With all these issues in mind, we now turn our attention to applying the 156  p=0.3  Figure 5.13: Outline of the model-based Bayesian reinforcement learning algorithm. Bayesian approach in model-based settings.  The idea here is that we will learn  a model of the environment that takes into account our uncertainty, and use this uncertain model to decide how to act. Figure 5.13 is an overview of our algorithm. We begin with an uncertain model of the world, translate our uncertainty about the effects of actions into uncertainty about the values of actions and states, and use the resulting Q-distributions to select an action a to perform in the current state s. By performing the action we move to a new state t, and receive a reward r. As before, we learn based on these observations (s,a,t,r),  but in model-learning we  update the parameters of our world model—the transition probabilities and reward function for s and a. To implement the algorithm outlined in Figure 5.13 we will need the following components: M o d e l representation  We need a way to represent a transition function for each  action and a reward function. This model must explicitly represent our uncertainty about transition and reward probabilities. We also need a way to update the model when we receive new observations of transitions and rewards. 157  Q-value uncertainty We will be using essentially the same action selection scheme (myopic VPI) as in the previous section to select real-world actions to perform. To use this, we need a representation of uncertainty about Q-values. A method to translate from model to Q-value uncertainty Myopic V P I operates on probability distributions over Q-values, but the uncertainty in the model-based case is in the form of probability distributions over the transition probabilities and reward function. We need a method for transforming this model uncertainty into uncertainty about the values of states and actions. In the following sections we will discuss each of these issues in detail. Again we are interested in a system that allows us to incorporate prior knowledge. In this case, however, we will be incorporating prior knowledge about transition probabilities and the reward function. For example, we might want to include knowledge that a particular state has a large negative reward, along with information about the transition functions of neighboring states in order to warn the agent. In some of the alternative approaches we describe below, it may be possible to include prior information about Q-values as welj, but in most cases initial Q-values are automatically generated from the model, so priors over Q-values would be overridden.  5.3.1  R e p r e s e n t i n g U n c e r t a i n t y i n the M o d e l  In this section we describe how to maintain a Bayesian posterior distribution over M D P s given our experiences in the environment. A t each step in the environment, we start at state s, choose an action o, and then observe a new state s' and a reward r. We summarize our experience by a sequence of experience tuples (s, a, r, s'). A Bayesian approach to this learning problem is to maintain a belief state over the possible M D P s . Thus, a belief state B defines a probability density over M D P s M, P(M | B).  Given an experience tuple (s,a,r,s')  158  we can compute the  posterior  belief state, which we denote B o (s, a, r, s'), by Bayes rule: P(M  oc  Bo(s,a,r,s ))  |  1  P{(s,a,r,s'}\  M)P(M  =  \ B)  P(sAs'\M)P{s-%r\M)P(M\B).  Thus, the Bayesian approach starts with some prior  probability distribution  over all possible M D P s (we assume that the sets of possible states, actions and rewards are delimited in advance). As we gain experience, the approach focuses the mass of the posterior  distribution on those M D P s in which the observed experience  tuples are most probable. A n immediate question is whether we can represent these prior and posterior distributions over an infinite number of M D P s . We show that this is possible by adopting results from Bayesian learning of probabilistic models, such as Bayesian networks. Under carefully chosen assumptions, we can represent such priors and posteriors in any of several compact manners. We discuss one such choice below. To formally represent our problem, we consider the parameterization MDPs.  of  The simplest parameterization is table based, where there are parame-  ters 0\ „ . and 0 , „ _ for the transition and reward models. Thus, for each choice of R  s and a, the parameters <9* = {#* , : s' € S} define a distribution over possible a s  a  successor states, and the parameters 8  R  over possible rewards.  = {6  R  SA  SAR  : r £ TZ) define a distribution  1  We say that our prior satisfies parameter  independence  if it has the product  form: Pr(0 I B) = [] J] P r < s  a  | B) P r ( ^  a  | B).  (5.5)  a  Thus, the prior distribution over the parameters of each local probability term in the M D P is independent of the prior over the others. It turns out that this form is The methods we describe are easily extend to other parameterizations. In particular, we can consider continuous distributions, e.g., Gaussians, over rewards. For clarity of discussion, we will use multinomial distributions here. J  159  maintained as we incorporate evidence. Theorem 7 If the belief state P(8 \ B) satisfies parameter independence, then P(9 \ B o (s, a, r, s')) also satisfies parameter independence.  Proof The observation (s, o, r, s') produces two changes in the belief state, namely to 0*  and 0  r s  . Regardless of the actual updating scheme, since P(Q \ B)  satisfies parameter independence, neither of these parameters has an effect on any others, so parameter independence is maintained.  •  As a consequence, the posterior after we incorporate an arbitrarily long number of experience tuples also has the product form of (5.5). Parameter independence allows us to reformulate the learning problem as a collection of unrelated local learning problems, one for each parameter 0\  and  9 . In each of these, we have to estimate a probability distribution over all states r  sa  or all rewards. The question is how to learn these distributions. We will use wellknown Bayesian methods for learning standard distributions such as multinomials or Gaussian distributions from data [34]. For the case of discrete multinomials, which we have assumed in our transition and reward models, we can use Dirichlet priors to represent Pr(0* ) and P (^s,a)r  These priors are conjugate, and thus the posterior after each observed  experience tuple will also be a Dirichlet distribution. In addition, Dirichlet distributions can be described using a small number of hyper-parameters.  5.3.2  The Dirichlet Distribution  Let X be a random variable that can take L possible values from a set £ . Without loss of generality, let £ = { 1 , . . .L}. We are given a training set D that contains the outcomes of N independent draws x ,.. .,x 1  N  of X from an unknown multinomial  distribution P*. The multinomial estimation problem is to find a good approximation for P*. 160  This problem can be stated as the problem of predicting the outcome a;^* given x ,.. .,x . 1  N  1  Given a prior distribution over the possible multinomial distribu-  tions, the Bayesian estimate is: Pr(x  N+1  = JP{x N+1  \x\...,x ,Z) N  | 0,Z)P(P | x\...,x ,Z)de  (5.6)  N  where 9 = {0±,.. .,0L) is a vector that describes possible values of the (unknown) probabilities P*(l),...,  P*(L),  and £ is the "context" variable that denote all other  assumptions about the domain. The posterior probability of 6 can be rewritten using Bayes law as: P(6\x\...,x ,£)  P(x ,....,  ex  N  x  l  = W O  | 0,£)P(0 | f).  N  114*'  (5-7)  i  where iV; is the number of occurrences of the symbol i in the training data.  Dirichlet  distributions are a parametric family that is conjugate to the multi-  nomial distribution. That is, if the prior distribution is from this family, so is the posterior. A Dirichlet prior for X is specified by hyper-parameters  « i , . . . ,O:L> and  has the form:  ( io«n  c  p e  i  - ° ° *)  6 i = 1 a n d 6i  i  f r ai1  where the proportion depends on a normalizing constant that ensures that this is a legal density function (i.e., integral of P{0 | £) over all parameter values is 1). Given a Dirichlet prior, the initial prediction for each value of X is p{x  l  = i\0=  [ me J  10^ = E,  «j  It is easy to see that, if the prior is a Dirichlet prior with hyper-parameters « i , . . . , aj,, then the posterior is a Dirichlet with hyper-parameters ati + N\,..., we get that the prediction for X p<x  N+1  is  N+1  = i ix  1  x  N  161  n =  *  a  +  N  t  + NL- Thus,  In some situations (see below) we would like to sample a vector 6 according to the distribution P(6 | £). This can be done using a simple procedure: Sample values yi,...,2/Z, such that each y - ~ Gamma(ai,l) 2  bility distribution, where Gamma(a,  and then normalize to get a proba-  (5) is the Gamma distribution. Procedures for  sampling from these distributions can be found in [88]. In the case of most M D P s studied in reinforcement learning, we expect the transition model to be sparse—there are only a few states that can result from a particular action at a particular state. Unfortunately, if the state space is large, learning with a Dirichlet prior can require many examples to recognize that most possible states are highly unlikely. This problem is addressed by a recent method of learning sparse-multinomial priors [44].  5.3.3  The Sparse-multinomial Distribution  Sparse-multinomial priors have the same general properties as Dirichlet priors, but assume that only some small subset of the possible outcomes will ever be observed. The sparse Dirichlet priors make predictions as though only the observed outcomes are possible, except that they also assign some of the probability mass to novel outcomes.  In the M D P setting, for a state s and action a, the set of possible  outcomes is the set of states (since a transition could be to any state). The observed outcomes is the set of states T C S such that if a transition has been observed to t, then t £ T. A novel outcome is a transition to a state t that has not been reached before by executing a in s. Friedman and Singer [44] introduce a structured prior that captures our uncertainty about the set of "feasible" values of X. that takes values from the set 2  s  Define a random variable V  of possible subsets of E . The intended semantics  for this variable, is that if we know the value of V, then 6{ > 0 iff i £ V. Clearly, the hypothesis V = £ ' (for E ' C E) is consistent with training data only if E ' contains all the indices i for which  162  > 0. We denote by E° the set of  observed symbols. That is, S ° = {i : A ; > 0}, and we let k° = | S ° | . Suppose we 7  know the value of V. Given this assumption, we can define a Dirichlet prior over possible multinomial distributions 9 if we use the same hyper-parameter a for each symbol in V. Formally, we define the prior: P[6\V) cc  Yl ®V C/Z = X  1 and  8i  =  9i  0 for a11 1  $ "> ( - ) V  5  8  Using E q . (5.7), we have that:  P(X  N+1  = i | x\...,x ,V) n  \V\a+N-  = {  0  11  1  fc  V  (5.9)  otherwise  Now consider the case where we are uncertain about the actual set of feasible outcomes. We construct a two tiered prior over the values of V. We start with a prior over the size of V, and assume that all sets of the same cardinality have the same prior probability. We let the random variable S denote the cardinality of V. We assume that we are given a distribution P(S = k) for A; = l,...,L. the prior over sets to be P(V \ S = k) = (^)  1  We define  . This prior is a sparse-multinomial  with parameters a and P r ( 5 — k). Friedman and Singer show that how we can efficiently predict using this prior. Theorem 8 [44] Given a sparse-multinomial prior, the probability of the next symbol is P(X  N+1  '  ifzez°  &%TC{D,L)  = i | D) = { [ ^(l-C(D,L))  */^£°  where k=k°  Moreover, P{S = k\D)  = =-^ 2-jk'>k°  163  , k  m  where m  k  and T(x) = f °° t ~ e~ dt x  0  1  t  k\ = P(S = k)(k-k°)l  T{ka) T(ka + N)  is the gamma function. y-i  °a+N  k  C(D,L)=^ = ° k  Thus,  ^  k  2-^k'>k°  m  k  .  k  m  We can think of C(D, L) as scaling factor that we apply to the Dirichlet prediction that assumes that we have seen all of the feasible symbols. The quantity 1 — C(D, L) is the probability mass assigned to novel (i.e. unseen) outcomes. In some of the methods discussed above we need to sample a parameter vector from a sparse-multinomial prior. Probable parameter vectors according to such a prior are sparse, i.e., contain few non-zero entries. The choice of the non-zero entries among the outcomes that were not observed is done with uniform probability. This presents a complication since each sample will depend on some unobserved states. To "smooth" this behaviour we sample from the distribution over V° combined with the novel event. We sample a value of k from P(S = k\D). We then, sample from the Dirichlet distribution of dimension k where the first k° elements are assigned hyper-parameter a + N{, and the rest are assigned hyper-parameter a. The sampled vector of probabilities describes the probability of outcomes in V° and additional k — k° events. We combine these latter probabilities to be the probability of the novel event. For both the Dirichlet and its sparse-multinomial extension, we need to maintain the number of times, N(s-^-t), state t is observed after executing action a at state s, and similarly, i V ( s A r ) for rewards. With the prior distributions over the parameters of the M D P , these counts define a posterior distribution over M D P s . This representation allows us to both predict the probability of the next transition and reward, and also to compute the probability of every possible M D P and to sample from the distribution of M D P s . To summarize, we assume parameter independence, and that for each transi164  tion probability 9\ and reward 8  r  a  sa  we have either a Dirichlet or sparse-multinomial  prior. The consequence is that the posterior at each stage in the learning can be represented compactly. This enables us to estimate a distribution over M D P s at each stage. It is easy to extend this discussion to more compact parameterizations of the transition and reward models. For example, we can use a 2 T B N representation of states and actions as described in Section 2.5. Such a structure requires fewer parameters and thus we can learn it with fewer examples. As we shall see in Chapter 6, much of the above discussion about parameter independence and Dirichlet priors apply to structured models as well [51]. 5.3.4  Translating from model uncertainty to Q-value uncertainty  In order to reason about the values of states, and in particular in order to use myopic V P I to select actions to perform, we need to translate our probability distribution over M D P s into a set of Q-distributions for each state and action. We now examine several methods of doing this translation, which have different complexities and biases. Naive Global Sampling Perhaps the simplest approach is to simulate the definition of a Q-value distribution. Since there are an infinite number of possible M D P s , we cannot afford to compute Q-values for each. Instead, we sample k M D P s : M ,..., 1  M from the distribution k  P r ( M | B). We can solve each M D P using standard techniques (e.g., value iteration or linear programming). For each state s and action a, we then have a sample solution ql ,..., a  q  k  , where q\ is the optimal Q-value, Q*(s, a), given the i'th M D P . A  From this sample we can estimate properties of the Q-distribution. For generality, we denote the weight of each sample, given belief state B, as w . For naive global B  sampling, all these weights are equal to 1, but we will use them again below.  165  Given these samples, we can estimate the mean Q-value as  Similarly, we can estimate the V P I by summing over the k M D P s : EVPI(s,  a) «  J2 « 4 V P l i ( * ) . i a  2Ui  B  9  i a  i  W  This approach is straightforward; however, it requires an efficient sampling procedure.  Here again the assumptions we made about the priors helps us. If  our prior has the form of (5.5), then we can sample each distribution (p#(sAr) or PR{S—>r)) independently of the rest. Thus, the sampling problem reduces to sampling from "simple" posterior distributions. Procedures for sampling from Dirichlet and sparse-multinomial priors are given in Sections 5.3.2 and 5.3.3 respectively.  Importance Sampling A n immediate problem with the naive sampling approach is that it requires a number of global computations (e.g., computing value functions for M D P s ) to evaluate each action made by the agent. This is generally too expensive. One possible way of avoiding these repeated computations is to re-use the same sampled M D P s for several steps. The idea is to sample some M D P s from the model, solve them, and then use them without resampling for a number of steps. To do so, we can use ideas from importance  sampling  [35].  In importance sampling, we want to sample from some distribution  f(x),  but for some reason we actually sample from another distribution g(x). To adjust the samples so they accurately represent f(x) we weight them based on the ratio of their likelihood of being sampled from f(x) to their likelihood of being sampled from g(x). That is, a sample X{ is given weight W{ where:  Wi  =  f(Xi)  -j—-  9(Xi)  166  In our case, the idea is to re-use samples from an older model rather than solving the new one, so f(x) and g(x) are the current model Pr(M \ B') and the old model Pr(M | B) respectively. Since we may re-use the samples several times before resampling, we assume they already have a weight associated with them (initially this weight will be 1), and adjust the weight of each sample to correct for the difference between the sampling distribution Pr(M \ B) and the target distribution Pr(M  | B') by: ?r(M>\B') W  b  '~  Pr(M<|5)  B  '  We now use the weighted sum of samples to estimate the mean and the V P I of different actions. It is easy to verify that the weighted sample leads to correct prediction when we have a large number of samples. In practice, the success of importance sampling depends on the difference between the two distributions. In our case, if an M D P M has low probability according to P r ( M | B), then the probability of sampling it is small, even if P r ( M | B') is high. Fortunately for us, the differences between the beliefs before and after observing an experience tuple are usually small. We can easily show that Theorem 9  Bo(s,a,r,t)  W  _ ~  Pr(M ' | B o {s,a,r,t)) P r ( M | B) ^ 4  i  Pr((s,a,r,t)  | AT)  B  •  Pv((s,a,r,t)\B)  .  WB  P r o o f Using Bayes theorem, we have that: Pr(AT | Bo(s,a,r,t))  =  Pr((s, a, r, t) | M ' A B) P r ( M * A B) Pv((s,a,r,t)\B)  Pv(B)  so: Pv((s,a,r,t) Bo(s,a, ,t) r  \M  i  Pv{(s,a,r,t)\  AS) B)  The theorem follows from the fact that Pr((s, a, r, t) \ M AB) = Pr((s, a, r, t) 1  M) l  since (s, a,r, t) is independent of B given M \ 167  •  The term Pr((s,a,r,t)  \ M ) is easily extracted from M', and Pr({s,a,r,t) l  can be easily computed based on our posteriors.  \ B)  Thus, we can easily re-weight  the sampled models after each experience is recorded and use the weighted sum for choosing actions. Note that re-weighting of models is fast, and since we have already computed the Q-value for each pair (s, a) in each of the models, no additional computations are needed. Furthermore, the fact that the weight is computed incrementally as the target distribution changes does not lead to error or bias in the weights when the samples are re-used multiple times. Two Q-value estimates in sequence are obviously correlated, because they are constructed from the same samples, but the samples used to construct each estimate are unbiased despite the fact that they are re-used from one step to the next. This is because even though the weights are computed incrementally (using Theorem 9) as the belief state over M D P s changes, the resulting weight for each sample is the same as if it had been computed directly for the current belief state and the original proposal distribution. Of course, the original set of models we sampled becomes irrelevant as we learn more about the underlying M D P . We can use the total weight of the sampled M D P s to track how unlikely they are given the observations. Initially this weight is k. As we learn more it usually becomes smaller. When it becomes smaller than some threshold k , mm  we sample k — k { m  n  new M D P s from our current belief state,  assigning each one weight 1 and thus bringing the total weight of the sample to k again. We then need only to solve the newly sampled M D P s . To summarize, we sample k M D P s , solve them, and use the k Q-values to estimate properties of the Q-value distribution. We re-weight the samples at each step to reflect our newly gained knowledge. Finally, we have an automatic method for detecting when new samples are required.  168  Initially, sample k M D P s from our prior belief state. At each step we: 1. Observe an experience tuple (s,a,r,t) 2. Update Pr(0$ ) by t, and P r ( ^ J by r. >o  3. For each i= 1,..., k, sample 0^, respectively.  0 '* from the new Pr(0* ) and P r ( ^ ) , R  S  A  a  > 0  4. For each i = 1,..., k run a local instantiation of prioritized sweeping to update the Q-value function of M . 1  Figure 5.14: The global sampling with repair algorithm for translating from model uncertainty to Q-value uncertainty. Global Sampling with Repair The sampling approaches we have described so far have one serious deficiency. They involve computing global solutions to M D P s which can be very expensive. Although we can re-use M D P s from previous steps, this approach still requires us to sample new M D P s and solve them quite often. A n alternative idea is to keep updating each of the sampled M D P s . Recall that after observing an experience tuple (s,a,r,t), over 0\  and 6 . R  SA  we only change the posterior  Thus, instead of re-weighting the sample M\  or repair, it by re-sampling 0\  and 0 . R  SA  we can update,  If the original sample M was sampled l  from P r ( M | J5), then it easily follows that the repaired M is sampled from P r ( M | l  B o (s,a,r,t)). Of course, once we modify M ' its Q-value function changes. However, all of these changes are consequences of the new values of the dynamics at (s,a). Thus, we can use prioritized sweeping (see Section 4.3) to update the Q-value computed for M . 1  This sweeping performs several Bellman updates to correct the values of  states that are most affected by the change in the model. This suggests the algorithm shown in Figure 5.14. Thus, our approach is quite similar to standard model-based learning with prioritized sweeping, but in-  169  Figure 5.15: Mean and variance of the Q-value distribution for a state, plotted as a function of time. Note that the means of each method converge to the true value of the state at the same time that the variances approach zero. stead of running one instantiation of prioritized sweeping, we run k instantiations in parallel, one for each sampled M D P . The repair to the sampled M D P s ensures that they constitute a sample from the current belief state, and the local instantiations of prioritized sweeping ensure that the Q-values computed in each of these M D P s is a good approximation to the true value. As with the other approaches we have described, after we invoke the k prioritized sweeping instances, we use the k samples from each q  SA  to select the next  actions using V P I computations. Figure 5.15 shows a single run of learning on the "trap" domain we describe below where the actions selected were fixed and each of the three methods was used to estimate the Q-values of a state. Initially the means and variances are  170  very high, but as the agent gains more experience, the means converge on the true value of the state, and the variances tend towards zero. This behaviour is typical of the experiments we have performed, and suggests that the repair and importance sampling approaches both provide reasonable approximations to naive global sampling.  5.3.5  R e p r e s e n t i n g U n c e r t a i n t y i n the Q-values  As we said above, we can compute the expected Q-values of actions in states and their myopic V P I directly from the Q-values we found for the sampled M D P s using the following formulae:  and 1  Here we are using a collection of (potentially weighted) point values to represent the approximation to the Q-value distribution. A possible problem with this representation approach is that we use a fairly simplistic representation to describe a complex distribution. Instead, we may wish to generalize from the k samples by using standard generalization methods. This allows us to smooth the Q-value distributions. This should improve of the V P I calculations and also allow us to represent more complex hypotheses about our Q-value distributions while using fewer samples (hence solving fewer M D P s ) . Perhaps the simplest approach to generalize from the k samples is to assume that the Q-value distribution has a particular parametric form, and then to fit the parameters to the samples. One standard approach is to fit a Gaussian to the A: samples.  This captures the first two moments of the sample, and allows simple  generalization. Unfortunately, because of the max terms in the Bellman equations, we expect the Q-value distribution to be skewed in the positive direction. If this  171  skew is strong, then fitting a Gaussian would be a poor generalization from the sample. At the other end of the spectrum are non-parametric approaches. One of the simplest ones is Kernel  estimation  (see for example [10]). In this approach,  we approximate the distribution over Q(s,a) by a sum of Gaussians, one for each sample, and each with a fixed variance. This approach can be effective if we are careful in choosing the variance parameter.  Too small a variance will lead to a  spiky distribution, while too large a variance will lead to an overly smooth and flat distribution. We use a simple rule for estimating the kernel width as a function of leave-one-  the mean (squared) distance between points. This rule is motivated by a out cross-validation  estimate of the kernel widths. Let q ,...,q l  k  be the k samples.  We want to find the kernel width a that best predicts the value of each sample when used to build a distribution from all the other samples. That is, we want to maximizes the term J(a )  = £log£;/(i7* |  2  where f{q  l  q^a )) 2  \ qi,a) is the Gaussian P D F with mean qi and variance a . 2  Using  Jensen's inequality, we have that  •V)>EE Theorem  10  The value of a  2  l o  8/(?i^  that maximizes  d is the average distance among  f f 2  )  5~J,- Ylj^i l°g fil' I  °) 2  *'  s  \d, where  samples:  < = i<^i)  £ £ « ' - « ' > '  P r o o f We maximize the sum over all the samples of the log probability of each sample X,- given all the other samples X\,..., logP(X | * ! , . . . , * „ )  X{_\, Xi+\,...,  log-Y^f(x:X ,a)  =  t  172  X: n  n ^—' 2  =  Let c = log 1/\/2K. log P(Xi  | X  x  - \  l o g - = - f l o g - - -±  T  Then the total log probability is:  . . .  Let d = 1/n ^  ,  n  n  ~ ^j)  2  > [c-\oga-^Yl  . . . X )  2  D e  X>Y<'"  the average distance between any two  sample points. Then: VMogP(Xi|...)  a  2  =  =  ntc-logff-jLrf)  -d 4  So the Gaussian kernels should have variance d/4 as required.  •  The only question remaining to be answered is how to sample from a distribution constructed using kernel estimation. To do this, we take a random number between 0 and 1, and then use binary search on the cumulative probability distribution to find the corresponding value. Of course, there are many other generalization methods we might consider using here, such as mixture distributions. However, these two approaches provide us with initial ideas on the effect of generalization in this context. Figure 5.16 shows the effects of Gaussian approximation and kernel estimation smoothing (using the computed kernel width) on the sample values used to generate the Q-distributions in Figure 5.15 after performing 100, 300 and 700 actions respectively. In general, the distribution produced using Gaussian approximation is overly simplistic, and particularly early in learning is a very poor generalization of 173  -5  0  5  0.25  10  15  —i  1  20  25  1  GaussianSamples Approx. Kernel estimation  -  0.2  i  0.15  J  0.05  0  \  -5  0  i  ill 11L 5  il 10  Value  15  20  Samples Gaussian Approx. Kernel estimation  i  1  :  i  1 i !  / -6  -4  -2  0 Value  Illi  2  4  6  Figure 5.16: Samples, Gaussian approximation, and Kernel estimates of a Q-value distribution after 100, 300, and 700 steps of Naive global sampling on the same run as Figure 5.15. 174  the sampled values. This is because the sampled Q-values, initially being computed from significantly different M D P s , are widely spread and quite skewed. Kernel estimation produces a generalized distribution that is much more closely tied to the sampled values. For this reason, we expect kernel estimation to perform better than Gaussian approximation for computing V P I . We must also compute the V P I of a set of generalized distributions made up of Gaussians or kernel estimates. This is simply a matter of solving the integral given in Equation 5.1 where Pr(q  Sta  = x) is computed from the generalized probability  distribution for state s and action a. This integration can be simplified to a term where the main cost is an evaluation of the C D F of a Gaussian distribution (e.g., see [90]). This function, however, is implemented in most language libraries (e.g., using the er/() function in the C-library), and thus can be done quite efficiently.  5.3.6  Results  Figure 5.17 shows two domains of the type on which we have tested our algorithms. Each is a four action maze domain in which the agent begins at the point marked S and must collect the flag F and deliver it to the goal G. The agent receives a reward of 1 for each flag it collects and then moves to the goal state, and the problem is then reset. If the agent enters the square marked T (a trap) it receives a reward of -10. Each action (up, down, left, right) succeeds with probability 0.9 if that direction is clear, and with probability 0.1, moves the agent perpendicular to the desired direction. The "trap" domain has 18 states, the "maze" domain 56. We evaluate the algorithms by computing the average (over 10 runs) future discounted reward received by the agent.  For comparison purposes we use the  prioritized sweeping algorithm [79] with the T ^ j . ^ parameter optimized for each problem. Figure 5.18 shows the performance of a representative sample of our algorithms on the trap domain. Unless they are based on a very small number of samples,  175  Q s  G  F  o  c  F  G  T  a. Figure 5.17: The (a.) "trap" and (b.) larger maze domains.  -10  -15  -20  -25  Prioritized Sweeping Naive Global, Kernel estimation smoothing, Dirichlet priors Importance sampling, No smoothing, Sparse multinomial priors Repair sampling, Gaussian approximation smoothing, Dirichlet priors 50  100  150  200  250  300  350  400  Iterations  Figure 5.18: Discounted future reward received for the "trap" domain.  176  450  1000 Number of steps  2000  Figure 5.19: Comparison of Q-value estimation techniques on the larger maze domain. In all cases, kernel estimation was used to smooth the Q-distributions.  177  Prioritized Sweeping Kernel Estimation No smoothing Gaussian Approximation  500  1000  Number of steps  1500  2000  Figure 5.20: The effects of smoothing techniques on performance in the large maze domain. Naive global sampling was used to produce the samples for all the algorithms.  178  all of the Bayesian exploration methods outperform prioritized sweeping. This is due to their more cautious approach to the trap state. Although they are uncertain about it, they know that its value is probably bad, and hence do not explore it further after a small number of visits. In comparison, the T ^  o r e (  j parameter of  prioritized sweeping forces the agent to repeatedly visit the trap state until it has enough information to start learning. Figure 5.19 compares prioritized sweeping and the model-free Bayesian Qlearning algorithm from Section 5.2 (we used V P I for selecting actions and mixture updating) with our Q-value estimation techniques on the larger maze domain. As the graph shows, our techniques perform better than prioritized sweeping early in the learning process. They explore more widely initially, and do a better job of avoiding the trap state once they find it. Of the three techniques, global sampling performs best, although its computational requirements are considerable—about ten times as much as sampling with repair, or around 1000 times as much as prioritized sweeping. Importance sampling runs about twice as fast as global sampling but converges relatively late on this problem, and did not converge on all trials. Sampling with repair is the fastest of our algorithms, and requires approximately n times the computation time of prioritized sweeping, where n is the number of samples used. A l l these algorithms require considerably more computation time than prioritized sweeping, and in fact are only practical when large amounts of computation time is available between each action. If computation time is costly, or in short supply, the agent is generally better off to do less computation, and perform more real-world actions. For comparison purposes, global sampling with repair, which is the fastest of our algorithms, runs approximately n times slower than prioritized sweeping where n is the number of samples. Figure 5.20 shows the relative performance of the three smoothing methods, again on the larger domain. To exaggerate the effects of smoothing, only 20 samples  179  were used to produce this graph. Kernel estimation performs very well, while no smoothing failed to find the optimal (two flag) strategy on two out of ten runs. Gaussian approximation was slow to settle on a policy, it continued to make exploratory actions after 1500 steps while all the other algorithms had converged by then. In summary, we have presented both model-free and model-based reinforcement learning algorithms in this Chapter. Both algorithms were based on the idea that in order to perform efficient exploration we need to explicitly reason using our uncertainty about the values of different actions. The V P I measure we introduced for selecting actions does this, choosing actions based on the value of the information we might expect to gain by performing them. We have showed that both our approaches generally outperform competing algorithms, although both have considerably higher computational demands than the algorithms currently in use. However, in many domains the cost of acting far exceeds the cost of reasoning, and there is sufficient computation time available between actions, so that the higher computational overhead of these methods is relatively unimportant.  180  Chapter 6  Structure-based Reinforcement Learning We now have almost all the pieces we require to build a reinforcement learning system that works with structured representations of problems, and that makes very efficient use of the observations it gets by performing actions. The system we propose will make use of available prior knowledge of the domain, both in the form of a structured representation of a problem, and also any prior knowledge of the actual transition probabilities and other parameters.  By using the techniques we  describe in Chapter 5, the system will also use this available knowledge to guide exploration and make learning more efficient. Much of this work is preliminary and speculative in nature, although we will provide some experimental results to begin to validate the techniques. The architecture we envision is a model-based structured learning agent. In Section 5.3 we presented a model-based learning algorithm that kept a probability distribution over M D P s , and sampled a number of M D P s from the distribution to produce probability distributions over Q-values which could be used for action selection. We propose to use the same approach here, except that the M D P s will be structured, so the distribution will be over structured M D P s , and the M D P s  181  we sample from it will also be structured.  As before, we will use the value of  information-based measure of Section 5.1.2 to select actions, and we can use the same smoothing techniques when computing the V P I . We will use repair-based sampling to produce the samples from the model, so we will need a version of prioritized sweeping that works on structured M D P s in order to update the models without losing the structure we wish to exploit. The models we will be learning contain structure that will be represented by 2TBNs (see Section 2.5.1). We will consider two learning problems: Known Structure In Section 6.2, we examine the case where the structure of the model is given to the learning system in advance.  The agent is given  in advance the structure of the conditional probability tree for each actionvariable pair and the structure of the reward tree, and the model-learning part of the problem is to learn the probability distribution at each leaf of each conditional probability tree, and the reward for each leaf of the reward tree. These trees can then be used to compute the values of states by using any of the algorithms from Section 2.3 or Chapter 3. Obviously we would expect a structured algorithm such as SPI to be used, in order to take advantage of the structured representation. Unknown Structure We extend our investigation in Section 6.3 to include the case where the structure of the conditional probability and reward trees is not given in advance. Here we must combine the techniques we have developed for the known structure case with algorithms to learn the Bayesian network representations [23, 67, 106] for the actions, along with their conditional probability trees [45]. This leads to a number of interesting additional challenges. The first is that we need Bayesian network learning algorithms that only consider the very restrictive types of network structure that make up 2TBNs. A more important problem is that, as with all reinforcement learning, we are doing active learning — we have some control over the data from which we 182  learn. This means that the data we use to learn the structure of the network is biased by the method we use to select actions to perform. Quantifying the effects of this bias, and examining ways to account for it are very interesting challenges. There are several ways in which the existence of structure should influence a reinforcement learning algorithm. Some of these have been investigated previously, while others are new areas of research. We will discuss three here: Learning the Model Faster The most obvious advantage of structured representations is that they allow us to learn the model faster. Without structure, we only learn the transition probabilities for a state from direct observation of the action in that state, whereas in the structured case we may also learn about them from other "similar" states. This is used to speed learning in [107] (see Section 4.5.4). A second and more subtle advantage is that the parameters of the model are generally easier to learn. To see why this is so, consider an M D P defined in terms of n binary variables. Since there are 2" states, each probability distribution in the unstructured case is over 2" possible outcomes, while in comparison, each probability distribution in the conditional probability tree is only over two outcomes (the truth or falsehood of a single variable). Even if the transition function is sparse, there may still be a significant number of possible outcomes for an action. Learning a distribution with a single parameter requires much less data than learning a multinomial distribution. More Efficient Update of the Value Function Given an observation of the system, (s, a, r, t), the existence of structure allows us to learn more efficiently. In particular, we get to learn about the effects of action a for all the other states that are "similar" to s. This fact has been exploited by the generalized prioritized sweeping algorithm [3] to speed learning. The algorithm (see Section 4.5.4) works like standard prioritized sweeping [79], keeping a priority queue  183  of states that need to have their value functions recomputed, but as well as updating the priority of s when the observation is made, it also updates all "similar" states' priorities. Generalized prioritized sweeping keeps a flat statebased representation of the value function, and uses structural information only to tell it when to update it. Here we will use a local form of structured value iteration to update the values of a number of states simultaneously. M o r e Effective E x p l o r a t i o n A potential advantage of structure that hasn't yet been investigated is its influence on exploration. The idea here is that we should prefer execution of actions that let us learn about a large number of states to those actions that give us information about only a small number of states, all else being equal. In the case of conditional probability trees, this means that learning parameters that are closer to the root of the tree is preferred, since these parameters contain information about a larger number of states. In order to use an algorithm like Generalized Prioritized Sweeping when we have a structured representation of the value function (and Q-values for actions), we need to be able to perform local value function updates in a structured way. We investigate this in the next section and develop an algorithm which we call prioritized  structured  sweeping which uses these structured local updates. In Section 6.2 we  describe how structured prioritized sweeping can be used with the model-based Bayesian reinforcement learning algorithm of Section 5.3. We also discuss a number of issues to do with Bayesian approaches to learning structured models and selecting actions using V P I . Section 6.3 extends these ideas to briefly examine the case where the structure of the problem is initially unknown and the model structure must be discovered at the same time as the model parameters. Sections 6.2 and 6.3 are very preliminary. We will only sketch how such algorithms could be built from the components we have described in this thesis, rather than describing an actual working system. 184  1. Let the current state be s. 2. Select an action a to perform in the real world, and observe its outcome r and t. 3. Update the model to reflect the new observation. 4. Promote state s to the top of the priority queue. 5. While there is computation time remaining do (a) Pop the top state s' from the priority queue. (b) For each action, perform a Bellman backup to recompute its Q-value, set V(s') to be the maximum Q-value, and set A / to be the magnitude of the change in V(s'). s  (c) For each predecessors s" of s', push s" onto the priority queue with its priority set to the sum of its old priority (if it was on the queue already) and 7 max„ Pr(s", a, s')A >. s  Figure 6.1: The prioritized sweeping algorithm.  6.1  Structured Prioritized Sweeping  The SPI algorithm we presented in Chapter 3 uses D B N s along with a tree-based representation of the conditional probability tables for each post-action variable to produce a tree-structured optimal value function and policy. The basic operation used in SPI for constructing the optimal value function was decision-theoretic regression (see Figure 3.10). In this section we present aversion of prioritized sweeping in which the value function is represented in this tree-structured manner. In our structured prioritized sweeping algorithm, the decision-theoretic regression operator is applied locally to make updates to the value function. The advantage of this approach is that changes propagate across the state space much more quickly than with prioritized sweeping or generalized prioritized sweeping. This is because rather than only updating the value of the current state, similar states have their values updated, and parts of the state space where value function is constant can be updated in a single operation. A s we said above, we will assume that the structure is given in advance and only the model parameters need to be learned.  185  The standard prioritized sweeping algorithm is shown in Figure 6.1. To produce a structured version of the algorithm we will need to modify steps 3 and 4, and all the parts of step 5. The most significant changes will be to step 5(b) in which a Bellman backup must be performed on part of a tree-structured value function. We will describe this procedure in the next section. We will also need a structured model for step 3, and in Section 6.1.2 we will describe the model and how we update it. Finally, in Section 6.1.3, we will describe the structured priority queue we need to perform step 4, and steps 5(a) and (c), and put all the pieces together to produce the complete structured prioritized sweeping algorithm. In Section 6.1.4 we will validate our algorithm by presenting empirical results that compare its performance with that of generalized prioritized sweeping.  6.1.1  Local Decision-Theoretic Regression  In the SPI algorithm described in Chapter 3 we updated the value function for a fixed policy by applying the decision-theoretic regression operator shown in Figure 3.10. The operator takes a value function and action as input and produces as output the Q-value for the action given the value function. The local version of the operator acts in exactly the same way except that it operates only on a subset of the state space. It performs a single Bellman backup in this subset of the state space in much the same way as asynchronous value iteration algorithms (see for example [5]), and could easily be used to implement structured versions of these algorithms. The local decision-theoretic regression operator takes as its inputs a treestructured value function Tree(V),  an action a, a logical sentence (in practice, a con-  junction of negated and unnegated variables)  tp,  and a Q-tree for a,  Note  Tree(Q ). a  that unlike SPI and A S V I , Q-trees are stored as well as value trees. The operator produces as output a new Q-tree for  V,  a and  <p,  Tree(Qa' )• v  The crucial difference Tree(V)  between it and decision-theoretic regression is that for every sub-tree of that is inconsistent with y>, the value of the corresponding leaves of  186  Tree(Qa' ) v  are  I n p u t : Tree(V), an action a, a logical sentence tp, a Q-tree for a Tree(Q ), Tree(R). Output: Tree{Ql' ). a  and  v  1. Let PTree(V,a,tp) be the P Regress (Tree (V) ,a,tp, Tree (Q ) )•  tree  returned  by  Local-  a  2. For each branch b of PTree(V,  a, tp) with leaf l that is consistent with tp: 0  (a) Let Pr be the joint distribution formed from the product of the individual variable distributions at b  (b) Compute: v=  P^iP'Wib')  b  b'eTree(v) where b' are the branches of Tree(V), Pr (6') is the probability according to the distribution at /(, of the conditions labeling branch b', and V(b') is the value labeling the leaf at the end of branch b'. (c) Re-label leaf It, with v to produce the new "future value" tree FVTree {V, a, <p). 6  0  3. Discount every leaf of FVTree(V, a, tp) that is consistent with tp with discount factor 7. 4. Merge the leaves of FVTree (V, a, tp) that are consistent with tp with Tree(R) using addition to combine the values at the leaves. Simplify the resulting tree, which is Tree (Qa''*')• Figure 6.2: Local- Regress (Tree (V), a, ip, Tree (Q ), Tree (R)), theoretic regression algorithm. a  taken directly from Tree(Q ) a  the local decision-  (that is, the Q-value for a is unchanged).  In this  context, we say that a subtree t of some tree Tree(V) is inconsistent with a logical sentence (p if there is no state s consistent with the assignments from the root of Tree(V) to the root of t that is also consistent with <p. The local decision-theoretic regression operator is shown in Figures 6.2 and 6.3.  In analogy with the decision-theoretic regression operator shown in Figures  3.10 and 3.11 we refer to the tree produced by Local-PRegress  as PTree (V, a, tp) and  the corresponding future-value tree as FVTree (V, a, tp). In fact, only the subtrees that are consistent with tp contain probabilities or future values. The remainder of the trees is unchanged from the Q-tree they were constructed from. 187  I n p u t : Tree(V), an action a. a logical sentence (p, and a Q-tree for a Tree(Q ) O u t p u t : PTree(V,a). a  1. If Tree(V) is a leaf node, return an empty tree. 2. Let X be the variable at the root of Tree(V). Let Tx = Tree(a,X) be the conditional probability tree for X in action a. 3. For each leaf / of Tx that is inconsistent with <p, replace the leaf with Tree(Q ) (simplified by removing redundant variables). a  4. For each Xi £ val(X) that occurs with non-zero probability at some leaf of Tx that contains a state consistent with (p, let: (a) T% be the subtree of Tree(V) attached to the root by the arc labeled t  (b) T^. be the tree produced by calling Local-PRegress(T^.,a). 5. For each leaf / £ Tx that contains a state consistent with cp labeled with probability distribution i V : (a) Let vah(X) = a:,- £ val(X) : Pr (x ) > 0. i  i  (b) Let T\ = Merge (T : X{ £ vali(X)) using union to combine the labels (probability distributions) at the leaves. B  (c) Revise Tx by appending T\ to leaf /, again using union to combine the leaf labels. 6. Return PTree (V, a, <p) =  T. x  Figure 6.3: Local-PRegress(Tree(V),a,(p,Tree(Q )). PTree{V,a,<p). a  188  The algorithm for producing  z  Update Y X t i  z  '  '  '  \  z : i . o ;  \  Z  z  10  Z : 1.0 !  Y  Y  1  5  . , Value tree T  Y  \ \ \ s  :  10  /  ,. .. ,_ (a) Initial Q-tree  Z : 0.0 -1 Z:0.0  Z : 0.0 Y : 0.9  Z : 0.0' -1  (c) Unsimplified probability tree  (b) Partial probability tree  T  X  Z  / \  ! 10 ,  | Z : l.Oi  19  X |Z:0.0; 'Y:0.9!  Y X  X Z:0.0 -1  (d) Simplified probability tree  J 4.5 i  (  e) F u t u r e  v a  -1  i  u e  4.05  (f) Q .  -1  t r e e  Figure 6.4: Local decision-theoretic regression of a value tree through the action in Figure 6.5 to produce the Q-tree for the action given the value function.  189  Figure 6.5: A simple action represented using a D B N , and the reward tree for the MDP. An example application of the local decision-theoretic regression operator is shown in Figure 6.4, in which we apply the operator to the XY states. Not that we update all leaves of the Q-tree that contain any state where X is true and Y false. The initial value tree is shown in (a) (it will also be used to provide the default Q-values), and the action being regressed through is the one shown in Figure 6.5. We are making local changes whenever Y is false and X is true. First we determine the conditions under which the action will have different expected future value with respect to the value tree V. Since V depends on variable Z, the expected future value depends on the conditions that make Z true or false after the action is performed. These conditions are found in the conditional probability tree for Z (see Figure 6.5). Each branch of this tree tells us exactly the conditions we want—the conditions under which the action will lead with fixed probability to each partition of the state space induced by V. This tree is shown in Figure 6.4 (b), but since the subtree where Z is false and Y is true is outside the scope of the operation, it is not updated, and its default value is used instead. When Z could be false, V also depends on Y, so we add the conditional probability tree for Y wherever Z has a non-zero probability of being false, resulting in Figure 6.4 (c), which can then be simplified by removing the redundant branches to get (d). Again, we have  190  used default values for branches which are outside the scope of our local update. Since each leaf of this tree that we are updating is labeled with a distribution over the value of Z and Y, we can now compute the future value of performing the action, which is given in Figure 6.4 (e), and the leaves we are operating on are then discounted, and the rewards are added to get the final Q-tree for the action given V, which is shown in Figure 6.4 (f).  6.1.2  The Structured Model  Following [3], we update the model and compute transition probabilities by keeping for each leaf l ,Xi  m  a  the conditional probability tree for action a and variable X%  a count of the number of times the action results in Xi having each of its possible values x],.. .x™. We write Ni x j for the number of times in a leaf / that Xi has t  t  value x\ after the action is performed, and let Pr(l, a, Xi = x{) = J^'N^X  k' Similarly  we record for each leaf IR of the reward tree the average reward received. When we perform action a in state s = {Xi = ccj,.. .,X  = x }, and observe a reward s  n  n  of r, and a transition to a new state t = {X\ = x\,.. .,X  n  counts Ni  '.  iiXiiX  = x^}, we update the  for each variable Xi, where /; is the leaf that includes state s in the  conditional probability tree for Xi. We also update the expected reward at the leaf IR of the reward tree that includes state s. Figure 6.6 shows a partially learned model of the action in Figure 6.5. When we observe a transition from state XYZ  to state XYZ,  we update the count for  the indicated parameters, and as a result the probability of Y becoming true when the action is performed in a state where X is true and Y is false changes from 0.67 (4/6) to 0.71 (5/7).  6.1.3  T h e S t r u c t u r e d P r i o r i t i z e d Sweeping A l g o r i t h m  Now that we have a structured representation of the model and a procedure for performing structured local value function updates, we can construct a variant of  191  p(Y) = 1  N(Y) = 4 N(Y) = 2 p(Y) = .67  N(Z) = 6 N(Z) = 0 P(Z) = 1  N(Y =0 N(Y) = 3 P(Y) = 0  Y  N(Z) = 7 N(Z) = 1 p(Z) = .875  Update >  N(Z) = 0 N(Z) = 1 P(Z) = 0  Figure 6.6: Learning the simple action of Figure 6.5. prioritized sweeping that operates on a structured M D P represented using D B N s with tree-structured conditional probability tables. In this section we will assume that the structure of the D B N s and trees is given, and only the parameters (the probabilities at the leaves of the conditional probability trees) must be learned.  1  Generalized prioritized sweeping [3] also makes this assumption, and examines the implications of structure on the states that need to be updated, and on the priority queue.  However, they use a state-based value function, and update the values  of states individually.  As we shall see, keeping a structured value function can  considerably improve the speed of learning, but also raises many new issues that must be examined. To describe the structured prioritized sweeping algorithm, we need the following notation. As we have before, we use D B N s with conditional probability trees to represent the structure. Let l ^x be some leaf of the tree for variable X and aca  tion a, Tree (a, X). We write Assigns(l x) at  for the assignment of values to variables  We will examine what happens when we remove this assumption in Section 6.3, although in a slightly different context. 1  192  1. Let the current state be s = {X\ = x\,..., X = x }. s  n  n  2. Select an action a to perform in the real world, and observe its outcome r and t = {A'j = x\,...,  X  -x }. f  n  n  3. Update the parameters of the structured model to reflect the new observation. 4. For each variable Xi, find the leaf l ,x, Tree (a, Xi) that contains s, and add (Assigns(l Xi), «•) to the priority queue with priority computed from the magnitude of the change in value due to the parameter changes. m  a  at  5. While there is computation time remaining do (a) Pop the top item (<p, a) from the priority queue. (b) Perform local decision-theoretic regression on (p and a by calling Local-Regress(Tree(V),a,(p,Tree(Q ),Tree(R))  where Tree(V)  a  is the  current value tree, Tree(Q ) is the current Q-tree for a, and Tree(R) is the current model of the reward function. Let the new Q-tree that results from the regression be Tree (Q ' , a). a  V V  (c) Merge Tree (Q ' , a) with the Q-trees for the other actions to produce a new value tree Tree(V). V V  (d) For each leaf / of Tree(V) such that the change in the value from the corresponding leaf in Tree(V) to / is A / > 0, and each action a, find all the sets of states cp such that the probability p of reaching a state in / by performing a is constant. A d d each (p to the priority queue with priority fAip with the corresponding action a. 6. Set Tree(V)  = Tree(V)  and Tree{Q ) a  = Tree{Q < , v  v  a). Go to step 1.  Figure 6.7: The structured prioritized sweeping algorithm. associated with l x—that a<  is, the variable assignments made in the path from the  root of the tree to the leaf. The structured prioritized sweeping algorithm is shown in Figure 6.7. It uses the following data structures: • For each action, the D B N representation for the action, and a conditional probability tree for each variable. • A value tree, which is an estimate of the expected value of following the optimal policy.  193  • For each action, a Q-tree which records the expected value of performing the action assuming that the future value received is given by the current value tree. • A priority queue of pairs that consist of a partial assignment of values to variables, and the action for which the set of states need updating. For example, the pair (XY,a)  indicates that the Q-tree for action a needs updating for all  states in which both X and Y are false. The priority of each pair in the queue is an estimate of the magnitude of the change in Q-value that would result for every state in the assignment if the update were to be performed. The major change to the algorithm in our structured version is that updates are action specific. In prioritized sweeping, an update consists of recomputing the value of all actions in a state, taking the maximum, and updating the value function appropriately.  In the structured algorithm, because updates are more expensive  (but act over more states), and because the sets of states whose value functions needs updating may differ for different actions, we update the Q-trees for each action independently. This ensures that computational effort is only spent where value updates are needed. Step 5(c) of the algorithm in Figure 6.7 is quite expensive to perform after each Q-value update. Even though we only have to perform the merge on a small part of each tree (the part that corresponds to the changes in the Q-tree for a), we do. have to merge subtrees of all \A\ Q-trees to produce the new value tree. We can avoid this by storing in each leaf of the value tree the action that is maximal in that leaf. The new Q-tree Tree(Qa^) to produce TreeiV').  can then be merged directly with  Tree(V)  The only added complexity occurs when a is the action that  was maximal at some leaf / of Tree (V), and the value at the corresponding leaf in Tree(QX'' ) p  is less than the value at / (that is, the update has decreased the Q-value  for a in the states corresponding to /). In this case, we no longer know which action is maximal, so we must again merge the corresponding subtrees of all the Q-trees 194  to find the new value at /. When we perform action a in state s = {X\ = x\,...,X  n  = x }, and observe s  n  a reward of r, and a transition to a new state t = {X\ = x\,..., X = x }, we update t  n  for each variable Xi, the probability at leaf l ,Xn where a  n  is the leaf that includes  state s in the conditional probability tree for X{. This update potentially affects the value of every state that agrees with s on the value of Assigns(l x) ai  f ° each r  variable X, so the priorities for all these states need to be adjusted (Step 4 of the algorithm in Figure 6.7). Using the M D P in Figure 6.5 as an example, if we perform the action in a state where X, Y and Z are all false, and the state doesn't change as a result of the action, we update the counts for the rightmost leaf in each tree. As a result of this, any states where X and Y are both false will potentially need updating because their transition probabilities are partially determined by the rightmost leaf of the tree for Y, and similarly, any state where Y and Z are both false may require updating due to the change in the tree for Z. To make these changes, both XY and YZ will have their priorities increased (or will be added to the priority queue). We discuss the details of this (Step 5(d) of Figure 6.7) in detail below. In state-based prioritized sweeping, changes in model parameters affect only a single state and hence it is not worth estimating the change's effect on value to decide priority. However, in the structured case, many states can be changed by a single changed parameter, so estimating the effects of this change on value is important. In particular, after many rounds of learning the effects of each model change become smaller and smaller, so it is advantageous not to update the value function automatically after each change. This is especially true as updates caused by model changes are in general more computationally demanding than value propagation updates. This is because model changes tend to affect many more states than value function changes since the conditional probability trees tend to be simpler than the value tree.  195  In [3], Andre, Friedman and Parr provide a formula that estimates the change in value due to a parameter change in the model. Unfortunately, their formula involves summing over all states that are similar to the state whose parameters were updated. This summation could be over a significant subset of the state space and therefore impractical for large problems, particularly ones that exhibit a lot of structure. The authors suggest approximating the formula by a quantity Px that estimates the effect on the value function of a change in the model parameters for variable X. The idea is that a model update for a state s — \X\ = x\, X = x^, •. •} 2  consists of an update to the parameters of each X{. Px estimates the change in {  value due to this change in the parameters for X{. For an update in state s, the priority of a state t would be: priority(t)  =  Px  t  {Xi\xfsc\}  that is, the sum of the Px estimates for all the variables for which s and t have the same value. It is desirable that Px should overestimate the true change in value to ensure that the corresponding states eventually get updated. A number of possible approximations could be used to estimate the value change after a model update. In the results below, we compute the change in value for the actual state and action that was observed, and use this value for every parameter that was updated. In most cases, this value will considerably overestimate the actual change in value due to the update because we expect that the observed state would have the maximal change in value due to a change in the model since its transition probabilities for every variable are potentially altered. Similarly, by using the state's value change as the priority for every variable, we ensure that we overestimate each variable's contribution to the value function change. A less conservative approach might be to split the value change among the variables, perhaps by measuring the value function's sensitivity to each variable, but we will not investigate this here.  196  Like generalized prioritized sweeping, we sum the priorities when we add an item to the priority queue that is already present. Currently we make no attempt to merge queue elements that overlap. For example, if (X, a) is on the priority queue with priority p, and (XY, a) is subsequently added with priority q, we keep the entries separately.  In a more advanced implementation, we might replace the two with  entries (XY, a) and (XY, a) with priorities p + q and p respectively. Theoretically, overlapping queue items in this way could result in states that have significant changes in their value function not being updated because their priority is spread over a number of queue entries. However, as our results below show, our simplistic approach performs well, and we have not yet observed a situation where this overlap of queue entries without updates being performed has occurred.  Value Propagation To perform a value-propagation step (Step 5 of the algorithm), we pop the highest priority set of states off the queue to get a partial assignment of values to variables if = {Xi = X{, XJ = XJ, ...}, and an action a. We then use the local decision-theoretic regression operator to update the values of all the leaves / of the Q-tree for a such that Assigns(l)  is consistent with <p. The updated Q-tree for a is then compared  with the current value tree, and any changes to the value tree are made as described above. When a change is made at a leaf / of the value tree, we need to identify all the aggregate states that may require updating (Step 5(d) of Figure 6.7). For each action a and variable X = x in Assigns(l),  we build a set iftx of assignments  that correspond to the possible ways of making X = x true by performing a. To do this we traverse Tree (a, X), and for each leaf /' in which X — x with non-zero probability, we add Assigns(l')  to ipx- The cross-product of the if) sets is the set of 2  X  regions of the state space (partial assignments of values to variables) that when a is Not all combinations are possible since for some leaves /,• and lj, Assigns(h) and Assigns(lj) may assign different values to the same variable. 2  197  executed in them have a non-zero probability of resulting in a state that corresponds to leaf /, and the product of the probabilities is the probability of reaching a state in /. For example, in Figure 6.5 if the value function was changed at the rightmost leaf (ZY), then ijjy = {XY,XY} is {XYZ(p  = 0.1), XYZ(p  and ip = {YZ,YZ}, z  and the cross product  = 1.0)}. Each of these assignments has its priority  increased by the product of its probability of reaching / multiplied by the change in value at /. As with prioritized sweeping, the sparseness of the transition function (in our structured representation, the number of zeros and ones in the conditional probability trees) means that the number of entries added to the priority queue remains manageable. This gives us all the important pieces we need for the structured prioritized sweeping algorithm of Figure 6.7. The only thing left to describe is the action selection mechanism from Step 2. We will return to this in Section 6.2 when we sketch how the Bayesian exploration method of Chapter 5 operates in structured problems, but for now we will assume the same counter-based action selection method used in standard prioritized sweeping and described in Section 4.4.1. We use this method in the experiments below to directly compare the performance of our algorithm with that of generalized prioritized sweeping (see Section 4.5.4).  6.1.4  Results  We have conducted experiments in several domains to evaluate the effects of doing structured updates to the value function as opposed to the state-based value function used by generalized prioritized sweeping. The first domain we present here is the linear domain with eight variables from Section 3.1.6. This domain has 256 states and eight actions, and is an example of a problem with a great deal of structure. As was the case with SPI, it represents "best-case" performance for our algorithm. For our second problem, we used a domain based on the process-planning problems, also from Section 3.1.6. The problem is slightly different as our current implementation  198  600  Structured PS: 5 backups per action ..Structured PS: 10 backups per action Generalized PS  100  200  300  400  500  600  700  800  900  Learning Steps  Figure 6.8: Structured prioritized sweeping compared with generalized prioritized sweeping on the 256 state linear (best-case) domain. of the structured prioritized sweeping algorithm only supports binary variables. It has 1024 states and eight actions, and is designed to be a more realistic problem (See Appendix A for details of this problem). The optimal value function consists of a tree with 142 leaves, so on average each leaf aggregates seven states together. In all cases, the results presented are averaged over ten runs, although the variance between runs was quite low for these problems. As before, the graphs are smoothed to improve their readability. Figures 6.8 and 6.9 compare the performance of structured prioritized sweeping with generalized prioritized sweeping (note that we didn't include prioritized sweeping in the comparison because generalized prioritized sweeping always performs at least as well as it). The graphs show the actual discounted future reward received for each step of the algorithm. For each algorithm, we run a series of tri-  199  80  I  1  1  1  r-  0  500  1000  1500  2000  -——i  2500  1  1  r  3000  3500  4000  Learning Steps  Figure 6.9: Structured prioritized sweeping compared with generalized prioritized sweeping on the process-planning domain.  200  als in which the agent selects 20 real-world actions. A t the end of each trial, the state of the system is randomly reset to a state with reward zero, thus ensuring that the algorithms must learn a policy that covers a large part of the state space. In both cases, a discount factor of 0.9 and T linear domain, T j j  0 r e (  D O r e  c l exploration was used. In the  j = 2 for both algorithms, while for the process-planning do-  main it was 3 for structured prioritized sweeping, and 5 for generalized prioritized sweeping. In all cases, this was the optimal value for the parameter. Generalized prioritized sweeping performs ten sweeping steps per real-world step, while we show structured prioritized sweeping with both ten and five backups per real-world step. Since structured backups are more costly, allowing only five backups per real action gives a more fair comparison with generalized prioritized sweeping in terms of computation time. In our implementation, structured prioritized sweeping actually performs somewhat faster than generalized prioritized sweeping on these problems because of the additional costs incurred be generalized prioritized sweeping to translate between feature-based and state-based representations when updating Q-values. This effect is less pronounced in the larger domain because the number of Q-tree leaves that structured prioritized sweeping changes in each update increases as the problem complexity increases, so the number of Bellman backups performed per update increases. We can see from the graphs that structured prioritized sweeping learns more quickly than generalized prioritized sweeping, even when only half as many value backups are performed.  This demonstrates the strength of the structured  approach—more generalization is done, and hence learning is improved. The effect is particularly pronounced in the linear domain because of the very large amount of structure present in the problem. The graphs also show that the difference in performance between five and ten sweeping steps per real action is relatively small for structured prioritizes sweeping. This suggests that the most important updates are being performed first, which gives us confidence that the priorities being used  201  1. Let the current state be s. 2. Select an action a using V P I on the current Q-distributions for s. Perform a resulting in a new state t and a reward r. 3. Update the structured model to reflect the transition s A i and reward a  s—>r.  4. Use the updated structured model to compute new Q-distributions using one of the methods described in Section 5.3.4. 5. Go to Step 1. Figure 6.10: The structured Bayesian exploration algorithm for reinforcement learning when the structure is known in advance. are reasonable. The similarity between the curves in Figure 6.9 is due to the rather artificial way that the trials were performed. Because the domain state includes goal states which are never left once reached, the state was randomly reset to a zero-reward state at regular intervals. This resetting led to the regular falls and rises in performance seen in the graphs because after each reset the future reward would be low, and then slowly increase once a good policy had been learned. The two runs of structured prioritized sweeping are in fact following optimal policies after approximately 2500 steps.  6.2  Bayesian Reinforcement Learning Where the Structure is Known  We now turn our attention to combining our structured representation techniques with the Bayesian approach to exploration we described in Chapter 5. As with structured prioritized sweeping, we will initially examine the case where the problem structure is given in advance and only the parameters of the model must be learned. We emphasize again that the rest of this chapter is very preliminary; we haven't implemented any of the ideas we will propose, but include them here to give a feel for how the work in this thesis fits together. Assuming that the problem structure is  202  known in advance allows us to concentrate on a number of other important issues. As we said above, we will use the same basic algorithm as we described in Section 5.3. The algorithm is shown in Figure 6.10. We will build an uncertain structured model of the world—a probability distribution over structured MDPs—use the model to compute our uncertainty about value functions, and use both the model and the Q-distributions to select a new action to perform. As with the model-based algorithm presented in the last chapter, there are three aspects to this algorithm. These are: M o d e l U n c e r t a i n t y How to represent our uncertainty about the underlying M D P , and how to update this uncertain model when a new observation is made. We examine this in Section 6.2.1. Q - V a l u e U n c e r t a i n t y How to represent our uncertainty about the Q-value of each state and action, and how to produce Q-value distributions from distributions over structured M D P s . We discuss the issues involved with this in Section 6.2.3. A c t i o n Selection How to select actions to perform in the real world. As we said above, we want to use an action selection scheme based on value of information (see Section 5.1.2). However, the use of V P I for selecting actions adds a number of additional complications to the algorithm, which we discuss in Section 6.2.4.  6.2.1  Model Uncertainty  We represent a probability distribution fx over structured M D P s in a very similar way to that we used for unstructured M D P s in Section 5.3.1. In the unstructured case, the parameters we set out to learn were the transition probabilities, so we had | 5 | | A | parameters, each of which was a multinomial distribution with | 5 | hyperparameters, or if we used sparse-multinomials, with B hyper-parameters where B  203  was the branching factor (the number of states with non-zero entries in the transition matrix). In the structured case, we have a parameter for every leaf of the conditional probability tree for each action and variable. This means we have n L | A | parameters, where n is the number of variables in the M D P (for an M D P composed of binary variables, n = log |5|), and L is the number of leaves per tree. Each of 2  these parameters is also much simpler than in the unstructured case because it consists of a probability distribution over the values of the variable (and therefore has as many hyper-parameters as the variable has values), rather than over all the states of the M D P . Figure 6.11 shows a structured action representation for an M D P that consists of three binary variables. To learn an unstructured model of this action would therefore require learning eight probability distributions, each of which is a multinomial distribution, potentially over all eight states. In comparison, learning this structured model still requires learning eight probability distributions (represented by the eight boxes in the figure), but each is a univariate distribution since it is only over the probability of a single variable being true. Let a be an action in the M D P being learned, and let x be one of the variables in the M D P . Let Assigns(l)  be the partial assignment of values to variables that  corresponds to the path to some leaf / in the conditional probability tree for x and a. We write B , ,i for the parameter (that is, the probability distribution) at this a x  leaf. Similarly we write 0[ for the parameter at leaf I of the reward tree. convenient shorthand, if a state s agrees with the partial assignment Assigns(l)  As a for  some / in either a conditional probability tree or the reward tree, we will write Q ,x,s a  (or 6 ) for the parameter at leaf / which describes the effects of the action (reward) r  s  at s. As before, we will assume parameter independence. For any M D P 0:  Pr(6 | a) =  leTree(R)  PrW | M) E[ II aeAxex  U  i Tree(a,x)  PR  K * , ' I *>)  e  Note however that this assumption is a lot more reasonable than it was in the un-  204  Figure 6.11: A structured representation of an action. The boxes represent parameters that must be learned. structured case we discussed in the previous chapter. There the problem was that many states can be expected to have quite similar transition functions, and so assuming they were independent was somewhat unrealistic. However, the structure that made the parameter independence assumption unreasonable in the previous chapter is exactly the structure that we are explicitly representing with our 2 T B N model. Thus in the structured case, where there are states with correlated transition probabilities or rewards, these states will already be clustered together by the structure of the conditional probability and reward trees, and hence most of the dependencies present in the structured case will be represented as a single parameter in the unstructured case. Of course, there may still be dependencies that can't be captured by the structured representation we have chosen. For example, consider an M D P whose state space consists of two variables X and Y. If states behave the same if X V Y is true, this cannot be represented compactly by our decision trees because they can only represent conjunctive sentences. In cases like this, even the stronger parameter independence we get from our structured representation breaks down. Parameter independence allows us to treat the learning problem as a col-  205  lection of unrelated local learning problems, one for each parameter. For the 9  ax l  parameters, we are learning a probability distribution over the values of x. For the  91 parameters, the distributions are over the set of possible rewards. For ease of exposition we will assume a finite set of possible rewards that are known in advance, and that all the variables are binary. As before, we will represent the reward parameters as Dirichlet distributions. Since the variables are all binary, we will represent the transition parameters using beta  6.2.2  distributions.  The Beta Distribution  The beta distribution is the univariate equivalent of the Dirichlet distribution from Section 5.3.2.  As with the Dirichlet, the beta distribution is conjugate, so the  posterior after some sequence of observations will also be a beta distribution. A random variable R has a beta distribution with parameters a and 8 (a > 0, 8 > 0) if the P D F of R is:  f(x\a,8)={  0  otherwise.  Here r(a;) is the gamma function as before. The beta distribution has the following properties: E[R]  Var(i2)  a  a + -0 a8  {a + 8) {a + 8 + l) 2  As with the Dirichlet, let p(x) ~ B{a,8)  be a beta-distributed random vari-  able over the possible outcomes of some binary variable X. If we observe i positive outcomes of X and j negative outcomes, then we update the distribution of p by a' = a + i and 8' = 8 + j.  If we relax the assumption that all variables be binary, the model parameters become distributions over the possible values of the corresponding variables (that is,  206  the parameters in Tree(a, X) are distributions over the possible values of X), and we use Dirichlet distributions rather than beta distributions for the parameters. In Chapter 5 we used sparse multinomials in preference to Dirichlets for the parameters because we expect most transition functions to be sparse. In the structured case, we run into a similar problem; namely that we expect many of the probabilities being learned to be 0 or 1. In the case of sparse multinomials, we were happy to assume a novel outcome since the number of possible outcomes far exceeded the number of actual ones. However, this is not the case here. For a binary variable, if we have observed the variable being true, the only possible interpretation for the novel outcome is that the variable is false. If we make this assumption for every parameter, we end up with a transition function in which any state can be reached from any other—since every probability is non-zero, there is for each variable a non-zero probability that it is true, and a non-zero probability that it is false. To overcome this problem, we represent a model parameter using a beta distribution only if it has been observed being both true and false. If only one or the other has so far been observed, we assume that the probability of that outcome is 1. While this is a rather inelegant hack, it is necessary to prevent an explosion in the complexity of the trees due to the connectedness of the transition function. It only has an effect on the algorithm when there are few observations for a particular leaf of a conditional probability tree and one possible value of the variable has not yet been observed. As soon as the variable has been observed to be both true and false, it will be-represented using a beta-distribution, but if no more observations are made, perhaps because the observations made so far indicate that the action is not worth performing again, this approach could lead to sub-optimal behaviour.  6.2.3  Q-Value Uncertainty  To represent our uncertainty about the Q-values of states and actions, we use the same Q-distributions as we described in Section 5.1.1. Again, we can use any of the  207  sampling approaches described in Section 5.3.4, although the same comments we made there about the relative strengths and weaknesses of each method still hold. However, since we have developed a structured version of prioritized sweeping, we will concentrate here on global sampling with repair.  6.2.4  S t r u c t u r e d V a l u e of Information C a l c u l a t i o n  When we select an action to perform in a structured problem, we obviously want to take the structure into account. For example, consider what we should do if we have a choice between two actions that have the same expected value, but one provides observations that give us information about a large number of states while the other gives us information about the current state only. A l l other things being equal, we should favor the first action because it gives us far more information. The question is how to take this into account when selecting actions. In an ad-hoc action selection scheme like Boltzmann exploration, we could simply adjust the probability of selecting each action to reflect the amount of information we can gain from it. Actions probabilities are initially based on their Q-values, and are then skewed so that the probability of an action that gives us relatively little information about other states is reduced, and the probability of an action that give us information about model parameters that affect a lot of states is increased. It turns out that in the value of perfect information exploration method we described in Section 5.1.2 takes the value of learning about a greater number of states into account automatically when we apply it to structured problems. To see why this is, we have to look at the effects of structure on the sampled M D P s . The key idea is that a model parameter that affects a large number of states—we will refer to this as a "global" parameter—tends to have a correspondingly large influence on the optimal value function for the M D P . Such a model parameter affects the transition probabilities for a large number of states, and changing the value of the  208  parameter can greatly change the characteristics of the M D P and hence its optimal value function. In comparison, a parameter that only affects a single state or small number of states—a "local" parameter—has relatively little effect on the optimal value function because changing the parameter only affects transitions in a small subset of the state space. Unless the value function is highly dependent on the part of the state space the parameter affects, the change in the value function due to changing the parameter is likely to be small. What this means is that when we consider a set of M D P s sampled from the uncertain model, the variability in their Q-values is likely to be much greater for actions with more global parameters than for actions with predominantly local parameters.  The V P I measure for selecting  actions will therefore naturally prefer actions with these global parameters. Since V P I takes structure into account automatically, we will use V P I exactly as in Section 5.1.2 to select actions. We will take the Q-distribution for each action in the current state induced by the sampled M D P s , and compute E V P I using equation 5.1. As we said at the beginning of the chapter, this section and the next are highly speculative, and thus we do not have experimental results to demonstrate the efficacy of this algorithm, although we expect it to perform better than the standard Bayesian model-based exploration algorithm of Section 5.3 in problems where significant amounts of structure are present.  One important issue is the  speed at which the algorithm runs. We have already pointed out that global sampling with repair using n sampled M D P s runs n times as slowly as the version of prioritized sweeping it uses. Structured prioritized sweeping already has significant overhead compared with standard prioritized sweeping due to the complexity of the data structures, and in particular due to the large areas of the state space affected by model changes. A d d to this the cost of solving integrals to compute E V P I , and sampling from the models themselves, and this algorithm has serious computational requirements. In some domains, the cost of this computation may make these ap-  209  proaches impractical, but as we have said, in many real problems computation time is far cheaper than performing actions, so this approach will be useful in reducing the number of actions performed to find a good policy.  6.3  Unknown Structure  The last area we will look at in this chapter is what happens when the structure itself is not given in advance. In the last section we assumed that the parameters of the model were unknown, but that the Bayesian network for each action was given, as was the structure of the conditional probability trees. If we remove this assumption, the learning agent must learn the structure at the same time as it uses that structure to learn a model and optimal value function. Since we are using two-step Bayesian Networks for representing actions, the literature on learning Bayesian Networks from data is of obvious interest. A l l of these algorithms take a database of cases as input, and produce a Bayesian network that models the data as output.  Each case in the database is an assignment to  each variable in the Bayesian Network. We assume that the database is constructed by sampling the domain. Bayesian Network learning algorithms generally prefer simpler networks over more complex ones, leading to a trade-off between network size and how well the network models the data. Some of the better known algorithms can be found in [23, 67, 106, 45]. They tend to differ in how the quality of a network structure is measured, but all are essentially hill-climbing algorithms that search for local maxima in the space of possible network structures. Learning the network structure for the 2TBNs we use follows exactly from these approaches except that we do not require prior probability distributions for the nodes in the first time slice of the 2 T B N , and hence the functions used to measure the quality of a network structure should ignore the goodness-of-fit of the pre-action variables. The only other potential problem is that these algorithms are not incremental. They expect a database of cases, and build a network structure 210  from them.  While a system that incrementally improves the network structure  as more information becomes available would be preferable, we can overcome this difficulty by saving a certain number of experiences, and then running one of these batch algorithms to build a new network structure. Most of the Bayesian network learning algorithms learn conditional probability tables rather than trees. The exception to this is [45]. In that work, conditional probability trees are learned directly using techniques from [87] which, as the authors show, makes learning more efficient because of the reduction in parameters that need observations. Unfortunately, there is one major problem with learning the structure in this way, and it is a problem we do not yet have a solution for. A l l the Bayesian network learning algorithms rely crucially on the fact that the database of cases they use to build the network are independent and uniformly distributed. In our case, however, they are anything but independent. Each observation is closely correlated with the previous one because the resulting state of one observation is the initial state for the following one. Since we aren't concerned with the prior probabilities of the pre-action variables, this isn't too much of a problem. Unfortunately, we are also using the model we are learning as part of the action-selection mechanism. This means that the structure of the current model influences the actions we choose, and hence biases the observations we get, which are then used to refine the model. As far as we are aware, nobody has investigated learning from databases with systematically biased data of this kind. In practice, it may be easiest to assume that there is no bias, and just apply the standard Bayesian network learning algorithms. We anticipate that in most circumstances this will work fine. The problem comes when an unnecessarily complex model (for example, an overly large conditional probability tree) is learned for some action. Because V P I tends to prefer actions that gain information about larger numbers of states, the over-complex action will tend not to be performed, and so the algorithm may not get an opportunity to re-simplify the model. Fortunately,  211  this situation should be relatively rare.  212  Chapter 7  Conclusions Much work in artificial intelligence, particularly in planning, is concerned with solving very difficult problems by identifying restricted sub-domains in which they are easier. Good examples of this can be found in almost all classical planning algorithms: The general problem, finding a sequence of actions that will lead from the current state to a goal state, is difficult to solve, but by assuming that actions are relatively simple, and by representing the actions' effects in a structured way—for example using  STRIPS  rules [42] —a subclass of problems is defined that are much  easier to solve. This thesis has been an attempt to apply this idea in the fields of decision-theoretic planning and reinforcement learning. We have described the general problems of planning and learning to act in M D P s , but have mostly restricted ourselves to certain classes of M D P that are easier to find policies for, and have developed algorithms for these types of problems. The leverage we apply to make the problems easier is structured representations. By representing planning and learning problems in a way that makes structure explicit we showed improved performance in both decision-theoretic planning and reinforcement learning tasks. The advantage that structure gives us is the ability to abstract. Rather than reasoning at the level of individual states, the structured representation allows us to aggregate states together and build abstractions in which  213  aggregated states are treated as if they were a single state. One important contribution of this work is the decision-theoretic regression operator that we describe in Section 3.1.1. It is this operator that allows us to build an abstraction and reason with it, but more importantly, it automatically adjusts the level of abstraction, aggregating states when they need not be distinguished, and disaggregating them when additional distinctions become necessary to find a solution. In Chapter 3, we presented a set of algorithms for decision-theoretic planning that use the decision-theoretic regression operator to take advantage of structure in a problem to find an optimal or close-to-optimal policy or plan more efficiently. The SPI and A S V I algorithms we described operate on structured representations of M D P s , with the structure in the form of two-step Bayesian networks with decision trees used to store the conditional probability tables. The two algorithms produce a structured representations of an optimal policy (or an approximately optimal policy in the case of ASVI) with the policy structure again in the form of decision trees. The results we presented show that these are very powerful techniques when sufficient structure is present in the problem, and the algorithms outperform standard algorithms in many domains, in particular on large problems with lots of structure. In Section 6.1 we applied many of these same ideas in reinforcement learning. We developed a local variant of the decision-theoretic regression operator and used it in a structured version of the prioritized sweeping algorithm. We showed that we could learn the parameters of a structured model from data, and use this model to maintain structured Q-values. As our results showed, using structure in this way considerably improves learning performance when compared with generalized prioritized sweeping, an algorithm that uses a structured model, but maintains its value functions in an unstructured way. One critical component of any reinforcement learning algorithm is its exploration strategy—how efficiently it explores its environment and learns from the observations it makes. Much of our work in reinforcement learning is focused closely  214  on the problem of how to select actions so as to balance immediate performance with future performance, what is known as the exploration-exploitation tradeoff. In Chapter 5, we described two reinforcement learning algorithms. The general approach we took was Bayesian—we explicitly represented and reasoned using measures of how uncertain we were about the actual value of each state and action. By explicitly reasoning about our uncertainty, we created learning agents that explore their environment more efficiently and effectively than current approaches. The two algorithms differ in that one is model-based while the other does not learn a model of its environment. The model-free algorithm presented in Section 5.2 makes use of prior information in the form of prior probability distributions over the values of particular actions in states. For example, if we think that in a particular state s, action a is better than action 6, we can set the prior distribution we have over the value of a to be higher that b. Also, if we are fairly confident in our prior estimate of the value of 6, we may also wish to reduce the variance of the distribution over the value of b in s to indicate this. As our results (Figure 5.12) showed, providing information in this form can considerably improve the speed of learning. In comparison, conventional learning algorithms that don't reason about uncertainty tend to "wash out" prior information by overwriting it with observations before the priors become useful. The Bayesian approach avoids this by allowing us to specify how strong or good the prior information is using the variance of the distribution—the prior still gets influenced by subsequent observations, but we can control the rate at which this happens so that the prior can still be useful to speed learning. Our other results showed that even with uniform priors, our Bayesian Q-learning algorithm still performs as well or better than other approaches on many problems. In Section 5.3 we presented the model-based version of our Bayesian reinforcement learning algorithm. This algorithm uses the same exploration techniques as the model-free algorithm, but learns the parameters of an M D P model of the problem.  215  Although this algorithm makes use of no more prior information than other modelbased learning algorithms, it is a stepping-stone towards a model-based algorithm in which a structured model is used. Again our results showed that the modelbased Bayesian approach outperforms standard model-based algorithms. However, as with the model-free algorithm, the computational requirements of the algorithm are much higher than its competitors. For this reason we expect these algorithms to be useful only when computation time is much "cheaper" than taking actions in the world. Fortunately, there are many real problems that have this characteristic. The main contribution from these algorithms is the exploration measure we use, myopic value of perfect information. The idea of this measure is to trade off future value received for immediate value. We use the value of information to estimate the future value, and thus can reason directly about the tradeoff. Although the measure is expensive to compute, and contributes quite significantly to the overall computational requirements of these algorithms, we think that the idea of selecting actions in this way is important, and anticipate that computationally less demanding estimates of this value can be developed. Chapter 6 was rather more speculative as we described how all these pieces can be put together to produce a structured model-based reinforcement learning algorithm that takes a Bayesian approach to exploration. The approach is based on the model-based Bayesian exploration algorithm we described in Section 5.3, which uses the optimal value functions of a set of sampled M D P s to construct a probability distributions over the values of states.  When the model changes as  a result of an observation, the agent updates the value functions of the sampled M D P s using prioritized sweeping. Structured prioritized sweeping allows us to do this update in a structured fashion. By using structured prioritized sweeping as the update method for our model-based reinforcement learning algorithm, we hope to build a complete system that can learn structured policies and value functions, select actions in a Bayesian way, and if necessary discover the structure inherent in  216  the problem at the same it is learning how to act.  7.1  Future Work  As we have already said, much of the work in Chapter 6 is speculative. Obviously, the most important piece of future work that comes out of this thesis is to further develop and experimentally test the ideas described in Sections 6.2 and 6.3. The structured Bayesian model-based learning algorithm outlined in Section 6.2 requires relatively little new research, but needs an implementation, and experimental evaluation. For the case where the structure of the problem must be learned at the same time that the value function is learned, there is considerably more research to be done, including the algorithm used to learn the structure—algorithms for learning Bayesian networks and techniques from decision-tree learning are possibilities—and how the learning algorithm interacts with the method for selecting actions. As we have already said before, another important area that would benefitfrom future work is the testing of these algorithms empirically on more realistic problems. The most important question we can hope to answer is how much and what kinds of problem structure exist in real-world domains. If realistic domains to test these ideas cannot be found (as is currently the case), then this also provides clear impetus to extend this work in ways that makes it more applicable to real problems. There are a number of other possible directions for future work, which we will discuss in the following sections. For ease of exposition, we divide them into decision-theoretic planning and reinforcement learning research, but there may be considerable overlap between these, with ideas from planning also being applicable in learning algorithms and vice versa.  217  7.1.1  Future Work in Planning  The most interesting direction for our work in planning to be taken is in examining how the SPI and A S V I algorithms can be integrated with other A l techniques for solving M D P s . As we have said, algorithms that use reachability analysis, other abstraction methods, and other structured value function representations (in fact, most of the work described in Section 2.6) are in some ways orthogonal to the ideas we present and could potentially be combined with SPI. For reachability analysis, one example of this kind of composite approach has already been described in [14]. Combining SPI with the explanation-based reinforcement learning ideas of [38] is another fruitful area of research.  The feature-based representation used by SPI  increases the effectiveness of the "funnel actions" used by Dietterich and Flann by allowing generalization of a funnel action over states where particular features have different values. A n example of this is the flag-collecting maze we described in Chapter 5, in which a funnel action such as "move north until you hit a wall" can be used regardless of which flags the agent is currently carrying. Extending the SPI algorithm to work in other classes of M D P s is another interesting area of work. The most important of these is the use of partially observable M D P s , in which some work has already appeared in [19], however the use of continuous state variables, and semi-Markov processes in which the time that actions take to be executed is modeled are also of great interest. The use of continuous variables in problems has been extensively studied in the literature on decision-tree induction, and we hope to adapt their solutions for use in SPI. The structured representation can also be extended to allow abstraction over actions as well as states, such as the Pickup(X) actions used in classical planning representations such as STRIPS. A n other way to think about this is as a first-order representation of a problem that allows relations over objects to be specified. Adding this extension will greatly increase the representational power of SPI and allow it to be applied in many more decision-making domains, such as scheduling problems in which the sheer number  218  of possible actions overwhelms our current techniques. 7.1.2  Future Work in Learning  There are also many possible future extensions to our work in reinforcement learning. As with planning, of particular interest is the use of continuous state spaces and function approximators. The model-free algorithm is particularly amenable to this, as a set of observations of the value function in various parts of a continuous state space can be used as inputs to a Bayesian neural network [10, C h . 10] which learns an approximation to the probability distribution throughout the continuous state space and allows posterior means and variances (and other parameters of the distribution) to be computed for arbitrary points in the space. The posterior distribution produced by the neural network can then be used directly by the model-free Bayesian algorithm. A similar approach may also be possible for the model-based algorithm, although there are additional complexities involved with sampling continuous state-space M D P s from the learned model, and then solving these M D P s to find sample state values that can be fed to the function approximator. As we said in the introduction to Chapter 5, the myopic value of perfect information method we use to select actions is intended to be an approximate way to estimate a solution to the exploration M D P . This exploration M D P is a "meta"M D P that models the exploration problem for some underlying M D P directly by including the state of that M D P and our current knowledge of it into a single process in which actions move us not only from state to state, but also from belief state to belief state. As we said, this M D P is extremely large (it is continuous in many dimensions), and we cannot hope to solve it exactly. However, it is also quite structured, and with the right approximations, and by identifying structure we can exploit, we may be able to use it to select actions based on the true value of information for an action, rather than a myopic approximation to it. A locally optimal (or approximately optimal) solution to this M D P would constitute an optimal  219  exploration strategy for the original M D P . The structured prioritized sweeping algorithm we presented in Chapter 6 also has a number of directions for further research. The decision-trees it learns for the policy and value function are very similar to those used by McCallum [73] for reinforcement learning in P O M D P s , and it seems likely that the two approaches can be applied to learn a structured value function that not only ignores irrelevant state variables, but can do so in P O M D P s . The result of such a union would be an algorithm that produces a minimal model of the P O M D P that only distinguishes states with different value functions, regardless of whether those states can be distinguished directly through the most recent observation or whether they must be distinguished by examining the history of observations. The local decision-theoretic regression operator we described in Section 6.1.1 can be applied to produce structured versions of other planning and learning algorithms. One such is a structured version of real-time dynamic programming [5], where the value function is treated as a heuristic function measuring distance from a goal state or states, and local value function updates are used to improve the performance of the heuristic over time. Doing these updates in a structured way widens the effect of each update, improving the heuristic for other states similar to the current one. A second possible application is as part of an approximation algorithm. Given an approximately optimal policy produced (for example) by the A S V I algorithm of Section 3.2, local decision-theoretic regression could be used to improve the approximation in places where the value function has been overly simplified, adding back value distinctions that significantly affect the policy, but that the approximation has obscured.  220  Bibliography [1] M . Abramowitz and I. A . Stegun, editors. Handbook  of Mathematical  Func-  tions. Dover, New York, 1964. [2] S. K . Andersen, K . G . Olesen, F . V . Jensen, and F . Jensen. H U G I N — a shell for building Bayesian belief universes for expert systems. In Proceedings the Eleventh  International  Joint  Conference  on Artificial  Intelligence,  of  pages  1080-1085, Detroit, Michigan, 1989. Morgan Kaufmann. [3] David Andre, Nir Friedman, and Ronald Parr. Generalized prioritized sweeping.  In Michael I. Jordan, Michael J . Kearns, and Sara A . Solla, editors,  Advances  in Neural  Information  Processing  Systems, volume 10. M I T Press,  1998. [4] R. Iris Bahar, E . A . Frohm, C . M . Gaona, G . D . Hachtel, E . Macii, A Pardo, and F . Somenzi. Algebraic decision diagrams and theis applications. In International  Conference  on Computer-Aided  Design, pages 188-191. I E E E , 1993.  [5] A . G . Barto, S. J . Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial  Intelligence,  72(1-2):81—138, 1995.  [6] Andy G Barto and Satinder P. Singh. On the computational economies of reinforcement learning. In D . S. Touretzky, J . L . Elman, T . J . Sejnowski, and G. E . Hinton, editors, Connectionist  Models:  Proceedings of the 1990  School, pages 35-44, San Mateo, C A , 1990. Morgan Kauffman.  221  Summer  [7] Richard Bellman. Adaptive  Control  Processes.  Princeton University Press,  Princeton, New Jersey, 1961. [8] Donald A . Berry and Bert Fristedt. Bandit  of Experiments.  Problems:  Sequential  Allocation  Chapman and Hall, London, U K , 1985.  [9] D . P. Bertsekas and J . N . Tsitsiklis. Neuro-Dynamic  Programming.  Athena  Recognition.  Oxford  Scientific, Belmont, M A , 1996. [10] Christopher M . Bishop.  Neural  Networks  for Pattern  University Press, Oxford, 1995. [11] Marko Bohanic and Ivan Bratko. Trading accuracy for simplicity in decision trees. Machine  Learning,  15:223-250, 1994.  [12] Craig Boutilier. Correlated action effects in decision theoretic regression. In Proceedings  of the Thirteenth  Conference  on Uncertainty  in Artificial  Intelli-  gence, pages 30-37, Providence, RI, 1997. [13] Craig Boutilier, Ronen I. Brafman, and Christopher Geib. Prioritized goal decomposition of Markov decision processes: Toward a synthesis of classical and decision theoretic planning. In Proceedings of the Fifteenth Joint  Conference  on Artificial  Intelligence,  International  1997.  [14] Craig Boutilier, Ronen I. Brafman, and Christopher Geib. Structured reachability analysis for Markov decision processes. In Proceedings of the Conference  on Uncertainty  in Artificial  Intelligence,  Fourteenth  pages 24-32, Madison,  WI, 1998. [15] Craig Boutilier, Thomas Dean, and Steve Hanks. Decision theoretic planning: Structural assumptions and computational leverage. Journal telligence  Research,  1:1-93, 1999.  222  of Artificial  In-  [16] Craig Boutilier, Richard Dearden, and Moises Goldszmidt. Exploiting strucIn Proceedings  ture in policy construction. Joint  Conference on Artificial  Intelligence,  of the Fourteenth  International  pages 1104-1111, Montreal, 1995.  [17] Craig Boutilier, Richard Dearden, and Moises Goldszmidt. Stochastic dynamic programming with factored representations. Artificial  Intelligence,  2000. To  appear. [18] Craig Boutilier, Nir Friedman, Moises Goldszmidt, and Daphne Koller. Context-specific independence in Bayesian networks. Twelfth  Conference on Uncertainty  in Artificial  In Proceedings  Intelligence,  of the  pages 115-123,  Portland, OR, 1996. [19] Craig Boutilier and David Poole. Computing optimal policies for partially observable decision processes using compact representations. In Proceedings of the Thirteenth  National  Conference on Artificial  Intelligence,  pages 1168—  1175, Portland, OR, 1996. [20] Justin A . Boyan and Andrew W . Moore. Generalization in reinforcement learning: Safely approximating the value function. Advances in Neural Information Processing Systems, 7, 1995.  [21] B . W . Brown, J . Lovato, and K . Russell. cumulative distribution functions,  inverses,  Library of routines for  and other parameters, 1997.  ftp://odin.mdacc.tmc.edu/pub/source/dcdflib.c-1.1.tar.gz. [22] David Chapman and Leslie Pack Kaelbling. Input generalization in delayed reinforcement learning: A n algorithm and performance comparisons. In Proceedings of the Twelfth International  Joint  pages 726-731, Sydney, 1991.  223  Conference  on Artificial  Intelligence,  [23] Gregory F . Cooper and Edward Herskovits. A Bayesian method for the in-  Machine  duction of probabilistic networks from data.  Learning,  9:309-347,  1992. [24] T . M . Cover and J . A . Thomas. Elements  of Information  Theory. John Wiley  & Sons, New York, 1991. [25] Adnan Darwiche and Moises Goldszmidt. Action networks: A framework for reasoning about actions and change under uncertainty. In Proceedings of the Tenth Conference  on Uncertainty  in Artificial  Intelligence,  pages 136-144,  Seattle, 1994. [26] Thomas Dean, Robert Givan, and Sonia Leach. Model reduction techniques for computing approximately optimal solutions for Markov decision processes. In Proceedings  of the Thirteenth  Conference  on Uncertainty  in Artificial  In-  telligence, Providence, Rhode Island, 1997. [27] Thomas Dean, Leslie Pack Kaelbling, Jak Kirman, and Ann Nicholson. Planning with deadlines in stochastic domains. National  Conference  on Artificial  Intelligence,  In Proceedings  of the  Eleventh  pages 574-579, Washington,  D.C., 1993. [28] Thomas Dean, Leslie Pack Kaelbling, Jak Kirman, and A n n Nicholson. Planning under time constraints in stochastic domains. 76:35-74, 1995.  Artificial  Intelligence,  *  [29] Thomas Dean and Keiji Kanazawa. A model for reasoning about persistence and causation. Computational  Intelligence,  5(3):142-150, 1989.  [30] R . Dearden, N . Friedman, and S. Russell. Bayesian Q-learning. In Proceedings of the Fifteenth  National  Conference  1998.  224  on Artificial  Intelligence  (AAAI-98),  [31] Richard Dearden. Abstraction and search for decision-theoretic planning. Master's thesis, University of British Columbia, Vancouver, B C , October 1994. [32] Richard Dearden and Craig Boutilier. Integrating planning and execution in stochastic domains. In Proceedings of the Tenth Conference on Uncertainty  Artificial  Intelligence,  in  pages 162-169, Seattle, 1994.  [33] Richard Dearden and Craig Boutilier. Abstraction and approximate decision theoretic planning. Artificial [34] M . H . Degroot. Proability  Intelligence, and Statistics.  89:219-283, 1997. Addison-Wesley, Reading, Mass.,  2nd edition, 1986. [35] William Edwards Demming. Some Theory of Sampling. John Wiley and Sons, 1950. [36] Thomas G . Dietterich. The M A X Q method for hierarchical reinforcement learning. In Fifteenth  International  Conference on Machine  Learning,  pages  118-126. Morgan Kaufmann, 1998. [37] Thomas G . Dietterich and Nicholas S. Flann. Explanation-based learning and reinforcement learning: A unified approach. International  Conference  on Machine  In Proceedings  of the Twelfth  Learning, pages 176-184, Lake Tahoe,  1995. [38] Thomas G . Dietterich and Nicholas S. Flann. Explanation-based learning and reinforcement learning: A unified view. Machine  Learning,  28(2):169-210,  1997. [39] B . Drabble. Mission scheduling for spacecraft: Diaries of T-SCHED. In Expert Planning  Systems, pages 76-81. Institute of Electrical Engineers, 1990.  [40] Denise Draper, Steve Hanks, and Daniel Weld. A probabilistic model of action for least commitment planning with information gathering. In Ramon Lopez 225  de Mantaras and David Poole, editors, Proceeding on Uncertainty  in Artificial  Intelligence,  of the Tenth  Conference  pages 178-186. Morgan Kaufmann,  1994. [41] Claude-Nicholas Fiechter.  Learning Algorithms.  Design and Analysis  of Efficient  Reinforcement  P h D thesis, Department of Computer Science, Univer-  sity of Pittsburgh, 1997. [42] Richard E . Fikes and Nils J . Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Artificial  Intelligence,  2:189-208, 1971.  [43] R. E . Frank, W . F . Massy, and Y . Wind. Market Segmentation.  Prentice-Hall,  New Jersey, 1972. [44] N . Friedman and Y . Singer. Efficient Bayesian parameter estimation in large discrete domains. In Advances in Neural Information  Processing Systems  11.  M I T Press, Cambridge, Mass., 1999. [45] Nir Friedman and Moises Goldszmidt. Learning Bayesian networks with local structure.  Artificial  In Proceedings  Intelligence,  of the Twelfth Conference  on Uncertainty  in  San Francisco, C A , 1996. Morgan Kaufmann.  [46] Robert Givan, Sonia Leach, and Thomas Dean. Bounded parameter Markov decision processes. Technical Report CS-97-05, Brown University, Providence, Rhode Island, 1997. [47] Geoffrey J . Gordon. Stable function approximation in dynamic programming. In Proceedings of the Twelfth International  Conference  on Machine  Learning,  pages 261-268, Lake Tahoe, 1995. [48] Vu Ha and Peter Haddawy. Toward case-based preference elicitation: Similarity measures on preference structures. In Proceedings Conference  on Uncertainty  in Artificial  226  Intelligence,  of the  Fourteenth  pages 193-201, 1998.  [49] Peter Haddawy and AnHai Doan. Abstracting probabilistic actions. In Ramon Lopez de Mantaras and David Poole, editors, Proceeding of the Tenth Conference  on Uncertainty  in Artificial  Intelligence,  pages 270-277, Seattle,  1994. Morgan Kaufmann. [50] Peter Haddawy and Meliani Suwandi. Decision-theoretic refinement planning using inheritance abstraction. In Proceedings of the Second International ference on AI Planning  Systems,  Con-  1994.  [51] D . Heckerman. A tutorial on learning with Bayesian networks. In M . I. Jordan, editor, Learning  in Graphical  Models. Kluwer, Dordrecht, Netherlands, 1998.  [52] David Heckerman. Probabilistic  Similarity  Networks.  M I T Press, Cambridge,  Massachusetts, 1991. [53] Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. S P U D D : Stochastic planning using decision diagrams. In Proceedings of the Fifteenth ence on Uncertainty  in Artificial  Confer-  Intelligence, Stockholm, Sweden, 1999.  [54] J . H . Holland. Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In R. Michalski, J . Carbonell, and T. Mitchell, editors, Machine  Learning  II. Morgan Kaufmann,  1986. [55] R. A . Howard. Information value theory. Science and Cybernetics,  IEEE  Transactions  on Systems  SSC-2(l):22-26, 1966.  [56] R. A . Howard and J . E . Matheson. Influence diagrams. In R. A . Howard and J . Matheson, editors, The Principles  and Applications  of Decision  Analysis,  pages 720-762. Strategic Decisions Group, C A , 1981. [57] Ronald A . Howard. Dynamic  Programming  Cambridge, 1960. 227  and Markov Processes. M I T Press,  [58] Ronald A . Howard. Dynamic  Probabilistic  Systems.  Wiley, New York, 1971.  [59] Ronald A . Howard and James E . Matheson, editors. Readings on the  and Applications  of Decision Analysis.  Principles  Strategic Decision Group, Menlo Park,  C A , 1984. [60] L . Hyafil and R. L . Rivest. NP-complete. Information  Constructing optimal binary decision trees is  Processing  Letters, 5:15-17, 1976.  [61] M . D . Johnston and H . - M . Adorf. Scheduling with neural networks: the case of the hubble space telescope.  Computers  4):209-240, 1992.  '  and Operations  Research,  19(3-  '  [62] Leslie Pack Kaelbling. Hierarchical reinforcement learning: Preliminary results. In Proceedings of the Tenth International  Conference on Machine  Learn-  ing, pages 167-173, San Francisco, C A , 1993. Morgan Kaufmann. [63] Leslie Pack Kaelbling. Learning  in Embedded Systems. M I T Press, Cambridge,  1993. [64] Leslie Pack Kaelbling, Michael L . Littman, and Andrew W . Moore. Reinforcement learning: A survey. Journal  of Artificial  Intelligence  Research, 4:237-285,  1996. [65] Michael Kearns and Satinder Singh. Near-optimal performance for reinforcement learning in polynomial time. In Proceedings  Conference  on Machine  Learning,  of the 15th  International  pages 260-268, San Mateo, C A , 1998. Mor-  gan Kauffman. [66] Nicholas Kushmerick, Steve Hanks, and Daniel Weld. A n algorithm for probabilistic least-commitment planning. In Proceedings  Conference  on Artificial  Intelligence,  228  of the Twelfth  pages 1073-1078, Seattle, 1994.  National  [67] Wai Lam and Fahiem Bacchus. Using causal information and local measures to learn Bayesian networks. In Proceedings  tainty in Artificial  Intelligence,  of the Ninth  Conference on Uncer-  San Francisco, C A , 1993. Morgan Kaufmann.  [68] Long-Ji L i n . Programming robots using reinforcement learning and teaching. In Proceedings  of the Ninth  National  Conference  on Artificial  Intelligence,  1991. [69] Long-Ji L i n . Self-improving reactive agents based on reinforcement learning, planning, and teaching. Machine  Learning,  8:293-321, 1992.  [70] Michael Littman and Csaba Szepesvari. A generalized reinforcement-learning model: Convergence and applications. In Proceedings of the Thirteenth national  Conference on Machine  Learning,  Inter-  pages 310-318, San Francisco, C A ,  1996. Morgan Kauffman. [71] D . G . Luenberger.  Applications.  Introduction  to Dynamic  Systems:  Theory, Models and  Wiley, New York, 1979.  [72] David McAllester and David Rosenblitt. Systematic nonlinear planning. In Proceedings  of the Ninth National  Conference on Artificial  Intelligence,  pages  634-639, Anaheim, 1991. [73] R . Andrew McCallum.  Instance-based utile distinctions for reinforcement  learning with hidden state. In Proceedings of the Twelfth International ference on Machine  Learning.  Con-  Morgan Kaufmann, 1995.  [74] Nicolas Meuleau and Paul Bourgine.  Exploration of multi-state environ-  ments: Local measure and back-propagation of uncertainty. Machine  Learn-  ing, 35(2):117-154, 1999. [75] Nicolas Meuleau,  Milos  Hauskrecht,  Kee-Eung K i m , Leonid  Peshkin,  Leslie Pack Kaelbling, Thomas Dean, and Craig Boutilier. Solving very large 229  weakly coupled Markov decision processes. In Proceedings of the 15th  Conference  on Artificial  Intelligence,  [76] Andrew W . Moore. Efficient  National  pages 165-172, Madison, W I , 1998.  Memory-based  Learning  for Robot Control.  PhD  thesis, Trinity Hall, University of Cambridge, England, 1990. [77] Andrew W . Moore. Variable resolution dynamic programming:  Efficiently  learning action maps in multivariate real-valued spaces. In Proceedings International  Machine  Learning  Workshop,  Eighth  1991.  [78] Andrew W . Moore. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state spaces. Advances formation  Processing  in Neural In-  Systems, 6, 1994.  [79] Andrew W . Moore and Christopher G . Atkeson.  Prioritized sweeping—  reinforcement learning with less data and less time. Machine  Learning,  13:103-  130,1993. [80] Ann E . Nicholson and Leslie Pack Kaelbling. Toward approximate planning in very large stochastic domains. In AAAI  Theoretic  Planning,  Spring  Analysis.  on Decision  pages 190-196, Stanford, 1994.  [81] R. M . Oliver and J . Q. Smith, editors. Influence  Decision  Symposium  Diagrams,  Belief Nets and  Series in probability and mathematical statistics. Wiley,  Chichester, 1990. [82] Judea Pearl. Probabilistic  sible Inference.  Reasoning  in Intelligent  Systems:  Networks  of Plau-  Morgan Kaufmann, San Mateo, 1988.  [83] John L . Pollock. The logical foundations of goal-regression planning in autonomous agents. Artificial [84] Martin L . Puterman.  namic Programming.  Intelligence,  Markov  Decision  106:267-335, 1998. Processes:  Wiley, New York, 1994. 230  Discrete  Stochastic Dy-  [85] Martin L . Puterman and M . C . Shin. Modified policy iteration algorithms for discounted Markov decision problems.  Management  Science,  24:1127-1137,  1978. [86] J. R. Quinlan. C'4-5: Programs for Machine  Learning.  Morgan Kaufmann,  1993. [87] J . R. Quinlan and R. Rivest.  Inferring decision trees using the minimum  description length principle. Information [88] B . D . Ripley. Stochastic  Simulation.  and Computation,  Wiley, N Y , 1987.  [89] Ronald L . Rivest. Learning decision lists. Machine  Learning,  [90] Stuart J . Russell and Eric Wefald. Do the Right Thing: Rationality.  80:227-248, 1989.  2:229-246, 1987.  Studies in  Limited  M I T Press, Cambridge, 1991.  [91] R. A . Saleh, K . A . Gallivan, M . Chang, I. N . Hajj, D . Smart, and T . N . Trick. Parallel circuit simulation on supercomputers. Proceedings of the IEEE, 77(12):1915-1930, 1990. [92] M . J. Schoppers. Universal plans for reactive robots in unpredictable environments. In Proceedings of the Tenth International  Intelligence,  Joint Conference  on  Artificial  pages 1039-1046, Milan, 1987.  [93] Anton Schwartz.  A reinforcement learning method for maximizing undis-  counted rewards.  In Proceedings  Machine  pages 298-305, Amherst, M A , 1993. Morgan Kaufmann.  Learning,  [94] Ross D . Shachter.  of the Tenth International  Evaluating influence diagrams.  Conference  Operations  on  Research,  33(6):871-882, 1986. [95] Satinder Singh and David Cohn. How to dynamically merge Markov decision processes. In Michael I. Jordan, Michael J . Kearns, and Sara A . Solla, editors,  231  Advances  in Neural  Information  Processing  Systems, volume 10. The M I T  Press, 1998. [96] Satinder P. Singh and Richard C . Yee. A n upper bound on the loss from approximate optimal-value functions. Machine  Learning,  16:227-233, 1994.  [97] Satinder Pal Singh. Transfer of learning by composing solutions of elemental sequential tasks. Machine  Learning, 8:323, 1992.  [98] David E . Smith and Mark A . Peot. Postponing threats in partial-order planning. In Proceedings  of the Eleventh National  Conference  on Artificial  Intelli-  gence, pages 500-506, Washington, D . C . , 1993. [99] M . J . Stefik. Planning and meta-planning. Artificial  Intelligence,  16:141-169,  1981. [100] Richard S. Sutton. Temporal Credit Assignment  in Reinforcement  Learning.  PhD thesis, University of Massachusetts, Amherst, M A , 1984. [101] Richard S. Sutton. Learning to predict by the method of temporal differences. Machine  Learning,  3(l):9-44, 1988.  [102] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings Seventh International  Conference  on Machine  Learning,  of the  pages 216-224, 1990.  [103] Richard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. SIGART  Bulletin,  2:160-163, 1991.  [104] Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances  in Neural Information  Processing  Systems, 8, 1996.  [105] Richard S. Sutton and Andrew G . Barto. Reinforcement duction.  M I T Press, Cambridge, Massachusetts, 1998. 232  Learning:  an intro-  [106] Joe Suzuki. A construction of Bayesian networks from databases based on an M D L principle. In Proceedings  Artificial  Intelligence,  of the Ninth  Conference on Uncertainty  in  San Francisco, C A , 1993. Morgan Kaufmann.  . [107] Prasad Tadepalli and DoKyeong Ok. Scaling up average reward reinforcement learning by approximating the domain models and the value function. Proceedings of the Thirteenth  International  Conference on Machine  In  Learning,  San Francisco, C A , 1996. Morgan Kaufmann. [108] Jonathan Tash and Stuart Russell. Control strategies for a stochastic planner. In Proceedings  of the Twelfth National  Conference on Artificial  Intelligence,  pages 1079-1085, Seattle, 1994. [109] Gerald Tesauro. Temporal difference learning and TD-Gammon.  Communi-  cations of the ACM, pages 58-67, March 1995. [110] Sebastian B . Thrun. The role of exploration in learning control. In David A . White and Donald A . Sofge, editors, Handbook of Intelligent Fuzzy and Adaptive Approaches.  Control:  Neural,  Van Nostrand Reinhold, Florence, K Y , To  appear. [Ill] J . N . Tsitsiklis and B . Van Roy. A n analysis of temporal-difference learning with function approximators. Technical Report LIDS-P-2322, M I T , 1996. [112] Paul E . Utgoff. Decision tree induction based on efficient tree restructuring. Technical Report 95-18, University of Massachusetts, March 1995. [113] Christopher J . C . H . Watkins. Learning  from Delayed Rewards.  P h D thesis,  King's College, Cambridge, 1989. [114] Christopher J . C . H . Watkins and Peter Dayan. Q-learning. Machine 8(3):279-292, 1992.  233  Learning,  [115] Brian C . Williams and P. Pandurang Nayak. A model-based approach to reactive self-configuring systems. In Proceedings Conference  on Artificial  tificial Intelligence  Intelligence  Conference,  of the Thirteenth  and Eighth Innovative  Applications  National of Ar-  pages 971-978, Portland, Oregon, 1996. A A A I  Press / The M I T Press. [116] Jeremy Wyatt. Exploration  and Inference  in Learning  from  Reinforcement.  PhD thesis, Department of Artificial Intelligence, University of Edinburgh, 1997.  234  Appendix A  Example MDPs This appendix includes the file format used by our implementations of SPI, A S V I , Bayesian Q-learning and structured prioritized sweeping to input structured M D P s , along with a number of the structured M D P s used in this paper in that format. The file format is illustrated by the following small example: features ((a t f ) ( b  t  f))  a c t i o n doA a ((t  l)(f  0))  b (b (t  ((t  l)(f  0)))  (f  ((t  0)(f  1))))  endaction  reward (a (t  1 ) (f  0))  value 1  discount  .9  The first line of the example consists of a list of the features or variables used to describe the problem, each one followed by its set of possible values. In this case there are two variables, a and b, and each can have two values, t and f. 235  The feature list is followed by a list of the actions In the example there is only one. Each action consists of the keyword action, followed by the action name (doA), and a list of (variable,tree) pairs, representing the action's effects on each variable. The leaves of the trees consist of a list of the values of the variable, along with the probability of the variable having that value after the action is performed. The interior nodes consist of a variable, followed by a list of its possible values and the subtree that defines the effect of the action when the variable has that particular value. In the example, the action doA has the effect of making a have value t with probability 1. Its effect on variable b is to leave it unchanged. If b has value t before the action is executed, the first branch of the tree for b in which b has value t after the action with probability 1 is used. If b has value f before the action, we follow the second branch, which makes b have value f with probability 1. Each action is ended by the keyword endaction. After the list of actions, we define the reward function for the M D P , again as a tree. This tree is identical to the action trees above except that the leaves consists of single values, defining the reward received for being in a state corresponding to the set of assignments in the tree above leaf. In the example, there is a reward of 1 if a has value t and a reward of 0 if a has value f. The reward tree is followed by an initial value tree, in the same format. In the example, this tree has value 1 everywhere, but could be more complex if for example you wish to "seed" the algorithm with an initial value tree to increase convergence speed. Finally, the file concludes with'the discount factor for the problem.  A.l  The Coffee Robot M D P  This is the example used throughout the the thesis. It consists of 32 states and four actions. features ((WC t f ) (HC t f ) (R t f ) (W t f ) (U t f ) )  236  action noop WC (WC (t ((t 1))) (f ((f 1)))) HC (HC (t ((t 1))) (f ((f 1)))) R (R (t ((t 1))) (f ((f 1)))) W (W (t ((t 1))) (f ((f 1)))) U (U (t ((t 1))) (f ((f 1)))) endaction  action getum WC (WC (t ((t 1))) (f ((f 1)))) HC (HC (t ((t 1))) (f ((f 1)))) R (R (t ((t 1))) (f ((f 1)))) W (W (t ((t 1))) (f ((f 1)))) U (U (t ((t 1))) (f ((t 0.9)(f O.i)))) endaction  action dele WC (WC (t (HC (t ((t 0.1)(f 0.9))) (f ((t 1))))) (f ((f 1)))) HC ((f 1)) R (R (t ((t 1)))  237  (f ((f  1))))  W (W (t ((t 1)))  (f ((f 1)))) U (U (t ((t 1))) (f ((f 1)))) endaction action fetch WC (WC (t ((t i ) ) ) (f ((f 1)))) HC (HC (t ((t 1))) (f ((t 0.9)(f 0.1)))) R (R (t ((t 1))) (f ((f 1)))) W (W (t ((t 1))) (f (R (t (U (t ((f 1))) (f ((t 1.0))))) (f ((f 1.0)))))) U (U (t ((t 1))) (f ((f 1)))) endaction reward (WC (t (W (t 0) (f 0.1))) (f (W (t 0.9) (f 1.0)))) value (WC (t (W (t 0) (f 0.1))) (f (W (t 0.9) (f 1.0))))  238  discount 0.9  A.2  The Worst-Case and Best-Case Examples  We only show the smallest version of each of these problems, but they can easily be scaled to create larger problems. The worst-case problem with 64 states and six actions, and with no noise present (all actions are deterministic) is given below: features ((a t f ) ( b t f ) ( c t f ) ( d t f ) ( e t f ) ( f t f ) )  action doa a ((t  1)  b (b (t (t  1  (f  1  c (c (t (t  1  (f  1  d (d (t (t  1  (f  1  e (e (t (t  1  (f  1  (t  1  (f  1  (f  (f  (f  (f f (f (t (f endaction  action dob a ((f 1)) b (a (t ((t (f ((f c (c (t ((t (f ((f d (d (t ((t  1) 1)))) 1))) 1)))) 1)))  239  (f ((f  1))))  e (e (t ((t  1)))  (f ((f  1))))  f (f (t ((t  1)))  (f ((f  1))))  endaction  action doc a ((f  D)  b ((f  D)  c (a (t (b (t ((t (f ((f  1)))  1 ) ) ) ))  (f ((f d (d (t ((t (f ((f e (e (t ((t (f ((f f (f (t ((t (f ((f  1) ) ) ) 1) ) ) 1) ) ) ) 1) ) ) 1) ) ) ) 1) ) ) 1) ) ) )  endaction  action dod a ((f  1))  b ((f  l))  c ((f  D)  d (a (t (b (t (c (t ((t (f  ((f  I))))]  (f ((f  D ) ))  (f ((f  1  )))  e (e (t ((t  1  ))  1  )))  1  ))  (f  1)))  at  f (f (t ((t  240  (f  1))))  ( ( f  e n d a c t i o n  a c t i o n  d o e  a  ( ( f  1))  b  ( ( f  1))  c  ( ( f  1))  d  ( ( f  1))  e  ( a  ( t  ( b  (f (f  ( c  ( t  ( d  ( t  ( ( t  ( d  ( t  ( e  1)))  ( ( f 1)))))  ( ( f 1))))) (f  f  ( t  ( f  ( ( f 1))))) (f  ( ( f  1))))  ( t  ( ( t  1)))  ( f  ( ( f  1))))  e n d a c t i o n  a c t i o n  d o f  a  ( ( f  1))  b  ( ( f  i))  c  ( ( f  1))  d  ( ( f  1))  e  ( ( f  1))  f  ( a  ( t  ( b  (f  ( ( f 1))))) (f  (f  ( t  ( c  ( t  ( t  ( ( t  1)))  ( ( f 1)))))  ( ( f 1))))) (f  ( ( f 1))))) (f  ( ( f  1))))  e n d a c t i o n  r e w a r d  ( a ( t  ( b ( t  ( c  ( t  ( d ( t  ( e  ( t  ( f  241  ( t  10000000000000000)  (f  0)))  (f  0)))  (f o))) (f o))) (f  0)))  (f  0))  value (a (t (b (t (c (t (d (t (e (t (f (t 10000000000000000)  (f (f  0)))  (f  0)))  (f  0)))  (f (f  0)))  0)))  0))  discount 0.99 The best-case problem, also with 64 states and six actions, and with no noise present: features ((a t f)(b t f)(c t f)(d t f)(e t f)(f t f)) action doa a ((t 1)) b ((f  1))  c ((f  1))  d ((f D) e ((f  1))  f ((f D) endaction action dob a (a (t ((t (f ((f  1))) 1))))  242  b ((t  1))  c ((f  D)  d ((f  1))  e ((f  D)  f ((f  D)  endaction  action doc a (a (t ((t  1)))  (f ((f  1))))  b (b (t ((t  1)))  (f ((f c ((t  l))  d ((f  1))  e ((f  1))  f ((f  D)  1))))  endaction  action dod a (a (t ((t  1)  >)  (f ((f  1)  )))  b (b (t ((t  1)  ))  (f ((f  1)  )))  c (c (t ((t  1)  ))  (f ((f  1)  )))  d ((t  1))  e ((f  1))  f ((f  D)  endaction  action doe a (a (t ((t  1)))  243  (f ((f b (b (t ((t  at  (f  c (c (t ((t (f ((f d (d (t ((t (f ((f  1)))) 1))) 1)))) 1))) 1)))) 1))) 1))))  e ((t 1)) f (Cf D ) endaction  action dof a (a (t ((t (f ((f b (b (t ((t  at  (f  c (c (t ((t  at  (f  d (d (t ((t (f <(f e (e (t ((t  at  (f f ((t  1))) 1)))) 1))) 1)))) 1)))  i)))) 1))) 1)))) 1))) 1))))  1))  endaction  reward (a (t (b (t (c (t (d (t (e (t (f (t (f  0)))  (f  0)))  (f  0)))  (f (f  0)))  0)))  (f  0))  244  100)  value  1  discount  A.3  .9  Exogenous Events  Here we present the two larger coffee-robot domains that illustrate the effects of exogenous events. Each has 400 states and eight actions. The version with exogenous events is given first. features  ((loc  off  (mailwaiting t  action  (lab  lab mail  f)(hascoffee  t  loc  (loc  (off  1)))  ((off  1)))  ((lab  (home  ((mail  1)))  (shop  ((home  1))))  ((t  O.lXf  hascoffee (f  ((f  ((t  1)))  (t  ((t  (t  ((t  1)))  1)))  1))))  0.2)(f hasmail  (f  (t  0.9))))  mailwaiting ((t  (wantscoffee  (hascoffee  ((f  ((shop  i)))  wantscoffee  (f  f)(hasmail t  goLeft  (mail  (f  home s h o p ) ( t i d y t O t l  (mailwaiting  0.8)))) (hasmail (t  ((t  1)))  1)))) tidy  (tidy  (tO  ((tO  (tl  ((tO 0 . 1 ) ( t l  0.9)))  (t2  ((tl  0.2)(t2  0.8)))  (t3  ((t2  0.3)(t3  0.7)))  1)))  245  f))  t2  t3  t4)(wantscoffee  t  f)  (t4 ((t3 0.4)(t4 0.6)))) endaction  action stay loc (loc (off ((off 1))) (lab ((lab 1))) (mail ((mail 1))) (home ((home 1))) (shop ((shop 1)))) wantscoffee (wantscoffee (t ( ( t 1))) (f ((t 0.1)(f 0.9)))) hascoffee (hascoffee (t ((t 1))) (f ((f 1)))) mailwaiting (mailwaiting (t ((t 1))) (f ((t 0.2)(f 0.8)))) hasmail (hasmail (t ((t 1))) (f ((f 1)))) tidy (tidy (tO ((tO 1))) ( t l ((tO 0.1)(tl 0.9))) (t2 ( ( t l 0.2)(t2 0.8))) (t3 ((t2 0.3)(t3 0.7))) (t4 ((t3 0.4)(t4 0.6)))) endaction  action goRight loc (loc (off ((lab 1))) (lab ((mail 1))) (mail ((home 1))) (home ((shop 1))) (shop ((off 1)))) wantscoffee (wantscoffee (t ((t 1))) (f ((t 0.1)(f 0.9))))  246  hascoffee (hascoffee (t ((t 1))) (f ((f 1)))) mailwaiting (mailwaiting (t ((t 1))) (f ((t 0.2)(f 0.8)))) hasmail (hasmail (t ((t 1))) (f ((f 1)))) tidy (tidy (tO ((tO 1))) ( t l ((tO 0.1)(tl 0.9))) (t2 ( ( t l 0.2)(t2 0.8))) (t3 ((t2 0.3)(t3 0.7))) (t4 ((t3 0.4)(t4 0.6)))) endaction  action pickUpMail loc (loc (off ((off 1))) (lab ((lab 1))) (mail ((mail 1))) (home ((home 1))) (shop ((shop 1)))) wantscoffee (wantscoffee (t ((t 1))) (f ((t 0.1)(f 0.9)))) hascoffee (hascoffee (t ((t 1))) (f ((f 1)))) mailwaiting (mailwaiting (t (loc (mail ((t 0.2)(f 0.8))) (home off lab shop ((t 1))))) (f ((t 0.2)(f 0.8)))) hasmail (hasmail (t ((t 1))) (f (loc (mail (mailwaiting (t ((t 1))) (f ((f 1))))) (home off lab shop ((f 1)))))) tidy (tidy (tO ((tO 1))) ( t l ((tO 0.1)(tl 0.9)))  247  (t2 ( ( t l 0.2)(t2 0.8))) (t3 ((t2 0.3)(t3 0.7))) (t4 ((t3 0.4)(t4 0.6)))) endaction  action delMail loc (loc (off ((off 1))) (lab ((lab 1))) (mail ((mail 1))) (home ((home 1))) (shop ((shop 1)))) wantscoffee (wantscoffee (t ((t 1))) (f ((t 0.1)(f 0.9)))) hascoffee (hascoffee (t ((t 1))) (f ((f 1)))) mailwaiting (mailwaiting (t ((t 1))) (f ((t 0.2)'(f 0.8)))) hasmail (hasmail (t (loc (off ((f 1))) (lab mail home shop ((t 1))))) (f ((f 1)))) tidy (tidy (tO ((tO 1))) ( t l ((tO 0.1)(tl 0.9))) (t2 ( ( t l 0.2)(t2 0.8))) (t3 ((t2 0.3)(t3 0.7))) (t4 ((t3 0.4)(t4 0.6)))) endaction  action buyCoffee loc (loc (off ((off 1))) (lab ((lab 1))) (mail ((mail 1))) (home ((home 1)))  248  (shop ((shop 1)))) wantscoffee (wantscoffee (t ((t 1))) (f ((t 0.1)(f 0.9)))) hascoffee (hascoffee (t ((t 1))) (f (loc (off lab mail home ((f 1))) (shop ((t 1)))))) mailwaiting (mailwaiting (t ((t 1))) (f ((t 0.2)(f 0.8)))) hasmail (hasmail (t ((t 1))) (f ((f 1)))) tidy (tidy (tO ((tO 1))) ( t l ((tO 0.1)(tl 0.9))) (t2 ( ( t l 0.2)(t2 0.8))) (t3 ((t2 0.3)(t3 0.7))) (t4 ((t3 0.4)(t4 0.6)))) endaction  action delCoffee loc (loc (off ((off 1))) (lab ((lab 1))) (mail ((mail 1))) (home ((home 1))) (shop ((shop 1)))) wantscoffee (wantscoffee (t (hascoffee (t (loc (off ((f 1))) (shop lab mail home ((t 1))))) (f ((t 1))))) (f (hascoffee (t (loc (off ((f 1))) (shop lab mail home ((t . l ) ( f .9))))) (f ((t 0.1)(f 0.9)))))) hascoffee ((f 1)) mailwaiting (mailwaiting (t ((t 1))) (f ((t 0.2)(f 0.8))))  249  hasmail (f  <(f  (hasmail  (t  ((t  1)))  1)))) tidy  (tidy  (tO  ((tO  (tl  ((tO 0 . 1 ) ( t l  0.9)))  (t2  ((tl  0.2)(t2  0.8)))  (t3  ((t2  0.3)(t3  0.7)))  (t4  ((t3  0.4)(t4  0.6))))  1)))  endaction  action  tidy loc  (lab  ((lab  (loc (off  ((mail  1)))  (home  ((home  1)))  (shop  ((shop  1))))  wantscoffee ((t  0.1)(f  hascoffee (f  ((t  0.2)(f  ((f  (t  ((t  ((t  1)))  1)))  (mailwaiting  (t  0.8)))) (hasmail  (t  ((t  1)))  1)))) tidy  (loc  (lab  (tl  ((t2  1)))  (t2  ((t3  1)))  (t3  t4  (off  1)))  1))))  hasmail (f  ((t  0.9))))  mailwaiting (f  (wantscoffee (t  (hascoffee  ((f  1)))  1)))  (mail  (f  ((off  mail  ((t4  (tidy  (tO  ((tl  1)))  1)))))  home s h o p  (tidy  (tO  (tl  ((tO 0 . 1 ) ( t l  0.9)))  (t2  ((tl  0.2)(t2  0.8)))  (t3  ((t2  0.3)(t3  0.7)))  (t4  ((t3  0.4)(t4  0.6))))))  ((tO  1)))  250  endaction  reward (wantscoffee (t (mailwaiting (t (tidy (tO (tl  -8.5)  (t2  -8.0)  (t3  -7.5)  (t4 -7.0))) (f (hasmail (t (tidy (tO (tl  -8.5)  (t2  -8.0)  (t3  -7.5)  -9.0)  (t4 -7.0))) (f (tidy (tO (tl  -4.5)  (t2  -4.0)  (t3  -3.5)  -5.0)  (t4 -3.0))))))) (f (mailwaiting (t (tidy (tO (tl  -5.5)  (t2  -5.0)  (t3  -4.5)  -6.0)  (t4 -4.0))) (f (hasmail (t (tidy (tO (tl  -5.5)  (t2  -5.0)  (t3  -4.5)  -6.0)  (t4 -4.0))) (f (tidy (tO (tl  -1.5)  (t2  -1.0)  (t3  -0.5)  -2.0)  (t4 0.0))))))))  251  -9.0)  value (wantscoffee (tl  -8.5)  (t2  -8.0)  (t3  -7.5)  (t (mailwaiting (t (tidy (tO  (t4 -7.0))) (f (hasmail (t (tidy (tO (tl  -8.5)  (t2  -8.0)  (t3  -7.5)  -9.0)  (t4 -7.0))) (f (tidy (tO (tl  -4.5)  (t2  -4.0)  (t3  -3.5)  -5.0)  (t4 -3.0))))))) (f (mailwaiting (t (tidy (tO (tl  -5.5)  (t2  -5.0)  (t3  -4.5)  -6.0)  (t4 -4.0))) (f (hasmail (t (tidy (tO (tl  -5.5)  (t2  -5.0)  (t3  -4.5)  -6.0)  (t4 -4.0))) (f (tidy (tO (tl  -1.5)  (t2  -1.0)  (t3  -0.5)  -2.0)  (t4 0.0))))))))  discount .9  252  -9.0)  The same problem with the exogenous events removed is as follows: features ((loc off lab mail home shop)(tidy tO t l t2 t3 t4)(wantscoffee t f ) (mailwaiting t f)(hascoffee t f)(hasmail t f ) )  action goLeft loc (loc (off ((shop 1))) (lab ((off 1))) (mail ((lab 1))) (home ((mail 1))) (shop ((home 1)))) wantscoffee (wantscoffee (t ((t 1))) (f ((f 1)))) hascoffee (hascoffee (t ((t 1))) (f ((f 1)))) mailwaiting (mailwaiting (t ((t 1))) (f ((f 1)))) hasmail (hasmail (t ((t 1))) (f ((f 1)))) tidy (tidy (tO ((tO 1))) ( t l ( ( t l 1))) (t2 ((t2 1))) (t3 ((t3 1))) (t4 ((t4 1)))) endaction  action stay loc (loc (off ((off 1))) (lab ((lab 1))) (mail ((mail 1))) (home ((home 1))) (shop ((shop 1)))) wantscoffee (wantscoffee  (t ((t 1)))  253  (f ((f 1)))) hascoffee (hascoffee (t ((t 1))) (f ((f 1)))) mailwaiting (mailwaiting (t ((t 1))) (f ((f 1)))) hasmail (hasmail (t ((t 1))) (f ((f 1)))) tidy (tidy (tO ((tO 1))) ( t l ( ( t l 1))) (t2 ((t2 1))) (t3 ((t3 1))) (t4 ((t4 1)))) endaction  action goRight loc (loc (off ((lab 1))) (lab ((mail 1))) (mail ((home 1))) (home ((shop 1))) (shop ((off 1)))) wantscoffee (wantscoffee  (t ((t 1)))  (f ((f 1)))) hascoffee (hascoffee (t ((t 1))) (f ((f 1)))) mailwaiting (mailwaiting (t ((t 1))) (f ((f 1)))) hasmail (hasmail (t ((t 1))) (f ((f 1)))) tidy (tidy (tO ((tO 1))) ( t l ( ( t l 1))) (t2 ((t2 1))) (t3 ((t3 1)))  254  (t4 ((t4 1)))) endaction  action pickUpMail loc (loc (off ((off 1))) (lab ((lab 1))) (mail ((mail 1))) (home ((home 1))) (shop ((shop 1)))) wantscoffee (wantscoffee (t ((t 1))) (f ((f 1)))) hascoffee (hascoffee  (t ( ( t 1)))  (f ((f i ) ) ) ) mailwaiting  (mailwaiting  (t (loc (mail ((f 1)))  (home off lab shop ((t 1))))) (f ((f 1)))) hasmail (hasmail (t ((t 1))) (f (loc (mail (mailwaiting  (t ((t 1)))  (f ((f 1))))) (home off lab shop ((f 1)))))) tidy (tidy (tO ((tO 1))) ( t l ( ( t l 1))) (t2 ((t2 1))) (t3 ((t3 1))) (t4 ((t4 1)))) endaction  action delMail loc (loc (off ((off 1))) (lab ((lab 1))) (mail ((mail 1))) (home ((home 1)))  255  (shop ((shop 1)))) wantscoffee (wantscoffee  (t ((t 1)))  (f ((f 1)))) hascoffee (hascoffee (t ((t 1))) (f ((f 1)))) mailwaiting (mailwaiting (t ((t 1))) (f ((f 1)))) hasmail (hasmail (t (loc (off ((f 1))) (lab mail home shop ((t 1))))) (f ((f 1)))) tidy (tidy (tO ((tO 1))) ( t l ( ( t l 1))) (t2 ((t2 1))) (t3 ((t3 1))) (t4 ((t4 1)))) endaction  action buyCoffee loc (loc (off ((off 1))) (lab ((lab 1))) (mail ((mail 1))) (home ((home 1))) (shop ((shop 1)))) wantscoffee (wantscoffee  (t ((t 1)))  (f ((f 1)))) hascoffee (hascoffee (t ((t 1))) (f (loc (off lab mail home ((f 1))) (shop ((t 1)))))) mailwaiting (mailwaiting (t ((t 1))) (f ((f 1)))) hasmail (hasmail (t ((t 1))) (f ((f 1))))  256  tidy (tidy (tO ((tO 1))) ( t l ( ( t l 1))) (t2 ((t2 1))) (t3 ((t3 1))) (t4 ((t4 1)))) endaction  action delCoffee loc (loc (off ((off 1))) (lab ((lab 1))) (mail ((mail 1))) (home ((home 1))) (shop ((shop 1)))) wantscoffee (wantscoffee  (t (hascoffee (t (loc (off ((f 1)))  (shop lab mail home ((t 1))))) (f ((t 1))))) (f (hascoffee (t (loc (off ((f 1))) (shop lab mail home ((f 1))))) (f ((t 0.1)(f 0.9)))))) hascoffee ((f 1)) mailwaiting (mailwaiting (t ((t 1))) (f ((f 1)))) hasmail (hasmail (t ((t 1))) (f ((f 1)))) tidy (tidy (tO ((tO 1))) ( t l ( ( t l 1))) (t2 ((t2 1))) (t3 ((t3 1))) (t4 ((t4 1)))) endaction  action tidy  257  loc (loc (off ((off 1))) (lab ((lab 1))) (mail ((mail 1))) (home ((home 1))) (shop ((shop 1)))) wantscoffee (wantscoffee (t ((t 1))) (f ((f 1)))) hascoffee (hascoffee (t ((t 1))) (f ((f 1)))) mailwaiting (mailwaiting (t ((t 1))) (f ((f 1)))) hasmail (hasmail (t ((t 1))) (f ((f 1)))) tidy (loc (lab (tidy (tO ( ( t l 1))) ( t l ((t2 1))) (t2 ((t3 1))) (t3 t4 ((t4 1))))) (off mail home shop (tidy (tO ((tO 1))) ( t l ( ( t l 1))) (t2 ((t2 1))) (t3 ((t3 1))) (t4 ((t4 1)))))) endaction reward (wantscoffee (t (mailwaiting (t (tidy (tO -9.0) (tl  -8.5)  (t2  -8.0)  (t3  -7.5)  (t4 -7.0))) (f (hasmail (t (tidy (tO (tl  -8.5)  (t2  -8.0)  -9.0)  258  (t3  -7.5)  (t4 -7.0))) (f (tidy (tO (tl  -5.0)  -4.5)  (t2. -4.0) (t3  -3.5)  (t4 -3.0))))))) (f (mailwaiting (t (tidy (tO (tl  -5.5)  (t2  -5.0)  (t3  -4.5)  -6.0)  (t4 -4.0))) (f (hasmail (t (tidy (tO (tl  -5.5)  (t2  -5.0)  (t3  -4.5)  -6.0)  (t4 -4.0))) (f (tidy (tO (tl  -1.5)  (t2  -1.0)  (t3  -0.5)  -2.0)  (t4 0.0)))))))) value (wantscoffee (t (mailwaiting (t (tidy (tO (tl  -8.5)  (t2  -8.0)  (t3  -7.5)  (t4 -7.0))) (f (hasmail (t (tidy (tO (tl  -8.5)  (t2  -8.0)  (t3  -7.5)  -9.0)  (t4 -7.0)))  259  -9.0)  (f (tidy (tO -5.0) ( t l -4.5) (t2 -4.0) (t3 -3.5) (t4 -3.0))))))) (f (mailwaiting (t (tidy (tO -6.0) ( t l -5.5) (t2 -5.0) (t3 -4.5) (t4 -4.0))) (f (hasmail (t (tidy (tO -6.0) ( t l -5.5) (t2 -5.0) (t3 -4.5) (t4 -4.0))) (f (tidy (tO -2.0) ( t l -1.5) (t2 -1.0) (t3 -0.5) (t4 0.0))))))))  discount .9  A.4  The Process Planning Problem for Structured Prioritized Sweeping  Finally, we give the 1024 state version of the process-planning problem that we used for the experiment in Chapter 6 to test structured prioritized sweeping. This problem has eight actions. features ((ashaped t f)(bshaped t f)(typeneeded highq lowq)(glue t f ) (apainted t f)(bpainted t f)(connected t f )  260  (asmooth t f)(bsmooth t f)(bolts t f ) )  action shapea  ashaped (connected (t ((f 1))) (f ((t 0.8)(f 0.2)))) bshaped (connected (t ((f 1))) (f (bshaped (t ( ( t 1))) (f ((f 1)))))) apainted ((f 1)) bpainted (connected (t ((f 1))) (f (bpainted (t ((t 1))) (f ((f 1)))))) asmooth ((f 1)) bsmooth (connected (t ((f 1))) (f (bsmooth (t ((t 1))) (f ((f 1)))))) typeneeded (typeneeded (highq ((highq 1))) (lowq ((lowq 1)))) connected ((f 1)) bolts (bolts (t ( ( t 1))) (f ((f 1)))) glue (glue (t ((t 1))) (f ((f 1)))) endaction  action shapeb  bshaped (connected (t ((f 1))) (f ((t 0.8)(f 0.2)))) ashaped (connected (t ((f 1))) (f (ashaped (t ((t 1)))  261  (f ((f 1)))))) bpainted ((f 1)) apainted (connected (t ((f 1))) (f (apainted (t ((t 1))) (f ((f 1)))))) bsmooth ((f 1)) asmooth (connected (t ((f 1))) (f (asmooth (t ((t 1))) (f ((f 1)))))) typeneeded (typeneeded  (highq ((highq 1)))  (lowq ((lowq 1)))) connected ((f 1)) bolts (bolts (t ((t 1))) (f ((f 1)))) glue (glue (t ((t 1))) (f ((f 1)))) endaction  action painta  ashaped (ashaped  (t ((t 1)))  (f ((f 1)))) bshaped (bshaped (t ((t 1))) (f ((f 1)))) apainted (connected (t ( ( f 1))) (f (asmooth (t ((t 0.8)(f 0.2))) (f ((f 1)))))) bpainted (bpainted (t ((t 1))) (f ((f 1)))) asmooth (asmooth (t ((t 1))) (f ((f 1)))) bsmooth (bsmooth (t ((t 1)))  262  (f ((f 1)))) typeneeded (typeneeded  (highq ((highq 1)))  (lowq ((lowq 1)))) connected  (connected (t ((t 1)))  (f ((f 1)))) bolts (bolts (t ((t 1))) (f ((f 1)))) glue (glue (t ((t 1))) (f ((f 1)))) endaction  action paintb  ashaped (ashaped (t ((t 1))) (f ((f 1)))) bshaped (bshaped (t ((t 1))) (f ((f 1)))) bpainted (connected (t ((f 1))) (f (bsmooth (t ((t 0.8)(f 0.2))) (f ((f 1)))))) apainted (apainted (t ((t 1))) (f ((f 1)))) asmooth (asmooth (t ((t 1))) (f ((f 1)))) bsmooth (bsmooth (t ((t 1))) (f ((f 1)))) typeneeded (typeneeded  (highq ((highq 1)))  (lowq ((lowq 1)))) connected (connected (t ((t 1))) (f ((f 1)))) bolts (bolts (t ((t 1))) (f ((f 1))))  263  glue (glue (t ((t 1))) (f ((f 1)))) endaction  action bolt  ashaped (ashaped (t ((t 1))) (f ((f 1)))) bshaped (bshaped (t ((t 1))) (f ((f 1)))) apainted (apainted (t ((t 1))) (f ((f 1)))) bpainted (bpainted ,(t ((t 1))) (f ((f 1)))) asmooth (asmooth (t ((t 1))) (f ((f 1)))) bsmooth (bsmooth (t ((t 1))) (f ((f 1)))) typeneeded (typeneeded  (highq ((highq 1)))  (lowq ((lowq 1)))) connected  (connected (t ((t 1)))  (f (bolts (t ((t 0.9)(f 0.1))) (f ((f 1)))))) bolts (bolts (t ((t 1))) (f ((f 1)))) glue (glue (t ((t 1))) (f ((f 1)))) endaction  action glue  ashaped (ashaped (t ((t 1)))  264  (f ((f 1)))) bshaped (bshaped (t ((t 1))) (f ((f 1)))) apainted (apainted (t ((t 1))) (f ((f 1)))) bpainted (bpainted (t ((t 1))) (f ((f 1)))) asmooth (asmooth (t ((t 1))) (f ((f 1)))) bsmooth (bsmooth (t ((t 1))) (f ((f 1)))) typeneeded (typeneeded (highq ((highq 1))) (lowq ((lowq 1)))) connected (connected (t ((t 1))) (f (glue (t ((t 0.4)(f 0.6))) (f ((f  1))))))  bolts (bolts (t ((t 1))) (f (Cf 1)))) glue (glue (t ((t 1))) (f ((f 1)))) endaction  action polisha  ashaped (ashaped (t ((t 1))) (f ((f 1)))) bshaped (bshaped (t ((t 1))) (f ((f 1)))) apainted ((f 1)) bpainted (connected (t ((f 1))) (f (bpainted (t ((t 1))) (f ((f  1))))))  265  asmooth (asmooth ( t (f  (ashaped  (t  0 . 8 ) ( f 0.2)))  ((t  ( ( f 1))))))  (f  bsmooth (bsmooth ( t (f  1)))  ((t  1)))  ((t  ( C f 1))))  typeneeded  (typeneeded  (lowq ((lowq connected (connected (f bolts  (bolts  (f  1)))) (t  1)))  ((t  (t  1)))  ((t  1))))  ((f  glue (glue  1)))  1))))  ((f  (f  (highq ((highq  (t  1)))  ((t  1))))  ((f  endaction  action polishb  ashaped (f  (ashaped  at  ((t  1)))  (t  ((t  1)))  i))))  bshaped (f  (t  (bshaped  1))))  ((f  bpainted ((f apainted  1))  (connected  (t  ((f  1)))  (t  ((t  1)))  bsmooth (bsmooth ( t  ((t  1)))  (f  (f  (f  (apainted  ( ( f 1))))))  (bshaped  (f  (t  ((f  0 . 8 ) ( f 0.2)))  ( ( f 1))))))  asmooth (asmooth (f  ((t  (t  ((t  1)))  1))))  typeneeded  (typeneeded  (lowq ((lowq  (highq ((highq  1)))  1))))  266  connected (connected (f bolts  ((f  glue  ((f  ((t  ((f  1)))  1))))  (glue (t ( ( t  (f  1)))  1))))  ( b o l t s (t (f  (t ( ( t  1)))  1))))  endaction  reward (typeneeded (f  (highq (connected  (t ( a p a i n t e d (t  (bpainted (t 10)  3))) (f  3))) (f  0)))  (lowq (connected (f (f  (t (apainted (t ( b p a i n t e d (t 1)  2)))  (bpainted (t 2) (f (f  3)))))  0))))  value (typeneeded (f  (highq (connected  (t ( a p a i n t e d (t (bpainted (t 10)  3))) (f  3))) (f  0)))  (lowq (connected (f (f  (t (apainted (t ( b p a i n t e d (t 1)  2)))  (bpainted (t 2) (f (f  3)))))  0))))  d i s c o u n t 0.9  267  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051620/manifest

Comment

Related Items