ABSTRACTION AND SEARCHFOR DECISION-THEORETICPLANNINGByRichard William DeardenB. Sc. (Computer Science) Victoria Universityof Wellington, 1990B. Sc. (Hons.) (Computer Science) VictoriaUniversity of Wellington, 1991A THESIS SUBMITTED IN PARTIAL FULFILLMENTOFTHE REQUIREMENTS FOR THE DEGREEOFMASTER OF SCIENCEinTIlE FACULTY OF GRADUATE STUDIESDepartment ofCOMPUTER SCIENCEWe acct this thesis as conformingthe required standardY I3ItITISII COLUMBIAOctober 1994©Richard William Dearden,1994Signature(s) removed to protect privacySignature(s) removed to protect privacyIn presenting this thesis in partial fulfilmentof the requirements for an advanceddegree at the University of British Columbia,I agree that the Library shallmake itfreely available for reference and study.I further agree that permission forextensivecopying of this thesis for scholarly purposesmay be granted by the head of mydepartment or by his or herrepresentatives. It is understoodthat copying orpublication of this thesis for financialgain shall not be allowed withoutmy writtenpermission.Department of c-\c)i--6LThe University of BritishColumbiaVancouver, CanadaDateER.I2) C\(Signature)DE.6 (2/88)Signature(s) removed to protect privacySignature(s) removed to protect privacyAbstractWe investigate the use Markov Decision Processes a.s a means of representing worldsin which actions have probabilistic effects. Markov Decision Processes provide manyrepresentational advantages over traditional planning representations. As well as beingable to represent actions with more than one possible result, they also provide a muchricher way to represent good and bad states of the world. Conventional approaches forfinding optimal plans for Markov Decision Processes are computationally expensive andgenerally impractical for the large domains and real-time requirements of many planningapplications. For this reason, we have concentrated on producing approximately optimalplans using a minimal amount of computation.We describe two complementary methods for planning. The first is to generate approximately optimal plans using abstraction. By ignoring certain features of a planningproblem, we can create a smaller problem for which an optimal plan can be efficientlyfound by conventional means. The plan for this smaller problem can be directly appliedto the original problem, and also provides an estimate of the value of each possiblestateof the world. Our second technique uses these estimates as a heuristic, and applies gametree search techniques to try to determine a better action to perform in the current stateof the system. By repeatedly choosing an action to perform by searching,and executingthe action, we provide a planning algorithm which has a complexity that is independentof the number of possible states of the world.11Table of ContentsAbstract iiList of Tables viList of Figures viiiAcknowledgement ix1 Introduction 11.1 Decision Theoretic Planning 41.2 Abstraction and Search in Markov Decision Process Planning 61.3 Organization of this Thesis 72 Markov Decision Processes for Planning 92.1 The MDP model 102.2 Propositional Domains 122.2.1 The extended STRIPS model 122.2.2 Using Bayesian networks 182.3 Goals and Rewards202.3.1 Why userewards?202.4 Policy Iteration and Planning 212.4.1 Calculating optimal policies 222.5 Related Work 252.5.1 MDP planning251112.5.2 Other decision-theoretic planning algorithms . . . 272.5.3 Abstraction in classical planning302.5.4 Markov decision processes and Decision Theory . .322.5.5 Search algorithms333 Creating Approximate Policies Using Abstraction363.1 Abstract MDPs and Policies373.2 Constructing Abstract Policies in Propositional Domains393.2.1 Choosing relevant atoms 393.2.2 Building abstract actions and rewards423.2.3 An example domain443.2.4 Properties of the abstract MDP463.3 Properties of the Abstract Policy473.4 Results534 Planning by Heuristic Search584.1 Interleaving Search and Execution . . .594.2 Planning in MDPs by Searching604.3 The Search Algorithm614.3.1 Action selection614.3.2 Pruning algorithms654.4 Results704.4.1 Execution and caching of values734.4.2 Effectiveness of pruning745 Conclusions785.1 Real World Domains80iv5.2 FutureWork 825.2.1 Abstraction 835.2.2 Search 85Bibliography 87Appendix A: Experimental Domains 90VList of Tables2.1 An example domain presented as STRIPS-style action descriptions. (a) isthe action description for the problem, while (b) is the reward function.Note that HUC and HRC are HasUserCoffee and HasRobotCoffee respectively.UserlsThirsty*is a random event 142.2 Translating action aspects and random events into single actions. (a) is asummary of both aspects of the Move action, while (b) is the combinationof the random eventUserlsThirsty*with the Get Umbrella action 152.3 Parts of two actions to be translated into a single action. A1 is a normalaction,A*is a random event, and A is the new action produced by combining them. Discriminants D1 and D2 are combined into a single newdiscriminant. The o operator is defined on two sets of literals E and FsuchthatEoFEU{F\{l:-ilEE}}162.4 The optimal policy for the sample domain 243.5 The STRIPS representation of the abstract MDP 453.6 The policy computed using the abstract MDP 463.7 Results of abstraction for the COFFEE domain 543.8 Results of using optimal abstract policies as initial policies when computing an optimal policy for the concrete MDP 553.9 Results of abstraction for the BUILDER domain 574.10 Comparison of the induced policy as search depth increases for three different search problems 71vi4.11 A comparison of time to search for ten actions both with and withoutcaching of previous best actions, and execution 744.12 Expectation pruning when only the top of the search tree is pruned. Valuesare percentage of value with no pruning 76viiList of Figures2.1 An example of an MDP with 5 states and 2 actions 112.2 The influence diagram representation of the actions for the coffee robotexample 193.3 Constructing an approximately optimal policy using Abstraction 394.4 The search algorithm. To begin search from state s to depth d, searchstate(s, d) is called. The action that is returned is the estimate of the bestaction to perform 634.5 An example of a two-level search for the best action from state s 644.6 Two kinds of pruning where V(s) 10 and is accurate to +1. In (a),utility pruning, the trees at U and V need not be searched, while in (b),expectation pruning, the trees below T and U are ignored, although thestates themselves are evaluated664.7 Graphs of (a) search time, and (b) policy quality against search depth forthe COFFEE domain 724.8 Pruning effectiveness for (a) the COFFEE, and (b) the BUILDER domains75viiiAcknowledgementThis thesis has been a true collaboration between my supervisorCraig Boutilier andmyself. Craig’s enthusiasm and interest have had a profound influence onthis work, hefirst pointed me at the work of Dean et al. which provided muchof our motivation towork with Markov Decision Processes. Indeed, some of the workin this thesis is at leastas much Craig’s research as it is my own. Craig has also been an ideal supervisor,alwaysprepared to talk, and usually full of suggestions on ways to improve things.I can’t thankCraig enough for his commitment to his students, and contributionsto this research.I should also thank my reader, David Poole. His knowledgeof decision theory andprobabilistic reasoning is encyclopaedic, and he hasbeen an invaluable resource throughout this research.Fiona Humphris also deserves thanks. She helped proofread thisthesis, and providedendless encouragement and understanding. Without her, thisthesis might never havecome into existence. My parents have also showed me nothingbut encouragement in thepursuit of this degree.Finally I would like to thank the faculty, staff and graduatestudents of the UBCComputer Science department. I have yet to find a better environmentfor performingresearch. In particular I would like to thank Michael Horsch,Roger Ford, and BrentBoerlage for comments on various parts of this work,and Paul Lalonde and GrahamDenham for providing a place where I could escape fromit.ixChapter 1IntroductionThe goal of research in artificial, or computational intelligence is to create an agentwhich,at least in some limited sense, can be said to behave intelligently. What wemean bythe word “behave” varies widely from problem to problem, but in its most generalsense,“behaving intelligently” means carrying out some sequence of actionswith a rationalintent based on information about the agent’s environment. The processof choosingthe sequence of actions to perform is planning. Informally, planning isthe problem ofdetermining, given a set of objectives and a set of actions, a sequence of actionsto performwhich will achieve some or all of the objectives.As an example of a planning problem that we might want to solve, imaginea robotthat tries to keep its user supplied with coffee. The robot has a goal thatit must achieve,that of ensuring the user has coffee, it has a set of actions such as buyingcoffee, deliveringcoffee, and moving to the user’s office that it can perform, andit has certain informationabout the current state of the world, for example that the user wantscoffee, that therobot is at the coffee shop, and that it is sunny outside. The problem forthe robot is toplan a sequence of actions that will make the goal true.Planning for problems of this type has been investigatedin artificial intelligence formany years. One of the earliest planning systems is STRIPS (Fikesand Nilsson 1971).STRIPS uses goal-directed search to plan. The planner beginswith the goal state, andsearches backwards through the state space until it reachesthe initial state. The planit produces is a linear series of actions which will accomplishthe goal. Although not all1Chapter 1. Introduction2classical planners fit precisely into this description for example, many producepartialorders over actions, rather than total orders, and may find plans that will succeedfroma set of starting states, rather than a single one — most make a number of assumptionsabout the problems they will be used for. In this thesis we are particularlyinterested inrelaxing the following assumptions, and building planning systems that canoperate indomains where they do not hold.Agents achieve goals The definition of a classical planning problemis in terms of 1)the initial state of the world (this need not be unique), 2) a set of actions theagentcan perform, and 3) a description of a state or states which constitute theagent’sgoal. The agent’s objective is to produce a sequence of actions which whenexecutedin the initial state (or any initial state if there could be morethan one), will leavethe world in a goal state. While this representation of states as either goalstatesor normal states is adequate for some domains, we would typically expecta muchricher way of describing how good or bad certain states are. For example,in thecoffee robot domain we described above, as well as its main objectiveof deliveringcoffee the robot may receive rewards based on how hot the coffee is,whether itdisturbed people on its way to get the coffee, etc. Although its goalis to bringthe user coffee that is as hot as possible while disturbingas few people as possible,the agent may find that these goals conflict, and hence must reasonabout which ismost important.Actions always work Classical planning assumes that theresults of all actions aredeterministic. For example, if an agent carries out the action Pickupblock A, theresulting state of the world is that in which the agent is holding blockA (assumingthat it is possible for the agent to pick up the block),and this fact is known withcertainty. In many real-world situations, this is a very unreasonableassumption,Chapter 1. Introduction3the block might slip from the agent’s grasp, the agent might make a sensingerrorand pick up block B instead of A, and so on. Although the most likely outcomemaybe that the agent is left holding the block, an agent that oniy considersthe mostprobable result of a real-world action would he courting disaster. Byconsideringthe probability and effects of other outcomes, and by making observationsof theworld to determine the outcome that actually occurred when an action iscarriedout, the agent can avoid actions with unlikely but catastrophic results,and canalso plan to achieve states which only occur rarely.The whole world is visible Another important assumptionof classical planning systems is that the state of the world is completely observable. The agentalwaysknows exactly what is true at any given time. While this assumption maywell betrue in some domains some scheduling problems are an exampleof this ingeneral it will not be. While planning systems which use this assumptioncan beuseful in a restricted set of problems, a general planner must stillbe able to operatewith partial information about the world.This thesis examines one approach to planning which addresses at leastthe first twoof these issues, and makes some progress on the third,We are particularly interested in computationally efficient planning.As we shallsee, there are already well-developed planning algorithms for the domainrepresentationswhich we are interested in, but their computational requirementsare often polynomialin the number of states in the system. Many AT planning problemsare represented interms of propositions, and hence have exponentially many possiblestates (n atoms resultin 2’ possible worlds). A real-time planning algorithm for such a problemmust performconsiderably better than exponentially in the number of atoms.In summary, the motivation for this work is to build a planning system whichoperatesChapter 1. Introduction 4in probabilistic worlds with a rich structure of rewards and penalties. The system shouldhave reasonable computational requirements, and should be designed for use in timecritical domains.1.1 Decision Theoretic PlanningDecision-theoretic planning is an attempt to combine the artificial intelligence field ofautomated planning with decision theory. Although researchers have been interested inthis idea for some time, only recently have the problems become well-posed. Decisiontheory allows us to reason not only about whether a plan satisfies some goal or not,but also about how good a plan is how quickly or cheaply it satisfies the goal, howprobable it is that the goal will have been satisfied by performing the plan, and howvaluable the particular goal achieved is, compared with other goals that can be achievedby other plans. The ability to compare two plans and decide which one does best is agreat asset that many classical planning algorithms cannot provide.Much of the research that has gone on in decision-theoretic planning has addressed theclass of problems associated with Markov Decision Processes (MDPs) (Howard 1971) (see(Haddawy and Hanks 1992) for an alternative approach). MDPs are a form of stochasticautomata where every state has an associated numerical value or reward. Most recentdecision-theoretic planning algorithms use an MDP model in which there are a finitenumber of states, a finite number of actions (each of which is a mapping from statesto probability distributions over the states) and a reward function (a mapping fromstates to real numbers). An agent accumulates rewards as it moves through the MDP byperforming actions. Although other classes of problems can be addressed using decisiontheoretic methods, MDPs provide a representation which is general enough to be used formany interesting problems. They have been extensively studied in Operations Research,Chapter 1. Introduction 5so a number of well understood algorithms for computing plans are available, at least forsimple forms of MDP.Since the problems of interest in MDP planning are generally stochastic dynamicalsystems, the standard approach to planning is to create a universal plan, referred toas a policy, which indicates which action to perform in every situation that could beencountered. For AT planning, computing an optimal policy is generally far too expensive,so most decision-theoretic planning algorithms sacrifice optimality (although some willconverge on an optimal policy if enough computation time is available), preferring toproduce either a nearly optimal policy for a subset of the state space, or a policy that isapplicable in any state, but may be far from optimal.Artificial intelligence research often concentrates on “tractable islands” — problemswhich have additional structure that can be used to reduce the computation requiredto find solutions to reasonable levels. The challenge of applying this type of approachto Markov decision processes is a major motivation for this thesis, and has motivated anumber of other researchers in this area (Dean et al. 1993b; Nicholson and Kaelbling1994). The following have been identified as one set of characteristics that make generating approximately optimal policies easier. Although these are from (Dean et al. 1993b),they also apply to our approach, and a number of other MDP algorithms.• There are relatively many policies which have high (although not necessarily optimal) value. Since most algorithms do not produce optimal policies, some less-thanoptimal policies must be reasonably good.• From any given state, there are few states to whichtransitions can be made. Inother words, there are relatively few valid actions for each state, and more importantly, actions have few possible outcomes.• The value of a state can be estimated by consideringthe values of nearby states.Chapter 1. Introduction 6Nearby states are those for which the expected number of steps to reach one fromthe other is small.1.2 Abstraction and Search in Markov Decision Process PlanningAs Chapter 2 shows, Markov processes are extremely general, and somewhat laborious tospecify. In practice we expect most MDP planning problems not to be specified directlyas the MDP itself, but in a more compact representation which makes the structure inthe problem explicit. In Section 2.2 we suggest one such representation, other researchershave proposed similar models. The representation we have chosen is in terms of logicalpropositions. Thus a state in the MDP represents a possible world in the logical model(an assignment of truth values to all the propositions).A second consideration when developing our approach was the problem of planningin real-time. Since well developed methods already exist for finding optimal policies forMDPs, any contribution from Artificial Intelligence must be directed at taking advantageof structure for computational gains, and at finding close-to-optimal plans efficiently.Designing “anytime” algorithms that always produce a plan, but produce better ones ifmore time is available has also motivated this thesis.Although we are very interested in the problem of how to plan if the agent does notknow the state of the world, this is an extremely difficult problem, and for the purposesof this thesis, we will concentrate instead on domains where the agent always knows whatthe current state of the world is, even though it cannot predict exactly the outcome of itsactions. In the terminology of decision theory, we say our agents operate in completelyobservable worlds.Our approach is twofold. Before any actions are performed, we use relatively smallamounts of computation to produce a policy which we will treat as a set of defaultChapter 1. Introduction 7reactions, actions that the agent will do if there is no time to find a better plan. To do thiswe use an idea from classical planning, namely abstraction. The idea of using abstractionis to solve a large problem by constructing smaller but closely related problems, and usingthe solution to a small problem as the basis for a solution to a larger problem. In the caseof our algorithm, we ignore some of the atoms in the propositional representation of theproblem to create a smaller MDP which we will refer to as the abstract MDP. Becauseof the way we choose the atoms to ignore, an optimal policy for this small MDP can alsobe used as a (usually sub-optimal) policy for the original problem.The second part of our approach is performed at run-time. Since the policy we foundusing abstraction is usually not optimal, we use search to try to find a better strategyfor the current state. By interleaving search with execution — search for the best actionto do now, perform it, and then search for the best action in the new state — we canreduce the size of the search considerably, and if there is no time for searching, we stillhave the default reactions we computed with the abstraction algorithm.We can compute theoretical bounds on the difference in value between an optimalpolicy for the original MDP, and a policy produced from any abstract MDP. With thesebounds a user can trade off computation time for policy accuracy by choosing an appropriate abstraction. Similarly, the depth of the search tree can be adjusted to fit theavailable time before the next action must be performed, and actions can be cached toavoid recalculation of the best action when revisiting a state. These properties contributeto making the algorithm ideally suited for time-critical domains.1.3 Organization of this ThesisThis chapter has introduced the area of decision-theoretic planning, described the motivation behind this work, and briefly outlined the contribution of this thesis. Chapter 2Chapter 1. Introduction 8describes the MDP model in considerably more detail, examines the propositional representation we use for domains, and provides a summary of related work in classicalplanning, search, and decision-theory as well as in decision-theoretic planning itself.Chapter 3 is a detailed description of the abstraction algorithm, including theoreticalresults concerning the value of abstract policies, and experimental results that demonstrate the algorithm in operation. Similarly, Chapter 4 describes the search algorithm.Chapter 5 discusses how the two algorithms combine to form a complete planning system,and describes some of the future research this thesis leads towards.Chapter 2Markov Decision Processes for PlanningIn this chapter we will describe the basic characteristics of Markov Decision Processes(MDPs), compact representations for them, and a standard approach from OperationsResearch for planning using them. For reasons of simplicity, we will only concern ourselveswith completely observable MDPs as this research has almost entirely concentrated onthis model. Although partially observable MDPs can accurately represent many morereal world problems, and a much larger class of situations, they are computationallymuch more difficult to deal with. We will leave our discussion of the partially observablecase to Chapter 5.Finite automata provide a general way of representing problems that include an explicit state space, and a set of actions each of which is a mapping from state to state. Ifthe actions are probabilistic mappings from a state to a probability distribution overall states — such a model is called a stochastic automata. A Markov Decision Processis a stochastic automata in which each state has a numerical reward or cost associatedwith it (we will use the term reward to describe both rewards and costs, costs can beconsidered as negative rewards). At each state, a decision is made about what action toperform, with the objective being to maximize the total accumulated reward over a seriesof such decisions. An MDP represents all possible worlds explicitly as states, representsactions as stochastic transitions from state to state, and represents objectivesor goals interms of rewards associated with states. Although rewards and costscan be associatedwith actions as well as states, we will not consider that case here.9Chapter 2. Markov Decision Processes for Planning10There are many advantages to using MDPs for decision theoretic planning. Theyare very general, and allow a wide variety of problems to be represented. Therewardstructure is much more flexible than traditional goal representationsfor planning, andMDPs are much better than classical planning at representing processes,so they can heeasily used to plan in systems that must keep some condition or conditions true asmuchas possible, or deal with an infinite sequence of objectives. MDPs havelong been used inOperations Research, and well developed algorithms exist for findingoptimal universalplans or policies.One difficulty with using MDPs for planning is that the Markov Property (seebelow)must hold in the domain. Although a model that conforms to theMarkov Property canbe designed for any domain, in practice this may result in a muchlarger state space.Another problem is that it is generally not possible to take advantageof the structureof a domain because it is hidden in the MDP model of it.By using more compactrepresentations of domains we can take advantage of certain structurewhen planning(see Chapter 3).2.1 The MDP modelFor our purposes, an MDP can be defined by the tuple< 5, A, T, R >, where S is a finiteset of world states that the agent can distinguish,A is a finite set of actions the agentcan perform, T is a state transition function thatdescribes the effects of each action, andR is the reward function. Although this need not be the case in general,we will assumethat the set of feasible actions for any states E S is A every action can be attemptedin any state. T is a mapping from S xA into discrete probability distributions overS. We write Pr(s2A,Si)for the probability that state s2 results ifthe agent performsaction A in state s1. As we would expect, 0Pr(tIA,s) 1 for all s,t, and for allChapter 2. Markov Decision Processes for Planning11C)Bo/to.:B 03\A0.7B0.8A0.2A0.6Figure 2.1: An example of an MDP with 5 states and 2 actions.s,Zs Pr(tIA,s) = 1. R is a mapping from S to which specifies the instantaneousreward the agent gains for entering a state. ‘We will write R(s) to signifythe reward forentering state s. By our definition of T, we ensure that the Markov Propertyholds.Definition 1 [Markov Property] A tuple <S,A, T, R > has the Markov Property if theresult of performing any action Ae A depends only on the current state s enot on the sequence of actions that left the world in states.Figure 2.1 shows an example MDP depicted as a directed graph. Each arrowrepresents a possible state-to-state transition, and is labeled with the action thatcould causethe transition to be used, and the probability that it will be used giventhat action isperformed. For example, from state u, if action B is performed,the transition from uto itself will take place with probability0.3. Here S is {s, t, u, v, w}, 1? is {A, B}. T foraction A is:Chapter 2. Markov Decision Processes for Planning 12From To s t u v ws 0 0.3 0 0.7 0t 0.10.90 0 0u 0.6 0 0 0.4 0v 0 00.40.60w 0 0 0 0.80.2One possible reward function for the MDP is: R(s) = R(t) = R(u) = R(v) =—1, R(w) = 0. This produces an MDP where the best actions to take are those thatput the world in state w with the minimum number of actions, and then keep it there.The optimal policy for this MDP is to perform action A in states u and w, and action Botherwise.2.2 Propositional DomainsIn many planning problems, we describe the state of the world using a set of propositionalatoms. For example, we might say that Raining is false in the current world, butthatCloudy is true. This section discusses the propositional model of actions that we havechosen to use. In translating such a scheme into a MDP, we make each state in thenetwork correspond to a possible world, so for a world represented by n propositionalatoms, the resulting MDP contains2states.2.2.1 The extended STRIPS modelThe representation described here is based on that used in BURIDAN (Kushmerick, Hanksand Weld 1993) with the addition of a method for representing independence betweendifferent aspects of the outcome of an action.Chapter 2. Markov Decision Processes for Planning 13In the classical STRIPS (Fikes and Nilsson 1971) representation, actions are represented with a list of preconditions and a list of effects made up of literals which willbecome true or false if the action is executed. If the preconditions are true, and theaction is performed, then the effects list is applied to the current world to determinethe new world. In BURIDAN, this is extended by adding multiple mutually exclusive andexhaustive preconditions which we will refer to as discriminants(,and by making theeffects probabilistic. For each action and discriminant, there is a set of effects lists withassociated probabilities. The probabilities sum to one, and represent the probabilitythat the corresponding effects actually occur given that the action is performedwith thediscriminant true. Table 2.1 shows a problem represented in this way. If we look atthe BuyCoffee action, we see that it has two discriminants. If Officeis true, the withprobability 1, there will be no changes to the state, while if Office is false, then withprobability 0.8 HRC will be true after executing the action, regardless of the previousvalue of HRC, and with probability 0.2, there will be no effect.If s is the current state(represented as the set of literals true in that state), and is the effect list to be applied,then the new state that results is:We have extended the BURIDAN representation slightly in two ways. Firstly, we haveadded what we call action aspects. These are designed to represent thefact that in someof the effects of an action are dependent on only some of the preconditions. Forexample,in Table 2.1, the Move action has two aspects. The first representsthe fact that whenthe agent performs a Move, the resulting location depends on the agent’s currentlocationonly. It is independent of the values of Rain and Umbrella. Thesecond aspect deals withwhether the agent becomes wet or not. Since this isindependent of where the agent is,the precondition only contains Rain and Umbrella. Actionswith multiple action aspectsChapter 2. Markov Decision Processes for Planning 14Table 2.1: An example domain presented as STRIPS-style action descriptions. (a) is theaction description for the problem, while (b) is the reward function. Note that HUC andHRC are HasUserCoffee and HasRobotCoffee respectively.UserlsThirsty*is a randomevent.Action Discriminant Effect Prob.Move Office -‘Office 0.90.1-‘Office Office 0.90.1Move Rain,—’Umb Wet 0.90.1BuyCoffee -‘Office HRC 0.80.2Office 1.0GetUmbrella Office Umbrella 0.90.1DelCoffee Office,HRC HUC,HRC 0.8HRC 0.10.1-‘Office,HRC -HRC 0.80.2-HRC 1.0UserlsThirsty*—HUC 0.010.99(a)Sentence Value Sentence ValueFIIUC,—‘Wet 1.0 -iHUC, -‘Wet 0.2HUC, Wet 0.8 -‘HUC, Wet 0.0(h)Chapter 2. Markov Decision Processes for Planning 15Table 2.2: Translating action aspects and random events into single actions. (a) is asummary of both aspects of the Move action, while (b) is the combination of the randomeventUserlsThirsty*with the Get Umbrella action.Action Discriminant Effect Prob.Move Office,Rain,—iUmb -‘Office,Wet 0.81Wet 0.09—iOffice 0.090.01Office,-i(Rain,-iUmb) -‘Office 0.90.1—iOffice,Rain,—iUmb Office,Wet 0.81Wet 0.09Office 0.090.01-Office,-(Rain,--iUmb) Office 0.90.1(a)Action Discriminant Effect Prob.GetUmbrella Office Umbrella 0.891Umbrella,-’HUC 0.009—HUC 0.0010.099-‘Office -HUC 0.010.99(b)can be translated into actions with a single aspect by forming the “cross-product” of theireffects. Table 2.2 (a) shows the translated form of the move action from the example.The second extension of the BURIDAN model is the addition of random events. Theseare actions over which the agent has no control and which (in our rather naive model)occur randomly at the same time as some action of the agent. If we imagine a situationwhere an agent must keep some process running smoothly, random events representthingsthat occur during the process that the agent must deal with. As with actionaspects,random events can be combined with normal actions by forming the “cross-product”ofChapter 2. Markov Decision Processes for Planning 16Table 2.3: Parts of two actions to be translated into a single action. A1 is a normalaction,A*is a random event, aridAis the new action produced by combining them.Discriminants D1 and D2 are combined into a single new discriminant. The o operatoris defined on two sets of literals E and F such that E F E U {F\{l : —11 E}}Action Discriminant Effect Prob.A1 D1 E1,P1,1E1,2P1,2E1,P1,nA*2 2,1 P2,1E2,P2,2E2,mP2,mA D1 A D2 E1, E,1Pi,i P2,1E1, E2,P1,1 P2,2E1,D’ E2,mP1,1 P2,mE1,2 E2,1P1,2 P2,1E1, E2,mP1,n P2,mtheir outcomes. If an action of the agent and one or more random events would haveaconflicting effect on some atom (for instance if the agent delivered coffee to the userjustas the user became thirsty), we arbitrarily assume that the action written first in therepresentation of the problem takes precedence. Table 2.2 (b) shows the result of addingthe random eventUserlsThirsty*to the Get Umbrella action. The complete translationof Table 2.1 can be found in the appendix.Table 2.3 gives a formal definition of the result of combining two discriminants in theChapter 2. Markov Decision Processes for Planning 17process of translating a regular action and a random event into the compiled form. Tocompile a sequence of n actions A1,A2,. .. , A, (one of which is assumed to be a regularaction, and the rest random events) into a single action, we proceed by compiling A1 andA2, then compile the result with A3, and so on. To compile two actions A and B, foreach discriminant of A and for each discrirninant of B we produce a new rule (here werefer to a discriminant and its set of effects and probabilities as a rule) as in Table 2.3.The new rule will have the conjunction of the two discrirninants as its discriminant, andwill contain a probability and effect for every pair of effects in the original rules. If Eand F are effects from A and B with probabilities p and q respectively, then the new rulewill contain an effect E U {F\{l : -11 E E}} which has probability p• q. Obviously, if theactions have discriminants D1 and D2 where D1 A D2 I— L then no new discriminant iscreated for their combination. Similarly, there may be a large number of identical effectsin the new rule that can be combined by summing their probabilities, and there may beeffects that include literals that appear in the corresponding discriminants. These literalswhich the action makes true, but which are already known to be true can be deleted fromany effects lists they appear in.We will use a slightly different way of writing the extended STRIPS rules when weare talking about effects. In our alternate representation, a problem consistsof a listof rules. Each rule is a tuple consisting of an effect E, the probability of it occurringp, the discrirninant D that makes it possible, and the action A it relates to. Thisrepresentation is identical to the standard representation given above theelements ofeach tuple are identical but is a more convenient way to talk about effects lists andtheir associated actions and discriminants. We will also occasionally want to talk aboutthe set of atoms that appear in the logical sentence that is a discriminant. We will writeDf for the set of all atoms that appear either negated or unnegated in discriminant D.For example, if D is p V -q, then D={p, q}.Chapter 2. Markov Decision Processes for Planning 18We have represented actions in terms of propositions, but we would also like a compact representation of the reward function. There are two simple representations forrewards that we have considered. In the first, each atom is assigned rewards for its truthor falsehood and rewards for states in the MDP are the sum of the rewards for eachatom’s value in that state. A more general approach (in fact any reward function can berepresented by this method) is to assign rewards to some set of mutually exclusive andexhaustive sentences so that the reward for any state in the MDP is the reward for thesentence that is true of that state. Our algorithms work equally well with either of these,so we will generally prefer the second as it can describe a strictly larger class of rewardfunctions.2.2.2 Using Bayesian networksNicholson and Kaelbling (1994) use a representation of actions as time dependent Bayesiannetworks. They use a two-slice network to represent each action. The first slice representsthe values of (possibly multi-valued) variables before the action is performed, while thesecond slice represents the value after the action. In addition there is a value node whichrepresents the instantaneous reward for the new state of the world. Arcs in the diagramrepresent probabilistic dependence between variables. As with conventional Bayesiannetworks, each node contains a table of conditional probabilities over all the nodes witharcs leading to it.The Bayesian Network representation for actions is more expressive than the extendedSTRIPS approach in that nodes in the network may have more than two possiblevalues.However, for propositional domains the two are equivalent. Figure 2.2 showsour exampledomain from Table 2.1 represented with Bayesian Networks.Chapter 2. Markov Decision Processes for Planning 19MoveBuyCoffeeDeliverCoffeeGetUmbrellaFigure 2.2: The influence diagram representation of the actions for the coffeerobotexample.Chapter 2. Markov Decision Processes for Planning 202.3 Goals and RewardsIn classical planning, the objective of the agent is to make some goal true. A planningproblem typically consists of a set of actions and a goal, with the agent’s task beingto find a sequence of actions that make the goal true. Note that for most classicalplanners, the objective is to find any plan that satisfies the goal, without consideringhow efficient the plan is. Decision theoretic planning allows a more general description ofthe aims of the agent. As we have already described, it allows us to think about agentsas continuing processes, but we can also use a decision theoretic framework to representdomains where there are multiple goals with different priorities, and goals with moreinteresting structures. As an example of this, the agent may want to achieve one of twothings, but achieving both would be undesirable.How can we represent classical goals in the framework of MDPs? The standardapproach used by (Dean et al. 1993b) is to create a reward function such thatR(s) = 0if s is a state in which the goal is satisfied and R(s) = —1 otherwise, and to make allstates in which the goal is satisfied absorbing (an absorbing state is one in whichtheresult of every action is to leave the state unchanged). This reward function will causethe agent to seek out goal states using as few actions as possible. In contrast with aclassical planner, it finds a minimal plan, not just any plan that will achieve the goal.2.3.1 Why use rewards?As we have already said, rewards provide a more flexible representation of an agentsaims, and also allow more complex goals to be represented, but there are other reasonsfor using the full power of a reward function in decision theoretic planning.In many situations, producing optimal plans in probabilistic domains is computationally infeasible. The cost of calculating an optimal plan in a time-critical domainmay beChapter 2. Markov Decision Processes for Planning 21greater than the improvement in performance gained. If approximate plans are useful,the reward function can be a very valuable aid in determining the most important tasksfor the agent. For example, consider the example domain in Table 2.1. The rewardfunction shows us that the most important goal to achieve is that the user has coffee.In making a plan for this domain, it may well be worth ignoring the small reward forkeeping the agent dry, and concentrate on delivering coffee. Classical goals are unable torepresent information of this kind.2.4 Policy Iteration and PlanningA control policy r is a mapping from S to A. If an agent adopts some policy r, then ir(s)is the action the agent will perform whenever it finds itself in state s. Thus r is a universalplan, an action to perform in every possible circumstance. An agent that adopts policyK can also be thought of as a reactive system. Given an MDP, an agent ought to adopta policy that maximizes the expected rewards for states visited. Such a policy is calledan optimal policy. We will concentrate on discounted infinite horizon problems where theactions of the agent continue infinitely, and the value of a reward received n actions in thefuture is discounted by some factor/3n(O</3 < 1). We do this for a number of reasons.Even though in many situations we can expect the agent to have a finite lifetime, werarely know in advance how long it will be. We can approximate a long finite horizonprocess with an infinite horizon one. Discounting is used in Operations Research MDPproblems where a dollar earned today is worth more than a dollar earnedtomorrow, orwhere the agent has some probability 1— /3of “dying” at each time step. While thismay not be the case in many planning problems, discounting encourages short plans,and is much easier to compute than alternative methods such as average reward criteria(Puterman 1994). As well as an action to perform in each state, we arealso interestedChapter 2. Markov Decision Processes for Planning 22in the value of that state. While the total reward received by an agent will diverge overan infinite lifetime, the discounted reward will converge on a finite value (assuming finiterewards). The agent’s objective is to maximize the expected accumulated discountedreward for all future states over an infinite time period. One way to look at a problemin decision theoretic planning is as finding a good — or preferably optimal — policy, atleast for all states that the agent actually visits.Given some policy ir and a reward function R, we can compute the value of any states e S (V.(s)) by the following formula due to Howard (1960):V.(s) = R(s) + Pr(tir(s), s)V(t)teSSince the value of each state is dependent on the values of all others, we can find thevalue of r for all states by solving the set of simultaneous linear equationsV(s)Vs E S.A policy r is optimal if:(Vs E S)(Vir’)(V(s) > V’(s))2.4.1 Calculating optimal policiesThere are two commonly used algorithms for calculating optimal policies in discountedinfinite horizon problems. Value iteration (Bellman 1961) converges on an optimal policyby considering the effects of states further and further from each state. It is guaranteedto converge on an optimal policy, but only approximates the value of states in the policy.The algorithm we have used is policy iteration, first presented by Howard (1960). Policyiteration converges on both an optimal policy and accurate values for states under thatpolicy. Althollgh its computation time is exponential in the number of states in the worstcase, policy iteration tends to converge quickly in practice.Chapter 2. Markov Decision Processes for Planning 23The policy iteration algorithm begins with some arbitrary policy, and repeatedlyimproves on it until no more improvement is possible. At that point itis guaranteed toproduce an optimal policy. The algorithm in detail is given below:1. Let r’ be any policy on S2. While K ‘‘r’ do(a) K =(b) For all s e 8, calculate V,,.(s) by solving the set ofSIlinear equations givenby the equation above.(c) For all s S, if there is some action a E A such thatR(s)+/3iesPr(tla,s)V7,.(t) >V(s) then ir’(s) = a;otherwise 7r’(s) = K(s)3. Return ‘irThe algorithm begins with an arbitrary policy (a good choice is a greedyapproachthat chooses the action with the greatest immediate reward),and repeatedly improveson the policy until no more improvement is possible. The improvementstep is performedby first constructing a vector containing the value of each stateby solving the set of181linear equations inSIunknowns given above in the equation forthe value of a state.Algorithms for solving linear equations of this kind are typically0(n3)where n is thenumber of unknowns. Once the value of every stateunder the current policy is known,the algorithm attempts to find a better action foreach state if one exists. To find abetter action for a given state, the algorithm holds therest of the policy and the valuevector constant, and checks every action to see ifit has a higher value than the currentone, using the same equation as before. Althoughthis won’t be the true value of thatChapter 2. Markov Decision Processes for Planning 24Table 2.4: The optimal policy for the sample domain.HUC,HRC HUC,HRC HUC,HRC I-IUC,HRCOffice BuyCoffee Move BuyCoffee MoveOffice,Rain, BuyCoffee Move BuyCoffee BuyCoffeeUmbrella, WetOffice Move DeliverCoffee Move BuyCoffeeOffice,Rain, GetUmbrella DeliverCoffee GetUmbrella GetUmbrellaUmbrella, Wetaction (the value of this state will differ from that in the vector), it is still an indicationthat the policy is not yet optimal. The algorithm stops performing improvement stepswhen no better action than the one in the current policy can be found for any state.For the example problem in Table 2.1, there are 64 states in the MDP. The optimalpolicy discovered by policy iteration (using = 0.95) is shown in Table 2.4. As the tableshows, if the robot doesn’t have coffee and is in the office, it will get the umbrella if thereis rain, the robot is dry, and it doesn’t have it already, and will then move to the coffeeshop. If the robot is at the coffee shop it will buy coffee if it doesn’t have any, and willthen move to the office unless the user has coffee and it is likely to get wet, in which caseit does nothing (by buying more coffee). If the user wants coffee and the robot has somein the office then it will deliver it, but if the user doesn’t want any, the robot will get theumbrella if that might be useful, and then does nothing (by trying to buy coffee in theoffice). The fact that the robot may repeatedly buy coffee while waiting until the userbecomes thirsty is a good example of why an MDP model with rewards (or in this casecosts) on actions as well as states would be useful.The values of states in this policy range from 12 to just under 20, with the lowestvalues (12—14) for states where the robot is (or will probably become) wet and theuserhas no coffee. States where the robot is dry, or the user has coffee but not both haveChapter 2. Markov Decision Processes for Planning 25intermediate values (15—18), while the highest values (19—20) are for states where therobot is dry and the user has coffee. Computing this policy required 0.48 seconds.2.5 Related WorkDecision-theoretic planning is a relatively new field. Consequently, there is little closelyrelated previous work. However, the abstraction procedure is related to other such algorithms from classical planning, and the search algorithm bears similarities to a numberof search procedures, especially from game playing, and decision analysis.2.5.1 MDP planningThis section describes a selection of important papers that are directly related to thisthesis in that they are performing decision-theoretic planning using explicit MDP modelsin domains with many of the same characteristics we are assuming.Dean, Kaelbling, Kirman and Nicholson — Planning Under Time Constraintsin Stochastic DomainsDean, Kaelbling, Kirman and Nicholson (1993b; 1993a) produced two papers whichprovided the motivation for a lot of the recent interest in using MDPs for decision-theoreticplanning, including this thesis. Although MDPs and algorithms for them were alreadywell known in operations research, these two papers first suggested examiningtimecritical applications where optimal policies cannot be computed from an ATperspective.The approach taken in these papers is to use information about the starting stateofthe system, the reward function, and the transition probabilities for actions in ordertorestrict the planner’s attention to only those states which are likely to be encounteredin the process of reaching the goal. By planning in this smaller envelope of states,theChapter 2. Markov Decision Processes for Planning 26agent can find a policy with much less computation than the original domain. The sizeof the envelope can be tailored to the amount of computation time available, althoughif the agent leaves the envelope, replanning will be required. These papers also considerthe question of when computation should be performed. In their precursor deliberationmodel all computation is performed before any actions are taken, while the recurrentdeliberation model plans and acts in parallel.Nicholson and Kaelbling Toward Approximate Planning in Very LargeStochastic DomainsNicholson and Kaelbling (1994) are also interested in automatically generatingabstractions. As with our approach, Nicholson and Kaelbling assume a compact representationof the domain in terms of domain attributes, although they allow theirs to have morethan two possible values, and represent actions as two-slice Bayesian networks. Theyalso perform abstraction by ignoring domain attributes, although their abstraction procedure is rather more general as abstractions can always be created. Unfortunately,withthis generality comes a great deal of computation, as the effects of actions must all berecomputed by assuming a uniform distribution over the values of the ignored attributes.To decide which domain attributes to ignore, Nicholson and Kaelbling use sensitivityanalysis. Their approach is to generate an abstraction and an envelope, and continuallyrefine these by extending the envelope, building a policy for it (in the abstractMDP),and expanding the abstraction to be more fine-grained if necessary, until theagent mustact.Tash and Russell Control Strategies for a Stochastic PlannerTash and Russell (1994) use a rather different form of the envelope approachof Dean etal. Their system begins with an heuristic which estimates the value of eachstate, andChapter 2. Markov Decision Processes for Planning 27through a number of iterations, improves this heuristic to more accurately represent thetrue value of each state. Tash and Russell describe an algorithm whereby an envelope iscreated around the current state, the states just outside the envelope (the fringe) havetheir values fixed to the heuristic values, and a local form of policy iteration is performedon the envelope and fringe. The result of the local policy iteration algorithm isan actionto perform for every state in the envelope, and new heuristic values for eachof thosestates. As with the search algorithm we present in Chapter 4, the agent inTash andRussell’s planner executes each action as soon as it has computed it, and then rebuildsits envelope to choose the next action. The relationship between our search algorithm,and the approach of Tash and Russell is analogous to that between value iterationandpolicy iteration except that we make no changes to our heuristic function, preferringtosave computation time when revisiting states.Tash and Russell also present an introspective planner which not onlytakes intoaccount the problem domain it is working in, but also its knowledge of the domainwhenplanning. The introspective planner is designed to expend extra computationaleffortin areas where it currently has little information, and to use well-known pathsthroughthe state space when little computation time is available. They presentresults whichshow the introspective planner finding optimal plans more quickly than theirstandardalgorithm when computation time is relatively cheap, but more slowlyif computationtime is expensive.2.5.2 Other decision-theoretic planning algorithmsIn this section we will examine a few other decision-theoreticplanning algorithms thatdo not directly use MDPs as their underlying model ofthe world. Unlike the MDPplanners which generally use dynamic programming techniques and find universalplans(at least for a restricted part of the state space), these systems are much closerto classicalChapter 2. Markov Decision Processes for Planning 28planners.Feldman and Sproull (1977) were among the first researchers to investigate the useof decision theory for AT. They suggested using bounds on utility to guide the planningprocess, but they left their utility model relatively undeveloped, focusing instead onrepresentational issues for reasoning with uncertainty.Haddawy and Hanks — Representations for Decision-Theoretic Planning:Utility Functions for Deadline GoalsHaddawy and Hanks (1992) made one of the earliest attempts to couple decision theorywith AT planning. In particular, they were interested in using representations of goalsto determine plan quality. Although they make no effort to model actions, either probabilistically or otherwise, they have a detailed model of planning goals, and examine waysthat the utility of a plan will vary depending on whether all goals are achieved, andon whether there is only partial achievement. Unlike our approach, they also examinetemporal aspects of planning. Their goal description language is rich enough to capturerequirements such as “Make G true by noon” or “Make sure that H is true betweenlOam and 1pm,” allowing them to reason about goals that are partially completed froma temporal point of view (for example if H is true only between 11am and noon).Kushmerick, Hanks and Weld — An Algorithm for Probabilistic PlanningKushmerick, Hanks and Weld (1993) developed the BURIDAN planner, a method forperforming probabilistic planning with uncertainty about both state and the effects ofactions in a classical planning framework. BURIDAN uses a classical partial-order planner,for example SNLP (McAllester and Rosenblitt 1991) for plan creation, and a probabilisticreasoning system to evaluate the partial plans. The system requires a set of probabilisticactions, a probability distribution over initial states of the world, agoal expression, andChapter 2. Markov Decision Processes for Planning 29a probability threshold. A solution consists of a partial ordering of actions which willaccomplish the goal with probability greater than the threshold given the distribution ofinitial states. Note that the notion of a goal here is of a set of states that must be reached.BURIDAN is unable to make use of the rich variety of goals that can be represented inMDP planners.The BURIDAN planner builds a complete plan to reach the goal based on its initialinformation about the initial state, without executing a single action. One characteristicof the plans it makes is that actions are frequently repeated in order to increase theirprobability of success. A modification of the BURIDAN algorithm, where the agent canperform actions and gather information to avoid this behaviour, is the C-BURIDAN plannerpresented in (Draper, Hanks and Weld 1994). Here, actions not only change the world,they also give the agent messages about the state of the world (some actions mayoniyobserve the state of the world), by building conditional plans based on these messages,C-BURIDAN can produce plans where an action is repeatedly performed until a messagethat shows it has succeeded is received.Haddawy and Doan Abstracting Probabilistic ActionsHaddawy and Doan (1994) use an action model very similar to that of the BURIDANplanner, although it can handle a much larger class of rewards and costs, including rewards that vary over time. Their DRIPS system attempts to use abstraction toreduce thenumber of possible plans the agent must generate and evaluate. While we areinterestedin abstracting states while leaving the actions unchanged, Haddawy and Doanbuild abstract actions by combining possible outcomes of one or more actions. Forexample, ifthe system contains two actions ‘drive on the mountain road” and “driveon the valleyroad”, the abstraction might abstract them into a single action which combinesthe outcomes and preconditions of both. The DRIPS system searches for a plan thatmaximizesChapter 2. Markov Decision Processes for Planning30expected utility by first building abstract plans, and then only refining those withthehighest expected utility. Although this technique greatly reduces the number ofplansthat DRIPS evaluates, it is not clear how much computation would be required tofindoptimal plans in real-world domains by this approach.2.5.3 Abstraction in classical planningIn this section we examine the use of abstraction in classical planning. Usingabstractversions of a problem in order to solve it efficiently is by nomeans new. The first formaldescription and application of this idea to planning is in the ABSTRIPSplanner describedbelow.Sacerdoti — Planning in a Hierarchy of Abstraction SpacesSacerdoti (1974) described the earliest use of abstraction in his ABSTRIPS planningsystem. Many of the ideas from this system have influenced our abstractionprocedure.Along with a description of the actions and the goal, ABSTRIPS must alsobe providedwith an abstraction hierarchy which is used to automaticallyassign criticalities (a measure of how difficult a literal is to achieve) to the preconditions ofeach action. Thesystem uses this to form a hierarchy of problems with the topmost levelcontaining onlythe literals in the goal that are the hardest to satisfy, since the actions thatmake themtrue are the most likely to be clobbered by the effects ofother actions, and the lowestlevel is the original problem. ABSTRIPS builds a planat the topmost level of the hierarchy, where only the most critical propositions are reasoned about,and then refinesthe plan level by level by including more preconditions, and operatorsto satisfy them,until a complete plan has been constructed. Our approachis closely related to this aswe form abstract problems by only including the most important propositions.Althoughwe generally don’t construct more refined plans from theabstract ones in the same wayChapter 2. Markov Decision Processes for Planning 31ABSTRIPS does, Chapter 3 mentions the possibility of using abstract policies to seed thepolicy iteration algorithm in the original domain to find optimal policies more efficiently.Knoblock — Generating Abstraction Hierarchies: An Automated Approachto Reducing Search in PlanningKnoblock (1993) extends the ABSTRIPS approach to abstraction with the ALPINEsystem, which automatically learns the abstraction hierarchy that ABSTRIPS requires.Theapproach is to construct a partial ordering of the literals in the system by findingconstraints between literals, building a graph of the constraints, and topologicallyorderingthe graph. A constraint from one literal to another represents the fact thatachieving thefirst might interfere with achieving the second, and hence the first shouldbe at least ashigh in the abstraction hierarchy. This procedure is very similar to and inspired —the abstraction algorithm we describe in Chapter 3.Smith and Peot — Postponing Threats in Partial-Order PlanningPartial-order planners, such as SNLP (McAllester and Rosenblitt 1991), operateby constructing plans where two actions are ordered (one must take place beforethe other)only if one has an effect which threatens the preconditions of the other.Ordering twoactions is known as resolving a threat. Smith and Peot (1993)describe an algorithmfor avoiding having to resolve some threats during the planning processby postponingthe threat until the plan is complete. In order to do this, they introducedthe conceptof operator graphs, a generalized form of the graphs constructedby the ALPINE systemfor total-order planning. Operator graphs represent the factthat certain actions makethe preconditions of other actions true: there is a path (sequenceof arcs) in the graphbetween two literals if performing the actions along thatpath will make the second literaltrue if the first one is. These graphs show the deterministic equivalentof the relationshipsChapter 2. Markov Decision Processes for Planning 32we discover for probabilistic actions when we automatically construct abstractions (seeChapter 3). Although the uses we put them to is rather different, the graphs Smith andPeot construct represent exactly the relationships we are interested in when we constructabstractions.2.5.4 Markov decision processes and Decision TheoryThere is a huge body of work in the area of Markov decision processes, mostof it inthe field of operations research. There are two well known algorithms for constructingoptimal policies for MDPs where rewards are discounted over time. We have alreadydescribed the most commonly used, policy iteration, described by Howard (1960).Theother commonly used algorithm is value iteration(Bellman 1961). Value iterationbeginswith a vector of estimated values for each state, and repeatedly updates the valueof eachstate until the vector converges on the optimal value. To perform each update,the valueiteration algorithm finds the action which gives the highest expectedimmediate value(using the values from the previous iteration), and sets the new value of this stateto bethe immediate reward plus the discounted expected value of performing thataction. IfV is the value vector at the ith iteration, the (i+1)th iteration value for state s is:= R(s)+3 max{ Pr(ta, s)V(t) : a E A}teSThis function is guaranteed to converge in the limit to the optimal policy, and inpractice,often converges quite quickly. The search procedure we describe inChapter 4 resemblesa local form of policy iteration in that deeper and deeper search willconverge on thecorrect value for a state in the same way that value iteration does.For a more recent survey of policy construction in MDPs, includingdiscussion ofpartially observable MDPs, semi-Markov processes, non-discounted MDPswhere theaverage reward earned per action must be maximized, and more efficient algorithmssuchChapter 2. Markov Decision Processes for Planning 33as modified policy iteration for computing optimal policies, see (Puterman 1994). Thisalso describes work on finding optimal policies by aggregating states. However, ratherthan taking advantage of the way the problem is represented as our abstraction algorithmdoes, these algorithms use the MDP directly when deciding when to aggregate states.Although we have yet to implement many of its suggestions, we were greatly influencedin this research by (Russell and Wefald 1991). Russell and Wefald discuss decisionmaking in general, and in particular, deciding whether to perform further sensing and/orcomputation in order to improve a decision, or whether to act now with the currentinformation. Adding the ability to reason introspectively about the value of furthercomputation is an important capability of any planner, and one we would very much liketo add to our system.2.5.5 Search algorithmsThe search algorithm described in Chapter 4 is closely related to a number of classic ATsearch algorithms. Although the algorithm is in essence a decision-tree search, its use ofheuristics and pruning techniques is based in game-tree searches.A decision tree is an explicit representation of all the possible scenarios that couldoccur following a decision. The root of the tree represents the initial situation, whileeach path through the tree represents a series of decisions, outcomes of actions, andevents that might occur, terminating in a consequence node which contains the utilityof the situations that would result from following that path. The search trees whichwe construct are very similar to this, they contain action nodes, wherethe agent mustdecide which action to perform, and chance nodes where all the possible outcomes of someaction are investigated. The main difference between our search procedure anddecisiontree analysis is that we cannot produce a complete tree due to the nature of our problems,so we produce a partial decision tree and use the heuristic to estimate the utility of theChapter 2. Markov Decision Processes for Planning 34final nodes in each path. However, the algorithm we use to determine which actionto perform is essentially identical to the exp-max algorithms used in analyzing decisiontrees. For more detail on decision trees and their uses, see (Pearl 1988) and (Howard,Matheson and Miller 1976).Ballard — The*...MinimaxSearch Procedure for Trees Containing ChanceNodes(Ballard 1983) describes an extension of alpha-beta search for game trees. The*minimaxsearch algorithm selects game moves in games which have probability nodes whose valueis defined to be the weighted average of their successors’ values. The class ofgamesthat can be played by this algorithm are those with random elements, but nohiddeninformation. The treatment of the probability nodes in*minimaxsearch is essentiallyidentical to that of the AVERAGE nodes of the search procedure we describe in Chapter4. In fact, the trees we construct during the heuristic search procedure are equivalentto*.minimaxtrees which contain no MIN nodes (moves by the opposing player). There isa similar relationship between the pruning techniques used in*1ninimsearch and ourutility pruning technique. The decision trees we construct during the searchprocess canbe seen as game trees for a game played against a randomizedopponent, so many of theresults shown for trees also hold for our problem.Korf — Real-time Heuristic SearchKorf (1990) presents a single-agent heuristic search procedure adaptedfrom game-treesearch. TheRealTimeA* (RTA*)algorithm combines a limited searchhorizon andcommitment to moves in a constant time to produce a search procedurewhich is thedeterministic equivalent of the one we present in Chapter4. Using the property thatthe heuristic evaluation of nodes is monotonic (that is, the heuristicvalue of any nodeChapter 2. Markov Decision Processes for Planning35is at least as high as that of its parents), Korf provides a form of pruning for these treeswhich he calls alpha pruning. Alpha pruning is similar to the expectation pruningthatwe perform, although the properties of the heuristic that we make use of are different.Paradoxically, Korf finds that as the branching factor of the domains he has examinedincreases, the search space actually reduces as a result of the effectiveness of alphapruning.Korf also presents a learning form ofRTA*which, over a series of runs, will improveits heuristic function, and will eventually converge on the true values of states.Some ofthe same ideas can be applied to the search procedure we present, but we havenot yethad the opportunity to investigate them.Chapter 3Creating Approximate Policies Using AbstractionAs we described in Chapter 2, computing the optimal policyfor a Markov DecisionProcess using policy iteration is usually polynomial in the size of thestate space. Whilethis might seem to be a reasonable computational cost, in practice thestate space maybe extremely large. For example, in a propositional domainmade up of n atoms, thesize of the state space is2.Given that the state space is often too largeto compute anoptimal policy for, we need a way to produce approximately optimalpolicies in less thanpolynomial time.To produce “good” policies efficiently, we will use a process ofabstraction. By abstraction, we mean the creation of a smaller problem which containsthe most significantdetails of the original problem while ignoring some of the less importantproperties. Asolution to the simpler problem can then serve as the outlinefor a solution to the originalproblem which will considerably reduce the total computationrequired. In the case ofthis work, abstraction provides a mechanism for reducing the sizeof the problem to besolved to manageable proportions, and allows us tocheaply compute a (probably lessthan optimal) solution to the original problemdirectly from an optimal solutionto thesmaller problem.The most obvious use for the abstraction processis to supply a close-to-optimalpolicy for use in the original domain, but thisis not the only possible use. Wealsouse the abstract policy to construct a heuristicestimate of the value of each state.Byperforming a heuristic search from states that arevisited in the course of carryingout36Chapter 3. Creating Approximate Policies Using Abstraction37actions, we can hope to find better actions than the ones provideddirectly by abstraction(if better actions exist). See Chapter 4 for the details of thisapproach. Since the policyiteration algorithm described in Chapter 2 requires an initialpolicy, we can also use theabstract policy as this “seed” for constructing an optimal policyin the original domain.By first computing a policy in an abstract state space, and thenusing it to compute atrue optimal policy, we can hope for computational savings becausefewer iterations ofpolicy iteration should be required before an optimal policyis reached.3.1 Abstract MDPs and PoliciesIn overview, our approach to abstraction is to buildan exponentially smaller abstractMDP which captures the most significant details ofthe original concrete MDP, use policy iteration to compute an optimal policy for the abstractMDP, which we call theabstract policy, and then apply the abstract policy to the concreteMDP to produce anapproximately optimal policy. The key to this approachis to generate the abstractionautomatically. Rather than requiring the user tospecify the abstraction, the algorithmwill provide a set of possible abstractions each witha bound on the difference betweenthe abstract and optimal policies, and willlet the user select an acceptable value.In thisway the user can trade off computation time forpolicy quality.The abstract MDP must have a number of properties inorder to be useful. Thefirst of these is that the structure we produce inthe process of abstraction must bea MDP. In particular, the Markov Property(see Section 2.1) must hold. Themostimportant property of the abstract MDP is thatthe cost of constructing it should bequite small. Producing an abstraction and then computing its optimalpolicy must requireconsiderably less computation than computingan optimal policy for the originalMDP.If this is not the case, the time is better spent computingthe optimal policy.Chapter 3. Creating Approximate Policies Using Abstraction38Not only must the construction of the abstract policy be inexpensive,the policydiscovered for the abstract domain should be easily applicablein the original MDP aswell. Since we want to use the abstract policy to tell the agentwhat to do in the concretedomain, each abstract action must correspond to some concreteaction or sequence ofconcrete actions. For example, in a navigation domain, theabstract action GoNorthmight translate to the sequence of concrete actions TurnRight, Advanceif the agenthad been facing west to begin with. If the abstract actions donot correspond directlyto concrete actions then they also must be computationallyinexpensive to produce ordiscover.When constructing the abstract MDP, we want to use as muchdomain information aspossible, whether or not that information is explicitly representedin the concrete MDP.This domain information may take a variety of forms. For example, wemay be able touse the fact that the domain is a navigation problem andthat certain states are geographically related to each other when producingthe abstract state space. As we saidinSection 2.2, domains will usually not be represented directly asMarkov Networks. Representations such as the extended STRIPS model(see Section 2.2.1) provide additionaldomain information that can be used to construct more usefulabstractions.The final requirement for the abstraction process is thatthe abstract policy generatedis actually useful in the concrete MDP. Time spenton abstraction is only of value if theresulting policy is fairly close to optimal. As wediscuss in Chapter 4, we can performlocal search — when sufficient computation time is available— to improve on the policygenerated by abstraction so we do not require anoptimal policy. However, we do needtheabstract policy to be a reasonable approximationto the optimal policy for local searchto provide any improvement.Chapter 3. Creating Approximate Policies Using Abstraction391. Using the STRIPS representation of the domain, decide which atoms aremostimportant for constructing a good policy.2. Build the abstract state space 8 by clustering together all the states in S whichagree on the values of the important atoms.3. For each action, build an abstract transition function T by deleting all referenceto unimportant atoms from the action description and translating the extendedSTRIPS representation of the action into an MDP transition function. Note thatan explicit transition matrix need not be built for each action as the extendedSTRIPS rules can be used to generate the linear equations required for policyiteration directly.4. Construct R, the reward function for the abstract problem.5. Use policy iteration to find the optimal policy for the MDP <8,A, T, R>.6. Construct the policy K such that for each state se S e 5, K(s) = i(S). K is anapproximately optimal policy for the original MDP.Figure 3.3: Constructing an approximately optimal policy using Abstraction3.2 Constructing Abstract Policies in Propositional DomainsWhile abstraction can be used in any domain (see (Nicholson andKaelbling 1994) foran example of a different approach to abstraction), we have concentratedon domainsrepreseilted in terms of propositions. In particular, we shall usethe extended STRIPSrepresentation described in Section 2.2. For these types of domain,the process of usingabstraction to produce an approximately optimal policy is describedin Figure 3.3.3.2.1 Choosing relevant atomsIn order to construct an abstract MDP, we need to select some subsetof the atoms whichwill form the basis of the abstraction. The quality of the policyand the effectiveness ofthe abstraction process are both closely dependent on the atomschosen. If too manyChapter 3. Creating Approximate Policies Using Abstraction40atoms are selected, the policy created may be very close to optimal,but the computationalsavings may not be large enough to justify the loss of optimality.On the other hand, if theset of atoms chosen is small, then the computation required toproduce the approximatepolicy will also be small, but the policy may be quite poor.As well as choosing an appropriate number of atoms to constructan abstraction, wemust consider which atoms should be selected. Obviously, if thereward for each statedepends solely on the value of a single atom in that state, it wouldbe foolish to ignorethat atom when constructing the abstract state space. However,this is not the onlyconsideration — atoms that have relatively little effecton the reward for a state may beable to be ignored, and an atom which indirectlyaffects reward should be included inthe set of relevant atoms.In order to construct a set of atoms which meets the criteriadescribed above, wefirst identify a set I7? of immediately relevant atoms. 17?. is formedby examining thepropositional model of the reward structure, and selecting onlythose atoms which havethe greatest impact on the reward for each state.The larger this set is, the more finegrained the abstraction will be, so by varying thesize of I??, we can find a balancebetween the quality of the abstraction, and the computation timerequired.Although the set 17?. contains some of the relevantatoms we would like to use forabstraction, it does not yet include all relevantatoms. For example, in a domainwherethe reward is large if atom A is true and small otherwise,17?. would be {A}. But if anaction that made A true required B to be true asa precondition, then clearly B isarelevant atom as well. To capture this formally,we define 7?., the set of relevantatomsas the smallest set such that 7?. contains everyatom in 17?., and if some atom AE 7?.appears in the effects list of some action aspect,then every atom that appears inthecorresponding discriminant is also in 7?.. The set7?. is defined as follows:Chapter 3. Creating Approximate Policies Using Abstraction411. 17?.C7Z.2. if q e 7?. and for some effects list E, either qE E1 or —‘q E E, then D C 7?..Notice that only the atoms in a discriminant that mightprobabilistically lead toa relevant effect are deemed relevant; we will call this a relevantdiscriminant. Otherconditions associated with the same action aspect areignored, unless these are also relevant. Furthermore, although 7?. is defined recursively,it can be easily and automaticallycomputed as a fixed point of the functionR.given below:7?. =17?=R.U{qI(j)(qD’andE fl 7?O)}The only decision required from the user of the systemis that of which atoms should beplaced in 17?.. As we shall see, this fact allows the userto specify the degree of accuracyrequired of the abstraction, and to havean abstract policy and heuristic functionforsearch calculated automatically.Having calculated 7?., we have completedstep 1 of the algorithm in Figure3.3. Toachieve step 2, we must use 7?. to createthe abstract state space 8. We do thisbyclustering together all the states in the original MDPwhich agree on the valuesof theatoms in 7?.. By treating each cluster as a statein the abstract MDP, we ignoretheirrelevant details of atoms that do not appearin 7?..Definition 2 The abstract state space generatedby 7? is S= {,... , },where:1.2.U{}= S.3.fl,=Oifij.Chapter 3. Creating Approximate Policies Using Abstraction424.For an example of how to construct an abstract MDP by this method, see Section3.2.3.3.2.2 Building abstract actions and rewardsIf an optimal policy is to be constructed over the abstract state space8, we require actionsand a reward function that are applicable to the new states. In general, computingthetransition probabilities for actions associated with an arbitraryclustering of states iscomputationally prohibitive. It requires that one consider the effectof each action oneach state in the cluster. Furthermore, computing theprobability of moving from onecluster to another when performing a given action requires someprior distribution overthe worlds in the initial cluster. With the information we have,it is impossible tocalculate such a distribution. To do so would require a distributionover initial states ofthe system, and a fixed policy.The abstraction mechanism we have described is designedto avoid exactly these problems. Our intention in creating 7? as described above isto make the problem of buildingabstract actions and rewards as computationally inexpensive aspossible. The followingtheorems demonstrate the significant characteristicsof the abstraction. Theorems 1 and2 ensure that all the states clustered into one abstractstate will behave the same when anabstract action is performed. This justifies using the abstractproblem to reason aboutthe concrete problem because it shows that the concrete statesin a cluster will behavethe same as the corresponding abstract state. Ifan action is performed on a cluster forwhich the action has a relevant discriminant, thenall the states in the cluster willbemapped to the same probability distribution over clusters, andif the cluster is containedin an irrelevant discriminant, then all the states willbe mapped into other states in theChapter 3. Creating Approximate Policies Using Abstraction43same cluster.Theorem 1 If is an abstract state, and s, tE s, then s satisfies a relevant discriminantfor some action aspect if t does.Proof: Assume not, then without loss of generality, there exists, t € and a relevantdiscriminant D, such that D, is true in s but notin t. Hence s and t must disagree onthe value of some atom in Df, but since Df C7? by the definition of 7?., s and t cannotbe in the same cluster , which contradicts our assumption.Theorem 2 If E is an effect of some action,and s, t e as above, then i) If E isassociated with an irrelevant discriminarit, then E(s)E;and ii) Eq(s)E i iffE(t) E ii.Proof: i) If E is associated with an irrelevant discriminant,then E fl 7?= 0 since if Econtained an atom in 7?, the discriminant wouldbe relevant. Since E contains no atomsin 7?., E(s) must agree with s on the value of allatoms in 7?., and hence E(s)e asrequired.ii) Since s and t agree on the values of all atoms in7?., Ei(s) and E(t) must also (becauseE changes the same atoms in both), so Ed(s)é ü iff E(t) E ii. UThe two theorems show that when any action is performed,the effects of that actionwill be to map all the states in a cluster to thesame new cluster. This allows ustoconstruct the abstract actions by simply deletingall reference to irrelevant atomsfromthe actions in the original problem. Thismay cause some simplification ofthe actionspecification as discriminants and effectswhich were different in the original problembecome the same in the abstract problem.For example, if an action has a discriminantwith several effects, none of which contains a relevantatom, they will all collapse intoasingle empty effects list. The probability of this effectoccurring will be the sum of theprobabilities of the concrete effects which combinedto form it.Chapter 3. Creating Approximate Policies Using Abstraction 44To associate a reward with each cluster, we use the midpoint of the range of rewardsfor the states in that cluster. For example if a cluster contained states with rewards0, 1and 10, we would select 5 as the reward for the cluster. Formally, if min() and max()denote the minimum and maximum values of the set {R(s) : s}respectively, thenthe abstract reward function is defined as:rnax()+min(’)2This choice of R() minimizes the possible difference between R(s) and(‘)for anyS E , and is adopted for reasons explained in Section 3.3. Although using the averageof the rewards in the cluster as the abstract reward for the cluster might result inbetteraverage-case behavior, it can lead to much worse bounds on the differencebetween theabstract and optimal policies.3.2.3 An example domainTo demonstrate the abstraction algorithm, we will use the example fromTable 2.1 (Page14). In this example, there are two atoms that affect the reward fora given state,Has UserCoffee, and Wet, but since the effect of Has UserCoffee is fargreater, we willset I??.= {Has Usercoffee}. Two actions affect the value of Has UserCoffee, the relevanteffect ofUserlsThirsty*has no preconditions, and can be ignored, whileDeliverCoffeehas discriminant {Office,HasRobotCoffee}, so these two atoms must be includedin 1?.Similarly, the Move action affects the value of Office, and the BuyCoffee andDeliverCoffeeactions affect the value of HasRobotCoffee, but no new atoms appear inthe relevantdiscriminants of these actions, so = {HasUserCoffee, HasRobotCoffee,Office}.Note that if we had chosen 17?. to include Wet, then7?. would also have includedRain and Umbrella from the second aspect of the Move action,so all the atoms from theoriginal problem appear in I??., and the abstract state space is identical tothe concreteChapter 3. Creating Approximate Policies Using Abstraction45Table 3.5: The STRIPS representation of the abstract MDP.one.Action Discriminant Effect Prob.Move Office -‘Office0.90.1-‘Office Office 0.90.1BuyCoffee Office HRC0.80.2Office 1.0GetUmbrella 1.0DelCoffee Office,HRC HUC,HRC0.8HRC 0.10.1-‘Office,HRC -‘HRC 0.80.2-‘HRC 1.0UserlsThirsty*-HUC 0.010.99Sentence ValueHUC 0.9HUC 0.1Having constructed 1?., we now have an abstract MDPwith a STRIPS representationas given in Table 3.5. We can then treat the abstract MDPas any other MDP andcompute an optimal policy for it using policy iteration. Thisgives us the policy shownin Table 3.6. When this policy is translated intoa policy in the original domain, itdiffers from the optimal policy for oniy threeof the 64 states (although in stateswhereOffice, Has UserCoffee, and HasRobotCoffee are alltrue, it substitutes Get UmbrellaforBuyCoffee, neither of which has any effect in thosestates) and — as we would expectgiven the method of its construction — is optimalwhenever Rain is false or Umbrellaistrue. The values for states in this abstract policyare around the midpoints of therangesChapter 3. Creating Approximate Policies Using Abstraction46Table 3.6: The policy computed using the abstractMDP.HUG Val. HUC Val.HRC, Office GetUmbrella 17.8 DelCoffee16.5HRC,—iOffice Move 17.8 Move 15.7-‘HRG,Office Move 17.7 Move 14.1-iHRC,-iOffice BuyCoffee17.7 BuyCoffee 14.8of values for the states in each cluster according to the optimalpolicy. Moreover, thetime required to compute the abstract policy was 0.01 seconds,only 2.1 percent of theoptimal policy (both using policy iteration).3.2.4 Properties of the abstract MDPThe most important property that the domain formed by theabstraction process musthave is the Markov property. The Markov property states thatthe probabilistic outcomeof every action depends only on the current state, not onany previously visited states,or previous actions.Theorem 3 The Markov property holds in the abstract domain(and hence it is anMDP).Proof: Since the Markov property holds inthe original MDP, it is sufficient toshowthat for a given abstract action, the result of performingthat action on two states in thesame cluster will be the same probability distributionover clusters. Formally, if S andT are clusters and 8 is the abstract state-space,we must show that:(VS,T e )(Vs,s’ e S)(Pr(tA,s) =Pr(tIA,s’) = Pr(TIA,S))teTConsider two such states s and s’, both in clusterS. Since they are in the samecluster, s and s’ must differ only on the valuesof atoms not in 7? (the set ofrelevantChapter 3. Creating Approximate Policies Using Abstraction47atoms).Now consider some action A such that Pr(tJA,s)$ teT Pr(tIA,s’). For thisto be the case there must be (at least) one atomp e 7?. which appears negated andunnegated respectively in the effects lists of the discriminants containings and s’ for A.Since p E 1?, the discriminants of S and s’ for A must also be in7?. (by the definition of7?.), 50 s and s’ must differ on the values of atoms in 7?. and thiscontradicts the statementmade in the previous paragraph, so no such action canexist. DSince the Markov property holds, we know that the abstract domainis an MDP, sopolicy iteration will produce an optimal policy (for the abstractdomain). If the Markovproperty does not hold, a policy produced using policy iteration isno longer guaranteedto be optimal. The proof shows that for any action and any two statesin a single cluster,the sum of the transition probabilities from the twostates to all the states in any othercluster is the same.3.3 Properties of the Abstract PolicyWe are interested in two properties of the abstractdomain. The time required to computea policy using abstraction compared with the timerequired to find an optimalpolicy,and how close to optimal the policy is.As to the first question, since the time requiredfor policy iteration is a functionofthe size of the state space, and the size ofthe state space is exponential in thenumberof underlying atoms, any reduction in thecardinality of 7?. will result in an exponentialreduction in the size of the state spaceand hence in computation time. Evenreducingthe domain by a single atom will halve the size ofthe state space, and producea largecomputational saving when performing policy iteration.This speed-up comes at the cost of generatingpossibly less-than-optimal policies.Chapter 3. Creating Approximate Policies Using Abstraction48However, we can at least bound the difference in valuebetween an optimal policy r*and the policy for the concrete domain r derivedfrom the optimal abstract policy asin step 6 of Figure 3.3. Along with , policyiteration will produce an abstract valuefunction V;. We can treat V; as an estimate of thetrue value of policy ; that is,V()approximates the value of Vs(s) wheres e is any state. The difference betweenV()and V.(s) is a measure of the accuracy of policyiteration over the abstract state spacein estimating the value of the induced policy in theconcrete state space. To find thisbound, we will need to use the discounting factorj3, and the reward span of each clusterS.Definition 3 The reward span of a cluster, span() is the maximum range of possiblerewards for that cluster. That is; span() = max()— min().The reward span for a cluster is the maximum degreeto which the estimate(‘)ofthe immediate reward associated witha state s E differs from the true rewardR(s) forthat state. Let 6 denote the maximum rewardspan over all the clusters inS.Theorem 4 For any se‘ cS,V)- V(s)2(1-Proof: Let p be the expected (undiscounted)reward the agent will receive forperformingits ith action from policy ir startingin state s(p is just the reward received at that singlestep). HenceV(s) = R(s)+Pi+2P2+Let s e , then R(s) — <R(s) R(s)+ , so6 6 6V(s) <R(s) ++/(p+ ) +32(p + ) +Chapter 3. Creating Approximate Policies Using Abstraction49andV) R(s) - +(pi - +2(p- +6 626V(s)—V(s)I+/3+/3 +...=____- 2(l-/3)DA more useful bound is on the difference between the generated policy rand anoptimal policy r*. As above, we will use the value functionV associated with theoptimal policy. The true measure of the optimality of the abstract policy is thedifferencebetween V71.(s) and V(s) for any state s.Theorem 5 For any s EIV(s)-V(s)Intuitively, this bound results in the same way as the previous one. Thedifference invalue is simply the sum of all the bounds at each step.In this case, the first state bound iszero since both values are in the concrete domain. The valueof the state one step aheadcould be in error by at most 6 since the true reward can be atmost 6/2 greater than theabstract reward for the optimal action, and at most 6/2 less for theaction selected bypolicy ir. This value must be discounted by /3. A similarargument shows that at eachfuture step the bound on the reward received at that stepis at most 6, so the sum isas required. To formally prove the theorem, we makeuse of the following lemmata:Chapter 3. Creating Approximate Policies Using Abstraction 50Lemma 6 Let V denote the (optimal) value function for the abstract MDP (Hence()is the value given to by any optimal policy for the abstract MDP). Let V denote thevalue function for the concrete MDP. Let sE S, e S, and s e , thenV(s)-2(1-Proof: Define Vo(s)Va(s) =tSThis is the discounted finite horizon formulation of V. Similarly, define V0()== R()tESThenV(s) = urn Va(s)n—= lim)n—*ooAnd since‘ _+3.+a2.+2(1—3)2 22We must show that:V(s)-Base Case: By the construction of R it follows thatVo(s)-= R(s)-E)Inductive Hypothesis: Assume for all s thatV1(s)- -i()Chapter 3. Creating Approximate Policies Using Abstraction51Lemma 7 Assuming the inductive hypothesis, we have for anyaction A,—‘6>Pr(tA, s)V_1(t) — Pr(A,)—()>tES i=OProof: By Theorem 3, if s E , then for all A we havePr(]A,) =>Pr(tjA, s)tEtSoPr(tA, s)V1(t) - Pr(A, =Pr(tIA,s)IVi(t) -tES tinSPr(tIA,s)tES i=O—‘6/3z_asrequiredi=O2Let A be the action that maximizes V(’), let B maximizeVa(s). ThenVa(s) = R(s)+4lPr(tB,s)V_i(t)teS=+ ,6 Pr(A,)_)tESLetf= ‘Z /36/2, letA= Pr(tIB,s)Vi(t)tESB = Pr(tA,s)V_i(t)tESC =tESD =tESThen by the definitions of A and B, we have AB, and C D, and by Lemma 7,— Cf,and A— DI f.Chapter 3. Creating Approximate Policies Using Abstraction52If C B, then C—Af.From C D and A—D fit follows that A—C<f,so— Al f.Similarly, if C <B, then A > C, and since CD and A — Df,wehave thatIA—CIf.In full,we havePr(tB, s)Vi (t) - Pr(A,tES_E’_i=OFrom the definitions given above,V(s)-= R(s)+Pr(tIB,s)Vi(t)-tEStESR(s)- +Pr(tB, s)Vi(t) - Pr(A,)V1()tES(5n.i=Onas required.i=O2CCorollary 8 The preceding lemma states thatV(s)-2(1-Since V(s) is the value of state s for any optimalpolicy in the concrete MDP,and()is the value of ‘ for any optimal policy in theabstract MDP, the same relationship appliesto the values of specific optimal policies, namelyfor all s,IV*(s)-V;(s)I2(1-a)Proof: (of Theorem5). From Corollary 8, we have that for all sV(s)-V;(s)I2(1-a)Chapter 3. Creating Approximate Policies Using Abstraction53Now, from Theorem 4, we have that for all state s,V(s)- V(s)I2(1-SoV*(s)-V(s)IThe first term in each series is 6/2. This is the termfor the difference between R(s)and but since V(s) and V(s) both haveR(s) as their first term, they must agreeon this value, soV*(s)—V(s)I1/3as required.3.4 ResultsWe examined two domains in our investigation of the abstractionalgorithms. Since wehave yet to apply these algorithms in any real-worlddomains, both have been createdby hand, and the first was designed specificallywith this abstraction scheme in mind.This should result in the first domain performing particularlywell as an example ofabstraction. The second domain was adaptedfrom (Smith and Peot 1993), and wasspecifically chosen as a typical planning problemfor other planners. It is not particularlywell suited for abstraction as there is onlya single possible set I?? that resultsin a statespace smaller than the concrete problem (a second possibleIi? exists, but has a largerstate space, and worse reward span than the onewe use here). See Appendix A forthedetails of both these domains and their abstractions.The COFFEE domain, is an extension of the example inChapter 1. It contains ninepropositions and nine actions, producing a 512 stateMDP. There are three possibleChapter 3. Creating Approximate Policies Using Abstraction 54Table 3.7: Results of abstraction for the COFFEE domain.abstractions for this domain, with five, six and eight propositions respectively.Policyiteration on the complete problem required 254.33 seconds on aSun 4/60. State valuesranged from -7.0 to 30.0. The results of the abstractions are summarizedin Table 3.7.As the table shows, the more fine-grained the abstraction, thebetter the resultingpolicy is. The number of errors in the action to select drops untilover 90 percent ofstates have their correct action in the eight proposition domain,and the errors in theestimates of state values drop as well. Requiring very littlecomputation time, the fiveproposition abstraction still produces quite a reasonablepolicy. Although state valuescan be anywhere from zero to twenty, the policy produced from fivepropositions willdiffer from the optimal policy by an average value ofjust 4.12, only an eleven percentAbstract value vs. 5 Proposition 6 Proposition8 Propositiontrue policy value domain domain domainNo. of errors in 512 512512value of stateAverage error 5.00 2.591.00Standard Deviation 2.71 0.890.00Maximum Error 8.5 3.51.0Predicted Bound 8.5 3.5 1.0Time required to 0.14s 0.89s 35.55scompute_policytrue abstract policyvalue vs. optimal policyNo. of errors in 187 8539action to choose 36.5% 16.6%7.6%No. of errors in 348 256 192value of state 68.0%50.0% 37.5%Average error 4.12 0.910.48Standard Deviation 3.80 1.390.76Maximum Error 14.17 5.93 1.89Predicted Bound 16.15 6.65 1.90Chapter 3. Creating Approximate Policies Using Abstraction55Table 3.8: Results of using optimal abstract policies as initial policies when computingan optimal policy for the concrete MDP.(a) Time to find optimal policyMDP to solve Initial policy Time Iterations32 state None 0.89s 664 state None 4.93s 732 state 5.59s 8256 state None 214.71s 832 state 214.68s 864 state 84.71s 3512 state None 1531.67s 832 state 1564.14s 864 state 643.58s 3256 state 472.53s 2Use 256 state Use 64 state Use 32 state Total timeYes Yes Yes 563.72Yes Yes No 562.17Yes No Yes 688.10Yes No No 687.24No Yes Yes 650.06No Yes No 648.51No No Yes 1565.03No No No 1531.67error for a 99 percent reduction in computation time. The eightproposition abstractionperforms even better, producing a 1.3 percent average error in the valueof a state for an86 percent reduction in computation time.We can also examine the possibility of computing an optimalpolicy by using the resultof an abstract policy as the initial policy for applying policyiteration to the concrete(or a less abstract) policy. The results for this experiment,again in the COFFEE domainare given in Table 3.8. As the table shows, there canbe considerable savings by using aseries of abstractions to calculate the optimal value for the concretedomain. The fastest(b) Total time to find optimal policy for 512 stateMDPChapter 3. Creating Approximate Policies Using Abstraction56way to compute the optimal value for the concrete domain is tocompute the optimalvalue for the 64 state domain, use that to compute the 256 state optimalvalue, and thenuse that to find the optimal policy. This requires only 37 percent ofthe computationtime of computing the optimal policy directly. Perhaps surprisingly,computing the 32state optimal policy is o help at all for computing more fine-grainedpolicies, and in thecase of the 64 state domain, actually slows the process of policy iteration. Atpresent,we have no way of predicting when this will occur.In order to use these results for efficient computation of the optimalpolicy, the resultssuggest that using as many levels of abstraction as possible isa good strategy. Althoughit isn’t the best possible series of abstractions in this case,it is very close, and informalarguments suggest that using all available abstractions should be thebest strategy in mostcases. If multiple processors are available, all the possible permutationsof abstractionscould be run in parallel, and computation halted as soon as anyprocessor computed anoptimal policy. This method allows us to guarantee that computingthe optimal policyusing abstractions will be no worse than computing it directly.The second domain, the BUILDER domain involves an agent thatmust join two objectstogether. For maximum reward, the objects must be machined to thecorrect shape, clean,painted, and joined together. The reward for any given state issimply the sum of theindividual rewards for all of these attributes. The state containsnine propositions (512states) and ten actions. Policy iteration on the entire state spacerequired 201.86 seconds.State values range from 0.0 to 20.0. The results are summarizedin Table 3.9.As the table shows, the abstraction was again good, especiallyconsidering the smallsize of the abstraction (only 32 states). The average errorin the value of a state is 5.99,which is quite large at 30 percent of the possible range of values,but the abstract policyrequired less than one percent of the computation timeof the optimal policy. For thisdomain, since the abstract state space is so much smaller thanthe concrete one, someChapter 3. Creating Approximate Policies Using Abstraction57Table 3.9: Results of abstraction for the BUILDERdomain.Abstract value vs true policy valueNo. of errors in value of state 478= 93.4%Average error 2.69Standard Deviation 1.86Maximum Error6.0Predicted Bound 6.0Time required to compute policy 0.lls= 0.05%true abstract policy value vs. optimal policyNo. of errors in action 309= 60.4%No. of errors in value 512Average error5.99Standard Deviation 2.16Maximum Error 10.00Predicted Bound 11.40local way of improving the policy, such as the search procedure describedin the nextchapter, may be very valuable. Since there is no abstraction thatis more fine-grainedthan this, we cannot choose another abstraction if the boundon the difference betweenthe abstract and optimal policies of 11.4 is unacceptable.Chapter 4Planning by Heuristic SearchThe abstraction scheme described in the previous chapterproduces an approximatelyoptimal policy for a relatively small investment ofcomputation time. This computationmust be performed before any actions arecarried out, but if time is available astheplan is executed, extra computations could be usedto make local improvements to thepolicy. This chapter introduces the idea of treatingthe abstract policy as a set of defaultreactions, and using heuristic search to try to improveon these at run time if computationtime is available. A major advantage to be gainedby using search is that its computationtime generally does not depend on the sizeof the state space, only on its connectivity(the number of states that can be reachedby performing a single action from somestate).Although the abstraction process of Chapter3 is an obvious place to obtain a heuristicfrom, the algorithm described in this chaptercan be (and has been) used with heuristicsobtained from other sources. In many domains,a heuristic may already be providedbythe user. In others, such as high-level robot navigation,a domain specific heuristic —for example, one based on geographicrelationships may be easy to constructby hand.We will assume throughout this chapterthat a heuristic functionV is available, withoutconcerning ourselves with its source.If a heuristic exists which is a fairlyreliable estimate of the value ofeach state, wecan use it to select a good action to performin each state in a way analogous to minimaxsearch. A simple search algorithmwill tend to be greedy, selectingactions which leadto the greatest immediate reward, butby using a heuristic we can hopeto increase the58Chapter 4. Planning by Heuristic Search59importance of long-term gains, and hence produce a better planner.In general, the more computation time is spent generating the heuristic,the moreaccurate it will be, and similarly, if deeper search is performed, we cantypically expecta better decision to he made — this may not always be the case (see (Pearl1984) fora proof of this for minimax search). In practice, searching deeper generally leadstobetter performance. There is an obvious tradeoff between initial computationtime whileconstructing the heuristic and computation time spent searching.At one extreme, wecan compute a perfect valuation function for states initially, andthen perform no search,while at the other, we have a greedy search with no heuristic informationto guide it.Classical planning makes use of search as well, but there are importantdifferencesbetween the classical approach, and our search algorithm. Classicalplanning is typicallygoal directed, which is impossible with the complex reward structures weare interestedin; but even when the search isn’t goal directed, classicalplanning algorithms searchuntil a goal state is found. Even ignoring the fact that, like game playingproblems, thereal-time nature of our approach prohibits exhaustive searches,the fact that we have noexplicit goal states, and are planning for the infinite horizoncase, means that there isno final state where search stops. Our use of partial decision tree searchand game-likeheuristic techniques is dictated by this inability to perform completesearches.4.1 Interleaving Search and ExecutionTime-critical domains provide a minimumof computation time in which to plan, henceit is important to restrict the space to be searched as muchas possible. To this end,we propose an algorithm where actions are executed as soonas they have been selected.Because of the stochastic nature of the domains we are interestedin, this results in aconsiderable saving in computation. Performingan action in a certain state can leave theChapter 4. Planning by Heuristic Search60system in a number of different states, so a planning algorithm that constructsa sequenceof actions would need to find an action to perform for each of the possibleoutcomes of theaction it selects first. By executing actions as soon as they are selected,we know (sincethe MDP is completely observable) which of the possible outcomesactually occurred, andonly need search for the next action to perform from a single state,rather than many.As we would expect, the saving in computation gained by interleavingexecution withplanning is considerable. Let b be the maximum numberof outcomes each action couldhave. In a search for a sequence of a actions,search without execution will requirean action to be selected forbzstates compared with only a states for searchwithexecution.4.2 Planning in MDPs by SearchingAs we described in the previous section, there is a great dealto be gained in termsof computational savings by interleaving planning with execution.In view of this, theplanning algorithm can be viewed at the highest level as follows:1. Calculate the best action to perform in the current state,using the heuristic function if necessary.2. Execute the selected action.3. Observe the new state of the system and returnto step 1.Steps 2 and 3 require little explanation. Althoughthe algorithm as it stands neverterminates, this is consistent with the process-likedomains that the MDP representationis ideally suited for. If the domain contains goalstates in which the agent has no moreto do or other self-absorbing states, the algorithmmight terminate when such a stateisreached. In general, however, the agent will continueplanning and acting indefinitely.Chapter 4. Planning by Heuristic Search614.3 The Search AlgorithmThis section deals with Step 1 of the algorithm described above.The first point to makeis that the algorithm we will present makes no changes tothe heuristic function — itdoesn’t learn more accurate estimates of the valuesof states — so for any given state, theaction chosen to execute for that state will be thesame every time. To take advantageof this, the algorithm caches the action chosen for each stateit computes so that shouldthe state be visited again, no extra computation needbe performed.The search algorithm constructs a partialdecision tree rooted at the current stateto determine the best action to perform. Thedecision tree is built to a fixed depth,and the heuristic function is used to estimate the valueof the leaf nodes. Althoughfixed-depth search is not necessary for the algorithmto function, it allows the use ofdepth-first search, which is computationallyefficient. Using breadth-first search(or oneof its variations) would make some ofthe pruning algorithms more efficient,but thesetechniques often require considerableextra book-keeping costs. If certaininformationabout the heuristic function is available,then pruning of the search tree ispossible, andas the results below show, can be quiteeffective. The cached values of previouslyselectedactions provide an even more effective formof pruning.4.3.1 Action selectionLet s and t be states, let/be the discounting factor as before,and let V(t) be the valueof the heuristic function at statet. Then the estimated expected utility ofactionAinstate s is:U(AIs) =Pr(tA, s)17(t)tESChapter 4. Planning by Heuristic Search62Where V(s), the value (estimated by the search process) of astate is:IV(s) if s is a leaf nodeV(s) =R(s) +7(max{U(AIs): A € A}) otherwiseAs we would expect, the utility of an action is simply the weightedaverage of the valueof every state reachable by performing the action. The valueV(s) of state s is computedas follows: if s is at the bottom of the search tree(s is a leaf node), then we use theheuristic to approximate its value. Otherwise weadd the immediate reward for statesto the discounted value of the best action to performin that state. Figure 4.4 gives thealgorithmic description of the basic search algorithm.Figure 4.5 illustrates the search process with a partial tree ofactions two levels deep,and a discounting factoriof 0.9. From the initial state s, if theagent performs action A,state t will result with probability 0.8 and state u with probability0.2. When the tree isexpanded again to include states reachablefrom these, the set of second states result.Atthis point, the heuristic is used to estimate the values ofthe second states as they are leafnodes in the tree. Given that V(x) = 2 andV(y) = 3, the utility of action A if the agentwere in state t, is 0.9 x 2+ 0.1 x 3 = 2.1, and since U(Bt) = 0.3, A is the better actionfrom statet, and V(t)= R(t)+,@xU(AIt) = 0.5+0.9x2.1 = 2.39. Inasimilarfashion,wecan compute the value of state u, whichis 1.58. Given this, we determine the estimatedutility of action A if the agent is in states. U(As)= 0.8 x 2.39+ 0.2 x 1.58 = 2.23, andsimilarly,U(BIs)= 2.94. Since the estimated utility ofaction B is higher than actionA, we select B as the best action to performin state s, record that fact in the cacheofpreviously calculated best actions, and executeaction B. By observing the outcomeofthe action, we now know which of statesv and w actually resulted, and canhence avoidsearching for a best action for theone that did not.In Figure 4.5, the tree is expanded to depth two,but this does not need to be thecase. As available computation time varies, the depthof the search tree can also vary.Chapter 4. Planning by Heuristic Search63search-state(s, depth) — Returns the bestaction and its value for states1. If depth > 0 (this isn’t a leaf node)(a) Set MaxVaiue to be lower than any possible valueof an action(b) Set BestAction to be null(c) for each action a e Ai. Let value be search-action(a, s, depth), the(estimated) value of actiona in state s.ii. If value > MaxValue (this is the best actionso far), let MaxValue bevalue, let BestAction be a(d) Return BestAction, R(s)+7 x MaxValue2. otherwise (s is a leaf node of the search tree)return V(s), the heuristic valueofstate ssearch-action(a, s, depth) Returnsthe value of action a in state s1. Let value be 02. For each state t that could result from performinga in s do(a) Let value be value+ Pr(tla, s) x search-state(t, depth — 1), the probabilityof reaching state t times the computedvalue of t.3. Return valueFigure 4.4: The search algorithm.To begin search from state sto depth d,search-state(s, d) is called. The action thatis returned is the estimate of the bestactionto perform.Chapter 4. Planning by Heuristic Search 64w (p=O.5)•BN•.State Valuex p=O.9, V=2y p=O.l,V=3p=O.7,V=Op=o.3,V=1p=o.7,V=op=O.3, V=4p=o.9,V=op=O.l,V=2p=O.l,V=4p=O.9, V=Op=O.5, V=3p=O.5, V=2Initial First First Second Second StateState Action State Action.A7•t(p=0.8)7•/%NUtility of 2nd act. Action and value Best actiongiven 1st state of first state and valueU=2.lAct.A V(t)=2.39U=0.3U=2.23U=l.2/Act.A V(u)=1.58U=0.2Act.B V(s)=2.64U=1.8Act.A V(v)=2.62/}U=1.4\U=2.94U=0.4/Act.B V(w)=3.25}U=2.5• p=0.9, V=2• p=0.1,V=0• p=0.6, V=1•p=O.4,V=2R(t) = R(u) = 0.5, R(v) = R(w) = 1, R(x) = 0, R(y) = 1Figure 4.5: An example of a two-level search for the best actionfrom state s.Chapter 4. Planning by Heuristic Search65In practice, an interruptible search using an iterative deepeningtechnique may well beused, so that at any time, the algorithm can be interrupted,and the current best actioncan be performed. We cannot guarantee that deeper search willproduce better results,in general however, the deeper the tree is expanded, the moreaccurate the estimates ofaction utility will tend to be, and the more confidence weshould have that the actionselected approaches optimality. One can view thissearch process as a form of directedvalue iteration (Bellman 1961). A search to depthn will result in exactly the sameaction for each state as performing n iterations of thevalue iteration algorithm with theheuristic as the initial vector of state values. Hencewe can use the result that valueiteration converges on an optimal policy to show that deep enoughsearch will also findoptimal actions, but such a search is too expensive tobe performed in practice.There is also a close relationship between thesearch algorithm and Ballard’s *-minimax search (Ballard 1983). Determiningthe value of a state is analogousto theMAX step for minimax search, while calculating the utilityof an action is an AVERAGE step and is analogous to a move made bya randomized opponent in Ballard’sgame-playing search.4.3.2 Pruning algorithmsAs we said above, there is a close relationship betweenour search algorithm andminimaxsearch. Just as alpha-beta search is a versionof minimax with pruning of irrelevantsubtrees, we can perform two different formsof pruning on our search trees.To makeour description of these pruning techniques clearer,we will treat a single plyof searchas consisting of two steps, MAX in whichall the possible actions are considered,and thebest is chosen, and AVERAGE where the outcomesof a particular action are combinedto determine the expected value of that action.Two type of pruning are possible. The first, utilitypruning, is very similar toa- andChapter 4. Planning by Heuristic Search66StatesMAX stepActionsAVERAGE stepStatesFigure 4.6: Two kinds of pruning whereV(s) 10 and is accurate to ±1. In (a), utilitypruning, the trees at U and V need not be searched,while in (b), expectation pruning,the trees below T and U are ignored, althoughthe states themselves are evaluated.3-cuts in minimax search, and requires knowledgeof the maximum and minimum valuesof the heuristic function. For heuristics producedby the abstraction algorithm describedin Chapter 3, ifR+and R are the maximum and minimum rewardsfor any state inthe MDP, then the maximum and minimum expectedvalues for any state are boundedby R/(1 — /3) and R/(1—/3) respectively. For the second type ofpruning, expectationpruning, we require bounds on the error associatedwith the heuristic function.Again,for heuristics based on our abstraction algorithm,these bounds can be computed usingthe result of Corollary 8 (see Section 3.3).Utility Pruning We can prune the searchat an AVERAGE step if weknow that nomatter what the value of the remainingoutcomes of this action, we can neverexceedthe utility of some other action at the precedingMAX step. For example, considerthe search tree in Figure 4.6 (a). We assumethat the maximum value that theheuristic function can take is 10. When evaluatingaction b, since we know thatVai=3(a) (b)Chapter 4. Planning by Heuristic Search67the value of the subtree rooted at T is 5, andthe best that the subtrees belowU and V could be is 0.1 x 10 + 0.2 x 10 = 3, the total cannotbe larger that3.5 + 3 = 6.5, so neither of nodes U and V need be expanded.This type of pruningrequires knowledge of the maximum value of theheuristic. Although we haven’timplemented it, we can use the minimum valuein a more restricted fashion. If,for example, the value of action a was3, and the minimum value of the heuristicwas 0, then the value of b must be at least0.7 x 5+ 0.3x 0 = 3.5, so we can tellthat b is the best action without searching nodesU and V. This process will onlywork if b is the last action to be evaluated, and weare at the topmost level of thesearch tree, so the value of b is unimportant. Forthe maximum amount of pruningto be performed, the possible outcomes of eachaction should be searched in orderof their probability of occurring, with outcomesof high probability searched first.To implement Utility pruning, we need to modifythe algorithm in Figure4.4 asfollows:We need a new constant, MaximumStateValue,which is the maximum expectedvalue of any state.search-state(s, depth)The only change needed is in the callto search-action in 1(a)i. The functioncallbecomes search-action(a, s, depth,MaxValue).search-action(a, s, depth, a) — Find the valueof action a in state sbysearching to depth depth. The valueis irrelevant if it is less thata.1. Let value be 02. Let RernainingProb be 13. For each state t that could result from performinga in s doChapter 4. Planning by Heuristic Search68(a) Let value bevalue+Pr(tla,s) x search-state(t, depth—i), the probabilityof reaching state t times the computed value oft.(b) Let RemainingProb be RemainingProb —Pr(ta, s)(c) If value+RemainingProbxMaximumStateValue<a then return a—i.‘We can stop searching, this cannot be the bestaction.4. Return valueExpectation Pruning For this type of pruningwe need to know the maximum errorassociated with the heuristic function.If the search is at a maximizing step,and,even taking into account the errorin the heuristic function, the actionwe areinvestigating cannot have as high autility as some other action for whicha utilityis already known, then we do not need to expandthis action further. For example,consider Figure 4.6 (b), where weassume that for all states s,V(s) is within ±1of the true (optimal) value of s. Havingdetermined that U(aS) =7, we knowthat any potentially better action must havea heuristic value of at least6. SincePr(Tb, S)V(T)+ Pr(UIb, S)V(U) <4, even if b is as good as possible, it cannotbebetter than a, so there is no need tosearch the subtrees below states Tand U.To implement Expectation pruning,we make the following further modificationstoFigure 4.4:Again, we need a new global variable HeuristicError,which is the maximumpossible difference between theheuristic value of a state and itsvalue in an optimalpolicy.search-action(a, s, depth, a)The following additional steps must beincluded at the beginning of thealgorithm(before step 1. in Figure 4.4).Chapter 4. Planning by Heuristic Search691. Let Val’ueEstimate be 02. For each state t that could result from performing a in s do(a) Let ValueEstimate beValueEstimate+Pr(tla, s) x V(t), the probabilityof state t times its heuristic value.3. If ValueEstimate+ HeuristicError <c then return cv —1, we need not lookany deeper into this action.As we have implemented it above, utility pruning is very simple to perform. It requiresvery little extra computation, and as our results below indicate, can resultin quite arespectable improvement in computation time. Expectation pruning requiresa moresignificant modification of the search algorithm to check all the outcomesof an action tosee if they justify searching the tree below that action, but in domains where theheuristicfunction is quite accurate, it can still offer a substantial performance improvement,asthe results in the next section will show. Expectation pruning is quite closelyrelatedto what Korf (1990) calls alpha-pruning. The difference is that whileKorf relies on aproperty of the heuristic that it is always increasing, we rely on an estimate of theactualerror in the heuristic.The comments in the previous paragraph are specific to the implementationof thesearch algorithm as a depth-first search. Using an iterative deepening searchseriouslylimits the applicability of utility pruning since the final valueof an action is only knownwhen the last round of deepening is performed. On the other hand,iterative deepeningremoves the additional computation requirements thatmake expectation pruning moreexpensive to perform.Chapter 4. Planning by Heuristic Search704.4 ResultsAs before, let n=I-4be the number of actions. Let b be the maximum numberofpossible outcomes for any action and state. We will assume thatwe are constructing asearch tree without pruning to a fixed depth d. The number of nodes(states) expandedwhile calculating the best action for a single state is 1+ bn + (bn)2 + ... +(bn)d =((bn)4— 1)/(bn — 1). Over a series of such calculations, the cost isslightly less than thisbecause we can reuse previous calculations, but the complexityisO((bn)d).The actualsize of the state space has no effect on the algorithm; rather itis the number of statesvisited that determines the cost. In most domainsthis number should be considerablyless than the total number of states. More importantly,the complexity of the algorithmis constant (with regard to the number of states), and executiontime per action can bebounded for a fixed search depth and branching factor.We have performed experiments to test the effectiveness of the searchingalgorithm ina number of different domains. Figure 4.7 shows how both the time requiredfor searchingand the value of the induced policy (the policy that results if searchingis conducted onevery state) tends to increase with the searchdepth. The experiments were performedon a Sun 4/60, with no pruning of the search tree.We also examined in detail the way the inducedpolicy changes as search depthincreases for three different problems. The firsttwo are the COFFEE domain, withaheuristic computed using the 32 state abstractMDP, and the 256 state abstractMDPrespectively. The third problem we examined wasthe BUILDER domain, using the32state abstract MDP to compute the heuristic. Theresults are presented in Table 4.10.As the table shows, the effects of searching varywidely from domain to domain.In the first problem, the heuristic is quite poor,and the policy produced by searchingconverges quite slowly to the optimal one. Eventhough the number of states for whichChapter 4. Planning by Heuristic Search71Table 4.10: Comparison of the induced policy as search depth increases forthree differentsearch problems.COFFEE domain, coarse-grained heuristicOptimal 1 step 2 step 3 step 4 steppolicy search search search searchAverage state value 22.607 18.686 19.96120.363 20.509Percentage of optimal 100.0 82.788.3 90.1 90.7Maximum error 0 14.169 10.60710.607 10.607Average error 0 3.921 2.6462.245 2.098No. of non-zero errors 0 320 288288 288Average non-zero error 0 6.274 4.7043.991 3.730COFFEE domain, fine-grained_heuristicOptimal 1 step 2 step 3 step 4 steppolicy search search search searchAverage state value 22.607 21.928 22.60722.607 22.607Percentage of optimal 100.0 97.0 100.0100.0 100.0Maximum error 0 1.890 00 0Average error 0 0.6790 0 0No. of non-zero errors 0 2240 0 0Average non-zero error 0 1.5530 0 0BUILDER domainOptimal 1 step 2 step 3 step 4 steppolicy search search search searchAverage state value 18.173 12.227 18.11218.008 18.021Percentage of optimal 100.0 67.399.7 99.1 99.2Maximum error 0 10.003 0.7025.050 5.050Average error 0 5.9470.062 0.166 0.152No. of non-zero errors 0 512 207141 91Average non-zero error 0 5.947 0.1530.602 0.857Chapter 4. Planning by Heuristic Search72Depth of search v. Time on an exponential scaleDepth of search v. Accuracy of induced policy100100%state abstractionPercentage of 32 state abstractionTime to searchoptimal value8010701600.10.0150 I I I I1 step 2 steps 3 steps 4 steps 5 steps 6 steps 1 step2 step 3 step 4 stepDepth of searchDepth of search(a) (b)Figure 4.7: Graphs of (a) search time, and (b) policy qualityagainst search depth forthe COFFEE domain.the optimal policy has not been found1 remains relatively constant,the expected valueof states continues to rise, which shows that theoptimal action is being selected for agreater and greater number of states, although a few pathologicalones with high errorsremain, even after searching four steps deep.In the second problem, the heuristic is already very accurate.Searching to one stepis essentially equivalent to selecting the same actions as theabstract policy, and evenone step search performs very well (better than fourstep search with the coarse-grainedabstraction). This illustrates the trade-off betweenabstraction complexity and searchdepth. By spending 35 seconds initially rather than0.14 seconds, we can spend only0.003 seconds computing each action, rather than five seconds,and still perform better.In fact, since the search procedure converges onthe optimal policy when searching todepth two in this domain, we can guaranteean optimal policy, and still only spend0.03‘A non-zero error means that some state that could be visitedby following the policy has a less-thanoptimal action.Chapter 4. Planning by Heuristic Search73seconds per action in the search process.The third search problem illustrates an important point; the searchprocedure doesn’talways perform better as search depth increases. Inthis domain, we once again havea coarse-grained abstraction, which makes searching todepth one perform quite poorly.However, a two-step search increases performancedramatically, and in fact is better thansearching to three or four steps, at least in terms of the average valueof a state. Moredetailed analysis of the policies produced by each depth of searchreveals that for almostall states the value of the policy continues to improve as search depthincreases, but thereare a small number of pathological states for whichthe search algorithm performs verybadly.4.4.1 Execution and caching of valuesWe have also performed experiments to investigate the valueof caching previously computed best actions for state, and the value of interleavingexecution with search. Table4.11 summarizes our results. The columns where executionis interleaved with searchshow the standard algorithm as we presented it. Forsearch without execution, the agentperforms the standard search, determines the best action, andthen rather than executing it, searches again to find the best action to performfor all possible outcomes of theaction.Unsurprisingly, cached search interleaved with executionis the most efficient method.The size of the domain will have a considerable effecton the value of caching. In thiscase, the domain contains 256 states but we are onlyperforming ten actions, so cachinghas relatively little effect. However, since in mostcases more searches will be performed,if the space is available to cache values, it is aworthwhile tradeoff against the timerequired to recompute previously computed values.One surprise from this table ishowwell the cached search without execution performs.This is due to the small numberChapter 4. Planning by Heuristic Search 74Table 4.11: A comparison of time to search for ten actions both with and without cachingof previous best actions, and execution.Search Execution Interleaved No Executiondepth Caching No cache Caching No cache1 0.01 0.02 5.19 26.72 0.04 0.06 5.41 2813 0.42 0.51 ‘7.14 27804 4.48 5.68 15.4 -5 55.9 56.9 102 -6 219 230 272 -of states. Because the algorithm caches the action whenever it computesfor a state,the interleaved execution algorithm will oniy cache values for at most ten states.Incomparison, if each action has n possible outcomes, the no-execution algorithm cancacheup to 1 + n + n2 + ... + n9. In practice this means that for a domain ofthis size,the algorithm quickly finds itself looking for actions for states it hasalready evaluated,and hence not performing any computation. The column for search withoutcaching orexecution gives an idea of how badly search without execution but withcaching wouldperform if the state space was big enough that it more rarely got to reuse its calculations.4.4.2 Effectiveness of pruningFigure 4.8 shows the performance of the pruning algorithm in both the COFFEEandBUILDER domains. The two graphs illustrate the large differences in theperformance ofthe pruning algorithms on different domains. For the COFFEE domain,utility pruning isineffective it requires more time than no pruningat all — while expectation pruningperforms remarkably well, saving over 60 percent ofthe computation time for deep searchtrees. For the BUILDER domain, utility pruning is extremely effective, pruningmore than40 percent of all states for depth five search, and required only half thecomputationChapter 4. Planning by Heuristic Search75COFFEE DomainI I I I—2 step 3 step 4 step 5 stepDepth of searchBUILDER Domain125percentage100-75502 step 3 step 4 step 5 stepDepth of search(a)Percentage ofcomputation timeKeyUtility Pruning- --. Expectatioz PruningBoth2step 3step 4step 5stepDepth of search(b)100-percentage755025100percentagePercentage of statesexpanded5025-I I I2step 3step 4step SstepDepth of search1251007550Figure 4.8: Pruning effectiveness for (a) the COFFEE,and (b) the BUILDER domains.Chapter 4. Planning by Heuristic Search76Table 4.12: Expectation pruning when only the top of thesearch tree is pruned. Valuesare percentage of value with no pruning.____________ ____________Search depth Prune depth Percentageof Percentage ofstates pruned search time3 3 89.3 80.62 92.6 89.04 4 85.797.33 88.3 88.92 92.4 94.55 5 82.3 89.84 84.3 84.83 87.4 89.52 91.2 95.6time of search without pruning. The coarse-grainedabstraction scheme of the BUILDERdomain results in quite large errors being possiblein the heuristic function, so expectationpruning is ineffective in this domain. We used theheuristic produced from the 256stateabstract MDP for the COFFEE domain, and theerror bound of +1 is low enough thatexpectation pruning works particularly well. Thissuggests that the pruning algorithmsapplied in various different problems shouldbe tailored to the quality of the heuristicfunction. The two pruning algorithmswill often prune the same nodes,so using utilitypruning when the heuristic is poor, and expectationpruning when it is good may proveto be a good strategy.In many domains, the gains due to expectationpruning in term of states expandedare not matched in terms of time to search becauseof the additional cost of pruning.Wehave run some experiments in a smaller(256 state) domain which show that formaximalsavings of computation time, expectationpruning should not be performed allthe wayto the bottom of the search tree. Theseresults are summarized in Table4.12.In this domain, expectation pruning is much less effectivethan in the COFFEE domainChapter 4. Planning by Heuristic Search77given above. This emphasizes the computational requirements.As we would expect, asthe depth to which pruning is performed decreases, the numberof states which must beexamined in the search increases, but the time requiredto perform the search tends todecrease at first as the cost of attempting to prune the largenumbers of states at bottomof the tree is reduced, and then increases againas the lack of pruning means that morenodes must be searched. At least for this searchproblem, the optimal pruning depthseems to be one level earlier than the bottom of thetree, but we have no way at presentof estimating this value in advance for a given domainand heuristic.Chapter 5ConclusionsThis thesis has presented two components of a planningsystem for domains representedas Markov decision processes. Chapter3 describes a way of creating close to optimalpolicies using abstraction, while Chapter4 demonstrates the use of heuristic searchtoplan and execute actions at run-time. Puttingthe two together to build acompleteplanning system produces the following high-levelalgorithm:1. Prior to run-time: Use abstraction to produce1) an abstract policy that can beused as a set of default reactions, and2) an estimate of the value of each state,tohe used as a heuristic function.2. At run-time: If there is sufficient computationtime, use the search algorithmtocompute the best action for the currentstate, and execute the action.Otherwise,execute the default reaction providedby the abstraction process.3. If a goal state has been reached (if suchstates exist), end. Otherwise, observethenew state, and return to step 2.This algorithm is designed for use indomains where time is critical.The accuracy ofthe heuristic can be tailored to the amountof start-up computation time available,andthe depth of the search can be variedfrom action to action in orderto take advantageof the time available before thenext action must be performed, oran anytime searchalgorithm such as iterative deepening canbe used. However, because theplans produced78Chapter 5. Conclusions 79are only approximately optimal, the domain should have the following characteristics ifthe algorithm is to work effectively.• Structure: The abstraction algorithm relies on agreat deal of structure existingin the domain. We expect other methods for creating heuristics to require similarstructure.• Time criticality: As we have said before, the approachis designed for use in domainswhere reaction time is significant. If there is no restriction on the computation timeavailable, then a better approach would be to use policy iteration (Howard1971)or a similar algorithm to find an optimal policy, and execute actions according tothe policy.• Recoverability: By recoverability, we mean thatperforming a poor action is generally not fatal. For example, if the connectivity of some parts of the state spaceisparticularly low (there are very few transitions between certain classes of states),the fact that actions are executed as planning proceeds may become a liability. Theagent may select an action and execute it, only to discover later that its choice hasmade reaching certain desirable states very difficult. An algorithm that plans morethoroughly may well be much more effective in such a domain.• Complete Observability: As we have observedbefore, we have only applied thesetechniques in completely observable domains. Unfortunately, many interesting realworld domains do not have this characteristic.• Stable goals: Over a series of planning tasks, the agent will work especiallywell ifthe reward function remains constant. If the start state changes,only the searchneed be redone, while making any significant change to rewards requires creatinganew abstraction.Chapter 5. Conclusions 80• Few actions: The abstraction and search algorithmsare both very sensitive tothe number of actions in the domain. Even a small reduction in the number ofactions to be considered can result in a significant reduction in the performance ofthe search algorithm. Action elimination techniques (see Section 6.7 of (Puterman1994)) may be a useful addition to the approach, as may the work of Dean andGreenwald (1994) on applying MDP techniques to scheduling problems.The tradeoff between the two parts of the algorithm is also important. If a largeamount of initial computation time is available, or actions must be executed very quicklyonce planning has begun, a more accurate heuristic can be constructed, and the searchcan be correspondingly less deep. There are a number of other factors that affect thisbalance.• Continuity: if actions have similar effects inlarge classes of states and most of thegoal states are fairly similar, we can use a less detailed heuristic function(a moreabstract policy).• Fan-out: if there are relatively fewactions, and each action has a small number ofoutcomes, we can afford to increase the depth of the search tree.• Extreme probabilities: with extreme probabilities it may be worth only expandingthe tree for the most probable outcomes of each action. This seemsto bear some relationship to the envelope reconstruction phase of the recurrent deliberation modelof (Dean et al. 1993a).5.1 Real World DomainsAs we said above, there are relatively few real world domains which are completelyobservable. This considerably restricts the real world applicabilityof the work as it nowChapter 5. Conclusions81stands. Although we have not investigated it as yet, one promisingclass of real problems,scheduling, has been proposed in which our techniques can beapplied. We describe twosuch domains.Airline Gate Scheduling The problem is, given a set of gatesat an airport, and aset of incoming and outgoing flights, to allocate gates to each flightin order tominimize deviation from the published schedule. The domainis stochastic sinceflights can be delayed, or canceled altogether, and gatesmay be unavailable forvarious reasons. Actions in this domain are assignmentsof gates to flights, andrewards (and penalties) are received depending on whether flightsleave on time,and are delayed on the ground because no gate is available.Rewards may also bereceived for allocating nearby gates to flights with alarge number of connectingpassengers.One difficulty with this domain is the very large number of actions— each assignment of a flight to a gate is a distinct action. The techniquesdiscussed above forreducing the action space will have a critical part toplay in applying our algorithmsin many scheduling domains, as will schematic descriptionsof actions, and theirexploitation.Oil Pipeline Management Here the problem isto supply a number of customers withvarious petroleum products. Actions are openingand closing valves, while rewardsare received for deliveringcorrect shipments on time, and minimizing oil spoilagedue to leaks and contamination of one productwith another. The system is stochastic since pumps may fail, pipes may leak,etc.One open research question is whether the abstractionprocedure described in Chapter3 will be applicable in real world domains. Although wehave not investigated any realChapter 5. Conclusions82domains to test this, we have reason to be optimistic, as wewould expect a number ofirrelevant or only slightly relevant propositions tobe present in any such domain. Thefact that Bayesian networks can be applied in real-world domainssuggests that there isa great deal of independence that can be exploited.5.2 Future WorkThere are a great many possible lines of research thatpresent themselves. A majorone, which we have already discussed, is extending the approachto partially observableMDPs. The major difficulty in the partially observable caseis that the current state of thesystem is a probability distribution over all states,rather than a single state. Althoughpartially observable MDPs can be translated into completelyobservable ones (Sondik1978), the resulting MDP has an uncountably infinitestate space. One naive approachto this problem is to perform heuristic searchas described in Chapter 4 from each statewith a non-zero probability (assuming that thereare relatively few such states) or withprobability above some threshold, and then calculatethe weighted average of the utility ofeach action over all such states. For complex probabilitydistributions, such a proceduremay be very expensive to perform, but there arecurrently no other computationallyefficient ways of computing even close-to-optimalplans in partially observable MDPs.From an Al perspective, it is important to considersensor models to determine what kindsof probability distribution are possible when considering planningin partially observabledomains.A fairly easy extension to this work would be to add rewardsand costs associated withactions as well as states. This would involvea relatively small change to the algorithms,and would provide an increase in representationalpower.In both the abstraction and planning algorithmsas we have presented them, the agentChapter 5. Conclusions83is aiming to maximize its discounted total reward over an infinite horizon.However, thisis not the only value the agent might be interested in. Without consideringthe finitehorizon case, the agent might also be interested in the undiscountedreward. In theMDP literature, this is referred to as average reward criteria, where theagent aims tomaximize its average reward per action (or per unit time). Inmany of the domains weare interested in, it is difficult to justify discounting,so an agent using average rewardcriteria may well perform better. Average reward criteria considerablycomplicates policyiteration, and other methods of finding optimal policies,and we have not yet investigatedthe difficulties involved in applying it with our algorithms.Extending our approach to operate in infinite domainsis another area we have yet toinvestigate. Nor have we lookedat the possibility of planning with semi-Markov processes(Howard 1971).The areas we have mentioned so far are all general problemsthat we would like tosee investigated. There are also a series of extensions thatcould be made to both theabstraction and planning algorithms individually.5.2.1 AbstractionAs it stands, the abstraction algorithmis rather restricted in some ways.We havetended to be very conservative whenproducing the abstract state space — everyatomthat could possibly affect the eventualreward is deemed relevant. There area numberof possible ways to relax this requirement.The first is to ignore atoms which makeonlya very small difference to the outcome of some action.For instance, in the examplepresented in Table 2.1, if carrying the umbrellamade a very slight difference to theprobability of delivering the coffee successfully, thenthe abstraction described in Section3.2.3 would have included Umbrella as a relevant atom.If the difference in the probabilityof DeliverCoffee was sufficiently small, we wouldprefer to ignore Umbrella, andperformChapter 5. Conclusions84the same abstraction as in the example. There are two difficulties withthis. The first isthat the Markov property no longer holds if these atoms are ignored, andthe second isdetermining which atoms should be ignored, and which needto be included.A second relaxation of our requirements is to use logical sentences inthe place ofatoms when constructing the abstract state space. This makes the constructionrathermore difficult, but could result in equally good abstract policiesproduced from evensmaller abstract state spaces.One area that would greatly benefit by further investigation is ways of expandingtheabstraction process to operate with variables that can take morethan two values. Forinstance, a robot’s position in some navigation problemmight be represented by twovariables, one giving the room that the robotis currently in, and the other the robot’sheading (north, south, etc.). An abstraction processthat could decide which values of agiven variable could be grouped together would be very valuable.Michael Horsch (Horsch1994) has been looking at this problem as a way of usingabstraction to evaluate influencediagrams.Another interesting way to improve the abstraction,although only in certain domains,is to take advantage of knowledge of the initial state when constructingthe abstract statespace. For example, if we knew that in our exampledomain from Table 2.1 that Rainwas false in the initial state, and since the value ofRain can never change, we would likethe abstraction procedure to treat Rain as an irrelevantatom even if it has a large effecton reward.An important research area that we have yetto explore is the question of howtheabstraction algorithm we describe can be combined withthe envelope method of findingpolicies described in (Dean et al. 1993b). Tosome extent, this is related to the ideaof using the start state to eliminate atoms from theabstraction, but Dean’s approachgoes further, ignoring states that are possible but veryunlikely to be reached from theChapter 5. Conclusions85initial state. Nicholson and Kaelbling (1994) have already investigatedone form of abstraction using sensitivity analysis. Combining their approachwith ours might providea very powerful approach which might well go some way towardssome of the other openproblems described in this section.5.2.2 SearchWe have tried to describe the search algorithm in a fairly general way,making littlemention of implementation details such as whether to use depthfirst search or iterativedeepening. As it stands, the algorithm searches toa fixed depth. However, this need notbe the case. Searching to variable depth dependingon the utility of the current action,or some other criteria, would be an interesting area to investigate.If there is little time to perform search, an agentmay want to plan to take advantageof previous computation. For example, an agentmight want to reject an apparentlysuperior action in favor of a slightly worse one whichleads to an area of the state spacewhere the agent has already cached many actions,thus saving the agent computationtime.Another fruitful area for research is the idea oflearning a better heuristic whilesearching to select actions. We have deliberately shiedaway from an investigation of thisarea because of our interest in producing as efficienta planning procedure as possible.The idea is that as the agent computes values for states in thesearch process, it updatesthe heuristic with these new values. Over time, the heuristicshould approach the truevalue of each visited state. However, if thevalue of the heuristic changes duringtheplanning procedure, then the agent needs to replan everytime it reaches a certain state,rather than using the previously computed action.As state spaces become larger, theprobability of reaching a particular state many times willin general decrease, so thiscaching of values will be of less value. In these situations,an agent that constantlyChapter 5. Conclusions86attempts to improve its heuristic may well perform better.REFERENCES87ReferencesBallard, B. W. 1983. The*.minimsearch procedure for trees containingchance nodes.Artificial Intelligence, 21:327—350.Bellman, R. 1961. Adaptive Control Processes.Princeton University Press, Princeton,New Jersey.Dean, T., Kaelbling, L. P., Kirman, J., and Nicholson,A. 1993a. Deliberation schedulingfor time-critical decision making. In Proc. of UAI-93, pages309—316, Washington,D.C.Dean, T., Kaelbling, L. P., Kirman,J., and Nicholson, A. 1993b. Planning with deadlinesin stochastic domains. In Proc. of AAAI-93, pages574—579, Washington, D.C.Draper, D., Hanks, S., and Weld, D. 1994. A probabilisticmodel of action for least commitment planning with information gathering.In de Mantaras, R. L. and Poole,D.,editors, Proceeding of the Tenth Conferenceon Uncertainty in Artificial Intelligence,pages 178—186. Morgan Kaufmann.Feldman, J. and Sproull, R. 1977. Decision theoryand artificial intelligence ii: Thehungry monkey. Cognitive Science,1:158—192.Fikes, R. E. and Nilsson, N.J. 1971. Strips: A new approach to the applicationoftheorem proving to problemsolving. Artificial Intelligence, 2:189—208.Greenwald, L. and Dean, T. 1994. Montecarlo simulation and bottleneck-centeredheuristics for time-critical scheduling in stochasticdomains. UnpublishedManuscript.Haddawy, P. and Doan, A. 1994.Abstracting probabilistic actions.In de Mantaras,R. L. and Poole, D., editors, Proceeding of the Tenth Conferenceon Uncertainty inREFERENCES88Artificial Intelligence, pages 270—277. Morgan Kaufmann.Haddawy, P. and Hanks, S. 1992. Representations fordecision-theoretic planning: Utilityfunctions for deadline goals. In Proc. ofKR-92, pages 71—82, Cambridge.Horsch, M. 1994. Personal communication.Howard, R. A. 1960. Dynamic Programming andMarkov Processes. MIT Press, Cambridge, MA.Howard, R. A. 1971. Dynamic Probabilistic Systems.Wiley, New York.Howard, R. A., Matheson,J. E., and Miller, K. L. 1976. Readingsin decision analysis.Stanford Research Institute, Menlo Park,California.Knoblock, C. A. 1993. Generating AbstractionHierarchies: An Automated ApproachtoReducing Search in Planning. Kluwer,Boston.Korf, R. E. 1990. Real-time heuristicsearch. Artificial Intelligence,42:189—211.Kushmerick, N., Hanks, S., and Weld,D. 1993. An algorithm for probabilisticplanning.Technical Report 93-06-04, University ofWashington, Seattle.McAllester, D. and Rosenblitt, D. 1991.Systematic non-linear planning.In Proceedingsof the Ninth National Conference on Artificial Intelligence,pages 634—639.Nicholson, A. E. and Kaelbling,L. P. 1994. Toward approximate planningin very largestochastic domains. In AAAI Spring Symposiumon Decision Theoretic Planning,pages 190—196, Stanford.Pearl, J. 1984. Heuristics: IntelligentSearch Strategies for Computer ProblemSolving.Addison-Wesley, Reading, Massachusetts.REFERENCES 89Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference. Morgan Kaufmann, San Mateo.Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, New York.Russell, S. J. and Wefald, E. 1991. Do the Right Thing: Studies in Limited Rationality.MIT Press, Cambridge.Sacerdoti, E. D. 1974. Planning in a hierarchy of abstraction spaces. Artificial Intelligence, 5:115—135.Smith, D. E. and Peot, M. A. 1993. Postponing threats in partial-order planning.InProc. of AAAI-93, pages 500—506, Washington, D.C.Sondik, E. J. 1978. The optimal control of partially observable markov processes overthe infinite horizon: Discounted cost. Operations Research, 26(2):282—304.Tash, J. and Russell, S. 1994. Control strategies for a stochastic planner.In Proceeding ofthe Twelfth National Conference on Artificial Intelligence, pages 1079—1085.AAAIPress.Appendix AExperimental DomainsWet,-’Office,-’HUCWet,—iOfficeWet,HUCWet-‘Office,-’HUC-‘Office-‘HUC0.00810.80190.00090.08910.00090.08910.00010.0099The fully compiled form of the coffee-making domain fromChapter 2MoveAction Discriminant EffectProb.-iUmbrella,Rain,OfficeUmbrella,Rain,Office —iOffice,-’HUC0.0090-‘Office 0.8910-HUC 0.00100.0990-‘Rain,Office -‘Office,-’HUC 0.0090—‘Office 0.8910-‘HUC 0.00100.0990—‘TJmbrella,Rain,-’Office Wet,Office,-’HUC0.0081Wet,Office 0.8019Wet,—’HUC 0.0009Wet 0.0891Office,-’HUC 0.0009Office 0.0891-‘HUe 0.00010.009990Action Discriminant Effect Prob.Move Umbrella,Rain,-’Office Office,-iHUC 0.0090Office 0.8910-‘HUC 0.00100.0990-,Rain,-iOffice Office,-’HUC0.0090Office 0.8910-HUC 0.00100.0990BuyCoffee -‘Office HRC,-’HUC0.0080HRC 0.7920-HUC 0.00200.1980Office -iHUC 0.01000.9900GetUmbrella Office Umbrella 0.8910Umbrella,-’HUC 0.0090-1HTJC 0.00100.0990-‘Office -HUC 0.01000.9900DeliverCoffee Office,HRC -‘HRC,HUC0.8000-HRC,-iHUC 0.0010-HRC 0.0990-HUC 0.00100.0990-iOffice,HRC -HRC,-HUC 0.0080-,HRC 0.7920-HUC 0.00200.1980-HRC -HUC 0.01000.9900The value function for this domain is exactly the one presented on Page 14.91The COFFEE domain from the results sections of Chapters3 and 4Action Discrinilnant Effect Prob.GoBarn umb,-’la la,-’lb 0.90000.1000-‘umb,-’la wet,la,-’lb 0.8100wet 0.0900la,—ilb 0.09000.0100la 1.0000GoOffice —‘la la,lb 0.90000. 1000la 1.0000GoAlLab -‘umb,Ia,-’lb wet,dist,-’la,lb 0.8100dist,-ila,lb 0.0900wet 0.09000.0100umb,la,-’lb dist,—ila,lb 0.90000.1000la,lb dist,—ila,lb 0.90000.1000-ila 1.0000GoGraphicsLab -‘umb,la,-’lb wet,-ila,-’lb 0.8 100-‘la,-’lb 0.0900wet 0.09000.0100umb,la,-ilb —da,-ilb 0.90000. 1000la,lb —ila,-ilb 0.90000. 1000-‘la 1.0000BuyCoffee hrs,la,-’lb —ihrs,hrc 0.50000.5000—‘hrs,la,-’lb hrc 0.80000.2000la,lb 1.0000-‘la 1.0000DeliverCoffee hrc,la,lb huc,—ihrc 0.8000—‘hrc 0.10000. 1000-‘hrc,la,lb 1.0000la,—ilb 1.0000-‘la 1.000092Sentence Valuehuc,hus,wet,dist 1.15huc,hus,wet,—idist 1.25huc,hus,—iwet,dist 1.4huc,hus,—iwet,--dist 1.5huc,-ihus,wet,dist 0.65huc,—ihus,wet,--idist 0.75huc,—ihus,--iwet,dist 0.9huc,—ihus,--iwet,-,dist 1.0—ihuc,hus,wet,dist 0.15—ihuc,hus,wet,--idist 0.25—‘huc,hus,-’wet,dist 0.4-‘huc,hus,--’wet,-’dist 0.5—ihuc,—ihus,wet,dist -0.35—ihuc,-,hus,wet,-,dist -0.25-,huc,--ihus,-,wet,dist -0.1—ihuc,—ihus,-,wet,---idist 0.0Action Discriminant Effect Prob.BuySnack hrc,la,-ilb hrs,-’hrc 0.50000.5000—‘hrc,la,---’lb hrs 0.80000.2000la,lb 1.0000—ila 1.0000DeiSnack hrs,la,lb hus,—’hrs 0.8000—ihrs 0.10000.1000—‘hrs,la,lb 1.0000la,—’lb 1.0000—ila 1.0000GetUmbrella la,lb umb 0.90000.1000la,-’lb 1.0000—ila — 1.000093The BUILDER domain from Chapters 3 and4Action DiscriminantEffectProb.PaintA ACleanAPainted0.75-‘AClean 0.200.05—iAClean1.00PaintB BCleanBPainted0.75—‘BClean 0.200.05-iBClean1.00ShapeA -‘Joined-‘APainted,AShaped0.80—iAPainted,-’AClean,-iAShaped,-’ADrilled0.10—‘APainted0.10Joined -iBPainted,-,APainted1.00ShapeB —‘Joined —BPainted,BShaped0.80—iBPainted,-iBClean,-iBShaped,-iBDriIled0.10-‘BPainted0.10Joined -BPainted,-’APainted1.00DrillA —‘JoinedADrilled 0.900.10Joined1.00DriiB -‘JoinedBDrilled 0.900.10Joined1.00WashAAClean0.900.10WashBBClean0.900.10Bolt BShaped,AShaped,Joined0.80BDrilled,ADrilled0.20-‘ADrilled1.00—iBDrilled,ADrilled1.00-iAShaped,BDrilled,ADriled1.00-‘BShaped,AShaped,1.00BDrilIed,ADriIled94Action Discriminant EffectProb.Glue BShaped,AShaped -iBClean,-’AClean,Joined0.35Joined 0.35—‘BClean,-’AClean 0.150.15-‘AShaped -‘BClean,-’AClean0.500.50-‘BShaped,AShaped -iBClean,-’AClean0.500.50Sentence ValueAClean,BClean,APainted,BPainted,Joined 1—‘AClean,BClean,APainted,BPainted,Joined.9AClean,—iBClean,APainted,BPainted,Joined .9—iAClean,-iBClean,APainted,BPainted,Joined.8AClean,BClean,—iAPainted,BPainted,Joined.8-iAClean,BClean,-’APainted,BPainted,Joined .7AClean,-,BClean,-iAPainted,BPainted,Joined.7-‘AClean,-’BClean,-iAPainted,BPainted,Joined.6AClean,BClean,APainted,-’BPainted,Joined.8-iAClean,BClean,APainted,-’BPainted,Joined.7AClean,—’BClean,APainted,-iBPainted,Joined.7-‘AClean,--iBClean,APainted,-iBPainted,Joined.6AClean,BClean,-iAPainted,-’BPainted,Joined.6-iAClean,BClean,—iAPainted,-iBPainted,Joined.5AClean,-’BClean,-’APainted,-’BPainted,Joined.5—iAClean,-iBClean,-,APainted,-iBPainted,Joined.4ACIean,BClean,APainted,BPainted,-Joined.6—iAClean,BClean,APainted,BPainted,--iJoined.5ACIean,-iBClean,APainted,BPainted,-’Joined.5—‘AClean,-’BClean,APainted,BPainted,--’Joined .4AClean,BClean,-’APainted,BPainted,—’Joined .4-‘AClean,BClean,-’APainted,BPainted,-iJoined.3ACIean,—’BClean,-’APainted,BPainted,-’Joined.3-iAClean,-’BClean,-iAPainted,BPainted,-’Joined .2AClean,BClean,APainted,-iBPainted,-’Joined .4—iAClean,BClean,APainted,--iBPainted,--,Joined.3AClean,—’BClean,APainted,--iBPainted,-’Joined.3AClean,-iBC1ean,APainted,-iBPainted,-iJoined.2AClean,BClean,-iAPainted,-’BPainted,-iJoined .2-iAClean,BClean,--iAPainted,-iBPainted,-iJoined.1AClean,BC1ean,-’APainted,--’BPainted,Joined.1-‘AClean,-’BClean,-iAPainted,-’BPainted,-’Joined095
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Abstraction and search for decision-theoretic planning
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Abstraction and search for decision-theoretic planning Dearden, Richard William 1994
pdf
Page Metadata
Item Metadata
Title | Abstraction and search for decision-theoretic planning |
Creator |
Dearden, Richard William |
Date | 1994 |
Date Issued | 2009-03-04 |
Description | We investigate the use Markov Decision Processes a.s a means of representing worlds in which actions have probabilistic effects. Markov Decision Processes provide many representational advantages over traditional planning representations. As well as being able to represent actions with more than one possible result, they also provide a much richer way to represent good and bad states of the world. Conventional approaches for finding optimal plans for Markov Decision Processes are computationally expensive and generally impractical for the large domains and real-time requirements of many planning applications. For this reason, we have concentrated on producing approximately optimal plans using a minimal amount of computation. We describe two complementary methods for planning. The first is to generate ap proximately optimal plans using abstraction. By ignoring certain features of a planning problem, we can create a smaller problem for which an optimal plan can be efficiently found by conventional means. The plan for this smaller problem can be directly applied to the original problem, and also provides an estimate of the value of each possible state of the world. Our second technique uses these estimates as a heuristic, and applies game tree search techniques to try to determine a better action to perform in the current state of the system. By repeatedly choosing an action to perform by searching, and executing the action, we provide a planning algorithm which has a complexity that is independent of the number of possible states of the world. |
Extent | 1862595 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2009-03-04 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051232 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1994-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
URI | http://hdl.handle.net/2429/5461 |
Aggregated Source Repository | DSpace |
Download
- Media
- [if-you-see-this-DO-NOT-CLICK]
- ubc_1994-0513.pdf [ 1.78MB ]
- Metadata
- JSON: 1.0051232.json
- JSON-LD: 1.0051232+ld.json
- RDF/XML (Pretty): 1.0051232.xml
- RDF/JSON: 1.0051232+rdf.json
- Turtle: 1.0051232+rdf-turtle.txt
- N-Triples: 1.0051232+rdf-ntriples.txt
- Original Record: 1.0051232 +original-record.json
- Full Text
- 1.0051232.txt
- Citation
- 1.0051232.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
China | 12 | 15 |
United States | 7 | 5 |
Germany | 6 | 0 |
France | 4 | 0 |
United Kingdom | 4 | 0 |
Japan | 4 | 0 |
Estonia | 1 | 0 |
City | Views | Downloads |
---|---|---|
Unknown | 11 | 1 |
Shenzhen | 8 | 14 |
Beijing | 4 | 0 |
London | 4 | 0 |
Tokyo | 4 | 0 |
Ashburn | 3 | 0 |
Redmond | 2 | 3 |
University Park | 2 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051232/manifest