Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Exploring partially observable Markov decision processes by exploting structure and heuristic information Leung, Siu-Ki 1996

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1997-0101.pdf [ 4.07MB ]
Metadata
JSON: 831-1.0051460.json
JSON-LD: 831-1.0051460-ld.json
RDF/XML (Pretty): 831-1.0051460-rdf.xml
RDF/JSON: 831-1.0051460-rdf.json
Turtle: 831-1.0051460-turtle.txt
N-Triples: 831-1.0051460-rdf-ntriples.txt
Original Record: 831-1.0051460-source.json
Full Text
831-1.0051460-fulltext.txt
Citation
831-1.0051460.ris

Full Text

Exploring Partially Observable Markov Decision Processes by Exploiting Structure and Heuristic Information by Siu-Ki Leung BSc(Hon), Memorial University of Newfoundland, 1994 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE OF MASTER OF SCIENCE IN T H E FACULTY OF GRADUATE STUDIES COMPUTER SCIENCE We accept this thesis as conforming to the Required standard The University of Bri t ish Columbia November 1996 © Siu-Ki Leung, 1996 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her repre-sentatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Siu-Ki Leung Computer Science The University of British Columbia 2366 Main Mall Vancouver, BC CANADA V6T1Z4 Date Abstract This thesis is about chance and choice, or decisions under uncertainty. The desire for creating an intelligent agent performing rewarding tasks in a realis-tic world urges for working models to do sequential decision making and plan-ning. In responding to this grand wish, decision-theoretic planning (DTP) has evolved from decision theory and control theory, and has been applied to planning in artificial intelligence. Recent interest has been directed toward Markov Decision Processes (MDPs) introduced from operations research. While fruitful results have been tapped from research in fully observable MDPs , partially observable M D P s (POMDPs) are still too difficult to solve as observation uncertainties are incorporated. Abstraction and approxima-tion techniques become the focus. This research attempts to enhance P O M D P s by applying A l techniques. In particular, we transform the linear P O M D P constructs into a structured representation based on binary decision trees and Bayesian Networks to achieve compactness. A handful of tree-oriented operations is then devel-oped to perform structural belief updates and value computation. Along ii ABSTRACT iii with the structured representation, we explore the belief space with a heuris-tic online search approach, in which best-first search strategy with heuristic pruning is employed. Experimenting with a structured testbed domain reveals great potentials of exploiting structure and heuristics to empower P O M D P s for more prac-tical applications. Contents Abstract i i Contents iv List of Figures ix List of Tables x i Abbreviations x i i Notations x i i i Acknowledgement xvi 1 Introduction 1 1.1 Decision-Theoretic Planning 2 1.1.1 Decision Theory 2 1.1.2 Planning 3 1.2 Markov Decision Processes 5 1.3 Exploiting Structure 6 iv CONTENTS v 1.4 Employing Heuristics 6 1.5 Overview 7 2 P O M D P s 9 2.1 Conceptual Model 10 2.1.1 Perceived World 10 2.1.2 Rational Agent 12 2.2 Computational Model 15 2.2.1 Representation 15 2.2.2 Computation 19 2.2.3 Analysis 24 2.3 Related Work 26 2.4 Summary 26 3 Focus 29 3.1 Bounded Scope 30 3.1.1 Foresight: Belief Projection 30 3.1.2 Hindsight: Backward Induction 33 3.2 Infinite Horizon 33 3.2.1 Discounting 34 3.2.2 Focusing 35 3.2.3 Pruning 36 3.3 F O M D P s 37 CONTENTS vi 3.3.1 Representation 37 3.3.2 Computation 39 3.4 Online Search 41 3.5 Related Work 42 3.6 Summary 43 4 Structure 45 4.1 Structured Representation 46 4.1.1 Contexts as Abstract States 46 4.1.2 Binary Decision Trees 48 4.1.3 Belief State as b-Tree 52 4.1.4 Reward Function as R-Tree 53 4.1.5 Transition Function as T-Trees 54 4.1.6 Observation Function as O-Trees 55 4.2 Structured Computation 57 4.2.1 Computing R(b) 59 4.2.2 Computing P[o \ b, a 1 67 4.2.3 Computing B[ b, a, o 1 67 4.2.4 Computing Vk(b) 81 4.2.5 Computing n(6) 81 4.3 Experiments 82 4.3.1 Testbed Domain 82 4.3.2 Results 83 CONTENTS vii 4.4 Related Work 86 4.5 Summary 87 5 Approximation 88 5.1 Pruning b-Trees and R-Trees 89 5.1.1 Reordering Nodes 89 5.1.2 Averaging Leaves 92 5.1.3 Experimental Results 93 5.2 Pruning Decision Search Tree 94 5.2.1 Heuristic 95 5.2.2 Algorithm 97 5.2.3 Experiments 99 5.3 Related Work 101 5.4 Summary 101 6 Conclusion 103 6.1 Hindsight 104 6.2 Insight 105 6.3 Foresight 106 A Algorithms 108 A.O Preliminary 109 A . l Average I l l A.2 DotProduct 112 CONTENTS v i i i A.3 Merge 1 1 3 A.4 Multiply 1 1 5 A.5 Normalize H6 A.6 Product 1 1 7 A.7 Sum : 118 A.8 Update 120 B Testbed 1 2 2 Bibliography 1 2 ^ List of Figures 2.1 The Perception-Decision-Action Cycle 13 2.2 Notational Constructs of P O M D P s 27 2.3 Computational Constructs of P O M D P s 28 3.1 Reachable Belief States as a Decision Tree mapped on Time . 32 4.1 Diagramatic Representation of Binary Decision Trees (BDTs) 51 4.2 A b-Tree for X = {A, B, C} 52 4.3 An example R-Tree for X = {A,B,C} 53 4.4 T-Tree[B\a] for X = {A, B, C} in Evolution 56 4.5 0-Tree[a] for domain X = {A,B,C} 58 4.6 The Graft-Prune-Evaluate process for computing R(b) . . . . 60 4.7 Algorithm DotProduct 64 4.8 Notations in Algorithm DotProduct 65 4.9 The Concurrent Traversal process for computing R(b) 66 4.10 Notations in Algorithm Merge 70 4.11 Algorithm Merge 71 4.12 An example BT-Tree generated from the T-Trees 72 ix LIST OF FIGURES x 4.13 Algorithm Product 74 4.14 Algorithm Update 77 4.15 Algorithm Sum 78 4.16 Updating b-Tree structurally 79 4.17 Structured Specification for the Client-Server-Agent domain . 84 5.1 Compacting a BDT by Reordering Nodes 90 List of Tables 4.1 Computing V : Linear vs Structured Representation 85 4.2 Computing Vk: Caching vs No Caching 86 5.1 Compacting b-Trees by Reordering Nodes and Averaging Leaves 94 5.2 Heuristic Online Search: Complete vs Best-First 100 5.3 The precomputed F O M D P Value Function 100 x i Abbreviations The following abbreviations list the keywords and the related contexts of this thesis work. 2TBNS 2-stage Temporial Bayesian Networks A l Artificial Intelligence B D T s Binary Decision Trees CI Computational Intelligence CT Control Theory C P T s Conditional Probability Tables D P Dynamic Programming DSTs Decision Search Trees DT Decision Theory D T P Decision-Theoretic Planning F O M D P s Fully Observable Markov Decision Processes L P Linear Programming M D P s Markov Decision Processes OR Operations Research P D A Perception-Decision-Action P O M D P s Partially Observable Markov Decision Processes S D M Sequential Decision Making xii Notations For lacking of a systematic and intuitive notational construct, we have cus-tomized various conventions of notations used in the literature and introduced some new notations for the structured represenation. They are summarized as follows. General n Set of Real Numbers V Set of Probabilities P[\ Probability Function AO Probability Distribution k horizon; stages-to-go; search depth 5 discounting factor e error tolerance; threshold for convergence xiii NOTATIONS xiv Classical Model s Set of States s a generic state i current state 3 next state A Set of Actions a a generic action O Set of Observations o a generic observation R Reward Function: R(i) G 71 R Expected Reward Function: R(b) G 71 T Transition Function: T(i, a, j) G V 0 Observation Function: 0(i, a,o) G "P B Set of Belief States b current belief state: b(i) G V b' updated belief state B[b, a, o) Belief Update Transformation V F O M D P Value Function: V(i) G 71 V P O M D P Value Function: V(b) G 71 yk Value Function with fc-horizon v* Optimal Value Function n F O M D P Policy Function: U(i) G A n P O M D P Policy Function: 11(6) G A NOTATIONS X V Structured Representation X X X+ x-c c\=X+ c\=X~ B D T Specific b b' R T-Tree[X|a] BT-Tree[a] O-Tree [o| a] 0-Tree[a] BO-Tree[a] leaf(v) node(X, L, R) IsLeaf(T) value(T) reward (T) prob(T) effect(T) XT TL TR prefix(T) card{T) Set of prepositional Varaibles a generic propositional variable variable X assigned true variable X assigned false a generic context c subsumes X+ c subsumes X~ b-Tree for current belief state b-Tree for updated belief state R-Tree for Reward Function T-Tree for X e X given a e A BT-Tree for a G A O-Tree for o G O given a e A Combined O-Tree for a £ A b x 0-Tree[o|a] leaf node with value v internal node labeled X with left subtree L and right subtree R T is a single leaf tree installed value if IsLeaf(T) installed reward if IsLeaf(T) and T is a R-Tree installed probability if IsLeaf(T) and T is a b-Tree, T-Tree or O-Tree installed effects if IsLeaf(T) and T is a BT-Tree variable at root of T left subtree of T right subtree of T prefix context of T cardinality of T Acknowledgement This page should have been full of names, yet I choose to name the ones that you may know of by any chance. Craig Boutilier, my thesis supervisor, who not only introduced a whole new horizon in Artificial Intelligence to me, but also guided me through step by step under all kinds of uncertainties during the research process. From his class to his personal contacts with me, he has been an inspirative teacher and supportive mentor. Thanks Craig! : ) David Poole, my second reader, who has always been an admirable A l scholar. I am so glad that he could promise to read my thesis. Comments of his indepth knowledge and insightful experience are greatly appreciated. Thanks David! : ) Above all names, my grateful heart is owe to my dear heavenly Father, who has been enlightening the decision processes in my life journey with His Spirit. Thank You Lord! In Jesus' name. Siu-Ki Leung Thanksgiving, 1996 xvi 1 Introduction Life is a sequence of decisions (and acts). A better life involves planning, or simply put, thinking ahead. Creating a thinking machine, or intelligent agent, has been the grand wish of Artificial Intelligence. Computer scientists take this as a task of devising computational models to capture intelligence. With a practical computational model for decision making and planning, we can build an automated controller that drives the decision process in an intelligent agent's mind to perform rewarding tasks. This thesis is about an exploration in empowering the existing numerical models in decision-theoretic planning with structured and heuristic methods. 1 1. INTRODUCTION 2 1.1 Decision-Theoretic Planning Decision- Theoretic Planning (DTP) provides a more realistic view than clas-sical planning in Artificial Intelligence, in which a perfect world model is as-sumed. Inspired by decision theory, planning can be perceived as sequential decision making under uncertainties arising from a partially controlled world model. 1.1.1 Decision Theory Decision theory is based on belief, desire and expectation. In a sense, it is the result of probability theory and utility theory combined. Belief Knowledge is built on belief. And a more realistic world model should con-sider the uncertainties implied by belief. Decision theory employs probabili-ties as a measure of belief. Probabilities sit in the interval [0,1] and offer a more general decision logic than propositional logic, in which everything is true or false. In particular, the core of the kind of decision theory that we are referring to is Bayesian decision theory. Applied to a world with stochastic transitions and imperfect observations, the belief in a state is measured by a posterior probability, the belief in a transition is measured by a conditional probability, and the updated belief given an observation is obtained by the Bayes' Rule. 1. INTRODUCTION 3 Desire An intelligent agent has goals, or tasks to perform, implanted by its de-signer. With a goal, one situation may be more desirable than another. In utility theory, desirabilities, or preferences, are simply ranked by real num-bers. From the designer's point of view, the desirability of a particular state can be defined by a reward promised to the agent for being in that state. Expectation A rational decision is based on both belief and desire. Without belief, desire may encourage wishful thinking. Without desire, belief only supports aimless wandering. A realistic expectation takes both belief (probability) and desire (utility) into consideration. The principle of decision theory is to choose the alternative that has the maximum expected utility. 1.1.2 Planning In classical planning, a plan is a sequence of actions to be performed in order to achieve a defined goal. Planning is to find an effective plan for a given goal. However, there are situations in which a definite goal is unclear and the action effects cannot be predicted with certainty. Goal-Directed vs Process-Oriented Classical planning is goal-directed. Typically, a given goal is specified by a list of conditions, and the action effects by preconditions and postconditions. A 1. INTRODUCTION 4 qualified plan is a sequence of actions supposed to achieve the goal. It works fine in a fully observed and controlled world, but fails when action effects cannot be perfectly predicted. In addition, multiple goals induce redundant actions and conflicting actions, which are difficult to resolve. An alternate view on planning is moving the focus from the ultimate goal to the process of visiting good states. Instead of getting all or none, the successfulness is measured by the rewards obtained during the journey, possibly a recurring process. A plan is extended from a fixed agenda to a flexible policy. Basically, a policy is a function prescribing the action for each possible situation. With a policy function and a set of sensors observing the current state, an agent would act more optimally than performing a prescribed sequence of actions without sensing in a stochastic domain. In addition, multiple tasks can be handled implicitly by designing a reward function to define the goodness of being in each world state. Sequential Decision Making The link of planning and decision making is that a plan is a sequence of decisions. A "good" choice at the moment, which results in high immediate reward, may be a very bad one for the long run. A purpose of planning is to act optimally over a defined time period. This requires sequential deci-sion making, in which both the immediate reward and the expected future value are considered in each situation. The expected future value is typi-cally defined as the maximum total expected rewards accumulated over the 1. INTRODUCTION 5 subsequent decision stages. In general, the longer the decision sequence we consider, the better the policy we can optimize. However, the longer the horizon, exponentially more subsequent decisions need to be made. In many cases, to think ahead a few steps is already not easy. Much of the research interest in sequential decision making focus on enhancing the existing models to plan with longer finite horizons and infinite horizons. 1.2 Markov Decision Processes A family of formal models supporting decision-theoretic planning is Markov Decision Processes ( M D P s ) . It has been employed by Operations Research for a long time and recently utilized by A l for designing planning agents in stochastic domains. In particular, we will employ two main kinds of M D P s in this thesis work, namely Fully Observable Markov Decision Processes ( F O M D P s ) and Partially Observable Markov Decision Processes ( P O M D P s ) They differ in the assumption of the observability of the current state in the world being modeled. F O M D P s can be considered as a simplified and ap-proximate model of P O M D P s . In the literature, F O M D P s are simply referred as M D P s in the context where fully observability is assumed. An introduction to P O M D P s will appear in Chapter 2 and the background of F O M D P s is given in Chapter 3 (Section 3.3). 1. INTRODUCTION 6 1.3 Exploiting Structure Classical M D P models are represented by linear and numerical constructs. Typically, they are formulated as numerical optimization problems and solved by applying Linear Programming and Dynamic Programming methods. Since MDPs have been brought to A l , they are often enhanced by Bayesian Net-works and Influence Diagrams for natural and compact representation, in which conditional independence is exploited. This thesis takes this approach a step further. Contextual independence and persistence are introduced along with conditional independence as general domain structural properties to of-fer a structured representation. While more natural and concise specification becomes possible, the structured representation also provides computational leverage. In addition, it opens new possibilities for informed approximation. Chapter 4 presents the structured representation and Chapter 5 describes possible approximation methods with it. 1.4 Employing Heuristics While a classical M D P is typically formulated as a numerical optimization problem, it can be transformed to an online searching problem, where heuris-tic techniques can be applied. In particular, we employ FOMDPs as a heuristic to solve the more general POMDPs with online search. As op-posed to precomputing the optimal policy before the execution session, the online approach has execution and planning interleaved, and decision made 1. INTRODUCTION 7 for the current situation with a feedback provided by observation. Chapter 3 introduces the ideas of the online approach and heuristic searching. 1.5 Overview This thesis is organized as follows: • Introduction Chapter 1 puts this thesis work into the contexts of computational intelligence, decision-theoretic planning and P O M D P s . • POMDPs Chapter 2 introduces P O M D P s as a conceptual and computational model for decision-theoretic planning. • Focus Chapter 3 lays out the framework of the structural and heuristic ap-proach pursued in this thesis. • Structure Chapter 4 presents the structured representation transforming the nu-merical P O M D P constructs into a compact model. • Approximation Chapter 5 describes possible approximation schemes with structured representation and heuristic searching. 1. INTRODUCTION 8 • Conclusion Chapter 6 concludes the exploration with hindsights (summary), in-sights (discussion) and foresights (future research). • Algorithms Appendix A specifies the structural algorithms in PROLOG. • Testbed Appendix B specifies the testbed domain in PROLOG. 2 POMDPs Partially Observable Markov Decision Processes In a sense ... To think is to conceive. To compute is to model. P O M D P s provide a neat conceptual and computational model for decision-theoretic planning. Devising such a model is an attempt to model a piece of world without perfect knowledge about it, and to compute rational decisions that are much better than blank wishes or blind choices. With P O M D P s , we are trying to formalize the notions of belief and value, which drive the decision processes in an intelligent being's mind. 9 2. P O M D P S 10 2.1 Conceptual Model Conceptually, with a P O M D P , we are modeling a piece of the world per-ceived by a rational agent. Essentially, we are searching for the rationality for acting wisely. 2.1.1 Perceived World A simplistic yet practical view of the complex and dynamic world is to think of it as a state space. Each state in the state space represents a possible situation or state of the world. As time goes on, the world is changed, and moves from one state to another state continually. The dynamics of the world is revealed in the pattern of the state transitions caused by the performed actions or exogenous events. Imperfect Knowledge Without perfect knowledge, the effect of each possible action is uncertain. It depends on the situation in which the action is performed. It depends on the degree of success of the action. It depends on some external factors that we do not know yet. More uncertain is that we may even not be able to tell for sure what state the world is currently in, let alone the absolute certainty of the total history that determines the present and the future. 2. P O M D P S 11 Decision Processes Facing those uncertainties, we are challenged to act intelligently. Whenever we are prompted to act, we are not making an independent decision, but a de-cision whose consequence will affect the decisions for the subsequent actions. To make wise decisions then is to consider opportunities, i.e., more decisions in the future. This sequential decision problem calls for the understanding of decision processes. Based on the incomplete information that we can access and the imperfect knowledge that we have acquired, we should be able to do something much more meaningful than acting arbitrarily as if we knew nothing at all. The quest is what kind of wisdom is needed, and how it can be achieved by computation. Markov Property Often, the history of the changing world does not seem to be significant or relevant to predicting the effect of the current action. It is only the current state and the action being taken that would determine the change. So maybe we can focus on here and now, and take the irrelevance of the past for granted. That means we use the current state, instead of the total history, to predict the effect of the current action. The decision process in which this assumption is valid is said to have Markov property.1 Such decision processes are referred 1 This is named after the Russian statistician Andrei A . Markov. See RUSSELL-NORVIG [29, p.500] for a historical note. 2. P O M D P S 12 as MDPs , Markov Decision Processes. Partial Observability Taking a step back, rather than pretending that we can always perfectly determine the current state, we would admit that the actual state is only partially observable. By taking observations, we can identify the current state to some degree, but probably never completely. This initiates the studies of P O M D P s , Partially Observable Markov Decision Processes. Perception, Decision and Action Intuitively, decision making is based on what has been determined and what will be desirable. To act intelligently is to observe the current situation, to interpret the observation with the perceived world model, and to apply the action that has the greatest chance to produce the most desirable change. These intelligent activities form the Perception-Decision-Action cycle in a control loop as shown in Figure 2.1. This is an ongoing process driven by belief and value. In the computational intelligence context, the growing interest drawn from P O M D P s in operations research is to use it as a rational model to build the decision controller embedded in an intelligent agent. 2.1.2 Rational Agent Imagine that we are creating a rational agent and putting it in an interesting world. We make the agent perceive and act. More interestingly, we implant 2. P O M D P S 13 Environment O Controller sensor actor Perception Decision Action Figure 2.1: The Perception-Decision-Action Cycle in the agent some knowledge, or belief, about the world. And finally, we grant it the wisdom to evaluate the value of being in each possible world state. The notions of belief and value deserve some more thoughts. Probabilistically, the agent's belief would tell itself • how probable it is in a state, • how probable a state would be reached after an action is performed in some state, and • how probable an observation would be obtained when an action is per-formed in one state and resulted in another state. Belief 2. P O M D P S 14 In some sense, the belief system of the agent summarizes the past experience that drives the agent to the current state, the future expectation that predicts the next state, and the evaluation model that updates the agent's belief right after an action has produced the outcome observation. Value Rationally, the agent will choose an action believed to be leading to a state with the greatest value whenever possible. But the subtlety lies in how the states are evaluated, especially when the future is uncertain. One way to think about it is to introduce the concept of reward. Rewards are received along the journey that the agent travels in space-time. So the agent may evaluate each state with respect to the expectation of the overall reward that can be obtained eventually. Expectation depends on the agent's horizon, i.e., how far ahead in the future that the agent would consider. It also depends on whether the agent would weigh the value of a state differently according to how long the agent would take to gain the expected reward. Furthermore, there may be some objective constraints, such as the lifetime of the agent or the decision deadline that has to be met. After all, the decision is subject to the agent's will. Whether the agent tends to be conservative or risky would depend on the attitude built into its computational mind. 2. P O M D P S 15 2.2 Computational Model Here we come to the question how we put all those intriguing concepts into a formal representation for finite computation. As an attempt to answer part of the question, this section introduces the building blocks of P O M D P s . 2.2.1 Representation A P O M D P problem can be formulated as: Given a world model ( S, A, O, T,0,R) and the belief states space B, compute an optimal or a "good" policy function H, which depends on the value function V implicitly. The notations are explained as follows. States S is a finite set of states in which each state represents a possible world situation. Typically, states are indexed by natural numbers. As a conven-tion, we use i to denote the current state and j to denote the next state. Occassionally, s is also used to refer to a generic state in S. Actions A is a finite set of actions, one of which is chosen at each decision point in time. As with states, actions are normally indexed by natural numbers. We use a to refer to a generic action in A. \ 2. P O M D P S 16 Observations O is a finite set of observations that can be obtained. Again, observations are indexed by natural numbers. We use o to refer to a generic observation in O. Transition Function Define A(5) to be the set of mappings P : S —> [0,1] such that Ylses P(s) ~ 1. T is a state transition function mapping the current state and the current action to a discrete probability distribution over the possible next states: T : S x A —> A(5) We write T(i, a,j), or P[j | i, a], for the probability of making the transition from state i to state j by taking action a. In addition, the definition requires Y,jesT(i, a, j) = 1. Observation Function Define A(C) to be the set of mappings P : O —>• [0,1] such that Eoeo — 1. O is an observation function2 mapping the current state, the current action and the next state to a discrete probability distribution over the possible outcome observations: O : S x A x S —> A(C) 2 I t happens to have a name clash with the B ig -0 notation used in complexity analysis; however, the context of discussion should resolve the confusion well enough. 2. P O M D P S 17 We write 0(i, a, j, o), or P[ o j z, a, j ' ], for the probability of observing observa-tion o from state j after action a has,taken in state i. In addition, the defi-nition requires £oee> 0 ( i , a, j , o) = 1. This gives us the most general observa-tion model. However, for simplicity, we assume that 0(i,a,j,o) = 0(i,a,o) as in BOUTILIER-POOLE[5], i.e., the observation is independent of the re-sulting state.3 Note that no observation is possible-when T(i,a,j) = 0, thus we only need to define those 0(i,a,j,o), or 0(i,a,o), probabilities if T(i,a,j)>0. Reward Function Let 1Z be the set of real numbers. R is a reward function mapping the current state to the real number specifying the immediate reward: R:S —> n We write R(s) for the immediate reward obtained by being in state s. Intu-itively, reward expresses the immediate benefit that the agent receives from being in a state. A penalty or cost is defined by a negative reward. Belief States B is the set of belief states, in which b is the current belief state defined as a discrete probability distribution over the possible current states: beB and B = A{S) 3An alternative simplification used in the P O M D P literature is making the observation independent of the starting state, i.e., 0(i,a,j,o) — 0(a,j,o). 2. P O M D P S 18 where A(<S) is the set of mappings P : S —• [0,1] such that Ylses P(s) = 1-We use 6' to refer to the next belief state updated from b. And we write b(i) for the probability of being in state i under the belief state b. As required by the definition, S i e s ^ ) = 1-Although we have a finite number of states in S, B is an infinite set. With the constraint J2i£S = 1> w e have |<S| — 1 degrees of freedom, hence a |«S —11-dimensional space. As each dimension lies in [0,1], the space of belief states is an |«S — l|-dimensional continuously infinite space. But for unifor-mity, we specify belief states by |<S|-vectors. Belief states are also referred as information states or information vectors in the P O M D P literature. Policy Function n is a policy, a function mapping the current belief state to a chosen action: We write 11(6) for the action of choice. In general, a policy depends on the history, the full record of the trajectory of the state transitions from the initial state to the current state. However, one can summarize the relevant information given by the history in a belief state. It is known that a belief state can serve as a sufficient statistic for making decision at a decision stage (SONDIK[33]). In decision-theoretic planning[16],4 a policy adopts a process-oriented[6] 4 As an aside, Decision Theory = Probability Theory + Uti l i ty Theory. Utility here refers to the quality of being useful. 2. P O M D P S 19 view in contrast to the goal-directed view of a plan. A plan is a fixed sequence of actions meant to achieve a definite goal in a deterministic environment. In contrast, given the current belief state, a policy returns a decision, i.e., the action of choice that is believed to get the most value during the process of acting in an uncertain world. With the policy function, the rational agent knows how to act in different situations accordingly. Actually, an optimal policy is determined not only by the belief system of the agent, but also by the agent's value system implicitly. Value Function V is a value function mapping a belief state to the real number specifying the value of being in that belief state: V : B —>TZ We write V(b) for the value of being in the belief state b. As opposed to reward, which is a given measure of the defined immediate benefit, value is a measure of the overall desirability or utility to be determined. In the simplest case, value can be the total expected reward obtained over a given time period. Under the P O M D P model, most of the decision efforts are due to computing this value function. 2.2.2 Computation To live in a world full of uncertainties, perceiving, deciding and acting are the rational agent's life pattern. The essential computational activities in the 2. P O M D P S 20 agent's mind are correspondingly devoted to belief updating, value estimation and policy construction. Belief Updating After each action has been taken, the previously current belief state should be updated according to the observation obtained. Formally, the updated belief state b' is a function of the current state b, the performed action a, and the outcome observation o: b' = B[b,a,o] where B performs the belief updating by applying Bayes' Rule: VjES b\j) = P[j\b,a,o] _ E< &(*) p[j\ha] P[o\i,a,J'. EkEib(i) P[k\i,a]P[o\i,a,k] _ E« b(i) T(i,a,j) 0{i,a,j,o) 2Zkib(i) T(i,a,k) 0(i,a,k,o) When the observation model is simplified as 0(i,a,o), the belief update transformation becomes EfciK») T(i,a, k) 0(i,a,o) 2. P O M D P S 21 Value Estimation To estimate the value of being in a belief state requires the agent to look ahead into the future. Assuming that rewards are simply additive, the value of being in a belief state can be formulated as the maximum total expected reward. However, the total expected reward depends on the end of the world, which is unexpected; or the lifetime of the agent, which may be indefinite; or the deadline of the task, of which we may or may not have control. One of the practical approaches to estimate the total expected reward then is to restrict the agent's attention to a finite horizon} That means the agent considers a fixed number of sequential decisions ahead. Let this number be k, and we write Vk(b) for the value of being in the belief state b with k stages to go, or k actions left to perform. Then Vk(b) can be recursively formulated as f R(b) < k = 0 Vk{b) < R(b) + max { £ P[o\b,a] Vk-\B[b,a,o}) } <l k > 0 o (2.2) where R(b) = £ b(i) R(i) (2.3) 5 We would focus on finite horizon in this chapter and discuss the case of infinite horizon in the next chapter. 2. P O M D P S 22 is the expected immediate reward given belief state b, P[o\b,a] = biJ) 0{i,a,o) (2.4) i is the probability of expecting observation o given action a is performed in belief state b, and B[b, a, o] is the updated belief state defined by Equa-tion 2.1. Intuitively, the recursive step estimates the sum of the expected imme-diate reward and the maximum expected future value with respect to the selected action based on the estimated value for the next decision stage with one less stage to go. A great challenge in computing the value function in general is that the domain of the function, i.e., the belief state space, is continuously infinite. This implies that the value function cannot be represented and computed explicitly. The signifcant breakthrough to get around this difficulty was due to SONDIK'S observation[33]: the value function for finite horizon P O M D P s is piecewise-linear and convex.6 Such a function property allows us to repre-sent Vk by a finite set of |5|-vectors, referred as a-vectors by SONDIK. Let this set of o>vectors be Vk. Then Vk(b) can be obtained by Vk(b) = max { b • a } Geometrically, each of these a-vectors represents a hyperplane7 composing the value function. Base on SONDIK'S insight, MONAHAN[24] presented a 6 For a detailed illustration of this property, see CASSANDRA[10]. 7 Line for \S\ = 2 and plane for \S\ = 3. 2. P O M D P S 23 simpler method than SONDIK'S original algorithm for constructing Vk. In our notation, it is defined as follows. V° = {R} Vk = {V| ak(i)=R(i) + J2T(ha,j) -£0(2,(1,0) a'-'ij), jes oeo ieS, aeA, ak~l £ Vk~l } where R is a |«S|-vector representing the immediate reward function, and ak(i) is the 2-th component of the a-vector generated for a £ A and ak~l £ V f c _ 1 . That is, the value of being in state i as the sum of the immediate reward and the expected future value. The immediate reward is determined by the reward function and the expected future value is estimated by the transition function, observation function and V f c _ 1 . In this formulation, Vk is guaranteed to be finite. But the enumera-tion of o>vectors blows up very quickly as the possible trajectories grows exponentially with k on \A\ • \0\. Observe that some a-vectors may be dom-inated by others with respect to any belief state and hence can be removed. The development of more efficient algorithms since SONDIK[32] exploits this fact. Important landmark works include MONAHAN'S Linear Programming method[24], CHENG'S Relaxed Region and Linear Support Algorithms[12], and LITTMAN et al's Witness Algorithm[21][ll]. As it appears, computing the value function would be the essence of P O M D P s and substantial research has been involved in developing meth-ods to represent and construct value functions with abstraction and approx-2. P O M D P S 24 imation techniques such as PARR-RUSSELL'S Differentiable Approximation [25]. Policy Construction Given the value function V, the choice of actions becomes clear: choose the one with the greatest expected value according to the possible outcome observations. With the current belief state b, that is n(6) = argmax { £ P[o\b,a] V(B[b,a,o]) } o As the space of belief states is continuously infinite, an optimal policy is usually constructed from the value function partially or on demand. 2 . 2 . 3 A n a l y s i s Traditionally, the basic constructs of P O M D P s are represented linearly with vectors and matrices while dynamic programming and linear program-ming are the usual techniques to compute the value function and policy. A brief complexity analysis gives us an appreciation of the difficulty of solving P O M D P s . Space Complexity A belief state is a vector of size \S\ and so as the reward function. The transition function requires \S\ x \A\ x |«S| probability entries and can be represented by \A\ matrices of size \S\ x |<S|, one for each action. The general observation function requires |<S|x|.4.|x|«S|x|C?| entries while the simplified 2. P O M D P S 25 model can be represented by \A\ matrices of size |<S| x \0\, in which the observation is independent of the resulting state. In theory, we would require a vector of size \B\ for each of the value function and the policy function. However, since B is continuously infinite, implicit methods are necessary. Time Complexity In a domain of size \S\, the time complexity for updating a belief state is 0(|<S|2) for there are \S\ x |«S| possible state transitions for each j £ S in Equation 2.1 under the given action and observation. For a P O M D P with finite horizon k, computing the value of a particular belief state with Equa-tion 2.2 requires considering the values of the |.A| x \0\ possible situations in each recursive step. And in each of the projected situations, one belief update is performed. Hence, the overall complexity is 0((|*4| • \0\ • \S\2)K), which is globally exponential in k. Given a computed value function, a decision still requires 0(|.4| • \0\ • |«S|2) to be returned by projecting the next possible belief states one step ahead. While the complexity looks bad, it is even worse when we consider that |«S| is typically exponential in the number of problem variables or features modeling the world. We will discuss this issue further in Chapter 4, in which a structured representation is introduced to reduce the complexity. 2. P O M D P S 2.3 Related Work 26 P O M D P s were introduced to the Artificial Intelligence (Al) community from Operations Research (OR). In particular, P O M D P s were general-ized from M D P s in Dynamic Programming (DP) after BELLMAN[2, 1957]. While much of the foundation has been built on SMALLWOODI-SONDIK[31, 1973], SONDIK[33, 1978], MONAHAN[24, 1982] and LOVEJOY[23, 1991] con-tributed the more readable surveys. In A l , P O M D P s are found particu-larly interesting in Decision-Theoretic Planning (DTP) and Reinforcement Learning (RL). Remarkable works have been done by BROWN'S research group, e.g. CASSANDRA-KAELBLING-LITTMAN [11, 1994], and BERKELEY'S A l researchers, e.g. PARR-RUSSELL[25, 1995]. A comprehensive study of P O M D P s algorithms are recently found in LITTMAN'S Ph.D. thesis [22, 1996]. CASSANDRA[10, 1994] provides an illustrative and relatively recent summary of P O M D P s including a brief development history of the algo-rithms. 2.4 Summary P O M D P s has been introduced as a conceptual and computational model for decision-theoretic planning. In particular, it adopts the process-oriented world view and the perceive-decide-act life pattern for designing a rational agent. More specifically, we focus on the rationality based on the notions of belief and value in the decision agent's mind. Figure 2.2 and Figure 2.3 2. P O M D P S 27 Notation • n Real Numbers V = [0,1] Probability A Distribution G S (finite) Set of States a G A (finite) Set of Actions 0 G O (finite) Set of Observations Transition Function T : S x A —• A(<S) Observation Function 0 : S x Ax S — ^ A ( O ) Reward Function R : S —> ft b,b' G B = A(5) Set of Belief States Policy Function n : B—>A Value Function V : B —> U Figure 2.2: Notational Constructs of P O M D P s summarize the notational and computational constructs of P O M D P s re-spectively. In search of a useful and usable formalism, the quest reveals the need for powerful abstraction and approximation techniques. 2. P O M D P S 28 P O M D P s : post-Perception, Decision and pre-Action Belief Updating b' = Vje s b'(j) = Value Estimation Vk(b) R(b) B[b, a,o] Ej b(i) T(i,a,j) Q(i,a,o) Efci^W T(i,a,k) 0(i,a,o) = < f R{b) < k = 0 R(b) + max { p[o\b,a] V f c _ 1 (B[&,a,o]) } < k > 0 o P[o |6 ,a ] = £ b (») ° ( * > a > ° ) Policy Construction n(6) = a r g m a x { £ P[o\b,a] V(B[b,a,o]) } Figure 2.3: Computational Constructs of P O M D P s 3 Focus An Online Search Approach to POMDPs In a sense ... To decide is to foresee. To see is to focus. When applying P O M D P s as a computational model for designing a ratio-nal agent to perform process-oriented tasks, the major difficulties of solv-ing P O M D P s are due to the continuously infinite belief state space and the infinite horizon. To avoid losing our focus in the infinity, we adopt an online search-oriented heuristic approach to our structured exploration of P O M D P s , in which we employ our seeing experience as a metaphor to visualize the decision processes in the mind's eyes. 29 3. FOCUS 30 3.1 Bounded Scope When we define the current belief state as a discrete probability distribu-tion over the finite set of states, the space of the belief states is continuously infinite. Explicit representation of the value function then becomes impos-sible. One breakthrough technique to approach this problem has been the result from SMALLWOOD-SONDIK[31], in which the value function is approx-imated deliberately well by a piecewise-linear and convex function. While this method inherits the linear representation and numerical computation of the classical model, another practical alternative is to perform online search instead of precomputing and storing the whole value function. Starting with a particular belief state, there are only finite number of reachable belief states in the projected belief space with finite horizon. And in the case of infinite horizon, there are countably infinite number of those reachable belief states. This can be compared to the limited scope of our eyesight. With this bounded scope of discrete vision, we are projecting from the current belief state into the possible future belief states, and estimating the expected value of taking an action at the moment. 3.1.1 Foresight: Belief Projection When we consider possibilities, we imagine. We foresee. We visualize. Es-sentially, we project our current belief into the future. More concretely, this belief projection can be viewed as a decision search tree rooted at the current 3. FOCUS 31 belief state and branching out from decision nodes and observation nodes alternately. The current belief state can then be compared to the viewpoint as we focus on a distant object. Decision Points A decision node in time, or decision point,1 is the point in time at which the decision has to be ready for implementing one of the alternative actions, including, possibly, doing nothing. Since a decision can be made well in advance before the action with planning, decision points actually represent the decision deadlines, but not the time at which the decision must be made. A decision node holds the updated belief state and branches out to the next possible actions. The initial decision point, from which the current belief is being projected, is the origin of the projection, or the viewpoint, whereas the most distant nodes merge into the horizon. Every 'decision node has one incoming observation and \A\ possible outgoing actions, except the origin does not requires an input observation, and the terminal nodes on the horizon have the output actions omitted. Observation Points An observation node in time, or observation point, is the point in time at which observation is being made. It comes right after an action has been taken and branches out to possible consequent observations. Without ex-1 We use node when we refer to the data structure primarily whereas point refers to a point in time. 3. FOCUS 32 Figure 3.1: Reachable Belief States as a Decision Tree mapped on Time ception, every observation point has one incoming action and at most \0\ possible outgoing observations.2 The alternate layers of decision points and observation points conform to a decision tree structure, which captures the bounded scope of the belief state space. Figure 3.1 shows an upright deci-sion tree corresponding to a decision process with 2 actions, 2 observations and 2 decision stages to go. The progressive pattern resembles a viewing perspective. 2 Technically, we need to include the feasible observations that have nonzero probability. Chances are the number of those observations is less than \0\ in many cases. 3. FOCUS 33 3.1.2 Hindsight: Backward Induction When we see things, we see the reflection of the light. When we make a decision, we do not merely project our belief into the future situations, but also reflect on the value in each of the possible final states back to the current states with a simulated hindsight. This backward process of evaluating the value of the current state is referred as backward induction by PUTERMAN[27]. In fact, the recursive definition introduced in Equation 2.2, repeated here, { R(b) < k = 0 R(b) + max { ]T P[o\b,a] Vk-\B[b,a,o]) } < k > 0 (3.1) Vk(b) captures the backward induction for a P O M D P with a finite horizon. 3.2 Infinite Horizon Even if we restrict our focus on a bounded scope of belief states, we are still challenged by an optimization problem requiring infinite computation when backward induction is applied to infinite horizon directly. When perfect optimization is out of sight, approximation techniques are developed to search for the decent or near-optimal solutions. Here we turn our attention to three of such ideas that can work together, namely discounting, focusing and pruning. 3. FOCUS 34 3.2.1 Discounting Discounting is to count less on the more distant rewards. It is a simple and intuitive technique well used to solve F O M D P s , Fully Obserable Markov Decision Processes, with infinite horizon, in which the value function is de-fined on finite state space as the resulting state is completely determined after each action.3 It was also an essential technique that SONDIK[33] incor-porated into his implicit representation of the belief state value function to solve P O M D P s with infinite horizon. Let 5 be the discount factor, where 0 < 5 < 1. This is the constant rate at which the future reward is discounted for each backup step in the backward induction. Typically, this discount factor is close to 1, e.g. 0.99 or 0.999. With the discount factor, the value function becomes f R(b) < k = 0 R(b) + 5 max { £ P[o\b,a] Vk-\B[b,a,o]) } < k > 0 (3.2) Vk(b) = The effect of the discount factor is that the expected total rewards for each action sequence would converge as the belief states are projected further and further. That means we do not need to project the belief state indefinitely before we can compare the values of different action paths. The discount rate is an adjustable parameter. The computation effort and solution quality can be tuned by varying this factor deliberately. A 3 W e shall revisit F O M D P s in more details in the Section 3.3. 3. FOCUS 35 discount factor close to 1 makes a farsighted but slow decision agent while a quick but shortsighed agent is built with a smaller discount factor. When interpreted visually, the effect of 5 corresponds to the vanishing pattern in a perspective diagram. Although the motivation of introducing the discount factor is intuitive, it is not clear how we should choose an appropriate value for 5. Finding the optimal discounting rate indeed is part of the decision problem. 3.2.2 Focusing P O M D P s with finite horizon can potentially be solved because the value of each belief state on the horizon is determined by the reward function immediately. Since there is no further action will be taken over there, we do not consider the future values beyond the horizon. However, while focusing on a finite depth, we would still like to get a gross picture of the future heading to infinity. In the finite horizon case, the reward function sets the base values of the recursive value function. If we can define the base values with a heuristic function that takes the future value beyond the finite depth into consideration, then we might approximate the infinite horizon value function more closely while cutting off the infinite computation sitting beyond our focus. A direct approach to obtain the base values is to use F O M D P s with infinite horizon as the heuristic. Symbolically, we are exploring the idea P O M D P 0 0 « P O M D P * + F O M D P 0 0 3. FOCUS where 36 P O M D P 0 0 :: P O M D P with infinite horizon P O M D P f c :: P O M D P with finite ^ -horizon F O M D P c F O M D P with infinite horizon Now the horizon defines the depth of view for the decision process and the values beyond the horizon are vaguely estimated by F O M D P 0 0 . To adopt this approximation, a slight refinement of the value function gives f £ b(i) V*(i) < k = 0 Vk{b) = R(b) + 5 max { £ P[o\b,a] Vk-\B[b,a,o}) } < k > 0 (3.3) where V*(i) is the optimal value function, defined on state, for the F O M D P 0 0 reduced from the target P O M D P 0 0 to be approximated. Just like the dis-count factor, k is also an adjustable parameter to manipulate the tradeoff between the solution quality and the computational complexity. In the ex-treme case, F O M D P s are crude approximation of P O M D P s . 3.2.3 Pruning While possibilities could be numerous, productive ones might be few. It would be wise to spend more time on exploring the better alternatives than the worse. Better still, the really bad choices should be ignored as soon as possible. The question here is how we can choose the better from the worse 3. FOCUS 37 and eliminate the really bad without considering every possibility towards the end. An answer to this is to use F O M D P s as a heuristic for pruning the decision search tree of the belief state space. Since F O M D P s assume complete observation, the maximum total expected reward evaluated by a F O M D P overestimates the one evaluated by the P O M D P modeling the same world. We will elaborate and exploit these ideas in Chapter 5. Since both focusing and pruning employ F O M D P s as a heuristic, an intuitive understanding on the traditional F O M D P formulation will be helpful in the subsequent discussions. 3.3 FOMDPs F O M D P s , Fully Observable Markov Decision Processes, were developed in operations research and have deep root in mathematics and statistics. The foundation work was due to BELLMAN[2], who introduced the dynamic programming approach to solve sequential decision problems modeled by F O M D P s , or simply M D P s in the context where complete observability is assumed. Since we have introduced the constructs of P O M D P s in Chapter 2, here we present F O M D P s as specialized models of P O M D P s . 3.3.1 Representation A F O M D P problem can be formulated as: 3. FOCUS 38 Given a world model ( S, A,T, R ) compute the policy function IT, which depends on the value function V implicitly. where S :: finite set of states A :: finite set of actions T :: transition function R :: reward function V :: value function (on states) n :: policy function (on states) Since we assume complete observability, i.e., the current state can be deter-mined with certainty, we have a more simplified model with F O M D P s . The value function and policy function required now is defined on a finite set of states, as opposed to the infinite set of belief states with P O M D P s . Assume the component functions are represented by vectors and matrices explicitly. Specifically, T is represented by \A\ matrices of size x \S\, in which each element specifies the transition probability of taking action a € A in state i 6 <S and resulted in state j G S, that is T(i,a,j). R is simply a |»S|-vector of real numbers specifying immediate rewards. V is an |<S|-vector of real numbers specifying maximum total expected reward over a defined horizon, finite or infinite. IT is a |5|-vectors of indices identifying the decided 3. FOCUS 39 actions for each state i 6 S. 3.3.2 Computation With the explicit representation, there are two main streams of approaches to solving F O M D P s , namely Value Iteration and Policy Iteration. We out-line the basic algorithms for the infinite horizon case with discounting in the following subsections. Corresponding methods for solving finite horizon F O M D P s follows the same structure. Value Iteration Value Iteration is a direct application of the following equation: ' R(i) < k = 0 R(i) + 5 max { £ T(i,a,j) Vk~l{j) } < k > 0 (3.4) where 0 < 8 < 1 is the discount factor forcing the value function to converge. Algori thm 3.1 Value Iteration 1. Initialize k = 0 and Vk = R. 2. Iterate on k until Vk ~ Vk~l (Apply Equation 3.4)-3. Return V* = Vk. 3. FOCUS 40 Commonly used stopping criteria for comparing Vk with Vk~l includes root mean square(rms) error and span difference.4 Once we obtain the optimal value function V*, we can construct the corresponding optimal policy by IT(i) = argmax { £ T(i,a,j) V*(j) } (3.5) a J Policy Iteration Under the F O M D P model, if the optimal policy function is the only final result that we want to obtain, we may take a slightly different view on the value function. Instead of using the value function as a measure of maximum total expected rewards, we can define the value function as a measure of total expected reward given a particular policy. Given a policy II, the value function with respect to II can be obtained by solving the linear system Vn(i) = R(i) + 5 £ T(i,a,j)Vu(j) (3.6) j directly. Or we can apply successive approximation similar to Value Iteration: f R(i) < k = 0 R(i) + 5 £ T(i, n(i),j) v i? - 1 ^)} < k > 0 (3.7) Vnfc(») = until Vk ~ Vk~l. Algorithm 3.2 Policy Iteration 4 Given vector v, span(w) = maxt>[i] —.minu[i]. 3. FOCUS 41 1. Initialize n = 0 and IP with Un(i) = argmaxa { Ylj T(i,a,j) R(j) }. 2. Iterate on n (a) Determine Vn« (Apply Equation 3.6 or 3.7). (b) Improve II if3(i € S) such that maxa { Zj T{i,a,j) Vh»0') } > E ; rP(;),j) Vh»0')-5. i2e*«rn H * = nn. Usually, Policy Iteration converges faster than Value Iteration because a policy function update in Policy Iteration is usually compared to a series of value function updates in Value Iteration. 3.4 Online Search Integrating the ideas dealing with the infinite belief states and infinite horizon problems, we employ an online search approach to approximate P O M D P s with long finite horizon or infinite horizon. As opposed to dynamic programming and linear programming approach, in which the value function or the policy function are usually fully precom-puted and installed in the decision agent, an online search approach would search for the best action with respect to the current belief state only. When discounting, focusing and pruning are employed, the search space becomes fi-nite and reduced. Given the current belief state, we are searching through the 3. FOCUS 42 projected decision search tree of the reachable belief states with a bounded depth, at which the future values are estimated by heuristics. While discount-ing future rewards ensures that the value function will converge, pruning can be applied at each level to reduce the search space dynamically. We will de-velop a heuristic online search algorithm to implement this idea in Chapter 5, where approximation methods building on the structured representation and the online search approach are discussed. The structure representation is presented in the next chapter in detail. 3.5 Related Work The focus of this chapter is to introduce a heuristic online search approach to approximate the optimal value function of P O M D P s with long finite horizon or infinite horizon. In particular, we apply F O M D P s as a heuristic, which is inspired by recent research in integrating Control Theory and A l techniques for Decision-Theoretic Planning based on F O M D P s . F O M D P s are more often referred as M D P s in the literature, where fully observability is assumed. The idea of applying online search using heuristic function for pruning was developed by DEARDEN-BOUTILIER[17], in which M D P s with large state space was the subject. A recent comparison and integration of A l techniques and Control Theory on real-time planning and learning with dy-namic programming are discussed in BARTO-BRADTKE-SINGH[1]. DEAN-3. FOCUS 43 WELLMAN[16] gives much of the background of bringing Control Theory to A l planning. For the fundamental results of M D P s with dynamic program-ming, PUTERMAN'S textbook[28] on M D P s has a comprehensive consolida-tion while RUSSELL-NORVIG[29, ch.17] provides an excellent intuitive intro-duction. Pioneer works in the field have been due to BELLMAN[2] and BERT-SEKAS[3]. DEAN ET AL[14, 15] contributes the work on time-critical issues in computing M D P s with a decision deadline. Further interesting inspira-tional works include DEAN-BODDY'S Anytime Algorithms[13] and RUSSELL-WEFALD'S Decision-Theoretic Metareasoning[30]. 3.6 Summary By drawing insights from our seeing experience, P O M D P s can be inter-preted visually. The vision metaphors not only help us to visualize the de-cision processes, but also guides us to an intuitive approach to simplify the infinite belief states and infinite horizon problems. Firstly, we restrict our attention to a bounded scope to reduce the contin-uously infinite belief state space into a discrete search space consisting of all the reachable belief states with respect to the current belief state. Secondly, we cut off countably infinite number of distant belief states beyond a finite depth of the decision search tree by applying heuristic value estimation based on F O M D P s . Finally, we adopt an online search approach with best-first search strategy and dynamic pruning of the seemingly unproductive branches 3. FOCUS 44 in the decision search tree to compute the best action for the current belief state. 4 Structure In Search of Structured Abstraction for POMDPs In a sense ... To know is to believe. To understand is to structure. In the usual linear representation of P O M D P s , using vectors and matri-ces, the state space is an enumeration of unrelated states. It captures no structure information of the domain. Even if the search-based approach is adopted, there are way too many states in the search space. Only domains with small number of states and horizon seem to be tractable, preventing the use of P O M D P methods for many practical applications. However, sys-tem states are naturally identified by a number of variables, whose values 45 4. STRUCTURE 46 reflect the preconditons and postconditions of the executed actions dynami-cally. For states sharing the same causal factors under the same action, they would share similar outcomes. From this intuition, we would like to design a structured representation for P O M D P s , not only for natural and compact specification, but also for computational leverage. The idea starts with tran-scending from the state level to a higher abstraction level, at which states are generalized into contexts. 4.1 Structured Representation In an unstructured state space, a state is identified by an arbitrarily assigned index. But more meaningfully, a state can be identified by a set of variable assignments. Each assignment represents a domain feature possessed by the state and describes one aspect of a possible situation. To specify the preconditions and postconditions of the action to be performed, we need to identify the assignments to the involved variables only while leaving the irrelevant variables unspecified. This action representation, where irrelevant variables are not mentioned, is typically found in STRIPS[19][20]. 4.1.1 Contexts as Abstract States While a state is a complete set of variable assignments, a context is a subset of that. It serves as a more general term to describe system dynamics. Instead of prescribing each state transition explicitly, it is more concise to say that an action taken in one context would result in another context. 4. STRUCTURE 47 In a sense, a context is an abstract state, or an aggregation of states shar-ing common variable assignments. In the special case of having all the domain variables assigned, the context specifies a state. On the other extreme, a null context with all the variables unbound refers to all states. By using context as a generalized unit, we can compactly specify the P O M D P components, namely the belief state, the reward function, the tran-sition function and the observation function. In particular, we can structure them as decision trees. For simplicity, we assume that the domain problem is specified by propositions, i.e., boolean variables. This allows us to focus on binary trees. Definition 4.1 Let X be a set of boolean variables with possible values in {true, false}, or { 0 , 1 } . A context of domain X is a truth assignment to a subset o /X. [A state is a truth assignment to all the variables in X. / g Notation 4.1 Let X be a propositional domain and X e X denotes a generic variable in X. We write X+ for X assigned true and X~ for X assigned false.1 Let c be a context of X. We write c f= X+, read as c sub-sumes X+, and c (= X~, read as c subsumes X~, for X is assigned true and X is assigned false in c respectively. To refer to assignment inclusion, we write c + X+ for cU{X+} and c + X~ for cU{X~}, whereasc — X refers to the exclusion of the assignment to X regardless of its truth value. The size 1 Traditionally, the truth assignment to a boolean variable X is written as X and ->X, or X and X. The intention of using X+ and X~ here is to distinguish the variable X and the true-assignment to X. 4. STRUCTURE 48 of c, or the number of variables being assigned in c, is referred by \c\, whereas the size o/X is referred by |X|. g Example 4.1 Suppose X = {A, B,C} and c = {A+,B~}, or A+B~ for short. Then c \= A+ and c f= B~ with |c| = 2. The assignment inclusion, c + C+ gives A+B~C+, while the assignment exclusion c — B = {A+}. g 4.1.2 Binary Decision Trees All the P O M D P components are functions of states in various form. Binary Decision Trees (BDTs) are structured representations of boolean functions. The marriage of the two is that a B D T can be used to partition the state space into complementary clusters of states, in each of which the member states share a common context. For our purpose, each non-leaf node in a B D T is associated with a boolean variable and each branch from the node represents a possible assignment. We adopt a convention that true branches on the left and false branches on the right. As a path from the root to a leaf represents a particular context, various kinds of information are installed at the leaf nodes for representing different P O M D P components. Definition 4.2 A B D T , Binary Decision Tree,2 is either a leaf node labeled with an associated value, or an internal node labeled by a boolean variable X 2 B D T has a more generic notion in Machine Learning. Here we define and refer to a special instance of B D T as a structured representation for P O M D P s . 4. STRUCTURE 49 along with a left subtree denoting X+ and a right subtree denoting X~. Both of the left and right subtrees are BDTs . In a B D T , a node with no subtree is a leal while the only node that is not a subtree of others is the root. The non-leaf nodes, including the root, are called internal nodes. g Notation 4.2 Let b be a B D T . When b is a leaf, we write value(b) for the value associated to the single leaf tree b. When b is rooted by an internal node, the variable labeling b is referred by X\>. We write b^ and 6# for the left subtree and right subtree ofb respectively. The parent node ofb is referred by bP. a Definition 4.3 Let n be a node in a B D T b. The prefix context of n, denoted by prefix(n), is the context corresponding to the truth assignment following the path from the root to n. pj Definition 4.4 Let n be a node in a B D T b. The depth of n inb is defined recursively as: where np is the parent node of n. When n is a most distant leaf to the root, depth(n) is the depth of the B D T b. • Proposition 4.1 Let n be a node in a B D T b and cn be the prefix context 0 depth(np) + 1 if b is the root otherwise of n. Then depth(n) = c , 4. STRUCTURE 50 Definition 4.5 Let n be a node in a B D T b for a propositional domain X. The cardinality of n, denoted by card(n), is the number of states that are consistent with prefix(n), the prefix context of n. We call a state falling under n a member state of n. g Proposition 4.2 Let n be a node in a B D T b for a propositional domain X and c^ be the prefix context of n. Then cardin) = 2l xl _lC nL B Definition 4.6 The size of a B D T b is defined recursively as: where br, and bR are the left and right subtrees of b respectively. That is the number of leaf nodes in b. g Proposition 4.3 Let b be a B D T and d be the depth of b. Then the max-imum total number of nodes in b is 2d+1 — 1, and the maximum number of leaf nodes, \b\, is 2d. | Proposition 4.4 Let d be the depth and I be the number of leaf nodes in a B D T b. Then the maximum total number of nodes in b is 21 — 1. g Example 4.2 Figure 4-1 illustrates the diagramatic representation and the basic properties of a BDT for the propositional domain X = {A, B,C}. g if b is a leaf otherwise 4. STRUCTURE 51 leaf node • abstract state (context) • concrete state Figure 4.1: Diagramatic Representation of Binary Decision Trees (BDTs) 4. STRUCTURE 52 [2 states] .24 .05 [4 states] 2(.24) + 2(.16) + 4(.05) = 1.0 Figure 4.2: A b-Tree for X = {A, B, C} 4.1.3 Belief State as b-Tree Recall that a belief state is a discrete probability distribution over states. To represent a belief state, probabilities are installed at the leaf nodes of a B D T . In particular, we store the per state probability at each leaf node. That is the probability for a member state of the leaf being the actual state. Such a B D T represents a set of of mutually exclusive and collectively exhaustive contexts covering the whole state space. We call this B D T a b-Tree. Formally, given a b-Tree, it determines a belief state as follows: P[s] = I, where I labels the unique leaf node whose prefix context subsumes state s. Example 4.3 Figure 4-2 shows an example b-Tree for X = {A, B,C}. • 4. STRUCTURE 53 3.0 -2.0 Figure 4.3: An example R-Tree for X = {A, B, C} 4.1.4 Reward Function as R-Tree To specify the reward function, installed at each leaf node of a B D T is the immediate reward obtained for being in a member state of the leaf. We call this an R-Tree. Example 4.4 Figure 4-3 gives an specification for the reward function of the domain X = {A,B,C} with an R-Tree. It categorizes all states into three different desirabilities: the states with A+B+ get 3 regardless of the value of C; the states with A~B+ get 1 regardless of the value of C; and the states with B~ get —2 regardless of the values of A and C. g 4. STRUCTURE 54 4.1.5 Transition Function as T-Trees To specify the transition function, we use one 2-stage Temporal Bayesian Network (2TBN) for each action. A 2 T B N is a Bayesian Network with two sets of variables, one for the pre-action stage, 'and the other for the post-action stage. The two sets of variables are called precondition variables and postcondition variables as they are used to specify the preconditions and postconditions respectively. We assume that links are only viable from a precondition variable to a postcondition variable, which indicate conditional dependence.3 Associated with each postcondition variable, there is one Con-ditional Probability Table (CPT), By using a B D T , which we call T-Tree, to represent each of the CPTs, we are able to capture additional dependence or independence based on contexts, which is referred as context-specific in-dependence^), or contextual independence. The semantics of a T-Tree for the postcondition variable X G X given action a G A is as follows. Each internal node represents a precondition variable that influences X under action a. Again, left branches indicate t r u e -assignments and right branches signify false-assignments. The probability installed at each leaf node specifies the probability that X becomes or remains t r u e when the action is taken in the prefix context of the leaf. For a full specification of the transition function, we then require \A\ x | X | T-Trees. The size of each T-Tree is bounded by 2fc, where k is the number of parents, 3 A more sophisticated formulation may allow arcs between two postcondition variables. 4. STRUCTURE 55 or conditions, that influence variable X. Notation 4.3 We refer to a T-Tree for variable I e X given action a G A as T-Tree[X\a] and the probability installed at the leaf node I as P[X+\ci,a], where Ci is the prefix context of I, i.e., Ci = prefix(l). g Example 4.5 Figure 4-4 shows the evolution of T-Tree[B \ a] for the domain X = {A,B,C} from the state-based representation through the traditional C P T to a B D T specification with the 2TBN framework. With T-Tree[B | a], we can interpret the effect on B under action a structurally. When B was true before the action has taken, it remains true under the action with certainty. When B was false, it would become true with probability 0.9 or remain false with probability 0.1, provided A was true. When both A and B were false, B remains false with certainty. This structured interpretation of the transition model is the kind of domain structure that we would like to exploit in our B D T representation. And we believe that it occurs naturally in many real-world domains. g 4.1.6 Observation Function as O-Trees We assume that the observation to be obtained depends on the preconditions and the action only. For example, pushing a door (action) to see (observation) if it is locked (precondition). Then we can add one observation variable O to the set of postcondition variables in each 2TBN that we use to specify the transition function. However, we do not restrict O to be boolean. For each 4. STRUCTURE 56 S t a t e - b a s e d Spec i f i ca t ion for B 5 A B C P[B\s, a] 0 T T T 1.0 1 T T F 1.0 2 T F T 0.9 3 T F F 0.9 4 F T T 1.0 5 F T F 1.0 6 F F T 0.0 7 F F F 0.0 8 ent r ies Figure 4.4: T-Tree[B\a] for X = {A, B, C} in Evolution 4. STRUCTURE 57 action, we introduce one B D T , which we call O-Tree, in the 2TBN. That means we require \A\ O-Trees to specify the observation function. This time each leaf node does not contain a single value, but a vector representing the probability distribution over the set of observations O. Notation 4.4 We refer to a O-Tree for an action a £ A as 0-Tree[a]. When only a particular observation o £ O is concerned, we write 0-Tree[o\a] to denote the B D T conforming to 0-Tree[aJ with only the probabilities for observing o extracted. pj Example 4.6 Figure 4-5 depicts an 0-Tree[aJ for the domain X from the previous example. From 0-Tree[aJ, we can tell that when B was true, we would observe o = O[0] with probability 0.8 and o = 0[1] with probability 0.2 under action a. When B was false, we would observe O[0] and 0[1] with probability 0.1 and 0.9 respectively. pg 4.2 Structured Computation Recall that the value of being in a belief state with ft-horizon was recursively defined in Equation 2.2 as: f R(b) < k = 0 R(b) + max { £ P[o \ b, a] Vk-\B[b, a, o]) } < k > 0 o In our structured representation, this implies that the computation of Vh(b) requires computing R(b), P[o\b,a] and B[b, a,o] with BDTs structurally. Vk{b) = 4. STRUCTURE 58 4. STRUCTURE 59 4.2.1 Computing R(b) When the belief state and the reward function are represented by vectors, the expected immediate reward R(b) is the dot product of the two vectors. For computing R(b) with a b-Tree and the R-Tree for a domain, we develop an algorithm to perform the equivalent dot product of two BDTs. Given a b-Tree b and the R-Tree R, let us consider computing DotProduct(6, R) specifically. Graft-Prune-Evaluate To each leaf I in b, we graft one copy of the R-Tree and multiply all the reward values in the R-Tree copy by value(l), that is the probability installed at I. For there may be inconsistent or redundant branches resulted in the product tree, we then prune them all to keep the tree consistent and compact. The resulting tree can be compared to an expression tree in which each internal node is an average operation. Now we evaluate the tree and obtain the expected immediate reward per state.4 To get R(b), we multiply it by 2 ' x l , that is the number of states in the domain. Figure 4.6 illustrates the process pictorially. Note that the whole subtree is removed along with an inconsistent branch while the subtree needs to be reconnected to the point of dissection when a redundant branch is pruned. This Graft-Prune-Evaluate strategy provides us an intuitive understanding of the operations to define the following recursive implementation of the algorithm. 4 This is due to the per state representation of the b-Tree. 4. STRUCTURE 60 G i v e n X ={A,B,C} .24 .16 b-Tree b R-Tree R -2 G r a f t . . . A / ^ V A, Z \ ,24(-2) y \ -]6(-2) -05<-2) .24(3) .24(1) .16(3) .16(1) .05(3) .05(1) P r u n e ^ inconsistent branch f redundant branch A f \ A \ A / \ -24<-2> / \ G ^ y \ ^ 2 j ) ( -24(3) ) .24(7) .76fj; .76(7) .05(3) ( .05(1) ) E v a l u a t e 7.20 +(-.025)7/2 = .0<S75 W = .0875(8) = .7 1.24(3) + .16(-2)]/2 = .20 S *\[.05(1) + .05(-2)]/2 = -.025 .24(3) .16(-2) .05(1) .05(-2) Figure 4.6: The Graft-Prune-Evaluate process for computing R(b) 4. STRUCTURE 61 Concurrent Traversal While the Graft-Prune-Evaluate process seems to be an intuitive and direct approach to compute the dot product of two BDTs , it requires three kinds of tree traversals — one for traversing b to visit all the leaf nodes, one for traversing the grafted R-Tree copies for pruning inconsistent and redundant branches, and one for traversing the resulting tree for evaluating the expected reward per state. However, the three kinds of traversals need not be done separately in three phases. We can avoid revisiting the tree nodes more than necessary by traversing the two trees concurrently with recursion. We implement this idea in the following algorithm. Algorithm The complete algorithm, DotProduct that implements concurrent traversal is specified in Figure 4.7 along with the notations summarized in Figure 4.8. For tree and recursion are a perfect match, the algorithm looks surprisingly simple when defined recursively. In the base case, where we reach a leaf node of the given b-Tree, we take the probability value installed at the leaf, multiply it by the number of member states (that is, the cardinality of the leaf node), and the average reward determined by another recursive procedure Average. The result is the product of the probability and the reward with respect to the prefix context of the leaf. Al l Average does is to collect the appropriate rewards 4. STRUCTURE 62 subsumed by the given context and average them recursively.5 In the case of Xf, = XR, we prune the inconsistent right branch along with its subtree in R and the redundant left branch for the left subtree of b by invoking DotProduct(&£,, RL)- Likewise, DotProduct(6^, RR) prunes the inconsistent left branch and the redundant right branch for bR. When prefix (b) (= X^, we discover that the right branch is inconsis-tent and the left branch is redundant for both bz and bR. Similarly, when prefix (b) \= XR, the left branch becomes inconsistent and the right branch becomes redundant for both &L and 6R. When no inconsistency or redundancy is found, we take the whole R-Tree for evaluating both bj, and bR in the next recursive calls. Note that the way we handle the per state notion of the B D T s is slightly different from the one in the Graft-Prune-Evaluate process. While in the Graft-Prune-Evaluate method we compute the dot product per state without worrying about determining the cardinality at each of the leaf nodes, the algorithm DotProduct determine the cardinality right at each leaf node without bothering to average the branch values at each internal node of the b-Tree and multiply the result by 2l xL Besides this, there should be no significant difference between the two interpretations of the B D T dot product. Figure 4.9 shows some snapshots of DotProduct with the b-Tree and 5 Note that Average assumes c is a context in which all states have the same probability. 4. STRUCTURE 63 R-Tree given in Figure 4.6. Analysis Let db and dR be the maximum depths of. the given b-Tree b and R-Tree R respectively. More intuitively, db and dR denote the numbers of distinct variables appear in the longest paths from the root to the maximum depths of b and R respectively. By Proposition 4.3, we have |b| = 2db and \R\ - 2dR. In the worst case, when b and R share no common variables, the size of the product tree of b and R is 0(2db+dR). However, it will almost always be the case that they would share some common variables in a natural domain. Let the number of common variables be dc. Then the size of the product tree becomes 0(2db+dR~dc), with 0{2dc) inconsistent and redundant branches pruned. Proposition 4.5 Let L and R be two B D T s and d^, and du be their depths respectively. Suppose there are dc common variables occuring in L and R. Then the complexity for performing a concurrent traversal on L with R is 0(2dL+dR~d°} n When comparing with the complexity of the counterpart dot product with vectors, which is of 0(2^), we would require db + dR — dc < |X| to pay off the B D T representation. However, in domains with rich structural regular-ities, we would get a better chance to have <4 <C |X| and dR <C |X|, and a signifcant dc. The depths db and dR essentially capture the description 4. STRUCTURE 64 Algorithm • A:2 DotProduct (6, R) :: Evaluate the dot product of B D T b and B D T R. if IsLeaf(b) then ;; evaluate the expected reward with respect to b return card(b) x value(b) x Average(i?, prefix(b)) else if (XB = XR) then ;; prune the inconsistent/redundant branches return DotProduct(bL, RL) + DotProduct(bR, RR) else if (prefix(b) \= XR) then ;; prune the inconsistent right branch return DotProduct(6, RL) else if (prefix(b) \= XR) then ;; prune the inconsistent left branch return DotProduct (b, RR) else ;; no inconsistent/redundant branch to prune return DotProduct(&t,, R) + DotProduct(bR, R) A . l Average(R, c) :: Evaluate the average value w.r.t. context c of B D T R. if IsLeaf(R) then return value(R) else if (c |= XR) then return Average(RL, c) else if (c |= XR) then return Average(RR, c) else return [Averagec) + Average(i?R, c)] / 2 Figure 4.7: Algorithm DotProduct 4. STRUCTURE 65 Notation • assignment X+ variable X is assigned true. X~ variable X is assigned false. context c (= X+ context c subsumes assignment X+. c\= X~ context c subsumes assignment X~. B D T IsLeaf(b) true if B D T b is a leaf. value(b) value associated to B D T b, where b is a leaf. Xb variable associated to the root of B D T b. bL left subtree of B D T b. bR right subtree of B D T b. card(b) cardinality of B D T b. prefix(b) prefix context of B D T b. Figure 4.8: Notations in Algorithm DotProduct lengths for representing the belief state and the reward function with the B D T representation respectively. The smaller the values for both rib and dR, the more the conditional independence can be exploited. In addition, there are many cases that we do not need the full depths for every leaf node in b and R, where contextual independence is another kind of structural property of the domain that we can exploit. In terms of the sizes of the b-tree and R-Tree, or the numbers of leaf nodes, the best case is that one tree subsumes the other, in which we have 0(max{ \R\ }) for DotProduct(b,R), where |6| and \R\ are the sizes of the b-Tree and R-Tree respectively. The worst case however is 0(\b\ • \R\) when the b-tree and R-Tree share no Common context. In general, the 4. STRUCTURE 66 Figure 4.9: The Concurrent Traversal process for computing R(b) 4. STRUCTURE 67 complexity is domain dependent. Proposition 4.6 Let L and R be two BDTs and \L\, and \R\ be their number of leaf nodes respectively. Then the complexity 0(Traverse(L, R)) for performing a concurrent traversal on L with R is (9(max{ \L\, \R\ }) < 0{Traverse(L,R)) < 0(\L\ • \R\). u 4.2.2 Computing P[o \ b, a] From Equation 2.4, P[o\b,a] = YI 6(») 0{i,a,o) i is the dot product of the belief state vector and the vector obtained by se-lecting action a and observation o in the observation function. The method for computing P[o\b, a] is essentially the same as the one used for com-puting R(b). In particular, P[o \ b, a] = DotProduct(6, 0-Tree[o|a]), where O-Tree [o | a] denotes the B D T that conforms to O-Tree [a] with the probabili-ties for observation o at its leaf nodes selected. The complexity for computing P[o | b, a] hence follows the complexity of the Algorithm DotProduct given previously. 4.2.3 Computing B[6,a,o] Given the current belief state 6, the performed action a and the obtained observation o, B[b,a,o] updates the belief state according to the predefined 4. STRUCTURE 68 transition function and observation function, which are specified by the T-Trees and O-Trees in our structured representation. As oppose to the lin-ear representation, which computes B[b,a,o] by applying Bayes' Rule in a statewise fashion, here we develop the counterpart algorithm that updates the b-Tree structurally. Precomputing BT-Trees While each of the T-Trees specifies the effect on each variable under a par-ticular action, we may merge all the |X| T-Trees to summarize the overall effect. We call this summary tree a Belief Transition Tree, or BT-Tree. We write BT-Tree[a] to denote a BT-Tree for action a e A. To generate a compact BT-Tree, a general domain property that we can exploit is persistence. A variable is persistent under action a if its value remains the same under a with certainty. In a T-Tree for variable X, persis-tence is observed by having the probability 1.0 under the true-branch from a internal node labeled X, and the probability 0.0 under the false-branch. As persistence occurs quite often, it would be a powerful domain property that we would like to exploit in our structured belief updates. Definition 4.7 Variable X is persistent under action a if for all leaf nodes I in T-Tree[X\a], {prefix{l) (= X+) (value(l) = 1.0) and [prefix(l) (= X~) => (value(l) = 0.0), where value(l) is the probability labeling I. g To capture persistence, a BT-Tree can be implemented as a B D T with 4. STRUCTURE 69 each of its leaf nodes installed a set of (X,p) pairs, where X is a nonpersistent variable with probability p to be assigned true in the postcondition. Notation 4.5 Let I be a leaf in a BT-Tree. We refer to the set of (X, p) pairs installed at I as effect(l). We write en for effect(l) = 0. The inclusion and exclusion of a (X,p) pair are denoted by e + (X,p) and e— (X,p) respectively, where e is a generic set of (X, p) pairs. g In essence, the BT-Tree represents effects in a manner somewhere in be-tween a conditional STRIPS representation and the Bayes Net representa-tion. As in STRIPS, the effects of the action under any condition are speci-fied together rather than distributed across different nodes as in a Bayes Net. But unlike STRIPS, BT-Tree still exploit independence of action effects by specifying the probability of individual effects separately. The idea is to con-vert the 2 T B N for each action into a BT-Tree which will be easier to use in the structural belief update process. To obtain the BT-Tree for a particular action, we initialize a BT-Tree as a leaf node labeled with e0 (the empty effect) and merge it with the T-Trees incrementally. Figure 4.11 gives the algorithm for merging a T-Tree with the evolving BT-Tree, which closely follows the concurrent traversal strategy used in DotProduct. Figure 4.10 summarizes the notations and Figure 4.12 shows an example BT-Tree generated from its 2 T B N specification of T-Trees. 4. STRUCTURE 70 Notation • assignment X+ variable X is assigned t r u e . X~ variable X is assigned f a l s e . context c j= X+ context c subsumes assignment X+. c j= X~ context c subsumes assignment X~. B D T IsLeafiT) t r u e if B D T T is a leaf. XT variable at the root of B D T T. TL left subtree of B D T T. TR right subtree of B D T T. prefix(T) prefix context of B D T T. leaf(v) leaf node with value v. node(X, L, R) internal node labeled X with left subtree L and right subtree R. T-Tree prob(Tx) probability at single leaf T-Tree Tx. persistent(c, X,p) t r u e if c |= X+ and p = 1.0, or c (= X~ and p = 0.0, where c is the prefix context of a leaf node in the T-Tree for variable X, and p is the probability installed at the leaf. BT-Tree eo empty set of {X,p) pairs. effect(T*) set of (X,p) pairs at a single leaf BT-Tree T*. Figure 4.10: Notations in Algorithm Merge 4. STRUCTURE 71 Algorithm • A.3 Merge(T*, Tx) :: Merge T-Tree Tx with the (evolving) BT-Tree T*. if IsLeaf(T*) then return SubMerge(pre/i2:(T*), effect(T*),Tx) else if (XT* = XTx) then return node(XT*, Merge(T£,T£), Merge(T£,T$)) else if (prefix(T*) (= X+ x ) then return Merge(T*,Tf) else if (prefix(T*) (= X ~ x ) then return Merge(T*,T£) else return node(XT., Merge(T£,Tx), Merge(TR,Tx)) SubMerge(c,e,Tx) :: Generate the subtree w.r.t. context c, effect e and T-Tree Tx. HIsLeaf(Tx) then if persistent(prefix(Tx), X, prob(Tx)) then return leaf(e) else return leaf(e+ (X, prob(Tx))) else if (c (= X+ x ) then return SubMerge(c, e, T*) else if (c (= X ^ x ) then return SubMerge(c, e, Tg) else if (SubMerge(c,e,Tf) = SubMerge(c,e,T^)) then return SubMerge(c, e, T*) else return node(XTx, SubMerge(c, e,Tx), SubMerge(c, e, Tjf)) Figure 4.11: Algorithm Merge 4. STRUCTURE 72 Figure 4.12: An example BT-Tree generated from the T-Trees 4. STRUCTURE 73 Updating b-Trees Let T* be the precomputed BT-Tree for action a, O be 0-Tree[o|a], and b be the current b-Tree. To update b with T* and O, we perform a product operation on b and O followed by a concurrent traversal on T* with the product tree to generate the updated b-Tree b'. The tree product, with a spirit much like the cross product, of b and O essentially captures the updated belief state based on the observation function and the obtained observation without considering the transition model.6 This is because the observation is a function of the original state. Let the product tree, or BO-Tree, be b x O. The algorithm for obtaining bxO from b and O is given in in Algorithm Product (Figure 4.13). Again, it employs the concurrent traversal strategy, but without the backup evaluation process. For simplicity, we use 0-Tree[o|a], with single value leaf nodes, in our algorithm description to refer to the probabilities relevant to observation o in 0-Tree[a], which has a probability distribution at each of its leaf nodes in general. Note that o G O is not restricted to boolean observations. Given the BT-Tree T* and the BO-Tree b x O, we can generate the up-dated b-tree b' by applying yet another concurrent traversal on bx O with T*. The forward process of the recursion generates partial trees that correspond to the possible postconditions with their probabilities, which is equivalent 6 When the observation function is defined on the action and the resulting state, this tree product will be performed on b' and O instead. 4. STRUCTURE 74 Algorithm • A.6 Product(6, O) :: Generate the product tree of B D T b and B D T O. if IsLeaf(b) then return SubProduct(pre/w;(&), prob(b), O) else if (Xb = Xo) then return node(Xb, Product^, OL), Product(6#, OR)) else if (prefix(b) \= XQ) then return Product(b, OL) else if (prefix(b) f= XQ) then return Product (b, OR) else return node(Xb, Product^, O), Product(&#, 0) SubProduct(c, p, O) :: Generate from the O-Tree 0 the subtree w.r.t. context c :: with each leaf multiplied by probability p. if IsLeaf(0) then return leaf(p x prob(0)) else if (c |= A ^ ) then return SubProduct(c, p, OL) else if (c (= A ^ ) then return SubProduct(c, p, OR) else return node(SubProduct(c, p, OL), SubProduct(c, p, OR)) Figure 4.13: Algorithm Product 4. STRUCTURE 75 to the term b(i) T(i,a,j) 0(i,a,o) in the statewise Bayes' Rule. The back-ward process merges all the partial trees to form a summation tree with each of its leaves abstracts the term VJi T(i,a,j) 0(i,a,o), representing the unnormalized updated belief state. Finally, b' is obtained by perfoming a normalization on this summation tree. See Figure 4.14, Algorithm Update, in which the procedure Sum, defined in Figure 4.15, performs a summa-tion of two partial trees. The procedure for normalizing a b-Tree is given in Appendix A.5. Figure 4.16 illustrates the update process. First, we take the b-Tree b (the current belief state) and 0-Tree[o|a] (the observation model for observation o under action a) from the upper left corner to generate the BO-Tree on the right by Algorithm Product. Then we combine the BO-Tree with the precomputed BT-Tree[a] to form the unnormalized b-Tree b' (the updated belief), using Algorithm Update. The dotted arrow lines trace the update process in part for the context A+B~, which captures the states A+B~C+ and A+B~C~. With repsect to A+B~, the leaf value is .16(.l) (0.16 from the b-Tree b and 0.1 from 0-Tree[o|a]) in the BO-Tree, whereas the leaf value in the BT-Tree is the set {(A, 0.8), (B, 0.9)}, capturing the action effect. Considering this action effect with the leaf value with prefix context A+B~ in the BO-Tree, we obtain one of the partial trees specifying the possible postconditions, which is highlighted in the middle of the figure. The associated probability for each 4. STRUCTURE 76 leaf in the partial tree is computed by multiplying .16(.l) by the probabilities with respect to the prefix context of that leaf in {(A, 0.8), (.B,0.9)}, which implies {(A+, 0.8), (A~, 0.2), (B+, 0.9), (B~, 0.1)}. Summing, superimposing visually, all the four partial trees together produces the unnormalized b-Tree b'. Note that, since only the highlighted partial tree has a leaf with prefix context A+B~, the probability (.0013) for the leaf node with prefix context A+B~ in b' is solely contributed by the corresponding leaf, with probability .16(.l)(.8)(.l), in the highlighed partial tree. In general, it would be the sum of the probabilities associated with the same prefix context from multiple partial trees as in the other three leaf nodes. ^ Algorithm Given the transition function defined by a 2 T B N specification of T-Trees, the observation function defined by a collection of O-Trees, the current b-Tree b, the performed action a, and the obtained observation o, the overall algorithm for updating b structurally is summarized as follows. Algorithm 4.1 Structured (Bayesian) Belief Update 0. Precompute BT-Trees f Algorithm Merge,). 1. Select BT-Tree[a] and 0-Tree[o\aJ. 2. Compute BO-Tree b x O (Algorithm Product/ 3. Generate the unnormalized updated b-Tree f Algorithm Update/ 4. STRUCTURE 77 Algorithm • A.8 U p d a t e d x 0, T*) :: Generate the updated b-Tree with BT-Tree T* and BO-Tree 0 x 6 . if IsLeaf(b x 0) then return SubUpdate(pre/u;(& x O), prob(b x 0), T*) else if (X(bxo) = XT*) then return Sum(Update((b x 0)L, Tl), Update((6 x 0)R T*R)) else if (prefix(b x 0) \= X^,,) then return Update((6 x 0)Ll T*) else if (prefix(b x 0) \= X^*) then return Update((6 x 0)R, T*) else return Sum(Update((6 x 0)L, T*), Update((& x 0)R T*)) SubUpdate(c , p, T*) :: Extract from the BT-Tree T* the partial tree w.r.t. context c :: and probability p. if IsLeaf(T*) then if IsEmpty(effect(T*)) then if IsEmpty(c) then return leaf(p) else choose a variable X £ c if X+ then return node(X, SubUpdate (c - X, p, T*) leaf (0.0)) else [ X- ] return node(X, leaf (0.0), SubUpdate(c— X,p, T*)) else extract an (X, q) £ effect(T*) return node(X, SubUpdate (c — X, pq, T*), SubUpdate (c - X, p(l - q), T*)) else if (c |= Xjt„) then return SubUpdate(c , p, T£ ) else if (c |= X^T) then return SubUpdate(c , p, TR) else return Sum(SubUpdate (c + XT*, p, T £ ) , SubUpdate (c + X T * , p, TR) Figure 4.14: Algorithm Update 4. STRUCTURE 78 Algorithm • A.7 S u m ( T \ T 2 ) :: Generate the summation Tree of the partial trees T 1 and T 2 . if IsLeafiT1) then return SubSum(pre^a;(T 1), prob(Tl), T2) else if (XTi = XT2) then return TreeUnion(X T i , Sum(TJ, Tf), Sum(TR,TR)) else if (prefix (T 1) |= X+2) then return S u m f T 1 , Tf) else if (prefix(Tl) \= XZ2) then return S u m f T 1 , T | ) else return TreeUnion(X T i , Sum(T2, T 2 ) , Sum(T^, T 2)) SubSum(c, p, T) :: Generate the summation subtree w.r.t. partial subtree T, :: context c and probability p. \HsLeaf(T) then return leaf (p + prob(T)) else if (c |= X£) then return SubSum(c, p, T/j,) else if (c |= Xp) then return SubSum(c, p, TR) else return TreeUnion(Xx, SubSum(c, p, TL) SubSum(c, p, TR)) TreeUnion(X, L, R) :: Return a union tree of BDTs L and R, :: using variable X for merging if L ^ R. if (L = R) then return L else return node(X, L, R) Figure 4.15: Algorithm Sum 4. STRUCTURE 79 b-Tree b A / \ v 0.05 0.24 0.76 0-Tree[o\a] B 0.8 0.1 BO-Tree / A .24(.8) \j6(.l.r .05{.8) .05(.l) .16(.1)(.8)(.9) O" .16(.1)(.8)(.J) ~a~ .16(.1)(.2)(.9) ^~ .16(.1)(.2)(.J) -a-' (unnormalized) .1651 \0013J .0813 .0053 \ Figure 4.16: Updating b-Tree structurally 4. STRUCTURE 80 4- Normalize the updated b-Tree. where Step 0 can be precomputed once for all execution sessions. g Analysis When the transition model is stationary, i.e., the same transition function applies to every decision stage, the \A\ BT-Trees can be precomputed from the \A\ x |X| T-Trees. With the persistent branches pruned, each BT-Tree is potentially more compact than the corresponding |X| T-Trees. While the BT-Trees can be precomputed off-line, the complexity for up-dating the b-Tree with the selected BT-Tree and O-Tree with respect to action a and observation o would be more crucial in our online approach. Since all the procedures developed for performing the structured belief state updates adopt the concurrent traversal strategy, which exploits the struc-tural properties of the domain, it would be more appropriate to evaluate its performance for each domain individually and experimentally. However, a rough estimate from Proposition 4.6 implies that the concur-rent tree traversal operations developed for computing B[b,a,o] are between 0(max{ \L\, \R\ }) and 0( \L\ • \R\), where \L\ and \R\ are the numbers of leaves in the two trees involved in each operation. The sizes of the trees depends on how much structural regularity is available in the domain. The best case is when the contexts overlap completely and the worst case is when they share no variables. Naturally, those common contexts would occur and provide a great computational advantage. 4. STRUCTURE 81 4.2.4 Computing Vk(b) Armed with the algorithms for computing R(b), P[o\b,a] and B[b,a,o] structurally, Vk(b) can be computed recursively with Equation 2.2, repeated here: f R(b) < k = 0 R(b) + max { £ P[o\b,a] Vk-\B[b,a,o)) } < k > 0 for the finite horizon case. With infinite horizon, approximate Algorithm 5.4 can be employed. We will focus on the finite case in this chapter and discuss approximation with the structured representation in the next chapter. 4.2.5 Computing 11(6) To compute, or search for, the best action with respect to the current belief state b, n(6), we project from b to each of the \A\ x \0\ possible belief states b' in the next stage and compute its value with Vk(b'). The action corresponding to the one with the maximum value will be the optimal choice. Once an action is performed and the observation is obtained, the belief state will be updated to one of the \A\ x \0\ possible belief states we have foreseen. Since the belief states projected from this belief state b' have already been determined when Vk(b') was computed, the subsequent decision stage can be speeded up by caching the search tree of belief states rooted at b'. Then in the next decision stage, we do not need to generate the depth-A; search tree from scratch. Instead, we only need to project one step further 4. STRUCTURE 82 at the frontier of the cached search tree and back up the values from there to determine the best action. 4.3 Experiments To test out the effectiveness of the structured representation, we have per-formed several experiments with a testbed domain. The main objective is to compare the compactness of the structured representation with the per-formance in traditional linear representation. The domain is designed with common structural properties in mind so that structural advantages can po-tentially be exploited. 4.3.1 Testbed Domain Consider a simplified client-server scenario. There are two servers, A and B, serving a client C. Server A is remote to client C, but it has all the resources that may provide C s needs. B is a local server to C, who has fewer resources to satisfy C. However, after a service has been provided by A, resources would normally be cached in B so that B would has a better chance to be able to satisfy client C in the next request for service. Since B is local to C, it is more likely to deliver service promptly whenever it is still keeping the resources in its temporary cache. With this general picture in mind, we want to design a decision agent D, helping client C to make decision on requesting service from server A or server B from time to time. The task of agent D hence is to keep client C satisfied in the service with minimal remote 4. STRUCTURE 83 communication. With agent D, server A and server B becomes transparent to C. Essentially, this is a process-oriented planning problem that can be nicely modeled by P O M D P . With the structured representation, we formulate the problem as a P O M D P with 3 variables, 3 actions and 3 observations. Figure 4.17 specify this testbed domain schematically. 4.3.2 Results The performance of the structured representation has been compared with the corresponding linear representation's test run. Two versions of comput-ing the value function with a fixed horizon are implemented in PROLOG. While both of them, especially the linear representation, are not optimally implemented, it would not be appropriate to compare the two representations having the same horizon directly. However, our interest is in seeing how the space and time complexities grow with a longer horizon in each case. For the purpose of our empirical studies, we initialize both systems with the same initial belief state and project it to various horizons. The number of proba-bility entries in the projected belief space are counted and the C P U time for the test runs are recorded. The results are summarized in Table 4.1. Note that the poor computation times for the linear representation is probably due to inefficient implementation as direct array representation of vectors is not available in PROLOG. The implication of the growth in sizes over the horizon is that for each 4. STRUCTURE 84 Server Agent Client Remote A B Local D C Action Observation Variables A :: true if Server A can meet Client Cs need. B :: true if Server B can meet Client Cs need. C :: true if Client C is satisfied. Actions a_A :: send request to Server A. a_B :: send request to Server B. a_C :: send reply to C. Observations o_A .: received reply from Server A. o_B :: received reply from Server B. o_C :: received request from Server C. Reward Function R Keep Client C served with minimal traffic. C 2.0 3.0 1.0 4.0 Action a A Action a B Action a_C AO „ 0. i.o c 0. 1.0 0.5 -0 o_A 0.49 0.09 o_B 0.01 0.01 o_C 0.50 0.90 o_A 0.10 0.10 o_B 0.60 0.40 o_C 0.30 0.50 o_A 0.49 0.29\ o_B 0.01 0.01 o C0.50 0.70\ Figure 4.17: Structured Specification for the Client-Server-Agent domain 4. STRUCTURE 85 k yk Linear Representation Structured Representation S/L Size L CPU Time/s Size S CPU Time/s 0 -5.00 8 0.00 4 0.00 50 % 1 -5.04 80 1.10 49 0.05 61 % 2 -3.93 728 66.22 499 1.00 69 % 3 -2.50 6560 6693.64 4828 60.70 74 % 4 -0.92 59048 > 36000.00 45787 5031.50 78 % Table 4.1: Computing Vk: Linear vs Structured Representation belief update during the belief projection, the complexity depends on how many probability entries need to be multiplied during the updates and this depends on the number of belief states in the stage currently considered. More precisely, this growth is in 0((|A||O|) f c). In the structured represen-tation, the number of probability entries is the total number of leaf nodes in the projected belief space, which is locally reduced in each of the belief states. As we projected the belief further, the S/L column in Figure 4.1 shows that the potential computational saving is decreasing while the hori-zon gets longer. This agrees with the intuition that belief becomes weakened and diversified as we attempt to project further into the future, where there are fewer structural properties that the b-Trees can capture. This motivates the research in approximation methods presented in the next chapter. Since both versions are implemented in PROLOG, in which numerical computation are not optimized, it would not be fair to compare the time across the table in Figure 4.1. However, the growth trends within their own columns indicate that structured representation does help in suppressing the 4. STRUCTURE 86 k Caching: CPU Time/s Tc No Caching: CPU Time/s Tnc I'c/Tnc 0 0.00 0.00 100 % 1 0.50 0.05 100 % 2 0.72 1.00 72 % 3 43.02 60.70 71 % 4 3764.77 5031.50 75 % Table 4.2: Computing Vk: Caching vs No Caching exponential blow up over the horizon. This agrees with the growth in the size columns. We also repeated the test run with caching installed in the structured computation, the results are given in Figure 4.2. Even though the caching strategy is not necessary optimized in our PROLOG implementation, the sig-nificant savings reveal that projecting with a scope or window of belief states is much more efficient than projecting the current belief states from scratch in each decision stage. 4.4 Related Work Tree-oriented structured representation in F O M D P s was introduced by BOUTILIER-DEARDEN-GOLDSZMIDT[7] and BOUTILIER-DEAN-HANKS[4]. The first ex-tension to P O M D P s has been investigating by BOUTILIER-POOLE[5]. Ap-plying Bayesian Networks and exploiting conditional independence have been widely discussed in the Probabilistic Reasoning. Discussions on contextual independence, or context-specific independence, in Bayesian Networks can 4. STRUCTURE 87 be found in BOUTILIER-FRIEDMAN-GOLDSZMIDT-KOLLER[8]. BOUTILIER-GOLDSZMIDT[9] examines persistence in Bayesian Networks in the Knowl-edge Representation context. Much of the spirit of the tree-oriented struc-tured belief updates introduced here is closely related to POOLE[26]. 4.5 Summary When the domain is specified by propositions, a P O M D P can be nicely structured in terms of contexts and represented by Binary Decision Trees (BDT) uniformly. Structural properties, such as conditional independence, contextual independence and variable persistence, can then be exploited nat-urally. Based on the B D T representation, a set of BDT-oriented algorithms have been developed to compute R(b), P[o\b,a] and B[b,a,o] structurally. Vk(b) and 11(6) can then be computed on demand with the online search approach. Caching the search tree of projected belief states from the current decision stage provides speedup for the subsequent decision stage. 5 Approximation Approximating POMDPs with Structure and Heuristic Information In a sense ... To imagine is to abstract. To realize is to approximate. Even with the structured representation and online search approach, our intuition as well as the experimental results suggest that exact methods for computing the optimal value function still cannot go very far. For making decisions with a longer horizon, we may trade off correctness for simplicity to compute a near-optimal value function. In this chapter, we examine two kinds 88 5. APPROXIMATION 89 of approximation methods: one exploits the structured framework presented in Chapter 4 to make the BDT-based P O M D P components more compact (b-Trees and R-Trees in particular); the other employs a heuristic search strategy to explore the belief state space more selectively. In a sense, both of them are pruning methods. 5.1 Pruning b-Trees and R-Trees When we think further into the future, our belief in each particular possibility becomes less certain and exact classification of states into contexts becomes less possible. The experimental results given in the last chapter confirm this intuition. The longer the horizon, the more diversified the probabilities in each of the projected belief states becomes. Eventually, each state probability in a belief state would take one leaf node in a fully-grown b-Tree containing all the variables. To keep the structured representation compact over a longer horizon, reducing the size of b-Trees becomes necessary. Here we develop a simple algorithm to reduce the size of a b-Tree by reordering nodes and averaging leaves in a B D T . Since the method is applicable to any B D T , it can be employed to approximate the R-Tree as well in order to speed up the value computation further. 5.1.1 Reordering Nodes In the structured belief update algorithm (Algorithm 4.1) presented in the last chapter, the variable ordering of each path from the root to a leaf in 5. APPROXIMATION 90 A B /\ / \ o - / 1.0 \ 1.0 2.9 1.0 3.0 2.9 3.0 Figure 5.1: Compacting a BDT by Reordering Nodes the updated b-Tree is quite arbitrary. Chances are that we may reduce the size of the b-Tree a bit by reordering the variables. Figure 5.1 illustrates a situation that we may reduce the size of a B D T , in which same values are installed at the two leaf nodes that can be reduced to one by reordering the variables in the tree. Procedurally, it can be done by the following algorithm. Algorithm 5.1 Compact(b,c,X.) returns b' b :: BDTto be compacted c :: prefix context of the compact B D T being constructed X initialized as set of variables; reduced in subsequent calls b' :: compacted B D T optimized in number of leaf nodes Average :: see Algorithm A.I Tree :: defined subsequently If (X = {X}) then APPROXIMATION 1. Get leaf value I w.r.tb and ( c + X+) [ I = Average(b, c + X+) ] 2. Get leaf value r w.r.tb and ( c + X~) [ r = Average(b, c + X") ] 3. Return Tree(X, leaf(l), leaf(r)) 1. Get and remove I e X such that there are the least number of distinct leaf values under X w.r.t. b and c. e 2. L := Compact(b, c + X+, X ) 3. R : = Compact(b, c + X , X ) Return Tree(X,L,R) 5. APPROXIMATION Algorithm 5.2 Tree(X, L, R) returns T 92 X L R T variable labeling the current node left subtree consistent with X+ right subtree consistent with X~ averaged tree w.r.t. 6 to be returned If (L = R) then Return L else Return node(X, L, R) Note that in Algorithm 5.2, the reduction of node(X, L, R) into L (or R) if (L = R) is generally happens when L and R are leaf nodes, but not often when they are subtrees. 5.1.2 Averaging Leaves It becomes less likely that states can be classified into clusters sharing the contexts with the same probabilties in the belief states as they are projected further and further into the future. Therefore optimizing the b-Trees by reordering nodes would become ineffective. However, although there may be few leaf values that are identical, we may classify leaves with common contexts into clusters if their values are close enough. We trade off correctness for simplicity. For example, given two leaf nodes leaf (I) and leaf(r) under the same internal node labeled by X, node(X, leaf (I), leaf(r)), if the difference of their leaf values jZ — 7™ | are within a defined tolerance 9, then we may 5. APPROXIMATION 93 reduce the internal node into a leaf node leaf((l + r)/2) by averaging the leaf values. To implement this idea, Algorithm Tree(5.2) can be replaced by the following algorithm. Algorithm 5.3 AverageTree(X, L, R,9) X :: variable labeling the current node L :: left subtree consistent with X+ R :: right subtree consistent with X~ 9 :: defined tolerance for averaging leaves T :: averaged tree w.r.t. 9 to be returned If (L = leafil) and R = leaf(r) and \l — r\ < 9) then Return leaf((l + r)/2) else Return node(X, L, R) More sophisticated ideas for approximating the b-Trees may look into Step 1 of the else case in Algorithm 5.1, where variable selection criteria can be defined approximately. 5.1.3 Experimental Results Applying to the same Client-Server-Agent testbed domain used in Chapter 4, we perform the b-Tree compaction algorithms, with and without averaging leaf values, after each belief state has been updated. Belief states visited are cached in each case. The results are summarized in Table 5.1. Results for structured representation with b-Tree caching given in Chapter 4 are 5. APPROXIMATION 94 k Structured + Reordering Nodes + Averaging Leaves yk Size Time/s yk Size Time/s yk Size Time/s 0 -5.0000 4 0.00 -5.0000 4 0.00 -5.0000 4 0.00 1 -5.0400 49 0.05 -5.0400 40 0.19 -5.0521 34 o:i5 2 -3.9344 499 0.72 -3.9344 397 2.16 -3.9748 314 2.02 3 -2.4983 4828 43.02 -2.4983 3787 37.94 -2.5721 2940 33.95 4 -0.9184 45787 3764.77 -0.9184 35176 1718.10 -1.0254 27132 1536.97 D> Size/lea: " nodes CPU Time/seconds Tolerance 9 = 0.01 Table 5.1: Compacting b-Trees by Reordering Nodes and Averaging Leaves repeated in the table for comparisons. Values, Vh, are computed for a fixed horizon with an initial belief state setting P[A+B~C~] = 1.0. Note that the longer time required with horizon 0 to 2 reflects the overhead cost of doing b-Tree compaction. This overhead is easily seen to be worthwhile when the horizon is extended longer. In general, b-Tree compaction during belief updates helps reducing the number of probability entries, or total number of leaf nodes, significantly. 5.2 Pruning Decision Search Tree In a sense, the b-Tree compaction with node reordering and leaf averaging prunes the local structures of the projected belief space. However, viewing the global picture, we can think of the projected belief space as a decision search tree branching from actions and observations iteratively along with the horizon as the search depth. As described in Chapter 3, we can prune the decision search tree evenly at the bounded depth and estimate the bound 5. APPROXIMATION 95 values by the value function of the corresponding F O M D P model with the belief states sitting there. More aggressively, we can prune the decision search tree asymmetrically by using the F O M D P value function again as a heuristic. This heuristic search strategy saves the effort to further explore seemingly unproductive belief states for searching deeper in the promising ones. 5.2.1 Heuristic Although F O M D P is a less accurate model comparing to P O M D P when the perfect observation assumption fails, it serves as a rough and quick (over)estimate of the problem. With information given by the optimal value function of F O M D P , we can explore the belief state space more intelli-gently. Specifically, given an optimal F O M D P value function V*, we define a heuristic function on belief sate H(b) as follows. The applicability of this heuristic function is based on the following propo-sition and the subsequent conjecture. Proposition 5.1 Given a P O M D P and its counterpart F O M D P with fi-nite horizon k, H(b) £ 6(0 y*(i) (5.1) vk(b) < £&w no 5. APPROXIMATION 96 where Vk(b) is the P O M D P value function, defined on belief states, Vk(i) is the F O M D P value function, defined on states, and b is a generic belief state in the belief state space. Reasoning When k = 0, • V°{b) = R(b) = E 6(0 m i • V°(i) = R(i) ^v°(b) = J2b(z)v°(i) i When k > 0, consider F O M D P represented by the P O M D P constructs. Since complete observability is assumed, the observation function 0(i,a,o) conforms to the transition function T(i,a,j) when o and j are co-dependent with certainty. The belief state would have probability 1.0 in one and only one of the state under any belief updates. With more certainties, the value function, formulated as the maximum total expected reward, is likely to be overestimated. A more speculative conjecture is that this proposition can be generalized to the model with infinite horizon given V*(b) and V*(i) are the converged optimal value function of the P O M D P value function and F O M D P value function respectively. g Proposition 5.2 Given a P O M D P and its counterpart F O M D P with in-finite horizon, 5. APPROXIMATION 97 where V*(b) is the P O M D P optimal value function, defined on belief states, V*(i) is the F O M D P optimal value function, defined on states, and b is an generic belief state in the belief state space. Reasoning As V* = l i i m ^ Vk, if V*(b) > EiKi)V*'(i), 3k Vk(b) > J2i V*(i) > E i Ki)y f c(0> which contradicts Proposition 5.1. • 5.2.2 Algorithm Assuming our conjecture is approximately correct and can be applied as a heuristic, we developed an algorithm for exploring the belief state space where we search a decision search tree using a best-first strategy. The idea is to use Equation 5.1 as a heuristic to determine the value of being in a belief state, and hence the value of taking an action. Obviously, we would like to explore the seemingly "best" action first. When it turns out to be "not good enough" after a more detailed examination, we may want to continue to explore the "second best" and so on. However, if we believe that we have already explored the "best" or "good enough" option, we do not bother to investigate the unexplored options on our list. In our algorithm, V m a x, the best value that can be obtained and known currently, sets the "good enough" standard. 5. APPROXIMATION Algorithm 5.4 Explore(b, k, 5, V*) returns (V m a x , a m a x ) b current belief state k horizon / search depth 5 discounting factor V* optimal F O M D P value function Kiax the currently best value ^max the currently best action = 0) then I- ^rnax • = Zi b(i) V*W 2. a m a x := none 3. Return {Vmm, a m a x ) else [k > 0] 1. Project b to B[b, a, o] one step ahead for each a G A and o G O. 2. Order HQ(b, a) = £c P[o | 6, a] H(B[b, a, o]) in a with the greatest in the front for each a G A, where H(B[b,a,o\) = B[b,a,o](i) V*(i). 3. Set a m a x to any a G A arbitrarily. 4. Set V m a x = HQ(b, a m a x ) . 5. Get an item HQ(b,a) from the queue to explore Repeat Q(b, a) := E 0 P[o I B[b, a, o], a'mJ V m a x 5. APPROXIMATION 99 where [V^x, a'max) = Explore{B[b, a, o], k - 1,5, V*). Vmax := max { Vmax, Q(b, a) } If (Knax = Q(b, a)) then a m a x := a until (HQ(b, a) < VmAX) or queue is empty. 6. Knax := R{b) + 6 V m a x 7. Return (Vmax, a m a x) 5.2.3 Experiments To test the heuristic online search approach, we run it in the Client-Server-Agent testbed domain. This time we approximate the value function with infinite horizon by discounting and focusing on a bounded depth, at which the future values are estimated by the F O M D P value function. The ex-periment is set up with a precomputed F O M D P value function converged in 757 iterations by Value Iteration with discount factor 0.99 and conver-gent threshold 0.001. Both belief state caching and b-Tree compaction are employed. The control experiment is performed by pruning the belief state space at the bounded depth evenly. The results are summarized in Table 5.2. The control online search algorithm with discounting and focusing only is ti-tled as Complete Search; whereas the heuristic online search using Algorithm Explore is titled as Best-First Search (to depth k). The results look appealing. The belief state space is trimmed down tremendously while errors in the value computed are virtually unobserv-5. APPROXIMATION 100 k Complete Search Best-First Search V* Size Time/s V* Size Time/s 0 187.5531 4 0.00 187.5531 4 0.00 1 188.5235 34 0.14 188.5235 34 0.17 2 189.4419 314 1.86 189.4419 135 0.68 3 190.3320 2940 38.96 190.3321 408 2.31 4 191.1989 27132 2221.24 191.1989 1217 10.86 5 — — — 192.0451 3645 72.50 6 — — — 192.8718 10969 580.50 > Size/leaf nodes (CPU)Time/second s Discount 5 = 0.99 Table 5.2: Heuristic Online Search: Complete vs Best-First State i V*(T) A+B+C+ 199.9007 A+B+C~ 191.1725 A+B~C+ 195.0930 A+B~C~ 187.5531 A-B+C+ 369.5441 A-B+C- 387.3797 A~B~C+ 399.8014 A-B-C- 387.9797 Value Iteration: 757 iterations; discount 0.99; threshold 0.001. Table 5.3: The precomputed F O M D P Vaiue Function 5. APPROXIMATION 101 able with small horizons. The small errors imply that the heuristic function guides the search to the best path nicely. The effectiveness of the heuristic search in the Client-Server-Agent domain may also be attributed to its rich structural properties. Performance reduction is expected if more randomness is imposed in the testing domain. It is suspected that the selection of the first currently best action in Algorithm Explore may play a role; however, similar results are obtained even when the actions are shuffled. In short, applying heuristics in structured domains seems to be promising. 5.3 Related Work The idea of applying pruning in structured representation framework was in-troduced to decision-theoretic planning by DEARDEN-BOUTILIER[18]. Par-allel line of work in approximation techniques with a functional model is found in PARR-RUSSELL[25}. UTGOFF[35] discusses methods for restructur-ing decision tree efficiently. 5.4 Summary When exact computation fails to meet an acceptable decision deadline, ap-proximation methods are applied. Within the structured representation framework and the online search approach, near-optimal value for a belief state can be computed more efficiently with tree prunings. Locally, a b-Tree or R-Tree can be restructured and/or pruned by aver-5. APPROXIMATION 102 aging leaves with close enough values. Globally, the belief state space repre-sented by a decision search tree can be pruned symmetrically at a bounded depth and/or asymmetrically by a heuristic guided search. In particular, we may employ the counterpart F O M D P model as a heuristic for the original P O M D P concerned. With a quick heuristic and rich structural properties, significant improvement in computing the value for the current belief state with the heuristic online search strategy is observed. 6 Conclusion In a sense ... T o a c t i n t e l l i g e n t l y is t o o b s e r v e t h e c u r r e n t s i t u a t i o n , t o i n t e r p r e t t h e o b s e r v a t i o n w i t h t h e p e r c e i v e d w o r l d m o d e l , a n d t o a p p l y t h e a c t i o n t h a t h a s t h e g r e a t e s t c h a n c e t o p r o d u c e t h e m o s t d e s i r a b l e c h a n g e . Bringing Decision Theory, Control Theory and A l Planning together, Decision-Theoretic Planning has evolved as a distinct line of interesting research to be explored. In search of a working model, Partially Observable Markov Deci-sion Processes (POMDPs) borrowed from Operations Research has drawn much interests. However, like most interesting research problems, P O M D P s are extremely difficult to solve in general. The state of the art can only deal with small domains with at most tens of states. Compact representations and approximation strategies are needed to empower the formal model for 103 6. CONCLUSION 104 practical applications. Recently, tree-oriented structured representation have emerged as an new attempt to solving both fully observable and partially ob-servable MDPs . In this thesis, we have explored a structured representation based on binary decision trees, on which approximation strategies can be employed along with a more A l flavored heuristic online search approach ap-plied to pruning the decision search tree of the belief state space. It seems that trees has been the coherent theme of this thesis work. 6.1 Hindsight To respond to the challenging problem of solving P O M D P s more efficiently, we have designed a uniform structured representation transforming P O M D P component constructs into binary decision trees. A handful of binary tree-oriented algorithms have been develped to perform belief updates and value computation structurally. Based on the structured framework, tree restructuring and pruning have been applied as optimization and approximation techniques to reduce the size of the belief state trees (b-Trees). Significant improvements in computation time have been observed as fewer probability entries need to be computed with belief update operations. Along with the structured representation, the belief state space has been structured as a decision search tree forming the global tree with the b-Trees stored at its nodes locally. By applying pruning symmetrically or asym-6. CONCLUSION 105 metrically with the optimal value function of F O M D P as a quick heuristic measure, remarkable performance boosts have been observed in the testbed domain with rich structural properties. Specifically, this computational ad-vantage is attributed to exploiting the conditional independence, contextual independence and persistence abundantly found in the domain. So it is ex-pected that the sturctured representation would be less appealing when it is applied to domains with scarce structural properties. 6.2 Insight While the Markov property and partial observability are model assumptions built into P O M D P s , several additional simplying assumptions and general domain properties have been exploited in our structure and heuristic explo-ration in this thesis work. First of all is the propositional domain assumption. For simplicity, we assume the domain is specified by propositions, or boolean variables, so that we can make the binary decision tree representation possible. This makes the algorithms for the structured representation a lot easier to understand and develop. In a sense, this is an integrated approach to bring numeri-cal planning models handling uncertainties into classical A l planning with propositional representation. While this propositional domain assumption may restrict the usability of the structured representation, the idea of classi-fying states into clusters sharing same contexts and similar probabilities (or 6. CONCLUSION 106 values) should still be valid beyond prepositional domains. To make the structured representation compact, several structural prop-erties are exploited. In particular, they are conditional independence, con-textual independence and persistence. Speculatively, we strongly believe that these properties are frequently found in natural domains. Yet another powerful idea that we have exploited is using F O M D P value function as search heuristic to guide the exploration of the belief state space in the right track approximately. By applying best-first online search with heuristic pruning, the belief state space can be greatly reduced. In a sense, this heuristic search strategy encapsulates the heuristic function. When bet-ter heuristic functions are devised, we can just replace the heuristic function within the same search framework. 6.3 Foresight Following the train of thought explored in this thesis, several immediate ongoing and future research directions are in focus. Immediately, further empirical and analytical studies of the developed algorithms are in need. To test the algorithms with larger domains and longer horizons, the current P R O L O G programs may need to be ported to more efficient implementations. Meanwhile, more rigorous complexity analyses are always necessary. To justify the application of the F O M D P value function as heuristic, more formal proofs showing the optimal P O M D P value function is 6. CONCLUSION 107 bounded by the F O M D P value function need to be constructed. To enhance the usability of the structured representation, formal specification techniques need to be developed to model different kinds of domains effectively. This may require more experimental studies with great varieties of domains to discover the useful pattern. Future research directions include: extending the propositional represen-tation to handle numerical and continuous domains; devising better and in-formed heuristic functions for the online search algorithm; adopting heuristic sensing strategy and plan execution; considering multi-agent environments; and the list goes on. A Algorithms The BDT-Oriented Algorithms in P R O L O G This appendix chapter gives a catalogue of the main algorithms along with the supporting data structures specified in PROLOG for our BDT represen-tation of P O M D P s described in Chapter 4. In particular, SICStus Prolog 2.1 Release 9[34] has been used as the platform for implementation. 108 A. ALGORITHMS A.O Preliminary Variables 109 Domain variables are specified as names in lower case letters, e.g. a, b, c, is_raining, lights_on etc. Assigments Assignments are represented by (X,0) or (X,l) pairs, where X is a domain variable. 0 and 1 denote false and true respectively. Contexts Contexts are represented by unordered lists of assignments, e.g. the context A+B~C+ may be specified as [(a, 1), (b,0), (c , l ) ] or [(b,0), ( a , l ) , (c, 1)] . The essential operations defined on contexts are given in the follow-ing PROLOG predicates. Subsumes '/, subsumes (+C, ?A) : -'/, ?A is an assignment in context +C. subsumes(C, A) : -member(A, C). member(X, [XI_]). member(X, [_|L]) : -member(X, L ) . Exclude '/. exclude(+CO, +X, ?C1) : -'/, ?C1 is context +C0 with the assignment to variable +X excluded, A. ALGORITHMS 'I. r e g a r d l e s s o f t h e v a l u e o f +X. e x c l u d e ( C O , X, C l ) : -r e m o v e ( C 0 , ( X , _ ) , C l ) . r e m o v e ( L 0 , X , L I ) : -removeCLO, [ ] , X , L I ) , ! . r e m o v e ( [ ] , L , _X, L ) . r e m o v e ( [ X I L O ] , L , X , L I ) : -r e m o v e ( L O , L , X , L I ) , r e m o v e ( [ Y I L O ] , L , X , L I ) : -r e m o v e ( L O , [ Y | L ] , X , L I ) . Persistent •/. p e r s i s t e n t ( + C , +X, +P) : -7, T r u e i f c o n t e x t +C subsumes a s s i g n m e n t ( + X , 1 ) and p r o b a b i l i t y '/, +P = 1 . 0 , o r c o n t e x t +C subsumes a s s i g n m e n t ( + X , 0 ) and p r o b a b i l i t y •/. +P = 0 . 0 . p e r s i s t e n t ( C , X , 1 . 0 ) : -s u b s u m e s ( C , ( X , l ) ) . p e r s i s t e n t ( C , X , 0 . 0 ) : -s u b s u m e s ( C , ( X , 0 ) ) . A. ALGORITHMS A.1 Average '/. a v e r a g e ( + T , +C , ?V) : -•/. ?V i s t h e a v e r a g e v a l u e o f BDT +T w . r . t . c o n t e x t +C. a v e r a g e ( l e a f ( V ) , _ C , V ) . a v e r a g e ( n o d e ( X , L , _R) , C, V) : -s u b s u m e s ( C , ( X , D ) , a v e r a g e ( L , C, V ) . a v e r a g e ( n o d e ( X , _ L , R ) , C, V) : -s u b s u m e s ( C , ( X , 0 ) ) , a v e r a g e ( R , C, V ) . a v e r a g e ( n o d e ( _ X , L , R ) , C, V) : -a v e r a g e ( L , C, V L ) , a v e r a g e ( R , C, V R ) , V i s (VL + V R ) / 2 . A. ALGORITHMS A.2 DotProduct 112 */. dotproduct(+Tl, +T2, ?V) :-*/. ?V i s the dotproduct of BDT +T1 and BDT +T2. d o t p r o d u c t ( T l , T2, V) :-num_variables(NX), d o t p r o d u c t ( T l , [ ] , NX, T2, V), !. •/. dotproduct (+T1, +C1, +H1, +T2, ?V) :-'/. ?V i s the dotproduct of BDT +T1 and BDT +T2, '/. where +T1 has p r e f i x context +C1 and rank(+Tl) = exp(2, +H1) . d o t p r o d u c t ( l e a f ( V I ) , CI, HI, T2, V) :-average(T2, CI, V2), V i s exp(2, HI) * VI * V2. dotproduct(node(X, L I , R l ) , CI, HI, node(X, L2, R2), V) :-dotproduct(LI, [ ( X , l ) ICI], Hl-1, L2, VL), d o t p r o d u c t ( R l , [(X,0)|C1], Hl-1, R2, VR), V i s VL + VR. d o t p r o d u c t ( T l , CI, HI, node(X2, L2, _R2), V) :-subsumes(Cl, ( X 2 , D ) , d o t p r o d u c t ( T l , CI, HI, L2, V). d o t p r o d u c t ( T l , CI, HI, node(X2, _L2, R2), V) :-subsumes(CI, (X2,0)), d o t p r o d u c t ( T l , CI, HI, R2, V). dotproduct(node(XI, L I , R l ) , CI, HI, T2, V) :-dotproduct(LI, [(XI,1) ICI], Hl-1, T2, VL), d o t p r o d u c t ( R l , [ ( X I , 0 ) I C I ] , Hl-1, T2, VR), V i s VL + VR. A. ALGORITHMS A.3 Merge '/. merge. p i '/, Merge o p e r a t i o n f o r precompiling BT-Tree. % merge(+T1, +T2, +T2X, ?T) :-7. ?T i s the BT-Tree r e s u l t e d from merging BT-Tree +T1 with T-Tree +T2 7. f o r p o s t c o n d i t i o n v a r i a b l e +T2X. merge(Tl, T2, T2X, T) :-merge ( T l , [] , T2, T2X, T ) , ! . 7. merge (+T1, +C1, +T2, +T2X, ?T) :-7. ?T i s the BT-Tree r e s u l t e d from merging BT-Tree +T1 with T-Tree +T2 7. f o r p o s t c o n d i t i o n v a r i a b l e +T2X, where +T1 has p r e f i x context +C1. m e r g e ( l e a f ( E l ) , C l , T2, T2X, T) :-sub_merge(Cl, E l , T2, T2X, T ) . merge(node(X, L I , Rl) , C l , node(X, L2, R2), T2X, node(X, TL, TR)) :-merge(Ll, [(X,1)|C1], L2, T2X, TL), merge(Rl, [(X,0)IC1], R2, T2X, TR). merge(Tl, C l , node(X2, L2, _R2), T2X, T) :-subsumes(Cl, ( X 2 , l ) ) , merge(Tl, C l , L2, T2X, T ) . merge(Tl, C l , node(X2, _L2, R2), T2X, T) :-subsumes(Cl, (X2,0)), merge(Tl, C l , R2, T2X, T ) . merge(node(X, L I , R l ) , C l , T2, T2X, node(X, TL, TR)) :-merge(Ll, [(X,1)|C1], T2, T2X, TL), merge(Rl, [ ( X , 0 ) I C l ] , T2, T2X, TR). 7. sub_merge(+Cl, +E1, +T2, +T2X, ?T) :-7. ?T i s the subtree of type BT-Tree generated by e x t r a c t i n g 7. the c o n s i s t e n t branches from T-Tree +T2 w.r.t. context +C1, 7. d e p o s i t i n g the e f f e c t +E1 i n t o the c o n s i s t e n t l e a f nodes, 7. and pruning the p e r s i s t e n t branches. sub.merge(Cl, E l , l e a f ( P 2 ) , T2X, l e a f ( E l ) ) :-p e r s i s t e n t ( C l , T2X, P2). sub.merge(.Cl, E l , l e a f ( P 2 ) , T2X, leaf([(T2X.P2)I E l ] ) ) . sub.merge(Cl, E l , node(X2, L2, _R2), T2X, T) :-subsumes(Cl, ( X 2 , l ) ) , sub.merge(Cl, E l , L2, T2X, T ) . sub_merge(Cl, E l , node(X2, _L2, R2), T2X, T) :-subsumes(Cl, (X2,0)), A. ALGORITHMS sub.merge(Cl, E l , R2, T2X, T ) . sub_merge(Cl, E l , node(X2, L2, R2), T2X, T) :-sub_merge([(X2,l)ICI], E l , L2, T2X, T ) , sub_merge([(X2,0)ICI] , E l , R2, T2X, T ) . sub.merge(Cl, E l , node(X2, L2, R2), T2X, node(X2, TL, TR)) sub_merge([(X2,l)ICI] , E l , L2, T2X, TL), sub_merge([(X2,0)ICI] , E l , R2, T2X, TR). A. ALGORITHMS 115 A.4 Multiply % multiply(+T, +N, TNT) :-. •/. ?NT i s BDT +T m u l t i p l i e d by value +N f o r each of the l e a f v a l u e s . m u l t i p l y U e a f (V), N, leaf(NV)) :-NV i s N * V. multiply(node(X, L, R), N, node(X, NL, NR)) :-m u l t i p l y ( L , N, NL), multiply(R, N, NR). A. ALGORITHMS A. 5 Normalize '/. n o r m a l i z e ( + T , ?NT) : -'/. ?NT i s BDT +T n o r m a l i z e d t o 1 . 0 . n o r m a l i z e ( T , NT) : -n o r m _ f a c t o r ( T , N ) , N_INV i s 1 / N, m u l t i p l y C T , N . I N V , N T ) . '/. n o r m _ f a c t o r ( + T , ?N) : -'/. ?N i s t h e n o r m a l i z i n g f a c t o r f o r BDT + T , % i . e . t h e sum o f t h e v a l u e s f o r a l l s t a t e s . n o r m _ f a c t o r ( T , N) : -n u m _ v a r i a b l e s ( N X ) , n o r m . f a c t o r ( T , NX, N ) , ! . '/. n o r m . f a c t o r ( + T , + H , ?N) : -'/. ?N i s t h e n o r m a l i z i n g f a c t o r f o r BDT + T , '/. f o r w h i c h r a n k ( + T ) = e x p ( 2 , + H ) . n o r m . f a c t o r ( l e a f ( V ) , H, N) : -N i s e x p ( 2 , H) * V . n o r m _ f a c t o r ( n o d e ( _ X , L , R ) , H, N) : -n o r m _ f a c t o r ( L , H - l , N L ) , n o r m . f a c t o r ( R , H - l , N R ) , N i s NL + NR. A. ALGORITHMS 117 A.6 Product •/. product (+T1, +T2, ?T) :-'/. ?T i s the t r e e product of BDT +T1 and BDT +T2. p r o d u c t ( T l , T2, T) :-p r o d u c t ( T l , • , T2, T ) , !. '/. product (+T1, +C1, +T2, ?T) :-'/, ?T i s the t r e e product of BDT +T1 and BDT +T2, '/, where +T1 has p r e f i x context +C1. p r o d u c t ( l e a f ( V I ) , C l , T2, T) :-sub_product(Cl, VI, T2, T ) . product(node(X, L I , R l ) , C l , node(X, L2, R2), node(X, L, R)) :-pr o d u c t ( L I , [ ( X . l ) I C l ] , L2, L ) , p r o d u c t ( R l , [(X,0)|C1], R2, R). p r o d u c t ( T l , C l , node(X2, L2, _R2), T) :-subsumes(Cl, ( X 2 , l ) ) , p r o d u c t ( T l , C l , L2, T ) . p r o d u c t ( T l , C l , node(X2, _L2, R2), T) :-subsumes(Cl, (X2,0)), p r o d u c t ( T l , C l , R2, T ) . product(node(X, L I , R l ) , C l , T2, node(X, L, R)) :-p r o d u c t ( L l , [ ( X . D I C l ] , T2, L ) , p r o d u c t ( R l , [(X,0)|C1], T2, R). '/. sub.product(+Cl, +V1, +T2, ?T) :-'/, ?T i s BDT +T2 pruned to make c o n s i s t e n t with context +C1 and '/, with each of i t s l e a f values m u l t i p l i e d by value +V1. sub_product(_Cl, VI, l e a f ( V 2 ) , l e a f ( V ) ) :-V i s VI * V2. sub.product(Cl, VI, node(X, L, _R), T) :-subsumes(Cl, ( X , D ) , sub_product(Cl, VI, L, T ) . sub_product(Cl, VI, node(X, _L, R), T) :-subsumes(Cl, (X,0)), sub_product(Cl, VI, R, T ) . sub.product(Cl, VI, node(X, L I , R l ) , node(X, L, R)) :-sub_product(Cl, VI, L I , L ) , sub_product(Cl, VI, R l , R). A. ALGORITHMS 118 A.7 Sum '/. sum(+Tl, +T2, ?T) :-7. ?T i s the summation t r e e of the p a r t i a l BDTs +T1 and +T2. sum(Tl, T2, T) :-sum(Tl, [ ] , T2, T ) , !. '/. sum(+Tl, +C1, +T2, ?T) :-'/. ?T i s the summation t r e e of the p a r t i a l BDTs +T1 and +T2, '/, where +T1 has p r e f i x context +C1. s u m ( l e a f ( V l ) , CI, T2, T) :-sub_sum(Cl, VI, T2, T ) . sum(node(X, L I , R l ) , CI, node(X, L2, R2), T) :-sum(Ll, [ ( X . l ) I C l ] , L2, L ) , sum(Rl, [(X,0)IC1], R2, R), treeUnion(X, L, R, T ) . sum(Tl, CI, node(X2, L2, _R2), T) :-subsumes(Cl, ( X 2 , D ) , sumCTl, CI, L2, T) . sum(Tl, CI, node(X2, _L2, R2), T) :-subsumes(CI, (X2,0)), sum(Tl, CI, R2, T ) . sum(node(X, L I , R l ) , CI, T2, T) :-sum(Ll, [ ( X . l ) I C l ] , T2, L ) , sum(Rl, [(X,0)|C1], T2, R), treeUnion(X, L, R, T ) . •/. sub_sum(+Cl, +V1, +T2, ?T) :-7. ?T i s BDT +T2 pruned to make c o n s i s t e n t with context +C1, 7. and with value +V1 added to each of the l e a f v a l u e s . sub.sum(.Cl, VI, l e a f ( V 2 ) , l e a f ( V ) ) :-V i s VI + V2. sub.sum(Cl, VI, node(X2, L2, _R2), T) :-subsumes(Cl, ( X 2 , l ) ) , sub.sum(Cl, VI, L2, T ) . sub.sum(Cl, VI, node(X2, _L2, R2), T) :-subsumes(CI, (X2,0)), sub.sum(Cl, VI, R2, T ) . sub.sum(Cl, VI, node(X2, L2, R2), T) :-sub.sum(Cl, VI, L2, L ) , sub.sum(Cl, VI, R2, R), treeUnion(X2, L, R, T ) . 7. treeUnion(+X, +T1, +T2, ?T) :-A. ALGORITHMS '/. BDT ?T i s BDT +T1 (or BDT +T2) i f +T1 == +T2; '/. otherwise ?T i s +T1 and +T2 merged with v a r i a b l e treeUnion(_X, T, T, T ) . treeUnion(X, L, R, node(X, L, R ) ) . A. ALGORITHMS 120 A.8 Update '/, update.pl '/. "/, Update op e r a t i o n f o r performing updates with BO-Tree and BT-Tree. '/. update(+B0, +BT, ?B) '/. ?B i s the updated b-Tree obtained from BO-Tree +B0 and BT-Tree +BT. update(BO, BT, B) :-update(BO, • , BT, B), ! . '/, update(+BD, +C, +BT, ?B) :-'/, ?B i s the updated b-Tree obtained from BO-Tree +B0 and BT-Tree +BT, 7, where +B0 has p r e f i x context +C. u p d a t e ( l e a f ( P ) , C, BT, B) :-sub_update(C, P, BT, B). update(node(X, B0_L, B0_R), C, node(X, BT.L, BT.R), B) :-update(B0_L, C(X,1)IC], BT.L, B_L), update(B0.R, C(X,0)IC], BT.R, B.R), sum(B_L, B.R, B). update(B0, C, node(X.BT, BT.L, .BT.R), B) :-subsumes(C, (X_BT,1)), update(B0, C, BT.L, B). update(B0, C, node(X_BT, .BT.L, BT.R), B) :-subsumes(C, (X.BT.O)), update(B0, C, BT.R, B). update(node(X.BO, BO.L, BO.R), C, BT, B) :-update(BO.L, [(X.BO,1)IC], BT, B.L), update(BO_R, [(X.BO.O)|C], BT, B.R), sum(B_L, B.R, B). 7. sub_update(+C, +P, +BT, ?T) :-7, ?T i s the p a r t i a l t r e e generated from BT-Tree +BT w.r.t. context +C 7. and p r o b a b i l i t y +P. sub_update([], P, l e a f ( [ ] ) , l e a f ( P ) ) . sub.update([(X,l)|C] , P, l e a f ( [ ] ) , node(X, L, leaf(O.O))) :-sub_update(C, P, l e a f ( [ ] ) , L ) . sub_update([(X,0)IC], P, l e a f ( [ ] ) , node(X, l e a f ( 0 . 0 ) , R)) :-sub_update(C, P, l e a f ( [ ] ) , R). sub_update(C, P, leaf([(X,Q )IE]), node(X, L, R)) :-exclude(C, X, NC), PI i s P * Q, PO i s P * (1.0 - Q), A. ALGORITHMS sub_update(NC, PI, l e a f ( E ) , L ) , sub.update(NC, PO, l e a f ( E ) , R). sub_update(C, P, node(X, L, _R), T) :-subsumes(C, ( X , l ) ) , sub_update(C, P, L, T ) . sub_update(C, P, node(X, _L, R), T) :-subsumes(C, (X , 0 ) ) , sub_update(C, P, R, T ) . sub_update(C, P, node(X, L, R), T) :-sub_ u p d a t e ( [ ( X , l ) I C ] , P, L, TL), sub_update([(X , 0 )IC], P, R, TR), suraCTL, TR, T ) . B Testbed The Client-Server-Agent Testbed Domain Specification For experimental studies, a 3-variable, 3-action and 3-observation testbed domain is devised. It makes up a 8-state domain with a branching factor 9 in the decision search tree projecting the belief states. Rich structural properties, such as conditional independence, contextual independence and persistence, are frequently found in the domain. The specification in P R O L O G predicates is given in the following page. 122 B. TESTBED The Client-Server-Agent Testbed v a r i a b l e s ( [ a , b, c ] ) . 7. a :: TRUE i f Server A (Remote) can provide C l i e n t C's need. */. b :: TRUE i f Server B (Local) can provide C l i e n t C's need. '/. c :: TRUE i f C l i e n t C i s s a t i s f i e d . a c t i o n s ( [ a _ C , a_B, a_A]). "/• a_A :: Send request t o A. '/, a_B :: Send request t o B. a_C :: Send r e p l y to C. observations([o_A, o_B, o_C]). '/. o_A :: Received r e p l y from A. "/. o_B :: Received r e p l y from B. '/. o_C :: Received request from C. "/, Reward Function '/. - Try to keep C s a t i s f i e d while u s i n g l e a s t remote s e r v i c e from A rew(node(c, node(b, node(a, l e a f ( 2 . 0 ) , l e a f ( 3 . 0 ) ) , node(a, l e a f ( 1 . 0 ) , l e a f ( 4 . 0 ) ) ) , l e a f ( - 5 . 0 ) ) ) . 7, T r a n s i t i o n F u n c t i o n act(a_A, a, node(a, l e a f ( 1 . 0 ) , act(a_A, b, node(b, l e a f ( 1 . 0 ) , act(a_A, c, node(c, l e a f ( 1 . 0 ) , l e a f ( 0 . 0 ) ) ) . l e a f ( 0 . 2 ) ) ) . node(a, l e a f ( 0 . 8 ) , l e a f ( 0 . 0 ) ) ) ) . act(a_B, a, node(a, l e a f ( 1 . 0 ) , act(a_B, b, node(b, l e a f ( 0 . 7 ) , act(a_B, c, node(c, l e a f ( l . O ) , l e a f ( O . O ) ) ) . l e a f ( 0 . 0 ) ) ) . node(b, l e a f ( 0 . 9 ) , l e a f ( 0 . 0 ) ) ) ) . act(a_C, a, node(a, act(a_C, b, node(b, act(a_C, c, node(c, l e a f ( l . O ) , l e a f ( 0 . 0 ) ) ) . l e a f ( l . O ) , l e a f ( 0 . 0 ) ) ) . l e a f ( l . O ) , l e a f ( 0 . 5 ) ) ) . 7. Observation Func t i o n obs(a_A, o_A, node(c, l e a f ( 0 . 4 9 ) , l e a f ( 0 . 0 9 ) ) ) obs(a.A, o.B, node(c, l e a f ( 0 . 0 1 ) , l e a f ( 0 . 0 1 ) ) ) obs(a_A, o_C, node(c, l e a f ( 0 . 5 0 ) , l e a f ( 0 . 9 0 ) ) ) obs(a_B, o_A, node(c, l e a f ( 0 . 1 0 ) , l e a f ( 0 . 1 0 ) ) ) obs(a_B, o_B, node(c, l e a f ( 0 . 6 0 ) , l e a f ( 0 . 4 0 ) ) ) obs(a_B, o_C, node(c, l e a f ( 0 . 3 0 ) , l e a f ( 0 . 5 0 ) ) ) obs(a_C, o_A, node(c, l e a f ( 0 . 4 9 ) , l e a f ( 0 . 2 9 ) ) ) obs(a_C, o.B, node(c, l e a f ( 0 . 0 1 ) , l e a f ( 0 . 0 1 ) ) ) obs(a_C, o_C, node(c, l e a f ( 0 . 5 0 ) , l e a f ( 0 ; 7 0 ) ) ) 7. I n t i t i a l B e l i e f State bel(node(c, l e a f ( 0 . 0 ) , node(b, l e a f ( 0 . 0 ) , node(a, l e a f ( 1 . 0 ) , l e a f ( 0 . 0 ) ) ) ) ) . Bibliography [1] Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1-2):81-138, 1995. [2] Richard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957. [3] Dimitri P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, 1987. [4] Craig Boutilier, Thomas Dean, and Steve Hanks. Planning under un-certainty: Structural assumptions and computational leverage. In Pro-ceedings of the 3rd European Workshop on Planning (EWSP'95), Assisi, ITALY, September 1995. [5] Craig Boutilier and, David Poole. Computing optimal policies for par-tially observable decision process using compact representations. In Pro-ceedings of the 13th National Conference on Artificial Intelligence, Port-land, OR, 1996. To appear. 124 BIBLIOGRAPHY 125 [6] Craig Boutilier and Martin L. Puterman. Process-oriented planning and average-reward optimality. In Proceedings of the l\th International Joint Conference on Artificial Intelligence (IJCAI-95), volume 2, pages 1096-1103, Montreal, Quebec, CANADA, August 1995. [7] Craig Boutlier, Richard Dearden, and Moises Goldszmidt. Exploiting structure in policy construction. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), volume 2, pages 1104-1111, Montreal, CANADA, August 1995. [8] Craig Boutlilier, Nir Friedman, Moises Goldszmidt, and Daphne Roller. Context-specific independence in bayesian networks. To appear, UAI-96, 1996. [9] Craig Boutlilier and Moises Goldszmidt. The frame problem and bayesian network action representation. 1996. [10] Anthony R. Cassandra. Optimal policies for partially observable markov decision processes. Technical Report CS-94-14, Department of Com-puter Science, Brown University, Providence, Rhode Island, August 1994. [11] Anthony R. Cassandra, Leslie P. Kaelbling, and Michael L. Littman. Acting optimally in partially observable stochastic domains. In Proceed-ings of the 12th National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994. BIBLIOGRAPHY 126 [12] Hsien-Te Cheng. Algorithms for Partially Observable Markov Decision Processes. PhD thesis, The University of British Columbia, Vancouver, BC, CANADA, 1988. [13] Thomas Dean and Mark Boddy. An analysis of time-dependent plan-ning. In Proceedings of the 7th National Conference on Artificial Intell igence (AAAI-88), St. Paul, MN, 1996. [14] Thomas Dean, Leslie Kaelbling, Jak Kirman, and Ann Nicholson. De-liberation scheduling for time-critical decision making. In Proceedings of the 9th Conference on Uncertainty in Artificail Intelligence (UAI-93), pages 309-316, Washington, DC, 1993. [15] Thomas Dean, Leslie Kaelbling, Jak Kirman, and Ann Nicholson. Plan-ning with deadlines in stochastic domains. In Proceedings of the 11th National Conference on Artificial Intelligence (AAAI-93), pages 574-579, Washington, DC, 1993. [16] Thomas L. Dean and Michael P. Wellman. Planning and Control. Mor-gan Kaufmann, San Mateo, CA, 1991. [17] Richard Dearden and Craig Boutilier. Integrating planning and execu-tion in stochastic domains. In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence (UAI-94), pages 162-169, Seattle, WA, 1994. BIBLIOGRAPHY 127 [18] Richard Dearden and Craig Boutilier. Abstraction and approximate decision theoretic planning. Unpulished manuscript (revised version to appear, Artificial Intelligence), 1995. [19] Richard Fikes and Nils Nilsson. STRIPS: A new approach to the appli-cation of theorem proving and problem solving. Artificial Intelligence, 2 (3-4): 189-208, 1971. [20] Richard Fikes and Nils Nilsson. STRIPS, a retrospective. Artificial Intelligence, 59(1-2):227-232, February 1993. [21] Michael L. Littman. The witness algorithm: Solving partially observable markov decision processes. Technical Report CS-94-40, Brown Univer-sity, Department of Computer Science, Providence, R l , December 1994. [22] Michael L. Littman. Algorithms for Sequential Decision Making. PhD thesis, Department of Computer Science, Brown University, Providence, Rhode Island, May 1996. Technical Report CS-96-09. [23] William S. Lovejoy. A survey of algorithmic methods for partially ob-servable markov decision processes. Annals of Operations Research, 28(l-4):47-66, April 1991. [24] George E. Monahan. A survey of partially observable markov deci-sion processes: Theory, models, and algorithms. Management Science, 28(1):1-16, 1982. BIBLIOGRAPHY 128 [25] Ronald Parr and Stuart Russell. Approximating optimal policies for par-tially observable stochastic domains. In Proceedings of thelfth Interna-tional Joint Conference on Artificial Intelligence (IJCAI-95), volume 2, pages 1088-1094, Montreal, CANADA, August 1995. [26] David Poole. Probabilistic horn abduction and bayesian networks. Ar-tificial Intelligence, 64(1):81-129, November 1993. [27] Martin L. Puterman. Dynamic programming. In Encylopedia of Phys-ical Science and Technology, volume 4, pages 438-463, Orlando, 1987. Academic Press. [28] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, 1994. [29] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Ap-proach. Prentice-Hall, Englewood Cliffs, NJ, 1995. [30] Stuart Russell and Eric Wefald. Do the Right Thing: Studies in Limited Rationality. The MIT Press, Cambridge, MA, 1991. [31] Richard D. Smallwood and Edward J. Sondik. The optimal control of partially observable markov processes over a finite horizon. Operations Research, 21:1071-1088, 1973. [32] Edward Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford University, 1971. BIBLIOGRAPHY 129 [33] Edward J. Sondik. The optimal control of partially observable markov processes over the infinite horizon: Discounted costs. Operations Re-search, 26(2):282-304, 1978. [34] Swedish Institute of Computer Science, PO Box 1263, S-164 28 KISTA, Sweden. SICStus Prolog User's Manual, 2.1 r9 edition, April 1994. [35] Paul E. Utgoff. Decision tree induction based on efficient tree restruc-turing. Technical Report 95-18, Department of Computer Science, Uni-versity of Massachusetts, Amherst, MA, March 1995. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051460/manifest

Comment

Related Items