Stochastic Simulation in Dynamic Probabilistic Networks Using Compact Representation by Adrian Y . W. Cheuk B. Sc., The Chinese University of Hong Kong, 1994 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF T H E REQUIREMENTS FOR T H E D E G R E E OF MASTER OF SCIENCE in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard T H E UNIVERSITY OF BRITISH C O L U M B I A August 26, 1996 © Adrian Y. W. Cheuk, 1996 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of C/)MPOT&-R- Sc/BAJCB-The University of British Columbia Vancouver, Canada Date 7?- ^ 3o DE-6 (2/88) A b s t r a c t In recent years, researchers in the A l domain have used Bayesian belief networks to build models of expert opinion. Though computationally expensive deterministic algorithms have been de-vised, it has been shown that exact probabilistic inference in belief networks, especially multiply con-nected ones, is intractable. In view of this, various approximation methods based on stochastic simu-lation appeared in an attempt to perform efficient approximate inference in large and richly intercon-nected models. However, due to convergence problems, approximation in dynamic probabilistic net-works has seemed unpromising. Reversing arcs into evidence nodes can improve convergence per-formance in simulation, but the resulting exponential increase in network complexity and, in particu-lar, the size of the conditional probability tables (CPTs) can often render this evidence reversal method computationally inefficient. In this thesis, we describe a structured simulation algorithm that uses the evidence reversal technique based on a tree-structured representation for CPTs. Most real systems exhibit a large amount of local structure. The tree can reduce network complexity by exploiting this structure to keep CPTs in a compact way even after arcs have been reversed. The tree also has a major impact on im-proving computational efficiency by capturing context-specific independence during simulation. Ex-perimental results show that in general our structured evidence reversal algorithm improves conver-gence performance significantly while being both spatially and computationally much more efficient than its unstructured counterpart. ii Contents ABSTRACT ii CONTENTS iii LIST OF TABLES v LIST OF FIGURES vi ACKNOWLEDGMENTS vii 1. INTRODUCTION 1 1.1 PROBABILISTIC REASONING AND BELIEF NETWORKS '. 1 1.2 STOCHASTIC SIMULATION AND DYNAMIC PROBABILISTIC NETWORKS 1 1.3 PROBLEMS AND SOLUTIONS 2 1.3.1 Accuracy 2 1.3.2 Compactness 3 1.3.3 Efficiency 3 1.4 OVERVIEW 3 2. BACKGROUND K N O W L E D G E 5 2.1 UNCERTAINTY 5 2.1.1 Types of Uncertainty 5 2.1.2 Basic Probability Notation 7 2.1.3 Bayes 's Rule 11 2.2 PROBABILISTIC REASONING SYSTEMS 12 2.2.7 Belief Network 13 2.2.2 Inference in Belief Networks 17 2.3 STOCHASTIC SIMULATION METHODS 18 2.3.1 Approaches 19 2.3.2 Algorithms 21 3. T E M P O R A L PROCESSES 30 3.1 DYNAMIC PROBABILISTIC NETWORK 30 3.2 EVIDENCE REVERSAL 31 3.3 ARC REVERSAL 35 4. C O M P A C T REPRESENTATION 38 4.1 TREE-STRUCTURED CPTs 38 4.2 STRUCTURED ARC REVERSAL 40 4.2.1 Example 41 4.2.2 Procedure 47 i i i 4.2.3 Analysis 55 5. EFFICIENT INFERENCE 59 5.1 QUERY-DIRECTED APPROACH 59 5.2 RELEVANT VARIABLES 60 5.2.7 The Network Level 61 5.2.2 The CPTLevel 63 5.3 SATISFIABLE QUERIES 67 5.4 ASSUMED LAST VALUES 68 6. IMPLEMENTATION AND RESULTS 69 6.1 EXPERIMENTS 69 6.2 RESULTS AND ANALYSIS 72 7. CONCLUSION AND EXTENSIONS 88 BIBLIOGRAPHY 90 iv L i s t o f T a b l e s TABLE 2.1: GLOBAL DEFINITIONS 22 TABLE 2.2: DEFINITIONS FOR LOGIC SAMPLING 22 TABLE 2.3: DEFINITIONS FOR LIKELIHOOD WEIGHTING 28 TABLE 6.1: SIMULATION PERFORMANCE UNDER DIFFERENT SETTINGS 70 TABLE 6.2: PERFORMANCE — CASE 1 74 TABLE 6.3: PERFORMANCE — CASE 2 77 TABLE 6.4: PERFORMANCE — CASE 3 80 TABLE 6.5: PERFORMANCE — CASE 4 83 TABLE 6.6: PERFORMANCE — CASE 5 86 v List of Figures FIGURE 2.1: A MULTIPLY CONNECTED NETWORK 14 FIGURE 2.2: MONTE CARLO SIMULATION APPROACHES TO INFERENCE IN BNS 20 FIGURE 3.1: GENERIC STRUCTURE OF A DPN 31 FIGURE 3.2: SCHEMATIC DIAGRAM OF EVIDENCE REVERSAL TRANSFORMATION FOR DPNS 32 FIGURE 3.3: REVERSING AN ARC BETWEEN ARBITRARY NODES I AND J 35 FIGURE 4.1: A SIMPLE TREE-STRUCTURED CPT 39 FIGURE 4.2: COMPACT REPRESENTATION OF CPT OF A 40 FIGURE 4.3: AN EXAMPLE SYSTEM 41 FIGURE 4.4: CONSTRUCTION OF NEW CPT FOR NODE J (VARIABLE O) 42 FIGURE 4.5: CONSTRUCTION OF NEW CPT FOR NODE I (VARIABLE A) 45 FIGURE 4.6: NETWORK AFTER REVERSING THE ARC BETWEEN A AND 0 47 FIGURE 4.7: BREAKDOWN OF TREE CHAINING 52 FIGURE 5.1: EXAMPLE NETWORK 63 FIGURE 6.1: NETWORK — CASE 1 73 FIGURE 6.2: CONVERGENCE BEHAVIOR — CASE 1 75 FIGURE6.3: NETWORK — CASE2 76 FIGURE 6.4: EXCERPT FROM EXECUTION RESULTS OF SLW — CASE 2 78 FIGURE 6.5: NETWORK AFTER SAR — CASE 3 79 FIGURE 6.6: EXCERPT FROM EXECUTION RESULTS OF SER — CASE 3 81 FIGURE 6.7: NETWORK — CASE 4 82 FIGURE 6.8: CONVERGENCE BEHAVIOR — CASE 4 84 FIGURE 6.9: NETWORK — CASE 5 85 FIGURE 6.10: CONVERGENCE BEHAVIOR — CASE5 87 vi Acknowledgments "I am the Alpha and the Omega, the First and the Last, the Beginning and the End. " — Revelation 22:13 In the beginning, the author, who was then more affiliated to the field of graphics, was less than ready to explore deeply into the field of (Al) simulation with his superficial knowledge of the field. During the research process of this thesis, there have been many struggles for the unseasoned author to try to understand various fundamental concepts. It is truly God's grace that the author could strive for his goal and in the end bring this thesis to its completion. The author hereby dedicates this thesis to the Lord, his God, acknowledging His unstinting help from beginning to end and bestowing the highest praise on Him. May His name be exalted and glorified, henceforth and forever! The author would also like to express his heartfelt thanks towards Professor Craig Boutilier, the supervisor for this thesis, for his precious advice and unfailing support throughout the research. Professor Boutilier has been providing the author with a handful of inspirational ideas that led to the development of this thesis to its full extent. He has also been very nice, (which, in the author's opin-ion, is quite remarkable for a professor and is highly appreciated), complying with his motto, "Be nice to people." May his benevolence continue to benefit students to come, and may his professorial ex-pertise further his career as well as research in the A l area. vii Special thanks also go to Professor David Poole. Although the author only knows him by name, Professor Poole kindly promised to be the second reader of this thesis. Without his generous help and professional comments this thesis would certainly not have been complete. One last note of thanks is reserved for the author's (high-school) alma mater, Wah Yan Col-lege, Hong Kong, where he learnt "how to treasure and keep for a lifetime all that's noble and true." Should chances arise that a Wahyanite will come across this thesis, he is reminded the following ex-cerpt from the School song: "In all that we do whether duty or pleasure, We count not the cost, but unselfishly strive; What's mean or unmanly we shun with displeasure; Come praise or come blame, we hold our heads high." With this "Wah Yan Spirit" in his heart, the author finally came through all difficulties and fulfilled his degree requirements. In the days to come, this Spirit will continue to live in his heart, and he will continue "to serve the Lord God in the love of our neighbour." May glory be with God, both now and forevermore. Amen! viii 1 . I n t r o d u c t i o n A central issue in A l has been to devise algorithms for an agent to make decisions under situations in the real world. In view of the presence of uncertainty in the decision process, a probabilistic reason-ing approach has evolved to represent uncertainty in reasoning systems. 1.1 Probabilistic Reasoning and Belief Networks The main advantage of probabilistic reasoning over traditional logical reasoning is in allowing the agent to reach rational decisions even when there is not enough information to prove that any action will "work" with certainty. Furthermore, the probabilistic approach has a clear theoretical basis com-bined with a theory of rational decision-making, an operational definition, a language for expressing uncertain dependence, and an ability to integrate diagnostic and predictive reasoning [HENR88]. The most promising current candidates for a coherent probabilistic representation appear to be the (Bayesian) belief network [PEAR88] and the closely related influence diagram [SHAC86]. In this thesis, we will use the belief network (BN) as a form of uncertainty representation. The interested reader may refer to Section 2.2.1 for an introduction to BNs. 1.2 Stochastic Simulation and Dynamic Probabilistic Networks The basic task for any probabilistic inference system is to compute the posterior probability distribu-tion for a set of query variables, given exact values for some evidence variables. In general, an agent gets values for evidence variables from its percepts (or from other reasoning), and asks about the 1 Chapter 1. Introduction 2 possible values of other variables so that it can decide what action to take [RUN095]. In order to an-swer such queries in a BN, various algorithms have been proposed and employed in practice. These inference algorithms, classified into three categories, either give an exact solution or an approximation of such. Since exact inference in BNs is known to be NP-hard [COOP90], for very large networks, and especially for dynamic probabilistic networks (DPNs), approximation using stochastic simula-tion (SS) is currently the method of choice. In Section 2.3 we will study the class of SS algorithms including Logic Sampling [HENR88], Markov Simulation [PEAR87], and Likelihood Weighting [SHPE90]. In Chapter 3 we will examine the application of SS in DPNs. 1.3 Problems and Solutions In view of the dynamic nature of decision processes we would like to use SS in DPNs so that queries can be answered in a timely fashion. In so doing we would also like to solve the following problems that are brought forth. 1.3.1 Accuracy The nature of simulation implies there is always an "error", or a deviation between simulation and the true behavior of the real-world process being simulated. While the DPN, being an extension of the BN, captures on the one hand the temporal aspect of real-world processes, it also magnifies the inher-ent errors of SS in BNs on the other, since "errors" in one time slice will be propagated to the follow-ing time slice and accumulated. Sequential error propagation will result in a significant deviation from the actual behavior of the process. We are interested in reducing this error to a minimum, thus Chapter 1. Introduction 3 bringing simulation "back on track". We will put our focus on the Evidence Reversal (ER) algorithm [KAKR95] in Section 3.2. 1.3.2 Compactness Associated with each node in a B N or DPN is a conditional probability table (CPT). In standard SS algorithms, a CPT is encoded using a tabular representation. This simple representation fails to cap-ture qualitative regularities in a CPT. In particular, there are independencies that hold only when spe-cific values are assigned to certain variables. We are interested in capturing this kind of context-based irrelevance and the advantages of so doing. In Chapter 4 we will focus on one particular qualitative representation scheme — tree-structured CPTs [BFGK96]. 1.3.3 Efficiency In order to perform probabilistic inference in BNs and DPNs, SS methods sample the values of vari-ables concerned. (Which variables are actually concerned, or relevant, will be discussed in Sections 5.1 and 5.2.) This requires the assignment of values to variables according to their probability distri-butions. As we would like to make inference in BNs and DPNs as efficient as possible, for any query we would like any SS method concerned to produce as accurate an answer using as few variable as-signments, or instantiations, as possible. In Chapter 5 we propose several ways to achieve this goal. 1.4 Overview Chapter 2 covers the background knowledge for BNs and inference in BNs, and surveys several infer-ence algorithms that have major significance. Readers familiar with BNs may proceed directly to Chapter 1. Introduction 4 Chapter 3, which focuses on DPNs and gives in detail the ER algorithm and the related Arc Reversal (AR) algorithm [SHAC86]. Chapter 4 introduces the Structured Arc Reversal (SAR) algorithm, de-tailing the construction of a CPT using a tree representation and the application of A R under such a representation. In Chapter 5 we present several techniques that, when combined with our Structured Evidence Reversal (SER) algorithm, will increase the efficiency of simulation and inference in DPNs. Chapter 6 features details of experiments conducted to examine the benefits that the tree-structured representation can bring to simulation and inference in DPNs. Conclusions and future work are pre-sented in Chapter 7. 2 . B a c k g r o u n d K n o w l e d g e The real world is complex and dynamic in nature. In order to model changes in the world, we can per-ceive the world as a state space where each state represents a possible situation or configuration of the world. Every state has a degree of usefulness, or utility, to an agent. In different states, the agent has to make different decisions on what actions to perform to maximize the expected utility. In view of this, substantial research has been conducted in the A l planning domain. 2.1 Uncertainty In the past, planning research has been unrealistic in that complete knowledge of both states and ac-tions have been assumed. The realization that problems are often associated with uncertain initial conditions and action effects leads to growing interest in uncertain or probabilistic reasoning. 2.1.1 Types of Uncertainty Uncertainty means that many of the simplifications that are possible with deductive inference are no longer valid. Below are three kinds of uncertainty described in [CHEU95]. Uncertainty of Action Effects Given a state, we would like to choose the action that will maximize the expected utility. However, there may be chances that the effect of a certain action is not what is actually wanted or predicted. This can be described by a matrix of transition probabilities — probabilities associated with the pos-5 Chapter 2. Background Knowledge 6 sible transitions between states after a given action. In the machine-maintenance example in [SMS073], if the alternative (i.e., action) "manufacture" is chosen, then the transition matrix for the machine is "0.81 0.18 0.0l" 0 0.9 0.1 . . 0 0 1 Given there is a 0.1 probability that any of the two internal components will break down during the manufacture of a product, this matrix represents all the transition probabilities under the action "manufacture" for a machine that begins the production cycle with zero, one, or two internal compo-nents that have failed and ends up in one of these states after the cycle. Partial Observability As mentioned before, an agent receives information about evidence variables through its percepts. These percepts, however, are not perfect in that they may not give complete, or even correct, informa-tion for the agent to determine the state in which it is. For example, an agent may have to decide whether to bring an umbrella by observing the weather, but its sensor may have a failure chance of 0.2 on rainy days. Thus, with a probability of 0.2 it may think it is sunny when it is actually raining. This partial observability is well captured in the partially observable Markov decision process (POMDP) [SMS073] where the maximum extent of uncertainty is taken into account during the deci-sion process. Chapter 2. Background Knowledge 7 Exogenous Events In systems where process-oriented problems [BOPU95][BOUT95][CHEU95] typically arise, a third source of uncertainty comes from the existence of exogenous events. They occur from time to time, changing the state of a system in certain ways independent of the agent's actions. An important class of such events is the class of user commands such as "coffee requests". (A user request can be thought of to cause facts like "there is an outstanding request to do X" to become true.) Often, these events are represented by evidence variables an agent has to observe to choose the next action, and usually these variables are not completely observable. 2.1.2 Basic Probability Notation Since an agent almost never has access to the whole truth about its environment, the agent's knowledge can at best provide only a degree of belief in relevant propositions. The main tool for dealing with degrees of belief will be probability theory, which assigns a numerical degree of belief between 0 and 1 to propositions. Prior Probability We use the notation P(A) for the unconditional or prior probability that the proposition A is true. For example, P(A) = 0.9 means that without any other information, the agent will assign a probability of 0.9 to the event of A being true. Chapter 2. Background Knowledge 8 A proposition is usually represented by an equality involving so-called random variables [PEAR88]. Each random variable X, has a domain Q, of possible values it may assume. For example, P{Xx=x{) = 0.7 tells us the prior probability of X\ having the value xu for some xx e Q.u is 0.7. We shall hereafter use the short notation P(xt) for the probabilities = xi), xt e Q, for some random variable Xh and we will write P(xj) for the probabilities P(Xj = xj), xj e Q.j for some set of variables Xj. In the latter case, Xj is a configuration [PEAR88] of Xj, and Qy is the cross product space of the domains of all variables in Xj. Conditional Probability In the presence of other information (evidence), prior probabilities are no longer applicable. Instead, we use conditional or posterior probabilities, with the notation P(A I E), meaning "the probability of A given all we know is B." For example, P(xi I x2) = 0.6 means that given only that X2 has a value x2, the probability of Xi having the value is 0.6. Probability Distribution When we talk about the probabilities of all the possible values of a random variable X„ we use the no-tation P(X,) to denote X,'s probability distribution — a vector of values for the probabilities of each state of Xt. For example, P(X,) = <0.5,0.3,0.1,0.1 > Chapter 2. Background Knowledge 9 tells us the probabilities of each of the four states of Xx. We may also write P(X a, ...,X,„) to denote the probabilities of all combinations of the values of a set of random variables, getting a value for each element in the cross-product space Q ( 1 x . . . x Q,„. In the case where il = 1 and in = n, we have a joint probability distribution (or "joint" for short) which completely specifies an agent's probability as-signments to all propositions in the domain. From the joint we can compute any probabilistic state-ment in the domain by expressing the statement as a disjunction of elementary events1 and adding up their probabilities. For example, if we have a domain comprising two random variables X and Y, and the joint for the domain can be represented by the table Y —>Y X 0.04 0.06 -nX 0.01 0.89 then P(X) can be computed by P( X, Y) + P( X,-,Y) = 0.04 + 0.06 = 0.1. In general, the probability of any variable Xt having a certain value x, can be computed by />(*,) = ! > ( * „ * > , ) , where w, ranges over the possible states of all variables except Xt. While the joint contains all information we need about the domain, it is in general not practical n to define all the entries for the joint over n variables. Representations that address this prob-1=1 lem will be discussed in the next section. Hereafter we will focus on discrete variables, but it should A n elementary event [ P E A R 8 8 ] is a conjunction i n which every atomic proposition appears once. Chapter 2. Background Knowledge 10 be noted that continuous variables can be described using representations such as Gaussian conditional densities [PEAR88]. Conditional Independence Conditional independence is crucial to making probabilistic systems work effectively. It is therefore beneficial to state its formal definition below [PEAR88]. Definition 2.1: Conditional Independence Let XN be a finite set of variables with discrete values. Let P (•) be a joint probability function over the variables in XN, and let Xi, Xj, and XK be any three subsets in XN. XI and Xj are said to be condi-tionally independent given XK if P{xi | Xj, XK) = P(X/\ XK) whenever P(xj, XK) > 0 forallx/G Q/, xj& Qj, andAXre QK-m Intuitively, what the definition says is that given three sets of variables Xh Xj, and XK, if P(X/1 Xy_ XK) = P(X/1 XK), then we say Xj and Xj are conditionally independent given XK. We will use the notation I(X,,XK,Xj)P, or simply I{XhXK,Xj), to denote the conditional independence of X[ and Xj given XK; thus, I(X,,Xk,XJ)P iff P(Xi I xj, xk) = P(xi I xk) for all values x-t, xj, and xk such that P(xj, xk) > 0.2 2 The notation /(X, Z, Y)P renders the probability distribution P a dependency model, i.e., a rule that determines a subset / of triplets (X, Z, Y) for which the assertion "X is independent of Y given Z" is true. Chapter 2. Background Knowledge 11 2.1.3 Bayes's Rule We would often like to compute unknown probabilities from known ones, especially when we have conditional probabilities based on causal relationships and want to derive a diagnosis. (See Section 2.2.2 for the nature of probabilistic inferences.) For example, when reversing the arc (Section 3.3) between two arbitrary nodes / and j in a B N (Section 2.2.1), we have to compute P(X, I Xy) from F(Xj I Xi), P(JQ and P(A}). Bayes's rule is employed for this purpose. We can derive Bayes's rule from the product rule: P(AAB) = P(A\B)P(B), ( 2 ] ) where A and B are propositions and A is the logical-AND operator. It stems from the fact that for A and B to be true, we need B to be true, and then A to be true given B. From ( 2.1 ) we have P(A A B) . - - , P(A\B) = — - , (2-2) P(B) which is just the definition of conditional probabilities in terms of unconditional probabilities. By interchanging the variables in the proposition in ( 2.2) and applying the product rule, we get P(B\ A)P(A) , - 3 . P(A\B) = — ,—^-L. (2.3) P(B) This equation is known as Bayes's rule. We can also use the P notation for probability distributions over random variables Xt and Xy. Chapter 2. Background Knowledge 12 P X , I X , P X , . 2.4 P(X,. IX,) = — . ^ ' >' P(X.) If we have some background evidence in addition to B, we will have to use a more general version of Bayes's rule. By replacing B with BAE in ( 2.2) we can derive the following: P(AABAE) P(BAAAE) P(B\AAE)P{AAE) P(A\BAE) = — - = — - = — — P(BAE) P(BAE) P(BAE) P(B\ A A E)P(A\ E)P(E) P(B\ A A E)P(A\ E) P(B\E)P(E) ~ P(B\E) Using the P notation for random variables X, and Xj and the set of evidence variables XE, we have P(X, IX. ,X £ )P(X. IX, ) V ' 1 E > P(X. IX £ ) 2.2 Probabilistic Reasoning Systems Although the joint probability distribution can answer any question about the domain, it becomes in-tractably large as the number of variables grows. In view of this, a number of inference algorithms have been proposed to make probabilistic inference efficient in many practical situations. The under-lying inference mechanism centers around the BN, a data structure used to represent the direct de-pendence between variables and to give a concise specification of the joint probability distribution. In the following section, we describe the B N based on the literature in [PEAR88]. Chapter 2. Background Knowledge 13 2.2.1 Belief Network A B N is a directed acyclic graph (DAG) 3 , G = (N,A), consisting of a set of nodes N and a set of di-rected arcs A . Each node i in N represents a variable X, in the system. X, may be binary or multi-valued. A directed arc exists between i and j for each node j that has a direct causal influence on node i. The strengths of these influences are quantified by conditional probabilities stored in the CPT, or link matrix, of /; the link matrix P(X, I XC(,)) describes the conditional distribution of X, given differ-ent configurations of XC(,> the set of variables perceived to be direct causes of X,. Dependence Semantics The semantics of a B N postulates a clear correspondence between the topology of a D A G and the de-pendence relationships it portrays. This correspondence is based on a separability criterion called d-separation. Definition 2.2: ^-separation If X, Y, and Z are three disjoint subsets of nodes in a DAG D, then Z is said to ^/-separate X from Y, denoted < X \ Z\ Y>D, if there is no path between a node in X and a node in Y along which the following two conditions hold: 1. Every node with both path arrows leading in is in Z or has a descendent in Z, and 2. Every other node is outside Z. 3 A DAG is a graph that contains directed arcs but no uni-directional cycles. This implies the total number of arcs in a DAG cannot exceed n{n-\)!2, where n is the number of nodes in the graph. Chapter 2. Background Knowledge 14 Figure 2.1: A multiply connected network A path satisfying the above conditions is said to be active; otherwise, it is said to be blocked by Z. [RUN095] puts this definition in a more direct way: a set of nodes Z ci-separates two sets of nodes X and Y if every undirected path4 from a node in X to a node in Y is blocked given Z. A path is blocked given a set of nodes Z if there is a node i on the path for which one of three conditions hold: 1 . i is in Z and i has one arrow on the path leading in and one leading out. 2 . / is in Z and i has both path arrows leading out. 3 . Neither i nor any of its descendents is in Z, and both path arrows lead in to /. Taking the example in [PEAR88], we can see in Figure 2.1 that X = {2} and Y= {3} are J-separated by Z = {1}; the path 2 <— 1 —> 3 is blocked by 1 e Z, and the path 2 —> 4 <— 3 is blocked because 4 and 4 An undirected path is a path through the network that ignores the direction of the arrows. Chapter 2. Background Knowledge 15 all its descendents are outside Z. X and Y are not ti-separated by Z' = {1, 5}, however, because the path 2 —> 4 <r- 3 is rendered active: learning the value of node 5 makes nodes 2 and 3 dependent. Together with the above definition, the following two definitions help define a B N from a de-pendence perspective. Definition 2.3:1-map A DAG D is said to be an /-map of a dependency model M if every ci-separation condition displayed in D corresponds to a valid conditional independence relationship in M , i.e., if for every three dis-joint sets of nodes X, Y, and Z we have <x \ z\ v>D^i(x,z, y)m. A D A G is a minimal /-map of M if none of its arrows can be deleted without destroying its /-mapness. • Definition 2.4: Bayesian Network Given a probability distribution P on a set of variables XN, a DAG D - (N, A) is called a Bayesian Network of P iff D is a minimal /-map of P. • Given the above definitions, we see that all conditional independencies portrayed in a B N (by way of d-separation) are valid in the underlying distribution P and no two independencies therein are redundant. Chapter 2. Background Knowledge 16 Joint Representation The usefulness of the BN, and indeed any inference network, rests on the assumption that knowledge is decomposable and can be represented by a relatively sparse graph, where each variable is directly influenced by only a few other variables [HENR88]. If a topological ordering of nodes is maintained, the B N serves as a correct decomposed representation of the joint for the domain. This can be shown by first considering a generic entry in the joint and rewriting it using the definition of conditional probability: By successively reducing each conjunctive probability to a conditional probability and a smaller con-junction, we obtain / >(x ],..., Xn ) — P(xn I X n _ | , . . . , Xj ) / 3 ( X n _ ] , . . . , Xj) . P(xl,..., xn) = P(xn I x„_ , , . . . , x, )P(xn_, I xn_2,..., X, )• • • P(x21 x, )P(x,) n = I l / > ( * i l * « - l . " - » * l ) 1=1 Given the conditional independencies encoded in the BN, we observe that P(x,.lx._1,...,x1) = P(x . lx C ( 0 ) , provided that C(i) c {1,2, ...,/-!} for all i 6 N. As a result, we have n P(x, 1' •••^n) = YlP(xi\xC(i)), (2.6) Chapter 2. Background Knowledge 17 provided that C(i) cz {1, 2, . . . , i-1} for all i e N. This shows that a B N represents each entry in the joint by the product of the appropriate elements of the CPTs in the network given a topological node ordering (which can easily be satisfied by labeling the nodes in any order that is consistent with the partial order implicit in the graph structure). In essence, the CPT provides a decomposed representation of the joint, thus making the B N more compact than the full joint especially in locally structured systems [RUN095], where each variable is directly influenced by at most k others, for some constant k « n. Assuming a variable has at most q values, the amount of information needed to specify the CPT for a node will be at most qk numbers, so the complete network can be specified by nqk numbers. The joint, on the other hand, contains q" numbers. 2.2.2 Inference in Belief Networks There are two major types of inference we would like to perform. Predictive or causal inference in-volves reasoning from evidence about root nodes down through the network in the direction of the arcs to the leaf nodes, e.g., from a disease to its symptoms. Diagnostic inference involves reasoning in the reverse direction, e.g., from observations of symptoms to diseases. A coherent probabilistic inference scheme can support both predictive and diagnostic inference and combinations of the two, according to what evidence is available and what hypotheses are of interest [HENR88]. While such algorithms do exist for singly connected networks (e.g., Pearl's polytree algorithm), our main concern is how we can make accurate and efficient inference in multiply connected networks. A multiply connected Chapter 2. Background Knowledge 18 graph is one in which two nodes are connected by more than one path. Figure 2.1 is an example of such a graph, where the parents of node 4 share a common ancestor (node 1). There are three classes of algorithms for evaluating multiply connected networks [PEAR88]: • Clustering • Conditioning • Stochastic simulation The first two yield exact solutions by transforming the network while the third gives an approximation of the exact solution by generating a large number of sample models consistent with the network dis-tribution. Since exact inference in belief networks is known to be NP-hard (and so is approximate in-ference), for very large networks, the appropriate choice of evaluation algorithm would be approxima-tion using SS. The next section will cover several SS algorithms of significance. 2.3 Stochastic Simulation Methods Currently there are quite a number of belief network inference algorithms that belong to this category. In this sampling-based simulation approach, conditional probabilities and other statistics are estimated by recording the fraction of times that events occur in a random series of sample instantiations of the network where each variable is assigned a particular deterministic value. The size of the sample set governs the accuracy of simulation: the more samples (or "scenarios" [HENR88]) we have, the more accurate the approximation. Chapter 2. Background Knowledge 19 2.3.1 Approaches [HENR90] classifies these algorithms into two major sub-categories: • forward propagation • Markov simulation The major difference between the two approaches lies in the way the nodes are updated. The first method sequentially updates all variables in each simulation run according to some topological order-ing of the nodes in the network. The instantiation of a variable then depends on the current instantia-tions of its parents. The second method, however, performs in each simulation cycle only local nu-merical computations for a variable; variables are updated in a random order in successive cycles and, in each cycle, depend on the instantiations of neighboring node variables in the previous cycle. Figure 2.2 shows a diagrammatic classification of various simulation algorithms. Forward Propagation In the forward propagation (or forward simulation) approach, each instantiation is created by follow-ing the influence arrows in the network starting from the root nodes. In other words, the sequence of instantiations follows a particular topological node order consistent with the graph structure. In this way, we can make sure all nodes are instantiated before any other nodes dependent on them. This criterion in turn ensures that the process of sampling will continue until all nodes in the network are instantiated. Chapter 2. Background Knowledge 20 Monte Carlo or Stochastic simulation Arc reversal and node elimination Shacter, 1986, 1988 Evidential Integration Forward propagation (Incidence calculus: Bundy, 1985) Probabilistic logic sampling Henrion, 1986 Markov simulation (Gibbs sampling) Stochastic simulation Pearl, 1987 Likelihood weighting Likelihood weighting with Markov blanket scoring Fung & Chang, 1989 Shacter & Peot, 1989 Stochastic simulation for DPNs Kanazawa, Koller, Russell, 1995 Figure 2.2: Monte Carlo simulation approaches to inference in BNs. (Adapted from [HENR90]) Italics refer to articles examined in detail in this thesis. Essentially we completely sweep through the network in each simulation run, instantiating all nodes from top to bottom according to their specified priors (if they are source nodes) or conditional probabilities (if otherwise). The sampled values of the nodes are then recorded. After a specified number of runs, diagnostic inference is performed by estimating the probability of a hypothesis as the fraction of simulations that lead to the observed set of evidence. In Section 2.3.2, we will examine the Chapter 2. Background Knowledge 21 simplest yet most fundamental forward propagation scheme — Henrion's probabilistic logic sampling [HENR88] — on which many later algorithms are based. Markov Simulation In Markov simulation (originally developed by Pearl [PEAR87]; see Section 2.3.2), propagation is in either direction along arcs, generally in a random sequence, and is localized within the Markov blanket of each variable. The Markov blanket consists of the variable's parents, children, and children's par-ents, and it shields the variable from the rest of the network. In each simulation cycle, a variable is chosen at random and its conditional distribution given all its Markov neighbors is first computed. (This means the instantiation of a variable is influenced by the previous instantiations of its neigh-bors.) The distribution is then sampled and the variable instantiated accordingly. As simulation con-tinues, the probability of each node can be estimated on demand as the fraction of simulation cycles for which it is true. 2.3.2 Algorithms We will examine some of the more influential SS algorithms below. The notations used are defined in Table 2.1. Probability notations follow those in Section 2.1.2 with short forms as in equation ( 2.6 ). For example, P(xj\\xJ2) is the short form for P{Xj\=xn\ XJ2-xJ2), for some 71,72 cz N. Logic Sampling [HENR88] Table 2.2 gives additional definitions with respect to arbitrary propositions A and B. Since the algo-rithm assumes the use of binary variables, the logic sample L(X,=1) automatically records both the Chapter 2. Background Knowledge 22 Table 2.1: Global definitions Symbol 1 Definition Description N = {1, •••,"} set of all n nodes in the system, for some n e N E c N set of evidence nodes C(0 CZ N set of conditional predecessors of node i for some i e N s(0 N set of direct successors of node i for some i e N Xi variable associated with node i for some i e N set of possible values X, may assume, for some / e N Xi G an instantiation of X„ for some i e N Xj = (Xji, X/„) set of variables associated with / for some J = fj'l, .. .,jn} c N = Q/l x ... x Q , „ cross-product space of J for some J = (j'l, . . . , jn} e N Xj e an instantiation of Xj, for some 7 c N Table 2.2: Definitions for Logic Sampling Symbol / Definition Description J1, if x is true in scenario k Lk{B) = j o , otherwise truth of B in scenario5 k for some k e [1, m], m G N Lk(B\A) = (Lk(A) => Lk{B)) truth of B conditioned on A in scenario k for some&e [l,m], me N L(B) = [L\(B), Lm(B)] logic sample6 of 73 over all m scenarios for some me N truth and falsehood of X, over all scenarios. For convenience, here we will use x, to represent the truth of variable X,; thus L(x,) denotes the logic sample of the proposition X, = 1. This scheme uses an uninstantiated belief network as a scenario (or possible world) generator which, in each scenario k, assigns random values to all system variables. Starting with priors specified 5 The term "scenario" can be related to the more commonly used term "sample" that reflects one of the many possible worlds in the domain. In Henrion's context, a scenario is a set of truth values for those propositions concerned. 6 Note the difference between the sample and the logic sample. The former contains the multi-valued states of all variables in one simulation run; the latter contains the binary states (truth or falsehood) of one proposition over all runs. Chapter 2. Background Knowledge 23 for all source variables and conditional distributions for all others, a random number generator is used to generate a sample truth value L^(x,) for each source variable X, and a conditional truth value Lic(xj\xc(j)) for all XCQ) e &C(j), f ° r each non-source variable Xj. Then by following the arrows from the source nodes, the truth of each variable is obtained using the following logical operation: repeats itself m times, after which a logic sample L(x,) is obtained for each variable Xt. Belief distri-butions are then calculated by averaging the frequency of events over those cases in which the evi-dence variables agree with the data observed. The algorithm is as shown in Algorithm 2.1. After logic samples are obtained, we can estimate the prior probability of any simple or com-pound event x by the truth fraction of its logic sample: In other words, T[L(x)] is the proportion of scenarios in which x is true. We can also estimate the conditional probability of any event x conditioned on any set of ob-served evidence y by (2.7) where v is the logical OR operator v defined over some finite set N. 7 The simulation process then 7 This is the deterministic counterpart of the probabilistic chain rule: P(Xj ) = 2~< P(xj' xc(j) ^^xC(j)) xC{j)eac(j) Chapter 2. Background Knowledge 24 Algorithm 2.1: Logic Sampling / * We s t a r t w i t h a b e l i e f n e t w o r k w i t h P(Xt) s p e c i f i e d V r o o t v a r i a b l e s Xi & P I Xc(j)) s p e c i f i e d V n o n - r o o t v a r i a b l e s Xr */ procedure Logic_Sampling( priors, CPTs ) / * R e p e a t m t i m e s t o o b t a i n HxL) f o r e a c h v a r i a b l e Xi */ for k = 1 to m for each X, e {root variables} / * Use a random-number g e n e r a t o r t o " f l i p t h e c o i n " * / Produce Lk(xi) according to P(X,); / * Sweep t h r u t h e n e t w o r k f o l l o w i n g t h e a r r o w s f r o m t h e r o o t n o d e s t o g e t a t r u t h v a l u e f o r e a c h n o n - r o o t v a r i a b l e Xt. */ for each Xj e {non-root variables} Lk(xj) = 0; for each xC(j) e Ocy) Produce Lk(xj I xc(j)) according to F(Xj I xcq)); Lk(xj) = Lk{xj) v Lk{xj I xC(j)) A Lk(xC(j)); Lk(x)ALk(y) P(x\y) = P(x, y) _ T[L(x A y)] _ T[L(x) A L(y)] P(y) T[L(y)] T[L(y)] £ An obvious advantage of logic sampling is that logic samples can be efficiently represented using bitstrings. Al l operations (i.e., equation ( 2.7 )) to compute logic samples for derived variables and compound events then involve simple Boolean operations on these bitstrings. Moreover, for a given level of precision, the complexity 0(mn) is linear in the number of nodes, irrespective of the degree of connectedness of the graph. Since all the sample instantiations are independent, it is also possible to estimate the precision of probability estimates as a function of the sample size using stan-Chapter 2. Background Knowledge 25 dard statistical methods: the standard deviation 0" in probability estimates is inversely proportional to [HENR88]. As the required precision increases, however, we experience a quadratic increase in m ( ° c l/o 2), and hence complexity. The main problem, however, arises when probability estimates are desired given some ob-served evidence. In logic sampling there is no way to account in advance for evidence known to have occurred until the corresponding variables are sampled. If the sampled values conflict with the ob-served evidence, then the sample has to be discarded. For rare combinations of evidence, this may result in an excess number of simulation runs. In fact, the complexity of logic sampling is exponential in the number of observed evidence variables [HENR88][HENR90]. Despite its drawbacks in the complexity aspect, logic sampling is of importance in its concept of sample generation which leads to many later algorithms. Stochastic Simulation [PEAR87] In view of the negligence of evidence in logic sampling, Pearl devised a two-phase Markov simulation algorithm which emphasizes local numerical computation and clamps the evidence variables to the values observed. It first computes the conditional distribution for some variable X given the states of all its neighboring variables. Then it samples the distribution computed in the first step and instanti-ates X to the value selected by the sampling. The cycle then repeats itself by sequentially scanning through all the variables in the system. The algorithm is described in pseudo-code in Algorithm 2.2. A variable superscripted by an asterisk (*) represents a given value of the variable; when a variable is topped by a caret (A), it means the current sampled value of the variable. Chapter 2. Background Knowledge 26 Algorithm 2.2: Pearl I* G i v e n : P(X,) V r o o t v a r i a b l e s X , , P (X, | X c l J ) , ) V n o n - r o o t v a r i a b l e s X}, Xc = x\ f o r some x\e ii, */ procedure Pearl( priors, CPTs, x*E ) Instantiate Xe to x*E; / * I n i t i a l i n s t a n t i a t i o n by f o r w a r d s i m u l a t i o n * / for each ie N \ E / * R e s e t a r r a y s f o r u s e w/ p r o b a b i l i t y e s t i m a t i o n * / for each x, e Q, SUmi(xi)= P(x / lx c ( ( . )); / * o r P U J i f X c l i ) = 0 * / count, = 1; / * Sweep t h r u t h e n e t u s i n g a r andom-number g e n e r a t o r t o " f l i p t h e c o i n " * / Instantiate X,- according to PCX",) or P(X, I Xc(,)); for ever Choose some i e N \ E; for each x, G Q, / * Compute pfXj) u s i n g M a r k o v b l a n k e t * / P(xt I x m i ) ) = P(x. I x m ) II p O , 1 *CU)) '» ;'sS(i) sumi(xi) = sumi(xi) + P (x , l x M ( l ) ) ; Normalize P(X. Ix m ] ) so that X^(XI*AA(,-))= 15 countj = counti + 1; Flip the coin for X, using P(X,. I xm[i]); if (query) / * E s t i m a t e P (xJx' E ) f o r some q e N , x, e £ \ * / Input x g ; /? = sumq{xq) I countq; Output p; The advantage of Markov simulation is that the algorithm can be implemented as a network of parallel processors simulating the happening of concurrent events. In general, however, if the network Chapter 2. Background Knowledge 27 contains conditional probabilities close to 0 or 1, convergence will be very slow [SHPE90] [HENR90]. Moreover, since successive cycles in Markov simulation are not independent, the simulation can get trapped in particular states or sets of states [HENR90]. Another limitation is that we can only estimate the conditional probabilities for single vari-ables but not the joint probabilities of sets of variables. In the algorithm we have already used as many as accumulators (i.e., the array sum) to refine P(xi\x*E) for all x, e Q„ i e N. Yet this only accounts for the conditional probabilities for each value of each variable considered separately. It is infeasible, however, to account for the joint probabilities of J for all J cz N using time and space proportional to ^ # Q y where k = ^ n C r is the number of possible subsets in N. 7=1 r=l Likelihood Weighting [SHPE90]8 Shacter and Peot extended logic sampling by assigning weights, or scores, to the samples generated. Instead of discarding samples that conflict with the observed evidence, each sample is weighted by the joint likelihood of the observations conditioned on their unobserved predecessors. Table 2.3 intro-duces some new definitions. Essentially we are assigning a score, Z, to any given sample sk selected from the joint distribu-tion of X N . The score is equal to the original probability of sk divided by the probability of selecting sk, i.e., generating sk by likelihood weighting. Thus, [FUCH90] proposed the same algorithm, which they called "evidence weighting". Chapter 2. Background Knowledge 28 Table 2.3: Definitions for Likelihood Weighting Symbol / Definition Description $k = [ %\) • • • > Xn ] the k-th sample Zk e <K score of the k-th sample Z(Xi=Xi) e Si cumulative score over those samples in which Xj=xi, for some i e N, xi e Qi P'sAxl) Z k P(selecting^lx;) f l ^ W ~ ie £ The algorithm, which they called Basic, is shown in Algorithm 2.3. If we have a set of query variables XQ and we want to find out the conditional probability for Xq - xq for some xQ e £lQ, Q e N, we can make an estimation by normalizing Z(Xq - xq): P(xq\x*e)s* Z(X0 — x0) 1: k=l (2.8) This step can also be incorporated into the Basic algorithm in a similar fashion as Markov simulation so that simulation can be interrupted at any time to return the best quality answer thus far. Just like Markov simulation, likelihood weighting (LW) is suitable for parallel processing ar-chitectures. It also avoids the problem with logic sampling where samples inconsistent with the evi-dence are discarded. Unlike Markov simulation, however, likelihood weighting can also account for the joint probabilities of sets of variables rather than just the conditional probabilities for single vari-ables. We can simply compare xQ with the values of the variables in XQ over all samples, calling for Chapter 2. Background Knowledge 29 Algorithm 2.3: Basic I* G i v e n : P I X , ) V r o o t v a r i a b l e s Xt, P ( X j X C U ) ) V n o n - r o o t v a r i a b l e s X]t XB = x\ f o r some x ' B 6 CiE * / procedure Basic( priors, CPTs, x*E ) Instantiate Xe to x*E; for k = 1 to ra / * sweep t h r o u g h by f o r w a r d s i m u l a t i o n * / for each i e N \ E Flip the coin for Xi according to P(X,-) or P(X, I XQ,)); Add x(. to 5^ ; / * Compute t h e s c o r e f o r t h e k-th s a m p l e * / Zk =llp(**'*<:<.•))'. only m#<2 comparisons. For example, if we want to estimate P(X2=true, X?=true I X5 = true) in Figure 2.1, assuming all variables are binary, then we can go through each sample (with X5 clamped to true) and compare the values of X2 and X 3 with the query. If the values match in sample k, then we add Zk to Z(X2=true, Xj=true); otherwise, the next sample is considered. Assuming there are m = 1000 samples, we only need 1000 x 2 = 2000 comparisons in total. There are cases where likelihood weighting performs quite poorly. When the likelihood prod-uct (score) varies greatly among samples most of the samples are effectively ignored, and therefore additional samples must be taken. [SHPE90] suggested we can reverse the arcs into those evidence nodes most responsible for the variation in likelihood to remedy the situation. [SHAC86] gave a de-tailed theoretical derivation for inference with arc reversals. We will examine its application in DPNs in the next chapter. 3. Temporal Processes While a B N is an attractive approach for representing uncertain expert knowledge in a coherent prob-abilistic form, it has seldom been focused on the temporal aspect of real-world processes. In order to capture this dynamic nature, we use a data structure called the dynamic probabilistic network9 (DPN). 3.1 Dynamic Probabilistic Network The DPN can be thought of as a chain of BNs joined by temporal links (Figure 3.1). Each unit of the chain is a time slice representing a snapshot of the evolving temporal process. If we have n time slices, we will have n BNs chained together, in a way such that nodes in time slice t are connected only to nodes in time slice t + 1 and other nodes in slice t. In other words, the history of the process can be ignored; or more precisely, information at time t summarizes all relevant features of the world. This property is called the Markov property. For temporal projection (i.e., prediction from the state at some time into the future), this property allows us to focus on only two time slices at a time, moving the "window of focus" in one-slice increments. 9 The idea first appeared in [DEKA89] and was reiterated in [KAKR95]. It was called dynamic belief network (DBN) in [RUN095]. 30 Chapter 3. Temporal Processes 31 STATE EVOLUTION MODEL / 4 SENSOR MODEL Figure 3.1: Generic structure of a DPN (adapted from [KAKR95]). In an actual network, there may be many state and sensor variables in each time slice. 3.2 Evidence Reversal Standard simulation algorithms such as those in Section 2.3.2 often give fast, accurate approximations to posterior probabilities in belief networks, and are the methods of choice for very large networks. However, when we come to temporal processes, these algorithms sometimes perform very poorly. In essence, the simulation trials diverge further and further from reality as the process is observed over time. In the case of logic sampling, because it discards trials whenever a variable instantiation con-flicts with observed evidence, it is likely to be ineffective in DPN-based tasks where evidence is ob-served throughout the temporal sequence.10 As for likelihood weighting, a straightforward application generates simulations that simply ignore the observed evidence and therefore become increasingly ir-relevant. This is especially true in systems where the state evolution model is weak (i.e., more or less random). Although uncertainty in observation may be very low, the samples are nonetheless evolved according to the state evolution model. A weak model (which is typical in systems where autonomous 1 0 On the contrary, logic sampling is very effective for projection because no evidence is observed in future slices. Chapter 3. Temporal Processes 32 Figure 3.2: Schematic diagram of evidence reversal transformation for DPNs: A is the state evolu-tion model; B is the sensor model; C and D are the new CPTs we have to compute. (Adapted from [KAKR95]) agents are monitored) means the sample distribution will get more random over time, unrelated to the observed evidence. As a result, the weighting process will assign extremely low weights to (or even discard) almost all the samples. Only a very small number of samples closest to the true state will ob-tain relatively high scores and thus dominate the estimated distribution. The effective number of samples therefore diminishes rapidly over time, resulting in large estimation errors. In general, forward simulation methods will ignore the actual observations made when de-termining the sample state trajectory and future observations. In Figure 3.1 this is due to the fact that the causal direction of influences does not permit our observations (percepts) to bias the state samples. As a remedy, [KAKR95] adopted Shacter's suggestion and added an arc reversal (AR) step to likeli-hood weighting for use with DPNs. However, instead of reversing all the arcs into the evidence over time, we simply reverse the arcs within slice t, so that the evidence at t and the state at t - 1 become the parents of the state at time t (Figure 3.2)." This is possible because each sample, once it has instanti-1 1 The idea is based on the evidential integration method proposed in [FUCH90]. In evidential integration only the arcs be-tween the evidence and its parents are reversed (in non-temporal BNs), thus "partially" integrating the evidence into the network. Chapter 3. Temporal Processes 33 ated variables in time slice f - l , d-separates all preceding time slices from the state at time f. After AR, the current evidence becomes a parent of the current state; therefore, it can influence the process of extending the samples to the state variables at f. Consequently, the sample population can be re-positioned closer to reality according to the observed evidence. In Chapter 6 we will show in an ex-perimental case how convergence may be improved by reversing arcs into evidence. Algorithm 3.1 shows the evidence reversal algorithm (which is modified from the likelihood weighting algorithm). Where necessary we have added time arguments ranging from 0 to T, for some T e N. If we want to know P(xQ(t)\x*E(t)) for some xe(f) e Q.Q, t e [0,7], we can apply equation ( 2.8 ) to time slice f. Note that we assume XE remains constant over time, i.e., the set of evidence vari-ables is fixed. This is in general the case for decision-theoretic agents since they base their actions on the same set of observable events over time (i.e., their perceptual inputs). If the opposite is true, how-ever, then the arc reversal step (marked with / * ( 1 ) * /) has to be placed inside the loop (at the place marked / * ( 2 ) * / ) . This means for each time slice we have to reverse arcs into the evi-dence variables in that time slice, thus making the simulation process much slower. From the pseudocode we see [KAKR95] laid out their algorithm in such a way that m partial samples are generated and recorded in one slice followed by another m partial samples in another slice. While this can be made to work (though there is insufficient detail in [KAKR95]), it is not clear what advantages this approach offers. On the contrary, a considerable amount of work has to be done to book-keep each sample since the temporal process is repeatedly sampled in its entirety. In Section 5.1 we will present the evidence reversal algorithm in a query-directed approach that will alleviate this problem. Chapter 3. Temporal Processes 34 Algorithm 3.1: Likelihood weighting with evidence reversal (adapted from [KAKR95]) /* Given: CPTs V variables; we assume the CPTs to remain constant over time */ /* Modified from BASIC */ procedure LIKELIHOOD_WEIGHTING{ CPTs) /* Time slice 0 * / Obtain x* (0); Instantiate XE to x*E(0); /*(!)*/ Reverse arcs into E; for k = 1 to m Sweep through the net, flipping the coin for each variable; Add xN (0) to Sk, I* Compute the score for the k-th sample at time 0 * / w*=nm(0)i*c(,)(°)); Zk(0) = wk; /* Time slices 1 to T */ for t = 1 to T Obtain xE(t); Instantiate XE to x*E(t); / * ( 2 ) * / for k = 1 to m Sweep through the net, flipping the coin for each variable; Add (t) to Sk', /* Compute the score for the k-th sample at time t */ wt = 11 (^01^0(0); ; < E £ Zk(t) = Zk(t-l) + wk; Chapter 3. Temporal Processes 35 Figure 3.3: Reversing an arc between arbitrary nodes i and j . (Adapted from [SHAC86]) 3.3 Arc Reversal The arc reversal step is critical in that it uses the current evidence to bring the simulation closer to reality. [SHAC86] proved that for all i,j e N, arc (/, j) can be replaced by arc (j, i) if there is one and only one arc (/', j) between nodes i and j. After the reversal, both nodes inherit each other's parents. More formally, C™(j) = Cold(/)uCold(j)\{/} ( 3 1 ) c „ e w ( . ) = c „ e w 0 . ) u { y } Figure 3.3 is a pictorial representation of the arc reversal step. Besides the parents, the CPTs for i and j are modified as follows: Chapter 3. Temporal Processes 36 P(x J .lx c o M ( , ))P(x,.lx c 0 | J ( , )) ( 3 J ) for all Xj G Q.,jc-e Q . , x n e w e Q. ^ . Putting this into our DPN, we are essentially using A and B to compute C and then D in Figure 3.2. Equation ( 3.2 ) lets us compute each entry in C while D is computed by equation ( 3.3 ), which is just the Bayesian inversion formula, or Bayes's rule. (The de-nominator can be seen as a normalizing constant if we substitute the first equation into the second.) To be more specific, let Xs', Xs, and XE be State(t-\), State(f) and Percept(t) respectively. Then we have, for each entry of C, P(xE\x's)= ^P(xE\xs)P(xs\x's), and, for each entry of D, / N . P(xE^XS)P(xS^XS) P(xs\xE,xs) = ^ - | - j r • The A R process is then complete. A major drawback of AR is the increase in size of CPTs, proportional to the size of the cross-product space of the domains of both nodes' parents. This is a direct consequence of equation ( 3.1 ), where the parent set of each node becomes the union of both parent sets. Although in our discussion above we only showed one state variable Xs and one percept XE, the expansion of D is trivial. In gen-eral, more than one state variable may be present, and there may be an n-to-1 interdependency between Chapter 3. Temporal Processes 37 the two time slices (i.e., a variable in time t may depend on more than one variable in time t - 1). Similarly, more than one percept may exist and may depend on one or more state variables. In such cases, a series of A R steps has to be taken to make the percept depend only on variables in t - 1. Nev-ertheless, the above simplified illustration suffices to show the evidence reversal algorithm takes into account the observed evidence rather than ignoring it. 4. Compact Representation In a BN, CPTs have often been represented in a "plain" fashion using arrays. For example, the four CPTs in Figure 3.2 would be represented as two-dimensional arrays. This representation, though sim-ple in nature, is not compact in that it requires an entry for every state of the system. A system is usually reasonably locally structured (Section 2.2.1) so that many states have the same probability val-ues and thus can be compressed. Furthermore, a structured representation allows irrelevant distinc-tions to be ignored, as we shall see in the following section. 4.1 Tree-Structured CPTs The tree-structured representation was first introduced in [BODG95]. The use of such representation for CPTs allows the capture of independence of variable assignments.12 It can identify relevant dis-tinctions in the belief space encoded in the CPT of a node. As a result, values associated with a col-lection of states may be represented once. Figure 4.1 shows a simple binary tree-structured CPT for a variable A. (For binary variables, our convention is to use left-arrows for "true" and right-arrows for "false".) Each branch determines a partial assignment to the parents of the "owner" variable of the tree (i.e., A in this case) with some parents not mentioned; for example, the left subtree of B is trav-ersed when B is true, without considering C. Each leaf denotes the probability distribution of the vari-able given conditions consistent with that branch.13 So in the figure, the rightmost leaf corresponds to 1 2 Other representations such as the use of rules representing conditional probability distributions also capture such independ-ence [POOL93]. 1 3 We only show the probability values for the variable being true because the probability of falsehood can always be com-puted by one minus the truth probability. This representation with assumed last values will be discussed in Section 5.4. 38 Chapter 4. Compact Representation 39 •' B (A)' 0.5 C \ 0.4 0.1 Figure 4.1: A simple tree-structured CPT the probability distribution of A given B and C are both false. Normally, if we use a plain CPT repre-sentation, we will have to enlist all 2 2 CPT entries. The tree-structured CPT, however, allows us to capture the independence of A on C when B is determined to be true. Thus, only one value, P(a\ b), is needed to represent both P{a\b,c) and P(a\b,c). As a result only three values in total need be speci-fied, spatially enhancing the system. In [BOG096], the tree-structured representation is extended by the inclusion of persistence subgraphs. This feature is especially useful in reasoning about actions where persistence relations of-ten arise. Figure 4.2a shows a tree-structured CPT with persistence subgraphs represented by broken arcs and marked nodes, generated automatically if unspecified by the user. From the figure we can observe that these subgraphs can actually be compressed into one. This leads to the graph representa-tion of the CPT as shown in Figure 4.2b. The "else" branch corresponds to all values of C other than c\. Although the graph representation is more compact, we would adopt the tree structure in this thesis as performing A R using graphs is more complicated. In Section 5.2 we will see how the tree-structured representation can help in identifying relevant variables to speed up simulation. Chapter 4. Compact Representation 40 c c B A * A * B else / tf'i i/\ / f'-t-i 0 5 A* 10* 0 0* 10* 0 0* (A)' °-5 A* ' • •"•• / '•. ' • ' " • • / 1.0* 0.0* '• 1.0* 0.0* (a) Decision Tree (b) Decision Graph Figure 4.2: Compact representation of CPT of A 4.2 Structured Arc Reversal Obviously, the tree structure is capable of representing regularities in the interdependence among vari-ables in the system. Since regularities may prove useful computationally, we would like to employ the tree structure in our simulation process. In Sections 3.2 and 3.3 we have seen how A R can be used to bring simulation "back on track" under plain CPT representation. For large CPTs, however, working with trees may be more efficient in the A R process. This requires some special thoughts, though: while we would like the A R steps to increase the accuracy of simulation, we also want to preserve the tree structures the CPTs had before the reversal so that after AR, regularities can be retained to gain certain computational efficiency. In the following sections we present the operations that will be used along with the structured arc reversal (SAR) algorithm. To avoid confusion, we use the terms "t-nodes" and "t-arcs" to represent nodes and arcs in a tree (as opposed to nodes and arcs in a BN) [BFGK96]. We will first give an example to illustrate. Chapter 4. Compact Representation 41 B A 0.7 0.2 0.5 B IAI 0.4 0.1 Figure 4.3: An example system 4.2.1 Example The SAR algorithm is perhaps best demonstrated with an example. Figure 4.3 shows a 2-slice DPN encoding an example system with four system variables (A through D) and one observation (evidence) variable (O). For simplicity, we write the names of the variables directly on the corresponding nodes and show only the CPTs for A and O, assuming B, C, and D have arbitrary distributions. Figure 4.4 and Figure 4.5 demonstrate how we reverse the arc between A and O (i.e., nodes i and j in Section 3.3). Construction ofO's CPT In Figure 4.4, we construct a new CPT for variable O as follows: Chapter 4. Compact Representation •. c--0.6 - . A B 1A1 0.7 0.2 A 1A1 0.5 B 0 4 (i 11 •,c%-0.6 \ A ' - -B " A l A i IAI 0.7 0.2 0.5 B 0.4 (I 1 A I A ' D 0.3 — 0.3 0.9 0.8 (a) (b) 0.6 C A ' B \ 0.7 0.2 0 5 B 0 4 0 1 A I 0.5 x 0.3 + , B x 0 7 D4 (I I (c) (d) Figure 4.4: Construction of new CPT for node ) (variable O) Chapter 4. Compact Representation 43 1 . First we search the old CPT of O for a subtree T with a root node labeled A (diagram (a)), since Twil l be the only part of the CPT for O that changes; all other parts of the tree remain the same because they do not depend on A and thus will not be affected by AR. In the search process we record all t-nodes and t-arcs traversed in a path (enclosed by a dotted line) starting from the root for use in the next step. 2 . In diagram (b), we duplicate the CPT of A and carry out tree reduction to the copy using in-formation stored in the path above. Intuitively, tree reduction is a process that removes redun-dant t-nodes in the tree, especially when we are appending a tree to another. The path then serves as an index of t-nodes that are already present in the original tree. In our case, O inher-its the parents of A on reversal, so we have to append the CPT of A to that of O to make O de-pend on the parents of A. By carrying out tree reduction we remove t-nodes considered redun-dant in the appending process. The path, tree reduction, and the appending of trees will be elaborated in more detail in the next section. 3 . Then we use the probabilistic chain rule to merge the child subtrees of T with the reduced copy (diagram (c)). In essence, we are removing the t-node A from the tree (which corresponds to deleting the arc from A to O in the network) and adjusting the probability distributions of A's children according to equation ( 3.2 ). The application of the chain rule in trees will be dis-cussed in the next section. 4 . The final tree is shown in diagram (d) where T has been saved and then replaced by T*, the re-sult from step 3 above. T is saved for later use, when we have to refer to values in the old CPT of O to update the CPT of A. T* is then marked "altered" to indicate an altered portion of the Chapter 4. Compact Representation 44 old CPT of O. (In general, there will be more than one T, and hence T*.) This will help reduce computational efforts when we apply Bayes's rule later. A Note: Reduction in Dependency Complexity At this point we are able to see that the new CPT for O does not depend on D. This would not have been true if we had used Shacter's algorithm [SHAC86], which states that both nodes inherit each other's parents. This reduction in dependency complexity actually occurred when we reduced the "appending" tree (diagram (b)) by pruning the branch containing the label D. Another place where dependency complexity may be reduced is in diagram (d), when we performed the tree merge. If we have deterministic values, say, P(a\ 7?) = 1, then when we perform the merge by the chain rule, a zero (instead of 0.7) will be multiplied to the subtree labeled B, thus removing fi's influence when C and A' are both false. If the label is not B but some variable other than A', B, and C, then the influence of that variable can be totally removed from the CPT. Although reduction in dependency complexity is not a necessary consequence of tree reduc-tion14, it is still a major saving in time and space when there is a possibility of making fewer steps in tree traversals, especially during simulation in large networks containing hundreds of nodes and tens of thousands of interdependencies. Construction ofA's CPT We go on to update the CPT of A in-place in Figure 4.5. 1 4 For example, if we swap the two branches of T°U so that the D branch is traversed when A' = false, then on reduction (diagram (b)) the D branch will remain, and we will not be able to remove D's influence on O. Chapter 4. Compact Representation 09 A I 0.8 A I 0.3 A I I I C i A i 0.6 B 0.7 0.6 c c 0.6 B o.6 B 0.7 0.6 0.43 0.22 \17 i n l i f t n/r< M T T 0.9 H i A A X X c c 06 • 06 < f ML% 07 0v 0 7 0.3 A I c , 0.6 B i A i 0.43 0.22 (a) (b) 0 3 B O "0 0 5 0_5 0.5 (15 x 0 ' { " • » Oji 0.1 (LO x07 ) O O 0.3488 0.2632 0.6818 0.1923 (0.6512 0.7368 0.3182 0.8077) 0 4? 0 57 0 22 II" (c) (d) Figure 4.5: Construction of new CPT for node i (variable A) Chapter 4. Compact Representation 46 1 . We first duplicate the new CPT of O and reduce the copy with respect to each path from root to leaf in A's CPT (diagram (a)). This is to remove the t-nodes considered redundant when we append the new CPT of O to each leaf of the C P T of A. Since there are three leaves in A's CPT, we have three reduced CPTs of O in the diagram. 2 . Then we search the reduced copies for subtrees marked "altered" and find that only the third one contains altered parts (T* in diagram (b)). T* is inherited from the computation of O's new CPT. It represents the part of the tree that will be extended by the application of Bayes's rule in the next step. The first two reduced copies are discarded, leaving the corresponding leaves in A's CPT unchanged, because the absence of altered parts means the terms P(x Ix ) and P(Xj\xcM(j)) in Bayes's rule equal and hence cancel out each other. We will come to more details in Section 4.2.3. 3 . In diagram (c), since the left branch of C is not in T*, the distribution at the right branch of A ' (< 0.3, 0.7 >) is propagated there. T*, on the other hand, is extended by one level at each leaf with the insertion of a new tree rooted at O. The distributions at the new leaves are computed using Bayes's rule. Note the correspondence between the shaded entries in the diagram: the darkest shade refers to entries in the subtree T in Figure 4.4, the lightest shade refers to entries in A's CPT, and the medium shade refers to entries in T*. The underlined numbers represent assumed last values (Section 5.4) while those in parentheses show the computations that would have taken place if we had not used the assumed-last-value representation for probabil-ity distributions. Chapter 4. Compact Representation 47 0.7 0.2 0.43 0.22 Figure 4.6: Network after reversing the arc between A and O 4 . The final tree is shown in diagram (d) where the rightmost leaf of A's CPT has been replaced by the result in step 3 above. The network now becomes as shown in Figure 4.6, where the effect of reduction in dependency com-plexity is shown by the missing arc between D and O. 4.2.2 Procedure Having seen our previous example, we are ready to examine the SAR algorithm in more detail. Let Tt and Tj denote the CPTs of arbitrary nodes i and j in G, a given BN. Then Algorithm 4.1 will reverse ( i, j) into (j, i) correctly while preserving the tree-structure of both 7, and 7}. For each node j corresponding to an evidence variable, we reverse the arc between j and each of its parents in C(j). In each such reversal, we build a new CPT for j and update the CPT of i in place. The operations involved are explained below. Chapter 4. Compact Representation 48 Algorithm 4.1: SAR /* SAR: reverse arcs i n t o evidence; each node i i n graph G has ass o c i a t e d with i t a tr e e -s t r u c t u r e d CPT, TL * I void SAR( Graph G, Evidence E) { Tree newTj, T, T , Tp; Path p; for (each j e E) for ( each i e C(j)) { newTj = Tf, /* B u i l d new CPT for j * I for ( each subtree T of newTj with root( T)== i) { Tp = treduce( T, path( newTj, T)); T = tchain( T,Tp); save T and replace with T ; mark T as altered; } /* B u i l d new CPT f o r i * I for (each leaf / of T,•) { Tp = treduce( newTj, path{ T, I)); T = tbayes{ Tj, T, Tp ); replace / with T ; } Tj = newTf, update C(j); Tree Reduction (treduce) In the SAR algorithm, the basic operation is the appending of a tree Tx to another tree T2, i.e., the re-placing of a leaf of T2 with Tx. Though conceptually straightforward, this operation may make the re-Chapter 4. Compact Representation 49 suiting tree have duplicate node labels. Therefore, a decision tree reduction algorithm has to be car-ried out to delete redundant nodes. Theoretically, the reduction can be done either before or after the appending, but since we have to perform arithmetic operations on trees (as we shall see below), it would be desirable to carry out the reduction early to reduce computational effort. (For example, we reduced the CPT of A in Figure 4.4b so that only the leaf node with distribution <0.3, 0.7> need be considered when we applied the chain rule in Figure 4.4c.) In view of this, whenever appending of trees is required, we choose to reduce the "appending" tree (TO before appending it to the "appended" tree (T2). As we have seen in the example in the previous section, a tree is reduced with respect to a path (of another tree). The path is an ordered list of label-value pairs. Each pair consists of a t-node label and a number specifying which child subtree to traverse next. Formally, a path is defined as follows: Definition 4.1: Path Let label( u) denote the node label of some t-node u. Let The a decision tree and path( T, u) be a path of T from root to a certain t-node u in T. Then, path( T, u) = (labeh, value], labek, valuei,labeln, value„), where labeh is the label of the /-th t-node traversed starting from the root of T, valuei is the value with which the t-arc between label, and labels is labeled, labeh = label( T ), and label„+\ - label( u). Chapter 4. Compact Representation 50 In essence, the path records t-nodes that should be pruned during tree reduction, the process of which can be described by the following definition of a reduced tree15: Definition 4.2: Reduced Tree Let p be a path of some decision tree V. Let chilaX i, T ) denote the 7-th child of an arbitrary tree T. Further let The a decision tree with root Kand immediate subtrees Tx, Tm where 7} = childij, T). Then the reduced tree T( p) is defined recursively as follows: Put in words, the reduced tree Tip) is a version of the tree T such that every path in T that is "inconsistent" with p is removed. With these definitions in mind, decision tree reduction is straightforward. The reduction al-gorithm is shown in Algorithm 4.2. Chain Rule (tchain) Recall in our example that O takes the role of variable Xj while A, Xt. In the following discussions we will use the more general terms X, and Xj to represent A and O respectively. [BFGK96] described how a reduced decision tree can be produced under a certain context of variable assignments. This is applicable in our SAR algorithm if increased computational efficiency is not a major concern. Since we would like to prune away all redundant t-nodes before applying arithmetic operations on trees, we modified the idea to be used with re-spect to paths instead of contexts. R with subtrees Tj(p) for all j e [l,m], if label(R) is not in p Tj{p), where j = value n if label(R) = labeli is in p Chapter 4. Compact Representation 51 Algorithm 4.2: Tree Reduction / * R e d u c e a t r e e T w i t h r e s p e c t t o a p a t h p . T h e r e s u l t i s r e t u r n e d a s newT. */ Tree treduce( Tree T, Path p ) { Tree newT; / * D e t e r m i n e i f t h e r o o t l a b e l o f T e x i s t s o n p . * / if (label( T) == labelj for some labelj e p ) newT= treduce( child( valuei, T),p); else { label( newT) = label( T); / * R e d u c e e a c h c h i l d o f T. */ for (i = 0; i < #children( T); /++ ) chila\ i, newT) = treduce( chila\ i, T),p); } return newT; } After we have found a subtree T with root node labeled i in 7}, the CPT of j, and obtained Tt( p) where T is the CPT of / and p = path{ 7}, T), we apply the tree version of the probabilistic chain rule to T and T)( p ): we replace each leaf / of T( p ) with the tree produced by "chaining" T with the distribution at /. Referring to Figure 4.4c, we see that this requires multiplying each value in the dis-tribution at / with the corresponding child subtree of Tand then "summing up the products". The products are probabilities given values of X,. Thus, summing up the products means summing up the terms P(X,\x t, X M . ) for all x, e Q,-. The multiplication step is straightforward because we only have to scale each leaf of the tree by the probability value specified. The summation step, however, is more complicated. If one of the addends is a scalar number, then we just add it to every leaf of the other (tree) addend, in much the same way as we perform scaling. If, however, both addends are trees, Chapter 4. Compact Representation 52 A 1) B 0 6 0 4 0 4 I) li chained with 0.3 06 0 1 + B xOJ 0.4 o.r B 0.18 0.12 0.28 0.07 \r B 0.28 0.07 B 0.46 0.19 (a) (b) (c) Figure 4.7: Breakdown of tree chaining (d) then we have to append, with possible reduction, one tree to every leaf of the other. For example, if T was a binary full tree, then the sequence of chaining would become as shown in Figure 4.7, where tree reduction is required in diagram (c). In any case, the correspondence between the normal chain rule and its tree version, Jt .e f i (4.1) can be readily observed: the term on the left corresponds to T, the first term in the summation corre-sponds to T, and the last term represents each probability value in 7,( p). By successively applying the tree chain rule to each subtree T found, a new CPT for j can be constructed. Chapter 4. Compact Representation 53 Bayes's Rule (loaves) In the construction of a new CPT for /, we apply Bayes's rule to update each leaf / of Tt using T°ld, I, and Tjnew( p ), where p - path{ Tit I). (Here we superscript 7} by "old" and "new" to denote the CPT of j before and after AR, respectively. This is not necessary for Tt because we are updating Tt in place.) Due to the complexity of the updating process, we choose to present the idea using an algorithmic way as follows: Traverse T"ew( p ) post-orderly. For each node k thus traversed, if k has been marked "altered", i.e., if the subtree rooted at k is part of the subtree T in Tjew, 1 . for each leaf/of k, replace/with a 2-level, n-leaf tree T with root( T') = j and n = #Qy. The distributions at the leaves of T' are determined by the P version of Bayes's rule, i.e., psv \ v ^ J ' V M ^ ^ ' V , , ) ) (4.2) 1 ' c » ( 0 j P(X.\X ) where P(X IX M ) is retrieved from the subtree T saved along with k, P(X,IX ld ) is the distribution at /, and P ( X y I X c „ e w o ) ) is the distribution at/. 2 . else if & is a leaf, replace the distribution at k with the distribution at /. 3 . else if no child of k is altered, replace the subtree rooted at k with the distribution at /. Chapter 4. Compact Representation 54 Basically, the entries in Tj are computed as in the normal Bayes's rule. However, since we would like to retain the CPTs' compactness after AR, we have to take special measures to prevent the original tree from being fully expanded. We have found that the distribution at / remains unchanged if there is no altered part in T"ew( p ) (Figure 4.5b), and that the unaltered portion of T"ew( p ) does not grow but merely inherits the distribution at / (Figure 4.5c). As we have mentioned before, this is due to the canceling out of the terms P ( X I X l d ) and P(XIX ) in equation ( 4.2 ), making the two ' C \J) J C \J) sides equal. Therefore, we need only expand the altered part of 7\ n w( p ) by applying Bayes's rule. The expansion is done in step 2 above, where T"m( p) is extended by one level with the distribution at the new leaves computed by Bayes's rule (Figure 4.5c). Now X/s influence is introduced into the tree and the result can be appended to T, at /. When all leaves of Tt are adjusted as above, the construction of a new CPT for i is complete. This corresponds to the addition of an arc from node j to node i in the network. Remarks Note that we have to update the parent set of j in each A R cycle. This is because during A R new arcs may be introduced into node j. If these arcs come from nodes in time t, then they in turn have to be reversed. For example, D was originally not a parent of O before A R in Figure 4.3. If we switched the two branches of A's CPT, then after reversing the arc between A and O, O would depend on D and there would be an arc from D to O in Figure 4.6. Since D is in time t, the new arc would have to be reversed also. Therefore, it is necessary to keep track of the parent set of O throughout the A R proc-ess. Chapter 4. Compact Representation 55 4.2.3 Analysis Regarding the SAR algorithm developed in the last section, we have made some observations and analysis of its operation. Constructing Tj In the construction of a new CPT for node j, we examined how the chain rule is applied in a "tree" context. We note that all parts of T°ld other than T remain the same (Figure 4.4), whether structurally or with respect to both structure and values. The two cases are described as follows: • If there is a path from root to leave not containing i, then both the structure of the internal nodes on the path and the distribution at the leaf will not change, since the path does not de-pend on i, and thus will not be affected by the reversal. • For all other paths, the portion from the root down to but not including / remains the same (i.e., the structure of that portion will not change). The subtrees below i are merged according to the method described, and it is easy to see that the distribution at each of the leaves is in fact a vector of sums of products as implied by equation (4.1 ), the tree version of the chain rule. We also observe that when appending a tree to another we have a choice over which one is the "appending" tree and which one the "appended" tree. If we carefully consider the sizes (number of t-nodes) of the trees to be merged, the merge can be done more efficiently. Consider two binary full trees Tm and Tn with the former having m leaves and the latter n. Then Tm has 2m-1 nodes while T„ has 2n-\. Assume m is greater than n. There are two cases: Chapter 4. Compact Representation 56 1 . If we append Tm to Tn, then we have to append 2m-1 nodes from Tm to each of the n leaves of T„, requiring a total of n(2m-\) = 2mn-n copies. 2. If we append Tn to Tm, then we have to append 2n-l nodes from Tn to each of the m leaves of Tm, requiring a total of m(2n-l) = 2mn-m copies. Since m> n, case 2 requires fewer copies than case 1. Hence, we conclude that we should append a smaller tree to a larger one. This conclusion is, nevertheless, more theoretical than practical since it takes 0( n ) time to determine the size of an n-node tree (unless the size is stored with each tree). A l -though the time spent in determining the initial tree sizes of CPTs can be absorbed as overhead cost by the system set-up phase when CPTs are built (through console or file input), the presence of tree re-ductions and appending keeps these sizes varying with time. As a result, just determining which tree should be the "appended" tree costs even more than the saving in copying it will bring. Therefore, in practice we will ignore the tree sizes and choose an arbitrary appending order. Constructing T ; In the construction of a new CPT for node i, we can see if T* and 7V"W are disjoint, then everything on p = path( Ti, I) remains the same (Figure 4.5). In particular, the distribution at /, i.e., P(Xt\x M ) for the x M observed at /, remains unchanged. This is justified because T* and p) are disjoint L. \l) J implies the latter is entirely contained in T°ld', and if we were to apply Bayes's rule (equation ( 4.2 )), then at every leaf node * of T.ew( p ) we would have P ( X ; l x c „ e W 0 ) ) = P (X J l x c 0 l d o ) ) for the xCCW(j) observed at k; corresponding individual values in these two distributions would cancel out each other, Chapter 4. Compact Representation 57 leaving P(X,lx ) = P(Xt\x M ) for the x ld observed at /. Since we are considering the k's with respect to the same /, all k's now have the same distribution and are compressed into one node. As a result, we have not changed anything with respect to /. For the non-T* portion of T"ew( p ), a similar argument applies. The difference is that it ap-plies not to the whole tree but only part of it, and thus node collapsing does not propagate to the root level. The T* portion, on the other hand, is extended by one level with the appending of a 7" at each leaf of T"ew( p ). This is necessary to bring the influence of O into the tree, and is the case where the corresponding entries of the first multiplication term and the denominator in the equation do not cancel out. Reversal Order In our example in Section 4.2.1 we chose to reverse the arc between A and O first. In general, how-ever, there is no fixed rule in reversal order. For example, given the same system our actual implemen-tation of SAR will reverse the arc between C and O first (see Case 3 in Section 6.2) based on a "first come first serve" basis: it reverses the arc between the first evidence node (in some topological order) and the first parent in time t encountered in its CPT. As it traverses the CPT down to the leaves, the arcs between the node and the parents encountered (if they are in time r) are successively reversed. The next evidence node is then considered, and the process continues until all evidence nodes have been considered. The difference in reversal order will lead to different resulting CPTs, and thus different net-work structures after AR. As a general rule, the sooner a node has its arc into the evidence node re-Chapter 4. Compact Representation 58 versed, the later it will be in topological order in the resulting network. The reason can be traced out as follows: the first reversal will make the first node depend on all (ignoring tree reduction) parents of the evidence. The arcs introduced will then remain unchanged during the rest of the AR process. In the reversal with the second node, the evidence node will have one parent (the first node) fewer than before. So the second node will depend on all parents of the evidence except the first node after AR. This makes the second node come before (or at least in the same position as) the first in topological order. In a similar way, the third node comes before the second, and so on. As a result, when the whole AR process has finished, the first node will be placed last in topological order, preceded by the second node, then the third node, and so on. The evidence node will eventually become the first in topological order since no incoming arc from nodes in time t remains, achieving our ultimate goal of making the evidence influence other nodes in time t. As an example, we can see in Figure 6.5 that C is placed last in topological order (with three incoming arcs from nodes in time r), preceded by A (three arcs), then B (one arc), then D (no arc), and then O (no arc). If we reverse the arc between A and O first, then A will be the last in topological order (with four arcs in Figure 4.6), and the rest of the node ordering can be inferred from the reversal sequence with the remaining nodes in the aforementioned fashion. Although different networks may arise from different AR orders, we do not anticipate any major impact imposed by this phenomenon on simulation. The different networks are merely different representations of the same underlying system. We have not established mathematical proofs of the correctness of these representations, but we believe that assuming the existence of a "truly random" number generator the degree of accuracy in probabilistic inference will be the same across representa-tions. 5. Efficient Inference With our SAR algorithm established, we can incorporate it into Algorithm 3.1 and carry out simula-tion using this structured ER (SER) algorithm. However, as practical BNs usually involve hundreds of nodes, performing probabilistic inference in such networks, especially dynamic ones, could be quite a cumbersome task without techniques to improve inference efficiency. We attempt to speed up infer-ence by the following methods. 5.1 Query-directed Approach Traditional simulation algorithms are straightforward in that they sequentially "flip the coin" for every variable in the system and then compute any posterior probabilities concerned. All variables have to be sampled even though some (probably many) of them may actually not be referenced at all. In the example shown in Figure 2.1, X2, X 3, and X 4 will never be sampled if we only want to know P( X\ I x2, xi ), assuming the variables are binary, since nothing influences X\. We only need to clamp X2 and X 3 to their given values and make coin flips for Xi. If we use a query-directed approach, however, we may be able to save on computation (the number of "flips"). In this approach, simulation starts only after a query has been given, and proceeds according to the given information. In particular, only those variables relevant to the query are sam-pled (see next section). In other words, variables that will not affect the query result are ignored. As most systems are locally structured this greatly reduces the number of random numbers generated 59 Chapter 5. Efficient Inference 60 Algorithm 5.1: Structured evidence reversal I* S E R : e v i d e n c e r e v e r s a l u s i n g a t r e e - s t r u c t u r e d C P T r e p r e s e n t a t i o n * / / * G i v e n : a g r a p h G w h i c h i s a B N ; we a s s u m e t h e C P T s i n G t o r e m a i n c o n s t a n t o v e r t i m e * / void SER( Graph G ) { / * O b t a i n e v i d e n c e a n d q u e r y * / Input E, x*E; Input Q, xQ\ I* R e v e r s e a r c s i n t o e v i d e n c e u s i n g t h e SAR a l g o r i t h m * / SAR( G, E); for ( k = 0; k < m; k++ ) { / * I n i t i a l i z e s c o r e f o r s a m p l e k */ Zk= 1; for (t = 0; t <= T\ t++ ) { Instantiate XE to x*E(t); Sweep through the net, flipping the coin for each variable; / * U p d a t e s c o r e f o r s a m p l e k */ Z t = Z»n^.-(0lWf)); } } } during simulation, thus increasing computational efficiency. Algorithm 5.1 shows our SER algorithm using a query-directed approach. 5.2 Relevant Variables The notion of relevant variables can be understood at two levels. In the last section we examined this notion based on a network level of view, which is adequate when a plain CPT representation is used with the network. When we come to the tree-structured representation, however, a more appropriate Chapter 5. Efficient Inference 61 view on relevance will be at the CP7Tevel. In the following sections we will examine these levels in turn. 5.2.1 The Network Level Relevance viewed at the network level means the extraction of variables relevant to a query based on the network structure. The following definition helps clarify this. For simplicity's sake, we have left out the time dimension, but it is conceptually straightforward to understand the definition in the time domain. Definition 5.1: Relevant Variables (Network Level) Let V„et( Q, E ) be the set of nodes relevant to the posterior probability P ( XQ \ xE ) at the network level, for some XQ e Q.Q, £?C N, given some xE e Qf, £ c N . Then, for all le N, le Vmt( Q, E ) iff 1. le Q,or 2. le C(y') for some Je Qyj E, or 3. le C( k) for some ke V„eA Q, E ). • This is a recursive definition that accounts for all ancestors of Q and E together with Q itself. The first two conditions can be treated as the base case while the third the recursive step. We do not in-clude the condition i e E because E is given and need not be "flipped".16 1 6 In logic sampling E has to be flipped also. Nevertheless, our description will centre around LW, where E is given and need not be flipped. Chapter 5. Efficient Inference 62 In LW, this set of nodes, i.e., Vnet( Q, E), corresponds to all variables that need be considered (and no others) regarding a particular query P( XQ \X*E). This is because to obtain xQ we have to in-stantiate XQ. Since XQ is conditioned on its parents, before instantiating XQ we have to instantiate XC(o for all i e Q, and to do so we in turn have to instantiate XCij) for all j e C(i), which requires the instan-tiation of Xc(k) for all k e C(j), and so on and so forth until we have at some point instantiated the set of "first generation" ancestors that are not conditioned on any other variable. Hence, Vnet( Q, E) must at least contain Q and the ancestors of each node in Q. This is expressed in a recursive way by condi-tions 1 and 3 and half of condition 2 in the above definition. We also have to account for E in a similar fashion because to compute P( xQ \x*E) we have to score the samples based on P{x*E\xCE) where CE = [JC(i). This requires the instantiation of XCE ieE and, in turn, its ancestors. So we see that Vne,( Q, E) should include also the ancestors of each node in E, and the other half of condition 2 above completes our definition. The set of nodes V„el( Q, E) helps us focus on only those variables relevant to the query, with-out wasting computational effort to instantiate irrelevant variables as in traditional simulation methods like logic sampling and LW. This reduction in computational complexity is especially significant when XN, the set of all variables in the system, is large. When applied in DPNs, the extraction of rele-vant variables may have an even more conspicuous effect in saving the number of flips as Q and E tend to spread out over time so that not all query and evidence nodes need be considered in all time slices. It is conjectured that the number of relevant variables per slice in a DPN is smaller than in a non-temporal BN where all query and evidence nodes are located in the same time slice. Chapter 5. Efficient Inference 63 0.4 0.1 Figure 5.1: Example net-work 5.2.2 The C P T Level Though it may be less complicated to analyze relevant variables at the network level, it may not be the most restrictive way possible. In Figure 4.1, for example, if we have instantiated B to the value true (or, b), then C will become irrelevant. More specifically, if the entire network is as shown in Figure 5.1, and we have the query P( a I b ), then in the instantiation of A we can directly retrieve the distri-bution <0.5, 0.5> from the CPT of A and flip the coin accordingly. The right subtree of the root, cor-responding to b, need not be considered at all. As a result, C is rendered irrelevant in this context. Furthermore, this irrelevance of C makes Cs parents E and F irrelevant also, since A does not depend on E or F. So we actually only need to flip for A (B is given). With our definition of relevance in the previous section, however, we will have to flip for C, E, and F because they are the ancestors of A, no matter to what value B is instantiated. In other words, this context-specific independence [BFGK96] cannot be recognized by the above definition. To do so we have to analyze relevant variables at the Chapter 5. Efficient Inference 64 CPT level (assuming the use of a tree structure), as the following definition will show. Again, we have omitted the time dimension for simplicity. Definition 5.2: Relevant Variables (CPTLevel) Let labeK T) be the root label of tree T, xt be the sampled value of X,-, and 7} denote the CPT of node i for some /e N. Let VCPA Q, E ) be the set of nodes relevant to the posterior probability P( XQ \ x*E ) at the CPT level, for some XQ e QQ, Q CZN, given some x*E e Q.s, £ c N. Define r( T ) such that for any arbitrary tree T, J label(T), if Thas more than one node r(T) (null), otherwise Further, for some Wcz N, let S( W) be a set such that for all ke N, ke S( W) iff 1. k= ti Tj ) for some ye W, or 2. k-ii T, ) for some ye Si W), or 3. k= ri chifd(xr(Tj), 7})) for some ye S( W). Then, VcMQ,E) = Q uS(Q uE). m This recursive definition captures the set of relevant variables in a dynamic fashion, as op-posed to the definition of relevance at the network level where the set of relevant variables is statically determined before simulation starts, when the network structure is already available. At the CPT level, relevant variables are determined "on the fly" as simulation proceeds and variables are sampled. Chapter 5. Efficient Inference 65 This dynamic nature is encoded in the third condition for membership in S: the sampled value xr(T) will be available only after X r ( r ) has been instantiated. Furthermore, in different simulation runs xr(J) may be different, resulting in different sets of relevant variables from simulation to simulation. In the following paragraphs we will give the intuition behind this definition and how its dynamic property helps us to minimize the set of relevant variables chosen. In order to obtain xQ, we have to instantiate XQ, so just like Vne,( Q, E), VCpi{ Q, E) must at least contain Q so that we can determine whether the query is satisfied in a sample for proper scoring. Since XQ is conditioned on XC(i) for all i e Q, we start to find out which node belongs to C(i) by look-ing at the root of Tt. This is reflected in the first condition for membership in S. Naturally, we will then try to traverse the whole tree in some order to find out all parents of i, but this is essentially what we were doing in the search for relevant variables at the network level. Considering that a variable can be instantiated to only one value at any point of time, a wiser way would be not to traverse the whole tree but wait till Xj, where j = r( Tt), has been instantiated and select the appropriate branch to traverse. This requires that Xaj) be instantiated beforehand. Thus we proceed to the root of Tj to search for the parents of j. This is reflected in the second condition for membership in 5. (Note that condition 1 has already made S(Q) contain r( 7,) for all i e Q.) Using the same argument, we then proceed to the root of Tk where k - r( Tj), and so on and so forth until we have reached a T„ where r( T0) = i . for some o e N. At this point, since X„ is not conditioned on any other variable, we can in-stantiate X0 and begin to "bounce back" the recursion. For simplicity, assume we reached Ta from Tk, i.e., r(Tk) = o. Because X0 is now instantiated to x0, we can select the corresponding subtree, namely Chapter 5. Efficient Inference 66 child(x0, Tk), of Tk to continue our search for relevant variables. This is described by condition 3 in the definition, where we proceed to look at r(child(x0, Tk)) and search for relevant variables using conditions 2 and 3 recursively. After a path from root to leaf is established in Tk (i.e., all relevant par-ents of k in the current context are instantiated), we can retrieve the distribution for k at the leaf and instantiate XK accordingly. Recursion then bounces back to 7}, where we select child( xk, 7}) to con-tinue the search. The process goes on until we have the whole set of XQ instantiated, at which point the set S(Q) is complete. For E, we follow the same sequence to build S(E) progressively, starting at r( T,) for all i e E until a path is formed from root to leaf in Tt. We can then compute the score of the sample according to the distribution at the leaf and the given evidence. After S(Q) and S(E) have been completely specified, our task in determining VCpj{ Q, E) has virtually finished. The only thing left is to prove S(Q) u S(E) = S(Q u E), but this is trivial if we hold on to our definition of S and the definition of the union operation in basic set theory. Summing up, the difference between relevance at the network level and that at the CPT level lies in the dynamic nature of the latter. To determine VCpi{ Q, E), we dynamically order variables to be sampled depending on previous sampled values of ancestral nodes so that there is one and only one path from root to leaf traversed in each CPT. Any node not on the path is ignored, and it is this prop-erty that makes VCP1{ Q,E)cz Vnel( Q,E). Chapter 5. Efficient Inference 67 5.3 Satisfiable Queries Another saving we can make is to stop sampling query variables whenever any value contradictory to the given query is sampled. This is justified because if any part of a query is not satisfied, then a score of zero will be assigned to the whole query. Therefore, there is no point in sampling further query variables once it has been determined that the sample does not satisfy the given query. More formally, given some set of query values x e e Q e for some Q cz N, if at any point during simulation a value xt, for some i e Q, is sampled such that x(. ^ JC„ then no sampling will be done for all j G Q' u S( Q') where Q' is the set of unsampled query nodes. Simulation proceeds for all k e S( E') where E' is the set of evidence nodes not taken into account, and the score for the query, Z(XQ = XQ), equals zero. This is yet another example of dynamically selecting nodes for sampling. It helps further re-stricting the set of relevant nodes chosen. Note that this technique is applicable only if a conjunctive query is posed. For disjunctive queries, we have to modify the technique in such a way that the sam-pling of query variables is stopped whenever a value x(., for some i e Q, is sampled such that x. = x,-, because the satisfaction of a disjunctive component means the whole query is satisfied. Our imple-mentation assumes the use of conjunctive queries, but in either case, provided it is not a combination of both, the more complicated the query, the more we can save through this technique. Chapter 5. Efficient Inference 68 5.4 Assumed Last Values By observing that all probability values in a distribution add up to one, we can omit the last value when representing a distribution. The last value can always be found by one minus the sum of the previous values, which is just one arithmetic operation if we use a cumulative distribution representa-tion. (For example, <0.1, 0.2, 0.3, ...> is represented as <0.1, 0.3, 0.6, ...>.) This is especially useful when computationally demanding operations such as arc reversals have to be performed: we can re-duce all operations devoted to computing the last entry of any distribution to just one subtraction. We have already seen an example of how computations are saved in Figure 4.5c, where the values and the multiplication in parenthesis represent assumed values and the operation saved.17 Another probability representation scheme would be one which relaxes the constraint that the probabilities in a distribution add up to one. We can then use the unnormalized numbers as probability ratios. Although it can let us avoid normaliza-tion, this scheme still requires arithmetic operations, which can become computationally intensive when tree components are present, to compute each entry in the distribution. In other words, the arithmetic operations devoted to computing the last entry cannot be saved as in the assumed-last-value representation. 6. Implementation and Results As we have seen in previous chapters, LW does not use observed evidence directly to constrain the generation of the sample. This leads to poor accuracy in the simulation of DPNs. On the other hand, ER takes into account the observed evidence by reversing arcs that go into evidence variables. Thus, answers to queries are really based on the given evidence, resulting in more accurate simulation. We have also seen that the usual "plain" CPT representation (using arrays) requires the speci-fication of probability values for all combinations of values of a node's parents. The tree-structured representation, however, not only helps reduce the number of CPT entries required but also saves on the number of "coin flips" (random number generation) to a large extent in practical systems. In order to confirm the aforementioned ideas, we implemented the "plain" version and the tree version of both the LW and the ER algorithms. A series of experiments was then carried out on a 486-DX66 machine. Hereafter we will use the abbreviations ER, LW, SER, and SLW to refer respectively to plain ER, plain LW, structured ER, and structured LW. 6.1 Experiments The aim of our experiments is to show that: • ER helps speed up convergence of answers to queries, as compared to LW; 69 Chapter 6. Implementation and Results 70 Table 6.1: Simulation performance under different settings Plain Tree Measurement LW ER LW ER #distributions avg #flips avg time 1,2, 4,5 1,3 2 3,4,5 • the tree-structured CPT representation can reduce the number of flips during simulation, thus increasing computational efficiency; • the tree-structured CPT representation can reduce the number of CPT entries by discarding ir-relevant distinctions that would be present in a "plain" representation. Accordingly, we arranged our experiments in different settings as shown in Table 6.1. The small numbers refer to the cases described below. For each case, we compared simulation performance un-der two of the settings using a sample size of 1000. (For example, in Case One we compared the per-formance of LW and ER, as shown by the two small "ones" in the "LW" and "ER" columns under "Plain" in the table.) Analysis was then made, assuming the existence of an unbiased, or "truly ran-dom" number generator that accurately produces states of the network according to their true prob-abilities, contingent on the given evidence. In the first row of the table, we measured the number of probability distributions required. For the plain representation, we obtained the number by summing up the total number of rows in all CPTs in the system. For the tree representation, the number was computed by counting the number of leaf nodes in all the CPTs. (For example, the table for variable O in Figure 6.3a contains four distri-Chapter 6. Implementation and Results 71 butions while the tree for O in Figure 6.3b has three.) In the ER and SER settings, this number reflects the number of distributions generated after AR had been performed. In the second row, the average number of flips over all samples was recorded. This shows the average number of random numbers generated in a sample and corresponds to the set of relevant vari-ables (Section 5.2) actually simulated. Again, in the ER and SER settings, the number was recorded after AR had been performed. In the last row we recorded the execution time averaged over all samples. We included this test because we observe that most languages are developed with arrays (rather than trees) as a central component; if we perform our simulation using a tree-structured representation, we might incur more runtime overhead than when using arrays. By examining the average execution time we can see whether the structured representation (or AR) has any influence on execution speed, and if so, to what extent it might be. In each of the five cases below, we ran the simulation 100 times using a query-directed ap-proach. For cases 1, 4, and 5 we analyzed convergence performance based on the variance of the probability values obtained across all runs and plotted variance against the number of samples in graphs. The variance computed is based on the total population, i.e., Variance where x, is the probability value computed for the query variable during the i-th run, and n is the total number of runs. Chapter 6. Implementation and Results 72 Convergence is reached when the variance falls below an epsilon value computed by Epsilon =10 2Llog(3c)J where x — — 6.2 Results and Analysis Case 1 For the network shown in Figure 6.1a, we made 100 simulation runs using the LW and ER algorithms to find the posterior probability P(A(0) = AQO) = A(20) = A(30) = A(40) = A(50) = a, I O(0) = O(10) = O(20) = O(30) = 0(40) = 0(50) = o,). (Underlined entries represent "assumed" values, i.e., the ones that can be computed by subtracting cor-responding cumulative probabilities from one.) Performance results are shown in Table 6.2. We see that in general ER doubled all the measurements we made as compared to LW. The increase in the number of distributions (and thus CPT entries) can be projected from our discussion in Section 3.3. In this case, the CPT of A increased from three rows to nine rows because after AR O becomes a parent of A and the value set of O has a cardinality of three. The average number of flips per sample also nearly doubled because A is now influenced by one more variable, namely O; for time slices where no Chapter 6. Implementation and Results 73 (a) Before AR A ' 0 Ol 0 2 0 3 ai .39 .305 .305 a 2 .305 .39 .305 a 3 .305 .305 39 A A ' O ai a 2 a 3 ai Ol 9231 .0385 .0384 0 2 .0656 XS52 .0492 0 3 .0656 .0492 i .$852 a 2 Ol .8852 .0656 .0492 0 2 .0385 .9231 .0384 0 3 .0492 .0656 .8852 Ol .8852 .0492 .0656 0 2 .0492 .8852 .0656 0 3 .0385 .0385 .9230 (b) After AR Figure 6.1: Network — Case 1 evidence value was given, we had to make two flips instead of one to obtain the value of A. In turn, this resulted in a nearly 100% increase in the average execution time per sample. As for convergence, we plotted in Figure 6.2 the variance of the probability values obtained versus the number of samples. An epsilon value of 0.01 was used. As shown, simulation using ER Chapter 6. Implementation and Results 74 Table 6.2: Performance — Case 1 Measurement LW ER #distributions avg #flips avg time 6 51 8.4e-4 12 96 1.53e-3 converged very early, at a sample size of about 50 (variance ~ 6e-3). Using LW, however, simulation did not converge even using 1000 samples (variance = 0.08). This shows that convergence is at least 20 times faster in the ER case where AR played a major role in bringing simulation back on track.18 The network used is a DPN where the state evolution model is weak, as can be seen from the CPT of variable A where states evolve quite randomly. Variable O can be thought of as a sensor of A. Although this sensor has an error of only 0.1 in tracking the current state of A, LW ignored this evi-dence and let the system proceed according to its weak state evolution model, disagreeing with the evidence most of the time. In each run, therefore, the weighting process assigned extremely low weights to almost all the samples. On the other hand, a very small number of samples closest to the true state were assigned relatively high scores. These samples imposed a disproportionately large in-fluence on the sample score distribution and dominated in the calculation of the posterior in concern. Most of the samples were therefore effectively ignored, resulting in large estimation errors. For this example, the score of a sample varies from0.056= 1.6e-8 (when all six A's have either a2 or a3 as val-ues) to 0.96 = 0.53 (when all six A's have the value a{). Since these scores fluctuate by factors of as large as 107, the probability computed in one run might be very different from that in another. The In [FUCH90], simulation results showed that by reversing arcs into evidence nodes convergence was at least 100 times faster than without evidential integration (the non-temporal counterpart of ER). Chapter 6. Implementation and Results 75 Figure 6.2: Convergence behavior— Case 1 resulting variance is so great that the posterior obtained cannot be considered as accurate. Unless significantly more samples are taken, LW fails in producing a reliable answer when systems with a weak state evolution model are simulated. ER helps bring simulation closer to reality by using the observed evidence to influence state variables. Figure 6.1b shows the network structure and the CPTs after AR. Now sample scores do not vary to a great extent: assuming random initial states, the score of a sample varies from 1/3 x 0.3055 = 8.8e-4 to 1/3 x 0.395 = 0.003 after arcs have been reversed (including time slice 0). Moreover, vari-able A now has a chance of about 0.9 of having the value at given O = ou as shown by the highlighted entries in the figure. As a result, not only is the degree of accuracy increased by the reduction in vari-Chapter 6. Implementation and Results 76 (a) Plain CPT (b) Tree-structured CPT Figure 6.3: Network — Case 2 ance of the scores obtained but the "degree of realness" of simulation is also enhanced by the genera-tion of more samples close to the true state of A given values of O. Case 2 We compared the performance of LW and SLW using the networks shown in Figure 6.3. (We have left out the CPT for D because D will never be sampled given our query below.) Performance results are shown in Table 6.3. First, we managed to save one probability distribution by using the tree-structured representa-tion. (Note the compression of the two shaded entries into one.) Then we started simulation using the query P(C(5) = 710(5) = J), assuming both variables are binary. The average number of flips per sample using SLW is only two-third that using LW. This is mainly due to two places where flips were saved. The first is the saving of flips for variable D; since it is not the ancestor of C or O, it is not a Chapter 6. Implementation and Results 77 Table 6.3: Performance — Case 2 Measurement LW SLW #distributions avg #flips avg time19 10 12 8.18e-4 9 8.174 7.27e-4 relevant variable and was not sampled at all in the simulation. The second saving lies in the CPT of O. We notice that once C is true, there is no need to flip for B. So, for this query, we can generate all values of C(0) to C(5) and then, if C(5) is true, generate 0(5) and these are all the variables we need. 6(0) to B(5) will only be generated if C(5) is false. Thus we see that if a variable depends on the pre-vious value of itself, and the variable need not be generated in a certain time slice t, then in all slices up to and including t no sampling for the variable is necessary. When traced from time t (i.e., five in this case) back to time zero, this chain of savings can increase computational efficiency significantly. Moreover, since a variable may depend on more than one variable in the previous time slice, a saving in time t will mean several chains of savings from time t back to time 0. In Figure 6.4 we show an excerpt from the execution results of SLW. The blank entries corre-spond to the B(t)'s skipped during simulation. As can be seen, as many as 50% of the total number of relevant variables (perceived at the network level) were skipped for most samples. For the case where a plain CPT representation was used, however, no such savings could be made. This explains why the average execution time per sample is longer using LW than using SLW, and proves that our worry that using a tree-structured representation may slow down execution is not necessary. Since we turned on the "verbose" option in this experiment in order to produce excerpts from execution results, these val-ues are actually greater than they should be when the verbose option is off. Nonetheless, as our comparison between LW and SLW is local to this experiment only, this increase in value does not matter in a relative sense. Chapter 6. Implementation and Results 78 Run 1 : V a r i a b l e ZI sample ) Z ( q u e r y ) Round CIO) C ( l ) C (2 ) C (3 ) C(4 ) C<5)* BIO) B I D B(2) BI3) B(4) BI5) 1 T T T T T T 0 7000 0 7000 2 T T T T T F T T T T T T 0 6000 0 0000 3 F F F F T T 0 7000 0 7000 4 F F F F F T 0 7000 0 7000 5 T T T T T T 0 7000 0 7000 6 T T T T T T 0 7000 0 7000 7 F F F F T T 0 7000 0 7000 8 T T T T T T 0 7000 0 7000 9 F F F F F F T F T T T T 0 6000 0 0000 10 T T T T T T 0 7000 0 7000 11 F F F F F T 0 7000 0 7000 12 F T T T F F T T T T T T 0 6000 0 0000 13 F F F F T T 0 7000 0 7000 14 T F F F F F T T T T T T 0 6000 0 0000 15 T T T T T T 0 7000 0 7000 16 F F F F T T 0 7000 0 7000 17 F F F F F T 0 7000 0 7000 18 F F F T T T 0 7000 0 7000 19 T T T T T T 0 7000 0 7000 20 T T T T T T 0 7000 0 7000 21 T T T F F F F F T T F T 0 6000 0 0000 22 T T T T F F F F F T T T 0 6000 0 0000 23 T T T T T T 0 7000 0 7000 24 T T T T T T 0 7000 0 7000 25 T T T T T T 0 7000 0 7000 Figure 6.4: Excerpt from execution results of SLW— Case 2 We do not examine convergence behavior in this experiment because the difference in CPT representations does not affect convergence and there will be no difference between LW and SLW in this aspect. Case 3 Using the system shown in Figure 4.3, we carried out simulation with both the ER and the SER algo-rithms. Since our implementation reverses arcs on a "first come first serve" basis, it chose to reverse the arc between variables O and C first because C is the parent of O first encountered. The resulting system is as shown in Figure 6.5. The query posed was P(C(5) = 71(9(5) = 7). Performance results are shown in Table 6.4. Again, we do not investigate convergence behavior for this case for the aforementioned reason. Chapter 6. Implementation and Results 79 A ' 0.619 0.3684 0.5568 0.4512 Figure 6.5: Network after SAR — Case 3 As expected, the plain ER algorithm led to full CPTs (with 62 rows in total) after AR while our SER algorithm managed to shrink the CPTs to contain merely 23 leaves. (Note that the CPT of D remained unchanged: it retained its original random distribution because the influence of D on O that would have been introduced during AR between O and A was removed in SAR (Section 4.2.1).) For large networks, this reduction in space complexity (even with internal nodes counted) is an invaluable asset. SER also generated a much lower average number of flips of 15.87 per sample20, reflecting the ability of the tree-structured representation to capture variable independence (or, more precisely, con-text-specific independence (CSI) [BFGK96]) during simulation. An excerpt from the execution results 2 0 The total number of relevant variables perceived in the network level was 24 after SAR. This is slightly different from that (27) in ER due to the difference in the orders of arcs reversed. Our ER implementation also reverses arcs on an FCFS ba-sis, but since the order of parent specification for a node does not matter in a plain CPT representation, we chose to specify the parents of a node in alphabetical order for consistency. Thus, variable A was encountered first in ER while SER first came across variable C. As a result, two different network structures were produced. Chapter 6. Implementation and Results 80 Table 6.4: Performance -— Case 3 Measurement ER SER #distributions avg #flips avg time21 62 27 1.55e-3 23 15.87 1.55e-3 of SER is shown in Figure 6.6. As many as 10 variables could be skipped before the requested poste-rior was computed in each sample, though the average execution time per sample was not reduced due to the overhead in AR and the relatively small size of the system. For a larger, more practical net-work, it is expected this AR overhead will have a much smaller impact on execution time. We also observe in the figure that variable D in time slice 5 never entered into consideration during simulation, as shown by the blank column under the underlined heading "D(5)" near the right end of the figure. The reason can be found by examining the set of variables relevant to the posed query. Starting (in time 5) at the root of the CPT of variable C we encounter A' (Figure 6.5), which means A(4) must be flipped before other parents of C(5). If A(4) turns out to be true, then C(5) only depends on B(5). From the CPT of B we know that variable B does not depend on D. Therefore, if A(4) is true, then D(5) will never be considered. On the other hand, if A(4) is false, then the next flip has to go for A(5). Referring to the CPT of A, given that A' is false, D is skipped. The remaining par-ents are B and O, neither of which depends on D. So, if A(4) is false, D(5) will also not be flipped. Hence, we conclude that C(5) in no circumstance depends on 7J>(5), explaining why the column in con-cern is blank. 2 1 We again turned on the verbose option in this experiment to obtain excerpts from execution results. Chapter 6. Implementation and Results 81 0(0} D(0] AIO) Oil) BID AIL] 0(21 DI2) BIZ) A (21 0(3) DI3) B(3) A (3) 0(4) DM I BUI A (4) D(5) BI5) A15) CISC impla) Z(i]uaiy) 5375 4GZ5 4750 5250 4625 5375 4750 5250 5250 4625 5250 4625 5250 5375 5250 4750 4750 4625 4750 5250 5250 5250 5375 4750 4625 4625 5250 5375 4750 4750 5250 5Z50 5375 4625 4750 5250 4750 5250 4625 4625 4750 5250 4625 5375 5250 4625 5250 4750 1625 4750 5250 5250 5250 5375 4750 4625 4625 5250 5250 5250 4750 5250 4625 0000 Figure 6.6: Excerpt from execution results of SER — Case 3 In this experiment (and the previous one) we can see the tree structure lends itself nicely to performing simulation in DPNs. Despite the relatively small size of our networks, the savings in the number of flips were tremendous. To a large extent it is this ability to capture CSI and the resulting efficiency enhancement that motivates the use of a tree-structured CPT representation, especially when hundreds of variables are present in practical systems. Chapter 6. Implementation and Results 82 0.7241 0.5385 0.1111 0.0526 (a) Before SAR (b) After SAR Figure 6.7: Network — Case 4 Case 4 Though AR and the tree-structured representation are useful, there are cases where neither of these techniques can help. One such system is shown in Figure 6.7a. We ran L W 2 2 and SER using this system with the query P(5(0) = 5(2) = 5(4) = 5(6) = 5(8) = 5(10) = 710(0) = 0(2) = 0(4) = 0(6) = 0(8) = O(10) = 7). The tree structure did not buy us anything before AR, since all CPTs were full trees. It did not make CPTs more compact after AR either (Figure 6.7b). In Table 6.5 we see that SER apparently managed to save some flips (0.96 out of 16). This saving, however, was due to the incorporation of our technique where sampling of query variables is stopped on insatisfiability of any component of the query23 (Section 5.3). In other words, the tree structure had no contribution to the saving of flips in this experiment. 2 2 We show only the tree version of the CPTs for convenience. The plain version can be obtained in a straightforward man-ner. 2 3 Query variables will still be sampled if they are relevant to the given evidence. Chapter 6. Implementation and Results 83 Table 6.5: Performance — Case 4 Measurement LW SER #distributions avg #flips avg time 4 11 1.82e-4 6 15.04 2.73e-4 In Figure 6.8, we can see AR did not speed up convergence significantly in this case because the sample score before AR was already quite stable (0.9 versus 0.8) across samples and only a little more stable (0.87 versus 0.81) after AR. In addition, ten time slices is not sufficient, given the strength of the state evolution model, to allow simulation to diverge "too far" from reality. Therefore, simulation using either of the algorithms converged to an epsilon of le-4 at about the same number of samples. (In fact, SER needed even more samples (670) than LW (420) in this case.) The average execution time per sample also increased after SAR due to the increase in the number of arcs and thus the average number of flips. So, in this case, instead of going into all the troubles to reverse the arcs, we would prefer using the non-AR version. Nonetheless, systems similar to this one are not common in practice. Real systems are usually much larger so that a larger extent of local structures can be ex-ploited. At the same time, the errors of the sensors are normally much smaller (e.g., P(0 = T\B = F) « 0.8) so that the effect of AR is much more conspicuous. Chapter 6. Implementation and Results 84 Figure 6.8: Convergence behavior— Case 4 Case 5 In an attempt to explore the effects of SAR on probabilistic inference in a larger network, we doubled the size of the network in Case 3 and compared the performance of LW and SER. Figure 6.9 shows the network used in the experiment. As before, variable A depends on its previous value. All other variables, i.e., B - E, O, W- Z, only depend on variables within the current time slice. Source variables (B - D, W- Z) are assumed to have random distributions and their CPTs are not shown. After posing the query P(L\0) = D(10) = D(20) = £>(30) = D(40) = D(50) = F\ O(0) = O(10) = 0(20) = 0(30) = 0(40) = 0(50) = F) Chapter 6. Implementation and Results 85 0.95 0.05 B A 0.7 0.2 0.5 B 0.4 0.1 Figure 6.9: Network — Case 5 we ran both the LW and the SER algorithms 100 times. The epsilon used was le-4, since the posterior probability was of the order 0.01. Performance results are shown in Table 6.6. From the table we notice that SER more than doubled the number of probability distributions in the resulting system after SAR. This can be explained by the vast number of parents of both the Chapter 6. Implementation and Results 86 Table 6.6: Performance — Case 5 Measurement LW SER #distributions avg #flips avg time 75 144 0.0031 192 363.1 0.00979 evidence variable O and the state variable E in the original network (where both variables had 5 par-ents). Since there was a directed arc from E to O, this arc had to be reversed, which means O had to inherit all five of £"s parents. This in turn called for further reversal of arcs between O and the par-ents of E. Adding to the AR of O's own parents, these AR sequences produced an intermingled result-ing network where the number of parents increased substantially for all variables except O. As a re-sult, the CPTs for these state variables expanded. We also observe in the table that the average number of flips per sample is considerably greater in SER than in LW. This is a direct consequence of the increase in CPT size. Although this is a 1.5-fold increase, the tree structure already helps reduce this number (as well as the size of CPTs) to a minimum by ignoring irrelevant variables — in a by-experiment we conducted, we had 366 distribu-tions in total and 500 flips per sample on average when the unenhanced ER algorithm was used. The average execution time per sample of SER is three times that of LW. This may seem un-promising at first sight, but given SER converged at 170 samples while LW still did not converge at 1000 samples (Figure 6.10), SER still beats LW in terms of accuracy and total execution time. The only way for LW to decrease its variance in the computed posterior seems to be at least a 5-fold in-crease in the sample size, demanding significant increase in total execution time. Chapter 6. Implementation and Results Figure 6.10: Convergence behavior— Case 5 7. Conclusion and Extensions In this thesis we investigated the problem of divergence from reality when LW is used to perform probabilistic inference in DPNs [KAKR95]. An experiment was conducted to demonstrate the unde-sirable effect of a straightforward application of LW where it generated simulations that ignored the observed evidence and therefore became increasingly irrelevant. Posterior probability values thus ob-tained induced great variance across runs and virtually did not converge even at a sample size of one thousand. As a remedy, we modified the ER algorithm in [KAKR95], which employs the AR algorithm in [SHAC86], with an intention to guide simulation back on track. Experimental results show that ER was indeed capable of bringing simulation closer to reality, providing accurate computed values that converged quickly. However, the additional space and time complexity required by AR were substan-tial. In view of these problems, we devised the SER algorithm using a tree-structured CPT repre-sentation [BODG95]. SER avoids the problem of divergence from reality by using SAR, which was presented in detail in Chapter 4, to keep simulation on track while reducing the size of CPTs with a compact representation. We also used a query-directed approach and presented the techniques of rele-vant variable extraction and satisfiable query sampling to speed up probabilistic inference. A series of experiments was conducted to test our SER algorithm incorporated with the aforementioned tech-niques. With the exception of an ad-hoc counterexample, results show that the SER algorithm outper-88 Chapter 7. Conclusion and Extensions 89 formed both LW and ER in all our experimental cases in terms of rate of convergence as well as com-pactness of CPT representation. From the performance of SER in our experiments, it is postulated that SER will be found practical in applications involving complex models with highly interdependent variables. We would like to see SER being applied to specific problem structures, especially in a partially observable envi-ronment where decisions have to be made based on some set of computed posterior probability values. As for refinement of the algorithm and its implementation, there is room for exploration in how determinism can be exploited effectively in performing SAR. We have seen that the presence of deterministic values in a tree-structured CPT may help remove the influence of some of the parents of the node in concern during SAR. The ability to detect and exploit determinism can certainly increase both computational and spatial efficiencies. Another challenge lies in the development of a graphical user interface to visualize the simu-lation process. A visual simulator of this kind was implemented in [CHEU95] where policies can be visually simulated. A reward structure was also built in for policy evaluation purposes. It is yet un-known whether the SER algorithm can find its place in this area where causal inference (or projection) rather than diagnostic inference plays a major role. Bibliography [BFGK96] Craig Boutilier, Nir Friedman, Moises Goldszmidt and Daphne Roller, "Context-Specific Independence in Bayesian Networks", in Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelli-gence, (to appear). Portland, OR, 1996. [BODG95] Craig Boutilier, Richard Dearden and Moises Goldszmidt, "Exploiting Structure in Policy Construction", in Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1104-1111. Montreal, 1995. [BOG096] Craig Boutilier and Moises Goldszmidt, "The Frame Problem and Bayesian Network Action Representations", in Proceedings of the Eleventh Biennial Canadian Conference on Artificial Intelligence, pp. 69-83. Toronto, 1996. [B0P096] Craig Boutilier and David Poole, "Computing Optimal Policies for Partially Observable Decision Processes Using Compact Repre-sentations", in Proceedings of the Thirteenth National Conference on Artifi-cial Intelligence, (to appear). Portland, OR, 1996. [BOPU95] Craig Boutilier and Martin L. Puterman, "Process-Oriented Planning and Average-Reward Optimality", in Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1096-1103. Montreal, 1995. [BOUT95] Craig Boutilier, "Some Considerations on Process-Oriented Planning", (manuscript), 1995. [CHCO90] R. Martin Chavez and Gregory F. Cooper, "An Empiri-cal Evaluation of a Randomized Algorithm for Probabilistic Inference", in Uncertainty in Artificial Intelligence, 5:191-207. Elsevier: North-Holland, 1990. 90 Bibliography 91 [CHEU95] Adrian Y. W. Cheuk, "Simulating Plan Execution in Stochastic Domains", (manuscript), 1995. [COOP90] Gregory F. Cooper, "The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks", in Artificial Intel-ligence Journal, 42:393-405, 1990. [DEKA89] Thomas Dean and Keiji Kanazawa, "A Model for Rea-soning about Persistence and Causation", in Computational Intelligence, 5(3):142-150, 1989. [FUCH90]24 Robert Fung and Kuo-Chu Chang, "Weighing and Inte-grating Evidence for Stochastic Simulation in Bayesian Networks", in Un-certainty in Artificial Intelligence, 5:209-219. Elsevier: North-Holland, 1990. [HENR88]25 Max Henrion, "Propagating Uncertainty in Bayesian Networks by Probabilistic Logic Sampling", in Uncertainty in Artificial In-telligence, 2:149-163. Elsevier: North-Holland, 1988. [HENR90] Max Henrion, "An Introduction to Algorithms for Infer-ence in Belief Nets", in Uncertainty in Artificial Intelligence, 5:129-138. Elsevier: North-Holland, 1990. [KAKR95] Keiji Kanazawa, Daphne Koller, and Stuart Russell, "Stochastic Simulation Algorithms for Dynamic Probabilistic Networks", in UAI95 Proceedings, pp. 346-351. Montreal, 1995. [PEAR87] Judea Pearl, "Evidential Reasoning Using Stochastic Simulation of Causal Models", in Artificial Intelligence, 32:245-257. El-sevier: North-Holland, 1987. Originally published in Proceedings of the Fifth Conference on Uncertainty in Artificial Intelligence (UAI-89), Windsor, Ontario, 1989. Originally published in Proceedings Workshop on Uncertainty in Artificial Intelligence, Philadelphia, PA, 1986, under the title "Propagating Uncertainty by Logic Sampling in Bayes' Networks". Bibliography 92 [PEAE88] Judea Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. [POOL93] David Poole, "Probabilistic Horn Abduction and Baye-sian Networks", in Artificial Intelligence Journal, 64(1):81-129, 1993. [RUN095] Stuart Russell and Peter Norvig, Artif icial Intelli-gence: A Modern Approach. Prentice Hall, 1995. [SHAC86] Ross D. Shacter, "Evaluating Influence Diagrams", in Operations Research, 34(6):871-882, 1986. [SHPE90]26 Ross D. Shacter and Mark A. Peot, "Simulation Ap-proaches to General Probabilistic Inference on Belief Networks", in Uncer-tainty in Artificial Intelligence, 5:221-231. Elsevier: North-Holland, 1990. [SMS073] Richard D. Smallwood and Edward J. Sondik, "The Op-timal Control of Partially Observable Markov Processes over a Finite Hori-zon", in Operations Research, 21:1071-1088, 1973. Originally published in Proceedings of the Fifth Conference on Uncertainty in Artificial Intelligence (UAI-89), Windsor, Ontario, 1989.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Stochastic simulation in dynamic probabilistic networks...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Stochastic simulation in dynamic probabilistic networks using compact representation Cheuk, Adrian Y.W. 1996
pdf
Page Metadata
Item Metadata
Title | Stochastic simulation in dynamic probabilistic networks using compact representation |
Creator |
Cheuk, Adrian Y.W. |
Date Issued | 1996 |
Description | In recent years, researchers in the A l domain have used Bayesian belief networks to build models of expert opinion. Though computationally expensive deterministic algorithms have been devised, it has been shown that exact probabilistic inference in belief networks, especially multiply connected ones, is intractable. In view of this, various approximation methods based on stochastic simulation appeared in an attempt to perform efficient approximate inference in large and richly interconnected models. However, due to convergence problems, approximation in dynamic probabilistic networks has seemed unpromising. Reversing arcs into evidence nodes can improve convergence performance in simulation, but the resulting exponential increase in network complexity and, in particular, the size of the conditional probability tables (CPTs) can often render this evidence reversal method computationally inefficient. In this thesis, we describe a structured simulation algorithm that uses the evidence reversal technique based on a tree-structured representation for CPTs. Most real systems exhibit a large amount of local structure. The tree can reduce network complexity by exploiting this structure to keep CPTs in a compact way even after arcs have been reversed. The tree also has a major impact on improving computational efficiency by capturing context-specific independence during simulation. Experimental results show that in general our structured evidence reversal algorithm improves convergence performance significantly while being both spatially and computationally much more efficient than its unstructured counterpart. |
Extent | 3954175 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-03-13 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0051525 |
URI | http://hdl.handle.net/2429/6018 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1996-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1996-0468.pdf [ 3.77MB ]
- Metadata
- JSON: 831-1.0051525.json
- JSON-LD: 831-1.0051525-ld.json
- RDF/XML (Pretty): 831-1.0051525-rdf.xml
- RDF/JSON: 831-1.0051525-rdf.json
- Turtle: 831-1.0051525-turtle.txt
- N-Triples: 831-1.0051525-rdf-ntriples.txt
- Original Record: 831-1.0051525-source.json
- Full Text
- 831-1.0051525-fulltext.txt
- Citation
- 831-1.0051525.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051525/manifest