A Clique Tree Algorithm Exploiting Context Specific Independence by Leslie Tung B.Sc, University of British Columbia, 2000 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) We accept this thesis as conforming ^te-tjie required standard The University of British Columbia August 2002 © Leslie Tung, 2002 In presenting t h i s thesis i n p a r t i a l f u l f i l m e n t of the requirements for an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree that the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e for reference and study. I further agree that permission for extensive copying of t h i s thesis for s c h o l a r l y purposes may be granted by the head of my department or by h i s or her representatives. It i s understood that copying or p u b l i c a t i o n of t h i s thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission. The U n i v e r s i t y of B r i t i s h Columbia Vancouver, Canada Abstract Context specific independence can provide compact representation of the condi-tional probabilities in Bayesian networks when some variables are only relevant in specific contexts. We present eve-tree, an algorithm that exploits context specific independence in clique tree propagation. This algorithm is based on a query-based contextual variable elimination algorithm (eve) that eliminates in turn the variables not needed in an answer. We extend eve to producing the posterior probabilities of all variables efficiently and allow the incremental addition of evidence. We per-form experiments that compare eve-tree and Hugin using parameterized random networks that exhibit various amounts of context specific independence, as well as a standard network, the Insurance network. Our empirical results show that eve-tree is efficient, both in time and in space, as compared to the Hugin architecture, on computing posterior probabilities for Bayesian networks that exhibit context specific independence. Contents Abstract ii Contents iii List of Tables v List of Figures vi Acknowledgements ix 1 Introduction 1 1.1 Bayesian Network 1 1.2 Probability Inference 2 1.3 Organization 5 2 Context Specific Independence 6 2.1 Definitions 6 2.2 Past Work on Context Specific Independence 10 3 Algorithms for Probabilistic Inference 12 3.1 Variable Elimination 12 3.2 Variable Elimination with Clique Trees 14 iii 3.3 The Hugin Architecture 26 3.4 Contextual Variable Elimination 32 4 Probabi l i ty Inference Us ing eve-tree 43 4.1 Contextual Clique Tree 43 4.2 A n Example 46 5 E m p i r i c a l Evaluat ion of eve-tree 62 5.1 Experiment Setup 62 5.2 Space Comparisons 64 5.3 Time Comparisons 70 5.4 The Insurance Network .' 85 6 Conclusion and Future W o r k 89 Bibl iography 91 iv List of Tables Summary of the insurance network and its approximations in terms of context specific independence and rule and table sizes 88 List of Figures 1.1 Example Network #1 3 2.1 Rule base for Example Network #1 9 3.1 The Variable Elimination Algorithm 15 3.2 Example Network #2 17 3.3 The moral graph and one of the triangulated moral graphs for Ex-ample Network #2 18 3.4 A clique tree for Example Network #2 19 3.5 The initial clique tree of Example Network #2 using the ve algorithm 21 3.6 The ve-tree Algorithm 24 3.7 The consistent clique tree of Example Network #2 using the ve-tree algorithm 25 3.8 The initial clique tree of Example Network #2 using Hugin 28 3.9 A clique tree for Example Network #1 with Hugin's initial factors . 30 3.10 A consistent clique tree for Example Network #1 (Hugin) 31 3.11 The Contextual Variable Elimination Algorithm 42 4.1 A clique tree of Example Network #1 with initial rules (eve-tree) . . 47 4.2 The consistent clique tree for Example Network #1 (eve-tree) . . . . 57 vi 5.1 Comparison of table size of consistent clique trees 65 5.2 Comparison of table size of consistent clique trees from networks with splits = 0 66 5.3 Comparison of table size of consistent clique trees from networks with splits — 1 67 5.4 Comparison of table size of consistent clique trees from networks with splits = 5 68 5.5 Comparison of table size of consistent clique trees from networks with splits = 10 69 5.6 Distribution of time to set up the initial clique trees 71 5.7 Distribution of message propagation time 73 5.8 Distribution of the average time to get the probability of a query variable from a consistent clique tree 75 5.9 Distribution of the total time for calculating the probability of 5 un-observed variables 77 5.10 Total time for calculating the probability of 5 unobserved variables for 10 random networks with splits = 0 (ie. no context specific independence) 78 5.11 Total time for calculating the probability of 5 unobserved variables for 10 random networks with splits — 1 79 5.12 Total time for calculating the probability of 5 unobserved variables for 10 random networks with splits = 5 80 5.13 Total time for calculating the probability of 5 unobserved variables for 10 random networks with splits = 10 81 vii 5.14 Total time for calculating the probabilities of all of the 20 unobserved variables for random networks with no observed variables 82 5.15 Total time for calculating the probabilities of all of the 15 unobserved variables for random networks with 5 observed variables 83 5.16 Total time for calculating the probabilities.of all of the 10 unobserved variables for random networks with 10 observed variables 84 5.17 Comparison of the total time required to calculate the probability of 5 unobserved variables for 4 approximations of the insurance network 88 viii Acknowledgements I would like to thank my supervisor, David Poole, for his ideas, comments and guidance. I would also like to thank my parents for their encouragement and support. Without them, my studies and research would not have been possible. L E S L I E T U N G The University of British Columbia August 2002 ix Chapter 1 Introduction 1.1 Bayesian Network Reasoning under uncertainty is an important issue in artificial intelligence because human and computer agents must make decisions on what actions to take without knowing the exact state of the world and without being able to precisely predict the outcomes of actions. In situations in which causality plays a role, Bayesian networks (Pearl, 1988) allow us to model the situations probabilistically. Bayesian networks provide a graphical representation of the dependencies and independencies among the vari-ables. A Bayesian network Af is a triplet (V, A, P) where • V is a set of variables. • A is a set of arcs, which together with V constitutes a directed acyclic graph G = (V,A). • P = P(v\irv) : v € V, where irv is the set of parents of v, where the parents of 1 v is the set of variables such that v is independent of all other variables given the parents of v. That is, P is the set of conditional probabilities of all the variables given their respective parents. If a variable has no parents, then nv is empty and P(v\irv) is simply the prior probability of v. The arcs in a Bayesian network represent direct influence between the random variables. The independency encoded in a Bayesian network is that each variable is independent of its non-descendants given its parents. Figure 1.1 shows a Bayesian network of 6 variables, along with the conditional probability tables for each variable. The domain of all variables is Boolean. B y conditional independence, the prior joint probability of a Bayesian net-work can be obtained by p(v)=n p{v\*v) vev and for any subset X of V, the marginal probability P(X) is obtained by nx) = E n p<v\«v) V-Xv£V If Y C V is the set of variables observed to have specific values l o and X CV is the set of variables whose joint probability is of interest, then the posterior probability P(X\Y = lo) can be obtained by P(X\Y - Y.) - P ( X A y = y°) 1.2 Probability Inference The most common task we wish to solve using Bayesian networks is probabilistic inference. Probabilistic inference is the process of computing the posterior proba-bility P(X\Y — lo ) , where X is a set of one or more query variables and Y is a set of observed variables with corresponding observed values lo-2 A B C P ( D | A , B , C ) true false true true true 0.7 0.3 true true false 0.7 0.3 true false true 0.7 0.3 true false false 0.7 0.3 false true true 0.2 0.8 false true false 0.4 0.6 false false true 0.2 0.8 false false false 0.9 0.1 P(A) true false 0.4 0.6 P(B) true false 0.2 0.8 P(C) true false 0.3 0.7 c D P ( E | C . D ) true false true true 0.8 0.2 true false 0.1 0.9 false true 0.8 0.2 false false 0.3 0.7 P ( F D, E ) D E true false true true 0.4 0.6 true false 0.7 0.3 false true 0.4 0.6 false false 0.8 0.2 Figure 1.1: Example Network #1 3 The computation of a variable's posterior probability given evidence in a Bayesian network is in general NP-hard (Cooper, 1987). However, there exist many architectures and algorithms for computing all the exact posterior probabilities in a Bayesian network efficiently for many networks, including Lauritzen and Spiegel-halter (1988), Shafer and Shenoy (1990) and Jensen, Lauritzen and Olesen (1990). These algorithms are based on a secondary structure called a clique tree (or join tree or junction tree), which is built by triangulating a Bayesian network's moralized graph. Other algorithms, such as bucket elimination (Dechter, 1996) and variable elimination (Zhang and Poole, 1994) use a query-based approach to calculate the posterior probabilities of variables as they are demanded. The definition of a Bayesian network only captures the independence rela-tions among variables, that is, independencies that hold for any assignment of values to the variables. However, it does not constrain how a variable depends on its par-ents. There is often some structure in the conditional probability tables, such as causal independence (Heckerman and Breese, 1994; Zhang and Poole, 1996; Zhang and Yan, 1997) and context specific independence (Boutilier, Friedman, Goldszmidt and Koller, 1996; Geiger and Heckerman, 1996; Zhang, 1998; Zhang and Poole, 1999; Poole and Zhang, 2002), that can be exploited to improve the time and space required in probabilistic inference. Causal independence refers to the situation where multiple causes contribute independently to a common effect. With causal independence, the probability func-tion can be described using a binary operator that can be applied to values from each of the parent variables. Context specific independence refers to variable dependencies that depend on particular values of parent variables. Intuitively, in the network in Figure 1.1, the 4 regularities of the conditional probability table of D ensure that D is independent of B and C given the context A = true, but is dependent on B and C in the context A — false. We propose an algorithm called eve-tree that exploits context specific inde-pendence using the clique tree structure to compute all the posterior probabilities in a Bayesian network. Our experiments show that eve-tree is more efficient both in time and space than Hugin (Lauritzen and Spiegelhalter, 1988) when some context specific independence is present in the Bayesian network. 1.3 Organization We first present definitions of context specific independence and describe some re-lated works in the research of exploiting context specific independence in Chapter 2. Chapter 3 presents some algorithms for probabilistic inference that are related to our proposed algorithm, eve-tree. The eve-tree algorithm is discussed in detail in Chapter 4. Chapter 5 gives some empirical evaluation of eve-tree using standard and random networks that contain various amounts of context specific independence. Chapter 6 presents conclusions and future work. 5 Chapter 2 Context Specific Independence 2.1 Definitions In this section, we give some definitions for context specific independence. We also define generalized rules which we will use to capture context specific independence and facilitate probabilistic inference exploiting context specific independence. These definitions are based on Boutilier et al. (1996) and Poole and Zhang (2002). Definition 1 Given a set of variables C, a context on C is an assignment of one value to each variable in C. Two contexts are incompatible if there exists a variable that is assigned different values in them; otherwise, they are compatible. Context c\ on C\ is a subcontext of context c 2 on C2 if C\ C C 2 and c\ and c 2 are compatible. A n empty context, denoted TRUE, occurs when C is an empty set. For example, the context A = trueAB = false is compatible with the context A = true A C = true, and is incompatible with the context A = false A C = true. Definition 2 Let X, Y, and C be sets of variables. X and Y are contextually independent given context C = c, where c € domain(C), if P(X\Y = y\ A C — c) — 6 P(X\Y = y2 A C = c) fo r a l l yi,y2 € domain(Y) such t h a t P(Y = y1AC = c1)>0 a n d P(Y = y2AC = c1) > 0. I n t h e n e t w o r k i n F i g u r e 1 .1, v a r i a b l e F is c o n t e x t u a l l y i n d e p e n d e n t o f v a r i -ab le D g i v e n t h e c o n t e x t E — true, s ince P(F = true\E = true) = 0.4 a n d P(F = false\E = true) = 0.6 regard less o f t h e va lue o f D. O n t h e o t h e r h a n d , i f E = false, t h e n t h e p r o b a b i l i t y o f F w o u l d d e p e n d o n t h e v a l u e o f D. Definition 3 Suppose we have a t o t a l o r d e r i n g o f t h e va r i ab les X\,..., Xn o f a B a y e s i a n n e t w o r k such t h a t a l l p a r e n t s o f a v a r i a b l e precede t h e v a r i a b l e . G i v e n v a r i a b l e Xi, we say t h a t C = c, w h e r e C C {X\,..., a n d c 6 domain(C), is a parent context fo r Xi i f Xi is c o n t e x t u a l l y i n d e p e n d e n t o f t h e predecessors o f Xi ( n a m e l y , {X\,..., X{-i}) g i v e n C = c. Definition 4 A mutually exclusive and exhaustive set of parent contexts f o r v a r i a b l e Xi is a set o f p a r e n t c o n t e x t s fo r Xi such t h a t , f o r e v e r y c o n t e x t C o n t h e p a r e n t s o f Xi, t h e r e is e x a c t l y one e lemen t o f t h e m u t u a l l y exc lus ive a n d e x h a u s t i v e set o f p a r e n t c o n t e x t s t h a t is c o m p a t i b l e w i t h C. A c o n t e x t u a l be l i e f n e t w o r k has t h e same g r a p h i c a l s t r u c t u r e as a be l i e f n e t w o r k , b u t each n o d e Xi i n t h e n e t w o r k is assoc ia ted w i t h a m u t u a l l y exc lus ive a n d e x h a u s t i v e set o f m i n i m a l p a r e n t c o n t e x t s , I I j , a n d , fo r each 7r S ITJ, a p r o b a b i l i t y d i s t r i b u t i o n P(Xi\ir) o n Xi. I n s t e a d o f h a v i n g a c o n d i t i o n a l p r o b a b i l i t y t a b l e f o r a n o d e , each n o d e has a set o f genera l i zed ru les w h o s e c o n t e x t s f o r m a m u t u a l l y exc lus ive a n d e x h a u s t i v e set o f m i n i m a l p a r e n t c o n t e x t s . Definition 5 A generalized rule is o f t h e f o r m (c, t), w h e r e c is a c o n t e x t , say, X\ = v\ A . . . A Xk = Vk, a n d t is a t a b l e t h a t represen ts a f u n c t i o n o n va r iab les 7 Xk+i, •.., Xm. t[vk+i,.. •, vm] is a number that represents the contribution of the context Xi = v\ A . . . A Xk = vk A Xk+i = Vk+i A . . . A Xm = vm. For this thesis, generalized rules are simply called rules. T h e rule (A = false A C = false, B D Value true true 0.4 true false 0.6 false true 0.9 false false 0.1 represents the part of the conditional probability table of P(D\A, B,C) in Figure 1.1 with A = false and C = false. T h e probability table of P(D\A, B, C) in Figure 1.1 can be represented by rules with contexts A = true, A = false A C = true, A = false A C = false A B = true, and A = false A C = false A B = false. These contexts form a mutually exclusive and exhaustive set of parent contexts for variable D because for any truth value assignment on the parent variables A, B, and C, the assignment is compatible with exactly one of the four contexts. T h e rule base of the network is the union of the rules in all the nodes of the network. Figure 2.1 lists the rule base of the Example Network #1 in Figure 1.1. W i t h context specific independence, some of the larger tables are collapsed into a number of rules with smaller tables. For example the probability table P(D\A, B, C) without context specific independence encoded has 16 entries. T h i s table is split into 3 rules of table size 2, 2, and 4, for a total of 8 entries. Notice that the two parent contexts for D, A = false AC = false A B = true and A = false AC = false A B — false, are combined into one rule in the rule base because no savings can be obtained by maintaining them as two separate rules. T h e amount of space savings can be A Value B Value {TRUE, true 0.4 }, (TRUE, true 0.2 false 0.6 false 0.8 C Value D Value (TRUE, true 0.3 >, {A — true, true 0.7 false 0.7 false 0.3 (A = false AC — true, D Value true false 0.2 0.8 (A = false A C = false, B D Value true true 0.4 true false 0.6 false true 0.9 false false 0.1 C E Value E Value true true 0.1 (D = true, true 0.8 ), (D = false, true false 0.9 false 0.2 false true 0.3 false false 0.7 D F Value F Value true true 0.7 (E = true, true 0.4 ), (E = false, true false 0.3 false 0.6 false true 0.8 false false 0.2 Figure 2.1: Rule base for Example Network #1 significant when a variable has many parents, but does not depend on all of them in some contexts. 2 . 2 P a s t W o r k o n C o n t e x t S p e c i f i c I n d e p e n d e n c e There are a few different ways in the literature that make use of the additional structure of context specific independence to improve probabilistic inference. Boutilier et al. (1996) presents two algorithms to exploit context specific inde-pendence in a Bayesian network. The first is network transformation and clustering. Wi th this method, auxiliary variables are introduced into the original network such that many of the context specific independencies are qualitatively encoded within the network structure of the transformed network. The transformation of the Bayesian network makes use of the structure of CPT-trees, which are the conditional proba-bility tables of the Bayesian network represented as decision trees. Computational savings can be achieved with transformations that reduce the overall number of val-ues present in a family (that is, a node and its parents) using clustering algorithms such as Hugin (Lauritzen and Spiegelhalter, 1988). The other algorithm suggested by Boutilier et al. (1996) is a form of cutset conditioning. The cutset algorithm for probability inference works by selecting a set of variables (called a cutset) that, once instantiated, makes the network singly connected. Each variable in the cutset is instantiated with a value as evidence and inference is performed on the resulting network. This is done using reasoning by cases, where each case is a possible assignment to the variables in the cutset. The results of inference for all cases are combined to give the final answer to the query. The running time of the cutset algorithm is determined by the number of cases, that is, the total number of values for the cutset variables. Wi th context specific 10 independence, instantiating a particular variable in the cutset to a certain value in order to cut a loop may render other arcs vacuous. This may cut additional loops without instantiating additional variables in the cutset. As a result, the number of cases may be reduced and so the number of times that inference must be performed is reduced, speeding up the total inference time for queries. Geiger and Heckerman (1996) presents another method to exploit context specific independence. With the notion of similarity networks, context specific inde-pendencies are made explicit in the graphical structure of a Bayesian network. In a similarity network, the edges between the variables can be thought of as a network whose edges can appear or disappear depending on the values of certain variables in the network. This allows for different Bayesian networks to perform inference for different contexts. Poole and Zhang (2002) presents a rule-based contextual variable elimination algorithm. This algorithm will be discussed in detail in Section 3.4. 11 Chapter 3 Algorithms for Probabilistic Inference 3.1 Variable Elimination The variable elimination (ve) algorithm (Zhang and Poole, 1994) is a query-oriented algorithm for probabilistic inference which does not make use of any secondary structure and works on factors representing conditional probabilities. The ve algorithm works as follows: Suppose a Bayesian network consists of variables Xi,..., Xn and suppose the variables E\,..., Es are observed to have values e i , . . . , es, respectively, and the conditional probability of variable X is queried given the observations. We have D / V I E , . A ZT1 x P{X A Ei = e\ A . . . A Es = es) P(X\E1 = elA...AEs = es)= p(E± = e\ A ... A Es = es) where the denominator is a normalizing factor: P(Ei = e i A . . . A Es = es) = J2 P(X = v A Ex = ex A ... A Es = es) v£domain(X) 12 and the numerator can be calculated by summing out the non-query, non-observed variables Y, where Y = {Yu ..., Yk} = {Xu • • •, Xn} - {X} - {Eu ..., Es}. Thus: P(X|£ l = ei A . . . A Es = es) = £ ' • • lZ P(X^ • • • ' Xn)[E1=e1A...AEs=es] Yk Yi n = ZZ " ' X] II P P ^ I 7 r * i ) [ £ a = e i A . . . A £ s = e s ] Ffc n i=l where the subscripted probabilities mean that the associated variables are assigned the corresponding values. The problem of probabilistic inference is thus reduced to the problem of summing out variables from a product of functions. This can be efficiently solved using the distribution law of algebra. To compute the sum of products xy + xz efficiently, we can distribute out the common factors and get x(y + z). This is the essence of the variable elimination algorithm. A factor is a representation of a function from a tuple of variables into the real numbers. This can be represented as a d-dimensional table, where d is the number of variables in the factor. Initially, the factors represent the set of conditional probability tables in the Bayesian network. The product of two factors / i and /2, written / i <8>t / 2 , is a factor on the union of the variables in / i and / 2 . To construct the product of factors f\ <Sn H ®t • • • ®t /fe, whose union of variables are, say, X\,..., Xr, we construct an r-dimensional table having an entry for each combination vi,...,vr, where vi G domain(Xi). The value for the entry corresponding to v\,..., iy is obtained by multiplying the values obtained from each applied to the projection olv\,...,vr onto the variables of / j . The summing out of variable Y from a factor / with variables X\,..., Xk, Y, written XV / ^s * n e factor with variables X\,..., Xk, defined by: 13 / ) (* ! , . . . ,X f c )= E f{Xu...,Xk,Y = Vi) Y Vi£domain(Y) Summing out a variable reduces the dimensionality of a factor by one. The values of the resulting factor are obtained by adding the values of the factor for each value of the variable to be summed out. The ve algorithm works by keeping a set of factors, originally obtained from the initial set of conditional probabilities in the Bayesian network and then eliminat-ing the non-query variables in the network one by one. The order that the variables are eliminated is called the elimination order. To eliminate an unobserved variable Y, we replace the factors in the set that contains Y, say, fm+i, fM+2, • • •, fr, by their product with Y summed out, namely, J2Y fm+i <8>t fm+2 ®t • • • <8>t fr- After all non-query variables are eliminated, the set only contains factors on the query variable(s). The posterior probability of the query variable(s) is obtained by multi-plying the factors in the set and normalizing the resulting factor. The ve algorithm is summarized in Figure 3.1. 3.2 Variable Elimination with Clique Trees While Bayesian networks are used to encode the probabilities and conditional in-dependencies, many architectures perform probabilistic inference on a secondary structure of the Bayesian network called a clique tree, instead of working on the entire set of conditional probabilities directly. When the posterior probability of more than one variable is wanted, the clique tree serves as a cache to save us from recalculating some of the probabilities already computed for other variables. The graph of a Bayesian network is a directed acyclic graph (DAG). The 14 T o c o m p u t e P(X\E1 = e\ A . . . A Es = es) let F be the factors obtained from the original conditional probabilities replace each / G F that involves some E{ with fE1=ei,...,Es=e3 w h i l e there is a factor involving a non-query variable s e l e c t non-query variable Y to eliminate F <— eliminate(Y, F) r e t u r n normalize(F, X) P r o c e d u r e eliminate(Y, F) p a r t i t i o n F into: • • • > fm} that do not contain Y {/ro+i? • • •) fr} that do contain Y c o m p u t e / = J2Y fm+i ®t • • • ®t fr r e t u r n . . . , / m , / } P r o c e d u r e normalize({fi,..., fr}, X) c o m p u t e / = fi <g>t... ®t fr c o m p u t e c = Y,x f r e t u r n f/c Figure 3.1: The Variable Elimination Algorithm 15 moralized graph of a D A G is created by marrying the parents of each node and then dropping the directions of the arcs in the D A G . Marrying the parents of a node X means that we identify the set of nodes that represent the parents of X and con-nect each pair of nodes in the set with an undirected arc. A n undirected graph is triangulated if and only if every cycle of length four or higher in the graph contains an arc that connects two non-adjacent nodes in the cycle. Kjaerulff (1990) describes a procedure to triangulate arbitrary undirected graphs. A clique in an undirected graph is a set of nodes that forms a complete and maximal subgraph of this undi-rected graph. That is, every pair of distinct nodes in the clique is connected by an arc in the undirected graph and the clique is not properly contained in a larger complete subgraph. A clique tree is a tree whose nodes are cliques in the triangu-lated, moralized Bayesian network such that for the path connecting two cliques Ki and Kj, the intersection K = Ki<l Kj is a subset of every clique on the path. The efficiency of clique tree algorithms depends largely on the size of the clique tree and the size of the largest clique. These sizes are determined by the triangulation of the moral graph. Finding good triangulations in the building of compact clique trees is related to finding good elimination orderings for the variable elimination algorithm, as triangulating a graph can be done by selecting an ordering that the vertices of the graph (ie. the variables in the network) should be eliminated (Kjaerulff, 1990). Figure 3.2 shows the graph of a Bayesian network (Example Network #2) with 8 variables (the probability tables are omitted). Its moral graph and triangulated moral graph are shown in Figure 3.3, and the resulting clique tree is shown in Figure 3.4. Even though the variable elimination algorithm as described in the previous section is a query-based algorithm, it can be extended to work on clique trees to 16 Figure 3.2: Example Network #2 17 Moral Graph Triangulated Moral Graph Figure 3.3: The moral graph and one of the triangulated moral graphs for Example Network #2 18 A B C B C C H Figure 3.4: A clique tree for Example Network #2 19 obtain the posterior probabilities of all the variable in a Bayesian network simulta-neously. We introduce the ve-tree algorithm as an extension of variable elimination on clique trees. The idea of ve-tree is initiated by other clique tree algorithms such as Hugin (Lauritzen and Spiegelhalter, 1988) that calculates the posterior prob-abilities of all variables. Even though the ve-tree algorithm is less efficient than Hugin, the contribution of ve-tree is that it provides the concept and structure that are the foundation of the eve-tree algorithm we propose to exploit context specific independence in Chapter 4. Wi th ve-tree, after the clique tree for a Bayesian network is constructed, each conditional probability table in the Bayesian network is assigned to a clique that consists of all the variables for that probability table. The cliques simply keep the conditional probabilities in a set. Observations are incorporated into the factors in the same way as in the ve algorithm. Each factor that contains an observed vari-able can zero out the entries in the factor that do not correspond to the variables' observed values, or it can be shrunk by deleting the dimensions that do not corre-spond to the variables' observed values. Probabilistic inference proceeds as message passing among the cliques. We keep a sepset between each pair of connected cliques in the clique tree to hold the two messages being passed between the cliques, on for each direction. The clique tree for the Bayesian network in Figure 3.2 with the initial factors assigned is shown in Figure 3.5. The clique tree can now be made consistent by performing global propagation that consists of a series of message passes between adjacent cliques. In a consistent clique tree, each clique's factor represent the marginal probability of the variables in the clique given the observations. In a single message pass from clique K\, with the set of variables Xi, to a neighbouring clique Ki, with the set of variables X 2 , 20 A B C { P(A), P(B|A), P(C|A)} B C The initial clique tree of Example Network #2 using the ve algorithm 21 we take the set of factors in clique K\, subtracted by the set of factors stored in the sepset between K\ and K2 in the direction to K\ (this set of factors corresponds to a previous message passed from K2 to K\; it would be an empty set if K2 has not passed any message to K\). Let this set of factors be F. The message passed from K\ to K2, MK1K2, is obtained by: for each Y e Xi - (Xi n X2) F <— eliminate(Y, F) MKLK2 <- F where the eliminate procedure is as defined in Figure 3.1. The message- MKXK2 is stored in the sepset between K\ and K2 in the K2 direction, and is unioned to the set of factors in the clique K2. For instance, if the clique DEFG is to pass a message to the clique CDE in Figure 3.5, and CDE has not passed a message to DEFG (so the sepset DE has no message stored), we need to eliminate variables F and G from the set of factors in clique DEFG. By calling the eliminate procedure on variables F and G on the factors {P(F\DE),P(G\DEF)}, the message: {h(D,E) = YJlZp(F\DE)P{G\DEF)} G F is obtained. This set of factors is stored in the sepset DE in the direction of clique CDE and unioned with the factors in clique CDE. Given a clique tree with n cliques, global propagation starts with choosing an arbitrary clique, R, as the root clique, and then performing 2(n — 1) message passes, divided into two phases, to make the clique tree consistent. During the collect-evidence phase, each clique passes a message to its neighbour in i?'s direction, beginning with the cliques farthest away from R. During the distribute-evidence phase, each clique passes messages to its neighbours away from R's direction. Each 22 of the two phases causes ra — 1 message to be passed. The procedures are shown in Figure 3.6. The result of global propagation is that each clique passes its information to all other cliques in the clique tree. Note that a clique can only pass a message to a neighbour after the clique has received messages from all of its other neighbours. This ensures consistency of the clique tree when global propagation is completed. Figure 3.7 shows the clique tree of Figure 3.5 with the messages passed during global propagation (where clique BCD is chosen as the root clique) and the resulting set of factors in each clique after global propagation is completed. The order of message passed is shown by the circled numbers of the arrows in the direction of the messages. After global propagation is completed, the clique tree becomes consistent and we can compute the probability of each variable of interest, say X, using the calculate-probability procedure. The advantage of this extended algorithm of variable elimination is two-fold. First, the majority of the work for computing the posterior probabilities of all vari-ables in the Bayesian network is done only once by making the clique tree consistent with global propagation. When the probability of a variable is desired, it can be obtained from the set of factors in the clique that contains that variable. This set of factors is much smaller in size than the entire set of conditional probabilities that ve must use to calculate for each query variable, and so the required probabilities can be more efficiently computed. This efficiency is especially apparent when the probabilities of many variables are queried at the same time, as ve needs to perform the variable elimination process from the beginning for each query variable. Second, this algorithm is more efficient than ve in the case of incremental observations, that is, where observations do not all come in together in the beginning. With ve, we 23 P r o c e d u r e global-propagation select an arbi t rary clique R as root clique unmark a l l cliques call collect-evidence{R, null) unmark a l l cliques call distribute-evidence(R) P r o c e d u r e collect-evidence (K, Caller) mark K for each K' G neigbours of K if K' unmarked call collect-evidence(K'', K) if Caller ^ null // not processing root clique call pass-message(K, Caller) P r o c e d u r e distribute-evidence (K) mark K for each K' G neigbours of K if K' unmarked call pass-message(K, K') call distribute-evidence(K') P r o c e d u r e pass-message(Ki,K2) //ve-tree let S = sepset between K\ and K2 let M21 = message stored in S i n the direction of K\ let F — set of factors i n K\ F <— F - M 2 i for each Y G Kx - (Kx n i^ 2) F <— eliminate(Y, F) // Note: eliminate is shown i n Figure 3.1 set F i n 5 in the direction of i^ 2 let G = set of factors i n K2 G < - G U F procedure calculate-probability(X) //ve-tree let F = set of factors i n K identify clique K such that X G K for each Y E K — {X} F eliminate(Y, F) r e t u r n normalize(F, X) // Note: normalize is shown i n Figure 3.1 Figure 3.6: The ve-tree A l g o r i t h m 24 A B C { P(A), P(B|A), P(C|A), MO.MB.C)} { f3 (B, 0 = 2 P(A) • P(B I A) • P(C I A) } A ~ (3 B C D {P(D|B), f2(C,D) {/4(C) MC.D) = {/4(C) (6) / 6 (5 ,C) = 2 / 2 (C,Z) ) .P(D|5) } { / 5 ( C ) = C H { P(H|C). / 5 ( C ) } {MC,D)=X/1(D,E)P(E\C)} Y,MB,C)P(D\B) } I C D E {P(E |C) , MD,E)t / 4 (C) , /,«?.£))} {MD,E)= © \ \ P E Zy^c)MC,D).p(E\C)}\ { fx(D, E) = Z Z p ( F \ D S ) - P i G I ^ ) } <7 F D E F G (P(F|DE),P(G|DEF), / 8 (A£)1 Figure 3.7: The consistent clique tree of Example Network #2 using the ve-tree algorithm 25 must redo all the calculations from the initial set of conditional probabilities for each query variable as each observation comes in. But with ve-tree, observations can be incorporated into a consistent clique tree by a distribute-evidence call from the clique whose factors are changed by the observation. If more than one observation comes in at one time, affecting more than one clique, then the clique tree can be made consistent again with a collect-evidence call followed by a distribute-evidence call. Therefore, in the case of incremental observations, instead of having to recalculate the posterior probabilities of all variables individually as in ve, ve-tree can maintain the clique tree's consistency with at most two phases of message passing, which is a significant savings if the posterior probability of many variables are required. 3.3 The Hugin Architecture The Hugin architecture (Lauritzen and Spiegelhalter, 1988) is one of the most es-tablished and efficient algorithms for calculating exact posterior probabilities. The procedure for Hugin is very similar to ve-tree as described in the previous section. Whereas the cliques and sepsets in ve-tree keep sets of factors representing condi-tional probabilities, the cliques and sepsets in Hugin keep marginal probabilities by immediately multiplying the sets of factors when they are initialized and passed as messages. Moreover, the sepsets in Hugin only keeps one factor instead of two separate messages for each direction. Clique tree construction is done in the same way as ve-tree. When the conditional probability tables are assigned to the cliques, they are immediately multiplied to form a single factor. Sepsets and the cliques with no conditional probability table assigned are initialized with tables consisting of the sepsets and cliques with all entries being one. Observations are incorporated by zeroing out the entries in the factors that do not correspond to the variables' 26 observed values. Figure 3.8 shows the initial clique tree of the Hugin architecture for the network in Figure 3.2. After initialization, global propagation is performed the same way as in ve-tree (Figure 3.6), except for the pass-message procedure, which is revised as follows: Procedure pass-message(Ki, K2) //Hugin let S = sepset between K\ and K2 let 4>\ — factor in K\ let <f>2 = factor in K2 let (b^d = factor in S <t>new *— Y.Kl-{K1C\K2) 1^ 4>2 <— 4>2 • (<f>new/4>oid) // update the factor in clique K2 4>old <— 4>new The division step {§new/4>old) iu the pass-message procedure is analogous to the set subtraction step in ve-tree. It is done to prevent information that a neighbour previously passed to a clique to be passed back to the same neighbour. <t>new a n d (f)0id are factors of the same set of variables and the division is done entry by entry in the factors. As an example to demonstrate the Hugin algorithm, Figure 3.9 shows the clique tree of Example Network #1 in Figure 1.1, along with the initial factors for the cliques (without any observations). Suppose the clique ABCD is chosen as the root clique. Propagation starts with the collect-evidence phase by passing a message from the leaf clique DEF to the clique CDE. The message is the factor of DEF with F summed out, resulting in a factor in the variables D and E. This factor in the message is divided by the factor stored in sepset DE (which is all ones at this time). The resulting factor is then mulitiplied with the factor in clique CDE and the clique CDE replaces its factor with this product. Finally, the factor in the message is saved in sepset DE. Clique CDE then passes a message to clique ABCD 27 Figure 3.8: The initial clique tree of Example Network #2 using Hugin 28 i n the same fashion. It first sums out the variable E from its factor and passes the resulting factor as message. Th i s message is d ivided by the factor stored i n sepset CD. T h e resulting factor is mul t ip l ied w i t h the factor i n clique ABCD, and the clique ABCD replaces its factor w i t h this product. F ina l ly , the factor of the sepset CD is replaced w i t h the factor i n the message. Th i s concludes the first sweep. In the distribute-evidence phase, messages are passed from the root clique to the leaf cliques. The same procedure of message passing is done, first passing a message from clique ABCD to clique CDE, and then passing a message from clique CDE to clique DEF. After the message propagations are completed, the clique tree is consistent. T h e resulting factors of the entire clique tree are shown i n Figure 3.10. W i t h a consistent clique tree, each clique's factor is the marginal probabi l i ty P(X\Y = YQ), where X is the set of variables i n the cliques and Y is the set of observed variables w i t h corresponding observed values Yo. The posterior probabi l i ty of a variable can be readily obtained from the factor of a clique that contains the variable by summing out the other variables i n that factor. Jensen et a l . (1990) and Huang and Darwiche (1996) provide more imple-mentat ion details of the algori thm. The advantage of Hugin over ve-tree is that the sets of factors i n the cliques of ve-tree need to be mul t ip l ied for each message pass, whereas they are mul t ip l ied only once i n Hugin (when the message is received). Moreover, i n a consistent clique tree, each clique already has the posterior marginal probabi l i ty on the variables of the clique, and so the posterior probabil i ty of a variable can be very efficiently computed by s imply summing out the other variables. T h i s makes H u g i n more efficient than ve-tree i n computat ion time. 29 c D Value true true 1 true false 1 false true 1 false false 1 A B C D Value true true true true 0.0168 true true true false 0.0072 true true false true 0.0392 true true false false 0.0168 true false true true 0.0672 true false true false 0.0288 true false false true 0.1568 true false false false 0.0672 false true true true 0.0072 false true true false 0.0288 false true false true 0.0336 false true false false 0.0504 false false true true 0.0288 false false true false 0.1152 false false false true 0.3024 false false false false 0.0336 C D E Value true true true 0.8 true true false 0.2 true false true 0.1 true false false 0.9 false true true 0.8 false true false 0.2 false false true 0.3 false false false 0.7 D E D E Value true true 1 true false 1 false true 1 false false 1 ' D E F Value true true true 0.4 true true false 0.6 true false true 0.7 true false false 0.3 false true true 0.4 false true false 0.6 false false true 0.8 false false false 0.2 Figure 3.9: A clique tree for Example Network #1 with Hugin,s initial factors 30 c D Value true true 0.12 true false 0.18 false true 0.532 false false 0.168 A B C D Value true true true true 0.0168 true true true false 0.0072 true true false true 0.0392 true true false false 0.0168 true false true true 0.0672 true false true false 0.0288 true false false true 0.1568 true false false false 0.0672 false true true true 0.0072 false true true false 0.0288 false true false true 0.0336 false true false false 0.0504 false false true true 0.0288 false false true false 0.1152 false false false true 0.3024 false false false false 0.0336 C D E Value true true true 0.096 true true false 0.024 true false true 0.018 true false false 0.162 false true true 0.4256 false true false 0.1064 false false true 0.0504 false false false 0.1176 D E D E Value true true 0.5216 true false 0.1304 false true 0.0684 false false 0.2796 D E F Value true true true 0.20864 true true false 0.31296 true false true 0.09128 true false false 0.03912 false true true 0.02736 false true false 0.04104 false false true 0.22368 false false false 0.05592 Figure 3.10: A consistent clique tree for Example Network #1 (Hugin) 31 3.4 Contextual Variable Elimination The contextual variable elimination (eve) algorithm (Poole and Zhang, 2002) is an extension to variable elimination (Zhang and Poole, 1994), exploiting the context specific independence that may exist in the conditional probabilities of a Bayesian network. Instead of representing conditional probabilities in terms of factors, contex-tual variable elimination represents conditional probabilites in terms of generalized rules which capture context specific independence in variables (Section 2.1). A fac-tor in ve is analogous to a set of rules whose contexts form a mutually exclusive and exhaustive set of parent contexts. The savings in time and space for probabilis-tic inference can be substantial when there is some context specific independence in the conditional probabilities. On the other hand, if there is no context specific independence, eve degenerates into ve, but with the overhead of maintaining the contexts. The eve algorithm consists of three primitive operations: summing out a variable, multiplying rules, and splitting rules. At each stage of the algorithm, these operations maintain the following program invariant: The probability of a context on the non-eliminated variables can be obtained by multiplying the probabilities associated with rules that are applicable in that context. The following descriptions and definitions on the operations are based on Poole and Zhang (2002). There are two cases for summing out a variable, as the variable to be summed out can appear in the context or in the table. If the variable to be eliminated, Y, with domain {yi,..., ys}, is in the table of a rule (c, t), then the rule can simply be 32 replaced by (c, yjy • ^ ^ appears in the contexts of a set of rules R, we define Ri = {{(H, U) : (ci A Y = yi, ij) 6 R}, for 1 < i < s, where each R4 represents the set of rules in which Y = yi is a part of the context and with Y = yi removed from the context. We define the operation © s (addition for generalized rules) to sum out Y in R as: Ri ®g Rj = {(ci U Cj, set(ti, Cj) ©t set(tj,a)) : (ci,ti) G Ri, (cj,tj) 6 and compatible^, Cj)} where ©t is table addition and set(t, X = x) is the table t if i does not involve any variable in X, and is the table t with entries complying to the values X = x, if t involves some variable in X. As an example, if we want to sum out the variable A from the rule collection R = { (A = false A C = false, B D Value true true 0.24 true false 0.36 false true 0.54 false false 0.06 (A = false A C = true, D Value true false 0.12 0.48 (A = true, D Value true false 0.28 0.12 we have to partition the rules into rule collections R\ and R2, corresponding to rules compatible with A = true and A = false, respectively, and with the variable A removed from the contexts of the rules. 33 i?i = {(TRUE, D Value true false 0.28 0.12 R2 = { (C = false, B D Value true true 0.24 true false 0.36 false true 0.54 false false 0.06 D Value (C = true, true 0.12 ) } false 0.48 The resulting rules of adding the rule sets are: B D Value true true 0.28 + 0.24 = 0.52 Ri®gR2 = { (C = false, true false 0.12 + 0.36 = 0.48 false true 0.28 + 0.54 = 0.82 false false 0.12 + 0.06 = 0.18 (C = true, D Value true false 0.28 + 0.12 = 0.4 0.12 + 0.48 = 0.6 ) } These are the rules that result from summing out the variable A. Multiplying rules in eve is much more complicated than multiplying factors in ve as rules involve both tables and contexts. If we want to multiply two rules whose contexts are identical, then the product is simply the rule with the same context and with the table that is the product of the two rules' tables. If we want to multiply two rules with different contexts, we cannot simply multiply their tables because they may belong to different contexts. To multiply two rules with different 34 contexts, we must change one of the rules or both rules so that their contexts become identical. This is done by rule splitting. When we want to split a rule r = (c, t) on a variable X, with domain {xi , . . . where X is not in the context c, there are two possible cases. First, if X does not appear in the table t, then we can simply replace r with the k rules r j = (c A X — Xi, t), for 1 < i < k. Ii X appears in the table t, then r is replaced with the k rules = (c A X = X{, set(t, X = Xi)), where set(t,X = Xi) is the part of the table t that X takes on the value Xi, and with X removed from the table. Definition 6 Given rule r = (c, t) and context c' such that c and c1 are compatible, to split r on c' means that r is split on each of the variables that are assigned in c1 that are not assigned in c. When a rule r is split on a context c, we end up with a single rule with a context that is compatible with c. Al l other rules we end up with that are incom-patible with c are called the residual rules of splitting r on c. Let r = (c1, t') and let c be a context that is compatible with c', then residual(r, c) is defined as: if c C c' then residual(r, c) = {} else Select a variable X that is assigned in c but not in c' Let XQ be the value assigned to X in c residual(r, c) = { ( c ' A l = Xi,set[t',X = x^) : Xi S domain(X) and Xi ^ xo} U residual({c' A X = xo, set(t', X = xo)) , c) Note that the result of splitting depends on the order that the variables are selected. However, the result satisfies the program invariant no matter which order the variables are selected. For example, the residuals of splitting the rule 35 (X = true, Y Z W Value in the context Y = false A Z = false (assuming all variables are boolean) are the following set of rules if we split first on Y and then on Z: { (X — true AY = true, z w Value ) (X = true A Y = false A Z = true, w Value > } where the table of the first rule contains values from the original table that corre-spond to Y = true, and the table of the second rule contains values from the original table that correspond to Y = false and Z = true. The rule can also be split by variable Z before variable Y, and would result in a different set of residuals: { (X = true A Z = true, Y W Value (X = true A Z = false A Y = true, w Value ) } where the table of the first rule contains values from the original table that corre-spond to Z = true, and the table of the second rule contains values from the original table that correspond to Z = false and Y = true. Once two rules are split on each other's contexts, their contexts become identical and their tables can be multiplied by table multiplication. However, if residuals appear in the splitting, the residual rules must also be multiplied as well. 36 This would involve more rule splitting and become very complicated. Fortunately, there are situations that we do not need to split. Definition 7 A set of rules R is complete in a set C of contexts if in any complete context with a member of C , there is exactly one member of R whose context is consistent with that complete context, and the context of every rule in R is consistent with at least one element of C. We say R is complete in a context c if R is complete in {c}. For example, • The rule set R\ = {{a A x, ti}} is complete in context a A x. • The rule set R2 = {(a A b A x,ti), (a A b A c A x, t2) , (a A b A cAx,ts)} is com-plete in context a A i . • The rule set R3 = {(a A b A x, t\), (a A b A c, t2), (a A b A c, £3)} is not com-plete in context a A x as x is not in every rule's context in R3, even though exactly one rule in R3 will be used in any context consistent with a Ax. • The rule set R4 = {(a, ti), (a A b, t2)} is complete in the set of contexts {b, a A b}. • The set of rules obtained by considering the rows of a probability table is complete in the empty context. If we have a set of rules that is complete in a context, we do not have to split other rules that contain this context when multiplying, because all the residual rules will also have a mate. We define the procedure absorb to multiply a single rule, r — (c, t), with a covering set of rules R (where r ^ R) that is complete in the empty context as follows: 37 absorb(R, (c, t)) = {(ci,ti) € R : incompatible(c, Ci)} U ( (residual((ci, ti),c) U{(c U C J , set(t, ci) <S>t set(U, c))} \ C j , ii) t i t ana compatible(c, Cj) U where <8>t is table multiplication. For example, to absorb the rule r = {A = false A C = false, B D Value true true 0.4 true false 0.6 false true 0.9 false false 0.1 into the rule collection R = { (A = false A C = false, 0.6) (A = false A C = true, D Value true false 0.12 0.48 (A = true, D Value true false 0.28 0.12 > } only the first rule in R is compatible with the context of r (which creates no residual) So the resulting rule collection is: absorb(R, r) = { (A = false A C = false, B D Value true true true false false true false false 0 . 4 * 0 . 6 = 0.24 0 . 6 * 0 . 6 = 0.36 0 . 9 * 0 . 6 = 0.54 0.1 * 0.6 = 0.06 38 D Value (A = false A C = true, true 0.12 0.48 ) false D Value (A = true, true 0.28 0.12 ) } false In order for absorption to work for multiplying rules, we need to be able to easily find a set of rules that is complete in a context. Initially, the rules that represent the conditional probability table P(X\nx) are complete in the empty context, for all variables X. But these initial rules may not exist anymore after they are split, multiplied, or added with other rules. It turns out, however, that at any stage of the contextual variable elimination, we can extract a set of rules that all contain a variable X and the set is complete in the empty context. This set of rules is called the rules for X and is defined as follows: Definition 8 If X is a variable, the rules for X are a subset of the rule base consisting of the following rules: • The rules that define the conditional probability P(X\TTX) initially. • The rules created when splitting a rule for X. • The rules created when multiplying a rule for X with another rule. • The rules created when adding a rule for X with another rule. The set of rules for X forms the covering set of rules complete in the empty context that absorption bases on initially. When we eliminate X, we need to multiply 39 the rules that involve X. We start with the set of rules for X, and absorb all other rules that involve X into the set one by one. The result of the multiplication is the set of rules obtained after absorption is done for all the rules to be multiplied. Evidence absorption is a simplification of the rule base by removing the observed variables from the rules' contexts and tables. If the variables E\,... ,ES are observed with the values e i , . . . , es, respectively, then the rule base is modified as follows: • remove any rules that contain Ei — e\ in the context, where ^ e\ • remove any term Ei = ei in the context of a rule • replace each table t with tE1=eiA.../\Es=es • remove any rule that does not involve any variable (i.e. rules with an empty context and a single number as the table) Rules that do not involve any variable are considered redundant because the only purpose they serve is as a normalization constant. There is another kind of rules that is redundant: rules with no for variables and whose tables consist of all ones. These rules occur when eliminating a variable that has no children in the particular context. They are considered redundant rules because they are equivalent to recursively removing a non-observed, non-queried variable with no children in a Bayesian network, which would not change the conditional probability of the query variable. Redundant rules can be removed from the rule base because they do not affect the probability calculation for the query variable. The posterior probability of a query variable X can be obtained after absorb-ing the evidence variables and eliminating the remaining variables. Only rules of the 40 form ({X = Xi},pi) and {TRUE,tk[X]) remain (pi is a table of a single number). We have: P(X = XiAE = e) = K ]J Pi Y[ tk[xi] ({X=Xi},Pi) (TRUE,tk[X]) where K is a normalization constant, and we have: P(X = XiAE = e) P(X = Xi\E = e) = Ex.P(X = xjAE = e) Notice that the normalization constants K in the numerator and the denominator cancel each other. A summary of the contextual variable elimination algorithm is given in Figure 3.11. 41 To compute P(X\Ei = ex A . . . A Es = es) Given a rule base of generalized rules R observe(E\ = e\ A . . . A Es = es) while there is a rule in the rule base involving a non-query variable select non-query variable Y R <— eliminate(Y, R) compute posterior probability of X Procedure eliminate(Y, R) partition the rule base R into: R~ = {r G R : r doesn't involve Y} R+ = {r <E R:r is for Y)} R* = {r € R : r involves Y and r is not for Y)} for each r € R* do R+ <— absorb(R+ ,r) partition R+ into: Rl = {r e R+ : Y in table of r} Ri = {(c,t) : (cAY = yt,t) e R+} for each 1 < i < n return rule base R- U (Ri @g ... ®g Rn) U{(c', *') : (c', t') G R*} Figure 3.11: The Contextual Variable Elimination Algorithm 42 Chapter 4 Probability Inference Using eve-tree 4.1 Contextual Clique Tree The eve algorithm as described in Section 3.4 is a query-based algorithm that cal-culates posterior probabilities from the initial rule base each time a query variable is requested. This is inefficient if the network is large, the rule base is complicated, and the number of query variables is large, as many operations must be repeated for each query. We can alleviate this problem by extending the eve algorithm to be used in a clique tree structure, similar to the extension of the ve algorithm applied to clique trees as described in Section 3.2. Since each clique contains only a subset of the variables in the network, the number of rules applicable, to a particular clique is much smaller than the entire rule base. As a result, after spending some overhead time to obtain a consistent clique tree, the time required to calculate the posterior probabilities for this new algorithm, eve-tree, can be much smaller than the time required by eve. Note that the eve algorithm is a specific case of the eve-tree algo-43 rithm, in which the clique tree consists of a single clique, namely, the clique with all variables in the network. The eve-tree algorithm works as follows: Rule Base Initialization • The initial rule base is constructed and the observations are incorporated to the rule base the same way as the query-based contextual variable elimination algorithm (Section 3.4). Clique Tree Initialization • The clique tree is constructed the same way as the ve-tree and Hugin architec-tures (Section 3.2), but instead of associating each clique with a factor, each clique is associated with a set of rules. • The rules in the initial rule base that are for variable X are assigned to a clique that is for X. A clique is for a variable X if it contains X and all the parents of X. There may be multiple cliques for X, and any one of these cliques can be chosen to assign the initial rules without affecting the result. • Sepsets are also created between neighbouring cliques to hold the messages (rules) passed between neighbours. Each sepset contains two rule collections to hold messages passed in both directions. Message Propagation • After initializing the clique tree, each clique contains a rule collection on the clique's variables. Clique tree propagation proceeds the same way as ve-tree (Section 3.2), with the messages being rule collections instead of factors. 44 • To pass a message from clique Ki to its neighbour Kj, the message is the resulting rule collection from calling the eliminate{Y,R) procedure (Figure 3.11) on each variable Y E Ki — Kj, where R is the rule collection of Ki, minus those rules that Kj has passed to Ki previously (if any). This can be done simply by taking the set difference of the rule collection of Ki and the rule collection stored in the sepset betweeen Ki and Kj that is the message passed from Kj to Ki previously. • When Kj receives a message, its rule collection, Rj, is simply updated with the union of Rj and the rules in the received message. A copy of the message is also stored in the sepset between Ki and Kj. As opposed to Hugin, no rule multiplications are performed when a clique receives a message. Posterior Probability Computation • After each clique has sent and received a message from all its neighbours, the clique tree is consistent. The posterior probability of a variable X can be obtained from the rule collection of the clique that is for X by calling the eliminate procedure (Figure 3.11) on the other variables that the clique contains. The major difference between Hugin and eve-tree is that Hugin updates the marginal probabilities of the clique by immediately multiplying all messages received, whereas eve-tree only stores the rules received without multiplying them immediately. Multiplication of the rules is delayed until the posterior probability computation stage. The reason for the delay is that once the rules are multiplied, any structure of context specific independence captured in the rules may be lost. If rules are multiplied as they are received by the cliques in the message propagation 45 stage, then the cliques revert to storing marginals, defeating the advantages of eve-tree of exploiting the context specific independence in the network. 4.2 A n Example Let's run the algorithm with Example Network #1 (Figure 1.1). Suppose we want to calculate the probability of variable F, with no observations. We start by con-structing the clique tree, which has the same structure as that of Hugin architecture, with the addition of which variables the cliques are for. The rules in the rule base (Figure 2.1) are assigned to the cliques according to the variable(s) that the rules are for. The clique tree constructed along with the initial rules is shown in Figure 4.1. Suppose the clique ABCD is chosen as the root. Propagation begins with passing messages from the leaf clique (DEF) inwards. Eliminating F from the two rules of the clique DEF, we find that the resulting rules are both redundant, as both rules are for F (so after eliminating F, the rules are for no variable) and both tables consist of all ones. As a result, a message of an empty rule collection is sent to the clique CDE. Eliminating the variable E from the two rules of the clique CDE also results in redundant rules, so a message of an empty rule collection is sent to clique ABCD. After the root clique has received messages from all its neighbours, it can start propagating messages outwards to the leaf cliques. We need to eliminate the variables A and B from the rules of clique ABCD. We begin by eliminating A and partitioning the 6 rules (all of which are initially in clique ABCD, ie. not passed from clique CDE) into R~ (rules that do not involve A), R+ (rules that are for A), and R* (rules that involve A but are not for A): 46 < D = true, E Value true false 0.8 0.2 <D = false, c E Value true. true 0.1 true false 0.9 false true 0.3 false false 0.7 < T R U E , < T R U E , < T R U E , A Value true false 0.4 0.6 B Value true false 0.2 0.8 C Value true false 0.3 0.7 < A = true, D Value true false 0.7 0.3 A = false A f J = true A = false * A C = false' D Value true false 0.2 0.8 B D Value true true 0.4 true false 0.6 false true 0.9 false false 0.1 C D D E < E = true, F Value true false 0.4 0.6 < E = false, D F Value true true 0.7 true false 0.3 false = true 0.8 false false 0.2 Figure 4.1: A clique tree of Example Network #1 with initial rules (eve-tree) 47 B Value C Value R- = { (TRUE, true 0.2 ), (TRUE, true 0.3 false 0.8 false 0.7 R+ = {{TRUE, A Value true false 0.4 0.6 ) } R* = { (A = true, D Value true false 0.7 0.3 (A — false A C = true, D Value true false 0.2 0.8 (A = false A C = false, B D Value true true 0.4 true false 0.6 false true 0.9 false false 0.1 ) } We absorb rules in R* one by one into R+ R? = absorb(R+', (A = true, D Value true false 0.7 0.3 = {(A = false, 0.6), (A = true, D Value true false 0.7*0.4 = 0.28 0.3*0.4 = 0.12 ) } where the first rule is the residual rule created. So R+ becomes: 48 R+ = {(A = false,0.6), {A = true, D Value true false 0.28 0.12 Continuing the absorption, D Value = absorb(Rf, (A — false A C = true, true 0.2 false 0.8 = { (A = false AC — false, 0.6), D Value (A = false AC = true, true 0.2*0.6 = 0.12 false 0.8*0.6 = 0.48 (A = true, D Value true false 0.28 0.12 ) } where the first rule is a residual rule, and the third rule remains the same as it incompatible with the absorbing rule. Absorbing the last rule, we have: Rt, = absorb(Rt, {A = false AC = false, B D Value true true 0.4 true false 0.6 false true 0.9 false false 0.1 » = { (A = false AC = false, B D Value true true true false false true false false 0.4*0.6 = 0.24 0.6*0.6 = 0.36 0.9*0.6 = 0.54 0.1*0.6 = 0.06 {A = false A C — true, D Value true false 0.12 0.48 49 (A = true, D Value true false 0.28 0.12 ) } After absorbing all the rules in R*, the 3 resulting rules in R+ all have the eliminating variable A in the contexts. So we split them into Ri and R2, corre-sponding to A = true and A = false, respectively, and with A removed from the contexts: i2i = {{TRUE, D Value true false 0.28 0.12 R2 = { (C = false, B D Value true true 0.24 true false 0.36 false true 0.54 false false 0.06 D Value (C = true, true 0.12 >} false 0.48 and we add the two set of rules: Ri ®gR2 = { (C = false, B D Value true true true false false true false false 0.28 + 0.24 = 0.52 0.12 + 0.36 = 0.48 0.28 + 0.54 = 0.82 0.12 + 0.06 = 0.18 (C = true, D Value true false 0.28 + 0.12 = 0.4 0.12 + 0.48 = 0.6 > } 50 Thus, the rule base becomes: R = R~ U ®g R2) = { (TRUE, B Value true false 0.2 0.8 C Value (TRUE, true 0.3 false 0.7 (C = false, B D Value true true 0.52 true false 0.48 false true 0.82 false false 0.18 (C = true, D Value true false 0.4 0.6 ) } The procedure for eliminating variable B from R is similarly performed: )} C Value D Value R~ = {{TRUE, true 0.3 ), (C — true, true 0.4 false 0.7 false 0.6 B Value R+ = {(TRUE, true 0.2 false 0.8 51 R* = { (C = false, B D Value true true 0.52 true false 0.48 false true 0.82 false false 0.18 We absorb the rules in R* into R+: Rt = absorb(R+, (C = false, B D Value true true 0.52 true false 0.48 false true 0.82 false false 0.18 » = { (C = true, B Value true false 0.2 0.8 (C = false, B D Value true true true false false true false false 0.2*0.52 = 0.104 0.2 * 0.48 = 0.096 0.8 * 0.82 = 0.656 0.8*0.18 = 0.144 > } Since the variable B appears in the tables of both rules, we can simply sum it out in the table. Note that after B is summed out, the first rule in Rt can be removed because it is redundant. The resulting rules to be passed as message are: # = { {TRUE, c Value true false 0.3 0.7 ) (C = true, D Value true false 0.4 0.6 52 (C = false, D Value true false 0.104 + 0.656 = 0.76 0.096 + 0.144 = 0.24 When clique CDE receives this message, it simply unions these rules with its rule base (for a total of 5 rules). It then eliminates the variable C from its rule base to pass a message to clique DEF as follows: Rr = {(D = true, E Value true false 0.8 0.2 R+ = {{TRUE, C Value true false 0.3 0.7 D Value R* = { (C7 = true, true 0.4 false 0.6 (C = false, D Value true false 0.76 0.24 (D = false, C E Value true true 0.1 true false 0.9 false true 0.3 false false 0.7 ) } Absorbing the rules in R* into R+, we have: 53 Rt = absorb(R+, (C = true, D Value true false 0.4 0.6 (C = false, 0.7) D Value <c = true, true false 0.3*0.4 0.3*0.6 = 0.12 = 0.18 ) } Rt = absorb(Rt, (C = false, D Value true false 0.76 0.24 = { (C = false, D Value true false 0.7*0.76 = 0.532 0.7*0.24 = 0.168 (C — true, D Value true false 0.12 0.18 ) } Rt, = absorb(Rt, (D = false, c E Value true true 0.1 true false 0.9 false true 0.3 false false 0.7 { (C = false A D = true, 0.532) (C = false A D = false, E Value true false 0.168 * 0.3 = 0.0504 0.168*0.7 = 0.1176 (C = true A D = true, 0.12), 54 (C = true A D — false, E Value true false 0.18*0.1 = 0.018 0.18*0.9 = 0.162 ) } The variable C to be eliminated appears in the contexts of all 4 rules, so we partition the rules into R\ and R2 corresponding to C = true and C = false, respectively, with the variable C removed from the contexts, and add the two sets of rules. R1 = { (D = true, 0.12), (D = false, E Value true false 0.018 0.162 R2 = { (D — true, 0.532), (D = false, E Value true false 0.0504 0.1176 R1 ©ff R2 = { (D = true, 0.12 + 0.532 = 0.652), E Value (D = false, true 0.018 + 0.0504 = 0.0684 ) } false 0.162 + 0.1176 = 0.2796 So the message passed from clique CDE to clique DEF is: R = R~ U (Ri ©g R2) = { ( £ > = true, E Value true false 0.8 0.2 (D = true, 0.652), 55 (D = false, E Value true false 0.0684 0.2796 ) } These rules are unioned with the rule base of clique DEF. At this point, clique tree propagation is completed and the clique tree is consistent. The consistent clique tree with the cliques' rule collections are shown in Figure 4.2 (the messages stored in the sepsets are omitted here). We can now calculate the probability of variable JF from the rules in clique DEF (which is for variable F), by eliminating the other variables, D and E. To eliminate D from the rule base of clique DEF, the rules are first split into: R- = {(E = true, F Value true false 0.4 0.6 R+ = {(D = true, 0.652)} R* { (D = false, E Value true false 0.0684 0.2796 {D = true, E Value true false 0.8 0.2 (E = false, D F Value true true 0.7 true false 0.3 false true 0.8 false false 0.2 ) } 56 < T R U E , < T R U E , < T R U E , A Value true false 0.4 0.6 B Value true false 0.2 0.8 C Value true false 0.3 0.7 I< A = true, D Value true false 0.7 0.3 A = false A c = true A = false ~ A C = false' D Value true false 0.2 0.8 B D Value true true 0.4 true false 0.6 false true 0.9 false false 0.1 C D < D = true, E Value true false 0.8 0.2 <D = false, c E Value true true 0.1 true false 0.9 false true> 0.3 false false 0.7 < C = true, <C= false, D Value true false 0.4 0.6 D Value tine false 0.76 0.24 < T R U E , c Value true false 0.3 0.7 D E D F Value F Value true true 0.7 < E = true, true 0.4 > < E = false, true false 0.3 false 0.6 false true 0.8 false false 0.2 E Value • E Value < D = true, 0.652 > <D = true, true 0.8 >. .< D = false, true 0.0684 false 0.2 false 0.2796 Figure 4.2: The consistent clique tree for Example Network #1 (eve-tree) 57 The rules in R* are then absorbed into R+ one by one. R+ = absorb(R+, (D = false, E Value true false 0.0684 0.2796 = { (D = true, 0.652), (D = false, E Value true false 0.0684 0.2796 ) } Rt = absorb(Rf, (D = true, E Value true false 0.8 0.2 = { (D = true, E Value true false 0.652*0.8 = 0.5216 0.652 * 0.2 = 0.1304 (D = false, E Value true false 0.0684 0.2796 ) } Rt, = absorb(Rt, (E — false, D F Value true true 0.7 true false 0.3 false true 0.8 false false 0.2 { (D = true AE = true, 0.5216), (D = true A E = false, F Value true false 0.1304*0.7 = 0.09128 0.1304 * 0.3 = 0.03912 (D = false A E = true, 0.0684), 58 (D = false A E = false, F Value true false 0.2796 * 0.8 = 0.22368 0.2796 * 0.2 = 0.05592 ) } We split the resulting rules into Ri and R2 corresponding to the contexts true and D = false, respectively, and add the two sets of rules: Ri = { (E = true, 0.5216), (E = false, F Value true false 0.09128 0.03912 R2 = { (E = true, 0.0684), (E = false, F Value true 0.22368 )} false 0.05592 Ri ®g i?2 = { (E = true, 0.5216 + 0.0684 = 0.59), F Value (E = false, true 0.09128 + 0.22368 = 0.31496 ) false 0.03912 + 0.05592 = 0.09504 } The rule base becomes: R = R~ U © 9 R2) = {(E = true, 0.59), (E = false, F Value true false 0.31496 0.09504 {E = true, F Value true false 0.4 0.6 ) } 59 Eliminating the variable E, we have: R- = {} R+ = {{E = t r u e , 0 . 5 9 } } R* = { {E = false, F Value true false 0.31496 0.09504 (E = true, F Value true false 0.4 0.6 ) } Rf = absorb(R+, (E = false, F Value true false 0.31496 0.09504 = { (E = true, 0 . 5 9 } , (E = false, F Value true false 0.31496 0.09504 Rt =absorb(Rt,(E = true, F Value true false 0.4 0.6 { (E = true, F Value true false 0 . 5 9 * 0 . 4 = 0 .236-' 0 . 5 9 * 0 . 6 = 0.354 (E = false, F Value true false 0.31496 0.09504 ) } 60 Ri = { {TRUE, F Value true false 0.236 0.354 R2 = { {TRUE, F Value true false 0.31496 0.09504 F Value Rl ®gR2 = { {TRUE, true 0.236 + 0.31496 = 0.55096 ) } false 0.354 + 0.09504 = 0.44904 After elimination, there is only one rule left. The probabilities of F are calculated to be P(F = true) = 0.55096 and P(F = false) = 0.44904. 61 Chapter 5 Empirical Evaluation of eve-tree 5.1 Experiment Setup In order to evaluate our proposed algorithm, we constructed a number of parameter-ized classes of networks that display context specific independence. 1 We compare eve-tree in terms of the time and space required to do probability inference with Hugin (Jensen et al., 1990) and the query-based contextual variable elimination algorithm (Poole and Zhang, 2002). The following parameters are used for building the random networks: • n: the number of variables in the network • s: the number of contextual splits (so there are n + s generalized rules for the network in the initial rule base ) • p: the probability that a parent variable that is not in the context of a gener-alized rule will be included in the table of the generalized rule 1Note that we do not expect eve-tree to work well for random networks, eve-tree is designed for networks that display context specific independence. Thus, it is fair to compare the algorithm on these networks. 62 • obs: the number of observed variables Given the n variables, a total ordering is imposed. For each variable, we build a decision tree whose leaves correspond to the contexts of the generalized rules of the variables. We randomly choose a leaf and a variable. If the variable is a possible split for the leaf (i.e. that variable is a predecessor in the total ordering of the variable that the leaf is for, and the context corresponding to the leaf had not committed to a value of that variable), we split the leaf on that variable. This process is repeated until we have a total of n + s leaves. Then, for each leaf, we build a generalized rule using the leaf as the context. For the table of the generalized rule, we include the variable that the leaf is for, and each of its predecessor is included with probability p. The probabilities in the tables are assigned random numbers. Thus, the number of rules in the initial rule base is controlled by the parameter s, and the size of the tables is controlled probabilistically by p. We also randomly choose variables to be observed for the networks. The number of observed variables is controlled by the parameter obs. The following parameters were used to make the random networks in our experiments: • number of variables in the network: n = 20 • probability of including a predecessor variable in the table of the rule: p = 0.2 • number of splits: 0 < s < 10 • number of observed variables: 0 < obs < 10. Each set of parameters (n,p,s) was used to make 10 different random networks and each network was accompanied with the 11 different observations (determined 63 by obs), for a total of 1210 random networks. Notice that s — 0 corresponds to networks that exhibit no contextual specific independence. All experiments were performed on an Intel Pentium 4, 2.0GHz processor with 256KB CPU cache and 1GB RAM, running RedHat Linux 7.2. 5.2 Space Comparisons Figure 5.1 gives a comparison of the space required by Hugin and eve-tree to do probabilistic inference for the random networks described above. The space is mea-sured by the combined size (the number of floating point values) of all the tables in the clique tree after the tree is consistent. For eve-tree, this is calculated by summing up all the table size in the rules for each clique. The Hugin architecture that we use in this experiment compacts the tables in the cliques if there are ob-served variables. This is done to make the comparison fairer and to not confound the saving by compacting with the saving by context. A major advantage of eve-tree can be seen in the much smaller total table sizes that eve-tree requires compared to those required by Hugin. The amount of savings that eve-tree provides over Hugin depends on the amount of context specific independence that a particular network exhibits, the number of observed variables, and whether the observed variables appear in the contexts or in the tables. This savings of space can be substantial, especially for large networks that exhibit some context specific independence. Figures 5.2 to 5.5 show the space comparisons by the number of splits that the random networks were created with. The cause for the difference in space requirements between eve-tree and Hugin is that Hugin always multiplies the factors passed to a clique while eve-tree may just keep the factors without multiplying them. With no context specific independence present, 64 Compar i son of tab le s ize for Hugin and eve-tree 10 1 10 2 10 3 10 4 10 5 10' Hugin tab le s ize (log scale) Figure 5.1: Comparison of table size of consistent clique trees 65 Compar ison of table size for random networks with spli ts = 0 Hugin table size (log scale) Figure 5.2: Comparison of table size of consistent clique trees from networks with splits = 0 66 to1 io2 10 3 10 4 10 5 10 6 10 7 Hugin table size (log scale) Figure 5.3: Comparison of table size of consistent clique trees from networks with splits = 1 67 Hugin table size (log scale) Figure 5.4: Comparison of table size of consistent clique trees from networks with splits = 5 68 101 102 103 104 105 106 107 Hugin table size (log scale) Figure 5.5: Comparison of table size of consistent clique trees from networks with splits = 10 69 Hugin may take less space than eve-tree or the other way around depending on the particular set of factors involved. For splits = 0 (Figure 5.2), even though the networks exhibit no context specific independence, the space savings for eve-tree over Hugin is statistically significant when analyzed with the matched pairs t test. With splits = 1 (Figure 5.3), the space required by eve-tree is less than Hugin for almost all networks. The difference in space is even more significant for networks with larger amounts of context specific independence. Figures 5.4 and 5.5 show the space difference for random networks with splits = 5 and splits = 10, respectively. 5.3 Time Comparisons The entire process of clique tree propagation can be viewed as three stages: setting up the initial clique tree, propagating messages, and getting posterior probabilities. We compare the amount of time spent in these stages by the different algorithms as well as the total time they take to query 5 unobserved variables. Figures 5.6 to 5.9 show the distribution of the times for 3 algorithms: Hugin, eve-tree, and the query-based contextual variable elimination (eve) (Poole and Zhang, 2002), over the 1210 randomized networks as described above. Notice that these figures show the overall time distributions over the 1210 randomized networks, so the difference of the curves at a point may not correspond to the actual time difference for any particular network. Figure 5.6 shows the time distribution for the Hugin and eve-tree algorithms to set up the initial clique tree for the 1210 random networks. Hugin takes a pre-dominantly large amount of time at this stage because much of the work for this algorithm is done here. It involves initializing all the factors to the cliques by mul-tiplying the conditional probabilities in the Bayesian networks and the observations 70 T ime to set up initial c l ique tree for Hugin and eve-tree Ins tances of randomized network Figure 5.6: Distribution of time to set up the initial clique trees 71 into the corresponding cliques. On the other hand, eve-tree uses minimal time at this stage because all it does is to assign the rules in the initial rule base to the appropri-ate clique and update the rules with the observed variables; no table multiplication is performed. Note that this set up stage is not applicable to the query-based eve algorithm, as eve does not make use of a clique tree. Figure 5.7 shows the time distribution for message propagation. The mes-sage propagation time is the time to do the two sweeps of message passing to make the clique tree consistent. Again, the message propagation time does not apply to the query-based eve algorithm. In general, for simpler networks, the message prop-agation time is higher for eve-tree than Hugin, mainly due to the overhead of the rule multiplications. But for more complicated networks with context specific inde-pendence, eve-tree in general takes less time than Hugin in the message propagation stage. Figure 5.8 shows the distribution of the average time to get the probability of a query variable from a consistent clique tree for Hugin and eve-tree, or from the initial rule base for the query-based eve algorithm. This average time is obtained from summing up the time to get the posterior probability of each of the unobserved variables in the Bayesian network, divided by the number of unobserved variables. The time it takes for Hugin in this stage is very small comparatively, because all it needs to do is summing out the non-query variables from the marginal probability table contained in the clique, eve-tree takes a much longer time to get the proba-bility, because it has to multiply the rules in a clique and eliminate the non-query variables from these rules. The overhead for maintaining variables in the contexts and splitting rules also plays a factor in the time it takes to get the probabilities. Since the query-based eve algorithm does all the work at this stage, its average time 72 Time to do m e s s a g e propagat ion for Hugin and eve-tree 10 i , , , , , Ins tances of randomized network Figure 5.7: Distribution of message propagation time 73 to get the probability is higher than the other two algorithms. As each algorithm does most of its work at different stages of the algorithm, it is difficult to compare which one is the most efficient. It all depends on the application, such as how many query variables there are and how much space is available to do the inference. If only one variable is to be queried, then the query based eve algorithm should be the choice because there is no point for setting up the clique tree that does probability inference for all the variables in the network simultaneously. However, if more than two variables are to be queried, the algorithm of choice would depend on the structure of the networks and the amount of context specific independence (and other additional structures) that can be exploited, as well as the physical constraints such as memory size. Figure 5.9 shows the distributions of the average time that the 3 algorithms take to calculate the posterior probability of 5 unobserved variables for the 1210 randomized networks. This time is obtained by adding the set up time for the initial clique tree, the message propagation time, and 5 times the average get probability time. It is very clear from the figure that for most of the randomized networks, eve-tree does much better than Hugin when context specific independence appears in a Bayesian network. This is especially apparent for the more complicated networks for which Hugin takes a relatively long time to do inference. The distribution for eve-tree and eve are very close to each other. It appears that the query-based eve does better for simpler networks which takes less time to do inference, while eve-tree is better for more complicated networks. For simpler networks, the time taken by eve-tree to build the clique tree and make the clique tree consistent dominates. However, for the complicated networks, eve-tree only needs to do all the hard work once in the clique tree set up and message propagation stages, while eve must repeat 74 Average time to get probability for Hugin, eve-tree, and query based eve 10 i , , 1 , , . — 10 I 1 1 1 1 ' 1— 0 200 400 600 800 1000 1200 Instances of randomized network Figure 5.8: Distribution of the average time to get the probability of a query variable from a consistent clique tree 75 the process for each query. Figures 5.10 to 5.13 show the amount of time that the Hugin, eve-tree and query-based eve algorithms take to query 5 unobserved variables for some individual random networks. In Figure 5.10, where the random networks do not exhibit context specific independence, Hugin does a little better than eve-tree for 5 of the networks, and does worse for the other 5. It seems that for networks that require relatively more time for Hugin (more than 10 3 msec), eve-tree is faster than Hugin, but for networks that require relatively less time for Hugin (less than 10 2 msec), eve-tree is a little slower than Hugin. This is because the networks that Hugin takes longer corresponds to clique trees that contain cliques and factors of many variables. Multiplication of these large factors is time costly. On the other hand, in eve-tree, the rules are not multiplied unless it is necessary. In addition, the time is reduced significantly for eve-tree and query-based eve when variables are observed, especially for those networks that take a long time to do inference. This is because the number of rules as well as the size of the rules are reduced with each variable being observed. For Hugin, the time reduction is not as significant, if it is reduced at all. The reason is that when variables are observed, the factors that contain those variables would have the corresponding entries changed to 0. The time that it takes to multiply the factors cannot be reduced as significantly. For the random networks with splits > 0, the graph of the time that the algorithms take to query 5 unobserved variables look very similar. Figures 5.11, 5.12, and 5.13 show the results for splits = 1,5, and 10, respectively. In most cases, eve-tree and query-based eve are a lot faster than Hugin. The time difference is in general larger for a larger number of splits. Nevertheless, even when only a small amount of context specific independence is present in the network, eve-tree can take advantage of it and wins over Hugin. 76 Ins tances of randomized network Figure 5.9: Distribution of the total time for calculating the probability of 5 unob-served variables 77 Total t ime for querying 5 variables for ten different networks where splits = 0 1 0 ' 10 M CD O 0> C I O S 10-o cu CO E, cu E Q. O 10" 10 ' 10 L 1 1 \ \ • \ \ •\ • \ \ \ \ '\ \ \ H \ . \ \ \ \ \ I \\ :>\ K \ \ • \ V \ \ \ \ 0 5 10 D 5 10 0 5 10 Hugin eve-tree query based eve X -A \ A i \ D 5 10 D P 5 iq 0 5 10 0 5 10 0 5 10 0 5 10 0 5 10 Number of Observations Figure 5.10: Total time for calculating the probability of 5 unobserved variables for 10 random networks with splits = 0 (ie. no context specific independence) 78 Total t ime for querying 5 variables.for ten different networks where sp!ils = '\ 1 0 ' 10< CD O CO % 1 0 c o cn CO £ . 1 1 0 ' D_ O 10' 10 u I 0 5 10 •A \\ '•I 1 hi \ \ 0 5 10 X 0 5 10 Hugin eve-tree query based eve 0 5 - i q 0 5 10 0 5 10 0 5 10 0 5 10 0 5 10 Number of Observations Figure 5 . 1 1 : Total time for calculating the probability of 5 unobserved variables for 10 random networks with splits = 1 79 Total time for querying 5 variables for ten different networks where sp!/ls = 5 10* ^ 1 0 4 " l CO o CO 0 } o o cu CO E 0_ o 10' 10' 10' V V 0 5 i d \ V I. \ D 5 101 V \ \ \ Q 5 id Hugin eve-tree — query based eve t \ I \ D 5 101 V (• 5 id 0 5 10 0 5 10 0 5 10 0 5 10 Number of Observations 0 5 10 Figure 5.12: Total time for calculating the probability of 5 unobserved variables for 10 random networks with splits — 5 80 Total t ime for querying 5 variables.for ten different networks where splits =10 10' 10' CO C J CO O O CO CO E , cu E Q. O 10' 10' 10 \ \ p 5 i d \ V D 5 10 I I \ p 5 i d Hugin eve-tree query based eve \ \ \ b s m 'vl IV'. V R 5 Id 0 5 10 0 5 10 0 5 10 0 5 10 0 5 10 Number of Observat ions Figure 5.13: Total time for calculating the probability of 5 unobserved variables for 10 random networks with splits = 10 81 Total time to calculate posterior probabilities of. all unobservedvariables: for.Hugin.and eve-tree (# observed variables = 0) 30 40 50 60 70 .80 Instances of randomized.network Figure 5.14: Total time for calculating the probabilities of all of the 20 unobserved variables for random networks with no observed variables 82 Total t ime to calculate posterior probabilit ies of all unobserved variables for Hugin and eve-tree (# observed variables = 5 ) ; 30 .40 50 , 60 70 80 Instances of randomized.network-100? 110 Figure 5.15: Total time for calculating the probabilities of all of the 15 unobserved variables for random networks with 5 observed variables 83 Total t ime to calculate posterior probabil i t ies of all unobserved variables for Hugin and eve-tree (# observed variables =10): Instances of randomized network. Figure 5.16: Total time for calculating the probabilities of all of the 10 unobserved variables for random networks with 10 observed variables 84 Figures 5.14 to 5.16 show the CPU time for Hugin and eve-tree to query all the unobserved variables in the random networks. Figure 5.14 shows the result for networks with no observation, corresponding to querying all 20 variables in the net-works, eve-tree wins over Hugin for a majority of the cases. The networks for which eve-tree takes more time than Hugin are networks with little or no context specific independence. Figure 5.15 shows the result for networks with 5 variables observed, corresponding to querying the remaining 15 variables. Aside from a few networks that take a particularly little amount of time for Hugin (less than 102 msec), eve-tree is significantly faster than Hugin for all the other networks, including those that do not display context specific independence. The same result is seen for networks with 10 observed variables, as shown in Figure 5.16. As the number of observed variables increases, querying all unobserved variables' posterior probabilities for the two al-gorithms results in an even larger time difference. Our experiments show that the eve-tree alogrithm not only takes advantage of context specific independence in the networks, but also increases its efficiency as variables in the networks are observed. 5.4 The Insurance Network In addition to the randomized networks, we experimented with the insurance net-work, a network for evaluating car insurance risks, from the Bayesian network repos-itory at the Hebrew University of Jerusalem. 2 The insurance network consists of 27 variables, each with 2 to 5 domain values. Since the insurance network does not contain any context specific independence, we created modified versions of the insurance network by changing the numbers in the probability tables using precision values, as defined below. Two numbers in a probability table are considered equal if 2http://www.cs.huji.ac.il/labs/compbio/Repository/ 85 their difference is less than the precision value. For example, if the precision value 0.1 is used, the table: P(D\A,B,C) A B C true false true true true 0.75 0.25 true true false 0.8 0.2 true false true 0.6 0.4 true false false 0.55 0.45 false true true 0.1 0.9 false true false 0.2 0.8 false false true 0.35 0.65 false false false 0.6 0.4 is split into the three rules: (A — true A B = true, D Value true false 0.775 0.225 D Value (A — true A B = false, true 0.575 false 0.425 ) } (A — false, B C D Value true true true 0.1 true true false 0.9 true false true 0.2 true false false 0.8 false true true 0.35 false true false 0.65 false false true 0.6 false false false 0.4 The use of precision values generates context specific independence within the network that we can use to experiment with the eve-tree algorithm. Precision values 86 of 0.05, 0.1, 0.15, and 0.2 were applied to the insurance network in our experiment. Since no sensitivity analysis (Chan and Darwiche, 2001; Kjaerulff and van der Gaag, 2000) has been done for these modifications, we do not expect any accuracy from the approximations. We are only interested in how this kind of approximations can be used with our algorithm to improve the efficiency of probability inference. Table 5.1 summarizes the rule bases of the insurance network (corresponding to a precision value of 0) and its approximations using precision values 0.05, 0.1, 0.15, and 0.2. The total time it takes to do inference for these networks and query 5 unobserved variables with Hugin and eve-tree are shown in Figure 5.17, where the number of observed variables ranges from 0 to 13. Given the same observed variables, the amount of time Hugin takes for the four networks is very similar. On the other hand, eve-tree performs differently with different precisions, which correspond to different amounts of context specific independence. As the figure shows, eve-tree is most efficient with a precision of 0.15 for this particular network. As table 5.1 shows, the network with precision value 0.2 has 7 more rules than the network with precision value 0.15, but the saving in the total number of entries in the rules' tables is only 11. This makes the overhead of maintaining the extra rules larger than the benefit obtained from the smaller table sizes. Thus, it takes more time to do inference with the network with precision value 0.2 than the network with precision value 0.15. As this experiment shows, not all splittings of probability tables into rules are beneficial to the inference time. It depends on the reduction in table sizes compared to the increase in the number of rules in the network. 87 Table 5.1: Summary of the insurance network and its approximations in terms of context specific independence and rule and table sizes Precision Value Number of Variables with Context Specific Independence Total Number of Rules in Network Total Number of Entries in the Rules' Tables 0 1 30 1407 0.05 2 34 1393 0.1 5 42 1339 0.15 6 43 1291 0.2 8 50 1280 _ 6 0 0 o cu CO — 400 cu Total time for querying 5 variables for Hugin and eve-tree for the insurance network ^ 6 0 0 C L O 200 Precision = 0.05 \ \ s i \ i \ s \ — Hugin - - eve-tree i \ \ ^ ' \ \ \ -] 5 10 o cu CO E Q . O 400 200 Precision = 0_.1 / • r - " K i \ i \ i \ i \ \ \ \ \ — - Hugin eve-tree s Number of Observations 0 5 10 Number of Observations 600 CJ cu CO .E cu E 400 R 200 o 0 Precision = 0.15 r i — Hugin - - eve-tree 0 5 10 Number of Observations CJ cu CO ,E CD E C L O 600 400 200 0 Precision = 0.2 . •/ -. / '* \ \ i - -/' — Hugin - - eve-tree 0 5 10 Number of Observations Figure 5.17: Comparison of the total time required to calculate the probability of 5 unobserved variables for 4 approximations of the insurance network 88 Chapter 6 Conclusion and Future Work We have presented eve-tree, a clique tree propagation algorithm for exploiting con-text specific independence to compute posterior probabilities in Bayesian networks. With our experiments on randomized networks that exhibit various amounts of context specific independence and the insurance network, our algorithm shows ad-vantages both in time and space over the Hugin architecture when some level of context specific independence appears in the Bayesian networks. Although the eve-tree algorithm as described in this thesis uses the clique tree structure of the Hugin architecture, other clique tree structures such as the Shenoy-Shafer architecture (Shafer and Shenoy, 1990) can also be employed. The efficiency of eve-tree on other clique tree structures would be further explored in future research since the efficiency also depends on the clique tree used to carry out inference. The combination of eve-tree with sensitivity analysis (Chan and Darwiche, 2001; Kjserulff and van der Gaag, 2000) could be explored in future research as an application for approximate inference. Sensitivity analysis techniques can be used to distinguish which probabilities in a Bayesian network are close enough to be 89 considered independent in the specific contexts while keeping the desired accuracy in the results. The combination of eve-tree with sensitivity analysis would require a cost-benefit analysis to determine if it is worthwhile to split a probability table into rules by trading off accuracy with time and space efficiency. The resulting Bayesian network would contain context specific independence that eve-tree can take advantage of in doing probability inference. The main overhead of eve-tree, as compared to Hugin, is the more compli-cated methods for multiplying rules and eliminating variables. The amount of time needed for probabilistic inference directly depends on the number of rules in the network. Because of the overhead of rule maintenance, splitting a probability table into a set of rules that results in only a small reduction in table sizes may hinder the inference procedure rather than speeding it up. We would like to explore a heuris-tics that determines whether a probability table with context specific independence should be split into a set of rules with smaller tables or should be left as a single rule with the entire table. A reduction in table sizes would decrease the space and time in doing inference, whereas an increase in the number of rules would increase the overhead costs and the time to do inference. The heuristics for splitting a rule into a set of rules with smaller tables would be based on the reduction in table sizes as compared to the increase in the number of rules that result from the splitting. 90 Bibliography Boutilier, C, Friedman, N., Goldszmidt, M. and Roller, D. (1996). Context-specific independence in Bayesian networks, in E. Horvitz and F. Jensen (eds), Proceed-ings of the Twelfth Conference on Uncertainty in Artificial Intelligence (UAI-96), Portland, OR, pp. 115-123. Chan, H. and Darwiche, A. (2001). When do numbers really matter?, Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (UAI-01), Morgan Kaufmann Publishers, San Francisco, CA, pp. 65-74. Cooper, G. (1987). Probabilistic inference using belief networks is NP-hard, Tech-nical report, Medical Computer Science Group, Stanford University. Dechter, R. (1996). Bucket elimination: A unifying framework for probabilistic infer-ence, in E. Horvitz and F. Jensen (eds), Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (UAI-96), Portland, OR, pp. 211-219. Geiger, D. and Heckerman, D. (1996). Knowledge representation and inference in similarity networks and Bayesian multinets, Artificial Intelligence 82: 45-74. Heckerman, D. and Breese, J. (1994). A new look at causal independence, Proceed-ings of the Tenth Conference on Uncertainty in Artificial Intelligence (UAI-94), pp. 286-292. Huang, C. and Darwiche, A. (1996). Inference in belief networks: A procedural guide, International Journal of Approximate Reasoning 15(3): 225-263. Jensen, F. V., Lauritzen, S. L. and Olesen, K. G. (1990). Bayesian updating in causal probabilistic networks by local computations, Computational Statistics Quarterly 4: 269-282. Kjaerulff, U. (1990). Triangulation of graphs - algorithms giving small total state space, Technical Report R 90-09, Department of Mathematics and Computer Science, Strandvejen, DK 9000 Aalborg, Denmark. 91 Kjserulff, U. and van der Gaag, L. (2000). Making sensitivity analysis computa-tionally efficient, Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI-00), Morgan Kaufmann Publishers, San Francisco, CA, pp. 317-325. Lauritzen, S. L. and Spiegelhalter, D. J. (1988). Local computations with probabil-ities on graphical structures and their application to expert systems, Journal of the Royal Statistical Society, Series B 50(2): 157-224. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plau-sible Inference, Morgan Kaufmann, San Mateo, CA. Poole, D. and Zhang, N. (2002). Exploiting contextual independence in probabilistic inference, Technical report, Department of Computer Science, University of British Columbia. Shafer, G. and Shenoy, P. (1990). Probability propagation, Annals of Mathematics and Artificial Intelligence 2: 327-352. Zhang, N. (1998). Inference in Bayesian networks: The role of context-specific independence, Technical Report HKUST-CS98-09, Department of Computer Science, Hong Kong University of Science and Technology. Zhang, N. and Poole, D. (1994). A simple approach to Bayesian network computa-tions, Proceedings of the Tenth Canadian Conference on Artificial Intelligence, Banff, Alberta, Canada, pp. 171-178. Zhang, N. and Poole, D. (1996). Exploiting causal independence in Bayesian network inference, Journal of Artificial Intelligence Research 5: 301-328. Zhang, N. and Poole, D. (1999). On the role of context-specific independence in probabilistic reasoning, Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI-99), Stockholm, Sweden, pp. 1288-1293. Zhang, N. and Yan, L. (1997). Independence of causal influence and clique tree propagation, Proceedings of the Thirteenth Conference on Uncertainty in Arti-ficial Intelligence (UAI-97), Morgan Kaufmann Publishers, San Francisco, CA, pp. 481-488. 92
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A clique tree algorithm exploiting context specific...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A clique tree algorithm exploiting context specific independence Tung, Leslie 2002
pdf
Page Metadata
Item Metadata
Title | A clique tree algorithm exploiting context specific independence |
Creator |
Tung, Leslie |
Date Issued | 2002 |
Description | Context specific independence can provide compact representation of the conditional probabilities in Bayesian networks when some variables are only relevant in specific contexts. We present eve-tree, an algorithm that exploits context specific independence in clique tree propagation. This algorithm is based on a query-based contextual variable elimination algorithm (eve) that eliminates in turn the variables not needed in an answer. We extend eve to producing the posterior probabilities of all variables efficiently and allow the incremental addition of evidence. We perform experiments that compare eve-tree and Hugin using parameterized random networks that exhibit various amounts of context specific independence, as well as a standard network, the Insurance network. Our empirical results show that eve-tree is efficient, both in time and in space, as compared to the Hugin architecture, on computing posterior probabilities for Bayesian networks that exhibit context specific independence. |
Extent | 3086713 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-09-28 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0051389 |
URI | http://hdl.handle.net/2429/13264 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2002-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2002-0587.pdf [ 2.94MB ]
- Metadata
- JSON: 831-1.0051389.json
- JSON-LD: 831-1.0051389-ld.json
- RDF/XML (Pretty): 831-1.0051389-rdf.xml
- RDF/JSON: 831-1.0051389-rdf.json
- Turtle: 831-1.0051389-turtle.txt
- N-Triples: 831-1.0051389-rdf-ntriples.txt
- Original Record: 831-1.0051389-source.json
- Full Text
- 831-1.0051389-fulltext.txt
- Citation
- 831-1.0051389.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051389/manifest