Probabilistic Inference with Large Discrete Domains by Rita Sharma M . S c , University of British Columbia, 2001 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F DOCTOR OF PHILOSOPHY in T H E F A C U L T Y OF G R A D U A T E STUDIES (Computer Science) T H E UNIVERSITY OF BRITISH C O L U M B I A October 2006 © Rita Sharma, 2006 Abstract The straightforward representation of many real world problems is in terms of discrete random variables with large or infinite domains. For example, in a do-main where we are trying to identify a person, we may have variables that have as domains, a set of all names, a set of all postal codes, and a set of all credit card num-bers. The task usually reduces to performing probabilistic inference, i.e., compute the probability of some values of some random variables given the values of some other variables. Bayesian networks are a compact way to represent joint probabil-ity distributions. This thesis is concerned with probabilistic inference in Bayesian networks that have discrete random variables with large or infinite domains. Carrying out inference in Bayesian networks that have variables with large domains is a difficult problem. For efficient inference, we consider cases where there is some structure that can be exploited to make inference efficient. In this the-sis we consider two kinds of structures that can be exploited for efficient inference. These structures allow us to partition the large number of values in equivalence classes. Rather than reasoning about every value of a variable individually, we can reason about a set of values in a single step. We first consider the case where there are intensional definitions of the con-ditional probability distributions. To represent these conditional probabilities, we introduce a CPD language that allows us to define the conditional probabilities pro-cedurally, in terms of predicates and functions. We present an inference algorithm, Large Domain VE, for the CPD language that uses this representation to partitions the domains of the variables dynamically. The partitions depend on what is ob-served and what is queried. We apply Large Domain V E to the person identification problem that has variables with large domains. The second case we consider where there is a priori internal structure on the values of the variables. In particular, we consider the case where the values of ii the variables are represented as tree hierarchies. We call such variables hierarchi-cally structured variables. We present a language for representing the conditional probabilities of Bayesian networks with hierarchically structured variables. To per-form inference in Bayesian networks with hierarchically structured variables we construct an abstract Bayesian network dynamically, given some evidence and a query, by collapsing the hierarchies to include only those values necessary to an-swer the query. We can answer the query from the abstract Bayesian network using any standard inference algorithm. Finally, we show how both intensional definitions of the conditional proba-bility distributions and hierarchically structured values can be put together to pro-duce a general framework that can be applied to a more general class of problems. iii Contents Abstract « Contents iv List of Tables viii List of Figures ' x Acknowledgements x ' n 1 Introduction 1 1.1 Probabilistic Reasoning 1 1.2 Thesis Overview 6 1.3 Summary of Thesis Contributions 9 1.3.1 Thesis Organization 10 2 Background 11 2.1 Introduction ' 11 2.2 Notation 12 2.3 Bayesian Networks 12 iv 2.4 Conditional Independence 16 2.5 Inference 16 2.5.1 Variable Elimination 19 2.6 Pruning Irrelevant Variables 29 2.7 Context Specific Independence 30 2.8 Conclusion 33 3 Exploiting Structure in Large Discrete Domains 34 3.1 Introduction 34 3.2 A Motivating Example: The Person Identification Problem 38 3.2.1 Decision Making in Person Identification 40 3.2.2 Modelling of Attribute Dependence/Independence 43 3.3 Structured Representation of Conditional Probabilities 49 3.3.1 Input Requirements for Inference 56 3.4 Inference algorithm: Large Domain V E 61 3.4.1 Constraint Elimination Ordering 66 3.4.2 Operations Over Trees 67 3.4.3 Conditioning on Observations 74 3.4.4 Eliminating Variables < 78 3.4.5 Computing the Posterior Distribution 88 3.4.6 Complexity of Large Domain V E 89 3.5 Application of Large Domain V E to Person Identification 91 3.5.1 Evaluation of p' 92 3.5.2 Experiment 97 v 3.6 Conclusion 100 Inference with Hierarchically Structured Variables 102 4.1 Introduction • 102 4.2 Related Work 106 4.3 Hierarchical Variables 108 4.3.1 Hierarchical Variable and Distribution 110 4.4 Bayesian Networks with Hierarchical Variables 112 4.5 Specifying the Conditional Probability Distribution 114 4.5.1 Specifying the Prior Probability Distribution for a Hierar-chical Variable 114 4.5.2 Hierarchical Variable Is the Parent of a Simple Variable . . .117 4.5.3 The General Case 121 4.6 Observation in Bayesian Networks with Hierarchical Variables . . . 126 4.7 Query in Bayesian Networks with Hierarchical Variables 129 4.8 Inference in Bayesian Network with Hierarchical Variables 130 4.8.1 Construction of an Abstract Bayesian Network 135 4.8.2 Abstract_Bayes 136 4.8.3 Complexity 151 4.9 Correctness of Abstract Bayesian Network 153 4.10 Empirical Results 157 4.10.1 Conference - Academics Domain 158 4.10.2 Results . 160 4.11 Conclusion 166 vi 5 Putting It A l l Together '. 167 5.1 Introduction 167 5.2 Generalised Hierarchical Variables 171 5.3 Representation 173 5.3.1 Specifying the Prior Probability Distribution for a Gener-alised Hierarchical Variable 173 5.3.2 Conditional Probability of a Simple Variable Conditioned on a Generalised-Hierarchical Variable 175 5.3.3 Arbitrary Probability Distribution 176 5.4 Inference in Bayesian networks with Generalised-Hierarchical Vari-ables 178 5.5 Conclusion 184 6 Conclusions 186 6.1 Summary 186 6.2 Future Work 188 Bibliography 194 vii List of Tables 4.1 Conditional probability table P{F\\Ha) 150 4.2 Conditional probability table P(Pa\H") 150 viii List of Figures 2.1 An example of a Bayesian network 14 2.2 Conditional probability tables for the example network 15 2.3 The Variable Elimination Algorithm 22 2.4 Initial set of factors for example 2.5.6 23 2.5 (a) A simple Bayesian network modeling whether the person is injured, (b) The tabular representation of conditional probability P(I\CAS). (c) The decision tree representation of conditional prob-ability P(I\C A S) 32 3.1 Bayesian Network representation of attribute dependency for case X ^ Y (shaded variables are observed) 44 3.2 Bayesian Network representation of attribute dependency for case X = Y (shaded nodes are observed) 46 3.3 A C P D language representation of P{FnameJC\Afname A Sex A EFJC) ->. 53 3.4 A decision tree representation of P(Fname-X\AfnameA Sex A EFJC). 54 ix 3.5 The decision tree representations TI, T2, and T3 of conditional probabilities P(Fname-X\AfnameASexAEFJC), P{FnameJt\Afname/\ Sex A EFJf), and P{Afname\Sex) respectively 59 3.6 A B N F grammar of factor language 65 3.7 A tree simplified by removal of redundant subtrees (triangles denote subtrees) 67 3.8 Algorithm for pruning the decision tree T . 69 3.9 Merging trees TI and T2. The leaf labels are combined using the plus function merge2(Tl, T2, +) 72 3.10 Merging trees 71 and 72. The leaf labels are combined using the multiplication function merge2(T\, T2, x ) 72 3.11 Algorithm for merging trees TI and 72 with operation Op 73 3.12 A decision tree representation, T, of a factor corresponding to C P D P(FnameJC\AJhame /\ Sex AEFJi). 75 3.13 Conditioned tree TFnane->c=>daN^\ t r e e T a f t e r conditioning on FnameJC = "david" 75 3.14 Algorithm for conditioning the decision tree T on observations. . . . 77 3.15 (a) A simple Bayesian network, (b) The decision tree representa-tions of conditional probabilities P(X), P(W), and P(Y\W AX). . . 80 3.16 The partitions of the values of W and X 83 3.17 Algorithm for summing out variables Y from tree T 86 3.18 Function Compute_Evi-Factor called by algorithm Sum 87 3.19 The Large Domain V E Algorithm. 90 x 3.20 The decision tree representations, Tl, T2, and T3, of factors/1,/2, and/3 respectively 93 3.21 A decision tree representation, T, of new factor/ after multiplying trees Tl, 72, and T3 94 3.22 A decision tree representation, T, of new factor/' after summing out Afriame from tree T 95 3.23 Recall versus precision for both attribute dependence and attribute independence 99 4.1 Part of a class hierarchy of living things (L) 110 4.2 (a) A Bayesian network with simple and hierarchical variables (b) A part of the class hierarchy of R 113 4.3 (a) A Bayesian network with simple and hierarchical variables, (b) Part of the class hierarchies for variables P and Q. (c) parent context hierarchy of the given parent contexts that define P(Z\P A Q). . . . 124 4.4 The class hierarchy of L is divided into three regions R,rue, Runknown, and Rfaiae by positive observation (animal) and negative observa-tions (bat, penguin) about L 129 4.5 A Bayesian network with simple and hierarchical variables 131 4.6 Algorithm AbstractJBayes for constructing abstract Bayesian net-work 137 4.7 Functions AbstractHier called by the algorithm Abstract JBayes . . 138 4.8 Functions AbstractObs called by the algorithm Abstract_Bayes . . . 139 4.9 Function Construct_Table called by the Abstract_Bayes 142 xi 4.10 Function Simple_Var_Table called by the Construct_Table 142 4.11 Function Hier.Var.Table called by the Construct.Table 143 4.12 A Bayesian network with simple and hierarchical variables (shaded variables are observed) 146 4.13 The Bayesian network of Figure 4.12 with trees hierarchies for the hierarchical variables. Thin lines show the class-subclass relation-ship and thick lines show the Bayesian network dependency. Ex-ceptional classes are marked by * 147 4.14 (a) Bayesian network representation of the conference - academics domain, (b) A part of the class hierarchy for location in the world. . 158 4.15 The average inference time of applying VE and Abstract Jiayes + - VE for k = 2 as a function of the depth of the hierarchy, and depth of the observations 163 4.16 The average inference time of applying VE and Abstract Jiayes -F VE for d - 8 as a function of the number of academics and the depth of the observations 164 5.1 (a) A Bayesian network modelling rock types in a region, (b) The class hierarchies of Location and RockType 170 xii Acknowledgements I am incredibly grateful to Dr. David Poole, my research supervisor, for the count-less hours of discussions throughout the course of my work. His advice, comments, and suggestions concerning this thesis were invaluable. Without his extensive sup-port (both financially and technically), patience and endless encouragement, my thesis would not have been possible. Sincere thanks to Dr. Alan Mack worth, Dr. Nando de Freitas, and Dr. Anne Condon, members of my supervisory committee for their valuable feedback on my thesis draft, and great recommendations on this research project. I would like to thank Dr. Marek J. Druzdzel (External Examiner), Dr. James J. Little (University Examiner), Dr. Paul Gutafson (University Examiner), and Dr. Bertrand Clarke (Examination Chair) for the time they spent reading and reviewing my work. I am very grateful to Valerie McRae, the LCI Ex-Secretary for her support during the course of this thesis. She was always ready for me, providing help, proofreading (a lot), and supportive comments on a regular basis. I thank my LCI friends Mark Crowley, Jacek Kisynski, and Michael Chi-ang, for their patience to listen my practice talks, providing feedback, and some valuable discussion. I would like to thank Craig Wilson for taking the time to proof read my thesis draft. Thanks to faculty members, staff, and fellow students in the Department for their help and encouragement. I am indebted to my mom, who helps me in a different way to finish this the-sis, in the course of this thesis she came to Vancouver three times, all the way from India to take care of my little one. Without her support, this work would not have been completed. I thank my father, father-in-law, and mother-in-law for their bless-ings and encouragement. I thank my friends and family for their encouragement and support. Lastly but not least, I thank my husband, Sudhir, and my loving sons, Ashu xiii and Amit, for their sacrifices, unconditional love, encouragement and support. RITA SHARMA The University of British Columbia October 2006 xiv Chapter 1 Introduction 1.1 Probabilistic Reasoning Reasoning under uncertainty is an important topic in artificial intelligence (AI), since any intelligent agent must be able to take action under uncertainty. Uncer-tainty is an unavoidable feature of the world. It arises in a variety of ways. An agent's sensors usually provide incomplete knowledge about the world, and the knowledge it does receive is often noisy. Also, its actions can have unpredictable effects. Probability theory provides a sound mathematical basis for reasoning under uncertainty. It allows us to maintain our beliefs with incomplete (partial) knowl-edge. It allows us to update our beliefs by incorporating new pieces of evidence obtained from various sensors in a coherent way. Despite this, for many years the AI researcher did not use probabilistic models. The most important reason for not using probability theory was the inability to represent and exploit the structure of 1 large probabilistic models. This situation changed in the late 1980s with the development of Bayesian networks [Pearl, 1988]. A Bayesian network is a directed acyclic graph of nodes, which represent random variables taking a set of values, and arcs, which represent direct probabilistic dependencies among these variables. Each node is associated with a Conditional Probability Distribution (CPD) defining the conditional distri-bution of the variable given its parents. The graph structure represents the con-ditional independence relationship that exists between different variables, and the parameters of conditional probability distributions represent the strength of prob-abilistic dependencies [Pearl, 1988]. Bayesian networks provide a compact rep-resentation for joint probability distributions (the probability distribution over the joint state space of all variables in the network) by a product of all conditional probabilities associated with the nodes in the network. Over the past fifteen years or so, Bayesian networks have emerged as the leading technology for probabilistic reasoning. The fundamental issue for Bayesian networks is the problem of probabilistic inference. Given a Bayesian network, we would like to compute the probability dis-tribution over the values of some random variables, given the fact that we know the values of some other variables. For example, in medical diagnosis we can query the probability of a particular disease given the results of a physical examination and tests performed; Probabilistic inference in a Bayesian network is computationally intensive and is known to be NP-hard in general [Cooper, 1990]. The computa-tion cost grows exponentially with the size of a Bayesian network. Note that even 2 approximate inference is computationally difficult in the worst cases [Dagum and Luby, 1993]. , Recently there has been interest in speeding up inference by extending the Bayesian networks representation through more structured representations of the conditional probability of a variable, given its parents (for example, in terms of causal independence [Zhang and Poole, 1996] or context specific independence (CSI) [Poole and Zhang, 2003; Boutilier, Friedman, Goldszmidt and Roller, 1996]). In all of these approaches, discrete random variables are considered to have a bounded number of values. However, in many real problems the discrete random variables could have very large or even unbounded domains. For example, in a Bayesian network we may have random variables that have as domains the set of all names, all postal codes, and all phone numbers. The large number of values of the variables are a problem for many reasons: • Specifying conditional probabilities in conditional probability tables (CPTs) requires a complete specification of CPT entries. The number of parameters required to describe the CPT of a variable X is |X| x \Pa(X)\, where Pa(X) denotes the parents of X. For CPTs with large \X\ or \Pa(X)\, specifying all the parameters is not feasible. • Even if one could specify all the parameters for big CPTs, a big tabular rep-resentation is not efficient in terms of memory. • The complexity of exact probabilistic inference in a Bayesian network is ex-ponential in tree width, where the base of the exponent is the domain size. 3 • One could potentially have a big tabular representation and use the previ-ous approaches for probabilistic inference. However, we cannot represent the conditional probability distribution of a variable conditioned on a vari-able that has an unbounded number of values in tabular form. For example, consider a random variable that has as its domain the set of all first names. This domain may never be known to the full extent because people can make up names. We need a compact representation for representing the large (or unbounded) CPTs so that we can reason in an efficient manner. In some problems we might have large-domain variables with a priori hi-erarchical structure on their values. Almost anything - objects, Geo-systems, uni-versity courses etc. - may be classified according to some taxonomic scheme. The classic example is the taxonomic classification of living things (i.e., the millions of species) according to Linnaean taxonomy hierarchy1. Living things are divided into kingdoms (e.g., plantae, animalia), classes (e.g., mammals, birds, fish), all the way down to species. The hierarchical structure over the values of a variable can be exploited to make the representation compact and reasoning efficient. This thesis is concerned with the representations and inference algorithms for probabilistic reasoning in complex domains that have discrete random variables with large domains, including the case where there are unboundedly many discrete values. To perform efficient inference in Bayesian networks that have discrete vari-ables with large domains, the approach we propose in this thesis is based on a lhttp : /lwww.wikipedia.org/wiki/LinnaeanJaxonomy 4 structured representation of the problem. The structured representation allows the exploitation of regularities and independencies in the problem that can reduce the "effective state space" of the variables. This reduction has an immediate effect on the computation of the query, and the storage required. We consider the following types of structures: • We consider that there is some structure in the conditional probability tables that, given an observation and query, allows us to partition the large number of values of the variables in equivalence classes that have the same conditional probability. We can take advantage of the structure by reasoning about these classes together. In particular, we do not need to reason about every value of the large-domain variables individually. Rather, we can reason about a block of values together. • We consider that there is a priori internal structure on the values of large domain variables. That is, the large number of values of the variables can be represented as specialization/feature hierarchies. We can exploit the structure provided by feature hierarchies in probabilistic inference. Feature hierarchies aggregate the values in meaningful ways. We can support the compact rep-resentation for the local probability model using inheritance, thus encoding constraints on the contexts in which the conditional probabilities are equal. The ideaof utilizing structure in the CFTs is standard [Boutilier et al., 1996; Zhang and Poole, 1996]. The difference is that we are applying it to a more general class of problems, where we can have discrete random variables with small, large 5 or unbounded domains. Similarly, feature hierarchies and semantic networks have been used in A l for a long time. These graph-based formalisms, are not easily amenable for representing uncertainty in an elegant and efficient manner. There has been some work that combines belief network and feature hierarchies [Pearl, 1986; Poh and Fehling, 1993; Heckerman, 1990]. But in these works, the Bayesian networks are restricted to a diagnostic network form (or naive Bayes classifier) with a single hypothesis variable. In this thesis, we utilize the feature hierarchies in a general Bayesian network that has both small and large-domain random variables, where any variable can have any type and any number of parents. In summary, this thesis is concerned primarily with the representation and the probabilistic inference for Bayesian networks that contain both small and large-domain (or unbounded) discrete random variables, where there is some structure. The structure could either be in CPDs, or be due to the feature hierarchies, or both. 1.2 Thesis Overview In this thesis, we present two structured probabilistic inference algorithms that ex-ploit the structure in the problem for efficient reasoning in Bayesian networks that have discrete random variables with large domains. To represent the conditional probabilities in a compact manner for Bayesian networks that have variables with large domain, we first consider that CPDs can be represented compactly using both intensional (in terms of functions and predicates) and extensional (by listing the values) definitions. The computation of intensional definitions may involve looking up the tables and computing the predicates and 6 functions. To capture such distributions, we propose a CPD language for repre-senting the CPDs in a compact manner. To deal with the complexity of inference, we exploit the fact that when there is no evidence or query to distinguish between all the values of a variable, we do not need to reason about each value separately. Rather, we can reason about a group of values together as a single entity. We develop a structured inference algorithm, Large Domain VE, that is based on the Variable Elimination algorithm, for making inference in Bayesian networks that have variables with large domain. Large Do-main VE partitions the domains of the variables dynamically during the inference based on the evidence and query. In this thesis, we motivate our work on large domain variables using the "per-son identification problem" [Gill, 1997; Bell and Sethi, 2001]. This is the problem of identifying whether two records that contain information about the demographic attributes of people (e.g., firstname, lastname, date of birth) refer to the same person. Standard methods [Fellegi and Sunter, 1969] treat the attributes as independent of each other, i.e., the matching on one attribute does not depend on other attributes. We have relaxed this assumption to model how the attributes are interdependent. Bayesian networks of attribute dependence contain many variables with large do-mains. For example, we have random variables that have as domains the set of all first names and the set of all postal codes. To represent the large CPTs of these networks, we need some compact representation. As we shall see in Chapter 3, we represent the large CPTs in a compact manner using intensional and extensional representation. We present the application of Large Domain V E to the person iden-7 tification problem. Chapter 4 considers the case where there is a priori hierarchical structure on the values of the variables. That is, the values of the variables can be represented as specialization / feature hierarchies. We call such variables hierarchically structured variables. To exploit the structure provided by feature hierarchies in probabilistic reasoning we need a representation language for representing the CPDs in a com-pact manner. We represent the distribution for the hierarchical variables by speci-fying, for each class, the probability distribution over its immediate subclasses. To represent the conditional probability distribution of any variable conditioned on a hierarchical variable, we use inheritance. To perform efficient inference in Bayesian networks with hierarchically struc-tured variables, we dynamically construct an abstract Bayesian network, given some evidence and a query, by collapsing the hierarchies to include only those values necessary to answer the query. We can answer the query from the abstract Bayesian network using any standard probabilistic inference algorithm. The do-main size of the variables in the abstract Bayesian network is independent of the size of the hierarchies; it depends on how many of the classes in the hierarchies are supported directly by the evidence or are relevant to the query. Thus, the proposed approach is applicable even when the hierarchies are conceptually infinite. Finally, we put both CPD language representation and feature hierarchies together to produce a general framework that can be applied to a more general class of problems that have both small and large-domain variables with or without feature hierarchies. 8 1.3 Summary of Thesis Contributions The contributions of this thesis are summarized as follows: • To represent the CPDs of a Bayesian network that has variables with large domains in a compact manner, we introduce a CPD language. The CPD language allows us to represent the large conditional probabilities compactly using intensional and extensional definitions. The CPD language generalizes the idea of context specific independence [Boutilier et al., 1996], because the contexts are not only given by expressions such as variablei = vdlj but also by expressions such as fooiyariablei, variablej) = true. • We provide a structured efficient inference algorithm, Large Domain VE, that exploits the CPD language representation. Large Domain V E dynamically partitions the values of large-domain variables, based on the evidence and query. • We demonstrate how to apply Large Domain V E to the person identification problem. • We provide a systematic way of dealing with hierarchically structured vari-ables in probabilistic reasoning. We provide the representation techniques for representing the CPDs in a compact manner. • We provide an algorithm to perform efficient inference in Bayesian networks with hierarchically structured variables. The proposed approach is applicable even when the hierarchies are conceptually infinite. 9 • We develop a general framework that exploits both intensional definition of the conditional probabilities and hierarchically structured values of the vari-ables. 1.3.1 Thesis Organization The rest of this thesis is organized as follows. In Chapter 2, we provide a formal def-inition of Bayesian networks and describe an inference method. The contributions of this thesis begin in Chapter 3, where we consider how to exploit the structure in large discrete domains. We discuss in brief the "person identification problem". We present the representation and inference algorithm for making inference in Bayesian networks that have variables with large domains. In Chapter 4, we introduce the notion of hierarchically structured variable. We present the representation and in-ference algorithm for exploiting the hierarchical structure in probabilistic inference. We present some empirical results that show that hierarchically structured values of a variable can make reasoning more efficient than reasoning over the set of all the values. In Chapter 5, we integrate both kinds of structures in a common framework so that it can be applied to a more general class of problems. Finally, in Chapter 6, we summarize the main points of this thesis, and future directions that could be taken with this work. Some of the ideas presented in the above chapters were previously published as technical papers in A l conferences [Sharma and Poole, 2003; Sharma, 2004; Sharma and Poole, 2005a; Sharma and Poole, 2005b]. 10 J Chapter 2 Background 2.1 Introduction Bayesian networks (also known as Belief networks, Bayes nets) are a dominant technique for representing uncertainty knowledge in AI [Pearl, 1988]. Bayesian networks exploit conditional independence between variables to compactly rep-resent the joint probability distribution. This chapter presents an introduction to Bayesian networks. We begin by denning Bayesian networks, then discuss the con-ditional independence relationships encoded in their structure, and finally an infer-ence algorithm for performing probabilistic inference in Bayesian networks. None of the material presented in this chapter is new; readers who are familiar with this background can skip to the next chapter. In this chapter, we concentrate on discrete models where all the variables have small domains. The techniques presented in this chapter do not work very well with variables that have large domains. In sub-sequent chapters we shall concern ourselves with networks that have variables with 11 very large (or infinite) domains. 2.2 Notation We denote a random variable by upper case letter (e.g., X, Y-„ Z), and the actual value of that variable by the same letter in lower case (e.g., x, yi, z). We use Val(X) to denote the set of all possible values that a random variable X can take. We instantiate a variable X by assigning it a value x G Val(X), denoted by X = x. We denote a set of random variables by bold-face upper case letter (e.g., X , Y j , Z); the corresponding bold-face lower case letter (e.g., x, y s , z) denotes an assignment for each variable in a given set. The function Val(X) is the cross product of the domains of the variables in X . We instantiate a set of variables X by assigning a value to each variable in the set, denoted by X = x. We use P(X) to denote the probability distribution for a variable X. We use P(X = x) (or P(x)) to denote the probability that X takes the value x. We use P ( X | Y ) to denote the conditional probability of X given Y , and P ( X | Y = y) denotes the conditional distribution of X given Y = y . We use Pa(X) to denote all the parents of a variable X in a Bayesian network. 2.3 Bayesian Networks A Bayesian network is a compact representation of a joint probability distribution over a set of random variables X = {X1}... ,Xn}. A Bayesian network is a di-rected, acyclic graph Q defining a probability distribution over X . Each node in Q 12 corresponds to a random variable, and edges in Q represent the direct dependence between the variables. We use the terms node and random variable interchangeably. The parents of node X, denoted by Pa(X), are all the nodes in Q that have directed edges going into X. Each variable X in a Bayesian network is associated with a conditional probability distribution (CPD), denoted by P (X\Pa (X)). If X has no parents, then it is a prior probability of X, denoted by P(X). The CPD for a variable X is often specified explicitly in tabular form by listing the probability distribution over the values of X given every possible instantiation of its parents Pa(X). In this case, the CPD is called a conditional probability table (CPT). The graph Q represents a set of conditional independencies: every node in the graph is independent of its non-descendants, given its parents. If we total order the variables Xi,..., Xn such that Pa(X() C {Xi,. . . , then P(Xi\Xi_u X i ) = P (Xi\Pa fr)) (2.1) The joint probability distribution induced by a Bayesian network is defined using the chain rule and the conditional independence assumption. Using the chain rule, the joint probability distribution over {Xx... Xn} can be computed as follows: n P(X1...Xn) = J J p p c - f c , . . . , * ! ) 1=1 After applying the conditional independence assumption as defined in equation (2.1), the joint probability distribution P(Xi... X„) can be computed as follows: n ! P(Xl...Xn) = Y[p(Xi\Pa(Xi)) (2.2) i= i More formally, we can define a Bayesian network as follows: 13 ( Figure 2.1: An example of a Bayesian network. Definition 2.3.1 A Bayesian network B over random variables X = {Xi}... ,X„} is a pair (Q, 9), where • Q is a directed acyclic graph over variables X , with one node for every vari-able Xi. • 6 is a set of conditional probability distributions: d=-{P (Xi\Pa (Xi)) | Xi e X , Pa(Xi) are the parents of X, in Q } The joint probability distribution represented by B is given by equation (2.2). Example 2.3.2 Figure 2.1 presents a medical Bayesian network that models the lung cancer problem. The domain is modeled here through six binary variables; Smoking (S), Pollution (P), Bronchitis (B), Lung Cancer (L), Fatigue (F), and Pos-itive chest X-Roy (X) are all Boolean {true, false) variables. This slightly modified Bayesian network model is taken from [Neapolitan, 2004]. The graph structure of a Bayesian network reflects the causal structure of the domain. Lung cancer and 14 s true false 0.2 0.8 S B ] true false true ' 0.25 0.75 false 0.05 0.95 B L F true false true true 0.75 0.25 true false 0.10 0.90 false true 0.5 0.5 false false 0.05 0.95 • P true false 0.8 0.2 S P L true false true true 0.05 0.95 true false 0.03 0.97 false true 0.02 0.98 false false 0.001 0.999 L X true false true 0.9 0.1 false 0.2 0.8 Figure 2.2: Conditional probability tables for the example network. bronchitis are more often encountered in smokers, and we indicate this by setting 5 to be a parent of both L and B. Pollution can increase the probability of lung cancer, and we indicate this by setting P to be a parent of L. Both lung cancer and bronchitis can cause fatigue, so the parents of F are L and B. A person with lung cancer usually has a positive chest X-ray, so the parent of X is L. Figure 2.2 shows a set of possible conditional probability tables for the Bayesian network shown in Figure 2.1. The entries in the CPT encode the pa-rameters of the probability distribution. For example, P[P — true) = 0.8 and P(B = true\S =' true) = 0.25. 15 2.4 Conditional Independence A Bayesian network is a compact representation of a joint probability distribution. A joint probability distribution over n binary random variables needs 2" — 1 in-dependent parameters. However, a Bayesian network representation over n binary random variables requires at most n x 2k independent parameters, where k is the maximum number of parents of a variable. This is because the graph structure of a Bayesian network forces certain conditional independencies to hold, irrespective of the parameters of the conditional probability distributions. Definition 2.4.1 Let X = {Xu ..., Xn}, Y = {Y1,..., Yn}, and Z = {Zu ..., Z„} be three disjoint sets of random variables, then X is conditionally independent of Y given Z, if Vx e Val(X), Vy e Val(Y), and Vz e Val(Z), P(X = x, Y = y |Z = z) = P(X = x|Z = z)P(Y = y |Z = z) When X and Y are conditionally independent given Z, we write as /(X, Y | Z ) . For example, in the Bayesian network shown in Figure 2.1, variable F is not independent of S; however, it is independent of S given its parents L and B, which means I(F,S\ {L,B}). 2.5 Inference Almost every system that uses Bayesian networks must be able to perform prob-abilistic inference, i.e., find the distribution over some variables given some evi-16 dence. Given a Bayes net B over the variables X , we typically need to compute the distribution P ((^evidence), where Q C X is called query variables. Very often, evidence is considered as an instantiation of some of the vari-ables in the network, i.e., evidence denotes E = e, E C X , e € Val(E). However, one may also be interested in computing the posterior distribution of Q, given dis-junctive evidence (more general evidence), e.g., Xi = xn V Xi = xu- That is, we only know that the value that a variable can be assigned belongs to a set of values, but we do not know which value. We therefore consider that E G e*, e s C Va/(E). The new joint probability distribution after inserting the evidence can be computed as follows: n \ P{XiX„)Eee. = J J P(Xi\Pa(Xi) )Eee, 1=1 That is, inserting evidence into a joint probability distribution means that ev-ery conditional probability F(X,|Pa(Z,)) is replaced by new conditional probability P(Xj\Pa(X())Eees, which can be computed as follows: • If ({Xi}UPa(Xi))nE = 0 PiXilPaiXi))^ = P(Xi\Pa(Xi)) • Otherwise, let Z = ({X,} U Pa (X,)) n E j P(Xi\Pa(Xi)) Vz e Val(Z) such that z € e v P(Xi\Pa(Xi))Eees = < I 0 otherwise P ( Q | E e e s) is the conditional probability distribution over the query vari-able Q, given the evidence E e e s. Let {Yl,..., Ys} be the non-query random 17 1 variables of B. The query P ( Q | E e e') can be computed as follows: P ( Q | E e e ' ? ) = ^ • • • ^ Y i P ( X i T - - > X « h e e s ( 2 3 ) The denominator in equation (2.3) does not depend on Q . Thus, we only need to compute Y^YS • • • i C n 7" ( X L , • • •, Xn)E&^, which is a function of Q: / ( Q ) = E - - - E p ( x i > - - - > x « ) E € e > We can consider P ( Q | E G ev) as a function that assigns each q £ Val(Q) a condi-tional probability P (Q = q |E e ev). The conditional probability P (Q = q |E e ev) can be computed as follows: 2_/QeVa/(Q) /VWJ where ^QeV<fl/(Q) / ( Q ) is a normalizing constant that normalizes the values of /(Q) so that they sum to 1. Thus, the problem of probabilistic inference reduces to the problem of com-put ing/^) . Theoretically,/(Q) can be computed from the joint probability distri-bution by summing out non-query variables one by one. For example, suppose we are interested in the query P(X\F = true) in the Bayesian network shown in Figure 2.1. To compute P(X\F = true), we need to compute f(X), which can be computed as follows: .f(X) = P(S,P,L,B,X,F)F€{lrue} S,P,L,B = J2 P(S)xP(P)xP(L\S,P)x S,P,L,B P(B\S) x P(F = true\B,L) x P(X\L) 18 The variables S, P, L, and B are Boolean variables. Thus, to compute/(X) we need to sum 16 different terms, and that in general the number of terms we need to sum might be exponential in the number of variables in the network. Thus, summing out a variable from a joint probability distribution is not practically viable. It has been proved that the probabilistic inference in Bayesian networks is NP-hard [Cooper, 1990]. However, in many cases we can exploit the structure of the Bayesian network to perform probabilistic inference efficiently. In the next section, we discuss an exact inference algorithm. 2.5.1 Variable Elimination The Variable Elimination (VE) algorithm [Zhang and Poole, 1994; Dechter, 1996] is a query oriented algorithm for probabilistic inference that exploits the conditional independence inherent in the Bayesian network for efficient inference. As discussed above, to compute the query P(Q|E £ ev) we need to compute the function/(Q). where Yi,..., Ys are the non-query variables of the Bayesian network and E 6 e4' is the evidence. The problem of probabilistic inference is thus reduced to the problem of summing out variables from a product of functions. Before showing in detail how V E computes the function/(Q), we introduce some terminology. / (Q) = J2---Y,P^--->X") {Eee-} n (2.4) 19 Definition 2.5.1 A factor over variables X is a function from Vfo/(X) into the real numbers. A factor can be represented as a d-dimensional table, where d is the number of variables in the factor. Each row of the table corresponds to a specific instantiation of the factor variables. Definition 2.5.2 Le t / be a factor over n variables and b be the maximum domain size of a variable i n / . The size of factor/ is 0(b"). Definition 2.5.3 Let fl be a factor over variables Y , and/2 be a factor over vari-ables Z . The product of factors/1 and/2, denoted by fl 0/2, is a factor over Y U Z defined by: ' / 1 ® / 2 ( Y U Z ) = / 1 ( Y ) X / 2 ( Z ) The product of factors is commutative and associative. To construct the product of factors fx ® fi ® ... ® fk we need to create a factor / over variables Y , where Y is the union of all the variables i n/i.. .fk. The table entry in / for any assignment, y of Y , y 6 Val(Y), is the multiplication of k table entries from fi,... Jk corresponding to the assignment y. Definition 2.5.4 Let / be a factor over variables Y , and Z some variable in Y . The summing out of variable Z from factor/, is a factor g over variables W = Y — {Z} such that S(W) = X>0O = / ( W A Z = z) Z z<EVal(Z) The table entry in g for any assignment w of W is the summation of all those entries from / that correspond to the assignment w. That is, the values of the resulting factor g are obtained by adding the values of factor/ for each value of Z. 20 Definition 2.5.5 Let / be a factor over variables X and Z e X . Suppose Z e z s, z s C Val(Z). The conditioning off on Z e zv is the factor/Zez.< over X defined as follows: 1 { / ( X ) if z e ^ c t a / ( z ) I 0 otherwise This definition includes the conditioning off on Z = z as special case because Z = z is an abbreviation of Z € {z}. We have defined a factor, along with some basic operations on factors. Now, we will discuss how V E computes the function/(Q) as a series of operations on factors. The V E algorithm works by keeping a set of factors. Initially, the factors are the conditioned CPDs of the Bayesian network. The non-query variables of the Bayesian network are eliminated one by one. The order in which the variables are eliminated is called the elimination ordering. To compute/(Q) efficiently, factors are rearranged in equation (2.4) and the summations are pushed inside so that inside the sum are only those factors that involve the summing variable. For example, suppose we want to sum out variable F„ and/ i , . . . ,fk are the factors that are multiplied together. Suppose/i,... ,/„ are those factors that do not involve T,, and f m + i . . .fk are those that do involve 7,. Then, y^/i • • •/* = / i • • -fm ^ / « + i • • - fk Yi Yi We explicitly construct a factor for X^/m+i • • •/*• After summing out all the non-query variables, we can compute the posterior distribution by multiplying the remaining factors and normalizing the remaining factor. The V E algorithm is summarized in Figure 2.3. 21 Function Variable Elimination^, Q, E, es) Input: B: Bayesian network, Q: query variables, E: observed variables, e5: set such that es C Val(E) Output: A normalized factor over variables Q Let F be the factors corresponding to the CPDs in B Replace each/ € F that involves some £, 6 E wi th / E e e » Let Z be the variables of B for each Y 6 Z - Q according to some elimination ordering do F <— Eliminate(Y, F) end for return Normalize(F, Q) Function Eliminate(F, F) partition F into: • • • ,fm} that do not involve Y {fm+i • • -fk} are those that do involve Y / ^ E y / L + i ' x • • • xfk return {/i,... ,/„,/} Function Normalize ({/,... ,fr} , X) / < - / ! X . . . Xfr return f/c Figure 2.3: The Variable Elimination Algorithm. 22 S P L true true true 0.05 true true false 0.95 s true false true 0.03 true 0.2 f3(S,P,L)= true false false 0.97 false 0.8 false true true 0.02 false true false 0.98 false false true 0.001 false false false 0.999 P L X f2(P) = true 0.9 f(L,X) = true true 0.8 false 0.1 false true 0.2 B L F S B true true false 0.25 true true 0.25 true false false 0.9 f (S, B) = true false 0.75 false true false 0.5 false true 0.05 false false false 0.95 false false 0.95 Figure 2.4: Initial set of factors for example 2.5.6. Example 2.5.6 We illustrate the V E algorithm by computing the answer to the query P{L = true\X = true, F = false) in the Bayesian network shown in Fig-ure 2.1. After conditioning on the evidence, we have the set of factors as shown in Figure 2.4. Al l entries that are 0 have been omitted from factors f(L,X) and ,f(B, L, F) We will eliminate the variables X, F, P, S, and B in that order. Variable X is involved in only one factor f(L,X), and the value of X is fixed by evidence. Eliminating X replaces f{L,X) by the factor f(L) defined as follows: 23 I I L true 0.8 false 0.2 Variable F is involved only in one factor f5(5, L,X), and its value is fixed by evi-dence. Eliminating F replaces f$(B, L, F) by the factorfs(B, L) defined as follows: B L true true 0.25 true false 0.9 false true 0.5 false false 0.95 To eliminate P we need to multiply factors/2(P) and/3(5, P, L) because P appears in these two factors. Multiplying these factors together results an interme-diate factor fg(S, L, P), defined as follows: s P L true true true 0.045 true true false 0.855 true false true 0.003 true false false 0.097 false true true 0.018 false true false 0.882 false false true 0.0001 false false false 0.0999 24 We then sum out P from/9(5, L, P), and replace/2(F) and f3(S,P,L) with / 1 0 (5, L), defined as follows: s L true true 0.048 true false 0.952 false true 0.0181 false false 0.9819 Next, we eliminates, which is involved in factors/i(5),/6(5,5), and fw(S, L). We multiply these together to obtain an intermediate factor fn(S,B,L) denned as follows: s B L true true true 0.0024 true true false 0.0476 true false true 0.0072 true false false 0.1428 false true true 0.000724 false true false 0.03927 false false true 0.013756 false false false 0.74624 We then sum out 5 from/ n(5,B,L), and replace/i(S),/ 6(S,g) and/ 1 0(5,L) with fu(B, L), defined as follows: \ 25 B L true true 0.003124 fi2(B,L) = true false 0.08687 false true 0.0209 false false 0.88904 To eliminate B, we multiply togetherfs(B, L) and/12^ defined as follows: B L true true 0.000781 fu{B,L) = true false 0.07818 false true 0.01045 false false 0.8445 We then sum out B f rom/ 1 3 ( f i , L), replacing / 8(B, L) and / 1 2 (5 , L) with a factor fu(L) over L defined as follows: L true 0.0113 false 0.9227 Now, we are left with factorsfu(L) and/7(L), which are factors over L alone. Multiplying these two factors together, we obtain a factor/i5(L) defined as follows: L f 15(E) = true 0.0102 false 0.1845 26 Now, we normalize/15 to obtain P(L = true\X = true, F = false) = 0.0523 P(L =false\X = true, F = false) = 0.9476 In summary, we have performed the following computation: f9(S,P,L) = f2(P) xf3(S,P,L) fw(S,L) = J2f9(S,P,L) p : fn(S,B,L) = MS)xMS,B)xf10(S,L) fl2(B,L) = J2f^(S,B,L) s fl3(B,L) = fL2(B,L) x / 8 ( 5 , L ) ML) - J2MB,L) B ML) = fu(L)xf7(L) As discussed above, eliminating a variable in V E involves a series of multi-plications to compute an intermediate factor, followed by a number of summations. The number of multiplications is proportional to the size of the intermediate factor times the number of factors to be multiplied. This is because for each assignment of the variables of intermediate factor we need to do k multiplications, where k is the 27 number of factors to be multiplied. Note that the number of summations is also pro-portional to the size of the intermediate factor, as each entry (row) of intermediate factor contributes to one summation term. Thus, the cost of eliminating a variable in V E is proportional to the size of the intermediate factors times the number of factors to be multiplied. Proposition 2.5.7 Let N be the number of variables in the network, b be the maxi-mum number of values that a variable can have, and M be the maximum number of variables involved in an intermediate factor produced during the computation. The complexity of variable elimination is 0(N x bM). It is evident that both M and b are crucial in determining the complexity of the V E algorithm. The number of variables involved in intermediate factors depends upon the ordering we choose to eliminate the variables. For example, if we first eliminate the variable S in Example 2.5.6, then we get an intermediate factor over variables S, B, L and P that is larger than any factor we actually had. Unfortunately, finding the optimal elimination ordering, i.e., the ordering that results in the smallest factors, is NP-complete [Arnborg, Corned and Proskurowski, 1987]. However, one can use some simple heuristics to produce good elimination ordering. One such simple heuristic is the minimum discrepancy heuristic. According to this heuristic, at each time we choose to eliminate the variable that adds the fewest number of new edges to Q [Kjaerulff, 1990]. Note that performing inference in networks that have variables with very large b, even if we can reduce M, could be very expensive. If exact inference is too expensive in a network, we can use approximate inference algorithms, which we 28 have not discussed here. Usually there is a good deal of structure in the networks that can be exploited for performing efficient inference. 2.6 Pruning Irrelevant Variables Variable Elimination is a query oriented algorithm that processes one query at a time. We can use the structure of the graph to simplify the task of computing the query as much as possible. We can simplify the task by pruning from the Bayesian network all variables that are irrelevant to the query. One of the properties of the Bayesian network that can be utilized by the V E algorithm is known as barren variables. Shachter [1986] introduced the concept of barren variables. Definition 2.6.1 Let Q be the query variables in a Bayesian network and E be the observed variables. A variable X is a barren variable with respect to query nodes Q, if X g- Q, X £ E , and all descendants of X are barren. Barren variables are a consequence of one of the basic axioms of probability theory: Axiom 2.6.2 $>(x|r) = i X When a barren variable is summed out, it always gives a unity factor equal to one. Hence, barren variables are computationally irrelevant and can be removed from a Bayesian network without affecting the posterior probability distribution of the query variables.; 29 2.7 Context Specifi c Independence Bayesian networks exploit conditional independence to represent joint probability distribution in a compact manner. They do not place any restriction on how a vari-able depends on its parents. This means that to specify the conditional probability table of a variable X, we need to specify \X\ x conditional probabilities. Often, however, there is much structure in the conditional probability tables that can be exploited for efficient inference. In particular, Bayesian networks cannot capture certain independence: independence that holds only in certain contexts, for example, given a specific assignment of values to certain variables. In the past, extensive work has been done to extend Bayesian network rep-resentation in order to capture this additional independence. Many different ap-proaches have been proposed for representing and reasoning with independence that are more general than conditional independence [Boutilier et al., 1996; Zhang and Poole, 1996; Poole and Zhang, 2003; Heckerman and Breese, 1994]. In this section, we discuss context specific independence [Boutilier et al., 1996], which we will apply in subsequent sections. Boutilier et al. [1996] formalized the notion of context specific independence (CSI). CSI refers to the fact that some random variables are independent of each other only in a certain context. Definition 2.7.1 Given a set of variables C, a context on C is an assignment of one value to each variable in C, denoted by C = c, where c 6 Val(G). We say that C are the variables of the context. Definition 2.7.2 Let X , Y , Z, and C be pairwise disjoint sets of variables. We say - • 30 that X and Y are contextually independent [Boutilier et al., 1996], given Z and the context C = c, c G Val(C), if P ( X | Z , C = c, Y ) = P ( X | Z , C = c) whenever P (Y, Z, C = c) > 0 Thus, the independence relation between X and Y need not hold for all values of C. For example, consider the Bayesian network shown in Figure 2.5 (a). The network represents the domain that if a person goes climbing (C) and the rocks are slippery (5), then that person will probably get injured (/). The domain is captured here through three Boolean random variables S, I, and C, with domain {true, false}. The tabular representation of P(I\CAS) is shown in Figure 2.5 (b). Note that the last two rows in the tabular representation of P(I\C A S) are the same. This represents that the probability of getting injured does not depend on whether the rocks are slippery or not, if a person does not go climbing. That is, variable / is independent of S in the context C = false. If there is context specific independence, the regularities in the conditional probability tables can be captured for representing the conditional probabilities in a compact manner. A very natural representation for capturing the common entries in the CPTs is by decision tree [Boutilier et al., 1996]. Definition 2.7.3 A decision tree representing the conditional probability distribu-tion of a variable X is a rooted tree Tx, where each internal node represents Xt G ({X} U Pa(X)) and each leaf node represents a real number between [0,1]. Each in-ternal node has a set of outgoing arcs to its children, and each arc in Tx is associated with a unique variable assignment Xt = xi} x, G Val(Xj). 31 s C I true false true true 0.3 0.7 false true 0.1 0.9 true false 0.05 0.95 false false 0.05 0,95 (a) (b) 0.95 ( 0 Figure 2.5: (a) A simple Bayesian network modeling whether the person is injured, (b) The tabular representation of conditional probability P(I\CAS): (c) The decision tree representation of conditional probability P(I\C A S) . 32 A branch in a decision tree represents a path beginning at the root and end-ing at a leaf node. The parent context defined by a path is the set of variable assign-ments Xi = Xi, Xi ^ X encountered on the arcs along the path. The decision tree representation of P(I\C A S) is shown in Figure 2.5 (c). Boutilier et al. [1996] present two inference algorithms that exploit the decision tree representation for the efficient inference in Bayesian networks that we have not discussed (see [Boutilier et al., 1996] for details). Poole and Zhang [2003] exploit context specific'independence in the variable elimination algorithm. 2.8 Conclusion Bayesian networks have been used with great success for modeling many real world applications. For example, Bayesian networks are used for troubleshooting in M i -crosoft Windows [Heckerman, Breese and Rommelse, 1994], for predicting the likelihood of meeting attendance [Horvitz, Koch, Kadie and Jacobs, 2002], and in molecular biology for analyzing expression data [Friedman, Linial, Nachman and Pe'er, 2000]. They are potentially applicable anywhere there is uncertainty about the state of the world from observations. The key to the success of Bayesian networks is the representation of conditional independence, which enables the com-pact representation of the joint probability distribution and efficient inference algo-rithms. It is our goal in subsequent chapters to extend Bayesian networks to more complex domains, where discrete random variables can have large (or unbounded) number of values. 33 Chapter 3 Exploiting Structure in Large Discrete Domains 3.1 Introduction As described in the previous chapter, Bayesian networks have emerged as a viable technology for representing and reasoning about probabilistic models. As we have seen, the complexity of probabilistic inference grows with the domain size of the variables. We now want to take the next step, to extend Bayesian networks to more complex problems, where discrete random variables can have a large (or possibly infinite) number of values. For example, in the person identification problem, where given a pair of records we are trying to identify whether two records refer to the same person, we may have variables that have as domains the set of all names, the set of all postal codes, and the set of all credit card numbers. The large number of values of the variables are a problem for the following 34 reasons: • Representing conditional probabilities in conditional probability tables (CPTs) requires a complete specification of CPT entries. The number of parameters required to describe the CPT of a variable X is |X| x \Pa(X)\. For the CPTs with large \X\ or |Pa(Z)|, specifying all the parameters is not feasible. • Even if one could specify all the parameters for the big CPTs, a big tabular representation is not efficient in terms of memory. • The complexity of exact probabilistic inference in a Bayesian network is ex-ponential in tree width, where the base of the exponent is domain size. • One could potentially have a big tabular representation and use the previ-ous approaches for probabilistic inference. However, we cannot represent the conditional probability distribution (CPD) of a variable conditioned on a variable that has an infinite number of values in a tabular form. For example, consider a random variable that has as its domain the set of all first names. This domain may never be known to the full extent because people can make up names. We need a compact representation for representing the large (or unbounded) CPTs so that we can reason in an efficient manner. Carrying out inference in Bayesian networks that have variables with large domain is a difficult problem. For efficient inference in Bayesian networks that have large-domain discrete random variables, we consider cases where there is some structure that can be exploited to make inference efficient. In many cases, there is some structure in the conditional probability tables that allows us to partition 35 I the large number of values of the variables into equivalence classes. Rather than reasoning about every value of these variables individually, we can reason about a set of values together. The fundamental idea is that in any conditional probability table, we partition the values of each large-domain variables into disjoint subsets (equivalence classes) that have the same conditional probability for particular values of other variables. We would like to represent the CPDs in a compact manner and an inference algorithm that can utilize this compact representation in probabilistic inference for computational gain. We assume that we have a procedural way of generating the prior probabil-ities of the large domain variables (perhaps conditioned on other variables). This may include looking up tables. For example, the U.S. Census Bureau1 publishes a list of all first names, conditioned by gender, together with probabilities that covers 90% of the probability mass for both male and female first names. Together with a method for estimating the probability of a new name, this can be used as the basis for P(FirstName\Sex). We may also have a model of how postal codes are generated to give a procedure that estimates the probability of a given postal code. To represent the conditional probability of a variable conditioned on vari-ables that have large domains we assume that we can write them compactly using predicates and functions. We also assume that we have the procedures for comput-ing the predicates (and functions) and to count the number of values for which they are true. For example, consider a binary variable CommonName that has only one parent FirstName. CommonName is true when the value taken by FirstName is a common male or female name. We may consider the top 20 names in the male and 1 http://www.census.gov/geneaIogy/names/ 36 female name file as the common male and female names. Now, we can write the the conditional probability P(CommonName\FirstName) compactly as follows: where, common(FirstName) is a predicate that is true if FirstName is a common male or female:name. Predicate common{FirstName) is true for the 40 values of FirstName. In this chapter, we propose an approach for making efficient inference in Bayesian networks that have discrete random variables with large domains. We be-gin in Section 3.2 by presenting a motivating problem - the person identification problem. In Section 3.3, we discuss an approach for representing the CPDs in a compact manner. We represent the CPDs using both intensional (i.e., in terms of procedures) and extensional (by listing the values) definitions. To deal with the complexity of inference, we consider an evidence and query oriented approach. Our approach is based on the fact that when there is no evidence to distinguish between all the values of a variable, we do not need to reason about each value sep-arately. Rather, we can reason about a group of values together as a single entity. For effective inference, we utilize a Bayesian network representation of conditional independence, while also taking advantage of the additional structure in the CPDs. In Section 3.4 we present an inference algorithm, Large Domain VE, that is based on the variable elimination algorithm for making inference in Bayesian networks that have variables with large domains. In Section 3.5, we present the application of Large Domain V E to the person identification problem that shows that by parti-0 otherwise 37 tioning the large number of values of a variable we can make the inference efficient. 3.2 A Motivating Example: The Person Identifica-tion Problem The person identification problem that we consider is to determine whether two records containing information about the demographic attributes of a person (e.g., name, age, phone number, etc.) refer to the same person. It is used for comparing records in one or more data files, removing duplications, and determining if a new record refers to a person already in the database, or to a new person. The person identification problem occurs in many person-centric applications. For example, it occurs in health care applications [Gill , 1997; Bell and Sethi, 2001] when iden-tifying patients^ in social services applications in relation to clients, in commerce applications with customers, even in criminal contexts when identifying suspects [Wang, Chen and Atabakhsh, 2004]. This problem has been studied independently under various names by dif-ferent user communities. In statistics, it has been studied as record linkage since at least 1969 [Fellegi and Sunter, 1969]. In the Fellegi-Sunter method, for each pair of records, agreement and disagreement probabilities (matching weights) for each attribute are computed using frequency counts and error rates. The values of these match weights are used to designate a pair of records as a match, a possible match or a nomatch. Possible matches are those pairs for which clerical review is needed to decide their final status. 38 In the computer science literature, this problem has been studied as du-plicate detection [Monge and Elkan, 1997], the merge/purge problem [Hernandez and Stolfo, 1995], or identity uncertainty [Pasula, Marthi, Milch, Russell and Sh-pitser, 2002]. Hernandez and Stolfo developed the sorted neighbourhood method for limiting the number of potential duplicate pairs requiring distance computation. Monge and Elkan proposed the use of the Smith-Waterman algorithm for comput-ing minimum edit-distance to recognize pairs of approximately duplicate records. Pasula et al. [2002] proposed the use of a relational probability model with MCMC for matching scientific citations. The standard methods for person identification [Fellegi and Sunter, 1969] consider that the attributes are independent: the matching of one attribute does not depend on other attributes. However, this assumption is often faulty as the following examples demonstrate: • People Hying in the same household have the same address, phone number and often the same last name. In this situation, if we assume that the last name, address and phone number are independent of each other, it becomes more likely that we have a false positive match. • Twins usually have the same date of birth and last name. In this situation, the independence assumption can again cause a false positive match. • If a person moves to a different city, his address, phone number, and postal code change together. In this situation, the independence assumption can cause a false negative match. 39 If two descriptions refer to the same person, we expect that the attribute val-ues should be the same for both descriptions. However, the attributes may be dif-ferent because of errors, for example: typographical errors, misspellings, swapping first and last names, moving and so on. Rather than considering the attribute as in-dependent of each other, we consider here that attributes depend on each other. We propose the use of Bayesian networks for modelling the dependence/independence among the attributes. 3.2.1 Decision Making in Person Identifi cation The core sub-problem of person identification is to determine whether two records refer to the same person or not. Let the set of field values (attribute values) for a record be called a description2. Let X and Y be two records, which refer to the people to be compared, with Descx and DescY denoting their corresponding de-scriptions. There are two hypotheses when we compare the descriptions Descx and DescY: • the records refer to the same person (X = Y) • the records refer to different people (X ^ Y) Let Psame be the posterior probability that records X and Y refer to the same person, given their descriptions and Pdiff be the posterior probability that records X and Y refer to different people, given their descriptions. That is, Psame = P (X = Y\Descx, DescY) 2We want to distinguish a record from its description because two records can have the same description. 40 P m = P{X^ Y\Descx, DescY) The odds, Odds, for hypotheses X = Y and X ^ Y: x p r\ , , 1 same Odds = —— _ P(X=Y) xP (Descx A DescY\X = Y) P(X^Y) xP (Descx A DescY\X ^ Y) . P(X=Y) x P(DescY\X = Y) x P [Descx\DescY A X = Y) (3.1) P(X^Y) x P(DescY\X ^ Y) x P (Descx\DescY A X ^ Y) We would expect that record K's attribute value is independent of X = Y or X ^ Y given no information about the other person's attribute value. That is, P{DescY\X^Y) = P(DescY\X=Y) (3.2) After applying equation (3.2) in equation (3.1), we have the following: PiX^Y) P(Descx\DescY AX^Y) K ' The ratio is a prior odds and the ratio ff^K^rj is a likelihood ratio. There are three possible actions (decisions) that we can take for records X and Y [Fellegi and Sunter, 1969]: • match (decide that X and Y refer to the same person) • possible match (hold for a clerical review) • nomatch (decide that X and Y refer to different people) The decision can be made using decision theory [Duda, Hart and Stork, 2000], given the cost of false positive, true positive, true negative and false neg-ative match. An action with minimum cost is selected. Suppose we have a cost 41 function E (a\u) that describes the cost of action a. when u> is true in the world. If the assignment is match, the expected cost Ematch is: Ematch = E (match\same) x Psame + E (match\diff) x Pdig If the assignment is possible match, the expected cost Eposmalci, is: Eposmatck = E {posmatch\dijf) x P d , j + E {posmatch\same) x P M m e If the assignment is nomatch the expected cost Enommc]X is: Enomatch = £ (nomatch\diff) x + £ (nomatch\same) x P r a„, e We select the assignment (action) for which the error cost is minimum. Let us assume the following: • E (match\same) < E (posmatch\same) < E (nomatch\same), and • E (nomatch\diff) < E (posmatch\diff) < E (match\diff) The conditions for actions match, possible match, and nomatch are the following: p • Action match is optimal if p g / M e > max ( C I , C 2 ) P • Action possible match is optimal if mm ( C 2 , C 3 ) < , f m e < max (CI, C 2 ) • Action nomatch i f p f l w e < mm ( C 2 , C 3 ) where, C I = E (match\diff) - E {posmatch\diff) E {posmatch\same) — E (match\same) £ 2 E (match\diff) — E (nomatch\diff) E (nomatch\same) — E (match\same) • E (posmatch\diff) — E (nomatch\diff) E (nomatch\same) — E (posmatch\same) 42 The constant prior odds can be merged with constants CI, C2, and C3. We then just need the likelihood ratio for making the decision. Traditional methods [Fellegi and Sunter, 1969] treat the attributes as inde-pendent of each other, whether the descriptions refer to the same person or not. In the Fellegi-Sunter method, for each pair of records, agreement and disagreement probabilities for each attribute are computed using frequency counts and error rates. We have relaxed this assumption, and model the dependence/independence between the attributes for both cases X = Y and X ^ Y using a similarity network representation [Heckerman, 1990]. This representation exploits the hypothesis-specific independence between variables. In particular, separate Bayesian networks are constructed for each hypothesis. 3.2.2 Modelling of Attribute Dependence/Independence We model the dependence/independence between the attributes. The Bayesian net-works of attribute dependence contain many variables that have very large domains. The Bayesian Network Model of Attribute Dependence for X ^ Y To identify a person, we consider the following seven attributes3: social security number (SSN), first name (Fname), last name (Lname), date of birth (DOB), gender (Gen), phone number (Phone), and postal code (Post). The statistical dependence4 among the attributes that we assume is shown in Figure 3.1. Dependence among the attributes is modelled using the hidden variables that are the unshaded nodes in 3The approach can be extended to any number of attributes. 4The dependence may be known by the domain experts, or potentially can be learned. • 4 3 Figure 3 .1: Bayesian Network representation of attribute dependency for case X ^ Y (shaded variables are observed). Figure 3.1. • The proposition twin is true if the people referred by X and Y are twins. • The proposition relative is true if the people referred by X and Y are relatives. • The proposition samehousehold is true if the people referred by X and Y live in the same household. • The proposition samelastname is true if the people referred by X and Y have the same last name. We assume here that the gender of two different people is independent of each other5. Attribute SSN does not depend on the considered relations. However, we cannot assume that the SSN of two different people is independent. Knowing 5 A more detailed model may specify that twins are more likely to be of the same gender. 44 one person's SSN changes our belief about another person's SSN, because we ex-pect that they should not be the same (as two persons are not allowed to have the same SSN). Hence, where r denotes the probability that two different persons have the same SSN recorded (due to some errors), which is very, very small. The Bayesian Network Model of Attribute Dependence for X = Y If records X and Y refer to the same person (the numerator of the Odds formula), we expect that the attribute values should be the same for both X and Y. However, there may be differences because of attribute errors: typing errors, phonetic errors, nicknames, swapping first and last names, change of address and so forth. As explained earlier, we do not want to assume that the attributes are inde-pendent. We model their dependence using the hidden variables that explain the dependence among the attributes. Suppose the attributes are dependent because the data-entry person was sloppy, or because the person moved to a new residence be-tween the times that the records were input. If the move is true, we consider that the record X was input before the move and that Y was input after the move. j We model the dependence among attributes using their true values6, the slop-piness of the data-entry person, and the possibility of moving. Here, we consider 6For some attributes, the true values change because of the move, e.g., phone number. For such attributes, we consider the true values with respect to Y. 45 Figure 3.2: Bayesian Network representation of attribute dependency for case X = Y (shaded nodes are observed). the change in phone number, and postal code because of the move. The depen-dence between attributes is shown in Figure 3.2. The unshaded nodes show the hidden variables, and we assume that the shaded nodes are observed. The variable Sloppy JC represents whether the person who recorded the attribute values of record X was sloppy of not. The variable EFJC represents which error was made in record-ing the first name for record X. To make this example more readable, we consider the following errors (values of EFJC): • copy error (ce): an error where a person copies a correct name, but from the wrong row of a table. • single digit/letter error (sle): an error where a person copies all the letters (digits) correctly except one letter (digit). • no error (noerr): no errors. The variable move represents whether the person moved to a different ad-dress between the two records. The variable AphoneJf represents the true phone number after the move, i f move is true. The variable Afname represents the actual 46 first name. The random variables FnameJC, FnameJY, and Afname have, as do-mains, all first names. The random variables Phone JC, Phone A7, and AphoneJY have, as domains, all phone numbers. For the probability P (Afname\Sex), we can use the first name lists avail-able from the U.S. Census Bureau7. There are two first name lists with associated probabilities: one for female names, and the other for male names. The probabil-ity P (Afname\Sex — male) is computed using the male name file. The probability P (Afname\Sex = female) is computed using the female name file. We need a differ-ent mechanism for names that do not appear in these lists, as we cannot assign zero probabilities to such names. A number of techniques have been proposed to esti-mate the probabilities of new events [Chen and Goodman, 1998; Good, 1953; Fried-man and Singer, 1998] that can be used to compute the probability of a new name (names that did;not appear in the database). The data available from the U.S. Cen-sus Bureau is too noisy and incomplete to apply any of these techniques. In our implementation, we use a very small probability as the estimate of the probability of a new name. To compute the probability P (Aphone.Y) a model for generating phone numbers can be used. There are rules to generate the valid phone numbers for a city, province, and so forth. We use a simple procedure to compute the probability P (Aphone.Y). Let P be the total number of legal phone numbers, then { IIP if AphoneJY is a legal phone number 0 otherwise To make the decision whether records X and Y refer to the same person, we 7http://www.census.gov/genealogy/names/ 47 need to compute the likelihood ratio (LR) as given in equation 3.3. Suppose B\ and B2 denote the Bayesian networks shown in Figures 3.1 and 3.2, respectively. Let M and f\f be the variables of Hi and B2, respectively. Since B\ and B2 refer to a similarity network, Bi encodes the conditional distribution P(M. \ X ^ Y), and B2 encodes P(f\f \ X = Y) [Geiger and Heckerman, 1996]. To compute the likelihood ratio we need to condition on the observations and marginalize over the unobserved variables in the networks shown in Figures 3.1 and 3.2. After conditioning on the observed variables and summing out the hidden variables from the network shown in Figure 3.1, we get the likelihood of the observed data, given the hypothesis X ^ Y [Geiger and Heckerman, 1996]. Note that this is different from the standard set-up of probabilistic inference where we want to query some variable(s); here, we want to compute the probability of the data given the hypothesis that is not explicitly represented in the network. For the Bayesian network shown in Figure 3.2, we cannot represent the con-ditional probabilities in tabular form because some of the variables have very large domains. The conditional probability P (FnameJX\Afname A Sex A EFJC) cannot be represented in tabular form, as we do not know all names, and even if we could, the domains of Afiiame and FnameJC would be very large. The tabular representa-tion of P (Afiiame\Sex) is also very large. To represent these large CFTs we need a compact representation. The previously existing Bayesian network inference al-gorithms do not work on this network. In the following sections, we develop a methodology for representing large CFTs and detail the probabilistic inference al-gorithm that exploits the compact representation of the large CFTs. 48 3.3 Structured Representation of Conditional Prob-abilities To compactly represent the conditional probabilities for Bayesian networks that have large domain variables, the approach we propose is based on a structured rep-resentation of the problem. The structured representation allows exploitation of regularities and independencies in the problem that can reduce the effective state space of the variables. Our idea for specifying the conditional probabilities in a compact manner is that: the conditional probabilities can be represented compactly using both intensional (in terms of functions and predicates) and extensional (by listing the values) definitions. The computation of intensional definitions may in-volve looking up the tables, and computing the predicates and functions. Example 3.3.1 Consider the conditional probability P(Fname JC\Afname A Sex A EFJC) from the Bayesian network shown in Figure 3.2. Suppose we observed FnameJC = "dave". If the data-entry person makes a single letter error (EFJC = sle), the conditional probability P(FnameJC = "dave"|A/name ASexAEFJC = sle) has the same value for all those values of Afiiame that are a single letter apart from "dave". Note that there are 100 words that are single letter apart from "dave", as each letter in "dave" can be replaced by 25 other possible letters. Let S be the set of names that are single letter apart from "dave": , S = {x | x is a single letter apart from "dave"} 49 Then, { 1 / 1 0 0 if Afname eS 0 otherwise This above assumes that given EFJC = sle, FnameJC is independent of Sex. Thus, given EFJC = sle, we can represent P (FnameJC = "dave" | Afname A Sex A EFJC = sle) compactly using the inten-sional definition: Afname G S, S = {x | x is a single letter apart from "dave"} In a Bayesian network that has variables with large domains, we can repre-sent the conditional probabilities in a compact manner using intensional (using the predicates and functions) and extensional representations (by listing the values). To capture such distributions, we introduce a CPD language that allows us to define the CPDs procedurally, in terms of predicates and functions. Let Y be a variable and Z i , . . . , Zk be the parents of Y. In C P D language, to describe the conditional probability of Y, we have two cases: Casel (Y does not have large domain): We represent P(Y\ZX,... ,Zk) as a sen-tence as follows: P(Y\Zl,...,Zk) = (Cpl) where ( Cpl ) ::= ( number ), \ ( i f C(ZU Zn, Y) then ( Cpl ) else (Cpl)) 50 i i Here, number is a real constant in the closed interval [0,1]; it represents the probability of a single value of Y. C ( Z i , . . . ,Zn,Y) is a condition that de-scribes the subsets of the variables. The condition C ( Z i , . . . , Z„, Y) can be described either: • intensionally, as a rule (predicate) which implicitly describes the sub-sets'for the variables. We assume that there is a procedure to effi-ciently compute each predicate. For example, in Example 3.3.1 the condition Afiiame & {x \ x is single letter apart from "dave"} is an in-tensional definition. • extensionally, by listing the values of the variables. For example, Xj £ { v i , • • •, vk] orX, = v 1 2 . Note that this is a particular form of intensional definition that needs to be distinguished because it allows more efficient inference. case2 (Y has large domain): We represent P ( K | Z i , . . . ,Zk) as a sentence as fol-lows: P(Y\Zx,...,Zk) = (Cp) where ( Cp) ::= ( number ) \ , ( \pY, pm) ) | ( i f C(ZX,... ,Z„, Y) then ( Cp ) else ( Cp ) ) Here, number and C ( Z i , . . . , Z„, Y) have the same meaning as discussed above in Casel. In pair [ pY, pm ], pY represents the probability of a single value of 51 Y and pm, a real constant between [0,1], represents the probability mass of a set of values of Y. The value py could be defined extensionally (by listing the value) or intensionally using a function, e.g.,/(F), that returns a real con-stant between [0,1] - we assume that we have procedures for computing the functions. The conditional probability of Y can be defined using both intensional and extensional definitions. The CPD language generalizes the notion of context spe-cific independence [Boutilier et al., 1996], because the contexts are not only given by the assignment of the values to the variables such as X,- G { v i , . . . , v^}, but also by expressions such asfoo(Xi,... , X*) = v,-. Example 3.3.2 Consider the representation of conditional probability distribution P (Fname JC\Afname A Sex A EFJC) of the Bayesian network shown in Figure 3.2. As mentioned in Section 3.2.2, variable FnameJC represents the first name for record X, Aftiame represents the actual first name of the person and EFJC repre-sents the type of error made in first name for record X. The conditional probability P(FnameJC\Afriamef\Sexf\EFJC) can be represented compactly using the CPD lan-guage, as shown in Figure 3.3. It can also be seen as a decision tree [Quinlan, 1986] as shown in Figure 3.4, where the internal nodes represent the conditions and the leaf nodes represent the real numbers or the pairs of functions and numbers, e.g., [ prsing (FnameJC), 1]. As shown in Figures 3.3 and 3.4 P(FnameJC\Afname/\Sexf\EFJC) is repre-sented compactly using the intensional and extensional definitions, by conditioning on the values of EFJC. Figure 3.4 shows that there are three main cases. The 52. P(FnameJC\Afname A Sex A EFJC) = .{ if (EFJC = sle) then Cx else C 2 ) where Cx = ( if singlet (Afname, FnameJC) then \prsing (FnameJC), 1] else 0) c2 = ( if (EFJC = ce) then C 3 else C 4 ) c 3 - ( if (Sex = ma/e) then C 5 else C 6 ) c 5 = ( if intable(FnameJC, male), then C 7 else \fnew(FnameJC), 0.1]) c 7 = [ lookup(FnameJC,male), 0.9] c 6 = ( if intable(FnameJC,female) then Cs else \fnew(FnameJC), 0.1] ) C8 = [ lookup(FnameJC, female), 0.9 ] Q = ( if equal(Afname, FnameJC) then 1 else 0) Figure 3.3: A CPD language representation of P(FnameJC\Afname A Sex A EFJC). conditional probabilities for these cases are computed as follows: Case 1: E F _ X = noerr If the data-entry person who entered the attribute-values for record X did not make an error in the first name, the first name that appears in record X must be the actual first name. The predicate equal(Nl,N2) as shown in Figure 3.3 is true when NI and N2 have the same values. We assume that given EFJC = noerr, FnameJC is independent of Sex. Thus, { 1 if equal(Aftiame, FnameJC) 0 otherwise If FnameJC (Afname) is observed, predicate equal implicitly partitions the values of Afname (FnameJC) into the observed value of FnameJC (Afname) and the other values. 53 E F _ X [lookup(Fname_X,female), 0.9] Figure 3.4: A decision tree representation of P(FnameJC]Afname A Sex A EFJC). Case 2: E F _ X = sle If the data-entry person who entered the values for record X mistyped a single letter in the first name (EFJC = sle), the first name that appears in record X must be one letter different from the actual name, but has the same number of letters. The predicate singlet(Nl,N2) as shown in Figure 3.3 is true when |iVl| = \N2\ and AQ is a single letter different than N2. The function prsing(FnameJC) computes the probability when EFJC = sle and singlet(FnameJC, Afname) = true. Function prsing (FnameJC) is computed as follows: prsing (FnameJC) = r— r F s y ' 25 x \FnameJC\ If FnameJC = "dave", prsing (FnameJC) = ^ We again assume that given EFJC = sle, FnameJC is independent of Sex, as this typing error does not depend on the sex. Given EFJC = sle, the probability mass of all the values of FnameJC that have the same number of letters as in Afname 5 4 and are a single letter different from Afname is 1. Thus, P (FnameJC\Afname A Sex A EFJC = sle) = [ prsing (FnameJC), 1 ] if singlet(Fname JC, Afname) 0 otherwise If FnameJC is observed, predicate singlet implicitly partitions the values of Afriame into two sets: one consists of those values of Afname that have the same number of letters as in the observed value of FnameJC and are a single letter differ-ent from the value of FnameJC; and the second consists of the rest of the values of Afriame. Case 3: E F _ X = ce If the data-entry person makes a copying error, we can consider that the value of FnameJC is distributed according to the distribution over names8. That is, FnameJC is independent of Afname given EFJC = ce. We can use male and female name files for the distribution over names. The predicate intable (FnameJC, male) is true when the value of FnameJC exists in the male name file. The function lookup(FnameJC, male) computes the probability when EFJC = ce and intable (FnameJC, male) = true. The func-tion lookup(FnameJC, male) computes the probability P(FnameJC) by looking in the male name file. If FnameJC is not in the male name file, we use function fnew(FnameJC), which returns the probability of a new name. The probability mass 8When a person makes a copying error there are more chances for a common name to be typed than an uncommon name, because a common name appears more frequently in the database than an uncommon name. 55 of all the male names that are in the male name file available from U.S. Census Bureau is 0.9. Also, the probability mass of all the female names that are in the female name file available from U.S. Census Bureau is 0.9. Thus, Sex = male P (Fname JC\Afname A Sex — male A EFJC = ce) = { [ lookup(FnameJC,female), 0.9 ] if intable(FnameJC, male) [fnew(FnameJC), 0.1 ] otherwise Sex = female P (FnameJC\Afname A Sex = female A EFJC = ce) — [ lookup(FnameJC, female), 0.9 ] if intable (Fname JC, female) [fnew(FnameJC),0A] otherwise 3.3.1 Input Requirements for Inference Let B be a Bayesian network whose CPDs are represented compactly using inten-sional and extensional definitions in the CPD language form discussed in Section 3.3. To perform inference in B, we assume that the "Input Condition" defined below is true (the reason for this condition is explained later in Section 3.4.4): Input Condition: Let Y be a variable with large domain. Suppose TY denotes the tree representation of P(Y\Pa(Y)). If 7> contains condition C(X, Y) such that X / { } , X C Pa(Y), Y must be observed. 56 To perform inference in B, user needs to provide some procedures (the rea-son for these procedures is also explained later in Section 3.4.4). Let Y be a min-imal non-empty, non-query, non-observed, set of large-domain variables of B that do not share a predicate with any non-observed variable outside of Y . Suppose X y is the set of children of elements of Y , i.e., X Y = {X | 3Y E Y, Y G Pa{X)} Let Y i , . , . , Yk be the variables in Y and Xx,..., X,„ be the variables in X Y -We define the following: involve(T, a,, Z): a function that returns a set of conditions of path a, of T that contain variable Z 6 Z, . notinvolve(T, a,, Z): a function that returns a set of conditions of path a, of T that do not contain variable Z G Z. SY'- a set of contexts computed as follows: SY = {c | c is CY U Cx, c is consistent } where, CV = notinvolve(jYXiaii Y ) U . . . U notinvolve(TYk, ak, Y) Cx ^= notinvolve(TXl, ft, Y ) U . .. U notinvolve(TXl„, /?,„, Y ) For each Y , for each context c G SY, user needs to provide the procedures defined as follows: . Let Pc be a set of predicates defined as follows: Pc = involve(TXl,pi,Y) U . . . U involve(Tx,„,pm,Y) 57 where pt is a path in TXj that is consistent with context c. . If Pc contains m predicates 5 i ( Y i , X i ) . . . Bm(Ym,Xm), where Y , - C Y , Xi € X y . Suppose B(Zi,..., Zj)0bs denotes the predicate fi(Zi,..., Zj) after each observed variable Z , is replaced by its value. Then, for each j = l,m there exist procedures for computing the following conditional probabilities: P {BJ OTj>Xj)obs = true\B^ (X/-!,*/-!)^ A . . . A Bj.x {YuX{)obs A Cond) where Cond = invoh>e(jYxiPn Y ) U . . . U involve(Tyk,Pk, Y) ,p,- is a path in 7Y(. that is consistent with context c. However, if P(Xi\Pa(Xj) A c A involve(TXnpi, Y ) ) is 0 for each value of X,-that satisfy condition involve(TXnpi,Y) (pt is consistent with c), then user does not need to provide the procedures for conditional probabilities that involve B 6 involve(TXi,Pi,Y). Example 3.3.3 Consider the Bayesian network shown in Figure 3.2. The variable Afriame has large domain (all first names) and does not share predicate with any non-observed variables. The variables FnameJC and FnameJ are observed vari-ables. The decision tree representations 71, 72, and T3 of P(FnameJC\Afname A EFJC A Sex), P(FnameJC\Afriame A EFJC A Sex), and P(Afname\Sex) respectively are shown in Figure 3.5. Suppose we have observed FnameJC = "david" and FnameJf = "davig". To perform inference in the network shown in Figure 3.2, user needs to provide some procedures. The set of contexts S for variable Afname has more than one elements. The context as given below is one of them : con = (Sex — male) A (EFJC = sle) A (EF.Y — sle) 58 EF X singlet(Afname,FnameJ<) S e ' x equal(Afname,Fname_X) [prsing(Fname_X), 1 (pathl) m a l e / \female intable(Fname_X,male) yes yes (path2) intable(Fname_X,female) ^ ' Q no [lookup(Fname_X,male),0.9] [fnew(Fname_X),0.1] [fnew(Fname_X),0.1] [lookup(Fname_X,female),0.9] T l EF_Y [prsing(Fname_Y),l] (pathl) singlet(Afname,Fname_Y) g* x y e > y n 0 mal intable(Fname_Y,male) . yes female (path2) yes intable(Fname_Y,female) no equal(Afname,Fname_Y) no [lookup(Fname_Y,male),0.9] [fnew(Fname_Y),0.1]^ T2 [fnew(Fname_Y),0.1] [lookup(Fname_Y,female),0.9] intable(Aftiame,male) >ntable(Afname,female) no [lookup(Fname_Y,male),0.9] (pathl) [fnew(Fname_Y),0.1] (path2) T3 [fnew(Fname_Y),0.1] [lookup(Fname_Y,male),0.9] Figure 3.5: The decision tree representations 71, 72, and 73 of conditional prob-abilities P(FnameJC\Afname A Sex A EFJC), P(FnameJ/\Afname A Sex A EFJS), and P(Afname\Sex) respectively. 59 Let us consider the procedures that need to be provided for variable Afname, for context con. The paths pathl, path2 in the tree representations TI, T2, and T3 as shown in Figure 3.5 are consistent with context con. involve(TAfname,pathl, Afname) : (intable(Afname,male) = yes) involve(TAfitame,path2, Afname) : (intable(Afname, male) = no) involve(Tpmme_x,pathl, Afname) : (singlet(Afname, Fname JC) = yes) involve(Tpmme_x,path2, Afname) : (singlet (Afname, Fname JC) = no) involve(TFname_y, pathl, Afriame) : (singlet (Afname, Fname JS) = yes) involve(TFname_y, path2, Afname) : (singlet (Afname, Fname JS) = no) For context con, set Pc contains two predicates as shown below (after conditioning on observations): Pc = {singlet(Afname,david),singlet(Afname,davig)} From tree representations TI and T2 we have the following: P(FnameJC\Aftiame A con A singlet(Afname, FnameJC) = no) = 0 P(Fname _7| Afname A con A singlet(Afname, Fname JS) = no) — 0 Thus, for context con, and condition Cond = (intable(Afname, male) — yes) user needs to provide the procedures for computing the conditional probabilities pi and p2 defined as follows: p i : P (sing let (Afname, david) = yes\Cond), and p 2 : P(singlet(Aftiame, davig) = yes\(singlet(Afname,david) = yes) A Cond) 60 The conditional probability pi can be computed as follows: The predicate singlet(Afname, david) is true for 125 values of Afriame. Out of these 125 values, only one value "davis" exists in male name file. The percentage frequency of name "davis" from male name file is 0 , 0 0 0 1 . As discussed before, the probability mass of all the names that exist in male name file is 0 .9 . Thus, P(singlet(Afname, david) — true\intable (Afname, male) = yes) = 0 . 0 0 0 1 / 0 . 9 The probability p2 can be computed as follows: As discussed in the computation of PI, the conditions (singlet(Afname, "david") = true) and (intable(Afriame, male) = true) hold only when Afname is "davis". This means Afriame can have only one value "davis". Now, the predicate singlet("davis", "davig") is true as "davis" is single letter apart from "davig". Thus, P(singlet(Afname, "davig") = true\singlet(Afname, "david") = yes A Cond) = 1 In the next section, we present an inference algorithm that using the above mentioned "Input Condition" and procedures can perform efficient inference in B. 3.4 Inference algorithm: Large Domain V E Let B be a Bayesian network with n discrete random variables, X — \XX,... ,Xn}. Suppose we want to answer some probabilistic query, P ( Q | E = e), where Q , E C X. Q denotes the query variables. We observed that E = e, e £ domain(E), E = e is the evidence. As discussed in Chapter 2, from B we can recursively prune any variable that has no children, and is not observed or queried [Geiger, Verma and Pearl, 1990; Pearl, 1988]. 61 In a Bayesian network that has variables with large domain if a query vari-able has only a few values, we consider computing the probability distribution over the values of the query variable. If a query variable has a large (or unbounded) number of values, we consider computing the probability of a particular subset (or value) from the domain of the query variable, and the subset can be described ex-* tensionally or iritensionally. If a query asks for the probability of a particular subset S of query variable Xb, we create an extra Boolean child Xa of Xb, which is true exactly when the query is true. That is, P(Xq = true\Xb) = lifXbeS = 0 otherwise The variable Xq has the same probability as the probability of Xb 6 S. We thus reduce the problem to one of computing the probability distribution over the values of a variable that has only a few values. Let Xi,:.. ,XS be the non-query random variables of B, and suppose Z,'s are ordered according to some elimination ordering. As discussed in Chapter 2, we can compute the query from a function/(Q), which can be computed from the conditioned joint probability distribution by summing out the variables X,'s in order. / ( Q ) = E - - - E n / , ( x ' i f l B W ) { E = . } X. A', i= l The subscripted probabilities have the same meaning as discussed in Chapter 2. As discussed in Section 3.1, we want to exploit the structure of the prob-lem to perform efficient inference in Bayesian networks that have variables with large domains. To compute the function / (Q) efficiently, our approach is based 62 on the fact that if the evidence or query does not need to distinguish between each value of a variable, we do not need to reason about each value of the variable sep-arately. Rather, for efficient inference we can treat those values that do not need to be distinguished as a single value. This will reduce the effective domain size of the variables. \ Definition 3.4.1 A partition of the domain of a variable X , denoted by part(X), is a collection of b1,..., bk of nonempty, mutually disjoint subsets of Val(X) such that Val(X) = uf=1&'. The sets b' are called blocks of the partition. Lemma 3.4.2 Let B be a Bayesian network with n discrete random variables, X = { X i , . . . , X „ } . Suppose there exists a partition of each non-observed variable X of B such that: V7 if parents of Y are X 1 ; . . . , X * and 6 part(Xi) for each i . If for context C = c and Vi v,-, w,- € bt, P{Y\X1 = vl...Xk = vkAC = c) = P (Y\Xi = w1...Xk = wk/\C = c) (3.4) then, n p( bs\Pa{Xx) A C = c) JJ P(X\Pa{Xi) A C = c) bs'epart(Xs) 1=1, i^.v Proof Summing over all the values of X , is equivalent to summing over the partition of the values of X v . Thus, n n £ n w M * / )AC = c ) = E i I p ( x ' i p a ( z ' ) A C = c ) Xs 1=1 • bs€part(Xs) Xs€bs i = l 63 To compute 2~lx,eb, i l"=i P{Xi\Pa(Xi) A C = c), from the product TJ"=i P {Xi\Pa (X,) A C = c) we can distribute out all of the conditional probabil-ities that do not involve Xs out of the sum. After distributing out the conditional probabilities, inside the sum ~52xsebs w e n a v e conditional probability of Xs given its parents and the conditional probabilities of X/s children. Suppose Xx,... ,Xp are the children of Xs. Then, inside the sum 2~2Xsebs w e n a v e t n e following terms: p J2 P(Xs\Pa(Xs) A C = c)Y[P(Xj\Pa(Xj) A-C = c) x,ci>, 7=i If equation (3.4) is true, all the children X/s of Xs have the same conditional probabilities for each valuer € bs, we can distribute the product \~[p=l P(Xj\Pa(Xj)A C = c) out of the sum, leaving the term 2~2x eb-P(Xs\Pa(Xs) A C = c), which is equal to P (bs\Pa (X,) A C = c). Thus, n J2l[P(Xi\Pa(Xi)AC = c) xs i = l n = J2 P(bs\Pa(Xs) AC = c) P(Xi\Pa(Xi) A C = c) bsepart(Xs) I'=1,I'^.S' When there is structure in the problem that allows us to partition the domain of variables in subsets, the above lemma provides a way for summing such vari-ables efficiently. If there is structure, given the observation and query, we need to construct such partitions dynamically. To perform inference in Bayesian networks whose CPDs are represented using the CPD language representation discussed in Section 3.3, we introduce an inference algorithm based on the variable elimination (VE) algorithm [Zhang and 64 Poole, 1994], which we call Large Domain VE. Large Domain V E partitions the domains of the large-domain variables dynamically based on the contexts. In V E , a factor is the unit of data used during computation. A factor is a function over a set of variables. The core of Large domain V E is the same as the V E algorithm. The V E algorithm requires three basic operations over factors: • conditioning on observations • multiplying factors • summing out a variable or variables from a factor We adapt the above three operations for Large Domain V E so that they can be applied to the richer representation of the factors. Initially, the factors represent the CPDs in CPD language form. After conditioning on the observations the factors are represented in factor language form. Suppose Xi,... ,Xn are the variables of a factor. The B N F of the factor language is shown in Figure 3.6. In the factor language representation number is a real constant between [0,1]; C(XX,... ,Xn) is a condition can be described intensionally or extensionally as discussed in Section 3.3. ( Cp ) ::= ( number ) | ( i f C(XU .. .,Xn) then ( Cp ) else (Cp)) Figure 3.6: A B N F grammar of factor language. A Large Domain V E factor is more complicated than a V E factor repre-sented as a table. In Large Domain V E , we represent the factors as sentences in 65 factor language. As discussed in Section 3.3, a sentence in CPD language can be seen as a decision tree shown in Figure 3.4. Thus, a sentence in factor language rep-resentation can also be seen as a decision tree, where a leaf has only one probability value. Definition 3.4.3 A factor that involves variables with large domain is termed a big factor if the number of conditions in its factor language representation is linear in the domain size of its any variable with large domain. For the intermediate factors that are created by marginalizing and multiply-ing factors, we partition the domains of large-domain variables dynamically. 3.4.1 Constraint Elimination Ordering The main message of the Lemma 3.4.2 is that to sum out a variable X with large domain, we need to compute the probability masses for the values of X that are in some sets. To compute the probability mass efficiently (without enumerating the values of X), we need procedures that can compute the probability of a set of values of X, conditioned on other sets of values of X (this is explained in Section 3.4.4 how we use these.). These procedures are provided by the user as part of the input as discussed in Section 3.3.1. When the parents of X have been summed out before X, a new partition of X will be created. In this case, when we sum out X, we need different procedures to compute the probability of a set of values of X conditioned on the newly created set of values of X. Since we cannot expect a user to anticipate these intermediate 66 Figure 3.7: A tree simplified by removal of redundant subtrees (triangles denote subtrees). partitions (new conditions), we assume that a user provides us with the local infor-mation. Because of this locality constraint, in Large Domain V E we assume that large-domain variables are summed out before their parents. 3.4.2 Operations Over Trees The operations of multiplying factors together and summing out a variable (or vari-ables) from a factor are based on two tree operations. In this section, we describe these two operations. Tree Pruning (simplification) Tree pruning is used to remove redundant interior nodes or redundant subtrees of the interior nodes of a tree. We prune branches that are incompatible with the ancestors in the tree. If the condition is of the form X € S and the ancestor specifies X 6 S', we can. carry out an intersection, S" = S D S', to determine the effective constraints. We can then prune any branch where the effective constraint is that a variable is a member of the empty set (which means S" is empty). For example, 67 consider the subtree of X G {1, 2} of the tree as shown in Figure 3.7 (a). Here, the ancestor specifies X £ {1,2} and one descendant in this subtree specifies X G {3}, the descendant can be pruned. Similarly, for the "else" case in this subtree, we can do set difference to determine the effective constraints. The effective constraint for the "else" case is X G ({1, 2} - {1} - {3}) = X G {2}. • The tree shown in Figure 3.7 (a) contains multiple interior nodes labeled X along a single branch. We can simplify it to produce a new tree in which the subtrees of the subsequent occurrence of X that are not feasible are removed. The new simplified tree is shown in Figure 3.7 (b). The algorithm for pruning a tree is shown in Figure 3.8. T is the tree to be pruned, and A,-„. and A,!0„>, denote the set of (var, set) pairs. A var-set pair (X, S) in Ain denotes that X G S, while (X, S) in Anotin denotes that X ^ S, X is an ancestor variable of T. We start with Prune(T, { } , { } ) . We traverse the tree T in a top-down manner. At each node, we determine the type of T. We test if T is of the form: (If C then Tl else T2 ). If so, we test if the condition C is of the form X G S. If C is of the form X G S, there are three cases: • If ( X, S' ) G A,„, we compute the intersection S" = S n S'. There are three cases: - If S" is empty, there is no "then" case, so the result is Prune(T2,Ain, Anotin). - If S'[ C S, there is no "else" case, so the result is Prune(Tl,Ain,Ano,in). - Otherwise, the result is a tree with condition C, if branch of the tree is Prune(Tl,Ain - {(X,S')}'\J {(X, S")} , Anolin) and the else branch is Prune(Tl, Ain - {(X,S')} U{(X,S'-S)}, Anotin). 68 Function Prune (T, A i n , A n o t i n ) returns a decision tree T ' Input: T is the sentence, A,„ and Anotin are sets of (var, set) pairs (for Ain var-set pair (var, set) denotes var € set. For A„„„„ var-set pair (var, set) denotes var 0 set). l : if T is of form (if C then TI else T2 ) then 2: if C is of form X £ S then 3: if (X,S') £ 4 then 4: Let S" = S n 5' 5: ifS" = {}then 6: return Prune(T2,AiluAnotin) 7: else if S' C S then 8: return Prune(T\,Ain,Anotin) 9: else 10: TV *- Prune(Tl,Ain - {(X, S')} U {(X, S")} ,Anotin) 11: T2' <- Prune(T2, Ain - {(X, S')} U {(X, 5' - 5)} , A/!0(/„) 12: return (if C then TV else 72' ) 13: end if 14: else if (X, S') 6 Ano„-„ then 15: Let 5" = 5 - S' 16: ifS" = {}then 17: return Prune(T2,Ain,Anolin) 18: else 19: TV <- / V i m e ( n , A w U {(X, 5")} , A)lor,-rt - {(X, S')}) 20: T2' ^ Prune(T2,Ain,Amnin - {(X,S')} U {(X,5' U S")} ) 21: return ( i f C then 7T else T2! ) 22: end if 23: else 24: return ( i f C then Prune ( TI, A,„ U {(X, 5)} , Anolin ) 25: : elsePrM^(7;2,A, ; !,A, 1„ (,„U{(X )S)}) ) 26: endif 27: else 28: return ( i f C then Prune(T\, Ain, A,mil,) else Prune(T2, A,„, Anotin) ) 29: end if 30: else % T is not of the form: ( i f C then TI else T2 ) 31: return T 32: end if Figure 3.8: Algorithm for pruning the decision tree T. 69 • If ( X, S') e Anotin, we compute the set difference S" — S — S'. There are two cases: - If'S" is empty, the result is Prune(T2, Ain, Ano,in). - Otherwise, the result is a tree with condition C, if branch of the tree is Prune(Tl,Ain U {(X,S")},Alwlin - {(X,S'}}) and the else branch is Prune{Tl, Ain, Anotm - {(X, S>)} U {(X, S' U S")}). • If X is not assigned in Ain and Anmin (which means that X didn't appear exten-sionally in the ancestors of T in the tree), the result is a tree with condition C, if branch of the tree is Prune(Tl,Ain U {(X, S)} , Ano,in)) and the else branch \sPrune(Tl,Ain,AnotinU{(X,S)}). For any variable X, (X, S) appears at most once in Ain and Anotin, Note that X itself only appears either in set A m or in Aiwlin, and only once (see line 10,11,19,20, 24,25 and 28 in Figure 3.8, where a var-set pair (X, S) is assigned either to Ain or Anotin\ If C is not of the form X € S (which means that C is an intensional defini-tion), the result is a tree with condition C, if branch of the tree is Prune(Tl, Ain, A„„„-„) and the else branch is Prune(Tl, Ain, Ann,in). The correctness of the algorithm does not depend on whether we do com-plete pruning. We haven't consider checking for compatibility of intensional repre-sentations (which may require some theorem proving); whether the algorithm can be more efficient with such operations is still an open question. 70 Merging Trees In VE, we need to multiply factors and sum out a variable from a factor. Both of these operations in Large Domain V E are built upon the merging trees operation. Two trees TI and T2 can be merged using operation Op to form a single tree that makes all the distinctions made in any of TI and T2, and with Op applied to the leaves. When we merge TI and T2, we replace the leaves of tree TI by the structure of tree T2. The new leaves of the merged tree are labeled with the function, Op(pl,p2), where pi and p2 are the labels of the leaf in TI and-T2, respectively. We write merge2 (TI, 72, Op) to denote the resulting tree. If the labels of the leaves are numbers, the leaf value of the new merged tree is evaluated while merging the trees. If the leaf labels are intensional functions, one of the choices is when to evaluate the intensional function. When to evaluate the intentional functions can be considered as a secondary optimization problem. We always apply the pruning operation to the merged tree. For example, Figure 3.9 shows tree T2 being merged to tree TI with the addition (+) operator being applied. When we merge two trees and the Op is a multiplication operator (x) then if the value at any leaf of TI is zero, we keep that leaf of TI unchanged in the merged tree. We do not put the structure of T2 at that leaf (as shown in Figure 3.10). The algorithm merge2(Tl, T2, Op) for merging two trees TI and T2 is shown in Figure 3.11. merge2(Tl, T2, Op) calls mergeT(Tl, 72, op) for merging trees 71 and 72. After merging two trees 71 and 72, merge2 calls Prune function to remove the redundant nodes or subtrees from the merged tree. 71 Figure 3.9: Merging trees 71 and TI. The leaf labels are combined using the plus function merge2(Tl, 72, +). 4 10 14 Figure 3.10: Merging trees TI and T2. The leaf labels are combined using the multiplication function merge2(Tl, T2, x ) . In mergeT we traverse the tree TI in a top-down manner. At each node we determine the type of TI. We test if TI is of the form: ( i f C then 7 ' else T"). If so, the result is a tree with condition C, i f branch of the tree is merge!\T'', 72, Op) and the else branch is mergeT(T", T2, Op). If TI is a leaf node (which means that TI is a number or function), there are two cases: • If TI is zero and Op is x , the result is T I . • Otherwise, we call ReplaceLeaves(T2, TI, Op) to replace each leaf p of T2 with op{p, TI). The result is the updated tree 72. The function ReplaceLeaves(T\, T2, Op) as shown in Figure 3.11 replaces each leaf p of TI with Op(p, 72), where 72 is a number or function. We traverse 72 F u n c t i o n merge2 ( T l , T2, Op) returns a merged decision tree T ' Input: 7 1 , 7 2 , the root of the decision tree, Op, the operation Tp «- mergeT(Tl,T2,Op) 7 «— Prune(Tp, {} , {}) return 7 F u n c t i o n m e r g e T ( T l , T2, Op) returns a merged decision tree T ' Input: 7 1 , 7 2 , the root of the decision tree, Op, the operation if 7 1 is of form: ( i f C then 7 ' else T" ) then return ( i f C then merge!\T', 7 2 , Op) else mergeT(T", 7 2 , Op) ) else if ( 7 1 = = . 0 ) A (Op == x) then return T l else return ReplaceLeaves(T2,Tl,Op) end if F u n c t i o n Rep laceLeaves ( T l , T2, Op) returns a decision tree T ' Input: 7 1 , 72,'the root of the decision tree, Op, the operation if 7 1 is number then N <- simplify 7 1 Op 7 2 return N else % 7 1 is of form: ( i f C then T else T" ) return ( i f C then ReplaceLeaves(V, 7 2 , Op) else ReplaceLeaves(T", 7 2 , 0/?) ) end if Figure 3 .11 : Algorithm for merging trees 7 1 and 7 2 with operation Op. 73 the tree 71 in a top-down manner. At each node we determine the type of TI. If TI is a number or function, the result is Op(T\, T2) (if both TI and T2 are numbers, Op(Tl, T2) is evaluated). Otherwise, TI is of the form: (if C then T' else T"), the result is a tree with condition C, if branch of the tree is ReplaceLeaves(T', T2, Op) and the else branch is ReplaceLeaves(T", T2, Op). We can extend the merge2 operator to a set of trees. Let Ts be a set of trees, Ts = {TQ} ..., Tn}. We can define merge(Ts, Op), where Op is an associative and commutative operator, as follows. We choose a total order of the set, and carry out the following recursive procedure: merge({T0,... ,Tn},Op) = To ifra = 0 merge2(merge({T0,r„_i}, Op), 7,,, Op) if n > 0 3.4.3 Conditioning on Observations When we observe the values taken by certain variables, we need to incorporate the observation into the factors. Given a tree T, a set of observed variables Xj, Xj € Vh/(Xj), let r X J = X J denotes the conditioned tree, i.e., tree T after conditioning onXj = X J . To compute the conditioned tree T X J = X J we traverse T in a top-down man-ner. If a node in T splits on a function of observed variables X k € Xj, the observed values Xk of Xk are incorporated by replacing the occurrences of Xk by their ob-served values xk. To simplify the tree, we compute those conditions (or predicates) that do not contain any variables. However, the computation of the intensional rep-74 resentations that involve variables is delayed until we sum out those variables. Example 3.4.4 Consider a factor/ corresponds to CPD P(FnameJC\Afnamef\Sexf\ EFJC) as shown in Figure 3.12. Suppose we observe FnameJC - "david". Tree r/vM,»e_x="david" i s s h o w n i n Figure 3.13. EF_X [ lookup(Fname_X,female), 0.9J Figure 3.12: A decision tree representation, T, of a factor corresponding to CPD P(FnameJC]Afname A Sex A EFJC). E F _ X (sle) ^ - ^ ^ T ^ - ^ ^ s l s e sii 1/125 0 0.02363 Pnew Figure 3.13: Conditioned tree TF'"""e-x= david ^ j e ^ t r e e j a f t e r conditioning on FnameJC — "david". The computation of predicates equal and singlet is delayed until we sum out the variable Afiiame. However, to simplify the tree after conditioning, the predicates intable("david", male) and intable("dav\d",female) are computed. The subtree at 75 node intable("dav\d", male) is replaced by the value of lookup("dav\d", male), which is 0.02363, as david appears in the male name file. The subtree at node intable("david", female) is replaced by the probability of a new name, Pnew, as david does not appear in the female name file. The function prsing(FnameJC) is computed, and replaced by 1/125. The predicate equal gives us the possible value for Afname which is equal to david. That is, in the context of EFJC = noerr, we are implicitly partitioning Afname into {david} and all of the other names. Similarly, for EFJC = sle, we are implicitly partitioning the values of Afname into those names that are a single letter apart from david, and all of the other names. The algorithm for conditioning on observations is shown in Figure 3.14. T is the tree to be conditioned, A is the set of {var, set) pairs, Z is the random variable such that T represents the conditional probability of Z. A var-set pair (X, s) in A denotes that X is an observed variable with value s. We traverse the tree T in a top-down manner. At each node, we determine the type of T. We test i fT is of the form: ( i f C then TI else T2 ). If so, there are three cases: • C is of the form Y € S and (Y, s') G A. We compute the intersection 5 " = S n {Y}. There are three cases: - if S". = {}, there is no "then" case, so the result is CondObs(T2, A). - if {s1} C S, there is no "else" case, so the result is CondObs(Tl,A). - Otherwise, we replace 5 in C by S" and the result is a tree with new condition C, if branch of the tree is CondObs(Tl,A) and the else branch 76 F u n c t i o n C o n d O b s (T,A, Z) r e t u r n s a conditioned decision tree T ' I n p u t : T is the sentence, A is a set of (var, s) pairs ( var is an observed variable and s is its observed value.) Z is a random variable and T represents the conditional probability of Z . 1: i f T is of form: ( i f C then Tl else T2 ) t h e n 2: i f C is of form Y G S and (Y, s') G A t h e n 3: Let S" = S n {s!} 4 : i f S " = = { } t h e n 5: r e t u r n CondObs(T2, A) 6: • e l s e i f {s'} C 5 t h e n . 7: r e t u r n CondObs(Tl, A) 8: e l s e 9: C <— replace 5 in C by 5 " 10: r e t u r n ( i f C then CondObs(Tl, A) else CondObs(T2, A) ) 11: e n d i f 12: e l s e i f C is of intensional form, C(Xi,..., Xn) t h e n 13: C <— instantiate each X, of C with s, such that (X,, 5,) G A 14: i f C does not have any variable t h e n 15: Val <— compute predicate C 16: i f Viz/ is true t h e n 17: r e t u r n CondObs(Tl,A) 18: e l s e 19: r e t u r n CondObs(T2,A) 20: e n d i f 21: e n d i f 22: e n d i f 23: r e t u r n ( i f C then CondObs(Tl,A) else CondObs{T2,A) ) 24: e l s e i f T is of the form ( number ) t h e n 25: r e t u r n T 26: e l s e % T is of the form (p x, number) 27: i f Z is not an observed variable t h e n 28: r e t u r n number 29: e l s e 30: p'x <— instantiate X of with s such that (X, s) G A 31: N <— call procedure to compute p'x 32: r e t u r n N 33: e n d i f 34: e n d i f Figure 3.14: Algorithm for conditioning the decision tree T on observations. 77 is CondObs(T2,A). • i f C is a predicate (intensional), we instantiate each observed variable of C with its observed value. If C does not have any variables, we compute C ; i f C is true, the result is CondObs(Tl,A); otherwise the result is CondObs(T2,A). • Otherwise, the result is a tree with condition C , i f branch of the tree is CondObs(Tl,A) and the else branch is CondObs(T2,A). If T is ainumber, we return T. Otherwise, T is of the form (px, number). If Z is not an observed variable, we return number. Otherwise, we have two cases: • If px is not a function, we return px. • Otherwise, we instantiate the observed variables of px with their observed values, compute the function and return it. 3.4.4 E l i m i n a t i n g V a r i a b l e s In variable elimination, to eliminate a variable Y, we multiply all of the factors that contain Y, then sum out Y from the resulting factor. When a variable has a very large domain, we want to sum out all of those values that are in one partition in a single step. Let fY be the factor that involves Y and its ancestors, a n d / i , . . . ,fk be the other factors that involve Y. Suppose part(Y) denotes the partition of the values of Y such that the children of Y have the same conditional probability for each value in block b' € part(Y). Suppose pj denotes factor fj under condition b'. L e t / be the 78 resulting factor after summing out Y. Then, / = ^ / y x/! x ... xfk Y H€part{Y) Yeb> bjepart(Y) Y£b! To compute the resulting factor/ in an efficient manner, we consider that the sum J2Y£bifY should be computed without enumerating the values of Y. To make this possible, the input condition as discussed in Section 3.3.1 should be true, and is why it is included. ; Before discussing how the domain of Y is partitioned dynamically, the fol-lowing example illustrates why we need to sum out more than one variable at a time. Motivation for Summing Out Multiple Variables Together Example 3.4.5 Consider the Bayesian network B shown in Figure 3.15 (a). The variables Y and Z are Boolean variables. The variables X and W have large do-mains, suppose X and W both have discrete integer values 1... 1000. The tree representations of P(X), P(W), and P(Y\X A W) are shown in Figure 3.15 (b). The predicate lesseq300(X) means X < 300, lesseq500(W) means W < 500, and the predicate lessthan(W,X) means W < X. Suppose X and W are uniformly dis-tributed in regions X < 300, X > 300, W < 500, and W > 500. The leaf values in the tree representation of P(X) and P(W) represent the probability masses of all those values of X and W that reach to that leaf. Suppose 79 0.8 0.2 0.3 0.7 Tree representation of P(Y | X , W) fb) Figure 3.15: (a) A simple Bayesian network, (b) The decision tree representations of conditional probabilities P'X), P(W), and P{Y\W A X). we have observed the value of Z and suppose Y is the query variable. To compute the query, suppose we first want to sum out X from B. To sum out X, we need to multiply P(Y\X.A W) and P(X). Let/(W, Y) be the resulting factor after summing out X. Then, f(W,Y) = ] T P ( X ) x P(Y\WAX) x Note that P(Y\W A X) has same distribution over the values of Y for both cases: W < X and W > X. Thus, f(W,Y) = P(Y\WAX)W<X £ P(X) ' X:W<X + P(Y\WAX)W>X P(X) (3.5) X:W>X where P (Y\W A.X)C denotes P(Y\W A X) under condition C. 80 To compute the f a c t o r / ( W , Y), we need to compute the terms YLxw<xP(X) and 2~2x-w>x P(X) as shown in equation (3.5). Note that to compute 2~2x-w<xp(^) we need to enumerate the values o f W because the set o f values o f X that satisfy W < X is different for each value value o f W. We do not want to enumerate the values o f W, and we therefore want to delay the computation o f terms 2~2w<x P(X) unt i l we sum out W, in Which case we are effectively summing out W and X simultaneously. L e t / ( y ) be the resulting factor after we sum out together X and W. Then, f(Y) = J^J2P(Y\WAX)xP(W)xP(X) x w = P(Y\WAX)W<X J2 P(W)xP(X) X,W:W<X + P(Y\WAX)W>X P(W)xP(X) X,W:W>X From the tree representations o f X and W, i t is evident that X and W do not have uni form distribution over their values. This means to compute the terms such as Y^x,w-.w<xP(x) x • p ( ^ ) > t n e values of X and W need to be further parti t ioned. Thus, / ( F ) = P(Y\WAX)W<X J2 P(X)xP(W) , X,W:X<300,W<50Q,W<X + P(Y\WAX)W>X Y P(X)xP{W) X,W:A'<300,W<500,H'>A' + P(Y\WAX)W<X J2 P(X)xP(W) X,W:X<300,W>50O,W<X + P(Y\WAX)W>X Y P(X)xP(W) X,W:X<300;W>5O0,W<X + P(Y\WAX)W<X Y P(X)xP(W) A',W:X>300,U'<500,H/<A' + P{Y\WAX)W>X Y • P(X)xP(W) I X,W:X>3Q0,W<W0,W>X 81 + P{Y\W/\X)W<X P{X)xP(W) A-,W:X>300,W>500,W<X + P(Y\WAX)W>X £ P(X)xP(W) (3.6) X,W:X>300,W>500,W>X To compute factor f(Y), we need the probability mass of X and W under various conditions as shown in equation 3.6. For example, consider the computation of probability mass of X and W under condition X < 300, W < 500, W < X, which can be calculated as follows: E P(X)xP(W) X,W:X<300,W<500,W<X = P{W < X\X < 300 AW< 500) x P(X) x P ( W ) A',U':X<300,W<500 = P{W < X\X- < 300 /\W< 500) x ^ P ( X ) x ^ P ( W ) ' • X:X<300 W:W<500 From the tree representation of P(X) and P(W) shown in Figure 3.15 (b), we have the following: J2x:x<3oop(x) = ° - 8 ' a n d J2w-.w<50op(w) = ° - 7 - N o w > knowing the conditional probability P(W < X\X < 300 A W < 500), we can compute 2~2x,W:X<300,W<500,W<XP(X) X P{W). Figure 3.16 shows the partitions of the values of X and W. The area of the triangle ABC excluding the values that are on line AC (W — X) represents the number of values of W and X that satisfy the conditions W < X, X < 300, and W < 500. As discussed before, X and W both have uniform distribution in the region X < 300 and W < 500. Thus, P(W < X\X < 300 A W < 500) = (0.5 x 300 x 300 - 300)/(300 x 500)) As discussed in Section 3.3.1, we assume that there exists procedures that provide us the probability such as P(W < X\X < 300 A W < 500) for computing : 82 I-1000 X 300 B 1 500 1000 i w Figure 3.16: The partitions of the values of W and X. the new factor/(7). By summing out X and W together we can avoid creating a big factor. To sum out the variables efficiently (without enumerating the values), we sum out together all those variables whose partitions depend on each other. Algorithm Let Y be the set of variables to be summed out. Y is a minimal non-empty set of variables that do not share a predicate with a variable outside of Y . Suppose Y i , . . . , Yk are the variables in Y . Since the parents of Y have not been summed out yet, there will be k factors fYl,... ,fYk, where fYj corresponds to P(Yi\Pa(Y^)), and some other factors, s ay / i , . . . , /„ that involve Y € Y . Thus, there are total n + k factors that involve variables Y . Suppose TYl,..., TYk denote the tree representation o f . . .fYk and Ti,..., Tn denote the tree representations o f / i , . . . ,/„. In Large Domain V E to multiply the factors we can use the "Merging Trees" operation to 83 multiply the corresponding trees. Suppose T represents the product of TYl,..., TYk, 7 i , . . . , Tn. A path in tree T corresponds to the blocks (partitions) of the summing variables and their parents. Suppose CondP'denotes all the conditions of path P in T. The conditions in CondP can be divided into two sets: CondPp: conditions that also appear in trees TYl,..., TYk Condf"w": conditions that also appear in trees Ti,...,T„ Note that he conditions in Condf'w" represent the partitions of the domains of only summing variables, while Cond"Pp represents the partitions of the domains of both summing variables and their parents. Suppose (fj)c denotes factor fj under condition C. Then, the leaf value lP of path P is a number which represents: n Y\ifj)condft»" X fYl X ' ' ' X ^ j=l Y-.Cond';1,' The values of Y represented by CondP are the subset of the values repre-sented by ConddPow". Thus, (Jj)CondP = ifj)c<mdf«"-To sum out Y from T, for each path P we need to compute the term Uj=i(fi)c<mdr YtY-.condrfri x • • • x A* t hat we call the evidenced leaf value of path P and can be computed as follows: u Ylifj)condr>< 53 X • • • X / r t y'=l Y-.Condp n = n(fj)condp«» * fYiX...XfYk J=l Y-.Concf^ACondp"" n = P(Condfwn\ConduPp) x IJOSOc^ x ^ fYl x . . . x fYk ] V'=l Y-.Condf 8 4 = P(ConddPown\Cond"Pp) x lP where lP is the leaf value of path P. Thus, we can compute the term lY}=1(fj)cond,. Y^Y-.condJn x • • • x /r* b y m u l " tiplying the leaf value lp of path P by the probability P(ConddPown\CondPp). We term the probability P^ondf^CondJ) evidence factor of path P. Suppose ConddPown has n (condition, value) pairs, b\, .. • ,bn, in some order. Then, P(ConddPown\CondPp) = P(b\ A . . . A bn\ConduP) n = JJ P(bi\bi-i A . . . A 61 A Conr fp P ) 1=1 We assume that when bt is of the form (predi, true), there exists procedures that compute the probabilities A . . . A bx A Cond"P) (this is the input re-quirement as discussed in Section 3.3.1). If bi is of the form (predi, false), suppose b'i denotes (predi, true). Then, P(bi\bi-i A . . . A bi A Condf) = 1 - P ( # | & / _ i A . . . A by A C o « 4 P ) To sum out Y from tree T, at each leaf we need to compute the evidenced leaf value. Once we have the evidenced leaf value at each leaf, we need to sum the subtrees that correspond to different blocks for a partition of some K, 6 Y . We need to do this for every context (i.e., for every assignment of ancestors). Thus, summing out Y from T involves two steps: 1. computing the evidenced leaf value; and 2. summing the subtrees that corresponds to different partition of T, e Y . The algorithm Sum(T, Y , Contextup, Contextdown) that combines both com-puting the evidenced leaf value and summing the subtrees steps is shown in Figure 3.17. Y is the set of variables that needs to be summed out from T, Contextup and 85 / Func t i on Sum(7, Y , Context up, Contextdown) re turns a tree I n p u t : 7, a node of the decision tree, Y , the summing variables Contextup,Context(irm,„ : set of (condition, value) pairs. 1: i f 7 is of the form: ( i f C then 71 else T2) then 2: i f C contains some Y e Y then 3: i f C is in P(Y\Pa(Y)), F G Y then 4: TV <— iSwm.(Tl, Y , Contextup + { ( C , true)} , Contextdown) 5: 72' <— Sum(T2, Y , Contextup + {(Cj/a/^e)} , Contextdown) 6: else 7: 71' <— Swm (71, Y , Contextup, Contextdown + { ( C , frwe)}) 8: 72' <— Sum(T2, Y, Contextup, Contextdown + {(C,false)}) 9: e n d i f 10: r e t u r n merge ({TV, 72'} ,+) 11: else 12: 71' Sum (71, Y , ContextUp, Contextdown) 13: 72' <— Sum (72, Y , Contextup, Contextdown) 14: r e t u r n ( i f C then TV else 72' ) 15: end i f 16: else % 7 is a leaf node 17: Let LT be the value at leaf node 7 18: i f LT == 0 then 19: r e t u r n LT 20: else 21: evfactor <— Compute-EviJFactor(Contextup, Contextdown) 22: r e t u r n LT x evfactor 23: end i f 24: end i f Figure 3.17: Algorithm for summing out variables Y from tree 7. 8 6 I Function Compute_Evi_Factor(Contort^, Contextd(mn) returns a number • Input: Contextup, Contextdown- set of (condition, value) pairs. Preconditions: There exists procedures for computing the probabilities P(bi\bi-\ A ... b\ A ContextUp). 1: Let by... b'n be the (condition, value) in Contextdown in some order 2: evfactor <—] 1 3: for all 6,- do % we need to compute A . . . A b\ A Contextup) 4: if bi is of the form (cond, false) then 5: <— (cond, true) 6: p,„ < — Call procedure to compute P(£J|Z?,_i A . . . A &i A Contextup) 1: evfactor <— evfactor x (1 — p m ) 8: else 9: Pm Call procedure to compute A . . . b\ A Contextup) 10: evfactor <— evfactor x p,„ 11: end if 12: end for 13: return evfactor Figure 3.18: Function Compute_Evi-Factor called by algorithm Sum. Contextdown are set of (condition, value) pairs that are above T in the tree. Contextup records those conditions that appear in P(Yj\Pa(Yj)) Yi € Y , while Contextdown records those conditions that do not appear in P(T,|Pa(F,)) T, € Y . In particular, ContextUp and Contextdown track the predicates (or conditions) and their values for computing the evidence factors of paths in T. We start with Sum(T, Y , { } , { }). We traverse T in a top-down manner. At each node, we determine the type of T. If T is of the form: ( i f C then TI else T2 ), there are two cases: • If condition C contains some Y e Y , we sum out Y from TI and T2. Let TI ' and T2' be the trees after summing out Y from TI and T2 respectively. There are two cases: 87 - If the condition C also appears in P(Yj\Pa(Yj)), F, £ Y . Then, T l ' = sum(Tl, Y, Contextlip + {(C, true)} , Contextdown) T2' — sum(T2, Y, Contextup + {(C,false)} , Contextdown) - Otherwise, T l ' = sum(Tl,Y, ContextUp, Contextdown + {(C, true)}) T2' = sum(T2, Y, Contextup, Contextdown + {(C,false)}) We merge TV and T2' using plus operator, and the merged tree is the resulting tree after eliminating Y . • Otherwise, we return ( i f C then Tx else T2 ) where, Ti = Sum(T\, Y, Contextltp, Contextdown) T2 = Suni(T2, Y, Contextup, Contextdown) Otherwise, T is a leaf node. Let LT be the leaf value at T. If LT is zero, we return LT. Otherwise, we compute the evidence factor at T. We call function Compute-Evi-Factor(Contextup, Context0,iier) for computing the evidence factor at T. Let evfactor be the evidence factor return by the function Compute-Evi factor. Then, the evidenced leaf value of T is LT x evfactor. 3.4.5 Computing the Posterior Distribution The procedure for computing the posterior distribution in Large Domain V E is the same as in V E . To compute the posterior distribution, we first condition on the 88 observed variables and then sum out all non-query variables one by one. We can compute the posterior distribution by multiplying the remaining factors and nor-malizing the remaining factor. The Large Domain V E algorithm is summarized in Figure 3.19. 3.4.6 Complexity of Large Domain VE Let B be a Bayesian network with large-domain variables made up of N discrete random variables. Let d be the maximum domain size of a variable in B. Large Domain V E partitions the domains of large-domain variables dynamically during the inference and sum out all of the values that are in one block in a single step. Definition 3.4.6 Let X be a random variable in B. The induced domain size of X is the number of blocks (disjoint subsets) in the partition of the values of X, created by Large Domain V E during the inference. If X is a small-domain variable, the induced domain size of X is same as its domain size. Definition 3.4.7 The induced treewidth of a graph for a specified elimination or-dering is the maximum number of variables in an intermediate factor (minus 1). Definition 3.4.8 The treewidth of a graph is the minimum induced treewidth over "all elimination ordering. Definition 3.4.9 The constrained treewidth of a graph is the minimum induced treewidth over all elimination ordering that satisfy the constraint. As discussed in Chapter 2, the complexity of V E on B is 0(Ndw), where W is the treewidth of the graph. Let b be the maximum induced domain size of 89 Function LargeDomainVE(H, Q, E, e) Input: B: Bayesian network, Q: query variables, E : observed variables, e: values of E , e G Val(E) Output: A factor over variables Q 1: Let F be the factors corresponding to original CPDs 2: Let Ts be the trees corresponding to F 3: Replace each T e Ts that involves some E ; e E with r E i = e i . 4: Let Z be the variables of B 5: Z ' <- Z - Q 6: while Z' is not empty do 7: select Y G Z ' such that Y is a minimal non-empty set of variables that do not share a predicate with a variable outside of Y 8: Ts <— Eliminate(Y, Ts) 9: Z ' < - Z ' - Y 10: end while 11: return Normalize(Ts, Q) Function Eliminate(Y, 7$) 1: partition 7$ into: 2: { 7 i , r „ , } that do not involve Y 3: {T,„+i... Tk} are those that do involve Y 4: TM *- merge({Tm+1, ...,Tk},x) 5: Ts^Sum(TM,Y, {},{}) 6: return { r 1 ; . . . , 7 m , Ts} Function Normalize ({Ti,..., Tr} , X) l : T < - m e r g e d , . . . , 7 V } , x ) 2: ^ ^ S u m ( r , X , {} ,{}) 3: rt-ReplaceLeaves(T',JV,-r) 4: return 7 Figure 3.19: The Large Domain V E Algorithm. 90 a variable in B. Suppose Wc is the constrained treewidth of the graph. Then, the complexity of Large Domain V E on B is 0(NbWc). Because of the constrained elimination ordering, the constrained graph treewidth Wc could be larger than graph treewidth W. However, the induced domain size of a large-domain variable can be significantly smaller than its domain size. 3.5 Application of Large Domain V E to Person Iden-tifi cation In this section we discuss how to apply Large Domain V E to the Person Identifica-tion problem. Let X and Y be two records, which refer to the people to be compared. As discussed in Section 3.2.2, to make the decision we need to condition on the ob-servations and marginalize over the unobserved variables in the Bayesian networks shown in Figures 3.1 and 3.2. Consider the elimination of variable Afname from Bayesian network as shown in Figure 3.2. To eliminate variable Afname we need to multiply all the factors that contain variable Afname. Factors that contain variable Afname are the following: • fl (FnameJC, EFJC, Sex, Afname) • f2 (FnameJY, EF JY, Sex, Afname) and • / 3 (Afname, Sex) As shown in Figure 3.20, T l , T2 and T3 are the decision tree representa-tions of factors/1, /2, and/3 respectively. Suppose for records X and Y we have 91 observed FnameJC = "david" and Fname JS = "davig". After conditioning on the observations 71 ^ « '»^="dav id" a n d r 2 fn«m e _ r = " d a v i g " ^ s n o w n m F i g u r e 3.21. After multiplying T \ F n a m e - x = W l d " , T2Fmme-Y="i*sw'l%", and T3 we get a new fac-tor/ (EFJC, EFJS, Sex, Afname). A part of the tree representation T off is shown in Figure 3.21. After multiplying the factors we want to sum out the variable Afname from factor/. That is, we need to sum out Afname from tree T. After we sum out the variable Afname f rom/ we get a new factor/' (EFJC, EFJS, Sex). A part of the tree representation V of new factor/' is shown in Figure 3.22. In the next section, we discuss how the leaf p' of V as shown in Figure 3.22 is computed by Large Domain V E . 3.5.1 Evaluation of p' The leaf value p' of V as shown in Figure 3.22 is the sum of the the evidenced leaf values of p i and p2 of T as shown in Figure 3.22. Let pi' and p2' be the evidenced leaf values corresponding to leaf valuespl andp2 respectively. Then,/?' = pl'+p2'. Let us first consider the computation of the evidenced leaf value pi'. The evidenced leaf value pi' is computed by multiplying the leaf value pi with the evidence factor evfactorpi corresponding to pi. As shown in Figure 3.22, there are six conditions in the path from root to leaf pi: CI : (Sex = male), C2 : (intable(Afname, male) = true), C3 : (EFJC = sle), C4 : (singlet(Afname, "david") = true), C5 : (EFJS = sle), and C6 : (sing let (Afname, "davig") = true) 92 EF X singlet(Afname,Fname_X) s * x equal(Afname,Fname_X) [prsing(Fname_X), I ] yes i ntable(Fname_X .female) no [lookup(Fname_X,male),0.9] [fnew(Fname_X),0.1 ] T l [fnew(Fname_X),0.1] [lookup(Fname_X,female),0.9] EF Y singlet(Afname,Fname_Y) g* x [prsing(Fname_Y), 1 ] , female equa!(Afname,Fname_Y) yes intable(Fname_Y,female) yes [lookup(Fname_Y,male),0.9] [fnew(Fname_Y),0.1 ] T 2 [fnew(Fname_Y),0.1] [lookup(Fname_Y,female),0.9] male female [lookup(Fname_Y,male),0.9] intable(Afname,male) inlable(A'fname,female) yes / \ n ° yes\ [fnew(Fname_Y),0.1] [lookup(Fname_Y,male),0.9] T3 [fnew(Fname_Y),0.1] Figure 3.20: The decision tree representations, 71, T2, and 73, of f ac to r s / l , / 2 , and/3 respectively. 93 I s l e I EF_X Sex .else ( m a l e ) / \ { female) singlet(Afname,"david") equal(Afname,"david") intable(Afname,male) jntable(Afname,female) ^ \ n o 1/125 Fnamex = david T l Sex e q u a l (Af i ) a m e , " d a v i g " ) 1 1 0 I m a l e } / " ^ \ { f e m a l e ) y e s / singlet(Afname,"davig") yes / \ / o P l ^ e w Pnew 1/125 Fnamey = davig T2 T3 merge({Tl,T2,T3),*) intable(Afname,male) no EF_X , i s l e ) / ^ singlet(Afname,"david") \ ' singlet(Afname,"david") ' . Isle)/ \ singlet(Afname,"davig") \ l c e | 0 . 0 0 0 0 5 7 6 \ Q '. singlet(Afname,"davig") y^/A no 0 . 0 0 0 0 0 6 4 \ Figure 3.21: A decision tree representation, T, of new factor / after multiplying trees T l , T2, and 7*3. 94 S e x pi =0.0000576 p2 = 0.0000064 Figure 3.22: A decision tree representation, V, of new factor/' after summing out Afname from tree T. Thus , evfactorp\ = P(singlet(Afname, "david") = true\intable(Afname,male) = true) xP(singlet(Afname, "davig") = true\C4 A C2) We call the procedures that compute the conditional probabilities required in the computation of evfactorpX. These procedures are provided by the user as a part of the input. As discussed in Section 3.3.1: P(singlet(Afname, "david") = true\intable(Afiiame, male) = true) — 0.0001/0.9 and P(singlet(Afname, "davig") = true\singlet(Afname, "david") = true A C2 = 1 Thus, pi' =pl x 0.0001/0.9 95 Let us now consider the computation of evidenced leaf value, p2'. As shown in Figure 3 .22 , there are six conditions in the path from root to leaf p2: CI' : (Sex = male), C2' : (intable(Afname, male) = false), C 3 ' : (EFJC = sle), CA' : (sing let (Afname, "david") = true), Ch' : (EF.Y = sle), and C& : (singlet(Afname, "davig") = true). Thus, evfactorp2 — P (sing let (Afname, "david") = true\intable(Afname, male) = false) xP (sing let (Afname, "davig") = true\C4' A C2') The probability P(smg/e/(A/hame, "david") = true\intable(Afname,male) = false) can be computed as follows: The predicate singlet(Afriame, "david") is true for 125 values of Afname. As discussed before only one value "davis" out of these values exist in male name file. That is, there are 124 values that are single letter apart from david and are not in male name file. As discussed before, the probability mass of all the names that are not in male name file is 0 . 1 . Suppose Pnew is the probability (a very small number) of a new name that is not in male name file and single letter apart from "david". Then, 124 x P P(singlet(Afname, "david") = true\intable(Afname, male) = false) = n e w The probability P(singlet(Afriame, "davig") = true\C4' A C 2 ' ) can be com-puted as follows: To count efficiently the number of values of Afriame that are single letter apart from both david and davig, we can first generate the patterns of names 96 that are a single letter apart from david. For example, lavid, where ? is any letter except d. After generating these patterns we test which of these patterns makes the predicate singlet(Afname, davig) = yes. Here, the pattern davil satisfies both pred-icates singlet (Afname, david) and singlet(Afname, davig) i f (? 7^ dAl 7^ g). Thus, the possible number of values for Afname is 24 that are a single letter apart from both david and davig9. Out of these 24 values of Afname we have already found that one value exist in male name file (during the computation of pi'). Thus, there are only 23 values of Afname that are single letter apart from david and davig and are not in male name file. Thus, P(singlet(Afiiame, "davig") = true\singlet(C4' A C2') = 2 3 / 1 2 4 and p2' = P2 x ( 2 3 / 1 2 4 ) x (124 x Pnew/0.l) 3.5.2 Experiment In this section we describe an experiment which we designed to test that by mod-elling attribute dependence gives better results over the traditional approach which consider the attributes independence. To test the proposed approach, as real databases are confidential, we model a reasonably realistic distribution of attribute values by modelling the people in a set of households and model, for example, how twins are born. We model a small town of 1500 households. We generate the population of the town. The generated population was intended to be a good model of real world population. Persons living in the same household have the same address and phone 9As there are 26 letters. 97 number. The probability that a single person lives in a house is 0.4. The probability that a person is living with a partner is 0.6. For a single person there is a 30% chance of having one child. The chances for the subsequent child is 10%. For each birth there is a 3% chance that twins will be born. The probability that partners have the same last name is 0.5. For partners there is a 70% chance of having one child. The chances for a subsequent child is 30%. When both partners have different last names then the probability that the child will have mother's or father's last name is the same. For each birth there is a 3% chance that twins wil l be born. Each record of the population contain seven fields: social security number, first name, last name, gender, date of birth, phone , number, and postal code. Personal first names and last names are chosen according to the distribution from U.S. census file10. After creating the true population, we made two datasets, DA and DB. To create DA we randomly took 600 records from the true population. We corrupt the records using the database generator of Hernandez and Stolfo [Hernandez and Stolfo, 1995], using typographical errors and movement into the true record. The ty-pographical errors introduced by the generator occur with relative frequency known from previous research on spelling correction algorithms [Peterson, 1986; Pollock and Zamora, 1987]. We place these corrupted records in dataset DA. Similarly, we made the database DB but we took 1500 records from the true population. We compared each record of dataset DA with each record of dataset DB. There were 900, 000 comparisons in total. In these comparisons there were only 227 duplicate cases. We compute the likelihood ratio considering both attribute dependence and lohttp://www.census.gov/genealogy/names/ 98 independence. 0.95 0.75 0.6 -#- with dependence with independence Recall Figure 3.23: Recall versus precision for both attribute dependence and attribute independence. After computing the likelihood ratio between all pairs of records, we set the deciding threshold equal to the maximum of maximum likelihood ratio from both cases. The pair of records with likelihood ratio greater than the deciding threshold were taken as duplicates. We compute the precision and recall. j^of Correctly Identified Duplicate Pairs Precision = Recall — j^of Identified Duplicate pairs #of Correctly Identified Duplicate Pairs # of True Duplicate pairs We reduce the deciding threshold with a step of 1 until the deciding threshold is equal to the minimum of minimum likelihood ratio from both cases. For each value of threshold we compute the precision and recall for both cases. 99 As more pairs with lower similarity are labeled as duplicates, recall in-creases, while precision begins to decrease. Figure 3.23 shows the precision versus recall for both cases. The resulting recall/precision curve shows that with attribute dependence the precision of the prediction is 95% with 100% recall, while with attribute independence precision is 70% for 100% recall. Also, with attribute de-pendence the 100% accuracy is achieved with more coverage than attribute inde-pendence. Results of the experiment show that the performance is improved when attribute dependence is considered over attribute independence for person identifi-cation problem. These experiments are not possible without the inference algorithm in this chapter. 3.6 Conclusion In this chapter we have argued for making inference in Bayesian networks that con-tain random variables with large or even unbounded domains. To make the infer-ence efficient we exploit the structure in the problem. The main idea is to partition the domain of variables in equivalence classes that have the same conditional prob-ability for particular values (or particular subset) of other random variables. These equivalence classes can be described extensionally and intensionally. To represent the conditional probability distributions in a compact manner we propose a CPD language that allows us to represent the CPDs using intensional and extensional definitions. We present an inference algorithm, Large Domain VE, which is based on the variable elimination algorithm, that exploit the structure for efficient infer-100 ence. Large Domain V E partitions the domains of the variables dynamically during the inference based on what is observed and what is queried. 101 Chapter 4 Inference with Hierarchically Structured Variables 4.1 Introduction In the previous chapter, we provide a representation and an inference algorithm for Bayesian networks that have discrete random variables with large domains. To handle the large number of values, we consider that the conditional probabilities can be represented compactly using both intensional (in terms of procedures) and extensional (by listing the values) definitions. The domains of the variables are par-titioned at run time depending on what is observed and what is queried. In many problem domains there is a priori structure on the values of a random variable. In particular, the large number of values of a variable is represented a priori as an abstraction hierarchy (or an is-a hierarchy or tree hierarchy) [Pearl, 1986; Mack-worth, Mulder and Havens, 1985]. An abstraction hierarchy is a tree where the 102 nodes represent concepts and the nodes at the top are less specific than the nodes at the bottom. In abstraction hierarchies, information from high-level concepts is inherited by more specific concepts. We hope that the large number of values of a variable can be managed efficiently by considering a hierarchical structure on its values. The large number of values of many discrete random variables can be repre-sented naturally in a tree hierarchy form. For example, consider a random variable living Jhings that describes the types of species on the earth. The values of the living Jhings can be represented into a tree hierarchy according to Linnaean taxon-omy 1 . Living things are divided into kingdoms (e.g., plantae, animalia), classes (e.g., mammals, birds, fish), all the way down to species. We call such discrete ran-dom variables hierarchically structured variables (or hierarchical variables). We call discrete random variables that are not hierarchical simple variables. The reason we consider hierarchically structured variables is they typically have a good deal of structure; there are some features that are common to higher-level concepts in the hierarchy. For example, birds have feathers and lay eggs. The feature information needs to be stored only at the highest possible level of abstraction, thus achieving i economy of storage. There is very limited work on combining abstraction hierarchies and prob-abilistic reasoning. Pearl [1986] considered the problem of evidential reasoning in a taxonomic hierarchy of hypotheses using probability theory. Pearl presents an approach for updating probabilities in a taxonomic hierarchy given the observa-tion. However, he considered a very restricted, naive Bayes classifier network form, xhttp : I/www.wikipedia.org/wiki/Linnaeanlaxonomy 103 where there is only one hierarchical variable representing the hypotheses. Pearl's approach cannot be used for making inference in a general Bayesian network set-ting. We can use existing inference algorithm for making inference in Bayesian networks with hierarchical variables. However, the basic difficulty is that existing algorithms [Lauritzen and Spiegelhalter, 1988; Zhang and Poole, 1994] flatten the domain of such variables, removing the hierarchical structure and treating them as variables that have as domains all the leaf values. This flattening has several important adverse implications. From a representation point of view, i f the domain of the variable is very large, the tabular representation of the conditional probability distribution is very big. For example, consider a hierarchical variable that represents the living things on the earth. After flattening, this variable has millions of species as its values. As an another example, consider a hierarchical variable location that represents the location of an object on the earth. After flattening, the domain of location is unmanageable because it can, for example, take each square centimeter of the earth as a value. When there is a lot of structure because of the hierarchies (e.g., the fea-tures depend on the classes, not on leaves) flattening introduces duplicate entries in the tabular representation and we lose the good properties that we were hoping to achieve from the hierarchically structured values. In addition to the representational inadequacies of Bayesian networks for such variables, there is also the problem of inference. As discussed in Chapter 2, the cost of exact inference in Bayesian net-works is exponential in tree width, where the domain size is the base of the expo-104 nent. The hierarchical structure between the values of a variable can be exploited to make the representation compact and reasoning efficient. In this chapter, we propose an approach for making efficient inference in Bayesian networks that have both simple and hierarchical variables. We begin in Section 4.3 by defining hierarchically structured variables. To represent the hi-erarchical variables we try to exploit any structure they may have. The structure can be represented in tree hierarchies by considering that there are some features that are common to higher-level concepts. In Section 4.5, we discuss our approach for representing the CPDs of a Bayesian network that has hierarchical variables in a compact manner. We represent the distribution of the hierarchical variables by specifying, for each class, the probability distribution over its immediate subclasses. To represent the conditional probability of a variable conditioned on a hierarchical variable, we use inheritance. That is, we associate the probabilities with the classes (group of values) rather than with each value of the variable. In Section 4.8, we present an approach for reasoning in Bayesian networks with hierarchically structured variables that dynamically constructs an abstract Bayes network, given some evidence and a query, by collapsing the hierarchies to include only those values necessary to answer the query. This can be done with a single pass over the network. We can answer the query from the abstract Bayesian network us-ing any standard probabilistic inference algorithm, such as variable elimination or stochastic simulation. The domain size of the variables in the abstract Bayesian net-work is independent of the size of the hierarchies; rather it depends on how many of the classes in the hierarchies are supported directly by the evidence or relevant 105 to the query. The proposed approach is therefore applicable even when the hierar-chies are conceptually infinite. In Section 4.10, we present some empirical results that show that hierarchically structured values of a variable can make the reasoning more efficient than reasoning over the set of all values. 4.2 Related Work Abstraction hierarchies and semantic nets have been used in artificial intelligence since at least 1968 [Quillian, 1968] for representing concepts and categorical knowl-edge. The early work that combined tree hierarchies and uncertainty was motivated by the need for diagnosing medical domains that are hierarchical in nature. In a hi-erarchical hypothesis space, the evidence can appear at any level of abstraction. The earlier rule-based expert systems (e.g., M Y C I N ) were not capable of combining the evidence that are at different levels of abstraction in a consistent way [Buchanan and Shortliffe, 1984]. Motivated by the inadequacy of expert systems for handling evidence that is at different levels of abstraction, Gordon and Shortliffe [1985] first studied the prob-lem of combining evidence in a hierarchical hypothesis space using the Dempster-Shafer theory. They developed an approximate algorithm for computing the D-S belief functions on tree hierarchies. Shenoy and Shafer [1986] improved the algo-rithm proposed by Gordon and Shortliffe and showed that belief functions can be propagated using local computations. • Pearl [1986] proposed a Bayesian approach for combining evidence in a hi-erarchical hypothesis space. Let H be a finite set of mutually exclusive and exhaus-106 tive hypotheses, H = {hi,h2,..., hn}. The subsets of the values of H are organized in a tree hierarchy. That is, each node in the tree represents a set of values of H. H is the root of the hierarchy and /z,'s are the leaves in the tree hierarchy. Each leaf hi is assigned with a measure of belief BEL(hi), denoting the prior probability of hi. The belief of the intermediate nodes in the hierarchy can be computed from the sum of the belief of its children in a recursive manner. Suppose we have a new evidence, e, that affects one of the node /V in the hierarchy but does not say anything about the descendants of N in the hierarchy. Pearl's algorithm [Pearl, 1986] for computing the updated belief of every hypothesis in the tree hierarchy consists of three main steps: • Estimation: Let S be the hypothesis set, represented by N, that is affected by evidence e. We compute the likelihood ratio Xs that determines the degree to which the evidence confirms or disconfirms S. • Weight distribution: Each hypothesis h: EH is assigned a weight Wi, as follows: • Belief updating : In this step, the updated belief of each hypothesis h-, G H, denoted by BEL'(hi), is computed as follows: A . , = P{e\S) P(e\^S) BEL'(hi) = P(hi\e ) - asWi BEL(hj) where as is a normalizing factor: as = BEL(ln) 107 The updated belief of the intermediate nodes in the tree can be computed from the sum of the belief of its children in a recursive way. Pearl considered a naive Bayes classifier network form where there is only one hierarchical variable. We consider a general Bayesian network setting that can have both hierarchical and simple variables. We consider both parents and children of hierarchical variables, which can themselves be simple or hierarchical variables. In more recent work, Koller and Pfeffer [1998] described a representation language for combining frame-representation systems and Bayesian networks to represent complex structured domains. Their framework uses is-a and part-of hier-archies in an object-oriented way. The work by Koller and Pfeffer is different than ours in that they look at different aspects of the problem, concentrate on multiple objects, and combine first order logic representation with probabilities. In contrast, we are working with propositional logic. We consider the case where the domain of a random variable is hierarchically structured. Our work wil l allow inheritance in a single node of a Bayesian network. We will exploit inheritance to represent the conditional probabilities in a compact manner and in the probabilistic inference algorithm. Future work can combine these two ideas. 4.3 Hierarchical Variables In this section, we introduce the notion of hierarchical variables. The hierarchi-cally structured values for a variable define an abstraction hierarchy for mutually exclusive subsets of the values of the variable [Pearl, 1986]. The mutual exclusivity implies that the abstraction hierarchy forms a tree - we do not allow the situation 108 where a node is a direct descendant of more than one node. Note that there is also no need for the leaves - the tree could be infinitely big (i.e., no need to a priori discretize an hypothesis space). Thus, it is possible to model hierarchies that are infinitely detailed (e.g., spatial hierarchies). Definition 4.3.1 In a hierarchical variable, the subsets of the values of the vari-able are represented hierarchically in a tree. The nodes in the tree represent subsets of the values of the variable, and satisfy the following conditions: 1. The root of the tree represents the set of all the values of the variable. 2. The children of a node correspond to a partitioning of the set of values of the node. That is, the subsets represented by the children of a node in the tree are mutually disjoint and any node in the tree represents the union of its children in the tree. In the field of Semantic web, the ontologies representations (eg., OWL, RDFs) are used to represent hierarchical knowledge. Classes in ontologies are hierarchically organised through is-a (part-of or subclass-oj) relationship, with a partial-order relation. Using the same terminology as in OWL, we refer to the tree hierarchy of a hierarchical variable as a class hierarchy and the nodes in the tree as a class. A class represents the set of values of the variable. If a class Cj in the class hierarchy is a child of another class C, in the hier-archy, written as child(Cj, C,), we say that Cj is an immediate subclass of C,. The subclass relation, written as <, is the transitive closure of the immediate subclass relation; if C, < C,, we say that C, is a subclass of C,. The inverse of subclass is superclass. 109 l i v ing t l i i ng platypus reptile cat b ; ' ' A / \ * penguin sparrow K A Figure 4.1: Part of a class hierarchy of living things (L). The conjunction of two classes C, and Q , written as Cj A Ck, denotes the set intersection of sets represented by C, and Q . This is empty unless one is a superclass of the other. If Cj A C* is empty, we say that C, and Ck are disjoint. The set difference of two classes Q and Ck, denoted by Q — Ck, represents the set of all the values that are in the set represented by Q but not in the set represented by Ck. For example, suppose L is a hierarchical variable that represents the living things on the earth. In the class hierarchy of L we may have the root class livingthing that represents all the values of L. We can have plant, and animal as the immediate subclasses of livingthing. The class animal in turn has the immediate subclasses mammal, bird, fish, etc. A part of the class hierarchy of L is shown in Figure 4.1. 4.3.1 Hierarchical Variable and Distribution Intuitively, a hierarchical variable can be considered as a simple variable that has a large number of values. It may seem natural to define the probability distribution for a hierarchical variable as it is defined for a simple variable. Unfortunately, if there 110 are infinitely many values, the probability of any value is typically zero. Instead, we need to define a probability measure over the set of values of a hierarchical variable. A probability measure assigns a probability to a set of values rather than to individual values. Let X be a random variable and S be the set of all possible values that X can take. Let A be a a algebra over the subset of 5. A cr algebra over a set S, denoted by A, is a non-empty collection of subsets of S that is closed under countable unions and formation of complements. Each set E G A is a measurable event. Definition 4.3.2 A probability measure P on (S, A) is a function that satifies the following axioms: • P(A) > 0 for all A G A . P(S) = 1 • P(A U B) = P(A) + P(B) for all A, B G A and A n B = 0 Let H be a hierarchical variable and S be the set of values of H. If, for example, S is finite the power set of S, V(5), is a a algebra over 5: Then, from definition 4.3.2, the probability of each element E G V(S) is between [0,1], 0 < P(E) < 1. The sets denoted by the classes in the hierarchy, closed under countable unions and formation of complements, can form a sigma algebra over which we can have a probability distribution. I l l 4.4 Bayesian Networks with Hierarchical Variables In a Bayesian network that has hierarchical variables, we consider that a hierarchical variable can have both hierarchical and simple variables as its parents and children. We assume that the class hierarchies of the hierarchical variables are fixed, i.e., do not change with the context. Figure 4.2 shows a simple example of a Bayesian network that has both sim-ple and hierarchical variables. The variables Flying and Lay-Eggs are binary vari-ables with values true and false. The simple variable Season represents the seasons on earth. We consider that Season has only two values: (northern hemisphere) sum-mer and winter. The variables L and R are hierarchical variables. The hierarchical variable L represents living things on the earth. The class hierarchy of L is shown in Figure 4.1. The hierarchical variable R represents regions on the earth. Figure 4.2 (b) shows part of the hierarchy of R. At the root, all possible values of R are grouped together into the class earth. Below this there are two classes - land and ocean. The class land is further subdivided into northern hemisphere, equatorial, and southern hemisphere. Further, each hemisphere can be divided into temperate and polar and so on. The distribution of living things on the earth changes with the season, for example in summer there are more insects than in winter. Similarly, the distribution of living things is different in different regions of the earth (e.g., in the ocean we mostly have fish), so the parents of L are Season and R. An important aspect of any probabilistic representation is the support for inference; having made some observations, how to condition on these observations and compute the posterior probability distributions. As discussed in Section 4.2, 112 (a) Part of tree hierarchy of the variable R (b) Figure 4.2: (a) A Bayesian network with simple and hierarchical variables (b) A part of the class hierarchy of R. we cannot use Pearl's algorithm [Pearl, 1986] for making inference in Bayesian networks with hierarchical variables. Existing inference algorithms are not capable of exploiting hierarchical structure; we need to "flatten" the hierarchical variables by removing the hierarchical structure, and treat them as simple variables that have as their domain all of the leaf values. That is, we lose all of the good properties that we were hoping to achieve from the hierarchically structured values. In this chapter, we look at two related problems with hierarchical variables in Bayesian networks. The first is the compact representation of conditional proba-bility distributions. The second is how to exploit that representation in probabilistic inference for computational gain. 113 4.5 Specifying the Conditional Probability Distribu-tion In this section, we discuss how to represent the conditional probability distribution in a compact manner that utilises the structure provided by the class hierarchy. There are two main issues to be considered to represent the CPDs of Bayesian networks with hierarchical variables: CI: specifying the conditional probability for hierarchical variables; C2: specifying how variables are conditioned on hierarchical parents. The simplest case of CI is how to specify the prior probability of a hierarchi-cal variable. The simplest case of C2 is how to specify the conditional probability of a simple variable conditioned on a hierarchical variable. The more complicated cases are built from these two cases. First, we discuss how we can represent prob-ability distribution for these two cases. Then, we present the general case using the representations developed for these two cases. 4.5.1 Specifying the Prior Probability Distribution for a Hierar-chical Variable The prior probability distribution for a hierarchical variable can be represented in a bottom-up manner, as discussed in the work of Pearl [Pearl, 1986]. In this approach the leaves in the class hierarchy are assigned with the probabilities. Knowing this, 114 we can compute the probability of any class ( Q ) in the class hierarchy by sum-ming the probabilities of the immediate subclasses of that class (C*) in a recursive manner. Thus, CJ: child(Cj,Ci,) This approach assumes that we can enumerate the probabilities for the leaves. However, if the hierarchies are very big (or even infinite) it may not be practically or theoretically possible to enumerate the leaves. For example, the class hierarchy of a hierarchical variable that describes the location of a person in the world could be infinitely detailed; we may not a priori choose a lowest level resolution. To al-low for the compact specification of any class that does not depend on the lowest level details, we consider a top-down approach for specifying the prior probability distribution of a hierarchical variable. In our approach, for each class in the class hierarchy of the variable that has subclasses, we specify a probability distribution over its immediate subclasses; this is the probability of a class given its immediate superclass in the class hierarchy. The top-down approach is a more natural and elegant way of defining the prior J probability distribution, and does not assume that we can enumerate the leaves. More formally, Definition 4.5.1 Let H be a hierarchical variable. The prior probability distribution for H is defined as follows: each link representing child(Cj, Q ) (which means Q is the child of Q ) in the class hierarchy of H is associated with a value that gives the conditional probability P ( Q | Q ) (i.e., the conditional probability that value Cj 115 holds, given that value Ck holds) such that: Cy. child(Cj,Ck) Example 4.5.2 Suppose we want to specify the prior probability distribution over the hierarchical variable L, the class hierarchy of L is shown in Figure 4.1. The root of the hierarchy is the class livingthing. Suppose the class livingthing has only two children. We specify the probability distribution of livingthing over its immediate subclasses: P(animal\livingthing) = 0.4 P(plant\livingthing) = 0.6 We can specify the probability distribution over the immediate subclasses of animal. The class animal has many other children that are not shown in Figure 4.1. Suppose we have as part of this distribution: P(mammal\animal) = 0.3 P(bird\animal) = 0.2, etc. We can specify the probability of an immediate subclass of mammal given mammal, with probabilities such as: P(bat\mammal) = 0.1 In this approach, the prior probability of any class can be computed in a recursive manner by multiplying the probability of class given its immediate superclass and the probability of its immediate superclass. The probability of the root class is 1, since it represents all the values. 116 Example 4.5.3 Given the probabilities as in Example 4.5.2, P(bat) can be com-puted as follows: P(bat) = P(bat\mammal) x P(mammal\animal) x P(animal\livingthing) x P(livingthing) = 0.1 x 0.3 x 0.4 x 1 = 0.012 Computing the probability of a class involves a series of multiplications. The num-ber of multiplications is proportional to the depth of the class, so the cost of comput-ing the probability of a class is proportional to the depth of the class. We therefore have the following: Proposition 4.5.4 The time to compute the probability of any class Ck in the class hierarchy is 0(d), where d is the depth of Q in the hierarchy. 4.5.2 Hierarchical Variable I s the Parent of a Simple Variable Suppose that H is a hierarchical variable, F is a simple variable, and H is the only parent of F. To exploit the hierarchical structure, we assume that the influence on F is local in the hierarchy of H, so that, generally, related values of H have the same distribution over F. We assume that the probability distribution over F given a class in the hierarchy of H is either specified or can be inherited from a superclass of that class in the hierarchy. Using this convention, we do not need to specify the probability distribution over F for each class in the hierarchy. 117 To specify the conditional probability distribution P(F\H) in a compact man-ner, we use the notion of a default probability distribution. The underlying idea is that the default probability distribution of F for class Cj, written as Pd(F\Cj),.\s assumed to be inherited by all subclasses of Cj, unless overridden by a more specific default probability distribution. More formally, Definition 4.5.5 The conditional probability distribution P(F\H) is specified by specifying the default distribution Pd(F\Cj) for some classes Cj in the class hier-archy of H such that there is at least one Cj along every path from the root. That is, every path down the tree must contain a default (Pd) distribution. Note that this condition holds trivially if the Pd value is defined for the root class. These classes Cj are called exceptional classes of H with respect to F, denoted by e". Note that different Fs may have different exceptional classes. The conditional probability of F for any class Ck of H, P(F\Ck), can be computed as follows: • if there is no C G €p such that C is a strict subclass of Ck, P(F\Ck) is inherited from the default probability distribution of F for the class Cj G e", such that Cj is the lowest exceptional superclass of Ck, P{F\Ck) = Pd(F\Cj) • otherwise, P(F\Ck) can be derived from Q's immediate subclasses: P(F\Ck) = ^ P(F\Q) x P(C-IQ) C,-: cliild(Ci,Ck) 118 Associating the default probability distributions with exceptional classes is a compact representation over defining the distribution of F for each value of H, when many of the subclasses share the same conditional probability distribution. Example 4.5.6 Consider representing the conditional probability distribution P(Flying in the Bayesian network as shown in Figure 4.2 (a). To represent P(Flying\L), we need to define the default probability distributions over Flying for the exceptional classes of L. For example, we can state that living things have a low probability of flying by default, but bird and insect are exceptional because they have a high probability of flying. From the children of class mammal, a bat is exceptional be-cause it has a high probability of flying. From the children of class bird, a penguin is exceptional because it has a very low probability of flying. Thus, to represent P(flying\L) we could have 2 : Pci(Flying = true\livingthing) = 0.00001 P<i(Flying = true\bird) = 0.5 P<i(Flying = true\bat) = 0.3 Pd(Flying = true\penguin) = 0.00001 Pci(Flying = true\insect) = 0.4 From this we can infer: • P[Flying — true\sparrow) = 0.5, as it inherits this value from bird • P(Flying — true\platypus) = 0.00001, as it inherits this value from living-thing 2 We use a very small probability rather than zero for specifying the default probability of Flying for living things that do not fty. This is because 0 means that ifying is impossible. Similarly, we do not have 1.0 for living things that fy. This is to indicate that individuals could be exceptional. 119 • P(Flying = true\penguin) = 0:00001, as it inherits this value from penguin We can compute P(Flying\bird) from the probabilities of its exceptional sub-classes as follows: P{Flying\bird) = £ P{Flying\C) x P(C\bird) Ce { -*penguinkbird,penguin) = P(Flying\-^penguink.bird) x P[^penguinhbird\bird) +P(Flying\penguin) x P(penguin\bird) = P(i(Flying\bird) x (1 — P (penguin\bird)) +Pd{Flying\penguin) x P{penguin\bird) = 0.5 x (1 - P(penguin\bird)) + 0.00001 x P{penguin\bird) = 0.5 x (1 - 0.01)+ 0.00001 x 0.01 = 0.495 Note that P(Flying = true\bird) ^ 0.5 as the class bird contains penguin, which have a much lower probability of flying. The use of "inheritance" with abstraction hierarchies is similar to the use of "context specific independence" (CSI) for the compact representation of the CPTs in Bayesian networks [Boutilier et al., 1996]. In CSI, the CPTs are partitioned into contexts in which the conditional probabilities are equal, and can be represented compactly using decision trees. With inheritance, the CPDs are represented by specifying the default probability distributions for some of the classes. The default probability distribution for a class is inherited by all its subclasses, unless overrid-den by a more specific default probability distribution. 120 4.5.3 The General Case The general case is when a random variable (hierarchical or simple) can have any number and any kind of variables as parents. To represent the conditional probabil-ity distribution of a hierarchical variable H conditioned on its parents, we assume that each class in the hierarchy of H has a probability distribution over its immedi-ate subclasses that is conditioned on (some of) the parents of H. We can treat each of these classes as a simple variable. Thus, the problem of representing a (condi-tional) probability distribution over a hierarchical variable reduces to the problem of representing a (conditional) probability distribution over simple variables. Let F be a simple variable that has both simple and hierarchical parents. Suppose Pa"(F) denotes its simple parents and Pah(F) denotes its hierarchical par-ents. To represent the conditional probability distribution of F given its parents, we introduce the notion of parent context. D e f i n i t i o n 4.5.7 Let Fbe a simple variable that has n hierarchical parents Hx,... ,Hn A p a r e n t c o n t e x t , denoted by X, is an assignment of a class to each hierarchical parent of F. Thus, X=(cl,...,4) where c'x is a class in the class hierarchy of //,. Note that if a hierarchical variable is not relevant for some context, this is equivalent to having the root of that hierarchical variable in the parent context. D e f i n i t i o n 4.5.8 A parent context X is a s u p e r p a r e n t c o n t e x t of parent context y, written as y < X, if Vi c'y < c'x. X is a p r o p e r s u p e r p a r e n t c o n t e x t of parent context y, written as y < X, iff y < X and y ^ X (i.e., 3i c\ < c'x). 121 The conditional probability of F given its parents, P (F\pas (F) Apa h (F)), is specified by specifying the default probability distribution over F, Pd(F\Pas(F) — x A X), for each assignment x of Pas(F), for some given parent contexts X. These parent contexts X are called default parent contexts of F. The parent contexts X are such that there is at least one default parent context along every path from the root in tree(Hx) x . . . x tree(Hn), where tree(Hi) represents the class hierarchy of Hi. Definition 4.5.9 A parent context hierarchy, denoted by Cp, is a partial ordered set of default parent contexts, Cp = ( Zp,< ), where Zp is the set of default parent contexts and < is a partial order over Zp. Example 4.5.10 Consider the Bayesian network as shown in Figure 4.2. The prob-ability distribution of class animal of L over its immediate subclasses can be con-ditioned on some of its parents. The distribution of class animal changes with the values of Season and R, because in ocean we mostly have fish; in the northern hemisphere we have more insects in summer than in winter; and, in the polar region birds migrate in winter to a warmer area. Suppose the distribution of class animal in the equatorial region is indepen-dent of season, then for parent context X = (equatorial) we could have a default distribution such as: Pd (reptile \ animal A equatorial) = 0.1 Pd(insect\animal A equatorial) = 0.4 Pd(mammal\animal A equatorial) = 0.2 122 Pd{fish\animal A equatorial) = 0.1 Pd(bird\animal A equatorial) = 0.2 In the polar regions, the distribution of class animal over its children de-pends on the season, so for parent context X = (N-polar) we could have a default distribution such as: Pd(reptile\animal A Season = summer A N-polar) = 0.01 Pd(mammal\animal A Season = summer A N-polar) = 0.04 Pd(insect\animal A Season = summer A N-polar) = 0.6 Pd{fish\animal A Season = summer A N-polar) = 0.05 Pd(birds\animal A Season = summer A N-polar) = 0.2 When a simple variable F has more than one hierarchical parent, we cannot always compute the conditional probability distribution of F given a parent context X in the same way as discussed in Section 4.5.2, because there could be a problem of multiple inheritance. This arises when conditional distribution over F for parent context X can be inherited from more than one superparent contexts of X in context hierarchy Cp of F. Example 4.5.11 Suppose simple variable Z has two hierarchical parents P and Q, as shown in Figure 4.3 (a). The class hierarchies for P and Q are shown in Figure 4.3 (b). Suppose that in order to define P(Z\P A Q) we have defined the default dis-tribution over Z for the parent contexts (p, q), (pn, q2\) and (p2i, <7ii)(- The parent context hierarchy Cp of the given parent contexts is shown in Figure 4.3 (c). In this case, for the context (p2i, 921) the default distribution over Z is not defined, it can be 123 (pll ,q21) (p21 ,qll) (c) Figure 4.3: (a) A Bayesian network with simple and hierarchical variables, (b) Part of the class hierarchies for variables P and Q. (c) parent context hierarchy of the given parent contexts that define P(Z\P A Q). 124 inherited from the two most specific superparent contexts, (pn, q2i) and (p2i, qn), which might specify different distributions. The above example illustrates the situation where the problem of multiple inheritance can arise. Rather than trying to define a way of combining different distributions, we explicitly disallow this. We assume that there is only one most specific superparent context. If there is more than one most specific compatible superparent context, with different default probability distributions, there must be a default distribution that disambiguates the two default distributions. We are thus forcing the user to disambiguate cases in which there would otherwise be a problem of multiple inheritance. The conditional distribution over F for a parent context X is computed as follows: • i f X does not have any strict subparent contexts in the parent context hierar-chy Cp of F, P(F\X) is inherited from the default probability distribution of F for the superparent context Z of X in Cp that is not overridden by a more specific default probability distribution of F for parent context y such that X < y < Z. Thus, P(F\X) = Pd(F\Z) • otherwise, P(F\X) can be derived from the children of X in Cp n P(F\X) = P(F|Z)n^(<l4) Z : child{Z,X) 1=1 125 We have shown how to represent the conditional probability distribution for a Bayesian network that has hierarchical variables in a compact manner. In order to make the inference in a hierarchical Bayesian network, we first discuss what is an observation and query in a Bayesian network that has hierarchical variables. 4.6 Observation in Bayesian Networks with Hierar-chical Variables A n observation of a simple variable is an instantiation of the variable. That is, the variable is assigned one value from its domain. However, for a hierarchical variable an observation might support a single value or a set of the values. For example, i f a hierarchical variable denotes the possible species of living things, the evidence might support a single species emperor penguin, or might support a set of species such as penguin or ^penguin hbird, which corresponds to disjunction of its element. We assume that the observation for a hierarchical variable is specified in terms of the classes in its hierarchy. In order to support a set of values that is not defined as a class, evidence can support a class and refute particular subclasses of the class. We term evidence that supports a class a positive observation and evidence that refutes particular values a negative observation. More formally, Definition 4.6.1 Let H be a hierarchical variable. An observation about H is pos-itive when we observe that H is in C„bs, where C()bs is a class in the hierarchy of H. This means that H has a value vh such that v,, e V, where V i s a set of values referred by C0bs. 126 Definition 4.6.2 Let H be a hierarchical variable. An observation about H is neg-ative when we observe that H is ->C0bs- This means that H has a value vh such that v/, 0 V,where V is a set of values referred by Cabs. Without loss of generality, we can assume that there is always one positive observation about H. For example, suppose we have two positive observations C I and C2 about H. This means that v/,'. £ Vc, where Vc is a set of values referred by C I A C2. We have two cases: • If C I and C2 are disjoint, C I A C2 = 0 . This means vh G 0 . That is, the observation about H is not consistent. • Classes C I and C2 are not disjoint, i.e., one is a superclass of another. Sup-pose C I is a subclass of C2, C I A C2 = C I . Thus, vh € Vc i , where Va is a set of values referred by C I . It is sufficient to assume that there is always one positive observation about H, denoted by Cpns. If there are only negative observations about H, we assume root class is the positive observation. There can be multiple negative observations about H that are descendants of Cpos. Definition 4.6.3 A n allowable observation in a Bayesian network that has both simple and hierarchical variables is a conjuction of sentences of the form: • X = xi, where X is a simple variable and jc,- € Val(X). • H is in Cpos and not in { C \ e g , Cknes }, where Cpos is the positive observation about H and C 1 , . . . , C* are the negative observations about H. 127 A class C in the hierarchy of H is true for the given observations, if the value v„ that H can have is in the set represented by C. A class C in the hierarchy of H is false for the given observations, if the value v/, that H can have is not in the set represented by C. Given the observations about a hierarchical variable H, we can divide its class hierarchy into three regions: Rtrue- Consists of those classes of H that are true for the given observation (i.e., set of superclasses of Cpos). Thus, Rtrue = { C t | Cpos < Ck} Rfaise' Consists of those classes of H that are false for the given observation. Thus, Rjhi.se = {Ck | Ck {. Cpos, Cpos •£ Ck} U {Ck | Ck < (negative observations) } Runknown- Consists of those classes of H that are not in Rtrue and R/aise- We know for certain that the classes along one path in Runknown are true, but we do not know which path this is. Example 4.6.4 Suppose for the hierarchical variable L we have one positive obser-vation, animal, and two negative observations, (bat, penguin). That is, L has a value v such that v e (Vanimal ~ Vbal - Vpe„guin), where Vanimal, Vba„ and Vpenguin are sets corresponding to the classes animal, bat, and penguin. We know that v is a member of all the sets corresponding to the superclasses of animal, and is not a member of the sets corresponding to the classes that are neither an ancestor nor a descendant of the class animal. The descendants bat and penguin of class animal are also false. The regions Rtrue and RfaiSe are shown by the bold lines in the class hierarchy of L, 128 Figure 4.4: The class hierarchy of L is divided into three regions R,rUe, Runknowm and Rjaise by positive observation {animal) and negative observations (bat, penguin) about L. as shown in Figure 4.4. The classes that are outside of the regions R,rue and Rfaise are denoted by R u n k n o w n . 4.7 Q u e r y i n B a y e s i a n N e t w o r k s w i t h H i e r a r c h i c a l V a r i a b l e s Let B be a Bayesian network made up of n discrete random variables, X = {X\,... ,Xn}. The inference task is to compute P (Q\e), where Q is a simple variable or a class in the hierarchy of hierarchical variable and e is an allowable observations as defined in Section 4.6. If the query is a simple variable, we compute the probability distribution over the values of the simple variable. For a hierarchical variable, we consider comput-129 ing the probability of a particular class that can be at any level of abstraction in the class hierarchy of hierarchical variable. We gain here because we can compute the probability of a particular class without computing the probability distribution over all the values. Example 4.7.1 Consider the Bayesian network as shown in Figure 4.2. Suppose we have observed Flying — true and Lay-Eggs = true. Given the observations, one query could be computing the probability that a living thing is a bird. P(L = bird\Flying = true A Lay-Eggs = true) =111 Another query could be computing the probability that a living thing is a particular species of the bird. For example, P(L = royal penguin \Flying = true A Lay-Eggs — true) =111 4.8 Inference in Bayesian Network with Hierarchical Variables As discussed in the introduction to this chapter, we wish to exploit the hierarchical structure on the values of the variables to perform efficient inference in Bayesian networks with hierarchical variables. In particular, we wish to exploit the fact that for a given problem, only certain classes in the class hierarchies of the hierarchical variables are supported directly by the evidence or are relevant to the query. Example 4.8.1 Consider the Bayesian network shown in Figure 4.5. The hierar-chical variable L represents living things on the earth. The hierarchical variable R 130 Figure 4.5: A Bayesian network with simple and hierarchical variables. represents regions on the earth. The variables Flying and LayJZggs are Boolean variables with values true and false. Suppose that the distribution of class animal is defined for only two classes in the hierarchy of R, ocean and land. Consider the computation of the likelihood P(Flying = true). We can compute the likelihood P{Flying = true) as follows: P(Flying = true) J2 p(R) P(L\R) P{Flying = true\L) P(lay-eggs\L) L,R,Lay-Eggs = YPWzZP(FlyinS = tme\L) P ( L W zC P(LayJEggs\L) (4.1) R L Lay-Eggs ; /' Note that 2~2Lay_ESgs P(LayJZggs\L) — 1. Next, consider the computation of the term 2~2LP(P^US — true\L) P(L\R) in equation (4.1). As discussed in Section 4.5.2, the conditional probability P(Flying\L) can be computed by partitioning the values of L into five disjoint and exclusive subsets, as follows: Vv e Vu P(Flying\L = v) = Pd(Flying\bat) Vv £ V2, P{Flying\L = v) = Pd(Flying\bird) 131 Vv £ V 3 , P(Flying\L = v) = P' d(Flying\penguin) Vv € V 4 , P(Flying\L = v) = Pd(Flying\insect) Vv € V 5 P(Flying\L = v) = P d{Flying\livingthing) where Vi is a set of values corresponding to class £>a?, V2 is a set of values corre-sponding to class difference bird - penguin, V 3 is a set of values corresponding to class penguin, V4 is a set of values corresponding to class insect and V5 is a set of values corresponding to class difference livingthing - bat - bird - insect. Now, the term 2~ZL P(Flying = true\L) x P(L\R) can be computed as follows: P(Flying = true\L) x P(L\R) = Pd(Flying = true\bat) £ P(L = v\R) Note that ^ V ( e V P ( L = vt\R) = P(V\R). Thus, we can replace the hierarchi-cal variable L by a simple variable X such that: • Val{X) = { "insect", "bat", "penguin", "bird -penguin", vx] where vx denotes "livingthing - bat - bird - insect". Note that the values v 6 Val(X) are meaningless names that correspond to sets. • The conditional probability distribution P(Flying\X) is given by: + Pd(Flying true\bird) £ P(L = v\R) + Pd(Flying v e v 4 true\livingthing) P(L = v\R) v e v 5 132 P(Flying\X = "insect") = Pd(Flying\insect) P(Flying\X = " bat") = P d(Flying\bat) P(Flying\X = "penguin") = Pd(Flying\penguin) P(Flying\X = "bird - penguin") = Pd(Flying\bird) P(Flying\X = vx) = P d(Flying\livingthing) Now, consider the computation of the term P(X\R). For example, P(X = "insect"\R) can be computed as follows3 : P(X = "insect"|7?) = P(insect\animal AR) x P[animal\livingthing AR) Note that only the distribution of class animal over its children changes with the locations on the earth and is the same at any location on land and ocean. Let Vocean a n d Viand be the sets corresponding to classes ocean and land, respectively. Then, Vv G Vocean, P(X = "insect"|/? = v) = Pd(insect\animal A ocean) x Pd(animal\livingthing) Vv e Viand, P{X = "insect"\R = v) = Pd(insect\animal A land) x Pd(animal\livingthing) Similarly, we can compute P(X = v\R) for any other value v G Val(X). Thus, we can replace the hierarchical variable R by a simple variable Y, such that: • Val(Y) = {"ocean", "land"} • and, 3The value insect" corresponds to the set of values represented by class insect. 133 P(Y = "ocean") = Pd{ocean\earth) P{Y = "land") = P d{land\earth) The hierarchical variables L and R can therefore be replaced by the simple variables X and Y. Note that the domains of X and Y depend on those classes of L and R that are exceptional for their relevant children. The above example illustrates that the partitioning of a hierarchical variable that has relevant hierarchical children is governed by those classes that affect the distribution of the superclasses of those exceptional classes of its relevant hierar-chical children for which we have evidence. That is, the domain of a hierarchical variable can be partitioned efficiently after partitioning the domain of its relevant hierarchical children. The idea of the structured inference algorithm is, given some evidence and a query, to construct an abstract Bayesian network by replacing the hierarchical variables with simple variables that only include those values necessary to answer the query. We can then answer the query from the abstract Bayesian network using any standard probabilistic inference algorithm. Definition 4.8.2 Let H be a hierarchical variable with values Val(H). An abstrac-tion of the domain of H, denoted by part(H), is a partition of Val(H). That is, part(H) is a set of subsets of Val(H) such that all sets in part(H) are mutually disjoint and the union of all the sets equals Val(H). Definition 4.8.3 Let H be a hierarchical variable and part(H) be a partition of its domain. An abstraction of hierarchical variable H for partition part(H) is a 134 simple variable Ha with the following domain: Val(H") = {v" | v" is a meaningless name corresponds to set b e part(H)} We refer to v" as an abstract value of H. Definition 4.8.4 A n abstract Bayesian network is a Bayesian network that is con-structed from a Bayesian network with hierarchical variables by replacing each hi-erarchical variable by its abstraction. Definition 4.8.5 A flattened Bayesian network is a Bayesian network that is con-structed from a Bayesian network with hierarchical variables by replacing each hi-erarchical variable with a simple variable that has all the leaf values as its domain. Definition 4.8.6 A n exceptional class is relevant exceptional class i f there is evi-dence about it. 4.8.1 Construction of an Abstract Bayesian Network To construct an abstract Bayesian network from a Bayesian network that has hierar-chical variables given some evidence and a query, we traverse the Bayesian network from the leaves upwards, prune those variables irrelevant to answer the query, and abstract the hierarchical variables. We abstract the hierarchical variables as part of a pass to remove variables that are irrelevant to the query. From the bottom-up, we can recursively prune any variable that has no children and is not observed or queried [Shachter, 1986; Geiger et al., 1990; Pearl, 1988] as well as abstract hierar-chical variables when their children have been abstracted. 135 Definition 4.8.7 An abstract Bayesian network constructed from a Bayesian net-work B is safe if it has the same posterior distribution over the query as by the flattened Bayesian network constructed from B. After constructing the abstract Bayesian network, we can use variable elim-ination or stochastic simulation algorithm to solve the query. The algorithm for constructing an abstract Bayesian network, called Abstract J3ayes, is discussed in the next section. 4.8.2 Abstract_Bayes As discussed before, to compute the efficient abstraction of hierarchical variables we need to abstract the hierarchical variables in a bottom-up manner. That is, when we abstract a hierarchical variable H, it has only simple children. The hierarchical variable can be a parent of several simple variables. The abstraction of a hierarchical variable H, denoted by H", depends upon the exceptional classes Ck of H with respect to each of these children (initially these are the classes associated with the evidence or the query). In order to develop a safe abstraction, we should follow the following constraints when computing the abstraction for the hierarchical variables: 1. For every evidence or query, there must be an abstract value that directly associates with that evidence/query and conveys its effect. 2. All abstract values must be mutually disjoint and exhaustive. The algorithm Abstract J3ayes is shown in Figure 4.6. It consists of three phases: 136 A l g o r i t h m A b s t r a c t _ B a y e s ( B , <?, Q) I n p u t B: A Bayesian network, e: evidence, Q: query O u t p u t Ba: An Abstract Bayesian network with respect to e and Q P h a s e O : i f <2 is of form H = C q t h e n f o r a l l H G H d o Cq <— query class of H in C q Cmot <~ r o o t c l a s s in the hierarchy of H Xq <— new Boolean child of H Pd{Xq = true\Cq) <- 1, Pd(Xq =false\Cmot) *- 1 e n d f o r e n d i f P h a s e l : A b s t r a c t Iterate over variables H of B in a bottom-up manner If H is barren variable, Prune H i f / / is hierarchical variable t h e n Vex <— set of exceptional classes of H for its relevant children i f H is observed t h e n Cpos *— positive observation about H . C„eg <— set of negative observations about H Val <- AbstractObs(H, V^, B, Cpos, Cneg) e l s e Vc/ <- AbstractHier(//, Va, B) e n d i f Ha <— new simple variable with domain Val e l s e Ha <— H % H is a simple variable e n d i f Phase2: C o n s t r u c t T a b l e s f o r a l l variables H" d o P(Ha\Pa(Ha)) <- ConstructTable(//a,fi) e n d f o r fi" < — abstract Bayesian network with nodes representing variables Ha and CPTs P(Ha\Pa{Ha)) 33: Return Ba Figure 4.6: Algorithm Abstract JBayes for constructing abstract Bayesian network. 137 F u n c t i o n A b s t r a c t H i e r ( / / , Vex, B) I n p u t H: A Hierarchical variable, Vex: A set of exceptional classes of H B: A Bayesian network. O u t p u t Val: A set of abstract values of H 1: Vaffect(H) <- {} , Val«- [ ], num « - 0 2: f o r a l l Ck G V 7 ^ do 3: C i , C , „ <— highest strict subclasses of Ck that are in Vex 4: i f (Ck — C\ — ... — Cm) does not represent an empty set then 5: Val[num + +] <— "C* — Ci — . . . — C„," 6: V f l j r e c , ( / / ) <- Vuffea(H) U {C | C, < C} 7: end i f 8: end f o r 9: Pah(H) <— hierarchical parents of H in B 10: f o r a l l Y G Pa / !(tf) do 11: Mark those classes of Y exceptional that affect the probability distribution of C G Vaffec,(H) over its children. % this is for the next step 12: end f o r 13: Return Val Figure 4.7: Functions AbstractHier called by the algorithm Abstract JBayes Phase 0: If a query of a Bayesian network asks for the probability of a particular class in hierarchical variable H, we create an extra binary child Xq of H, which is true exactly when the query is true (i.e., Pd(Xq = false\Cr00t) = 1 and P<i{Xa = true\Cq) = 1, where C„)0t is the root class and Cq is the query class in the class hierarchy of H). The variable Xq has the same probability as the query class. We thus reduce the problem to one of computing the probability of a simple variable. Phase 1: Abs t rac t In this phase, we traverse Bayesian network B from the leaves upwards. We prune a variable that is not queried or observed and does not have any children [Geiger et al., 1990; Pearl, 1988]. For each unpruned hierarchical variable H, we compute its abstraction H". Let Vex be the set of exceptional classes of H 138 Function AbstractObs (H, Vex, B, Cpos, Cneg) Input H: A hierarchical variable,Vex: A set o f exceptional classes o f H B: A Bayesian network, Cpos: A positive observation about H CneH: A set o f negative observations about H Output Val : A set of abstract values o f H 1: Croot <— root class o f H, Pa!'(H) <— hierarchical parents o f H in B 2: Vaffect(H) <- {} , Val *- [], num < - 0 3: Runknown < {^-'1^-' ^ Cpos} Cneg 4: V £ - {C\C G Vex and C G Runknown} U {Cpos} 5: for all Ck G V*x do 6: C i , . . . , C,„ < — highest strict subclasses o f Q that are in V * U C „ e g 7: if ( Q — C i — . . . — C,„) does not represent an empty set then 8: Val[num + +] <- " C * - C i - . . . - Cm" 9: ^ ( H ) ^ V V e c , ( H ) U { C | Ck < C} 10: end if 11: end for 12: for all Y G Pa h(H) do 13: Mark those classes o f Y exceptional that affect the distr ibution o f C G Vqffect(H) % this is for the next step 14: end for 15: Val[num + + ] " ( C T O ( - Cpos) U Cneg" 16: Return Vh/ Figure 4.8: Functions AbstractObs called by the algori thm Abstract_Bayes 139 with respect to its relevant children. The domain for Ha is computed as follows: Case 1: H is not observed The domain of Ha is the set of non-empty abstract values for each exceptional class. We compute the abstract value, va, for each class Q G Vex as follows: • va = "Ck", if Ck does not have any exceptional strict subclasses. • Otherwise, v" = "Ck — C\ — . . . — C,„", where C i , . . . , Cm are the highest strict subclasses of Ck that are in Vex. The abstract value va represents the set of all of the values that are in class Ck and are not covered by other abstract values. The abstract values that denote empty sets are removed from the domain of H". To abstract the hierarchical parent of H, we find their exceptional classes because of H. Let Vajject(H) be the set of strict super classes of C*'s for which we have the abstract values. Let Pah(H) be the set of hierarchical parents of H. The classes of Y G Pah(H) that affect the distribution of C G Vaffect(H) are marked exceptional because of H. Case 2: H is observed Let Cpos be the positive observation about H and Cneg be the set of negative observations about H. The observations about H divide its class hierarchy into three regions Rlrue, Runknown, and Rfaise as discussed in Section 4.6. We do not need to distinguish between the values of H that we know are false. Thus, one abstract value of H, denoted by Vfalse, corresponds to all 140 those values of H that are false. vfalse — (Croot CpOS) U Cneg vpise denotes the set of values represented by {Cwot — Cpos) U Cneg. To compute other abstract values of H, we need to consider exceptional classes from Vex that are in region RUnhwwn in the hierarchy of H. Let V* denotes the set of exceptional classes from Vex that are in region Runknown- In order to have an abstract value corresponding to evidence, Cpos is incorpo-rated in set Vfx. We compute the abstract value, va, for each class Ck € V* as follows: va = "Ck — C\ — . . . — c„r where C 1 ; . . . , C,„ are the highest strict subclasses of Ck that are in V*. The abstract value v" represents the set of all of the values that are in class Ck and are not covered by other abstract values. The abstract values that denote empty sets are removed from the domain of H". To abstract the parents of H, for each hierarchical parent Y of H we find F's exceptional classes because of H in the same way as discussed in Casel. Phase 2: Construct tables In this phase, we construct the conditional proba-bility table for each variable Ha. Suppose P<f(H") denotes the abstracted par-ents of Ha, and Pana(H") denotes the non-abstracted parents of H". We compute P(Ha = va\Pana(Ha) = Vm A Paa(Ha) = va) for each value v" of Ha, for each assignment Vna of Pana(Ha) and Va of Paa(Ha) for the following three cases: 141 Function Construct_Table(#", B) Input: Ha abstracted variable, B Bayesian network Output: P(Ha\Pa(Ha)), conditional probability table of Ha 1: if Ha is not abstracted and has no abstracted parents then 2: P (Ha\pa {Ha)) <- conditional probability of Ha in B 3: else if Ha is not abstracted but it has abstracted parents then 4: P{Ha\Pa(Ha)) <- Simple.Var•J,able(Ha,B) 5: else 6: P{H"\Pa(H")) <- Hier_Var.Table(Ha,B) 7: end if 8: ReturnP(Ha\Pa(Ha)) Figure 4.9: Function Construct_Table called by the Abstract_Bayes. Function Simple_Var_Table(//', B) Input: H" abstracted variable, B Bayesian network Output: P(Ha\Pa{Ha)), conditional probability table of X? 1: Paa(Ha) <— abstracted parents of H" 2: Pana(Ha) <— non-abstracted parents of Ha 3: Val±-[] 4: for all assignment Vm of Pam(Ha) do 5: for all assignmnet Va of Pam(Ha) do 6: context <— [ ] 7: for all Y G Paa(Ha) do 8: y <— value of Y in Va 9: Cyex — C{ — ... — C'm <— class difference corresponds to y 10: context <— add Cyex in context 11: end for 12: Vs <— inherit P(Ha\Vna A context) % the distribution over values of/ /" 13: Val <— add the values Vs in Vh/ 14: end for 15: end for 16: P(Ha\Pa(Ha)) <- construct CPT of H" with values in Val 17: Return P(Ha\Pa(Ha)) Figure 4.10: Function Simple_Var_Table called by the Construct_Table. 142 Function Hier_Var_Table(//a, B) Input: Ha abstracted variable, B Bayesian network Output: P(Ha\Pa{Ha)), conditional probability table of Xf 1: X <— variable corresponds to H" in B 2: Pa"(H") <— abstracted parents of Ha 3: Pana(H") <— non-abstracted parents of H" 4: Val <— [ ] ; num. < — 0 5: for all assignment Vna of Pa""(Ha) do 6: for all assignmnet Va of Pana (Ha) do 7: context <— [ ] 8: for all 7 6 Paa(Ha) do 9: j <— value of Y in V a 10: Cyex — C\ — . . . — C'„ <— class difference corresponds to y 11: context <— add Cyex in context 12: end for 13: for all v e Va/(77a) do 14: Ck — C i — . . . — C„, < — class difference corresponds to v 15: pi <— compute P(C,-| Vna A context) 16: Val[num + +] *— pc — Pi — • • • — pm 17: end for 18: end for 19: end for 20: P(Ha\Pa(Ha) « - construct CPT of H a from values in Val 21: Return P(Ha\Pa(Ha) Figure 4.11: Function Hier_Var_Table called by the Construct.Table. 143 Case 0: H" is not abstracted and has no abstracted parents: Let H be the vari-able in B that corresponds to H". Then, P (Ha\Pa (Ha)) = P (H\Pa (//)) Case 1: H" is not abstracted but has abstracted parents: Suppose H" has k ab-stracted parents. The assignment Pa"(H") = Va denotes the abstracted value for each abstracted parent of H", Va = (v„, . . . , va). As discussed in Phase 1 of Abstract JBayes, each abstract value v'a corresponds to set of values represented by an exceptional class and are not covered by other abstract values. Let v'a corresponds to set of values represented by C'ex — C\ — ... — C'm, where C'ex is an exceptional classes in the hierarchy of i"' hierarchical parent of H" and C[,..., C'm are the highest exceptional strict subclasses of Oex. Suppose V denotes the parent context, V = [C]x,..., Ckex). Then, P (Ha\Pa"a (Ha) = V„a A Pa" (Ha) = Va) = P {Ha\Pam {H") = Vna A V) As discussed in Section 4.5.3, P (Ha = va\Pana (Ha) = Vm A V) is inherited from the default probability distribution of Ha for the super parent context Z of V in Cp of X that is not overridden by a more specific default probability distribution of H" for parent context y, such that V < y < Z. P (Ha\Pana (Ha) = V„a A Pa" [H") = Va) = Pd (Ha\Palia [H") = Vna A Z) Case 2: H" is an abstracted variable: Suppose H" has k abstracted parents. As discussed in Case 1: P (Ha = v"\Pana (Ha) = V„a A Pa" (Ha) = Va) 144 = P (Ha = va\Pa"u (Ha) = Vna A V) Let A denotes the assignment: A = (Pana (Ha) = Vna A V ) . As discussed in Phase 1, suppose the abstract value v" represents a set of values that corre-sponds to Ck — C\ — ... — C,„, where Q , C i , . . . , C„, are the classes in the hierarchy of hierarchical variable corresponding to H". Thus, P(Ha = vu\A) = P ( Q | A ) - P ( C 1 | A ) - . . . . - / J ( C , „ | A ) As discussed in Section 4.5.1, the probability of a class can be computed by multiplying the probabilities up in the class hierarchy. Thus, we can compute F ( Q | A ) as follows: P{Ck\A) = P{Ck\Ck-i A A ) x P ( Q _ i | A ) where C*_i is the immediate superclass of Q . Therefore, to compute P ( Q | A ) we need the probability distribution of all the strict super classes of Q over their immediate subclasses for parent context V. The probability distribution of a class over its immediate subclasses for parent context V can be computed in the same way as described in Casel . We have presented the AbstractJBayes algorithm for constructing an ab-stract Bayesian network from a Bayesian network with hierarchical variables. Ab-stract_Bayes has a desirable property: it is independent of the particulars of the inference algorithm that we use for computing the posterior distribution. Thus, it can be applied as preprocessing step before computing the posterior distribution (but note that the abstraction depends on both the observation and the query). 145 © CD Figure 4.12: A Bayesian network with simple and hierarchical variables (shaded variables are observed) Example 4.8.8 We illustrate the Abstract_Bayes algorithm by applying it on a Bayes network that has four simple variables, C, F l , F2, and A, and three hierarchical vari-ables, Q, H, and P as shown in Figure 4.12. The tree hierarchies for the hierarchical variables are shown in Figure 4.13. The thin lines in Figure 4.13 show the class-subclass relationships and the thick lines with arrows show the Bayesian network dependency. In the tree hierarchies, classes that are exceptional (i.e., those classes that have defined default distributions for at least one of the children of hierarchical variable) are marked by "*". In this example, we consider the following: • The conditional probability P(F1\H) is specified by the default probability distributions Pd(Fl\h), Pd(Fl\hl), and Pd(Fl\h7). • The conditional probability P(F2\H) is specified by the default probability distributions Pd{F2\h) and Pd(F2\h3). • The default probability distributions of all the classes of P are defined for the 146 I Figure 4.13: The Bayesian network of Figure 4.12 with trees hierarchies for the hierarchical variables. Thin lines show the class-subclass relationship and thick lines show the Bayesian network dependency. Exceptional classes are marked by *. 147 root class h of H. As shown in Figure 4.13, class pi is affected by class h3 of H so the default probability distribution of class pi over its children is also defined for class h3. The default probability distribution of classes p3 and p6 of P is also defined for class h5 of H. • The default distribution of all the classes of variable H are defined for the root class q of Q. As shown in Figure 4.13, the probability distribution of class /z3 of H is affected by class q5 of Q, so the default probability distribution of class h3 is also defined for class q5. The default probability distributions of class h8 of H is also defined for class qlO. • Suppose the query is P(H = hlO\A = al A P = p3). Construction of abstract Bayesian network: Since the query is on a hierarchical variable, the algorithm first creates a simple child Xq of H, such that: Pd(Xq = true\hW) = 1, ar\dPd(Xq = true\h) = 0 To construct the abstract Bayesian network from the Bayesian network B as shown in Figure 4.13, we traverse B from the leaves upwards. We leave the observed simple variable A as it is. The simple variable F2 is pruned because it is irrelevant. We abstract the hierarchical variable P. Abstraction of P (observed hierarchical variable): We have observed P = pS. P is abstracted by a simple variable P" with domain Val(Pa)- {"p3", "p - p3"}. After abstracting P, we abstract H. 148 Abstraction of H: After pruning F2 and abstracting P, to abstract H we find those classes of H that are exceptional for P, XCj, and F l . Variable H has five exceptional classes4: h, hi, hZ, and hi, and MO. H is abstracted by a simple variable H" that has the following domain: Val{Ha) = {"h3", "h7", "hlO", " h i - h3 - h7 - hlO", "h - hi"} Abstraction of Q Finally, Q is abstracted. As discussed before, to compute the abstraction of Q, only those values are needed that are necessary to compute the conditional probability distribution of Ha. Thus, Val(Q") = {"q5", "qlO", "q - q5 - qlO"} After abstracting the hierarchical variables, we compute the conditional prob-abilities for the abstract Bayesian network. The first case we consider is how F l depends on H". Construction of P(Fl\Ha): To compute P(F\\Ha), we compute P(Fl\Ha = vf l) for each value v" of Ha. The conditional probability table P(Fl\Ha) is shown in Table 4.1. Note that the domain of H" was created such that we can have simple conditional probability tables for its children. There are some abstract values that represent the set of values that are equal to the set of values represented by excep-tional classes for F l (those classes that have different distribution over F l in the above table). 4Note that h5 is not exceptional. h5 tells us the distribution over the children of p3 or p6, but we have no evidence about either of these. 149 Ha P{Fl\Ha) "h3" "h7" "hlO" "hi -h3-h7-h lO" " h - h l " Pd(Fl\hl) Pd(Fl\h7) Pd(Fl\hl) Pd(Fl\hl) Pd(Fl\h) Table 4.1: Conditional probability table P{F\\Ha). Ha P(Pa = p3\Ha) "h3" "h7" "hlO" "hl-h3-h7-hl0" " h - h l " Pd(p3\pl A h3) x Pd{pl\p A h) Pd{p3\plAh) x Pd(pl\pAh) Pd(p3\pl A h) x Pd(pl\pAh) Pd(p3\plAh) x Pd(pl\pAh) Pd{p3\p\Ah) x Pd{pl\pAh) Table 4.2: Conditional probability table P{Pa\Ha). Construction of P{Pa\Ha): To compute P(F'\Ha), we compute P(Pa = v£|tfa = v") for each value v£ G Val(Pa) and for each assignment v" of H". The conditional probability of Pa given / / a is shown in Table 4.2. Note that P" makes different distinctions in Ha than Fl. This reflects the different exceptional classes in H for P and F l . The domain oiHa is the minimal set of values that preserve the distinctions needed and are adequate to answer the query. The evidence for the observed hierarchical variable H is translated into the corresponding abstract variable Ha of the abstract Bayesian network. The evidence for hierarchical variable H in abstract Bayesian network is the disjunction of all those abstract values of H" that are not false for the observation. After constructing the abstract Bayesian network, we can answer the query using any standard probabilistic inference algorithm. However, it is not viable to use 150 the Junction Tree algorithm, as the construction of the abstract Bayesian network is query oriented. We can use the Variable Elimination algorithm [Zhang and Poole, 1994] or stochastic simulation to sum out all non-observed, non-query variables from the abstract Bayesian network. The abstraction is done as part of the standard elimination of irrelevant variables. 3 • -4.8.3 Complexity To construct an abstract Bayesian network, Abstract Jiayes traverses a Bayesian network from the leaves upwards, prunes those variables that are irrelevant to the query and abstracts the relevant hierarchical variables. As discussed in the algo-rithm, the number of abstract values of a hierarchical variable is less than or equal to the number of classes in its hierarchy that are exceptional with respect to its relevant children. Let B be a Bayesian network with hierarchical variables and made up N discrete random variables. Let Nn be the number of hierarchical variables in B that need to be abstracted. Let e be the maximum number of exceptional classes of a relevant hierarchical variable in B, and bs be the maximum domain size of a simple variable in B. Proposition 4.8.9 Let p be the maximum number of parents of a variable in B. Let / be the maximum domain size of a variable in the abstract Bayesian network. The number of abstract values for a hierarchical variable is less than or equal to e. Thus, I = max(e, bs). The size of the abstract Bayesian network Ba constructed from B by Abstract Jiayes is 0(N x 151 Computing the conditional probability table of a hierarchical variable in-volves a series of multiplications. The number of multiplications is proportional to the depth of the class corresponding to an abstract value, which in turn is the depth of the relevant exceptional class. We can have the following: Proposition 4.8.10 The cost of traversing B is 0(N). The cost of constructing B" from B, given the observation and query, is 0(N+Nn xexd), where d is the largest depth of a relevant exceptional class in the hierarchy of a relevant hierarchical vari-able. Suppose after constructing an abstract Bayesian network from a Bayesian network we use Variable Elimination to answer the query. Then the cost of solving a query is the sum of the cost of constructing an abstract Bayesian network and the cost of running Variable Elimination on the abstract Bayesian network. As discussed in Chapter 2, the cost of Variable Elimination is 0(NbM), where N is the number of variables in the Bayesian network, b is maximum domain size of a variable in the Bayesian network, and M is the largest number of variables in an intermediate factor. We can have the following: Proposition 4.8.11 Let / be the maximum domain size of a variable in the abstract Bayesian network. As discussed above, / = max(e, bs). The cost of running Vari-able Elimination on the abstract Bayesian network is 0(NIM). 152 4.9 Correctness of Abstract Bayesian Network The abstract Bayesian network is constructed from a Bayesian network with hierar-chical variables using the representation and algorithm as described in the previous Sections. The construction of an abstract Bayesian network would of course be use-less, if the posterior distribution over the query in the abstract Bayesian network is not equal to the posterior distribution computed from the flattened Bayesian net-work. Let B be a Bayesian network with hierarchical variables made up of n dis-crete random variables, X = {Xx,..., Xn}, and we want to answer some probabilis-tic query, P (Q\E), where Q, E C X , Q denotes the query variables. We observed that E is in the set e, and e C domain(E), E G e is the evidence. From B we can recursively prune any variable that has no children, and is not observed or queried [Geiger et al., 1990; Pearl, 1988]. Let Xi,..., Xs be the non-query random variables of B and suppose X,'s are ordered according to some elimination ordering. As discussed in Chapter 2, we can compute the query from a function / (Q) , which is proportional to Q's posterior distribution. n / (Q) = ^ . . . ^ n n ^ w W ) • Xs Xi 1=1 The subscripted probabilities have the same meaning as discussed in Chapter 2. Lemma 4.9.1 Let B be a Bayesian network with n discrete random variables, X = {Xi,... ,Xn}. Suppose a partition of the domain of each non-observed variable X of B exists such that: VY if the parents of Y are Xx,..., Xk and bt 6 part(Xi) for 153 each i. If Vz v,;, w,- € bj, P(Y\X1 = v1...Xk = vk) = P(Y\X1=w1...Xk = wk) (4.2) then, Y,f[P{Xi\P*(Xi)) = E P(bs\Pa(Xs)) f[ Xv 1=1 bsepart(Xs) . (=1, i^i' Proof Summing over all the values of Xs is equivalent to summing over the partition of the values of Xs. Thus, E [ [ f ( W ) ) = E E l l ^ w ) ) X.v i=l bs£part(Xs) Xs€bs i=l To compute ^ X j 6 f c l ECU P(X,|Pa(X,)), from the product JTjLi ^ (X (|Pa (X,)) we can distribute out all of the conditional probabilities that do not involve Xs out of the sum. After distributing out the conditional probabilities, inside the sum 2~2xsebs we have conditional probability of Xs given its parents and the conditional proba-bilities of X v 's children. Suppose Xx,..., Xp are the children of Xs. Then, inside the sum 2~2x-eb- w e n a v e t n e following term: p J2PWa(Xs))HP(Xj\Pa(Xj)) xsebs j - \ If equation (4.2) is true, all the children X/s of Xs have the same condi-tional probabilities for each value xs E bs of Xs. We can distribute the product 11/= i P(Xj\Pa(Xj)) o u t °f t n e s u m ' leaving the term 2~2xsebs P{Xs\Pa(Xs)), which is equal to P (bs\Pa (Xs)). Thus, En^ i^ (^ ) ) X, 1=1 E P(bs\Pa(Xs)) J ] P(X,|Pa(X,.)) bs€part{Xs) '=l,'V.v 154 The above lemma provides us with a way of proving the correctness of the abstract Bayesian network. We can do so by proving that Abstract Jiayes constructs the abstraction of the hierarchical variables Xk of B from partition part(Xk) such that the children of Xk satisfy equation (4.2). Lemma 4.9.2 The Abstract Jiayes algorithm partitions the domain of each hierar-chical variable Xh of B such that all of Xh's children satisfy the equation (4.2). Proof Suppose part(Xil) denotes the partition of the domain of hierarchical vari-able Xh computed by Abstract Jiayes. Let Xb be the abstraction of Xh. If Xh is a simple variable, part(Xi,) = Val(Xf,). Abstract Jiayes constructs the conditional probability tables for each abstracted variable X"t of an abstract Bayesian network using "inheritance" for the following two cases: Case 1: Xft is not abstracted but some of its parents are: As discussed in Phase 2 (Construct Sable) of Abstract Jiayes the conditional probability table of X\ is constructed as follows: Suppose XI has k abstracted parents. The assignment Paa(X"l) = Va denotes the abstracted value for each abstracted parent of X"v Va — (va,..., vka). As discussed in Phase 1, each abstract value y'a corresponds to set of values represented by an exceptional class and are not covered by other abstract values. Let v'a corresponds to set of values represented by C'ex — C[ — ... — C'm, where Cex is an exceptional classes in the hierarchy of /''' hierarchical parent of X% and C[,..., Cm are the highest exceptional strict subclasses of Cex. Suppose V denotes 155 the parent context, V = (C]x,..., Ckex). Then, P {Xah\Pam {XI) = Vna A Pa" (X°n) = Va) = P {Xt\Pana (X°h) = Vna A V) As discussed in Section 4.5.3, P(X% = v u |Pa'M {XI) = Vna A V) is inherited from the default probability distribution of Xft for the super parent context Z of V in parent context hierarchy Cp of Xh that is not overridden by a more specific default probability distribution of Xft for parent context y, such that V < y < Z. P {Xl\Pana {Xah) = V,m A Paa {X°h) = Va) = Pd {Xl\Pana {Xf) = Vna A Z) Let Vx be a set defined as follows: Vs = { j * 1 , . . . , x"} \x' G set of values represented by v'a} Note that C'ex is the lowest exceptional class of x'. Suppose Z are the hierarchical variables in B corresponding to Pa"{X"l). Then, Vv G V,, P(Z^|Pa™(^) = Vna A Z = v) = P ^ l P a " " ^ , ) = V„a A Z) Thus, the values of Z are partitioned such that Xb has the same probability distri-bution for all values v,- G bh where bi G part{Zi),Zi G Z. This implies equation (4.2). Case 2: is abstracted: As discussed in Phase2, to compute the conditional probability P{X^ = v a|Pa'M(X£) = Vna A Paa{Xl) .= V) we need the conditional distribution of some classes Cj, from the tree hierarchy of Xh, over their immediate subclasses. We can treat the classes Cj as simple variables. Thus, the proof follows from Case 1. 156 Theorem 4.9.3 The posterior probability P(Q|E) computed from the abstract Bayes network B" of B as constructed by Abstract JBayes is equal to the posterior proba-bility computed from the flattened Bayesian network of B. Proof The proof follows from Lemmas 4.9.1 and 4.9.2. 4.10 Empirical Results We present some results testing whether the hierarchically structured values of the variables can make reasoning more efficient and to determine when the overhead of the abstraction is worth it. To evaluate our approach, we compare the cost of answering a query from a Bayesian network using the proposed approach, applying first AbstracLBayes then Variable Elimination (Abstract J3ayes+VE), to applying Variable Elimination on a flattened Bayesian network (VE). To test the proposed approach, we could not find any realistic networks. We notice that finding the real problem is like the chicken-and-egg problem, since in order to solve a Bayesian network that has hierarchical variables, one really needs a method for solving it. This problem is resolved by developing a method that can solve a Bayesian network with hierarchical variables. To test whether the hierar-chically structured values of the variables can make reasoning more efficient, we model a simple conference - academics domain. 157 Figure 4.14: (a) Bayesian network representation of the conference - academics domain, (b) A part of the class hierarchy for location in the world. 4.10.1 Conference - Academics Domain Suppose the location of an academic in the world is influenced by the location of a particular conference, and the location of the spouse of the academic is influenced by the location of the academic. Suppose the location of the conference in the world does not depend on anything we are modelling. Let there be k academics. We assume that the locations of the different academics are not independent; they depend on the conference location. We consider that all location variables (con-ference location or person locations) have the same domain and are hierarchical variables. The Bayesian network representation of this domain is shown in Fig-ure 4.14 (a). The part of the class hierarchy for the location is shown in Figure 4.14 (b). Here we have 2k + 1 hierarchical variables. The hierarchical variable c o n f J o c denotes the conference location in the world. The hierarchical vari-ables a c a m _ l o c _ l , . . . , acam_loc_k denote the locations of the academics, and s p o u _ l o c _ l , . . . , spou_loc_k denote the locations of their spouses. We can specify the prior probability P (conf _loc) by specifying the distribi-158 tion of each class in the class hierarchy of conf Joe over its immediate subclasses. To specify the conditional probability P(acam_loc_k|conf _loc), we need to spec-ify the default distribution of each class in the hierarchy of acam_loc_k over its immediate subclasses for some parent contexts (for some classes in the hierarchy of conf _loc). We assume that the conference location influences each academic's location very locally; we need to specify the default distribution of the classes over their immediate subclasses for very few parent contexts. Example 4.10.1 Let us consider the distribution of class uk in the hierarchy of academic location over its immediate subclasses. That is, we need to specify the conditional probabilities such as P(scotland\uk A conf Joe). This means we know that the academic is in the U K but do not know whether he is in Scotland, Wales or England. Now, knowing that the conference is taking place in the U K but not where in the U K does not provide us with any information about the location of the academic in the UK. However, if we know that the conference is taking place somewhere in Scotland, we can state that there is a high probability that the academic is in Scot-land. That is, the conference location influences the academics location in the U K if the conference is taking place in the UK and we know where in the UK, other-wise it does not influence the location of the academic in the UK. We assume that knowing the conference location in Scotland (e.g., in Edinburgh), does not provide us with any extra information about where they may be if not in Scotland. That is, the distribution of class uk in the hierarchy of acam_loc_k over its immediate subclasses conditioned on conf Joe needs to be defined for four parent contexts: 159 (world), (scotland), (england), and (wales). We could have a default probability distribution of the class uk in the hier-archy of acam_loc_k over its immediate subclasses conditioned on conf J o e for parent context X = (scotland) such as: Pa(scotland\uk A conf _loc = Scotland) = 0.8 Pd(england\uk A conf_loc = Scotland) = 0.1 Pd(wales\uk A c o n f J o c = Scotland) = 0.1 Similarly, for parent context X = (wales) or X = (england) such as: P d(scotland\uk f\ c o n f J o c = wales) = 0.1 Pd(england\uk l\ conf_loc = wales) = 0.1 Fv(wa/e.s'|M& A conf _loc = wa/es) = 0.8 For parent context = (world) such as: Pd(scotland\uk A conf _loc = world) = 0.3 P'd(england\uk A conf_loc = world) = 0.4 P(/(wa/e5|«A: A conf_loc = world) = 0.3 4.10.2 Results To get an idea of the performance of Abstract Jiayes + V£ compared to VE, we investigate how the performance of both algorithms varies with the number of hier-archical variables in the Bayesian network, with the depth of the tree hierarchies of 160 the hierarchical variables, and with the depth of the exceptional classes. We repre-sent the abstraction hierarchy of all location variables by a binary tree of depth d, where d is a parameter. The probability distribution of any class a in the hierarchy of academic acam_loc_k need to be specified for three parent contexts c0, cn, and C i 2 , where C o is the root class in the hierarchy of conf Joe, and c\\ and c\2 are the children of class c in the hierarchy of conf Joe, where class c corresponds to class a in the hierarchy of acam Joe J<. We define the default distribution of class a as follows: Pci(au\a A conf Joe = c0) = 0.5 Pd(ai2\a A conf Joe = CQ) = 0.5 Pd(aii\a A conf Joe = cn) = 0.8 Pd(ai2\a A conf Joe = cn) = 0.2 Pci(an\a A conf Joe = c\2) — 0.2 Pd(ai2\a A conf Joe = c\2) = 0.8 In the same way, we define the distribution of classes in the hierarchy of spou J o c k conditioned on acam Jock. For the conference location variable, conf Joe, for each class c in its hier-archy, we consider a uniform distribution over its children. Suppose cn and c\2 are the children of c. Then, Pd(cu\c) = 0.5 Pd(c12\c) = 0.5 161 Suppose k denotes the number of academics, d denotes the depth of location hierarchies, and doos denotes the depth of observation. For the experiments, we consider computing the probability of a class in the class hierarchy of conf Joe . We consider s p o u J o c . l , . . . , spou J o c J c as the observed variables. We do so because we want some hierarchical variables to sum out. If the academic locations were observed, there are no hierarchical variables to sum out. For the experiments, we consider computing the probability of a class that is at level d = 1 in the class hierarchy of conf Joe , given the evidence, by applying only VE on the flattened network and Abstract Jiayes + VE. To determine how the inference time depends on the number of hierarchical variables, we vary k from 1 to 8. To determine how inference time depends on the depth of the hierarchy (d), we vary d from 1 to 10 with a step of 1. To determine how the inference time depends on the depth of the exceptional classes, for each value of d we vary the depth of the observation (doos) for the observed variables from 1 to d. We consider only the positive observations about the hierarchical variables. For all hierarchical variables we keep the observations at the same level, but we randomly pick the observed classes at that level. Figure 4.15 shows the average inference time of both Abstract J3ayes+VE and V E for the belief network that has only two academics as a function of the depth of the class hierarchies, as well as the depth of the observations. The error bars on Abstract Jiayes + VE, (dobs — d) curve represent the standard deviations obtained from randomly picking the positions of the positive observation classes. Figure 4.16 shows the average inference time of both Abstract Jiayes + VE 162 800i i i 0 I 1 1 1 L I I 1 I I 1 2 3 4 5 6 7 8 9 10 depth of abstraction hierarchies (d) Figure 4.15: The average inference time of applying VE and Abstract J3ayes + VE for k = 2 as a function of the depth of the hierarchy, and depth of the observations. 163 501 1 1 1 1 1 i i i i i I 1 2 3 4 5 6 7 8 . 9 10 11 12 Number of academics (k) Figure 4.16: The average inference time of applying VE and Abstract J3ayes + VE for d - 8 as a function of the number of academics and the depth of the observations. 164 and VE as a function of the number of academics and the depth of observations; here we consider the depth of the hierarchy is 8 (d = 8). The error bars on Abstract Jiayes + VE, (dobs = 8) curve again represent the standard deviations ob-tained from by randomly picking the location of the postive observation classes at level 8 in the hierarchies of the spouse variables. The error (standard deviation) arises because every time different observed classes are selected for the spouse variables. Sometimes the observed classes are very apart and make different exceptional classes in the hierarchies of the academics variables. If academics have different exceptional classes, they make different dis-tinctions (different exceptional classes) in the hierarchy of conf Joe . If there are many exceptional classes in the hierarchy of conf Joe , the number of abstract val-ues5 for variable conf J o e is also more that increases the processing time. Figure 4.15 shows that the inference time for Abstract J3ayes+VE does not depend on the depth of the class hierarchy but increases as the depth of the ob-servations increases. This increase is because the depth of the observations for spou J o c _ l , . . . , s pou J o e Jc variables increases the depth and the number of ex-ceptional classes for a c a m J o c J , . . . , a c a m J o c J c and conf J o e variables. No-tice that VE takes less time as the depth of observation increases. This is because the domain sizes of the spou J o e l , . . . , s pou J o e Jc variables reduce as the depth of the observation increases; after incorporating the observations in the flattened net-work the domain of each spou J o e J variable is the disjunction of all leaf values that are descendants of the observed class (i.e., 2 ^ - r f » ' « ) ) 6 . 5The number of abstract values is proportional to the number of exceptional classes. 6In a binary tree of depth d, the number of values represented by a node that is at depth n from root is 2<l~n. 165 The results shown in Figures 4.15 and 4.16 give us some indication of the overhead involved in constructing an abstract Bayesian network and cases in which it may be more effective to use it rather than applying V E on a flattened network. 4.11 Conclusion In this chapter, we argue that considering hierarchical structure on the values of a variable can make reasoning more efficient than reasoning over the set of all values. We show how, given a query and observations, we can dynamically construct an abstract Bayesian network from the given hierarchical Bayesian network that can be used in any standard probabilistic inference algorithm. The domain size of the variable in an abstract Bayesian network is independent of the size of the hierar-chies; it depends on how many of the classes in the hierarchy are exceptional with respect to the children that are observed or have observed descendants in a Bayesian network. The running time of Abstract Jiayes to abstract a hierarchical variable de-pends on the depth of the classes that are exceptional with respect to children that are observed or have observed descendants, as well as the depth of the query in the hierarchy, and is otherwise independent of the size of the hierarchy. Thus, it is possible to have hierarchical variables with unbounded or infinite domains. For example, with a spatial hierarchy, as long as we have a way to compute the proba-bility distribution over subclasses, there is no reason not to have a hierarchy that is infinitely detailed. It is only when we ask a query of a class or make observations that we need to restrict the size of the hierarchy. 166 Chapter 5 Putting It All Together 5.1 Introduction In Chapters 3 and 4, we consider the problem of performing inference in Bayesian networks that have variables with large domains. In Chapter 3, for handling the large number of values of the variables efficiently we consider that the conditional probabilities can be represented in a compact manner using both intensional and extensional definitions. We represent the conditional probabilities in CPD language form. We present an inference algorithm, Large Domain VE, that dynamically partitions the large number of values of the variables in equivalence classes, given the observation and query. In Chapter 4, to handle the large number of values efficiently we consider the case where there is a priori hierarchical structure on the values of the variables, and called such variables hierarchical variables. To represent the CPDs compactly, we use inheritance. We present an approach for reasoning in Bayesian networks with 167 hierarchically structured variables that dynamically constructs an abstract Bayesian network, given some evidence and a query, by collapsing the hierarchies to include only those values necessary to answer the query. The query can be answered from the abstract Bayesian network using any standard probabilistic inference algorithm, such as variable elimination or stochastic simulation. Given the close relationship1 between these two approaches, in this chapter we examine how they can be put together in a common framework; that is, how to perform inference in Bayesian networks that exploit both kinds of structures. To exploit both kinds of structures, one obvious setting is: consider a Bayesian net-work that contains both hierarchical and large-domain variables, and large-domain variables are represented using intensional and extensional definitions. Let us consider another more general setting: consider a hierarchical vari-able such that for some of its children the hierarchical structure on its values does not need to be defined completely. That is, some children depend on each value rep-resented by some of the nodes (classes) in the hierarchy, which means we need to consider the large number of values represented by some of the classes in the hier-archy. These large number of values can be represented efficiently using intensional definitions. For example, consider the problem of identifying rock types in a region. There are a large number of rock types. The British Geological Survey provides a rock classification scheme2 that classifies rocks in a hierarchical manner. At the very high level, all types of rocks are grouped together as "allrocks" and then par-1 Essentially, both of these approaches partition the large number of values of the variables into equivalence classes. 2http://www.bgs.ac.uk/bgsrcs/ 168 titioned into four groups: "igneous rocks", "metamorphic rocks", "sediments and sedimentary rocks", and "artificial man-made ground and natural superficial de-posits". The rocks in each group are further partitioned. The types of rocks for some locations could be known to us because we have some prior knowledge about these locations. For locations that we know nothing about, we can find out the rock type by knowing the geological type of these locations, which can be determined from a geological map. The properties acidity and grain size of the rocks depend on rock type. We can model this rock-identifying problem using the Bayesian network shown in Figure 5.1 (a). The variable Location is a hierarchical variable and rep-resents all of the locations in Western Canada. The tree hierarchies of Location is shown in Figure 5.1 (b). At the very high level, all of the (X ,Y) locations rep-resented by Location are grouped together as western Canada. Suppose all of the locations in Western Canada can be partitioned into two disjoint classes, be and al-berta. The values represented by the classes be and alberta are further partitioned, and so on. The variable Legend is a simple variable representing the geological type of a location. The variable RockType is a hierarchical variable and represents the rock type of a location. The tree hierarchy of RockType is shown in Figure 5.1 (b). The hierarchical structure on the values of Location is exploited by its child RockType but not by its other child Legend. This is because to find the geologi-cal type of a location we need to consult maps, which means we need to use the (X, Y) value of a location. We can use B C maps to find the geological type of any location in B C and Alberta maps to find the geological type of any location in 169 western Canada (a) Tree hierarchy of Location a l l r o c k s a r t i f i c i a l and natural s u p e r f i c i a l deposi ts c rys ta l l ine igneous r o c k Tree hierarchy of RockType Figure 5.1: (a) A Bayesian network modelling rock types in a region, (b) The class hierarchies of Location and RockType. 170 Alberta. The conditional probability of Legend, P[Legend\Location) can be repre-sented compactly, using the intensional definitions (using predicates and functions), which may involve looking up the locations in the B C map or Alberta map, in C P D language form. We term those hierarchical variables whose classes can be represented exten-sionally or intensionally (using predicates and functions) generalised-hierarchical variables. Note that both large-domain and hierarchical variables are instances of generalised-hierarchical variables. In this chapter, we consider the problem of performing inference in Bayesian networks that have generalised-hierarchical variables. We begin in Section 5.2 by defining the generalised-hierarchical variables. In Section 5.3, we discuss how to represent the CPDs of Bayesian networks with generalised-hierarchical variables in a compact manner using the representations that we developed in Chapters 3 and 4. In Section 5.4, we show how to perform inference in Bayesian networks with generalised-hierarchical variables using the inference algorithms that we developed i I in Chapters 3 and 4. 5.2 Generalised Hierarchical Variables In Chapter 4, we introduced the notion of hierarchical variables. There we consid-ered that the large number of values of a variable can be organized a priori in a class hierarchy form, where the large number of values are grouped together in a meaningful way. However, in some problems for some children, the large number of values need to be grouped together up to some level (depth) in the class hierar-171 chy, after that they need not be grouped together. That is, for some children, some internal nodes in the class hierarchy are considered as leaf-nodes. The large number of values of these nodes (or classes) are represented using intensional definitions. We term such hierarchical variables generalised-hierarchical variables. Definition 5.2.1 A generalised-hierarchical variable is a hierarchical variable with classes that can be represented extensionally or intensionally. A large-domain variable can also be seen as a generalised-hierarchical vari-able with only one class (root class) represented intensionally. Thus, both hierarchi-cal and large-domain variables are instances of generalised-hierarchical variables. For example, consider the hierarchical variable Location shown in Figure 5.1 (a). The tree hierarchy of Location is shown in Figure 5.1 (b). At the very high level, all of the values of Location are grouped together as western canada. Then, western canada is partitioned into two disjoint classes, be and alberta, and so on. As discussed before, Location's one child Legend depends on each value repre-sented by the classes be and alberta. However, Location's other child, RockType, exploits hierarchical structure on Location's values. In this case, to define the con-ditional probability of Legend we need to consider all of the values represented by classes be and alberta, i.e., we may need to use intensional definitions. However, to define the conditional probability of RockType, we do not consider the values repre-sented by the classes in the hierarchy of Location. Thus, Location is a generalised-hierarchical variable. 172 5.3 Representation In this section, we discuss how to represent the conditional probability distributions of Bayesian networks with generalised-hierarchical variables. There are two main issues that need to be considered: C I : specifying the conditional probability for generalised-hierarchical variables C2: specifying how variables are conditioned on generalised-hierarchical variables The simplest case of CI is how to specify the prior probability distribution for a generalised-hierarchical variable. The simplest case of C2 is how to spec-ify the conditional probability of a simple variable conditioned on a generalised-hierarchical variable. The more complicated cases can be built from these two cases. First, we discuss how to represent the probability distributions for these two cases. Next, we present the general case using the representations developed for these two cases. 5.3.1 Specifying the Prior Probability Distribution for a Gener-alised Hierarchical Variable In Chapter 4, we discussed a top-down approach for defining the prior probability distribution of a hierarchical variable. In the top-down approach, we consider that each class in the hierarchy defines a probability distribution over its children. To represent the prior probability distribution of the generalised-hierarchical variables, we can use the same approach. Note that for classes that have only few children, we 173 can define the probability distribution over their children extensionally (by listing the values). For the leaf-classes that represent a large number of values, we can define the probability distribution using the intensional definitions. We can define the distribution of a class over its children extensionally or intensionally using the s CPD language representation as developed in Chapter 3. For example, suppose we want to specify the prior probability distribution over the generalised-hierarchical variable Location as discussed in the previous sec-tion. Note that the domain of Location is all (X, Y) locations in Western Canada. Suppose the class western canada has only two children. We specify the probability distribution of western canada over its immediate subclasses: P(bc| western canada) = 0.6 P(alberta\western canada) = 0.4 We can specify the probability distribution over the immediate subclasses of be. Suppose we have as part of this distribution: P( north bc\bc) = 0.4 P(south bc\bc) = 0.3 Suppose the distribution of classes north be and south be over their values need to be represented intensionally. Suppose both north be and south be define a uniform distribution over their values. Then we can represent the distribution of north be and south be over their values as follows: P[Location\Location G north be) = uniform("north be") P(Location\Location G south be) — uniform("south be") 174 The function uniform("north be") defines an uniform distribution over all the lo-cations in northern BC. Notice that how the representations that we developed in Chapters 3 and 4 can be combined to represent the prior probability distribution of a generalised-hierarchical variable. 5.3.2 Conditional Probability of a Simple Variable Conditioned on a Generalised-Hierarchical Variable Suppose that H is a generalised-hierarchical variable, F is a simple variable, and H is the only parent of F. To represent the conditional probability of F in a compact manner, we use the approach discussed in Section 4.5.2. The conditional probability distribution P(F\H) is specified by specifying the default conditional probability distribution over F for some classes Cj in the hierarchy of H. We have two cases: • If the distribution over F does not depend on the set of values represented by Cj, we define Pd{F\Cj). • Otherwise, we specify Pd(F\H A H 6 Cj). The default distribution Pd(F\H A H e Ck) can be inherited by the subclasses of Ck in the same way as Pd(F\Cj) is inherited by the subclasses of C, as discussed in Section 4.5.2. The default distributions Pds can be specified extensionally or intensionally using the CPD language. The conditional probability of F for any class Ck of H, P(F\Ck) (or P(F\H AH e Ck)), can be computed using inheritance, as discussed in Section 4.5.2. Example 5.3.1 Consider the problem of representing the conditional probability of 175 simple variable Legend conditioned on Location in the Bayesian network shown in Figure 5.1. Suppose Legend has four values: {"mountain", "hill", "water", "plain"}; Legend is a deterministic function of Location that is obtained by consulting the ap-propriate map. To represent P(Legend\Location) we could have: Pd(Legend = v\Location A Location € be) = ( if (consult Jjcjnap(Location) = v) then 1 else 0 ) Pj(Legend = v\Location A Location G alberta) = ( if (consult jedbertajnap(Location) = v) then 1 else 0 ) The function consult Jjcjnap returns the legend value of a location that is in British Columbia. We assume here that we have procedures that compute the functions consult jcdbertajnap(Location) and consultJjcjnap(Location). 5.3.3 Arb i t rary Probability Distribution The general case is when a random variable (generalised-hierarchical, hierarchical or simple) can have any number and any kind of variables as its parents. Let H be a generalised hierarchical variable. Suppose Pah(H) denotes H's hierarchical (and generalised-hierarchical) parents and Pa"(H) denotes its simple parents. To represent the conditional probability of H conditioned on its parents, we can assume the following: • Each non-leaf (internal) class in the hierarchy of H has a probability distribu-tion over its immediate subclasses that is conditioned on (some of) the parents 176 of H. Let F be a variable corresponding to an internal class Ck in the hierar-chy of H. The values of F are the immediate subclasses of Ck. For each F we specify the conditional probability P [F\Pah (H) A Pas (H)). • Each leaf-class that represents a large number of values has a distribution over its values that is conditioned on (some of) the parents of H, which is specified intensionally using the CPD language. Let Q be a leaf class that represents a large number of values. For each Q we specify the conditional probability P (H\H G C, A Pa'1 (H) A Pa? (H)). The conditional probability of F given its parents, P (F\Pah (H) A Pas (H)) (or P (H\H G Q A Pa'1 (H) A Pas (//))), is specified by specifying the default prob-ability distribution over F, Pd(F\Pas(H) A Pah(H) = X) using the CPD language representation, for some parent contexts X as discussed in Section 4.5.3. The parent contexts X are such that there is at least one default parent context along every path from the root in tree(Hx) x . . . x tree(Hn), where Hx,.. ., Hn are the generalised-hierarchical parents of F, and tree(Hi) denotes the tree hierarchy of//,. Note that the problem of multiple inheritance can arise with generalised-hierarchical variables, as discussed in Section 4.5.3; we explicitly disallow this to occur. We are thus forcing the user to disambiguate the cases in which there would otherwise be a problem of multiple inheritance. The conditional probability of F (or (H\H G Cj)) for a parent context X can be computed in the same way we computed it for hierarchical variables in Section 4.5.3. 177 In this section, we discuss how to represent the conditional probability dis-tributions for Bayesian networks with generalised-hierarchical variables. In the next section, we discuss how to perform inference in Bayesian networks with generalised-hierarchical variables using the inference algorithms that we developed in Chapters 3 and 4. 5.4 Inference in Bayesian networks with Generalised-Hierarchical Variables Similar to a hierarchical variable, observation about a generalised-hierarchical vari-able can also be positive and negative and involves classes from its tree hierarchy. Observation about a generalised-hierarchical variable can be interpreted in the same way we interpreted for hierarchical variables in Section 4.6. Let B be a Bayesian network with generalised-hierarchical variables. The inference task is to compute P(Q\e), where Q is a simple variable or a class in the hierarchy of hierarchical (generalised-hierarchical) variable and e is the allowable observations, as defined in Section 4.6. For performing inference in Bayesian networks with generalised-hierarchical variables, we can use the same approach that we used with hierarchical variables. That is, given some evidence and a query, construct an abstract Bayesian net-work by replacing the generalised-hierarchical variables with large-domain vari-ables whose values are partitioned into a minimum number of blocks to answer the query. We can then answer the query from the abstract Bayesian network. Note that 178 hierarchical variables are replaced by simple variables.. The abstract Bayesian network, constructed from the Bayesian network with generalised-hierarchical variables, involves variables with large domains; its con-ditional probabilities in CPD language form. We can use the Large Domain V E algorithm developed in Chapter 3 for making inference in the constructed abstract Bayesian network with large-domain variables. The algorithm for constructing an abstract Bayesian network from a Bayesian network with generalised-hierarchical variables is essentially the same as the algorithm Abstract_Bayes, except for the following step: Abstraction of generalised-hierarchical variables: Let H be a generalised-hierarchical variable that needs to be abstracted. Let Vex be the set of exceptional classes of H with respect to its relevant children. Suppose Ha denotes the abstrac-tion of H, a large-domain variable. The domain of Ha is partitioned into disjoint subsets (blocks). During the abstraction of H, we compute the blocks for Ha. We compute the set of values (a block) v for Ha, for each class Q € Vex as follows: • v = Vck, if Q does not have any exceptional strict subclasses. The set VQ denotes the set of values represented by class Ck. • Otherwise, v = VCk — VCl — ... — Vc,„, where C\,... ,Cm are the highest strict subclasses of Q that are in Vex. The set v contains all the values that are in Q and are not covered by other blocks. The sets that are empty can be removed from the partition of H" and H is replaced by H". The conditional probabilities for the abstract Bayesian network are 179 constructed as discussed in Section 4.8.1 for hierarchical variables. Example 5.4.1 Consider constructing an abstract Bayesian network from the Bayes network with generalised-hierarchical variables shown in Figure 5.1 (a). The vari-able Location is a generalised-hierarchical variable. The class hierarchies of Location and RockType are shown in Figure 5.1 (b). Suppose we have observed RockType G crystalline igneous rock. The conditional probability P(Legend\Location) is spec-ified using the intensional definitions discussed in Example 5.3.2. The query is: P(Location G south bc\RockType G crystalline igneous rock ). The conditional probability P(RockType\Legend A Location) is specified by specifying the default distribution of each class in the hierarchy of RockType over their immediate subclasses for some of the classes in the hierarchy of Location. The default distribution of each class in the hierarchy of RockType is specified for the classes be and alberta in the hierarchy oi Location. The default distribution of each class in the hierarchy of RockType for the contexts be and alberta depends on the values of Legend. Suppose for the context be, Legend — "hill" the class allrocks defines the following distribution: context: (be) Pd(metamorphic rock\ allrocks A be A Legend = "hill") = 0.4' Pd(igneous rock\ allrocks A be A Legend — "hill") = 0.3 Pd(sedimentary rock\allrocks Abe A Legend = "hill") = 0.2 P d(deposits\allrocks Abe A Legend = "hill") = 0.1 The class east squamish in the hierarchy of Location affects the distribution of class allrocks in the hierarchy of RockType. The distribution of class allrocks for 180 the context east squamish does not depend on the values of Legend (we have some prior knowledge about east squamish). For the context east squamish the class allrocks defines the following distribution: context : (east squamish) Pa-(metamorphic rock\allrocks A east squamish) = 0.2 Pa-(igneous rock\allrocks A east squamish) = 0.6 P ti (sedimentary rock\allrocks A east squamish) = 0.1 Pci(deposits\allrocks A east squamish) = 0.1 Construction of an abstract Bayesian network: To construct an abstract Bayesian network from the Bayesian network shown in Figure 5.1 (a), we traverse the net-work from the leaves upwards. The simple variables Acidity and GrainSize are pruned because they are irrelevant. We abstract the hierarchical and generalised-hierarchical variables RockType and Location. Abstraction of RockType (observed hierarchical variable): We have observed RockType € crystalline igneous rock. The variable RockType does not have any children, so RockType is abstracted by a simple variable RockType" with domain: Val(RockType") = {" igneous rock type", "allrocks - igneous rock type"} After abstracting RockType we abstract Location. Abstraction of Location: To abstract Location, we find those classes of Location that are exceptional for Legend and RockType. The variable Location has four ex-ceptional classes: east squamish, south be, be , and alberta. We abstract Location 181 by large-domain variable Location". The values of Location" are partitioned into four disjoint subsets: Val(Location") = {VCl, VC,, VC,, Vc4} where Vci is a set of values represented by class east squamish, Vc2 is a set of values represented by class difference: (south be — east squamish), Vc3 is a set of values represented by class difference: (be — south be), and Vc4 is a set of values represented by class alberta. 1 After abstracting the hierarchical variables the algorithm compute the con-ditional probabilities for the abstract Bayesian network. The first case we consider is the prior probability of Location". Construction of P'(Location"): The prior probability P(Location") is computed as follows: P(Location". EVQ) = P(east squamish\southbc) x P(south bc\bc) xP(bc\western canadq) P(Location" 6 Vc2) — P(south bc\bc) x P(bc\western Canada) -P(Location" e VCl) P(Location" € Vc3) = P(bc\western Canada) — P(Location" G Vc2) P(Location" G Vc4) = P(alberta\westernCanada) Note that P(Location" G VQ) represents the probability mass of all the values of Location" that are in set VQ. 182 Construction of P(Legend[Location"): To compute the conditional probability P(Legend\Location"), the algorithm compute P(Legend[Location" G V) for each block V of Location" using inheritance. The conditional probability P(Legend\Location") is shown below: P(Legend = v\Location" G V c J = ( i f (consult Jbc jnap(Location") = v) then 1 else 0 ) P(Legend = v\Location" G Vc2) = ( i f (consultJbcjnap(Location") = v) then 1 else 0 ) P(Legend = v\Location" G Vc3) = ( i f (consult Jbcjnap(Location") = v) then 1 else 0 ) P(Legend = v\Location" G Vc4) = ( i f (consultjalbertajnap(Locationa) = v) then 1 else 0 ) Construction of P(RockType" [Location" A Legend): To compute the conditional probability P(RockType" \Location"ALegend), the algorithm compute P(RockType" = va\Locationa G V" A Legend) for each value va of RockType", for each block V" of Location". The conditional probability P(RockType" [Location" A Legend) is shown below: P(RockType" = " crystalline igneous rock "[Legend = V A Location" G VCl) — P (crystalline igneous rock\igneous rock Abe A Legend = V) X P(igneous rock[allrocks A east squamish) P(RockType" — " crystalline igneous rock "[Legend — V A Location" G VC2) 183 = P(crystalline igneous rock\igneous rock Abe A Legend = V) x P(igneous rock\allrocks Abe A Legend = V) P(RockType" = " crystalline igneous rock "\Legend = V A Location" £ Vc3) = P(crystalline igneous rock\igneous rock Abe A Legend = V) x P(igneous rock\allrocks A be A Legend = V) P(RockType" = " crystalline igneous rock "\Legend = V A Location" £ Vc4) = Picrystalline igneous rock\igneous rock A alberta A Legend = V) x P(igneous rock\allrocks A alberta A Legend = V) P(RockType" = "allrocks - crystalline igneous rock"\Legend A Location") = 1 — P(RockType" = " crystalline igneous rock "\Legend A Location") Note that variable RockType" is an observed variable, with assignment RockType" = "crystalline igneous rock". After constructing the abstract Bayesian network we can answer the query using Large Domain V E algorithm to sum out all non-observed, non-query variables from the abstract Bayesian network. The abstraction is done as part of the standard elimination of irrelevant variables. 5.5 Conclusion In this chapter, we discuss how the structures that we considered in Chapters 3 and 4 can be put together in a common framework that can be applied to a more 184 general class of problems. We extend the definition of hierarchical variables by considering that classes can be represented extensionally or intensionally. We show here that the representations that we have developed in Chapters 3 and 4 can be combined together to represent the CPDs of Bayesian networks with generalised-hierarchical variables. To perform inference in Bayesian networks that exploit both kinds of structures we can first construct an abstract Bayesian network that may have variables with large domains, and then use Large Domain V E to answer the query from the abstract Bayesian network. 185 Chapter 6 Conclusions 6.1 Summary In this thesis, we have presented two approaches for probabilistic inference in com-plex systems that have discrete random variables with very large or infinite do-mains 1. To perform efficient inference in Bayesian networks that have discrete variables with large domains, the structure we exploit is the ability to group the values. Rather than reasoning at the level of individual values, the structured rep-resentation allows us to partition the values into subsets and the values in these subsets are treated as if they were a single value. We first consider the case where the conditional probabilities can be rep-resented in a compact manner using both intensional (in terms of functions and predicates) and extensional (by listing the values) definitions. To represent such conditional probabilities, we developed a C P D language. To deal with the com-'The complexity of exact probabilistic inference in a Bayesian network is exponential in tree width, where the base of the exponent is the domain size. 186 plexity of inference, we exploit the fact, when there is no evidence or query to dis-tinguish between all the values of a variable, we do not need to reason about each value separately. Rather, we can reason about a group of values together as a single entity. We developed a structured inference algorithm, Large Domain VE, for mak-ing inference in Bayesian networks that have discrete variables with large domain. In this case the partitions are inferred at run time and depend on what is observed and what is queried. We describe a motivating example application: person identification prob-lem. We present the application of Large Domain V E to the person identification problem. Large Domain V E dynamically partitions the domains of large-domain variables for efficient inference. The point is that there is a natural Bayesian net-work representation for the person identification problem for which inference was not possible before. In Chapter 4, we consider the case where there is a priori hierarchical struc-ture on the values of the variables. That is, the large number of values of the variables can be represented as tree hierarchies. We call such variables hierar-chically structured variables. To exploit the structure provided by tree hierarchies in probabilistic reasoning we developed a representation language for representing the conditional probability distributions in a compact manner. We represent the dis-tribution of the hierarchical variables by specifying, for each class, the probability distribution over its immediate subclasses. To represent the conditional probability distribution of any variable conditioned on a hierarchical variable we use inheri-tance. 187 To perform efficient inference in Bayesian networks that have hierarchically structured variables, we construct an abstract Bayesian network dynamically, given some evidence and a query, by collapsing the hierarchies to include only those val-ues necessary to answer the query. We can answer the query from the abstract Bayesian network using any standard probabilistic inference algorithm. The do-main size of the variables in the abstract Bayesian network is independent of the size of the hierarchies; it depends on how many of the classes in the hierarchies are supported directly by the evidence or relevant to the query. Thus, the proposed approach is applicable even when the hierarchy of the variable is conceptually infi-nite. For example, with a spatial hierarchy, as long as we have a way to compute the probability distribution over subclasses, there is no reason not to have a hierarchy that is infinitely detailed. It is only when we ask a query given observations that we need to restrict the size of the hierarchy. We presented experimental results that show that hierarchically structured values can make the reasoning more efficient than reasoning over the set of all values. Finally, we described a framework that combines both intensional definitions of the conditional probability distributions and hierarchically structured values to-gether so that it can be applied to a more general class of problems. 6.2 Future Work This thesis has created many new opportunities for future work. This section sug-gests some possible directions future research may take. 188 The proposed approach for person identification involves pairwise match-ing. This approach works in two cases: first, when given a pair of records deter-mining if they refer to the same person; and second, given two databases of records, where each record corresponds to a different person, determining matching records between them. However, this approach is not directly applicable when given a database we need to find which records refer to same people. One way our method can be extended is to build on the work of Pasula et al. [2002]. In our work with hierarchically structured variables in-Chapter 4, we con-sider that the values of a hierarchically structured variable are arranged in a tree hierarchy form. We consider that each class in a tree hierarchy has only one parent. In the future, we would like to consider that the values of the variables can be ar-ranged in a lattice hierarchy form. That is, we want to allow a class to have multiple parents. In this case, we will need a different mechanism for defining the prior (and conditional) distribution of a hierarchical variable. We also have to deal with the problem of multiple inheritance, since a class has more than one parent. We need techniques for handling multiple inheritance. Recently, Poole [2003] presented first order probabilistic inference algo-rithms based on variable elimination and unification for reasoning about multiple individuals on the first-order level. He exploits the fact that all the individuals that we do not have any fact about can be consider as a group. In this thesis we consider how to combine all those values of a variable that are the same, given the evidence and query, as a single abstract value. We did not consider here how to combine all those individuals that are the same, given the evidence and query, as a group. In 189 the future, we would like to put these two works together in a common framework where we have large number of individuals and variables with large domains. Many real problems require spatial hierarchies for modelling large domains. For example, consider the problem of decision making.in forestry: which trees should be cut down in B C (Canada) given preferences and uncertainty. In this case we can represent the forest area in a tree hierarchy form. To make the decision, we may not be able to model this spatial hierarchy by a hierarchically structured vari-able. This is because in this case as an observation we can have number of locations (trees in that location) in the hierarchy tliat are infected with pine beetle, i.e., we have more than one positive observation for a hierarchical variable. However, for a hierarchically structured variable more than one value cannot be true at the same time. The proposed approach for handling a hierarchy as a hierarchical variable is not applicable with hierarchies where more than value is true. One idea for handling such hierarchies is to consider each class in the hierarchy as a node of the Bayesian network. In the future, we would like to extend our work to handle such hierarchies. Many practical systems that are based on Bayesian networks are used in environments where the evidence arrives incrementally rather than coming all at once and is interleaved with belief updating. In the future, we would like to ex-tend our work with hierarchically structured variables to incorporate evidence that arrives incrementally. To perform inference in a Bayesian network with hierarchi-cally structured variables, we construct an abstract Bayesian network, given some evidence and query, by pruning the irrelevant variables and abstracting the hierar-chical variables. In the case of incremental evidence we may not be able to prune 190 such variables. This is because the future observations can make these variables relevant. One idea is to consider these irrelevant variables as unit value variables (variables with only one value) and lazily revise the domain of these variables based on the new evidence. Thus, each new evidence can be viewed as invalidating some of the previously computed abstractions and conditional probability tables of the abstract Bayesian network. In this framework we can potentially use the Junction Tree algorithm. We can build the structure of Junction tree offline and we can re-vise the domain of the variables and conditional probability tables online as new evidence arrives. One reason exact inference methods are developed is as basis of approxi-mate inference methods. For example, both Rao-Blackwellization [G.Casella and Robert, 1996] and Mini-Buckets [Dechter, 1997] are approximate methods and use exact inference. In future one could explore new approximate algorithms, based on the algorithm proposed in this thesis, for solving networks that are too big for exact inference. We would like to apply our work with hierarchical variables on more realis-tic problems. During the course of this thesis we were unable to find any realistic networks. We notice that finding a real network is like the chicken-and-egg prob-lem. There is no point in building a network, if there is no method to solve it. However, there are domains that involve hierarchies where people need to make decisions. For example, the medical and geological domains often involve big hi-erarchies. Using the tools proposed in this thesis, people are now able to solve probabilistic models that involves hierarchies and so may start to build them. 191 With the development of the semantic web, people are using ontologies to describe the data (e.g., O W L , RDF). Our work with hierarchically structured vari-ables, where the domain of a variable can be represented in a tree hierarchy form (e.g., in a taxonomy as part of the ontology), shows some progress in the direction of mixing ontologies and probabilities. Finally, we would like to extend our work so that it can be applied to the problems that involve ontologies and uncertainties. 192 Index abstract Bayesian network , 135 abstract value, 135 abstraction of hierarchical variable, 134 allowable observation, 127 barren variable, 29 big factor, 66 blocks, 63 class, 109 class hierarchy, 109 constrained treewidth, 89 contextually independent, 31 default parent contexts, 122 default probability distribution, 118 elimination ordering, 21 evidence factor, 85 evidenced leaf value, 84 exceptional classes, 118 flattened Bayesian network , 135 193 generalised-hierarchical variable, 172 hierarchical variable, 109 induced domain size, 89 induced treewidth, 89 negative observation, 126 parent context, 121 parent context hierarchy, 122 partition, 63 path, 33 positive observation, 126 proper superparent context, 121 relevant exceptional class, 135 superparent context, 121 treewidth, 89 Bibliography Arnborg, S., Corneil, D . and Proskurowski, A . [1987]. Complexity of finding em-beddings in a k-tree, SIAM Journal of Algebraic and Discrete Methods, Vol. 8(2), pp. 177-284. Bell , G. B . and Sethi, A . [2001]. Matching records in a national medical patient index, Communication of the ACM, Vol. 44, pp. 83-88. Boutilier, C , Friedman, N . , Goldszmidt, M . and Koller, D . [1996]. Context-specific independence in Bayesian networks, In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence (UAI-96), pp. 115-123. Buchanan, B . G . and Shortliffe, E . H . (eds) [1984]. Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project, Addison-Wesley Series in Artificial Intelligence, Addison-Wesley, Reading, M A . Chen, S. F. and Goodman, J. T. [1998]. An empirical study of smoothing techniques for language modeling, Technical Report TR-10-98, Computer Science Group, Harvard University. 194 Cooper, G. [1990]. Probabilistic inference using belief networks is np-hard, Artifi-cial Intelligence, Vol. 42, pp. 393-405. Dagum, P. and Luby, M . [1993]. Approximating probabilistic inference in Bayesian belief networks is np-hard, Artificial Intelligence, Vol. 60(1), pp. 141-153. Dechter, R. [1996]. Bucket elimination: A unifying framework for probabilistic in-ference, In Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (UAI-96), pp. 211-219. Dechter, R. [1997]. Mini-buckets: A general scheme for generating approximations in automated reasoning, In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pp. 1297-1303. Duda, R. O., Hart, P. E . and Stork, D. G. [2000]. Pattern Classification, second edn, Wiley-Interscience Publication John Wiley and Sons Inc. Fellegi, I. and Sunter, A . [1969]. A theory for record linkage, Journal of the Amer-ican Statistical Association, pp. 1183-1210. Friedman, N . , Linial, M . , Nachman, I. and Pe'er, D . [2000]. Using Bayesian net-works to analyze expression data, In Proceedings of the fourth annual in-ternational conference on Computational molecular biology, Tokyo, Japan, pp.127-135. Friedman, N . and Singer, Y. [1998]. Efficient Bayesian parameter estimation in large discrete domains, Advances in Neural information Processing Systems (NIPS-98). G.Casella and Robert, C. [1996]. Rao-blackewellization of sampling schemes, Biometrika, Vol. 47(3), pp. 81-94. Geiger, D. and Heckerman, D. [1996]. Knowledge representation and inference in similarity networks and Bayesian multinets, Journal of the Artificial Intelli-gence, Vol. 82, pp. 45-74. Geiger, D., Verma, T. and Pearl, J. [1990]. d-separation: From theorems to al-gorithms, In Proceedings of the Fifth Conference on Uncertainty in Artificial Intelligence (UAI-90), pp. 139-148. Gill , L. [1997]. Ox-link: The Oxford medical record linkage system, Record Link-age Techniques, National Academy Press, pp. 15-33. Good, I. [1953]. The population frequencies of species and the estimation of pop-ulation parameters, Journal of the American Statistical Association, pp. 237-264. Gordon, J. and Shortliffe, E. [1985]. A method of managing evidential reasoning in a hierarchical hypothesis space, Artificial Intelligence 26: 323-57. Heckerman, D. [1990]. Probabilistic Similarity Networks. Ph.D. thesis, Stanford University. Heckerman, D. and Breese, J. [1994]. A new look at causal independence, In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (UAI-94), pp. 286-292. 196 Heckerman, D., Breese, J. and Rommelse, K . [1994]. Troubleshooting Under Un-certainty, Technical Report MSR-TR-94-07. Hernandez, M . and Stolfo, S. [1995]. The merge/purge problem for large databases, Proceedings of the SIGMOD Conference, San Jose, pp. 127-138. Horvitz, E. , Koch, P., Kadie, C M . and Jacobs, A . [2002]. Coordinate: Probabilis-tic forecasting of presence and availability, In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI-02), pp. 224-233. Kjaerulff, U . [1990]. Triangulation of graphs-algorithms giving small total state space. Technical Report T R R 90-09, Department of Mathematics and Com-puter science, Strandvejen, Aalborg, Denmark. Koller, D . and Pfeffer, A . [1998]. Probabilistic frame-based systems, In Proceed-ings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pp. 580-587. Lauritzen, S. and Spiegelhalter, D . [1988]. Local computations with probabilities on graphical structures and their application to expert systems, Journal of Royal Statistics Society. Series B, Vol. 50(2), pp. 157-224. Mackworth, A . K. , Mulder, J. and Havens, W. [1985]. Hierarchical arc consistency: exploiting structured domains in constraint satisfaction problems, Computa-tional Intelligence, Vol. 1(3), pp. 118-126. Monge, A . E . and Elkan, C. [1997]. An efficient domain-independent algorithm for detecting approximately duplicate database records, In Proceedings of the 197 SIGMOD-1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 23-29. Neapolitan, R. E. [2004]. Learning Bayesian Networks, Prentice Hall. Pasula, H., Marthi, B., Milch, B., Russell, S. and Shpitser, I. [2002]. Identity un-certainty and citation matching, Advances in Neural Information Processing Systems (NIPS-02), pp. 1401-1408. Pearl, J. [1986]. On evidential reasoning in a hierarchy of hypotheses, Artificial Intelligence 28: 9-15. Pearl, J. [1988]. Probabilistic reasoning in intelligent systems: networks of plausi-ble inference, Morgan Kaufmann Publishers Inc. Peterson, J. L. [1986]. A note on undetected typing errors, Communication of the ACM, Vol. 29(9), pp. 633-637. Poh, K. and Fehling, M. [1993]. Probabilistic conceptual network: A belief repre-sentation scheme for utility-based categorization, In Proceedings of the Ninth Conference on Uncertainty in Artificial Intelligence (UAI-1993), pp. 166-173. Pollock, J. and Zamora, A. [1987]. Automatic spelling correction in scientific and scholarly text, ACM Computing Surveys, Vol. 27(4), pp. 358-368. Poole, D. [2003]. First order probabilistic inference, In Proceedings of the Eigh-teenth International Joint Conference on Artificial Intelligence (IJCAI-03), pp. 985-991. 198 Poole, D . and Zhang, N . L . [2003]. Exploiting contextual independence in proba-bilistic inference, Journal of Artificial Intelligence Research, Vol . 18, pp. 263-313. Quillian, M . [1968]. Semantic memory, in M . Minsky (ed.), Semantic information processing, Cambridge M A : MIT Press, pp. 227-270. Quinlan, J. [1986]. Induction of decision trees, Machine Learning, Vol . 1, pp. 81 -106. Shachter, R. [1986]. Evaluating influence diagrams, Operation Research, Vol. 6, pp. 871-882. Sharma, R. [2004]. Representation and reasoning with hierarchically structured variables, In AAAI Technical Reports of the KR-04 Doctoral Consortium (KRDC2004). Sharma, R. and Poole, D. [2003]. Efficient inference in large discrete domains, In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelli-gence (UAI-03), pp. 535-542. Sharma, R. and Poole, D. [2005a]. Probabilistic reasoning with hierarchically struc-tured variables, In Proceedings of the Nineteenth International Joint Confer-ence on Artificial Intelligence (IJCAI-05), pp. 1391-1397. Sharma, R. and Poole, D. [2005b]. Probability and equality: A probabilistic model of identity uncertainty, In Proceedings of the Eighteenth Canadian Conference on Artificial Intelligence (AI 05), pp. 227-231. 199 Shenoy, P. and Shafer, G. [1986]. Propagating belief functions with local computa-tions, IEEE Expert 1: 43-52. Wang, G , Chen, H. and Atabakhsh, H. [2004]. Automatically detecting deceptive criminal identities, Communications of the ACM, Vol. 47(3), pp. 70-76. Zhang, N. L. and Poole, D. [1994]. A simple approach to Bayesian network com-putation, In Proceedings of the Tenth Canadian Conference on Artificial Intel-ligence, pp. 171-178. Zhang, N. L. and Poole, D. [1996]. Exploiting causal independence in Bayesian network inference, Journal of Artificial Intelligence Research, Vol. 5, pp. 301-328. 200
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Probabilistic inference with large discrete domains
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Probabilistic inference with large discrete domains Sharma, Rita 2006
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Probabilistic inference with large discrete domains |
Creator |
Sharma, Rita |
Publisher | University of British Columbia |
Date Issued | 2006 |
Description | The straightforward representation of many real world problems is in terms of discrete random variables with large or infinite domains. For example, in a domain where we are trying to identify a person, we may have variables that have as domains, a set of all names, a set of all postal codes, and a set of all credit card numbers. The task usually reduces to performing probabilistic inference, i.e., compute the probability of some values of some random variables given the values of some other variables. Bayesian networks are a compact way to represent joint probability distributions. This thesis is concerned with probabilistic inference in Bayesian networks that have discrete random variables with large or infinite domains. Carrying out inference in Bayesian networks that have variables with large domains is a difficult problem. For efficient inference, we consider cases where there is some structure that can be exploited to make inference efficient. In this thesis we consider two kinds of structures that can be exploited for efficient inference. These structures allow us to partition the large number of values in equivalence classes. Rather than reasoning about every value of a variable individually, we can reason about a set of values in a single step. We first consider the case where there are intensional definitions of the conditional probability distributions. To represent these conditional probabilities, we introduce a CPD language that allows us to define the conditional probabilities procedurally, in terms of predicates and functions. We present an inference algorithm, Large Domain VE, for the CPD language that uses this representation to partitions the domains of the variables dynamically. The partitions depend on what is observed and what is queried. We apply Large Domain VE to the person identification problem that has variables with large domains. The second case we consider where there is a priori internal structure on the values of the variables. In particular, we consider the case where the values of the variables are represented as tree hierarchies. We call such variables hierarchically structured variables. We present a language for representing the conditional probabilities of Bayesian networks with hierarchically structured variables. To perform inference in Bayesian networks with hierarchically structured variables we construct an abstract Bayesian network dynamically, given some evidence and a query, by collapsing the hierarchies to include only those values necessary to answer the query. We can answer the query from the abstract Bayesian network using any standard inference algorithm. Finally, we show how both intensional definitions of the conditional probability distributions and hierarchically structured values can be put together to produce a general framework that can be applied to a more general class of problems. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-01-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051316 |
URI | http://hdl.handle.net/2429/18561 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2006-200865.pdf [ 6.89MB ]
- Metadata
- JSON: 831-1.0051316.json
- JSON-LD: 831-1.0051316-ld.json
- RDF/XML (Pretty): 831-1.0051316-rdf.xml
- RDF/JSON: 831-1.0051316-rdf.json
- Turtle: 831-1.0051316-turtle.txt
- N-Triples: 831-1.0051316-rdf-ntriples.txt
- Original Record: 831-1.0051316-source.json
- Full Text
- 831-1.0051316-fulltext.txt
- Citation
- 831-1.0051316.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051316/manifest