LINK STRENGTH IN BAYESIAN NETWORKS by BRENT BOERLAGE B.A.Sc., The University of British Columbia, 1983 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF COMPUTER SCIENCE We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA October 1992 © Brent Boerlage, 1992 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. (Signature) Department of C OhTuker^eAce. The University of British Columbia Vancouver, Canada Date DE-6 (2/88) OCI 15, 1 iqL Abstract This thesis introduces the concept of a connection strength (CS) between two nodes of a propositional Bayesian network (BN). Connection strength is a generalization of node independence, from a binary property to a graded measure. The connection strength from node A to node B is a measure of the maximum amount that the belief in B will change when the truth value of A is learned. If the belief in B does not change, they are independent, and if it changes a great deal, they are strongly connected. It also introduces the link strength (LS) between two adjacent nodes, which is an upper bound on that part of the connection strength between them which is due only to the link between them (and not other paths which may connect them). Calculating connection strengths is computationally expensive, while calculating link strengths is not. An algorithm is provided which finds a bound on the connection strength between any two nodes by combining link strengths along the paths connecting them (which is of complexity linear in the number of links). Such an algorithm lends substance to notions of an "effect" flowing along paths, and "effect" being attenuated by "weak" links, which is terminology that has appeared often in the literature, but only as an intuitive idea. An algorithm for faster, approximate BN inference is presented, and connection strengths are used to provide bounds for its error. A system is proposed for BN diagrams to be drawn with strong links represented by heavy lines and weak links by fine lines, as a visualization aid for humans. Another visualization aid which is explored is the iso-CS contour map, in which the amount that one particular node can effect each of the other nodes is shown as contour lines super-imposed on a regular BN diagram. A non-trivial example BN is presented, some of its connection strengths are calculated, CS contour maps are constructed for it, and it is displayed with link strength indicated by line width. Contents Abstract ii Contents iii Acknowledgments 1^Introduction 1 2^Bayesian Networks 4 When do we Reason With Uncertainty? ^ Using Probability for Uncertain Reasoning ^ Axioms of Probability Theory^ Representing Independencies ^ 2.4.1^d-Separation Algorithm ^ 2.5 Bayesian Networks^ 2.6 Bayesian Network Inference ^ 2.6.1^Virtual Evidence ^ 2.6.2^Inference on BN Example ^ 2.1 2.2 2.3 2.4 3 Connection and Link Strengths 3.1 Connection Strength Definition ^ 3.2 AP Connection Strength ^ 3.2.1^Single Link Example ^ 3.2.2^Range of AP Connection Strengths ^ 3.3 AO Connection Strength ^ 3.3.1^Single Link Example ^ 3.3.2^Range of AO Connection Strengths ^ 3.4 An Alternate Definition of Connection Strength ^ 3.5 Conditional Strength and Potential Strength ^ 3.6 Connection Strength in Complex BNs ^ 3.7 Commutivity of Connection Strength ^ 3.8 Link Strength Definition ^ 3.9 Comparing Link and Connection Strengths ^ 4 Using Link Strength to Bound Connection Strength 4.1 AP Serial Combination ^ 4.2 AO Serial Combination ^ 4 6 9 11 13 14 19 23 24 26 26 28 28 29 29 33 33 34 35 36 36 38 40 43 43 44 4.3 Fundamental Sensitivity Equation ^ 47 4.4 Example of Finding CS by Fundamental Equation ^50 4.5 Path Based Methods ^ 53 4.5.1 The CS Path Algorithm ^ 55 4.6 Complexity of Path Algorithm ^ 59 4.7 Dealing With Evidence^ 63 5 Applications 5.1 BN Link Display ^ 66 66 6 Conclusion 76 6.1 Further Work^ 76 6.1.1 Qualitative Probabilistic Networks ^ 76 6.1.2 Greater Computation for Tighter Bounds ^ 77 6.1.3 Multistate Nodes^ 78 7 Bibliography 79 A Notation and Nomenclature Abbreviations ^ 81 82 B Conventional Statistical Measures of Association 83 C Proofs 85 Theorem 3.1 - Equivalence of CS Definitions ^ 85 Theorem 3.4 - Alternate CS Definition ^ 87 Theorem 4.1 - AP Serial Chaining ^ 89 Theorem 4.2 - AO Serial Chaining ^ 90 Theorem 4.3 - Fundamental Equation ^ 93 Theorem 4.7:1 - Intercausal Link Strength ^ 97 Theorem 5.3:2 - Approx. Inference Error Bound ^ 98 Acknowledgments I would like to thank: David Lowe - My supervisor, who allowed enough flexibility in my program to really get the education I wanted and fmd the field that interested and benefited me most. Judea Pearl For writing Probabilistic Reasoning in Intelligent Systems. - David Poole - Who introduced me to probabilistic reasoning, and who created a group of people studying Bayesian networks at UBC. The Bayesian network people at UBC - David Poole, Mike Horsch, Nevin (Lianwen) Zhang, Yang Xiang, Keiji Kanazawa, and Greg Provan for useful discussions, and David Lowe, David Poole and Mike Horsch for useful comments on my thesis. The many graduate students at UBC who I had the pleasure of knowing during my long stay there. If I named them I would have a long list, and I'm sure I would miss some, so I'll resist the temptation to put even one name. My wife, Christine - For patience and for support. 1 Introduction Bayesian networks (BNs), also called belief networks or probabilistic causal networks, consist of graphs in which each node represents a variable of interest, and the links between the nodes indicate the probabilistic relationships between them. This thesis is restricted to BNs composed only of binary nodes (which are nodes representing variables that take on one of two values), and generally speaking we will interpret their value to mean that some proposition is TRUE or FALSE. Using a BN we can capture the relationships between our uncertain beliefs in the propositions, and then if we find out the truth value of one or more of the propositions, we can use BN inference algorithms to find updated beliefs for each of the other propositions, and updated relationships between the propositions. Uncertain reasoning is very common for humans, and will conceivably be common in future machines. Chapter 2 provides some examples of situations requiring uncertain reasoning. Then it makes an argument in favor of using probabilities for such reasoning (in machines), and provides a well-known set of axioms for doing so. However, probabilistic reasoning can be extremely computationally expensive, and even quite small problems can be outside the range of practicality, unless we have some technology for taking advantage of the fact that generally our beliefs in some propositions are independent of our beliefs in others (Pear188, Cooper90). This is the primary purpose of the BN graph. Chapter 2 goes on to show how BNs represent these independencies, and provides a non-trivial example of a BN. Some BN inference algorithms are also discussed, and it is pointed out that the BN inference problem is NP-hard (Cooper90). Connection strength (CS) is a generalization of independence. Instead of simply indicating whether one proposition is independent of another, it provides a graded measure of how much our belief in one proposition can change when we learn the truth or falsity of another. Chapter 3 defines connection strength and explores some of its properties. It also introduces the concept of link strength (LS), which is a sort of "local connection strength" between two adjacent nodes of a BN. Link strength measures that part of the connection strength between two nodes that is due to the single link between them, and not due to any other paths from one of them to the other. Computing the connection strength between two nodes is generally even more computationally expensive than regular BN inference, but a link strength can be found very quickly using only the conditional probabilities stored at a single node. Chapter 4 presents an algorithm which uses link strength values to fmd a bound for the connection strength between any two nodes in time linear in the number of links in the BN. The algorithm can be viewed as a summation over alternative paths between the nodes, which lends substance to notions of an "effect" flowing along paths, and "effect" being attenuated by "weak" links, which is terminology that has often appeared in the literature as an intuitive idea, but which has never been substantially formalized. Using independence information allows BN inference algorithms to solve medium-sized BN problems in a reasonable amount of time. Using connection strengths we can determine which nodes are nearly independent, and then by assuming that they are independent, we can solve larger-sized BN problems in reasonable time, while obtaining approximate results. In Chapter 5 an algorithm is given which quickly provides bounds on the maximum error made during such an approximation. Another application of link strengths explored in Chapter 5, is to display them on BN diagrams as a visualization aid for humans. For example, the width of the line representing a link can be used to indicate its link strength, with finer lines for weaker links, and thicker lines for stronger links. The BN of Chapter 2 is redrawn in such a mariner as an example. BNs have been praised as a great tool for humans to visualize probabilistic relations, and this technique extends that tool by providing graded rather than binary independence information. Another visualization aid for BNs are CS contour maps, which indicate how much one node can effect each of the other nodes in the network, and are created by drawing iso-CS lines over the BN diagram. Chapter 5 contains a CS contour map for the example BN of Chapter 2. It also contains a contour map based on CS bounds calculated by the algorithm developed in Chapter 4. By comparing the two contour maps, the bounds may be compared with the true values. 2 Wellman90 introduces the concept of qualitative probabilistic networks (QPNs), which are networks with the same underlying topology as BNs, and whose purpose is to determine the direction of change in belief of one proposition when we learn the truth of another. We can consider connection strength as determining the maximum magnitude, and QPNs as determining the sign, of the same quantity. In fact, many of Wellman's results can be obtained from the connection strength equations, simply by modifying them to retain sign information (e.g. removing absolute value functions). This is briefly discussed in the "Further Work" section at the end of the thesis. Another possibility explored in the "Further Work" section, is the development of a continuum of algorithms for determining CS bounds, which trade off tightness of bound for speed, from very fast algorithm that provide loose bounds, to very slow algorithms that produce exact values. The two endpoint algorithms (fastest and slowest) are the algorithms used in the body of this thesis, and a method is suggested for generating the in-between algorithms. Notation is explained as it is introduced, but it is also summarized in Appendix A. Readers who are already familiar with BNs can skip the next chapter and go straight to chapter 3, using Appendix A as a guide for notation definitions they may have missed. 3 2 Bayesian Networks This chapter provides a brief introduction to probabilistic reasoning and Bayesian networks (BNs), and states a number of well-known results which are used later in this thesis. It does not contain any original results, and so it may be skipped by the knowledgeable reader. Two good introductory books for Bayesian networks and related topics are Pearl88 (a "must read" for a thorough introduction) and Neapolitan90 (easier to read and has more recent results but generally covers less material). 2.1 When do we Reason With Uncertainty? Reasoning with uncertainty is the process of combining items of uncertain knowledge to obtain uncertain conclusions. Uncertain knowledge is knowledge which one would be willing to retract, or consider less certain, upon receiving knowledge (certain or uncertain) to the contrary. There are not many things we know that we wouldn't be willing to retract given enough evidence to the contrary, so much of our reasoning can be considered reasoning with uncertainty. In this thesis, the main purpose of studying uncertain reasoning will be to produce computer-based automated systems, although some of the results may apply to other intelligent agents. In constructing an automated system, we must decide which of its information it should treat as uncertain. We may want it to treat some information as certain, even though it doesn't hold in all cases (or we don't know if it does), just to make the system simpler, or give it a more predictable behavior. On the other hand, we may want it to consider much of what it learns, based on limited or imperfect observations, as uncertain information. Generally speaking, as the sophistication of our automated systems increase, a larger percentage of their 4 knowledge should be treated as uncertain, since many of these systems will be more adaptable, will be learning more, and will be working in less well-defined domains in an autonomous manner. There are numerous particular situations where an automated system may need to reason with uncertainty. Any approximate measurement is information with uncertainty. So combining uncertain or approximate observations, such as physical measurements, which have a redundancy in the observations for the purposes of increasing the accuracy or detecting a totally erroneous observation, requires some form of reasoning with uncertainty. The reasoning may be to simply take the mean or median of the set of measurements, or it may involve a complex analysis. Any situation in which we have information coming from multiple sources, which may agree or conflict to varying degrees, requires reasoning with uncertainty. Examples are sensor fusion in a robot (i.e. combining sensory data), or merging news reports from different agencies. In some reasoning situations, much of the knowledge involved is nearly certain, and we can gain huge computational savings by treating it the same way we treat knowledge that is certain. However, when we learn something that casts doubt on some piece of it, we may have to revert the status of that piece back to "uncertain': or even to "false': and suitably modify the status of related pieces. The methods of default reasoning have traditionally been used to do this. Some "inverse" problems (such as diagnosis, machine vision, machine hearing, and other recognition problems) don't have a unique solution. Reasoning with uncertainty can help to find the most probable solutions to these problems. Problems in which an agent learns generalizations from case data are examples of reasoning with uncertainty. Presumably, the more cases the agent sees, the more certain he becomes about the generalizations. When the generalizations are applied to predict unknown values in a new case, more reasoning with uncertainty is required, both because the generalization is uncertain, and because its applicability to the new case may be uncertain. We may even use reasoning with uncertainty for problems that can be stated in purely logical (i.e. certain) terms, such as theorem proving. Often, these types of problems can be considered to involve some kind of search, and they can be solved far more quickly if we use suitable heuristics to guide the search to examine the most probable candidates first. But as the use of heuristics becomes more sophisticated, it becomes -5 difficult to combine conflicting heuristics, and so it is useful to think of the heuristics as uncertain knowledge. Then, we can use reasoning with uncertainty to direct the search of the theorem prover (or other strictly logical reasoning-with-certainty system). 2.2 Using Probability for Uncertain Reasoning There are a number of mathematical systems available for reasoning with uncertainty, and there has been considerable controversy over which is "best" These systems include subjective probability, fuzzy logic, belief functions (e.g. Dempster-Shafer), certainty factors, non-numerical probabilities, and default logic. This thesis uses a system based on subjective probability, that is often called "Bayesian probabilistic reasoning': although it involves far more than Rev. Thomas Bayes was ever concerned with. Some approximations to Bayesian reasoning will also be considered. de Finetti provides an argument in favor of subjective probabilities based on the notion of coherence (F. P. Ramsey and L. J. Savage have also done similar work). An agent is offered a number of betting options and his choices are analyzed. If he acts as though his beliefs were governed by rules other than those of probability, it is possible to arrange a series of bets with him (called a Dutch book) in which he is guaranteed to lose money regardless of how events unfold. Cox46 derives the rules of probability without any of the machinery of gambling or decisions, based only on meeting a list of desirable properties for an ideal agent to reason with uncertainty. The proof is more complex than the de Finetti proof, but seems to have broader applicability. Below is the list of desired properties from which all the axioms of probability theory can be derived. Since this thesis is built upon the theory of probability, these may be considered the assumptions about reasoning with uncertainty made by this thesis. 1. Clarity: Propositions must be defined precisely enough so that it would be theoretically possible to determine whether they are indeed true or false. 2. Scalar Continuity: A single real number is both necessary and sufficient for representing a degree of belief. 6 3. Completeness: Some degree of belief can be meaningfully assigned to any welldefined proposition. 4. Context dependency: The belief assigned to a proposition can depend on the belief in other propositions. 5. Hypothetical conditioning: There exists some function that allows the belief in a conjunction of two propositions to be calculated from the belief in the first proposition, and the belief in the second proposition given that the first is true. 6. Complementarity: The belief in the negation of a proposition is a monotonically decreasing function of the belief in the proposition itself. 7. Consistency: There will be equal belief in propositions that are logically equivalent. Cox46 originally proved that the probability axioms follow from these properties, but Tribus69 weakened the hypothetical conditioning requirement, and the solution of the "associativity equation" in Acze166 may be used to remove the differentiability assumptions required by Cox. The list 1-7 was taken (with some modifications) from HorvitzHeckermanLanglotz86, who clearly state and name the desired properties, and use the results to compare existing uncertainty systems. Some people have argued for uncertainty systems that violate one of the properties 1-7, or that have internal inconsistencies. For example, the Dempster-Shafer theory violates the combination of 2, 3, and 6. One justification for doing this is that some way of representing total ignorance about a proposition is required. Property 3 requires that we specify a belief for every proposition and property 2 requires that it be only a single number, which appears to preclude any representation of ignorance. Bayesians have pointed out that for decision theory a representation of ignorance is not required, although for learning and communication it may be. Sometimes using beliefs about real-world frequencies, instead of just beliefs for the occurrence of a single event, will handle these situations. Otherwise, true Bayesian probabilities can be augmented with a confidence measure (such as an "equivalent sample size", which is equivalent to the number of relevant cases used for a learned BN), which can be combined with the probability when necessary to provide enough information for someone with different priors to update their priors. The issue remains 7 controversial, but without doubt the Bayesian method is suitable for a very large class of uncertainty problems. Fuzzy logic appears to violate property 1, but the primary complaint of the fuzzy logic community is that probabilities can not represent all the different types of uncertainty that arise. They distinguish between vagueness (fuzziness, haziness, etc.), and ambiguity (nonspecificity, variety, etc.). Bayesians do claim to be able to handle these types of uncertainty using only probabilities (for example, see Cheeseman86). Expert systems based on simple evidential support may be misled when they use transitivity to chain rules. For example, if B provides support for C, and C provides support for D, a support based system may increase support for D upon observing B, which may be inappropriate. An example from Pear188 is that "wet grass" may provide support for "rained last night? and "sprinkler on" may provide support for "wet grass? Each of these two rules work fine by themselves but chaining them suggests that "sprinkler on" supports "wet grass" which in turn supports "rained last night? However, if we see the sprinkler on we should probably decrease our belief that it rained last night, instead of increasing it. A good example of the ad hoc nature of certainty factors, and the confusion resulting in trying to apply them, is the transcript of email exchanges of the MYCIN developers (BuchananS84, pp. 221-232). Later, Heckerman (1986) analyzed certainty factors and showed that (when the inconsistencies were removed) they were equivalent to using probabilities, but with independence assumptions being implicitly made. Since certainty factors are often inappropriate, using them may produce misleading results. By comparing the system to a probabilistic one, it became clearer where the deficiencies were, and how serious they were. Proponents of default reasoning are often concerned with the numbers normally used in probabilistic reasoning, and ask "Where do these numbers come from?" The numbers represent subjective beliefs, not exact real-world frequencies, so the reasoning agent is simply summarizing his beliefs with the numbers. In fact, given a series of observations from nature, it seems easier to find probabilities (either as frequencies or doing Bayesian reasoning over priors on frequencies) than to decide on a set of default rules. Nevertheless, default reasoning can be a very valuable way of reasoning with uncertainty, since it has significant computational advantages, and can reduce the information rate considerably during communication. It is dangerous to assume that a proof like the one by Cox completely defines the "best" system to use. It is always difficult to foresee what types of systems will be successful in the future, partly because the nature of the problems being solved keeps changing. However, the proof is useful for understanding the fundamental differences between uncertainty systems, and if one is willing to accept the properties 1-7, it indicates that probability is the system to use. Doing complete probabilistic reasoning is generally very computationally expensive (see section 2.6), so it is reasonable to look for alternatives. Sometimes the most appropriate algorithm is as simple as taking the average of a few values, or using a majority vote scheme. The approach advocated in this thesis is to start with a normative theory for combining uncertain information, and then use a simplified scheme, when appropriate, to improve the computational speed or overall complexity of the reasoning. However, it should be considered an approximation to using probabilities, and should be judged based on how similar its results are to those of full scale Bayesian reasoning. One of the applications of "link strength" is to indicate when it is appropriate to take a certain type of computational shortcut in probabilistic reasoning, and provide a bound on the resulting inaccuracy. 2.3 Axioms of Probability Theory Here is a set of axioms for probability theory equivalent to those derived by Cox (and also compatible with other axiomatizations of probability, such as the Kolmogorov axioms): 1. P(ala) = 1 2.3:1 2. P(—alb) = 1 — P(alb) 3. P(a, blc) = P(alb, c) P(blc) where P(x,ylz) means the probability that propositions x and y are both true, given that proposition z is true. For notational convenience we use P(x) to mean P(xIT) where T is a tautology that is always true. Upper case letters refer to propositional variables, and lower case to their value. +b stands for b=TRUE, —it) stands for b=FALSE, and sometimes +b is written simply as b if that does not result in confusion. For more notational details see Appendix A. From these few axioms, together with propositional logic and arithmetic, we can generate the entire mathematical structure of probability theory. Below is a list of a few theorems that I will use later. For -9 their proofs, or alternate (but equivalent) axiomatizations of probability theory, see an elementary text such as Lindley65. This is Bayes theorem: P (alc) P (alb, c) = P (bla, c) P (blc)^ 2.3:2 This is the reasoning by cases theorem: P(alc) = P(alb, c) P(blc) + P(al--lb, c) P( Iblc) ^ — 2.3:3 This is the independence theorem: P (a, blc) = P (alc) P (blc) iff A is independent of C given B ^ 2.3:4 where independence is defined as: P (alb, c) = P (alc) iff A is independent of B given C, providing ^ b and c are consistent 2.3:5 Theorem 2.3:6: In each of the above theorems, all the probabilities are conditioned on the proposition c. Since it could be equivalent to any logical formula, they all hold with c replaced by any vector of truth values, c. If we are willing to accept using the 3 simple axioms 1 - 3 in 2.3:1 for a system to reason with uncertainty, then we are accepting using probability theory. The full joint probability distribution (FJD) specifies a probability for every possible conjunction involving every proposition of interest (or its negation). For example, if an agent knew only about the propositions A, B, and C, his FJD would consist of the probabilities P(abc), P(ab-ic), P(a—ibc), P(a—lb—ic), P(—,abc), P(--iab-,c), P(--,a—bc), and P(—a—ib--,c). Obviously, if the FJD were stored in a table, the size of the table would be exponential in the number of propositions. We must take advantage of independencies between propositions to more efficiently represent an FJD. Supplying an FJD for a problem completely specifies the problem probabilistically. Sometimes I will refer to the FJD of a problem or an agent, when I want to indicate the complete probabilistic specification, although it may not be represented in any table. 2.4 Representing Independencies Using just the axioms and theorems of the previous section we can do probabilistic reasoning. That is, given beliefs for a set of propositions and their conjunctions (or relations), we can combine them with new items of certain or uncertain knowledge about the propositions or their relations, to obtain new beliefs for any of the propositions, or any logical formula of the propositions. But unless we exploit the fact that under some conditions our belief in some of the propositions will be independent of others, then for even a moderately sized problem, the computational cost will be astronomical. We could represent all the independencies as a list of triples, each one of the form I (X, Y Iz), where X , Y , and Z are sets of propositions, and I (X, Ylz) means that if Z is precisely the set of propositions we know the truth value of, then X is independent of Y (i.e. further knowledge of whether the propositions in X are true won't change our beliefs in any of the propositions in Y). However, this list would be impossibly large and awkward to deal with since the number of possible sets for each of X, Y , and Z is exponential (the power sets) in the number of base propositions. Once some independencies are known, generally others must follow to be consistent with the axioms of probability. So we could keep a partial list of independencies and generate the others as needed. We can generate them using the graphoid axioms (Pear188): Symmetry I (X, YIz) I (Y, Xlz) Decomposition I (X, Y U WIz) I (X, Ylz) Weak Union I (X, Y u WIz) I (X, YIz u w) Contraction I (X, YIz) & I (X, Wlz U y) I (X, Y U WIz) However using these axioms to generate required independence information would generally turn out to be a major computational task in itself. Instead we use a dag (directed acyclic graph) to represent the independencies, and by using a simple and very fast algorithm called d—separation, we can use the dag to quickly fmd independence information. We require that the dag not represent any independency that isn't in the original FJD. Unfortunately, due to the limited representational power of dags, the result is that sometimes not all the independencies of the FJD can be represented by a dag. Since we are using the independence information to speed computation, the fact that it sometimes misses an independency means that sometimes we will do a little more computation than necessary, but since it never reports an independency when there isn't one, the computation will never be in error. In the dag representation, each proposition of interest is represented by a node in a graph. Directed arcs (called links in this thesis) connect the nodes. The graph may be constructed by choosing any total ordering for the nodes, starting with no links, and then stepping through the ordering and adding links to each node N as it is encountered. The links to add to node N are determined by examining each node preceding node N in the ordering, and adding a link from it to N if N is not independent of that node given all the other nodes preceding N. This process will be examined again in section 3.8. The dag that results can be used to regenerate the independence information via the d-separation algorithm. In general, its structure will depend on the total ordering of the nodes that we started with. In order to create the simplest model, and to represent as many independencies as possible, it is desirable to chose an initial total ordering that will minimize the number of links in the final dag. In models of physical causation, generally placing propositions about causes before propositions about their effects will result in a smaller dag. An example of a dag to represent independence appears in figure 2.5:1. The graph created by the above algorithm never has any cycles (i.e. paths that return to their starting point following the direction of the links), but it may have loops (paths that return to their starting point ignoring the direction of the links). A graph without loops is called singly-connected, and one with loops is called multiply-connected. The nodes with links going to a node N are called the parents of N, and the nodes at the end of links leaving N are called the children of N. The definitions of ancestors, descendants, siblings, etc. follow in the natural way. The recent heightened interest in using normative probabilistic systems to reason with uncertainty, and the construction of a number of practical systems, is due in a large part to exploiting the independencies represented by a dag description of the FJD. One of the contributions of this thesis is to provide a criterion for when it is appropriate to exploit "near-independencies" as well, and to generalize the d-separation algorithm to discover them. 2.4.1 d-Separation Algorithm The d-separation algorithm is used to determine the independencies represented by a dag. It is an algorithm which allows dags to represent independencies in a manner consistent with the graphoid axioms (and therefore the axioms of probability). Say X, Y, and Z are disjoint subsets of nodes in the dag. We will use the d-separation algorithm to determine if X and Y are independent given Z ("given Z" means the same thing as "having evidence for Z',' which means the truth values of all the propositions in Z are known by the reasoning agent). A node is termed converging with respect to a path, iff it has a node along the path in which both the links to enter and to leave the node are directed toward the node (e.g. the third node B from the left in figure 2.4). A path is blocked by Z iff: 1. It has a non converging node in Z, or 2. It has a converging node N that is not in Z, and N has no descendants in Z. A path that is not blocked is called active. Figure 2.4 illustrates some blocked and active paths. In each case A is a node from X, C is a node from Y, and B is in Z iff it is shaded. Z d-separates X from Y iff all paths from X to Y are blocked by Z. If there is any active path from X to Y, then X and Y are not d-separated by Z. If Z d-separates X from Y, then X and Y are independent given Z. That is, if you know Z, further knowledge of X will not shed any light on Y, and vice versa. It is important that Z include all the nodes for which you have knowledge, since some of the nodes in Z may block all the paths from a node in X to a node in Y, but others may form new paths. You can check your ability to apply the d-separation algorithm by trying to find all the active paths from node V to node Q in the BN of figure 2.6:2, where the shaded nodes are the ones for which you have knowledge. The only two active paths are the marked ones. Blocked Paths 0 ^ Active Paths = Node without evidence 0 = Node with evidence Figure 2.4 - Each case indicates whether the displayed path between A and C is active or blocked. A node with "evidence" is one whose truth value is known. In each case there may be other links connected to A, B, C, or D, so there may be other paths between A and C which may be active or blocked. 2.5 B ayesian Networks We can use the dag representation of independence introduced in the previous section, together with a set of conditional probabilities at each node, to provide a complete representation of any particular FJD. We call the resulting structure a Bayesian network (BN). In the common definition of a BN, nodes represent any kind of variable of interest, but since this thesis is restricted to the study of binary nodes, our nodes represent propositions (or variables that can take on one of two states). To construct a BN, a dag may be determined as described in the previous section. Then, a number of conditional probabilities called the node conditional probabilities (NCPs), are attached to each node. These are the probabilities that the proposition of the node is TRUE given each of the different TRUE/FALSE combinations of its parents. For example, if the node B had the single parent A, its NCP would be {P(bla), P(bka)), if it had parents A and W its NCP would be {P(blaw), P(bla-m), P(bkaw), P(bl-a-lw)}, and if it had no parents its NCP would be {P(b)}. These probabilities are subjective probabilities, which are unique - 14 - for the reasoning agent constructing and using the BN. They are measures of how strongly the agent would believe the child proposition if he knew the truth about the parent propositions. When appropriate, they correspond somewhat to frequencies in the real world, but that is not required. The purpose of BN inference is simply to tell the agent how to change beliefs when he observes certain evidence. The structure of the links is often called the topology of the BN to distinguish it from the conditional probability information. The FJD is easily reconstructed from the BN representation via the equation: fl P(xln(X)) P(v) = XE V 2.5 where V is the set of nodes in the BN, P(v) is a probability from the FJD for the vector of truth values v (one value for each node in V), X is a node, n(x) is a vector of values (consistent with v) for the parents of X, and P(xlir(x)) is an NCP for X, where both x and it (X) are consistent with v. In other words, the joint probability of a setting of TRUE/FALSE values for all the nodes, is simply the product over the nodes of the NCP from each node which is consistent with the setting. A single FJD can generally be represented by several different BNs, but if the BN must satisfy a given total ordering for the nodes, then it will be unique. Any propositional FJD can be represented by some BN having one node for each propositional variable of the FJD. A BN uniquely (and therefore unambiguously) determines an FJD. Every possible BN determines some FJD, so it is impossible to construct an inconsistent BN no matter what its topology (providing its acyclic) or its NCPs. The direction of each link in a BN is significant. A link from node A to node B may always be reversed in direction, but if it was already in its optimal orientation, then the reversal usually requires the addition of extra links from the parents of A going to B, and from the parents of B going to A, to avoid indicating independencies that don't exist. Once the extra links have been added, the BN doesn't represent all the independencies it did before the link reversal. Also, with the links in a non-optimal direction our knowledge is less modular, in that adding new variables (nodes) will generally require adding a great many new links. In those problems where the variables are causally related, the optimal direction for links is generally the direction of causality (e.g., the direction of time). An example BN is shown in figure 2.5:1, and its NCPs are shown in figure 2.5:2. This BN represents the relationships between the beliefs of Jim, who is a fictitious character that lives in a small community which also contains Hank, Tom, Molly, and Gale. The NCPs are the subjective probabilities of Jim only. For example, the NCPs of node MT represent Jim's belief of what Molly thinks (notice that in this example it mirrors what Jim himself believes, since the NCPs of node MT are the same as the NCPs of node TD). Each node should specify a proposition precisely to the user of the BN, in order to satisfy the clarity condition (see section 2.2) of probabilistic reasoning. For example, "Tom" must refer to a particular person, so including a last name may help, "a big donation" must refer to a particular range of donation sizes, so including dollar amounts may help, and "lots of cars" must refer to a particular range in number of cars, so providing numbers may help. Perfectly describing each node to someone completely unfamiliar with the situation may require an endlessly long description, but any description is adequate as long as it produces in the mind of the BN builder and the BN users a proposition of adequate preciseness for the task at hand. It may appear that most of the important causes for many of the nodes have not been included. For example "Park is approved", has only "Molly elected mayor" as a parent. Surely there are many more important factors, such as the need for a park, the existence of necessary funds, the availability of land, etc. However, the purpose of using probabilities and reasoning with uncertainty is to be able to reason without explicitly accounting for all these factors. The probabilities themselves summarize the missing factors (if all factors were accounted for, perhaps the NCPs would consist of only Os and 1s). In building a BN we need include only those factors we know of and suspect are relevant, and the uncertainty in the inference results will reflect our lack of knowledge. If Jim were an actual person he would probably know of many more factors that were relevant to his real-life questions involving these nodes, so he could add nodes and links for them to expand our example BN, and thereby generally increase the expected certainty of his conclusions. This BN can be used to find Jim's initial beliefs in each proposition, and what those beliefs become if he finds out the truth value of one or more of the nodes. For example, if Jim found out that Tom's hardware store did well last year, we can use it to find his new beliefs in each proposition (for example, whether Hank is going to move away in the next 5 years). Then, if he later read in the newspaper that Molly was elected mayor, we could obtain a new set of beliefs for each node (for example, whether Tom told Molly he will donate campaign funds). The actual beliefs calculated for this example are given at the end of the next section. "V: Tom's HT: Lots of traffic on 5th street last year TR: Tom told CP: Lots of cars ar e usually parked in front of Tom's store , hardware store did well last year TW: Tom is currently fairly wealthy cousin is siting himj SW: Tom's ( 7T- C: Tom just - Molly he is rich bought a new BMW car /^ .NN TF: Tom told Molly he will donate campaign funds _) CD: A new BMW' is parked in Tom's driveway ^} CGT: Gale told Molly that lots of cars are usually parked in nt of Tom's store/ wo MR: Molly thinks Tom is rich TA: Tom can afford to move to an expensive neighborhood next year TD: Tom makes big donation to Molly's campaign (MT: Molly thinks Tor n\ is going to make big campaign donation elected mayor TM: Tom moves next door to Hank next year 2 (M MM: Molly gets ^ ` . Neighborhood ark is approved D: Molly decides to run for mayor FW: 5th street is widened next year (F;C: Neighborhood park is constructed 1ext year . V (HM: Hank moves Hank's property value goes up more than 20% in two year away sometime in ,the next 5 years Figure 2.5 - Example BN. T: Traffic more than doubles on Hank's street next year ) P(owl+ht)^=^0.7 P(owl-ht)^=^0.6 P(cpl+sw,+ht)^= P(cpl+sw,-ht)^= P(cpl-sw,+ht)^= P(cpl-sw,-ht)^= P(fw1+ht,+mm,+pc) P(fwl+ht,+mm,-pc) P(fwl+ht,-mm,+pc) P(fw1+ht,-mm,-pc) P(fwl-ht,+mm,+pc) P(fwl-ht,+mm,-pc) P(fwl-ht,-mm,+pc) P(fwl-ht,-mm,-pc) P(tcl+tw)^=^0.3 P(tcl-tw)^=^0.1 P(ht)^=^0.7 P(cv)^=^0.01 0.8 0.7 0.7 0.2 P(cd1+cv,+tc)^= P(cd1+cv,-tc)^= P(cd1-cv,+tc)^= P(cd1-cv,-tc)^= P(tal+tw,+tc,+td)^= P(ta1+tw,+tc,-td)^= P(tal+tw,-tc,+td)^= P(tal+tw,-tc,-td)^= P(tal-tw,+tc,+td)^= P(tal-tw,+tc,-td)^= P(tal-tw,-tc,+td)^= P(tal-tw,-tc,-td)^= P(gtl+cp)^=^0.1 P(gtl-cp)^=^0.002 P(twl+sw)^=^0.7 P(tw1-sw)^=^0.3 P(trl+tw)^=^0.1 P(trl-tw)^=^0.05 P(tfl+tr,+tw)^=^0.60 P(tfl+tr,-tw)^=^0.15 P(tfl-tr,+tw)^=^0.20 P(tfl-tr,-tw)^=^0.05 P(tmI+ta)^=^0.3 P(tml-ta)^=^0.05 P(mdl+mt)^=^0.7 P(mdl-mt)^=^0.5 P(mrl+tr,+gt)^=^0.71 P(mrl+tr,-gt)^=^0.70 P(mrl-tr,+gt)^=^0.31 P(mrl-tr,-gt)^=^0.30 P(mm1+md,+td)^=^0.5 P(mm1+md,-td)^= 0.3 P(mml-md,+td)^=^le-7 P(mml-md,-td)^=^le-7 P(mtl+tf,+mr)^=^0.8 P(mtl+tf,-mr)^=^0.5 P(mtl-tf,+mr)^=^0.1 P(mti-tf,-mr)^=^0.02 P(pal+mm)^=^0.7 P(pal-mm)^0.4 P(pcI+pa)^=^0.9 P(pcl - pa)^=^le-5 P(td1+tf,+tw)^=^0.8 P(td1+tf,-tw)^=^0.5 P(td1-tf,+tw)^=^0.1 P(tdI-tf,-tw)^=^0.02 Figure 2.5:2 figure 2.5:1. 0.95 0.90 0.90 0.05 - 0.800 0.802 0.810 0.812 0.300 0.302 0.310 0.312 = = = = = = = = 0.52 0.50 0.42 0.40 0.15 0.15 0.12 0.10 P(dtl+fw)^=^0.8 P(dtl-fw)^=^0.2 P(vul+pc,+dt)^=^0.80 P(vul+pc,-dt)^=^0.82 P(vul-pc,+dt)^=^0.50 P(vul-pc,-dt)^=^0.51 P(hml+pc,+vu,+dt,+tm) P(hmI+pc,+vu,+dt,-tm) P(hml+pc,+vu,-dt,+tm) P(hml+pc,+vu,-dt,-tm) P(hml+pc,-vu,+dt,+tm) P(hml+pc,-vu,+dt,-tm) P(hm1+pc,-vu,-dt,+tm) P(hm1+pc,-vu,-dt,-tm) P(hml-pc,+vu,+dt,+tm) P(hml-pc,+vu,+dt,-tm) P(hml-pc,+vu,-dt,+tm) P(hml-pc,+vu,-dt,-tm) P(hml-pc,-vu,+dt,+tm) P(hm1-pc,-vu,+dt,-tm) P(hml-pc,-vu,-dt,+tm) P(hml-pc,-vu,-dt,-tm) = = = = = = = = = = = = = = = = 0.1: 0.1: 0.11 0.1: 0.1: 0.0! 0.11 0.3: 0.3: 0.3( 0.3: 0 .3: 0.3: 0.2! 0.3( Node conditional probabilities (NCPs) for the example BN in 2.6 Bayesian Network Inference The most studied BN inference problem is: Given a BN and evidence for some of its nodes, what are the posterior probabilities of its other nodes? Evidence is some ideal observation, or the receiving of some certain information, on the truth of one or more node propositions (uncertain evidence is dealt with in section 2.7). With respect to a particular BN, evidence items may be inconsistent with each other. For example, if B is a child of A with P(bla) = 1, and we obtain evidence a= TRUE and b=FALSE, we say the evidence is inconsistent. This indicates that the evidence is not possible given the BN model (which probably indicates a fault with the model, not the evidence). Inconsistent evidence is not allowed in BN inference, and throughout this thesis, whether it is explicitly stated or not, we assume evidence is consistent. Occasionally in this thesis I will refer to BN inference in a dynamic sense, which implies a situation where a stream of evidence items constantly arrive to a BN, and we update the beliefs of each node as they arrive. So we may speak of the belief at a node as rising and falling through time. A primary feature of this situation is that evidence always accumulates, no item is ever retracted. As the evidence monotonically increases, some quantities of interest will monotonically increase or decrease, while others will vary up and down. Once we have received evidence there are two different systems for taking it into account in a BN. We can do evidence propagation to modify the BN to one specific for that evidence state, by modifying the network topology and NCPs (often extensively), so that the nodes for which we have evidence no longer appear in the network, and the new network represents the original BN conditioned on the evidence. Or we can leave the BN structure unchanged, but mark the nodes for which we have evidence with the state of the evidence (these become known as evidence nodes and are usually drawn shaded on a BN diagram), and then use algorithms designed to handle the evidence "in place" to find posterior probabilities. These two systems of dealing with evidence are entirely equivalent in semantics; the only differences are ones of representation and computation. The way evidence propagation to create a new BN is normally done is through link reversals and node absorption (Shachter86 and Shachter88). First, all links pointing to evidence nodes are reversed, one by one. Each reversal may result in new links pointing to an evidence node, since during a reversal the nodes at each end of the link gain all the parents the other has. Eventually though, each evidence node is guaranteed to have links only leaving it (links between evidence nodes may simply be deleted), and then that evidence node is absorbed. That is, it is removed from the network and all the NCPs of its child nodes are collapsed by one dimension to the value of the evidence. Later a process called probability propagation may be used on this new BN to calculate the posterior probabilities. There are a number of methods for computing the new beliefs after receiving evidence which do not create a new BN. Some of them use a compiled secondary structure for efficiency, and some of them operate on - 20 - just the original BN. Some of them produce approximate results, and some exact. Pearl developed a fast algorithm for BN updating when the network is singly-connected called belief propagation (see Pear188), which can find posteriors in 0(N) time, where N is the number of nodes in the network, and as a parallel algorithm with a processor for each node, in 0(d) time, in which d is the diameter of the network. Multiply-connected networks pose a much greater computational problem. The reasoning by assumptions algorithm finds a set of nodes (called cut nodes), such that if they were instantiated with evidence some active paths would become blocked and the network would become singly connected. Then it solves the singly-connected problem multiple times, once for each possible instantiation of the cut nodes. The final beliefs are found as a weighted sum of the beliefs in each of these subcalculations, with the weighting for each instantiation being the probability that the cut nodes would be instantiated that way (using equation 2.3:3). Since all combinations of evidence at the cut nodes must be considered, the algorithm is exponential in the number of cut nodes. Cooper has shown general BN inference to be NP-hard by reducing the 3SAT problem to a BN inference problem (Cooper90). This is a worst-case result, but generally even the average case requires exponential time to find exact results. So for large BN applications, some sort of approximation algorithm is often necessary for BN inference. Although the BN inference algorithms described above will find the posterior probabilities for any node, given a set of evidence at any other nodes, it is useful to dissect BN inference into predictive reasoning, diagnostic reasoning, and intercausal reasoning. Predictive reasoning finds the belief at a node which is a descendent of a node with evidence, and diagnostic reasoning finds the belief at an ancestor of an evidence node. Intercausal reasoning propagates the effect of evidence between common parents of a node with evidence. Figure 2.6:1 illustrates the three types. Predictive Diagnostic > > Intercausal Figure 2.6:1 - The three types of reasoning in BN inference. The shaded nodes are nodes with evidence. In each case we wish to find the change in belief at node Q due to evidence at node E. Any of these three types of reasoning may be combined for more complex inference. Although BN inference is not normally subdivided along the paths from an evidence node to a query node (i.e. a node whose updated belief we wish to find), it is sometimes useful to do so. Suermondt92 does this to generate explanations of BN inference, and I will do it to study BN sensitivity. Figure 2.6:2 shows the paths of reasoning from node V to query node Q, given that some previous evidence has arrived to the network at the shaded nodes. Diagnostic I ntercausal Predictive Figure 2.6:2 - Combination of different types of reasoning. There are only two active paths from V to Q: the path V,A,B,C,Q and the path V,D,F,Q. The first path contains diagnostic, intercausal, and predictive reasoning, while the second contains only predictive and intercausal reasoning. The shaded nodes are nodes with evidence, V is a varying node (or node with new evidence), and Q is a query node. 2.6.1 Virtual Evidence In cases where we obtain evidence for a node, but the evidence is not certain, we can make use of the concept of virtual evidence to handle it using the regular machinery of BN updating (Pearl88). For instance, suppose we wanted to process uncertain information that node A is true. We make a new node that is a child of node A and call it A v . To add this node we must supply two new probabilities: the probability that A v is TRUE given that A is TRUE, and the probability that A v is TRUE given that A is FALSE. These two probabilities measure the degree of uncertainty of the evidence. If one of them is 0, then we have the limiting case of the evidence being certain; We don't allow them both to be 0, which would be inconsistent. - 23 - To do updating with the virtual evidence A v is marked as a node with certain evidence of TRUE and regular BN updating is used to adjust the beliefs of the rest of the nodes in the network. We may have several independent pieces of uncertain evidence for the node A, which may conflict or agree with each other. In this case we simply add a new child node, and its two probabilities, for each piece of evidence (see figure 2.7). Then regular BN updating will handle finding the belief for node A and the rest of the nodes in the network whether the evidence items conflict with each other or support each other. P(aviIa)^P(av2)a)^P(av3la) P(avil-a) P(av2I—ia) P(av3I—a) Figure 2.7 Three items of virtual evidence for node A are represented as three child nodes. The two parameters for each child specify the certainty of the evidence. The subnetworks C and D simply illustrate that A is part of a larger BN. - 2.6.2 Inference on BN Example We can use any one of the algorithms described earlier to find beliefs for the example BN given in the last section. Initially Jim does not know the truth value of any of the nodes in the example. By doing probability propagation we can find his belief in any node: P(sw) = 0.670 ^ P(mm) = 0.179 ^ P(tf) = 0.160^ P(gt) = 0.0695 P(pc) = 0.40821^P(hm) = 0.2304 If he fmds out that Tom's hardware store did well last year, then we can find his new beliefs for each node by doing evidence updating: P(swlsw) = 1 ^ P(mmlsw) = 0.183 P(tflsw) = 0.184^P(gtlsw) = 0.0777 ^ P(pclsw) = 0.409^P(hmlsw) = 0.2302 If he reads in the newspaper that Molly was elected mayor (and believes it with certainty), then once again we can find his new beliefs for each node by doing evidence updating: P(swlsw,mm) = 1 ^ P(mmlsw,mm) = 1 ^ P(tflsw,mm) = 0.286^P(gtlsw,mm) = 0.0778 P(pclsw,mm) = 0.630^P(hmlsw,mm) = 0.188 It may seem strange that finding out Tom's hardware store did well would change Jim's belief in something so distantly related as whether Hank was going to move or not (even though it only changed from 0.2304 to 0.2302). Whenever there is an active path from one node of a BN to another, evidence at one of the nodes can change the belief at the other node. In a very large BN, it is quite reasonable for every node to be connected to every other node, with many of the connections being active paths. Knowledge of different subject areas may be connected along their boundaries. In the example BN, the links between the nodes were quite natural, and it would be natural for there to be a link from "Tom's hardware store did well" to a node called "Tom opens grocery store" to "Tom eats more lettuce" to, etc., providing nodes ever more weakly connected to "Hank moves", yet if Jim received evidence for any one of these nodes, his belief in all of them will change somewhat. The point is that the beliefs at very distant nodes will change almost imperceptibly, and if we have a way of measuring what that change will be, or of guaranteeing that it is less than some bound, we may want to ignore finding new beliefs for those distant nodes, thereby greatly simplifying the computational burden. 3 Connection and Link Strengths 3.1 Connection Strength Definition Given two nodes, A and B, in a Bayesian network, the connection strength (CS) from node A to node B is defined as the difference in the resulting belief at node B, between the case where A receives evidence TRUE, and the case where A receives evidence FALSE. Formally . CS (A,B) = d (P(bl+a), P(bl d))^ — 3.1:1 where d is a distance measure to determine the amount of change or degree of difference between two probabilities, and in later proofs we assume it meets the following criteria: 1. Zero: d (x, x) = 0^ 3.1:2 2. Symmetry: d (x, y) = d (y, x) 3. Triangle Inequality: d (x, z) ^d (x, y) + d (y, z) 4. Continuity: d (x, y) is continuous with respect to x and y. 5. Monotonicity: For any given x, d(x, y) increases monotonically as y moves away from x. Formally: xy z^d (x, y)^d (x, z) and d (y, z)^d (x, z) 6. Inevitability: d (x, y) = d (1 — x, 1— y) 7. Convexity: d(ax + b(1-x), cx + d(1-x)) ^max (d(a,c), d(b,d)) for all a,b,c,d,x E [0,1]. The first three conditions are the well known Hausdorff postulates of many distance semi-metrics, and are the requirements of a probability metric as defined in Zolotarev83. There has been some research on probability metrics (see Zolotarev83) but it appears to be mostly concerned with distributions defined over continuous variables (and much of it hasn't been translated from Russian). Specific distance measures will be discussed later. CS(A,B) was defined as the change in belief at B as the belief in A switched from true to false, due to evidence at A. But what if the belief at A changed from somewhere between certainly true and certainly false to somewhere else between true and false. How much would the belief at B change? If the change in belief at A was due to evidence at nodes that aren't independent of B given A, then we can't say how much the belief at B would change. But if it occurs at nodes that are independent of B given A, we can guarantee that the change in belief at B will be less than (or equal) to CS(A,B) as defined by equation 3.1:1. An example of evidence that is independent of B given A is virtual evidence for the node A. So as various items of virtual evidence arrive for node A, the belief in A will vary between true and false, and this will cause the belief in B to vary between true and false, but the maximum variation at B will always be less than (or equal) CS(A,B). Formalizing this results in two equations equivalent to definition 3.1:1: Theorem 3.1: Equation 3.1:1 defining connection strength is equivalent to: CS (A,B) = max d (P(bla v i), Nblav2))^3.1:3 avi, av2 and: CS (A,B) = sup d (P(bla v i), Rbtavl, av2))^ avi, av2 3.1:4 where a v i is some virtual (or nonexistent) evidence for A, a v 2 is other consistent evidence (possibly virtual) for A, and "sup" means the least upper bound. The equivalence of 3.1:1, 3.1:3, and 3.1:4 is proved in Appendix C. These equations allow a slightly different defining statement for connection strength. CS(A,B) is a measure of the most the belief in B could be changed by receiving new (possibly virtual) evidence at A, whatever that evidence is, and whatever other (possibly virtual) evidence has already been received at A. 3.2 AP Connection Strength To completely define connection strength we must define a probability distance measure d. One possibility which satisfies the distance measure requirements (3.1:2) is simply the difference function: dp (Pi, P2) = 1 P1 — 3.2:1 P21^ This definition is a degenerate case (applied to a binary variable, rather than a mulitstate or continuous variable) of what is called the Kolmogorov metric, also known as the uniform metric or variation norm (see Zolotarev 8 3). Connection strength defined using this distance measure for probabilities will be called AP connection strength, and denoted CS p. 3.2.1 Single Link Example First we consider a very simple BN consisting of only two binary nodes and a single link between them: P (a) ^ P(131+a) P(bI-1a) Figure 3.2 - Simple BN example. Three numerical parameters are needed to define this BN: P(a), P(bl+a) and P(bl—ia). As an example we take P(a) = 0.3, P(bl+a) = 0.5 and P(bl—a) = 0.75. If evidence TRUE is observed at A, then the belief at B (by this we mean the probability that B is TRUE given the evidence) will be P(bl+a), and if evidence FALSE is observed then it will be P(bI--a). Using equation 3.1.4 for connection strength, and the distance measure of equation 3.2:1, for this simple BN we obtain: CSp (A,B) = I P(bl+a) — P(b1-0) I^ 3.2:2 This is a measure of "the most node A can affect the belief at node B': As a visualization aid one can imagine a situation in which a steady stream of virtual evidence is arriving for A, but no evidence arrives for B. As each piece of evidence arrives, our belief in A will change, and as a consequence our belief in B will also change. In general, the belief at B will rise and fall, but will be restricted within a certain envelope. CSp measures the maximum width of this envelope for any series of virtual evidence items at A. For the example parameters, CSp (A, B) = 1 0.75 - 0.5 I = 0.25, so we would say the connection strength from A to B was 0.25. 3.2.2 Range of AP Connection Strengths Consider a BN in which P(bl+a) = P(b1—,a). In that case, whether evidence of TRUE, evidence of FALSE, or no evidence, is observed at A, the belief at B remains constant at P(bl+a) = P(b1—,a) = P(b), providing no evidence is observed for B. Also, whatever evidence is observed at B, providing there is no evidence observed for A, the belief at A will remain constant at P(a). So B is actually independent of A, and normally the BN would be drawn without a link connecting the two nodes. The connection strength in this case is CSp(A,B) = 0. AP connection strength between two nodes is zero if and only if the two nodes are independent of each other. Consider a BN in which P(bl+a) = 1, and P(bl—a) = 0. Then, observing evidence TRUE at A results in a belief of 1 at B (i.e., the knowledge that B is TRUE). Observing evidence FALSE at A results in the knowledge that B is FALSE. In this case the connection strength is 1. AP connection strength is 1 if and only if direct evidence for the first node deterministicly defines the value of the second. AP connection strength varies from 0 for the weakest "links" (actually independence) to 1 for the strongest links (deterministic dependence): 0 CSp (X, Y) 1 3.2:3 3.3 AO Connection Strength We have defined a connection strength based on the absolute difference of probabilities, d (P1, P2) = 1P 1 — P21. But sometimes an absolute difference of probabilities does not capture what we want as a measure of the "distance" between two probabilities. For example, suppose we are calculating a probability - 29 - approximately, and we want to measure how close our estimation, P*, is to the value calculated exactly, P. We will call the distance between these two probabilities "the error" of the estimation, e = d (P*, P). We may want to specify an upper bound e m on this error, and look for an algorithm which is guaranteed to calculate the estimate with an error less than this bound. This is an application of connection strength that we will be considering. We can use the absolute difference function as a distance measure for calculating an error in probabilities, but sometimes this will lead to unsatisfactory results. For example, suppose an approximate algorithm estimated the probability for rain tomorrow as 0.61, when an exact algorithm with the same information would have calculated the probability to be 0.60. We would likely consider the approximate algorithm to have performed well, and the action that we take based on its estimate will not be misguided. However, if the approximate algorithm estimated the probability for severe flood each month in some area as 0.0001, when the exact algorithm yielded 0.01, we would say the approximate algorithm failed, and the action we took based on its result — building a house with an expected lifetime of less than 10 years — was misguided. In both cases the absolute difference between the estimate and the exact value is about 0.01, but that difference is more significant in the case of the smaller probabilities. It seems that what we need for this problem is a relative measure of error. One might imagine using a "percent difference" distance measure, such as d (P j, P2) =1P1 — P21 / P 1. Then the error of the approximate algorithm estimating the probability for rain would be only 1.7%, whereas that of flood probability would be 99%, which nicely distinguishes between the two cases. However it suffers from a couple of problems. The proposition whose probability we are estimating may just as well have been defined in the logical inverse, Q' = so that the probability associated with it becomes one minus what it would otherwise be, P(Q') = 1— P(Q). Therefore, the distance measure should treat probabilities which are close to 1 in the same way that it treats probabilities close to 0, i.e. d (P1, P2) = d (1 — P1, 1 — P2). For example, the approximate algorithm estimated the probability for no flood at 0.9999, while the exact algorithm yielded 0.9900, which is a percent difference of only 1%. One way to deal with this would be to use the percent difference in P for probabilities less than 0.5, and use the percent difference in 1 — P for probabilities greater than 0.5. However, this is somewhat messy, especially when measuring differences between probabilities which straddle 0.5. A more elegant solution is to use the percent difference of odds ratios (there are also other important reasons to use odds ratio which - 30 - ^ will be discussed later). The odds ratio of proposition A is defined as the probability that A is TRUE, divided by the probability A is FALSE: P (a)^P (a) 0 (a) = P H) a. — 1 — P (a) 3.3:1 P (bla)^P (bla) 0 (bla) — P^— 1— p (bla) Odds ratios apply only to propositions, but occasionally we will loosely refer to the odds ratio of a probability P, by which we mean the quantity P / (1 — P). As two probabilities, P1 and P2, approach 0, the percent difference in the odds ratios of P 1 and P2 approaches the percent difference of Pi and P2, and as P1 and P2 approach 1, the percent difference in the odds ratios of P 1 and P2 approaches the percent difference of 1 — P 1 and 1 — P2. Percent difference of odds ratios satisfies d (P1, P2) = d (1 — Pi, 1 — P2), and is numerically close to percent difference of probabilities for small probabilities. We need to make another refinement to our relative distance measure. Instead of a percent difference we use a factor difference, which is the ratio of the two quantities. For example, a 5% difference corresponds to a factor difference of 1.05. The advantage of this is symmetry. If quantity X1 changes by 5% in going to X2, it doesn't change by —5% going from X2 to X1 (it changes by 1/1.05 — 1 = —4.76 %), while if X1 changes by a factor of 1.05 in going to X2, then X2 changes by a factor of 1/1.05 going to X i. To define a distance measure we are interested only in the magnitude of the change in going from one quantity to the other, not its direction, so if a factor difference is less than 1, we use its inverse instead. We denote the factor difference between Pi and P2 as OP 1/P21 , where the double vertical bars are similar to the vertical bars of absolute value (since ix' = max (x, 1/x) is to multiplication what lx1= max (x, —x) is to addition). This gives us a proposed relative distance measure for two probabilities Pi and P2 as: ^P1^P2 d e (P1, P2) ^= 1 —^/ 1 — P2 P1 ( 1 — P2) P2 (1 — P1) 3.3:2 Finally, we take the logarithm of d c to provide a true distance measure, which we will denote as d o . d o and d c values can always be recovered from each other, and d 0 satisfies 3.1:2, so it is a suitable distance measure (d o does not satisfy the zero condition or the triangle inequality condition). d 0 turns out to be the absolute difference of log odds ratios: 1^P i , ^2 P I d o (P1, P2) = I log d o (P1, P2) I = !log 1 p 1 — lo g 1 P2 3.3:3 Absolute difference of log odds ratio has been in widespread use for variety of purposes. The reason for the long discussion in getting to here was to show some of the reasons it is better than some of the alternatives. Any base of logarithm may be used, as long as the usage is consistent. This thesis assumes that natural logarithms are used, but makes a note on how an equation or result should modified to accommodate some other logarithm base in the few cases where this is necessary. To get a feel for natural logarithm d o values, when they are small, they are close to percent values. This is because for small x, log e (x) ---- 1 + x. Thus, a d o of 0.05 corresponds to approximately a 5.1% difference in odds ratio. If the probabilities are small, this in turn corresponds to approximately a 5.1% difference in probability, . If one of the probabilities P1 or P2 is 0 or 1, the denominator of a fraction in 3.3:3 will become 0, or we will end up with log 0, which is also undefined. Throughout this thesis I use arithmetic defined on the real numbers augmented by infinity, i.e. the set 91 u { 00}. The following rules are observed: x/0 = 00, forallx>0, x / 00 = 0, for all x # 00 , 0 / 0, 00 / 00, 00 — 00, and 00 * 0 are undefined, and in general f (00) = lim f (x) x—>. Furthermore the d o measure between equal probabilities is defined to always be 0, even if the probabilities are both 0 or both 1. With these refinements we can express d o as: do (Pl, P2) = 0^ 00^ g P1 =P2 P1=0,P2=1 or P1=1,P2=0 P1^P2 1 1 _ p i - log I — P21 otherwise 3.3:4 ^ The definition of d o also satisfies the requirements of a probability metric as described at the beginning of section 3.2. Connection strength defined using the d o distance measure for probabilities (i.e. the absolute difference of log odds ratios) will be called AO connection strength, and denoted CS 0. 3.3.1 Single Link Example We return to the example of figure 3.2. In section 3.2 we found CS p (A, B) = 0.25. By equations 3.3.4 and 3.1:4 we find: cso (A, p(3d,a) I P(bla) B) =hog 1 _ P(bla) _ log 1 _13031-0.)' 3.3:5 1^0.75^0.5 ^ CS ° (A, B) = h og 0.25^logg1:31 = 1.10 — 3.3.2 Range of AO Connection Strengths AO connection strength varies from 0 for the weakest "links" (actually independence) to ... for the strongest links (deterministic dependence): 0 5 CS 0 (X, Y)^oo^ 3.3:6 Note that if CS ° (X, Y) = oo, Y may be deterministic for only one value of X (such as in the case: given that X is true, Y is true, but given X is false, Y is uncertain). This differs from AP connection strength, where a strength of 1 indicates complete deterministic dependency. Although log odds ratio is a monotonic function of probability, AO connection strength is not a monotonic function of AP connection strength. Two nodes with a weak AP connection strength (CSp close to 0) may have a strong AO connection strength (CS ° very large). More precisely, for any two nodes X and Y, in any BN: 2 log 1 + CS p (X, Y) 1 — CSp (X, Y) < CS () (X, Y) 00 3.3:7 where the CS ° values can be anywhere in the range, depending on the particular BN. For small values of CS p we can make the approximation: CS 0 .?_ 4 CSp ^ 3.3:8 which actually holds for all values of CSp, but doesn't provide a tight bound when CS p is large. Conversely: 0 5. CSp (X, Y) 5_ eCS0 (X, Y)/2 _ 1 eCSo (X, Y)/2 + 1 3.3:9 If CS () was defined using a logarithm of some base other than e, then in the equation above, e should be substituted with the actual base used. 3.4 An Alternate Definition of Connection Strength The connection strength from node A to node B could have been defined as: The maximum change in belief at B as we go from a state of no evidence at A to a state with some evidence at A, that is: CS' (A,B) = max (d(P(b), P(bl+a), d(P(b), P(bl la))^ — 3.4:1 which is equivalent to: CS' (A,B) = max d (P(b), P(bla v i))^ avi 3.4:2 where a vi is virtual evidence for A. This quantity can be bounded above and below using the original definition of connection strength, and the original definition is more workable in most situations, so it is preferred. Theorem 3.4: For any two propositional variables A and B, and for any evidence e (or no evidence e), connection strength defined by equation 3.4:1 can be bounded above and below as: 1 2 — CS (A,Ble) 5 CS' (A,BIe)^CS (A,BIe) This is proved in Appendix C. 3.4:3 3.5 Conditional Strength and Potential Strength In a situation where a stream of evidence arrives for a BN (items of evidence arrive sequentially through time), it is sometimes useful to consider connection strength dynamically. The conditional connection strength CS (A, Ble) is a measure of the maximum effect on node B of evidence at A given the (possibly virtual) evidence e that has already arrived to other nodes of the network, and no other evidence. More formally: CS (A, Ble) = d (P(bl+a, e),^e))^ ^ CS (A, Ble) = sup^d (P(blav i, e), P(bla v i, av2, e)) av i. av2 3.5:1 3.5:2 where e is the evidence seen thus far, and, as with the CS definition, a v i is some (possibly virtual) evidence for A, a v2 is some other consistent evidence (possibly virtual) for A, and d is a distance measure satisfying 3.1:2. Equation 3.5:2 is equivalent to 3.5:1 (the proof is similar to that of theorem 3.1). We define potential CS as the maximum value conditional CS may take upon receiving further evidence consistent with the evidence already received, and denote it as CSM: CSM (A,BIe) = max CS (A,BIe, e + ) ^ e + =e 3.5:4 where e is the evidence seen thus far and e + is evidence consistent with e (the symbol means "consistent with"). Conditional strength may increase or decrease upon gathering more evidence, but potential strength always remains constant or decreases. At all points in time conditional CS is always less than or equal to potential CS. The following are some easily proved relationships, which are true for all nodes A and B, and consistent evidence e and e + , in any BN: CS (A, Ble)^CSM (A,BIe)^CSM (A, B)^ 3.5:5 CSM (A, Ble, e + ) < CSM (A, Ble)^ 3.5:6 max CS (A, Ble) = max CSM (A, Ble) = CSM (A, B) ^3.5:7 CSM (A,Ble) =^sup^d (P(blavi, e, e+), P(blavi, av2, e, e+))^3.5:8 a v i,av 2,e+—e 3.6 Connection Strength in Complex BNs I gave an example of calculating connection strength in a simple two node BN, but how do we find the connection strength between the nodes A and B in a complex BN, where there are multiple paths between A and B, and each path consists of multiple links? Connection strength may be computed from equation 3.5:1 by the methods of BN inference: CS (A, Ble) = d (P (bl+a, e), P (bl la, e)) — We instantiate the BN with evidence e and +a or —,a, then use the techniques of BN updating discussed in section 2.6 to find the posterior probabilities P (bl+a, e) and P (bl—ka, e). , Since BN is inference is NP-hard (Cooper90), calculating CS in this manner can be quite expensive. Alternately, we can find bounds on CS quickly by local calculations using the concept of link strength. This will be explored in Chapter 5. 3.7 Commutivity of Connection Strength Given a value for the AP connection strength from node A to node B, can we use only this information to determine what the connection strength from node B to node A is? It turns out that the AP CS in one direction does not even constrain the AP CS in the reverse direction (unless it is 0 or 1, in which case the CS in the reverse direction will be the same). More precisely, for any value of CSp(A ,B) in (0,1), and any value CSp(B, A) in (0,1), there is some BN model with these connection strengths. However, the AO connection strength in one direction always equals the strength in the reverse direction. In other words, AO connection strength is commutative: CS ° (A, B) = CS 0 (B, A)^ 3.7:2 This means that for any two nodes A and B, in any BN, the maximum amount that evidence at node B can effect the belief at node A, is the same as the maximum amount that evidence at node A can effect the belief ^ an node B, provided "effect" is measured by d o or de . This is a very useful result and later we will see how it gives CS 0 and CS c significant advantages over CSp. Conditional and potential connection strength based on odds ratio is also commutative: CS () (A, Ble) = CS 0 (B, Ale)^ 3.7:3 CSM 0 (A, Ble)^CSM 0 (B, Ale)^ 3.7:4 As further evidence is obtained for the network, CS o (B,AI e) may change, but CS 0 (A,B1 e) will also change so that they will remain equal. In extreme cases the new evidence may d-separate A and B (so CS 0 goes from some number to 0), or it may connect A and B with an active path when they were d-separated (so it goes from 0 to some larger number, possibly even infinity). In any case CS 0 will remain commutative. Proof of 3.7:3 and 3.7:2: By definition of CS (equations 3.5:1 and 3.3:3): 0(+bl+a, e) CS () (A, Ble) = I log^e) By the definition of odds ratio (equation 3.3:1): P(+bl+a, e)^e) CS 0 (A, Ble) = I log e)^e) ^P(—,b1+a, By 4 applications of Bayes rule (equation 2.3:2): CS 0 „ 1 Be) le) +ble)^p(—,ble) P(+al+b, e) p( p(,a K-Fale)^ = Hog p(—ible)^ P(+ble) e) pHale) e) p(+ale) Canceling common factors: P(+al+b, e)^e) CS ° (A, Ble) = I log ^e)^e) By the definition of odds ratio (equation 3.3:3): CS 0 (A, Ble) = I log 0(+al+b, e) e) By definition of CS (equation 3.5:1): CS0 (A, Ble) = CS 0 (B, Ale) Which proves 3.7:3. Note that when CS 0 (A, Ble) is infinity (using infinity as described in section 3.3), the result still holds. There is a problem when P(+ale) = 0 or P(—ale) = 0 in the second step of the proof, since then these factors cannot be canceled from the numerator and denominator. However, in that case we simply define the CS ° values in both directions as 0, since the node A is independent of all other nodes in the network (in one direction this follows from d o (0, 0) = 0, and in the other it is equivalent to saying P(bl+a) = P(bl—a) when one of them is undefined). Equation 3.7:2 follows from the simple case when the evidence e is absent or irrelevant. ■ Proof of 3.7:4: By equation 3.7:3 For any e+ e: CS 0 (A, Ble, e + ) = CS ° (B, Ale, e + ) — So: max CS (A, Ble, e + ) = max CS (B, Ale, e+) e+ ---..e^e+,--e , But by equation 3.5:4: max CS (A, Ble, e+) = CSM 0 (A, Ble) e+ ,-----e So: CSM o (A,Ble) = CSM 0 (B,AIe) ■ 3.8 Link Strength Definition For each link in a BN we can supply a single number, called the link strength (LS), which is a measure of the maximum amount that evidence at the parent node of the link can effect the belief at the child, given that all the other parents of the child have some evidence. So the link strength of A-B is defined as: ^ LS (A—*B) =CS (A,B1c) C E na-{A}) 3.8:1 where C(B) is the set of parents of B. The strength of the A —>B link in figure 3.8:1 is the maximum value of connection strength from A to B as all the parents C 1, C2,..., C n take on different evidence. Computing the LS from equation 3.8:1 can be done completely locally; it requires knowledge only of the NCP at B. Figure 3.8:1 - These nodes may be embedded in a larger BN. The only nodes shown are B and all its parents, since these are the nodes required to define the link strength of the link from A to B. Given its parents, B is independent of the rest of its ancestors, so equation 3.8:1 is equivalent to: LS (A—>B) =^max^CS (A,B1a) a E II(C*(B) — {A}) ^ 3.8:2 where C*(B) is the set of all ancestors of B. We can still speak of an A B link strength, even if there is no link from A to B (that is, A is not a parent of B). If A is a descendent of B, then adding a link from A to B would create a cycle, and LS(A —>B) is undefined. But, if A is not a descendent of B, and is not a parent of B, it will be independent of B given B's parents, so by equation 3.8:1, LS(A—>B) is zero. In this case we sometimes say there is a null link from A to B, although of course it wouldn't count as a link to the d-separation algorithm, and normally we wouldn't draw it on a BN diagram. The range of LS values is the same as that of CS, so LSp values vary from 0 for null links to 1 for the strongest links (potential deterministic dependence), and LS 0 values vary from 0 for null links to infinity for the strongest links: 0 . LSp 5 1 ^ 0^LS 0^00^ 3.8:3 Recalling the algorithm described in section 2.4 used to construct the dag of a BN, we can view it slightly differently now. We are given a FM and we wish to construct a BN to represent it. First we choose a total ordering for the nodes. Using equation 3.8:1 on the FM and the ordering, we can find the LS between any pair of nodes. We put a link between those nodes (in the direction from the node earlier in the ordering to the later one) if the LS is greater than zero. After doing this for every pair of nodes we have the completed dag. 3.9 Comparing Link and Connection Strengths A fundamental difference between CS and LS is that CS is defined with respect to a FJD, whereas LS is defined with respect to a FJD and an ordering on the nodes. For a given probabilistic specification, the CS between two nodes will be the same regardless of what BN is used to represent that specification. But LS values between two nodes will generally be different depending on the particular BN used to represent the FJD. Figure 3.9:1 shows an example of the invariance of CS, and the dependence of LS, on the particular BN ordering. Both BNs represent the same FJD, but the total order used in (a) is A, B, C while the order in (b) is A, C, B. The AO link strengths, which are written along the links, are different between (a) and (b), but the connection strengths, which appear to the right, are the same for (a) and (b). Notice that link B--->C has reversed in going from (a) to (b) and its link strength remains the same. If any BN is modified by reversing a link, say B -C, the only link strengths that will change are links going to B or to C, and if the AO measure for link strength is used, the strength of the link being reversed won't change. P(131-1-a) = 0.7 P(bl-a) = 0.2 2.233 1.386 P(a) = 0.5 3.584 P(cl+a+b) = 0.5 P(cl+a-b) = 0.8 P(cI-,a+b) = 0.3 P(cl-la-b) = 0.1 CS (A, B) = 2.233 CS (B, C) = 0.714 CS (A, C) = 2.179 (a) P(bI+a+c) = 0.5432 P(bl+a-,c) = 0.8537 P(bI-1a+c) = 0.4286 P(bl-a-,c) = 0.162E P(cI+a) = 0.59 P(cj-ia) = 0.14 P(a) = 0.5 CS (A, B) = 2.233 CS (B, C) = 0.714 CS (A, C) = 2.179 (b) Figure 3.9:1 - These two BNs represent the same FJD. Link strengths are different in each, but connection strengths are the same. The link strengths (LS 0 ) are displayed alongside the link, and the connection strengths (CS 0 ) are to the right. The total order in (a) is A, B, C, while in (b) the B-*C link has been reversed, so it is A, C, B. Another fundamental difference between CS and LS is that LS is local measure which can be computed very quickly, while CS is a global measure which may be very difficult to compute. Consider the BN in figure 3.9:2. Figure 3.9:2 - A BN to illustrate the global nature of CS and the local nature of LS. Calculating an exact value for CS(A,B) involves all the nodes in the network, since evidence at A can change the belief at B by the active path through Z. However, calculating LS(A —>B) involves only the nodes A, B, and C since taking the maximum with C having evidence blocks this active path. In fact, it really only involves the NCP at B. For any BN, calculating a precise value for the unconditional CS between two nodes involves all their ancestors, and calculating a precise value for the conditional CS between two nodes involves all their ancestors, and all the ancestors of every node with evidence. Also, in general this calculation is of exponential complexity. Calculating the LS between two nodes involves only the NCP of the child node, and the complexity is linear with the number of NCP probabilities (or better than this, depending on how the NCP probabilities are represented. 4 Using Link Strength to Bound Connection Strength 4.1 AP Serial Combination Link strengths may be combined by local operations to provide bounds on the connection strength of widely separated nodes. This allows for fast algorithms to find a limit on the degree to which evidence at one node can affect the belief at another (although in some cases the bound is not very tight). For a simple example, consider the following BN with two links in series: P (a) P(cl+b) P( cl—ilD) Figure 4.1 A three node BN in which we want to find the CS from A to C given the link strengths of A --> B and B --> C. - We wish to find a bound on the AP connection strength from A to C given only the AP strengths of the two links A ---> B and B -3 C. In this case we can actually find an exact relation (the proof is in Appendix C): Theorem 4.1: For any three propositional variables A, B, and C, where A and C are independent given B: CSp (A, C) = CSp (A, B) CS p (B, C)^ 4.1 If node A is the only parent of B, and B is the only parent of C (as in figure 4.1), then the definition of link strength matches that of connection strength and we obtain: CS p (A, C) = LSp (A—>B) LSp (B-*C) This rule was reported in Henrion89 in a different form. It also corresponds to the "shrinkage factor" of Markov net analysis (Howard71, p. 20). It can be chained for longer paths as long as the arrows all point in the same direction. Since each AP link strength is less than one, the effect of evidence can only get attenuated (or unchanged) by intermediate links; it can never be amplified. 4.2 AO Serial Combination Consider the BN in figure 4.1 once again. This time we want to find a bound on the AO connection strength from A to C, given only information on the AO strengths from A to B, and from B to C. Below is the tightest bound that can be placed with only this information (see Appendix C for the derivation): Theorem 4.2: For any three propositional variables A, B, and C, where A and C are independent given B: tanh (T, ^ cs o (A, C))^tanh 1 CS 0 (A, B)) tanh (T1i CS0(B, C)) 4.2:1 If node A is the only parent of B, and B is the only parent of C (as in figure 4.1), then the definition of link strength matches that of connection strength and we obtain: tanh CS 0 (A, C)) 5 tanh LS 0 (A-B)) tanh ( -14 LSo(BC)) where tanh is the hyperbolic tangent function: tanh (x) = ex - e-x e x e -x 4.2:2 When x 0, which is the domain of AO connection and link strengths, tanh(x), stays within the range [0, 1], and increases strictly monotonically as x increases (see figure 4.2:1). If the logarithms used to define CS 0 and LS 0 are not natural logarithms, the multiplying constant 1/4 in the equation above changes to another constant dependent on the logarithm base, but otherwise the equation remains the same. 2 ^^^^ 4 6 8 10 Figure 4.2:1 - Graph of tanh(x/4). It starts out quite linear and then asymptotically approaches 1. First, consider a case in which LS 0 (BC) = 0 in the BN of figure 4.1, so B is independent of C. Then, by equation 4.2:1 tanh (-4 CS 0(A, C)) = 0, which implies CS 0 (A, C) = 0, so C is independent of A. Second, consider a case in which LS 0 (B—>C) = 00 , so B and C are deterministicly related. Then, tanh ( -1 4 LS 0 (B -* C)) = 1, and so CS o (A, C) LS 0(A-4B). In general, when links A--->B and B—*C are in series, we can consider the link B —>C as attenuating the effect of the link A —>B on C. The degree of attenuation is always between that of the first case (independence leading to complete attenuation), and that of the second case (determinism leading to no attenuation). Equation 4.2:1 can be chained for longer paths, resulting in a product with one term for each link. Since tanh(x), stays within the range [0, 1], the effect of evidence can only get attenuated (or unchanged) by intermediate links. For small x, tanh (x) .---. x, and for all x 0, tanh (x) x, so from equation 4.2:1 we can form the following bound: 1 CS 0 (A, C)^.7i LS o (AB) LS o (B—>C) 4.2:3 which is valid for all values of LS 0 , but only forms a reasonably tight bound when both LS 0 values are small (say LS 0 5 2). Using equation 3.3:8, we can see that the equation above, and therefore the AO serial link strength equation (4.2:1), approaches the AP serial link strength equation (4.1) when the links are weak (small LS 0 values). - 45 - Some bounds provided in the field of computer science are notorious for being so far above values actually obtained in practice, that the bound is almost useless. A simulation was done in which 200000 BNs of the topology in figure 4.1 were generated, with NCP values drawn from a uniform distribution on [0,1]. In each case the bound on CS(A,C) given by equation 4.2:1 was compared to the true value of CS(A,C), and frequency histograms of the results appear in figures 4.2:2 and 4.2:3. Of course, in some applications the actual distribution of NCPs may be very different, which could lead to very different results on the tightness of the bound, but at least we can see that for this example application, the bound for CS given by equation 4.2:1 is usually not far above the true value of CS. 0 . 5^1^1.5^2^2 . 5 (CS Bound) - (Actual CS) Figure 4.2:2 Frequency histogram of the amount that the bound calculated from LS(A—>B) and LS(B-*C) exceeded the actual value of CS(A,C) in 200000 cases of random BNs with the structure of Figure 4.1 and uniformly generated NCPs. Notice the point on the vertical axis at a height above 5, and the point to the left of the axis at a height of zero. The same graph is drawn with different scales below. - 1^2^3^4 (CS Bound) - (Actual CS) Figure 4.2:3 Expanded scales of graph in Figure 4.2:2. - 4.3 Fundamental Sensitivity Equation Now we wish to find a method we can use to find bounds on connection strength in a BN of arbitrary complexity. Sometimes we can break a probability problem down into two simpler problems by assuming that some proposition is true, solving the problem, then assuming it is false, and re-solving the problem. Then we say the solution for the case where we don't know the value of the proposition is somewhere between the solutions for the proposition being true and being false. For example, this method will work to find posterior probabilities. The posterior probability for a proposition Q in the case that we don't know the value of proposition Z, will be between the values it would have if Z were true or Z were false. In fact the weighted average of these two bounds (weighted by the probabilities of Z being true or false) is actually the probability of Q for the case we don't know Z, which is the basis for the reasoning by assumptions method described in section 2.6. So we always have the following two relations: P(ql+v-a)^P(ql+v) S P(ql+v+z)^or^P(ql+v+z) 5 P(ql+v)^P(ql+v---,z) P(q1-,v-a)^P(q1-,v)^P(ql-iv+z)^or^P(ql-iv+z)^P(q1-,v) 5_ P(ql-iv-iz) ^ ^ 4.3:1 4.3:2 We might be tempted to think connection strength will behave in the same way, since it is just the distance between the two posterior probabilities hounded in the two expressions above. That is, since 4.3:1 and 4.3:2 hold, we might expect the following to hold: d (P(ql+v), P(ql—Iv))^d (P(ql+v+z), P(ql—Iv+z)) or d (P(ql+v), P(q1—,v))^d (P(q1+v—a), P(q1---N—Iz)) [false]^4.3:3 or equivalently: d (P(ql+v), P(q1--Iv))^max d (P(ql+vz), P(q1-1vz)) [false]^4.3:4 CS (V, Q) 5_ max CS (V, Q1z) [false]^4.3:5 z z But these equations do not hold. The reason for following this line of thought is to warn that the connection strength for the case when a proposition Z is unknown is not bounded by the strengths for the cases when Z is true and when Z is false, as it is for a number of other probabilistic quantities (like belief, odds ratio belief, expected utility, most probable explanation, etc.). A trivial example to illustrate is shown in figure 4.3:1. When Z is unknown CS(V,Q) may be nonzero, but when Z is known, whether it is true or false, it d-separates V from Q so CS(V,Q1+z) and CS(V,Q1—a) are zero. Figure 4.3:1 - CS(V,Q) may be nonzero, but CS(V,QI+z) and CS(V,Q1—a) are both zero, so they don't provide bounds for CS(V,Q). However, there is an equation similar to equation 4.3:5 that we can use to decompose connection strength problems. Theorem 4.3: For any three propositional variables V, Q, and Z, we can decompose the connection strength from V to Q on cases of Z as follows: CS (V, Q) 5.. max CS (V, Q I^+ z CS (V, Z) * min (CS (Z, Q I +v), CS (Z, Q I —Iv), CS (Z, Q))^4.3:6 where * is a generalized multiplication corresponding to the serial combination rule. If the AP measure is used for connection strength, it is regular multiplication, but for the AO measure it is the AO serial combination rule: x * y = 4 tanh 1 (tanh x) tanh y)) 4.3:8 - Since all connection strength problems can be decomposed using equation 4.3:6, it is called the fundamental sensitivity equation in this thesis. It is proved in Appendix C. Notice that it is similar to the more traditional style of decomposition equation given by 4.3:5, but with the addition of an extra term. To help visualize the fundamental equation, consider the BN of figure 4.3:2. Varying evidence at node V will create a varying belief at the query node Q. We are interested in decomposing the effect on Q by cases of Z. We can consider equation 4.3:6 as composed of two parts, each corresponding to one of the active paths from V to Q. The first part is max CS(V,QIz), and this corresponds to the link from V to Q, since Z z is given and therefore the other path is blocked. The second part is CS(V,Z) * CS(Z,Q), which corresponds to the serial connection of the two links V-->Z and Z-Q. Together they provide a limit on how much the node V can effect the belief at Q. P (v) P(ql+v,+z) P(ql+v,—,z) P(q1—,v,+z) q1-1v,—,z, P(zl+v) P(zi—iv) Figure 4.3:2 - A BN to help visualize the terms of the fundamental equation. This equation can be considered to consist of a term for the V—> Q path in parallel with a serial combination of terms for V*Z and Z-->Q. The fundamental equation applies to any 3 variables of any BN. They may be connected like the BN in figure 4.3:2, or they may be scattered through a large network with no links between them. 4.4 Example of Finding CS by Fundamental Equation Suppose we have the BN of figure 4.4:1 and we wish to find the connection strength from node X1 to node X3. Figure 4.4:1 A BN for which we wish to find the CS from node X1 to X3. - The fundamental equation implies both the following two equations, which we will use to solve this problem: CS (V, Q) S max CS (V, QIz) + CS (V, Z) * CS (Z, Q)^ 4.4:1 z CS (V, Q) ... max CS (V, Qlz) + CS (V, Z) * max CS (Z, Qlv)^4.4:2 z^ v Using equation 4.4:2 with X1 as V, X3 as Q, and X0 as Z: CS (X1, X3) .. max CS (X1, X31x0) + CS (X1, X0) * max CS (X0, X3Ix1) xi xo^ 4.4:4 Now we find bounds on the three terms of the equation above, one by one, by using the fundamental equation repeatedly. For the first term: max CS (X1, X31x0) -xo max CS (X1, X31x0,x2) + max CS (X1, X21x0) * max CS (X2, X3Ix0,x1) xo,x2 xo^xo,x 1 4.4:5 The first term in the equation above (4.4:5) matches the definition of link strength for the X 1—>X3 link: LS (X1—>X3) =^max^CS (Xi, X31c) = max CS (Xi, X31x0,x2)^4.4:6 C E ri(C(X3)-{ X1})^XO,X2 The other terms in equation 4.4:5 are also link strengths, so we may rewrite equation 4.4:5 as: max CS (Xi, X31x0)^LS (X1-->X3) + LS (Xi—>X2) * LS (X2 -> X3) xo 4.4:7 The second term of equation 4.4:4 is CS (X 1, XO). If we are using the AO measure of CS, then this is equivalent to CS (X0, Xi), which is the link strength from X0 to Xi. For the last term of equation 4.4:4 we get: 4.4:8 max CS (X0, X3Ix1) xi max CS (X0, X3Ix1,x2) + max CS (X0, X2Ixi) * max CS (X2, X3Ix0,x1) xi^xo,x1 xi,x2 Each of the terms in the above equation is a link strength, so we may write it as: max CS (X0, X3Ixi)^LS (X0-->X3) + LS (X0---)X2) * LS (X2—>X3)^4.4:9 xi Combining all of this we finally get our bound for CS (X 1, X3): CS (Xi, X3)^LS (xi x3>) + ^ 4.4:10 LS ()(12 ) * LS 072-—r3) > + LS( -)TilT0- * (LS (xc (3) > + LS (x7r (2) > * LS ((3)) By examining the BN in figure 4.4:1, we can see how the equation above can be considered as having terms for all of the paths from X1 to X3. The first line corresponds to the link straight from X i to X3. The second line corresponds to the path from X 1, through X2, to X3 (i.e. the X 1-->X2 and X2-->X3 links in series). The last line corresponds to the paths from X1 through X0, then either straight to X3 or from X0 to X2, then to X3. Notice that X4 was not involved at all in finding CS (X 1, X 3). For any BN, finding CS (X i, X j), where i and j indicate the position of the nodes in the total order, does not involve any nodes Xk, where k>i and k>j, since there are no active paths from Xi to Xj through Xk (they all have at least one converging node). Because of this, equation 4.4:10 can be used to find the connection strength from node 1 to node 3 in any BN. The nodes that come after node 3 are irrelevant. Equation 4.4:10 was developed for a fully connected network (i.e. every two nodes are connected by a link). If we wish to use it for a network that is not fully connected, then we just use a link strength of 0 between nodes with no link connecting them, as described in section 3.9. For example, we can use equation 4.4:10 to find a bound on CS(X1,X3) for the BN in figure 4.4:2. Setting LS (xjli(2 ) = 0 and LS ()7ff 3 ) = 0, we obtain: CS (Xj, X3) < LS (xi (3) --> + LS( )(0) * LS (T(Te2) * LS ()(3) 4.4:11 This consists of two parallel paths, one straight from X1 to X3, and the other consisting of 3 links in serial: X1 to X0, then to X2, and finally to X3. The path from X1 through X4 to X3 does not appear because it is blocked by the converging node X4. P(x4I+x1+x3) = 0.9 P(x4I+x1—D.3) = 0.1 P(x41-1x1+x3) = 0.2 P(x41.---,x1--oe) = 0.8 P(x0) = 0.3 P(x3I+x1+x2) = 0.3 P(x3I+x1—ix2) = 0.05 P(x3I—x1+x2) = 0.7 P(x31—ix1--,x2) = 0.2 Figure 4.4:2 A BN like that in figure 4.4:1, but with fewer links. We can use the equation developed for bounding CS(X1,X3) of the BN in 4.4:1 to bound CS(X1,X3) of this BN. - Using the numbers provided we obtain: LS o (X1 — X0) = LS o (X0 >X1) = do (P(x1I+x0), P(x11—,x0)) = d o (0.45, 0.64) = 0.776 — I-So (X0 >X2) = do (P(x2I+x0), P(x2I—Dx0)) = d o (0.92, 0.30) = 3.29 -- ILSo (X 1 >X3) = max (d o (P(x31+x i+x2), P(x31 ix 1+x 2)), do (P(x3I+x 1—ix 2), P(x 31—ix 1 IX 2))) = max (d o (0.3, 0.7), d o (0.05, 0.2)) = 1.69 — — -52- — LS0 (X 2-->X3) = max (d o (P(x3I+x2+x 1), P(x 31--x 2+x 1)), do (P(x3I+x 2—ix 1), P(x 31—ix 2—Ix t))) = max (d o (0.3, 0.05), d o (0.7, 0.2)) = 2.23 CS 0 (X 1, X3) 5_ LS 0 (X 1—>X3) + LS 0 (X “—X0) * LS0 (X 0--X2) * LS0 (X2 -->X3) = 1.69 + 0.776 * 3.29 * 2.23 = 1.69 + 4 tanirl (tanh(0.776 / 4) tanh(3.29 / 4) tanh(2.23 / 4)) = 1.95 So the bound we calculate is CS ° (X1, X3) 1.95. In actual fact CS ° (X 1, X3) = 1.543. We can repeat the calculation to find a bound on AP connection strength, but the calculation of the "backwards" connection strength CSp (Xi, X0) is more difficult than in the AO case, since there it was the same as CS 0 (X0, X1) which is LS 0 (X0-->X1). Actually, we can easily find CSp (X 1, X0) using Bayes rule, but in the general case finding "backwards" CS p link strength involves nonlocal calculations. 4.5 Path Based Methods Most researchers have been reluctant to attach formal significance to the paths of a BN. It seems that on an intuitive level they make use of ideas like "effect" or "influence" "flowing" along the paths of a BN, and will speak of "weak paths", "strong links; etc., but do not formally define these concepts. One reason for this may be that paths are not intrinsic to the probabilistic model (the FJD), but are an artifact of the BN factoring process. So if a BN had been constructed with a different total ordering for the nodes, its path structures could have turned out completely different. The same effect is created by the operation of link reversal, which does not change the FJD represented by a BN, but can add or remove links, thereby changing paths. Another reason for avoiding paths could be due to concerns that they lead to an over-simplistic view of the BN. There is a tendency for beginners to think of BNs as constraint networks, with the belief of a child node given as a function of the beliefs of its parents. Or, if they are a bit more sophisticated, they may think the belief in a node can be given as a function of the beliefs in the nodes of its Markov boundary (i.e. its parents, children, and parents of its children). ^ ^1 Actually, to express the belief in a node as a function, it must be expressed as a function of the joint beliefs of its Markov boundary nodes (i.e. the beliefs in the Cartesian product of their values). Thinking in terms of paths can obscure this. For example, consider the BN of figure 4.5:1. When there is no evidence, the beliefs at A, B, C, and D are all 1/2. If we get evidence TRUE for A, the beliefs at B and C remain at 1/2, but the belief at D changes to 3/4. Thinking in terms of a constraint network, or "flow of influence along paths',' it is hard to see how a change at A can create a change at D without changing the beliefs at B or C. Of course, it is the joint belief in B and C which has changed (BEL(+b+c) changes from 1/4 to 3/8, BEL(+b—,c) changes from 1/4 to 1/8, etc.). So we must be careful with the path concept. P(b) = 1/2 P(b) = P(c) = P(d) = 1/2 And P(a) = 1/2 P(cI+a+b) = 3/4^ b) = 1/4^ P(cHa+b) = 1/4 P(cHa—,b) = 3M P(bk-a) = P(131—a). 1/2 P(cI+a) = P(cha)= 1/2 P(dJ+b+c) = 1 P(dl+b—c) = 0 P(dl—b+c) = 0 P(dl—b—ic) = 1 But P(c11-1-a) = 3/4 P(dl—a) = 1/4 Figure 4.5:1 - Evidence at A changes the belief at D, but not at B or C. There are only a few examples of previous research that make extensive use of paths in BNs. The most obvious example is the d-separation algorithm itself. It finds independence information by tracing active and blocked paths. Since connection strength can be considered the "degree of independence" it is not surprising that bounds for it can be found using a path based method as well. Suermondt92 uses paths to generate explanations of BN inference for a human user. He considers some paths more significant than others if breaking a link along one of them results in a significant change in the inference result. That way he can prioritize "chains of reasoning" in presenting the explanation. Intuitively he considers something "flowing" along paths, although he is wary of going to far with this line of thought, as is evident from the quote, "Such an image, in which probabilistic updates are treated analogously to electrical currents, is simplistic, and is invalid in many cases; we cannot predict in a definitive manner the combined effects of evidence transmission along multiple chains by analyzing the chains separately, since -54- there are often unpredictable synergistic effects: He doesn't arrive at any of the central results of this thesis, since he measures changes in belief due to breaking a link by doing full Bayesian inference with the link in place, then again with the link broken, and then compares the beliefs produced in each case (similar to the method mentioned in section 3.6, but breaking links instead of instantiating nodes). This can be very expensive when there are many links to try breaking, and worse when one considers combinations of links. Of course his method is much more accurate than the bounds calculated in this thesis, but the methods of this thesis could be used as a pre-screening phase to eliminate obviously weak paths from consideration (since the main purpose is to leave out very weak paths from the explanation). Wellman90 uses the paths in a BN to do qualitative probabilistic inference. The purpose is to answer the question, "if the belief in this node increases (say through virtual evidence), will the belief in this other node increase or decrease?" He traces "influence" along paths to arrive at the answer. This is discussed in greater detail in section 6.1:1. 4.5.1 The CS Path Algorithm We can simplify using the fundamental equation to fmd connection strength bounds, by using it to develop a path-based algorithm. The resulting formulation is also intuitively more appealing. We will find an expression for a bound on the AO connection strength from node X v to node X q, that is CS n (X v ,X q), for a fully connected BN. The expression will be entirely in terms of link strengths. Later the solution for any BN can be generated simply by setting the link strengths of missing links to 0 in that expression. We start by providing a total ordering for the nodes of the BN consistent with its dag, and we label the nodes X0, X1, ..., X n where a lower index corresponds to earlier in the total order. Since we are assuming the BN is fully connected, the parents of node Xi will be (Xi I 0 j < 1. To find CS ii (X v ,X q ) we first apply the fundamental equation (version 4.4:1) to it, with Z = X0, and obtain: CS (X v , X q ) S CS (X v , X0) * CS (X0, X q ) + max CS (Xv, Xqlx0) Now we apply version 4.4:1 of the fundamental equation again, this time with Z = X1, to the last term of the above equation. Then we repeat the process until its been done with Z = Xj, for j=0 to j=v-1. The result is the expansion shown below. Each line corresponds to one application of the fundamental equation (used with Z = Xj, j having the value shown in the rightmost column) to the last term of the line above it. The resulting bound is the sum of products in the left hand column. CS (X v , X q ) CS (X v , X0) * CS (X0, X q ) + max CS (Xv, Xqlx0) x0 j=0 max CS (X v , X lx) * max CS (X1, X q lx) + x0^ xo max CS (X v , Xqlx0,x1) xo,x1 j=1 max CS (X v , X21x) * max CS (X2, X q lx) + x- (x0,x1)^x=(x0,x1) max^CS (X v , X q lx) x=(x0,x1,x2) j=2 , • • • max CS(X v ,X v _ilx) * max CS(X v _i,X q lx) + x=(x0•xv-2)^x=(x0,. ,x v_2) max CS (X v ,X q lx) x=(x0,x1,.,x v _1) j=v-1 We can express the above sum of products as: v-1 CS (X v , X q )^max CS(X v ,Xj1x) * max CS(Xj,X q lx) + j=0 x=(x0•xj-1)^x=(xo-xj-1) 4.5:1 max CS (X v ,X q lx) x=(x0,•,x v _1) Since this derivation is for AO connection strength, which is commutative, we can reverse the order of the CS arguments in the first factor of the first line. Also, we can include the second line in the sum by increasing the range of its index, so we get: CS (X v , X q )^max CS(Xj,X v lx) * max CS(Xj,X q lx)^4.5:2 j=0 x=(x0,•,xj - 1)^x=(x0,.,xj_i) The CS expressions in the above equation are all of the form ^max CS(Xj,Xq lx), some with j or q x=(x0,.,xj_1) replaced by v. We can generate an expansion to solve for these expressions in the same way that we did for CS (X v, X q). It appears below, with each line formed by expanding the last term of the line above it using version 4.4:2 of the fundamental equation (with Z = Xk, k having the value shown in the rightmost column), to form the sum of products shown in the left hand column. -56- max CS(Xi,Xqlx) 5_ x=(x0—xj-1) max^CS (Xi ,X ci lx) x=(x0,x1„,xj-1,xj+1) max CS(Xj ,Xj+j lx) * max CS(Xj+1 ,X q lx) + x=(xo,x1,•,xj_1)^x=(x0,x1,.,xi ) max CS(Xj ,Xj + 21x) * max CS(Xj + 2^+ x=(x0—xj-1 ,xj+1)^x=(x0,.,xj+1 ) max^CS (Xj ,X q lx) ,xj+1 ,xj+2) k=j+1 k=j+2 • • • max CS(Xi^Ix) * max CS(X ci _i ,X q lx) + ,xj+1^x=(x0,•,xci_2) max CS (Xi ,X ci lx) ,xj+1 ,.,xq-1) k=q-1 We can express the above sum of products as: q-1 max CS(Xj,X ci lx)^max CS(Xi,Xklx) * x=(xmo axxk CoS(Xk,X ci lx) + x =(xo,.,xj_i^,xk-1 k=j+ 1 x=(x0,.,xj_1)^ max CS(Xi,X ci lx) ,Xj+1,,xq-1) Once again we must evaluate the two factors in the sum of products above. The first factor of the product matches the link strength definition. The second factor is of the same form as the expression being bounded, but with the j index at least one larger, so it may be bounded recursively using the same equation. Folding the second line into the sum, we obtain the following recursive equation: q max CS(Xj,X q lx) x=(xo-xj-1) ELS(37rk) > * k=j+1 max CS (Xk,X ci lx) x=(x0,.,xk-1) co j<q 4.5:3 j=q This equation can be used to bound each of the CS expressions of equation 4.5:2. Now we modify equations 4.5:2 and 4.5:4 to suit the case of a network that is not fully connected, by setting the appropriate link strengths to zero. The range of the summation is reduced to only include nonzero terms. Also, we use a notation in which C + (X) are the ancestors of X, C*(X) are the ancestors of X and the node X, and S(X) are the successors (children) of X. max CS (Xj,X v Ix) xE nc+(xj) CS (X v , X q ) Xj E C * (Xv)nC + ( Xq) * max CS (Xj,X q lx) xE ilC+(Xj) LS() (k) * max CS(Xk,X q lx) max CS(Xj,X q lx) xE ric+(xj) Xke S(Xj)nC*(X q ) xE nc+(xo X'E C + (X q ) Xj=Xq 00 The expression^max CS(Xj,X q lx), appears repeatedly in the above equations so we define a quantity XE nc+(xj) termed "the strength of all forward paths from Xj to Xq", and denote it with the letter F, as follows: F(Xj,X q) = max CS (Xj,X q lx)^ XE ric±(xj) 4.5:6 We substitute this into our two CS bounding equations. Also, since they no longer rely on the total order, and we don't need integer indexes for the nodes, we can rename all the nodes from X a style to A style for notational aesthetics. Keeping in mind Qo C*(V), we have: CS (V, Q) 5_^F(J,V) * F(J,Q)^Q C*(V)^4.5:7 Je C*(V)nC+(Q) 1 LS(—4 x K ) * F(K,Y) X#Y Ke S(X)nC*(Y) F(X,Y) { 00^ 4.5:8 X=Y With the two equations above we can now bound the connection strength between any two nodes of any BN, using only link strength values. The following written description may help to make the above equations more intuitive. Specification 4.5: A bound on the AO connection strength from node V to node Q, when Q is not an ancestor of V, is given by: The strength of all forward paths from V to Q, plus the sum over every node, J, which is an ancestor of both V and Q, of the strength of all backwards paths from V to J, multiplied by the strength of all forward paths from J to Q. The "strength of all forward paths from X to Y" is bounded by the sum over all X's children, K, which are ancestors of Y (or Y itself), of: The strength of the link from X to K, multiplied by the strength of all forward paths from K to Y. The "strength of all backward paths from V to J" is the same as the "strength of all forward paths from J to V". In the above, the term "multiply" is used in its generalized sense to mean serial combination as described in section 4.2. To find a bound on CS 0 (V,Q) when Q is an ancestor of V, we use the above algorithm to find a bound on CS 0 (Q,V), which is equal to CS 0 (V,Q). 4.6 Complexity of Path Algorithm Using equations 4.5:7 and 4.5:8 directly in a recursive manner to find a bound for connection strength results in an algorithm of exponential worst case complexity. However, the same values are being repeatedly calculated and so by doing a "bottom up" evaluation instead, the complexity becomes linear in the number of links. The bottom up algorithm to find a bound for CS(V,Q) involves calculating F(J,V) and F(J,Q) values for a number of nodes J, and storing those values with the nodes to aid in calculating further F(J,V) and F(J,Q) values. Once these values have been calculated for all the required nodes, equation 4.5:7 is used to combine them, yielding the desired bound. This yields the following algorithm, in which the descendants of a set of elements is defined as the set of all the descendants of the elements (and C* represents all ancestors, and S* all descendents): Algorithm 4.6 - To find a bound on CS0(V,Q): 1. If Q is an ancestor of V, switch V and Q, and find the equivalent connection strength CS(Q,V). 2. Starting with J = Q, and working J backwards through a total order on the nodes consistent with the dag, calculate F(J,Q) values for all J falling in the set: S*(C*(V)) n C*(Q) using equation 4.5:8. 3. Starting with J = V, and working J backwards through the total order, calculate F(J,V) values for all the nodes falling in the set: S*(C*(Q)) n C*(V) using equation 4.5:8. 4. Use equation 4.5:7 to sum up products of F(J,V) and F(J,Q) values, yielding the bound on CS(V,Q). The following property holds for the above algorithm (proved further below): Lemma 4.6: When calculating each new F(J,Q) in step 2, all the subcalculations of F(K,Q) that are required, will already be calculated. The same holds for step 3. To determine the complexity of algorithm 4.6, a good estimate is provided by the number of generalized multiplications required. The CS(V,Q) bound is formed only by generalized multiplications and additions, and the number of additions will be slightly less than the number of multiplications, since the only additions that are required are those to add one product to another. The following theorem supplies the number of multiplications needed: Theorem 4.6: The number of generalized multiplications required to find a bound on CS 0 (V,Q) using algorithm 4.6, is the number of links between the ancestors of Q which are also descendants of ancestors of V, plus the number of links between the ancestors of V which are also descendants of ancestors of Q, that is: Number multiplies =^Number links between ( S*(C*(Q)) n C*(V)) + Number links between (S*(C*(V)) n C*(Q)) Clearly this will be less than the number of links between ancestors of V, plus the number of links between ancestors of Q. Depending on the representation of the BN, extra computation may be required to determine which nodes are the required descendants, ancestors, etc., but even when doing this, the overall complexity is linear in the number of links between ancestors of V and Q. - 60 - The space requirements are minimal. On top of what is required for the BN and control of the algorithm, only I S*(C*(V)) n C*(Q) I+ I S*(C*(Q)) n C* (V) I real numbers must be stored (i.e. linear in the number of ancestor nodes of V and Q). Proof of Lemma 4.6: Each time we use equation 4.5:8 to find a value for F(X,Y), we end up finding values for F(X,Y), and each F(K,Y) where KE S(X) n C*(Y) (whether or not X=Y). If we apply the equation recursively, we end up finding F(K,Y) values for all KE K, where K is given by: K= X u (S(X) n C*(Y)) u (S(S(X) n C*(Y)) n C*(Y)) u . . .^4.6:3 It is a property of ancestor/descendent relations, that if a node W isn't an ancestor of another node Z, it can't have a successor or descendent which is an ancestor of W. That is: W^C*(Z) implies S(W) o C*(Z) and S*(W) n C*(Z) = 0^4.6:4 Using 4.6:4, we can simplify 4.6:3 to: K = S* (X) n C*(Y)^ 4.6:5 When we use equation 4.5:7 to find a bound on CS(V,Q), we have to find a value of F(J,Q) for all Je C*(V) n C+(Q). Since we will find these values using equation 4.5:8, that requires using equation 4.5:8 to find F(K,Q) for all KE KQ where KQ is given by 4.6:5, substituting C*(V) n C+(Q) for X, and Q for Y: KQ = S*(C*(V) n C F(Q)) n C*(Q) ^ - 4.6:6 We can simplify the above using 4.6:4 to get: KQ = S*(C*(V)) n C*(Q) ^ 4.6:7 So when finding all the F(J,Q) values in step 2, all the recursive F(K,Q) values that need to be found will be in the set above. Since step 2 specifies that we find the F(J,Q) values in reverse order of J, and equation 4.5:8 finds F(J,Q) values using only F(K,Q) values for which K succeeds J, completion of step 2 will require only F(K,Q) values from the set above which have already been found. The same type of argument, but with V substituted for Q, and Q for V, serves to show that step 3 requires finding only F(K,V) values in which K is an element of the set K v below: Ky = S * (C * (Q)) n C*(V) 4.6:8 Furthermore, step 3 also specifies that we find the F(J,V) values in reverse order of J, so by the same reasoning as the step 2 case, completion of step 3 will require only F(K,V) values from the set above which have already been found. ■ Proof of Theorem 4.6: Equation 4.5:8 is invoked to find each value of F(X,Q) where X E S*(C*(V)) n C*(Q) as stated by equation 4.6:7. Each time it is invoked it performs one multiply for each link leaving X and going to a node in C*(Q). Since the node that the link goes to will be in S*(C*(V)) as well (because by definition it must go to a successor), we can say that one multiply is performed for each link leaving a node in S*(C*(V)) n C*(Q) and going to a node in S*(C*(V)) n C*(Q). Or, in other words, the number of multiplies required is the number of links between the nodes in S*(C*(V)) n C*(Q). Equation 4.5:8 must also be invoked to find each value of F(X,V) where X E S*(C*(Q)) n C*(V) as stated by equation 4.6:8. By reasoning similar to the last paragraph, we can say that the number of multiplies required to do this is the number of links between the nodes in S*(C*(Q)) n C*(V). Finally, we must add on the number of multiplies required by equation 4.5:7 in step 4. This will be IC*(V) n C+(Q)I, which is one multiply for each term of the sum. However, if we don't count all those multiplications that are with a connection strength from a node to itself, done in steps 2 and 3 above (which don't really need to be done, since they yield the original number, and are present simply to terminate the recursion), then this quantity will be absorbed, leaving us with the sum of the two quantities from the two paragraphs above: Number multiplies = Number links between ( S*(C*(Q)) n C*(V)) + Number links between (S*(C*(V)) n C*(Q)) ■ 4.7 Dealing With Evidence We may wish to find a bound for a conditional connection strength, that is, a bound for a connection strength after the BN has received some evidence. If we are automatically incorporating the evidene into the network as it arrives using Schacter's evidence propagation, then we can simply use the methods of the previous section on the new BN. However, if we want to find a CS bound knowing the evidence is present, but not propagated, we need some new techniques. Link strengths were defined by maximizing CS over all states of the other parents of the child in the link. If the evidence received at the BN is for one of those parents, that maximum may be reduced. Consider the example BN in figure 4.7:1: Figure 4.7:1 Evidence at node E restricts the CS(A,C) bound. - Say we are finding a bound on CS(A,C) with no evidence at node E. This is provided by CS(A,C)^LS(A--4B) LS(B >C) = max CS(A,B1e) CS(B,C) --- e where * is the serial combination operator. However, if we receive evidence that E is true, then a bound for the conditional connection strength, CS(A,C1+e) is provided by: CS(A,CI+e)^LS(A —>B1+e)*LS(B —>C1+e) = CS(A,BI+e) * CS(B,C) This new bound for CS will be less than or the same as the original bound since CS(A,BI+e) max(CS(A,BI+e), CS(A,B1—,e)). Often when evidence is received it will improve CS bounds in this manner. When the evidence is for a nonconverging node which is right on an active path, that node blocks the active path, so the CS contribution from that path drops to zero, giving a lower overall CS value, and a lower value for the CS bound. Although receiving evidence often lowers CS values, it may increase CS values by creating new active paths through converging nodes which have received the evidence. Consider the example BN in figure 4.7:2: Figure 4.7:2 - BN with evidence at E, for which we wish to find the CS from A to C via the intercausal path through E. Without any evidence at E, CS(A,C) = 0, and the CS(A,C) bound calculated by active paths is also zero. But once evidence TRUE arrives for E, an active path from A to C is created. The strength of this path is d(P(cl+a+e), P(cl—a+e)), and it is termed an intercausal link strength, since reasoning from A to C is intercausal reasoning, as defined in section 2.6. For AO connection strength it comes out to be (proved in Appendix C): CS^ 0 (A,C1+e) = LS 0 (A -4 1-e— P(+el+a+c) P(+el—la--Ic) 1 C) = I log P(Fel+a—c) P(Fel—a+c) 4.7:1 where the LS(A >+e C) is a notation invented to denote an intercausal link strength from A to C when E — — has evidence TRUE. It is defined by the equation above. As an example, we can put all these techniques together to find a bound on CS(A,Dle,f) in the BN of figure 4.7:3: Figure 4.7:3 - BN with evidence at E and F, for which we wish to find a bound on the CS from A to D. The active paths from A to D are A,B,C,D and A,F,C,D. Each one of them forms one of the two terms in the bounding equation below: CS(A,Dle,f) 5_ (LS(A >Ble)*LS(B >C) + LS(A >f C)) * LS(C—>D1f) — --- — — 5 Applications 5.1 BN Link Display The first application of link strength that we consider is as a visualization aid for humans. BN diagrams have been praised as a great tool for people to visualize probabilistic relationships, and by marking the links with their strengths this tool may be improved, because then someone viewing the BN doesn't just get information about node independence, but also about a "degree of independence". An extremely weak link is very nearly an independence, but this information is lost if it is drawn exactly the same as the rest of the links. One possibility is to draw stronger links with thicker lines and weaker links with finer lines. Viewing a BN can be much more meaningful if one sees a skeleton of very dark lines corresponding to definitions and constraints, followed by slightly lighter lines for less certain rules, and so on, down to faint lines corresponding to very weak dependencies, and of course the absence of links indicating independencies. When using the AO link strength measure, we must represent the 0 — 0 0 scale of LS 0 with a finite width , line. A mapping that is commonly used to do this for graphical display is: W = x / (x + a), where W is proportional to the width of the line, and a is an adjustable scale parameter. This is approximately linear in x for x a, and approaches 1 as x approaches infinity. Another possibility is to use W = tanh (x / a), which is also linear in x for x « a, and approaches 1 as x approaches infinity (see graph of figure 4.2:1). This mapping is recommended (with a = 4), since it corresponds closely to serial combinations of links. That way the human viewer can easily imagine the minimum attenuation of a chain of links as the product of the attenuations of each of the links. So two 50% width lines in series correspond to a 25% width line (or smaller) connection. Unfortunately, it takes a bit more work to mentally combine parallel paths. For example two 50% width lines would combine to form, at most, an 80% width connection. However, for the finer lines, addition of line width can be used to combine parallel paths. For example two 25% width lines combine in parallel to form, at most, a 47% width line, which is very nearly 50%. This measure used for the width of the line turns out to be the absolute value of the statistical measure of association known as the coefficient of colligation, or Yule's Y (not to be confused with Yule's Q, which is in more common usage). The original invention of this measure had nothing to do with the chaining between variables that we use it for here, or so it appears from reading the paper in which it was introduced. For a description of Yule's Y see Appendix B, and for the original paper, see Yule1912. Figure 5.1 shows the same BN as figure 2.5, but with the link strengths printed beside each link, and the links drawn in different widths to show their strengths according to the formula: ^ LSo = 0 Width ={ max 0 ((0.2mm), (2mm) x tanh (LS 0 / 4)) otherwise th At a glance one can get an idea of which dependencies are always of minor importance. It must be remembered that each link strength represents the amount that the parent node can effect the child, maximized over all possible beliefs for the other parents. Furthermore, AO link strength was used, so any effects that bring a belief close to 0 or 1 are considered very significant. 1.69 SW: Tom's^(FIT: Lots of 0.442 traffic on 5th hardware store^IN Kidwell last year^treet last year 1.81 \ CV: Tom's cousin is ,f i siting him, TW: Tom is^ currently^ .747^2.23^ fairly wealthy} (- .. 5.14 ^ ■ 0\ 2.14 i" (TC: Tom jus bought a new BMW car 1.69 5.140.064 CTR: Tom told Molly he is rich 1.79 ^ is parked in Tom's zlriveway 3.89 " (CP: Lots of cars ar usually parked in 1.70^.'ont of Tom's store, TF: Tom told Molly he will donate campaign funds BCD: A new BMW 3.89 4.02 1GT: Gale told Moll that lots of cars are usually parked in qront of _Tom's store} 1.81 2.25 Tom can afford to move to an expensive , neighborhood next year 2.10 2.23 ITD: Tom makes donation to Molly's . 013 \campaign MT: Molly thinks Tom is going to make big ampaign donation . 0.847\ 0.847 16.12 MD: Molly decides (I/1M: Molly get elected mayor }^to run for mayor ilt4: Tom moves next door to Hank ,next year 0.116 $1.25^0.463 4W: 5th street : Neighborhood) tark is approved is widened next year 13.7 2.77 PC: Neighborhood park is constructed next year 0.205 1.42 11.48 6U: Hank's property 0.116 value goes up more DT: Traffic more thar doubles on Hank's street next year ( - 0.223 , than 20% in two year 41 M: Hank moves - away sometime in -.I the next 5 years Figure 5.1 - Example BN showing link strengths. This is the same BN as in figure 2.5, except the links have been drawn thicker to indicate LS as: width = (2mm) * tanh (LS/4), with a 0.2mm minimum. The LS values also appear beside the link. 5.2 Connection Strength Contours If there is a particular node for which evidence may arrive (called the origin node), we may draw an iso-CS contour map over the BN to indicate the maximum effect which that evidence can have on each of the other nodes. Each contour represents a particular value of CS, and is drawn so as to separate nodes on its one side which are more strongly connected to the origin node (i.e. have a greater CS), from those on its other side which are less strongly the origin node (i.e. have a lesser CS). The purpose is for a human to quickly assess which nodes will be affected by the evidence, and by how much. It also helps to visualize the "neighborhood" of a node, and provide a sense of locality. Figure 5.2:1 shows an iso-CS contour map with the node SW as the origin node. Of course, the map would be different if some other node was the origin. If AO connection strength is used to draw the contours, then the contour map may also be interpreted in another way. Instead of measuring how much evidence at the origin node effects each of the other nodes, it can be interpreted to measure how much evidence at each of the other nodes can effect the origin node. This is due to the commutivity of AO connection strength. One possible application of this is the following. We have a query node and we want to know which nodes to gather evidence at to best form a belief for the query node. Gathering evidence at a node that has very little effect on the query node is generally useless. So we can use the contours as indicators of levels of desirability for gathering evidence at each of the nodes (and trade that off with the cost of gathering evidence at that node). Using the methods of the previous chapter, a contour map may be drawn that is based on the CS bounds calculated from link strengths, instead of the actual CS values. Such a contour map may be constructed very quickly (of complexity linear in the product of number of links and number of nodes). Figure 5.2:2 shows such a contour map calculated for the node SW, using algorithm 4.6. It is interesting to compare it with figure 5.2:1. For each node the actual CS value is less than the bound, as we would expect. For nodes close to SW the bound is very close to the actual value (equal for TW and HT), whereas for distant nodes the bound is less tight (greater by a factor of about 12 for the most distant node, HM). It is interesting to notice that, at least for this example, even in areas where the bound is loose, the shapes of the contours for the bound are quite similar to the contours for the actual value. This example does not show it well, but if the BN is composed of a dendritic skeleton of strong links with the "flesh" filled in by a network of weak links, then the contours will tend to follow the skeleton in a - 69 - manner similar in appearance to the contours of a topographic map following the valleys of a dendritic river system, with the origin node at the point where the river meets the ocean. 1 CV: Tom's cousin is visiting him 0.0 HT: Lots of traffic on 5th street last year .,40^SW: Tom's hardware store did well last year TW: Tom is currently fairly wealthy 0.442 1.69 CP: Lots of cars are usually parked in front of Tom's store TR: Tom told Molly he is rich TC: Tom just bought a new BMW car 0.506 0.290 1.15 TF: Tom told Molly he will donate campaign funds V CD: A new BMW is parked in Tom's driveway GT: Gale told Molly that lots of cars are usually parked in ront of Tom's store 0.599 0.386 oA MR: Molly thinks Tom is rich OA 0.0375^0.09 TA: Tom can afford to move to an expensive neighborhood next year TD: Tom makes big donation to Molly's campaign MT: Molly thinks Tom is going to make big campaign donation 0.819 0.685 0.382 0.08 ^ 0.4 MM: Mo ly gets elected mayor .11^ 0.090 TM: Tom moves next door to Hank next year 0.325 0,08 V PA: Neighborhood park is approved 0.0159 MD: Molly decides to run for mayor 0.0345 FW: 5th street is widened next year 0.139 PC: Neighborhood park is constructed next year 0.0147 0.01 VU: Hank's property value goes up more than 20% in two years HM: Hank moves away sometime in -IN^ the next 5 years 0.00502 DT: Traffic more than doubles on Hank's street next year 0.0769 0.00377 Figure 5.2:1 - Example BN showing the connection strength contours for the node SW. This is the same BN as in figure 5.1, with contours drawn to show the CS between each node and the node SW. The CS values appear below each node. —I- ■ 0. 2 1, 0 0.5 CV: Tom's cousin is visiting him 0.0 TW: Tom is currently fairly wealthy SW: Tom's hardware store did well last year HT: Lots of traffic on 5th street last year TR: Tom told Molly he is rich CP: Lots of cars are usually parked in front of Tom's store 0.442 1.69 TC: Tom just bought a new BMW car 0.296 2.45 0.5 0.523 TF: Tom told Molly he will donate campaign funds CD: A new BMW is parked in Tom's driveway GT: Gale told Molly that lots of cars are usually parked in front of Tom's store 0.917 1.78) 0.448 1. 0 MR: Molly thinks Tom is rich :).13 1/48 0.5 TA: Tom can afford to move to an expensive neighborhood next year TD: Tom makes big donation to Molly's campaign MT: Molly thinks Tom is going to make big campaign donation 0.837 1.33 0.738 0.5 MM: Molly gets elected mayor 0.420 TM: Tom moves next door to Hank next year 0.398 PA: Neighborhood 0,2.^park is approved 0.127 FW: 5th street is widened next year 0.242 PC: Neighborhood park is constructed next year 0.127 - 0. 1 VU: Hank's property value goes up more than 20% in two years HM: Hank moves away sometime in -06^ the next 5 years 0.0642 DT: Traffic more than doubles on Hank's street next year 0.145 0.0495 Figure 5.2:2 - The example BN of figure 5.1, with contours showing the connection strength bound for the node SW. The contours are drawn for the calculated CS bound between each node and the node SW. The CS bound values appear below each node. -72.- 5.3 Approximate Inference Recall the "reasoning by assumptions" algorithm for BN inference, introduced in section 2.6. It is based on instantiating some node (say node Z) to one of its values in order to simplify the network (generally to reduce the active path connectivity), doing the BN inference with the simplified network, repeating the process with Z instantiated to its other value, and then combining the two solutions obtained by taking their weighted average (weighted by the probabilities that Z would take on each of its two values). But now suppose that node Z is distant enough from the nodes that we wish to find beliefs for (i.e. the query nodes), that instantiating Z to some value has almost no effect on their beliefs. In that case the two solutions will be almost the same, and their weighted average won't be that different from either one of them. So we could save some time by only computing one of the solutions, and recognizing that the beliefs that it provides are approximate. This suggests the following algorithm: Algorithm 53:1: To compute the posterior probabilities of the nodes in the set Q given evidence e for the nodes in E, instantiate the node Z to one of its values, then use some standard BN inference algorithm to find P(qilz,e) and consider it an approximation for P(qile), for all Qi e Q. Node Z can be any node not in E, and will normally be chosen so that the particular BN inference algorithm to be used can find P(qilz,e) more quickly than P(qile), (which is often the case if, for example, Z blocks active paths from nodes in Q to their ancestors or nodes in E). When we use an algorithm that produces approximate results, we usually need some kind of bound on how accurate we can expect the approximation to be. That is provided by the following theorem (which follows directly from theorem 3.4): Theorem 5.3:1: When using algorithm 5.3.1, a bound on the error of the approximation, e = d(P(q1z,e), P(q1e)), is: e CS (Z, Qle) for any node Q E Q, where d is the distance measure used for the definition of CS. Although instantiating one node may result in faster BN inference, usually we want to instantiate a set of nodes to block many active paths. There are two different advantages we could gain from this. We could - 73 - make the network singly connected (or more singly connected), and/or we could prune off large parts of the network. For example consider the BN in figure 5.3: Figure 5.3 - Z1, Z2,..., Z n block all active paths from subnetwork G to subnetwork H. Instantiating the nodes Z1, Z2,..., Z n cuts off the subnetwork G from Q and its subnetwork H. So a BN inference algorithm finding the belief of Q can ignore the whole subnetwork G. Clearly, this may result in a major computational saving, depending on the size of G . An approximation algorithm which uses this strategy is algorithm 5.3:2, which is essentially the same as algorithm 5.3:1, except Z is now a set of nodes. Algorithm 5.3:2: To compute the posterior probabilities of the nodes in the set Q given evidence e for the nodes in E, instantiate all the nodes in Z to one of their values, creating the tuple of values z, then use some standard BN inference algorithm to find P(qilz,e) and consider it an approximation for P(qile), for all Qi E Q. The nodes in Z can be any nodes not in E, and will normally be chosen so that the particular BN inference algorithm to be used can find P(qil 4e) more quickly than P(qile). Theorem 5.3:2: When using algorithm 5.3:2, a bound on the error of the approximation: e = d(P(qIz1,z2,...z n ,e), P(q1e)) is given by: e^CS (Z1, Qle) + CS (Z2, Qlzi,e) + ... + CS (Z n , Qlz1,z2,...,zn_lie) for any node Q E Q, where d is the distance measure used for the definition of CS. This theorem is proved in Appendix C. 6 Conclusion 6.1 Further Work 6.1.1 Qualitative Probabilistic Networks Algorithm 4.6 provides a way to calculate a bound on CS values, that is, it calculates the maximum magnitude of the change of belief at one node due to evidence at another. But we may also be interested in the direction of the change; do the beliefs increase or decrease? By removing the absolute value signs in the defmition of CS ° and CSp we can retain the information on the direction of the change. Equations 4.5:7 and 4.5:8, used by algorithm 4.6, must be modified to handle the signs of LS and CS values separately from the magnitudes. They must produce a number of the same magnitude as they do now, but the combination of signs must be as follows: A positive plus a positive is a positive, a negative plus a negative is a negative, a positive plus a negative is an unknown sign, and an unknown sign plus anything is an unknown sign. For the serial combination rule: A positive times a positive is a positive, a positive times a negative is a negative, a negative times a negative is a positive, and an unknown times anything is an unknown. The reason that the sign must be handled separately is that actually interval arithmetic is being performed, where the intervals are always from 0 to the positive or negative CS values. Exploiting this, we could modify equations 4.5:7 and 4.5:8 to produce a tighter bound in cases which involve the sum of a positive CS with a negative CS, since the magnitude of their sum will be bounded by the maximum of their magnitudes, which is less than the sum of their magnitudes (which is what must be used in the absence of sign information). Wellman90 introduces qualitative probabilistic networks, which have the same dag structure as BNs, but contain only sign information along the links instead of NCPs (and may also contain hyperedges providing the sign of synergies, etc.). Their purpose is simply to predict in which direction a belief at one node will change given evidence at another node. It appears that adding sign information to CS as described above, will produce the same qualitative results as those of Wellman (while also providing magnitudes), but more work remains to account for synergies in the way Wellman does. 6.1.2 Greater Computation for Tighter Bounds This thesis presented a way to find a bound on CS, by combining quantities that could be calculated locally at the scale of a single link. But if we separate the nodes of a BN into small disjoint groups, then the interactions between two groups may be expressed solely in terms of the links between the nodes in the Markov boundaries of the two groups. For each group, we can do full BN reasoning to find exact connection strengths between the nodes of the group, then use the methods of this thesis to find bounds on CS between the nodes in the Markov boundaries of the different groups. That way we can find a bound on the CS between any two nodes of the original BN. The larger we make the groups, the tighter the bound will be, but the longer it will take to compute (because full BN inference will be required on larger groups). If the groups are so small they are just single nodes, the CS bounds calculated will be those of algorithm 4.6. If a group is so large that it includes the whole BN, then the CS values will be exact. By varying the group size we will be provided with a continuum of CS bounding algorithms from completely global and slow algorithms which produce exact results, to completely local and fast algorithms which produce loose bounds. Then we can adjust the scale of the group size depending on how tight a bound we require. If we get really sophisticated, we can use a range of scales to find a single connection strength, CS(V,Q), in which we use large scale groups close to the nodes V and Q, and smaller scale groups further away. 6.1.3 Multistate Nodes An obvious next step would be to try to extend the results of this thesis to BNs composed of nodes that can take on more than two values. Any multistate BN can be modeled by a binary one by replacing each multistate node with a number of binary nodes, each binary node representing the proposition that the multistate node takes on one of its values. So many of the proofs in this thesis will extend to BNs composed of multistate nodes, but whether practical algorithms and reasonably tight bounds can be produced has yet to be determined. Also, it may be desirable to generalize the definition of CS for multistate nodes in some other way. 7 Bibliography Aczel, J. (1966) Lectures on Functional Equations and Their Applications, Academic Press, New York. Agosta, John M. (1991) "'Conditional inter causal independent' node distributions, a property of 'noisy or' models" in Uncertainty in Artificial Intelligence, Proc. of the Seventh Conf. (July, UCLA), Bruce D'Ambrosio, et al, eds., Morgan Kaufmann, San Mateo, CA. - - Boerlage, Brent (1992) "Link strength in Bayesian networks" in First Canadian Workshop on Uncertainty Management: Theory and Practise (May 12, Vancouver, Canada), Mary Deutsch-McLeish (ed.). Buchanan, Bruce G. and Edward H. Shortliffe (1984) Rule Based Expert Systems: The MYCIN Experiments, Addison-Wesley, Reading, MA. Cheeseman, Peter (1986) "Probabilistic versus fuzzy reasoning" in Uncertainty in Artificial Intelligence, L. N. Kanal and J. F. Lemmer (Eds.), North-Holland, Amsterdam, pp. 85-102. Cooper, Gregory F. (1990) "The computational complexity of probabilistic inference using belief networks" in Artificial Intelligence, 42, 393 405. - Heckerman, David E. (1986) "Probabilistic interpretations for MYCIN's certainty factors" in Uncertainty in Artificial Intelligence, L. N. Kanal and J. F. Lemmer (Eds.), North-Holland, Amsterdam, pp. 167-196. Henrion, Max (1989) "Some practical issues in constructing belief networks" in Uncertainty in Artificial Intelligence 3, Laveen N. Kanal, T. S. Levitt and J. F. Lemmer (Eds.), North-Holland, Amsterdam. Horvitz, Eric J. (1989) "Reasoning about beliefs and actions under computational resource constraints" in Uncertainty in Artificial Intelligence 3, Laveen N. Kanal, T. S. Levitt and J. F. Lemmer (Eds.), NorthHolland, Amsterdam. Howard, Ron A. (1971) Dynamic Probabilistic Systems, John Wiley & Sons, New York. Lauritzen, Steffen L. and David J. Spiegelhalter (1988) "Local computations with probabilities on graphical structures and their application to expert systems" in J. Royal Statistics Society B, 50(2), 157-194. Lindley, D. V. (1965) Introduction to Probability and Statistics, Cambridge University Press, Cambridge, MA. Neapolitan, Richard E. and James R. Kenevan (1990) "Computation of variances in causal networks" in Proc. of the Sixth Conf. on Uncertainty in Artificial Intelligence (July 27-29, 1990 MIT), GE Corporate Research and Development. Neapolitan, Richard E. (1990) Probabilistic Reasoning in Expert Systems: Theory and Algorithms , John Wiley & Sons, New York. Neapolitan, R. and J. Kenevan (1991) "Investigation of variances in belief networks" in Uncertainty in Artificial Intelligence, Proc. of the Seventh Conf. (July, UCLA), Bruce D'Ambrosio, et al, eds., Morgan Kaufmann, San Mateo, CA. Pearl, Judea (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Mateo, CA. Suermondt, H. Jacques (1992) Explanation in Bayesian Belief Networks, PhD thesis (Report No. STAN-CS-921417), Departments of Computer Science and Medecine, Stanford Univ., CA. Shachter, Ross D. (1986) "Evaluating influence diagrams" in Operations Research, 34(6), 871-882. Shachter, Ross D. (1988) "Probabilistic inference and influence diagrams" in Operations Research, 36(4), 589604. Tribus, M. (1969) "What do we mean by rational?" in Rational Descriptions, Decisions and Designs, Pergamon Press, New York. Wellman, Michael P. (1990) "Fundamental concepts of qualitative probabilistic networks" in Artificial Intelligence, 44, 257-303. Yule, G. Udny (1912) "On the methods of measuring association between two attributes" in J. of the Royal Statistical Society, 75, 579-642. Zolotarev, V. M. (1983) "Probability Metrics" in Theory of Probability and its Applications, 28(2), 278 302. - A Notation and Nomenclature When naming nodes in a Bayesian net, upper case letters, such as "A',' refer to single nodes, and bold upper case, such as "A',' to a set of zero or more nodes. Since each node represents a propositional random variable, the names of the random variables are also denoted upper case, and the values that it can take on are labeled "TRUE" and "FALSE:' "+a" denotes that the value of node A is TRUE, "—a" denotes that the value of node A is FALSE, and "a" stands for the value of node A (TRUE or FALSE). Sometimes "+a" is written simply as "a': if that does not result in confusion. A vector of values for all the nodes in the set E, is written bold lower case, as "el: Conditional probabilities are written in the form "P(+bl—la,+c)': which means "the probability that B is TRUE, given that A is FALSE and C is TRUE1' "0(+bk,a,+c)" is the odds ratio that B is TRUE, given that A is FALSE and C is TRUE, i.e. 0(+bl-la,+c) = P(+bl--,a,+c) / If we say "the belief at node B is x" we mean P(b=TRUEIe) = x, where e is the evidence seen so far. "C(B)" denotes the set of parents (conditional predecessors) of node B, "C + (B)" is the set of all ancestors of B, and "C* (B)" is the set of all ancestors of B, including B. Likewise "S(B)" is the set of B's children (successors), and "S*(B)" the set of all descendents. "C*(B)", where B is a set of nodes, is defined as the set of all descendents of all the nodes in B. Likewise for C+(B), S *(B), etc. A BN link from node A to node B is denoted as "A -*B" or "B <-- A" or "A---> B or BA . If A is not a to to to parent of B, the preceeding link notation may still be used, providing that adding a link from A to B does not create a cycle in the BN, and the link represented is considered a null link from A to B. The Cartesian product is formed with the "II" symbol, which takes a set of variables as its argument, and represents the set of all vectors formed with each of the variables taking on one of its possible values. As an example, for propositional variables A and B: II { A, B} = +a+b, Abbreviations BN - Bayesian net. CS - Connection strength. FJD - Full joint (probability) distribution. Consists of a probability for every conjunction consisting of all the primitive propositions (nodes). The term is also used to indicate the complete probabilistic model. LS - Link strength. NCP - Node conditional probability (ies). Also called the "link matrix" in Pear188. B Conventional Statistical Measures of Association There is a broad array of standard statistical measures of association. Historically they have been defined simply by searching through equations to find one that meets certain desiderata (although in the last few decades there has been a move to define measures according to some optimality criterion). The following have been mentioned by the statistical community as desirable qualities for a measure of association: 1.Range: The measure of association should range from 0 to 1, or -1 to 1. 2. Endpoints: A measure of association of 0 should correspond to independence, while 1 should indicate full association (and -1 indicate reverse full association, if that value can be obtained). Full association is defined by some to mean deterministic dependence (all conditional probabilities of the contingency table are 0 or 1), and by others to mean that at least one of the conditional probabilities of the contingency table is 0 or 1. 3. Monotonicity: The measure of association should vary monotonically and continuously with P(xy)P(x)P(y). 4. Symmetry: For two binary variables X and Y, the measure of association given as a function of P(xly) and P(xl-,y), should be the same as the same function of P(ylx) and P(yl-ix). Here are the most common measures of association as they would be applied to the case of two binary variables, X and Y, and written in the probabilistic notation used in this thesis. Cross Ratio: C = P(ylx) P(--iykx)^P(ylx) [1 - P(yl-,x)] P(--,y1x) P(y1-,x) - [1 - P(ylx)] P(y1-,x) Log cross ratio: L = log (C) Coefficient of association (Yule's Q): Q – C-1 C+1 –C - 1 Coefficient of colligation (Yule's Y): Y – 4 I+11 Root mean square contingency: r =^\,1-= 'NT— q [P(xly) - P(xl–Ty)] [P(ylx) - P(y1–,x)] Coefficient of contingency (Pearson): c = r / Air 2 + 1 Difference coefficient (J. H. Edwards): E = P(ylx) - P(yl–a) Ratio coefficient (J. H. Edwards): F = P(ylx) / P(y1–,x) C Proofs Theorem 3.1 - Equivalence of CS Definitions The following 3 definitions of CS are equivalent: CS (A,B) = d (P(bl+a), P(b1-0))^ 3.1:1 CS (A,B) = max d^ (P(bla v i), P(blav2)) avi, av2 3.1:3 CS (A,B) = sup ^d (P(bla v i), P(blavi, av2))^ avi, av2 3.1:4 where a v i is some virtual (or nonexistent) evidence for A, a v 2 is other consistent evidence (possibly virtual) for A, and "sup" means the least upper bound. To prove that 3.1:3 and 3.1:4 are equivalent to 3.1:1, it is first useful to prove the following property of the distance measure: Lemma 3.1:5: For the distance measure defined in 3.1:2 and any a, c, x, and y: If a x 5.. c and a y c then d (x, y)^d (a, c) Proof of lemma 3.1:5: By the monotonicity requirement on d: a_xc^d (a, x)^d (a, c) ^ 3.1:5 a y ._ c^d (a, y)^d (a, c) Either: a 5 x 5 y or a 5 y 5_ x So, by monotonicity and symmetry of d: Either d(x,y) 5_ d(a,y) or d(y,x) = d(x,y) 5 d(a,x) But both d (a, y) and d (a, x) are 5_ d (a, c), so d (x, y) 5 d (a, c). ■ Proof that 3.1:3 is equivalent to 3.1:1: Decomposing P(blav i) on cases of A (by 2.3:3) we get: P(blav l) = P(bla, avi) P(alavl) + P(b1----a, avl) P(—ialav l) Since av 1 is virtual evidence for A, B is independent of a v 1 given A. So P(bla,a v i) = P(bla). Also substitute a = P(alav l), to get: P(blav i) = P(bla) a + P(bl ia) (1 — — a) a= P(alavi) is restricted to [0,1], and as it varies from 0 to 1, P(bla v l) will vary linearly from P(b1--,a) to P(bl+a). So at all times it is bounded by these limits . P(bl a)^P(blav i) 5 P(bl+a) or P(bl+a)^P(blavl) 5 P(b1 ,a) — — By an identical argument for av2: P(bl la)^P(blav2) ^P(bl+a) or P(bl+a) 5 P(blav2)^P(bl a) — - By lemma 3.1:5: d (P(blav 1), P(blav2)) ^d (P(bl+a), P(bl la)) — So CS(A,B) defined by 3.1:3 is always less than or equal CS(A,B) defined by 3.1:1. But in 3.1:3 the max runs over all values of a v 1 and av2, which includes the possibility av 1 = +a and av2 = —a, which will give it the value of CS(A,B) defined by 3.1:1, and therefore its maximum value. So each equation assigns the same value to CS(A,B). ■ Proof that 3.1:4 is equivalent to 3.1:1: This proof is the same as the one above, with a vt, ay2 substituted for av2, except for the last paragraph, which becomes: CS(A,B) defined by 3.1:4 is always less than or equal CS(A,B) defined by 3.1:1. In 3.1:4 the max runs over all values of a v 1 and av2, which doesn't include the possibility a v 1 = +a and ay 1 ,av2 = because that would be inconsistent evidence. However, ay1 and av2 can come arbitrarily close to this, and since there are no discontinuities in the system (the probability equations are linear and the distance measure is required to be continuous), d (P(bla v t), P(blav1, av2)) can come arbitrarily close to CS (A,B) defined by 3.1:1, with the appropriate choice of av1 and a v 2. By replacing "max" by "sup" to mean the lowest upper bound, we can write the expression as an equality, and have CS(A,B) defined by 3.1:4 exactly equivalent to CS(A,B) defined by 3.1:1. ■ Theorem 3.4 - Alternate CS Definition If an alternate CS is defined as: CS' (A,BIe) = max (d(P(ble), P(bl+a,e), d(P(ble), P(b1--,a,e))^3.4:1 then for any two propositional variables A and B, and for any evidence e (or no evidence e), connection strength defined by equation 3.4:1 can be bounded above and below as: 1 CS (A,BIe)^CS' (A,BIe)^CS (A,BIe) 2 — To prove the lower bound the following lemma is useful: Lemma 3.4: For a distance measure, d, satisfying the triangle inequality (of 3.1:2): 1 max (d(x,y), d(y,z))^ d(x,z) Proof of lemma 3.4: Whether d(x,y) or d(y,z) is greater: max (d(x,y), d(y,z)) ?_ min (d(x,y), d(y,z)) 3.4:3 Adding max (d(x,y), d(y,z)) to each side: 2 max (d(x,y), d(y,z))^max (d(x,y), d(y,z)) + min (d(x,y), d(y,z)) But: max (d(x,y), d(y,z)) + min (d(x,y), d(y,z)) = d(x,y) + d(y,z) By the triangle inequality of 3.1:2: d(x,y) + d(y,z)^d(x,z) Combining the above: 2 max (d(x,y), d(y,z))^d(x,y) + d(y,z)^d(x,z) Dividing each side by 2: max (d(x,y), d(y,z))^1d(x,z)^■ Proof of lower bound in 3.4:3: If we substitute P(bl+a,e) for x, P(ble) for y, and P(bl—a,e) for z in lemma 3.4, we obtain: 1 max (d(P(bl+a,e), P(ble)), d(P(ble), P(bl—ia,e)))^ d(P(bl+a,e), P(b1--,a,e)) By the symmetry of d (required by 3.1:2) 1 max (d(P(ble), P(bl+a,e)), d(P(ble), P(bl—ia,e)))^ d(P(bl+a,e), P(bl—e,e)) Substituting, by the definition of CS (i.e. 3.1:1), and the definition of CS' (i.e. 3.4:1): CS' (A,BIe)^1CS (A,BIe)^■ Proof of upper bound in 3.4:3: Decomposing P(ble) on cases of A (by 2.3:3) we get: P(ble) = P(bl+a,e) P(+ale) + P(bl—la,e) (1 — P(+ale)) - 88 - P(+ale) is restricted to [0,1], and as it varies from 0 to 1, P(ble) will vary linearly from P(bl+a,e) to P(bl-a,e). So at all times it is bounded by these limits: P(b1+a,e) 5 P(ble)^P(bl a,e) or P(bl la,e) 5 P(ble) 5 P(bl+a,e) - - By the monotonicity requirement on d ( 3.1:2) d(P(ble), P(bl+a,e))^d(P(bl+a,e), P(bl—ia,e)) and d(P(ble), P(bl—ia,e)) 5 d(P(bl+a,e), P(bl—la,e)) Since both left hand sides in the above are less than d(P(bl+a,e), P(bl-a,e)), the maximum of them must also be less than d(P(bl+a,e),P(bl-a,e)): max (d(P(bl+a,e), P(ble)), d(P(ble), P(bl la,e))) 5 d(P(bl+a,e), P(bl la,e)) - - Substituting, by the definition of CS (i.e. 3.1:1), and the definition of CS' (i.e. 3.4:1): CS' (A,BIe) 5 CS (A,BIe)^■ Theorem 4.1 - AP Serial Chaining Theorem 4.1: For any three propositional variables A, B, and C, where A and C are independent given B: CSp (A, C) = CS p (A, B) CS p (B, C)^ 4.1 Proof of Theorem 4.1: Using the reasoning by cases theorem (2.3:3): - - P(cla) = P(cla,+b) P(+bla) + P(cla, b) P( Ibla) - - Since I(A,C1b), we know that P(cla,b) = P(clb): P(cla) = P(cl+b) P(+bla) + P(c1 ,b) P(—ibla) - If we subtract a version of this equation with a=FALSE, from a version of it with a= TRUE we obtain: P(c1+a) — P(c1 1a) = P(cl+b) (P(+bl+a) - - P(+bl a)) + P(cl ib) (P( Ibl+a) P( 114 1a)) - Replacing P(-bla) with 1-P(+bla), and simplifying, we obtain: - - - - - P(cl+a) — P(c1-0) = (P(cl+b) — P(c1-4)) (P(+b1+a) — P(+bl—a)) Taking the absolute value of each side: I P(cl+a) — P(cl—la) I = I P(cl+b) — P(cI-1b) I I P(+bl+a) — P(+b1-0) I By the definition of AP connection strength this equation is equivalent to: CSp (A, C) = CSp (A, B) CSp (B, C) ■ Theorem 4.2 - AO Serial Chaining For any three propositional variables A, B, and C, where A and C are independent given B: i^1^1 tanh q cs o (A, C))^tanh ( ii CS 0 (A, B)) tanh (74 CS 0 (B, C)) -- 4.2:1 Proof of Theorem 4.2: Using the reasoning-by-cases theorem (2.3:3): P(cla) = P(cla,+b) P(+bla) + P(cla,—ib) P(--Ibla) Since I(A,CIb), we know that P(cla,b) = P(c1b): P(cla) = P(cl+b) P(+bla) + P(cl—lb) P(—Ibla) Now we convert from probabilities to odds ratio, using O=P/(1—P), divide an A=TRUE version of the resulting equation with an A=FALSE version, and then simplify: 0(cl+a) = (1 + 0(b1-0) + O(clb) + 0(bl—ia) 0(cl—lb)) 0(cl—la) (O(bla) O(clb) + 0(c1-6) + O(clb) O(cl-b) + O(bla) O(clb) 0(cl—ib)) / ((1 + O(bla) + O(clb) + O(bla) 0(c1-113)) (O(bl—a) O(clb) + O(cl—b) + O(clb) 0(c1---,b) + O(bl—a) O(clb) 0(c1-11)))) We define 0(xl+y)/0(xl—iy) as C(xly), substitute where appropriate in the above equation, and simplify: C(cla) = (1 + C(clb) O(bla) + O(clb) + O(bla) O(clb)) (C(bla) C(clb) + C(clb) O(bla) + C(bla) C(c1b) O(clb) + O(bla) O(clb)) / ((C(bla) + C(clb) O(bla) + C(bla) O(clb) + O(bla) O(clb)) (C(clb) + C(clb) O(bla) + C(c1b) O(clb) + O(bla) O(clb))) - 90 - The above expression for C(cla) can be broken down into the composition of two functions: C(cla) — (C(bla) C(clb) fl + 1) (fl + 1) (C(bla) fl + 1) (C(clb) fl + 1) C4.2:1 where 1 + O(clb) f1 — (C(clb) + O(clb)) O(bla) C4.2:2 Now we must find the maximum value that C(cla) can take on for a given C(clb) and C(bla), so we maximize it with respect to O(bla) and O(clb). We are really interested in the maximums of CS ° , which is the absolute value of the log of C(cla), so we want to find all maxima of C(cla) which are greater than 1, and all minima less than 1. We assume C(bla) and C(clb) are finite and nonzero, and we can check these extreme cases for maxima and minima at the end. First we maximize with respect to O(bla), then with respect to O(clb). First we check the boundaries of O(bla). At O(bla) = 0 and O(bla) = oo, we get C(cla) = 1, so there are no maxima or minima of interest at the boundaries. Next we check for discontinuities (where the numerator or denominator equals zero). All the quantities in fl are greater than or equal 0, so fl = (1 + O(clb)) / (( C(clb) + O(clb)) O(bla)) is greater than or equal 0. The denominator of C(cla) is (C(bla) fl + 1) (C(clb) fl + 1) which therefore must be 1 or greater. The numerator of C(cla) is ( C(bla) C(clb) fl + 1) (fl + 1) which must also be 1 or greater. So there are no discontinuities in our domain of interest. Also, since it is a rational function, the derivative won't have discontinuities in places where the function doesn't. Next we take the derivative with respect to O(bla), and check the zeros of the derivative for maxima. We do this in two steps using the chain rule for derivatives: C(cla) = gl(C(bla), C(clb), fl(C(clb), O(clb)), O(bla)) cl[C(cla)] _ d[C(c1a)]^d[fl] d[O(bla)] d[f1] d[O(bla)] —^ So the maxima/minima will occur when either derivative equals zero. d[fl] ^_ ^1 + O(clb) d[O(bla)] — (O(bla)) 2 (C(clb) + O(clb)) This is zero when O(bla) is infinity (which is at the border and has already been checked). It is not zero when O(clb) is infinity (there it is -1). It is zero when O(clb) = -1, but that is not an allowed value for O(clb), so there are no maxima of interest when this derivative is zero. We try the other derivative in the product: d[C(cla)] _ (C(bla) — 1) (C(clb) — 1) (C(bla) C(clb) f1 2 — 1) d[fl]^—^(1 + C(bla) f1) 2 (1 + C(clb) f1) 2 Zeros are of this derivative are at C(bla) = 1, C(clb) = 1, fl = oo, and C(bla) C(clb) f1 2 = 1. When C(bla) = 1 or C(clb) = 1, we find that C(cla) = 1, so they do not correspond to maxima of interest. fl is infinite only if 0(bla) = 0 (which we have already considered). So the only interesting roots are at C(bla) C(clb) f1 2 = 1. One of the roots of this equation is always negative, so it doesn't correspond to a valid solution. The other root is: fl = A/C(bla) C(clb) (1 + O(clb)) C(clb) + O(clb) Substituting this back in equation C4.2:1, yields: C(cla) = [1 + -N/C(bla) -NiC(c1b)] 2 A/ C(bla) + AIC(clb) C4.2:4 We have left to the end the task of checking for possible maxima or minima at C(bla) or C(clb) taking a value of 0 or o. When we substitute C(bla) = 0 into the formula for C(cla) we get C(cla) = 1/ C(clb) and for C(bla) = ... we get C(cla) = C(clb). These are the same values we get from C4.2:4, so that formula will do in all cases. So the value of C(cla) given by equation C4.2:4 truly is the maximum value C(cla) can take given values of C(bla) and C(c1b), and we may rewrite it as an inequality which always holds: C(cla) [1 + A[C(bla) AIC(clb)] 2 AI C(bla) + AIC(clb) We can use equation 3.3:3 to express things in terms of CS 0: CS 0 (A, C) = I log C(cIa) I CS 0 (A, B) = I log C(bla) I C4.2:5 CS 0 (B,^= I log C(clb) I If we make the above substitutions in equation C4.2:5, and then simplify, we obtain: 1 tanh (4 CS 0 (A, C))^tanh (4 CS 0 (A, B)) tanh (4 CS 0 (B, C)) - - which is the equation to be proved. ■ Theorem 4.3 - Fundamental Equation For any three propositional variables V, Q, and Z, we can decompose the connection strength from V to Q on cases of Z as follows: CS (V, Q)^max CS (V, Q I z) + CS (V, Z) * min CS(Z,Q1v) where * is a generalized multiplication corresponding to the serial combination rule for the particular distance measure, d, used to define CS. Lemma 4.3: For any real numbers xi, x2, yi, y2, and z, if z^xi + yi^and^z^x2 + Y2 then z^max(xl, x2) + min (yi, y2) Proof of Lemma 4.3: There are four possible cases. If xi x2 and yi S y2, then max(xl, x2) + min (yi, Y2) = xi + yi^z by the first given equation If xi ?_ x2 and y2 yi, then max(xl, x2) + min (yi, Y2) = xi + y2 ^x2 + y2^z by the second given equation and similarily for the other two cases. ■ Proof of Theorem 4.3: We can analyze the propositional variables V, Q, and Z by constructing a BN for them, which appears in figure C.1. Since this is a fully connected BN, it can represent any probabilistic relationship between the 3 variables, with the right choice of NCP values. So even if these three variables are originally from a different BN where they are connected up in a different way, and perhaps with many other nodes involved, the BN of figure C.1 can represent their relationships within the other BN with no loss of generality. So we need only prove equation 4.3:6 for the BN of figure C.1. Figure C.1 - Fully connected BN representing the probabilistic relationship between V, Z, and Q with no loss of generality. Figure C.2 - A new BN which is the same as the one in figure C.1, except the V node has been split into U and W. This BN is defined to have the same NCPs as the one in figure C.1. The BN in figure C.2 is the same as C.1 (has the same connections and the same NCPs), except the node V has been replaced with two nodes: U and W. In general C.2 will produce different inference results from those produced by C.1, because the active path from Q to Z through V has been broken. However, in those cases where U and W both have evidence, and that evidence is the same for both of them, then inference using C.2 will produce the same beliefs as C.1 (with V receiving the same evidence as U and W). Since in the calculation of CS(V,Q) the only values of interest are the beliefs at Q when V has evidence, we will obtain the same values from C.2 (giving both U and W the same evidence as V). Even though some of the intermediate calculations may be different, the results will be the same because P(ql+v) = P(ql+u,+w) and P(q1—,v) = P(ql,u,—,w). So we need only prove equation 4.3:6 for the BN of figure C.2. By the definition of CS (3.1:1): CS(V,Q) = d(P(ql+v), P(ql N)) = d(P(ql+u,+w), P(q1 111, 1w)) - — - By the triangle inequality of d (3.2:2): d(P(ql+u,+w), P(q1---o, Iw)) .. d(P(ql+u,+w), P(ql+u,---,w)) + d(P(ql+u, Iw), P(q1 111,--1w)) — — - Substituting in the above the definition of conditional CS (3.5:1): CS(V,Q)^CS(W,QI+u) + CS(U,Q1--,w) Similarily we can derive: CS(V,Q)^CS(W,Q1—,u) + CS(U,QI+w) Invoking lemma 4.3 on the two equations above: CS(V,Q)^max CS(W,QIu) + min CS(U,Q1w)^ u^w C4.3:1 The second term in the above equation can be evaluated as a three node BN consisting of two links in serial, similar in form to the BN of figure 4.1. The techniques of section 4.1 or section 4.2 can be used (or similar techniques for other distance measures) to provide a bound for it. We express that bound in the general form: min CS(U,Q1w)^min CS(U,ZIw) * CS(Z,QIw) w^w where * is the serial combination rule, which depends on the particular distance measure being used. Since W is independent of U and Z, CS(U,ZIw) = CS(U,Z), and we may move the min operator in: min CS(U,Q1w)^CS(U,Z) * min CS(Z,Qiw) w w^ Returning to notation using V instead of U and W: - 95 - min CS(U,QIw) 5_ CS(V,Z) * min CS(Z,QIv)^ w^ v C4.3:2 Now we examine the first term of C4.3:1. Expressing it in terms of the d measure: max CS(W,QIu) = max d(P(ql+w,u), P(qI-w,u)) ^ u^u C4.3:3 We evaluate d(P(ql+w,u), P(q1 ,w,u)) by reasoning by cases on Z (2.3:3): - d(P(qI+w,u), P(q1 1w,u)) = d(P(ql+w,+z,u) P(+zl+w,u) + P(ql+w,-a,u) P(-izl+w,u), - P(ql-lw,+z,u) P(+z1-1w,u) + P(ql-lw,-a,u) P(-al-lw,u)) Q is independent of U given Z, so P(q1w,z,u) = P(q1w,z). Also, Z is independent of W given U so P(zlw,u) = P(zlu). Substituting these in: d(P(ql+w,u), P(ql-iw,u)) = d(P(ql+w,+z) P(+zlu) + P(ql+w,-a) P(-alu), P(q1-,w,+z) P(+zlu) + P(ql-iw,-,z) P(--alu)) Substituting x for P(+zlu), and 1 x for P( alu), we obtain: - - d(P(ql+w,u), P(ql-iw,u)) = d(P(ql+w,+z) x + P(ql+w,-a) (1-x), P(q1-1w,+z) x + P(ql-lw,-a) (1 x)) - By the convexity requirement on d (3.1:2): d(P(ql+w,u), P(ql-iw,u)) max (d(P(ql+w,+z), P(ql-lw,+z)), d(P(ql+w,-a), P(q1-1w,--,z)) By the definition of conditional CS (3.5:1): d(P(ql+w,u), P(q1-,w,u))^max CS(W,QIz) z Substituting it back into C4.3:3, and switching W to V notation, we obtain: max CS(W,Q1u)^max CS(V,QIz)^ u^z Combining C4.3:1, C4.3:2, and C4.3:4, gives us the result to be proved: CS (V, Q) S max CS (V, Q I z) + CS (V, Z) * min CS(Z,QIv) ^■ z^ v C4.3:4 Theorem 4.7:1 - Intercausal Link Strength If A and C are both parents of E, and I(A,CI), then if E receives evidence TRUE, the CS from A to C is bounded by: P(+el+a+c) P(+el-la-ic) 1 CS0(A,C1+e) = I log P(+el+a-,c) P(+el-a+c) 4.7:1 Proof: By Bayes rule (2.3:2): P(ela) P(cla,e) = K^ ela,c) P(ela) If we form the ratio of this equation with c=TRUE, to it with c=FALSE we obtain: P(+cla,e)^P(ela,+c) P(+cla) P(ela) P(-cla,e) - P(ela,-ic) P(-cla) P(ela) Canceling P(ela) and using the definition of odds ratio, we obtain the following well known (e.g. Pear188) equation, which holds for either value of A and either value of E: O(+cla,e) — P(ela,+c) ,c) 0(+cla) P(ela, - If we form the ratio of this equation with a=TRUE, to it with a=FALSE we obtain: 0(+cl+a,e) _ P(el+a,+c) P(e1-0,-Ic) 0(+cl+a) 0(+cl-e,e) - P(el+a,-c) P(el-la,+c) O(+cI-a) Since I(A,CI), we know that 0(+cl+a) = 0(+cl ,a), so we can cancel these factors. - O(+cl+a,e) _ P(el+a,+c) P(el-la,-ic) 0(+cl-la,e) - P(el+a,-Ic) P(el-la,+c) Next we take the absolute value of the logarithm of both sides. Ke1+a,+c) Kel—la,—,c) 0(+cl+a,e) I = Hog I log O(+cI-a,e)^P(el+a, c) P(el a,+c) — — By the definition of CS using the d o measure, the left hand side is CS(A,Cle): CS(A,Cle) = I log P(el+a,+c) P(el—la,—Ic) 1 P(el+a,—Ic) P(el—la,+c) The equation we set out to prove is just a special case of the above with e=TRUE. ■ Theorem 5.3:2 - Approx. Inference Error Bound When using algorithm 5.3:2, a bound on the error of the approximation: e = d(P(q1zi,z2,...z n,e), P(q1e)) is given by: e^CS (Zi, QIe) + CS (Z2, Qlzi,e) + ... + CS (Z n, Qlzi,z2,...,zn _1,e) for any node Q E Q, where d is the distance measure used for the definition of CS. Lemma 5.3: For any two propositional variables, Z and Q, and any evidence e CS(Z,QIe)^d(P(q1e), P(q1z,e)) Proof of Lemma 5.3: By the definition for an alternate CS (3.4:1): CS'(Z,QIe) = max (d(P(q1e), P(q1+z,e)), d(P(q1e), P(q1--a,e))) The result of a "max" function is greater than either of its arguments, so: CS'(Z,QIe)^d(P(q1e), P(q1z,e)) By theorem 3.4, CS(Z,QIe)^CS'(Z,QIe), so: CS(Z,QIe)^d(P(q1e), P(qIz,e))^■ Proof of Theorem 5.3:2: By the triangle inequality on d (required by 3.1:2): d(P(q1e), P(q1zi,...,zn-1,e)) + d(P(q1zi,...,zn_1,e), P(q1zi,...,z n ,e)) d(P(q1e), P(q1zi,...,zn ,e)) The first term of the above can be substituted with an upper bound (i.e. a number guaranteed to be greater or equal) provided by the same equation with an index of n that is one lower, to yield. d(P(q1e), P(q1z1,_,z n _2,e)) + d(P(q1zi,...,z n _2,e), P(q1zi,...,zn-1,0) + d(P(q1e), P(q1zi,...,zn,e)) d(P(q1z1,...,zn_i,e), P(q1zi,...,zn,e))^ This process can be repeated n-2 times to yield: n 1 d(P(q1zi,...,zi_i,e), P(q1zi,...,zi,e))^d(P(q1e), P(q1z1,...,zn,e)) i=1 Each term of the sum can be bounded by lemma 5.3, by substituting Zi for Z and z 1 &...& zi_i & e for e, to yield: n 1i=1 CS(Zi,Q1zi,...,zi_1,e)^d(P(q1e), P(q1zi,...,zn,e)) If we define: e = d(P(q1zi,z2,...z n ,e), P(q1e)), then we obtain the bound: e __ CS (Z1, Qle) + CS (Z2, Qlzi,e) + ... + CS (Z n , Qlzi,z2,...,z n _i,e)^■
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Link strength in Bayesian networks
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Link strength in Bayesian networks Boerlage, Brent 1992
pdf
Page Metadata
Item Metadata
Title | Link strength in Bayesian networks |
Creator |
Boerlage, Brent |
Date Issued | 1992 |
Description | This thesis introduces the concept of a connection strength (CS) between two nodes of a propositional Bayesian network (BN). Connection strength is a generalization of node independence, from a binary property to a graded measure. The connection strength from node A to node B is a measure of the maximum amount that the belief in B will change when the truth value of A is learned. If the belief in B does not change, they are independent, and if it changes a great deal, they are strongly connected. It also introduces the link strength (LS) between two adjacent nodes, which is an upper bound on that part of the connection strength between them which is due only to the link between them (and not other paths which may connect them). Calculating connection strengths is computationally expensive, while calculating link strengths is not. An algorithm is provided which finds a bound on the connection strength between any two nodes by combining link strengths along the paths connecting them (which is of complexity linear in the number of links). Such an algorithm lends substance to notions of an "effect" flowing along paths, and "effect" being attenuated by "weak" links, which is terminology that has appeared often in the literature, but only as an intuitive idea. An algorithm for faster, approximate BN inference is presented, and connection strengths are used to provide bounds for its error. A system is proposed for BN diagrams to be drawn with strong links represented by heavy lines and weak links by fine lines, as a visualization aid for humans. Another visualization aid which is explored is the iso-CS contour map, in which the amount that one particular node can effect each of the other nodes is shown as contour lines super-imposed on a regular BN diagram. Anon-trivial example BN is presented, some of its connection strengths are calculated, CS contour maps are constructed for it, and it is displayed with link strength indicated by line width. |
Extent | 4149050 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2008-09-16 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051344 |
URI | http://hdl.handle.net/2429/2041 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1992-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1992_fall_boerlage_brent.pdf [ 3.96MB ]
- Metadata
- JSON: 831-1.0051344.json
- JSON-LD: 831-1.0051344-ld.json
- RDF/XML (Pretty): 831-1.0051344-rdf.xml
- RDF/JSON: 831-1.0051344-rdf.json
- Turtle: 831-1.0051344-turtle.txt
- N-Triples: 831-1.0051344-rdf-ntriples.txt
- Original Record: 831-1.0051344-source.json
- Full Text
- 831-1.0051344-fulltext.txt
- Citation
- 831-1.0051344.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051344/manifest