UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Link strength in Bayesian networks Boerlage, Brent 1992

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1992_fall_boerlage_brent.pdf [ 3.96MB ]
JSON: 831-1.0051344.json
JSON-LD: 831-1.0051344-ld.json
RDF/XML (Pretty): 831-1.0051344-rdf.xml
RDF/JSON: 831-1.0051344-rdf.json
Turtle: 831-1.0051344-turtle.txt
N-Triples: 831-1.0051344-rdf-ntriples.txt
Original Record: 831-1.0051344-source.json
Full Text

Full Text

LINK STRENGTH IN BAYESIAN NETWORKSbyBRENT BOERLAGEB.A.Sc., The University of British Columbia, 1983A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF COMPUTER SCIENCEWe accept this thesis as conformingto the required standardTHE UNIVERSITY OF BRITISH COLUMBIAOctober 1992© Brent Boerlage, 1992In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature) Department of  C OhTuker^eAce. The University of British ColumbiaVancouver, CanadaDate  OCI 15, 1 iqLDE-6 (2/88)AbstractThis thesis introduces the concept of a connection strength (CS) between two nodes of a propositionalBayesian network (BN). Connection strength is a generalization of node independence, from a binaryproperty to a graded measure. The connection strength from node A to node B is a measure of themaximum amount that the belief in B will change when the truth value of A is learned. If the belief in Bdoes not change, they are independent, and if it changes a great deal, they are strongly connected.It also introduces the link strength (LS) between two adjacent nodes, which is an upper bound on that partof the connection strength between them which is due only to the link between them (and not other pathswhich may connect them). Calculating connection strengths is computationally expensive, while calculatinglink strengths is not. An algorithm is provided which finds a bound on the connection strength betweenany two nodes by combining link strengths along the paths connecting them (which is of complexity linearin the number of links). Such an algorithm lends substance to notions of an "effect" flowing along paths,and "effect" being attenuated by "weak" links, which is terminology that has appeared often in the literature,but only as an intuitive idea.An algorithm for faster, approximate BN inference is presented, and connection strengths are used toprovide bounds for its error. A system is proposed for BN diagrams to be drawn with strong linksrepresented by heavy lines and weak links by fine lines, as a visualization aid for humans. Anothervisualization aid which is explored is the iso-CS contour map, in which the amount that one particular nodecan effect each of the other nodes is shown as contour lines super-imposed on a regular BN diagram. Anon-trivial example BN is presented, some of its connection strengths are calculated, CS contour maps areconstructed for it, and it is displayed with link strength indicated by line width.ContentsAbstractContentsAcknowledgments1^Introduction2^Bayesian Networksiiiii142.1 When do we Reason With Uncertainty?^ 42.2 Using Probability for Uncertain Reasoning 62.3 Axioms of Probability Theory^ 92.4 Representing Independencies 112.4.1^d-Separation Algorithm 132.5 Bayesian Networks^ 142.6 Bayesian Network Inference^ 192.6.1^Virtual Evidence 232.6.2^Inference on BN Example 243 Connection and Link Strengths 263.1 Connection Strength Definition^ 263.2 AP Connection Strength 283.2.1^Single Link Example 283.2.2^Range of AP Connection Strengths^ 293.3 AO Connection Strength^ 293.3.1^Single Link Example 333.3.2^Range of AO Connection Strengths^ 333.4 An Alternate Definition of Connection Strength^ 343.5 Conditional Strength and Potential Strength 353.6 Connection Strength in Complex BNs^ 363.7 Commutivity of Connection Strength 363.8 Link Strength Definition^ 383.9 Comparing Link and Connection Strengths^ 404 Using Link Strength to Bound Connection Strength 434.1 AP Serial Combination^ 434.2 AO Serial Combination 444.3 Fundamental Sensitivity Equation^4.4 Example of Finding CS by Fundamental Equation4.5 Path Based Methods^4.5.1 The CS Path Algorithm^4.6 Complexity of Path Algorithm4.7 Dealing With Evidence5 Applications5.1 BN Link Display^6 Conclusion6.1 Further Work6.1.1 Qualitative Probabilistic Networks^6.1.2 Greater Computation for Tighter Bounds^6.1.3 Multistate Nodes^7 BibliographyA Notation and NomenclatureAbbreviations^B Conventional Statistical Measures of AssociationC ProofsTheorem 3.1 - Equivalence of CS Definitions^Theorem 3.4 - Alternate CS DefinitionTheorem 4.1 - AP Serial Chaining^Theorem 4.2 - AO Serial ChainingTheorem 4.3 - Fundamental EquationTheorem 4.7:1 - Intercausal Link Strength^Theorem 5.3:2 - Approx. Inference Error Bound47^505355596366667676767778798182838585878990939798AcknowledgmentsI would like to thank:David Lowe - My supervisor, who allowed enough flexibility in my program to really get the education Iwanted and fmd the field that interested and benefited me most.Judea Pearl - For writing Probabilistic Reasoning in Intelligent Systems.David Poole - Who introduced me to probabilistic reasoning, and who created a group of people studyingBayesian networks at UBC.The Bayesian network people at UBC - David Poole, Mike Horsch, Nevin (Lianwen) Zhang, Yang Xiang,Keiji Kanazawa, and Greg Provan for useful discussions, and David Lowe, David Poole and Mike Horschfor useful comments on my thesis.The many graduate students at UBC who I had the pleasure of knowing during my long stay there. If Inamed them I would have a long list, and I'm sure I would miss some, so I'll resist the temptation to puteven one name.My wife, Christine - For patience and for support.1 IntroductionBayesian networks (BNs), also called belief networks or probabilistic causal networks, consist of graphs inwhich each node represents a variable of interest, and the links between the nodes indicate the probabilisticrelationships between them. This thesis is restricted to BNs composed only of binary nodes (which arenodes representing variables that take on one of two values), and generally speaking we will interpret theirvalue to mean that some proposition is TRUE or FALSE. Using a BN we can capture the relationshipsbetween our uncertain beliefs in the propositions, and then if we find out the truth value of one or more ofthe propositions, we can use BN inference algorithms to find updated beliefs for each of the otherpropositions, and updated relationships between the propositions.Uncertain reasoning is very common for humans, and will conceivably be common in future machines.Chapter 2 provides some examples of situations requiring uncertain reasoning. Then it makes an argumentin favor of using probabilities for such reasoning (in machines), and provides a well-known set of axiomsfor doing so. However, probabilistic reasoning can be extremely computationally expensive, and even quitesmall problems can be outside the range of practicality, unless we have some technology for takingadvantage of the fact that generally our beliefs in some propositions are independent of our beliefs in others(Pear188, Cooper90). This is the primary purpose of the BN graph. Chapter 2 goes on to show how BNsrepresent these independencies, and provides a non-trivial example of a BN. Some BN inferencealgorithms are also discussed, and it is pointed out that the BN inference problem is NP-hard (Cooper90).Connection strength (CS) is a generalization of independence. Instead of simply indicating whether oneproposition is independent of another, it provides a graded measure of how much our belief in oneproposition can change when we learn the truth or falsity of another. Chapter 3 defines connection strengthand explores some of its properties. It also introduces the concept of link strength (LS), which is a sort of"local connection strength" between two adjacent nodes of a BN. Link strength measures that part of theconnection strength between two nodes that is due to the single link between them, and not due to any otherpaths from one of them to the other.Computing the connection strength between two nodes is generally even more computationally expensivethan regular BN inference, but a link strength can be found very quickly using only the conditionalprobabilities stored at a single node. Chapter 4 presents an algorithm which uses link strength values tofmd a bound for the connection strength between any two nodes in time linear in the number of links in theBN. The algorithm can be viewed as a summation over alternative paths between the nodes, which lendssubstance to notions of an "effect" flowing along paths, and "effect" being attenuated by "weak" links,which is terminology that has often appeared in the literature as an intuitive idea, but which has never beensubstantially formalized.Using independence information allows BN inference algorithms to solve medium-sized BN problems in areasonable amount of time. Using connection strengths we can determine which nodes are nearlyindependent, and then by assuming that they are independent, we can solve larger-sized BN problems inreasonable time, while obtaining approximate results. In Chapter 5 an algorithm is given which quicklyprovides bounds on the maximum error made during such an approximation.Another application of link strengths explored in Chapter 5, is to display them on BN diagrams as avisualization aid for humans. For example, the width of the line representing a link can be used to indicateits link strength, with finer lines for weaker links, and thicker lines for stronger links. The BN of Chapter 2is redrawn in such a mariner as an example. BNs have been praised as a great tool for humans to visualizeprobabilistic relations, and this technique extends that tool by providing graded rather than binaryindependence information.Another visualization aid for BNs are CS contour maps, which indicate how much one node can effect eachof the other nodes in the network, and are created by drawing iso-CS lines over the BN diagram. Chapter 5contains a CS contour map for the example BN of Chapter 2. It also contains a contour map based on CSbounds calculated by the algorithm developed in Chapter 4. By comparing the two contour maps, thebounds may be compared with the true values.2Wellman90 introduces the concept of qualitative probabilistic networks (QPNs), which are networks withthe same underlying topology as BNs, and whose purpose is to determine the direction of change in beliefof one proposition when we learn the truth of another. We can consider connection strength as determiningthe maximum magnitude, and QPNs as determining the sign, of the same quantity. In fact, many ofWellman's results can be obtained from the connection strength equations, simply by modifying them toretain sign information (e.g. removing absolute value functions). This is briefly discussed in the "FurtherWork" section at the end of the thesis.Another possibility explored in the "Further Work" section, is the development of a continuum ofalgorithms for determining CS bounds, which trade off tightness of bound for speed, from very fastalgorithm that provide loose bounds, to very slow algorithms that produce exact values. The two endpointalgorithms (fastest and slowest) are the algorithms used in the body of this thesis, and a method issuggested for generating the in-between algorithms.Notation is explained as it is introduced, but it is also summarized in Appendix A. Readers who are alreadyfamiliar with BNs can skip the next chapter and go straight to chapter 3, using Appendix A as a guide fornotation definitions they may have missed.32 Bayesian NetworksThis chapter provides a brief introduction to probabilistic reasoning and Bayesian networks (BNs), andstates a number of well-known results which are used later in this thesis. It does not contain any originalresults, and so it may be skipped by the knowledgeable reader. Two good introductory books for Bayesiannetworks and related topics are Pearl88 (a "must read" for a thorough introduction) and Neapolitan90(easier to read and has more recent results but generally covers less material).2.1 When do we Reason With Uncertainty?Reasoning with uncertainty is the process of combining items of uncertain knowledge to obtain uncertainconclusions. Uncertain knowledge is knowledge which one would be willing to retract, or consider lesscertain, upon receiving knowledge (certain or uncertain) to the contrary. There are not many things weknow that we wouldn't be willing to retract given enough evidence to the contrary, so much of ourreasoning can be considered reasoning with uncertainty.In this thesis, the main purpose of studying uncertain reasoning will be to produce computer-basedautomated systems, although some of the results may apply to other intelligent agents. In constructing anautomated system, we must decide which of its information it should treat as uncertain. We may want it totreat some information as certain, even though it doesn't hold in all cases (or we don't know if it does), justto make the system simpler, or give it a more predictable behavior. On the other hand, we may want it toconsider much of what it learns, based on limited or imperfect observations, as uncertain information.Generally speaking, as the sophistication of our automated systems increase, a larger percentage of their4knowledge should be treated as uncertain, since many of these systems will be more adaptable, will belearning more, and will be working in less well-defined domains in an autonomous manner.There are numerous particular situations where an automated system may need to reason with uncertainty.Any approximate measurement is information with uncertainty. So combining uncertain or approximateobservations, such as physical measurements, which have a redundancy in the observations for thepurposes of increasing the accuracy or detecting a totally erroneous observation, requires some form ofreasoning with uncertainty. The reasoning may be to simply take the mean or median of the set ofmeasurements, or it may involve a complex analysis. Any situation in which we have information comingfrom multiple sources, which may agree or conflict to varying degrees, requires reasoning with uncertainty.Examples are sensor fusion in a robot (i.e. combining sensory data), or merging news reports fromdifferent agencies.In some reasoning situations, much of the knowledge involved is nearly certain, and we can gain hugecomputational savings by treating it the same way we treat knowledge that is certain. However, when welearn something that casts doubt on some piece of it, we may have to revert the status of that piece back to"uncertain': or even to "false': and suitably modify the status of related pieces. The methods of defaultreasoning have traditionally been used to do this.Some "inverse" problems (such as diagnosis, machine vision, machine hearing, and other recognitionproblems) don't have a unique solution. Reasoning with uncertainty can help to find the most probablesolutions to these problems.Problems in which an agent learns generalizations from case data are examples of reasoning withuncertainty. Presumably, the more cases the agent sees, the more certain he becomes about thegeneralizations. When the generalizations are applied to predict unknown values in a new case, morereasoning with uncertainty is required, both because the generalization is uncertain, and because itsapplicability to the new case may be uncertain.We may even use reasoning with uncertainty for problems that can be stated in purely logical (i.e. certain)terms, such as theorem proving. Often, these types of problems can be considered to involve some kind ofsearch, and they can be solved far more quickly if we use suitable heuristics to guide the search to examinethe most probable candidates first. But as the use of heuristics becomes more sophisticated, it becomes-5difficult to combine conflicting heuristics, and so it is useful to think of the heuristics as uncertainknowledge. Then, we can use reasoning with uncertainty to direct the search of the theorem prover (orother strictly logical reasoning-with-certainty system).2.2 Using Probability for Uncertain ReasoningThere are a number of mathematical systems available for reasoning with uncertainty, and there has beenconsiderable controversy over which is "best" These systems include subjective probability, fuzzy logic,belief functions (e.g. Dempster-Shafer), certainty factors, non-numerical probabilities, and default logic.This thesis uses a system based on subjective probability, that is often called "Bayesian probabilisticreasoning': although it involves far more than Rev. Thomas Bayes was ever concerned with. Someapproximations to Bayesian reasoning will also be considered.de Finetti provides an argument in favor of subjective probabilities based on the notion of coherence (F. P.Ramsey and L. J. Savage have also done similar work). An agent is offered a number of betting optionsand his choices are analyzed. If he acts as though his beliefs were governed by rules other than those ofprobability, it is possible to arrange a series of bets with him (called a Dutch book) in which he isguaranteed to lose money regardless of how events unfold.Cox46 derives the rules of probability without any of the machinery of gambling or decisions, based onlyon meeting a list of desirable properties for an ideal agent to reason with uncertainty. The proof is morecomplex than the de Finetti proof, but seems to have broader applicability. Below is the list of desiredproperties from which all the axioms of probability theory can be derived. Since this thesis is built upon thetheory of probability, these may be considered the assumptions about reasoning with uncertainty made bythis thesis.1. Clarity: Propositions must be defined precisely enough so that it would betheoretically possible to determine whether they are indeed true or false.2. Scalar Continuity: A single real number is both necessary and sufficient forrepresenting a degree of belief.63. Completeness: Some degree of belief can be meaningfully assigned to any well-defined proposition.4. Context dependency: The belief assigned to a proposition can depend on the belief inother propositions.5. Hypothetical conditioning: There exists some function that allows the belief in aconjunction of two propositions to be calculated from the belief in the first proposition, andthe belief in the second proposition given that the first is true.6. Complementarity: The belief in the negation of a proposition is a monotonicallydecreasing function of the belief in the proposition itself.7. Consistency: There will be equal belief in propositions that are logically equivalent.Cox46 originally proved that the probability axioms follow from these properties, but Tribus69 weakenedthe hypothetical conditioning requirement, and the solution of the "associativity equation" in Acze166 maybe used to remove the differentiability assumptions required by Cox. The list 1-7 was taken (with somemodifications) from HorvitzHeckermanLanglotz86, who clearly state and name the desired properties, anduse the results to compare existing uncertainty systems.Some people have argued for uncertainty systems that violate one of the properties 1-7, or that have internalinconsistencies. For example, the Dempster-Shafer theory violates the combination of 2, 3, and 6. Onejustification for doing this is that some way of representing total ignorance about a proposition is required.Property 3 requires that we specify a belief for every proposition and property 2 requires that it be only asingle number, which appears to preclude any representation of ignorance. Bayesians have pointed out thatfor decision theory a representation of ignorance is not required, although for learning and communication itmay be. Sometimes using beliefs about real-world frequencies, instead of just beliefs for the occurrence ofa single event, will handle these situations. Otherwise, true Bayesian probabilities can be augmented with aconfidence measure (such as an "equivalent sample size", which is equivalent to the number of relevantcases used for a learned BN), which can be combined with the probability when necessary to provideenough information for someone with different priors to update their priors. The issue remains7controversial, but without doubt the Bayesian method is suitable for a very large class of uncertaintyproblems.Fuzzy logic appears to violate property 1, but the primary complaint of the fuzzy logic community is thatprobabilities can not represent all the different types of uncertainty that arise. They distinguish betweenvagueness (fuzziness, haziness, etc.), and ambiguity (nonspecificity, variety, etc.). Bayesians do claim to beable to handle these types of uncertainty using only probabilities (for example, see Cheeseman86).Expert systems based on simple evidential support may be misled when they use transitivity to chain rules.For example, if B provides support for C, and C provides support for D, a support based system mayincrease support for D upon observing B, which may be inappropriate. An example from Pear188 is that"wet grass" may provide support for "rained last night? and "sprinkler on" may provide support for "wetgrass? Each of these two rules work fine by themselves but chaining them suggests that "sprinkler on"supports "wet grass" which in turn supports "rained last night? However, if we see the sprinkler on weshould probably decrease our belief that it rained last night, instead of increasing it.A good example of the ad hoc nature of certainty factors, and the confusion resulting in trying to applythem, is the transcript of email exchanges of the MYCIN developers (BuchananS84, pp. 221-232). Later,Heckerman (1986) analyzed certainty factors and showed that (when the inconsistencies were removed)they were equivalent to using probabilities, but with independence assumptions being implicitly made.Since certainty factors are often inappropriate, using them may produce misleading results. By comparingthe system to a probabilistic one, it became clearer where the deficiencies were, and how serious they were.Proponents of default reasoning are often concerned with the numbers normally used in probabilisticreasoning, and ask "Where do these numbers come from?" The numbers represent subjective beliefs, notexact real-world frequencies, so the reasoning agent is simply summarizing his beliefs with the numbers.In fact, given a series of observations from nature, it seems easier to find probabilities (either as frequenciesor doing Bayesian reasoning over priors on frequencies) than to decide on a set of default rules.Nevertheless, default reasoning can be a very valuable way of reasoning with uncertainty, since it hassignificant computational advantages, and can reduce the information rate considerably duringcommunication.It is dangerous to assume that a proof like the one by Cox completely defines the "best" system to use. It isalways difficult to foresee what types of systems will be successful in the future, partly because the natureof the problems being solved keeps changing. However, the proof is useful for understanding thefundamental differences between uncertainty systems, and if one is willing to accept the properties 1-7, itindicates that probability is the system to use.Doing complete probabilistic reasoning is generally very computationally expensive (see section 2.6), so itis reasonable to look for alternatives. Sometimes the most appropriate algorithm is as simple as taking theaverage of a few values, or using a majority vote scheme. The approach advocated in this thesis is to startwith a normative theory for combining uncertain information, and then use a simplified scheme, whenappropriate, to improve the computational speed or overall complexity of the reasoning. However, itshould be considered an approximation to using probabilities, and should be judged based on how similarits results are to those of full scale Bayesian reasoning. One of the applications of "link strength" is toindicate when it is appropriate to take a certain type of computational shortcut in probabilistic reasoning, andprovide a bound on the resulting inaccuracy.2.3 Axioms of Probability TheoryHere is a set of axioms for probability theory equivalent to those derived by Cox (and also compatible withother axiomatizations of probability, such as the Kolmogorov axioms):1.P(ala) = 1 2.3:12. P(—alb) = 1 — P(alb)3. P(a, blc) = P(alb, c) P(blc)where P(x,ylz) means the probability that propositions x and y are both true, given that proposition z is true.For notational convenience we use P(x) to mean P(xIT) where T is a tautology that is always true. Uppercase letters refer to propositional variables, and lower case to their value. +b stands for b=TRUE, —it) standsfor b=FALSE, and sometimes +b is written simply as b if that does not result in confusion. For morenotational details see Appendix A.From these few axioms, together with propositional logic and arithmetic, we can generate the entiremathematical structure of probability theory. Below is a list of a few theorems that I will use later. For-9their proofs, or alternate (but equivalent) axiomatizations of probability theory, see an elementary text suchas Lindley65. This is Bayes theorem:P (alb, c) = P (bla, c) P (alc)P (blc)^ 2.3:2This is the reasoning by cases theorem:P(alc) = P(alb, c) P(blc) + P(al--lb, c) P( —Iblc)^2.3:3This is the independence theorem:P (a, blc) = P (alc) P (blc) iff A is independent of C given B^ 2.3:4where independence is defined as:P (alb, c) = P (alc) iff A is independent of B given C, providing^2.3:5b and c are consistentTheorem 2.3:6: In each of the above theorems, all the probabilities are conditioned on the proposition c.Since it could be equivalent to any logical formula, they all hold with c replaced by any vector of truthvalues, c.If we are willing to accept using the 3 simple axioms 1 - 3 in 2.3:1 for a system to reason with uncertainty,then we are accepting using probability theory.The full joint probability distribution (FJD) specifies a probability for every possible conjunction involvingevery proposition of interest (or its negation). For example, if an agent knew only about the propositionsA, B, and C, his FJD would consist of the probabilities P(abc), P(ab-ic), P(a—ibc), P(a—lb—ic), P(—,abc),P(--iab-,c), P(--,a—bc), and P(—a—ib--,c). Obviously, if the FJD were stored in a table, the size of the tablewould be exponential in the number of propositions. We must take advantage of independencies betweenpropositions to more efficiently represent an FJD. Supplying an FJD for a problem completely specifiesthe problem probabilistically. Sometimes I will refer to the FJD of a problem or an agent, when I want toindicate the complete probabilistic specification, although it may not be represented in any table.2.4 Representing IndependenciesUsing just the axioms and theorems of the previous section we can do probabilistic reasoning. That is,given beliefs for a set of propositions and their conjunctions (or relations), we can combine them with newitems of certain or uncertain knowledge about the propositions or their relations, to obtain new beliefs forany of the propositions, or any logical formula of the propositions. But unless we exploit the fact that undersome conditions our belief in some of the propositions will be independent of others, then for even amoderately sized problem, the computational cost will be astronomical. We could represent all theindependencies as a list of triples, each one of the form I (X, Y Iz), where X , Y , and Z are sets ofpropositions, and I (X, Ylz) means that if Z is precisely the set of propositions we know the truth value of,then X is independent of Y (i.e. further knowledge of whether the propositions in X are true won't changeour beliefs in any of the propositions in Y). However, this list would be impossibly large and awkward todeal with since the number of possible sets for each of X, Y , and Z is exponential (the power sets) in thenumber of base propositions.Once some independencies are known, generally others must follow to be consistent with the axioms ofprobability. So we could keep a partial list of independencies and generate the others as needed. We cangenerate them using the graphoid axioms (Pear188):Symmetry I (X, YIz) I (Y, Xlz)Decomposition I (X, Y U WIz) I (X, Ylz)Weak Union I (X, Y u WIz) I (X, YIz u w)Contraction I (X, YIz) & I (X, Wlz U y) I (X, Y U WIz)However using these axioms to generate required independence information would generally turn out to bea major computational task in itself. Instead we use a dag (directed acyclic graph) to represent theindependencies, and by using a simple and very fast algorithm called d—separation, we can use the dag toquickly fmd independence information.We require that the dag not represent any independency that isn't in the original FJD. Unfortunately, due tothe limited representational power of dags, the result is that sometimes not all the independencies of theFJD can be represented by a dag. Since we are using the independence information to speed computation,the fact that it sometimes misses an independency means that sometimes we will do a little morecomputation than necessary, but since it never reports an independency when there isn't one, thecomputation will never be in error.In the dag representation, each proposition of interest is represented by a node in a graph. Directed arcs(called links in this thesis) connect the nodes. The graph may be constructed by choosing any total orderingfor the nodes, starting with no links, and then stepping through the ordering and adding links to each nodeN as it is encountered. The links to add to node N are determined by examining each node preceding nodeN in the ordering, and adding a link from it to N if N is not independent of that node given all the othernodes preceding N. This process will be examined again in section 3.8.The dag that results can be used to regenerate the independence information via the d-separation algorithm.In general, its structure will depend on the total ordering of the nodes that we started with. In order to createthe simplest model, and to represent as many independencies as possible, it is desirable to chose an initialtotal ordering that will minimize the number of links in the final dag. In models of physical causation,generally placing propositions about causes before propositions about their effects will result in a smallerdag. An example of a dag to represent independence appears in figure 2.5:1.The graph created by the above algorithm never has any cycles (i.e. paths that return to their starting pointfollowing the direction of the links), but it may have loops (paths that return to their starting point ignoringthe direction of the links). A graph without loops is called singly-connected, and one with loops is calledmultiply-connected. The nodes with links going to a node N are called the parents of N, and the nodes atthe end of links leaving N are called the children of N. The definitions of ancestors, descendants, siblings,etc. follow in the natural way.The recent heightened interest in using normative probabilistic systems to reason with uncertainty, and theconstruction of a number of practical systems, is due in a large part to exploiting the independenciesrepresented by a dag description of the FJD. One of the contributions of this thesis is to provide a criterionfor when it is appropriate to exploit "near-independencies" as well, and to generalize the d-separationalgorithm to discover them.2.4.1 d-Separation AlgorithmThe d-separation algorithm is used to determine the independencies represented by a dag. It is an algorithmwhich allows dags to represent independencies in a manner consistent with the graphoid axioms (andtherefore the axioms of probability).Say X, Y, and Z are disjoint subsets of nodes in the dag. We will use the d-separation algorithm todetermine if X and Y are independent given Z ("given Z" means the same thing as "having evidence for Z','which means the truth values of all the propositions in Z are known by the reasoning agent).A node is termed converging with respect to a path, iff it has a node along the path in which both the linksto enter and to leave the node are directed toward the node (e.g. the third node B from the left in figure 2.4).A path is blocked by Z iff:1. It has a non converging node in Z, or2. It has a converging node N that is not in Z, and N has no descendants in Z.A path that is not blocked is called active. Figure 2.4 illustrates some blocked and active paths. In eachcase A is a node from X, C is a node from Y, and B is in Z iff it is shaded.Z d-separates X from Y iff all paths from X to Y are blocked by Z. If there is any active path from X toY, then X and Y are not d-separated by Z.If Z d-separates X from Y, then X and Y are independent given Z. That is, if you know Z, furtherknowledge of X will not shed any light on Y, and vice versa. It is important that Z include all the nodes forwhich you have knowledge, since some of the nodes in Z may block all the paths from a node in X to anode in Y, but others may form new paths.You can check your ability to apply the d-separation algorithm by trying to find all the active paths fromnode V to node Q in the BN of figure 2.6:2, where the shaded nodes are the ones for which you haveknowledge. The only two active paths are the marked ones.Blocked Paths^ Active Paths0 = Node without evidence 0 = Node with evidenceFigure 2.4 - Each case indicates whether the displayed path between A and C isactive or blocked. A node with "evidence" is one whose truth value is known. Ineach case there may be other links connected to A, B, C, or D, so there may beother paths between A and C which may be active or blocked.2.5 Bayesian NetworksWe can use the dag representation of independence introduced in the previous section, together with a set ofconditional probabilities at each node, to provide a complete representation of any particular FJD. We callthe resulting structure a Bayesian network (BN).In the common definition of a BN, nodes represent any kind of variable of interest, but since this thesis isrestricted to the study of binary nodes, our nodes represent propositions (or variables that can take on one oftwo states).To construct a BN, a dag may be determined as described in the previous section. Then, a number ofconditional probabilities called the node conditional probabilities (NCPs), are attached to each node. Theseare the probabilities that the proposition of the node is TRUE given each of the different TRUE/FALSEcombinations of its parents. For example, if the node B had the single parent A, its NCP would be {P(bla),P(bka)), if it had parents A and W its NCP would be {P(blaw), P(bla-m), P(bkaw), P(bl-a-lw)}, and if ithad no parents its NCP would be {P(b)}. These probabilities are subjective probabilities, which are unique- 14 -for the reasoning agent constructing and using the BN. They are measures of how strongly the agent wouldbelieve the child proposition if he knew the truth about the parent propositions. When appropriate, theycorrespond somewhat to frequencies in the real world, but that is not required. The purpose of BNinference is simply to tell the agent how to change beliefs when he observes certain evidence. The structureof the links is often called the topology of the BN to distinguish it from the conditional probabilityinformation.The FJD is easily reconstructed from the BN representation via the equation:P(v) = fl P(xln(X)) 2.5XE Vwhere V is the set of nodes in the BN, P(v) is a probability from the FJD for the vector of truth values v(one value for each node in V), X is a node, n(x) is a vector of values (consistent with v) for the parents ofX, and P(xlir(x)) is an NCP for X, where both x and it (X) are consistent with v. In other words, the jointprobability of a setting of TRUE/FALSE values for all the nodes, is simply the product over the nodes of theNCP from each node which is consistent with the setting.A single FJD can generally be represented by several different BNs, but if the BN must satisfy a given totalordering for the nodes, then it will be unique. Any propositional FJD can be represented by some BNhaving one node for each propositional variable of the FJD. A BN uniquely (and thereforeunambiguously) determines an FJD. Every possible BN determines some FJD, so it is impossible toconstruct an inconsistent BN no matter what its topology (providing its acyclic) or its NCPs.The direction of each link in a BN is significant. A link from node A to node B may always be reversed indirection, but if it was already in its optimal orientation, then the reversal usually requires the addition ofextra links from the parents of A going to B, and from the parents of B going to A, to avoid indicatingindependencies that don't exist. Once the extra links have been added, the BN doesn't represent all theindependencies it did before the link reversal. Also, with the links in a non-optimal direction ourknowledge is less modular, in that adding new variables (nodes) will generally require adding a great manynew links. In those problems where the variables are causally related, the optimal direction for links isgenerally the direction of causality (e.g., the direction of time).An example BN is shown in figure 2.5:1, and its NCPs are shown in figure 2.5:2. This BN represents therelationships between the beliefs of Jim, who is a fictitious character that lives in a small community whichalso contains Hank, Tom, Molly, and Gale. The NCPs are the subjective probabilities of Jim only. Forexample, the NCPs of node MT represent Jim's belief of what Molly thinks (notice that in this example itmirrors what Jim himself believes, since the NCPs of node MT are the same as the NCPs of node TD).Each node should specify a proposition precisely to the user of the BN, in order to satisfy the claritycondition (see section 2.2) of probabilistic reasoning. For example, "Tom" must refer to a particularperson, so including a last name may help, "a big donation" must refer to a particular range of donationsizes, so including dollar amounts may help, and "lots of cars" must refer to a particular range in number ofcars, so providing numbers may help. Perfectly describing each node to someone completely unfamiliarwith the situation may require an endlessly long description, but any description is adequate as long as itproduces in the mind of the BN builder and the BN users a proposition of adequate preciseness for the taskat hand.It may appear that most of the important causes for many of the nodes have not been included. Forexample "Park is approved", has only "Molly elected mayor" as a parent. Surely there are many moreimportant factors, such as the need for a park, the existence of necessary funds, the availability of land, etc.However, the purpose of using probabilities and reasoning with uncertainty is to be able to reason withoutexplicitly accounting for all these factors. The probabilities themselves summarize the missing factors (ifall factors were accounted for, perhaps the NCPs would consist of only Os and 1s). In building a BN weneed include only those factors we know of and suspect are relevant, and the uncertainty in the inferenceresults will reflect our lack of knowledge. If Jim were an actual person he would probably know of manymore factors that were relevant to his real-life questions involving these nodes, so he could add nodes andlinks for them to expand our example BN, and thereby generally increase the expected certainty of hisconclusions.This BN can be used to find Jim's initial beliefs in each proposition, and what those beliefs become if hefinds out the truth value of one or more of the nodes. For example, if Jim found out that Tom's hardwarestore did well last year, we can use it to find his new beliefs in each proposition (for example, whetherHank is going to move away in the next 5 years). Then, if he later read in the newspaper that Molly waselected mayor, we could obtain a new set of beliefs for each node (for example, whether Tom told Molly hewill donate campaign funds). The actual beliefs calculated for this example are given at the end of the nextsection.(F;C: Neighborhoodpark is constructed.1ext yearVHank's propertyvalue goes up morethan 20% in two yearNeighborhoodark is approved(HM: Hank movesaway sometime in,the next 5 yearsT: Traffic more thandoubles on Hank'sstreet next year )TM: Tom movesnext door to Hankn`ext year FW: 5th streetis widened nextyear"V: Tom'scousin issiting himj SW: Tom'shardware storedid well last yearHT: Lots oftraffic on 5thstreet last year(TW: Tom iscurrentlyfairly wealthy7T-C: Tom justbought a newBMW carCD: A new BMW'is parked in Tom'sdriveway^ }TR: Tom toldMolly he is rich/ .^NNTF: Tom told Mollyhe will donatecampaign funds _)CP: Lots of cars ar-eusually parked infront of Tom's store ,CGT: Gale told Mollythat lots of cars areusually parked inwont of Tom's store/MR: Molly thinksTom is rich(MT: Molly thinks Tor n\is going to make bigcampaign donation .2(MD: Molly decidesto run for mayorTA: Tom can afford tomove to an expensiveneighborhood next yearTD: Tom makes bigdonation to Molly'scampaignMM: Molly gets ^elected mayorFigure 2.5 - Example BN.P(ht)^=^0.7 P(tcl+tw)^=^0.3 P(fw1+ht,+mm,+pc) = 0.52P(tcl-tw)^=^0.1 P(fwl+ht,+mm,-pc) = 0.50P(owl+ht)^=^0.7 P(fwl+ht,-mm,+pc) = 0.42P(owl-ht)^=^0.6 P(cv)^=^0.01 P(fw1+ht,-mm,-pc) = 0.40P(fwl-ht,+mm,+pc) = 0.15P(cpl+sw,+ht)^= 0.8 P(cd1+cv,+tc)^= 0.95 P(fwl-ht,+mm,-pc) = 0.15P(cpl+sw,-ht)^= 0.7 P(cd1+cv,-tc)^= 0.90 P(fwl-ht,-mm,+pc) = 0.12P(cpl-sw,+ht)^= 0.7 P(cd1-cv,+tc)^= 0.90 P(fwl-ht,-mm,-pc) = 0.10P(cpl-sw,-ht)^= 0.2 P(cd1-cv,-tc)^= 0.05P(dtl+fw)^=^0.8P(gtl+cp)^=^0.1 P(tal+tw,+tc,+td)^= 0.800 P(dtl-fw)^=^0.2P(gtl-cp)^=^0.002 P(ta1+tw,+tc,-td)^= 0.802P(tal+tw,-tc,+td)^= 0.810 P(vul+pc,+dt)^=^0.80P(twl+sw)^=^0.7 P(tal+tw,-tc,-td)^= 0.812 P(vul+pc,-dt)^=^0.82P(tw1-sw)^=^0.3 P(tal-tw,+tc,+td)^= 0.300 P(vul-pc,+dt)^=^0.50P(tal-tw,+tc,-td)^= 0.302 P(vul-pc,-dt)^=^0.51P(trl+tw)^=^0.1 P(tal-tw,-tc,+td)^= 0.310P(trl-tw)^=^0.05 P(tal-tw,-tc,-td)^= 0.312 P(hml+pc,+vu,+dt,+tm) = 0.1:P(tfl+tr,+tw)^=^0.60 P(tmI+ta)^=^0.3 P(hmI+pc,+vu,+dt,-tm) = 0.1:P(tfl+tr,-tw)^=^0.15 P(tml-ta)^=^0.05 P(hml+pc,+vu,-dt,+tm) = 0.11P(tfl-tr,+tw)^=^0.20 P(hml+pc,+vu,-dt,-tm) = 0.1:P(tfl-tr,-tw)^=^0.05 P(mdl+mt)^=^0.7 P(hml+pc,-vu,+dt,+tm) = 0.1:P(hml+pc,-vu,+dt,-tm) =P(mdl-mt)^=^0.5P(mrl+tr,+gt)^=^0.71 P(hm1+pc,-vu,-dt,+tm) = 0.0!P(mrl+tr,-gt)^=^0.70 P(mm1+md,+td)^=^0.5 P(hm1+pc,-vu,-dt,-tm) = 0.11P(mrl-tr,+gt)^=^0.31 P(mm1+md,-td)^= 0.3 P(hml-pc,+vu,+dt,+tm) = 0.3:P(mrl-tr,-gt)^=^0.30 P(mml-md,+td)^=^le-7 P(hml-pc,+vu,+dt,-tm) = 0.3:P(mml-md,-td)^=^le-7 P(hml-pc,+vu,-dt,+tm) = 0.3(P(mtl+tf,+mr)^=^0.8 P(hml-pc,+vu,-dt,-tm) = 0.3:P(mtl+tf,-mr)^=^0.5 P(pal+mm)^=^0.7 P(hml-pc,-vu,+dt,+tm) = 0 . 3 :P(mtl-tf,+mr)^=^0.1 P(pal-mm)^0.4 P(hm1-pc,-vu,+dt,-tm) = 0.3:P(mti-tf,-mr)^=^0.02 P(hml-pc,-vu,-dt,+tm) = 0.2!P(pcI+pa)^=^0.9 P(hml-pc,-vu,-dt,-tm) = 0.3(P(td1+tf,+tw)^=^0.8 P(pcl - pa)^=^le-5P(td1+tf,-tw)^=^0.5P(td1-tf,+tw)^=^0.1P(tdI-tf,-tw)^=^0.02Figure 2.5:2 - Node conditional probabilities (NCPs) for the example BN infigure 2.5:1.2.6 Bayesian Network InferenceThe most studied BN inference problem is: Given a BN and evidence for some of its nodes, what are theposterior probabilities of its other nodes? Evidence is some ideal observation, or the receiving of somecertain information, on the truth of one or more node propositions (uncertain evidence is dealt with insection 2.7). With respect to a particular BN, evidence items may be inconsistent with each other. Forexample, if B is a child of A with P(bla) = 1, and we obtain evidence a= TRUE and b=FALSE, we say theevidence is inconsistent. This indicates that the evidence is not possible given the BN model (whichprobably indicates a fault with the model, not the evidence). Inconsistent evidence is not allowed in BNinference, and throughout this thesis, whether it is explicitly stated or not, we assume evidence is consistent.Occasionally in this thesis I will refer to BN inference in a dynamic sense, which implies a situation wherea stream of evidence items constantly arrive to a BN, and we update the beliefs of each node as they arrive.So we may speak of the belief at a node as rising and falling through time. A primary feature of thissituation is that evidence always accumulates, no item is ever retracted. As the evidence monotonicallyincreases, some quantities of interest will monotonically increase or decrease, while others will vary up anddown.Once we have received evidence there are two different systems for taking it into account in a BN. We cando evidence propagation to modify the BN to one specific for that evidence state, by modifying the networktopology and NCPs (often extensively), so that the nodes for which we have evidence no longer appear inthe network, and the new network represents the original BN conditioned on the evidence. Or we can leavethe BN structure unchanged, but mark the nodes for which we have evidence with the state of the evidence(these become known as evidence nodes and are usually drawn shaded on a BN diagram), and then usealgorithms designed to handle the evidence "in place" to find posterior probabilities. These two systems ofdealing with evidence are entirely equivalent in semantics; the only differences are ones of representationand computation.The way evidence propagation to create a new BN is normally done is through link reversals and nodeabsorption (Shachter86 and Shachter88). First, all links pointing to evidence nodes are reversed, one byone. Each reversal may result in new links pointing to an evidence node, since during a reversal the nodesat each end of the link gain all the parents the other has. Eventually though, each evidence node isguaranteed to have links only leaving it (links between evidence nodes may simply be deleted), and thenthat evidence node is absorbed. That is, it is removed from the network and all the NCPs of its child nodesare collapsed by one dimension to the value of the evidence. Later a process called probability propagationmay be used on this new BN to calculate the posterior probabilities.There are a number of methods for computing the new beliefs after receiving evidence which do not createa new BN. Some of them use a compiled secondary structure for efficiency, and some of them operate on- 20 -just the original BN. Some of them produce approximate results, and some exact. Pearl developed a fastalgorithm for BN updating when the network is singly-connected called belief propagation (see Pear188),which can find posteriors in 0(N) time, where N is the number of nodes in the network, and as a parallelalgorithm with a processor for each node, in 0(d) time, in which d is the diameter of the network.Multiply-connected networks pose a much greater computational problem.The reasoning by assumptions algorithm finds a set of nodes (called cut nodes), such that if they wereinstantiated with evidence some active paths would become blocked and the network would become singly -connected. Then it solves the singly-connected problem multiple times, once for each possible instantiationof the cut nodes. The final beliefs are found as a weighted sum of the beliefs in each of these sub-calculations, with the weighting for each instantiation being the probability that the cut nodes would beinstantiated that way (using equation 2.3:3). Since all combinations of evidence at the cut nodes must beconsidered, the algorithm is exponential in the number of cut nodes.Cooper has shown general BN inference to be NP-hard by reducing the 3SAT problem to a BN inferenceproblem (Cooper90). This is a worst-case result, but generally even the average case requires exponentialtime to find exact results. So for large BN applications, some sort of approximation algorithm is oftennecessary for BN inference.Although the BN inference algorithms described above will find the posterior probabilities for any node,given a set of evidence at any other nodes, it is useful to dissect BN inference into predictive reasoning,diagnostic reasoning, and intercausal reasoning. Predictive reasoning finds the belief at a node which is adescendent of a node with evidence, and diagnostic reasoning finds the belief at an ancestor of an evidencenode. Intercausal reasoning propagates the effect of evidence between common parents of a node withevidence. Figure 2.6:1 illustrates the three types.PredictiveDiagnostic>Intercausal>Figure 2.6:1 - The three types of reasoning in BN inference. The shaded nodesare nodes with evidence. In each case we wish to find the change in belief atnode Q due to evidence at node E.Any of these three types of reasoning may be combined for more complex inference. Although BNinference is not normally subdivided along the paths from an evidence node to a query node (i.e. a nodewhose updated belief we wish to find), it is sometimes useful to do so. Suermondt92 does this to generateexplanations of BN inference, and I will do it to study BN sensitivity. Figure 2.6:2 shows the paths ofreasoning from node V to query node Q, given that some previous evidence has arrived to the network atthe shaded nodes.DiagnosticI ntercausalPredictiveFigure 2.6:2 - Combination of different types of reasoning. There are only twoactive paths from V to Q: the path V,A,B,C,Q and the path V,D,F,Q. The firstpath contains diagnostic, intercausal, and predictive reasoning, while the secondcontains only predictive and intercausal reasoning. The shaded nodes are nodeswith evidence, V is a varying node (or node with new evidence), and Q is aquery node.2.6.1 Virtual EvidenceIn cases where we obtain evidence for a node, but the evidence is not certain, we can make use of theconcept of virtual evidence to handle it using the regular machinery of BN updating (Pearl88). For instance,suppose we wanted to process uncertain information that node A is true. We make a new node that is achild of node A and call it A v. To add this node we must supply two new probabilities: the probability thatAv is TRUE given that A is TRUE, and the probability that A v is TRUE given that A is FALSE. These twoprobabilities measure the degree of uncertainty of the evidence. If one of them is 0, then we have thelimiting case of the evidence being certain; We don't allow them both to be 0, which would be inconsistent.- 23 -To do updating with the virtual evidence A v is marked as a node with certain evidence of TRUE and regularBN updating is used to adjust the beliefs of the rest of the nodes in the network.We may have several independent pieces of uncertain evidence for the node A, which may conflict or agreewith each other. In this case we simply add a new child node, and its two probabilities, for each piece ofevidence (see figure 2.7). Then regular BN updating will handle finding the belief for node A and the restof the nodes in the network whether the evidence items conflict with each other or support each other.P(aviIa)^P(av2)a)^P(av3la)P(avil-a) P(av2I—ia) P(av3I—a)Figure 2.7 - Three items of virtual evidence for node A are represented as threechild nodes. The two parameters for each child specify the certainty of theevidence. The subnetworks C and D simply illustrate that A is part of a largerBN.2.6.2 Inference on BN ExampleWe can use any one of the algorithms described earlier to find beliefs for the example BN given in the lastsection. Initially Jim does not know the truth value of any of the nodes in the example. By doingprobability propagation we can find his belief in any node:P(sw) = 0.670^P(tf) = 0.160^ P(gt) = 0.0695P(mm) = 0.179 P(pc) = 0.40821 P(hm) = 0.2304If he fmds out that Tom's hardware store did well last year, then we can find his new beliefs for each nodeby doing evidence updating:P(swlsw) = 1^P(tflsw) = 0.184^P(gtlsw) = 0.0777P(mmlsw) = 0.183 P(pclsw) = 0.409 P(hmlsw) = 0.2302If he reads in the newspaper that Molly was elected mayor (and believes it with certainty), then once againwe can find his new beliefs for each node by doing evidence updating:P(swlsw,mm) = 1^P(tflsw,mm) = 0.286^P(gtlsw,mm) = 0.0778P(mmlsw,mm) = 1 P(pclsw,mm) = 0.630^P(hmlsw,mm) = 0.188It may seem strange that finding out Tom's hardware store did well would change Jim's belief in somethingso distantly related as whether Hank was going to move or not (even though it only changed from 0.2304to 0.2302). Whenever there is an active path from one node of a BN to another, evidence at one of thenodes can change the belief at the other node. In a very large BN, it is quite reasonable for every node to beconnected to every other node, with many of the connections being active paths. Knowledge of differentsubject areas may be connected along their boundaries. In the example BN, the links between the nodeswere quite natural, and it would be natural for there to be a link from "Tom's hardware store did well" to anode called "Tom opens grocery store" to "Tom eats more lettuce" to, etc., providing nodes ever moreweakly connected to "Hank moves", yet if Jim received evidence for any one of these nodes, his belief in allof them will change somewhat. The point is that the beliefs at very distant nodes will change almostimperceptibly, and if we have a way of measuring what that change will be, or of guaranteeing that it is lessthan some bound, we may want to ignore finding new beliefs for those distant nodes, thereby greatlysimplifying the computational burden.3 Connection and Link Strengths3.1 Connection Strength DefinitionGiven two nodes, A and B, in a Bayesian network, the connection strength (CS) from node A to node B isdefined as the difference in the resulting belief at node B, between the case where A receives evidenceTRUE, and the case where A receives evidence FALSE. Formally .CS (A,B) = d (P(bl+a), P(bl—d))^ 3.1:1where d is a distance measure to determine the amount of change or degree of difference between twoprobabilities, and in later proofs we assume it meets the following criteria:1. Zero: d (x, x) = 0^ 3.1:22. Symmetry: d (x, y) = d (y, x)3. Triangle Inequality: d (x, z)^d (x, y) + d (y, z)4. Continuity: d (x, y) is continuous with respect to x and y.5. Monotonicity: For any given x, d(x, y) increases monotonically as y moves away from x.Formally: xy z^d (x, y)^d (x, z) and d (y, z)^d (x, z)6. Inevitability: d (x, y) = d (1 — x, 1— y)7. Convexity: d(ax + b(1-x), cx + d(1-x))^max (d(a,c), d(b,d)) for all a,b,c,d,x E [0,1].The first three conditions are the well known Hausdorff postulates of many distance semi-metrics, and arethe requirements of a probability metric as defined in Zolotarev83. There has been some research onprobability metrics (see Zolotarev83) but it appears to be mostly concerned with distributions defined overcontinuous variables (and much of it hasn't been translated from Russian). Specific distance measures willbe discussed later.CS(A,B) was defined as the change in belief at B as the belief in A switched from true to false, due toevidence at A. But what if the belief at A changed from somewhere between certainly true and certainlyfalse to somewhere else between true and false. How much would the belief at B change? If the change inbelief at A was due to evidence at nodes that aren't independent of B given A, then we can't say how muchthe belief at B would change. But if it occurs at nodes that are independent of B given A, we can guaranteethat the change in belief at B will be less than (or equal) to CS(A,B) as defined by equation 3.1:1. Anexample of evidence that is independent of B given A is virtual evidence for the node A. So as variousitems of virtual evidence arrive for node A, the belief in A will vary between true and false, and this willcause the belief in B to vary between true and false, but the maximum variation at B will always be lessthan (or equal) CS(A,B). Formalizing this results in two equations equivalent to definition 3.1:1:Theorem 3.1: Equation 3.1:1 defining connection strength is equivalent to:CS (A,B) = max d (P(blavi), Nblav2))^3.1:3avi, av2and:CS (A,B) = sup d (P(blav i), Rbtavl, av2))^ 3.1:4avi, av2where a v i is some virtual (or nonexistent) evidence for A, a v2 is other consistent evidence (possibly virtual)for A, and "sup" means the least upper bound. The equivalence of 3.1:1, 3.1:3, and 3.1:4 is proved inAppendix C.These equations allow a slightly different defining statement for connection strength. CS(A,B) is a measureof the most the belief in B could be changed by receiving new (possibly virtual) evidence at A, whateverthat evidence is, and whatever other (possibly virtual) evidence has already been received at A.3.2 AP Connection StrengthTo completely define connection strength we must define a probability distance measure d. One possibilitywhich satisfies the distance measure requirements (3.1:2) is simply the difference function:dp (Pi, P2) = 1 P1 — P21^ 3.2:1This definition is a degenerate case (applied to a binary variable, rather than a mulitstate or continuousvariable) of what is called the Kolmogorov metric, also known as the uniform metric or variation norm (seeZolotarev 8 3).Connection strength defined using this distance measure for probabilities will be called AP connectionstrength, and denoted CS p.3.2.1 Single Link ExampleFirst we consider a very simple BN consisting of only two binary nodes and a single link between them:P (a)^P(131+a)P(bI-1a)Figure 3.2 - Simple BN example.Three numerical parameters are needed to define this BN: P(a), P(bl+a) and P(bl—ia). As an example wetake P(a) = 0.3, P(bl+a) = 0.5 and P(bl—a) = 0.75.If evidence TRUE is observed at A, then the belief at B (by this we mean the probability that B is TRUEgiven the evidence) will be P(bl+a), and if evidence FALSE is observed then it will be P(bI--a). Usingequation 3.1.4 for connection strength, and the distance measure of equation 3.2:1, for this simple BN weobtain:CSp (A,B) = I P(bl+a) — P(b1-0) I^ 3.2:2This is a measure of "the most node A can affect the belief at node B': As a visualization aid one canimagine a situation in which a steady stream of virtual evidence is arriving for A, but no evidence arrivesfor B. As each piece of evidence arrives, our belief in A will change, and as a consequence our belief in Bwill also change. In general, the belief at B will rise and fall, but will be restricted within a certain envelope.CSp measures the maximum width of this envelope for any series of virtual evidence items at A. For theexample parameters, CSp (A, B) = 1 0.75 - 0.5 I = 0.25, so we would say the connection strength from Ato B was Range of AP Connection StrengthsConsider a BN in which P(bl+a) = P(b1—,a). In that case, whether evidence of TRUE, evidence of FALSE, orno evidence, is observed at A, the belief at B remains constant at P(bl+a) = P(b1—,a) = P(b), providing noevidence is observed for B. Also, whatever evidence is observed at B, providing there is no evidenceobserved for A, the belief at A will remain constant at P(a). So B is actually independent of A, andnormally the BN would be drawn without a link connecting the two nodes. The connection strength in thiscase is CSp(A,B) = 0. AP connection strength between two nodes is zero if and only if the two nodes areindependent of each other.Consider a BN in which P(bl+a) = 1, and P(bl—a) = 0. Then, observing evidence TRUE at A results in abelief of 1 at B (i.e., the knowledge that B is TRUE). Observing evidence FALSE at A results in theknowledge that B is FALSE. In this case the connection strength is 1. AP connection strength is 1 if andonly if direct evidence for the first node deterministicly defines the value of the second.AP connection strength varies from 0 for the weakest "links" (actually independence) to 1 for the strongestlinks (deterministic dependence):0 CSp (X, Y) 1 3.2:33.3 AO Connection StrengthWe have defined a connection strength based on the absolute difference of probabilities, d (P1, P2) = 1P 1 —P21. But sometimes an absolute difference of probabilities does not capture what we want as a measure ofthe "distance" between two probabilities. For example, suppose we are calculating a probability- 29 -approximately, and we want to measure how close our estimation, P*, is to the value calculated exactly, P.We will call the distance between these two probabilities "the error" of the estimation, e = d (P*, P). Wemay want to specify an upper bound e m on this error, and look for an algorithm which is guaranteed tocalculate the estimate with an error less than this bound. This is an application of connection strength thatwe will be considering.We can use the absolute difference function as a distance measure for calculating an error in probabilities,but sometimes this will lead to unsatisfactory results. For example, suppose an approximate algorithmestimated the probability for rain tomorrow as 0.61, when an exact algorithm with the same informationwould have calculated the probability to be 0.60. We would likely consider the approximate algorithm tohave performed well, and the action that we take based on its estimate will not be misguided. However, ifthe approximate algorithm estimated the probability for severe flood each month in some area as 0.0001,when the exact algorithm yielded 0.01, we would say the approximate algorithm failed, and the action wetook based on its result — building a house with an expected lifetime of less than 10 years — was misguided.In both cases the absolute difference between the estimate and the exact value is about 0.01, but thatdifference is more significant in the case of the smaller probabilities. It seems that what we need for thisproblem is a relative measure of error.One might imagine using a "percent difference" distance measure, such as d (P j, P2) =1P1 — P21 / P 1.Then the error of the approximate algorithm estimating the probability for rain would be only 1.7%,whereas that of flood probability would be 99%, which nicely distinguishes between the two cases.However it suffers from a couple of problems. The proposition whose probability we are estimating mayjust as well have been defined in the logical inverse, Q' = so that the probability associated with itbecomes one minus what it would otherwise be, P(Q') = 1— P(Q). Therefore, the distance measure shouldtreat probabilities which are close to 1 in the same way that it treats probabilities close to 0, i.e. d (P1, P2) =d (1 — P1, 1 — P2). For example, the approximate algorithm estimated the probability for no flood at0.9999, while the exact algorithm yielded 0.9900, which is a percent difference of only 1%.One way to deal with this would be to use the percent difference in P for probabilities less than 0.5, and usethe percent difference in 1 — P for probabilities greater than 0.5. However, this is somewhat messy,especially when measuring differences between probabilities which straddle 0.5. A more elegant solution isto use the percent difference of odds ratios (there are also other important reasons to use odds ratio which- 30 -will be discussed later). The odds ratio of proposition A is defined as the probability that A is TRUE,divided by the probability A is FALSE:P (a)^P (a) 0 (a) = P H)a. — 1 — P (a) 3.3:1P (bla)^P (bla) 0 (bla) — P^— 1— p (bla)Odds ratios apply only to propositions, but occasionally we will loosely refer to the odds ratio of aprobability P, by which we mean the quantity P / (1 — P).As two probabilities, P1 and P2, approach 0, the percent difference in the odds ratios of P 1 and P2approaches the percent difference of Pi and P2, and as P1 and P2 approach 1, the percent difference in theodds ratios of P 1 and P2 approaches the percent difference of 1 — P 1 and 1 — P2. Percent difference ofodds ratios satisfies d (P1, P2) = d (1 — Pi, 1 — P2), and is numerically close to percent difference ofprobabilities for small probabilities.We need to make another refinement to our relative distance measure. Instead of a percent difference weuse a factor difference, which is the ratio of the two quantities. For example, a 5% difference correspondsto a factor difference of 1.05. The advantage of this is symmetry. If quantity X1 changes by 5% in goingto X2, it doesn't change by —5% going from X2 to X1 (it changes by 1/1.05 — 1 = —4.76 %), while if X1changes by a factor of 1.05 in going to X2, then X2 changes by a factor of 1/1.05 going to X i. To define adistance measure we are interested only in the magnitude of the change in going from one quantity to theother, not its direction, so if a factor difference is less than 1, we use its inverse instead. We denote thefactor difference between Pi and P2 as OP 1/P21 , where the double vertical bars are similar to the vertical barsof absolute value (since ix' = max (x, 1/x) is to multiplication what lx1= max (x, —x) is to addition).This gives us a proposed relative distance measure for two probabilities Pi and P2 as:de (P1, P2)^=^P ^P2 ^1 —^/ 1 — P2P1 ( 1 — P2) P2 (1 — P1)3.3:2Finally, we take the logarithm of d c to provide a true distance measure, which we will denote as d o . d o anddc values can always be recovered from each other, and d 0 satisfies 3.1:2, so it is a suitable distancemeasure (d o does not satisfy the zero condition or the triangle inequality condition). d 0 turns out to be theabsolute difference of log odds ratios:1^ , ^2do (P1, P2) = I log do (P1, P2) I = !log 1 P i p 1 — log 1 P  P2I 3.3:3Absolute difference of log odds ratio has been in widespread use for variety of purposes. The reason forthe long discussion in getting to here was to show some of the reasons it is better than some of thealternatives.Any base of logarithm may be used, as long as the usage is consistent. This thesis assumes that naturallogarithms are used, but makes a note on how an equation or result should modified to accommodate someother logarithm base in the few cases where this is necessary. To get a feel for natural logarithm d o values,when they are small, they are close to percent values. This is because for small x, log e(x) ---- 1 + x. Thus, ado of 0.05 corresponds to approximately a 5.1% difference in odds ratio. If the probabilities are small, thisin turn corresponds to approximately a 5.1% difference in probability, .If one of the probabilities P1 or P2 is 0 or 1, the denominator of a fraction in 3.3:3 will become 0, or we willend up with log 0, which is also undefined. Throughout this thesis I use arithmetic defined on the realnumbers augmented by infinity, i.e. the set 91 u { 00}. The following rules are observed:x/0 = 00, forallx>0, x / 00 = 0, for all x # 00 ,0 / 0, 00 / 00, 00 — 00, and 00 * 0 are undefined,and in general f (00) = lim f (x)x—>.Furthermore the d o measure between equal probabilities is defined to always be 0, even if the probabilitiesare both 0 or both 1. With these refinements we can express d o as:0^ P1 =P200 P1=0,P2=1 or P1=1,P2=0P1^P2  1g 1 _ p i - log I — P21 otherwisedo (Pl, P2) = 3.3:42 log 1 — CSp (X, Y)1 + CSp (X, Y) < CS () (X, Y) 00 3.3:7The definition of d o also satisfies the requirements of a probability metric as described at the beginning ofsection 3.2. Connection strength defined using the d o distance measure for probabilities (i.e. the absolutedifference of log odds ratios) will be called AO connection strength, and denoted CS Single Link ExampleWe return to the example of figure 3.2. In section 3.2 we found CS p (A, B) = 0.25. By equations 3.3.4and 3.1:4 we find:d,a)  I P(bla) p(3_ log 1 _13031-0.)'cso (A, B) =hog  1 _ P(bla) 3.3:51^0.75^g1:31^0.5—CS° (A, B) = h og 0.25^log^= Range of AO Connection StrengthsAO connection strength varies from 0 for the weakest "links" (actually independence) to ... for the strongestlinks (deterministic dependence):0 5 CS0 (X, Y)^oo 3.3:6Note that if CS ° (X, Y) = oo, Y may be deterministic for only one value of X (such as in the case: giventhat X is true, Y is true, but given X is false, Y is uncertain). This differs from AP connection strength,where a strength of 1 indicates complete deterministic dependency.Although log odds ratio is a monotonic function of probability, AO connection strength is not a monotonicfunction of AP connection strength. Two nodes with a weak AP connection strength (CSp close to 0) mayhave a strong AO connection strength (CS ° very large). More precisely, for any two nodes X and Y, in anyBN:where the CS ° values can be anywhere in the range, depending on the particular BN. For small values ofCS p we can make the approximation:CS0 .?_ 4 CSp^ 3.3:8which actually holds for all values of CSp, but doesn't provide a tight bound when CS p is large.Conversely:0 5. CSp (X, Y) 5_eCS0 (X, Y)/2 _ 1eCSo (X, Y)/2 + 1 3.3:9If CS () was defined using a logarithm of some base other than e, then in the equation above, e should besubstituted with the actual base used.3.4 An Alternate Definition of Connection StrengthThe connection strength from node A to node B could have been defined as: The maximum change inbelief at B as we go from a state of no evidence at A to a state with some evidence at A, that is:CS' (A,B) = max (d(P(b), P(bl+a), d(P(b), P(bl —la))^ 3.4:1which is equivalent to:CS' (A,B) = max d (P(b), P(bla v i))^ 3.4:2aviwhere a vi is virtual evidence for A.This quantity can be bounded above and below using the original definition of connection strength, and theoriginal definition is more workable in most situations, so it is preferred.Theorem 3.4: For any two propositional variables A and B, and for any evidence e (or no evidence e),connection strength defined by equation 3.4:1 can be bounded above and below as:1—2 CS (A,Ble) 5 CS' (A,BIe)^CS (A,BIe) 3.4:3This is proved in Appendix C.3.5 Conditional Strength and Potential StrengthIn a situation where a stream of evidence arrives for a BN (items of evidence arrive sequentially throughtime), it is sometimes useful to consider connection strength dynamically. The conditional connectionstrength CS (A, Ble) is a measure of the maximum effect on node B of evidence at A given the (possiblyvirtual) evidence e that has already arrived to other nodes of the network, and no other evidence. Moreformally:CS (A, Ble) = d (P(bl+a, e),^e))^ 3.5:1CS (A, Ble) = sup^d (P(blav i, e), P(blav i, av2, e))^ 3.5:2av i. av2where e is the evidence seen thus far, and, as with the CS definition, avi is some (possibly virtual) evidencefor A, a v2 is some other consistent evidence (possibly virtual) for A, and d is a distance measure satisfying3.1:2. Equation 3.5:2 is equivalent to 3.5:1 (the proof is similar to that of theorem 3.1).We define potential CS as the maximum value conditional CS may take upon receiving further evidenceconsistent with the evidence already received, and denote it as CSM:CSM (A,BIe) = max CS (A,BIe, e+)^ 3.5:4e+=ewhere e is the evidence seen thus far and e+ is evidence consistent with e (the symbol means "consistentwith").Conditional strength may increase or decrease upon gathering more evidence, but potential strength alwaysremains constant or decreases. At all points in time conditional CS is always less than or equal to potentialCS. The following are some easily proved relationships, which are true for all nodes A and B, andconsistent evidence e and e+ , in any BN:CS (A, Ble)^CSM (A,BIe)^CSM (A, B)^ 3.5:5CSM (A, Ble, e+) < CSM (A, Ble) 3.5:6max CS (A, Ble) = max CSM (A, Ble) = CSM (A, B)^3.5:7CSM (A,Ble) =^sup^d (P(blavi, e, e+), P(blavi, av2, e, e+))^3.5:8av i,av2,e+—e3.6 Connection Strength in Complex BNsI gave an example of calculating connection strength in a simple two node BN, but how do we find theconnection strength between the nodes A and B in a complex BN, where there are multiple paths betweenA and B, and each path consists of multiple links? Connection strength may be computed from equation3.5:1 by the methods of BN inference:CS (A, Ble) = d (P (bl+a, e), P (bl —la, e))We instantiate the BN with evidence e and +a or —,a, then use the techniques of BN updating discussed insection 2.6 to find the posterior probabilities P (bl+a, e) and P (bl—k,a, e).Since BN is inference is NP-hard (Cooper90), calculating CS in this manner can be quite expensive.Alternately, we can find bounds on CS quickly by local calculations using the concept of link strength. Thiswill be explored in Chapter 5.3.7 Commutivity of Connection StrengthGiven a value for the AP connection strength from node A to node B, can we use only this information todetermine what the connection strength from node B to node A is? It turns out that the AP CS in onedirection does not even constrain the AP CS in the reverse direction (unless it is 0 or 1, in which case theCS in the reverse direction will be the same). More precisely, for any value of CSp(A ,B) in (0,1), and anyvalue CSp(B, A) in (0,1), there is some BN model with these connection strengths.However, the AO connection strength in one direction always equals the strength in the reverse direction.In other words, AO connection strength is commutative:CS° (A, B) = CS 0 (B, A)^ 3.7:2This means that for any two nodes A and B, in any BN, the maximum amount that evidence at node B caneffect the belief at node A, is the same as the maximum amount that evidence at node A can effect the beliefan node B, provided "effect" is measured by do or de. This is a very useful result and later we will see howit gives CS 0 and CS c significant advantages over CSp. Conditional and potential connection strength basedon odds ratio is also commutative:CS () (A, Ble) = CS 0 (B, Ale)^ 3.7:3CSM0 (A, Ble)^CSM0 (B, Ale) 3.7:4As further evidence is obtained for the network, CS o(B,AI e) may change, but CS 0(A,B1 e) will also changeso that they will remain equal. In extreme cases the new evidence may d-separate A and B (so CS 0 goesfrom some number to 0), or it may connect A and B with an active path when they were d-separated (so itgoes from 0 to some larger number, possibly even infinity). In any case CS0 will remain commutative.Proof of 3.7:3 and 3.7:2: By definition of CS (equations 3.5:1 and 3.3:3):0(+bl+a, e) CS () (A, Ble) = I log^e)By the definition of odds ratio (equation 3.3:1):CS0 (A, Ble) = I log P(+bl+a, e)^e) ^P(—,b1+a, e) e)By 4 applications of Bayes rule (equation 2.3:2):le)^+ble)^p(,aP(+al+b, e) p(K-Fale)p(—,bl ) P(+ble) p(—ible)^e) pHale)1„ Be) = HogCS 0e) p(+ale)Canceling common factors:CS° (A, Ble) = I log P(+al+b, e)^e) ^e) e)By the definition of odds ratio (equation 3.3:3):0(+al+b, e) CS 0 (A, Ble) = I log e)By definition of CS (equation 3.5:1):CS0 (A, Ble) = CS 0 (B, Ale)Which proves 3.7:3. Note that when CS 0 (A, Ble) is infinity (using infinity as described in section 3.3), theresult still holds. There is a problem when P(+ale) = 0 or P(—ale) = 0 in the second step of the proof, sincethen these factors cannot be canceled from the numerator and denominator. However, in that case wesimply define the CS ° values in both directions as 0, since the node A is independent of all other nodes inthe network (in one direction this follows from d o (0, 0) = 0, and in the other it is equivalent to sayingP(bl+a) = P(bl—a) when one of them is undefined).Equation 3.7:2 follows from the simple case when the evidence e is absent or irrelevant. ■Proof of 3.7:4: By equation 3.7:3For any e+—e: CS0 (A, Ble, e+) = CS° (B, Ale, e+)So:max CS (A, Ble, e+) = max CS (B, Ale, e+)e+ ,---..e^e+,--eBut by equation 3.5:4:max CS (A, Ble, e+) = CSM0 (A, Ble)e+ ,----eSo:CSMo(A,Ble) = CSM0(B,AIe) ■3.8 Link Strength DefinitionFor each link in a BN we can supply a single number, called the link strength (LS), which is a measure ofthe maximum amount that evidence at the parent node of the link can effect the belief at the child, given thatall the other parents of the child have some evidence. So the link strength of A-B is defined as:LS (A—*B) =CS (A,B1c)C E na-{A})^ 3.8:1where C(B) is the set of parents of B. The strength of the A —>B link in figure 3.8:1 is the maximum valueof connection strength from A to B as all the parents C 1, C2,..., C n take on different evidence. Computingthe LS from equation 3.8:1 can be done completely locally; it requires knowledge only of the NCP at B.Figure 3.8:1 - These nodes may be embedded in a larger BN. The only nodesshown are B and all its parents, since these are the nodes required to define thelink strength of the link from A to B.Given its parents, B is independent of the rest of its ancestors, so equation 3.8:1 is equivalent to:LS (A—>B) =^max^CS (A,B1a)^3.8:2a E II(C*(B)—{A})where C*(B) is the set of all ancestors of B.We can still speak of an A B link strength, even if there is no link from A to B (that is, A is not a parentof B). If A is a descendent of B, then adding a link from A to B would create a cycle, and LS(A —>B) isundefined. But, if A is not a descendent of B, and is not a parent of B, it will be independent of B given B'sparents, so by equation 3.8:1, LS(A—>B) is zero. In this case we sometimes say there is a null link from Ato B, although of course it wouldn't count as a link to the d-separation algorithm, and normally we wouldn'tdraw it on a BN diagram.The range of LS values is the same as that of CS, so LSp values vary from 0 for null links to 1 for thestrongest links (potential deterministic dependence), and LS 0 values vary from 0 for null links to infinity forthe strongest links:0 . LSp 5 1^0^LS0^00 3.8:3Recalling the algorithm described in section 2.4 used to construct the dag of a BN, we can view it slightlydifferently now. We are given a FM and we wish to construct a BN to represent it. First we choose a totalordering for the nodes. Using equation 3.8:1 on the FM and the ordering, we can find the LS between anypair of nodes. We put a link between those nodes (in the direction from the node earlier in the ordering tothe later one) if the LS is greater than zero. After doing this for every pair of nodes we have the completeddag.3.9 Comparing Link and Connection StrengthsA fundamental difference between CS and LS is that CS is defined with respect to a FJD, whereas LS isdefined with respect to a FJD and an ordering on the nodes. For a given probabilistic specification, the CSbetween two nodes will be the same regardless of what BN is used to represent that specification. But LSvalues between two nodes will generally be different depending on the particular BN used to represent theFJD.Figure 3.9:1 shows an example of the invariance of CS, and the dependence of LS, on the particular BNordering. Both BNs represent the same FJD, but the total order used in (a) is A, B, C while the order in(b) is A, C, B. The AO link strengths, which are written along the links, are different between (a) and (b),but the connection strengths, which appear to the right, are the same for (a) and (b).Notice that link B--->C has reversed in going from (a) to (b) and its link strength remains the same. If anyBN is modified by reversing a link, say B -C, the only link strengths that will change are links going to Bor to C, and if the AO measure for link strength is used, the strength of the link being reversed won'tchange.P(a) = 0.5P(cl+a+b) = 0.5P(cl+a-b) = 0.8P(cI-,a+b) = 0.3P(cl-la-b) = 0.13.584P(131-1-a) = 0.7P(bl-a) = 0.21.3862.233(a)P(a) = 0.5 P(cI+a) = 0.59P(cj-ia) = 0.14P(bI+a+c) = 0.5432P(bl+a-,c) = 0.8537P(bI-1a+c) = 0.4286P(bl-a-,c) = 0.162E(b)CS (A, B) = 2.233CS (B, C) = 0.714CS (A, C) = 2.179CS (A, B) = 2.233CS (B, C) = 0.714CS (A, C) = 2.179Figure 3.9:1 - These two BNs represent the same FJD. Link strengths aredifferent in each, but connection strengths are the same. The link strengths(LS 0) are displayed alongside the link, and the connection strengths (CS 0) are tothe right. The total order in (a) is A, B, C, while in (b) the B-*C link has beenreversed, so it is A, C, B.Another fundamental difference between CS and LS is that LS is local measure which can be computedvery quickly, while CS is a global measure which may be very difficult to compute. Consider the BN infigure 3.9:2.Figure 3.9:2 - A BN to illustrate the global nature of CS and the local nature ofLS.Calculating an exact value for CS(A,B) involves all the nodes in the network, since evidence at A canchange the belief at B by the active path through Z. However, calculating LS(A —>B) involves only thenodes A, B, and C since taking the maximum with C having evidence blocks this active path. In fact, itreally only involves the NCP at B.For any BN, calculating a precise value for the unconditional CS between two nodes involves all theirancestors, and calculating a precise value for the conditional CS between two nodes involves all theirancestors, and all the ancestors of every node with evidence. Also, in general this calculation is ofexponential complexity. Calculating the LS between two nodes involves only the NCP of the child node,and the complexity is linear with the number of NCP probabilities (or better than this, depending on howthe NCP probabilities are represented.P(cl+b)P(cl—ilD)P (a)4 Using Link Strength to BoundConnection Strength4.1 AP Serial CombinationLink strengths may be combined by local operations to provide bounds on the connection strength ofwidely separated nodes. This allows for fast algorithms to find a limit on the degree to which evidence atone node can affect the belief at another (although in some cases the bound is not very tight). For a simpleexample, consider the following BN with two links in series:Figure 4.1 - A three node BN in which we want to find the CS from A to Cgiven the link strengths of A --> B and B --> C.We wish to find a bound on the AP connection strength from A to C given only the AP strengths of the twolinks A ---> B and B -3 C. In this case we can actually find an exact relation (the proof is in Appendix C):Theorem 4.1: For any three propositional variables A, B, and C, where A and C are independent given B:CSp (A, C) = CSp (A, B) CSp (B, C)^ 4.1If node A is the only parent of B, and B is the only parent of C (as in figure 4.1), then the definition of linkstrength matches that of connection strength and we obtain:CSp (A, C) = LSp (A—>B) LSp (B-*C)This rule was reported in Henrion89 in a different form. It also corresponds to the "shrinkage factor" ofMarkov net analysis (Howard71, p. 20). It can be chained for longer paths as long as the arrows all point inthe same direction. Since each AP link strength is less than one, the effect of evidence can only getattenuated (or unchanged) by intermediate links; it can never be amplified.4.2 AO Serial CombinationConsider the BN in figure 4.1 once again. This time we want to find a bound on the AO connection strengthfrom A to C, given only information on the AO strengths from A to B, and from B to C. Below is thetightest bound that can be placed with only this information (see Appendix C for the derivation):Theorem 4.2: For any three propositional variables A, B, and C, where A and C are independent given B:tanh (T, cso(A, C))^tanh 1 CS 0(A, B)) tanh (Ti CS0(B, C))1^4.2:1If node A is the only parent of B, and B is the only parent of C (as in figure 4.1), then the definition of linkstrength matches that of connection strength and we obtain:tanh CS 0(A, C)) 5 tanh LS 0(A-B)) tanh (-14 LSo(BC))where tanh is the hyperbolic tangent function:tanh (x) = ex e-xWhen x 0, which is the domain of AO connection and link strengths, tanh(x), stays within the range [0,1], and increases strictly monotonically as x increases (see figure 4.2:1). If the logarithms used to defineCS 0 and LS 0 are not natural logarithms, the multiplying constant 1/4 in the equation above changes toanother constant dependent on the logarithm base, but otherwise the equation remains the same.ex - e-x4.2:22^4^6^8^10Figure 4.2:1 - Graph of tanh(x/4). It starts out quite linear and thenasymptotically approaches 1.First, consider a case in which LS 0(BC) = 0 in the BN of figure 4.1, so B is independent of C. Then, byequation 4.2:1 tanh (-4 CS 0(A, C)) = 0, which implies CS 0(A, C) = 0, so C is independent of A. Second,consider a case in which LS 0(B—>C) = 00 , so B and C are deterministicly related. Then, tanh ( -14LS 0(B -*C)) = 1, and so CS o(A, C) LS 0(A-4B). In general, when links A--->B and B—*C are in series,we can consider the link B —>C as attenuating the effect of the link A —>B on C. The degree of attenuation isalways between that of the first case (independence leading to complete attenuation), and that of the secondcase (determinism leading to no attenuation).Equation 4.2:1 can be chained for longer paths, resulting in a product with one term for each link. Sincetanh(x), stays within the range [0, 1], the effect of evidence can only get attenuated (or unchanged) byintermediate links.For small x, tanh (x) .---. x, and for all x 0, tanh (x) x, so from equation 4.2:1 we can form the followingbound:1CS0(A, C)^.7i LSo(AB) LSo (B—>C) 4.2:3which is valid for all values of LS 0 , but only forms a reasonably tight bound when both LS 0 values aresmall (say LS 0 5 2). Using equation 3.3:8, we can see that the equation above, and therefore the AO seriallink strength equation (4.2:1), approaches the AP serial link strength equation (4.1) when the links are weak(small LS0 values).- 45 -Some bounds provided in the field of computer science are notorious for being so far above values actuallyobtained in practice, that the bound is almost useless. A simulation was done in which 200000 BNs of thetopology in figure 4.1 were generated, with NCP values drawn from a uniform distribution on [0,1]. Ineach case the bound on CS(A,C) given by equation 4.2:1 was compared to the true value of CS(A,C), andfrequency histograms of the results appear in figures 4.2:2 and 4.2:3. Of course, in some applications theactual distribution of NCPs may be very different, which could lead to very different results on the tightnessof the bound, but at least we can see that for this example application, the bound for CS given by equation4.2:1 is usually not far above the true value of CS.0 . 5^1^1.5^2^2 . 5(CS Bound) - (Actual CS)Figure 4.2:2 - Frequency histogram of the amount that the bound calculatedfrom LS(A—>B) and LS(B-*C) exceeded the actual value of CS(A,C) in 200000cases of random BNs with the structure of Figure 4.1 and uniformly generatedNCPs. Notice the point on the vertical axis at a height above 5, and the point tothe left of the axis at a height of zero. The same graph is drawn with differentscales below.1^2^3^4(CS Bound) - (Actual CS)Figure 4.2:3 - Expanded scales of graph in Figure 4.2:2.4.3 Fundamental Sensitivity EquationNow we wish to find a method we can use to find bounds on connection strength in a BN of arbitrarycomplexity. Sometimes we can break a probability problem down into two simpler problems by assumingthat some proposition is true, solving the problem, then assuming it is false, and re-solving the problem.Then we say the solution for the case where we don't know the value of the proposition is somewherebetween the solutions for the proposition being true and being false.For example, this method will work to find posterior probabilities. The posterior probability for aproposition Q in the case that we don't know the value of proposition Z, will be between the values it wouldhave if Z were true or Z were false. In fact the weighted average of these two bounds (weighted by theprobabilities of Z being true or false) is actually the probability of Q for the case we don't know Z, which isthe basis for the reasoning by assumptions method described in section 2.6. So we always have thefollowing two relations:P(ql+v-a)^P(ql+v) S P(ql+v+z)^or^P(ql+v+z) 5 P(ql+v)^P(ql+v---,z)^4.3:1P(q1-,v-a)^P(q1-,v)^P(ql-iv+z)^or^P(ql-iv+z)^P(q1-,v) 5_ P(ql-iv-iz)^4.3:2We might be tempted to think connection strength will behave in the same way, since it is just the distancebetween the two posterior probabilities hounded in the two expressions above. That is, since 4.3:1 and4.3:2 hold, we might expect the following to hold:d (P(ql+v), P(ql—Iv))^d (P(ql+v+z), P(ql—Iv+z)) ord (P(ql+v), P(q1—,v))^d (P(q1+v—a), P(q1---N—Iz))or equivalently:d (P(ql+v), P(q1--Iv))^max d (P(ql+vz), P(q1-1vz))zCS (V, Q) 5_ max CS (V, Q1z)z[false]^4.3:3[false]^4.3:4[false]^4.3:5But these equations do not hold. The reason for following this line of thought is to warn that the connectionstrength for the case when a proposition Z is unknown is not bounded by the strengths for the cases whenZ is true and when Z is false, as it is for a number of other probabilistic quantities (like belief, odds ratiobelief, expected utility, most probable explanation, etc.). A trivial example to illustrate is shown in figure4.3:1. When Z is unknown CS(V,Q) may be nonzero, but when Z is known, whether it is true or false, itd-separates V from Q so CS(V,Q1+z) and CS(V,Q1—a) are zero.Figure 4.3:1 - CS(V,Q) may be nonzero, but CS(V,QI+z) and CS(V,Q1—a) areboth zero, so they don't provide bounds for CS(V,Q).However, there is an equation similar to equation 4.3:5 that we can use to decompose connection strengthproblems.Theorem 4.3: For any three propositional variables V, Q, and Z, we can decompose the connectionstrength from V to Q on cases of Z as follows:CS (V, Q) 5.. max CS (V, Q I^+zCS (V, Z) * min (CS (Z, Q I +v), CS (Z, Q I —Iv), CS (Z, Q))^4.3:6P(ql+v,+z)P(ql+v,—,z)P(q1—,v,+z)q1-1v,—,z,P (v)where * is a generalized multiplication corresponding to the serial combination rule. If the AP measure isused for connection strength, it is regular multiplication, but for the AO measure it is the AO serialcombination rule:x * y = 4 tanh- 1 (tanh x) tanh y)) 4.3:8Since all connection strength problems can be decomposed using equation 4.3:6, it is called the fundamentalsensitivity equation in this thesis. It is proved in Appendix C. Notice that it is similar to the moretraditional style of decomposition equation given by 4.3:5, but with the addition of an extra term.To help visualize the fundamental equation, consider the BN of figure 4.3:2. Varying evidence at node Vwill create a varying belief at the query node Q. We are interested in decomposing the effect on Q by casesof Z. We can consider equation 4.3:6 as composed of two parts, each corresponding to one of the activepaths from V to Q. The first part is max CS(V,QIz), and this corresponds to the link from V to Q, since Zzis given and therefore the other path is blocked. The second part is CS(V,Z) * CS(Z,Q), whichcorresponds to the serial connection of the two links V-->Z and Z-Q. Together they provide a limit onhow much the node V can effect the belief at Q.P(zl+v)P(zi—iv)Figure 4.3:2 - A BN to help visualize the terms of the fundamental equation.This equation can be considered to consist of a term for the V—> Q path inparallel with a serial combination of terms for V*Z and Z-->Q.The fundamental equation applies to any 3 variables of any BN. They may be connected like the BN infigure 4.3:2, or they may be scattered through a large network with no links between them.4.4 Example of Finding CS by Fundamental EquationSuppose we have the BN of figure 4.4:1 and we wish to find the connection strength from node X1 to nodeX3.Figure 4.4:1 - A BN for which we wish to find the CS from node X1 to X3.The fundamental equation implies both the following two equations, which we will use to solve thisproblem:CS (V, Q) S max CS (V, QIz) + CS (V, Z) * CS (Z, Q)^ 4.4:1zCS (V, Q) ... max CS (V, Qlz) + CS (V, Z) * max CS (Z, Qlv)^4.4:2z^ vUsing equation 4.4:2 with X1 as V, X3 as Q, and X0 as Z:CS (X1, X3) .. max CS (X1, X31x0) + CS (X1, X0) * max CS (X0, X3Ix1)xo^ xi4.4:4Now we find bounds on the three terms of the equation above, one by one, by using the fundamentalequation repeatedly. For the first term:max CS (X1, X31x0) --xo 4.4:5max CS (X1, X31x0,x2) +xo,x2max CS (X1, X21x0) * max CS (X2, X3Ix0,x1)xo^xo,x 1The first term in the equation above (4.4:5) matches the definition of link strength for the X 1—>X3 link:LS (X1—>X3) =^max^CS (Xi, X31c) = max CS (Xi, X31x0,x2)^4.4:6C E ri(C(X3)-{ X1})^XO,X2The other terms in equation 4.4:5 are also link strengths, so we may rewrite equation 4.4:5 as:max CS (Xi, X31x0)^LS (X1-->X3) + LS (Xi—>X2) * LS (X2 ->X3)xo4.4:7The second term of equation 4.4:4 is CS (X 1, XO). If we are using the AO measure of CS, then this isequivalent to CS (X0, Xi), which is the link strength from X0 to Xi. For the last term of equation 4.4:4 weget:max CS (X0, X3Ix1)xi 4.4:8max CS (X0, X3Ix1,x2) +xi,x2max CS (X0, X2Ixi) * max CS (X2, X3Ix0,x1)xi^xo,x1Each of the terms in the above equation is a link strength, so we may write it as:max CS (X0, X3Ixi)^LS (X0-->X3) + LS (X0---)X2) * LS (X2—>X3)^4.4:9xiCombining all of this we finally get our bound for CS (X 1, X3):CS (Xi, X3)^LS (xi x3>) +^ 4.4:10LS ()(12 ) * LS 07-2—r>3) +LS( )-TilT-0 * (LS (xc >(3) + LS (x7r>(2) * LS ((3))By examining the BN in figure 4.4:1, we can see how the equation above can be considered as havingterms for all of the paths from X1 to X3. The first line corresponds to the link straight from X i to X3. Thesecond line corresponds to the path from X 1, through X2, to X3 (i.e. the X 1-->X2 and X2-->X3 links inseries). The last line corresponds to the paths from X1 through X0, then either straight to X3 or from X0 toX2, then to X3.Notice that X4 was not involved at all in finding CS (X 1, X 3). For any BN, finding CS (X i, X j), where iand j indicate the position of the nodes in the total order, does not involve any nodes Xk, where k>i and k>j,since there are no active paths from Xi to Xj through Xk (they all have at least one converging node).Because of this, equation 4.4:10 can be used to find the connection strength from node 1 to node 3 in anyBN. The nodes that come after node 3 are irrelevant. Equation 4.4:10 was developed for a fully connectednetwork (i.e. every two nodes are connected by a link). If we wish to use it for a network that is not fullyconnected, then we just use a link strength of 0 between nodes with no link connecting them, as describedin section 3.9.For example, we can use equation 4.4:10 to find a bound on CS(X1,X3) for the BN in figure 4.4:2. SettingLS (xjli(2 ) = 0 and LS ()7ff3 ) = 0, we obtain:CS (Xj, X3) < LS (xi -->(3) + LS( )(0) * LS (T(Te2) * LS ()(3) 4.4:11This consists of two parallel paths, one straight from X1 to X3, and the other consisting of 3 links in serial:X1 to X0, then to X2, and finally to X3. The path from X1 through X4 to X3 does not appear because it isblocked by the converging node X4.P(x0) = 0.3P(x4I+x1+x3) = 0.9P(x4I+x1—D.3) = 0.1P(x41-1x1+x3) = 0.2P(x41.---,x1--oe) = 0.8P(x3I+x1+x2) = 0.3P(x3I+x1—ix2) = 0.05P(x3I—x1+x2) = 0.7P(x31—ix1--,x2) = 0.2Figure 4.4:2 - A BN like that in figure 4.4:1, but with fewer links. We can usethe equation developed for bounding CS(X1,X3) of the BN in 4.4:1 to boundCS(X1,X3) of this BN.Using the numbers provided we obtain:LS o (X1 —X0) = LS o (X0—>X1) = do (P(x1I+x0), P(x11—,x0)) = do (0.45, 0.64) = 0.776I-So (X0-->X2) = do (P(x2I+x0), P(x2I—Dx0)) = d o (0.92, 0.30) = 3.29ILSo (X 1 —>X3) = max (do (P(x31+x i+x2), P(x31—ix 1+x 2)), do (P(x3I+x 1—ix 2), P(x 31—ix 1 —IX 2)))= max (do (0.3, 0.7), do (0.05, 0.2)) = 1.69-52-LS0 (X 2-->X3) = max (do (P(x3I+x2+x 1), P(x 31--x 2+x 1)), do (P(x3I+x 2—ix 1), P(x 31—ix 2—Ix t)))= max (do (0.3, 0.05), do (0.7, 0.2)) = 2.23CS 0 (X 1, X3) 5_ LS 0 (X 1—>X3) + LS 0 (X “—X0) * LS0 (X 0--X2) * LS0 (X2-->X3)= 1.69 + 0.776 * 3.29 * 2.23= 1.69 + 4 tanirl (tanh(0.776 / 4) tanh(3.29 / 4) tanh(2.23 / 4))= 1.95So the bound we calculate is CS ° (X1, X3) 1.95. In actual fact CS ° (X 1, X3) = 1.543.We can repeat the calculation to find a bound on AP connection strength, but the calculation of the"backwards" connection strength CSp (Xi, X0) is more difficult than in the AO case, since there it was thesame as CS0 (X0, X1) which is LS 0 (X0-->X1). Actually, we can easily find CSp (X 1, X0) using Bayesrule, but in the general case finding "backwards" CS p link strength involves nonlocal calculations.4.5 Path Based MethodsMost researchers have been reluctant to attach formal significance to the paths of a BN. It seems that on anintuitive level they make use of ideas like "effect" or "influence" "flowing" along the paths of a BN, andwill speak of "weak paths", "strong links; etc., but do not formally define these concepts.One reason for this may be that paths are not intrinsic to the probabilistic model (the FJD), but are anartifact of the BN factoring process. So if a BN had been constructed with a different total ordering for thenodes, its path structures could have turned out completely different. The same effect is created by theoperation of link reversal, which does not change the FJD represented by a BN, but can add or removelinks, thereby changing paths.Another reason for avoiding paths could be due to concerns that they lead to an over-simplistic view of theBN. There is a tendency for beginners to think of BNs as constraint networks, with the belief of a childnode given as a function of the beliefs of its parents. Or, if they are a bit more sophisticated, they may thinkthe belief in a node can be given as a function of the beliefs in the nodes of its Markov boundary (i.e. itsparents, children, and parents of its children).Actually, to express the belief in a node as a function, it must be expressed as a function of the joint beliefsof its Markov boundary nodes (i.e. the beliefs in the Cartesian product of their values). Thinking in termsof paths can obscure this. For example, consider the BN of figure 4.5:1. When there is no evidence, thebeliefs at A, B, C, and D are all 1/2. If we get evidence TRUE for A, the beliefs at B and C remain at 1/2,but the belief at D changes to 3/4. Thinking in terms of a constraint network, or "flow of influence alongpaths',' it is hard to see how a change at A can create a change at D without changing the beliefs at B or C.Of course, it is the joint belief in B and C which has changed (BEL(+b+c) changes from 1/4 to 3/8,BEL(+b—,c) changes from 1/4 to 1/8, etc.). So we must be careful with the path concept.P(b) = 1/2P(a) = 1/2P(dJ+b+c) = 1P(dl+b—c) = 0^P(cI+a+b) = 3/4^ P(dl—b+c) = 0^1 b) = 1/4 P(dl—b—ic) = 1P(cHa+b) = 1/4P(cHa—,b) = 3MP(b) = P(c) = P(d) = 1/2AndP(bk-a) = P(131—a). 1/2P(cI+a) = P(cha)= 1/2ButP(c11-1-a) = 3/4P(dl—a) = 1/4Figure 4.5:1 - Evidence at A changes the belief at D, but not at B or C.There are only a few examples of previous research that make extensive use of paths in BNs. The mostobvious example is the d-separation algorithm itself. It finds independence information by tracing activeand blocked paths. Since connection strength can be considered the "degree of independence" it is notsurprising that bounds for it can be found using a path based method as well.Suermondt92 uses paths to generate explanations of BN inference for a human user. He considers somepaths more significant than others if breaking a link along one of them results in a significant change in theinference result. That way he can prioritize "chains of reasoning" in presenting the explanation. Intuitivelyhe considers something "flowing" along paths, although he is wary of going to far with this line of thought,as is evident from the quote, "Such an image, in which probabilistic updates are treated analogously toelectrical currents, is simplistic, and is invalid in many cases; we cannot predict in a definitive manner thecombined effects of evidence transmission along multiple chains by analyzing the chains separately, since-54-there are often unpredictable synergistic effects: He doesn't arrive at any of the central results of this thesis,since he measures changes in belief due to breaking a link by doing full Bayesian inference with the link inplace, then again with the link broken, and then compares the beliefs produced in each case (similar to themethod mentioned in section 3.6, but breaking links instead of instantiating nodes). This can be veryexpensive when there are many links to try breaking, and worse when one considers combinations of links.Of course his method is much more accurate than the bounds calculated in this thesis, but the methods ofthis thesis could be used as a pre-screening phase to eliminate obviously weak paths from consideration(since the main purpose is to leave out very weak paths from the explanation).Wellman90 uses the paths in a BN to do qualitative probabilistic inference. The purpose is to answer thequestion, "if the belief in this node increases (say through virtual evidence), will the belief in this other nodeincrease or decrease?" He traces "influence" along paths to arrive at the answer. This is discussed ingreater detail in section 6.1: The CS Path AlgorithmWe can simplify using the fundamental equation to fmd connection strength bounds, by using it to developa path-based algorithm. The resulting formulation is also intuitively more appealing.We will find an expression for a bound on the AO connection strength from node X v to node X q, that isCS n(X v ,Xq), for a fully connected BN. The expression will be entirely in terms of link strengths. Later thesolution for any BN can be generated simply by setting the link strengths of missing links to 0 in thatexpression.We start by providing a total ordering for the nodes of the BN consistent with its dag, and we label thenodes X0, X1, ..., X n where a lower index corresponds to earlier in the total order. Since we are assumingthe BN is fully connected, the parents of node Xi will be (Xi I 0 j < 1. To find CS ii(X v ,Xq) we firstapply the fundamental equation (version 4.4:1) to it, with Z = X0, and obtain:CS (Xv , Xq) S CS (Xv, X0) * CS (X0, Xq) + max CS (Xv, Xqlx0)Now we apply version 4.4:1 of the fundamental equation again, this time with Z = X1, to the last term ofthe above equation. Then we repeat the process until its been done with Z = Xj, for j=0 to j=v-1. Themax CS (Xv, Xqlx0)x0max CS (Xv , Xqlx0,x1)xo,x1max^CS (Xv , Xq lx)x=(x0,x1,x2)max CS (Xv ,Xq lx)x=(x0,x1,.,xv_1)j=0j=1j=2j=v-1result is the expansion shown below. Each line corresponds to one application of the fundamental equation(used with Z = Xj, j having the value shown in the rightmost column) to the last term of the line above it.The resulting bound is the sum of products in the left hand column.CS (Xv , Xq)CS (Xv , X0) * CS (X0, Xq) +max CS (Xv , X lx) * max CS (X1, X q lx) +x0^ xomax CS (Xv , X21x) * max CS (X2, Xq lx) +x-, (x0,x1)^x=(x0,x1)• • •max CS(Xv ,Xv_ilx) * max CS(Xv_i,Xqlx) +x=(x0•xv-2)^x=(x0,. ,xv_2)We can express the above sum of products as:v-1CS (Xv , Xq)^max CS(Xv,Xj1x) * max CS(Xj,Xqlx) +j=0 x=(x0•xj-1)^x=(xo-xj-1)4.5:1max CS (Xv,Xq lx)x=(x0,•,xv_1)Since this derivation is for AO connection strength, which is commutative, we can reverse the order of theCS arguments in the first factor of the first line. Also, we can include the second line in the sum byincreasing the range of its index, so we get:CS (Xv , Xq)^max CS(Xj,Xvlx) * max CS(Xj,Xq lx)^4.5:2j=0 x=(x0,•,xj - 1)^x=(x0,.,xj_i)The CS expressions in the above equation are all of the form^max CS(Xj,Xqlx), some with j or qx=(x0,.,xj_1)replaced by v. We can generate an expansion to solve for these expressions in the same way that we did forCS (X v, X q). It appears below, with each line formed by expanding the last term of the line above it usingversion 4.4:2 of the fundamental equation (with Z = Xk, k having the value shown in the rightmostcolumn), to form the sum of products shown in the left hand column.-56-max^CS (Xi ,Xci lx)x=(x0,x1„,xj-1,xj+1)max^CS (Xj ,Xq lx),xj+1 ,xj+2)max CS (Xi ,Xci lx),xj+1 ,.,xq-1)k=j+1k=j+2k=q-1max CS(Xi,Xqlx) 5_x=(x0—xj-1)max CS(Xj ,Xj+j lx) * max CS(Xj+1 ,Xqlx) +x=(xo,x1,•,xj_1)^x=(x0,x1,.,xi )max CS(Xj ,Xj+21x) * max CS(Xj+2^+x=(x0—xj-1 ,xj+1)^x=(x0,.,xj+1 )• • •max CS(Xi^Ix) * max CS(X ci_i ,Xqlx) +,xj+1^x=(x0,•,xci_2)We can express the above sum of products as:q-1max CS(Xj,Xci lx)^max CS(Xi,Xklx) *x=(xmoaxxk CoS(Xk,Xcilx) +x=(x0,.,xj_1)^k=j+ 1 x =(xo,.,xj_i^,xk-1max CS(Xi,Xci lx),Xj+1,,xq-1)Once again we must evaluate the two factors in the sum of products above. The first factor of the productmatches the link strength definition. The second factor is of the same form as the expression beingbounded, but with the j index at least one larger, so it may be bounded recursively using the same equation.Folding the second line into the sum, we obtain the following recursive equation:max CS(Xj,Xqlx)x=(xo-xj-1)qk=j+1ELS(37r>k) *x=(x0,.,xk-1)max CS (Xk,Xcilx)coj<qj=q4.5:3This equation can be used to bound each of the CS expressions of equation 4.5:2.Now we modify equations 4.5:2 and 4.5:4 to suit the case of a network that is not fully connected, bysetting the appropriate link strengths to zero. The range of the summation is reduced to only includenonzero terms. Also, we use a notation in which C +(X) are the ancestors of X, C*(X) are the ancestors ofX and the node X, and S(X) are the successors (children) of X.CS (Xv , Xq) max CS (Xj,Xv Ix)xE nc+(xj)Xj E C*(Xv)nC+(Xq)* max CS (Xj,Xq lx)xE ilC+(Xj)max CS(Xj,Xq lx)xE ric+(xj) LS() (k) * max CS(Xk,Xqlx)xE nc+(xoXke S(Xj)nC*(Xq)00X'E C+(Xq)Xj=XqThe expression^max CS(Xj,Xqlx), appears repeatedly in the above equations so we define a quantityXE nc+(xj)termed "the strength of all forward paths from Xj to Xq", and denote it with the letter F, as follows:F(Xj,Xq) = max CS (Xj,Xq lx)^ 4.5:6XE ric±(xj)We substitute this into our two CS bounding equations. Also, since they no longer rely on the total order,and we don't need integer indexes for the nodes, we can rename all the nodes from X a style to A style fornotational aesthetics. Keeping in mind Qo C*(V), we have:CS (V, Q) 5_^F(J,V) * F(J,Q)^Q C*(V)^4.5:7Je C*(V)nC+(Q)F(X,Y){1 —4LS( x K ) * F(K,Y) X#YKe S(X)nC*(Y)00^ X=Y4.5:8With the two equations above we can now bound the connection strength between any two nodes of anyBN, using only link strength values. The following written description may help to make the aboveequations more intuitive.Specification 4.5: A bound on the AO connection strength from node V to node Q, when Q is not anancestor of V, is given by:The strength of all forward paths from V to Q, plus the sum over every node, J, which is an ancestor ofboth V and Q, of the strength of all backwards paths from V to J, multiplied by the strength of all forwardpaths from J to Q.The "strength of all forward paths from X to Y" is bounded by the sum over all X's children, K, which areancestors of Y (or Y itself), of:The strength of the link from X to K, multiplied by the strength of all forward paths from K to Y.The "strength of all backward paths from V to J" is the same as the "strength of all forward paths from J toV". In the above, the term "multiply" is used in its generalized sense to mean serial combination asdescribed in section 4.2. To find a bound on CS 0(V,Q) when Q is an ancestor of V, we use the abovealgorithm to find a bound on CS 0(Q,V), which is equal to CS0(V,Q).4.6 Complexity of Path AlgorithmUsing equations 4.5:7 and 4.5:8 directly in a recursive manner to find a bound for connection strengthresults in an algorithm of exponential worst case complexity. However, the same values are beingrepeatedly calculated and so by doing a "bottom up" evaluation instead, the complexity becomes linear inthe number of links.The bottom up algorithm to find a bound for CS(V,Q) involves calculating F(J,V) and F(J,Q) values for anumber of nodes J, and storing those values with the nodes to aid in calculating further F(J,V) and F(J,Q)values. Once these values have been calculated for all the required nodes, equation 4.5:7 is used to combinethem, yielding the desired bound. This yields the following algorithm, in which the descendants of a set ofelements is defined as the set of all the descendants of the elements (and C* represents all ancestors, and S*all descendents):Algorithm 4.6 - To find a bound on CS0(V,Q):1. If Q is an ancestor of V, switch V and Q, and find the equivalent connection strength CS(Q,V).2. Starting with J = Q, and working J backwards through a total order on the nodes consistent withthe dag, calculate F(J,Q) values for all J falling in the set: S*(C*(V)) n C*(Q) usingequation 4.5:8.3. Starting with J = V, and working J backwards through the total order, calculate F(J,V) values forall the nodes falling in the set: S*(C*(Q)) n C*(V) using equation 4.5:8.4. Use equation 4.5:7 to sum up products of F(J,V) and F(J,Q) values, yielding the bound onCS(V,Q).The following property holds for the above algorithm (proved further below):Lemma 4.6: When calculating each new F(J,Q) in step 2, all the subcalculations of F(K,Q) that arerequired, will already be calculated. The same holds for step 3.To determine the complexity of algorithm 4.6, a good estimate is provided by the number of generalizedmultiplications required. The CS(V,Q) bound is formed only by generalized multiplications and additions,and the number of additions will be slightly less than the number of multiplications, since the only additionsthat are required are those to add one product to another. The following theorem supplies the number ofmultiplications needed:Theorem 4.6: The number of generalized multiplications required to find a bound on CS 0(V,Q) usingalgorithm 4.6, is the number of links between the ancestors of Q which are also descendants of ancestors ofV, plus the number of links between the ancestors of V which are also descendants of ancestors of Q, thatis:Number multiplies =^Number links between ( S*(C*(Q)) n C*(V)) +Number links between (S*(C*(V)) n C*(Q))Clearly this will be less than the number of links between ancestors of V, plus the number of links betweenancestors of Q.Depending on the representation of the BN, extra computation may be required to determine which nodesare the required descendants, ancestors, etc., but even when doing this, the overall complexity is linear in thenumber of links between ancestors of V and Q.- 60 -The space requirements are minimal. On top of what is required for the BN and control of the algorithm,only I S*(C*(V)) n C*(Q) I+ I S*(C*(Q)) n C* (V) I real numbers must be stored (i.e. linear in thenumber of ancestor nodes of V and Q).Proof of Lemma 4.6: Each time we use equation 4.5:8 to find a value for F(X,Y), we end up findingvalues for F(X,Y), and each F(K,Y) where KE S(X) n C*(Y) (whether or not X=Y). If we apply theequation recursively, we end up finding F(K,Y) values for all KE K, where K is given by:K= X u (S(X) n C*(Y)) u (S(S(X) n C*(Y)) n C*(Y)) u . . .^4.6:3It is a property of ancestor/descendent relations, that if a node W isn't an ancestor of another node Z, it can'thave a successor or descendent which is an ancestor of W. That is:W^C*(Z) implies S(W) o C*(Z) and S*(W) n C*(Z) = 0^4.6:4Using 4.6:4, we can simplify 4.6:3 to:K = S* (X) n C*(Y)^ 4.6:5When we use equation 4.5:7 to find a bound on CS(V,Q), we have to find a value of F(J,Q) for allJe C*(V) n C+(Q). Since we will find these values using equation 4.5:8, that requires using equation 4.5:8to find F(K,Q) for all KE KQ where KQ is given by 4.6:5, substituting C*(V) n C+(Q) for X, and Q forY:KQ = S*(C*(V) n C-F(Q)) n C*(Q)^ 4.6:6We can simplify the above using 4.6:4 to get:KQ = S*(C*(V)) n C*(Q)^ 4.6:7So when finding all the F(J,Q) values in step 2, all the recursive F(K,Q) values that need to be found will bein the set above. Since step 2 specifies that we find the F(J,Q) values in reverse order of J, and equation4.5:8 finds F(J,Q) values using only F(K,Q) values for which K succeeds J, completion of step 2 willrequire only F(K,Q) values from the set above which have already been found.The same type of argument, but with V substituted for Q, and Q for V, serves to show that step 3 requiresfinding only F(K,V) values in which K is an element of the set K v below:Ky = S * (C*(Q)) n C*(V) 4.6:8Furthermore, step 3 also specifies that we find the F(J,V) values in reverse order of J, so by the samereasoning as the step 2 case, completion of step 3 will require only F(K,V) values from the set above whichhave already been found. ■Proof of Theorem 4.6: Equation 4.5:8 is invoked to find each value of F(X,Q) whereX E S*(C*(V)) n C*(Q) as stated by equation 4.6:7. Each time it is invoked it performs one multiply foreach link leaving X and going to a node in C*(Q). Since the node that the link goes to will be in S*(C*(V))as well (because by definition it must go to a successor), we can say that one multiply is performed for eachlink leaving a node in S*(C*(V)) n C*(Q) and going to a node in S*(C*(V)) n C*(Q). Or, in otherwords, the number of multiplies required is the number of links between the nodes inS*(C*(V)) n C*(Q).Equation 4.5:8 must also be invoked to find each value of F(X,V) where X E S*(C*(Q)) n C*(V) asstated by equation 4.6:8. By reasoning similar to the last paragraph, we can say that the number ofmultiplies required to do this is the number of links between the nodes in S*(C*(Q)) n C*(V).Finally, we must add on the number of multiplies required by equation 4.5:7 in step 4. This will beIC*(V) n C+(Q)I, which is one multiply for each term of the sum. However, if we don't count all thosemultiplications that are with a connection strength from a node to itself, done in steps 2 and 3 above (whichdon't really need to be done, since they yield the original number, and are present simply to terminate therecursion), then this quantity will be absorbed, leaving us with the sum of the two quantities from the twoparagraphs above:Number multiplies = Number links between ( S*(C*(Q)) n C*(V)) +Number links between (S*(C*(V)) n C*(Q))■4.7 Dealing With EvidenceWe may wish to find a bound for a conditional connection strength, that is, a bound for a connectionstrength after the BN has received some evidence. If we are automatically incorporating the evidene into thenetwork as it arrives using Schacter's evidence propagation, then we can simply use the methods of theprevious section on the new BN. However, if we want to find a CS bound knowing the evidence ispresent, but not propagated, we need some new techniques.Link strengths were defined by maximizing CS over all states of the other parents of the child in the link. Ifthe evidence received at the BN is for one of those parents, that maximum may be reduced. Consider theexample BN in figure 4.7:1:Figure 4.7:1 - Evidence at node E restricts the CS(A,C) bound.Say we are finding a bound on CS(A,C) with no evidence at node E. This is provided byCS(A,C)^LS(A--4B) LS(B --->C) = max CS(A,B1e) CS(B,C)ewhere * is the serial combination operator. However, if we receive evidence that E is true, then a bound forthe conditional connection strength, CS(A,C1+e) is provided by:CS(A,CI+e)^LS(A—>B1+e)*LS(B —>C1+e) = CS(A,BI+e) * CS(B,C)This new bound for CS will be less than or the same as the original bound sinceCS(A,BI+e) max(CS(A,BI+e), CS(A,B1—,e)). Often when evidence is received it will improve CSbounds in this manner.When the evidence is for a nonconverging node which is right on an active path, that node blocks the activepath, so the CS contribution from that path drops to zero, giving a lower overall CS value, and a lowervalue for the CS bound.Although receiving evidence often lowers CS values, it may increase CS values by creating new activepaths through converging nodes which have received the evidence. Consider the example BN in figure4.7:2:Figure 4.7:2 - BN with evidence at E, for which we wish to find the CS from Ato C via the intercausal path through E.Without any evidence at E, CS(A,C) = 0, and the CS(A,C) bound calculated by active paths is also zero.But once evidence TRUE arrives for E, an active path from A to C is created. The strength of this path isd(P(cl+a+e), P(cl—a+e)), and it is termed an intercausal link strength, since reasoning from A to C isintercausal reasoning, as defined in section 2.6. For AO connection strength it comes out to be (proved inAppendix C):CS^ P(+el+a+c) P(+el—la--Ic) 10(A,C1+e) = LS 0(A-4 1-e— C) = I log P(Fel+a—c) P(Fel—a+c) 4.7:1where the LS(A —>+e— C) is a notation invented to denote an intercausal link strength from A to C when Ehas evidence TRUE. It is defined by the equation above.As an example, we can put all these techniques together to find a bound on CS(A,Dle,f) in the BN of figure4.7:3:Figure 4.7:3 - BN with evidence at E and F, for which we wish to find a boundon the CS from A to D.The active paths from A to D are A,B,C,D and A,F,C,D. Each one of them forms one of the two terms inthe bounding equation below:CS(A,Dle,f) 5_ (LS(A—>Ble)*LS(B --->C) + LS(A—>f—C)) * LS(C—>D1f)5 Applications5.1 BN Link DisplayThe first application of link strength that we consider is as a visualization aid for humans. BN diagramshave been praised as a great tool for people to visualize probabilistic relationships, and by marking the linkswith their strengths this tool may be improved, because then someone viewing the BN doesn't just getinformation about node independence, but also about a "degree of independence". An extremely weak linkis very nearly an independence, but this information is lost if it is drawn exactly the same as the rest of thelinks. One possibility is to draw stronger links with thicker lines and weaker links with finer lines.Viewing a BN can be much more meaningful if one sees a skeleton of very dark lines corresponding todefinitions and constraints, followed by slightly lighter lines for less certain rules, and so on, down to faintlines corresponding to very weak dependencies, and of course the absence of links indicatingindependencies.When using the AO link strength measure, we must represent the 0 — 0 ,0 scale of LS 0 with a finite widthline. A mapping that is commonly used to do this for graphical display is: W = x / (x + a), where Wis proportional to the width of the line, and a is an adjustable scale parameter. This is approximately linearin x for x a, and approaches 1 as x approaches infinity.Another possibility is to use W = tanh (x / a), which is also linear in x for x « a, and approaches 1 as xapproaches infinity (see graph of figure 4.2:1). This mapping is recommended (with a = 4), since itcorresponds closely to serial combinations of links. That way the human viewer can easily imagine theminimum attenuation of a chain of links as the product of the attenuations of each of the links. So two 50%width lines in series correspond to a 25% width line (or smaller) connection.Unfortunately, it takes a bit more work to mentally combine parallel paths. For example two 50% widthlines would combine to form, at most, an 80% width connection. However, for the finer lines, addition ofline width can be used to combine parallel paths. For example two 25% width lines combine in parallel toform, at most, a 47% width line, which is very nearly 50%.This measure used for the width of the line turns out to be the absolute value of the statistical measure ofassociation known as the coefficient of colligation, or Yule's Y (not to be confused with Yule's Q, which isin more common usage). The original invention of this measure had nothing to do with the chainingbetween variables that we use it for here, or so it appears from reading the paper in which it was introduced.For a description of Yule's Y see Appendix B, and for the original paper, see Yule1912.Figure 5.1 shows the same BN as figure 2.5, but with the link strengths printed beside each link, and thelinks drawn in different widths to show their strengths according to the formula:Width ={ 0^ LSo = 0thmax ((0.2mm), (2mm) x tanh (LS 0 / 4)) otherwiseAt a glance one can get an idea of which dependencies are always of minor importance. It must beremembered that each link strength represents the amount that the parent node can effect the child,maximized over all possible beliefs for the other parents. Furthermore, AO link strength was used, so anyeffects that bring a belief close to 0 or 1 are considered very significant.2.14CTR: Tom toldMolly he is rich1.69 SW: Tom's^(FIT: Lots ofhardware store^IN 0.442 traffic on 5th(T-W: Tom is^Kidwell last year treet last year 1.81currently^0\^fairly wealthy}2.23.747^2.23■(CP: Lots of cars ar"usually parked in1.70^.'ont of Tom's store,1.69^ 4.020.1166U: Hank's propertyvalue goes up more, than 20% in two yearPC: Neighborhoodpark is constructednext year13.71.42 11.48BCD: A new BMWis parked in Tom'szlrivewayTF: Tom told Mollyhe will donatecampaign funds3.891GT: Gale told Mollthat lots of cars areusually parked inqront of_ Tom's store}1.812.253.89ilt4: Tom movesnext door to Hank,next year0.116ITD: Tom makesdonation to Molly's.013 \campaign0.847MT: Molly thinks Tom.is going to make bigampaign donation0.847\41- M: Hank movesaway sometime in -.I the next 5 yearsFigure 5.1 - Example BN showing link strengths. This is the same BN as in figure 2.5,except the links have been drawn thicker to indicate LS as: width = (2mm) * tanh (LS/4),with a 0.2mm minimum. The LS values also appear beside the link.Tom can afford tomove to an expensive, neighborhood next year2.10CV: Tom's\cousin is,f.i. siting him,5.14(TC: Tom jusi"bought a new5.14  BMW car0.0641.794W: 5th streetis widened nextyear2.770.205(I/1M: Molly get 16.12 MD: Molly decideselected mayor }^to run for mayor1.25^0.463$tar: Neighborhood)k is approved(DT: Traffic more thar-doubles on Hank'sstreet next year0.2235.2 Connection Strength ContoursIf there is a particular node for which evidence may arrive (called the origin node), we may draw an iso-CScontour map over the BN to indicate the maximum effect which that evidence can have on each of the othernodes. Each contour represents a particular value of CS, and is drawn so as to separate nodes on its oneside which are more strongly connected to the origin node (i.e. have a greater CS), from those on its otherside which are less strongly the origin node (i.e. have a lesser CS). The purpose is for a human to quicklyassess which nodes will be affected by the evidence, and by how much. It also helps to visualize the"neighborhood" of a node, and provide a sense of locality. Figure 5.2:1 shows an iso-CS contour map withthe node SW as the origin node. Of course, the map would be different if some other node was the origin.If AO connection strength is used to draw the contours, then the contour map may also be interpreted inanother way. Instead of measuring how much evidence at the origin node effects each of the other nodes, itcan be interpreted to measure how much evidence at each of the other nodes can effect the origin node.This is due to the commutivity of AO connection strength. One possible application of this is thefollowing. We have a query node and we want to know which nodes to gather evidence at to best form abelief for the query node. Gathering evidence at a node that has very little effect on the query node isgenerally useless. So we can use the contours as indicators of levels of desirability for gathering evidence ateach of the nodes (and trade that off with the cost of gathering evidence at that node).Using the methods of the previous chapter, a contour map may be drawn that is based on the CS boundscalculated from link strengths, instead of the actual CS values. Such a contour map may be constructedvery quickly (of complexity linear in the product of number of links and number of nodes). Figure 5.2:2shows such a contour map calculated for the node SW, using algorithm 4.6. It is interesting to compare itwith figure 5.2:1. For each node the actual CS value is less than the bound, as we would expect. For nodesclose to SW the bound is very close to the actual value (equal for TW and HT), whereas for distant nodesthe bound is less tight (greater by a factor of about 12 for the most distant node, HM). It is interesting tonotice that, at least for this example, even in areas where the bound is loose, the shapes of the contours forthe bound are quite similar to the contours for the actual value.This example does not show it well, but if the BN is composed of a dendritic skeleton of strong links withthe "flesh" filled in by a network of weak links, then the contours will tend to follow the skeleton in a- 69 -manner similar in appearance to the contours of a topographic map following the valleys of a dendritic riversystem, with the origin node at the point where the river meets the ocean.— I -.11 ^MM: Mo ly getselected mayor0.090MD: Molly decidesto run for mayor0.0345.,40^SW: Tom'shardware storedid well last yearTW: Tom iscurrentlyfairly wealthy1.691 HT: Lots oftraffic on 5thstreet last year0.442CV: Tom'scousin isvisiting him0.0TC: Tom justbought a newBMW car0.506VCD: A new BMWis parked in Tom'sdriveway0.386OATA: Tom can afford tomove to an expensiveneighborhood next year0.819^ 0.4CP: Lots of cars areusually parked infront of Tom's store1.15GT: Gale told Mollythat lots of cars areusually parked inront of Tom's storeoA0.08MT: Molly thinks Tomis going to make bigcampaign donation0.382MR: Molly thinksTom is rich0.0375^0.09TF: Tom told Mollyhe will donatecampaign funds0.599TR: Tom toldMolly he is rich0.290TD: Tom makes bigdonation to Molly'scampaign0.685TM: Tom movesnext door to Hanknext year0.325VPA: Neighborhoodpark is approved0.0159FW: 5th streetis widened nextyear0.1390,08PC: Neighborhoodpark is constructednext year 0.0147VU: Hank's propertyvalue goes up morethan 20% in two years0.00377DT: Traffic more thandoubles on Hank'sstreet next year0.07690.01HM: Hank movesaway sometime in -IN^the next 5 years 0.00502Figure 5.2:1 - Example BN showing the connection strength contours for the node SW.This is the same BN as in figure 5.1, with contours drawn to show the CS between eachnode and the node SW. The CS values appear below each node.HT: Lots oftraffic on 5thstreet last year0.442CP: Lots of cars areusually parked infront of Tom's store2.45GT: Gale told Mollythat lots of cars areusually parked infront of Tom's store1.78)1.0TC: Tom justbought a newBMW car0.523TR: Tom toldMolly he is rich0.2960.5TF: Tom told Mollyhe will donatecampaign funds0.917CD: A new BMWis parked in Tom'sdriveway0.448SW: Tom'shardware storedid well last yearTW: Tom iscurrentlyfairly wealthy1.691, 0TA: Tom can afford tomove to an expensiveneighborhood next year0.837TD: Tom makes bigdonation to Molly'scampaign1.33MT: Molly thinks Tomis going to make bigcampaign donation0.738TM: Tom movesnext door to Hanknext year0.398MM: Molly getselected mayor0.420PA: Neighborhood0,2.^park is approved0.127FW: 5th streetis widened nextyear0.242- 0. 1HM: Hank movesaway sometime in -06 ^the next 5 years 0.0642PC: Neighborhoodpark is constructednext year 0.127VU: Hank's propertyvalue goes up morethan 20% in two years0.0495DT: Traffic more thandoubles on Hank'sstreet next year0.145MR: Molly thinksTom is rich:).131/48■0.2CV: Tom'scousin isvisiting him0.00.50.5Figure 5.2:2 - The example BN of figure 5.1, with contours showing the connectionstrength bound for the node SW. The contours are drawn for the calculated CS boundbetween each node and the node SW. The CS bound values appear below each node.0.5-72.-5.3 Approximate InferenceRecall the "reasoning by assumptions" algorithm for BN inference, introduced in section 2.6. It is based oninstantiating some node (say node Z) to one of its values in order to simplify the network (generally toreduce the active path connectivity), doing the BN inference with the simplified network, repeating theprocess with Z instantiated to its other value, and then combining the two solutions obtained by taking theirweighted average (weighted by the probabilities that Z would take on each of its two values).But now suppose that node Z is distant enough from the nodes that we wish to find beliefs for (i.e. thequery nodes), that instantiating Z to some value has almost no effect on their beliefs. In that case the twosolutions will be almost the same, and their weighted average won't be that different from either one ofthem. So we could save some time by only computing one of the solutions, and recognizing that the beliefsthat it provides are approximate. This suggests the following algorithm:Algorithm 53:1: To compute the posterior probabilities of the nodes in the set Q given evidence e for thenodes in E, instantiate the node Z to one of its values, then use some standard BN inference algorithm tofind P(qilz,e) and consider it an approximation for P(qile), for all Qi e Q. Node Z can be any node not inE, and will normally be chosen so that the particular BN inference algorithm to be used can find P(qilz,e)more quickly than P(qile), (which is often the case if, for example, Z blocks active paths from nodes in Qto their ancestors or nodes in E).When we use an algorithm that produces approximate results, we usually need some kind of bound on howaccurate we can expect the approximation to be. That is provided by the following theorem (which followsdirectly from theorem 3.4):Theorem 5.3:1: When using algorithm 5.3.1, a bound on the error of the approximation, e = d(P(q1z,e),P(q1e)), is:e CS (Z, Qle)for any node Q E Q, where d is the distance measure used for the definition of CS.Although instantiating one node may result in faster BN inference, usually we want to instantiate a set ofnodes to block many active paths. There are two different advantages we could gain from this. We could- 73 -make the network singly connected (or more singly connected), and/or we could prune off large parts of thenetwork. For example consider the BN in figure 5.3:Figure 5.3 - Z1, Z2,..., Z n block all active paths from subnetwork G tosubnetwork H.Instantiating the nodes Z1, Z2,..., Z n cuts off the subnetwork G from Q and its subnetwork H. So a BNinference algorithm finding the belief of Q can ignore the whole subnetwork G. Clearly, this may result ina major computational saving, depending on the size of G . An approximation algorithm which uses thisstrategy is algorithm 5.3:2, which is essentially the same as algorithm 5.3:1, except Z is now a set of nodes.Algorithm 5.3:2: To compute the posterior probabilities of the nodes in the set Q given evidence e for thenodes in E, instantiate all the nodes in Z to one of their values, creating the tuple of values z, then use somestandard BN inference algorithm to find P(qilz,e) and consider it an approximation for P(qile), for allQi E Q. The nodes in Z can be any nodes not in E, and will normally be chosen so that the particular BNinference algorithm to be used can find P(qil 4e) more quickly than P(qile).Theorem 5.3:2: When using algorithm 5.3:2, a bound on the error of the approximation:e = d(P(qIz1,z2,...zn,e), P(q1e))is given by:e^CS (Z1, Qle) + CS (Z2, Qlzi,e) + ... + CS (Zn , Qlz1,z2,...,zn_lie)for any node Q E Q, where d is the distance measure used for the definition of CS.This theorem is proved in Appendix C.6 Conclusion6.1 Further Work6.1.1 Qualitative Probabilistic NetworksAlgorithm 4.6 provides a way to calculate a bound on CS values, that is, it calculates the maximummagnitude of the change of belief at one node due to evidence at another. But we may also be interested inthe direction of the change; do the beliefs increase or decrease? By removing the absolute value signs in thedefmition of CS ° and CSp we can retain the information on the direction of the change. Equations 4.5:7and 4.5:8, used by algorithm 4.6, must be modified to handle the signs of LS and CS values separatelyfrom the magnitudes. They must produce a number of the same magnitude as they do now, but thecombination of signs must be as follows: A positive plus a positive is a positive, a negative plus a negativeis a negative, a positive plus a negative is an unknown sign, and an unknown sign plus anything is anunknown sign. For the serial combination rule: A positive times a positive is a positive, a positive times anegative is a negative, a negative times a negative is a positive, and an unknown times anything is anunknown.The reason that the sign must be handled separately is that actually interval arithmetic is being performed,where the intervals are always from 0 to the positive or negative CS values. Exploiting this, we couldmodify equations 4.5:7 and 4.5:8 to produce a tighter bound in cases which involve the sum of a positiveCS with a negative CS, since the magnitude of their sum will be bounded by the maximum of theirmagnitudes, which is less than the sum of their magnitudes (which is what must be used in the absence ofsign information).Wellman90 introduces qualitative probabilistic networks, which have the same dag structure as BNs, butcontain only sign information along the links instead of NCPs (and may also contain hyperedges providingthe sign of synergies, etc.). Their purpose is simply to predict in which direction a belief at one node willchange given evidence at another node. It appears that adding sign information to CS as described above,will produce the same qualitative results as those of Wellman (while also providing magnitudes), but morework remains to account for synergies in the way Wellman does.6.1.2 Greater Computation for Tighter BoundsThis thesis presented a way to find a bound on CS, by combining quantities that could be calculated locallyat the scale of a single link. But if we separate the nodes of a BN into small disjoint groups, then theinteractions between two groups may be expressed solely in terms of the links between the nodes in theMarkov boundaries of the two groups. For each group, we can do full BN reasoning to find exactconnection strengths between the nodes of the group, then use the methods of this thesis to find bounds onCS between the nodes in the Markov boundaries of the different groups. That way we can find a bound onthe CS between any two nodes of the original BN. The larger we make the groups, the tighter the boundwill be, but the longer it will take to compute (because full BN inference will be required on larger groups).If the groups are so small they are just single nodes, the CS bounds calculated will be those of algorithm4.6. If a group is so large that it includes the whole BN, then the CS values will be exact.By varying the group size we will be provided with a continuum of CS bounding algorithms fromcompletely global and slow algorithms which produce exact results, to completely local and fast algorithmswhich produce loose bounds. Then we can adjust the scale of the group size depending on how tight abound we require. If we get really sophisticated, we can use a range of scales to find a single connectionstrength, CS(V,Q), in which we use large scale groups close to the nodes V and Q, and smaller scalegroups further away.6.1.3 Multistate NodesAn obvious next step would be to try to extend the results of this thesis to BNs composed of nodes that cantake on more than two values. Any multistate BN can be modeled by a binary one by replacing eachmultistate node with a number of binary nodes, each binary node representing the proposition that themultistate node takes on one of its values. So many of the proofs in this thesis will extend to BNscomposed of multistate nodes, but whether practical algorithms and reasonably tight bounds can beproduced has yet to be determined. Also, it may be desirable to generalize the definition of CS formultistate nodes in some other way.7 BibliographyAczel, J. (1966) Lectures on Functional Equations and Their Applications, Academic Press, New York.Agosta, John M. (1991) "'Conditional inter-causal independent' node distributions, a property of 'noisy-or'models" in Uncertainty in Artificial Intelligence, Proc. of the Seventh Conf. (July, UCLA), BruceD'Ambrosio, et al, eds., Morgan Kaufmann, San Mateo, CA.Boerlage, Brent (1992) "Link strength in Bayesian networks" in First Canadian Workshop on UncertaintyManagement: Theory and Practise (May 12, Vancouver, Canada), Mary Deutsch-McLeish (ed.).Buchanan, Bruce G. and Edward H. Shortliffe (1984) Rule Based Expert Systems: The MYCIN Experiments,Addison-Wesley, Reading, MA.Cheeseman, Peter (1986) "Probabilistic versus fuzzy reasoning" in Uncertainty in Artificial Intelligence, L. N.Kanal and J. F. Lemmer (Eds.), North-Holland, Amsterdam, pp. 85-102.Cooper, Gregory F. (1990) "The computational complexity of probabilistic inference using belief networks" inArtificial Intelligence, 42, 393 -405.Heckerman, David E. (1986) "Probabilistic interpretations for MYCIN's certainty factors" in Uncertainty inArtificial Intelligence, L. N. Kanal and J. F. Lemmer (Eds.), North-Holland, Amsterdam, pp. 167-196.Henrion, Max (1989) "Some practical issues in constructing belief networks" in Uncertainty in ArtificialIntelligence 3, Laveen N. Kanal, T. S. Levitt and J. F. Lemmer (Eds.), North-Holland, Amsterdam.Horvitz, Eric J. (1989) "Reasoning about beliefs and actions under computational resource constraints" inUncertainty in Artificial Intelligence 3, Laveen N. Kanal, T. S. Levitt and J. F. Lemmer (Eds.), North-Holland, Amsterdam.Howard, Ron A. (1971) Dynamic Probabilistic Systems, John Wiley & Sons, New York.Lauritzen, Steffen L. and David J. Spiegelhalter (1988) "Local computations with probabilities on graphicalstructures and their application to expert systems" in J. Royal Statistics Society B, 50(2), 157-194.Lindley, D. V. (1965) Introduction to Probability and Statistics, Cambridge University Press, Cambridge, MA.Neapolitan, Richard E. and James R. Kenevan (1990) "Computation of variances in causal networks" in Proc.of the Sixth Conf. on Uncertainty in Artificial Intelligence (July 27-29, 1990 MIT), GE Corporate Researchand Development.Neapolitan, Richard E. (1990) Probabilistic Reasoning in Expert Systems: Theory and Algorithms , John Wiley& Sons, New York.Neapolitan, R. and J. Kenevan (1991) "Investigation of variances in belief networks" in Uncertainty inArtificial Intelligence, Proc. of the Seventh Conf. (July, UCLA), Bruce D'Ambrosio, et al, eds., MorganKaufmann, San Mateo, CA.Pearl, Judea (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, MorganKaufmann, San Mateo, CA.Suermondt, H. Jacques (1992) Explanation in Bayesian Belief Networks, PhD thesis (Report No. STAN-CS-92-1417), Departments of Computer Science and Medecine, Stanford Univ., CA.Shachter, Ross D. (1986) "Evaluating influence diagrams" in Operations Research, 34(6), 871-882.Shachter, Ross D. (1988) "Probabilistic inference and influence diagrams" in Operations Research, 36(4), 589-604.Tribus, M. (1969) "What do we mean by rational?" in Rational Descriptions, Decisions and Designs, PergamonPress, New York.Wellman, Michael P. (1990) "Fundamental concepts of qualitative probabilistic networks" in ArtificialIntelligence, 44, 257-303.Yule, G. Udny (1912) "On the methods of measuring association between two attributes" in J. of the RoyalStatistical Society, 75, 579-642.Zolotarev, V. M. (1983) "Probability Metrics" in Theory of Probability and its Applications, 28(2), 278 -302.A Notation and NomenclatureWhen naming nodes in a Bayesian net, upper case letters, such as "A',' refer to single nodes, and bold uppercase, such as "A',' to a set of zero or more nodes. Since each node represents a propositional randomvariable, the names of the random variables are also denoted upper case, and the values that it can take onare labeled "TRUE" and "FALSE:' "+a" denotes that the value of node A is TRUE, "—a" denotes that thevalue of node A is FALSE, and "a" stands for the value of node A (TRUE or FALSE). Sometimes "+a" iswritten simply as "a': if that does not result in confusion. A vector of values for all the nodes in the set E, iswritten bold lower case, as "el:Conditional probabilities are written in the form "P(+bl—la,+c)': which means "the probability that B isTRUE, given that A is FALSE and C is TRUE1' "0(+bk,a,+c)" is the odds ratio that B is TRUE, given that Ais FALSE and C is TRUE, i.e. 0(+bl-la,+c) = P(+bl--,a,+c) / If we say "the belief at node B isx" we mean P(b=TRUEIe) = x, where e is the evidence seen so far."C(B)" denotes the set of parents (conditional predecessors) of node B, "C +(B)" is the set of all ancestorsof B, and "C* (B)" is the set of all ancestors of B, including B. Likewise "S(B)" is the set of B's children(successors), and "S*(B)" the set of all descendents. "C*(B)", where B is a set of nodes, is defined as theset of all descendents of all the nodes in B. Likewise for C+(B), S *(B), etc.A BN link from node A to node B is denoted as "A -*B" or "B <--A" or "A B---> to toor BA .to If A is not aparent of B, the preceeding link notation may still be used, providing that adding a link from A to B doesnot create a cycle in the BN, and the link represented is considered a null link from A to B.The Cartesian product is formed with the "II" symbol, which takes a set of variables as its argument, andrepresents the set of all vectors formed with each of the variables taking on one of its possible values. Asan example, for propositional variables A and B:II { A, B} = +a+b,AbbreviationsBN - Bayesian net.CS - Connection strength.FJD - Full joint (probability) distribution. Consists of a probability for every conjunction consisting of allthe primitive propositions (nodes). The term is also used to indicate the complete probabilisticmodel.LS - Link strength.NCP - Node conditional probability (ies). Also called the "link matrix" in Pear188.B Conventional StatisticalMeasures of AssociationThere is a broad array of standard statistical measures of association. Historically they have been definedsimply by searching through equations to find one that meets certain desiderata (although in the last fewdecades there has been a move to define measures according to some optimality criterion). The followinghave been mentioned by the statistical community as desirable qualities for a measure of association:1.Range: The measure of association should range from 0 to 1, or -1 to 1.2. Endpoints: A measure of association of 0 should correspond to independence, while 1 shouldindicate full association (and -1 indicate reverse full association, if that value can be obtained). Fullassociation is defined by some to mean deterministic dependence (all conditional probabilities ofthe contingency table are 0 or 1), and by others to mean that at least one of the conditionalprobabilities of the contingency table is 0 or 1.3. Monotonicity: The measure of association should vary monotonically and continuously with P(xy)-P(x)P(y).4. Symmetry: For two binary variables X and Y, the measure of association given as a function ofP(xly) and P(xl-,y), should be the same as the same function of P(ylx) and P(yl-ix).Here are the most common measures of association as they would be applied to the case of two binaryvariables, X and Y, and written in the probabilistic notation used in this thesis.P(ylx) P(--iykx)^P(ylx) [1 - P(yl-,x)] Cross Ratio: C = P(--,y1x) P(y1-,x) - [1 - P(ylx)] P(y1-,x)Log cross ratio: L = log (C)Coefficient of association (Yule's Q): Q – C - 1C + 1Coefficient of colligation (Yule's Y): Y – 4–C - 1'NT—Root mean square contingency: r =^\,1-= q [P(xly) - P(xl–Ty)] [P(ylx) - P(y1–,x)]Coefficient of contingency (Pearson): c = r / Air2 + 1Difference coefficient (J. H. Edwards): E = P(ylx) - P(yl–a)Ratio coefficient (J. H. Edwards): F = P(ylx) / P(y1–,x)I+11C ProofsTheorem 3.1 - Equivalence of CS DefinitionsThe following 3 definitions of CS are equivalent:CS (A,B) = d (P(bl+a), P(b1-0))^ 3.1:1CS (A,B) = max d (P(blav i), P(blav2))avi, av2^ 3.1:3CS (A,B) = sup^d (P(blav i), P(blavi, av2))^ 3.1:4avi, av2where a v i is some virtual (or nonexistent) evidence for A, a v2 is other consistent evidence (possibly virtual)for A, and "sup" means the least upper bound.To prove that 3.1:3 and 3.1:4 are equivalent to 3.1:1, it is first useful to prove the following property of thedistance measure:Lemma 3.1:5: For the distance measure defined in 3.1:2 and any a, c, x, and y:If a x 5.. c and a y c then d (x, y)^d (a, c)^ 3.1:5Proof of lemma 3.1:5:By the monotonicity requirement on d:a_xc^d (a, x)^d (a, c)a y ._ c^d (a, y)^d (a, c)Either: a 5 x 5 y or a 5 y 5_ xSo, by monotonicity and symmetry of d:Either d(x,y) 5_ d(a,y) or d(y,x) = d(x,y) 5 d(a,x)But both d (a, y) and d (a, x) are 5_ d (a, c), so d (x, y) 5 d (a, c). ■Proof that 3.1:3 is equivalent to 3.1:1:Decomposing P(blav i) on cases of A (by 2.3:3) we get:P(blav l) = P(bla, avi) P(alavl) + P(b1----a, avl) P(—ialav l)Since av 1 is virtual evidence for A, B is independent of a v 1 given A. So P(bla,av i) = P(bla). Alsosubstitute a = P(alav l), to get:P(blav i) = P(bla) a + P(bl —ia) (1 — a)a= P(alavi) is restricted to [0,1], and as it varies from 0 to 1, P(bla v l) will vary linearly from P(b1--,a) toP(bl+a). So at all times it is bounded by these limits .P(bl—a)^P(blav i) 5 P(bl+a) or P(bl+a)^P(blavl) 5 P(b1—,a)By an identical argument for av2:P(bl—la)^P(blav2)^P(bl+a) or P(bl+a) 5 P(blav2)^P(bl-a)By lemma 3.1:5:d (P(blav 1), P(blav2))^d (P(bl+a), P(bl—la))So CS(A,B) defined by 3.1:3 is always less than or equal CS(A,B) defined by 3.1:1. But in 3.1:3 the maxruns over all values of a v 1 and av2, which includes the possibility av 1 = +a and av2 = —a, which willgive it the value of CS(A,B) defined by 3.1:1, and therefore its maximum value. So each equation assignsthe same value to CS(A,B). ■Proof that 3.1:4 is equivalent to 3.1:1:This proof is the same as the one above, with a vt, ay2 substituted for av2, except for the last paragraph,which becomes: CS(A,B) defined by 3.1:4 is always less than or equal CS(A,B) defined by 3.1:1. In 3.1:4the max runs over all values of av 1 and av2, which doesn't include the possibility av 1 = +a and ay 1 ,av2 =because that would be inconsistent evidence.However, ay1 and av2 can come arbitrarily close to this, and since there are no discontinuities in the system(the probability equations are linear and the distance measure is required to be continuous), d (P(bla v t),P(blav1, av2)) can come arbitrarily close to CS (A,B) defined by 3.1:1, with the appropriate choice of av1and a v2. By replacing "max" by "sup" to mean the lowest upper bound, we can write the expression as anequality, and have CS(A,B) defined by 3.1:4 exactly equivalent to CS(A,B) defined by 3.1:1. ■Theorem 3.4 - Alternate CS DefinitionIf an alternate CS is defined as:CS' (A,BIe) = max (d(P(ble), P(bl+a,e), d(P(ble), P(b1--,a,e))^3.4:1then for any two propositional variables A and B, and for any evidence e (or no evidence e), connectionstrength defined by equation 3.4:1 can be bounded above and below as:1—2 CS (A,BIe)^CS' (A,BIe)^CS (A,BIe) 3.4:3To prove the lower bound the following lemma is useful:Lemma 3.4: For a distance measure, d, satisfying the triangle inequality (of 3.1:2):max (d(x,y), d(y,z))^1 d(x,z)Proof of lemma 3.4:Whether d(x,y) or d(y,z) is greater:max (d(x,y), d(y,z)) ?_ min (d(x,y), d(y,z))Adding max (d(x,y), d(y,z)) to each side:2 max (d(x,y), d(y,z))^max (d(x,y), d(y,z)) + min (d(x,y), d(y,z))But:max (d(x,y), d(y,z)) + min (d(x,y), d(y,z)) = d(x,y) + d(y,z)By the triangle inequality of 3.1:2:d(x,y) + d(y,z)^d(x,z)Combining the above:2 max (d(x,y), d(y,z))^d(x,y) + d(y,z)^d(x,z)Dividing each side by 2:max (d(x,y), d(y,z))^1d(x,z)^■Proof of lower bound in 3.4:3:If we substitute P(bl+a,e) for x, P(ble) for y, and P(bl—a,e) for z in lemma 3.4, we obtain:max (d(P(bl+a,e), P(ble)), d(P(ble), P(bl—ia,e)))^1d(P(bl+a,e), P(b1--,a,e))By the symmetry of d (required by 3.1:2)max (d(P(ble), P(bl+a,e)), d(P(ble), P(bl—ia,e)))^1d(P(bl+a,e), P(bl—e,e))Substituting, by the definition of CS (i.e. 3.1:1), and the definition of CS' (i.e. 3.4:1):CS' (A,BIe)^1CS (A,BIe)^■Proof of upper bound in 3.4:3:Decomposing P(ble) on cases of A (by 2.3:3) we get:P(ble) = P(bl+a,e) P(+ale) + P(bl—la,e) (1 — P(+ale))- 88 -P(+ale) is restricted to [0,1], and as it varies from 0 to 1, P(ble) will vary linearly from P(bl+a,e) toP(bl-a,e). So at all times it is bounded by these limits:P(b1+a,e) 5 P(ble)^P(bl-a,e) or P(bl-la,e) 5 P(ble) 5 P(bl+a,e)By the monotonicity requirement on d ( 3.1:2)d(P(ble), P(bl+a,e))^d(P(bl+a,e), P(bl—ia,e)) andd(P(ble), P(bl—ia,e)) 5 d(P(bl+a,e), P(bl—la,e))Since both left hand sides in the above are less than d(P(bl+a,e), P(bl-a,e)), the maximum of them mustalso be less than d(P(bl+a,e),P(bl-a,e)):max (d(P(bl+a,e), P(ble)), d(P(ble), P(bl-la,e))) 5 d(P(bl+a,e), P(bl-la,e))Substituting, by the definition of CS (i.e. 3.1:1), and the definition of CS' (i.e. 3.4:1):CS' (A,BIe) 5 CS (A,BIe)^■Theorem 4.1 - AP Serial ChainingTheorem 4.1: For any three propositional variables A, B, and C, where A and C are independent given B:CSp (A, C) = CSp (A, B) CSp (B, C)^ 4.1Proof of Theorem 4.1: Using the reasoning -by-cases theorem (2.3:3):P(cla) = P(cla,+b) P(+bla) + P(cla, - b) P(-Ibla)Since I(A,C1b), we know that P(cla,b) = P(clb):P(cla) = P(cl+b) P(+bla) + P(c1-,b) P(—ibla)If we subtract a version of this equation with a=FALSE, from a version of it with a= TRUE we obtain:P(c1+a) — P(c1-1a) = P(cl+b) (P(+bl+a) - P(+bl-a)) + P(cl-ib) (P(-Ibl+a) - P(-114-1a))Replacing P(-bla) with 1-P(+bla), and simplifying, we obtain:P(cl+a) — P(c1-0) = (P(cl+b) — P(c1-4)) (P(+b1+a) — P(+bl—a))Taking the absolute value of each side:I P(cl+a) — P(cl—la) I = I P(cl+b) — P(cI-1b) I I P(+bl+a) — P(+b1-0) IBy the definition of AP connection strength this equation is equivalent to:CSp (A, C) = CSp (A, B) CSp (B, C) ■Theorem 4.2 - AO Serial ChainingFor any three propositional variables A, B, and C, where A and C are independent given B:i^1^1tanh q cso(A, C))^tanh (--ii CS0(A, B)) tanh (74 CS 0(B, C)) 4.2:1Proof of Theorem 4.2: Using the reasoning-by-cases theorem (2.3:3):P(cla) = P(cla,+b) P(+bla) + P(cla,—ib) P(--Ibla)Since I(A,CIb), we know that P(cla,b) = P(c1b):P(cla) = P(cl+b) P(+bla) + P(cl—lb) P(—Ibla)Now we convert from probabilities to odds ratio, using O=P/(1—P), divide an A=TRUE version of theresulting equation with an A=FALSE version, and then simplify:0(cl+a) 0(cl—la) = (1 + 0(b1-0) + O(clb) + 0(bl—ia) 0(cl—lb))(O(bla) O(clb) + 0(c1-6) + O(clb) O(cl-b) + O(bla) O(clb) 0(cl—ib)) /((1 + O(bla) + O(clb) + O(bla) 0(c1-113))(O(bl—a) O(clb) + O(cl—b) + O(clb) 0(c1---,b) + O(bl—a) O(clb) 0(c1-11))))We define 0(xl+y)/0(xl—iy) as C(xly), substitute where appropriate in the above equation, and simplify:C(cla) = (1 + C(clb) O(bla) + O(clb) + O(bla) O(clb))(C(bla) C(clb) + C(clb) O(bla) + C(bla) C(c1b) O(clb) + O(bla) O(clb)) /((C(bla) + C(clb) O(bla) + C(bla) O(clb) + O(bla) O(clb))(C(clb) + C(clb) O(bla) + C(c1b) O(clb) + O(bla) O(clb)))- 90 -The above expression for C(cla) can be broken down into the composition of two functions:whereC(cla) — (C(bla) C(clb) fl + 1) (fl + 1) (C(bla) fl + 1) (C(clb) fl + 1)1 + O(clb) f1 — (C(clb) + O(clb)) O(bla)C4.2:1C4.2:2Now we must find the maximum value that C(cla) can take on for a given C(clb) and C(bla), so wemaximize it with respect to O(bla) and O(clb). We are really interested in the maximums of CS °, which isthe absolute value of the log of C(cla), so we want to find all maxima of C(cla) which are greater than 1,and all minima less than 1. We assume C(bla) and C(clb) are finite and nonzero, and we can check theseextreme cases for maxima and minima at the end. First we maximize with respect to O(bla), then withrespect to O(clb).First we check the boundaries of O(bla). At O(bla) = 0 and O(bla) = oo, we get C(cla) = 1, so there are nomaxima or minima of interest at the boundaries. Next we check for discontinuities (where the numerator ordenominator equals zero). All the quantities in fl are greater than or equal 0, so fl = (1 + O(clb)) / (( C(clb)+ O(clb)) O(bla)) is greater than or equal 0. The denominator of C(cla) is (C(bla) fl + 1) (C(clb) fl + 1)which therefore must be 1 or greater. The numerator of C(cla) is ( C(bla) C(clb) fl + 1) (fl + 1) whichmust also be 1 or greater. So there are no discontinuities in our domain of interest. Also, since it is arational function, the derivative won't have discontinuities in places where the function doesn't.Next we take the derivative with respect to O(bla), and check the zeros of the derivative for maxima. We dothis in two steps using the chain rule for derivatives:C(cla) = gl(C(bla), C(clb), fl(C(clb), O(clb)), O(bla))cl[C(cla)] _ d[C(c1a)]^d[fl] d[O(bla)] —^d[f1] d[O(bla)]So the maxima/minima will occur when either derivative equals zero.d[fl] ^_ ^1 + O(clb) d[O(bla)] — (O(bla))2 (C(clb) + O(clb))This is zero when O(bla) is infinity (which is at the border and has already been checked). It is not zerowhen O(clb) is infinity (there it is -1). It is zero when O(clb) = -1, but that is not an allowed value forO(clb), so there are no maxima of interest when this derivative is zero. We try the other derivative in theproduct:d[C(cla)] _ (C(bla) — 1) (C(clb) — 1) (C(bla) C(clb) f1 2 — 1)d[fl]^—^(1 + C(bla) f1)2 (1 + C(clb) f1) 2Zeros are of this derivative are at C(bla) = 1, C(clb) = 1, fl = oo, and C(bla) C(clb) f1 2 = 1. WhenC(bla) = 1 or C(clb) = 1, we find that C(cla) = 1, so they do not correspond to maxima of interest. fl isinfinite only if 0(bla) = 0 (which we have already considered). So the only interesting roots are atC(bla) C(clb) f1 2 = 1. One of the roots of this equation is always negative, so it doesn't correspond to avalid solution. The other root is:fl = A/C(bla) C(clb) (1 + O(clb))C(clb) + O(clb)Substituting this back in equation C4.2:1, yields:C(cla) =[1 + -N/C(bla) -NiC(c1b)] 2A/ C(bla) + AIC(clb)C4.2:4We have left to the end the task of checking for possible maxima or minima at C(bla) or C(clb) taking avalue of 0 or o. When we substitute C(bla) = 0 into the formula for C(cla) we get C(cla) = 1/ C(clb) andfor C(bla) = ... we get C(cla) = C(clb). These are the same values we get from C4.2:4, so that formula willdo in all cases. So the value of C(cla) given by equation C4.2:4 truly is the maximum value C(cla) can takegiven values of C(bla) and C(c1b), and we may rewrite it as an inequality which always holds:C(cla) [1 + A[C(bla) AIC(clb)] 2AI C(bla) + AIC(clb)We can use equation 3.3:3 to express things in terms of CS 0:CS0(A, C) = I log C(cIa) ICS0(A, B) = I log C(bla) IC4.2:5CS0(B,^= I log C(clb) IIf we make the above substitutions in equation C4.2:5, and then simplify, we obtain:1tanh (-4 CS 0(A, C))^tanh (-4 CS0(A, B)) tanh (4 CS 0(B, C))which is the equation to be proved. ■Theorem 4.3 - Fundamental EquationFor any three propositional variables V, Q, and Z, we can decompose the connection strength from V to Qon cases of Z as follows:CS (V, Q)^max CS (V, Q I z) + CS (V, Z) * min CS(Z,Q1v)where * is a generalized multiplication corresponding to the serial combination rule for the particulardistance measure, d, used to define CS.Lemma 4.3: For any real numbers xi, x2, yi, y2, and z, ifz^xi + yi^and^z^x2 + Y2thenz^max(xl, x2) + min (yi, y2)Proof of Lemma 4.3:There are four possible cases. If xi x2 and yi S y2, thenmax(xl, x2) + min (yi, Y2) = xi + yi^z by the first given equationIf xi ?_ x2 and y2 yi, thenmax(xl, x2) + min (yi, Y2) = xi + y2^x2 + y2^z by the second given equationand similarily for the other two cases. ■Proof of Theorem 4.3:We can analyze the propositional variables V, Q, and Z by constructing a BN for them, which appears infigure C.1. Since this is a fully connected BN, it can represent any probabilistic relationship between the 3variables, with the right choice of NCP values. So even if these three variables are originally from adifferent BN where they are connected up in a different way, and perhaps with many other nodes involved,the BN of figure C.1 can represent their relationships within the other BN with no loss of generality. So weneed only prove equation 4.3:6 for the BN of figure C.1.Figure C.1 - Fully connected BN representing the probabilistic relationshipbetween V, Z, and Q with no loss of generality.Figure C.2 - A new BN which is the same as the one in figure C.1, except the Vnode has been split into U and W. This BN is defined to have the same NCPs asthe one in figure C.1.The BN in figure C.2 is the same as C.1 (has the same connections and the same NCPs), except the node Vhas been replaced with two nodes: U and W. In general C.2 will produce different inference results fromthose produced by C.1, because the active path from Q to Z through V has been broken. However, in thosecases where U and W both have evidence, and that evidence is the same for both of them, then inferenceusing C.2 will produce the same beliefs as C.1 (with V receiving the same evidence as U and W). Since inthe calculation of CS(V,Q) the only values of interest are the beliefs at Q when V has evidence, we willobtain the same values from C.2 (giving both U and W the same evidence as V). Even though some ofthe intermediate calculations may be different, the results will be the same because P(ql+v) = P(ql+u,+w)and P(q1—,v) = P(ql,u,—,w). So we need only prove equation 4.3:6 for the BN of figure C.2.By the definition of CS (3.1:1):CS(V,Q) = d(P(ql+v), P(ql —N)) = d(P(ql+u,+w), P(q1-111,-1w))By the triangle inequality of d (3.2:2):d(P(ql+u,+w), P(q1---o, —Iw)) .. d(P(ql+u,+w), P(ql+u,---,w)) + d(P(ql+u, —Iw), P(q1-111,--1w))Substituting in the above the definition of conditional CS (3.5:1):CS(V,Q)^CS(W,QI+u) + CS(U,Q1--,w)Similarily we can derive:CS(V,Q)^CS(W,Q1—,u) + CS(U,QI+w)Invoking lemma 4.3 on the two equations above:CS(V,Q)^max CS(W,QIu) + min CS(U,Q1w)^ C4.3:1u^wThe second term in the above equation can be evaluated as a three node BN consisting of two links in serial,similar in form to the BN of figure 4.1. The techniques of section 4.1 or section 4.2 can be used (or similartechniques for other distance measures) to provide a bound for it. We express that bound in the generalform:min CS(U,Q1w)^min CS(U,ZIw) * CS(Z,QIw)w^wwhere * is the serial combination rule, which depends on the particular distance measure being used.Since W is independent of U and Z, CS(U,ZIw) = CS(U,Z), and we may move the min operator in:min CS(U,Q1w)^CS(U,Z) * min CS(Z,Qiw)w^ wReturning to notation using V instead of U and W:- 95 -min CS(U,QIw) 5_ CS(V,Z) * min CS(Z,QIv)^ C4.3:2w^ vNow we examine the first term of C4.3:1. Expressing it in terms of the d measure:max CS(W,QIu) = max d(P(ql+w,u), P(qI-w,u))^ C4.3:3u^uWe evaluate d(P(ql+w,u), P(q1-,w,u)) by reasoning by cases on Z (2.3:3):d(P(qI+w,u), P(q1-1w,u)) = d(P(ql+w,+z,u) P(+zl+w,u) + P(ql+w,-a,u) P(-izl+w,u),P(ql-lw,+z,u) P(+z1-1w,u) + P(ql-lw,-a,u) P(-al-lw,u))Q is independent of U given Z, so P(q1w,z,u) = P(q1w,z). Also, Z is independent of W given U soP(zlw,u) = P(zlu). Substituting these in:d(P(ql+w,u), P(ql-iw,u)) = d(P(ql+w,+z) P(+zlu) + P(ql+w,-a) P(-alu),P(q1-,w,+z) P(+zlu) + P(ql-iw,-,z) P(--alu))Substituting x for P(+zlu), and 1-x for P(-alu), we obtain:d(P(ql+w,u), P(ql-iw,u)) =d(P(ql+w,+z) x + P(ql+w,-a) (1-x), P(q1-1w,+z) x + P(ql-lw,-a) (1-x))By the convexity requirement on d (3.1:2):d(P(ql+w,u), P(ql-iw,u))max (d(P(ql+w,+z), P(ql-lw,+z)), d(P(ql+w,-a), P(q1-1w,--,z))By the definition of conditional CS (3.5:1):d(P(ql+w,u), P(q1-,w,u))^max CS(W,QIz)zSubstituting it back into C4.3:3, and switching W to V notation, we obtain:max CS(W,Q1u)^max CS(V,QIz)^ C4.3:4u^zCombining C4.3:1, C4.3:2, and C4.3:4, gives us the result to be proved:CS (V, Q) S max CS (V, Q I z) + CS (V, Z) * min CS(Z,QIv)^■z^ vTheorem 4.7:1 - Intercausal Link StrengthIf A and C are both parents of E, and I(A,CI), then if E receives evidence TRUE, the CS from A to C isbounded by:CS0(A,C1+e) = I log P(+el+a+c) P(+el-la-ic)  1P(+el+a-,c) P(+el-a+c) 4.7:1Proof: By Bayes rule (2.3:2):P(cla,e) = K^P(ela)ela,c) P(ela)If we form the ratio of this equation with c=TRUE, to it with c=FALSE we obtain:P(+cla,e)^P(ela,+c) P(+cla) P(ela) P(-cla,e) - P(ela,-ic) P(-cla) P(ela)Canceling P(ela) and using the definition of odds ratio, we obtain the following well known (e.g. Pear188)equation, which holds for either value of A and either value of E:O(+cla,e) — P(ela,+c) P(ela,-,c) 0(+cla)If we form the ratio of this equation with a=TRUE, to it with a=FALSE we obtain:0(+cl+a,e) _ P(el+a,+c) P(e1-0,-Ic) 0(+cl+a) 0(+cl-e,e) - P(el+a,-c) P(el-la,+c) O(+cI-a)Since I(A,CI), we know that 0(+cl+a) = 0(+cl-,a), so we can cancel these factors.O(+cl+a,e) _ P(el+a,+c) P(el-la,-ic) 0(+cl-la,e) - P(el+a,-Ic) P(el-la,+c)Next we take the absolute value of the logarithm of both sides.0(+cl+a,e) I = Hog Ke1+a,+c) Kel—la,—,c) I log O(+cI-a,e)^P(el+a,—c) P(el—a,+c)By the definition of CS using the d o measure, the left hand side is CS(A,Cle):CS(A,Cle) = I log P(el+a,+c) P(el—la,—Ic)  1P(el+a,—Ic) P(el—la,+c)The equation we set out to prove is just a special case of the above with e=TRUE. ■Theorem 5.3:2 - Approx. Inference Error BoundWhen using algorithm 5.3:2, a bound on the error of the approximation:e = d(P(q1zi,z2,...zn,e), P(q1e))is given by:e^CS (Zi, QIe) + CS (Z2, Qlzi,e) + ... + CS (Z n, Qlzi,z2,...,zn_1,e)for any node Q E Q, where d is the distance measure used for the definition of CS.Lemma 5.3: For any two propositional variables, Z and Q, and any evidence eCS(Z,QIe)^d(P(q1e), P(q1z,e))Proof of Lemma 5.3: By the definition for an alternate CS (3.4:1):CS'(Z,QIe) = max (d(P(q1e), P(q1+z,e)), d(P(q1e), P(q1--a,e)))The result of a "max" function is greater than either of its arguments, so:CS'(Z,QIe)^d(P(q1e), P(q1z,e))By theorem 3.4, CS(Z,QIe)^CS'(Z,QIe), so:CS(Z,QIe)^d(P(q1e), P(qIz,e))^■Proof of Theorem 5.3:2: By the triangle inequality on d (required by 3.1:2):d(P(q1e), P(q1zi,...,zn-1,e)) + d(P(q1zi,...,zn_1,e), P(q1zi,...,z n ,e))d(P(q1e), P(q1zi,...,zn ,e))The first term of the above can be substituted with an upper bound (i.e. a number guaranteed to be greateror equal) provided by the same equation with an index of n that is one lower, to yield.d(P(q1e), P(q1z1,_,zn_2,e)) + d(P(q1zi,...,zn_2,e), P(q1zi,...,zn-1,0) +d(P(q1z1,...,zn_i,e), P(q1zi,...,zn,e))^ d(P(q1e), P(q1zi,...,zn,e))This process can be repeated n-2 times to yield:n1 d(P(q1zi,...,zi_i,e), P(q1zi,...,zi,e))^d(P(q1e), P(q1z1,...,zn,e))i=1Each term of the sum can be bounded by lemma 5.3, by substituting Zi for Z and z 1 &...& zi_i & e fore, to yield:n1 CS(Zi,Q1zi,...,zi_1,e)^d(P(q1e), P(q1zi,...,zn,e))i=1If we define: e = d(P(q1zi,z2,...z n,e), P(q1e)), then we obtain the bound:e __ CS (Z1, Qle) + CS (Z2, Qlzi,e) + ... + CS (Z n , Qlzi,z2,...,zn_i,e)^■


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items