MULTIPLY SECTIONED BAYESIAN BELIEF NETWORKS FORLARGE KNOWLEDGE-BASED SYSTEMS:AN APPLICATION TO NEUROMUSCULAR DIAGNOSISByYang XiangB.A.Sc., Beijing Institute of Aeronautics and Astronautics,M.A.Sc., Beijing Institute of Aeronautics and Astronautics,A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIESELECTRICAL ENGINEERINGWe accept this thesis as conformingto the required standard19831985THE UNIVERSITY OF BRITISH COLUMBIA1991® Yang Xiang, 1992In presenting this thesis in partial fulfilment of the requirements for an advanced degree atthe University of British Columbia, I agree that the Library shall make it freely availablefor reference and study. I further agree that permission for extensive copying of thisthesis for scholarly purposes may be granted by the head of my department or by hisor her representatives. It is understood that copying or publication of this thesis forfinancial gain shall not be allowed without my written permission.Electrical EngineeringThe University of British Columbia2075 Wesbrook PlaceVancouver, CanadaV6T 1W5Date:-i(Icw.. //, (/2AbstractIn medical diagnosis a proper uncertainty calculus is crucial in knowledge representation.Finite calculus is close to human language and should facilitate knowledge acquisition.An investigation into the feasibili: of finite totally ordered probability models has beenconducted. It shows that a finite model is of limited usage, which highlights the importance of infinite totally ordered models including probability theory.Representing the qualitative domain structure is another important issue. Bayesiannetworks, combining graphical representation of domain dependency and probability theory, provide a concise representation and a consistent inference formalism. An expertsystem QUALICON for quality control in electromyography has been implemented asa pilot study of Bayesian nets. The performance is comparable to that from humanprofessionals.Extending the research into a large system PAINULIM in neuromuscular diagnosisshows that the computation using homogeneous net representation is unnecessarily complex. At any one time a user’s attention is directed to only part of a large net, i.e., thereis ‘localization’ of queries and evidence. The homogeneous net is inefficient since theoverall net has to be updated each time. Multiply Sectioned Bayesian Networks (MSBNs) have been developed to exploit localization. Reasonable constraints are derivedsuch that a localization preserving partition of a domain and its representation by a setof subnets are possible. Reasoning takes place at only one of them due to localization.Marginal probabilities obtained are identical to those obtained when the entire net isglobally consistent. When the user’s attention shifts, a new subnet is swapped in andpreviously acquired evidence absorbed. Thus, with /3 subnets, the complexity is reducedIIapproximately to 1/6.Reducing the complexity with MSBN, the knowledge acquisition of PAINULIM hasbeen conducted using normal hospital computers. This results in efficient cooperationwith medical staff. PAINULIM was thus constructed in less than one year. An evaluationshows very good performance.Coding probability distribution of Bayesian nets in causal direction has several advantages. Initially the distribution is elicited from the expert in terms of probabilities ofa symptom given causing diseases. Since disease-to-symptom is not the direction of dailypractice, the elicited probabilities may be inaccurate. An algorithm has been derived forsequentially updating probabilities in Bayesian nets, making use of the expert’s symptomto-disease probabilities. Simulation shows better performance than Spiegeihalter’s {O, l}distribution learning.111Table of ContentsAbstract iiList of Tables ixList of Figures xA Guide for the Reader xiiAcknowledgement xiv1 INTRODUCTION 11.1 Neuromuscular Diagnosis and Expert Systems 11.2 Representation Formalisms Other Than Bayesian Networks 31.2.1 Rule-Based Systems 41.2.2 The Method of Odds Likelihood Ratios 51.2.3 MYCIN certainty factor 71.2.4 Dempster-Shafer Theory 91.2.5 Fuzzy Sets 101.2.6 Limitations of Rule-Based Systems 101.3 Representing Probable Reasoning In Bayesian Networks 121.3.1 Basic Concepts of Probability Theory 121.3.2 Inference Patterns Embedded In Probability Theory 141.3.3 Bayesian Networks 151.3.4 Propagating Belief in Bayesian Networks 18iv2 FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 222.1 Motivation 232.2 Finite Totally Ordered Probability Algebras 252.2.1 Characterization 252.2.2 Mathematical Structure 272.2.3 Solution and Range 322.3 Bayes Theorem and Reasoning by Case . . . 342.4 Problems With Legal Finite Totally Ordered Probability Models 362.4.1 Ambiguity-generation and Denominator-indifference 362.4.2 Quantitative Analysis of the Problems 382.4.3 Can the Changes in Probability Assignment Help? 412.5 An Experiment 432.6 Conclusion 453 QUALICON: A COUPLED EXPERT SYSTEM 473.1 The Problem: Quality Control in Nerve Conduction Studies 473.2 QUALICON: a Coupled Expert System in Quality Control 483.3 Development of QUALICON 513.3.1 Feature Extraction and Partition 513.3.2 Probabilistic Reasoning 593.4 Assessment of QUALICON 633.4.1 Assessment Procedure 633.4.2 Assessment Results 653.5 Remarks 654 MULTIPLY SECTIONED BAYESIAN NETWORKS 694.1 Localization 70v4.2 Background 744.2.1 Operations on Belief Tables 744.2.2 Transform a Bayesian Net into a Junction Tree 754.2.3 d—separation 774.3 ‘Obvious’ Ways to Explore Localization 784.4 Overview of MSBN and Junction Forest Technique 824.5 The d-sepset and the Junction Tree 894.5.1 The d-sepset 894.5.2 Implication of d-sepset in Junction Trees 934.6 Multiply Sectioned Bayesian Nets 954.6.1 Definition of MSBN 954.6.2 Soundness of Sectioning 984.7 Transform MSBN into Junction Forest 1064.7.1 Transform SubDAGs into Junction Trees by Local Computation 1064.7.2 Linkages between Junction Trees 1114.7.3 Joint System Belief of Junction Forest 1144.8 Consistency and Separability of Junction Forest 1174.8.1 Consistency of Junction Forest 1174.8.2 Separability of Junction Forests 1194.9 Belief Propagation in Junction Forests 1284.9.1 Supportiveness 1284.9.2 Basic Operations 1294.9.3 Belief Initialization 1364.9.4 Evidential Reasoning 1384.9.5 Computational Complexity 1434.10 Remarks 146vi5 PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 1495.1 PAINULIM Application Domain and Design Criteria 1495.2 Exploiting Localization in PAINULIM Using MSBN Technique 1525.2.1 Resolve Conflict Demands by Exploiting Localization 1525.2.2 Using MSBN Technique in PAINULIM 1535.3 Other Issues in Knowledge Acquisition and Representation 1575.3.1 Multiple Diseases 1575.3.2 Acquisition of Probability Distribution 1595.4 Shell Implementation 1605.5 A Query Session with PAINULIM 1625.6 An Evaluation of PAINULIM 1645.6.1 Case Selection 1655.6.2 Performance Rating 1675.6.3 Evaluation Results 1695.7 Remarks 1726 ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 1756.1 Background 1766.2 Learning from Posterior Distributions 1776.3 The Algorithm for Learning by Posterior Probability (ALPP) 1816.4 Convergence of the Algorithm 1836.5 Simulation Results 1866.6 Remarks 1927 SUMMARY 194Bibliography 197viiAppendices 204A Background on Graph Theory 204A.1 Graphs 204A.2 Hypergraphs 207B Reference for Theory of Probabilistic Logic (TPL) 209B.1 Axioms of TPL 209B.2 Probability Algebra Theorem 210C Examples on Finite Totally Ordered Probability Algebras 212C.1 Examples of Legal FTOPAs 212C.2 Derivation of p(firesmoke&alarm) under TPL 213C.3 Evaluation of Legal FTOPAs with Size 8 214D Proofs for corollary and propositions 217E Reference for Statistic Evaluation of Bernoulli Trials 220E.1 Estimation for Confidence Intervals 220E.2 Hypothesis Testing 221F Glossary of PAINULIM Terms 222F.1 Diseases Considered in PAINULIM 222F.2 Clinical Features in PAINULIM 223F.3 EMG Features in PAINULIM 224F.4 Nerve Conduction Features in PAINULIM 225G Acronyms 226viiiList of Tables2.1 Probabilities for smoke-alarm example 433.2 Potential recording conditions 644.3 Probability distribution associated with DAG 0 in Figure 4.17 . 974.4 Probability distribution associated with subDAGs of Figure 4.20 . 974.5 Belief tables for the junction forest in Figure 4.20 1164.6 Belief tables after CollectBelief during Belieflnitialization 1394.7 Belief tables after the completion of Beliefinitialization 1404.8 Prior probabilities after tIie completion of Belieflnitialization . . 1414.9 Belief tables in evidential reasoning 1454.10 Posterior probabilities in evidential reasoning 1465.11 Number of cases involved for each disease considered in PAINULIM . . . 1676.12 Simulation 1 summary. . 1886.13 Simulation 2 summary. . . 1896.14 Simulation 3 summary. . . 1906.15 Simulation 4 summary. . . 191ixList of Figures1.1 An inference network 51.2 An inference net using the method of odds likelihood ratios 61.3 An inference net using the MYCIN certainty factors 81.4 The DAG of a singly connected Bayesian network 191.5 The DAG of a Bayesian network with diameter 1 191.6 The DAG of a general multiply connected Bayesian network 202.7 Smoke-alarm example 423.8 QUALICON system structure 503.9 QUALICON report 533.10 QUALICON report 543.11 Feature extraction strategy 563.12 SNAP produced by reversing stimulation polarity 573.13 Illustration of Guided Filtering 583.14 Bayesian subset for CMAP 603.15 A DAG to illustrate the inference algorithm in QUALICON 623.16 Results of statistical assessment of QUALICON 664.17 Transformation of a Bayesian net into a junction tree 794.18 A triangulated graph and its junction tree 814.19 Comparison of junction tree technique with MSBN technique 834.20 Illustration of MSBN and junction forest technique 92x4.214.224.234.244.254.264.275.305.315.325.335.34An example of D(1)Fire alarm example for simulationSimulation set-upA.38 Examples of graphsA.39 A junction tree99103105122125127135154156160162164165Radnn 166C.40 Evaluation result of smoke-alarm example by 32 legal FTOPAs of size 8 . 215An example of unsound sectioningA MSBN with a hypertree structureA MSBN with sound sectioning but without a covering sectExamples for violation of host composition conditionA partial host tree violating the host composition condition.A junction tree from a junction forestAn example illustrating the operation UpdateBjlief5.28 The 3 sects of PAINULIM: CLINICAL, EMG, and NCV5.29 The linked junction forest of PAINULIMCreate a MSBN using WEBWEAVR shellCLINICAL sect in a query sessionNCV sect in a query sessionEMG sect in a query sessionFeature expectation for a patient with both Cts and6.356.366.37181186187205208xiA Guide for the ReaderThis thesis has been written for readers from various backgrounds including artificialintelligence, medical informatics, and biomedical engineering.Chapter 1 contains the background on Bayesian belief networks and related uncertainty management formalisms. Bayesian belief networks combine probability theory andgraphical representation of domain models. Appendix A contains an introduction aboutconcepts from graph theory which are relevant to this thesis. Readers unfamiliar withgraph theory should read this appendix before reading the main body of the thesis.Chapter 2 contains results of an investigation into the feasibility of using finite totallyordered probability models for uncertainty management in expert systems. Readers whoare mainly interested in positive results and applicable techniques may skip this chapter.Chapter 3 describes the construction and evaluation of QUALICON, an expert systemcoupling digital signal processing and Bayesian networks for technical quality control innerve conduction studies. Researchers in medical informatics and biomedical engineeringwho are interested in learning about the practical application of Bayesian networks andcoupled expert systems will find this chapter useful.Chapter 4 contains the theory for Multiply Sectioned Bayesian Networks (MSBNs)and junction forests. Its implementation in WEBWEAVR shell and application inPAINULIM are described in Chapter 5.Chapter 5 describes the construction and evaluation of PAINULIM, an expert neuromuscular diagnostic system for patients presenting a painful or impaired upper limb.Researchers in medical informatics and biomedical engineering who are interested in thepractical application of the MSBN technique will find this chapter particularly relevant.xiiChapter 6 contains the algorithm of learning for sequential updating conditional probabilities in Bayesian networks using the expert’s subjective posterior probabilities.XIIIAcknowledgementAt University of British Columbia, my deepest thanks go to David Poole, Michael Bed-does and Andrew Eisen. David Poole introduced me to the concepts of probabilisticreasoning and Bayesian networks. His critical eye on my work has been extremely beneficial. Michael Beddoes directed me into the field of application of artificial intelligence(AT) in neuromuscular diagnosis, and has been my advisor both in and out of academics.His guidance has been critical to this interdisciplinary research. Andrew Eisen has guidedme on making decisions in choosing appropriate subdomains in neuromuscular diagnosisfor AT applications, has given me necessary support to conduct the research towards thetwo expert systems to which he also act as one of the domain experts.At Vancouver General Hospital, special thanks are due to Bhanu Pant, and MaureenMacNeil, the domain experts for PAINULIM and QUALICON. The technique of multiplysectioned Bayesian networks and junction forests grew out of the discussions with AndrewEisen and Bhanu Pant. Bhanu Pant also devoted months of painstaking work and muchovertime to the construction and evaluation of PAINULIM. Without his enthusiasticcooperation, it would not be possible for PAINULIM to reach its clinical significancein less than a year. Maureen MacNeil devoted months of work for the construction ofQUALICON besides her hospital duty.I thank Romas Aleliunas for helping me to gain the understanding of his theoryof probabilistic logic. I also thank Finn Jensen for providing several reprints of hispapers; Steen Andreassen for providing publications regarding the progress in MUNIN;and David Heckerman for providing his Ph.D. thesis on similarity networks and his paperson PATHFINDER. Lianwen Zhang suggested the relation between the d-sepset and thexivgraph separator. Nick Newell provided useful suggestions for programming WEBWEAVRshell.In addition I thank Sandy Trueman, and Andrew Travios for participating in theassessment of QUALICON and Tish Mallinson, Karen Burns, and Khadija Amlani forhelp in nerve conduction potential acquisition.The National Science and Engineering Research Council, under Grant A3290, 583200,and University of British Columbia, under University Graduate Fellowship, provided thefinancial support that made this thesis possible.Finally, I owe a great debt to my mother and father for a lifetime of love, caring andsupport, and especially to my wife Jingyu Zhu for sheltering me from the travails of thereal world and surrounding me with so much love and support.xvChapter 1INTRODUCTIONIn this chapter, the status of computers in medicine is reviewed in Section 1.1: the reviewcovers the status of expert systems in neuromuscular diagnosis. Section 1.2 reviewsrule-based expert systems and uncertainty management formalisms other than Bayesiannetworks. Section 1.3 reviews Bayesian networks as a natural and concise representationfor uncertain knowledge.1.1 Neuromuscular Diagnosis and Expert SystemsThe thesis research addresses practical issues in building medical expert systems. Themedical area is neuromuscular diagnosis.The computer is now assisting doctors and technicians in the neuromuscular diagnosisby performing the following functions: data acquisition/analysis; database management;and documentation. Computers have failed to assist doctors directly in their diagnosticinference and clinical decision making [Desmedt 89].In general medicine, the same failure was true until the mid 70’s. The failure waspartly due to the limitations of conventional techniques for medical decision making (theclinical algorithm or flow chart, and the matching of cases to large data bases of previouscases [Szolovits 82]). The failure was also partly due to the unawareness of proper waysto apply normative probability and decision theories.On the other hand, as Szolovits [1982] put it: “Modern medicine has become technically complex, the standards set for it are very high, conceptual and practical advances1Chapter 1. INTRODUCTION 2are rapid, yet the cognitive capabilities of physicians are pretty much fixed. As moreand more data become available to the practicing doctor, as more becomes known aboutthe processes of disease and possible interventions available to alter them, practitionersare called on to know more, to reason better, and to achieve better outcomes for theirpatients.” In response to the need for more sophisticated diagnostic aids, medical expertsystems appeared in the field of AT in medicine.Expert systems are computer programs capable of ijiaking judgments or giving assistance in a complex area. The tasks they perform are ordinarily performed only byhumans. They are commonly separated into 2 major components: knowledge base whichincludes assumptions about the particular domain, and the inference engine which controls how the knowledge base is to be used in solving a particular problem. Since themid 70’s, researchers in AT have built expert systems in many medicine areas to assistthe doctors in diagnosis or therapy (e.g., MYCIN for the diagnosis and treatment ofbacterial infections [Buchanan and Shortliffe 84], and INTERNIST for the diagnosis ininternal medicine [Pople 82]). New expert system technologies are still evolving as limitations to existing technologies are recognized. Limited successes in different aspects ofperformance have been achieved.Back to the area of neuromuscular diagnosis, several (prototype) expert systems haveappeared since the mid 80’s: LOCALIZE [First et al. 82] for localization of peripheralnerve lesions; MYOSYS [Vila et al. 85] for diagnosing mono- and polyneuropathies; MYOLOG [Gallardo et al. 87] for diagnosing plexus and root lesions; Blinowska and Verroust’s system [1987] for diagnosing carpal tunnel syndrome; ELECTRODIAGNOSTICASSISTANT [Jamieson 90] for diagnosing entrapment neuropathies, plexopathies,and radiculopathies; KANDID [Fuglsang-Frederiksen and Jeppesen 89] and MUNIN[Andreassen et al. 89] aiming at diagnosing the complete range of neuromuscular disorders. Satisfaction in system testing with constructed cases have been reported, whileChapter 1. INTRODUCTION 3oniy one of them (ELECTRODIAGNOSTIC ASSISTANT) reported clinical evaluationusing 15 cases of 78% agreement rate with electromyographers. Most of the systems arerule-based. The limitation of rule-based systems is reviewed in Section 1.2. MUNINuses Bayesian networks for representation of uncertain knowledge; and is still underdevelopment. Just as the expert system techniques are evolving, their application toneuromuscular diagnosis is still one for ongoing research.1.2 Representation Formalisms Other Than Bayesian NetworksConstruction of an expert system faces an immediate question: What is a proper knowledge representation and inference formalism? The answer depends largely on the characteristics of the task to be performed by the target system.Medical diagnosis is characterized with probable reasoning, or more formally, reasoning under uncertainty. The uncertainty comes from the incomplete knowledge aboutbiological process within the human body, from the incomplete observation and monitoring of patient conditions, and from dynamically changing relations between many factorsentering diagnostic process. A proper knowledge representation for reasoning under uncertainty is required which includes the representation of qualitative domain structureand an uncertainty calculus.Perhaps the most widely used structure in expert systems is a set of rule in rulebased systems. This section reviews rule-based systems, their advantages and limitations.There has been a set of uncertainty calculuses used in expert systems. This sectionalso reviews those calculuses notable in Al which associate single real numbers withuncertainty. An investigation of the feasibility of using a finite number of symbols todenote degrees of uncertainty will be presented in Chapter 2. The reason for reviewingthose calculuses together with rule-based systems is because some of them (i.e., MYCINChapter 1. INTRODUCTION 4certainty factor and odds likelihood ratios) can only be applied in conjunction with rule-based systems.Bayesian networks has been chosen as the knowledge representation in this thesisresearch. The networks combine graphic representation of domain dependence and probability theory. This combination overcomes many limitations of rule-based systems.Bayesian network techniques will be reviewed in section 1.3.There has been controversy in the artificial intelligence field about the adequacy ofusing probability for representation of beliefs [Cheeseman 88a, Cheeseman 88b]. I willnot attempt to enter into this controversy. Rather, my review discusses why Bayesiannetworks seem appropriate for building medical expert systems and thus are adopted inthis thesis research.1.2.1 Rule-Based SystemsAn expert system which deals with probable reasoning will have a knowledge base whichconsists of a qualitative structure of the domain model and a quantitative uncertaintycalculus. The same qualitative structure can usually accommodate different uncertaintycalculuses. Perhaps the most widely used structure in expert systems is a set of rulesin rule-based systems. All the methods for representing uncertainty reviewed in thissection can be incorporated with rule-based systems. Rules in such systems consist ofantecedent-conclusion pairs, “if X, then Y” with the reading “if antecedent X is true,then conclusion Y is true”. A rule encodes a piece of knowledge.An inference network is a directed acyclic graph (DAG) in which each node is labeledwith a proposition. The set of rules in a rule-based system, which does not contain cyclicrules, can be represented by an inference network in which each arc corresponds to arule with direction from antecedent to conclusion. The ‘roots’ of an inference networkare labeled with variables about which the user is expected to supply information. TheChapter 1. INTRODUCTION 5‘leaves’ are variables of interest. Figure 1.1 is a inference network representing 4 rules:“if B, then A”, “if C, then A”, “if D, then A” and “if E, then D”. B, C, E are rootsand A is a leaf. Below rule-based systems are considered in terms of inference networks.1.2.2 The Method of Odds Likelihood RatiosThis subsection reviews the representation of uncertainty in rule-based expert systemsby odds and likelihood ratios, and the propagation of evidence under this representation.Such an approach is used in the expert system PROSPECTOR [Duda et al. 76] whichhelps geologists evaluate the mineral potential of exploration sites. The formulation ofNeapolitan [1990] is followed.With the method of odds likelihood ratios, uncertainty to the rules are representedby conditional probabilities in the form p(XIY) where X is the antecedent and Y is theconclusion’. Suppose for each rule “if X, then Y”, the conditional probabilities p(XIY)and p(XY) are determined; and for each node Y in the corresponding inference network,except for the roots, the prior probability p(Y) is determined. Then the prior odds on1111 Bayesian networks the order of X and Y will be reversed.BCEDFigure 1.1: An inference networkAChapter 1. INTRODUCTION 6Y is defined as 0(Y) = p(Y)/p(Y). The posterior odds on Y upon learning that X istrue is defined as O(YIX) = p(YIX)/p(TIX). The likelihood ratio of X is defined asL(XIY) = p(XIY)/p(XIY). Notice that p(Y) = O(Y)/(1 + 0(Y)).With the above definition, the evidence propagation can be illustrated with the example in Figure 1.2. As in the figure, likelihood ratios are stored at each arc, andprior probabilities are stored at each node except for roots. If one has evidence thatE is true, then 0(DIE) L(EID)0(D). To propagate the evidence to A in a simple way, it is assumed that A and E are conditionally independent given D. Thenp(AIE) = p(AD)p(DE) + p(AD)p(DiE). Note the assumption is not valid if multiplepaths exist from E to A. If in another case, one has evidence that B and C are true,one would like to know 0(AjBC) where the concatenation implies ‘AND’. Assume thatB and C are conditionally independent given A, then 0(AIBC) = L(BIA)L(CIA)0(A).Again, the assumption is not valid’ in a multiply connected network.BL(EID)L(EjD)C DL(BJA)L(IA)EP(D)Figure 1.2: An inference net using the method of odds likelihood ratiosSeveral limitations of the method of odds likelihood ratios have been reviewed byNeapolitan [1990]:A P(A)1. The method is restricted to singly connected networks. Many application domainsChapter 1. INTRODUCTION 7can not be represented.2. An inference network with odds and likelihood ratios associated can only be usedfor reasoning in the designed direction but not in the opposite direction.3. For all the nodes except roots, prior probabilities are to be specified. Since priorsare population specific, they are difficult to ascertain. A medical expert systemconstructed using data from one clinic might not be deployed in another clinicbecause the population there was different.1.2.3 MYCIN certainty factorThe most widely used uncertainty calculus in rule-based systems is called certainty factorand it was originally used in MYCIN [Buchanan and Shortliffe 84]. General probabilisticreasoning based on probability theory required the specification of exponentially largenumber of data which lead to intractable computation associated with belief updating[Szolovits and Pauker 78, Buchanan and Shortliffe 84]. For example, a full joint probability distribution over a domain with a binary variables requires specification of 2 — 1parameters. The parameters had to be updated when the new evidence became available.The primary goal in creating the MYCIN certainty factor was to provide a method toavoid this difficulty.In a rule-based system using certainty factors for reasoning under uncertainty, a rule“if X, then Y” is associated with a certain factor CF(Y, X) [—1, 1] to represent thechange in belief about Y given the verity of X. CF(Y, X) > 0 corresponds to increase inbelief, CF(Y, X) < 0 corresponds to decrease in belief, and CF(Y, X) = 0 correspondsto no change in belief. Representing a rule-based system by an inference network, thecertainty factors are stored at arcs as in Figure 1.3.The MYCIN certainty factor model provides a set of rules for propagating uncertaintyChapter 1. INTRODUCTION 8BCF (A. B)CF(A,B)Figure 1.3: An inference net using the MYCIN certainty factorsthrough an inference network. Figure 1.3 illustrates sequential and parallel combinationin an inference network. If one has evidence that E is true, then the certainty factorsCF(D,E) and CF(A,D) can be combined to give the certainty factor CF(A,E) bysequential combinationI CF(D,E)CF(A,D) CF(D,E)>0.CF(A,E)=-I -CF(D,E)CF(A,D) CF(D,E)<0If instead one has evidence that B and C are true, then the certainty factor CF(A, BC)is given by parallel combinationCF(A, BC) = I CF(A, B) + CF(A, C)(1 - CF(A, B)) CF(A, B)CF(A, C) 0( CF(A,B)+CF(A,C) CF(A1—min(ICF(A,B)I,ICF(A,C)j) ‘ ‘ J’’ ‘The calculus of the MYCIN certainty factor has many desirable properties whichwould be expected from an uncertain reasoning technique. However, in the original workof the MYCIN certainty factor, there was no operational definition of a certainty factor [Heckerman 86, Neapolitan 90]. That is, the definition of a certainty factor doesnot prescribe a method for determining a certainty factor. Without an operational definition there is no way of knowing whether 2 experts mean different things when theyECF(t3,E)CF(JJ)CDA VChapter 1. INTRODUCTION 9assign different certainty factors to the same rule. Heckerman [1986] gives probabilisticinterpretations of the MYCIN certainty factor. He shows that these interpretations aremonotonic transformations of the likelihood ratio reviewed in section 1.2.2. Therefore theMYCIN certainty factor makes the same assumptions as the method of odds likelihoodratios and shares the same limitations [Neapolitan 90].1.2.4 Dempster-Shafer TheoryUnlike probability theory, which assigns probabilities to every member of a set ‘I’ ofmutually exclusive and exhaustive alternatives and requires that the probabilities sum tounity, the Dempster-Shafer (D-S) theory [Shafer 76] assigns basic probability assignments(bpa) to every subset of ‘P (member of 2’’) and requires that the bpas sum to unity. For aproposition C, the bpa assigned to {C} and the bpa assigned to {} do not have to sum tounity. Thus D-S theory allows some of the probability to be unassigned, that is, it acceptsan incomplete probabilistic model when some parameters (either prior probabilities orconditional probabilities) are missing [Pearl 88]. In this sense, D-S theory is an extensionof probability theory [Neapolitan 90]. Medical diagnosis involves repetition and it ispossible to obtain a complete probability model. The model may not be accurate andfurther refinement may be required (Chapter 6).Unlike probability theory, which calculates the conditional probability that Y is truegiven evidence X, D-S approach calculates the probability that the proposition Y isprovable given the evidence X and given that X is consistent. Pearl [1988] argues thatD-S theory offers a better representation when the task is one of synthesis (e.g., the classscheduling problem) where external constraints are imposed and the concern centers onissues of possibility and necessity; he also argues that probability theory is more suitablefor diagnosis.A pragmatic advantage of probability theory over D-S theory is that it is well foundedChapter 1. INTRODUCTION 10and well known. An expert system based on a well known theory will certainly benefitin the knowledge acquisition and in communicating with users.1.2.5 Fuzzy SetsFuzzy set theory [Zadeh 65] deals with propositions which have vague meaning suchas “the symptom is severe” or “the median nerve conduction velocity is normal”. Itassociates a real number from [0, 1] with the membership of a particular element in a set.For example, if a patient has a median nerve conduction velocity of 58 m/sec, the resulthas its membership 1 in the set of ‘normal’ results. If the velocity is 44 m/sec, the resulthas membership 0 in the ‘normal’ set. When the velocity value is 50 m/sec, the resulthas a partial membership, say, 0.7 in the ‘normal’ set.Some researchers [Pearl 88, Neapolitan 90] view fuzzy set theory as addressing a fundamentally different class of problems than those addressed by probability theory, certainty factor and the Dempster-Shafer theory. All but fuzzy set theory deal with welldefined propositions which are definitely either true or false. One is simply uncertainas to the outcome. Cheeseman [1986,1988a,1988b] makes a strong claim that fuzzy setsare unnecessary (for representing and reasoning about uncertainty, including vagueness),probability theory is all that is required. He shows how probability theory can solve theproblems that the fuzzy approaches claim probability cannot solve.1.2.6 Limitations of Rule-Based SystemsSection 1.2.2 and 1.2.3 reviewed 2 methods for reasoning under uncertainty in rulebased systems. This subsection concentrates on the questions: Are rule-based systemssuitable for probable reasoning? In which situations are they appropriate knowledgerepresentation?The basic principle of rule-based systems is modularity. A rule “if X, then Y” has theChapter 1. INTRODUCTION 11procedural interpretation: “If X is in the knowledge base, then regardless of what otherthings the knowledge base contains and regardless of how X was derived, add Y to theknowledge base” [Pearl 88]. The attractiveness of rule-based systems is high efficiency ininference. This stems from modularity. However, the principle is valid only if the domainis certain. That is, rule-based systems are appropriate for applications involving onlycategorical (as defined by Szolovits and Pauker [1978]) reasoning [Neapolitan 90].When reasoning under uncerta1ty, the rule “if X, then Y with uncertainty w” readsas: “If the certainty of X undergoes a change 5x, then regardless of what other thingsthe knowledge base contains and regardless of how 5x was triggered, modify the currentcertainty of Y by some amount 6y, which may depend on w, on x, and on the current certainty of Y” [Pearl 88]. Pearl discusses three major problems resulted from theprinciple of modularity in probable reasoning.• Improper handling of bidirectional inferences. This is a restatement of the thirdlimitation in section 1.2.2. If X implies Y, then verification of Y makes X morecredible. Thus the rule “X implies Y” should be used in two different directions:deductive - from antecedent to conclusion, and abductive - from conclusion to antecedent. However, rule-based systems require that the abductive inference to bestated explicitly by another rule and, even worse, that the original rule be removed.Otherwise, a cycle would be created.• Difficulties in retracting conclusions. Two rules “If the ground is wet then it rained”and “If the sprinkler was on then the ground is wet” may be contained in a system.Suppose “the ground is wet” is found. The system will conclude “it rained” by firstrule. But if later “the sprinkler was on” is found, the certainty “it rained” shouldbe decreased significantly. However, this can not be implemented naturally in arule-based system.Chapter 1. INTRODUCTION 12• Improper treatment of correlated sources of evidence. Recall that both the oddslikelihood ratio method and the certainty factor method assume singly connectedinference networks. Rule-based systems respond only to the degrees of certaintyto antecedents and not to the origins of these degrees. As a result the same conclusions will be made whether the degree of certainty originates from identical orindependent sources of information.Heckerman and Horvitz [19871 analyze the inexpressiveness of rule-based systems.• Only binary variables (proposition variables) allowed. When multi-valued randomvariables are needed, they have to be broken down into several binary ones and thenaturalness of representation is lost.• Multiple causes. When multiple causes exist, they are represented by an exponential number of binary variables, and the representation can not accommodatedifferent evidence patterns.With these problems in mind, one must conclude that generally rule-based systemsare not appropriate for applications that require probable reasoning.1.3 Representing Probable Reasoning In Bayesian NetworksThis section briefly introduces Bayesian networks as a natural, concise knowledge representation method and a consistent inference formalism for building expert systems. Thesubstantial advances of Bayesian network technologies in recent years are reviewed.1.3.1 Basic Concepts of Probability TheoryTwo major interpretations of probability exists: objective or frequentist interpretation,and subjective or Bayesian interpretation [Savage 61, Shafer 90, Neapolitan 90]. TheChapter 1. INTRODUCTION 13former defines probability of an event E as the limiting frequency of E in repeatedexperiments. The latter interprets probability as degree of belief held by a person. Inbuilding a medical expert system based on probability theory, one usually has to takea medical expert as the major source of the probabilistic information. Thus Bayesianinterpretation is naturally adopted.Although philosophically the 2 interpretations represent 2 different camps, practically, they are not significantly different as far as physicians’ belief in uncertain relationsin medicine is concerned. The medical literature substantiates the fact that many physicians believe that probabilities, according to the frequentist’s definition, exist, and thatthe likelihoods which they assign, are estimates of these probabilities based on theirexperiences with frequencies [Neapolitan 90j. My personal experience with neurologistsin building PAINULIM expert system is also in agreement with this viewpoint. TheBayesian formulation of probability will be adopted in this thesis. In Chapter 6, an interaction between the 2 interpretations is utilized to improve the accuracy of knowledgerepresentation.The framework of Bayesian probability theory consists of a set of axioms whichdescribes constraints among a collection of probabilities provided by a given person[Pearl 88, Heckerman 90b}.o p(AC) 1p(CjC) = 1If A and B are mutually exclusive,then p(A or BIG) = p(AfC) + p(BIC) (sum rule)p(ABIC) = p(AJBC)p(BIC) (product rule)where A, B are arbitrary events and C represents the background knowledge of theperson who provides the probability. The probability p(AIC) represents a person’s beliefin A given his background knowledge C. The following rules, to be used in the thesis,Chapter 1. INTRODUCTION 14can be proved from the above axioms.p(IC) = 1— p(AIC) (negation rule)If B1,.. . , B are mutually exclusive and exhaustive thenp(AB1IC) + ... +p(ABjC) = p(AC) (marginalization)p(AIBC) = p(BjAC)p(AIC)/p(BC) (Bayes theorem)1.3.2 Inference Patterns Embedded In Probability TheoryAs a well founded theory, probability theory embeds many intuitive inference patterns,which renders it a suitable uncertainty calculus for probable reasoning. The followingbriefly discusses some patterns used in medical diagnosis.Bidirectional reasoning Probability allows both predictive and abductive (diagnostic)reasoning. Carpal tunnel syndrome (cts) is a common cause of pain in forearm. If onehas a probability p(painful forearmicts) = 0.75 and also knows John has cts, then onewould predict with high confidence John would suffer from pain in forearm. On the otherhand, if one also has p(cts), p(painful forearm), and knows Mary does suffer from pain inforearm, then one can compute p(ctspainful forearm) by applying Bayes theorem. Thediagnosis will be based on p(ctspainfu1 forearm).Context sensitivity Both cts and thoracic outlet syndrome (tos) cause pain in forearm. The pain in the forearm is equally likely from either cause. Cts has much higherincidence than tos, say p(cts) = 0.25 and p(tos) = 0.02 in a clinic. A patient with painfulforearm is 12 times more likely to have cts than tos. But if later through other meansthe patient is found to have tos, then the painful forearm can be explained by tos. Ctsbecomes much less likely. The probability for this case will have p(ctslpainful forearm)= HIGH, and p(ctspainfu1 forearm & tos) = LOW. That is, the original belief in cts isChapter 1. INTRODUCTION 15retracted in the new context.Explaining away In the above example, the confirmation of tos makes the alternativeexplanation tos for painful forearm less credible. That is, tos explains painful forearmaway from cts.Dynamic dependence Cts and tos are independent diseases, i.e., knowing oniy apatient having or not having cts tells one nothing about whether he has tos or not.However, knowing a patient has a painful forearm will render the 2 diseases relatedin the diagnostic process. Further evidence supporting one of them will decrease thelikelihood of another.Readers are referred to Pearl [1988] for an elaborate discussion of the relationshipbetween probability theory and probable reasoning.1.3.3 Bayesian NetworksBayesian networks are known in the literature as Bayesian belief networks, belief networks, causal networks, influence diagrams, etc. They have a history in decision analysis[Miller et al. 76, Howard and Matheson 84]. They have been actively studied for probable reasoning in Al for about a decade. Formally a Bayesian network [Pearl 88] is atriplet (N,E,P).• The domain N is a set of nodes each of which is labeled with a random variablecharacterized by a set of mutually exclusive and exhaustive outcomes. ‘Node’ and‘variable’ are used interchangeably in the context of Bayesian nets.• E is a set of arcs such that (N, E) is a DAG. The arcs signify the existence of directcausal influences between the linked variables. The basic dependence assumptionChapter 1. INTRODUCTION 16embedded in Bayesian nets is that a variable is independent of its non-descendantsgiven its parents.Uppercase letters (possibly subscripted) in the beginning of the alphabet are usedto denote variables, corresponding script letters are used to denote their samplespaces, and corresponding lowercase letters with subscripts are used to denote theiroutcomes. For example, in binary case, a variable A has its sample space A ={a1,a2}, and II has its sample space ‘H = {h1,h22}. Uppercase letters towardsthe end of the alphabet are used to denote a set of variables. If X N is a set ofvariables, the space ‘I’(X) of X is the cross product of sample spaces of the variablesXAEXA. 7r is used to denote the set of parent variables of A, e N.P is a joint probability distribution quantifying the strengths of the causal influences signified by the arcs. P is specified by, for each A2 E N, the distributionof the random variable labeled at A conditioned by the values of At’s parents irin the form of a conditional probability table p(AIir). p(Air) is a normalizedfunction mapping ‘({A} U r) to [0, 1]. The joint probability distribution P isP = p(A1 . .. A) = flp(AI)For medical application, nodes in a Bayesian net represent disease hypotheses, symptoms, laboratory results, etc. Arcs signify the causal relations between them. The probability distribution quantifies the strengths of these relations. For example, “cts (C) oftencauses pain in forearm (F)” can be represented by an arc from C to F and a probabilityp(fiIci) = 0.75.The term ‘causal’ above is to be interpreted in a broad sense. Correspondingly, arcs ina Bayesian net could go either direction. But in medical applications, it is usual to consider diseases as causes and symptoms as effects. Shachter and Heckerman [1987j showChapter 1. INTRODUCTION 17that if arcs in Bayesian nets are directed from disease to symptoms, the DAG construction is usually easier and the resultant net topologies are usually simpler. Furthermore,conditional probabilities of symptoms given diseases are related to the symptom causingmechanisms of diseases, and are usually irrelevant to the patient population. Thus onlypriors of diseases need to be changed when an expert system is built at one clinic andused at another location with a different patient population. Whereas both the priors forsymptoms and the probabilities oJiseases given symptoms are patient population dependent. Directing arcs from symptoms to diseases will require total revision of probabilityvalues in a Bayesian net when the system is to be used with a different population.As stated above, Bayesian networks combine probability theory with graphic representation of domain models. The necessity of this combination is two-edged. For one thing,encoding a domain model with a DAG conveys directly the dependence and independenceassumptions made of the domain. The DAG facilitates knowledge acquisition and makesthe representation transparent. For another, graphical models allow quick identificationof dependencies by examining DAGs locally. Therefore efficient belief propagations arepossible [Pearl 88] and the difficulty associated with general probabilistic reasoning (section 1.2.3) can be avoided. The study of inference in Bayesian networks heavily dependson the study of the DAG topologies. In the thesis, when the topology of a Bayesiannetwork is mentioned, it always means the topology of the corresponding DAG.Probabilities associated with Bayesian networks have different meanings dependingon their location of storage in the networks and the stage of inference. Some conventionis appropriate to avoid confusion. Before any inference takes place, the probabilitiesassociated with root nodes (with zero in-degree) are called prior probabilities or priors.The probabilities associated with non-root nodes are called conditional probabilities.After the evidence is available, the updated probabilities conditioned on evidence arecalled posterior probabilities. When it is clear from the context, they are just calledChapter 1. INTRODUCTION 18probabilities.1.3.4 Propagating Belief in Bayesian NetworksInference in a Bayesian network is essentially a problem of propagating changes in beliefand computing posterior probabilities as new evidence is obtained. Since the appealof Bayesian networks for representing uncertain knowledge has become increasingly apparent over the last few years [Howard and Matheson 84, Pearl 861, there has been aproliferation of research seeking to develop new and more efficient inference algorithms.Two classes of approaches can be identified. One class of approaches explores approximation using stochastic simulation and Monte Carlo schemes. A good review on thisclass is given by Henrion [1990]. Another class of approaches explores specificity in computing exact probabilities. Since this thesis research adopts the exact method, the majoradvances in this class are reviewed below.The first breakthrough in efficient probabilistic reasoning in Bayesian networks ismade by Kim and Pearl [1983]. They develop an algorithm, applicable to singly connectedBayesian networks (Figure 1.4), for propagating the effect of new observations. Eachnode in the network obtains messages from its parents (ir messages) and its children (Amessages). These messages represent all the evidence from the portion of the networklying beyond these parents and children. The single-connectedness guarantees that theinformation in each message to a node is independent and so local updating can beemployed. The algorithm’s complexity is linear in the number of variables.Heckerman [1990] provides QUICKSCORE algorithm which addresses networks ofdiameter 1 as the one in Figure 1.5. The upper level consists of disease variables andthe lower level consists of ‘finding’ variables. Note the net is multiply connected (Appendix A). The algorithm makes three assumptions. All variables are binary. DiseasesChapter 1. INTRODUCTIONFigure 1.4: The DAG of a singly connected Bayesian network19B1DISEASES. . .FINDINGSFigure 1.5: The DAG of a Bayesian network with diameter 1Chapter 1. INTRODUCTION 20are marginally independent and findings are conditionally independent given diseases2.Diseases interact to produce findings via a noisy-OR-gate. The time complexity ofQ UICKSCORE is (D(nmi2m2) where n is the number of diseases, m1 is the numberof negative findings, and m2 is the number of positive findings.Unfortunately, many applications can not be represented properly by a singly-connectednet of diameter 1. An example of a general multiply connected Bayesian network is givenin Figure 1.6. Pearl [1986] presents ioop cutset conditioning as an indirect method forinference in multiply connected networks. Selected variables (loop cutset) are instantiated to cut open all ioops such that resultant singly. connected networks can be solvedby A — ir message passing, and the results from the instantiations are combined.Shachter [1986,1988a] applies a sequence of operations called arc-reversal and barrennode reduction to an arbitrary multiply connected Bayesian net. The process continuesuntil the network contains only those nodes whose posterior distributions are desiredwith the evidence nodes as immediate predecessors.Baker and Boult [1990] extend the barren node reduction method to prune a Bayesiannetwork relative to each query instance such that saving in computational complexity is2This implies that the net topology is of diameter 1.B1Figure 1.6: The DAG of a general multiply connected Bayesian networkChapter 1. INTRODUCTION 21obtained when evidence comes in a batch.Lauritzen and Spiegeihalter [1988] describe an algorithm based on a reformulation ofa multiply connected network. The DAG is moralized and triangulated. Then cliques areidentified to form a clique hypergraph of the DAG. The cliques are finally organized intoa directed tree of cliques3 which satisfies running intersection property. They providean algorithm for the propagation of evidence within this secondary representation. Thecomplexity of the algorithm is (D(prm) where p is the number of clique in the cliquelist, r is the maximum number of alternatives for a variable in the network, and m is themaximum number of variables in a clique. Pearl [1988] proposes a clustering method withmany similarities but propagates evidence in a secondary directed tree by \ — r messagepassing. The directed tree approach and the following junction tree approach all have theadvantage of trading compile time with run time for ‘reusable systems’. Here reusablesystems mean those systems where the domain knowledge is captured once and is used formultiple cases, as opposed to the decision systems which are created for decision makingin a non-repeatable situation.Jensen, Lauritzen, and Olesen [1990] further improve the directed clique tree approachand they organize clique hypergraph into an (undirected) junction tree to allow moreflexible computation. Similar work was done by Shafer and Shenoy [1988]. A close lookat the junction tree approach is included in Chapter 4.3The ‘directed’ property is indicated by Henrion [1990], Neapolitan [1990], and Shachter [1988b] sincethe cliques are ordered and this order is to be followed in belief propagation.Chapter 2THE FEASIBILITY OF FINITE TOTALLY ORDERED PROBABILITYALGEBRA FOR PROBABLE REASONINGMedical diagnosis is featured by probable reasoning and a proper uncertainty calculus isthus crucial in knowledge representation for medical expert systems. A finite calculus isplausible since it is close to human language in communicating uncertainty. An investigation into the feasibility of using finite totally ordered probability models for probablereasoning was conducted under Aleliunas’s Theory of Probabilistic Logic [Aleliunas 88].In this investigation, the general form of the probability algebra of these models and thenumber of possible algebras given the size of probability set are derived. Based on thisanalysis, the problems of denominator-indifference and ambiguity-generation that arise inreasoning by cases and abductive reasoning are identified. The investigation shows thata finite probability model will be of very limited usage. This highlights infinite totallyordered probability algebras including probability theory as uncertainty managementformalism for probable reasoning.This chapter presents the major results of the investigation. The results are mainlytaken from Xiang et al. [1991a]. Section 2.1 discusses the motivation and criteria of theinvestigation. Section 2.2 presents the mathematical structure of finite totally orderedprobability models. Section 2.3 derives the inference rules under these models. Section 2.4identifies the problems of these models both intuitively and quantitatively. Section 2.5presents an experiment which exhaustively investigates models of a given size with anexample.22Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 232.1 MotivationThe investigation started in the early stage of this thesis research when two engineeringapplications of AT in neurology, EEG analysis and neuromuscular diagnosis, were under consideration. Probable reasoning is the common feature of both domains. Whenconsulted about the formalism of representing uncertainty in EEC analysis, the expertsclaimed that they did not use nurers, but rather used a small number of terms to describe uncertainty. This motivated a desire for a formal finite non-numerical uncertaintycalculus. In such a calculus, the domain expert’s vocabulary about uncertainty could beused directly in encoding knowledge and in reasoning about uncertain information. Thiswould facilitate knowledge acquisition and make the system’s diagnostic suggestion andexplanation more understandable.There were few known finite calculus for general uncertainty management [Pearl 89,Halpern and Rabin 87], but Aleliunas’ probabilistic logic [Aleliunas 88] was explored, because it seemed to be based on clear intuitions, and to allow measures of belief (probabilityvalues) to be summarized by values other than just real numbers.Aleliunas [1988] presents an axiomatization for a theory of rational belief, the Theoryof Probabilistic Logic (TPL). It generalizes classical probability theory to accommodate avariety of probability values rather than just [0, 1]. According to the theory, probabilisticlogic is a scheme for relating a body of evidence to a potential conclusion (a hypothesis) ina rational way, using probabilities as degrees of belief. ‘p(PQ)’ stands for the conditionalprobability of proposition P given the evidence Q, where P and Q are sentences of someformal language L consisting of boolean combinations of propositions. TPL is chieflyconcerned with identifying the characteristics of a family F of functions from L x L to theset of probabilities P. The probability values P are not constrained to be just [0, 1], butcan be any values that conform to a set of reasonably intuitive axioms (Appendix B.1).Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 24The semantics of TPL is given by ‘possible worlds’. Each proposition P is associatedwith a set of situations or possible worlds S(P) in which P holds. Given Q as evidence,the conditional probability p(PQ), whose value ranges over the set P, is some measureof the fraction of the set S(Q) that is occupied by the subset S(P&Q).TPL provides minimum constraints for a rational belief model. For the applicationdomain in question the following criteria are thought desirable:Ri The domain experts do not express and communicate uncertainty using numericalvalues in their practice. Their language consists of a small set of terms ‘likely’,‘possibly’, etc., used to describe the uncertainty in their domain. Thus a finite setof probability values is required.R2 Any two probability values in a chosen model should be comparable. An essentialtask of a medical diagnostic ystem is to differentiate between a set of competingdiagnoses given a patient’s symptoms and history. It is felt as though totally orderedprobabilities are needed in order to allow for totally ordered decisions when one hasto act on the results of the diagnoses.R3 Inference based on a TPL model should generate empirically intuitive results. Thatis, the inference outcomes generated with such a model should reflect, as far aspossible, the reasonable outcomes reached by a human expert.Although these criteria are formed from the point of application in question, it isbelieved that they are shared by many automated reasoning systems making decisionsunder uncertainty. Based on the first 2 criteria, the focus is placed on finite totallyordered probability models.Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 252.2 Finite Totally Ordered Probability Algebras2.2.1 CharacterizationTo investigate the mathematical structure (probability algebra) of the probability space,the characterization of any finite totally ordered probability algebra under TPL axioms[Aleliunas 88] is given in the proposition below. For more about universal algebra, seeBurns and Sankappannvar [1981] and Kuczkowski and Gersting [1977]. This propositionis a restriction of the Probability Algebra Theorem [Aleliunas 86] (Appendix B.2) tofinite totally ordered sets.The smallest element of P is denoted as 0, and the largest element of P as 1. Thereare a finite number of other values between 0 and 1.Proposition 1 A probability algebra defined on a totally ordered finite set P with ordering relation ‘<‘satisfies TPL axioms ifCondi An order preserving binary operation ‘*‘ (product) is well defined and closed onP.Cond2 ‘*‘ is commutative, i.e., (Vp, q e P) p * q = q * p.Cond3 ‘*‘ is associative, i.e., (Vp, q, r E P) p * (q * r)=(p * q) * r.Cond4(Vp,q,reP)(p*q=r)=(rmin(p,q)).Cond5 No non-trivial zero, i.e., (Vp, q E P) p * q = 0=(p = 0 V q = 0).Cond8 (Vp, q E P) p < q =‘ (r E F) p = r * q. The solution will be denoted asr= p/q.CondT (VpP)0p1.Cond8 (Vp P) p * 1=p.Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 26Cond9 A monotone decreasing inverse function i[.] is well defined and closed on P, i.e.,(Vp<qEP)i[pj>i[q].CondlO (Vp E P) i[i[p]]=p.Proof:One needs to show the equivalence of Condi,. . . ,CondiO, to Ti,. .. ,T7 of ProbabilityAlgebra Theorem (Appendix B.2). Cond4 and CondiO are not obvious in the theoremand they are included in proposition i for the convenience of later use. They are provedhere from TPL axioms AX1,. .. ,AX12 (Appendix B.i) directly.Condi and Cond3 are equivalent to Ti. Cond2 is equivalent to T5.Proof of Cond4. With Cond2, it suffices to prove r q. By AX1O, let p = f(BjA)and q = f(AIi). Thenp*q = f(BA)*f(Ai)< f(BA) * f(AA) (f(Afi) <f(AA))= f(B&AA) (AX8)= f(B!A) (AX5)Cond5 is equivalent to T6. Cond6 is equivalent to T7. Cond7 is equivalent to T4.Cond8 and Cond4 are equivalent to T2. Cond9 is equivalent to T3. CondiO is impliedby AX3.DFrom now on any Finite Totally Ordered Probability Algebra satisfying proposition 1 isreferred as legal FTOPA. The general form of all legal FTOPA is derived in section 2.2.2.Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 272.2.2 Mathematical StructureHere only those probability algebras with at least 3 elements are considered’. A finitetotally ordered probability set with size n is denoted as P = {e,,e2,.. . , e,_1 e}, where1 = e1 > e > ... > e,_1 > e,, 0. For example, P = {e,, e2,e3,e4} could stand for{certain, likely, unlikely, impossible}. This linguistic interpretation is left open.The uniqueness of the inverse function i[.] of any legal FTOPA is given by the followinglemma.Lemma 1 For a legal FTOPA with size n, the inverse is uniquely defined asi[ek] = e.f_k (1 k n).Proof:By CondlO, the inverse is symmetric and it suffices to prove for k n/2.For k = 1, i[ei] e, by Cond7. Assume i{eiJ > e,. Then i[e] > i[i{ei]] = e1 whichis contradictory to CondT. Hence i[e,] = e.Suppose i[e} for j n/2. For k= j + 1 < n/2 + 1, assume i[ej+i] <which implies i[e+i] By Cond9, this further implies e31 i[e_÷i] = e3which is contradictory to ordering < e. Thus, i[e+iJAssume i[e1] > e_3. Then e1 < i[e_], i.e. i[e_] e. By Cond9, e_3i[e] which is contradictory to e_, > e_+i.DThus given the size of a legal FTOPA, only the choice of the product function is left.A probability p e F is idempotent if p * p=p. Idempotent elements play importantroles in defining probability algebras as will be shown in a moment. The following2 lemmas describe properties of idempotent elements. Aleliunas [1986] gives similarstatement to that in lemma 3.‘Probability algebra with 2 elements is equivalent to propositional logic [Aleliunas 87].Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 28Lemma 2 Any legal FTOPA has at least 3 idempotent elements, namely e1, e_1 ande.Proof:The proof is trivial for e1 and e. e_ * e_1 by Cond4 and e_ * e_ > e7by Cond5. Therefore e_ * =Lemma 3 For any legal FTOPA, ifp E P is idempotent, then (Vq€F) p*q = min(p, q).Proof:Assume q > p. Then (r€P) r * q= p. Therefore p * p = r * (q * p) = p impliesq*p p. But by Cond4, q *p <p which proves q*p =p.Ifq <p, then (reP)r*p=q. Then q*p=r*(p*p)=r*p=q.DProposition 2 For a finite totally ordered set with size n 3, there exists only one legalFTOPA with 3 ide mpotent elements. The ‘*‘ operation on it is defined ase, ifiorj_—ne * e =( eflfl(1+3_1,fl_1) otherwise.Proof:Let denote a legal FTOPA with size n and k idempotent elements.2 Letdenote e, * e2. The following proves the proposition constructively.(1) In case of i or j = n, the proposition holds due to lemma 3. By non-trivial zero,zero part of the product table is entirely covered within this case.21n general, for a pair of n and k, there may be more than one legal FTOPA. Thus M,k does nonecessarily stand for a unique model characterized by n and k.Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 29e e2 63 64e1 e2 e3 e1 e e2 e3 64e e e2 e3 e2 62 63 63 6462 e2 62 63 e3 63 63 e3 64e3 63 e3 e3 e4 64 e4 e4 64M4,3(2) What is left is to prove tb non-zero part of the product table (the second halfof the product formula) which is bounded by two idempotent elements e1 and e_1. Forthe completeness of the product table, the zero parts are still included in the followingtables although they are not relevant to the remaining proof.For M3, and M4,3 the proposition holds (see the product tables). It is not difficult tocheck that they satisfy proposition 1 and any change to these product tables will violateproposition 1 in one way or another.Suppose a unique legal FTOPA Mm,3 exists with product defined as in the proposition.As for Mm+i,3 (table below), the product a1 (i + j m) should be constructed in thesame way as in Mm,3, i.e., the second half of product formula= en(i+j_1,m) = ei+i_1applies within this portion as does in Mm,3. If this portion could be changed withoutviolating proposition 1, the corresponding portion in Mm,3 could also be changed whichis contradictory to the uniqueness assumption for Mm,3.Further one can show the uniqueness of for all (i, j <m <i + i)Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 30e1 e2 e3 em_i emem+iei e1 e2 e3 em_i emem+e2 e2 e3... em_i a2,m_i emem+e3 e3... em_i a3,m_2a3,m_iemem÷aj,m_j+iaj+1,m_jem_i em_i ... ?em em . emem+i em+i ...Note: ‘?‘ stands for product items to be c osen.Mm+i,3By associativity, one has(e * e2) * em_i e+i * em_i= aj+i,m_je * (e2 * em_i) e * em_i+1= ai,m_j+i (2 j m — 2) (a)Also one hase2 * (ei * em_i) = e2 * ai,m_i= (e2 * ei) * em_i e1 * em_i= aj+1,m_i (2 i m —2) (b)From order preserving property of ‘*‘, one knows= em_i V = em (i,j <m < i +j).Suppose a2,m_i = em_i. Then from (b),e2 * a2,m_i = e2 * em_i = a2,m_i = em_i = a3,m_i.Similarly, and from commutativity and order preserving, one hasaj,j=em_l (i,j<m<i+j).Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 31This means that em_I is also an idempotent element which is contradictory to the 3idempotent elements assumption. Therefore, a2,m_I = em. Then from (a) and orderpreserving, one ends up with = €m (i,j <m <i + i)•DThe second part of the above proof for product bounded by e1 and e_1 does notinvolve the 0 element at all as already stated. Thus for any legal FTOPA with more than3 idempotent elements, the proposition holds for each diagonal block of its product tablebounded by two adjacent idempotent elements. The non-diagonal part of the producttable is totally determined by lemma 3, the order preserving and solution existing property. Thus one has the following theorem 1. Given proposition 2 and above description,the proof is trivial.Theorem 1 Given a finite totally ordered set P = {ei,e2,.. . , e,} with ordering relatione1 > e2 > ... > e, and a set I of indexes of all the idempotent elements on P, I ={i1,j2...,jm} where i1 < j2 < < m there exists a unique legal FTOPA whose productfunction is defined asmin(e,ek) if j = ii* ek = e11fl(3+k_,) if <j, Ic i’+’ek if ji1<k<ijand whose inverse function is defined asi[ek] = e+1_kTheorem 1 says that, given the set of idempotent elements, a legal FTOPA is totallydefined. From theorem 1 and lemma 2 one can easily derive the following corollary.Corollary 1 The number of all the possible legal FTOPA of size n 3 isChapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 32=where C is the number of combinations taking i elements out of m.Theorem 1 and corollary 1 provide the possibility of exhaustive investigation for anylegal FTOPA of a given size.2.2.3 Solution and RangeOnce a legal FTOPA is defined, its solution table is forced. Inverses to the operation* : P x P —* P will not be unique. For this reason, it is necessary to introduce aprobability range denoted by [1, u] representing all the probability values between lowerbound 1 and upper bound u.{l,uJ = {v Pjl v u}[v, v] is written as just v. One has the following corollary on single value probabilitysolution. Its proof can be found in Appendix D.Corollary 2 Given a finite totally ordered set P = {e1,e2,. . . , e} with ordering relatione1 > e2 > ... > e and a set I of indexes of all the idempotent elements on P, I ={i1, 2, .. . , im} where i1 < j2 < •.. < m the solution function (multiple value) of a legalFTOPA is forced to be:Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 33eke.jififif[ek ifproduct and solutionj=1k = n,j nk > j,ii + 1 <k i1+ — 1,ii + 1 <ii+1 — 1k= j, i1 < k < i1÷’k= ii+1, i1 <j <l+lIc =j = i1,k 1,nj i1 < Ic < i1+tables for three legal FTOPAs of size 8Definition 1 For any legal FTOPA, the product of two ranges [a, b] (a b) and [c, ci](c < d) is defined as[a,bJ * [c,d] = {zIx E [a,b]&y [c,d]&z = x * y}.And the solution of above two ranges with additional constraint a d is defined as[a, b]/[c, d} = {zjdx [a, b]3y E [c, d]x= y * z}.One can prove the following proposition.Proposition 3 For any legal FTOPA, the product of two ranges [a, b} (a b) and [c, ci](c < d) isifififek/e =[e1, ei][ej1+,,e1+j_j][ek, ei]In Appendix C.1, theare presented.The solution of two single valued probabilities may become a range which will participate in further manipulation. Thus the product and solution of ranges should beconsidered before we can manipulate uncertainty in an inference chain.[a,b]*[c,d] [a*c,b*dj.Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 34And the solution of above two ranges with additional constraint a d isI [LB(a/d), UB(b/c)j if b[a,bj/[c,d][LB(a/d), ej] if b> cwhere LB and UB are lower and upper bounds of ranges.It should be noted that, in general, product and solution of legal FTOPAs do notfollow commutativity. For example, in model M8,(e2 *e5)/e = [e5, ei] e2 * (es/es) = [e5,e2j.Thus the order of product and solution in evaluation of conditional probabilityp(AIB&C) = p(A&BC)/p(BC)= (p(BAC) * p(AjC))/p(BC)can not be changed arbitrarily.2.3 Bayes Theorem and Reasoning by CaseHaving derived the mathematical structure of legal finite totally ordered probability models, one needs deductive rules. In this investigation, DAG and conditional independenceassumption made in Bayesian nets are adopted as the qualitative part of the knowledgerepresentation. Instead of using probability theory to represent uncertainty as Bayesiannets, the legal FTOPAs are used. Call the resulting overall representation a quasiBayesian net. The quasi-Bayesian nets are implemented by PROLOG programs withthe approach described by Poole and Neufeld [1988]. The inference rules thus requiredare Bayes theorem and reasoning by cases.Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 35Bayes theorem provides a way of determining the possibility of certain causes fromthe observation of effects.3 It takes the form:p(PIQ&C) = p(QIF&C) * p(PIC)/p(QIC)which is the same with the one introduced in section 1.3.1 for probability theory. Buthere the order of calculation is to be followed as it appears as discussed in section 2.2.3.Reasoning by cases is an inference rule to compute a Londitional probability by partitioning the condition into several mutually exclusive situations such that the estimationunder each of them is more manageable. The simplest form considers the cases where Bis true and where B is false:p(A(C) = p((A&B) V (A&C)Under probability theory, it can be derived from marginalization rule in section 1.3.1:p(AIC) = p(AB&C) . p(BC) + p(AC) p(iC)Using TPL, the . becomes *, and one does not have the +. This can, however, besimulated using product and inverse. The corresponding formula under TPL is given bythe following propositions.Proposition 4 Let A, B, and C be three sentences. p(AC) can be computed using thefollowing:fi = i[p(AI&C) * i{p(BIC)]]f2 = i[p(AIB&G) * p(BIC)/fi]p(AC) = i[fi*f2J.3Canse and effed are used here in a broad sense.Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 36Using quasi-Bayesian nets, the probability of a hypothesis given some set of evidencecan be computed by applying the two inference rules, namely, Bayes theorem and reasoning by cases [Poole and Neufeld 88].2.4 Problems With Legal Finite Totally Ordered Probability Models2.4.1 Ambiguity-generation and Denominator-indifference‘vVith the mathematical structure of legal finite totally ordered probability models andthe form of relevant deductive rules derived, one can assess these probability models asto how well they fit in with the intuition.To begin with, examine the solution of legal FTOPA which has all its elementsidempotent. The solution takes the form of (compare to Appendix C.1)(ek if j<kek/e =j [ej,e1] if k=jNote that e3 does not have direct influence on the result of the first case of the solution.Name this phenomenon as denominator-indifference. Also, name the emergence of rangein the second case of the solution operation as ambiguity-generation.To analyze the effect of denominator-indifference and ambiguity-generation on application of Bayes theorem, apply Bayes theorem top(AjB&C)= p(A&BIC)/p(BC)= Jp(A&BIC) if p(ABIC)p(BIC)I [p(A&BIC), ej] if p(A&BC) = p(BjC)In the first case, the probability p(BC) does not affect the estimation of p(AIB&C)due to denominator-indifference. In the second case, ambiguity-generation produces aChapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 37disjunct of all the probabilities larger than p(BIC) which is a very rough estimation.Neither satisfies the requirement for empirically satisfactory probability estimates.To analyze the effect of denominator-indifference and ambiguity-generation on reasoning by cases, consider applying proposition 4 tomax(p(A&BC), p(ARiC))if p(A&BC) p(A&C)p(AC)=—[max(p(A&BJC), p(A&BIC)), ei]if p(A&BC) = p(A&HIC)Here again, in the first situation, denominator-indifference forces a choice of outcomefrom one case or another instead of giving some combination of the two outcomes. Onedoes not get an estimation larger than both which is contrary to the intuition. In thesecond situation, a very rough estimation appears because of ambiguity-generation. Notethat, when max(p(A&BIC),p(A&EC)) is small, p(AIC) can span almost the whole rangeof probability set P.The analysis here is in terms of a model that has all of its values idempotent. Theother case to consider is what happens at the values between the idempotent values.Consider M,3 which has minimal number of idempotent elements. By proposition 2,its product isI e(I+_l,_l) if i,je, * e3 =otherwise.Its solution simplifies to (compare to Appendix C.1)e if k=n>jek/e= if j k <n — 1[en_i,e7jJ if k = n— 1In this algebra, it is quite easy for a manipulation to reach the probability value e....1:Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 381. Whenever one of the factors of product is the product will be e1 unless theother factor is e.2. Whatever takes the value e2, its inverse will be e_1.3. Products of low or moderate probability tend to reach e_1 due to quick decreasingof product.4. e/e_ 2 for all 2 j <n 2.Once e_1 is reached, any solution will be ambiguous. This ambiguity will be propagated and amplified during further inference in Bayesian analysis or case analysis. Although e_ is a value one should try to avoid, there is no means to avoid it. Hereone sees an interesting trade off between the two problems. In M,3, the denominator-indifference disappears. But, since manipulations under this model move probabilityvalues quickly, they tend to produce e_ more frequently and thus one suffers more fromthe ambiguity-generation.As all finite totally ordered probability algebras can be seen as combinations ofthe above two cases, they must all suffer from the denominator-indifference and theambiguity-generation. The question is how serious the problems are in an arbitrarymodel. This is to be answered in the next section.2.4.2 Quantitative Analysis of the ProblemsGiven the constraint of legal FTOPA in choosing a probability model, one is free to selectthe model size n and to select among alternative legal FTOPAs once n is fixed. Afew straightforward measurements are introduced to quantify the degree of suffering ina randomly chosen model.Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 39The number of ranges in a model’s solution table and the number of elements covered by each range mirror the problem of ambiguity-generation of the model. Define ameasurement of the amount of ambiguity in a model as the number of elements coveredby ranges in its solution table minus the number of ranges.Definition 2 LetS = {ri,r2,. ..,r} be the set of ranges in the solution table of a legalFTOPA. Let w3 be the number of values covered by range r3. Let M be the number ofdifferent solution pairs in the solution table.The amount of ambiguity of the algebra is defined asA =—1.The relative ambiguity of the algebra is defined asR=A/M.Example 1 The three legal FTOPAs with size 8 in Appendix C.1 all have A 21 andR=O.6.One has the following proposition. The proof can be found in Appendix D.Proposition 5 The amount of ambiguity of any legal FTOPA with size n isA = (n — 1)(n — 2)/2.The relative ambiguity of the algebra isR= (n—2)/(n+2).The number of solution pairs satisfying e3/ek = e3 reflects the seriousness of denominator-indifference of the model. Define the order of denominator-indifference as thisnumber minus the number of such e3s.Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 40Definition 3 Let d, be the number of times e3/e,. = e3 for 1 k <j in a legal FTOPAof size n. The order of denominator-indifference of the algebra is defined asOd = >Zd —1.Example 2 The three legal FTOPAs M8,3 M8, and M8,4 in Appendix C.1 have Odvalues 0, 15 and 9 respectively.Define the order of mobility of a model to express the chance with which a product ora solution transfers an operand to a different value. The higher this order, the more likelyfor a manipulation to generate an idempotent element and produce ambiguity afterwards.Definition 4 The order of mobility °m of a legal FTOPA is defined as the numberof distinct product pairs a * b in its product table such that a * b < mm [a, bj.Example 3 The three legal FTOPAs M8,3 M8, and M8,4 in Appendix C.1 have Omvalues 15, 0 and 6 respectively.One has the following proposition. The proof can be found in Appendix D.Proposition 6 For any legal FTOPA with size n and a set I of indexes of all its idempotent elements I = {i1,2 . ..,i} where i1 < 2 < •.. < z its order of denominator-indifference isk— 2Od=1) (m+iits order of mobility isk—2 Im+1i 10m .7,m1 3=1and°d + Om = (n — 2)(n — 3)/2.Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 41Example 4 The three legal FTOPAs with size 8 in Appendix C.1 all have Od+Om 15.Proposition 5 tells us that all the legal FTOPAs of same size have same amount ofambiguity. Increasing size increases R which approaches 1 as n approaches infinity.Proposition 6 says that,1. among legal FTOPAs of same size n, the order of denominator-indifference Odchanges from lower bound 0 at M,3 to upper bouid (n — 2)(n— 3)/2 at2. the upper bound of Od as well as °m increases with model size n;3. given n, the sum Od + °m remains constant and thus if a model suffers less fromdenominator-indifference, it must suffer more frequently from ambiguity-generationdue to the increase in its mobility.2.4.3 Can the Changes in Probability Assignment Help?After explored model size and alternative models given size, the final freedom that remains is the assignment of probability values. From Corollary 2, it is apparent that, ingeneral, denominator-indifference and ambiguity-generation happen only in certain regions of the solution table. So, is it possible, by choosing certain set of probability valuesas prior knowledge, to avoid intermediate results falling onto those unfavorable regions?To help answer this question, a derivation of conditional probability p(firelsrnoke& alarm)4 for a smoke-alarm problem in Figure 2.7 is given in Appendix C.2. Thecalculation involves 2 applications of Bayes theorem, and 3 of reasoning by cases. Itrequires 19 products, 9 solutions, and 14 inverses.In general,4The four nodes involved form a minimum set which has alternative hypotheses (fire and tampering)and allows accumulation of evidences (smoke + alarm).Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 421. a product tends to decrease the probability value until an idempotent value isreached.2. a solution tends to increase the probability value or cause a large range to occur(especially for idempotent values).3. an inverse tends to transfer small value into big and vice versa.Since many operations are required even in a small problem and each operation tendsto move the intermediate value around the probability set, the compound effect of theoperations are not generally controllable.To summarize, in the context of legal FTOPA, there seems to be no way to getaway with the problem of denominator-indifference and ambiguity-generation by meansof clever assignment of probability values; increasing model size does no good in reducingthe difficulty; selecting among different models trades one trouble with another.In the next section, these problems are demonstrated by an experiment.Figure 2.7: Smoke-alarm exampleChapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 432.5 An ExperimentAll the 32 legal FTOPAs with size 8 were implemented in PROLOG and their performance were tested by the smoke-alarm example (Figure 2.7) from Poole and Neufeld[1988]. The PROLOG program has basically the same structure, but inverse, product,solution, as well as Bayes theorem and reasoning by cases are redefined.Table 2.1 lists the probabilities in the knowledge base together with numerical valuesused by Poole and Neufeld [1988] for comparison.p(fire) = e6 0.01 p(smokefire) = e2 0.9p(tamperirig) = e6 0.02 p(smokeI7E) = e6 0.01p(alarmfire&tarnpering) = e4 0.5 p(leavingjalarm) = e2 0.88p(alarrnfire&tampering) = e2 0.99 p(1eavinga1arm) = e7 0.001p(alarmflE&tampering) = e2 0.85 p(reportjleaving) = e3 0.75p(alarrnIflE&tarnpering) = e7 0.0001 p(reportIleaving) = e6 0.01Table 2.1: Probabilities for smoke-alarm exampleThe following posterior probabilities are calculated in all 32 possible legal FTOPAswith size 8 (using quasi-Bayesian nets) and in [0, 1] real number probability model (usingBayesian nets) as a comparison.p(slf), p(a f), p(st), p(alt), (fIs), p(fa), p(f Is&a), p(ts), p(tla), p(t Is&a)The first four probabilities are deductive which, given cause acting, estimate theprobabilities of effects appearing. The remaining six are abductive which, given effectsobserved, estimate the probability of each conceivable cause. The results are included inAppendix C.3.Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 44• Among the 32 legal FTOPAs, eight of them produced identical value for the abductive cases:p(fs) = p(fa) = p(fs&a) = e6,and 16 others produce the identical ranges for all the abductive cases about fire.According to Table 2.1, smoke does not necessarily relate to fire (p(s7) e6). Nordoes alarm (p(alf&t) = e2’) As a result, observing only one of smoke and alarm,one is not quite sure about iire. Intuitively, adding the positive evidence alarm tosmoke should increase one’s belief for fire. As well, adding to alarm the evidencesmoke which is independent of tampering indicates higher chance of fire causingalarm. Thus this intuitive inference arrives atp(fsa) > p(fls) & p(fjs&a) > p(f a)which the results obtained from the above mentioned 24 legal FTOPAs do not fitin with.To illustrate how this happens, evaluate p(fls&a) in model M8,4 with idempotentelements {e, e5, e7,e8}.p(fls&a) = p(slf&a) * p(fa)/p(sa) = e2 * e6/e4 = e6/e4= 66Pay attention to the solution in last step. The result is no larger than p(f Ia) = 66due to denominator-indifference. One does not get extra evidence accumulating.• One of the very useful results provided by [0, 1] numerical probability is that although p(fI) = 0.48 and p(fa) = 0.37 are moderate, when both smoke and alarmare observed p(fs&a) = 0.98 is quite high which is more intuitive than the caseabove. According to Table 2.1, fire is the only event which can cause both smokeChapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 45and alarm with high certainty (p(sf) = p(alf) = e2). Thus observing both simultaneously one would expect a higher probability. But the remaining 8 legal FTOPAsgive only ambiguous p(fls&a) spanning at least half of the total probability range.Consider the evaluation of p(fs&a) in model M8,4 with idempotent elements{ €, e, e7,e8}.p(fls&a) = p(sjf&a) * p(fla)/p(sja) = 62 * es/es es/es = [64, ei]Notice the solution in last step.• In the deductive case, the situation is slightly better. Some models achieve the sametendency as [0, 1] probability in deduction (e.g., p(st) <p(alt)). Some achieve thesame tendency with increased ambiguity. Others either produce identical ranges fordifferent probabilities or do not reflect the correct trend. The slight improvementattributes to less operations required in deduction (only reasoning by cases but notBayes theorem is involved). Since reasoning by cases needs the solution operation,it still creates denominator-indifference and generates ambiguity.Our experiment is systematic with respect to legal FTOPAs of a particular size 8.Although a set of arbitrarily chosen probability is used in this presentation, it has beentried to vary them in a non-systematic way, but the outcomes are basically the same.2.6 ConclusionThe motivation of this investigation is to find finite totally ordered probability modelsto automate reasoning under uncertainty and to facilitate knowledge acquisition andexplanation in expert systems.Under the theory of probabilistic logic [Aleliunas 88], the general form of finite totallyordered probability algebras is derived and the number of different models is deducedChapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 46such that all the possible models can be explored systematically.Two major problems of those models are analyzed: denominator-indifference, andambiguity-generation. They are manifested during the processes of applying Bayes theorem and reasoning by cases. Changes in size, model and assignment of priors do notseem to solve the problems.All the models with size 8 have been implemented in a PROLOG program and testedagainst a simple example. The results are consistent with the analysis.The investigation reveals that under the TPL axioms, finite probability models willhave limited usefulness. The premise of legal FTOPA is {TPL axioms, finite, totallyordered}. It is believed that TPL axioms represent the necessity of general inferenceunder uncertainty. ‘Totally ordered’ seems to be necessary for a probability model to beuseful. Thus it is conjectured that a useful uncertainty management mechanism can notbe realized in a finite setting.The result of the investigation does not leave the probability theory as the onlychoice for representation of probable reasoning. But it does highlight infinite totallyordered probability algebras including probability theory. Since the appeal of Bayesiannetworks for representing uncertain knowledge has become increasingly apparent; andsubstantial advances have been made in recent years on reasoning algorithms in Bayesiannets, Bayesian networks are adopted as the framework of this research.Chapter 3QUALICON: QUALITY CONTROL IN NERVE CONDUCTIONSTUDIES BY COUPLING BAYESIAN NETWORKS WITH SIGNALPROCESSING PROCEDURESBefore more complicated diagnostic problem was tackled (Chapter 5), a pilot study onBayesian networks was done with a problem of quality control in nerve conduction studies.The resultant system is QUALICON. The corresponding network has a small size (lessthan 20 variables and singly connected) and thus is a good testbed. While the solution tothe problem contributes to the information gathering stage of neuromuscular diagnosis.This chapter presents major issues in the implementation of QUALICON. The resultsare mainly taken from Xiang et al. [1992]. Section 3.1 introduces the QUALICON domain. Section 3.2 describes the overall structure of QUALICON. Section 3.3 discusses theknowledge representation of QUALICON. Section 3.4 presents a preliminary evaluationof QUALICON.3.1 The Problem: Quality Control in Nerve Conduction StudiesNerve conduction studies are now used routinely as part of the electrodiagnostic examination for neuromuscular diagnosis. Their clinical value is demonstrated in the examinationof diseases or injuries which might be difficult to diagnose with (needle) EMG alone. Theclinical procedures involve the placement of surface electrodes, delivering of stimulatingimpulse to nerve or muscle, and recording of nerve or muscle responses (action potentials).47Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 48The procedures are fairly simple to perform, are easily tolerated, and require little cooperation from the patient. Interpretation of the results, however, requires knowledge of therange of normal values, the morphology of normal potentials and, of equal importance,the sources of technical error which may affect the finding [Goodgold and Eberstein 83].Since abnormalities can be due to technical error or disease, identification of technicalerror is a major element of quality control in nerve conduction studies.Contemporary equipment used for nerve conduction studies is usually capable ofcomputerized measurement of latency, amplitude, duration and area of nerve and muscleaction potentials and resulting conduction velocities. However, computerized measurements assume that stimulating and recording characteristics and electrode placementsare correct; they do not take cognizance of technical acceptability. The development ofan appropriate technique for automating the quality control is addressed in this Chapter.3.2 QUALICON: a Coupled Expert System in Quality ControlSince identification of technical errors are based on observations of recorded nerve andmuscle responses, the problem is of the nature of diagnosis, i.e., given the abnormal outcomes, explaining the cause. Since stimuli are applied to and responses are generatedthrough complex biological systems, namely, human bodies, much uncertainty is involvedin the interpretation of abnormal responses in terms of possible technical errors. Thusthe problem is in many aspects similar to a medical diagnostic problem, the limitationof conventional techniques and promise of application of AT technique apply as discussedin the beginning of this thesis. Since the recorded responses take the form of continuous signals, it is essential to ‘couple’ the expert system technique with the numericalprocedures for signal processing in order to develop a feasible technique as a solution.The term ‘couple’ needs some explanation. The concept of coupled expert systemChapter 3. QUALICON: A COUPLED EXPERT SYSTEM 49emerges, when rule-based systems are dominant, from a need to apply expert systemtechniques to domains where numerical information is largely involved and is processedby conventional numerical procedures. At that time, expert systems are considered as‘symbolic’ since rule-based systems use rules for inference which are different than conventional ‘numeric’ computation. Therefore, the coupled expert system was defined asa system which links numeric and symbolic computing processes; has some knowledgeof the numerical processes embedded; and reasons about the application or results ofthese numerical processes [Kowalik 86]. In light of newer expert system methodology forprobable reasoning, namely Bayesian networks which compute extensively the (numerical) posterior probability using algorithms, it is not appropriate to characterize expertsystems which do not deal with numerical information other than measurement of uncertainty as simply ‘symbolic’. Thus, my suggestion is to define a coupled expert system asa system which links numerical processes not directly involving inference with computingprocesses (symbolic or numeric) for inference; has some knowledge of the (non-inferential)numerical processes embedded, and reasons about the application or results of these numerical processes.A prototype expert system QUALICON coupling Bayesian networks with numerical procedures for signal processing has been developed in the thesis research, whichautomates quality control in nerve conduction studies. The system was developed incooperation with Neuromuscular Disease Unit (ND U) of Vancouver General Hospital(VGH). Utilizing information for stimulating and recording parameters for a given nerveor muscle action potential, QUALICON compares specific characteristics with valuesfrom a normal database determining the qualitative nature of each feature. Probabilisticreasoning allows QUALICON to provide the user with recommendations on the acceptability of the potential. If it is not technically acceptable, the most likely explanation(s)of the problem is/are provided with information on how to correct it.Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 50Figure 3.8 illustrates the 3 modules comprising the framework of QUALICON, namely:the Feature Extraction (FE) module, the Feature Partition (FP) module and Probabilistic Inference Engine (PIE) module. There are also 2 knowledge bases: a Normal Database(ND) and a Bayesian Net Specification (BNS).LATENCYDURATIONAMPLITUDERATIOFigure 3.8: QUALICON system structureFE consists of a set of numerical procedures extracting numeric or symbolic featuresfrom an action potential. These routines consult ND in determining their best heuristicstrategy. ND contains the normal value ranges (in terms of means and standard deviations) of all the numeric features to be examined. Normal value ranges for distal andproximal are distinguished. It combines the statistics derived from several hundred normal studies in NDU of VCH, and the electromyographic “expertise” from neuromuscularUSERnerve studied numeric featuresstimulus position evidence summaryrecording position recommendationsChapter 3. QUALICON: A COUPLED EXPERT SYSTEM 51specialist A. Eisen and registered EMG technician M. MacNeil. FP partitions the numeric features against ND translating them into symbolic descriptions of the potential.When all the features are available in symbolic form they are fed into PIE for probabilistic inference. PIE reasons with BNS as the background knowledge and with the symbolicfeatures as new evidence to generate recommendations with respect to any technical errorin test setup.The FE and FP modules are programmed in C, and the PIE module is programmedin PROLOG.3.3 Development of QUALICONImportant technical issues involved in the development of QUALICON is presented inthis section, which concern knowledge representation, robust feature extraction, coupling,and an algorithm for probabilistic reasoning.3.3.1 Feature Extraction and PartitionFeature SelectionA set of features of numerical data provides a basic interface between signal processingand Bayesian net components. Two criteria are used in the feature selection. (1) Utilization of human expertise as a resource in building QUALICON requires that the selectedfeatures be mainly composed of those used in daily practice. (2) The set of featuresshould have sufficient differential power, i.e. they should enable different potentials to becharacterized mostly by different sets of feature values.Based on the criteria, 6 variables are chosen: LATENCY, DURATION, AMPLITUDE (peak to peak), WAVE SEQUENCE, RATIO, and STIMULATION ARTIFACT.The first 3 are routinely used features. WAVE SEQUENCE is selected to characterizeChapter 3. QUALICON: A COUPLED EXPERT SYSTEM 52the basic morphology of the potentials. This feature can take 4 symbolic values: bnpb(baseline negative positive baseline), bpnb, bpnpb, and ab_seq (abnormal sequence) forCMAPs (Compound Muscle Action Potentials) and pnp, npn, np, and ab_seq for SNAPs(Sensory Nerve Action Potentials). bnpb (Figure 3.9(a)) and pnp are the wave sequencesseen in normal potentials; bpnb and npn (Figure 3.10(a)) are most often created by reversed polarity of recording electrodes; bpnbp (Figure 3.10(b)) and np usually signifythe misplacement of recording electrodes; and ab_seq encapsulates any wave sequencesdifferent from the above 3.The feature RATIO is selected to complement WA VE SEQ UENCE. It is defined as theratio of negative peak amplitude over positive peak amplitude. Although not explicitlyused in routine electromyography, this feature is employed implicitly in combination withthe basic morphology. In Figure 3.10(b), the ratio is 3.75mV/(4.7—3.75)mV = 3.95. Thisis larger than normal and therefore the entry in ‘summary’: “Positive dip too shallow”.STIMULATION ARTEFACT takes 3 symbolic values: negative (upwards), positive(downward), and isoelectric. An otherwise normal potential except STIMULATIONARTEFACT = positive could well be produced by reversing the stimulating polarity.The latency will be prolonged but could still be within normal range. Without using thisfeature, an abnormal setup might be overlooked (Figure 3.9(b)).Determine values for feature variablesSome features like STIMULATION ARTEF14CT and WAVE SEQUENCE are intrinsically symbolic. Others like LATENCY could be extracted in symbolic form from roughestimates of numeric values. The quality control task does not logically require any features in numeric form since all the features must be symbolic when entering the finalsymbolic computation stage, and computation savings could be achieved by abandoningaccurate numerical measurement. The choice of accurate measurement of LATENCYChapter 3. QUALICON: A COUPLED EXPERT SYSTEM 53(a)(uU)sensoryRECORDED IGNAL NUMERICAL. FEATURES10 /‘\ A’ip(p-p) (u.’iiU): 17.985 / Latency (‘is) 4.8Dur(on—end)(ns): 12.94Area (‘iU.ns): 34.22A’ip(p—b) (‘iv): 11.8310 (‘is) Dur(on—b) (‘is): 6.55o i’S i’S 2’S 2’S 3’OSUMMARYRECQMMENDAT IONSABNORMAL 1 Reversed polarity ofSti’i. artifact dournards. suspected. (0.88) Please check thepolarity of stinulating electrode.2 Consideration of disease.(b)SUMMARYRECOMMENDATIONSNORMAL 1 The recording is considered operationallyacceptable. (0.94)Figure 3.9: QUALICON report: (a) Normal CMAP evoked by tibial nerve stimulationat ankle. Recorded at abd. hallucis; (b) Same but with a reversed stimulus polarity.NUMERICAL FEATURESA’ip(p-p) (uñiv): 18.8Latency (‘is): 4.26Our(on-end) (‘is): 12.47Area (‘iU.tis): 34.88A’ip(p-b) (‘iv): 13.35Dur(on-b) (‘is): 6.24Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM(b)54Figure 3.10: QUALICON report: (a)ulation at finger 2 (normal subject).Median nerve SNAP recorded at wrist, with stimRecording polarity was reversed. (b) The CMAP(UI. sensory(iii *iotor RECORDED S I GNAL HUMERI CAL. FEATURESA.lp(p-p) (u.’riU): 19.95 Latency (us): 2.48010Dur(on—end)(hs): 1.36—6( us)•102.5 75 i’D i.5 1’SSUMMARY I I RECOMMENDATIONSABNORMAL 1 Reversed polarity of recording suspected.(0.99) Please check the polarity ofSti,a, artifact downj.,ards. j recording electrode.Polarity reversed. 2 Consideration of disease.(a)CORDED SIGNAL NUMERICAL FEATURESAup(p-p) (uuU): 4.7Latency (us): 3.33Dur(on-end)(us): 9.22Area (IIU.Ms): 6.7Aup(p—b) (iii)): 3.75(us) Dur(on-b) (us): 6.16SUMMARYABNORMAL RECOMMENDATI OHSNo stir,, artifact. 1 Possibly recording electrode uisplaced.Two positive dips. (0.91) Please check recording electrode.Auplitude less than norijal. 2 Consideration of disease.Positive dip too shallow.Duration less than norr,al.is recorded from median nerve of a healthy person with stimulation at wrist and withrecording electrode misplaced.Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 55(and the other variables) is made in order to present the user with values with whichthey are familiar. The partition into symbolic values follows normally.Two methods determining the onset and end of a CMAP are presented by Meyerand Hilfiker [1983]. ‘Slope threshold’ identifies the 2 critical points when 2% of themaximal slope are reached. ‘Power threshold’ signals the points corresponding to 0.1and 0.999 of the integrated squared potential. Even though the methods perform wellin ‘semiautomatic’ condition, they are sensitive to artifacts without human supervision.For the quality control task, a robust strategy is required.The problem is treated by developing a technique named guided filtering which (1)introduces the knowledge about the wave morphology and guides the search for criticalpoints; and (2) adopts noise rejection and suppression filtering. Figure 3.11 depicts thefeature extraction strategy used by FE module. The following are the important stepsemployed.• Detecting stimulation artifacts in distal CMAP. The first few samples are averagedand compared with 2 thresholds to determine the symbolic value of the variable.This approach cannot be applied to SNAPs since, when the stimulation polarityis reversed, only a short positive component appears in the stimulation artefactreflecting the high gain used (Figure 3.12). Thus the following rules are appliedto the first T ms samples: (1) if successive positive samples of TP ms are foundwith their average greater than Ni 1W, the STIMULATION ARTEFACT takesthe value downward; (2) if the average over T ms is lower than -N2 iiV, the valueis upwards; (3) otherwise the value is none. The parameters T, TP, Ni and N2depend on the electromyograph used and the sampling interval.• ND contains the means and 2 standard deviations for LATENCYand DURATION.The linear combination of these defines a time period P = [1, u] (Figure 3.13). TheChapter 3. QUALICON: A COUPLED EXPERT SYSTEM 56JrI Determine stim artefact /Consult NDFind all turning pointsI Find tiobal extrerna for CMAP/ Determine wave sequenceGuided filtering for onset & .fl4. /1.Measure lat. , dur. • amp. ratioFigure 3.11: Feature extraction strategymost significant part of the nerve response lies within P with a probability close to1. Most of the noise and artefact before the onset or after the end are outside P.The values for the remaining features are searched only through the samples in P.• In determining turning points within F, another level of noise suppression is adoptedto catch the turning points which define the basic patterns of wave sequences. Turning points corresponding to small perturbations are ignored. This technique is similar to that developed for EEG processing [Gotman and Gloor 76]. Two successiveturning points (one a local maximum and the other minimum) with their differenceof amplitudes less than a preset threshold are ignored.• Once turning points are found, a pattern matching suffices to determine the valuefor the WAVE SEQ UENCE.Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 57• The onset and end of SNAP coincide with turning points. Estimation of the onsetand end of CMAP requires additional search. A digital filter (Figure 3.13) is usedin the form:—a + Zk_(W_l)(xi+k —Ui— 2+ (x+— X+k+1)where x is the sample data, i the time index, 2w + 1 the window width and a apreset constant. The minimum (maximum) of y indicates the onset (end) of CMAP.a prevents denominator from being 0 when the window slides over flat baseline.The filtering over the window width smooths any small perturbation, and providesanother level of noise suppression. Increasing the window width strengthens thenoise suppression but decreases the sensitivity of the filter. The filtering is onlyapplied to the period [1, f(y)] for onset estimation. f(y1) is initialized to j the timeindex of the turning point corresponding to the negative peak of CMAP. When ydrops below a threshold which signifies the onset is nearby, f(y) takes the valuei + /3 where /3 is a preset constant.Figure 3.12: SNAP produced by reversing stimulation polarityChapter 3. QUALICON: A COUPLED EXPERT SYSTEM 58Figure 3.13: Illustration of Guided FilteringFor estimation of the end, the filtering is applied to [d, d + 71 (d + u) where d isthe time index of the turning point corresponding to the positive peak of CMAP,and is a preset constant determined by the estimation of the maximum possiblelength between d and the end.In summary, this technique, named guided filtering, attempts to utilize fully the available knowledge and information up to the moment during its search process. Initially,the general knowledge about the nerve-stimulation pair guides the search. As the searchprogress, new information about turning points and wave sequence are absorbed to localize further search for the onset and end. It simulates human heuristic which startswith some expectation and search is guided by information obtained during the search.In this way, not only the overall processing is very efficient, but since in each step thesearch is kept local, the effect of artefact and noise is minimized.ii Ti+flChapter 3. QUALICON: A COUPLED EXPERT SYSTEM 59Partitioning Numeric FeaturesPartition is also a feature interpretation process. It involves (1) straightforward mappingof numeric features against the normal database ND, for example, into less_than_normal,normal and greater_than_normal; and (2) capturing the dependency in feature interpretation. Two kinds of dependency required consideration in partition. The first is technicallyobvious. For instance, if WAVE SEQUENCE = ab_seq, no values could be assigned to thevariables AMPLITUDE, DURATION etc. In this case, call these variables inapplicable.Another kind of dependency reflects the experts’ context dependent view of numericfeatures. For exampic, in the case of CMAP if the stimulus is applied proximally, thepossibility of LATENCY = less_than_normal will not be considered. This is becauseindividual variation in limb length precludes a standard interstimulus distance. Likewisewhen AMPLITUDE = normal, even if the DURATION is below normal range, it is stillinterpreted as normal. This is probably due to inability to accurately determine the endof the potential (where it returns to baseline). In both examples, the dimension of certainvariables changes depending on the context.3.3.2 Probabilistic ReasoningConstruction of Bayesian NetworksDue to the difference in nerve conduction studies with respect to CMAPs and SNAPs.Two separate Bayesian nets with similar topology are constructed for C MAPs and SNAPsrespectively, and only one of them needs to be evaluated at a particular session. The netfor CMAP is shown in Figure 3.14. Each box (a node) in the network represents a variable with its name underlined and its alternative values listed. The root is STIMULUSPOSITION which affects the distribution of OPERATION which in turn determines thelikelihood of observing different feature values.Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 60OPETATI ONnormalrey_st irn_po lar ityrev_rec_po lar itym isp lace_st im_e lectm isp lace_rec_e lectsubmax_st irnsupermax_st imFigure 3.14: Bayesian subset for CMAPChapter 3. QUALICON: A COUPLED EXPERT SYSTEM 61A system dealing with technical errors must make assumptions about the knowledgeand skill levels of potential users. QUALICON has intended users as residents, fellowsand technologists learning in a hospital EMG laboratory. It is further assumed that theinadequate operations are mutually exclusive since it is unlikely that 2 or more couldoccur in combination given the intended user level. Assuming mutual exclusion, differenttechnical errors can be represented by a single node OPERATION. In this way simplicityin the network structure is gained and an efficient evaluation algorithm (section 3.3.2)can be developed.After the topology of the net is decided, the conditional probabilities for all linksare acquired from subjective judgement of a domain expert in the form like p(WAVESEQUENCE = bpnb I OPERATION = reversed recording polarity).Although not constructed for diagnostic purposes, QUALICON recognizes that abnormal potentials could result from disease rather than a technical error. When anabnormal potential is encountered QUALICON attempts to interpret the abnormality interms of a technical error but also raises the possibility of disease (e.g. recommendation2 in Figure 3.10). If an unacceptable recording could not be overcome by correcting theidentified technical error then disease becomes the likely cause.The net for CMAP shown in Figure 3.14 reflects the influence among relevant variablesin the most general case. When the network is used in particular situation, it mustadapt to the reality. For instance, when STIMULUS POSITION = proximal, STIMARTEFACT usually can no longer be detected. QUALICON will remove the node STIMARTEFACT together with its relation with OPERATION in this case. Likewise, whenWAVE SEQUENCE is found to be ab..seq, DURATION, AMPLITUDE, and RATIOwill be removed. This is a higher level parallel of inapplicability raised in the previoussubsection. Heckerman, Horvitz, and Nathwani [1989] discuss this similar issue.Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 62Inference Algorithm in QUALICONIn a QUALICON session, after the feature extraction and partition are finished, QUALICON is able to reason at the Bayesian net level where STIMULUS POSITION (Figure 3.14) and all other applicable feature variables are known. At this time, only thevalue of OPERATION is to be estimated (i.e., given the evidence, only the probabilitiesof OPERA TION are to be evaluated). A simple and efficient algorithm to take advantage of this net structure is developed. With the algorithm, only the probability of onealternative value needs to be calculated in full length; the rest can be obtained by anegligible amount of computation. Figure 3.15 depicts the DAG used in QUALICONwhich is slightly simplified for illustration of the algorithm.A.;;Figure 3.15: A DAG to illustrate the inference algorithm in QUALICONAlgorithm 1 Suppose a set of evidence is given as {a, c,d} with the value of E unknown.Assuming B {b1,. . . , b,}, the probability p(b1a&c&d) (i 1,... , n) can be computedin the following way.Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 63step 1 For p(biIa&c&d) = p(bi&c&dla)/p(c&dla), retrieve the knowledge base to calculatep(b&c&dIa) =p(cb)p(dIbj)p(ba) (i = 1,.. .,n)p(c&dla) = p(b&c&dIa)and cache the intermediate results.step 2 For the rest of the probabilities to be evaluated, no knowledge base retrieval isneeded at all. Fetch the cache content to obtainp(b Ia&cd) = p(b&c&dIa)/p(c&dja)It can be easily derived that in the case there are m known children of B (m = 2is illustrated above), the algorithm requires inn multiplication, n division, and n — 1addition. That is, the time complexity of the algorithm is linear to the number offeatures and number of technical errors considered.Out of the 7 alternative values of OPERA TION, those with their posterior probabilities above 1/7 (above-equal-likelihood) are chosen as the basis of the recommendationsfor the user. The probabilities are attached to the corresponding recommendations (Figure 3.9, 3.10).3.4 Assessment of QUALICON3.4.1 Assessment ProcedureIn order to test the validity of QUALICON, 84 muscle and nerve action potentials areelicited and recorded from normal volunteers using standard procedures (see Table 3.2).For recording compound muscle and nerve action potentials, evoked by either proximalor distal stimulation, filter settings are fixed at 2 Hz to 10 kHz and 20 Hz to 2 kHz, respectively. Five different types of technical error are introduced: reversal of stimulatingChapter 3. QUALICON: A COUPLED EXPERT SYSTEM 64muscle or nerve recording conditionnormal sti_rev rec..rev sti_mis rec_mis sub_maxmedian motor distal 2 1 2 2 2 1median motor proximal 2 2 1median sensory 2 6 4 3 1ulnar motor distal 2 1 2 2 2 1ulnar motor proximal 2 1 2 1ulnar sensory 3 1 2 4 2post. tib. motor distal 2 2 2 3 1post. tib. motor proximal 1 1 2sural sensory 2 2 2 1 2 2subtotal 17 17 14 9 16 11Table 3.2: Potential recording conditions.Technical errors introduced include reversal of stimulating (sti_rev) orrecording (rec_rev) polarity, misplacement of stimulating (sti_mis) orrecording (rec_mis) electrodes, and use of submaximal stimulating current (sub..sti). Empty entry: no potential recorded under correspondingcondition.(Sti_rev) or recording (Rec_rev) polarity, misplacement of stimulating (Sti_mis) or recording (Rec_mis) electrodes, and use of submaximal stimulating current (sub_max). Only 1error is introduced at a time. No distribution about errors introduced is assumed. In analyzing the potentials QUALICON is blinded as to whether an error has been introducedor not. Two physicians, 1 technician and 1 resident also manually analyze the data. Indoing so they are given information as to the nerve stimulated, stimulating and recordingsites, and are allowed to select from: Normal, Sti_rev, Rev_rev, Stim_mis, Rec..niis, andSub_max in making their interpretations. Multiple options of up to 3 are allowed whenit is difficult to make a single choice. If an option matches the actual technical error, theinterpretation is recognized as a success. The success rate is defined as the number ofsuccesses/the total number of interpretations made.Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 653.4.2 Assessment ResultsThe assessment is treated as Bernoulli trials (Appendix E). The success rate of QUALICON is 73% compared to average success rate 57% for manual assessment (excludingassessor 4 who is a resident in training) (Figure 3.16(a)). Given the result, it can bederived, using the standard statistic technique (Appendix E) that the 95% confidenceintervals of success probabilities fr QUALICON (pc) and 3 human average (ph) are(0.62, 0.82) and (0.46, 0.68), respectively. Further more, the hypothesis H0 : p = Ph (asopposed to H1 : pQ > ph) is accepted at 0.01 level of significance but rejected at 0.05level of significance. Thus it can be safely concluded that QUALICON performed as wellas, or better than human assessment did on the task.One would have noticed that the success rate (both QUALICON and human) isnot high. The average inter-human agreement rate for the different, imposed, errors isdepicted in Figure 3.16(b). Agreement is high for recognition of a normal potential andone in which the recording electrode is inverted. There is much poorer agreement betweenindividuals for the other technical errors reflecting the limited amount of information thatis provided which would have been necessary. In daily practice, a series of potentials areusually recorded on a nerve or muscle. Comparison among them is an important cluefor detecting technical errors. This suggests the use of multiple potentials in a group indetecting errors in the future extension of QUALICON.3.5 RemarksThe experienced electromyographer(EMGer) or technologist can usually detect a technical error as a cause of an abnormal potential rapidly and with ease. Doing so is a vitalelement in the quality control of electromyography. Contemporary equipment automatesmuch of the requisite measurements needed, for example, in performing nerve conductionChapter 3. QUALICON: A COUPLED EXPERT SYSTEM 66Figure 3.16: (a) Manual success rates of human assessors compared to QUALICON. (b)Average agreement rate for manual assessment. (this excludes the assessor 4 who is aresident in training)box.(a)8 OxLOX40x.20xox.(b)Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 67studies but does not take cognizance of error as a cause of abnormality. Artificial intelligence offers a possible solution to this problem as demonstrated through the coupledknowledge based prototype system QUALICON.QUALICON is not the first attempt in computerized quality control in nerve conduction studies. With conventional signal processing and database techniques, deviations offeatures from reference values are used {Stalberg and Stalberg 89] to turn on a warningdisplay. MUNIN [Andreassen et al. 89], currently under development by large researchgroups, is one system which gives a test guide in electromyography. MUNIN displayswhere the electrodes should be placed and other test setup information but does notcheck for technical errors. QUALICON is unique in that it tries to pinpoint what is themost likely technical error given the recorded abnormal potential.The intended user of QUALICON are residents, fellows and technologists learningin a hospital EMG laboratory. Given the level of knowledge and experience in nerveconduction studies, QUALICON only considers the most common technical errors in thiscontext. Also, development of appropriate technique rather than a complete system isemphasized in this work. Thus 6 kinds of technical errors are considered in QUALICONincluding reversal of stimulating or recording polarity, misplacement of stimulating orrecording electrodes, and use of submaximal or supermaximal stimulating current.QUALICON’s task is to recognize abnormal potentials due to technical errors. Anabnormal potential due to disease is defined, from QUALICON’s point of view, as anabnormal potential which is technically correction-resistant. No attempt is made forQUALICON to differentiate among disease states. With the success in applying Bayesiannetwork technique for recognition of technical errors the next step of my thesis researchis to extend the technique to disease diagnosis.Currently, QUALICON supports quality control for nerve conduction studies derivedfrom median and ulnar, motor and sensory nerve, posterior tibial motor nerve and suralChapter 3. QUALICON: A COUPLED EXPERT SYSTEM 68sensory nerve conductions. The CMAPs and SNAPs used by QUALICON can be acquired from any electromyograph capable of producing digitized potentials. With efficientsignal processing and inference algorithms implemented for QUALICON, it takes 6 seconds to evaluate a potential in an IBM AT compatible. The short processing time withsuch a small computer means that QUALICON can be easily included in a computerizedelectromyograph.In an assessment using 84 potentials, QUALICON’superformance is compared withhuman professionals. QUALICON’s 95% confidence interval of success probability isassessed as (0.62, 0.82), with corresponding interval for average human professional as(0.46, 0.68). The assessment suggests QUALICON performs as well as human professional with the given task. It is currently used as a teaching instrument at NDU ofVGH.Future extension to QUALICON can be expected in supporting other nerves routinelyused in nerve conduction studies; and strengthening the knowledge base to include suchfactors as the effect of age. Although QUALICON detects only 6 types of technicalerrors, its success suggests that a more sophisticated system can be built using the sameapproach to deal with broader technical errors.Chapter 4MULTIPLY SECTIONED BAYESIAN NETWORKS AND JUNCTIONFORESTS FOR LARGE EXPERT SYSTEMSWith QUALICON’s performance being satisfactory, the neuromuscular diagnosis involving a painful or impaired upper limb was tackled using Bayesian networks. This is thePAINULIM project. Two major problems arose.First, the PAINULIM project has a large domain. The Bayesian network representation contains 83 variables including 14 diseases and 69 features each of which has upto three possible outcomes. The network is multiply connected and has 271 arcs and6795 probability values. When transformed into a junction tree representation (to bedetailed below), the system contains 10608 belief values. To process the problem ofthis complexity with a reasonable system response time demands powerful computingequipment.On the other hand, the tight schedule of the medical staff demands that the knowledge acquisition (including initial network construction, subsequent system testing andrefinement) should be conducted within the hospital environment where most computingequipment consists of small personal computers.These two conflicting demands motivates the development of a technique to reducethe computational complexity. The result is the general technique of Multiply SectionedBayesian Networks (MSBNs) and junction forests presented in this chapter. The MSBNtechnique is an extension to the d-separation concept [Pearl 88] and the junction tree technique [Andersen et al. 89, Jensen, Lauritzen and Olesen 90]. The application of MSBN69Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 70to PAINULIM is presented in chapter 5.Section 4.1 introduces an important observation called localization which is to beassociated with large application domains. MSBN exploits localization. Section 4.2 reviews background research, particularly, the d-separation concept [Pearl 881 and the junction tree technique [Jensen, Olesen and Andersen 90, Jensen, Lauritzen and Olesen 90][Andersen et al. 89] on which the MSBN technique is based. Section 4.3 explains why‘obvious’ solutions to exploit localization do not work, and Section 4.4 gives an overviewof the MSBN technique. These two sections serve to motivate and guide readers into thesubsequent sections which present the mathematical theory necessary for the technique.4.1 LocalizationAs Cooper (1990) has shown, probabilistic inference in a general Bayesian net is NP-hard. Several approaches have been pursued to reduce the computational complexity ofinference in Bayesian nets; these are reviewed in section 1.3.4. Here the exact methodsin section 1.3.4 are restated briefly and the problems that remain are discussed.Efficient algorithms have been developed for inference in Bayesian nets with specialtopologies [Pearl 86, Heckerman 90aJ. Unfortunately many domain models can not berepresented by these special types of Bayesian nets. For sparse nets, Lauritzen andSpiegelhalter [1988] have created a secondary directed tree structure to achieve efficientcomputation. Jensen, Lauritzen and Olesen [1990] and Shafer and Shenoy [1988] havecreated an undirected tree structure. Creating secondary structures offers also the advantage of trading compile time (the computation time spent in transforming a Bayesiannet into a secondary structure) with run time (the computation time spent in inference). This advantage is particularly relevant to reusable systems (e.g., expert systems).Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 71However, for large applications, the run time overhead (both space and time) is still forbidding. Pruning Bayesian nets with respect to each query instance is yet another exactmethod with savings in computational complexity [Baker and Boult 90]. A portion S ofa Bayesian net may not be relevant given a set of evidence and a set of queries. It canbe pruned away before computation. In light of a piece of new evidence, S may becomerelevant but can not be restored within the pruning algorithm. If networks have to bepruned for each set of queries, t1i advantage of trading compile time with run time willalso be lost. The MSBN technique can be viewed as a way to retain the advantage ofthe pruning algorithm, but to overcome its limitations.This chapter addresses domains representable by general but sparse networks andcharacterized by incremental evidence. It addresses reusable systems where the probabilistic knowledge can be captured once and be used for multiple cases. Current Bayesiannet representations do not consider structure in the domain and lump all variables intoa homogeneous network. For small applications, this is appropriate. But for a largeapplication domain where evidence arrives incrementally, in practice one often directsattention to only part of the network within a period of time, i.e., there is ‘localization’of queries and evidence. More precisely, ‘localization’ means 2 things. For an averagequery session, first, only certain parts of a large network are interesting’; second, newevidence and queries are directed to a small part of a large network repeatedly withina period of time. When this is the case, the homogeneous network is inefficient sincenewly arrived evidence has to be propagated to the overall network before queries canbe answered.“Interesting’ is more restrictive than ‘relevant’. One may not be interested in something even thoughit is relevant.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 72The following observation of the PAINULIM domain based on the practice in Neuromuscular Disease Unit (ND U), Vancouver General Hospital (VGH) illustrates localization.A neurologist, examining a patient with a painful impaired upper limb, may temporarily consider only his findings’ implication on a set of diseases candidates. He maynot consider the diagnostic significance of each available laboratory test until he has finished the clinical examination. After each clinical finding (about 5 findings on an averagepatient), he dynamically changes his choice of the most likely disease candidates. Basedon this he chooses the best question to ask the patient next or the best examinationto perform on the patient next. After the clinical examination of the patient, findingshighlight certain disease candidates and make others less likely, which may suggest thatfurther nerve conduction studies are of no help at all, while (needle) EMG tests are diagnostically beneficial (about 60% of patients have only EMG tests, and about 27% ofpatients have only nerve conduction studies). Since EMG tests are usually not comfortable for the patients, the neurologist would not perform a test unless it is diagnosticallynecessary. Thus he would perform each test depending on results in previous ones (about6 EMG tests performed on an average patient, and 4 for nerve conduction studies).The above scenario and above statistics illustrate both aspects of localization. Duringthe clinical examination, only clinical findings and disease candidates are of currentinterests to the neurologist; and during EMG tests, only EMO test results and theirimplications on a subset of the diseases are under the neurologist’s attention; furthermore,for a large percentage (87%) of patients, only one of EMG or nerve conduction studies isrequired. If the neurologist is assisted by a Bayesian network based system, the evidenceand queries would repeatedly (about 5 times) involve a ‘small’ part of the network duringeach diagnostic period (the clinical or the EMG). For 87% of patients certain parts ofthe network (either the EMG or the nerve conduction) may not be of interest at all. IfChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 73the Bayesian net representation is homogeneous, each batch of clinical findings has tobe propagated to all the EMG and nerve conduction variables which are not relevant atthe moment. Likewise, after each batch of EMO tests, the overall net has to be updatedeven though the neurologist is only interested in planning the next EMO test.A large application domain can often be partitioned naturally in terms of localization. In the PAINULIM domain, a natural subdomain can be formed from the knowledgeabout the clinical symptOms and a set of diseases. Another natural subdomain can beformed from the knowledge about EMO results and a subset of diseases. The third natural subdomain can be formed from the knowledge about nerve conduction study resultsand a different subset of diseases. One problem of current Bayesian net representations isthat they do not provide means to distinguish variables according to natural subdomains.Heckerman [1990b] partitions Bayesian nets into small groups of naturally related variables to ease the construction of large networks. But once the construction is finished,the run time representation is still homogeneous. -If groups of naturally related variables in a domain can be identified and represented,the run time computation can be restricted to one group at any given stage of a querysession due to localization. In particular, one may not need to propagate new evidencebeyond the current group. Along with the arrival of new evidence, attention can beshifted from one group to another. Chunks of knowledge not required with respect tocurrent attention remain inactive (not being thrown away) until they are activated. Thisway, the run time overhead is governed by the size of the group of naturally relatedvariables, not the size of application domain. Computational savings can be achieved.As demonstrated by Heckerman [1990b], grouping of variables can also help in ease andaccuracy in construction of Bayesian networks.Partitioning a large domain into separate knowledge bases and coordinating them inChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 74problem solving have a history in rule-based expert systems termed blackboard arch itectures [Nii 86a, Nii 86b]. However a proper parallel for Bayesian network technology hasnot yet appeared.This chapter derives constraints which can often be satisfied such that a natural (localization preserving) partition of a domain and its representation by separate Bayesiansubnets are possible. Such a representation is termed a multiply sectioned Bayesian network (MSBN). In order to perform efficient evidential reaoning in a sparse network, theset of subnets are transformed into a set of junction trees as a secondary representationwhich is termed a junction forest. The junction forest becomes the permanent representation for the reusable system in which incremental evidential reasoning takes place.Since the junction trees preserve localization, each of them stands as a computationalobject which can be used alone during reasoning. Multiple linkages between the junctiontrees are introduced to allow evidence acquired from previously active junction trees tobe absorbed into a newly active junction tree of current interest. In this way, localization naturally existing in the domain can be exploited and the above illustrated idea isrealized.4.2 BackgroundThis section reviews the background research particularly related to the MSBN techniquewhich is not included in section 1.3.4.2.1 Operations on Belief TablesThe following introduces the belief table representation of probability distribution on aset of variables, and the mathematical operations on belief tables.A belief table [Andersen et al. 89, Jensen, Olesen and Andersen 90], or potentialChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 75[Lauritzen and Spiegeihalter 88] denoted as B() is a non-normalized probability distribution. It can be viewed as a function from the space of a set of variables to the reals.For example, the belief table B(X) of a set X of variables maps ‘1(X) (the space of Xdefined in section 1.3.3) to the reals. If x E ‘(X), the belief value of x is denoted byB(x). Denote a set X of variables and corresponding belief table B(X) with an orderedpair (X, B(X)) and call the pair a world.For Y X, the projection y E 11(Y) of x E 11(X) to the space 11(Y) is denotedas Frj,(y)(x). Denote the marginalization of B(X) to Y X by ZX\Y B(X) whichspecifies a belief table on Y. The operation is defined as the following. If B(Y)Zx\Y B(X), then for all y €B(y)= B(x).Prj( Y) (x)=ySimilarly denote the multiplication of B(X) and B(Y) by B(X) * B(Y) which specifies abelief table on X U Y. If B(X U Y) B(X) * B(Y), then for all z E ‘P(X U Y), B(z)B(x)*B(y) where x = Prjs(x)(z) and y = Prj,(y)(z). Denote the division of B(X) overB(Y) by B(X)/B(Y) which specifies a belief table on XUY. If B(XUY) = B(X)/B(Y),then for all z e I’(X U Y), B(z) = B(x)/B(y) if B(y) 0 where x = Prjp(X)(z) andy = PrjJ,(y)(z).4.2.2 Transform a Bayesian Net into a Junction TreeThe MSBN technique is an extension to the junction tree technique [Andersen et al. 89,Jensen, Lauritzen and Olesen 90] which transforms a Bayesian net into an equivalentsecondary structure where inference is conducted (Figure 4.19). Because of this restructuring, belief propagation in multiply connected Bayesian nets can be performed in asimilar manner as can in singly connected nets. The following briefly summarizes thejunction tree technique. Readers not familiar with the terminologies should refer toChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 76Appendix A.Moralization Transform the DAG into its moral graph.Triangulation Triangulate the moral graph. Call the resultant graph a morali-triangulated graph.Clique hypergraph formation Identify cliques of the morali-triangulated graph andobtain a clique hypergraph.Junction tree construction Organize the clique hypergraph into a junction tree ofcliques.Node assignment Assign each node in the DAG to a clique in the junction tree ofcliques.Belief universes formation For each clique C: in the junction tree of cliques, obtainits belief table B(C) by multiplication of the conditional probability tables of itsassigned nodes. Call the (Ci, B(C1)) a belief universe. When it is clear from thecontext no distinction is made between a junction tree of cliques and a junctiontree of belief universes.Inference is conducted through the junction tree representation. An absorption operation is defined for local belief propagation. Global belief propagation is achieved bya forward propagation procedure DistributeEvidence and a backward propagation procedure CollectEvidence.Belief initialization Before any evidence can be entered to the junction tree, the belieftables are made consistent by CollectEvidence and DistributeEvidence such thatthe prior marginal probability for a variable (of the original Bayesian net) can beobtained by marginalization of the belief table in any universe which contains it.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 77Evidential reasoning When evidence about a set of variables (of the original Bayesiannet) is available, the evidence is entered to universes which contain the variables.Then the belief tables of the junction tree are made consistent again by CollectEvidence and DistributeEvidence such that the posterior marginal probability for avariable can be obtained from any universe containing the variable.The computational complexi;y of evidential reasoning in junction trees is about thesame as the reasoning method by Lauritzen and Spiegeihalter [1988] which can be viewedas performed on a (secondary) directed tree structure [Shachter 88, Neapolitan 90]. Butjunction trees are undirected and allow more flexible computation. The junction treerepresentation is explored in this chapter since its flexibility is of crucial importance tothe MSBN extension.4.2.3 d-separationThe concept of d-separation introduced by Pearl (1988, page 116-118) is fundamental inprobabilistic reasoning in Bayesian networks. It permits easy determination, by inspection, of which sets of variables are considered independent of each other given a third set,thus making any DAG an unambiguous representation of dependence and independence.It plays an important role in our partitioning of Bayesian networks.Definition 5 (d-separate [Pearl 88]) If X, Y, and Z are 3 disjoint subsets of nodesin a DAG, then Z is said to d-separate X from Y, if there is no path between a nodein X and a node in Y along which 2 conditions hold: (1) every node with convergingarcs (head-to-head node) is in Z or has a descendent in Z and (2) every other node(non-head-to-head node) is outside Z.A path satisfying the conditions above is said to be active; otherwise it is said to beblocked by Z.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 78For example, in the DAG 0 of Figure 4.17, {F1} d-separates {F2} from {H1,2}.{H2,34}d-separates{E1,234}from the rest. The path between A3 and E4 isblocked by H4.The importance of d-separation lies in that in a Bayesian network X and Y are conditionally independent given Z if Z d-separates X from Y [Geiger, Verma and Pearl 90J.4.3 ‘Obvious’ Ways to Explore LocalizationRecall that, the ‘localization’ means (1) for an average query session, only certain partsof a large network are interesting; and (2) new evidence and queries are directed to smallpart of a large network repeatedly within a period of time. An obvious way to explorelocalization in multiply connected networks is to preserve localization within subtrees ofa junction tree by clever choice in triangulation and junction tree construction. If thiscan be done, the junction tree can be split and each subtree can be used as a separatecomputational object. The following example shows that this is not always a workablesolution.Consider the DAG 0 in Figure 4.17. Suppose variables in the DAG form three groupsnaturally related which satisfy localization:C1 = {A,23H}G2 = {F1,HG3 = {E1,E2,E3,E4,H2,H3,H4}One would like to construct a junction tree of it which preserves localization withinthree subtrees. The graph T in Figure 4.17 is the moral graph of 0. Only the cycleA3 — H3— E3 — — H4 — A3 needs to be triangulated. There are six distinct waysof triangulation out of which only two do not mix nodes in different groups. The twotriangulations have a link (113, 114) in common and they do not make significant differenceChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 79Figure 4.17: A DAG 0, its moral graph T, one of Vs triangulated graph A, and thecorresponding junction tree F. Each clique in F is numbered (the number is separatedfrom clique members by aPChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 80in the following analysis. The A in Figure 4.17 shows one of the two triangulations. Thenodes of graph 1’ are all the cliques in A.The junction tree F does not preserve localization since cliques 7, 8, 9, 12 and 1correspond to group G1 but are connected via cliques 10 and 11 which contains E3 fromgroup G3. Examine the morali-triangulated graph A to see why this is unavoidable.When there is evidence towards A1 or A2 in A, updating belief in group G3 requirespassing the joint distribution of H2 and H3. But updting the belief in A3 and A4requires passing only the marginal distribution of H3. That is to say, updating the beliefin A3 and A4 needs less information than group G3. In the junction tree representationwhich insists on a single information channel between any two cliques, this becomes apath from cliques 7, 8, and 9 to clique 12 or 1 via cliques 10 and 11.In general, let X and Y be two sets of variables in a same natural group, and letZ be a set of variables in a distinct neighbor group. Suppose the information exchangebetween pairs of them requires the exchange of distribution on sets Ixy, 1xz and Iyz ofvariables respectively. Sometime Ixy is a subset of both Ixz and Iyz. When this is thecase, a junction tree representation will always indirectly connect cliques correspondingto X and Y through cliques corresponding to Z if the method by Andersen et al. [1989],and Jensen, Lauritzen and Olesen [1990] is followed.However, there is a way around the problem with a brute force method. In the aboveexample, when there is evidence towards A1 or A2, the brute force method pretends thatupdating the belief in A3 and A4 needs as much information as G3. What one does isto add a dummy link (H2,A3) to the moral graph T in Figure 4.17. Then triangulatingthe augmented graph gives the graph A’ in Figure 4.18. The resultant junction tree 1”in Figure 4.18 does have 3 subtrees which correspond to the 3 groups desired. However,the largest cliques now have size 4 instead of 3 as before. In binary case, the size of totalstate space is 92 instead of 84 as before.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 81In general, the brute force method preserves natural localization by congregation ofthe set of interfacing nodes (nodes H2,H3,H4 above) between natural groups. In thisway, the joint distribution on interfacing nodes2 can be passed between groups througha single channel, and preservation of localization and preservation of tree structure canbe compatible. However, in a large application domain with the original network sparse,this will greatly increase the amount of computation in each group due to the exponentialenlargement of the clique state space. The computation amount increased could outweighthe savings gained by exploring localization in general.The trouble illustrated in the above 2 situations can be traced to the tree structure ofjunction tree representation which insists on single path between any 2 cliques in the tree.In the normal triangulation case, one has small cliques but loses localization. In the brute2] will be shown later that when the set of interfacing nodes possesses certain property, the jointdistribution on the set is the minimum information to be exchanged.Figure 4.18: A’ is a triangulated graph. F’ is a junction tree of A’.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 82force case, one preserves localization but does not have small cliques. To summarize, thepreservation of natural localization and small cliques can not coexist by the method ofAndersen et al. [1989], and Jensen, Lauritzen and Olesen [1990]. It is claimed here thatthis is due to a single information channel between local groups of variables. In thefollowing, it is shown that by introducing multiple information channels between groupsand by exploring conditional independence, one can pass the joint distribution on a setof interfacing variables between groups by passing only marginal distributions on subsetsof the set.4.4 Overview of MSBN and Junction Forest TechniqueAs demonstrated in the previous section, in order to explore localization, tree structure and single channel requirement have to be relaxed. Since the computational advantage offered by tree structure has also been demonstrated repeatedly, it is not desirabletotally to abandon tree structure. Rather, it is desirable to keep tree structure within eachnatural group, but allow multiple channels between groups. The MSBNs and junctionforests representations extend the d-separation concept and the junction tree techniqueto implement this idea. This section outlines the development of these representations.Each major step involved is described in terms of its functionality. The problems possiblyencountered and the hints for solutions are discussed. The details are presented in thesubsequent sections. Since the technique extends the junction tree technique reviewed insection 4.2.2, the parallels and the differences are indicated. Figure 4.19 illustrates themajor steps in transformation of the original representation into the secondary representation for both techniques.The d-sepset The purpose is to partition a large domain according to natural localization into subdomains such that each can be represented separately by a Bayesian subnet;Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 83A USEN MSBNMoralI—Tr langu lateMoralize ,b9 local computationA moral gzuhTriangulate A set of erali—A morali—triangulated triangulated graphsgraphI IdentifyIdentify Icliques 4, cliquesA clique A set of cliquehypergraph hypergraphsOrganize into Organize eacha tree 4 into a treeA Junction tree ] A Junction forestof cliques of cliquesCreatelinkagesA linked Junctionforest of cliquesAssign belief Assign belieftables tablesA Junction tree of A Junction forest ofbelief universes belief universesBelief Beliefinitialization initializationA consistent Junction A consistent Junctiontree of belief forest of beliefuniverses universesFigure 4.19: Left: major steps in transformation of a USBN (UnSectioned BayesianNetwork) into a junction tree. Right: major steps in transformation of a MSBN into ajunction forest.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 84and that these subnets can cooperate with each other during inference by exchangingminimum amount of information between them. In order to do that, one needs to findout the technical constraints which have to be followed during the partition. This can beformulated conceptually in the opposite direction. Suppose the domain has been represented with a homogeneous network. The task is to find necessary technical constraintsto be followed when the net is partitioned into subnets according to natural localization.Section 4.5 defines d-sepsets whose joint distribution is the minimum information to beexchanged to keep neighbor subnets informed. It is shown that in the junction tree representation of the homogeneous net, the nodes in d-sepsets actually serve as informationpassageways between nodes in different subnets. Thus the d-sepset is a constraint on theinterface between each pair of subnets.Sectioning Continuing in the conceptual direction, section 4.6 describes how to sectiona homogeneous Bayesian net into subnets called sects, based on d-sepsets. The collectionof these sects forms a MSBN. It is described how probability distribution should beassigned to sects relative to the distribution in homogeneous network. Particularly, it isnecessary to assign the original probability table of a d-sepnode to a unique sect whichcontains the d-sepnode and all its parent nodes, and assign to the same d-sepnode in othersects a uniform table. This is necessary, for one thing, because if a sect does not containa d-sepnode’s all parents as in the homogeneous net, the size of probability table of the dsepnode must be decreased; for another, because this assignment will guarantee the jointsystem belief constructed later to be proportional to the joint probability distribution ofthe homogeneous net.In order to perform efficient inference in a general but sparse network, one wants totransform each sect into a separate junction tree which will stand as an inference entity.When doing so, it is necessary to preserve the intactness of the clique hypergraph resultedChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 85from corresponding homogeneous net. That is, one has to ensure each clique in the original hypergraph will find at least one host sect. This imposes another constraint, termedsoundness of sectioning, on the overall organization of sects. Actually the soundnessof sectioning plus d-sepset interface imposes conditional independence constraints at amacro level, i.e., at the level of sects (as opposed to conditional independence at the levelof nodes). In addition to a necessary and sufficient condition for soundness, a guidingrule called covering subDAG is provided to ensure soundness. Its repeated applicationforms a second rule called hypertree which can be used for creating sophisticated MSBNswhose sectioning is sound. Although there exists MSBNs of sound sectioning which donot follow the 2 rules, it is shown that computational advantages are obtained in MSBNssectioned according to the rules. Further discussion will therefore only be directed toMSBNs satisfying the 2 rules.Moralization and triangulation To transform a MSBN into a set of junction treesrequires moralization and triangulation as reviewed in section 4.2.2. However in theMSBN context, an operational option is available, i.e., transformation can be performedglobally or by local computation at the level of sects. The global computation performsmoralization and triangulation in the same way as in the junction tree technique with careto be taken not to mix nodes in distinct sects into one clique. An additional mapping ofthe resultant morali-triangulated graph into subgraphs corresponding to sects is needed.But when space saving is concerned, local computation is desired. The pitfalls andprocedures in moralization and triangulation by local computation is discussed.Since the number of parents for a d-sepnode may be different in different sects, themoralization in MSBN can not be achieved by ‘pure’ local computation in each sect.Communication between sects is required to ensure parent d-sepnodes are moralizedidentically in different sects.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 86The criterion of triangulation in the MSBN is to ensure the ‘intactness’ of resultinghypergraph from the corresponding homogeneous net. Problems arise if one insists intriangulation by local computation at the level of sects. One problem is that an intersect cycle will be triangulated in the homogeneous net, but the cycle can not be realizedby viewing locally in each sect involved. Another problem is that cycles involving dsepnodes may be triangulated differently in different sects. The solution is to let sectscommunicate during triangulation. Since moralization ‘and triangulation both involveadding links and both require communication between sects, the corresponding localoperations in each sect can be performed together and messages to other sects can besent together. Therefore, operationally, moralization and triangulation in MSBN are notseparate steps as in the junction tree technique. The corresponding integrated operationis termed morali-triangulation to conceptually reflect this reality.In section 4.7.1, the above concept of ‘intactness’ of hypergraph is formalized in termsof invertibility of morali-triangulation. it is shown that if the sectioning of a MSBN issound, then there exists an invertible morali-triangulation such that the ‘intactness’ ofhypergraph is preserved. Section 4.7.1 provides an algorithm for an invertible moralitriangulation assuming a covering subDAG.Next steps in the junction tree technique In the junction tree technique, aftertriangulation, further steps of transformation are identification of cliques in the moralitriangulated graph (clique hypergraph formation) and junction tree construction. InMSBNs, these steps are performed in the similar way for each sect as the junction treetechnique. A MSBN is thus transformed into a set of junction trees called a junctionforest of cliques. See Andersen et al. [1989], and Jensen, Lauritzen and Olesen [1990] fortechnique details involving these steps.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 87Linkage formation An important extension of MSBNs and junction forests to thejunction tree technique is the formation of multiple information channels between junction trees (in a junction forest) such that a joint distribution on a d-sepset can be passedbetween a pair of junction trees by passing through marginal distributions on subsets ofthe d-sepset. In this way, the exponential enlargement of the clique state space causedby brute force method (section 4.3) can be avoided. These channels are termed linkages(section 4.7.2). Each linkage is a set of d-sepnodes which links 2 cliques ach in one ofthe 2 junction trees involved. During inference, if evidence is obtained from previouslyactive junction tree, it can be propagated to newly active neighbor junction tree throughlinkages between them.As can be imagined, multiple linkages can cause redundant information passage orconfuse the information receiver. The problem can be avoided by coordination amonglinkages during information passing. Since the problem manifests differently during beliefinitialization and evidential reasoning, it has to be treated differently. In both cases,information passing is performed one linkage at a time. During initialization, (redundant)information already passed through other linkages is removed from the linkage belief tablebefore the latter is passed over. Operationally, one orders linkages. The intersection ofa linkage with linkages ordered before it is defined as the redundancy set of the linkage.The redundancy set tells a linkage what portion of information has to be removed duringinformation passing. During evidential reasoning, a DistributeEvidence is performedafter each information passing to avoid confusion in the receiving junction tree. Withlinkages and redundancy sets created, one has a linked junction forest of cliques.Formation of joint system belief of junction forest The joint system belief of thejunction forest is defined (section 4.7.3) in terms of the belief on each junction trees, thebelief on linkages and redundancy sets such that it is proportional to the joint probabilityChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 88distribution of the homogeneous net. With the joint system belief defined, one has ajunction forest of belief universes. When it is clear from the context, only ‘junctionforest’ is used without differentiating between its different stages.Consistency and separability of junction forest As in the case of the junction treetechnique, one would like to obtain marginal probability of a variable by marginaliza‘tion of the belief in any belief universe of any junction tree which contains the variable,i.e., by local computation. In. the case of the junction tree technique, this requires theconsistency property which can be satisfied by performing the DistributeEvidence andthe CollectEvidence as reviewed in section 4.2.2. In the context of a junction forest, anadditional property called separability is required (section 4.8) due to multiple linkagesbetween junction trees. It imposes a host composition constraint on the composition oflinkage host cliques. The function of linkages is to pass the joint belief of the corresponding d-sepset. When linkage hosts are ill-composed, what is passed over is not a correctversion of joint belief on the d-sepset. The marginal probabilities thus obtained by localcomputation will also be incorrect. It is shown if all the junction trees in a junction forest satisfies the host composition condition, then separability is guaranteed. Why theseconditions usually hold naturally is explained. The remedy when the condition does nothold is also discussed. With a junction forest structure satisfying the separability, andwith a set of operations performed to bring the forest into consistency, one is guaranteedto obtain marginal probabilities by local computation.Belief initialization Belief initialization (section 4.9.3) in a junction forest is achievedby first bringing the belief universes in each junction tree into consistency, and thenexchanging prior belief between junction trees to bring the junction forest into globalconsistency. When exchanging beliefs, care is to be taken on (1) non-trivial informationChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 89(recall that d-.sepnodes in some sects are assigned uniform tables during sectioning) couldbe contained in either side of the 2 junction trees involved; and (2) not to pass redundancyinformation through multiple linkages. Section 4.9 defines several levels of operations toinitialize belief of a junction forest by local computation.Evidential reasoning Only 1 junction tree in a junction forest needs to be active dueto localization and great computational savings are possible when repeated computation of the junction tree is required. Whenever new evidence becomes available to thecurrently active junction tree, it is entered and the tree is made consistent such thatqueries can be answered. Thus the computation complexity of a MSBN/junction forestis 1/,8 where /3 is the number of sects in the MSBN. When the user shifts attention, anew junction tree replaces the currently active tree and all previously acquired evidenceis absorbed through the operation ShiftAttention. The operation requires only a chainof neighbor junction trees to be updated. During the inter-junction tree updating, oneneeds to ensure no confusion is resulted from multi-linkage information passing.4.5 The d-sepset and the Junction Tree4.5.1 The d-sepsetAs discussed in section 4.4, the problem of partitioning a Bayesian net by natural localization can be conceptually formulated as though the domain has been represented witha homogeneous network. The task is to find out the technical constraint to partition thenet into subnets such that the subnets can be used separately and cooperatively duringinference with minimum amount of information exchange. This section defines the mostimportant concept for partitioning, namely, d-sepset. Then some insights are providedinto its implication in the secondary structure of DAGs.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 90Definition 6 (d-sepset) Let D D’ U D2 be a DAG. The set of nodes I = N’ fl N2is a d-sepset between subDAG D’ and D2 if the following condition holds.For every A1 E I with its parents 7r in D, either ir ç N’, or 7r C N2.Elements of a d-sepset are called d-sepnodes. When the above condition holds, D issaid to be sectioned into {D’,D2}.Note in general a DAG D = D’UD2does not imply that D is sectioned into {D1,D2}since the intersection of the corresponding 2 sets of nodes may not be a d-sepset.Lemma 4 Let a DAG D be sectioned into {D’,D2} and I = N’ fl N2 be a d-sepset. Id-separates N’ \ I from N2 \ I.Proof:It suffices to prove every path between N’ \ I and N2 \ I is blocked by I. Every pathbetween N’ \ I and N2 \ I has at least one d-sepnode. From definition of d-separate, ifone of the d-sepnode in a path is a non-head-to-head node, the path is blocked.In the case of a single d-sepnode path, by definition, the d-sepnode must be a non-head-to-head node. In case of multiple d-sepnode path, for any 2 adjacent d-sepnodes onthe path, one of them must be a non-head-to-head node.DThe lemma can be generalized into the following theorem which states that d-sepsetsd-separate a subDAG from the rest of the DAG. Note when a d-sepset is indexed with 2superscripts, their order is immaterial.Theorem 2 (completeness of d-sepset) Let a DAGD be sectioned into {D’, .. . , D”}and I = N1 fl N’ be the d-sepset between D and D’. UI11 d-separates N1 \ UI1from N \ N.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 91The theorem implies that the joint distribution on d-sepsets is the minimum information to be exchanged between a Bayesian subnet and the rest of the network. Considera DAG (N, E) sectioned into 3 > 1 subDAGs. Let A denote the set of all the non-dsepnodes in one of the subDAGs, C denote the set union of all the d-sepsets between thechosen subDAG and its neighbor subDAGs, and B denote the set N \ A \ C. From theabove theorem, C d-separates A from B. Thus p(ABC) = p(AC)p(BC)/p(C). Whenevidence is available such that some of the variables in A are instantiated, one can updatep(AC) into p’(AC) = p’(A)p(CA). One can update p(C) into p’(C) by marginalizationon p’(AC). Then it follows that p’(BC) = p(BIC)p’(C), and that the updated jointdistribution p’(ABC) p’(AC)p’(BC)/p’(C). Note that updating p’(BC) is throughreplacing p(C) by p’(C) - the posterior joint distribution on the d-sepset union.Example 5 The DAG 0 in Figure 4.17 is sectioned into {0, 02, 0} in Figure 4.20.j12= {H,, H2} is the d-sepset between 0’ and 02; j13 = {H2,H3,H4} is the d-sepsetbetween 0’ and 0; and 123 = {H2} is the d-sepset between 02 and 0. 112 U J13 ={H,,H2,3j d-separates the rest of 0, from the rest of 02 and 03.There is a close relation between d-sepset and usual graph separator given in proposition 7. If Z is the graph separator of X and Y, then the removal of the set Z ofnodes from the graph (together with their associated links) would render the nodes in Xdisconnected from those in Y.Proposition 7 Let a DA G D be sectioned into {D’, D2}. The set of nodes I = N’ fl N2is a d-sepset between D’ and D2 if I is a graph separator in the moral graph of D.Proof:Suppose I = N’flN2 is a d-sepset. For any A, e N’\I and A2 E N2\I, moralizationwill not create links between A, and A2. Hence I is a graph separator in the moral graphof D.92Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKSFigure 4.20: The set {0’, 02, 0} of 3 subDAGs (top) forms a sectioning of 0 in Figure4.17. {A’, A2,A3} (middle) is the set of moralitriangUlat graphs of {0’, 92, 93}, and={r’, r2,F3} (bottom) is the corresponding junction forest obtained by Algorithm 2.The ribbed bands indicate linkages.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 93On the other hand, suppose I separate N’ \ I from N2 \ I in moral graph M of D,and I is not a d-sepset. Then there exists A E I such that A has a parent A, e N’ \ Iand a parent A2 E N2 \ I. But then there would have been a link between A, and A2 bymoralization. Thus I is not a separator in M. Contradiction.DThe properties of d-separation in the DAG representation of Bayesian networks havebeen studied extensively [Pearl 88, Geiger, Verma and Pearl 90]. It can be used to derivePearl’s propagation algorithm in singly-connected Bayesian nets [Neapolitan 901. Butto my knowledge, its implication in secondary structure has not been examined. Thedefinition of the d-sepset now allows to do so.4.5.2 Implication of d-sepset in Junction TreesBy representing a multiply connected Bayesian network in its secondary structure- junction tree, flexible and efficient belief propagation can be achieved. With the d-sepsetconcept defined, one would like to know how information is passed in the junction treebetween nodes separated by the d-sepset in the original Bayesian network.Lemma 5 Let a DAG D be sectioned into {D’, D2} and I = N’ fl N2 be the d-sepset.A junction tree T can be constructed from D, such that the following statement is true.For all pairs of nodes A, N’ \ I and A2 E N2 \ I, if A, is contained inclique C, and A2 in C2, then on the unique path between C, and C2 in T,there exists a clique sepset Q containing only d-sepnodes.Proof:[Step 1J First, prove T can be constructed such that C, C2. By definition of thed-sepset, A, and A2 are not adjacent in D. Cliques are formed through moralizationChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 94and triangulation. If A, and A2 both are adjacent to a d-sepnode H, by definition ofd-sepset, A, and A2 can not both be H’s parents. Thus moralization does not create a.link between A, and A2.Enforce the following rule for triangulation: if a cycle contains 2 or more non-adjacentd-sepnodes, add a link between 2 such d-sepnodes first.After moralization, if there is only 1 path between A1 and A2, triangulation will notadd a link between them. If there are more than 1 path, at least 1 cycle is formed. Forany such cycle including A, and A2, 2 paths between A, and A2 can be identified. Oneach of the 2 paths, there must be at least 1 d-sepnode. With the above rule followed,a link between 2 such d-sepnodes is always added such that there is no need to link A1and A2 to triangulate D.Since A1 and A2 are not linked initially; and not during moralization and triangulation, they will not be contained in the same clique in T.[Step 2] Proceed from C, to C2 along the unique path in T connecting them to findQ. Examine C,’s neighbor clique C.Case 1: Cr C I. Then Q = Ci fl C.Case 2: Cr contains nodes in N2 \ I. By step 1, C, does not contain nodes in N2 \ Iand C, does not contain nodes in N’ \ I. However, C, fl Cr . Therefore Q = Ci fl Cr.Otherwise C contains nodes in N’ \ I. In this case, proceed to Cr’s neighbor C,, andabove examination is repeated. Since the other end of the path is C2 which contains nonode in N’ \ I, some point along the path one will eventually hit either case 1 or case 2.DThe lemma can be generalized to the case of any finite number of subDAGs. This isthe following proposition. Its proof is similar to the lemma.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 95Proposition 8 (belief relay) Let a DAG D be sectioned into {D’,.. . , D} and I =UI’ be the union of d-sepsets between D and its neighbor subDAGs. A junction treeT can be constructed from D, such that the following statement is true.For all pairs of nodes A1 e N \ I and A2 E N \ N, if A1 is contained inclique C1 and A2 in C2, then on the unique path between C1 and C2 in T,there exists a clique sepset Q containing only d-sepnodes in I.Example 6 Recall the DAG 9 in Figure 4.17 which is sectioned into {9’, 92,93) inFigure 4.20 with I = {H2,H3,H4} being the d-sepset between 91 and 93• Consider thenode A3 in clique {H3,H4,A3) and the node E2 in clique {E4,E3,E2} in the junctiontree F in Figure 4.17. In the path between the 2 cliques, the sepset {H3,H4} betweencliques {H3, 114, A3} and {H3,H4,E3} contains only d-sepnodes.When new evidence is available, it can be propagated to the overall junction treethrough sepsets between cliques [Jensen, Lauritzen and Olesen 90]. Therefore, the aboveproposition means that a junction tree can be constructed such that evidence in N \ Imust pass through at least 1 sepset containing only nodes in I in order to be propagatedto nodes in N \ N.Theorem 2 and Proposition 8 suggest that one can organize the clique hypergraphsuch that the cliques corresponding to different subDAGs separated by d-sepsets can beorganized into different junction trees. Communication between them can be accomplished through d-sepsets. This idea is formalized below.4.6 Multiply Sectioned Bayesian Nets4.6.1 Definition of MSBNChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 96Definition 7 (MSBN) Let S = (N, E, P) be a Bayesian network; D = (N, E) be sectioned into {D’,. . .,D1} where Dt (N1,E); and P = N1flN be the d-sepset betweenD1 and D (1 i,j 3;i j). Assign d-sepnodes to subDAGs in the following way.For each d-sepnode A, if the in-degree i of A in subDAG D satisfies ii(j = 1,... ,6) then assign A to subDAG D and break ties arbitrarily.Assign probability distribution F for subDAG Dt (i = 1,. . .,/3) in the following way.For all node A E N1, if A is a d-sepnode and A is not assigned to D1, assign toA a uniform probability table.3 Otherwise assign to A an identical probabilitytable as that in (N, E, F).Call S1 = (D1;F1) (N1,E1; pi) a sect and call the set of sects {S’,. . . , S’3} aMultiply Sectioned Bayesian Network (MSBN).The original Bayesian net S is called an ‘UnSectioned Bayesian Network (USBN)’.Note that the sectioning of a Bayesian network is essentially determined by the sectioningof the corresponding DAG D. It doesn’t matter which way to break ties. There will beno significant difference in further processing.Example 7 Suppose the variables in DAG 0 in Figure 4.17 are all binary. Associatethe probability distribution P given in Table 4.3 with 0. (0, F) is an USBN.Given the USBN (0, F), and corresponding 3 subDAGs 01, 02 and 0, a 3-sectMSBN {(91, F’), (02, P2), (93,F3)) can be constructed. First assign d-sepnodes H1,.H4 to the subDAGs. H2 and H4 must be assigned to 01. H1 can be assigned to either3This is necessary, for one thing, because if a sect does not contain a d-sepnode’s all parents as in thehomogeneous net, the size of probability table of the d-sepnode must be decreased; for another, becausethis assignment will guarantee that the joint system belief constructed in section 4.7.3 is proportionalto the joint probability distribution P.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKSp(h11)= .15p(h2iIaaii)= .86962iIa2iii2)= .7P(h2122G11) .6p(hao)= .08p(h31)= .3p(h411a3) .25p(hiIa32)= .4p(ai,Ihii) = .8p(a11h2)= .1P(G2lih3) .8P(a21132)= .1Table 4.3: Probability distribution associated with DAG 0 in Figure 4.17.97or 02, and H3 can be assigned to either 0’ or 0. Here it is chosen to assign all 4d-sepnodes to 0’. Based on this assignment and P given, the probability distribution foreach sect can be determined (Table 4.4). Note the uniform probability tables assignedto d-sepnodes in 02 and 03.Table 4.4: Probability distribution associated with subDAGs 0’, 02 and 0 in Figure4.20.p(a31 h31) = .3 p(e21 eMe41) = .9789p(a311h2)= .8 211e342)= .8p(ei134 = .9p(a411h)= .9 211e32e4) .05p(a411h2 = .2p(eajlhjhi)= .7702p(f11Jhh2)= .7895 (e3122)= .35p(fjjh2) .5 p(e1h) .65P(11h2 .6 (e3122h)= .01p(f2h)= .05p(e411h)= .8P(f2lIflI) .4 (411h2)= .15P(f211f12) .75(eiiIe3i)= .2P(e11e32) = .71p(h11)= .15 p(h11)= .5 p(hjj)= .5p(h21jaa1)= .8696 p(h21) .5 p(h21) = .5p(ha2= .7p(h211 = .6 p(f1iIh,h21)= .7895 p(h31) = .5p(ha2a)= .08 p(fiiIhiih2)= .5p(fiiIhii)= .6 p(e11[e3)= .2p(h31)= .3 (f112h2)= .05 P(eill32)= .7p(h411a3)= .25 P(f2lIfll) = .4 p(e2113e4)= .9789p(h41j32 = .4 P(f21 1112) = .75 211e342 = .82i1e32e4i) .9p(aiiIhii) .8 p(ejI3242)= .05p(a112)= .1p(e311h2h= .7702P(a2lIhal) = .8 p(e2)= .35p(a211h32)= .1 p(e311h2z= .65p(e2h)= .01p(a311h)= .3p(a311h2 = .8 p(e411h)= .8P(el1h42 = .15p(a411h)= .9(iIh42 = .2Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 984.6.2 Soundness of SectioningIn order to perform efficient inference computation in a multiply connected Bayesiannet, the junction tree technique transforms the Bayesian net into a clique hypergraphthrough moralization and triangulation. Then the hypergraph is organized into a junctiontree, and efficient inference can take place. Because of the computational advantage ofjunction trees, in the context of a MSBN, one would like to transform each sect into ajunction tree representation. The immediate question is: for an arbitrary MSBN, whatis the condition in the transformation such that correct inference can take place in theresultant set of junction trees. The following reviews the major theoretical results relatedto this question.Lauritzen et al. [1984] show that the clique hypergraph of a graph is decomposableif the graph is triangulated. Jensen [1988] proves that a hypergraph has a junctiontree if it is decomposable. Maier [1983] proves the same in the context of relationaldatabase. Jensen, Lauritzen and Olesen [1990] and Pearl [1988] show that a junctiontree representation of a Bayesian net is an equivalent representation in the sense that theinformation about joint probability distribution can be preserved. Finally, a more flexiblealgorithm (compared to [Lauritzen and Spiegelhalter 88]) is devised on the junction treerepresentation of multiply connected Bayesian nets [Jensen, Lauritzen and Olesen 90].The above results highlight the importance of clique hypergraphs resulted from triangulation of original graphs. Thus, when one tries to transform each sect in a MSBNinto a junction tree, it is necessary to preserve the intactness of the clique hypergraphresulted from corresponding USBN. This is possible only if the sectioning of DAG D ofthe original USBN is sound defined formally below.Definition 8 (soundness of sectioning) Let a DAGD be sectioned into {D’,.. . ,If there exists a clique hyperyraph from D such that for every clique Ck in the hype rgraphChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 99there is at least 1 subDAG D satisfying Ck N, then the sectioning is sound. Call Dthe host subDAG of clique Ck.Although the soundness of sectioning is defined in DAGs, the concept is useful onlyin the context of MSBNs. Therefore when the sectioning of DAG is sound, it is said thatthe sectioning of the corresponding USBN into the MSBN is sound.If the sectioning of a DAG D is unsound, then one will find no host subDAG for atleast 1 clique in all possible hypergraphs from D. If a MSBN is based on this sectioningof DAG, it is impossible to maintain the autonomous status of sects in the secondaryrepresentation.Example 8 In Figure 4.21, {D’, D2,D3} is an unsound sectioning of D. The cliquehypergraph for D must have clique {A, B, C} which finds no host subDAG from D’, D2,and D3.Figure 4.21: Top left: A DAG D. Top right: The set of subDAGs from an unsoundsectioning of D. Bottom left: The junction tree T from D.EABDTThe following develops necessary and sufficient condition for soundness of sectioning.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 100Lemma 6 LetA1—.. .—A_1B.. —B3_l—Cl—...—Ck_I—. ..—A1 be a cycle consisting of nodes from 3 or more sets: X = {A1, . . . , A1_B1}, Y = {B1,. . ., B_1,C},and so on. The nodes from a same set are adjacent in the cycle, and 2 adjacent setshave a node in common. Then triangulation of this cycle must create a triangle with its3 nodes not belonging to any single set.Proof:The proof is inductive. The lemma is trivially true if the cycle involves only 3 nodes.Assume the lemma is true when the cycle involves n = 3 sets and the cycle 0 isA1— . . . — A_1 — B1—.. . — B3_— C1 — ...— Ck_1 — A1, i.e., there are i nodes in set 1, jin set 2, and kin set 3. If 1 more node is added to set 3 (k’ = k+ 1) such that a node Ckis added between Ck_l and A1 of cycle 0, then one can first add a chord between Ck_land A1, and triangulate the rest of the cycle. Since the remaining untriangulated cycleis exactly the cycle 0, the lemma is true by assumption. Since the 3 sets are symmetricand the nodes in any set above can be augmented by nodes from any sets other than theabove 3, the proof is valid in general.DIf a MSBN has only 2 sects, the sectioning is always sound. Unsoundness can ariseonly when there are 3 or more sects. The following shows exactly the case where asectioning is unsound.Theorem 3 (inter-subDAG cycle) A sectioning of a DAG D to a set of 3 or moresubDAGs is sound if there exists no (undirected) cycle in D which consists of nodes from3 or more distinct subDAGs such that the nodes from each subDAG are adjacent on thecycle.Proof:Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 101Let D be a DAG sectioned into {D’,.. . , D} (/3 3). Suppose, relative to thissectioning, D has no cycle which consists of nodes from 3 or more distinct subDAGs suchthat the nodes from each subDAG are adjacent on the cycle. By definition of d-sepset,moralization of D may triangulate existing cycles but will not create new cycle. Thusby assumption, in moral graph of D, all cycles consisting of nodes from more than 1subDAG such that the nodes from each subDAG are adjacent on the cycle can involvenodes from at most 2 subDAGs. By the argument in Stp 1 of the proof for Lemma 5,all the 2-subDAG crossing cycles can be triangulated without creating cliques containingnon-d-sepnodes in both subDAGs. Thus the sectioning is sound.On the other hand, suppose, relative to the sectioning, there is a cycle in D whichconsists of nodes from 3 or more distinct subDAGs such that the nodes from each subDAGare adjacent on the cycle. Then, by lemma 6, the triangulation of this cycle must create atriangle with its 3 nodes not belonging to any subDAG. Hence the sectioning is unsound.DGiven a DAG and a sectioning, search for inter-subDAG cycles relative to the sectioning is expensive, especially by local computation when space is concerned. Even ifa sectioning is decided to be sound, it may not be computationally desirable at laterreasoning stages as will be discussed at corresponding sections. Furthermore, each sectin a large system (MSBN) is constructed one at a time. If a sectioning is not sound and itcan only be discovered after all sects have been constructed, the overall revision would bedisastrous. Thus one would like to develop simple guidelines for sound sectioning whichcould be followed during incremental construction of MSBNs. The following coveringsubDAG rule is one of such guidelines.Theorem 4 (covering subDAG) Let a DAG D be sectioned into {D’,. . . , D’}. Let= N2 fl N’ be the d-sepset between D2 and Div. If there is a subDAG D such thatChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 102N1 UjkPlc then the sectioning is sound. The subDAG D is called the coveringsubDAG relative to the sectioning.In the context of a MSBN, call the sect corresponding to the covering subDAG as thecovering sect. Note that the covering sect rule actually imposes a conditional independence constraint at a macro level.Proposition 9 Let S1 and S be any 2 sects in a MSBN with a covering .ect S’ (j k). The 2 sets of variables N2 and N are conditionally independent given N’.Example 9 Consider the 3-sect MSBN {(0’, F’), (92, F2), (93, F3)} constructed.(91, F’) is the covering sect.Note, in general, the covering sect of a MSBN may not be unique. As far as thesoundness is concerned, one is as good as the others. Practically, the one to be consultedmost often or the one with the least size is preferred for the sake of computationalefficiency which will be clear later.The covering sect is usually formed naturally. For example (Chapter 5), in a neuromuscular diagnosis system, the sect containing knowledge about clinical examinationcontains all the disease hypotheses considered by the system. The EMG sect or nerveconduction sect contains only a subset of the disease hypotheses based on diagnosticimportance of these tests to each disease. Thus the clinical sect is a natural covering sectwith all the disease hypothesis as d-sepnodes interfacing the sect with other sects.The covering subDAG rule can be repeatedly used to create sophisticated MSBNswhich are sound. When doing so, a global covering subDAG requirement is replaced bya local covering subDAG requirement. A local covering subDAG is a subDAG interfacingtwo or more subDAGs and including all d-sepsets among them in its domain.Theorem 5 (hypertree) Any MSBN created by the following procedure is sound.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 103Start with any single subDAG. Recursively add a new subDAG D to previous subDAGs such that if D has nonempty d-sepsets with more than oneprevious subDAGs, then one of the previous interfacing subDAGs must be alocal covering subDAG which covers all the d-sepsets resultant from the addition.The following example illustrate the hypertree rule. It also explains that the sectioningdetermined by the procedure is sound.Example 10 Figure 4.22 depicts part of a MSBN constructed by the hypertree rule.Each box represents a subDAG with boundaries between boxes representing d-sepsets.The superscripts of subDAGs represent the order of their creation. D’, D4,D5 are localcovering subDAGs.Figure 4.22: A MSBN with a hypertree structure.It is easy to see that the inter-subDAG cycle as described in theorem 3 can not happenin this MSBN due to its hypertree structure, and hence the sectioning is sound.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 104Note that the hypertree rule also imposes a conditional independence constraint at amacro level.Proposition 10 Let S and 5’ be any 2 sects with an empty d-sepset in a MSBN sectioned by the hypertree rule. Let Sk be any sect on the unique route mediating Si and Sion the hypertree. The 2 sets of variables N and N’ are conditionally independent givenNk.It should be indicate that the covering subDAG rule and the hypertree rule do notcover every case where sectioning is sound.Example 11 The 3-sect MSBN {D’, D2,D3} in Figure 4.23 has no covering subDAG.But the sectioning is sound.Note that, although the sectioning of the MSBN in Figure 4.23 is sound, this kind ofstructure is restricted. For example, one can add arcs between A and B in D’, between Aand C in D2, but as soon as one adds 1 more arc between B and C in D3, the theorem 3is violated and the sectioning become unsound. That is, when n subDAGs (n 3) areinterfaced in this style, there can be at most n — 1 of them being multiply connected.Further computational problems with such structure will be discussed in the appropriatelatter sections.Since MSBNs constructed by the covering subDAG rule or the hypertree rule havesound sectioning, are less restricted, and have extra computational advantages (to bediscussed in latter sections) over the MSBNs which do not follow these rules, the followingstudy is directed to only the former MSBNs.Conceptually, all MSBNs constructed by the hypertree rule can be viewed as MSBNswith covering subDAGs when attention is directed to local structures. For example, ifone pays attention to D’ in Figure 4.22 and its surrounding subDAGs, one can viewChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 105Figure 4.23: Top left: A DAG D. Top right: A junction tree T from D. Bottom left:{ D’, D2,D3} forms a sound sectioning of D. Bottom right: The junction trees from theMSBN in Bottom left.D2,D4,D6,D7,D8 as 1 subDAG, D3,D5,D9,D10,D” as another, D’2,D’3 and D14,D’5as 2 others. Thus the MSBN is viewed as one with a global covering subDAG D’.Likewise, when one is concerned with relation between D’4 and D15, the MSBN canbe viewed as one satisfying the covering subDAG rule with /3 = 2. Therefore, thecomputation required for a MSBN of a hypertree structure is just the repetition of thecomputation required for a MSBN with a global covering subDAG. On the other hand,a MSBN with a global covering subDAG is a special case of the hypertree structure.Hence, the following study is often simplified by considering only one of the 2 cases.IDHIT1Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 1064.7 Transform MSBN into Junction ForestIn order to perform efficient inference in a general but sparse network, it is desirable totransform each sect of a MSBN into a junction tree which will stand as an inference entity(Section 4.4). The transformation takes several steps to be discussed in this section. Theset of subDAGs of the MSBN are morali-triangulated into a set of morali-triangulatedgraphs from which a set of clique hypergraphs are formed. Then the set of clique hypergraphs are organized into a set of junction trees of cliques. Afterwards, the linkagesbetween the junction trees are created. Finally, belief tables are assigned to cliques andlinkages and a junction forest of belief universes is constructed.4.7.1 Transform SubDAGs into Junction Trees by Local ComputationThe key issue is morali-triangulating subDAGs of a MSBN into a set of morali-triangulatedgraphs. Once this is done, the formation of clique hypergraph and the organization ofjunction tree for each subDAG are performed the same way as in the case of a USBN anda single junction tree [Andersen et al. 89, Jensen, Lauritzen and Olesen 90]. As mentioned before, the criterion in morali-triangulation of a set of subDAGs of a MSBN into aset of clique hypergraphs is to preserve the ‘intactness’ of the clique hypergraph resultedfrom the corresponding USBN. The concept of ‘intactness’ is formalized below.Definition 9 (invertible morali-triangulation) Let D be a DAG sectioned into{D’, . . . , D’} where D has domain N. If there exists a morali-triangulated graph Gof D, with the clique hypergraph H, such that G = U1G where G2 is the subgraph ofG induced by N, and H = U1H where H is the clique hypergraph of G, then theset of morali-triangulated graphs {G’,.. . , G} is invertible. Also the transformation of{D’,... , D’3} into {G’, .. . , G} is said to be an invertible morali-triangulation.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 107A morali-triangulated graph G is equivalent to the corresponding clique hypergraphH in that given one of them the other is completely determined. By definition 8, if thesectioning of a MSBN is sound, then there exists a clique hypergraph H whose everyclique is a subset of the domain of at least 1 subDAG of the MSBN. Thus one has thefollowing theorem.Theorem 6 (existence of invertible morali-triangulation) There exists an invertible morali-triangulation for {D’,.. . , D} sectioned from a DAG D, if the sectioning issound.One could construct a set of invertible morali-triangulated graphs of a MSBN by firstperforming a global computation (moralization and triangulation) on D to find C, andthen determining its subgraphs relative to the sectioning of the MSBN. The moralizationand triangulation would be the same as in the junction tree technique with care to betaken not to mix nodes in different subDAGs into one clique. However, when spacerequirement is of concern, MSBNs offer the possibility of morali-triangulation by localcomputation at the level of subDAGs of sects. Each subDAG in a MSBN is moralitriangulated separately (message passing may be involved) such that the collection ofthem is invertible. The following discusses how this can be achieved.Example 12 In the example depicted in Figure 4.17 and 4.20, 0 is sectioned into{ 01, 02, O} by a sound sectioning and {A1,A2,A3} is a set of invertible morali-triangulated graphs relative to the sectioning. The desire is to find A1 (i = 1,2,3) from(i = 1,2,3) by local computation.Since subDAGs of a MSBN are interfaced through d-sepsets, the focus of finding a setof invertible morali-triangulated graphs by local computation is to decide whether eachChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 108pair of d-sepnodes is to be linked. Coordination between neighbor subDAGs is necessaryto ensure correct decisions. The following considers this systematically.Call a link between 2 d-sepnodes a d-link. Call a simple path (A,, A2, . . . , A,) a d-pathif A1,..., A2 and As,... ,Ak (1 i,i + 1 j,j k) are all d-sepnodes, while all theother nodes on the path are non-d-sepnodes. A d-link is a trivial d-path. Each moralitriangulated graph G from an invertible morali-triangulation may generally contain 6types of d-links.Arc type inherited from the subDAG. That is, if 2 d-sepnodes are connected originallyin the subDAG, there is a d-link between them in G. Decision on this type ofd-links is trivial.ML type created by local moralization. For example, the d-links (H,, H2) in A2 and(H2,3)in A3.ME type created by moralization in neighbor subDAGs. For example, the d-links(H1,H2) and (H2,H3) in A’. Decision on this type of d-links requires communication between neighbor subDAGs.Cy type created to triangulate inter-subDAG cycles. For example, the d-link (H3,H4)in A’ and A3. Decision on this type of d-links requires communication betweenneighbor subDAGs.TL type created during local triangulation. After the above 4 types of d-links have beenintroduced to the moral graph of a subDAG, there may still be un-triangulatedcycles within the moral graph involving 4 or more d-sepnodes. The example usedabove is too simple to illustrate this and the next type.TE type created by local triangulation in neighbor subDAGs. The triangulation of acycle of length> 3 involving only d-sepnodes is not unique. If 2 neighbor subDAGsChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 109triangulate such a cycle by local computation without coordination, they may triangulate in different ways and result in different set of cliques for the nodes in thed-sepset. Therefore communication is required such that a subDAG may adoptthe d-links introduced by triangulation in neighbor subDAGs. The argument alsoapplies to the case of triangulating cycles consisting of general d-paths.An algorithm for morali-triangulation of subDAGs of a MSBN into a set of invertibletriangulated graphs under the covering subDAG assumption is given below.Algorithm 2 (morali-triangulation with a covering subDAG) Let D’ be the covering subDAG in the MSBN.1. For each subDAG Dt, do the following: (1) moralize D, and add d-links due tomoralization to ML (a. set of node pairs); (2) search for pairs of d-sepnodes connected by a d-path in the moral graph of D, and add the pairs found to Cy (a setof node pairs).2. For D’, do the following: (1) for each pair of nodes of D’ contained in one ofML (i > 1), connect the pair in the moral graph of D’; and (2) for each pairof d-sepnodes contained in both Cy’ and one of Cy (j > 1), connect the 2 nodesin the moral graph of D’; (3) triangulate the augmented moral graph of D’ (themorali-triangulation of D’ is completed); and () add d-links to DLINK (a set ofnode pairs).3. For Dt (i = 2,. ..,j3), do the following: (1) for each pair of nodes of Dt containedin DLINK, connect the pair in the moral graph of D; and (2) triangulate theaugmented moral graph of D. The morali-triangulation of D is completed.Note that the above process has 2 passes through all the subDAGs. The followingtheorem shows the invertibility of the morali-triangulation.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 110Theorem 7 (invertibility of Algorithm 2) The morali-triangulation by Algorithm 2is invertible.Proof: VThe key issue is to decide on d-links. The algorithm considered only neighbor relationsbetween the covering subDAG and other subDAGs. The first step is to show this issufficient. aLet da and db be 2 d-sepnodes between 2 subDAGs D and Db. Let D’ be the coveringsubDAG. It is claimed that if da and d join 2 simple paths in D and Db respectively toform an inter-subDAG cycle, then they must join 2 simple paths in D and D’ as well.Since da and db both are contained in D’ which is connected, it is certainly true.Similarly, if there is a cycle involving d-sepnodes in the d-sepset between D and Db,these d-sepnodes are also in the d-sepset between D and D’.Now it is a simple matter to show: the ME type d-links are introduced by step 2 (1)and step 3 (1); the Cy type d-links are introduced by step 2 (2) and step 3 (1); and theTE type d-links are introduced by step 2 (2) and step 3 (1).DExample 13 The following is a recount for morali-triangulation ofU1O in Figure 4.20by Algorithm 2.1. Aftersteplofthealgorithm,ML = q,ML2 = {(H1,H2)},andML3= {(H2,H3)};Cy’ = {(H,, H2), (H2,H3), (H3,H4)}, Cy2 {(H1,H2)}, andCy3 = {(H2,H3), (H2,H4), (H3,H4)}.2. After step 2, the morali-triangulated graph A’ of G’ is completed by adding d-links(H1,H2), (H2,H3), (H3,H4) to O”s moral graph, and then triangulating (nothingis added). DUNK will contain {(H1,H2), (H2,H3), (H3,H4)}.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 1113. After step 3, the morali-triangulated graph A2 of O is completed without change toits moral graph; the morali-triangulated graph A3 of (33 is completed by adding thed-link (H3,H4) to its moral graph, and then triangulating (with the link (E3,H4)added).As mentioned in section 4.4, after the morali-triangulation, the other steps in transformation of a MSBN into a set of junction trees of cliques are: identifying cliques of themorali-triangulated graphs to form a set of clique hypergraphs and then organizing eachhypergraph into a junction tree. These steps are performed in the same way as in thejunction tree technique. Throughout the rest of the chapter, it is assumed that junctiontrees are obtained through a set of invertible triangulated graphs, and it is said that thejunction trees are obtained by invertible transformation.Call a set of junction trees of cliques from an invertible transformation of subDAGsof a MSBN as a junction forest of cliques denoted by F = {T’,.. . , T} where T2 is thejunction tree from the subDAG Dt.4.’T.2 Linkages between Junction TreesJust as d-sepsets interface subDAGs, linkages interface junction trees transformed fromsubDAGs and serve as information channels between junction trees during inference.The extension of the MSBN technique to the junction tree technique is to allow multiplelinkages between pairs of junction trees in a junction forest such that localization can bepreserved within junction trees and the exponential explosion of the sizes of clique statespaces associated with the brute force method (section 4.3) can be avoided.Definition 10 (linkage set) Let I be the d-sepset between 2 subDAGs D° and Db. LetT and Tb be the junction trees transformed from D’ and D’ respectively. A linkage ofT relative to T6 is a set I of nodes such that the following 2 conditions hold.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 1121. Boundary: there exists a clique C2, E T’t such that 1 = C2, fl I. C2, is called a hostclique of 1;2. Maximum: there is no subset of I that is also a linkage.In general there may be more than one linkage between a pair of junction trees. DefineL to be the set of all linkages of T relative to Tb.Proposition 11 (identity of linkages) Let T and Tb be the junction trees from subDAGs D and Db respectively. If L is the set of linkages of T relative to Tb and Lis the set of linkages of Tb relative to Ta, then L = Lba.Proof:A linkage consists of the d-sepnodes which are pairwise connected. In an invertiblemorali- triangulation, d-sepno des are connected identically in both morali- triangulatedgraphs involved.Example 14 In Figure 4.20, linkages between junction trees are indicated with ribbedbands connecting the corresponding host cliques. The 2 linkages between I” and I’ are{H3,2}and {H3,4}.Given a set of linkages between a pair of junction trees, the concept of a redundancyset can be defined. As mentioned in section 4.4, redundancy sets provide structureswhich allow redundant information to be removed during inter-junction tree informationpassing. The concept will be used for defining joint system belief in section 4.7.3 anddefining the operation NonRedundancyAbsorption in section 4.9.2. To define the redundancy set, one needs to index linkages such that the redundancy sets defined based onthe indexing possess certain desirable property which will become clear below. Index .aset L of linkages by the following procedure.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 113Procedure 1 1. Pick one of the junction trees in the pair, say Ta. Create a tree Gwith nodes labeled by linkages in Lab. Connect 2 nodes in G by a link if eitherthe hosts of corresponding linkages are directly connected in T, or the hosts ofcorresponding linkages are (indirectly) connected in T by a path on which allintermediate cliques are not linkage hosts. Call this tree a linkage tree.2. Index the nodes (linkages in Lw’) of G into L1,L2,... in any order that is consistentwith G, i.e., for every i > j there is a unique predecessor j(i) <i such that L() isadjacent to L1 in G.With linkages indexed this way, the redundancy set can be defined as the following.Definition 11 (redundancy set) Let a set of linkages L’ — {L1,. . . , L9} be indexedby procedure 1. Then for this set of indexed linkages, a redundancy set R, for index iis defined asifi=1( L, fl L,(2) i > 1; j(i) < i and Lj(:) adjacent to L1 in linkage tree GNote that a linkage tree is itself a junction tree, and redundancy sets are sepsets ofthe linkage tree.Example 15 There are 2 linkages between F’ and F3 in Figure 4.20. Consider junctiontree F3. The linkage tree G has 2 connected nodes, one labeled by the linkage {H3,H2}and the other by {H3,4}. An indexing L, = {H3,2} and L2 = {H3,4} defines 2redundancy sets R1 = and R2 = {H3}.With linkages and redundancy sets constructed, one has a linked junction forest ofcliques.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 1144.7.3 Joint System Belief of Junction ForestLet (D, F) be a USBN, S {S’,.. . , S} be a corresponding MSBN with a coveringsect 51, and F = {T’, . . . , T} be the junction forest from an invertible transformation.Let TI be the junction tree of D with cliques Cs and sepsets Qi. Let P (i > 1) be thed-sepset between S and 5’. Let L’ (i > 1) be the set of linkages between T and T’;and R (i > 1) be the corresponding set of redundancy sets.Construct a joint system belief for the junction forest through an assignment of belieftables to each clique, clique sepset, linkage and redundancy set in the junction forest.First, for each junction tree T in F, do the following.• Assign each node nk E N1 to a unique clique C, E C such that 0r contains k andits parents 7rk. Break ties arbitrarily.• Let Fk denote the probability table associated with node k• For each clique Cthat has assigned nodes k, . . . , nj, associate it with belief table B(Cr) = Pk*..• For each clique sepset Q,, e Q1, associate it with constant belief table B(Q).Then for each set of linkage L’, do the following.• Associate each linkage L e L with constant belief table B(L).Define the belief table for redundancy set R E R asB(R)= B(L)LZ\RZDefine the belief table for d-sepset P asBI — FJLZEL B(L)FIRER’ B(R)Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 115Define the belief table for each junction tree T asB(T’) FICXEC’ B(C)—fJQEQ B(Q)Here the notation B(T) is used in stead of B(N’) to emphasize that it is related tothe junction tree. Note B(11) and B(T1) are mathematical objects which do not have•corresponding data structures in the knowledge base. Comparing the form of joint probability distribution for an USBN (section 1.3.3) and the assignment of probability tablesfor nodes in a sect (Definition 7), it is easy to see that B(T1) is proportional to the jointprobability distribution of S relative to that assignment.Define the joint system belief for the junction forest F asB(F) = flB(T)fl2B(I)The notation B(F) in stead of B(N) is used for the same reason as above. With thisdefinition, one has the following lemma.Lemma 7 The joint belief B(F) of a MSBN is proportional to the joint probability distribution F of the corresponding USBN.To see this is true, it suffices to indicate that each d-sepnode, appearing in at least 2sects, carries its original probability table as in (N, E, F) exactly once by Definition 7,and carries uniform table for the rest.Example 16 Table 4.5 lists the constructed belief tables for belief universes of junction forest F={1”, I2, f3} in Figure 4.20. The belief tables for sepsets, linkages, andredundancy sets are all constant tables at this stage.Having constructed belief tables for cliques, sepsets, linkages, redundancy sets andjunction forest, using the definition of world of section 4.2.1, one can talk about beliefChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 116Table 4.5: Constructed belief tables for belief universes of junction forest F = {1”, F2, F}NodeAss.F2EQ.4.75.6.25NodeA,s.H2, F1 H1EQ.7895.6.2 105.4.5.05.5.95E(r’) 8(r2)Clique NodeAss. Clique{H2,1A) 11,A {F2,1)Config. EQ Conf ig.{h31,1 .12 (121111)2, 2} .03 (121.112){h1,1 .085 (122.111)2, 2 .765 (f22.112)(h,11} .12 Clique(22, .03 (H,F1}{h,11 .085 Config.22,} .765 (h21,1)Clique NodeAss. (f, 2{H,A1 H2 {h211,Config. EQ 211f2,)(h21,a1) .8696 {hf1,{, 2 .7 221,h21,a1 .6 (hf1,){,2) .08 (22,h2,a11} .1304{2, .3h2,a11) .4{2, .92Clique NodeAss.(H3,2A) H3A2Config. EQ{h31,2a .24, 2) .06{h31,2a} .24, 2 .06{h3,21a) .07(2, .63{h3,2a1 .072,) .63Clique NodeAss.{H3,A4 A3,H4Config. EQ{h31,a31,h41 } .075h,42) .225(31,a} .28{h, 24 .423,a1) .2{h2,4 .6(3,a1 .08(h3, 24) .12Clique NodeAss.{H4,A) A4Conf ig. EQ{h41,a} .9{h41, 2 .1{h42,a1} .2{h42,) .8E(r3)Clique{E1 , E3)Conf i g.{e11 , e31 )(e11 , e32){e12,e31){e12,e32)Clique(H3,H2,E3)Conf i g.(h31 , , e31){1i31 , , e32)(h31 , h22e31){h31 , ,e32)(h32, , e31){h32,It21,e32)(It32,It22,e31)(It32,It22,e32)Clique(8,E4)Conf ig.{e21 , e31e41)(e21 e31,e42){e21 32, e41)(e21 e32,e42)(e22,C31,e41)(e22,e31,e42){e22 , e32e41)(e22 , e32 e42 }Clique(E3,4H4)Conf ig.(e31 , e41 , It41)(e31 ,e41 ,It42)(e31 , e42 , It41)(e31,e42It){e32,e41 , It41)(e32 , e41 , It42)(e32,C42,It41)(e32,e42,It42)Clique(H,E4)Config.(It31,e31,It41)(It31,e31,It42){It31 , e32 It41)(It31,e32,It42)(It32,e31,It41)(It32,e31,It42)(It32,e32,It41)(It32,e32,It42)NodeAss.EQ.2.7.8.3NodeAss.H3,H2,EQ.7702.2298.65.35.35.65.01.99NodeAss.EQ.9 789.8.9.05.0211.2.95NodeAss.E4, H4EQ.8.15.2.85.8.15.2.85Node Ass.EQin Figure 4.20. Config: Configuration. Node Ass: Nodes Assigned.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 117universes, sepset worlds, linkage worlds, redundancy worlds, and junction forest of beliefuniverses. These terms will be used below.The preceding has an assumption of a covering sect. The joint system belief of ajunction forest with a hypertree structure can be defined in the similar way. As thereis no need to consider the d-sepset/linkages between non-covering sects above, in thehypertree case, there is no need to consider the d-sepset/linkages between neighbor sectscovered by a local covering sect. Therefore in practice, these linkages are never created.One sees another computational advantage of the covering sect rule and the hypertreerule.4.8 Consistency and Separability of Junction ForestPart of the goal is to propagate the information stored in different belief universes indifferent junction trees of a junction forest to the whole system such that marginalprobability of variables can be obtained from any universes containing them with local computation.4 The information to be propagated can be the prior knowledge in theform of products of original probability tables from the corresponding USBN. The information to be propagated can also be evidence entered from a set of universes possibly indifferent junction trees. The following defines consistency and separability that are theproperties of junction forests which guarantee the achievement of this goal.4.8.1 Consistency of Junction ForestThe property of consistency partly guarantees the validity of obtaining marginal probabilities by local computation. In the context of the junction forest, 3 levels of consistencycan be identified.4Obtaining marginals by local computation is what the junction tree technique is developed for. Moreis obtained from junction forests, namely, exploiting localization.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 118Definition 12 (local consistency) Neighbor universes (C1,B(C1)) and (C,, B(C3)) ina junction tree T with sepset world (Qk, B(Qk)) are consistent ifB(C1) x B(Qk) cx B(C3)C,\C CJ\CIwhere ‘x’ reads ‘proportional to’. When the relation holds among all neighbor universes,the junction tree T2 is said to be consistent. When all junction trees are consistent, ajunction forest F is said to be locally consistent.Definition 13 (boundary consistency) Host universes (C,, B(Cj) and (Cs, B(C’))of linkage world (Lk,B(Lk)) are consistent ifB(C)cB(Lk)c B(C)When the relation holds among all linkage host universes, a junction forest is said tohave reached boundary consistency.Definition 14 (global consistency) A junction forest is said to be globally consistent if for any 2 belief universes (possibly in different junction trees) (Ci, B(C)) and(Ci, B(C))B(C)c B(Cj)C,\CTheorem 8 (consistent junction forest) A junction forest is globally consistent if itreaches both local consistency and boundary consistency.Proof:The necessity is obvious. The sufficiency is proven below.Let (Ci, B(Cj) and (Cr’, B(C)) be 2 universes in junction trees Tt and P of junctionforest F respectively. Let 3 be the d-sepset between the 2 corresponding sects. One hasChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 119Z = CnC ç I. By definition of linkages, they have the maximum property. Therefore,there exists a linkage L 2 Z with its hosts being (C,,, B(C)) and B(Cj).Since C, 2 2 Z, Z is contained in all cliques in the unique path between G, andCt,. Because S is locally consistent,B(C) x B(CJJ.Ct\Z C\ZSimilarly,> B(C) cc B(C).C\ZBecause T also reached boundary consistency,B(C,,) cc B(C)C\Z G3W\ZThereforeB(C)c B(Cj)C\Z C\ZD4.8.2 Separability of Junction ForestsIn the junction tree technique, consistency is all that is required in order to obtainmarginals by local computation. In junction forests, this is not sufficient due to the existence of multiple linkages. The function of multiple linkages between a pair of junctiontrees is to pass the joint distribution on the d-sepset by passing marginal distributions onsubsets of the d-sepset. By doing so, one avoids the exponential increase in clique statespace sizes as outlined in section 4.3. When breaking down the joint into the marginals,one must ensure the joint can be reassembled from the marginals, i.e., one must passa correct version of the joint. Otherwise, the correctness of local computation is notChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 120guaranteed. Since passing the marginals is achieved by passing the belief tables on linkages and redundancy sets, the structure of linkage hosts is the key factor. The followingdefines separability of junction forests in terms of the correctness of local computation.Then the structural condition of linkage hosts is given under which the separability holds.Definition 15 (separability) Let F = {TI1 i ,6} be a junction forest with domain N and joint system belief B(F). F is said to be separable if, when it is globallyconsistent, for any T’ over subdomain NB(F) cc B(T)N\NThe following lemma is quoted from Jensen for latter proof of proposition 12.Lemma 8 [Jensen 88]Let T be a junction tree from clique hypergraph (N, C). Let C1 and C2 be 2 adjacentcliques in T. Let T’ be the graph resulting from T by union of Ci and C2 into one clique,and by keeping the original sepsets. Then T’ is a junction tree for clique hypergraph(N, (C \ {C1,C2}) U {C1 U C2}).The following is the structural condition for separability to be proved below.Definition 16 (host composition) Let a MSBN by sound sectioning be transformedinto a junction forest. Let S1 be a sect in the MSBN; T be the junction tree of St; I bethe d-sepset between S1 and any distinct sect; and L be the set of linkages between T andthe junction tree for the other sect.Recursively remove from T every leaf clique which is not a linkage host relative to L.Call the resultant junction tree as a host tree.T satisfies a host composition condition relative to L, whenever the following istrue.Chapter 4. MULTIPLY SECTIONED. BAYESIAN NETWORKS 1211. If a non-d-sepnode is contained in any linkage host in T, then it is contained inexactly 1 linkage host; and2. if a set of non-d-sepnodes are contained in any non-host clique in the host tree, andeach element appears in some host clique, then the set must be contained in exactly1 linkage host.Example iT The host composition condition is violated in the host trees of Figure 4.24.The following shows the violation and the resultant problem. Assume both trees areconsistent.First consider the top tree. Let L consist of linkages L1 {A, D} and L2 {A, E}.Let their hosts be C1 = {A, B, D} and C2 = {A, B, E} which are adjacent in the tree.B is a common non—d-sepnode - a violation of part 1 of the host composition condition.Even if all the belief tables are consistent, in general,B(ABD)B(ABE) B(AD)B(AE)B(AB) B(A)That is, the joint distribution on the d-sepset {A, D, E} constructed from belief tableson linkages and redundancy sets is inconsistent in general.Consider the bottom tree. Let L and C1 be the same. Let the host C2 = {A, E, G}which is connected to C1 through a non-host C3 = {A, B, G}. {B, G} is a set of non-dsepnodes violating the part 2 of the host composition condition. If Ci and 03 are unitedas described in lemma 8, the resultant graph is still a junction tree. If letB(C13)= B(C1)B(/B(Q3where B(Q13) is the original belief for sepset between C and 03, the joint belief for thenew tree is exactly the same as before and the new tree is consistent. Now the commonnode 0 in 013 and 02 creates the same problem illustrated above.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 122C2c::::: C2Figure 4.24: Two example trees for violation of the host composition condition. I: thed-sepset. The ribbed bands indicate linkages.Proposition 12 Let a MSBN by sound sectioning be transformed into a junction forestF. Let S be a sect in the MSBN; T’ be the junction tree of S in F; I be the d-sepsetbetween S’ and any distinct sect; and L be the set of linkages between T and the junctiontree for the other sect. Let all belief tables be defined as in section 4.7.3.When F is globally consistent, B(I) satisfiesB(I) x B(T)N \IiffT2 satisfies the host composition condition relative to L.Proof:[Sufficiency] The proof is constructive. Denote the host tree of T relative to L byT’ and denote T”s domain by N’. Marginalize B(T) with respect to N \ N’. This isdone by marginalization of B(Tt) recursively with respect to unique variables in each leafclique not in T’. This results inB(T2)=B( ’)Nt\N’where B(T’) is the joint belief for the consistent junction tree T’. Note N \ N’ containsno d-sepnodes.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 123Second, unite non-host cliques into linkage hosts in T’. Suppose C is a non-linkage-host with host clique neighbor(s). Choose the neighbor host C, containing all the nond-sepnodes of C1 which appear in host cliques. Since T’ is a junction tree and the hostcomposition condition holds, one is guaranteed to find such a C,. By lemma 8, the graphresulting from T’ by union of C, C, into a new clique Ck, and by keeping the originalsepsets is still a junction tree on domain N’. The host composition condition still holdsin the new graph. If letB(Ck) = B(C)B(C)/B(Q3where B(Q13) is the original belief for sepset between C2 and C, the joint belief for thenew junction tree T(1) is exactly the same as B(T’) and T(1) is consistent. Repeat theunion operation for all non-hosts, one ends up with a consistent junction tree T” withonly linkage hosts (possibly new composition) and every non-d-sepnode in N’ appearsin exactly 1 host. If for every clique in T”, marginalize each clique belief with respectto its (unique) non-d-sepnodes; remove these non-d-sepnodes from the clique; assign themarginalized belief to the new clique and keep the sepset belief invariant, one ends upwith a new consistent junction tree T” with its joint beliefBT” c B(T’).N’\INote that T” is just the linkage tree in the procedure 1. That is, all the cliques in T”are linkages, and all the sepsets are redundancy sets. Therefore,B(I) cc B(T”).[Necessityj Suppose there are 2 or more linkages in L, and the host compositioncondition does not hold in T. Obtain the host tree T’ in the same way as in the sufficiencyproof. Recursively unite each non-host clique with a neighbor host clique if there is one,Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 124and assign the belief for the new clique in the same way as in the sufficiency proof. Theresultant is a consistent junction tree T”. Since the host composition condition doesnot hold in T’, and the uniting process does not remove non-d-sepnodes from any host,there is at least 2 neighboring cliques C1 and C2 in T” such that they have a set ofcommon non-d-sepnodes. Denote the 2 cliques as C1 = X U Y U W and C2 = X U Z U W(Figure 4.25) with L1 = C fl I = X U Y and L2 C2 fl I = X U Z. That is, C1 and C2share a set of d-sepnodes X and a set of non-d-sepnodes W. Even if all the belief tablesare consistent, in general,B(C1)B(2— B(XuYuW)B(XuZuW)w 12 wB(X U Y)B(X U Z) — B(L1)B(2B(X)- B(R1)where R is the intersection of L1 and L2 (a redundancy set). That is, the distributionon In (C1 U C2) is not consistent with the distribution constructed from the belief tableson the corresponding linkages and redundancy set in general. The above C1 and C2 arechosen not including unique non-d-sepnodes in each of them. If this is not the case, thesenon-d-sepnodes can always be removed by marginalization at each of the 2 cliques, andthe result is the same. If L = {L1,L2}, the proof is complete.Consider the case L contains more than 2 linkages. In this case, the the left side ofthe above equation represents a correct version of the marginal distribution on a subsetof d-sepset I, while the right side is an inconsistent version of the same distributiondefined in terms of belief of linkages and redundancy sets (section 4.7.3). Now, it sufficesto indicate that, in general, if the marginal distribution is inconsistent, the joint is alsoinconsistent.The proposition shows that the belief table B(I) defined in section 4.7.3 is indeed thejoint belief on d-sepset I when the host composition condition is satisfied and the forestChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 12559uwL1=Cfl1 = XUY L2=Cfl1 = XUZFigure 4.25: Part of a host tree violating the host composition condition. I: the d-sepset.The ribbed bands indicate linkages.is consistent. Now it is ready for ie following result on separability.Theorem 9 (host composition) Let {S’, . . . , S} be a MSBN satisfying the hypertreecondition. Let F = {TIl i /} be a corresponding junction forest with joint systembelief B(F). F is separable if the host composition condition is satisfied in all pairs ofjunction trees with linkages constructed.5Proof:Assume F has reached global consistency. First, unite all the cliques in each junctiontree into a huge clique with the belief table for the junction tree as its belief. Connectthe huge cliques as they are connected in the hypertree (that is, if the linkages between2 trees are not created as discussed in section 4.7.3, do not connect the 2 correspondinghuge cliques), and assign the original B(I) to their sepset. The resultant graph is ajunction tree due the hypertree structure of the MSBN. By proposition 12, the tree isconsistent with joint belief being B(F), and for any T over subdomain NB(F) cx B(T1)N\N1iff the host composition condition is satisfied.EJ5Recall that when a MSBN has a hypertree structure, no linkage is created between junction treeswhose corresponding sects are covered by a local covering sect.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 126The host composition condition can usually be satisfied naturally in an applicationsystem. Since d-sepsets are the only media for information exchange between sects, dsepnodes usually involve many inter-subDAG cycles. The consequence is that they willbe heavily connected during morali-triangulation and form several large cliques in theclique hypergraph as well as some small ones. On the other hand, non-d-sepnodes rarelyform connections with so many d-sepnodes simultaneously and hence will rarely be theelements of these large cliques. To be an element of more than 1 such large clique iseven more unlikely. Because linkages are defined to be maximal, these large cliques willbecome linkage hosts.For example, in the PAINULIM application system (Chapter 5), there are 3 sects andcorrespondingly 3 junction trees. The host composition condition is satisfied naturally inall 3 trees. Figure 4.26 gives one of them. The 4 linkage hosts contain no non-d-sepnodeat all.When the host composition condition can not be satisfied naturally, one can adddummy links between d-sepnodes in the moral graph before triangulation such that linkage hosts will be enlarged and the condition is satisfied. Hence, given a MSBN, a separable junction forest can always be realized. The penalty of added links is increasedamount of computation during belief propagation due to increased sizes of cliques andlinkages. In the worst case, one resorts to the brute force method discussed in section 4.3in order to satisfy the host composition condition for certain pairs of junction trees. Ifthe system is large, sectioning may still yield computational savings on the whole evenif cliques are enlarged at a few junction trees.One of the key results now follows.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 127Figure 4.26: T is a junction tree in a junction forest taken from an application systemPAINULIM with variable names revised to simplify. An upper case letter in a cliquerepresents a d-sepnode member, and a lower case letter represents a non-d-sepnodesmember. The cliques C1,C2,C3,C4 are linkage hosts.Theorem 10 (local computation) Let F be a consistent and separable junction forestwith domain N and joint system belief B(F). Let (Cs, B(C)) be any universe in F. ThenB(F)cB(C)N\CXProof:Let (Ci, B(C)) be a universe in one of the junction tree T of F, Let the subdomainof TX be NX and its belief be B(Tz). Since F is consistent and separable, by definitionof separability one hasB(F)=B(Tx)N\NXand TX is a consistent junction tree. Jensen, Lauritzen and Olesen [1990] have proved thatthe belief of a consistent junction tree marginalized to any of its universes is proportionalChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 128to the belief of that universe. HenceB(F)cx (> B(F))xB(C)N\CX NS\C N\NxWith the above theorem, the marginal belief of any variable in a consistent andseparable junction forest can be computed by marginalization of the belief table of anyuniverse which contains the variable. In this respect, a consistent and separable junctionforest behaves the same as a consistent junction tree [Jensen, Lauritzen and Olesen 90].It will be seen that, in the context of junction forests, additional computational advantageis available, i.e., the global consistency is not necessary for obtaining marginal belief bylocal computation.4.9 Belief Propagation in Junction ForestsGiven the importance of consistency of junction forests, a set of operations are introducedwhich bring a junction forest into consistency. Since the purpose is to exploit localization,only operations for local computation at the junction tree level are considered. That is,at any moment, there is only 1 junction tree resident in the memory. This junction treeis said to be active.4.9.1 SupportivenessJensen, Lauritzen and Olesen [1990] introduced the concept of supportiveness. Let (Z,B(Z)) be a world. The support of B(Z) is defined as= {z E ‘P(Z)Ibelief of z > 0}.A junction tree is supportive, if, for any universe (C2,B(Cj) and for any neighboringsepset world (Q3,B(Q)), L(B(Ci)) Z(B(Q)). The underlying intuition is that,Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 129when beliefs are propagated in a supportive junction tree, non-zero belief values will notbe turned into zeros.Here the concept is extended to junction forests. A junction forest is supportive, ifall its junction trees are supportive, and if, for any linkage host (C:, B(C)) and corresponding linkage world (L3,B(L2)), L(B(Ci)) ç B(Lj)).The construction in section 4.7.3 results in a supportive junction forest.4.9.2 Basic OperationsOperations for Consistency within a Junction TreeThe following operation brings a pair of belief universes into consistency.Operation 1 (AbsorbThroughSepset) [Jensen, Lauritzen and Olesen 90]Let (C0,B(C)) and its neighbors (C1,B(C)),..., (Ck,B(Ck)) be belief universes ina junction tree. Let (Q1,B(Q)),... ,(Qk,B(Qk)) be the corresponding sepset worlds.Suppose z.(B(C0)) C L.(B(Q)) (i = 1,... ,k). Then the AbsorbThroughSepset of(C0,B(C0)) from (Ci, B(Cj) (i = 1,. . ., k) changes B(Q:) and B(C0) into B’(Q1) andB’(C0).B’(Q1) = B(C2) i=1,...,kkB’(C0) = B(C0)fJB’(Q1/ (This operation is performed at the level of belief universe. Jensen, Lauritzen andOlesen [1990] show that the operation changes neither the supportiveness of a junctiontree nor the joint system belief in the context of a junction tree. In the context of ajunction forest, the supportiveness is also invariant after AbsorbThroughSepset. This isbecause the operation does not increase the support of any linkage host and does notChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 130change linkage beliefs directly. The invariance of joint system belief for the junctionforest is obvious given the definition of joint system belief and the invariance of beliefsfor junction trees.The following are three high level operations which bring a junction tree into consistency.Operation 2 (DistributeEvidence) [Jensen, Lauritzen and Olesen 90]Let (C0,B(C0)) be a universe in a junction tree. When DistributeEvidence is calledin (C0,B(C0)), it performs Absorb ThroughSepset to absorb from the caller if the calleris a neighbor and calls DistributeEvidence in all its neighbors except the caller.Suppose a junction tree is originally consistent. If evidence is entered6 in one of itsbelief universes and DistributeEvidence is initiated from that universe, then the resultingjunction tree is still consistent.Operation 3 (CollectEvidence) [Jensen, Lauritzen and Olesen 90]Let (C0,B(C0)) be a universe in a junction tree. When CollectEvidence is calledin (C0,B(C0)), it calls CollectEvidence in all its neighbors except the caller, and whenthey have finished their CollectEvidence, (C0,B(C0)) performs Absorb ThroughSepset toabsorb from them.DistributeEvidence and CollectEvidence are operations performed at the level of beliefuniverses. Since they are composed of AbsorbThroughSepset, they do not change thesupportiveness and joint system belief of junction forest. The combination of these 2operations yields the operation UnifyBelief which brings a supportive junction tree intoconsistency and is performed at the level of junction trees.6The concept of evidence entering is defined later.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 131Operation 4 (UnifyBelief) UnifyBelief can be initiated at any universe (C0,B(C0))in a junction tree. (C0,B(C0)) calls CollectEvidence in all its neighbors and when theyhave finished their CollectEvidence, (C0,B(C0)) calls DistributeEvidence in them.Operations for Belief Exchange in Belief InitializationBelief initialization brings a junction forest into global consistency before any evidenceis available. One problem arises w*ien there are multiple linkages between junction trees.Care is to be taken not to count the same information passed on different linkages multipletimes. The following two operations pass information through multiple linkages duringbelief initialization. NonRedundancyAbsorption is performed at the level of linkage hosts.ExchangeBelief calls NonRedundancyAbsorption and is performed at the level of junctiontrees. ExchangeBelief ensures the exchange between junction trees of prior distributionon d-sepsets without redundant information passing.Operation 5 (NonRedundancyAbsorption) Let (C, B(C)) and (Ci, B(C)) be 2linkage host universes in junction trees T’ and Tb respectively. Let (Lx, B(L)) and(R, B(R)) be the worlds for corresponding linkage and redundancy set. Supposez(B(C:)) ç .(B(L)).The NonRedundancyAbsorption of (C, B(C)) from (Ci, B(C)) through linkageL, changes B(L), B(R) and B(C) into B’(L), B’(R) and B’(C) respectively.B’(L) = B(C)C\LXB’(R) = B’(L)LX\RXB’C’ — B1C” * B’(L)/B’(R)“ ‘ — “ B(L)/B(R)The factor 1/B’(R) above has the function of redundancy removal.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 132At initialization, the belief tables for linkages and redundancy sets are in the stateof construction, i.e., constant. Hence, B(L) and B(R) above are constant tables. IfB’(L) is constant, which is possible because constant probability tables are assigned tod-sepnodes in some sects in Definition 7, then after the operation> B’(C) oc > B(C).C\LX C\LXThat is, if C has no information to offer, then C will not change its belief. If C\LX B(C)is constant, then after the operationB’(C) oc ( B(C)) /( B(C)).C\LX C\RXThat is, if C has new information and C contains no non-trivial information, then thebelief of C will be copied with red4ndancy removed. If none of B’(L) and ZCa\LX B(C)is constant, then after the operationB’(C) * ( B(C)) /( B(C))).C\LX C\RXThat is, if none of the above cases is true, the belief from both sides will be combinedwith redundancy removed. SinceC > B(C)) =CXb\LXthe supportiveness of the junction forest is invariant under NonRedundancyAbsorption.SinceB’(C)- B(C)B’(L)/B’(R) — B(L)/B(R)the joint system belief is invariant under NonRedundancyAbsorption. This operation isequipped at each linkage host.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 133Operation 6 (ExchangeBelief) Let L be the set of linkages between junction trees Tand Tb. When T’ initiates ExchangeBelief with Tb, the NonRedundancyAbsorption isperformed at all linkage host universes in Ta.Since ExchangeBelief is composed of NonRedundancyAbsorption, the siipportivenessof the junction tree and its joint system belief are invariant under the ExchangeBelief.The operation is equipped at the level of junction trees. After ExchangeBelief, the nontrivial content of joint distribution on d-sepset at T” will be passed onto T withoutredundancy.Operations for Belief Update in Evidential ReasoningEvidential reasoning propagates evidence obtained in one junction tree to the junctionforest. A junction tree receiving updated belief on the d-sepset from a neighbor junctiontree may be confused due to multiple linkage evidence passing. The following 2 operations handle the evidence propagation between junction trees. AbsorbThroughLinkagepropagates evidence through one linkage and is performed at the level of linkage hosts.UpdateBelief calls AbsorbThroughLinkage to propagate evidence from one junction treeto another, and is performed at the level of junction trees. It is used during evidential reasoning when both junction trees are consistent themselves but may not reach boundaryconsistency between them.Operation F7 (AbsorbThroughLinkage) Let (C, B(C)) and (Ci, B(C)) be 2 linkage host universes in junction trees T’1 and Tb respectively. Let (Lw, B(L)) be the corresponding linkage world. Suppose z(B(C)) z.S.(B(L)). The AbsorbThroughLinkage of(C, B(C)) from (Ci, B(C)) changes B(C) and B(L) into B’(C) and B’(L)Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 134as the following.B’(L) = B(C)B’(C) = B(C) * B’(Lx)/B(Lr)After AbsorbThroughLinkage,B’(C)= B(C)Sincez.(B(C)) C B(C)) = z(B’(L))the supportiveness of a junction forest is invariant under AbsorbThroughLinkage. Theoperation makes the belief of C up-to-date with respect to the belief of C on theircommon variables.Operation 8 (UpdateBelief) Let L = {L1, .. . , Lk} be the set of linkages betweenjunction trees T’ and Tb with corresponding linkage hosts Cr,.. . , C. When T initiatesUpdateBelief with Tb, the operation pair Absorb ThroughLinkage and then DistributeEvidence is performed at Ce,... , C.Since the operation is composed of AbsorbThroughLinkage and DistributeEvidence,the supportiveness of the junction forest is invariant under the operation.Since after UpdateBelief,B’(C) = B(C) x = 1,. . . , kand T’ is consistent, the effect of the operation isBI(Ta)=B(T) * B’(I)/B(I)Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 135where I is the d—sepset between 5a and Sb or equivalentlyBI(Ta)/Bl(I) = B(T)/B(I)which implies the joint system belief is invariant under the operation.Note that after each AbsorbThroughLinkage in operation UpdateBelief, a DistributeEvidence is performed. This is used to avoid confusion in the information receivingjunction tree possibly resulted from multiple linkages information passing. The followingsimple example exemplifies this necessity.L XUZ vuzFigure 4.27: An example illustrating the operation UpdateBelief.Example 18 Let junction tree T’ (Figure 4.27) have 2 linkage host C1 = L1 X U Zand C2 = L2 = Y U Z where X, Y Z are 3 disjoint sets of nodes. Let B(C1), B(C2)and B(Z) be the belief tables of the 2 hosts and their sepset respectively. Suppose newinformation is passed over to Tt through the 2 linkages from its neighbor junction tree.If AbsorbThroughLinkage is performed at L1 and then L2 Without a DistributeEvidencebetween the 2 operations, then the belief on 2 host cliques will be updated to B’(C1),B’(C2), while B(Z) is unchanged. If C1 initiates an AbsorbThroughSepset from C2 inthe process of propagating the new information to the rest of T, the belief on C1 willbecomeB”(C1)= B’(C1) B’(C2)/B(Z)) ç B’(C1)1’which is not expected. This is because Z B’(C2)ç B(Z).Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 136With a DistributeEvidence between the 2 AbsorbThroughLinkage operations, thereis B’(Z)= y B’(C2). The result of AbsorbThroughSepset becomesB”(C1)= B’(C1) B’(C2)/B’ Z)) cx B’(C1)Ywhich is correct.4.9.3 Belief InitializationBefore any evidence is available, an internal representation of beliefs is to be established.The establishment of this representation is termed initialization by Lauritzen and Spiegel-halter [1988] for their method. The function of initialization in the context of junctionforests is to propagate the prior knowledge stored in different belief universes of differentjunction trees to the rest of the forest such that (1) prior marginal probability distribution for any variable can be obtained in any universe containing the variable, and (2)subsequent evidential reasoning can be performed.The following defines operations DistributeBelief and CollectBelief which are analogous to DistributeEvidence and CollectEvidence but at the junction tree level. Theoperation Beliefinitialization is then defined in terms of DistributeBelief and CollectBelief just as UnifyBelief is defined in terms of DistributeEvidence and CollectEvidence butat the junction forest level.Two junction trees in a junction forest is called neighbors if the d-sepset between the2 corresponding sects is nonempty.Operation 9 (DistributeBelief) Let T be a junction tree in a junction forest. WhenDistributeBelief is called in T, it performs UpdateBelief with respect to the caller ifthe caller is a junction tree and then calls DistributeBelief in all its neighbors except thecaller and caller’s neighbors.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 137Operation 10 (CollectBelief) Let T’ be a junction tree in a junction forest. WhenCollectBelief is called in T1, it calls CollectBelief in all its neighbors except the callerand caller’s neighbors. when they have finished, T performs ExchangeBelief with respectto each of them followed by a UnifyBelief on T1.Operation 11 (Belieflnitialization) Beliefinitialization can be initiated at any junction tree T in a junction forest f T is transformed from a local covering sect. T callsCollectBelief in all its neighbors, and when they have finished Tt calls DistributeBelief inthem.All 3 operations do not change the supportiveness and joint system belief. Thus onehas the following theorem.Theorem 11 (belief initialization with hypertree) Let {S’,.. . , S} be a MSBN witha hypertree structure. Let F = {T’,. . . , T} be a junction forest with T being the junction tree of S1. Let B(F) be the joint system belief constructed as section 4.7.3. AfterBelieflnitialization, the junction forest is globally consistent.Note that the Beliefinitialization does not involve direct information passing betweenneighbors of a local covering sect. This is also the case during evidential reasoning to bediscussed latter. As assumed in section 4.7.3, the linkages between these neighbors arenot created in the first place. This saving is possible only if the MSBN has a hypertreestructure.Example 19 Beliefinitialization is initiatedat 1” in Figure 4.20. It calls CollectBeliefin F2 and F3. Since the latter do not have neighbors other than the caller and caller’sneighbor, only UnifyBelief is performed in F2 and F3. Table 4.6 lists the belief tables forbelief universes in junction trees F2 and P3 after their UnifyBeliefs.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 138Table 4.7 and Table 4.8 list the belief tables for belief universes of the junction forest and the (prior) marginal probabilities for all variables of the corresponding MSBNrespectively after the completion of Beliefinitialization. The marginal probabilities areidentical to what would be derived from the USBN (F, P) with F in Figure 4.17 and Pin Table 4.3. The marginals are obtained by marginalization of belief universes whichcontain the corresponding variables.Once belief initialization is completed, the junction forest becomes the permanentrepresentation which will be reused for each query session.4.9.4 Evidential ReasoningThe joint system belief defined in section 4.7.3 is proportional to the prior joint distribution representing the background domain knowledge. Initialization allows one to obtainprior marginal probabilities with efficient local computation. When evidence about aparticular case becomes available, one wants the prior distribution to change into theposterior distribution.Evidence is represented in terms of evidence functions. Two types of evidence areconsidered here as by Jensen, Olesen and Andersen [1990]. The first type has a valuerange of {0, 1} where ‘0’ stands for that the corresponding outcome is impossible and ‘1’stands for that the corresponding outcome is still possible with relative belief strengthremaining the same. The second type is a restriction. It has the same function valuerange but the function assigns ‘1’ to only one outcome. This type of evidence arises whenthe corresponding evidential variables are directly observable. Both types of evidencefunctions can be entered to junction forests by multiplying the prior distribution withthe evidence function.Call the overall process of entering evidence and propagating evidence as evidentialChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 139B(r2) 2(r3)Clique NodeA33. Clique NodeAsa.{F2,1} F2 {E1,3} E1Config. B() Config.{f21,fii} .7758 {e11,3} .7121(f21,12} 1.545 {e11,32 3.108(122,fll} 1.164 {e12,31} 2.848(122112) .5151 {e12,3 1.332Clique NodeAss. Clique NodeA3s.{H,F} H2,F1H {H,2E} H3,H2EConf 19. 20 Config. 20{h21 fii h11} .7895 {h31,h21,e3) 1.54h1,2} .6 {h31,21,e32} .4596{h21,j12h11} .2105 {h31,h22,e31) 1.3h1f2, .4 1,2e} .7(2,1} .5 {h32,) .7{h22,f11h12} .05 {h32,h21 ,e32) 1.3{h22, 112 h11} .5 ,2e} .02{h22,f1 h12} .95 {h32h22,e3} 1.98Clique NodeAss.{EE,4} P22Cool ig. 20{e21,34 1.656, 2} 1.495{e21,34 1.898e2, .1165{e231,4} .03562,) .3738{e2c3e41 .21092,} 2.214Clique NodeAss.{E3,4H E4,HConf 1g. 20{e3e4h} 1.424{e31 e41 h42) .267042,h .3560{e31,} 1.5132,4h 1.776{e3, 1 .33302,4h} .4440{e3,h 1.887Clique NodeAss.{H,E4}Config. B(){h31,e4 1.42,h2} 1.42{h31,e4 .5798, 2 .5798{h3,e14} .362,) .36{h3,eh41 1.642, 1.64Table 4.6: Belief tables for belief universes in junction trees P2 and P3 in Figure 4.20after Collect Belief during Beliefiniti alization.Table 4.7: Belief tables for belief universes of junction forest F =4.20 obtained after the completion of Beliefinitialization.{f’,f2,3}in FigureChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 140B(r1) 8(F2) B(r3)Clique NadeAsa. Clique NodeAss. Clique NodeAss.{H2,1A} H1,A {F2,1} F2 {E1,3) E1Config. 80 Config. 80 Config. 80{h21,1 .8203 {f21,fll} 1.160 {e11,3} .564, 2} .08166 (121,112) 5.324 {e11,32 5.026(h21,1 .5810 {f22,flI} 1.741 (e12,31} 2.256(, 2 2.082 {f22112} 1.775 {12,e3) 2.154{h22 , h11 } .3797 Clique NodeAss. , Clique NodeAss.h2,11) .2183 {H,F1} H2,F1H {HH2E} H3,H2E{h22,12,h1 } .2690 Config. 80 Config. B(),1 5.568 {h21,1 .7121 {h31h2,e 1.444Clique NodeA3s. f, 2} 1.598 , 2} .4310{H2,A} H2 {h211, .1899 {h31,2e .731Config. B() f2, 1.065 , 2e .3936{h21,a1 .5526 {h211,} .2990 {h3,21e} .5915, 2} 1.725 2f, .2918 2, 1.098{h21,a1 .8487 (h21, 1 .2990 {h3,2e1 .0531, 2 .4388 (2f,} 5.545 (2,) 5.257{h2,a11) .08289 Clique NodeAss.2,} .7394 {E2,34 E2{h2,a11 .5658 Config. B02, 5.047 {e21,3 e41} 1.020Clique NodeAss. {e21 , e31 e42 } 1.422{H3,2A} H3,A2 (,324) 2.182Config. B() {e21,} .2378{h31,h21 a21} 1.763 {e22e31e41} .02194{h31 h21 ,a22} .1120 {e22,31e42} .3555,2a} .6366 e2,41} .2424{h31,h22,a} .4880 {e22,3 e42} 4.518{h32,h21 a21) .5143 Clique NodeA3s.,2a} 1.176 (E,4H) E4,H{h32, 1 .1857 Config. 80{h32,22,a} 5.124 {e31 ,e41 ,h41} .7622Clique NodeAss. {e31 e41 , h42} .2801(H3,A3,H4} A3,H4 {e31,e421} .1906Config. EQ (4,h 1.587(h31,a31,h41) .225 {e32,41,h41} 1.658{h,42) .675 e2, 1h} .7662{h31,a32,h41 } .84 {3,4 .4145{h31,a32,h42} 1.26 e2,h 4.342h, 14} 1.4 Clique NodeAss.{32,a 4.2 {H3,E4}h,41 .56 Config. B(){32,a} .84 {h31,e4 .7723Clique NodeAss. {h31 , e, h42 } 1.403{H4,A} A4 h1, 24} .2927Conjig. EQ {3,e .5319{h41,a) 2.723 h2, 14 .1805{h41 a42) .3025 {h32,e1h42} .4641{h42, 1} 1.395 h2,45} 1.780{h42,a 5.58 {3,e 4.576Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 141.15 p(a11)= .205 p(fij)= .2901 i’(ii)= 559(h21)= .3565 p(a21)= .31 p(f21)= .6485 p(e21)= .4862.3 p(e31)= .282p(h41)= .3025 (a41)= .4118 p(e41)= .3466Table 4.8: Prior probabilities from junction forest F = {I”, F2, F} in Figure 4.20 obtained after the completion of Belieflnitialization.reasoning. After a batch of evidence is entered to a junction tree, UnifyBelief can beperformed to bring the junction tree into consistency. This is the same in the context ofjunction forests as in the junction tree technique. However, in order to obtain posteriormarginal distributions on variables in the current active junction tree, the global consistency of the junction forest is not necessary. Before this is formally treated, severalconcepts are to be defined.Here only junction forests transformed from MSBNs with hypertree structures areconsidered. When a user wants to obtain marginal distributions on variables not contained in the currently active junction tree, it is said that there is an attention shift. Thejunction tree which contains the desired variables is called the destination tree.Definition 17 (intermediate tree) Let S, S, SC be 3 sects in a MSBN with a hypert ree structure, and T, T, Tk be their junction trees respectively in the correspondingjunction forest. P is the intermediate tree between T1 and Tc if the removal of S fromthe MSBN would render the hype rtree disconnected with S and S’ in different parts.Due to the hypertree structure, one has the following lemma.Lemma 9 Let a junction forest be transformed from a MSBN with a hype rtree structure.Let T and P be 2 junction trees in the forest. The set of intermediate junction treesbetween T and T’ is unique.The following defines an operation ShiftAttention at the junction forest level. It isperformed when the user’s attention shifts.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 142Operation 12 (ShiftAttention) Let F be a junction forest whose corresponding MSBNhas a hype rtree structure. Let T° and Tim+1 be the currently active tree and destinationtree in F respectively. Let {T”,.. . , Tmn} be the set of m intermediate trees between Ti0and Tim+1 such that Tb, Tu1,. . . ,T, Tim+1 form a chain of neighbors.For i = 1 to m + 1, T’ performs UpdateBelief with respect to Tui_1.Before each attention shift, several batches of evidence can be entered to the currentlyactive tree. When an attention shift happens, ShiftAttention swaps in and out of memorysequentially only the intermediate trees between the currently active tree and destinationtree without the participation of the rest of the forest. The following theorem shows thatthis is sufficient in order to obtain the marginal distributions in the destination tree.Theorem 12 (attention shift) Let F be a consistent junction forest whose corresponding MSBN has a hypertree structure. Start with any active junction tree. Repeat thefollowing cycle for finite times:1. repeatedly enter evidence to the currently active tree followed by UnifyBelief forfinite times;2. use ShiftAttention to shift attention to any destination tree.The marginal distributions obtained in the final active tree are identical as would beobtained when the forest is globally consistent.Proof:Before any evidence is entered, the forest is consistent by assumption. Before eachShiftAttention, the currently active tree is consistent due to UnifyBelief. The supportiveness of the forest is invariant under the execution of the cycle. The joint system beliefchanges (correctly) under only evidence entering but remains invariant under other operations of the cycle.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 143Transform the initial consistent forest into a junction tree F’ of huge universes asin the proof of theorem 9. Each huge universe in F’ corresponds to a tree in F. Theactive tree and destination trees corresponds to an ‘active’ universe and destinationuniverse. The evidence entering in F and the subsequent UnifyBelief correspond to theevidence entering in F’. The ShiftAttention in F corresponds to performing a seriesof AbsorbThroughSepsets starting at the neighbor of the ‘active’ universe down to thedestination universe in F’.Note that after each series of AbsorbThroughSepset operations in F’, the belief tableof each sepset of F’ is the marginalization of the belief in the neighbor more distantfrom the destination universe in the junction tree F’. Therefore, at the end of cyclesin F, if a DistributeEvidence is performed in F’, it will be consistent and the belief inthe destination universe does not undergo further change. That is, the belief on thedestination tree at the end of the cycles is identical to what would obtained when theforest is globally consistent.D4.9.5 Computational ComplexityTheorem 12 shows the most important characterization of the MSBN and junction forests,namely, the capability of exploitation of localization to reduce the computational complexity.Due to localization, the user interest and new evidence will remain in the sphere ofone junction tree for a period of time. The judgments obtained is at the knowledge levelof overall junction forest while the computation required is at. the level of the currentlyactive tree. Compared to the USBN and single junction tree representation where the.evidence has to be propagated to the overall system, this leads to great savings in termsof time and space requirement when localization is valid.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 144When the user subsequently shifts interest into another set of variables containedin a destination tree, only the intermediate trees need to be updated. The amount ofcomputation required is linear to the number of intermediate trees and to the numberof linkages between each pair of neighbors. No matter how large is the overall junctionforest, the amount of computation for attention shift is fixed once the destination treeand mediating trees are fixed. For example, in a MSBN with a covering sect, no matterhow many sects are in the MSBN, the attention shift updates maximum 2 sects.Given the analysis, the computational complexity of a MSBN with /3 sects is about1//3 of the corresponding USBN system when localization is valid. The actual timeand space requirement is little more than 1//3 due to the repetition of d-sepnodes andthe computation required for attention shift. The computational savings obtained inPAINULIM system is discussed in the chapter 5.Example 20 Complete the example on junction forest F = {F’, 12, F3} with evidentialreasoning. Suppose the outcome of variable E3 in 1’ is found to be e31. Table 4.9 listsB’(1’3) which is obtained after evidence entering and UnifyBelief; B’(F’) and B’(F2)both of which are obtained after a ShiftAttention with destination F2. Table 4.10 liststhe posterior marginal probabilities after the ShiftAttention.Before closing this section, the following example demonstrates the computationaladvantage, during attention shift, provided by covering sects.Example 21 In figure 4.23, D is sectioned into {D1,D2,D3} by sound sectioning without a covering sect. The MSBN is transformed into the junction forest {T’, T2,T3} byan invertible transformation. If evidence about E, and then about G comes, the firstpiece of evidence will be entered to T’, and then T’ will send message to T2. Afterentering the second piece of evidence, T2 will be the only one up-to-date. Now if oneChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 145Table 4.9: Belief tables for F = {1”, F2, F} in Figure 4.20 in evidential reasoning. B’(F3)is obtained first after E3 = e31 is entered to F3 and the evidence is propagated to theNodeAss.F2BQ5.52310.798.2853.598Node A as.H2, F1 H1B()3.6389.450.97026.300.3644.3556.36446.757B’(F3) B’(F1) B’(r2)Clique NodeAss. Clique NodeAss. Clique(E1,3} {H2,1A} H1,A (F2,1}Config. B() Config. B() Conf ig.(e11,3} .5640 (h21,hll,hll) 4.105 (121,111)(e11,32) 0 {21,12} .5036 {121,f12}{e12,31} 2.256 (h,) 2.908 (122,111){e12,3) 0 (21,12 12.84 {f22,f12}Clique NodeAss. (h22 , h1 } .4627 Clique{H,2E} H3,H2E 2,1} .2661 (H,F1}Conjig. B() (h2, 1) .3278 Config.{h31,2e 14.44 {2,1 6.785 (h21,jll,hll}, 2} 0 Clique NodeAs,. {h21,fll,h}{h31,2e 7.31 (H2,A} H2 (j12,}, 2 0 Config. B() (h21,(h3,21e) 5.915 {h21,a1 3.732 (2j1,)(2, 0 a,2) 11.65 (h21,}{h3,2e1 .5310 {h21,a1} 3.281 (2j1,(2,) 0 ,cz2 1.696 {h2f,Clique NodeAss. {h2,a11) .4190(E2,34} E2 2, 3.737Config. B() {h2,a11 .3715(e21 , e31 ,e41) 1.020 {h22,a22,a12) 3.313(,342} 1.422 Clique NodeAss.e21,) 0 {H3,A) H3,A2(,324 0 Config. B()(e22,31,e41} .02194 (h31,2 °21) 13.58(e22,31,e42} .3555 (h31 ,h21 “22) .8623(2,41} 0 (,2a) 4.138e2,3} 0 (h31 ,h22,022} 3.172Clique NodeAss. (,21a} 1.800{E,4H E4,H {h32,h21,022} 4.115Config. B() ,2a) .01857{e31 ,e41 ,h41} 7.622 {h32,} .5124{e31 , e41 , h42 } 2.801 Clique NodeAss.(e31 c42 h41} 1.906 {H,A4 A3,H4e1,4h} 15.87 Conjig. B(){e32,41,h} 0 (h31 ‘“31 ,h41) 1.632(e32,e41 , h42} 0 (hi a31 ,h42} 4.895(e32,42,h1} 0 {hi ,a32h41} 6.091(a32,e42,h 0 (“31 ,a32,h42} 9.137Clique NodeAss. {h32, a31, h41) 1.289{H,E4} h2,4) 3.867Config. B() {3,a1 .5157{h31,e4 7.723 h2,4 .7735(h31 , e31 , h42) 14.03 Clique NodeAss.(h, 24) 0 {H4,A} A4(31,e} 0 Config.(h32, e31 ,h41 } 1.805 {h41, “41) 8.575{h2,4 4.641 {h41,a2) .9528(3,e1} 0 {h42,1 3.734(h2,4 0 {h42,a42) 14.94junction tree. B’(F’) and B’(F2) are obtained afterwards by ShiftAttention.Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 146p(hjj) = .1893 p(a11) = .2767 p(f) = .4897 p(e11) = .2.7219 p(cz21)= .6928 P(f21)= .5786 p(e21) .8661p(h31)= .7714 p(a31)= .4143 P(e31) 1p(h41)= .3379 p(a41) .4365 P(e4l)= .3696Table 4.10: Posterior probabilities from junction forest F = {I”, F2, F} in Figure 4.20after evidence E3 = is propagated by ShiftAttention.is interested in the belief on H, the belief tables on {B, F} and {C, F} in T3 have tobe updated. However T2 can not provide distribution on {B, F}. Thus, T2 has to sendmessage to T3 about {C, F}, then send message to T’. Then T’ can becomes up-to-dateand send distribution on {B, F} to T3. One sees 3 message passings are necessary, andlinkages between each pair of junction trees have to be created and maintained. Moremessage passings and more linkages are needed when there are more sects organized inthis structure. When n sects are inter-connected and there is a covering sect, only n-i setsof linkages need to be created; and maximum 2 message passings are needed to updatethe belief in any destination tree.4.10 RemarksThis chapter presents MSBNs and junction forests as flexible and efficient knowledgerepresentation and inference formalisms to exploit localization naturally existing in largeknowledge-based systems. The systems which can benefit from the technique are thosereusable, representable by general but sparse networks, and characterized by incrementalevidential reasoning.The MSBNs allow the partition of a large application domain into smaller naturalsubdomains such that each of them can be represented as a Bayesian subnetwork - asect, and can be tested and refined individually. This makes the representation of acomplex domain easier and more precise for knowledge engineers and makes the resultantChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 147system more natural and more understandable to system users. The resultant modularityfacilitates implementation of large systems in an incremental fashion. The constraintstechnically imposed by MSBNs on the partition are found to be d-sepsets and soundnessof sectioning.Two important guidelines for sound sectioning are derived. The covering subDAGrule is suitable for partition according to categories in the same abstraction level. Whilethe hypertree structure is suitable for partition of domain with hierarchia1 nature. Itprovides a formalism to represent and reason at different levels of abstraction. MSBNsfollowing the rules allow multiply connected sects, do not require expensive computationfor validation of soundness of sectioning, and have additional computational advantageduring attention shift in evidential reasoning.Each sect in the MSBN is transformed into a junction tree such that the MSBN istransformed into a junction forest representation where evidential reasoning takes place.The constraints on transformation are found to be the invertibility of morali- triangulationand separability.Each sect/junction tree in the MSBN/junction forest stands as a separate computational object. Since the technique allows transformation of sects into junction treesthrough local computation at the sect level, and allows reasoning to be conducted withjunction trees as units, the space requirement is governed by the size of 1 sect/junctiontree. Hence large applications can be built and run on relatively smaller computerswhenever hardware resource is of concern.For large application domain, an average case may involve only a portion of thetotal knowledge encoded in a system, and one portion may be used repeatedly over aperiod of time. A MSBN and junction forest representation allows the ‘interesting’ or‘relevant’ sect/junction tree to be loaded while the rest of the junction forest remainsinactive and stays in the secondary storage. However, the judgments made on variablesChapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 148in the active junction tree are consistent with all the knowledge available, including bothprior knowledge and new evidence, embedded in the overall junction forest. When user’sattention shifts, inactive junction trees can be made active and previous accumulation ofevidence is preserved. This is achieved through transfer of joint belief on d-sepsets. Theamount of computation needed for this shift of attention is that required by the operationDistributeEvidence in the newly active junction tree times the number of linkages betweenthe 2 junction trees when there exists a covering sect. Therefore, when localization isvalid during consultation, the time and space requirements of a MSBN system of /3 sectswith a covering sect or with a hypertree structure are little more than 1//3 of that requiredby a USBN system, assuming equal size of sects. The larger a MSBN system, the morecomputational savings can be obtained.Chapter 5PAINULIM: AN EXPERT SYSTEM FOR NEUROMUSCULARDIAGNOSISaThis chapter discusses the implementation of a prototype expert system called PAINULIM.The system assists EMGers in neuromuscular diagnosis involving the painful or impairedupper limb. Section 5.1 introduces the PAINULIM domain, and compares PAINULIMwith MUNIN. Section 5.2 discusses how localization is exploited in PAINULIM usingthe MSBN technique (Chapter 4). Section 5.3 discusses other issues in knowledge representation of PAINULIM. Section 5.4 introduces a expert system shell WEBWEAVRwhich is used in development of PAINULIM. Section 5.5 provides a query session withPAINULIM to show its capability. Section 5.6 presents an evaluation of PAINULIM.5.1 PAINULIM Application Domain and Design CriteriaAs is reviewed in section 1.1, several expert systems in the neuromuscular diagnosis fieldhave appeared since the mid 80’s. Satisfaction in system testing with constructed caseshave been reported, while only one of them (ELECTRODIAGNOSTIC ASSISTANT[Jamieson 90]) reported clinical evaluation using 15 cases of 78% agreement rate withelectromyographers (EMGers). Most of these systems are rule-based systems which arereviewed in Chapter 1. MUNIN [Andreassen et al. 89] as a large system has attractedmuch attention and it can be used as a yard stick against which to compare PAINULIM.MUNIN project started in the mid 80’s in Denmark as part of the European ESPRIT149Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 150program. MUNIN was to be a ‘full expert system’ for neuromuscular diagnosis. Functionalities to be included would be test-planning, test-guide, test-set-up, signal processingof test results, diagnosis, and treatment recommendation. The intended users of MUNINwould range from novice to experienced practitioners. The knowledge base would ultimately include full human neuroanatomy. MUNIN adopts Bayesian networks to representprobabilistic knowledge. Substantial contributions to Bayesian network techniques weremade (e.g., [Lauritzen and Spiegelhalter 88, Jensen, Lauritzen and Olesen 90]).The MUNIN system is to be developed in 3 stages. In the first stage a ‘nanohuman’model with 1 muscle and 3 possible diseases has been developed. In the second stage a‘microhuman’ system with 6 muscles and corresponding nerves is to be developed. Thelast stage would correspond to a model of the full human neuroanatomy. The only clinicalexperience with MUNIN is contained in the phrase: “we expect semi-clinical trials of the‘microhuman’ to give the first indrcation of clinical usefulness” [Andreassen et al. 89].PAINULIM started in 1990 at University of British Columbia in the Departmentof Electrical Engineering with cooperation from the Department of Computer Scienceand at the Neuromuscular Disease Unit (NDU) of Vancouver General Hospital (VGH).Expertise needed to build PAINULIM’s knowledge base was provided by experiencedelectromyographer Andrew Eisen and Bhanu Pant. Rather than attempting to coverthe full range of diagnosis as does MUNIN, PAINULIM sets to cover the more modest,but realizable, goal of diagnosis for patients suffering from a painful or impaired upperlimb due to diseases of spinal cord and/or peripheral nervous system. About 50% of thepatient entering NDU of VGH can be diagnosed with PAINULIM. The 14 most commondiseases considered by PAINULIM include: Amyotrophic Lateral Sclerosis, Parkinsonsdisease, Anterior horn cell disease, Root diseases, Intrinsic cord disease, Carpal tunnelsyndrome, Plexus lesions, Median, Ulnar and Radial nerve lesions. PAINULIM usesfeatures from clinical examination, (needle) EMG studies and nerve conduction studiesChapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 151to make diagnostic recommendations.PAINULIM requires from the users specific knowledge and experience:• minimum competence in clinical medicine especially in neuromuscular diseases;• basic knowledge of nerve conduction study techniques; and• minimum experience of EMG patterns in common neuromuscular diseases.PAINULIM will benefit the following users:• students and residents in neurology, physical medicine and neuromuscular diseases;• doctors who are practicing EMG and nerve conduction in their offices;• experienced EMGers as a formal peer review (self evaluation); and• hopefully different labs to adapt uniform procedures and criteria for diagnosis.Clinical diagnosis is performed in steps as anatomical, pathological and etiological.PAINULIM currently works at the anatomical level only. Extension to other levels isconsidered as one of the future research topics.Given the level of knowledge and experience of intended user of PAINULIM, the system does not attempt to represent explicitly the neuroanatomy involved in the painful orimpaired upper limb. Rather, PAINULIM chooses to represent explicitly the clinicallysignificant disease-feature relations which is one of the most important part of the expertise of experienced neurologists. This choice allow rapid development of PAINULIMwhich can work directly at the clinical level.PAINULIM is based on Bayesian belief networks as is MUNIN. The PAINULIMproject has benefited from the research results of MUNIN, especially the junction treetechnique (section 4.2.2); but, PAINULIM is not limited to duplicating the techniquesChapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 152developed in MUNIN; rather the Multiply Sectioned Bayesian Networks and JunctionForests technique presented in chapter 4 is the extension to the junction tree techniquein an effort to achieve more flexible knowledge representation and more efficient inferencecomputation.5.2 Exploiting Localization in PAINULIM Using MSBN Technique5.2.1 Resolve Conflict Demands by Exploiting LocalizationThe PAINULIM project has a large domain. The Bayesian network representation contains 83 variables including 14 diseases and 69 features each of which has up to 3 possibleoutcomes. The network is multiply connected and has 271 arcs and 6795 probabilityvalues. When transformed into a junction tree representation, the system contains 10608belief values. During the system deyelopment, the tight schedule of medical staff demands(1) knowledge acquisition and system testing within hospital environment where mostcomputing equipments are personal computers; and (2) short response time in systemtesting and refinement. Implementation in hospital equipments will also facilitate theadoption of the system when it is completed. On the other hand, the space and timecomplexity of PAINULIM system tends to slow down the response and to demand morepowerful computing equipments not available in the hospital lab.This conflict demands motivated the improvement on current Bayesian network representation in order to reduce the computational complexity. The ‘localization’ naturallyexisting in the PAINULIM domain (section 4.1) inspired the development of the MSBNtechnique (chapter 4). The following briefly summarizes the description on localizationin the PAINULIM domain.Localization means that, during a particular consultation session in a large applicationdomain represented by a Bayesian net, (1) new evidence and queries are directed to smallChapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 153part of a large network repeatedly within a period of time; and (2) certain parts of thenetwork may not be of interest to users at all. An EMGer bases his diagnosis on 3 majorinformation sources: clinical examination, (needle) EMG studies and nerve conductionstudies. The examination/studies are performed in sequence. Based on the practice inNDU of VGII, about 60% of patients have only EMG studies, and about 27% of patientshave only nerve conduction studies. The number of clinical findings on an average patientis about 5. The number of EMG and nerve conduction studies performed on an averagepatient are about 6 and 4 respectively. All findings are obtained incrementally.5.2.2 Using MSBN Technique in PAINULIMBased on localization, PAINULIM uses the MSBN representation. The domain is partitioned into 3 natural localization preserving subdomains (clinical, EMG and nerve conduction) which are separately represented by 3 sects (CLINICAL, EMG, and NCV) inPAINULIM (Figure 5.28). The (sub)domain of each sect contains the correspondingfeature variables and relevant disease variables. Using this representation, the 3 sectsare created and tested separately. This modularity allows the EMOer to concentrate onone natural subdomain at a time during network construction rather than to managethe overall complexity at the same time. This eases the task of knowledge acquisition inboth the part of the EMGer and the part of the knowledge engineer. Problems in eachsect can thus be isolated and corrected quickly and easily.The 3 sects are interfaced by common disease variables. The d-sepset between CLINICAL and EMG contains 12 diseases, and the d-sepset between CLINICAL and NCVcontains 10 diseases. All the disease variables have no parent variables and thus thed-sepset constraint (section 4.5) is satisfied. The CLINICAL sect has all the 14 diseasevariables considered in PAINULIM, and it becomes the covering sect (section 4.6.2).Thus the soundness of sectioning is guaranteed in PAINULIM. This representation is>zc—)zCUCl)HIt)It)Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 155natural because clinical examination is the stage where all the disease candidates aresubject to consideration.After the 3 sects are created, they are transformed into a junction forest (Figure 5.29).The MSBN technique allows the transformation to be conducted by local computationat the level of sects (section 4.7.1). Thus the space requirement for transformation isgoverned by the size of only 1 sect.Multiple linkages are created between junction trees such that evidence can be propagated from one to another during evidential reasoning. There are 3 linkages betweenCLINICAL tree and EMG tree (thick lines in Figure 5.29). The sizes of their state spacesare 768, 1536, and 1536 respectively. Without introducing multiple linkages (using thebrute force method in section 4.3), the d-sepset would form a clique with state spacesize 6144, which would greatly increase the computational complexity in each sect. Thisis because this large clique would have all the other cliques in the same junction treeas neighbors. During evidential reasoning, the belief table of this large clique would bemarginalized and updated as many times as the number of neighbor cliques.With the 3 junction trees linked and joint system belief initialized, the resultantconsistent junction forest becomes the permanent representation of the PAINULIM domain where evidential reasoning takes place. The original MSBN still serves as the userinterface while the computation for inference is solely performed in the junction forest.Since the junction forest representation preserves localization, the run time computation of PAINULIM can be restricted to only one junction tree. Thus the time and spacerequirements are governed by the size of one sect, not the size of the forest. If PAINULIMwere represented by a USBN, the size of total state space of the junction tree would be10608. This overall junction tree has to be updated for each batch of evidence. Usingthe MSBN technique, the sizes of total state space of each junction tree are 7308, 6498and 2178 (in order of CLINICAL, EMG and NCV) respectively. The largest size 7308Ia)Ia)Ia)—::c’:HIa.)vI.44Ia)-+.-Ia)-i-aIa)Ia),D1Ia)L)c)CJD0c)03Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 157will govern the space requirement and will put an upper bound on time requirement.The mean size 5328 gives an average time requirement which is about half of the amountrequired by a USBN system with size 10608. Due to localization, each tree will haveto be computed about 5 times (recall that findings in 3 subdomains come in 5, 6 and 4batches respectively on an average patient) before an attention shift. Thus the overalltime saving is about half compared to a USBN system.When a user’s attention is shed from the current active tree to another tree, thelatter is swapped from secondary storage into main memory and all previously acquiredevidence is absorbed. The posterior distributions obtained are always based on all theavailable knowledge and evidence embedded in the overall system. Thus the savings inspace and time do not sacrifice accuracy. The computational savings thus obtained translate immediately to smaller hardware requirement and quicker response time. With theMSBN technique, it has been possible to use hospital equipments (IBM AT compatiblecomputers) to construct, refine and run PAINULIM interactively with EMGers right inVGH lab. EMGer’s before-, inter-, and after-patient time can be utilized for knowledgeacquisition and system refinement. The inference time for each batch of evidence takesfrom l2sec in NCV to 28sec in CLINICAL. The attention shift takes from 24sec to lO6secdepending on the currently active tree and the destination tree. The efficiency in cooperation with EMOers gained greatly speeded up the development of PAINULIM (less thana year).5.3 Other Issues in Knowledge Acquisition and Representation5.3.1 Multiple DiseasesMany probability-based medical expert systems have assumed that diseases are mutually exclusive, for example, PATHFINDER [Heckerman, Horvitz and Nathwani 89] andChapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 158MUNIN’s nanohuman system [Andreassen et al. 89]. A few did not, for example, QMR[Heckerman 90aj. When this assumption is valid, diseases can be lumped into 1 variablein the Bayesian network which simplifies the network topology.PAINULIM considers 14 most common diseases in patients presenting with a painfulor impaired upper limb. Since a patient could suffer from multiple neuromuscular diseases, the assumption of mutually exclusive diseases is not valid in the PAINULIM domain. PAINULIM has therefore represented each disease by a separate node.Although this representation is acceptable for most of the 14 diseases, there is anexception: Amyotrophic lateral sclerosis (Als), and Anterior horn cell diseases (Ahcd).Both are disorders of the motor system. Ahcd involves only the lower motor neuronalsystem (between the spinal cord and the muscle), but Als additionally involves the uppermotor neuron (between the brain and the spinal cord). However when one speaks of Als itis not considered as an Ahcd plus disease, but an entity by itself. Therefore, conceptually,an EMGer would never diagnose a patient to have both Als and Ahcd. This conceptualexclusion is represented by combining the 2 into 1 variable which is Motor Neuron Disease(Mnd). The variable has 3 exclusive and exhaustive outcomes: Als, Ahcd, Neither.All the disease variables and most of the feature variables are represented as binary.That is, oniy ‘positive’ or ‘negative’ outcomes are allowed. ‘Positive’ corresponds to‘severe’ or ‘moderate’ degree of severity, and ‘negative’ corresponds to ‘mild’ or ‘absent’. This choice is made for simplicity in both knowledge acquisition and inferencecomputation. More refined representation is planned when the system’s performance issatisfactory at the current grade level.Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 1595.3.2 Acquisition of Probability DistributionIn PAINULIM, a feature variable can have up to 7 parent disease nodes. For example,wk_wst_ex can be caused by Mnd, Rc67, Rc81, Pxutk, Pxltk, Pxpcd, and Radnn. It requires 384 numbers to fully specify the conditional probability distribution at wk_wst_ex.It would be frustrating if all these numbers have to be elicited from a human EMGer.The leaky noisy OR gate model [Pearl 88, Henrion 89] is found to be a powerful toolfor distribution acquisition in PAINULIM. When a symptom can be caused by n explicitlyrepresented diseases, the model assumes (1) each disease has a probability to produce thesymptom in the absence of all other diseases; (2) the symptom-causing probability of eachdisease is independent of the presence of other diseases; and (3) due to the possibility ofunrepresented diseases there is a non-zero probability that the symptom will manifest inthe absence of any of the diseases represented explicitly.In discussion with EMGers, it is found that the above assumptions are quite validin the PAINULIM domain. A symptom will occur in any given disease with a uniquefrequency. Should there exist more than one disease that could cause the same symptom,the frequency of occurrence of this particular symptom will be heightened. Using theleaky noise OR gate model, the above distribution for wk...wst.ex is assessed by elicitingonly 8 numbers. Seven of them takes the formp(wk_wst_ex = yesRadnn = yes and every other disease = no)and one of them takes the formp(wk_wst_ex = yesall 7 parent disease = no)Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 1605.4 Shell ImplementationFigure 5.30: Top left: drawing a sect with mouse operation; top right: naming a variable and specifying its outcomes; middle left: specifying the conditional probabilitydistribution for a variable; middle right: specifying the sects composing the MSBNof PAINULIM; bottom left: specifying the d-sepset to be entered next; bottom right:specifying a d-sepset.An expert system shell WEBWEAVR is implemented which incorporates the MSBNtechnique and leaky noisy OR gate model. This shell is in turn used to construct thePAINULIM expert system.WEBWEAVR shell is written in C and is implemented in an IBM PC to suit thecomputing environment at the NDU, VGH where PAINULIM is constructed. It can berun in XT, AT or 386 although AT or above is recommended.Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 161The shell consists of a graphical editor (EDITOR), a structure transformer(TRANSNET), and a consultation inference engine (DOCTR). EDITOR allows users toconstruct MSBNs in a visually intuitive manner. TRANSNET transforms constructedMSBNs into junction forests.DOCTR does the evidence entering and evidential reasoning. It can work in 2 differentmodes: interactive or batch processing. The interactive mode allows user to enter evidenceand obtain posterior distributions in an interactive and incremental way. This mode isused during consultation session. The batch processing mode allows user to enter all theevidence to a file for a patient case. Then DOCTR makes diagnosis based on the file.This mode is majorly used for system evaluation such that large amount of cases can beprocessed without human supervision.DOCTR provides 2 modes for screen layout: network and user-friendly. The networkmode (Figure 5.28) displays the full sect topology such that parent-child relation canbe traced along with the marginal distributions. The user-friendly mode (Figure 5.31)does not display arcs of the network but labels variables with names closer to medicalterminology. The former layout provides richer information while the latter gives neatscreen.Figure 5.30 illustrates the WEBWEAVR shell with 6 screen dumps. In the upperleft screen, a sect is drawn using a mouse. In the upper right screen, the name of avariable and its possible outcomes are entered. In the middle left screen, the conditionalprobability distribution for a child variable is entered. Each sect can be constructedin this way separately. In the middle right screen, the composition of the MSBN forPAINULIM is specified. In the bottom left screen, a menu is displayed which allows auser to specify the d-sepset to be entered next. In the bottom right screen, the d-sepsetbetween CLINICAL and EMO sects is specified.WEBWEAVR supports the construction of any MSBN which has a covering sect.Figure 5.31: Top: CLINICAL sect with prior distributions;tions after clinical evidence is entered.bottom: posterior distribuSince clinical examination is always the first stage in the diagnosis of a patient,PAINULIM begins with the CLINICAL subnet. Before any evidence is available, theChapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 162Porting it into SUN workstation and X-Window is under consideration.5.5 A Query Session with PAINULIMIn this section, a query session with PAINULIM in the diagnosis of a particular patientis illustrated with snapshots of major steps.pa inshow 1-derpainforearnpa inhandradipainLIUlnarlesionRadialCarlRootdiseaseC67RootdiseasediseasePark in—diease.Tbicer’sltricepsIexag a—rate• radialLIweakshow 1-derweakamweakweekweakthi.,nbal?dwc—t ionweakthunbf legionweakthwnbjIIJ.II]die oiateduppernessned ia 1forearnnwnb‘ateralnunbbackhandforearnnunbnesslateralhandnwnbnessned ia 1handnwnbnessDIAG[I radiallostMotordiseaseleousleiiuslowert ro,*leowsUlnar pain weak d*sso— biceps paç—nerve in shout— c ated eoa9 e— 10 tylesion show 1— der loss ratedder ensa- 0 0 0 0 0. -Radial pain weak upper triceps r i—nerve n arM arM exag — ditylesion forearn nonb— ratedC8TI [r7pain 0weak 0nedial 0radial 0twitchRoot in finger foreamn egag —disease hand flexion nonb— rateness- 0 0C67 radi- weak lateral trenorRoot cular wrist forearndisease pain exten- nwnbC56nweak LF?bak °b icepsRoot thunb hand lostdisease bdoc- forernParkin weak lateral 0trioepssons thunb hand lostdisease (legion flunk—0weak 0nedial 0radialthunb hand losteoten- nooksion nessChapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 163prior distributions for all the diseases and symptoms can be obtained which reflect thebackground knowledge about the patient population in NDU of VGH. The top screen inFigure 5.31 shows the CLINICAL sect with prior distributions displayed in histograms.The user-friendly display mode is used here for better illustration.The patient presents with tingling in the hand, weakness of thumb abduction, weakness of wrist extension, and loss of sensation in the back of the hand. After enteringthe evidence into the CLINICAL sect, the impression is Cts (0.808) and Radial nervelesion (0.505). The bottom screen in Figure 5.31 highlights the diseases and features withposterior probability greater than 0.1.After attention is shifted to NCV sect, PAINULIM suggests (Figure 5.32 top) thatthe most likely abnormalities on NCV are from the Median sensory study (0.785), theMedian to ulnar palmar latency difference (0.777), the Median motor distal latency (0.58)and the Radial motor and sensory study (0.43 and 0.51 respectively).After values for Median, Radial and Ulnar nerves are entered, the revised impression(Figure 5.32 bottom) is Cts (0.912) and Radial nerve lesion (0.918).With attention further shifts to EMG sect, PAINULIM (Figure 5.33 top) promptsthat the most likely EMG abnormalities are in the Edc (0.791), the Triceps (0.789), theApb (0.827) and the Brachioradialis (0.840) muscles.Data entered is that the Apb and the Edc are abnormal, while the Fdi is normal. Thefinal impression (Figure 5.33 bottom) reads as Cts (0.992) and Radial lesion (0.922).The above snapshots illustrate the diagnostic capability of PAINULIM. Another important usage is education. Given a patient with Cts and Radnn, Figure 5.34 displayshighly expected (above 80% likelihood) positive features for the patient which can beused in training.For the patient case in the above diagnosis, out of 6 highly expected CLINICALfeatures 4 are positive and 2 negative; out of 4 EMG features 2 are positive and 2Chapter 5. PAIN ULIM: A NEUROMUSCULAR DIAGNOSTIC AID 164Figure 5.32: Top: NCV sect with evidence from CLINICAL and EMG sects absorbed;bottom: final diagnosis after nerve conduction studies are finished.unchecked; out of 5 NCV features 1 is positive, 3 unchecked and 1 negative. Thus themajority of checked features matches the expectation for multiple disease Cts and Radnn,which explains the diagnosis reached above from one perspective.5.8 An Evaluation of PAINULIMThis section presents the procedure and results of a preliminary evaluation of PAINULIM.1, scepsbra—chiprad ia1 issupra—naisI isp las, i.de 1 to idj1tricePsChapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 165pa—na 1pare—a1Intrin [W?Radialsic nervecord lessondisease‘Pie.o..s CBTIper Roottrunk diseaseP1eous •C67iase‘Pieous •C56post Rpotcord disease•T;os•errar ior•Latt issima—.al isoci oro las, i—Polarisnaj orsternoola.,i.de ito Id09111—1005.15f exorleodigfa i—‘her1 ..,b.’trunknusele.para—mlc6para—nalpara—Motor Ulnarneuron nervedisease lesionIntrin FjRadialcord lesiondiseasePiexusn8T1diseasePiexus C67lover Roottrunk disease•Plexus C56aselevetorPu iaeust isdorsiexten- abducabduC.flexordig23flexdigfatEtherlinbftrunkFigure 5.33: Top: EMG sect with evidence from CLINICAL sect absorbed; bottom:posterior distributions after EMO findings are entered.5.6.1 Case Selection76 patient cases in NDU of VGH are selected. They have been diagnosed by EMGersbefore used in the evaluation. The selection is conducted such that there is a balanceddistribution among diseases considered by PAINULIM. Table 5.11 lists numbers of casesinvolved for each disease. If a case involves multiple diseases, that is, either the patientwas diagnosed as suffering from multiple diseases or differentiation among several competing disease hypotheses could not be made at the time of diagnosis, then the count ofChapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 166Figure 5.34: Highly expected feature presentation of a patient with both Cts and Radnn.Top: CLINICAL sect. Middle: EMO sect. Bottom: NCV sect.:.DIAG*Ulnardisease lesionIntrin F[rRadialcord lesiondiseaseSPZCOUS CSTItinner Hoottrunk diseaseiconstrunkPlexusSMedianditallatency>0<•Mediao =CHAPIL¶ledianIL<5>SHed ian —blockMedianHCMedian Sum r SUlnar F?RadialUinar CHAP F5 L• J..otorPal.. SNAP JNCUlat latencydiffMedian Umnar Ulnar RadialF2 F FS sensorySNAP yaue SNAP NOVlatency muMedian SUlnarPS ..otorSNAP condanpli- blocktudeUinarNCUChapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 167each disease involved will be increased by 1. The total number of count is 124. Thus theratio 124/76 serves as an indication of multiple-disease-ness of the case population. 3cases not considered by PAINULIM but presenting with a painful or impaired upper limbare also included in the evaluation to test PAINULIM’s performance at its limitation.disease count disease count disease count disease countAls 5 Inspcd 2 Cts 16 Pd 3Ahcd 1 Pxutk 6 Mednn 5 Normal 12Rc56 8 Pxltk 8 Ulrnn 16 Other 3Rc67 19 Pxpcd 8 Radnn 6Rc81 6subtotal 39 24 43 18Table 5.11: Number of cases involved for each disease considered in PAINULIM5.6.2 Performance RatingUnlike the evaluation for QUALICON (chapter 3), in the PAINULIM domain, there isno absolutely certain way to know the ‘real’ disease(s) given a patient case. The best onecan do is to compare the expert system with the human expert and to take the humanjudgment as the golden standard. Since no human expert is perfect, the evaluationconducted this way may have its limitation. Both the system and human may make thesame error, or the system may be correct while the human makes an error.In evaluating PATHFINDER, Ng and Abramson [1990] uses ‘classic’ cases with knowndiagnoses. The agreement between the known diagnosis and the disease with top probability is used as an indicator of PATHFINDER’s performance. Heckerman [1990b] asksthe expert to compare PATHFINDER’s posterior distributions with his own and to givea rating between 0 to 10.In the PAINULIM domain, human diagnosis can be single disease or multiple diseases,Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 168and can have different degrees of severity. Sometimes even the human diagnosis is unsurebecause the test studies are not well designed or incomplete. Thus a more sophisticatedrating method than the one used by Ng and Abramson [1990] is desired for the evaluationof PAINULIM. A more objective rating than the one used by Heckerman [1990b] is alsotargeted such that the rating can be verified by persons other than the evaluator.For each case, the patient documentation is examined and the values for feature variables are entered to PAINULIM to made a diagnosis. The posterior marginal probabilitiesof diseases produced by PAINULIM are compared with the EMGer’s original diagnosis.5 rating scales (EX (excellent), GD (good), FR (fair), PR (poor), and WR (wrong))are informally defined. The definition is not exhaustive but serves as a guideline for theevaluation.For each disease involved (possibly a disease not considered by PAINULIM as will beclear below), PAINULIM’s performance is rated by the following rules.1. For a disease judged by the EMGer as severe or moderate, the rating is EX if itsposterior probability (PP) by PAINULIM falls in [0.8, 1] (GD: [0.6, 0.8); FR: [0.4,0.6); PR: [0.2, 0.4); WR: [0, 0.2)).2. For a disease judged by the EMGer as mild, the rating is EX if its PP falls in [0,0.4] (GD: (0.4, 0.6]; FR: (0.6, 0.7]; PR: (0.7, 0.8]; WR: (0.8, 1]).3. For a disease judged by the EMGer as absent, the rating is EX if its PP falls in [0,0.2] (GD: (0.2, 0.4]; FR: (0.4, 0.6]; PR: (0.6, 0.8]; WR: (0.8, 1]).4. For a disease judged by the EMGer as uncertain as to its likelihood because ofevidence which cannot be easily explained, the rating is EX if PAINULIM suggeststhe same disease or other disease(s) whose anatomical site(s) is/are close to thatsuspected by the EMGer, with PP falling in [0, 0.6] (GD: (0.6, 0.7]; FR: (0.7, 0.8];ChapterS. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 169PR: (0.8, 0.9]; WR: (0.9, 1]).5. For a disease not represented by PAINULIM but judged by the EMGer as thesingle disease diagnosis of the case in question, the rating is EX if the PPs of allrepresented diseases fall in [0, 0.2] (CD: (0.2, 0.4]; FR: (0.4, 0.6]; PR: (0.6, 0.8];WR: (0.8, 1]).6. If PAINULIM provides an akernative diagnosis of PP in [0.70, 1]; and the alternative is agreed upon by a second EMGer to whom only the evidence as presented toPAINULIM is available, then the rating is EX. The rating is also EX, if PAINULIMprovides a diagnosis of PP in [0.3, 0.7) based on evidence ignored by the original EMGer, but which the second EMCer considers significant and agrees withPAINULIM.The above 1, 2, and 3 are used when the EMGer has a confident diagnosis. 4 is usedwhen the EMCer has a unsure diagnosis. 5 is used to rate PAINULIM’s performance atits limitation. 6 is used when human diagnosis is biased by nonanatomical evidence notavailable to PAINULIM, or when PAINULIM behaves superior to the human diagnostician. After the rating for each disease is assigned, the lowest rating is given as the ratingof PAINULIM’ performance for the case in question.5.6.3 Evaluation ResultsThe evaluation of 76 patient cases has the following outcomes: EX: 65, CD: 6, FR: 2, PR:1, WR: 2. The excellent rate is 0.86 with 95% confidence interval being (0.756, 0.925)using the standard statistic technique (Appendix E). The good or excellent rate is 0.93with 95% confidence interval being (0.853, 0.978). A closer look at the cases evaluatedis worthwhile.Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 170Among the 76 cases, there are 38 cases where the EMGer is confident about thediagnosis. The performance of PAINULIM is EX for 36 of them, and GD for the other 2.Thus for cases where evidence is complete, PAINULIM performs as well as the EMGer.There are 8 cases where the EMGer is either confident about the diagnosis of a singlemild disease, or is unsure about a single mild disease. For these 8 cases, PAINULIM’sposterior probabilities for all disease hypotheses are below 0.2, which says none of thediseases is severe or moderate. Thus PAINULIM performs equally well as the EMGer inthe case of a single mild disease.For at least 7 cases, PAINULIM indicates a second disease which is not mentioned bythe EMGer. Careful examination of the feature presentation would show that the corresponding features are indeed not explained by the EMGer’s diagnosis. PAINULIM’sbehavior in these cases is considered superior to that of the EMGer. For several othercases where the EMGer’s diagnosis is unsure about the possible diseases, PAINULIM indicates several candidates (with probability greater than 0.3) best matching the availableevidence which provide hints to human users for a more complete test study.For the 2 cases diagnosed as a disease in the spinal cord or peripheral nervous systembut not represented in PAINULIM, PAINULIM’s performance is EX for one and GD forthe other. The posterior probabilities of all disease variables are below 0.3. This is interpreted as saying the patient is not in a severe or moderate disease state for any representeddisease (not to be interpreted as normal). As stated earlier, PAINULIM represented 14most common diseases of spinal cord and/or peripheral nervous system presented with apainful or impaired upper limb. The diseases in this category which are not representedin PAINULIM include: Posterior interosseous nerve lesions, Anterior interosseous nervelesions, Axillary nerve lesions, Musculocutaneous nerve lesions, Suprascapular nerve lesions, and Long thoracic nerve lesions. They are not represented in current version ofPAINULIM because their combined incidence is less than 2%. The evaluation showsChapterS. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 171that outside its representation domain, PAINULIM does not confuse such a disease withrepresented diseases whenever the unrepresented disease has its unique feature pattern.On the other hand, the evaluation reveals several limitations of PAINULIM. Onelimitation relates to unrepresented diseases. One case in evaluation is diagnosed by theEMGer as Central Sensory Loss which is a disorder in the central nerval system. It isoutside the PAINULIM domain but also characterized by a painful impaired upper limb.When restricted within the PAINULIM domain, the disease presentation is similar toRadnn. PAINUILM could not differentiate and gives probability 0.62 to Radnn (therating is WR).PAINULIM works with limited variables. For example in the clinical sect, the variable‘lslathnd’ which represents loss of sensation in the lateral hand and/or fingers, does notallow distinguishing between the front of the hand (Median nerve, Cts or Rc67) and theback of the hand (Radial nerve, Plexus Posterior cord, Rc67). When evidence comestowards one of them but not the other, instantiation of lslathnd will enforce (incorrectly)a group of diseases.One of the other important lesson is on the assessment of numerical data in thePAINULIM representation. PAINULIM represents each disease with 2 states. ‘Positive’corresponds to ‘severe’ or ‘moderate’ degree of severity, and ‘negative’ corresponds to‘mild’ or ‘absent’. This choice is made for the reason explained in section 5.6.1. Whenassessing conditional probabilities for PAINULIM, the above mapping has to be keptconsistently. When a probabilityp(wk_wst_ex = yes IRadnn = yes and every other disease = no)is assessed, the condition ‘Radnn = yes’ means either severe or moderate Radnn. Withoutemphasizing the mapping, the EMGer could include mild Radnn in his mind whichincreases the population considered and lower the assessed probability.Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 172Another source of inaccuracy in the numerical probabilities come from limited experience for certain diseases. When a disease is very rare (for example, Inspcd), the EMGergiving the probabilities may have rare experience with the disease and thus providesinaccurate assessment. A few cases rated as FR or PR or WR are due to inappropriate assessment of conditional probabilities. Further fine tuning is planned for the nextversion of PAINULIM.4’57 RemarksThe development of PAINULIM has shown that the MSBNs and junction forests technique provides a natural representation and an efficient inference formalism. Using thetechnique, the computational complexity of PAINULIM is reduced by half with no reduction of accuracy. The development of PAINULIM has thus benefited from efficientcooperation with medical staff, and rapid system construction and refinement.The evaluation of PAINULIM’s performance using 76 patient cases shows the goodor excellent rate is 0.93 with 95% confidence interval being (0.853, 0.978). The case population is not selected sequentially (taking whatever patient case coming in a sequence)in order to have a balanced distribution among diseases considered by PAINULIM. Ifsequential population is used, even better performance can be expected. This is because normal or mild cases, or Cts cases will occupy even larger percentage than in thepopulation used for the evaluation, and PAINULIM has performed excellently for thesecases.The deficiencies of current PAINULIM are recognized in (1) not sufficiently elaboratedfeature variables; (2) room for further fine tuning of numerical parameters; and (3)limitations with central nerval system disorder presenting with a painful or impairedupper limb. The improvement calls for further refinement and extension in PAINULIM’sChapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 173representation.An important application of PAINULIM would be the assistance of test design. Acorrect and efficient neuromuscular diagnosis depends largely on good test design whichwill allow the gathering of minimum amount of but sufficient diagnostic information.However, when faced with difficult cases (initial evidence points to different directions)and many test alternatives, an EMGer may not be able to come up with an optimal testdesign. This has been true in several cases used in the evaluation. Withbut good testdesign and constrained by time and patient, an EMGer would have to make a diagnosis(after test) with limited and incomplete information.There has been evidence that human thinking is often biased in making judgmentunder uncertainty [Tversky and Kahneman 74}. Thus an expert system like PAINULIMcapable of reasoning rationally under uncertainty under complex conditions (multiplyconnected network and evidence pointing to different directions with different degrees ofsupport to various hypotheses) would be a very helpful peer in test design. The benefitwould be a better test design and an improved diagnostic quality. As demonstrated insection 5.5, given current available evidence, PAINULIM can prompt the EMGer themost likely disease(s) and most likely features. The confirmation of these features wouldlend further support to the disease(s); and the negative test results for these featureswill tend to exclude the disease(s) for further investigation. Patil et al. [1982] discussseveral strategies in test design. This functionality has not been fully explored andmade sufficiently explicit in the current version of PAINULIM. It is a future researchtopic. Since the likelihood of features depends on the current likelihood of diseases, theperformance of PAINULIM in the diagnosis suggests promising potentials of its extensionto the test design.Explanation capability is important in communication between the system and itsuser both in diagnosis and in test design, and can add great usefulness to PAINULIM.Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 174Section 5.5 demonstrates certain explanation functions available in the current versionof PAINULIM. A sophisticated explanation facility would give reasons why a particulardisease is likely; and it would give further reasons why this disease is more likely thananother one. Such a facility would be a future research topic.Chapter 6A LEARNING ALGORITHM FOR SEQUENTIALLY UPDATINGPROBABILITIES IN BAYESIAN NETWORKSAs described in section 1.3.1, in building medical expert systems, the probabilities necessary for the construction of Bayesian networks usually have to be elicited from medicalexperts. The values thus obtained are liable to be inaccurate. This has also been theexperience with PAINULIM (Chapter 5). The sensitivity analysis of a medical expert system PATHFINDER shows that probabilities play an important role in PATHFINDER’sperformance and minor changes in all of its probabilities have a substantial impact onperformance [Ng and Abramson 90]. Therefore, methodologies which allow the representation of uncertainty of probabilities and improvement of their accuracy are needed.This chapter presents an Algorithm for Learning by Posterior Probabilities (ALPP)for sequential updating of probabilities in Bayesian networks. The results are mainlytaken from Xiang, Beddoes and Poole [1990b]. Section 6.1 reviews 3 representations ofuncertainty of probabilities: interval, auxiliary variable, ratio and corresponding methodsfor updating probabilities. Section 6.2 presents the idea of learning from expert’s posteriorprobabilities in order to overcome the limitation of the existing updating method for theratio representation. Section 6.3 presents ALPP. Section 6.4 proves the convergence ofALPP. Section 6.5 presents the performance of ALPP in simulation.175Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 1768.1 BackgroundSeveral formalisms for representation of uncertainty of probabilities in Bayesian networkshave been proposed. Some of them have corresponding methodologies incorporated toimprove the accuracy of probabilities. One formalism represents probabilities by intervals[Fertig and Breese 90]. The size of an interval signifies the degree of uncertainty tothe probability it represents. No known method directly uses this representation forimprovement of accuracy of probabilities.Another formalism represents the uncertainty of probabilities by probabilities ofprobabilities [Spiegelhalter 86, Neapolitan 90, Spiegeihalter and Lauritzen 90]. Auxiliaryvariables (nodes) are added to Bayesian networks. Their outcomes are the probabilitiesof variables in the original Bayesian net. The distributions of these auxiliary variables,therefore, are the probabilities of probabilities. With this representation, the updatingof probabilities can be performed within the probabilistic inference process.Yet another formalism represents probabilities by ratios of imaginary sample sizes[Cheeseman 88a], [Spiegelhalter, Franklin and Bull 89]. A probability p(symptomAldiseaseB) is represented as x/y where y is an imaginary patient population with diseaseB and among these patients x of them show symptom A. The less certain the probabilityp(symptornA diseaseB) is, the smaller the integer y. Spiegehalter, Franklin and Bull[1989] present a procedure for sequentially updating probabilities represented as ratios.The procedure is presented in the form directly applicable to Bayesian networks of diameter 1 with a single parent node (possibly multiple children). Here, a single child case(p(.symptomAdiseaseB)) is used to illustrate the idea. When a new patient with diseaseB is observed, the sample size y is increased by 1, and the sample size x is increasedby 1 or 0 depending on if the patient shows symptom A. With more and more patientcases processed, the probability approaches the correct value. The improvement of thisChapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 177method is the focus of this chapter.The limitation of the updating method for the ratio representation is the underlyingassumption that when updating p(symptomAdiseaseB), whether disease B is true orfalse is known with certainty (thus the method will be referred as {0, 1} distributionlearning). The assumption is not realistic in many applications. A doctor would notalways be 100% confident about a diagnosis he or she made of a patient. This argumentis expanded in the section below.6.2 Learning from Posterior DistributionsThe spirit of {0, 1 } distribution learning is to improve the precision of probabilities elicitedfrom the human expert by learning from available data. What else does one really have inmedical practice in addition to patients’ symptoms? It may be possible, in some medicaldomain, that diagnoses can be confirmed with certainty. But this is not commonplace.A successful treatment is not always an indication of correct diagnosis. A disease can becured by a patient’s internal immunity or by a drug with wide disease spectrum. Onesubtlety of medical diagnosis comes from the unconfirmability for each individual patientcase.For most medical domains, the available data beside patients’ symptoms are physician’s subjective posterior probabilities (PPs) of disease hypotheses given the overallpattern of patient’s symptoms. They are not distributions with values from {0, 1}, butrather distributions from [0, 1]’. The diagnoses appearing in patients’ files are typicallynot the diagnoses that have been concluded definitely; they are only the top ranking diseases with physician’s subjective PP omitted. The assumption of {0, 1} posterior diseasedistribution may, naively, be interpreted as an approximation to [0, 1] distribution with 1‘Note that {O, 1} denotes a set containing only elements 0 and 1, and [0, 1] is a domain of real numbersbetween 0 and 1 inclusive.Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 178substituting top ranking PP, and 0 substituting the rest. This approximation loses usefulinformation. Thus a way of learning directly from [0, 1] posterior distribution seems morenatural and anticipates better performance.In dealing with learning problem in a Bayesian network setting, three ‘agents’ areconcerned: the real world (Dr, Fr), the human expert (Do, Fe), and the artificial system(D3,F3). It is assumed that all 3 can be modeled by Bayesian networks. As the buildingof an expert system involves specifying both the topolcrgy of DAG D and probabilitydistribution F, the improvement can also be separated into the two aspects. For thepurpose of this chapter, Dr, De, and D3 are assumed identical, leaving to be improvedonly the accuracy of quantitative assignment of P3.As reviewed in section 1.3.3, a medical expert system based on Bayesian nets usuallydirects its arcs from disease nodes to symptom nodes, thus encoding quantitative knowledge by priors of diseases and conditional probabilities (CPs) of symptom given diseases.This results in the ease of DAG construction, simplicity of net topology, and portabilityof the system. Given that a Bayesian net is constructed with such directionality, andthe desire to improve accuracy of CPs by PPs, a question which arises is whether PPsare any better in quality compared to CPs also supplied by the human expert. To avoidpossible confusion, it is emphasized that the CP here is the conditional probabilities of asymptom variable given its disease parent variables, and the PP is the joint probabilitiesof a set of (relevant) disease variables given the outcomes of a set of symptom variables.The former is stored explicitly in Bayesian networks as parameter, while the latter isnot explicit parameter even if the arcs in the network were directed from symptoms todiseases.In my cooperation with medical staff, it was found that the causal network is a naturalmodel to view the domain, however, the task of estimating CPs is more artificial thannatural to them. Forming posterior judgments is their daily practice. An expert is anChapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 179expert in that he/she is skilled at making diagnosis (posterior judgement), not necessarilyskilled at estimating CPs. It is the expert’s posterior judgment that is the behavior onewants the expert system to simulate.An excellent argument supporting the idea of using PPs to improve CPs has beenpublished by Neapolitan [1990]:For example, sometimes it is easier for a person to ascertain the probabi1-ity of a cause given an effect than that of an effect given a cause. Consider thefollowing situation: If a physician had worked at the same clinic for a numberof years, he would have seen a large population of similar people with certainsymptoms. Since his job is to reason from the symptoms to determine thelikelihood of diseases, through the years he may have become adept at judging the probabilities of diseases given symptoms for the population of peoplewho attend this clinic. On the other hand, a physician does not have the taskof looking at a person with a known disease and judging whether a symptomis present. Hence it is not likely that the physician would have acquired theability from his experience to judge the probabilities of symptoms given diseases. He does ordinarily learn something about this probability in medicalschool. However, if a particular disease were rare in the population, whereasa particular symptom of the disease were common, and the physician had notstudied the disease in medical school, he would certainly not be able to determine the probability of the symptom given the disease. On the other hand,he could readily determine the probability of the disease given the symptomfor this particular population. We see then that in this case it is easier toascertain the probabilities of causes given effects. Notice that these conditional probabilities are only valid for the population of people who attendChapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 180that particular clinic, and thus we would not want to incorporate them intoan expert system. We could, however, use the definition of conditional probability to determine the probabilities of symptoms given diseases form theseconditional probabilities obtained from the physician. These latter probabilities could then be used in an expert system which would be applicable at anyclinic.The task imposed on expert by the method presented here is actually more naturalthan that argued by Neapolitan. Experts will be asked for PPs given a set of symptoms(not just one as is seen in a moment), and thus the task will be close to their dailypractice.Furthermore, Kuipers and Kassier [1984] has been cited by Shachter and Heckerman [1987] to show “experts are more comfortable when their beliefs are elicited in thecausal direction”; Tversky and Kahneman [1980] has been cited by Pearl [1988] to show“people often prefer to encode experiential knowledge in causal schemata, and as a consequence, rules expressed in causal forms are assessed more reliably”. Neither Shachter andHeckerman nor Pearl made explicit distinction between ‘qualitative’ and ‘quantitative’knowledge. Careful examination of the studies by Kuipers and Kassier and Tversky andKahneman reveals the following two points. Both experts and ordinary people preferto encode their qualitative knowledge in terms of causal schemata. People often givehigher probabilities in the causal direction than those dictated by probability theory.These studies support the reliability of causal elicitation of qualitative knowledge. Thesestudies do not support the reliability of causal elicitation of quantitative (probabilistic)knowledge. The reliability issue is left open.Finally, Spiegehalter et al. [1989] has reported that the assessment of CPs by humanexperts is generally reliable, but has a tendency to be too extreme (too close to 0 or 1).Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 181B1 B23 134Figure 6.35: An example of D(1)If one could assume that the human expert carries a mental Bayesian network andPPs are produced by the network, it is postulated that the CPs the expert articulates,which consists of F of the system, could be a distorted version of those in Fe. Also,Fe may differ from F in general. Thus, 4 categories of probability distributions aredistinguished: F, Fe, P3, and the PPs produced by Fe (written as ps). One’s access toonly P3 and pe(hypothesesevidence) is assumed. The goal is to use the latter to improveP3 such that the system’s behavior will approach that of expert.How can PP be utilized in the updating? The basic idea is: instead of updatingimaginary sample sizes by 1 or 0, increase them by the measure of certainty of thecorresponding diseases. The expert’s PP is just such a measure. Formal treatment isgiven below.6.3 The Algorithm for Learning by Posterior Probability (ALPP)The following notation is used:D(1) DAGs of diameter 1 (The diameter is the length of the longest directed path in theDAG. An example of D(1) is given in Figure 6.35.);(D(1), F) Bayesian net with diameter 1 and underlying distribution F;Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 182B,€B1 = {bi,.. . , b1,.} the ith parent variable in D(1) with sample space B1;A3€A3 = {a31,.. . , ai,} the jth child variable in D(l) with sample space A3;A e ‘I’(A) the set of all child variables in D(l) with spacea31 element of ‘I’(A) with A3 instantiated asYk1k2... the imaginary sample size for joint event bIk12k . being true;xl3k12...k the imaginary sample size for joint event alblk1 . .. being true;6, impulse function which equals 1 if for the cth fresh case As’s outcome is ai,, andequals 0 otherwise (superscripts denote the orders of fresh cases);PrO, PeO, Ps() probabilities contained or generated by (Dr(1), Fr), (De(1), Fe) and (D31),F3) respectively.A Bayesian net (D(1), F) 2 is considered where the underlying distribution is composed viaN Mp(Bi . .. BNA1 . . . AM) II p(B1) II p(AI7r)i=1 j=1where 7r is the set of parents of A.Each of the CPs is internally represented in the system as a ratio of 2 imaginary samplesizes. For child node A1 having its parent nodes B1 ... BQ (Q 1), a corresponding CPisp(all1Iblk . . .bQkQ) = Xkikq/Ykikqwhere the superscript c signifies the cth updating. Only the real numbers Xki kq andare stored. The prior probabilities for A1’s parents can be derived as2Whether it is a subnet or a net by itself is irrelevant.Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 183c — Yl...kQ— c-. cI_.,kl,...,kQ Ykj...kqFor a (D(1), F) with M children and with all variables binary, the number of numbersto be stored in this way is upper bounded by 2 2” where is the in-degree of childnode i. Storage saving can be achieved when different child nodes share a common setof parents.Updating F is done one child node at a time through updating xs and ys associatedwith the node as illustrated above. Once the xs and ys axe updated, the updated CPsand priors can be derived. The order in which child nodes are selected for updating isirrelevant.Without losing generality, we describe the updating with respect to above mentionedchild node A1. For the cth fresh case where ac is the symptoms observed, the expertprovides the PP distribution p(bik1 . . . bNkNlac). This is transformed intoP(b1k1 . .. bQkQIac) = p(bi1 . .. bNkNIac)bq+l ,...,bNThe sample sizes are updated byc c—i cc ft L C= XlIkl...kQ + Oj1PeLLh1k, . ..LQkQ aC c—i it L cYki...kq = Yk1...0 +PekUlki . . . a8.4 Convergence of the AlgorithmAn expert is called perfect if (De(1), Fe) is identical to (D(1), Fr).Theorem 13 Let a Bayesian network (D31), F3) be supported by a perfect expert equippedwith (D(1), Fe). No matter what initial state P3 is in, it will converge to F by ALPP.Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 184Proof:Without losing generality, consider the updating with respect to A1 in section 6.3.(1) Priors. Let {a(1), a(2),.. .} be the set of all possible conjuncts of evidence. Letu(t) be the number of times at which event a(t) is true in c cases; and u(t) = c. Fromthe prior updating formula of ALPP,limp(bk)— urn?J°k1... + E,_iPe(b1ki .. . law)—C—*QO C + YZk1,...,kq Y1...kQ= lirn! ( Epe(k . . . bQkQIa(t))u(t))t= Pe(blki . . . bQkqla(t))pr(a(t))t= p(b1k, . ...bqkqla(t))pe(a(t)) (perfect expert)tpe(blk, ... bq) = Pe(bikj)k1 ,...,k_ ,k+i ,...,kq(2) CPs. Let u111 (t) be the number of times at which event a11 (t) is true in c cases.Following ALPP, one hasXlik1k+ £j=15T’Pe(1kj . . . law)lirnp3(aljjblk1...bQk) =+EiPe(l’lki ...bQkQla”)— II_lpe(b1kl...bQkqlaw)—-T E=iPe(b1ki .. . bav)— E Pe(blk1 . . . bej au, (t))u111 (t)— limc...’Ezpe(blki . .. bQkla(z))u(z)— EtPe(blki . . . bQkQlalzl(t))pr(alll(t))— EzPe(1ki . . . bQkQla(z))pr(a(z))E Pe(blki . . . Ia11, (t))pe(ajj, (t))= (perfect expert)zPe(1’1k, . . . bQkqla(z))pe(a(z))Pe(blki . .. bqkQall,) / L L= IL L = pe1ail, 01k, . . . UQkqpeU1k, . UQkQ)DChapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 185A perfect expert is never available. One needs to know the behavior of ALPP whensupported by an imperfect expert. This leads to the following theorem.Theorem 14 Letp be any resultant probability in (D31),F3) after c updating by ALPP.p converges to a continuous function of F.3Proof:(1) Continuity of priors.Following the proof of theorem 13, the prior p(b1k1) converges tof E pC(bikl...bQkQIa(t))t Cwhere p(blk, . . . a(t)) is an elementary function of F, and so does f. Therefore,converges to a continuous function of Fe.(2) Continuity of CP.From theorem 13, p(ai11 Iblk1 ... bqk) converges to— tPe(bIki . . . bQkQIallt(t))uh1t)— Ep(b1k1 . . .where Pe(blki . .. a(z)) is an elementary function of Fe.Theorem 14, together with Theorem 13, says that when the discrepancy between Fand F,. is small, the discrepancy between F3 and F,. (F as well) will be small after enoughlearning trials. The specific form of the discrepancy is left open.The absolute value of PPs is not really important in many applications but the posterior ordering of diseases be. A set of PPs defines such a posterior ordering. Claim a100% behavior match between (D, F1) and (D, F2) if for any possible set of symptoms the3By ‘F is a function of Ps’, it is meant that F takes probability variables in P as its independentvariables which in turn themselves have [0,1] as their domain.Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 186tampering(T) fire(F)heat alarm(H) smoke alarm(S) report(R)Figure 6.36: Fire alarm example for simulationtwo give the same ordering. The minimum difference bettveen successive PPs of (D, F1)defines a threshold. Unless the maximum difference between corresponding PPs from 2(D, P)s exceeds the threshold, 100% behavior match is guaranteed. Thus as long as thediscrepancy between Fe and F,. is within some (D,.(1), Fr) dependent threshold, a 100%match between the behavior of P3 and that of Fe is anticipated.6.5 Simulation ResultsSeveral simulations have been run using the example in Figure 6.36. It is a revised versionof the smoke-alarm example in Poole and Neufeld [1988]. Here heat alarm, smoke alarmand report are used as evidence to estimate the likelihood of joint event tampering andfire. Each variable, denoted by uppercase letters, takes binary values. For example, Fhas value f or 7 which signify the event fire being true or false.The simulation set-up is illustrated in Figure 6.37. Logical sampling [Henrion 88] isused in the real world model (D,.(1), Fr) to generate scenarios {T,., F,., H,., S,., R,.}. Theobserved evidence {Hr, S,., R,.} is fed into (De(1), Fe). The posterior distribution pe(TFIH,.S,.R,.) is computed by the expert model and is forwarded to update system model(D,(1), F3).To compare the performance between ALPP and {0, 1} distribution learning, a controlmodel (D(1), F) is constructed in the set-up. It has the same DAG structure and initialChapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 187{Tr,Fr} (D(1),P) {T,F}(Dr(1)J{Hr,Sr,Rr}! (De(1),Pe)(D31),P)pe(TFIHrSrRrIFigure 6.37: Simulation set-upprobability distribution as (D31), F3) but is updated by {0, 1} distribution learning.4Two different sets of diagnoses are utilized in different simulation runs by (D(1), F)for the purpose of comparison. In simulation 1, 2 and 3 to be described below, the topdiagnosis {Te, F} made by (D(1), Fe) is used. In simulation 4, the scenario {Tr, F} isused. The former simulates the situation where posterior judgments could not be fullyjustified. The latter simulates the case where such justification is indeed available.For all the simulations Pr is the following distribution.p(hjft) 0.50 p(sjft) 0.60p(hJfT) 0.90 p(sIfT) 0.92p(h7t) 0.85 p(sI7t) 0.75p(hlfl) 0.11 p(s17i) 0.09p(rjf) 0.70 p(f) 0.25p(rj7) 0.06 p(t) 0.20P3 and P are distributions with the maximal error 0.3 relative to Pr. The initialimaginary sample size for each joint event FT is set to 1. Such setting is mainly for thepurpose of demonstrating the convergence of ALPP under poor initial condition. Thedistribution error should generally be smaller and initial sample sizes be much larger incase of practical application where the convergence will be a slowly evolving process.4The original form of {0, 1} distribution learning is extended to the form applicable to D(1).Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 188(De(1), P) (.D31),P3) (D(1), P)behv. max. behv. max.diag. mat. err. mat. err.trial No. rate rate S-E rate C-E0 0.30 0.301’.’25 68% 60% 0.14 48% 0.212650 76% 96% 0.10 12% 0.2551’100 80% 100% 0.06 36% 0.27101200 76% 100% 0.03 33% 0.28Table 6.12: Summary for simulation 1. I1e—Fri = 0.Simulation 1 is run with F being the same as Fr which assumes a perfect expert.The results are depicted in Table 6.12. The diagnostic rate of (De(1), Fe) is defined asA/N where N is the base number of trials and A is the number of trials where the topdiagnosis agrees with {Tr, Fr} simulated by (Dr(1), Fr). The behavior matching rate of(D31),F5) relative to (D(1), Fe) is defined as B/N where B is the number of trialsin which (D31),F3)’s diagnoses have the same ordering as (D(1), Ps). The behaviormatching rate of (D(1), F) to (De(1), F) is similarly defined.The results show convergence of probability values in F3 to those in F (maximumerror(S-E) —* 0). The behavior matching rate of (D31), F3) increases along with the convergence of probabilities and finally (D31), F3) behaves exactly the same as (D(1), Ps).An interesting phenomenon is that, despite F = Fr, the diagnostic rate of (D(1), F)is only 76% in the total 200 trials. Though the rate is dependent on the particular(D, F), it is expected to be less than 100% in general. In terms of medical diagnosis,this is because some disease may manifest through unlikely symptoms, making otherdiseases more likely. In an uncertain world with limited evidence, mistakes in diagnosesare unavoidable. More importantly, P3 converges to F under the guidance of this 76%correct diagnoses while F does not. The maximum error of F remains about the sameChapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 189throughout the 200 trials and the behavior matching rate of (D(1), P) is low. Similarperformance of (D(1), F) is seen in the next two simulations. This shows that under thecircumstances where good experts are available but confirmations to diagnoses are notavailable, ALPP is robust while {0,1} distribution learning will be misled by the errorsin diagnoses. This is not surprising since the assumption underlying {0,1} distributionlearning is violated. One will gain more insight into this from the results of simulation 4below.An imperfect expert is assumed in simulation 2 (Table 6.13). The distribution Fediffered from F,. up to 0.05. Because of this error, F converges to neither Fe (as shownin Table 6.13) nor F,.. But the error between F, and F approaches a small value (about0.07) such that after 200 trials the behavior of F, matches that of F perfectly.(1)e(1),Pe) (D,(1),P,) (D(1),P)behv. max. behv. max.dliag. mat. err. mat. err.trial No. rate rate S-E rate C-E0 .300 .3001100 84% 82% .058 32% .272101200 86% 92% .122 43% .287201’.’300 80% 100% .067 32% .290301400 83% 100% .076 36% .292Table 6.13: Summary of simulation 2. IF — F,.I = 0.05.If the discrepancy between P, and F,. is further increased so that the threshold discussed in last section is crossed, (D,(1), F,) will no longer converge to (D(1), Fe). Thisis the case in simulation 3 (Table 6.14) where the maximum error and root mean squareerror (rms) between F and F,. are 0.15 and 0.098 respectively. The rms error is calculatedover all the priors and conditional probabilities of F and F. Introduction of rms errorfor interpretation of simulation 3 is because maximum error itself, when not approachingChapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 190to 0, does not give good indication of the distance between the two.(D(1), P) (D31),_F)behv. rms rms max.diag. diag. mat. err. err. err.trial No. rate rate rate S-B S-R S-B0 .170 .169 .391”.25 80% 84% 20% .086 .079 .172675 74% 74% 40% .071 .068 .1176’175 73% 73% 53% .050 .083 .087176-.’375 79% 79% 46% .059 .072 .095376’475 78% 78% 43% .061 .071 .119(De(1),Pe) (D(1),P)behv. rms rms max.diag. diag. mat. err. err. err.trial No. rate rate rate C-B C-R C-E0 .170 .169 .39125 80% 80% 32% .110 .091 .202675 74% , 74% 38% .110 .092 .1676175 73% 73% 38% .098 .092 .15176375 79% 79% 23% .100 .090 .15376”475 78% 78% 26% .096 .084 .15Table 6.14: Summary of simulation 3. Fe— F71 0.15.The simulation shows that the behavior matching rate of P8 and P is quite low(43% after 475 trials). Since the diagnostic rate of Fe is also lower (77%), one could askwhich one is better. One way of viewing this is to compare the diagnostic rates. It isobserved that, among F3, P and F, no one is superior than others if only top diagnosis isconcerned. More careful examination can be obtained by comparison of distances amongmodels. It turns out that the distance (S-E) and distance (S-R) are smaller than thedistance (E-R) with corresponding rms errors 0.061, 0.071 and 0.098 respectively.The above 3 simulations assume that only the subjective posterior judgments areavailable. In simulation 4, it is assumed that the correct diagnosis is also accessible. Thistime, (D(1), F) is supplied with the scenario generated by (Dr(1), Fr). Fe is the sameChapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 191as P,..The results (Table 6.15) shows that ALPP converges much quicker than {0,1} distribution learning even the latter has access to ‘true’ answers to the diagnostic problem.After 1500 trials, (D31), F3) reduces its maximum error from (De(1), Fe) to 0.041 andmatches the latter’s behavior perfectly, while (D(1), F) is still on its way of convergencewith its error about 2 times larger and its behavior matching rate 80%.(De(1), Fe) (D31), F3) (D(1), P)behv. max. behv. max.diag. mat. err. mat. err.trial No. rate rate S-E rate C-E0 .300 .3001100 88% 95% .130 60% .375101600 78% 98% .048 72% .0456011100 78% 93% .052 61% .07511011500 79% 100% .041 80% .0791501’170O 81% 100% .025 85% .093Table 6.15: Summary of simulation 4. P — Fri = 0 and ‘true’ scenario is accessible to{0, 1} distribution learning (Fe).Real world scenarios could be distinguished as being common or exceptional. An expert with knowledge about the real world tends to catch the common and to ignore theexceptional. Thus the diagnostic rate will never be 100%. This is the best one coulddo given the limited evidence. The PPs provided by the expert contain the information about the entire domain, while a scenario contains only the information about thisparticular scene. Thus, although both (D31), F3) and (D(1), P) converge, the formerconverges quicker. This difference in convergence speed is expected to emerge whereverthe diagnosis is difficult and the diagnostic rate of the expert is low (for example, in somearea where disease mechanism is not well understood and diagnostic criteria are not wellestablished).Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 1926.6 RemarksSeveral features of ALPP can be appreciated through the theoretical analysis and simulation results presented.• ALPP provides an alternative way of sequentially updating conditional probabilitiesin Bayesian nets when confirmed diagnosis is not available.• Under ideal conditions (perfect expert for ALPP and confirmed diagnosis for {O,1}distribution learning), both ALPP and {O,1} distribution learning converge to thereal world model. However, ALPP converges faster than {O,1 } distribution learningdue to the richer information contained in expert’s posterior judgments.• When human expert’s judgment is the only available source and the expert is notperfect but fairly good (his mental model is different from but close to the realworld model), ALPP still converges to expert’s posterior behavior and improve thesystem model towards real world model up to a small error. On the other hand,{O,1} distribution learning will be misled by unavoidable mistakes made in expert’sdiagnoses due to the violation of its underlying assumption. Consequently, it couldonly simulate expert’s behavior up to top diagnosis, both posterior ordering andmodel parameters (probability values) are far out.• As is argued at the beginning of this chapter, expert’s diagnoses are indeed theonly available source in many applications. Thus to use {O,1} distribution learningin these domain one must simplify the expert’s posterior distribution to a {O,1}distribution. On the other hand, to use ALPP one has to obtain from the expert theoverall real value distribution, which may not be practical. A proper compromisemight be to ask the expert to provide a few top posterior probabilities and assignthe remaining value uniformly to other probabilities.Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 193• ALPP is directly applicable to Bayesian networks of diameter 1 as {O,1} distributionlearning would. Although the topology is not feasible for many applications, thereare domains the topology is adequate, for example, the Bayesian net version ofQMR (its precursor is INTERNIST, one of the first expert systems in internalmedicine) [Heckerman 90a, Heckerman 90b] and PAINULIM, among others.Chapter ‘TSUMMARYThe thesis research aims to construct medical expert systems which will be of practicaluse. The attitude adopted is to identify practical problems in the engineering practiceand to solve the problems scientifically (as opposed to ad hoc approaches). The accomplishments of this research include the engineering part and scientific part which areclosely related.Contributions to theoretical knowledge• The limitation of finite totally ordered probability algebras has been shown theoretically. This highlights the use of infinite totally ordered probability algebrasincluding probability theory.• The technique of multiply sectioned Bayesian networks (MSBN) and junction forestshas been developed. This technique allows the exploitation of localization naturally existing in a large domain such that the construction and deployment of largeexpert systems using Bayesian networks can be practical.• An algorithm for learning by posterior probabilities (ALPP) has been developed.This algorithm is useful for sequential updating conditional probabilities in Bayesiannetworks in order to improve the accuracy of (quantitative) knowledge representation. The algorithm removes the assumption that diagnoses must be confirmed.Chapter 7. SUMMARY 195Accomplishments in engineering• WEBWEAVR, an expert system shell has been implemented in IBM compatibleswhich embeds the MSBN technique.• PAINULIM, an expert neuromuscular diagnostic system for patients with a painfulor impaired upper limb has been constructed. The utilization of the MSBN technique has made possible the owledge acquisition and system refinement of PAINULIM being conducted within hospital environment. This has resulted in moreefficient cooperation with medical experts and greatly speeded up the developmentof PAINULIM.• QUALICON, a coupled expert system in technical quality control of nerve conduction studies has been constructed.Limitations of the work• The MSBN technique is only used in a single domain. Only the sectioning with acovering sect is used in the application. The applicability of the technique to manyother domains requires future experience.• The performance of ALPP has been shown through simulation but it has not beenevaluated through real application.Topics for future research• Test design in a multiply sectioned Bayesian network. System prompt for acquisition of evidence most beneficial to a diagnosis given current state of belief has beenthe topic of expert system research. In chapter 4, the attention shift is described asChapter 7. SUMMARY 196originated by the user. Under the context of a MSBN, system prompt for acquisition of evidence from neighbor sect can be another reason for attention shift. Howto generate the prompt for the target sect for attention shift is an open question.The test design within a sect also deserves further exploration.• Explanation of inference in a multiply sectioned Bayesian network. Qualitativeand intuitive explanation of inference is important for the acceptance of expertsystems. The MSBN technique allow the representation of categorical and hierarchical knowledge. How to frame an explanation in terms of evidence distributed indifferent categories or hierarchies requires future research.• Only the marginal distribution is obtained in the MSBN technique (in the junctiontree technique as well). Is there a way to obtain the most likely combination ofhypotheses?• The representation incorporating decision analysis with Bayesian nets is usuallytermed influence diagram. Incorporating decision analysis with a MSBN (for, i.e.,treatment recommendation) introduces new problems and possibilities.Bibliography[Aleliunas 86] R. Aleliunas, “Models of reasoning based on formal deductive probabilitytheories,” Draft unpublished, 1986.[Aleliunas 87] R. Aleliunas, “Mathematical models of reasoning - competence modelsof reasoning about propositions in English and their relationship to the concept ofprobability,” Research Report CS-87-S1, Univ. of Waterloo, 1987.[Aleliunas 88] R. Aleliunas, “A new normative theory of probabilistic logic,” Proc.CSCSI-88, 67-74, 1988.[Andersen et al. 89] S.K. Andersen, K.G. Olesen, F.V. Jensen and F. Jensen, “HUGIN- a shell for building Bayesian belief universes for expert systems”, Proceedings of theEleventh International Joint Conference on Artificial Intelligence, Detroit, Michigan,Vol. 2, 1080-1085, 1989.[Andreassen, Andersen and Woldbye 86] S. Andreassen, S.K. Andersen and M. Woldbye,“An expert system for electromyography”, Proc. of Novel Approaches to the Study ofMotor Systems, Banff, Alberta, Canada, 2-3, 1986.[Andreassen et al. 89] S. Andreassen, F.V. Jensen, S.K. Andersen, B. Faick, U. Kjerulff,M. Woldbye, A.R. Sorensen, A. Rosenfalck and F. Jensen, “MUNIN - an expert EMGassistant”, J.E. Desmedt, (Edt), Computer-Aided Electromyography and Expert Systems, Elsevier, 255-277, 1989.[Baker and Boult 90] M. Baker and T.E Boult, “Pruning Bayesian networks for efficientcomputation”, Proc. of the Sixth Conference on Uncertainty in Artificial Intelligence,Cambridge, Mass., 257-264, 1990.[Blinowska and Verroust 87] A. Blinowska and J. Verroust, “Building an expert systemin electrodiagnosis of neuromuscular diseases”, Electroenceph. Gun. Neurophysiol., 66:510, 1987.[Buchanan and Shortliffe 84] B.G. Buchanan and E.H. Shortliffe, (Edt), Rule-BasedExpert Systems: The MYCIN Experiments of the Stanford Heuristic ProgrammingProject, Addison-Wesley, 1984.[Burns and Sankappannvar 81] S. Burns and H. P. Sankappannvar, A course in universalalgebra, Springer-Verlag, 1981.197Bibliography 198[Cheeseman 86] P. Cheeseman, “Probabilistic versus fuzzy reasoning”, L.N. Kanal andJ.F. Lemmer, (Edt), Uncertainty in Artificial Intelligence, Elsevier Science, 85-102,1986.[Cheeseman 88a] P. Cheeseman, “An inquiry into computer understanding”, Computational Intelligence, 4: 58-66, 1988.[Cheeseman 88b] P. Cheeseman, “In defense of An inquiry into computer understanding”, Computational Intelligence, 4: 129-142, 1988.[Cooper 84] G.F. Cooper, NESTOR: A computer-based rIedical diagnostic aid that integrates causal and probabilistic knowledge, Ph.D. diss., Stanford University, 1984.[Cooper 90) G.F. Cooper, “The computational complexity of probabilistic inference usingBayesian belief networks”, Artificial Intelligence, 42: 393-405, 1990.[Desmedt 89] J.E. Desmedt, (Edt), Computer-Aided Electromyography and Expert Systems, Elsevier, 1989.[Duda et al. 76] R.O. Duda, P.E. Hart and N.J. Nilsson, “Subjective Bayesian methodsfor rule-based inference systems”, Technical Report 124, Stanford Research Institute,Menlo Park, California, 1976.[Fertig and Breese 90] K.W. Fertig and J.S. Breese, “Interval Influence Diagrams”, M.Henrion et al., (Edt), Uncertainty In Artificial Intelligence 5, 149-161, 1990.[First et al. 82] M.B. First, B.J. Weimer, S. McLinden and R.A. Miller, “LOCALIZE:Computer assisted localization of peripheral nervous system lesions”, Comput. Biomed.Res., 15: 525-543, 1982.[Fuglsang-Frederiksen and Jeppesen 89] A. Fuglsang-Frederiksen and S.M. Jeppesen, “Arule-based EMG expert system for diagnosing neuromuscular disorders”, J.E. Desmedt,(Edt), Computer-Aided Electromyography and Expert Systems, Elsevier, 289-296, 1989.[Gallardo et al. 87] R. Gallardo, M. Gallardo, A. Nodarse, S. Luis, R. Estrada, L. Garciaand 0. Padron, “Artificial intelligence in EMG diagnosis of cervical roots and brachialplexus lesions”, Electroenceph. GUn. Neurophysiol., 66: S37, 1987.[Geiger, Verma and Pearl 90] D. Geiger, T. Verma and J. Pearl, “d-separation: from theorems to algorithms. In Uncertainty in Artificial Intelligence 5. Edited by M. Henrion,R.D. Shachter, L.N. Kanal and J.F. Lemrner. Elsevier Science Publishers, 139-148,1990.[Gibbons 85] A. Gibbons, Algorithmic Graph Theory, Cambridge University Press, 1985.Bibliography 199[Golumbic 801 M.C. Golumbic, Algorithmic graph theory and perfect graphs, AcademicPress, NY, 1980.[Gotman and Gloor 76] J. Gotman and P. Gloor, “Automatic recognition and quantifi..cation of interictal epileptic activity in the human scalp EEG”, Electroenceph. din.Neurophysiol., 41: 513-529, 1976.[Goodgold and Eberstein 83] J. Goodgold and A. Eberstein, Electrodiagnosis of Neuromuscular Diseases, Williams and Wilkins, 1983.[Halpern and Rabin 87] J.Y. Halpern and M.O. Rabin, “A logic to reason about likelihood,” Artificial Intelligence, 32: 379-405, 1987.[Heckerman 86] D. Heckerman, “Probabilistic interpretations for MYCIN’s certainty factors”, L.N. Kanal and J.F. Lemmer, (Edt), Uncertainty in Artificial Intelligence, Elsevier Science, 167-196, 1986.[Heckerman and Horvitz 87] D.E. Heckerman and E.J. Horvitz, “On the expressivenessof rule-based systems for reasoning with uncertainty”, Proc. AAAI, Seattle, Washington, 1987.[Heckerman, Horvitz and Nathwani 89] D.E. Heckerman, E.J. Horvitz and B.N. Nathwani, “Update on the Pathfinder project”, 13th Symposium on computer applicationsin medical care, Baltimore, Maryland, U.S.A., 1989.[Heckerman 90a] D. Heckerman, “A tractable inference algorithm for diagnosing multiplediseases”, M. Henrion, R.D. Shachter, L.N. Kanal and J.F. Lemmer. (Edt), Uncertaintyin Artificial Intelligence 5, Elsevier Science Publishers, 163-171, 1990.[Heckerman 90b] D. Heckerman, Probabilistic Similarity Networks, Ph.D. Thesis, Stanford University, 1990.[Henrion 88] M. Henrion, “Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, J. F. Lemmer and L. N. Kanal (Edt), Uncertainty in ArtificialIntelligence 2, Elsevier Science Publishers, 149-163, 1988.[Henrion 89] M. Henrion, “Some practical issues in constructing belief networks”, L.N.Kanal, T.S. Levitt and J.F. Lemmer (Eds), Uncertainty in Artificial Intelligence 3,Elsevier Science Publishers, 161-173, 1989.[Henrion 90] M. Henrion, “An introduction to algorithms for inference in belief nets”, M.Henrion, et al. (Edt), Uncertainty in Artificial Intelligence 5, Elsevier Science, 129-138,1990.Bibliography 200[Howard and Matheson 84] R.A. Howard and J.E. Matheson, “Influence diagrams”,Readings in Decision Analysis, R.A. Howard and J.E. Matheson (Edt), Strategic Decisions Group, Menlo Park, CA, Ch 38, 763-771, 1984.[Jamieson 90] P.W. Jamieson, “Computerized interpretation of electromyographic data”,Electroen cephalogr Clin Neurophysiol, 75: 390-400, 1990.[Jensen 88] F.V. Jensen, “Junction tree and decomposable hypergraphs”, JUDEX Research Report, Aalborg, Denmark, 1988.[Jensen, Lauritzen and Olesen 90] , F.V. Jensen, S.L. Lauritzen and K.G. Olesen,“Bayesian updating in causal probabilistic networks by local computations”, Computational Statistics Quarterly. 4: 269-282, 1990.[Jensen, Olesen and Andersen 90] , F.V. Jensen, K.G. Olesen and S.K. Andersen, “Analgebra of Bayesian belief universes for knowledge based systems”, Networks, Vol. 20,637-659, 1990.[Kini and Pearl 83] J.H. Kim and J. Pearl, “A computational model for causal and diagnostic reasoning in inference engines”, Proc. of 8th IJCAI, Karlsruhe, West Germany,190-193, 1983.[Kowalik 86] J.S. Kowalik, (Edt.), Coupling Symbolic and Numerical Computing in Expert Systems, Elsevier, 1986.[Kuczkowski and Gersting 77] J.E. Kuczkowski and J.L. Gersting, Abstract Algebra,Marcel Dekker, 1977.[Kuipers and Kassier 84] B. Kuipers and J. Kassier, “Causal reasoning in medicine:Analysis of a protocol”, Cognitive Science, 8: 363-385, 1984.[Larsen and Marx 81] R.J. Larsen and M.L. Marx, An Introduction to MathematicalStatistics and Its Applications, Prentice-Hall, NJ, 1981.[Lauritzen et al. 84] S.L. Lauritzen, T.P. Speed and K. Vijayan, “Decomposable graphsand hypergraphs”, Journal of Australian Mathematical Society, Series A, 36: 12-29,1984.[Lauritzen and Spiegelhalter 88] S.L. Lauritzen and D.J. Spiegelhalter, “Local computation with probabilities on graphical structures and their application to expert systems”,Journal of the Royal Statistical Society, Series B, 50: 157-244, 1988.[Maier 83] D. Maier, The Theory of Relational Databases, Computer Science Press, 1983.Bibliography 201[Meyer and Hilfiker 83] M. Meyer and P. Hilfiker, “Computerized motor neurography”,J.E. Desmedt, (Edt), Computer Aided Electromyography, Karger, 1983.[Miller et al. 76] A.C. Miller, M.M. Merkhofer, R.A. Howard, J.E. Matheson and T.R.Rice, Development of Automated Aids for Decision Analysis, Stanford Research Institute, Menlo Park, California, 1976.[Neapolitan 90] R.E. Neapolitan, Probabilistic Reasoning in Expert Systems, John Wileyand Sons, 1990.[Ng and Abramson 90] Keung-CI: Ng and B. Abramson, “A sensitivity analysis ofPATHFINDER”, Proc. of the Sixth Conference on Uncertainty in Artificial Intelligence, Cambridge, Mass., 204-212, 1990.[Nii 86a] H.P. Nii, “Blackboard systems: the blackboard model of problem solving andthe evolution of blackboard architectures”, Al Magazine, Vol. 7, No. 2, 38-53, 1986.[Nii 86b] H.P. Nii, “Blackboard systems: blackboard application systems, blackboardsystems from a knowledge engineering perspective”, Al Magazine, Vol. 7, No. 3, 82-106, 1986.[Patil, Szolovits and Schwartz 82] R.S. Patil, P. Szolovits and W.B. Schwartz, “Modelingknowledge of the patient in acid-base and electrolyte disorders”, P. Szolovits, (Edt),Artificial Intelligence in Medicine, Westview, 191-225, 1982.[Pearl 86] J. Pearl, “Fusion, propagation, and structuring in belief networks”, ArtificialIntelligence, 29: 241-288, 1986.[Pearl 88] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference, Morgan Kaufmann, 1988.[Pearl 89] J. Pearl, “Probabilistic semantics for nonmonotonic reasoning: A survey”,Proc., First intl. conf. on principles of knowledge representation and reasoning, 1989.[Poole and .Neufeld 88] D. Poole and E. Neufeld, “Sound probabilistic inference in Prolog: an executable specification of influence diagrams,” I SIMPOSIUM INTERNACIONAL DE INTELIGENCIA ARTIFICIAL, 1988.[Pople 82] H.E. Pople, Jr., “Heuristic methods for imposing structure on ill-structuredproblems: the structuring of medical diagnostics”, P. Szolovits, (Edt), Artificial Intelligence in Medicine, Westview, 119-190, 1982.[Rose et al. 76] D.J. Rose, R.E. Tarjan and G.S. Lueker, “Algorithmic aspects of vertexelimination on graphs”, SIAM J. Comput., 5: 266-283, 1976.Bibliography 202[Savage 61] L.J. Savage, “The foundations of statistics reconsidered”, G. Shafer and J.Pearl, (Edt), Readings in Uncertainty Reasoning, Morgan Kaufmann, 1990.[Shafer 76] G. Shafer, A Mathematical Theory of Evidence, Princeton University Press,Princeton, New Jersey, 1976.[Shafer 90] G. Shafer, “Introduction for Chapter 2: The meaning of probability”, G.Shafer and J. Pearl, (Edt), Readings in Uncertainty Reasoning, Morgan Kaufmann,1990.[Shafer and Shenoy 88] G. Shafer and P.P. Shenoy, Local computation in hypertreesSchool of Business Working Paper No. 201, University of Kansas, U.S.A, 1988.[Shachter 86] R.D. Shachter, “Evaluating influence diagrams”, Operations Research, 34,No. 6, 871-882, 1986.[Shachter 88a] R.D. Shachter, “Probabilisitc inference and influence diagrams”, Operations Research, 36, No. 4, 589-604, 1988.[Shachter 88] R.D. Shachter, Discussion published in (Lauritzen and Spiegeihalter 88),1988.[Shachter and Heckerman 87] R.D. Shachter and D.E. Heckerman, “Thinking backwardfor knowledge acquisition”, Al Magazine, 8:55-62, 1987.[Spige1ha1ter 86] D.J. Spiegelhalter, “Probabilistic reasoning in predictive expert systems”, L.N. Kanal and J.F. Lemmer, (Edt), Uncertainty in Artificial Intelligence,Elsevier Science, 47-67, 1986.[Spiegelhalter, Franklin and Bull 89] D.J. Spiegeihalter, R.C.G. Franklin, and K. Bull,“Assessment, criticism and improvement of imprecise subjective probabilities for amedical expert system,” Procc Fifth workshop on uncertainty in artificial intelligence,335-342, Windsor, Ontario, 1989.[Spiegelhalter and Lauritzen 90] D.J. Spiegeihalter and S.L. Lauritzen, “Sequential updating of conditioanl probabilities on directed graphical structures”, Networks, Vol.20, 5: 579-606, 1990.[Stalberg and Stalberg 89] E. Stalberg and S. Stalberg, “The use of small computers inthe EMG lab”, J.E. Desmedt, (Edt), Computer-Aided Electromyography and ExpertSystems, Elsevier, 1-32, 1989.[Szolovits and Pauker 78] P. Szolovits and S.G. Pauker, “Categorical and probabilisticreasoning in medical diagnosis”, Artificial Intelligence 11, 115-144, 1978.Bibliography 203[Szolovits 82] P. Szolovits, “Artificial intelligence and medicine”, P. Szolovits, (Edt),Artificial Intelligence in Medicine, Westview, 1-19, 1982.{Tarjan and Yannakakis 84] R.E. Tarjan and M. Yannakakis, “Simple linear-time algorithms to test chordality of graphs, test acyclicity of hypergraphs, and selectivelyreduce acyclic hypergraphs”, SIAM J. Comput., 13: 566-579, 1984.[Tversky and Kahneman 74] A. Tversky and D. Kahneman, “Judgment under uncertainty: heuristics and biases”, Science, 185: 1124-1131, 1974.[Tversky and Kahneman-801 A. Tversky and D. Kahneman, “Causal schemata in judgments under uncertainty”, M. Fishbein, (Edt), Progress in Social Psychology, LawrenceEribaum, Vol. 1, 49-72, 1980.[Vila et al. 85] A. Vila, D. Ziebelin and F. Reymond, “Experimental EMG expert systemas an aid in diagnosis”, Electroenceph. Clin. Neurophysiol., 61: S240, 1985.[Xiang, Beddoes and Poole 90a] Y. Xiang, M. P. Beddoes and D. Poole, “Can uncertainty management be realized in a finite totally ordered probability algebra?”, M.Henrion et al., (Edt), Uncertainty In Artificial Intelligence 5, 41-57, 1990.[Xiang, Beddoes and Poole 90bj Y. Xiang, M. P. Beddoes and D. Poole, “Sequential updating conditional probability in Bayesian networks by posterior probability”, Proc.8th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, Ottawa, 21-27, 1990.[Xiang et al. 91b] Y. Xiang, B. Pant, A. Eisen, M.P. Beddoes, and D. Poole,“PAINULIM: An expert neuromuscular diagnostic system based on multiply sectionedBayesian belief networks”, Proc. ISMM International Conference on Mini and Microcomputers in Medicine and Healthcare, Long Beach, CA, 64-69, 1991.[Xiang, Poole and Beddoes 91] Y. Xiang, D. Poole and M.P. Beddoes, “Multiply sectioned Bayesian networks and junction forests for large knowledge-based systems”,submitted to Computational Intelligence, 1991.[Xiang et al. 92] Y. Xiang, A. Eisen, M. MacNeil and M. P. Beddoes], “Quality controlin nerve conduction studies with coupled knowledge based system approach”, Muscleand Nerve, Vol.15, No.2, 180-187, 1992.[Zadeh 65] L. Zadeh, “Fuzzy sets”, Information and Control, 8: 338-353, 1965.Appendix ABackground on Graph TheoryGraph theory plays an important role in characterizing Bayesian networks and developing efficient inference algorithms for Bayesian networks. In this Appendix, the basicconcepts of graph theory relevant to this thesis are introduced. For formal treatmentof the graph theoretical concepts introduced, see [Golumbic 80, Gibbons 85, Jensen 88,Lauritzen et al. 84].A.1. GraphsA graph G is a pair (N, E) where N = {A,,.. . , A} is a set of nodes and E ={(A, A,)IA, A3 E N; i j} is a set of links between pairs of nodes in N. A directed graphis a graph where links in E are ordered pairs and an undirected graph is a graph wherelinks in E are unordered pairs. Call the links in directed graphs by arcs when concernedwith their directions. A subgraph of a graph (N, .E) is any graph (Nyc, E’) satisfyingNk C N and Ec C E. Given a subset of nodes N’ C N of a graph (N, E), the subgraphinduced by N’ is (N’, E’) where E’ = {(A, A) E EIA E N’ & A3 E N’}. The uniongraph of subgraphs G’ = (N’, E’) and G2 = (N2,E2) is the graph (N1 U N2,E’ U E2)denoted G1 U G2.Example 22 Figure A.38 shows eight examples. G1 = (N’, E’) is a undirected graphwhere N’ = {A,, A2,.. . , A6} andE’ ={(A,,A),(A1,45)}204Appendix A. Background on Graph Theory 205is a set of unordered pairs. G4 = (N4,E4) is a directed graph where N’ = N1 and= {(A,, A2), (A1,Aj, (A3,A2), (A4,A3), (A5,A4), (A4,A6)} is a set of ordered pairs.G’ is the union graph of subgraphs G2 and G3 (G’ =G2UG3). Likewise, G4 = G5UG6.Both G2 and G are also subgraphs of G7 but G7 G2 U G3 since the link (A,, A5) inG7 is not contained in either subgraph.Figure A.38: Examples of graphsA path in graph (N, E) is a sequence of nodes A,, A2,. . . , A, (k > 1) such that(As, A1,) e E. A path in a directed graph can be directed or undirected (i.e., each arcis considered undirected). A simple path is a path with no repeated node except thatA, is allowed to equal Ak. A cycle is a simple path with A1 = Ak. Directed graphsA1 A5A2 A4A3 A6A5A4Ga A6A1 A5A2 A4c4 A3 A6Appendix A. Background on Graph Theory 206without directed cycle are called DAGs (directed acycic graphs). Define the diameter ofa DAG as the number of arcs of the longest directed path in the DAG. A graph (N, E)is connected if for any pair of nodes in N there is an undirected path between them. Agraph is singly connected or is a tree if there is a unique undirected path between anypairs of nodes. If a graph consists of several unconnected trees, call the graph a forest.A graph is multiply connected if more than one undirected path exists between a pair ofnodes. Note: a graph can be multiply connected but NOT connected!Example 23 In Figure A.38, A1 — A2 — A3 — A4 — A6 is a path in G’. It is also anundirected path in G4. A5 —* A4 —* A3 —* A2 is a directed path in G4.A6 — A4 — A1 — A2 — A3 — A4 — A5 is not a simple path in G1, but A1 — A2 — A3 — A4 — A6is. A1 — A2 — A3 — A4 — A1 is a undirected cycle in G4. There is no directed cycle in G4.There would be one in G4 if the arc (A1,A2) were reversed. Therefore, G4 is a DAG.G3 is a singly connected graph or is a tree. G5 is a singly connected DAG or is a tree.G4 is a multiply connected DAG.The diameters of DAGs G5, G6, and G4 are 1, 2, and 3 respectively.Only connected DAGs are considered in this thesis since an unconnected DAG canalways be treated as several connected ones. Define a subDAG of a DAG D = (N, E) asany connected subgraph of D. A DAG D is the union DAG of subDAG D’ and D2 ifD=D1U2.Example 24 Tn Figure A.38, G4 is the union DAG of subDAGs G5 and G6 (G4 =G5uG6).If there is an arc (A1,A2) from node A1 to A2, A1 is called a parent of A2, and A2a child of A1. Similarly, if there is a directed path from A1 to Ak, the two nodes arecalled, respectively, ancestor and descendant, relative to each other. The in-degree of anode (denoted by i) is defined as the number of its parents.Appendix A. Background on Graph Theory 207Example 25 In G4 of Figure A.38, A, and A5 are the parents of A4 and A4 is theirchild. A2 has 4 ancestors namely A,, A3, A4 and A5. The in-degree of A4 is 2 and thein-degree of A1 is 0.For each node in a DAG, if links are added between all its parents and the directionson arcs are dropped, the graph thus formed is the moral graph of the DAG. A graph istriangulated if every cycle of lengtI> 3 has a chord. A chord is a link connecting 2 nonadjacent nodes. A maximal set of nodes all of which are pairwise linked is called a clique.Algorithms for triangulating graphs have been developed, for example, the Lexicographicsearch [Rose et al. 76] with time complexity Q(ne) where n is the number of nodes ande the number of links, and the maximum cardinality search [Tarjan and Yannakakis 84]with time complexity (5(n + e).Example 26 In Figure A.38, G7 is the moral graph of G4. G8 is a triangulated graphof G7.A.2 HypergraphsA hypergraph is a pair (N, C) where N is a set and C 2N is a set of subsets of N.Define the union of hypergraphs similarly to the union of graphs. The union hypergraphof (N’, C’) and (N2,C2) is (N’ UN2,C1 U C2) denoted (N’, C1) U (N2,C2). Let (N, E)be a graph, and C be the set of cliques of (N, E). Then (N, C) is a clique hypergraph ofgraph (N, E).If a clique hypergraph is organized into a tree where the nodes of the tree are labeledwith cliques such that for any pair of cliques, their intersection is contained in each of thecliques on the unique path between them, then the tree is called a junction tree or jointree. The intersection of 2 adjacent cliques in a junction tree is called the sepset of the 2cliques. Given a hypergraph, its junction trees can be characterized as maximum-weightAppendix A. Background on Graph Theory 208spanning-tree if it has one [Jensen 88]. The algorithm for constructing maximum -weightspanning-trees can be found in, e.g., [Gibbons 85].Example 27 Consider the graph G8 in Figure A.38 where N8 = {A1,A2, ... , A}. Theset of cliques of G8 is C = {{A1,A2,A4}, {A2,A3,A4}, {A1,A4,A5}, {A4,A6}}. Therefore(N8,C) is a clique hypergraph of G8. Figure A.39 is a junction tree of the hypergraph(N8,C), where nodes are labeled with cliques (in ovals) and links are labeled with sepsetsof cliques (in squares).Figure A.39: A junction tree with nodes (ovals) labeled with cliques and links labeledwith sepsets of cliques (squares).Appendix BReference for Theory of Probabilistic Logic (TPL)B.1 Axioms of TPLThe content of this section is taken from Aleliunas [1988] with minor changes to simplifythe presentation.Axiom 1 (TPL axioms) Axioms about the domain and range of each f in F.AX1 The set of probabilities, P, is a partially ordered set. The ordering relation is ‘s’.AX2 The set of sentences, L, is a free boolean algebra with operations &, V, and, and it isequipped with the usual equivalence relation ‘E’. The generators of the algebra area countable set of primitive propositions. Every sentence in L is either a primitiveproposition or a finite combination of them.AX3 If P X and Q Y, then f(PIQ) = f(XIY).Axioms that hold for all f in F, and for any F, Q, R in L.AX4 If Q is absurd (i.e., Q R&), then f(IQ) = f(FIF).AX5 f(F&QIQ) = f(FIQ) f(QIQ).AX6 For any other g in F, f(PIP) = g(PIP) = 1.AX’T There is a monotone non-increasing total function, i, from P into P such thatf(P1R) = i(f(PIR)).209Appendix B. Reference for Theory of Probabilistic Logic (TPL) 210AX8 There is an order-preserving total function, h, from P x P into P such thatf(P&Q IR) = h(f(PIQ&R), f(QIR)). Moreover, if f(P&QR) = 0, then f(PIQ&R)= 0 or f(QIR) = 0, where 0 = f(IR) is defined as a function of f and R.AX9 11 f(PIR) f(PI) then f(PIR) f(FIRVTh f(PR).Axioms about the richness of the set F.Let 1 = P V P. For any distinct primitive propositions A, B and C in L, and forany arbitrary probabilities a, b and c in P, there is a probability assignment f in F (notnecessarily the same one in each case) for whichAX1O f(AI1) = a, f(BIA) = b, and f(CIA&B) = C.AX11 f(AIB) = f(AI) = a and f(BIA) = f(Bl) = b.AX12 f(AI1) = a, and f(A&BI1) = b, whenever b a.B.2 Probability Algebra TheoremThe content of this section is taken from Aleliunas [1986].Theorem 15 (Probability algebra theorem) Any probability algebra (under TPL axioms) defined on the partially ordered set (poset) P satisfies the following conditions:Ti P is a partially ordered semigroup with an order preserving operation ‘*‘.T2 0*x=O andl*x =xforanyx inP.T8 P is a self-dual poset equipped with an isomorphism, i, from P to its dual.T4 P is a poset bounded by 0 and 1, i.e., 0 x 1 for any x in P.T5 P is commutative, i.e., x * y = y * x.T6 P has non-trivial zero, i.e., x y implies that x = 0 or y = 0.Appendix B. Reference for Theory of Probabilistic Logic (TPL) 211Ti’ P is a naturally ordered semigroup, i.e., x y implies z(x=y * z).Conversely, the above 7 conditions are also sufficient to characterize a probabilityalgebra.COCOCOCOCOCOCOCO00-O0’.0’b3‘I-’>4COCOCOCOCOCOCOCOCO‘CO00O010’3‘-3COCOCOCOCOCOCOCOCOCO00--1—0)Cli.C.)b3C.)CDCOCOCOCOCOCOCT)COCOCICO00-.4--0)01C.)C.).0COCT)COCOCOCOCOCOCOCO00---—40)Cl’01COCOCOCOCOCOCOCOCOCDCO00—4-1-—4—40)Cl’Cl’0)PtCOCOCOCOCOCOCOCOCOCO00—4-4—4-4-1-40)0)—4ITJCOCOCOCOCOCOCOCOCOCO00-.4--—---—00COCOCOCOCOCOCOCOCO0000000000000000000rjt-’zCOCOCOCOCOCOCOCT)00-40)Cl’.C.).)COCT)COCOCOCOCOCO00-10)01C.).)I-’COCOCOCOCOCOCOCO00—40)01C.).)t)COCOCOCOCOCOCOCO00-40)CIIC.)C.)C.)COCOCOCOCOCOCOCO00-40)Cl’COCOCOCOCOCOCOCO00—40)Cl’C’01Cl’Cl’COCOCOCOCOCOCOCO0030)0)0)0)0)0)COCOCOCOCOCOCOCO00—4-4.-4-4-41-4COCOCOCO(0COCOCO0000000000000000COCOCOCOCOCOCO0)CII.C.)I-.)COCOI-’l—!‘COCOCOCOCOCO0’C.)COCT)COCOCOCOCOCOCOCOCOCO01Cl’Cl’Cl’Cl’I-. COCOCOCOCOCO0)0)0)0)0)0)CO-COCOCOCOCT)CO.-4.-4.-4-4-lCO COCOCOCOCOCOCO00000000000000Tj>4 Pt I- CD 0 I-.I-.(4CD >4 C)U)0 0 p., 0 COCOCOCOCOCOCOCO0)Cl’C.)3COCO I—’COCOCOI-’.)b)COCOCOCOI-’.)C.)C.)COCOCT)COCOI’.)C.)..COCOCOCOCOCO1.)C.).-Cl’Cl’COCOCOCOCOCO0.)C.).Cl’0)COCOCOCOCOCOCOCOCOCOCOCOCOCO-0.)C.),0.Cn0)COCOCOCOCOCOCO00000000000000Appendix C. Examples on Finite Totally Ordered Probability Algebras 213e1€2€3€4e5€6e7e e2 e3 e4 e5 e6 e7 €g qPei e2e1e2e3e4e6€7’e3 e4 € e6 €7€Solution table p/qOne of M8,4 with idempotent elements e1,€4 e7 and€8C.2 Derivation of p(firelsmoke&alarm) under TPLp(fls&a) = p(slf&a) *p(fla)/p(sla)wherep(s jf&a)p(sla)andp(f Ia)wherep(alf)and= p(sjf);= p(s&(fV7)Ia)= i[i[p(s17)] * j[p(If) * p(fa)/i[p(s7) * p(71a)]]J;= p(alf) * p(f)/p(a)= p(a&((f&t) V (f&’) V (7&t) V (Tkt))If)= i[i[p(aIf&) * pQ)} * i[p(alf&t) * p(t)/i[p(alf&T) * (OJ]1e1 e2 e3 e4 € e6 e7 e8€2 e3 €4 €4 €€6 € e8e3 e4 e4 e4 e5 e6 €7 e8€4 e4 e4 e4 e5 e6 €7€€5 e5 € €€6 e7 €7 €€6 €6 e6 €6 €7 €7 €7 e8e7 e7 €7 €7 e7 e7 e7€8€8 e8 €g € €8 € e5 €€€€3 e4 €€6 €7 e8e1 e2 [e4,3j € e6€7 e8e1 [e4,2j€ e6€7 e8[e4,ei] e5€6 €7€8[€4, eu € [€7,€6] €g[e4,1] [e7,5] €g[e7,1j€8p(a) = p(a&((f&t) V (fm) V (7&t) V (7&)))Appendix C. Examples on Finite Totally Ordered Probability Algebras 214= i[f1*234JC.3 Evaluation of Legal FTOPAs with Size 8Figure C.40 plots the evaluation results of the following 14 posterior probabilities fromsmoke-alarm example by Poole and Neufeld [1988] using all the 32 legal FTOPAs of size8. The conventional probability model (MO) is included for comparison.p(slf), p(alf), p(1If) p(rff); p(sjt), p(aIt, p(IIt), p(rlt);p(fIs), p(fja), p(fls&a); p(tls), p(t(a), p(tls&a)The following table lists model labels (L) used in Figure C.40 and their correspondingidempotent element subscripts (S). ‘MO’ labels the conventional probability model.wheref’f2f3f4= i[p(a17&T) *p(7) *p(T)]= i[p(aIf&t) *p(7) *p(t)/f1]= i[p(alf&T) * P(f) * * f2)J= i[p(alf&t) *p(f) *p(t)/(f1* f2 * fe)].L S L S L S L SMi 1,7,8 M9 1,2,5,7,8 M17 1,2,3,4,7,8 M25 1,3,5,6,7,8M2 1,2,7,8 M1O 1,2,6,7,8 M18 1,2,3,5,7,8 M26 1,4,5,6,7,8M3 1,3,7,8 Mu 1,3,4,7,8 M19 1,2,3,6,7,8 M27 1,2,3,4,5,7,8M4 1,4,7,8 M12 1,3,5,7,8 M20 1,2,4,5,7,8 M28 1,2,3,4,6,7,8MS 1,5,7,8 M13 1,3,6,7,8 M21 1,2,4,6,7,8 M29 1,2,3,5,6,7,8M6 1,6,7,8 M14 1,4,5,7,8 M22 1,2,5,6,7,8 M3O 1,2,4,5,6,7,8M7 1,2,3,7,8 M15 1,4,6,7,8 M23 1,3,4,5,7,8 M31 1,3,4,5,6,7,8M8 1,2,4,7,8 M16 1,5,6,7,8 M24 1,3,4,6,7,8 M32 1,2,3,4,5,6,7,8In each sub-graph, the vertical scale represents the probability range with 0 = e8 and1 = e. And horizontally arranged are the probabilities with the same order above. TheAppendix C. Examples on Finite Totally Ordered Probability Algebras 215predictive probabilities with identical condition are shown in the first 8 positions in 2groups. The diagnostic probabilities with a fixed hypothesis are shown in the last 6positions in 2 groups. As is analyzed in Section 2.5, none of the models through 1 to 32gives satisfactory results.L.;3 E1W IiM 117 MSn L rLUW .UuU_iiU—0 0— .-+ V W Lfl ‘$1 W,— .-.Figure C.40: Evaluation result of smoke-alarm example by 32 legal FTOPAs of size 8f: fire; t: tampering; a: alarm; 5: smoke; 1: leaving; r: report.tuup00M4LZ ci———0 0—Appendix C. Examples on Finite Totally Ordered Probability Algebras 216M210DIIFigure C.40 (cont)f: fire; t: tampering; a: alarm; s: smoke; 1: leaving; r: report.LM15CCI L—Ml 6nMl?Lilt LL1OO I0— —M220__0M23I1u In— —LA LA If.w QIAppendix DProofs for corollary and propositionsProof of Corollary 2Proof:To prove p/q = r, it suffices to show r * q = p. Below the 7 cases are shown in theorder of their appearance in the Corollary.(Case 1) Equivalent to ek * e1 = ek which is in turn equivalent to Cond8 of proposition 1.(Case 2) Equivalent to e * = e which is in turn equivalent to Cond4 of proposition 1.(Case 3) Equivalent to e * e = ek where e = ek_+ and e+1_2 e, e1+,. First,show this is the second case of Theorem 1. Clearly, ij < y < i11 — 1 i1. Alsox =— j + i1 > i1, and k— j + i1 <ii+1 — 1 — (i1 + 1) + ii = i1+ — 2 i+i.Second, show emjfl(÷j_ji,j÷i) = ek. This is true because x—j — i1 = k < i1..(Case 4) Equivalent to e, * ek = ek where x = i1 and i1 <k < i1. It is true by thethird case of Theorem 1.(Case 5) Equivalent to e, * = e+1 where e < and <e÷i. First, show this is the second case of Theorem 1, that is, e, e e [e11+,e1+i]. Thisis obviously true for e and the lower bound of e. For the upper bound, from i11 > j,one has ij—j 1, and therefore1+ e1.Second, show emin(z+y_ii,ii+i) = e+1 which is equivalent to x + y — ii ii+’. Since thelower bound is i1 + 1 for x, and i11 for y, this is certainly true.217Appendix D. Proofs for corollary and propositions 218(Case 6) Equivalent to er * e1 = e1 where e e1 which is true by Lemma 3.(Case 7) Covered by the third case of Theorem 1.CProof of Proposition 5Proof:A solution table (see appendix C.1) can be viewed as a different layout of a producttable. An entry in the column labeled ‘q’ is one factor. A table entry in the same lineas the first factor is another factor. The entry in the line labeled ‘p’ and in the samecolumn as the second factor is the product.Since product operation ‘‘ is well defined, all the probabilities will appear in eachline of the table entries (possibly several of them appear together in a range). Therefore,there are n(n— 1) table entries. There are O.5n(n + 1) — 1 distinct solution pairs. Theirdifference is just the amount of ambiguity by definition.A = {n(n — 1)] — [O.5n(n + 1)—1] = (n — 1)(n — 2)/2Proof of Proposition 6Proof:(Od) From Corollary 2, all the incidences of e/e = er to be counted are coveredby the case 7 of the corollary. Among {e2, .. . , e,}, there are k — 2 idempotent elementswhich determine k— 3 intervals. For each interval, the case 7 dictates m+1 — m columnsin the solution table with m — 1 entries in each column to be counted.(Om) From Theorem 1, e, * e < min(e1,,e) happens only in the second case. Sinceidempotent elements do not follow the relation by Lemma 3, only m <X, y <m+1 needsto be considered, and there are k —2 such intervals ((ik..i, ik) does not contribute to Om).For each interval, if x is fixed, y goes through m+l—— 1 values. Since only distinctAppendix D. Proofs for corollary and propositions 219product pair is to be counted, each interval contributesE:1_tm_lNow it suffices to show min(x + y— m+i) > x, y. Trivially, m+l > X, y. AlsoX+(y_im)>X+1 >. Similarly X+Ym >Y(Od + Om) Rewrite Od ask.-2 m+lm10d > (im+iim)“=2 j=OThusi22 321 4131 k—1k—21Od+Om = (i+j—1)+ (i+j—1)+... (ik_2+j—1)j=1 j=O 3=0 j=0Note the last addend of each sum is 1 less than the 1st addend of the next sum; andthere are tkl — 2 addends in total. Thus,k—1 —2 ..3Od+0m E j=j=(n—3)(n—2)/2j=1 j=1DAppendix EReference for Statistic Evaluation of Bernoulli TrialsThe content of this appendix is taken from Larsen and Marx [1981].Any set of repeated independent trials with each trial having just 2 possible outcomes(success and failure) are called Bernoulli trials. The probability p associated with successis called success probability.E.1 Estimation for Confidence IntervalsDefinition 18 Suppose that y successes are observed in n independent Bernoulli trials. The probability interval [pl,P2] is the 100(1 — a)% confidence interval for successprobability p ifP(pi<p<pIY=y)=1—awhere Y is the variable for number of successes.The above defined lower and upper confidence limits p and P2 for p are the solutionsofCp(1 _p1)=Cp(1 _p2)=whereci-— j!(n—is the number of combinations taking j elements out of n.220Appendix E. Reference for Statistic Evaluation of Bernoulli Trials 221E.2 Hypothesis TestingLet x and y denote the numbers of successes observed in 2 independent sets of n andm Bernoulli trials, respectively. Let Px and py denote the true success probabilitiesassociated with each set of trials. An approximate generalized likelihood ratio test atthe a level of significance for hypothesis H0 Px = py versus hypothesis H1 : Px py isgotten by rejecting H0 whenevern mis either < —z,,2 or +z12 where1 PZQ/2 2J e ‘2dx = a/2.v,-ooAppendix FGlossary of PAINULIM TermsF.1 Diseases Considered in PAINULIMAhcd : Anterior horn cell diseaseAls: Amyotrophic Lateral SclerosisCts: Carpal tunnel syndromeInspcd : Intrinsic cord diseaseMednn: Median nerve lesionMnd: Motor neuron diseases (including Ahcd and Als)Pd: Parkinsons diseasePxltk: Plexus lower trunkPxpcd: Plexus post cordPxutk: Plexus upper trunkRadnn: Radial nerve lesionRc56 : C56 Root diseaseRc67: C67 Root diseaseRc81 : C8T1 Root diseaseUlrnn: Ulnar nerve lesion222Appendix F. Glossary of PAINULIM Terms 223F2 Clinical Features in PAINULIMbcpex: Biceps reflex exaggeratedbcpls Biceps reflex lost/diminishedlsbkhdfrm: Loss of sensation in back of hand/forearmisdisoc : Dissociated sensory losslslatfrm: Loss of sensation in lateral forearmlslathnd: Loss of sensation in lateral hand/fingerslsmedfrm: Loss of sensation in medial forearmlsmedhnd: Loss of sensation in medial hand/fingersisuparm: Loss of sensation in upper armradex: Radial/Supinator reflex exaggeratedradls: Radial/Supinator reflex lost/diminishedrad_pn: Radicular painrigid : Rigiditypnirm: Pain or tingling in the forearmpniind: Pain or tingling in the handpn.shd: Pain or tingling in the shoulderspstc : Spasticitytcpex : Triceps reflex exaggeratedtcpls : Triceps reflex lost/diminishedtremor : Tremor of limbs/handstwitch : Muscle twitch or fasciculationswk....arm: Weakness of the armwking±c: Weakness of finger flexionwk...shld: Weakness of the shoulderAppendix F. Glossary of PAINULIM Terms 224wk_th.abd: Weakness of thumb Abductionwk...th_ext: Weakness of thumb extensionwk_th.Jx: Weakness of thumb flexionwk_wst_ex: Weakness of wrist extensionF.3 EMG Features in PAINULIMadm: Abductor digiti rninimiapb : Abductor pollicis brevisbchrd: Brachioradialisbcps : Biceps brachiledc: Extensor digitorum communisdeltd: Deltoidfasc: Fasciculationfcu: Flexor carpi ulnarisfdi : First dorsal interosseousfdp23: Flexor digitorum profundus 2,3fdp45 : Flexor digitorum profundus 4,5fpl: Flexor pollicis longuslatdrs : Lattismus dorsilvscp: Levator scapulaeother : Muscle in other limb/trunkpmcv: Pectorajis major clavicularpmsc : Pectoralis major sterno-clavicularprt : Pronator terespsS6: Paraspinals C5,6Appendix F. Glossary of PAINULIM Terms 225ps67: Paraspinals C6,7ps8l: Paraspinals C8,T1rhmj: Rhomboideus majorspspn: Supraspinatussrant : Serratus anteriortrcps : TricepsF.4 Nerve Conduction Features in PAINULIMmcmp: MEDIAN COMPOUND MUSCLE ACTION POTENTIALmf2I: MEDIAN finger2 SNAP latencymf2a: MEDIAN finger2 SNAP amplitudemmcb: MEDIAN motor CONDUCTION BLOCKmmcv: MEDIAN motor CONDUCTION VELOCITYmmdl: MEDIAN motor DISTAL LATENCYminfw: Median “F” responsemupl: MEDIAN to ULNAR palmar latency differenceradm: Radial motor studyrads: RADIAL sensory studyucmp: ULNAR COMPOUND MUSCLE ACTION POTENTIALuf5a: ULNAR fingerS SNAP amplitudeuf5l: ULNAR fingerS SNAP latencyumcb: ULNAR motor CONDUCTION BLOCKumcv: ULNAR motor CONDUCTION VELOCITYumfw: Ulnar “F” responseAppendix GAcronymsAT: Artificial IntelligenceALPP : Algorithm of Learning by Posterior ProbabilityBNS : Bayesian Network Specification (in QUALICON)CLINICAL : clinical sect of PAINULIMCMAP : Compound Muscle Action PotentialCP : Conditional ProbabilityDAG : Directed Acyclic GraphDOCTR: DOCToR - a module in WEBWEAVREDITOR: a module in WEBWEAVREEG : ElectroEncephaloGraphyEMG: ElectroMyoGraphyEMGer: ElectroMyoGrapherFE: Feature Extraction (in QUALICON)FP: Feature Partition (in QUALICON)FTOPA: Finite Totally Ordered Probability AlgebraMSBN : Multiply Sectioned Bayesian NetworkNCV: Nerve Conduction VelocityND : Normal Data (in QUALICON)NDU: Neuromuscular Diseases UnitPAINULIM: PAINful or impaired Upper LIMb226Appendix G. Acronyms 227PIE: Probabilistic Inference Engine (in QUALICON)PP : Posterior ProbabilityQUALICON : QUALIty CONtrolSNAP : Sensory Nerve Action PotentialTPL : Theory of Probabilistic LogicTRANSNET : TRANSform NETwork - a module in WEBWEAVRUBC: University of British ColumbiaUSBN : UnSectioned Bayesian NetworkVGH: Vancouver General HospitalWEBWEAVR: WEB WEAVeR
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Multiply sectioned Bayesian belief networks for large...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Multiply sectioned Bayesian belief networks for large knowledge-based systems: an application to neuromuscular… Xiang, Yang 1992
pdf
Page Metadata
Item Metadata
Title | Multiply sectioned Bayesian belief networks for large knowledge-based systems: an application to neuromuscular diagnosis |
Creator |
Xiang, Yang |
Date Issued | 1992 |
Description | In medical diagnosis a proper uncertainty calculus is crucial in knowledge representation. Finite calculus is close to human language and should facilitate knowledge acquisition. An investigation into the feasibility of finite totally ordered probability models has been conducted. It shows that a finite model is of limited usage, which highlights the importance of infinite totally ordered models including probability theory. Representing the qualitative domain structure is another important issue. Bayesian networks, combining graphical representation of domain dependency and probability theory, provide a concise representation and a consistent inference formalism. An expert system QUALICON for quality control in electromyography has been implemented as a pilot study of Bayesian nets. The performance is comparable to that from human professionals. Extending the research into a large system PAINULIM in neuromuscular diagnosis shows that the computation using homogeneous net representation is unnecessarily complex. At any one time a user’s attention is directed to only part of a large net, i.e., there is ‘localization’ of queries and evidence. The homogeneous net is inefficient since the overall net has to be updated each time. Multiply Sectioned Bayesian Networks (MSBNs) have been developed to exploit localization. Reasonable constraints are derived such that a localization preserving partition of a domain and its representation by a set of subnets are possible. Reasoning takes place at only one of them due to localization. Marginal probabilities obtained are identical to those obtained when the entire net is globally consistent. When the user’s attention shifts, a new subnet is swapped in and previously acquired evidence absorbed. Thus, with β subnets, the complexity is reduced approximately to 1/β. Reducing the complexity with MSBN, the knowledge acquisition of PAINULIM has been conducted using normal hospital computers. This results in efficient cooperation with medical staff. PAINULIM was thus constructed in less than one year. An evaluation shows very good performance. Coding probability distribution of Bayesian nets in causal direction has several ad vantages. Initially the distribution is elicited from the expert in terms of probabilities of a symptom given causing diseases. Since disease-to-symptom is not the direction of daily practice, the elicited probabilities may be inaccurate. An algorithm has been derived for sequentially updating probabilities in Bayesian nets, making use of the expert’s symptom to-disease probabilities. Simulation shows better performance than Spiegeihalter’s {O, l} distribution learning. |
Extent | 4387824 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2008-12-16 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0064986 |
URI | http://hdl.handle.net/2429/2995 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1992-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1992_spring_xiang_yang.pdf [ 4.18MB ]
- Metadata
- JSON: 831-1.0064986.json
- JSON-LD: 831-1.0064986-ld.json
- RDF/XML (Pretty): 831-1.0064986-rdf.xml
- RDF/JSON: 831-1.0064986-rdf.json
- Turtle: 831-1.0064986-turtle.txt
- N-Triples: 831-1.0064986-rdf-ntriples.txt
- Original Record: 831-1.0064986-source.json
- Full Text
- 831-1.0064986-fulltext.txt
- Citation
- 831-1.0064986.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0064986/manifest