MULTIPLY SECTIONED BAYESIAN BELIEF NETWORKS FOR LARGE KNOWLEDGE-BASED SYSTEMS: AN APPLICATION TO NEUROMUSCULAR DIAGNOSIS By Yang Xiang B.A.Sc., Beijing Institute of Aeronautics and Astronautics, M.A.Sc., Beijing Institute of Aeronautics and Astronautics, A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES ELECTRICAL ENGINEERING We accept this thesis as conforming to the required standard 1983 1985 THE UNIVERSITY OF BRITISH COLUMBIA 1991 ® Yang Xiang, 1992 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Electrical Engineering The University of British Columbia 2075 Wesbrook Place Vancouver, Canada V6T 1W5 Date: -i (Icw.. //, (/2 Abstract In medical diagnosis a proper uncertainty calculus is crucial in knowledge representation. Finite calculus is close to human language and should facilitate knowledge acquisition. An investigation into the feasibili: of finite totally ordered probability models has been conducted. It shows that a finite model is of limited usage, which highlights the impor tance of infinite totally ordered models including probability theory. Representing the qualitative domain structure is another important issue. Bayesian networks, combining graphical representation of domain dependency and probability the ory, provide a concise representation and a consistent inference formalism. An expert system QUALICON for quality control in electromyography has been implemented as a pilot study of Bayesian nets. The performance is comparable to that from human professionals. Extending the research into a large system PAINULIM in neuromuscular diagnosis shows that the computation using homogeneous net representation is unnecessarily com plex. At any one time a user’s attention is directed to only part of a large net, i.e., there is ‘localization’ of queries and evidence. The homogeneous net is inefficient since the overall net has to be updated each time. Multiply Sectioned Bayesian Networks (MS BNs) have been developed to exploit localization. Reasonable constraints are derived such that a localization preserving partition of a domain and its representation by a set of subnets are possible. Reasoning takes place at only one of them due to localization. Marginal probabilities obtained are identical to those obtained when the entire net is globally consistent. When the user’s attention shifts, a new subnet is swapped in and previously acquired evidence absorbed. Thus, with /3 subnets, the complexity is reduced II approximately to 1/6. Reducing the complexity with MSBN, the knowledge acquisition of PAINULIM has been conducted using normal hospital computers. This results in efficient cooperation with medical staff. PAINULIM was thus constructed in less than one year. An evaluation shows very good performance. Coding probability distribution of Bayesian nets in causal direction has several ad vantages. Initially the distribution is elicited from the expert in terms of probabilities of a symptom given causing diseases. Since disease-to-symptom is not the direction of daily practice, the elicited probabilities may be inaccurate. An algorithm has been derived for sequentially updating probabilities in Bayesian nets, making use of the expert’s symptom to-disease probabilities. Simulation shows better performance than Spiegeihalter’s {O, l} distribution learning. 111 Table of Contents Abstract ii List of Tables ix List of Figures x A Guide for the Reader xii Acknowledgement xiv 1 INTRODUCTION 1 1.1 Neuromuscular Diagnosis and Expert Systems 1 1.2 Representation Formalisms Other Than Bayesian Networks 3 1.2.1 Rule-Based Systems 4 1.2.2 The Method of Odds Likelihood Ratios 5 1.2.3 MYCIN certainty factor 7 1.2.4 Dempster-Shafer Theory 9 1.2.5 Fuzzy Sets 10 1.2.6 Limitations of Rule-Based Systems 10 1.3 Representing Probable Reasoning In Bayesian Networks 12 1.3.1 Basic Concepts of Probability Theory 12 1.3.2 Inference Patterns Embedded In Probability Theory 14 1.3.3 Bayesian Networks 15 1.3.4 Propagating Belief in Bayesian Networks 18 iv 2 FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 22 2.1 Motivation 23 2.2 Finite Totally Ordered Probability Algebras 25 2.2.1 Characterization 25 2.2.2 Mathematical Structure 27 2.2.3 Solution and Range 32 2.3 Bayes Theorem and Reasoning by Case . . . 34 2.4 Problems With Legal Finite Totally Ordered Probability Models 36 2.4.1 Ambiguity-generation and Denominator-indifference 36 2.4.2 Quantitative Analysis of the Problems 38 2.4.3 Can the Changes in Probability Assignment Help? 41 2.5 An Experiment 43 2.6 Conclusion 45 3 QUALICON: A COUPLED EXPERT SYSTEM 47 3.1 The Problem: Quality Control in Nerve Conduction Studies 47 3.2 QUALICON: a Coupled Expert System in Quality Control 48 3.3 Development of QUALICON 51 3.3.1 Feature Extraction and Partition 51 3.3.2 Probabilistic Reasoning 59 3.4 Assessment of QUALICON 63 3.4.1 Assessment Procedure 63 3.4.2 Assessment Results 65 3.5 Remarks 65 4 MULTIPLY SECTIONED BAYESIAN NETWORKS 69 4.1 Localization 70 v 4.2 Background 74 4.2.1 Operations on Belief Tables 74 4.2.2 Transform a Bayesian Net into a Junction Tree 75 4.2.3 d—separation 77 4.3 ‘Obvious’ Ways to Explore Localization 78 4.4 Overview of MSBN and Junction Forest Technique 82 4.5 The d-sepset and the Junction Tree 89 4.5.1 The d-sepset 89 4.5.2 Implication of d-sepset in Junction Trees 93 4.6 Multiply Sectioned Bayesian Nets 95 4.6.1 Definition of MSBN 95 4.6.2 Soundness of Sectioning 98 4.7 Transform MSBN into Junction Forest 106 4.7.1 Transform SubDAGs into Junction Trees by Local Computation 106 4.7.2 Linkages between Junction Trees 111 4.7.3 Joint System Belief of Junction Forest 114 4.8 Consistency and Separability of Junction Forest 117 4.8.1 Consistency of Junction Forest 117 4.8.2 Separability of Junction Forests 119 4.9 Belief Propagation in Junction Forests 128 4.9.1 Supportiveness 128 4.9.2 Basic Operations 129 4.9.3 Belief Initialization 136 4.9.4 Evidential Reasoning 138 4.9.5 Computational Complexity 143 4.10 Remarks 146 vi 5 PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 149 5.1 PAINULIM Application Domain and Design Criteria 149 5.2 Exploiting Localization in PAINULIM Using MSBN Technique 152 5.2.1 Resolve Conflict Demands by Exploiting Localization 152 5.2.2 Using MSBN Technique in PAINULIM 153 5.3 Other Issues in Knowledge Acquisition and Representation 157 5.3.1 Multiple Diseases 157 5.3.2 Acquisition of Probability Distribution 159 5.4 Shell Implementation 160 5.5 A Query Session with PAINULIM 162 5.6 An Evaluation of PAINULIM 164 5.6.1 Case Selection 165 5.6.2 Performance Rating 167 5.6.3 Evaluation Results 169 5.7 Remarks 172 6 ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 175 6.1 Background 176 6.2 Learning from Posterior Distributions 177 6.3 The Algorithm for Learning by Posterior Probability (ALPP) 181 6.4 Convergence of the Algorithm 183 6.5 Simulation Results 186 6.6 Remarks 192 7 SUMMARY 194 Bibliography 197 vii Appendices 204 A Background on Graph Theory 204 A.1 Graphs 204 A.2 Hypergraphs 207 B Reference for Theory of Probabilistic Logic (TPL) 209 B.1 Axioms of TPL 209 B.2 Probability Algebra Theorem 210 C Examples on Finite Totally Ordered Probability Algebras 212 C.1 Examples of Legal FTOPAs 212 C.2 Derivation of p(firesmoke&alarm) under TPL 213 C.3 Evaluation of Legal FTOPAs with Size 8 214 D Proofs for corollary and propositions 217 E Reference for Statistic Evaluation of Bernoulli Trials 220 E.1 Estimation for Confidence Intervals 220 E.2 Hypothesis Testing 221 F Glossary of PAINULIM Terms 222 F.1 Diseases Considered in PAINULIM 222 F.2 Clinical Features in PAINULIM 223 F.3 EMG Features in PAINULIM 224 F.4 Nerve Conduction Features in PAINULIM 225 G Acronyms 226 viii List of Tables 2.1 Probabilities for smoke-alarm example 43 3.2 Potential recording conditions 64 4.3 Probability distribution associated with DAG 0 in Figure 4.17 . 97 4.4 Probability distribution associated with subDAGs of Figure 4.20 . 97 4.5 Belief tables for the junction forest in Figure 4.20 116 4.6 Belief tables after CollectBelief during Belieflnitialization 139 4.7 Belief tables after the completion of Beliefinitialization 140 4.8 Prior probabilities after tIie completion of Belieflnitialization . . 141 4.9 Belief tables in evidential reasoning 145 4.10 Posterior probabilities in evidential reasoning 146 5.11 Number of cases involved for each disease considered in PAINULIM . . . 167 6.12 Simulation 1 summary . . 188 6.13 Simulation 2 summary . . . 189 6.14 Simulation 3 summary . . . 190 6.15 Simulation 4 summary . . . 191 ix List of Figures 1.1 An inference network 5 1.2 An inference net using the method of odds likelihood ratios 6 1.3 An inference net using the MYCIN certainty factors 8 1.4 The DAG of a singly connected Bayesian network 19 1.5 The DAG of a Bayesian network with diameter 1 19 1.6 The DAG of a general multiply connected Bayesian network 20 2.7 Smoke-alarm example 42 3.8 QUALICON system structure 50 3.9 QUALICON report 53 3.10 QUALICON report 54 3.11 Feature extraction strategy 56 3.12 SNAP produced by reversing stimulation polarity 57 3.13 Illustration of Guided Filtering 58 3.14 Bayesian subset for CMAP 60 3.15 A DAG to illustrate the inference algorithm in QUALICON 62 3.16 Results of statistical assessment of QUALICON 66 4.17 Transformation of a Bayesian net into a junction tree 79 4.18 A triangulated graph and its junction tree 81 4.19 Comparison of junction tree technique with MSBN technique 83 4.20 Illustration of MSBN and junction forest technique 92 x 4.21 4.22 4.23 4.24 4.25 4.26 4.27 5.30 5.31 5.32 5.33 5.34 An example of D(1) Fire alarm example for simulation Simulation set-up A.38 Examples of graphs A.39 A junction tree 99 103 105 122 125 127 135 154 156 160 162 164 165 Radnn 166 C.40 Evaluation result of smoke-alarm example by 32 legal FTOPAs of size 8 . 215 An example of unsound sectioning A MSBN with a hypertree structure A MSBN with sound sectioning but without a covering sect Examples for violation of host composition condition A partial host tree violating the host composition condition. A junction tree from a junction forest An example illustrating the operation UpdateBjlief 5.28 The 3 sects of PAINULIM: CLINICAL, EMG, and NCV 5.29 The linked junction forest of PAINULIM Create a MSBN using WEBWEAVR shell CLINICAL sect in a query session NCV sect in a query session EMG sect in a query session Feature expectation for a patient with both Cts and 6.35 6.36 6.37 181 186 187 205 208 xi A Guide for the Reader This thesis has been written for readers from various backgrounds including artificial intelligence, medical informatics, and biomedical engineering. Chapter 1 contains the background on Bayesian belief networks and related uncer tainty management formalisms. Bayesian belief networks combine probability theory and graphical representation of domain models. Appendix A contains an introduction about concepts from graph theory which are relevant to this thesis. Readers unfamiliar with graph theory should read this appendix before reading the main body of the thesis. Chapter 2 contains results of an investigation into the feasibility of using finite totally ordered probability models for uncertainty management in expert systems. Readers who are mainly interested in positive results and applicable techniques may skip this chapter. Chapter 3 describes the construction and evaluation of QUALICON, an expert system coupling digital signal processing and Bayesian networks for technical quality control in nerve conduction studies. Researchers in medical informatics and biomedical engineering who are interested in learning about the practical application of Bayesian networks and coupled expert systems will find this chapter useful. Chapter 4 contains the theory for Multiply Sectioned Bayesian Networks (MSBNs) and junction forests. Its implementation in WEBWEAVR shell and application in PAINULIM are described in Chapter 5. Chapter 5 describes the construction and evaluation of PAINULIM, an expert neu romuscular diagnostic system for patients presenting a painful or impaired upper limb. Researchers in medical informatics and biomedical engineering who are interested in the practical application of the MSBN technique will find this chapter particularly relevant. xii Chapter 6 contains the algorithm of learning for sequential updating conditional prob abilities in Bayesian networks using the expert’s subjective posterior probabilities. XIII Acknowledgement At University of British Columbia, my deepest thanks go to David Poole, Michael Bed- does and Andrew Eisen. David Poole introduced me to the concepts of probabilistic reasoning and Bayesian networks. His critical eye on my work has been extremely bene ficial. Michael Beddoes directed me into the field of application of artificial intelligence (AT) in neuromuscular diagnosis, and has been my advisor both in and out of academics. His guidance has been critical to this interdisciplinary research. Andrew Eisen has guided me on making decisions in choosing appropriate subdomains in neuromuscular diagnosis for AT applications, has given me necessary support to conduct the research towards the two expert systems to which he also act as one of the domain experts. At Vancouver General Hospital, special thanks are due to Bhanu Pant, and Maureen MacNeil, the domain experts for PAINULIM and QUALICON. The technique of multiply sectioned Bayesian networks and junction forests grew out of the discussions with Andrew Eisen and Bhanu Pant. Bhanu Pant also devoted months of painstaking work and much overtime to the construction and evaluation of PAINULIM. Without his enthusiastic cooperation, it would not be possible for PAINULIM to reach its clinical significance in less than a year. Maureen MacNeil devoted months of work for the construction of QUALICON besides her hospital duty. I thank Romas Aleliunas for helping me to gain the understanding of his theory of probabilistic logic. I also thank Finn Jensen for providing several reprints of his papers; Steen Andreassen for providing publications regarding the progress in MUNIN; and David Heckerman for providing his Ph.D. thesis on similarity networks and his papers on PATHFINDER. Lianwen Zhang suggested the relation between the d-sepset and the xiv graph separator. Nick Newell provided useful suggestions for programming WEBWEAVR shell. In addition I thank Sandy Trueman, and Andrew Travios for participating in the assessment of QUALICON and Tish Mallinson, Karen Burns, and Khadija Amlani for help in nerve conduction potential acquisition. The National Science and Engineering Research Council, under Grant A3290, 583200, and University of British Columbia, under University Graduate Fellowship, provided the financial support that made this thesis possible. Finally, I owe a great debt to my mother and father for a lifetime of love, caring and support, and especially to my wife Jingyu Zhu for sheltering me from the travails of the real world and surrounding me with so much love and support. xv Chapter 1 INTRODUCTION In this chapter, the status of computers in medicine is reviewed in Section 1.1: the review covers the status of expert systems in neuromuscular diagnosis. Section 1.2 reviews rule-based expert systems and uncertainty management formalisms other than Bayesian networks. Section 1.3 reviews Bayesian networks as a natural and concise representation for uncertain knowledge. 1.1 Neuromuscular Diagnosis and Expert Systems The thesis research addresses practical issues in building medical expert systems. The medical area is neuromuscular diagnosis. The computer is now assisting doctors and technicians in the neuromuscular diagnosis by performing the following functions: data acquisition/analysis; database management; and documentation. Computers have failed to assist doctors directly in their diagnostic inference and clinical decision making [Desmedt 89]. In general medicine, the same failure was true until the mid 70’s. The failure was partly due to the limitations of conventional techniques for medical decision making (the clinical algorithm or flow chart, and the matching of cases to large data bases of previous cases [Szolovits 82]). The failure was also partly due to the unawareness of proper ways to apply normative probability and decision theories. On the other hand, as Szolovits [1982] put it: “Modern medicine has become techni cally complex, the standards set for it are very high, conceptual and practical advances 1 Chapter 1. INTRODUCTION 2 are rapid, yet the cognitive capabilities of physicians are pretty much fixed. As more and more data become available to the practicing doctor, as more becomes known about the processes of disease and possible interventions available to alter them, practitioners are called on to know more, to reason better, and to achieve better outcomes for their patients.” In response to the need for more sophisticated diagnostic aids, medical expert systems appeared in the field of AT in medicine. Expert systems are computer programs capable of ijiaking judgments or giving as sistance in a complex area. The tasks they perform are ordinarily performed only by humans. They are commonly separated into 2 major components: knowledge base which includes assumptions about the particular domain, and the inference engine which con trols how the knowledge base is to be used in solving a particular problem. Since the mid 70’s, researchers in AT have built expert systems in many medicine areas to assist the doctors in diagnosis or therapy (e.g., MYCIN for the diagnosis and treatment of bacterial infections [Buchanan and Shortliffe 84], and INTERNIST for the diagnosis in internal medicine [Pople 82]). New expert system technologies are still evolving as limi tations to existing technologies are recognized. Limited successes in different aspects of performance have been achieved. Back to the area of neuromuscular diagnosis, several (prototype) expert systems have appeared since the mid 80’s: LOCALIZE [First et al. 82] for localization of peripheral nerve lesions; MYOSYS [Vila et al. 85] for diagnosing mono- and polyneuropathies; MY OLOG [Gallardo et al. 87] for diagnosing plexus and root lesions; Blinowska and Ver roust’s system [1987] for diagnosing carpal tunnel syndrome; ELECTRODIAGNOSTIC ASSISTANT [Jamieson 90] for diagnosing entrapment neuropathies, plexopathies, and radiculopathies; KANDID [Fuglsang-Frederiksen and Jeppesen 89] and MUNIN [Andreassen et al. 89] aiming at diagnosing the complete range of neuromuscular disor ders. Satisfaction in system testing with constructed cases have been reported, while Chapter 1. INTRODUCTION 3 oniy one of them (ELECTRODIAGNOSTIC ASSISTANT) reported clinical evaluation using 15 cases of 78% agreement rate with electromyographers. Most of the systems are rule-based. The limitation of rule-based systems is reviewed in Section 1.2. MUNIN uses Bayesian networks for representation of uncertain knowledge; and is still under development. Just as the expert system techniques are evolving, their application to neuromuscular diagnosis is still one for ongoing research. 1.2 Representation Formalisms Other Than Bayesian Networks Construction of an expert system faces an immediate question: What is a proper knowl edge representation and inference formalism? The answer depends largely on the char acteristics of the task to be performed by the target system. Medical diagnosis is characterized with probable reasoning, or more formally, rea soning under uncertainty. The uncertainty comes from the incomplete knowledge about biological process within the human body, from the incomplete observation and monitor ing of patient conditions, and from dynamically changing relations between many factors entering diagnostic process. A proper knowledge representation for reasoning under un certainty is required which includes the representation of qualitative domain structure and an uncertainty calculus. Perhaps the most widely used structure in expert systems is a set of rule in rule based systems. This section reviews rule-based systems, their advantages and limitations. There has been a set of uncertainty calculuses used in expert systems. This section also reviews those calculuses notable in Al which associate single real numbers with uncertainty. An investigation of the feasibility of using a finite number of symbols to denote degrees of uncertainty will be presented in Chapter 2. The reason for reviewing those calculuses together with rule-based systems is because some of them (i.e., MYCIN Chapter 1. INTRODUCTION 4 certainty factor and odds likelihood ratios) can only be applied in conjunction with rule- based systems. Bayesian networks has been chosen as the knowledge representation in this thesis research. The networks combine graphic representation of domain dependence and prob ability theory. This combination overcomes many limitations of rule-based systems. Bayesian network techniques will be reviewed in section 1.3. There has been controversy in the artificial intelligence field about the adequacy of using probability for representation of beliefs [Cheeseman 88a, Cheeseman 88b]. I will not attempt to enter into this controversy. Rather, my review discusses why Bayesian networks seem appropriate for building medical expert systems and thus are adopted in this thesis research. 1.2.1 Rule-Based Systems An expert system which deals with probable reasoning will have a knowledge base which consists of a qualitative structure of the domain model and a quantitative uncertainty calculus. The same qualitative structure can usually accommodate different uncertainty calculuses. Perhaps the most widely used structure in expert systems is a set of rules in rule-based systems. All the methods for representing uncertainty reviewed in this section can be incorporated with rule-based systems. Rules in such systems consist of antecedent-conclusion pairs, “if X, then Y” with the reading “if antecedent X is true, then conclusion Y is true”. A rule encodes a piece of knowledge. An inference network is a directed acyclic graph (DAG) in which each node is labeled with a proposition. The set of rules in a rule-based system, which does not contain cyclic rules, can be represented by an inference network in which each arc corresponds to a rule with direction from antecedent to conclusion. The ‘roots’ of an inference network are labeled with variables about which the user is expected to supply information. The Chapter 1. INTRODUCTION 5 ‘leaves’ are variables of interest. Figure 1.1 is a inference network representing 4 rules: “if B, then A”, “if C, then A”, “if D, then A” and “if E, then D”. B, C, E are roots and A is a leaf. Below rule-based systems are considered in terms of inference networks. 1.2.2 The Method of Odds Likelihood Ratios This subsection reviews the representation of uncertainty in rule-based expert systems by odds and likelihood ratios, and the propagation of evidence under this representation. Such an approach is used in the expert system PROSPECTOR [Duda et al. 76] which helps geologists evaluate the mineral potential of exploration sites. The formulation of Neapolitan [1990] is followed. With the method of odds likelihood ratios, uncertainty to the rules are represented by conditional probabilities in the form p(XIY) where X is the antecedent and Y is the conclusion’. Suppose for each rule “if X, then Y”, the conditional probabilities p(XIY) and p(XY) are determined; and for each node Y in the corresponding inference network, except for the roots, the prior probability p(Y) is determined. Then the prior odds on 1111 Bayesian networks the order of X and Y will be reversed. B C E D Figure 1.1: An inference network A Chapter 1. INTRODUCTION 6 Y is defined as 0(Y) = p(Y)/p(Y). The posterior odds on Y upon learning that X is true is defined as O(YIX) = p(YIX)/p(TIX). The likelihood ratio of X is defined as L(XIY) = p(XIY)/p(XIY). Notice that p(Y) = O(Y)/(1 + 0(Y)). With the above definition, the evidence propagation can be illustrated with the ex ample in Figure 1.2. As in the figure, likelihood ratios are stored at each arc, and prior probabilities are stored at each node except for roots. If one has evidence that E is true, then 0(DIE) L(EID)0(D). To propagate the evidence to A in a sim ple way, it is assumed that A and E are conditionally independent given D. Then p(AIE) = p(AD)p(DE) + p(AD)p(DiE). Note the assumption is not valid if multiple paths exist from E to A. If in another case, one has evidence that B and C are true, one would like to know 0(AjBC) where the concatenation implies ‘AND’. Assume that B and C are conditionally independent given A, then 0(AIBC) = L(BIA)L(CIA)0(A). Again, the assumption is not valid’ in a multiply connected network. B L(EID) L(EjD) C D L(BJA) L(IA) E P(D) Figure 1.2: An inference net using the method of odds likelihood ratios Several limitations of the method of odds likelihood ratios have been reviewed by Neapolitan [1990]: A P(A) 1. The method is restricted to singly connected networks. Many application domains Chapter 1. INTRODUCTION 7 can not be represented. 2. An inference network with odds and likelihood ratios associated can only be used for reasoning in the designed direction but not in the opposite direction. 3. For all the nodes except roots, prior probabilities are to be specified. Since priors are population specific, they are difficult to ascertain. A medical expert system constructed using data from one clinic might not be deployed in another clinic because the population there was different. 1.2.3 MYCIN certainty factor The most widely used uncertainty calculus in rule-based systems is called certainty factor and it was originally used in MYCIN [Buchanan and Shortliffe 84]. General probabilistic reasoning based on probability theory required the specification of exponentially large number of data which lead to intractable computation associated with belief updating [Szolovits and Pauker 78, Buchanan and Shortliffe 84]. For example, a full joint proba bility distribution over a domain with a binary variables requires specification of 2 — 1 parameters. The parameters had to be updated when the new evidence became available. The primary goal in creating the MYCIN certainty factor was to provide a method to avoid this difficulty. In a rule-based system using certainty factors for reasoning under uncertainty, a rule “if X, then Y” is associated with a certain factor CF(Y, X) [—1, 1] to represent the change in belief about Y given the verity of X. CF(Y, X) > 0 corresponds to increase in belief, CF(Y, X) < 0 corresponds to decrease in belief, and CF(Y, X) = 0 corresponds to no change in belief. Representing a rule-based system by an inference network, the certainty factors are stored at arcs as in Figure 1.3. The MYCIN certainty factor model provides a set of rules for propagating uncertainty Chapter 1. INTRODUCTION 8 B CF (A. B) CF(A,B) Figure 1.3: An inference net using the MYCIN certainty factors through an inference network. Figure 1.3 illustrates sequential and parallel combination in an inference network. If one has evidence that E is true, then the certainty factors CF(D,E) and CF(A,D) can be combined to give the certainty factor CF(A,E) by sequential combination I CF(D,E)CF(A,D) CF(D,E)>0. CF(A,E)= - I -CF(D,E)CF(A,D) CF(D,E)<0 If instead one has evidence that B and C are true, then the certainty factor CF(A, BC) is given by parallel combination CF(A, BC) = I CF(A, B) + CF(A, C)(1 - CF(A, B)) CF(A, B)CF(A, C) 0( CF(A,B)+CF(A,C) CF(A1—min(ICF(A,B)I,ICF(A,C)j) ‘ ‘ J’’ ‘ The calculus of the MYCIN certainty factor has many desirable properties which would be expected from an uncertain reasoning technique. However, in the original work of the MYCIN certainty factor, there was no operational definition of a certainty fac tor [Heckerman 86, Neapolitan 90]. That is, the definition of a certainty factor does not prescribe a method for determining a certainty factor. Without an operational def inition there is no way of knowing whether 2 experts mean different things when they E CF(t3,E) CF(JJ) C D A V Chapter 1. INTRODUCTION 9 assign different certainty factors to the same rule. Heckerman [1986] gives probabilistic interpretations of the MYCIN certainty factor. He shows that these interpretations are monotonic transformations of the likelihood ratio reviewed in section 1.2.2. Therefore the MYCIN certainty factor makes the same assumptions as the method of odds likelihood ratios and shares the same limitations [Neapolitan 90]. 1.2.4 Dempster-Shafer Theory Unlike probability theory, which assigns probabilities to every member of a set ‘I’ of mutually exclusive and exhaustive alternatives and requires that the probabilities sum to unity, the Dempster-Shafer (D-S) theory [Shafer 76] assigns basic probability assignments (bpa) to every subset of ‘P (member of 2’’) and requires that the bpas sum to unity. For a proposition C, the bpa assigned to {C} and the bpa assigned to {} do not have to sum to unity. Thus D-S theory allows some of the probability to be unassigned, that is, it accepts an incomplete probabilistic model when some parameters (either prior probabilities or conditional probabilities) are missing [Pearl 88]. In this sense, D-S theory is an extension of probability theory [Neapolitan 90]. Medical diagnosis involves repetition and it is possible to obtain a complete probability model. The model may not be accurate and further refinement may be required (Chapter 6). Unlike probability theory, which calculates the conditional probability that Y is true given evidence X, D-S approach calculates the probability that the proposition Y is provable given the evidence X and given that X is consistent. Pearl [1988] argues that D-S theory offers a better representation when the task is one of synthesis (e.g., the class scheduling problem) where external constraints are imposed and the concern centers on issues of possibility and necessity; he also argues that probability theory is more suitable for diagnosis. A pragmatic advantage of probability theory over D-S theory is that it is well founded Chapter 1. INTRODUCTION 10 and well known. An expert system based on a well known theory will certainly benefit in the knowledge acquisition and in communicating with users. 1.2.5 Fuzzy Sets Fuzzy set theory [Zadeh 65] deals with propositions which have vague meaning such as “the symptom is severe” or “the median nerve conduction velocity is normal”. It associates a real number from [0, 1] with the membership of a particular element in a set. For example, if a patient has a median nerve conduction velocity of 58 m/sec, the result has its membership 1 in the set of ‘normal’ results. If the velocity is 44 m/sec, the result has membership 0 in the ‘normal’ set. When the velocity value is 50 m/sec, the result has a partial membership, say, 0.7 in the ‘normal’ set. Some researchers [Pearl 88, Neapolitan 90] view fuzzy set theory as addressing a fun damentally different class of problems than those addressed by probability theory, cer tainty factor and the Dempster-Shafer theory. All but fuzzy set theory deal with well defined propositions which are definitely either true or false. One is simply uncertain as to the outcome. Cheeseman [1986,1988a,1988b] makes a strong claim that fuzzy sets are unnecessary (for representing and reasoning about uncertainty, including vagueness), probability theory is all that is required. He shows how probability theory can solve the problems that the fuzzy approaches claim probability cannot solve. 1.2.6 Limitations of Rule-Based Systems Section 1.2.2 and 1.2.3 reviewed 2 methods for reasoning under uncertainty in rule based systems. This subsection concentrates on the questions: Are rule-based systems suitable for probable reasoning? In which situations are they appropriate knowledge representation? The basic principle of rule-based systems is modularity. A rule “if X, then Y” has the Chapter 1. INTRODUCTION 11 procedural interpretation: “If X is in the knowledge base, then regardless of what other things the knowledge base contains and regardless of how X was derived, add Y to the knowledge base” [Pearl 88]. The attractiveness of rule-based systems is high efficiency in inference. This stems from modularity. However, the principle is valid only if the domain is certain. That is, rule-based systems are appropriate for applications involving only categorical (as defined by Szolovits and Pauker [1978]) reasoning [Neapolitan 90]. When reasoning under uncerta1ty, the rule “if X, then Y with uncertainty w” reads as: “If the certainty of X undergoes a change 5x, then regardless of what other things the knowledge base contains and regardless of how 5x was triggered, modify the current certainty of Y by some amount 6y, which may depend on w, on x, and on the cur rent certainty of Y” [Pearl 88]. Pearl discusses three major problems resulted from the principle of modularity in probable reasoning. • Improper handling of bidirectional inferences. This is a restatement of the third limitation in section 1.2.2. If X implies Y, then verification of Y makes X more credible. Thus the rule “X implies Y” should be used in two different directions: deductive - from antecedent to conclusion, and abductive - from conclusion to an tecedent. However, rule-based systems require that the abductive inference to be stated explicitly by another rule and, even worse, that the original rule be removed. Otherwise, a cycle would be created. • Difficulties in retracting conclusions. Two rules “If the ground is wet then it rained” and “If the sprinkler was on then the ground is wet” may be contained in a system. Suppose “the ground is wet” is found. The system will conclude “it rained” by first rule. But if later “the sprinkler was on” is found, the certainty “it rained” should be decreased significantly. However, this can not be implemented naturally in a rule-based system. Chapter 1. INTRODUCTION 12 • Improper treatment of correlated sources of evidence. Recall that both the odds likelihood ratio method and the certainty factor method assume singly connected inference networks. Rule-based systems respond only to the degrees of certainty to antecedents and not to the origins of these degrees. As a result the same con clusions will be made whether the degree of certainty originates from identical or independent sources of information. Heckerman and Horvitz [19871 analyze the inexpressiveness of rule-based systems. • Only binary variables (proposition variables) allowed. When multi-valued random variables are needed, they have to be broken down into several binary ones and the naturalness of representation is lost. • Multiple causes. When multiple causes exist, they are represented by an expo nential number of binary variables, and the representation can not accommodate different evidence patterns. With these problems in mind, one must conclude that generally rule-based systems are not appropriate for applications that require probable reasoning. 1.3 Representing Probable Reasoning In Bayesian Networks This section briefly introduces Bayesian networks as a natural, concise knowledge repre sentation method and a consistent inference formalism for building expert systems. The substantial advances of Bayesian network technologies in recent years are reviewed. 1.3.1 Basic Concepts of Probability Theory Two major interpretations of probability exists: objective or frequentist interpretation, and subjective or Bayesian interpretation [Savage 61, Shafer 90, Neapolitan 90]. The Chapter 1. INTRODUCTION 13 former defines probability of an event E as the limiting frequency of E in repeated experiments. The latter interprets probability as degree of belief held by a person. In building a medical expert system based on probability theory, one usually has to take a medical expert as the major source of the probabilistic information. Thus Bayesian interpretation is naturally adopted. Although philosophically the 2 interpretations represent 2 different camps, practi cally, they are not significantly different as far as physicians’ belief in uncertain relations in medicine is concerned. The medical literature substantiates the fact that many physi cians believe that probabilities, according to the frequentist’s definition, exist, and that the likelihoods which they assign, are estimates of these probabilities based on their experiences with frequencies [Neapolitan 90j. My personal experience with neurologists in building PAINULIM expert system is also in agreement with this viewpoint. The Bayesian formulation of probability will be adopted in this thesis. In Chapter 6, an in teraction between the 2 interpretations is utilized to improve the accuracy of knowledge representation. The framework of Bayesian probability theory consists of a set of axioms which describes constraints among a collection of probabilities provided by a given person [Pearl 88, Heckerman 90b}. o p(AC) 1 p(CjC) = 1 If A and B are mutually exclusive, then p(A or BIG) = p(AfC) + p(BIC) (sum rule) p(ABIC) = p(AJBC)p(BIC) (product rule) where A, B are arbitrary events and C represents the background knowledge of the person who provides the probability. The probability p(AIC) represents a person’s belief in A given his background knowledge C. The following rules, to be used in the thesis, Chapter 1. INTRODUCTION 14 can be proved from the above axioms. p(IC) = 1 — p(AIC) (negation rule) If B1,.. . , B are mutually exclusive and exhaustive then p(AB1IC) + ... +p(ABjC) = p(AC) (marginalization) p(AIBC) = p(BjAC)p(AIC)/p(BC) (Bayes theorem) 1.3.2 Inference Patterns Embedded In Probability Theory As a well founded theory, probability theory embeds many intuitive inference patterns, which renders it a suitable uncertainty calculus for probable reasoning. The following briefly discusses some patterns used in medical diagnosis. Bidirectional reasoning Probability allows both predictive and abductive (diagnostic) reasoning. Carpal tunnel syndrome (cts) is a common cause of pain in forearm. If one has a probability p(painful forearmicts) = 0.75 and also knows John has cts, then one would predict with high confidence John would suffer from pain in forearm. On the other hand, if one also has p(cts), p(painful forearm), and knows Mary does suffer from pain in forearm, then one can compute p(ctspainful forearm) by applying Bayes theorem. The diagnosis will be based on p(ctspainfu1 forearm). Context sensitivity Both cts and thoracic outlet syndrome (tos) cause pain in fore arm. The pain in the forearm is equally likely from either cause. Cts has much higher incidence than tos, say p(cts) = 0.25 and p(tos) = 0.02 in a clinic. A patient with painful forearm is 12 times more likely to have cts than tos. But if later through other means the patient is found to have tos, then the painful forearm can be explained by tos. Cts becomes much less likely. The probability for this case will have p(ctslpainful forearm) = HIGH, and p(ctspainfu1 forearm & tos) = LOW. That is, the original belief in cts is Chapter 1. INTRODUCTION 15 retracted in the new context. Explaining away In the above example, the confirmation of tos makes the alternative explanation tos for painful forearm less credible. That is, tos explains painful forearm away from cts. Dynamic dependence Cts and tos are independent diseases, i.e., knowing oniy a patient having or not having cts tells one nothing about whether he has tos or not. However, knowing a patient has a painful forearm will render the 2 diseases related in the diagnostic process. Further evidence supporting one of them will decrease the likelihood of another. Readers are referred to Pearl [1988] for an elaborate discussion of the relationship between probability theory and probable reasoning. 1.3.3 Bayesian Networks Bayesian networks are known in the literature as Bayesian belief networks, belief net works, causal networks, influence diagrams, etc. They have a history in decision analysis [Miller et al. 76, Howard and Matheson 84]. They have been actively studied for prob able reasoning in Al for about a decade. Formally a Bayesian network [Pearl 88] is a triplet (N,E,P). • The domain N is a set of nodes each of which is labeled with a random variable characterized by a set of mutually exclusive and exhaustive outcomes. ‘Node’ and ‘variable’ are used interchangeably in the context of Bayesian nets. • E is a set of arcs such that (N, E) is a DAG. The arcs signify the existence of direct causal influences between the linked variables. The basic dependence assumption Chapter 1. INTRODUCTION 16 embedded in Bayesian nets is that a variable is independent of its non-descendants given its parents. Uppercase letters (possibly subscripted) in the beginning of the alphabet are used to denote variables, corresponding script letters are used to denote their sample spaces, and corresponding lowercase letters with subscripts are used to denote their outcomes. For example, in binary case, a variable A has its sample space A = {a1,a2}, and II has its sample space ‘H = {h1,h22}. Uppercase letters towards the end of the alphabet are used to denote a set of variables. If X N is a set of variables, the space ‘I’(X) of X is the cross product of sample spaces of the variables XAEXA. 7r is used to denote the set of parent variables of A, e N. P is a joint probability distribution quantifying the strengths of the causal influ ences signified by the arcs. P is specified by, for each A2 E N, the distribution of the random variable labeled at A conditioned by the values of At’s parents ir in the form of a conditional probability table p(AIir). p(Air) is a normalized function mapping ‘({A} U r) to [0, 1]. The joint probability distribution P is P = p(A1 . .. A) = flp(AI) For medical application, nodes in a Bayesian net represent disease hypotheses, symp toms, laboratory results, etc. Arcs signify the causal relations between them. The proba bility distribution quantifies the strengths of these relations. For example, “cts (C) often causes pain in forearm (F)” can be represented by an arc from C to F and a probability p(fiIci) = 0.75. The term ‘causal’ above is to be interpreted in a broad sense. Correspondingly, arcs in a Bayesian net could go either direction. But in medical applications, it is usual to con sider diseases as causes and symptoms as effects. Shachter and Heckerman [1987j show Chapter 1. INTRODUCTION 17 that if arcs in Bayesian nets are directed from disease to symptoms, the DAG construc tion is usually easier and the resultant net topologies are usually simpler. Furthermore, conditional probabilities of symptoms given diseases are related to the symptom causing mechanisms of diseases, and are usually irrelevant to the patient population. Thus only priors of diseases need to be changed when an expert system is built at one clinic and used at another location with a different patient population. Whereas both the priors for symptoms and the probabilities oJiseases given symptoms are patient population depen dent. Directing arcs from symptoms to diseases will require total revision of probability values in a Bayesian net when the system is to be used with a different population. As stated above, Bayesian networks combine probability theory with graphic represen tation of domain models. The necessity of this combination is two-edged. For one thing, encoding a domain model with a DAG conveys directly the dependence and independence assumptions made of the domain. The DAG facilitates knowledge acquisition and makes the representation transparent. For another, graphical models allow quick identification of dependencies by examining DAGs locally. Therefore efficient belief propagations are possible [Pearl 88] and the difficulty associated with general probabilistic reasoning (sec tion 1.2.3) can be avoided. The study of inference in Bayesian networks heavily depends on the study of the DAG topologies. In the thesis, when the topology of a Bayesian network is mentioned, it always means the topology of the corresponding DAG. Probabilities associated with Bayesian networks have different meanings depending on their location of storage in the networks and the stage of inference. Some convention is appropriate to avoid confusion. Before any inference takes place, the probabilities associated with root nodes (with zero in-degree) are called prior probabilities or priors. The probabilities associated with non-root nodes are called conditional probabilities. After the evidence is available, the updated probabilities conditioned on evidence are called posterior probabilities. When it is clear from the context, they are just called Chapter 1. INTRODUCTION 18 probabilities. 1.3.4 Propagating Belief in Bayesian Networks Inference in a Bayesian network is essentially a problem of propagating changes in belief and computing posterior probabilities as new evidence is obtained. Since the appeal of Bayesian networks for representing uncertain knowledge has become increasingly ap parent over the last few years [Howard and Matheson 84, Pearl 861, there has been a proliferation of research seeking to develop new and more efficient inference algorithms. Two classes of approaches can be identified. One class of approaches explores approx imation using stochastic simulation and Monte Carlo schemes. A good review on this class is given by Henrion [1990]. Another class of approaches explores specificity in com puting exact probabilities. Since this thesis research adopts the exact method, the major advances in this class are reviewed below. The first breakthrough in efficient probabilistic reasoning in Bayesian networks is made by Kim and Pearl [1983]. They develop an algorithm, applicable to singly connected Bayesian networks (Figure 1.4), for propagating the effect of new observations. Each node in the network obtains messages from its parents (ir messages) and its children (A messages). These messages represent all the evidence from the portion of the network lying beyond these parents and children. The single-connectedness guarantees that the information in each message to a node is independent and so local updating can be employed. The algorithm’s complexity is linear in the number of variables. Heckerman [1990] provides QUICKSCORE algorithm which addresses networks of diameter 1 as the one in Figure 1.5. The upper level consists of disease variables and the lower level consists of ‘finding’ variables. Note the net is multiply connected (Ap pendix A). The algorithm makes three assumptions. All variables are binary. Diseases Chapter 1. INTRODUCTION Figure 1.4: The DAG of a singly connected Bayesian network 19 B1 DISEASES . . . FINDINGS Figure 1.5: The DAG of a Bayesian network with diameter 1 Chapter 1. INTRODUCTION 20 are marginally independent and findings are conditionally independent given diseases2. Diseases interact to produce findings via a noisy-OR-gate. The time complexity of Q UICKSCORE is (D(nmi2m2) where n is the number of diseases, m1 is the number of negative findings, and m2 is the number of positive findings. Unfortunately, many applications can not be represented properly by a singly-connected net of diameter 1. An example of a general multiply connected Bayesian network is given in Figure 1.6. Pearl [1986] presents ioop cutset conditioning as an indirect method for inference in multiply connected networks. Selected variables (loop cutset) are instanti ated to cut open all ioops such that resultant singly. connected networks can be solved by A — ir message passing, and the results from the instantiations are combined. Shachter [1986,1988a] applies a sequence of operations called arc-reversal and barren node reduction to an arbitrary multiply connected Bayesian net. The process continues until the network contains only those nodes whose posterior distributions are desired with the evidence nodes as immediate predecessors. Baker and Boult [1990] extend the barren node reduction method to prune a Bayesian network relative to each query instance such that saving in computational complexity is 2This implies that the net topology is of diameter 1. B1 Figure 1.6: The DAG of a general multiply connected Bayesian network Chapter 1. INTRODUCTION 21 obtained when evidence comes in a batch. Lauritzen and Spiegeihalter [1988] describe an algorithm based on a reformulation of a multiply connected network. The DAG is moralized and triangulated. Then cliques are identified to form a clique hypergraph of the DAG. The cliques are finally organized into a directed tree of cliques3 which satisfies running intersection property. They provide an algorithm for the propagation of evidence within this secondary representation. The complexity of the algorithm is (D(prm) where p is the number of clique in the clique list, r is the maximum number of alternatives for a variable in the network, and m is the maximum number of variables in a clique. Pearl [1988] proposes a clustering method with many similarities but propagates evidence in a secondary directed tree by \ — r message passing. The directed tree approach and the following junction tree approach all have the advantage of trading compile time with run time for ‘reusable systems’. Here reusable systems mean those systems where the domain knowledge is captured once and is used for multiple cases, as opposed to the decision systems which are created for decision making in a non-repeatable situation. Jensen, Lauritzen, and Olesen [1990] further improve the directed clique tree approach and they organize clique hypergraph into an (undirected) junction tree to allow more flexible computation. Similar work was done by Shafer and Shenoy [1988]. A close look at the junction tree approach is included in Chapter 4. 3The ‘directed’ property is indicated by Henrion [1990], Neapolitan [1990], and Shachter [1988b] since the cliques are ordered and this order is to be followed in belief propagation. Chapter 2 THE FEASIBILITY OF FINITE TOTALLY ORDERED PROBABILITY ALGEBRA FOR PROBABLE REASONING Medical diagnosis is featured by probable reasoning and a proper uncertainty calculus is thus crucial in knowledge representation for medical expert systems. A finite calculus is plausible since it is close to human language in communicating uncertainty. An investi gation into the feasibility of using finite totally ordered probability models for probable reasoning was conducted under Aleliunas’s Theory of Probabilistic Logic [Aleliunas 88]. In this investigation, the general form of the probability algebra of these models and the number of possible algebras given the size of probability set are derived. Based on this analysis, the problems of denominator-indifference and ambiguity-generation that arise in reasoning by cases and abductive reasoning are identified. The investigation shows that a finite probability model will be of very limited usage. This highlights infinite totally ordered probability algebras including probability theory as uncertainty management formalism for probable reasoning. This chapter presents the major results of the investigation. The results are mainly taken from Xiang et al. [1991a]. Section 2.1 discusses the motivation and criteria of the investigation. Section 2.2 presents the mathematical structure of finite totally ordered probability models. Section 2.3 derives the inference rules under these models. Section 2.4 identifies the problems of these models both intuitively and quantitatively. Section 2.5 presents an experiment which exhaustively investigates models of a given size with an example. 22 Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 23 2.1 Motivation The investigation started in the early stage of this thesis research when two engineering applications of AT in neurology, EEG analysis and neuromuscular diagnosis, were un der consideration. Probable reasoning is the common feature of both domains. When consulted about the formalism of representing uncertainty in EEC analysis, the experts claimed that they did not use nurers, but rather used a small number of terms to de scribe uncertainty. This motivated a desire for a formal finite non-numerical uncertainty calculus. In such a calculus, the domain expert’s vocabulary about uncertainty could be used directly in encoding knowledge and in reasoning about uncertain information. This would facilitate knowledge acquisition and make the system’s diagnostic suggestion and explanation more understandable. There were few known finite calculus for general uncertainty management [Pearl 89, Halpern and Rabin 87], but Aleliunas’ probabilistic logic [Aleliunas 88] was explored, be cause it seemed to be based on clear intuitions, and to allow measures of belief (probability values) to be summarized by values other than just real numbers. Aleliunas [1988] presents an axiomatization for a theory of rational belief, the Theory of Probabilistic Logic (TPL). It generalizes classical probability theory to accommodate a variety of probability values rather than just [0, 1]. According to the theory, probabilistic logic is a scheme for relating a body of evidence to a potential conclusion (a hypothesis) in a rational way, using probabilities as degrees of belief. ‘p(PQ)’ stands for the conditional probability of proposition P given the evidence Q, where P and Q are sentences of some formal language L consisting of boolean combinations of propositions. TPL is chiefly concerned with identifying the characteristics of a family F of functions from L x L to the set of probabilities P. The probability values P are not constrained to be just [0, 1], but can be any values that conform to a set of reasonably intuitive axioms (Appendix B.1). Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 24 The semantics of TPL is given by ‘possible worlds’. Each proposition P is associated with a set of situations or possible worlds S(P) in which P holds. Given Q as evidence, the conditional probability p(PQ), whose value ranges over the set P, is some measure of the fraction of the set S(Q) that is occupied by the subset S(P&Q). TPL provides minimum constraints for a rational belief model. For the application domain in question the following criteria are thought desirable: Ri The domain experts do not express and communicate uncertainty using numerical values in their practice. Their language consists of a small set of terms ‘likely’, ‘possibly’, etc., used to describe the uncertainty in their domain. Thus a finite set of probability values is required. R2 Any two probability values in a chosen model should be comparable. An essential task of a medical diagnostic ystem is to differentiate between a set of competing diagnoses given a patient’s symptoms and history. It is felt as though totally ordered probabilities are needed in order to allow for totally ordered decisions when one has to act on the results of the diagnoses. R3 Inference based on a TPL model should generate empirically intuitive results. That is, the inference outcomes generated with such a model should reflect, as far as possible, the reasonable outcomes reached by a human expert. Although these criteria are formed from the point of application in question, it is believed that they are shared by many automated reasoning systems making decisions under uncertainty. Based on the first 2 criteria, the focus is placed on finite totally ordered probability models. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 25 2.2 Finite Totally Ordered Probability Algebras 2.2.1 Characterization To investigate the mathematical structure (probability algebra) of the probability space, the characterization of any finite totally ordered probability algebra under TPL axioms [Aleliunas 88] is given in the proposition below. For more about universal algebra, see Burns and Sankappannvar [1981] and Kuczkowski and Gersting [1977]. This proposition is a restriction of the Probability Algebra Theorem [Aleliunas 86] (Appendix B.2) to finite totally ordered sets. The smallest element of P is denoted as 0, and the largest element of P as 1. There are a finite number of other values between 0 and 1. Proposition 1 A probability algebra defined on a totally ordered finite set P with order ing relation ‘<‘satisfies TPL axioms if Condi An order preserving binary operation ‘*‘ (product) is well defined and closed on P. Cond2 ‘*‘ is commutative, i.e., (Vp, q e P) p * q = q * p. Cond3 ‘*‘ is associative, i.e., (Vp, q, r E P) p * (q * r) = (p * q) * r. Cond4(Vp,q,reP)(p*q=r)=(rmin(p,q)). Cond5 No non-trivial zero, i.e., (Vp, q E P) p * q = 0 = (p = 0 V q = 0). Cond8 (Vp, q E P) p < q =‘ (r E F) p = r * q. The solution will be denoted as r = p/q. CondT (VpP)0p1. Cond8 (Vp P) p * 1 = p. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 26 Cond9 A monotone decreasing inverse function i[.] is well defined and closed on P, i.e., (Vp<qEP)i[pj>i[q]. CondlO (Vp E P) i[i[p]] = p. Proof: One needs to show the equivalence of Condi,. . . ,CondiO, to Ti,. .. ,T7 of Probability Algebra Theorem (Appendix B.2). Cond4 and CondiO are not obvious in the theorem and they are included in proposition i for the convenience of later use. They are proved here from TPL axioms AX1,. .. ,AX12 (Appendix B.i) directly. Condi and Cond3 are equivalent to Ti. Cond2 is equivalent to T5. Proof of Cond4. With Cond2, it suffices to prove r q. By AX1O, let p = f(BjA) and q = f(AIi). Then p*q = f(BA)*f(Ai) < f(BA) * f(AA) (f(Afi) <f(AA)) = f(B&AA) (AX8) = f(B!A) (AX5) Cond5 is equivalent to T6. Cond6 is equivalent to T7. Cond7 is equivalent to T4. Cond8 and Cond4 are equivalent to T2. Cond9 is equivalent to T3. CondiO is implied by AX3. D From now on any Finite Totally Ordered Probability Algebra satisfying proposition 1 is referred as legal FTOPA. The general form of all legal FTOPA is derived in section 2.2.2. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 27 2.2.2 Mathematical Structure Here only those probability algebras with at least 3 elements are considered’. A finite totally ordered probability set with size n is denoted as P = {e,,e2,.. . , e,_1 e}, where 1 = e1 > e > ... > e,_1 > e,, 0. For example, P = {e,, e2,e3,e4} could stand for {certain, likely, unlikely, impossible}. This linguistic interpretation is left open. The uniqueness of the inverse function i[.] of any legal FTOPA is given by the following lemma. Lemma 1 For a legal FTOPA with size n, the inverse is uniquely defined as i[ek] = e.f_k (1 k n). Proof: By CondlO, the inverse is symmetric and it suffices to prove for k n/2. For k = 1, i[ei] e, by Cond7. Assume i{eiJ > e,. Then i[e] > i[i{ei]] = e1 which is contradictory to CondT. Hence i[e,] = e. Suppose i[e} for j n/2. For k = j + 1 < n/2 + 1, assume i[ej+i] < which implies i[e+i] By Cond9, this further implies e31 i[e_÷i] = e3 which is contradictory to ordering < e. Thus, i[e+iJ Assume i[e1] > e_3. Then e1 < i[e_], i.e. i[e_] e. By Cond9, e_3 i[e] which is contradictory to e_, > e_+i. D Thus given the size of a legal FTOPA, only the choice of the product function is left. A probability p e F is idempotent if p * p = p. Idempotent elements play important roles in defining probability algebras as will be shown in a moment. The following 2 lemmas describe properties of idempotent elements. Aleliunas [1986] gives similar statement to that in lemma 3. ‘Probability algebra with 2 elements is equivalent to propositional logic [Aleliunas 87]. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 28 Lemma 2 Any legal FTOPA has at least 3 idempotent elements, namely e1, e_1 and e. Proof: The proof is trivial for e1 and e. e_ * e_1 by Cond4 and e_ * e_ > e7 by Cond5. Therefore e_ * = Lemma 3 For any legal FTOPA, ifp E P is idempotent, then (Vq € F) p*q = min(p, q). Proof: Assume q > p. Then (r € P) r * q = p. Therefore p * p = r * (q * p) = p implies q*p p. But by Cond4, q *p <p which proves q*p =p. Ifq <p, then (reP)r*p=q. Then q*p=r*(p*p)=r*p=q. D Proposition 2 For a finite totally ordered set with size n 3, there exists only one legal FTOPA with 3 ide mpotent elements. The ‘*‘ operation on it is defined as e, ifiorj_—n e * e = ( eflfl(1+3_1,fl_1) otherwise. Proof: Let denote a legal FTOPA with size n and k idempotent elements.2 Let denote e, * e2. The following proves the proposition constructively. (1) In case of i or j = n, the proposition holds due to lemma 3. By non-trivial zero, zero part of the product table is entirely covered within this case. 21n general, for a pair of n and k, there may be more than one legal FTOPA. Thus M,k does no necessarily stand for a unique model characterized by n and k. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 29 e e2 63 64 e1 e2 e3 e1 e e2 e3 64 e e e2 e3 e2 62 63 63 64 62 e2 62 63 e3 63 63 e3 64 e3 63 e3 e3 e4 64 e4 e4 64 M4,3 (2) What is left is to prove tb non-zero part of the product table (the second half of the product formula) which is bounded by two idempotent elements e1 and e_1. For the completeness of the product table, the zero parts are still included in the following tables although they are not relevant to the remaining proof. For M3, and M4,3 the proposition holds (see the product tables). It is not difficult to check that they satisfy proposition 1 and any change to these product tables will violate proposition 1 in one way or another. Suppose a unique legal FTOPA Mm,3 exists with product defined as in the proposition. As for Mm+i,3 (table below), the product a1 (i + j m) should be constructed in the same way as in Mm,3, i.e., the second half of product formula = en(i+j_1,m) = ei+i_1 applies within this portion as does in Mm,3. If this portion could be changed without violating proposition 1, the corresponding portion in Mm,3 could also be changed which is contradictory to the uniqueness assumption for Mm,3. Further one can show the uniqueness of for all (i, j <m <i + i) Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 30 e1 e2 e3 em_i emem+i ei e1 e2 e3 em_i emem+ e2 e2 e3 ... em_i a2,m_i emem+ e3 e3 ... em_i a3,m_2a3,m_iemem÷ aj,m_j+i aj+1,m_j em_i em_i ... ? em em . em em+i em+i ... Note: ‘?‘ stands for product items to be c osen. Mm+i,3 By associativity, one has (e * e2) * em_i e+i * em_i = aj+i,m_j e * (e2 * em_i) e * em_i+1 = ai,m_j+i (2 j m — 2) (a) Also one has e2 * (ei * em_i) = e2 * ai,m_i = (e2 * ei) * em_i e1 * em_i = aj+1,m_i (2 i m —2) (b) From order preserving property of ‘*‘, one knows = em_i V = em (i,j <m < i +j). Suppose a2,m_i = em_i. Then from (b), e2 * a2,m_i = e2 * em_i = a2,m_i = em_i = a3,m_i. Similarly, and from commutativity and order preserving, one has aj,j=em_l (i,j<m<i+j). Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 31 This means that em_I is also an idempotent element which is contradictory to the 3 idempotent elements assumption. Therefore, a2,m_I = em. Then from (a) and order preserving, one ends up with = €m (i,j <m <i + i)• D The second part of the above proof for product bounded by e1 and e_1 does not involve the 0 element at all as already stated. Thus for any legal FTOPA with more than 3 idempotent elements, the proposition holds for each diagonal block of its product table bounded by two adjacent idempotent elements. The non-diagonal part of the product table is totally determined by lemma 3, the order preserving and solution existing prop erty. Thus one has the following theorem 1. Given proposition 2 and above description, the proof is trivial. Theorem 1 Given a finite totally ordered set P = {ei,e2,.. . , e,} with ordering relation e1 > e2 > ... > e, and a set I of indexes of all the idempotent elements on P, I = {i1,j2... , jm} where i1 < j2 < < m there exists a unique legal FTOPA whose product function is defined as min(e,ek) if j = ii * ek = e11fl(3+k_,) if <j, Ic i’+’ ek if ji1<k<ij and whose inverse function is defined as i[ek] = e+1_k Theorem 1 says that, given the set of idempotent elements, a legal FTOPA is totally defined. From theorem 1 and lemma 2 one can easily derive the following corollary. Corollary 1 The number of all the possible legal FTOPA of size n 3 is Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 32 = where C is the number of combinations taking i elements out of m. Theorem 1 and corollary 1 provide the possibility of exhaustive investigation for any legal FTOPA of a given size. 2.2.3 Solution and Range Once a legal FTOPA is defined, its solution table is forced. Inverses to the operation * : P x P —* P will not be unique. For this reason, it is necessary to introduce a probability range denoted by [1, u] representing all the probability values between lower bound 1 and upper bound u. {l,uJ = {v Pjl v u} [v, v] is written as just v. One has the following corollary on single value probability solution. Its proof can be found in Appendix D. Corollary 2 Given a finite totally ordered set P = {e1,e2,. . . , e} with ordering relation e1 > e2 > ... > e and a set I of indexes of all the idempotent elements on P, I = {i1, 2, .. . , im} where i1 < j2 < •.. < m the solution function (multiple value) of a legal FTOPA is forced to be: Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 33 ek e.j if if if [ek if product and solution j=1 k = n,j n k > j, ii + 1 <k i1+ — 1, ii + 1 <ii+1 — 1 k = j, i1 < k < i1÷’ k = ii+1, i1 <j <l+l Ic =j = i1,k 1,n j i1 < Ic < i1+ tables for three legal FTOPAs of size 8 Definition 1 For any legal FTOPA, the product of two ranges [a, b] (a b) and [c, ci] (c < d) is defined as [a,bJ * [c,d] = {zIx E [a,b]&y [c,d]&z = x * y}. And the solution of above two ranges with additional constraint a d is defined as [a, b]/[c, d} = {zjdx [a, b]3y E [c, d]x = y * z}. One can prove the following proposition. Proposition 3 For any legal FTOPA, the product of two ranges [a, b} (a b) and [c, ci] (c < d) is if if if ek/e = [e1, ei] [ej1+,,e1+j_j] [ek, ei] In Appendix C.1, the are presented. The solution of two single valued probabilities may become a range which will par ticipate in further manipulation. Thus the product and solution of ranges should be considered before we can manipulate uncertainty in an inference chain. [a,b]*[c,d] [a*c,b*dj. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 34 And the solution of above two ranges with additional constraint a d is I [LB(a/d), UB(b/c)j if b[a,bj/[c,d] [LB(a/d), ej] if b> c where LB and UB are lower and upper bounds of ranges. It should be noted that, in general, product and solution of legal FTOPAs do not follow commutativity. For example, in model M8, (e2 *e5)/e = [e5, ei] e2 * (es/es) = [e5,e2j. Thus the order of product and solution in evaluation of conditional probability p(AIB&C) = p(A&BC)/p(BC) = (p(BAC) * p(AjC))/p(BC) can not be changed arbitrarily. 2.3 Bayes Theorem and Reasoning by Case Having derived the mathematical structure of legal finite totally ordered probability mod els, one needs deductive rules. In this investigation, DAG and conditional independence assumption made in Bayesian nets are adopted as the qualitative part of the knowledge representation. Instead of using probability theory to represent uncertainty as Bayesian nets, the legal FTOPAs are used. Call the resulting overall representation a quasi Bayesian net. The quasi-Bayesian nets are implemented by PROLOG programs with the approach described by Poole and Neufeld [1988]. The inference rules thus required are Bayes theorem and reasoning by cases. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 35 Bayes theorem provides a way of determining the possibility of certain causes from the observation of effects.3 It takes the form: p(PIQ&C) = p(QIF&C) * p(PIC)/p(QIC) which is the same with the one introduced in section 1.3.1 for probability theory. But here the order of calculation is to be followed as it appears as discussed in section 2.2.3. Reasoning by cases is an inference rule to compute a Londitional probability by parti tioning the condition into several mutually exclusive situations such that the estimation under each of them is more manageable. The simplest form considers the cases where B is true and where B is false: p(A(C) = p((A&B) V (A&C) Under probability theory, it can be derived from marginalization rule in section 1.3.1: p(AIC) = p(AB&C) . p(BC) + p(AC) p(iC) Using TPL, the . becomes *, and one does not have the +. This can, however, be simulated using product and inverse. The corresponding formula under TPL is given by the following propositions. Proposition 4 Let A, B, and C be three sentences. p(AC) can be computed using the following: fi = i[p(AI&C) * i{p(BIC)]] f2 = i[p(AIB&G) * p(BIC)/fi] p(AC) = i[fi*f2J. 3Canse and effed are used here in a broad sense. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 36 Using quasi-Bayesian nets, the probability of a hypothesis given some set of evidence can be computed by applying the two inference rules, namely, Bayes theorem and rea soning by cases [Poole and Neufeld 88]. 2.4 Problems With Legal Finite Totally Ordered Probability Models 2.4.1 Ambiguity-generation and Denominator-indifference ‘vVith the mathematical structure of legal finite totally ordered probability models and the form of relevant deductive rules derived, one can assess these probability models as to how well they fit in with the intuition. To begin with, examine the solution of legal FTOPA which has all its elements idempotent. The solution takes the form of (compare to Appendix C.1) (ek if j<k ek/e = j [ej,e1] if k=j Note that e3 does not have direct influence on the result of the first case of the solution. Name this phenomenon as denominator-indifference. Also, name the emergence of range in the second case of the solution operation as ambiguity-generation. To analyze the effect of denominator-indifference and ambiguity-generation on appli cation of Bayes theorem, apply Bayes theorem to p(AjB&C) = p(A&BIC)/p(BC) = Jp(A&BIC) if p(ABIC)p(BIC) I [p(A&BIC), ej] if p(A&BC) = p(BjC) In the first case, the probability p(BC) does not affect the estimation of p(AIB&C) due to denominator-indifference. In the second case, ambiguity-generation produces a Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 37 disjunct of all the probabilities larger than p(BIC) which is a very rough estimation. Neither satisfies the requirement for empirically satisfactory probability estimates. To analyze the effect of denominator-indifference and ambiguity-generation on rea soning by cases, consider applying proposition 4 to max(p(A&BC), p(ARiC)) if p(A&BC) p(A&C)p(AC)= — [max(p(A&BJC), p(A&BIC)), ei] if p(A&BC) = p(A&HIC) Here again, in the first situation, denominator-indifference forces a choice of outcome from one case or another instead of giving some combination of the two outcomes. One does not get an estimation larger than both which is contrary to the intuition. In the second situation, a very rough estimation appears because of ambiguity-generation. Note that, when max(p(A&BIC),p(A&EC)) is small, p(AIC) can span almost the whole range of probability set P. The analysis here is in terms of a model that has all of its values idempotent. The other case to consider is what happens at the values between the idempotent values. Consider M,3 which has minimal number of idempotent elements. By proposition 2, its product is I e(I+_l,_l) if i,j e, * e3 = otherwise. Its solution simplifies to (compare to Appendix C.1) e if k=n>j ek/e = if j k <n — 1 [en_i,e7jJ if k = n — 1 In this algebra, it is quite easy for a manipulation to reach the probability value e....1: Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 38 1. Whenever one of the factors of product is the product will be e1 unless the other factor is e. 2. Whatever takes the value e2, its inverse will be e_1. 3. Products of low or moderate probability tend to reach e_1 due to quick decreasing of product. 4. e/e_ 2 for all 2 j <n 2. Once e_1 is reached, any solution will be ambiguous. This ambiguity will be prop agated and amplified during further inference in Bayesian analysis or case analysis. Al though e_ is a value one should try to avoid, there is no means to avoid it. Here one sees an interesting trade off between the two problems. In M,3, the denominator- indifference disappears. But, since manipulations under this model move probability values quickly, they tend to produce e_ more frequently and thus one suffers more from the ambiguity-generation. As all finite totally ordered probability algebras can be seen as combinations of the above two cases, they must all suffer from the denominator-indifference and the ambiguity-generation. The question is how serious the problems are in an arbitrary model. This is to be answered in the next section. 2.4.2 Quantitative Analysis of the Problems Given the constraint of legal FTOPA in choosing a probability model, one is free to select the model size n and to select among alternative legal FTOPAs once n is fixed. A few straightforward measurements are introduced to quantify the degree of suffering in a randomly chosen model. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 39 The number of ranges in a model’s solution table and the number of elements cov ered by each range mirror the problem of ambiguity-generation of the model. Define a measurement of the amount of ambiguity in a model as the number of elements covered by ranges in its solution table minus the number of ranges. Definition 2 LetS = {ri,r2,. . . ,r} be the set of ranges in the solution table of a legal FTOPA. Let w3 be the number of values covered by range r3. Let M be the number of different solution pairs in the solution table. The amount of ambiguity of the algebra is defined as A = —1. The relative ambiguity of the algebra is defined as R=A/M. Example 1 The three legal FTOPAs with size 8 in Appendix C.1 all have A 21 and R=O.6. One has the following proposition. The proof can be found in Appendix D. Proposition 5 The amount of ambiguity of any legal FTOPA with size n is A = (n — 1)(n — 2)/2. The relative ambiguity of the algebra is R= (n—2)/(n+2). The number of solution pairs satisfying e3/ek = e3 reflects the seriousness of denomi nator-indifference of the model. Define the order of denominator-indifference as this number minus the number of such e3s. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 40 Definition 3 Let d, be the number of times e3/e,. = e3 for 1 k <j in a legal FTOPA of size n. The order of denominator-indifference of the algebra is defined as Od = >Zd —1. Example 2 The three legal FTOPAs M8,3 M8, and M8,4 in Appendix C.1 have Od values 0, 15 and 9 respectively. Define the order of mobility of a model to express the chance with which a product or a solution transfers an operand to a different value. The higher this order, the more likely for a manipulation to generate an idempotent element and produce ambiguity afterwards. Definition 4 The order of mobility °m of a legal FTOPA is defined as the number of distinct product pairs a * b in its product table such that a * b < mm [a, bj. Example 3 The three legal FTOPAs M8,3 M8, and M8,4 in Appendix C.1 have Om values 15, 0 and 6 respectively. One has the following proposition. The proof can be found in Appendix D. Proposition 6 For any legal FTOPA with size n and a set I of indexes of all its idem potent elements I = {i1,2 . . . ,i} where i1 < 2 < •.. < z its order of denominator- indifference is k— 2 Od = 1) (m+i its order of mobility is k—2 Im+1i 1 0m .7, m1 3=1 and °d + Om = (n — 2)(n — 3)/2. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 41 Example 4 The three legal FTOPAs with size 8 in Appendix C.1 all have Od+Om 15. Proposition 5 tells us that all the legal FTOPAs of same size have same amount of ambiguity. Increasing size increases R which approaches 1 as n approaches infinity. Proposition 6 says that, 1. among legal FTOPAs of same size n, the order of denominator-indifference Od changes from lower bound 0 at M,3 to upper bouid (n — 2)(n — 3)/2 at 2. the upper bound of Od as well as °m increases with model size n; 3. given n, the sum Od + °m remains constant and thus if a model suffers less from denominator-indifference, it must suffer more frequently from ambiguity-generation due to the increase in its mobility. 2.4.3 Can the Changes in Probability Assignment Help? After explored model size and alternative models given size, the final freedom that re mains is the assignment of probability values. From Corollary 2, it is apparent that, in general, denominator-indifference and ambiguity-generation happen only in certain re gions of the solution table. So, is it possible, by choosing certain set of probability values as prior knowledge, to avoid intermediate results falling onto those unfavorable regions? To help answer this question, a derivation of conditional probability p(firelsrnoke & alarm)4 for a smoke-alarm problem in Figure 2.7 is given in Appendix C.2. The calculation involves 2 applications of Bayes theorem, and 3 of reasoning by cases. It requires 19 products, 9 solutions, and 14 inverses. In general, 4The four nodes involved form a minimum set which has alternative hypotheses (fire and tampering) and allows accumulation of evidences (smoke + alarm). Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 42 1. a product tends to decrease the probability value until an idempotent value is reached. 2. a solution tends to increase the probability value or cause a large range to occur (especially for idempotent values). 3. an inverse tends to transfer small value into big and vice versa. Since many operations are required even in a small problem and each operation tends to move the intermediate value around the probability set, the compound effect of the operations are not generally controllable. To summarize, in the context of legal FTOPA, there seems to be no way to get away with the problem of denominator-indifference and ambiguity-generation by means of clever assignment of probability values; increasing model size does no good in reducing the difficulty; selecting among different models trades one trouble with another. In the next section, these problems are demonstrated by an experiment. Figure 2.7: Smoke-alarm example Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 43 2.5 An Experiment All the 32 legal FTOPAs with size 8 were implemented in PROLOG and their perfor mance were tested by the smoke-alarm example (Figure 2.7) from Poole and Neufeld [1988]. The PROLOG program has basically the same structure, but inverse, product, solution, as well as Bayes theorem and reasoning by cases are redefined. Table 2.1 lists the probabilities in the knowledge base together with numerical values used by Poole and Neufeld [1988] for comparison. p(fire) = e6 0.01 p(smokefire) = e2 0.9 p(tamperirig) = e6 0.02 p(smokeI7E) = e6 0.01 p(alarmfire&tarnpering) = e4 0.5 p(leavingjalarm) = e2 0.88 p(alarrnfire&tampering) = e2 0.99 p(1eavinga1arm) = e7 0.001 p(alarmflE&tampering) = e2 0.85 p(reportjleaving) = e3 0.75 p(alarrnIflE&tarnpering) = e7 0.0001 p(reportIleaving) = e6 0.01 Table 2.1: Probabilities for smoke-alarm example The following posterior probabilities are calculated in all 32 possible legal FTOPAs with size 8 (using quasi-Bayesian nets) and in [0, 1] real number probability model (using Bayesian nets) as a comparison. p(slf), p(a f), p(st), p(alt), (fIs), p(fa), p(f Is&a), p(ts), p(tla), p(t Is&a) The first four probabilities are deductive which, given cause acting, estimate the probabilities of effects appearing. The remaining six are abductive which, given effects observed, estimate the probability of each conceivable cause. The results are included in Appendix C.3. Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 44 • Among the 32 legal FTOPAs, eight of them produced identical value for the ab ductive cases: p(fs) = p(fa) = p(fs&a) = e6, and 16 others produce the identical ranges for all the abductive cases about fire. According to Table 2.1, smoke does not necessarily relate to fire (p(s7) e6). Nor does alarm (p(alf&t) = e2’) As a result, observing only one of smoke and alarm, one is not quite sure about iire. Intuitively, adding the positive evidence alarm to smoke should increase one’s belief for fire. As well, adding to alarm the evidence smoke which is independent of tampering indicates higher chance of fire causing alarm. Thus this intuitive inference arrives at p(fsa) > p(fls) & p(fjs&a) > p(f a) which the results obtained from the above mentioned 24 legal FTOPAs do not fit in with. To illustrate how this happens, evaluate p(fls&a) in model M8,4 with idempotent elements {e, e5, e7,e8}. p(fls&a) = p(slf&a) * p(fa)/p(sa) = e2 * e6/e4 = e6/e4 = 66 Pay attention to the solution in last step. The result is no larger than p(f Ia) = 66 due to denominator-indifference. One does not get extra evidence accumulating. • One of the very useful results provided by [0, 1] numerical probability is that al though p(fI) = 0.48 and p(fa) = 0.37 are moderate, when both smoke and alarm are observed p(fs&a) = 0.98 is quite high which is more intuitive than the case above. According to Table 2.1, fire is the only event which can cause both smoke Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 45 and alarm with high certainty (p(sf) = p(alf) = e2). Thus observing both simulta neously one would expect a higher probability. But the remaining 8 legal FTOPAs give only ambiguous p(fls&a) spanning at least half of the total probability range. Consider the evaluation of p(fs&a) in model M8,4 with idempotent elements { €, e, e7,e8}. p(fls&a) = p(sjf&a) * p(fla)/p(sja) = 62 * es/es es/es = [64, ei] Notice the solution in last step. • In the deductive case, the situation is slightly better. Some models achieve the same tendency as [0, 1] probability in deduction (e.g., p(st) <p(alt)). Some achieve the same tendency with increased ambiguity. Others either produce identical ranges for different probabilities or do not reflect the correct trend. The slight improvement attributes to less operations required in deduction (only reasoning by cases but not Bayes theorem is involved). Since reasoning by cases needs the solution operation, it still creates denominator-indifference and generates ambiguity. Our experiment is systematic with respect to legal FTOPAs of a particular size 8. Although a set of arbitrarily chosen probability is used in this presentation, it has been tried to vary them in a non-systematic way, but the outcomes are basically the same. 2.6 Conclusion The motivation of this investigation is to find finite totally ordered probability models to automate reasoning under uncertainty and to facilitate knowledge acquisition and explanation in expert systems. Under the theory of probabilistic logic [Aleliunas 88], the general form of finite totally ordered probability algebras is derived and the number of different models is deduced Chapter 2. FINITE TOTALLY ORDERED PROBABILITY ALGEBRA 46 such that all the possible models can be explored systematically. Two major problems of those models are analyzed: denominator-indifference, and ambiguity-generation. They are manifested during the processes of applying Bayes the orem and reasoning by cases. Changes in size, model and assignment of priors do not seem to solve the problems. All the models with size 8 have been implemented in a PROLOG program and tested against a simple example. The results are consistent with the analysis. The investigation reveals that under the TPL axioms, finite probability models will have limited usefulness. The premise of legal FTOPA is {TPL axioms, finite, totally ordered}. It is believed that TPL axioms represent the necessity of general inference under uncertainty. ‘Totally ordered’ seems to be necessary for a probability model to be useful. Thus it is conjectured that a useful uncertainty management mechanism can not be realized in a finite setting. The result of the investigation does not leave the probability theory as the only choice for representation of probable reasoning. But it does highlight infinite totally ordered probability algebras including probability theory. Since the appeal of Bayesian networks for representing uncertain knowledge has become increasingly apparent; and substantial advances have been made in recent years on reasoning algorithms in Bayesian nets, Bayesian networks are adopted as the framework of this research. Chapter 3 QUALICON: QUALITY CONTROL IN NERVE CONDUCTION STUDIES BY COUPLING BAYESIAN NETWORKS WITH SIGNAL PROCESSING PROCEDURES Before more complicated diagnostic problem was tackled (Chapter 5), a pilot study on Bayesian networks was done with a problem of quality control in nerve conduction studies. The resultant system is QUALICON. The corresponding network has a small size (less than 20 variables and singly connected) and thus is a good testbed. While the solution to the problem contributes to the information gathering stage of neuromuscular diagnosis. This chapter presents major issues in the implementation of QUALICON. The results are mainly taken from Xiang et al. [1992]. Section 3.1 introduces the QUALICON do main. Section 3.2 describes the overall structure of QUALICON. Section 3.3 discusses the knowledge representation of QUALICON. Section 3.4 presents a preliminary evaluation of QUALICON. 3.1 The Problem: Quality Control in Nerve Conduction Studies Nerve conduction studies are now used routinely as part of the electrodiagnostic examina tion for neuromuscular diagnosis. Their clinical value is demonstrated in the examination of diseases or injuries which might be difficult to diagnose with (needle) EMG alone. The clinical procedures involve the placement of surface electrodes, delivering of stimulating impulse to nerve or muscle, and recording of nerve or muscle responses (action potentials). 47 Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 48 The procedures are fairly simple to perform, are easily tolerated, and require little coop eration from the patient. Interpretation of the results, however, requires knowledge of the range of normal values, the morphology of normal potentials and, of equal importance, the sources of technical error which may affect the finding [Goodgold and Eberstein 83]. Since abnormalities can be due to technical error or disease, identification of technical error is a major element of quality control in nerve conduction studies. Contemporary equipment used for nerve conduction studies is usually capable of computerized measurement of latency, amplitude, duration and area of nerve and muscle action potentials and resulting conduction velocities. However, computerized measure ments assume that stimulating and recording characteristics and electrode placements are correct; they do not take cognizance of technical acceptability. The development of an appropriate technique for automating the quality control is addressed in this Chapter. 3.2 QUALICON: a Coupled Expert System in Quality Control Since identification of technical errors are based on observations of recorded nerve and muscle responses, the problem is of the nature of diagnosis, i.e., given the abnormal out comes, explaining the cause. Since stimuli are applied to and responses are generated through complex biological systems, namely, human bodies, much uncertainty is involved in the interpretation of abnormal responses in terms of possible technical errors. Thus the problem is in many aspects similar to a medical diagnostic problem, the limitation of conventional techniques and promise of application of AT technique apply as discussed in the beginning of this thesis. Since the recorded responses take the form of contin uous signals, it is essential to ‘couple’ the expert system technique with the numerical procedures for signal processing in order to develop a feasible technique as a solution. The term ‘couple’ needs some explanation. The concept of coupled expert system Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 49 emerges, when rule-based systems are dominant, from a need to apply expert system techniques to domains where numerical information is largely involved and is processed by conventional numerical procedures. At that time, expert systems are considered as ‘symbolic’ since rule-based systems use rules for inference which are different than con ventional ‘numeric’ computation. Therefore, the coupled expert system was defined as a system which links numeric and symbolic computing processes; has some knowledge of the numerical processes embedded; and reasons about the application or results of these numerical processes [Kowalik 86]. In light of newer expert system methodology for probable reasoning, namely Bayesian networks which compute extensively the (numeri cal) posterior probability using algorithms, it is not appropriate to characterize expert systems which do not deal with numerical information other than measurement of uncer tainty as simply ‘symbolic’. Thus, my suggestion is to define a coupled expert system as a system which links numerical processes not directly involving inference with computing processes (symbolic or numeric) for inference; has some knowledge of the (non-inferential) numerical processes embedded, and reasons about the application or results of these nu merical processes. A prototype expert system QUALICON coupling Bayesian networks with numeri cal procedures for signal processing has been developed in the thesis research, which automates quality control in nerve conduction studies. The system was developed in cooperation with Neuromuscular Disease Unit (ND U) of Vancouver General Hospital (VGH). Utilizing information for stimulating and recording parameters for a given nerve or muscle action potential, QUALICON compares specific characteristics with values from a normal database determining the qualitative nature of each feature. Probabilistic reasoning allows QUALICON to provide the user with recommendations on the accept ability of the potential. If it is not technically acceptable, the most likely explanation(s) of the problem is/are provided with information on how to correct it. Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 50 Figure 3.8 illustrates the 3 modules comprising the framework of QUALICON, namely: the Feature Extraction (FE) module, the Feature Partition (FP) module and Probabilis tic Inference Engine (PIE) module. There are also 2 knowledge bases: a Normal Database (ND) and a Bayesian Net Specification (BNS). LATENCY DURATION AMPLITUDE RATIO Figure 3.8: QUALICON system structure FE consists of a set of numerical procedures extracting numeric or symbolic features from an action potential. These routines consult ND in determining their best heuristic strategy. ND contains the normal value ranges (in terms of means and standard devi ations) of all the numeric features to be examined. Normal value ranges for distal and proximal are distinguished. It combines the statistics derived from several hundred nor mal studies in NDU of VCH, and the electromyographic “expertise” from neuromuscular USER nerve studied numeric features stimulus position evidence summary recording position recommendations Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 51 specialist A. Eisen and registered EMG technician M. MacNeil. FP partitions the nu meric features against ND translating them into symbolic descriptions of the potential. When all the features are available in symbolic form they are fed into PIE for probabilis tic inference. PIE reasons with BNS as the background knowledge and with the symbolic features as new evidence to generate recommendations with respect to any technical error in test setup. The FE and FP modules are programmed in C, and the PIE module is programmed in PROLOG. 3.3 Development of QUALICON Important technical issues involved in the development of QUALICON is presented in this section, which concern knowledge representation, robust feature extraction, coupling, and an algorithm for probabilistic reasoning. 3.3.1 Feature Extraction and Partition Feature Selection A set of features of numerical data provides a basic interface between signal processing and Bayesian net components. Two criteria are used in the feature selection. (1) Utiliza tion of human expertise as a resource in building QUALICON requires that the selected features be mainly composed of those used in daily practice. (2) The set of features should have sufficient differential power, i.e. they should enable different potentials to be characterized mostly by different sets of feature values. Based on the criteria, 6 variables are chosen: LATENCY, DURATION, AMPLI TUDE (peak to peak), WAVE SEQUENCE, RATIO, and STIMULATION ARTIFACT. The first 3 are routinely used features. WAVE SEQUENCE is selected to characterize Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 52 the basic morphology of the potentials. This feature can take 4 symbolic values: bnpb (baseline negative positive baseline), bpnb, bpnpb, and ab_seq (abnormal sequence) for CMAPs (Compound Muscle Action Potentials) and pnp, npn, np, and ab_seq for SNAPs (Sensory Nerve Action Potentials). bnpb (Figure 3.9(a)) and pnp are the wave sequences seen in normal potentials; bpnb and npn (Figure 3.10(a)) are most often created by re versed polarity of recording electrodes; bpnbp (Figure 3.10(b)) and np usually signify the misplacement of recording electrodes; and ab_seq encapsulates any wave sequences different from the above 3. The feature RATIO is selected to complement WA VE SEQ UENCE. It is defined as the ratio of negative peak amplitude over positive peak amplitude. Although not explicitly used in routine electromyography, this feature is employed implicitly in combination with the basic morphology. In Figure 3.10(b), the ratio is 3.75mV/(4.7—3.75)mV = 3.95. This is larger than normal and therefore the entry in ‘summary’: “Positive dip too shallow”. STIMULATION ARTEFACT takes 3 symbolic values: negative (upwards), positive (downward), and isoelectric. An otherwise normal potential except STIMULATION ARTEFACT = positive could well be produced by reversing the stimulating polarity. The latency will be prolonged but could still be within normal range. Without using this feature, an abnormal setup might be overlooked (Figure 3.9(b)). Determine values for feature variables Some features like STIMULATION ARTEF14CT and WAVE SEQUENCE are intrinsi cally symbolic. Others like LATENCY could be extracted in symbolic form from rough estimates of numeric values. The quality control task does not logically require any fea tures in numeric form since all the features must be symbolic when entering the final symbolic computation stage, and computation savings could be achieved by abandoning accurate numerical measurement. The choice of accurate measurement of LATENCY Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 53 (a) (uU)sensory RECORDED IGNAL NUMERICAL. FEATURES 10 /‘\ A’ip(p-p) (u.’iiU): 17.98 5 / Latency (‘is) 4.8 Dur(on—end)(ns): 12.94 Area (‘iU.ns): 34.22 A’ip(p—b) (‘iv): 11.83 10 (‘is) Dur(on—b) (‘is): 6.55 o i’S i’S 2’S 2’S 3’O SUMMARY RECQMMENDAT IONS ABNORMAL 1 Reversed polarity of Sti’i. artifact dournards. suspected. (0.88) Please check the polarity of stinulating electrode. 2 Consideration of disease. (b) SUMMARY RECOMMENDATIONS NORMAL 1 The recording is considered operationally acceptable. (0.94) Figure 3.9: QUALICON report: (a) Normal CMAP evoked by tibial nerve stimulation at ankle. Recorded at abd. hallucis; (b) Same but with a reversed stimulus polarity. NUMERICAL FEATURES A’ip(p-p) (uñiv): 18.8 Latency (‘is): 4.26 Our(on-end) (‘is): 12.47 Area (‘iU.tis): 34.88 A’ip(p-b) (‘iv): 13.35 Dur(on-b) (‘is): 6.24 Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM (b) 54 Figure 3.10: QUALICON report: (a) ulation at finger 2 (normal subject). Median nerve SNAP recorded at wrist, with stim Recording polarity was reversed. (b) The CMAP (UI. sensory (iii *iotor RECORDED S I GNAL HUMERI CAL. FEATURES A.lp(p-p) (u.’riU): 19.9 5 Latency (us): 2.48 0 10 Dur(on—end)(hs): 1.36 —6 ( us)•10 2.5 75 i’D i.5 1’S SUMMARY I I RECOMMENDATIONS ABNORMAL 1 Reversed polarity of recording suspected.(0.99) Please check the polarity ofSti,a, artifact downj.,ards. j recording electrode. Polarity reversed. 2 Consideration of disease. (a) CORDED SIGNAL NUMERICAL FEATURES Aup(p-p) (uuU): 4.7 Latency (us): 3.33 Dur(on-end)(us): 9.22 Area (IIU.Ms): 6.7 Aup(p—b) (iii)): 3.75 (us) Dur(on-b) (us): 6.16 SUMMARY ABNORMAL RECOMMENDATI OHS No stir,, artifact. 1 Possibly recording electrode uisplaced. Two positive dips. (0.91) Please check recording electrode. Auplitude less than norijal. 2 Consideration of disease. Positive dip too shallow. Duration less than norr,al. is recorded from median nerve of a healthy person with stimulation at wrist and with recording electrode misplaced. Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 55 (and the other variables) is made in order to present the user with values with which they are familiar. The partition into symbolic values follows normally. Two methods determining the onset and end of a CMAP are presented by Meyer and Hilfiker [1983]. ‘Slope threshold’ identifies the 2 critical points when 2% of the maximal slope are reached. ‘Power threshold’ signals the points corresponding to 0.1 and 0.999 of the integrated squared potential. Even though the methods perform well in ‘semiautomatic’ condition, they are sensitive to artifacts without human supervision. For the quality control task, a robust strategy is required. The problem is treated by developing a technique named guided filtering which (1) introduces the knowledge about the wave morphology and guides the search for critical points; and (2) adopts noise rejection and suppression filtering. Figure 3.11 depicts the feature extraction strategy used by FE module. The following are the important steps employed. • Detecting stimulation artifacts in distal CMAP. The first few samples are averaged and compared with 2 thresholds to determine the symbolic value of the variable. This approach cannot be applied to SNAPs since, when the stimulation polarity is reversed, only a short positive component appears in the stimulation artefact reflecting the high gain used (Figure 3.12). Thus the following rules are applied to the first T ms samples: (1) if successive positive samples of TP ms are found with their average greater than Ni 1W, the STIMULATION ARTEFACT takes the value downward; (2) if the average over T ms is lower than -N2 iiV, the value is upwards; (3) otherwise the value is none. The parameters T, TP, Ni and N2 depend on the electromyograph used and the sampling interval. • ND contains the means and 2 standard deviations for LATENCYand DURATION. The linear combination of these defines a time period P = [1, u] (Figure 3.13). The Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 56 Jr I Determine stim artefact / Consult ND Find all turning points I Find tiobal extrerna for CMAP / Determine wave sequence Guided filtering for onset & .fl4. / 1. Measure lat. , dur. • amp. ratio Figure 3.11: Feature extraction strategy most significant part of the nerve response lies within P with a probability close to 1. Most of the noise and artefact before the onset or after the end are outside P. The values for the remaining features are searched only through the samples in P. • In determining turning points within F, another level of noise suppression is adopted to catch the turning points which define the basic patterns of wave sequences. Turn ing points corresponding to small perturbations are ignored. This technique is sim ilar to that developed for EEG processing [Gotman and Gloor 76]. Two successive turning points (one a local maximum and the other minimum) with their difference of amplitudes less than a preset threshold are ignored. • Once turning points are found, a pattern matching suffices to determine the value for the WAVE SEQ UENCE. Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 57 • The onset and end of SNAP coincide with turning points. Estimation of the onset and end of CMAP requires additional search. A digital filter (Figure 3.13) is used in the form: — a + Zk_(W_l)(xi+k — Ui— 2+ (x+ — X+k+1) where x is the sample data, i the time index, 2w + 1 the window width and a a preset constant. The minimum (maximum) of y indicates the onset (end) of CMAP. a prevents denominator from being 0 when the window slides over flat baseline. The filtering over the window width smooths any small perturbation, and provides another level of noise suppression. Increasing the window width strengthens the noise suppression but decreases the sensitivity of the filter. The filtering is only applied to the period [1, f(y)] for onset estimation. f(y1) is initialized to j the time index of the turning point corresponding to the negative peak of CMAP. When y drops below a threshold which signifies the onset is nearby, f(y) takes the value i + /3 where /3 is a preset constant. Figure 3.12: SNAP produced by reversing stimulation polarity Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 58 Figure 3.13: Illustration of Guided Filtering For estimation of the end, the filtering is applied to [d, d + 71 (d + u) where d is the time index of the turning point corresponding to the positive peak of CMAP, and is a preset constant determined by the estimation of the maximum possible length between d and the end. In summary, this technique, named guided filtering, attempts to utilize fully the avail able knowledge and information up to the moment during its search process. Initially, the general knowledge about the nerve-stimulation pair guides the search. As the search progress, new information about turning points and wave sequence are absorbed to lo calize further search for the onset and end. It simulates human heuristic which starts with some expectation and search is guided by information obtained during the search. In this way, not only the overall processing is very efficient, but since in each step the search is kept local, the effect of artefact and noise is minimized. ii T i+fl Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 59 Partitioning Numeric Features Partition is also a feature interpretation process. It involves (1) straightforward mapping of numeric features against the normal database ND, for example, into less_than_normal, normal and greater_than_normal; and (2) capturing the dependency in feature interpreta tion. Two kinds of dependency required consideration in partition. The first is technically obvious. For instance, if WAVE SEQUENCE = ab_seq, no values could be assigned to the variables AMPLITUDE, DURATION etc. In this case, call these variables inapplicable. Another kind of dependency reflects the experts’ context dependent view of numeric features. For exampic, in the case of CMAP if the stimulus is applied proximally, the possibility of LATENCY = less_than_normal will not be considered. This is because individual variation in limb length precludes a standard interstimulus distance. Likewise when AMPLITUDE = normal, even if the DURATION is below normal range, it is still interpreted as normal. This is probably due to inability to accurately determine the end of the potential (where it returns to baseline). In both examples, the dimension of certain variables changes depending on the context. 3.3.2 Probabilistic Reasoning Construction of Bayesian Networks Due to the difference in nerve conduction studies with respect to CMAPs and SNAPs. Two separate Bayesian nets with similar topology are constructed for C MAPs and SNAPs respectively, and only one of them needs to be evaluated at a particular session. The net for CMAP is shown in Figure 3.14. Each box (a node) in the network represents a vari able with its name underlined and its alternative values listed. The root is STIMULUS POSITION which affects the distribution of OPERATION which in turn determines the likelihood of observing different feature values. Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 60 OPETATI ON normal rey_st irn_po lar ity rev_rec_po lar ity m isp lace_st im_e lect m isp lace_rec_e lect submax_st irn supermax_st im Figure 3.14: Bayesian subset for CMAP Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 61 A system dealing with technical errors must make assumptions about the knowledge and skill levels of potential users. QUALICON has intended users as residents, fellows and technologists learning in a hospital EMG laboratory. It is further assumed that the inadequate operations are mutually exclusive since it is unlikely that 2 or more could occur in combination given the intended user level. Assuming mutual exclusion, different technical errors can be represented by a single node OPERATION. In this way simplicity in the network structure is gained and an efficient evaluation algorithm (section 3.3.2) can be developed. After the topology of the net is decided, the conditional probabilities for all links are acquired from subjective judgement of a domain expert in the form like p(WAVE SEQUENCE = bpnb I OPERATION = reversed recording polarity). Although not constructed for diagnostic purposes, QUALICON recognizes that ab normal potentials could result from disease rather than a technical error. When an abnormal potential is encountered QUALICON attempts to interpret the abnormality in terms of a technical error but also raises the possibility of disease (e.g. recommendation 2 in Figure 3.10). If an unacceptable recording could not be overcome by correcting the identified technical error then disease becomes the likely cause. The net for CMAP shown in Figure 3.14 reflects the influence among relevant variables in the most general case. When the network is used in particular situation, it must adapt to the reality. For instance, when STIMULUS POSITION = proximal, STIM ARTEFACT usually can no longer be detected. QUALICON will remove the node STIM ARTEFACT together with its relation with OPERATION in this case. Likewise, when WAVE SEQUENCE is found to be ab..seq, DURATION, AMPLITUDE, and RATIO will be removed. This is a higher level parallel of inapplicability raised in the previous subsection. Heckerman, Horvitz, and Nathwani [1989] discuss this similar issue. Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 62 Inference Algorithm in QUALICON In a QUALICON session, after the feature extraction and partition are finished, QUAL ICON is able to reason at the Bayesian net level where STIMULUS POSITION (Fig ure 3.14) and all other applicable feature variables are known. At this time, only the value of OPERATION is to be estimated (i.e., given the evidence, only the probabilities of OPERA TION are to be evaluated). A simple and efficient algorithm to take advan tage of this net structure is developed. With the algorithm, only the probability of one alternative value needs to be calculated in full length; the rest can be obtained by a negligible amount of computation. Figure 3.15 depicts the DAG used in QUALICON which is slightly simplified for illustration of the algorithm. A. ;; Figure 3.15: A DAG to illustrate the inference algorithm in QUALICON Algorithm 1 Suppose a set of evidence is given as {a, c,d} with the value of E unknown. Assuming B {b1,. . . , b,}, the probability p(b1a&c&d) (i 1,... , n) can be computed in the following way. Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 63 step 1 For p(biIa&c&d) = p(bi&c&dla)/p(c&dla), retrieve the knowledge base to calcu late p(b&c&dIa) =p(cb)p(dIbj)p(ba) (i = 1,.. .,n) p(c&dla) = p(b&c&dIa) and cache the intermediate results. step 2 For the rest of the probabilities to be evaluated, no knowledge base retrieval is needed at all. Fetch the cache content to obtain p(b Ia&cd) = p(b&c&dIa)/p(c&dja) It can be easily derived that in the case there are m known children of B (m = 2 is illustrated above), the algorithm requires inn multiplication, n division, and n — 1 addition. That is, the time complexity of the algorithm is linear to the number of features and number of technical errors considered. Out of the 7 alternative values of OPERA TION, those with their posterior probabil ities above 1/7 (above-equal-likelihood) are chosen as the basis of the recommendations for the user. The probabilities are attached to the corresponding recommendations (Fig ure 3.9, 3.10). 3.4 Assessment of QUALICON 3.4.1 Assessment Procedure In order to test the validity of QUALICON, 84 muscle and nerve action potentials are elicited and recorded from normal volunteers using standard procedures (see Table 3.2). For recording compound muscle and nerve action potentials, evoked by either proximal or distal stimulation, filter settings are fixed at 2 Hz to 10 kHz and 20 Hz to 2 kHz, re spectively. Five different types of technical error are introduced: reversal of stimulating Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 64 muscle or nerve recording condition normal sti_rev rec..rev sti_mis rec_mis sub_max median motor distal 2 1 2 2 2 1 median motor proximal 2 2 1 median sensory 2 6 4 3 1 ulnar motor distal 2 1 2 2 2 1 ulnar motor proximal 2 1 2 1 ulnar sensory 3 1 2 4 2 post. tib. motor distal 2 2 2 3 1 post. tib. motor proximal 1 1 2 sural sensory 2 2 2 1 2 2 subtotal 17 17 14 9 16 11 Table 3.2: Potential recording conditions. Technical errors introduced include reversal of stimulating (sti_rev) or recording (rec_rev) polarity, misplacement of stimulating (sti_mis) or recording (rec_mis) electrodes, and use of submaximal stimulating cur rent (sub..sti). Empty entry: no potential recorded under corresponding condition. (Sti_rev) or recording (Rec_rev) polarity, misplacement of stimulating (Sti_mis) or record ing (Rec_mis) electrodes, and use of submaximal stimulating current (sub_max). Only 1 error is introduced at a time. No distribution about errors introduced is assumed. In an alyzing the potentials QUALICON is blinded as to whether an error has been introduced or not. Two physicians, 1 technician and 1 resident also manually analyze the data. In doing so they are given information as to the nerve stimulated, stimulating and recording sites, and are allowed to select from: Normal, Sti_rev, Rev_rev, Stim_mis, Rec..niis, and Sub_max in making their interpretations. Multiple options of up to 3 are allowed when it is difficult to make a single choice. If an option matches the actual technical error, the interpretation is recognized as a success. The success rate is defined as the number of successes/the total number of interpretations made. Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 65 3.4.2 Assessment Results The assessment is treated as Bernoulli trials (Appendix E). The success rate of QUAL ICON is 73% compared to average success rate 57% for manual assessment (excluding assessor 4 who is a resident in training) (Figure 3.16(a)). Given the result, it can be derived, using the standard statistic technique (Appendix E) that the 95% confidence intervals of success probabilities fr QUALICON (pc) and 3 human average (ph) are (0.62, 0.82) and (0.46, 0.68), respectively. Further more, the hypothesis H0 : p = Ph (as opposed to H1 : pQ > ph) is accepted at 0.01 level of significance but rejected at 0.05 level of significance. Thus it can be safely concluded that QUALICON performed as well as, or better than human assessment did on the task. One would have noticed that the success rate (both QUALICON and human) is not high. The average inter-human agreement rate for the different, imposed, errors is depicted in Figure 3.16(b). Agreement is high for recognition of a normal potential and one in which the recording electrode is inverted. There is much poorer agreement between individuals for the other technical errors reflecting the limited amount of information that is provided which would have been necessary. In daily practice, a series of potentials are usually recorded on a nerve or muscle. Comparison among them is an important clue for detecting technical errors. This suggests the use of multiple potentials in a group in detecting errors in the future extension of QUALICON. 3.5 Remarks The experienced electromyographer(EMGer) or technologist can usually detect a techni cal error as a cause of an abnormal potential rapidly and with ease. Doing so is a vital element in the quality control of electromyography. Contemporary equipment automates much of the requisite measurements needed, for example, in performing nerve conduction Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 66 Figure 3.16: (a) Manual success rates of human assessors compared to QUALICON. (b) Average agreement rate for manual assessment. (this excludes the assessor 4 who is a resident in training) box. (a) 8 Ox LOX 40x. 20x ox. (b) Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 67 studies but does not take cognizance of error as a cause of abnormality. Artificial intel ligence offers a possible solution to this problem as demonstrated through the coupled knowledge based prototype system QUALICON. QUALICON is not the first attempt in computerized quality control in nerve conduc tion studies. With conventional signal processing and database techniques, deviations of features from reference values are used {Stalberg and Stalberg 89] to turn on a warning display. MUNIN [Andreassen et al. 89], currently under development by large research groups, is one system which gives a test guide in electromyography. MUNIN displays where the electrodes should be placed and other test setup information but does not check for technical errors. QUALICON is unique in that it tries to pinpoint what is the most likely technical error given the recorded abnormal potential. The intended user of QUALICON are residents, fellows and technologists learning in a hospital EMG laboratory. Given the level of knowledge and experience in nerve conduction studies, QUALICON only considers the most common technical errors in this context. Also, development of appropriate technique rather than a complete system is emphasized in this work. Thus 6 kinds of technical errors are considered in QUALICON including reversal of stimulating or recording polarity, misplacement of stimulating or recording electrodes, and use of submaximal or supermaximal stimulating current. QUALICON’s task is to recognize abnormal potentials due to technical errors. An abnormal potential due to disease is defined, from QUALICON’s point of view, as an abnormal potential which is technically correction-resistant. No attempt is made for QUALICON to differentiate among disease states. With the success in applying Bayesian network technique for recognition of technical errors the next step of my thesis research is to extend the technique to disease diagnosis. Currently, QUALICON supports quality control for nerve conduction studies derived from median and ulnar, motor and sensory nerve, posterior tibial motor nerve and sural Chapter 3. QUALICON: A COUPLED EXPERT SYSTEM 68 sensory nerve conductions. The CMAPs and SNAPs used by QUALICON can be ac quired from any electromyograph capable of producing digitized potentials. With efficient signal processing and inference algorithms implemented for QUALICON, it takes 6 sec onds to evaluate a potential in an IBM AT compatible. The short processing time with such a small computer means that QUALICON can be easily included in a computerized electromyograph. In an assessment using 84 potentials, QUALICON’superformance is compared with human professionals. QUALICON’s 95% confidence interval of success probability is assessed as (0.62, 0.82), with corresponding interval for average human professional as (0.46, 0.68). The assessment suggests QUALICON performs as well as human profes sional with the given task. It is currently used as a teaching instrument at NDU of VGH. Future extension to QUALICON can be expected in supporting other nerves routinely used in nerve conduction studies; and strengthening the knowledge base to include such factors as the effect of age. Although QUALICON detects only 6 types of technical errors, its success suggests that a more sophisticated system can be built using the same approach to deal with broader technical errors. Chapter 4 MULTIPLY SECTIONED BAYESIAN NETWORKS AND JUNCTION FORESTS FOR LARGE EXPERT SYSTEMS With QUALICON’s performance being satisfactory, the neuromuscular diagnosis involv ing a painful or impaired upper limb was tackled using Bayesian networks. This is the PAINULIM project. Two major problems arose. First, the PAINULIM project has a large domain. The Bayesian network represen tation contains 83 variables including 14 diseases and 69 features each of which has up to three possible outcomes. The network is multiply connected and has 271 arcs and 6795 probability values. When transformed into a junction tree representation (to be detailed below), the system contains 10608 belief values. To process the problem of this complexity with a reasonable system response time demands powerful computing equipment. On the other hand, the tight schedule of the medical staff demands that the knowl edge acquisition (including initial network construction, subsequent system testing and refinement) should be conducted within the hospital environment where most computing equipment consists of small personal computers. These two conflicting demands motivates the development of a technique to reduce the computational complexity. The result is the general technique of Multiply Sectioned Bayesian Networks (MSBNs) and junction forests presented in this chapter. The MSBN technique is an extension to the d-separation concept [Pearl 88] and the junction tree tech nique [Andersen et al. 89, Jensen, Lauritzen and Olesen 90]. The application of MSBN 69 Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 70 to PAINULIM is presented in chapter 5. Section 4.1 introduces an important observation called localization which is to be associated with large application domains. MSBN exploits localization. Section 4.2 re views background research, particularly, the d-separation concept [Pearl 881 and the junc tion tree technique [Jensen, Olesen and Andersen 90, Jensen, Lauritzen and Olesen 90] [Andersen et al. 89] on which the MSBN technique is based. Section 4.3 explains why ‘obvious’ solutions to exploit localization do not work, and Section 4.4 gives an overview of the MSBN technique. These two sections serve to motivate and guide readers into the subsequent sections which present the mathematical theory necessary for the technique. 4.1 Localization As Cooper (1990) has shown, probabilistic inference in a general Bayesian net is NP- hard. Several approaches have been pursued to reduce the computational complexity of inference in Bayesian nets; these are reviewed in section 1.3.4. Here the exact methods in section 1.3.4 are restated briefly and the problems that remain are discussed. Efficient algorithms have been developed for inference in Bayesian nets with special topologies [Pearl 86, Heckerman 90aJ. Unfortunately many domain models can not be represented by these special types of Bayesian nets. For sparse nets, Lauritzen and Spiegelhalter [1988] have created a secondary directed tree structure to achieve efficient computation. Jensen, Lauritzen and Olesen [1990] and Shafer and Shenoy [1988] have created an undirected tree structure. Creating secondary structures offers also the ad vantage of trading compile time (the computation time spent in transforming a Bayesian net into a secondary structure) with run time (the computation time spent in infer ence). This advantage is particularly relevant to reusable systems (e.g., expert systems). Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 71 However, for large applications, the run time overhead (both space and time) is still for bidding. Pruning Bayesian nets with respect to each query instance is yet another exact method with savings in computational complexity [Baker and Boult 90]. A portion S of a Bayesian net may not be relevant given a set of evidence and a set of queries. It can be pruned away before computation. In light of a piece of new evidence, S may become relevant but can not be restored within the pruning algorithm. If networks have to be pruned for each set of queries, t1i advantage of trading compile time with run time will also be lost. The MSBN technique can be viewed as a way to retain the advantage of the pruning algorithm, but to overcome its limitations. This chapter addresses domains representable by general but sparse networks and characterized by incremental evidence. It addresses reusable systems where the proba bilistic knowledge can be captured once and be used for multiple cases. Current Bayesian net representations do not consider structure in the domain and lump all variables into a homogeneous network. For small applications, this is appropriate. But for a large application domain where evidence arrives incrementally, in practice one often directs attention to only part of the network within a period of time, i.e., there is ‘localization’ of queries and evidence. More precisely, ‘localization’ means 2 things. For an average query session, first, only certain parts of a large network are interesting’; second, new evidence and queries are directed to a small part of a large network repeatedly within a period of time. When this is the case, the homogeneous network is inefficient since newly arrived evidence has to be propagated to the overall network before queries can be answered. “Interesting’ is more restrictive than ‘relevant’. One may not be interested in something even thoughit is relevant. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 72 The following observation of the PAINULIM domain based on the practice in Neuro muscular Disease Unit (ND U), Vancouver General Hospital (VGH) illustrates localiza tion. A neurologist, examining a patient with a painful impaired upper limb, may tem porarily consider only his findings’ implication on a set of diseases candidates. He may not consider the diagnostic significance of each available laboratory test until he has fin ished the clinical examination. After each clinical finding (about 5 findings on an average patient), he dynamically changes his choice of the most likely disease candidates. Based on this he chooses the best question to ask the patient next or the best examination to perform on the patient next. After the clinical examination of the patient, findings highlight certain disease candidates and make others less likely, which may suggest that further nerve conduction studies are of no help at all, while (needle) EMG tests are di agnostically beneficial (about 60% of patients have only EMG tests, and about 27% of patients have only nerve conduction studies). Since EMG tests are usually not comfort able for the patients, the neurologist would not perform a test unless it is diagnostically necessary. Thus he would perform each test depending on results in previous ones (about 6 EMG tests performed on an average patient, and 4 for nerve conduction studies). The above scenario and above statistics illustrate both aspects of localization. During the clinical examination, only clinical findings and disease candidates are of current interests to the neurologist; and during EMG tests, only EMO test results and their implications on a subset of the diseases are under the neurologist’s attention; furthermore, for a large percentage (87%) of patients, only one of EMG or nerve conduction studies is required. If the neurologist is assisted by a Bayesian network based system, the evidence and queries would repeatedly (about 5 times) involve a ‘small’ part of the network during each diagnostic period (the clinical or the EMG). For 87% of patients certain parts of the network (either the EMG or the nerve conduction) may not be of interest at all. If Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 73 the Bayesian net representation is homogeneous, each batch of clinical findings has to be propagated to all the EMG and nerve conduction variables which are not relevant at the moment. Likewise, after each batch of EMO tests, the overall net has to be updated even though the neurologist is only interested in planning the next EMO test. A large application domain can often be partitioned naturally in terms of localiza tion. In the PAINULIM domain, a natural subdomain can be formed from the knowledge about the clinical symptOms and a set of diseases. Another natural subdomain can be formed from the knowledge about EMO results and a subset of diseases. The third nat ural subdomain can be formed from the knowledge about nerve conduction study results and a different subset of diseases. One problem of current Bayesian net representations is that they do not provide means to distinguish variables according to natural subdomains. Heckerman [1990b] partitions Bayesian nets into small groups of naturally related vari ables to ease the construction of large networks. But once the construction is finished, the run time representation is still homogeneous. - If groups of naturally related variables in a domain can be identified and represented, the run time computation can be restricted to one group at any given stage of a query session due to localization. In particular, one may not need to propagate new evidence beyond the current group. Along with the arrival of new evidence, attention can be shifted from one group to another. Chunks of knowledge not required with respect to current attention remain inactive (not being thrown away) until they are activated. This way, the run time overhead is governed by the size of the group of naturally related variables, not the size of application domain. Computational savings can be achieved. As demonstrated by Heckerman [1990b], grouping of variables can also help in ease and accuracy in construction of Bayesian networks. Partitioning a large domain into separate knowledge bases and coordinating them in Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 74 problem solving have a history in rule-based expert systems termed blackboard arch itec tures [Nii 86a, Nii 86b]. However a proper parallel for Bayesian network technology has not yet appeared. This chapter derives constraints which can often be satisfied such that a natural (lo calization preserving) partition of a domain and its representation by separate Bayesian subnets are possible. Such a representation is termed a multiply sectioned Bayesian net work (MSBN). In order to perform efficient evidential reaoning in a sparse network, the set of subnets are transformed into a set of junction trees as a secondary representation which is termed a junction forest. The junction forest becomes the permanent repre sentation for the reusable system in which incremental evidential reasoning takes place. Since the junction trees preserve localization, each of them stands as a computational object which can be used alone during reasoning. Multiple linkages between the junction trees are introduced to allow evidence acquired from previously active junction trees to be absorbed into a newly active junction tree of current interest. In this way, localiza tion naturally existing in the domain can be exploited and the above illustrated idea is realized. 4.2 Background This section reviews the background research particularly related to the MSBN technique which is not included in section 1.3. 4.2.1 Operations on Belief Tables The following introduces the belief table representation of probability distribution on a set of variables, and the mathematical operations on belief tables. A belief table [Andersen et al. 89, Jensen, Olesen and Andersen 90], or potential Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 75 [Lauritzen and Spiegeihalter 88] denoted as B() is a non-normalized probability distri bution. It can be viewed as a function from the space of a set of variables to the reals. For example, the belief table B(X) of a set X of variables maps ‘1(X) (the space of X defined in section 1.3.3) to the reals. If x E ‘(X), the belief value of x is denoted by B(x). Denote a set X of variables and corresponding belief table B(X) with an ordered pair (X, B(X)) and call the pair a world. For Y X, the projection y E 11(Y) of x E 11(X) to the space 11(Y) is denoted as Frj,(y)(x). Denote the marginalization of B(X) to Y X by ZX\Y B(X) which specifies a belief table on Y. The operation is defined as the following. If B(Y) Zx\Y B(X), then for all y € B(y)= B(x). Prj( Y) (x)=y Similarly denote the multiplication of B(X) and B(Y) by B(X) * B(Y) which specifies a belief table on X U Y. If B(X U Y) B(X) * B(Y), then for all z E ‘P(X U Y), B(z) B(x)*B(y) where x = Prjs(x)(z) and y = Prj,(y)(z). Denote the division of B(X) over B(Y) by B(X)/B(Y) which specifies a belief table on XUY. If B(XUY) = B(X)/B(Y), then for all z e I’(X U Y), B(z) = B(x)/B(y) if B(y) 0 where x = Prjp(X)(z) and y = PrjJ,(y)(z). 4.2.2 Transform a Bayesian Net into a Junction Tree The MSBN technique is an extension to the junction tree technique [Andersen et al. 89, Jensen, Lauritzen and Olesen 90] which transforms a Bayesian net into an equivalent secondary structure where inference is conducted (Figure 4.19). Because of this restruc turing, belief propagation in multiply connected Bayesian nets can be performed in a similar manner as can in singly connected nets. The following briefly summarizes the junction tree technique. Readers not familiar with the terminologies should refer to Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 76 Appendix A. Moralization Transform the DAG into its moral graph. Triangulation Triangulate the moral graph. Call the resultant graph a morali-triangu lated graph. Clique hypergraph formation Identify cliques of the morali-triangulated graph and obtain a clique hypergraph. Junction tree construction Organize the clique hypergraph into a junction tree of cliques. Node assignment Assign each node in the DAG to a clique in the junction tree of cliques. Belief universes formation For each clique C: in the junction tree of cliques, obtain its belief table B(C) by multiplication of the conditional probability tables of its assigned nodes. Call the (Ci, B(C1)) a belief universe. When it is clear from the context no distinction is made between a junction tree of cliques and a junction tree of belief universes. Inference is conducted through the junction tree representation. An absorption op eration is defined for local belief propagation. Global belief propagation is achieved by a forward propagation procedure DistributeEvidence and a backward propagation proce dure CollectEvidence. Belief initialization Before any evidence can be entered to the junction tree, the belief tables are made consistent by CollectEvidence and DistributeEvidence such that the prior marginal probability for a variable (of the original Bayesian net) can be obtained by marginalization of the belief table in any universe which contains it. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 77 Evidential reasoning When evidence about a set of variables (of the original Bayesian net) is available, the evidence is entered to universes which contain the variables. Then the belief tables of the junction tree are made consistent again by CollectEv idence and DistributeEvidence such that the posterior marginal probability for a variable can be obtained from any universe containing the variable. The computational complexi;y of evidential reasoning in junction trees is about the same as the reasoning method by Lauritzen and Spiegeihalter [1988] which can be viewed as performed on a (secondary) directed tree structure [Shachter 88, Neapolitan 90]. But junction trees are undirected and allow more flexible computation. The junction tree representation is explored in this chapter since its flexibility is of crucial importance to the MSBN extension. 4.2.3 d-separation The concept of d-separation introduced by Pearl (1988, page 116-118) is fundamental in probabilistic reasoning in Bayesian networks. It permits easy determination, by inspec tion, of which sets of variables are considered independent of each other given a third set, thus making any DAG an unambiguous representation of dependence and independence. It plays an important role in our partitioning of Bayesian networks. Definition 5 (d-separate [Pearl 88]) If X, Y, and Z are 3 disjoint subsets of nodes in a DAG, then Z is said to d-separate X from Y, if there is no path between a node in X and a node in Y along which 2 conditions hold: (1) every node with converging arcs (head-to-head node) is in Z or has a descendent in Z and (2) every other node (non-head-to-head node) is outside Z. A path satisfying the conditions above is said to be active; otherwise it is said to be blocked by Z. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 78 For example, in the DAG 0 of Figure 4.17, {F1} d-separates {F2} from {H1,2}. {H2,34}d-separates{E1,234}from the rest. The path between A3 and E4 is blocked by H4. The importance of d-separation lies in that in a Bayesian network X and Y are condi tionally independent given Z if Z d-separates X from Y [Geiger, Verma and Pearl 90J. 4.3 ‘Obvious’ Ways to Explore Localization Recall that, the ‘localization’ means (1) for an average query session, only certain parts of a large network are interesting; and (2) new evidence and queries are directed to small part of a large network repeatedly within a period of time. An obvious way to explore localization in multiply connected networks is to preserve localization within subtrees of a junction tree by clever choice in triangulation and junction tree construction. If this can be done, the junction tree can be split and each subtree can be used as a separate computational object. The following example shows that this is not always a workable solution. Consider the DAG 0 in Figure 4.17. Suppose variables in the DAG form three groups naturally related which satisfy localization: C1 = {A,23H} G2 = {F1,H G3 = {E1,E2,E3,E4,H2,H3,H4} One would like to construct a junction tree of it which preserves localization within three subtrees. The graph T in Figure 4.17 is the moral graph of 0. Only the cycle A3 — H3 — E3 — — H4 — A3 needs to be triangulated. There are six distinct ways of triangulation out of which only two do not mix nodes in different groups. The two triangulations have a link (113, 114) in common and they do not make significant difference Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 79 Figure 4.17: A DAG 0, its moral graph T, one of Vs triangulated graph A, and the corresponding junction tree F. Each clique in F is numbered (the number is separated from clique members by a P Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 80 in the following analysis. The A in Figure 4.17 shows one of the two triangulations. The nodes of graph 1’ are all the cliques in A. The junction tree F does not preserve localization since cliques 7, 8, 9, 12 and 1 correspond to group G1 but are connected via cliques 10 and 11 which contains E3 from group G3. Examine the morali-triangulated graph A to see why this is unavoidable. When there is evidence towards A1 or A2 in A, updating belief in group G3 requires passing the joint distribution of H2 and H3. But updting the belief in A3 and A4 requires passing only the marginal distribution of H3. That is to say, updating the belief in A3 and A4 needs less information than group G3. In the junction tree representation which insists on a single information channel between any two cliques, this becomes a path from cliques 7, 8, and 9 to clique 12 or 1 via cliques 10 and 11. In general, let X and Y be two sets of variables in a same natural group, and let Z be a set of variables in a distinct neighbor group. Suppose the information exchange between pairs of them requires the exchange of distribution on sets Ixy, 1xz and Iyz of variables respectively. Sometime Ixy is a subset of both Ixz and Iyz. When this is the case, a junction tree representation will always indirectly connect cliques corresponding to X and Y through cliques corresponding to Z if the method by Andersen et al. [1989], and Jensen, Lauritzen and Olesen [1990] is followed. However, there is a way around the problem with a brute force method. In the above example, when there is evidence towards A1 or A2, the brute force method pretends that updating the belief in A3 and A4 needs as much information as G3. What one does is to add a dummy link (H2,A3) to the moral graph T in Figure 4.17. Then triangulating the augmented graph gives the graph A’ in Figure 4.18. The resultant junction tree 1” in Figure 4.18 does have 3 subtrees which correspond to the 3 groups desired. However, the largest cliques now have size 4 instead of 3 as before. In binary case, the size of total state space is 92 instead of 84 as before. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 81 In general, the brute force method preserves natural localization by congregation of the set of interfacing nodes (nodes H2,H3,H4 above) between natural groups. In this way, the joint distribution on interfacing nodes2 can be passed between groups through a single channel, and preservation of localization and preservation of tree structure can be compatible. However, in a large application domain with the original network sparse, this will greatly increase the amount of computation in each group due to the exponential enlargement of the clique state space. The computation amount increased could outweigh the savings gained by exploring localization in general. The trouble illustrated in the above 2 situations can be traced to the tree structure of junction tree representation which insists on single path between any 2 cliques in the tree. In the normal triangulation case, one has small cliques but loses localization. In the brute 2] will be shown later that when the set of interfacing nodes possesses certain property, the joint distribution on the set is the minimum information to be exchanged. Figure 4.18: A’ is a triangulated graph. F’ is a junction tree of A’. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 82 force case, one preserves localization but does not have small cliques. To summarize, the preservation of natural localization and small cliques can not coexist by the method of Andersen et al. [1989], and Jensen, Lauritzen and Olesen [1990]. It is claimed here that this is due to a single information channel between local groups of variables. In the following, it is shown that by introducing multiple information channels between groups and by exploring conditional independence, one can pass the joint distribution on a set of interfacing variables between groups by passing only marginal distributions on subsets of the set. 4.4 Overview of MSBN and Junction Forest Technique As demonstrated in the previous section, in order to explore localization, tree struc ture and single channel requirement have to be relaxed. Since the computational advan tage offered by tree structure has also been demonstrated repeatedly, it is not desirable totally to abandon tree structure. Rather, it is desirable to keep tree structure within each natural group, but allow multiple channels between groups. The MSBNs and junction forests representations extend the d-separation concept and the junction tree technique to implement this idea. This section outlines the development of these representations. Each major step involved is described in terms of its functionality. The problems possibly encountered and the hints for solutions are discussed. The details are presented in the subsequent sections. Since the technique extends the junction tree technique reviewed in section 4.2.2, the parallels and the differences are indicated. Figure 4.19 illustrates the major steps in transformation of the original representation into the secondary represen tation for both techniques. The d-sepset The purpose is to partition a large domain according to natural localiza tion into subdomains such that each can be represented separately by a Bayesian subnet; Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 83 A USEN MSBN MoralI—Tr langu lateMoralize , b9 local computation A moral gzuh Triangulate A set of erali— A morali—triangulated triangulated graphs graph I IdentifyIdentify I cliques 4, cliques A clique A set of clique hypergraph hypergraphs Organize into Organize each a tree 4 into a tree A Junction tree ] A Junction forest of cliques of cliques Create linkages A linked Junction forest of cliques Assign belief Assign belief tables tables A Junction tree of A Junction forest of belief universes belief universes Belief Belief initialization initialization A consistent Junction A consistent Junction tree of belief forest of belief universes universes Figure 4.19: Left: major steps in transformation of a USBN (UnSectioned Bayesian Network) into a junction tree. Right: major steps in transformation of a MSBN into a junction forest. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 84 and that these subnets can cooperate with each other during inference by exchanging minimum amount of information between them. In order to do that, one needs to find out the technical constraints which have to be followed during the partition. This can be formulated conceptually in the opposite direction. Suppose the domain has been repre sented with a homogeneous network. The task is to find necessary technical constraints to be followed when the net is partitioned into subnets according to natural localization. Section 4.5 defines d-sepsets whose joint distribution is the minimum information to be exchanged to keep neighbor subnets informed. It is shown that in the junction tree rep resentation of the homogeneous net, the nodes in d-sepsets actually serve as information passageways between nodes in different subnets. Thus the d-sepset is a constraint on the interface between each pair of subnets. Sectioning Continuing in the conceptual direction, section 4.6 describes how to section a homogeneous Bayesian net into subnets called sects, based on d-sepsets. The collection of these sects forms a MSBN. It is described how probability distribution should be assigned to sects relative to the distribution in homogeneous network. Particularly, it is necessary to assign the original probability table of a d-sepnode to a unique sect which contains the d-sepnode and all its parent nodes, and assign to the same d-sepnode in other sects a uniform table. This is necessary, for one thing, because if a sect does not contain a d-sepnode’s all parents as in the homogeneous net, the size of probability table of the d sepnode must be decreased; for another, because this assignment will guarantee the joint system belief constructed later to be proportional to the joint probability distribution of the homogeneous net. In order to perform efficient inference in a general but sparse network, one wants to transform each sect into a separate junction tree which will stand as an inference entity. When doing so, it is necessary to preserve the intactness of the clique hypergraph resulted Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 85 from corresponding homogeneous net. That is, one has to ensure each clique in the orig inal hypergraph will find at least one host sect. This imposes another constraint, termed soundness of sectioning, on the overall organization of sects. Actually the soundness of sectioning plus d-sepset interface imposes conditional independence constraints at a macro level, i.e., at the level of sects (as opposed to conditional independence at the level of nodes). In addition to a necessary and sufficient condition for soundness, a guiding rule called covering subDAG is provided to ensure soundness. Its repeated application forms a second rule called hypertree which can be used for creating sophisticated MSBNs whose sectioning is sound. Although there exists MSBNs of sound sectioning which do not follow the 2 rules, it is shown that computational advantages are obtained in MSBNs sectioned according to the rules. Further discussion will therefore only be directed to MSBNs satisfying the 2 rules. Moralization and triangulation To transform a MSBN into a set of junction trees requires moralization and triangulation as reviewed in section 4.2.2. However in the MSBN context, an operational option is available, i.e., transformation can be performed globally or by local computation at the level of sects. The global computation performs moralization and triangulation in the same way as in the junction tree technique with care to be taken not to mix nodes in distinct sects into one clique. An additional mapping of the resultant morali-triangulated graph into subgraphs corresponding to sects is needed. But when space saving is concerned, local computation is desired. The pitfalls and procedures in moralization and triangulation by local computation is discussed. Since the number of parents for a d-sepnode may be different in different sects, the moralization in MSBN can not be achieved by ‘pure’ local computation in each sect. Communication between sects is required to ensure parent d-sepnodes are moralized identically in different sects. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 86 The criterion of triangulation in the MSBN is to ensure the ‘intactness’ of resulting hypergraph from the corresponding homogeneous net. Problems arise if one insists in triangulation by local computation at the level of sects. One problem is that an inter sect cycle will be triangulated in the homogeneous net, but the cycle can not be realized by viewing locally in each sect involved. Another problem is that cycles involving d sepnodes may be triangulated differently in different sects. The solution is to let sects communicate during triangulation. Since moralization ‘and triangulation both involve adding links and both require communication between sects, the corresponding local operations in each sect can be performed together and messages to other sects can be sent together. Therefore, operationally, moralization and triangulation in MSBN are not separate steps as in the junction tree technique. The corresponding integrated operation is termed morali-triangulation to conceptually reflect this reality. In section 4.7.1, the above concept of ‘intactness’ of hypergraph is formalized in terms of invertibility of morali-triangulation. it is shown that if the sectioning of a MSBN is sound, then there exists an invertible morali-triangulation such that the ‘intactness’ of hypergraph is preserved. Section 4.7.1 provides an algorithm for an invertible morali triangulation assuming a covering subDAG. Next steps in the junction tree technique In the junction tree technique, after triangulation, further steps of transformation are identification of cliques in the morali triangulated graph (clique hypergraph formation) and junction tree construction. In MSBNs, these steps are performed in the similar way for each sect as the junction tree technique. A MSBN is thus transformed into a set of junction trees called a junction forest of cliques. See Andersen et al. [1989], and Jensen, Lauritzen and Olesen [1990] for technique details involving these steps. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 87 Linkage formation An important extension of MSBNs and junction forests to the junction tree technique is the formation of multiple information channels between junc tion trees (in a junction forest) such that a joint distribution on a d-sepset can be passed between a pair of junction trees by passing through marginal distributions on subsets of the d-sepset. In this way, the exponential enlargement of the clique state space caused by brute force method (section 4.3) can be avoided. These channels are termed linkages (section 4.7.2). Each linkage is a set of d-sepnodes which links 2 cliques ach in one of the 2 junction trees involved. During inference, if evidence is obtained from previously active junction tree, it can be propagated to newly active neighbor junction tree through linkages between them. As can be imagined, multiple linkages can cause redundant information passage or confuse the information receiver. The problem can be avoided by coordination among linkages during information passing. Since the problem manifests differently during belief initialization and evidential reasoning, it has to be treated differently. In both cases, information passing is performed one linkage at a time. During initialization, (redundant) information already passed through other linkages is removed from the linkage belief table before the latter is passed over. Operationally, one orders linkages. The intersection of a linkage with linkages ordered before it is defined as the redundancy set of the linkage. The redundancy set tells a linkage what portion of information has to be removed during information passing. During evidential reasoning, a DistributeEvidence is performed after each information passing to avoid confusion in the receiving junction tree. With linkages and redundancy sets created, one has a linked junction forest of cliques. Formation of joint system belief of junction forest The joint system belief of the junction forest is defined (section 4.7.3) in terms of the belief on each junction trees, the belief on linkages and redundancy sets such that it is proportional to the joint probability Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 88 distribution of the homogeneous net. With the joint system belief defined, one has a junction forest of belief universes. When it is clear from the context, only ‘junction forest’ is used without differentiating between its different stages. Consistency and separability of junction forest As in the case of the junction tree technique, one would like to obtain marginal probability of a variable by marginaliza ‘tion of the belief in any belief universe of any junction tree which contains the variable, i.e., by local computation. In. the case of the junction tree technique, this requires the consistency property which can be satisfied by performing the DistributeEvidence and the CollectEvidence as reviewed in section 4.2.2. In the context of a junction forest, an additional property called separability is required (section 4.8) due to multiple linkages between junction trees. It imposes a host composition constraint on the composition of linkage host cliques. The function of linkages is to pass the joint belief of the correspond ing d-sepset. When linkage hosts are ill-composed, what is passed over is not a correct version of joint belief on the d-sepset. The marginal probabilities thus obtained by local computation will also be incorrect. It is shown if all the junction trees in a junction for est satisfies the host composition condition, then separability is guaranteed. Why these conditions usually hold naturally is explained. The remedy when the condition does not hold is also discussed. With a junction forest structure satisfying the separability, and with a set of operations performed to bring the forest into consistency, one is guaranteed to obtain marginal probabilities by local computation. Belief initialization Belief initialization (section 4.9.3) in a junction forest is achieved by first bringing the belief universes in each junction tree into consistency, and then exchanging prior belief between junction trees to bring the junction forest into global consistency. When exchanging beliefs, care is to be taken on (1) non-trivial information Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 89 (recall that d-.sepnodes in some sects are assigned uniform tables during sectioning) could be contained in either side of the 2 junction trees involved; and (2) not to pass redundancy information through multiple linkages. Section 4.9 defines several levels of operations to initialize belief of a junction forest by local computation. Evidential reasoning Only 1 junction tree in a junction forest needs to be active due to localization and great computational savings are possible when repeated computa tion of the junction tree is required. Whenever new evidence becomes available to the currently active junction tree, it is entered and the tree is made consistent such that queries can be answered. Thus the computation complexity of a MSBN/junction forest is 1/,8 where /3 is the number of sects in the MSBN. When the user shifts attention, a new junction tree replaces the currently active tree and all previously acquired evidence is absorbed through the operation ShiftAttention. The operation requires only a chain of neighbor junction trees to be updated. During the inter-junction tree updating, one needs to ensure no confusion is resulted from multi-linkage information passing. 4.5 The d-sepset and the Junction Tree 4.5.1 The d-sepset As discussed in section 4.4, the problem of partitioning a Bayesian net by natural local ization can be conceptually formulated as though the domain has been represented with a homogeneous network. The task is to find out the technical constraint to partition the net into subnets such that the subnets can be used separately and cooperatively during inference with minimum amount of information exchange. This section defines the most important concept for partitioning, namely, d-sepset. Then some insights are provided into its implication in the secondary structure of DAGs. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 90 Definition 6 (d-sepset) Let D D’ U D2 be a DAG. The set of nodes I = N’ fl N2 is a d-sepset between subDAG D’ and D2 if the following condition holds. For every A1 E I with its parents 7r in D, either ir ç N’, or 7r C N2. Elements of a d-sepset are called d-sepnodes. When the above condition holds, D is said to be sectioned into {D’,D2}. Note in general a DAG D = D’UD2does not imply that D is sectioned into {D1,D2} since the intersection of the corresponding 2 sets of nodes may not be a d-sepset. Lemma 4 Let a DAG D be sectioned into {D’,D2} and I = N’ fl N2 be a d-sepset. I d-separates N’ \ I from N2 \ I. Proof: It suffices to prove every path between N’ \ I and N2 \ I is blocked by I. Every path between N’ \ I and N2 \ I has at least one d-sepnode. From definition of d-separate, if one of the d-sepnode in a path is a non-head-to-head node, the path is blocked. In the case of a single d-sepnode path, by definition, the d-sepnode must be a non- head-to-head node. In case of multiple d-sepnode path, for any 2 adjacent d-sepnodes on the path, one of them must be a non-head-to-head node. D The lemma can be generalized into the following theorem which states that d-sepsets d-separate a subDAG from the rest of the DAG. Note when a d-sepset is indexed with 2 superscripts, their order is immaterial. Theorem 2 (completeness of d-sepset) Let a DAGD be sectioned into {D’, .. . , D”} and I = N1 fl N’ be the d-sepset between D and D’. UI11 d-separates N1 \ UI1 from N \ N. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 91 The theorem implies that the joint distribution on d-sepsets is the minimum informa tion to be exchanged between a Bayesian subnet and the rest of the network. Consider a DAG (N, E) sectioned into 3 > 1 subDAGs. Let A denote the set of all the non-d sepnodes in one of the subDAGs, C denote the set union of all the d-sepsets between the chosen subDAG and its neighbor subDAGs, and B denote the set N \ A \ C. From the above theorem, C d-separates A from B. Thus p(ABC) = p(AC)p(BC)/p(C). When evidence is available such that some of the variables in A are instantiated, one can update p(AC) into p’(AC) = p’(A)p(CA). One can update p(C) into p’(C) by marginalization on p’(AC). Then it follows that p’(BC) = p(BIC)p’(C), and that the updated joint distribution p’(ABC) p’(AC)p’(BC)/p’(C). Note that updating p’(BC) is through replacing p(C) by p’(C) - the posterior joint distribution on the d-sepset union. Example 5 The DAG 0 in Figure 4.17 is sectioned into {0, 02, 0} in Figure 4.20. j12 = {H,, H2} is the d-sepset between 0’ and 02; j13 = {H2,H3,H4} is the d-sepset between 0’ and 0; and 123 = {H2} is the d-sepset between 02 and 0. 112 U J13 = {H,,H2,3j d-separates the rest of 0, from the rest of 02 and 03. There is a close relation between d-sepset and usual graph separator given in propo sition 7. If Z is the graph separator of X and Y, then the removal of the set Z of nodes from the graph (together with their associated links) would render the nodes in X disconnected from those in Y. Proposition 7 Let a DA G D be sectioned into {D’, D2}. The set of nodes I = N’ fl N2 is a d-sepset between D’ and D2 if I is a graph separator in the moral graph of D. Proof: Suppose I = N’flN2 is a d-sepset. For any A, e N’\I and A2 E N2\I, moralization will not create links between A, and A2. Hence I is a graph separator in the moral graph of D. 92Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS Figure 4.20: The set {0’, 02, 0} of 3 subDAGs (top) forms a sectioning of 0 in Figure 4.17. {A’, A2,A3} (middle) is the set of moralitriangUlat graphs of {0’, 92, 93}, and = {r’, r2,F3} (bottom) is the corresponding junction forest obtained by Algorithm 2. The ribbed bands indicate linkages. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 93 On the other hand, suppose I separate N’ \ I from N2 \ I in moral graph M of D, and I is not a d-sepset. Then there exists A E I such that A has a parent A, e N’ \ I and a parent A2 E N2 \ I. But then there would have been a link between A, and A2 by moralization. Thus I is not a separator in M. Contradiction. D The properties of d-separation in the DAG representation of Bayesian networks have been studied extensively [Pearl 88, Geiger, Verma and Pearl 90]. It can be used to derive Pearl’s propagation algorithm in singly-connected Bayesian nets [Neapolitan 901. But to my knowledge, its implication in secondary structure has not been examined. The definition of the d-sepset now allows to do so. 4.5.2 Implication of d-sepset in Junction Trees By representing a multiply connected Bayesian network in its secondary structure - junc tion tree, flexible and efficient belief propagation can be achieved. With the d-sepset concept defined, one would like to know how information is passed in the junction tree between nodes separated by the d-sepset in the original Bayesian network. Lemma 5 Let a DAG D be sectioned into {D’, D2} and I = N’ fl N2 be the d-sepset. A junction tree T can be constructed from D, such that the following statement is true. For all pairs of nodes A, N’ \ I and A2 E N2 \ I, if A, is contained in clique C, and A2 in C2, then on the unique path between C, and C2 in T, there exists a clique sepset Q containing only d-sepnodes. Proof: [Step 1J First, prove T can be constructed such that C, C2. By definition of the d-sepset, A, and A2 are not adjacent in D. Cliques are formed through moralization Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 94 and triangulation. If A, and A2 both are adjacent to a d-sepnode H, by definition of d-sepset, A, and A2 can not both be H’s parents. Thus moralization does not create a .link between A, and A2. Enforce the following rule for triangulation: if a cycle contains 2 or more non-adjacent d-sepnodes, add a link between 2 such d-sepnodes first. After moralization, if there is only 1 path between A1 and A2, triangulation will not add a link between them. If there are more than 1 path, at least 1 cycle is formed. For any such cycle including A, and A2, 2 paths between A, and A2 can be identified. On each of the 2 paths, there must be at least 1 d-sepnode. With the above rule followed, a link between 2 such d-sepnodes is always added such that there is no need to link A1 and A2 to triangulate D. Since A1 and A2 are not linked initially; and not during moralization and triangula tion, they will not be contained in the same clique in T. [Step 2] Proceed from C, to C2 along the unique path in T connecting them to find Q. Examine C,’s neighbor clique C. Case 1: Cr C I. Then Q = Ci fl C. Case 2: Cr contains nodes in N2 \ I. By step 1, C, does not contain nodes in N2 \ I and C, does not contain nodes in N’ \ I. However, C, fl Cr . Therefore Q = Ci fl Cr. Otherwise C contains nodes in N’ \ I. In this case, proceed to Cr’s neighbor C,, and above examination is repeated. Since the other end of the path is C2 which contains no node in N’ \ I, some point along the path one will eventually hit either case 1 or case 2. D The lemma can be generalized to the case of any finite number of subDAGs. This is the following proposition. Its proof is similar to the lemma. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 95 Proposition 8 (belief relay) Let a DAG D be sectioned into {D’,.. . , D} and I = UI’ be the union of d-sepsets between D and its neighbor subDAGs. A junction tree T can be constructed from D, such that the following statement is true. For all pairs of nodes A1 e N \ I and A2 E N \ N, if A1 is contained in clique C1 and A2 in C2, then on the unique path between C1 and C2 in T, there exists a clique sepset Q containing only d-sepnodes in I. Example 6 Recall the DAG 9 in Figure 4.17 which is sectioned into {9’, 92,93) in Figure 4.20 with I = {H2,H3,H4} being the d-sepset between 91 and 93• Consider the node A3 in clique {H3,H4,A3) and the node E2 in clique {E4,E3,E2} in the junction tree F in Figure 4.17. In the path between the 2 cliques, the sepset {H3,H4} between cliques {H3, 114, A3} and {H3,H4,E3} contains only d-sepnodes. When new evidence is available, it can be propagated to the overall junction tree through sepsets between cliques [Jensen, Lauritzen and Olesen 90]. Therefore, the above proposition means that a junction tree can be constructed such that evidence in N \ I must pass through at least 1 sepset containing only nodes in I in order to be propagated to nodes in N \ N. Theorem 2 and Proposition 8 suggest that one can organize the clique hypergraph such that the cliques corresponding to different subDAGs separated by d-sepsets can be organized into different junction trees. Communication between them can be accom plished through d-sepsets. This idea is formalized below. 4.6 Multiply Sectioned Bayesian Nets 4.6.1 Definition of MSBN Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 96 Definition 7 (MSBN) Let S = (N, E, P) be a Bayesian network; D = (N, E) be sec tioned into {D’,. . .,D1} where Dt (N1,E); and P = N1flN be the d-sepset between D1 and D (1 i,j 3;i j). Assign d-sepnodes to subDAGs in the following way. For each d-sepnode A, if the in-degree i of A in subDAG D satisfies ii (j = 1,... ,6) then assign A to subDAG D and break ties arbitrarily. Assign probability distribution F for subDAG Dt (i = 1,. . . , /3) in the following way. For all node A E N1, if A is a d-sepnode and A is not assigned to D1, assign to A a uniform probability table.3 Otherwise assign to A an identical probability table as that in (N, E, F). Call S1 = (D1;F1) (N1,E1; pi) a sect and call the set of sects {S’,. . . , S’3} a Multiply Sectioned Bayesian Network (MSBN). The original Bayesian net S is called an ‘UnSectioned Bayesian Network (USBN)’. Note that the sectioning of a Bayesian network is essentially determined by the sectioning of the corresponding DAG D. It doesn’t matter which way to break ties. There will be no significant difference in further processing. Example 7 Suppose the variables in DAG 0 in Figure 4.17 are all binary. Associate the probability distribution P given in Table 4.3 with 0. (0, F) is an USBN. Given the USBN (0, F), and corresponding 3 subDAGs 01, 02 and 0, a 3-sect MSBN {(91, F’), (02, P2), (93,F3)) can be constructed. First assign d-sepnodes H1,. H4 to the subDAGs. H2 and H4 must be assigned to 01. H1 can be assigned to either 3This is necessary, for one thing, because if a sect does not contain a d-sepnode’s all parents as in the homogeneous net, the size of probability table of the d-sepnode must be decreased; for another, because this assignment will guarantee that the joint system belief constructed in section 4.7.3 is proportional to the joint probability distribution P. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS p(h11)= .15 p(h2iIaaii)= .8696 2iIa2iii2)= .7 P(h2122G11) .6 p(hao)= .08 p(h31)= .3 p(h411a3) .25 p(hiIa32)= .4 p(ai,Ihii) = .8 p(a11h2)= .1 P(G2lih3) .8 P(a21132)= .1 Table 4.3: Probability distribution associated with DAG 0 in Figure 4.17. 97 or 02, and H3 can be assigned to either 0’ or 0. Here it is chosen to assign all 4 d-sepnodes to 0’. Based on this assignment and P given, the probability distribution for each sect can be determined (Table 4.4). Note the uniform probability tables assigned to d-sepnodes in 02 and 03. Table 4.4: Probability distribution associated with subDAGs 0’, 02 and 0 in Figure 4.20. p(a31 h31) = .3 p(e21 eMe41) = .9789 p(a311h2)= .8 211e342)= .8 p(ei134 = .9 p(a411h)= .9 211e32e4) .05 p(a411h2 = .2 p(eajlhjhi)= .7702 p(f11Jhh2)= .7895 (e3122)= .35 p(fjjh2) .5 p(e1h) .65 P(11h2 .6 (e3122h)= .01 p(f2h)= .05 p(e411h)= .8 P(f2lIflI) .4 (411h2)= .15 P(f211f12) .75 (eiiIe3i)= .2 P(e11e32) = .7 1 p(h11)= .15 p(h11)= .5 p(hjj)= .5 p(h21jaa1)= .8696 p(h21) .5 p(h21) = .5 p(ha2= .7 p(h211 = .6 p(f1iIh,h21)= .7895 p(h31) = .5 p(ha2a)= .08 p(fiiIhiih2)= .5 p(fiiIhii)= .6 p(e11[e3)= .2 p(h31)= .3 (f112h2)= .05 P(eill32)= .7 p(h411a3)= .25 P(f2lIfll) = .4 p(e2113e4)= .9789 p(h41j32 = .4 P(f21 1112) = .75 211e342 = .8 2i1e32e4i) .9 p(aiiIhii) .8 p(ejI3242)= .05 p(a112)= .1 p(e311h2h= .7702 P(a2lIhal) = .8 p(e2)= .35 p(a211h32)= .1 p(e311h2z= .65 p(e2h)= .01 p(a311h)= .3 p(a311h2 = .8 p(e411h)= .8 P(el1h42 = .15 p(a411h)= .9 (iIh42 = .2 Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 98 4.6.2 Soundness of Sectioning In order to perform efficient inference computation in a multiply connected Bayesian net, the junction tree technique transforms the Bayesian net into a clique hypergraph through moralization and triangulation. Then the hypergraph is organized into a junction tree, and efficient inference can take place. Because of the computational advantage of junction trees, in the context of a MSBN, one would like to transform each sect into a junction tree representation. The immediate question is: for an arbitrary MSBN, what is the condition in the transformation such that correct inference can take place in the resultant set of junction trees. The following reviews the major theoretical results related to this question. Lauritzen et al. [1984] show that the clique hypergraph of a graph is decomposable if the graph is triangulated. Jensen [1988] proves that a hypergraph has a junction tree if it is decomposable. Maier [1983] proves the same in the context of relational database. Jensen, Lauritzen and Olesen [1990] and Pearl [1988] show that a junction tree representation of a Bayesian net is an equivalent representation in the sense that the information about joint probability distribution can be preserved. Finally, a more flexible algorithm (compared to [Lauritzen and Spiegelhalter 88]) is devised on the junction tree representation of multiply connected Bayesian nets [Jensen, Lauritzen and Olesen 90]. The above results highlight the importance of clique hypergraphs resulted from tri angulation of original graphs. Thus, when one tries to transform each sect in a MSBN into a junction tree, it is necessary to preserve the intactness of the clique hypergraph resulted from corresponding USBN. This is possible only if the sectioning of DAG D of the original USBN is sound defined formally below. Definition 8 (soundness of sectioning) Let a DAGD be sectioned into {D’,.. . , If there exists a clique hyperyraph from D such that for every clique Ck in the hype rgraph Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 99 there is at least 1 subDAG D satisfying Ck N, then the sectioning is sound. Call D the host subDAG of clique Ck. Although the soundness of sectioning is defined in DAGs, the concept is useful only in the context of MSBNs. Therefore when the sectioning of DAG is sound, it is said that the sectioning of the corresponding USBN into the MSBN is sound. If the sectioning of a DAG D is unsound, then one will find no host subDAG for at least 1 clique in all possible hypergraphs from D. If a MSBN is based on this sectioning of DAG, it is impossible to maintain the autonomous status of sects in the secondary representation. Example 8 In Figure 4.21, {D’, D2,D3} is an unsound sectioning of D. The clique hypergraph for D must have clique {A, B, C} which finds no host subDAG from D’, D2, and D3. Figure 4.21: Top left: A DAG D. Top right: The set of subDAGs from an unsound sectioning of D. Bottom left: The junction tree T from D. E ABD T The following develops necessary and sufficient condition for soundness of sectioning. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 100 Lemma 6 LetA1—.. .—A_1B. . —B3_l—Cl—...—Ck_I—. ..—A1 be a cycle con sisting of nodes from 3 or more sets: X = {A1, . . . , A1_B1}, Y = {B1,. . ., B_1,C}, and so on. The nodes from a same set are adjacent in the cycle, and 2 adjacent sets have a node in common. Then triangulation of this cycle must create a triangle with its 3 nodes not belonging to any single set. Proof: The proof is inductive. The lemma is trivially true if the cycle involves only 3 nodes. Assume the lemma is true when the cycle involves n = 3 sets and the cycle 0 is A1 — . . . — A_1 — B1—.. . — B3_ — C1 — ... — Ck_1 — A1, i.e., there are i nodes in set 1, j in set 2, and kin set 3. If 1 more node is added to set 3 (k’ = k+ 1) such that a node Ck is added between Ck_l and A1 of cycle 0, then one can first add a chord between Ck_l and A1, and triangulate the rest of the cycle. Since the remaining untriangulated cycle is exactly the cycle 0, the lemma is true by assumption. Since the 3 sets are symmetric and the nodes in any set above can be augmented by nodes from any sets other than the above 3, the proof is valid in general. D If a MSBN has only 2 sects, the sectioning is always sound. Unsoundness can arise only when there are 3 or more sects. The following shows exactly the case where a sectioning is unsound. Theorem 3 (inter-subDAG cycle) A sectioning of a DAG D to a set of 3 or more subDAGs is sound if there exists no (undirected) cycle in D which consists of nodes from 3 or more distinct subDAGs such that the nodes from each subDAG are adjacent on the cycle. Proof: Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 101 Let D be a DAG sectioned into {D’,.. . , D} (/3 3). Suppose, relative to this sectioning, D has no cycle which consists of nodes from 3 or more distinct subDAGs such that the nodes from each subDAG are adjacent on the cycle. By definition of d-sepset, moralization of D may triangulate existing cycles but will not create new cycle. Thus by assumption, in moral graph of D, all cycles consisting of nodes from more than 1 subDAG such that the nodes from each subDAG are adjacent on the cycle can involve nodes from at most 2 subDAGs. By the argument in Stp 1 of the proof for Lemma 5, all the 2-subDAG crossing cycles can be triangulated without creating cliques containing non-d-sepnodes in both subDAGs. Thus the sectioning is sound. On the other hand, suppose, relative to the sectioning, there is a cycle in D which consists of nodes from 3 or more distinct subDAGs such that the nodes from each subDAG are adjacent on the cycle. Then, by lemma 6, the triangulation of this cycle must create a triangle with its 3 nodes not belonging to any subDAG. Hence the sectioning is unsound. D Given a DAG and a sectioning, search for inter-subDAG cycles relative to the sec tioning is expensive, especially by local computation when space is concerned. Even if a sectioning is decided to be sound, it may not be computationally desirable at later reasoning stages as will be discussed at corresponding sections. Furthermore, each sect in a large system (MSBN) is constructed one at a time. If a sectioning is not sound and it can only be discovered after all sects have been constructed, the overall revision would be disastrous. Thus one would like to develop simple guidelines for sound sectioning which could be followed during incremental construction of MSBNs. The following covering subDAG rule is one of such guidelines. Theorem 4 (covering subDAG) Let a DAG D be sectioned into {D’,. . . , D’}. Let = N2 fl N’ be the d-sepset between D2 and Div. If there is a subDAG D such that Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 102 N1 UjkPlc then the sectioning is sound. The subDAG D is called the covering subDAG relative to the sectioning. In the context of a MSBN, call the sect corresponding to the covering subDAG as the covering sect. Note that the covering sect rule actually imposes a conditional indepen dence constraint at a macro level. Proposition 9 Let S1 and S be any 2 sects in a MSBN with a covering .ect S’ ( j k). The 2 sets of variables N2 and N are conditionally independent given N’. Example 9 Consider the 3-sect MSBN {(0’, F’), (92, F2), (93, F3)} constructed. (91, F’) is the covering sect. Note, in general, the covering sect of a MSBN may not be unique. As far as the soundness is concerned, one is as good as the others. Practically, the one to be consulted most often or the one with the least size is preferred for the sake of computational efficiency which will be clear later. The covering sect is usually formed naturally. For example (Chapter 5), in a neu romuscular diagnosis system, the sect containing knowledge about clinical examination contains all the disease hypotheses considered by the system. The EMG sect or nerve conduction sect contains only a subset of the disease hypotheses based on diagnostic importance of these tests to each disease. Thus the clinical sect is a natural covering sect with all the disease hypothesis as d-sepnodes interfacing the sect with other sects. The covering subDAG rule can be repeatedly used to create sophisticated MSBNs which are sound. When doing so, a global covering subDAG requirement is replaced by a local covering subDAG requirement. A local covering subDAG is a subDAG interfacing two or more subDAGs and including all d-sepsets among them in its domain. Theorem 5 (hypertree) Any MSBN created by the following procedure is sound. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 103 Start with any single subDAG. Recursively add a new subDAG D to pre vious subDAGs such that if D has nonempty d-sepsets with more than one previous subDAGs, then one of the previous interfacing subDAGs must be a local covering subDAG which covers all the d-sepsets resultant from the addi tion. The following example illustrate the hypertree rule. It also explains that the sectioning determined by the procedure is sound. Example 10 Figure 4.22 depicts part of a MSBN constructed by the hypertree rule. Each box represents a subDAG with boundaries between boxes representing d-sepsets. The superscripts of subDAGs represent the order of their creation. D’, D4,D5 are local covering subDAGs. Figure 4.22: A MSBN with a hypertree structure. It is easy to see that the inter-subDAG cycle as described in theorem 3 can not happen in this MSBN due to its hypertree structure, and hence the sectioning is sound. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 104 Note that the hypertree rule also imposes a conditional independence constraint at a macro level. Proposition 10 Let S and 5’ be any 2 sects with an empty d-sepset in a MSBN sec tioned by the hypertree rule. Let Sk be any sect on the unique route mediating Si and Si on the hypertree. The 2 sets of variables N and N’ are conditionally independent given Nk. It should be indicate that the covering subDAG rule and the hypertree rule do not cover every case where sectioning is sound. Example 11 The 3-sect MSBN {D’, D2,D3} in Figure 4.23 has no covering subDAG. But the sectioning is sound. Note that, although the sectioning of the MSBN in Figure 4.23 is sound, this kind of structure is restricted. For example, one can add arcs between A and B in D’, between A and C in D2, but as soon as one adds 1 more arc between B and C in D3, the theorem 3 is violated and the sectioning become unsound. That is, when n subDAGs (n 3) are interfaced in this style, there can be at most n — 1 of them being multiply connected. Further computational problems with such structure will be discussed in the appropriate latter sections. Since MSBNs constructed by the covering subDAG rule or the hypertree rule have sound sectioning, are less restricted, and have extra computational advantages (to be discussed in latter sections) over the MSBNs which do not follow these rules, the following study is directed to only the former MSBNs. Conceptually, all MSBNs constructed by the hypertree rule can be viewed as MSBNs with covering subDAGs when attention is directed to local structures. For example, if one pays attention to D’ in Figure 4.22 and its surrounding subDAGs, one can view Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 105 Figure 4.23: Top left: A DAG D. Top right: A junction tree T from D. Bottom left: { D’, D2,D3} forms a sound sectioning of D. Bottom right: The junction trees from the MSBN in Bottom left. D2,D4,D6,D7,D8 as 1 subDAG, D3,D5,D9,D10,D” as another, D’2,D’3 and D14,D’5 as 2 others. Thus the MSBN is viewed as one with a global covering subDAG D’. Likewise, when one is concerned with relation between D’4 and D15, the MSBN can be viewed as one satisfying the covering subDAG rule with /3 = 2. Therefore, the computation required for a MSBN of a hypertree structure is just the repetition of the computation required for a MSBN with a global covering subDAG. On the other hand, a MSBN with a global covering subDAG is a special case of the hypertree structure. Hence, the following study is often simplified by considering only one of the 2 cases. I D HI T1 Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 106 4.7 Transform MSBN into Junction Forest In order to perform efficient inference in a general but sparse network, it is desirable to transform each sect of a MSBN into a junction tree which will stand as an inference entity (Section 4.4). The transformation takes several steps to be discussed in this section. The set of subDAGs of the MSBN are morali-triangulated into a set of morali-triangulated graphs from which a set of clique hypergraphs are formed. Then the set of clique hy pergraphs are organized into a set of junction trees of cliques. Afterwards, the linkages between the junction trees are created. Finally, belief tables are assigned to cliques and linkages and a junction forest of belief universes is constructed. 4.7.1 Transform SubDAGs into Junction Trees by Local Computation The key issue is morali-triangulating subDAGs of a MSBN into a set of morali-triangulated graphs. Once this is done, the formation of clique hypergraph and the organization of junction tree for each subDAG are performed the same way as in the case of a USBN and a single junction tree [Andersen et al. 89, Jensen, Lauritzen and Olesen 90]. As men tioned before, the criterion in morali-triangulation of a set of subDAGs of a MSBN into a set of clique hypergraphs is to preserve the ‘intactness’ of the clique hypergraph resulted from the corresponding USBN. The concept of ‘intactness’ is formalized below. Definition 9 (invertible morali-triangulation) Let D be a DAG sectioned into {D’, . . . , D’} where D has domain N. If there exists a morali-triangulated graph G of D, with the clique hypergraph H, such that G = U1G where G2 is the subgraph of G induced by N, and H = U1H where H is the clique hypergraph of G, then the set of morali-triangulated graphs {G’,.. . , G} is invertible. Also the transformation of {D’,... , D’3} into {G’, .. . , G} is said to be an invertible morali-triangulation. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 107 A morali-triangulated graph G is equivalent to the corresponding clique hypergraph H in that given one of them the other is completely determined. By definition 8, if the sectioning of a MSBN is sound, then there exists a clique hypergraph H whose every clique is a subset of the domain of at least 1 subDAG of the MSBN. Thus one has the following theorem. Theorem 6 (existence of invertible morali-triangulation) There exists an invert ible morali-triangulation for {D’,.. . , D} sectioned from a DAG D, if the sectioning is sound. One could construct a set of invertible morali-triangulated graphs of a MSBN by first performing a global computation (moralization and triangulation) on D to find C, and then determining its subgraphs relative to the sectioning of the MSBN. The moralization and triangulation would be the same as in the junction tree technique with care to be taken not to mix nodes in different subDAGs into one clique. However, when space requirement is of concern, MSBNs offer the possibility of morali-triangulation by local computation at the level of subDAGs of sects. Each subDAG in a MSBN is morali triangulated separately (message passing may be involved) such that the collection of them is invertible. The following discusses how this can be achieved. Example 12 In the example depicted in Figure 4.17 and 4.20, 0 is sectioned into { 01, 02, O} by a sound sectioning and {A1,A2,A3} is a set of invertible morali-triangu lated graphs relative to the sectioning. The desire is to find A1 (i = 1,2,3) from (i = 1,2,3) by local computation. Since subDAGs of a MSBN are interfaced through d-sepsets, the focus of finding a set of invertible morali-triangulated graphs by local computation is to decide whether each Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 108 pair of d-sepnodes is to be linked. Coordination between neighbor subDAGs is necessary to ensure correct decisions. The following considers this systematically. Call a link between 2 d-sepnodes a d-link. Call a simple path (A,, A2, . . . , A,) a d-path if A1,..., A2 and As,... ,Ak (1 i,i + 1 j,j k) are all d-sepnodes, while all the other nodes on the path are non-d-sepnodes. A d-link is a trivial d-path. Each morali triangulated graph G from an invertible morali-triangulation may generally contain 6 types of d-links. Arc type inherited from the subDAG. That is, if 2 d-sepnodes are connected originally in the subDAG, there is a d-link between them in G. Decision on this type of d-links is trivial. ML type created by local moralization. For example, the d-links (H,, H2) in A2 and (H2,3)in A3. ME type created by moralization in neighbor subDAGs. For example, the d-links (H1,H2) and (H2,H3) in A’. Decision on this type of d-links requires commu nication between neighbor subDAGs. Cy type created to triangulate inter-subDAG cycles. For example, the d-link (H3,H4) in A’ and A3. Decision on this type of d-links requires communication between neighbor subDAGs. TL type created during local triangulation. After the above 4 types of d-links have been introduced to the moral graph of a subDAG, there may still be un-triangulated cycles within the moral graph involving 4 or more d-sepnodes. The example used above is too simple to illustrate this and the next type. TE type created by local triangulation in neighbor subDAGs. The triangulation of a cycle of length> 3 involving only d-sepnodes is not unique. If 2 neighbor subDAGs Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 109 triangulate such a cycle by local computation without coordination, they may tri angulate in different ways and result in different set of cliques for the nodes in the d-sepset. Therefore communication is required such that a subDAG may adopt the d-links introduced by triangulation in neighbor subDAGs. The argument also applies to the case of triangulating cycles consisting of general d-paths. An algorithm for morali-triangulation of subDAGs of a MSBN into a set of invertible triangulated graphs under the covering subDAG assumption is given below. Algorithm 2 (morali-triangulation with a covering subDAG) Let D’ be the cov ering subDAG in the MSBN. 1. For each subDAG Dt, do the following: (1) moralize D, and add d-links due to moralization to ML (a. set of node pairs); (2) search for pairs of d-sepnodes con nected by a d-path in the moral graph of D, and add the pairs found to Cy (a set of node pairs). 2. For D’, do the following: (1) for each pair of nodes of D’ contained in one of ML (i > 1), connect the pair in the moral graph of D’; and (2) for each pair of d-sepnodes contained in both Cy’ and one of Cy (j > 1), connect the 2 nodes in the moral graph of D’; (3) triangulate the augmented moral graph of D’ (the morali-triangulation of D’ is completed); and () add d-links to DLINK (a set of node pairs). 3. For Dt (i = 2,. .. , j3), do the following: (1) for each pair of nodes of Dt contained in DLINK, connect the pair in the moral graph of D; and (2) triangulate the augmented moral graph of D. The morali-triangulation of D is completed. Note that the above process has 2 passes through all the subDAGs. The following theorem shows the invertibility of the morali-triangulation. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 110 Theorem 7 (invertibility of Algorithm 2) The morali-triangulation by Algorithm 2 is invertible. Proof: V The key issue is to decide on d-links. The algorithm considered only neighbor relations between the covering subDAG and other subDAGs. The first step is to show this is sufficient. a Let da and db be 2 d-sepnodes between 2 subDAGs D and Db. Let D’ be the covering subDAG. It is claimed that if da and d join 2 simple paths in D and Db respectively to form an inter-subDAG cycle, then they must join 2 simple paths in D and D’ as well. Since da and db both are contained in D’ which is connected, it is certainly true. Similarly, if there is a cycle involving d-sepnodes in the d-sepset between D and Db, these d-sepnodes are also in the d-sepset between D and D’. Now it is a simple matter to show: the ME type d-links are introduced by step 2 (1) and step 3 (1); the Cy type d-links are introduced by step 2 (2) and step 3 (1); and the TE type d-links are introduced by step 2 (2) and step 3 (1). D Example 13 The following is a recount for morali-triangulation ofU1O in Figure 4.20 by Algorithm 2. 1. Aftersteplofthealgorithm,ML = q,ML2 = {(H1,H2)},andML3= {(H2,H3)}; Cy’ = {(H,, H2), (H2,H3), (H3,H4)}, Cy2 {(H1,H2)}, and Cy3 = {(H2,H3), (H2,H4), (H3,H4)}. 2. After step 2, the morali-triangulated graph A’ of G’ is completed by adding d-links (H1,H2), (H2,H3), (H3,H4) to O”s moral graph, and then triangulating (nothing is added). DUNK will contain {(H1,H2), (H2,H3), (H3,H4)}. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 111 3. After step 3, the morali-triangulated graph A2 of O is completed without change to its moral graph; the morali-triangulated graph A3 of (33 is completed by adding the d-link (H3,H4) to its moral graph, and then triangulating (with the link (E3,H4) added). As mentioned in section 4.4, after the morali-triangulation, the other steps in trans formation of a MSBN into a set of junction trees of cliques are: identifying cliques of the morali-triangulated graphs to form a set of clique hypergraphs and then organizing each hypergraph into a junction tree. These steps are performed in the same way as in the junction tree technique. Throughout the rest of the chapter, it is assumed that junction trees are obtained through a set of invertible triangulated graphs, and it is said that the junction trees are obtained by invertible transformation. Call a set of junction trees of cliques from an invertible transformation of subDAGs of a MSBN as a junction forest of cliques denoted by F = {T’,.. . , T} where T2 is the junction tree from the subDAG Dt. 4.’T.2 Linkages between Junction Trees Just as d-sepsets interface subDAGs, linkages interface junction trees transformed from subDAGs and serve as information channels between junction trees during inference. The extension of the MSBN technique to the junction tree technique is to allow multiple linkages between pairs of junction trees in a junction forest such that localization can be preserved within junction trees and the exponential explosion of the sizes of clique state spaces associated with the brute force method (section 4.3) can be avoided. Definition 10 (linkage set) Let I be the d-sepset between 2 subDAGs D° and Db. Let T and Tb be the junction trees transformed from D’ and D’ respectively. A linkage of T relative to T6 is a set I of nodes such that the following 2 conditions hold. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 112 1. Boundary: there exists a clique C2, E T’t such that 1 = C2, fl I. C2, is called a host clique of 1; 2. Maximum: there is no subset of I that is also a linkage. In general there may be more than one linkage between a pair of junction trees. Define L to be the set of all linkages of T relative to Tb. Proposition 11 (identity of linkages) Let T and Tb be the junction trees from sub DAGs D and Db respectively. If L is the set of linkages of T relative to Tb and L is the set of linkages of Tb relative to Ta, then L = Lba. Proof: A linkage consists of the d-sepnodes which are pairwise connected. In an invertible morali- triangulation, d-sepno des are connected identically in both morali- triangulated graphs involved. Example 14 In Figure 4.20, linkages between junction trees are indicated with ribbed bands connecting the corresponding host cliques. The 2 linkages between I” and I’ are {H3,2}and {H3,4}. Given a set of linkages between a pair of junction trees, the concept of a redundancy set can be defined. As mentioned in section 4.4, redundancy sets provide structures which allow redundant information to be removed during inter-junction tree information passing. The concept will be used for defining joint system belief in section 4.7.3 and defining the operation NonRedundancyAbsorption in section 4.9.2. To define the redun dancy set, one needs to index linkages such that the redundancy sets defined based on the indexing possess certain desirable property which will become clear below. Index .a set L of linkages by the following procedure. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 113 Procedure 1 1. Pick one of the junction trees in the pair, say Ta. Create a tree G with nodes labeled by linkages in Lab. Connect 2 nodes in G by a link if either the hosts of corresponding linkages are directly connected in T, or the hosts of corresponding linkages are (indirectly) connected in T by a path on which all intermediate cliques are not linkage hosts. Call this tree a linkage tree. 2. Index the nodes (linkages in Lw’) of G into L1,L2,... in any order that is consistent with G, i.e., for every i > j there is a unique predecessor j(i) <i such that L() is adjacent to L1 in G. With linkages indexed this way, the redundancy set can be defined as the following. Definition 11 (redundancy set) Let a set of linkages L’ — {L1,. . . , L9} be indexed by procedure 1. Then for this set of indexed linkages, a redundancy set R, for index i is defined as ifi=1 ( L, fl L,(2) i > 1; j(i) < i and Lj(:) adjacent to L1 in linkage tree G Note that a linkage tree is itself a junction tree, and redundancy sets are sepsets of the linkage tree. Example 15 There are 2 linkages between F’ and F3 in Figure 4.20. Consider junction tree F3. The linkage tree G has 2 connected nodes, one labeled by the linkage {H3,H2} and the other by {H3,4}. An indexing L, = {H3,2} and L2 = {H3,4} defines 2 redundancy sets R1 = and R2 = {H3}. With linkages and redundancy sets constructed, one has a linked junction forest of cliques. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 114 4.7.3 Joint System Belief of Junction Forest Let (D, F) be a USBN, S {S’,.. . , S} be a corresponding MSBN with a covering sect 51, and F = {T’, . . . , T} be the junction forest from an invertible transformation. Let TI be the junction tree of D with cliques Cs and sepsets Qi. Let P (i > 1) be the d-sepset between S and 5’. Let L’ (i > 1) be the set of linkages between T and T’; and R (i > 1) be the corresponding set of redundancy sets. Construct a joint system belief for the junction forest through an assignment of belief tables to each clique, clique sepset, linkage and redundancy set in the junction forest. First, for each junction tree T in F, do the following. • Assign each node nk E N1 to a unique clique C, E C such that 0r contains k and its parents 7rk. Break ties arbitrarily. • Let Fk denote the probability table associated with node k• For each clique C that has assigned nodes k, . . . , nj, associate it with belief table B(Cr) = Pk*.. • For each clique sepset Q,, e Q1, associate it with constant belief table B(Q). Then for each set of linkage L’, do the following. • Associate each linkage L e L with constant belief table B(L). Define the belief table for redundancy set R E R as B(R)= B(L) LZ\RZ Define the belief table for d-sepset P as BI — FJLZEL B(L) FIRER’ B(R) Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 115 Define the belief table for each junction tree T as B(T’) FICXEC’ B(C) — fJQEQ B(Q) Here the notation B(T) is used in stead of B(N’) to emphasize that it is related to the junction tree. Note B(11) and B(T1) are mathematical objects which do not have •corresponding data structures in the knowledge base. Comparing the form of joint prob ability distribution for an USBN (section 1.3.3) and the assignment of probability tables for nodes in a sect (Definition 7), it is easy to see that B(T1) is proportional to the joint probability distribution of S relative to that assignment. Define the joint system belief for the junction forest F as B(F) = fl B(T) fl2B(I) The notation B(F) in stead of B(N) is used for the same reason as above. With this definition, one has the following lemma. Lemma 7 The joint belief B(F) of a MSBN is proportional to the joint probability dis tribution F of the corresponding USBN. To see this is true, it suffices to indicate that each d-sepnode, appearing in at least 2 sects, carries its original probability table as in (N, E, F) exactly once by Definition 7, and carries uniform table for the rest. Example 16 Table 4.5 lists the constructed belief tables for belief universes of junc tion forest F = {1”, I2, f3} in Figure 4.20. The belief tables for sepsets, linkages, and redundancy sets are all constant tables at this stage. Having constructed belief tables for cliques, sepsets, linkages, redundancy sets and junction forest, using the definition of world of section 4.2.1, one can talk about belief Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 116 Table 4.5: Constructed belief tables for belief universes of junction forest F = {1”, F2, F} NodeAss. F2 EQ .4 .75 .6 .25 NodeA,s. H2, F1 H1 EQ .7895 .6 .2 105 .4 .5 .05 .5 .95 E(r’) 8(r2) Clique NodeAss. Clique {H2,1A) 11,A {F2,1) Config. EQ Conf ig. {h31,1 .12 (121111) 2, 2} .03 (121.112) {h1,1 .085 (122.111) 2, 2 .765 (f22.112) (h,11} .12 Clique (22, .03 (H,F1} {h,11 .085 Config. 22,} .765 (h21,1) Clique NodeAss. (f, 2 {H,A1 H2 {h211, Config. EQ 211f2,) (h21,a1) .8696 {hf1, {, 2 .7 221, h21,a1 .6 (hf1,) {,2) .08 (22, h2,a11} .1304 {2, .3 h2,a11) .4 {2, .92 Clique NodeAss. (H3,2A) H3A2 Config. EQ {h31,2a .24 , 2) .06 {h31,2a} .24 , 2 .06 {h3,21a) .07 (2, .63 {h3,2a1 .07 2,) .63 Clique NodeAss. {H3,A4 A3,H4 Config. EQ {h31,a31,h41 } .075 h,42) .225 (31,a} .28 {h, 24 .42 3,a1) .2 {h2,4 .6 (3,a1 .08 (h3, 24) .12 Clique NodeAss. {H4,A) A4 Conf ig. EQ {h41,a} .9 {h41, 2 .1 {h42,a1} .2 {h42,) .8 E(r3) Clique {E1 , E3) Conf i g. {e11 , e31 ) (e11 , e32) {e12,e31) {e12,e32) Clique (H3,H2,E3) Conf i g. (h31 , , e31) {1i31 , , e32) (h31 , h22e31) {h31 , ,e32) (h32, , e31) {h32,It21,e32) (It32,It22,e31) (It32,It22,e32) Clique (8,E4) Conf ig. {e21 , e31e41) (e21 e31,e42) {e21 32, e41) (e21 e32,e42) (e22,C31,e41) (e22,e31,e42) {e22 , e32e41) (e22 , e32 e42 } Clique (E3,4H4) Conf ig. (e31 , e41 , It41) (e31 ,e41 ,It42) (e31 , e42 , It41) (e31,e42It) {e32,e41 , It41) (e32 , e41 , It42) (e32,C42,It41) (e32,e42,It42) Clique (H,E4) Config. (It31,e31,It41) (It31,e31,It42) {It31 , e32 It41) (It31,e32,It42) (It32,e31,It41) (It32,e31,It42) (It32,e32,It41) (It32,e32,It42) NodeAss. EQ .2 .7 .8 .3 NodeAss. H3,H2, EQ .7702 .2298 .65 .35 .35 .65 .01 .99 NodeAss. EQ .9 789 .8 .9 .05 .0211 .2 .95 NodeAss. E4, H4 EQ .8 .15 .2 .85 .8 .15 .2 .85 Node Ass. EQ in Figure 4.20. Config: Configuration. Node Ass: Nodes Assigned. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 117 universes, sepset worlds, linkage worlds, redundancy worlds, and junction forest of belief universes. These terms will be used below. The preceding has an assumption of a covering sect. The joint system belief of a junction forest with a hypertree structure can be defined in the similar way. As there is no need to consider the d-sepset/linkages between non-covering sects above, in the hypertree case, there is no need to consider the d-sepset/linkages between neighbor sects covered by a local covering sect. Therefore in practice, these linkages are never created. One sees another computational advantage of the covering sect rule and the hypertree rule. 4.8 Consistency and Separability of Junction Forest Part of the goal is to propagate the information stored in different belief universes in different junction trees of a junction forest to the whole system such that marginal probability of variables can be obtained from any universes containing them with lo cal computation.4 The information to be propagated can be the prior knowledge in the form of products of original probability tables from the corresponding USBN. The infor mation to be propagated can also be evidence entered from a set of universes possibly in different junction trees. The following defines consistency and separability that are the properties of junction forests which guarantee the achievement of this goal. 4.8.1 Consistency of Junction Forest The property of consistency partly guarantees the validity of obtaining marginal proba bilities by local computation. In the context of the junction forest, 3 levels of consistency can be identified. 4Obtaining marginals by local computation is what the junction tree technique is developed for. More is obtained from junction forests, namely, exploiting localization. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 118 Definition 12 (local consistency) Neighbor universes (C1,B(C1)) and (C,, B(C3)) in a junction tree T with sepset world (Qk, B(Qk)) are consistent if B(C1) x B(Qk) cx B(C3) C,\C CJ\CI where ‘x’ reads ‘proportional to’. When the relation holds among all neighbor universes, the junction tree T2 is said to be consistent. When all junction trees are consistent, a junction forest F is said to be locally consistent. Definition 13 (boundary consistency) Host universes (C,, B(Cj) and (Cs, B(C’)) of linkage world (Lk,B(Lk)) are consistent if B(C)cB(Lk)c B(C) When the relation holds among all linkage host universes, a junction forest is said to have reached boundary consistency. Definition 14 (global consistency) A junction forest is said to be globally consis tent if for any 2 belief universes (possibly in different junction trees) (Ci, B(C)) and (Ci, B(C)) B(C)c B(Cj) C,\C Theorem 8 (consistent junction forest) A junction forest is globally consistent if it reaches both local consistency and boundary consistency. Proof: The necessity is obvious. The sufficiency is proven below. Let (Ci, B(Cj) and (Cr’, B(C)) be 2 universes in junction trees Tt and P of junction forest F respectively. Let 3 be the d-sepset between the 2 corresponding sects. One has Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 119 Z = CnC ç I. By definition of linkages, they have the maximum property. Therefore, there exists a linkage L 2 Z with its hosts being (C,,, B(C)) and B(Cj). Since C, 2 2 Z, Z is contained in all cliques in the unique path between G, and Ct,. Because S is locally consistent, B(C) x B(CJJ. Ct\Z C\Z Similarly, > B(C) cc B(C). C\Z Because T also reached boundary consistency, B(C,,) cc B(C) C\Z G3W\Z Therefore B(C)c B(Cj) C\Z C\Z D 4.8.2 Separability of Junction Forests In the junction tree technique, consistency is all that is required in order to obtain marginals by local computation. In junction forests, this is not sufficient due to the ex istence of multiple linkages. The function of multiple linkages between a pair of junction trees is to pass the joint distribution on the d-sepset by passing marginal distributions on subsets of the d-sepset. By doing so, one avoids the exponential increase in clique state space sizes as outlined in section 4.3. When breaking down the joint into the marginals, one must ensure the joint can be reassembled from the marginals, i.e., one must pass a correct version of the joint. Otherwise, the correctness of local computation is not Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 120 guaranteed. Since passing the marginals is achieved by passing the belief tables on link ages and redundancy sets, the structure of linkage hosts is the key factor. The following defines separability of junction forests in terms of the correctness of local computation. Then the structural condition of linkage hosts is given under which the separability holds. Definition 15 (separability) Let F = {TI1 i ,6} be a junction forest with do main N and joint system belief B(F). F is said to be separable if, when it is globally consistent, for any T’ over subdomain N B(F) cc B(T) N\N The following lemma is quoted from Jensen for latter proof of proposition 12. Lemma 8 [Jensen 88] Let T be a junction tree from clique hypergraph (N, C). Let C1 and C2 be 2 adjacent cliques in T. Let T’ be the graph resulting from T by union of Ci and C2 into one clique, and by keeping the original sepsets. Then T’ is a junction tree for clique hypergraph (N, (C \ {C1,C2}) U {C1 U C2}). The following is the structural condition for separability to be proved below. Definition 16 (host composition) Let a MSBN by sound sectioning be transformed into a junction forest. Let S1 be a sect in the MSBN; T be the junction tree of St; I be the d-sepset between S1 and any distinct sect; and L be the set of linkages between T and the junction tree for the other sect. Recursively remove from T every leaf clique which is not a linkage host relative to L. Call the resultant junction tree as a host tree. T satisfies a host composition condition relative to L, whenever the following is true. Chapter 4. MULTIPLY SECTIONED. BAYESIAN NETWORKS 121 1. If a non-d-sepnode is contained in any linkage host in T, then it is contained in exactly 1 linkage host; and 2. if a set of non-d-sepnodes are contained in any non-host clique in the host tree, and each element appears in some host clique, then the set must be contained in exactly 1 linkage host. Example iT The host composition condition is violated in the host trees of Figure 4.24. The following shows the violation and the resultant problem. Assume both trees are consistent. First consider the top tree. Let L consist of linkages L1 {A, D} and L2 {A, E}. Let their hosts be C1 = {A, B, D} and C2 = {A, B, E} which are adjacent in the tree. B is a common non—d-sepnode - a violation of part 1 of the host composition condition. Even if all the belief tables are consistent, in general, B(ABD)B(ABE) B(AD)B(AE) B(AB) B(A) That is, the joint distribution on the d-sepset {A, D, E} constructed from belief tables on linkages and redundancy sets is inconsistent in general. Consider the bottom tree. Let L and C1 be the same. Let the host C2 = {A, E, G} which is connected to C1 through a non-host C3 = {A, B, G}. {B, G} is a set of non-d sepnodes violating the part 2 of the host composition condition. If Ci and 03 are united as described in lemma 8, the resultant graph is still a junction tree. If let B(C13)= B(C1)B(/B(Q3 where B(Q13) is the original belief for sepset between C and 03, the joint belief for the new tree is exactly the same as before and the new tree is consistent. Now the common node 0 in 013 and 02 creates the same problem illustrated above. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 122 C2 c::::: C2 Figure 4.24: Two example trees for violation of the host composition condition. I: the d-sepset. The ribbed bands indicate linkages. Proposition 12 Let a MSBN by sound sectioning be transformed into a junction forest F. Let S be a sect in the MSBN; T’ be the junction tree of S in F; I be the d-sepset between S’ and any distinct sect; and L be the set of linkages between T and the junction tree for the other sect. Let all belief tables be defined as in section 4.7.3. When F is globally consistent, B(I) satisfies B(I) x B(T) N \I iffT2 satisfies the host composition condition relative to L. Proof: [Sufficiency] The proof is constructive. Denote the host tree of T relative to L by T’ and denote T”s domain by N’. Marginalize B(T) with respect to N \ N’. This is done by marginalization of B(Tt) recursively with respect to unique variables in each leaf clique not in T’. This results in B(T2)=B( ’) Nt\N’ where B(T’) is the joint belief for the consistent junction tree T’. Note N \ N’ contains no d-sepnodes. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 123 Second, unite non-host cliques into linkage hosts in T’. Suppose C is a non-linkage- host with host clique neighbor(s). Choose the neighbor host C, containing all the non d-sepnodes of C1 which appear in host cliques. Since T’ is a junction tree and the host composition condition holds, one is guaranteed to find such a C,. By lemma 8, the graph resulting from T’ by union of C, C, into a new clique Ck, and by keeping the original sepsets is still a junction tree on domain N’. The host composition condition still holds in the new graph. If let B(Ck) = B(C)B(C)/B(Q3 where B(Q13) is the original belief for sepset between C2 and C, the joint belief for the new junction tree T(1) is exactly the same as B(T’) and T(1) is consistent. Repeat the union operation for all non-hosts, one ends up with a consistent junction tree T” with only linkage hosts (possibly new composition) and every non-d-sepnode in N’ appears in exactly 1 host. If for every clique in T”, marginalize each clique belief with respect to its (unique) non-d-sepnodes; remove these non-d-sepnodes from the clique; assign the marginalized belief to the new clique and keep the sepset belief invariant, one ends up with a new consistent junction tree T” with its joint belief BT” c B(T’). N’\I Note that T” is just the linkage tree in the procedure 1. That is, all the cliques in T” are linkages, and all the sepsets are redundancy sets. Therefore, B(I) cc B(T”). [Necessityj Suppose there are 2 or more linkages in L, and the host composition condition does not hold in T. Obtain the host tree T’ in the same way as in the sufficiency proof. Recursively unite each non-host clique with a neighbor host clique if there is one, Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 124 and assign the belief for the new clique in the same way as in the sufficiency proof. The resultant is a consistent junction tree T”. Since the host composition condition does not hold in T’, and the uniting process does not remove non-d-sepnodes from any host, there is at least 2 neighboring cliques C1 and C2 in T” such that they have a set of common non-d-sepnodes. Denote the 2 cliques as C1 = X U Y U W and C2 = X U Z U W (Figure 4.25) with L1 = C fl I = X U Y and L2 C2 fl I = X U Z. That is, C1 and C2 share a set of d-sepnodes X and a set of non-d-sepnodes W. Even if all the belief tables are consistent, in general, B(C1)B(2— B(XuYuW)B(XuZuW) w 12 w B(X U Y)B(X U Z) — B(L1)B(2 B(X) - B(R1) where R is the intersection of L1 and L2 (a redundancy set). That is, the distribution on In (C1 U C2) is not consistent with the distribution constructed from the belief tables on the corresponding linkages and redundancy set in general. The above C1 and C2 are chosen not including unique non-d-sepnodes in each of them. If this is not the case, these non-d-sepnodes can always be removed by marginalization at each of the 2 cliques, and the result is the same. If L = {L1,L2}, the proof is complete. Consider the case L contains more than 2 linkages. In this case, the the left side of the above equation represents a correct version of the marginal distribution on a subset of d-sepset I, while the right side is an inconsistent version of the same distribution defined in terms of belief of linkages and redundancy sets (section 4.7.3). Now, it suffices to indicate that, in general, if the marginal distribution is inconsistent, the joint is also inconsistent. The proposition shows that the belief table B(I) defined in section 4.7.3 is indeed the joint belief on d-sepset I when the host composition condition is satisfied and the forest Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 125 59uw L1=Cfl1 = XUY L2=Cfl1 = XUZ Figure 4.25: Part of a host tree violating the host composition condition. I: the d-sepset. The ribbed bands indicate linkages. is consistent. Now it is ready for ie following result on separability. Theorem 9 (host composition) Let {S’, . . . , S} be a MSBN satisfying the hypertree condition. Let F = {TIl i /} be a corresponding junction forest with joint system belief B(F). F is separable if the host composition condition is satisfied in all pairs of junction trees with linkages constructed.5 Proof: Assume F has reached global consistency. First, unite all the cliques in each junction tree into a huge clique with the belief table for the junction tree as its belief. Connect the huge cliques as they are connected in the hypertree (that is, if the linkages between 2 trees are not created as discussed in section 4.7.3, do not connect the 2 corresponding huge cliques), and assign the original B(I) to their sepset. The resultant graph is a junction tree due the hypertree structure of the MSBN. By proposition 12, the tree is consistent with joint belief being B(F), and for any T over subdomain N B(F) cx B(T1) N\N1 iff the host composition condition is satisfied. EJ 5Recall that when a MSBN has a hypertree structure, no linkage is created between junction trees whose corresponding sects are covered by a local covering sect. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 126 The host composition condition can usually be satisfied naturally in an application system. Since d-sepsets are the only media for information exchange between sects, d sepnodes usually involve many inter-subDAG cycles. The consequence is that they will be heavily connected during morali-triangulation and form several large cliques in the clique hypergraph as well as some small ones. On the other hand, non-d-sepnodes rarely form connections with so many d-sepnodes simultaneously and hence will rarely be the elements of these large cliques. To be an element of more than 1 such large clique is even more unlikely. Because linkages are defined to be maximal, these large cliques will become linkage hosts. For example, in the PAINULIM application system (Chapter 5), there are 3 sects and correspondingly 3 junction trees. The host composition condition is satisfied naturally in all 3 trees. Figure 4.26 gives one of them. The 4 linkage hosts contain no non-d-sepnode at all. When the host composition condition can not be satisfied naturally, one can add dummy links between d-sepnodes in the moral graph before triangulation such that link age hosts will be enlarged and the condition is satisfied. Hence, given a MSBN, a sep arable junction forest can always be realized. The penalty of added links is increased amount of computation during belief propagation due to increased sizes of cliques and linkages. In the worst case, one resorts to the brute force method discussed in section 4.3 in order to satisfy the host composition condition for certain pairs of junction trees. If the system is large, sectioning may still yield computational savings on the whole even if cliques are enlarged at a few junction trees. One of the key results now follows. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 127 Figure 4.26: T is a junction tree in a junction forest taken from an application systemPAINULIM with variable names revised to simplify. An upper case letter in a clique represents a d-sepnode member, and a lower case letter represents a non-d-sepnodes member. The cliques C1,C2,C3,C4 are linkage hosts. Theorem 10 (local computation) Let F be a consistent and separable junction forest with domain N and joint system belief B(F). Let (Cs, B(C)) be any universe in F. Then B(F)cB(C) N\CX Proof: Let (Ci, B(C)) be a universe in one of the junction tree T of F, Let the subdomain of TX be NX and its belief be B(Tz). Since F is consistent and separable, by definition of separability one has B(F)=B(Tx) N\NX and TX is a consistent junction tree. Jensen, Lauritzen and Olesen [1990] have proved that the belief of a consistent junction tree marginalized to any of its universes is proportional Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 128 to the belief of that universe. Hence B(F)cx (> B(F))xB(C) N\CX NS\C N\Nx With the above theorem, the marginal belief of any variable in a consistent and separable junction forest can be computed by marginalization of the belief table of any universe which contains the variable. In this respect, a consistent and separable junction forest behaves the same as a consistent junction tree [Jensen, Lauritzen and Olesen 90]. It will be seen that, in the context of junction forests, additional computational advantage is available, i.e., the global consistency is not necessary for obtaining marginal belief by local computation. 4.9 Belief Propagation in Junction Forests Given the importance of consistency of junction forests, a set of operations are introduced which bring a junction forest into consistency. Since the purpose is to exploit localization, only operations for local computation at the junction tree level are considered. That is, at any moment, there is only 1 junction tree resident in the memory. This junction tree is said to be active. 4.9.1 Supportiveness Jensen, Lauritzen and Olesen [1990] introduced the concept of supportiveness. Let (Z, B(Z)) be a world. The support of B(Z) is defined as = {z E ‘P(Z)Ibelief of z > 0}. A junction tree is supportive, if, for any universe (C2,B(Cj) and for any neighboring sepset world (Q3,B(Q)), L(B(Ci)) Z(B(Q)). The underlying intuition is that, Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 129 when beliefs are propagated in a supportive junction tree, non-zero belief values will not be turned into zeros. Here the concept is extended to junction forests. A junction forest is supportive, if all its junction trees are supportive, and if, for any linkage host (C:, B(C)) and corre sponding linkage world (L3,B(L2)), L(B(Ci)) ç B(Lj)). The construction in section 4.7.3 results in a supportive junction forest. 4.9.2 Basic Operations Operations for Consistency within a Junction Tree The following operation brings a pair of belief universes into consistency. Operation 1 (AbsorbThroughSepset) [Jensen, Lauritzen and Olesen 90] Let (C0,B(C)) and its neighbors (C1,B(C)),..., (Ck,B(Ck)) be belief universes in a junction tree. Let (Q1,B(Q)),.. . ,(Qk,B(Qk)) be the corresponding sepset worlds. Suppose z.(B(C0)) C L.(B(Q)) (i = 1,... ,k). Then the AbsorbThroughSepset of (C0,B(C0)) from (Ci, B(Cj) (i = 1,. . ., k) changes B(Q:) and B(C0) into B’(Q1) and B’(C0). B’(Q1) = B(C2) i=1,...,k k B’(C0) = B(C0)fJB’(Q1/ ( This operation is performed at the level of belief universe. Jensen, Lauritzen and Olesen [1990] show that the operation changes neither the supportiveness of a junction tree nor the joint system belief in the context of a junction tree. In the context of a junction forest, the supportiveness is also invariant after AbsorbThroughSepset. This is because the operation does not increase the support of any linkage host and does not Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 130 change linkage beliefs directly. The invariance of joint system belief for the junction forest is obvious given the definition of joint system belief and the invariance of beliefs for junction trees. The following are three high level operations which bring a junction tree into consis tency. Operation 2 (DistributeEvidence) [Jensen, Lauritzen and Olesen 90] Let (C0,B(C0)) be a universe in a junction tree. When DistributeEvidence is called in (C0,B(C0)), it performs Absorb ThroughSepset to absorb from the caller if the caller is a neighbor and calls DistributeEvidence in all its neighbors except the caller. Suppose a junction tree is originally consistent. If evidence is entered6 in one of its belief universes and DistributeEvidence is initiated from that universe, then the resulting junction tree is still consistent. Operation 3 (CollectEvidence) [Jensen, Lauritzen and Olesen 90] Let (C0,B(C0)) be a universe in a junction tree. When CollectEvidence is called in (C0,B(C0)), it calls CollectEvidence in all its neighbors except the caller, and when they have finished their CollectEvidence, (C0,B(C0)) performs Absorb ThroughSepset to absorb from them. DistributeEvidence and CollectEvidence are operations performed at the level of belief universes. Since they are composed of AbsorbThroughSepset, they do not change the supportiveness and joint system belief of junction forest. The combination of these 2 operations yields the operation UnifyBelief which brings a supportive junction tree into consistency and is performed at the level of junction trees. 6The concept of evidence entering is defined later. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 131 Operation 4 (UnifyBelief) UnifyBelief can be initiated at any universe (C0,B(C0)) in a junction tree. (C0,B(C0)) calls CollectEvidence in all its neighbors and when they have finished their CollectEvidence, (C0,B(C0)) calls DistributeEvidence in them. Operations for Belief Exchange in Belief Initialization Belief initialization brings a junction forest into global consistency before any evidence is available. One problem arises w*ien there are multiple linkages between junction trees. Care is to be taken not to count the same information passed on different linkages multiple times. The following two operations pass information through multiple linkages during belief initialization. NonRedundancyAbsorption is performed at the level of linkage hosts. ExchangeBelief calls NonRedundancyAbsorption and is performed at the level of junction trees. ExchangeBelief ensures the exchange between junction trees of prior distribution on d-sepsets without redundant information passing. Operation 5 (NonRedundancyAbsorption) Let (C, B(C)) and (Ci, B(C)) be 2 linkage host universes in junction trees T’ and Tb respectively. Let (Lx, B(L)) and (R, B(R)) be the worlds for corresponding linkage and redundancy set. Suppose z(B(C:)) ç .(B(L)). The NonRedundancyAbsorption of (C, B(C)) from (Ci, B(C)) through linkage L, changes B(L), B(R) and B(C) into B’(L), B’(R) and B’(C) respectively. B’(L) = B(C) C\LX B’(R) = B’(L) LX\RX B’C’ — B1C” * B’(L)/B’(R) “ ‘ — “ B(L)/B(R) The factor 1/B’(R) above has the function of redundancy removal. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 132 At initialization, the belief tables for linkages and redundancy sets are in the state of construction, i.e., constant. Hence, B(L) and B(R) above are constant tables. If B’(L) is constant, which is possible because constant probability tables are assigned to d-sepnodes in some sects in Definition 7, then after the operation > B’(C) oc > B(C). C\LX C\LX That is, if C has no information to offer, then C will not change its belief. If C\LX B(C) is constant, then after the operation B’(C) oc ( B(C)) /( B(C)).C\LX C\RX That is, if C has new information and C contains no non-trivial information, then the belief of C will be copied with red4ndancy removed. If none of B’(L) and ZCa\LX B(C) is constant, then after the operation B’(C) * ( B(C)) /( B(C))).C\LX C\RX That is, if none of the above cases is true, the belief from both sides will be combined with redundancy removed. Since C > B(C)) = CXb\LX the supportiveness of the junction forest is invariant under NonRedundancyAbsorption. Since B’(C) - B(C) B’(L)/B’(R) — B(L)/B(R) the joint system belief is invariant under NonRedundancyAbsorption. This operation is equipped at each linkage host. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 133 Operation 6 (ExchangeBelief) Let L be the set of linkages between junction trees T and Tb. When T’ initiates ExchangeBelief with Tb, the NonRedundancyAbsorption is performed at all linkage host universes in Ta. Since ExchangeBelief is composed of NonRedundancyAbsorption, the siipportiveness of the junction tree and its joint system belief are invariant under the ExchangeBelief. The operation is equipped at the level of junction trees. After ExchangeBelief, the non trivial content of joint distribution on d-sepset at T” will be passed onto T without redundancy. Operations for Belief Update in Evidential Reasoning Evidential reasoning propagates evidence obtained in one junction tree to the junction forest. A junction tree receiving updated belief on the d-sepset from a neighbor junction tree may be confused due to multiple linkage evidence passing. The following 2 opera tions handle the evidence propagation between junction trees. AbsorbThroughLinkage propagates evidence through one linkage and is performed at the level of linkage hosts. UpdateBelief calls AbsorbThroughLinkage to propagate evidence from one junction tree to another, and is performed at the level of junction trees. It is used during evidential rea soning when both junction trees are consistent themselves but may not reach boundary consistency between them. Operation F7 (AbsorbThroughLinkage) Let (C, B(C)) and (Ci, B(C)) be 2 link age host universes in junction trees T’1 and Tb respectively. Let (Lw, B(L)) be the corre sponding linkage world. Suppose z(B(C)) z.S.(B(L)). The AbsorbThroughLink age of(C, B(C)) from (Ci, B(C)) changes B(C) and B(L) into B’(C) and B’(L) Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 134 as the following. B’(L) = B(C) B’(C) = B(C) * B’(Lx)/B(Lr) After AbsorbThroughLinkage, B’(C)= B(C) Since z.(B(C)) C B(C)) = z(B’(L)) the supportiveness of a junction forest is invariant under AbsorbThroughLinkage. The operation makes the belief of C up-to-date with respect to the belief of C on their common variables. Operation 8 (UpdateBelief) Let L = {L1, .. . , Lk} be the set of linkages between junction trees T’ and Tb with corresponding linkage hosts Cr,.. . , C. When T initiates UpdateBelief with Tb, the operation pair Absorb ThroughLinkage and then Distribu teEvidence is performed at Ce,... , C. Since the operation is composed of AbsorbThroughLinkage and DistributeEvidence, the supportiveness of the junction forest is invariant under the operation. Since after UpdateBelief, B’(C) = B(C) x = 1,. . . , k and T’ is consistent, the effect of the operation is BI(Ta) = B(T) * B’(I)/B(I) Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 135 where I is the d—sepset between 5a and Sb or equivalently BI(Ta)/Bl(I) = B(T)/B(I) which implies the joint system belief is invariant under the operation. Note that after each AbsorbThroughLinkage in operation UpdateBelief, a Distribu teEvidence is performed. This is used to avoid confusion in the information receiving junction tree possibly resulted from multiple linkages information passing. The following simple example exemplifies this necessity. L XUZ vuz Figure 4.27: An example illustrating the operation UpdateBelief. Example 18 Let junction tree T’ (Figure 4.27) have 2 linkage host C1 = L1 X U Z and C2 = L2 = Y U Z where X, Y Z are 3 disjoint sets of nodes. Let B(C1), B(C2) and B(Z) be the belief tables of the 2 hosts and their sepset respectively. Suppose new information is passed over to Tt through the 2 linkages from its neighbor junction tree. If AbsorbThroughLinkage is performed at L1 and then L2 Without a DistributeEvidence between the 2 operations, then the belief on 2 host cliques will be updated to B’(C1), B’(C2), while B(Z) is unchanged. If C1 initiates an AbsorbThroughSepset from C2 in the process of propagating the new information to the rest of T, the belief on C1 will become B”(C1)= B’(C1) B’(C2)/B(Z)) ç B’(C1) 1’ which is not expected. This is because Z B’(C2)ç B(Z). Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 136 With a DistributeEvidence between the 2 AbsorbThroughLinkage operations, there is B’(Z) = y B’(C2). The result of AbsorbThroughSepset becomes B”(C1)= B’(C1) B’(C2)/B’ Z)) cx B’(C1) Y which is correct. 4.9.3 Belief Initialization Before any evidence is available, an internal representation of beliefs is to be established. The establishment of this representation is termed initialization by Lauritzen and Spiegel- halter [1988] for their method. The function of initialization in the context of junction forests is to propagate the prior knowledge stored in different belief universes of different junction trees to the rest of the forest such that (1) prior marginal probability distribu tion for any variable can be obtained in any universe containing the variable, and (2) subsequent evidential reasoning can be performed. The following defines operations DistributeBelief and CollectBelief which are anal ogous to DistributeEvidence and CollectEvidence but at the junction tree level. The operation Beliefinitialization is then defined in terms of DistributeBelief and CollectBe lief just as UnifyBelief is defined in terms of DistributeEvidence and CollectEvidence but at the junction forest level. Two junction trees in a junction forest is called neighbors if the d-sepset between the 2 corresponding sects is nonempty. Operation 9 (DistributeBelief) Let T be a junction tree in a junction forest. When DistributeBelief is called in T, it performs UpdateBelief with respect to the caller if the caller is a junction tree and then calls DistributeBelief in all its neighbors except the caller and caller’s neighbors. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 137 Operation 10 (CollectBelief) Let T’ be a junction tree in a junction forest. When CollectBelief is called in T1, it calls CollectBelief in all its neighbors except the caller and caller’s neighbors. when they have finished, T performs ExchangeBelief with respect to each of them followed by a UnifyBelief on T1. Operation 11 (Belieflnitialization) Beliefinitialization can be initiated at any junc tion tree T in a junction forest f T is transformed from a local covering sect. T calls CollectBelief in all its neighbors, and when they have finished Tt calls DistributeBelief in them. All 3 operations do not change the supportiveness and joint system belief. Thus one has the following theorem. Theorem 11 (belief initialization with hypertree) Let {S’,.. . , S} be a MSBN with a hypertree structure. Let F = {T’,. . . , T} be a junction forest with T being the junc tion tree of S1. Let B(F) be the joint system belief constructed as section 4.7.3. After Belieflnitialization, the junction forest is globally consistent. Note that the Beliefinitialization does not involve direct information passing between neighbors of a local covering sect. This is also the case during evidential reasoning to be discussed latter. As assumed in section 4.7.3, the linkages between these neighbors are not created in the first place. This saving is possible only if the MSBN has a hypertree structure. Example 19 Beliefinitialization is initiatedat 1” in Figure 4.20. It calls CollectBelief in F2 and F3. Since the latter do not have neighbors other than the caller and caller’s neighbor, only UnifyBelief is performed in F2 and F3. Table 4.6 lists the belief tables for belief universes in junction trees F2 and P3 after their UnifyBeliefs. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 138 Table 4.7 and Table 4.8 list the belief tables for belief universes of the junction for est and the (prior) marginal probabilities for all variables of the corresponding MSBN respectively after the completion of Beliefinitialization. The marginal probabilities are identical to what would be derived from the USBN (F, P) with F in Figure 4.17 and P in Table 4.3. The marginals are obtained by marginalization of belief universes which contain the corresponding variables. Once belief initialization is completed, the junction forest becomes the permanent representation which will be reused for each query session. 4.9.4 Evidential Reasoning The joint system belief defined in section 4.7.3 is proportional to the prior joint distribu tion representing the background domain knowledge. Initialization allows one to obtain prior marginal probabilities with efficient local computation. When evidence about a particular case becomes available, one wants the prior distribution to change into the posterior distribution. Evidence is represented in terms of evidence functions. Two types of evidence are considered here as by Jensen, Olesen and Andersen [1990]. The first type has a value range of {0, 1} where ‘0’ stands for that the corresponding outcome is impossible and ‘1’ stands for that the corresponding outcome is still possible with relative belief strength remaining the same. The second type is a restriction. It has the same function value range but the function assigns ‘1’ to only one outcome. This type of evidence arises when the corresponding evidential variables are directly observable. Both types of evidence functions can be entered to junction forests by multiplying the prior distribution with the evidence function. Call the overall process of entering evidence and propagating evidence as evidential Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 139 B(r2) 2(r3) Clique NodeA33. Clique NodeAsa. {F2,1} F2 {E1,3} E1 Config. B() Config. {f21,fii} .7758 {e11,3} .7121 (f21,12} 1.545 {e11,32 3.108 (122,fll} 1.164 {e12,31} 2.848 (122112) .5151 {e12,3 1.332 Clique NodeAss. Clique NodeA3s. {H,F} H2,F1H {H,2E} H3,H2E Conf 19. 20 Config. 20 {h21 fii h11} .7895 {h31,h21,e3) 1.54 h1,2} .6 {h31,21,e32} .4596 {h21,j12h11} .2105 {h31,h22,e31) 1.3 h1f2, .4 1,2e} .7 (2,1} .5 {h32,) .7 {h22,f11h12} .05 {h32,h21 ,e32) 1.3 {h22, 112 h11} .5 ,2e} .02 {h22,f1 h12} .95 {h32h22,e3} 1.98 Clique NodeAss. {EE,4} P22 Cool ig. 20 {e21,34 1.656 , 2} 1.495 {e21,34 1.898 e2, .1165 {e231,4} .0356 2,) .3738 {e2c3e41 .2109 2,} 2.214 Clique NodeAss. {E3,4H E4,H Conf 1g. 20 {e3e4h} 1.424 {e31 e41 h42) .2670 42,h .3560 {e31,} 1.513 2,4h 1.776 {e3, 1 .3330 2,4h} .4440 {e3,h 1.887 Clique NodeAss. {H,E4} Config. B() {h31,e4 1.42 ,h2} 1.42 {h31,e4 .5798 , 2 .5798 {h3,e14} .36 2,) .36 {h3,eh41 1.64 2, 1.64 Table 4.6: Belief tables for belief universes in junction trees P2 and P3 in Figure 4.20 after Collect Belief during Beliefiniti alization. Table 4.7: Belief tables for belief universes of junction forest F = 4.20 obtained after the completion of Beliefinitialization. {f’,f2,3}in Figure Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 140 B(r1) 8(F2) B(r3) Clique NadeAsa. Clique NodeAss. Clique NodeAss. {H2,1A} H1,A {F2,1} F2 {E1,3) E1 Config. 80 Config. 80 Config. 80 {h21,1 .8203 {f21,fll} 1.160 {e11,3} .564 , 2} .08166 (121,112) 5.324 {e11,32 5.026 (h21,1 .5810 {f22,flI} 1.741 (e12,31} 2.256(, 2 2.082 {f22112} 1.775 {12,e3) 2.154 {h22 , h11 } .3797 Clique NodeAss. , Clique NodeAss.h2,11) .2183 {H,F1} H2,F1H {HH2E} H3,H2E {h22,12,h1 } .2690 Config. 80 Config. B() ,1 5.568 {h21,1 .7121 {h31h2,e 1.444 Clique NodeA3s. f, 2} 1.598 , 2} .4310 {H2,A} H2 {h211, .1899 {h31,2e .731 Config. B() f2, 1.065 , 2e .3936 {h21,a1 .5526 {h211,} .2990 {h3,21e} .5915 , 2} 1.725 2f, .2918 2, 1.098 {h21,a1 .8487 (h21, 1 .2990 {h3,2e1 .0531 , 2 .4388 (2f,} 5.545 (2,) 5.257 {h2,a11) .08289 Clique NodeAss. 2,} .7394 {E2,34 E2 {h2,a11 .5658 Config. B0 2, 5.047 {e21,3 e41} 1.020 Clique NodeAss. {e21 , e31 e42 } 1.422 {H3,2A} H3,A2 (,324) 2.182 Config. B() {e21,} .2378 {h31,h21 a21} 1.763 {e22e31e41} .02194 {h31 h21 ,a22} .1120 {e22,31e42} .3555 ,2a} .6366 e2,41} .2424 {h31,h22,a} .4880 {e22,3 e42} 4.518 {h32,h21 a21) .5143 Clique NodeA3s. ,2a} 1.176 (E,4H) E4,H {h32, 1 .1857 Config. 80 {h32,22,a} 5.124 {e31 ,e41 ,h41} .7622 Clique NodeAss. {e31 e41 , h42} .2801 (H3,A3,H4} A3,H4 {e31,e421} .1906 Config. EQ (4,h 1.587 (h31,a31,h41) .225 {e32,41,h41} 1.658 {h,42) .675 e2, 1h} .7662 {h31,a32,h41 } .84 {3,4 .4145 {h31,a32,h42} 1.26 e2,h 4.342 h, 14} 1.4 Clique NodeAss. {32,a 4.2 {H3,E4} h,41 .56 Config. B() {32,a} .84 {h31,e4 .7723 Clique NodeAss. {h31 , e, h42 } 1.403 {H4,A} A4 h1, 24} .2927 Conjig. EQ {3,e .5319 {h41,a) 2.723 h2, 14 .1805 {h41 a42) .3025 {h32,e1h42} .4641 {h42, 1} 1.395 h2,45} 1.780 {h42,a 5.58 {3,e 4.576 Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 141 .15 p(a11)= .205 p(fij)= .2901 i’(ii)= 559(h21)= .3565 p(a21)= .31 p(f21)= .6485 p(e21)= .4862 .3 p(e31)= .282 p(h41)= .3025 (a41)= .4118 p(e41)= .3466 Table 4.8: Prior probabilities from junction forest F = {I”, F2, F} in Figure 4.20 ob tained after the completion of Belieflnitialization. reasoning. After a batch of evidence is entered to a junction tree, UnifyBelief can be performed to bring the junction tree into consistency. This is the same in the context of junction forests as in the junction tree technique. However, in order to obtain posterior marginal distributions on variables in the current active junction tree, the global con sistency of the junction forest is not necessary. Before this is formally treated, several concepts are to be defined. Here only junction forests transformed from MSBNs with hypertree structures are considered. When a user wants to obtain marginal distributions on variables not con tained in the currently active junction tree, it is said that there is an attention shift. The junction tree which contains the desired variables is called the destination tree. Definition 17 (intermediate tree) Let S, S, SC be 3 sects in a MSBN with a hy pert ree structure, and T, T, Tk be their junction trees respectively in the corresponding junction forest. P is the intermediate tree between T1 and Tc if the removal of S from the MSBN would render the hype rtree disconnected with S and S’ in different parts. Due to the hypertree structure, one has the following lemma. Lemma 9 Let a junction forest be transformed from a MSBN with a hype rtree structure. Let T and P be 2 junction trees in the forest. The set of intermediate junction trees between T and T’ is unique. The following defines an operation ShiftAttention at the junction forest level. It is performed when the user’s attention shifts. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 142 Operation 12 (ShiftAttention) Let F be a junction forest whose corresponding MSBN has a hype rtree structure. Let T° and Tim+1 be the currently active tree and destination tree in F respectively. Let {T”,.. . , Tmn} be the set of m intermediate trees between Ti0 and Tim+1 such that Tb, Tu1,. . . ,T, Tim+1 form a chain of neighbors. For i = 1 to m + 1, T’ performs UpdateBelief with respect to Tui_1. Before each attention shift, several batches of evidence can be entered to the currently active tree. When an attention shift happens, ShiftAttention swaps in and out of memory sequentially only the intermediate trees between the currently active tree and destination tree without the participation of the rest of the forest. The following theorem shows that this is sufficient in order to obtain the marginal distributions in the destination tree. Theorem 12 (attention shift) Let F be a consistent junction forest whose correspond ing MSBN has a hypertree structure. Start with any active junction tree. Repeat the following cycle for finite times: 1. repeatedly enter evidence to the currently active tree followed by UnifyBelief for finite times; 2. use ShiftAttention to shift attention to any destination tree. The marginal distributions obtained in the final active tree are identical as would be obtained when the forest is globally consistent. Proof: Before any evidence is entered, the forest is consistent by assumption. Before each ShiftAttention, the currently active tree is consistent due to UnifyBelief. The support iveness of the forest is invariant under the execution of the cycle. The joint system belief changes (correctly) under only evidence entering but remains invariant under other op erations of the cycle. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 143 Transform the initial consistent forest into a junction tree F’ of huge universes as in the proof of theorem 9. Each huge universe in F’ corresponds to a tree in F. The active tree and destination trees corresponds to an ‘active’ universe and destination universe. The evidence entering in F and the subsequent UnifyBelief correspond to the evidence entering in F’. The ShiftAttention in F corresponds to performing a series of AbsorbThroughSepsets starting at the neighbor of the ‘active’ universe down to the destination universe in F’. Note that after each series of AbsorbThroughSepset operations in F’, the belief table of each sepset of F’ is the marginalization of the belief in the neighbor more distant from the destination universe in the junction tree F’. Therefore, at the end of cycles in F, if a DistributeEvidence is performed in F’, it will be consistent and the belief in the destination universe does not undergo further change. That is, the belief on the destination tree at the end of the cycles is identical to what would obtained when the forest is globally consistent. D 4.9.5 Computational Complexity Theorem 12 shows the most important characterization of the MSBN and junction forests, namely, the capability of exploitation of localization to reduce the computational com plexity. Due to localization, the user interest and new evidence will remain in the sphere of one junction tree for a period of time. The judgments obtained is at the knowledge level of overall junction forest while the computation required is at. the level of the currently active tree. Compared to the USBN and single junction tree representation where the. evidence has to be propagated to the overall system, this leads to great savings in terms of time and space requirement when localization is valid. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 144 When the user subsequently shifts interest into another set of variables contained in a destination tree, only the intermediate trees need to be updated. The amount of computation required is linear to the number of intermediate trees and to the number of linkages between each pair of neighbors. No matter how large is the overall junction forest, the amount of computation for attention shift is fixed once the destination tree and mediating trees are fixed. For example, in a MSBN with a covering sect, no matter how many sects are in the MSBN, the attention shift updates maximum 2 sects. Given the analysis, the computational complexity of a MSBN with /3 sects is about 1//3 of the corresponding USBN system when localization is valid. The actual time and space requirement is little more than 1//3 due to the repetition of d-sepnodes and the computation required for attention shift. The computational savings obtained in PAINULIM system is discussed in the chapter 5. Example 20 Complete the example on junction forest F = {F’, 12, F3} with evidential reasoning. Suppose the outcome of variable E3 in 1’ is found to be e31. Table 4.9 lists B’(1’3) which is obtained after evidence entering and UnifyBelief; B’(F’) and B’(F2) both of which are obtained after a ShiftAttention with destination F2. Table 4.10 lists the posterior marginal probabilities after the ShiftAttention. Before closing this section, the following example demonstrates the computational advantage, during attention shift, provided by covering sects. Example 21 In figure 4.23, D is sectioned into {D1,D2,D3} by sound sectioning with out a covering sect. The MSBN is transformed into the junction forest {T’, T2,T3} by an invertible transformation. If evidence about E, and then about G comes, the first piece of evidence will be entered to T’, and then T’ will send message to T2. After entering the second piece of evidence, T2 will be the only one up-to-date. Now if one Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 145 Table 4.9: Belief tables for F = {1”, F2, F} in Figure 4.20 in evidential reasoning. B’(F3) is obtained first after E3 = e31 is entered to F3 and the evidence is propagated to the NodeAss. F2 BQ 5.523 10.79 8.285 3.598 Node A as. H2, F1 H1 B() 3.638 9.450 .9702 6.300 .3644 .3556 .3644 6.757 B’(F3) B’(F1) B’(r2) Clique NodeAss. Clique NodeAss. Clique (E1,3} {H2,1A} H1,A (F2,1} Config. B() Config. B() Conf ig. (e11,3} .5640 (h21,hll,hll) 4.105 (121,111) (e11,32) 0 {21,12} .5036 {121,f12} {e12,31} 2.256 (h,) 2.908 (122,111) {e12,3) 0 (21,12 12.84 {f22,f12} Clique NodeAss. (h22 , h1 } .4627 Clique {H,2E} H3,H2E 2,1} .2661 (H,F1} Conjig. B() (h2, 1) .3278 Config. {h31,2e 14.44 {2,1 6.785 (h21,jll,hll} , 2} 0 Clique NodeAs,. {h21,fll,h} {h31,2e 7.31 (H2,A} H2 (j12,} , 2 0 Config. B() (h21, (h3,21e) 5.915 {h21,a1 3.732 (2j1,) (2, 0 a,2) 11.65 (h21,} {h3,2e1 .5310 {h21,a1} 3.281 (2j1, (2,) 0 ,cz2 1.696 {h2f, Clique NodeAss. {h2,a11) .4190 (E2,34} E2 2, 3.737 Config. B() {h2,a11 .3715 (e21 , e31 ,e41) 1.020 {h22,a22,a12) 3.313 (,342} 1.422 Clique NodeAss. e21,) 0 {H3,A) H3,A2 (,324 0 Config. B() (e22,31,e41} .02194 (h31,2 °21) 13.58 (e22,31,e42} .3555 (h31 ,h21 “22) .8623 (2,41} 0 (,2a) 4.138 e2,3} 0 (h31 ,h22,022} 3.172 Clique NodeAss. (,21a} 1.800 {E,4H E4,H {h32,h21,022} 4.115 Config. B() ,2a) .01857 {e31 ,e41 ,h41} 7.622 {h32,} .5124 {e31 , e41 , h42 } 2.801 Clique NodeAss. (e31 c42 h41} 1.906 {H,A4 A3,H4 e1,4h} 15.87 Conjig. B() {e32,41,h} 0 (h31 ‘“31 ,h41) 1.632 (e32,e41 , h42} 0 (hi a31 ,h42} 4.895 (e32,42,h1} 0 {hi ,a32h41} 6.091 (a32,e42,h 0 (“31 ,a32,h42} 9.137 Clique NodeAss. {h32, a31, h41) 1.289 {H,E4} h2,4) 3.867 Config. B() {3,a1 .5157 {h31,e4 7.723 h2,4 .7735 (h31 , e31 , h42) 14.03 Clique NodeAss. (h, 24) 0 {H4,A} A4 (31,e} 0 Config. (h32, e31 ,h41 } 1.805 {h41, “41) 8.575 {h2,4 4.641 {h41,a2) .9528 (3,e1} 0 {h42,1 3.734 (h2,4 0 {h42,a42) 14.94 junction tree. B’(F’) and B’(F2) are obtained afterwards by ShiftAttention. Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 146 p(hjj) = .1893 p(a11) = .2767 p(f) = .4897 p(e11) = .2 .7219 p(cz21)= .6928 P(f21)= .5786 p(e21) .8661 p(h31)= .7714 p(a31)= .4143 P(e31) 1 p(h41)= .3379 p(a41) .4365 P(e4l)= .3696 Table 4.10: Posterior probabilities from junction forest F = {I”, F2, F} in Figure 4.20 after evidence E3 = is propagated by ShiftAttention. is interested in the belief on H, the belief tables on {B, F} and {C, F} in T3 have to be updated. However T2 can not provide distribution on {B, F}. Thus, T2 has to send message to T3 about {C, F}, then send message to T’. Then T’ can becomes up-to-date and send distribution on {B, F} to T3. One sees 3 message passings are necessary, and linkages between each pair of junction trees have to be created and maintained. More message passings and more linkages are needed when there are more sects organized in this structure. When n sects are inter-connected and there is a covering sect, only n-i sets of linkages need to be created; and maximum 2 message passings are needed to update the belief in any destination tree. 4.10 Remarks This chapter presents MSBNs and junction forests as flexible and efficient knowledge representation and inference formalisms to exploit localization naturally existing in large knowledge-based systems. The systems which can benefit from the technique are those reusable, representable by general but sparse networks, and characterized by incremental evidential reasoning. The MSBNs allow the partition of a large application domain into smaller natural subdomains such that each of them can be represented as a Bayesian subnetwork - a sect, and can be tested and refined individually. This makes the representation of a complex domain easier and more precise for knowledge engineers and makes the resultant Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 147 system more natural and more understandable to system users. The resultant modularity facilitates implementation of large systems in an incremental fashion. The constraints technically imposed by MSBNs on the partition are found to be d-sepsets and soundness of sectioning. Two important guidelines for sound sectioning are derived. The covering subDAG rule is suitable for partition according to categories in the same abstraction level. While the hypertree structure is suitable for partition of domain with hierarchia1 nature. It provides a formalism to represent and reason at different levels of abstraction. MSBNs following the rules allow multiply connected sects, do not require expensive computation for validation of soundness of sectioning, and have additional computational advantage during attention shift in evidential reasoning. Each sect in the MSBN is transformed into a junction tree such that the MSBN is transformed into a junction forest representation where evidential reasoning takes place. The constraints on transformation are found to be the invertibility of morali- triangulation and separability. Each sect/junction tree in the MSBN/junction forest stands as a separate compu tational object. Since the technique allows transformation of sects into junction trees through local computation at the sect level, and allows reasoning to be conducted with junction trees as units, the space requirement is governed by the size of 1 sect/junction tree. Hence large applications can be built and run on relatively smaller computers whenever hardware resource is of concern. For large application domain, an average case may involve only a portion of the total knowledge encoded in a system, and one portion may be used repeatedly over a period of time. A MSBN and junction forest representation allows the ‘interesting’ or ‘relevant’ sect/junction tree to be loaded while the rest of the junction forest remains inactive and stays in the secondary storage. However, the judgments made on variables Chapter 4. MULTIPLY SECTIONED BAYESIAN NETWORKS 148 in the active junction tree are consistent with all the knowledge available, including both prior knowledge and new evidence, embedded in the overall junction forest. When user’s attention shifts, inactive junction trees can be made active and previous accumulation of evidence is preserved. This is achieved through transfer of joint belief on d-sepsets. The amount of computation needed for this shift of attention is that required by the operation DistributeEvidence in the newly active junction tree times the number of linkages between the 2 junction trees when there exists a covering sect. Therefore, when localization is valid during consultation, the time and space requirements of a MSBN system of /3 sects with a covering sect or with a hypertree structure are little more than 1//3 of that required by a USBN system, assuming equal size of sects. The larger a MSBN system, the more computational savings can be obtained. Chapter 5 PAINULIM: AN EXPERT SYSTEM FOR NEUROMUSCULAR DIAGNOSIS a This chapter discusses the implementation of a prototype expert system called PAINULIM. The system assists EMGers in neuromuscular diagnosis involving the painful or impaired upper limb. Section 5.1 introduces the PAINULIM domain, and compares PAINULIM with MUNIN. Section 5.2 discusses how localization is exploited in PAINULIM using the MSBN technique (Chapter 4). Section 5.3 discusses other issues in knowledge rep resentation of PAINULIM. Section 5.4 introduces a expert system shell WEBWEAVR which is used in development of PAINULIM. Section 5.5 provides a query session with PAINULIM to show its capability. Section 5.6 presents an evaluation of PAINULIM. 5.1 PAINULIM Application Domain and Design Criteria As is reviewed in section 1.1, several expert systems in the neuromuscular diagnosis field have appeared since the mid 80’s. Satisfaction in system testing with constructed cases have been reported, while only one of them (ELECTRODIAGNOSTIC ASSISTANT [Jamieson 90]) reported clinical evaluation using 15 cases of 78% agreement rate with electromyographers (EMGers). Most of these systems are rule-based systems which are reviewed in Chapter 1. MUNIN [Andreassen et al. 89] as a large system has attracted much attention and it can be used as a yard stick against which to compare PAINULIM. MUNIN project started in the mid 80’s in Denmark as part of the European ESPRIT 149 Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 150 program. MUNIN was to be a ‘full expert system’ for neuromuscular diagnosis. Func tionalities to be included would be test-planning, test-guide, test-set-up, signal processing of test results, diagnosis, and treatment recommendation. The intended users of MUNIN would range from novice to experienced practitioners. The knowledge base would ulti mately include full human neuroanatomy. MUNIN adopts Bayesian networks to represent probabilistic knowledge. Substantial contributions to Bayesian network techniques were made (e.g., [Lauritzen and Spiegelhalter 88, Jensen, Lauritzen and Olesen 90]). The MUNIN system is to be developed in 3 stages. In the first stage a ‘nanohuman’ model with 1 muscle and 3 possible diseases has been developed. In the second stage a ‘microhuman’ system with 6 muscles and corresponding nerves is to be developed. The last stage would correspond to a model of the full human neuroanatomy. The only clinical experience with MUNIN is contained in the phrase: “we expect semi-clinical trials of the ‘microhuman’ to give the first indrcation of clinical usefulness” [Andreassen et al. 89]. PAINULIM started in 1990 at University of British Columbia in the Department of Electrical Engineering with cooperation from the Department of Computer Science and at the Neuromuscular Disease Unit (NDU) of Vancouver General Hospital (VGH). Expertise needed to build PAINULIM’s knowledge base was provided by experienced electromyographer Andrew Eisen and Bhanu Pant. Rather than attempting to cover the full range of diagnosis as does MUNIN, PAINULIM sets to cover the more modest, but realizable, goal of diagnosis for patients suffering from a painful or impaired upper limb due to diseases of spinal cord and/or peripheral nervous system. About 50% of the patient entering NDU of VGH can be diagnosed with PAINULIM. The 14 most common diseases considered by PAINULIM include: Amyotrophic Lateral Sclerosis, Parkinsons disease, Anterior horn cell disease, Root diseases, Intrinsic cord disease, Carpal tunnel syndrome, Plexus lesions, Median, Ulnar and Radial nerve lesions. PAINULIM uses features from clinical examination, (needle) EMG studies and nerve conduction studies Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 151 to make diagnostic recommendations. PAINULIM requires from the users specific knowledge and experience: • minimum competence in clinical medicine especially in neuromuscular diseases; • basic knowledge of nerve conduction study techniques; and • minimum experience of EMG patterns in common neuromuscular diseases. PAINULIM will benefit the following users: • students and residents in neurology, physical medicine and neuromuscular diseases; • doctors who are practicing EMG and nerve conduction in their offices; • experienced EMGers as a formal peer review (self evaluation); and • hopefully different labs to adapt uniform procedures and criteria for diagnosis. Clinical diagnosis is performed in steps as anatomical, pathological and etiological. PAINULIM currently works at the anatomical level only. Extension to other levels is considered as one of the future research topics. Given the level of knowledge and experience of intended user of PAINULIM, the sys tem does not attempt to represent explicitly the neuroanatomy involved in the painful or impaired upper limb. Rather, PAINULIM chooses to represent explicitly the clinically significant disease-feature relations which is one of the most important part of the ex pertise of experienced neurologists. This choice allow rapid development of PAINULIM which can work directly at the clinical level. PAINULIM is based on Bayesian belief networks as is MUNIN. The PAINULIM project has benefited from the research results of MUNIN, especially the junction tree technique (section 4.2.2); but, PAINULIM is not limited to duplicating the techniques Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 152 developed in MUNIN; rather the Multiply Sectioned Bayesian Networks and Junction Forests technique presented in chapter 4 is the extension to the junction tree technique in an effort to achieve more flexible knowledge representation and more efficient inference computation. 5.2 Exploiting Localization in PAINULIM Using MSBN Technique 5.2.1 Resolve Conflict Demands by Exploiting Localization The PAINULIM project has a large domain. The Bayesian network representation con tains 83 variables including 14 diseases and 69 features each of which has up to 3 possible outcomes. The network is multiply connected and has 271 arcs and 6795 probability values. When transformed into a junction tree representation, the system contains 10608 belief values. During the system deyelopment, the tight schedule of medical staff demands (1) knowledge acquisition and system testing within hospital environment where most computing equipments are personal computers; and (2) short response time in system testing and refinement. Implementation in hospital equipments will also facilitate the adoption of the system when it is completed. On the other hand, the space and time complexity of PAINULIM system tends to slow down the response and to demand more powerful computing equipments not available in the hospital lab. This conflict demands motivated the improvement on current Bayesian network rep resentation in order to reduce the computational complexity. The ‘localization’ naturally existing in the PAINULIM domain (section 4.1) inspired the development of the MSBN technique (chapter 4). The following briefly summarizes the description on localization in the PAINULIM domain. Localization means that, during a particular consultation session in a large application domain represented by a Bayesian net, (1) new evidence and queries are directed to small Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 153 part of a large network repeatedly within a period of time; and (2) certain parts of the network may not be of interest to users at all. An EMGer bases his diagnosis on 3 major information sources: clinical examination, (needle) EMG studies and nerve conduction studies. The examination/studies are performed in sequence. Based on the practice in NDU of VGII, about 60% of patients have only EMG studies, and about 27% of patients have only nerve conduction studies. The number of clinical findings on an average patient is about 5. The number of EMG and nerve conduction studies performed on an average patient are about 6 and 4 respectively. All findings are obtained incrementally. 5.2.2 Using MSBN Technique in PAINULIM Based on localization, PAINULIM uses the MSBN representation. The domain is parti tioned into 3 natural localization preserving subdomains (clinical, EMG and nerve con duction) which are separately represented by 3 sects (CLINICAL, EMG, and NCV) in PAINULIM (Figure 5.28). The (sub)domain of each sect contains the corresponding feature variables and relevant disease variables. Using this representation, the 3 sects are created and tested separately. This modularity allows the EMOer to concentrate on one natural subdomain at a time during network construction rather than to manage the overall complexity at the same time. This eases the task of knowledge acquisition in both the part of the EMGer and the part of the knowledge engineer. Problems in each sect can thus be isolated and corrected quickly and easily. The 3 sects are interfaced by common disease variables. The d-sepset between CLIN ICAL and EMG contains 12 diseases, and the d-sepset between CLINICAL and NCV contains 10 diseases. All the disease variables have no parent variables and thus the d-sepset constraint (section 4.5) is satisfied. The CLINICAL sect has all the 14 disease variables considered in PAINULIM, and it becomes the covering sect (section 4.6.2). Thus the soundness of sectioning is guaranteed in PAINULIM. This representation is >zc—)zCUCl)HIt) It) Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 155 natural because clinical examination is the stage where all the disease candidates are subject to consideration. After the 3 sects are created, they are transformed into a junction forest (Figure 5.29). The MSBN technique allows the transformation to be conducted by local computation at the level of sects (section 4.7.1). Thus the space requirement for transformation is governed by the size of only 1 sect. Multiple linkages are created between junction trees such that evidence can be prop agated from one to another during evidential reasoning. There are 3 linkages between CLINICAL tree and EMG tree (thick lines in Figure 5.29). The sizes of their state spaces are 768, 1536, and 1536 respectively. Without introducing multiple linkages (using the brute force method in section 4.3), the d-sepset would form a clique with state space size 6144, which would greatly increase the computational complexity in each sect. This is because this large clique would have all the other cliques in the same junction tree as neighbors. During evidential reasoning, the belief table of this large clique would be marginalized and updated as many times as the number of neighbor cliques. With the 3 junction trees linked and joint system belief initialized, the resultant consistent junction forest becomes the permanent representation of the PAINULIM do main where evidential reasoning takes place. The original MSBN still serves as the user interface while the computation for inference is solely performed in the junction forest. Since the junction forest representation preserves localization, the run time computa tion of PAINULIM can be restricted to only one junction tree. Thus the time and space requirements are governed by the size of one sect, not the size of the forest. If PAINULIM were represented by a USBN, the size of total state space of the junction tree would be 10608. This overall junction tree has to be updated for each batch of evidence. Using the MSBN technique, the sizes of total state space of each junction tree are 7308, 6498 and 2178 (in order of CLINICAL, EMG and NCV) respectively. The largest size 7308 Ia)Ia)Ia) —: : c’: HIa.) v I . 4 4 Ia) - + . - Ia) -i-a Ia) Ia) ,D 1I a ) L ) c)CJD0c)03 Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 157 will govern the space requirement and will put an upper bound on time requirement. The mean size 5328 gives an average time requirement which is about half of the amount required by a USBN system with size 10608. Due to localization, each tree will have to be computed about 5 times (recall that findings in 3 subdomains come in 5, 6 and 4 batches respectively on an average patient) before an attention shift. Thus the overall time saving is about half compared to a USBN system. When a user’s attention is shed from the current active tree to another tree, the latter is swapped from secondary storage into main memory and all previously acquired evidence is absorbed. The posterior distributions obtained are always based on all the available knowledge and evidence embedded in the overall system. Thus the savings in space and time do not sacrifice accuracy. The computational savings thus obtained trans late immediately to smaller hardware requirement and quicker response time. With the MSBN technique, it has been possible to use hospital equipments (IBM AT compatible computers) to construct, refine and run PAINULIM interactively with EMGers right in VGH lab. EMGer’s before-, inter-, and after-patient time can be utilized for knowledge acquisition and system refinement. The inference time for each batch of evidence takes from l2sec in NCV to 28sec in CLINICAL. The attention shift takes from 24sec to lO6sec depending on the currently active tree and the destination tree. The efficiency in cooper ation with EMOers gained greatly speeded up the development of PAINULIM (less than a year). 5.3 Other Issues in Knowledge Acquisition and Representation 5.3.1 Multiple Diseases Many probability-based medical expert systems have assumed that diseases are mutu ally exclusive, for example, PATHFINDER [Heckerman, Horvitz and Nathwani 89] and Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 158 MUNIN’s nanohuman system [Andreassen et al. 89]. A few did not, for example, QMR [Heckerman 90aj. When this assumption is valid, diseases can be lumped into 1 variable in the Bayesian network which simplifies the network topology. PAINULIM considers 14 most common diseases in patients presenting with a painful or impaired upper limb. Since a patient could suffer from multiple neuromuscular dis eases, the assumption of mutually exclusive diseases is not valid in the PAINULIM do main. PAINULIM has therefore represented each disease by a separate node. Although this representation is acceptable for most of the 14 diseases, there is an exception: Amyotrophic lateral sclerosis (Als), and Anterior horn cell diseases (Ahcd). Both are disorders of the motor system. Ahcd involves only the lower motor neuronal system (between the spinal cord and the muscle), but Als additionally involves the upper motor neuron (between the brain and the spinal cord). However when one speaks of Als it is not considered as an Ahcd plus disease, but an entity by itself. Therefore, conceptually, an EMGer would never diagnose a patient to have both Als and Ahcd. This conceptual exclusion is represented by combining the 2 into 1 variable which is Motor Neuron Disease (Mnd). The variable has 3 exclusive and exhaustive outcomes: Als, Ahcd, Neither. All the disease variables and most of the feature variables are represented as binary. That is, oniy ‘positive’ or ‘negative’ outcomes are allowed. ‘Positive’ corresponds to ‘severe’ or ‘moderate’ degree of severity, and ‘negative’ corresponds to ‘mild’ or ‘ab sent’. This choice is made for simplicity in both knowledge acquisition and inference computation. More refined representation is planned when the system’s performance is satisfactory at the current grade level. Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 159 5.3.2 Acquisition of Probability Distribution In PAINULIM, a feature variable can have up to 7 parent disease nodes. For example, wk_wst_ex can be caused by Mnd, Rc67, Rc81, Pxutk, Pxltk, Pxpcd, and Radnn. It re quires 384 numbers to fully specify the conditional probability distribution at wk_wst_ex. It would be frustrating if all these numbers have to be elicited from a human EMGer. The leaky noisy OR gate model [Pearl 88, Henrion 89] is found to be a powerful tool for distribution acquisition in PAINULIM. When a symptom can be caused by n explicitly represented diseases, the model assumes (1) each disease has a probability to produce the symptom in the absence of all other diseases; (2) the symptom-causing probability of each disease is independent of the presence of other diseases; and (3) due to the possibility of unrepresented diseases there is a non-zero probability that the symptom will manifest in the absence of any of the diseases represented explicitly. In discussion with EMGers, it is found that the above assumptions are quite valid in the PAINULIM domain. A symptom will occur in any given disease with a unique frequency. Should there exist more than one disease that could cause the same symptom, the frequency of occurrence of this particular symptom will be heightened. Using the leaky noise OR gate model, the above distribution for wk...wst.ex is assessed by eliciting only 8 numbers. Seven of them takes the form p(wk_wst_ex = yesRadnn = yes and every other disease = no) and one of them takes the form p(wk_wst_ex = yesall 7 parent disease = no) Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 160 5.4 Shell Implementation Figure 5.30: Top left: drawing a sect with mouse operation; top right: naming a vari able and specifying its outcomes; middle left: specifying the conditional probability distribution for a variable; middle right: specifying the sects composing the MSBN of PAINULIM; bottom left: specifying the d-sepset to be entered next; bottom right: specifying a d-sepset. An expert system shell WEBWEAVR is implemented which incorporates the MSBN technique and leaky noisy OR gate model. This shell is in turn used to construct the PAINULIM expert system. WEBWEAVR shell is written in C and is implemented in an IBM PC to suit the computing environment at the NDU, VGH where PAINULIM is constructed. It can be run in XT, AT or 386 although AT or above is recommended. Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 161 The shell consists of a graphical editor (EDITOR), a structure transformer (TRANSNET), and a consultation inference engine (DOCTR). EDITOR allows users to construct MSBNs in a visually intuitive manner. TRANSNET transforms constructed MSBNs into junction forests. DOCTR does the evidence entering and evidential reasoning. It can work in 2 different modes: interactive or batch processing. The interactive mode allows user to enter evidence and obtain posterior distributions in an interactive and incremental way. This mode is used during consultation session. The batch processing mode allows user to enter all the evidence to a file for a patient case. Then DOCTR makes diagnosis based on the file. This mode is majorly used for system evaluation such that large amount of cases can be processed without human supervision. DOCTR provides 2 modes for screen layout: network and user-friendly. The network mode (Figure 5.28) displays the full sect topology such that parent-child relation can be traced along with the marginal distributions. The user-friendly mode (Figure 5.31) does not display arcs of the network but labels variables with names closer to medical terminology. The former layout provides richer information while the latter gives neat screen. Figure 5.30 illustrates the WEBWEAVR shell with 6 screen dumps. In the upper left screen, a sect is drawn using a mouse. In the upper right screen, the name of a variable and its possible outcomes are entered. In the middle left screen, the conditional probability distribution for a child variable is entered. Each sect can be constructed in this way separately. In the middle right screen, the composition of the MSBN for PAINULIM is specified. In the bottom left screen, a menu is displayed which allows a user to specify the d-sepset to be entered next. In the bottom right screen, the d-sepset between CLINICAL and EMO sects is specified. WEBWEAVR supports the construction of any MSBN which has a covering sect. Figure 5.31: Top: CLINICAL sect with prior distributions; tions after clinical evidence is entered. bottom: posterior distribu Since clinical examination is always the first stage in the diagnosis of a patient, PAINULIM begins with the CLINICAL subnet. Before any evidence is available, the Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 162 Porting it into SUN workstation and X-Window is under consideration. 5.5 A Query Session with PAINULIM In this section, a query session with PAINULIM in the diagnosis of a particular patient is illustrated with snapshots of major steps. pa in show 1- der pain forearn pa in hand radi pain LI Ulnar lesion Radial Carl Root disease C67 Root disease disease Park in— diease .Tbicer’s ltricepsIexag a— rate • radial LI weak show 1- der weak am weak week weak thi.,nb al?dwc— t ion weak thunb f legion weak thwnb j I I J. I I] die o iated upper ness ned ia 1 forearn nwnb ‘ateral nunb back hand forearn nunb ness lateral hand nwnb ness ned ia 1 hand nwnb ness DIAG [I radiallost Motor disease leous leiius lower t ro,* leows Ulnar pain weak d*sso— biceps paç— nerve in shout— c ated eoa9 e— 10 tylesion show 1— der loss rated der ensa - 0 0 0 0 0. -Radial pain weak upper triceps r i— nerve n arM arM exag — dity lesion forearn nonb— rated C8TI [r7pain 0weak 0nedial 0radial 0twitch Root in finger foreamn egag — disease hand flexion nonb— rate ness - 0 0C67 radi- weak lateral trenor Root cular wrist forearn disease pain exten- nwnb C56 n weak LF?bak °b iceps Root thunb hand lost disease bdoc- forern Parkin weak lateral 0trioeps sons thunb hand lost disease (legion flunk— 0weak 0nedial 0radial thunb hand lost eoten- nook sion ness Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 163 prior distributions for all the diseases and symptoms can be obtained which reflect the background knowledge about the patient population in NDU of VGH. The top screen in Figure 5.31 shows the CLINICAL sect with prior distributions displayed in histograms. The user-friendly display mode is used here for better illustration. The patient presents with tingling in the hand, weakness of thumb abduction, weak ness of wrist extension, and loss of sensation in the back of the hand. After entering the evidence into the CLINICAL sect, the impression is Cts (0.808) and Radial nerve lesion (0.505). The bottom screen in Figure 5.31 highlights the diseases and features with posterior probability greater than 0.1. After attention is shifted to NCV sect, PAINULIM suggests (Figure 5.32 top) that the most likely abnormalities on NCV are from the Median sensory study (0.785), the Median to ulnar palmar latency difference (0.777), the Median motor distal latency (0.58) and the Radial motor and sensory study (0.43 and 0.51 respectively). After values for Median, Radial and Ulnar nerves are entered, the revised impression (Figure 5.32 bottom) is Cts (0.912) and Radial nerve lesion (0.918). With attention further shifts to EMG sect, PAINULIM (Figure 5.33 top) prompts that the most likely EMG abnormalities are in the Edc (0.791), the Triceps (0.789), the Apb (0.827) and the Brachioradialis (0.840) muscles. Data entered is that the Apb and the Edc are abnormal, while the Fdi is normal. The final impression (Figure 5.33 bottom) reads as Cts (0.992) and Radial lesion (0.922). The above snapshots illustrate the diagnostic capability of PAINULIM. Another im portant usage is education. Given a patient with Cts and Radnn, Figure 5.34 displays highly expected (above 80% likelihood) positive features for the patient which can be used in training. For the patient case in the above diagnosis, out of 6 highly expected CLINICAL features 4 are positive and 2 negative; out of 4 EMG features 2 are positive and 2 Chapter 5. PAIN ULIM: A NEUROMUSCULAR DIAGNOSTIC AID 164 Figure 5.32: Top: NCV sect with evidence from CLINICAL and EMG sects absorbed; bottom: final diagnosis after nerve conduction studies are finished. unchecked; out of 5 NCV features 1 is positive, 3 unchecked and 1 negative. Thus the majority of checked features matches the expectation for multiple disease Cts and Radnn, which explains the diagnosis reached above from one perspective. 5.8 An Evaluation of PAINULIM This section presents the procedure and results of a preliminary evaluation of PAINULIM. 1, sceps bra— chip rad ia 1 is supra— na is I is p las, i. de 1 to id j1tricePs Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 165 pa— na 1 pare— a1 Intrin [W?Radial sic nerve cord lessondisease ‘Pie.o..s CBTI per Root trunk disease P1eous •C67 iase ‘Pieous •C56 post Rpot cord disease •T;os •erra r ior •Latt is si ma— . al is oci or o las, i— Polar is naj or sterno ola.,i. de ito Id 09111— 1005.15 f exor leo dig fa i— ‘her 1 ..,b.’ trunk nusele . para— ml c6 para— nal para— Motor Ulnar neuron nerve disease lesion Intrin FjRadial cord lesion disease Piexus n 8T1 disease Piexus C67 lover Root trunk disease •Plexus C56 ase levetorPu iae us t is dorsi exten- abduc abduC .flexor dig 23 flex dig fatE ther linbf trunk Figure 5.33: Top: EMG sect with evidence from CLINICAL sect absorbed; bottom: posterior distributions after EMO findings are entered. 5.6.1 Case Selection 76 patient cases in NDU of VGH are selected. They have been diagnosed by EMGers before used in the evaluation. The selection is conducted such that there is a balanced distribution among diseases considered by PAINULIM. Table 5.11 lists numbers of cases involved for each disease. If a case involves multiple diseases, that is, either the patient was diagnosed as suffering from multiple diseases or differentiation among several com peting disease hypotheses could not be made at the time of diagnosis, then the count of Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 166 Figure 5.34: Highly expected feature presentation of a patient with both Cts and Radnn. Top: CLINICAL sect. Middle: EMO sect. Bottom: NCV sect. :. DIAG *Ulnar disease lesion Intrin F[rRadial cord lesion disease SPZCOUS CSTI tinner Hoot trunk disease icons trunk Plexus SMedian dital latency >0< •Mediao = CHAP IL ¶ledian IL <5> SHed ian — block Median HC Median Sum r SUlnar F?Radial Uinar CHAP F5 L• J..otor Pal.. SNAP JNCU lat latency diff Median Umnar Ulnar Radial F2 F FS sensory SNAP yaue SNAP NOV latency mu Median SUlnar PS ..otor SNAP cond anpli- block tude Uinar NCU Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 167 each disease involved will be increased by 1. The total number of count is 124. Thus the ratio 124/76 serves as an indication of multiple-disease-ness of the case population. 3 cases not considered by PAINULIM but presenting with a painful or impaired upper limb are also included in the evaluation to test PAINULIM’s performance at its limitation. disease count disease count disease count disease count Als 5 Inspcd 2 Cts 16 Pd 3 Ahcd 1 Pxutk 6 Mednn 5 Normal 12 Rc56 8 Pxltk 8 Ulrnn 16 Other 3 Rc67 19 Pxpcd 8 Radnn 6 Rc81 6 subtotal 39 24 43 18 Table 5.11: Number of cases involved for each disease considered in PAINULIM 5.6.2 Performance Rating Unlike the evaluation for QUALICON (chapter 3), in the PAINULIM domain, there is no absolutely certain way to know the ‘real’ disease(s) given a patient case. The best one can do is to compare the expert system with the human expert and to take the human judgment as the golden standard. Since no human expert is perfect, the evaluation conducted this way may have its limitation. Both the system and human may make the same error, or the system may be correct while the human makes an error. In evaluating PATHFINDER, Ng and Abramson [1990] uses ‘classic’ cases with known diagnoses. The agreement between the known diagnosis and the disease with top proba bility is used as an indicator of PATHFINDER’s performance. Heckerman [1990b] asks the expert to compare PATHFINDER’s posterior distributions with his own and to give a rating between 0 to 10. In the PAINULIM domain, human diagnosis can be single disease or multiple diseases, Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 168 and can have different degrees of severity. Sometimes even the human diagnosis is unsure because the test studies are not well designed or incomplete. Thus a more sophisticated rating method than the one used by Ng and Abramson [1990] is desired for the evaluation of PAINULIM. A more objective rating than the one used by Heckerman [1990b] is also targeted such that the rating can be verified by persons other than the evaluator. For each case, the patient documentation is examined and the values for feature vari ables are entered to PAINULIM to made a diagnosis. The posterior marginal probabilities of diseases produced by PAINULIM are compared with the EMGer’s original diagnosis. 5 rating scales (EX (excellent), GD (good), FR (fair), PR (poor), and WR (wrong)) are informally defined. The definition is not exhaustive but serves as a guideline for the evaluation. For each disease involved (possibly a disease not considered by PAINULIM as will be clear below), PAINULIM’s performance is rated by the following rules. 1. For a disease judged by the EMGer as severe or moderate, the rating is EX if its posterior probability (PP) by PAINULIM falls in [0.8, 1] (GD: [0.6, 0.8); FR: [0.4, 0.6); PR: [0.2, 0.4); WR: [0, 0.2)). 2. For a disease judged by the EMGer as mild, the rating is EX if its PP falls in [0, 0.4] (GD: (0.4, 0.6]; FR: (0.6, 0.7]; PR: (0.7, 0.8]; WR: (0.8, 1]). 3. For a disease judged by the EMGer as absent, the rating is EX if its PP falls in [0, 0.2] (GD: (0.2, 0.4]; FR: (0.4, 0.6]; PR: (0.6, 0.8]; WR: (0.8, 1]). 4. For a disease judged by the EMGer as uncertain as to its likelihood because of evidence which cannot be easily explained, the rating is EX if PAINULIM suggests the same disease or other disease(s) whose anatomical site(s) is/are close to that suspected by the EMGer, with PP falling in [0, 0.6] (GD: (0.6, 0.7]; FR: (0.7, 0.8]; ChapterS. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 169 PR: (0.8, 0.9]; WR: (0.9, 1]). 5. For a disease not represented by PAINULIM but judged by the EMGer as the single disease diagnosis of the case in question, the rating is EX if the PPs of all represented diseases fall in [0, 0.2] (CD: (0.2, 0.4]; FR: (0.4, 0.6]; PR: (0.6, 0.8]; WR: (0.8, 1]). 6. If PAINULIM provides an akernative diagnosis of PP in [0.70, 1]; and the alterna tive is agreed upon by a second EMGer to whom only the evidence as presented to PAINULIM is available, then the rating is EX. The rating is also EX, if PAINULIM provides a diagnosis of PP in [0.3, 0.7) based on evidence ignored by the origi nal EMGer, but which the second EMCer considers significant and agrees with PAINULIM. The above 1, 2, and 3 are used when the EMGer has a confident diagnosis. 4 is used when the EMCer has a unsure diagnosis. 5 is used to rate PAINULIM’s performance at its limitation. 6 is used when human diagnosis is biased by nonanatomical evidence not available to PAINULIM, or when PAINULIM behaves superior to the human diagnosti cian. After the rating for each disease is assigned, the lowest rating is given as the rating of PAINULIM’ performance for the case in question. 5.6.3 Evaluation Results The evaluation of 76 patient cases has the following outcomes: EX: 65, CD: 6, FR: 2, PR: 1, WR: 2. The excellent rate is 0.86 with 95% confidence interval being (0.756, 0.925) using the standard statistic technique (Appendix E). The good or excellent rate is 0.93 with 95% confidence interval being (0.853, 0.978). A closer look at the cases evaluated is worthwhile. Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 170 Among the 76 cases, there are 38 cases where the EMGer is confident about the diagnosis. The performance of PAINULIM is EX for 36 of them, and GD for the other 2. Thus for cases where evidence is complete, PAINULIM performs as well as the EMGer. There are 8 cases where the EMGer is either confident about the diagnosis of a single mild disease, or is unsure about a single mild disease. For these 8 cases, PAINULIM’s posterior probabilities for all disease hypotheses are below 0.2, which says none of the diseases is severe or moderate. Thus PAINULIM performs equally well as the EMGer in the case of a single mild disease. For at least 7 cases, PAINULIM indicates a second disease which is not mentioned by the EMGer. Careful examination of the feature presentation would show that the cor responding features are indeed not explained by the EMGer’s diagnosis. PAINULIM’s behavior in these cases is considered superior to that of the EMGer. For several other cases where the EMGer’s diagnosis is unsure about the possible diseases, PAINULIM in dicates several candidates (with probability greater than 0.3) best matching the available evidence which provide hints to human users for a more complete test study. For the 2 cases diagnosed as a disease in the spinal cord or peripheral nervous system but not represented in PAINULIM, PAINULIM’s performance is EX for one and GD for the other. The posterior probabilities of all disease variables are below 0.3. This is inter preted as saying the patient is not in a severe or moderate disease state for any represented disease (not to be interpreted as normal). As stated earlier, PAINULIM represented 14 most common diseases of spinal cord and/or peripheral nervous system presented with a painful or impaired upper limb. The diseases in this category which are not represented in PAINULIM include: Posterior interosseous nerve lesions, Anterior interosseous nerve lesions, Axillary nerve lesions, Musculocutaneous nerve lesions, Suprascapular nerve le sions, and Long thoracic nerve lesions. They are not represented in current version of PAINULIM because their combined incidence is less than 2%. The evaluation shows ChapterS. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 171 that outside its representation domain, PAINULIM does not confuse such a disease with represented diseases whenever the unrepresented disease has its unique feature pattern. On the other hand, the evaluation reveals several limitations of PAINULIM. One limitation relates to unrepresented diseases. One case in evaluation is diagnosed by the EMGer as Central Sensory Loss which is a disorder in the central nerval system. It is outside the PAINULIM domain but also characterized by a painful impaired upper limb. When restricted within the PAINULIM domain, the disease presentation is similar to Radnn. PAINUILM could not differentiate and gives probability 0.62 to Radnn (the rating is WR). PAINULIM works with limited variables. For example in the clinical sect, the variable ‘lslathnd’ which represents loss of sensation in the lateral hand and/or fingers, does not allow distinguishing between the front of the hand (Median nerve, Cts or Rc67) and the back of the hand (Radial nerve, Plexus Posterior cord, Rc67). When evidence comes towards one of them but not the other, instantiation of lslathnd will enforce (incorrectly) a group of diseases. One of the other important lesson is on the assessment of numerical data in the PAINULIM representation. PAINULIM represents each disease with 2 states. ‘Positive’ corresponds to ‘severe’ or ‘moderate’ degree of severity, and ‘negative’ corresponds to ‘mild’ or ‘absent’. This choice is made for the reason explained in section 5.6.1. When assessing conditional probabilities for PAINULIM, the above mapping has to be kept consistently. When a probability p(wk_wst_ex = yes IRadnn = yes and every other disease = no) is assessed, the condition ‘Radnn = yes’ means either severe or moderate Radnn. Without emphasizing the mapping, the EMGer could include mild Radnn in his mind which increases the population considered and lower the assessed probability. Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 172 Another source of inaccuracy in the numerical probabilities come from limited experi ence for certain diseases. When a disease is very rare (for example, Inspcd), the EMGer giving the probabilities may have rare experience with the disease and thus provides inaccurate assessment. A few cases rated as FR or PR or WR are due to inappropri ate assessment of conditional probabilities. Further fine tuning is planned for the next version of PAINULIM. 4’ 57 Remarks The development of PAINULIM has shown that the MSBNs and junction forests tech nique provides a natural representation and an efficient inference formalism. Using the technique, the computational complexity of PAINULIM is reduced by half with no re duction of accuracy. The development of PAINULIM has thus benefited from efficient cooperation with medical staff, and rapid system construction and refinement. The evaluation of PAINULIM’s performance using 76 patient cases shows the good or excellent rate is 0.93 with 95% confidence interval being (0.853, 0.978). The case pop ulation is not selected sequentially (taking whatever patient case coming in a sequence) in order to have a balanced distribution among diseases considered by PAINULIM. If sequential population is used, even better performance can be expected. This is be cause normal or mild cases, or Cts cases will occupy even larger percentage than in the population used for the evaluation, and PAINULIM has performed excellently for these cases. The deficiencies of current PAINULIM are recognized in (1) not sufficiently elaborated feature variables; (2) room for further fine tuning of numerical parameters; and (3) limitations with central nerval system disorder presenting with a painful or impaired upper limb. The improvement calls for further refinement and extension in PAINULIM’s Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 173 representation. An important application of PAINULIM would be the assistance of test design. A correct and efficient neuromuscular diagnosis depends largely on good test design which will allow the gathering of minimum amount of but sufficient diagnostic information. However, when faced with difficult cases (initial evidence points to different directions) and many test alternatives, an EMGer may not be able to come up with an optimal test design. This has been true in several cases used in the evaluation. Withbut good test design and constrained by time and patient, an EMGer would have to make a diagnosis (after test) with limited and incomplete information. There has been evidence that human thinking is often biased in making judgment under uncertainty [Tversky and Kahneman 74}. Thus an expert system like PAINULIM capable of reasoning rationally under uncertainty under complex conditions (multiply connected network and evidence pointing to different directions with different degrees of support to various hypotheses) would be a very helpful peer in test design. The benefit would be a better test design and an improved diagnostic quality. As demonstrated in section 5.5, given current available evidence, PAINULIM can prompt the EMGer the most likely disease(s) and most likely features. The confirmation of these features would lend further support to the disease(s); and the negative test results for these features will tend to exclude the disease(s) for further investigation. Patil et al. [1982] discuss several strategies in test design. This functionality has not been fully explored and made sufficiently explicit in the current version of PAINULIM. It is a future research topic. Since the likelihood of features depends on the current likelihood of diseases, the performance of PAINULIM in the diagnosis suggests promising potentials of its extension to the test design. Explanation capability is important in communication between the system and its user both in diagnosis and in test design, and can add great usefulness to PAINULIM. Chapter 5. PAINULIM: A NEUROMUSCULAR DIAGNOSTIC AID 174 Section 5.5 demonstrates certain explanation functions available in the current version of PAINULIM. A sophisticated explanation facility would give reasons why a particular disease is likely; and it would give further reasons why this disease is more likely than another one. Such a facility would be a future research topic. Chapter 6 A LEARNING ALGORITHM FOR SEQUENTIALLY UPDATING PROBABILITIES IN BAYESIAN NETWORKS As described in section 1.3.1, in building medical expert systems, the probabilities nec essary for the construction of Bayesian networks usually have to be elicited from medical experts. The values thus obtained are liable to be inaccurate. This has also been the experience with PAINULIM (Chapter 5). The sensitivity analysis of a medical expert sys tem PATHFINDER shows that probabilities play an important role in PATHFINDER’s performance and minor changes in all of its probabilities have a substantial impact on performance [Ng and Abramson 90]. Therefore, methodologies which allow the represen tation of uncertainty of probabilities and improvement of their accuracy are needed. This chapter presents an Algorithm for Learning by Posterior Probabilities (ALPP) for sequential updating of probabilities in Bayesian networks. The results are mainly taken from Xiang, Beddoes and Poole [1990b]. Section 6.1 reviews 3 representations of uncertainty of probabilities: interval, auxiliary variable, ratio and corresponding methods for updating probabilities. Section 6.2 presents the idea of learning from expert’s posterior probabilities in order to overcome the limitation of the existing updating method for the ratio representation. Section 6.3 presents ALPP. Section 6.4 proves the convergence of ALPP. Section 6.5 presents the performance of ALPP in simulation. 175 Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 176 8.1 Background Several formalisms for representation of uncertainty of probabilities in Bayesian networks have been proposed. Some of them have corresponding methodologies incorporated to improve the accuracy of probabilities. One formalism represents probabilities by intervals [Fertig and Breese 90]. The size of an interval signifies the degree of uncertainty to the probability it represents. No known method directly uses this representation for improvement of accuracy of probabilities. Another formalism represents the uncertainty of probabilities by probabilities of probabilities [Spiegelhalter 86, Neapolitan 90, Spiegeihalter and Lauritzen 90]. Auxiliary variables (nodes) are added to Bayesian networks. Their outcomes are the probabilities of variables in the original Bayesian net. The distributions of these auxiliary variables, therefore, are the probabilities of probabilities. With this representation, the updating of probabilities can be performed within the probabilistic inference process. Yet another formalism represents probabilities by ratios of imaginary sample sizes [Cheeseman 88a], [Spiegelhalter, Franklin and Bull 89]. A probability p(symptomAl diseaseB) is represented as x/y where y is an imaginary patient population with disease B and among these patients x of them show symptom A. The less certain the probability p(symptornA diseaseB) is, the smaller the integer y. Spiegehalter, Franklin and Bull [1989] present a procedure for sequentially updating probabilities represented as ratios. The procedure is presented in the form directly applicable to Bayesian networks of di ameter 1 with a single parent node (possibly multiple children). Here, a single child case (p(.symptomAdiseaseB)) is used to illustrate the idea. When a new patient with disease B is observed, the sample size y is increased by 1, and the sample size x is increased by 1 or 0 depending on if the patient shows symptom A. With more and more patient cases processed, the probability approaches the correct value. The improvement of this Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 177 method is the focus of this chapter. The limitation of the updating method for the ratio representation is the underlying assumption that when updating p(symptomAdiseaseB), whether disease B is true or false is known with certainty (thus the method will be referred as {0, 1} distribution learning). The assumption is not realistic in many applications. A doctor would not always be 100% confident about a diagnosis he or she made of a patient. This argument is expanded in the section below. 6.2 Learning from Posterior Distributions The spirit of {0, 1 } distribution learning is to improve the precision of probabilities elicited from the human expert by learning from available data. What else does one really have in medical practice in addition to patients’ symptoms? It may be possible, in some medical domain, that diagnoses can be confirmed with certainty. But this is not commonplace. A successful treatment is not always an indication of correct diagnosis. A disease can be cured by a patient’s internal immunity or by a drug with wide disease spectrum. One subtlety of medical diagnosis comes from the unconfirmability for each individual patient case. For most medical domains, the available data beside patients’ symptoms are physi cian’s subjective posterior probabilities (PPs) of disease hypotheses given the overall pattern of patient’s symptoms. They are not distributions with values from {0, 1}, but rather distributions from [0, 1]’. The diagnoses appearing in patients’ files are typically not the diagnoses that have been concluded definitely; they are only the top ranking dis eases with physician’s subjective PP omitted. The assumption of {0, 1} posterior disease distribution may, naively, be interpreted as an approximation to [0, 1] distribution with 1 ‘Note that {O, 1} denotes a set containing only elements 0 and 1, and [0, 1] is a domain of real numbers between 0 and 1 inclusive. Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 178 substituting top ranking PP, and 0 substituting the rest. This approximation loses useful information. Thus a way of learning directly from [0, 1] posterior distribution seems more natural and anticipates better performance. In dealing with learning problem in a Bayesian network setting, three ‘agents’ are concerned: the real world (Dr, Fr), the human expert (Do, Fe), and the artificial system (D3,F3). It is assumed that all 3 can be modeled by Bayesian networks. As the building of an expert system involves specifying both the topolcrgy of DAG D and probability distribution F, the improvement can also be separated into the two aspects. For the purpose of this chapter, Dr, De, and D3 are assumed identical, leaving to be improved only the accuracy of quantitative assignment of P3. As reviewed in section 1.3.3, a medical expert system based on Bayesian nets usually directs its arcs from disease nodes to symptom nodes, thus encoding quantitative knowl edge by priors of diseases and conditional probabilities (CPs) of symptom given diseases. This results in the ease of DAG construction, simplicity of net topology, and portability of the system. Given that a Bayesian net is constructed with such directionality, and the desire to improve accuracy of CPs by PPs, a question which arises is whether PPs are any better in quality compared to CPs also supplied by the human expert. To avoid possible confusion, it is emphasized that the CP here is the conditional probabilities of a symptom variable given its disease parent variables, and the PP is the joint probabilities of a set of (relevant) disease variables given the outcomes of a set of symptom variables. The former is stored explicitly in Bayesian networks as parameter, while the latter is not explicit parameter even if the arcs in the network were directed from symptoms to diseases. In my cooperation with medical staff, it was found that the causal network is a natural model to view the domain, however, the task of estimating CPs is more artificial than natural to them. Forming posterior judgments is their daily practice. An expert is an Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 179 expert in that he/she is skilled at making diagnosis (posterior judgement), not necessarily skilled at estimating CPs. It is the expert’s posterior judgment that is the behavior one wants the expert system to simulate. An excellent argument supporting the idea of using PPs to improve CPs has been published by Neapolitan [1990]: For example, sometimes it is easier for a person to ascertain the probabi1- ity of a cause given an effect than that of an effect given a cause. Consider the following situation: If a physician had worked at the same clinic for a number of years, he would have seen a large population of similar people with certain symptoms. Since his job is to reason from the symptoms to determine the likelihood of diseases, through the years he may have become adept at judg ing the probabilities of diseases given symptoms for the population of people who attend this clinic. On the other hand, a physician does not have the task of looking at a person with a known disease and judging whether a symptom is present. Hence it is not likely that the physician would have acquired the ability from his experience to judge the probabilities of symptoms given dis eases. He does ordinarily learn something about this probability in medical school. However, if a particular disease were rare in the population, whereas a particular symptom of the disease were common, and the physician had not studied the disease in medical school, he would certainly not be able to deter mine the probability of the symptom given the disease. On the other hand, he could readily determine the probability of the disease given the symptom for this particular population. We see then that in this case it is easier to ascertain the probabilities of causes given effects. Notice that these condi tional probabilities are only valid for the population of people who attend Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 180 that particular clinic, and thus we would not want to incorporate them into an expert system. We could, however, use the definition of conditional prob ability to determine the probabilities of symptoms given diseases form these conditional probabilities obtained from the physician. These latter probabili ties could then be used in an expert system which would be applicable at any clinic. The task imposed on expert by the method presented here is actually more natural than that argued by Neapolitan. Experts will be asked for PPs given a set of symptoms (not just one as is seen in a moment), and thus the task will be close to their daily practice. Furthermore, Kuipers and Kassier [1984] has been cited by Shachter and Hecker man [1987] to show “experts are more comfortable when their beliefs are elicited in the causal direction”; Tversky and Kahneman [1980] has been cited by Pearl [1988] to show “people often prefer to encode experiential knowledge in causal schemata, and as a conse quence, rules expressed in causal forms are assessed more reliably”. Neither Shachter and Heckerman nor Pearl made explicit distinction between ‘qualitative’ and ‘quantitative’ knowledge. Careful examination of the studies by Kuipers and Kassier and Tversky and Kahneman reveals the following two points. Both experts and ordinary people prefer to encode their qualitative knowledge in terms of causal schemata. People often give higher probabilities in the causal direction than those dictated by probability theory. These studies support the reliability of causal elicitation of qualitative knowledge. These studies do not support the reliability of causal elicitation of quantitative (probabilistic) knowledge. The reliability issue is left open. Finally, Spiegehalter et al. [1989] has reported that the assessment of CPs by human experts is generally reliable, but has a tendency to be too extreme (too close to 0 or 1). Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 181 B1 B23 134 Figure 6.35: An example of D(1) If one could assume that the human expert carries a mental Bayesian network and PPs are produced by the network, it is postulated that the CPs the expert articulates, which consists of F of the system, could be a distorted version of those in Fe. Also, Fe may differ from F in general. Thus, 4 categories of probability distributions are distinguished: F, Fe, P3, and the PPs produced by Fe (written as ps). One’s access to only P3 and pe(hypothesesevidence) is assumed. The goal is to use the latter to improve P3 such that the system’s behavior will approach that of expert. How can PP be utilized in the updating? The basic idea is: instead of updating imaginary sample sizes by 1 or 0, increase them by the measure of certainty of the corresponding diseases. The expert’s PP is just such a measure. Formal treatment is given below. 6.3 The Algorithm for Learning by Posterior Probability (ALPP) The following notation is used: D(1) DAGs of diameter 1 (The diameter is the length of the longest directed path in the DAG. An example of D(1) is given in Figure 6.35.); (D(1), F) Bayesian net with diameter 1 and underlying distribution F; Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 182 B, € B1 = {bi,.. . , b1,.} the ith parent variable in D(1) with sample space B1; A3 € A3 = {a31,.. . , ai,} the jth child variable in D(l) with sample space A3; A e ‘I’(A) the set of all child variables in D(l) with space a31 element of ‘I’(A) with A3 instantiated as Yk1k2... the imaginary sample size for joint event bIk12k . being true; xl3k12...k the imaginary sample size for joint event alblk1 . .. being true; 6, impulse function which equals 1 if for the cth fresh case As’s outcome is ai,, and equals 0 otherwise (superscripts denote the orders of fresh cases); PrO, PeO, Ps() probabilities contained or generated by (Dr(1), Fr), (De(1), Fe) and (D31), F3) respectively. A Bayesian net (D(1), F) 2 is considered where the underlying distribution is com posed via N M p(Bi . . . BNA1 . . . AM) II p(B1) II p(AI7r) i=1 j=1 where 7r is the set of parents of A. Each of the CPs is internally represented in the system as a ratio of 2 imaginary sample sizes. For child node A1 having its parent nodes B1 ... BQ (Q 1), a corresponding CP is p(all1Iblk . . .bQkQ) = Xkikq/Ykikq where the superscript c signifies the cth updating. Only the real numbers Xki kq and are stored. The prior probabilities for A1’s parents can be derived as 2Whether it is a subnet or a net by itself is irrelevant. Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 183 c — Yl...kQ — c-. c I_.,kl,...,kQ Ykj...kq For a (D(1), F) with M children and with all variables binary, the number of numbers to be stored in this way is upper bounded by 2 2” where is the in-degree of child node i. Storage saving can be achieved when different child nodes share a common set of parents. Updating F is done one child node at a time through updating xs and ys associated with the node as illustrated above. Once the xs and ys axe updated, the updated CPs and priors can be derived. The order in which child nodes are selected for updating is irrelevant. Without losing generality, we describe the updating with respect to above mentioned child node A1. For the cth fresh case where ac is the symptoms observed, the expert provides the PP distribution p(bik1 . . . bNkNlac). This is transformed into P(b1k1 . .. bQkQIac) = p(bi1 . .. bNkNIac) bq+l ,...,bN The sample sizes are updated by c c—i cc ft L C = XlIkl...kQ + Oj1PeLLh1k, . ..LQkQ a C c—i it L c Yki...kq = Yk1...0 +PekUlki . . . a 8.4 Convergence of the Algorithm An expert is called perfect if (De(1), Fe) is identical to (D(1), Fr). Theorem 13 Let a Bayesian network (D31), F3) be supported by a perfect expert equipped with (D(1), Fe). No matter what initial state P3 is in, it will converge to F by ALPP. Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 184 Proof: Without losing generality, consider the updating with respect to A1 in section 6.3. (1) Priors. Let {a(1), a(2),. . .} be the set of all possible conjuncts of evidence. Let u(t) be the number of times at which event a(t) is true in c cases; and u(t) = c. From the prior updating formula of ALPP, limp(bk) — urn ?J°k1... + E,_iPe(b1ki .. . law) — C—*QO C + YZk1,...,kq Y1...kQ = lirn! ( Epe(k . . . bQkQIa(t))u(t))t = Pe(blki . . . bQkqla(t))pr(a(t)) t = p(b1k, . ...bqkqla(t))pe(a(t)) (perfect expert) t pe(blk, ... bq) = Pe(bikj) k1 ,...,k_ ,k+i ,...,kq (2) CPs. Let u111 (t) be the number of times at which event a11 (t) is true in c cases. Following ALPP, one has Xlik1k+ £j=15T’Pe(1kj . . . law) lirnp3(aljjblk1...bQk) = +EiPe(l’lki ...bQkQla”) — II _lpe(b1kl...bQkqlaw) — -T E=iPe(b1ki .. . bav) — E Pe(blk1 . . . bej au, (t))u111 (t) — limc...’Ezpe(blki . .. bQkla(z))u(z) — EtPe(blki . . . bQkQlalzl(t))pr(alll(t)) — EzPe(1ki . . . bQkQla(z))pr(a(z)) E Pe(blki . . . Ia11, (t))pe(ajj, (t)) = (perfect expert) zPe(1’1k, . . . bQkqla(z))pe(a(z)) Pe(blki . .. bqkQall,) / L L = IL L = pe1ail, 01k, . . . UQkqpeU1k, . UQkQ) D Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 185 A perfect expert is never available. One needs to know the behavior of ALPP when supported by an imperfect expert. This leads to the following theorem. Theorem 14 Letp be any resultant probability in (D31),F3) after c updating by ALPP. p converges to a continuous function of F.3 Proof: (1) Continuity of priors. Following the proof of theorem 13, the prior p(b1k1) converges to f E pC(bikl...bQkQIa(t)) t C where p(blk, . . . a(t)) is an elementary function of F, and so does f. Therefore, converges to a continuous function of Fe. (2) Continuity of CP. From theorem 13, p(ai11 Iblk1 ... bqk) converges to — tPe(bIki . . . bQkQIallt(t))uh1t) — Ep(b1k1 . . . where Pe(blki . .. a(z)) is an elementary function of Fe. Theorem 14, together with Theorem 13, says that when the discrepancy between F and F,. is small, the discrepancy between F3 and F,. (F as well) will be small after enough learning trials. The specific form of the discrepancy is left open. The absolute value of PPs is not really important in many applications but the pos terior ordering of diseases be. A set of PPs defines such a posterior ordering. Claim a 100% behavior match between (D, F1) and (D, F2) if for any possible set of symptoms the 3By ‘F is a function of Ps’, it is meant that F takes probability variables in P as its independent variables which in turn themselves have [0,1] as their domain. Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 186 tampering(T) fire(F) heat alarm(H) smoke alarm(S) report(R) Figure 6.36: Fire alarm example for simulation two give the same ordering. The minimum difference bettveen successive PPs of (D, F1) defines a threshold. Unless the maximum difference between corresponding PPs from 2 (D, P)s exceeds the threshold, 100% behavior match is guaranteed. Thus as long as the discrepancy between Fe and F,. is within some (D,.(1), Fr) dependent threshold, a 100% match between the behavior of P3 and that of Fe is anticipated. 6.5 Simulation Results Several simulations have been run using the example in Figure 6.36. It is a revised version of the smoke-alarm example in Poole and Neufeld [1988]. Here heat alarm, smoke alarm and report are used as evidence to estimate the likelihood of joint event tampering and fire. Each variable, denoted by uppercase letters, takes binary values. For example, F has value f or 7 which signify the event fire being true or false. The simulation set-up is illustrated in Figure 6.37. Logical sampling [Henrion 88] is used in the real world model (D,.(1), Fr) to generate scenarios {T,., F,., H,., S,., R,.}. The observed evidence {Hr, S,., R,.} is fed into (De(1), Fe). The posterior distribution pe(TFI H,.S,.R,.) is computed by the expert model and is forwarded to update system model (D,(1), F3). To compare the performance between ALPP and {0, 1} distribution learning, a control model (D(1), F) is constructed in the set-up. It has the same DAG structure and initial Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 187 {Tr,Fr} (D(1),P) {T,F} (Dr(1)J{Hr,Sr,Rr}! (De(1),Pe) (D31),P) pe(TFIHrSrRrI Figure 6.37: Simulation set-up probability distribution as (D31), F3) but is updated by {0, 1} distribution learning.4 Two different sets of diagnoses are utilized in different simulation runs by (D(1), F) for the purpose of comparison. In simulation 1, 2 and 3 to be described below, the top diagnosis {Te, F} made by (D(1), Fe) is used. In simulation 4, the scenario {Tr, F} is used. The former simulates the situation where posterior judgments could not be fully justified. The latter simulates the case where such justification is indeed available. For all the simulations Pr is the following distribution. p(hjft) 0.50 p(sjft) 0.60 p(hJfT) 0.90 p(sIfT) 0.92 p(h7t) 0.85 p(sI7t) 0.75 p(hlfl) 0.11 p(s17i) 0.09 p(rjf) 0.70 p(f) 0.25 p(rj7) 0.06 p(t) 0.20 P3 and P are distributions with the maximal error 0.3 relative to Pr. The initial imaginary sample size for each joint event FT is set to 1. Such setting is mainly for the purpose of demonstrating the convergence of ALPP under poor initial condition. The distribution error should generally be smaller and initial sample sizes be much larger in case of practical application where the convergence will be a slowly evolving process. 4The original form of {0, 1} distribution learning is extended to the form applicable to D(1). Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 188 (De(1), P) (.D31),P3) (D(1), P) behv. max. behv. max. diag. mat. err. mat. err. trial No. rate rate S-E rate C-E 0 0.30 0.30 1’.’25 68% 60% 0.14 48% 0.21 2650 76% 96% 0.10 12% 0.25 51’100 80% 100% 0.06 36% 0.27 101200 76% 100% 0.03 33% 0.28 Table 6.12: Summary for simulation 1. I1e — Fri = 0. Simulation 1 is run with F being the same as Fr which assumes a perfect expert. The results are depicted in Table 6.12. The diagnostic rate of (De(1), Fe) is defined as A/N where N is the base number of trials and A is the number of trials where the top diagnosis agrees with {Tr, Fr} simulated by (Dr(1), Fr). The behavior matching rate of (D31),F5) relative to (D(1), Fe) is defined as B/N where B is the number of trials in which (D31),F3)’s diagnoses have the same ordering as (D(1), Ps). The behavior matching rate of (D(1), F) to (De(1), F) is similarly defined. The results show convergence of probability values in F3 to those in F (maximum error(S-E) —* 0). The behavior matching rate of (D31), F3) increases along with the con vergence of probabilities and finally (D31), F3) behaves exactly the same as (D(1), Ps). An interesting phenomenon is that, despite F = Fr, the diagnostic rate of (D(1), F) is only 76% in the total 200 trials. Though the rate is dependent on the particular (D, F), it is expected to be less than 100% in general. In terms of medical diagnosis, this is because some disease may manifest through unlikely symptoms, making other diseases more likely. In an uncertain world with limited evidence, mistakes in diagnoses are unavoidable. More importantly, P3 converges to F under the guidance of this 76% correct diagnoses while F does not. The maximum error of F remains about the same Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 189 throughout the 200 trials and the behavior matching rate of (D(1), P) is low. Similar performance of (D(1), F) is seen in the next two simulations. This shows that under the circumstances where good experts are available but confirmations to diagnoses are not available, ALPP is robust while {0,1} distribution learning will be misled by the errors in diagnoses. This is not surprising since the assumption underlying {0,1} distribution learning is violated. One will gain more insight into this from the results of simulation 4 below. An imperfect expert is assumed in simulation 2 (Table 6.13). The distribution Fe differed from F,. up to 0.05. Because of this error, F converges to neither Fe (as shown in Table 6.13) nor F,.. But the error between F, and F approaches a small value (about 0.07) such that after 200 trials the behavior of F, matches that of F perfectly. (1)e(1),Pe) (D,(1),P,) (D(1),P) behv. max. behv. max. dliag. mat. err. mat. err. trial No. rate rate S-E rate C-E 0 .300 .300 1100 84% 82% .058 32% .272 101200 86% 92% .122 43% .287 201’.’300 80% 100% .067 32% .290 301400 83% 100% .076 36% .292 Table 6.13: Summary of simulation 2. IF — F,.I = 0.05. If the discrepancy between P, and F,. is further increased so that the threshold dis cussed in last section is crossed, (D,(1), F,) will no longer converge to (D(1), Fe). This is the case in simulation 3 (Table 6.14) where the maximum error and root mean square error (rms) between F and F,. are 0.15 and 0.098 respectively. The rms error is calculated over all the priors and conditional probabilities of F and F. Introduction of rms error for interpretation of simulation 3 is because maximum error itself, when not approaching Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 190 to 0, does not give good indication of the distance between the two. (D(1), P) (D31),_F) behv. rms rms max. diag. diag. mat. err. err. err. trial No. rate rate rate S-B S-R S-B 0 .170 .169 .39 1”.25 80% 84% 20% .086 .079 .17 2675 74% 74% 40% .071 .068 .11 76’175 73% 73% 53% .050 .083 .087 176-.’375 79% 79% 46% .059 .072 .095 376’475 78% 78% 43% .061 .071 .119 (De(1),Pe) (D(1),P) behv. rms rms max. diag. diag. mat. err. err. err. trial No. rate rate rate C-B C-R C-E 0 .170 .169 .39 125 80% 80% 32% .110 .091 .20 2675 74% , 74% 38% .110 .092 .16 76175 73% 73% 38% .098 .092 .15 176375 79% 79% 23% .100 .090 .15 376”475 78% 78% 26% .096 .084 .15 Table 6.14: Summary of simulation 3. Fe — F71 0.15. The simulation shows that the behavior matching rate of P8 and P is quite low (43% after 475 trials). Since the diagnostic rate of Fe is also lower (77%), one could ask which one is better. One way of viewing this is to compare the diagnostic rates. It is observed that, among F3, P and F, no one is superior than others if only top diagnosis is concerned. More careful examination can be obtained by comparison of distances among models. It turns out that the distance (S-E) and distance (S-R) are smaller than the distance (E-R) with corresponding rms errors 0.061, 0.071 and 0.098 respectively. The above 3 simulations assume that only the subjective posterior judgments are available. In simulation 4, it is assumed that the correct diagnosis is also accessible. This time, (D(1), F) is supplied with the scenario generated by (Dr(1), Fr). Fe is the same Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 191 as P,.. The results (Table 6.15) shows that ALPP converges much quicker than {0,1} distri bution learning even the latter has access to ‘true’ answers to the diagnostic problem. After 1500 trials, (D31), F3) reduces its maximum error from (De(1), Fe) to 0.041 and matches the latter’s behavior perfectly, while (D(1), F) is still on its way of convergence with its error about 2 times larger and its behavior matching rate 80%. (De(1), Fe) (D31), F3) (D(1), P) behv. max. behv. max. diag. mat. err. mat. err. trial No. rate rate S-E rate C-E 0 .300 .300 1100 88% 95% .130 60% .375 101600 78% 98% .048 72% .045 6011100 78% 93% .052 61% .075 11011500 79% 100% .041 80% .079 1501’170O 81% 100% .025 85% .093 Table 6.15: Summary of simulation 4. P — Fri = 0 and ‘true’ scenario is accessible to {0, 1} distribution learning (Fe). Real world scenarios could be distinguished as being common or exceptional. An ex pert with knowledge about the real world tends to catch the common and to ignore the exceptional. Thus the diagnostic rate will never be 100%. This is the best one could do given the limited evidence. The PPs provided by the expert contain the informa tion about the entire domain, while a scenario contains only the information about this particular scene. Thus, although both (D31), F3) and (D(1), P) converge, the former converges quicker. This difference in convergence speed is expected to emerge wherever the diagnosis is difficult and the diagnostic rate of the expert is low (for example, in some area where disease mechanism is not well understood and diagnostic criteria are not well established). Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 192 6.6 Remarks Several features of ALPP can be appreciated through the theoretical analysis and simu lation results presented. • ALPP provides an alternative way of sequentially updating conditional probabilities in Bayesian nets when confirmed diagnosis is not available. • Under ideal conditions (perfect expert for ALPP and confirmed diagnosis for {O,1} distribution learning), both ALPP and {O,1} distribution learning converge to the real world model. However, ALPP converges faster than {O,1 } distribution learning due to the richer information contained in expert’s posterior judgments. • When human expert’s judgment is the only available source and the expert is not perfect but fairly good (his mental model is different from but close to the real world model), ALPP still converges to expert’s posterior behavior and improve the system model towards real world model up to a small error. On the other hand, {O,1} distribution learning will be misled by unavoidable mistakes made in expert’s diagnoses due to the violation of its underlying assumption. Consequently, it could only simulate expert’s behavior up to top diagnosis, both posterior ordering and model parameters (probability values) are far out. • As is argued at the beginning of this chapter, expert’s diagnoses are indeed the only available source in many applications. Thus to use {O,1} distribution learning in these domain one must simplify the expert’s posterior distribution to a {O,1} distribution. On the other hand, to use ALPP one has to obtain from the expert the overall real value distribution, which may not be practical. A proper compromise might be to ask the expert to provide a few top posterior probabilities and assign the remaining value uniformly to other probabilities. Chapter 6. ALGORITHM FOR LEARNING BY POSTERIOR PROBABILITIES 193 • ALPP is directly applicable to Bayesian networks of diameter 1 as {O,1} distribution learning would. Although the topology is not feasible for many applications, there are domains the topology is adequate, for example, the Bayesian net version of QMR (its precursor is INTERNIST, one of the first expert systems in internal medicine) [Heckerman 90a, Heckerman 90b] and PAINULIM, among others. Chapter ‘T SUMMARY The thesis research aims to construct medical expert systems which will be of practical use. The attitude adopted is to identify practical problems in the engineering practice and to solve the problems scientifically (as opposed to ad hoc approaches). The accom plishments of this research include the engineering part and scientific part which are closely related. Contributions to theoretical knowledge • The limitation of finite totally ordered probability algebras has been shown the oretically. This highlights the use of infinite totally ordered probability algebras including probability theory. • The technique of multiply sectioned Bayesian networks (MSBN) and junction forests has been developed. This technique allows the exploitation of localization natu rally existing in a large domain such that the construction and deployment of large expert systems using Bayesian networks can be practical. • An algorithm for learning by posterior probabilities (ALPP) has been developed. This algorithm is useful for sequential updating conditional probabilities in Bayesian networks in order to improve the accuracy of (quantitative) knowledge representa tion. The algorithm removes the assumption that diagnoses must be confirmed. Chapter 7. SUMMARY 195 Accomplishments in engineering • WEBWEAVR, an expert system shell has been implemented in IBM compatibles which embeds the MSBN technique. • PAINULIM, an expert neuromuscular diagnostic system for patients with a painful or impaired upper limb has been constructed. The utilization of the MSBN tech nique has made possible the owledge acquisition and system refinement of PAIN ULIM being conducted within hospital environment. This has resulted in more efficient cooperation with medical experts and greatly speeded up the development of PAINULIM. • QUALICON, a coupled expert system in technical quality control of nerve conduc tion studies has been constructed. Limitations of the work • The MSBN technique is only used in a single domain. Only the sectioning with a covering sect is used in the application. The applicability of the technique to many other domains requires future experience. • The performance of ALPP has been shown through simulation but it has not been evaluated through real application. Topics for future research • Test design in a multiply sectioned Bayesian network. System prompt for acquisi tion of evidence most beneficial to a diagnosis given current state of belief has been the topic of expert system research. In chapter 4, the attention shift is described as Chapter 7. SUMMARY 196 originated by the user. Under the context of a MSBN, system prompt for acquisi tion of evidence from neighbor sect can be another reason for attention shift. How to generate the prompt for the target sect for attention shift is an open question. The test design within a sect also deserves further exploration. • Explanation of inference in a multiply sectioned Bayesian network. Qualitative and intuitive explanation of inference is important for the acceptance of expert systems. The MSBN technique allow the representation of categorical and hierar chical knowledge. How to frame an explanation in terms of evidence distributed in different categories or hierarchies requires future research. • Only the marginal distribution is obtained in the MSBN technique (in the junction tree technique as well). Is there a way to obtain the most likely combination of hypotheses? • The representation incorporating decision analysis with Bayesian nets is usually termed influence diagram. Incorporating decision analysis with a MSBN (for, i.e., treatment recommendation) introduces new problems and possibilities. Bibliography [Aleliunas 86] R. Aleliunas, “Models of reasoning based on formal deductive probability theories,” Draft unpublished, 1986. [Aleliunas 87] R. Aleliunas, “Mathematical models of reasoning - competence models of reasoning about propositions in English and their relationship to the concept of probability,” Research Report CS-87-S1, Univ. of Waterloo, 1987. [Aleliunas 88] R. Aleliunas, “A new normative theory of probabilistic logic,” Proc. CSCSI-88, 67-74, 1988. [Andersen et al. 89] S.K. Andersen, K.G. Olesen, F.V. Jensen and F. Jensen, “HUGIN - a shell for building Bayesian belief universes for expert systems”, Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, Detroit, Michigan, Vol. 2, 1080-1085, 1989. [Andreassen, Andersen and Woldbye 86] S. Andreassen, S.K. Andersen and M. Woldbye, “An expert system for electromyography”, Proc. of Novel Approaches to the Study of Motor Systems, Banff, Alberta, Canada, 2-3, 1986. [Andreassen et al. 89] S. Andreassen, F.V. Jensen, S.K. Andersen, B. Faick, U. Kjerulff, M. Woldbye, A.R. Sorensen, A. Rosenfalck and F. Jensen, “MUNIN - an expert EMG assistant”, J.E. Desmedt, (Edt), Computer-Aided Electromyography and Expert Sys tems, Elsevier, 255-277, 1989. [Baker and Boult 90] M. Baker and T.E Boult, “Pruning Bayesian networks for efficient computation”, Proc. of the Sixth Conference on Uncertainty in Artificial Intelligence, Cambridge, Mass., 257-264, 1990. [Blinowska and Verroust 87] A. Blinowska and J. Verroust, “Building an expert system in electrodiagnosis of neuromuscular diseases”, Electroenceph. Gun. Neurophysiol., 66: 510, 1987. [Buchanan and Shortliffe 84] B.G. Buchanan and E.H. Shortliffe, (Edt), Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project, Addison-Wesley, 1984. [Burns and Sankappannvar 81] S. Burns and H. P. Sankappannvar, A course in universal algebra, Springer-Verlag, 1981. 197 Bibliography 198 [Cheeseman 86] P. Cheeseman, “Probabilistic versus fuzzy reasoning”, L.N. Kanal and J.F. Lemmer, (Edt), Uncertainty in Artificial Intelligence, Elsevier Science, 85-102, 1986. [Cheeseman 88a] P. Cheeseman, “An inquiry into computer understanding”, Computa tional Intelligence, 4: 58-66, 1988. [Cheeseman 88b] P. Cheeseman, “In defense of An inquiry into computer understand ing”, Computational Intelligence, 4: 129-142, 1988. [Cooper 84] G.F. Cooper, NESTOR: A computer-based rIedical diagnostic aid that inte grates causal and probabilistic knowledge, Ph.D. diss., Stanford University, 1984. [Cooper 90) G.F. Cooper, “The computational complexity of probabilistic inference using Bayesian belief networks”, Artificial Intelligence, 42: 393-405, 1990. [Desmedt 89] J.E. Desmedt, (Edt), Computer-Aided Electromyography and Expert Sys tems, Elsevier, 1989. [Duda et al. 76] R.O. Duda, P.E. Hart and N.J. Nilsson, “Subjective Bayesian methods for rule-based inference systems”, Technical Report 124, Stanford Research Institute, Menlo Park, California, 1976. [Fertig and Breese 90] K.W. Fertig and J.S. Breese, “Interval Influence Diagrams”, M. Henrion et al., (Edt), Uncertainty In Artificial Intelligence 5, 149-161, 1990. [First et al. 82] M.B. First, B.J. Weimer, S. McLinden and R.A. Miller, “LOCALIZE: Computer assisted localization of peripheral nervous system lesions”, Comput. Biomed. Res., 15: 525-543, 1982. [Fuglsang-Frederiksen and Jeppesen 89] A. Fuglsang-Frederiksen and S.M. Jeppesen, “A rule-based EMG expert system for diagnosing neuromuscular disorders”, J.E. Desmedt, (Edt), Computer-Aided Electromyography and Expert Systems, Elsevier, 289-296, 1989. [Gallardo et al. 87] R. Gallardo, M. Gallardo, A. Nodarse, S. Luis, R. Estrada, L. Garcia and 0. Padron, “Artificial intelligence in EMG diagnosis of cervical roots and brachial plexus lesions”, Electroenceph. GUn. Neurophysiol., 66: S37, 1987. [Geiger, Verma and Pearl 90] D. Geiger, T. Verma and J. Pearl, “d-separation: from the orems to algorithms. In Uncertainty in Artificial Intelligence 5. Edited by M. Henrion, R.D. Shachter, L.N. Kanal and J.F. Lemrner. Elsevier Science Publishers, 139-148, 1990. [Gibbons 85] A. Gibbons, Algorithmic Graph Theory, Cambridge University Press, 1985. Bibliography 199 [Golumbic 801 M.C. Golumbic, Algorithmic graph theory and perfect graphs, Academic Press, NY, 1980. [Gotman and Gloor 76] J. Gotman and P. Gloor, “Automatic recognition and quantifi.. cation of interictal epileptic activity in the human scalp EEG”, Electroenceph. din. Neurophysiol., 41: 513-529, 1976. [Goodgold and Eberstein 83] J. Goodgold and A. Eberstein, Electrodiagnosis of Neuro muscular Diseases, Williams and Wilkins, 1983. [Halpern and Rabin 87] J.Y. Halpern and M.O. Rabin, “A logic to reason about likeli hood,” Artificial Intelligence, 32: 379-405, 1987. [Heckerman 86] D. Heckerman, “Probabilistic interpretations for MYCIN’s certainty fac tors”, L.N. Kanal and J.F. Lemmer, (Edt), Uncertainty in Artificial Intelligence, El sevier Science, 167-196, 1986. [Heckerman and Horvitz 87] D.E. Heckerman and E.J. Horvitz, “On the expressiveness of rule-based systems for reasoning with uncertainty”, Proc. AAAI, Seattle, Washing ton, 1987. [Heckerman, Horvitz and Nathwani 89] D.E. Heckerman, E.J. Horvitz and B.N. Nath wani, “Update on the Pathfinder project”, 13th Symposium on computer applications in medical care, Baltimore, Maryland, U.S.A., 1989. [Heckerman 90a] D. Heckerman, “A tractable inference algorithm for diagnosing multiple diseases”, M. Henrion, R.D. Shachter, L.N. Kanal and J.F. Lemmer. (Edt), Uncertainty in Artificial Intelligence 5, Elsevier Science Publishers, 163-171, 1990. [Heckerman 90b] D. Heckerman, Probabilistic Similarity Networks, Ph.D. Thesis, Stan ford University, 1990. [Henrion 88] M. Henrion, “Propagating uncertainty in Bayesian networks by probabilis tic logic sampling”, J. F. Lemmer and L. N. Kanal (Edt), Uncertainty in Artificial Intelligence 2, Elsevier Science Publishers, 149-163, 1988. [Henrion 89] M. Henrion, “Some practical issues in constructing belief networks”, L.N. Kanal, T.S. Levitt and J.F. Lemmer (Eds), Uncertainty in Artificial Intelligence 3, Elsevier Science Publishers, 161-173, 1989. [Henrion 90] M. Henrion, “An introduction to algorithms for inference in belief nets”, M. Henrion, et al. (Edt), Uncertainty in Artificial Intelligence 5, Elsevier Science, 129-138, 1990. Bibliography 200 [Howard and Matheson 84] R.A. Howard and J.E. Matheson, “Influence diagrams”, Readings in Decision Analysis, R.A. Howard and J.E. Matheson (Edt), Strategic De cisions Group, Menlo Park, CA, Ch 38, 763-771, 1984. [Jamieson 90] P.W. Jamieson, “Computerized interpretation of electromyographic data”, Electroen cephalogr Clin Neurophysiol, 75: 390-400, 1990. [Jensen 88] F.V. Jensen, “Junction tree and decomposable hypergraphs”, JUDEX Re search Report, Aalborg, Denmark, 1988. [Jensen, Lauritzen and Olesen 90] , F.V. Jensen, S.L. Lauritzen and K.G. Olesen, “Bayesian updating in causal probabilistic networks by local computations”, Com putational Statistics Quarterly. 4: 269-282, 1990. [Jensen, Olesen and Andersen 90] , F.V. Jensen, K.G. Olesen and S.K. Andersen, “An algebra of Bayesian belief universes for knowledge based systems”, Networks, Vol. 20, 637-659, 1990. [Kini and Pearl 83] J.H. Kim and J. Pearl, “A computational model for causal and diag nostic reasoning in inference engines”, Proc. of 8th IJCAI, Karlsruhe, West Germany, 190-193, 1983. [Kowalik 86] J.S. Kowalik, (Edt.), Coupling Symbolic and Numerical Computing in Ex pert Systems, Elsevier, 1986. [Kuczkowski and Gersting 77] J.E. Kuczkowski and J.L. Gersting, Abstract Algebra, Marcel Dekker, 1977. [Kuipers and Kassier 84] B. Kuipers and J. Kassier, “Causal reasoning in medicine: Analysis of a protocol”, Cognitive Science, 8: 363-385, 1984. [Larsen and Marx 81] R.J. Larsen and M.L. Marx, An Introduction to Mathematical Statistics and Its Applications, Prentice-Hall, NJ, 1981. [Lauritzen et al. 84] S.L. Lauritzen, T.P. Speed and K. Vijayan, “Decomposable graphs and hypergraphs”, Journal of Australian Mathematical Society, Series A, 36: 12-29, 1984. [Lauritzen and Spiegelhalter 88] S.L. Lauritzen and D.J. Spiegelhalter, “Local computa tion with probabilities on graphical structures and their application to expert systems”, Journal of the Royal Statistical Society, Series B, 50: 157-244, 1988. [Maier 83] D. Maier, The Theory of Relational Databases, Computer Science Press, 1983. Bibliography 201 [Meyer and Hilfiker 83] M. Meyer and P. Hilfiker, “Computerized motor neurography”, J.E. Desmedt, (Edt), Computer Aided Electromyography, Karger, 1983. [Miller et al. 76] A.C. Miller, M.M. Merkhofer, R.A. Howard, J.E. Matheson and T.R. Rice, Development of Automated Aids for Decision Analysis, Stanford Research Insti tute, Menlo Park, California, 1976. [Neapolitan 90] R.E. Neapolitan, Probabilistic Reasoning in Expert Systems, John Wiley and Sons, 1990. [Ng and Abramson 90] Keung-CI: Ng and B. Abramson, “A sensitivity analysis of PATHFINDER”, Proc. of the Sixth Conference on Uncertainty in Artificial Intelli gence, Cambridge, Mass., 204-212, 1990. [Nii 86a] H.P. Nii, “Blackboard systems: the blackboard model of problem solving and the evolution of blackboard architectures”, Al Magazine, Vol. 7, No. 2, 38-53, 1986. [Nii 86b] H.P. Nii, “Blackboard systems: blackboard application systems, blackboard systems from a knowledge engineering perspective”, Al Magazine, Vol. 7, No. 3, 82- 106, 1986. [Patil, Szolovits and Schwartz 82] R.S. Patil, P. Szolovits and W.B. Schwartz, “Modeling knowledge of the patient in acid-base and electrolyte disorders”, P. Szolovits, (Edt), Artificial Intelligence in Medicine, Westview, 191-225, 1982. [Pearl 86] J. Pearl, “Fusion, propagation, and structuring in belief networks”, Artificial Intelligence, 29: 241-288, 1986. [Pearl 88] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988. [Pearl 89] J. Pearl, “Probabilistic semantics for nonmonotonic reasoning: A survey”, Proc., First intl. conf. on principles of knowledge representation and reasoning, 1989. [Poole and .Neufeld 88] D. Poole and E. Neufeld, “Sound probabilistic inference in Pro log: an executable specification of influence diagrams,” I SIMPOSIUM INTERNA CIONAL DE INTELIGENCIA ARTIFICIAL, 1988. [Pople 82] H.E. Pople, Jr., “Heuristic methods for imposing structure on ill-structured problems: the structuring of medical diagnostics”, P. Szolovits, (Edt), Artificial Intel ligence in Medicine, Westview, 119-190, 1982. [Rose et al. 76] D.J. Rose, R.E. Tarjan and G.S. Lueker, “Algorithmic aspects of vertex elimination on graphs”, SIAM J. Comput., 5: 266-283, 1976. Bibliography 202 [Savage 61] L.J. Savage, “The foundations of statistics reconsidered”, G. Shafer and J. Pearl, (Edt), Readings in Uncertainty Reasoning, Morgan Kaufmann, 1990. [Shafer 76] G. Shafer, A Mathematical Theory of Evidence, Princeton University Press, Princeton, New Jersey, 1976. [Shafer 90] G. Shafer, “Introduction for Chapter 2: The meaning of probability”, G. Shafer and J. Pearl, (Edt), Readings in Uncertainty Reasoning, Morgan Kaufmann, 1990. [Shafer and Shenoy 88] G. Shafer and P.P. Shenoy, Local computation in hypertrees School of Business Working Paper No. 201, University of Kansas, U.S.A, 1988. [Shachter 86] R.D. Shachter, “Evaluating influence diagrams”, Operations Research, 34, No. 6, 871-882, 1986. [Shachter 88a] R.D. Shachter, “Probabilisitc inference and influence diagrams”, Opera tions Research, 36, No. 4, 589-604, 1988. [Shachter 88] R.D. Shachter, Discussion published in (Lauritzen and Spiegeihalter 88), 1988. [Shachter and Heckerman 87] R.D. Shachter and D.E. Heckerman, “Thinking backward for knowledge acquisition”, Al Magazine, 8:55-62, 1987. [Spige1ha1ter 86] D.J. Spiegelhalter, “Probabilistic reasoning in predictive expert sys tems”, L.N. Kanal and J.F. Lemmer, (Edt), Uncertainty in Artificial Intelligence, Elsevier Science, 47-67, 1986. [Spiegelhalter, Franklin and Bull 89] D.J. Spiegeihalter, R.C.G. Franklin, and K. Bull, “Assessment, criticism and improvement of imprecise subjective probabilities for a medical expert system,” Procc Fifth workshop on uncertainty in artificial intelligence, 335-342, Windsor, Ontario, 1989. [Spiegelhalter and Lauritzen 90] D.J. Spiegeihalter and S.L. Lauritzen, “Sequential up dating of conditioanl probabilities on directed graphical structures”, Networks, Vol. 20, 5: 579-606, 1990. [Stalberg and Stalberg 89] E. Stalberg and S. Stalberg, “The use of small computers in the EMG lab”, J.E. Desmedt, (Edt), Computer-Aided Electromyography and Expert Systems, Elsevier, 1-32, 1989. [Szolovits and Pauker 78] P. Szolovits and S.G. Pauker, “Categorical and probabilistic reasoning in medical diagnosis”, Artificial Intelligence 11, 115-144, 1978. Bibliography 203 [Szolovits 82] P. Szolovits, “Artificial intelligence and medicine”, P. Szolovits, (Edt), Artificial Intelligence in Medicine, Westview, 1-19, 1982. {Tarjan and Yannakakis 84] R.E. Tarjan and M. Yannakakis, “Simple linear-time algo rithms to test chordality of graphs, test acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs”, SIAM J. Comput., 13: 566-579, 1984. [Tversky and Kahneman 74] A. Tversky and D. Kahneman, “Judgment under uncer tainty: heuristics and biases”, Science, 185: 1124-1131, 1974. [Tversky and Kahneman -801 A. Tversky and D. Kahneman, “Causal schemata in judg ments under uncertainty”, M. Fishbein, (Edt), Progress in Social Psychology, Lawrence Eribaum, Vol. 1, 49-72, 1980. [Vila et al. 85] A. Vila, D. Ziebelin and F. Reymond, “Experimental EMG expert system as an aid in diagnosis”, Electroenceph. Clin. Neurophysiol., 61: S240, 1985. [Xiang, Beddoes and Poole 90a] Y. Xiang, M. P. Beddoes and D. Poole, “Can uncer tainty management be realized in a finite totally ordered probability algebra?”, M. Henrion et al., (Edt), Uncertainty In Artificial Intelligence 5, 41-57, 1990. [Xiang, Beddoes and Poole 90bj Y. Xiang, M. P. Beddoes and D. Poole, “Sequential up dating conditional probability in Bayesian networks by posterior probability”, Proc. 8th Biennial Conference of the Canadian Society for Computational Studies of Intel ligence, Ottawa, 21-27, 1990. [Xiang et al. 91b] Y. Xiang, B. Pant, A. Eisen, M.P. Beddoes, and D. Poole, “PAINULIM: An expert neuromuscular diagnostic system based on multiply sectioned Bayesian belief networks”, Proc. ISMM International Conference on Mini and Micro computers in Medicine and Healthcare, Long Beach, CA, 64-69, 1991. [Xiang, Poole and Beddoes 91] Y. Xiang, D. Poole and M.P. Beddoes, “Multiply sec tioned Bayesian networks and junction forests for large knowledge-based systems”, submitted to Computational Intelligence, 1991. [Xiang et al. 92] Y. Xiang, A. Eisen, M. MacNeil and M. P. Beddoes], “Quality control in nerve conduction studies with coupled knowledge based system approach”, Muscle and Nerve, Vol.15, No.2, 180-187, 1992. [Zadeh 65] L. Zadeh, “Fuzzy sets”, Information and Control, 8: 338-353, 1965. Appendix A Background on Graph Theory Graph theory plays an important role in characterizing Bayesian networks and devel oping efficient inference algorithms for Bayesian networks. In this Appendix, the basic concepts of graph theory relevant to this thesis are introduced. For formal treatment of the graph theoretical concepts introduced, see [Golumbic 80, Gibbons 85, Jensen 88, Lauritzen et al. 84]. A.1. Graphs A graph G is a pair (N, E) where N = {A,,.. . , A} is a set of nodes and E = {(A, A,)IA, A3 E N; i j} is a set of links between pairs of nodes in N. A directed graph is a graph where links in E are ordered pairs and an undirected graph is a graph where links in E are unordered pairs. Call the links in directed graphs by arcs when concerned with their directions. A subgraph of a graph (N, .E) is any graph (Nyc, E’) satisfying Nk C N and Ec C E. Given a subset of nodes N’ C N of a graph (N, E), the subgraph induced by N’ is (N’, E’) where E’ = {(A, A) E EIA E N’ & A3 E N’}. The union graph of subgraphs G’ = (N’, E’) and G2 = (N2,E2) is the graph (N1 U N2,E’ U E2) denoted G1 U G2. Example 22 Figure A.38 shows eight examples. G1 = (N’, E’) is a undirected graph where N’ = {A,, A2,.. . , A6} and E’ ={(A,,A),(A1,45)} 204 Appendix A. Background on Graph Theory 205 is a set of unordered pairs. G4 = (N4,E4) is a directed graph where N’ = N1 and = {(A,, A2), (A1,Aj, (A3,A2), (A4,A3), (A5,A4), (A4,A6)} is a set of ordered pairs. G’ is the union graph of subgraphs G2 and G3 (G’ =G2UG3). Likewise, G4 = G5UG6. Both G2 and G are also subgraphs of G7 but G7 G2 U G3 since the link (A,, A5) in G7 is not contained in either subgraph. Figure A.38: Examples of graphs A path in graph (N, E) is a sequence of nodes A,, A2,. . . , A, (k > 1) such that (As, A1,) e E. A path in a directed graph can be directed or undirected (i.e., each arc is considered undirected). A simple path is a path with no repeated node except that A, is allowed to equal Ak. A cycle is a simple path with A1 = Ak. Directed graphs A1 A5 A2 A4 A3 A6 A5 A4 Ga A6 A1 A5 A2 A4 c4 A3 A6 Appendix A. Background on Graph Theory 206 without directed cycle are called DAGs (directed acycic graphs). Define the diameter of a DAG as the number of arcs of the longest directed path in the DAG. A graph (N, E) is connected if for any pair of nodes in N there is an undirected path between them. A graph is singly connected or is a tree if there is a unique undirected path between any pairs of nodes. If a graph consists of several unconnected trees, call the graph a forest. A graph is multiply connected if more than one undirected path exists between a pair of nodes. Note: a graph can be multiply connected but NOT connected! Example 23 In Figure A.38, A1 — A2 — A3 — A4 — A6 is a path in G’. It is also an undirected path in G4. A5 —* A4 —* A3 —* A2 is a directed path in G4. A6 — A4 — A1 — A2 — A3 — A4 — A5 is not a simple path in G1, but A1 — A2 — A3 — A4 — A6 is. A1 — A2 — A3 — A4 — A1 is a undirected cycle in G4. There is no directed cycle in G4. There would be one in G4 if the arc (A1,A2) were reversed. Therefore, G4 is a DAG. G3 is a singly connected graph or is a tree. G5 is a singly connected DAG or is a tree. G4 is a multiply connected DAG. The diameters of DAGs G5, G6, and G4 are 1, 2, and 3 respectively. Only connected DAGs are considered in this thesis since an unconnected DAG can always be treated as several connected ones. Define a subDAG of a DAG D = (N, E) as any connected subgraph of D. A DAG D is the union DAG of subDAG D’ and D2 if D=D1U2. Example 24 Tn Figure A.38, G4 is the union DAG of subDAGs G5 and G6 (G4 = G5uG6). If there is an arc (A1,A2) from node A1 to A2, A1 is called a parent of A2, and A2 a child of A1. Similarly, if there is a directed path from A1 to Ak, the two nodes are called, respectively, ancestor and descendant, relative to each other. The in-degree of a node (denoted by i) is defined as the number of its parents. Appendix A. Background on Graph Theory 207 Example 25 In G4 of Figure A.38, A, and A5 are the parents of A4 and A4 is their child. A2 has 4 ancestors namely A,, A3, A4 and A5. The in-degree of A4 is 2 and the in-degree of A1 is 0. For each node in a DAG, if links are added between all its parents and the directions on arcs are dropped, the graph thus formed is the moral graph of the DAG. A graph is triangulated if every cycle of lengtI> 3 has a chord. A chord is a link connecting 2 non adjacent nodes. A maximal set of nodes all of which are pairwise linked is called a clique. Algorithms for triangulating graphs have been developed, for example, the Lexicographic search [Rose et al. 76] with time complexity Q(ne) where n is the number of nodes and e the number of links, and the maximum cardinality search [Tarjan and Yannakakis 84] with time complexity (5(n + e). Example 26 In Figure A.38, G7 is the moral graph of G4. G8 is a triangulated graph of G7. A.2 Hypergraphs A hypergraph is a pair (N, C) where N is a set and C 2N is a set of subsets of N. Define the union of hypergraphs similarly to the union of graphs. The union hypergraph of (N’, C’) and (N2,C2) is (N’ UN2,C1 U C2) denoted (N’, C1) U (N2,C2). Let (N, E) be a graph, and C be the set of cliques of (N, E). Then (N, C) is a clique hypergraph of graph (N, E). If a clique hypergraph is organized into a tree where the nodes of the tree are labeled with cliques such that for any pair of cliques, their intersection is contained in each of the cliques on the unique path between them, then the tree is called a junction tree or join tree. The intersection of 2 adjacent cliques in a junction tree is called the sepset of the 2 cliques. Given a hypergraph, its junction trees can be characterized as maximum-weight Appendix A. Background on Graph Theory 208 spanning-tree if it has one [Jensen 88]. The algorithm for constructing maximum -weight spanning-trees can be found in, e.g., [Gibbons 85]. Example 27 Consider the graph G8 in Figure A.38 where N8 = {A1,A2, ... , A}. The set of cliques of G8 is C = {{A1,A2,A4}, {A2,A3,A4}, {A1,A4,A5}, {A4,A6}}. Therefore (N8,C) is a clique hypergraph of G8. Figure A.39 is a junction tree of the hypergraph (N8,C), where nodes are labeled with cliques (in ovals) and links are labeled with sepsets of cliques (in squares). Figure A.39: A junction tree with nodes (ovals) labeled with cliques and links labeled with sepsets of cliques (squares). Appendix B Reference for Theory of Probabilistic Logic (TPL) B.1 Axioms of TPL The content of this section is taken from Aleliunas [1988] with minor changes to simplify the presentation. Axiom 1 (TPL axioms) Axioms about the domain and range of each f in F. AX1 The set of probabilities, P, is a partially ordered set. The ordering relation is ‘s’. AX2 The set of sentences, L, is a free boolean algebra with operations &, V, and, and it is equipped with the usual equivalence relation ‘E’. The generators of the algebra are a countable set of primitive propositions. Every sentence in L is either a primitive proposition or a finite combination of them. AX3 If P X and Q Y, then f(PIQ) = f(XIY). Axioms that hold for all f in F, and for any F, Q, R in L. AX4 If Q is absurd (i.e., Q R&), then f(IQ) = f(FIF). AX5 f(F&QIQ) = f(FIQ) f(QIQ). AX6 For any other g in F, f(PIP) = g(PIP) = 1. AX’T There is a monotone non-increasing total function, i, from P into P such that f(P1R) = i(f(PIR)). 209 Appendix B. Reference for Theory of Probabilistic Logic (TPL) 210 AX8 There is an order-preserving total function, h, from P x P into P such that f(P&Q IR) = h(f(PIQ&R), f(QIR)). Moreover, if f(P&QR) = 0, then f(PIQ&R) = 0 or f(QIR) = 0, where 0 = f(IR) is defined as a function of f and R. AX9 11 f(PIR) f(PI) then f(PIR) f(FIRVTh f(PR). Axioms about the richness of the set F. Let 1 = P V P. For any distinct primitive propositions A, B and C in L, and for any arbitrary probabilities a, b and c in P, there is a probability assignment f in F (not necessarily the same one in each case) for which AX1O f(AI1) = a, f(BIA) = b, and f(CIA&B) = C. AX11 f(AIB) = f(AI) = a and f(BIA) = f(Bl) = b. AX12 f(AI1) = a, and f(A&BI1) = b, whenever b a. B.2 Probability Algebra Theorem The content of this section is taken from Aleliunas [1986]. Theorem 15 (Probability algebra theorem) Any probability algebra (under TPL ax ioms) defined on the partially ordered set (poset) P satisfies the following conditions: Ti P is a partially ordered semigroup with an order preserving operation ‘*‘. T2 0*x=O andl*x =xforanyx inP. T8 P is a self-dual poset equipped with an isomorphism, i, from P to its dual. T4 P is a poset bounded by 0 and 1, i.e., 0 x 1 for any x in P. T5 P is commutative, i.e., x * y = y * x. T6 P has non-trivial zero, i.e., x y implies that x = 0 or y = 0. Appendix B. Reference for Theory of Probabilistic Logic (TPL) 211 Ti’ P is a naturally ordered semigroup, i.e., x y implies z(x = y * z). Conversely, the above 7 conditions are also sufficient to characterize a probability algebra. CO CO CO CO CO CO CO CO 00 - O 0’ . 0’ b3 ‘ I- ’ >4 CO CO CO CO CO CO CO CO CO ‘ CO 00 O 01 0’ 3 ‘ - 3 CO CO CO CO CO CO CO CO CO CO 00 - - 1 — 0) Cl i . C. ) b3 C. ) CD CO CO CO CO CO CO CT ) CO CO CI CO 00 - . 4 - - 0) 01 C.) C.) . 0 CO CT ) CO CO CO CO CO CO CO CO 00 - - - — 4 0) Cl ’ 01 CO CO CO CO CO CO CO CO CO CD CO 00 — 4 - 1 - — 4 — 4 0) Cl ’ Cl’ 0) Pt CO CO CO CO CO CO CO CO CO CO 00 — 4 - 4 — 4 - 4 - 1 - 4 0) 0) — 4 IT J CO CO CO CO CO CO CO CO CO CO 00 - . 4 - - — - - - — 00 CO CO CO CO CO CO CO CO CO 0 00 00 00 00 00 00 00 00 00 rj t-’ z CO CO CO CO CO CO CO CT ) 00 - 4 0) Cl ’ . C.) . ) CO CT) CO CO CO CO CO CO 00 - 1 0) 01 C.) . ) I- ’ CO CO CO CO CO CO CO CO 00 — 4 0) 01 C.) . ) t) CO CO CO CO CO CO CO CO 00 - 4 0) CI I C.) C.) C. ) CO CO CO CO CO CO CO CO 00 - 4 0) Cl ’ CO CO CO CO CO CO CO CO 00 — 4 0) Cl ’ C ’ 01 Cl ’ Cl ’ CO CO CO CO CO CO CO CO 00 3 0) 0) 0) 0) 0) 0) CO CO CO CO CO CO CO CO 00 — 4 - 4 . - 4 - 4 - 4 1 - 4 CO CO CO CO (0 CO CO CO 00 00 00 00 00 00 00 00 CO CO CO CO CO CO C O 0) CI I . C. ) I- .) CO CO I- ’ l— !‘ C O C O CO C O C O CO 0’ C.) CO CT ) CO CO CO CO CO CO CO CO CO CO 01 Cl ’ Cl ’ Cl ’ Cl ’ I- . CO CO CO CO CO CO 0) 0) 0) 0) 0) 0) CO - CO CO CO CO CT ) CO . - 4 . - 4 . - 4 - 4 - l CO CO CO CO CO CO CO CO 00 00 00 00 00 00 00 T j >4 Pt I- CD 0 I- . I - . (4 CD >4 C) U ) 0 0 p., 0 CO CO CO CO CO CO CO C O 0) Cl ’ C. ) 3 CO CO I—’ CO CO CO I- ’ . ) b) CO CO CO CO I- ’ . ) C. ) C. ) CO CO CT) CO CO I’. ) C. ) . . CO CO CO CO CO CO 1. ) C. ) . - Cl ’ Cl ’ CO CO CO CO CO CO 0. ) C. ) . Cl ’ 0) CO CO CO CO CO CO CO CO CO CO CO CO CO CO - 0. ) C. ) , 0. Cn 0) CO CO CO CO CO CO CO 00 00 00 00 00 00 00 Appendix C. Examples on Finite Totally Ordered Probability Algebras 213 e1 €2 €3 €4 e5 €6 e7 e e2 e3 e4 e5 e6 e7 €g qPei e2 e1 e2 e3 e4 e6 €7’ e3 e4 € e6 €7 € Solution table p/q One of M8,4 with idempotent elements e1, €4 e7 and €8 C.2 Derivation of p(firelsmoke&alarm) under TPL p(fls&a) = p(slf&a) *p(fla)/p(sla) where p(s jf&a) p(sla) and p(f Ia) where p(alf) and = p(sjf); = p(s&(fV7)Ia) = i[i[p(s17)] * j[p(If) * p(fa)/i[p(s7) * p(71a)]]J; = p(alf) * p(f)/p(a) = p(a&((f&t) V (f&’) V (7&t) V (Tkt))If) = i[i[p(aIf&) * pQ)} * i[p(alf&t) * p(t)/i[p(alf&T) * (OJ]1 e1 e2 e3 e4 € e6 e7 e8 €2 e3 €4 €4 € €6 € e8 e3 e4 e4 e4 e5 e6 €7 e8 €4 e4 e4 e4 e5 e6 €7 € €5 e5 € € €6 e7 €7 € €6 €6 e6 €6 €7 €7 €7 e8 e7 e7 €7 €7 e7 e7 e7 €8 €8 e8 €g € €8 € e5 € € € €3 e4 € €6 €7 e8 e1 e2 [e4,3j € e6 €7 e8 e1 [e4,2j € e6 €7 e8 [e4,ei] e5 €6 €7 €8 [€4, eu € [€7,€6] €g [e4,1] [e7,5] €g [e7,1j €8 p(a) = p(a&((f&t) V (fm) V (7&t) V (7&))) Appendix C. Examples on Finite Totally Ordered Probability Algebras 214 = i[f1*234J C.3 Evaluation of Legal FTOPAs with Size 8 Figure C.40 plots the evaluation results of the following 14 posterior probabilities from smoke-alarm example by Poole and Neufeld [1988] using all the 32 legal FTOPAs of size 8. The conventional probability model (MO) is included for comparison. p(slf), p(alf), p(1If) p(rff); p(sjt), p(aIt, p(IIt), p(rlt);p(fIs), p(fja), p(fls&a); p(tls), p(t(a), p(tls&a) The following table lists model labels (L) used in Figure C.40 and their corresponding idempotent element subscripts (S). ‘MO’ labels the conventional probability model. where f’ f2 f3 f4 = i[p(a17&T) *p(7) *p(T)] = i[p(aIf&t) *p(7) *p(t)/f1] = i[p(alf&T) * P(f) * * f2)J = i[p(alf&t) *p(f) *p(t)/(f1* f2 * fe)]. L S L S L S L S Mi 1,7,8 M9 1,2,5,7,8 M17 1,2,3,4,7,8 M25 1,3,5,6,7,8 M2 1,2,7,8 M1O 1,2,6,7,8 M18 1,2,3,5,7,8 M26 1,4,5,6,7,8 M3 1,3,7,8 Mu 1,3,4,7,8 M19 1,2,3,6,7,8 M27 1,2,3,4,5,7,8 M4 1,4,7,8 M12 1,3,5,7,8 M20 1,2,4,5,7,8 M28 1,2,3,4,6,7,8 MS 1,5,7,8 M13 1,3,6,7,8 M21 1,2,4,6,7,8 M29 1,2,3,5,6,7,8 M6 1,6,7,8 M14 1,4,5,7,8 M22 1,2,5,6,7,8 M3O 1,2,4,5,6,7,8 M7 1,2,3,7,8 M15 1,4,6,7,8 M23 1,3,4,5,7,8 M31 1,3,4,5,6,7,8 M8 1,2,4,7,8 M16 1,5,6,7,8 M24 1,3,4,6,7,8 M32 1,2,3,4,5,6,7,8 In each sub-graph, the vertical scale represents the probability range with 0 = e8 and 1 = e. And horizontally arranged are the probabilities with the same order above. The Appendix C. Examples on Finite Totally Ordered Probability Algebras 215 predictive probabilities with identical condition are shown in the first 8 positions in 2 groups. The diagnostic probabilities with a fixed hypothesis are shown in the last 6 positions in 2 groups. As is analyzed in Section 2.5, none of the models through 1 to 32 gives satisfactory results. L.;3 E1W I iM 117 MS n L rLUW . UuU_ii U — 0 0 — .-+ V W Lfl ‘$1 W ,— .-. Figure C.40: Evaluation result of smoke-alarm example by 32 legal FTOPAs of size 8 f: fire; t: tampering; a: alarm; 5: smoke; 1: leaving; r: report. tuup 0 0 M4 LZ ci —— — 0 0 — Appendix C. Examples on Finite Totally Ordered Probability Algebras 216 M21 0DII Figure C.40 (cont) f: fire; t: tampering; a: alarm; s: smoke; 1: leaving; r: report. LM15 CCI L — Ml 6 n Ml? Lilt LL1OO I 0 — — M22 0 __ 0 M23 I1u I n — — LA LA If. w QI Appendix D Proofs for corollary and propositions Proof of Corollary 2 Proof: To prove p/q = r, it suffices to show r * q = p. Below the 7 cases are shown in the order of their appearance in the Corollary. (Case 1) Equivalent to ek * e1 = ek which is in turn equivalent to Cond8 of proposi tion 1. (Case 2) Equivalent to e * = e which is in turn equivalent to Cond4 of proposi tion 1. (Case 3) Equivalent to e * e = ek where e = ek_+ and e+1_2 e, e1+,. First, show this is the second case of Theorem 1. Clearly, ij < y < i11 — 1 i1. Also x = — j + i1 > i1, and k — j + i1 <ii+1 — 1 — (i1 + 1) + ii = i1+ — 2 i+i. Second, show emjfl(÷j_ji,j÷i) = ek. This is true because x — j — i1 = k < i1.. (Case 4) Equivalent to e, * ek = ek where x = i1 and i1 <k < i1. It is true by the third case of Theorem 1. (Case 5) Equivalent to e, * = e+1 where e < and < e÷i. First, show this is the second case of Theorem 1, that is, e, e e [e11+,e1+i]. This is obviously true for e and the lower bound of e. For the upper bound, from i11 > j, one has ij — j 1, and therefore1+ e1. Second, show emin(z+y_ii,ii+i) = e+1 which is equivalent to x + y — ii ii+’. Since the lower bound is i1 + 1 for x, and i11 for y, this is certainly true. 217 Appendix D. Proofs for corollary and propositions 218 (Case 6) Equivalent to er * e1 = e1 where e e1 which is true by Lemma 3. (Case 7) Covered by the third case of Theorem 1. C Proof of Proposition 5 Proof: A solution table (see appendix C.1) can be viewed as a different layout of a product table. An entry in the column labeled ‘q’ is one factor. A table entry in the same line as the first factor is another factor. The entry in the line labeled ‘p’ and in the same column as the second factor is the product. Since product operation ‘‘ is well defined, all the probabilities will appear in each line of the table entries (possibly several of them appear together in a range). Therefore, there are n(n — 1) table entries. There are O.5n(n + 1) — 1 distinct solution pairs. Their difference is just the amount of ambiguity by definition. A = {n(n — 1)] — [O.5n(n + 1) — 1] = (n — 1)(n — 2)/2 Proof of Proposition 6 Proof: (Od) From Corollary 2, all the incidences of e/e = er to be counted are covered by the case 7 of the corollary. Among {e2, .. . , e,}, there are k — 2 idempotent elements which determine k — 3 intervals. For each interval, the case 7 dictates m+1 — m columns in the solution table with m — 1 entries in each column to be counted. (Om) From Theorem 1, e, * e < min(e1,,e) happens only in the second case. Since idempotent elements do not follow the relation by Lemma 3, only m <X, y <m+1 needs to be considered, and there are k —2 such intervals ((ik..i, ik) does not contribute to Om). For each interval, if x is fixed, y goes through m+l — — 1 values. Since only distinct Appendix D. Proofs for corollary and propositions 219 product pair is to be counted, each interval contributesE:1_tm_l Now it suffices to show min(x + y — m+i) > x, y. Trivially, m+l > X, y. Also X+(y_im)>X+1 >. Similarly X+Ym >Y (Od + Om) Rewrite Od as k.-2 m+lm1 0d > (im+iim) “=2 j=O Thus i22 321 4131 k—1k—21 Od+Om = (i+j—1)+ (i+j—1)+... (ik_2+j—1) j=1 j=O 3=0 j=0 Note the last addend of each sum is 1 less than the 1st addend of the next sum; and there are tkl — 2 addends in total. Thus, k—1 —2 ..3 Od+0m E j=j=(n—3)(n—2)/2 j=1 j=1 D Appendix E Reference for Statistic Evaluation of Bernoulli Trials The content of this appendix is taken from Larsen and Marx [1981]. Any set of repeated independent trials with each trial having just 2 possible outcomes (success and failure) are called Bernoulli trials. The probability p associated with success is called success probability. E.1 Estimation for Confidence Intervals Definition 18 Suppose that y successes are observed in n independent Bernoulli tri als. The probability interval [pl,P2] is the 100(1 — a)% confidence interval for success probability p if P(pi<p<pIY=y)=1—a where Y is the variable for number of successes. The above defined lower and upper confidence limits p and P2 for p are the solutions of Cp(1 _p1)= Cp(1 _p2)= where ci- — j!(n — is the number of combinations taking j elements out of n. 220 Appendix E. Reference for Statistic Evaluation of Bernoulli Trials 221 E.2 Hypothesis Testing Let x and y denote the numbers of successes observed in 2 independent sets of n and m Bernoulli trials, respectively. Let Px and py denote the true success probabilities associated with each set of trials. An approximate generalized likelihood ratio test at the a level of significance for hypothesis H0 Px = py versus hypothesis H1 : Px py is gotten by rejecting H0 whenever n m is either < —z,,2 or +z12 where 1 PZQ/2 2J e ‘2dx = a/2.v,-oo Appendix F Glossary of PAINULIM Terms F.1 Diseases Considered in PAINULIM Ahcd : Anterior horn cell disease Als: Amyotrophic Lateral Sclerosis Cts: Carpal tunnel syndrome Inspcd : Intrinsic cord disease Mednn: Median nerve lesion Mnd: Motor neuron diseases (including Ahcd and Als) Pd: Parkinsons disease Pxltk: Plexus lower trunk Pxpcd: Plexus post cord Pxutk: Plexus upper trunk Radnn: Radial nerve lesion Rc56 : C56 Root disease Rc67: C67 Root disease Rc81 : C8T1 Root disease Ulrnn: Ulnar nerve lesion 222 Appendix F. Glossary of PAINULIM Terms 223 F2 Clinical Features in PAINULIM bcpex: Biceps reflex exaggerated bcpls Biceps reflex lost/diminished lsbkhdfrm: Loss of sensation in back of hand/forearm isdisoc : Dissociated sensory loss lslatfrm: Loss of sensation in lateral forearm lslathnd: Loss of sensation in lateral hand/fingers lsmedfrm: Loss of sensation in medial forearm lsmedhnd: Loss of sensation in medial hand/fingers isuparm: Loss of sensation in upper arm radex: Radial/Supinator reflex exaggerated radls: Radial/Supinator reflex lost/diminished rad_pn: Radicular pain rigid : Rigidity pnirm: Pain or tingling in the forearm pniind: Pain or tingling in the hand pn.shd: Pain or tingling in the shoulder spstc : Spasticity tcpex : Triceps reflex exaggerated tcpls : Triceps reflex lost/diminished tremor : Tremor of limbs/hands twitch : Muscle twitch or fasciculations wk....arm: Weakness of the arm wking±c: Weakness of finger flexion wk...shld: Weakness of the shoulder Appendix F. Glossary of PAINULIM Terms 224 wk_th.abd: Weakness of thumb Abduction wk...th_ext: Weakness of thumb extension wk_th.Jx: Weakness of thumb flexion wk_wst_ex: Weakness of wrist extension F.3 EMG Features in PAINULIM adm: Abductor digiti rninimi apb : Abductor pollicis brevis bchrd: Brachioradialis bcps : Biceps brachil edc: Extensor digitorum communis deltd: Deltoid fasc: Fasciculation fcu: Flexor carpi ulnaris fdi : First dorsal interosseous fdp23: Flexor digitorum profundus 2,3 fdp45 : Flexor digitorum profundus 4,5 fpl: Flexor pollicis longus latdrs : Lattismus dorsi lvscp: Levator scapulae other : Muscle in other limb/trunk pmcv: Pectorajis major clavicular pmsc : Pectoralis major sterno-clavicular prt : Pronator teres psS6: Paraspinals C5,6 Appendix F. Glossary of PAINULIM Terms 225 ps67: Paraspinals C6,7 ps8l: Paraspinals C8,T1 rhmj: Rhomboideus major spspn: Supraspinatus srant : Serratus anterior trcps : Triceps F.4 Nerve Conduction Features in PAINULIM mcmp: MEDIAN COMPOUND MUSCLE ACTION POTENTIAL mf2I: MEDIAN finger2 SNAP latency mf2a: MEDIAN finger2 SNAP amplitude mmcb: MEDIAN motor CONDUCTION BLOCK mmcv: MEDIAN motor CONDUCTION VELOCITY mmdl: MEDIAN motor DISTAL LATENCY minfw: Median “F” response mupl: MEDIAN to ULNAR palmar latency difference radm: Radial motor study rads: RADIAL sensory study ucmp: ULNAR COMPOUND MUSCLE ACTION POTENTIAL uf5a: ULNAR fingerS SNAP amplitude uf5l: ULNAR fingerS SNAP latency umcb: ULNAR motor CONDUCTION BLOCK umcv: ULNAR motor CONDUCTION VELOCITY umfw: Ulnar “F” response Appendix G Acronyms AT: Artificial Intelligence ALPP : Algorithm of Learning by Posterior Probability BNS : Bayesian Network Specification (in QUALICON) CLINICAL : clinical sect of PAINULIM CMAP : Compound Muscle Action Potential CP : Conditional Probability DAG : Directed Acyclic Graph DOCTR: DOCToR - a module in WEBWEAVR EDITOR: a module in WEBWEAVR EEG : ElectroEncephaloGraphy EMG: ElectroMyoGraphy EMGer: ElectroMyoGrapher FE: Feature Extraction (in QUALICON) FP: Feature Partition (in QUALICON) FTOPA: Finite Totally Ordered Probability Algebra MSBN : Multiply Sectioned Bayesian Network NCV: Nerve Conduction Velocity ND : Normal Data (in QUALICON) NDU: Neuromuscular Diseases Unit PAINULIM: PAINful or impaired Upper LIMb 226 Appendix G. Acronyms 227 PIE: Probabilistic Inference Engine (in QUALICON) PP : Posterior Probability QUALICON : QUALIty CONtrol SNAP : Sensory Nerve Action Potential TPL : Theory of Probabilistic Logic TRANSNET : TRANSform NETwork - a module in WEBWEAVR UBC: University of British Columbia USBN : UnSectioned Bayesian Network VGH: Vancouver General Hospital WEBWEAVR: WEB WEAVeR
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Multiply sectioned Bayesian belief networks for large...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Multiply sectioned Bayesian belief networks for large knowledge-based systems: an application to neuromuscular.. Xiang, Yang 1992
pdf
Page Metadata
Item Metadata
Title | Multiply sectioned Bayesian belief networks for large knowledge-based systems: an application to neuromuscular diagnosis |
Creator |
Xiang, Yang |
Date | 1992 |
Date Created | 2008-12-16 |
Date Issued | 2008-12-16 |
Description | In medical diagnosis a proper uncertainty calculus is crucial in knowledge representation. Finite calculus is close to human language and should facilitate knowledge acquisition. An investigation into the feasibility of finite totally ordered probability models has been conducted. It shows that a finite model is of limited usage, which highlights the importance of infinite totally ordered models including probability theory. Representing the qualitative domain structure is another important issue. Bayesian networks, combining graphical representation of domain dependency and probability theory, provide a concise representation and a consistent inference formalism. An expert system QUALICON for quality control in electromyography has been implemented as a pilot study of Bayesian nets. The performance is comparable to that from human professionals. Extending the research into a large system PAINULIM in neuromuscular diagnosis shows that the computation using homogeneous net representation is unnecessarily complex. At any one time a user’s attention is directed to only part of a large net, i.e., there is ‘localization’ of queries and evidence. The homogeneous net is inefficient since the overall net has to be updated each time. Multiply Sectioned Bayesian Networks (MSBNs) have been developed to exploit localization. Reasonable constraints are derived such that a localization preserving partition of a domain and its representation by a set of subnets are possible. Reasoning takes place at only one of them due to localization. Marginal probabilities obtained are identical to those obtained when the entire net is globally consistent. When the user’s attention shifts, a new subnet is swapped in and previously acquired evidence absorbed. Thus, with β subnets, the complexity is reduced approximately to 1/β. Reducing the complexity with MSBN, the knowledge acquisition of PAINULIM has been conducted using normal hospital computers. This results in efficient cooperation with medical staff. PAINULIM was thus constructed in less than one year. An evaluation shows very good performance. Coding probability distribution of Bayesian nets in causal direction has several ad vantages. Initially the distribution is elicited from the expert in terms of probabilities of a symptom given causing diseases. Since disease-to-symptom is not the direction of daily practice, the elicited probabilities may be inaccurate. An algorithm has been derived for sequentially updating probabilities in Bayesian nets, making use of the expert’s symptom to-disease probabilities. Simulation shows better performance than Spiegeihalter’s {O, l} distribution learning. |
Extent | 4387824 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | Eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2008-12-16 |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0064986 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of |
Degree Grantor | University of British Columbia |
Graduation Date | 1992-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
URI | http://hdl.handle.net/2429/2995 |
Aggregated Source Repository | DSpace |
Download
- Media
- ubc_1992_spring_xiang_yang.pdf [ 4.18MB ]
- Metadata
- JSON: 1.0064986.json
- JSON-LD: 1.0064986+ld.json
- RDF/XML (Pretty): 1.0064986.xml
- RDF/JSON: 1.0064986+rdf.json
- Turtle: 1.0064986+rdf-turtle.txt
- N-Triples: 1.0064986+rdf-ntriples.txt
- Original Record: 1.0064986 +original-record.json
- Full Text
- 1.0064986.txt
- Citation
- 1.0064986.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
United States | 7 | 0 |
Poland | 2 | 0 |
Japan | 2 | 0 |
Netherlands | 2 | 0 |
India | 1 | 0 |
China | 1 | 17 |
United Kingdom | 1 | 0 |
City | Views | Downloads |
---|---|---|
Unknown | 5 | 6 |
Ashburn | 5 | 0 |
Tokyo | 2 | 0 |
Sunnyvale | 1 | 0 |
Los Angeles | 1 | 0 |
Beijing | 1 | 2 |
Huddersfield | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0064986/manifest