B A Y E S I A N M O D E L S E L E C T I O N A N D CLASS IF ICAT ION : A P P L I C A T I O N TO B R A I N TISSUES T H R O U G H T 2 D I S T R I B U T I O N By Iftah Gideoni B. Sc. (Physics) Tel Aviv University, 1992 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF PHYSICS We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA March 1995 © Iftah Gideoni, 1995 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of The University of British Columbia Vancouver, Canada Date -i JciM DE-6 (2/88) Abst rac t A Bayesian procedure for model selection, parameter estimation and classification, using models of non-orthogonal basis functions, is applied to the problem of T2 decay rate distributions in brain tissues. The feasibility of generating reliable synthetic images of tissue-classified pixels is examined. The work determines, for the first time, the Bayesian probability of existence of short (5-15ms) T2 component in the brain tissues, and found it to be higher than 99% for all white matter tissues and higher than 80% for all gray matter tissues except Cortical Gray . The probability of having no more than three components of decaying exponents in the Ti distributions of the brain tissues, is found to be higher than 90% for all the tissues. We arrive to these findings through the use of models which are parameterized by highly coupled parameters, and the use of multi-dimensional search in the space of these models. 11 Table of Contents Abs t rac t i i Table of Contents i i i L is t of Tables v i L is t of Figures v i i Acknowledgements x 1 In t roduct ion 1 1.1 Bayesian Inference 4 1.1.1 Manipulation Rules • 5 1.1.2 Direct Probabilities 12 1.2 M R I - T2 Distribution 14 1.2.1 The Data 14 1.2.2 Background Knowledge 16 2 Bayesian Classification 20 2.1 Classification Onto 'Old 'Da ta Sets 21 2.2 Model Comparison 24 3 M o d e l Selection 28 3.1 The Set of Models . . . . . . . . . . . . . 28 3.1.1 The Base Functions, Gja 29 m 3.1.2 Search Path 32 3.2 The probability of the Model p(Ma/DI) 34 3.2.1 The Prior Probability Distributions 36 3.2.2 The Likelihood 38 3.3 Results 48 3.3.1 Simulated Data 48 3.3.2 Experimental data 56 4 Es t ima t ion of the Parameters 58 4.1 Noise Variance 60 4.2 Nonlinear Parameters 62 4.3 Linear parameters . . . ' 63 4.4 Results 65 4.4.1 Simulated Data 65 4.4.2 Experimental Data 66 5 Classification 77 5.1 Procedure . 77 5.1.1 The Total Amplitudes Integral 79 5.1.2 The T2 Distribution Integral . . 82 5.1.3 Final Expressions . 91 5.2 Results . . 92 5.2.1 Feasibility . 92 5.2.2 Tissue Classification 94 5.2.3 Images 95 6 Summary , Discussion and Conclus ion 99 i v 6.1 Summary . . . . . . . .• , . 99 6.2 Discussion • : 100 6.2.1 Multiple Measurements: cr Vs. as 101 6.2.2 Mariginalization in the Neighbourhood of No Data . 102 6.2.3 Multi-Dimensional Search for Models 104 6.3 Conclusions . 105 6.3.1 Evidence of a Short T2 Component . . . . . . . . . . . . . . . . . 105 6.3.2 Evidence of a Long T2 Component 106 6.3.3 Components Width 106 6.3.4 Alternating Echo Refocusing . . 107 Bib l iography 108 A Safety Valve Rout ine 110 B A r e Two Signals Equal? 113 B . l Signal Detection , 114 B . l . l M0 . . 115 B.1.2 Mi . . . 115 B. l .3 Reduction of the Integral . . . . ' 116 B.1.4 The Odds Ratio . 118, B.2 Are Two Signals Equal? 119 B.2.1 M i . . . . . 119 B.2.2 M2 121 B.2.3 Reduction of the Integrals 122 B.2.4 The Odds Ratio 124 v Lis t of Tables 1.1 The Database. 15 3.1 Generating Parameters of Simulations 49 3.2 Search Path and Odds Ratio for the Continuous T2 Simulated Distribution. 51 3.3 Search Path and Odds Ratio for the Discrete T2 Simulated Distribution. 51 3.4 Experimental Data - model selection results. 57 4.1 Generating T2 Distributions and Simulation Results. . 66 4.2 Uncertainty of Linear Combinations of Amplitudes. 68 4.3 Estimated Parameters for White Matter Tissues. 69 4.4 Estimated Parameters for Gray Matter Tissues. -70 vi List of Figures 1.1 Raw Data: Decay Curves of Pixels taken from Major Forceps 16 3.1 Search Path Flow Chart • . . 33 3.2 Generating Parameters for the Discrete and Continuous Simulated Distri-butions. . . . . . . . . . . . . . . . 50 3.3 Model Selection Failure 53 4.1 Probability Distribution of the Noise Level for a typical tissue 60 4.2 Simulation Parameters and Estimated Parameters for the Discrete and Continuous Distributions 67 4.3 The Estimated T2 Distributions and Residuals for the Estimated Param-eters of White Matter Tissues . . 72 4.4 The Estimated T2 Distributions and Residuals for the Estimated Param-eters of Gray Matter Tissues. 73 4.5 The Estimated T2 Distributions and Residuals for the Estimated Param- . eters of Gray Matter Tissues (cont.). 74 4.6 Joint Probability Distribution for the Low Component of a Typical Tissue. 75 4.7 Joint Probability Distribution for the tissue Head Caudate Nucleus. . . 76' 5.1 Typical Spread of the Measurements of one Tissue 86 5.2 Models/Data Averages for the Tissues and Individual Pixels' Data From One Tissue 93 5.3 Classification Results for White Matter Tissues 96 vn 5.4 Classification Results for Gray Matter Tissues 97 5.5 Synthetic Classification Image of Patient Eleven 98 6.1 Marginal izat ion Decisions in Light of Likel ihood Widths. 103 A . l Normal Behavior and Abnormal Behavior of the Likel ihood along Principle Axes 112 v m Acknowledgements I would like to thank my supervisor, Dr. Phil Gregory. His course "Theory of Mea-surement" , changed both my perception of the scientific method and the direction of my studies. Any experimental scientific work is composed of the design of the experiment, per-formance of the measurements, and analysis of the results.. The author of this thesis did only an analysis of data. If this thesis contains any meaningful conclusions regard-ing the distribution of the T 2 decay rates, it is mainly the result of work done by Dr. Alex MacKay and Dr. Ken Wriittall. They developed the M R I sequence, performed the measurements, and collected the data. Their kindness in sharing this data and their knowledge, and the time spent by them in rearranging the data into an accessible form, are hereby gratefully acknowledged. Last but never least, I am delighted to thank Elana Brief for reading and editing the thesis. Without her efforts, this thesis would look much worst. ix Chapter 1 Introduction The field of Bayesian inference as a scientific tool is young; even in the last years, the Bayesian inference has not been put to wide use outside the Astrophysics community. The Bayesian inference scheme enables, in principle, direct and consistent application of basic ideas of the scientific method regarding the way we infer about natural phenomena and put our hypotheses to test. Nevertheless, the utilization of this theory to the analysis of scientific data and background knowledge remains mostly, unexplored. In this work, we will try to share the exploration of the theory in three main directions: first, in a demonstration of multi-dimensional search in a space of models. We try to search and test automatically and efficiently many models, different in more than one attribute. The general problem of the multi-dimensional search for the 'right' model is important but will not be treated in this work. A degenerate demonstration will be presented here. Second, we will solve a model selection and parameter estimation problem using models of decaying exponents as basis functions. This case, in which the basis function are non-orthogonal, is known to be one of the more pathological cases, both from the traditional, 'frequentist', and from the Bayesian point of view. The general approach we adopt here is similar to that developed by Bretthorstfl] for time series analysis. The ability to solve such problems using models of non-orthogonal basis functions in a computable manner, is crucial for many applications. Third, we discuss the general solution for a classification problem: we are given a new data set and a number of old data sets, and assume that each of the old data sets was produced by different mechanism. 1 Chapter 1. Introduction 2 We then infer which of these mechanisms produced the new data set. We approximate the general solution for this problem and apply it in the limit case where much more information is carried by the old data sets than by the new data set. The above will be applied to the problem of inferring about the T2. distributions of in vivo human brain tissues, where the data is taken from. Carr-Purcell-Meiboom-Gill (CPMG) Magnetic Resonance Imaging (MRI) sequences. The analyzed data is considered to be produced by decaying exponents with some distribution of the decay rates (the T2 distribution). We model this distribution by a finite number of models, each of which has a finite number of parameters. The T2 distribution problem is also dealt with in an elaborate frequentist literature,using inverse theory, and non-linear approaches similar to ours. The results of the application of probability theory may help to understand better the results of the traditional solutions. The determination of the distribution of the T2 coefficients in brain tissues and lesions is important for. a number of reasons. First, it may serve as classification criterion to be used for imaging, based on sub-voxel structure differences,that can not be seen in M R I [2]. Second, once the underlying physical mechanisms of the various components in the distribution have.been identified, determination of the distribution may serve to trace variations of the mechanisms among patients, tissues, or as a function of time. For example, t h e T 2 distributions are sensitive.to processes in lesions (as demyelination. in aging multiple sclerosis lesions [2]) and thus enable measurements of these processes which cannot be measured by other means. ,. Recently,'a new multi-echo M R I sequence for quantification of the T2 coefficients was developed in U.B.C. [2]. This sequence enabled, for the first time, determination of short T2 components (less than 20ms.) in in vivo brain tissues measurements. We use the data acquired by this new sequence. Partial analysis.of this data is given in [3]. Chapter 1. Introduction 3 The assignment of components in the T2 distribution to the underlying physical mech-anisms will not be treated in this work, beside commenting about the ability to make such assignments. In this chapter we wi l l discuss the basic tools used by probability theory in answering well-posed problems. The reader who is unfamiliar with the Bayesian inference should be able to follow the rest of the work based on this introduction. We then discuss our state of knowledge about the T2 distribution problem. There is no intention of giving background on M R I or the physics behind the T2 distribution, as it is of no importance for the quantitative solution, once our background knowledge has been defined. After introducing our initial state of knowledge, the work is confined to the pure inference problem. In chapter 2, we discuss the Bayesian classification procedure. It is a generalization of the Bayesian model selection procedure for cases where the models are replaced by known probability distributions for parameters of the model. In chapter 3, we define a set of candidate models for the T2 distributions and apply the Bayesian model selection procedure for choosing one of them for each tissue. The ability to consistently compare models of different degrees of complexity is a feature of the Bayesian inference not existing in the traditional, frequentist, methods. Also in that chapter, we will discuss the problems arising in the analysis of exponential decay curves; the non-orthogonality makes mistakes in choosing a model very costly. Lacking any solid grounds for choosing a model, this has led traditional workers to regard the non-linear methods (specifying a small number of decaying components) as disadvantageous [4],[5]; the price paid by the Bayesian procedure for being non-linear is merely the speed of computation. In chapter 4 we estimate the parameters of the model we chose in chapter 3 for each Chapter 1. Introduction 4 tissue. This is not a part of the classification procedure. Nevertheless, it has a stand-alone importance for the interpretation of the underlying physical mechanisms. Our parameters are highly coupled and poorly resolved and thus the estimations, both of the parameters values and the accuracies of these estimations should be done with care. We discuss the estimation of the noise in case of multiple measurements and systematic effects not accounted for by the model. In chapter 5 we apply the classification procedure introduced in chapter 2 to pix-els of an M R image. We discuss the meeting of necessary conditions for classification: having different T2 distributions for different tissues, and the variance of samples within one tissue. The classification procedure will demonstrate the merging of two sources of information: the results of classification based on the M R I image amplitudes, and the results of the classification based on the normalized T2 distribution. 1.1 Bayesian Inference In this section, we will give the background needed for understanding the inference method used throughout this work. For the sake of completeness, we will first outline the basic ideas behind the method. . Fundamentals The goal of Bayesian inference (or better, "Probability Theory as Logic") is the rigorous development of a procedure [6] for assignment of probability (a real number) to encode the information regarding the truth of a proposition which is a member of a well defined set of propositions. The starting point for this process is a set of desiderata[6], [7]: 1. Representation of plausibilities (abstracts) by probabilities (real numbers). Chapter 1. Introduction 5 2. Qualitative correspondence of these assignments with common sense; i.e. mono-tonicity (the bigger our confidence in the proposition the larger the probability attached to it) and continuity (infinitesimal change in the state of knowledge can result only in infinitesimal change in the probability). 3. Consistency: if a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result. 4. No use of information that is not available. 5. Equivalent states of knowledge will lead to equivalent assignments of probability. This set of desiderata leads to a unique procedure for assignments of the probabilities in any well posed inference problem. By ''well "posed? we mean both the description of our state of knowledge by real numbers, on the one hand, and well posedness of the question asked in the Hadamard sense on the other hand. That is: uniqueness, existence, and continuous dependence of the answer on our state of knowledge. The notion of probability here is freed from the need to correspond to any frequency. In the natural sciences, we use probability theory to encode (by the probability) our state of knowledge regarding the truth of a mechanism. The rigorous development of the theory can be found in [6]. A concise discussion of Bayesian inference and the difference from traditional, frequentist, statistics can be found in [8]. 1.1.1 Manipulation Rules The beginning of the treatment of any problem is the unambiguous statement of our state of knowledge and the question for which we are seeking an answer; the propositions' probabilities (or ratio of probabilities) we want to find. We then use probability theory to Chapter 1. Introduction 6 manipulate the desired probabilities to a form consisting of probabilities we can directly assign (direct probabilities). We give the manipulation rules in this section, and discuss the assignment of the direct probabilities in section 1.1.2. This section is based on [9]. We use the notation introduced by Jeffreys [10]: p(A/B) is the probability that A is true given that B is true where the arguments are propositions (statements), which are either true or false. The probability p(A/B) describes our state of knowledge about the truth of A assuming B is true. The proposition AB (or A • B) is true if both A and B are true and the proposition A.+ B is true if either A, B or AB is true. The proposition A signifies the negation of A; it is true if A is false. A form that appears many times in Bayesian works as the probability to be found (and will be used extensively in this work) is p(Hi/Dl). Hi is a member of a set of propositions (hypotheses) {Hi}, whose probability we want to assess in light of some data, D, and some background information / . The background information must include the statistical model (the relationship between the data and a true hypothesis), the specification of the hypotheses set and the probability of each of the members of {Hi} before having knowledge of the data (prior probabilities - see below). The basic rules emerging from the theory are the sum rule, p(Hi) + p(Hi) = 1 and the product rule p(HiD/I)=p(Ht/I)-p(D/HJ)=p(D/I)-p(Ht/DI) We use exclusive propositions, and thus, from the sum and product rules we have: p(Hi + Hi/I) =.p{HiII)+p(HJII) (1.1) Chapter 1. Introduction 7 As demonstrated in [1], once we state a set of propositions, we can treat them as exhaustive without influencing any of our quantitative solutions. That is: 5 > W ) . = 1 , (1-2) i a normalization rule. A n important result of the product rule is Bayes theorem: p(Ht/DI)=p(HjI).P^j^ We usually use Bayes theorem to describe the process of learning about the probability of the proposition Hi in light of the truth of a new proposition (usually data) D and other known information (background knowledge) I. In such a case: • p{HijI) is the prior probability. It is the description of our knowledge about the truth of Hi before we saw the current data, D. • p(D/HiI) is the probability of collecting the set of data £), given a proposition Hi. It is the likelihood of Hi in light of the data D: • p{D/I) is the global likelihood. By eq.(1.2) the global likelihood is p(D/I)= £ p{HdI) • V{DIHd) HiC{H} ensuring a normalized posterior probability, given a normalized prior. • p(Hi/DI) is the posterior probability. It is the description of our knowledge about the probability of Hi after we saw the data, D. Chapter 1. Introduction 8 A l l the rules introduced so far hold when our set of models are continuous and the probabilities are replaced by probability density functions. Assuming our• proposition Rk is parameterized by the parameter 0 , then, H^Q = uHk is true and the value of its parameter is 0 " . Then, by repetitive application of eq. (1.1) and taking the limit of the sum to an integral we have: p(Hk/I)=p( J HkQdQ / / ) = J.p'HkQ/I)-dQ . (1.3) Where the integrals are over all the values available to 0 ; jp(Q/I)dQ = 1. The rigorous justification and discussion of the passage of the discrete propositions to continuous ones can be found in [6]. M o d e l Selection Usually, the propositions {Hi} are models describing the phenomenon under investiga-tion. Often we want to compare a number of parameterized models, which may differ in form or number of parameters. We do so by comparing the posterior probability p(Hk/DI) for each of them. Bayesian model selection has a built-in penalty for com-plicated models: their posterior probability is larger than the posterior probability of simpler models only if their additional complexity is justified by the data. In model selection, we often speak about the ratio of the probabilities instead of the probabilities directly. The ratio Oij — p{HijDI)/p(Hj/Dl) is the odds in favour of model Hi over model Hj. By Bayes theorem we have: Pin/pi) _ mm • = 13 M/Dl) piH3/i).^l = p(Ht/I) pjD/Hil) piHj/I) p{DlH3I) • Chapter 1. Introduction 9 That is, the odds is the ratio of the prior probabilities multiplied by the ratio of the likelihoods. Note that the Bayesian. model selection relies on the global likelihoods of the models and not on the maximum likelihood as is usually done in the frequentist statistics. The maximum likelihood (i.e., the maximum of p(D/H{{Q}I) where {0} is the set of pa-rameters characterizing the model), is achieved by assigning the parameters values that maximize the likelihood. By relying on the global likelihoods of the model, p(D/HiI), the Bayesian procedure calculates the odds regardless of the parameters that enter each model. The transition from p(D/Hi{Q}I), being the assigned direct probabilities to p(D/HiI) is done by marginalizatiou] the application of eq. (1.3) : p(D/HJ) = JP(D{e}/HJ)d{e} = Ip({0}/^,-7) .p(/J/{0}/J t-/)ci{0} (1.4) It is here where the more complex model (the model that has more parameters) is penalized for its' extra complexity. We will demonstrate it by considering the following simple case [9]: Consider two models, M\ with a single parameter 6 and M0 with 0 fixed - at some default value #05Thus M0 has no free parameters. By application of eq. (1.4), the odds is, = p(Mo/I) p(D/M0I) 01 piMJI)' p(D/M1I) • -= P(M0/I) p(D/90M0I) -p{M1II)' Sv{6II)-p(DI0MiI)d0 Where the integration is over all possible values of 6. Now, in most cases, the likeli-hood 1(6) = p(DjOI) is much more peaked than the prior p(9jl). Let & be the value of 9 at the maximum likelihood, and approximate the prior by the constant p(9/I). Thus, Chapter 1. Introduction 10 Coi ct — p(O/I)-fl(0)dO Next, we replace the likelihood integral by its maximum amplitude 1(0), multiplied by the characteristic width 80. 80 — f l(0)d0 / 1(0). Finally, approximating p(0/I) by the reciprocal of a characteristic width of the prior, AO, we have: The likelihood ratio in the first term can never be greater than one, since 1(0) is the maximum of 1(0); it can never favour the simpler model. The data will always be fit better by the model with additional parameters. Alas, the second ratio penalizes the ' complex model if the width of the likelihood, 80, is smaller than the width of the prior, AO, and this term in general will be larger than one. The odds , assuming no prior preference to one of the models ( p(M\/I) = p(Mo/I) ), will favour the complex model only if the data have sufficient evidence in favour of the more complicated explanation. Probability theory both quantifies such evidence and determines how much additional evidence is "sufficient" through the calculation of global likelihoods. As demonstrated by Bretthorst [1], the additional goodness-of-fit (measured by the difference of the data from the model predictions) needed in order to overcome the penalty for the additional complexity of one additional parameter can be calculated, and depends very weakly on the width of the prior. The quantitative results of Bretthorst are not directly applicable to our case, since any additional parameter will change substantially the probability distribution of other parameters. Nevertheless, the qualitative results regarding this weak dependence on the width of the prior remains true. Chapter 1. Introduction 11 Parameter Estimation Many times we are interested in the values of the parameters that enter our model. By application of Bayes' theorem we usually arrive at the determination of the joint probability: P({Q}/DI) That is, the probability as a function of all the parameters. When coming to estimate the value of one of the parameters, the rest of them are nuisance parameters. We can get rid of the nuisance parameters by marginalization. Assuming we have a model with two parameters, 0 and $, and we want to concentrate on the estimation of 0 . By eq.(1.3), we have:' p{e/DI) = jp{m/DI) • al® where p(Q/DI) is the one-dimensional probability distribution as the function of the parameter under investigation. It is called the marginal posterior distribution. Strictly speaking, the Bayesian procedure brought us so far, but not further. It does not determine the value of the parameter at any single point in the parameter space, but rather the full marginal posterior distribution. Nonetheless, we are often interested in the representation of our results in the form of single number to be taken as our final estimate for the value of the parameter. Possible estimators for the real value of the parameter are the posterior mode (most probable value of 0) and the posterior mean < 0 >= J Q-p{e/DI)-dQ If the mode and the mean differ substantially, the distribution is too asymmetric to be adequately represented by a single number. The allowed range for a parameter with Chapter 1. Introduction 12 a probability content C is given by a credible region, i?, defined by: / p(Q/DI)-dQ = C JR with probability density everywhere inside R greater than everywhere outside it. Due to computational difficulties, not all the nuisance parameters in our work will be eliminated by marginalization. we will then remain with a joined probability distribution of some of the parameters, and the task of estimating the accuracies of their estimates will have to be done.with care. 1.1.2 Direct Probabilities The Bayesian procedure gives us tools for manipulating known probabilities into other probabilities. We are left with the task of describing our initial knowledge in form of probabilities to be assigned. We call these probabilities, which have not been a result of manipulation of other probabilities, direct probabilities. The assignment of direct probabilities involves in general the reduction of knowledge from other forms into real numbers. It is thus an open problem. We will treat direct probabilities of two groups: prior probabilities and likelihoods. Prior Probabilities The prior probabilities are the description of our knowledge prior to seeing the data of our present experiment. It may be the posterior probability of some previous experiment, knowledge arising from the same experiment but not used in any other place (see chapter 5 for example), or any other knowledge regarding the propositions/parameters we have that may be relevant to our problem. An important subclass is the representation of ignorance; the probability we should assign to each proposition (or probability density Chapter 1. Introduction 13 for the value of a parameter, when our models are continuous), when we have no prior information that will prefer one over the others. We use this kind of prior in numerous occasions through this work. The assignment of prior probabilities that represent ignorance usually can be helped by the invariances that the propositions/parameters should obey; i.e., the transformations that can not change our state of knowledge about the problem. Two important cases used in our work are: • Location parameters: these parameters obey the symmetry of translation. The state : of knowledge cannot be different if the origin is changed. In the discrete case'it amounts to the invariance of state of knowledge under permutations of the propositions' labels. We assign location parameters uniform prior distributions in the case of continuous parameters and equal probability for all propositions in case of discrete parameters. That is, the normalized prior of a location parameter is for the case of continuous parameter, and = NumL of H <L6' for discrete models. Scale parameters: these parameters obey the symmetry of scaling. The state of knowledge cannot be different if the unit of measurement is changed. The prior distribution that obeys this symmetry is the Jeffreys' prior[10] which has the nor-malized form : Chapter. 1. Introduction 14 Where 0# and 0^ are the upper and lower limits of 0 . Note that the logarithm of a scale parameter is a location parameter. We will make use of this fact in chapter 3. A broad treatment of assignments of prior information and limiting processes can be found in [11], [12] and [6]. A -concise treatment that is more adequate for this work is in the opening chapter of [1]. , - . L ike l ihoods The likelihood p(D/HI) is the probability of collecting the data given the model is true. In case we have all the information about the mechanisms producing the data, we can incorporate it into the model. Then, the likelihood is either 1 or 0/ probability theory reduces to deductive reasoning. •/• In most of our real life problems (i.e., the problems where inference tools are required) we allow the possibility of data different from the model given the model is true, due to noise. The noise, in the Bayesian interpretation, is not a 'random' process, but merely a process where our ignorance regarding the producing mechanism is sucli that its effect on the data is not reproducible. Any known behavior of the noise can, in principle, be extracted from the noise and incorporated in the model. The noise will include any effects, systematic or not, that are present in the data but not in the model. Thus, the form of the likelihood is the statistical model - the probability of material-ization of the noise. , ... . 1.2 M R I - T2 D i s t r i bu t i on 1.2.1 T h e D a t a The data used through this work was supplied by Dr.Kenneth Whittalland Dr. Alexander MacKay from the department of radiology at U . B . C . The apparatus and the acquisition Chapter 1. Introduction 15 Tissue Tota l number of pixels Genu Corpus Callosum ,1908 Splenium Corpus Callosum . 2204 Minor Forceps 2886 Major Forceps 5041 Internal Capsules 1936 Head Caudate Nucleus 1656 Put amen 3234 Globus Palidus 831 Thalamus . 2157 Insular Cortex 5755 Cingulate Gyrus 3414 •Cortical Gray 4313 Table 1.1: The Database. method is described by them in [3]. The data consist of decay curves acquired from brains of 11 normal volunteers. The data from ten of these volunteers serve as 'learning data' for choosing a model and producing the probability distribution of the parameters for each tissue, and the data from the eleventh volunteer serve as test data for the classification process. For each volunteer, volumes were outlined on M R I image for 5 white and 7 gray matter structures. The data numbers are the spin-echo amplitudes of each pixel inside the outlined tissue for 32 spin-echoes with echo-spacing of 10ms. Typical decay curves for twenty pixels taken from different locations and volunteers but from the same tissue are displayed in Figure 1.1. The learning database is given in Table 1.1. We want to investigate the shape of the T 2 distribution, regardlessdts total amplitude. Thus, the total amplitude is a nuisance parameter and we wish to remove it in this step. We multiply each data set by a factor that will bring the linear extrapolation of the origin to 1000 amplitude units. The resulting data set is equivalent to the original as no relevant information has been changed. This 'normalization' replaces the more rigorous and proper marginalization due to computational consideration. Chapter 1. Introduction 16 Time [ms] Figure 1.1: Raw Data: Decay Curves of Pixels taken from Major Forceps. The variance of the amlitudes is evident. 1.2.2 Background Knowledge Our state of knowledge prior to seeing the data will determine our set of models, assign-ment of noise distribution (the statistical model) and assignment of prior distributions (both the prior distribution of parameters within models and the prior probability of the models). We wish to incorporate any known systematic effect into our set of tested models and thus reduce the resulting noise. We discuss the background knowledge at hand. Each T 2 distribution is expected to be composed of decaying exponents. The decaying exponents may have different decay rates and may have a distribution of decay rates. We assume that all samples of a particular tissue will exhibit decays of the spin-echo signal according to the same T 2 distribution; That is, we do not take into account the variance of the T2 distribution between different volunteers and between different locations in the same volunteer. Chapter 1. Introduction 17 Here we enumerate effects expected to show in the data in superposition to the T% distribution. The numbers given in the data are the M R I amplitudes of pixels in the image for 32 sequential images generated from the maximum of the 32 spin-echoes in a C P M G sequence. This process includes the following known contributions: • Fourier transform artifacts. The D F T is a sufficient statistic for optimal frequency estimation when looking for one Sine/Cosine in the data [13]. The evenly sampled Sine/Cosine functions are almost orthogonal and thus the D F T is still an adequate estimator in case of well-separated frequenciesfl]. In the case of M R I images, the frequencies are close to each other and the amplitudes suffer from artifacts, both from the neighboring frequencies and the finite dimension of the sampling window (256 pixels). No publication of a practical Bayesian solution for Fourier-transformed image reconstruction is known to the author. The contribution of this effect is assumed to be small compared to other noise sources and spatially dependent. We will not try to incorporate any knowledge about this effect beside its finite variance. • Imperfect 180° pulses and imperfect corrections artifacts. The presence of imperfect 180° pulses in the sequence give rise to both ghost image artifacts and Ti artifacts [14]. A descending gradient crusher sequence had been applied in order to eliminate these contributions [3] and is expected to eliminate most of the stimulated echoes. Nevertheless, no weighting had been applied in compensation to the main spin-echo degradation by the stimulated echoes. The sum of these effects is finite variance. A systematic component, if any, will be alternating due to the nature of the ghost image artifacts and 7\ artifacts. Its magnitude is assumed to be roughly proportional to the total signal along the decay. We do not attempt to incorporate any knowledge about initial magnitude or sign Chapter 1. Introduction 18 of this effect. • • Flow artifacts. A laminar fluids flow will give rise to even-echo-rephasing and thus we expect an alternating effect with negative amplitude for the odd echoes, and finite (yet unknown) magnitude. Again, we assume that the magnitude of this effect is proportional to the total signal along the decay. • Magnetic field inhomogeneities. The effect is unknown but assumed to have finite variance. • Variance among volunteers. In trying to estimate the parameters of each of the tissues we assume that same tissues in different patients will exhibit exactly the same T 2 distribution. The extent of the error of this statement is unknown, but finite. • Variance inside tissues. Not all the tissue elements are identical, even within one volunteer. Again, the extent of the error in this assumption is unknown, but finite. • Region OfTnterest (ROI) contaminations. The data pixels were collected by assign-ing ROI volumes over the M R I image. The extent of the error in wrongly labeling pixels is unknown. Due to the finite difference between the tissues this error is finite. ' . • The data are expressed by integers. The contribution of this rounding to the noise has a symmetric uniform distribution with unit width. Of this knowledge, we incorporate into our knowledge only that this contribution is finite and unknown in sign. We expect the effect of this loss of information to be negligible, as the other contributions to the noise are expected to be much larger. Chapter 1. Introduction 19 • There may be other effects we do not know about. These effects assumed to have finite magnitudes, but are otherwise unknown. • Unknown effects in the measurement process may give rise to a DC level. This contribution is unknown in sign and magnitude. The magnitude assumed finite. The background knowledge can be summarized as follow: • Possible systematic effects: 1. Decaying exponents of unknown amplitude and decay rate. The expected decay rates are in the range of 1-3000 ms. The amplitude of each such exponent is in the range 0-1000 amplitude units. 2. Decaying exponents of unknown amplitude and with distribution of decay rates of unknown center, shape and width. The decay rates and amplitude of these components have the same ranges as in 1. The T2 values of these components can range from half to twice the main T2 value. 3. Alternating signal of unknown initial amplitude or sign, but proportional to the total signal. 4. DC signal of unknown amplitude and sign. • The non-systematic effects are expected to be of finite variance but are otherwise not known. Different states of knowledge may be introduced. For example, one may understand the measurement process better and thus come up with different models for the systematic effects. Chapter 2 Bayesian Classification In this chapter we discuss the Bayesian procedure for classification of data. Section 2.1 describes the classification of a new data set into classes defined by old data sets. This is a generalization of the model selection procedure to cases where the models are given in form of probability distributions. We use this procedure in Chapter 5 when classifying data of image pixels into one of several tissues, using the data acquired from other collections of samples from these tissues. In fact, our implementation in Chapter 5 approximates the probability distributions of the parameters (of the models that have been selected to describe the old data sets) by their values that maximize the likelihood; it thus reduces the problem solved to the model selection problem. In Section 2.2 we describe the test for deciding whether two data sets came from the same mechanism. In many problems we are not sure that our assumption about the completeness of the mechanisms (that produced the old data sets) is true; the new data may have come from some other mechanism that produced none of the old data sets. For example, when classifying pixels to one of the gray and white matter brain tissues, some of the pixels come from the skull; not having the skull bone as one of the classes results in an erroneous classification of these pixels to some other tissue. This test, being based on the bayesian ability to consistently compare models with different complexity, has no traditional, frequentist, analogue. It can serve as a test of the completeness of our classes and thus enables the automated definition of new classes [15]. In the special case, when one of the models is 'zero signal', the procedure is reduced 20 Chapter 2. Bayesian Classification 21 to the Bayesian signal detection. The solutions to both procedures, signal detection and whether two data sets are produced by the same mechanism, are given in Appendix B for the case of On/Off measurements of Poisson distributed data. For that case, the solutions are given in a closed form. The solutions to the problem of T2 distribution of the brain tissues will not be given in this work, due to computational difficulties. 2.1 Classification Onto 'Old' Data Sets We assume that each of the 'old' data sets was produced by a different mechanism and that the mechanism that produced the new data is one (and only one) of these mechanisms. This assumption is part of our background knowledge, I. The question we are asking is: For which M ; the posterior probability p(Mi/dnewd0idI) is the highest, where: d n e w = "The new data set in question", and Mi = . "the I 1 6 W . d c l t c L dnew was produced by the same mechanism that produced the ith old data set liow,": This Mechanism is described by a choice of model (a functional form) M., and model parameters values 0 . By Eqn. (1.1) and (1.3), the posterior probability of M,- is given by p{Mi/dnewdoldI) = oc | MaC{M] (2-1) where: • Ma is the ath model. Chapter 2. Bayesian Classification 22 • The summation is over the set of possible models, . • 0Q. is the set of model parameters values, parameterizing A4a. • tta is the parameter space of model a. In most cases, in the classification phase where the functional form.is well determined by the old data set, we will constrain the model to its most probable form: p(MaMax/dnewd0idI) = 1 and p(Ma/dnewd0idI) = 0 V M a ^ M a M a x Where A4aMax is the most probable model, describing the old data set i: . p(MaMax/doldlI) = Max [{p(M-a/doldi I)} I M a C {M}] . Thus, we can omit from now on the symbols a and A4ai denoting the model. We constrain a to aMax, and assume M.aMax is the true model, describing both dnew and doidi • So, in terms of the parameters of MaMax' Mi = = 0", where \I> is the set of parameters that describes dnew, and Eqn. (2.1) becomes p(Mi/dnewd0idI) - / p(Mi<2)/dnewd0iI)dO By Bayes theorem we have: = p(MiQIdoidi) • —— ——-—dQ •hi • p{dn,„,/dMl) And using the product rule, p(Mi/doldI) p{dnewId0idl) JQ f p{®lMidoldI) p(dnew/MiQdoldI) dQ (2.2) Chapter 2. Bayesian Classification 23 Now, • p(Mi/d0idI) is independent of dold ; p(Mi/doldI) = p(Mi/I). • p(dnew/d0idI) is constant as a function of the index i, and will be cancelled in the comparison among the posterior probabilities of the Mi. • The first term in the integral is the posterior probability of the parameters given the • ^ = Mi • 0 and thus the second term in the integral is equal to p(dnew/^I) with \f/ numerically equal to 0; the likelihood the parameter set 0 in light of the new data set. This quantity is independent of the old data set. p(Mi/I) = # Qf old1 data sets = cons^-)i the posterior probability of M , is just the summa-tion (over all the possible sets of parameters) of the probability of the new data set to be produced by a mechanism having a particular parameter set, weighted by the probability of this parameter set to be the one that produced the old data set i. The distinction between the roles of the new and old data sets vanish if dnew is not constrained to be produced by one of the mechanisms of d0id. By Bayes theorem and using the known independencies, Eqn.(2.2) then becomes Therefore: (2.3) Thus, beside normalization constants and assuming no preference to particular i (i.e. p(Mi/dnewd0idI) Chapter 2. Bayesian Classification 24 (A P / ( r w P m I ^ l 1 } P f e / M , G / ) P ( C / M , 0 / ) dQ-. (2.4) p{d0id/I)p{dnew/I) Jn In such a case, as common sense may already suggested, the proposition M,- is sym-metric in regard to the old and new data sets. It does not matter which data set had been sampled later in time. 2.2 Model Comparison In Section 2.1 we chose one 'old' data set, for which the probability of the new data set to come from the same mechanism as this set, was the highest among all the old data sets. We want to test the model of choice (from Section 2.1): Did the same mechanism produce both our old data set of choice, and the new data set? In this Section we give the general solution for this problem. We describe the test for deciding whether two data sets are produced by the same mechanism. As demonstrated by Bretthorst [1], This question in its general form is meaningless. We cannot compare our model to an ill-defined model of "all the other mechanisms that are not the one which produced the old data set". Nevertheless, the proposition becomes well-defined when we restrict ourselves to the functional form of one. (say, the old) of the data sets. The proposition "the models describing the mechanisms that produced the old and new data sets are of the same functional form but with different values of the parameters" is a well defined and computable proposition. From now on, we will omit the subscript i. We will do everything inside our model and functional form of choice. We want to find the odds ratio of: • Mo = "The parameters of the mechanism responsible for the new data set are the r1 Chapter 2. Bayesian Classification 25 same as those of the mechanism responsible for the old data set; *P = 0 " O V E R • Mi = "The parameters of the mechanism responsible for the new data set are different from those of the mechanism responsible for the old data set; \t ^ 0 " Where VP and 0 , as in Section 2.1, are all the parameters of the model but excluding the functional form. Now, the posterior for M0 had been already calculated in Section 2.1 for every member of the old data sets (Eqn. (2.4)). So, similarly to Eqn. (2.4): p(M0/dnewdoldI) = . P(M0/I) f p(e/I) P(dold/M0QI) p(dnew/M0QI) dB (2.5) J $"2 p(d0id/I) p(dnew/I) Jn and Q _ p{M0/dnewdMI) 1 - p(M0/dnewdotdI) 1 • is the odds ratio in favour of both data sets (new and old) being produced by the same parameters. Note that the global likelihoods p(d0id/I) and p(dnew/I) now stands for the summation over the two possibilities, Mo and M j , instead of over the set of old data sets. This point would not matter since the global likelihoods are independent of Mo and Mi, and will cancel in any comparison of these propositions. Again, p(dnew/MoQI) is the likelihood of the parameters in light of the new data set since Mo© = In order to arrive at a form consisting of assignable direct probabilities, we will manipulate explicitly the probability of M i and introduce the odds ratio as (~) p(M0/dnewdoldI) W o 1 - p(M1/dnewdoldI)-Chapter 2. Bayesian Classification 26 In a similar manner to M 0 , the posterior of M i is given by p(M1/dnewd0idI) = = (A P / r w P ,T\ I P^l1) P(doid/Mi®I) Pid^/M.BI) d© . (2.6) P[doid/I) p{dnew/I) Ju Now, the third term in the'integral is no longer the likelihood of the parameters that produced the new data set, since M i 9 , unlike M o 0 , is not equal to We still have to span this term: p ( < W M i 0 / ) = / p K e ^ / M a O / ) d^ = J^/M^pid^/^M^d^ and Eqn.(2.6) becomes p ( M i / d n e w d o l d I ) = and the resulting odds ratio is: = p{M0/dnewdoidI) = p(M0/I) 01 p{Mxldn&wdoldI) piMJI) Igpje/i) P{dold/M0ei) P(dnew/M0ei) dQ fnp(e/I)p{dold/M1QI)dQ.SnP{*/I)p(dr^/*M1I)d*. ' [ • ) Eqn.(2.7) is the expression of the odds ratio as function of direct probabilities. Chapter 2. Bayesian Classification 27 From inspecting Eqn.(2.7), we see that having the same likelihood of the parameters in light of both data sets, p(dnew/QMoI) = p{d0id/QI), wil l not lead to the same odds for all shapes of the likelihood. Situations where Ooi ^ 1 ( i-e. approaching certainty that the data sets came.from the same mechanism) will be achieved only when the likelihoods are equal and are both very peaked. That is, the parameters are well estimated by each data set and these estimations are equal. Another interesting limit is the neighborhood of no new data. In this, case p(dnew/QM0I) p(dnew 1^M\I) = Const, and Eqn.(2.7) becomes: = p(M0/dnewdoldI) = p(M0/I) 0 1 ~ p(MildnewdoldI) ~ piMjI)' p(cWM)©/) • J n p ( 0 / / ) p(dold/M0QI) dQ p(dnewimMJ) • fnP(Q/I) pid^/MrQ^dQ • fnpWI) dV 1 ' } P(Mo/I) p(Mi/I) As may already be clear from common sense, in a case where we know very little about a mechanism, the chances of having this mechanism equal to any other are hardly influenced by any information we have about that other mechanism. No matter what are the other mechanism parameters, or how accurately they can be determined, the probability of our 'foggy' mechanism to have these parameters, or any others, in the parameter space fl will be solely determined by our prior knowledge. We will not implement this procedure to the problem of tissue classification. Dealing with the computation problems involved in this implementation is beyond the scope of this work. In Appendix B we bring implementation of the procedure for the case of Pois-son distributed data ( a set of On/Off measurements in the astrophysics nomenclature). In that case, the solutions are reduced to expressions with closed form. Chapter 3 M o d e l Selection We now turn to applying the principles discussed so far to the brain tissue samples. In the following two chapters we will find the probability distribution of the parameters that enter the models describing the data set for each of the tissues. This is a two step process: First, we have to define the set of models and choose a model (a functional form) for each tissue. This will be done in the current chapter, applying the principles of model selection described in Section 1.1.1 . Second, in the next chapter, we will use the probability distribution of the parameters of this model to estimate their values and the accuracy of these estimates. 3.1 The Set of Mode l s In order to choose the functional form of the T2 distribution of a given tissue, we should address the following question: • Given a finite set of models, which one best describes the data and our prior information? The set of models to be examined will be defined according to our background knowl-edge about the systematic effects expected to appear in the data, as defined in Section 1.2.2. 28 Chapter 3. Model Selection 29 We use the notation and derivation introduced by Bretthorst [1]. Each of our models is a linear combination of base functions, that is, the model Jvia for tissue a as a function of the time t, is of the form: Ma(t) = ,£iBjaGia(t,T£) (3.1) j=i Where: • TJ^ is the set of the nonlinear parameters of the base function Gja. • ma is the number of base functions appearing in model a. • Bja is the amplitude of base function Gja. Our question will be answered by comparing the posterior probability p(j\4a/Dl) among the models. 3.1.1 The Base Functions, Gja The set of models is specified as follow: • 1 < ma < 7 . Our model is a superposition of no more than 7 base functions. The model functions Gja represent the various systematic effects expected to appear in the data, according to our background knowledge. Gja can take one of the following forms: Delta Function in the Laplace Space Described by the exponential decay: Gja(t) = e-*'**-Chapter 3. Model Selection 30 This base function has one nonlinear parameter: the location of the delta function in the Laplace space =the decay rate T2. Gaussian in the Laplace Space Described by a series of exponents in the form: Cjn(!) = 0 . 4 - e x p ( - ^ - ) + 0 - 2 • ( e X P ( - T 2 , - , - ( l + 1 . 0 7 . ^ o ) ) + e X P ( - T 2 j a / ( l + 1 . 0 7 . W j a ) ) ) + 0.1 • ( ^ ( - ^ . a . ( 1 + l . 6 7 ^ This base function has two nonlinear parameters: the location (T2ja) a n d the width-fraction Wja of the gaussian in the Laplace space. It approximates a gaussian having width of T2ja • Wja and unit integral . The gaussian is symmetric when plotted as function of Log[T2]. This form was chosen to represent a peak with added structure (width) over the delta function, but no other information regarding the distribution of decay times. Al ternate-Echo-Refocusing ( A E R ) Described by: Gja(t) = (-l)k.exp(-^-) Where k is the index of the measurement point; k=l,2,...-32. The alternating signal is multiplied by a decaying exponent having decay rate of 70ms as a first order approx-imation to the decaying of the total signal. This base function does not have nonlinear parameters. Chapter 3. Model Selection 31 A Background ( D C ) Level in the T i m e D o m a i n . Described by: Gja(t) = l This base function does not have nonlinear parameters. A Linear Component in the T i m e D o m a i n . Described by: Gja(t) = 0.19146(—1.65 + 0.01 • t) This base function is added in order to improve the algorithm stability and to reduce computation time. Exponential decays with long decay rates (T2 ~ 1000ms) will appear in our data as almost straight lines. Allowing such a component in the data to be repre-sented by a combination of DC level and linear component will reduce substantially the computation time: We are trading a linear (amplitude) and non-linear parameter (the decay rate) for two amplitudes (the DC and the linear component). This is important since our computation time is polynomial in the number of linear parameters but expo-nential in the number of the nonlinear parameter (see 3.1.2). The contribution to the algorithm stability is achieved by the elimination of an insensitive parameter: the decay rate T2 when having values over T2 ~ 1000ms. The coefficients in the form of this base function were chosen so the base function will be normalized over the sampling points (j>2T Gja(10i) • Gja(10i) — l ) and orthogonal to-the DC level component(^f 2 Gja(10i) • 1 = OJin order to reduce computation time. This base function does not have nonlinear parameters. . Chapter 3. Model Selection 32 3.1.2 Search Path The previous Section defined the set of models to be tested: it is the set of possible combinations of the base functions Gja not exceeding 7 in their total number1. Alas, we do not wish to calculate the needed posterior probability p(A4a/Dl) for all of these models. Though their number is finite, the computation time needed for computing all of them will render our task impractical. Our algorithm computation time is of the order of ^ computation — *± ' rn *. Cl Where: • K is.a computation time coefficient of the order of seconds. - -• m is the number of linear parameters. • 7' is the number of nonlinear parameters. • a is a search-sampling coefficient: a ~ 25. That is, our computation time is polynomial in the number of linear parameters and exponential in the number of the nonlinear parameters. We need to find the most probable model without calculating all the models, and in particular we want to avoid unnecessary computations of models with many nonlinear parameters. We describe our model space as 3D discrete space, using the following coordinates: 1. The number of decaying exponents (with or without width). We will refer to this coordinate as 'Components' (C). C=l,2,...7 2Two constraints reduce the number of combinations: the AER component must always be present and the linear component cannot be present without the DC component. Chapter 3. Model Selection 33 Evaluate^ p>(Mref/r>ij Define M; Find new model not been tried yet Eliminate one Component/Width/Poly.comp. WNo - nllready tried A d d C o m p o n e n t and el iminate al l W id ths a n d Po lv .comp. ± No - allready tried E l i m i n a t e C o m p o n e n t a n d a d d W i d t h No - allready tried A d d Poly .comp. / W id th 3 No - ftllrendy tried A d d one Componen t and eliminate one Width/Poly .comp. No - allready tried No Y e p(M, /Dl) > p(M„f/DI) I Evaluate Figure 3.1: Search Path Flow Chart. The algorithm is looking for the maximum proba-bility in the space of models. 2. The number of decaying exponents with width as additional parameter. We will refer to this coordinate as 'Widths' (W).' W=0,1,2,...C . 3. The number of 'Polynomial Components' (P). P=0,l,2. P=0 When no polynomial component is present. P = l when a DC level is added. P=2 when both the DC and the linear component are added. A flow chart of the search path is given in figure 3.1. The algorithm is looking for the maximum probability in the space of models. In each step, the resulting likelihood is compared to the best model so far and replaces it if its likelihood is higher. For computation time efficiency, the algorithm first tries to Chapter 3. Model Selection 34 simplify the model. In the case where all possibilities to simplify the model exhausted, the algorithm will try more complex models. The algorithm stops once all possible steps from the model having the highest probability so far have been exhausted. We assume that the starting model of the algorithm has no particular importance. Nonetheless, for the sake of computation efficiency, our search path will always start from the solutions given by A.L.MacKay and K.P.Whittal l in their analysis of the data of one volunteer[3] . We assume that the maximum found is the global maximum in the defined space of models. In Section. 3.3.1 we will demonstrate the search path for two simulated data sets. In one of. them the starting model will be a simple model, one decaying exponent, and the search path will add parameters and components in order to reach an adequate description of the data, and thus higher probability of the. model. In the second simulation we will start from the 'right' model. The search will stop after finding that the initial model was in the maximum of the models space. 3.2 The probability of the Model p(Ma/DI) We compare the posterior probability of the model A4a to other competing models. By Bayes theorem .we have: *M.,Di)=«Mji).«y$p. Where: ' . • p(.Ma/I) is the prior for the model Ma. Lacking any prior information about the plausibility of the model comparing to other models, we will assign the same prior to all of them: • Chapter 3. Model Selection 35 for all a. • p(D/I) is a normalization constant. It is the same for all the models and thus will cancel out when we compare models within our set. Thus, the odds of each of the models compared to any other model is given by: _ p(Mk/DI) _ p'D/Mkl) h l p{Mi/DI) p(D/MiI) [ ' The odds is the ratio of the models likelihoods. We now have to express the likelihood in term of assignable direct probabilities. By Eqn. (1.3) and the product rule, we have: p{D/MaI) = J p(DQl/MaI)dec = Jp(E&/MaI)- Jp(^/XMaI)-p(D/X^MaI)d^dX Where: • Ba is the linear parameters set for the model a (amplitudes). • To* is the nonlinear parameters set for the model a (decay rates and width-fractions). • Qa is the parameter set for the model a (linear and nonlinear). • p(Ba/A4aI) is the prior for the linear parameters of model A4a. • p ( ^ a / B a A 4 a I ) is the prior for the nonlinear parameters. It depends, in general, on the values of the linear parameters Ba of model M.a. Chapter 3. Model Selection 36 • p(D/BaTaj\4aI) is the likelihood of the model with a specific set of parameters (both linear and nonlinear). It is a direct probability we have to compute. 3.2.1 The P r i o r P robab i l i t y Dis t r ibu t ions We will not include any information regarding the values of the parameters. One may think of incorporating information from previous experiments about the anticipated val-ues of the parameters or dependencies between them. Now, previous inVivo determina-tions of T2 decay rates and amplitudes [16, 17, 18, 19] do not agree2. Moreover, it is hard to ascertain any reliable credible regions from their reported results. Thus, any responsi-ble and conservative use of the information contained in these reports for assignment of prior distribution will take the form of very broad distributions for all the parameters. If the data will determine the parameters much better than the prior, the prior proba-bility distribution of the parameters will be essentially constant in the region of substan-tial contribution of the posterior probability distribution. Thus, the details of our hardly informative priors would not change much and do not merit the additional computational effort. The main feature of the prior that may survive in the final results are the scale of the prior distribution; the range in which the prior has a substantial magnitude. Therefore, we will use priors that represent this state of ignorance and according to their known ranges, as discussed in Section 1.1.2. Nonl inear Parameters Our nonlinear parameters, decay rates and widths, are scale parameters. We will express our ignorance regarding their values by assigning them a normalized Jeffreys' prior. We do not incorporate any information about possible coupling among them or their dependence 2 0 f course, we cannot.use the results of [3] as part of our prior information. They use the same data discussed in this work. Chapter 3. Model Selection 37 on the values of the amplitudes; that is, —> / a p{T^IBaMaI) = 1>(W MaI) k=l , ^ W ^ . / ) = L O G ( ^ ) . T R I (3.3) Where: • la is the number of nonlinear parameters in model a. • TakH = 3000ms and = 1ms whenever rafc is a decay rate. • rakH = 1 and rafcZ/ = 0.01 whenever jay. is a width-fraction. L inear Parameters Again, we do not incorporate any information about coupling of the amplitudes. That p(BJMaI)=~[[p{Bja/MaI) where ma is the number of base function in model a. Our linear parameters, the amplitudes are location parameters. We will express our ignorance regarding their values by assigning them a uniform prior: P(Bja/MaI) = — (3.4) BjaMAX Where: • BjaMAX = 1000 whenever BJ0MAX is the amplitude of a main component. Chapter 3. Model Selection 38 • BjaMAX = 60 whenever BjaMAX is the amplitude of a D C level. • BjaMAX = 100 whenever BjaMAX is the amplitude of a linear component. 3.2.2 The Likelihood We now turn to calculate the likelihood p(D/BaTaA4aI). The data produced in the experiment is described by: di = Ma(ti) + ei where e/ is the noise realized in measurement I. Given the model and a parameter set, the difference between a datum value and the value (of this datum) predicted by the model is assigned to the noise. Of course, we assign different values to the noise associated with each of the data points when we choose different model. In case the noise variance is known in advance, this gives us a tool for testing whether our model gives a proper description of the data [20]. We will follow the derivation and notation used by Bretthorst [1] with the extra care to our special case in which: • The signal to noise ratio is poor and thus not all of the approximations used in [1] can be directly applied. • Our data consist of multiple measurements. According to our background information, as discussed in Section 1.2.2, we assume the noise has a finite variance which is not known. Due to the lack of any other infor-mation about the noise (like correlations), we assign the noise a Gaussian probability distribution. This probability distribution arises from maximum entropy consideration Chapter 3.' Model Selection 39 [13]; the Gaussian is the distribution that maximizes the entropy under the constraint of finite variance. The assignment of a Gaussian distribution to the noise guarantees us that we made a conservative assignment by declaring total ignorance about the behaviour of the noise, beside its finite variance. If, indeed, the noise is correlated, or non-stationary, it will not lead us to erroneous results as we already took these possibilities into account. The probability of getting the value of a datum point, given the Gaussian probability distribution of the noise, its variance crs, and all the other parameters, is: —» 1 1 m p(di/B al^aMaasI) = e x p ( - — ( c i i - ^ BjaGjafa, r j a ) ) 2 ) V27TO-2 j = l For clarity, from now on we omit the index a that signified the model A4a and the explicit symbol of the model M.a. We do everything inside the model under consideration. The noise is assumed uncorrelated, and thus the probability of getting this data is the product of the probability of getting each of the data points: n ni p(D/^ BasI)=J[Hp(du/^ BaJ) i i . i . n ni m = ^ r ^ e x p t - — - E ^ G ^ i ^ ) ) 2 = Z ( J « 1=1 1=1 j=l = ( 2 7 r ( 7 s 2 ) - ^ e x p ( - ^ ) Where m n ni m m - . Q = nnid2 - 2 ^ ^ 2 ?jGj{ti, Tj) • ^ du +-nt J2 9jhBjBk j = l i = l 1=1 j=l k=l n 9jk — X^^i(^' ri) ' Gk(U,Tj) i=l • Chapter 3. Model Selection 40 • m = the number.of model functions. • n = the number of sampling points •= 32. • rti= the number of measurements of each sampling point = number of pixels for each tissue. In order to understand the implications of having multiple measurements and the relationship between the part of noise arising from the misfits of the model and the part of noise arising from the unsystematic effects, we let: a2 = — • » • / and use this quantity from now on. So, p(D/^~BaI) = (27Tnla2y^ exp( - -^ - ) (3.5) 2o~ Q = nd> - 2 £ E BjGjiU, Tj)d{ + ^ E 9ikBjBk j=l i = l j=l k=l Where: 2 « i di = — 52 du We see that d,, the average of the data points for every sampling time i , is a sufficient statistic for finding the maximum of the likelihood. It is not sufficient to establish the absolute value of the likelihood at any point of the parameter space. Chapter 3. Model Selection 41 Nevertheless, in case <r is known,.d{ is a sufficient statistic for any calculation of the parameters' values, since the likelihood dependence on the more detailed distribution of the data is through a constant factor. ' As common sense suggests, if a is unknown, the extent to which the data points are distributed will determine the probability distribution of a. The probability distribution of <T, in turn, will not influence the values of the parameters at maximum likelihood, but will determine the width of the likelihood and of the probability distribution for the parameters. In order to compute the global likelihood P(M. /Dl) for the model comparison and estimation of the parameters, one must integrate the likelihood given in eq.(3.5) over all the parameters. Having the base functions orthogonal, in the sense that: N 9jk — E Gjiti) • Gk{U) = 6jk i=l will enable us to analytically integrate the likelihood over the linear parameters. More-over, in case the base functions are orthonormal, the expectation values of their ampli-tudes are given directly by the projection of the data on the base functions [1]. In spite of the general dependency of the base functions on nonlinear parameters, the base functions will be, in many important problems, almost orthogonal. For example, in the problem of evenly sampled multiple stationary frequencies, with cosines and sines as the base functions, the off diagonal terms of gjk will be negligible as long as the frequencies are well separated. As indicated in [20], the problem of exponential decays as base functions is about the most pathological case one can encounter. Under no conditions will the g^ matrix be orthogonal. Presence of an additional exponent will always interfere with the estimation Chapter 3. Model Selection 42 of the existing ones. One of the consequences is that any estimation of the decay rate constants using a model with a wrong number of exponents will give rise to incorrect results even in the limit of zero noise. At this point, one may want to resort to numerical integration. This is a very difficult thing to do. Beside the obvious problem of long computation time of integration in a space of high dimensionality, the likelihood in this space is highly peaked in many ridges and maxima. The task of finding the maxima that have the substantial contribution to the integral is beyond the scope of this work. Any set of non-orthogonal base functions can be transformed to an orthonormal set; gjk is a real symmetric matrix and thus can be always diagonalized. This diagonalization itself (i.e., the eigenvalues and eigenvectors of,gjk) depends on the nonlinear parameters and cannot be done analytically. We will need to compute gjk, diagonalize it, and perform the transformation of the base functions with every computation of the likelihood. So, we transform the base functions to bring the gjk matrix into diagonal form. Let A/t be the kth eigenvalue of gjk and let Ckj be the jth component of the associated eigenvector/ The transformation of the base functions is: 2 m Hjit, ^ ) = —7= E eJkGk(t, Tjfc) . y Aj k=i and the amplitudes transform according to: Ak = \f*~k Bj.ekj i = i The Jacobian of this transformation is i = i V A i Chapter 3. Model Selection 43 The transformation depends on the nonlinear parameters and so does its Jacobian. The new base functions are orthonormal in the sense that: i=l and thus, our likelihood become: 1 m m p(D/~? A al) = ^ T r r w 2 ) - ^ e x p ( - — • {nd? - 2 £ Ajhj + £ A))) (3.7) J = I i = 1 where n 4 = 1 is the projection of the averaged data onto the orthonormal base functions Hj. The amplitudes have been separated and now the integration over them can be easily done. We multiply our likelihood by the prior for the amplitudes (given in Section 3.2.1) and integrate over them: p(D/^aI) = Jp{A/o-I) • p(D/^r*~£aI) • dr = = P ( A / / ) . n A - i . n r ^ ( 2 ^ ) ^ . a x P ( - ^ 5 ) (3,8) i=l ^ Where: m p(A/I) = Hp(B3/I) i=i is a different constant for each model, and Chapter 3. Model Selection 44 is the mean of the squares of the projections of data onto the orthonormal base functions H. mh2 is the scalar product of the data vector and the vector of values predicted by the orthonormal model functions. The nonlinear parameters that will have the highest probability are those that will maximize this scalar product. We are now left with the integration of the nonlinear parameters. This can not be done analytically. Moreover, we cannot apply gaussian integration or similar methods directly, since our signal to noise ratio is not good enough to give rise to narrow and elliptic peaks of the likelihood. On the other hand; the dependence of h2 on the nonlinear parameters makes any attempt to integrate the likelihood numerically very costly: any evaluation of h2 deserve new diagonalization and transformation of the base functions. We first transform the nonlinear parameters to a set in which the likelihood is almost elliptic in a sufficiently wide range that includes the substantial contributions to the integrals, and then use gaussian approximation to integrate them. We later use a 'safety valve routine' to ensure that our integral is approximated up to accuracy level of 20%. The transformation that will bring the likelihood bubble close to elliptic shape, is: Ui — Log(ri) . diii = — • d(ri) As discussed in Section 1.1.2, the priors of the time-scaled nonlinear parameters (the decay rates and width-fractions) are normalized Jeffreys' priors: ' P(n/I) = J \ T, nun / ' When coming to integrate over the nonlinear parameters the Jacobian of the last transformation will cancel with the priors. That is, r .p(~T*/MaI) d~r* = "[[ (p(ui/I) dui) t=l Chapter 3. Model Selection 45 where: p{ui/I) = —. Ui Range(Ui) We are left with: p(D/aI) = Jp(D/llT) pCu/l) d u -n-nt nl " "' • (2-KO" ) 2 1HL exp / mh2\ ,_, exp I „ „ I a u 2a2 J J ~~r \2a2 / and no additional Jacobian to take care of. What the last transformation did was merely transform the nonlinear parameters from scale parameters,to location parameters. Now that the likelihood is approximately elliptic in its variables over the region of substantial contribution, we can expand mh2m Taylor series to get: ' mh2 2^ m h2 2a2 2a2^ j,k=i where U are the values of the u's at maximum likelihood, Afc = uk - uk and (3.9) m d2h2 2 dujduk is the 2nd derivatives matrix obtained numerically.. The integral becomes Chapter 3. Model Selection 46 J e x p ( 2 ^ dTt mh2 f exp 2o2 • / exp u J 3,k=l L G dA In order to perform the integral we change variables again. This change of variables is not essential for the analytical integration. Nonetheless, it will separate the variables, give a more intuitive look to the likelihood, and will help us in the next Section to estimate the values of the parameters. So, let Vj be the jth eigenvalue of the bjk matrix, and let Wjk be the kth component of the jth associated eigenvector. The transformation that will rotate our ellipsoid in the nonlinear parameter space to principle axes, and thus eliminate off-diagonal terms in the 2nd derivatives matrix bjk, is given by: Sj = £ A , H / - , k=l SkWjk JVk with volume element: i=i and the integrand becomes: r S? e x p ( E ^ A ^ A = e x p ( E ^ \i,k=i — / \j=i Performing the r integrals, the likelihood becomes: (3.10) p(D/*I) = p{AjI) -HP^II) • n r n i • (2^a2):±:^ (3.11) Chapter 3. Model Selection 47 where ur=i A: is the Jacobian of the transformation of the base functions G to u the orthonormal base functions H using the values of the nonlinear parameters that will maximize the likelihood. The eigenvalues of the last transformation are treated in this stage by a 'safety valve routine' described in Appendix A. Our measure of the noise a, as discussed in the introduction, is not known. We can now treat o as a nuisance parameter: we will assign a a Jeffreys' prior and integrate over it. The normalized Jeffreys' prior for a is: p(a/I) = where an and a^ are the upper and lower prior bounds for a respectively. So, p'D/I) = / p'a/I) p(D/aI) do r+m — nnr i = i U n i = i V u ar+m-n.ni-l e x p nd2—mh2 2a2 da u. In this integration over the probability distribution of a, we assume we chose o.r, and an to be well outside the substantial contribution to the integral, so we can let a^ —» 0 and an —> oo without substantially change the value of the integral. Changing variables, Chapter 3. Model Selection 48 V the integral become: dV=~dc VL = — a2 O~JJ VH= — O-L L 'H-HX> V n-ni —1—771 — r V,-*0 • exp n d2 - mh2 •V2 n-rtj—r — m nd2 — mh2 u As indicated by the notation, p(cr/I) is independent of the model. We do not have any information that will differ the prior noise assignment for different models. Thus, the term Log (^) will appear in all the compared likelihoods and will cancel. Omitting the term Log {^) a n < ^ s o m e O I" the constants that will appear in all the models, p(D/M*I) a p( A//) •!!*(«///)• ( 2 * ) ^ IIK U L-II * i = i (3.13) (7 71-71; — r — m ^ nc?2 — mh2 ( 2 ) is the likelihood we evaluate for each model and compare among the models. 3.3 Results 3.3.1 Simulated Data We produced two sets of simulated data, one of 6 functions in the Laplace domain and one of Gaussians in the Laplace domain. Both produced with noise level of 0.3 units, Chapter 3. Model Selection 49 Discrete Component A m p l i t u d e T2 [ms] W i d t h [ms] Low T2 100.0 9.0 -Medium T2 600.0 85.0 • High T2 300.0 180.0 -Alter. Echo R. 0.0 - -Continuous Component A m p l i t u d e T2 [ms] W i d t h ms] Low T2 100.0 9.0 2.0 Medium T2 900.0 85.0 20.0 High T2 70.0 380.0 30.0 Alter. Echo R. 0.0 - -Table 3.1: Generating Parameters of Simulations, cr = 0.3[Amp. units]. The given amplitudes of the continuous components are the integrals over these components. . which, as we shall see in the next chapter, is typical to our experimental results. The simulated noise is Gaussian and uncorrelated. The generating parameters are tabulated in Table 3.1 and the resulting generated distributions are displayed in Figure 3.2. In order to demonstrate the search path (described in Section 3.1.2) in the 'complexity' space of models, we give the algorithm a very simple model, one decaying exponent, as a first guess. This is done in the continuous distribution case, and the tested models and the results are tabulated in Table 3.2. For the discrete distribution we start from a model having three decaying exponents. The tested models and the results are tabulated in Table 3.3 . For each of the tested models, we show the odds ratio of that model and the model having the highest probability. For each model, we also show the Bayesian estimation of the noise level. This estimation will be discussed in the next chapter; for now it is enough to regard the noise estimation as a measure of the misfit of the data and the model. As we see from Table 3.3, the algorithm detected the 'right' model for the discrete distribution. Chapter 3. Model Selection 50" 600 500 T J "5. E 400 < 300 200 100 0 - | | I 10 26 ' '5o'"i 'oo 260 ' fedo'M600 2060 T2[ms] T2 [ms] Figure 3.2: Generating Parameters for the Discrete (upper figure) and Continuous (lower figure) Simulated Distributions. The plotted amplitudes of the gaussian components are equal to the Gaussian integral and not to their actual amplitudes. Chapter 3. Model Selection 51 Model Results Components Widths Poly. Comp. ^ 3i3max a 1 0 0 1.7 • i r r 4 2 14.4 ± 2 . 0 1 o 1 1.6 • I O - 3 1 5.0 ±0.7 . 1 1 1 2.9 • Hr 2 2 2.9 ± 0 . 3 • ! • ' . 1 0 3.3 • IO"2 3 2.1 ± 0 . 3 ' 1 ; 0 2 4.2 • l f r 2 4 3.0 ± 0 . 4 1 1 2 1.7 • l f r 1 4 1.25 ± 0 . 2 2 0 0 1.3 • l f r 2 1 2.3 ± 0 . 3 2 1 0 3.9 • l f r 2 9 2.3 ± 0 . 3 2 . 0 1 3.9 • i r r 1 3 .91 ± . 1 3 •2 • .0 2 3.2 • i<r2 ; 37±0 .05 2 1 2 4.1 • 10\ 3 .36 ± 0.05 3 . 0 0 1 .31 ± 0.04 3 0 1 7.3 • l f r 3 .31 ± 0.05 3 1 0 7.3 • 1(T2 .31 ± 0 . 0 5 4 0 0 1.3 • l fr 1 .31 ± 0 . 0 4 Table 3.2: Search Path and Odds Ratio for the Continuous T2 Simulated Distribution, cr is a measure of the misfit between the data and the model. Model Results Components widths Poly. Comp. a 3 0 0 1 0.28 ± 0 . 0 4 4 0 0 0.0054 0.29 ± 0 . 0 4 2 0 0 6.210" 2 2 2.09 ± 0 . 3 2 1 0 5.210" 2 2 2.0 ± 0 . 3 3 0 1 0.0056 0.28 ± 0 . 0 4 3 1 0 0.07 0.28 ±0 .04 Table 3.3: Search Path and Odds Ratio for the Discrete T2 Simulated Distribution, a is a measure of the misfit between the data and the model. Chapter 3. Model Selection 52 For the continuous distribution, Table 3.2 , the algorithm found a discrete model with 3 components and not a continuous model of 3 components, as is the generating mechanism. To see what has happened, we plot the residuals produced by fitting the noisy data by the 'winning' discrete model. We also plot the difference between the predictions of that model with the fitted parameters, and predictions of a continuous model having the generating parameters (Figure 3.3). We see that the residuals are a full order of magnitude larger than the difference between predictions of continuous and discrete models. Thus, fitting the data with the continuous model will hardly decrease the residuals and will not be able to justify its additional parameters. In other words, the difference between the continuous and the discrete models is not relevant to our data; under the given noise level, they merely produce the same data. The Bayesian procedure finds no reason to prefer a more complex model if a simpler model can describe the data and the prior information just as well. Compar i son to E x i s t i n g Methods The closest frequentist method for comparing several models is the multiple-model method The models are compared based on some measure of "Goodness-Of-Fit" (GOF) of their solutions, where each model uses the set of parameters that give rise to the best G O F within the model. The most common measure of G O F for these model comparisons is the x2 • From the Bayesian point of view, the set of parameters that will minimize x2 is, in the case of Gaussian noise, the set of parameters that will maximize the likelihood. We can now see what question is answered by the multiple-model method: • "Given a set of models, which model, contains the set of parameters that has the highest likelihood in light of the data?" Chapter 3. Model Selection 53 Amp. 0.4 0.2 -0.2 -0.4 Time[ms] Figure 3.3: Model Selection Failure. The solid line represents the residuals produced by fitting the noisy data by the discrete model. The dashed line is the difference between the predictions of the fitted discrete model and the predictions of a continuous model having the generating parameters. The continuous model failed to reduce the noise level enough to overcome the 'penalty' it carries due to its complexity. Chapter 3. Model Selection 54 That is not what we are looking for. Our question is rather: • "Given a set of models, which model lias the highest probability in light of the data and any prior information that we have." . These are different questions and in general they will result in different answers. The difference between the Bayesian and the frequentist multiple-model method can be summarized in the following points: 1. No prior information about preferred models can be consistently incorporated into the multiple-model method. The result of the frequentist method is the likelihood of the parameter set and not the posterior probability. 2. In the model selection process we want to compare the models regardless of the values of their parameters. While the frequentist method is constrained to compare the models while choosing a particular set of parameters for every model (namely, the parameters that maximize the likelihood) the Bayesian procedure marginal-izes over all the parameters that enter the model, including the noise variance. The marginalization is equivalent to considering all the possible parameter sets, weighted by their probability to materialize. 3. There is no consistent way to penalize complex models in the frequentist approach (quantitatively implementation of the so-called Ockhams Razor). The quantita-tive implementation of Ockhams Razor arises naturally in the Bayesian procedure during the marginalization over the parameters. In this work we used flat prior distributions for both the models and the parameters inside each model to express ignorance about their values prior to seeing the data. As a result, in our case the first statement (Point 1) has no substantial contribution to the Chapter 3. Model Selection 55 difference between the frequentist and the bayesian methods. On the other hand, the issues mentioned in Points 2 and 3 are of great importance in a problem like ours. Several models exhibit sets of parameters for which the noise is reduced to its 'true' value. The model that will be picked by the multiple-model methods thus depends on the particular materialization of the noise in the data set and not on the systematic effects in the data; as all these models describe the systematic effects equally well, x2 will be determined by the non-systematic effects. The traditional frequentist answer to the problem of point 3 is to draw the x2 a s a function of the number of parameters, looking for the 'knee' in the results. In the case of few measurements (like in a 32 points sets ) this knee may be very vague. In view of the above, the difficulties in the frequentist approach to model comparison in our class of problems are such that they are hardly used. No implementation of multiple-model method for the problem of decaying exponents is known to the author. These difficulties, combined with the strong coupling between the parameters, that result in erroneous estimation of the parameters in case of wrong model, lead to strong resistance in the frequentist camp to the usage of non-linear methods altogether, considering the need to state an initial model as the major disadvantage of these method. In fact, the frequentist linear methods are not free from introducing a model either. A common linear model [4, 5, 21] is a set of exponents (100-200 is a typical number) with fixed decay rates. These models result in underdetermined solutions, which are then constrained by adding conditions like Non-Negativity and Continuity. These constraints, and in particular their quantitative implementation, are in general not justified by our state of knowledge; they represent information we in general, do not have.' Chapter 3. Model Selection 56 3.3.2 Experimental data We applied the procedure described in this chapter to the learning data, described in Section 1.2.1. For each set of measurements, the collection of data from all the pixels from the ten volunteers assigned to each tissue, we calculated the probability given by Eqn. (3.13). We calculated the probability for different models, using the search path described in section 3.1.2. In order to ensure that the starting model did not influence the results, we ran the program with two different starting models for each tissue: the first is the model given by MacKay and Whittall [3] (two to three decaying exponents) and the second is a model of one decaying exponents. For a l l tissues, the two runs ended with the same 'winning' model, supporting our belief that this model is the model of global maximum probability in the model space. As can be seen from Eqn. (3.5), the mean of the measurements and the mean of the measurements squares for each sampling time are sufficient for the calculation of the probability for any model. For each tissue, we indicate the models having the highest and second highest prob-ability, along with the Bayesian estimation of their noise (Table 3.4). The noise is a measure of the misfit of the model and the data; its estimation and the accuracy of the estimation will be discussed in the next chapter. Most of the models found for the tissues are discrete. We are reminded that continuous distributions, as the distribution used in the continuous simulation, will produce similar data to discrete distribution of the same number of components and thus may not be detected. Chapter 3. Model Selection Tissue Model 1 Model 2 Oh2 C W p C W P a Genu Corpus Callosum 3 o 0 0.3632± 0.001 3 1 0 0.3632± , 0.001 12.6 : 1 Splenium Corpus Callosum 3 0 0, 0.4004± 0.001 3 1 0 0.4004± 0.001 11.1 : 1 Minor Forceps .3 0 0 0.2916± 0.0007 2 0 2 0.2916± 0.0007 2.04 : 1 Major Forceps 3 0 1 0.2149± 0.0004 4 0 0 0.2149± 0.0004 9.33 : 1 Internal Capsules 3 0 0 0.5506± 0.002 3 1 0 0.5506± 0.002 8.09 : 1 Head Caudate Nucleus 3 0 .0 . 0.4722± 0.001 2 0 2 0.4728± 0.001 2.23 : 1 Putamen 3 0 0 0.2209± 0.0004 3 0 1 0.2209± 0.0004 2.1 : 1' Globus Palidus 3 0 0 0.4611± 0.002 3 0 1 0.4611± 0.002 1.42 : 1 Thalamus 3 0 0 0.2787±, 0.0008 2 0 2 0:2788± 0.0008 2.96 : 1 Insular Cortex 1 1- 2 0.4049± 0.0007 1 1 1 0.4049± 0.0007 28.1 : 1 Cingulate Gyrus 2 0 1 0.6493± 0.001 3 0 0 0.6493± 0.001 4.13 : 1 Cortical Gray 3 0 0 0.6471± 0.001 2 1 2 0.6471± 0.001 2.73 : 1 Table 3.4: Experimental Data - model selection results. Chapter 4 Estimation of the Parameters In this chapter we will estimate the parameters of our models for the tissues, and find the accuracies of these estimates. " The main purpose of our work is assessing the feasibility of classification of new data sets to various tissues. The estimation of the parameters is not optimal: we use the joint posterior distribution of all the parameters from all the measurements and then use the marginal posterior of each of the parameters for the estimations. This is the analogue of the traditional practice of averaging the data before the analysis, and can damage the. estimation of a parameter in case there is a systematic variation, not accounted for by the model, of other parameters between measurements. The optimal estimation is achieved by first calculating the marginal distribution for each of the parameters from every single measurement set, and then multiplying the marginal distributions. That is, every set is considered as new data, and then, by Bayes theorem, the marginal probability distribu-tion it produces is multiplied by the distribution produced by previous measurements1. Elaborate discussion and examples of this point can be found in [1, pp. 161-175]. The above does not imply.that our estimations are wrong. We may lose information that was in the data by using averages, but the estimations are valid and the accuracy estimates are conservative; we do not incorporate knowledge we do not have. A quantita-tive rule of thumb is to look at the Bayesian estimation of the noise variance: successful reduction of the noise level (by the. averaging) by a factor of y/n. will indicate that not fact, our normalization of the data sets in the beginning of the analysis is an approximation to the marginalization needed in the rigorous process. 58 Chapter 4. Estimation of the Parameters 59 much had been lost in the averaging. In our case the reduction does achieve a reduction of noise level that is not far from-y/n. The author believes that the loss of accuracy of the estimates due to the use of averaging in this work is not substantial. A l l calculations of the parameters value estimates and the accuracies of these estimates are for the chosen model. They cannot be taken as estimates of any parameters of a different model. For example, in Chapter 3 we generated a set of data by a continuous T2 distribution. By the model selection process in Chapter 3, a discrete model was chosen. We cannot expect the estimation of parameters (e.g., a decay rate of a discrete component). to be valid for other parameters in different model (e.g., the center of a gaussian distributed decay rates in a continuous model). These parameters are in general dissimilar. As discussed in the introduction, the marginal posterior probability of a parameter is given by P(Qk/Di) = JP(e/Di)-de[k] (4.1) Where: m i = l We will use the expectation value of the distribution p(Qk/DI) as an estimator for the value of the parameter: ;.et • p(ek/Di) • aek < 0 T > = I^/DD-M, • (4'2) Combining eq. (4.1) and (4.2) we have: fQk-P(e/Di)-de < Wfc > = — — — = fp(Q/DI)-d& Chapter 4. Estimation of the Parameters. 60 Noise Probability Distribution Genu Corpus Callosum Figure 4.1: Probability Distribution of the Noise Level for a typical tissue. JQk-p{Q/I)-p(D/QI) -dS (4.3). jp(e/i)-p(D/Qi)-de The Bayesian credible region will usually be estimated by the variance of the one dimensional probability distribution of our parameter of interest. This estimate is valid as long as the distribution is well approximated by a Gaussian in the region of high probability. In such a case, the probability of the parameter to be inside the credible region will be about 0.68 . • 4.1 Noise Variance As any other parameter, the noise variance can be estimated using Eqn. (3.12), which gives the likelihood as a function of the noise variance cr2, regardless the other parameters. We plot the likelihood as a function of a (figure 4.1) for a typical tissue. The prior distribution for the noise is the Jeffreys prior, p(cr./1) a K Following [1], we Chapter 4. Estimation of the Parameters 61 integrate Eqn. (3.12): < a' > = fcrs-±-p(D/aI)-da f±-p(D/aI)-do-_ • ni - m - s^ • nt - m^_ 1 nd2 — mh2 (4.4) 2 ' v 2 Where we approximate "r* (hidden in /i2) by their values that maximize the likelihood. Our estimation for the noise variance is: {<r2)est =< cr2 > ±V< a*> - <a2>2 and substituting (4.4) with 5 = 2 and s = 4, we have, after some algebra, (V) est n • ni — m nd2 — mh2 1 ± n • rii — m — 4 (4.5) The estimation of the noise depends on the number of base functions m and total number of measurements n • n/. Larger models, having more base functions, will reduce the value of id2 —mh? , since they fit the data better. This will result in reduction of the noise level if this increase is larger then the expected by fitting the noise, which is taken care of by the first term ra-nj— m—2 Now, in all our models the number of total measurements n • n\ > 20,000 , m < 7 and thus the term is almost constant n-n{— m — 2 * for all possible m; we have much more measurements than base functions and we are not in a position to fit the noise. As a result, we could expect a reduction in the noise estimation as we add base functions. Alas, as can be seen for Major Forceps in Table 3.4, this does not happen. Any addition of base functions above the model given under "Model 1" in Table 3.4, did not reduce the estimated noise level. Our additional base Chapter 4. Estimation of the Parameters:. 62 functions are almost orthogonal to the data; they hardly improve the fit of the model to the data. • • < > . • • • ' . • " ' The high accuracy of the noise estimation (see Figure 4.1 and Table 3.4) is due to the large number of measurements compared to the number of base functions. . 4.2 Nonlinear Parameters As demonstrated in [1], having different values of <r cannot affect our estimation of the parameters but will affect the accuracy of these estimates, the assignment of the Bayesian credible region for the values of the parameters. It is •possible to give the credible region of the parameters when cr is unknown, but it will render-the resulting equations info a non-intuitive forms. We will rather take cr to be constrained to the value found by Eqn. (4.5). The large number of measurements make the estimation of cr accurate enough to justify this approximation (see Figure 4.1). The likelihood of the model as a function of the nonlinear parameters (regardless of the linear parameters) was approximated in Chapter 3 by a Taylor expansion of the argument of the exponent (Eqn.(3:9) to Eqn.(3.10)). We can.now use this approximation and notation to estimate the nonlinear parameters and the accuracy of these estimates. The values of all < Sj > will be zero by symmetry. The covariance matrix of the 5j is given by: < SjSk > - < Sj >< Sk>= < SjSk >= • I-oo d~S • I T 1' V 2 =1 vn • SjSk-exp [ E L i 2 ^ 5 > ) •exp( . E f = i j ^ > ) = < cr2 > 8kj Chapter 4. Estimation of the Parameters 63 The covariance matrix elements of the logarithm of the nonlinear parameters is ob-tained by: .< UjUk > — < Uj >< uk >=< a2 > /=i UijUik Vi So, our estimation for the logarithm of the nonlinear parameters is: r U2 {u3)est = u]± . <cr2 > E TT where u] are the logarithm of the nonlinear parameters that maximize the likelihood. Under the approximation of the likelihood by the first order Taylor expansion, these values (modes for the nonlinear parameters) are the expectation values for the logarithms. The estimations for the nonlinear parameters are given by: (dj)est = exp (UJ) ± (exp (u-j) \ r TT2 (4.6) 4.3 Linear parameters The likelihood as a function of all the parameters is given in Eqn. (3.7). The results of calculating < Aj > and the posterior coyariances < AjAk > — < Aj >< Ak > according to (4.3) are given in [1]: ' • and < AjAk > -.< Aj >< Ak >= cr26jk These results are to be expected. The base functions H are orthonormal and thus the expectation values of the expansion coefficients of the data over them (the A) will be the projection of the data onto these orthonormal base functions. Chapter 4. Estimation of the Parameters 64 The covariance matrix diagonality is obvious since the H functions are orthogonal. The expectation values of the original amplitudes < Bj > are given (using eq.(3.6)) by: m he-i. <BK>=Y^ , i=i V *>j and the elements of the covariance matrix are given by: <BkBl>-<BkxBl>=<*2>-jr^ (4.7) 3 = 1 A J where we again approximated cr2 by its expectation value. Thus, our estimates of the amplitudes are • (Bk)est =< Bk > ±\J< B2k > - < Bk >2 = = Y ^ ± <<r2>-f:-^. . (4.8) These estimates are still functions of the nonlinear parameters, through the depen-dence of hj, ejk and Xj on them. In many problems the estimates of the parameters are good enough and thus the nonlinear parameters can be approximated by their value at maximum likelihood. Our case is different; the uncertainty of the nonlinear parameters is large and some of the amplitudes are strongly coupled to them. In order to make the covariance matrix free of the nonlinear parameters we need the likelihood as a function of the amplitudes alone; i.e., marginalize over the nonlin-ear parameters in Eqn.(3.7). This cannot .be done analytically and cannot be easily done numerically. Due to the strong coupling between the amplitudes themselves this marginalization is still short of giving definite values for the accuracies. In view of the above, we will summarize our estimates for the amplitude as follows: Chapter 4. Estimation of the Parameters 65 1. Results of Eqn.(4.8). These tabulated results should be consulted whenever there is no strong coupling with other amplitudes. The nonlinear parameters are approx-imated by their values at maximum likelihood. 2. In case there is strong coupling between the amplitudes we give the coupled esti-mates. These are variances (which we signify by S) of linear combinations of the, coupled amplitudes. The variance estimate is the largest eigenvalue of the covari-ance matrix (Eqn. (4.7)) and the relative weights are taken from the associated eigenvector. 3. For most of the tissues, our model contains a component with a decaying exponent with T 2 of less than 15ms. In that component the decay rate and the amplitude are strongly coupled to each other, but relatively independent of the other parameters. Typical contour plots of the likelihood as a function of the decay rate and the amplitude of the lower components are given. One should consult these sample plots when using the tabulated values. Unlike the nonlinear parameters, where the accuracy estimates are free of the amplitudes, and thus conservative, the tabulated results for the amplitudes are under the approximation of the nonlinear parameters by their maximum likelihood values. In case where the amplitudes are strongly coupled to the decay rates (as is the case of the lower component), the tables will give an oyer optimistic accuracy estimates. 4.4 Results 4.4.1 Simulated Data We use here the same simulated data as used for the model selection. After the model have been selected, we turn to estimate its parameters. Chapter 4. Estimation of the Parameters 66 Discrete T2 Distribution Generating Parameters Estimated Paramet ers Component A m p . T2 [ms] W id th A m p . T2 [ms] W id th Low 100 9.0 - . 99.9 ± 1 . 2 9.31 ±0 .45 -Med. 600 85.0 - 579.5 ± 1 . 4 84.3 ± 1 . 9 -High 300 180.0 - 318.8 ± 0 . 8 . 175.2 ± 5 . 7 -A E R 0.0 - -0.04 ±0 .10 - -0.3 - - 0.28 ±0 .04 -Continuous T2 Distribution Generating Parameters Estimated Paramet ers Component A m p . T2 [ms] W id th A m p . . T2 [ms] W id th Low 100 9.0 2.0 100.1 ± 1.3 9.7 ± 0 . 5 - . Med. 900 85.0 .20.0 819 ± 3 . 2 • • 80.0 ± 1.0 -High 70. . ' 380.0 30.0 148 ± 1.6 246. ± 12. -A E R 0.0 - - 0.00 ±0 .20 -<j 0.3 - - 0.31 ± 0.04 • - - • Table 4.1: Generating T2 Distributions and Simulation Results. The generating parameters and the results are described in Table 4.1 and 4.2. They are displayed in Figure 4.2. As discussed in the beginning of the chapter, the estimations of the parameters are for the chosen model. We cannot expect these estimations to serve as estimations of parameters of other mechanisms. 4.4.2 Experimental Data The estimations of the parameters of the T2 distribution for each tissue were done using the collection of pixels taken from ten volunteers. For each of the tissues we compute the estimated parameters along with their accuracies. Our results are given in Tables 4.3 and 4.4 and Figures 4.3, 4.4 and 4.5. As mentioned in Section 4.3, the summary of the probability distribution of the parameters in the form of expectation values and variances will not give, by itself, an adequate description of the results for our case. Estimating the linear parameters having Chapter 4. Estimation of the Parameters 67 finn Q. E < 500 400 300 200 100 600 500 400 300 200 100 0 83 85 87 170~~ 175 180" T85 T O " "2tr ' 5b ' ' Mdo—ato—'—'sdo' ' 'ib6o—Tutw T2[ms] E < 800 600 400 200 —I 1 1 1 I I A Tt5 2(5 ' ' 5b ' ' 'UO 20^5 '—'5do' ' '1660 2T50TT T2 [ms] Figure 4.2: Simulation Parameters and Estimated Parameters for the Discrete (upper figure) and Continuous (lower figure) Distributions. The height of each components of the continuous distribution signifies the integral over that component. Chapter 4. Estimation of the Parameters 68 Discrete Dis t . Component weight High 0.51 Med. 0.37 8 1.9 Continuous Dis t . Component weight High 0.47 Med. 0.28 8 1.9 Table 4.2: Uncertainty of Linear Combinations of Amplitudes. The uncertainties, 8, are the eigenvalues of the covariance matrices and the weights are taken from the associated eigenvectors. The lower eigenvalues (not shown) are meaningless due to the dependence of the amplitudes estimation on the nonlinear parameters. the nonlinear parameters constrained to their values at maximum likelihood, a common and proper method when we have good estimates for the nonlinear parameters, give overoptimistic estimates for the accuracies. These values for the amplitudes are tabulated in Tables 4.3 and 4.4. Therefore, in addition to the above estimates, we give two samples of 2D contour-plots of the probability distribution as a function of amplitude and decay rate of the lower T2 component. These parameters are strongly coupled and are interesting to the study of the physical system. The first contour-plot, for a typical tissue, is shown in Figure 4.6. For the tissue Head Caudate Nucleus, the lower component is justified only by the measurements of the first spin-echo in the sequence, at 10ms (Figure 4.7); the signature of the lower component on the rest of the points is below the noise level. This is an example of a case where the parameters are underdetermined, and give us another opportunity to understand the estimations made in this Chapter and tabulated in Table 4.4: the estimated value for the amplitude of this component is 82 ± 20.; This estimate assumes the decay rate is constrained to its value at maximum likelihood. On the other hand, the Chapter 4. Estimation of the Parameters 69 model C o m p . Tissue C W p a C o m p . A m p . T 2 [ms] W g h t . Genu 3 ' 0 0 0,363± Low 87.9 ± 2 . 8.9 ± 0 . 8 -Corpus 0.001 Med. 906. ± 0 . 6 66.6 ± 0.6 -0.39 Callosum High 52.7 ± 0 . 3 207. ± 20. 0.38 A E R -4.6 ± 0 . 2 - 5=2.4 Splenium 3 0 0 0.401± Low 99.9 ± 2. 9.9 ± 0 . 9 -Corpus 0.001 Med. 661. ± 0 . 9 65.5 ± 2 . 0.492 Callosum ' High 281. ± 0 . 5 150. ± 5 . -0.401 A E R -5.0 ± 0 . 3 - 5=2.9 Minor 3 0 0 0.2916± Low 82.1 ± 2 . 8.12 ± 0 . 7 -Forceps 0.0007 Med. 935. ± 0 . 5 66.8 ± 0 . 5 0.37 High 32.8 ± 0 . 2 214. ± 3 0 . -0.36 A E R -3.6 ± 0 . 2 - 5=2.1 Major 3 0 1 0.2149±' Low 79.8 ± 2 . 8.29 ± 1 . -Forceps 0.0004 Med. 175. ± 1 . 43.7 ± 6 . 0.65 High 784. ± 0 . 7 87.3 ± 2 . -0.30 A E R -3.24 ± 0 . 1 - 5=3.2 DC 8.7.2 ± 0.08 -Internal 3 .0 0 0.551± Low 113. ± 2 . 13.8 ± 1 . -Capsules 0.002 Med. 633. ± 1 . 67.8 ± 3, -0.44 High 284. ± 0 . 6 172. ± 10. -0.50 A E R -5.12 ± 0 . 4 - 5=3.6 Table 4.3: Estimated Parameters for White Matter Tissues. 5 is the uncertainty for the linear combination of the amplitudes weighted by the given Component Weights. Chapter 4. Estimation of the Parameters 70 model Comp. Tissue C W p a Comp. A m p . T2[ms] Wght. Head 3 0 0 0.472± Low 81.9 ± 2 0 . . 3.8/* 3 Caudate 0.001 Med. 943 .± 1. 77. ± 1. 0.13 Nucleus High 43.4 ± 0.5 572. ± 300. ,0.18 A E R -6.37 ± 0 . 9 - 6=25. Putamen 3. 0 0 0.2209± Low 54.2 ± 2. 5.65 ± 1. 0.0005 Med. 948. ± 0.3 70.7 ± 0.3 0.28 High 37.4 ± 0.1 ' 257. ± 20. -0.27 A E R -5.7 ± 0.2 - 6=2.2 Globus 3 0 0 0.461± Low 69.2 ± 2 . 11.3 ± 2 . Palidus 0.002 Med. 897. ± 0.8 67.2 ± 1 . 0.36 High 60.6 ± 0 . 3 234. ± 30. 0.46 A E R -5.67 ± 0 . 3 - 8=2.7 Thalamus 3 0 0 0.2787± Low 49.4 ± 0.9 14.2 ± 2 . 0.0008 Med. 923. ± 0.5 74. ± 0 . 9 -0.32 High 41.9 ± 0 . 2 266. ± 50. -0.52 A E R -5.35 ± 0 . 2 8=1.5 Insular 1 1 1 0.4049± Med. 970. ± 0.3 86.3 ±0 .09 Cortex 0.0007 A E R -6.61 ± 0 . 2 wdth.fr ac= DC 22.7 ± 0 . 1 0.30 ± 0.02 Cingulate 2 0 1 0.649± Low 27.2 ± 3 . 7.44 ± 3 . -0.91 Gyrus 0.001 Med. 968. ± 0 . 6 85.2 ± 0 . 2 - -0.14 A E R - 4 . ± 0.4 - 6=4.0 DC 22. ± 0 . 2 -Cortical 3 0 0 0.647± Low 27. ± 3 . 7.5 ± 3.0 Gray 0.001 Med. 949. ± 0 . 7 76.4 ± 0.6 -0.20 High 47.8 ± 0 . 2 760. ± 200. -0.36 A E R -2.23 ± 0 . 4 - '6=4.2 Table 4.4: Estimated Parameters for Gray Matter Tissues. 8 is the uncertainty for the linear combination of the amplitudes, weighted by the given Component Weights. Chapter 4. Estimation of the Parameters '•" 71 estimation of the decay rate is after marginalization over, the amplitude and thus takes into account all possible values of the amplitude, as reflected by the estimated value, 3.8^3,(That is, 3.8 multiplied or divided by factor 3; the estimation of the logarithm is 1.34 ±.1.2). Chapter 4. Estimation of the Parameters Amplitude 800 600 400 200 0 AER=-4.56 No DC 1 ' • i 10 00 1000 • Genu Corpus Callosum .0.75 0.5 0.25 -0.25 -0.5 -0.75 T2[ms] Residuals time[10*ms] Amplitude 600 500 400 300 200 100 0 AER=-5.01 No DC Amplitude 800 600 400 200 0 AER=-3.62 No DC Amplitude 800 600 400 •200 0 AER=-3.24 DC =8.72 4 r Spleriium Corpus Callosum 1000 T2[ms] Minor Forceps tiu - 1 TOoTT T2[ms] Major Forceps 100 1000 T2[ms] Residuals 1 time[10*ms] Residuals Residuals time[10*ms] time[10*ms] Amplitude AER=-5.12 600 500 400 300 200 100 0 No DC "TTT Intemal Capsules ',' 1000 T2[ms] Residuals time[10*ms] Figure 4.3: The Estimated T2 Distributions (left) and Residuals for the Estimated P; rameters (right) of White Matter Tissues. Chapter 4. Estimation of the Parameters 73 Head Caudate Nucleus Amplitude 800 600 400 200 0 AER=-6.4 No DC T 0 ~ 100 1000 T2[ms] Residuals time[10*ms] Putamen Amplitude 800 600 400 200 0 AER=-5.7 No DC i . i . . . . . 10 1 00 1000 T2[ms] Residuals time[10*ms] Globus Palidus Amplitude 800 600 400 200 0 AER=-5.7 No DC I ... • i . 10 1 00 1000 T2[ms] Residuals time[10*ms] Amplitude 800 600 400 200 0 AER=-5.3 No DC Thalamus Residuals 10" 100 1000 T2[ms] time[10*ms] Figure 4.4: The Estimated T2 Distributions (left) and Residuals for the Estimated Pa-rameters (right) of Gray Matter Tissues. Chapter 4. Estimation of the Parameters 74 Insular Cortex Amplitude 800 600 400 200 0 AER=-6.6 DC =23. 10 00 1000 T2[ms] Residuals time[10*ms] Amplitude 800 600 400 200 0 AER=-4. DC =22. Cingulate Gyrus T2[ms] 10 100 1000 Residuals time[10*ms] Amplitude 800 600 400 200 0 AER=-2.2 No DC 10 Cortical Gray 0.5 -0.5 T2[ms] 100 1000 Residuals time[10*ms] Figure 4.5: The Estimated T 2 Distributions (left) and Residuals for the Estimated Pa-rameters (right) of Gray Matter Tissues (cont.). Chapter 4. Estimation of the Parameters 75 Figure 4.6: Joint Probability Distribution for the Low Component of a Typical Tissue. The second line from the middle enclose 99% of the probability. Chapter 4. Estimation of the Parameters 76 Head Caudate Nucleus 9 r1—•—•—1—•—•—1—•—1—•—1—•—1—1—•—1—"—•—•—'—1—•—•—•—•—•—•—1—1—r 6 ' ' ' 5b ' ' ' 1 CiO' ' '. 1 5 0 ' ' ' 2CJ0' ' ' 2 5 0 ' ' ' 36(5 Amplitude Figure 4.7: Joint Probability Distribution for the tissue Head Caudate Nucleus. The second line from the middle enclose 99% of the probability. Chapter 5 Classification In this chapter we will devise a procedure for the classification of a new data set to one of several brain tissues. The principles of the Bayesian classification have been discussed in Chapter 2, and we are now left with the task of applying them to our special case in a computable manner. 5.1 Procedure As in Chapter 2, we denote the old data sets, the learning data, by daid and the set of data to be classified to one of the tissues by dnew: From Eqn. (2.3), assuming we have no prior information classifying dnew to one of the tissues, we have: p(Mi/d0iddnewI) a jp(6i/dotdI) p(dnew/0il) d6{ . (5.1) The T 2 distributions found in Chapter 3 do not contain any information about the total magnitude of the signal; the data were normalized prior to their analysis. However, we want to use the information carried by the amplitudes and include it in d0\d and dnew. To that end, we present the parameters and the data as the following conjunctions: @i = 9iA ' QiDist d0id = d0idA • d0idr)ist -dnew — dnewA ' dnew£){sf , where: , . • 77 Chapter 5. Classification 78 • BiA is the amplitude of tissue i. • ®iDist are the T2 distribution parameters of tissue i. • dgidA are the total amplitudes of the old data set. So far, they have not been used. • d0idDist are the normalized old data sets we used in our work so far. • d n e w A is the total amplitude of the new data set. ' ist is the normalized new data set. Using our new nomenclature, Eqn.(5.1) become: p(Mi/d0iddnewI), a 16iAQiDistMiI\d9iA d9iDist — fp{GiA/doldAI) p(dnewA/e,AMJ) d9iA- - (5-2) Ip(9iDist/doldDistf) p{dnewDist/9wistMiI) d9iDist That is, p ( M t / d o l d d n e w I ) a p ( M i / d 0 i d A d n e w A l ) • p(Mi/d0idDistdnewDistf) , where we used the independencies achieved by the normalization performed upon doidDist and dnewDist- The probability of Mi = "The new data set come from the same mechanism as old data set i " is a product of two integrals: the first is the probability of M{ where only the total amplitudes are known (the total amplitudes integral) and the second is the probability of Mi where only the normalized decay curves are known (the T 2 distribution integral). Chapter 5. Classification 79 We now have to assign the probability distributions in Eqn.(5.2), and to approximate the integrals. We will treat the first integral in section 5.1.1 and the second integral in section 5.1.2. 5.1.1 The Total Amplitudes Integral $p(QiAldoidAl) p { d n e w A l ' Q - i A M i l ) dOiA . The probability distribution of the total amplitude of a tissue, p(0iA/doidAl), is wide due to the difficult estimation of the amplitudes of low T2 components. The signature of such components on the data is limited to the first spin-echos measurements. In extreme cases where the signature of the lower component on the data is limited to the first spin-echo alone, the amplitude of the lower component is not determined. We have a realization of such a case in our data, and.the effect of this on the probability distribution of the amplitude of the lower component can be seen in Figure 4.7. We will not try to estimate the value of the total amplitudes. Rather, we will estimate the value of the amplitudes in the time of the first spin-echo, 10ms after the beginning of the decay, and we will approximate this amplitude by the amplitude of the first mea-surement. From now on the subscript A denotes the value of the measurement/model.at time 10ms. . . . We now turn to evaluate the terms in the integral. First Term p(6iA/d0idAf) By Bayes theorem: ' . p(0iA/doldAl) = p(0iA/I) • pid°ldA/6;Ap a . (5.3) P(<'UdA/l) p(doidA/0iAI) Chapter 5. Classification 80 since we set p(9iA/I) = const. V i ( i.e., we do not incorporate any prior information regarding the value of the amplitude for the tissue i and we treat this value as a location parameter). We assume that the data amplitudes are not correlated, and that the amplitude noise' is unknown, but finite in variance; the likelihood factors and the noise will be assigned a gaussian distribution. So, p(dMA/eiAcrI) = ( 2 7 T < 7 2 ) - ^ • exp ^ _ ^ i ( ^ ~ M a j ( 5 4 ) where: • nli is the number of amplitudes measured for tissue i. • d0idAij is the j data amplitude measurement (10ms point) of tissue i. Now, a is unknown and we mariginalize over it using Jeffreys' prior: p(doidA/0iAl) ct f — • p(doidA/0iAcrI) • da J a a Tfnk\ / E g i K ^ - M y 2 ( 5 5 ) Our estimation of the value of $iA is given by: ^ A ) e s t =< eiA > ± yj[< Q\A > - < 6lA >'] where ^ ^ > — / 9JA • PJdpidA/QJAI) d9jA _ Y!j=\ doldAij _ -j-/p{d0idA/6iAl) d6iA nk M A l and Chapter 5. Classification 81 < ^ = J ^ A P(d>oldA/6iAl) d9iA ^ T,f=l{doldAij - dpldAj)2 ^ -j , j 4 /p(d0ldA/9iAl) d9iA nli(nli-^l) °'dAl => y/[< 9 2 A > - < 9 l A > 2 } = ^ That is, YTj=i{d0idAij — d0idAi)2 nli(nli - 1) {9iA)est = d0idM± STDoSM{dMAi} (5.6) where ST DoS M {d0idAi} is the standard deviation of the sample mean of d0idAi- We have arrived at results familiar from frequentist calculations. 1 Second T e r m p(dnewA/9iAMiI) p{dnewA 19iAMiI) is the likelihood of an amplitude in face of the new data amplitude. We assume that the noise level of the measurement of the new data is similar to the noise level in the old data which is estimated by the standard deviation of the old data set. So, p(dnewA/9iAMicriI) = Ji— • exp (_(DNEWA ^A) where ai = STD{doldAi). . 1 Thi s happened because the prior is not informative, the noise distribution is gaussian and the problem is one-dimensional (the likelihood is a function of only one parameter - the parameter we want to estimate). One may be tempted to conclude that the Bayesian procedure is a generalization of the frequentist approach. This in general is not true, and the interpretation of the results is always different. The frequentist variance defines a confidence interval, an estimate of the interval in which new data will be materialized in known frequency (say 0.68). The Bayesian result is a credible region, in which the probability of the parameter to be is known (here 0.68). The credible region boundaries are approximated using the variance of the parameter, as in chapter 4. Chapter 5. Classification 82 The Integral The number of samples composing our d0\d for each of the tissues under consideration is in the range of nl =800 to nl =5000 (see Table 1.1). As a result, the width of the first term in the integral (~STD of the sample mean) will be 30-70 times smaller than the width of the likelihood in the second term (~STD of the sample). We approximate the second term by a constant in the region of substantial magnitude of the first term, the probability distribution of the amplitude. That is: Jp(9iA/doidAI) p{dnewAl9iAMiI) d0iA « p { d n e w A j < 6lA > Mil) • Jp(9iA/d0idAI) d9iA = p(dnew/ < 9iA > Mil) = 1 / {dnewA- < 9iA >)2\ • 2 ^ r e x p ( , — M — J ^ where Gi = STD{d0idAi}. 5.1.2 The T2 Distribution Integral f p(9iDist/d0idDistI) p{dnewDist/9iDistMiI) d9iDist The first term in the T2 Distribution integral, p(9iDistId0idDistI), w a s found in Chapters 3, and the maximum and width of the probability distributions of the parameters were found in Chapter 4. We now turn to evaluate the second term p(dnewDist/9i£>istMiI). This term can be evaluated in a straight-forward manner in case we assume that the noise is not correlated. This is equivalent to the assumption: "If dnewDist came from tissue i, it was produced by the mechanism with parameters given by p(0iDist/doidDistI), and thus any deviation of a datum point from the value predicted by the parameter set p(0%rjistld0idpistl) ( i - e - 5 the Chapter 5. Classification 83 noise) is due to non_systematic effects". We call this assumption the unique mechanism and. evaluate p(dnewDist/OiDistMil) under it in the next subsection. Alas, observation of the old data reveals correlations in the noise that call for relax-ation of the unique mechanism assumption. We discuss these observations and our crude treatment of the problem in the subsection that follows the next. There we will assume that the parameters of the T2 distribution in each imaged pixel (dnewmst) are perturbed, from the parameters of the T2 distribution of the tissue. Second Term p(dnewDist/6ipistMiI)—Unique Mechan i sm Case The expression for-the likelihood of a parameter set •in light of a data set, assuming uncorrelated noise, is given by Eqn.(3.7): . ' ' p(d/ A^al) = (2-nia2)-^ exp f - — • (nW - 2 ' £ Ajhj +.£ A))\ . (5.8) We want to free our results from the absolute magnitude of the data. To that end, we introducer . " m A • i= i A .. • m ' .7=1 . . . Note that A and the /_,• are not necessarily positive, and the values of fj are not necessarily smaller than one. So,' ' • . p(dnewDist/AfiTtMiO-lewI).a Chapter 5.., Classification 84 •^--{nW- •• (5-9) LGnew j=l 1 / where cr n e u) is the unknown variance of the noise of the new data set, and. and are approximated by their estimates, as found in Chapter 4. Assuming a flat, and equal for all the models, prior for A, we can analytically marginalize over it: . P{dnewDist/ fi ~Ti Mi<TnewI) Q> J'p(AdnewDist/ fi ~rf MiOnewI) dA = p(A/ fiT*MiO-newI)-J p(dnewDist/Afi~TtMi(TnewI)dA a J p(dnewDist/A 7?' M i O n e w J ) dA Which is of the form J.-oo V a and result in p{dnewDistl fi^ftMiCT^I) Ct-.aJ^ X) e X P ••( ,2 ~ n ( j 2 ) v^m /2 ^ V o 2 v V^m f2 t-^j=l J ji' \ new . L^ij—\Jji and marginalize over anew, using Jeffreys' prior, to get (5.10.) P i d n e w D i s t I f i ~T{ M{I) = J p(o-newdnewDist/ f i ~Ti Mil) dan / • p ( d n e w D i s t I f i Ti(jnewMiI) da new a Chapter 5. Classification 85 Em y2 j=l J ji nd2) dV = Where V = and dV = — V2do~new. This expression is of the form: r Jo V m~~aV2dV = r [ ( m + l)/2] 10 2a( m + 1 ) / 2 with m = 30, and thus the desired likelihood is p(dnewDist/ fi ~TiMi'I) a T(15.5) -15.5 V Second Te rm p(dnewDist/'9iDisiMiI)—Non-Unique Mechan ism Case . (5.11) A typical behaviour of normalized pixels' data (subtracted from their values predicted by OiDist, the parameters estimated in Chapter 4) is shown in Figure 5.1. Obviously, successive values of these deviations are highly correlated. The systematic deviation of the individual pixels from their values predicted by 6iDist seems to be primarily characterized by the patient2 (upper figure). Nonetheless, even within the context of one patient (lower figure), the deviations of the individual pixels' data from the average data of this tissue and patient seem to be correlated, though to less extent. Similar figures had been generated for the rest of the tissues and found to exhibit similar behavior. We here assume a non-unique mechanism: The new data set dnewr)jst was produced by a parameter set 9new that differs from 9iv%st- We can view this difference as a deviation 2 W e denote our normal volunteers by 'patients'. We do not expect any pathology to exist in these volunteers. Thus, when we talk about difference in the results as function of the patient, we do not assign this difference to any difference in a pathology. Rather, we refer to it as a consequence of variance found among normal humans. Chapter 5. Classification 86 Time[ms*10] Figure 5.1: Typical Spread of the Measurements of one Tissue. Upper figure: Normalized averages for each patient subtracted from the prediction of the model (thick lines), and collection of normalized pixels measurements from patient 6 (thin lines). Lower figure: The same pixels as in the upper figure but now subtracted from their average. Chapter 5. Classification 87 of the particular mechanism that produced the new data set from the 'base' mecha-nism (described by OiDist) due to additional factors which depends on the volunteer, the measurement session, location in the image, etc. The unique mechanism is a special case of the non-unique mechanism, where the probability distribution of Q n e w is given by p(®new/@iDistI) = 8(Qnew — QiDist)- The non-unique mechanism admits that the probability distribution p(Qnew/®iDist!) is, in general, wider than a 8 function. We here run the risk of loosing the distinction between expected data from different tissues, and thus loosing the ability to perform a meaningful classification. This will happen in case the width of the distribution p(Qnew/@iDistI) is of similar magnitude to the difference between the parameters values expected from different tissues. In fact, our problem is an example of such a case. The only knowledge we have about the distribution of Q n e w around Qioist is its width, derived indirectly from the Standard Deviation of the data points, and this width is indeed in the order of the difference between the parameters values of different tissues. We cannot expect much from the classification process, starting from our current state of knowledge. Nevertheless, we should not resort to the use of unique mechanism, or any other practices that seems to produce 'better' results. The answers given by probability theory are to the questions asked, given a state of knowledge; the unique mechanism assumption, according to our current state of knowledge, is false. So, we have: PidnewDistIQiDistMil) = J p{6newdnewDistI'QiDistMil) d6new = Jp(&new/6iDistMiI) p(dnewr)isi j6new6iDistMiI) d$new (5.12) Chapter 5. Classification 88 Now, p(dnewDist/6new6ir)jstMiI) is independent of QiDist and can be assigned assuming uncorrelated noise, that is: where we apply steps similar to the steps led to Eqn. (5.10) but now the amplitude-fractions /,• and the nonlinear parameters 1\ (hidden in hji) are determined by 6new. pi&new/GiDistMil) is the probability distribution of the parameters that produced dnewDist given the 'base' parameters of the tissue OiDist- I failed to characterize this distribution in a quantitative manner within the time frame of this work. We will approximate this term using the following steps: 1. Lacking any other knowledge about the distribution (but assuming finite deviation of 0new from OiDist), we will assign a gaussian distribution to all the parameters. By proper scaling of individual parameters we have: We do not assume any knowledge about the value of 6, the scale of the deviation of 9new from 0iDist. 2. We assume that for any set of measurements, there exists a set of parameters 0new, that will describe this set well, down to the level of the non_systematic effects. 3. We turn from the parameter domain to the measurements domain. The model P(dnewDist/QnewOiDistMil) = p(dnewDist / 6newMiI) a (5.13) Chapter 5. Classification 89 function m g(9,t)=g(f ,r,t) = J2fiHArj,t) 3=1 is a continuous function of the parameters 6. Assuming the deviation of 9 n e w from 9iDist is small and thus will be well approximated by a first order Taylor expansion, we have W^new — 9iDist\\L2 diPnew i t) — g{9iDist,t) \dg_ [d9 and therefore, ta io u n 1 ( (9{Qnew,t) - g{9iDist,t)f\ te-iA\ P{9new 9 l D i s t M i I ) = - — • exp — V t (5.14) where jt = 6 • [^f]e t -7< is the scale of the difference between the predictions of model having parameters 9i£>ist and one having parameters 9 n e w . From Figure 5.1 we can see that ^ t is at least of the order of cr n e w , the level of non_systematic effects. 4. As a result, the integral in Eqn. (5.12) will be peaked at the maximum of the expression given in Eqn.(5.13). Thus, 9 n e w can be approximated by the values that will minimize zZl=i {9{9new, h) - 4 ) 2 • 5. Therefore, Eqn. (5.13) does not carry information relevant to the classification problem; p{dnewDistI'9newMiI) describes the probability of materialization of a par-ticular set of non_systematic effects. We replace it by a constant and approximate g(9new,tk) by dk. 6. Under the above, jt is well approximated by the STD of the measurement points for each tissue. From Figure 5.1 we can see that 7 t remains in the same order of magnitude for all t. Approximating 7* = 7 = const. V t, Eqn. 5.14 becomes P(enew/0^MJ) = * • exp (_^^(^^7(^^^))2) = Chapter 5. Classification 90 p(dnewDist/OiDistMil) 1 ,2 • exp Efcii (dk — g(0jDist,tk)) 32 • 2 7 t 2 (5.15) where 7? is the variance of the measurement points for tissue i. We are now in a position to apply similar steps to those performed in the previous section (Eqn. (5.8) to (5.10) ) to marginalize over the total amplitudes. The resulting likelihood is: where again, /,• and are approximated by their estimates, as found in Chapter 4. We approximate 7 by the STD of the measurement points for tissue i, and do not marginalize over it. The reason will be discussed in Chapter 6. The Integral The expression to be used for p(dnewr)ist/OiDistMil) is given by Eqn.(5.11) in case of the unique mechanism assumption and by Eqn. (5.16) in case of relaxation of the unique mechanism assumption. Using these results, we now have to approximate the second integral in Eqn. (5.2). As shown in Chapter 4, the width of the probability distribution of every parameter is proportional to the noise cr. Our new data come from individual pixels and we assume they have similar noise levels to the individual pixels in the corresponding tissues of the old data set. This noise level can be approximated by the average of the S T D { d o i ^ D i s t i } for all the measurement points of tissue i. It is in the range of 10-25 amplitude units. P(dnewDist I OiDistMil) OL (5.16) Chapter 5. Classification 91 The Bayesian estimation of the noise level cr that enter the first term in the integral had been found in Chapter 4 and is in the range of 0.22 to 0.64 amplitude units. As a result, the likelihood p{d0idDistl&i) as function of the parameters 0,-, is 30-60 times narrower then p{dnewDist I Thus, we can approximate the likelihood in face of the new data set as a constant in the region of substantial weight of the probability distribution of the parameters given the old data set. where Owist are the values of the parameters that maximize the likelihood in face of the old data set. These values had been determined in Chapter 4. 5.1.3 Final Expressions That is, p(Mi I' d0idDistdnewDistI) = Ip{QiDist/doidDistl) p{dnewDistl&iDistMiI) d6i£)ist P(dnewDistI'OiDistMi) • f p{9iDist/d0idDist) ' dd{ = P{dnewDist / QiDistMi) (5.17) The final expression to be evaluated for each pixel is given, according to Eqn. (5.2), by the product of the expressions given in Eqn. (5.7) and Eqn. (5.17). That is, assuming a non-unique mechanism, the expression to be evaluated is: p{Mi I d0iddnew) a (5.18) Chapter 5. Classification 92 where crt- = STD{d0idAi}, Ii — ^1 1 —32~ °ldD'st'3} , and /,• and ~rt are approximated by their estimates, as found in Chapter 4. 5.2 Results 5.2.1 Feasibility At this point, we try to present the data in a way that will help us set out our qualitative expectations from the classification process derived in the previous section. The test of the ability to classify the tissues can be divided into two parts: 1. The basic assumption of the classification process is that the various mechanisms are different in their amplitudes or T2 distributions, leading to a significant difference in the data produced. If this is not true, the classification will fail since our data is not relevant to the difference between the tissues. The rigorous procedure for deciding whether two sets of data are produced by the same mechanism was discussed in Section (2.2) but has not been applied to the tissue classification problem. Looking at the models and the average data of the tissues (thick lines in Figure 5.2) can set our qualitative impression. Our observation from Figure 5.2 is that the mechanisms are indeed different; each tissue seems to exhibit substantially different average behavior of the T2 distribution. 2. Each of the different mechanisms, or in the general case - probability distribution of mechanisms, defines measurement values expected from the new data set in order to be classified to that mechanism. Reliable classification depends on the deviation of the new data set from their predicted values by the producing mechanism, in comparison to the scale of the difference between the predictions of the various mechanism. A typical spread of the data of individual pixels is given by the thin Chapter 5. Classification 93 <D 13 -60 0 5 10 15 20 25 30 Time[ms*10] Figure 5.2: Models/Data Averages for the Tissues and Individual Pixels' Data From One Tissue. The models (the data predicted by the parameter estimation from Chapter 4), the average of the data and the la of the models predictions are all within the width of the thick lines (one line per tissue). The thin lines are samples of individual pixels data from one tissue but different patients. Chapter 5. Classification 94 lines in Figure 5.2. As can be seen from the figure, a substantial percentage of the pixels of this tissue will be erroneously classified. As said in Section 5.1.2, I failed to track down any systematic behaviour of this deviation; Figure 5.1 suggests that most of the deviation is to be associated with the deivation of the T2 distributions among patients. 5.2.2 Tissue Classification We will not try to distinguish between the Cortical Gray, Insular Cortex and Cingulate Gyrus. Though appearing under different data sets, the Insular Cortex and the Cingulate Gyrus are both parts of the Cortical Gray [23]. The three data sets exhibit very large variances among patients. It seems possible that the assignment of Regions Of Interest (ROI) was in different parts of the brain for each patient, as the Cortical Gray can be found in the circumference of the brain in every image. In order to set our 'degree of belief in the classification of each of the tissues, we classify the pixels from the known ROI in ten patients: each of these pixels serves now as a new data set, dnew. The classification is done by applying Eqn. (5.18) to every pixel data and every mechanism M ; (i = 1..10), and choosing the mechanism (tissue) with the highest posterior probability p(Mi / d0iddnew). The results of the classification are given in Figure 5.3 for the white matter ROI, and in Figure 5.4 for the gray matter ROI. In these Figures, each Bar Chart is the result of classifying all the pixels from the ten 'learning' volunteers, that where assigned (by the ROI) to the tissue given in the Chart Label. Each bar denotes the number of these pixels, classified to the tissue given in the bar base. The sub-divisions of the bars denote the number of pixels classified to the tissue from each volunteer.3 3 Naturally, one would test the ability to classify the tissues on the data taken from the eleventh volunteer; this data were not a part of the 'learning' data. Alas, this set of data is suspected to contain a substantial number of erronously ROI assigned pixels [23]; The resulting classification performance Chapter 5. Classification 95 As we can see from both Figures 5.3 and 5.4, the classification of white matter from gray matter is relatively successful; very few pixels of white matter were classified as Grey matter and vice versa. The success of the classification among white matter tissues or among gray matter tissues is much smaller. 5.2.3 Images The results of applying Eqn. (5.18) to an image of the eleventh patient are given in Figure 5.5. We did not apply any image processing in the generation of our synthetic images. Nevertheless, it seems as the images could be easily improved by incorporation of addi-tional information in the spatial level (e.g., continuity). based on these ROI assignment is suspected to be misleading. Therefore, we choose to present the classification of the data of the eleventh volunteer through a synthetic image (see next section) and to use here a much larger database, the data from the ten 'learning' volunteers. Chapter 5. Classification 96 Genu Corpus Callosum 800 600 n 400 200 Minor Forceps 2000 1500 m 1000 in 500 H Splenium Corpus Callosum 1000 800 HCPt SPTI CS Major Forceps Internal Capsules 1000 800 600 400 200 Figure 5.3: Classification Results for White Matter Tissues. Each Bar Chart contains the results of the classification of pixels known to be associated with one tissue (by the assignment of ROI). The bars denote the number of these pixels classified to each of the tissues. The subdivisions of the bars denote the different patients. Chapter 5. Classification 97 Head Caudate Nucleus Putamen 1 2 0 0 1 0 0 0 8 0 0 6 0 0 4 0 0 2 0 0 2 0 0 0 1 5 0 0 1 0 0 0 5 0 0 GCSCMi MalC Globus Palidus 7 0 0 6 0 0 5 0 0 4 0 0 3 0 0 2 0 0 1 0 0 'GC'St? Mi Ma IC HCPt Thalamus 1 0 0 0 8 0 0 6 0 0 4 0 0 2 0 0 7 0 0 0 6 0 0 0 5 0 0 0 4 0 0 0 3 0 0 0 2 0 0 0 1 0 0 0 Cortical Gray Figure 5.4: Classification Results for Gray Matter Tissues. Each Bar Chart contains the results of the classification of pixels known to be associated with one tissue (by the assignment of ROI ) . The bars denote the number of these pixels classified to each of the tissues. The subdivisions of the bars denote the different patients. Figure 5.5: Synthetic Classification Image of Patient Eleven. Each gray level signifies one tissue. Chapter 6 Summary, Discussion and Conclus ion 6.1 S u m m a r y One of the main goals of this work, as defined when started, was the classification of brain tissues by their T 2 distribution. This goal was not achieved in a satisfactory man-ner. The variance of the T 2 distribution of each tissue was found to be in the order of the characteristic variance of all the tissues. As a consequence, the reliability of the classification was substantially reduced. As discussed in chapter 5, the variance of the distributions within the tissues seems to be mainly due to variance of the T 2 distribu-tion parameters among different patients 1 2. That is, classification may be enabled by determining the probability distribution of these parameters as function of patient. I failed to characterize this distribution in a quantitative manner within the time frame of this work. Nonetheless, based on unpublished work of the author, this task seems to be feasible for future work, and could carry important information about the underlying physical mechanisms. The estimation of the parameters of the tissues is also substantially influenced by : E a c h image includes a water phantom, intended to be used as a reference (having mono-exponent ia l decay curve w i th known parameters) , enabl ing detection and correction of systematic effects ar is ing f rom the par t i cu lar imag ing session. The da ta f rom the water phan t om pixels i n the images used for this work, exhib i t large variance and a mul t i - exponent ia l behaviour. A s a result, the water phantoms reading were ignored. 2 B y 'pat ient ' we mean here bo th the h u m a n and the par t i cu lar imag ing session. A new MRI sequence, i n wh i ch the water phantoms pixels exh ib i t re l iable mono-exponent ia l decay curves, has been late ly developed i n the U . B . C . Us ing the da ta f rom these water phantoms, removal of systematic effects of the par t i cu lar imag ing session may be enabled. 99 Chapter 6. Summary, Discussion and Conclusion 100 the variations among the patients. These variations render the T2 distributions based on the analysis of the tissues of one patient to be unrepresentative of the distribution of other tissues. This may explain part of the difference between the resulting distributions of this work and the analysis given in [3]. The parameter estimations in this thesis are under the assumption that there is a unique T2 distribution (functional form and parameters values for this functional form) for each tissue regardless of the patient from which the data were taken. If the T2 distribution is determined by both the tissue and the patient, our resulting parameters can be taken as an average of the parameters for each tissue over the patients; assuming small deviations of the parameters as function of patient, the data is linear with each of them. The work determined for the first time the probability of existance of short T2 component in the brain tissues, and found it to be higher than 99% for all white matter tissues and higher than 80% for all gray matter tissues except Cortical Gray (see section 6.3.1). The probability of having no more than three components of decaying exponents in the T2 distributions of the brain tissues, were found to be higher than 90% for all the tissues. This work analyzed a problem in which the signal-to-noise ratio is poor, the data sets consist of multiple-measurements and the systematic effects are only partially accounted for by the models. The analysis used a multi-dimensional search in the space of models and classification of data into (conceptually, if not practically) probability distributions generated by previous sets of data. The multi-dimensional search in the space of models has not yet been treated in the Bayesian literature. 6.2 D i s c u s s i o n Probability theory determines the unique procedure for transformation of state of knowl-edge (expressed in real numbers) into answers to well-posed questions. The answers Chapter 6. Summary, Discussion and Conclusion 101 may be changed substantially when we change our questions or state of knowledge. The Bayesian inference procedure forces us to explicitly introduce our questions and state of knowledge; it will thus prevent us from solving problems other than those we wanted to solve. We discuss here an example encountered in our work. 6.2.1 Multiple Measurements: a Vs. as In chapter 3 we defined a quantity a = as/\fn~i, where crs is the Bayesian estimation of the noise of the individual data points and n) is the number of measurement sets (number of pixels constituting the data base, in our case). At a glance, it seems that cr is the bayesian parallel to the Standard Deviation of the sample mean. As the STD of the sample mean, assuming the noise level as (the deviation of the individual data points from the model) is finite and constant, we expect the estimation of o~s to stay constant and the estimation of a to decrease as we accumulate more data. In case of systematic effects in the data, not accounted for by the model, this does not happen. As we add more data sets, the estimation of <r stops decreasing (as a result of increasing estimate of crs). What happened to our estimation of the noise level crs? Obviously, the difference between the model and the individual data points remain the same regardless the number of data sets acquired; the answer must lie in the meaning of o~s. crs is the width we assign to the distribution of the difference between the data points and the model predictions. In case of a single measurement set, it does not matter whether the difference between the model and the data is due to non systematic effects or systematic effects not accounted for by the model. To see what happened in case of multiple-measurements, consider the average-data:. as we accumulate more data, the systematic effects will manifest themselves in the averaged data more until dominant. At that point, adding more data sets will not reduce the deviation of the averaged data from the model. Alas, assignment of the Gaussian distribution to the noise, as is our case, is Chapter 6. Summary, Discussion and Conclusion 102 equivalent to the assumption that all the difference is non systematic, and thus adding data sets should reduce the deviation of the average data from the model predictions. Encountering a collection of data sets whose average is of certain deviation from the model, the Bayesian procedure estimates a crs that will justify such a deviation by non systematic deviation of the individual points. More data sets, failing to reduce the average deviation, will thus result in increasing estimate for o~s. In the limit of very many measurements, the deviation of the averaged data from the model will remain constant (this is the difference between the model and the sytematic effects in the data), and will be characterized by constant cr. The estimate of the noise from each point needed to achieve such deviation of the averaged data will increase as yjn. o~ is a measure of fit of our collective data to the model; all the estimation of the accuracies of parameters in the models are proportional to cr. The estimation of as, on the other hand, answers the question: "Given the model, and assuming all the deviations of the data from the model are non. systematic, what width should be assigned to the distribution of the individual deviations of the data points from the model in order to account for the collected data?" 6.2.2 Mar ig ina l i za t ion in the Neighbourhood of N o D a t a As illustrated by Bretthorst [1, Appendix C], the marginalization of nuisance parame-ters is very much like estimating the nuisance parameter from the data and using this estimate, given we have enough data to determine that nuisance parameter well. The marginalization does not assume any particular value to the nuisance parameter, but rather considers all the possible values this parameter might have. Each of the possi-ble values is weighted by its prior probability multiplied by its likelihood in light of the data. In case the likelihood is peaked, the practical result is equivalent to estimating the parameter from the data. Chapter 6. Summary, Discussion and Conclusion 103 JO © PL, • Likelihood: Well determined parameter 1 prior 5 function<=> No marginalization \ Prior state,of knowledge V Likelihood: Ill-determined \ ^ ^ ^ p a r a m e t e r Ignorance prior \—^^^^^^^ • Nuisance parameter ()) Figure 6.1: Marginalization Decisions in Light of Likelihood Widths. Broad likelihood: Representing our state of knowledge as ignorance will result in losing information. Peaked likelihood: The data is capable of determining the value of the parameter; not marginal-izing will prevent it from doing so. In case the data does not determine the nuisance parameter well, the shape of the prior become important. Our prior information in hand, in most of the cases, is hard to express in terms of real Probability Distribution Function (PDF), or expressed in P D F which is difficult to marginalize. The common practice in these cases, is either to assign an uninformative prior (expressing ignorance regarding the value of the parameter) or not to marginalize at all, and to use some value (best representing our prior information) for that nuisance parameter. The latter practice is equivalent to assigning the prior a 8 function, expressing certainty about the value of the nuisance parameter. In many cases, both practices represent states of knowledge different from the one in hand: we know more than mere ignorance, but we are not as certain as 8 function. Which one is less harmful? We illustrate the situation and the various cases in figure 6.1. Chapter 6. Summary, Discussion and Conclusion 104 If the data determine the value of the parameter well (peaked likelihood), we should let it do so, by assigning prior ignorance and marginalizing. As discussed in Chapter 3, in case the likelihood is much more peaked than the prior, the exact assignment of the prior does not matter, as long as the data and the prior are not 'surprising' in face of each other. The data will surprise us if the prior rules out the value-range in which the likelihood is peaked. On the other hand, if the data do not determine the value of the parameter well (broad likelihood), we should put our prior information to the best use. If marginalization using our true prior, is computationally not possible we should approximate our prior. A crude, but in many cases acceptable approximation is the 8 function - approximate the value of the parameter from our prior knowledge and use this value instead of marginalization. In several cases during the classification process (chapter 5), we chose not to marginal-ize. This was our choice in case of one datum point; the likelihood was broader than the prior in hand. 6.2.3 M u l t i - D i m e n s i o n a l Search for Mode l s When performing a multi-dimensional search in the models space, we should remember that the space is discrete. In general, standing at one model in the models space, it is not clear which are the models we should compare our current model to. It is possible that they are not the neighbouring models; neither when deciding what is the next model to test, nor when deciding that we arrived at a maximum after finding that all neighbouring models are of lower probability. The general topology of such spaces as a function of the number of parameters in each model, degree of orthogonality to other models, the priors of the models parameters, and perhaps structure of the data deserve further research. The search in this work was designed for the special case of the parameters of the T2 distributions and was fine-tuned ad-hock; thus, no conclusions regarding the general Chapter 6. Summary, Discussion and Conclusion 105 model search are made. 6.3 Conclusions The results of the analysis of the T 2 distribution of the brain tissues are given in Table 3.4 for the model selection and in Tables 4.3 to 4.4 for the parameter estimation. We here indicate some of the findings that had not been explicitly posted in those tables. 6.3.1 Evidence of a Short T 2 Component ^ A l l the tissues, except Insular cortex, showed significant evidence in favour of having a component of short T2 in the range of 5-15ms. The odds ratio in favour of model having the lower component over a model lacking the lower component range from 6:1 for Head Caudate Nucleus to 1015:1 for Splenium Corpus Callosum. The gray matter tissue exhibit, in general, lower evidence for the lower component, along with lower estimated amplitude of that component. Recall that the evidence for the lower component in the Head Caudate Nucleus comes only from the data of the first measurements (at time 10ms). This result in under .determination of the parameters of this component, as shown in Figure 4.7. The Insular cortex showed an odds ratio of 40:1 in favor of not having a lower com-ponent. According to [2], there seems to be no apparent reason to expect inherent differences in the T 2 of the lower component in the white matter tissues. The results of our work indicate no evidence for such differences but for the Internal Capsules, that have a slightly higher T 2 time. The lower component is usually assigned to water confined within myelin bilayers. A number of independent supportive indications to this assignment are given in [2]. The results of this work do not weaken or contradict any of these indications. Chapter 6. Summary, Discussion and Conclusion 106 6.3.2 Evidence of a Long T 2 Component No evidence was found for long T 2 components, with T 2 of more than 1000ms, in the brain tissues except for the Cortical Gray tissues: In Insular Cortex and Cingulate Gyrus in the form of DC level, and in the data set labelled ' Cortical Gray1 as a component of T 2 just below lsec. Most of the tissues exhibit a component of T 2 in the range of 150-300ms. It is not clear to the author to what source this component should be assigned. 6.3.3 Components Width No evidence for spread of the T 2 times in the components of the T 2 distributions were found. As discussed and demonstrated in Chapter 3, this is not to say that there is no width, but that there is no justification in the data and our prior knowledge for the incorporation of width in our models. The data is not sensitive to the width of the T 2 distribution components; a model containing widths (of the order of magnitude of those demonstrated in Chapter 3) will explain the data as well as a discrete model, whether the data were produced by a discrete or a continuous T 2 distribution. Thus, lacking any prior information that will strongly support the continuous and more complicated model, the Bayesian procedure chooses the discrete models as the simplest models that adequately explain the data and our prior information. One may introduce strong prior information supporting the continuous distribution. This will result, under the current noise level, in an under_determination of the param-eters of the distribution. The joint posterior probability distribution of the parameters will have ridges (in case of one redundant parameter), surfaces (in case of two redundant parameters) or hyper-surfaces (in case of three redundant parameters) of high probability. Chapter 6. Summary, Discussion and Conclusion 107 6.3.4 Alternating Echo Refocusing The Alternating Echo Refocusing (AER) component in our models is almost orthogonal, over the measurement points, to the other base function.' As a result, its estimation is almost free from coupling to the estimations of other parameters and is accurate to the level of 5-8%. The highest A E R component was found in Head Caudate Nucleus, as anticipated by MacKay [22] associated with the extent of laminar flow of fluids in this tissue. -Bibl iography [1] G. Larry Bretthorst. Bayesian Spectrum Analysis and Parameter Estimation. Lec-ture Notes in Statistics:48. Springer-Verlag, 1988. [2] A . MacKay, K.Whit tal l , J . Adler, D. L i , D. Paty, D. Graeb/n Vivo visualization of myelin water in brain by magnetic resonance. Mag. Res. in Med., 31:673-677, 1994. [3] A . L. MacKay, K . P. Whittall, R. A. Nugent, D. Graeb, I. Lees, D . K . B . L i In vivo T2 relaxation in normal human brain. Mag. Res. in Med., Aug. 1994. [4] Kenneth P. Whittall and Alexander L. MacKay. Quantitative interpretation of NMR relaxation data. J . of Mag. Res., 84:134-152, 1989. [5] R.S.Menson and P.S. Allen. Application of continuous relaxation time distributions to the fitting of data from model systens and excited tissue. Mag. Res. in Med., 20:214-227, 1991. [6] E.T.Jaynes. Probability theory - the Logic of Science. 1993. [7] M . Tribus. Rational Descriptions, Decisions and Design. Pergamon Press, Oxford, 1969. [8] T. J . Loredo. The promise of bayesian inference for astrophysics. Modern Astron-' omy, 1992. [9] P. C. Gregory and T.-J. Loredo. A new method for the detection of a periodic signal of unknown shape and period. The Astrophysical Journal, 398:146-168, Oct. 1992. [10] H . Jeffreys. Theory of Probability. Oxford Uni. Press, London, ed 3, 1961. [11] E.T.Jaynes. Prior probabilities. IEEE Transactions on Systems science and Cyber-netics, SSC-4:227-24L, .1968.' • • [12] E.T. Jaynes. Marginalization and Prior Probabilities. Bayesian Analysis in Econo-metrics and Statistics. North-Holland Publishing Company, 1980. [13] E . T. Jaynes. Bayesian Spectrum and Chirp Analysis. Maximum Entropy and Bayesian Spectral Analysis and Estimation Problems. D. Reidel ed., Dordrecht-Holland, 1987. 108 Bibliography 109 [14] A . P. Crawley and R. M . Henkelman. Errors in T2 estimation using multislice multiple-echo imaging. Mag. Res. in Med., 4:34-47, March 1987. [15] Peter Cheeseman et al. Bayesian classification. Proceedings on the seventh National Conference on Artificial Intelligence (AAAI-88), pages 607-611, 1988. [16] H .B .W. Larsson, J.Frederiksen, L. Kjaer, O. Henriksen, J . Olesen/n vivo determi-nation o / T i and T2 in the brain of patients with severe but stable multiple sclerosis. Mag. Res. in Med., 7:43-55, 1988. [17] R.V.Mulkern, S.T.S. Wong, P. Jakab, A . R. Bleier, T. Sandor, F . A . Jolesz CPMG imaging sequences for high field in vivo transverse relaxation studies. Mag. Res. in Med., 16:67-79, 1990. [18] J . P. Armspach, D. Gounot, L. Rumbach, J . Chambron In vivo determination of multiexponential T2 relaxation in the brain of patients with multiple sclerosis. Mag. Res. Imag., 9:107-113, 1991. [19] C.S.Poon and M.R. Henkelman Practical T2 quantization for clinical applications. J. of Mag. Res. Imag., 2/5:541-553, 1992. [20] G. Larry Bretthorst. Bayesian analysis II: Signal detection and model selection. J. of Mag. Res, 88:552-570, 1990. [21] Wendy A . Stewart, Alex.L. MacKay, Kenneth.P.Whittall, G. R. Wayne Moore, Don-ald W. Paty Spin-spin relaxation in experimental allergic encephalomyelitis, analy-sis of CPMG data using a nonlinear least squares method and linear inverse theory. Mag. Res. in Med., 29:767-775, 1993. [22] Alex MacKay, Dept. of Radiology/U.B.C. - Private Conversations. [23] Alister Prout, Dept. of Neurology/St. Paul Hospital, Vancouver, B .C. - Private Con-versations. Appendix A Safety Valve Routine The ability to approximate the integral over the nonlinear parameters is crucial to the feasibility of model selection in the general case, where such an integral cannot be done analytically or by usual numerical methods. The case of gaussian distributed noise is of general interest. We brought the likelihood to an almost symmetric shape (as function of the non-linear parameters) by transformation of the nonlinear parameters. We then try to ap-proximate its peak. In most of the cases, the expansion of the log(likelihood) function in Taylor series will be adequate; the distribution assigned to the noise is Gaussian and thus log(likelihood) will behave in a quadratic manner near its peak. This is not al-ways the case; if the problem is not well posed down to the noise level, the behaviour of log(likelihood) will not be quadratic in all principle directions. The (pseudo) non-uniqueness of the solution seems to give rise to behaviour of — x2n where n=2,3...(see figure A . l ) . This will cause a very flat log(likelihood) at the point of maximum likeli-hood and small coefficients of the Taylor expansion in some directions, which in turn will lead to over estimation of the integral that enters the likelihood of the model. In order to estimate the integral we will assign 'artificial' eigenvalues to the 2nd deriva-tives matrix b. Let Vj (j = l. .r) be the eigenvalues of matrix b. We correct any eigenvalue smaller than a certain threshold. The threshold is set as a function of a reasonable ex-tension of the logarithm of the nonlinear parameters under the given noise level. An 110 Appendix A. Safety Valve Routine 111 eigenvalue that indicates a variance of (a linear combination of) the nonlinear parame-ters of more than a factor of 2, for instance, is suspected to be a-result of non-quadratic behaviour of the log[likelihood]. That is, we correct eigenvalues satisfy 2 < * 2 > > 2 \! Vi > where < <r2 > is approximated by 0.1. Our results depend weakly on this approxi-mation. We carry out the following steps: 1. Each eigenvalue describes the curvature of the likelihood in one of the principle axes. We find the position, along this principle axis in which the log(likelihood) • drops by 2< o2 >. We solve mK*(U) - mh?(U + t • Wj) = 2 < a2 > (A. l ) for t, where Wj is the eigenvector associated with the eigenvalue Vj . 2. Averaging the absolute values of the two solutions of (A. l ) will give the width of the likelihood in the direction of this principle axis.: tl + |<2| w — -3. Now, what is the influence of the non-quadratic shape of the log(likelihood) on the value of the integral? The integral of a gaussian of unit amplitude and unit width is Jexp(—x2)dx = yjv « 1.77 , / exp(—x6)dx w 1.85 and / exp(—x2™)cir—> 2.0.. That is, assuming log(likelihood) does behave as — x2n near its peak, the resulting decrease in the integral from erroneously assuming it is a gaussian of the same width Appendix A. Safety Valve Routine 112 lr1 0 • iQ i—i lr1 H-w I—1 H-o o pi 0.5 0 •0.5 -1 5 -2 2 .5 0.5 -1 \ v hi V 1 I t"1 0 l 1 H-H H-O O •0.5 -1 •1.5 0 . 3 - 0 . 2 - 0 . 1 ) . 0 . 1 0 . 2 0 . Figure A . l : Normal Behavior (right) and Abnormal Behavior (left) of the Likelihood along Principle Axes. In case of abnormal behavior we search for characteristic width w and use it for the integration of the likelihood. The figure was generated by subtracting the maximum of Log [Likelihood] from the LogfLikelihood]. The data were taken from the simulation of the continuous distribution; the model contains 4 discrete exponents. will not exceed 12%. We multiply w by a factor of 1.06 to confine the resulting error to 6%. 4. Thus, the substitution to the eigenvalue is given by: v. 2 < a 2 > ' j S u b ' (1.06 w)2 The approximation described above enable us to estimate the integral to accuracy of 20-30%. Models that will have likelihood difference smaller then that will be treated 'manually'; in the range of 20-30% difference of the likelihoods, the exact assignment of the prior ranges become important. As can be seen from Table 3.4 , no such small differences had been encountered during the analysis of the brain tissues data. A p p e n d i x B A r e Two Signals Equal? In this appendix We will treat the problem of signal detection, and the problem of comparing two sets of measurements to find whether their signal source rate, of Poisson distributed events, was the same. The problem of signal detection can be regarded as a special case of the problem of deciding whether two signals are equal, for a case where we know that one of the signals is zero. We treat this problem in Section B . l . We then proceed to treat the more general problem (of deciding whether two signals are equal) in Section B.2. We are interested in the signal rate s of Poisson distributed events.of some source. The problem of estimating this rate is complicated by an, unavoidable background of. unknown rate b. In order to estimate the rate s regardless the background rate b, a set of. On/Off measurements is performed. We assume that the source s contributes only to the On measurements, and that the rate b of the background is the same in the On and Off measurements. The solution for the estimation of s was given by Loredo [8]: We denote by T the duration of the measurement and by N the number of events in that duration. The posterior distribution of the background rate is given by P(b/NoffToffIb) = I o f f ^ ' f , . . = (B.2) assuming nonJnformative prior for b. In the following, we will denote this probability by- p(b/1); we regard the results of the Off measurements as part of our background information. 113 Appendix B. Are Two Signals Equal? 114 The likelihood of s • b in light of The results of the On measurement is given by the Poisson distribution for s + b : \(s + b)T }Non e-(s+b)T™ • P(Non/sbI)=^S + b)lonl , 6 — , (B.3) The posterior probability of the rate s is given by ' Non rp I rp U _ -sTon p(s/NonI) = 53 d • 1 ^ S l o n l . (B.4) =o *• where U | T o f f y (Nm+Noff-i)l . Q _ \ ^ Tm ) (Non-i)l v ^ W i , TOJfy {Non+Noff-j)\ • "; . 2^j=o y t- Tonj • (Non-jy. • We are going to work out the detailed algebra; The final solutions can be found in Sections B. l .4 (for signal detection) and Section B.2.4 (for comparing two signals). B . l Signal Detec t ion * Given a set of On/Off measurements, is there evidence for the existence of a signal? That is, find the odds ratio of: ' . • Mo = The rate of the signal is zero (there is no additional signal on top of the background). O V E R • M i = There is a signal with rate s. Appendix B. Are Two Signals Equal? 115 B.l.l M0 • The posterior probability of Mo is given by: p(W„„/M„6) P ( % „ / / ) where: • p(b/I) is the probability of b given the background information / , where / includes the results of the Off measurements. p(b/I) is given by Eqn. (B.2). • p(M0/bI) = p(M0/I) is the prior of the model. We set• p(M0/bI) = ,p(Mi/W) =.§,.. indicating prior ignorance regarding the plausibility of the signal existence. • p{NOn/M0b} is the likelihood and is given by Eqn. (B.3) for s o. • p(Non/I) is the global likelihood. We will not have to calculate it for this model comparison. . > B.l.2 Mi The posterior probability of M\ is given by: p(M1/NonI) = JJdbds p(MlSb/Non:I) = JJ db ds p(M±sb/I) • ^ N ^ M ^ "= JJdbds P(b/I) P(MjbI) pis/M.bl) P ^ M ^ "(B,6) p(Non/MlSbI) p(Non/I) where: Appendix B. Are Two Signals Equal? 116 • p(b/I) is the prior for b. as for Mo, it is given by Eqn. (B.2). • p(Mi/bI) is the prior for Mx. As for M 0 „ p(Mi/bI) -\. • p{sIM\bl) is the prior for s (= j^—).' • p(Non/MisbI) is the likelihood. It is given by Eqn. (B.3). Odds Ra t io The odds ratio 0 O i is the ratio of the probability p(M0/NonI) to p(Mi/NonI) (given by Eqn. B.5 and B.6, respectively). Since p(M0/bI) = p(M0/I) and p{M1jbI) = p{M1/I) we can write: a _ PJMolKJ) = fdb p(b/I) p{MQlbI) E ^ f f : " p{M,INonI) ffdbds pib/I) P(M1/bI) P(s/Mxbl) p{N;(rio%bI) = PJMo/I) fdbp(b/I)p(Non/M0b) p(Mi/I) ffdbd p'b/I) p{slM1bI) p{NonlMxsbI) { • J B.1 .3 Reduc t ion of the Integral Integration Over b The integral / db p(b/I) p(Non/sbI) appears both in the numerator and the denominator: By equations (B.2) and (B.3): 1 r f Tu- (bT fAN°ist~bT-=i} \(s + b)T ]N°" e _ ( s + 6 ) T o " / db p(b/I) P(Non/sbI) = db i o / / { °"} y l[s + o)lon\ e _ = 1 E q n . (B.2) gives the posterior probability of the background for large 6 m a x - The general form differs only by the multiplicative constant: ^N ^^Tb y This constant (appearing in both the numerator and the denominator) is independent of s and 6 and thus will not alter the odds ratio. Appendix B. Are Two Signals Equal? 117 (Toff)N°ff+* e-°T™.(Ton)N°-Noff\Nonl Expanding (5 + b)Non : J db e-h{T°ff+T°n)bN°"{s + b) N0; ( r o / / )^ / /+ 1 e-T°» (Ton)N° N0jj\Non\ The integral to be solved is of the form , Non / db e-b(T°ff+T°^bN°ff y s"™-^-N 1 1 nn. • (B.8) / e~axxbdx = erax b xb'j bl and Eqn. (B.8) becomes (B.9) (Toff) ° f f + 1 e-'T" (Ton) Noff\Non\ N°N I ,r N ' (Ton + Toff)N°u+'+l (B.10) where: l(bmax) —• 0. . I(Lax) will be given with the final expression for the odds ratio. The numerator of the odds ratio (Eqn. (B.7)), becomes (by taking the limit s —> 0 in Eqn. B.10 ): • . • . numer(odas) = - • - -— V ; 2 Noff\Non\ Integration Over s (Noff + Non)\ (Ton + Toff)N°ff+N°^ ( B . l l ) In order to compute the denominator of the odds ratio, we have to integrate Eqn. (B.10) over 5: . Appendix B. Are Two Signals Equal? 118 • r r r (T ,AN°ff+1 e'sTon (T )N° JJds db p(b/I) P(Non/sM) = J d s [ l o f f ) N l N m l 1 °n) E i=0 N ' 1 ' o n • %\{Nm-i)\ {Ton + Toff)Noff+^ (^ ^ max) (B.12) The integral to be solved is of the same form as Eqn. (B.9) and results in: off • i v o n -JV 0 : E t=0 [Non-i)\ j+l) 9 ( ^ max ) i\(Non-i)\ (Ton + Toff)Noff+^ where g(smax) —>• 0. <7(smax) will be given with the final expression for the odds ratio. The denominator of the odds ratio (Eqn. (B.7)) becomes: ^max) (B.13) deno(odds) = 1 1 (Toff)N°ff^ (Ton) N0 2 -Sr N N ' off- 4 ' o r i -son E t=0 (N0 ( T o n ) ( ^ n - , + l ) AT 1 * o n • ; ! ( i V o n - z ) ! (Ton + Toff)N°ff+*+i • ^ ( ^ m a x ) (B.14) B.1.4 The Odds Ratio Combining Eqn. B . l l and B.14, the odds ratio is: n. (Noff+Nony ] ( h v (Ton+To//)JV°// + J V°" + 1 H°maxJ f"R 1 en Appendix B. Are Two Signals Equal? 119 where e Non-i (Tnji (Non-i-j)\ • E on and as a function of T 0 / / , T o n , A ^ / / , -/V o n , 5 m a a ; and bmax). It should be multiplied by the prior Note that the odds ratio depends on the range of the priors for b and 5. The depen-dence on bmax is very weak. On the other hand, the odds ratio is linear in smax. B . 2 A r e Two Signals Equal? Given two data sets A and B of On/Off measurements, find the odds ratio of: • • M\ = the source rates sa and Sb are identical regardless of differences in the background intensity b). B . 2 . 1 M i The probability of the model M i is given-by: 2 2IA{IB) signify all the background information regarding A (B), including the results of the 'Off' measurements. of the odds ratio p|^°7{| if the latter is different from 1. O V E R • M 2 = the source rates.sa and si> are different (regardless of differences in the background intensity 6). Appendix B. Are Two Signals Equal? 120 (B.16) p(M1/NonANonBlAlB) = J J dsa dsb p(M1SaSh/NonANonBlAh) = I I j 7 / A/r IT T \ P I o n A NonBlMiSaSbIAIB) / / dsa dsb piMxSaSbflAlB) , • = J J ' ' P(NonAJSonBl1a1B) J J dsa dsbp(M1IIAIB) • p(sa/M1IAIB)- p{sblMlSaIAIB)-p{N0nANonB-lM1SdSbIAIB) p(NonANonB ITAIB) Now, • piMJlAls) is. the prior for M i . We set p(MiJTAIB) = p(M2/TAIB) = |. • p(salMiIAIB) — p(sa/IA) is the prior for sa. We assume ignorance priors for sa and 5 6 . That"is, p(sa/IA) = p{sb/IB) ='^-• p(sb/M1saIAIB) = S(sb - sa). • p{NonANonBlMisasbIAIB) = p(NonA/M1sasbIAIB) • p(NonB/NonAM1sasbIAIB) = p{NonA/saIA) • p(NonB/sbIB}.3 And thus Eqn. (B.16) becomes: - p(M1lNonANonBIAlB) = 1 • — • / fdsa dsb 6(sb - sa) PW^/*M)-P(N^/S>IB) = 2 W i i > ' P(NonANonB/IAIB) 3 I n general, p(NONB/N0„AMIIAIB) ^ p(NONB/M-^IAIB)- The independence in our case is due to the knowledge of sj . Appendix B. Are Two Signals Equal? 121 •1 1 r d s p{NonA/sbIA) • p{NonB/sbaIB) ( B 17) ^ °max J a p(NonANonB/IAIB) where sba is the rate of source B numerically identical to the rate of source A , sba = sb -Mi. B.2.2 M2 The probability of model M2 is given by: p(M2/NonANonBIAIB) = J J dsa dsb p{M2sasb/NonANonBIAIB) = 111 7 f a JT ' IT T \ P\1Von A NonB/M2sasbIAIB) / / dsa dsb p(M2sasb/IAIB) — . = J J P{IyonA-NonB/lAlB) J J dsadsbp(M2/IAIB)-p(sa/M2IAIB)-p(sb/M2saIAIB)- . p(NonANonB/M2sasbIAIB) p[NpnA NonB IIAIB ) Substituting the known probabilities, we have: _ 1 1 1 f / , , p{NonAlsaIA)-p(NonBlsbIB) (B .18 ) / / dSa d S b . M i V ^ : : a ^ : M ; r f 7 B ) ( B . I 9 ) / J p(NonANONB/IAIB\ The Odds R a t i o Combining Eqn. B . 1 7 and B . 1 9 , the expression for the odds ratio is: Q _ . / d s a p(NonA/sJA) • p{NonB/sBAIB) 1 2 j2—• 11 dsa dsb p(NonA/sAIA) • p(NonB/sBIB) Appendix B. Are Two Signals Equal? 122 Using Bayes theorem, we have: P(NonA/sJA) = p(NonA/IA). ti**!^1*) " where: • p(sa/NonAIA) is given by eq.(B.4), assuming b m a x , the range of the prior of the background rate, is sufficiently large. • p(sa/IA) is the prior for s {=j^)- It is independent of s. • p(NonA/I) is a constant. Then, 0 1 2 = fdsa p(NonA/I) • -p(NonB/I) • ^ J S ) B ) ^•Ifdsa dsb P(NonA/I) • • P(NonA/I) • «*J?X)A) • Idsa • p{sa/NonAIA) • p(sba/NonBIB) _ j2— • f jdsa dsb p{sa/NonAIA) • p(sb/NonBIB) (B.21; °max / dsa • p(sa/NonAIA) • p{sba/NonBIB) I dsa p(sa/NonAIA) / dsb p(sb/NonBIB) B.2.3 Reduction of the Integrals Integration of the Denominator The denominator is a multiplication of two independent integral. Substitute Eqn.(B.4) in Eqn. (B.21), these integral are of the form: Appendix B. Are Two Signals Equal? 123 £ C . ' - ^ r - / d sie-aTm t=0 il 1-where: and Non (rp Q. \±on) . »'=0 Z ! E C -i=0 (r o n) Z ! " s T ° " ^ z'! 5D(5max) +i n 4- Toff\i(M°»+Moff-iy. V ^ T o n / (A/on \^Non(-\ I T c f f (No„+Nof}-jy. l^j=0\L "I" T o n j (Nm-j)l (^ max) - J z! So the denominator is: (B.22) deno(odds) onA (rp 'max \ , = 0 Z ! z! +1 — 9DA(sma.x) 'NonB (rp C j S z^ i=o l -(TonBy+i gDB(sma.x) (B.23) Integration of the Numera to r Substituting Eqn.- (B.4) in Eqn. (B.21), the numerator is: / , " f i * („ TmA(sTcnA)>e-T-->\ ' f a f ( TmB(sTmB)>e-T-*-Appendix B. Are Two Signals Equal? 124 NonA NonB frp frp . E E ciAc3B[lonA) { l o n B ; dsi+i e - ^ + r — > A—n „'—n I. • 7 . J i=0 j=0 NonA NonB frp frp 2^ GiACjB J] J] »=o i = o - • J-Where: (TonA + TonBy+^ <?Af(-Smax) e-Sma.x(TonA+TonB) t+j ( s m a x ) * + J - n (*+j)' 3 f f ( s m a x ) = - 7 7 7 ; 1 rp \ ' E {TonA +-TonB) ^Zo {TonA + T o n B ) n {i +j-n)l B . 2 . 4 T h e O d d s Ratio Combining Eqn. (B.23) and Eqn. (B.24), the odds ratio is (B.24) 9N{smax) —> 0 Smax ^OO (B.25) Under the assumptions that bmax is large enough to justify the use of Eqn. (B.4). Eq. (B.25) is the odds ratio in favour of equality of signals sa and s0, as function of the results of the On/Off measurements of sets A and B. Again, the dependence on the range of b is very weak and we gave here the solution assuming bmax —•» 0 0 . As in the solution to the signal detection odds ratio, the odds ratio is linear in sm„r.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Bayesian model selection and classification: application...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Bayesian model selection and classification: application to brain tissues through T distribution Gideoni, Iftah 1995
pdf
Page Metadata
Item Metadata
Title | Bayesian model selection and classification: application to brain tissues through T distribution |
Creator |
Gideoni, Iftah |
Date Issued | 1995 |
Description | A Bayesian procedure for model selection, parameter estimation and classification, using models of non-orthogonal basis functions, is applied to the problem of T2 decay rate distributions in brain tissues. The feasibility of generating reliable synthetic images of tissue-classified pixels is examined. The work determines, for the first time, the Bayesian probability of existence of short (5-15ms) T2 component in the brain tissues, and found it to be higher than 99% for all white matter tissues and higher than 80% for all gray matter tissues except Cortical Gray . The probability of having no more than three components of decaying exponents in the Ti distributions of the brain tissues, is found to be higher than 90% for all the tissues. We arrive to these findings through the use of models which are parameterized by highly coupled parameters, and the use of multi-dimensional search in the space of these models. |
Extent | 6538546 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-01-19 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0086784 |
URI | http://hdl.handle.net/2429/3796 |
Degree |
Master of Science - MSc |
Program |
Physics |
Affiliation |
Science, Faculty of Physics and Astronomy, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1995-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1995-0346.pdf [ 6.24MB ]
- Metadata
- JSON: 831-1.0086784.json
- JSON-LD: 831-1.0086784-ld.json
- RDF/XML (Pretty): 831-1.0086784-rdf.xml
- RDF/JSON: 831-1.0086784-rdf.json
- Turtle: 831-1.0086784-turtle.txt
- N-Triples: 831-1.0086784-rdf-ntriples.txt
- Original Record: 831-1.0086784-source.json
- Full Text
- 831-1.0086784-fulltext.txt
- Citation
- 831-1.0086784.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0086784/manifest