Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Bayesian model selection and classification: application to brain tissues through T distribution Gideoni, Iftah 1995

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1995-0346.pdf [ 6.24MB ]
Metadata
JSON: 831-1.0086784.json
JSON-LD: 831-1.0086784-ld.json
RDF/XML (Pretty): 831-1.0086784-rdf.xml
RDF/JSON: 831-1.0086784-rdf.json
Turtle: 831-1.0086784-turtle.txt
N-Triples: 831-1.0086784-rdf-ntriples.txt
Original Record: 831-1.0086784-source.json
Full Text
831-1.0086784-fulltext.txt
Citation
831-1.0086784.ris

Full Text

BAYESIAN M O D E L SELECTION A N D CLASSIFICATION: A P P L I C A T I O N TO B R A I N TISSUES T H R O U G H T D I S T R I B U T I O N 2  By Iftah Gideoni B . Sc. (Physics) Tel Aviv University, 1992  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER  OF SCIENCE in  THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF PHYSICS We accept this thesis as conforming to the required standard  THE UNIVERSITY OF BRITISH COLUMBIA March 1995 © Iftah Gideoni, 1995  In  presenting this  degree at the  thesis  in  University of  partial  fulfilment  of  of  department  this thesis for or  publication of  by  his  or  her  The University of British Columbia Vancouver, Canada  DE-6 (2/88)  It  this thesis for financial gain shall not  Department of  -i  representatives.  JciM  for  an advanced  Library shall make  it  agree that permission for extensive  scholarly purposes may be granted  permission.  Date  requirements  British Columbia, I agree that the  freely available for reference and study. I further copying  the  is  by the  understood  that  head of copying  my or  be allowed without my written  Abstract  A Bayesian procedure for model selection, parameter estimation and classification, using models of non-orthogonal basis functions, is applied to the problem of T2 decay rate distributions in brain tissues. The feasibility of generating reliable synthetic images of tissue-classified pixels is examined. The work determines, for the first time, the Bayesian probability of existence of short (5-15ms) T2 component in the brain tissues, and found it to be higher than 99% for all white matter tissues and higher than 80% for all gray matter tissues except Cortical  Gray . The probability of having no more than three components  of decaying exponents in the Ti distributions of the brain tissues, is found to be higher than 90% for all the tissues. We arrive to these findings through the use of models which are parameterized by highly coupled parameters, and the use of multi-dimensional search in the space of these models.  11  Table of Contents  Abstract  ii  Table of Contents  iii  L i s t of Tables  vi  L i s t of Figures  vii  Acknowledgements  x  1 Introduction  1  1.1  1.2  2  3  Bayesian Inference  4  1.1.1  Manipulation Rules  1.1.2  Direct Probabilities  •  5 12  M R I - T Distribution  14  1.2.1  The Data  14  1.2.2  Background Knowledge  16  2  Bayesian Classification  20  2.1  Classification O n t o ' O l d ' D a t a Sets  21  2.2  Model Comparison  24  M o d e l Selection 3.1  The Set of Models 3.1.1  28 . . .  . . . . . . . . . .  The Base Functions, Gj  28 29  a  m  3.1.2 3.2  3.3  4  5  32  The probability of the Model p(M /DI)  34  3.2.1  The Prior Probability Distributions  36  3.2.2  The Likelihood  38  a  Results  48  3.3.1  Simulated Data  48  3.3.2  Experimental data  56  E s t i m a t i o n of the P a r a m e t e r s  58  4.1  Noise Variance  60  4.2  Nonlinear Parameters  62  4.3  Linear parameters  4.4  Results  . . .'  63 65  4.4.1  Simulated Data  65  4.4.2  Experimental Data  66  Classification 5.1  5.2  6  Search Path  77  Procedure  .  5.1.1  The Total Amplitudes Integral  5.1.2  The T Distribution Integral  5.1.3  Final Expressions .  2  Results  77 79  . .  82 91  . .  92  5.2.1  Feasibility  .  5.2.2  Tissue Classification  94  5.2.3  Images  95  S u m m a r y , D i s c u s s i o n and C o n c l u s i o n  iv  92  99  6.1  Summary  6.2  Discussion  6.3  . . . . . . . .•  ,  . •:  6.2.1  Multiple Measurements: cr Vs. a  6.2.2  Mariginalization in the Neighbourhood of No Data  6.2.3  Multi-Dimensional Search for Models  99 100 101  s  .  102 104  Conclusions .  105  6.3.1  Evidence of a Short T Component  6.3.2  Evidence of a Long T Component  106  6.3.3  Components Width  106  6.3.4  Alternating Echo Refocusing  2  . . . . . . . . . . . . . . . . .  2  . .  105  107  Bibliography  108  A  Safety V a l v e R o u t i n e  110  B  A r e T w o Signals E q u a l ?  113  B.l  B.2  Signal Detection  ,  114  B.l.l  M  . .  115  B.1.2  Mi  ...  115  B.l.3  Reduction of the Integral . . . . '  116  B.1.4  The Odds Ratio  118,  0  .  Are Two Signals Equal?  119  B.2.1  Mi . . . .  .  B.2.2  M  121  B.2.3  Reduction of the Integrals  122  B.2.4  The Odds Ratio  124  2  v  119  L i s t of Tables  1.1  The Database.  15  3.1  Generating Parameters of Simulations  49  3.2  Search Path and Odds Ratio for the Continuous T Simulated Distribution. 51  3.3  Search Path and Odds Ratio for the Discrete T Simulated Distribution.  51  3.4  Experimental Data - model selection results.  57  4.1  Generating T Distributions and Simulation Results.  4.2  Uncertainty of Linear Combinations of Amplitudes.  68  4.3  Estimated Parameters for White Matter Tissues.  69  4.4  Estimated Parameters for Gray Matter Tissues.  -70  2  2  2  vi  .  66  List of Figures  1.1  Raw Data: Decay Curves of Pixels taken from Major Forceps  3.1  Search Path Flow Chart  3.2  Generating Parameters for the Discrete and Continuous Simulated Distributions.  16 •. .  . . . . . . . . . .  . . . . .  33  50  3.3  Model Selection Failure  53  4.1  Probability Distribution of the Noise Level for a typical tissue  60  4.2  Simulation Parameters and Estimated Parameters for the Discrete and Continuous Distributions  4.3  67  The Estimated T Distributions and Residuals for the Estimated Param2  eters of White Matter Tissues 4.4  . .  72  The Estimated T Distributions and Residuals for the Estimated Param2  eters of Gray Matter Tissues. 4.5  73  The Estimated T Distributions and Residuals for the Estimated Param- . 2  eters of Gray Matter Tissues (cont.).  74  4.6  Joint Probability Distribution for the Low Component of a Typical Tissue.  75  4.7  Joint Probability Distribution for the tissue Head Caudate Nucleus.  76'  5.1  Typical Spread of the Measurements of one Tissue  5.2  Models/Data Averages for the Tissues and Individual Pixels' Data From  5.3  . .  86  One Tissue  93  Classification Results for White Matter Tissues  96  vn  5.4  Classification Results for Gray M a t t e r Tissues  97  5.5  Synthetic Classification Image of Patient Eleven  98  6.1  M a r g i n a l i z a t i o n Decisions i n Light of Likelihood W i d t h s .  A.l  N o r m a l Behavior and A b n o r m a l Behavior of the Likelihood along P r i n c i p l e Axes  103  112  vm  Acknowledgements  I would like to thank my supervisor, Dr. Phil Gregory. His course "Theory of Measurement" , changed both my perception of the scientific method and the direction of my studies. Any experimental scientific work is composed of the design of the experiment, performance of the measurements, and analysis of the results.. The author of this thesis did only an analysis of data. If this thesis contains any meaningful conclusions regarding the distribution of the T decay rates, it is mainly the result of work done by Dr. 2  Alex MacKay and Dr. Ken Wriittall. They developed the M R I sequence, performed the measurements, and collected the data. Their kindness in sharing this data and their knowledge, and the time spent by them in rearranging the data into an accessible form, are hereby gratefully acknowledged. Last but never least, I am delighted to thank Elana Brief for reading and editing the thesis. Without her efforts, this thesis would look much worst.  ix  Chapter 1  Introduction  The field of Bayesian inference as a scientific tool is young; even in the last years, the Bayesian inference has not been put to wide use outside the Astrophysics community. The Bayesian inference scheme enables, in principle, direct and consistent application of basic ideas of the scientific method regarding the way we infer about natural phenomena and put our hypotheses to test. Nevertheless, the utilization of this theory to the analysis of scientific data and background knowledge remains mostly, unexplored. In this work, we will try to share the exploration of the theory in three main directions: first, in a demonstration of multi-dimensional search in a space of models. We try to search and test automatically and efficiently many models, different in more than one attribute.  The general problem of the multi-dimensional search for the 'right' model  is important but will not be treated in this work. A degenerate demonstration will be presented here. Second, we will solve a model selection and parameter estimation problem using models of decaying exponents as basis functions. This case, in which the basis function are non-orthogonal, is known to be one of the more pathological cases, both from the traditional, 'frequentist', and from the Bayesian point of view. The general approach we adopt here is similar to that developed by Bretthorstfl] for time series analysis. The ability to solve such problems using models of non-orthogonal basis functions in a computable manner, is crucial for many applications. Third, we discuss the general solution for a classification problem: we are given a new data set and a number of old data sets, and assume that each of the old data sets was produced by different mechanism.  1  Chapter  1.  Introduction  2  We then infer which of these mechanisms produced the new data set. We approximate the general solution for this problem and apply it in the limit case where much more information is carried by the old data sets than by the new data set. The above will be applied to the problem of inferring about the T . distributions 2  of in vivo human brain tissues, where the data is taken from. Carr-Purcell-Meiboom-Gill ( C P M G ) Magnetic Resonance Imaging (MRI) sequences. The analyzed data is considered to be produced by decaying exponents with some distribution of the decay rates (the T  2  distribution).  We model this distribution by a finite number of models, each of which  has a finite number of parameters. The T distribution problem is also dealt with in an 2  elaborate frequentist literature,using inverse theory, and non-linear approaches similar to ours. The results of the application of probability theory may help to understand better the results of the traditional solutions. The determination of the distribution of the T coefficients in brain tissues and lesions 2  is important for. a number of reasons. First, it may serve as classification criterion to be used for imaging, based on sub-voxel structure differences,that can not be seen in M R I [2]. Second, once the underlying physical mechanisms of the various components in the distribution have.been identified, determination of the distribution may serve to trace variations of the mechanisms among patients, tissues, or as a function of time. For example, t h e T 2 distributions are sensitive.to processes in lesions (as demyelination. in aging multiple sclerosis lesions [2]) and thus enable measurements of these processes which cannot be measured by other means.  ,.  Recently,'a new multi-echo M R I sequence for quantification of the T coefficients was 2  developed in U.B.C. [2]. This sequence enabled, for the first time, determination of short T components (less than 20ms.) in in vivo brain tissues measurements. We use the data 2  acquired by this new sequence. Partial analysis.of this data is given in [3].  Chapter  1.  3  Introduction  The assignment of components in the T distribution to the underlying physical mech2  anisms will not be treated in this work, beside commenting about the ability to make such assignments. In this chapter we w i l l discuss the basic tools used by probability theory in answering well-posed problems. The reader who is unfamiliar with the Bayesian inference should be able to follow the rest of the work based on this introduction. We then discuss our state of knowledge about the T distribution problem. There is no intention of giving 2  background on M R I or the physics behind the T distribution, as it is of no importance 2  for the quantitative solution, once our background knowledge has been defined. After introducing our initial state of knowledge, the work is confined to the pure inference problem. In chapter 2, we discuss the Bayesian classification procedure. It is a generalization of the Bayesian model selection procedure for cases where the models are replaced by known probability distributions for parameters of the model. In chapter 3, we define a set of candidate models for the T distributions and apply 2  the Bayesian model selection procedure for choosing one of them for each tissue. The ability to consistently compare models of different degrees of complexity is a feature of the Bayesian inference not existing in the traditional, frequentist, methods. Also in that chapter, we will discuss the problems arising in the analysis of exponential decay curves; the non-orthogonality makes mistakes in choosing a model very costly. Lacking any solid grounds for choosing a model, this has led traditional workers to regard the non-linear methods (specifying a small number of decaying components) as disadvantageous [4],[5]; the price paid by the Bayesian procedure for being non-linear is merely the speed of computation. In chapter 4 we estimate the parameters of the model we chose in chapter 3 for each  Chapter 1.  Introduction  4  tissue. This is not a part of the classification procedure. Nevertheless, it has a standalone importance for the interpretation of the underlying physical mechanisms. Our parameters are highly coupled and poorly resolved and thus the estimations, both of the parameters values and the accuracies of these estimations should be done with care. We discuss the estimation of the noise in case of multiple measurements and systematic effects not accounted for by the model. In chapter 5 we apply the classification procedure introduced in chapter 2 to pixels of an M R image. We discuss the meeting of necessary conditions for classification: having different T distributions for different tissues, and the variance of samples within 2  one tissue. The classification procedure will demonstrate the merging of two sources of information: the results of classification based on the M R I image amplitudes, and the results of the classification based on the normalized T distribution. 2  1.1  Bayesian Inference  In this section, we will give the background needed for understanding the inference method used throughout this work. For the sake of completeness, we will first outline the basic ideas behind the method.  .  Fundamentals The goal of Bayesian inference (or better, "Probability  Theory as Logic") is the rigorous  development of a procedure [6] for assignment of probability (a real number) to encode the information regarding the truth of a proposition which is a member of a well defined set of propositions. The starting point for this process is a set of desiderata[6], [7]: 1. Representation of plausibilities  (abstracts) by probabilities (real numbers).  Chapter 1.  Introduction  5  2. Qualitative correspondence of these assignments with common sense; i.e. monotonicity (the bigger our confidence in the proposition the larger the probability attached to it) and continuity (infinitesimal change in the state of knowledge can result only in infinitesimal change in the probability). 3. Consistency: if a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result. 4. No use of information that is not available. 5. Equivalent states of knowledge will lead to equivalent assignments of probability. This set of desiderata leads to a unique procedure for assignments of the probabilities in any well posed inference problem. B y ''well"posed?we mean both the description of our state of knowledge by real numbers, on the one hand, and well posedness of the question asked in the Hadamard sense on the other hand. That is: uniqueness, existence, and continuous dependence of the answer on our state of knowledge. The notion of probability here is freed from the need to correspond to any frequency. In the natural sciences, we use probability theory to encode (by the probability) our state of knowledge regarding the truth of a mechanism. The rigorous development of the theory can be found in [6]. A concise discussion of Bayesian inference and the difference from traditional, frequentist, statistics can be found in [8].  1.1.1  Manipulation Rules  The beginning of the treatment of any problem is the unambiguous statement of our state of knowledge and the question for which we are seeking an answer; the propositions' probabilities (or ratio of probabilities) we want to find. We then use probability theory to  Chapter  1. Introduction  6  manipulate the desired probabilities to a form consisting of probabilities we can directly assign (direct probabilities).  We give the manipulation rules in this section, and discuss  the assignment of the direct probabilities in section 1.1.2. This section is based on [9]. We use the notation introduced by Jeffreys [10]: p(A/B)  is the probability that A is  true given that B is true where the arguments are propositions (statements), which are either true or false. The probability p(A/B)  describes our state of knowledge about the  truth of A assuming B is true. The proposition AB (or A • B) is true if both A and B are true and the proposition A.+ B is true if either A, B or AB is true. The proposition A signifies the negation of A; it is true if A is false. A form that appears many times in Bayesian works as the probability to be found (and will be used extensively in this work) is p(Hi/Dl).  Hi is a member of a set of  propositions (hypotheses) {Hi}, whose probability we want to assess in light of some data, D, and some background information / . The background information must include the statistical model (the relationship between the data and a true hypothesis), the specification of the hypotheses set  and the probability of each of the members of  {Hi} before having knowledge of the data (prior probabilities - see below). The basic rules emerging from the theory are the sum rule,  p(Hi)  + p(Hi) = 1  and the product rule  p(HiD/I)=p(H /I)-p(D/HJ)=p(D/I)-p(H /DI) t  t  We use exclusive propositions, and thus, from the sum and product rules we have:  p(Hi + Hi/I)  =.p{H II)+p(H II) i  J  (1.1)  Chapter 1.  Introduction  7  As demonstrated in [1], once we state a set of propositions, we can treat them as exhaustive without influencing any of our quantitative solutions. That is: 5>W).= 1 ,  (1-2)  i  a normalization rule. A n important result of the product rule is Bayes theorem:  p(H /DI)=p(HjI). ^j^ P  t  We usually use Bayes theorem to describe the process of learning about the probability of the proposition Hi in light of the truth of a new proposition (usually data) D and other known information (background knowledge) I. In such a case:  •  p{HijI)  is the  prior probability.  It is the description of our knowledge about the  truth of Hi before we saw the current data, D. • p(D/HiI) It is the  •  is the probability of collecting the set of data £), given a proposition Hi.  likelihood of  p{D/I) is the  Hi in light of the data D:  global likelihood.  p(D/I)=  £  By eq.(1.2) the global likelihood is  p{HdI) • V{DIHd)  HiC{H}  ensuring a normalized posterior probability, given a normalized prior.  •  p(Hi/DI)  is the  posterior probability.  It is the description of our knowledge about  the probability of Hi after we saw the data, D.  Chapter 1.  Introduction  8  A l l the rules introduced so far hold when our set of models are continuous and the probabilities are replaced by probability density functions.  Assuming our• proposition  Rk is parameterized by the parameter 0 , then, H^Q = Hk is true and the value of its u  parameter is 0 " . Then, by repetitive application of eq. (1.1) and taking the limit of the sum to an integral we have:  J H QdQ  p(H /I)=p( k  k  / / ) = J.p'H Q/I)-dQ  . (1.3)  k  Where the integrals are over all the values available to 0 ; jp(Q/I)dQ  = 1.  The rigorous justification and discussion of the passage of the discrete propositions to continuous ones can be found in [6].  M o d e l Selection Usually, the propositions {Hi} are models describing the phenomenon under investigation. Often we want to compare a number of parameterized models, which may differ in form or number of parameters. We do so by comparing the posterior probability p(Hk/DI)  for each of them. Bayesian model selection has a built-in penalty for com-  plicated models: their posterior probability is larger than the posterior probability of simpler models only if their additional complexity is justified by the data. In model selection, we often speak about the ratio of the probabilities instead of the probabilities directly. The ratio Oij — p{HijDI)/p(Hj/Dl)  is the odds in favour of  model Hi over model Hj. B y Bayes theorem we have:  Pin/pi) _ mm • 13  M/Dl) =  =  H /i).^l  pi  p(H /I) t  piHj/I)  3  pjD/Hil) p{DlH I) 3  •  Chapter 1.  Introduction  9  That is, the odds is the ratio of the prior probabilities multiplied by the ratio of the likelihoods. Note that the Bayesian. model selection relies on the global likelihoods of the models and not on the maximum likelihood as is usually done in the frequentist statistics. The maximum likelihood (i.e., the maximum of p(D/H{{Q}I)  where {0} is the set of pa-  rameters characterizing the model), is achieved by assigning the parameters values that maximize the likelihood. B y relying on the global likelihoods of the model,  p(D/HiI),  the Bayesian procedure calculates the odds regardless of the parameters that enter each model. The transition from p(D/Hi{Q}I), p(D/HiI)  is done by marginalizatiou]  p(D/HJ)  = J (D{e}/HJ)d{e}  being the assigned direct probabilities to  the application of eq. (1.3) :  = Ip({0}/^,-7)  P  .p(/J/{0}/J -/)ci{0} t  (1.4)  It is here where the more complex model (the model that has more parameters) is penalized for its' extra complexity. We will demonstrate it by considering the following simple case [9]: Consider two models, M\ with a single parameter 6 and M with 0 fixed 0  - at some default value # Thus M has no free parameters. B y application of eq. (1.4), 0  05  the odds is, = 01  =  p(Mo/I) piMJI)'  P(M /I) 1  0  1  • -  p(D/9 M I)  0  p{M II)'  p(D/M I) p(D/M I)  0  0  -  S {6II)-p(DI0MiI)d0 v  Where the integration is over all possible values of 6. Now, in most cases, the likelihood 1(6) = p(DjOI)  is much more peaked than the prior p(9jl).  Let & be the value of  9 at the maximum likelihood, and approximate the prior by the constant p(9/I). Thus,  Chapter  1.  10  Introduction  Coi  ct —  p(O/I)-fl(0)dO  Next, we replace the likelihood integral by its maximum amplitude 1(0), multiplied by the characteristic width 80. 80 — f l(0)d0 / 1(0). Finally, approximating p(0/I) by the reciprocal of a characteristic width of the prior, AO, we have:  The likelihood ratio in the first term can never be greater than one, since 1(0) is the maximum of 1(0); it can never favour the simpler model. The data will always be fit better by the model with additional parameters. Alas, the second ratio penalizes the ' complex model if the width of the likelihood, 80, is smaller than the width of the prior, AO, and this term in general will be larger than one. The odds , assuming no prior preference to one of the models ( p(M\/I) = p(Mo/I)  ), will favour the complex model  only if the data have sufficient evidence in favour of the more complicated explanation. Probability theory both quantifies such evidence and determines how much additional evidence is "sufficient" through the calculation of global likelihoods. As demonstrated by Bretthorst [1], the additional goodness-of-fit (measured by the difference of the data from the model predictions) needed in order to overcome the penalty for the additional complexity of one additional parameter can be calculated, and depends very weakly on the width of the prior. The quantitative results of Bretthorst are not directly applicable to our case, since any additional parameter will change substantially the probability distribution of other parameters. Nevertheless, the qualitative results regarding this weak dependence on the width of the prior remains true.  Chapter  1.  Introduction  11  Parameter E s t i m a t i o n  Many times we are interested in the values of the parameters that enter our model. By application of Bayes' theorem we usually arrive at the determination of the joint probability: ({Q}/DI)  P  That is, the probability as a function of all the parameters. When coming to estimate the value of one of the parameters, the rest of them are nuisance  parameters.  We can get rid of the nuisance parameters by marginalization. Assuming we have a model with two parameters, 0 and $, and we want to concentrate on the estimation of 0. B y eq.(1.3), we have:'  p{e/DI)  where p(Q/DI)  = jp{m/DI)  • al®  is the one-dimensional probability distribution as the function of the  parameter under investigation. It is called the marginal posterior  distribution.  Strictly speaking, the Bayesian procedure brought us so far, but not further. It does not determine the value of the parameter at any single point in the parameter space, but rather the full marginal posterior distribution. Nonetheless, we are often interested in the representation of our results in the form of single number to be taken as our final estimate for the value of the parameter. Possible estimators for the real value of the parameter are the posterior mode (most probable value of 0 ) and the posterior  mean  < 0 >= J  Q-p{e/DI)-dQ  If the mode and the mean differ substantially, the distribution is too asymmetric to be adequately represented by a single number. The allowed range for a parameter with  Chapter  1.  Introduction  12  a probability content C is given by a credible region, i?, defined by:  / p(Q/DI)-dQ  = C  JR  with probability density everywhere inside R greater than everywhere outside it. Due to computational difficulties, not all the nuisance parameters in our work will be eliminated by marginalization. we will then remain with a joined probability distribution of some of the parameters, and the task of estimating the accuracies of their estimates will have to be done.with care.  1.1.2  Direct Probabilities  The Bayesian procedure gives us tools for manipulating known probabilities into other probabilities. We are left with the task of describing our initial knowledge in form of probabilities to be assigned. We call these probabilities, which have not been a result of manipulation of other probabilities, direct probabilities.  The assignment of direct  probabilities involves in general the reduction of knowledge from other forms into real numbers. It is thus an open problem. We will treat direct probabilities of two groups: prior probabilities and likelihoods.  Prior Probabilities The prior probabilities are the description of our knowledge prior to seeing the data of our present experiment. It may be the posterior probability of some previous experiment, knowledge arising from the same experiment but not used in any other place (see chapter 5 for example), or any other knowledge regarding the propositions/parameters we have that may be relevant to our problem. A n important subclass is the representation of ignorance; the probability we should assign to each proposition (or probability density  Chapter  1.  13  Introduction  for the value of a parameter, when our models are continuous), when we have no prior information that will prefer one over the others. We use this kind of prior in numerous occasions through this work. The assignment of prior probabilities that represent ignorance usually can be helped by the invariances that the propositions/parameters should obey; i.e., the transformations that can not change our state of knowledge about the problem. Two important cases used in our work are: • Location  parameters:  these parameters obey the symmetry of translation.  state of knowledge cannot be different if the origin is changed. :  The  In the discrete  case'it amounts to the invariance of state of knowledge under permutations of the propositions' labels. We assign location parameters uniform prior distributions in the case of continuous parameters and equal probability for all propositions in case of discrete parameters. That is, the normalized prior of a location parameter is  for the case of continuous parameter, and  = Nu L m  of H  < ' L6  for discrete models. Scale parameters:  these parameters obey the symmetry of scaling. The state of  knowledge cannot be different if the unit of measurement is changed. The prior distribution that obeys this symmetry is the Jeffreys' prior[10] which has the normalized form :  Chapter. 1. Introduction  14  Where 0 # and 0 ^ are the upper and lower limits of 0 . Note that the logarithm of a scale parameter is a location parameter. We will make use of this fact in chapter 3. A broad treatment of assignments of prior information and limiting processes can be found in [11], [12] and [6]. A -concise treatment that is more adequate for this work is in the opening chapter of [1].  ,  -  .  Likelihoods The likelihood p(D/HI)  is the probability of collecting the data given the model is true.  In case we have allthe information about the mechanisms producing the data, we can incorporate it into the model. Then, the likelihood is either 1 or 0/ probability theory reduces to deductive reasoning. •/• In most of our real life problems (i.e., the problems where inference tools are required) we allow the possibility of data different from the model given the model is true, due to noise. The noise, in the Bayesian interpretation, is not a 'random' process, but merely a process where our ignorance regarding the producing mechanism is sucli that its effect on the data is not reproducible.  Any known behavior of the noise can, in principle,  be extracted from the noise and incorporated in the model. The noise will include any effects, systematic or not, that are present in the data but not in the model. Thus, the form of the likelihood is the statistical model - the probability of materialization of the noise.  1.2 1.2.1  ,  ... .  M R I - T Distribution 2  The Data  The data used through this work was supplied by Dr.Kenneth Whittalland Dr. Alexander MacKay from the department of radiology at U . B . C . The apparatus and the acquisition  Chapter 1.  15  Introduction  Tissue Genu Corpus Callosum Splenium Corpus Callosum Minor Forceps Major Forceps Internal Capsules Head Caudate Nucleus Put amen Globus Palidus Thalamus . Insular Cortex Cingulate Gyrus •Cortical Gray  T o t a l n u m b e r of pixels ,1908 . 2204 2886 5041 1936 1656 3234 831 2157 5755 3414 4313  Table 1.1: The Database. method is described by them in [3]. The data consist of decay curves acquired from brains of 11 normal volunteers. The data from ten of these volunteers serve as 'learning data' for choosing a model and producing the probability distribution of the parameters for each tissue, and the data from the eleventh volunteer serve as test data for the classification process. For each volunteer, volumes were outlined on M R I image for 5 white and 7 gray matter structures. The data numbers are the spin-echo amplitudes of each pixel inside the outlined tissue for 32 spin-echoes with echo-spacing of 10ms. Typical decay curves for twenty pixels taken from different locations and volunteers but from the same tissue are displayed in Figure 1.1. The learning database is given in Table 1.1. We want to investigate the shape of the T distribution, regardlessdts total amplitude. 2  Thus, the total amplitude is a nuisance parameter and we wish to remove it in this step. We multiply each data set by a factor that will bring the linear extrapolation of the origin to 1000 amplitude units. The resulting data set is equivalent to the original as no relevant information has been changed. This 'normalization' replaces the more rigorous and proper marginalization due to computational consideration.  Chapter  1.  Introduction  16  Time [ms]  Figure 1.1: Raw Data: Decay Curves of Pixels taken from Major Forceps. The variance of the amlitudes is evident. 1.2.2  Background Knowledge  Our state of knowledge prior to seeing the data will determine our set of models, assignment of noise distribution (the statistical model) and assignment of prior distributions (both the prior distribution of parameters within models and the prior probability of the models). We wish to incorporate any known systematic effect into our set of tested models and thus reduce the resulting noise. We discuss the background knowledge at hand. Each T distribution is expected to be composed of decaying exponents. The decaying 2  exponents may have different decay rates and may have a distribution of decay rates. We assume that all samples of a particular tissue will exhibit decays of the spin-echo signal according to the same T distribution; That is, we do not take into account the variance 2  of the T distribution between different volunteers and between different locations in the 2  same volunteer.  Chapter  1.  17  Introduction  Here we enumerate effects expected to show in the data in superposition to the T% distribution. The numbers given in the data are the M R I amplitudes of pixels in the image for 32 sequential images generated from the maximum of the 32 spin-echoes in a C P M G sequence. This process includes the following known contributions: • Fourier transform artifacts. The D F T is a sufficient statistic for optimal frequency estimation when looking for one Sine/Cosine in the data [13]. The evenly sampled Sine/Cosine functions are almost orthogonal and thus the D F T is still an adequate estimator in case of well-separated frequenciesfl]. In the case of M R I images, the frequencies are close to each other and the amplitudes suffer from artifacts, both from the neighboring frequencies and the finite dimension of the sampling window (256 pixels). No publication of a practical Bayesian solution for Fourier-transformed image reconstruction is known to the author.  The contribution of this effect is  assumed to be small compared to other noise sources and spatially dependent. We will not try to incorporate any knowledge about this effect beside its finite variance. • Imperfect 180° pulses and imperfect corrections artifacts. The presence of imperfect 180° pulses in the sequence give rise to both ghost image artifacts and T i artifacts [14]. A descending gradient crusher sequence had been applied in order to eliminate these contributions [3] and is expected to eliminate most of the stimulated echoes. Nevertheless, no weighting had been applied in compensation to the main spin-echo degradation by the stimulated echoes. The sum of these effects is finite variance. A systematic component, if any, will be alternating due to the nature of the ghost image artifacts and 7\ artifacts. Its magnitude is assumed to be roughly proportional to the total signal along the decay. We do not attempt to incorporate any knowledge about initial magnitude or sign  Chapter  1.  Introduction  of this effect.  18  •  • Flow artifacts. A laminar fluids flow will give rise to even-echo-rephasing and thus we expect an alternating effect with negative amplitude for the odd echoes, and finite (yet unknown) magnitude.  Again, we assume that the magnitude of this  effect is proportional to the total signal along the decay. • Magnetic field inhomogeneities. The effect is unknown but assumed to have finite variance. • Variance among volunteers. In trying to estimate the parameters of each of the tissues we assume that same tissues in different patients will exhibit exactly the same T distribution. The extent of the error of this statement is unknown, but 2  finite. • Variance inside tissues. Not all the tissue elements are identical, even within one volunteer. Again, the extent of the error in this assumption is unknown, but finite. • Region OfTnterest (ROI) contaminations. The data pixels were collected by assigning ROI volumes over the M R I image. The extent of the error in wrongly labeling pixels is unknown. Due to the finite difference between the tissues this error is finite. ' . • The data are expressed by integers. The contribution of this rounding to the noise has a symmetric uniform distribution with unit width. Of this knowledge, we incorporate into our knowledge only that this contribution is finite and unknown in sign. We expect the effect of this loss of information to be negligible, as the other contributions to the noise are expected to be much larger.  Chapter 1.  Introduction  19  • There may be other effects we do not know about. These effects assumed to have finite magnitudes, but are otherwise unknown. • Unknown effects in the measurement process may give rise to a D C level. This contribution is unknown in sign and magnitude. The magnitude assumed finite. The background knowledge can be summarized as follow: • Possible systematic effects: 1. Decaying exponents of unknown amplitude and decay rate. The expected decay rates are in the range of 1-3000 ms. The amplitude of each such exponent is in the range 0-1000 amplitude units. 2. Decaying exponents of unknown amplitude and with distribution of decay rates of unknown center, shape and width. The decay rates and amplitude of these components have the same ranges as in 1. The T values of these 2  components can range from half to twice the main T value. 2  3. Alternating signal of unknown initial amplitude or sign, but proportional to the total signal. 4. D C signal of unknown amplitude and sign. • The non-systematic effects are expected to be of finite variance but are otherwise not known. Different states of knowledge may be introduced. For example, one may understand the measurement process better and thus come up with different models for the systematic effects.  Chapter 2  Bayesian Classification  In this chapter we discuss the Bayesian procedure for classification of data.  Section  2.1 describes the classification of a new data set into classes defined by old data sets. This is a generalization of the model selection procedure to cases where the models are given in form of probability distributions. We use this procedure in Chapter 5 when classifying data of image pixels into one of several tissues, using the data acquired from other collections of samples from these tissues. In fact, our implementation in Chapter 5 approximates the probability distributions of the parameters (of the models that have been selected to describe the old data sets) by their values that maximize the likelihood; it thus reduces the problem solved to the model selection problem. In Section 2.2 we describe the test for deciding whether two data sets came from the same mechanism. In many problems we are not sure that our assumption about the completeness of the mechanisms (that produced the old data sets) is true; the new data may have come from some other mechanism that produced none of the old data sets. For example, when classifying pixels to one of the gray and white matter brain tissues, some of the pixels come from the skull; not having the skull bone as one of the classes results in an erroneous classification of these pixels to some other tissue. This test, being based on the bayesian ability to consistently compare models with different complexity, has no traditional, frequentist, analogue. It can serve as a test of the completeness of our classes and thus enables the automated definition of new classes [15]. In the special case, when one of the models is 'zero signal', the procedure is reduced  20  Chapter 2. Bayesian Classification  21  to the Bayesian signal detection. The solutions to both procedures, signal detection and whether two data sets are produced by the same mechanism, are given in Appendix B for the case of On/Off measurements of Poisson distributed data.  For that case, the  solutions are given in a closed form. The solutions to the problem of T distribution of 2  the brain tissues will not be given in this work, due to computational difficulties.  2.1  Classification Onto 'Old' Data Sets  We assume that each of the 'old' data sets was produced by a different mechanism and that the mechanism that produced the new data is one (and only one) of these mechanisms. This assumption is part of our background knowledge, I. The question we are asking is: For which M ; the posterior probability p(Mi/d d idI) new  0  is the highest,  where: d  =  n e w  "The new data set in question", and  Mi = . "the i  th  I16W. dcltcL  d ew n  was produced by the same mechanism that produced the  old data set liow,": This Mechanism is described by a choice of model (a functional  form) M., and model parameters values 0 . By Eqn. (1.1) and (1.3), the posterior probability of M,- is given by  p{Mi/d d I) new  old  = oc |  M C{M] a  (2-1) where: • M  a  is the a  th  model.  Chapter  2. Bayesian  22  Classification  • The summation is over the set of possible models, . • 0Q. is the set of model parameters values, parameterizing A4 . a  • tta is the parameter space of model a. In most cases, in the classification phase where the functional form.is well determined by the old data set, we will constrain the model to its most probable form:  = 1  p(M Max/d ewd idI) a  Where  n  0  = 0 VM  and p(M /d d i I) a  new  0 d  a  ^  M  a  M  a x  is the most probable model, describing the old data set i: .  A4 Max a  p(M ax/d I) aM  = Max [{p(M- /d  oldl  a  oldi  I M  I)}  Thus, we can omit from now on the symbols a and A4  ai  constrain a to aMax,  and assume  a  C {M}]  .  denoting the model. We  is the true model, describing both d  M. Max  new  a  and  doidi •  So, in terms of the parameters of Mi =  M Max' a  = 0 " , where \I> is the set of parameters that describes d , and Eqn. (2.1) new  becomes  p(Mi/d d i I) new  -  0 d  /  p(Mi<2)/d d iI)dO new  0  By Bayes theorem we have:  =  p(MiQIdoidi) •hi  • —— •  ——-—dQ p{d ,„,/d l) n  M  And using the product rule, p(Mi/d I) old  p{d Id idl) new  0  f p{®lMid I) JQ old  p(d /MiQd I) new  old  dQ  (2.2)  Chapter 2. Bayesian  Classification  23  Now, • p(Mi/d i I) 0 d  is independent of d  old  old  =  p(Mi/I).  is constant as a function of the index i, and will be cancelled in the  • p(d /d i I) new  ; p(Mi/d I)  0 d  comparison among the posterior probabilities of the Mi. • The first term in the integral is the posterior probability of the parameters given the  •  ^ = Mi  • 0 and thus the second term in the integral is equal to p(d /^I)  with  new  \f/ numerically equal to 0 ; the likelihood the parameter set 0 in light of the new data set. This quantity is independent of the old data set. Therefore:  (2.3) Thus, beside normalization constants and assuming no preference to particular i (i.e. p(Mi/I)  = # f old data sets 1  Q  =  ^-)i  cons  the posterior probability of M , is just the summa-  tion (over all the possible sets of parameters) of the probability of the new data set to be produced by a mechanism having a particular parameter set, weighted by the probability of this parameter set to be the one that produced the old data set i. The distinction between the roles of the new and old data sets vanish if d  new  constrained to be produced by one of the mechanisms of d i . 0 d  using the known independencies, Eqn.(2.2) then becomes  p(Mi/d d i I) new  0 d  is not  B y Bayes theorem and  Chapter 2.  Bayesian  P (A  Classification  / rwP (  p{d id/I)p{d /I) 0  m  new  I  Jn  ^l  24  1  }  P f e / M , G / ) ( C / M , 0 / ) dQ-. P  (2.4)  In such a case, as common sense may already suggested, the proposition M,- is symmetric in regard to the old and new data sets. It does not matter which data set had been sampled later in time.  2.2  Model Comparison  In Section 2.1 we chose one 'old' data set, for which the probability of the new data set to come from the same mechanism as this set, was the highest among all the old data sets. We want to test the model of choice (from Section 2.1): Did the same mechanism produce both our old data set of choice, and the new data set? In this Section we give the general solution for this problem. We describe the test for deciding whether two data sets are produced by the same mechanism. As demonstrated by Bretthorst [1], This question in its general form is meaningless. We cannot compare our model to an ill-defined model of "all the other mechanisms that are not the one which produced the old data set". Nevertheless, the proposition becomes well-defined when we restrict ourselves to the functional form of one. (say, the old) of the data sets. The proposition "the models describing the mechanisms that produced the old and new data sets are of the same functional form but with different values of the parameters" is a well defined and computable proposition. From now on, we will omit the subscript i. We will do everything inside our model and functional form of choice. We want to find the odds ratio of: • Mo = "The parameters of the mechanism responsible for the new data set are the  r  1  Chapter 2. Bayesian  Classification  25  same as those of the mechanism responsible for the old data set; *P = 0 " OVER • Mi = "The parameters of the mechanism responsible for the new data set are different from those of the mechanism responsible for the old data set; \t ^ 0 " Where VP and 0 , as in Section 2.1, are all the parameters of the model but excluding the functional form. Now, the posterior for M had been already calculated in Section 0  2.1 for every member of the old data sets (Eqn. (2.4)). So, similarly to Eqn. (2.4):  p(M /d d I) 0  . P(M /I)  =  old  f p(e/I) (d /M QI)  0  p(d i /I) p(d /I) Jn J$ "2 0 d  new  P  old  0  p(d /M QI) dB new  0  (2.5)  new  and Q  _  p{M /d d I) 1 - p(M /d d I) 0  0  new  new  M  •  1  otd  is the odds ratio in favour of both data sets (new and old) being produced by the same parameters. Note that the global likelihoods p(d i /I) and p(d /I) now stands 0 d  new  for the summation over the two possibilities, Mo and M j , instead of over the set of old data sets. This point would not matter since the global likelihoods are independent of Mo and Mi, and will cancel in any comparison of these propositions. Again, p(d /MoQI) new  since Mo© =  is the likelihood of the parameters in light of the new data set  In order to arrive at a form consisting of assignable direct probabilities,  we will manipulate explicitly the probability of M i and introduce the odds ratio as (~) W  o  p(M /dnewd ldI) p(M /d d I)0  1  -  o  1  new  old  Chapter 2. Bayesian  Classification  26  In a similar manner to M , the posterior of M i is given by 0  p(M /d d i I) 1  =  (A  P  ,T\ I /I) Ju  /rwP  P[doid/I) p{d  new  new  =  0 d  P^l ) P(doid/Mi®I) 1  d© .  Pid^/M.BI)  (2.6)  Now, the third term in the'integral is no longer the likelihood of the parameters that produced the new data set, since M i 9 , unlike M 0 , is not equal to  We still have  o  to span this term:  p ( < W M i 0 / ) = / p K e ^ / M a O / ) d^ =  J^/M^pid^/^M^d^ and Eqn.(2.6) becomes  p(Mi/d  n e w  d  o l d  I) =  and the resulting odds ratio is:  = 01  p{M /d d i I) p{M ld d I) 0  new  o d  x  n&w  old  p(M /I)  =  0  piMJI)  Igpje/i) {d /M ei) (d /M ei) dQ P  old  0  new  P  0  f p(e/I)p{d /M QI)dQ.S {*/I)p(d ^/*M I)d*. n  old  1  nP  r  1  '  Eqn.(2.7) is the expression of the odds ratio as function of direct probabilities.  [  •  )  Chapter 2. Bayesian Classification  27  From inspecting Eqn.(2.7), we see that having the same likelihood of the parameters in light of both data sets, p(d /QMoI)  = p{d i /QI),  new  will not lead to the same odds for  0 d  all shapes of the likelihood. Situations where Ooi ^ 1 ( i-e. approaching certainty that the data sets came.from the same mechanism) will be achieved only when the likelihoods are equal and are both very peaked. That is, the parameters are well estimated by each data set and these estimations are equal. Another interesting limit is the neighborhood of no new data. In this, case p(d  new  p(d /QM I) new  0  1^M\I) = Const, and Eqn.(2.7) becomes: p(M /d d I) ~ p(Mild d I)  = 0 1  0  new  old  new  old  =  p(M /I) piMjI)' 0  ~  p(cWM)©/) • J p ( 0 / / ) p(d /M QI) dQ p(d imMJ) • f (Q/I) pid^/MrQ^dQ • fnpWI) n  new  old  0  nP  dV  1  '  }  P(Mo/I) p(Mi/I) As may already be clear from common sense, in a case where we know very little about a mechanism, the chances of having this mechanism equal to any other are hardly influenced by any information we have about that other mechanism. No matter what are the other mechanism parameters, or how accurately they can be determined, the probability of our 'foggy' mechanism to have these parameters, or any others, in the parameter space fl will be solely determined by our prior knowledge. We will not implement this procedure to the problem of tissue classification. Dealing with the computation problems involved in this implementation is beyond the scope of this work. In Appendix B we bring implementation of the procedure for the case of Poisson distributed data ( a set of On/Off measurements in the astrophysics nomenclature). In that case, the solutions are reduced to expressions with closed form.  Chapter 3  M o d e l Selection  We now turn to applying the principles discussed so far to the brain tissue samples. In the following two chapters we will find the probability distribution of the parameters that enter the models describing the data set for each of the tissues. This is a two step process: First, we have to define the set of models and choose a model (a functional form) for each tissue. This will be done in the current chapter, applying the principles of model selection described in Section 1.1.1 . Second, in the next chapter, we will use the probability distribution of the parameters of this model to estimate their values and the accuracy of these estimates.  3.1  T h e Set of M o d e l s  In order to choose the functional form of the T distribution of a given tissue, we should 2  address the following question: • Given a finite set of models, which one best describes the data and our prior information? The set of models to be examined will be defined according to our background knowledge about the systematic effects expected to appear in the data, as defined in Section 1.2.2.  28  Chapter 3. Model Selection  29  We use the notation and derivation introduced by Bretthorst [1]. Each of our models is a linear combination of base functions, that is, the model Jvi for tissue a as a function a  of the time t, is of the form:  M (t)  = £ B G (t,T£)  (3.1)  ,  a  i  ja  ia  j=i  Where: • TJ^ is the set of the nonlinear parameters of the base function Gj . a  • m is the number of base functions appearing in model a. a  • Bj  a  is the amplitude of base function Gj . a  Our question will be answered by comparing the posterior probability  p(j\4 /Dl) a  among the models.  3.1.1  T h e Base Functions,  Gj  a  The set of models is specified as follow: • 1< m  a  < 7 . Our model is a superposition of no more than 7 base functions.  The model functions Gj represent the various systematic effects expected to appear a  in the data, according to our background knowledge. Gj  a  forms:  D e l t a F u n c t i o n i n the Laplace Space  Described by the exponential decay:  G (t) = e-*'**ja  can take one of the following  Chapter 3. Model  30  Selection  This base function has one nonlinear parameter: the location of the delta function in the Laplace space =the decay rate T . 2  G a u s s i a n i n the L a p l a c e Space Described by a series of exponents in the form:  Cjn(!) 0  0  -  2  • (  e X  .4-exp(-^-)+  P(-T ,-,-(l+1.07.^ 2  0.1 • ( ^ ( - ^ . . a  =  ( 1 +  l.  o )  )  6 7  +  e X  P(-T  2 j a  /(l+1.07.  W j a  )))  +  ^  This base function has two nonlinear parameters: the location (T j ) 2 a  a n  d the width-  fraction Wj of the gaussian in the Laplace space. It approximates a gaussian having width a  of T j • Wja and unit integral . The gaussian is symmetric when plotted as function of 2  a  Log[T ]. This form was chosen to represent a peak with added structure (width) over the 2  delta function, but no other information regarding the distribution of decay times.  Alternate-Echo-Refocusing ( A E R ) Described by:  (t)  Gja  =  (-l) .exp(-^-) k  Where k is the index of the measurement point; k=l,2,...-32. The alternating signal is multiplied by a decaying exponent having decay rate of 70ms as a first order approximation to the decaying of the total signal. This base function does not have nonlinear parameters.  Chapter 3. Model  Selection  31  A B a c k g r o u n d ( D C ) L e v e l i n the T i m e D o m a i n . Described by:  G (t) ja  = l  This base function does not have nonlinear parameters. A L i n e a r C o m p o n e n t i n the T i m e D o m a i n . Described by:  = 0.19146(—1.65 + 0.01 • t)  G (t) ja  This base function is added in order to improve the algorithm stability and to reduce computation time. Exponential decays with long decay rates (T ~ 1000ms) will appear 2  in our data as almost straight lines. Allowing such a component in the data to be represented by a combination of D C level and linear component will reduce substantially the computation time: We are trading a linear (amplitude) and non-linear parameter (the decay rate) for two amplitudes (the D C and the linear component). This is important since our computation time is polynomial in the number of linear parameters but exponential in the number of the nonlinear parameter (see 3.1.2). The contribution to the algorithm stability is achieved by the elimination of an insensitive parameter: the decay rate T when having values over T ~ 1000ms. 2  2  The coefficients in the form of this base function were chosen so the base function will be normalized over the sampling points (j>2T Gj (10i) a  the D C level component(^f Gj (10i) 2  a  • Gj (10i) a  —  l ) and orthogonal to-  • 1 = OJin order to reduce computation time.  This base function does not have nonlinear parameters. .  Chapter 3. Model  3.1.2  Selection  32  Search Path  The previous Section defined the set of models to be tested: it is the set of possible combinations of the base functions Gj not exceeding 7 in their total number . 1  a  Alas, we do not wish to calculate the needed posterior probability p(A4 /Dl) a  for  all of these models. Though their number is finite, the computation time needed for computing all of them will render our task impractical. Our algorithm computation time is of the order of  ^ computation — *± ' rn *. Cl  Where: • K is.a computation time coefficient of the order of seconds. -  -  • m is the number of linear parameters. • 7' is the number of nonlinear parameters. • a is a search-sampling coefficient: a ~ 25. That is, our computation time is polynomial in the number of linear parameters and exponential in the number of the nonlinear parameters. We need to find the most probable model without calculating all the models, and in particular we want to avoid unnecessary computations of models with many nonlinear parameters. We describe our model space as 3D discrete space, using the following coordinates: 1. The number of decaying exponents (with or without width). We will refer to this coordinate as 'Components' (C). C=l,2,...7 Two constraints reduce the number of combinations: the AER component must always be present and the linear component cannot be present without the DC component. 2  Chapter 3. Model  Selection  33  Evaluate^  p>(M /r>ij ref  Define M  ;  Find new  model not been tried yet  No  Eliminate one Component/Width/Poly.comp. WNo - nllready tried A d d C o m p o n e n t a n d eliminate all Widths and Polv.comp.  ±  No - allready tried  Eliminate Component and add Width  Y  p(M,  e  /Dl)  >  p(M„ /DI) f  I Evaluate  No - allready tried A d d Poly .comp. / W i d t h  3  No - ftllrendy tried  A d d one C o m p o n e n t a n d eliminate one Width/Poly.comp.  No - allready tried  Figure 3.1: Search Path Flow Chart. The algorithm is looking for the maximum probability in the space of models. 2. The number of decaying exponents with width as additional parameter. We will refer to this coordinate as 'Widths' (W).' W=0,1,2,...C . 3. The number of 'Polynomial Components' (P). P=0,l,2. P=0 When no polynomial component is present. P = l when a D C level is added. P=2 when both the D C and the linear component are added. A flow chart of the search path is given in figure 3.1. The algorithm is looking for the maximum probability in the space of models. In each step, the resulting likelihood is compared to the best model so far and replaces it if its likelihood is higher. For computation time efficiency, the algorithm first tries to  Chapter 3. Model Selection  34  simplify the model. In the case where all possibilities to simplify the model exhausted, the algorithm will try more complex models. The algorithm stops once all possible steps from the model having the highest probability so far have been exhausted. We assume that the starting model of the algorithm has no particular importance. Nonetheless, for the sake of computation efficiency, our search path will always start from the solutions given by A.L.MacKay and K.P.Whittall in their analysis of the data of one volunteer[3] . We assume that the maximum found is the global maximum in the defined space of models. In Section. 3.3.1 we will demonstrate the search path for two simulated data sets. In one of. them the starting model will be a simple model, one decaying exponent, and the search path will add parameters and components in order to reach an adequate description of the data, and thus higher probability of the. model. In the second simulation we will start from the 'right' model. The search will stop after finding that the initial model was in the maximum of the models space.  3.2  The probability of the M o d e l  p(M /DI) a  We compare the posterior probability of the model A4 to other competing models. B y a  Bayes theorem .we have:  *M.,Di)=«Mji).«y$p. Where: • p(.M /I) a  ' is the prior for the model M . a  .  Lacking any prior information about the  plausibility of the model comparing to other models, we will assign the same prior to all of them:  •  Chapter  3. Model  35  Selection  for all a. is a normalization constant. It is the same for all the models and thus will  • p(D/I)  cancel out when we compare models within our set. Thus, the odds of each of the models compared to any other model is given by:  _ p(M /DI)  _  k  p{Mi/DI)  h l  p'D/Mkl) p(D/MiI)  [  '  The odds is the ratio of the models likelihoods. We now have to express the likelihood in term of assignable direct probabilities. B y Eqn. (1.3) and the product rule, we have:  p{D/M I) a  = Jp(E&/M I)a  p(DQl/M I)de  = J  a  c  Jp(^/XM I)-p(D/X^M I)d^dX a  a  Where: • B is the linear parameters set for the model a (amplitudes). a  • To* is the nonlinear parameters set for the model a (decay rates and width-fractions). • Q is the parameter set for the model a (linear and nonlinear). a  • p(B /A4 I) a  is the prior for the linear parameters of model A4 .  a  a  • p ( ^ a / B A 4 I ) is the prior for the nonlinear parameters. It depends, in general, on a  a  the values of the linear parameters B  a  of model M. . a  Chapter 3. Model  • p(D/B Taj\4 I) a  a  Selection  36  is the likelihood of the model with a specific set of parameters  (both linear and nonlinear). It is a direct probability we have to compute. 3.2.1  The Prior Probability Distributions  We will not include any information regarding the values of the parameters. One may think of incorporating information from previous experiments about the anticipated values of the parameters or dependencies between them. Now, previous inVivo  determina-  tions of T decay rates and amplitudes [16, 17, 18, 19] do not agree . Moreover, it is hard 2  2  to ascertain any reliable credible regions from their reported results. Thus, any responsible and conservative use of the information contained in these reports for assignment of prior distribution will take the form of very broad distributions for all the parameters. If the data will determine the parameters much better than the prior, the prior probability distribution of the parameters will be essentially constant in the region of substantial contribution of the posterior probability distribution. Thus, the details of our hardly informative priors would not change much and do not merit the additional computational effort. The main feature of the prior that may survive in the final results are the scale of the prior distribution; the range in which the prior has a substantial magnitude. Therefore, we will use priors that represent this state of ignorance and according to their known ranges, as discussed in Section 1.1.2. Nonlinear Parameters Our nonlinear parameters, decay rates and widths, are scale parameters. We will express our ignorance regarding their values by assigning them a normalized Jeffreys' prior. We do not incorporate any information about possible coupling among them or their dependence 0 f course, we cannot.use the results of [3] as part of our prior information. They use the same data discussed in this work. 2  37  Chapter 3. Model Selection  on the values of the amplitudes; that is,  —>  / a  p{T^IB M I) = a  a  ,^W^./)=  1>(W M I) a  k=l  ^ .  L O G (  )  T R I  (3.3)  Where: • l is the number of nonlinear parameters in model a. a  •  akH  T  = 3000ms and  = 1ms whenever r fc is a decay rate. a  • r kH = 1 and r fcZ/ = 0.01 whenever j y. is a width-fraction. a  a  a  Linear Parameters Again, we do not incorporate any information about coupling of the amplitudes. That  p(BJM I)=~[[p{B /M I) a  where m  a  ja  a  is the number of base function in model a. Our linear parameters, the  amplitudes are location parameters. We will express our ignorance regarding their values by assigning them a uniform prior:  (B /M I)  P  ja  a  =— BjaMAX  Where:  •  Bj MAX = 1000 whenever BJ MAX a  0  is the amplitude of a main component.  (3.4)  Chapter 3. Model  38  Selection  •  Bj MAX  = 60 whenever  •  Bj MAX  = 100 whenever  3.2.2  a  a  is the amplitude of a D C level.  Bj MAX a  Bj MAX a  is the amplitude of a linear component.  The Likelihood p(D/B TaA4 I).  We now turn to calculate the likelihood  a  a  The data produced in the experiment is described by:  di = M (ti) a  + ei  where e/ is the noise realized in measurement I. Given the model and a parameter set, the difference between a datum value and the value (of this datum) predicted by the model is assigned to the noise. Of course, we assign different values to the noise associated with each of the data points when we choose different model. In case the noise variance is known in advance, this gives us a tool for testing whether our model gives a proper description of the data [20]. We will follow the derivation and notation used by Bretthorst [1] with the extra care to our special case in which: • The signal to noise ratio is poor and thus not all of the approximations used in [1] can be directly applied. • Our data consist of multiple measurements. According to our background information, as discussed in Section 1.2.2, we assume the noise has a finite variance which is not known. Due to the lack of any other information about the noise (like correlations), we assign the noise a Gaussian probability distribution. This probability distribution arises from maximum entropy consideration  Chapter 3.' Model Selection  39  [13]; the Gaussian is the distribution that maximizes the entropy under the constraint of finite variance. The assignment of a Gaussian distribution to the noise guarantees us that we made a conservative assignment by declaring total ignorance about the behaviour of the noise, beside its finite variance. If, indeed, the noise is correlated, or non-stationary, it will not lead us to erroneous results as we already took these possibilities into account. The probability of getting the value of a datum point, given the Gaussian probability distribution of the noise, its variance cr , and all the other parameters, is: s  p(di/B  —»  l^ M a I)  a  a  a  1  =  s  1 exp(-—(cii - ^  m  r )) ) 2  BjaGjafa,  ja  j=l  V27TO-2  For clarity, from now on we omit the index a that signified the model A4 and the a  explicit symbol of the model M. . We do everything inside the model under consideration. a  The noise is assumed uncorrelated, and thus the probability of getting this data is the product of the probability of getting each of the data points: n  ni  Ba I)=J[Hp(du/^  p(D/^  BaJ)  s  i  =^r^expt-— i  Z  (  J  n  .  «  i  .  ni  m  - E ^ G ^ i ^ ) ) 1=1  1=1  2  =  j=l  = (2 r 7 )-^exp(-^) 2  7  (  s  Where m  Q  = nnid  2  n  - 2^ ^ 2  ni  ?jGj{ti,  Tj)  j=l i=l  •  m  j=l  1=1  — X^^i(^' i) r  i=l  '  Gk(U,Tj) •  -  .  9jhBjB  t  n  9jk  m  ^ du +-n J2  k  k=l  Chapter 3. Model  Selection  40  • m = the number.of model functions. • n = the number of sampling points •= 32.  • rti= the number of measurements of each sampling point = number of pixels for each tissue. In order to understand the implications of having multiple measurements and the relationship between the part of noise arising from the misfits of the model and the part of noise arising from the unsystematic effects, we let: a = — 2  •  »•/  and use this quantity from now on. So,  p(D/^~BaI)  =  (27Tn a y^ e x p ( - - ^ - ) 2  (3.5)  l  2o~  Tj)d{ + ^  Q = nd> - 2 £ E BjGjiU, j=l  E  9ikBjB  k  j=l k=l  i=l  Where:  di =  2 —  «i  52 du  We see that d,, the average of the data points for every sampling time i , is a sufficient statistic for finding the maximum of the likelihood. It is not sufficient to establish the absolute value of the likelihood at any point of the parameter space.  Chapter  3. Model  Selection  41  Nevertheless, in case <r is known,.d{ is a sufficient statistic for any calculation of the parameters' values, since the likelihood dependence on the more detailed distribution of the data is through a constant factor.  '  As common sense suggests, if a is unknown, the extent to which the data points are distributed will determine the probability distribution of a. The probability distribution of <T, in turn, will not influence the values of the parameters at maximum likelihood, but will determine the width of the likelihood and of the probability distribution for the parameters. In order to compute the global likelihood P(M. /Dl) for the model comparison and estimation of the parameters, one must integrate the likelihood given in eq.(3.5) over all the parameters. Having the base functions orthogonal, in the sense that: N  9jk — E  Gjiti)  • Gk{U)  = 6jk  i=l  will enable us to analytically integrate the likelihood over the linear parameters. Moreover, in case the base functions are orthonormal, the expectation values of their amplitudes are given directly by the projection of the data on the base functions [1]. In spite of the general dependency of the base functions on nonlinear parameters, the base functions will be, in many important problems, almost orthogonal. For example, in the problem of evenly sampled multiple stationary frequencies, with cosines and sines as the base functions, the off diagonal terms of gjk will be negligible as long as the frequencies are well separated. As indicated in [20], the problem of exponential decays as base functions is about the most pathological case one can encounter. Under no conditions will the g^ matrix be orthogonal. Presence of an additional exponent will always interfere with the estimation  Chapter  3. Model  Selection  42  of the existing ones. One of the consequences is that any estimation of the decay rate constants using a model with a wrong number of exponents will give rise to incorrect results even in the limit of zero noise. At this point, one may want to resort to numerical integration. This is a very difficult thing to do. Beside the obvious problem of long computation time of integration in a space of high dimensionality, the likelihood in this space is highly peaked in many ridges and maxima. The task of finding the maxima that have the substantial contribution to the integral is beyond the scope of this work. Any set of non-orthogonal base functions can be transformed to an orthonormal set; gjk is a real symmetric matrix and thus can be always diagonalized. This diagonalization itself (i.e., the eigenvalues and eigenvectors of,gjk) depends on the nonlinear parameters and cannot be done analytically. We will need to compute gjk, diagonalize it, and perform the transformation of the base functions with every computation of the likelihood. So, we transform the base functions to bring the gjk matrix into diagonal form. Let A/t be the k eigenvalue of gjk and let Ckj be the j th  th  component of the associated eigenvector/  The transformation of the base functions is: 2 Hjit, ^ ) =  m  —7= E  y Aj  JkG (t,  e  k  Tjfc) .  i=i  V  k=i  and the amplitudes transform according to:  Ak = \f*~k  Bj.e j k  i=i  The Jacobian of this transformation is  A  i  Chapter 3. Model  Selection  43  The transformation depends on the nonlinear parameters and so does its Jacobian. The new base functions are orthonormal in the sense that:  i=l  and thus, our likelihood become:  m  1  m  p(D/~? A al) = ^ T r r w ) - ^ e x p ( - — • {nd? - 2 £ Ajhj + £ A))) 2  i  J = I  =  (3.7)  1  where n 4  =1  is the projection of the averaged data onto the orthonormal base functions Hj. The amplitudes have been separated and now the integration over them can be easily done. We multiply our likelihood by the prior for the amplitudes (given in Section 3.2.1) and integrate over them:  p(D/^aI)  = ( A / / ) . P  = Jp{A/o-I)  n  • p(D/^r*~£aI)  A - i . n r ^ ( 2 ^ ) ^ . a x  i=l  • dr =  P  ( - ^ 5 ) ^  Where: m  p(A/I)  =  Hp(B /I) 3  i=i  is a different constant for each model, and  (3,8)  Chapter 3. Model  Selection  44  is the mean of the squares of the projections of data onto the orthonormal base functions H.  mh is the scalar product of the data vector and the vector of values 2  predicted by the orthonormal model functions. The nonlinear parameters that will have the highest probability are those that will maximize this scalar product. We are now left with the integration of the nonlinear parameters. This can not be done analytically. Moreover, we cannot apply gaussian integration or similar methods directly, since our signal to noise ratio is not good enough to give rise to narrow and elliptic peaks of the likelihood. On the other hand; the dependence of h on the nonlinear parameters 2  makes any attempt to integrate the likelihood numerically very costly: any evaluation of h deserve new diagonalization and transformation of the base functions. 2  We first transform the nonlinear parameters to a set in which the likelihood is almost elliptic in a sufficiently wide range that includes the substantial contributions to the integrals, and then use gaussian approximation to integrate them. We later use a 'safety valve routine' to ensure that our integral is approximated up to accuracy level of 20%. The transformation that will bring the likelihood bubble close to elliptic shape, is:  Ui — Log(ri)  . diii = — • d(ri)  As discussed in Section 1.1.2, the priors of the time-scaled nonlinear parameters (the decay rates and width-fractions) are normalized Jeffreys' priors:  '  (n/I) =  P  J  \  T, nun /  '  When coming to integrate over the nonlinear parameters the Jacobian of the last transformation will cancel with the priors. That is, r  .p(~T*/M I) a  d~r* = "[[ (p(ui/I) t=l  dui)  Chapter 3. Model  Selection  45  where:  p{ui/I) =  Range(Ui)  —. Ui  We are left with:  p(D/aI) = Jp(D/llT)  pCu/l) d u  n -n-n " "' • (2-KO" ) t  l  exp  2a  J  2  2  1HL  mh \ ,_, exp I „ „ I a u J ~~ \2a 2  /  r  2  /  and no additional Jacobian to take care of. What the last transformation did was merely transform the nonlinear parameters from scale parameters,to location parameters. Now that the likelihood is approximately elliptic in its variables over the region of substantial contribution, we can expand mh m 2  mh 2^ 2  mh 2a  Taylor series to get:  2  2  2a ^ 2  j,k=i  where U are the values of the u's at maximum likelihood,  Afc = u - u k  k  and  m 2 is the 2  nd  '  dh 2  2  dujdu  k  derivatives matrix obtained numerically.. The integral becomes  (3.9)  Chapter 3. Model  J  e x p  46  Selection  (2^  mh  f  2  dTt  exp  2o  2  dA  • / exp u  3,k=l  J  L  G  In order to perform the integral we change variables again. This change of variables is not essential for the analytical integration. Nonetheless, it will separate the variables, give a more intuitive look to the likelihood, and will help us in the next Section to estimate the values of the parameters. So, let Vj be the j  th  of the j  th  eigenvalue of the bjk matrix, and let Wjk be the k  th  component  associated eigenvector. The transformation that will rotate our ellipsoid in the  nonlinear parameter space to principle axes, and thus eliminate off-diagonal terms in the 2  nd  derivatives matrix bjk, is given by:  Sj =  £  SkWjk  A,H/-,  k=l  JVk  with volume element:  i=i  and the integrand becomes:  e x  p ( \i,k=i E^ —  A  ^  A  /  =  S? p ( E^ \j=i r  e x  (3.10)  Performing the r integrals, the likelihood becomes:  p(D/*I)  = p{AjI)  -HP^II)  •n r  n i  •  (2^a ) ^ 2 :±:  (3.11)  Chapter 3. Model  where  Selection  47  ur=i A:  is the Jacobian of the transformation of the base functions G to u the orthonormal base functions H using the values of the nonlinear parameters that will maximize the likelihood. The eigenvalues of the last transformation are treated in this stage by a 'safety valve routine' described in Appendix A . Our measure of the noise a, as discussed in the introduction, is not known. We can now treat o as a nuisance parameter: we will assign a a Jeffreys' prior and integrate over it. The normalized Jeffreys' prior for a is:  p(a/I)  =  where an and a^ are the upper and lower prior bounds for a respectively. So,  p'D/I)  = /  p'a/I)  p(D/aI)  r+m — nnr  nV  i=i  r+ -n. -l  a  m  ni  i=i  U  nd —mh 2  e  x  p  do  u  2  2a  da  2  u. In this integration over the probability distribution of a, we assume we chose o.r, and an to be well outside the substantial contribution to the integral, so we can let a^ —» 0 and an —> oo without substantially change the value of the integral. Changing variables,  Chapter 3. Model  Selection  48  V  dV=~dc  V =  —  L  a  V= H  O~JJ  2  — O-L  the integral become:  L  ' -HX>  V n-ni  V,-*0  —1—771 — r  n d - mh •V 2  H  • exp  nd  2  2  — mh  2  n-rtj—r — m  2  u  As indicated by the notation, p(cr/I) is independent of the model. We do not have any information that will differ the prior noise assignment for different models. Thus, the term Log  (^) will appear in all the compared likelihoods and will cancel. Omitting  the term Log {^)  a n <  ^  s  o  m  e  OI  " the constants that will appear in all the models, p(D/M*I)  p( A//) •!!*(«///)• ( 2 * ) ^  a  IIK  II *  U  71-71; — r — m ^  nc? — 2  mh  2  (  2  L-i = i  (3.13) (7  )  is the likelihood we evaluate for each model and compare among the models.  3.3 3.3.1  Results Simulated D a t a  We produced two sets of simulated data, one of 6 functions in the Laplace domain and one of Gaussians in the Laplace domain. Both produced with noise level of 0.3 units,  Chapter 3. Model  Selection  49  Component Low T Medium T High T Alter. Echo R.  Discrete A m p l i t u d e T [ms] W i d t h [ms] 100.0 9.0 600.0 85.0 300.0 180.0 0.0 -  Component Low T Medium T High T Alter. Echo R.  Continuous A m p l i t u d e T [ms] W i d t h ms] 100.0 9.0 2.0 900.0 85.0 20.0 70.0 380.0 30.0 0.0 -  2  2  2  2  2  2  2  •  2  Table 3.1: Generating Parameters of Simulations, cr = 0.3[Amp. units]. The given amplitudes of the continuous components are the integrals over these components. . which, as we shall see in the next chapter, is typical to our experimental results. The simulated noise is Gaussian and uncorrelated. The generating parameters are tabulated in Table 3.1 and the resulting generated distributions are displayed in Figure 3.2. In order to demonstrate the search path (described in Section 3.1.2) in the 'complexity' space of models, we give the algorithm a very simple model, one decaying exponent, as a first guess. This is done in the continuous distribution case, and the tested models and the results are tabulated in Table 3.2. For the discrete distribution we start from a model having three decaying exponents. The tested models and the results are tabulated in Table 3.3 . For each of the tested models, we show the odds ratio of that model and the model having the highest probability. For each model, we also show the Bayesian estimation of the noise level. This estimation will be discussed in the next chapter; for now it is enough to regard the noise estimation as a measure of the misfit of the data and the model. As we see from Table 3.3, the algorithm detected the 'right' model for the discrete distribution.  Chapter 3. Model  Selection  50"  600  TJ  500  "5. E 400  <  300 200 100  0  -  | 10  | 26 '  '5o'"i'oo  I 260 '  fedo'M600  2060  T2[ms]  T2 [ms]  Figure 3.2: Generating Parameters for the Discrete (upper figure) and Continuous (lower figure) Simulated Distributions. The plotted amplitudes of the gaussian components are equal to the Gaussian integral and not to their actual amplitudes.  Chapter 3. Model  Selection  Components 1 1 1 •  !•  '  51  Model Widths 0 o  1 1 0 1 0 1  .  1 1 2 2 2 •2 • 2 3 3 3 4  ;  . 0 .0 1 .0 0 1 0  Results Poly. Comp. 0 1 1 0 2 2 0 0 1 2 2 0 1 0 0  ^  a  3i3max  1.7 • i r r 1.6 • I O  42  - 3 1  2.9 • H r  22  3.3 • I O "  23  4.2 • l f r 1.7 • l f r 1.3 • l f r 3.9 • l f r 3.9 • i r r 3.2 • i<r  24  14 21  29 13 2  4.1 • 1 0 \ 1  7.3 • l f r  3  3  7.3 • 1(T 1.3 • l f r 1 2  14.4 ± 2 . 0 5.0 ± 0 . 7 . 2.9 ± 0 . 3 2.1 ± 0 . 3 ' 3.0 ± 0 . 4 1.25 ± 0 . 2 2.3 ± 0 . 3 2.3 ± 0 . 3 .91 ± . 1 3 ;37±0.05 .36 ± 0.05 .31 ± 0.04 .31 ± 0.05 .31 ± 0 . 0 5 .31 ± 0 . 0 4  Table 3.2: Search Path and Odds Ratio for the Continuous T Simulated Distribution, cr is a measure of the misfit between the data and the model. 2  Components 3 4 2 2 3 3  Model widths 0 0 0 1 0 1  Results Poly. Comp. 0 0 0 0 1 0  a  1 0.0054 6.210" 5.210" 0.0056 0.07  22  22  0.28 ± 0 . 0 4 0.29 ± 0 . 0 4 2.09 ± 0 . 3 2.0 ± 0 . 3 0.28 ± 0 . 0 4 0.28 ± 0 . 0 4  Table 3.3: Search Path and Odds Ratio for the Discrete T Simulated Distribution, a is a measure of the misfit between the data and the model. 2  Chapter  3. Model  52  Selection  For the continuous distribution, Table 3.2 , the algorithm found a discrete model with 3 components and not a continuous model of 3 components, as is the generating mechanism. To see what has happened, we plot the residuals produced by fitting the noisy data by the 'winning' discrete model. We also plot the difference between the predictions of that model with the fitted parameters, and predictions of a continuous model having the generating parameters (Figure 3.3). We see that the residuals are a full order of magnitude larger than the difference between predictions of continuous and discrete models. Thus, fitting the data with the continuous model will hardly decrease the residuals and will not be able to justify its additional parameters. In other words, the difference between the continuous and the discrete models is not relevant to our data; under the given noise level, they merely produce the same data. The Bayesian procedure finds no reason to prefer a more complex model if a simpler model can describe the data and the prior information just as well.  C o m p a r i s o n to E x i s t i n g M e t h o d s The closest frequentist method for comparing several models is the multiple-model method The models are compared based on some measure of "Goodness-Of-Fit" (GOF) of their solutions, where each model uses the set of parameters that give rise to the best G O F within the model. The most common measure of G O F for these model comparisons is the x  2  • From  the Bayesian point of view, the set of parameters that will minimize x is, in the case of 2  Gaussian noise, the set of parameters that will maximize the likelihood. We can now see what question is answered by the multiple-model method: • "Given a set of models, which model, contains the set of parameters that has the highest likelihood in light of the data?"  Chapter  3. Model  Selection  53  Amp.  0.4  0.2  Time[ms]  -0.2  -0.4  Figure 3.3: Model Selection Failure. The solid line represents the residuals produced by fitting the noisy data by the discrete model. The dashed line is the difference between the predictions of the fitted discrete model and the predictions of a continuous model having the generating parameters. The continuous model failed to reduce the noise level enough to overcome the 'penalty' it carries due to its complexity.  Chapter  3. Model  Selection  54  That is not what we are looking for. Our question is rather: • "Given a set of models, which model lias the highest probability in light of the data and any prior information that we have." . These are different questions and in general they will result in different answers. The difference between the Bayesian and the frequentist multiple-model method can be summarized in the following points: 1. No prior information about preferred models can be consistently incorporated into the multiple-model method. The result of the frequentist method is the likelihood of the parameter set and not the posterior probability. 2. In the model selection process we want to compare the models regardless of the values of their parameters. While the frequentist method is constrained to compare the models while choosing a particular set of parameters for every model (namely, the parameters that maximize the likelihood) the Bayesian procedure marginalizes over all the parameters that enter the model, including the noise variance. The marginalization is equivalent to considering all the possible parameter sets, weighted by their probability to materialize. 3. There is no consistent way to penalize complex models in the frequentist approach (quantitatively implementation of the so-called Ockhams Razor).  The quantita-  tive implementation of Ockhams Razor arises naturally in the Bayesian procedure during the marginalization over the parameters. In this work we used flat prior distributions for both the models and the parameters inside each model to express ignorance about their values prior to seeing the data. As a result, in our case the first statement (Point 1) has no substantial contribution to the  Chapter 3. Model  Selection  55  difference between the frequentist and the bayesian methods. On the other hand, the issues mentioned in Points 2 and 3 are of great importance in a problem like ours. Several models exhibit sets of parameters for which the noise is reduced to its 'true' value. The model that will be picked by the multiple-model methods thus depends on the particular materialization  of the noise in the data set and not on the systematic  effects in the data;  as all these models describe the systematic effects equally well, x will be determined by 2  the non-systematic effects. The traditional frequentist answer to the problem of point 3 is to draw the x  2  a s  a  function of the number of parameters, looking for the 'knee' in the  results. In the case of few measurements (like in a 32 points sets ) this knee may be very vague. In view of the above, the difficulties in the frequentist approach to model comparison in our class of problems are such that they are hardly used.  No implementation of  multiple-model method for the problem of decaying exponents is known to the author. These difficulties, combined with the strong coupling between the parameters, that result in erroneous estimation of the parameters in case of wrong model, lead to strong resistance in the frequentist camp to the usage of non-linear methods altogether, considering the need to state an initial model as the major disadvantage of these method. In fact, the frequentist linear methods are not free from introducing a model either. A common linear model [4, 5, 21] is a set of exponents (100-200 is a typical number) with fixed decay rates. These models result in underdetermined solutions, which are then constrained by adding conditions like Non-Negativity and Continuity. These constraints, and in particular their quantitative implementation, are in general not justified by our state of knowledge; they represent information we in general, do not have.'  Chapter  3.3.2  3. Model  Selection  56  Experimental data  We applied the procedure described in this chapter to the learning data, described in Section 1.2.1. For each set of measurements, the collection of data from all the pixels from the ten volunteers assigned to each tissue, we calculated the probability given by Eqn. (3.13). We calculated the probability for different models, using the search path described in section 3.1.2. In order to ensure that the starting model did not influence the results, we ran the program with two different starting models for each tissue: the first is the model given by MacKay and Whittall [3] (two to three decaying exponents) and the second is a model of one decaying exponents. For all tissues, the two runs ended with the same 'winning' model, supporting our belief that this model is the model of global maximum probability in the model space. As can be seen from Eqn. (3.5), the mean of the measurements and the mean of the measurements squares for each sampling time are sufficient for the calculation of the probability for any model. For each tissue, we indicate the models having the highest and second highest probability, along with the Bayesian estimation of their noise (Table 3.4). The noise is a measure of the misfit of the model and the data; its estimation and the accuracy of the estimation will be discussed in the next chapter. Most of the models found for the tissues are discrete. We are reminded that continuous distributions, as the distribution used in the continuous simulation, will produce similar data to discrete distribution of the same number of components and thus may not be detected.  Chapter  3. Model  Tissue Genu Corpus Callosum  Selection  C 3  Model 1 W p 0 0.3632± o 0.001  W 1  0.4004± 0.001  3  1  0  0.4004± 0.001  11.1 : 1  0.2916± 0.0007 1 0.2149± 0.0004 0 0.5506± 0.002 .0 . 0.4722± 0.001  2  0  2  2.04 : 1  4  0  0  3  1  0  2  0  2  0.2916± 0.0007 0.2149± 0.0004 0.5506± 0.002 0.4728± 0.001  3  0  1  3  0  1  2  0  2  1  1  1  3  0  0  2  1  2  3  0  0,  .3  0  0  3  0  3  0  3  0  3  0  0  Globus  3  0  0  Palidus Thalamus  3  0  0  1  1-  2  2  0  1  3  0  0  Splenium Corpus Callosum Minor Forceps Major Forceps Internal Capsules Head Caudate Nucleus Putamen  Insular Cortex Cingulate Gyrus Cortical Gray  Model 2 P a 0 0.3632± , 0.001  C 3  0.2209± 0.0004 0.4611± 0.002 0.2787±, 0.0008 0.4049± 0.0007 0.6493± 0.001 0.6471± 0.001  0.2209± 0.0004 0.4611± 0.002 0:2788± 0.0008 0.4049± 0.0007 0.6493± 0.001 0.6471± 0.001  O  h2  12.6 : 1  9.33 : 1 8.09 : 1 2.23 : 1  2.1 : 1' 1.42 : 1 2.96 : 1 28.1 : 1 4.13 : 1 2.73 : 1  Table 3.4: Experimental Data - model selection results.  Chapter 4  Estimation of the Parameters  In this chapter we will estimate the parameters of our models for the tissues, and find the accuracies of these estimates.  "  The main purpose of our work is assessing the feasibility of classification of new data sets to various tissues. The estimation of the parameters is not optimal: we use the joint posterior distribution of all the parameters from all the measurements and then use the marginal posterior of each of the parameters for the estimations. This is the analogue of the traditional practice of averaging the data before the analysis, and can damage the. estimation of a parameter in case there is a systematic variation, not accounted for by the model, of other parameters between measurements. The optimal estimation is achieved by first calculating the marginal distribution for each of the parameters from every single measurement set, and then multiplying the marginal distributions. That is, every set is considered as new data, and then, by Bayes theorem, the marginal probability distribution it produces is multiplied by the distribution produced by previous measurements . 1  Elaborate discussion and examples of this point can be found in [1, pp. 161-175]. The above does not imply.that our estimations are wrong. We may lose information that was in the data by using averages, but the estimations are valid and the accuracy estimates are conservative; we do not incorporate knowledge we do not have. A quantitative rule of thumb is to look at the Bayesian estimation of the noise variance: successful reduction of the noise level (by the. averaging) by a factor of y/n. will indicate that not fact, our normalization of the data sets in the beginning of the analysis is an approximation to the marginalization needed i n the rigorous process.  58  Chapter  4. Estimation  of the  59  Parameters  much had been lost in the averaging. In our case the reduction does achieve a reduction of noise level that is not far from-y/n. The author believes that the loss of accuracy of the estimates due to the use of averaging in this work is not substantial. A l l calculations of the parameters value estimates and the accuracies of these estimates are for the chosen model. They cannot be taken as estimates of any parameters of a different model. For example, in Chapter 3 we generated a set of data by a continuous T  2  distribution. B y the model selection process in Chapter 3, a discrete model was  chosen. We cannot expect the estimation of parameters (e.g., a decay rate of a discrete component). to be valid for other parameters in different model (e.g., the center of a gaussian distributed decay rates in a continuous model). These parameters are in general dissimilar. As discussed in the introduction, the marginal posterior probability of a parameter is given by  (Q /Di)  P  k  =  J (e/Di)-de P  (4.1)  [k]  Where: m i=l  We will use the expectation value of the distribution p(Qk/DI)  as an estimator for  the value of the parameter: ;.e • p(e /Di) t  <  0  T  >  =  k  • ae  k  I^/DD-M,  •  Combining eq. (4.1) and (4.2) we have:  fQ - (e/Di)-de k P  <  Wfc > =  —  fp(Q/DI)-d&  — —  =  '  (4 2)  Chapter  4. Estimation  of the  Parameters.  60  Noise Probability Distribution G e n u Corpus Callosum  Figure 4.1: Probability Distribution of the Noise Level for a typical tissue.  JQ -p{Q/I)-p(D/QI)  -dS  k  jp(e/i)-p(D/Qi)-de  (4.3).  The Bayesian credible region will usually be estimated by the variance of the one dimensional probability distribution of our parameter of interest. This estimate is valid as long as the distribution is well approximated by a Gaussian in the region of high probability. In such a case, the probability of the parameter to be inside the credible region will be about 0.68 . •  4.1  Noise Variance  As any other parameter, the noise variance can be estimated using Eqn. (3.12), which gives the likelihood as a function of the noise variance cr , regardless the other parameters. 2  We plot the likelihood as a function of a (figure 4.1) for a typical tissue. The prior distribution for the noise is the Jeffreys prior, p(cr./1) a K Following [1], we  Chapter 4. Estimation  of the  61  Parameters  integrate Eqn. (3.12): fcr -±-p(D/aI)-da s  < a' > =  f±-p(D/aI)-do-  • ni - m - s^  _  2  • n - m^_ t  '  nd  2  1  — mh  2  (4.4)  2  v  Where we approximate "r* (hidden in /i ) by their values that maximize the likelihood. 2  Our estimation for the noise variance is:  {<r )est =< cr > ±V< 2  2  a*>  -  <a > 2  2  and substituting (4.4) with 5 = 2 and s = 4, we have, after some algebra,  (V)  nd  2  est  n • ni  — mh  —m  2  1 ± n • rii  —m—4  (4.5)  The estimation of the noise depends on the number of base functions m and total number of measurements n • n/. Larger models, having more base functions, will reduce the value of  id  2  —mh?  , since they fit the data better. This will result in reduction of  the noise level if this increase is larger then the expected by fitting the noise, which is taken care of by the first term  ra-nj— m—2  Now, in all our models the number of total  measurements n • n\ > 20,000 , m < 7 and thus the term  is almost constant  n-n{— m — 2 *  for all possible m; we have much more measurements than base functions and we are not in a position to fit the noise. As a result, we could expect a reduction in the noise estimation as we add base functions. Alas, as can be seen for Major Forceps in Table 3.4, this does not happen. Any addition of base functions above the model given under "Model 1" in Table 3.4, did not reduce the estimated noise level. Our additional base  Chapter 4.  Estimation  of the  Parameters:.  62  functions are almost orthogonal to the data; they hardly improve the fit of the model to the data.  • • < > . • • • ' . • " '  The high accuracy of the noise estimation (see Figure 4.1 and Table 3.4) is due to the large number of measurements compared to the number of base functions. . 4.2  Nonlinear Parameters  As demonstrated in [1], having different values of <r cannot affect our estimation of the parameters but will affect the accuracy of these estimates, the assignment of the Bayesian credible region for the values of the parameters. It is •possible to give the credible region of the parameters when cr is unknown, but it will render-the resulting equations info a non-intuitive forms. We will rather take cr to be constrained to the value found by Eqn. (4.5). The large number of measurements make the estimation of cr accurate enough to justify this approximation (see Figure 4.1). The likelihood of the model as a function of the nonlinear parameters (regardless of the linear parameters) was approximated in Chapter 3 by a Taylor expansion of the argument of the exponent (Eqn.(3:9) to Eqn.(3.10)). We can.now use this approximation and notation to estimate the nonlinear parameters and the accuracy of these estimates. The values of all < Sj > will be zero by symmetry. The covariance matrix of the 5j is given by:  < SjS > - < Sj >< k  I-oo d~S • I T < SjSk >= •  S >= k  1'  V =1 vn  2  • SjS -exp k  •exp( = < cr > 8 j 2  k  [  ELi2^5>)  .Ef=ij^>)  Chapter  4. Estimation  of the  63  Parameters  The covariance matrix elements of the logarithm of the nonlinear parameters is obtained by:  .<  UjUk  >  — <  Uj ><  u  k  >=<  a  UijUik  >  2  Vi  /=i  So, our estimation for the logarithm of the nonlinear parameters is: U  r  {u )  3 est  = u]±  .  2  <cr > E 2  TT  where u] are the logarithm of the nonlinear parameters that maximize the likelihood. Under the approximation of the likelihood by the first order Taylor expansion, these values (modes for the nonlinear parameters) are the expectation values for the logarithms. The estimations for the nonlinear parameters are given by: r  (dj)est = exp (UJ) ± (exp (u-j)  4.3  TT2  (4.6)  \  Linear parameters  The likelihood as a function of all the parameters is given in Eqn. (3.7). The results of calculating < Aj > and the posterior coyariances < AjA  k  > — < Aj >< A > according k  to (4.3) are given in [1]:  '  •  and < AjA  k  > -.<  Aj >< A  k  >=  cr 6 2  jk  These results are to be expected. The base functions H are orthonormal and thus the expectation values of the expansion coefficients of the data over them (the A) will be the projection of the data onto these orthonormal base functions.  Chapter 4. Estimation  64  of the Parameters  The covariance matrix diagonality is obvious since the H functions are orthogonal. The expectation values of the original amplitudes < Bj > are given (using eq.(3.6)) by:  he-i.  >=Y^ m  <BK  i=i  ,  V *>j  and the elements of the covariance matrix are given by:  (4.7)  B B >-<B xB >=<* >-jr^ 2  <  k  l  k  l  3=1  A  J  where we again approximated cr by its expectation value. Thus, our estimates of the 2  amplitudes are  • (B )  k est  =< B > ±\J< B > - < B > = 2  k  = Y ^ ±  k  <<r >-f:-^. 2  k  2  .  (4.8)  These estimates are still functions of the nonlinear parameters, through the dependence of hj, ej and Xj on them. In many problems the estimates of the parameters are k  good enough and thus the nonlinear parameters can be approximated by their value at maximum likelihood. Our case is different; the uncertainty of the nonlinear parameters is large and some of the amplitudes are strongly coupled to them. In order to make the covariance matrix free of the nonlinear parameters we need the likelihood as a function of the amplitudes alone; i.e., marginalize over the nonlinear parameters in Eqn.(3.7). This cannot .be done analytically and cannot be easily done numerically. Due to the strong coupling between the amplitudes themselves this marginalization is still short of giving definite values for the accuracies. In view of the above, we will summarize our estimates for the amplitude as follows:  Chapter  4. Estimation  of the  Parameters  65  1. Results of Eqn.(4.8). These tabulated results should be consulted whenever there is no strong coupling with other amplitudes. The nonlinear parameters are approximated by their values at maximum likelihood. 2. In case there is strong coupling between the amplitudes we give the coupled estimates. These are variances (which we signify by S) of linear combinations of the, coupled amplitudes. The variance estimate is the largest eigenvalue of the covariance matrix (Eqn. (4.7)) and the relative weights are taken from the associated eigenvector. 3. For most of the tissues, our model contains a component with a decaying exponent with T of less than 15ms. In that component the decay rate and the amplitude are 2  strongly coupled to each other, but relatively independent of the other parameters. Typical contour plots of the likelihood as a function of the decay rate and the amplitude of the lower components are given. One should consult these sample plots when using the tabulated values. Unlike the nonlinear parameters, where the accuracy estimates are free of the amplitudes, and thus conservative, the tabulated results for the amplitudes are under the approximation of the nonlinear parameters by their maximum likelihood values. In case where the amplitudes are strongly coupled to the decay rates (as is the case of the lower component), the tables will give an oyer optimistic accuracy estimates.  4.4 4.4.1  Results Simulated Data  We use here the same simulated data as used for the model selection. After the model have been selected, we turn to estimate its parameters.  Chapter 4. Estimation  of the  66  Parameters  Discrete T D i s t r i b u t i o n Generating Parameters E s t i m a t e d Paramet ers A m p . T [ms] W i d t h Amp. T [ms] Width 2  Component  Low Med. High AER  2  100 600 300 0.0 0.3  2  9.0 85.0 180.0  - . -  -  -  99.9 ± 1 . 2 9.31 ± 0 . 4 5 579.5 ± 1 . 4 84.3 ± 1 . 9 318.8 ± 0 . 8 . 175.2 ± 5 . 7 -0.04 ± 0 . 1 0 0.28 ± 0 . 0 4  -  Continuous T D i s t r i b u t i o n Generating Parameters E s t i m a t e d Paramet ers A m p . T [ms] W i d t h Amp. . T [ms] Width 2  Component  Low Med. High AER <j  2  2  100 9.0 900 85.0 70. . ' 380.0 0.0 0.3 -  2.0 .20.0 30.0 -  100.1 ± 1.3 819 ± 3 . 2 • 148 ± 1.6 0.00 ± 0 . 2 0 0.31 ± 0.04  9.7 ± 0 . 5 • 80.0 ± 1.0 246. ± 12.  - . -  • -  -  •  Table 4.1: Generating T Distributions and Simulation Results. 2  The generating parameters and the results are described in Table 4.1 and 4.2. They are displayed in Figure 4.2. As discussed in the beginning of the chapter, the estimations of the parameters are for the chosen model. We cannot expect these estimations to serve as estimations of parameters of other mechanisms.  4.4.2  Experimental Data  The estimations of the parameters of the T distribution for each tissue were done using 2  the collection of pixels taken from ten volunteers. For each of the tissues we compute the estimated parameters along with their accuracies. Our results are given in Tables 4.3 and 4.4 and Figures 4.3, 4.4 and 4.5. As mentioned in Section 4.3, the summary of the probability distribution of the parameters in the form of expectation values and variances will not give, by itself, an adequate description of the results for our case. Estimating the linear parameters having  Chapter  4. Estimation  of the  67  Parameters  finn  500  Q.  E <  400 300  600 500 400 300 200 100 0  83  85  87  170~~ 175  180" T85  200 100  TO"  ' 5b ' ' Mdo—ato—'—'sdo' ' 'ib6o—Tutw  "2tr  T2[ms]  —I  1  1 1 I  I  800  E <  600  400  200  A Tt5  2(5  '  ' 5b '  ' 'UO  20^5  '—'5do' ' '1660  2T50TT  T2 [ms]  Figure 4.2: Simulation Parameters and Estimated Parameters for the Discrete (upper figure) and Continuous (lower figure) Distributions. The height of each components of the continuous distribution signifies the integral over that component.  Chapter  4. Estimation  of the  68  Parameters  Discrete Dist. Component weight High 0.51 Med. 0.37 8 1.9 Continuous Component High Med. 8  Dist. weight 0.47 0.28 1.9  Table 4.2: Uncertainty of Linear Combinations of Amplitudes. The uncertainties, 8, are the eigenvalues of the covariance matrices and the weights are taken from the associated eigenvectors. The lower eigenvalues (not shown) are meaningless due to the dependence of the amplitudes estimation on the nonlinear parameters. the nonlinear parameters constrained to their values at maximum likelihood, a common and proper method when we have good estimates for the nonlinear parameters, give overoptimistic estimates for the accuracies. These values for the amplitudes are tabulated in Tables 4.3 and 4.4. Therefore, in addition to the above estimates, we give two samples of 2D contour-plots of the probability distribution as a function of amplitude and decay rate of the lower T  2  component. These parameters are strongly coupled and are interesting to the study of the physical system. The first contour-plot, for a typical tissue, is shown in Figure 4.6. For the tissue Head Caudate Nucleus, the lower component is justified only by the measurements of the first spin-echo in the sequence, at 10ms (Figure 4.7); the signature of the lower component on the rest of the points is below the noise level. This is an example of a case where the parameters are underdetermined, and give us another opportunity to understand the estimations made in this Chapter and tabulated in Table 4.4: the estimated value for the amplitude of this component is 82 ± 20.; This estimate assumes the decay rate is constrained to its value at maximum likelihood. On the other hand, the  Chapter 4. Estimation  Tissue Genu Corpus Callosum Splenium Corpus Callosum  model C W p 3 ' 0 0  3  Minor Forceps  3  Major Forceps  3  Internal Capsules  3  0  0  0  .0  69  of the Parameters  0  0  1  0  a  0,363± 0.001  0.401± 0.001  0.2916± 0.0007  0.2149±' 0.0004  0.551± 0.002  Comp. Low Med. High AER Low Med. ' High AER Low Med. High AER Low Med. High AER DC Low Med. High AER  Amp. 87.9 ± 2 . 906. ± 0 . 6 52.7 ± 0 . 3 -4.6 ± 0 . 2 99.9 ± 2. 661. ± 0 . 9 281. ± 0 . 5 -5.0 ± 0 . 3 82.1 ± 2 . 935. ± 0 . 5 32.8 ± 0 . 2 -3.6 ± 0 . 2 79.8 ± 2 . 175. ± 1 . 784. ± 0 . 7 -3.24 ± 0 . 1 8.7.2 ± 0.08 113. ± 2 . 633. ± 1 . 284. ± 0 . 6 -5.12 ± 0 . 4  T [ms] 8.9 ± 0 . 8 66.6 ± 0.6 207. ± 20. 2  -  9.9 ± 0 . 9 65.5 ± 2 . 150. ± 5 . -  8.12 ± 0 . 7 66.8 ± 0 . 5 214. ± 3 0 . -  8.29 ± 1 . 43.7 ± 6 . 87.3 ± 2 . -  Comp. Wght. -  -0.39 0.38 5=2.4 -  0.492 -0.401 5=2.9 -  0.37 -0.36 5=2.1 -  0.65 -0.30 5=3.2 -  13.8 ± 1 . 67.8 ± 3, 172. ± 10. -  -  -0.44 -0.50 5=3.6  Table 4.3: Estimated Parameters for White Matter Tissues. 5 is the uncertainty for the linear combination of the amplitudes weighted by the given Component Weights.  Chapter  4. Estimation  Tissue Head Caudate Nucleus  of the  model C W p  70  Parameters  a  Comp.  Low Med. High AER Low Med. High AER Low Med. High AER Low Med. High AER Med. AER DC Low Med. AER DC Low Med. High AER  3  0  0  0.472± 0.001  Putamen  3.  0  0  0.2209± 0.0005  Globus Palidus  3  0  0  0.461± 0.002  Thalamus  3  0  0  0.2787± 0.0008  Insular Cortex  1  1  1  0.4049± 0.0007  Cingulate Gyrus  2  0  1  0.649± 0.001  Cortical Gray  3  0  0  0.647± 0.001  Amp.  T [ms] 2  Comp. Wght.  81.9 ± 2 0 . . 3.8/* 3 9 4 3 . ± 1. 77. ± 1. 0.13 43.4 ± 0.5 572. ± 300. ,0.18 -6.37 ± 0 . 9 6=25. 54.2 ± 2. 5.65 ± 1. 948. ± 0.3 70.7 ± 0.3 0.28 37.4 ± 0.1 ' 257. ± 20. -0.27 -5.7 ± 0.2 6=2.2 69.2 ± 2 . 11.3 ± 2 . 897. ± 0.8 67.2 ± 1 . 0.36 60.6 ± 0 . 3 234. ± 30. 0.46 -5.67 ± 0 . 3 8=2.7 49.4 ± 0.9 14.2 ± 2 . 923. ± 0.5 -0.32 74. ± 0 . 9 41.9 ± 0 . 2 266. ± 50. -0.52 -5.35 ± 0 . 2 8=1.5 970. ± 0.3 86.3 ± 0 . 0 9 -6.61 ± 0 . 2 wdth.fr ac= 22.7 ± 0 . 1 0.30 ± 0.02 27.2 ± 3 . 7.44 ± 3 . -0.91 968. ± 0 . 6 85.2 ± 0 . 2 - -0.14 - 4 . ± 0.4 6=4.0 22. ± 0 . 2 27. ± 3 . 7.5 ± 3.0 76.4 ± 0.6 949. ± 0 . 7 -0.20 47.8 ± 0 . 2 760. ± 200. -0.36 -2.23 ± 0 . 4 '6=4.2  Table 4.4: Estimated Parameters for Gray Matter Tissues. 8 is the uncertainty for the linear combination of the amplitudes, weighted by the given Component Weights.  Chapter  4. Estimation  of the Parameters  '•"  71  estimation of the decay rate is after marginalization over, the amplitude and thus takes into account all possible values of the amplitude, as reflected by the estimated value, 3.8^3,(That is, 3.8 multiplied or divided by factor 3; the estimation of the logarithm is 1.34 ±.1.2).  Chapter  4. Estimation  of the  Parameters  • Genu Corpus Callosum Amplitude Residuals 800  .0.75 0.5 0.25  AER=-4.56 600 No D C 400 200 0  1 ' 10  00  •i  T2[ms]  time[10*ms]  -0.25 -0.5 -0.75  1000 Spleriium Corpus Callosum  Amplitude Residuals  600 AER=-5.01 500 No D C 400 300 200 100 0  1 time[10*ms] T2[ms]  1000  Minor Forceps Amplitude Residuals 800 AER=-3.62 600 No D C 400 200 0  time[10*ms]  tiu  -1  TOoTT  T2[ms]  Major Forceps Amplitude 800  Residuals  600 AER=-3.24 D C =8.72 400  time[10*ms]  •200 0  4 r  100  1000  T2[ms]  Intemal Capsules Amplitude AER=-5.12 600 500 No D C 400 300 200 100 0 "TTT  Residuals  time[10*ms]  ',' 1000  T2[ms]  Figure 4.3: The Estimated T Distributions (left) and Residuals for the Estimated P; rameters (right) of White Matter Tissues. 2  4.  Chapter  Estimation  of the  73  Parameters  Head Caudate Nucleus Amplitude Residuals 800 600 400  AER=-6.4 No D C time[10*ms]  200 0  T2[ms]  T0~  100  1000 Putamen  Amplitude Residuals 800 600 400  AER=-5.7 No D C  time[10*ms]  200 0  i  10  100  T2[ms]  . i . . . . .  1000 Globus Palidus  Amplitude Residuals  800 600 400  AER=-5.7 No D C  time[10*ms]  200 0  I 10  T2[ms]  ... • i . 100 1000  Thalamus Amplitude Residuals 800 600 400  AER=-5.3 No D C time[10*ms]  200 0  T2[ms] 10"  100  1000  Figure 4.4: The Estimated T Distributions (left) and Residuals for the Estimated Parameters (right) of Gray Matter Tissues. 2  Chapter  4. Estimation  74  of the Parameters  Insular Cortex Amplitude Residuals 800 600 400  AER=-6.6 DC =23.  time[10*ms]  200 0  T2[ms] 10  00  1000 Cingulate Gyrus  Amplitude Residuals 800 600 400  AER=-4. time[10*ms]  DC =22.  200 0  T2[ms] 10  100  1000 Cortical Gray  Amplitude Residuals 800 600 400  AER=-2.2 No D C  0.5 time[10*ms] -0.5  200 0  T2[ms] 10  100  1000  Figure 4.5: The Estimated T Distributions (left) and Residuals for the Estimated Parameters (right) of Gray Matter Tissues (cont.). 2  Chapter  4. Estimation  of the Parameters  75  Figure 4.6: Joint Probability Distribution for the Low Component of a Typical Tissue. The second line from the middle enclose 99% of the probability.  Chapter  4. Estimation  of the  Parameters  76  Head Caudate Nucleus 9 r —•—•— —•—•— —•— —•— —•— — —•— —"—•—•—'— —•—•—•—•—•—•— — —r 1  6 '  1  1  1  1  1  ' ' 5b ' ' '  1  1  1 CiO'  1  ' '. 1 5 0 '  1  ' '  1  2CJ0'  ' ' 250' ' '  36(5  Amplitude Figure 4.7: Joint Probability Distribution for the tissue Head Caudate Nucleus. The second line from the middle enclose 99% of the probability.  Chapter 5  Classification  In this chapter we will devise a procedure for the classification of a new data set to one of several brain tissues. The principles of the Bayesian classification have been discussed in Chapter 2, and we are now left with the task of applying them to our special case in a computable manner.  5.1  Procedure  As in Chapter 2, we denote the old data sets, the learning data, by  d id a  and the set of  data to be classified to one of the tissues by d : From Eqn. (2.3), assuming we have no new  prior information classifying d  new  p(Mi/d i d I) 0 d  new  to one of the tissues, we have:  a jp(6i/d I)  p(d /0il)  otd  new  d6  {  .  (5.1)  The T distributions found in Chapter 3 do not contain any information about the 2  total magnitude of the signal; the data were normalized prior to their analysis. However, we want to use the information carried by the amplitudes and include it in d \ and d . 0  d  To that end, we present the parameters and the data as the following conjunctions:  @i = 9iA ' d id 0  = d idA 0  dnew — d A  • d i r)i t 0  d  new  ,  -  s  ' d £){ f  new  where:  QiDist  s  ,  . 77  •  new  Chapter 5.  Classification  78  • BiA is the amplitude of tissue i. •  ®iDist  are the T distribution parameters of tissue i.  •  dgidA  are the total amplitudes of the old data set. So far, they have not been used.  •  d idDist  •  d  2  are the normalized old data sets we used in our work so far.  0  n  e  w  is the total amplitude of the new data set.  A  i  st  '  is the normalized new data set.  Using our new nomenclature, Eqn.(5.1) become:  p(Mi/d i d I),  a  new  0 d  16iAQiDistMiI\d9iA  p(d /e, MJ)  fp{GiA/d I)  newA  oldA  Ip(9iDist/doldDistf)  d9iDi t — s  d9 -  A  - (5-2)  iA  p{dnewDist/9wistMiI)  d9  iDist  That is, p(M /d t  p(Mi/d i 0  d  A  d  n  e  w  Al)  o l d  d  n e w  I)  a  • p(Mi/d i Distd ewDistf) 0 d  n  ,  where we used the independencies achieved by the normalization performed upon doidDist  and  d Distnew  The probability of Mi = "The new data set come from the same  mechanism as old data set i " is a product of two integrals: the first is the probability of M{ where only the total amplitudes are known (the total amplitudes integral) and the second is the probability of Mi where only the normalized decay curves are known (the T distribution integral). 2  Chapter  5.  79  Classification  We now have to assign the probability distributions in Eqn.(5.2), and to approximate the integrals. We will treat the first integral in section 5.1.1 and the second integral in section 5.1.2.  5.1.1  T h e Total A m p l i t u d e s Integral  $p(QiAldoidAl)  p{d  The probability distribution of the total amplitude of a tissue,  n e w A  l'Q-iAMil)  dOiA  .  is wide due  p(0iA/d idAl), o  to the difficult estimation of the amplitudes of low T components. The signature of such 2  components on the data is limited to the first spin-echos measurements. In extreme cases where the signature of the lower component on the data is limited to the first spin-echo alone, the amplitude of the lower component is not determined. We have a realization of such a case in our data, and.the effect of this on the probability distribution of the amplitude of the lower component can be seen in Figure 4.7. We will not try to estimate the value of the total amplitudes. Rather, we will estimate the value of the amplitudes in the time of the first spin-echo, 10ms after the beginning of the decay, and we will approximate this amplitude by the amplitude of the first measurement. From now on the subscript A denotes the value of the measurement/model.at time 10ms.  . . .  We now turn to evaluate the terms in the integral.  First T e r m  p(6i /d idAf) A  0  By Bayes theorem:  .  ' p(0iA/d Al) old  =  •pid°ldA/6;Ap  p(0iA/I)  P(<'UdA/l)  p(doidA/0i I) A  a  .  (5.3)  Chapter 5.  80  Classification  since we set p(9i /I)  = const. V i ( i.e., we do not incorporate any prior information  A  regarding the value of the amplitude for the tissue i and we treat this value as a location parameter). We assume that the data amplitudes are not correlated, and that the amplitude noise' is unknown, but finite in variance; the likelihood factors and the noise will be assigned a gaussian distribution. So, p(d /e crI) MA  =  iA  • exp ^ _ ^ i ( ^ ~ M  (27T<7 )-^ 2  j  a  ( 5 4 )  where: • nli is the number of amplitudes measured for tissue i. •  d idAij 0  is the j data amplitude measurement (10ms point) of tissue i.  Now, a is unknown and we mariginalize over it using Jeffreys' prior:  p(d idA/0iAl)  ct  o  f — • p(d i /0 crI) o dA  J a  • da  iA  / E g i K ^ - M y  fnk\  T  a  2  ( 5 5 )  Our estimation of the value of $i is given by: A  ^A)  e  s  t  e  =<  iA  > ±  yj[<  Q\ > A  < 6  lA  >']  where  ^ ^  >  —  / 9JA •  /p{d idA/6iAl) 0  and  d9j _  PJdpidA/QJAI)  A  d6  iA  Y!j=\ doldAij  nk  _ -jM  A  l  Chapter  5.  Classification  ^  <  =  81  d9  J ^ A P(d>oldA/6iAl)  , j 4  iA  d9  /p(d ldA/9iAl)  2  A  >-<9  l A  -  ^  dpldAj)  2  nli(nli-^l)  iA  0  => y/[< 9  ^ T,f=l{doldAij  > } = ^ 2  YTj=i{d id ij 0  A  nli(nli  -j  °'  dAl  d id i)  —  0  A  2  - 1)  That is,  {9 ) iA  where ST DoS M {d id i} 0  A  est  = d idM± 0  MA  is the standard deviation of the sample mean of d id i0  have arrived at results familiar from frequentist calculations.  p(d  Second T e r m p{dnew 19i MiI) A  A  (5.6)  STDoSM{d i}  A  We  1  /9 MiI)  newA  iA  is the likelihood of an amplitude i n face of the new data amplitude. We  assume that the noise level of the measurement of the new data is similar to the noise level in the old data which is estimated by the standard deviation of the old data set. So,  p(d /9i MicriI) newA  where ai = STD{d i). oldA  =  A  Ji— • exp (_(  DNEWA  ^A)  .  T h i s happened because the prior is not informative, the noise distribution is gaussian and the problem is one-dimensional (the likelihood is a function of only one parameter - the parameter we want to estimate). One may be tempted to conclude that the Bayesian procedure is a generalization of the frequentist approach. This in general is not true, and the interpretation of the results is always different. The frequentist variance defines a confidence interval, an estimate of the interval in which new data will be materialized i n known frequency (say 0.68). The Bayesian result is a credible region, in which the probability of the parameter to be is known (here 0.68). The credible region boundaries are approximated using the variance of the parameter, as in chapter 4. 1  Chapter 5.  Classification  82  The Integral The number of samples composing our d \ for each of the tissues under consideration is 0  d  in the range of nl =800 to nl =5000 (see Table 1.1). As a result, the width of the first term in the integral ( ~ S T D of the sample mean) will be 30-70 times smaller than the width of the likelihood in the second term ( ~ S T D of the sample). We approximate the second term by a constant in the region of substantial magnitude of the first term, the probability distribution of the amplitude. That is:  Jp(9i /d I) A  p{d  n e w A  j  p{d l9 MiI)  oidA  newA  > Mil) • Jp(9 /d i I)  < 6  lA  iA  1  • 2 ^ r where Gi =  5.1.2  e  x  p  = p(d /  iA  {dnewA-  «  iA  d9  0 dA  /  d0  iA  <  new  9  < 9  iA  > Mil) =  >) \ 2  iA  ( , — M — J  ^  STD{d i }. 0 dAi  The T Distribution Integral f p(9 /d i i I) 2  iDist  The first term in the T Distribution integral, 2  p{d i /9 i M I)  0 dD st  newD st  p(9iDistId i DistI), 0 d  w a  iD st  d9  i  iDist  s found in Chapters  3, and the maximum and width of the probability distributions of the parameters were found in Chapter 4. We now turn to evaluate the second term  This term can be  p(d Di /9i£>i MiI). new  st  st  evaluated in a straight-forward manner in case we assume that the noise is not correlated. This is equivalent to the assumption: "If d  new  the mechanism with parameters given by  came from tissue i, it was produced by  Dist  p(0iDist/d idDistI), o  datum point from the value predicted by the parameter set  and thus any deviation of a p(0%rjistld idpistl) 0  (i- - the e  5  Chapter 5.  Classification  83  noise) is due to non_systematic effects". We call this assumption the unique mechanism and. evaluate  under it in the next subsection.  p(d Di t/OiDistMil) new  s  Alas, observation of the old data reveals correlations in the noise that call for relaxation of the unique mechanism assumption. We discuss these observations and our crude treatment of the problem in the subsection that follows the next. There we will assume that the parameters of the T distribution in each imaged pixel (d m t) 2  new  s  are perturbed,  from the parameters of the T distribution of the tissue. 2  Second  Term  M e c h a n i s m Case  p(d i /6ipi MiI)—Unique newD st  st  The expression for-the likelihood of a parameter set •in light of a data set, assuming uncorrelated noise, is given by Eqn.(3.7): .  '  p(d/ A^al)  '  exp f - — • (nW - 2' £ Ajhj +.£ A))\ .  = (2-nia )-^ 2  (5.8)  We want to free our results from the absolute magnitude of the data. To that end, we introducer  .  "  A •  m  i=i  ..  A  m ' .7=1  .  .  .  Note that A and the /_,• are not necessarily positive, and the values of fj are not necessarily smaller than one. So,'  '  •  .  p(d ewDist/AfiTtMiO-l I).a n  ew  •  Chapter 5..,  Classification  84  ••  •^--{nWnew  j=l  LG  where  cr u) ne  1  (5-9)  /  is the unknown variance of the noise of the new data set, and.  and  are approximated by their estimates, as found in Chapter 4. Assuming a flat, and equal for all the models, prior for A, we can analytically marginalize over it:  P{dnewDist/  p(A/  .  fi ~Ti Mi<T I)  fiT*MiO- I)-J new  p(d  fi ~rf MiO I)  Q> J'p(Ad Di t/ new  new  /Afi~TtMi(T I)dA  newDist  dA  new  s  a J p(d  new  =  7?' M i O  /A  newDist  n e w  dA  J)  Which is of the form  V  J.-oo  a  and result in  p{dnewDistl  Ct-.aJ^  fi^ftMiCT^I)  X)  v^m  /2  e X  t-^j=l J ji'  P^ V o 2 \  ••( V^m f2 ,2 v  new .  L^ij—\Jji  ~  n ( j 2  )  (5.10.) and marginalize over a , new  PidnewDistI  using Jeffreys' prior, to get  f i ~T{ M{I)  = J p(o- d  /  n e w  •p(d  new  DistI  newD  fi  ist/  f i ~Ti Mil)  Ti(j MiI) new  da  new  da  n  a  Chapter  5.  85  Classification  m E j=l  Where V =  and dV = — V do~ . 2  r  10  new  2  dV =  This expression is of the form:  V ~~ dV m  nd )  y2  J ji  =  aV2  Jo  r [ ( m + l)/2] 2a( )/ m+1  2  with m = 30, and thus the desired likelihood is  -15.5  p(d  i t/  newD s  fi ~TiMi'I)  a  T(15.5)  . (5.11)  V Second T e r m p(d  M e c h a n i s m Case  /'9iD MiI)—Non-Unique  newDist  isi  A typical behaviour of normalized pixels' data (subtracted from their values predicted by  OiDist,  the parameters estimated in Chapter 4) is shown in Figure 5.1. Obviously,  successive values of these deviations are highly correlated. The systematic deviation of the individual pixels from their values predicted by  6iDi t s  seems to be primarily characterized  by the patient (upper figure). Nonetheless, even within the context of one patient 2  (lower figure), the deviations of the individual pixels' data from the average data of this tissue and patient seem to be correlated, though to less extent. Similar figures had been generated for the rest of the tissues and found to exhibit similar behavior. We here assume a non-unique mechanism: a parameter set 9  new  that differs from  9iv%st-  The new data set  d r)j new  st  was produced by  We can view this difference as a deviation  W e denote our normal volunteers by 'patients'. We do not expect any pathology to exist i n these volunteers. Thus, when we talk about difference in the results as function of the patient, we do not assign this difference to any difference in a pathology. Rather, we refer to it as a consequence of variance found among normal humans. 2  Chapter  5.  86  Classification  Time[ms*10]  Figure 5.1: Typical Spread of the Measurements of one Tissue. Upper figure: Normalized averages for each patient subtracted from the prediction of the model (thick lines), and collection of normalized pixels measurements from patient 6 (thin lines). Lower figure: The same pixels as in the upper figure but now subtracted from their average.  Chapter  5.  Classification  87  of the particular mechanism that produced the new data set from the 'base' mechanism (described by  due to additional factors which depends on the volunteer, the  OiDist)  measurement session, location in the image, etc. The unique  mechanism  probability distribution of non-unique  is a special case of the non-unique Q  n  e  is given by  w  p(® w/@iDistI) ne  mechanism,  =  8(Q  new  admits that the probability distribution  mechanism  where the  — QiDist)-  p(Q ew/®iDist!) n  The is, in  general, wider than a 8 function. We here run the risk of loosing the distinction between expected data from different tissues, and thus loosing the ability to perform a meaningful classification. This will happen in case the width of the distribution p(Q ew/@iDistI) n  is of  similar magnitude to the difference between the parameters values expected from different tissues. In fact, our problem is an example of such a case. The only knowledge we have about the distribution of  Q  n  e  w  around  Qioist  is its width, derived indirectly from the Standard  Deviation of the data points, and this width is indeed in the order of the difference between the parameters values of different tissues. We cannot expect much from the classification process, starting from our current state of knowledge. Nevertheless, we should not resort to the use of unique mechanism, or any other practices that seems to produce 'better' results. The answers given by probability theory are to the questions asked, given a state of knowledge; the unique mechanism assumption, according to our current state of knowledge, is false. So, we have:  PidnewDistIQiDistMil)  Jp(&new/6iDistMiI)  =J  d6  p{6 ewd ewDistI'QiDistMil) n  p(d r)i new  new  n  si  j6 6iDistMiI) new  d$  new  =  (5.12)  Chapter 5.  Now,  Classification  is independent of  p(d Di /6 6ir)j MiI) new  st  new  88  st  QiDist  and can be assigned assuming  uncorrelated noise, that is:  P(d ewDist/QnewOiDistMil) n  =  p(d  new  Di  st  / 6 MiI) new  a  (5.13) where we apply steps similar to the steps led to Eqn. (5.10) but now the amplitudefractions /,• and the nonlinear parameters 1\ (hidden in hji) are determined by 6 . new  is the probability distribution of the parameters that produced  pi&new/GiDistMil)  dnewDist  given the 'base' parameters of the tissue  OiDist-  I  failed to characterize this  distribution in a quantitative manner within the time frame of this work. We will approximate this term using the following steps: 1. Lacking any other knowledge about the distribution (but assuming finite deviation of 0  new  from  we will assign a gaussian distribution to all the parameters. B y  OiDist),  proper scaling of individual parameters we have:  We do not assume any knowledge about the value of 6, the scale of the deviation of 9  new  from  0  .  iDist  2. We assume that for any set of measurements, there exists a set of parameters 0 , new  that will describe this set well, down to the level of the non_systematic effects. 3. We turn from the parameter domain to the measurements domain. The model  Chapter 5.  Classification  89  function m g(9,t)=g(f  ,r,t)  =  J2fi Arj,t) H  3=1  is a continuous function of the parameters  6.  Assuming the deviation of  9  from  n e w  is small and thus will be well approximated by a first order Taylor expansion,  9iDist  we have W^new  diPnew i t) — 9iDist\\L  —  g{9iDist,t)  \dg_  2  [d9  and therefore, ta  u  io  P{9new  9  l  D  i  s  t  n  MiI)  1  where j = 6 • [^f]  e  t  - — • exp  =  t  (  (9{Qnew,t)  - g{9iDist,t)f\  —  Vt  te- \  (5.14) iA  -7< is the scale of the difference between the predictions  of model having parameters  9i£>ist  and one having parameters  5.1 we can see that ^ is at least of the order of c r t  new  9  n e w  .  From Figure  , the level of non_systematic  effects. 4. As a result, the integral in Eqn. (5.12) will be peaked at the maximum of the expression given in Eqn.(5.13). Thus, will minimize zZl=i  {9{9new,  h)  -  4 )  2  9  n e w  can be approximated by the values that  •  5. Therefore, Eqn. (5.13) does not carry information relevant to the classification problem;  describes the probability of materialization of a par-  p{d Di tI'9 MiI) new  s  new  ticular set of non_systematic effects. We replace it by a constant and approximate by d .  g(9 ,t ) new  k  k  6. Under the above, j is well approximated by the STD of the measurement points t  for each tissue. From Figure 5.1 we can see that 7 t remains in the same order of magnitude for all t. Approximating 7* = 7 = const. V t, Eqn. 5.14 becomes P(e /0^MJ) new  =  *  • exp  (_^^(^^7(^^^)) ) 2  =  Chapter 5.  Classification  90  1  p(d ewDist/OiDistMil) n  ,2 • exp  Efcii  (dk  —  32 • 2  g(0jDist,tk)) 2  (5.15)  7t  where 7? is the variance of the measurement points for tissue i. We are now in a position to apply similar steps to those performed in the previous section (Eqn. (5.8) to (5.10) ) to marginalize over the total amplitudes. The resulting likelihood is:  P(d ewDist n  I OiDistMil)  OL  (5.16) where again, /,• and  are approximated by their estimates, as found in Chapter  4. We approximate 7 by the S T D of the measurement points for tissue i, and do not marginalize over it. The reason will be discussed in Chapter 6.  The Integral The expression to be used for  p(d r)ist/OiDistMil) new  is given by Eqn.(5.11) in case of the  unique mechanism assumption and by Eqn. (5.16) in case of relaxation of the unique mechanism assumption. Using these results, we now have to approximate the second integral in Eqn. (5.2). As shown in Chapter 4, the width of the probability distribution of every parameter is proportional to the noise cr. Our new data come from individual pixels and we assume they have similar noise levels to the individual pixels in the corresponding tissues of the old data set. This noise level can be approximated by the average of the  STD{doi^Disti}  for all the measurement points of tissue i. It is in the range of 10-25 amplitude units.  Chapter 5.  Classification  91  The Bayesian estimation of the noise level cr that enter the first term in the integral had been found in Chapter 4 and is in the range of 0.22 to 0.64 amplitude units. As a result, the likelihood narrower then  I  p{d Di t new  s  as function of the parameters 0,-, is 30-60 times  p{d idDistl&i) 0  Thus, we can approximate the likelihood in face of the new  data set as a constant in the region of substantial weight of the probability distribution of the parameters given the old data set. That is,  p(Mi  I' d idDistd DistI) 0  Ip{QiDist/doidDistl)  p{d ewDistl&iDistMiI)  st  • f p{9iDist/d idDist)  ne  0  P{d ewDist  /  n  Owist  d6i£)i  n  P(d wDistI'OiDistMi)  where  =  new  ' dd{  (5.17)  =  QiDistMi)  are the values of the parameters that maximize the likelihood in face of  the old data set. These values had been determined in Chapter 4.  5.1.3  Final Expressions  The final expression to be evaluated for each pixel is given, according to Eqn. (5.2), by the product of the expressions given in Eqn. (5.7) and Eqn. (5.17). That is, assuming a non-unique  mechanism,  p{Mi I d i d ) 0  d  new  the expression to be evaluated is:  a  (5.18)  Chapter 5.  92  Classification  where cr- = t  STD{d idAi}, 0  Ii — ^  1  1  —~ ° 32  ' '  ldD  st  3}  , and /,• and ~rt are approximated  by their estimates, as found in Chapter 4.  5.2 5.2.1  Results Feasibility  At this point, we try to present the data in a way that will help us set out our qualitative expectations from the classification process derived in the previous section. The test of the ability to classify the tissues can be divided into two parts: 1. The basic assumption of the classification process is that the various mechanisms are different in their amplitudes or T distributions, leading to a significant difference in 2  the data produced. If this is not true, the classification will fail since our data is not relevant to the difference between the tissues. The rigorous procedure for deciding whether two sets of data are produced by the same mechanism was discussed in Section (2.2) but has not been applied to the tissue classification problem. Looking at the models and the average data of the tissues (thick lines in Figure 5.2) can set our qualitative impression. Our observation from Figure 5.2 is that the mechanisms are indeed different; each tissue seems to exhibit substantially different average behavior of the T distribution. 2  2. Each of the different mechanisms, or in the general case - probability distribution of mechanisms, defines measurement values expected from the new data set in order to be classified to that mechanism. Reliable classification depends on the deviation of the new data set from their predicted values by the producing mechanism, in comparison to the scale of the difference between the predictions of the various mechanism. A typical spread of the data of individual pixels is given by the thin  Chapter  5.  93  Classification  <D 13  -60 0  5  10  15  20  25  30  Time[ms*10] Figure 5.2: Models/Data Averages for the Tissues and Individual Pixels' Data From One Tissue. The models (the data predicted by the parameter estimation from Chapter 4), the average of the data and the la of the models predictions are all within the width of the thick lines (one line per tissue). The thin lines are samples of individual pixels data from one tissue but different patients.  Chapter 5.  Classification  94  lines in Figure 5.2. As can be seen from the figure, a substantial percentage of the pixels of this tissue will be erroneously classified. As said in Section 5.1.2, I failed to track down any systematic behaviour of this deviation; Figure 5.1 suggests that most of the deviation is to be associated with the deivation of the T distributions 2  among patients. 5.2.2  T i s s u e Classification  We will not try to distinguish between the Cortical  Gray, Insular  Cortex and  Cingulate  Gyrus. Though appearing under different data sets, the Insular Cortex and the Cingulate Gyrus are both parts of the Cortical  Gray [23]. The three data sets exhibit very large  variances among patients. It seems possible that the assignment of Regions Of Interest (ROI)  was in different parts of the brain for each patient, as the Cortical  Gray can be  found in the circumference of the brain in every image. In order to set our 'degree of belief in the classification of each of the tissues, we classify the pixels from the known ROI in ten patients: each of these pixels serves now as a new data set, d . The classification is done by applying Eqn. (5.18) to every pixel new  data and every mechanism M ; (i = 1..10), and choosing the mechanism (tissue) with the highest posterior probability p(Mi / d idd ). 0  new  The results of the classification are given  in Figure 5.3 for the white matter ROI, and in Figure 5.4 for the gray matter ROI. In these Figures, each Bar Chart is the result of classifying all the pixels from the ten 'learning' volunteers, that where assigned (by the ROI) to the tissue given in the Chart Label. Each bar denotes the number of these pixels, classified to the tissue given in the bar base. The sub-divisions of the bars denote the number of pixels classified to the tissue from each volunteer.  3  Naturally, one would test the ability to classify the tissues on the data taken from the eleventh volunteer; this data were not a part of the 'learning' data. Alas, this set of data is suspected to contain a substantial number of erronously R O I assigned pixels [23]; The resulting classification performance 3  Chapter  5.  Classification  95  As we can see from both Figures 5.3 and 5.4, the classification of white matter from gray matter is relatively successful; very few pixels of white matter were classified as Grey matter and vice versa. The success of the classification among white matter tissues or among gray matter tissues is much smaller. 5.2.3  Images  The results of applying Eqn. (5.18) to an image of the eleventh patient are given in Figure 5.5. We did not apply any image processing in the generation of our synthetic images. Nevertheless, it seems as the images could be easily improved by incorporation of additional information in the spatial level (e.g., continuity).  based on these R O I assignment is suspected to be misleading. Therefore, we choose to present the classification of the data of the eleventh volunteer through a synthetic image (see next section) and to use here a much larger database, the data from the ten 'learning' volunteers.  Chapter  5.  Classification  Genu Corpus Callosum 800  96  Splenium Corpus Callosum 1000 800  600 n 400 200  HCPt SPTI CS  Minor Forceps  Major Forceps  2000 1500 1000 500  m in H  Internal Capsules 1000 800 600 400 200  Figure 5.3: Classification Results for White Matter Tissues. Each Bar Chart contains the results of the classification of pixels known to be associated with one tissue (by the assignment of ROI). The bars denote the number of these pixels classified to each of the tissues. The subdivisions of the bars denote the different patients.  Chapter 5.  Classification  97  Head Caudate Nucleus 1200  Putamen 2 0 0 0  1000 1500  800 600  1000  400 500 200  GCSCMi MalC  Globus Palidus  Thalamus  700 1000  600 500  800  400  600  300  400  200 200  100  'GC'St? Mi Ma IC HCPt  Cortical Gray 7000 6000 5000 4000 3000 2000 1000  Figure 5.4: Classification Results for Gray M a t t e r Tissues. E a c h B a r Chart contains the results of the classification of pixels known to be associated w i t h one tissue (by the assignment of R O I ) . T h e bars denote the number of these pixels classified to each of the tissues. The subdivisions of the bars denote the different patients.  Figure 5.5: Synthetic Classification Image of Patient Eleven. Each gray level signifies one tissue.  Chapter 6  S u m m a r y , Discussion and C o n c l u s i o n  6.1  Summary  One of the main goals of this work, as defined when started, was the classification of brain tissues by their T distribution. This goal was not achieved in a satisfactory man2  ner. The variance of the T distribution of each tissue was found to be in the order 2  of the characteristic variance of all the tissues. As a consequence, the reliability of the classification was substantially reduced. As discussed in chapter 5, the variance of the distributions within the tissues seems to be mainly due to variance of the T distribu2  tion parameters among different patients . That is, classification may be enabled by 12  determining the probability distribution of these parameters as function of patient. I failed to characterize this distribution in a quantitative manner within the time frame of this work. Nonetheless, based on unpublished work of the author, this task seems to be feasible for future work, and could carry important information about the underlying physical mechanisms. The estimation of the parameters of the tissues is also substantially influenced by E a c h i m a g e includes a water p h a n t o m , intended to be used as a reference ( h a v i n g m o n o - e x p o n e n t i a l decay curve w i t h k n o w n p a r a m e t e r s ) , e n a b l i n g detection a n d correction of s y s t e m a t i c effects a r i s i n g f r o m the p a r t i c u l a r i m a g i n g session. T h e d a t a f r o m the water p h a n t o m pixels i n the images used for this work, e x h i b i t large variance a n d a m u l t i - e x p o n e n t i a l b e h a v i o u r . A s a result, t h e water p h a n t o m s r e a d i n g were ignored. :  B y ' p a t i e n t ' we m e a n here b o t h the h u m a n a n d the p a r t i c u l a r i m a g i n g session. A new MRI sequence, i n w h i c h t h e water p h a n t o m s p i x e l s e x h i b i t r e l i a b l e m o n o - e x p o n e n t i a l decay curves, h a s been l a t e l y developed i n t h e U . B . C . U s i n g t h e d a t a f r o m these water p h a n t o m s , r e m o v a l o f s y s t e m a t i c effects of the p a r t i c u l a r i m a g i n g session m a y be enabled. 2  99  Chapter  6. Summary,  Discussion  and Conclusion  100  the variations among the patients. These variations render the T distributions based 2  on the analysis of the tissues of one patient to be unrepresentative of the distribution of other tissues. This may explain part of the difference between the resulting distributions of this work and the analysis given in [3]. The parameter estimations in this thesis are under the assumption that there is a unique T distribution (functional form and 2  parameters values for this functional form) for each tissue regardless of the patient from which the data were taken. If the T distribution is determined by both the tissue and 2  the patient, our resulting parameters can be taken as an average of the parameters for each tissue over the patients; assuming small deviations of the parameters as function of patient, the data is linear with each of them. The work determined for the first time the probability of existance of short T component in the brain tissues, and found it to 2  be higher than 99% for all white matter tissues and higher than 80% for all gray matter tissues except Cortical  Gray (see section 6.3.1). The probability of having no more than  three components of decaying exponents in the T distributions of the brain tissues, were 2  found to be higher than 90% for all the tissues. This work analyzed a problem in which the signal-to-noise ratio is poor, the data sets consist of multiple-measurements and the systematic effects are only partially accounted for by the models. The analysis used a multi-dimensional search in the space of models and classification of data into (conceptually, if not practically) probability distributions generated by previous sets of data. The multi-dimensional search in the space of models has not yet been treated in the Bayesian literature.  6.2  Discussion  Probability theory determines the unique procedure for transformation of state of knowledge (expressed in real numbers) into answers to well-posed questions.  The answers  Chapter  6.  Summary,  Discussion  and Conclusion  101  may be changed substantially when we change our questions or state of knowledge. The Bayesian inference procedure forces us to explicitly introduce our questions and state of knowledge; it will thus prevent us from solving problems other than those we wanted to solve. We discuss here an example encountered in our work.  6.2.1  Multiple Measurements: a Vs. a  s  In chapter 3 we defined a quantity a = a /\fn~i, where cr is the Bayesian estimation s  s  of the noise of the individual data points and n) is the number of measurement sets (number of pixels constituting the data base, in our case). At a glance, it seems that cr is the bayesian parallel to the Standard Deviation of the sample mean. As the STD of the sample mean, assuming the noise level a (the deviation of the individual data points s  from the model) is finite and constant, we expect the estimation of o~ to stay constant s  and the estimation of a to decrease as we accumulate more data. In case of systematic effects in the data, not accounted for by the model, this does not happen. As we add more data sets, the estimation of <r stops decreasing (as a result of increasing estimate of cr ). What happened to our estimation of the noise level cr ? Obviously, the difference s  s  between the model and the individual data points remain the same regardless the number of data sets acquired; the answer must lie in the meaning of o~ . s  cr is the width we assign to the distribution of the difference between the data points s  and the model predictions. In case of a single measurement set, it does not matter whether the difference between the model and the data is due to non systematic effects or systematic effects not accounted for by the model. To see what happened in case of multiple-measurements, consider the average-data:. as we accumulate more data, the systematic effects will manifest themselves in the averaged data more until dominant. At that point, adding more data sets will not reduce the deviation of the averaged data from the model. Alas, assignment of the Gaussian distribution to the noise, as is our case, is  Chapter 6. Summary, Discussion  and  102  Conclusion  equivalent to the assumption that all the difference is non systematic, and thus adding data sets should reduce the deviation of the average data from the model predictions. Encountering a collection of data sets whose average is of certain deviation from the model, the Bayesian procedure estimates a cr that will justify such a deviation by non s  systematic deviation of the individual points.  More data sets, failing to reduce the  average deviation, will thus result in increasing estimate for o~ . In the limit of very many s  measurements, the deviation of the averaged data from the model will remain constant (this is the difference between the model and the sytematic effects in the data), and will be characterized by constant cr. The estimate of the noise from each point needed to achieve such deviation of the averaged data will increase as yjn. o~ is a measure of fit of our collective data to the model; all the estimation of the accuracies of parameters in the models are proportional to cr. The estimation of a , on s  the other hand, answers the question: "Given the model, and assuming all the deviations of the data from the model are non. systematic, what width should be assigned to the distribution of the individual deviations of the data points from the model in order to account for the collected data?"  6.2.2  M a r i g i n a l i z a t i o n i n the N e i g h b o u r h o o d of N o D a t a  As illustrated by Bretthorst [1, Appendix C], the marginalization of nuisance parameters is very much like estimating the nuisance parameter from the data and using this estimate, given we have enough data to determine that nuisance parameter well. The marginalization does not assume any particular value to the nuisance parameter, but rather considers all the possible values this parameter might have. Each of the possible values is weighted by its prior probability multiplied by its likelihood in light of the data. In case the likelihood is peaked, the practical result is equivalent to estimating the parameter from the data.  Chapter 6. Summary, Discussion and Conclusion  103  •  JO ©  PL,  prior 5 function<=> No marginalization  Likelihood: Well determined parameter  \ Prior state,of knowledge 1  V  Likelihood: Ill-determined \^^^parameter Ignorance prior  \—^^^^^^^  Nuisance parameter ())  •  Figure 6.1: Marginalization Decisions in Light of Likelihood Widths. Broad likelihood: Representing our state of knowledge as ignorance will result in losing information. Peaked likelihood: The data is capable of determining the value of the parameter; not marginalizing will prevent it from doing so. In case the data does not determine the nuisance parameter well, the shape of the prior become important. Our prior information in hand, in most of the cases, is hard to express in terms of real Probability Distribution Function (PDF), or expressed in P D F which is difficult to marginalize. The common practice in these cases, is either to assign an uninformative prior (expressing ignorance regarding the value of the parameter) or not to marginalize at all, and to use some value (best representing our prior information) for that nuisance parameter. The latter practice is equivalent to assigning the prior a 8 function, expressing certainty about the value of the nuisance parameter. In many cases, both practices represent states of knowledge different from the one in hand: we know more than mere ignorance, but we are not as certain as 8 function. Which one is less harmful? We illustrate the situation and the various cases in figure 6.1.  Chapter 6. Summary, Discussion and Conclusion  104  If the data determine the value of the parameter well (peaked likelihood), we should let it do so, by assigning prior ignorance and marginalizing. As discussed in Chapter 3, in case the likelihood is much more peaked than the prior, the exact assignment of the prior does not matter, as long as the data and the prior are not 'surprising' in face of each other. The data will surprise us if the prior rules out the value-range in which the likelihood is peaked. On the other hand, if the data do not determine the value of the parameter well (broad likelihood), we should put our prior information to the best use. If marginalization using our true prior, is computationally not possible we should approximate our prior. A crude, but in many cases acceptable approximation is the 8 function - approximate the value of the parameter from our prior knowledge and use this value instead of marginalization. In several cases during the classification process (chapter 5), we chose not to marginalize. This was our choice in case of one datum point; the likelihood was broader than the prior in hand. 6.2.3  M u l t i - D i m e n s i o n a l Search for M o d e l s  When performing a multi-dimensional search in the models space, we should remember that the space is discrete. In general, standing at one model in the models space, it is not clear which are the models we should compare our current model to. It is possible that they are not the neighbouring models; neither when deciding what is the next model to test, nor when deciding that we arrived at a maximum after finding that all neighbouring models are of lower probability. The general topology of such spaces as a function of the number of parameters in each model, degree of orthogonality to other models, the priors of the models parameters, and perhaps structure of the data deserve further research. The search in this work was designed for the special case of the parameters of the T distributions and was fine-tuned ad-hock; thus, no conclusions regarding the general 2  Chapter 6. Summary, Discussion  105  and Conclusion  model search are made.  6.3  Conclusions  The results of the analysis of the T distribution of the brain tissues are given in Table 2  3.4 for the model selection and in Tables 4.3 to 4.4 for the parameter estimation. We here indicate some of the findings that had not been explicitly posted in those tables. 6.3.1  E v i d e n c e of a Short T C o m p o n e n t 2  ^  A l l the tissues, except Insular cortex, showed significant evidence in favour of having a component of short T in the range of 5-15ms. The odds ratio in favour of model having 2  the lower component over a model lacking the lower component range from 6:1 for Head Caudate Nucleus to 10 :1 for Splenium Corpus Callosum. The gray matter tissue exhibit, 15  in general, lower evidence for the lower component, along with lower estimated amplitude of that component. Recall that the evidence for the lower component in the Head Caudate Nucleus comes only from the data of the first measurements (at time 10ms). This result in under .determination of the parameters of this component, as shown in Figure 4.7. The Insular cortex showed an odds ratio of 40:1 in favor of not having a lower component. According to [2], there seems to be no apparent reason to expect inherent differences in the T of the lower component in the white matter tissues. The results of our work 2  indicate no evidence for such differences but for the Internal Capsules, that have a slightly higher T time. 2  The lower component is usually assigned to water confined within myelin bilayers. A number of independent supportive indications to this assignment are given in [2]. The results of this work do not weaken or contradict any of these indications.  Chapter  6.3.2  6. Summary,  Discussion  106  and Conclusion  Evidence of a Long T Component 2  No evidence was found for long T components, with T of more than 1000ms, in the 2  brain tissues except for the Cortical  2  Gray tissues: In Insular Cortex and Cingulate  Gyrus  in the form of D C level, and in the data set labelled ' Cortical Gray as a component of 1  T just below lsec. 2  Most of the tissues exhibit a component of T in the range of 150-300ms. It is not 2  clear to the author to what source this component should be assigned. 6.3.3  Components Width  No evidence for spread of the T times in the components of the T distributions were 2  found.  2  As discussed and demonstrated in Chapter 3, this is not to say that there is  no width, but that there is no justification in the data and our prior knowledge for the incorporation of width in our models. The data is not sensitive to the width of the T  2  distribution components; a model containing widths (of the order of magnitude of those demonstrated in Chapter 3) will explain the data as well as a discrete model, whether the data were produced by a discrete or a continuous T distribution. Thus, lacking any prior 2  information that will strongly support the continuous and more complicated model, the Bayesian procedure chooses the discrete models as the simplest models that adequately explain the data and our prior information. One may introduce strong prior information supporting the continuous distribution. This will result, under the current noise level, in an under_determination of the parameters of the distribution. The joint posterior probability distribution of the parameters will have ridges (in case of one redundant parameter), surfaces (in case of two redundant parameters) or hyper-surfaces (in case of three redundant parameters) of high probability.  Chapter 6. Summary, Discussion and Conclusion  6.3.4  107  Alternating Echo Refocusing  The Alternating Echo Refocusing ( A E R ) component in our models is almost orthogonal, over the measurement points, to the other base function.' As a result, its estimation is almost free from coupling to the estimations of other parameters and is accurate to the level of 5-8%. The highest A E R component was found in Head Caudate Nucleus, as anticipated by MacKay [22] associated with the extent of laminar flow of fluids in this tissue.  -  Bibliography  [1] G. Larry Bretthorst. Bayesian Spectrum Analysis and Parameter ture Notes in Statistics:48. Springer-Verlag, 1988.  Lec-  Estimation.  [2] A . MacKay, K.Whittall, J . Adler, D. L i , D. Paty, D. Graeb/n Vivo visualization myelin  water  in brain  by magnetic  of  Mag. Res. in Med., 31:673-677, 1994.  resonance.  [3] A . L. MacKay, K . P. Whittall, R. A . Nugent, D. Graeb, I. Lees, D . K . B . L i In vivo T  2  relaxation  in normal  human  Mag. Res. i n Med., Aug. 1994.  brain.  [4] Kenneth P. Whittall and Alexander L. MacKay. Quantitative relaxation  interpretation  of NMR  data. J . of Mag. Res., 84:134-152, 1989.  [5] R.S.Menson and P.S. Allen. Application to the fitting of data from  of continuous  model systens  relaxation  and excited  tissue.  time  distributions  Mag. Res. in Med.,  20:214-227, 1991. [6] E.T.Jaynes. Probability  [7] M . Tribus. Rational 1969.  theory - the Logic  Descriptions,  [8] T. J . Loredo. The promise ' omy, 1992.  Decisions  of bayesian  of Science.  1993.  and Design.  inference  Pergamon Press, Oxford,  for astrophysics.  [9] P. C. Gregory and T.-J. Loredo. A new method for the detection  of unknown  Modern Astron-  of a periodic  signal  shape and period. The Astrophysical Journal, 398:146-168, Oct. 1992.  [10] H . Jeffreys. Theory of Probability. [11] E.T.Jaynes. Prior probabilities. netics, SSC-4:227-24L, .1968.'  Oxford Uni. Press, London, ed 3, 1961. I E E E Transactions on Systems science and Cyber• •  [12] E.T. Jaynes. Marginalization and Prior Probabilities. Bayesian Analysis in Econometrics and Statistics. North-Holland Publishing Company, 1980. [13]  E . T. Jaynes.  Bayesian  Spectrum  and Chirp  Analysis.  Maximum Entropy and  Bayesian Spectral Analysis and Estimation Problems. D . Reidel ed., DordrechtHolland, 1987.  108  Bibliography  109  [14] A . P. Crawley and R. M . Henkelman.  multiple-echo  Errors in T estimation imaging. Mag. Res. in Med., 4:34-47, March 1987. 2  using  multislice  [15] Peter Cheeseman et al. Bayesian classification. Proceedings on the seventh National Conference on Artificial Intelligence (AAAI-88), pages 607-611, 1988.  [16] H . B . W . Larsson, J.Frederiksen, L . Kjaer, O. Henriksen, J . Olesen/n vivo determination o / T i and T in the brain of patients with severe but stable multiple 2  sclerosis.  Mag. Res. in Med., 7:43-55, 1988. [17] R.V.Mulkern, S.T.S. Wong, P. Jakab, A . R. Bleier, T. Sandor, F . A . Jolesz imaging sequences for high field in vivo transverse relaxation  CPMG  studies. Mag. Res. in  Med., 16:67-79, 1990. [18] J . P. Armspach, D. Gounot, L . Rumbach, J . Chambron In vivo determination of multiexponential T relaxation in the brain of patients with multiple sclerosis. Mag. Res. Imag., 9:107-113, 1991. 2  [19] C.S.Poon and M . R . Henkelman Practical  T  2  quantization  for clinical  applications.  J. of Mag. Res. Imag., 2/5:541-553, 1992. [20] G. Larry Bretthorst. Bayesian  analysis II: Signal detection  and model selection.  J.  of Mag. Res, 88:552-570, 1990. [21] Wendy A . Stewart, Alex.L. MacKay, Kenneth.P.Whittall, G. R. Wayne Moore, Donald W . Paty Spin-spin relaxation in experimental allergic encephalomyelitis, analysis of CPMG data using a nonlinear least squares method and linear inverse theory.  Mag. Res. in Med., 29:767-775, 1993. [22] Alex MacKay, Dept. of Radiology/U.B.C. - Private Conversations. [23] Alister Prout, Dept. of Neurology/St. Paul Hospital, Vancouver, B . C . - Private Conversations.  Appendix A  Safety Valve Routine  The ability to approximate the integral over the nonlinear parameters is crucial to the feasibility of model selection in the general case, where such an integral cannot be done analytically or by usual numerical methods. The case of gaussian distributed noise is of general interest. We brought the likelihood to an almost symmetric shape (as function of the nonlinear parameters) by transformation of the nonlinear parameters. We then try to approximate its peak. In most of the cases, the expansion of the log(likelihood) function in Taylor series will be adequate; the distribution assigned to the noise is Gaussian and thus log(likelihood) will behave in a quadratic manner near its peak.  This is not al-  ways the case; if the problem is not well posed down to the noise level, the behaviour of log(likelihood) will not be quadratic in all principle directions. The (pseudo) nonuniqueness of the solution seems to give rise to behaviour of — x  2n  where n=2,3...(see  figure A . l ) . This will cause a very flat log(likelihood) at the point of maximum likelihood and small coefficients of the Taylor expansion in some directions, which in turn will lead to over estimation of the integral that enters the likelihood of the model. In order to estimate the integral we will assign 'artificial' eigenvalues to the 2 derivand  tives matrix b. Let Vj (j = l..r) be the eigenvalues of matrix b. We correct any eigenvalue smaller than a certain threshold. The threshold is set as a function of a reasonable extension of the logarithm of the nonlinear parameters under the given noise level. A n  110  Appendix  A. Safety Valve  111  Routine  eigenvalue that indicates a variance of (a linear combination of) the nonlinear parameters of more than a factor of 2, for instance, is suspected to be a-result of non-quadratic behaviour of the log[likelihood]. That is, we correct eigenvalues satisfy  <* > \! Vi 2  2  > 2  >  where < <r > is approximated by 0.1. Our results depend weakly on this approxi2  mation. We carry out the following steps: 1. Each eigenvalue describes the curvature of the likelihood in one of the principle axes. We find the position, along this principle axis in which the log(likelihood) • drops by 2< o >. We solve 2  mK*(U) - mh?(U + t • Wj) = 2 < a >  (A.l)  2  for t, where Wj is the eigenvector associated with the eigenvalue Vj . 2. Averaging the absolute values of the two solutions of ( A . l ) will give the width of the likelihood in the direction of this principle axis.:  w —  tl + |<2| -  3. Now, what is the influence of the non-quadratic shape of the log(likelihood) on the value of the integral? The integral of a gaussian of unit amplitude and unit width is Jexp(—x )dx = yjv « 1.77 , / exp(—x )dx w 1.85 and / exp(—x ™)cir—> 2.0.. 2  6  That is, assuming log(likelihood) does behave as — x  2  2n  near its peak, the resulting  decrease in the integral from erroneously assuming it is a gaussian of the same width  Appendix A. Safety Valve Routine  112  0.5  0.5  0 lr 0 • iQ •0.5 i—i  lr  1  H-  w  I— H-  -1  1  o o pi  -1 5 -2  t"  1  \  1  v  1  hi  0  V  l  •0.5 1  H-  -1  H H-  •1.5  O O  I  2.5 0.3-0.2-0.1  )  .0.10.20.  Figure A . l : Normal Behavior (right) and Abnormal Behavior (left) of the Likelihood along Principle Axes. In case of abnormal behavior we search for characteristic width w and use it for the integration of the likelihood. The figure was generated by subtracting the maximum of Log [Likelihood] from the LogfLikelihood]. The data were taken from the simulation of the continuous distribution; the model contains 4 discrete exponents. will not exceed 12%. We multiply w by a factor of 1.06 to confine the resulting error to 6%. 4. Thus, the substitution to the eigenvalue is given by: '  v.  j S u b  2  '  <  a  2  >  (1.06 w)  2  The approximation described above enable us to estimate the integral to accuracy of 20-30%. Models that will have likelihood difference smaller then that will be treated 'manually'; in the range of 20-30% difference of the likelihoods, the exact assignment of the prior ranges become important. As can be seen from Table 3.4 , no such small differences had been encountered during the analysis of the brain tissues data.  Appendix B  A r e T w o Signals E q u a l ?  In this appendix We will treat the problem of signal detection, and the problem of comparing two sets of measurements to find whether their signal source rate, of Poisson distributed events, was the same. The problem of signal detection can be regarded as a special case of the problem of deciding whether two signals are equal, for a case where we know that one of the signals is zero. We treat this problem in Section B . l . We then proceed to treat the more general problem (of deciding whether two signals are equal) in Section B.2. We are interested in the signal rate s of Poisson distributed events.of some source. The problem of estimating this rate is complicated by an, unavoidable background of. unknown rate b. In order to estimate the rate s regardless the background rate b, a set of. On/Off measurements is performed. We assume that the source s contributes only to the On measurements, and that the rate b of the background is the same in the On and Off measurements. The solution for the estimation of s was given by Loredo [8]: We denote by T the duration of the measurement and by N the number of events in that duration. The posterior distribution of the background rate is given by  (b/N T I )  P  off  off  b  =  I o f f  ^ ' f  ,  ..  =  (B.2)  assuming nonJnformative prior for b. In the following, we will denote this probability by- p(b/1); we regard the results of the Off measurements as part of our background information. 113  Appendix  B. Are Two Signals  114  Equal?  The likelihood of s • b in light of The results of the On measurement is given by the Poisson distribution for s + b : \(s + b)T }  (N /sbI)=^  P  S  on  +  l  -( + ) ™  Non  ,  b)lon  s  e  6  —  b T  •  ,  (B.3)  The posterior probability of the rate s is given by ' N  rp  on  = 53 d • ^  p(s/N I)  1  on  I rp U _ S  l  o  =o  n  -sT  on  l  .  *•  (B.4)  where U Q  _  \  | T ^  v ^ W i 2^j=o y  o f f  T  m  y  (N +N -i)l m  , T y tj  on  OJf  Ton  .  off  (N -i)l  )  {Non+N -j)\ (N -jy.  •  off  •  ";  .  •  on  We are going to work out the detailed algebra; The final solutions can be found in Sections B . l . 4 (for signal detection) and Section B.2.4 (for comparing two signals). B.l  Signal D e t e c t i o n  *  Given a set of On/Off measurements, is there evidence for the existence of a signal? That is, find the odds ratio of:  '  .  • Mo = The rate of the signal is zero (there is no additional signal on top of the background). OVER • M i = There is a signal with rate s.  Appendix B. Are Two Signals Equal?  B.l.l  115  M  •  0  The posterior probability of Mo is given by:  p(W„„/M„6) P(%„//) where: • p(b/I) is the probability of b given the background information / , where / includes the results of the Off measurements. p(b/I) is given by Eqn. (B.2). • p(M /bI) = p(M /I) 0  0  is the prior of the model. We set• p(M /bI) = ,p(Mi/W) =.§,.. 0  indicating prior ignorance regarding the plausibility of the signal existence. • p{N n/M b} O  0  • p(N /I) on  o.  is the global likelihood. We will not have to calculate it for this model  comparison.  B.l.2  is the likelihood and is given by Eqn. (B.3) for s  .  >  Mi  The posterior probability of M\ is given by:  p(M /N I) 1  on  = JJdbds  JJdbds  p(M b/N I) :  lS  on  (b/I) (MjbI)  P  P  = JJ  db ds p(M±sb/I) • ^  pis/M.bl)  p(N /M bI) ^ ^ p(N /I) on  P  lS  M  on  where:  N  ^  M  ^ "=  "(B,6)  Appendix B. Are Two Signals Equal?  116  • p(b/I) is the prior for b. as for Mo, it is given by Eqn. (B.2). is the prior for M . As for M „ p(Mi/bI)  • p(Mi/bI)  x  • p{sIM\bl)  -\.  0  is the prior for s (= j^—).' is the likelihood. It is given by Eqn. (B.3).  • p(N /MisbI) on  Odds R a t i o The odds ratio 0 i is the ratio of the probability p(M /N I) O  0  to p(Mi/N I)  on  (given by  on  Eqn. B.5 and B.6, respectively). Since p(M /bI)  = p(M /I)  0  a  0  _ PJMolKJ)  1  fdb p(b/I) p{M lbI) Q  ff ds  on  pib/I) (M /bI)  db  P  PJMo/I)  1  E ^ f f  (s/M bl)  P  :  ; i %  p{N  x  bI)  (r  o  fdbp(b/I)p(N /M b) on  p(Mi/I) B.1.3  we can write:  = p{M /I)  1  =  " p{M,IN I)  =  and p{M jbI)  ffdbd  p'b/I) p{slM bI) 1  0  p{N lM sbI) on  •  {  x  J  R e d u c t i o n of the Integral  Integration Over b The integral / db p(b/I) p(N /sbI) appears both in the numerator and the denominator: on  By equations (B.2) and (B.3):  r / db p(b/I) (N /sbI) P  on  =  1  f Tu- (bT A °is ~ -=i} db °" i o / /  {  }  f  N  y  t  bT  \( + b)T ] °" l[s + o)l \ e N  s  on  e ( _  s + 6  )  T o  "  _  =  E q n . (B.2) gives the posterior probability of the background for large 6 - The general form differs only by the multiplicative constant: ^ ^^ y This constant (appearing in both the numerator 1  m a x  N  Tb  and the denominator) is independent of s and 6 and thus will not alter the odds ratio.  Appendix B. Are Two Signals Equal?  (T ) °ff+*  e-° ™.(T ) °-  N  T  off  N  on  Expanding (5 + b)  Non  (r  o / /  )^//+  1  h{T  +T  n)  N  0  e  on  :  e- °»  N  ,  (T ) °  T  N  on  y  / db - ( °ff °^b °ff b T  +T  N  e  N jj\N \ 0  + b) N ;  J db - °ff ° b °"{s  N \N l off  117  on  s"™-^-  N 1  1  nn. •  (B.8)  on  The integral to be solved is of the form  /  er  ax  e~ x dx = ax  b  x'  b  b  bl  j  (B.9)  and Eqn. (B.8) becomes  °  (Toff)  e-' " N \N \ T  ff+1  off  °  N  I  N  ,r  (T ) on  on  N '  (T  on  +  (B.10)  T ) °u+'+ off  where: l(b ) —• 0.  N  l  .  max  I(Lax) will be given with the final expression for the odds ratio. The numerator of the odds ratio (Eqn. (B.7)), becomes (by taking the limit s —> 0 in Eqn. B.10 ):  •  numer(odas) = - • 2 V  .  N \N \  ;  off  on  -—  •  .  (N  off  (T + on  + N )\ on  T ) °ff+ °^ N  N  (B.ll)  off  Integration Over s In order to compute the denominator of the odds ratio, we have to integrate Eqn. (B.10) over 5:  .  Appendix  B. Are Two Signals  •r r JJds  118  Equal?  r (T ,A °ff = Jds N  db p(b/I)  N  E  1  (N /sM)  P  i=0  N  on  m  l  N  m  l  1  (T °  )° N  n)  (B.12)  ^( ^max)  T ) off+^  {T +  %\{N -i)\  sTon  N  '  ' on •  e'  +1  [ l o f f )  on  off  The integral to be solved is of the same form as Eqn. (B.9) and results in: JV  0:  off • o n i v  E  [N -i)\ on  j+l)  t=0  9 ( ^max )  (T +  i\(N -i)\ on  T ) off+^ N  on  ^max)  off  (B.13)  where g(s ) max  —>• 0.  <7(smax) will be given with the final expression for the odds ratio. The denominator of the odds ratio (Eqn. (B.7)) becomes:  deno(odds) =  son  E  1  2  (T  o n  off  N  N  r  AT  0  ;!(iV  0  on  N '  off- 4 ' o r i -  1  * on •  )(^n-,+l)  (T )N  (T ) °ff^  -S  (N  t=0  B.1.4  1  o n  (T  -z)!  +  on  T ) °ff+*+i off  N  • ^(^max)  (B.14)  The Odds Ratio  Combining Eqn. B . l l and B.14, the odds ratio is:  (N +No y n  off  JV  n.  ]  (T +T ) °// °" on  o//  + JV  + 1  (  h  H°maxJ  v f"R 1 en  Appendix  B. Are Two Signals  119  Equal?  where Non-i  e  • E  on  (Tnji  (N -i-j)\ on  and  as a function of T / / , T , A ^ / / , -/V , 0  o n  on  5  m a  a;  and b ). It should be multiplied by the prior max  of the odds ratio |^°7{| if the latter is different from 1. p  Note that the odds ratio depends on the range of the priors for b and 5. The dependence on b  max  B.2  is very weak. On the other hand, the odds ratio is linear in s . max  A r e T w o Signals E q u a l ?  Given two data sets A and B of On/Off measurements, find the odds ratio of: • • M\ = the source rates s and Sb are identical a  regardless of differences in the background intensity b). OVER • M = the source rates.s and si> are different 2  a  (regardless of differences in the background intensity 6).  B.2.1  M i  The probability of the model M i is given-by:  2  IA{IB) signify all the background information regarding A (B), including the results of the 'Off' measurements. 2  Appendix  120  B. Are Two Signals Equal?  p(M /N N lAlB) 1  onA  = J J ds ds p(M S S /NonANonBlAh)  onB  a  b  1  a  I I j 7 / A/r IT T \ P I o n A / / ds ds piMxSaSbflAlB) , a  ' '  J J ds ds p(M II I ) b  a  1  N BlMiS S I I ) on  •  b  J J  a  b A B  a  =  P(N JS l1 1B) onA  onB  • p(s /M I I )-  A B  =  h  a  p{s lM I I )-  1 A B  b  lSa A B  p{N nAN B-lM SdS I I ) 0  on  p(N N onA  (B.16)  b A B  1  IT I )  onB  A  B  Now, is. the prior for M i . We set p(MiJT I )  • piMJlAls) • p(s lMiI I ) a  — p(s /I )  A B  a  A  a  6  • p(s /M s I I ) 1  A  b  • p{N N lMis s I I ) onA  onB  a  p{N /s I ) onA  ='^-  B  s ). a  onA  1  a  • p(N /N M s s I I )  b A B  onB  onA  1  a  =  b A B  p(N /s I }.  •  a A  a  = p(N /M s s I I )  b A B  = |.  A B  a  b  = S(s -  a A B  2  is the prior for s . We assume ignorance priors for s = p{s /I )  and 5 . That"is, p(s /I )  b  = p(M /T I )  A B  onB  3  b B  And thus Eqn. (B.16) becomes:  -  1• — 2  a  I n general, p(N /N „ MII IB) knowledge of s j . 3  0  1  • / fds i i  W ONB  p(M lN N I lB)  A  A  onA  onB A  =  ds 6(s - s ) PW^/*M)-P(N^/S>IB) > ' (N N /I I ) b  b  a  P  ^ p(N /M-^IAIB)ONB  onA  onB  A  =  B  The independence in our case is due to the  Appendix  B. Are Two Signals  •1  1  ^  °max  Equal?  r  d  p{N /s I )  s  J  121  onA  ( B 17)  • p{N B/s I )  b A  on  ba B  p(N N /I I )  a  onA  onB  A B  is the rate of source B numerically identical to the rate of source A , s  where s  ba  ba  =  s -Mi. b  B.2.2  M  2  The probability of model M is given by: 2  = J J ds ds p{M s s /N N I I )  p(M /N N I I ) 2  onA  onB A B  111  a  f a JT '  7  b  T \  IT  2  P\1Von  b  2  a  b  b  onA  =  onB A B  A  N /M s s I I ) — . P{IyonA-N /l l ) onB  / / ds ds p(M s s /I I ) J J a  a  A B  2  a  b A B  onB  =  A B  J J ds ds p(M /I I )-p(s /M I I )-p(s /M s I I )a  b  2  A B  a  2 A B  b  2  .  a A B  p(N N /M s s I I ) onA  onB  2  a  b  A  B  p[NpnA N  onB  II I A  B  (B.18)  )  Substituting the known probabilities, we have:  _ 1  1  1  f /, / /d / J  , Sa  d S b  .M  i V  ls I )-p(N ls I ) ^ : p{N : ^p(N : N; r/If I7\ a  onA  a  M  A  onA  onB  ONB  A B  The Odds Ratio Combining Eqn. B . 1 7 and B . 1 9 , the expression for the odds ratio is:  Q  . / d s p(N /sJ )  _ 1 2  a  onA  •  A  j —• 11 ds ds p(N /s I ) 2  a  b  onA  A A  p{N /s I ) onB  •  BA B  p(N /s I ) onB  B B  b  B )  B  (B.I  9 )  Appendix  B. Are Two Signals  122  Equal?  Using Bayes theorem, we have: (N /sJ )  P  onA  = p(N /I ).  A  onA  ti**!^ *)  "  1  A  where: is given by eq.(B.4), assuming b  • p(s /N I ) a  onA A  m a x  , the range of the prior of the  background rate, is sufficiently large. • p(s /I ) a  A  • p(N /I) onA  is the prior for s {=j^)-  It is independent of s.  is a constant.  Then,  fds  a  012  =  ^•Ifds  p(N /I)  -p(N /I)  •  onA  ds (N /I)  a  b P  •  Ids  •  onA  • (N /I) P  • p{s /N I )  a  a  j — • f jds 2  onA  ba  b  a  onA  •  A  • p(s /N I )  a  a  I ds p(s /N I ) a  B.2.3  a  onA A  •  onA A  / ds  b  )  A)  _  onB B  p(s /N I ) b  onB B  °max / ds  B  «*J?X)  •  onA  • p(s /N I )  A  ds p{s /N I )  a  ^JS)  •  onB  p{s /N I ) ba  (B.21;  onB B  p(s /N I ) b  onB B  Reduction of the Integrals  Integration of the Denominator The denominator is a multiplication of two independent integral. Substitute Eqn.(B.4) in Eqn. (B.21), these integral are of the form:  Appendix  B. Are Two Signals  Equal?  123  £ C . ' - ^ r t=0 N  (rp  on  Q. \ on)  "  .  ±  / d 1-  il  °"  s T  s ei  aTm  ^  z'!  Z!  »'=0  (r ) on  ECi=0  (B.22)  5D( max) 5  +i  Z!  where:  n  4-  off\i(M°»+Mof -iy.  T  f  V ^ T  (A/on  I T ff "I" j  \^ on(-\ l^j=0\L N  and  o n  /  (No„+No -jy. (N -j)l  c  T  o  f}  n  (^max) -  m  z!  J  So the denominator is: onA  (rp  z!  deno(odds) 'max 'N  \ ,  +1  Z!  —  9DA(s . ) ma x  = 0  (rp  onB  C  j  S  i=o  l  z^ -  onB  (B.23)  gDB(s .x) ma  (T y  +i  Integration of t h e N u m e r a t o r Substituting Eqn.- (B.4) in Eqn. (B.21), the numerator is:  /  , " f i * („  T (sT )>e- -->\ mA  cnA  T  'faf(  T (sT )>e- -*mB  mB  T  Appendix  B. Are Two Signals  N  N  onA  E  i=0 A—n  N  E c c iA  j=0 „'—n  2^ G CjB i=o  ;  ds  i+i  e  - ^  +  r  — >  J  7.  frp  J] - •  iA  .  { l o n B  3B  frp  onB  »=o  frp  [lonA)  I. •  N  onA  frp  onB  124  Equal?  J]  (T  J-  onA  +  (B.24)  <?Af(-Smax)  T y+^ onB  Where:  3ff( max) = s  B.2.4  (s  )*  -Sma.x(T +T )  t+j  - 7{TonA 77;  E {TonA + T ^Zo  e  The Odds  onA  onB  1 rp ) +-T onB  \  '  m a x  (*+j)'  + J - n  o n B  )  n  {i  9N{s ) max  +j-n)l  —> 0  S m a x ^OO  Ratio  Combining Eqn. (B.23) and Eqn. (B.24), the odds ratio is  (B.25) Under the assumptions that b  max  is large enough to justify the use of Eqn. (B.4).  Eq. (B.25) is the odds ratio in favour of equality of signals s and s , as function of a  0  the results of the On/Off measurements of sets A and B. Again, the dependence on the range of b is very weak and we gave here the solution assuming b  max  —•» 0 0 . As in the solution to the signal detection odds ratio, the odds ratio  is linear in s „r. m  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0086784/manifest

Comment

Related Items