A MINIMALLY INFORMATIVE LIKELIHOOD APPROACH TO BAYESIAN INFERENCE AND DECISION ANALYSIS by AO Y U A N B .Sc , Sichuan University, China, 1982 M . S c , Sichuan University, China, 1989 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF D O C T O R OF P H I L O S O P H Y in T H E F A C U L T Y OF G R A D U A T E STUDIES D E P A R T M E N T OF STATISTICS We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y OF BRITISH C O L U M B I A 1997 © A o Yuan, 1997 In presenting this thesis in partial fulfilment of the. requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of S>T&hs(7cJ The University of British Columbia Vancouver, Canada Date J*?i, '(• K '/'??'(? DE-6 (2/88) A Minimally Informative Likelihood Approach to Bayesian Inference and Decision Analysis Abstract For a given prior density, we minimize the Shannon M u t u a l Information between a pa-rameter and the data, over a class of l ikelihoods defined by bounding a Bayes risk by a 'dis tort ion parameter ' . Th is gives a condit ional d is t r ibut ion for the data given the parameter which provides op t imal data compression, or equivalently, is m in ima l ly infor-mative for a type of locat ion parameter. These op t imal l ikelihoods cannot, i n general, be obtained i n closed form. However, they can be found numerically. Moreover, we give two stat is t ical senses i n which the op t ima l l ikelihoods form parametric families which make the weakest possible assumptions on the data generating mechanism. In addi t ion , we establish properties of this parametric family that characterize its behavior as the distort ion parameter varies. We argue that the parametric families identified here may lead to a default technique for some settings i n i n i t i a l da ta analysis. We par t ia l ly charac-terize the settings i n which our techniques may be expected to provide useful answers. In par t icular , we argue that i f one is interested i n performing certain Bayesian hypothesis tests on a parameter that locates a typica l region for the response, then our technique may provide weak but nevertheless useful inferences. We also investigated the robustness of inferences to model ing strategies for paired, blocked data. i i Contents Abstract ii Table of Contents iii List of Tables v List of Figures v Notations and Definitions vi Acknowledgements vii Chapter 1. Introduction 1 1.1 The M i n i m a l l y Informative Like l ihood Prob lem 1 1.2 Formula t ion of the M I L Prob lem 3 1.2.1 Defini t ion of the M i n i m a l l y Informative L ike l ihood 6 1.2.2 M i n i m a l l y Informative Distr ibut ions 8 1.2.3 The Quantit ies that Determine the M I L 8 1.3 The B l a h u t - A r i m o t o Iterative Procedure 10 1.4 Some Closed F o r m Examples 12 1.5 Dependence i n the M I L 27 1.6 Computa t iona l Aspects 29 Chapter 2. Information Theory and Other Background 34 2.1 Information Theory 34 2.1.1 Entropy, Relat ive Ent ropy and Source Cod ing 34 i i i 2.1.2 Channel Capaci ty ' 37 2.1.3 D a t a Compression and the Rate Dis tor t ion Funct ion 39 2.1.4 Compar ison w i t h the M E Formula t ion 40 2.1.5 Interpretation of the M I L 41 2.2 Rela t ion to Reference Pr iors 43 2.3 Other Background 46 Chapter 3. Main Results on the MILs 49 3.1 Large Sample Properties of M I L 51 3.2 Smal l Sample Properties of M I L 60 3.3 Behavior of the M I L for Large and Smal l Values of A 62 3.4 Hypothesis Testing Us ing the M I L s 78 3.5 Remarks 81 Chapter 4. Application 82 4.1 Introduct ion 82 4.2 A p p l i c a t i o n to A Rea l D a t a Set 83 4.2.1 Descr ipt ion of the D a t a 84 4.2.2 Models for the D a t a and Results 88 Chapter 5. Robustness of Modeling Strategies for Paired Data 96 5.1 Introduct ion and Defini t ion of Models 96 5.2 Equivalence of Models 99 5.3 Robustness of Mode l ing Strategies for Pa i red D a t a 101 5.4 More Considerations for the Robustness Issue 119 Chapter 6. Discussion and Further Research 134 6.1 Discussion 134 6.2 Further Topics Regarding the M I L 137 iv References 142 List of Tables Table 1 85 Table 2 87 List of Figures Figure 1 32 Figure 2 33 Figure 3 86 Figure 4 92 Figure 5 95 v Basic Notations and Definitions The following notations and definitions are used throughout the thesis. 1. xn stands for ( a ? i , x n ) . v i Acknowledgements I would l ike to thank my supervisor Ber t rand S. Clarke for guiding me to the unexplored territory, for his inspira t ion, for his constant encouragement and the financial support. I am indebted to Professor Har ry Joe for his invaluable advise and suggestions, to Nancy Heckman, P a u l Gustafson and al l the members of my Supervisory Commit tee for their comments and help. I would like to thank Chris t ine G r a h a m , our Department secretary, for her constant help. F ina l ly , I would l ike to thank the Univers i ty of B r i t i s h C o l u m b i a and the Department of Statistics for its financial support. v i i Chapter 1 Introduction 1.1 The Min ima l ly Informative Likelihood Problem In this thesis we investigate an information theoretic cr i ter ion for l ike l ihood selection. It is based on min imiz ing information: The information being min imized is the information impl ic i t i n the l ike l ihood. Th i s is counter-intuitive because usually one wants a l ike l ihood which is as informative as possible. However, i t must be remembered that fundamentally the l ike l ihood is as arbitrary, at least in i t ia l ly , as any other stat ist ical construct. More to the point , for the sake of being conservative, one wants to assume as l i t t le as possible because i t is hard to assess whether the assumptions one has made are acceptable for the appl icat ion. Indeed, i t is an empir ical fact that no models are demonstrably exact for any real phenomenon. In practice, i f one has a genuinely val id parametric family then infer-ences made using i t w i l l l ikely be stronger than those made using a l ikel ihood representing weaker assumptions. We formalize this by obtaining min imal ly informative l ikel ihoods. Even though they provide weaker inferences, their conservatism is useful. For instance, re-jecting a hypothesis using a min ima l ly informative l ikel ihood is a stronger form of rejection than rejecting under a true l ike l ihood. Es t ima t ion using a min ima l ly informative l ikel ihood bypasses much of the argumentation essential to just i fying a model - which is generally done cursorily at best, that is without reference to the detailed physical basis of the phenomenon which is often unknown. The actual cri terion we study is a min imiza t ion over a set of 'good ' l ikelihoods defined by a Bayes risk bound. Thus , to use this cri terion one must identify a quanti ty or parameter, choose a prior density for i t , and choose a loss function. In addi t ion, one must choose a bound for the Bayes risk. Specifically, the cri terion we are to opt imize is the Shannon mutua l information (SMI) between the l ikel ihood and the prior, i.e. the l ike l ihood we seek 1 IS p* = a r g m i n / ( 0 , X n ) where I(Q,Xn) = J J m(xn)w(9\xn)\og^^-d9dxn is the Shannon mutua l information between the random variable Xn and the parameter 0 , m(xn) = j p(xn\9)w(9)d9 is the data marginal density and V = { p(-\9): J Jp(xn\9)L(xn,9)w(9)dxnd9 <l }. Here Z ( - , •) is the loss function and / is the specified risk tolerance bound. Note the S M I is the expected K u l l b a c k - L e i bier divergence between the posterior and the prior I(e,Xn) = EmD(w(.\Xn)\\w(-)), where for any pair of densities p(-) and g(-), D(p(-)\\q(-)) = Jp(x)log^dx is the Kul lback-Le ib le r divergence between p(-) and q(-). It is not a distance, but has some distance-like properties. It measures the discrepancy between the two densities. Thus , we are seeking the l ike l ihood, i n the given class, which updates the pr ior the least i n that it gives a posterior as close to the prior as possible. In this sense, i t is min ima l ly informative for the parameter among the class of l ikel ihoods. We cal l such a l ike l ihood the M i n i m a l l y Informative Like l ihood ( M I L ) . It is the most conservative, and i n case no knowledge of the data d is t r ibut ion available, i t can be used for an i n i t i a l da ta analysis. Since 7 ( 0 , Xn) = i f ( 0 ) — H(Q\Xn), where H denotes the entropy, our method has some connections w i t h the m a x i m u m entropy method. In fact, m i n i m i z i n g the S M I over l ikelihoods which we do here, is the same as max imiz ing the second term which is the condit ional entropy, where the condit ioning is on the data. 2 1.2 Formulation of the M I L Problem A s we noted earlier, i n Bayesian analysis, usually one assumes a known l ikel ihood and a known prior density, so that a l l inferences are obtained from the posterior d is t r ibut ion of the parameter. Sometimes, however, we have pre-experimental knowledge about the param-eter that we can quantify i n a prior d is t r ibut ion, but l i t t le knowledge about the l ike l ihood or the d is t r ibut ion of the data condit ional ly on the parameter. For example, past experi-ence w i t h phenomena s imilar to a phenomenon under investigation, and expert opinion may be used to suggest a prior , but do not generally provide enough information to suggest a l ike l ihood. In addi t ion, modeling a physical problem may suggest a part icular loss function, or i t can be chosen based on convenience. We also require a bound on the allowable Bayes risk under that loss function; this can be set by the experimenter as we w i l l discuss later. Th i s bound controls the Bayes risk of using the data itself say X, as an estimator for the unknown parameter 6. The l ikel ihood our opt imizat ion produces therefore depends on the prior , the loss function and the bound; its robustness to these inputs can also be assessed. We w i l l discuss the choice of these quantities i n more detail i n Section 1.2.3. A s a specific example, we might want to estimate the mean precipi ta t ion i n a given month over the long term i n B r i t i s h C o l u m b i a based on the dai ly observations i n that month . Past records i n B . C . can be used to help us to formulate the prior d is t r ibut ion. (If no such records were available, a Bayesian might just specify a set of basic beliefs about the parameter, say its locat ion, dispersion, etc, and choose some standard dis t r ibut ion to fit these beliefs.) For a moderately smal l number of observations, i f a phenomena which, l ike the weather, is not well understood, i t may be hard to identify a reasonable family of distr ibutions. So the M I L method while may sensitive to the physical details of a phenomena, w i l l at least provides an opt ion for the user who has l imi ted knowledge. The loss function can be par t ia l ly specified by the cost of under est imation and over est imation. For instance, excessive rainfal l might lead to flooding or other damage to crops which can be assessed financially. Too l i t t le rainfal l might necessitate i r r igat ion which has an approximately known cost. The Bayes risk bound can be chosen by the experimenter according to the pract ical precision requirements; the larger the bound, the less accurate, but the more flexible the model w i l l be. Or i t can be a suitable positive number no greater than the m i n i m u m average loss over parameter, which 3 i n the case of squared error loss, is just the prior variance, since by Proposi t ion 3.1, there is a unique M I L for each Bayes risk bound i n that range. For another example, consider the life t ime a?;'s of l ight bulbs, or any life t ime data. Let 8 be the mean life t ime. Suppose we have some historical data that can be used to specify the prior . A reasonable loss function may be taken the form L(x, 8) = L\(x, 8)I(x > 8) + L,2(x,8)I(x < 8) where L\ is non-negative function which is non-decreasing i n \x — 8\ and £ 2 is non-positive which is non-increasing i n \x — 8\ and such that the Bayes risk is positive for a class of distr ibutions. It is known that i f L\ and L2 are linear i n |a; — 8\, which may be reasonable choices i n some pract ical settings, then the Bayes estimators are percentiles. The parameter occupies an unorthodox role i n this strategy. Conventionally, one spec-ifies a prior fully and seeks a parametric family condit ional on i t . O f course this is hard. Instead, we specify the parameter par t ia l ly : we regard i t as a sort of locat ion parameter i n the sense that because we opt imize over V, 8 must be estimable by X. (This of course, is i n addi t ion to information theoretic cri terion that 9 must be decodable from X.) A s such, 8 is par t ia l ly specified and is only fully specified by the opt imizat ion procedure. In Example 1.4.3, for instance, we choose a N(0,1) prior for 8 that we think of as being the data mean. In fact, i t is not exactly the data mean: once we have the M I L , we see that 9 is interpreted as a shift of the mean. E x a m i n i n g the form of the M I L , we see that the loss function is i n the exponent. Thus , the nature of the loss function also strongly affects the detailed interpretat ion of the parameter 8. Th i s imprecision i n the interpretat ion of 8 w i l l not i n general hamper the assignment or el ici tat ion of a prior . In practice, i t is difficult to explain the difference between a mean and a median and i n our method this degree of exactitude is usually glossed over anyway. Moreover, i t is rare to be able to assign a prior to one sort of locat ion parameter wi thout having obvious implicat ions for s imilar but different locat ion parameters. In applications, one is concerned wi th what the parameter represents. Consider a hy-pothet ical problem i n which one person wants to estimate the median and another wants to estimate the 99th percentile using the same loss function and the same prior . No t ing that the procedure gives the same M I L for both cases one is concerned that the basic set up doesn't make sense. 4 The answer to this cr i t ic ism is that the set up does indeed make sense but that no uniform stat ist ical interpretat ion of the parameter exists. Tha t is , you get to choose a loss function, a measure of distort ion and a prior but do not get to choose the stat ist ical interpretation for the parameter; the exact s tat is t ical interpretat ion of the parameter arises from the opt imiza t ion procedure. A l l that one case say i n general is that as a consequence of the choice of loss and allowed dis tor t ion one w i l l opt imize over a class of l ikelihoods for which the random variable as an estimator for 8 has Bayes risk bounded by the dis tor t ion. The parameter 6 is a locat ion parameter i n the sense that i t can be estimated by X w i t h Bayes risk bounded by A and the loss appears i n the exponent of expression (1 .2 .1 .3 ) . For instance, i f one chooses the squared error loss and a normal pr ior one does get a normal M I L wi th an identified mean that is a function of A , 6, and the parameters i n the normal prior . In this case, the interpretation of 9 as a percentile depends on the values of the parameters i n the prior and on A . A s noted i n Section 2 , i f fi = 0, then only i n the l imi t of ACT2 going to infinity does one get 6 as the mean. A l imi t a t ion arises i f one insists on using a certain parameter, loss and pr ior . In general one cannot l ink the choice of the loss function and the interpretation of the parameter. For example, suppose one insists on est imating the 99th percentile under the squared error loss. Even i f one uses a prior appropriate for the 99th percentile, the exact interpretat ion for the parameter from our method depends on the loss function through (1 .3 .1 ) . Since the 99th percentile is usually far from the posterior mean (the Bayes estimator under squared error loss) one expects the 99th percentile to differ substantially from the 6 i n (1 .3 .1 ) . We conjecture the only way to remedy this is to change the loss function so i ts Bayes estimator is close to the interpretat ion one wants. Thus , the range of interpretations that can emerge from our method is narrow. In practice, one should put a prior on a vaguely defined locat ion parameter after choosing a loss function. (This gives an idea of what type of #'s can be estimated wel l by X.) Fundamentally, we have a way to choose A , the prior and the loss function to get a l ike l i -hood, i.e., we have a hyperplane i n the space of l ikelihoods parametrized by ( A , w(-), £(-, •)). This hyperplane is the result of an information theoretic opt imiza t ion - a "universal ly" op-t i m a l reduction of the information i n the sense of the data compression as described i n Chapter 2 . Th i s formulation reverses the usual decision theory approach. In the usual ap-5 proach one specifies a loss function and a parametric family, op t imiz ing to find an estimator. Here, we have a loss function and an estimator (X) but we optimize to find a parametric family. 1.2.1 Definition of the Minimally Informative Likelihood To choose a l ike l ihood, we note that for a given prior w{6) and a parameter 6 G R d , I{Q,X) can be min imized over certain classes of l ikelihoods. Th i s min imiza t ion gives a l ikel ihood for which the posterior is least changed from the prior , i n an average sense. Th i s is one sense i n which the op t ima l l ike l ihood can be regarded as min ima l ly informative so we denote i t by PMIL(X\0). Consequently, a product of op t imal univariate l ikelihoods is an independence model which is min ima l ly informative apart from the assumption of independence. For s implici ty, we assume that X and 0 are continuous and unidimensional . W h e n either is discrete i t w i l l be enough to replace the integration w i t h a summation; the properties we use continue to hold . Let Ln(xn, 6) = VJ"=i L(xii 6) D e the cumulative empir ica l loss for est imating 6 based on the sample xn, where £ ( • , •) is the loss function. We min imize the S M I over the class V\ of l ikelihoods defined to be the set of parametric families of densities on a measure space ( A " 1 , 0 ) which satisfy p(xn\6)w(6)Ln(xn,6)dxnd9 < ln. (1.2.1.1) Here ln > 0 bounds the amount of Bayes risk we w i l l tolerate for est imating 6 by Xn. In information theory, the m i n i m a l value of the the S M I over Vi, for the n = 1 case R{1) = i n f 1 ( 0 , X), (1.2.1.2) peVi is the rate distort ion function, see Cover and Thomas (1991). It is shown i n Blahut (1972b) that the m i n i m u m i n (1.2.1.2) is achieved by P\(X\E) = 7 ; \ . r , , , (1.2.1.3) where A and m*(x) are determined by the conditions J jp*x(x\0)w(d)L(x,6)dxd0 = I (1.2.1.4) 6 and —— / p\(x\9)w(9)d9 = / f , , ™i L d0 < 1 (1.2.1.5) w i t h equality i n (1.2.1.5) for those x such that m*(x) > 0. The general approach of using M I L ' s as a default suggest models of the form p(x\9, A) = C(9, X)m{x)e-XL^e\ where m(x) > 0 and C(6, A) is the normal iz ing constant. We may i n tu rn ask the question: given the loss function L(-, •) and m(-), does there exist w(-) such that m(x) is the mixture of p(x\9,\) and w(9): This is an integral equation problem. Note that A = 0 i n (1.2.1.3) is associated w i t h p\(x\9) = m*(x) which is independent of 9. We w i l l see later that the corresponding / , under suitable conditions, is infinity. In this sense, the constraint (1.2.1.1) vanishes and the S M I assumes its m i n i m u m of zero for any dis t r ibut ion that is independent of 9. W h e n n > 1, the foregoing holds w i t h x and L(x,9) replaced by xn and Ln(xn,9) respectively. A l t h o u g h we have taken x to be real valued, the procedure is val id more generally. In part icular i t is val id for x t ak ing any values i n Rk. It is this generality which w i l l help permit the formulat ion of diverse models i n Chapter 4. In the definition of the M I L , 1/A or / behaves l ike a dispersion parameter for p*\(-\9) i n addi t ion to its role i n defining V\. Th i s w i l l be discussed i n an example i n Section 1.4. A p a r t from a few special cases, one cannot solve for the op t ima l PMIL(x\0) = P\(x\9) expl ici t ly . However, one can obtain p*x(x\9) numerical ly by the Blahut -A r i m o t o a lgor i thm, see Section 1.3. This information theoretic technique produces a l ikel ihood p\{x\9) which is op t imal w i th in the class V\ of parametric families. It is op t imal i n the sense of making the weakest assumptions consistent w i th est imating by X w i t h a Bayes risk bounded by / . Thus , i n general p*x(x\9) is not a "true" l ike l ihood. In part icular , we require only that X be not a bad estimator for 9, where 9 is a location-type parameter. It is a locat ion only i n the sense that i t summarizes X i n a data compression context or permits the effective decoding of X i n a channel transmission context. 7 1.2.2 Minimally Informative Distributions W h e n considering A as a parameter, the M I L is an op t imal parametric family w i th in a class of parametric families. Sometimes i t is interesting and pract ical for a given prior density w(9), to ask for an op t imal dis t r ibut ion wi th in a given parametric family p(-\9,77), where 6 is the parameter of interest and 77 is an addi t ional parameter, over which we are op t imiz ing . In this case, we s t i l l assume the data d is t r ibut ion is iid. The S M I is a function of the addi t ional parameter 77: i ( v ) = J J n VM°) i Q g m z t ^ ) v ) d 6 d x n ' = J D(plv(.)\\mn(.\n))w(e)d6 = I(Q;Xn\n) and I(Q; Xn\n) is the condit ional S M I . where m(xn\V) = J f[p(xi\0,rj)w(9)d9. 8 = 1 We are to minimize 7(77) over 77. Th i s w i l l lead to more understanding of the behavior of the m i n i m u m information approach. For large sample size n and any fixed value of 77, we can use the approximat ion by B . Clarke and A . Ba r ron (1990) D(pn(-\0,V)\\mn(-\V)) = | l o g ^ + log V ] ^ ' n + o ( l ) , (1.2.2.1) where d is the dimension of the parameter 9 and f(9\rj) is the Fisher information of the l ikel ihood p(x\9). So, by this formula, we can get an approximate min imal ly informative dis t r ibut ion, by asymptot ical ly min imiz ing 7(#|??) over 77. We denote the minimizer of 7(^(77) by 77* and cal l the corresponding d is t r ibut ion p(-\9, rf) the minimally informative distribution ( M I D ) . We w i l l give some examples of M I D ' s i n Sec-t ion 1.4. 1.2.3 The Quantities that Determine the MIL Since the M I L requires a fixed prior w(-), a specified loss function L(-, •) and a given Bayes risk bound / , we must specify these quantities before the construction of the M I L . There 8 are numerous methods for selecting the prior . If we have historical data, i t may be used to suggest a prior . If we have some vague knowledge of the l ike l ihood, for example the Fisher information 7(0), one can use Bernardo's reference prior which is based on f(9) (see Bernardo, 1979). In practice, to specify a prior d is t r ibut ion, one usually first specifies some basic beliefs about the prior , for example its range, locat ion, dispersion, and then chooses some standard dis t r ibut ion to meet these constraints. A l s o , i f the belief i n the prior dis t r ibut ion is not strong, we may choose priors sequen-t ia l ly . Tha t is let wo(-) be the in i t i a l prior which may be very flat, get $$('!') from it by the M I L procedure; use Wi(-) a pQ(data\-)w0(-) as the next stage prior and so on. Or divide the data into two parts x — (y, z), use y as a t ra ining set to get w(6\y), t reat ing i t as the pr ior and z as the data for inference. The loss function can be chosen i n several ways. The best way is to use an experimenter's understanding to formulate L so that the loss is a matter of modeling. In practice, however, one often chooses a loss function subjectively or for mathemat ica l convenience. For a single continuous observation, the usual choice is the squared error loss or absolute error loss. For the binary case one may choose the 0-1 loss. For a sample of size n , we may use the average or weighted average loss for each observation. A s for the Bayes risk bound /, i t may be chosen based on how much risk is tolerable for the specific problem, or chosen informally i n the same way for an op t ima l smoothing parameter. (We w i l l see later that / behaves much like a smoothing parameter i n the nonparametric context.) A pract ical and simple way to choose a proper value for the parameter A is to f ind, possibly by gr id search, the Ao based on which the corresponding posterior updated by p*Xo is closest, i n the Kul lback-Leib le r sense, for instance, to the prior . Th i s is consistent to the ideal w i t h the M I L . Since / and A, i n general, determine each other (see (i i) of Theorem 3.3.2), sometimes choosing A is more convenient. Since A and / have a sort of reciprocal relationship, we may roughly choose A oc 1/7. F r o m the structure of the M I L , we see that A - 1 behaves some-what l ike the dispersion of the d is t r ibut ion, thus, as another alternative we may choose A « 1/(Q(0.75) - Q(0.25)), where Q(0.25) and Q(0.75) are the first and th i rd sample quar-ti le respectively. In the example computed i n Chapter 4 we choose A on the basis of how the 9 posteriors look - a heuristic approach which we find more convincing that the automatic techniques we have l is ted. 1.3 The Blahut-Arimoto Iterative Procedure In Section 1.2.1 we stated that i n general, there is no closed form solution for the M I L ' s , but i t is computat ional ly possible to obtain M I L ' s through an i terative procedure, the Blahut -A r i m o t o a lgor i thm. Recal l that the M I L is the minimizer of the rate-distortion function R(l) for given dis tor t ion or Bayes risk /. It is shown i n Blahut (1972), that the m i n i m u m i n R(l) is achieved uniquely by „ , .„ m*(x)e-XL(x'e) V (x\0) = - J ( , T I 1.3.1 where m*(x) is determined by the equation e - \ L ( x , e ) w ^ I -d6 < 1 (1.3.2) / m*(y)e-XL(y<e)dyx w i t h equality for those x such that m*(x) > 0, and A > 0 is determined by / through the equality i n the constraint (1.2.1). Note that m*(x) is just the margina l density for the data from p*(x\e): m*(x) = f p*(x\6)w(6)d9. The following three theorems are given i n Blahut (1987). For self containment, we state them here and give an outline of the proof. The technique of proof is val id when Xi is discrete or continuous. Let Q = {l '• <?(•) is a probabi l i ty density on X}, and R, = {r(-|-) : Va;, r(6\x) is a posterior density on 0 } . We have T h e o r e m 1.3.1 (Blahut , 1987). ( i ) 7 ( 0 , X ) = in f Jw(9)p(x\9)log^^-d9dx. 10 (it) I(Q,X) = sup / w ( 0 ) p ( x \ 0 ) \ o g ^ ) ^ - d 0 d x . rev. J w{0) T h e o r e m 1.3.2 (Blahut , 1987). R(l) is decreasing on [0,oo) and is convex and hence con-tinuous on [0,rj , where r = inf J w(0)L(x,0)d0. Theorem 1.3.1 says that the inequali ty constraints i n the definition of R(l) can be re-placed by an equality constraint, since R(l) is deceasing and continuous. This is significant because i t means that we can use the equality constraint to introduce the Lagrange mul t i -plier. Tha t is , consider A) = inf [ / / ^ log I m $ $ m * » +X(^J J w(0)p(x\9)L(x,0)dxd0 - Z ]^, for each A. The m i n i m u m of this expression w i l l be achieved by some p\, which depends on A, and A is chosen so that f w(0)p\(x\0)L(x,0)dxd0 = I. Thus we have the following: T h e o r e m 1.3.3 (Blahut , 1987). R(l) = - A / + in f in f [y J w(9)p(x\0)\og^^d0dx + A J J w(0)p(x\0)L(x,9)dxd0 where m(-) is a probabi l i ty density. For fixed p(-\-), the expression i n the square brackets is min imized by choosing m ( s ) = J w(0)p(x\9)d0. For fixed m(-), the expression i n the square brackets is min imized by choosing , m(x)e-XL(x'8) p(x\0) = f m(y)e-XL(y'e)dy' For more details of proof, see Blahut (1972a, b). 11 Based on the above double min imiza t ion , the M I L can be evaluated by the following i terative procedure, see Blahut (1972a) and A r i m o t o (1972): choose an arbi t rary density function mo(-) and form po( - | - ) by setting * ' | , ) = w £ ^ v ( 1 - 3 - 3 ) where A is chosen to achieve the equality i n (1.2.1.1). Then , form the next step mi ( - ) marginal by m i ( s ) = J w(9)p0(x\9)d9, (1.3.4) and the next step pi(- |-) by replacing mo(-) by TOI(-) i n (1.3.3) and continue this fashion. After n step i terat ion, we get pn(-\-). Is was shown by Csiszar (1974), that l i m pn(x\6) = p*(x\6),'ix and 9. (1.3.5) 1.4 Some Closed Form Examples Except a few special cases, the rate distort ion function D(l) and its corresponding min-imizer p*{x\9) cannot be evaluated i n closed form, but can, i n general, be carried out by the B l a h u t - A r i m o t o i terative procedure. Here we show several examples, some for the discrete case and some for the continuous case, i n which the M I L p*(x\9) can be obtained i n closed form. Example 1.4.1. For a discrete one-dimensional example, we take the binary symmet-ric source, that is the prior w{9) takes a and 1 — a for some a € (0,1) respectively at 0 and 1, and zero at any other point . We take the loss to be the probability-of-error loss, that is L(x, 9) = 0 for x = 9, and 1 for x ^ 8. For 9 = 0 ,1 , p(x\9) is a probabil i ty mass function for x on {0 ,1} . Th is example is used i n information theory to i l lustrate how the rate-distortion bound can be achieved by the corresponding channel, the M I L i n our context. It also shows that even i n the simple s i tuat ion, a closed form M I L solution is relatively difficult to obtain . For I G [0 ,min(a , 1 — a)] the constraint is l l 1 = E E VWpWJWJ) = (1 - « ) K 0 | 1 ) + a p ( l | 0 ) (1.4.1) 12 The M I L i n this case is (see Cover and Thomas , 1991) f (l-g-nq-r) i f , _ 0 f ' ( i - " - 0 i f x - n t ( l -c*)(l-2() " J- - ( a( l -2/) 1 1 X _ A ' A n d the corresponding m*(-) is 1 — a — / „,,„. a — / m ^ = n ^ T ' m ( 1 ) = i 3 2 / ' so we can determine, for each / G [0, m i n ( a , 1 — a)] , the corresponding A i n the formula for p*(x\6) by the relationship P * ( 0 | 0 ) = m*(0) + i*( l )e -* ' which gives A = In i p . 7 P . ( X , 0 ) is a decreasing function of /, since for larger /, the class V\ is larger, and hence the inf imum over the class is smaller. For 1 = 0, the corre-sponding Ip*(X, 0 ) achieves its m a x i m u m value H(Q), the entropy of 0 . For / larger than m i n ( a , 1 — a ) , the corresponding S M I is zero and is achieved by any d is t r ibut ion which is independent of 0 (see Cover and Thomas , 1991). E x a m p l e 1.4.2. Th is is a discrete two-dimensional example. We use i t to examine de-pendence i n an M I L based on generalizing the last example. We w i l l see that here, the dependence is not so high that the two random variables represent the same information. The prior w(6) is the same as i n Example 1.4.1. Let the loss function be £2(21, x2, &) = L(x\,0) + L(x2,0), where -£(-,•) is the same as i n Example 1.4.1. B y (ii) of Proposi -t ion 3.1, the M I L is permutat ion symmetric i n x\ and x2, so p A (0 ,1 |0 ) = p A ( l , 0 | 0 ) and ^ ( 1 , 0 | 1 ) = p j ( 0 , l | l ) . Now, constraint (1.2.1.4) is / = 2 ( V A ( 0 , 1 | 0 ) + ap* A ( l , 1|0) + (1 - a)p*A(0,0|1) + (1 - a)p* A (0,1|1)) . (1.4.2) To get the M I L , we first find the corresponding m*. For this, first we note that m*(x\,x2) is also permutat ion symmetric i n its two arguments. Let f3\ = m*(0,0),/?2 = m*(0,1) and /33 = m * ( l , l ) . In (1.2.1.5) we take x\ = x2 = Q,xi = x2 — 1 and x\ — 1, x2 = 0 ( or x\ — 0,^2 — 1) i n turn to get the following three equations a (1 - a)e~2X _• p\ + 2e~xp2 + + e-^fa + 2e-x(32 + (33 = ( L 4 - 3 ) ' 13 ae 2 X 1 - a ft + 2e~ A f t + e ~ 2 A f t + e ~ 2 A f t + 2e" A f t + & = 1 ( 1 ' 4 " 4 ) and a e _ A (1 - a ) e _ A ft + 2 e ~ A f t + e - 2 A / ? 3 + e ~ 2 A f t + 2 e " A f t + ft = " ( 1 ' 4 - 5 ) M u l t i p l y both sides of these equations respectively by ( f t + 2 e ~ A f t + e-2X(33)(e-2X0i + 2 e ~ A f t + ft) we get ( f t + 2 e " A f t + e - 2 A 0 3 ) ( e - 2 A f t + 2 e " A f t + ft) = e " 2 A f t + 2 [ a e - A + (1 - a ) e " 3 A ] f t + [a + (1 - a ) e - 4 A ] f t , (1.4.6) ( f t + 2 e " A f t + e - 2 A f t ) ( e " 2 A f t + 2 e " A f t + ft) = [(1 - a ) + « e ~ 4 a ] f t + 2 [ a e - 3 A + (1 - a ) e - A ] f t + e ~ 2 A f t (1.4.7) and ( f t + 2 e - A f t ' + e " 2 A f t ) ( e - 2 A f t + 2 e " A f t + ft) = [(1 - a)e~x + a e - 3 A ] f t + 2(1 - a)e~2Xp2 + [ae~x + (1 - a ) e - 3 A ] f t . (1.4.8) Let oi = I - a - e~2X + ae~4X, a2 = 2[(1 - 2a)e~ A + (2a - l ) e _ 3 A ] , a 3 = 1 + a - e~ 2 A - ae~4X, h = - ( l - a ) + ( l - a ) e - A + a e - 3 A - a e - 4 A , b2 = 2 [ - ( l - a)e~x - ae~3X + (1 - a ) e ~ 2 A and b3 = - a e ~ x + e~2X -(l-a)e~3X. Subtract ing (1.4.6) from (1.4.7) and (1.4.7) from (1.4.8), we get respectively G i f t + « 2 f t = 0 3 f t M l + &2ft = &3ft . 14 Thus we get a3b2 — a2b3 a3bx - aib3 Pi = —r T~P3, pi = 7 r-P3-a\b2 — a2b\ a\b2 — a2b\ A l s o , by the relationship fii + 2(32 + /3 3 = 1, we get aib2 — a2bi 01(62 + 26 3 ) - a 2 (6i + b3) + a3(b2 - 2 & i ) ' P lugging the values of (3i,(32 and j33 into (1.4.2) we choose the value of A to satisfy the constraint for some values of I. Thus we can specify the /3,-'s completely and by (1.2.1.3) get the M I L p*\(xi, x2\0) as Pl(0,0|0) = Pi/C(0), pl{0,1|0) = p*x(l, 0|0) = / 3 2 e - A / C ( 0 ) , p\(l, 1|0) = / 3 3 e - 2 A / C ( 0 ) , K ( 0 , 0 | 1 ) = / ? i e - 2 A / C ( l ) , pl(0,1|1) = p* A ( l , 0|1) = foe-x/C(l), p*(l, 1|1) = / ? 3 / C ( l ) , where C(0 ) = /?i + 2 /3 2 e- A + / 3 3 e - 2 A , C ( l ) = /3 i e~ 2 A + 2/3 2 e" A + /? 3 To investigate the dependence i n p*(x\, x2\8), we can calculate the Pearson correlation coefficient between X\ and X2. Let p\ 1 ( x i |6>) and p\ 2{x2\0) be the marginals of y> A(zi, x2\6). We get PUO\O) = p j , 2 (o |o) = ^ + ffi"*, K , i ( i | o ) = P h m = / ? 2 C " A c ^ 3 e " 2 \ V a x e = 0 ( X 1 ) = V a r f c o ( X 2 ) = p A ) 1 ( l | 0 ) ( l - p A ) 1 ( l | 0 ) ) (/?2e-A + /3 3 e - 2 A ) ( /3x4- /3 2 e - A ) and C 2 ( 0 ) Cov f l = 0(Xi,X 2) = ^ ( I , I | O ) - P A , I ( I | O K , 2 ( I | O ) So C o r r e = 0 ( X 1 , X 2 ) = e ' 2 A ( ^ i / ? 3 - /?2 2)(/?i + 2/? 2e~ A + / ? 3 e ~ 2 A ) 2 (/32e-A + / 3 3 e - 2 A ) 2 ( / ? i + / 3 2 e - A ) 2 15 Similar ly , CoTT0=i(X1,X2) = < T 2 A ( f t f t ~ /3 2 2 )( /?ie- 2 A + 2 f t e ~ A + ft)2 ( f t e - A + ft)2(fte-2A + fte-A)2 We see that as A —• 0, C o r r 0 = o ( X i , X 2 ) and CoTTg=1(X1,X2) ( f t f t - f t 2 ) ( f t + 2 f t + ft) ( f t + ft)2(ft+ft)2 ' and that as A + 0 0 , C o r r 0 = o ( X i , X 2 ) and CoTig=i(Xi,X2) (ftft ~ ft2) E x a m p l e 1.4.3. For a continuous one-dimensional M I L example, choose w(-) to be a N(fj,,a2) density and L to be squared error loss. F r o m the form of p*, one expects that the min ima l ly informative l ikel ihood w i l l be normal . Indeed, the m a x i m u m entropy dis t r ibut ion under a second moment constraint, is a normal , which is s imilar to the p* here. Th is turns out to be the case subject to the restrict ion / < a2, i.e., the amount of Bayes risk that can be tolerated must be less than the variance of the source dis t r ibut ion. For I > a2 the rate distort ion function is zero, see Cover and Thomas ( 1 9 9 1 , Chapter 13 ) , so no unique solution exists. We see also that for this range of / , / ( A ) = 1 / ( 2 A ) , so we get that A must be greater than l/(2<7 2 ) . It w i l l be seen that m*(-) is N(fi,a2 — and p*(-\9) is ~ JX^)^ + 2 A W ^ ' 2T(1 ~~ 2 ^ ) ) ' A N ( * '(^) = 2 V Clearly, i f fi = 0 then, i n the l i m i t as ACT2 goes to infinity, 0 can be interpreted as the mean. More generally, any interpretat ion of 9 w i l l depend on the prior , and the loss L which determines p*. To identify p*(-\9) and the relationship between the tolerable risk bound / and the Lagrange mul t ip l ier A i n this case, we use three steps. Step 1. we identify the m*(-) which satisfies the constraint. Note that m*(-) must satisfy wi th equality i n (1 .4 .9 ) for those x w i th m*(x) > 0. W i t h some foresight, set m*(y) = and Jm*(y)e A(y e^dy is a constant. N o w , the exponent of m*(y)e (y~6)2 is -[(ay - b)2 + \(y - 9)2] = -[(a2 + \)y2 - 2(ab + \9)y + b2 + X92} which is (1 .4 .9 ) Cexp{—(ay—b) 2} for some real constants a and b, such that the rat io of w(9) = C e x p { — ^B2^T } ab + X0\2 a2 + \) + X92-(ab+X9)2 a2 + X 16 Requir ing that holds for a l l 9 gives Thus we have b2 + X9 2 (ab + Xe)2 a2 + X {0 - M)2 2a2 2Xa2- 1' b = a/i. m*(x) = ^-y=e j£L-("*-&)2 which is recognized as a N(fi,a2 — j^) density, and i t satisfies (1.4.9). Step 2. we identify the M I L p*(-\8) i n this case. Now, the expression for p* gives e-(ax+b)2-X(x-e)2 Pl(x\8) = V47r (a 2 + A)e v ° 2 +* ; - ( « 2 + A ) ( * - ^ > ) 2 ^ ( a 2 + A) After subst i tut ing for a and 6, the last expression is seen to be a N ( ( l ~ 2ih?)6 + 2 ^ ^ ' 2\(l ~ d e n s i t y - N o t e that EP*(X\8) is not 9, i t ' s a weighted average of /J, and 9. For fixed #,/z,A, as CT2 —• co p^(-|#) '—»• N(0,j%), and hence its variance increases to ^ = / (A) . For fixed 0,/z,(7 2, as A —> oo (or / —> 0), the family V\ shrinkages to a single member C(^)> the degenerate dis t r ibut ion at 9. We see p\(-\9) —> C(#), which is consistent w i th above reasoning. Th i s provides a sense i n which A is also a smoothing parameter, ensuring that a min ima l ly informative density does not just concentrate at the data points. A l s o , we investigate the relationship between / and A. F r o m the constraint for the Bayes risk, we have ,(X) = / / « * | e M * ) £ ( M ) ^ = + ( ^ ) V = ± . Last ly , simple computat ion gives the corresponding posterior w*(9\x) is N(x, 1/(2A)) = N(x,l) which is s t i l l i n the same normal family as the pr ior , but w i t h the prior mean and variance (fi,a2) been updated to (a;,/), any other l ikel ihood i n the class Vi w i l l update the prior more by the expected Kul lback-Le ib le r measure. 17 Example 1.4.4. Th is example is also for a one-dimensional parameter however we consider the general rc-dimensional data case to understand more about the structure of the M I L s . If no data summarizat ion is possible one can, i n principle, use the dependence model p*(-\0) to be derived shortly to form a posterior. It is seen that this p*(-\8) bears a superficial resemblance to the in tu i t ion behind shrinkage estimators. A s i n the last example we choose a standard normal prior w(-) and comment that our calculations can be extended to an arbi t rary N(fi,o2). Consider the loss function 1 " J(*V) = ( - E * . - - ' ) 2 ; ri. 1=1 arguably L(xn,8) — ^Yl?=i(xi ~ $ ) 2 1 S a more natural choice. However, i t is difficult to obta in closed form results for L(-, •). A s before, even for £ ( • , •), we can only obta in closed form expressions for selected values of A. For ease of calculat ion, choose A = A n = (n + l ) / 2 (this choice of A makes the computat ion simpler and results i n a closed form for the M I L ; other choices of A may not give a closed form expression for the M I L ) . We show that the marginal density for the data is an n-dimensional normal m*(-)~iV n (0 ,^ 1 ) , where 0 is the n-dimensional zero vector, and An is the variance-covariance ma t r ix given by A n ~ n 2 / n - 1 ... - 1 \ - 1 n ... - 1 \ - 1 - 1 ... n It is seen that An is positive definite w i t h determinant , 2 A S n 2 K1 | = ( Z ) B ( » + 1 ) B ~ 1 -The corresponding M I L is a n dimensional independent mult ivariate normal where ln is a n-vector of l ' s and In is the n dimensional identi ty mat r ix . Here symmetry is also achieved as predicted by Propos i t ion 3.1. The bound on the Bayes risk i n the constraint is x _ n _ _\_ 1_ 18 In this case, the marginal density is a dependence model and the M I L is an independence model . We regard this case as unusual. The corresponding posterior w*(6\xn) is N(x, ^ijrr), does not follow the result of Theorem 3.1.1, since the loss function is not of the form there. F r o m this example, we see that the form of the loss function affects the M I L a lo t . To verify the forms of m * ( - , • ) and p*(-,-\9), first note that the support of m * ( - , • ) is the entire n-dimensional Eucl idean space, so we must check that the equality i n (1.2.1.5) holds for a l l vectors ( x i , ...,xn). Us ing the conjectured form of m * ( - , • ) we have J m*(yn)e-XL(yn'eUyn = ( ^ " ( n + l ) ^ J / e x P ] { - A f n f ( » ) » - T ^ ) nn(V^)n J J 1 V fr{' n' ^ n n J -KJ2^-&)2}dyi...dyn i=l = ( ^ A H E + I ) Z : / . . / E X P { _ ( N + ^ ( ^ v i _ e2 -| I _ A J > L n + 1> x/n + 1 Choosing A = (n + l ) / 2 , we get / E- A ( i E r = 1 - . - « ) 2 _ L E - ^ V27T £Q J J m*(y")e-x^ zXi v^)2dyi...dyn V27T J because the prior density w(0) cancels out. Thus (1.2.1.5) is satisfied. Now, by (1.2.1.3), *( nW) = m * ( a ; " ) e x p { - A ( l E r = 1 ^ - ^ 2 } P { X 1 } f...fm*(y1,...yn)exp{-\(±U=iyi-e)2}dyi...dyn = (V2A)"(n + l ) ^ g 2 / 2 r A ( ^ ) 2 _ n » ( \ / 2 ^ ) " 1 fr{ n' ~ ± j n n + B f ) 2 + E ? ^ i : ^ } 8 = 1 l^J 1 = 1 = ( v ^ H " + D f e ^ e x p { A ( „ + l ) V ( g - - * - ) » - - * g . l (n + l ) w r l ( n + l ) z " ( v ^ ) » e x p { - 2 - V ^ g ^ ' - ^ T T e ) 2 } ' 19 which is the claimed mult ivariate normal d is t r ibut ion. F ina l ly , we derive an expression for the bound on the Bayes risk. It is Z n ( (n + l)/2) = J ...J P*(xn\0)L(x'6)w(0)dxnd6 V ^ r > V^J J ^ 2 n2 n + 1 M n n (V27r ) 1 n (-^Xi - 6)2exp{-6»2/2}dx1...dxnd6 t=i n " ( V ^ r > v ^ i V i 7 ^ 2 rc2 £ f n + 1 ' J - ~ T ^ ) + - ^ - - f l ] 2 ^ . . . ^ e x p { - 0 2 / 2 } d 0 <"+ij" 1 / f / ... / e x p f - i ^ f > , -1 " 8 = 1 J^J 2nd " n l \ + - — — f l ) dn...da;n)exp{-fl2/2}dfl n + 1 f - f n + 1 J y n " ( \ / 2 ? ) » ^/2iJ \J J \ 2 « 2 £ f ' n + l ' J -2 ! > • • - - ^ - T ^ ^ i - . ^ n J e x p { - 0 2 / 2 } d 0 I ~ (n + l ) 2 ^ 7 (n + 1 ) 2 ' Examples of MID. We discussed the min ima l ly informative distr ibutions ( M I D ) i n Sec-t ion 1.2.2. They can be considered as of the M I L s from different point of view, i n which we restrict the class of distr ibutions to be opt imized be some specified two parameter (they can be vectors) d is t r ibut ion. One is a parameter of interest, the other, while not a nuisance parameter serves only as an index for opt imiza t ion . The opt imiza t ion gives the member of the parametric family closest to the prior i n the Kul lback-Le ib le r distance. In the examples following, we w i l l see that i n most of the cases, M I D s can be solved i n closed forms, and the computat ion is usually easier than that of the M I L ' s . The parameter value which achieves 20 the m i n i m u m i n the S M I can be viewed as the minimally informative estimation of the addi t ional parameter, i t is the most conservative i n i t i a l guess of the addi t ional parameter value. Example 1.4.5. w(-) ~ i V ( 0 , l ) , p(-\9,rj) ~ N(9,n2). In this example, we can get a closed form solution. so, thus, m(xn\rj) = / f e x p { - - ^ f > ; - 6)2 - 6—\ 1 r 1 n n2 i d0 v V + n f f n' + n n 2 l ( V 2 t ) - , - - V ^ + n eXH-2,*(,' + n ) [ ( " 2 + ' < £ * ' > ' ' } • l o g n k £ ( f f e ) = l o g ^ ! ± ^ + 1 j=l m(xn\r]) n 1r\2(j)2 + n) X[(V2 + n) £ x2 - £ a,-)2] - ^ J > , - -w * . . n n xHp{xi\e,r])w(e)dxnde-—I I J2(xi-°)2Hp(xi\e>v)H0)dxnd9 «=1 ' i = l t=l 1. ,„ n . n (n — 1) = - l o g 1 + — ) + v ; 2 7/^ 2r\\r\l + n) It is decreasing i n rj2, so the m i n i m u m is achieved at rj* = +oo, and I(n*) = 0. If we add the constraint (1.2.1.1) w i t h L(xn,9) = £ ? = i O i - 9)2, we have l> / j[[p(xi\9,ri)Y/(xi - e)2w(9)dxnd9 = nrj2, i=l i=l add the corresponding rf2 = / / t i , w i th I(rf) = | l o g n("+0 + 2^{'n^+p)- The corresponding M I D degenerates to a uniform dis t r ibut ion on ( — 0 0 , 0 0 ) and the corresponding posterior ™ ( 0 | : r n , 7 r 2 ) is J V ( ^ , - £ l - ) = Nigf;,^). Note here rf2 corresponds to the 1/A for 21 the A i n the M I L . A s n tends to infinity, rf2 tends to zero. (This corresponds to A tends to infinity for the M I L . ) The corresponding p(-\9, if2) converges to the degenerate d is t r ibut ion at 9, and the corresponding posterior w(9\xn, r)*2) converges to the degenerate d is t r ibut ion at x; this result is a paral lel to ( i i i ) of Theorem 3.3.1. The choice of the loss function is problem dependent. In some cases, the average squared error loss may be more reasonable. If we take the loss to be ^ J2?=i(xi~ i n the constraint (1.2.1), then rf2 = I, I(r}*) = | l o g ( l + 7) + j j j^gj and the corresponding M I D is N(9,l). This has the least concentration around 9 and hence the is least informative for 9. The cor-responding posterior w(9\xn, TJ*2) is N(jjjj^, j^)- Th i s is the posterior which has the least mean Kul lback- Lei bier divergence, as the l ikelihoods varying i n the class V, from the prior i V ( 0 , 1 ) . In fact, the general forms of the posterior w(9\xn,rj2) updated by other l ike l ihood i n the class has the form N(^^,^-^), w i t h rj2 < I. They have bigger Kul lback-Le ib le r divergence from N(0,1) than that of w(9\xn, rj*2), since the former is more concentrated around, roughly, x. A l s o , we see that as / increases to infinity, the constraint varnishes. In this case, the M I D tends to the uniform dis t r ibut ion on ( — 0 0 , 0 0 ) as i n the case of no constraint, and the corresponding I(r]*) tends to zero. The corresponding posterior tends to i V ( 0 , 1 ) , which is the same as the pr ior , i.e. the M I D did not i n fact update the prior i n forming the posterior. E x a m p l e 1.4.6. A g a i n , let w(-) ~ JV(0,1) and suppose p(x\9,n) is the logist ic density . . exp{—(x — 9)/n} p ( * I M ) = ( i W - ( * - * ) M ) 2 ' - ° 0 < ^ < 0 0 ' 0 In this example, i t ' s hard to get a closed form solution for 7?*, so we use gr id search. Tha t is for each fixed 77 and n, we use the Monte Car lo s imulat ion to calculate I(rj), then find the 77* corresponding to the m i n i m a l /(??*). Specifically, note the S M I can be wri t ten as n I(V) = £e,x»[-21og(l + e x p { - J2(*i ~ *)M)] i=l -E@tx4log(m{Xn)] + I log(27r). (1.4.13) We use 10 6 i terations for the Monte Car lo s imulat ion. In each i tera t ion, we generate 9 from 7V(0,1), then generate x\, ...,xn iid from the logistic density p(x\9,77) corresponding to this 9. We use the inverse d is t r ibut ion function method: generate a random samples u\, ...,un 22 from a uni form(0, l ) d is t r ibut ion, then get the logistic samples by X{ = F~l(iii\0, 77) = —n\og(l/u — 1) + 0, where F~l(iii\0,77) is the inverse cdf for the logist ic density p(x\0,77). We calculated 1(77) for 77 = 1,2,.. . ,10 and found a roughly decreasing pat tern for the corresponding values of 7(77). E x a m p l e 1.4.7. In this example, we want to investigate the dependence i n the M I D . Consider w(-) ~ iV(0 ,1 ) and p(xi, x2\0,77) ~ i V ^ f ^ ^ , ^ ^ 1 ^ ) ' ^ e r e ^ e a c ^ i t i o n a l parameter 77 is the correlation coefficient between the two variables i n the dis t r ibut ion . For a sample of size n, ( x i , X 2 ) where x i = ( £ 1 , 1 , x i < n ) and x 2 = (x2,i, ••.,x2,n), the joint density is p (x i ,x 2 | 0 , 77 ) = 1 1 e x p { ~ , I - a x ( X > « ~ e? ~ 2 " E ( ^ - e){x2l -0) + £ > « - 0)A} The marginal density is m ( x l ' X 2 | 7 ? ) = ( ^ ( V i ^ v i / e x p ^ 2 T r ^ (J2(xu - 0f - 2 7 7 f > l 8 - 0)(x2i -0) + f > 2 i - 0)2) - 6^}d0 v i = i i=i i=i ' 1 = 1 1 1 / f 1 (2TT)» ( X / T ^ 2 " ) " J 6 X P >- 2(1 - T?2) ( l > i . - - e ) 2 ~ 2 f ? X > ^ - W** -0) + X > 2 , - - ef - (1 - 772)^2) }d0 ^t=l t'=l t'=l ' 1 1 1 f r 2 n + l + r ? / n ^ + g a ) ^ ^ (2TT)« ( 0 ^ 2 ) » y e x p >- 2(1+77) r 271+1+77; r u _ _ 1 1 1 f 1 ~ (2TT)» ( V T 3 ^ 2 ) " V2TT G X P I 2(1 - T?2) ( £ , l t - 2 7 ? £ xux2% + 2 ^ 2 n + 1 + v ) }, t'=l t'=l i=l so p ( x i , x 2 | g > 7 y ) 1, _ 2 n + l + 7? , 1 ^ 2 " » l 0 g 7 n ( X l , x 2 | 7 ? ) = 2 l 0 g ^ T ^ T + 2TW) I g * " - 2 T } g ^ + g ' •2i 23 and re2(l - T?)(XI + z 2 ) 2 ' 271 + 1 + 7/ J p ( x i , x 2 | f l , 7 7 ) u ; ( ^ ) d x i d x 2 d ^ - 2 ( 1 - ^ 2 ) / / ((:Cl ~ 0 ) 2 " 2r>(Xl ~ ° ^ X 2 ~ ^ + ( a ; 2 " 0)2)p(*i,X2\6,v)dxidx2 = 5 l o g ^ ^ I T ^ + 2 ( 1 ^ ) J { 1 + 0 2 - + o2) + i + o2) w { e ) d e 1 - 7 7 r r ( " n \ 2 - 2 ( l - 7 ? 2 ) ( 2 7 i + l + 7 ? ) 7 7 ( E ^ + X > 2 * j Kx i ,x 2 | f l ,»7 ) t£ ; ( f l ) r fx idx 2 df l 1 . 2 7 1 + 1 + 7? 2 - 77(77 + 1) = o l o S — T T + n—1 2 - " 2 1 + 7? 1 — 7? 1 [ f f n n - 2(1 + 7 ? ) (2n + 1 + 7?) y y I g * « + g 3 5 « ^ + 2 g n S + 2 ^ X l » a ; 2 i + + ^ Z 2 i Z 2 j j K x i , X 2 | 0 , 7 7 ) w ( 0 ) d X i G b C 2 c Z 0 1 , „ 2ra . 2 - 77(77 + 1) ~ 2 ( l + 7 ? ) ( 2 n + l + 7?) ( 2 n + n ( n ~ 1 } + 2 n ( 1 + V ) + 2 n ( n - X ) + 2 n + n ( n - * ) ) = 2 l o « < 1 + i T ^ ) ' which is min imized by 77* = 1, w i th /(T/*) = ^ l o g n . We see that the "op t ima l " correlation coefficient corresponding to the M I D is just the highest dependence. In the calculations above, we have used the facts that, for k = 1 , 2 , J J xlip(x1,x2\e,v)w(e)d^1dK2de = E&(E(x2ki)) = Ee(Var{Xki) + E2(Xki)) = Ee(l + 92) = 2, J J xktixkjp(x1,x.2\9,ri)w(6)dx.1dx.2d6 = EQ(E(Xk}i)E(Xkij)) = EQ(92) = 1, 24 and r i xitiX2,ip(^i,X2\9,ri)w(0)dx1dx2d9 = EQ(COV(XI, X2)) E@(n + 92) = r?+ 1. The corresponding M I D degenerates to a uni-dimensional N(9,1). Th i s l ike l ihood updates the prior the least, since i t basically produces one data point , and less data updates the prior less. Recal l that large sample size w i l l dominate the posterior and overwhelm the pr ior d is t r ibut ion. The corresponding posterior is u>(- |x i ,x 2 ,»7 ) ~ i V ( = i V ( ^ , 2 n + l + r ? * ' 2 n + l + r?V V 2(n + 1) ' n + IJ ' If we add the constraint (1.2.1) w i t h L(xn,9) = ^ YA=\(XU + x2i — 2#) 2, we have (take 0 < / < 4 ) i= l t'=l w(9)dx1dx2d9 = 2(1 + 77). Now, 77* = I - 1, /(?/*) = ^ l o g ( l + 4 y i ) and the corresponding M I D is s t i l l a bivariate normal w i t h mean 9, variance 1 and covariance ^ — 1. Th i s is the d is t r ibut ion i n the class V which has the highest dependence between its two variables. In this way the effect of two data values w i l l reduce to some extent to that of a single data value, and thus for the same reason as i n the non-constraint case, updates the prior the least. The corresponding posterior is Since the other posterior i n the class is N (^n+i+ij ? 2^+1+1^' ^ < 7?*> ^ ^ a s D 1 § g e r Kul lback-Le ib le r distance from N(0,1) than the M I D does. E x a m p l e 1.4.8. Let V = {tv{9,o) : 9 e R1, 77 G R + , v > 2 }, where tv(0,rj) is the t d is t r ibut ion wi th v degree of freedom, locat ion parameter 9 and scale parameter 77, that is (X — 9)/r) ~ tv. In this example, the parameter to be opt imized is the degree of freedom of the i -d is t r ibut ion and the dispersion, so we are seeking the i -d is t r ibut ion which , under the bounded Bayes risk constraint, updates the iV(0 ,1 ) prior the least. Since is normal , 25 in tu i t ive ly we expect the the M I D is a normal d is t r ibut ion. Assume p(xn\8) = Y\f=1 p(xi\0); the constraint (1.2.1.1) is J J p(xn\8)Ln(xn,0)w(0)dxnd0 = ntfvl(v - 2). The Fisher information for any member i n V is the same 'dlogp(x\0,r),p)^2 I{?\ri,v) = E 80 •n2u v/??r(f) [ •n2 \fv r ( ^ ) r ( ^ ) which is independent of 0, so by the same reasoning as i n the previous examples, min imiz ing the S M I subject to constraint (1.2.1.1) is equivalent to min imiz ing l(0\n, v) subject to nrfvKv — 2) < I. Since l(0\n,v) is decreasing i n r/ 2 , this leads to rf = l(v> — 2)/(nv). Plugging this value i n to l(0\r],i>), we are now to minimize ^ •= i ^ 2 - ' which is positive for a l l finite v, and is equivalent to, for large v, v M i / + 2 X i / + i ) n 2 ( 1 / - 2 ) ( i / + 1 ) 2 as i / —>• oo. So, g(v) is min imized as v —> oo, or the minimizer for the S M I is (r/*,i/*) = (\Jljn, oo), this is i n conformity wi th our in tu i t ion . In the last two examples, we demonstrate how to use formula (1.2.2.1) efficiently to calculate the M I D approximately for large sample size n. E x a m p l e 1.4.9. Let Y = (X — 0)/r) ~ have a logistic d is t r ibut ion w i t h density function f(y) = e _ 2 7 ( l + e ~ y ) 2 , - c o < y < oo. Assume p{xn\0) = rj"=ip{x{\0), then constraint (1.2.1.1) is J J p(xn\0,r))Ln(xn,0)(8)dxnd0 = Cnry2 < I, for some constant C which is independent of n , r) and 0. The Fisher information is d2p(x\0, T]Y 7 ( % ) = E d02 26 _2_ f ( e x p { - ^ } ) 2 7/2 J0 ( l + 7/)4 » n 2 ' where C is a generic constant. A g a i n for large n , by (1.2.2.1), min imiz ing the S M I over 7/ > 0 subject to (1.2.1.1) is equivalent to min imiz ing I(0\r}) over 77 > 0 subject to Cnr\2 < I. This leads to the unique solution rj*2 = l/(nC) asymptotical ly. E x a m p l e 1.4.10. Let V = {N(0,n2) : 0 € R1, a £ R+}. Constraint (1.2.1.1) is j J p(xn\0)Ln(xn,0)w(0)dxnd0 = TIT/2, so the corresponding Bayes risk bound / should be no smaller than nr)2. M i n i m i z i n g the S M I over p for large n , using (1.2.2.1), is equivalent to min imiz ing Jn2(0) = I/7/ 2, subject to nr/ 2 < / . The minimizer is asymptot ical ly rj2 = l/n. 1.5 Dependence in the M I L We see from the previous chapters that the n-dimensional M I L ' s are usually dependence models. It is natural to investigate the amount of dependence among the variables i n the M I L . It is well known that for iid large data sets x n , the posterior w i l l be dominated by the data. Since the M I L is the l ikel ihood which updates the prior the least, i t is na tura l that the n-dimensional M I L w i l l have high dependence among its variables to make the large data sets behave like a smal l data set. Th i s is also suggested by Theorem 3.1.1 below. If this high dependence seems undesirable, we may model the mult i -dimensional data by a product of uni-dimensional M I L s . Th is may be appropriate i n the data compression context (see Section 2.1.3). To assess the dependence, we use a transformation of the Kul lback-Le ib le r distance into the [0,1] scale proposed by Joe (1989). We calculate 6* = [1 - e x p ( - 2 £ ) ] 2 , where 6 = f — f / O l ^ - i Z r O l o g f(Xl'"''Xn) dxx...dxn J J h(x1)...fn(xn) 27 is the relative entropy between a joint density f{x\, ...,xn) and the product of its marginals. Since the joint M I L ' s we use are indexed by a parameter 9, we actually have a function 8MIL(6)- Integrating out 9 to obtain an averaged measure SMIL of dependence amongst the variables i n the joint M I L dis t r ibut ion gives SMIL = J SMIL{9)w(9)d9. Indeed, sampling procedures often permit the assumption of independence, perhaps for a set of summary statistics. W h e n this is possible, i t simplifies computat ion. To get an independence M I L , we should add the constraint that the density for the data given the parameter factors, i.e. n p(Xl,...,xn\9) = Y[p(xi\9). (1.5.1) i=l Current ly , we don't know how to perform the min imiza t ion of I(Xn; 0) to get the desired independent p*(xn\9) under constraint (1.5.1). Instead, as a simplif icat ion to understand the problem, what we may do is to select the loss functions and priors so that the Blahut -A r i m o t o a lgor i thm w i l l give M I L i n the from of univariate products. In Chapter 4, we w i l l form two-dimensional independent M I L ' s by a product of two unidimensional M I L s for the data analysis. Th is may be somewhat ar t i f ic ial , since i t is not the result of the opt imizat ion procedure for the two-dimensional l ike l ihood problem. Nevertheless, i t provides a model which is not implausible and can be compared to other models we identify. A compromise between the two methods above is to choose the independent l ike l ihood which is closest, i n the K u l l b a c k - L e i bier distance, to the M I L p*(xn\9) which is assumed not an independent model . Tha t is , let VQ be the class of l ikelihoods which are indepen-dent among a l l their variables. We choose p as our "independent min ima l ly informative l ike l ihood" , i.e. p = arg m i n D(p*(.,•\9)\\p(.\9)...p(.\9)), where D{p*{.,...,.\9)\\p{.\9)...p{.\9))= I f ^ \ 9 ) \ o g - ^ P - d x \ J Ui=iP{xi\9) We have 28 P r o p o s i t i o n 1.5.1 p(xn\0) = f[p*1(xi\0). i=l P r o o f : We see that D(p*(;...,.\0)\\p(.\6)...p(.\0)) = I V*(X"\0) k « d x n , / p*(xnlf)) l Q _ YlUP*^\0) J P (x W ^ M = I P * { X I \ E ) D X +JP(* l ^ i o s Y I U v { X i \ 0 ) ' and that the first term above does not involve p(-\0). The second term is min imized by setting p(-\0) = p(-\0) = p*(-\9). So the "independent min ima l ly informative l ike l ihood" we seek is p(xn\0) = f[p*(Xi\e). • i= l 1.6 Computat ional Aspects In cases no closed-form available for the M I L s , we use the B l a h u t - A r i m o t o iterative proce-dure, as i n Section 1.3, to evaluate the M I L s numerically. Our current C-program for the M I L is effective for one-dimensional data and one-dimensional parameter case as a demon-strat ion. The structure of the B l a h u t - A r i m o t o i terative procedure, as i n (1.3.3) and (1.3.4), makes i t difficult to use the current C-routines for integration. Instead, we used summat ion of 100 to 500 gr id points to approximate integrat ion. The convergence of the procedure depends on the choices of priors and the loss functions. Roughly, i n the one-dimensional case, i t needs 10 to 10 2 iterations to reach an uniformly absolute accuracy of the order 1 0 - 4 to 1 0 - 6 . The computat ional l imi t for M I L s w i t h mult i -dimensional da ta /mul t i -d imensional parameter(s) is routine and machine dependent. In the M I L we need to choose the A so that equality i n the constraint (1.2.1.1) is satisfied for the given Bayes risk bound /. For this, we can use ( iv) , (v) and (vi) of Theorem 3.3.2 as a guide to search for the corresponding value of A for given / , which states roughly that I is a decreasing function of A. Th is suggests the bisection rule i n the choice of A corresponds to a given /. Indeed, (Blahut 1972a) shows that the m i n i m u m i n the rate distort ion function is achieved for this A. 29 It is expected that as the dimension increases, the amount of computat ion w i l l increase exponentially, as the amount of computat ion involved i n the integration does. The number of iterations may increase linearly, as the number of comparisons for accuracy does. We also wrote the C-program for the corresponding posteriors for Models I and II, as i n Section 4.2.2. They are for 4-dimensional data and require more C P U t ime than the M I L s , since one must produce the M I L first, then get the corresponding posterior. It can be extended to higher dimensional cases, and w i l l cause similar increases i n the amount of computations. To assess convergence of pktX(x\9) to the l imi t px(x\9) we used the supremum norm. The computat ion terminates when sup,,, ^ ^ ( ^ l ^ ) — pk-i,\{x\Q)\ < e for a given value of 9 and prespecified e > 0. We note that the sequence pktX(x\9) tends to the l i m i t p*x(x\9) independently of the i n i t i a l density mo(-) chosen. Indeed, one can verify i n closed form that i f L is squared error, w is a standard normal , 9 = 0 and A = 1 then px(x\9 = 0) is a s tandard normal . Our program gave this and pk,\(x\9) was observed to converge numerical ly to a standard normal for a wide range of choices of mo- Thus , the program matched what we knew had to be the case from manual calculat ion. A second test of the program was to replicate numerical ly the results of Theorem 3.3.1, when A tends to infinity. Specifically, we verified expressions (i) and (ii) of Theorem 3.3.1 computat ionally. They show that as A increases the M I L converges to unit mass at a parameter value and the posterior from the M I L converges to unit mass at a data point . F igure l . a shows this for the M I L : i t is seen that as A increases, p* concentrates at the parameter value. Figure l . b shows that as A increases, wp* concentrates at the data value. L is squared error loss. A th i rd test of the program was to replicate numerical ly the results of Theorem 3.3.2, when A tends to zero. Consider part (vi i) of Theorem 3.3.2, write x0 = arg'mixf L(x,9)w(9)d9 assuming i t is well defined, i.e x0 < oo. Then , Theorem 3.3.2 gives conditions under which p*x(x\9) w i l l concentrate at XQ independently of 9 as A tends to zero. For squared error loss and priors w i th finite variances, x0 is just the prior mean. So, i f w is i V ( 0 , 1 ) we find x0 = 0 and that p*x(x\9) concentrates at zero. For w proport ional to exp(—(x + 10)), (x > —10) we found XQ = - 9 and p*x{x\9) concentrates at —9. For w propor t ional to exp(a: —15), (x < 15), XQ = 14 and p*x(x\9) concentrates at 14. In a l l these cases, the concentration was pronounced 30 by the t ime A had decreased to .01 and was independent of 6, see Figure 2. This confirms (vii) of Theorem 3.3.2. One can verify the other conclusion of Theorem 3.3.2 computat ional ly as wel l , i.e., we observed the posterior formed from p^(-|-) converges to the pr ior as A decreases to zero; we omit showing figures for this case since the meaning is clear from Theorem 3.3.2. 31 Figure 1 a) b) o ' • • . — ^ 0 2 4 6 8 10 theta Figure 1: Effect of Increasing A on the M I L and Posterior Density. Figure l . a shows how p\(x\9) concentrates as A increases. P lo t ted are the M I L ' s for A = 1 (dots), A = 10 (dashes) and A = 20 (solid) when w is i V ( 0 , 1 ) , 9 = 5.98. Figure l . b shows how the posterior based on a single observation changes as A increases. P lo t ted are the posterior's for A = 1 (dots), A = 10 (dashes) and A = 20 (solid) when w is N(0,1) and x = 5. 32 Figure 2 Figure 2: Effect of Decreasing A on the M I L . Graphs of p\(x\9) for A = .01. The three strongly peaked curves correspond to different priors, iV(0 ,1 ) wi th XQ = 0, a prior propor-t ional to exp(-(a ; + 10)) wi th XQ = —9 and a prior proport ional to exp(—(x — 15)) wi th x0 = 14. T h e values of 0 used were 9 = 1,5,10 respectively, but the convergence to xo is in-dependent of 9. The more dispersed density is p\(x\9) for A = .001, w given by U( —10,15), and 9 = 5,10. In this case XQ does not exist and p* does not concentrate. L is squared error loss. 33 Chapter 2 Information Theory and Other Background The l ikel ihood is the l ink between what we observed and what we seek to know. Con-sequently, the information tac i t ly assumed by choosing a l ike l ihood largely determines the results of the analysis. Nevertheless, i n practice, sometimes researcher chooses a l ike l ihood for convenience. Sometimes a diagnostic check is used to assess the adequacy of a model . Al ternat ively , the stat ist ician may choose the l ikel ihood according to one of a large number of model selection principles or use nonparametric techniques. However once the model is obtained, i t is a means to doing statist ical inference. It represents the statistician's understanding of the linkage between the data observed and the values of the parameter that might specify the data generating mechanism. Here we focus on parametric families but recognize that nonparametrics and model selection provide at least i n large sample cases alternatives to the technique we propose here. Since information theoretic considerations underlie most of the key results we have to present, we now turn to the relevant background i n information theory. It w i l l be seen that the above summary of the stat ist ical problem translates into the information theoretical setting. 2.1 Information Theory 2.1.1 Entropy, Relative Entropy and Source Coding T h e concept of entropy was developed by Shannon in 1948. In his at tempt to quantify the uncertainty of a random variable 0 satisfying a set of reasonable axioms, he showed 34 that the unique functional of the probabi l i ty density w(-) for a discrete random variable satisfies these axioms is #(0)= -X>(*<) l°g «>(*.•)• i Because this quanti ty is s imilar to the entropy i n thermodynamics, the name "entropy" was adopted. Here the base of the log is e, and the unit of the entropy is measured i n "nats". If another base for the logar i thm is chosen, for instance, b ^ e, we write the entropy as Hb(Q). If one chooses 6 = 2, and the corresponding entropy is measured i n "bi ts" . For a continuous random variable 0 w i t h density w(-), its entropy is defined as H(Q) = - J w(9)logw(0)d0. For finite discrete d is t r ibut ion w(-), H(Q) is non-negative, but not so i n general. Later , Ku l lback and Leibler (1951) extended the definition of entropy to measure the discrepancy between two density functions p and q as the relative entropy (or the Kul lback-Leibler distance) D{p\\q) It is not a metr ic , but has some metric l ike properties, such as non-negativity, the Pythagorean relationship, and is zero i f and only i f p = q. It is stronger then the L\ distance, see Csiszar (1975). Similar ly , the condit ional entropy of 0 given X is H(G\X) = j m(x)H(Q\X = x)dx, where ra(-) is the marginal density for X and H(Q\X = x) = - J w(0\X = x)logw(9\X = x)d9. Entropy characterizes some natural phenomena. Consider a discrete random variable 0 w i th mass function w(-). Suppose messages are drawn form w(-) and sent to a receiver. Before they are sent, these messages are coded into a 6-ary alphabet B codeword (usually binary i.e b = 2). There are many coding methods. A code is said to be non-singular i f different messages correspond to different codewords; a code is called instantaneous i f i t is 35 not a prefix of any other codeword. A n instantaneous code is preferred because any given codeword can be decoded without reference to any other codeword. For a value 9, let 1(9) be the code length of 9. For any instantaneous code, we want a smal l average code length El(Q) = ^2l(9)w(9) to describe a given source. Instantaneous codes are not unique for a random variable, but the set of codeword lengths for instantaneous codes is l imi ted by the following result: K r a f t i n e q u a l i t y : For any instantaneous code over an alphabet B, the code lengths li,...,lm must satisfy the inequali ty Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code wi th these lengths. A n instantaneous code is said to be op t ima l , i f i t satisfies the Kraf t inequality, and its expected code length is the smallest among al l such codes. The following theorem states the impl ica t ion of entropy for the average length of the shortest description of a random variable. T h e o r e m 2.1.1.1 (Cover and Thomas , 1991). Let l*(9) be the op t imal code length assign-ment to the source w(-) and B-ary alphabet, then So, roughly speaking, entropy is the average length of the shortest description of a random variable. (here |Y] denote the smallest integer greater or equal to r). Th i s is not necessarily op t ima l , but the Shannon code is operationally simple and is w i th in 1 bit of op t ima l . Indeed, the following result is well known. HB(Q)<El*(Q)<HB(Q) + l. The Shannon code-length assignment for the random variable 0 is 1(9) = [log ^4^] T h e o r e m 2 .1 .1 .2 (Cover and Thomas , 1991). 0 < El(S) - HB(Q) < 1. 36 For any source w(-), the op t imal code length assignment can be obtained by the Huffman coding, see Cover and Thomas (1991). Suppose we use the Shannon code, w i t h code length assignment 1(6) = [log ^ y ] based on the mass function v(-), while the true mass function for 0 is w(-). Then we w i l l not achieve the op t imal expected length 77b (0 ) . The following theorem verifies that the in -crease i n description length due to using the wrong mass function is the relative entropy D(w\\v). Theorem 2.1.1.3 (Cover and Thomas , 1991). The expected length under w(-) by us-ing the code length assignment 1(6) = [log ^ y ] satisfies D(w\\v) < El(Q) - 77(0) < D(w\\v) + 1. Th i s theorem provides the ma in information theoretic interpretations of the relative en-tropy. It is the average number of extra bits of information that one would have to send i n general, but that one wouldn' t have to send i f one knew the true density. In other words, the relative entropy is the redundancy of a coding scheme. Thus , we shall see that Bernardo's reference prior (see Section 2.2) is the source dis t r ibut ion which yields the worst case Bayes redundancy, and what we have called an M I L is the parametric family of densities which pro-duces, for a given prior , the least Bayes redundancy wi th in a class good parametric families. 2.1.2 Channel Capacity Consider a random variable 0 w i t h dis t r ibut ion w(-). In the present context 0 is often called a "source" because i t is preserved to supply a str ing of data we want to encode for some purpose. For s implic i ty we assume 0 is a random draw from { 1 , 2 , . . . , M } . A sender wants to send a message 6 (a realization of 0 ) . Before sending, the message is coded into a b-ary alphabet w i th code-length n. Th i s is called an (M, n) code. W h e n the codeword reaches the receiver, i t is translated back into a message. Due to background noise, there may be transmission error, so the codeword that reaches the receiver may be corrupted. The condi t ional probabi l i ty p(x\6) that the message sent is 6 and the received message is x is called a channel. It describes the d is t r ibut ion of messages that might be received given the sent message. One wants p(x\6) be high for those x's near 6, i.e., relative to x, 6 is a 37 locat ion parameter. Usual ly , larger values of M require larger code-lengths n to guarantee the distinguisha-b i l i ty of different codewords, but this costs more. The propor t ion R = M/n is called the rate of a code. A rate R is said to be achievable, i f there exists a sequence of ( M , n) codes w i t h arbi t rary smal l probabi l i ty of error, i.e., the m a x i m u m probabi l i ty of error e(n) for each code word tends to zero as n tends to infinity, where e(n) = m a x ^ ^ ^ , . . . ^ } e;, and et- = P(X 7^ i\Q = i). Tha t is , the received message is not i when the sent message is i. A basic question i n data transmission is: what is the m a x i m u m number of bits per unit t ime we can send, or equivalently bits per transmission through a given channel w i t h arbi t rar i ly smal l probabi l i ty of error? The capacity of a channel is defined as the m a x i m u m of a l l achievable rates for this channel. Shannon's channel coding theorem establishes that the channel capacity is the supremum of the Shannon mutua l information over a l l the input marginals, i.e., see Cover and Thomas (1991). So, roughly speaking, we can send at most 2 n / ( X ; 0 ) distinguishable sequences of length n across the given channel i n a single transmission. The w*(-) which achieves the supremum i n (2.1.2.1) is the source dis t r ibut ion which permits the fastest data transmission over the given channel. We have seen the definition of the Kul lback-Leib le r distance (or relative entropy) be-tween two density functions p(-) and q(-). The mutua l information of two random variables X and Y w i th a joint density function p(x, y) and margina l density p\{x) and p2(y) is defined as the relative entropy between the joint d is t r ibut ion and the product of the marginals, i.e., A Jensen's inequali ty argument shows that I(X;Y) > 0. I(X;Y) is a measure of depen-dence and so arises natural ly i n channel coding as rate because the message required should depend strongly on the message sent. Regarding I(X;Y) as a measure of dependence we see that another interpretat ion of Bernardo's reference prior (Section 2.2) is that i t is the dis t r ibut ion of the parameter which depends most, asymptotical ly, on the data d is t r ibut ion s u p / ( X ; 0 ) ; W(.) (2.1.2.1) = EPlD{p{.\X)\\p2{-)). (2.1.2.2) 38 i n that i t changes the most upon receipt of the data. It is immediate from the definition that the larger I(X;Y) is , the more dependence there is between X and Y; Indeed, I(X;Y) = 0 i f and only i f X and Y are independent. Thus , min imiz ing I(X; Q) w i l l y ie ld a t r i v i a l result w i t h no constraint, since this w i l l result i n a " l ike l ihood" independent of parameter as i n our first consideration for the min imiza t ion . 2.1.3 Data Compression and the Rate Distortion Function This concept is motivated by the discretization of continuous random variables into dis-crete ones or data compression. Since outcomes of a continuous random variable require infinitely many bits to represent, the only pract ical way to represent is to "compress" i t by representing i t w i th finitely many bits. Consider d iv id ing the support of a one-dimensional continuous random variable X into 2 n R intervals. Let X(X) be a discrete random variable assuming values 1, 2, 2 n R , depending on which cell X lands i n , where R is the code rate described i n section 1.3.2. A distort ion function L(x, x) is a measure of the loss representing x by x. For instance x, may be the midpoint of the interval x lies i n and L(x,x) = \x — x\. We want the compressed data X(X) to represent the true data X w i t h smal l "dis tor t ion" , i.e. the expected loss or Bayes risk Intuit ively, the larger the rate R is, the more accurate the representation, and the smaller the dis tort ion, but the higher the cost i n operation. In the present case, the more accurate our representation w i l l be. However, we want to use as few intervals as possible. Ass igning more x values into larger intervals means we are throwing out information. If we fix a level of distort ion we are wi l l ing to tolerate, we are led to min imiz ing R since we want to compress as much as possible, i.e., throw out information by permi t t ing less accurate representations of x, subject to the distort ion constraint. We are interested in : For a given source, what is the m i n i m u m rate to achieve a dis tor t ion no greater than a given tolerable dis tor t ion 11 A n d what is the corresponding channel? For a given positive number /, the rate distort ion function R(l) is defined as the min-i m u m rate to achieve the distort ion I. It is the m i n i m u m amount of information needed for representing the source wi th average loss bounded by /. The rate dis tor t ion theorem 39 establishes that the rate distort ion function valued at / is the m i n i m u m of the S M I over condi t ional densities w i th dis tort ion (Bayes risk) bounded by /: R(l)= m i n I(X,X), p(x\x)eVi where Vi = {p(x\x) : f f p(x)p(x\x)L(x, x)dxdx < 1} is the class of channels w i t h distort ion no greater than /. Th i s is the quanti ty we investigated i n choosing the M I L . Note that the above concepts can be expressed i n terms of channels as wel l . The condi t ional density achieving the rate R(l) is the channel w i t h the slowest transmission for the given source, w i th tolerable distort ion /. It is what we have called the M I L . It is the condi t ional density providing op t imal data compression, i n the sense that i t provides the greatest compression wi th in the allowed dis tor t ion. In practice, one uses R{1) as a theoretical lower bound, seeking discretizations of X into regions which provide op t ima l compression. Usual ly, one wants as few regions as possible provided they do not cause excessive distort ion. Current work on this problem is often called vector quantizat ion. 2.1.4 Comparison with the M E Formulation The pr inc ipal of the m a x i m u m entropy and our method are s imilar , since min imiz ing the S M I is equivalent to max imiz ing the condi t ional entropy. However, there are some differ-ences also. The m a x i m u m entropy, M E , method is used for selecting an op t ima l l ikel ihood based on incomplete information about the l ike l ihood. The information available is incorporated into a set of known constraint(s), and the least informative l ikel ihood subject to these constraints is found i n an entropy sense. The M E likelihoods are i n the exponential family p(x\9) = a{(3) exp[/3 1 T 1 (x) + • • • + f3kTk(x)}, where the parameter /? = (j3i, ...,f3k)' is chosen so that the l ike l ihood satisfies the con-s t r a i n t s ) . Its exponent part has a fixed form, the corresponding sufficient statistics T-i(x), ...,Tk(x) are determined by the form of the constraints. 40 Our method is aimed at selecting an op t ima l l ikel ihood i n a Bayesian setting w i t h a known prior and some incomplete information about the l ike l ihood: It has bounded Bayesian risk. Here we assume less than the M E method and many other known methods, and we have incorporated the pr ior information about the parameter. The M I L is a function of an exponential family and the marginal of the data m(x)e-XL(x'9) P * ^ 6 ) = $ m{y)e-XLMdy' where m(x) = Jp*(x\6)w(0)d6 is the margina l of the data, and A is chosen so that the equality i n the Bayesian risk constraint is satisfied. Its exponent structure is determined by the loss function £ ( • , • ) rather than by moments as is the M E l ike l ihood. Our method produces a l ike l ihood, i.e., a parametric family as a functional of the pr ior informat ion. The parametric family can be used i n frequentist techniques, or (wi th a different prior even) i n Bayesian techniques. In general the M I L is not an exponential family, but we conjecture the set of M I L ' s contains the collection of exponential families. The M I L defines the channel which transmits information as slowly as possible subject to a distort ion constraint that ensures data transmission actually occurs. 2.1.5 Interpretation of the MIL F r o m the information theory point of view, the M I L is the condi t ional density which achieves the rate distort ion function lower bound. The rate distort ion function plays a central role i n data compression and has an interpretat ion i n t ransmit t ing data across a channel. Usua l ly i n data transmission, large amount of source message is compressed into a relatively smaller number of representatives for pract ical purposes. For example, continuous variables must be represented by finitely many representatives for transmission for economical or operational reasons. There are many ways to do so. For a given source, the possible representatives con-stitute a codebook. We want a code that is op t imal i n that i t has the fewest codewords (xn) (fewest representatives or greatest compression), where n is the code length. This is to be accomplished subject to not losing too much information about the or iginal source (9). The loss is quantified by the distort ion L(xn,0), the Bayes risk bound / i n (1.2.1.1) constrains the average distort ion. So, the M I L is just the condit ional d is t r ibut ion of the op t imal code given the source, subject to average distort ion bound /. Th is causes high dependence be-41 tween xn and 8 and amongst the entries of xn. For more details about data compression, see Blahut (1987), Cover and Thomas (1991). B o t h provide information-theoretic just if ication based on data compression for cal l ing p*x m in ima l ly informative. Here, we only describe the channel-based interpretat ion which we argue is more appropriate to the stat ist ical context. A n information-theoretic channel is a condit ional d is t r ibut ion which specifies the dis-t r ibu t ion of the output received given the input sent. The input is a coded version of the message. T h e output received has a probabil is t ic description because even though we trans-mit a specific message i t may be corrupted by background noise. For instance, a condi t ional density such as p(x\8) defines a channel: If 8 is the input then the channel gives output x w i t h probabi l i ty p(x\8). If 8 is drawn from a source dis t r ibut ion w i t h density w then the S M I can be interpreted as a rate of transmission i n bits per unit t ime. Therefore, min imiz-ing the S M I over a constrained set of channels defined by condi t ional densities yields the channel i n that set wi th the slowest rate of transmission which we have called min ima l ly informative. Here, the set we have used is the collection of densities for which the Bayes risk of est imat ing 8 w i t h X is bounded by a number / . Information theoretically, this means that the average discrepancy, or distort ion, between the output X and the input 8 is bounded. Tha t is , a l l the channels we are considering must transmit at least some information related to the input 8. The m i n i m a l S M I is the slowest rate for this transmission and we have numerical ly found the condit ional density achieving this rate. We regard i t as an "opt ional" l ikel ihood w i t h the desired Bayesian loss based on the incomplete informat ion, and propose to use i t as a default l ike l ihood i n certain settings we identify i n Chapter 4. Now we look at the parameter A i n the M I L . The inverse of the parameter A i n p\(x\8) behaves like a dispersion parameter. Under reasonable conditions, i t is a decreasing function of / which controls the amount of risk (distort ion). For the p\(-\8) i n E x a m p l e 1.4.3, for fixed 8,fi,\, as a2 —• oo, p*x(-\8) —» N(8,-£%), and hence its variance increases to ^ = / (A) . For fixed 8,p,,a2, as A —»• oo, pA(-|#) —• ((8), the degenerate dis t r ibut ion at 8, consistent w i t h ( i i i ) of Theorem 3.3.1. More generally, our computations show that P\(-\9) spreads out as A shrinks and concentrates at 0 as A grows. Al ternat ively , one can regard A as a smoothing parameter ensuring that a min ima l ly informative density does not just concentrate at the data points. 42 2.2 Relat ion to Reference Priors Since the method we used for selecting the op t ima l l ike l ihood has some connection w i t h that used for noninformative priors, we also review some background on prior selection. In a Bayesian setting, the pre-experimental knowledge about the parameters of interests is incorporated into the prior dis t r ibut ion. W h e n such experience is available, the Bayesian is expected to be more efficient i n statist ical inference about the parameter than the non-Bayesian, i n the sense that the class of a l l Bayes rules is a complete class, or non-Bayes is inadmissible (see W a l d , 1950). However, when such pre-experimental information is far from enough to establish a pr ior d is t r ibut ion, how to choose a pr ior for inference remains an important issue for a Bayesian. M u c h work has been done on this. For example, conju-gate priors are often based on mathemat ical convenience. They require that the posterior and pr ior be i n the same dis t r ibut ion family. Another cri terion was invariance. Jeffreys' non-informative prior was original ly proposed to satisfy an invariance principle. In 1979, Bernardo proposed the reference prior , which is based on an information theoretic op t imal i ty cr i ter ion: One selects the prior for which the posterior is updated the most, asymptot ical ly i n the expected Kul lback-Leib le r measure, i.e. i t is arg m a x ^ l i m ^ o o EmD(w(-\Xn)\\w(-)), so i t is the prior that permits the posterior to change from it the most upon receipt of the data, on average, i n an asymptotic sense. In a fully Bayes setting, one has some pre-experimental beliefs about the parameter encapsulated i n a prior density. Often, i n practice, we do not have as many data points as desirable for many known methods, and the reasoning for the basic assumptions behind is unclear. In this case, choosing any known l ikel ihood to model the data seems inappropriate , and how to model the data reasonably becomes a basic and pract ical problem. In the fully Bayes setting, the prior represents par t ia l information i n paral lel to that specified by the constraints of the M E method. We want a l ikel ihood which is "unbiased" i n that i t is reasonable based on this par t ia l information. The M I L method here is , i n some sense, the reverse of the reference prior method of Bernardo (1979). Our task is to choose a l ikel ihood given the prior , while Bernardo identified a prior given the l ike l ihood. Specifically, Bernardo found a way to choose a prior i n the 43 absence of information about the parameter. He used the Shannon mutua l information ( S M I ) , or the expected Kul lback-Le ib le r distance between the posterior and the prior where m(xn) = f p(xn\6)w(9)d0 is the marginal density of the data X™, Q is the class of a l l the n dimensional densities. It measures, on average, the discrepancy between the posterior and the pr ior . He maximized , asymptotical ly, the S M I over a l l priors. Recognizing that m is the Bayes estimator for p(-\0), Bernardo examined The asymptot ic maximizer w*(-) is his reference prior . It differs most, on average, from the posterior i n an asymptot ic sense. It is the prior that contains the least information about the parameter, since the posterior based on i t is furthest away from the prior . Under some regularity conditions, Jeffreys' non-informative prior is a special case of Bernardo's reference prior , see Clarke and Bar ron (1994). We see that prior selection forces the prior to be far from the posterior, but i f we are selecting a l ike l ihood, we want the posterior differ not too much from the pr ior . Thus , we have min imized the S M I under a constraint to get an op t ima l l ike l ihood, while Bernardo maximized the S M I (asymptot ical ly) to get an op t ima l pr ior . Operat ional ly, Bernardo's method is a max-min procedure, while our method is a double min imiza t ion where A is the set of a l l the likelihoods that satisfy the Bayes risk constraint, B is the set of product distr ibutions w(9)r(xn) w i t h arbi t rary densities w(6) and r(xn), see Cover and Thomas (1991). Likel ihoods are less informative when the posterior is close to the pr ior . Pr iors are less informative when they give a posterior far from the pr ior . Our i n i t i a l efforts to find a min imal ly informative l ike l ihood and reverse Bernardo's approach original ly led us to consider min imiz ing the expected Kul lback-Le ib le r distance between the posterior and the "contaminated" prior over l ikelihoods. Tha t is , we used (1 - a)w(Q) + a<fi(x, 0)/m(x) i n the S M I , where 0 < a < 1 is fixed and <t>(x,6) is a given I(Q,Xn) = EmD(w(-\Xn)\\w(-)) (2.2.1) m i n m i n 44 non-negative function. In the case of a single outcome, our functional can be wri t ten as EmD(w(.\X)\\(l-a)w(.) + a ^ j y where D(p\\q) = Jp(x) log j^f^a: is the relative entropy between two densities p(-) and q(-). The standard method of calculus of variations gives a type of Fredholm equation, see for example, K o n d o (1991), p(x,9) = ac(0)#M) + (1 - a)c(0)w(6) jK*,0#> which results i n M 0 ) - av,(0)KX>V) + 1 - (1 - a)/c(0u»(0rff • ' where c(9) is the normal iz ing constant for each 9. We verified min imal i ty for this solut ion, and for some choices of a and cj) this p(-\9) is non-negative. However, it is not clear that this solution admits any physical interpretation. Moreover, i t appeared mathematical ly intractable. Later , instead of modifying the functional to be opt imized, we tr ied restr ict ing the class of l ikelihoods over which we conducted the opt imiza t ion . W h e n we sought meaningful quantities to opt imize over large classes of l ikel ihoods, i t seemed na tura l to start w i t h the S M I . F ina l ly , Decision theory led us to considered min imiz ing the S M I over l ikelihoods i n the class w i th a bounded Bayes risk, and gave a constraint of the form J J w(9)p(x\9)L(x,9)dxd0 <l. (2.2.2) F r o m this one can recognize that the opt imiza t ion is the same as that i n the definition of the rate distort ion function i n an information theory context. Statist ically, the M I L is the l ike l ihood which gives a posterior updated from the prior the least on average. In the next Chapter , we w i l l see formally that the Bayes risk bound / behaves l ike a dispersion parameter i n the M I L . In practice, the Bayes risk bound / may be chosen subjectively according to the ex-perimenters tolerance for the risk. However, how to choose / i n a general setting is s t i l l a question for further work. We address this heurist ically i n the appl icat ion of Chapter 4. 45 2.3 Other Background There are numerous methods i n l i terature regards l ikel ihood selection. Here we give a par t ia l review of those have some relevance wi th our methods. M a n y authors have used and developed the m a x i m u m entropy ( M E ) method for choosing a l ikel ihood based on incomplete information. Usual ly one assumes the "par t ia l informa-t ion" may be incorporated into a set of moment constraint(s) E[Tk(X)] = 0k, fc = 0 , l , . . . , m It can be shown, see Jaynes (1957), that the m a x i m u m entropy dis t r ibut ion under these constraints is of the form p(x\0) = a(B) exp[/?iTi(aO + • • • + BkTk(x)], (2.3.1) where the /3's are chosen so that the moment constraints are satisfied. The family (2.3.1) is the "least informative" dis t r ibut ion i n the absence of the adequate knowledge about the data generating mechanism. In practice, usually l i t t le is known about the data generating mechanism, so the M E method plays an important role i n data model ing i n these situations. A s the other data model ing strategies, the M E is not a perfect principle. The concern is how closely the M E dis t r ibut ion approximates the data generating dis t r ibut ion for a given data. It is reasonable to ask i f the data generating dis t r ibut ion is well approximated by the information specified constraints. If this is the case, then the entropy of the data dis t r ibut ion is expected to be somewhat close to the m a x i m u m entropy H m a x . " W h e n the constraints do not reflect the information content of the underlying random mechanism of data generating process, then a non-parametric estimate of the entropy solely based on the data would generally yie ld an unacceptable lower value than Hmax estimated by the data. In such a case, the use of the M E dis t r ibut ion would be inadequate because i t w i l l fa i l to predict the future outcomes correct ly", see Soofi (1994). The m i n i m u m complexi ty or m i n i m u m description length cri ter ion developed by K o l -mogorov (1965) is another information theoretic modeling selection cri ter ion. Assume 46 X\, ...,Xn are iid random variables w i th common density p(x). Denote p(xn) = Yl?=i p(xi)-Let T n be a countable collection of density functions. For each p(-) G Tn, there is a non-negative number Ln(p) which is the description length of p. In Kolmogorov ' s or ig inal formulat ion i t is the length of the shortest computer program that can calculate p. Ba r ron and Cover (1989) modified this idea so as to interpret Ln as a code length from a codebook which provides a code for each member of r „ . In this case, the m i n i m u m complexi ty or min-i m u m description length cri terion is to choose the pn G T„ which minimizes the complexity of the data Xn relative to Ln and T „ , i.e. the pn G Tn defined by pn = arg m i n B(Xn) = arg m i n \ Ln(p) - l o g p ( X n ) ) . pEXn per„ \ j In information theory, the terms Ln(p) and — l o g p ( X n ) are, respectively the description length of p and the Shannon code length of Xn based on p. This m i n i m u m complexi ty estimator has many useful properties, see B a r r o n and Cover (1989). They also provided a Bayesian interpretation based on using the Kraf t inequali ty to regard Ln(p) as a prior . The information cr i ter ia Akaike Information Cr i t e r i a , A I C and the Bayes Information Cr i t e r i a , B I C are also well known methods for model selection (see, Aka ike , 1977). Let V be a class of iid l ikelihoods w i t h a fc-dirnensional parameter 6, let 6 be the m a x i m u m l ikel ihood estimate of 6 based on a sample of size n, and assume the density for the data is i n V. The A I C cri terion is to choose the model i n V which minimizes AIC(k) = -21og(p(x n | f l ) ) + 2k. It is argued that the A I C has a m a x i m u m entropy interpretation. A n alternative to the A I C is the B I C . It chooses the op t ima l k for the dimensionali ty of the parameter. The B I C is BIC(k) = AIC + A(log(n) - 1) + log(Q(fc)/A), where Q(k) is the projection of the n-dimensional observed data into a k-dimensional space, its functional form depending on the method of estimation. The B I C method is to choose the k which minimizes the B I C (see, Aka ike , 1977). These usual methods are only asymptot ical ly op t ima l under regularity conditions. B y contrast, the method we propose has some smal l sample opt imal i ty and provides a flexible 47 class of models for consideration. We note that the A I C is rarely consistent and The rests on Bayes testing for its opt imal i ty . See Schwartz (1978) and Haughton (1988). 48 Chapter 3 Main Results on The MILs In this chapter we establish our ma in results on the M I L ' s . Let r = i n i x / w(0)L(x,0)d6. T h i s value r is achieved at the point x which is closest to the center of the dis t r ibut ion of 0. Our first result is that the parametric family we identified is unique. Proposition 3.1. (i) For each / G [0,r), R(l) has an unique minimizer i n V\. (ii) For / > r, R(l) = 0 and i t is achieved by any p(-) which is independent of 0. ( i i i) Assume the parameter 0 is a permutat ion symmetric functional of the d is t r ibut ion of Xn, i.e. let FXl,...Xn be the joint d is t r ibut ion of (X\, ...,Xn), there is a functional G(-) such that for any permutat ion ( i j , ...,in) of ( 1 , . . . ,n) 0 = G{FXl,...,Xn) = G(FXii,...iXJ and Ln(xn,0) is permutat ion symmetric i n x\, ...,xn, then p*(xn\-) is permutat ion symmet-ric i n x\, ...,xn. Proof: (i) F i r s t note that any p(-) which is independent of 0 is excluded form V\. In fact, for any p(-), J Jp(x)w(9)L(x,6)dxd6> J p(x)ini J w(6)L(t,0)d0dx = J p(x)rdx = r, so, p(-) is not i n V\. Next note that V\ is a convex set of probabi l i ty densities. In fact Vpi(- | - ) G Vi, P2(• I ' ) £ V\ and 0 < a < 1, J J (apx(x\0) + (1 - a)p2{x\0)SjL(x,0)w(0)dxd0 49 = a J J Pi(x\9)L(x,9)w(9)dxd9 + (1 - a) J j p2{x\9)L(x,9)w(9)dxd9 <al + (l- a)l = /, that is api(- | -) + (1 - a)p 2 (- |-) S 7 V Now i t is enough to show that I(Q,X) is s tr ic t ly convex on Vi as a functional of p(-\-). Wr i t e I(Q,X) as I(p, 0 ) to indicate its relationship w i t h Now V 0 < A < 1, Pi(-\-),P2(-\-) € wi th ^ P2{-Y), we have / ( A p i + ( 1 - A ) p 2 , 0 ) = / / W + (1 - ^ , W I !og ^ > + (1 - A W g ) ^ , The log-sum inequali ty states that for any integer n and any non-negative numbers a i , . . . , an and ( £ « 0 i ° g p ^ < £ « u o g ^ , w i t h equality i f and only i f a t / 6 ; is a constant over a l l i. N o w , I(\pi + (1 — A ) p 2 , 0 ) is bounded from above by A / / i t f C ^ p i C s i ^ i o g ^ ^ d ^ + c i - A ) / /w(e)p2(x\e)iog^^rdedx J J mi(x) J J m2\x) = A / ( P l , 0 ) + ( l - A ) / ( p 2 , 0 ) . (ii) Let XQ = arg mix J w(9)L(x,0)d9, p0(-) be the density which is independent of 9 and is concentrated i n a smal l neighbourhood of XQ, then apparently, po(-) € Vi, and J (po | | 0 ) = 0, since po(-) is independent of 9. ( i i i) B y the B l a h u t - A r i m o t o i terative procedure i n Section 1.6.1, we can choose m o ( - , • ) to be a permutat ion symmetric density, so i n each step k of i tera t ion, pk(-,... • \0) is permu-ta t ion symmetric . Thus p*(x1,...,xi,...,xj,...,xn\9) = l i m pk(xi,...,Xi,...,Xj,...,xn\9) = k—+oo l i m pk(xu ...,XJ, ...,Xi, ...,xn\9) = p*{xx, ...,Xj, ...,xh ...,xn\9), K—+OO that is , p*(xi, ...,xn\9) is permutat ion symmetric i n its argument. • 50 We comment that M I L ' s can be used to form a posterior density or can be used to obtain frequentist estimators. The prior has thus far only been used to get a l ike l ihood. One need not use i t again to form a posterior. Th is is a frequentist usage (getting a point estimator) of a Bayesian quanti ty (a pr ior) . 3.1 Large Sample Properties of the M I L Consider the collection of parametric families of the same form as Vi, but for a random variable Xn i n place of the univariate X. Tha t is , let Vn = {pn{xn\6) : J J Pn(xn\9)w(0)Ln(xn,e)dxnd6 < Z n } . (3.1.1) Denote the min ima l ly informative l ikel ihood for Xn by PMiL(xn\9), that is write PMlL(xn\6) = arg m i n 1 ( 0 , X " ) . S imi lar to the univariate case handled i n Blahut (1972a), one can obta in a form for the M I L based on the loss function L. For given prior w, this is m* (xn)e~XnLn(xn'6) where mn(xn) is determined by c-A„Ln(*B,fl)u ?(^) / -de < 1 (3.1.3) J m*(yn)e-x"L"(yn'e)dyn wi th equality for xn,s such that mn(xn) > 0, and A„ > 0 is determined by ln. We w i l l see that a posterior formed from the parametric family (3.1.2) and the prior w is asymptot ical ly the same as w in a relative entropy sense when convergence is assessed i n the mixture dis t r ibut ion. Tha t is , the data update w t r iv ia l ly . In addi t ion, we w i l l see that use of PMIL-, or p*x(x\6) to denote its dependence on A explici t ly , gives the weakest inferences possible amongst the elements of Vn To establish the asymptot ic equivalence of w and the posterior based on w and (3.1.2) we note that PMIL is typical ly a dependence model , i n which the dependence structure depends on n and A n . Because PMIL typical ly cannot be given i n closed form, the proof of our first theorem requires a carefully chosen independence density pn(-\e) i n Vn. Th is 51 pn(-\0) is chosen so that the expected relative entropy between the posterior based on PMIL and w and the prior is bounded by the relative entropy between the posterior based on pn(-\0) and w and the prior . Then we prove the latter tends to zero as n goes to infinity. In the definition of Vn, i f we take Ln(xn, 6) = an E"=i L(xi, 9) for given Z ( - , •), then we can absorb the ln into an and assume /„ = 1, for a l l n. Thus our set Vn is the same as used i n the usual formulation of the rate distort ion problem, see Cover and Thomas (1991). To state the theorem, we define the average loss for fixed x as r(x) = J w(9)L(x,9)d9, and denote its supremum and inf imum by r = in f r(a:), f = sup r(a;). x X Now we show that a posterior based on PMIL{Xu\9) and the prior w which generated i t updates w t r iv ia l ly , i n an asymptot ical ly average sense. T h e o r e m 3 .1 .1 . Assume that Vcc, r(x) = J w(9)L(x,9)d9 < oo, and that, for a l l n, ln = 1. Let Ln(xn,9) = an YA=I L(X{,9) where X ( - , •) is continuous i n both arguments and assume that the l i m i t of nan exists and is s. N o w , i f rs < 1 we have that Emp.D(wp.(-\Xn)\\w(.))^0. P r o o f : S t e p 1: F i r s t we prove that there exists a probabi l i ty density q(-) such that the new parametric family pn for Xn defined by M 1 ]~ h-L^Y\U^)dyn ( 3 > 1 ' 4 ) is an element of Vn for n large enough. Indeed, take a constant r < r < r and a constant b £ (r,f) such that 6s < 1. Choose a probabi l i ty density q(-) such that for a l l 9 Jq(x)L(x19)dx < C O , 52 and J w(6)q(x)L(x,0)dxd0 = b. Tha t is , we have / q(x)(r(x) — b)dx = 0. Such a probabi l i ty density </(•) exists because r < b < f. B y the symmetry of pn(xn\0), the sum of integrals from Ln can be reduced to a univariate integral . N o w , the density pn(xn\6) satisfies J jPn(xn\8)w(6)Ln(xn,0)dxnd6 f f w(d)q(x)e-a"L^'^L(x,6) , ,n = nan / v 7 \ \ T . ; ' 'dxd9. (3.1.5) J J I q(y)e-a"L(y<e)dy v ' Denote the double integral i n (3.1.5) by I(an). Since nan —> s, and s& < 1, since i n the definition of Vn, ln = 1 to see that pn(xn\9) G 7->„ for a l l large n , i t is enough to show I{an) -> 6. B y standard inequalities we have that, \I{an) - b\ I f f w(d)q(x)L(x,0)e-anL(x>B'> ,„ /" /" , , , ^ , , ,nsSS9(x)9(y)L(x,e)\e-a^x'e) - e-a»L(y<e)\dxdy ,„ < / w fl)J J ^ v v ; ; ' — T j — r - ! — ^ 0 . (3.1.6) y v ' f q{t)e-a^e)dt v ; Let Ae = {(z ,y ) |Z(a : ,0 ) > L(y,0)}. Since J/g(^)g(;/)^(a:,6')|e-a"L(:c'e) - e~a"L^ e) / q(t)e-anL(t,e)dt < ffAe g(x)q(y)L(x,9)\e-a^) - e'^^dxdy f q(t)e-a-L(t,e)dt I SAI (iix)(i(y)Li.y^)\e~anL{x,e) - e-anL^\dxdy + / q{t)e-anL{t,e)dt _ JX 4^(a;)X(a;,^)g(2/)e- a" z'(^)|e- 0"( i( a ;' e)-L(^)) - \\dxdy ~ f g (r )e - 0 » L (* ' e )c to 53 / J A c q{y)L(y,6)q{x)e-a^x'e)\l - e-a"<<L(y<e)-L(x<e»\dxdy + — ^ / g ( / ) e - a " L ( ' - 0 ) d i / fAf) l(x)L(x, 6)q(y)e-a»L(y'e)dxdy I J A c q(y)L(y, 9)q{x)e-^L^6)dxdy ~ f q{t)e-a"L(t,e)dt + j q(t)e-anL(t,e)dt f f q(x)L(x, e)q(y)e-a"L(y'e)dxdy / J q(y)L(y, 6)q(x)e-a»L(x'eUxdy f q(t)e-a"LW)dt + fq(t)e-a«L^e)dt = 2 J q(x)L(x,0)dx, which is integrable w.r. t . W(0), since 2 J w(6) J q(x)L(x,6)dxdO = 2 J w(0)q(x)L(x,0)dxdO = 2b < co. B y the Domina ted Convergence Theorem, the l imi t of the left hand side of (3.1.6) is bounded J w » V / q(t)e-a"LW)dt J and for any fixed 8, the l imi t i n (3.1.7) is limn/ fq(x)q(y)L(x,9)\e-a»L(x'B) - e-a-L^e)\dxdy limnfq(t)e-a"L(t,0)dt W - 1 - 8 ) provided the numerator of (3.1.8) exists and the denominator of (3.1.8) exists and is non-zero. In the numerator, the integrand is bounded by 2q(x)q(y)L(x,8), which is integrable w.r . t . x, y for any fixed 6 by our choice of </(•). So by Domina ted Convergence, the numerator of (3.1.8) is / Jq(x)q(y)L(x,6)\im\e-anL^-e-anL^\dxdy=0, (3.1.9) since an —• 0, so for fixed x and 0, l i m n \e-anL(x,6) _ e -a„L(y ,0 ) | _ g p o r t ^ e d e n o m m a t o r i n (3.1.8), the integrand is upper bounded by q(t), which is integrable w.r . t . / , so by Domina ted Convergence again, Urn J q(t)e-anL^dt = j Mmq(t)e-anL^9Ut = 1. (3.1.10) Now, by (3.1.7), (3.1.8), (3.1.9) and (3.1.10), the l imi t of the left hand side of (3.1.7) is zero, i.e. pn(xn\9) 6 Vn for a l l large n . 54 Step 2: Now we prove the assertion of the theorem. Let ™Pn(xn) = J Pn(xn\9)w(9)d9 be the mixture of pn(xn\9) w i th respect to the prior w(9) and write q(xn) — Y[i=i Q.(xi)-B y the definition of pn, its posterior is the closest to w(9) i n the expected Kul lback-Leibler distance among a l l the posteriors based on any other probabi l i ty densities i n Vn. We have 0 < E^D{wK(-\X")\\w(-)) < Emr„V(wp„(-\X")\\w(-)) (3.1.11) = / / » ( ^ x » W 1 o g ( ^ ) < ^ f f ( a(xn)e~Ln(-xn'e) \ = / / " W ^ ' W "* ( J ^ - M ^ ) = ~/ J w(e^(xn\e)Ln(xn,0)dxnd8 (3.1.12) - y t»(0) log ( y q{yn)e-L^n^dyn^ dO (3.1.13) - y y log (y s^^g^) * ™ ( 3 . i . i 4 ) Term (3.1.12) is -nanI(an) —y -sb, and we shall show (3.1.13) —y sb and (3.1.14) —y 0. For (3.1.13), since — log(-) is convex , we have that for any 6 and n, 0 < - log (y g ( y n ) e - i " ( J / " ' e ) ^ = - l o g [ £ , ( e - L » ( y n ^ ) ] < ^ [ - i o g ( e - L " ( y n - e ) ) ] = ^ ( x „ ( y " , f l ) ) = nan J q(y)L(y,9)dy < oo. (3.1.15) Denote the integral i n the right hand side of (3.1.15) by a(9). Now, a(9) is integrable w.r . t . W ( - ) . Indeed, / a(9)W(d6) = f f w(9)q(y)L(y,9)dyd9 = b < oo. B y the strong law of large numbers, we have that for al l 9, Ln(Yn,9) —y sa(9), almost surely w i t h respect to q. So, for any fixed 9, e > 0, when n is large enough, we have that e-(sa(9)+e) < e-Ln(Yn,e) < g -( S a(0)- £ ) (3.1.16) wi th high <?(•) probabil i ty. 55 Let U be the set of Y " ' s such that (3.1.16) holds. For n large, we have Eqxuc < and E q ( e - L ^ ) = E q ( e - L ^ X u ) + E q ( e - L ^ X u , ) . Because e~Ln^Y"'9^ is bounded, we now have e-M')+0 _ € < Eq[e-L^Yn^] < e-M«)-0 + e. Since e > 0 can be arbi t rar i ly smal l , we get J q(yn)e-Ln{yn'eUyn e~s<B\ (3.1.17) for each 9. Hence by (3.1.17) and the Domina ted Convergence Theorem, expression (3.1.13) converges to - J\og(e~sa^)W(d6) = sb. (3.1.18) So, as n goes to infinity, (3.1.12) and (3.1.13) w i l l cancel each other. To complete the proof we only need to prove (3.1.14) tends to zero as n goes to infinity, and by the non-negativity of the Kul lback-Le ib le r distance, we only need to show that (3.1.14) is non-positive. Recal l - l o g s < a T 1 - ! (3.1.19) and that expectations can be wri t ten w.r. t . q(-) rather than pn(-\0). We have that , for a l l n , (3.1.14) is r f e-Ln(X",e) / f w(£)e-Ln(X»,0 x -1 Thus we have 0 < 7 J ( ^ . ( - | X " ) | ^ ( . ) ) < ]MEmpnD(wPn(-\Xn)\\w(-)) < 0. • C o m m e n t : Th is theorem shows that, asymptotical ly, the n-dimensional M I L does not i n 56 fact updates the prior at a l l . In this sense the M I L is min ima l ly informative. A l s o one may use the product of n-fold 1-dimensional M I L s and do regular Bayesian updat ing given the data for independent observations. We have seen i n Propos i t ion 1.5.1 that the product of marginals is the product density closest i n Kul lback-Le ib le r distance to a joint density. A c -cordingly, we may use a product of unidimensional M I L ' s when the dependence is believed to be slight or absent. If the dependence cannot be ignored we have nevertheless done the best possible subject to dependence. In cases where the data may be assumed iid, we expect to get consistency results for the M I L paral lel to those for usual l ikelihoods. Here we only consider these questions heuristi-cally from a Bayesian standpoint and may investigate them in detail i n our future studies. Specifically, assume the data are iid and model their common dis t r ibut ion by the M I L , i.e. choose n p(xn\9) = Y[C{e)m(xi)e-XL^e\ i=l where C(9) = m{x)e-XL^dx^j is the normal iz ing constant. Now, the log l ikel ihood is n n G(9\x) = nlogC(9) -XJ2 L(*i, + E m(xi)-t = l i = l So the m.l.e. 9n of 9 based on the M I L can be obtained by solving the equation „ T / ^ dG(9\x) Denote the M I L given 6 by pg and the true density of X by p$a, here we assume the same parametr izat ion for both p*e and p$0. Essentially, we are using the l ike l ihood equation from the M I L as an est imating equation whose solution is the wrong model m.l.e. (under Pg). Since such estimators are typical ly consistent and asymptot ical ly normal even i f their asymptot ic variance is higher than the Fisher information. Let = E^L{x'e))~ E^{i~9L{x'd))-B y modifying the proof of J0rgensen and Labour iau (1994), we have the following consis-tency result of the m.l.e. based on the M I L . 57 T h e o r e m 3 .1 .2 . Assume C{8) and ^L(X,8) exist and are continuous i n 8 almost ev-erywhere w i t h respect to Pg0, and that there exists a <!>o > 0 such that for a l l 8 £ (#o — So, #o), £e 0 W > 0) a i m for a l l 8 £ (8Q, #o + ^o), ^e0(8) < 0- Then , there exists a sequence of roots {8n} of ^ x , 8) such that P 8n —° 8o, as n —»• oo. P r o o f : F i rs t note - l o g C ( 0 ) = - A — = - A £ p ; ( - Z ( J M ) ) , so £ P e o * ( X , 0 ) = _—logC (6>) - A y —Z(MM*l*o)<te = Afro(0). Thus , take <$ £ (0, <$o), by the strong law of large numbers ±V(X,90-6)P3 Afr o (0 o - 8) > 0, and i $ ( X , 0 o + <S) - A&O(0„ + « ) < O , as n —> co. Hence for large n we have tf(X,0o-£)>O and V(X,80 + 8) < 0. B y the continuity of \P (X,#) , there exists a root 8n{8) of $ ( X , 0 ) = 0 i n the interval (#o — S,80 + 6) such that Peo(\0n(S)-8o\<S^ - 1 , as n —• oo. Now, instead of 8n(6), we take the root 8n which is closest to 8Q, this root does not depend on 6 and also satisfy Next we state a well known result for the asymptotic normal i ty of the solution of an 58 est imating equation. We first recall the definition of regular inference function. A n inference function $(X, 9) is regular i f and only i f for a l l 9 i) EPeV(X,8) = 0; i i ) d$>(X.,9)/d9 exists for /x-almost a l l x, where / i is the common a-finite dominat ing measure for the l ike l ihood: p(-\9) = dP(-\9)/d/j,; i i i ) The order of the integration and differentiation may be changed: j-Q J*{x,9)p(x\9)n{dx) = J ^(x,9)p(x\9Mdx); iv ) O < £ P e { # 2 ( X , 0 ) } < c o ; v) 0 < EPe{dV2(X,9)/d9} < oo. W i t h i n the context of Estimating Equations (see, for example, Godambe, 1960 or j0rgensen and Labour iau , 1994), we know that under the above regulari ty conditions Vn(9n ~ 9) h N(0, a2(9)) as n oo, where the asymptot ic variance is given by the Godambe information EPeV2(x,9) N o w , we have used an M I L to generate an estimator, its m.l.e. is consistent and asymp-tot ical ly normal . In principle, we can examine the opt imal i ty of this estimator i n terms of the Godambe information. However, for the present, we note that the above results on consistent and asymptot ical normal i ty suggest, but do not prove, that the posterior density formed from an M I L concentrates asymptot ical ly at the true value of the parameter i n a mode of convergence defined by the true model , i.e., w(9\Xn) ^ ° 0 O . This conjecture is supported by Strasser (1981) who demonstrate that Bayes posterior con-sistency is weaker than frequentist m.l.e. consistency. Indeed, Strasser showed that any set of conditions ensuring the m.l.e. also ensures posterior concentration. In the present context, we would want to use Laplace's method of integration on m(xn) at 9n to extend 59 Walker 's proof. Indeed, a modification of Walker 's (1969) proof should give the desired consistency and asymptotic normal i ty of the posterior formed from an M I L . 3.2 Small Sample Properties of M I L N o w , we turn to a non-asymptotic sense i n which the M I L as we have defined is min i -mal ly informative. Let pn(xn\0) be the M I L from Vn based on w and let wp^(0\Xn) be the posterior formed from w(0) and pn(xn\8). Fol lowing Csiszar (1975), the tangent hyperplane determined by w{9) and w p * (0\xn) is given by H{x\w,wK) = {w':J wWog^^p d0 = D(wK(.\xn)\\w(.))}. Let p G Vn be any given density. The tangent hyperplane determined by w(0) and wp(0\xn) is H(xn,w,wp) = {w': J w'(0)log d0 = D(wp(-\xn)\\w(-))}. The two tangent hyperplane divide the whole space of priors into subspaces, one of them which we denote by S(xn,w,wp*i,wp) is {y : J w\0)\ogWpf\*pd0 < D(wK(.\xn)\\w(.)), J wX0)\og^p-d9 > D(wp(.\xn)\H-))}. Let Sn(w, wP^, wp) = OxnS(xn, w, u ; p » , wp), which is a subspace i n the prior space inde-pendent of data. Let w0 be a member of Sn(w, wp^, wp). We show that, on average, using the M I L l ikel ihood pn(xn\0) to update w(-) gives a posterior wp*(0\xn) further from wo i n Kul lback-Le ib le r distance than any other l ikel ihood p(xn\0) i n V does, i.e., wp* is further away from any untrue w0 than any other wp. To get a pointwise result i n the above sense, let U(w,wK,wp) = {xn : D(wp.(-\xn)\\w(-)) < D(wp(-\xn)\\w(-))}. Since Emp^D(wp*i(-\Xn)\\w(-)) < EmpnD{wp{-\Xn)\\w(-)), i t is l ikely that for some xn, U(w,wK,wp) ^ <j). 60 T h e o r e m 3 .2 .1 . (i) If xn £ U{w,wp^wp), and w0 € S(xn,w,wp*i,wp), then D(w0(.)\\wK(-\xn)) > D(w0(-)\\wp(.\xn)). (3.2.1) (ii) If for some n, wo G Sn(w, u ; p » , wp), then i ^ . / J ^ O I K - O ! * " ) ) > £ m p 0 ( « > o ( O I M - | * n ) ) - (3-2.2) P r o o f : (i) Since j W ' ( 0 ) l o g ^ ^ and j w'{d)\og^^p-de = D(w'(.)\\w(.)) - D(w'(.)\\wPn(-\Xn)), we see that wo G S(xn, w, wp^, wp) implies that D(w0(-)\\w(.)) < /J( W o (-)ll^(>")) + ^(^(-k n ) IW0) , and that D(M-)\H-)) > D(w0(.)\\wp(.\xn)) + D(wp(-\xn)\\w(-)). Since D(wp*(-\xn)\\w(-)) < D(wp(-\xn)\\w(-)), so for xn G U{w,wp*i,wp), by the above two inequalities we have D(w0(-)\\wK{-\xn)) > D(w0(-)\\wp(-\xn)). (ii) Since wo G Sn(w,wp^wp), we have that D(w0(-)\\w(.)) < J D M O I h P - ( - l ^ n ) ) + ^ K - ( - l ^ " ) l k ( - ) ) , and that D(w0(-)\\w(.)) > D(w0(-)\\wp(.\Xn)) + D(wp(.\Xn)\\w(-)). Taking expectations we have D(w0(-)\\w(-)) < Emp.nD(w0(-)\\wK(.\Xn)) + Emp,D(wK(-\Xn)\\w(.)), and D(w0(-)\\w(-)) > EmpD(w0(-)\\wp(-\Xn)) + EmpD(wp(-\Xn)\\w(-)). 61 B y definition of p^(xn\0) we have Emp.D(wK(-\Xn)\\w(-)) < EmpD(wp(-\Xn)\\w(-)), so we have Emp.D(w0(-)\\wK(-\Xn)) > EmpD(w0(.)\\wp(.\Xn)). • 3.3 Behavior of the M I L for large and small values of A Clearly, the M I L depends on the choice of A (or equivalently /) used to define V. In this section, we prove two theorems that show how the size of A affects the behavior of the M I L . To emphasize the dependence of the M I L on A, we wri te px(x\0) for the M I L , and we denote the corresponding marginal density by mx(x), and the corresponding posterior density by w*x(0\x). Let £(0) be the degenerate probabi l i ty mass function at 0, -5- denote convergence i n dis t r ibut ion, and //(•) be the Lebesgue measure on R1. F i r s t , we character- -ize the behavior of the M I L for A large. For simplici ty, we only prove the results for one dimensional data case, the proofs are also val id for n-dimensional data case. T h e o r e m 3 .3 .1 . (i) The marginal density for X from p*x(x\0) is m*x(x), i.e. , Let S be the support of w(-), w i th interior S°, and let C be the set of points i n S at which w is continuous. Assume L(x,8) = r(\x - 0\) is s t r ic t ly increasing i n \x - 0\, w i t h r(0) = 0, and r(s + i) > r(s) + r ( i ) , for a l l s > 0,t > 0 . Then as A —> oo, we have the following (ii) The marginal density for the data satisfies (3.3.1) (i i i) the M I L densities satisfy P\{x\e)^c{e), V 0 e 5 - ° n c , (3.3.2) and w*x(0\x) ^ C(x), V z G S° n C. (3.3.3) 62 P r o o f : (i) Since p\{x\9) = ^ f m*x(y)e-XL(y>e)dy, i f p*x(x\9) > 0, then m*x(x) > 0, so r r e~XL(-x^w(9) j pi(z\O)v,(O)d0 = m\{x) J f ml(yy„XLtJ)dydO = m\{x), by (1.2.1.5). If p*x(x\6) = 0, then m\(x) = 0, we s t i l l have m*x(x) = J p*x{x\9)w(9)d9. (ii) To prove the result, recall that m\{x) is determined by / r * / \ d 0 ^ ( 3 - 3 - 4 ) J f m*x(y)e-XLyy>e>dy v ' where equality holds for x 6 5A, where S\ is the support of m*x(-). Since w(-) and m*x(-) integrable, they are continuous almost everywhere, without loss of generality we restrict to the continuity points of w(-). Take 6 > 0 smal l , then Vx 6 S l~l C , by (3.3.4) we have / , . . r " ^ ^ +«] V w ] K (y )e- A L ^^ + (1 + MM)) /• e-XL(x'^w(9) + / r « / N XTLL D6' (3-3-5) 7fe-5.x+51= f mX(y)e-XL(y>e)dy v y where . _ W+qc m\{y)e-XL^)dy 1 ' j ~ W + q m j ( » ) c - ^ ) d y ' We show that the second term on the right hand side of (3.3.5) tends to zero as A tends to infinity, and / i (A, 9) is negligiblely smal l for large A, so the remaining part of (3.3.5) gives a rat io which is approximately ww(x) / m^x)", and equals 1. There are six steps i n the proof. S t e p 1: Show that the second term i n (3.3.5) goes to zero as A increases to infinity, i.e. I J\x -d9 —» 0, as A —»• 0 0 . c-6,x+&Y f m*x(y)e-XL(y>e)dy Indeed, the second term i n (3.3.5) equals -8 e -M*-g)w(6>) J n [°° e-Xr-(e-x)w(9) rx-6 e *n*-*)W(p) ro 7-oc fml(y)e-Wv>°)dy + Li f *x(y)e-XL(y'e)dy Jx+S J m\{y)e-XL^e)dy 63 d0. Since M9 G ( - 0 0 , x - 6], r(x - 9) = r(8 + x - 8 - 9) > r(8) + r(x-8-9) = r(8) + L(x - 8,0), and ^6 G [x+ 8,00), L(9-x) = L(8 + 9-(x + 8)) > L(8) + L(9-(x + 8)) = L(8) + L(x + 8,9), so the second term i n (3.3.5) is bounded from above by - A W rS e-^-^wje) r~ e-xn*+S,e)w(e) U - 0 0 fml(y)e-mv-<»dy + Jx+s J m*x(y)e-^(y,e)dy W r e-^-w)w[e) f e~XL^)W{9) \ ~ \J fm\(y)e-XL(y'e)dy^J Jm\(y)e-XLiy^dy> ) < 2e~Xr-(6) 0, as A -» 0 0 , since by (3.3.4) / -\L(x±6,e) fgs ^-L- d9<l, fm*x(y)e-XL(y'e)dy this completes the proof of Step 1. N o w , from Step 1 and (3.3.5), we have > i x -de+ o(l), (3.3.6) -8,x+6] f[e-8,e+s] m*x(y)e-xHy,o)dy(l + h(X, 9)) where o ( l ) goes to zero as A —> 00. For fixed 9, i t is easy to show h(X, 6) —>• 0, but this may not hold uniformly for e G [x - 8, x + 8] the domain of integrat ion i n (3.3.6). So, we split [x — 8, x + 8] into a "good" set on which h(X, e) is uniformly smal l , and a "bad" set on which h(X,9) is not smal l . We show the "bad" set is negligible i n Lebesgue measure for large A. Formally , let e > 0, and let Ax = {9 G [x - 8, x + 8] \ h(X,9) > e}. If Ax is contained i n a sub-interval of [x — 8, x + 8] which excludes x, we can reduce 8 and there is nothing to prove, otherwise the Lebesgue measure of Ax is controlled as follows. Step 2: We show that as A 0 0 , fi(Ax n[x- 8/2,x + 8/2]) = o ( e - M ^ ) - £ ( 5 / 2 ) ) ) . B y reducing the domain of integration i n (3.3.5) we have e~XL(-x^w(9) >/ J[x—5,x +6]nAx I[e-S,e+S] m*x(y)e-x^)dy + f[g_SJ)+s]c m\(y)e-x^)dy ^ / ~ i 7V7 i H — i n /» , d0, (3.3.7) J\x-s/2,x+s/2]nAx e" 1 + 1) f[g-s,e+6]c mx(y)e-XL(y<e)dy d9 64 since h(\,6) < e on A\, i.e. we have / ml(y)e-XL^dy < - [ m\{y)e-XL^dy. J[6-6,e+6] € J[0-5,6+5Y We can bound the e-X L (< x> 6) i n the numerator of (3.3.7) from below by e~XL(s/2\ and bound the e ~ X L < < y ' ^ i n the denominator of (3.3.7) from above by e~XL^. Th i s means that (3.3.7) is bounded below by 0-\r(8/2) r w ^ I*) r (e- 1 + l)c-M«) J[x-8/2,x+s/2]nAx J[e-6,e+sy m\{v)dy e\{r_(S)-r_(S/2)) dd > f w(6)d9 + 1 J[x-8/2,x+8/2]nAx e\(r(8)-L(S/2)) ( \ > + 1 ^M[s-*/2 > a J + «/2]nAA), since the continuity of w at x guarantees that for 6 smal l we have w(9) > w(x)/2, when 9e[x-6/2,x + 6/2], Step 2 now follows. Now by Step 1, and the definition of A\, we get i > / , e : k L < " l w ( e ) M I J\x -8/2,x+s/2]nAi I[e-S,e+8} m*x{y)e-XL(y'S)dy(l + o ( l ) ) e - \ L ( x , 9 ) w ^ J[x-8/2,x+8/2]nAx J m*x(y)e-XL(y<0)dy as A tends to infinity. We w i l l see that the second term i n the right hand side of (3.3.8) tends to zero as A tends to infinity. A l s o , by the mean value theorem for integrals, we w i l l see that the first term of the right hand side of (3.3.8) becomes w(() over mx(n) times an integral 1 to 1. Step 3: A s A —»• oo, e - \ L ( x , 6 ) J\x l[x-s/2,x+8/2]nAi f[e-Ste+s] e~XLMdy We start by showing that as A —• oo, e - \ L ( x , 9 ) de J\x -dO -»• 0. (3.3.9) l[x-s/2,x+s/2]nAx I[e-s,e+6) e~XL(y>e)dyl Indeed, let 0 < 6' < S satisfy L(S') < L(6) - L(6/2). Now, the left hand side of (3.3.9) is bounded from above by e - \ L ( x , 0 ) [x-s/2,x+s/2]nAx J[g-s;e+s'] e-XL{yfi)dy 65 d9. Since the numerator is bounded above by 1, and for y £ [0—8', 6+8'], we have L(y, 0) < L(8'), the last expression is bounded above by ^{[x ~ 8/2,x + 8/2] f l A\) K-\\T{6)-T(8/2)-TIS')] N 26'e-W) ~ 28' as A —*• oo, for some constant K . Now by adding and subtracting the left hand side of (3.3.9) to the left hand side of Step 3, the integral i n Step 3 becomes /• e - \ L ( x , 6 ) J[x-6/2,x+s/2] I[e-s,e+s] e~XL(y>9)dyd6 + /- e - \ L ( x , B ) /• e-XL(x,B) ~ J[x-s,x+s] J [ e - S , e + 6 ] e-XL(y'e)dydd ~ Jl<\x-e\<s f[e-S,e+6]e~XL{y'e)dydd + fS C-Ar(t) rS e - A r ( t ) = / -n—m—dt~ / -i n—dt + o(l) Jo / * e-x^)ds Js/2 $ e -M«)da Since the absolute value of the second term is not greater than e-Xr{s/2)Jo e -l>dt < e _ M 5 / 2 ) _^ a s A ^ f*e-W')ds -Step 3 is complete. To ensure that the equality is achieved i n (3.3.5), we first verify that m\(x) is positive i n a stronger sense. Step 4: We show that V z £ S°, l i m A ^ o o^Kz) > 0. (3.3.10) To prove Step 4, note that by (3.3.8) we have (we w i l l show later that the second term i n (3.3.8) tends to zero) r e~XL(-x'e^w(0) 1 " J[x-5/2,x+8/2]nAi J[e-S,g+8} m\{y)e-XL^e) dy{l + o ( l ) ) ^ + ° ^ w(C) f e-\L(x,e) = HQ [ mX(v) J\x "X^) J[x-s/2,x+s/2]nAi f[g_SMS] e~XL^e)dy(l + o(l)) w(Q 1 f e-\L(x,6) m\(v) (1 + o(l)) J[x-8/2,x+s/2]nAi J[B-s,e+6\ e~XL^eUy 66 by using the median point theorem of integration twice, where £ G [x — 8/2, x + 8/2] fl Acx, and n G [8 — 8, 8 + 8] C [x — §£, x + |<5], since both w(-) and mx(-) are continuous at x. B y Step 3, the last expression is W ( C ) + o ( l ) . (3.3.11) m*x(rj)(l + o(l)) Now, Step 4 follows by way of contradict ion: Suppose mx(x) —• 0, as A —• oo. Then , there exists 8 > 0, and A so large that w(Q/mx(rj) > 1, which is impossible by the above inequality. Th i s means we must have liminf; v _ + oo m*\(x) > 0, i.e. Step 4 is completed. Note that the result of Step 4 applies to each y G [x — 8, x + 8] C S. Now we prove the second term i n the right hand side of (3.3.8) is smal l as A increases. Step 5: A s A —• oo, we have that I J\x v-s/2,x+s/2]nAx f m*x(y)e-XL(y'e)dy To see this, let e > 0 and 8' be as i n Step 3, and let d8 -»• 0. (3.3.12) BXtC = {ye[x-8/2-8',x + 8/2 + 8'] \ m*x(y)>e}. B y choosing e smal l enough and A large enough, B\tt can be made as close to [x — 8/2 8', x + 8/2 + 8'] i n /z(-) measure as we want, i.e. for smal l e and large A we have /x([ar - 8/2 -8',x + 8/2 + 8']) - p(Bx,t) < 8'/2. B y intersecting the interval w i th Bx<e and Bx t we get lx{BXtt n[8-6',6 + 8']) > | , V0 G [x - 8/2, x + 8/2]. Hence the left hand side of (3.3.12) is bounded above by -de < [x-6/2,x+6/2]nAx J[e-s>,e+8']nBx,e m\(y)e-XL(y>9)dy 2w(x) f e-\L(x,6) J\x-8/2.x+8/2]nA> frasi » , « i n D . e-XL(y<e)dy € J[x-S/2,x+S/2)nAXS[g_s;B+S']nBx,ee XL{yfi)dy since w{8) < 2w(x),V6 G [x - 8/2,x + 8/2], and m*x(y) > e,Vy € BX}t. No t ing that e-\L(x,8) < i ; and e-\L{v,o) > e-\T[6>) for ye^Q_6ffQ + s>] n Bx,c, the last expression can be bounded above by 2w(x) n([x -8/2,x + 8/2]n Ax) € e-^(*') i n f f l e [ x _ f i / 2 i ! B + 5 / 2 ] fi(Bx n[9-6',0 + 8']) 67 4w(x) e- At( 5)-^ 5/ 2)-^ 5')) < > 0, as A oo, e 6' by Step 4. Now, Step 5 is complete. Now let £ and r? as i n (3.3.11), write w(() = w(x) + o ( l ) , and TTI*X(T]) = m*x{x) + o ( l ) . B y definition of mx(-), equality is achieved i n (3.3.5) for x £ S\, thus (3.3.11) achieves the upper bound 1, i.e. Le t t ing 6 —> 0, we get 1 - m A ( r ? ) ( l + 0 ( l ) ) + ° ( 1 ) -w(x) limA_>oo m A (a : ) ' proving (3.3.1) for x £ S fl C . For the final step i n the proof, let x £ 5' c . If for such an x we have V\,m*x(x) = 0, then the conclusion follows. Otherwise, we must have mXk(x) > 0 for some sequence {A^} which satisfies l im^oo Ajt = oo and l i n n ; _ 0 O i n f m*Xk(x) > 0. For this case, i t is enough to show limfc-jooinf m*Xk(-) = 0 a.e. on a smal l neighborhood of x. W i t h o u t loss of generality, assume x lies on the left of S. Step 6: There is a 6 such that 0 < 6 < d, where d = i n f y e s \x — y\ is the distance between x and S so that m*x(-) = 0, a.e. fi(-) on (x, x + 6]. We prove this by way of contradict ion. If this is not true, there is a least 6 satisfying such that 0 < 6 < d, and l i m ^ ^ inf m*Xk{x) > 0 a.e. on (x,x + 6]. N o w , let a > 0 and wri te D\k = {y £ [x + 6/2,x + 6] | m*Xk(y) > a}. Choose a so smal l and k so large that »(DXk) > 6/4. Since mXk(x) > 0, (3.3.4) implies e - \ k L { x , 9 ) w ^ -I. -de. Is f m*Xj{y)e-xkL{y,e)dy Let M > 0, and let SM be a closed set satisfying s u p 0 e S M w(e) < M < oo, and 1-<I This gives that 2 - is fm*Xk(y)e-x^(y,e)dy e-\kL(x,e)w^ d6. V rn*Xk(y)e-x*L(y,e)dy de 68 Since L(x,9) > L(y,9) + r(6/2),Vy £ D\k,V9 £ SM^ the last expression can be bounded by / 7 w r d 0 - (3.3.13) Jsu fn. e-x*L(y'e)dy v ' lsM IDx e-x*L(y><>)dy B y an argument s imilar to that used i n Step 3, we have that e - \ k L ( y , 6 ) e - \ k L ( y , 6 ) d y d6 —> 1, as k —* oo. In part icular , for large k, the last expression is bounded from above by 2, and the right hand side of (3.3.13) is bounded by l ^ e - A ^ / 2 ) _> o, which is a contradict ion thereby establishing Step 6. Th i s completes the proof of part (i) because we have shown that Vx £ Sc, i f l i m i n f A->OO m*x{x) > 0, then l i m A ^ o o mx(') — 0, a.e. on ( x , x + 6], for some 0 < 6 < d. i.e. l i m raA(x) = 0, a.e. /i(-) on Sc. X—t-oo This last statement is equivalent to (3.3.1) for x £ S°. ( i i i) To prove (3.3.2), let 4>\(-) be the characteristic function of p*x(-\9). We have Jm*x(x)e-XL(x'ehixtdx Im*x(y)e-XL(y'e)dy _ I[e-s,e+6] m*x(x)e-XL(x>%*x<dx + J[9_s<e+S]c m\(x)e-XL^e)e^dx f[e-6,e+6\ m*x{y)e-x^)dy + J [ e _ w ] c m\{y)e-xW) dy ' To simplify the expression, we first prove I[e-s,e+S]c m\{y)e-XL^)dy he-SMS] ml(y)e-x^)dy ~* ' as A —• oo. In fact f[e-8,e+6]<™x(y)e-XL(y'e)dy < I[g-s,e+s]c m*\(y)e~XL(y'e^ dy J[e-6,6+s]™*x(y)e-XL{y'e)dy ~ J[e-s/2,e+s/2]mx(y)e-XL(y'e)dy' Since 9 £ S°, Step 4 gives that l i m A ^ o o i n f m*x(y) > 0, pointwise for y £ [9-6/2, 9+6/2]nS°. For a > 0, let D A = {y £ [0 - 5/2, <9 + <S/2] | m A ( y ) > a}. N o w , for a smal l enough 69 and A large enough we have fi(Dx) > f. Since Vy e [9 - 6,9 + S)c, e~XL^'6) < e~x^s\ Vy € D\, e~XL(y'e) > e~Xz-(5l2\ the right hand side i n the above inequali ty is bounded by ke-SMSY m\{y)e-XL^)dy JDx ml(y)e-xL{yfi)dy e-Ar(g) J[e_Sd+s]c m\{y)dy - e-xL(s/2) fDxml(y)dy 2 c -A[r(S)-r(fi/2)] < ; > 0, as A —• co. Now by this inequali ty we have \S[e-s,e+S]^t(x)e-^e)e^dx\ I[e-s,e+s] ml(y)e-XLMdy + ^ _ M + 5 ] c m\{y)e-^vfi)dy < -7 — — — . r , , >• 0, as A —¥ 0 0 . " /[*-«,*+«] mj;(y)e->W)dy thus we have , J [ g - g , g + 5 ] ^ A ( ^ ) ^ A L ( 3 ; ' e ) ^ ^ m W J " ( V W ] m j ( y ) e - ^ ) d y ) ( l + o(l)) + ° W _ f[9-6,e+6] m\(x)e-XL^)e^dx 119-6,6+6] ml(y)e-W)dy + °[ j = I[9-6,9+6] m*x(x)e-XL^s) cos(xt)dx J[G_ w ] m ^ a Q e - ^ M ) sm(xt)dx he-SM6] m\{y)e-x^)dy + % J [ 8 _ w ] mJ(y)e-^M)dy + ° { } = J1(X,t) + iJ2(X,t) + o(l). Obviously, for a l l A inf cos(a;i) < J\(X,t) < sup cos(a;i) xe[9-8,9+6] x6[9-6,9+6] and l i m inf cos(xi) = l i m sup cos(xt) = cos(9t), 6^0xe[9-6,9+8] 6-M xe[g_Sto+Si so, l i m ^ o Ji(^, t) — cos(9t) holds for a l l A. Similar ly , for a l l A, l i m ^ o J2(X, t) = sm(9t). Thus , i f we first let 6 —> 0 and then let A —»• 0 0 , we get l i m <f>x(t) = em, A—too which is the characteristic function of £ (# ) . 70 To prove (3.3.3), let tp\(-) be the characteristic function of wx(-\x), then fe(»=/?;(,"w,)'i"^ / f m\(y)e-XL(yf)dy = / J\x-w(6)e-XL(x>ehiet d9 l*-6,x+6] S[e-&,e+S] m*x(y)e-x^)dy + S[e_5fi+S]c mx{y)e-xLMdyl j w(0)e-XL(x<e)eiet J[x-s,x+SY f m*x(y)e-XL(y'e)dy The absolute value of the second term i n the last expression is bounded by w(6)e-XL(-x<e) J\x -de -5,x+8Y f m*x(y)e-XL(y>e)dy which tends to zero as A tends to infinity, by Step 1. B y the transformation r = e — x, for large A we have where ( G [x - 8, x + 6], n G [6 - 8, e + 6] C [x - 28, x + 28]. B y the same reasoning as i n Step 4, £ and n tend to x as 8 tends to zero. Therefore, the rat io w(()/m*x(T]) tends to 1 by (3.3.1). Thus we have l i m ip\(t) = elxt, a.e. //(•) on x G S, \—KX> and part ( i i i) is proved. • Our next result characterizes the behavior of the M I L for the range of the parameter A be-tween zero and infinity and relates A to the / used to define V\. Let r = inf^ / w(9)L(x, 6)d6, x0 = aig'mix$w(6)L{x,9)d6. It w i l l be seen that when / > Z~the method breaks down be-cause there is no necessary relationship between the data and the est imand 9. The following theorem is proved for one dimensional data and parameter, the results and proof should be the same for random vectors and mult i -dimensional parameters. T h e o r e m 3 .3 .2 . Assume L{-,9) is not constant. Then we have: 71 (i) For / 6 (0, r),p*x(x\0) exists uniquely, inf p(. | f l) e 7> ( i p ( 0 , X) = Ip*x(Q,X) > 0, and is a continuous, decreasing function of / . (ii) For I € (0, r ) , A and / determine each other uniquely. We can therefore write / = / (A) , or A = A( / ) . ( i i i ) For / 6 [r, oo],inf p(.|g) e 'p i Ip(&, X) = 0, and the inf imum is achieved by any p(x) £ V\ which is independent of 0. (iv) Assume D(mX2\\m*Xi) + D(mXi\\m*X2) < oo for 0 < A i < A 2 , then / ( A 2 ) < / ( A i ) . i.e. /(•) is a decreasing function. Under conditions of Theorem 3.3.1, we have the following: (v) /(A) —> 0, as A —• oo. (vi) / (A) -» r , as A -» 0. (vi i ) Let Px(-\0), Mx(-), W(-) and W£(- |a:) be the probabi l i ty measures corresponding to p*x(x\0), m*x(x), w(0) and wx(0\x) respectively. If M A*(-) £ Mo(-) (3.3.14) for some probabi l i ty measure Mo(-) as / —»• r (or A —• 0), then PA*(-|0) £ Mo(-) (3.3.15) W J M ^ W ( - ) (3.3.16) (vi i i ) Under conditions of (v i i ) , i f r < oo, then Mo(-) = CC^o)-We comment that /(•) is usually continuous i n examples, but we have not established a general result showing this. P r o o f : (i) B y Propos i t ion 3.1, for given / £ ( 0 , l ) , P*x(-\0) exists uniquely. F rom The-orem 6.3.2 of Blahut (1987), we know that for / £ (0 , r ) , the rate distort ion function inf p ( . |0) e p ( i p ( 0 , X) is s t r ic t ly positive and is a convex (hence continuous) decreasing func-t ion of /. (ii) B y ( i) , / is determined uniquely by A. O n the other hand, for / £ (0, r ) , we know that there is a unique p*x(-\0). If there is another A' ^ A such that pXi(-\0) = p*x(-\6), then by (i) of Theorem 3.3.1, mx,(-) = mx(-), 72 so by (1.2.1.3) we have, Vx ,# e-\'L(x,6) e-\L(x,8) fm*x(y)e-XL(y<e)dy $ m*x{y)e-XL(y>e)dy' or e - ( Y - A ) W ) = J m l { y ) e - m v , e ) d y i J mUy)e-xL(y,e)d^ which is impossible, since the left hand side of the above is a function of x and the right hand side is independent of x. ( i i i) In the proof of Theorem 6.3.2 of Blahut (1987), i t was shown that inf IJ@,X) = 0. P(-\8)eVr B y ( i ) , i t is decreasing i n I. Clearly, i f p(-) is independent of 8, then Ip(Q,X) = 0. (iv) Consider p\(x\8) ^ / m\(y)e-XL(y<e)dy for two values A i and A 2 . For fixed 8, we have J Px^>l0^frnl2(y)e- X^ ymye m\i{x))d [ml (y)e-x*L(y<9)dy r = l o g ( } 4 ; ( , ) e - w ) ^ + ( A i - A ^ / H*>°)PU*\°)*< m A 2 0 ) Similar ly , + J pl2(x\8)log^^dx. (3.3.17) nf * ii * v / n i JmUy)e~X2L(y'd)dy, J ( P A j i P A a ) ( g ) = i o 8 ( f r o . ; ( y ) e _ . A l L ( y , g ) d y ) + ( A 2 - A a ) [L(x,9)p*Xl(x\0)dx+ f p*Xl(x\8)log^^\dx. (3.3.18) A d d i n g (3.3.17) and (3.3.18) gives = ( A i - A2)(JL(x,8)p*X2(x\8)dx - J L(x,8)p*Xl(x\8)dx) 73 + dx. (3.3.19) Averaging over 9 i n (3.3.19), we get 0 < EwD(p*X2\\p*Xi)(6) + EwD(p*Xl\\p*X2)(6) = J WxMW) + D{p*Xl\\p\2){e))w{9)d6 ( A a - A 2 ) ( / ( A 2 ) - /(AO) + D(m*X2\\mXi) + D(m*Xi\\m*X2). (3.3.20) B y the same technique as i n the proof of Propos i t ion 3.1, we can prove that D(j>2\\pi) is convex i n P2,Pi-D(ap'2 + (1 - aKHapi + (1 - a)p'{) < aD{p'2\\p\) + (1 - a)D(p'l\\p'{), (3.3.21) wi th equality holds i f and only i f p'2{-) = pi( - ) and p2{-) = p'({-). So, we have EwD{p*X2\\p*Xl) > D(EwP*X2\\EwP*Xi), EwD{p*Xi\\p\2) > D(Ewp*Xi\\Ewp*X2). (3.3.22) W i t h equality holds iff p\2{-6) = p\x{-\6) a.s.6. Thus , from (3.3.20) and (3.3.22) we have / ( A 2 ) < /(Ax), for Ax < A 2 . (v) For a l l A we have Taking the l i m i t superior gives l i m s u p / ( A ) < / w(0) l i m sup p*x(x\6)L(x,6)dxd6 So, i t is enough to show (3.3.23) Indeed, by continuity we have that Ve > 0, 38 > 0, 9 L(S) < e. Th i s gives p*x(x\6)L(x,6)dx = / p*x(x\d)L(x,e)dx+ f p*x(x\e)L(x,9)d: ,x f[x-9,,+6]crn*x(x)e-^e)L(x,8)dx fm*x(y)e-XL(y'6)dy (3.3.24) 74 Since 3 T, for t > T,e~H < e - */ 2 , let A be so large that Xr(S) > T. Now for such A(> 1), e~XL^L(x, 9) < e~XL^XL{x,9) < E ~ 2 \JX g [x — S,x + 6]c. Let 6' > 0, satisfy r(6') < r(S)/2, and [9 - 6',9 + S'] C S°. For such 6', the second term on the right hand side of (3.3.24) is bounded by f [ x - 6 , x + 6 ] c m \ ( . x ) e ~ * L { x ' e ) d x e - a ^ g ) f[x-e,x+S]c m*x(x)dx f[e-6>,e+s>] m\{y)e-xL{yfi)dy - e-xL{s>) J [ g _ s i e + S l ] m*x(y)dy < c - A ( S ^ - r ( « ' ) ) I f[g-6<,0+8>] m\{y)dy Let b = infygp-^+fi*] to(y), and 7JA = {?/ € [0 - + «'] | m*x(y) > § } . Now by (ii) of Theorem 3.3.1, we know that for large A, we have fi(Bx) > 6'. Now, the last upper bound is bounded by e - A ( ^ - r ( * ' ) ) 1 < _g_ A(3ft-r(g')) f / i ( 5 A ) - W ' ' establishing (v). (vi) Let A(0) = l i m ; - ^ A( / ) , then from (i i i) we know that 7 P» ( 0 , X ) = 0, and so Px^ is independent of 9. Tha t is A(0) = 0 otherwise i t cannot be independent of 9. (vii) For (3.3.15), i t is enough to prove that for a l l compact A e B, the Bore l algebra on JR 1, as A —> 0 P*X(A\9) - M0(A). Indeed, since we have p\{x\9) - ^ fm*x(y)e-XL(y'e)dy'' \L(x,6) For any pre-assigned e > 0, we can choose a, b, A 0 such that - 0 0 < a < 6 < o o , 0 < A 0 < o o and f e-XoL^M0(dy) > 1 - - . J[a,b] 2 Since e~XL^^ < 1, we have that, for A < A 0 , 75 Note that a, b, Ao are independent of A. Since e xoL(y,8) \ s bounded and continuous on [a, and M{(-) % Mo( - ) , we have J[a,b] and MX(A) < M0(A) + e. Now (3.3.26) is bounded from above by Mp(A) + e l - e ' Since e was arbitrary, we have l i m s u p P A * ( A | 0 ) < M0(A). \~*o O n the other hand, since the denominator i n (3.3.25) is no greater than 1, we have as A —> 0. Tha t is , for A small enough, where a, b and A 0 are chosen so that — e. So, for A smal l , we have P*X(A\6) > M 0 (A) - e. Thus l i m m f P A *(A|fl) > M0(A), establishing (3.3.14). For (3.3.16), note w*x(9\x) = Pl(x\6)w(e) m\(x) e - \ L ( x , 9 ) w ^ fmx(y)e-XL(y>e)dy' So, for smal l A 76 <r mm ( 3 3 27) W(d9) lAl[a,b]e-^L(y'e)M*x(dyy A s before, choose finite numbers a,b so large and Ao so small that uniformly for 9 £ A (recall that A is compact) / e-XoL^Ml{dy) > 1 - 6 . J[a,b] Now, (3.3.27) is bounded above by W(A)/(l - e). O n the other hand, for smal l A W^{A\x) > f e-XL^W(d9) J A > f e-XoL^W(d9). JAn[a,b] A s before, choose a, b and Ao > 0 small enough that f e-XoL(-x^W(d9) > W(A) - e. JAn[a,b] Since e > 0 was arbitrary, we get (2.4.16). (v i i i ) Now we prove Mo(-) = C,{XQ). B y way of contradict ion, suppose Mo(-) ^ C(^o)-B y the constraint (1.2.1) we have /(A) = J J L(x,9)Px*(dx\9)W(d6) > j f L(x,9)Px*(dx\9)W(d9). (3.3.28) J[a,b] J[c,d\ Since Mo(-) does not concentrate at XQ = arginf^JL(x,9)W(d9), there is e 2 = ^ 2(^ 1) > 0 such that J J L(x, 9)M0(dx)W(d9) > r + e, (3.3.29) for some e > 0. Strictness of the inequali ty follows from MQ(-) ^ C(xo), since the inequali ty implies MQ(-) assigns positive mass away from XQ. We can choose a,b,c and d large and independent of A so that / / L(x,9)M0(dx)W(d9)> f f L(x19)M0(dx)W(d9) - - . (3.3.30) J[a,b] J[c,J\ J J 2 Now, using (3.3.28), (3.3.19), (3.3.30) and (3.3.14), and the fact that L(x,9) is bounded and continuous i n x on [a, b], and the fact that /(A) —• r as A —• 0, we get l i m /(A) > l i m / / L(x,9)PUdx\9)W(d9) A ^ O A—0 7[a,6] J[c,(H 77 = / / L(x,6)M0(dx)W(d6)>r + l J[a,b] J[c,d] 2 which is impossible. Thus , (3.3.16) follows. • 3.4 Hypothesis testing Using M I L s In this subsection, we demonstrate a paral lel result for the exponential rate of the type II error when the product form of the M I L are used for testing the simple versus simple hypothesis. Specifically, assume the independence model for the M I L , ie, for xn = (x\,xn), p\(xn\9) = n?=i P\(xi\Q)- Now for fixed 9, the family now is parameterized by A, we may interested i n testing the hypotheses Hi : A = A i vs H2 : A = A 2 . For s implici ty, we de-note p*Xi(-\9) and #\ (-|0) just by pi(-) and p2(-) respectively. If we use the Neyman-Pearson level a test w i th acceptance region An based on an iid sample X\, ...,Xn. Let f3n denote the type 2 error, ie. /3n — P2(^An), then we identify the exponential rate of f3n as i n the following: T h e o r e m 3 .4 .1 . .. 1, fl l [ Z ( A 2 ) - / ( A i ) ] 2 P r o o f : We first prove l i m l , o g , „ = - l l ^ l W + ^ f f (3.4.2) 2 / p 2 ( l o g ^ ) 2 - D2(p2\\Pl) Since An has the form _ r P2(Xn) where cn is determined by P\{An) = 1 — a. So we have I - a = P i [-= 2^iog ; ; > - = i o g c n 78 = p .j* E p i l o g P j M ~ D ^ Hft)] > ^Cn-V^D(P1\\P2) > A n d V / P i ( l o g ^ ) 2 - ^ i | | p 2 ) ^ ^ l o g c n ~ ^ y p 2 ( l o g ^ - ) 2 - J D 2 ( p 2 | | p 1 ) $ - 1 ( a ) + ^D2(Pl\\p2). f3n = P2(An) = P J 2 = J 2 l o S < — ^ = log cn) _ p / ^ E r = l [ l Q g g g ) - ^ 2 ] b l ) ] ^ ^ i Q g C n + ^ C P a l l P l ) 2 V ^ / / p 2 ( l 0 g ^ ) 2 - / J 2 ( p 2 | b 1 ) - x / / p 2 ( l o g ^ ) 2 _ j D 2 ( p 2 | | p i ) / : f e l ° g C n + V ^ M P I ) ~ $ ¥ ^//P2(l0g^)2-2?2(P2||P1) / ^ " 1 ( « ) V / / P i Q o g ^ ) 2 - D \ P i \ \ p V ) + y/^D(Pi\\P2) + V^D{P2\\Pi) ^ y / p 2 ( l o g ^ ) 2 - J D 2 ( p 2 | | p a ) = * ( o „ ) , where an —• —oo, as n —»• oo, so by using the L ' H o s p i t a l rule we get 1 1 1 1 fan _£?_ l i m — l o g / 3 n = l i m — l o g $ ( a n ) = l i m — log(—7= / e 2 dx) ra-t-oo ft n - + o o n n - + o o n V27T . / - o o e-4>{ I Q(PlllP2)+g(wllpi) ) 2 V " V / / p 2 ( l o g £ 2 - ) 2 - D 2 ( P 2 | | P l ) = h m 2 J —00 l i m e ~ 4 - ( 1 g(pil|P2)+P(P2llpi) \ 2 / a \ 2 ^ y / p 2 ( l o g ^ ) 2 - ^ ( P 2 | | P l ) e - ^ / 1 £>(PI11P2)+P(P2||PI) \ 2 N Av / / p 2 ( l og^P - D 2 ( P 2 | | p 1 ) c - 4 - ( 1 P(P ll|P 2)+D(p2||pi) N 4 ^ V / / p 2 ( l o g ^ P - ^ ( P 2 l | p 1 ) e - ^ ( 1 -D(PlllP2)+g(P2l|pi) \ 2 ^ ^ / p 2 ( l o g ^ P - D 2 ( p 2 | | p i ) 79 1 (D(Pl\\p2) + D(p2\\Pl))2 2jp2(\og%y-Di(p2\\Ply ie. (3.4.2) is true. Now since = (Aa - A2)(JL{x,0)p\2{x\0)dx - J L(x,0)p*Xi(x\9)dx), and note 9 has a degenerate dis t r ibut ion w(-), so for z = 1,2 J L(x,9)Pi(x\9)dx = J J L(x,9)Pi(x\9)w(9)dxd9 = /(A,-), and hence D(P2\\PI) + D(PL\\P2) = ( A 2 - Ai) ( / (Ax) - / ( A 2 ) ) . (3.4.3) + ( A i - A 2 ) 2 y 0 ) 2 p 2 ( x | 0 ) d z . (3.4.4) Since D(plM,m = ( } 4 ^ ) e - W ) ^ ) + ( A 2 - * 0 / H x ^ i x W d x . (3.4.5) we have D ^ = (l0g ( / 4 y ) e - ^ ^ ) ) + 2 ( A l " A 2 ) / ( A 2 ) 1 ° g {s4i>-^dy) + ( A i - A 2 ) 2 / 2 ( A 2 ) . (3.4.6) Now by (3.4.3), (3.4.4) and (3.4.6), the R H S of (3.4.2) is 1 J / (A 2 - / ( A x ] 2 2 V a r P 2 ( Z ( X , 0 ) ) ' thus complete the proof. • 80 3.5 Remarks The result of Theorem 3.3.1 is i n s t r iking contrast to the fact that for iid da ta dis tr ibuted according to a density p(x\6) where 6 is a cf-dimensional parameter, we have EmD(w(-\Xn)\\w(-))=^\nn + 0(l). Asympto t i ca l l y max imiz ing this latter expression over priors w as is done to find a reference prior , see Bernardo (1979), leads to Jeffreys prior , Clarke and Bar ron (1994). Let XQ =arg inf* f w(6)L(x, 6)d6 as i n the condi t ion of Theorem 3.3.1. It is necessary for w(-) to have a finite second moment i f XQ is to be finite for L(x, 6) = (x — 6)2. In part icular , i f the prior w(-) is JV(ja, a 2 ) , then / w{0)L{x, 0)d6 = a2 + (fx-x)2, so inf* / w(0)L(x, 0)dd = a2 and the inf imum is achieved at x0 = fj,. If w(-) is Exp(a,fi), where fi is the locat ion parameter, then / w(6)L(x, 6)dQ = (X-(JL- l/a)2 + 1/a2, and inf* / w(0)L(x,6)d6 = 1/a2 and the inf imum is achieved at XQ = /x + 1/a. In general, XQ = mean (6), i f L(x, 6) — (x — Q)2. 81 Chapter 4 Application 4.1 Introduction Here we give a pract ical example in which M I L ' s can be used to provide answers to questions of interest that do not seem amenable to other techniques. In the present case, i t appears that M I L ' s do better than a conventional analysis because they can be applied to summary statistics. In general we suggest that M I L ' s may prove useful i n settings satisfying the following cri ter ia . F i r s t , a true parametric family cannot be proposed. T h a t is , a general form for the relationship between outcomes and a parameter is not apparent. Second, the unknown parameter must be locat ion l ike, that is , i t is not necessarily a locat ion parameter i n the strict sense of the term but does nevertheless track the typ ica l range for the outcomes. The interpretat ion for 6 is pre-set, the prior must be formulated according to this. Here, we use a Bayes hypothesis test, Theorem 3.1.2 to suggest some point estimate problem from a frequentist perspective way also be feasible. We assume that a finite dimensional parametr izat ion for the parameter has been chosen and that i t permits est imation of the parameter of interest, possibly as a function of the coor-dinates of the parameter. Let 6 = {0\, denote the finite dimensional parametr izat ion. In general, the data can be wri t ten as a random vector Xn = (X\,..., Xn) w i th outcomes denoted xn = (x\,..., xn). Assume that i t is possible to associate to the d is t r ibut ion of Xn a parametric family that has a density that can be wri t ten as p(xn\6) for 6 £ Rk but that the form of p(-\0) is unknown. In part icular , suppose that there is no basis for any assumptions about p(-\0), i.e., we know nothing about how values of the parameter affect the probabili t ies of the outcomes, only that the two are related i n some way. Even i n such cases where l i t t le is known, an experimenter may have some idea about 82 what values of 6 are more surprising than others. Assume these preconceptions have been formulated into a prior density w(6). Now, from either a Bayesian or frequentist perspective, the key quanti ty remaining to be identified is the l ike l ihood. Assuming there is no pract ical basis for choosing p(-\0), we adopt a m i n i m a l information cri terion. Tha t is , we seek a l ikel ihood which requires rela-t ively relaxed assumptions i n a precise sense. A l t h o u g h the data compression interpretat ion is va l id , we present our method from a data transmission viewpoint since this is easier to de-scribe compactly. The l ikel ihood can be used i n i n i t i a l data analysis where one cannot make detailed modeling assumptions and one must rely chiefly on arguments from robustness: If one has robustness against model ing strategy i n the sense that the same results obta in for several different modeling strategies (none of which make strong assumptions i.e., are min ima l ly informative) , and the results are insensitive to the choice of prior , loss function and bound on the Bayes risk then one has more confidence i n the val idi ty of the conclusions. 4.2 Appl ica t ion to A Real Data Set Here we demonstrate the use of M I L ' s by re-analyzing data from Nader and Reboussin (1994). To model the data, we used the M I L s i n two different ways. Tha t is , the models we use make weak assumptions i n an information theoretic sense. However, for large sample sizes, Theorem 3.3.1 tells us that the n-dimensional M I L may unsuitable for practice i n some situations, especially for large data sets. Other models and assumptions may be more appropriate for this data set, we are not a iming to search through for the best model for this data, just s imply want a demonstration of possible uses w i t h the M I L approach. The par t icular model by the M I L s may also not appropriate, however, as a study of our method for an i n i t i a l use to a real data set, we made our best effort i n the model ing and data assumptions, and welcome any cr i t ic ism for our future improvement. In the formulat ion of the M I L s , we used the Bayes risk bound /, but i n most of our examples and applications, we used the parameter A to determine the M I L s . The two pa-rameters are equivalent as described i n (ii) of Theorem 3.3.2. Use of A is direct and more convenient i n inferences. 83 4.2.1 Description of the Data The experimental data studied i n Nader and Reboussin (1994) was collected to investi-gate whether two different t ra ining methods w i l l produce different effects on the behavior of the monkeys. In this experiment, eight monkeys were in i t i a l ly t rained to respond under a fixed interval 5-minute schedule of intravenous cocaine presentation, F I5 . In this t ra in ing, the first t ime that a monkey pulled a lever after having waited at least five minutes produced a cocaine injection. The injection lasted ten seconds. (Responses during the injection were not counted.) P r io r to the second phase of t ra in ing , the monkeys were rated from one to eight based on response rates under the FI5 schedule. Based on this ra t ing, the monkeys were paired so that two monkeys who formed a pair would have s imilar average F I5 response rates before two different cocaine self-administration reinforcement schedules were applied. The two highest ratings gave the first block; the th i rd and fourth highest gave the next and so on. W i t h i n each pair , members were randomly assigned to one of two schedules. Thus , four monkeys were trained under an F R 5 0 ("fixed response 50") schedule: that is for every fifty responses (lever pulls) the monkey got an injection of cocaine. The other four monkeys were trained under an IRT30 ("inter-response t ime 30 seconds") schedule; the monkeys were reinforced by a cocaine dose for lever presses at least th i r ty seconds apart. A lever press before 30 seconds elapsed reset the IRT30 t imer. For a l l monkeys, each cocaine injection was followed by a two minute timeout and a sixty minute timeout followed the tenth and twentieth cocaine injections. Fol lowing the 65th session under F R 5 0 or IRT30 , availabil i ty of cocaine was again sched-uled under FI5 for sixty consecutive sessions. For each of these sixty sessions three variables were measured. The pr imary variable of interest was the response rate which was the to ta l number of responses during the session divided by the session length i n minutes. Here, session length was the actual session length less the timeouts. Secondary variables were cocaine intake (in m g / k g per session) and average quarter life. Quarter life values are the proport ion of the fixed interval elapsed when 25% of the responses i n that interval had occurred. The average is taken over the five minute intervals that occurred dur ing a ses-sion. The intake measures how quickly the monkey made a response after the five minute 84 intervals i n a session. Figure 3 shows plots of the response rate data over the s ixty sessions for the eight monkeys. The left 4 plots are the data from the 4 monkeys of the F R 5 0 group, the right 4 plots are the data from the 4 monkeys of the IRT30 group. The monkeys are paired row wise. F i r s t , label the monkeys i n pairs as (1,2), (3,4), (5,6) and (7,8), (here the odd numbers and even numbers correspond to, respectively, the F R 5 0 and IRT30 , or the left column and the right column of plots i n Figure 3) where the odd labels mean that the first i n each pair was trained under F R 5 0 and the even labels mean that the second i n each pair was trained under IRT30 . F rom Figure 3, we see a few data points i n the early sessions of monkeys I and III that are obviously larger than the rest of the corresponding observations. We treat these as outliers and delete the first three observations from a l l the eight monkeys i n our data analysis. For s impl ic i ty of notat ion, we just relabel the remaining observations for each money as 1 to 57. Let y s j be the da tum on rate for the z-th monkey on the j - t h day where i = 1, ...,8 and j = 1, . . . ,57. For each i, let yt- = 57 J2iLi Vij be the sample mean of the rate data from the i - th monkey. Now, take differences of the means wi th in each pair . We write these differences as x\ = 7/1 — V2-, %2 = 2/3 — 3/4, £3 = J/5 — y&, and X4 = y> — y § . F r o m the paired data we can obtain a few simple descriptive statistics. For the first pair , (1,2) we can find the mean, variance and lag-1 auto correlation for the vector ( y i ^ - y 1 ) 2 , y i , 6 0 — y2,6o) of sessional differences. Do ing this for the other 3 pairs, and for the vector of sessional differences wi th the first 3 entries deleted gives the summary statistics i n Table 1. Note that following the rule of thumb which gives 2/y/n « 0.26 as a threshold for assessing the presence of serial correlation (see Farnum and Stanton, 1989, P.78) leads us to suspect that most of the vector of differences does exhibit dependence. In addi t ion, we see that a l l of the means are positive, consistent w i th IRT30 and F R 5 0 having different effects. The range of the sample variances is too large to permit meaningful assertions. Delet ing the outliers does not appear to affect the summary statistics uniformly, apart from reducing the variances. Table 1. D a t a Summary for Observed Differences i n the T w o Groups F u l l D a t a 1st 3 Obs. Deleted Diff. M e a n Var . A u t o Cor r . M e a n Var . A u t o Cor r . 1-2 3-4 5-6 7-8 0.80 16.31 0.25 4.10 10.46 0.35 9.42 19.15 0.43 1.79 3.59 0.44 0.59 1.03 0.31 3.65 5.25 0.22 9.27 19.69 0.44 2.27 2.33 0.40 85 Monkey #1 30 25 20 CD (315 10 5 0 0 10 20 30 40 50 60 session Monkey #3 30 25 20 © g15 10 5 0 0 10 20 30 40 50 60 session Monkey #5 30 25 20 <u g15 10 5 0 0 10 20 30 40 50 60 session Monkey #7 30 25 20 <D g15 10 5 0 H . . • , ' « M • 0 10 20 30 40 50 60 session Monkey #2 0 10 20 30 40 50 60 session Monkey #4 0 10 20 30 40 50 60 session Monkey #6 30 25 20 E15 10 5 0 0 10 20 30 40 50 60 session Monkey #8 0 10 20 30 40 50 60 session Figure 3: Plots of Rate Over the Sixty Sessions. This sequence of figures shows a plot of rate over the sessions for each of the eight monkeys. Each row corresponds to a pair from matching baseline lever pressing rates; each column corresponds to a treatment either F R - 5 0 (left) or IRT-30 (right). 86 Here, the data vector for monkey I is (3/1,1,yi,eo) w i t h mean, variance and lag-1 auto correlation defined as before for the vector of differences. The corresponding summary statistics are displayed i n Table 2. Note that, again, deleting of the outliers does not affect the summary statistics uniformly, and the rule of thumb gives that each Monkey 's data vector exhibits dependence, al though for Monkey III wi th the 3 outliers deleted one may be skeptical. Table 2. D a t a Summary for the Eight Monkeys F u l l D a t a 1st 3 Obs. Deleted Monkey Mean Var . A u t o Cor r . M e a n Var . A u t o Cor r . 1 2.65 14.76 0.23 2.47 0.53 0.31 2 1.85 0.46 0.27 1.88 0.44 0.24 3 5.00 9.73 0.33 4.57 4.76 0.18 4 0.90 0.26 0.24 0.91 0.24 0.22 5 19.03 9.54 0.31 19.06 10.03 0.31 6 9.61 8.89 0.45 9.78 8.71 0.48 7 3.39 2.49 0.34 3.19 1.45 0.31 8 1.59 0.57 0.47 1.63 0.57 0.43 R e m a r k : The summary statistics x4 and y4 do not reflect the role of the sample size n. T o address this c r i t ic i sm, one could use the standardized summary statistics. Here, for example one might use yfnx4 i n place of x4 and the same for y4. If for the pair of monkeys, the numbers of observations are different, say n\ and n2 respectively for the two monkeys, we may use the weighted standard summary statistics, for example use y/n^x4 i n place of x4 and s imi lar ly for y4. There are other ways to address this cr i t ic ism also. In view of Theorem 3.1.1, we do not want to lump together too many data points. So, we might replace a str ing xi, ...,xn by a sequence of summary statistics, the number of summary statistics to be taken as independent being chosen by the experimenter to reflect the number of independent data points his data is equivalent to. One may also interested i n how sample size affects the wid th of the H P D region based on the M I L model . A t the two extremes, the M I L s can be used to form a product of i . i . d . densities or to get a single n-fold dependent density. In the former case, the result is the same as for i . i . d dis tr ibutions. A s sample size increases, the accuracy of inference increases, so the wid th of the H P D region decreases roughly as y/n times root inverse of the Fisher 87 information. For the latter case, since the n-fold M I L has high dependency, i n general, among its arguments, the sample size effect on the w i d t h of the H P D regions is not as ev-ident as i n the former case, since dependence causes large data sets to look like smaller ones. 4.2.2 Models for the Data and Results Par t of the data analysis presented i n Nader and Reboussin (1994) was a repeated measures analysis of variance for the rate data. Pairs and treatment group are between subject ef-fects, session is a wi th in subject effect. Nader and Reboussin (1994) looked for l inear trends i n the rate over the 60 sessions and asserted that the apparent nonlineari ty d id not affect the conclusions substantially. Th is model d id not reject the hypothesis that the mean rates are the same i n both groups, though the p-value is i n the suggestive range. However, i t d id reveal a highly significant difference between the mean linear trends i n the two groups over the sixty sessions. The other variables intake and quarter life were analyzed separately and gave conclusions compatible w i th the rate analysis. T h e ma in vi r tue of this model ing approach is i ts s implici ty . However, other models examined by Nader and Reboussin (1994) gave similar conclusions. Th i s included one model w i t h an A R ( 1 ) component over the 60 sessions, one al lowing some curvature i n the trend over sessions, and several excluding the early sessions. In these cases, the conclusions were much the same: not quite significant mean difference between groups, highly significant difference i n l inear trends. Here, we fit two models and consider a th i rd . We model the data differences for the paired monkeys i n the two groups and look for t ra ining differences i n the mean response rates. We find a significant difference i n the mean rates. We only analyze the rate data since it is regarded as the most important index. Our methods can be applied to either the average quarter life or the cocaine intake data as well . M o d e l I The data on rate for the eight monkeys over the 60 sessions are plot ted i n Figure 3 . Neither the plots i n these plots , nor the form of the experiment suggests an obvious parametric family. Moreover, there is not enough data to perform a nonparametric analysis. 88 A key problem is that for each monkey one cannot assume the data from the sixty sessions are independent, even condit ional ly on the monkey. In addi t ion, i t is not clear how the differences from pair to pair can be modeled. A s a consequence of the absence of strong modeling assumptions, i t is unreasonable to use any standard l ike l ihood to model the data. However, the M I L method gives a l ikel ihood which makes relatively weak assumptions on the data d is t r ibut ion. We use it to extract i n i t i a l conclusions. We begin our analysis by using the M I L s . There are other modeling strategies that are feasible w i th M I L s but they are more elaborate and require much more programming. In the first model ing strategy, we take differences w i th in each pair i n the data and average over the fifty seven sessions. (Recal l we deleted the first three observations as outliers from each monkey. A l s o , one may consider different models, for example, use the C L T for a normal approximat ion of the average. However, i f the number of observations is smal l , the C L T and other models may not be pract ical to use. Here our only intent is to demonstrate the M I L approach.) Thus , we estimate a single parameter using, effectively, four data points which are the mean differences. In the second model , we again average over the 57 data points for each monkey but do not take differences i n the data. Note that the eight monkeys are assumed to be independent of each other, al though for each fixed monkey, the 57 data points may not be independent. Nevertheless, the eight mean data points are independent, though they may not be identical ly distr ibuted. Instead, we use two parameters, one for the IRT30 group and one for the F R 5 0 group and then obtain a posterior for the difference i n the parameters. For M o d e l I, we suppose that the expected values of the the same and treat this as the parameter of interest 6, w i th the same units as the observations. N o w , the problem reduces to finding a posterior for this parameter given the data. If the posterior assigns most of its mass around a positive value we infer that the expectation of Xi is s t r ic t ly positive and therefore the IRT30 rate is lower than the F R 5 0 rate. For a given prior w(8), a given loss function L, and a given value of A, we can get an M I L p*(x\9) by the procedure i n Chapter 1. For s implici ty, we treat the four pair differences approximately independent and identical ly dis tr ibuted. A l t h o u g h dependence may present between blocks, we may assume they are canceled i n the difference, leaving 89 only the effects of the t ra ining. Now, we can form the posterior density T r , p*(x1\9)p*(x2\9)p*(x3\9)p*(x4\8)w(9) Given a > 0, one can see whether the (1 — a ) highest posterior density ( H . P . D . ) region for 9 contains 0. We obtained graphs of the posterior i n (4.2.2.1) for a range of values of A, several choices of prior and two choices of X . In practice, the priors are chosen according to pre-experimental knowledge about the parameter d is t r ibut ion. Here we choose a few of them for convenience. In part icular , we chose w(-) to be U( —15,20) or any of N(—2,1), N(0,1), iV(2 ,1 ) and N(0,10) to reflect a variety of priors w i th a range of reasonable means and variances; we choose L(x, 9) = (x — 9)2, or L(x, 9) = \x — 8\; and we choose A to range from .5 to 5. We recall that i n the context of p*(x\9), A has two meanings. F i r s t , i t behaves l ike a scale or dispersion parameter. For larger A's , p*(x\9) is more flat, for smaller A, p*(x\9) is more concentrated or more sharply peaked. The value of A also affects the set of l ikelihoods over which we have opt imized. Larger A values correspond to a smaller set, and vice versa. W e require A to be i n [0,oo), otherwise p*(x\9) is independent of 9 and so is meaningless. In our work we generally found that values of A i n [.1, 10] gave reasonable results. We used the i terat ion method described i n section 1.3 to find the M I L p*(x\9). Tha t is , we chose an in i t i a l d is t r ibut ion m( 0 )(-) , and got p^(x\9) by (1.3.1). Then , we plugged p^(x\9) into (1.3.2) to get m^j ( - ) . We continued this cycl ing to get the n step l ikel ihood P(n)( x|6?) un t i l the absolute difference of the two consecutive approximations to the M I L were no greater than a prespecified e > 0. Here, we found that a reasonable choice for e ranged from 1 0 - 6 to 1 0 - 4 , and that the number of iterations required i n our calculations for a fixed x and 9 was of the order 10 to 10 2 . Our results for the cases l isted above were generally consistent. For a l l the above choices of the prior d is t r ibut ion, A and the loss function, the posterior density was un imoda l and concentrated on the positive half l ine, w i t h mode between 4 and 8.5, and relatively smal l posterior variance, which can be controlled by the parameter A. Thus , we infer that there is a significant effect from F R 5 0 and I R T 3 0 t ra in ing, w i t h the rate for the I R T 3 0 group being much less than the rate for the F R 5 0 group. Figure 4.a shows some of the posteriors we obtained. Note that the posteriors assign essentially a l l their mass to the positive half-line. 90 A l s o , for M o d e l I we tr ied some priors w i t h larger variance (from the data plot , we find variance 10 is reasonable) and we deleted the first three observations from each monkey, since i t seems that there are some outliers i n these observations. The results are plot ted i n Figure 4.b. We see similar skewness toward the right axe on 0. Note that Theorems 3.3.1 and 3.3.2 do not apply direct ly to (4.2.2.1) because i t uses a product of univariate M I L ' s rather than a single M I L for a 4-variate outcome. However, the conclusions of those two theorems are quali tat ively consistent w i t h the results here. The posterior i n (4.2.2.1) is seen to concentrate at a point as A increases which is consistent w i th (i i i) of Theorem 3.3.1. In addi t ion, as A decreases, the posterior is seen to converge to a dispersed dis t r ibut ion that is similar to the prior used, as suggested by (i i i ) of Theorem 3.3.1. M o d e l II We investigate an alternative model based on M I L ' s to show how one might examine robustness against modeling strategy for paired data. A s an alternative model , instead of using four differences i n the data to estimate one parameter, we considered using the sample mean for the four first entry pairs and the sample mean from the four second entry pairs to estimate two parameters reflecting the means of the two groups of monkeys. A g a i n , we assume independence between the two groups to simplify the model ing. This may seems not reasonable for the pract ical data set. To model the exact dependence structure i n the data seems difficult, here we only intend to do another i n i t i a l analysis using the M I L i n a different way and compare the conclusion. Then , we can marginalize the posterior to get credible regions for the difference i n the two parameters. Since there are two parameters, we use a two-dimensional pr ior which for the present we assume factors, that is, we assume w(-, •) = wi(-) • w2(-). A l s o , we assume the components of X are independent and the components of Y are independent. F r o m wi(-), we get the M I L p\{xA\0\) = flLi V\{xi\0i), a i m fr°m we get the M I L p2(yA\®2) = Hi=i PZiVifa)- Now, we can form the two-dimensional posterior = w1(01 \x4)w2(62\y4), 91 Figure 4: Posteriors from M o d e l I. In 4 (a), the posteriors plotted here were formed from M I L ' s based on choosing w to be i V ( 0 , 1 ) , L to be squared error loss, and A = .7 (points) or A = 1.5 (solid). In 4 (b) The posteriors plotted here were formed from M I L ' s based on choosing w to be iV(0 ,10 ) , L to be squared error loss, and A = .7 (dots) or A = 1.5 (solid). 92 where m*(x4,y4) is the marginal from p(x4, y4\9i, 92) = Pi(x4\9i)p2(y4\92) and the prior w(Oi,02) — wi(9i)u>2(02), and wi(9i\x4) and W2(92\y4) are the corresponding one-dimensional marginals. N o w , we can apply the transformation ?/> = 0! + 0 2 , 4> = 9x-92 i n the bivariate posterior. After integrating out ip, we get a posterior for (j>. (In our C program, we used discretization summat ion to approximate the integration.) For this model , we also tr ied several priors (wi th variances ranging from about 1 to 10), losses (squared error and absolute difference) and As (around 0.0001 to around 0.05). We found that when A is smal l , around .0001, and the prior is i V ( 0 , 1 ) , the posterior is nearly i V ( 0 , 1 ) . In view of Theorem 3.3.1 this is not a surprise. A s A increases, the posterior shifts so as to concentrate on positive values of 9. However, when A is much above .09 or much below .0001 our implementat ion of the B l a h u t - A r i m o t o algori thm is numerical ly unstable because the integrand function is close to a product of delta functions. Th i s problem d id not occur w i th M o d e l I because the posterior there is based on a product of four densities whereas i n M o d e l II the product has 8 densities. The problem seems to be that as A increases, the and u>2 concentrate at different points so that the product is too small for the computer to store. One consequence of this is that we cannot observe the convergence of the posterior to uni t mass at a point that is suggested by Theorem 3.3.1. Moreover, i n addi t ion to having used a product of M I L ' s , we have marginal ized (4.2.2.2) making the conclusion of Theorem 3.3.1 more distant. Despite being unable to observe the concentration of the model at a point w i t h increasing A, Figure 5 shows the posteriors we obtained for two values of A, .0001 and .09. Intermediate values of A give posteriors roughly between these two posteriors. Note that for A = .0001 the posterior reverts to a dispersed dis t r ibut ion resembling the prior . A s the common value of the A's increases, w* shifts away from being centered at zero and again assigns essentially a l l i ts mass to the positive half-line (see Figure 5.a). The point is to note that i f A is chosen i n M o d e l II to be as close as possible to the values we used for M o d e l I (without exceeding the l i m i t of .09 so we can s t i l l compute) the inferences we make from the two models are quali tat ively the same, namely we have evidence that the difference i n rates for the F R 5 0 and IRT30 groups is positive. Thus the two modeling strategies confirm each other. We 93 note that the conclusions from M o d e l II do not seem as strong as from M o d e l I: we at tr ibute this to the choice of A here being much smaller than the choice of A i n M o d e l I, that is , the Bayes risk bound I used here is larger than that used i n M o d e l I. So, the set V\ here is much larger. Current ly we do not have a good formal technique for choosing A (or / ) . We w i l l discuss this later i n Chapter 6. We also used 7V(0,10) as the prior for M o d e l I and the priors for 9\ and #2 i n M o d e l II, and d id the same analysis. In this case, the posteriors form the two models are more spread out, see Figure 5 (b), especially for smal l values of A i and A 2 , since the corresponding Bayes risk bound is large which makes the allowable distort ion large and hence less accurate inferences. However, for the moderate value of 0.05 for A i and A 2 , much of the posterior mass lies on the right of zero, this leads to s imilar conclusion as the N(0,1) prior was used. More Alternatives Having recourse to M I L ' s permits the elaboration of other models that do not require the extreme data summarizat ion used i n models I and II. Th is summarizat ion is used here only to make i t easier to get computat ional results, and is justifiable chiefly on the basis that i t is not far wrong. One of the ways i n which models I and II can be cr i t ic ized is that they, unlike Nader and Reboussin's or iginal model , are insensitive to the sample size used to form the summary statistics. Other models can be considered. For example, assuming independence between the 57 sessions for each monkey, we can model each of the 57 observations form the two groups by the same M I L , and take the product of the 57 M I L ' s for the whole data set; or assuming no independence, we generate a 57 dimensional M I L to model the whole data. In this later case, we get a dependence model . Intermediate dependence structures can also be used. Unfortunately, there remains the problem of how to get the right dependence structure from the data. We w i l l discuss this question par t ia l ly i n Section 6.2.1. It is this plethora of modeling strategies that are equally plausible which motivated the work i n the next Chapter . 94 Figure 5: Posteriors from M o d e l II. In 5 (a) the posteriors plotted here were formed from M I L ' s based on choosing wi = w2 to be N(0,1), L to be squared error loss, and A i = A 2 = .0001 (dots), or A x = A 2 = .05 (solid). In 5 (b) the posteriors plotted here were formed from M I L ' s based on choosing Wi = w2 to be iV(0 ,10 ) , L to be squared error loss, and Ax = A 2 = .0001 (bold), or A x = A 2 = .05 (solid). 95 Chapter 5 Robustness of Modeling Strategies for Paired Data Mot iva ted by the inferential s imilar i ty of two different model ing strategies we have tr ied to investigate formally the degree to which three model ing strategies applicable i n the prob-lems of the previous chapters would agree in general. We begin by defining general cases of the two models we have used and defining a general case for a th i rd model that is equally plausible but that we d id not use. Then , we seek conditions under which these three models w i l l be equivalent, and we present results which par t ia l ly characterize how discrepant the inferences from these models w i l l be. In this Chapter , some of our results are for re-fold l ikel ihoods, some are for products of univariate l ikelihoods and some are for other special cases. We indicate appl icabi l i ty of each result. 5.1 Introduction and Definition of Models Recal l that i n the example i n the previous chapter, we assume that we have two independent data sets Xn and Yn, and we are interested i n modeling the data w i t h various l ikelihoods so as to make inference about the parameter 9. The parameter 6 is a quantification of some populat ion trai t of interest. We have used several different models. In M o d e l I from Section 4.2.2, we generate the M I L for Zn = Xn - Yn directly to get p^(zn\0) = p*(zn\9) and get the corresponding posterior w^(9\zn). In M o d e l II, by contrast, we model the two sets of data Xn and Yn by Pi(xn\9i) and P2(yn\&2) which use the two marginal priors and ^2(^2) from a joint prior w(#i,#2)- We got the posterior w(9i,02\Xn,Yn), and then applied the transformation 0 = 0X — 02, <f> = 9\ + 92- Marg ina l i z ing out 4> gives the posterior W(2)(9\Xn, Yn) of 9. In Chapter 4, we used the product of uni-dimensional M I L s to form the model . In this Chapter , we present a robustness analysis for paired data from general l ikelihoods; they 96 may be dependent or independent among their variables, often inc luding the various M I L ' s as special cases. A l s o , some results are only for n-dimensional M I L s . The point here is to compare different modeling strategies, assuming the same likelihoods been used i n each of these models. We can consider other models. In part icular , we define a th i rd model , M o d e l III: just as i n M o d e l II, we use pi(xn\0i) and p 2(2/ n |# 2) for Xn and Yn respectively, then use the transformation Zn = Xn - Yn,Sn = Xn + Yn to get the density for (Zn,Sn). Then integrating out Sn gives the density p^)(zn\0i,62) for Zn. Us ing the transformation 0 = 81 — 62, 4> = B\ + 62 and marginal iz ing out <f> gives the posterior w^(0\zn) of 0. In some cases P(3)(zn\9i,<?2) w i l l reduce to the form p(3)(zn\0), where 9 = 0X — 02, wi thout further transformation. Later i n this chapter we w i l l deal w i th these conditions. Th i s model is different from M o d e l II i n general. It uses the transformation of parameters to get the l ike l ihood i n 9 first and then gets the posterior. Whereas i n M o d e l II, we get the two dimensional posterior first, then use the transformation on parameters and marginalize to get the posterior for 9. There are numerous reasonable models for consideration, here as an attempt to do some robustness analysis for the paired data, we only consider the above three commonly used models. In practice, we can consider more general transformations i n the models: zn = Mxn,yn), S n = f2(xn,yn). (5.1.1) A l s o , we have used 0 = <?i(0i,0 2), <f> = 92(Oi,02). (5.1.2) Note that i n M o d e l I, we model the data transformation, while i n M o d e l II, we model the parameter transformation. These two model ing strategies are widely used i n practice, i t is natural to investigate the robustness of these methods for paired data. If there were only 2 ways, M o d e l I and M o d e l II, to analyze a data set, we could do both . If they agree, as we have shown i n Chapter 4, then we could be content and stop. However, there are many alternative techniques. Consider the general form of M o d e l III i n which used both a transformation i n the data and a transformation on the parameters. M o d e l Xi, ...,Xn as i i d p(x\0i) and Yi, ...Yn as i i d p(y\02). Here we assumed the X t ' s and 97 the I j ' s are from the same parametric family only w i t h different parameters. Assume Xn and Yn are independent. Now, we can use Zn = f(Xn,Yn) so as to derive, by convolut ion, a density for Zn. The density for Z ; is , i n general, p(zi\e1,e2) = J pi{gi{si,zi)\el)p2{g2{si,zi)\e2)J{sn,zn)dsi, (5.1.3) where x,- = <7i(s;, Z{), yi = g2{si, ?i) is the inverse transformation, and J(sn, zn) = n?=i J(si, zi is the transformation Jacobian. For compat ibi l i ty of the data transformation, let 9 = h{6\,02) be the parameter of interest. We can use the M I L for P\(-\9\) and for p2(-\92) so as to obtain a min imal ly informative convolution i n (5.1.3). In some cases, which we identify presently, the left hand side of (5.1.3) reduces to a density of the form p(zn\d), where 0 = f(6i,62). More generally, however, this reduction does not occur. Whether or not the reduction occurs, the parameter of interest is 6 = fi(6i:62): we would therefore use a prior for 6 (or 9 and <f>) and make inferences as before. If we want to relax the assumption of independence between X{ and Yi, (5.1.3) changes. We would use JP(9i{s, z),g2(s, z)\e1,92)J(s, z)ds i n place of (5.1.3) and have to find a M I L for a bivariate random variable, w i t h two param-eters. For the present, we note some further alternatives. One could use a locat ion family, one could use extra data; one could use ,a l ikel ihood that was not min ima l ly informative — perhaps based on physical modeling. One could avoid the extreme da ta summariza t ion used here by modeling the day-to-day dependence through a t ime series approach. The class of a l l models is enormous. Even after restrict ion to the subclass of a l l statisti-cally plausible models, there remain too many to enumerate and evaluate i n every par t icular instance. Moreover , there is no guarantee that a l l models i n this subclass w i l l give the same inferences for the parameter of interest. It is worthwhile therefore to have some theoretical guidelines for when to expect two modeling strategies to agree and for when to expect them to disagree. In short, the task of this chapter is to begin an investigation into the robustness of infer-ences to change i n model ing strategy for paired data. M a n y have investigated sensit ivity to 98 prior selection. Sensit ivi ty to smal l changes i n the l ikel ihood has also been studied, al though not much. Sensit ivi ty to outliers or, more generally, data has been extensive . However, i n a l l cases, the modeling strategy (by which we mean transformation of data, t ransformation of parameters and the nature of the l ink between the l ike l ihood and the data, inc luding the loss function i f there is one) has never been the focus of a robustness study. Here we undertake to begin this i n several cases. 5.2 Equivalence of Models For s implici ty, we consider the models for the specific form as i n the beginning of this chap-ter. Thus , the results i n this subsection are general; they are true for any n-dimensional l ikel ihoods, independent or not, any form of M I L or not, inc luding a l l the models w i t h the product of 1-dimensional M I L s we considered i n Chapter 4. Let the prior for M o d e l I to be w(i)(d) = \j wl{^^-)w2(^^-)d<i), and choose the l ikel ihood for M o d e l I to be P(i)(*|0) = 1J J M—j-\—2~)P*(—2~I-2~~M^)dsd^' for some density TT(-). We see that i f the joint l ike l ihood p(s, z\<f>, 9) = ±px(ste\te±)p2(a^i\£±) for (5 , Z) satisfies a sufficiency-like condit ion between Z and 9, then the three models are equivalent. The sufficiency-like condit ion is that the joint density of Z and S, obtained from the joint density of X and Y by transformation, can be factored into two parts. One part is a function of Z and 9 only, the other part is independent of 9. Specifically, we have the following P r o p o s i t i o n 5.2.1 Suppose the joint l ike l ihood for (S,Z) satisfies p(s,z\(j), 9) = g(z,9)h(z,s,(f>) for some functions g(-, •) and h(-, •, •), and the prior satisfies < ^ , ^ ) = M*)M*) 99 for some u>i(-) and w2(-), then w(1)(-\Z) = w{2)(-\X,Y) = w(3)t\Z). P r o o f : Since the l ikel ihood for M o d e l I is P(i)(z\°) = ^9{z,9) J J h(z,s,<f>)ir(<j>)dsd(i>, the posterior density for M o d e l I is g(Z, 9)ff h(Z, s, ^((fidsdcfrw^e) ( D ( W Jg(Z,Off h(Z, s, <P)ir(4>)dsd<f>Wl(0dt oc g(Z,9)Wl(9). (5.2.1) Simi lar ly , the l ike l ihood for M o d e l II factors as P(2)(x,y\0i,02) = g(z,9)h(z,s,<t>), giving that the posterior density for M o d e l II is ( ' yf9(Z,0h{Z,S,<j>)w(teL,*=L)d<t>dt oc g(Z,0)w1(9). (5.2.2) The l ikel ihood for M o d e l III is P(3)(*|0i.02) = 9{z,9) jh(z,s,4>)ds, thus ( 3 ) fg(Z,OIIh(Z,s,<f>)w(i^,^)dsdcf>d^ W( 2 ' 2 a g(Z,9)w1(9). (5.2.3) Now (5.2.1), (5.2.2) and (5.2.3) together complete the proof. • R e m a r k . We comment that i f tui(-) = w2(-) is the standard normal then Pi(- |#i) is N(9i,a2) and P2{-\92) is N(92,a2), and the condi t ion i n Propos i t ion 5.2.1 is satisfied. 100 5.3 Robustness against Model l ing Strategies for Paired Data The previous bounds on the differences between Models I and II were bounds i n an av-erage sense useful for comparing whole models. For pract ical purposes bounds that are pointwise i n the data are more useful: they permit comparison of inferences given a partic-ular data set. We first consider the simple case of the transformation of data and parameters: Sn = Xn + Yn, Zn = Xn-Yn; <f>=91+92, 0 = 01-02--If we use priors u ; 2 ( 0 i , 0 2 ) , 1 0 3 ( 0 1 , 0 2 ) i n Models II and III respectively, we can get an upper bound on the L\ distance between the two posteriors without averaging over the data. In Propos i t ion 5.3.1 and Propos i t ion 5.3.3, the l ikelihoods involved are general i n the sense described i n the beginning of this Chapter they include a l l the models we consid-ered i n Chapter 4. Corol la ry 5.3.1, Propos i t ion 5.3.2 and Theorem 5.3.1 are only for the n-dimensional M I L s . P r o p o s i t i o n 5 .3 .1 . If the priors of models 2 and 3 are respectively w2(9\,92), W3(91,92), then (i) For any data xn and yn, J \™(2)(0\xn,yn) - itf(3)(0|* n)|d0 < 2 M ( 2 ) 3 ) ( z " , 2 , n ) , where M(2,3)(xn,yn) = M ( 2 , 3 ) ( s V n ) = supM ( 2, 3)(sV n,<M), <t>,6 M ( 2,3)(AAM) = m i n { | l - i ? ( 2 , 3 ^ and f P l ( ^ m P 2 ( ^ \ ^ ) d v « W 3 ^ , ^ ) If w2(-, •) = w3(-, •) = w(-, •), then the factor M ( 2 i 3 ) ( s n , zn) i n the upper bound is inde-pendent of w(-, •), so we have 101 sup \w{2)(9\xn,yn) - w(3)(e\zn)\dd < 2 M ( 2 , 3 ) ( / , f ) , where W is the collection of a l l the two-dimensional priors. (ii) The posterior means under M o d e l II and III satisfy E(2)(9\xn,yn) - E(3){9\zn)\ < {E(2){\0\ \xn, yn) + E{3)(\9\ \zn ))M{2t3)(xn, yn), for any data set xn,yn. ( i i i) The posterior variances under Models II and III satisfy V a r ( 2 ) ( 0 | a ; " , yn) - V^{3)(0\zn) < h(2,3)(xn, yn)M{2<3){xn, yn) where h(2,s)(xn, Vn) = h(2t3)(sn, zn) = E(2)(02 \xn, yn) + E{3)(02\zn) +2(E{2)(\0\\xn,yn) + E(3)(\0\\zn)) P r o o f : (i) Let g2(s^z^0) = JPlCn + Zn^ + ^ ^Sn-Zn^-9--^ + e ,sn + zn.(t> + 0. 8 1 ")P2( zn <f>-0 <f>+0 <j>-9 „ —1—7^^—)^3(—^—, -^-)d<f>dsn. Now, w, ( 2 ) V ' ' IIPi(xn\91)p2(y-\92)w2(91,92)d91d92 I I P l ( ^ \ ^ ) P 2 ( ^ \ ^ ) w 2 ( ^ . , ^ ) d ^ 2 92(sn,zN\9) fg2(sn,z"\Z)d^ and likewise, for M o d e l II we have n p i ( ^ \ ^ ) P 2 ( ^ \ ^ ) w 3 ( ^ , ^ ) d < t > d s n V(3)(0\zn) I ! ! P l ( ^ \ ^ ) P 2 ( ^ \ ^ ) w 3 ( ^ , ^ ) d ^ 2 d ^ 93(zN\9) 102 Thus , the difference i n the posteriors is g2(sn,zn\6) 93(zn\0) de 92{sn,zn\e) gz{z-\e) \ j a i [ \ 93(zn\e) 93(zn\e) Jg2(sn,zn\{)d£ Jg2(sn,zn\0d£ 1 d0 + I + fg2(sn,zn\Od£ = 2 J93(zn\0)de\f g2(Sn,zn\Qdt - /g3(zn\Qdt\ f92(sn,zn\0dU93(zn\0dt J \g2(sn,zn\0) - 93(zn\6)\de + y 92{sn,zn\OdZ - Jg3(zn\0dt j ^ r ^ j ^ J M A zn\e) - 93(zn\e)\de f \ l M ^ \ ^ ) P 2 ( ^ \ ^ ) w 2 ( i ^ , ^)d4> de < I I P l ( ^ \ ^ ) P 2 ( ^ \ ^ ) w 2 ( ^ , ^ ) d ^ 2 IfPli vn+zn \<t>+8 )P2( )dvnw3( 4>+e 4>-e )d<t> de 1 7 M ^ I ^ ) M ^ I ^ M ^ , ^ K i ^ : < 2 I I P l ( ^ \ ^ ) P 2 ( ^ \ ^ ) w 2 ( ^ ^ ) d ^ 2 -!M^\^)P2^\¥)dv»M^,¥)\d*de = 2-I f P l ( ^ \ ^ ) P 2 ( ^ \ ^ ) w 2 ( ^ , ^ ) d ^ 2 I IPl^\H^)P2(^\^)w2(^,^)d^2 2E-L l-R{2,3)(sn,zn,<t>,e) (5.3.1) where the expectation E\ is taken over (</>, 6) w i t h respect to the density IIPl(^\^)P2(^\^)w2(^-,^)d^2-Similar ly , by adding and subtracting g2(sn, zn\0)/ f g3(zn\£)dt;, we have j\w(2)(e\xn,yn)-w(3)(0\zn) de 103 92(sn,zn\0) 92(sn,zn\6) d9 + I g2(sn,zn\e) gz(zn\9) de Jg2(sn,zn\e)de \f[g2(sn,zn\Q - g3(zn\Q]dt\ J92(sn,zn\0d{ 1 2 + < f93(zn\0]d{ J \92{sn,Zn\e)-g3(zn\e)\de J93{z»\s)ds Jl52(s"' *y) ~ 9s(zn\e)\de = 2 i hi{s^\^)P2{^\^)w2m^)d<t> 11 f Pi ( ^ I *F)P2( ^ I *f*) «*( ^ , ±±)d*"d4>M - S f P i ( ^ \ ^ ) r t ^ \ * ? ) d s » M i P , i ? i W de 11 fPl(^\^)P2(^\^)W3(*¥,^)ds"d<f>de < 2 = 2 / / I^CA^^^)]- 1-! = 2 £ 2 | [ i 2 ( 2 i 3 ) ( 5 " , ^ , 0 , 0 ) ] - 1 -where the expectation i ? 2 is taken over (</>, 0) wi th respect to the density (5 l n P l ( ^ \ ^ ) P 2 ( ^ \ ^ ) w 3 ( ^ , ^ ) d s ^ d 0 ' Now by (5.3.1) and (5.3.2) we get the desired conclusion, (ii) Similar ly , \ j e M e \ ^ ) d e - j e W z { e ^ ) d e \ = | / ( / ^ ; ^ 093(zn\e) )d0 0 5 2 ( 5 " , z " | 0 ) + flff3(*"|0) flff3(*"|0) d0 )d0 104 I I f l g 2 ( 3 n , * w l 0 f f U(8g2(sn,zn\0) - 9g3{zn\9))d9\ fg2(sn,zn\Odt f\t\92(sn,z"\0dt I\tMzn\Qdt;\fg2(S«,zn\0dt-Jg3(zn\0d(;\ + < (E(2)(\9\ \xn, yn) + E{3)(\0\ \zn))M(2i3)Oc", 2/"), since, as i n the proof of (i) we have \J(9g2(sn,zn\9)-9g3(zn\9))d9\ J\t\g2(s»,z«\t)dt \f92(sn,zn\0dZ-fg3(zn\Qdt\ fg2(sn,zn\t)d£ <M{2t3)(xn,yn), <M{2,3)(xn,yn), and f\Z\92(sn,z«\Qdt f \t\93(zn\Qdt * » K K ~ E ^ m X '» } ' fg3(z»\0dt - £ ( 3 ) ( l ^ )• ( i i i) We have V a r ( 2 ) ( 0 | x " , yn) - V a r { 3 ) ( 0 | ^ ) | < \J (02w(2)(9\xn, yn) - 92w(3)(9\zn))d9 (| 9w{2){9\xn,yn)d9^ - (/ 9w{3){9\zn)d9 + Note the second term i n the right hand side of the above is J (9w{2){9\xn,yn)-9w{3){9\zn)y9 J (9w{2){9\xn,yn) + 0 ™ ( 3 ) ( 0 | ^ ) ) d 0 , so as i n the proof of ( i i ) , the above is bounded by 'E(2){92 \x\yn) + E(3){92 \zn) + 2(E{2)(\9\ \xn,yn) + E{3)(\9\ \zn)) xM(2t3)(sn,zn), where we have defined 105 i n which f n ^ ff ,Sn + Zn.<t>+0. .Sn-Zn.<t>-6. - J J J M - ^ - \ — ) P 2 ( - ^ - \ — ) Remarks: 1. Since l - i < 1, |1 - R{2,3)(sn, zn,<t>,9)\>l implies | l - [R{2,3)(sn, zn,<j>,0)]' soVxn,yn, M(2j3)(xn,yn)<l. 2. Note that i n the above Propos i t ion , i f w2(-, •) = w3(-, •), then i2 ( 2 ] 3 ) ( s n , z n , (j), 9) = p(zn\sn)~1, the condi t ional d is t r ibut ion of Zn given Sn. So, i f Zn is t ight ly dis tr ibuted given Sn, then R(2i3)(sn,zn,<f>,9) « 1, and so M(2:3)(xn,yn) « 0. A l s o , for given priors w 2 ( - , •), W3(-, •), we can choose the data set (xn,yn) such that R(2,3)(sni zni <f>, 6) ~ 1. For given data set (xn,yn), we can choose priors w2(-, •), w3(-, •) such that R(2t3)(sn,zn,(f>,9) « 1. Or i n other words, we can identify data for which the models are indistinguishable for given priors and we can find priors which make the models indistinguishable for a data set. The upper bound i n Proposi t ion 5.3.1 is not sharp. If we take Wi(-,-) = w2(-,-) and Pi{-\9) = P2(-\8) = N(9,l), then by Proposi t ion 5.2.1 we have w{2)(9\xn,yn) = w{3)(9\zn). However, R(sn,z", 9, <f>) = n?=i v ^ r " e x p { ( S t - < £ ) 2 / 4 } , so M(2t3)(sn,zn, 9, <f>) = 1-]^=1(V^ exp{(si - 0 ) 2 / 4 } ) - 1 > 0, and as n tends to infinity, M ( 2 ) 3 ) ( s r a , zn, 9, <j>) tends to 1 g iv ing a t r i v i a l result. O n the other hand, this reduction to a t r i v i a l l i m i t i n g case makes sense because posteriors concentrate at the true value i f the prior assigns mass on a neighborhood around i t . If we use the M I L for the l ikelihoods, ie. pi(x n |6>i) = p\{xn\9-i), p2{yn\92) = p2(yn\92), and take = w2(-,-). Then for the independent case (i.e. p*k(xn\9) = rj"=i Pk(xi\0)i for k = 1,2, we have R(2,3)(sn,zn,9,4>) = 106 1T=I / K l ^ ) ^ ^ ^ ) ] ^ 1 L l ( ^ £ I ' ^ G ) ~ A 2 M V S V In some cases, we can calculate i £ ( 2 i 3 ) ( . s r i , z n , 9, <fi) i n closed form. For example, i f we further specify Lk(x,y) = (x — y)2 for k = l , 2 , A i = A 2 = 1, and w(-,-) = i V ( 0 , J 2 ) , then from Example 1.4.3, we know that m\(x) = m^x) oc e~x2. B y the facts that ( S ± i _ i ± » , . + _ = 1 , ( „ , _ + ( 2 J _ , ) 2 ] _ and ( ^ ) +( 2 ^ = 2^ ' ^ ' we have n n = i /exp{- i ; 2 -A(t ; t - -^)V2}dt ; , -nr=iexp{ -6?-A( S t -^)V2} R{2,3)(sn,z",9,<j>) = n n T T . / - ^ e ^ ( « - ^ ) a . t A i V 2 A + l We can state the corresponding results for M o d e l I vs M o d e l II and M o d e l I vs M o d e l III as i n the following corollary. Corollary 5.3.1 Let m ^ ( a ; " ) e x p { - A i £ i ( a : " , 0 1 ) } / m*(tn) e x p { - A i X x ( i n , 91)}dtn m*2(yn)exj>{-X2L2(yn,92)} J m^(tn)exp{-X2L2(tn,92)}dtn be M I L ' s , and write the l ikel ihood for M o d e l I as and P2{yn\e2) m*(zn)exp{-XL(zn,9)} 1 ; fm*(tn)exp{-XL(tn,9)}dtn' For Models I, II and III, choose the priors to be Wi(# i ,0 2 ) , w2(9i,92) and Wz(9\,92), where wi(6) = | / t « i ( ^ , ^ ) # . We have the following (i) For comparing M o d e l I to M o d e l II we have J\w(1)(9\zn) - w(2)(9\xn,yn)\d9 < 2 M ( 1 , 2 ) ( x " , y " ) , 107 E{1)(0\zn) - E{2){9\x\yn)\ < (E{1)(\6\ \zn) + E{2)(\6\ \xn,yn))M{li2)(xn,yn), V a r ( 1 ) ( 0 | z " ) - VM{2)(e\xn,yn)\ < h{h2)(x\yn)M(h2)(xn,yn), where M{lt2)(xn,yn) = M(lt2)(sn,zn) = s u p M ( l i 2 ) ( 5 V " , < M ) , <t>,6 M(lt2)(sn,zn, <j>, 9) = m m { | l - R(1,2)(sn, zn, <f>,0)\,\l- [R{1,2)(sn, zn, <f>, 0 ) H } and ( 1 ' 2 ) ( ' ' * ' 0 ) - f r n ^ e M - ^ t ^ ) } ^ X m H ^ ) e x p { - A 2 £ 2 ( ^ , *=*)} / m » ( f ) e x p { - A Z ( < " , 0)}dtn w2(*p, ^) / m 5 ( i " ) e x p { - A 2 i 2 ( < n , ^ f £)}rf< n m * ( ^ ) e x p { - A I ( ^ , 0 ) } Wl(£±l, £ziy h{1,2)(xn,yn) = hih2)(Sn,zn) = E(i)(02 \zn) + E(2)(92 \x\ yn) + 2(E{1)(\9\ \zn) + E(2)(\0\ \xn, yn)). (ii) For comparing M o d e l I and M o d e l II we have J \w(1)(9\zn) - w(3)(9\zn)\d6 < 2M(li3)(xn,yn), E(1)(9\zn) - E(3)(9\zn)\ < (E(1)(\e\ \zn) + E{3)(\9\ \zn))M{1>3)(xn,yn), Vax{1){0\zn) - V a r ( 3 ) ( 0 k n ) ) | < h{h3)(xn,yn)M{h3)(xn,yn), where and M ( l i 3 ) ( * n , y n ) = M(h3)(Sn,zn) = s u p M ^ V " , ^ ) , (f>,6 M(h3)(sn,zn,<j>,6) = min{|l - R{1,3)(sn,zn,<f>,9)\,\l- [R(1,3)(sn,zn,<f>,9)}~1\} R(x,3)(sn,zn,<l>,9) where e x p { - A 2 X 2 ( sn - zn </>-0 )}dsn J m*(tn)exv{-\L(tn,9)}dtn B = J m*1(tn)exp{-X1L1(tn,^^-)}dtnx 108 J m*2(tn) e x p { - A 2 £ 2 ( f \ ^^-)}dtnm*{zn) e x p { - A Z ( 2 N , 6)} and h(1,3)(xn,yn) = h{1,3)(sn,zn) = £ ( 1 ) ( 0 2 \zn) + Ei3)(9* \zn) + 2(E{1)(\6\ \zn) + E{3)(\6\ \zn))).D Let us now examine the robustness of the three models w i t h respect to sets of possible data points that are l ikely when M I L ' s are used. For this we define a not ion of typical i ty based on Theorem 3.1.1. Tha t result suggests sets on which posteriors should be close to their respective priors under the mixture density. F i r s t let iI(n) = Em.D(w*1(-\xn)\\w1(-)), 7 2 » = Em>D(w*(-\Yn)\\w2(-)), and 70*(n) = Em*D(w*(-\Zn)\\w(-)). Now, Theorem 3.1.1 gives conditions under which I*(n) 0, as n —• oo. To use this fact, let Ci(n), for i = 0 ,1 ,2 be sequences of constants such that as n —• oo we have d{n) -+ oo, d(n)If(n) 0. Next , for i = 0 , 1 ,2 , let Si be subsets of the sample space defined by So = {Zn : D(w*(-\Zn)\\w(-)) < C0(n)I*(n)}, 51 = {Xn : 7JK(-|X")|H(.)) < C!(n)iT(n)}, and 5 2 = {Y" : /J(^(-|y")||W2(.)) < C2(n)J2*(rz)}. We cal l such data sets canonical. Let P* be the probabi l i ty measure corresponding to p*,(i = 0 ,1 ,2 ) . The following proposi t ion gives a sense i n which the M I L probabi l i ty of these canonical sets is large. 109 P r o p o s i t i o n 5.3.2 For any pre-assigned e > 0, there exist subsets A ; j n ( e ) for i = 0 ,1 ,2 i n the domain of 0 , so that as n —»• oo, P?(Sf\6)-+0, WeAi, and W i ( A ? n ) < e, V n . Tha t is , for large sample sizes, the canonical sets have large P*(-\0) probabil i t ies, for a l l values of 0 i n a set of arbi t rar i ly large probabi l i ty under the prior dis t r ibut ion W,-(.-). P r o o f : We only prove the statement for i = 1, the other cases are similar . We w i l l omit the 0 i n P£ when i t is notat ional ly convenient and no confusion. Note that PttS{\8) = P ^ K O I ^ I K C - ) ) > Ci(n)iJ(n) = £ P . ; ^ L V i ( - | X n ) I K ( 0 ) > Cx{n)Il{n) * cTc^ ^w^^ "^ -^1^^ 11^05' ( 5- 3 J ) where x ( A ) is the indicator function for set A . Recal l that £ P . £ K ( . | X " ) | K ( . ) ) = / / K ( ^ | 0 - ( 0 1 o g ^ ^ ^ " . (5.3.8) Since p f t z n | 0 ) C - M * V ) m * ( z n ) f m*( i" ) e x p { - Z n ( r ™ , 6>i)}eft"' we denote the inverse of the denominator of the right hand side of the last expression by h(n,0). B y definition we require e-Ln(xn,9)w(0) -de < I >)exp{-Ln(tn,0)}dtn for a l l xn. Tha t is , for a l l xn we have J e-L(xn<eh(n,6)w(0)dd < 1. (5.3.9) We show that for any pre-assigned e > 0 there exists A\^n, such that e~L^x"'S^h(n,9) is bounded by a number N on A\in uniformly i n n and xn w i t h Wi(A\ ) < e. N o w , by way of contradict ion, suppose there is an eo and xn so that for some sequence N(n) —• oo as n —> oo, there is a sequence of sets A'ln such that e-L(*",8)h(n,0)> N(n), on A ' 1 > n , and Wx{A^%n) > e0. 110 Then , [ e-L(xnf)h(n,e)w(6)d9> f e-L(xn'eh(n,6)w(9)d6 > N(n)eo —>• co, contradict ing (5.3.9). Now we have =e-L^n^h(n,6)<N, V0 £ A\n, V n , V x n . (5.3.10) B y (5.3.8), (5.3.9) and (5.3.10), V0 £ Ai , „ and V n we have 1 N C i ( n ) / * ( n ) d ( n ) as n —>• oo, since C i ( n ) —»• co by assumption. • Let B be the Bore l field on the uni dimensional parameter space 0 , and let W^(-\Data) be the posterior probabi l i ty measure corresponding to w^(-\Data). Here, Data means the full data set or whatever summary statistics are being used to form the posterior. If we use M I L ' s (here we mean the mult i -dimensional M I L ' s , not the product of one-dimensional M I L ' s ) and assume that the data came from a canonical set then the posterior probabili t ies converge to a common l imi t i n variat ion distance. We have the following. Theorem 5.3.1 Let M o d e l I be obtained from the prior w(-) as i n (ii) of Propos i t ion 5.2.1, and suppose Models II and III are obtained by using the priors W(i)(-) and W2O) as described i n Section 5.2. Under the conditions of Theorem 3.1.1, i f Models I, II, and III are formed from M I L ' s then the posteriors from these models condi t ional on canonical da ta satisfy sup \W*{i){B\Xn,Yn) - Wfa(B\Xn,Yn)\ < C^- (n) + 0^,^(1), for 1 < i,j < 3, where C ; j ( n ) tends to zero as n tends to infinity. The error terms are Opj+pj,1,2(1) = 0, o Pj+p*,i,3(l) = OpI+ P;,2,3(l) = °p*(-|e1)(l) + o p . ( . | 0 2 ) ( l ) , for 0i £ Al>n,e2 £ ^2,71 where Ai ,„ and A2yn are as i n Propos i t ion 5.3.2. Convergences i n probabi l i ty are as-sessed w i t h the appropriate mixture density. I l l R e m a r k 1: Theorem 5.3.1 is a sense i n which Bayesian inferences using the M I L ' s and canonical data sets are insensitive to which of the three models is used, at least for large sample sizes. In part icular , this means that the conclusions of Bayesian hypothesis tests based on posteriors using M I L ' s are also robust against model ing strategy for canonical data. Th i s follows from recall ing that a Bayes hypothesis test is based on the posterior odds rat io or, equivalently, the posterior probabi l i ty of the nu l l . Th i s is a l imi ted form of robustness against modeling strategy. R e m a r k 2: Our strategy of proof is to expand Wfo(B\Data,ys as W(B) plus the cor-responding error terms. For example, we write and prove the second term is negligible. For Wfo(B\Xn, Yn) and Wfa(B\Zn), the treat-ments are s imilar . Note that this holds because convergences are assessed i n a mixture density not i n an iid density. We use the mixture density so that canonical data can be defined, however, i n practice, one would assume that there is a true value of the parameter. wr1)(B\zn)= f w(9)de + JB P r o o f : F i r s t we prove the conclusion for models I and II. We first expand Wfa(B\Xn,Yn) and Wfa(B\Xn,Yn) as W(B) plus negligible error terms. For any B 6 6 , we have Now, W^(B\Zn) - W{B)\ < - ^ = ^ C 0 ( n ) / 0 * ( n ) , (5.3.11) where the error term - ^ J = ^ C o ( n ) / o ( n ) is o ( l ) as n oo, for any data Xn,Yn g iv ing zn e s0. Similar ly , W ( * 2 ) ( 5 | X " , Y " ) = j f Q / t « ; ( ^ | X » K ( ^ | Y » w ) d * 112 where, the first term above is W(B) and the error terms arise by adding and subtracting W(B) and JB \ j w1(^-\Xn)w2(^-)d4>d0. The median point theorem of integration and the canonicality of the data gives bounds on the error terms. We have l * , i ( * B ) l < \ J j K(^yV) - M ^ l M ^ W d e = J J'\wW!\Xn) - w1(61)\w2(d2)d61dd2 ~ 72l^2^Cl{n)It{n)IW2^d°2 = ^ = \ /Ci (" )A*(«) , (5.3.12) thus, J2,\(Xn) tends to zero as n tends to infinity. Similar ly , for canonical data Xn,Yn, U2,2(Xn, Y " ) | < j \ J \ W * 2 ( t ^ - \ Y n ) - W 2 ( ^ l ) l w K t t l l X n W d d = J j'\w*2(62\Yn) - w2(e2)\wl(01\Xn)d02d91 so J2t2(Xn,Yn) tends to zero as n tends to infinity. Thus , by (5.3.12) and (5.3.13), W{2)(B\Xn,Yn) - W(B)\ < b2(n), (5.3.14) where b2(n) = :^^[^C1(n)I^(n) + ^C2(n)I^(n)j. Now by (5.3.11) and (5.3.14), the conclusion is true w i t h C\,2 = \fCo(n)Io(n) +b2(n), which tends to zero as n tends to infinity. Now we prove the conclusion for models I and III, I I and III. We first express W^(B\Zn) as W(B) plus an error term, since 0)1 I ) 4 ; f f r i ( 1 = ^ | f i ) r f ( ^ | 6 ) w i ( f i W 6 W l ^ a d a B > ( 3 ) ( Z " ) d</> 113 where Jan + Zn\ f f *fsn + Zn m m 21 2 x \ 2 gn _ Z n s , , f s n _ Z n and 1 r f / s n 4- 7n \ / s n — y n \ m\z){zn) = -] J p{[—2—16)^(^—16)^(6)^(6)^1^2^" = 2 7 m i ( - ^ — ) ™ 2 ( — 2 — )<*-"• (5-3.15) So, MB € we have that W^(B\Zn) can be wri t ten as / B ay - K ^ - I ^ T - M C V — > ) « * " ^ ) ( z " ) 2 Let then / / i ( s n , Z n ) d s n = 1. So, for fixed Zn = zn, h(-, Zn) is a probabi l i ty density. N o w by adding and subtracting W(B) and Recal l that by the definition of 5 i i n Propos i t ion 5.3.2, we have .sn + Zn, . I / J^{S—~-)h{sn,Zn)dsn\ - ^L-ZnjJ^s^^-ns\z^\ where 2Si - Zn is the set of a l l s n ' s , for fixed Zn = zn, such that e Si. B y (5.3.12), the first term i n the right hand side above is bounded by ^^\JC\{n)Ix(n). For the second term on the right hand side, note that J2,\ is bounded and h(-, Zn) is a density, we may 114 apply Proposi t ion 5.3.2 to assert that P1*(2SX — Zn) tends to zero as n tends to infinity. Thus , the second term on the right hand side converges to zero i n P f probabil i ty, i.e. i t is °p ; ( l ) . Similar ly , /„n i yn „n _ y/n r <sn 4- 7 " c n — 7n < | / J 3 , 2 ( L ^ £ - , — ^ " ) ^ B | ~/2S2+Z" 2 2 +1 / J2,2(—5—, —^—W* n , Z n)cfc»|, J2S%+Zn 2 2 by (5.3.13), the first term above is bounded by ^ y _ ^ / C i ( n ) J j > ( n ) . For the second term, the argument is s imilar as before: J2,2(-, •) is bounded and h(-, Zn) is a density, P ropos i t ion 5.3.1 asserts that the set 25*2 + Zn is o p * ( l ) . So is the second term i n the right hand side above. Thus we have \W{3)(B\Zn) - W(B)\ < b2(n) + o „ j ( l ) + 0 p ; ( l ) . (5.3.16) Now since b2(n) —• 0 as TZ —• oo, by (5.3.11) and (5.3.16) we get the conclusion for M o d e l I and III. B y (5.3.14) and (5.3.16), we get the conclusion for models II and III. • Since the three models rarely coincide but are similar i n a general sense, we are in-terested in which pairs of them are closer together or further apart. The following propo-si t ion tells us that, roughly speaking, Models II and III are the closest, and Models I and III differ most. Th is is consistent w i t h in tu i t ion . Since Models II and III start from the same l ikelihoods, the difference is that M o d e l II is transformed once i n the parameters and M o d e l III is transformed twice, i n both data and parameters. Th i s addi t ional transforma-t ion make M o d e l III differs most from M o d e l I. For M o d e l I, the l ikel ihood is different from that of Models II and III, and it differs from Models II and III than they do from each other. Proposition 5.3.3 Assume the general l ikelihoods as described i n the beginning of Chapter 5 (so the results inc luding a l l the models we considered i n Chapter 4) , we have 0) ( i ? ( « ; ( 2 ) ( - | X " , K " ) | | t i ; ( 1 ) ( - | ^ n ) ) - J D(«; ( 2 ) ( - |X n ,K" ) | | t« (3)(-|^"))) > 0. 115 (ii) ^ " , K " ) ( ^ ( ^ ( 3 ) ( n ^ ) l l ^ ( l ) ( n ^ ) ) - ^ (^(3 ) ( - | ^ ) l l ^(2) ( - |^" ,^ n ) ) ) > 0. ( i i i ) where a l l the expectations are taken wi th respect to the marginal d is t r ibut ion of (Xn,Yn). P r o o f : (i) Let the marginal density for Zn be m(zn) obtained from the joint marginal mi(xn)m,2(yn) by the transformation Zn = Xn - Yn, Sn = Xn + Yn and integrating out sn. Now m{zn) = -j m i ( ) m 2 ( )dsn which is the same as m^(zn), the marginal density for model 3, \ f sn + Zn sn — Zn m ( 3 ) ( ^ ) = 2j Pl( 2 2 I 7?2)^l(«l)^2(7? 2)d7 ?irf7 ? 2 = 2J 2 ^ 2 ^ So, E ^ n y ^ D i w ^ X ^ Y ^ w ^ Z ^ S 4ro(3)(Z»)u;(1)(0|Z») ™ )' l 0 S 4m(3)(Z»)«,(1)(*|Z«) " J " ( 5 - 3 - 1 7 ) The log-sum inequali ty states that for any integer n and any non-negative numbers a\,..., an and &i,...,6n, we have 8 = 1 2-i=i ° t i = i 116 w i t h equality i f and only i f a t / 6 ; is a constant over a l l i. Us ing this inequality, the right hand side of (5.3.17) is greater than / E[Xnyn) J l W l { t t l l X n ) M ^ l l Y n ) d ( f > g 4m{3)(Zn)EZnW{1)(8\Z") = /(// J lM^\^M^\ynM^Myn)dxndynd^j 1 / / / W l ( ^ | ^ ) ^ ( ^ | ^ ) m i ( ^ ) m 2 ( ^ ) ^ " ^ S EZnW{1)(0\Z») . | / / / u > 1 ( ^ | 3 ; " ) t i ; 2 ( ^ l y " ) m 1 ( a " ) m 2 ( y " ) d 3 ! " d y " # „ ^ ft l 0 g £ z » t « , ( 1 ) ( * | Z « ) ^ - ° ' since i t is the relative entropy between two densities of 9. (i i) The proof is s imilar to that of ( i) . (i i i) It is enough to note that ^ ( X « J K " ) ( ^ ( ^ ( l ) ( - | ^ " ) l l ^ ( 3 ) ) ( - | ^ ) ) - ^ ( ^ ( 1 ) ( - | ^ ) l l ^ ( 2 ) ( - | ^ T i , ^ ) 1 f f ,c/>+6.sn + Z n , 0 , 5 n - Z n N - l 0 g 4 ^ ^ ) J J — I " a — M — I " a — ) <j" _l_ 7 i s n _ Vn \ m1{—^—)m2^-1^-)dct>dsnye > J J EZnW{1)(0\Zn) y j f w1(^\xn)w2(^\yn)m1(xn)m2(yn)dxndynd<l> = J J EZnW{1)(e\zn) HI t P i ( ^ | a ; " ) W 2 ( ^ | y n ) m 1 ( a ; " ) m 2 ( y ' t ) r f x " d y " d 0 = Q q IIIwi(^\xn)w2(^-\yn)m1(xn)m2(yn)dxndynd(j) 117 C o m m e n t : In Chapter 4, the data analysis used Models I and II and we found that they gave similar conclusions. Accord ing to Propos i t ion 5.3.3, we can expect that the conclusions from M o d e l III to be close to those of M o d e l II. Models II and III are closer because both use pairs of data points and integrate out a function of the data. Tha t they give weaker results may be at t r ibuted to the marginal izat ion procedure. In marginal iz ing, we take an average over a l l possible data points. Th i s is different from fixing a part icular data point , and so impl i c i t l y includes var ia t ion from two random variables whereas i n M o d e l I we have included variat ion from only one random variable. Since two random variables generally have more var iabi l i ty than one does, i t seems Models II and III require more data to achieve inferences of the same strength as we got from M o d e l I. Physical ly , one should ask i f there are two random variables that are reasonable to model , or i f we have only one random variable (like for M o d e l I). Th i s is a scientific question brought to light by our stat ist ical analysis. However, s tat is t ical analysis alone cannot provide an answer. The results from M o d e l II and III are s imilar to that of Nader and Reboussin, Section 4.2, i n which the conclusion is not clear-cut. M o d e l I gives stronger inferences and clear answers. There is the question for the experimenter: are the model assumptions i n M o d e l I reasonable? i.e., do we have one random variable or must assume there are more of them? B y the above Propos i t ion , Models 1 and III differs most on average. So the average dis-crepancy between Models I and III can be taken as a measure of robustness for the l ikel ihood-prior quadruple {pi(-|#i),P2("|fl2)> w\(Bi), ^2(^ 2)} for the three model ing strategies. To sim-plify the problem, we may fix the priors, and for M o d e l I, we take the prior w^(9) = / a n d t h e l ike l ihood to be p{1)(z\0) = \pl{^\if-)p2{^\^-)dsd(j>. Then the question becomes the robustness of the l ike l ihood pairs {pi(-|#i),P2(-|02)} against the three model ing strategies. Specifically, we can use R = exv{-EXtYD(w(3)(.\Z)\\w{1)(-\Z))} = exV{-Em{3){z)D(w(3)(-\Z)\\w(1)(-\Z))} to measure the robustness, where Z = X — Y. Or to simplify the calculat ion, we may use RJ = exV{-D(EXtY[w{3)(-\Z)]\\D(Ex,y[w{1)(.\Z)})} 118 = exV{-D(w{3)(-)\\Em(3)(4w(l)(-\Z)])} as the measure. Note that 0 < R < R' < 1. Roughly speaking, the larger R' is , the more robust the l ikel ihood pair is against the three models and R' can be arbi t rar i ly close to 1 by choosing appropriate l ikelihoods and priors. To see this, note where and If we take the priors for M o d e l I and III to be « 7 ( 1 ) ( 0 ) = f | w i ( ^ ) t » 2 ( V ) ^ = w ( 3 ) ( d ) a n d the l ikelihoods for Models I and III, as {pi (• | d\), p2 (• 162)}, to be uniform densities on finite in -tervals, then the interval bounds may depend on #i, 62 respectively. If the intervals are large, then m ( 1 ) ( . ) « m ( 3 ) ( - ) , so ^ m ( 3 ) ( * ) [ « ' ( i ) ( - | ^ ) ] « IV(i){z\e)w(1)(6)dz = = w ( 3 ) ( - ) , thus R' ss 1. 5.4 More Considerations for the Robustness Issue Here we discuss some more considerations for the robustness issue addressed i n the pre-vious sections. These considerations are i n the in i t i a l stage, the conditions imposed are too strong i n practice, and the results are comparatively weak. However, we list these results here for further possible improvements. We have compared Models I and II computat ional ly i n an example. Here we compare Models II and III theoretically. We begin by stat ing and proving sufficient conditions for the left hand side of (5.1.3) to reduce to a function p(zn\Q) w i t h 6 = (0i,62). Assume that X i , . . . , X n are iid Pi{x\9i), and that Yi,...,Yn are iid P2(y\02)- Assume Zn - f(Xn,Yn) 119 has density of the form P(zn\e) = l[p(zi\6) where p{z\0) = J p1(g1(s,z)\91)p2(g2(s,z)\92)J{s,z)ds, and e = fi(eue2). Because we have assumed independence, i t is enough to examine the n = 1 case. Assume the distr ibutions of X and Y are i n a one parameter exponential family, ie. pix^) = ex P {^(ff 1 )r i(x) - B1(e1)}h1{x), (5.4.1) p(y\92) = exV{V2(62)T2(x) - B2(92)}h2(x). (5.4.2) Let z = h{x,y),s = f2(x,y), 9 = fi(9u62), <f> - h{9x,92). or x = gx{s,z),y = g2(s,z) and 9\ = gi(9,(f>), 92 = g2(6,(j>), and J(s,z) is the transformation Jacobian. Now in the following Propos i t ion , we give a sufficient condit ion for p{z\9\,92) to reduce to the form P(3)(z\0) f ° r the exponential family. It basically says that i f the exponents of p\ and p2 can be reformed into a sum of exponents from two exponential families for Zn and 5 n , then we get the reduction we want. In this section, Propos i t ion 5.4.1 and Theorem 5.4.2 are only for exponential families; Propos i t ion 5.4.2, Propos i t ion 5.4.4 and Theorem 5.4.1 are for the M I L s i n both the product form of n-dimensional dependent form; Propos i t ion 5.4.3 is for any likelihoods as long as they satisfy the stated hypotheses. P r o p o s i t i o n 5.4.1 (Reduct ion i n parameters for exponential families:) Suppose the dis-tr ibut ions of p\ and p2 are given by (5.4.1) and (5.4.2) respectively. If m(0i)Ti(Ms, z)) - B1(91) + log /»!(/!(a, z)) +rl2{92)T2{f2{s, z)) - B2(92) + logh2(f2(s, z)) + log J(s, z) = m(<l>)Ti(s) - Bi{<f>) + l o g / n ( s ) + 7fc(0)f 2 (z) - B2(9) + logh2(z) (5.4.3) 120 for some functions rjk(-),Tk(-), Bk(-),hk(-) (k = 1,2), then p(zn\91,92) = p(zn\9) = e x p { r ) 2 T 2 ( 2 ) - B2(9)}h2(z). ie. p(zn\9\,92) reduces to a parameter only depend on 9. P r o o f : p(z\91:92) = J pi(gi(s,z)\9l)p2(g2(s,z)\92)J(s,z)ds = J e x p { » n ( 0 i ) T 1 ( / i ( 3 , « ) ) - B 1 ( 0 1 ) + l o g / n ( / i ( 3 , « ) ) } exp{7 ? 2 (0 2 )T 2 ( / 2 ( s , z)) - B2(92) + log / i 2 ( / 2 ( a , z)) + log J(s, = Jexp{Jh(^)2\(a) - 5 i M + l o g ^ ( a ) + r) 2 (0)T 2 (*) - 5 2 ( 0 ) + log Ji2(*)}<fo = exp{r} 2 (0)f 2 (^) - B2(4>)}h2{z). • R e m a r k : If p\(x\9\) and p 2 (?/ |0 2 ) are normal densities w i t h 9\,92 be their respective lo-cation parameters, (without loss of generality assume their variance is 1) then 77,-(0,-) = 0;, Ti(x) = x, Bi{9) = 0 2 , hi(x) = 1 (i = 1,2). Assume fi(x,y) = x + y,f2(x,y) = x - y, then (6.2.2.2) is satisfied w i t h r);(0) = § 0 , fi(s) = s, 2?;(0) = | 0 2 , hi(x) = 1, (* = 1,2) and J(s,z) = \ . Next we show that the conclusion of Propos i t ion 5.4.1 holds for certain M I L ' s . Note that M I L ' s have a form similar to that assumed i n Propos i t ion 5.4.1, but they are different. P r o p o s i t i o n 5.4.2 Let pi(xn\0i) and p2(yn\62) be M I L ' s denoted by p j ( x n | 0 i ) and p*2{yn\92). (i) If the margina l priors W\{-) and w2(-) and the marginal densities ml(-) and fn2(-) are the uniform densities on Rn, and the Lk(-, •) i n pk (k = 1,2) satisfies 0i) + i 2 ( y , 0 2 ) = r i ( / i ( a ; , » ) , fi(9u92)) + r2(f2(x, j , ) , / 2 ( 0 i , 0 2 ) ) , Vx,y,91,92e R\_ (5.4.4) 121 wi th rj.it, 0) = rki\t-0\) > 0,(A;= 1,2), then again pizn\01,02) reduces to pizn\0) and where c(n) is the normal iz ing constant. (ii) For n = 1, i f the marginal priors u>i(-) and w 2 ( - ) a r e the uniform densities on R1, then m^(-) and m ^ - ) are the uniform densities on R1. R e m a r k 1: If Z i ( t , 0 ) = L2(t,0) = (t - 6>)2, / i ( s , y ) = x + y, then (5.4.4) is true w i t h Llit,0) = r_2it,0) = \it-0f. R e m a r k 2: If mj(-) or m ^ - ) is not uniform, p ( 2 n | 0 i , 0 2 ) w i l l not necessarily reduce to the f o r m p ( z " | 0 ) . For example, i f we take w a ( - ) = w2{-) = N(0,1), Li(t, 0) = L2{t,0) = ( * - 6 » ) 2 , Xi = X2 = 1, n = 1, zn — xn + yn, then form Example 1.4.3, we know that m*kix) = •^e-x\k = 1,2), and so X ^ i , ^ ) oc e x p { - l ( 0 i - z ) 2 - i ( 0 2 + z)2} which cannot be reduced to the form piz\0). P r o o f : (i) B y definition, the density for Zn is jK(<7i(A znM)P*2(92(sn, zn)\02)Jisn,zn)ds f m^jg1jsn, zn))m*2ig2jsn, z " ) ) e - A i M a i ( ^ n ) A ) - A 2 L 2 ( 3 2 ( s V " ) , g 2 ) ~ J / TO*(*«)e-AiM*nA)<ft« / m*itn)e-x^(tn^)dtn B y the assumptions on the priors and marginals we get J(sn,z .n )dsn. e x p { - A i r 1 ( a " , 0 ) - X2r2jzn,0)} Jisn,zn)ds , 1 J jexp{-AiZi(in, 0 x ) } d t « /(tn) exv{-X2L2itn,92)}dt> which reduces, by (5.4.3), to c ( n ) e x p ( - A 2 r 2 ( 2 n , 0 ) ) / e x p ( - A 1 r 1 ( 5 N , 0))J(sn, zn)dsn. (ii) In the case n — 1, since mj(-) is uniquely determined by the inequali ty 122 w i t h equality for a l l x i n the support of rn\{-). We see that i f w\(-) is uniform on R1, then the uniform density for m^(-) satisfies the above inequality. The same holds for m^-). • A s we w i l l see later, the problem of the equivalence of models is much harder than that of the mean of the model , the conditions imposed are so stringent that only i n very rare cases the equivalence can be guaranteed. Now we show that for the exponential families we discussed i n Propos i t ion 5.4.1, i f the density for Z = fi(X,Y) reduces to a l ikel ihood i n 0 alone, then M o d e l II and III are identical i n a formal sense, namely the posteriors are the same. Proposition 5.4.3 (i) Suppose the joint prior w(-,-) satisfies and for some non-negative q\, q2 where q\ satisfies f qi(sn,zn)dsn = C, then w(2)(6\xn,yn) = w(3)(0\zn). (i i) Suppose pi(a; |0i) and p2(x\02) are given by (5.4.1) and (5.4.2) respectively, and (5.4.3) is satisfied. Assume further that J(s, z) is constant, that w(0i, 92) = ^ 1(^1)^ 2(^ 2), and that for some integrable tf>i(-) and w2(-). Then w{2)(e\xn,yn) = w{3)(e\zn). 123 P r o o f : ( i ) Let us abuse notat ion to write tf n n I - m /S n + Znl<f> + 6^ . S n - Zn .<f>-0. and, for simplici ty, denote A + e <t>-e ™ ( — 2 ~ » — — ) = ™(<M)-Since ^ ( 2 ) ( 0 | x " , 2 / n ) « y f(sn,znM,0)w(<t>,0)d<l>, and by the definition of M o d e l III, w(3)(6\zn)<x j jf{sn,zn,\<f>,e)d<i>dsn J' w(<f>,6)d<l>, we have W(2)(0\xn, yn) = w^(6\zn) i f and only i f j f(sn,zn,\<t>,e)w(<j>,9)d<j>=qi(Sn,zn) J J f(sn,zn,\<t>,9)d<t>dsn j w(cj>,0)d4>, for some non-negative <7i, the last expression is equivalent to J[f(sn, zn, <j>, 0)-J^ML _ qi(sn,zn) J f(sn, zn, </>, 6)dsn]d(j> = 0. (5.4.5) A sufficient condit ion for (5.4.5) is that the integrand itself be zero, i.e. Vzn,4>,6, If we omit the fixed variables zn,(f>,0 for s implici ty, we get q(sn)f(sn) = \ J f(sn)dsn, (5.4.6) which is a Fredholm equation of the second type (See, for example, K o n d o , 1991), where q(sn) = (qi(sn, z*1))-1, and A = f w(£,0)d£/w(<j>,6). To solve this equation, divide by \/q(sn) on both sides of (5.4.6) and let Y(sn) = y/q(sn)f(sn). Now (5.4.6) becomes Y^=xl7wm¥(ndtn- (5-4-7) 124 Expression (5.4.7) has a solution i f and only i f dsn Since this is guaranteed by the assumption on q\{-, •), the solution is given by f(sn, zn, \<f>, 0) = qi(Sn,zn)q2(zn, <f>, 0) (5.4.8) where q2{zn,<t>,0) = J f(sn,zn,\<f>,0)dsn is non-negative, and for any non-negative q2(zn,(f>, 0), by subst i tut ing (5.4.8) into (5.4.6), i t is seen that (5.4.8) is a solution for (5.4.6). (i i) Since p\ and p2 are exponential families, we have (2)(0\xn,yn) oc f e x p { » & ( 5 l ( 0 , 0 ) ) £ T j f c , y , - ) + r?2(52(</»,0)) £ T 2 ( x i i y i ) i=l i=l -nB1(g1(<f>,e)) - nB2(g2((f>,0))}w(g1(<f>,0),g2(4>,0))J((t),0)d<l), Therefore Vi(9i(<t>,0))£Ti(xi>Vi)) + *)) Y,T2(xi,y,-)) - nBMt,0)) - nB2(g2(cj>,0)) » ' = 1 i = l n n = rji{<t>) £ Ti{gi(xi,yi)) + f,2{0) £ f2{g2{xu y,-)) - nB^) - nB2(9), i=l i=l so we have w{2)(0\xn, yn) oc exp{7?2(0) £ f2(<fe(:ct-, y,-)) " n5 2 (0 ) tB 2 (0 )} . (5.4.9) O n the other hand, the density of Zn has the reduced form P(3){zn\0) = exp{f)2(6)T2(zn) - nB2(0)}h2(zn), where f2(zn) = Z?=i ffa) and h2(zn) = f l L i Hzi). So w(3)(0\zn) cx e x p { f ) 2 ( 0 ) f 2 ( z n ) - n /J 2 (0)} Jw{gi{^6),g2{<i>,9))J{<t>,9)d<t> 125 a exp{JJ2(0)r2(«n) - nB2(9)}w2(9). (5.4.10) For w(-, •) satisfying the given conditions, the right hand side of (5.4.9) and (5.4.10) are proport ional to exp{fh(8)f2(zn) - nB2(9)}w2(9), i.e. w{2)(9\xn,yn) = w(3)(9\zn), • . R e m a r k : Let Wl(-) = w2(-) be the N(0,1) density, f1(01,92) = 91 + 92, f2(9lt92) = 61 -62, then (ii) is satisfied wi th J(<f>,9) = \, u>i(-) and tZ>2(-) be the i V ( 0 , 1 ) density. Note the sufficient conditions i n (i) of Propos i t ion 5.4.3 are not necessary. For ex-ample, let n = 1, pi(a:|0) = p2(x\9) = p*(x\9) = m*(x)e~L(x'ey J m*(t)e-L(t'eUt w i th L(x,y) = (x - y)2 and the prior u ; (0 i ,0 2 ) to be N(0,I2). B y Propos i t ion 5.4.4 i n the following we know that w^(9\xn,yn) = w^{6\zn), but the con-di t ion i n Propos i t ion 5.4.3 (i) is not satisfied. In fact from Example 1.4.3 we know that p*{-\6) - i V ( 0 , l / 4 ) , and .s + z.cb + 9. .s — z.6 — 6. . , ,., , „n9i P i ( — I ^ W — 1 ^ - ) = e x P { " ( 5 - <!>) + - ° ) } , which does not satisfy the conditions i n (i) of Propos i t ion 5.4.3. We assume the uniform priors and marginals for M I L ' s as i n Propos i t ion 5.4.2 and give conditions to ensure the same results as i n Propos i t ion 5.4.3. Note that even i f we use the uniform prior to generate the M I L ' s , we can s t i l l choose proper priors to form posteriors for inference. Para l le l to Propos i t ion 5.4.3 for exponential families, we give conditions for M o d e l II and III to give the same posterior when M I L ' s are used. P r o p o s i t i o n 5.4.4 Assume pl{xn\61) = Piix^O^^p^y71^) = p2(yn\92). (i) Assume (i) of Propos i t ion 5.4.2 and w(0i,62) = 101(^ 1)^ 2(^ 2)• If we take the prior for inference on 9 to be 126 then «;(0) = J w(gi(<f>,0),g2(<(>,e))j(<i>,e)d<f>, w{2){0\xn,yn) = w(3){0\zn). (i i) Let n = l,f-i(x,y) = x + y,f2(x,y) = x - y, A a = A 2 = 1, Lk(x,0) = (x-0)2, (k = 1, i f w(-, •) factors, i.e. w(-, •) = Wi(-)w2(-) and wi(-) = w 2 ( - ) is the N(0,1) density, then v>(2)(0\x,y) = W(3)(6\z) P r o o f : (i) B y Propos i t ion 5.2.1, the density for Zn is p(zn\0u02) = p(zn\0) ex e x p ( - A 2 r 2 ( z " , 0 ) ) , so „, r 0 L - ^ = ™P{-*2L2(zn,0)}w(0) ^ 1 ] Jexp{-A 2r 2(^,0}^(0^ = e x p { - A 2 r 3 ( z " , 0)} /n;( g l(0, 0), ff2(0,0)) J ( 0 , 0 ) # JexP{-A2r2(^,0}/^(ffl(ei,6)J(ei,6)^i^2 a e x p { - A 2 r : 2 ( z n , 0 ) } | w{gx{4>,0\g2{<\>,0))J(4>,0)d4>. O n the other hand, / P i ( ^ l 6)P2 ( 2 / n l 6 ) ^ (e i , 6 ) ^ i^2' S O W( 2)(0 |a; n ,y n) = I w(gi(<l>,6),g2(<t>,0)\xn,yn)J(<l>,6)d<l> _ ml(xn)m*2(yn) f e - ^ M ^ C ^ - ^ M ^ ^ . g ) ) ^ ^ ^ fl)) j(<^ fl) m*(x n,2/") 7 J m * ( i " ) e - A i i i ( < n - s i ( ^ . e ) ) ^ " / m ^ e - A 2 L 2 ( i n ' 3 2 ( ^ ) ) ^ n ^ Since u>(0i,0 2) = W\{0\)w2(02), m*{xn,yn) factors into ml(xn)m2(yn), so the above is t e x p { - A 1 r 1 ( a " , <f>)} exy{-\2r2(zn, 0)}w(9l(cf>, 0),g2(<f>, 6))J(<f>, 0) J / e x p { - A 1 i i ( < » , f f i ( 0 , 0 ) ) } r f « » / e x p { - A 2 X 2 ( i » , ^(0,0))}^ 9 127 o c e x p { - A 2 r 2 ( z " , 0 ) } J w(g1(cf>, 6), g2(<fi, 0))J{<j>, 0 ) # . Thus , w{3)(e\zn) = w{2)(6\xn,yn). (ii) We only assumed the M I L families, before marginal iz ing to get a posterior for 0 given z, the posterior for (#i,# 2) given z is given by w{3)(0u62\z) = c{z)a(6u62\z) where and j f / m ^ f f 1 ( 3 , z ) ) m ^ f f 2 ( ^ ^ ) ) e - ^ M g i ( ^ ^ ) , 6 ) - A 2 L 2 ( 3 2 ( , , z ) , 6 ) w ( 6 ^ 2 ) j ( 5 ; 2 ) ^ 1 J 7 7 / m J ( * ) e - A ^ i ( « i ) d t / m 5 ( * ) e - ^ i a ( * . & ) d < 4 l Now, recall the posterior for 6 under M o d e l II is I w(g1(4>,e),g2(<t>,e))e-XlL^x'3i('t>>d))-x2L2(yM<l>,mj^0) X y / m ^ ( t ) e - A i L i ( < , 5 1 ( ^ ) ) ^ j m * ( Z ) e - A 2 L 2 ( t , a 2 ( ^ , 0 ) ) ^ ^ Since to(-,-) = wi(-)t/; 2(-) w i t h u>i = w2 = i V ( 0 , 1 ) , and we have chosen n = 1, Lk{x,9) = (x - 0)2, Xk = l for A; = 1,2, #i(a:,y) = 2±lt a n ( i ff2(a;,y) = then we know from Example 1.4.3 that m*k{x) = ^ e ~ * 2 for k = 1,2, and J ( - , •) = 1/2. So u ; ( 3 ) (* i , W oc e x p { - ( * - °-^-f - \{e\ + 02)}. Thus , "(3)(*l*) = \ J " ( 3 ) ( ^ , ^ W cx e x p { - ± ( 0 - z ) 2 } . B y the expression for W(2)(0\xn, yn) we get W ( 2 ) ( 0 | z n , j/™) a e - ^ - 2 ) 2 and so, W( 3 ) (# i , 92\z) W(2)(9\x,y). a 128 If the conditions i n Propos i t ion 5.4.2 are not satisfied, we might s t i l l want to know how far away the posteriors from the different models are. In the following we use 1 • 1 to denote the L\ norm, and U(-), U(-,-) to denote the uniform density on R},R2 respectively. Let 0(i) be the Bayes estimator of 9 under model i using the common convex loss •) (i.e. under a posterior w^(0\xn, yn) for i = 1,2,3. Note that h can be different from the Z^ 's used to get the M I L ' s p£ 's for k = 1,2,3. Let E{i)(9\Xn, Yn), V a r ( i ) ( 0 | X n , Yn) denote the posterior expectation and variance of 6 under M o d e l i. T h e o r e m 5.4.1 Assume the M I L ' s for the l ikelihoods, then Ve > 0, 3<5t- > 0 , ( i = 1,2 ,3 ,4) , such that i f | K - , - ) - U(;-)\\ < Slt | |m;(-) - U(-)\\< 62, \\m*2(-) - U(-)\\ < 63 and Z*((-,-) 's satisfy sup \Ltix,6,) + L2(y,92) - r^x,y),h(9u92)) - r 2 ( / 2 (x ,y) , f2(9u62))\ < S4, w i t h Tj.(t, 9) = r ^d r - 9\) >0,{k= 1,2), then we have (i) \\w{2)(9\xn,yn)-w{3)(9\zn)\\<e, and hence \E(2)(9\X\Yn)-E{3)(9\Zn)\<e, | V a r ( 2 ) ( 0 | X " , y " ) - V a r ( 3 ) ( 0 | Z n ) | < e . (ii) For any e > 0, the Bayes estimators from the two models satisfy 1 (^2) - 0(3)1 < e-P r o o f : (i) Clear ly , as a functional of to(-,-), xni(-) and T U 2 ( - ) , the posteriors w^(9\xn, yn} (j = 2,3) are continuous ( in L\ norm). Let ui(9\xn, yn) (j = 1,2,3) be the posterior density corresponding to w(-,-) = U(-,-), m^(-) = U(-) and some loss function rj. satisfies the assumption of Theorem 5.4.1 (k = 1,2). w^ ^(9\xn, yn) (j — 1,2,3) be the quan-t i ty corresponding to w(-, •) = £/"(-, •). B y Proposit ions 5.4.2 and 5.4.3, w^tUtT(9\xn, yn) = 129 W(3),uA6\Zn)- S° < J\w(2)(0\xn,yn)-w(3)(9\zn)\d0 J \ww(0\xn,yn)-w{2)tr(9\xn,yn)\d9 + j\w(3)(d\zn)-w{3WL(6\zn)\d0. (5.4.11) For s implici ty, i n the following we only discuss the first term i n the right hand side of (5.4.11). It is / H2)(O\xn,yn)-w{2)tUiL(9\xntyn)\d0 < J H2){e\xn,yn)-w(2U{6\xn,yn)\d0 + J H2)Ae\x^yn) - w(2),u,L(0\x^yn)\d0-The first term above can be made as small as we want by the continuity of w^(0\xn, yn) as a functional of £&(•, -)'s. the second term i n the above is / m * 1 ( x n ) m * ( y n ) e - X 2 d ^ n ' S ) J i m*(x", yn) e-^-^^wjg^, 0),g2{4>, 0))J(cf>, 0) J m * ( i " ) e - A l M < S 3 l ( W ) ) ^ n / m * ( i « ) e - A 2 L 2 ( < n - 3 2 ( 0 , f i ) ) ^ n U(xn)U(yn)e-x^2^n^) d<f> U(xn,yn) e-^l'n>+)U(gi(<l>,9),g2(<i>, 9))J{cf>, 0) -d<f> d0. J U{tn)e-^L^tnM<t>fi))dtn f U(tn)e-x2L2(tn,g2{<l>,()))dtn B y adding and subtracting appropriate terms, the above can be bounded by a i | | r a* - U\\ + a 211 "*2 _ U\ \ + a,3\\w — U\ \ for some constants ai,a2 and a3, except for 0,xn,yn i n a set of smal l Lebesgue measure. The second term i n the right-hand side of (5.4.11) can be bounded i n a similar manner. Thus the right-hand side of (5.4.11) can be made smaller than e for suitable choices of the #'s. A l s o , since E^)(0\xn, yn) and Vax^(9\xn, yn) are continuous f u n c t i o n a l of w^, the last two conclusions of (i) follow. 130 (ii) Since the loss is convex i n a, the Bayes solution 8^ exists and is the unique min i -mizer of the posterior risk: where RW(t){a\xn,yn) = J h(a,8)w(,)(8\xn,yn)d8, and A is the act ion space. Since Rw^(a\xn, yn) is a continuous functional of Wk(-), {k = 1,2), so is 0( t ) , (i = 2, 3). A l s o , RW{i)(a\xn,yn) = RWU)(a\xn,yn) for = 2,3) under the conditions of Proposi t ion 5.4.4, so when these conditions are deviated a l i t t le ( in the sense given i n the L\ conditions ), the Rw^(a\xn,y")'s w i l l also change a l i t t le , thus the conclusion true. • Next , we establish a version of Theorem 5.4.1 for model II and III when exponential families are used. T h e o r e m 5.4.2 Assume P i ( z | 0 i ) = exp{77 1(0 1)Ti(a;) - B^O^h^x), p(y\82) = exp{r,2(92)T2(x) - B2{92)}h2{x), i f we assume (5.4.3) is satisfied, and take w(0) = j w(gi(<f>,9), g2(<j),9))J((l>,9)d<f>, then (i) For any prespecified e, we can choose 6 such that \E{2)(9\Xn,Yn) - E{3)(9\Zn)\ < e, \Var{2)(9\X\Yn)-Var{3)(9\Zn)\<e, whenever w(gi(<f), 8)) can be approximated, i n the L\ sense, by the product of two indepen-dent densities, i.e. \\w(gi(<f>,8),g2(<j>,9)) — wi(4>)w2(9)\\ < 6 for some integrable wi(-) and (ii) The Bayes estimators #(2) and 9^ satisfy |0(2) - 0(3)1 < 131 Proof: (i) Since E(i)(0\Xn, Yn) and V a r ( i ) ( 0 | X n , Yn) are continuous functional of w(-,-). T h e E^s are equal , and the V a r y ' s are equal for w(gi(<j),0), g2((f>,0)) = wi((f>)w2(0), so for any prespecified e, we can choose 6 such that \E(2){0\X\Yn)-E(z){0\Zn)\<e, \Var(2){0\Xn,Yn) - Var{3)(0\Zn)\ < e, whenever \\w{g\{<j>,6),g2{<f>,9)) - Wi{<j>)w2(0)\\ < 6. For the second conclusion, the proof is s imilar to that i n Theorem 5.4.1. • We may also investigate the inference range for the three different Models . For a function h(-) integrable w i th respect to w^(0\xn, yn), i = 1,2,3, let W be the collection of the three posteriors w^(0\xn, yn), i = 1,2,3 based on the three l ikelihoods from the three models. Consider the interval ( - E m j n u , € W / i ( 0 ) , Ema.Xwewh(Q) and length of i t . Since g(xn + yn,xn-yn\0) and we have and So the difference w(2)(0\x ,y ) j g ( x n + y n ^ n _ y n ^ d ^ „, (0\r" „"\ - f9(sn,xn-yn\0)dsn W { 3 ) { ° l X , V ] - fjg(3»,x«-y»\t)ds»d{' zp l h ( M ] _ fh(0)g(xn + yn,xn-yn\0)d0 _ ffh(e)g(xn + yn,xn-yn\0)dOdsn w ^ [ { JJ~ ffg(s»,x»-y»\t)dtda" EW(2)[h(Q)}-EW(3)[h(Q)} = fg(xn + yn, xn - yn\0)[h(0) - l]d0 Id(xn + yn, xn - yn\0)dO JJh(0)[g(sn,xn-yn\0)-l)d0dsn ffg(s»,x»-yn\Odtds" • ( * A - i Z ) 132 Similar ly , assume w(9) = f w ( ^ , ^)d<j>. Since W Jfp(x--yn\9)W(i±l^)d4>d9-we get EW(1)[h(Q))-EW(2)[h(Q)] _ / h(9)p(xn - yn\9)w(9)d9 ~ Jp(xn ~ yn\ZMt)dt f g(xn + yn,xn - yn\t)d£ X / J[Vi{xn\^^-)p2{yn\^-^-) - p{xn - yn\9)]w(^^,^^-)d<f>d9 ffh(0Mx--y-\9)-pl(x^)p2(y^)]w(^,^)d^ fg(xn + yn,xn-yn\Z)dt ' { • • ) Likewise, we have, EW(1)[h(Q)}-EWi3)[h(Q)} fh(9)p(xn - yn\9)w(9)d9 fp(xn - yn\Od£f Jg(sn,zn\Z)dtdsn ft, ,n + zn ,6 + 9^ ,TI - zn ,d> - 9, , „ „, „ ,6 + 9 6 - 9 s J J b l ( 2 ~ 1 2 ) M ~ 2 l 2 } " P { X ~ V l " ) ] ^ 2 ' 2 ) d H d UmWn - yn\<>) - P^nmi)P2(yn\^i)}w(^, ^)d4>de Jg(xn + yn,xn-yn\Odt ' 1 ' ' ) Likewise, we have, EW(1)[h(Q)}-EW{3)[h(Q)} fh(9)p(xn - yn\9)w(9)d9 Ip(xn - y n / / g ( * n , zn\Od£dsn J / M ^ I ^ ( ^ H ^ ) - p(xn - ,»H)M*±* +-^w> Ug(s»,z"\i)d(-ds" • y • • } B y using (5.4.13), (5.4.14) and (5.4.15) we may find the interval range as the posteriors vary among the three models. 133 Chapter 6 Discussion and Further Research 6.1 Discussion Here we have proposed a technique for choosing a l ikel ihood based on a given pr ior , a loss function and a distort ion parameter. G iven those three quantities, one can opt imize the Shannon mutua l information over a class of l ikelihoods to find the l ike l ihood which makes the weakest possible assumptions i n a precise information theoretic sense. The assumptions impl ic i t i n this l ikel ihood are also weak i n two other stat ist ical senses, formalized by the first two theorems. Theorem 3.1.1 shows that i n the l imi t of large sample sizes, the expected relative entropy distance between a prior and a posterior formed from the min ima l ly informative l ikel ihood tends to zero. Tha t is , the Shannon mutua l information goes to zero. Theorem 3.2 states a smal l sample sense i n which the M I L is min ima l ly infor-mative. Theorems 3.3.1 and 3.3.2 show how posteriors formed from min ima l ly informative l ikelihoods depend on the distort ion parameter. W h e n the allowed Bayes risk increases, the posterior tends to the prior . W h e n the allowed Bayes risk decreases to zero, the posterior tends to concentrate at the data. We may consider generalizing these theorems to the case of the n-fold product of univariate M I L ' s . The ma in drawback to the use of these l ikelihoods is that at this point we have not compared the inferences they give to the inferences one would get from a true parametric family. In par t icular , one can conjecture that highest posterior density sets from the present l ikelihoods would be wider than one would get from the true parametric family and that there would be further deformation due to the fact that the parameter introduced here might not be exactly the same as the actual locat ion parameter of interest, this would appear to follow from the est imating equation context i n Chapter 3. Another drawback is that the computing required to find these likelihoods for several parameters or more than 134 one outcome may be onerous. The formulation of the min ima l ly informative l ike l ihood permits a robustness analysis against choice of loss, A and prior . In Figures 1 and 2, Section 1.6, we observed how the shape of p*(-\0) varies as the prior and the parameter A vary. For relatively smal l values of A, the locat ion of the p*(-|#)'s are approximately at 0, this may be due to the loss function punishing data points for being far away from the parameter and the exponential component of the M I L . For larger values of A, the locat ion of p*(-\0) moves, independently of the prior , around the point where the m i n i m a l average loss is achieved, which i n the squared error case, is the mean of the prior . Th i s is predicted by Theorem 3.3.1. However the shape seems does not change significantly. For the left skewed exp(—(x + 10)), (x > —10) prior , the non-skewed JV(0,1) prior and the right skewed exp( — (x — 15)), (x < 15), the corresponding p*{-\0) a l l have a roughly symmetric form around their locations. We may also investigate how the M I L varies i f we fix the mean of a prior and increase its variance; or i f we fix the mean and vary the skewness of a prior . In addi t ion, one must choose how many parameters to include and how to condi t ion on the data ( in a posterior). Often, there are several ways one can do this. For instance, i n a very simple case one might have an independent sequence of paired data. One can marginalize a bivariate l ikel ihood to get a model for the difference i n each pair , as i n M o d e l I. Th i s gives a univariate posterior for a single parameter generating credibi l i ty sets for the difference i n means i n one sense. Al ternat ively , one can condi t ion on a l l the data i n a two parameter l ike l ihood to get a bivariate posterior. Now, one can marginalize i n the posterior to get a credibi l i ty set for the difference i n means i n a different sense, as i n Models II and III. It is not clear i n general when taking differences i n the data w i l l be equivalent to taking differences i n the parameters so that the two strategies w i l l give compatible inferences. In the usual case, one does not have a plausible l ikel ihood which can be used i n either approach. Our method generates one which also permits a sensit ivity analysis of the model ing strategy for paired data. In general, the n-dimensional M I L p*(xn\0) and the ra-fold M I L n"=i P*(xi\8) a r e v e r y different. The former may be used for dependent data. In information theory, 0 represents the send message, xn may be interpreted as the messages received by n receivers, there should be reasonably high dependence among those messages. Whereas p*(xn\6) is the 135 channel which permits the slowest data transmission subject to a dis tor t ion constraint. The latter may be used for independent data. Intermediate between an n-fold product and an n-fold density we may define the following. Let { x n ! + 1 } be a par t i t ion of the data xn, where Xn\+1 stands for ( x n i , x n i + i , x n ; + 1 _ i ) . If there is dependence among the data i n each substring but i t is reasonable to model different substrings as independent, then we can model the data by YliP*(xn)+1 \8). If there is reason to believe the data are independent, one should use the product of M I L s . If there is obvious dependence structure, for example, pair wise dependence, one may use a bivariate M I L . It is unclear how generally applicable the assumptions of Theorem 3.3.1 are. In case this result is applicable, i t is an argument for model ing of dependent data from a sample of size n by an n-dimensional M I L when n is too large. If the result is not applicable, then an n-dimensional M I L may be a candidate model . In the data analysis, we used the M I L for a locat ion parameter. We can also use it for different types of parameters. For example, for a scale parameter, we may take the loss to be L(x,8) — C(xjff)a for some positive constant C and real number a. If one s t i l l wishes to use a summary statistic, one may take the sufficient statistic E"=i x ? m this case. For the location-scale parameter (/J, a), we may take the loss to be L(x,[i,a) = (a; — fi)2/a2 or some other suitable alternatives. In our examples, inferences seems do not appear to be par t icular ly dependent on priors. Th i s robustness may be due i n part to the imprecision i n specification of the parameter (pre opt imizat ion) on which the prior is used. Tha t is, when the opt imiza t ion procedure identifies the exact interpretation of the parameter, i t may also, as a concomitant , reduce the influence of the prior . Th i s is a natural conjecture i f the requirement of non-informativi ty tends to make M I L ' s more s imilar than the priors which produced them. In this sense, the S M I may be a contraction mapping, and this contraction may be the main waay m i n i m a l informat ivi ty is being achieved. The M I L method seems work well for the i n i t i a l da ta analysis i n Chapter 4, as i t does not require detailed physical model ing. In addi t ion, i t makes relatively few assumptions. These assumptions are the inputs of w(-), L(-,-) and the dispersion parameter A. Our method assigns l ikelihoods based on min imiz ing the strength of assumptions, so i t is easy to get likelihoods which can be applied to summary statistics, (earlier, we discussed how to handle 136 the seeming independence of our method from the sample size on which a summary statistic was based, see Section 4.2.1. Par t i a l ly as a consequence of op t imiza t ion , we obtain some robustness against choice of prior since we are min imiz ing the strength of assumption going into the formulation of the l ike l ihood. In addi t ion, robustness against loca l perturbations of the inputs is relatively straight forward to evaluate. Frequentist and Bayesian evaluations of robustness usually assume perturbations are loca l . We do not have to do this here. It is scientifically more important to compare models that close i n physical mot ivat ion but not close mathematical ly (i.e., one is not just a perturbed form of the other). Th is can be examined by evaluating the compat ib i l i ty of their inferences. For example, Models I, II and III i n Chapter 4 are not close i n mathemat ica l formulat ion i n any quantified sense, but being formed from pairs and differences they are "close" interpretationally. Classical frequentist robustness results do not appear to handle this case even though it may be of more importance to scientists. Convent ional Bayesian robustness does not either. It is only certain model selection techniques that permit this sort of comparison indirect ly when they choose the mode of a posterior d is t r ibut ion over a class of models. Here, we are not concerned wi th model selection so much as w i t h corroboration of inferences by similar yet distinct modeling strategies. Another potent ial use of the M I L for s tat is t ical problems is that , i t provides a reference model for i n i t i a l da ta analysis i n a Bayesian frame work when l i t t le data are available and i t is difficult to model them. There are also potential use of the M I L i n information theory context. In fact, we see from Section 2.1.5 that the M I L provides the op t imal code for data compression. It is also the channel over which the slowest transmission of the source is achieved. 6.2 Further Topics Regarding The M I L We see i n Section 2.2 that, there is the problem of choosing the Bayes risk bound /; and there are other formulations of the min imiza t ion problem, for example use penalty term(s) instead of constraint. In case there is l i t t le knowledge for both the l ikel ihood and prior , we may consider a joint opt imizat ion procedure for selecting both the l ike l ihood and prior . A l s o , as we noted that, / and A determine each other and they have a sort of reciprocal 137 relationship. In many cases, choose A is more convenient, and A behaves l ike a smoothing parameter. So we can consider some smoothing technique for choosing A. In cases without an exact interpretation of 9, i t is hard to get a prior d is t r ibut ion of 9. B u t i f we have some vague knowledge about the da ta dis t r ibut ion, for example i ts Fisher information 7(0), then we can form Jeffreys' non-informative prior for the problem, and add the constraint so that the l ikel ihood has the Fisher information j(9). Or we may choose an i n i t i a l prior wo(-) which can reasonably the parameter, get PMIL,O form this prior , then based on its Fisher information *fo(9) get Jeffreys' non-informative prior w\(-) as the next stage prior , and continue this i terat ively we get a min ima l ly informative l ikel ihood-prior pair . A modificat ion of this search for joint m in ima l information is sequential. Consider the following procedure. Start w i th a prior wo(9) which is non-informative and is not derived from a l ike l ihood. (We note that Jeffreys' prior is non-informative but depends on a l ike l i -hood through its Fisher information.) For instance, suppose WQ(9) is that. F r o m PQ(-\9) for the first data point x\. F r o m this get the posterior density wi{9\x\) = PQ{X\\9)WQ{9)/'mo(xi) and use i t for fixed x\ as a prior so that opt imiza t ion of the S M I gives a l ike l ihood p\{-\9), for use on x2. (Note p\{-\9) also depends on x\ but we have suppressed this.) In this se-quential fashion we can develop an adaptive formulation for a joint l ike l ihood by taking a product of these sequential l ikelihoods. A t this point we cannot even conjecture how this procedure w i l l perform. We mention i t as a further possibi l i ty to explore. It is also interesting to investigate how much w i l l be lose i f we use the M I L , instead of using the true l ikel ihood for inference. F r o m Example 1.4.3, we see that Ep*^{X\9) is biased for 9 because i t is a weighted average of \i and 9. We may investigate whether this is the general case, and t ry method to reduce the bias. In principle there may be cases where i t is possible to ensure EP*(X\9) = 9 is satisfied. We may also consider this as a constraint i n the opt imiza t ion procedure. In Chapter 5, we studied the robustness of modeling strategies for paired data. There we investigated the use of general l ikel ihoods, i n some cases the special cases for both the dependent and independent M I L s , also some cases only for the n-dimensional M I L s . We may investigate some special cases only for the independent M I L s , since i n practice this form may be of wider use than the n-dimensional M I L s , al though our theoretical work is 138 mostly for the mult ivariate M I L s . A l s o , i n Chapter 5, some bounds on the differences for the three model ing strategies are not sharp, so we are not sure how robust the model ing strategies are i n such cases. To access i t , we may do some numerical comparisons for some pract ical problems to get some ideas about how these models differ. Theorem 3.1.1 asserts that under suitable conditions, 1(0, Xn) —> 0 as n —• oo. Tha t is , Theorem 3.1.1 syas that the M I L for a lot of data must be vacuous given a l l the dependence structures infinitely much data might have. So, i t may be useful to associate to a data string of length n, say x\,...,xn, an integer k between 1 and n. Th is integer k is to be regarded as the information content of x\, ...,xn measured i n terms of k independent data points. Tha t is , because a sequence of dependent data points behaves l ike a smaller sequence of independent data points we convert the xn to k smaller sequences each representing information equivalent to one data point . Now, we have grouped the data set x\,...,xn in to k subsets of equal size. Na tu ra l ly one would want to use several plausible values for k to see how they affect the conclusions. In our example i n Chapter 4, we summarized Xi, ...,x5r into one summary statistic, essentially thereby regarding the information content of x\, ...,xs7 as being equal to that of one data point . In general cases, the data may rarely be independent, but the generally high dependence i n the mult ivariate M I L may not appropriate. To control the dependence structure i n some extent i n the mult ivariate M I L s , we may consider using the copula to construct mult ivariate M I L s . Us ing a copula to construct mult ivariate model has been popular i n recent years. The copula method is an attempt to construct mult ivariate models w i th arbi t rary given marginals and to some extent, control the dependence structure (see Sklar 1959, Joe, 1993). D e f i n i t i o n : A mapping C : (0, l ) m ->• (0,1) is called a copula i f (1) i t is a continuous dis t r ibut ion function; (2) each of its univariate marginals is a uniform dis t r ibut ion. Let .F(x) be a mult ivariate dis t r ibut ion function w i t h marginal distr ibutions F\(xi),Fm(xm), and let ui,...,um be random variables from U(0,1), then the mult ivar ia te d is t r ibut ion CF(-, ••••) defined by 139 is a copula by the above definition. One of the most commonly used copulas is the mult ivariate normal copula. It is con-structed from the mult ivariate normal d is t r ibut ion JV"(0, T ) , where T = (7^) is its covariance ma t r ix , and for convenience, a l l the diagonal elements are 1. We denote its dis t r ibut ion function by «?r> its marginal cdfs by $ r , i : • • • 5 $ r , m - Then the m-dimensional copula is defined as c* r(u | r ) = . P r ^ F j K ) — . ^ ^ ) ) / u e (o,i)m. Its density function is C^ r(u | r) = i r r ^ e x p j - ^ r - 1 - /)x}, where x = ( x i , x m ) T w i th X{ = $r"*(u,-), i — l , . . . , m , and I is the ident i ty ma t r ix of dimension m (see X u , 1996). In this way, we can construct a mult ivariate min ima l ly informative dis t r ibut ion F*(-,•) w i t h given marginals Ff(-),F^(-), and par t ia l ly known dependence structure by means of the m-dimensional normal copula, i.e., i f F*(x 1 , . . . , a ; m | r ) = c ^ ( F r 1 ^ i ) , - , i 7 , r 1 ( ^ m ) ) , then its marginals are Ff(-), . . . , i ^ ( - ) and its dependence structure can be controlled by the chosen T. W e may also consider a non-parametric M I L , by t reat ing the unknown da ta d is t r ibut ion F(-) as the parameter of interest, and use the Dir ichlet process as the prior for distr ibutions. Such method was first proposed by Ferguson (1973) for the non-parametric Bayes method. Let (R,B) be a measurable space, where R is the real line and B is the cr-algebra of Bore l subsets of R. Let a(-) be a finite non-nul l measure on (R,B). A stochastic process P ( - ) is a Dirichlet process w i th parameter a and we write P £ 7J>(a), i f for any finite par t i t ion { / i i , ...,Bm} of R, the random vector (P(Bi), ...,P(Bm)) has a Dir ichlet d is t r ibut ion w i t h parameter (a(Bi), ...,a(Bm)). It has the property that i f F £ 7J>(a), then the posterior of F given X\, ...,Xn is V(a-\-Yll=\ &Xi), where Sx is the indicator function at x (see Ferguson, 1973). Thus , i f one can find the Bayes rule for the no-sample problem (n = 0), then the Bayes rule for the n-sample problem is given by replacing a w i t h a + J2?=i ^Xt- F ° r given loss function L(-,-) (for example, we may take i t as the Lr norm, the variat ional distance, 140 or the Kul lback-Le ib le r divergence, etc.) and / > 0, the Bayes risk bound for the no-sample problem is £L(F,F) < I, where F(t) = P((—oo,i]) and F is chosen to minimize the S M I i n this setting subject to the above constraint. Here, we need to formulate the S M I i n a meaningful way, so that the opt imiza t ion is feasible. 141 References [1] Aka ike , H . (1977). O n entropy maximiza t ion principle, Applications of Statistics, Nor th -Hol l and Publ i sh ing Company. [2] A r i m o t o , S. (1972). A n a lgor i thm for computing the capacity of arbi trary discrete memoryless channels, I.E.E.E. Trans. Inform. Theory, IT-18, N o . l , 14. 20. [3] Bernardo, J . M . (1979). Reference posterior dis t r ibut ion for Bayesian inference, J. Roy. Statist. Soc, Ser. B, No .2 , 113-147. [4] Bar ron , A . R . and Cover, T . M . (1989). M i n i m u m Complex i ty Densi ty Es t ima t ion , Technical Report f 28, Department of Statist ics, Univers i ty of I l l inois . [5] B lahu t , R . E . (1972a) Compu ta t i on of channel capacity and rate-distort ion functions, I.E.E.E. Trans. Inform. Theory, IT-18, No.4 . 460-473. [6] Blahu t , R . E . (1972b) A n hypothesis testing approach to information theory, Ph.D. Thesis, Cornell Univ.. [7] B lahu t , R . E . (1987). Principles and Practice of Information Theory. Addison-Wesley, Reading, M A . [8] Clarke , B . S. and Bar ron , A . R (1990). Information-Theoretic Asympto t ics of Bayes Methods , IEEE Trans. Inform. Theory, vo l . 36, no. 3, pp. 453-471. [9] Clarke , B . S. and Bar ron , A . R (1994). Jeffreys' prior is asymptot ical ly least favorable under entropy risk, J. Statist. Planning and Inference, vo l . 41, pp. 37-60. [10] Cover, T . M . and Thomas , J . A . (1991). Elements of Information Theory. John W i l e y and Sons Inc., New York . [11] Csiszar, I. (1974). O n the computat ion of rate distort ion functions. I.E.E.E. Trans. Inform. Theory, IT-20: 122-124. [12] Csiszar, I. (1975). I-divergence geometry of probabi l i ty distr ibutions and min imiza t ion problems. The Annals of Probability, 1975, V o l . 3 , N o . l , 146-158. 142 [13] Farnum, N . R . and Stanton, L . W . (1989). Quantitative forecasting methods, P W S -K E N T . [14] Ferguson, T . S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. V o l . 1, N o . 2, 209-230. [15] Godambe, V . P . (1960). A n op t imum property of regular m a x i m u m l ikel ihood estima-t ion. Ann. Math. Statist. 8 1 , 1208-1212. [16] Haughton, M . A . (1988). O n the choice of a model to fit da ta from an exponential family. Ann. Statist, V o l . 16, No . 1, 342- 355. [17] Jaynes, E . T . (1957). Information Theory and Stat is t ical Mechanics" , Physical Review, 106, 620-630. [18] Joe, Harry . (1989). Relat ive Ent ropy Measures of Mul t iva r ia te Dependence. J. Amer. Statist. Assoc. [19] Joe, Harry . (1993). Parametr ic family of mult ivariate d is t r ibut ion w i t h given margins. J. Mult. Anal. 46, 262-282. [20] J0rgensen, B . and Labour iau , R . S. (1994). Exponent ia l families and theoretical infer-ence, private communication. [21] Kolmogorov , A . N . (1965). Three Approaches to the Quant i ta t ive Defini t ion of Infor-mat ion , Problemy Peredachi Informatsii, V o l . 1, pp.3-11. [22] K o n d o , J . (1991). Integral Equations, Oxford U n i v . Press. [23] Ku l lback , S. and Leibler , R . A . (1951). O n information and sufficiency. Ann. Math. Stat. 22: 79-86. [24] Nader, M . A . and Reboussin, D . M . (1994). The effect of behavioral history on cocaine self-administration by rhesus monkeys. Psychopharmacology 115: 53-58. [25] Schwartz, G . (1978). Es t ima t ing the dimension of a model . Ann. Statist, V o l . 6, N o . 2, 461-464. 143 [26] Sklar , A . (1959). Fonctions de repart i t ion a n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8, 229-231. [27] Soon, E . S. (1994). Cap tur ing the intangible concept of information. J. Amer. Statist. Assoc. V o l . 89, N o . 428, 1243-1254. [28] Strasser, H . (1981). Consistency of m a x i m u m l ike l ihood and Bayes estimates, Ann. Statist. V o l . 9, N o . 5, 1107-1113. [29] W a l d , A . (1950). Statistical Decision Functions, Wi l ey . [30] Walker , A . M . (1967). O n the asymptot ic behavior of posterior dis t r ibut ions, Journal of the Royal Statistical Society, Series B (31), 80-88. [31] X u , J . J . (1996). P h . D . Thesis, Department of Statist ics, Univers i ty of B r i t i s h Co lumbia , Canada . 144
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A minimally informative likelihood approach to Bayesian...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A minimally informative likelihood approach to Bayesian inference and decision analysis Yuan, Ao 1997
pdf
Page Metadata
Item Metadata
Title | A minimally informative likelihood approach to Bayesian inference and decision analysis |
Creator |
Yuan, Ao |
Date Issued | 1997 |
Description | For a given prior density, we minimize the Shannon Mutual Information between a parameter and the data, over a class of likelihoods defined by bounding a Bayes risk by a 'distortion parameter'. This gives a conditional distribution for the data given the parameter which provides optimal data compression, or equivalently, is minimally informative for a type of location parameter. These optimal likelihoods cannot, in general, be obtained in closed form. However, they can be found numerically. Moreover, we give two statistical senses in which the optimal likelihoods form parametric families which make the weakest possible assumptions on the data generating mechanism. In addition, we establish properties of this parametric family that characterize its behavior as the distortion parameter varies. We argue that the parametric families identified here may lead to a default technique for some settings in initial data analysis. We partially characterize the settings in which our techniques may be expected to provide useful answers. In particular, we argue that if one is interested in performing certain Bayesian hypothesis tests on a parameter that locates a typical region for the response, then our technique may provide weak but nevertheless useful inferences. We also investigated the robustness of inferences to modeling strategies for paired, blocked data. |
Extent | 6225119 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-04-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0088242 |
URI | http://hdl.handle.net/2429/7313 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1997-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1997-251934.pdf [ 5.94MB ]
- Metadata
- JSON: 831-1.0088242.json
- JSON-LD: 831-1.0088242-ld.json
- RDF/XML (Pretty): 831-1.0088242-rdf.xml
- RDF/JSON: 831-1.0088242-rdf.json
- Turtle: 831-1.0088242-turtle.txt
- N-Triples: 831-1.0088242-rdf-ntriples.txt
- Original Record: 831-1.0088242-source.json
- Full Text
- 831-1.0088242-fulltext.txt
- Citation
- 831-1.0088242.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0088242/manifest