UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Bayesian cross-validation choice and assessment of statistical models Alqallaf, Fatemah Ali 1999

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1999-0455.pdf [ 2.85MB ]
JSON: 831-1.0080055.json
JSON-LD: 831-1.0080055-ld.json
RDF/XML (Pretty): 831-1.0080055-rdf.xml
RDF/JSON: 831-1.0080055-rdf.json
Turtle: 831-1.0080055-turtle.txt
N-Triples: 831-1.0080055-rdf-ntriples.txt
Original Record: 831-1.0080055-source.json
Full Text

Full Text

Bayesian Cross-Validation Choice and Assessment of Statistical Models by Fatemah A l i Alqal laf B . A . , Kuwait University, 1994 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Department of Statistics) we accept this thesis as conforming to the required standard The University of British Columbia September 1999 © Fatemah A l i Alqallaf, 1999 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of The University of British Columbia Vancouver, Canada Date 06 k 6 , <fC\ DE-6 (2/88) A b s t r a c t This thesis wi l l be concerned with application of a cross-validation crite-rion to the choice and assessment of statistical models, in which observed data are partitioned, wi th one part of the data compared to predictions conditional on the model and the rest of the data. We develop three methods, gold, silver, and bronze based on the idea of split t ing data in the context of measuring prediction error; however, they can also be adapted for model checking. The gold method uses analytic cal-culations for the posterior predictive distribution; however, the silver method avoids this mathematical intensity, instead simulating many posterior sam-ples, and the bronze method reduces the amount of sampling to speed up computation. We also consider the Bayesian p-value in which the posterior distribu-tion can be used to check model adequacy, in the context of cross-validation with repeated data splitting. Appl icat ion to examples is detailed, using the discussed methodologies of estimation and prediction. i i Contents Abstract ii Contents iii List of Figures vi Acknowledgements viii Dedication ix 1 Introduction 1 2 Methodology for Measuring Prediction Error 8 2.1 Reference Distr ibut ion 9 2.1.1 Notations 9 2.2 Description of the Methods 11 2.2.1 Go ld Method 11 2.2.2 Silver Method 13 2.2.3 Bronze Method 14 i i i 2.3 Theory Pertaining to the Silver Method 17 3 Examples of Measuring Prediction Error 21 3.1 Introduction 21 3.2 Normal Regression Mode l 22 3.2.1 The Go ld Method 26 3.2.2 The Silver Method 26 3.2.3 The Bronze Method 28 3.3 Weibul l and Extreme Value Regression Model 31 3.3.1 The Metropolis Algor i thm 33 3.3.2 The Silver Method 36 3.3.3 The Bronze Method 38 4 Bayesian p-Value 47 4.1 Introduction 47 4.2 Posterior Predictive p-value 49 4.3 Example: Comparing speed of light measurements to the pos-terior predictive distribution 52 4.3.1 The Gelman et al . Approach 53 4.3.2 The Split-Averaged p-value 54 4.3.3 A Model Check Based on a Test Quantity Sensitive to Asymmetry in the Centre of the Distr ibut ion 57 5 Discussion and Concluding Remarks 65 Bibliography 70 iv Appendix A Bayesian Analysis of the Classical Regression M o d e l 76 Appendix B Inverse Chi-square 80 List of Figures 3.1 Scatterplot of the mammals data for the average brain and body weights 41 3.2 Histograms of 100 W's 42 3.3 MCMC of uncentered voltage for insulating fluid failure data. . 43 3.4 MCMC of centered voltage for insulating fluid failure data. . . 44 3.5 Silver and bronze estimates prediction error 45 3.6 Histograms of 50 W's 46 4.1 Histogram of Simon Newcomb's measurements for estimating the speed of light 60 4.2 Scatterplot showing the sample variances of the actual validation sample and the replicated validation sample 61 4.3 Histograms of the estimate split-specific p-values for 50 splits. . 62 4.4 Scatterplot showing prior and posterior simulations of a test quantity T(y, 9) = \y^i) — 9\ — 12/(6)— 0\, based on 200 simulations from the posterior distribution of (9, yrep) 63 vi Scatterplot showing a test quantity T(yv, 0) = |yv ( 3 1 ) — 0\ — |yv ( 3 ) — 9\ for the actual observed validation sample and the replicated validation sample, based on 200 splits v i i Acknowledgements I would like to thank my supervisor Pau l Gustafson for his guidance, help and support throughout the development of this thesis. Paul was always wi l l ing to give his advice and intuitive ideas which were greatly appreciated. I would also like to thank Bertrand Clarke for his careful reading of the manuscript and for his neverending encouragement and advice. Without these two people I may never have succeeded in finishing this degree. It is a pleasure to thank all fellow graduate students, especially Matias for his invaluable help wi th computing, and various other problems along the way. Many thanks also go to Nathan for his encouragement, and help. Finally, I would like to thank my husband Mohammed, for his support and continued patience. F A T E M A H A L I A L Q A L L A F The University of British Columbia September 1999 v i i i To my parents, and to Mohammed ix Chapter 1 Introduction Statistical procedures can often adequately predict the data that generated them, however they do not always perform so well when used to predict new data. Cross-validation is a data oriented method which seeks to improve re-liabil i ty in this context. This thesis is concerned with application of a cross-validation criterion to the choice and assessment of statistical models. The concept of such assessment is an old one but nevertheless useful. It consists of the division of the data sample into two subsamples, the training sample and the validation sample. The estimation of a statistical model is based on the training sample, and then the assessment of its performance is made by measuring its predictions on the the validation sample. Herzberg (1969) made a detailed theoretical and numerical study of predictor construction methods, using cross-validatory assessment. Stone (1974) applied a generalized form of the cross-validation criterion to the choice and assessment of models using the data-analytic concept. Other examples include variable selection in linear 1 regression using cross-validated predictive squared error (Hjorth, 1994). Model selection has attracted the attention of many researchers. Cross-validation is one of the most well-known methods. There are other methods for model selection, such as the Akaike information criterion (AIC) (Akaike 1974; Shibata 1981), the Cp (Mallows 1973), the jackknife, and the bootstrap (Efron 1983, 1986). A l l these methods are asymptotically equivalent to the cross-validation with validation sample size nv = 1 (Stone 1977a; Efron 1983), however, and thus they share the same deficiency; that is they are inconsis-tent. In the problem of selecting models, this deficiency of the cross-validation can be rectified by using a cross-validation with a large nv depending on n. Shao (1993) showed that this inconsistency can be rectified by choosing nv such that nv/n —> 1 as n —» oo. Herzberg and Tsukanov (1986) d id some simulation comparisons between the cross-validation procedures with nv = 1 and nv = 2. They found that the leave-two-out cross-validation is sometimes better than the leave-one-out cross-validation, although the two procedures are asymptotically equivalent in theory. See also Geisser (1975), Burman (1989), and Zhang (1993). There are a number of different cross-validation methodologies, and they largely differ in how the partitions are chosen. Much work has been done on the ordinary cross-validation, for example Stone (1974, 1977), Bowman (1984) and Hardle and Marron (1985). The approach taken here is to delete a single data point (xi,y{), fit the model to the reduced data set, and using the estimated model parameters, obtain a prediction yi for the missing ob-2 servation. This is repeated for each point (xi,yi). Then the cross-validatory score C — Yl,L(yi,yi)/n is computed where L is an appropriate loss function; typically L(y,y) = (y — y)2. Clearly for large data sets the computational requirements may become excessive and alternative strategies are adopted, rather than omitt ing points singly. Burman (1989) considered two techniques in a study of the optimal transformations of variables, "w-fold" cross-validation and repeated learning-testing methods. The -u-fold cross-validation uses v dis-joint validation partitions by dividing the data randomly in v groups so that their sizes are as nearly equal as possible. This method wi th v = n is the leave-one-out cross-validation. For model selection in linear regression, Bur-man (1989), Shao (1993), and Zhang (1993) have each investigated a particular cross-validation procedure where M partitions are generated at random inde-pendently wi th a fixed fraction f3 being used as validation samples, and 1 — /? being used for parameter estimation in each case, Burman calls this repeated-learning-testing, and Shao calls it "Monte Carlo cross-validation". The main difference between the repeated-learning-testing method and the i>-fold method is that with the former each data point may be used as a validation point more than once. These methods can be used in the place of ordinary cross-validation whenever the latter is computationally very expensive and asymptotically in-consistent. A comparative study of these methods has been carried out in detail by Burman (1989). Consider the problem of selecting a model having the best predictive ability among a class of models. Cross-validation can be a method for model 3 selection according to the predictive ability of the models (Shao, 1993). P icard and Cook (1984) examined the cross-validation assessments of the predictive ability of a fitted multiple-regression model. Smyth (1998) applied the cross-validation approach to model selection in the sense that models are judged directly on their out-of-sample predictive performance. The approach and methods we wi l l introduce later are designed to fit wi th Bayesian statistics, which uses probability theory to describe uncertainty about the parameters and the observables. In the Bayesian approach, infor-mation and beliefs that are available before data are observed contribute to the specification of a prior distribution. After the data are observed the prior distribution is updated to the posterior distribution which is proportional to the product of the likelihood and the prior distribution. Inferences are made on the basis of the posterior distribution. The recent revolution in Bayesian statistics is due to Markov Chain Monte Carlo ( M C M C ) algorithms. These al-gorithms allow a user to draw inferences from a complex posterior distribution on a high-dimensional parameter space, by simulating a Markov chain wi th the posterior distribution as the stationary distribution. Then the posterior quantities are estimated from the simulation output, as the chain converges to its stationary distribution. Review material can be found in Neal (1993), Smith and Roberts (1994), Tierney (1994), and Besag et al . (1995). Checking the model is cri t ical in the statistical analysis. Therefore good Bayesian analysis should check how well the model fits the data and how good the posterior inferences are. In Bayesian statistics, a model can be checked 4 in at least three ways: (1) examining sensitivity of inferences to reasonable changes in the prior distribution and the likelihood; (2) checking that the pos-terior inferences are reasonable given the substantive context of the model; and (3) checking that the model fits the data. We address the third of these concerns using the posterior predictive distribution for a discrepancy, which extend classical test statistics to allow dependence on unknown parameters. Posterior predictive assessment was introduced by Gut tman (1967), applied by Rubin (1981), and given a formal Bayesian definition by Rub in (1984). Gelman, Meng, and Stern (1996), G M S , contributed to further develop R u -bin's (1984) work on posterior predictive assessment, as one possible way of measuring any discrepancies that may exist between the observables and their predictive distributions under a given model specification. This methodolog-ical contribution is the adoption of more general discrepancies, which allows more direct assessment of the discrepancy between data and the model. That there is a pressing need to perform such calibrative work wi th any model has become clearer at a time when Markov Chain Monte Carlo ( M C M C ) methods permit the realization of modeling in much more compli-cated settings than ever before. In fact, Bayesians are dwelling so much these days on diagnostics for M C M C convergence-monitoring that a strong reminder of the value of model-checking is al l the more welcome. The G M S approach appears to be simple conceptually and computa-tionally, and connected well to the classical goodness-of-fit methods with which most researchers are familiar. It is also very general, applicable for comparing 5 observations with model predictions in any form. Their applied work has ben-efited from the application of this method, as documented in their paper and also in many examples of application of this method in Gelman, Car l in , Stern and Rub in (1995) (also see Bel in and Rubin (1995) and Upadhyay and Smith (1993)). Meng (1994) discusses a similar method for testing parameters values within a model rather than for the entire model. West (1986) and Gelfand, Dey, and Chang (1992) also present posterior predictive approaches to model evaluation, in the contexts of sequential data and cross-validation of the exist-ing data set, respectively, rather than comparing to hypothetical replications of the entire data set. We consider the prediction error by randomly split t ing the data many times and averaging the squared prediction errors over the splits. Shao (1993) also averaged the squared prediction errors over the splits. We also consider the Bayesian p-value in which the posterior distribution can be used to check a statistical model, in the same context of cross-validation with repeated data splitting. In Smyth (1998), cross-validated likelihood is investigated as an appropriate score function for model selection in probabilistic clustering, in particular for choosing the number of component densities in finite mixture models. He splits the data x into x' e s i and x' r o m , for each split compute the likelihood / (x t e s ' | f? 4 r a ; n ) , where 9 t r a i n generated given the train sample. In our approach, we also split the data into samples, which we call validation and training samples. Then compute the prediction function, in the sense of Bayesian approach, / (y t e s t |x t e s f , 0 t r a i n ) . 6 The thesis is organized as follows. Chapter 2 consists of definitions, a discussion of computational issues, and an analytical il lustration of the cross-validatory methodology for assessing prediction error. Chapter 3 presents a detailed illustration of measuring the prediction error in two regression models. Chapter 4 exhibits the posterior predictive assessment of model fit through the Bayesian p-value, including a brief comparison to other versions of p-values. Chapter 5 provides discussion of various issues related to the method of measuring the prediction error and the posterior predictive p-value. 7 Chapter 2 Methodology for Measuring Prediction Error We propose methods for measuring prediction error in the context of cross-validation analysis. In this chapter we present our three methods, gold, silver and bronze, and the basic methodology of our approach in the context of mea-suring prediction error; however, they can also be adapted for other purposes, such as model checking, as wi l l be seen in the later chapter on Bayesian p-values. We introduce the analytical theory of the gold method, basic method-ology for the silver and the bronze methods, and the theory pertaining to the silver method. 8 2.1 Reference D i s t r i b u t i o n In this section we introduce definitions and notations which wi l l be used in the theoretical il lustration of the posterior predictive approaches to prediction error. We use predictive distributions in a somewhat different way, by cross-validating the modeling process, in which we set aside some of the data, fit the current model to the rest, and then locate the observed values of the set-aside data in their respective predictive distributions given the chosen model. 2.1.1 Notations Let y be the observed data, and 0 be the vector of parameters. To distinguish between the observed data y and the replicated data we define yrep as the replicated data that could have been observed. That is, yrep is the data we would see if the experiment that produced y were replicated with the same model and the same value of 6 that produced the observed data. Since the cross-validation methodology wi l l be used, we introduce a split indicator s. This splits the data into training and validation sets. Specifically s is a vector of zeros and ones, the zeros indicating observations that go into training sample and the ones indicating observations that go into the validation sample. We consider the joint distribution of yrep, 9, and the split s, P(yrep,e,s)=p(yrep\e,s)p(9\s)p(s), (2.1) as a reference distribution in our methodological approach. The following ex-planation may be helpful. 9 The conditional distribution p(yrep\9, s) is the probability distribution of the replicated data yrep given 6 and the training sample indicated by s, which is equivalent to the distribution of the replicated data set yrep given 6, expressed as p(yrep\9, s) = p(yrep\9). In our approach we typically use only the validation part of yrep, which we compare wi th the realized validation set. So we denote the distribution of the replicated data in the case of validation sample by yryP\s which is distributed as the predictive distribution for the validation sample given the training sample. A n d the distribution p(9\s) is the posterior distribution of 9 based on the training set yT. The probability distribution over the split s = (T,V) is p(s). We chose to split the data in two wi th half being the training set and half being the validation set. A l l such 50-50 splits are equally likely under p(s). V i a cross-validation, we can estimate the predicted error. Bu t instead of split t ing the data once we split the data many times. Then by averaging over the splits can estimate the expected sum of squared predicted error wi th respect to the joint distribution p(yrep,9,s), which is given by That is, we view W as a summary of predictive performance. There are ways to estimate W by using the idea of split t ing the data set. In the next section we introduce the three methods used to estimate W. (2.2) 10 2.2 D e s c r i p t i o n of the M e t h o d s In principle we are interested in various expectations wi th respect to the joint distribution p(yTep,Q,s). We develop our methodologies based on the idea of split t ing data to estimating the expected prediction error, but they may also be extended to include other posterior predictive checks. In this section we briefly discuss the three methods of estimating W. The gold method uses analytic calculations for the posterior predictive distribution; this requires computa-tional effort that most of the time can not be handled by hand. However, the silver method avoids this mathematical intensity, instead simulating many pos-terior samples. The bronze method reduces the amount of sampling to speed up computation. In the description of the techniques used for each method, we show how to conduct these methods, and then one can decide when it is appropriate to implement one rather than another. 2.2.1 Gold Method To estimate the expected sum of squared predicted error, W = E {£(V7 - Wt)2} , we do some analytic calculations on W to get, = Es J£ (yff - E{y^\sf) | + Es | £ £ ^ P | s (E(y^\8) - yv^j 11 where the cross-term Eyrep]s(y™p - E(y^\s))(E(y^\s) - yVi) = 0, since EyreP\a{yTyP — E(yTyP\s)) can be written as E y r ^ p ) - EyrePls(E(y^p\s)) = EyreP\s{yyP) ~ Eyrep \ s (f/yf ) . Now, W = Es J E Var{y™p\s)} + Es {E (E(y^\s) - yVi)^ . To estimate W, we need to perform the following. First , sample inde-pendent splits Si,S2,...,sr from p(s). Then estimate W by Wo = 11 J E Var(y^\8i)\ + \ ± {\\E(yrvep\Sj) - W | | 2 } . (2.3) ' j=i I j ) ' j=i To compute F F G (G stands for Gold) we need to compute the mean E(ylep\sj) and the variance Var(ylep\sj) of the predicted distribution of the i-th observation in the validation sample, given the training data set based on split Sj. Computat ion of the predicted distributions can be performed analyti-cally for some simple models such as normal linear regression, as wi l l be seen in the next chapter on examples of measuring prediction error. Bu t in compli-cated models which may arise in practical applications, such as a hierarchical mixture model, we can not compute E{yTyP\sj) and Var(yryP\sj) by hand. Therefore it is more easily accomplished v ia simulation. This is some extra 12 computational burden, but simulation is a standard tool for Bayesian analysis with complex models, and often it is the only option. The usual Bayesian simulation provides a set of draws of 9 from the posterior distribution, p(9\y). In the normal linear model the posterior distribution can be easily simulated using exact sampling; However, when the posterior distribution cannot be simulated directly, an indirect simulation method such as the Metropolis algo-r i thm ( M A ) is often used. O n the basis of this fact, we introduce the silver and the bronze methods, using simulation methods to sample from the posterior distribution so that the mathematical computation effort is lessen. 2.2.2 Silver Method In this method we sample from posterior predictive distributions to estimate the predicted error. We consider the following algorithm for comparing the realized validation data vector yv to yTyP: 1. Generate random splits s i , . . . , sr, from the p(s) that we introduced be-fore. 2. For each s, construct a sample of the parameter 9 from the posterior distribution p(9\s) based on the training sample. • Given 9, draw a replication data set y™p from the sampling distri-bution p{yrep\9, s) given the training sample. • Having obtained yyP, we can estimate the expected squared of pre-dieted error E(\\yrvep - yvf\Si) by w(Sl) = £ _ yVjf, 13 where m is the number of f?'s generated using the i-th split. 3. Averaging over the multiple splits, the silver estimate of the predicted error is Ws = -J2w(Si). (2.4) r i=i In the silver method as we have seen, for every split there is a new simulation from the posterior distribution, since we simulate draws from the posterior distributions for as many as splits we have. We thought that we should reduce the number of simulations and that the number of simulations should not depend on the number of splits, because by simulating every time we split the data we slow down the estimating procedure, especially when using complicated models. In the next section we.introduce the method that reduces the number of simulations. 2.2.3 Bronze Method In this section, we propose the use of importance sampling to estimate the predicted error and improve upon the silver method in terms of computation time. The bronze method uses a similar concept as the silver method; how-ever, we try to improve on the idea of drawing samples from the posterior distribution of 6 based on one portion or split of the data set. Instead, we simulate a sequence of draws for 6 based on the entire data set. Using impor-tance sampling we make a correction so that the draws look as though they are drawn from the posterior distribution based on one portion of the data set, the training set. This approach is shown in mathematical terms below. 14 The basic computational strategy Combining the densities f(yi\9) wi th a prior density, p(9), yields the posterior density P(e\s)<x{l[f(yi\e))p(o), which is based on a specific training set. Here we roughly approximate p(9\s)-by a "squashed" version of the posterior based on al l the data, 9 ( % ) « { n / ( y i l * ) } P(*)> where n is the size of the data set. It is intuitively sensible to choose a = | for an effective sample size of | , to match the training sample size, as we split the data into two halves. A sample of m > 1 draws for a better approximation of posterior sam-pling can be simulated as follows. 1. Sample values 9i,...,9m from q(9\y), then sample y*ly..., y*n from the/(y|0j) . 2. Draw different splits S\,... ,sr from p(s). For each split Sj the probability of sampling each 9i is proportional to the importance weight w*(9j) where [ l ) q{0i) US=i{fiVi\ma' These weights u>y which depend on 9i and split j make the 9's look like a sample from p(6\sj) based on the training set. 15 3. For every different split, use the split indicator to split the replicated data into validation and training samples, y* and y* . Also using the same split, the observed data are split into validation yv, and training J/T samples. Now for each split, we have y*y which is equivalent to yTyP, the replicated observation of the validation sample wi th probability w. Therefore, to estimate the expected sum of squared predicted error W = E{\\y^-yv\\2}, we should use these weights in our estimator. If n is the number of data points, then the bronze estimator is r m n WB = - £ £ £ ^ ) ( y v l - y k ) ' j=i i=i k=i m n vfWi,3 = E E i=l fc=l 1 r - E - M ^ M j ' J=I (2.5) (2.6) where ISj (k) is an indicator, taking the value 1 is yk is in the validation sample for split Sj, and 0 otherwise. Analogously, for each s we calculate the difference error i=i E 0 4 - Wu? k=l W. h3m Then the bronze estimate of the predicted error is ' i=l (2.7) 16 2.3 T h e o r y P e r t a i n i n g to the S i lve r M e t h o d In this section we show the theoretical considerations for the silver method, regarding the number of splits versus the size of the posterior samples for each split. We discuss the question of whether a few big samples or many small samples is optimal. We aim to analyze this concept theoretically under conditions that make our choice of the size of the posterior samples and the number of splits optimal. Let xu, Xi2,..., xu be draws of a posterior quantity for ith split, where i = 1 , . . . , I correspond to different splits. To estimate the value 9 = E{x) we may use the estimator §-4Mx") (2-8) Now how big should I and J be relative to one another so that 9 is a good estimator wi th small variance ? Suppose that we use Markov Chain Monte Carlo ( M C M C ) to simulate draws from the posterior distributions. Then for each split there wi l l be some burn-in time of say w observations, so we can think of the cost (c) of sampling as c = I(J + w). (2.9) We may reformulate the problem as one of minimizing the variance of 9 subject to a fixed cost under the following reasonable assumptions. 17 A s an A N O V A - s t y l e set-up, let •^ij ~~ Pi ~\~ ^-ij where pi ~ N(p,r2), reflects split-to-split variation, and e^ - ~ N(0,a2). A l l Xifs have the same distribution, wi th x^ ' s independent for different splits, corr{tij,ek{) = 0, for i ^ k, but dependent for the same split, corr(eij,eik) = p 1- 7 -* 1, for % ^ k. Now to minimize the variance of the estimator (2.8), an analysis proceeds as follows. For simplicity let x^, = j S /= i so that and 1 1 var(d) = -rY^var(xi,), 1 i=i since the covariance between different splits is zero, i.e. covfa., £j) — 0 for al l i ^ j. Now, var(xi) = var{pi + Cj.} 1 J • J j=l j<k = T2 + {Jo 2 + 2sa2} , 18 where s = (J - l)p + ( J - 2)p2 + Using the large J approximation J - I 3=1 P J gives .2 . 1 J T 2 , 2 T P Var(5i.) « r 2 + — { Ja 2 + 2a J J 2 — 1 _ p r 2 + _ ! / 1 + 2 : » 1 Now J 1 - p i2U\ J\ i - P , To find the minimum of the variance we differentiate with respect to J SJvar(«) ~ c — Thus solving ^jvar(6) = 0, and clearly the second derivative yields J — cr2u>(l + p) N ^ 2 ( 1 - P ) as the optimal posterior sample size, with / then determined by the cost constraint. 19 In light of these results, we can conclude that the optimal length of the chain of the parameters drawn from a posterior distribution is proportional to the square root burn-in time chain length oc \ /burn- in time. For the silver method we need to draw posterior samples of the param-eters every time we split the data set. Based on the mathematical results the optimal length of the chain depends on the square root of the burn-in time. Whether or not this chain length should be used when using the silver method is entirely dependent on the user's confidence, we just want to show how long the chain should be from a mathematical point of view. For the bronze method we simulate al l the parameters once before split t ing the data set, so maybe people would prefer one long chain at once as in the bronze method and toler-ate the computational intensity in computing the importance weights for each split. 20 Chapter 3 Examples of Measuring Prediction Error 3.1 I n t r o d u c t i o n Prediction errors for a new set of data can be used to assess the quality of model-based predictions. The approach taken in this chapter is to use standard output from regression analysis to get the prediction function. We illustrate the three methods of estimating expected prediction error wi th examples of constructing regression models. This chapter introduces Bayesian model build-ing and inference for normal and Weibul l regression models. Using a Bayesian approach we chose a noninformative prior distribution, wi th the understanding that this is no more than a convenient assumption for the purposes of expo-sition and can be extended to informative prior distributions. The Bayesian approach to prediction is potentially very rich, as one can focus attention on 21 the whole predictive distribution of unobserved future values, given predictors, prior data, and information. This can be more informative than just looking at point predictors. We compute the posterior distributions for these models, interpret the results, and compare the estimates of the prediction error for the different methods. Throughout, we describe computations used in each of the methods for estimating expected prediction error. In particular, we show how simple simulation methods can be used to draw samples from posterior and predictive distributions, to incorporate uncertainty in the model parameters, and to draw samples for posterior predictive checks. We also check the sampling variabili ty of the prediction error estimates of each method. 3.2 N o r m a l Regress ion M o d e l In this section we investigate the three methods using an example of normal linear regression with a noninformative prior distribution. It is interesting both theoretically because of the elegance of the underlying theory, and from an applied point of view because of the wide variety of uses of linear regression. The normal model had been chosen basically to be a test case in which it is convenient to compare the three methods. Data set The data set used in this analysis was obtained from Weisberg (1995, pp. 144-5). The mammals data contains two variables, the average brain 22 weights and the average body weights for 62 species of mammals. These data were taken from a larger study and were collected for another purpose (All ison and Cicchetti , 1976). Once the parameters of a regression model have been estimated, they can be used to predict future observations of the brain weight from the explanatory variable, the body weight. So we thought this was a good example to test our procedures for estimating the prediction error by split t ing the data into two halves. Given one half we predict the other half. Before we start, we check if any transformation is needed for the linear fit. We shall consider the problem of modeling brain weight y as a function of body weight x. A n ini t ia l attempt to graph brain weight (in grams) versus body weight (in kilograms), as given in Figure (3.1), indicates that some transformation is required. The plot shows that the relationship between the body weight and the brain weight is not linear, wi th most of the points in the plot being jammed into the lower left corner and only a few stragglers elsewhere. Because of the wide variation of both variables, an exponential fit seems to be a good candidate. Assume that the correct functional relationship is of the form brain weight = cv 0(body weight)" 1 . The scatter plot in the log scale given in Figure (3.1), suggests that there is a strong linear relationship in the log scale. The transformation seems to achieve linearity and constant variance. Taking logarithms wi l l reduce the model to a normal linear model with additive errors, log(Y) = log(a0) + ailog(X) + e, where e have zero expectation and constant variance a2. To simplify the notations we consider v = log(y), u = log(x), (3Q = log(a0), and p?i = o^. So 23 we may rewrite the normal linear model as V = Po + PiU + e. (3.1) The data points v follow a normal distribution wi th mean p = (30 + /5i u and variance a 2 . For a vector v = ( v i , . . . , v n ) of i id observations, the likelihood is p{v\po,Pi,v2)=n 7 ^ e x p { - ^ V i - & - • ( 3 - 2 ) Since we are conducting a Bayesian analysis, we need a prior distribu-tion in addition to the likelihood based on (3.2). This wi l l yield a posterior distribution on the unknown parameters 9 = (f30, (3i,a2) of the normal linear model and enable us to estimate them. The prior distribution on which our inference wi l l be based is a standard noninformative prior, 7r(/?,a2)oc^ . Now we illustrate the three methods through this example, and try to conduct a fair comparison between them. We thought if we could draw a his-togram of repeated realizations W (the estimate of the expected prediction error) we would gain some knowledge about the sampling distribution. The better the estimate the less variability there is in the sampling distribution. This can be done by repeating the process of getting W's many times for dif-ferent splits. Then by drawing histograms of the W's we can visually compare the sample variances between them in each method. We decided to have 100 W's for each method. However, for comparison purposes we fix the split indicators that we use to split the data to be the 24 same for the three methods. Before we discuss each method individually we may roughly explain the procedure of fixing the splits to draw histograms of 100 W's. We agreed on using 50 splits to compute each W, and we have 62 data points. To generate al l split indicators for the 100 W's, we create an array with dimensions 100 x 50 x 62. To compute W we need to split the data 50 times and get the same samples of the validation and the training for each split in each method. Using that array properly we wi l l make sure that each W in different methods is using the same splits of the data set. For the gold method, we need to compute the mean and the variance of the predictive distribution of each observation in the validation sample, given the training data based on the split. In the ordinary linear model, working through this mathematical computation is quite easy. The silver method uses many posterior samples, we should simulate 9 per split, then based on 9 generate the replicated data set. We first t ry to sample one 9 per split. Bu t this may lead to some crit icism so we decide to sample 100 9's per split for a more confident results as sampling one 9 only is less reliable. Final ly, for the bronze method we generate 100 9's at once and simulate replicated data for each of them. Then based on the same splits we estimate the 100 predicted errors W by split t ing the replicated data set and estimating the expected sum of squared error. Also we simulate 50 9's for the bronze method, so as to have a fair comparison to the silver method in the case of sampling one 9 per split. 25 3.2.1 The Gold Method The gold method is easy to apply for our ordinary linear model because the full conditional posterior distributions p(yrep\0,s) and p(8\s) have standard forms wi th the noninformative prior distribution, and the mean and variance of the predictive distribution can be easily computed. (For further details see Appendix A. ) We computed a set of 100 estimates which estimate the expected sum of predicted error W by ( G O L D W ) . The simulation is set up as follows. • For each iteration I = 1 , . . . , 100, we split the data set 50 times by sam-pling the splits indicator S j , i = 1 , . . . , 50. For each split we compute a predictive mean and variance for the validation sample given the training sample, E(y^p\yTi), var{yrvf\yTi) using (A.7) and (A.8), so the expected predicted error is tfi = Hi* - sWWII2 + zZvar(yv\p\yTi)-i • Average over splits W = ^ X)f£i Wi-• Repeat this procedure 100 times to obtain 100 W ' s , and draw the his-togram as shown in Figure (3.2). The histogram shows little variability amongst the 100 estimates. 3.2.2 The Silver Method We illustrate the silver method in estimating the predictive error with the mammals example, and we consider the same ordinary linear regression model. 26 A s wi th the normal distribution with unknown mean and variance, we sim-ply use the result derived by Gelman et al. (1995) to determine the posterior distribution for ft, conditioning on cr2, and then the marginal posterior distri-bution for a2. That is we factor the joint posterior distribution for ft and cr2 as p(f3,o2\y)=p(P\o2,y)p(o2\y), following the factorization of the posterior distribution given by (A.2) and (A.5). Using exact sampling to estimate the posterior parameters of the linear regression on the mammals data set, we first draw a random value of a2 ~ Inv—x2(29, s 2 ) , as 29s 2 divided by a random draw from the x 2(29) distribution (see Appendix B ) . Then given this value of cr2, we draw ft from its conditional posterior distribution, N((3, Vpa2), where ft is from (A.3). Based on 50 simulated values of (ft, a2), we estimate the expected sum of predicted error W. We carried out the idea of split t ing the data as follows. For each iteration 1 = 1,..., 100: 1. Sample multiple splits S j , i = 1 , . . . , 50. 2. For a single split, compute t u ( s j ) = HiLiiVvT ~ VVi)2-3. Having performed the step 50 times, we obtain W by averaging over the 50 w(s)'s, -. 50 We combine the results from the 100 iterations, and draw a histogram of the 100 W's as shown in Figure (3.2). The silver standard estimates are more 27 spread out than the gold standard estimates, as expected. However, we could get less sampling variability among the estimates by increasing the number of simulated values cr2) to 100 for each split as shown in Figure (3.2). Even though using larger posterior samples takes more computing time, it is worth trying for better results. 3.2.3 The Bronze Method In this section we demonstrate the use of importance sampling through the example of the body and brain weights of the mammals to check and im-prove upon the procedure of split t ing the data set and generating the poste-rior parameters for each split. We present the same ordinary linear model on the mammals wi th parameters 9 = (p? 0,Pi,cr 2), wi th the use of importance weights in drawing the posterior parameters to look as though they were gen-erated from the posterior distribution given the training set. For the bronze method we start first by generating the marginal and the conditional posterior distributions for 9 and the replicated data set based on the whole data set. This suggests the following simulation algorithm: • Simulate a sample of size m = 100 from the flattened posterior distribu-tion given by Sample a2 ~ Inv — x 2 (a (62) — 2, s 2 ) , and conditioning on a2 sample P~N0,Vp£). 28 • Given di, i = 1 , . . . , 100, generate a sample of m — 100 from p(y\9i). • Sample splits S j , i = 1 , . . . , 50. • For each split, compute the importance weights wf, i = 1 , . . . , 100. The estimate of the predictive error is m i=i • Average over splits to obtain W — ^ YL\=\ w(si). Repeating the whole procedure 100 times using the same split indica-tors that the gold and silver methods used, we have 100 W ' s shown in the histogram in Figure (3.2). Also for a fair comparison wi th the silver method of 50 simulated values of (/?, a2), one value per split, we generated a sample of size m = 50 from the flattened posterior distribution before spli t t ing the data and carrying out the same procedure to draw the histogram of the 100 W as in Figure (3.2). This is more spread out than the histograms of the W ' s from the 100 9's generated for the 50 splits. B y looking at the histograms in Figure (3.2), we can conclude that the gold method has the smallest variance, as expected from the gold estimator which has the bias correction term, the variances of the replicates, added to the prediction errors to provide an adequate estimate of W . Therefore the gold W ' s have less sampling variability. B y sampling 50 #'s at once for the bronze method and one 6 per split for the 50 splits in the silver method, we wanted to conduct a fair comparison between the two methods. Based on the results from this example they seem to have the same sample variability. Also, from a 29 practical point of view sampling one 8 it is less reliable so we needed to sample more than one 9 every time for each split. The 100 0's histogram of the silver sample estimates shows less spread than the bronze sample estimates, which was expected. For the silver method for each different split there is a new sample of the parameter 8, so we expect to have less sample variabili ty than the bronze. However, we should mention that the silver method takes more computing time than the bronze for sampling from the posterior distribution in each split. O n the other hand, for the bronze method we need to calculate the likelihood for each 8 and in each split we need to calculate those weights. In this example, the gold method turned to be the best but for complicated models it w i l l not be feasible. For us, we see the bronze method works better than the silver, and takes less computing time. We can not judge these methods based on one example only, so we decided to try another model, which we discuss in the next section. 30 3.3 Weibull and Extreme Value Regression Model Unt i l this point we have dealt exclusively wi th the ordinary linear model. In practice many situations involve heterogeneous populations, and it is im-portant to consider the relationship between lifetime and other factors. We thought we would try the methods on regression models, where the depen-dence of lifetime on regressor variables is explicitly recognized. The methods for estimating expected prediction error are treated in detail in this section v ia a regression method using data on the time to breakdown of a type of elec-trical insulting fluid subject to a constant voltage stress. The silver and the bronze methods are implemented for estimating the predictive error, whereas the gold method is not involved in this example as it is more mathematically intense to figure out the predictive distribution, and this is not the purpose of this research. Data set The data set used in this analysis were obtained from Lawless (1982). Nelson (1970a) presents the data, which are breakdown times for seven groups of specimens, each group involving a different voltage level. The data are uncensored, and times to breakdown are given in minutes. The results of an accelerated life test experiment on a type of electrical insulation were pre-sented. In the experiment, specimens of insulation were subjected to seven voltage levels, 26, 28, 30, 32, 34, 36, and 38 kilovolts ( K V ) . Engineers thought the appropriate model was that for a fixed voltage level X{, the lifetime distri-bution for the items is a Weibul l distribution wi th shape 5 and scale parameter 31 a, = exp{[3o + P\x{\. Note that distributions corresponding to different volt-age levels are considered to differ only with respect to their scale parameters ojj, the shape parameter 5 being the same for different levels. It is shown that there is no evidence against the hypothesis of equality of shape parameters 8 (for more details see Lawless (1982)). The Weibul l distribution can be extended to include regressor variables in different ways. However, the most commonly used is that for which the p.d.f. of lifetime, given the vector x of regressor variables, is We may often work wi th log lifetime, Y = log(T), wi th the p.d.f. given x being /(VW = i e x p - exp } , - c o < „ < oo, (3.4) where //(x) = loga(x) and a = 8'1. So the linear model can be written as V = Po + Pix + oz (3.5) wi th fj,(x) = Po+Pix, where z has a standard extreme value distribution, —oo < z < o o . Prediction To try out the prediction error in our Weibul l regression model, we car-ried out the procedures from the silver and the bronze methods. In this case it was not possible to sample directly from the posterior distribution. Therefore, 32 random numbers were generated from a Markov chain with the posterior as the stationary distribution. This method is well known as the Markov Chain Monte Carlo ( M C M C ) method. We used the Metropolis-Hastings algorithm with independent increments. In the next section we give a brief review of the .Metropolis algorithm. 3.3.1 The Metropolis Algorithm The Metropolis algorithm was originally introduced by Metropolis, Roseen-buth, Teller and Teller (1953) for computing properties of substances composed of interacting individual molecules. This algorithm has been used extensively in statistical physics. A variety of these kinds of algorithms were proposed by many researchers in the past. The Hastings version of the algorithm constructs a Markov chain wi th n as its stationary distribution as follows. In the Bayesian approach n is the posterior distribution which would be the target distribution. If the chain is currently at a point Xn = x, then it generates a candidate value Y for the next location Xn+i from a transition density q(x, •). W i t h probability a(x,y) this candidate is accepted and the chain moves to Xn+.x = y. Otherwise, the step is rejected and the chain remains at Xn+i = x wi th probability 1- a(x, y), where is the acceptance probability. Note that this algorithm depends on 7r only 33 through the ratios of the form ir(y)/Tr(x); thus TT only needs to be known up to a normalizing constant (Hastings, 1970). If n is a continuous univariate distribution on JR, the random walk Metropolis algorithm (Tierney, 1994) proceeds as follows. To update from Xn = xn to Xn+i = xn+\, we add noise (usually taken to be normal with mean zero and variance cr2) to the current state. A s such, y = xn + z, where z ~ 7V(0, a 2 ) and Implementing M C M C We thought we would try out the M C M C algorithms before conducting them in the methods, so we could make sure the chains converge and have an ap-propriate acceptance rate. As for the Bayesian approach, we must specify a prior distribution in addition to the likelihood based (3.4), so that samples of the unknown parameters 0 = (f30,3i,a) can be simulated from the posterior distribution. Since we do not have enough information to construct a prior distribution, we chose a standard noninformative prior, TX(8,U) OC ^, so that inferences are unaffected by information external to the current data. We used the random walk Metropolis algorithm with independent increments for ex-ploring the posterior distribution. The candidate state is obtained by adding Then, y wi th probability a xn+\ = < xn wi th probability 1 — a. 34 noise to the current state, the noise is generated from the normal distribution Z ~ N(0,Tj), where Tj is the tuning parameter set up by t r ia l and error to adjust how high the acceptance rate of the chain should be. To run the random walk (RW) algorithm we should start with in i t ia l values for the parameters that we want to sample from their posterior distri-butions. The simple way to do that is to fit a simple linear model and let the least squares estimates of the parameters be the in i t ia l values for p0 and P\. A n d we let a2 equal the variance of the log (lifetime). W i t h these settings R W chains of length n — 500 are simulated in Splus. We obtained unsatisfactory results of chains moving slowly through out of the support of the target dis-tribution and others having high jumps wi th high range of acceptance rates between the parameters as shown in Figure (3.3), which indicates that the chains wi l l never converge. So we thought we should use some methods which are useful for improving lack of convergence or slow convergence due to bad starting values, or high posterior correlations. The posterior parameters are correlated which may slow down the mixing in the chains. A simple remedy is to work with centered covariates (Gilks and Roberts, 1996). Doing that with the same first settings, and adjusting the tuning parameters, we achieve adequate mixing of a Markov chain and a range of acceptance rates for updating the parameters between 50% and 75% which are reasonable rates based on Gelman et al . (1995). Plots of these are given in Figure (3.4). In the next section we discuss the silver and the bronze methods through this same example. A n d for a fair comparison we generate the whole split 35 indicator by creating arrays so we use the same splits for both methods at the same time. We are using M C M C to sample from the posterior distributions. Based on some tests done to check how fast the chains converge, we decided to run the chains for a total of 500 iterations, with 300 burn-in iteration so we have a sample of size m = 200 of #'s to be our posterior sample. 3.3.2 The Silver Method We tried out the silver method for estimating the predictive error in the volt-age data, and we consider the extreme value model. Using the random walk Metropolis-Hastings algorithm to sample from the posterior distribution of 6 = (j3o, /3i,cr), we carried out the procedure for estimating the expected error as follows. • Sample splits s,, i = 1 , . . . , I. • For each split: 1. Set up the ini t ia l value for the M C M C method which are (30 = E(I/T), the mean of the training sample of the independent vari-ables; /?i = 0. A n d <r2 = var(yT), the variance of the sample of the independent variables. 2. R u n M C M C to estimate a sample of size m = 200 9's. 3. Based on the 200 0's, draw a hypothetical replication yTyV. 4. Compute the squared predictive error for the 200 sets of the repli-36 cation, s = \\yyp-yv\\2-5. The estimate of the predictive error would be wt = mean(s). • After multiple splits, average over the Wi's to obtain the estimate of the expected squared error 1 i=l We first tried the silver method wi th number of splits i = 50, and sample size m = 100 9's using M C M C over 1000 iterations. The estimate of the expected squared error was W = 212.28, which is a sensible estimate based on the results of the mean of the squared predictive error which have small variance for each split. The problem in for this setting was the computing time that took very long to get one W , which is not practical as we need many of them so we could compare wi th the bronze method estimates. Therefore, we cut down the number of splits to i = 25 and the number of iterations to 500. This time the silver method procedure took less time so we could use it for many estimates of W ' s . In the next section we present the bronze method, then we plot graphs of the estimates for both methods for comparison purposes and checking the bias of the estimates. 37 3.3.3 The Bronze Method As we discussed previously for the bronze method, we first sample al l 9's needed for the replication values, then split the data many times for each to get the estimate of the predicted error. We tried the following algorithm for the voltage data. 1. Simulate a sample of m = 200 9's. B y using the random walk Metropolis M C M C algorithm, wi th /30 = E(y), /3i = 0, and a — var(y) as the ini t ia l settings for the M C M C chain. We use the likelihood function where a = \ as explained in the previous chapter. 2. Given f?'s, compute the log densities for the regressor variables (voltages) that are needed to figure out the importance weights for each 9, given by MiW) = -nto»W+E ( W " ( A g + f t j ' ) ) - £ « P ( K " ( A g + / i l I - ) ) 3. For each 9, sample replicated data yrep from the extreme value distribu-tion. 4. Sample splits i = 1 , . . . , I. For each split s • For each 0;, i = 1 , . . . , m , compute the importance weight m ^• = E ( ( S - « ) M ^ ) ) ) -38 • Compute the squared difference between the replicated validation sample yTyP and the realized validation sample yv \\yryP — yv\\2-• Compute the split-specific estimate of the expected squared pre-dicted error m 5. Average over splits, W = j J2i=i Based on the same first settings of the silver method with i = 50 splits and sample size m = 100 0's, using M C M C over 1000 iterations, for the first set of splits we obtained W = 214.53 which is approximately the same as the silver estimate, with much less computing time. Then based on the ultimate setting of the silver method, we wanted to check i f there is any bias in the estimates of the silver and the bronze. We drew a plot shown in Figure (3.5), which obtained by repeating the process of computing W 20 times for the silver and the bronze methods. The plot shows no sign of bias in either of the methods, which makes us more confident wi th our results. Also we compute 50 W ' s for each of the silver and the bronze methods to draw histograms and test the sampling variability for each of them as shown in Figure (3.6). From this histogram, we see the silver method estimates have less sample variability than the bronze method estimates. These results would give us an idea which method to choose, Each of them has some advantage over the other. For example, the bronze method has less computing time and we do not have to fit the model for each split to sample 9 as is the case in the silver method. However, for the bronze method there are a lot of calculations such us the 39 log-likelihood for every 9 and the importance weights. As we wi l l see in the next chapter, prediction methods, while useful in their own right, can also be used as a tool in model checking. 40 Figure 3.1: Scatterplot of the mammals data for the average brain and body weights. o o o m igh o o o CO c re ii o o o t _ o Body and brain weight of 62 mammals. 2000 4000 6000 body weight (kg) The Log of body and brain weight of 62 mammals. CO . c O) CO CD c •* 2 .a M OJ o Lo o CM • . •• • • • • •• • - • 0 2 4 Log of body weight 41 Figure 3.2: Histograms of 100 W's. W.hats gold W.hats silver for one theta W.hats silver for 100 thetas 28 30 32 34 36 38 28 30 32 34 36 38 28 30 32 34 W.hats bronze for 50 thetas W.hats bronze for 100 thetas 28 30 32 34 36 38 28 30 32 34 36 38 W.hatb50 w.halblOO 42 Figure 3.3: MCMC of uncentered voltage for insulating fluid failure data. 43 Figure 3.4: MCMC of centered voltage for insulating fluid failure data. 0 100 200 300 400 500 It 0 100 200 300 400 500 It _ 9 & 9 0 100 200 300 400 500 44 Figure 3.5: Silver and bronze estimates prediction error. I Figure 3.6: Histograms of 50 W's. W.hats silver for 100 thetas 190 200 210 w.hats 220 W.hats bronze for 100 thetas 230 190 200 210 w.hatb 220 230 46 C h a p t e r 4 B a y e s i a n V a l u e 4.1 Introduction The p-value is a statistical notion that has been widely used and yet seri-ously criticized for a long time. The main controversy is whether a p-value provides adequate evidence against a null hypothesis. Most cri t icism leveled at the p-value has come from the Bayesian school, as mentioned in Hwang, Casella, Robert, Wel l and Farrel (1992), because the calculation of a p-value involves averaging over sample values that have not occurred, which is a clear violation of the likelihood principle. This has led to several formulations of a "Bayesian p-value." In particular, Rubin (1984) uses the posterior predictive distribution of a statistic to calculate the tail-area probability corresponding to the observed value of the statistic. A s Rubin pointed out, such a frequency calculation is relevant because it helps the process of model diagnosis, a fun-damental part of any Bayesian analysis. Meng (1994) intended to illustrate 47 the ut i l i ty of the posterior predictive p-value. The Bayesian formulation, using posterior predictive replications of the data, allows a test statistic to depend on both data and unknown parameters and thus permits a direct measure of the discrepancy between sample and population quantities. The extension to include generalized discrepancies D(y; 9) rather than just statistics T(y), as in Rubin (1984), may be practically quite important because of the potentially tremendous computational advantages when averaging rather than maximiz-ing over 9, especially with multi-modal likelihoods. This posterior predictive p-value can also be viewed as a generalization of a classical p-value, averaging over the posterior distribution of parameters under the null hypothesis, and thus it provides a general method for dealing with the unknown parameters. In Bayesian statistics, a model can be checked in at least three ways. One way is checking that the model fits the data using the posterior pre-dictive distribution, in the presence of unknown parameters. Gelman, Meng and Stern (1996), and Gelman, Car l in , Stern and Rubin (1995) discussed the approach of comparing the posterior predictive distribution of future observa-tions to the data that have actually occurred. It is simple, both conceptually and computationally, and is applicable for comparing observations wi th model predictions in any form, without requiring any more substantive information than is in the existing data and model. In this chapter our purpose is to in-troduce a different notion of Bayesian p-value which includes averaging over different training/validating splits of the data. We illustrate this notion of p-value with an example used in Gelman, Car l in , Stern and Rub in (1995). 48 Draper (1996) also proposed cross-validating in the modeling process as an alternative to posterior predictive assessment of x2 goodness-of-fit measures described by Gelman, Meng and Stern (1996). Our approach is different in that we split the data many times rather than once. We compare our approach to Gelman et al . (1995) and investigate how the idea of split t ing the data may effect the interpretation of the posterior predictive p-value. Also we provide a formal definition of a posterior predictive p-value. 4.2 Posterior Predictive p-value The p-value of the test quantity as defined in the Bayesian context can be used to measure lack of fit of the data with respect to the posterior predictive distribution. Before we introduce our notion of the p-value, we briefly present the Gelman et al . notion of the p-value and define the classical p-value. Also we define Draper's notion of the p-value, as it is somehow related to our approach by split t ing the data once rather than many times. Using the same notation as in Chapter 2, the Gelman et al . work wi th the joint posterior distribution of yrep and 9 given the existing data y, First we define the classical setting of the p-value for the test statistic T(y) to p(yrep,9\y)=p(yrep\9)p(9\y). Then the posterior predictive distribution is 49 be Cp - value = Pr{T{yrep) > T(y) where the probability is over the distribution of yrep with fixed 9. In the Bayesian approach, the test quantity T(y, 9) is a function of the un-known parameters and data as it is evaluated over draws from the posterior distribution of the unknown parameters. So the Gelman et al . definition of the posterior predictive p-value is the probability that the replicated data could be more extreme than the observed data, Bp - value = Pr{T(yrep, 9) > T(y, 9)\y}, where the probability is over the posterior predictive distribution of yrep and the posterior distribution of 9. The Gelman et al . approach to measuring the p-value by using posterior simulations of (9, yrep) is as follows. • Draw 9i,... ,9L independently from the posterior distribution of 9. • For each simulated 9l, draw a replicated observation, y r e p ' ( from the distribution, y\6 = 9l. • The estimated p-value is the proportion of the L test quantities for which the predictive test quantities T(yrep'1,9l) equal or exceed the realized test quantity T{y,9l). Draper suggested in his comment on The Gelman, Meng and Stern (1996) paper proposed using predictive distributions in a somewhat different 50 way-by cross-validation. Draper's p-value is "split-specific" which means spe-cific to one split s. The split-specific p-value is defined as follows. Draper thinks that this discrepancy measure can better highlight which part of the data set fits the model badly, and thus it prevents using the same data twice problem. We choose to use a different approach for measuring the p-value, follow-ing the methodology of split t ing the data. So we define the "split-averaged" p-value as with respect to the "reference distribution" (2.1). Note that p = E{ps} wi th respect to the reference distribution (2.1). To compute the p-value, the silver method can be used in the following way. • Sample 9\ from the posterior distribution p(9li\yTi), I = 1,..., L. • Sample yTyP'1 from the probability distribution p(yVj\9\). • Compare the realized validation test quantities T(yVi, 9\) and the predic-tive validation test quantities T(yv:p, #•), that is, test whether T{yryP, 9) > ps = Pr{T(yrvep,9) >T(yv,9)\S = s} (4.1) p = Pr{T(yrvep,9)>T(yy,9)}, (4.2) T(yv,9). To estimate the p-value (4.2), note that Pr(T(yrvep,9)>T(yv,9)\yT) = E { i ^ ^ ^ ^ y ? } 51 and we sample from the joint distribution on (s, 6) given by p(s,6)=p(s)p(6\yT). This suggests the following estimator, for a given split 1 L PS = T^2 ^\T(yTyP,&i)>T(yv fii)} (4-3) where 0\,..., f?L ~ 0\s. A n d thus j m P = — J2Psi where s\,..., sm ~ p{s). Thus we are averaging over independent splits. 4.3 Example: Comparing speed of light mea-surements to the posterior predictive dis-tribution. We illustrate the split-averaged p-value through the example used by Gelman, Car l in , Stern and Rubin (1995). We calculate the p-value using the Gelman et al . approach and ours wi th two different testing functions. We compare the adequacy of the two methods for model checking. Data set The data set used to illustrate the concept of posterior predictive p-value is from an experiment set up by Simon Newcomb in 1882 to measure the speed of light. Newcomb measured the amount of time required for light 52 to travel a distance of 7442 meters. The 66 measurements are given in a histogram as shown in Figure (4.1). The histogram shows that there are two unusually low measurements and then a cluster of measurements that are approximately symmetrically distributed. It is inappropriate to assume that all 66 measurements are independent draws from a normal distribution wi th mean p and variance cr2, wi th a noninformative uniform prior distribution 7r(n,o2) OC ^ J . It is obvious that the outlying measurements do not fit the normal model. We discuss the Bayesian p-value for measuring the lack of fit for these data by comparing the observed data to what we expect to be observed under the posterior distribution. A n y systematic differences between the simulations and the data indicate potential failings of the model. 4.3.1 The Gelman et al. Approach The Gelman et al . approach uses graphical comparisons of summaries of the data to summaries from posterior predictive simulations, and the notion of posterior predictive p-value to measure the statistical significance of the lack of fit. They use the sample variance as the test quantity. The histogram of the observed sample variance and the 200 simulated variances from the posterior predictive distribution, and the estimated p-value close to | , indicate that the model fit the data well, which is not true. They justify this result by noting that the sample variance is not a good test statistic because it is a sufficient statistic of the model, which wi l l be well fit in the posterior distribution. 53 4.3.2 The Split-Averaged p-value We tried our approach of split t ing the data with the speed of light measure-ments using the sample variance as a test quantity. We adapt the silver method to estimate the posterior predictive p-value as follows. • Sample Si, i = 1 , . . . , 50, to be used for splitt ing the data into equal halves. • Given the training sample yr, we first draw a random value of a2 ~ Inv — X 2 (32 , s2) as 32s 2 divided by a random draw from the X32 distribution (see Appendix B ) . Having obtained a2, we draw fj, from the conditional posterior distribution, N(yT, cr 2/33). • Simulate a sample of posterior predictive observations yrep from the pos-terior predictive distribution N(fi,<72). • Compute the sample variance of the validation predictive observations Vv • For each split we have the sample variances of the actual validation sample and the replicated validation sample. The scatterplot in Figure (4.2) for the two test quantities shows two clusters, but the estimated p-value is close to | , similar to the Gelman et al . result. However, the scatterplot implies that the model is inadequate for the data, so we should try to justify this. To investigate further, for each split we sampled 200 replicated validation samples, leading to 200 replicated validation variances. These are compared to the sample variance of the actual validation sample. In particular, for each split 54 we can locate the actual validation variance in the histogram of the replicated variances. The histograms show the actual validation variance is either to the far right or the far left of the distribution of the simulated variances, which was expected in the way that the outliers may be split between the training and the validation samples. A n d this implies that the model is inadequate for the data. Rather than looking over al l those histograms, we could try to summarize them in one plot and then in a single number corresponding to the posterior predictive p-value. For each split s, we estimate the split-specific p-value p s (4.1) by p s(4.3). In this example, ps is either zero or one. The histogram of p s ' s shown in Figure (4.3) should look uniformly dis-tributed between zero and one i f the model is adequate. For comparison pur-poses, we generate 66 data points from the standard normal and use the silver method of measuring the split-specific p-value. Based on 50 splits each wi th 200 simulated validation sample variances, the split-specific p-values look uni-formly distributed between zero and one as the histogram shows in Figure (4.3). For the sake of a single p-value which people always look for, we used the x2 test to check how uniform the p\,'s are. To conduct the x2 test, we set up the null hypothesis to be H0:ps~U{0,l). Each histogram is divided into bins (for our purposes, 5 bins to match the histograms), then we compute the frequency in each cell and the expected frequency. For the speed light measurements the p-value was less than 0.01 55 which means reject the null hypothesis, as expected. For the generated stan-dard normal, the p-value was 0.9384 which implies that the p-values follow a uniform distribution. So we could get a single p-value to present the fit of the model. Based on results from this example we showed that the model is inad-equate for the data; however, the Gelman et al . could not prove that without data spli t t ing from the same test quantity that we used. The Gelman et al. approach for calculating the p-value has been cri t i -cized for using the data twice, once to determine the posterior distribution of 6, and then again for replication. For the split-specific p-value, we overcome this problem by split t ing the data into two, so that one part is used in determining the posterior distribution and the other for prediction. The point of model checking is to see i f the current model is good enough. How do you know if a particular predictive p-value is precise and correct? The satisfying answer to this question must certainly include an attempt to quantify the ut i l i ty of the p-value. The speed of light is a simple example so the model inadequacy was obvious, but in more complex models things may not be so clear. In particular, the Gelman et al . method may be very sensitive to how close the test quantity T(-) is to a sufficient statistic. Our method to model checking does not have that problem. It might be best applying this interesting notion of p-value with more complex models where its message fits into the overall picture of Bayesian model checking, by comparing the observed data to what we expect to be observed under the specified posterior distribution. 56 In the next section we discuss the same example with different test function that depends on the parameter 9. Gelman et al. could demonstrate the poor fit of the normal model to the speed of light data, using min(j/j) as the test quantity. They suggested that the model can be inadequate for some purposes but adequate for others. They assessed the adequacy of the model except for the extreme tails, as the normal model clearly does not capture the variation that Newcomb observed. Using the same test quantity, we want to see how our method works and compare wi th the Gelman et al. result. 4.3.3 A Model Check Based on a Test Quantity Sensi-tive to Asymmetry in the Centre of the Distribu-tion Since the model looks adequate except for the extreme tails, Gelman et al . chose a test quantity sensitive to asymmetry in the centre of the distribution, T(y,e) = \ y m - e \ - \ m - e \ , where j/( 6i) and j/(6) are the 61st and 6th order statistics representing approxi-mately the 90% and 10% points of the distribution. We tried the Gelman et al. method, and the resulting scatterplot in Figure (4.4) shows the test quantity for the observed data and the test quantity evaluated for the simulated data for 200 simulations from the posterior distribution of {9,a2). The estimated p-value is 0.24 which is computed as the proportion of points in the upper-left half of the plot. The test quantity should be scattered about zero for a 57 symmetric distribution. As shown in Figure (4.4), the test quantities are more skewed to the positive side. Gelman et al. explain any observed asymmetry in the middle of the distribution by sampling variability. T h e S p l i t - A v e r a g e d p - V a l u e Next, we tried the split-averaged p-value. We chose to split the data 200 times, for each split sampling a replicated validation sample from the posterior distribution of (9, a2), and calculating the test quantity T(yv,9) = \yviai)-9\-\yvw-e\, for the actual observed validation sample and the replicated validation sample. Figure (4.5) shows the scatterplot and the estimated p-value is 0.38, implying the symmetry in the middle of the distribution. We expected to get clusters in the scatterplot to show the inadequacy of the model, as in the case of the sample variance test quantity. We interpreted the results as follows. There are clusters but they may not be visible. Whi l e spli t t ing the data the outliers either go into the validation sample or into the training sample. When the outliers are in the training samples, then the test quantities are more positive as in the Gelman et al. case. This is because the posterior distribution is more effected by the outliers to the left. However, when the outliers are in the validation sample the test quantities are more about zero, since the pos-terior distribution is more in the middle. We tried for each split sampling 200 replicated validation samples, leading to 200 replicated validation and actual validation test quantities. We did not see any clusters in the scatterplots of 10 splits. We thought also that test quantity is quite sensitive to asymmetry 58 in the middle of the distribution which was obvious from the histogram of the speed of light data. The split-averaged p-value shows the symmetry in the centre of the data. 59 Figure 4.1: Histogram of Simon Newcomb's measurements for estimating the speed of light. i 1 1 1 — — — 1 —i -60 -40 -20 0 20 40 speedlight 60 Figure 4.2: Scatterplot showing the sample variances of the actual validation sample and the replicated validation sample. ure 4.3: Histograms of the estimate split-specific p-values for 50 splits. Simon Newcomb's measurements of the speed of light. 0.0 0.2 0.4 0.6 ps-value 0.8 1.0 The generated normal data set. mmm 0.0 0.2 0.4 0.6 ps-value 0.8 1.0 62 Figure 4.4: Scatterplot showing prior and posterior simulations of a test quan-tity T(y, 9) = |y(6i) —9\ — \y^) —9\, based on 200 simulations from the posterior distribution of (9, yrep). Figure 4.5: Scatterplot showing a test quantity T(yv, 0) = | yy ( 3 1 ) — 0\ — \yv(3) ~Q\ for the actual observed validation sample and the replicated validation sample, based on 200 splits. -20 0 20 40 T(yv, theta) 64 C h a p t e r 5 D i s c u s s i o n a n d C o n c l u d i n g R e m a r k s In principle cross-validation can be used in a great variety of situations, be-cause of the lack of assumptions behind the cross-validation method. How-ever, it seems to work best for unstructured data, or for insufficiently specified models. Cross-validation appears to have disadvantages in some classical sit-uations. Da ta from designed experiments are usually highly structured. Dele-tion of a single observation destroys that structure, and therefore increases the computing effort. The use of cross-validation in regression is perhaps more dangerous. For simple models, the values of y corresponding to extreme values of x have disproportionate influence on the fitted parameters (Atk in-son, 1985). When the data point (xi,yi) is omitted, the difference between the predicted value y [ e p and the actual value yi is at least as big as the i-th residual from the model fitted to the full data set. For extreme values Xi the 65 predictions y [ e p are extrapolations, and so the difference can be expected to be larger than those for interior points Xi. Thus the disproportionate influence of the extreme data is increased by cross-validation. To overcome this problem of cross-validation we implemented the re-peated learning-testing methods shown by Burman (1989) to work very well for model selection purposes. We repeatedly split the data randomly into two parts. For each split, estimates are developed based on the data in the training set and then these are tested on the data in the validation set. So the extreme values, if any, are split between the two sets. It seems to be more natural to use predictive distributions by cross-validating the modeling process (for an example see Draper (1995b)). This prescription uses a non-omnibus discrepancy measure that can better highlight which parts of the data set are badly fit by the current model. A n d it avoids the inherent diagnostic overfitting problem. Chapter 2 presented the methodology for measuring the prediction error and called for the Bayesian approach in the context of cross-validation, which was revealed to be a fairly natural way to investigate the prediction error. The examples and comparisons between the proposed methods carried out in Chapter 3 found the bronze method to perform reasonably, in terms of computing and the posterior sampling comparing to the silver method. How-ever, the gold method gives the best estimates with the least sampling variabil-ity. To use the gold standard, the user must be able to compute the posterior expectation E(yrep\s) and variance var(yrep\s) wi th respect to the reference 66 distribution analytically, and that is not always possible depending on the complexity of data set. As for more realistic application, the silver method becomes more practical with the rapid computing developments. Therefore the reliability of the simulated result is important and requires much atten-tion. When the posterior distribution cannot be simulated directly by exact sampling, an indirect simulation method (Markov Chain Monte Carlo) such as the Metropolis algorithm is often used. In the silver methodology, we sam-ple from the posterior distribution and then sample the replicated data set from the posterior replicated distribution given the training set to compute the squared prediction error | | y r e p — y\\- We could try the silver method to es-timate the expectation E(yrep\s) and the var(yrep\s) by sampling many times from the posterior distribution. We thought we may try this procedure in the future. However, this methodology apparently wi l l not differ much from the current silver method in terms of posterior sampling for each split which we try to reduce by implementing the bronze method. The recent rapid development of Bayesian computation allows us to fit more realistic and sophisticated models than previously possible, and thus there is a corresponding need for general methods to assess the fit of these model when classical tests are not applicable. A n example is demonstrated by Gelman, Meng, and Stern (1996). Model checking is always a v i ta l compo-nent of model building. The frequentist approach relies on the clever choice of discrepancy measures that are pivotal and whose distribution under the hypothesized model is known, at least approximately. The Bayesian approach 67 described by Gelman, Meng, and Stern (1996) is more general. The discrep-ancy measures used do not have to be pivotal or to have a known distribution, and so can be chosen to check aspects of the model that cause concern. Gelman, Car l in , Stern, and Rub in (1995) introduced the posterior p-value method to check goodness-of-fit for parameteric Bayesian models. In a sense their method can be criticized for using the data twice, as their method checks how well the data predict themselves under the model. In Chapter 4 we introduced the cross-validation methodology with the split-averaged p-value using the silver method. This involves seeing how one portion of the data predicts another portion, which avoids using the data set twice. In Chapter 4 we illustrated the split-averaged p-value with the example used by Gelman et al. , showing our procedure is simple yet efficient. Gelman et al . used the sample variance to be the discrepancy function, and thus they could not show the poor fit of the model. They accounted for this by noting that this discrepancy function is a sufficient statistic of the model and thus is automatically fitted in the posterior distribution. However, using the same discrepancy function, the split-averaged p-value could succeed in proving the inadequacy of the model, as was obvious from the histogram of the data showing the two clusters in Figure (4.2), resulting from the two outliers. In light of these encouraging results, we would like to try the split-averaged p-value on a more complicated data structures, such as hierarchical models. Finally, although we could not try it in the scope of this thesis, the idea of implementing the Markov Chain Monte Carlo methods ( M C M C ) in the 68 procedure of split t ing the data for each posterior sampling seems promising in the silver method. Referring to the same reference distribution in Chapter 2, we have s = (T, V), yrep, and 9, then the joint distribution p(yrep, 9, T, V) = p(yrep\9, T, V)p(0\T, V)p(T, V), where p(yrep\9,T, V) = p(yrep\9) in principle. A n d p(9\T,V)~c(T,V)L(9\T,V)p(9), where c(T, V) is a normalizing constant. To update (T, V), the target distribution is proportional to c(T, V)p(T, V). Then we face the problem of calculating the constant that normalize the target distribution, making this approach difficult to apply. 69 B i b l i o g r a p h y [1] Akaike, H . (1974). A New Look at Statistical Mode l Identification. IEEE Transactions on Automatic Control, 19, 716-732. [2] Al l i son , T . and D . V . Cicchetti (1976). Sleep in Mammals: Ecological and Constiutional Correlates. Science, 194, 732-734. [3] Atkinson, A . C . (1985). Plots, Transformations and Regression. Oxford University Press, Oxford. [4] Bel in , T . R. and Rubin , D . B . (1995). The Analysis of Repeated-Measures Da ta on Schizophrenic Reaction Times Using Mixture Models. Statistics in Medicine, to Appear. [5] Besag, J . , Green, P., Higdon, D . and Mengersen, K . (1995). Bayesian Computat ion and Stochastic System (Wi th Discussion). Statistical Sci-ence, 10, 10-36. [6] Bowman, A . W . (1984). Jakknife Approximat ion to Bootstrap Estimate. Annals of Statistics, 12, 101-18. 70 [7] Burman, P. (1989). A comparative Study of Ordinary Cross-Validation w-Fold Cross-Validation, and the Repeated Learning-Testing Methods. Biometrika, 76, 503-514. [8] Burman, P. (1989). Est imation of Opt imal Transformations Using u-Fold Cross Validat ion and Repeated Learning-Testing Methods. Sankhya, A 51. [9] Draper, D . (1995b). Discussion of "Model Uncertainty, Da ta Min ing , and Statistical Inference," by C . Chatfield. Journal of Royal Statistical Society, Ser. A, 158, 419-466. [10] Draper, D . (1996). Comment: Ut i l i ty , Sensitivity Analysis, and Cross-Validat ion in Bayesian Model-Checking. Statistica Sinica, 6, 29-35. [11] Efron, B . (1983). Est imating the Error Rate of a Prediction Rule: Im-provment on Cross-Validation. Journal of the American Statistical Asso-ciation, 78, 316-331. [12] Efron, B . (1986). How Biased is the Apparent Error Rate of a Prediction Rule?. Journal of the American Statistical Association, 81, 461-470. [13] Geisser, S. (1975). The Predictive Sample Reuse Method W i t h Appl ica -tions. Journal of the American Statistical Association, 70, 320-328. [14] Gelfand, A . E . , Dey, D . K . and Chang, H . (1992). Model Determination Using Predictive Distributions, W i t h Implementing V i a Sampling-Based Methods. In Bayesian Statistics, 4, 147-167. 71 [15] Gelman, A . , Car l in , J . B . , Stern, H . S. and Rubin , D . B . (1995). Bayesian Data Analysis. New York: Chapman and Ha l l . [16] Gelman, A . , Meng, X . L . and Stern, H . S. (1996). Posterior Predictive As -sessment of Model Fitness V i a Realized Discrepancies (Wi th Discussion). Statistica Sinica, 6. [17] Gi lks , W . R., Richardson, S., Spiegelhalter, D . J . (1996). Strategies for Improving MCMC. In Markov Chain Monte Carlo in Practice. New York: Chapman and Ha l l . [18] Gut tman, I. (1967). The Use of the Concept of a Future Observation in Goodness-of-Fit Problems. Journal of Royal Statistical Society, Ser. B, 29, 83-100. [19] Hardle, W . , Marron, J . S. (1985). Opt imal Bandwidth Selection in Non-parametric Regression Function Est imation. Annals of Statistics, 12, 76-86. [20] Hastings, W . K . (1970). Monte Carlo Sampling Methods Using Markov Chains and their Applications. Biometrika, 57, 97-109. [21] Herzberg, P. A . (1969). The Parameters of Cross-Validation. Monograph Supplement to Psychometrika, 34. [22] Herzberg, G . , and Tsukanov, S. (1986). A Note on Modifications of the Jackknife Cri terion on Model Selection. Utilitas Mathematics, 29, 209-216. 72 [23] Hjorth, J . S. U . (1994). Computer Intensive Statistical Methods: Valida-tion Model Selection and Bootstrap. U K : Chapman and Ha l l . [24] Hwang, J . T. , Casell, G . , Robert, C , Wells, M . T . and Farrell , R. H . (1992). Est imation of Accuracy in Testing. Annals of Statistics, 20, 490-509. [25] Lawless, J . F . (1982). Satatistical Models and Methods for Lifetime Data. New York: John Wi ley & Sons. [26] Mallows, C . L . (1973). Some Comments on Cv. Technometrics, 15, 661-675. [27] Meng, X . L . (1994). Posterior Predictive p-values. Annals of Statistics, 22, 1142-1160. [28] Metroplis, Roseenbuth, Teller and Teller (1953). Equations of State C a l -culations by Fast Computing Machines. Journal of Chemical Physics, 21 , 1087-1092. [29] Neal, M . R . (1993). Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical Report C R G - T R - 9 3 - 1 , Department of C o m -puter Science, University of Toronto. [30] Nelson, W . B . (1970a). Statistical Methods for Accelerated Lifetest Data-the Inverse Power Law Model . General Electric Co. Technical Report 7 1 - C - 0 1 1 . New York: Schenectady. 73 [31] Picard, R . R. , and Cook, R. D . (1984). Cross-Validation of Regression Models. Journal of American Statistical Association, 79, 575-583. [32] Rubin , D . B . (1981). Est imation in Parallel Randomized Experiments. Journal of Educational Statistics, 6, 377-400. [33] Rubin , D . B . (1984). Bayesianly Justifiable and Relevant Frequency C a l -culations for the Appl ied Statisticain. Annals of Statistics, 12, 1151-1172. [34] Shibata, R . (1981). A n Opt imal Selection of Regression Variables. Biometrika, 68, 45-54. [35] Smith, A . F . M . and Roberts, G . 0 . (1994). Bayesian Computat ion V i a the Gibbs Sampler and Related Markov Chain Monte Carlo Methods (Wi th Discussion). Journal of the Royal Statistical Society, Ser. B, 55, 3-23. [36] Smyth, P. (1998). Model Selection for Probabilistic Clustering Using Cross-Validated Likelihood. Technical Report U C I - I C S 98-09, Informa-tion and Computer Science, University of California, Irvine. [37] Stone, M . (1974). Cross-Validatory Choice and Assessment of Statistical Prediction. Journal of Royal Statistical Society, Ser. B, 36, 111-133. [38] Stone, M.(1977). Cross Validation: A Review. Math. Oper. Statist., Ser. Statist, 9, 127-39. 74 [39] Stone, M . (1977a). A n Asymptot ic Equivalence of Choice and Models by Cross-Validation and Akaike's Criterion. Journal of the Royal Statistical Society, Ser. B, 39, 44-47. [40] Shao, J . (1993). Linear Model Selection by Cross-Validation. Journal of American Association, 88, 486-494. [41] Tierney, L . (1994). Markov Chains for Explor ing Posterior Distributions (with Discussion). Annals of Statistics, 22, 1701-1762. [42] Upadhyay, S. K . and Smith, A . F . M . (1993). A Bayesian Approach to Model Comparison in Reliability Via Predictive Simulation. Technical Re-port, Department of Mathematics, Imperial College, London. [43] Weisberg, S. (1985). Applied Linear Regression. Second Edition. New York: John Wi ley & Sons. [44] West, M . (1986). Bayesain Model Monitoring. Journal of Royal Statistical Society, Ser. B, 48, 70-78. [45] Zhang, P. (1993). Mode l Selection V i a Mul t i fo ld Cross-Validation. Annals of Statistics, 21(1) , 299-313. 75 A p p e n d i x A B a y e s i a n A n a l y s i s o f t h e C l a s s i c a l R e g r e s s i o n M o d e l The normal linear model is simple enough that we can determine the poste-rior predictive distribution analytically. In the ordinary linear regression, the observation errors are independent and have equal variance; y\(3,o2~N(Xft,o2I), where X is the n x k matrix of explanatory variables and I is the nxn identity matrix. Prior distribution It is convenient in the normal regression model, to choose the noninfor-mative prior distribution to be uniform on (ft, logo) or, equivalently, p(ft,o2)<xo-2. 76 It gives acceptable results and requires less effort than specifying prior infor-mation in probabilistic terms. However, this is only true when we have a large sample size and a few parameters, but i f we have a small sample size or a large number of parameters, then it is important to specify prior knowledge or perhaps use hierarchical models. The posterior distribution To characterize the joint posterior distribution for 3 and o2 as it is convenient to draw simulations of o and then f3\o2. Firs t we introduce the conditional distribution of the (vector) parameter (3 given cr2 p(P^2\y)=p(PW2,y)P(o2\y) (AA) (3\o2,y~N0,Vp). (A.2) B y completing the square we have 3 = (X'Xy'X'y, (A.3) Vp = (X'X) -1 (A.4) The posterior distribution of a2 from ( A . l ) can be written as which has a scaled inverse-\ 2 a2\y ~ Inv — x2(n — k, s2), (A.5) where 1 (y-XP)'(y-XP). n — k (A.6) 77 The density function of a2 is p(cr2) = Inv — x 2 ( c 2 | n — k,s2), which is useful for variance parameters in normal models. Note that the scaled Inverse-^ 2 is a special case of the inverse-gamma distribution with a = and 0 = ^ s 2 . The posterior predictive distribution Based on the standard results by Gelman et al . (1995), we determine the posterior predictive distribution for the linear regression model. Their approach is to consider first the conditional posterior predictive distribution p(y\cr2,y) where y is the future observable, then to average over the posterior a2\y. The predicted outcome y has a normal distribution given a2 and by averaging over 8 its mean is given by E(y\y) = E(E(y\3,o2,y)\o2,y) = E(X3\o2,y) = x p where X is the matrix of explanatory variables. Similarly, the variance of the future observation y is Var(y\a2,y) = E[var(y\P,a2,y)\a2,y] + var[E(\P,a2,y)\a2,y] = E[a2I\a2,y] + var[XP\a2,y] = (I + XVpX')o2. We determine the posterior predictive distribution by averaging over the posterior predictive distribution of a2. Then the posterior predictive dis-tr ibution p(y\y) is multivariate t wi th center P,n — k degrees of freedom, mean 78 E(y\y), and variance var(y\a2, y). We may extend these results to fit into the case of spli t t ing the data into training and validation samples. Based on the training sample we first compute (3 and Vp, which in practice is easy using any standard linear regression software. Then we derive the mean and the vari-ance of the predictive distribution of the validation sample given the training sample by E(y^\yT) = Xvp, (A.7) Var(yre?\yT) = E(a2\yT){I + XvVpX'v}, (A.8) where E(a2\yx) based on the training set is equal to s2 (A.6). 79 A p p e n d i x B I n v e r s e C h i - s q u a r e The inverse-x 2 is a special case of the gamma distribution, wi th a = v/2 and /3 = | , where v is the degrees of freedom. In addition to the standard parameterization, we also define the scaled inverse chi-square distribution. To obtain a simulation draw 9 from the Inv-x2(v, s2) distribution, first draw X from the xt distribution and then let 9 = vs2/X. 80 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items