Approximation of the Formal Bayesian ModelComparison using the Extended Conditional PredictiveOrdinate CriterionbyMd Rashedul HoqueM.S. in Applied Statistics, University of Dhaka, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)August 2017c©Md Rashedul Hoque, 2017AbstractThe optimal method for Bayesian model comparison is the formal Bayes factor(BF), according to decision theory. The formal BF is computationally troublesomefor more complex models. If predictive distributions under the competing modelsdo not have a closed form, a cross-validation idea, called the conditional predic-tive ordinate (CPO) criterion can be used. In the cross-validation sense, this is a“leave-out one” approach. CPO can be calculated directly from the Monte Carlo(MC) outputs, and the resulting Bayesian model comparison is called the pseudoBayes factor (PBF). We can get closer to the formal Bayesian model comparisonby increasing the “leave-out size”, and at “leave-out all” we recover the formalBF. But, the MC error increases with increasing “leave-out size”. In this study, weexamine this for linear and logistic regression models.Our study reveals that the Bayesian model comparison can favour a differentmodel for PBF compared to BF when comparing two close linear models. So,larger “leave-out sizes” are preferred which provide result close to the optimal BF.On the other hand, MC samples based formal Bayesian model comparisons arecomputed with more MC error for increasing “leave-out sizes”; this is observedby comparing with the available closed form results. Still, considering a reason-able error, we can use “leave-out size” more than one instead of fixing it at one.These findings can be extended to logistic models where a closed form solution isunavailable.iiLay SummaryThe purpose of the model comparison is to find a useful model among some possi-ble models. There are different model selection methods available in the literature,developed by the frequentist and Bayesian schools. The main goal of this thesisis to examine a model selection method based on cross-validation as an alternativeto the formal Bayesian model comparison. We demonstrate the behavior of thismodel comparison tool for both simple and complex models in the Bayesian con-text. The major finding of the thesis suggests using a general version of the widelyused cross-validation approach, but with a larger “leave-out size” than one, to getcloser to the formal Bayesian model comparison.iiiPrefaceI did the analyses in this thesis (Chapter 3 and Chapter 4) under the supervisionof my supervisor, Professor Paul Gustafson. Professor Gustafson instructed me towork according to the objectives set for the thesis and made corrections to my sim-ulation setup and choice of statistical tool. We have not submitted any manuscriptfor publication from this yet, but we are planning to do soon. The manuscriptwill summarize the major findings of the thesis incorporating concise content fromChapter 2, Chapter 3, and Chapter 4.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Organization of the Thesis Report . . . . . . . . . . . . . . . . . 32 Bayesian Model Comparison . . . . . . . . . . . . . . . . . . . . . . 42.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Classical View of Model Selection . . . . . . . . . . . . . . . . . 42.3 Bayesian View of Model Selection . . . . . . . . . . . . . . . . . 62.3.1 General Version of Bayes Factor . . . . . . . . . . . . . . 62.3.2 Criticisms of Bayes Factor . . . . . . . . . . . . . . . . . 82.3.3 Other Versions of Bayes Factor . . . . . . . . . . . . . . 9v2.4 Predictive Distribution as Comparative Tool . . . . . . . . . . . . 112.4.1 Different Predictive Distribution Based Approaches . . . . 122.4.2 Conditional Predictive Ordinate Criterion and Pseudo BayesFactor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.3 Extending the CPO Criterion . . . . . . . . . . . . . . . . 152.4.4 Computation of the Extended CPO Criterion . . . . . . . 162.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Bayesian Model Comparison for Linear Regression Models . . . . . 193.1 Bayesian Linear Regression Models . . . . . . . . . . . . . . . . 203.2 Normal Model with Unknown Mean and Known Variance . . . . 223.2.1 Extended CPO Criterion using Closed Form Posterior Pre-dictive Distributions . . . . . . . . . . . . . . . . . . . . 243.2.2 Extended CPO Criterion using Monte Carlo Samples . . . 253.2.3 Comparative Behavior of Extended CPO Criterion: ClosedForm Versus Monte Carlo Samples . . . . . . . . . . . . 253.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.5 Summarizing the Computations . . . . . . . . . . . . . . 333.3 Normal Model with Unknown Mean and Variance . . . . . . . . . 363.3.1 Extended CPO Criterion using Closed Form Posterior Pre-dictive Distributions: Unknown Variance Situation . . . . 393.3.2 Extended CPO Criterion using Monte Carlo Samples: Un-known Variance Situation . . . . . . . . . . . . . . . . . 403.3.3 Comparative Behavior of Extended CPO Criterion for Un-known Variance Situation: Closed Form Versus Monte CarloSamples . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.5 Summarizing the Computations for Unknown Variance Sit-uation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514 Bayesian Model Comparison for the Generalized Linear Models . . 544.1 Bayesian Logistic Regression Models . . . . . . . . . . . . . . . 55vi4.2 Extended CPO Criterion using MCMC Samples for Logistic Re-gression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.3 Examining the Behavior of Extended CPO criterion: A PracticalExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.3.1 Model Specification . . . . . . . . . . . . . . . . . . . . 594.3.2 Prior Specification: Weakly Informative Prior . . . . . . . 604.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.3.4 Summarizing the Computations . . . . . . . . . . . . . . 634.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655 Discussions and Conclusions . . . . . . . . . . . . . . . . . . . . . . 675.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Further Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74viiList of TablesTable 2.1 Interpretation of the Bayes factor (BF) values . . . . . . . . . . 8Table 3.1 Root mean squared errors of the estimated log EPBFs for allthree model comparisons . . . . . . . . . . . . . . . . . . . . 34Table 3.2 Root mean squared errors of the estimated log EPBFs for allthree model comparisons in unknown variance situation . . . . 49viiiList of FiguresFigure 3.1 Comparison of closed form results for three models (unknownmean but known variance) . . . . . . . . . . . . . . . . . . . 28Figure 3.2 Comparison of closed form results and MC based results forModel 1 (unknown mean but known variance) . . . . . . . . . 29Figure 3.3 Comparison of Model 1 and Model 2 (closed form results andMC based results for “unknown mean and known variance”situation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 3.4 Comparison of Model 1 and Model 3 (closed form results andMC based results for “unknown mean and known variance”situation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 3.5 Comparison of Model 2 and Model 3 (closed form results andMC based results for “unknown mean and known variance”situation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 3.6 Root mean squared error of the log EPBFs from 2500 MC sam-ples with three cut-offs (unknown mean but known variance) . 35Figure 3.7 Comparison of closed form results for three models (unknownmean and variance) . . . . . . . . . . . . . . . . . . . . . . . 43Figure 3.8 Comparison of closed form results and MC based results forModel 1 (unknown mean and variance) . . . . . . . . . . . . 44Figure 3.9 Comparison of Model 1 and Model 2 (closed form results andMC based results for “unknown mean and variance” situation) 45Figure 3.10 Comparison of Model 1 and Model 3 (closed form results andMC based results for “unknown mean and variance” situation) 46ixFigure 3.11 Comparison of Model 2 and Model 3 (closed form results andMC based results for “unknown mean and variance” situation) 47Figure 3.12 Root mean squared error of the log EPBFs from 2500 MC sam-ples with three cut-offs (unknown mean and variance) . . . . 50Figure 3.13 Root mean squared error of the log EPBFs from 25000 MCsamples with three cut-offs (unknown mean and variance) . . 51Figure 4.1 Comparison of big Model vs. small Model in Bayesian logisticregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62Figure 4.2 Error bar plot of the estimated log EPBFs for Bayesian logisticregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64xAcknowledgmentsAt first, I would like to thank my supervisor Professor Paul Gustafson for his kindsupervision. He not only supervised in this academic work but also guided mein some ups and downs of my study during my stay in UBC. That boosted myinspiration and helped me a lot to concentrate on the research. The faculty membersin the Department of Statistics are very nice and friendly. Especially, I want tothank Professor Lang Wu for his valuable inputs as a second reader.Also, I would like to thank my wife, Fatema Tuz Jhohura for her constantsupport during my writing. She is also a graduate student here in the Departmentof Statistics, UBC. The staff and graduate students are friendly and helped me alot during this thesis. I discussed many ideas and possibilities for this thesis outputwith Joe Watson and Qiong Zhang. Both of them are graduate students in theDepartment of Statistics, and my office mates too. I am grateful to them. I wouldlike to thank all of the graduate students in the Department who shared their ideasabout my research.Finally, I think I am lucky to be a part of the Department of Statistics, UBC.My stay is brief here, but I feel this is my second home. I am grateful to everyonein the Department for their kind and friendly support during my study and research.xiChapter 1IntroductionStatisticians build models to establish some relationship from the available data,and that can be used to predict future data. There might be multiple possibili-ties for building models from an available existing data set. Then, the next taskis to find the most useful model among the possible models using a model selec-tion method. Model selection is very important in Statistics. Both frequentist andBayesian schools have developed many selection methods for model comparisons.This thesis focuses on the Bayesian model selection methods.1.1 Problem StatementSome Bayesian model selection methods are available in the literature for com-paring models (Gelfand and Dey, 1994). Among these, a popular choice is theformal Bayesian comparison which simply uses the Bayes factor. As an optimalmodel comparison tool, the Bayes factor is the desired model selection method inBayesian paradigm. But, calculation of the Bayes factor is not easy for complexmodels. In that case, one can use alternatives to the Bayes factor that are eas-ily computable. One of these alternatives is to use the cross-validation approachbased on the so-called conditional predictive ordinate criterion (discussed in detailin Chapter 2); this can be computed easily from Monte Carlo posterior samples.The cross-validation approach with “leave-out one” is widely used as an approxi-mation of the formal Bayes factor (Geisser and Eddy, 1979), whereas the formal1Bayes factor can be computed mathematically using this cross-validation approachwith “leave-out all”. Mathematically, increasing “leave-out sizes” will allow usto compute more closely to formal Bayesian comparison than the commonly used”leave-out one”. This mathematical closeness comes with a price, the Monte Carloerror.The cross-validation approach requires posterior samples to compute the (ap-proximate) Bayes factors. As the cross-validation approach with “leave-out all”requires more computations, the results come with more Monte Carlo error thanthe “leave-out one”. In general, the Monte Carlo error increases with increasing“leave-out sizes”. So, we have an optimization problem here. We want to exam-ine how the cross-validation approach works to approximate the optimal formalBayesian comparison for some “leave-out sizes” from “leave-out one” to “leave-out all”. Also, if possible we want to find a “leave-out size” at which we have closerto formal Bayesian comparison than the “leave-out one” with only a little increasein Monte Carlo error. According to this problem, we formulate the objectives ofour study in the next section.1.2 ObjectivesAccording to the problem statement, we can break down the objectives of this studyin several stages. We list those below:1. At first, we want to find relevant pieces of literatures if any regarding thisissue.2. Our second objective is to examine the model comparisons for simple models(say based on normal distributions) where Monte Carlo is not needed. Then,our objective is to examine how closely the Monte Carlo samples computethe closed form results for the model comparison tools considered. For thesimple models, we also want to examine how rapidly does Monte Carlo error“kills us” as “leave-out size” increases.3. Our next objective is to examine how bad it is to not do an optimal formalBayesian model comparison. Here our interest is to examine whether it is2common/rare to see a switch in the winning model as “leave-out size” in-creases.4. Our final objective is to find a practical advice for more complex modelswhere we really need to use Markov chain Monte Carlo output. Particularly,we want to examine whether we can look at different “leave-out sizes” andreport Monte Carlo error.In general, our over-arching goal is to examine the cross-validation approach as analternative to the formal Bayesian model comparison. Also, we want to get someinteresting insight about ”leave-out size” (other than 1) for achieving a closer resultto the optimal solution with small Monte Carlo error. Especially, we want to exam-ine this for more complex models when the formal Bayesian model comparisonsare hard to compute.1.3 Organization of the Thesis ReportThe general idea of the problem we are interested in and the objectives of our studyhave been discussed in this Chapter. In Chapter 2 we formulate the problems moremathematically and describe the model selection methods we use in this study.A literature review of the model comparison tools is also given there. Chapter 3focuses on the Bayesian model comparison of linear regression models; the closedform and the Monte Carlo samples based results are documented, and we examinethose according to the objectives. We discuss the Bayesian model comparisons forthe generalized linear models, in particular, logistic regression models, in Chapter4. Overall findings and some concluding remarks are discussed in brief in Chapter5.3Chapter 2Bayesian Model Comparison2.1 IntroductionResearchers collect data on some variables to study the effect of these variableson some outcome of interest. Then the question arises of which variables are im-portant to explain the variation in the outcome. Also, inclusion of interactions be-tween the variables might be an interesting question to the researchers. These aremodel selection problems and statisticians have proposed many approaches to dealwith the issue of model selection. Some popular and well-known model selectionmethods are Akaike information criterion (AIC), Mallows CP, likelihood ratio testsfor nested models, stepwise selection procedures (backward or forward selection),cross-validation, different types of Bayes factors (intrinsic, partial, pseudo, poste-rior), Bayesian information criterion (BIC), and Bayesian model averaging. Thesemethods work in different ways. For example, some of these methods are just al-gorithms for choosing a useful model (e.g. stepwise selection). Other methods arebased on the criteria to judge the quality of a model (e.g. AIC, BIC).2.2 Classical View of Model SelectionWe discuss the classical approach to model selection here. Suppose we want tochoose a model between two parametric models Mi, i= 1,2. These two models aredenoted by the joint density f (y|θ i;Mi) or likelihood L(θ i;y,Mi), i = 1,2. Here,4θ i is a pi×1 parameter vector and y = (y1,y2, . . . ,yn) is a n×1 outcome vector.Classical Neyman-Pearson theory is applied to nested models where modelsare compared pairwise for model selection. Suppose, our hypotheses are Hi: data ycorrespond to the model Mi, i= 1,2. For example, we set H1 as the null hypothesis.Then, the likelihood ratio test can be used to compare these models by specifyingM1 and M2 as the reduced and full model, respectively. The reduced model isnested within the full model. With the estimated parameter vectors θˆ 1 and θˆ 2 fromModels M1 and M2, the test statistic of the likelihood ratio test has the followingform:λn =L(θˆ 1;y,M1)L(θˆ 2;y,M2). (2.1)The null hypothesis H1 can be rejected if λn < c < 1, where 0 < c < 1 is aconstant. Also, under H1,−2logλn has an approximate χ2p2−p1 distribution. Some-times the reduced models are rejected though they are actually true, especiallywhen λn tends to be very small, i.e.,limn→∞Pr(selectM2|M1 True) = Pr(χ2p2−p1 >−2logc)> 0.However, how small turns out to be too small is dependent on the significance levelof the test, that is how much tolerance is considered for the probability of Type Ierror. In general, the likelihood ratio test has a preference on the full model thanthe reduced model at a smaller level of significance.To deal with this problem, many penalization techniques of the log-likelihoodin the form of logL(θˆ i;y,Mi)− k(n, pi) have been proposed so that the largest pe-nalized log-likelihood wins. Here, k(n, pi)> 0 and this is a increasing function ofn and p implying more penalization for the big model than the nested small model.Incorporating this, λn in equation 2.1 is extended to log(λn)+ k(n, p2) − k(n, p1).Many model selection procedures, including AIC and BIC are different versionsof this expression. We need k(n, p2)− k(n, p1)→ ∞ as n→ ∞ for the selection ofthe true underlying model consistently with increasing n. The most common formfor k(n, p) found in the literature is k(n, p) = α p. Akaike (1973) uses values ofα in the interval 1 ≤ α ≤ 2.5 whereas Aitkin (1991) suggests α = log2. How-ever, these approaches produce some inconsistent result. Schwarz et al. (1978)5suggests k(n, p) = (p/2) logn where k depends on n and eliminates inconsistency.Some other versions are suggested by Nelder and Baker (1972), Hannan and Quinn(1979) and Shibata (1980).2.3 Bayesian View of Model SelectionWe describe the choice of a model among a possible set of models in Bayesianparadigm.We begin with the Bayes factor, the formal Bayes approach to compare twomodels. Different versions of the Bayes factor are also available in the literature.One can consider reading Gelfand et al. (1992), Kadane and Lazar (2004), and theirattendant discussions. In addition, implementation of the asymptotic and exactmethods are described with examples in Gelfand and Dey (1994).In a Bayesian model, we need to specify prior p(θ ) in addition to the likeli-hood specification. All the inference is made only from the posterior distributionp(θ |y) ∝ L(θ ;y)× p(θ ). Here θ are the parameters of interest, and L(θ ;y) =f (y|θ ) stands for the likelihood function. In the Bayesian paradigm, different ap-proaches have been proposed for selection of models. Some of these are based onposterior distributions and some others instead focus on predictive distributions.The two model components might be not fixed in a Bayesian model selectionproblem. Sometimes the likelihood L is held fixed, and only the prior p(θ ) is var-ied; this is used to check Bayesian robustness (Berger, 2013) by assessing howsensitive the posterior is due to prior variation. Sometimes, the likelihood L is var-ied. Now, the formal Bayesian model selection procedure follows in the followingsubsection using Bayes factor.2.3.1 General Version of Bayes FactorWe use the same notation as discussed in the classical approaches in section 2.2.The sampling density for model M1 is f (y|θ 1,M1) and the competing model is M2with sampling density f (y|θ 2,M2). There is no need to specify something commonbetween θ 1 and θ 2. For example, f (y|θ 1,M1) might be the density of a gammawith parameters (θ11,θ12) and f (y|θ 2,M2) might be the density of a log-normalwith parameters (θ21,θ22).6Suppose, the prior distributions p1(θ 1) and p2(θ 2) are specified for θ 1 and θ 2.Also, let the prior probability of Mi is wi, i= 1,2, with w2 = 1−w1. We considerthe comparison of M1 versus the alternative M2. Now, a Bernoulli random variableM for the models taking values 0 and 1 is defined and the joint density for thiscomparison can be written as:p(y,θ 1,θ 2,M) = f (y|θ 1,M1) p1(θ 1)w1 I0(M)+ f (y|θ 2,M2) p2(θ 2)w2 I1(M).We compute f (y|Mi), the predictive density for model, Mi, asf (y|Mi) =∫f (y|θ i,Mi) pi(θ i)dθ i. (2.2)Suppose yobs denotes the observed data. Using Bayes theorem, the posterior prob-ability for model M1 can be written asPr(M = 0|yobs) =w1 f (yobs|M1)w1 f (yobs|M1)+w2 f (yobs|M2).Now, the posterior odds of model M1 with respect to model M2 arePr(M = 0|yobs)Pr(M = 1|yobs)=w1 f (yobs|M1)w1 f (yobs|M1)+w2 f (yobs|M2)w2 f (yobs|M2)w1 f (yobs|M1)+w2 f (yobs|M2)(2.3)=w1 f (yobs|M1)w2 f (yobs|M2)=w1w2× f (yobs|M1)f (yobs|M2)= prior odds×BF,where the Bayes factor (of model M1 with respect to model M2) denoted by BF , isexpressed asBF =f (yobs|M1)f (yobs|M2). (2.4)Thus, equation (2.3) demonstrates a relationship between the posterior odds, priorodds w1/w2 and the Bayes factor. Jeffreys (1961) and Pettit and Young (1990) givea scale for interpretation of the Bayes factor (of model M1 with respect to model7Table 2.1: Interpretation of the Bayes factor (BF) valuesValue of Bayes factor Strength of evidence< 100 Negative (supports M2)100 to 101/2 Barely worth mentioning101/2 to 101 Substantial101 to 103/2 Strong103/2 to 102 Very strong> 102 DecisiveM2)[see Table 2.1].The interpretation of the Bayes factor (BF) is straightforward and easily un-derstandable for choosing between two models. Bayes factor does not require thatthe two models being compared are nested. Also, model fitting is not required forcomputing Bayes factor.2.3.2 Criticisms of Bayes FactorThe Bayes factor has some limitations as a model comparison tool. One limitationis related to the specification of a prior distribution. Depending on the locationsof the priors for θ 1 and θ 2, the Bayes factor may lead to change the decision onwhich model is favoured. Even when the priors are proper, the Bayes factors havenon-robustness issues for the prior specification. Also, if the prior distributionp(θ ) is improper as a non-informative specification, then the density function f (y)is improper as well. So, f (y|Mi) cannot be interpreted as the densities of thesemodels which in turn imply the non-interpretability of the Bayes factor ratio.Another limitation is that the well-known Lindley’s paradox (Lindley, 1957)may appear in the presence of an improper prior. Then, according to Lindley’sparadox, it is unlikely that Model M2 will be chosen as the sample size n growslarge. Thus, the Bayes factor has a contradiction with the likelihood ratio testwhich provides way too much support for the model under alternative (i.e. ModelM2). Smith and Spiegelhalter (1980) incorporate the idea of local Bayes factorto overcome the ‘Lindley’s Paradox’ where non-decreasing prior probabilities areassigned to an appropriate local neighbourhood of the parameters θ .82.3.3 Other Versions of Bayes FactorThe automatic Bayesian methods are suggested by some authors for model selec-tion to cope with the improper prior related problems such as Lindley’s paradox,Bayes factor dependency on prior specification and complexities in calculation andinterpretation of the Bayes factor. Note that in the automatic Bayesian methods,users are not required to specify the hyperparameters; rather, there is an algorithmto set the hyperparameters. Berger and Pericchi (1996) and Laud and Ibrahim(1995) argue that the automatic methods are essential as in practice, proper (or sub-jective) prior specification wouldn’t be feasible for a wide range of models that areinitially considered. On the other hand, Lindley (1997) argues that objective priors(reference or non-informative priors are often improper) are not commonly used inpractice; the author also mentions the absence of sensible interpretation for modelselection in the presence of improper priors. However, this controversy of priorspecification continues. To overcome this critical activity, different methodologiesare proposed. In particular, the main reason to do this is to avoid the difficulties ofBayes factor with improper or vague priors.The intrinsic Bayes factor, a version of the Bayes factor is proposed by Bergerand Pericchi (1996). To construct this, the data is needed to divide into two parts.Those parts are regarded as training and test data. Then, to compute the Bayesfactor, one can consider the testing data as the data and the posterior distributionsusing the training data as the prior. With y(l) and y(−l) as a training sample and atest sample respectively, an intrinsic Bayes factor denoted by BFint , for the trainingsample y(l) is defined asBFint(l) =f (y(−l)|y(l),M1)f (y(−l)|y(l),M2). (2.5)Here f (y(−l)|y(l)),Mi, i = 1,2 represents the marginal density of the testing sam-ple. As a popular choice, a minimal training sample is used as a training sample.But, a given data set usually has more than one minimal training sample. In thatcase, one possible option might be to use the arithmetic or geometric averages ofthe intrinsic Bayes factors that are computed using the available minimal trainingsamples of the data. Berger and Pericchi (1996) also discuss some versions of the9intrinsic Bayes factor.The average version of the intrinsic Bayes factor is a bad choice to use fora large data set as we might ended up averaging over many minimal training setsavailable for that data. Sometimes, due to difficulties with minimal training sample,intrinsic Bayes factors fail to discriminate between competing models (O’Hagan,1997). An alternative approach is the fractional Bayes factor (O’Hagan, 1995).To illustrate this approach, let us denote a fraction b as the ratio of the size of thetraining sample (u) to the size of the entire data set (v). The fractional Bayes factor,denoted by BFf rac in this case, can be defined asBFf rac =M1(b,y)M2(b,y), (2.6)whereMi(b,y) =∫f (y|θ i) pi(θ i)dθ i∫f (y|θ i)b pi(θ i)dθ i .It is important to remember that the motivation for the fractional Bayes factor issimply asymptotic (in u and v). O’Hagan (1997) shows that fractional Bayes fac-tors have many similar properties to ordinary Bayes factors, like adherence to thelikelihood principle, and invariance to data transformation, which are not enjoyedby intrinsic Bayes factors.Berger and Pericchi (1998) introduces median intrinsic Bayes factor with twodifferent versions. In the first version, the authors use the median instead of themean (arithmetic or geometric) over the training data. Now, the median intrinsicBayes factor has the formBFMint = median[BFint(l)], (2.7)where BFint(l) is defined in equation (2.5). The second version of the medianintrinsic Bayes factor has the formBFRMint =median[ f (y(−l)|y(l),M1)]median[ f (y(−l)|y(l),M2)]. (2.8)Berger and Pericchi (1998) argues that BFMint and BFRMint are stable compared to the10intrinsic BF, they proposed earlier.The posterior Bayes factor is defined by Aitkin (1991) which uses the the pos-terior distribution pi(θ i|y) as a replacement of the prior distribution pi(θ i) in theBayes factor formulation (equation 2.4). However, this approach has some criti-cisms related to double use of the data and use of posterior as prior. These criti-cisms are discussed by many authors in the discussion of Aitkin (1991).2.4 Predictive Distribution as Comparative ToolIn model selection, some particular forms of predictive distributions have been usedwithin the Bayesian approach for a long time. Box (1980) argues that conditionalon the model adequacy, the posterior distribution is utilized to estimate the modelparameters. On the other hand, the criticisms of the model given the existing datacan be obtained using the predictive distribution (Box, 1980). In addition, thepredictive distributions of two models will be comparable, not the posteriors whileexamining those two models.Many approaches to model selection have been suggested by using predictivecriteria instead of Bayes factors over the years since the 1970’s. The idea of apseudo Bayes factor arises by using cross-validation ideas (Geisser, 1975; Stone,1974). Some predictive ideas are already incorporated in the discussed intrinsicBayes factors and posterior Bayes factors.A predictive density emerges by averaging a likelihood defined in the samplespace with respect to the updated prior based on the data (that is the posterior).Suppose the data y is a collection of conditionally (given θ ) independent univari-ate observations y j, j = 1, . . . ,n. Also, suppose under Model Mi, y j has densityf (y j|θ i,Mi), i = 1,2, and let Jn denote the set {1, . . . ,n}, with S as an arbitrarysubset of Jn. We define the likelihood as:L(θ i;ys,Mi) =n∏j=1f (y j|θ i,Mi)d j ,where indicator function d j = 1 if j∈ S or d j = 0 if j 6∈ S. Similarly as section 2.3.1,let pi(θ i), i = 1,2, be the prior density under Model Mi. Now, the formal condi-tional density can be considered as11f (ys1 |ys2 ,Mi) =∫L(θ i;ys1 ,Mi) pi(θ i|ys2)dθ i (2.9)=∫L(θ i;ys1 ,Mi)L(θ i;ys2 ,Mi) pi(θ i)dθ i∫L(θ i;ys2 ,Mi) pi(θ i)dθ i,where S1 and S2 are arbitrary subsets of Jn. Equation (2.9) represents a predic-tive density; here the joint density of ys1 is averaged over the prior distribution ofθ i updated by the portion the data ys2 . Equation (2.9) represents a general for-mulation of predictive approach for Bayesian model selection. Using differentspecifications for S1 and S2, we can obtain different predictive distributions used inBayesian model selection approaches found in the literature. We discuss some ofthese approaches in next subsections.2.4.1 Different Predictive Distribution Based ApproachesSeveral examples of density (2.9) can be obtained in the literature. All of theseexamples vary in specification of the subsets S1 and S2 of Jn. We list some of theseexamples here, which are discussed by Gelfand and Dey (1994).(i) If S1 = Jn and S2 = φ , then f (ys1 |ys2 ,Mi) becomes the standard marginal densityof the data. In this case, the denominator integral is not considered. Clearly,this produces the general Bayes factor given in equation (2.4).(ii) If S1 = {r} and S2 = Jn−{r}, then a cross-validation density results. Thisleads to the conditional predictive ordinate criterion which eventually pro-duces the pseudo Bayes factor. We discuss this approach in detail in the nextsubsection.(iii) If S1 is considered as a subset of Jn, usually some (> 1) elements of Jn andS2 = Jn−S1, then we have an extended version of (ii) (Pena and Tiao, 1992).For our study, we focus on this extension.(iv) The choice S1 = Jn and S2 = Jn indicate posterior predictive density (Aitkin,1991). Posterior Bayes factor is produced directly by using (iv).12(v) Another choice is to specify S1 = Jn−S2 and S2 = {1, . . . , [ρn]}, where [] rep-resents the greatest integer function. A proportion ρ of the observations n isused for updating prior distribution whereas the observations (1−ρ)×n areused for model selection.(vi) If S1 = Jn− S2 and S2 is nothing but a minimal subset (Berger and Pericchi,1996) with a proper density pi(θ i|ys2), then f (ys1 |ys2 ,Mi) becomes proper.Several versions of intrinsic Bayes factor can be developed from (vi).Among the six different specifications of S1 and S2, (i) and (vi) are quite dif-ferent in terms of asymptotic behavior than the rest of the specifications (ii)− (v).The cardinality of S2 approaches infinity as sample size n→ ∞. We discuss condi-tional predictive ordinate criterion noted in (ii) in detail in the next subsection asthis is our primary interest.2.4.2 Conditional Predictive Ordinate Criterion and Pseudo BayesFactorThe conditional predictive ordinate criterion is obtained by specifying the subsetsof Jn as S1 = {r} and S2 = Jn−{r}. This specification yields the cross-validationdensity f (yr|y−r,Mi) by putting the values of S1 and S2 in density (2.9) wherey−r = (y1,y2, . . . ,yr−1,yr+1, . . . ,yn) (Stone, 1974; Geisser, 1975). According toGeisser (1980), this cross-validation density f (yr|y−r,Mi) is popularly known asthe conditional predictive ordinate (CPO) when evaluated at the observed yobs.Also, Geisser and Eddy (1979) propose the product of these cross-validation den-sities (or Bayesian predictive densities) ∏nr=1 f (yr|y−r,Mi) as a proxy for samplingdensity f (y).It should be noted that ∏nr=1 f (yr|y−r,Mi) is built by treating the yr’s as predic-tively independent conditional on the parameters. This product is a compromisebetween Bayesian and non-Bayesian methods in some ways. Firstly, the productof the conditional predictive densities f (yr|y−r,Mi), r = 1, . . . ,n is used instead ofjoint predictive density for the ease of the computational complexity. Secondly, theconditional predictive density of yr depends on both the y−r and prior distributionof θ i whereas the joint predictive density depends only on prior distribution. In ad-dition, as we discussed earlier, the joint predictive density of y becomes improper if13an improper prior distribution is used. These problems are addressed when insteadwe use the joint cross-validation densities ∏nr=1 f (yr|y−r,Mi).Using the idea of CPO, we can construct the Bayesian model selection tool,pseudo Bayes factor of Model M1 with respect to Model M2 (denoted by PBFhere) as suggested by Geisser and Eddy (1979):PBF =∏nr=1 f (yr|y−r,M1)∏nr=1 f (yr|y−r,M2). (2.10)The log version of the equation (2.10) is used commonly for computational sim-plicity which follows:logPBF =n∑r=1log f (yr|y−r,M1)−n∑r=1log f (yr|y−r,M2). (2.11)The cross-validation densities used to form the pseudo Bayes factor lead to an in-teresting asymptotic approximation. Log pseudo Bayes factor can be approximatedto a quantity that is the summation of the logarithm of likelihood ratio test statisticand a function of the parameters in two competing models for a large sample size(n). We can write as n→ ∞, thenlogPBF ≈ logλn+ p2− p12 .Here λn represents the likelihood ratio test statistic and p1 and p2 are numberof parameters for Model M1 and M2. This asymptotic approximation leads to abridge between the Bayesian (pseudo Bayes factor) and non-Bayesian (likelihoodratio test statistic) methods which strengthen the motivation of using pseudo Bayesfactor as a model comparison tool.CPO criterion can be called as a criterion based on “leave-out one” in thecross-validation sense. Hence, the pseudo Bayes factor is also based on “leave-out one”. We tag these “leave-out one” as in the cross-validation (or predictive)density f (yr|y−r,Mi), yr is conditional on all elements of y except yr (implyingleaving out yr). Only one element of y is considered as leave-out in this approachfor which the conditional cross-validation density is formed. From examples (i)and (ii) of Bayesian predictive distribution based approaches discussed in subsec-14tion 2.4.1, it is clear that, if we consider r = Jn, then Jn− r = φ which leads (ii)to be turned into (i); this implies that with “leave-out all”, pseudo Bayes factor isnothing but the formal Bayes factor. Now what happens in between, i.e., between“leave-out one” and “leave-out all”? To give insight into this matter, we need toextend the idea of CPO criterion for different “leave-out sizes” which is discussedin the next subsection.2.4.3 Extending the CPO CriterionCPO criterion is defined for “leave-out one” case in the cross-validation sensewhich is already discussed in the previous subsection. Now, we discuss the ex-tension of CPO criterion with different leave-out options. That will enable us toexamine the model selection between the formal Bayes factor and the pseudo Bayesfactor.The motivation for examining this is to explore that how well the Bayesianmodel comparison using Bayes factor can be approximated by the extended CPOcriterion with different leave-out options. Suppose we write the log-likelihood ofthe cross-validation densities for CPO for “leave-out one” with data y= (y1, . . . ,yn)and Model Mi, i= 1,2 asm(1) =1nn∑r=1log f (yr|y−r,Mi). (2.12)For two leave-out points r and t, we can definem(2) =12(n2)−1 n∑r<tlog f (yr,t |y−(r,t),Mi), (2.13)where yr,t =(yr,yt) and y−(r,t)=(y1, . . . ,yr−1,yr+1, . . . ,yt−1,yt+1, . . . ,yn). The mul-tiplier(n2)−1 is added to take average over all possible combinations of yr,t andmultiplier 12 is used for adjustment due to two leave-out points. Equation 2.13 canbe re-written as,m(2) =(n2)−1 n∑r<t12{log f (yr|y−(r,t),Mi)+ log f (yt |y−(r),Mi)}.15In general, with k leave-out, the adjustment factor will be 1/k. Following thepattern, for three leave-out points r, t and s we can definem(3) =13(n3)−1 n∑r<t<slog f (yr,t,s|y−(r,t,s),Mi), (2.14)and so on for increasing number of leave-out points up to the sample size n. Ifwe want to compare two models (1 and 2), the difference γk = m(k)1 −m(k)2 has asimilar form to the log Bayes factor; in a cross-validation sense of “leave-out k”,this difference represents how much better/worse model 1 predicts than model 2.We know that k = 1 corresponds to CPO criterion and k = n corresponds toformal Bayesian model comparison using Bayes factors; both of these are alreadydiscussed. The marginal density of the data is obtained from exp(nm(n)). Hence,the log Bayes factor of M1 with respect to M2 is represented by n(m(n)1 −m(n)2 )whereas the log pseudo Bayes factor of M1 with respect to M2 is represented byn(m(1)1 −m(1)2 ). In this study, I wish to examine whether the log of extended pseudoBayes factor (PBF), n(m(k)1 −m(k)2 ), 1 ≤ k < n can be treated as approximation oflog Bayes factor when the real Bayes factor comparisons are desired but hard tocompute.2.4.4 Computation of the Extended CPO CriterionThe CPO criterion is well used as it is easy to compute directly from Monte Carlo(MC) or Markov chain Monte Carlo (MCMC) output. Suppose we know that theelements of the data y are conditionally independent given the parameter vectorθ . Then using MCMC technique, Monte Carlo samples can be obtained fromthe posterior distribution p(θ |y). After that, one can express the cross-validationdensity (for “leave-out one” with k = 1) f (yr|y−r), r = 1, . . . ,n as a function ofposterior mean that is computed from the posterior with all data points (Newtonand Raftery, 1994). Then, we can express f (yr|y−r) asf (yr|y−r) = E{ f (yr|θ )−1|y}−1. (2.15)The right hand side of the equation (2.15) shows the posterior harmonic mean ofthe likelihood, so Raftery et al. (2006) term it the ‘harmonic mean identity’. The16authors suggest to approximate the cross-validation density f (yr|y−r) using thesample harmonic mean of the likelihoods[1BB∑q=11f (yr|θ (q))]−1, (2.16)where θ (1),θ (2), . . . ,θ (B) are the B draws from the posterior distribution f (θ |y).These sample draws may come directly from the output of a standard MC orMCMC implementation.Equation (2.15) is very straightforward and easy to compute which leads us tothe computation of m(1) discussed in subsection 2.4.3. Also, this approach can beextended for k > 1, i.e., for k = 2, k = 3 and so on. For example, for k = 2 we canwrite:f (yr,t |y−(r,t)) = E{[ f (yr|θ ) f (yt |θ )]−1|y}−1, (2.17)and this can be approximated as per equation (2.16):[1BB∑q=1(1f (yr|θ q).1f (yt |θ q))]−1. (2.18)We know that with increasing leave-out size, the model comparison using CPOcriterion (hence the pseudo Bayes factor) becomes closer to the model comparisonfrom formal Bayes factor. Moreover, Raftery et al. (2006) argue that finding formalBayes factor using this approach is hard due to Monte Carlo error. This implies thatthe CPO criterion with a larger leave-out size is more vulnerable to Monte Carloerror than with a smaller leave-out size.2.5 SummaryThere are many approaches to model comparison in both the frequentist and Bayesiancontexts. The Bayes factor is a well-known tool for Bayesian model comparison.But, there are some criticisms for Bayes factor that arise due to prior specifica-tion and other issues. Different versions of Bayes factors have been suggested toaddress these criticisms. Predictive distribution based approaches such as pseudo17Bayes factor based on CPO criterion are also popular in Bayesian model compar-ison setting. From the cross-validation viewpoint, CPO criterion corresponds to“leave-out one” and the extension of CPO criterion (“leave-out all”) leads to for-mal Bayesian model comparison.18Chapter 3Bayesian Model Comparison forLinear Regression ModelsLinear regression models are very common in studying the effect of one or morevariables (explanatory variables) on a variable of interest (response variable). Dif-ferent combinations of explanatory variables lead to different specifications of thelinear regression models. These models can be compared using the model compar-ison techniques described in the previous chapter. Both the classical and Bayesianapproaches of model comparison can be applied. For example, one can use likeli-hood ratio test which is a classical approach. Also, a Bayesian approach, say Bayesfactor can be applied for model comparison after specifying the linear regressionmodel in Bayesian context. In this chapter, we discuss the extended pseudo Bayesfactor (EPBF) utilizing the extended CPO criterion, a compromise between thepseudo Bayes factor and the formal Bayes factor, as a model selection tool for lin-ear regression models. Particularly our interest is to examine how the model selec-tion behaves when we change the “leave-out size” in the extended CPO criterion. Itis well-known that different model comparison tools, for example, likelihood ratiotest, AIC, BIC, Mallow’s CP may not select the same model among a set of candi-date models. For this reason, we hope to observe whether there is any agreementor disagreement in between the pseudo Bayes factors and the formal Bayes factorsas model comparison tool.193.1 Bayesian Linear Regression ModelsThe Bayesian approach can be applied to estimate the parameters of the linearregression model. We start our discussion with defining a linear regression model.Suppose our response variable is y and we have a set of explanatory variables:x1,x2, . . . ,xp. Here, y = (y1, . . . ,yn) is a vector of length n representing responsesfor n observations and each of the xk’s, k = 1, . . . , p are vectors of length n aswell. Hence, the design matrix [of dimension n× (p+ 1)] in this case is X =(1,x1,x2, . . . ,xp). The mean value of the response for the ith individual yi can bedescribed as,E(Yi |β , X ) = β0+β1 xi1+ · · ·+βp xip, i= 1, . . . ,n, (3.1)where xi1,xi2, . . . ,xip are the explanatory values for the ith individual and β =(β0,β1, . . . ,βp)ᵀ are unknown regression parameters. Suppose we denote xᵀi =(1,xi1,xi2, . . . ,xip) as the ith individual’s row vector of explanatory variables. Then,the mean value in equation (3.1) can be re-expressed asE(Yi |β , X ) = xᵀi β .In linear regression setting, one assumption made is that the responses {Yi} areindependent conditional on the values of the parameters and the explanatory vari-ables. Another assumption of equal variance is also made, that is, var(Yi |β , X ) =σ2. Now, the vector of all unknown parameters in this linear regression settingbecomes θ = {β ,σ2}. Also, we assume that the errors εi = yi−E(yi |β , X ), i =1, . . . ,n are independent of one another. The errors εi’s are distributed as normalwith mean 0 and variance σ2.For a vector of n observations y, we can write (in matrix notation):y |β , σ2, X ∼ Nn(Xβ ,σ2I), (3.2)with I as an n× n identity matrix. The vector of responses y has the multivariatenormal distribution of dimension n with mean vector Xβ and variance-covariancematrix σ2I.20From equation (3.2), the likelihood (joint density of y) as a function of β andσ2 is given by,L(θ ;y, X ) = L(β , σ2;y, X ) ∝n∏i=1σ−1 exp{− 12σ2(yi− xᵀi β )2}. (3.3)The likelihood L(θ ;y, X ) is used for the estimation of the parameters θ in classicalsetting. In Bayesian context, we need to specify prior distribution, say g(θ ) for θand then make inference on the θ from the posterior distribution p(θ |y, X ) whichhas the following general formp(θ |y, X ) ∝ L(θ ;y, X ) × g(θ ). (3.4)Different prior specifications, for example, a reference prior or a conjugateprior lead to different posterior distributions. The posterior distribution of the pa-rameters can be broken down into a marginal distribution of σ2 and a conditionaldistribution of β given σ2; this intuition is helpful to make inference for β and σ2respectively.In the Bayesian linear regression setting, one might be interested in predict-ing a future observation y˜ corresponding to a vector of values of the explana-tory variables say x˜. From the equation (3.2) we can say that conditional onθ = (β , σ2), y˜ is distributed as N(x˜ ′θ , σ2). Then, averaging the conditional den-sity of y˜, p(y˜ |θ , x˜) over the posterior distribution of the parameters θ we canobtain the posterior predictive density of y˜, p(y˜ |y) follows:p(y˜ |y, x˜, X ) =∫θp(y˜ |θ , x˜) p(θ |y, X )dθ . (3.5)Having the posterior, and the posterior predictive distribution, the next thing isto compute the extended CPO criterion discussed in the subsection 2.4.3 as a modelselection tool. The posterior predictive distribution formulated in equation (3.5)can be used to obtain the cross-validation densities, and hence to compute the CPOcriterion.Two types of unknown parameters to be estimated are included in θ : parame-ters for the location (i.e., β to calculate mean of the responses) and parameter σ2 to21calculate the variance of the responses. For simplicity, we initially assume that thevariance parameter σ2 is known. Now, θ reduces to β . We refer to this situationas “Normal model with unknown mean and known variance”. At first, we examinethe behavior of model comparisons of such models using the extended CPO crite-rion with different “leave-out sizes” which is discussed in section 3.2. We discussthis as a building block to understand the general situation with both the β and σ2unknown; this situation can be termed as “Normal model with unknown mean andvariance.” Behavior of model comparisons using the extended CPO criterion withboth the β and σ2 unknown is discussed later in section 3.3.3.2 Normal Model with Unknown Mean and KnownVarianceIn this section, we discuss the extended CPO criterion applied to linear regressionmodels with known variance of the responses. The likelihood of the responsesgiven in equation (3.3) can be re-expressed in matrix notation asL(β ;y, X , σ2) ∝ exp{− 12σ2(yᵀy−2β ᵀX ᵀy+β ᵀX ᵀXβ )}. (3.6)Now, a prior distribution for β is needed to commence the Bayesian inferencethrough constructing the posterior distribution of β . The distribution of y is multi-variate normal, and from the equation (3.6), we see that the β plays the same rolein the exponent looks like y; this gives an intuition that the multivariate normalprior distribution for β is conjugate. Hence, let us consider the prior distributionof β , p(β ) as multivariate normal (of dimension p+ 1) with mean vector β 0 andvariance-covariance matrix Σ0. The posterior distribution of β has the followingexpression:p(β |y, X , σ2) ∝L(β ;y, X , σ2)× p(β ) (3.7)∝exp{− 12σ2(yᵀy−2β ᵀX ᵀy+β ᵀX ᵀXβ )}× exp{−12(β ᵀΣ−10 β −2β ᵀΣ−10 β 0+β ᵀ0Σ−10 β 0)}.22After simplification, we can writep(β |y, X , σ2) ∝exp{−12[β ᵀ(Σ−10 +X ᵀXσ2)β −2β ᵀ(Σ−10 β 0+X ᵀyσ2)]}(3.8)=exp{−12(β −µ β )ᵀV −1β (β −µ β )},which is proportional to a multivariate normal density, with mean vectorµ β = E[β |y, X , σ2] =(Σ−10 +X ᵀXσ2)−1(Σ−10 β 0+X ᵀyσ2), (3.9)and variance-covariance matrixV β = Var[β |y, X , σ2] =(Σ−10 +X ᵀXσ2)−1. (3.10)From the formula of posterior mean E[β |y, X , σ2] in equation (3.9), it is evi-dent that if the prior variance-covariance matrix Σ0 has elements with large magni-tude (that is the precision matrix Σ−10 has elements with small magnitude), then theposterior mean approximately equals the least square estimate of β : (X ᵀX )−1X ᵀy.Alternatively, if the variance of responses σ2 is very large, then the expectation isapproximately equals β 0, the prior mean.Since the posterior distribution of β is multivariate normal, and the samplingdistributions of the yi’s are also normal, in this setting the predictive posterior dis-tribution of a future observation, say y˜, will be normal as well. So, to compute theextended CPO criterion, we can directly use the closed form densities of the pre-dictive posterior distribution as the cross-validation densities. Alternatively, if wepretend that the predictive posterior formulation has no closed form solution, onecan approximate the cross-validation densities using the Monte Carlo (MC) sam-ples from the posterior distribution of β . After that, the extended CPO criterioncan be computed as discussed in subsection 2.4.4. In general, having closed formresults, there is no need to use the MC based results. But, here our purpose is toexamine how well the MC based results approximate the closed form results sincein more difficult problems we must rely on such MC results. Hence, we use both23the approaches discussed in next two subsections to compute cross-validation den-sities and hence the extended CPO criterion or the extended pseudo Bayes factor(PBF) as a model comparison tool.3.2.1 Extended CPO Criterion using Closed Form PosteriorPredictive DistributionsThe posterior predictive distribution of a new observation y˜, p(y˜ |y) can be readilyobtained using the equation (3.5). As discussed in section 3.1, conditional on β , y˜is distributed as normal with mean x˜′β and variance σ2. Also, the posterior distri-bution of β is multivariate normal with mean vector µ β and variance-covariancematrix V β . Now, p(y˜ |y) has the following expression:p(y˜ |y) =∫βφ(y˜; x˜′β , σ2)φp+1(β ;µ β ,V β )dβ , (3.11)where φ(y˜; x˜ ′β , σ2) and φp+1(β ;µ β ,V β ) are the corresponding normal densitiesfor N(x˜′β , σ2) and Np+1(µ β ,V β ). Simplifying the results, we can show that y˜ |yis distributed as normal with mean x˜′µ β and variance x˜′V β x˜+σ2. Such univariateresults can be extended to multivariate ones when we want to find the predictivedistribution of a vector of new observations.A cross-validation density with “leave-out one”, say f (yr |y−r), r = 1, . . . ,nwith y−r = {y1, . . . ,yr−1,yr+1, . . . ,yn} can be computed from the posterior predic-tive formulation (3.11) by setting y˜ = yr and y = y−r; this is the CPO criterion.Then, as described in subsection 2.4.3, the log-likelihood of these cross-validationdensities (or CPO’s), denoted by m(1), can be computed for two models. The dif-ference of m(1) in two competing models multiplied by the number of observationsn is known as the log pseudo Bayes factor. Using the value of log pseudo Bayesfactor (PBF), one can decide on one model over the other.Similarly, cross-validation densities with “leave-out size” greater than one, canbe computed by substituting y˜ by the leave-out elements in the posterior predictiveformulation (3.11). Accordingly, we can compute the log extended pseudo Bayesfactor based on this extended CPO criterion and then compare the competing twomodels. The procedure is already discussed in subsection 2.4.3.243.2.2 Extended CPO Criterion using Monte Carlo SamplesIn general, how the MC samples from a posterior distribution can be used directlyto compute the CPO criterion is discussed in subsection 2.4.4. Suppose we have Bdraws of β , say β (1), . . . ,β (B) from the posterior distribution of β , p(β |y, X , σ2)specified by equation (3.8). The CPO criterion or cross-validation density with“leave-out one” can be approximated using (2.16). Log PBFs can be computedusing the approximated cross-validation densities to compare two linear regressionmodels in Bayesian context. Cross-validation densities in extended CPO criterionwith “leave-out two” can be approximated by equation (2.18). Similarly, cross-validation densities for different “leave-out sizes” can be approximated; log EPBFscomputed using these approximated cross-validation densities can then be used asto compare linear regression models.We discuss the comparative behavior of the extended CPO criterion or theEPBF in the next subsection for “Normal model with unknown mean and knownvariance” setting with different “leave-out sizes”. Both the cases are considered:when the extended CPO criterion is computed as a closed form solution using thepredictive posterior distribution, and when the extended CPO criterion is approxi-mated using the MC samples from the posterior distribution of the linear regressionparameters.3.2.3 Comparative Behavior of Extended CPO Criterion: ClosedForm Versus Monte Carlo SamplesThe unknown regression parameters β correspond to the mean of the responseswhereas the variance of the responses σ2 is known in “Normal model with un-known mean and known variance” which is already discussed in section (3.2).Here, we examine how the extended CPO criterion, a model comparison tool be-haves in real life scenario for both the situations: closed form and MC based solu-tion with an illustrating example. We use a data file here named automobile thatis taken from the UCI Machine Learning Repository (Lichman, 2013). This data isoriginally extracted from the 1985 Ward’s Automotive Yearbook.Among the three types of entities contained in the original data file, we consideronly the specification of an automobile in terms of various characteristics, say25length of the automobiles, height of the automobiles, etc. Particularly, we con-sider the following five explanatory variables regarding the characteristics of theautomobiles:• length,• width,• height,• compression ratio, and• horsepowerwith log(price) as our response variable. We want to examine whether the log(price)of the automobiles depend on the listed characteristics of the automobiles. Thereare in total n = 195 non-missing observations for this automobile data. Wespecify the models and priors for the regression parameters β below.Model SpecificationThree different models are considered; these models denoted by Model 1, Model2, and Model 3 have different combinations of the explanatory variables. Thesethree models have the following specifications.1. Model 1: Full Model (Model with all explanatory variables listed above)with the formulationl pricei =β0+β1 lengthi+β2widthi+β3 heighti+β4 comp.ratioi+β5 horsepoweri+ εi,where i = 1,2, . . . ,195, l price denotes log(price) and comp.ratio indicatescompression ratio.2. Model 2: Model which leaves only height out of the full Model.3. Model 3: Model which leaves height and compression ratio out of the fullModel.26We consider 2500 MC samples from the posterior distribution of the regressionparameters β (see equation (3.8)) for the MC based calculation of the extendedCPO criterion. We divide these into 10 batches of equal size 250. We consider thelog extended CPO and log extended pseudo Bayes factor calculation for each ofthese batches. Also, we compute the closed form results for CPO. To assess howwell the MC based calculation approximates the exact answer we examine the de-viation of the results found in 10 separate MC results from the closed form result.All these are done for different “leave-out sizes” (from “leave-out 1” to “leave-outall”). We consider 15 different “leave-out sizes”; these are 1, 5, 10, 30, 50, 80,100, 120, 150, 170, 175, 180, 185, 190, and 195 (i.e., all). These sizes are takenarbitrarily with more sizes near the “leave-out all”. In addition, while taking thecombinations of MC samples required for extended CPO calculation with differ-ent “leave-out sizes”, we take all combinations if the total combinations are lessthan 2000 and take only 2000 combinations randomly if the total combinations ex-ceed 2000 (as the number of combinations increases drastically/exponentially with“leave-out size”).Prior SpecificationFor the regression parameters β , we consider multivariate normal prior with meanvector 0 and variance-covariance matrix identity matrix I. We run a simple lin-ear regression with the response and explanatory variables discussed above andobserve that the linear regression coefficients are within ±2 which makes the con-sidered mean vector for prior reasonable. If the regression coefficients are biggerthan ±2, then this prior mean specification might not be useful.We use the model and prior setup discussed above to obtain the results de-scribed in the next subsection. We describe our findings for the three modelsconsidered. We have three possible pairwise comparisons of these three models:Model 1 versus Model 2, Model 1 versus Model 3, and Model 2 versus model 3.At first, we describe the behavior of the extended CPO criterion for the three mod-els separately and then extend to three model comparisons using the EPBF whichrelies on the extended CPO criterion.273.2.4 ResultsFirst we visualize the closed form results of log extended CPO values obtainedfrom Model 1, Model 2 and Model 3 (see Figure 3.1). The purpose of this visu-alization is to display the log extended CPO values for different “leave-out sizes”(smaller to larger) for all three models at the same time. We can observe the pat-tern of the log extended CPO values over the increasing “leave-out sizes” from theFigure 3.1.−1.10−1.05−1.00−0.95leave−out sizelog extended CPO1 10 30 50 80 100 120 150 170 185Model 1Model 2Model 3Figure 3.1: Comparison of closed form results for three models (unknownmean but known variance)All three models show quite a similar pattern with decreasing log extendedCPO values over the increasing leave-out size. We observe a smaller decrease inlog extended CPO values up to “leave-out 170” but a sharp decrease in an expo-nential manner after “leave-out 170”. All three models have close log extendedCPO values, but the distance between the log extended CPO values increases withincreasing “leave-out sizes”, and the difference is clearly visible at “leave-out all”.The relative position of the three models remains the same at each of the “leave-out sizes”. This indicates the consistent pattern of these three models over different“leave-out sizes”. The similarity of the closed form log extended CPO values forthree models can be concluded from Figure 3.1 except the “leave-out all” point.28Such close models are interesting to examine the model comparison behaviorover different “leave-out sizes”. In closed form result, we have consistent patternfor these three models at all “leave-out sizes”; but this may not be the case whenwe must compute using MC samples. The bias related to the results obtained fromthe MC samples increases with the increasing leave-out points. This bias is an im-portant issue to investigate further. We need to compare the closed form results andresults based on MC samples to examine whether MC based results can replicatethe closed form results or there is a bias that distort the MC based results from theclosed form results. This will give us an intuition for the future when we have onlyMC based results due to unavailability of the closed form results. Now, for theModel 1 we observe the log extended CPO values from the closed form solutionsas well as from the MC samples with the setup described above in the Figure 3.2.The green filled circles represent the closed form values at each “leave-out size”.−1.10−1.05−1.00−0.95−0.90Model 1leave−out sizelog extended CPO1 10 30 50 80 100 120 150 170 185Figure 3.2: Comparison of closed form results and MC based results forModel 1 (unknown mean but known variance)The MC based results (for 10 batches here) match the corresponding closedform result up to “leave-out 120” (see Figure 3.2). The visible changes betweenthe results from MC samples and closed forms are observed at “leave-out 150” andhigher “leave-out sizes”. Compared to nearer “leave-out sizes”, at “leave-out all”we see a sharp decrease in the log extended CPO value for the closed form result29which is not the case for MC samples. Clearly, MC based results show an upwardbias from the closed form result. This is not only the case for Model 1 as we ob-serve the same pattern for the other two models considered here. The closed formresult demonstrates the true values, and we observe substantive positive departureof MC based results from these values at higher “leave-out sizes”.Now we compare the three models pairwise. First compare Model 1 withModel 2. For model comparison, we use the extended pseudo Bayes factor whichcan be computed from the calculated extended CPO for the competing models. Fig-ure 3.3 describes this comparison visually. The red filled circles denote the closedform results whereas the 10 black circles represent MC results from 10 batches ateach “leave-out size”. We use the same specification for Figure 3.4 and Figure 3.5.−3−2−1012Model 1 versus Model 2leave−out sizelog EPBF1 10 30 50 80 100 120 150 170 185Figure 3.3: Comparison of Model 1 and Model 2 (closed form results andMC based results for “unknown mean and known variance” situation)We observe some interesting results from Figure 3.3. As we have negative logEPBFs at all “leave-out sizes”, the closed form log EPBFs suggest the choice ofModel 2 over Model 1 (hence the same choice). The strength of evidence increaseswith increasing “leave-out sizes”. MC based results vary over the direction withincreasing “leave-out sizes”. All MC based results from 10 batches yield the samechoice of model (Model 2 over Model 1) for “leave-out sizes” less than 80 for thisparticular data. But, from “leave-out size” 80, the choices of the model fluctuate30for the MC based results obtained from 10 batches. This fluctuation is due to theMC error. We need to quantify the MC error to examine how this increases forincreasing ‘leave-out sizes”.For the smaller “leave-out sizes” (for example, 1, 5, 10) the closed form resultsare near the center of the MC based results. For the larger “leave-out sizes” say“leave-out 100” or more, the MC based results from most of the 10 batches havea tend to have higher values than the corresponding closed form results, Hencefrom Figure 3.3, it is clearly observed that the MC based results become positivelybiased and more variable as the “leave-out size” increases. We have the interestin whether it is possible to find a “leave-out size” where the EPBF is close to theformal Bayesian comparison with smaller MC error which is discussed later.Moreover, at “leave-out all”, the EPBF becomes a version of the formal Bayesiancomparison, that is, the Bayes factor. In Figure 3.3, the closed form result at “leave-out all” indicates the value of the formal Bayes factor. MC based results from all 10batches show positive bias from the formal Bayes factor value which implies thatwe are very poorly computing real Bayesian model comparison with some positiveMC errors for MC based results.Now we check the two other possible model comparisons: Model 1 versusModel 3 and Model 2 versus Model 3. Closed form and MC based log EPBFs forModel 1 versus Model 3 are displayed in Figure 3.4.Figure 3.4 displays the similar pattern as Figure 3.3. Here we compare Model1 with Model 3. As the comparison between Model 1 and Model 2, the closedform results once again indicate the same choice of model over different “leave-outsizes”, and the evidence is stronger with increasing “leave-out sizes”. Interestingly,the change in log EPBFs between two different “leave-out sizes” is steeper forcomparison of Model 1 versus Model 3 than the comparison of Model 1 versusModel 2. Based on the closed form results, Figure 3.4 suggests the choice ofModel 3 over Model 1.As with the comparison between Model 1 and Model 2, the closed form resultsare near the center of the MC based results from 10 batches for smaller “leave-outsizes” for the comparison between Model 1 and Model 3. Figure 3.4 shows thatthe MC based results become positively biased and more variable as the “leave-outsize” increases which is similar to Figure 3.3. The only distinction with the pre-31−6−4−20Model 1 versus Model 3leave−out sizelog EPBF1 10 30 50 80 100 120 150 170 185Figure 3.4: Comparison of Model 1 and Model 3 (closed form results andMC based results for “unknown mean and known variance” situation)vious comparison is that from “leave-out 120”, the choices of the model fluctuatefor the MC based results obtained from 10 batches, whereas this fluctuation startsfrom the “leave-out 80” in the comparison between Model 1 and Model 2.Now, we move to the last comparison: Model 2 versus Model 3. The closedform and MC based results for this comparison are displayed in Figure 3.5. Like theprevious two comparisons, the closed form results in this comparison indicate thesame choice of the model over different “leave-out sizes”. Similarly, the strengthof evidence increases with increasing “leave-out sizes”. The closed form resultsdisplayed in Figure 3.5 suggest the choice of Model 3 over Model 2 as we havenegative log EPBF values at all “leave-out sizes” for this comparison. All MCbased results from 10 batches yield the same choice of model (Model 3 over Model2) for “leave-out sizes” less than 100, and start to fluctuate from “leave-out size”100 and higher. The same behavior of the MC based results at “leave-out sizes”is observed in Figure 3.5 as in Figures 3.3 and 3.4. Once again, MC based resultsbecome positively biased and more variable as the “leave-out size” increases mightbe due to MC error.In summary, all three model comparisons show us the same behavior of theEPBF (computed from extended CPO criterion) as a model selection tool. The32−4−3−2−101Model 2 versus Model 3leave−out sizelog EPBF1 10 30 50 80 100 120 150 170 185Figure 3.5: Comparison of Model 2 and Model 3 (closed form results andMC based results for “unknown mean and known variance” situation)close form results choose the model with a small number of parameters in all threecomparisons. Also, for closed form solutions, with larger “leave-out sizes”, theEPBFs show larger difference among the competing models. MC based results de-viate from the closed form results in all three comparisons in an increasing patternwith the increasing “leave-out sizes”. The deviation is due to the MC error, and wetry to compute the contribution of MC error at different “leave-out sizes” which wediscuss in the following subsection.3.2.5 Summarizing the ComputationsFrom the discussion in the previous subsection, we came to know the variation ofthe MC based results at different “leave-out sizes”. We compute the root meansquared error (RMSE) of the MC based estimates of the log EPBFs at all consid-ered “leave-out sizes” to get a measurement of the variation from the closed formresults. Note that, RMSE is a way of measuring how good the MC based estimatesof the log EPBFs compared to the closed form log EPBF. The smaller the RMSE,the better way the MC based results are behaving in general. The RMSE of theestimated log EPBFs for all three model comparisons are tabulated in Table 3.1.33Table 3.1: Root mean squared errors of the estimated log EPBFs for all threemodel comparisonsLeave-out size Model 1 vs. Model 2 Model 1 vs. Model 3 Model 2 vs. Model 31 0.178 0.166 0.1535 0.190 0.136 0.14210 0.103 0.143 0.09230 0.161 0.202 0.12150 0.203 0.279 0.13780 0.303 0.351 0.166100 0.385 0.755 0.558120 0.396 0.626 0.710150 0.800 0.862 1.099170 1.130 1.291 1.126175 0.879 1.391 0.711180 1.088 1.689 0.915185 1.545 2.527 1.615190 1.374 4.369 3.474all 2.274 6.292 4.182Now, we can interpret the RMSE values from Table 3.1. For example, forcomparison of Model 1 with Model 2 at ”leave-out 100” the RMSE of the esti-mated log EPBFs is 0.385 which implies that the model selections while using MCbased EPBFs instead of the corresponding closed form are erroneous by a factorof exp(0.385) = 1.47 (which implies (exp(0.385)− 1)× 100 = 47 percent erro-neous model comparisons by the MC samples). Similarly, while comparing Model1 with Model 3, at ”leave-out 120” the RMSE of the estimated log EPBFs is 0.626which leads to 87 percent of the erroneous decision on model comparisons usingMC based EPBFs relative to the corresponding closed form. We can interpret allother RMSE values in a similar fashion.For all three comparisons, the RMSE values of the estimated log EPBFs fromMC samples have the same pattern; the RMSE values increase with the increasing“leave-out size” and these increase drastically at higher “leave-out sizes”. We canuse several cut-off values for the RMSE values to examine the level of error inmodel comparison decision while using the MC based EPBFs relative to closedform counterparts at different “leave-out sizes”. The RMSE values from Table 3.134are in logarithmic scale, and we consider three cut-off values for these values: log1.25, log 1.5, and log 2 which correspond to 25, 50 and 100 percent erroneousmodel selection when using the log EPBFs estimated from MC samples instead ofthe closed form values. The comparative RMSE values for the MC based EPBFsat the considered “leave-out sizes” with vertical lines through the cut-off values areplotted in Figure 3.6.01234RMSE for three model comparisonsleave−out sizeRMSE of log EPBF1 10 30 50 80 100 120 150 170 185Model 1 vs. Model 2Model 1 vs. Model 3Model 2 vs. Model 3log 1.25log 1.5log 2Figure 3.6: Root mean squared error of the log EPBFs from 2500 MC sam-ples with three cut-offs (unknown mean but known variance)If we consider the log 1.25 as the cut-off value, then from Figure 3.6 we observethat up to “leave-out 30” the RMSE values lie below the cut-off value for all threecomparisons. The RMSE values are on the both sides of the cut-off value between“leave-out 50” and “leave-out 80” whereas all the RMSE values exceed the cut-off value at any “leave-out size” greater than or equal to 100. Similarly, up to“leave-out 80” and “leave-out 120”, the RMSE values for all comparisons are lessthan or equal to the cut-off values log 1.5 and log 2 respectively. Also, the RMSEvalues for all comparisons are greater than the cut-off values log 1.5 and log 2 at“leave-out size” 150 and higher.35To reduce the errors in model selection via model comparisons using MC sam-ples we need to increase either the number of batches or the MC samples as per theMC rule. For our analysis, we have 2500 MC samples and 10 equal sized batcheswith 250 MC samples. As we are computing RMSE values in logarithmic scale, ifwe want error reduction by a factor k compared to the current level x, then we needto increase the current MC sample size by a factor of (ln(x×k)/ ln(k))2. For exam-ple, according to MC rule, at least 2500×(ln(6)/ ln(2))2 = 16705 MC samples areneeded if we want error reduction in model selection by a factor of 3 from the MCbased results at cut-off point log(2). In other words, the model comparison resultswithin RMSE value = log2 (or within 100% relative error) for 2500 MC samples,will be within RMSE value = log4/3 (or within 33% relative error) for 16705 MCsamples. Then, the relative error reduces to one third with the increased MC sam-ples. So, more MC samples will provide a correct model comparisons based on theMC samples.3.3 Normal Model with Unknown Mean and VarianceWe discuss the extended CPO criterion applied to linear regression models withunknown mean and variance of the responses in this section. Compared to modelsdiscussed in the previous section, we have an additional unknown parameter forthe variance of the responses. Accommodating this, now the likelihood of theresponses (3.6) have the following expression:L(β , σ2;y, X ) ∝ σ−n exp{− 12σ2(yᵀy−2β ᵀX ᵀy+β ᵀX ᵀXβ )}. (3.12)Having specified the likelihood function, the next step is to specify the prior dis-tribution for β and σ2. We have already learned from the section 3.2 that if σ2is known, the multivariate normal prior distribution for β is conjugate. If σ2 ∼inverse-gamma (a,b), that isp(σ2) =baΓ(a)(1σ2)a+1exp(− bσ2), σ2 > 0,36then we can write the posterior distribution of σ2 with β known asp(σ2 |y, X , β ) ∝ p(σ2)L(σ2;y, X , β ,)∝(1σ2)a+1exp(− bσ2)σ−n× exp{− 12σ2(yᵀy−2β ᵀX ᵀy+β ᵀX ᵀXβ )}=(1σ2)a+ n2+1exp{− 1σ2[b+12(yᵀy−2β ᵀX ᵀy+β ᵀX ᵀXβ )]} ,which is simply an inverse-gamma density, so that the conjugate prior for σ2 is:{σ2 |y, X , β } ∼ IG(a− n2,[b+12(yᵀy−2β ᵀX ᵀy+β ᵀX ᵀXβ )]) .Now, we can factorize the joint conjugate prior of β and σ2 asp(β , σ2) = p(β |σ2) p(σ2) = N(µ β ,σ2V β ) × IG(a,b) = NIG(µ β ,V β ,a,b)(3.13)=ba(2pi)(p+1)/2|V β |1/2Γ(a)(1σ2)a+1+ (p+1)2× exp[− 1σ2{b+12(β −µ β )V −1β (β −µ β )}](3.14)∝(1σ2)a+ (p+1)2 +1× exp[− 1σ2{b+12(β −µ β )V −1β (β −µ β )}],where a,b > 0, and Γ(·) denotes the Gamma function. This prior is called thenormal-inverse-gamma prior and can be denoted as NIG(µ β ,V β ,a,b) (Banerjee,2008).With the likelihood (3.12) and joint prior distribution (3.13), the joint posterior37distribution of (β ,σ2) has the following expressionp(β , σ2 |y, X ) ∝σ−n exp{− 12σ2(yᵀy−2β ᵀX ᵀy+β ᵀX ᵀXβ )}(3.15)×(1σ2)a+ (p+1)2 +1× exp[− 1σ2{b+12(β −µ β )V −1β (β −µ β )}].As suggested by Banerjee (2008), to derive the joint posterior distributionp(β , σ2 |y, X ), we use the multivariate completion of squares identity with a sym-metric positive definite matrix D:uᵀDu−2α ᵀu = (u−D−1α )ᵀD(u−D−1α )−α ᵀD−1α . (3.16)An application of the identity (3.16) gives,1σ2[b+12{(β −µ β )V −1β (β −µ β )+(yᵀy−2βᵀX ᵀy+β ᵀX ᵀXβ )}]=1σ2[b∗+12(β −µ ∗)ᵀV ∗−1(β −µ ∗)].Using this, p(β , σ2 |y, X ) in (3.15) can be re-written asp(β , σ2 |y, X )∝(1σ2)a∗+ (p+1)2 +1× exp[− 1σ2{b∗+12(β −µ ∗)ᵀV ∗−1(β −µ ∗)}],(3.17)which can be identified as a NIG(µ ∗,V ∗−1,a∗,b∗) withµ ∗ = (V ∗−1β +XᵀX )−1 (V −1β µ β +Xᵀy)V ∗ = (V −1β +XᵀX )−1a∗ = a+n/2b∗ = b+12[µ ᵀβV−1β µ β + yᵀy−µ ∗ᵀV ∗−1µ ∗].Having the joint posterior distribution of β and σ2 as NIG, and the samplingdistributions of the yi’s as normal, the predictive posterior distribution of any futureobservations can be obtained as a closed form solution after some calculations.38Hence, as the section 3.2, to compute the extended CPO criterion, we can directlyuse the closed form densities of the predictive posterior distribution as the cross-validation densities. Similarly, pretending the predictive posterior formulation hasno closed form solution, one can approximate the cross-validation densities usingthe Monte Carlo (MC) samples from the joint posterior distribution of β and σ2 andthen extended CPO criterion can be computed as discussed in the previous section.Once again, we use both the closed form and MC samples based approaches, andthese are discussed in next two subsequent subsections to compute the extendedpseudo Bayes factor (PBF) as a model comparison tool.3.3.1 Extended CPO Criterion using Closed Form PosteriorPredictive Distributions: Unknown Variance SituationSuppose we want to predict the outcome y˜ for a future t× (p+1) matrix of regres-sors X˜ . These y˜ are independent of y, and given β and σ2 known, we can write thesampling distribution of y˜: y˜ ∼ N(X˜β ,σ2It). Now, the predictive posterior distri-bution of y˜, that is p(y˜, |y, X˜ ) can be obtained using the joint posterior distributionof β and σ2 asp(y˜ |y, X˜ ) =∫β ,σ2φt(y˜; X˜β ,σ2It),×φNIG(p+1)(β ;µ ∗,V ∗−1,a∗,b∗)dβ dσ2 (3.18)=MVSt2a∗(X˜ µ ∗,b∗a∗(I+ X˜V ∗X˜ ᵀ)),where φt(y˜; X˜β ,σ2It) and φNIG(p+1)(β ;µ∗,V ∗−1,a∗,b∗) are the corresponding normaland normal-inverse-gamma densities for N(X˜β ,σ2It) and NIG(µ ∗,V ∗−1,a∗,b∗)respectively. The final expression follows from the marginal distribution of ywith NIG prior distribution for (β ,σ2) which utilizes the well-known Sherman-Woodbury-Morrison identity and some matrix identities.As discussed for the “Normal model with unknown mean and known vari-ance” situation in section 3.2, the cross-validation density with “leave-out one”,say f (yr |y−r), r = 1, . . . ,n can be computed directly from the posterior predictiveformulation (3.18) by setting y˜ = yr and y = y−r. Then, the log PBF can be com-puted for a comparison of two available models using the CPO criterion. Similarly,cross-validation densities with “leave-out size” greater than one, can be computed39directly. Hence, we can compute the log EPBF based on this extended CPO crite-rion and then compare the competing model pairs as described in section 3.2.3.3.2 Extended CPO Criterion using Monte Carlo Samples:Unknown Variance SituationThe purpose of calculating the extended CPO criterion using MC samples for the“Normal model with unknown mean and known variance” situation is already dis-cussed in section 3.2. However, instead of taking MC samples of β and σ2 fromtheir joint posterior distribution, it is preferable to factorize the joint posterior dis-tribution of β and σ2 into a marginal posterior of σ2 and a conditional posteriorof β (given σ2) that is p(β , σ2 |y, X ) = p(β |σ2, y, X )× p(σ2|y, X ). Then, wecan take samples of σ2|y, X and β |σ2, y, X from p(σ2|y, X ) and p(β |σ2, y, X )respectively. Since, both the prior and posterior belongs to the same family of dis-tribution, using equation (3.13), it can be shown that β |σ2, y, X ∼ N(µ ∗,σ2V ∗−1)and σ2|y, X ∼ IG(a∗,b∗). Thus B MC samples for (β ,σ2): (β ,σ2)(1), (β ,σ2)(2),. . . ,(β ,σ2)(B) can be drawn from the posterior.As discussed in subsection 3.2.2, the CPO criterion or cross-validation den-sity with “leave-out one” and “leave-out two” can be approximated using (2.16)and (2.18), and in general, cross-validation densities for different “leave-out sizes”can be approximated. Accordingly, the log EPBFs can be computed using theapproximated cross-validation densities to compare two linear regression mod-els in Bayesian context. Cross-validation densities in extended CPO criterionwith “leave-out two” can be approximated by equation (2.18). Similarly, cross-validation densities for different “leave-out sizes” can be approximated; extendedlog pseudo Bayes factors computed using these approximated cross-validation den-sities can then be used as to compare linear regression models.The comparative behavior of the extended CPO criterion or the EPBF for “Nor-mal model with unknown mean and variance” setting with different “leave-outsizes” is discussed in the next subsection for both the closed form and MC samplesbased solution.403.3.3 Comparative Behavior of Extended CPO Criterion forUnknown Variance Situation: Closed Form Versus MonteCarlo SamplesIn this subsection, we examine how the extended CPO criterion, a model compari-son tool behaves in real life scenario for both the cases: closed form and MC basedsolution with an illustrating example for “Normal model with unknown mean andvariance” setting. As the “Normal model with unknown mean and known variance”situation discussed in subsection (3.2.3), we use the same data set, response vari-able, explanatory variables and the same model setup. Since, we have additionalunknown parameter σ2 compared to subsection (3.2.3), only the prior specificationfor the parameters needs some work.Prior Specification: Unit Informative and g PriorsWe need to specify the prior parameters µ β , V β , a, and b. But, it is hard to findrepresentable values of these parameters for actual prior information, specificallyfor µ β and V β . Again, with increasing regressors, the construction of an informa-tive prior distribution gets harder. For example, with p+1 regressors, the numberof prior correlation parameters is(p+12), and it increases quadratically in p.Sometimes one can use the least squares estimates βˆ OLS as the prior mean for βin the absence of precise prior information; then, no probability statements aboutβ can be made. Another idea is to use minimally informative prior as possiblewhen the prior distribution doesn’t represent the real prior information about theparameters. To some extent, using this compared to an informative prior distribu-tion, more “objective” result can be obtained from the posterior distribution. Kassand Wasserman (1995) describes unit information prior as a weakly informativeprior. The amount of information in a unit information prior is just the informa-tion contained in only a single observation. For example, (X ᵀX )/σ2 denotes theprecision of βˆ OLS that can be thought of as the amount of information from n ob-servations. Then, the unit information prior will set 1σ2V β = (XᵀX )/nσ2. Also,using µ β = βˆ OLS is suggested by Kass and Wasserman (1995). This specificationrequires knowledge of y and hence cannot be a real prior distribution. However,for this unit information prior, only a small amount of information in y is used.41The idea of invariant parameter estimation to the scale change of the regressorscan be implemented for choosing priors of β . Suppose X˜ = X H where X repre-sents a set of regressors and H is a (p+ 1)× (p+ 1) matrix. Also, suppose theposterior distributions of β and β˜ are obtained using X and X˜ with same y. Then,according to the invariance principle, both the posterior distributions of β and H β˜should be the same. We need to specify µ β = 0 and σ2V β = k(XᵀX )−1, k > 0to fulfill this condition. The choice of prior parameters with k = gσ2 reveals aversion of the well-known “g-prior” (Zenllner, 1986). Note that, with g = n, weget the unit information prior discussed in the previous paragraph.For the regression parameters β , we consider “g-prior” that is a multivariatenormal prior with mean vector µ β = 0 and variance-covariance matrix σ2V β =nσ2 (X ᵀX )−1.Using the prior specification discussed above and the model specification dis-cussed in subsection 3.2.3, we obtain results that are described in the next subsec-tion. As subsection 3.2.4, we describe our findings for the three models consideredwith three possible pairwise comparisons of these three models. Once again, wedescribe the behavior of the extended CPO criterion for the three models separatelyand then extend to three model comparisons using the EPBF for “Normal modelwith unknown mean and variance” situation.3.3.4 ResultsFor “Normal model with unknown mean and variance” situation, first we visualizethe closed form results of log extended CPO values obtained from Model 1, Model2 and Model 3 (see Figure 3.7) as discussed for “Normal model with unknownmean and known variance” in subsection 3.2.4.All three models exhibit the same pattern with decreasing log extended CPOvalues over the increasing leave-out size with a sharp decrease in an exponentialmanner after “leave-out 50”. Also, all three models have close log extended CPOvalues. The relative position of the three models remains more or less the same ateach “leave-out size” though compared to other two models, Model 2 and Model3 have a slightly higher log extended CPO values at smaller and higher “leave-outsizes”. So, compared to the known variance case, from Figure 3.7 we observe that42−1.2−1.1−1.0−0.9−0.8−0.7−0.6−0.5leave−out sizelog extended CPO1 10 30 50 80 100 120 150 170 185Model 1Model 2Model 3Figure 3.7: Comparison of closed form results for three models (unknownmean and variance)the strength of evidence changes for the considered models at different “leave-outsizes”. We hope that this pattern will be more clearly observed when we comparethe models pairwise through log EPBFs as a model selection criterion.Along the closed form results, we want to examine the strength of evidence forthese models at all “leave-out sizes” when we must compute using MC samples.As we observed in the “known variance” situation, the strength of evidence mightbe affected due to the bias related to the results obtained from the MC samplesthat increase with the increasing leave-out points. Once again, we need to examinefurther this bias issue. Also, as the “known variance” situation, we want to comparethe closed form results and results based on MC samples to check whether MCbased results can replicate the closed form results in “unknown variance” situation.For the Model 1 we observe the log extended CPO values from the closed formsolutions as well as from the MC samples in the Figure 3.8. The red filled circlesrepresent the closed form values at each “leave-out size”.Compared to “known variance” situation (up to “leave-out 120”), the MC basedresults match the corresponding closed form result up to a smaller “leave-out sizes”(“leave-out 50”) that can be observed from Figure 3.8. After “leave-out 50”, the43−1.2−1.1−1.0−0.9−0.8−0.7−0.6Model 1leave−out sizelog extended CPO1 10 30 50 80 100 120 150 170 185Figure 3.8: Comparison of closed form results and MC based results forModel 1 (unknown mean and variance)changes between the results from MC samples and closed forms become visibleand at higher “leave-out sizes”, MC based results show an upward (exponentiallyincreasing) bias from the closed form result. This pattern is also observed for the“known variance” situation discussed in subsection 3.2.4. Also, for this “unknownvariance” situation, the same pattern is observed for Model 2 and Model 3 as weobserve the same pattern for the other two models considered here. For three mod-els, a substantive positive departure of MC based results from the closed formvalues is observed at higher “leave-out sizes”.We compare Model 1 with Model 2 now in the similar manner as discussedin in subsection 3.2.4. The EPBF, computed from the extended CPO is used as amodel comparison tool. This comparison can be described from Figure 3.9 wherethe red filled circles and black circles denote the closed form results and MC resultsfrom 10 batches at each “leave-out size” respectively. Same specifications are usedfor Figure 3.10 and Figure 3.11.Figure 3.9 exhibits the same pattern in comparing Model 1 with Model 2 asFigure 3.3 except the wider values of the MC based results. We observe negativeclosed form log EPBFs at all “leave-out sizes” suggesting the choice of Model 244−15−10−5051015Model 1 versus Model 2leave−out sizelog EPBF1 10 30 50 80 100 120 150 170 185Figure 3.9: Comparison of Model 1 and Model 2 (closed form results andMC based results for “unknown mean and variance” situation)over Model 1 (hence the same choice) with increasing strength of evidence forincreasing “leave-out sizes”. However, MC based results vary over the directionstarting from early “leave-out sizes”, and have values with wider range at higher“leave-out sizes”. As discussed before, the fluctuation from the closed form resultsis due to the MC error. We will quantify the MC error to examine how this increaseswith increasing ‘leave-out sizes”.The closed form results are near the center of the MC based results at smaller“leave-out sizes” (1, 5, and 10). For other “leave-out sizes”, the MC based re-sults from most of the 10 batches have a tend to have higher values than the cor-responding closed form results. As the “known variance” situation, Figure 3.9demonstrates that MC based results become positively biased and more variable asthe “leave-out size” increases for “unknown variance” situation. MC based resultsfrom most of the 10 batches show positive bias from the formal Bayes factor value(closed form value at “leave-out all”) implying that using MC samples we are verypoorly computing real Bayesian model comparison with some positive MC errors.Our goal is to examine the possibility of finding a “leave-out size” where the EPBFis close to the formal Bayesian comparison with smaller MC error which is dis-45−15−10−505101520Model 1 versus Model 3leave−out sizelog EPBF1 10 30 50 80 100 120 150 170 185Figure 3.10: Comparison of Model 1 and Model 3 (closed form results andMC based results for “unknown mean and variance” situation)cussed later.Next we examine two other possible model comparisons: Model 1 versusModel 3 and Model 2 versus Model 3. Figure 3.10 displays the closed form andMC samples based log EPBFs for Model 1 versus Model 3.Figure 3.10 displays similar pattern of choosing the smaller model as Figure 3.9with increasing “leave-out sizes”. Compared to the comparison between Model 1and Model 2, for the comparison between Model 1 and Model 3, the closed formresults indicate a change pattern in the choice of the model (changes direction at“leave-out 80”) over different “leave-out sizes”. Up to “leave-out 50”, the closedform log EPBF values are positive indicating the evidence of choosing Model 1over Model 3 though this evidence decreases with increasing “leave-out sizes”.Negative closed form log EPBF values at “leave-out 80” and higher “leave-outsizes” suggest the choice of Model 3 over Model 1 and the strength of this evidenceincreases with increasing “leave-out sizes” from “leave-out 80” and onwards. Thechoice of model alters here for “unknown variance” situation which doesn’t alter inthe “known variance” situation though the pattern of stronger evidence of smallermodel with increasing “leave-out sizes” is prevalent in both situations.46−15−10−5051015Model 2 versus Model 3leave−out sizelog EPBF1 10 30 50 80 100 120 150 170 185Figure 3.11: Comparison of Model 2 and Model 3 (closed form results andMC based results for “unknown mean and variance” situation)For “unknown variance” situation, the closed form results are near the center ofthe MC based results from 10 batches at early “leave-out sizes” for the comparisonbetween Model 1 and Model 3 as the comparison between Model 1 and Model 2(see Figure 3.10). Again, the MC based results become positively biased and morevariable as the “leave-out size” increases which is similar as previous comparison(Figure 3.9) as well as the same comparison for “known variance” situation shownin Figure 3.4. Compared to the “known variance” situation, the MC samples basedresults have a wider range with large MC errors.At last, we focus on the last comparison: Model 2 versus Model 3 in “unknownvariance” situation. The closed form and MC based results for this comparison aredisplayed in Figure 3.11. Like the previous comparison displayed in Figure 3.10,the closed form results in this comparison indicate the same pattern of choosingbigger model (Model 2) at early “leave-out sizes” and smaller model (Model 3) athigher “leave-out sizes”. The strength of evidence for bigger model decreases up to“leave-out 120”, and changes of direction (evidence of smaller model) is observedat “leave-out 120” and onwards. Hence, Figure 3.10 suggests the choice of Model2 up to “leave-out 120” and Model 3 from “leave-out 150”. Compared to “known47variance” counterpart, similar decreasing pattern in log EPBF values is observedthough direction of model choice changes in “unknown variance” situation. Thisimplies that possibly the magnitude of change in log EPBF values for increasing“leave-out sizes” is greater for the “unknown variance” than the “known variance”situation. The same behavior of the MC based results at all “leave-out sizes” isobserved in Figure 3.11 as Figure 3.9 and Figure 3.10 with wide range of values. Asprevious, MC based results are positively biased and more variable as the “leave-out size” increases that might be due to MC error.In short, all three model comparisons show us the same behavior of the EPBF(computed from extended CPO criterion) as a model selection tool. The closeform results have a tendency to choose the model with a small number of param-eters in all three comparisons with increasing “leave-out sizes”. For comparisonsbetween Models 1 and 3 and Models 2 and 3, the values of log EPBF becomenegative from positive at some “leave-out sizes” that indicates the change in theevidence for model selection. Moreover, the MC based results have a wider rangeof the “unknown variance” situation compared to the “known variance” situation,and deviate from the closed form results in all three comparisons in an increasingpattern with the increasing “leave-out sizes”. MC error might be responsible forthis deviation so that our intention is to compute the contribution of MC error atdifferent “leave-out sizes” which we discuss in the following subsection.3.3.5 Summarizing the Computations for Unknown VarianceSituationAs “known variance” situation, we compute the root mean squared error (RMSE)of the MC based estimates of the log EPBFs at all considered “leave-out sizes” toexamine the variation from the closed form results for “unknown variance” situa-tion. The RMSE of the estimated log EPBFs for all three model comparisons aretabulated in Table 3.2.From Table 3.2 for comparison of Model 1 with Model 2 at “leave-out 10”, theRMSE of the estimated log EPBFs is 0.863 which implies that model comparisonsusing MC based EPBFs relative to the corresponding closed form are erroneous bya factor of exp(0.863) = 2.37. Similarly, while comparing Model 1 with Model 3,at “leave-out 120”, the RMSE of the estimated log EPBFs is 2.460 which leads to a48Table 3.2: Root mean squared errors of the estimated log EPBFs for all threemodel comparisons in unknown variance situationLeave-out size Model 1 vs. Model 2 Model 1 vs. Model 3 Model 2 vs. Model 31 0.575 0.715 0.6905 0.771 0.716 0.90010 0.863 0.968 1.04330 2.370 2.575 2.62650 3.884 2.694 3.77780 7.069 5.388 4.419100 5.128 4.693 5.044120 5.558 2.460 3.871150 5.775 4.807 6.659170 4.140 6.318 6.633175 3.810 5.639 3.942180 6.034 5.748 5.542185 6.697 5.164 5.759190 6.785 5.645 5.326all 5.400 7.072 6.210factor of 11.7 erroneous model comparisons while using MC based EPBFs relativeto the corresponding closed form. We can interpret all other RMSE values in asimilar fashion.However, the RMSE values of the estimated log EPBFs increase with the in-creasing “leave-out size” and these increase drastically at higher “leave-out sizes”for three model comparisons. As the “known variance” situation, we use severalcut-off values (two here) for the RMSE values to examine the level of error in themodel selection that occur due to using the MC based EPBFs at different “leave-outsizes”. The RMSE values from Table 3.1 are in logarithmic scale, and we considertwo cut-off values for these values: log 1.25, and log 2 which correspond to 25,and 100 percent erroneous model selection while using the log EPBFs estimatedfrom MC samples instead of the closed form values. The RMSE values for theMC based EPBFs at the considered “leave-out sizes” with vertical lines throughthe cut-off values are plotted in Figure 3.12.We are observing a lot of error in model selection for “unknown variance”situation from Figure 3.6. If we consider the log 1.25 as the cut-off value, then we490123456RMSE for three model comparisonsleave−out sizeRMSE of log EPBF1 10 30 50 80 100 120 150 170 185Model 1 vs. Model 2Model 1 vs. Model 3Model 2 vs. Model 3log 1.25log 2Figure 3.12: Root mean squared error of the log EPBFs from 2500 MC sam-ples with three cut-offs (unknown mean and variance)observe that no RMSE values lie below this cut-off value for all three comparisons.The RMSE values for three model comparisons at “leave-out 1” lie under the log 2cut-off value only and exceed the cut-off value at any “leave-out size” greater than1. So, compared to “known variance” situation, we require larger MC samples ormore batches to get accurate model comparisons from the MC based results.Also, according to the discussion of increasing MC sample sizes to get morecorrect model comparisons based on the MC sample in subsection 3.2.4, we in-crease the MC samples from 2500 to 25000 that is now we have 10 batches with2500 MC samples. According to MC rule, we expect an error reduction from theMC based model comparisons by a factor of 8.95 (based on the formulation insubsection 3.2.4) with this 25000 MC samples compared to the 2500 MC samples.This implies an expectation to have all the RMSE values less than log(8.95) = 2.19in this Figure 3.12 that is up to “leave-out 30” below the cult-off value log(2) withthe 25000 MC samples. We plot the revised RMSE values considering 25000 MCsamples for three model comparisons in Figure 3.13.As per our expectation, Figure 3.13 shows that all the RMSE values for threemodel comparisons up to “leave-out 30” lie below the cut-off point log(2). In500123456RMSE for three model comparisonsleave−out sizeRMSE of log EPBF1 10 30 50 80 100 120 150 170 185Model 1 vs. Model 2Model 1 vs. Model 3Model 2 vs. Model 3log 1.25log 1.5log 2Figure 3.13: Root mean squared error of the log EPBFs from 25000 MC sam-ples with three cut-offs (unknown mean and variance)addition, we observe some RMSE values (up to “leave-out 10”) below the cut-offpoint log(1.5). Hence, the MC results show a lot of improvement in error reductionwith the 25000 samples compared to the 2500 samples.3.4 SummaryIn this chapter, we discussed the Bayesian model comparison for linear regressionmodel with the extended CPO criterion as the model selection tool. Two linear re-gression models are compared using extended pseudo Bayes factor (EPBF) whichis a compromise between the pseudo Bayes factor and the formal Bayes factor. Weconsider 15 different “leave-out sizes” where the model comparisons at “leave-out1” and “leave-out all” indicate pseudo Bayes factor and formal Bayes factor re-spectively. Two different situations namely “normal model with unknown meanand known variance” and “normal model with unknown mean and variance” areconsidered for the Bayesian linear regression model and discussed in detail withan example in sections 3.2 and 3.3. Since the closed form results are available,we compare the MC based results with these to examine how much the modelcomparison results deviate from the corresponding closed form ones in both the51situations. We take a single MC sample and then create 10 equal-sized batchesfrom that sample. Model comparison results are computed for 10 batches. Threelinear regression models: Model 1, Model 2, and Model 3 are considered here thatcontain five, four and three explanatory variables respectively. This implies that wehave three pairwise model comparisons.There are several take-away messages from the closed form results obtainedin both the situations. One of these is the change in the evidence of the model atdifferent “leave-out sizes”. For “normal model with unknown mean and knownvariance” situation, closed form results of the three model comparisons exhibitsame choice of models over different “leave-out sizes”. But, for the “normal modelwith unknown mean and variance” situation, the choice of model changes at some“leave-out size” for the comparisons between Model 1 and Model 3 and betweenModel 2 and Model 3. This implies that the direction of evidence under the pseudoBayes factor and the formal Bayes factor is different in these comparisons thoughpseudo Bayes factor is used as a proxy of the formal Bayes factor in the literature(Gelfand and Dey, 1994). In addition, using the EPBF values at different “leave-out sizes”, we can exactly specify the “leave-out size” up to which the evidence ofa specific model remains the same and changes hereafter. Another take-away mes-sage from the closed form solutions is that for both the situations, the strength ofthe evidence increases for the comparatively smaller model (regarding the numberof explanatory variables) among the two models compared in all the comparisons.The MC samples based results show a rapid departure from the closed formresults with increasing “leave-out sizes” for all the model comparisons in both thesituations. The ranges of the MC results are much higher in the “normal modelwith unknown mean and variance” situation compared to the other situation. Wecompute RMSE of the MC samples based estimates of the log EPBFs for threemodel comparisons to measure how well the MC samples based estimates approx-imate the closed form log EPBF. With 2500 MC samples, MC based estimatesexhibit less than 25% error in model selection compared to the closed form resultsfor “leave-out size” 50 or smaller in “normal model with unknown mean and vari-ance” situation. However, RMSE values of the MC samples based estimates ofthe log EPBFs are too high in “normal model with unknown mean and variance”situation. Error in the decision of model selection from the MC based results can52be reduced by increasing the number of MC samples.This chapter focuses on Bayesian model comparisons for linear regressionmodel using extended CPO or EPBF as a model comparison tool. Both the closedform and MC samples based results are available here. For the generalized linearmodels, for example, logistic regression models, no closed form solutions are avail-able, and model comparisons are examined using some form MC samples only. Wewill discuss the Bayesian model comparisons for logistic regression models in thenext chapter.53Chapter 4Bayesian Model Comparison forthe Generalized Linear ModelsWe discussed the Bayesian model comparison for the linear regression models inthe previous Chapter. In this Chapter, we focus on the Bayesian model comparisonfor the generalized linear models, for example, logistic regression models. Logisticregression models are used in studying the effect of explanatory variables on anominal response variable (response is continuous for linear regression models).As with linear regression models, both the classical and Bayesian model selectionmethods can be applied to logistic regression models. For example, one can useeither a classical approach say AIC or any Bayesian approach. In this chapter, wediscuss the extended pseudo Bayes factor (EPBF) as a model selection method forlogistic regression models in Bayesian context. Our interest is to examine how themodel selection behaves when we change the “leave-out size” in the EPBF. Fromthe EPBF, we can get the pseudo Bayes factors (PBFs) as well as the formal Bayesfactors (BFs) depending on the “leave-out sizes”. We want to examine whetherthere is any agreement or disagreement in between the PBFs and the formal BFs asmodel selection methods.544.1 Bayesian Logistic Regression ModelsWe can apply the Bayesian approach to estimate the parameters of the logistic re-gression models. In the classical approach, we only need the likelihood functionfor estimation purposes. But, in Bayesian approach, in addition, we do requirea prior specification for parameters of the logistic regression model. The logisticregression model constructs a model to predict the probability of the presence ofan indicator using the available explanatory variables. The log transformation ofthe ratio of probabilities, known as log odds or logit, linearizes the relationship be-tween the response and the explanatory variables. As the linear regression case, westart our discussion with defining a logistic regression model. Our response vari-able, say y has binary responses. Suppose the binary response variable representsan indicator (0 =Absence, 1 =Presence) of an event. Also, suppose we have a setof explanatory variables: x1,x2, . . . ,xp. Here, y = (y1, . . . ,yn) is a vector of lengthn representing binary responses of interest for n observations.Suppose β = (β0,β1, . . . ,βp)ᵀ are unknown regression parameters and xᵀi =(1,xi1,xi2, . . . ,xip) represents the ith individual’s row vector of explanatory vari-ables. The design matrix [of dimension n×(p+1)] in this case is X =(1,x1,x2, . . . ,xp). The response variable for the ith individual indicates the presence or absenceof the event for that subject by setting yi = 1 and yi = 1 respectively. If p(xi) rep-resents the probability that the event is present for subject i, and the ith individual’sset of the explanatory values are contained in xᵀi , then the logistic regression modelcan be written as:logit p(xi) = log[p(xi)1− p(xi)]= β0+β1 xi1+ · · ·+βp xip = xᵀi β , i= 1, . . . ,n.(4.1)Now, we can express the probability p(xi) of the presence of the event asp(xi) = Pr(yi = 1|xi) = exᵀi β1+ exᵀi β.As the response is binary, the likelihood contribution from the ith observation can55be written using a Bernoulli likelihood expression:Li(β ;y, X ) = [p(xi)]yi [1− p(xi)](1−yi)=[exᵀi β1+ exᵀi β]yi [1− exᵀi β1+ exᵀi β](1−yi).The individuals are assumed to be independent from each other, and hence thelikelihood function over the n individuals has the following expression:L(β ;y, X ) =n∏i=1Li(β ;y, X ) =n∏i=1[exᵀi β1+ exᵀi β]yi [1− exᵀi β1+ exᵀi β](1−yi). (4.2)We can utilize the likelihood function (4.2) to estimate the unknown parametersβ in classical inference. Also, we need to specify the prior distribution, say g(β )for β for the unknown parameters β so that we can compute the posterior distri-bution p(β |y, X ) to obtain the Bayesian inference of those parameters. A generalexpression of the posterior follows:p(β |y, X ) ∝ L(β ;y, X ) × g(β ). (4.3)The posterior in the equation (4.3) has no closed form expression. But, wecan take realizations from the expression in equation (4.3) to obtain a valid empir-ical guess about the posterior distribution in a Bayesian manner, and hence makeinference for the parameters. There are many proposed approaches available inthe literature which includes the Metropolis−Hastings (MH) method, many latent-variable schemes that facilitate Gibbs sampling, etc. and so on to obtain Markovchain Monte Carlo (MCMC) samples from the posterior.We use one of these approaches in this study which is efficient enough andpopularly known as the ‘Po´lya−Gamma approach’ (Polson et al., 2013). If weparameterize the binomial likelihoods (say 4.2) by the log odds, then the assump-tion in the ‘Po´lya−Gamma approach’ is that the likelihood is expressed as a mix-ture of normals. It can be noted that the Po´lya−Gamma distribution is a subsetof the class of infinite convolutions of gamma distributions (Polson et al., 2013),and generalization of the Po´lya distributions (Barndorff-Nielsen et al., 1982). The56‘Po´lya−Gamma approach’ is implemented in a R package called BayesLogitwhich utilizes an accept/reject sampler based on the alternating-series method thatis proposed by Devroye (1986). This sampler is very efficient and requires ex-ponential and inverse-Gaussian draws only; also, the bound of the probability ofaccepting a proposed draw is uniformly bounded below at 0.99919 (Polson et al.,2013). Also, no tuning is needed which makes this a reliable black box samplingroutine in all situations with the logit link, even in complex hierarchical models.As per the claim from Polson et al. (2013), ‘Po´lya−Gamma approach’ is nearlyefficient as the independence MH sampler for simple logistic models with no hier-archical structure, and most efficient in all other cases.We can directly use the MCMC samples from the posterior distribution of β toapproximate the cross-validation densities as no closed form densities of the pre-dictive posterior distribution are available for Bayesian logistic regression models.Then, as discussed in Chapter 3 for the linear regression models, we can com-pute the extended conditional predictive ordinate (CPO) criterion. However, inthe absence of the closed form results, we can’t make a comparison between theclosed form and posterior samples based results for the Bayesian logistic regres-sion models. But, motivated from examining how well the closed form results canbe approximated by the posterior samples for the linear regression models in theChapter 3, we can rely on such MCMC samples based results. Then the estimatedextended CPO criterion can be used to compute the extended pseudo Bayes fac-tor (PBF) to compare the Bayesian logistic regression models for model selectionpurpose.4.2 Extended CPO Criterion using MCMC Samples forLogistic Regression ModelsWe describe the computation of the extended CPO criterion directly from the pos-terior MC/MCMC samples in the subsection 2.4.4 of the Chapter 2. Accordingto that, at first we draw B realizations of β , say β (1), . . . ,β (B) from the expres-sion of the posterior distribution of β , p(β |y, X ) in equation (4.3). Then, havingB posterior samples of β , as the linear regression models, the CPO criterion orcross-validation density for logistic regression models with “leave-out one” can be57computed approximately using (2.16). After that, we can compute log PBFs us-ing the approximated cross-validation densities to compare two logistic regressionmodels in Bayesian context. Again, cross-validation densities in extended CPO cri-terion with “leave-out two” can be approximated by using the formulation in equa-tion (2.18). Similarly, cross-validation densities for different “leave-out sizes” canbe approximated; log EPBFs computed using these approximated cross-validationdensities can then be used as to compare logistic regression models in Bayesiancontext as the linear regression models discussed in Chapter 3.We discuss the behavior of the extended CPO criterion or the EPBF in the nextsection for Bayesian logistic regression models with different “leave-out sizes” ina real life scenario.4.3 Examining the Behavior of Extended CPO criterion:A Practical ExampleHere, we attempt to examine how the extended CPO criterion, a Bayesian modelcomparison tool, behaves with an illustrative example. We use a well-known datahere named birthwt which is available in the R package MASS. The location ofthe data collection is Baystate Medical Center, Springfield, Mass (Venables andRipley, 2002). The aim of collecting the data was to investigate whether somefactors related to mother are responsible for low child birth weight. From thisspecific data we consider the following five explanatory variables:• smoke: mothers’ smoking status during pregnancy (1 = Yes, 0 = No),• ui: presence of uterine irritability for mothers (1 = Yes, 0 = No),• ht: hypertension status of mothers (1 = Yes, 0 = No),• lwt: mothers’ weight at last menstrual period (in pounds), and• ptl: mothers’ previous premature labours (numbers).We have three binary, one continuous (lwt), and one count (ptl) explanatory vari-ables. Our binary response variable is low which is simply an indicator of birthweight less than 2.5 kg (1 = Yes, 0 = No). We want to fit a logistic regression58to model the probability that a child is born with low birth weight with respect tofactors related to the mother of that child. There are 189 non-missing observationsfor this birthwt data. We specify the models considered for comparisons andpriors for the regression parameters β below.4.3.1 Model SpecificationTwo models are considered here, and these are denoted by Model 1 and Model 2with different combinations of the explanatory variables. If p denotes the proba-bility that a child is born with low birth weight (low = 1), then these two modelshave the following specifications.1. Model 1: Big Model or Full Model (Model with all explanatory variableslisted above) with the formulationlog(pi1− pi)= β0+β1 smokei+β2 uii+β3 hti+β4 lwti+β5 ptli,where pi denotes the probability that ith child is born with low birth weight, i=1,2, . . . ,189. All explanatory variables represent the binary status of differ-ent factors related to the mother of the ith child.2. Model 2: Small Model (Model which leaves only ptl out of the full Model).Previously, for linear regression models discussed in Chapter 3, we consider2500 MC samples from the posterior distribution of the regression parametersβ . MC samples are taken independently whereas MCMC scheme provides sub-sequent dependent samples. Hence, we consider a large number of MCMC sam-ples (100000) from the posterior distribution of the regression parameters β for thecomputation of the extended CPO criterion in the Bayesian logistic regression set-ting. We divide those into 10 batches of equal size 10000. As the linear regressionmodels, we consider the log extended CPO and log extended pseudo Bayes factorcalculation for each of these batches at different “leave-out sizes” (from “leave-out 1” to “leave-out all”). For logistic regression models, we consider 11 different“leave-out sizes”; these are 1, 5, 10, 30, 50, 80, 100, 120, 150, 170, and 189 (i.e.,all). Also, to calculate extended CPO calculation with different “leave-out sizes”,59we take all combinations if the total combinations are less than 2000 and take only2000 combinations randomly if the total combinations exceed 2000.4.3.2 Prior Specification: Weakly Informative PriorWe use the efficient ‘Po´lya−Gamma approach’ for generating MCMC samplesfrom the posterior distribution (4.3) for Bayesian logistic regression models. How-ever, the ‘Po´lya−Gamma approach’ incorporates the normal prior for the parame-ters only. Different prior specifications for the parameters of the logistic regressionmodel are also available in the literature. For example, a weakly informative prioris suggested by Gelman et al. (2008).A weakly informative prior is a minimally informative prior as possible, andmostly used when there is a doubt that the prior distribution might not represent thereal prior information about the parameters. Gelman et al. (2008) propose a weaklyinformative prior on the scaled explanatory variables. Scaling is an important issuefor the logistic regression models as different scaling can render the exponentia-tion of the parameter coefficients (the odds ratios) very different. Hence, the ex-planatory variables need standardization, and one such application can be found inRaftery (1996) for Bayesian generalized linear models.Two proposals are made by Gelman et al. (2008). For binary explanatory variables,a shift is suggested so that their means are 0s and ranges are 1s. Shifting and scal-ing for the non-binary explanatory variables are suggested in a way so that theirmean and standard deviation become 0 and 0.5 respectively. With this scaling, thecontinuous variables have the same scale as the symmetric binary variables in theinterval [−0.5,0.5] with standard deviation 0.5. In our ‘Po´lya−Gamma approach’to Bayesian logistic regression models, we use this scaling on the explanatory vari-ables. Also, we have normal prior distribution: β j ∼ N(µ j,σ2j ), j = 0,1, . . . , pfor the regression parameters β . With this prior, the posterior distribution of theparameters of the logistic regression model 4.3 can be rewritten as60p(β |y, X ) ∝n∏i=1[exᵀi β1+ exᵀi β]yi [1− exᵀi β1+ exᵀi β](1−yi)(4.4)×p∏j=01√2piσ jexp[−12(β j−µ jσ j)2].Now, we need to specify the hyperparameters for the inference of the param-eters β using equation (4.4). We use independent normal prior distributions withmean 0 and variance log(5) for all the parameters in the logistic regression modelexcept the intercept term. This implication comes from two suggestions. Firstly,we consider the epidemiological idea of prior construction using the odds ratio(Greenland, 2006). The value of odds ratio = 5 is considered as a meaningfulupper bound of the possible odds ratio. We consider this as the variance of theindividual parameters. The second suggestion comes from Gelman et al. (2008)that recommends 5 as the upper bound for the absolute difference in the logit prob-ability. If we consider the variance of the individual normal priors as log(5), thenboth the suggestions can be taken into account.For the intercept, we consider a normal prior with mean 0 and a higher variancethan the other variances (hyperparameters) taken into account for the other param-eters. Gelman et al. (2008) documents that a change of 5 in the input implies thechange in the probability either from 0.01 to 0.5, or from 0.5 to 0.99 for logisticregression which indicates that a change of 10 leads to the change in the prob-ability from 0.01 to 0.99. According to this concept, for the intercept we use anormal prior with mean 0 and variance log(10). Such specification allows that onan average, the expected probability of success is within the bound [0.01,0.99].We discuss the results in the next subsection using the prior specification aboveand the model specification in subsection 4.3.1. As the results for model compar-isons for the linear regression models in Chapter 3, we describe our findings formodel comparison for the two models considered for different “leave-out sizes”using the EPBF.614.3.3 ResultsWe do not have closed form model comparison results for Bayesian logistic regres-sion models which we examined for Bayesian linear regression models. Modelcomparison results based on the MCMC samples between Model 1 and Model 2for different “leave-out sizes” is displayed in Figure 4.1 as discussed in subsec-tion 4.3.1. Here, the black triangles represent the log EPBF values from 10 differ-ent batches of size 10,000 and the red dots represent the mean log EPBF valuesobtained from those 10 batches at each “leave-out size”.−0.50.51.52.5Commparison of Model 1 and Model 2leave−out sizeLog EPBF5 30 50 80 100 120 150 170 189Figure 4.1: Comparison of big Model vs. small Model in Bayesian logisticregressionFigure 4.1 represents how the model comparisons behave while comparingModel 1 and Model 2 using the log EPBFs. As we have positive log EPBFsat all “leave-out sizes” except some batches at some small and large “leave-outsizes”, the computed log EPBFs from the MCMC samples suggest the choice ofthe big Model (Model 1) over the small Model (Model 2) at all “leave-out sizes”.From the estimated mean log EPBFs at each “leave-out size”, this pattern is ob-servable. As the results found for the linear regression models in the Chapter 3,the strength of evidence increases with increasing “leave-out sizes” for the logisticregression models. So, a pattern of the strength of evidence for a specific modelover the increasing “leave-out sizes” is observed in both the linear and logistic re-62gression model comparisons. However, the pattern suggests a stronger evidencefor the small Model over the big Model in linear regression cases but the reverse(big Model over the small Model) for logistic regression cases though these aredata-driven decisions.From Figure 4.1 it is clear that, MCMC based results vary over the direction atsome “leave-out sizes” (for “leave-out sizes” 10, 30, 50 and 150 and above). MonteCarlo (MC) error might be responsible for this fluctuation. If possible, quantifyingthe MC error will allow us to examine how this increases for increasing ‘leave-outsizes”.For the smaller “leave-out sizes” (for example, 1, 5, 10), the MCMC basedmodel comparison results (log EPBF values) from 10 batches are close to the es-timated mean of the model comparison results of those batches. For the larger“leave-out sizes”, say “leave-out 100” or more, the MCMC based results from the10 batches tend to be more spread (more in the right direction) from their meanvalue. Hence from Figure 4.1, we can say that as the “leave-out size” increases,the MCMC samples might produce more and more variable and probably posi-tively biased model comparison results for the Bayesian logistic regression mod-els. In Figure 4.1, the EPBF at “leave-out all” indicates the computed formal Bayesfactor using MCMC samples. From the theoretical perspective, this is our optimalmodel selection criterion. However, the Figure 4.1 suggests that we are very poorlycomputing real Bayesian model comparison with some positive MCMC errors forlogistic regression model comparison.As the linear regression models, we have the interest in whether it is possible tofind a “leave-out size” where the EPBF is close to the formal Bayesian comparisonwith smaller MCMC error which is discussed in the next subsection.4.3.4 Summarizing the ComputationsWe can examine how well the posterior samples (MC or MCMC) can approximatethe closed form model comparison results given that the closed form results areavailable. We have already done this for the linear regression models in Chapter3. But, we do not have the same setup for the logistic regression models due tounavailability of the closed form results. Hence, to examine how well the posterior63MCMC samples compute the model selection criterion EPBF, we focus on theestimated mean and standard error of the log EPBFs obtained from 10 batches ofMCMC samples with size 10000. We plot the estimated error bars for the 10 logEPBFs at each of the “leave-out sizes” in Figure 4.2.0.000.250.500.751.001 5 10 30 50 80 100 120 150 170 189Leave−out sizeMean log EPBF from 10 batchesQuantification of MCMC error over different 'leave−out sizes'Figure 4.2: Error bar plot of the estimated log EPBFs for Bayesian logisticregressionFigure 4.2 has a very straightforward interpretation about the behavior of theMCMC samples. It is observed that the width of the error bars for the estimatedlog EPBFs increases with the “leave-out sizes” except “leave-out 5”. So, the stan-dard error of the estimated log EPBFs increases with “leave-out sizes” except that“leave-out size” implying higher variability among the estimated log EPBFs from10 batches at larger “leave-out sizes”. For example, at “leave-out 1”, the standarderror of the estimated log EPBFs is 0.0148 with estimated mean 0.1697 whereasat “leave-out 150”, the standard error is 0.1384 with estimated mean 0.5586, andat “leave-out all”, the standard error is 0.2835 with estimated mean 0.8313. Thestandard error of the estimated log EPBFs increases rapidly for the larger “leave-out sizes” that are close to the “leave-out all” than the smaller “leave-out sizes”.The higher variability in the estimated log EPBFs for “leave-out 5” might be dueto random chance.64We are not surprised by this overall findings as we have already observed highervariability among the estimated log EPBFs obtained from MC samples for linearregression models discussed in Chapter 3. The same thing happens here. Also,being independently drawn samples, MC samples are less vulnerable to the MCerrors than the dependent MCMC samples. However, we have had a direct formu-lation of improving the model selection results by increasing a specific number ofMC samples for the linear regression models discussed in Chapter 3. We can notdo it here as no closed form solution is available for model comparison using EPBFfor logistic regression models. Still, we can generalize the findings of the Chapter3 in MC error reduction.For our analysis, we have 100000 MCMC samples and 10 equal sized batcheswith 10000 MCMC samples. Either increasing the total MCMC samples from100000 or the number of batches with 10000 MCMC samples each will improvethe result by reducing the error in model selection for sure. As a compromisebetween the pseudo Bayes factor and the formal Bayes factor, the EPBF allowsus to choose “leave-out sizes” greater than one with some errors. From the Fig-ure 4.2, we can think of “leave-out sizes” up to 80 with small variability (andhence small error) among the estimated log EPBFs. Then, we can get close tothe estimated optimal formal Bayesian comparisons (“leave-out all”) more thanthe estimated pseudo Bayes factors (“leave-out 1”). Also, with increasing pos-terior MCMC samples, we can go further closer to the estimated optimal formalBayesian comparisons i.e., the Bayes factors by using larger “leave-out sizes”.4.4 SummaryWe discussed the Bayesian model comparison for the logistic regression modelswith the extended CPO criterion or the extended pseudo Bayes factor (EPBF) asthe model comparison method in Chapter 4. It is an extension of the applicationof the extended CPO criterion or EPBF for the linear regression models discussedin Chapter 3 to the generalized linear models. Two logistic regression models arecompared at 11 different “leave-out sizes”. Without any closed form results, weonly rely on the MCMC based results to compute the EPBF for model comparison.Model comparison results are computed for each 10 equal-sized batch of MCMC65samples. Two logistic regression models: Model 1 and Model 2 are consideredwith five and four explanatory variables respectively.The MCMC samples based results obtained from 10 batches show a greatervariability from their mean result with increasing “leave-out sizes”. The standarderror of the estimated log EPBFs gets bigger with increasing “leave-out sizes”,and this poses the requirement of more MCMC samples to obtain a less variableresult. Error in the decision of model selection from the available only MCMCbased results can be reduced by increasing the number of MCMC samples.This chapter focuses on Bayesian model comparisons for generalized linearregression models, particularly logistic regression models using extended CPO orEPBF as a model selection method. Motivated by the findings in Chapter 3 forlinear regression models, we generalize the findings here to select generalized lin-ear models using EPBF as a model selection method. We hope that the EPBF canbe applied for comparing other types of models where no closed form results areavailable.66Chapter 5Discussions and ConclusionsWe try to document our overall findings comparatively in this Chapter. Also, westate some possible further investigations.5.1 DiscussionsWe have examined how a model selection method, the extended conditional predic-tive ordinate (CPO) criterion or the extended pseudo Bayes factor (EPBF), behavesfor linear regression models and generalized linear regression models, with illus-trative examples in Chapter 3 and Chapter 4 respectively. Two real life data setswere used in the illustrative examples. We discuss the overall findings here.We have closed form expressions for the posterior distributions and the pre-dictive distributions for the linear regression models in the Bayesian setting with aconjugate prior setup. This empowered us to compare the model selection resultsobtained from the closed form with the results obtained using the Monte Carlo(MC) samples from the posterior distribution of the linear regression model pa-rameters (as if no closed form expressions are available). Two different situationswere considered for linear regression models: variance of the error is (i) knownand (ii) unknown. In the second situation, there was an unknown scale parameterin addition to the unknown location parameters (regression coefficients).We considered three linear regression models to be compared that producethree pairwise model comparisons. We computed the extended CPO criterion67for the individual models using the closed form and the MC samples setting forboth the situations. The posterior MC samples were divided into 10 equal-sizedbatches to examine the deviation of the results from the closed form results forthose batches. For known variance situation, the MC based results matched thecorresponding closed form results up to a large “leave-out size” (‘leave-out 120”for all three models with 195 data points), and then started to deviate from theclosed form results at larger “leave-out sizes”, especially at close to “leave-outall”. For unknown variance situation, the MC based results matched the corre-sponding closed form results up to “leave-out 50” for three models. Hence, it canbe said that the MC error is getting bigger at the smaller “leave-out sizes” for theunknown variance situation than the known variance situation. Empirical resultsfrom the considered data suggest that we can use the “leave-out size” consider-ably larger than one without too much concern for either situation while using thecross-validation approach, even when we do not have the closed form results.However, this result is not general as we have evidence from one data set only.If we observe the similar pattern from other data, then the generalization can beestablished. Also when working with MC samples, based on the evidence, it isrecommended to avoid the “leave-out sizes” close to “leave-out all” for high devi-ation from the closed form results due to MC error.We have observed a decrease in the log extended CPO value for the closed formresults at larger “leave-out sizes”. The result obtained for “leave-out all” is differentthan the “leave-out 1” with a decreasing pattern for increasing “leave-out sizes”.Also, the direction of the MC error can be examined from the results. The MCbased results posed a substantive positive departure from the corresponding closedform results for all three models at larger “leave-out sizes” indicating a trend ofupward bias with increasing “leave-out sizes” close to “leave-out all”. Both thesituations exhibited this pattern, and the MC error was a bit worse for the unknownvariance situation than the known variance situation.For logistic regression models, we relied on the MCMC samples from the pos-terior to compute the extended CPO criterion for the individual models as we didnot have the closed form expressions for the posterior distribution of the unknownparameters. We have used the mean result of the 10 batches as a proxy of the closedform results for both the models considered. As expected, the extended CPO crite-68rion computed from 10 batches of MCMC samples increasingly deviate from theirmean result for individual models with increasing “leave-out sizes”. The deviationsare pretty small at smaller “leave-out sizes”. So, we have the similar overall find-ings for the simple linear regression models (simple models) and the generalizedlinear models (non-simple models) regarding the cross-validation approach withdifferent “leave-out sizes”. In both cases, we can use “leave-out size” greater thanone with a small increase in the MC errors.Next, we focus on the winning models that come through model comparisonusing the extended pseudo Bayes factor (EPBF) as a model selection method forboth the linear and logistic regression models. For linear regression models, wehad three pairwise model comparisons for the three considered models. For knownvariance situation, the computed log EPBs in close form setting confirmed the com-paratively smaller models regarding the parameters as the winning models in thosethree pairwise comparisons. However, this was not the case for the unknown vari-ance situation where the winning model changed for two pairwise comparisons.The comparatively smaller model was the winning model in the other model com-parison. For the model comparisons with the differing winning model, the compar-atively large model win up to a “leave-out size”, and after that small model startsto win and keep the trend in the same direction up to “leave-out all”.We can conclude two important findings from these results. At first, a generaltrend is observed in both the situations: the strength of the evidence for the smallermodel is increasing with increasing “leave-out sizes”. Also, in the presence of theclosed form results, we have the computed optimal Bayesian model comparisonor the Bayes factor value which is nothing but the EPBF at “leave-out all”. Now,we can safely say that the winning models are different for the EPBF at “leave-out 1” and “leave-out all”, at least for our model comparisons. So, the so-calledapproximation of the formal Bayes factor, the PBF has a different winning modelhere compared to the one found through the computed formal Bayes factor (EPBFat “leave-out all”). Hence, the importance of using “leave-out sizes” other than oneis observed to get a closer optimal Bayesian model comparison.For logistic regression models, the strength of the evidence for the big modelincreased with the increasing “leave-out sizes”. As the closed form results of thelinear regression models, the mean log EPBF value of the 10 MCMC batches69showed a particular pattern with the increasing “leave-out sizes”. We can reportthe general finding from both the cases as a trend of selecting a model over in-creasing “leave-out sizes” which sometimes lead to change in the winning modelfor the comparison of two close models.Finally, we want to summarize the computations from the posterior sampleswhen the closed form expressions of the posteriors are unavailable. The summarieswill focus on the possible error reduction and guide us to find a “leave-out size” thatproduces the closest possible formal Bayesian model comparison with a smallerincrease in the MC error.For the linear regression model comparisons, we formulated a mathematical ex-pression to find the sample size required to obtain a specific amount of improve-ment by reducing MC errors. We used root mean squared error (RMSE) to measurethe MC errors for the linear regression models in the Bayesian setting. For exam-ple, for the known variance situation, 2500 MC samples ensured that the MC sam-ples based result would have less than 25% error in selecting the winning model upto “leave-out 30”. Increasing the MC sample size with the exact formula given inChapter 3 would allow us to lower the error percentage in model selection as wellas the consideration of the larger “leave-out sizes”.For the unknown variance situation, we can use the same formulation but keep-ing in mind that a reasonably large MC sample is required in the presence of theunknown variance parameter. For the logistic regression models, no such exactformulation is available as we do not have the closed form posteriors. We com-puted the standard error of the estimated log EPBFs at different “leave-out sizes”to measure how much MC error prone the MCMC samples based model selectionswere. High variability (due to high MC errors) was observed at larger “leave-outsizes” implying the use of more MCMC samples to obtain a less variable modelselection result for the larger “leave-out sizes”. Also, being a dependent sample,an MCMC sample of a large size is required for the logistic regression modelscompared to the size of the independent MC samples used in the linear regressionmodel comparisons. A general take-away message from both the linear and logis-tic regression models is to use a reasonably large sample from the posterior for thecross-validation approach with a desired ‘leave-out size” so that one can have themodel comparison result close to the formal Bayesian model comparison which is70mathematically optimal.We also attempt to find the sources of variation for the MC errors. We con-sidered 10 different splits of 2000 combinations of MC or MCMC samples for thecomputation of the extended CPO or the EPBF at different “leave-out sizes”. Then,we have two sources of variation for the MC error. Some error might come dueto the batches, and the random splits might contribute to some error. For both thelinear and logistic regression models, we computed the contribution of these twosources in the overall MC error at different “leave-out sizes”. We found that boththe sources have more or less similar contribution to the overall MC error at all“leave-out sizes” without no clear pattern. So, neither the batches nor the splits arethe dominating source of the MC errors with the current setup.5.2 Further ScopeIn this study, we consider 10 equal-sized batches for all MC and MCMC basedresults. Also, 2000 combinations of these posterior samples are considered in thecalculation of the extended CPO or the EPBF for different “leave-out sizes” whenthe total number of combinations exceed 2000. One further extension of this studymight be to examine the effect of the number of the batches with equal and un-equal (decreasing/increasing) sizes as well as the number of combinations on themodel comparisons. A set of some combinations can be examined for this purpose.Examining the decomposition of the overall MC error for the variable number ofbatches and number of combinations might be a good idea to explain the sourcesof the MC error in detail.We use two different data sets for constructing linear and logistic regressionmodels. The model comparison results utilize those data and consider different“leave-out sizes” accordingly. However, it might be a good idea to look for therelative “leave-out sizes” of a specific data and try to generalize findings for the rel-ative “leave-out sizes” with an application on data with the number of data pointsfrom very small to very big. For example, “leave-out 5” for a data set with datapoints 100 is very different than the “leave-out 5” with data points 1000 regardingthe percentage of “leave-out sizes”. We expect the MC/MCMC based model com-parison results for these will be different too. Thus, the relative “leave-out sizes”71can be examined for different types of simple and non-simple models with varyingdata points as a potential further scope of this study.Only logistic regression models are considered as an example of non-simplemodels. One can use this methodology for comparing other different non-simplemodels such as nonlinear models, mixture models, hierarchical models, etc. Also,model selection is very sensitive and important for the causal inference problems.That might be a good application to utilize the extended CPO or the EPBF as amodel selection method in causal inference problems to select a useful model.5.3 ConclusionsThe Bayes factor is the desired model selection method in the Bayesian settingbecause it is optimal. Due to computational issue for the non-simple models, inparticular, with no closed form posterior, some other predictive approaches saycross-validation approach with “leave-out 1”, popularly known as pseudo Bayesfactor is used as an approximation. Mathematically, the Bayes factor can be com-puted from the cross-validation approach with “leave-out all”. However, then theBayes factor is computed with more MC error at “leave-out all” than the “leave-out 1”. In this study, we set our objective to find a “leave-out size” that producesa closer Bayesian model comparison with a small increase in the MC error for thenon-simple models where the closed form results are unavailable.We start with the comparison of simple models, say the linear regression mod-els. Since the closed form results are available, we compare those with the MCsamples based results (hypothetically assuming unavailability of the closed forms).MC samples based results produce a good agreement with the corresponding closedform results implying the usability of the MC samples in the absence of the closedforms. We observe that the winning model changes for different “leave-out sizes”with a pattern for some pairwise linear regression model comparisons; this putsa contradiction of using cross-validation approach with “leave-out 1” as it has adifferent winning model compared to “leave-out all”. However, some “leave-outsizes” that are greater than one can produce a closer Bayesian model comparisonpossibly with some MC error. We can reduce this MC error by increasing the sizeof the MC samples.72For the logistic regression models, the MCMC samples based results are re-liable as per the behavior of the MC samples as a proxy of the closed forms inthe linear regression models. Increasing sample size will decrease the MC errorfor the logistic regression models too. The use of the cross-validation approachwith “leave-out sizes” more than one produces closer Bayesian model comparisonwhich is evident from both the linear and logistic regression models. Differentother non-simple models can be examined to generalize our findings from thisstudy.73BibliographyAitkin, M. (1991). Posterior bayes factors. Journal of the Royal Statistical Society.Series B (Methodological), pages 111–142. → pages 5, 11, 12Akaike, H. (1973). Information theory and an extension of the maximum likelihoodprinciple. In Second International Symposium on Information Theory, pages267–281. Akademinai Kiado. → pages 5Banerjee, P. N. B. S. (2008). Bayesian linear model: Gory details. Dowloadedfrom http://www. biostat. umn. edu/˜ ph7440. → pages 37, 38Barndorff-Nielsen, O., Kent, J., and Sørensen, M. (1982). Normal variance-meanmixtures and z distributions. International Statistical Review/Revue Interna-tionale de Statistique, pages 145–159. → pages 56Berger, J. O. (2013). Statistical decision theory and Bayesian analysis. SpringerScience & Business Media. → pages 6Berger, J. O. and Pericchi, L. R. (1996). The intrinsic bayes factor for modelselection and prediction. Journal of the American Statistical Association,91(433):109–122. → pages 9, 13Berger, J. O. and Pericchi, L. R. (1998). Accurate and stable bayesian modelselection: The median intrinsic bayes factor. Sankhya¯: The Indian Journal ofStatistics, Series B, pages 1–18. → pages 10Box, G. E. (1980). Sampling and bayes’ inference in scientific modelling androbustness. Journal of the Royal Statistical Society. Series A (General), pages383–430. → pages 11Devroye, L. (1986). Introduction. In Non-Uniform Random Variate Generation,pages 1–26. Springer. → pages 57Geisser, S. (1975). The predictive sample reuse method with applications. Journalof the American Statistical Association, 70(350):320–328. → pages 11, 1374Geisser, S. (1980). Discussion on sampling and bayes’ inference in scientific mod-eling and robustness (by gep box). Journal of the Royal Statistical Society A,143:416–417. → pages 13Geisser, S. and Eddy, W. F. (1979). A predictive approach to model selection.Journal of the American Statistical Association, 74(365):153–160. → pages 1,13, 14Gelfand, A. E. and Dey, D. K. (1994). Bayesian model choice: asymptotics andexact calculations. Journal of the Royal Statistical Society. Series B (Method-ological), pages 501–514. → pages 1, 6, 12, 52Gelfand, A. E., Dey, D. K., and Chang, H. (1992). Model determination using pre-dictive distributions with implementation via sampling-based methods. Techni-cal report, DTIC Document. → pages 6Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.-S. (2008). A weakly informativedefault prior distribution for logistic and other regression models. The Annals ofApplied Statistics, pages 1360–1383. → pages 60, 61Greenland, S. (2006). Bayesian perspectives for epidemiological research: I. foun-dations and basic methods. International journal of Epidemiology, 35(3):765–775. → pages 61Hannan, E. J. and Quinn, B. G. (1979). The determination of the order of an au-toregression. Journal of the Royal Statistical Society. Series B (Methodological),pages 190–195. → pages 6Jeffreys, H. (1961). Theory of probability. → pages 7Kadane, J. B. and Lazar, N. A. (2004). Methods and criteria for model selection.Journal of the American statistical Association, 99(465):279–290. → pages 6Kass, R. E. and Wasserman, L. (1995). A reference bayesian test for nested hy-potheses and its relationship to the schwarz criterion. Journal of the americanstatistical association, 90(431):928–934. → pages 41Laud, P. W. and Ibrahim, J. G. (1995). Predictive model selection. Journal of theRoyal Statistical Society. Series B (Methodological), pages 247–262. → pages 9Lichman, M. (2013). UCI machine learning repository. → pages 25Lindley, D. V. (1957). A statistical paradox. Biometrika, 44(1/2):187–192. →pages 875Lindley, D. V. (1997). Some comments on bayes factors. Journal of StatisticalPlanning and Inference, 61(1):181–189. → pages 9Nelder, J. A. and Baker, R. J. (1972). Generalized linear models. Encyclopedia ofstatistical sciences. → pages 6Newton, M. A. and Raftery, A. E. (1994). Approximate bayesian inference by theweighted likelihood bootstrap (with discussion). Journal of the Royal StatisticalSociety Series B. v56, pages 1–48. → pages 16O’Hagan, A. (1995). Fractional bayes factors for model comparison. Journal ofthe Royal Statistical Society. Series B (Methodological), pages 99–138.→ pages10O’Hagan, A. (1997). Properties of intrinsic and fractional bayes factors. Test,6(1):101–118. → pages 10Pena, D. and Tiao, G. (1992). Bayesian robustness functions for linear models. jmbernardo, jo berger, ap dawid and afm smith. → pages 12Pettit, L. and Young, K. (1990). Measuring the effect of observations on bayesfactors. Biometrika, 77(3):455–466. → pages 7Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logisticmodels using po´lya–gamma latent variables. Journal of the American statisticalAssociation, 108(504):1339–1349. → pages 56, 57Raftery, A. E. (1996). Approximate bayes factors and accounting for model un-certainty in generalised linear models. Biometrika, 83(2):251–266. → pages60Raftery, A. E., Newton, M. A., Satagopan, J. M., and Krivitsky, P. N. (2006).Estimating the integrated likelihood via posterior simulation using the harmonicmean identity. → pages 16, 17Schwarz, G. et al. (1978). Estimating the dimension of a model. The annals ofstatistics, 6(2):461–464. → pages 5Shibata, R. (1980). Asymptotically efficient selection of the order of the modelfor estimating parameters of a linear process. The Annals of Statistics, pages147–164. → pages 6Smith, A. F. and Spiegelhalter, D. J. (1980). Bayes factors and choice criteria forlinear models. Journal of the Royal Statistical Society. Series B (Methodologi-cal), pages 213–220. → pages 876Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions.Journal of the royal statistical society. Series B (Methodological), pages 111–147. → pages 11, 13Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S.Springer, New York, fourth edition. ISBN 0-387-95457-0. → pages 58Zenllner, A. (1986). On assessing prior distributions and bayesian regression anal-ysis with g-prior distributions. Bayesian Inference and Decision Techniques:Essaya in Honor of Bruno de Finetti. Amsterdam: North-Hollar. → pages 4277
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Approximation of the formal Bayesian model comparison...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Approximation of the formal Bayesian model comparison using the extended conditional predictive ordinate… Hoque, Md Rashedul 2017
pdf
Page Metadata
Item Metadata
Title | Approximation of the formal Bayesian model comparison using the extended conditional predictive ordinate criterion |
Creator |
Hoque, Md Rashedul |
Publisher | University of British Columbia |
Date Issued | 2017 |
Description | The optimal method for Bayesian model comparison is the formal Bayes factor (BF), according to decision theory. The formal BF is computationally troublesome for more complex models. If predictive distributions under the competing models do not have a closed form, a cross-validation idea, called the conditional predictive ordinate (CPO) criterion can be used. In the cross-validation sense, this is a ''leave-out one'' approach. CPO can be calculated directly from the Monte Carlo (MC) outputs, and the resulting Bayesian model comparison is called the pseudo Bayes factor (PBF). We can get closer to the formal Bayesian model comparison by increasing the ''leave-out size'', and at ''leave-out all'' we recover the formal BF. But, the MC error increases with increasing ''leave-out size''. In this study, we examine this for linear and logistic regression models. Our study reveals that the Bayesian model comparison can favour a different model for PBF compared to BF when comparing two close linear models. So, larger ''leave-out sizes'' are preferred which provide result close to the optimal BF. On the other hand, MC samples based formal Bayesian model comparisons are computed with more MC error for increasing ''leave-out sizes''; this is observed by comparing with the available closed form results. Still, considering a reasonable error, we can use ''leave-out size'' more than one instead of fixing it at one. These findings can be extended to logistic models where a closed form solution is unavailable. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2017-08-28 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0354974 |
URI | http://hdl.handle.net/2429/62844 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2017-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2017_november_hoque_mdrashedul.pdf [ 592.95kB ]
- Metadata
- JSON: 24-1.0354974.json
- JSON-LD: 24-1.0354974-ld.json
- RDF/XML (Pretty): 24-1.0354974-rdf.xml
- RDF/JSON: 24-1.0354974-rdf.json
- Turtle: 24-1.0354974-turtle.txt
- N-Triples: 24-1.0354974-rdf-ntriples.txt
- Original Record: 24-1.0354974-source.json
- Full Text
- 24-1.0354974-fulltext.txt
- Citation
- 24-1.0354974.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0354974/manifest