ayesian Inference in the Multivariate Probit Model Estimation of the Correlation M a t r i x by Aline Tabet A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Statistics) ; The University Of British Columbia August, 2007 © Aline Tabet 2007 Abstract Correlated binary data arise in many applications. Any analysis of this type of data should take into account the correlation structure among the variables. The multivariate Probit model (MVP), introduced by Ashford and Snowden (1970), is a popular class of models particularly suitable for the analysis of correlated binary data. In this class of models, the response is multivariate, correlated and discrete. Generally speaking, the M V P model assumes that given a set of explanatory variables the multivariate response is an indicator of the event that some unobserved latent variable falls within a certain interval. The latent variable is assumed to arise from a multivariate normal distribution. Difficulties with the multivariate Probit are mainly due to computation as the likelihood of the observed discrete data is obtained by integrating over a multidimensional constrained space of latent variables. In this work, we adopt a Bayesian approach and develop an an efficient Markov chain Monte Carlo algorithm for estimation in M V P models under the full correlation and the structured correlation assumptions. Furthermore, in addition to simulation results, we present an application of our method to the Six Cities data set. Our algorithm has many advantages over previous approaches, namely it handles identifiability and uses a marginally uniform prior on the correlation matrix directly. Table of Contents . Abstract ii Table of Contents iii List of Tables vi List of Figures . viii Acknowledgements xi Dedication xii I Thesis 1 1 Introduction 2 1.1 Motivation 2 1.2 Outline 4 2 The Multivariate Probit Model 6 2.1 Model Specification and Notation 6 2.2 Difficulty with Multivariate Probit Regression: Identifiability 8 2.3 Bayesian Inference in Multivariate Probit Models 10 2.3.1 Prior Specification on ft 11 2.3.2 Prior Specification on the correlation matrix R . . . . 12 3 Correlation Estimation in the Saturated Model 16 3.1 Introduction 16 iii Table of Contents 3.2 Parameter Expansion'and Data Augmentation . . . . . . . . 16 3.2.1 Data Augmentation ' 16 3.2.2 Parameter Expansion for Data Augmentation . . . . 17 3.2.3 Data Transformation 18 3.3 Proposed Model 18 3.3.1 Imputation Step - 19 3.3.2 Posterior Sampling Step 19 3.4 Simulations : 24 3.4.1 Results for T = 3 25 3.4.2 Results for T = 8 34 3.4.3 Convergence Assessment 42 3.5 Application: Six Cities Data 46 4 Correlation Estimation in the Structured Model 50 4.1 Introduction 50 4.2 Conditional Independence 50 4.3 Gaussian Graphical Models 51 4.3.1 Graph Theory 52 4.3.2 The Hyper-inverse Wishart Distribution 54 4.4 Marginally Uniform Prior for Structured Covariance 55 4.5 PX-DA in Gaussian Graphical Models 58 4.6 Simulations 61 4.6.1 Loss Under the Saturated Model and the Structured Model 61 4.6.2 Effect of Decreasing Sample Size 62 4.6.3 Prediction Accuracy 69 4.7 Application: Six Cities Data Revisited 70 5 Conclusion 74 5.1 Summary 74 5.2 Extensions, Applications, and Future Work 75 Bibliography 77 Table of Contents II Appendices 81 Appendices A Distributions and Identities 82 A . l The Multivariate Normal (Gaussian) Distribution 82 A.2 The Gamma Distribution 82 A.3 The Standard Inverse Wishart Distribution 82 B Marginal Prior on R proof from Barnard et al. (2000) . . 84 C Computation of the Jacobian | J : Z —+ W\ 88 D Sampling from Multivariate truncated Gaussian 89 E Sampling from the Hyper Inverse Wishart Distribution (Car-valho et al., 2007) 92 F Simulation Results 94 List of Tables 2.1 Summary of how identifiability has been handled in some pre-vious work 11 3.1 Correlation results from simulations for T = 3 26 3.2 Regression coefficients results from simulations for T = 3 . . . 27 3.3 Correlation Results from simulations for T = 8 35 3.4 Regression coefficients results from simulations when T — 8 . 35 3.5 Six Cities Data: Posterior estimates using Marginal Prior, M L E estimate using M C E M and Posterior estimates using the Jointly Uniform Prior (Chib and Greenberg (1998)) . . . 47 4.1 Simulation results: Entropy and quadratic loss averaged over 50 data sets generated by different correlation matrices with the same structure 62 4.2 Entropy and Quadratic loss obtained by estimating the true correlation and partial correlation matrix with the PX-DA algorithm under the saturated and structured model assump-tion 64 4.3 Simulation results on the unconstrained correlation coeffi-cients corresponding to the model in 4.1, with n = 100, T = 8 based on N — 5000 Gibbs samples 66 4.4 Simulation results on the constrained correlation coefficients corresponding to the model in 4.1, with n = 1000, T = 8 based on N = 5000 Gibbs samples 67 vi List of Tables 4.5 Simulation results on the unconstrained correlation coeffi-cients corresponding to the model in 4.1, with n = 200, T — 8 based on N = 5000 Gibbs samples. 68 4.6 Simulation results on the constrained correlation coefficients corresponding to the model in 4.1, with n = 200, T = 8 based on N = 5000 Gibbs samples 69 4.7 Six Cities Data: Posterior estimates under structured model assumption, M L E estimate using M C E M and Posterior es-timates using the Jointly Uniform Prior under a saturated model assumption(Chib and Greenberg (1998)) 71 F . l Simulation results: Entropy and quadratic loss for 50 data sets generated by different correlation matrices with the same structure 94 F.2 Table F continued 95 List of Figures 2.1 A graphical representation of the model in 2.3 under a full correlation structure. Observed nodes are shaded 7 2.2 Marginal prior density for r j 2 when T = 3 and T = 10 under the jointly uniform prior p(R) oc 1, based on 2000 draws. (Figure 1 reproduced from Barnard et al. (2000)) . . . . . . . 13 2.3 Marginal correlations obtained using the prior in 2.12 by sam-pling from a standard inverse Wishart with degrees of freedom v = T + l 14 3.1 Correlation estimates for p = 0.4 , T — 3 and increasing sample size from n = 100 to n=1000 28 3.2 Correlation estimates for p = 0.8 , T = 3 and increasing sample size from n — 100 to n=1000 29 3.3 0 estimates for p = 0.4 , T = 3 and sample size n — 100 . . . 30 3.4 0 estimates for p = 0.4 , T = 3 and sample size n = 1000 . . . 31 3.5 0 estimates for p = 0.8 , T = 3 and sample size n = 100 . . . 32 3.6 0 estimates for p = 0.8 , T — 3 and sample size n — 1000 . . . 33 3.7 Correlation estimates for p — 0.2 , T = 8 and increasing sample size from n = 100 to n=1000 36 3.8 Correlation estimates for p = 0.6 , T = 8 and increasing sample size from n = 100 to n=500 37 3.9 0 estimates for p — 0.2 , T = 8 and sample size n — 100 . . . 38 3.10 0 estimates for p = 0.2 , T — 8 and sample size n = 1000 . . . 39 3.11 0 estimates for p = 0.6 , T = 8 and sample size n = 100 . . . 40 3.12 0 estimates for p = 0.6 , T = 8 and sample size n — 500 . . . 41 viii List of Figures 3.13 n = 100, T = 3, Trace plots as the number of iterations increase from JV = 500 to N = 5000 post "Burn-in". The algorithm has started to converge after about 1000 iteration post "Burn-in" . . . 43 3.14 n = 100, T = 3, Autocorrelation plots of a randomly chosen parameter from correlation matrices for the cases where the marginal correlations is p = 0.2, p = 0.4, p — 0.6, and p — 0.8 44 3.15 Trace plots of the cumulative mean and cumulative standard deviation of randomly chosen parameters from correlation matrices as the correlation is varied from p = 0.2, p = 0.4, p = 0.6, and p = 0.8 and n = 100, T = 3. The vertical line marks" the "Burn-in" value (500) used in the simulations . . . 45 3.16 Six Cities Data: Trace plots and density plots of the corre-lation coefficients. The vertical lines denote 95 % credible interval and the line in red indicates the posterior mean re-ported by Chib and Greenberg (1998) 48 3.17 Six Cities Data : Trace plots, density plots and autocorrela-tion plots of the regression coefficients. Vertical lines denote 95 % credible interval and the line in red indicates the poste-rior mean reported by Chib and Greenberg (1998) . 49 4.1 A graphical representation of a structured MVP. model for T — 3. The edge between Zn and is missing, this is equivalent to f 13 = 0. This structure is typical of longitudinal models where each variable is strongly associated with the one before it and after it, given the other variables in the model. . 52 4.2 A graphical model with T = 7 vertices. In this graph, Z\ is a neighbor of Z2. Z3, 2*2, and Z-j form a complete subgraph or a clique. This graph can be decomposed into two cliques {Zi, Z2, Z3, Z5, Z4} and {Z$, Zq, Z7}. {Z^} separates the two cliques 54 4.3 Marginal distributions of the prior on the correlation matrix corresponding to the model in 4.1 57 List of Figures 4.4 Illustration of the marginally uniform prior on the structure of the graph in figure 4.2. In this graph we have unequal clique sizes where | C i | = 5 and |C?2| — .3 58 4.5 Box plot of the entropy and quadratic loss obtained by gen-erating data from 50 correlation structures and computing the loss function under the full correlation structure versus a structured correlation structure . 63 4.6 Six Cities Data: Correlation and partial correlation estimates 72 4.7 Six Cities Data : Trace plots, density plots and autocorre-lation plots of the regression coefficients under a structured model assumption. Vertical lines denote 95 % credible inter-val and the line in red indicates the posterior mean reported by Chib and Greenberg (1998) 73 x Acknowledgements I would like to thank my supervisors Dr. Arnaud Doucet and Dr. Kevin Murphy. This work would not have been possible without their valued advice and suggestions. I also thank the staff and faculty members of the Statistics Department at UBC, in particular, Dr. Paul Gustafson, Dr. Harry Joe and Dr. Matias Salibian-Barrera, for their help, advice and mentorship. I am forever grateful to my family, Salma, Ghassan, Najat, Sal and Rhea, for their continued support and encouragement. The numerous sacrifices they made over the last few years allowed me to pursue my aspirations, and reach important milestones in my professional career. Finally I want to thank my friends and fellow graduate students, both in the Statistics Department and in Computer Science, for providing theoretical advice, computer support and numerous help, but most importantly for making the last two years a memorable journey. xi Dedication To.my mom and dad, your love and support makes everything possible. Part I Thesis C h a p t e r 1 I n t r o d u c t i o n 1.1 Motivat ion Correlated discrete data, whether be it biliary, nominal or ordinal, arise in many applications. Examples range from the study of group randomized clinical trials to consumer behavior, panel data, sample surveys and lon-gitudinal studies. Modeling dependencies between binary variables can be done using Markov random fields (e.g., Ising models). However, an attrac-tive alternative is to use a latent variable model, where the observed binary variables are assumed independent given latent Gaussian variables, which are correlated. An example of such model is the multivariate Probit model (MVP), introduced by Ashford and Snowden (1970). In this class of models, the response is multivariate, correlated and discrete. Generally speaking, the M V P model assumes that given a set of explanatory variables the multivari-ate response is an indicator of the event that some unobserved latent variable falls within a certain interval. The latent variable is assumed to arise from a multivariate normal distribution. The likelihood of the observed discrete data is then obtained by integrating over the multidimensional constrain space of latent variables. where i = 1,..., n indexes the independent observation, j — 1,..., T indexes the dimension of the response, Y^ is a T-dimensional vector taking values in {0,1}, Aij is the interval (0, oo) if Yij = 1 and the interval (—oo, 0] otherwise, 0 is the regression coefficient, S is the covariance matrix, and d>r(Zi\Xi, 0, B.) is the probability density function of the standard normal (1.1) 2 Chapter 1. Introduction distribution defined in A . l . The M V P model has been proposed as an alternative to the multivariate logistic model, which is defined as: p ( y , ^ i | ^ , / ? , E ) = e x p ( x ^ (i.2) £*=i exP Wife) The appeal of the probit model is that it relaxes the independence of the irrelevant alternatives (HA) property assumed by the logit model. This IIA property assumption states that if choice A is preferred to choice B out of the choice set {A,B}, then introducing a third alternative C, thus expanding the choice set to {A,B,C} must not make B preferred to A. This means that adding or deleting alternative outcome categories does not affect the odds among the remaining outcomes. More specifically in the logistic regression model, the odds of choosing m versus n does not depend on which other outcomes are possible. That is, the odds are determined only by the coefficient vectors for m and n, namely Bm and 3n: P(Yim = lpfr,/?,S) _ exp(s ; /U/ELiexp(z ; /? f c ) _ . ( P(Yin = l\XuP, E) exp(^/? n)/ D J L 1 exp(z^) ^ { P m M ) (1.3) In many cases, this is considered to be an unrealistic assumption (see for example McFadden (1974)), particularly when the alternatives are similar or redundant as is the case in many econometric applications. Until recently, estimation of M V P models, despite its appeal, has been difficult due to computational intractability especially when the response is high dimensional. However, recent advances in computational and simula-tion methods made this class of models more widely used. Both classical and Bayesian methods have been extensively developed for estimation of these models. For a low dimensional response, finding the maximum likelihood estimator numerically using quadrature methods for solving the multidimensional integral is possible, but becomes quickly intractable as the number of dimensions T increases usually past 3. 3 \ Chapter 1. Introduction Lerman and Manski (1981) suggest the method of simulated maximum likelihood (SML). This method is based on Monte Carlo simulations to ap-proximate the high dimensional integral to estimate the probability of each choice. McFadden (1989) introduced the method of simulated moments (MSM). This method also requires simulating the probability of each out-come based on moment conditions. Natarajan et al. (2000) introduced a Monte Carlo variant of the Expectation Maximization algorithm (MCEM) to find the maximum likelihood estimator without solving the high dimen-sional integral. Other frequentist methods were also developed using Gen-eralized Estimation Equations (GEE) (eg. Chaganty and Joe (2004)). On the Bayesian side, Albert and Chib (1993) introduced a method that involves a Gibbs Sampling algorithm using data augmentation for the univariate probit model. McCulloch and Rossi (1994) extended this model to the multivariate case. The Bayesian method entails iteratively'alternating between sampling the latent data and estimating the unknown parameters by drawing from their conditional distributions. The idea is that under mild conditions, successive sampling from the conditional distributions produces a Markov chain which converges in distribution to the desired joint conditional distribution. Other work on the Bayesian side includes that of Chib and Greenberg (1998), and more recently Liu (2001), Liu and Daniels (2006), and Zhang et al. (2006). These methods will be examined in more detail in Chapter 2. Geweke et al. (1994) compared the performance of the classical frequen-tist methods SML and MSM with the Bayesian Gibbs sampling method and found the Bayesian method to be superior especially when the covariates are correlated and the error variances vary across responses. 1.2 Outline In this work we adopt a Bayesian approach for estimation in the multivariate Probit class of models. The multinomial and the ordinal models are gener-alizations.of the binary case. The multivariate binary response is a special case of the multinomial response with only two categories. The ordinal 4 Chapter 1. Introduction model is also a special case of the multinomial model, where the categories are expected to follow a certain order. Al l the methods developed herein are developed for the multivariate binary model, but could be easily extended to include the multinomial and ordinal cases. The aim is to find a general framework to estimate the parameters required for inference in the M V P model, especially in high dimensional problems. We particularly focus,on the estimation of an identifiable correlation matrix under a full correlation assumption and a constrained partial correlation assumption. This thesis will be structured as follows: In Chapter 2, we introduce the notation' that will be used throughout the thesis. We discuss the problem of identifiability in the M V P class of models. We briefly compare several possible choices of prior distributions for Bayesian modeling, as well as review some methods that have been proposed in the literature to deal with identifiability and prior selection. In Chapter 3, we detail a method for estimating an identifiable corre-lation matrix under the saturated model. The saturated model admits a full covariance matrix where all off-diagonal elements are assumed. to be non-zero. We show simulation results on a low dimensional and a higher di-mensional problem. Finally, we further investigate the method, by applying it to a widely studied data set: The Six Cities Data. In Chapter 4, we extend the method developed in Chapter 3 to the case where a structure on the partial correlation matrix is imposed. To do so, we motivate the use of Gaussian graphical models and the Hyper-inverse Wishart Distribution. We provide a general introduction to Gaussian graphical models, and we adapt the algorithm and the priors developed in Chapter 3 to the new framework. Throughout this chapter, we assume that the structure of the inverse correlation matrix is known and given. Simulation results are presented as well as an application to the Six Cities Data set from Chapter 3. We conclude in Chapter 5, by summarizing the work and the results. We also discuss possible'extensions, applications and future work. 5 C h a p t e r 2 The Multivariate Probit Mode l 2.1 Mode l Specification and Notation The multivariate Probit model assumes that'each subject has T distinct bi-nary responses, and a matrix of covariates that can be any mixture of discrete and continuous variables. Specifically, let Y{ — (Yn,..., YIT) denote the T-dimensional vector of observed binary 0/1 responses on the ith subject, i = 1,..., n. Let Xi be a T x p design matrix, and let Zi — (Zn,..., Zix)' denote a T-variate normal vector of latent variables such that Zi^XiP + eu i = l,...,n (2.1) The relationship between Zij and Y^ in the multivariate probit model is given by f 1 if Za > 0; , x = { n J° . J = h--.,T (2.2) I 0 otherwise. So that P(Yi = l\/3,Z) = $(Zi) Zi ~ N(Xi(3,Z) ' ' (2.3) where $ is the Probit link which denotes the cumulative distribution function of the normal distribution as defined in A . l . Here (3 — (0\ , . . . ,@T) is a pxT matrix of unknown regression coefficients, e» is a T x 1 vector of residual error distributed as NT(0, E), where E is the T x T correlation matrix of Zj. 6 Chapter 2. The Multivariate Probit Model The posterior distribution of Zi is given by T f(Zi | Yi, 6, R) oc MZi\Xit 0, R) HiHzij > 0)I(yij = 1) + I(Zij < 0)I(yij = 0)} 3=1 (2-4) This is a multivariate truncated Gaussian where (pr{Z) is the probability density function of the normal distribution as in A . l . The likelihood of the observed data Y is obtained by integrating over the latent variables Z: P(Yi = yi\Xi,3,R)= [ ... f ^T{Zi\Xh0,R)dZi (2.5) J AiT J An 7 Chapter 2. The Multivariate Probit Model where A^ is the interval (0, oo) if Y^ — 1 and the interval (—oo, 0] otherwise. This formulation of the model is most general, since it allows the re-gression parameters as well as the covariates to vary across categories T. In this work, we let the covariates vary across categories, however, we con-strain the regression coefficients (3 to be fixed across categories by requiring 01 = • • • = PT — 0-2.2 Difficulty with Multivariate Probit Regression: Identifiability In the multivariate Probit model, the unknown parameters (/?, S) are not identifiable from the observed-data model (e.g: Chib and Greenberg (1998), Keane (1992)). This could be easily seen if we scale Z by a constant c > 0, we get cZ = c(Xp + e) (2.6) = X(cP) + ce (2.7) from equation 2.2, clearly Y will have the same value given Z and given cZ, which means that the likelihood of Y\X,P,Y, is the same as that of Y\X, cP, c 2 S. Furthermore, we have no way of estimating the value of c. In order to handle this identifiability issue in MVP, restrictions need to be imposed on the covariance matrix. In the univariate case, this restriction is handled by setting the variance to one. However, imposing such a restriction in the multivariate case is a little more complicated. It is not uncommon to ignore the identifiability problem and perform the analysis on the unidentified model and post-process samples by scaling with the sampling variance using the separation strategy R = £ ) - 1 £ D - 1 , where D is a diagonal matrix with diagonal elements du — This method is adopted by McCulloch and Rossi (1994), and is widely used (e.g Edwards and Allenby (2003)). Many researchers are uncomfortable working with unidentified parame-ters. For instance, ignoring identifiability adds difficulty in the choice of prior 8 Chapter 2. The Multivariate Probit Model distributions, since priors are placed on unidentified parameters. Therefore, if the prior is improper, it is difficult to verify that the scaled draws are from a proper posterior distribution. Koop (2003, p. 227) gives an empirical i l -lustration of the effect of ignoring identifiability. From simulation results, he shows that unidentifiable parameters have higher standard errors, and furthermore with non-informative priors there is nothing stopping estimates from going to infinity. McCulloch et al. (2000) address identifiability by setting the first diago-nal element of the covariance matrix an = 1- However, this means that the standard priors for covariance could no longer be used, they propose a prior directly on the identified parameters, but their method is computationally expensive, and is slow to converge as pointed out by Nobile (2000). Nobile suggests an alternative way of normalizing the covariance by drawing from an inverse Wishart conditional on an — 1 (Linardakis and Dellaportas, 2003). The approach of constraining one element of the covariance adds difficulty in the interpretability of the parameters and priors, and is computationally demanding and slow to converge. Other approaches impose constraints on E _ 1 , the precision matrix., Webb and Forster (2006) parametrize E - 1 in terms of its Cholesky decomposition: E _ 1 = <&TAtyT. In this parametrization, ^ is an upper triangular matrix with diagonal elements equal to 1, and A is a diagonal matrix. The elements of ^ could be regarded as the regression coefficients obtained by regressing the latent variable on its predecessors. Each Xjj is interpreted as the con-ditional precision of the latent data corresponding to variable j given the latent data for all the variables preceding j in the decomposition. Identifia-bility is addressed in this case by setting Xjj to 1. This approach only works if the data follows a specific ordering, for example time series. Dobra et al. (2004) propose an algorithm to search over possible orderings, however this becomes very computationally expensive in high dimensions. Alternatively, identifiability could be handled by restricting the covari-ance matrix E to be a correlation matrix R (Chib and Greenberg (1998)). The correlation matrix admits additional constraints, since in addition to being positive semi-definite, it is required to have diagonal elements equal to 9 Chapter 2. The Multivariate Probit Model 1 and off-diagonal elements € [—1,1]. Furthermore, just as in the covariance case, the number of parameters to be estimated increases quadratically with the dimension of the matrix. Barnard et al. (2000) use the decomposition S = DRD, and place a separate prior on R and D directly. They use a Griddy Gibbs sampler (Ritter and Tanner, 1992) to sample the correlation matrix. Their approach involves drawing the correlation elements one at time and requires setting grid sizes and boundaries. This approach is inefficient, especially in high dimensions. Chib and Greenberg (1998) use a Metropolis Hastings Random Walk algorithm to sample the correlation matrix. This is more efficient than the Griddy Gibbs approach because it draws the correlation coefficient in blocks. However the resulting correlation matrix is not guaranteed to be positive definite, which requires the algorithm to have an extra rejection step. Furthermore, as with random walk algorithms in general, the mixing is slow in high dimensions. Alternatively, some approaches use parameter expansion as described in Liu and Wu (1999) together with data augmentation, for example Liu (2001), Zhang et al. (2006), Liu and Daniels (2006), and others. The idea is to propose an alternative parametrization, to move from a constrained corre-lation space to sampling a less constrained covariance matrix and transform it back to a correlation matrix. These approaches differ mainly with the choice of priors and how the covariance matrix is sampled. The different possibilities for priors will be discussed in more detail in the next section, and an in-depth explanation of parameter expansion with data augmenta-tion algorithm is in the next Chapter. Table 2.1 gives a summary of the how identifiability has been handled in the Probit model. 2.3 Bayesian Inference in Multivariate Probit Models A Bayesian framework treats parameters as random variables and there-fore requires the computation of the posterior distribution of the unknown 10 Chapter 2. The Multivariate Probit Model Table 2.1: Summary of how identifiability has been handled in some previous work Identifiability Paper Ignored McCulloch and Rossi (1994) Restrict o\\ — 1 McCulloch et al. (2000) Nobile (2000) Restrict Xjj = 1 in S" 1 = yTAVT Webb and Forster (2006) Restrict E to R Barnard et al. (2000) Liu (2001) Liu and Daniels (2006) Zhang et al. (2006) random parameters conditional on the data. A straightforward application of Bayes rule results in the posterior distribution of ((3, R) where R is the correlation matrix, (3 is the matrix of regression coefficients, and D is the data. ir{(3,R\D) cx f{D\P,R)ir(P,R) (2.8) In order to estimate the posterior distribution, a prior distribution on the unknown parameters /3 and R needs to be specified. In the absence of prior knowledge, it is often desirable to have uninformative flat priors on the parameters we are estimating 2.3.1 Prior Specification on (3 It is common to assume that a priori (3 and R are independent. Liu (2001) propose a prior on /? that depends on R to facilitate computations. There are several other choices of priors in the literature on the regression coefficients p. The most common choice is a multivariate Gaussian distribution centered at B, with known diagonal covariance matrix typ. It is typical to choose large values for the diagonal elements of typ so that the prior on P is uninformative. This is the proper conjugate prior. In addition, without loss of generality, 11 Chapter 2. The Multivariate Probit Model we could set B to 0 TT0) ~ JVpr(0, ® / r ) (2.9) where 8 is the nT-dimensional vector obtained by stacking up the columns of the p x T regression coefficient matrix 8. In this work, we constrain the regression parameter to be constant across T. 2.3.2 Prior Specification on the correlation matrix R To handle identifiability, we restrict the covariance matrix E to be a correla-tion matrix, which means that the standard conjugate inverse Wishart prior for covariances cannot be used. Instead, a prior needs to be placed on R directly. However, as mentioned previously there does not exist a conjugate prior for correlation matrices. Barnard et al. (2000) discuss possible choices of diffuse priors on R. The first is the proper jointly uniform prior: ir(R) oc 1, fief (2.10) Where the correlation matrix space !ftr is a compact subspace of the hyper-cube [—1, l ] r ( T - 1 ) / 2 . The posterior distribution resulting from this prior is not easy to sample from. Barnard et al. use the Griddy Gibbs approach (Ritter and Tanner, 1992), which is inefficient. The approach in Chib and Greenberg (1998) uses this prior as well. Liu and Daniels (2006) use this prior for inference. However, they use a different prior to generate their' sampling proposal. It is important to note that using a jointly uniform prior would not result in uniform marginals on each rjj. Barnard et al. (2000) show that a jointly uniform prior will tend to favor marginal correlations close to 0, making it highly informative, marginally. This problem becomes more apparent as T increases (see Figure 2.2). 12 Chapter 2. The Multivariate Probit Model Another commonly used uninformative prior is the Jeffrey's prior ir(R) oc \R\—* (2.11) This prior is used by Liu (2001). Liu and Daniels (2006) use it for generating their proposal. Jointly Uniform Prior Figure 2.2: Marginal prior density for ?-12 when T = 3 and T = 10 under the jointly uniform prior p(R) oc 1, based on 2000 draws. (Figure 1 reproduced from Barnard et al. (2000)) It has been shown that in the context of parameter expansion, this prior helps facilitate computations. However, it suffers from the disadvantage of being improper. Improper priors are not guaranteed to have a proper posterior distribution and, in addition, cannot be used for model selection due to Lindley's paradox. Furthermore, it has been shown that the use of improper priors on covariance matrices is in fact informative and tends to favor marginal correlations close to ±1 (Rossi et al., 2005, Chapter 2). Alternatively, Barnard et al. (2000) propose a prior on R such that marginally each r^ is uniform on the interval [—1,1]. This is achieved by taking the joint distribution of R to be: 13 Chapter 2. The Multivariate Probit Model (2.12) The above distribution is difficult to sample from directly. However, they show that sampling from it can be achieved by sampling from a standard inverse Wishart with degrees of freedom equal to v = T+l and transforming Marginal Correlations Figure 2.3: Marginal correlations obtained using the prior in 2.12 by sampling from a standard inverse Wishart with degrees of freedom v — T + 1 back to a correlation matrix using the separation strategy (E = DRD). The proof is reproduced in Appendix B and the result is illustrated in Figure 2.3. The marginally uniform prior seems convenient, since it is proper and we are able to compute its normalizing constant. It does not push correlations toward 0 or ±1 even in high dimensions. Most importantly, because it is proper, it opens the possibility for Bayesian model selection. However, multiplying together the distribution of Z in equation 2.4 and the marginally uniform prior in 2.12, results in a posterior distribution that is complicated and not easily sampled from. 14 Chapter 2. The Multivariate Probit Model Nevertheless, we show in the next chapter that the marginal prior, when used in the context of parameter expansion, is actually computationally convenient for sampling from the posterior distribution. 15 C h a p t e r 3 Correlation Estimation in the Saturated Mode l 3.1 Introduction As we have seen from the previous chapter, inference in the M V P model is complicated due to the identifiability issue which requires constraining the covariance to be a correlation matrix. There is no conjugate prior for cor-relation matrices and therefore the posterior is not easily sampled from. In this Chapter, we build on previous work and adopt a Bayesian approach that uses a combination of Gibbs sampling and data augmentation. Furthermore, we use a re-parametrization leading to an expansion of the parameter space. ,This helps significantly with the computation of the posterior distribution. We focus on R being a full T x T correlation matrix. 3.2 Parameter Expansion and Data Augmentation 3.2.1 Data Augmentation Data Augmentation (DA) is an algorithm introduced by Tanner and Wong (1987), very popular in statistics, used mainly to facilitate computation; These methods center on the construction of iterative algorithms by intro-ducing artificial variables, referred to as "missing data" or latent variables. These variables may or may not have a physical interpretation but are mainly there for computational convenience. Let Y be the observed data, and 8 be the unknown parameter of interest. 16 Chapter 3. Correlation Estimation in the- Saturated Model If we are interested in making draws from f(Y\9), the idea is to find a latent variable Z such that the joint distribution f(Y, Z\9) is easily sampled from. The distribution of the observed data model is recovered by marginalizing the latent variable: f(Y\9) = f f(YZ\6)dZ (3.1) Algori thm 3.1 Data Augmentation At iteration i 1. Draw Z ~ f(Z\9, Y) oc /(Y, Z\9) 2. Draw 9 ~ f(9\Z, Y) oc f(Y, Z\9)f{9) } The data augmentation algorithm 3.1 iterates between an imputation step where the latent variables are sampled and a posterior estimation step until convergence. The samples of the unknown parameter 9 could then be used for inference. 3.2.2 Parameter Expansion for Data Augmentation Parameter Expansion for Data Augmentation (PX-DA) , introduced by Liu and Wu (1999), is a technique usually useful for accelerating convergence. The idea is that if we can find an hidden parameter a in the complete data model f(Y,Z\9), we can then expand this model to a larger model p(Y,W\0,a), that would preserve the distribution of the observed data model: Jp(Y,W\0,a)dW = f(Y\9) (3.2) We adopt the notation used in Liu and Wu (1999), and use W instead of Z and p instead of / to denote the latent data and the distributions under the expanded model. To implement the DA algorithm in this setting, a joint prior on the expansion parameter a and the original parameter of interest 9 needs to be specified such that the prior on 9 is the same under the original model and the expanded model (Jp(9,a)da = /(#)). This can be done by maintaining the prior for 9 at f(9) and specifying a prior p(a\9). 17 Chapter 3. Correlation Estimation in the Saturated Model By iterating through the steps of algorithm 3.2, we are able to achieve a faster rate of convergence than the DA algorithm in 3.1. Algori thm 3.2 PX-DA Algorithm At iteration i 1. Draw (a, W) jointly by drawing a ~ p(a\6) W ~ p(W\6,a,Y) <xp(Y, W\6,a) 2. Draw (a, 6) jointly by drawing a, 6\Y, W ~ p(Y, W\e, a)p(a\e)f(d) 3.2.3 Data Transformation Under certain conditions, an alternative view of the PX-DA treats W as the result of a transformation on the latent data Z induced by the expansion parameter a (Liu and Wu, 1999, Scheme 1). For this interpretation to hold, a transformation Z — ta(W), needs to be defined such that for any fixed value of a, ta(W) is a one-to-one differentiable mapping between Z and W: p(Y,W\6,a) = f<Y,ta<W)\0)\Ja'W)\ (3.3) where | J a (W) | is the determinant of the Jacobian of the transformation Ta evaluated at W. The algorithm is detailed in 3.3. Note that in the second step of algorithm 3.3, a is sampled from its prior distribution. This interpretation of the PX-DA algorithm is particularly useful in the case of M V P regression. 3.3 Proposed Model In the model we are proposing, we want to use PX-DA mainly to simplify computation. We adopt the scheme described in algorithm 3.3 (correspond-18 Chapter 3. Correlation Estimation in the Saturated Model Algori thm 3.3 PX-DA Algorithm/ Data Transformation (scheme 1) At iteration i 1. Draw Z ~ f(Z\Y,6), compute W = t~l(Z) 2. Draw (a, 9) jointly conditional on the latent data a,6\Y,W ~ p(Y'ta(W)\e)\Ja(W)\p(a\e)f(6) ing to scheme 1 in Liu and Wu (1999)). 3.3.1 Imputation Step Let 9 = (R,0), be the identifiable parameter of interest. The first step of algorithm 3.3, involves drawing Z conditional on the identifiable parameter 9. This is achieved by sampling from a multivariate truncated Gaussian as in equation (2.4). For the generation of multivariate truncated Gaussian variables, we fol-lowed the approach outlined in Appendix D. This approach uses Gibbs steps to cycle through a series of univariate truncated Gaussians. In each step Zij is simulated from Zij\Zi-j,-B, R, which is a univariate Gaussian distribution truncated to [0, oo) if Yjj = 1 and to (—oo, 0] ii'Yij — 0. The parameters of the untruncated distribution Zij\Zit-j, 8, R are obtained from the usual formulae for moments of conditional Gaussians. 3.3.2 Posterior Sampling Step Given the latent data sampled in step 1, we would like to draw (a, 9) from its posterior distribution. In order to implement step 2 of algorithm 3.3, we need to find an expansion parameter a, not identifiable from the observed data model, but identifiable from complete data-model. Subsequently, we need to define a transformation on the latent data. 19 Chapter 3. Correlation Estimation in the Saturated Model Denning the Expansion Parameter and the Transformation If we let Z = ta(W) = D~lW (3.4) or alternatively W = DZ, where D is a diagonal matrix with positive diagonal elements da = y/^u. The scale parameter D is not identifi-able. For reasons which will become clear later, we could conveniently pick a = ( a i , . . . , ar) to be a function of D by taking « = 2 * ( 3 ' 5 ) where r M is the ith diagonal element of R~l and di is the ith diagonal element of D. In this case, for any fixed value of a, D is a one-to-one function of a and t<x(W) i s a one-to-one differentiable mapping between Z and W. This choice of a is not arbitrary. It is conveniently picked so that when combined with the prior of (R,(3), the transformed likelihood, and the Ja-cobian, it results in a posterior distribution that is easily sampled from. The Transformed Complete Likelihood: p(Y,ta(W)\6)\Ja(W)\ For a given a, the determinant of the Jacobian 1 resulting by going from (Z —> W) under the transformation in 3.11 is given by: \J:Z-*W\ = 0 ( Z l , . . . Z n ) d(Wl...Wn). | ( / n ® D _ 1 ) | \D\-n (3.6) (3.7) (3.8) Combining the complete likelihood in equation 2.4 with the Jacobian, and after doing some algebra, we get: xsee a 3 x 3 example in Appendix C 20 Chapter 3. Correlation Estimation in the Saturated Model p(Y,ta(W)\0,R) \J:Z^W\ = p(Y,Z\0,R) x\J : Z ->W\ (3.9) = \R\S exp (y~\Y,{Zi - XiPYRT^Zi - X ^ x \J : Z - W\ = \D\-n\R\-% exp (-1 Y£D(Zi- XifiWiDRDy^DiZi - Xtf))^ = \DRD\-? exp (-1 J2(Wi - XiBDY{pRD)-x(Wi - XiPD^j If we define E = DRD 1 (3.10) e* = D{Z-Xp) (3.11) We can re-write the likelihood under the expanded data model in equa-tion 3.10 as p(Y,ta(W)\R,P)\Ja(W)\ oc | E | - t expir ( i r V e * ) (3.12) The Prior: p0(a\9)f(8) For Bayesian inference, we need to define a joint prior on 9 = (P, R) and a. We assume that P and R are independent a priori so that -K(P, R, a) — pQ(a\R)f(R)f(P). . Under the transformation £ = DRD, Barnard et al. (2000) showed that if we take E to be a standard inverse Wishart distribution as in A.4 we can re-write the distribution of E as in B . l l : tt(E) = 7r(a, R) x | J : E -> D, R\ = f(R)p(a\R) (3.13) Where with a particular choice of parameters, namely v = T + 1, the distribution f{R) is as in 2.12 such as each is uniform on the interval 21 Chapter 3. Correlation Estimation in the Saturated Model [—1,1]. Furthermore, the distribution of po(a\R) is Gamma with shape parameter (T + l)/2 and rate parameter 1. Therefore, we are able to get the desired prior distribution n(a\R)ir(R) by sampling E from a standard inverse Wishart with degrees of freedom v — T + 1, and transforming using S = DRD. Here, we like to point out that both the prior distributions of R and (3 are the same under the expanded model and the observed data model. This is a condition required for the PX-DA algorithm. In addition, we note that R and.a are not a priori independent. The independence of these parameters is a necessary condition only to prove the optimality of the convergence of algorithm 3.3. In this case, their independence is not key since we are using PX-DA mainly for the convenience in that it results in a posterior distributions that is easily sampled from. Posterior Distribution of (a, 0) Now that we have specified the expanded likelihood and prior on the param-eters of interest (R, (3) and the expansion parameter a, the joint posterior distribution of (/?, R, a) conditional on the latent data can be computed: B, R, a\Y, W ~ p{Y, ta(W)\3, R)\Ja(W)\f(R)f((3)Po(a\R) (3.14) where ta(W) = Z = D~lW is the transformation of the latent data and |J Q (W)| is the determinant of the Jacobian of going from Z —> W. We could therefore put together the likelihood in 3.12 and the marginally uniform prior on R in 2.12, the Gamma prior on a in 3.13, and the prior on (3 in 2.9, we get: Tr(R,a,8\Y,W) oc |E|~2 exp tr (E_1e*'e*) x I R l ^ - ' i H | i ? i i | ) - ( T + 1 ) / 2 x Gamma ( i ^ x exp(/?V^/?) T+ 1 ,1 (3.15) 22 Chapter 3. Correlation Estimation in the Saturated Model where the Gamma distribution is defined as in A.2. In order to sample from the joint posterior distribution in 3.15, we use a Gibbs Sampling framework, where we sample B\Z,R and then sample R, ct\W. Since given R, the parameter 3 is identifiable, we sample it prior to transforming the data. Straightforward computations give the posterior distribution of B\Y, Z, R. The normal distribution is the conjugate prior, therefore the posterior dis-tribution of 3 will also follow a multivariate normal distribution with mean parameters 3* and covariance \&jg where n % = Vf3 + Y . X i R ~ l X i i=l 3* = %-l(f^X[R-lZ) The joint posterior ir(R,a\Y, W,B) can be obtained from 3.15: ir(R,a\Y,W,B) oc |£ |~ t exptr ( i T V e * } (3.16) x | f i | ^ - i ( T J | i ? i i | ) - ( T + i ) / 2 x G a m m a j ' I + l ! i ^ We perform a change of variable S = DRD: 7r(S|y, W, B) oc 7r(i?, a\Y, W, 3) x \ Ja : (D, R) -> S| = | E | - t e x p i r ( S - 1 e V } x |S | -5 2 ( r + 1 ) e xp( - - i r (S- 1 ) ) = | E | " ^ i y + r + 1 ) e x p ( - ^ r ( E - 1 5 ) ) (3.17) This is an inverse Wishart distribution with v = n + T+ 1 and S = e*'e*. The second line in the equation above is obtained by reversing the steps of the proof in Appendix B. 23 Chapter 3. Correlation Estimation in the Saturated Model Algori thm 3.4 Full PX-DA Sampling Scheme in Multivariate Probit At iteration i 1. Imputation Step • Draw Z ~ f(Z\Y,3,R) from a truncated Multivariate Normal distribution TMVN(X0,R) as described in Appendix D. 2. Posterior Sampling Step Draw (0, R, a) jointly conditional on the la-tent data : • Draw 0\Z,Y,R from a Multivariate Normal distribution 0 ~ MVN(B*,^) • Draw a ~ po(a\R) from a Gamma distribution G(^^-, 1) • Compute the diagonal matrix D, where each diagonal element di — \ £ r a n d rn is the ith diagonal element of R"1. • compute W = ta(Z) = DZ or equivalently e* = D(Z - X0). • Draw Y,W from an inverse Wishart distribution S ~ IW(u, S) where v = n + T + 1 and 5 = e*'e*. • compute R = D _ 1 S D _ 1 Repeat until convergence 3.4 Simulations In order to test the performance of the algorithm developed in the previous section, we conduct several simulation studies first with T = 3 and then we increase the dimension to T — 8. The data is simulated as follows: we generate a design matrix with p = 2 covariates from a uniform distribution from [—0.5, 0.5], we set the coefficients 0 = (—1,1)' and we generate random error from a multivariate Gaussian distribution centered at 0 and a full correlation matrix R. We fix R such that all pij off-diagonal elements are of equal value. We try for different values of./? namely 0.2, 0.4, 0.6, and 0.8. The following two loss functions are considered to evaluate the accuracy of 24 Chapter 3. Correlation Estimation in the Saturated Model the estimated correlation matrix: Li(R, R) = tr(RR~l) - log | A R - 1 1 - T (3.18) L2(R,R) = tr(RR-l-lf (3.19) Where R is the estimated correlation and R is the true correlation used to generate the data. The first loss function is the entropy loss and the second is the quadratic loss. These loss functions are discussed in more detail in Yang and Berger (1994). In each case, TV = 10000 Gibbs samples are drawn and the first 500 are discarded as "Burn-in". We tried multiple runs, to ensure convergence of results. The correlation is always initialized at the identity matrix, and the latent variables are initialized at 0." 3.4.1 Results for T = 3 For T = 3, three parameters in the correlation matrix are estimated. Table 3.1 outlines results from the simulations for the correlation matrix. The posterior median estimate is reported, the number of parameters falling within the 95% credible interval, the average interval length, as well as the entropy loss and the quadratic loss. 95% credible intervals are calculated based on 2.5% and 97.5% quantiles of the estimates. We can see that the likelihood carries more information with larger cor-relation values, estimation of the correlation becomes more accurate and confidence intervals become smaller on average. Similarly with more data, estimates become more precise and furthermore, we see a decrease in both the entropy and the quadratic loss. Except in one case (r^ = 0.2, n = 500), the true correlation coefficient was always included in the 95% credible in-terval. Figures 3.1 and 3.2, provide examples of traceplots and density plots for the correlation matrix with p^j = 0.4 and pij = 0.8 respectively. Sub-figures (a) and (b) in each case show how the density becomes narrower by increasing the sample size from n — 100 to n = 1000. Furthermore, we see 25 Chapter 3. Correlation Estimation in the Saturated Model that the algorithm mixes very well and converges fast. Table 3.1: Correlation results from simulations for T = 3 Sample CI Contains Average CI Entropy Quadratic Size True Length Loss Loss 0.2 3/3 0.644 0.206 0:557 0.4 3/3 • 0.580 0.122 0.296 100 0.6 3/3 0.511 0.185 0.496 0.8 3/3 0.423 0.336 0.923 0.2 2/3 0.290 0.064 0.135 0.4 3/3 0.269 0.031 0.061 500 0.6 3/3 0.226 0.051 0.102 0.8 3/3 0.164 0.127 0.329 0.2 3/3 0.202 0.028 0.056 0.4 3/3 0.188 0.027 . 0.053 1000 0.6 3/3 0.165 0.037 0.075 0.8 3/3 0.113 0.067 0.173 Table 3.2, shows simulation results for the regression coefficients 0. For each coefficient, we report the median of the posterior distribution, a 95% credible interval and the standard error The true regression coefficients seems to always fall within the 95% credi-ble interval. Standard errors and consequently credible intervals lengths tend to become smaller with the increase of correlation as well as the increase in sample size. Figures 3.3, 3.4, 3.5, and 3.6 provide trace plots, density and autocor-relation plots for the regression coefficient in the case where the correlation matrix has elements pij = 0.4 and pij = 0.8 and increasing the sample size from n = 100 to n = 1000 respectively. The density becomes narrower with a larger sample size and here too, the algorithm seems to be mixing well. 26 Chapter 3. Correlation Estimation in the Saturated Model Table 3.2: Regression coefficients results from simulations for T = 3 Sample Confidence Standard Confidence Standard Size Pi Interval Error Interval Error 0.2 -1.34 (-1.88,-0.83) 0.27 1.32 (0.79, 1.86) 0.27 0.4 -1.20 (-1.72,-0.72) 0.26 0.88 (0.38, 1.37) 0.25 100 0.6 -0.99 (-1.47,-0.52) 0.24 0.88 (0.41, 1.36) 0.24 0.8 -1.28 (-1.73,-0.82) 0.23 1.05 (0.62, 1.49) . 0.22 0.2 -1.22 (-1.45,-0.99) 0.12 1.18 (0.95, 1.40) 0.12 0.4 -1.23 (-1.45,-1.00) 0.11 1.04 (0.82, 1.26) 0.11 500 0.6 -0.92 (-1.11,-0.71) 0.10 1.15 (0.93, 1.35) 0.11 ' 0.8 -1.14 (-1.33,-0.95) 0.10 0.93. (0.75, 1.11) 0.09 0.2 -1.12 (-1.28,-0.96) 0.08 1.14 (0.98, 1.30) 0.08 0.4 -1.09 (-1.25,-0.94) 0.08 0.96 (0.81, 1.12) 0.08 1000 0.6 -1.08 (-1.23,-0.93) 0.08 1.12 (0.97, 1.26) 0.08 0.8 -1.09 (-1.22,-0.96) 0.07 0.98 (0.85, 1.11) 0.07 27 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.1: Correlation estimates for p = 0.4 , T = 3 and increasing sample size from n = 100 to n=1000 2000 4000 6000 8000 10000 2000 4000 6000 8000 10000 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 (a) n = 100 2000 4000 6000 8000 10000 2000 4000 6000 B000 10000 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 (b) n = 1000 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.2: Correlation estimates for p = 0.8 , T = 3 and increasing sample size from n = 100 to n=1000 2000 4000 6000 6000 10000 2000 4000 6000 8000 10000 -1 -0.5 0 0.5 1 2000 4000 6000 8000 10000 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 (a) n = 100 2000 4000 6000 6000 10000 2000 4000 6000 8000 10000 29 (b) n = 1000 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.3: 3 estimates for p = 0.4 , T = 3 and sample size n = 100 30 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.4: /3 estimates for p = 0.4 , T = 3 and sample size n = 1000 31 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.5: /? estimates for p = 0.8 , T = 3 and sample size n = 100 :V2 Figure 3.6: 8 estimates for p = 0.8 , T = 3 and sample size n = 1000 33 Chapter 3. Correlation Estimation in the Saturated Model 3.4.2 Results for T = 8 For T = 8, we are estimating T(T — l)/2 = 28 parameters in the correlation matrix in addition to two regression coefficients. Table 3.3 shows the number of parameters falling within the 95% credible interval, the average interval length, the entropy loss and the quadratic loss. In this case, we see that with more parameters to be estimated, we did not loose very much on the accuracy, as the average 95% credible interval length has remained within the same range as in the case of T = 3. We also note that five of the experiments had only one out of 28 parameters not contained in the 95% credible interval. Furthermore, we could see that average CI length decreases with larger correlation values and a larger sample size, while loss improves only with an increased sample size. Figure 3.7 and 3.8 show the density and the trace plot of the correlation matrices with ptj — 0.2 and pij = 0.6 respectively. The density becomes more peaky and narrow with an increase in sample size as expected. For the regression coefficients, we see from figures 3.9, 3.10, 3.11, and 3.12 that with more information in the data, the median is closer to the true parameters. In addition, similar to the results we saw in the previous simulation, the credible intervals decrease in length with an increase in sample size. We also note from table 3.4, that the true value of 8 is always included in the 95% credible interval. 34 Chapter 3. Correlation Estimation in the Saturated Model Table 3.3: Correlation Results from simulations for T = 8 Sample CI Contains Average CI Entropy Quadratic Size True Length Loss Loss 0.2 28/28 0.577 2.306 10.445 0.4 27/28 0.511 4.327 29.895 100 0.6 28/28 0.492 4.879 75.270 0.2 27/28 0.279 0.439 1.133 0.4 28/28 0.254 0.489 1.312 500 0.6 27/28 0.221 0.560 1.436 0.2 27/28 0.200 0.180 0.400 0.4 28/28 0.181 0.186 0.421 1000 0.6 27/28 0.157 0.358 0.880 Table 3.4: Regression coefficients results from simulations when T — 8 Sample Confidence Standard Confidence Standard Size 'f'ij fa Interval Error fa Interval Error 0.2 -1.20 (-1.51,-0.89) 0.16 0.90 (0.60, 1.21) 0.15 0.4 -1.08 (-1.37,-0.81) 0.14 0.99 (0.71, 1.27) 0.14 100 0.6 -1.06 (-1.33,-0.79) 0.14 1.04 (0.78, 1.31) 0.14 0.2 -1.12 (-1.26,-0.98) 0.07 0.98 (0.85, 1.11) 0.07 0.4 -1.01 (-1.14,-0.88) 0.07 0.95 (0.82, 1.08) 0.07 500 0.6 -1.04 (-1.16,-0.92) 0.06 0.97 (0.86, 1.09) 0.06 0.2 -1.09 (-1.18,-0.99) 0.05 1.01 (0.91, 1.11) 0.05 0.4 -1.00 (-1.09,-0.91) 0.05 0.94 (0.85, 1.03) 0.05 1000 0.6 -1.00 (-1.08,-0.91) 0.04 0.97 (0.88, 1.05) 0.04 35 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.7: Correlation estimates for p = 0.2 , T = 8 and increasing sample size from n = 100 to n=1000 4000 2000 1 -1 0 1 100 r — p — )00 | | | 0 -1 0 1 too )00 I LL. 4000 2000 50000000 5000000I " 1 5000000I 1 -1 0 1 100 •1000 2000 LiI 400C 2000 1 4000 2000 4Q0Q 2000 4000 2000 4 000 2000 a 4000 2000 I. 1 0 1 j—| 4UU0i j— i-J i "UL. 4 000 2000 lu 1 0 1 I. 4000 20 OC >-.ooo 2000 4000 2000 1 a 500000 s 500000 1[ 1 JL—I 500000 1 I 1 0 1 I ! 0 1 5000000 5000000 4000 2000 5000000< 5000000I 1 jj-j : h 5000000 1 4000 2000 JL 1 0 1 LJ 1 0 1 - 1 0 1 — 4 0 0 0 1 ir-|-J1T3 4000 2000 5000000I s 5000000 1 1 50000001 5003000 1 V (a) n = 100 4000 2000 0 4000 2000 0 4 000 2000 0 4000 2000 0 4000 2000 0 400C 2000 0 CE i o 1 LU o 1 i 4000 2000 H j -oooooo-1 LU o " M w J i 0 r a 50000000 50000000 50000000 50000000 50000000 50000000 50000000 H ^ ^ y LaMmJ LZJ L M J « J | M 0H 0!H 0 H 0H 0 H 50000000 50000000 50000000 50000000 50000000 500800 0.51 1 0.4 | ) 0.51 1 0.5 j ; 1 0.5 r———. A wmm Mkm L,t_*J uugng mtim o - — i 0.2 • S H s M | 0 f I -0.5' 1 0 l1171'"-"^ 01- - J OI — J -0.5' 1 4 000 2000 — 4UUU j p—| 0.5 I -—1 0 . 5 1 ^ j 0.5 [-""""; 2000 | 0 ^W" 0 0 ^ f l " ^ _ J ' 0' -0.5' 1 -0.51 1 -0.3' 1 0 1 LD 4 000 2000 -1 0 1 40001 r— 2000 I 0I—Li. 4000 200G 0 4000 2000 T—j *uuu I n—I IUUU I r— 200 | 2000 -1 0 1 «—I 4UUU | Tj— _ L 2 o:LL, 4000 2000 LD D — I 4 COO 2000 0 4000 2C00 n 1 4UUU i ;— 50000000 50000000 5000000' 0.5 | • 1 0 . 4 l ^ v r j "j Q P W M B Q.2 M ^ B -0.51——J oi____i 50000000 Q 5 5000000' I A m$m 1 0 1 ° 5000000' D 4U00I n—I 40001 1—j 4000 i 1—i 40001 n—i 40001 r p - . 4000 i p— 2000 | 2000 | 2000 I 2000 1 2000 j 2000 1 ^ 0L. ..IlLU" 0I tilJe 0 | LjV! n ! IliJ'' 0I fid'* 0I LJ* 4000 2000 | 2000 l l J e njl 1' 4000 2000 0 4000 2000 36 (b) n = 1000 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.8: Correlation estimates for p = 0.6 , T = 8 and increasing sample size from n = 100 to n=500 4000 200C • 4000 2000 :H :H °:H °:H °:H °:H 50000000 _ 50090000 50000000 50000000 50000000 5000000' fi fi fi il 2000 I i 0.5 HH _JLU „L_JL 0 ffl 1 0 1 -1 0 1 50000001 j—ri 4000 I p-p 40001 rrr | 2000 I 2000 | ^ 1 0 1 ° 1 0 1 ° 1 0 1 j-r-i 4000 I - p r i 4000i pn 40001 pp ^ 2000 j| 2000 2000 ^ 01 l " 1 5000000' . :• 5000000 1 0 " .ti 1 -11 50000000 5000 OOOi H 1l—jjg 4000 2000 2000 4000 2000 ° 5000000 o i i r i i ' 1 5000000 5000000 , s 50O00OOI —n *UUUj i-T| 40001 p| 40001 pn 40001 pr I 2000 J 2000 1 2000 A 2000 I *j o' 0I M fjl 0 I ML 1 0 1 - 1 0 1 - 1 0 1 - 1 0 1 - 1 0 1 pr i 4000. r-T. 4000 | i - n 40001 p n 4000 i pr-j 40001 p i -l l 2000 ji1 2000 2000 J 2000 i 2000 I Job •8 oI MJ-" o I — * o I — • * » j a o I — o I iB*W 1 0 1 - 1 0 1 - 1 0 1 - 1 0 1 - 1 0 1 - 1 0 1 ' 5 0 0 pr-] 40001 nn 4000 I -n 4000i 1 — p i 4000 i pj-i 4000 i -pi 4000i pp ^ j 2000 | | 2000 | 2000 2000 j 2000 ^ 2000 ^ (a) n = 100 41)00 2000 4000 200!) 4000 2000 Oi j -0 o L_XP* -1 0 1 0. T- | D J 1 0 1 EL 1 0 1 OL 1 0 1 a 500TDOOO 5000)000 4000 2000 4000 2000 4000 I 1-2000 ni iJ t i f 1 ) r j U 0 I U -1 0 1 lOi • 0I—m -1 0 1 1 0 1 • 1 0 1 a . 4000 2000 40;;:; 20GO 4000 -2000 SOOEDOC 1 • 1 0 1 a 1 0 1 a 1 0 1 -1 0 1 , 0 n a j 91 1' 5000)000 5000)0 500GDOOO 4000 2000 J 4000 2000 TT o'—I 1 0 1 • 4000 2000 a . 1 0 1 LX 1 0 1 — i i | - -] 1 Q\ • UUUJUU soomoo a -1 0 1 a 15 ° 5000)00' 3.5 NW :) 50003001 .n soomooi t- ' 5000)001 37 (b) n = 500 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.9: (3 estimates for p = 0.2 , T = 8 and sample size n = 100 38 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.10: (3 estimates for p = 0.2 , T = 8 and sample size n = 1000 39 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.11: (3 estimates for p — 0.6 , T = 8 and sample size n = 100 40 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.12: 3 estimates for p = 0.6 , T = 8 and sample size n = 500 41 Chapter 3. Correlation Estimation in the Saturated Model 3.4.3 Convergence Assessment As with any Markov Chain Monte Carlo Algorithm, it is important to en-sure that the algorithm has converged. Unfortunately, there is no measure that can tell us definitively whether this has happened. The most common method of assessing convergence is by considering trace plots, which show the evolution of the MCMC output as a time series. These plots provide a simple way to examine the convergence behavior of the algorithm for the parameters under consideration. Trace plots are useful for immediately di-agnosing lack of convergence and poor mixing, if the M C M C sampler covers the support of the posterior distribution very slowly. Poor mixing invali-dates the density estimates, as it implies that the M C M C output is not a representative sample from the posterior distribution. From previous figures, it appears that the algorithm is mixing well, since we do not see any particular trends in the time series plots, and the algorithm seems to be leveling off to a stationary state. We also consider the mixing speed by looking at the effect of drawing more samples. For examples, Figure 3.13 shows the effect of increasing the number of Gibbs draws from 500 to 5000, for the simulation where T = 3 and n = 100. We could see that with 1000 iterations post burn-in, the algorithm has started to converge to the target density. Next we consider correlation plots of the estimated parameters. These plots depict the autocorrelation of the sequence of simulated values as a function of lag time. The autocorrelation plots can be useful in identifying slow mixing. Figure 3.14 shows the standardized autocorrelation plots of the different values of the correlation coefficients when T = 3 and n = 100. Standardized plots mean that at lag time 0 the autocorrelation is 1. We could note that autocorrelation drops near zero at around lag 10-20. Similar results where observed for the regressions coefficients 3, in previous plots. Finally, under convergence, the estimated parameter of interest is ex-pected to converge to a flat region near the true parameter and then fluctu-ate around that region. However, a statistic like the mean of the parameter 42 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.13: n = 100, T — 3, Trace plots as the number of iterations increase from N = 500 to N = 5000 post "Burn-in". The algorithm has started to converge after about 1000 iteration post "Burn-in". is expected to converge in the limit to a constant. Diagnostic plots such as the cumulative mean and the cumulative standard deviation of the estimates provide a way to see if this has happened. Figure 3.15 shows a plot of the cumulative means and standard devia-tions for the 10000 iteration from a randomly selected set of parameters from the simulation with n = 100 and T — 3. Both mean and standard deviation indeed level off to a flat line early on in the simulation. Furthermore, it appears that the value we have chosen to use for "Burn-in" is reasonable since it cuts off all the fluctuations that happen early on. Finally, it is important to note that starting values are critical for the speed of convergence. If the starting values of a parameter are poor, it may take some time for the mean to stabilize. The fluctuations caused by the 43 Chapter 3. Correlation Estimation in the Saturated Model > i > . . . . 1 . .o i 2 1 . ' 1 . . 1 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Autocorrelation Autocorrelation Figure 3.14: n = 100, T = 3, Autocorrelation plots of a randomly chosen pa-rameter from correlation matrices for the cases where the marginal correlations is p = 0.2, p = 0.4, p = 0.6, and p = 0.8 poorly sampled values early on in the simulation are difficult to overcome, but in the long run, the mean will eventually stabilize. In the case of the Multivariate Probit, many have reported the sensitivity of convergence to the starting values of the parameters. In our case, many initialization values were tried,,the initial values that we chose to use are equivalent to running T univariate Probit models with 0 mean, these values provided optimal results. When the algorithm was initialized randomly, it took much longer to reach convergence. 44 Chapter 3. Correlation Estimation in the Saturated Model 2000 4000 6000 8000 10000 Samples p. 2000 4000 6000 8000 10000 Samples 2000 4000 6000 8000 10000 Samples 2000 4000 6000 8000 10000 Samples Figure 3.15: Trace plots of the cumulative mean and cumulative standard devi-ation of randomly chosen parameters from correlation matrices as the correlation is varied from p = 0.2, p = 0.4, p = 0.6, and p = 0.8 and n = 100, T = 3. The vertical line marks the "Burn-in" value (500) used in the simulations 45 Chapter 3. Correlation Estimation in the Saturated Model 3.5 Application: Six Cities Data In order to further evaluate our method and our prior, we apply it to the Six Cities data. This data set is based on a subset of data from the Six Cities study, a longitudinal study of the health effects of air pollution, which has been analyzed by Chib and Greenberg (1998) in the context of multivariate Probit and by others (eg. Glonek and McCullagh (1995)) with a multivariate Logit model. The data contain repeated binary measures of the wheezing status (1 = yes, 0 = no) for 537 children at ages 7, 8, 9 and 10 years. The objective of the study is to model the probability of wheeze status over time as a function of a binary indicator variable representing the mother's smoking habit during the first year of the study and the age of the child. We fit the same model as in Chib and Greenberg (1998): P(Yij = 1 R ) = 8o + 8xXijx + {32Xij2 + p3Xij3 (3.20) where j in {1,2,3,4} indexes the time at which the response was observed (ages 7,8,9 and 10), and Xij\ is the age centered at 9 years agej — 9, Xij2 is a binary variable representing the smoking status of the mother Xij2 = Xi2 — I mother-smokes, a n d Xijs is the interaction between smoking status and age X^\ * X^i- Here we would like to note that age is used both as a category in the response and as a covariate. We use the algorithm developed in this chapter to fit a full correlation matrix among the responses. N = 8000 samples are obtained and the first 500 were discarded as "Burn-in". Furthermore, conforming with what was done in simulations methods, initial values for 8 were sampled from their prior distribution, the latent data was initialized at zero and the correlation matrix was initialized at the identity matrix. Table 3.5 summarizes the parameters' posterior means and standard errors. It also provides, for comparative purposes, the results reported in Chib and Greenberg (1998) using both the maximum likelihood estimator and the posterior means resulting from the M C M C algorithm they develop in their paper. From comparing these results, we could see that they are very 46 Chapter 3. Correlation Estimation in the Saturated Model similar for both means and standard errors. We could note however, that the estimates obtained using the joint uniform priors are smaller compared to ones we obtained using the marginally uniform prior. This is consistent with what is expected, since the jointly uniform prior will tend to favor values closer to 0. We could also note that the intervals for Bi and (3% contain 0. This could be an indication that the mother's smoking habit may not have contributed to the wheezing status of the child at any age. Table 3.5: Six Cities Data: Posterior estimates using Marginal Prior, MLE esti-mate using MCEM and Posterior estimates using the Jointly Uniform Prior (Chib and Greenberg (1998)) Marj pnal Uniform Prior M C E M Jointly Uniform Prior Mean 95% CI s.e M L E s.e Mean s.e Po -1.13 (-1.26,-1.01) 0.06 -1.12 0.06 -1.13 0.06 fa -0.08 (-0.14,-0.02) 0.03 -0.08 0.03 -0.08 0.03 fa 0.18 (-0.02, 0.38) 0.10 0.15 0.10 0.16 0.10 fa 0.04 (-0.06, 0.14) 0.05 0.04 0.05 0.04 0.05 r\2 0.59 (0.45, 0.73) 0.07 0.58 0.07 0.56 0.07 r\z 0.54 (0.38, 0.68) 0.08 0.52 0.08 0.50 0.07 ru 0.55 (0.40, 0.69) 0.07 0.59 0.09 0.54 0.07 7*23 0.73 (0.60, 0.83) 0.06 0.69 0.05 0.66 0.06 r-24 0.57 (0.40, 0.69) 0.08 0.56 0.08 0.51 0.07 r-34 0.64 (0.40, 0.69) 0.08 0.63 0.08 0.60 0.06 In their paper, Chib and Greenberg (1998) do not show trace plots for any of their estimates and they do not discuss convergence diagnostics. It is therefore difficult to compare their algorithm to ours with that respect. Figures 3.16 and 3.17 depict the density plots and the trace plots of the cor-relation coefficients and the regression coefficients respectively. Trace plots do not seem to exhibit any patterns or poor mixing. The autocorrelation plot in 3.17 shows that there is a lag of 10 before autocorrelation goes down to 0. 47 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.16: Six Cities Data: Trace plots and density plots of the correlation coefficients. The vertical lines denote 95 % credible interval and the line in red indicates the posterior mean reported by Chib and Greenberg (1998). 48 Chapter 3. Correlation Estimation in the Saturated Model Figure 3.17: Six Cities Data : Trace plots, density plots and autocorrelation plots of the regression coefficients. Vertical lines denote 95 % credible interval and the line in red indicates the posterior mean reported by Chib and Greenberg (1998). 49 Chapter 4 Correlation Estimation in the Structured Mode l 4.1 Introduction Often times, when dealing with high dimensional problems, it is useful to impose a structure of association between the outcome variables Y\,..., Yp. In certain cases, the researcher may be interested in comparing different hypotheses of patterns of association. In other cases, the data might follow a natural grouping, where certain subsets of variables will express a higher degree of association within and less association between other groups of variables. For example, in longitudinal models, a variable might only be associated to the one preceding it at the previous time point. Furthermore, imposing a structure on the correlation matrix helps simplify computation. This is because the number of free parameters to be estimated is a smaller subset of the total number of parameters under the saturated model. Im-posing a structure helps both statistically and computationally. 4.2 Conditional Independence When variables exhibit a high marginal association, the pairwise estimate of their correlation is high. However, these two variables might be affected by a third variable, which acts as a mediating variable or confounder. As an example, consider the following three variables: smoking, drinking coffee and lung cancer. Drinking coffee and lung cancer might express a high degree of correlation, however once information about smoking is available, drinking coffee and lung cancer become decoupled. Therefore, smoking accounts for 50 Chapter 4. Correlation Estimation in the Structured Model the association between drinking coffee and lung cancer. In order to get around this problem, one could control for the other variables in the model, by considering the conditional dependence. Partial correlation represents the correlation between two variables conditional on all the other variables in the model. It can be obtained from the precision matrix Q — £ - 1 , with elements through the following relation: for i ^ j; v^i^JJ ' (4.1) pu for i = j. From this, we could see that a 0 in the precision matrix Si would result in a partial correlation of 0, meaning that the two variables are conditionally independent. In general, the correlation coefficient is a weak criterion for measuring dependence because marginally most variables will be correlated. This im-plies that zero correlation is in fact a strong indicator for independence. On the other hand, partial correlatiorrcoefficients provide a strong measure of dependence and, correspondingly, offer only a weak criterion of indepen-dence. In the multivariate Probit (MVP) class of models, the response is dis-crete and viewed as an indicator variable to a latent construct. Imposing a structure on the latent variables Z \ , . . . , Z T (see figure 4.1) results in a par-tial correlation matrix that is sparse, however this does not imply that the correlation matrix is sparse. In this section we shift our attention to the par-tial correlation matrix using undirected Gaussian graphical models (GGM). We impose the structure on the partial correlation matrix and subsequently estimate marginal correlations given this structure. 4.3 Gaussian Graphical Models Gaussian graphical models are a graphical representation of the conditional independence between multivariate Normal variables. In these graphs, vari-ables are represented by nodes. An edge is drawn between any two nodes unless these two nodes are conditionally independent. This way, graphs pro-51 Chapter 4. Correlation Estimation in the Structured Model i = 1 : n Figure 4.1: A graphical representation of a structured MVP model for T = 3. The edge between and Ziz is missing, this is equivalent to r~i3 = 0. This structure is typical of longitudinal models where each variable is strongly associated with the one before it and after it, given the other variables in the model. vide a clear representation of the interrelationship between variables in the model. This approach to modeling data is known as covariance selection (Dempster, 1972). The introduction of the hyper-inverse Wishart by Dawid and Lauritzen (1993) as a conjugate prior for structured covariance matrices was central in the development of Bayesian approaches for inference in this class of models. 4.3.1 Graph Theory In this section we review some graph theory used in this chapter, for a full account on graphical models, we refer the reader to Lauritzen (1996) or Whittaker (1990). An undirected graph is a pair G — (V,E), where V is a set of vertices representing variables and E, the edge-set, is a subset of the set of unordered distinct pair of vertices. Visually, each vertex i is a node representing the random variable i and an edge £ £ is an undirected edge connect-52 Chapter 4. Correlation Estimation in the Structured Model ing nodes i and j unless they are conditionally independent. In undirected graphical models if edge (i, j) € E then by symmetry edge ( j , i) G E as well. See for example figure 4.1, Z\ is conditionally independent of Z 3 given Z2 and is denoted by Z\EZ^\Z2-Below is a list of definitions used throughout this chapter, they are illus-trated in figure 4.2: • Vertices A and B are neighbors (nb) or adjacent in G if there is an edge (a, 6) in E. • A subgraph is a graph which has as its vertices some subset of the vertices of the original graph. • A graph or a subgraph is complete or fully connected if there is an edge connecting any two nodes. • A clique is a complete subgraph. • A set C is said to separate A from B if all paths from A to B have to go through C. • Subgraphs (A, B, C) form a decomposition of G if V = A U B, C = Ad B, where C is complete and separates A from B. • A sequence of subgraphs that cannot be further decomposed are the prime components of a graph. • A graph is said to be decomposable if every prime component is com-plete. • A distribution P is Markov with respect to a graph G, if for any decomposition (A, B) of G, AEB\A n B, where A n B is complete and separates A from B 53 Chapter 4. Correlation Estimation in the Structured Model Figure 4.2: A graphical model with T = 7 vertices. In this graph, Zx is a neighbor of Z2- Zz, Z2, and Zj form a complete subgraph or a clique. This graph can be decomposed into two cliques {Zi,Z2, Z3,Z5, Z4} and {Z3,Z6,Z7}. {Z3} separates the two cliques. 4.3.2 The Hyper-inverse Wishart Distribution The Hyper-inverse Wishart distribution, denned by Dawid and Lauritzen (1993), is a family of Markov probability distributions for structured co-variance matrices on decomposable graphs. Given a graph structure G, the probability distribution for E consistent with G follows a Hyper-inverse Wishart distribution denoted by: E ~ HIWa(b, D) whith degrees of freedom b > 0 and location parameter D > 0. The joint density of E decomposes as follows: For each clique C, Ec follows an inverse Wishart distribution IW(b,Dc). 54 Chapter 4. Correlation Estimation in the Structured Model 4.4 Marginally Uniform Pr ior for Structured Covariance In the M V P model, the covariance matrix is restricted to a correlation matrix due to identifiability reasons discussed in Chapter 2. In order to extend the PX-DA algorithm developed in Chapter 3, it is necessary to find a marginally uniform prior on the structured correlation matrix R. In decomposable graphical models, an edge joining any two variables i and j implies that the variables i and j belong to the same clique. This in turn means that the corresponding element in the inverse correlation matrix R"1 is not 0. On the other hand, if i and j do not belong to the same clique, the corresponding element in the inverse correlation matrix R~l is exactly , 0. The correlation matrix is obtained through matrix completion such that the resulting matrix R is positive definite. Therefore the elements of R corresponding to the marginal correlation of two variables not belonging to the same clique are not free parameters and their distributions are not of interest. On the other hand, non-zero elements in the partial correlation matrix indicate a high dependence among the corresponding variables, and since we would like an uninformative prior on the marginal correlations we would be interested in a prior on R such that: From the properties of the Hyper inverse Wishart distribution (Rover-ato, 2002), we know that given a graph structure, if Y,Q ~ HIW(b, D), then the covariance matrix of each prime component follows an inverse Wishart of prime components. It is important to note here that the precision param-eter b is common to all the prime components, and the location parameter Since the covariance of each prime component follows an inverse Wishart distribution £ p ~ IW(b, Dp), by taking Dp = Ip, we can obtain the distri-bution of the correlation matrix Rp for each prime component, as in Barnard (4.3) distribution with Sp^ -p,- ~ IW(b, Dp^pfj for j = 1,..., k and k is the number DpjPj is specific to each one. 55 Chapter 4. Correlation Estimation in the Structured Model et al.' (2000). ) (if.,• l+i) 2 fPj(R\b= \Pj\ + l) cx\R (4.4) where \Pj\ is the cardinality of the prime component P. If we consider the pairwise distribution of any two variables i and j within the same clique, we could appeal to the marginalization property of the inverse Wishart to obtain the marginal distribution of each r^ in that clique, based on the result used in the proof of Barnard et al. (2000) (see appendix B). Furthermore, in order to have a uniform distribution for each Tij on [—1,1], we sample Ec from a standard inverse Wishart with degree of freedom parameter b — \Cj \ + 1. However, because we are restricted to have the parameter b common to all the prime components, using the parametrization in A.4 would re-quire that all cliques have equal sizes. Alternatively, we could use the parametrization in Dawid and Lauritzen (1993) (see Appendix A.4). In that parametrization, sampling Ec ~ IW(6, Ic) and taking 5 — 2 is equivalent and furthermore, it would insure that the precision parameter is indepen-dent of the size of the cliques and could therefore be common to all prime components. Since the structure of the graph G is given, when i and j do not belong to the same clique, the corresponding element of the partial correlation matrix fij — 0 and rij is obtained through matrix completion. To illustrate this prior, given the graph structure in 4.1, we sample from a standard hyper-inverse Wishart Eg ~ (5 — 2, Ic) and we transform back to R using the separation strategy: E — DRD. Figure 4.3 illustrates the marginal distribution of the elements of the correlation matrix, under a structured partial correlation assumption, based on 5000 draws. 56 Chapter 4. Correlation Estimation in the Structured Model -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 - 0 . 5 1 Figure 4.3: Marginal distributions of the prior on the correlation matrix corre-sponding to the model in 4-1 /, . We could see that where there is an edge between the two variables, the corresponding element of the correlation matrix has a uniform distribution on [-1,1]. Furthermore, when two variables are conditionally independent, their marginal correlations are obtained to ensure that the correlation matrix is positive definite. Figure 4 .4 illustrates the same result on a more complicated structure where the size of the cliques are not equal. • 57 Chapter 4. Correlation Estimation in the Structured Model Figure 4.4: Illustration of the marginally uniform prior on the structure of the graph in figure 4-2. In this graph we have unequal clique sizes .where \Ci\ = 5 and \C2\ = 3 flBf™ •1 0 1-1 0 -1 0 1-1 0 1-1 0 1-1 0 1 -1 0 1 mmmi 1 L.jlli'i-J 0 -••iilpiwn WIBIWi f f ln f l H i LdUkJ oLdllliJ -1 0 1 ~-1 0 1-1 0 1 -1 0 1-1 0 1-1 0 1-1 0 1 -1 0 1-1 0 1-1 0 1-1 0 1 -1 0 1-1 0. -1 0 1 4.5 P X - D A in Gaussian Graphical Models The PX-DA algorithm for the structured case is very similar to the one used for the saturated model in chapter 3. In the imputation step, the latent variables are sampled given a correlation matrix and regression co-efficients. In the posterior sampling step, conditional on R, the regression coefficients are estimated in the same way as in chapter 3. The only differ-ence is in sampling of the covariance matrix. Rather than sampling from an inverse Wishart distribution as before, we impose the structure on the partial correlation matrix, and subsequently expand the model by scaling it by D, a diagonal matrix, and sample the covariance from a hyper in-verse Wishart distribution. Because D is diagonal, the zeros structure of the inverse correlation remains unchanged. 58 Chapter 4. Correlation Estimation in the Structured Model The multivariate Normal distribution of the latent variables W under the expanded model is also Markov with respect to G, and the joint likelihood decomposes as well: p(W\y \ nceCp(wc\Zc) •1 VVL G = j= ( 4 . 5 ) For Bayesian inference, the posterior distribution is given by Bayes rule: P(E|W, G) oc P(W|E, G)P(E|G) The Hyper-inverse Wishart is the conjugate local prior for any graph, there-fore we can compute the posterior as: nceCp(.wc\Xc)P(?c\b,Dc) YlSesp(Ws\Zs)P&s\b,Ds) [ ' ' Since for any prime component P (cliques and separators), Ep is inverse Wishart, the conjugate prior for covariance for the Normal distribution, we can write for each prime component: ir(Rp,ap\Yp,Wp) oc |Ep|~? exptr (E-1e*'e*} x i P p l ^ ^ - ^ n \Rpu\rm+1)'2 x Gamma (lP{ + 2 and doing a change of variable Dp, Rp —> Ep as in (3.11), we get a posterior distribution which is an Inverse Wishart. Ep ~ IW(6 + n,DP + Sw) where n is the number of observations in W, and Sw is the cross product matrix e*'e* under the expanded model. This way, an estimate of R is 5 9 Chapter 4. Correlation Estimation in the Structured Model obtained by sampling Ep ~ HIW(5 + n, Dp + Sz) and transforming to R using the separation strategy as before (see algorithm ??). The algorithm to implement this method is identical to the one in ??, except we replace the step: • Draw S|/3,y, W from an inverse Wishart distribution £ ~ IW(u, S) where v — n + T + 1 and S — e*'e*. by • Draw £ | /3 , Y, W from a hyper inverse Wishart distribution £ ~ HIW(S*, S), where S* — 2 - f n and S* = e*'e*. See sampling procedure in Appendix E. 60 Chapter 4. Correlation Estimation in the Structured Model 4.6 Simulations The motivation behind imposing a structure to the model is to reduce the number of parameters to be estimated. In the saturated model, all the parameters in the correlation matrix need to be estimated. However, by im-posing a structure on the inverse, the elements corresponding to correlation between conditionally independent variables are no longer free parameters. Since the number of free parameters to be estimated is reduced, the method that constrains the structure of the inverse should be more efficient at es-timating the correlation matrix. This is particularly beneficial when the number of parameters is large in proportion to sample size. To illustrate this, we consider the model with T = 8 correlated binary responses. Under the saturated model assumption, a correlation matrix with T(T — l)/2 = 28 parameters needs to be estimated. Alternatively,, if the longitudinal model assumption (as in figure 4.1) is made only 7 parameters would need to be estimated. This constitutes a significant reduction in the number of free parameters that could have a major impact especially for a small sample size. All the simulations in this chapter where generated using a partial cor-relation matrix corresponding to the graph with a structure as in figure 4.1, with T = 8. Data is generated by sampling Z from a multivariate Gaussian distribution centered at 0 with sample size n — 100. A density model with no covariates is assumed and we set Y = I{Z'-> 0). N = 5000 samples are drawn and the first 500 samples are discarded as "Burn-in". 4.6.1 Loss Under the Saturated Model and the Structured Model In the first simulation, the PX-DA algorithm is tested under both saturated and structured covariance assumptions. To do that we generate 50 data sets from 50 different correlation structures corresponding to the graph in figure 4.1. Each time, we run our algorithm and record the entropy loss and the quadratic loss (see complete results in Appendix F). Figure 4.5 is a boxplot of the results and table 4.6.1 gives the mean and standard deviation for the 61 Chapter 4. Correlation Estimation in the Structured Model Loss Model Mean s.e Entropy Saturated 13.875 25.164 Constrained 1.267 1.225 Quadratic Saturated 17098.952 119840.369 Constrained 2.776 2.838 Table 4.1: Simulation results: Entropy and quadratic loss averaged over 50 data sets generated by different correlation matrices with the same structure two estimated loss function resulting from these simulations. It is evident that the saturated model results in larger loss compared to the model where the structure is taken into account. To confirm, a paired t-test is performed. The difference in the means was significant in each case (pval < 1 x 10 - 5 ) . We can note large variability in the loss under the saturated model. 4.6.2 Effect of Decreasing Sample Size We would like to assess the effect of decreasing the proportion of parameters to samples size by varying the latter from n = 100 to n = 200. We generate data with the correlation matrix R, given in 4.9, such that the partial cor-relation matrix R, given in 4.10, has a structure corresponding to the model in figure 4.1. 1.000 0.796 0.552 0.058 -0.002 -0.001 -0.001 -0.000 0.796 1.000 0.693 0.072 -0.003 -0.001 -0.001 -0.000 0.552 0.693 L000 0.104 -0.004 -0.001 -0.001 -0.000 0.058 0.072 0.104 1.000 -0.036 -0.013 -0.010 -0.003 -0.002 -0.003 -0.004 -0.036 1.000 0.367 0.277 0.084 -0.001 -0.001 -0.001 -0.013 0.367 1.000 0.754 0.230 -0.001 -0.001 -0.0.01 -0.010 0.277 0.754 1.000 0.305 -0.000 -0.000 -0.000 -0.003 0.084 0.230 0.305 1.000 (4.9) In table 4.6.2, we report the entropy loss and the quadratic loss defined in 3.19. We could see that under the structured model assumption, the 62 Chapter 4. Correlation Estimation in the Structured Model 2501 200 h Figure 4.5: Box plot of the entropy and quadratic loss obtained by generating data from 50 correlation structures and computing the loss function under the full correlation structure versus a structured correlation structure reduction in loss is significant both in estimating the marginal correlation and the partial correlation. Moreover, the loss is reduced for all cases with an increase in sample size. We also note that under the structured model assumption, the loss in estimating the partial correlation matrix is the same as the one in estimating the correlation matrix, whereas under the saturated model assumption, the loss of estimating the partial correlation matrix is significantly larger than the loss incurred by estimating the marginal correlation matrix. Table 4.3 and table 4.5 outline simulation results for estimating the corre-lation coefficients of the unconstrained parameters for n = 100 and n = 200 respectively. 63 Chapter 4. Correlation Estimation in the Structured Model 1.000 0.688 -0.000 0.000 0.000 -0.000 0.000 -0.000 0.688 1.000 0.502 -0.000 -0.000 0.000 -0.000 0.000 0.000 0.502 1.000 0.075 0.000 -0.000 0.000 0.000 -0.000 0.000 0.075 1.000 -0.033 -0.000 0.000 -0.000 0.000 -0.000 0.000 -0.033 1.000 0.251 0.000 -0.000 -0.000 0.000 0.000 -0.000 0.251 1.000 0.714 -0.000 0.000 -0.000 -0.000 0.000 0.000 0.714 1.000 0.206 -0.000 0.000 -0.000 -0.000 -0.000 -0.000 0.206 1.000 (4.10) In the table, ps are the correlation coefficients under a structured as-sumption, p are the correlation coefficients under the saturated model, p are the partial correlation coefficients under the saturated model and ps are the partial correlation coefficients under the structured model. The results show that both the saturated and the structured model give similar results in estimating the unconstrained parameters of the marginal correlations. The standard errors and the 95% credible interval are very similar and just as we have seen in chapter 3, they are reduced by half in increasing the sample size from n = 100 to n = 200. However, in estimating the partial correlation parameters the strucured model has a smaller standard error and shorter credible intervals. n. Correlation Model Entropy Loss Quadratic Loss 100 Marginal Saturated 2.428 11.166 Marginal Structured 0.415 0.830 Partial Saturated 11.394 679762.050 Partial Structured •0.415 0.839 200 Marginal Saturated 1.179 4.139 Marginal Structured 0.219 0.479 Partial Saturated 1.145 20323.973 Partial Structured 0.241 0.547 Table 4.2: Entropy and Quadratic loss obtained by estimating the true correlation and partial correlation matrix with the PX-DA algorithm under the saturated and structured model assumption 64 Chapter 4. Correlation Estimation in the Structured Model More important differences between the structured model versus the sat-urated model are noted in the results of estimating the constrained param-eters of the correlation and partial correlation matrix. Tables 4.4 and 4.6 show the simulation results for the constrained parameters when n = 100 and n — 200 respectively. The standard errors and 95% credible intervals are smaller under the structured model in comparison with the saturated model for marginal correlations, and the structured model gives exact results for partial correlations. 65 Chapter 4. Correlation Estimation in the Structured Model Table 4.3: Simulation results on the unconstrained correlation coefficients cor-responding to the model in 4.1, with n =100, T — 8 based on N = 5000 Gibbs samples. CIContains Interval True Mean Median s.e 95% CI True Length P 1 2 0.796 0.707 0.714 0.096 (0.500,0.874) yes 0.374 Ph 0.796 0.701 0.707 0.092 (0.503,0.856) yes 0.352 P12 0.688 0.573 0.574 0.098 (0.378,0.758) yes 0.380 P 1 2 0.688 0.707 •0.714 0.096 (0.500,0.874) yes . 0.374 P 2 3 0.693 0.709 0.719 0.099 (0.492,0.876) yes 0.384 P 2 3 0.693 0.695 0.704 0.095 (0.483,0.858) yes 0.375 P 2 3 0.502 0.557 0.559 0.102 (0.351,0.743) yes 0.392 P 2 3 0.502 0.709 0.719 0.0.99 (0.492,0.876) yes 0.384 P 3 4 0.104 0.216 0.220 0.142 (-0.070,0.484) yes 0.555 P24 0.104 0.188 0.190 0.143 (-0.100,0.455) yes 0.555 P2A 0.075 0.135 0.132 0.106 (-0.071,0.349) yes 0.420 P2A 0.075 0.216 0.220 0.142 (-0.070,0.484) yes 0.555 P45 -0.036 -0.052 -0.051 0.149 (-0.334,0.237) yes 0.571 Pk -0.036 -0.046 -0.045 0.143 (-0.326,0.238) yes 0.564 P25 -0.033 -0.042 -0.042 0.131 (-0.298,0.215) yes 0.513 P 2 b -0.033 -0.052 -0.051 0.149 (-0.334,0.237) yes . 0.571 P56 0.367 0.347 0.353 0.132 (0.076,0.597) yes 0.521 Pu 0.367 0.326 0.332 0.136 (0.041,0.575) yes 0.535 Pu 0.251 0.251 0.251 0.112 (0.031,0.473) yes 0.442 Pu 0.251 0.347 N 0.353 0.132 (0.076,0.597) yes 0.521 P67 0.754 0.674 0.681 0.105 (0.443,0.858) yes 0.415 Pu 0.754 0.643 0.650 0.106 (0.411,0.829) yes 0.418 Pu 0.714 0.614 0.619 0.108 (0.385,0.806) yes 0.421 Pu 0.714 0.674 0.681 0.105 (0.443,0.858) yes 0.415 P78 0.305 0.065 0.067 0.148 (-0.229,0.350) yes 0.579 Pu 0.305 0.069 0.070 0.144 (-0.215,0.343) yes 0.559 Pu 0.206 0.053 0.052 0.110 (-0.165,0.266) yes 0.431 Pu 0.206 0.065 0.067 0.148 (-0.229,0.350) . yes 0.579 66 Chapter 4. Correlation Estimation in the Structured Model Table 4.4: Simulation results on the constrained correlation coefficients corre-sponding to the model in J^.l, with n = 1000, T = 8 based on N = 5000 Gibbs samples. CIContains Interval True Mean Median s.e 95% CI True Length Pl3 0.552 0.568 0.576 0.119 (0.314,0.773) yes 0.459 Pl3 0.552 0.489 0.490 0.103 (0.287,0.689) yes 0.402 Pl3 -0.000 -0.000 -0.000 0.000 (-0.000,0.000) yes 0.000 P\3 -0.000 0.157 0.159 0.200 (-0.260,0.524) yes 0.785 P24 0.072 0.032 0.031 0.148 (-0.254,0.324) yes 0.577 P24 0.072 0.130 0.129 0.102 (-0.073,0.334) yes 0.408 P24 -0.000 -0.000 -0.000 0.000 (-0.000,0.000) yes 0.000 P24 -0.000 -0.069 -0.075 0.196 (-0.443,0.316) yes 0.758 P46 -0.004 0.078 0.081 0.152 (-0.231,0.366) yes 0.596 P46 -0.004 -0.008 -0.004 0.034 (-0.089,0.057) yes 0.146 P46 0.000 -0.000 -0.000 0.000 (-0.000,0.000) yes 0.000 P46 0.000 0.149 0.152 0.185 (-0.225,0.507) yes 0.732 P57 -0.013 -0.014 -0.015 0.147 (-0.295,0.274) yes 0.569 Ph -0.013 -0.016 -0.011 0.052 (-0.129,0.086) yes 0.215 Ph -0.000 -0.000 -0.000 0.000 (-0.000,0.000) yes 0.000 hi -0.000 -0.005 -0.002 0.185 (-0.368,0.347) yes 0.715 67 Chapter 4. Correlation Estimation in the Structured Model Table 4.5: Simulation 'results on the unconstrained correlation coefficients cor-responding to the model in 4-1, with n = 200, T = 8 based on N = 5000 Gibbs samples. CIContains Interval True Mean Median s.e 95% CI True Length Pl2 0.796 0.716 0.720 0.066 (0.575,0.834) yes 0.259 Pl2 0.796 0.721 0.726 0.065 (0.581,0.831) yes 0.250 .P l2 0.688 0.577 0.579 0.073 (0.433,0.713) yes 0.280 Pl2 0.688 0.520 0.528 0.113 (0.275,0.717) yes 0.442 P23 0.693 0.725 0.729 . 0.068 (0.582,0.845) yes 0.263 Ph 0.693 0.731 0.736 0.065 (0.588,0.844) yes 0.255 Ph 0.502 0.591 0.593 0.074 . (0.442,0.736) yes 0.295 P23 0.502 0.525 0.536 0.117 (0.271,0.725) yes 0.454 P34 0.104 0.126 0.127 0.102 (-0.076,0.322) yes 0.398 Ph 0.104 0.128 0.130 0.104 (-0.082,0.325) yes 0.407 Ph 0.075 0.086 0.086 0.071 (-0.053,0.225) yes 0.278 P24 0.075 0.127 0.128 0.135 (-0.150,0.382) yes • 0.532 P45 -0.036 -0.163 -0.163 0.103 (-0.363,0.038) yes 0.401 Ph -0.036 -0.155 -0.157 0.103 (-0.353,0.052) yes 0.405 Ph -0.033 -0.141 -0.143 0.094 (-0.323,0.046) yes 0.369 P25 -0.033 T0.149 -0.151 0.117 (-0.373,0.081) yes 0.455 P56 0.367 0.396 0.399 0.097 (0.193,0.577) yes 0.384 P34 0.367 0.383 0.387 0.094 (0,193,0.558) yes 0.365 P34 0.251 0.263 0.263 0.073 (0.126,0.409) yes 0.282 P34 0.251 0.311 0.316 0.137 (0.021,0.562) yes 0.542 P67 0.754 0.759 0.763 0.066 (0.619,0.877) yes 0.258 P34 0.754 0.741 0.746 0.066 (0.599,0.854) yes 0.255 P34 0.714 0.695 0.699 0.071 (0.547,0.822) yes 0.275 P34 0.714 0.761 0.766 0.074 (0.598,0.893) yes 0.295 P78 0.305 0.295 0.297 0.102 (0.087,0.488) yes 0.401 P34 0.305 0.298 0.303 0.098 (0.097,0.477) yes 0.381 P34 0.206 0.205 0.205 0.072 (0.064,0.346) yes 0.282 P34 0.206 0.190 0.197 0.150 (-0.117,0.471) yes 0.588 68 Chapter 4. Correlation Estimation in the Structured Model Table 4.6: Simulation results on the constrained correlation coefficients corre-sponding to the model in 4-1, with n = 200, T = 8 based on N — 5000 Gibbs samples. CIContains Interval True Mean Median s.e 95% CI True Length 0.552 0.584 0.587 0.082 (0.411,0.732) yes 0.322 Ph 0.552 0.528 0.530 0.074 (0.380,0.668) yes 0.288 Pl3 -0.000 -0.000 -0.000 0.000 (-0.000,0.000) yes 0.000 P\z -0.000 0.136 0.140 0.145 (-0.151,0.409) yes 0.560 ' P24 0.072 0.060 0.061 0.103 (-0.153,0.260) yes 0.412 P24 0.072 0.093 0.094 0.077 (-0.060,0.243) yes 0.303 f>24 -0.000 -0.000 -0.000 0.000 (-0.000,0.000) yes 0.000 P24 -0.000 -0.067 -0.067 0.147 (-0.365,0.212) yes 0.577 P35 -0.004 -0.025 -0.025 0.106 (-0.229,0.174) yes 0.403 ^ 3 5 -0.004 -0.020 -0.016 0.024 (-0.081,0.015) yes 0.097 P35 0.000 0.000 0.000 0.000 (-0.000,0.000) yes 0.000 P 3 5 0.000 0.097 0.098 0.141 (-0.187,0.367) yes 0.554 P46 -0.013 -0.076 -0.076 0.108 (-0.288,0.137) yes 0.425 P46 -0.013 -0.060 -0.057 0.043 (-0.151,0.018) yes 0.170 P46 -0.000 -0.000 -0.000 0.000 (-0.000,0.000) yes 0.000 P4& -0.000 -0.059 -0.059 0.148 (-0.354,0.223) yes 0.577 4.6.3 P red ic t ion Accuracy An important benefit of regression models, is that they allow the prediction of a new outcome Y* given the model parameters and a new set of covariates X*. In the multivariate Probit class of models Y is binary, therefore we are interested in P{Y*- = 1\X*,Y). Rather than choosing a point estimator to make predictions, a Bayesian approach averages over the parameter space. This is given by the posterior predictive distribution: Pr(Y*\Y,X,X*). = Jpv{Y*\X*,p,R)p{P,R\Y,X)dpdR (4.11) = Jpv(Y*\Z)p(Z\X*,P,R)p(p,R\Y,X)dpdRdZ (4.12) 69 Chapter 4. Correlation Estimation in the Structured Model Therefore the predictive probabilities can be approximated by: Pr(Y"* = 1\Y,X,X*) ^ £ V > > 0 ) ( 4 ' 1 3 ) i = l where Z® ~ p(Z\X*, 0^, R^) and N is the number of Gibbs samples collected. Therefore for each draw of 0 and R, we sample according to the likelihood p(Z\X*, 0^\ .ftM). This involves sampling the latent variables from the multivariate normal distribu-tion with mean X*0^ and correlation matrix i?M and setting Y — I(z>o) . By Averaging over Monte Carlo samples we can obtain the predictive probability Pr(y* — 1|Y, X, X*) which is used to predict a new observation Y*. The predictive accuracy is then computed by counting the number of properly predicted values and dividing by the total number of predictions. To further investigate the advantages of estimating a structured correla-tion matrix, we compare the predictive accuracy under the assumption of a saturated model and a structured model. This time we use a data set with T = 25 and n = 100. In this data set, we expect the structured model to perform better than the saturated model, because the ratio of sample size to parameters to be estimated is much larger in the saturated case. In this simulation, we use a density model with no covariates and using the PX-DA algorithm, we sample N- = 5000 draws from the joint conditional posterior distribution under both the full and the structured correlation assumption. As expected, the results demonstrate the superiority of the structured model with 0.83 accuracy over the saturated model with 0.68 accuracy. 4.7 Application: Six Cities Data Revisited In this section, we revisit the Six Cities data, this time imposing the longitu-dinal structure on the covariance while fitting the exact model as 3.20. From table 4.7 and figure 4.6, we could see that the posterior mean of the marginal correlations coefficient do not correspond to the ones obtained under the 70 Chapter 4. Correlation Estimation in the Structured Model saturated model. This is particularly true for pu, P23 and P34. These corre-lations are now larger (posterior mean > 0.6) and the second order partial correlations are weaker and third order partial correlations are weaker still. Furthermore, the standard errors of the parameters that are constrained to zero in the inverse are smaller than the ones obtained in table 3.5. It is interesting to note however, that the estimates for 0 (Figure 3.17) and the standard errors have remained unchanged under the structured model. In order to assess predictive accuracy in this case, we use the method discussed in the previous section, we fit the model on a random subset of the data with n = 100 and we evaluate the predictive accuracy on n = 100 observations randomly selected from the remaining observations. The structured model has a slightly higher predictive accuracy (0.83) compared to the saturated model(0.80). Marginal Uniform Prior M C E M Jointly Uniform Prior Mean 95% CI s.e M L E s.e Mean s.e 00 -1.14 (-1.26,-1.02) 0.06 -1.12 0.06 -1.13 0.06 01 -0.08 (-0.15,-0.01) 0.03 -0.08 0.03 -0.08 0.03 02 0.17 (-0.03, 0.37) 0.10 0.15 0.10 0.16. 0.10 03 0.04 (-0.07, 0.15) 0.06 0.04 0.05 0.04 0.05 rn 0.63 (0.49, 0.75) 0.07 0.58 0.07 0.56 0.07 ri3 0.48 (0.35, 0.62) 0.07 0.52 0.08 0.50 0.07 ru 0.33 (0.21, 0.46) 0.06 0.59 0.09 0.54 0.07 r23 0.77 (0.66, 0.87) 0.05 0.69 0.05 0.66 0.06 r24 0.52 (0.66, 0.87) 0.07 0.56 0.08 0.51 0.07 0.68 (0.66, 0.87) 0.07 0.63 0.08 0.60 0.06 Table 4.7: Six Cities Data: Posterior estimates under structured model assump-tion, MLE estimate using MCEM and Posterior estimates using the Jointly Uni-form Prior under a saturated model assumption(Chib and Greenberg (1998)) 71 Chapter 4. Correlation Estimation in the Structured Model Figure 4.6: Six Cities Data: Correlation and partial correlation estimates 2000 4000 6000 800010000 2000 4000 6000 800010000 2000 4000 6000 800010000 0 , 5 Wr^ |ri|^ WPWWP™WP 2000 4000 6000 800010000 2000 4000 6000 800010000 (a) Marginal distribution and trace plots of the elements of the correlation matrix the red line denotes the estimates obtain in Chib and Greenberg (1998) by assuming the saturated model. 0 s 0 -0.5 20004000 6000 800010000 2000 4000 6000 800010000 2000 4000 6000 800010000 2000 4000 6000 800010000 2000 4000 6000 800010000 72 (b) Posterior distribution and trace plots of the elements of the partial correlation matrix. Chapter 4. Correlation Estimation in the Structured Model 20 40 60 Autocorrelation 20 40 60 Autocorrelation 0 20 40 Autocorrelation 20 40 60 Autocorrelation Figure 4.7: Six Cities Data : Trace plots, density plots and autocorrelation plots of the regression coefficients under a structured model assumption. Vertical lines de-note 95 % credible interval and the line in red indicates the posterior mean reported by Chib and Greenberg (1998). 73 Chapter 5 Conclusion 5.1 Summary The multivariate Probit model has several attractive features which make it particularly suitable for the analysis of correlated binary data. It relaxes the independence of the irrelevant alternatives (HA) property assumed by the logistic model and moreover, it is a natural choice in situations where an interpretation for thresholded continuous data is possible. It allows for flexible modeling of the association structure underlying the latent data and automatically accounts for overdispersion and underdispersion. Maximum likelihood estimation is not feasible in closed form in the multi-variate Probit class of models. Likelihood based approaches for estimation in M V P are very expensive due to the intractability of the high dimensional in-tegral that needs to be solved. The Bayesian framework is attractive because it allows the computation of a full posterior distribution on all unknown pa-rameters. The algorithm we proposed in 3.4 and extended in chapter 4 for structured models uses parameter expansion for data augmentation, which gives full conditional posterior distributions in closed form. This allows the implementation of a Gibbs sampler. Moreover, the algorithm we developed has many desirable properties: • It handles the identifiability problem in the M V P model by constrain-ing the covariance to be a correlation matrix, and placing the prior directly on the identified parameters. • The posterior distribution obtained through parameter expansion al-lows the use of the standard conjugate prior for the covariance. This makes the parameters easily interpret'able. 74 Chapter 5. Conclusion • The prior is marginally uniform and does not favor marginal corre-lations close to 0 or ±1 even in high dimensions. Furthermore, it is proper which makes it possible to do Bayesian model selection. • The full Gibbs framework is convenient as it bypasses having high dimensional proposal distributions. From previous work (Zhang et al., 2006), the design of such proposal distributions is difficult and requires careful tuning of the algorithm parameters. The extension of the algorithm provided in chapter 4, using Gaussian graph-ical models, greatly improves estimation and simplifies computation, espe-cially in high dimensional space, or when the proportion of parameters to sample size is high. Computation difficulties in our algorithm arise mainly from the sampling of univariate truncated Gaussian. We use the algorithm of Robert (1995), which is based on an accept/reject method (see appendix D). For certain simulations, accepting values was slow and this significantly slowed down the algorithm. This problem was more apparent in the estimation of a full covariance. For T — 8 and ./V = 5000, the program was taking on average 5 minutes to run in Matlab and for T = 25, N = 5000, it was taking about 30 minutes to run. In future work, it would be important to implement a different method for sampling the univariate truncated Gaussian and try to speed up computation in order to allow for scalability of the algorithm to higher dimensions. 5.2 Extensions, App l i ca t ions , and Future W o r k A natural and straightforward extension of the algorithm developed here is to multinomial and ordinal Probit, where the latent variables are thresholded to multiple intervals. In addition, the extension to a response consisting of a mixture of binary and continuous data could be interesting and useful for many applications. 75 Chapter 5. Conclusion Furthermore, future work would include an extension of the structured algorithm of chapter 4 to the case where the structure is unknown a priori, but where we would be interested to learn it from data. The method we proposed here is particularly suitable for this task. The Gaussian graphical model framework and the choice of a proper prior make model selection feasible. There are many applications of the multivariate Probit model, since cor-related binary data arise in many settings. Biological, medical, and social studies often yield binary or dichotomous data due to the lack of adequate and direct continuous measurements. Other examples include longitudinal data, panel data, latent class models in psychology. The M V P class of mod-els is also particularly attractive in marketing research, of consumer choice which subsequently yields to market segmentation and product design. In addition, one interesting application would be Bayesian Distance Metric Learning. This application is particularly important for classification. Yang et al. (2007) developed an approach that uses logistic regression to learn a similarity metric given labeled binary examples of similar and dissimilar pairs. In their approach, they ignore identifiability, which is also a concern for logistic models, and they use a variational approximation. Our model and algorithm could be easily adapted to this problem. 76 Bibliography J. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422): 669-679,jun 1993. J. Ashford and R. Snowden. Multivariate probit analysis. Biometrics, 26 (3):535-546, sep 1970. J..Barnard, R. McCulloch, and X. Meng. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrink-age. Statistica Sinica, 10:1281-1311, 2000. C. Carvalho, H. Massam, and M West. Simulation of the hyper-inverse wishart distribution in graphical models. Biometrika, 2007. To appear. N. Chaganty and H. Joe. Efficiency of generalized estimating equations for binary responses. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4):851-860, 2004. S. Chib and E. Greenberg. Analysis of multivariate probit models. Biometrika, 85(2):347-361, 1998. R. Cowell, A. Dawid, S. Lauritzen, and D. Spiegelhalter. Probabilistic Networks and Expert Systems. Springer, 1999. A. Dawid and S. Lauritzen. Hyper markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21(3):1272-1317, sep 1993. A. Dempster. Covariance selection. Biometrics, 28(1): 157-175, mar 1972. 77 Bibliography D. Dobra, C. Hans, B. Jones, J. Nevins, G. Yao, and M . West. Sparse graphical models for exploring gene expression data. J. Multivariate anal-ysis, 90:196-212, 2004. Y. Edwards and G. Allenby. Multivariate analysis of multiple response data. Journal of Marketing Research, 40(3):321-334, 2003. J. Geweke. Efficient simulation from the multivariate normal and student-t distributions subject to linear constraints. Computing Science and Statis-tics: Proceedings of the Twenty-Third Symposium on the Interface, Alexan-dria, VA: American Statistical Association, pp., 1991. J. Geweke, M . Keane, and D. Runkle. Alternative computational ap-proaches to inference in the multinomial probit model. The Review of Economics and Statistics, 76(4):609-632, nov 1994. G. Glonek and P. McCullagh. Multivariate logistic models. Journal of the Royal Statistical Society. Series B (Methodological), 57(3):533-546, 1995. A. Gupta and D. Nagar. Matrix Variate Distributions. Chapman and Hall, 2000. M. Keane. A note on identification in the multinomial probit model. Jour-nal of Business and Economic Statistics, 10(2):193-200, apr 1992. G. Koop. Bayesian Econometrics. John Wiley and Sons, 2003. S. Lauritzen. Graphical Models. Clarendon Press, Oxford, 1996. S. Lerman and C F . Manski. On the Use of Simulated Frequencies to Ap-proximate Choice Probabilities. MIT Press, 1981. M . Linardakis and P. Dellaportas. Assessment of athenss metro passenger behaviour via a multiranked probit model. Journal of the Royal Statistical Society: Series C (Applied Statistics), 52:185-200(16), May.2003. C. Liu. Bayesian analysis of multivariate probit models - discussion on the art of data augmentation by van dyk and meng. Journal of Computational and Graphical Statistics, 10:75-81, 2001. 78 Bibliography J. Liu and Y. Wu. Parameter expansion for data augmentation. Journal of the American Statistical Association, 94(448) :1264-1274, Dec 1999. X. Liu and M . Daniels. A new algorithm for simulating a correlation ma-trix based on parameter expansion and reparameterization. Journal of Computational and Graphical Statistics, 15:897-914(18), December 2006. R. McCulloch and P. Rossi. An exact likelihood analysis of the multinomial probit model. Journal of Econometrics, 64(l-2):207-240,. 1994. R. McCulloch, N. Poison, and P. Rossi. A bayesian analysis of the multi-nomial probit model with fully identified parameters. Journal of Econo-metrics, 99(1):173-193, November 2000. D. McFadden. Conditional Logit Analysis of Qualitative Choice Behavior. New York: Academic Press, 1974. D. McFadden. A method of simulated moments for estimation of discrete response models without numerical integration. Econometrica, 57(5):995-1026, sep 1989. R. Natarajan, C. McCulloch, and N. Kiefer. A monte carlo em method for estimating multinomial probit models. Computational Statistics and Data Analysis, 34:33-50, 2000. A. Nobile. Comment: Bayesian multinomial probit models with a nor-malization constraint. Journal of Econometrics, 99(2):335-345, December 2000. C. Ritter and M . Tanner. Facilitating the gibbs sampler: The gibbs stopper and the griddy-gibbs sampler. Journal of the American Statistical Associ-ation, 87(419):861-868, sep 1992. C. Robert. Simulation of truncated normal variables. Statistics and Com-puting, 5(2):121-125, 1995. P. Rossi, Allenby G., and R. McCulloch. Bayesian Statistics and Marketing. John Wiley and Sons, 2005. 79 Bibliography A. Roverato. Hyper inverse wishart distribution for non-decomposable graphs and its application to bayesian inference for gaussian graphical mod-els. Scandinavian Journal of Statistics, 29(3):391-411, 2002. M. Tanner and W. Wong. The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82 (398):528-540, jun 1987. E Webb and J. Forster. Bayesian model determination for multivari-ate ordinal and binary data. Work in progress, May 2006. URL http: //www. soton. ac .uk/~j j f /Papers/Ordinal .pdf. J. Whittaker. Graphical Models in Applied Multivariate Statistics. Wiley, 1990. L. Yang, R. Jin, and R. Sukthankar. Bayesian ac-tive distance metric learning. UAI,. July 2007. URL http: //www. cse .msu.edu/~yangliul/uai2007_bayesian.pdf. R. Yang and J. Berger. Estimation of a covariance matrix using the refer-ence prior. The Annals of Statistics, 22(3):1195-1211, Sep 1994. X. Zhang, W:J. Boscardin, and T. Belin. Sampling correlation matrices in bayesian models with correlated latent variables. Journal of Computational and Graphical Statistics, 15(4):880-896, December 2006. 80 Part II Appendices Appendix A Distributions and Identities A . l The Multivariate Normal (Gaussian) Distribution fx(xi,xn\pi, E) = ( 2 7 r ) n / 2 | S | i / 2 e x P (j>(x - M) T£(x - M)) (A.l) where p is the mean, and E is the variance-covariance matrix. A.2 The Gamma Distribution fx(x\a,0) = -^-xa-1eM-8x) (A.2) where a is the shape parameter and 3 is the rate parameter. A.3 The Standard Inverse Wishart Distr ibution There are several parametrization of the inverse Wishart distribution. We will list below the ones that we use in this work. 1. The parametrization used in Gupta and Nagar (2000) Let E ~ IW(m, IT), then / r ( 5 » <x | E | - ^ ) exp( - i i r (E- 1 ) ) (A.3) With Expectation: £(E) = h m-2T-2 The Matlab function invwishirnd from the M C M C tool box (http://www.mathworks.com/matlabcentral/fileexchange/ by David Shera), 82 Appendix A. Distributions and Identities implements sampling from this distribution. 2. The parametrization used in Barnard et al. (2000) Let S ~ IW(v, IT), then / r ( 2 » oc m-hl'+T+D e x p ( _ l i r ( s - i ) ) (A.4) With Expectation; IT E(X) = v - T - 1 This corresponds to A.3, with m — u + T + 1. This parametrization is implemented in the matlab function iwishrnd in the STAT toolbox. 3. The parametrization used in Dawid and Lauritzen (1993) Let £ ~ IW(S, IT), then / r ( E | J ) a | S r ^ 5 + 2 T ) e x p ( - ^ r ( £ - 1 ) ) (A.5) With Expectation: IT <J-2 This corresponds to A.3, with 5 = v — T + 1 . In the one dimensional case all three parametrization reduce to an inverse Chi Square: / ( a > ) c < ( a r ^ + 2 ) e x p ( - ^ ) ) (A.6) In this case we could see that v and 5 is parametrization 2 and 3 are equivalent and they are equal to v, and in parametrization 1, m = v + 2. 83 Appendix B Marginal Pr ior on R proof from Barnard et al. (2000) Under the transformation S = DRD, the Jacobian is given by der a dr. Where o-ij - didj'-ij iii^j; ._ T , - _ i T era = eL if i = j. Therefore, (B.l) (B.2) IJI = 2dx 2d2 2dT d\d2 di d. 1«3 = n2(4) r =2 r(n* dx—idp (B.3) As an illustration consider the following example with E a 3 x 3 covariance matrix. Using the transformation S = DRD, we can write 84 Appendix B. Marginal Prior on R proof from Barnard et al. (2000) £ = d\ d\d2 d\d3 d\d2 dl d2d3 d\d3 d2d3 d3 (B.4) The jacobian is: IJI = <9(crn, cr22, CT33, 0-12,013, 0-23) d(di,d2,d3,r12,ri3,r23) 2di 0 0 dzn3 .0 0 2d2 0 dm2 0 d3r23 0 0 2d3 0 d\r\3 d2r23 0 0 0 d\d2 0 0 0 0 0 0 d\d3 0 0 0 0 0 0 d2d3 Here we could see that the lower triangular part of the Jacobian matrix is 0 and therefore taking the determinant is equivalent to multiplying the diagonal elements, which gives us | J | = 2 3(did 2d 3) 3 . In Barnard et al. (2000), they start with £ ~ IW(v, IT), where the inverse Wishart is defined as in A.3. (B.5) T T ( £ W CK | £ | ^ ( i y + T + 1 ) e x p ( - i i r ( £ - 1 ) ) n(R,D\u) oc \ D R D \ - ^ + i ) e M l { D R D ) _ l ) x ^ cx |A|-i^(nrf*)-^r+1)(n*)Texp (~\tr{DRD)-^ oc |i2|2("+r+i)TJ^-("+ 1) exp r 2dl where r11 is the ith diagonal element of R 1. The distribution of R is obtained by marginalizing over D: roo f(R\u) = / n(R,D\u)dD Jo (B.6) 85 Appendix B. Marginal Prior on R proof from Barnard et al. (2000) oc jT \ R \ ^ T + D J J -^<,+u e x p d D ( B : 7 ) We could perform another change of variable from di — > a.i by letting ei-i = T T H . with da.i = —^ddi we can work through the algebra: i f°° - r / 3 f{R\u) oc \R\\W+*Y[] {di)A^i) e M _ a 0 d a i poo / J 2 \ (~ "+2) /2 , ( - " + 2 ) / 2 /•oo oc | i ? | l ^ + 1 ) ( n ^ ) - - I I / K ) ( i / - 2 ) / 2 e x p ( - a t ) ^ POO oc | i . |3^ r + 1 )(TTr«)-5TT / a i " - a ) / 2 expC-aO dt*. (B.8) i i • / o % ' r ( . | , i ) . From the above we could see that ir(R,D) = ir(R,a) = iT(a\R)ir(R) (B.9) Where ir(oi\R) ~ G a m m a ( ^ , l ) (B.10) a l i-i^-^ni^i))-^ 2 - . 0.11) i . The distribution of R in B . 11 is obtained by using the matrix algebra identity: ii I H-ii | • \R\ where Ra is the principal submatrix of R. The marginal distribution of each r^ - is obtained using the marginal-ization property of the inverse Wishart, which states that each principal submatrix of an inverse Wishart is also an inverse Wishart. This means 86 Appendix B. Marginal Prior on R proof from Barnard et al. (2000) that this derivation could be obtained for any T\ x T\ sub-covariance ma-trix. Choosing a submatrix S i of S, S i ~ IW{y - (T — T{),I). The density of the correlation submatrix is as in .B . l l , with T = T\ and v = v — (T — T\). In the case where T\ — 2, the marginal density is: / 2 ( r y » = ( l - r i j ) I T 1 ' (B.12) In this case, B.12 could be viewed as Beta , w~T2+l) on [-1,1] and is uniform when v = T + 1. 87 Appendix C Computation of the Jacobian \ J : Z ^ W \ ( • This is an example of the computation of the Jacobian of the transformation Z = D " 1 W for T = 3, p = 2, and i = 1,..., n: Zn The Jacobian is then I j, _ d(Zn, Z\2, Z\3, • . . , Zni, Zn2, Zns) 1 l _ d(WluW12,W13,...,Wnl,Wn2,Wn3) 0 0 0 D221 0 0 0 \J\ = Dn1 0 0 .0 D221 0 0 0 = \In®D-1\ = \D\~n . (C.5) = DrfWa (C.l) = D^Wi2 (C,2) = D^Wa (C.3) 88 Appendix D Sampling from Multivariate truncated Gaussian In the context of Multivariate Probit, we are interested in generating random samples from a multivariate Gaussian subject to multiple linear inequality constraints. We follow the method outlined in Geweke (1991) using Gibbs sampling. Let x ~ N(fi, £ ) a<x<b Where £ is an p x p covariance matrix of rank p, p is a p x 1 vector of means and a and b are vectors of lower and upper bound that can take on the values -co and +00 respectively. This problem is equivalent to sampling from a p-variate normal distribution subject to linear constraints: z ~ N(0, £ ) a < z < ,/? where a = a — p , and 3 = b — \x. We can then take x = ji + z. To sample the ^'s, we adopt a Gibbs sampling approach that uses the property that each element of z, conditional on all of the other elements of z is a univariate truncated normal. From conditional multivariate Normal distribution theory, we have the following result: If z ~ iV(0, T), the non truncated distribution 'E(zi\z-i) = ^CijZj (D.l) 89 Appendix D. Sampling from Multivariate truncated Gaussian where c^ - is defined in ?? and D.4 Then the truncated distribution has the following construction: (D.2) Where TN is the univariate truncated normal distribution, and the vec-tor of coefficients in the conditional mean is c% = (cii,...,cn-i,Ci+1,...Cip)'ci) (D.3) where i = 1,... ,p. and c i = _(S«)-iS«.<* and hj = ( E " ) _ 1 (D.4) Where £ w is the element in row i and column i of E - 1 and S* , < 1 is row i of E - 1 ' with E " deleted. As mentioned in Geweke (1991), we only need to perform these calcula-tion once in the beginning. We can then cycle through the Gibbs steps as follows: • Initialize z^> = 0 • In the first pass, generate p successive variables from : Zi ) • • • ) Zi-1> Zz+1> • • • ' P I J ' ' Z l » • • • ' Zi-\i Zi+1> • ' • > V > (D.S) where i — 1,..., p • • Repeat the above such that at the j ' t h pass: zU)UzU) Jj) zU-i) • ZU-V)„ f.(z(3) ZU) 0-D 2(_-i)\ (D.6) 90 Appendix D. Sampling from Multivariate truncated Gaussian • at the end of each pass we compute x{j) = p + z{j) (D.7) There are several methods available for generating a univariate truncated Normal distribution, in our implementation, we adopt the methods used in Robert (1995). Let x~N(n,iT,o2) where LX is the mean, u~ is the left truncation point and <r2 is the vari-ance. Robert (1995) uses a accept-reject algorithm that is more efficient than rejection sampling or the inverse cdf method which could be very inefficient if p~ — p is large. Assuming without loss of generality that ii — 0 and o2 — I- the algorithm proceeds as follows: 1. Generate z ~ Exp(a*, p,~) 2. Compute p(z) — exp(-(2 - a*) 2/ 2) 3. Generate u ~ U[0,1] and take x — z if u < p(z), otherwise go back to the first step. Where Exp(a*, p,~) is the translated Exponential distribution and the optimal value of a* = M ~ + ^ ~ ) 2 + 4 ) . 91 Appendix E Sampling from the Hyper Inverse Wishart Distr ibution (Carvalho et al. , 2007) Let G — (V,E), be a decomposable, undirected graph with \V\ = T. If we assume that G is a Gaussian Graphical Model with a sparse structure, then E ~ HIWo(b, D) Where the Hyper inverse Wishart is defined as in A.5. A graph can be represented by a perfect ordering of its prime components and separators. An ordering of components Pi € P and separators Si G S, (Pi, S2, P2, S3l..., PT), is said to be perfect if for every i = 2, 3 , . . . , T there exists a j < i such that Si = Pi Hi-i C Pj where In order to obtain the perfect ordering of prime components P\,..., Pk, we need to generate the junction tree of the graph (Cowell et al., 1999). For a decomposable undirected graph G, a junction tree is a represen-tation of its prime components. A junction tree is a subgraph of the de-composable graph that has the following characteristics: (1) it is a tree, (2) it contains all the nodes of the graph, and (3) it satisfies the junction tree 92 Appendix E. Sampling from the Hyper Inverse Wishart Distribution (Carvalho et al, 2007) property: For each pair Q , Cj of cliques with.intersection S, all cliques on the path between d and Cj contain S. Subsequently the joint density of £ factorizes as: •fc p(E\b,D)=p(EPl)l[P(EPz\Esi) (E.l) i=2 For decomposable models, all prime components are complete. Therefore standard results for the inverse Wishart (Gupta and Nagar, 2000) enables sampling from each of the distributions in the composition directly. Where T>Si,Ri = Ep]. s . . Define. ^Ri.Si = ^Ri - ^ R u S i ^ s ^ S i A (E.2) DRl.Si = D ^ - D R ^ D ^ D S ^ R , (E.3) The sampling scheme would proceed as follows: 1. sample E d ~ IW(b, Dcx), this gives the values of submatrix £ s 2 . 2. For i = 1,..., k, sample ERi.Si ~ IW(b-+\RI\1DR..SI) (E.4) • Ui ~ N(DRitSiD£,'2Ri.si®Ds?) . (E.5) We could then directly compute ERuSi = UiESi and ER. = ERi,Si + The non-free parameters of E are obtained through matrix completion given the perfect ordering of its prime components (for details see Lauritzen (1996)) using: ^Ri,Ai-i = ER^StEg^Es^Ai-! (E.6) 93 Appendix F Simulation Results Entropy Quadratic Saturated Structured Saturated Structured 22.228 2.534 154.075 3.827 28.097 3.130 237.388 4.862 36.069 3.095 424.618 4.381 23.836 3.111 - 189.696 4.585 27.421 3.558 226.054 4.563 21.959 3.041 149.302 4.225 39.708 4.216 1569.468 6.129 27.325 2.066 242.982 3.530 2.302 0.401 , 10.406 1.063 5.227 0.362 48.853 0.941 2.769 0.941 12.265 ' 8.453 2.375 0.513 16.248 0.995 2.867 0.503 17.114 1.250 4.383 0.638 42.125 4.627 3.162 0.533 20.535 1.354 5.947 0.739 644.714 3.338 2.413 0.558 10.570 1.427 2.616 0.559 12.475 1.100 3.793 0.444 36.864 0.970 22.310 2.087 157.641 3.469 3.518 0.637 21.579 1.890 2.114 0.455 9.493 1.079 2.103 0.371 8.683 0.870 Table F.l: Simulation results: Entropy and quadratic loss for 50 data sets gener-ated by different correlation matrices with the same structure 94 Appendix F. Simulation Results Entropy Quadratic Saturated Structured Saturated Structured 2.026 0.552 7.596 1.031 5.835 1.123 556.727 18.055 1.746 0.392 6.715 0.994 1.953 0.343 7.985 0.788 2.337 0.384 10.725 1.011 2.428 0.415 11.166 0.830 4.546 0.842 34.827 2.885 24.410 2.088 186.270 3.526 19.670 1.635 127.148 2.830 22.453 2.586 156.924 4.198 53.488 6.072 1066.345 5.868 29.999 2.337 278.968 3.848 26.146 2.313 211.350 3.441 1.860 0.402 6.632 0.756 2.209 0.343 10.706 0.819 2.749 0.532 14.805 0.990 1.701 0.501 5.315 0.866 1.953 0.408 8.378 0.820 ' 2.399 0.413 10.604 0.969 5.596 1.018 67.955 3.281 1.772 0.311 7.622 0.728 2.046 0.771 6.495 1.212 2.900 0.518 16.389 1.720 164.743 0.617 847548.010 1.271 6.030 0.501 283^085 1.661 3.300 0.553 17.834 1.489 2.937 0.890 17.898 3.991 Table F.2: Table F continued
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Bayesian inference in the multivariate probit model
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Bayesian inference in the multivariate probit model Tabet, Aline 2007
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Bayesian inference in the multivariate probit model |
Creator |
Tabet, Aline |
Publisher | University of British Columbia |
Date Issued | 2007 |
Description | Correlated binary data arise in many applications. Any analysis of this type of data should take into account the correlation structure among the variables. The multivariate Probit model (MVP), introduced by Ashford and Snowden (1970), is a popular class of models particularly suitable for the analysis of correlated binary data. In this class of models, the response is multivariate, correlated and discrete. Generally speaking, the MVP model assumes that given a set of explanatory variables the multivariate response is an indicator of the event that some unobserved latent variable falls within a certain interval. The latent variable is assumed to arise from a multivariate normal distribution. Difficulties with the multivariate Probit are mainly due to computation as the likelihood of the observed discrete data is obtained by integrating over a multidimensional constrained space of latent variables. In this work, we adopt a Bayesian approach and develop an an efficient Markov chain Monte Carlo algorithm for estimation in MVP models under the full correlation and the structured correlation assumptions. Furthermore, in addition to simulation results, we present an application of our method to the Six Cities data set. Our algorithm has many advantages over previous approaches, namely it handles identifiability and uses a marginally uniform prior on the correlation matrix directly. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-03-09 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0101129 |
URI | http://hdl.handle.net/2429/32271 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2007-0603.pdf [ 6.94MB ]
- Metadata
- JSON: 831-1.0101129.json
- JSON-LD: 831-1.0101129-ld.json
- RDF/XML (Pretty): 831-1.0101129-rdf.xml
- RDF/JSON: 831-1.0101129-rdf.json
- Turtle: 831-1.0101129-turtle.txt
- N-Triples: 831-1.0101129-rdf-ntriples.txt
- Original Record: 831-1.0101129-source.json
- Full Text
- 831-1.0101129-fulltext.txt
- Citation
- 831-1.0101129.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0101129/manifest