Modeling Dependencies in Multivariate Data by Aline Talhouk B.A, Concordia University, 1997 M.Sc., The University of British Columbia, 2007 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Statistics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2013 c Aline Talhouk 2013 Abstract In multivariate regression, researchers are interested in modeling a correlated multivariate response variable as a function of covariates. The response of interest can be multidimensional; the correlation between the elements of the multivariate response can be very complex. In many applications, the association between the elements of the multivariate response is typically treated as a nuisance parameter. The focus is on estimating efficiently the regression coefficients, in order to study the average change in the mean response as a function of predictors. However, in many cases, the estimation of the covariance and, where applicable, the temporal dynamics of the multidimensional response is the main interest, such as the case in finance, for example. Moreover, the correct specification of the covariance matrix is important for the efficient estimation of the regression coefficients. These complex models usually involve some parameters that are static and some dynamic. Until recently, the simultaneous estimation of dynamic and static parameters in the same model has been difficult. The introduction of particle MCMC algorithms by Andrieu and Doucet (2002) has allowed for the possibility of considering such models. In this thesis, we propose a general framework for jointly estimating the covariance matrix of multivariate data as well as the regression coefficients. This is done under different settings, for different dimensions and measurement scales. ii Preface This work builds on work developed during my UBC Masters’s thesis (Tabet, 2007). Particularly, the work concerning binary data and the inference of static correlation matrices in Chapters 3 and 4 extends the Master’s thesis work. More specifically, in Chapter 3, we extend the work done in Tabet (2007) to include model selection, where a structure for the inverse correlation matrix is no longer assumed, but rather, we estimate the structure directly from the data along with other parameters. Another major change with respect to the Master’s thesis arises in Chapter 4. The prior on the regression coefficients, β that had been previously suggested in Tabet (2007) results in a dependence on D, the expansion parameter, in the posterior. This situation is corrected in Chapter 4, by taking a conjugate prior for β that is centered at 0 and with a covariance matrix that depends on the correlation matrix R. Furthermore, the regression coefficients are sampled in the augmented space and therefore we are able to integrate out β locally. Parts of the content of Chapters 3 and 4 were done in collaboration and under the supervision of Dr. Kevin Murphy and Dr. Arnaud Doucet and were published in the Journal of Computational and Graphical Statistics (Talhouk et al., 2012). The heart rate variability data example discussed in Chapters 1 and 5 is based on physiological data obtained from the Pediatric Anesthesia Research Team (PART) at the British Columbia Children’s Hospital. This study was approved by the University of British Columbia and Children’s & Women’s Health Centre of British Columbia Research Ethics Board (CREB H09-00905). Parents of all subjects provided written informed consent to participate in the study. More details about the study protocol and data collection are provided in Chris J. Brouse’s Ph.D dissertation (Brouse, 2013). iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . xx Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Heart Rate Variability as an Indicator of Nociception 3 1.2.2 Stock Market Return of Technology Firms . . . . . . . 5 1.2.3 Women and Mathematics . . . . . . . . . . . . . . . . 8 1.2.4 Six Cities Data . . . . . . . . . . . . . . . . . . . . . . 8 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 2 Time-Varying Covariance . . . . . . . . . . . . . . . . . . . . 12 2.1 The Wishart Prior . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 14 Dynamic Modeling of the Covariance in the Literature 15 iv Table of Contents 2.1.2 The Inverse Wishart . . . . . . . . . . . . . . . . . . . 15 2.1.3 The Dynamic Linear Model . . . . . . . . . . . . . . . 16 2.1.4 The Wishart Autoregressive Process . . . . . . . . . . 17 2.1.5 The Hyperparameters of the WAR Prior . . . . . . . . 20 2.2 Bayesian Estimation with the WAR . . . . . . . . . . . . . . 24 2.3 Sampling using Sequential Monte Carlo (SMC) . . . . . . . . 24 2.3.1 Computation of the Marginal Likelihood . . . . . . . . 27 SMC Algorithm for the Wishart Autoregressive Process . . . 28 2.4 2.4.1 The Non-central Wishart Prior as Importance Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 A Central Wishart as Importance Distribution . . . . 29 2.4.3 Numerical Experiments using SMC . . . . . . . . . . . 33 2.5 Estimating Fixed Parameters . . . . . . . . . . . . . . . . . . 37 2.6 Particle MCMC for Fixed Parameter Estimation . . . . . . . 42 2.6.1 The Particle Independent Metropolis Hastings Algorithm (PIMH) . . . . . . . . . . . . . . . . . . . . . . 2.6.2 The Particle Marginal Metropolis Hastings Algorithm (PMMH) . . . . . . . . . . . . . . . . . . . . . . . . . 46 Estimating Both M and S . . . . . . . . . . . . . . . . 55 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . 59 2.6.3 2.7 42 3 Static Correlation . . . . . . . . . . . . . . . . . . . . . . . . 66 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . 67 3.3 Gaussian Graphical Models . . . . . . . . . . . . . . . . . . . 68 3.4 Some Graph Theory . . . . . . . . . . . . . . . . . . . . . . . 68 3.5 Bayesian Estimation in Gaussian Graphical Models . . . . . . 71 3.5.1 Prior and Posterior for Covariance Matrices . . . . . . 72 3.5.2 Covariance Selection and Model Determination . . . . 75 Bayesian Estimation for Binary Data . . . . . . . . . . . . . . 77 3.6.1 Probit Model with a Structured Correlation Matrix . 81 3.7 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 84 3.8 Conclusion 86 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Table of Contents 4 Estimating the Effects of Covariates 4.1 4.2 . . . . . . . . . . . . . 88 Multivariate Model, Static Covariance . . . . . . . . . . . . . 89 4.1.1 Continuous Response . . . . . . . . . . . . . . . . . . 89 4.1.2 Binary Response . . . . . . . . . . . . . . . . . . . . . 91 Dynamic Covariance . . . . . . . . . . . . . . . . . . . . . . . 94 (β, Σ−1 1:T ) 4.2.1 Prior Specification for . . . . . . . . . . . . . 94 4.2.2 Posterior and Sampling Scheme . . . . . . . . . . . . . 95 4.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . 102 5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.1 Heart Rate Variability as an Indicator of Nociception . . . . . 105 5.2 Stock Market Returns of Technology Firms . . . . . . . . . . 112 5.3 “Women and Mathematics” Data Set . . . . . . . . . . . . . . 114 5.4 Six Cities Data Set . . . . . . . . . . . . . . . . . . . . . . . . 116 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2.1 Numerical Evaluation of the Hypergeometric Function 123 6.2.2 Better Proposals for Static Parameters . . . . . . . . . 123 6.2.3 Improving Speed of Mixing and Avoiding Degeneracy 124 6.2.4 Other Possible Extensions . . . . . . . . . . . . . . . . 124 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Appendix A Distributions . . . . . . . . . . . . . . . . . . . . . 130 A.1 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . 130 A.1.1 The Univariate Representation . . . . . . . . . . . . . 130 A.1.2 Multivariate and Matrix-Variate Gaussian Distribution 131 A.2 The χ2 Distribution . . . . . . . . . . . . . . . . . . . . . . . 132 A.2.1 The Central χ2 . . . . . . . . . . . . . . . . . . . . . . 132 A.2.2 The Non-central χ2 . . . . . . . . . . . . . . . . . . . 133 A.3 The Wishart Distribution . . . . . . . . . . . . . . . . . . . . 134 vi Table of Contents Appendix B Conjugate analysis . . . . . . . . . . . . . . . . . . 140 B.1 Normal Wishart . . . . . . . . . . . . . . . . . . . . . . . . . 140 B.2 Normal Wishart with Covariates . . . . . . . . . . . . . . . . 141 Appendix C Heart Rate Data Results . . . . . . . . . . . . . . 144 C.1 Bolus Event 1 Data Results . . . . . . . . . . . . . . . . . . . 144 C.2 Bolus Event 2 Data Results . . . . . . . . . . . . . . . . . . . 145 C.3 Bolus Event 3 Data Results . . . . . . . . . . . . . . . . . . . 147 C.4 Bolus Event 4 Data Results . . . . . . . . . . . . . . . . . . . 148 C.5 Bolus Event 5 Data Results . . . . . . . . . . . . . . . . . . . 150 C.6 Bolus Event 6 Data Results . . . . . . . . . . . . . . . . . . . 151 C.7 Bolus Event 7 Data Results . . . . . . . . . . . . . . . . . . . 153 C.8 Bolus Event 8 Data Results . . . . . . . . . . . . . . . . . . . 154 C.9 Bolus Event 9 Data Results . . . . . . . . . . . . . . . . . . . 156 C.10 Bolus Event 10 Data Results . . . . . . . . . . . . . . . . . . 157 C.11 Bolus Event 11 Data Results . . . . . . . . . . . . . . . . . . 159 C.12 Bolus Event 12 Data Results . . . . . . . . . . . . . . . . . . 160 C.13 Bolus Event 13 Data Results . . . . . . . . . . . . . . . . . . 162 C.14 Bolus Event 14 Data Results . . . . . . . . . . . . . . . . . . 163 C.15 Bolus Event 15 Data Results . . . . . . . . . . . . . . . . . . 165 C.16 Bolus Event 16 Data Results . . . . . . . . . . . . . . . . . . 166 C.17 Bolus Event 17 Data Results . . . . . . . . . . . . . . . . . . 168 C.18 Bolus Event 18 Data Results . . . . . . . . . . . . . . . . . . 169 C.19 Bolus Event 19 Data Results . . . . . . . . . . . . . . . . . . 171 C.20 Bolus Event 20 Data Results . . . . . . . . . . . . . . . . . . 172 C.21 Bolus Event 21 Data Results . . . . . . . . . . . . . . . . . . 174 C.22 Bolus Event 22 Data Results . . . . . . . . . . . . . . . . . . 175 C.23 Bolus Event 23 Data Results . . . . . . . . . . . . . . . . . . 177 C.24 Bolus Event 24 Data Results . . . . . . . . . . . . . . . . . . 178 C.25 Bolus Event 25 Data Results . . . . . . . . . . . . . . . . . . 180 C.26 Bolus Event 26 Data Results . . . . . . . . . . . . . . . . . . 181 C.27 Bolus Event 27 Data Results . . . . . . . . . . . . . . . . . . 183 C.28 Bolus Event 28 Data Results . . . . . . . . . . . . . . . . . . 184 vii Table of Contents C.29 Bolus Event 29 Data Results . . . . . . . . . . . . . . . . . . 186 C.30 Bolus Event 30 Data Results . . . . . . . . . . . . . . . . . . 187 viii List of Tables 1.1 Different types of multivariate data . . . . . . . . . . . . . . . 3 1.2 Summary statistics of technology stock log-returns . . . . . . 7 1.3 Breakdown of wheezing cases by age group and smoking status of mother at onset. The percentages in parentheses are given with respect to each age group. . . . . . . . . . . . . . . 9 2.1 Posterior median of ρ obtained for different values of c and K. 49 2.2 Time to run different algorithms, in minutes. . . . . . . . . . 5.1 Table of edge marginal probabilities comparing results ob- 62 tained by the algorithm developed in this paper and those by M C 3 algorithm as outlined in Tarantola (2004) . . . . . . . . 116 5.2 Breakdown of wheezing cases by age group and smoking status of mother at onset. The percentages in brackets are percentages with respect to each age group. . . . . . . . . . . . . 117 5.3 Posterior means of the parameters of interest. The first two columns correspond to using the saturated model for the Jeffreys’s prior of Lawrence et al. (2008) and the marginally uniform prior. The last column corresponds to using a marginally uniform prior and performing model selection. . . . . . . . . . 118 5.4 Posterior predictive statistical checks with 95% intervals and p-values, indicating that the posterior predictive distribution of the proposed model is not statistically different from the observed data . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 ix List of Figures 1.1 A diagram depicting the multivariate longitudal response with n rows, J columns and over T time points. 1.2 . . . . . . . . . . 1 The brain perceives pain through the nociception system, which prompts increased activity in the autonomic nervous system (ANS). In turn, the ANS activates the sympathetic nervous system (SNS), creating a stress response in the body that includes increased respiration, blood pressure and heart rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 4 Heart rate data for a 6 year old child undergoing a dental procedure. The child starts out in a nociceptive state. An anasthetic bolus is administered and the state of the child returns to normal. The top line is the raw measured heart rate and the bottom line is the filtered heart rate, with the mean removed. . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 5 A plot of the weekly return of Google, Microsoft, IBM and Apple stock for the period of September 1 2004 to February 1st, 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 7 A graphical model representation of the multivariate longitudinal model which includes covariate information and covariate effects parameter. Observed variables are shaded, unobserved variables are clear. . . . . . . . . . . . . . . . . . . . . 2.1 11 Graphical representation of the multivariate longitudinal data. The observed variables are shaded, and the latent (unobserved) variables are not. Conditional on the precision matrices, the observed data are independent over time. . . . . . 13 x List of Figures 2.2 Precision simulated from the prior using different values of the hyper-parameters ρ, S and d. . . . . . . . . . . . . . . . . 2.3 Time series plots showing the effect of changing ρ on the marginal elements of Σ−1 in the bivariate case (J = 2). . . . . 2.4 . . . . . . . . . . . . 38 Marginal likelihood for different values of underlying ρ in varying dimensions, with T = 200 time steps. . . . . . . . . . 2.9 37 Effective sample size and the mean squared error obtained by increasing the number of observations n. . . . . . . . . . . . . 2.8 36 Effective sample size and the mean squared error obtained by increasing the number of observations n. . . . . . . . . . . . . 2.7 35 Effective sample size and the mean squared error obtained by increasing the number of time steps T . 2.6 23 Effective sample size and the mean squared error obtained by increasing the number of particles K. . . . . . . . . . . . . . . 2.5 22 40 Error in the marginal likelihood for different values of underlying ρ in varying dimensions, with T = 200 time steps. . . . 41 2.10 PIMH Simulation results showing the acceptance probability, the ESS and marginal liklihood as a function of Particles, MC iterations and number of times steps. . . . . . . . . . . . 45 2.11 The posterior traceplots of the parameter ρ obtained using the PMMH algorithm with different K and increasing c. . . . 49 2.12 Plot of the marginal likelihood obtained from running SMC with different values of ρ with K = 1000 particles. The maximum is reached at ρ=0.83. . . . . . . . . . . . . . . . . . . . 50 2.13 Plot of the posterior estimate of ρ obtained using pMCMC with c = 0.2 and K = 1000. . . . . . . . . . . . . . . . . . . . 50 2.14 Plot of the true correlation and the median estimated correlation along with a 95% credible interval, obtained using the PMMH algorithm estimating M and fixing S. 2.15 Average entropy loss for Σ−1 . . . . . . . . 52 computed over 30 independent experiments under a fixed ρ and using the PMMH algorithm. 54 2.16 Acceptance probability of PMMH on 30 different data sets, generated with different true value of ρ. . . . . . . . . . . . . 55 xi List of Figures 2.17 The posterior traceplots of the parameter ρ and s2 obtained using the PMMH algorithm with a fixed K = 2000 and simultaneously increasing c1 and c2 . . . . . . . . . . . . . . . . 2.18 The posterior traceplots of the parameter ρ and s2 57 obtained using the PMMH algorithm for c1 = c2 = 0.2. . . . . . . . . . 59 2.19 Posterior distribution of the estimated precision in the univariate case (J = 1) with 95% credible interval using the different algorithms. . . . . . . . . . . . . . . . . . . . . . . . 61 2.20 Plot of the true correlation and the median estimated correlation along with a 95% credible interval, obtained using the SMC algorithm with the Non-central Proposal. . . . . . . . . 63 2.21 Plot of the true correlation and the median estimated correlation along with a 95% credible interval, obtained using the PMMH algorithm estimating M and fixing S. . . . . . . . . 64 2.22 Plot of the true correlation and the median estimated correlation along with a 95% credible interval, obtained using the PMMH algorithm estimating both M and S. . . . . . . . . . 3.1 65 A decomposable graph and its junction tree decomposition. Each node of the junction tree represents a clique while vertices shared by adjacent nodes of the tree define the separators. 71 3.2 An example of two different graph structures, resulting in a fully connected precision in 3.2(a) versus a sparse precision matrix in 3.2(b). . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 72 A graphical representation of the multivariate model with J = 3. In this model Y represents the observed binary data and Z represents the latent correlated Gaussian data. . . . . . . . 3.4 78 Most probable graphical structures and estimated posterior probabilities for simulated data. . . . . . . . . . . . . . . . . . 85 xii List of Figures 3.5 Matrix of posterior edge marginal probabilities, simulated data. Here the size of the square is indicative of the magnitude of the posterior edge marginal probability and therefore the larger the square the higher the probability that an edge exists between the corresponding nodes. . . . . . . . . . . . . 3.6 86 Boxplot of the entropy loss over 50 datasets obtained using a fixed saturated correlation matrix (left) versus model selection (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 87 Example of ancestral lineages generated by an SMC algo−1(2) rithm for K = 5 and T = 3. The shaded path is Σ1:3 = −1(3) −1(4) −1(2) (Σ1 , Σ2 , Σ3 ) = and the ancestral lineage is (2) B1:3 (3, 4, 2) (example adapted from Andrieu et al. (2010)) . . . . 97 4.2 Posterior estimates of β using Particle Gibbs . . . . . . . . . 101 4.3 Plot of the true correlation and the median estimated correlation along with a 95% credible interval, obtained using the PG algorithm with conditional SMC. 4.4 . . . . . . . . . . . . . 103 Entropy loss computed at each time step and plotted against time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.1 Results for Bolus Event 1 . . . . . . . . . . . . . . . . . . . . 108 5.2 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 1. . . . . . . . . . 109 5.3 Results for Bolus Event 2 . . . . . . . . . . . . . . . . . . . . 110 5.4 Results for Bolus Event 4 . . . . . . . . . . . . . . . . . . . . 111 5.5 Posterior distribution of ρ obtained from running the PMMH algorithm on the four technology stock problem . . . . . . . . 113 5.6 The marginal likelihood evaluated and plotted for different scalar values for ρ on a grid. The maximum is reached at ρ = 1114 5.7 Most probable graphical structures and estimated posterior probabilities for the “Women and Mathematics” data set. . . 115 xiii List of Figures 5.8 Autocorrelation function for the marginal correlation parameters obtained from the Six Cities example under the saturated assumption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.9 The four most probable graph structures, with their associated posterior probability . . . . . . . . . . . . . . . . . . . . 120 C.1 Results for bolus event 1 . . . . . . . . . . . . . . . . . . . . . 144 C.2 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 1. . . . . . . . . . 145 C.3 Results for Bolus Event 2 . . . . . . . . . . . . . . . . . . . . 145 C.4 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 2. . . . . . . . . . 146 C.5 Results for bolus event 3 . . . . . . . . . . . . . . . . . . . . . 147 C.6 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 3. . . . . . . . . . 148 C.7 Results for bolus event 4 . . . . . . . . . . . . . . . . . . . . . 148 C.8 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 4. . . . . . . . . . 149 C.9 Results for bolus event 5 . . . . . . . . . . . . . . . . . . . . . 150 C.10 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 5. . . . . . . . . . 151 C.11 Results for bolus event 6 . . . . . . . . . . . . . . . . . . . . . 151 C.12 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 6. . . . . . . . . . 152 C.13 Results for bolus event 7 . . . . . . . . . . . . . . . . . . . . . 153 C.14 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 7. . . . . . . . . . 154 C.15 Results for bolus event 8 . . . . . . . . . . . . . . . . . . . . . 154 C.16 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 8. . . . . . . . . . 155 C.17 Results for bolus event 9 . . . . . . . . . . . . . . . . . . . . . 156 C.18 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 9. . . . . . . . . . 157 xiv List of Figures C.19 Results for bolus event 10 . . . . . . . . . . . . . . . . . . . . 157 C.20 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 10. . . . . . . . . 158 C.21 Results for bolus event 11 . . . . . . . . . . . . . . . . . . . . 159 C.22 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 11. . . . . . . . . 160 C.23 Results for bolus event 12 . . . . . . . . . . . . . . . . . . . . 160 C.24 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 12. . . . . . . . . 161 C.25 Results for bolus event 13 . . . . . . . . . . . . . . . . . . . . 162 C.26 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 13. . . . . . . . . 163 C.27 Results for bolus event 14 . . . . . . . . . . . . . . . . . . . . 163 C.28 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 14. . . . . . . . . 164 C.29 Results for bolus event 15 . . . . . . . . . . . . . . . . . . . . 165 C.30 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 15. . . . . . . . . 166 C.31 Results for bolus event 16 . . . . . . . . . . . . . . . . . . . . 166 C.32 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 16. . . . . . . . . 167 C.33 Results for bolus event 17 . . . . . . . . . . . . . . . . . . . . 168 C.34 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 17 . . . . . . . . 169 C.35 Results for bolus event 18 . . . . . . . . . . . . . . . . . . . . 169 C.36 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 18. . . . . . . . . 170 C.37 Results for bolus event 19 . . . . . . . . . . . . . . . . . . . . 171 C.38 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 19. . . . . . . . . 172 C.39 Results for bolus event 20 . . . . . . . . . . . . . . . . . . . . 172 C.40 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 20. . . . . . . . . 173 xv List of Figures C.41 Results for bolus event 21 . . . . . . . . . . . . . . . . . . . . 174 C.42 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 21. . . . . . . . . 175 C.43 Results for bolus event 22 . . . . . . . . . . . . . . . . . . . . 175 C.44 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 22. . . . . . . . . 176 C.45 Results for bolus event 23 . . . . . . . . . . . . . . . . . . . . 177 C.46 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 23. . . . . . . . . 178 C.47 Results for bolus event 24 . . . . . . . . . . . . . . . . . . . . 178 C.48 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 24. . . . . . . . . 179 C.49 Results for bolus event 25 . . . . . . . . . . . . . . . . . . . . 180 C.50 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 25. . . . . . . . . 181 C.51 Results for bolus event 26 . . . . . . . . . . . . . . . . . . . . 181 C.52 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 26. . . . . . . . . 182 C.53 Results for bolus event 27 . . . . . . . . . . . . . . . . . . . . 183 C.54 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 27. . . . . . . . . 184 C.55 Results for bolus event 28 . . . . . . . . . . . . . . . . . . . . 184 C.56 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 28. . . . . . . . . 185 C.57 Results for bolus event 29 . . . . . . . . . . . . . . . . . . . . 186 C.58 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 29. . . . . . . . . 187 C.59 Results for bolus event 30 . . . . . . . . . . . . . . . . . . . . 187 C.60 Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 30. . . . . . . . . 188 xvi Glossary Random and Observed Variables • V : a response variable of interest. It also may refer to a vertex in a graphical model. • Z: a continuous random variable. • Y : a binary 0/1 random variable. • X is a n × p design matrix of covariates observed without error. Parameters • β: is a p × J matrix of regression coefficients. • µ: is a notation to refer to the mean parameter • Σ: is a sequence of J × J covariance matrices of length T • S: scale parameter of the Wishart distribution • M : autoregressive parameter • u: latent Gaussian variable • d: degrees of freedom of the Wishart distribution • G: a graphical model, a collection of vertices xvii Glossary Indices • n : number of observations indexed by i = 1, . . . , n. • J: number of dependent variables indexed by j = 1, . . . , J. • T : number of time intervals that may or may not be equally spaced indexed by t = 1, . . . , T . • K: number of particles m = 1, . . . ,K Distribution Related Notation • g(.): the canonical link of a GLM • f (.): the likelihood of the observed data. • π(.): the distribution of the prior or the posterior distribution of parameters. • p(.): denotes the target distribution of interest. • q(.): denotes the importance distribution used to approximate the target of interest. • E(.): denotes the expected value • Var(.): denotes the variance • F(.): a parametric distribution that hasn’t been specified. Acronyms • MCMC: Markov Chain Monte Carlo • SMC: Sequential Monte Carlo • pMCMC: Particle Markov Chain Monte Carlo • AR: Autoregressive process • WAR: Wishart Autorepgressive process xviii Glossary Other Symbols • ⊗: Kroneker product • etr: exponential trace such that etr(A) ≡ exp (tr(A)) xix Acknowledgments First and foremost, I would like to express my heartfelt gratitude to my advisors during my time at UBC: Dr. Paul Gustafson, Dr. Arnaud Doucet and Dr. Kevin Murphy. I could not have asked for better role models, each inspirational, supportive, and patient. Through your guidance and support, I learned to be a better researcher and critical thinker. It has been a huge honor to work with you, and I appreciate your contributions of time, ideas, and funding to make my graduate school experience possible and successful. I would also like to thank my examining and supervisory committee: Dr. William Welch, Dr. Jim Zidek, and Dr. Jane Wang and Dr. Petros Dellaportas for their constructive, thoughtful and detailed feedback. I am very grateful to the anonymous reviewers at JCGS for their useful comments which helped improve the work presented herein. This thesis was partially funded by an NSERC CGS-D grant, I would like to thank NSERC for their generous support. A special thanks goes to all the faculty and staff in the department of Statistics at UBC. I am particularly happy to acknowledge my debt to Dr. John Petkau and to Rick White who helped plant and nurture my interest in consulting. My deepest thanks and acknowledgment goes to my collaborator and friend Chris Brouse for allowing me to use the data from his research and for assisting in the biological interpretation of the results. My time at UBC has been made enjoyable in large part due to the many friends that became a part of my life. I am grateful for the memorable times spent in the company of Roman Holenstein, Kimberly Fernandes, Steve Kanters, Mike Marin, Alexandra Romann, Luke Bornn, Julien Cornebise, Francois Caron, Pierre Jacob, Glen Goodvin, Michael Chiang, and Mark Schmidt. Your friendship has been most inspiring both professionally and xx Acknowledgments personally. I have been very fortunate to have many friends I was able to rely on for support outside the UBC community. Marianne Zakhour, I am forever in debt to you for your generosity, countless support and loyalty. Roxane Genot, your contagious laughter makes the darkest day a little brighter. Rodney Daughtrey, you have been my rock during the final stages of this Ph.D., I am very grateful to have you in my life. Lastly, I would like to thank my family for their continued love, support and encouragement; Rhea for brightening my life with laughter and hope, my brother Ziad for teaching me the meaning of perseverance, my dad for inspiring me to leave my mark and my mom who is the real reason I had any chance to make it this far down the academic path. Thank you. Aline Talhouk UBC July 2013 xxi Dedication To Rhea of Sunshine. xxii Chapter 1 Introduction 1.1 Motivation In multivariate regression, researchers are interested in modeling a correlated multivariate response variable as a function of covariates. The response of interest, in the most general case, can be three dimensional and represented by a cube (Figure 1.1). Let V denote the multivariate longitudinal response which consists of a vector of dimension J, measured for many units of observation n, over multiple time steps T . Each time step t = 1, . . . , T , represents V11 V1J ... ViJ ... VnJ ... ... ... Vi1 Vn1 t=T ... ... Units of observation a time slice of the cube. Time t=1 Responses Figure 1.1: A diagram depicting the multivariate longitudal response with n rows, J columns and over T time points. 1 1.2. Data Examples The correlation between the elements of the multivariate response can be very complex. At time slice t, the J responses may be correlated, the n units of observation may be clustered, furthermore, the correlation might be smoothly changing over time. In this thesis, we do not consider clustered or spatially correlated units. Therefore, all units of observation (the rows of the cube in Figure 1.1) are assumed to be independent and identically distributed. In certain applications, such as in biostatistics, for example, the association between the elements of the multivariate response is typically treated as a nuisance parameter. The focus is on efficiently estimating the regression coefficients, in order to study the average change in the mean response as a function of predictors. However, in many cases, the estimation of the covariance which encodes the co-variation, and where applicable the temporal dynamics of the multidimensional response variable, are of interest. A common example arises in financial applications, where the primary aim is to model the volatility and the co-volatility of financial variables. Moreover, the correct specification of the covariance matrix is important for the efficient estimation of the regression coefficients. In this thesis, we aim to develop methodology to jointly estimate the association between the elements of the multivariate response as well as the effect of covariates. We propose a framework that allows for an increase in each of the three dimensions (n, J and T ) and that can be suitable for both a continuous and binary data measurement. 1.2 Data Examples Multivariate data are very common and arise in many applications. They are referred to by different names, but each can be viewed as a special case of the cube in Figure 1.1. In table 1.1, we summarize some of the different types of responses commonly encountered in applications, and link them back to the cube of Figure 1.1. 2 1.2. Data Examples Data Type Univariate Response Multivariate Response Longitudinal Response Time Series Repeated Measure Multivariate Time Series Multivariate Longitudinal MV Repeated Measure Units n many many many one many one many many Outcomes J one many one one one many many many Time T one one many many, equally spaced many, not always equally spaced many, equally spaced many, not always equally spaced many, not always equally spaced Table 1.1: Different types of multivariate data In the next sections, we will introduce some of the data examples to be used in this thesis, along with the research question that we will be addressing. 1.2.1 Heart Rate Variability as an Indicator of Nociception The first example comes from a health research application in neuroscience. The response variable of interest is a time series measurement of the heart rate of a patient during surgery (n = 1, J = 1, T = 418). In this example, we do not consider covariate information; rather, we would like to model the variability in the heart rate at each time step. A change in the heart rate variability is an indicator of a stress response (Figure 1.2), which is an indicator that a patient is feeling pain. In sick patients, a strong stress response can cause serious injury. Anesthesiologists try to minimize the stress response during surgery by giving patients drugs that block nociception. Finding the appropriate balance can be challenging: too much anesthesia can make the patient very ill, while too little anesthesia increases the stress response. The aim is to measure the variability in heart rate to assist anesthesiologists in determining level of Autonomic Nervous System (ANS) Activity. The data under consideration for this example is obtained from the Pediatric Anesthesia Research Team (PART) at the British Columbia Children’s Hospital. The patients are 6 year old children undergoing dental procedures under general anesthesia. A baseline-stimulus response experiment was con3 1.2. Data Examples Pain) S'mulus) Nocicep'on) System)in)the) Brain)) Autonomic) Nervous)System) (ANS))) Sympathe'c) Nervous)System) (SNS))) Stress) Response:)) < Increased)respira'on) < High)blood)pressure) < Change'in'heart'rate'variability' Figure 1.2: The brain perceives pain through the nociception system, which prompts increased activity in the autonomic nervous system (ANS). In turn, the ANS activates the sympathetic nervous system (SNS), creating a stress response in the body that includes increased respiration, blood pressure and heart rate. ducted, where a patient is in a baseline state, a stimulus occurs and the patient exhibits a response changing the initial state. In the data we consider the patient is experiencing pain so that the baseline state is the nociceptive state. An anasthetic bolus (the stimulus) is administered. This can be a single shot or multiple shots applied over a few seconds. Subsequently, the state of the patient returns to normal which is the anti-nociceptive state. There are a total of 30 such bolus events, which we analyze separately. Each event is unique, although some of these events are from the same procedure. In Figure 1.3, we plot the heart rate data from one such bolus event over time. The top line is the raw measurement of heart rate over T = 418 time steps. Typically the interest is not in modeling the mean but the variability of heart rate, therefore, it is common to pre-process the data to remove the mean (the bottom line in Figure 1.3). This is usually done by fitting an autoregressive linear regression model and working with the residuals. 4 1.2. Data Examples 120 Bolus 100 80 Heart Rate 60 Anti Nociceptive Period Nociceptive Period 40 20 0 −20 50 100 150 200 time 250 300 350 400 Figure 1.3: Heart rate data for a 6 year old child undergoing a dental procedure. The child starts out in a nociceptive state. An anasthetic bolus is administered and the state of the child returns to normal. The top line is the raw measured heart rate and the bottom line is the filtered heart rate, with the mean removed. 1.2.2 Stock Market Return of Technology Firms In the next example we consider volatilities of assets returns which are generally thought to be time-changing (Poon and Granger, 2003). Moreover, they are likely to be serially correlated. This may suggest that the covariance matrix is smoothly and slowly changing in time. This has been well established in the financial time series literature, making multivariate stochastic volatility models an important topic of research in that field. To illustrate this with an example, we consider the weekly log returns of technology stocks for the period of September 1 2004 to February 1 2012. 5 1.2. Data Examples The data was obtained from Yahoo! Finance. We look at log returns of closing values, adjusted for dividends and split, for the following four technology assets: Google, Microsoft, IBM, and Apple. The time series are of dimension T = 387 time steps, with n = 1 and J = 4. A first step in the analysis of this model is to pre-process the data so that the response variable of interest is mean-centered. One way to do this is to fit a J-variate autoregressive ordinary least square model and work with the fitted residuals instead of the raw data (More on this in Chapter 5). A plot of the mean-centered time series over time is shown in Figure 1.4 and the summary statistics are presented in Table 1.2. The sample correlation matrix is given by: 1.00 0.39 0.43 0.53 ˆ = 0.39 1.00 0.51 0.36 R 0.43 0.51 1.00 0.54 0.53 0.36 0.54 1.00 . (1.1) 6 1.2. Data Examples Log−Returns as a Function of Time Google 0.1 0 Microsoft −0.1 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 250 300 350 0.1 0 −0.1 IBM 0.1 0 −0.1 Apple 0.2 0 −0.2 200 time (weeks) Figure 1.4: A plot of the weekly return of Google, Microsoft, IBM and Apple stock for the period of September 1 2004 to February 1st, 2012 Minimum Q1 Median Mean Q3 Maximum St. Dev. Google −0.07 −0.01 0.00 0.00 0.01 0.08 0.02 Microsoft −0.09 −0.01 0.00 0.00 0.01 0.06 0.02 IBM −0.07 −0.01 0.00 0.00 0.01 0.06 0.01 Apple −0.12 −0.01 0.00 0.00 0.02 0.08 0.02 Table 1.2: Summary statistics of technology stock log-returns The main aim of this analysis is to provide an estimate of the 4 × 4 covariance matrix at each time point t, while accounting for the slow evolution of this matrix in time. Moreover, it would be desirable to estimate the smoothness parameter directly from the data instead of making assumptions about this parameter. 7 1.2. Data Examples 1.2.3 Women and Mathematics Another example comes from research where the primary aim is to study the attitude of 1190 New Jersey high school students toward mathematics (Fowlkes et al., 1988a). In this research, there is no longitudinal component (T = 1), rather we only consider the association between the following six binary response variables measured for each individual: V1 Lecture Attendance (Attend/Did not Attend) V2 Gender (Male/Female) V3 School Type (Urban/Suburban) V4 Perceived need for Mathematics in future work (Yes/No) V5 Subject Preference (Math/Science vs. Liberal Arts) V6 Future Plans (College/Job) In this example the response Y is an n × J matrix of binary observations with n = 1190 and J = 6. The primary aim here is to study the structure of association between the elements of the response. 1.2.4 Six Cities Data The final example is a subset of the Six Cities study, a longitudinal study of the health effects of air pollution. Even though the data is longitudinal in this case, we will not assume that the covariance is changing over time. The data under consideration consists of repeated binary measures of wheezing status of 537 children from Stuebenville, Ohio. The objective is to model the probability of wheeze status over time as a function of the mother’s smoking status during the first year of the study. The multivariate response variable of interest in this case is Vit , indicating the wheezing status of child i at annual visit t = (1, 2, 3, 4) corresponding to age (7, 8, 9, 10). The full data is an n × T matrix, with n = 537 and T = 4. The nature of the data is suggestive of an association between the T elements of the multivariate 8 1.3. Thesis Organization response variable. Even though the association is not assumed to change over time, it is possible that it may follow a particular structure. We are interested in estimating the association structure as well as assessing the effects of one covariate, the mother’s smoking status at the beginning of the study (1=yes, 0=no), on the mean response. Table 5.2 breaks down the number of “cases” representing wheezing symptoms, by age group and mother smoking status. From looking at these results, it appears that the percentage of wheezing children is higher for smoking mothers at each age. Mother Smoking Status Smoker Non Smoker Total 7 32 (17.0%) 55 (15.8%) 87 (16.2%) Child’s Age 8 9 40 (21.3%) 36 (19.1%) 51 (14.6%) 49 (14.0%) 91 (16.9%) 85 (15.8%) 10 27 (14.4%) 36 (10.0%) 63 (11.7 %) Table 1.3: Breakdown of wheezing cases by age group and smoking status of mother at onset. The percentages in parentheses are given with respect to each age group. 1.3 Thesis Organization In this thesis, we propose a framework for estimating the covariance matrix of multivariate data under different settings, different dimensions, and for different measurement scales. In Chapter 2, we explore methods and algorithms for estimating a sequence of covariance matrices that are smoothly changing in time. This is done for continuous data measurements. This corresponds to multivariate stochastic volatility models, which are very popular in the econometrics literature. We develop algorithms that model the covariance Σt = Var(Vt ) directly at each time step and the evolution in time Σt | Σt−1 taking into account the complex structure and the dynamic nature of the response. We use novel algorithms that allow for the simultaneous estimation of both static and time-varying parameters. We consider the performance of these algorithms as the dimension of the response increases 9 1.3. Thesis Organization (over n, J, and T ). Moreover, we extend the algorithm to estimate some of the model’s hyper-parameters. In Chapter 3, the multivariate response variable under consideration is binary, with no longitudinal element. It consists of an n × J array of observations. Estimating a full correlation matrix requires the estimation of J(J − 1)/2 parameters. The number of parameters increases quadratically as a function of the number of variables. Therefore, when dealing with high dimensional problems, imposing a structure of association is helpful both conceptually and computationally. Conceptually, not all of the variables may be associated, and therefore it might make more sense from a subject matter standpoint to impose those sparsity constraints. From a computational viewpoint, this reduces the number of free parameters to be estimated compared with the saturated model. For a properly specified structure, this would result in more accurate and efficient estimates of parameters. In Chapter 3, we appeal to the well-developed theory of covariance selection in Gaussian graphical models to impose sparsity constraints on the inverse covariance matrix. In the case of binary observations, for reasons of identifiability, the covariance matrix needs to be constrained to be a correlation matrix R (Chib and Greenberg, 1998), with diagonal elements Rii = 1 and off-diagonal elements Rij ∈ (−1, 1). The sparsity constraint will be imposed −1 by setting some off-diagonal elements in the inverse correlation Rij to be 0. In Tabet (2007), we showed how this can be achieved using theories of Gaussian graphical models when the sparsity structure is known. The main contribution in Chapter 3 is to propose an extension to the algorithm to allow the estimation of a sparse structure directly from the data. In Chapter 4, in addition to estimating the association between the elements of the response, we also propose a methodology to estimate the effects of covariates. Even though the covariates X may be measured at each time step, i.e, time varying covariates, their effect on the response, denoted by β, is assumed to be static. Figure 1.5 is a graphical model representation of the general multivariate longitudinal model with covariates. The unknown parameters are the covariance matrices Σ1:T and the covariate effect β. The response variable V1:T and the covariates X1:T are observed. We consider es10 1.3. Thesis Organization ∑1 ... ∑t-1 ∑t ∑t+1 ... ∑T β V1 Vt-1 Vt Vt+1 VT X1 Xt-1 Xt Xt+1 XT Figure 1.5: A graphical model representation of the multivariate longitudinal model which includes covariate information and covariate effects parameter. Observed variables are shaded, unobserved variables are clear. timating covariates when the response is continuous and Σ is evolving in time dynamically, as depicted in Figure 1.5. We also consider a non-longitudinal (static) case with T = 1. Under this setting we propose methodology to estimate covariates for binary response, while accounting for the identifiability constraints. In Chapter 5, we show how the methodology developed in this thesis can be used to address real problems and applied to real dataset. We revisit the data examples that we introduced in the previous section. Finally in Chapter 6, we discuss some of the limitation of the methodology and future work. 11 Chapter 2 Time-Varying Covariance In this Chapter, we assume that the correlation between the elements of the multivariate response variable is slowly changing in time. The aim is to estimate the model’s underlying parameters, which in this case, consist of a sequence of covariance matrices and a smoothing parameter that controls the degree of change from one time step to the next. In general the smoothing parameter is assumed to be static, while the covariances are assumed to be dynamic. The main contribution of this Chapter is to propose a novel method for simultaneously estimating the dynamic and the static parameters of the model using particle Markov Chain Monte Carlo (Andrieu and Doucet, 2002). Throughout this Chapter, we assume the observed response variable is continuous, multivariate with J dimensions, longitudinal with T time steps, and with n number of observations. Under this scenario, the response is a n × J × T array of continuous observations corresponding to Figure 1.1. Let Z denote the multivariate response of interest and we assume that at any given time point t, Zt arises from a matrix-Normal distribution such that Zt ∼ MN n,J (µ, Σ, Φ), where MN denotes the matrix-Normal distribution with probability density function given by f (Zt | µ, Σt , Φ) = n/2 |Φ|−J/2 |Σ−1 1 t | etr − Σ−1 (Zt − µ) Φ−1 (Zt − µ) , nJ/2 2 t (2π) (2.1) where etr denotes the exponential trace, µ represents the mean of Zt , Σt is a J × J time-varying covariance matrix that depends on Σt−1 and encodes the dependence between the columns of Zt . The n × n matrix Φ encodes the dependence between the rows of Zt . In this research we assume that 12 Chapter 2. Time-Varying Covariance Σ-11 -1 …" Z1 -1 -1 Σt-1 Σt Σt+1 Zt-1 Zt Zt+1 …" Σ-1T ZT Figure 2.1: Graphical representation of the multivariate longitudinal data. The observed variables are shaded, and the latent (unobserved) variables are not. Conditional on the precision matrices, the observed data are independent over time. observations are independent and identically distributed, therefore we set Φ = I, the identity matrix. It is important to note that when Φ is equal to the identity matrix, the matrix-variate Gaussian simplifies to the multivariate Gaussian with probability density function given in Appendix A.1.2. Moreover, in this Chapter, for convenience, we model the data in terms of a precision matrix Σ−1 t rather than a covariance matrix Σt . The response, Z1 , . . . , ZT , is assumed to be mean-centered with µ = −1 E(Zt ) = 0 and conditionally independent given Σ−1 1 , . . . , ΣT . Intuitively, this means that the information in Zt depends on the information in Zt−1 −1 only through what is captured in the change of Σ−1 t | Σt−1 . Therefore, given the Markovian assumption of the model, the joint distribution of Z and Σ−1 decomposes according to the graph in Figure 2.1 and the joint distribution can be written as: T −1 −1 π(Z1:T , Σ−1 1:T ) = f (Z1 | Σ1 )π(Σ1 ) −1 −1 f (Zt | Σ−1 t )π(Σt | Σt−1 ). (2.2) t=2 This model represents a non-linear dynamic state space model where Zt is the observed state and Σt−1 is the hidden state and the model is described by the observation density f (Zt | Σt ) and the transition density π(Σ−1 | t 13 2.1. The Wishart Prior Σ−1 t−1 ). Under a Bayesian paradigm, the joint posterior of Σ−1 1:T | Z1:T is proportional to the joint distribution in (2.2) and differs only by a normalizing constant. The likelihood, f (Zt | Σ−1 t ), also known as the observation density is given in (A.5). Therefore, it remains to specify a joint prior on Σ−1 1:T , which decomposes as the initial prior π(Σ−1 1 ) and the transition density −1 π(Σ−1 t | Σt−1 ). 2.1 The Wishart Prior The Wishart distribution plays a very important role in Bayesian estimation of the inverse covariance matrix, because it is the conjugate prior on inverse covariance matrices for multivariate Gaussian data. There exists several parameterizations of the Wishart distribution (see appendix A.3 for the different parameterizations and their equivalence). Throughout this thesis, we might appeal to different parameterizations at different times, as needed. In this Chapter, we use the parametrization obtained by taking the sum of squared Gaussian variables. If a J ×J precision matrix Σ−1 follows a Wishart distribution with degrees of freedom ν ≥ J and scale parameter Ω > 0, and denoted by WJ (ν, Ω), its probability density function has mass only for a positive definite matrix Σ−1 . The p.d.f is given by 2νJ/2 ΓJ (ν/2) |Ω|ν/2 −1 1 1 |Σ−1 | 2 (ν−J−1) etr − Ω−1 Σ−1 , 2 (2.3) where ΓJ is the multivariate gamma function given by J 1 ΓJ (a) = π 4 J(J−1) i=1 1 Γ a − (i − 1) . 2 (2.4) Because the Wishart is conditionally conjugate, and because it is the multidimensional generalization of the χ2 distribution, its use is equivalent to extending the univariate model to the multivariate case and allows the interpretation of hyperparameters over increasing dimensions. For this reason, 14 2.1. The Wishart Prior we restrict our attention to prior distributions that involve the Wishart and inverse Wishart distributions. 2.1.1 Dynamic Modeling of the Covariance in the Literature Modeling time-varying covariance and correlation matrices has been a very active area of research in econometrics, through the study of multivariate stochastic volatility models. A few approaches have been proposed for this class of models, see the paper by Asai et al. (2006) and more recently by Chib et al. (2007) for a comprehensive review. From this literature, we are particularly interested in models that use the Wishart family of distributions to model the dynamics over time. 2.1.2 The Inverse Wishart Rather then working with the precision matrix, Philipov and Glickman (2006) model the covariance transition by proposing the following inverse Wishart prior for Σt (see appendix A.3 for the relationship to the Wishart distribution) Σt | Σt−1 ∼ IW J (ν, St−1 ), (2.5) where ν is the degrees of freedom and St−1 is the scale parameter which is used to encode the temporal dynamics and is defined as St = 1 1/2 d 1/2 (A )(Σ−1 ). t ) (A ν (2.6) Under this set up, the hyperparameters in the model are: A, a positive definite matrix with Cholesky decomposition A = (A1/2 )(A1/2 ) ; d, a scalar parameter; and ν, an integer degrees of freedom parameter. This approach has some nice properties, namely for a time-invariant structure, this model simplifies to the usual Normal-inverse-Wishart model. Furthermore, the hyperparameters A and d determine the dynamic behaviour of the covariance structure through St . The matrix A can be seen as the measure of inter-temporal covariance. The scalar parameter d denotes the strength of 15 2.1. The Wishart Prior the relationships over time (long-memory, persistence). When d = 0, an independent and identically distributed covariance over time is implied. A special case also arises when d = 1 and A = I and corresponds to the matrixvariate random walk. On the other hand, using this approach does not appear to result in stationary marginal distributions. Philipov and Glickman (2006) do not make any claims on that either, they only prove stationarity of the determinant |Σt | for values of |d| < 1. To confirm this, we performed a simulation study by sampling from this prior distribution, for different values of A and d. Indeed, the marginal distribution on the covariance matrices seems to change for each time point (results not shown). 2.1.3 The Dynamic Linear Model Adopting a completely different approach, Carvalho and West (2007) propose an extension of the dynamic linear model for sequential forecasting and updating, and appeal to the hyper-inverse Wishart distribution that arises in Gaussian graphical models (see appendix A.3 for definition). The prior they place on Σt is Σt ∼ HIW J (bt , St ), and bt = δbt+1 + 1 St = δSt−1 + et et where et is a vector of realized forecast errors et = Zt − ft , where ft denotes the forecast at time t. The advantage of this approach is that by using the hyper-inverse Wishart, a sparse covariance matrix can be estimated by imposing a structure on Σ−1 t . However, issues using this approach arise with the hyperparameters, because at each time point, the hyperparameter used in the prior is a function of the posterior parameters and the data from the previous time point. Furthermore, this process has been shown to be non-stationary. 16 2.1. The Wishart Prior 2.1.4 The Wishart Autoregressive Process An alternative approach uses the Wishart autoregressive process of order 1, introduced by Gourieroux et al. (2009), and denoted by WAR (1). This process constructs, at each time step, a Wishart distributed matrix by summing over squared latent Gaussian variables. These latent variables evolve in time according to an autoregressive process. More specifically, for t = 1, . . . , T , we define an artificial latent variable ut , a d × J matrix that follows a matrix-variate Gaussian distribution with initial distribution at t = 1 given by u1 ∼ MN d,J (0, S, Id ), (2.7) ut | ut−1 ∼ MN d,J (M ut−1 , S − M SM , Id ). (2.8) and subsequently, The temporal dependence is introduced by evolving ut in time according to an autoregressive process of order 1, so that at each time step t, we have d independent AR processes for vectors of length J. Taking the square of these vectors and adding them up results in a J × J matrix and is equivalent to computing Σ−1 t = ut ut , (2.9) where Σ−1 t follows a Wishart distribution marginally with integer degrees of freedom d > J and scale parameter S. −1 Under this setup, Σ−1 1 , . . . , ΣT constitute a Wishart autoregressive process of order 1, which defines a matrix Markov process for Σ−1 t , constructed using latent variables (Cameron, 1973). We are therefore able to build a −1 joint prior on Σ−1 1 , . . . , ΣT T −1 π(Σ−1 1:T ) = π(Σ1 ) −1 π(Σ−1 t | Σt−1 ), (2.10) t=1 where the initial distribution at time point t = 1, given by π(Σ−1 1 ), is the conjugate Wishart prior WJ (d, S), with degrees of freedom d, scale param17 2.1. The Wishart Prior eter S, and probability density function given by π Σ−1 = 2dJ/2 ΓJ (d/2) |S|d/2 1 −1 1 1 2 (d−J−1) etr |Σ−1 − S −1 Σ−1 . 1 | 1 2 (2.11) At subsequent time points, t > 1, the transition distribution π(Σ−1 | t −1 ∗ Σt−1 ) is a non-central Wishart denoted by WJ (d, S , Θ) with degrees of freedom d, scale parameter S ∗ = S − M SM and non-centrality parameter Θ = (S − M SM )−1 M Σ−1 t−1 M and with density given by −1 dJ/2 π(Σ−1 ΓJ (d/2) |S ∗ |d/2 t | Σt−1 ) = 2 −1 1 × etr − S ∗−1 Σ−1 t 2 1 1 2 (d−J−1) etr − Θ |Σ−1 t | 2 1 1 d; ΘS ∗−1 Σ−1 , (2.12) 0 F1 t 2 4 where 0 F1 is the hypergeometric function of matrix argument defined in Appendix A.3 in equation (A.25). The dynamics of the WAR process are characterized by the degrees of freedom d > J, a symmetric positive semi-definite J × J matrix S to model the covariance of the latent process; and an unconstrained latent autoregressive matrix M , that encodes the temporal dependence between the sequences of covariance matrices. The Wishart autoregressive process can also be defined for any autoregressive order, therefore the Markov assumption of order 1 can be weakened. Moreover, while in this thesis we construct the prior using integer degrees of freedom only, this process is also defined for noninteger degrees of freedom, as discussed in Gourieroux et al. (2009), which allows for greater modeling flexibility. Finally, this process guarantees that covariance and precision matrices are symmetric and positive definite. Another attractive property of the Wishart autoregressive process is that it represents a generalization of the conjugate model used in the univariate case. When T = 1, the Wishart autoregressive process simplifies to the conjugate Wishart distribution for precision matrices, and for J = 1, it simplifies to the χ2 distribution. Moreover, we can directly show that this process is stationary, with marginal distribution ut ∼ MN d,J (0, S, Id ) and Σ−1 t ∼ WJ (d, S). This can 18 2.1. The Wishart Prior be seen by taking the iterated expectation and moving by induction starting with: E(u1 ) = 0 E(u2 ) = .. . E(E(u2 | u1 )) = E(M u1 ) = M E(u1 ) = 0 E(ut ) = .. . E(E(ut | ut−1 )) = E(M ut−1 ) = M E(ut−1 ) = 0 E(uT ) = E(E(uT | uT −1 )) = E(M uT −1 ) = M E(uT −1 ) = 0 similarly, we can compute the marginal variance: Var(u1 ) = S Var(u2 ) = E(Var(u2 | µ1 )) + Var(E(u2 | u1 ) = E [S − M SM ] + Var(M u1 ) = S − M SM + M Var(u1 )M = .. . S − M SM + M SM = S = E(Var(ut | ut−1 )) + Var(E(ut | ut−1 )) = E [S − M SM ] + Var(M ut−1 ) = S − M SM + M Var(ut−1 )M = .. . S − M SM + M SM = S = E(Var(uT | uT −1 )) + Var(E(uT | uT −1 )) = E [S − M SM ] + Var(M uT −1 ) = S − M SM + M Var(uT −1 )M = S − M SM + M SM = S Var(ut ) Var(uT ) 19 2.1. The Wishart Prior 2.1.5 The Hyperparameters of the WAR Prior The temporal dynamics of the WAR model are controlled by the hyperparameter M . This can be seen from equations (2.7) and (2.8). To further illustrate this, suppose we take M = ρIJ , where I is the identity matrix and ρ is a scalar parameter. The “memory” of the process is encoded in ρ. A large ρ implies covariances with high dependency between time points; on the other hand, when ρ = 0, there are no dynamics and at each time step we have an independent and identically distributed precision matrix Σ−1 ∼ WJ (S, d). t The past matters more as ρ approaches 1. For ρ = 1 exactly, we have the −1 −1 same precision matrix at all time steps: Σ−1 1 = Σ2 = · · · = ΣT . In this case, the time element becomes irrelevant, and the data can be stacked and analyzed simultaneously as i.i.d data generated from the same Σ−1 . In this research, we only consider scenarios with 0 ≤ ρ ≤ 1. The hyperparameters S and d represent the scale matrix and degrees of freedom of the Wishart distribution respectively. Barnard et al. (2000) have shown that if Σ ∼ IW J (d, S) and given the decomposition Σ = DRD, where R is a correlation matrix and D = diag(d1 , . . . , dJ ) is a diagonal matrix of standard deviations with di = (Σii )−1 > 0, then the resulting correlation matrix R has a marginally uniform prior only when S = I, the identity matrix, and for d = J + 1 (note that this value depends on the parametrization of the Wishart). This means that marginally Rij is uniformly distributed in the [−1, 1] interval. A marginally uniform distribution has desirable properties which will be further explored in Chapter 3. To illustrate the effect of a change in M , S and d on the precision elements of Σ−1 , we consider an empirical example where we simulate from the prior distribution in (2.10). We begin by considering the univariate case. First we examine the effect of the autoregressive parameter by varying M = ρ ∈ {0.7, 0.98, 0.999} and holding S = 1 and d = J + 1 = 2. In the second simulation, we fix ρ = 0.98 and d = 2 and we vary S ∈ {0.01, 0.5, 1, 4}. Finally, in the last experiment, we fix ρ = 0.98 and S = 1 and vary d ∈ {J + 1, J + 50}. Given the above three scenarios, for each hyperparameter experiment, 20 2.1. The Wishart Prior the latent Gaussian variables are generated according to the model in (2.7) and (2.8).The precisions are then computed using (2.9). Figure 2.2(a) plots the values of a precision simulated from the prior using different values of ρ and maintaining the scale matrix at S = 1 and degrees of freedom at d = 2. It can be noted that a larger ρ results in a smoother change in the precision over time. Furthermore, ρ has to be really large and very close to 1 before the trace plot flattens. For ρ = 0.7 the change is still not large enough to show any visible persistence over time. In Figure 2.2(b), we plot the values obtained by simulating a precision from the prior, this time holding ρ at 0.98 and d still set at 2 and varying S. We can see that a smaller S translates generally in smaller precisions, which means very variable data. Finally, in Figure 2.2(c), where ρ is set to 0.98 and S is set to 1 and we consider two possibilities of d. What we note is that a larger d also means larger precision. The same pattern holds for the bivariate case in Figure 2.3 and trivariate case (results not shown). Previous research shows that S controls the location of the precision while d controls the spread. For more on understanding the effect of hyperparameters on the covariance matrix the reader is referred to Tokuda et al. (2011). 21 2.1. The Wishart Prior The Precision Over Time for Different Values ofρ 15 ρ = 0.7 ρ= 0.98 ρ=0.999 σ−1 10 5 0 0 100 200 300 400 500 time (a) Varying ρ for S = 1 and d = J + 1 The Precision Over Time for Different Values of s 20 18 16 s=0.01 s=0.5 s=1 s=2 14 σ−1 12 10 8 6 4 2 0 0 100 200 300 400 500 time (b) Varying S for ρ = 0.98 and d = J + 1 The Precision Over Time for Different Values of d 80 d=J+1 d=J+50 70 60 σ−1 50 40 30 20 10 0 0 100 200 300 400 500 time (c) Varying d for ρ = 0.98 and S = 1 Figure 2.2: Precision simulated from the prior using different values of the hyper-parameters ρ, S and d. 22 2.1. The Wishart Prior σ11 Over Time for Different Values ofρ 16 ρ = 0.7 14 ρ= 0.98 ρ=0.999 12 σ11 10 8 6 4 2 0 0 100 200 300 400 500 400 500 400 500 time σ22 Over Time for Different Values ofρ 18 ρ = 0.7 16 ρ= 0.98 14 ρ=0.999 12 σ22 10 8 6 4 2 0 0 100 200 300 time σ12 Over Time for Different Values ofρ 10 ρ = 0.7 ρ= 0.98 ρ=0.999 σ12 5 0 −5 −10 0 100 200 300 time Figure 2.3: Time series plots showing the effect of changing ρ on the marginal elements of Σ−1 in the bivariate case (J = 2). 23 2.2. Bayesian Estimation with the WAR 2.2 Bayesian Estimation with the WAR Given observed continuous Gaussian data, and adopting the WAR of order 1 as a prior for the precision matrix, one can begin to think of performing Bayesian inference on the distribution of Σ−1 1:T | Z1:T . Posterior estimation at time point t = 1 is straightforward. The Wishart is the conjugate prior, therefore the posterior distribution of Σ−1 1 | Z1 is also a Wishart distribution with updated parameters WJ (d + n, (S −1 + SST )−1 ), where SST = Z1 Z1 is the residual sum of squares. At time points t > 1, the non-central Wishart is not conjugate. The −1 −1 conditional posterior distribution π(Σ−1 t | Σ1:t−1 , Z1:t ) simplifies to π(Σt | Σ−1 t−1 , Zt ) due to the Markovian assumption on the model and can be written as −1 −1 −1 −1 π(Σ−1 t | Σt−1 , Zt ) ∝ f (Zt | Σt )π(Σt | Σt−1 ). (2.13) This distribution is not available in closed form and therefore it is difficult to sample directly from it. In cases where the posterior distribution is not available analytically, the problem is addressed indirectly by generating random samples for the target distribution of interest using Monte Carlo methods. We propose a sequential Monte Carlo (SMC) algorithm to sample from the posterior distribution, using two different proposal densities. SMC methods are a general class of Monte Carlo methods that sample sequentially from a sequence of target probability densities of increasing dimensions. These algorithms are particularly suitable for non-linear, non-Gaussian models. 2.3 Sampling using Sequential Monte Carlo (SMC) The posterior distribution π(Σ−1 1:T | Z1:T ) is multidimensional and not available analytically. SMC methods are a set of simulation-based methods which provide a convenient and attractive approach to estimating recursively a sequence of probability distributions. These methods are particularly suited 24 2.3. Sampling using Sequential Monte Carlo (SMC) for state space models. Using SMC, the intractable problem of sampling from a multidimensional target distribution, denoted by p(.), is decomposed into a series of simpler problems which can be tackled recursively. Using Bayes’ rule, the target posterior can be written as p Σ−1 1:T | Z1:T = −1 π Σ−1 π Σ−1 1:T , Z1:T 1:T f Z1:T | Σ1:T = , π (Z1:T ) π (Z1:T ) (2.14) and the normalizing constant, also known as the marginal likelihood, is obtained by integrating over the latent state π (Z1:T ) = −1 π Σ−1 1:T , Z1:T dΣ1:T . (2.15) The joint distribution can be re-written as −1 −1 −1 −1 . π Σ−1 1:T , Z1:T = π Σ1:T −1 , Z1:T −1 π(ΣT | ΣT −1 )f ZT | ΣT (2.16) Therefore, the posterior p(Σ−1 1:T | Z1:T ) satisfies the following recursion −1 −1 −1 −1 . (2.17) p Σ−1 1:T | Z1:T ∝ p Σ1:T −1 | Z1:T −1 π ΣT | ΣT −1 f ZT | ΣT Thus, we are able to break a high dimensional distribution into the product of distributions of lower dimensions. A general SMC algorithm for state space models can be described as a two-step recursion involving a sampling step and a correction step. The algorithm would proceed at time t = 1 by generating a sample of size K from a target distribution p(Σ−1 1 | Z1 ) of interest. In most cases it is not possible to sample directly from the target distribution p(.), therefore the K samples, also referred to as particles, are drawn instead from an importance distribution denoted by q(.), with support as large or larger than that of p(.). In order to correct for that approximation, the discrepancy between the target and the importance distribution, known as the importance weights, are computed for each sample by taking the ratio between the two distributions p and q, evaluated for each particle m. The resulting weights are given by 25 2.3. Sampling using Sequential Monte Carlo (SMC) −1(m) w1 (Σ1 −1(m) )= f (Z1 | Σ1 )π(Σ−1 1 ) . −1 q(Σ1 | Z1 )(m) (2.18) The K particles, are then re-sampled according to normalized weights W1 where (m) (m) W1 = w1 (l) K l=1 w1 . (2.19) This resampling step corrects the approximation and ensures that the resulting weighted samples are distributed according to the target distribution of interest. At the following time step, t = 2, the identity resulting from the recur−1 −1 −1 sion in (2.17) is considered: p(Σ−1 1:2 | Z1:2 ) ∝ p(Σ1 | Z1 )π(Σ2 | Σ1 )f (Z2 | −1 Σ−1 2 ). The particles obtained from the previous step are used for p(Σ1 | Z1 ) −1 and an importance density q(Σ−1 2 | Σ1 , Z2 ) is used to extend the dimension −1 −1 of the current particles to approximate π(Σ−1 2 | Σ1 )f (Z2 | Σ2 ). This pro- vides samples that are distributed according to the approximate distribution −1 −1 p(Σ−1 1 | Z1 )q(Σ2 | Z2 , Σ1 ). Once more, in order to correct for the approx- imation, the importance weights w2 must be computed and normalized and the whole particle path is re-sampled according to the normalized weights W2 . −1(m) w2 (Σ2 ) −1(m) −1(m) f (Z2 | Σ2 )π(Σ−1 2 | Σ1 = −1 (m) q(Σ2 | Z2 , Σ−1 1 ) ) (m) (m) W2 ; = w2 (l) K l=1 w2 . The resampling step results in particles that are distributed according −1 to p(Σ−1 1 , Σ2 | Z1 , Z2 ). This procedure is then repeated T times. At each step t the weights are computed and normalized according to −1(m) −1(m) wt (Σt ) = f ZT | ΣT −1(m) π Σ−1 T | ΣT −1 −1 (m) q(Σ−1 t | Zt , Σt−1 ) (m) ; (m) Wt = wt (l) K l=1 wt . (2.20) The resampling step ensures that the trajectories with small importance 26 2.3. Sampling using Sequential Monte Carlo (SMC) weights are eliminated and the ones with large weights are replicated. Furthermore, while the resampling step introduces some Monte Carlo variance, it reduces the accumulation of error over time. At each time step, the entire particle path corresponding to the particles that are chosen in the resampling step is kept and propagated forward and the remaining particles are discarded. This obviously results in the algorithm degenerating over time, because a small proportion of the weights contain all of the probability mass and hence most particles are eventually discarded. In many cases, it is simpler to work with log-weights for numerical reasons, since weights can be very small or very large, and numerical precision might be a concern when normalized. Furthermore, in order to prevent underflow or overflow, it is common to subtract the largest log-weight from all the log-weights prior to normalization. From equation (2.20) we can see the optimal choice of q(.) ≡ p(.) would result in weights equal to 1, meaning that all the particles survive and are propagated forward. A choice in importance distribution that is very different from p(.) would result in a very rapid degeneracy of the algorithm. For this reason, the choice of importance density for the SMC algorithm is somewhat critical. 2.3.1 Computation of the Marginal Likelihood One of the nice by-products of the sequential Monte-Carlo sampler is that an estimate of the marginal likelihood pˆ(Z1:T ) can be obtained directly from each SMC run. This estimate is evaluated recursively and is given by T pˆ(Zt | Z1:t−1 ), pˆ(Z1:T ) = pˆ(Z1 ) (2.21) t=2 where 1 pˆ(Zt | Z1:t−1 ) = K K −1(m) wt (Σ1:t ), (2.22) m=1 27 2.4. SMC Algorithm for the Wishart Autoregressive Process and where wt are the un-normalized weights at time point t. This is an estimate of p(Zt | Z1:t−1 ) = −1(m) wt (Σt )q(Σt | Zt , Σt−1 )p(Σ1:t−1 | Z1:t−1 )dΣ1:t . (2.23) 2.4 SMC Algorithm for the Wishart Autoregressive Process Estimation in the Wishart autoregressive process, at the first time step, results in an analytical solution that can be sampled exactly. Therefore at time point t = 1, we can sample directly from the target distribution p(Σ−1 1 | Z1 ) without the need for any correction. At time point t = 2, since the posterior distribution given in (2.13) is not available in closed form, we must choose an importance distribution to generate candidate samples. The weights can be computed for each sample as mentioned previously; particles paths are chosen according to those weights and propagated to the following time point. The procedure is repeated for all time points. In the next section, we propose two different importance densities to propose candidate samples that can be used for running the SMC algorithm with the Wishart autoregressive process. 2.4.1 The Non-central Wishart Prior as Importance Distribution The first approach to obtain samples from the target distribution p(Σ−1 1:T | Z1:T ) is to use the non-central Wishart prior distribution as an importance −1 ∗ distribution by setting q(Σ−1 | Σ−1 | Σ−1 t t−1 , Zt ) ≡ π(Σt t−1 ) ∼ W(d, S , Θ), with d, S ∗ and θ as defined in Section 2.2. Using the prior as importance distribution, the importance weights at each time step t simplify to the likelihood Wt = −1 −1 f (Zt | Σ−1 t )π(Σt | Σt−1 ) q(Σ−1 t | Σ−1 t−1 ) = f (Zt | Σ−1 t ). (2.24) 28 2.4. SMC Algorithm for the Wishart Autoregressive Process A detailed description of the implementation of the SMC algorithm is outlined in Algorithm (2.1). Algorithm 2.1 SMC using the Non-central Wishart Prior For particles m = 1, . . . , K: At time point t = 1 −1(m) i.i.d • Sample Σ1 ∼ WJ (d + n, (S −1 + SST )−1 ) where SST = Z1 Z1 At time point t > 1 −1(m) i.i.d −1(m) • Sample Σt ∼ WJ (d, S − M SM , (S − M SM )−1 M Σt−1 non-central Wishart. M )- A • Compute −1(m) wt (Σt −1(m) ) = f (Zt | Σt −1(m) ) ; (m) Wt = wt (Σt ) −1(l) K ) l=1 wt (Σt , −1(m) where f is the multivariate Gaussian p.d.f. evaluated at Zt and Σt for each particle m. −1(m) • Re-sample the particles Σ1:t (m) weights Wt . with replacement according to the The performance of this proposal distribution is expected to be satisfactory for cases where observations are not too informative. This is because the proposal essentially ignores the observed data, therefore when the data does not carry too much information, the posterior is very close to the prior in shape. 2.4.2 A Central Wishart as Importance Distribution An alternate proposal is one that takes into account the data and uses the central Wishart distribution as an approximation to the non-central Wishart transition density. This is convenient because the central Wishart is the conjugate prior, and would therefore provide a closed form solution making it easier to sample approximately from p(Σ−1 | Σ−1 t t−1 , Zt ) by proposing 29 2.4. SMC Algorithm for the Wishart Autoregressive Process samples from a Wishart distribution such that q(Σ−1 | Σ−1 t t−1 , Zt ) ∝ f (Zt | −1 −1 Σ−1 t )q(Σt | Σt−1 ). The central Wishart approximation of the non-central Wishart has been suggested by Gupta and Nagar (2000), and is obtained by matching the first moment of the two distributions. Using the representation of noncentral Wishart as the inner product of normal matrices, for t > 1, let ut ∼ MN d,J (M ut−1 , S ∗ , Id ), then Σ−1 t = ut ut and −1 −1 ∗ E Σ−1 t | Σt−1 = dS + M Σt−1 M. (2.25) −1 In this case it is easy to see that by taking q(Σ−1 t | Σt−1 ) to be a central Wishart with degrees of freedom d and location parameter S ∗ + d1 M Σ−1 t−1 M , the first moment of this central Wishart would match to the one in (2.25). Furthermore, the second moments have been shown to differ only in terms O(d−1 ). Combining this proposal with the likelihood, results in an importance approximation to the posterior that is also a central Wishart with updated −1 degrees of freedom d+n and scale parameter ((S −M SM + d1 M Σ−1 t−1 M ) + SST )−1 . The importance weights can then be used to correct this approximation wt (Σ−1 t )= −1 −1 p(Zt | Σ−1 t )p(Σt | Σt−1 ) −1 q(Σ−1 t | Σt−1 , Zt ) . (2.26) 30 2.4. SMC Algorithm for the Wishart Autoregressive Process We expand the target distribution of interest: −1 −1 p(Zt | Σ−1 t )p(Σt | Σt−1 ) ∝ |S ∗ | 1 Z Zt etr − Σ−1 2 t t (d−J−1) −d/2 × n/2 (2π)−nJ/2 Σ−1 t 2 Σ−1 t 2dJ/2 ΓJ (d/2) 0 F1 (d/2; (1/4)S 1 −1 etr − S ∗−1 Σ−1 t + M Σt−1 M 2 ∗−1 ∗−1 −1 M Σ−1 Σt ) t−1 M S −d/2 |S ∗ | Σ−1 2dJ/2 ΓJ (d/2) t ∝ (2π)−nJ/2 × etr − × 0 F1 (d/2; (1/4)S (n+d−J−1)/2 1 −1 Σ Zt Zt + S ∗−1 + S ∗−1 M Σ−1 t−1 M 2 t ∗−1 ∗−1 −1 M Σ−1 Σt ), t−1 M S where S ∗ = S − M SM . The importance distribution is a central Wishart −1 + Zt Zt )−1 ) and its pdf is given by WJ (d + n, ( S ∗ + d1 M Σ−1 t M −1 q(Σ−1 t | Σt−1 , Zt ) = 2 (d+n)J 2 ΓJ d+n 2 1 1 2 (d+n−J−1) etr |Σ−1 − Σ−1 t | 2 t 1 S ∗ + M Σ−1 t M d 1 S ∗ + M Σ−1 t M d −(d+n) 2 −1 + Zt Zt −1 −1 + Zt Zt . (2.27) Finally we can compute the logarithm of the weight by considering the ratio of p(.) and q(.), d (d + n) ∗ log(wt (Σ−1 log t )) = − log |S | − 2 2 1 − tr 2 1 S ∗−1 − Sc + M Σ−1 t−1 M d 1 S ∗ + M Σ−1 t−1 M d −1 + Zt Zt −1 ∗−1 Σ−1 M Σ−1 t +S t−1 M ∗−1 −1 + log(0 F1 (d/2; (1/4)S ∗−1 M Σ−1 Σt )). (2.28) t−1 M S The SMC sampling algorithm that uses the central Wishart as a proposal is detailed in Alogirthm (2.2). A big limitation to this approach is the numerical evaluation of the 0 F1 confluent hypergeometric function with matrix argument necessary for the 31 2.4. SMC Algorithm for the Wishart Autoregressive Process Algorithm 2.2 SMC using the central Wishart Approximation For particles m = 1, . . . , K, At time point t = 1, −1(m) i.i.d • Sample Σ1 ∼ WJ (d + n, (S −1 + SST )−1 ) where SST = Z1 Z1 . At time point t > 1 −1(m) i.i.d −1 +SST )−1 • Sample Σt ∼ WJ (d+n, ((S −M SM + d1 M Σ−1 t−1 M ) A centered Wishart with SST = Zt Zt • Compute −1(m) wt (Σt ) is the exponential of the expression in (2.28); −1(m) (m) Wt = wt (Σt ) −1(l) K ) l=1 wt (Σt , for each particle m. −1(m) • Re-sample the particles Σ1:t (m) malized weights Wt . with replacement according to the nor- 32 2.4. SMC Algorithm for the Wishart Autoregressive Process computation of the log weights in (2.28). Koev and Edelman (2006) propose an efficient algorithm and Matlab code to evaluate the hypergeometric function. However, for certain values of Σ−1 , their algorithm does not appear to produce accurate results. 2.4.3 Numerical Experiments using SMC In the first set of experiments, we consider the perfomance of the SMC algorithm under the two different proposals discussed in the previous section, while increasing one at a time the number of particles (K), the number of observations per time point (n), the number of time points (T ), and the dimensionality of the response at each time point (J). In order to evaluate each proposal, we monitor the computational efficiency of the algorithm as well as the accuracy of the estimates. In theory, both proposals are expected to produce similar results given enough computational power. To assess computational efficiency we consider the average effective sample size (ESS), where ESS is the average over t = 1, . . . , T of ESSt given by 1 ESSt = (m) K 2 (Σ−1 t )) m=1 (Wt . (2.29) We are interested in what the ESS represents as a percentage of the initial number of particles that are used in the model. The higher this percentage, the more computationally efficient the algorithm. In order to evaluate accuracy, for J = 1, we consider the average mean square error (M SE), which is the average over time of M SEt , the mean squared deviation of the estimated particles from the true variance used to generate the data and is given by M SEt = K ˆt −1(m) m=1 (σ K − σt−1 )2 . (2.30) For J > 1 we are considering precision matrices instead of scalar precisions. In order to evaluate how different the estimated precision matrices are with respect to the true precision matrices used to generate the data, 33 2.4. SMC Algorithm for the Wishart Autoregressive Process the entropy loss was computed for each SMC particle at each time point using (m) Lt −1(m) ˆ (Σ t ˆ −1(m) Σt ) − log |Σ ˆ −1(m) Σt | − J, , Σ−1 t ) = trace(Σt t (2.31) ˆ −1(m) is an SMC particle at time point t. The average entropy loss where Σ t was computed by averaging over all particles and time points ¯= L T t=1 ( (m) K m=1 Lt /K) T . We first begin by looking at the scalar case, namely when J = 1. While this case is not particularly interesting from a computational standpoint, it enables us to compare the two SMC proposals. This is because the central and non-central Wishart simplify to the χ2 and non-central χ2 distribution respectively and we are able to implement the SMC using both the prior as well as the conditional prior that uses the central χ2 approximation to the conditional posterior as importance densities. Because we are able to evaluate the o F1 function easily in the scalar case, we can evaluate the noncentral χ2 density as required by the algorithm. The hyperparameters used to generate the data were fixed in all the simulations to S = IJ and M = 0.98IJ . The methods were implemented using both the Wishart and the χ2 distribution with both proposals. The idea is that the univariate and the multivariate implementations are expected to give identical results for J = 1. However, only the Wishart implementation can run when J > 1, and this is implemented only using the non-central Wishart proposal due to the difficulty in evaluating the o F1 function with matrix argument. In Figure 2.4, we plot the ESS and M SE against the number of particles K, obtained using the different proposals and implementations, for a fixed n = 1 observation unit and T = 200 time steps. We note that the proposal that uses the non-central Wishart/χ2 distribution results in a larger average effective sample size than the one that uses the central Wishart/χ2 approximation. The ESS tends to stabilize around 90% of the original number of 34 2.4. SMC Algorithm for the Wishart Autoregressive Process particles. This is not surprising given that the likelihood is not very informative with only one observation per time point, making the posterior very similar to the prior. The method that uses the central Wishart/χ2 approximation as a proposal results in a ESS around 45% of the initial number of particles used. We can also see that with enough particles, both methods give similar results in terms of the M SE statistic. Furthermore, we note that the central and non-central χ2 implementation give identical results to the central and non central Wishart implementation, which is what we expect for J = 1. Mean Squared Error Effective Sample size 2.5 4500 Non−Central χ2 Non−Central χ2 Centeral χ2 4000 Centeral χ2 3500 Non−Central Wishart Central Wishart Non−Central Wishart Central Wishart 2 3000 1.5 2500 2000 1 1500 1000 0.5 500 0 0 500 1000 1500 2000 2500 3000 Particles 3500 4000 (a) Effective Sample Size 4500 5000 0 0 500 1000 1500 2000 2500 3000 Particles 3500 4000 4500 5000 (b) Mean Square Error Figure 2.4: Effective sample size and the mean squared error obtained by increasing the number of particles K. In the second set of experiments, using the same data set, we increase the number of time steps, while maintaining the number of observations at n = 1 and the number of particles fixed to K = 1000 particles. In Figure 2.5, we again plot the ESS and M SE this time against the number of time steps T , obtained using the different proposals and implementations. We can see that the ESS as well as the M SE estimates stabilize and reach stationarity at around T = 100. Furthermore, as expected, the ESS remains stable with increasing time steps and the Wishart/χ2 implementations continue to result in identical results. In the third set of experiments, we test the effect of increasing the number of observations from n = 1, . . . , 140, while maintaining the number of time 35 2.4. SMC Algorithm for the Wishart Autoregressive Process Mean Squared Error 7 Effective Sample size Non−Central χ2 1400 Non−Central χ2 1300 Non−Central Wishart Central Wishart Non−Central Wishart Central Wishart 1200 Centeral χ2 6 Centeral χ2 5 1100 1000 4 900 800 3 700 2 600 500 1 400 300 0 0 50 100 150 200 250 300 Time Points 350 400 450 500 0 50 (a) Effective Sample Size 100 150 200 250 300 Time Points 350 400 450 500 (b) Mean Square Error Figure 2.5: Effective sample size and the mean squared error obtained by increasing the number of time steps T . steps fixed at T = 200 and the number of particles at K = 10, 000. As the number of observations increases, the likelihood becomes more informative, leading the prior to become a worse approximation of the posterior. In Figure 2.6, we plot the ESS and M SE as a function of the number of observation per time point n. We can see that as the number of observations increases, the effective sample size in the method that uses the prior as an importance density deteriorates, from 90% for n = 1 dropping to 40% for n = 140. On the other hand, the method that uses the central χ2 / central Wishart approximation, which are the methods that account for the observed data, become more computationally efficient, with ESS increasing from 38 % for n = 1 to 78% when n = 140. In addition, we can see that given enough particles, both proposals give similar M SE as the number of observations increases. As expected with a larger n the M SE becomes smaller leading to more precise estimates. Finally, we extend our experiments to the multidimensional case, where we monitor the ESS and M SE as J increases. In this case however, we are not able to use the central Wishart approximation as a proposal, because there is no easy way to numerically evaluate the hypergeometric function, o F1 , needed for the computation of weights. Some methods were tried, but 36 2.5. Estimating Fixed Parameters 4 1.8 Effective Sample size x 10 Mean Squared Error 4.5 Non−Central χ2 Non−Central χ2 Centeral χ2 1.6 Centeral χ2 4 Non−Central Wishart Central Wishart Non−Central Wishart Central Wishart 3.5 1.4 3 2.5 1.2 2 1 1.5 1 0.8 0.5 0.6 0 50 100 Number of Observations (a) Effective Sample Size 150 0 0 50 100 Number of Observations 150 (b) Mean Square Error Figure 2.6: Effective sample size and the mean squared error obtained by increasing the number of observations n. they did not produce consistent numerical results. Therefore, we focus on evaluating to which extent the prior is able to serve as a suitable proposal density, at least in terms of effective sample size. In this experiment, the number of observations is kept at n = 1, number of time steps is fixed at T = 200 and the number of particles is set to K = 5000. In Figure 2.7, we plot the ESS and M SE against the dimension of the precision matrix at each time step J. We can see that the ESS percentage drops from around 90% for J = 1 to less than 30 % for J = 6. Similarly, the entropy loss seems to increase with larger dimensions. These results are not very surprising given the increase in the complexity of the problem under consideration. 2.5 Estimating Fixed Parameters In many situations, information about the hyperparameters is not available; estimating those from the data becomes of interest. For example, researchers may not know how smoothly the precision matrix varies from one time point to the next. In this case, it is useful to estimate the autoregressive parameter M from the data alongside other unkown parameters rather than arbitrarily fixing it to a specific value. From an application standpoint, it is reasonable to assume that this autoregressive parameter is static over time. We consider 37 2.5. Estimating Fixed Parameters Entropy Loss Effective Sample size 35 4500 30 4000 25 3500 20 3000 2500 15 2000 10 1500 5 1000 0 1 2 3 4 Dimension of Σ−1 t 5 (a) Effective Sample Size 6 1 2 3 4 Dimension of Σ−1 t 5 6 (b) Entropy Loss Figure 2.7: Effective sample size and the mean squared error obtained by increasing the number of observations n. the simple case M = ρI, where ρ is an unknown scalar; we also fix the other hyperparmeters to d = J + 1 and S = IJ , the identity matrix. A simple method to estimate the static autoregressive scalar parameter ρ, given the data, is to propose different values of ρ on a grid that cover the support of possible values of ρ, and evaluate the marginal likelihood corresponding to each proposed ρ. A parameter estimate is then obtained by choosing the value of ρ that maximizes the marginal likelihood. To illustrate how this method works, we perform simulations by taking a grid of values of ρ in the (0, 1) range and using SMC with the non-central Wishart as a proposal to compute the marginal likelihood for each value of ρ on that grid. The marginal likelihood is straight-forward to obtain from the SMC algorithm using (2.21) and (2.22). Moreover, for ρ = 0 and ρ = 1, we can compute the marginal likelihood analytically as the non-centrality parameter becomes 0 and the non-central Wishart distribution becomes a central Wishart distributon which is conjugate and a closed form solution is available (see Appendix B.1). The experiment is repeated for different underlying ‘true’ values of the parameter ρ = {0.7, 0.98, 0.999} and for dimensions J = {2, 3, 5}. The number of particles is fixed to K = 2500, in order to maintain a reasonable 38 2.5. Estimating Fixed Parameters ESS for all experiments. The number of time steps is fixed to T = 200 and the number of observations per time step is fixed to n = 1. The experiment was repeated 5 times in order to get some estimate of the Monte Carlo error. The error is obtained by computing the squared deviations from the average marginal likelihood over the five repetitions. Figure 2.8 shows plots of the marginal likelihood for the different dimensions with error bars to represent the Monte Carlo error obtained over the 5 repetitions. Figure 2.9 gives another depiction of the MC error from estimating the marginal likelihood. We note that as the number of dimensions increases, it becomes challenging to estimate the marginal likelihood. For example in Figure 2.8(a), the estimates of the marginal likelihood are in line with the theoretical marginal likelihood, however we can see that in Figure 2.8(b) and 2.8(c), for J = 3 and J = 5, the SMC estimate falls below the theoretical result. Moreover, the error bars get larger with an increase in dimension. Furthermore, for smaller values of the underlying true ρ, it seems more difficult to discriminate using the marginal likelihood since that seems mostly flat and only dips off past the true value. Taking a closer look at Figures 2.9, we note that the error in estimating the marginal likelihood becomes amplified as ρ gets closer to 1. This is because the proposal distribution tends to a point mass as ρ gets close to 1. The covariance matrix is not changing much in time as ρ → 1, so we end up proposing the same covariance matrix with probability very close to 1. This also affects computations numerically, as we are required to take the Cholesky decomposition with values very close to 0 in the sampling of the non-central Wishart. 39 2.5. Estimating Fixed Parameters Marginal Likelihood for K=2500, n=1, J=2, and T=200 −100 ρ=0.7 −200 ρ=0.98 ρ=0.999 Marginal likelihood −300 −400 −500 −600 −700 −800 −900 0 0.1 0.2 0.3 0.4 0.5 ρ 0.6 0.7 0.8 0.9 1 (a) J = 2 Marginal Likelihood for K=2500, n=1, J=3, and T=200 −750 ρ=0.7 ρ=0.98 −800 ρ=0.999 Marginal likelihood −850 −900 −950 −1000 −1050 −1100 0 0.1 0.2 0.3 0.4 0.5 ρ 0.6 0.7 0.8 0.9 1 (b) J = 3 Marginal Likelihood for K=2500, n=1, J=5, and T=200 −1200 ρ=0.7 ρ=0.98 Marginal likelihood −1400 ρ=0.999 −1600 −1800 −2000 −2200 −2400 0 0.1 0.2 0.3 0.4 0.5 ρ 0.6 0.7 0.8 0.9 1 (c) J = 5 Figure 2.8: Marginal likelihood for different values of underlying ρ in varying dimensions, with T = 200 time steps. 40 2.5. Estimating Fixed Parameters Error in Marginal Likelihood for K=2500, n=1, J=2, and T=200 12 ρ=0.7 ρ=0.98 Error in Marginal likelihood 10 ρ=0.999 8 6 4 2 0 0 0.1 0.2 0.3 0.4 0.5 ρ 0.6 0.7 0.8 0.9 1 (a) J = 2 Error in Marginal Likelihood for K=2500, n=1, J=3, and T=200 14 ρ=0.7 ρ=0.98 12 Error in Marginal likelihood ρ=0.999 10 8 6 4 2 0 0 0.1 0.2 0.3 0.4 0.5 ρ 0.6 0.7 0.8 0.9 1 (b) J = 3 Error in Marginal Likelihood for K=2500, n=1, J=5, and T=200 250 ρ=0.7 ρ=0.98 ρ=0.999 Error in Marginal likelihood 200 150 100 50 0 0 0.1 0.2 0.3 0.4 0.5 ρ 0.6 0.7 0.8 0.9 1 (c) J = 5 Figure 2.9: Error in the marginal likelihood for different values of underlying ρ in varying dimensions, with T = 200 time steps. 41 2.6. Particle MCMC for Fixed Parameter Estimation 2.6 Particle MCMC for Fixed Parameter Estimation The approach discussed in the previous section is one method that can be used to estimate a fixed a scalar M ; however, this method does not scale well for a more complex M . Particle Markov Chain Monte-Carlo (pMCMC) methods are a class of algorithms proposed by Andrieu et al. (2010) to build efficient proposal densities using SMC as part of an MCMC algorithm. This class of algorithms is particularly suited for estimation of parameters that are highly correlated or static over time. The SMC algorithm is used to generate a proposal, which is accepted with probability equal to a ratio of marginal likelihoods. In the next section we introduce the most basic pMCMC method, the particle independent Metropolis-Hastings algorithm. Later, we build on that algorithm to devise algorithms more suitable to the complexity of the multivariate longitudinal model. 2.6.1 The Particle Independent Metropolis Hastings Algorithm (PIMH) The particle independent Metropolis-Hastings (PIMH) is a simple particle MCMC algorithm, where SMC is used to propose a possible MCMC candidate. As noted in Andrieu et al. (2010), PIMH on its own provides little advantage over SMC. This is mainly due to the computational complexity of PIMH. Therefore, given equal computational resources, SMC with enough particles can produce satisfactory results. Here, we present the algorithm as a stepping stone to introduce more complex algorithms that use SMC in combination with other MCMC transitions for estimating fixed parameters. The algorithm, detailed in Algorithm (2.3), proceeds to propose candidates Σ∗−1 ˆ(. | Z1:T ). The new candidates, 1:T from an SMC approximation of p given the current state Σ−1 1:T are accepted with probability 1∧ pˆ∗ (Z1:T ) pˆ(Z1:T )(s − 1) 42 2.6. Particle MCMC for Fixed Parameter Estimation where pˆ∗ (Z1:T ) is the marginal likelihood given the new candidate and pˆ(Z1:T )(s − 1) is the marginal likelihood obtained at the previous iteration. This essentially means that if the marginal likelihood improves under the proposed candidate, it will be accepted. The PIMH algorithm is ergodic under weak conditions and leaves p(Σ−1 1:T | Z1:T ) invariant. Furthermore, the acceptance probability of the algorithm converges to 1 as the number of particles K → ∞. Algorithm 2.3 Particle Independent Metropolis Hastings Algorithm (PIMH) Initialization, s = 0 • Run the SMC Algorithm targeting p(Σ−1 1:T | Z1:T ) and sample (0) ∼ p ˆ (. | Z ) and denote p ˆ (Z )(0) the associated marginal Σ−1 1:T 1:T 1:T likelihood estimate. For each subsequent MCMC iteration s ≥ 1 −1∗ • Run an SMC Algorithm targeting p(Σ−1 1:T | Z1:T ) and sample Σ1:T ∼ pˆ(. | Z1:T ) and denote pˆ∗ (Z1:T ) the associated marginal likelihood estimate. • With probability 1∧ pˆ∗ (Z1:T ) pˆ(Z1:T )(s − 1) −1∗ ˆ(Z1:T )(s) = pˆ∗ (Z1:T ) otherwise setΣ−1 set Σ−1 1:T (s) = 1:T (s) = Σ1:T and p (s − 1) and p ˆ (Z )(s) = p ˆ (Z )(s − 1). Σ−1 1:T 1:T 1:T In the next set of simulations, we implement the PIMH algorithm to propose candidates from p(Σ−1 1:T | Z1:T ). The SMC algorithm with the prior as an importance distribution described in 2.1 is used as a mechanism to generate proposals. The marginal likelihood is easily obtained using equation (2.21). For a fixed ρ = 0.98IJ , S = I, J = 3, and n = 1, data are generated and results are obtained by running the PIMH algorithm. We consider the effect resulting from increasing the number of particles K to 5000 (Figure 2.10(a)), while fixing the other parameters as above. In the first pane of 43 2.6. Particle MCMC for Fixed Parameter Estimation Figure 2.10(a), the acceptance probability is plotted as a function of the number of particles. The acceptance probability seems to follow an increasing trend with an increase in number of particles, which is what would be expected. The second pane is the ESS obtained from each SMC proposal within PIMH. As expected, ESS is increasing linearly as a function of the number of particles. The third pane shows an estimate of the marginal likelihood obtained from SMC, as a function of increasing number of particles. This does not appear to give very fluctuating results. The results of the second set of experiments are outlined in Figure 2.10(b), where everything is held constant, except for the number of MCMC iterations which is increased up to s = 1000. The number of particles K is fixed to 1000. The first pane shows that the acceptance probability of the algorithm remains more or less constant, fluctuating between 12 % and 35%. Similarly, the ESS obtained from each SMC proposal within PIMH is stable. This is good because the acceptance probability should not change as a function of the number of MCMC iterations. Finally, the marginal likelihood in the third pane is consistent with results from the previous experiment. The last Figure 2.10(c) shows the results of running the PIMH algorithm for a fixed number of particles K = 10000 and a fixed number of MCMC iteration s = 500, but this time increasing the length of the time series to T = 500. In this case, the acceptance probability seems to suffer considerably as T increases, which is expected given that this increases the degree of difficulty of the problem we are trying to solve. 44 2.6. Particle MCMC for Fixed Parameter Estimation Acceptance Probability 0.5 0.45 Effective Sample size Marginal Likelihood 3500 −386.5 3000 −387 0.4 2500 0.35 −387.5 2000 0.3 1500 −388 0.25 1000 0.2 −388.5 500 0.15 0.1 0 0 5000 0 Particles 5000 −389 0 Particles 5000 Particles (a) Increasing the Number of Particles Acceptance Probability 0.35 Effective Sample size Marginal Likelihood 628.4 −386.6 −386.8 628.3 0.3 −387 628.2 −387.2 628.1 0.25 −387.4 628 −387.6 627.9 0.2 −387.8 627.8 −388 627.7 0.15 −388.2 627.6 0.1 0 500 1000 MC Iterations 627.5 −388.4 0 500 1000 MC Iterations −388.6 0 500 1000 MC Iterations (b) Increasing MCMC Iterations Acceptance Probability 0.7 Effective Sample size 650 0.6 640 0.5 630 0.4 620 Marginal Likelihood 0 −200 −400 −600 −800 −1000 0.3 610 0.2 600 0.1 590 0 580 −1200 −1400 −1600 −1800 0 500 Time 0 500 −2000 0 Time 500 Time (c) Increasing the Number of Time Steps Figure 2.10: PIMH Simulation results showing the acceptance probability, the ESS and marginal liklihood as a function of Particles, MC iterations and number of times steps. 45 2.6. Particle MCMC for Fixed Parameter Estimation 2.6.2 The Particle Marginal Metropolis Hastings Algorithm (PMMH) As mentioned previously, PIMH on its own offers little in terms of computational efficiency compared with SMC. However, in the presence of static parameters, particle MCMC methods become very useful. In the context of the Wishart autoregressive process, the autoregressive parameter M is assumed to be static. The particle marginal Metropolis-Hastings (PMMH) algorithm (Andrieu et al., 2010) builds a proposal for p(Σ−1 1:T , M | Z1:T ). This is done by proposing joint updates of Σ−1 1:T and M . The proposed pair ∗ (Σ∗−1 1:T , M ) is accepted with probability 1∧ pˆM ∗ (Z1:T )p(M ∗ ) q(M | M ∗ ) . pˆM (Z1:T )p(M ) q(M ∗ | M ) (2.32) ∗−1 ∗ If accepted, the update (Σ−1 1:T , M ) → (Σ1:T , M ) is made and a new pair is proposed. A general PMMH algorithm is detailed in Algorithm (2.4). As a first step, we consider a scalar static parameter such that M = ρIJ . We propose candidates from pρ (Σ−1 1:T | Z1:T ), using an SMC proposal described in Algorithm (2.2). The distribution p(ρ) is taken to be uniform on the interval (0, 1) and so the probability density function is equal to 1, for all values of ρ. The transition density for each ρ∗ ∼ q(. | ρ(s − 1)) is a random walk, centered at the previous value such that ρ∗ ∈ (0, 1) : ρ∗ = ρ(s−1) + c , (2.33) where c is a step size tuning parameter and ∼ N (0, 1). In order to propose new values of ρ∗ , we sample from a univariate truncated Gaussian distribution. Sampling from this distribution is done using the algorithm of Robert (1995). The Hastings component of the ratio in (2.32) is given by q(ρ | ρ∗ ) q(ρ∗ | ρ) Φ = Φ b−ρ∗ c −Φ a−ρ∗ c b−ρ c −Φ a−ρ c , (2.34) where a and b are the lower and upper limits of truncation, ρ is the current 46 2.6. Particle MCMC for Fixed Parameter Estimation Algorithm 2.4 Particle Marginal Metropolis-Hastings (PMMH) Step 1: Initialization, s = 0 • Set M (0) arbitrarily and Sample • Run an SMC algorithm targeting pM (0) (Σ−1 1:T | Z1:T ). −1 Σ1:T (0) ∼ pˆM (0) (. | Z1:T ) and let pˆM (0) (Z1:T ) denote the estimated marginal likelihood. Step 2: For iteration s ≥ 1, • Sample M ∗ ∼ q(. | M (s − 1)). −1∗ • Run an SMC algorithm targeting pM ∗ (Σ−1 1:T | Z1:T ). Sample Σ1:T ∼ pˆM ∗ (. | Z1:T ) and let pˆM ∗ (Z1:T ) denote the estimated marginal likelihood. • Accept the pair {M ∗ , Σ−1 1:T } with probability 1∧ pˆM ∗ (Z1:T )p(M ∗ ) q(M (s − 1) | M ∗ ) . pˆM (s−1) (Z1:T )p(M ) q(M ∗ | M (s − 1)) set M (s) = M ∗ , Σ1:T (s) = Σ∗1:T and pˆM (s) (Z1:T ) = pˆM ∗ (Z1:T ), other−1 ˆM (s) (Z1:T ) = wise set M (s) = M (s − 1), Σ−1 1:T (t) = Σ1:T (s − 1) and p pˆM (s−1) (Z1:T ). 47 2.6. Particle MCMC for Fixed Parameter Estimation value and ρ∗ is a new proposed value centered at ρ. In this case a = 0 and b = 1 are the truncation points. The tuning parameter of the proposal distribution c is typically chosen such that the acceptance probability is in the 20%- 30% range. Given this choice of target and proposal density, the acceptance probability in (2.32) becomes ∗ Φ 1−ρ −Φ c pˆρ∗ (Z1:T ) 1∧ pˆρ(s−1) (Z1:T ) Φ 1−ρ(s−1) − Φ c −ρ∗ c −ρ(s−1) c , (2.35) where the marginal likelihood is obtained readily from the SMC algorithm using equation (2.21). This PMMH sampler is ergodic and leaves p(ρ, Σ−1 1:T | Z1:T ) invariant. To test the PMMH algorithm, we conduct a set of simulations where we consider the computational efficiency of the algorithm by looking at the traceplots of ρ obtained from the PMMH algorithm as a function of an increasing number of particles K as c increases. We fix the underlying ’true’ ρ = 0.98, S = IJ and d = J +1, for J = 3 and n = 1. The starting value for ρ was set to 0.5. The resulting traceplots of the posterior distribution of ρ are shown in Figure 2.11. The acceptance probability appears to be decreasing as c increases and increasing as a function of the number of particles. For the particular data set we simulated, the acceptance probability goes from 16% to 29% for c = 0.2 as K increases from 250 to 1000. For a fixed K = 1000, increasing c decreases the acceptance probability from 54% to 19%. Moreover, a larger c allows a faster exploration of the parameter space, requiring fewer MCMC iterations. After 5000 MCMC iterations, we can note that the median value of ρ, given in Table 2.1, is closer to the true value with a larger c. 48 2.6. Particle MCMC for Fixed Parameter Estimation AP= 22.52 % K=250 1 K=500 AP= 15.76 % 1 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 2000 4000 AP= 43.1 % 0 2000 4000 AP= 39.68 % 1 0 2000 4000 AP= 20.1 % 1 0 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 2000 4000 AP= 54.1 % 1 0 0 2000 4000 AP= 54.22 % 1 0 0 2000 4000 AP= 29.12 % 1 0 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 2000 4000 c=0.001 0 2000 4000 c=0.01 0 2000 4000 c=0.2 0 2000 4000 AP= 12.64 % 0 2000 4000 AP= 18.96 % 1 0.8 0 0 1 0.8 0 AP= 9.18 % 1 0.8 1 K=1000 AP= 19.14 % 1 0 2000 4000 c=0.5 Figure 2.11: The posterior traceplots of the parameter ρ obtained using the PMMH algorithm with different K and increasing c. K = 250 K = 500 K = 1000 c = 0.001 0.53 0.51 0.51 c = 0.01 0.79 0.79 0.75 c = 0.2 0.77 0.77 0.77 c = 0.5 0.78 0.78 0.78 Table 2.1: Posterior median of ρ obtained for different values of c and K. From the results of Table 2.1 and Figure 2.11, we see that the true value of ρ is not what is returned by the algorithm as the posterior median. However, when we consider the grid method for estimating ρ, discussed in the previous section, we obtain a similar result. In Figure 2.12, we plot the marginal likelihood, estimated using SMC, as a function of the different values of ρ taken on a grid, we note that the maximum is reached at 0.83, which is close to the posterior median returned from the PMMH algorithm 49 2.6. Particle MCMC for Fixed Parameter Estimation depicted in Figure 2.13. −380 −382 Marginal likelihood −384 −386 −388 −390 −392 −394 −396 −398 0 0.1 0.2 0.3 0.4 0.5 ρ 0.6 0.7 0.8 0.9 1 Figure 2.12: Plot of the marginal likelihood obtained from running SMC with different values of ρ with K = 1000 particles. The maximum is reached at ρ=0.83. Posterior Distribution of ρ 1500 1000 500 0 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 2.13: Plot of the posterior estimate of ρ obtained using pMCMC with c = 0.2 and K = 1000. 50 2.6. Particle MCMC for Fixed Parameter Estimation We also would like to consider the posterior distribution of Σ−1 1:T obtained from the PMMH algorithm. However, rather than considering Σ−1 1:T directly, we transform the particles from Σ−1 1:T to R1:T . We used the data from the experiment with c = 0.2 and K = 1000. In Figures 2.14(a), 2.14(b), and 2.14(c), we plot the median marginal posterior of each off-diagonal element of Rt , obtained from the PMMH algorithm, as a function of time. We also highlight the 95% credible interval, as well as the true value of each element of Rt used to generate the data. From these plots, we notice that the median estimate of the marginal elements of Rt seems to get closer to the truth as the series progresses. The credible interval is quite wide, indicating our uncertainty about the posterior. Intuitively, the marginal prior has a flat distribution, and in this case, with n = 1, the likelihood is not very informative, resulting in a posterior that is quite spread out. 51 2.6. Particle MCMC for Fixed Parameter Estimation R12 True Correlations Median Corrleation 1 0 −1 0 10 20 30 40 50 time 60 70 80 90 100 80 90 100 80 90 100 (a) Marginal distribution of R12 R13 True Correlations Median Corrleation 1 0 −1 0 10 20 30 40 50 time 60 70 (b) Marginal distribution of R13 R23 True Correlations Median Corrleation 1 0 −1 0 10 20 30 40 50 time 60 70 (c) Marginal distribution of R23 Figure 2.14: Plot of the true correlation and the median estimated correlation along with a 95% credible interval, obtained using the PMMH algorithm estimating M and fixing S. 52 2.6. Particle MCMC for Fixed Parameter Estimation In the next set of simulations, we compare the covariance matrix estimates obtained from the PMMH algorithm where ρ is sampled using PMMH compared to being set to an arbitrary value. Thirty different data sets are generated using a fixed value of ρ = 0.4 with S = IJ and n = 1, J = 3. In each case, we compared the average entropy loss (defined in (2.31)). The experiment was repeated for ρ = 0.80 and ρ = 0.95. The results are outlined in Figure 2.15. In Figure 2.15(a), we consider the boxplots of the average entropy loss for three models where ρ is fixed, and one where ρ is sampled using PMMH. We can see that PMMH produced as good results as fixing to a value close to the true parameter. The loss was smaller on average compared to what is obtained by mis-specifying the model. In Figure 2.15(b), we show boxplots for the average entropy loss when the underlying true value of ρ is set to 0.8. The PMMH algorithm performs similar to the other algorithms with fixed values; however, it gives more consistent results between data sets. The average loss computed from the other algorithms seems to be more variable across data sets. In the last Figure 2.15(c), boxplots of the average entropy loss are produced for an underlying ρ value of 0.95. It appears that PMMH performs a little worse than a properly specified parameter, but does similar than a mis-specified model. In the previous simulation, the tuning parameter c was fixed at 0.3 across all experiments. This resulted in a wide range of acceptance probabilities plotted in Figure 2.16. The acceptance probability is very data dependent. This means that when considering a problem, one must ensure that the PMMH algorithm is tuned specifically to that problem. This is done by adjusting c and the number of particles K to produce acceptance probabilities in the 20%-30% range. 53 2.6. Particle MCMC for Fixed Parameter Estimation True ρ =0.4 50 45 Average Entropy Loss 40 35 30 25 20 15 10 5 Fixed at 0.5 Fixed at 0.75 Fixed at 0.98 PMMH (a) The true ρ underlying the data is set to 0.40 True ρ =0.8 35 Average Entropy Loss 30 25 20 15 10 5 Fixed at 0.5 Fixed at 0.75 Fixed at 0.98 PMMH (b) The true ρ underlying the data is set to 0.80 True ρ =0.95 50 45 Average Entropy Loss 40 35 30 25 20 15 10 5 Fixed at 0.5 Fixed at 0.75 Fixed at 0.98 PMMH (c) The true ρ underlying the data is set to 0.95 Figure 2.15: Average entropy loss for Σ−1 computed over 30 independent experiments under a fixed ρ and using the PMMH algorithm. 54 2.6. Particle MCMC for Fixed Parameter Estimation PMMH Acceptance Probability over independent data sets PMMH Acceptance Probability over independent data sets 0.5 0.5 0.45 0.4 0.4 0.35 0.3 0.3 0.25 0.2 0.2 0.15 0.1 0.1 0.05 0 0 1 1 (a) ρ = 0.40 (b) ρ = 0.80 PMMH Acceptance Probability over independent data sets 0.6 0.5 0.4 0.3 0.2 0.1 0 1 (c) ρ = 0.95 Figure 2.16: Acceptance probability of PMMH on 30 different data sets, generated with different true value of ρ. 2.6.3 Estimating Both M and S In this section, we extend the model to account for both an unkown but static M and S. As mentioned earlier, while M controls the temporal dynamics between covariance matrices at different time steps, S controls the spread of each covariance matrix at each time step. We extend the PMMH algorithm introduced in the previous section to jointly propose candidates for M and S. The algorithm would be identitical to the one outlined in Algorithm (2.4). However, we now propose a new triplet (M ∗ , S ∗ , Σ−1∗ 1:T ). We assume that M and S are independent a priori, such that the acceptance probability of the MH ratio is given by 1∧ pˆM ∗ (Z1:T )p(M ∗ )p(S ∗ ) q(M | M ∗ ) q(S | S ∗ ) . pˆM (Z1:T )p(M )p(S) q(M ∗ | M ) q(S ∗ | S) −1∗ ∗ ∗ As before, if accepted the update (M, S, Σ−1 1:T ) → (M , S , Σ1:T ) is made, otherwise the current state is kept at (M, S, Σ−1 1:T ). This process is repeated. 55 2.6. Particle MCMC for Fixed Parameter Estimation For simplicity, we begin by exploring this idea for the scalar parameters such that M = ρIJ and S = s2 Ij . We fix d = J + 1 as before. The prior distribution of p(ρ) is the uniform on (0, 1) as before and therefore it cancels in the numerator and the denominator. As for s2 , we assume an inverse χ2 prior distribution. We propose candidates for pM,S (Σ−1 1:T | Z1:T ) using the SMC proposal suggested in Algorithm (2.1). We can sample ρ, as previously, from a univariate Gaussian truncated to the (0, 1) region, centered at the current value of ρ and with variance c21 . As for S, there are a variety of proposals that can be used to generate candidates(Browne, 2006). For simplicity, we propose to use another Gaussian truncated to the (0, ∞) region, centered at the current value of s2 and with variance c22 . The above MH ratio becomes pˆM ∗ ,s∗2 (Z1:T )p(S ∗ ) Φ 1∧ pˆM,S (Z1:T )p(s2 ) Φ 1−ρ∗ c −Φ −ρ∗ c 1−Φ −s2∗ c 1−ρ c −Φ −ρ c 1−Φ −s2 c , (2.36) where pˆM ∗ ,S ∗ (Z1:T ) is readily available from the SMC algorithm. In order to test this algorithm we generate data from a model with underlying ρ = 0.98, S = 1. We generate a series of length T = 100 and we run the algorithm using K = 2000 particles for 2000 MCMC iterations. In Figures 2.17(a) and 2.17(b), we monitor the posterior distribution of M and S respectively as we vary c1 and c2 , We initialize both ρ(0) = s(0) = 0.5. Here we notice, similar to what was noted previously, the acceptance probability decreases as a function of c1 and c2 . 56 2.6. Particle MCMC for Fixed Parameter Estimation Posterior of ρ AP= 52.9 % c1=0.05 1 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0 2000 AP= 33.66 % 0 0 2000 AP= 27.28 % 1 0 0.2 0 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 2000 AP= 16.62 % 1 0.8 AP= 16.86 % 1 0.8 1 c1=0.2 AP= 28.16 % 1 0.8 0 0 0 2000 AP= 11.6 % 1 0.9 0.8 0.7 0 0 2000 AP= 22.06 % 1 c1=0.5 AP= 41.02 % 1 0.8 0 0 2000 AP= 18.46 % 1 0 0.6 0.5 0 2000 AP= 12 % 1 0.4 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0 2000 0 0 c2=0.05 2000 0 2000 AP= 7.18 % 1 0.8 0 0 0.2 0 c2=0.2 2000 0 0 c2=0.5 2000 c2=1 (a) Posterior for ρ Posterior of S AP= 52.9 % c1=0.05 2 AP= 28.16 % 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 AP= 16.86 % 2 1.5 1 0 0 2000 AP= 33.66 % 2 0 0 2000 AP= 27.28 % 2 1.5 c1=0.2 AP= 41.02 % 2 0 0 AP= 16.62 % 2 1.5 2000 1 1 1 0.5 0.5 0 2000 AP= 11.6 % 2 1.5 0.5 0.5 1.5 1 0 0 AP= 22.06 % 1.5 c1=0.5 2000 0 0 2000 AP= 18.46 % 2 0 0 2000 AP= 12 % 2 1.5 0 2000 AP= 7.18 % 2 1.5 1 0.5 1.5 1 1 1 0.5 0.5 0 2000 c2=0.05 0.5 0 2000 c2=0.2 0 0 2000 c2=0.5 0.5 0 2000 c2=1 (b) Posterior for S Figure 2.17: The posterior traceplots of the parameter ρ and s2 obtained using the PMMH algorithm with a fixed K = 2000 and simultaneously increasing c1 and c2 . 57 2.6. Particle MCMC for Fixed Parameter Estimation In Figures 2.18(a) and 2.18(b), we consider the posterior distribution obtained for ρ and S for c1 = c2 = 0.2, with acceptance probability of 27%. In this case, for this particular data set, similar to when we were estimating only ρ, we find that the posterior distribution of ρ has a median value of 0.73 and that of s2 has a median value of 1.04. 58 2.7. Summary and Conclusion Posterior Distribution of ρ 1 2500 Median −− 0.736 95 CI −− [0.527,0.868] 0.9 2000 0.8 0.7 1500 0.6 0.5 1000 0.4 0.3 500 0.2 0.1 0 1000 2000 3000 4000 5000 0 0 0.5 1 (a) Posterior for ρ Posterior Distribution of S 1.8 2200 Median −− 1.04 95 CI −− [0.765,1.44] 2000 1.6 1800 1.4 1600 1400 1.2 1200 1 1000 0.8 800 600 0.6 400 0.4 0.2 200 0 1000 2000 3000 4000 5000 0 0 0.5 1 1.5 2 (b) Posterior for S Figure 2.18: The posterior traceplots of the parameter ρ and s2 obtained using the PMMH algorithm for c1 = c2 = 0.2. 2.7 Summary and Conclusion In this Chapter, we have proposed a model that can be useful to understand the dynamic correlations (or co-movement) between multidimensional lon59 2.7. Summary and Conclusion gitudinal data. Using SMC, with fixed hyperparameters and given enough particles, we are able estimate the correlation matrix reasonably well in time. However, in many cases the interest lies in estimating the hyperparameters which are assumed to be static. We have shown that through the use of particle MCMC we are able to develop a valid algorithm to infer static as well as dynamic parameters. As a final illustration of the different methods we considered in this Chapter, we compare the results of the different algorithms on the same data set. Particularly, we look at where the true precision parameters lie in comparison to the posterior median and 95% credible interval of the estimated parameter. In Figure 2.19 we plot the posterior distribution of the precision in time for one dimension, with J = 1. For this particular simulation, we generated data with n = 1, T = 200 with a fixed underlying true ρ = 0.98, d = J + 1, and S = 1. Then we ran the different algorithms with K = 2000 particles. For the PMMH algorithm, we ran 2000 MCMC iterations. When estimating only M , we set the step size to c = 0.1 to result in a 30% acceptance probability. For estimating both M and S, we set c1 = 0.05 and c2 = 1 to result in a 24% acceptance probability. We can see that all the algorithms produce very similar estimates of the median precision. The results are similar, for both when we estimate the hyperparameters and when we fix them to their true values. 60 2.7. Summary and Conclusion 12 12 True Precision 10 True Precision 10 Median Precision Median Precision 8 8 6 6 4 4 2 2 0 0 −2 −2 0 20 40 60 80 100 time 120 140 160 180 200 (a) Using the Non-central χ2 Proposal 12 0 20 60 80 100 time 120 140 160 180 200 (b) Using the Central χ2 Proposal 12 True Precision 10 True Precision 10 Median Precision Median Precision 8 8 6 6 4 4 2 2 0 0 −2 40 0 20 40 60 80 100 time 120 140 160 180 200 −2 0 20 40 60 80 100 time 120 140 160 180 200 (c) Using the PMMH Algorithm, Esti- (d) Using the PMMH Algorithm, Esmating only M timating M and S Figure 2.19: Posterior distribution of the estimated precision in the univariate case (J = 1) with 95% credible interval using the different algorithms. To obtain the results in Figures 2.20, 2.21 and 2.22, we generated data with n = 1, J = 3, T = 100 with a fixed underlying true ρ = 0.98IJ , d = J + 1, and S = IJ . Then we ran the different algorithms (SMC with fixed M and S, PMMH with fixed S, and PMMH with neither parameter fixed ) with K = 2000 particles. For the PMMH algorithm, we ran 2000 MCMC iterations. When estimating only M , we set the step size to c = 0.25 to result in a 32% acceptance probability. For estimating both M and S, we set c1 = 0.08 and c2 = 0.8 to result in a 22% acceptance probability. After obtaining the estimates of Σ−1 , we transform to R an we consider the median off-diagonal elements as well as the 95% credible interval. In comparing the three methods, we note that while the median estimate is very similar from 61 2.7. Summary and Conclusion one algorithm to the next. From Figure 2.20, we notice that the SMC algorithm begins to degenerate. The particle MCMC algorithm in Figure 2.21 and 2.22 does not degenerate, even if the SMC proposal does. This is what we would expect because, by construction, in the PMMH algorithm, we use SMC to obtain one sample path approximately according to the posterior; we do not need a good SMC approximation of the joint. Finally, we compare the time it takes to run each algorithm (Table 2.2), and obviously PMMH takes a lot longer to run in comparison with SMC. J =1 J =3 SMC (NCW Prop) 0.5 0.3 SMC (CW Prop) 5.0 – PMMH (Est. M ) 31.5 611.9 PMMH (Est. M and S) 35.8 578.9 Table 2.2: Time to run different algorithms, in minutes. The drawback of any Metropolis-Hasting scheme is that it heavily depends on the proposal distribution. The proposal we developed here as part of the pMCMC algorithm needs tuning with respect to the step-size of the random walk for each data set. As we have seen in our many simulations, too small a value for c will results in most candidates being accepted but the algorithm moves very slowly and a longer chain is required to explore the space. Too large a value for c results in better mixing however, fewer candidates are accepted. Future work needs to explore methods to adaptively and automatically tune these parameters to optimize mixing (Gilks et al., 1998). Theoretical and empirical results suggest that the acceptance probability should be in 15-50% range for a random walk algorithm (Robert, 1996). Finally, a better implementation of a method that evaluates the o F1 hypergeometric function in higher dimensions, needs to be considered. It is important that the method be efficient, in order to keep the computation time reasonable. This would be particularly helpful for data set with a large number of observations n. 62 2.7. Summary and Conclusion 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 10 20 30 40 50 time 60 70 80 90 100 80 90 100 80 90 100 (a) Marginal distribution of R12 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 10 20 30 40 50 time 60 70 (b) Marginal distribution of R13 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 10 20 30 40 50 time 60 70 (c) Marginal distribution of R23 Figure 2.20: Plot of the true correlation and the median estimated correlation along with a 95% credible interval, obtained using the SMC algorithm with the Non-central Proposal. 63 2.7. Summary and Conclusion 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 10 20 30 40 50 time 60 70 80 90 100 80 90 100 80 90 100 (a) Marginal distribution of R12 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 10 20 30 40 50 time 60 70 (b) Marginal distribution of R13 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 10 20 30 40 50 time 60 70 (c) Marginal distribution of R23 Figure 2.21: Plot of the true correlation and the median estimated correlation along with a 95% credible interval, obtained using the PMMH algorithm estimating M and fixing S. 64 2.7. Summary and Conclusion 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 10 20 30 40 50 time 60 70 80 90 100 80 90 100 80 90 100 (a) Marginal distribution of R12 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 10 20 30 40 50 time 60 70 (b) Marginal distribution of R13 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 10 20 30 40 50 time 60 70 (c) Marginal distribution of R23 Figure 2.22: Plot of the true correlation and the median estimated correlation along with a 95% credible interval, obtained using the PMMH algorithm estimating both M and S. 65 Chapter 3 Static Correlation 3.1 Introduction Estimation of the association structure in the multivariate case, under certain settings, simplifies greatly. For example, when the data under consideration are multivariate but not longitudinal, or longitudinal but not multivariate. Referring back to Figure 1.1 in the Introduction, this corresponds to a slice of the cube with dimensions n × J for a slice T = 1 if the data is non-longitudinal, or n × T for J = 1 when the data is longitudinal but not multivariate. Moreover, when dealing with problems of high dimensions, it can be very useful from a computational viewpoint to reduce the number of parameters to be estimated. This is accomplished by imposing a structure of association on the covariance matrix. Furthermore, in certain cases, the researcher may be interested in comparing different hypotheses of patterns of association. In other cases, the data might follow a natural grouping, where certain subsets of variables will express a higher degree of association within, and less association between other groups of variables. Imposing a structure of association is done by imposing constraints to force the inverse covariance to lie in a lower dimensional space. The number of parameters to be estimated is reduced, which provides efficiency gains and facilitates interpretation. Nonetheless, without a priori knowledge or subject matter expertise, it might be difficult to know what structure to impose. In this case, it is possible to learn the structure from the data; however, this also adds to the computational effort. If one is willing to make the assumption that the observed data follow a multivariate Gaussian distribution, the estimation of the covariance matrix 66 3.2. Conditional Independence under this setting is straightforward. It can be achieved by appealing to the well-developed theory of covariance selection in Gaussian graphical models (Dempster, 1972). The structure of association among variables is imposed through conditional independence by fixing certain off-diagonal elements of the inverse covariance matrix to zero. In this Chapter, we present methodology used for the Bayesian estimation of a full as well as a constrained covariance and correlation matrix in the context of Gaussian Graphical Models. We first present some of the well established methodology used for Gaussian data. These methods are then extended to accommodate binary data, using the multivariate probit model. Estimation of the full and constrained covariance for binary data was addressed in Tabet (2007). Here, we build on this work and we provide an extension to learn the structure of association underlying the binary data. Throughout this chapter we assume the data to be mean-centered at 0 with no predictors in the model. 3.2 Conditional Independence The concept of conditional independence plays a key role in how constraints are imposed on the association between variables in this Chapter. Therefore, it is useful to consider its definition and interpretation more closely. The covariance matrix Σ, or equivalently the correlation matrix R, encodes the marginal association between variables. Usually, when two variables exhibit a high marginal association, the pairwise estimate of their correlation or covariance is high. However, these two variables may be affected by a third variable, which acts as a mediating variable or confounder. For example, consider the following three variables: smoking, drinking coffee and lung cancer. Drinking coffee and lung cancer may express a high degree of association, however once information about smoking is available, drinking coffee and lung cancer become decoupled because, smoking accounts for the association between drinking coffee and lung cancer. In order to filter out the indirect dependence caused by intermediate or confounding variables, one could control for all other variables in the model by considering 67 3.3. Gaussian Graphical Models the conditional dependence. In the context of Gaussian data, conditional dependence is encoded by the precision matrix, the inverse covariance, Σ−1 . Let σ ij denote the ij-th element of Σ−1 . When σ ij = 0, it implies that variables i and j are conditionally independent, given all the other variables in the model. Generally, most variables are marginally correlated, which implies that zero correlation is a strong indicator for independence. On the other hand, conditional independence accounts for confounding variables and indirect associations, providing a strong measure of dependence. Gaussian graphical models provide a convenient framework for imposing a conditional independence structure of association between Gaussian variables. 3.3 Gaussian Graphical Models Gaussian graphical models are a class of undirected graphical models that provide a graphical representation of the inverse covariance matrix Σ−1 , characterizing conditional independence between Gaussian variables. An edge is drawn between any two nodes in the graph unless these two nodes are conditionally independent given all the other variables in the model. This way, graphs provide a clear representation of the interrelationship between variables in the model. This approach to modeling data is known as covariance selection (Dempster, 1972). The introduction of the hyperinverse Wishart distribution by Dawid and Lauritzen (1993) as a conjugate prior for structured covariance matrices was central in the development of Bayesian approaches for inference in this class of models (Giudici (1996), Giudici and Green (1999), and Wong et al. (2003)). 3.4 Some Graph Theory In this section, we present some key graph theory concepts and definitions necessary to understand the statistical models that will be developed in this Chapter. For a complete account of graphical model theory, the reader is referred to Lauritzen (1996). 68 3.4. Some Graph Theory An undirected graph is a pair G = (V, E), where V is a set of vertices representing variables and E, the edge-set, is a subset of the set of unordered distinct pair of vertices. Visually, each vertex i is a node representing the random variable i and an edge (i, j) ∈ E is an undirected edge connecting nodes i and j unless they are conditionally independent. In the example of Figure 3.1(a), V1 ⊥ ⊥ V3 | V2 is used to denote that V1 is conditionally independent of V3 given V2 . Below are some graph theoretic definitions which we will call upon in this Chapter: - A subgraph is a graph which has as its vertices some subset of the vertices of the original graph. - A graph or a subgraph is complete or fully connected if there is an edge connecting any two nodes. - A clique is a complete subgraph. - Let A and B be subgraphs. A set S is said to separate A from B if all paths from A to B go through S. - Subgraphs (A, B, S) form a decomposition of G if V = A ∪ B, S = A ∩ B, where S is complete and separates A from B. - A sequence of subgraphs that cannot be further decomposed are the prime components of a graph. - A graph is said to be decomposable if every prime component is complete. - A junction tree is a tree with set of vertices equal to the set of prime components of G such that for any two prime components Pi and Pj and any T on the path between Pi and Pj , Pi ∩ Pj ⊂ T . This representation is critical for all computations done in graphical models. - A distribution p(.) is Markov with respect to a graph G, if for any decomposition (A, B) of G, A ⊥ ⊥ B|A ∩ B, where A ∩ B is complete and separates A from B. 69 3.4. Some Graph Theory - If the joint distribution of a set of variables V is Markov with respect to a graph G, then it can be decomposed according to the graph structure as a ratio of products over prime components and separators p(V |G) = p(VP ) . p(V S) S∈S P ∈P (3.1) where P is the set of all prime components, S is the set of all separators. Computational efficiencies are a direct consequence of the Markov properties in graphical models, because they give the ability to model subsets of variables locally. In this research, we restrict our attention to decomposable graphs. Figure 3.1, is an example of a decomposable undirected graphical model, taken from Carvalho (2006). The original graph in 3.1(a) can be decomposed into its junction tree in 3.1(b). This graph has five cliques and four seperators. In this graph, {{1, 2, 5}, {2, 4, 5, 6}, {2, 3, 4}, {4, 6, 7}, {6, 7, 8, 9}} is the set of cliques and {{2, 5}, {2, 4}, {4, 6}, {6, 7}} is the separators set. 70 3.5. Bayesian Estimation in Gaussian Graphical Models 3 2 4 5 6 7 8 9 1 (a) 2 S={2,4} 3 4 2 4 S={4,6} 5 4 7 6 6 S={6,7} S={2,5} 6 7 8 9 2 1 5 (b) Figure 3.1: A decomposable graph and its junction tree decomposition. Each node of the junction tree represents a clique while vertices shared by adjacent nodes of the tree define the separators. 3.5 Bayesian Estimation in Gaussian Graphical Models Let Zi = (Zi1 , . . . , ZiJ ) denote the J-dimensional vector of continuous responses on the ith observation (1 ≤ i ≤ n). Given a graph structure G, 71 3.5. Bayesian Estimation in Gaussian Graphical Models iid we assume that Zi ∼ NJ (0, Σ), follows a matrix variate normal distribution consistent with G. In this Chapter, we assume without loss of generality, that the response is centered at 0 and the probability density function of the matrix-variate Gaussian distribution simplifies under the i.i.d assumption to f (Z | Σ, G) = 1 1 |Σ|−n/2 etr − Σ−1 ZZ nJ/2 2 (2π) . (3.2) In a Bayesian paradigm, given a graph structure, the inference is based on the posterior distribution π(Σ | Z, G) ∝ f (Z | Σ, G)π(Σ | G), (3.3) where the distribution of the likelihood p(Z | Σ, G) is given in (3.2) and p(Σ | G) is the prior on the covariance matrix. 3.5.1 Prior and Posterior for Covariance Matrices When the graph G is fully connected as in the example of Figure 3.2(a), this corresponds to a saturated model where all the elements of Σ−1 are different from 0 and no constraints are imposed. 2 2 1 1 3 5 3 5 4 4 (a) An example of a fully connected graph (b) An example of a structured graph Figure 3.2: An example of two different graph structures, resulting in a fully connected precision in 3.2(a) versus a sparse precision matrix in 3.2(b). 72 3.5. Bayesian Estimation in Gaussian Graphical Models In this case, the prior for Σ is the conjugate inverse Wishart prior denoted by Σ ∼ IW J (ν, Ω) with degrees of freedom ν and inverse scale matrix Ω defined as in Dawid and Lauritzen (1993) Ω 2 ΓJ ( ν+J−1 ) 2 ν+J−1 2 1 |Σ|−(ν+2J)/2 exp − tr Σ−1 Ω 2 , (3.4) where ΓJ denotes the multivariate gamma function. Note that this convention for the inverse Wishart distribution is different from the one used in the previous Chapter. Refer to Appendix A.3 for the equivalence. The resulting posterior distribution is available analytically in closed form and is given by: π(Σ | Z, G) ∝ f (Z | Σ, G)π(Σ | G) 1 1 ∝ |Σ|−n/2 etr − Σ−1 ZZ × |Σ|−(ν+2J)/2 exp − tr Σ−1 Ω 2 2 1 ∝ |Σ|−(n+ν+2J)/2 exp − tr Σ−1 (Ω + ZZ ) , 2 which is an inverse Wishart denoted by IW(n + ν, Ω + ZZ ) , with updated degrees of freedom and inverse scale matrix. On the other hand, consider a structured graph as in the example in Figure 3.2(b). The precision matrix is sparse with elements σ ij = 0, whenever an edge is missing between variables i and j. In this case the inverse Wishart is no longer a valid prior distribution. Under a structured graph assumption, the hyper-inverse Wishart distribution, defined by Dawid and Lauritzen (1993), is a locally conjugate family of Markov probability distributions for matrices on decomposable graphs. Let Σ be a J × J covariance matrix consistent with the decomposable graph G, then Σ follows a hyper-inverse Wishart distribution denoted by HIW J (ν, Ω, G) with degrees of freedom ν > 0 and location matrix Ω > 0 which can be decomposed as π( Σ| ν, Ω, G) = π(ΣP | ν, ΩP ) . S∈S π(ΣS | ν, ΩS ) P ∈P (3.5) 73 3.5. Bayesian Estimation in Gaussian Graphical Models The hyper-inverse Wishart distribution is the unique hyper-Markov distribution for Σ that is consistent with clique-marginals that are inverse Wishart distributed. This means that for a given decomposable graph structure G, if Σ ∼ HIW J (ν, Ω, G), then the covariance matrix of each prime component P follows an inverse Wishart distribution with ΣP ∼ IW J (ν, ΩP ). As seen earlier, the multivariate Gaussian distribution is Markov with respect to any decomposable graph G and therefore the joint likelihood of the Gaussian variables Z decomposes according to the graph structure as a ratio of products over prime components and separators f (Z|Σ, G) = f (ZP |ΣP ) , S∈S f (ZS |ΣS ) P ∈P (3.6) where P is the set of all prime components, and S is the set of all separators. We can therefore write the posterior distribution as π(Σ | Z, G) ∝ f (Z | Σ, G)p(Σ | G) P ∈P f (ZP |ΣP ) ∝ · S∈S f (ZS |ΣS ) π(ΣP | ν, ΩP ) S∈S π(ΣS | ν, ΩS ) P ∈P f (ZP |ΣP )π(ΣP | ν, ΩP ) S∈S f (ZS |ΣS )π(ΣS | ν, ΩS ) ∝ P ∈P = P ∈P π(ΣP | ZP ) . S∈S π(ΣS | ZS ) The posterior distribution is also hyper-inverse Wishart with updated parameters Σ| (Z, ν, Ω, G) ∼ HIW J (n + ν, Ω + ZZ ). Efficient methods for sampling from the hyper-inverse Wishart distribution rely on the junction tree representation of the graph and matrix results for inverse Wishart distributions (Carvalho and West, 2007). Intuitively, the zero structure is fixed in the inverse covariance and matrix completion methods are used to ensure that the resulting covariance is positive semi-definite and unique. 74 3.5. Bayesian Estimation in Gaussian Graphical Models 3.5.2 Covariance Selection and Model Determination If the structure of the inverse correlation matrix is unknown, it is possible to estimate it directly from the data. This could be achieved by making the graph structure G an unknown parameter and assigning it a prior distribution π(G), and considering its posterior distribution for inference p(G | Z) ∝ p(Z | G)p(G). (3.7) The marginal likelihood is given by the following integral over the space of positive definite symmetric matrices consistent with the structure of G p(Z | G) = f (Z | Σ, G)π(Σ | G)dΣ, (3.8) Σ∈M (G) where M (G) is the set of all possible symmetric positive definite matrices consistent with G. For decomposable Gaussian graphical models, the marginal likelihood p(Z | G) is easily computed because the covariance matrix can be integrated out. This is due to the fact that the hyper-inverse Wishart is a locally conjugate prior and the posterior admits a closed form. In the case of decomposable models, this integral is given by a ratio of normalizing constants from the prior and the posterior p(Z | G) = 2π −nJ/2 h(G, ν, Ω) , h(G, νn , Ωn ) (3.9) where ν and Ω are the hyperparameters of the prior on Σ and νn = ν +n and Ωn = Ω + ZZ are the updated posterior hyperparameters; and h denotes the normalizing constant of the hyper-inverse Wishart given by h(G, ν, Ω) = | Ω2C |( ν+|C|−1 ) 2 Γ|C| ( ν+|C|−1 )−1 2 ΩS ( S∈S | 2 | ν+|S|−1 ) 2 Γ|C| ( ν+|S|−1 )−1 2 C∈C . (3.10) 75 3.5. Bayesian Estimation in Gaussian Graphical Models The posterior distribution over graphs can be computed as p(Z | G)π(G) p(G | Z) = Dc i=1 p(Z | G)π(G) , (3.11) where Dc is the total number of decomposable graphs. The marginal likelihood p(Z | G) can be computed using (3.9). If we assume that all decomposable graphs of any given size are equally likely, we can set a uniform prior π(G) = 1/Dc . This prior on G, as noted by Armstrong et al. (2009), will tend to favour graphs of medium size, where the size of a graph is given by the number of its edges. There are other sparsity inducing priors which can be considered. A Bernoulli prior on the inclusion of edges for example, would penalize a large number of edges. Other possible priors can be considered as well (Bornn and Caron, 2011). However, for illustration purposes and computational convenience, in this thesis we limit our consideration to the uniform prior π(G) = 1/Dc . Enumerating all possible graph structures can be a very arduous task since the number of possible graphs is 2 J(J−1) 2 . For graphs larger than 6 nodes, the computation problem becomes daunting. As an alternative, for higher dimensions, we can sample the graph from its posterior distribution by appealing to well developed MCMC methods for sampling decomposable graphs (Giudici and Green, 1999). The Monte Carlo sampling algorithm suggested by Giudici and Green (1999) to sample from the posterior distribution starts with a current graph Gc and proceeds by randomly selecting the index of two vertices such that the addition or the deletion of an edge between these two vertices would result in a decomposable graph. These types of moves, because they only differ by one edge are considered local moves and provide very important computational gains. The proposed resulting graph, denoted by Gp , is then accepted according to the following Metropolis-Hastings acceptance probability min 1, p(Z | Gp )π(Gp ) p(Z | Gc )π(Gc ) . (3.12) 76 3.6. Bayesian Estimation for Binary Data This results in samples generated from the posterior distribution over all graphs p(G | Z). It is important to note that most MCMC approximations of the posterior are expected to be slow to converge in these algorithms, particularly for large graphical models, with many nodes. 3.6 Bayesian Estimation for Binary Data In this section, we change the focus from a continuous response to an observed binary response. If the data under consideration are binary response variables, it is common to use the multivariate Probit model where the likelihood is written in terms of latent Gaussian variables. This is an extension to the linear model that we have seen introduced in Chapter 1 using a probit link instead of the identity link. As discussed in the introduction, in this class of models, the covariance matrix needs to be constrained to a correlation matrix as in Chib and Greenberg (1998) in order to ensure identifiability. Let Yi = (Yi1 , . . . , YiJ ) denote the J-dimensional vector of observed binary 0/1 responses on the ith subject (1 ≤ i ≤ n), and R = (rij ) a J × J correlation matrix. In the multivariate probit model we have pr(Yi = yi | R) = NJ (0, R)du, ... AiJ (3.13) Ai1 where Aij is the interval [(0, ∞) if yij = 1 and (−∞, 0) otherwise and Nk (u; m, Σ) is the density of a k-variate Gaussian distribution with argument u, mean vector m and covariance matrix Σ. Figure 3.6 provides a graphical representation of the multivariate Probit model. The observed binary data are independent conditional on the latent Gaussian variables. Estimation in this class of models is not simple since a prior specification on R is not straightforward; there are no conjugate priors for correlation matrices. Several priors on R have been proposed in the literature. In an influential paper, Chib and Greenberg (1998) use a jointly uniform prior for the correlation π(R) ∝ 1. This prior poses several problems. Computationally, it results in a posterior that is not easy to sample. Chib and Greenberg (1998) propose a Metropolis-Hastings (MH) random walk algorithm to sam77 3.6. Bayesian Estimation for Binary Data Z3 Z1 Z2 Y3 Y1 Y2 Figure 3.3: A graphical representation of the multivariate model with J = 3. In this model Y represents the observed binary data and Z represents the latent correlated Gaussian data. ple the correlation matrix drawing the correlation coefficients in blocks. The resulting proposal is not guaranteed to be a correlation matrix and moreover, as with random walk algorithms in general, the mixing is slow in high dimensions. Furthermore, Barnard et al. (2000) show that this prior tends to push marginal correlations to zero in high dimensions, making it very informative marginally. Similarly, the Jeffrey prior used by Liu (2001) tends to push marginal correlations to ±1 in high dimensions. Barnard et al. (2000) propose using a marginally uniform prior for R given by π(R) ∝ |R| J(J−1) −1 2 J |Rii |)−(J+1)/2 , ( (3.14) i=1 where Rii is the principal submatrix of R. This prior was suggested but not used in Zhang et al. (2006). Despite not being conjugate, Tabet (2007) show that this prior can be used within the context of a parameter expansion and data augmentation (PXDA) algorithm, to develop an efficient MCMC 78 3.6. Bayesian Estimation for Binary Data algorithm for inferring the posterior distribution of R. Given n binary observations Y = (Y1 , . . . , Yn ) and the prior density π(R), we are interested in the posterior density n π(R | Y ) ∝ π(R) pr(Yi | R). i=1 The standard approach in the latent model is to introduce explicitly the latent variables Z = (Z1 , . . . , Zn ) and to consider instead n π(R, Z | Y ) ∝ π(R) NJ (Zi | R)I (Zi ∈ Ai ) . (3.15) i=1 To sample from π(R, Z | Y ), we use an MCMC algorithm as in Tabet (2007). The algorithm alternates between simulating the latent Gaussian data and estimating the model parameters. The simulation of Z, the latent variables, is standard. The full conditional density of Z factorizes as n π(Z | Y, R) = π(Zi | Y, R), i=1 where π(Zi | Y, R) ∝ NJ (Zi | R)I (Zi ∈ Ai ) is a truncated multivariate Gaussian. For J = 1, we can sample from it directly. Otherwise, we sample approximately from it by cycling through a series of univariate truncated Gaussians (Geweke, 1991). In each step Zij is simulated from π (Zij |Zi,−j , R), which is a univariate Gaussian distribution truncated to [0, ∞) if Yij = 1 and to (−∞, 0) if Yij = 0. To sample the univariate truncated Gaussians, we use the efficient algorithm of Robert (1995). To sample the correlation matrix R, we adopt a parameter expansion framework. We define a transformation on the latent variables W = ZD where D = diag(d1 , . . . , dJ ) is the expansion parameter with di > 0. It 79 3.6. Bayesian Estimation for Binary Data follows that π(W | R, D) = MN n,J (0, DRD, In ) . We then define the new posterior n π(R, D, W | Y ) ∝ π(R)π(D | R)π(W | R, D) I (Wi ∈ Ai ) . (3.16) i=1 Due to the lack of identifiability mentioned in Chapter 1, the marginals in R under (3.15) and (4.8) are similar whatever be the density π(D | R). We set π(D | R) = J i=1 π(di | R) with d2i ∼ IG (J + 1) /2, rii /2 , (3.17) where the notation rij is used to refer to the ij th element of R−1 and IG (a, b) denotes the inverse-gamma distribution with shape parameter a and scale parameter b. The reason for this choice is that it follows from Barnard et al. (2000) that Σ = DRD ∼ IW (2, IJ ) when R follows (3.14) and where IW (ν, Ω) denotes the inverse Wishart distribution with degrees of freedom ν and inverse scale matrix Ω defined as in Dawid and Lauritzen (1993). Hence to sample jointly (R, D) from π(R, D | Y, W ) = π(R, D | W ), we can simply perform a change of variables Σ = DRD, sample Σ according to the resulting full conditional density π(Σ | W ), and then compute R = D−1 ΣD−1 where di = (σii )1/2 , and σii is the i-th diagonal element of Σ. It is easy to check that π(Σ | W ) = IW(2 + n, IJ + W W ) (3.18) The full algorithm is summarized in Algorithm (3.1). 80 3.6. Bayesian Estimation for Binary Data Algorithm 3.1 PXDA Algorithm (Saturated Model) Given a current value of (R, Z, D), • Sample Zij ∼ π (Zij |Zi,−j , R) for all i, j. • Sample D using (4.9) and compute W = ZD. • Sample Σ ∼ IW(2 + n, IJ + W W ). • Compute R = D−1 ΣD−1 where di = (Σii )−1/2 . 3.6.1 Probit Model with a Structured Correlation Matrix In the multivariate Probit model, the association between the J columns of the n × J matrix of binary responses Y = Y 1 , . . . , Y J is modeled via the correlations between the J columns of the n × J matrix of latent Gaussian variables Z = Z 1 , . . . , Z J distributed according to (3.2) As we have seen in a previous section, conditional independence assumptions of Gaussian latent variables are encoded by the inverse correlation R−1 , such that rij = 0 implies that Z i and Z j are conditionally independent, given all the other variables in the model. Furthermore, we saw that Gaussian graphical models provide a convenient framework for imposing a conditional independence structure of association between variables. As noted in Webb and Forster (2008), the interpretation of conditional independence on the latent variables Z does not translate to the observed binary variables, so that Z 1 ⊥ ⊥ Z 2 | Z 3 does not imply Y 1 ⊥⊥ Y 2 | Y 3 , but does imply Y 1 ⊥ ⊥ Y 2 | Z 3. In the Probit regression setting, we must define a prior on R and not on Σ. Marginally, we would like the correlation rij to be uniform on (−1, 1) if variable i is conditionally dependent on j and we would like rij , the ij th element of R−1 , to be null if they are conditionally independent: rij = 0 if Zi ⊥ Zj | Z−i,−j ; rij ∼ U (−1, 1) otherwise. (3.19) Because the Gaussian graphical model is a graphical representation of 81 3.6. Bayesian Estimation for Binary Data the covariance or equivalently the correlation matrix, conditional independence between variables i and j implies rij = 0. Moreover, when variables i and j are conditionally dependent, they belong, by the definition of a decomposable graph, to the same prime component P . From the property of the hyper-inverse Wishart, we know that ΣP is saturated and follows an inverse Wishart distribution. Therefore all the methods developed in section (3.5) can be applied locally at the prime component level to define a marginally uniform prior on RP . In order to obtain the marginally uniform prior on correlations, under the parametrization of Dawid and Lauritzen (1993), the degrees of freedom b should be set to 2, regardless of the size of the prime component and the location parameter for each prime component. Furthermore, ΩP should be set to the identity matrix. For each prime component, we then have the marginally uniform prior π (RP ) − π(RP ) ∝ |RP | |P |(|P |−1) −1 2 |RP,ii | (|P |+1) 2 (3.20) i∈P where |P | is the cardinality of the prime component P and RP,ii is the principal submatrix of RP . Following the same reasoning as earlier, we consider the scaled latent Gaussian variables W = ZD which follow a matrix-variate normal distribution MN n,J (0, Σ, In ). The PXDA algorithm can now be extended to accommodate a structured inverse correlation matrix with the only difference being that Σ is obtained by first sampling Σ | W, G ∼ HIW(νn , Ωn ) where νn = 2+n and Ωn = IJ +W W , and then obtaining R using R = D−1 ΣD−1 . The algorithm is summarized in Algortithm (3.2). 82 3.6. Bayesian Estimation for Binary Data Algorithm 3.2 PXDA Algorithm (Structured Precision Matrix) Given a current value of (R, Z, D), • Sample Zij ∼ π (Zij |Zi,−j , R) for all i, j. • Sample D using (4.9) and compute W = ZD. • Sample Σ | W, G ∼ HIW(νn , Ωn ) where νn = 2 + n and Ωn = IJ + WW . • Compute R = D−1 ΣD−1 where di = (Σii )−1/2 . As in the continuous case, we can extend the model to include sampling of the graph structure given the data. Once the expansion parameter is sampled and the latent variables are transformed Z → W , in this expanded space, all the procedures applicable to Gaussian data are valid, and therefore sampling of the graph structure would proceed exactly as in section 3.5.2. Then given the graph structure, a hyperinverse Wishart distribution can be used to impose the constraint on the precision matrix such that it’s consistent with the graph structure. The complete algorithm using covariance selection is summarized in Algorithm (3.3). Algorithm 3.3 PXDA Algorithm (Covariance Selection) Given a current value of (G, R, Z, D), • Sample Zij ∼ π (Zij |Zi,−j , R) for all i, j from a multivariate truncated Gaussian. • Sample D using (4.9) and compute W = ZD. • Propose a decomposable graph Gp by adding or deleting an edge and set G = Gp with probability given in (3.12). • Sample Σ | W, G ∼ HIW(νn , Ωn ), where νn = 2 + n and Ωn = IJ + W W. • Compute R = D−1 ΣD−1 where di = (Σii )−1/2 . 83 3.7. Simulation Results 3.7 Simulation Results The aim of the first simulation is to demonstrate the ability of the algorithm to recover the true graph structure underlying observed binary variables given a design matrix of covariates. We generate n = 500 independent data points from the model where the correlation matrix R used to generate the Gaussian observations was set at 1.000 0.000 0.000 1.000 −0.296 0.000 R = 0.000 −0.296 1.000 0.000 −0.491 0.000 0.000 0.116 −0.392 −0.491 0.000 0.000 1.000 0.000 0.000 0.116 −0.392 0.000 1.000 (3.21) corresponding to the inverse correlation R−1 1.318 0.000 0.000 0.647 0.000 0.000 1.096 0.325 = 0.000 0.325 1.278 0.647 0.000 0.000 0.000 0.000 0.464 0.000 0.000 0.000 0.464 1.318 0.000 0.000 1.182 (3.22) We ran the MCMC algorithm with model selection as described in Section 3.5.2 on the simulated data and N = 50, 000 Monte Carlo samples were collected. Of the 822 possible decomposable structures for a graph with five nodes, the algorithm explored 271 graphs. From Figure 3.4, we see that the true graph structure underlying the data was the most probable structure sampled by the algorithm and accounts for approximately 15% of the posterior mass. In Figure 3.5, we show an image of the matrix of posterior edge marginal probabilities. These probabilities were approximated by the empirical frequencies an edge between any two nodes was included in the sampled graphs. When compared to the image of the true inverse correlation matrix underlying the data, all the non-zero entries in R−1 correspond to edges that have considerably higher posterior marginal probabilities. In 84 3.7. Simulation Results Most Probable Structures Post= 0.15 Post= 0.11 4 1 4 1 3 2 3 2 5 5 Post= 0.06 Post= 0.05 4 1 4 1 3 2 3 2 5 5 Post= 0.03 True Structure 4 1 4 1 3 2 3 2 5 5 Figure 3.4: Most probable graphical structures and estimated posterior probabilities for simulated data. order to assess convergence, multiple runs were undertaken with different over-dispersed starting values and very similar results were obtained in all cases. In order to explore the benefits of performing model selection in comparison to fitting a saturated model, we ran a simulation comparing the two alternatives. In this simulation 50 different data sets were generated using the same graph structure as above, however with a different true underlying correlation matrix each time (consistent with the graph structure). For simplicity, no covariates were added in this case. For each data set, we fit both the saturated model and use the model selection procedure. To evaluate the posterior estimators provided in each case, the entropy loss was computed 85 3.8. Conclusion R−1 Edge Marginals 1 1 2 2 3 3 4 4 5 5 1 2 3 4 5 1 2 3 4 5 Figure 3.5: Matrix of posterior edge marginal probabilities, simulated data. Here the size of the square is indicative of the magnitude of the posterior edge marginal probability and therefore the larger the square the higher the probability that an edge exists between the corresponding nodes. for each Monte Carlo sample using ˆ R) = trace(RR ˆ −1 ) − log |RR ˆ −1 | − J L(R, ˆ is the Monte Carlo sample estimate of the correlation matrix. where R Figure 3.6 shows the box plots of the average entropy loss recorded for each data set. We can see that using the model selection procedure resulted in a significantly smaller entropy loss (paired t-test pval <0.01). 3.8 Conclusion In this Chapter, we looked at a simpler multivariate model where the time element is irrelevant either because of the nature of the data (not a time series) or because the association is assumed to be static over time. Estimating a full precision matrix requires the estimation of J(J −1)/2 parameters. The number of parameters increases quadratically as a function of the number of variables under consideration. 86 3.8. Conclusion Entropy Loss 0.2 0.15 0.1 0.05 Full Model Model Selection Figure 3.6: Boxplot of the entropy loss over 50 datasets obtained using a fixed saturated correlation matrix (left) versus model selection (right). Therefore, when dealing with high dimensional problems, imposing a structure of association is helpful both conceptually and computationally. From a computational viewpoint, this reduces the number of free parameters to be estimated compared with a full model. Hence, for a properly specified structure, it will result in more accurate and efficient estimates of parameters. Furthermore, researchers are often able to identify or reject certain parsimonious structures based on their conceptual understanding of the data and their subject matter expertise. We appealed to the theory of Gaussian graphical models to infer the underlying graphical structure and estimate a covariance matrix consistent with that structure under a unified framework. We extentended existent methodology to allow for a binary response by using a Probit link. Throughout this Chapter, we assumed that the model is mean-centered at 0. In the next Chapter we will consider adding the covariates. 87 Chapter 4 Estimating the Effects of Covariates It is often the case that the variable of interest V is dependent on certain variables referred to as predictors or covariates. Under this setting, the aim of the analysis is to estimate the effect of the covariates on the multivariate response while accounting for the correlation between the elements of the response. This corresponds to a multivariate linear regression model which assumes a linear relationship between the covariates and the multivariate response. In this Chapter, we extend the methodology introduced in Chapter 2 and Chapter 3 to include the estimation of the effects of covariates while simultaneously estimating the association between elements of the multivariate response. In the simple multivariate case, discussed in Chapter 3, the dimension of the response under consideration is either a n × T matrix if it’s a univariate time series with static covariance, or n × J if it’s multivariate cross-sectional response. For clarity, this will be referred to as the multivariate model. The more complex case arises when the response is multivariate, multi-observation and longitudinal with a dynamic covariance. This is the model discussed in Chapter 2. Under that setting the observed response is a n × T × J cube depicted in Figure 1.1. This will be referred to as the multivariate longitudinal model. We begin by presenting an extension of the multivariate model of Chapter 3. We discuss standard posterior estimation of parameters in multivariate Bayesian regression. We will then show how the same concept can be extended in the context of the parameter expansion framework for a binary 88 4.1. Multivariate Model, Static Covariance response, introduced in Chapter 3. This has been done previously in Tabet (2007); however, the prior on β, the covariate effect that had been previously suggested results in a dependence on D, the expansion parameter, in the posterior. In this Chapter, we correct this by taking a conjugate prior for β conditional on R, the correlation between the elements of the multivariate response, and centered at 0. Subsequently, we extend the methodology introduced in Chapter 2, to add the inference of the covariate effects in the context of the multivariate longitudinal model. This is a more complex model that involves the estimation of dynamic as well as static parameters. We appeal to a variant of the particle MCMC algorithm to address this. Throughout this Chapter, we assume that the effect of covariates on the response is static, i.e. β is not time-varying, even though the covariates may be measured at different time steps. 4.1 Multivariate Model, Static Covariance We begin by considering models with a covariance matrix that is not changing in time; these are models that are either multivariate and cross-sectional or univariate and longitudinal, corresponding to the representation discussed in Chapter 3. For simplicity, we illustrate the modeling approach for crosssectional response of dimension n × J. 4.1.1 Continuous Response For a continuous response the linear model can be written as Z = Xβ + , i.i.d ∼ F(0, Σ) (4.1) where Z is a n × J matrix of observed continuous responses, X is an n × p design matrix of observed covariates, and β is a p × J matrix of regression coefficients. The error term is a matrix with rows independent and identi- cally distributed according to a distribution F, with mean 0 and Σ = (σij ) a J × J covariance matrix. 89 4.1. Multivariate Model, Static Covariance In this model, the unknown parameters of interest are (β, Σ). If we assume that the distribution of the error terms F is Gaussian. The likelihood of the data is multivariate Gaussian, with pdf given by f (Z | X, β, Σ) = 1 1 |Σ|−n/2 etr − Σ−1 (Z − Xβ)(Z − Xβ) nJ/2 2 (2π) (4.2) A Bayesian approach attempts to model the posterior distribution of the parameters given the data π(β, Σ | Z) ∝ f (Z | β, Σ)π(β, Σ). We follow a default modeling approach by choosing a conjugate prior for β that depends on Σ. This choice of prior would result in mathematical convenience when dealing with the binary response case. The joint prior becomes π(β, Σ) = π(β | Σ)π(Σ). For β, we follow a standard Bayesian conjugate approach and we select the matrix-variate Gaussian distribution π(β | Σ) ∼ MN pJ (0p×J , Σ, Ψ), (4.3) where Ψ is a p × p matrix of hyperparameters representing the prior belief on the dispersion of the rows of β, and we assume that the dispersion across the columns of β follows Σ, similar to the data. This means that a priori, we believe that the effect of a given covariate on the different responses varies according to Σ, which is the same way the responses co-vary. Furthermore, it is not uncommon to assume that Ψ is a diagonal matrix with elements on the diagonal set to be very large so as to the prior is uninformative. This means that we assume that the effects of covariates on a given response are independent a priori. For a prior on Σ, we also choose a standard Bayesian conjugate prior, the inverse-Wishart prior for covariance matrices, π(Σ) ∼ IW J (d, S), (4.4) 90 4.1. Multivariate Model, Static Covariance where S is a J × J scale matrix and d is an integer representing the degrees of freedom. Given n observations Z, and the prior density π(β, Σ), we are interested in the posterior density π(β, Σ | Z). The previous conjugate priors result in a posterior distribution that is available analytically and it is the inverseWishart distribution and the matrix Normal for Σ and β respectively. π(β, Σ | Z) = π(Σ | Z)π(β | Σ, Z). It is easy to check that (see Appendix B.2 for algebraic details) π(Σ | Z) ∼ IW J (d + n, Z Z + SJ − B Ξ−1 B), (4.5) π(β | Z, Σ) ∼ MN pJ (B, Σ, Ξ), (4.6) and where Ξ−1 = X X + Ψ−1 and B = ΞX Z. This is multivariate Bayesian linear regression, where samples from the posterior are collected by alternating between sampling β | Σ, Z and Σ | Z. When sampling Σ | Z, we are able to integrate out β so that it does not affect sampling; however, since β is conditional on Σ, posterior samples of Σ are needed to obtain new samples of β. Algorithm 4.1 summarizes the steps for sampling from the joint posterior of (β, Σ). Algorithm 4.1 Multivariate Bayesian Linear Regression Given a current value for (β, Σ), the parameters are updated as follows: • Sample Σ | Z ∼ IW J (d + n, Z Z + IJ − B Ξ−1 B). • Sample β | Σ, Z ∼ Np,J (B, Σ, Ξ). where Ξ−1 = X X + Ψ−1 and B = ΞX Z 4.1.2 Binary Response When the observed data are not continuous, instead we observe n binary observations y = (y1 , . . . , yn ). As we have seen in Chapter 3, the covariance 91 4.1. Multivariate Model, Static Covariance matrix Σ must be constrained for identifiability reasons to be a correlation matrix R. The likelihood of the data is written using a Gaussian latent representation, pr(Yi = yi | X, β, R) = ... LiJ Li1 NJ ((Xβ)i , R)dZi , where (Xβ)i is the ith row of Xβ, Lij is the interval [0, ∞) if yij = 1 and (−∞, 0) otherwise. In this case, we are interested in the posterior density n π(β, R, Z | y) ∝ π(β, R) NJ (Zi | Xi β, R)I (Zi ∈ Li ) . (4.7) i=1 In order to sample from the expression in (4.7), we implement a Gibbs sampling approach similar to the one in the previous Chapter, where we sample the latent Gaussian variables n π(Z | y, β, R) = π(Zi | y, β, R), i=1 where π(Zi | y, β, R) ∝ NJ ((Xβ)i , R)I (Zi ∈ Li ) is a multivariate truncated Gaussian, easily sampled as in the previous Chapter using the algorithms of Geweke (1991) and Robert (1995). Furthermore, we also adopt a parameter expansion approach to obtain samples (β, R) by defining a transformation on the latent variables W = DZ, where D =diag(d1 , ..., dJ ) is the expansion parameter with di > 0. It follows that π(W | β, R, D) = Nn,J (XβD, DRD, In ) . We then define the new posterior n π(β, R, D, W | y) ∝ π(R)π(β | R)π(D | R)π(W | β, R, D) I (Wi ∈ Li ) . i=1 (4.8) 92 4.1. Multivariate Model, Static Covariance As before, we also set π(D | R) = J i=1 π(di | R) with d2i ∼ IG (J + 1) /2, rii /2 . (4.9) It follows that to sample jointly from (β, R, D) from π(β, R, D | y, W ) = π(β, R, D | W ), first we sample D from its prior distribution and perform a change of variables Σ = DRD and γ = βD. The independent prior on the regression coefficients, β that had been previously suggested in Tabet (2007) results in a dependence on D, the expansion parameter, in the posterior. This does not arise in this case by taking a conjugate prior for β that is centered at 0 and with a covariance matrix that depends on the correlation matrix R. We therefore sample (Σ, γ) according to their resulting full conditional density π(γ, Σ | W ) = π(Σ | W )π(γ | W, Σ). We then compute R = D−1 ΣD−1 , β = γD−1 where di = (σii )1/2 , where σii is the i-th diagonal element of Σ. These are the same as the same posterior distribution used in the previous section for Gaussian data π(Σ | W ) = IW(2 + n, W W + IJ − B Ξ−1 B), (4.10) π(γ | W, Σ) = Np,J (B, Σ, Ξ), (4.11) and where Ξ−1 = X X + Ψ−1 and B = ΞX W . A general PXDA Algorithm which includes covariates is given in Algorithm (4.1.2). 93 4.2. Dynamic Covariance Algorithm 4.2 PXDA with Covariates Given a current value for (β, R, Z) we update these parameters as follows: • Sample Zij ∼ π (Zij |Zi,−j , β, R) for all i, j using the algorithms of Geweke (1991) and Robert (1995). • Sample D using (4.9) and compute W = ZD. • Sample (Σ, γ) using (4.10)-(4.11). • Compute R = D−1 ΣD−1 , β = γD−1 where di = (Σii )−1/2 . 4.2 Dynamic Covariance Let Z1 , . . . , ZT represent multivariate continuous longitudinal observations assumed to arise from a Gaussian distribution. The covariance matrix is assumed to be dynamic, and the covariates are measured at each time step as well; however, the effect of those covariates on the response is usually not changing in time. The linear model is written as Zit = Xit β + (4.12) it where Xit is a 1 × p vector of covariates, assumed to be measured without error, for subject i at time point t, and β is a p × J matrix of regression coefficients, assumed to be static in time. Finally it ∼ N (0, Σt ) where Σt is a J × J covariance matrix. In this model, the parameters of interest are (β, Σ−1 1:T ), a combination of static and dynamic parameters. 4.2.1 Prior Specification for (β, Σ−1 1:T ) For Bayesian inference we must consider a joint prior for (β, Σ−1 1:T ). Unlike in the previous section, we are not able to specify a prior for β that depends on Σ since, we are assuming that β is fixed and Σ varies in time, therefore 94 4.2. Dynamic Covariance an independent prior for β and Σ−1 1:T is −1 π(β, Σ−1 1:T ) = π(β)π(Σ1:T ) (4.13) We already saw in Chapter 2, that a suitable prior for Σ−1 1:T is the Wishart auto-regressive process, such that T π(Σ−1 1:T ) = π(Σ−1 1 ) −1 π(Σ−1 t | Σt−1 ) (4.14) t=1 −1 | Σ−1 where π(Σ−1 t−1 ) is 1 ) is a Wishart distribution WJ (d, S) and π(Σt the non-central Wishart distributionW(d, S ∗ , Θ) with degrees of freedom d, scale parameter S ∗ = S − M SM and non-centrality parameter Θ = (S − M SM )−1 M Σ−1 t−1 M . It remains to specify a prior on the unknown matrix of regression coefficients β. We propose a matrix-variate Gaussian prior for β π(β) ∼ MN pJ (0, IJ , ΨP ) (4.15) with known diagonal covariance matrix ΨP indicating our prior belief about the degree of association between the p rows of beta and we assume that the J columns of beta are independent a priori and so we set the association between columns to the identity matrix. The Gaussian distribution is conjugate to the likelihood and will result in a posterior that can be sampled from easily. Furthermore, this prior on β is not very informative for large hyperparameter values on the diagonal elements of the covariance matrix Ψp . 4.2.2 Posterior and Sampling Scheme The resulting posterior distribution of interest is π(β, Σ−1 1:T | Z1:T ). Sampling from this posterior is not available in closed form as we have seen in Chapter 2. Intuitively, one can think of a Gibbs sampling scheme to iteratively sam−1 ple from π(β | Σ−1 1:T , Z1:T ) and π(Σ1:T | Z1:T , β). In this case, one can use 95 4.2. Dynamic Covariance a particle MCMC targeting π(Σ−1 1:T | Z1:T , β) and then, given the sampled values for the Σ−1 1:T , update π(β | Σ1:T , Z1:T ). Unfortunately, under this scenario Andrieu et al. (2010) show that a direct application of the algorithm (2.4), would not result in an invariant density for π(β, Σ−1 1:T | Z1:T ). In order to address this, they suggest the use of a special type of particle MCMC update called the conditional SMC update. The conditional SMC update is similar to the regular SMC update, how(m) ever, it requires keeping track of the “ancestral lineage”. We define Bt −1(m) to be the index of the ancestor particle of Σ1:T at generation time t. A graphical illustration is provided in Figure 4.1. A conditional SMC update, ensures that a prespecified particle path Σ1:T with ancestral lineage B1:T is “frozen” and survives all the resampling steps and the remaining K − 1 “free” particles are generated as usual. Therefore for each sampled value of the static parameter β, the current value of Σ−1 1:T is ensured to survive all the re-sampling steps, and K − 1 new samples from p(Σ−1 1:T | Z1:T ) are obtained at each step. A detailed description of the conditional SMC algorithm is provided in Algorithm 4.3. The resulting particle Gibbs algorithm, would therefore iterate between sampling Σ−1 1:T using Algorithm 4.3, conditional on a current draw of β and the observed data. Then, given the new value Σ−1 1:T and the data, we draw a new value of β ∼ π(β | Σ−1 1:T , Z1:T ). The posterior distribution of β can be obtained by considering a slice of the linear model for a given sequence of time-varying covariance matrices, so for observation i, we have Zi (T ×J) = Xi β + (T ×p)(p×J) i (T ×J) We can apply the vec operator which refers to the vectorization of a matrix and results in a vector obtained by stacking the columns of the matrix on top of one another, then we have zi = vec(Zi ) = (Xi ⊗ IJ )vec(β ) + vec( i ) (T J×1) (T J×pJ) (pJ×1) (T J×1) where ⊗ denotes the Kroneker product and we used the identity vec((Xi βIJ ) ) = 96 4.2. Dynamic Covariance particles Time steps -1(2) Σ1 -1(2) Σ2 -1(1) Σ1 -1(1) Σ2 -1(1) Σ3 Σ1 Σ2 Σ3 -1(2) -1(3) Σ1 -1(3) Σ2 -1(3) Σ3 -1(4) Σ1 -1(4) Σ2 -1(4) Σ3 -1(5) -1(5) -1(5) Σ3 Figure 4.1: Example of ancestral lineages generated by an SMC algorithm −1(2) −1(3) −1(4) −1(2) for K = 5 and T = 3. The shaded path is Σ1:3 = (Σ1 , Σ2 , Σ3 ) (2) and the ancestral lineage is B1:3 = (3, 4, 2) (example adapted from Andrieu et al. (2010)) vec(IJ β Xi )) = (Xi ⊗ IJ )vec(β ). If we let zi = vec(Zi ), β ∗ = vec(β ), and Xi = Xi ⊗ IJ . Then we have iid zi ∼ NT J (Xi β ∗ , Σ∗ ) where Σ∗ is a T J × T J block diagonal covariance matrix constructed with Σ1 , . . . , ΣT and and Xi is obtained by stacking the design matrices from all time points, for each individual. For i = 1, . . . , n, the pdf is given by 2π − T Jn 2 ∗ −n/2 |Σ | 1 exp − Σ∗−1 2 n (zi − Xi β ∗ )(zi − Xi β ∗ ) i=1 The prior on β ∗ can be obtained by applying taking the vectorization of the matrix Gaussian in 4.15 which becomes 1 (2π)−np/2 |(Ψ ⊗ IJ )|−1/2 etr − (Ψ ⊗ IJ )−1 (β ∗ β ∗ ) 2 (4.16) 97 4.2. Dynamic Covariance Algorithm 4.3 Conditional SMC −1(m) (m) Define Bt to be the index of the ancestor particle of Σ1:T at generation time t. −1(B ) −1(B1 ) −1(B ) −1(B ) Step 1 : Let Σ−1 , Σ2 2 , . . . , ΣT −1 T −1 , ΣT T ) be a path 1:T = (Σ1 associated with the ancestral lineage. Step 2 : for t = 1 (m) i.i.d • For m = B1 , sample Σ1 (Z1 − X1 β) (Z1 − X1 β) ∼ W(d+n, (S −1 +SST )−1 ), where SST = Step 3 :At time point t = 2, . . . , T : −1(m) i.i.d • For m = Bt , sample Σt ∼ W(d, S − M SM , (S − −1(m) −1 M SM ) M Σt−1 M )- A non-central Wishart. • Compute −1(m) wt (Σt −1(m) ) = f (Zt | Σt −1(m) ) ; (m) Wt = wt (Σt ) −1(l) K ) l=1 wt (Σt −1(m) where f is the multivariate Gaussian p.d.f. evaluated at Zt and Σt for each particle m. −1(m) • Re-sample the particles Σ1:t with replacement according to the (m) weights Wt using the conditional multinomial resampling algorithm. Combining the above expression with the prior on β ∗ would result in the following posterior distribution for β ∗ π(β ∗ | Z1:T , µ1:T ) = NpJ (βn , Ψn ) with (4.17) n Xi Σ∗−1 zi βn = Ψn i=1 and n −1 Ψ−1 + n = (Ψ ⊗ IJ ) Xi Σ∗−1 Xi . i=1 98 4.3. Simulation Results The particle Gibbs (PG), summarized in Algorithm (4.4) below, results in an invariant distribution of p(β ∗ , Σ−1 1:T | Z1:T ). Algorithm 4.4 Particle Gibbs (PG) Step 1 : Initialization, s = 0 • Set β ∗ (0) and B1:T (0) arbitrarily. Step 2 : for s = 1, . . . , N • Sample β ∗ (s) ∼ π(β ∗ | Σ−1 1:T (s − 1), Z1:T ) as in equation (4.17). ∗ • Run a conditional SMC algorithm targeting p(Σ−1 1:T | β (s), Z1:T ) con−1 ditional on Σ1:T (s − 1). ∗ ˆ(Σ−1 • Sample Σ−1 1:T | β (s), Z1:T ). In this case B1:T (s) are im1:T (s) ∼ p plicitly sampled as well. 4.3 Simulation Results To test the PG algorithm, continuous data Z were generated according to Zit ∼ N (Xit β, Σt ) In this example, with a fixed n = 1, J = 3, p = 2 and T = 100, covariates were generated by drawing uniformly between [−0.5, 0.5]. The matrix of regression coefficients underlying the model was fixed to β= 2 3 1 -1 5 2 (4.18) Furthermore, to simulate Σ1:T , we used the WAR process with d0 = 4, S0 = IJ and we fixed M = 0.98IJ . The number of particles used at each Gibbs step was set to K = 5000 and the total number of Gibbs iterations performed was 2000. With those settings, the algorithm took 3 days to run. Figure 4.2 shows the marginal distribution of β, as well as trace plots 99 4.3. Simulation Results and autocorrelation plots (ACF). We could see that the resulting posterior distribution for each β seems to be centered at the true value for most of the cases, and the true value is contained in the 95% credible interval for all of the elements of the matrix β. Furtherore, ACF plots indicate that the autocorrelation between samples quickly drops to zero, which is a sign of good mixing of the algorithm. In Figure 4.3, we take a look at the posterior distribution of R1:T = −1/2 D1:T Σ1:T D−1/2 , which we are able to compute from the Σ−1 1:T samples obtained from the particle Gibbs algorithm . We plot the median of the marginal elements Rij over time as well as a 95% credible interval. The first thing we note, is that the algorithm is degenerating. Credible intervals appear after time points t = 75. This is because up to that point the distribution is approximated by a unique particle, due to the re-sampling. Furthermore, the intervals do not always contain the true value of the correlation, however, this may be due to the algorithm degenerating. Having said that, the algorithm is picking up on the general trend and given the complexity of the problem, the resulting point estimates may be reasonable. Finally in order to evaluate the estimated sequences of matrices jointly as a whole, rather than marginally, we plotted the average entropy loss computed at each time step against time in Figure 4.3. We can see that the entropy loss seems to vary randomly over time. The lack of trend in the entropy loss is a good sign that the loss is not accumulating, and the variability is simply due to Monte Carlo error. 100 β11 Traceplot 3 2.5 3.5 2 3 1.5 2.5 0 2000 2 0 2000 1.5 −0.5 1 −1 0.5 −1.5 0 −2 −0.5 −2.5 −1 β21 0 0 2000 −3 0 β22 7 2000 6 3 5 2 4 1 3 0 0 2000 100 100 100 100 100 80 80 80 80 80 80 60 60 60 60 60 60 40 40 40 40 40 40 20 20 True 0 1 2 0 3 20 True 2 3 4 20 True 0 0 1 2 −2 −1 0 0 True 0 4 5 6 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 10 20 −0.5 0 10 20 −0.5 0 10 20 −0.5 0 10 20 2000 True 1 −0.5 0 20 True 0 β23 4 100 20 Sample Autocorrelation β13 2 −0.5 0 10 20 −0.5 101 Figure 4.2: Posterior estimates of β using Particle Gibbs 1 0 2 3 10 20 4.3. Simulation Results Histogram 1 β12 4 4.4. Summary and Discussion 4.4 Summary and Discussion In this Chapter, we have extended the models discussed in Chapters 2 and 3 to include covariate information and estimate the dependence between the elements of the response simultaneously. When the covariance matrix is static in time, we showed how are able to extend Bayesian estimation to binary data, using the parameter expansion for data augmentation framework. This allows the proper simultaneous estimation of the covariates, the covariance and its structure. For a dynamic covariance matrix, we implemented the particle Gibbs algorithm to alternate between sampling of the dynamic precision matrix and the fixed covariate. In order to ensure the validity of the particle Gibbs, we construct the proposal using conditional SMC, where we condition on maintaining a prespecified path through the sampler. While PG sampler is ergodic, it currently has poor mixing. This was shown to be particularly true when the number of particles small relative to T (Whiteley, 2010). This is because the SMC degeneracy in the SMC path causes the variables that are left out at any two consecutive iterations to be strongly dependent. Future work should consider improving the mixing by adding a backwards simulations step to the algorithm as suggested by Whiteley et al. (2010),Olsson and Ryden (2011) and Lindsten and Schon (2011). This strategy has been shown to drastically improve the degeneracy problem. Another alternative to consider as well is including ancestor sampling as suggested by Lindsten et al. (2012). 102 4.4. Summary and Discussion 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 50 100 150 time (a) Marginal distribution of R12 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 50 100 150 time (b) Marginal distribution of R13 3 2.5 True Correlation Median Correlation 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 50 100 150 time (c) Marginal distribution of R23 Figure 4.3: Plot of the true correlation and the median estimated correlation along with a 95% credible interval, obtained using the PG algorithm with conditional SMC. 103 4.4. Summary and Discussion 30 Entropy Loss for Σ 25 20 15 10 5 0 0 50 100 150 time Figure 4.4: Entropy loss computed at each time step and plotted against time 104 Chapter 5 Applications In this Chapter, we showcase the methodology developed in this thesis and how it can be used to address real problems and applied to real datasets. We revisit the examples we discussed in Chapter 1. 5.1 Heart Rate Variability as an Indicator of Nociception The brain perceives pain through the nociception system, which prompts increased activity in the autonomic nervous system (ANS). In turn, the ANS activates the sympathetic nervous system (SNS), creating a stress response in the body that includes increased respiration, blood pressure and heart rate. In sick patients, a strong stress response can cause serious injury. A patient exists in a homeostatic balance between a state of sympathetic (stress) and parasympathetic tone (a frequency of neural firing). When in a state of extreme stress, the sympathetic tone is high and the parasympathetic tone is low. On the other hand, when in a state of extreme relaxation the opposite is true. Heart rate variability (HRV) is a measurement of the interaction between the sympathetic and parasympathetic activity in autonomic functioning. In general, changes in HRV are highly correlated with morbidity and mortality. The first example we consider, introduced in section 1.2.1, is data coming from an experiment conducted by the Pediatric Anesthesia Research Team (PART) at the British Columbia Children’s Hospital. The heart rates of six year old children undergoing a dental procedure under general anesthesia were monitored. (More details about the experiment can be found in Brouse 105 5.1. Heart Rate Variability as an Indicator of Nociception (2013)). In this thesis we specifically focus on scenarios where a child is undergoing a dental procedure and is experiencing a painful stimulus, which means he is in a nociceptive state. The anesthesiologist responds by giving the child an anesthetic bolus. The bolus consists of a single shot or multiple shots applied in succession. In order to analyze the data available, each bolus event is considered separately. The first and last 120 seconds are treated as padding and are ignored. The baseline (nociceptive) period exists from from 121 seconds to 180 seconds, it is 60 seconds long. The response (anti-nociceptive) period is 180 seconds from the end to 121 seconds from the end and is also 60 seconds long. The anesthetic bolus occurs at 180 seconds from the beginning, but there may be more than one. Any number of anesthetic boluses may be given between 180 seconds to 240 seconds from the end. This is illustrated in Figure 5.1(a). The line on top is the observed heart rate. We can see that in the nociceptive period, prior to receiving the anesthetic bolus the heart rate behaves differently then after the bolus, when the patient no longer feels pain. The line at the bottom is the same data but with the mean filtered out. This is done by working with the residuals obtained by fitting a first order autoregressive model. This step is done in order to isolate the source of variability over time to the variance parameter. In mathematical notation, the observed heart rate at time point t is given by Vt . The data is assumed to follow an autoregressive process that is related to the heart rate at the previous time step through an auto-regressive parameter A such that Vt = AVt−1 + t , t ∼ N (0, σt2 ). 2 , we pre-process the data to Since we are only interested in inferring σ1:T remove the mean and work with the residuals ˆ t−1 Zt = Vt − AV . If we adopt a baseline-stimulus setup, we have, for each bolus event, two 106 5.1. Heart Rate Variability as an Indicator of Nociception datasets that we consider independently; one corresponds to the nociceptive period and one to the anti-nociceptive period. The data in each period has dimensions J = 1, n = 1, and T = 60. The sequence of variances can then be inferred using the autoregressive process we introduced in Chapter 2. Clearly, from figure 5.1(a), we expect the pattern of variability to change from the nociceptive period to the anti-nocicpetive period. In each case, we run the PMMH algorithm to obtain an estimate of the variance as well as the underlying smoothness parameter ρ. In this case because we are working in the univariate case, we have the option to use either the non-central χ2 or the central χ2 approximation as a proposal to the algorithm. Both algorithms produced similar results, and therefore we only present the results obtained using the non-central χ2 as a proposal for consistency. For each bolus event, we run the PMMH algorithm with K = 1000 particles, and 5000 MCMC iterations. In order to obtain an acceptance probablity in the 20 − 30% range, the variance of the random walk proposal for ρ, the autoregressive hyperparameter, had to be adjusted for each bolus event and for each period. For the first bolus event it was set to c = 0.4 in the nociceptive period, and to c = 0.05 in the anti-nociceptive period. In Figures 5.1(c) and 5.1(d), we show the trace plot and histogram of the posterior distribution of ρ in each of the two periods. From this Figure and other examples, we can see that a larger step size results in a larger exploration of the space. 107 5.1. Heart Rate Variability as an Indicator of Nociception log (σ) vs Time 4 3 Nociception 120 Bolus 100 2 1 0 80 −1 Heart Rate 60 Anti Nociceptive Period 4 Anti−Nociception Nociceptive Period 40 20 0 −20 50 100 150 200 time 250 300 350 3 2 1 0 −1 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase. Nociceptive Period Anti−Nociceptive Period 500 2500 400 2000 Median −− 0.943 95 CI −− [0.684,0.994] Acc. Prob −− 24.9% 300 1000 100 0 Median −− 0.995 95 CI −− [0.907,1] Acc. Prob −− 23.8% 1500 200 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and trace plot of the (d) Histogram and traceplot of the posterior distribution of ρ in the no- posterior distribution of ρ in the anticiception phase nociception phase Figure 5.1: Results for Bolus Event 1 The posterior distribution of each log(σt ) obtained in the nociceptive period and anti-nociceptive period is plotted in Figure 5.1(b) over time. We note that during the nociceptive period, the log standard deviation appears to be fluctuating in time compared with the anti-nociceptive period, which is a lot smoother. This is also reflected in the estimate of ρ which is higher in the anti-nociceptive period (median 0.99). Figure 5.2 contains side by side boxplots comparing the estimates of ρ in both periods. Bolus event 1 is a very “clean example” that showcases well how the variability of the heart rate between the two different periods differs. Results from bolus event 2 in Figure 5.3, are a little less straightforward to interpret. 108 5.1. Heart Rate Variability as an Indicator of Nociception Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 NC ANC Figure 5.2: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 1. In this case, the painful stimulus occurred partway through the nociceptive period. The anesthesiologist responded quickly and an anesthetic bolus was given, which led the heart rate to begin readjusting. However, likely another stimulus event occurred to counter the effect of the bolus. The behaviour in the data is mirrored in the estimated results. We observe that ρ is the same in both periods and the estimates of log σ are fluctuating for both periods, and not showing the same level of smoothness that we saw in the anti-nociceptive period of the previous bolus event example. 109 5.1. Heart Rate Variability as an Indicator of Nociception Heart Rate Data for Bolus Event 2 120 Bolus 100 80 60 Anti Nociceptive Period Nociceptive Period 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 1000 1500 800 Median −− 0.985 95 CI −− [0.859,0.999] Acc. Prob −− 26% 600 Median −− 0.963 95 CI −− [0.302,0.999] Acc. Prob −− 29.1% 1000 400 500 200 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure 5.3: Results for Bolus Event 2 Bolus event 4 in Figure 5.1 is another example of non-typical results. In this case it is suspected that an anesthetic bolus was given in anticipation of a stimulus that didn’t occur. From Figure 5.4(a), we can see that the pattern in the variability of heart rate over time does not change. Reliably, the estimates obtained from the algorithm also mirror that. The estimates for ρ are high and the estimates of log(σ) are very smooth over time. 110 5.1. Heart Rate Variability as an Indicator of Nociception Heart Rate Data for Bolus Event 4 120 Bolus 100 80 60 Anti Nociceptive Period Nociceptive Period 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 2500 2000 2000 1500 Median −− 0.998 95 CI −− [0.974,1] Acc. Prob −− 35% 1000 Median −− 0.996 95 CI −− [0.838,1] Acc. Prob −− 25.1% 1500 1000 500 0 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure 5.4: Results for Bolus Event 4 One of the limitations of the baseline-stimulus setup, is that the windows that are used to collect the data are fixed. The researchers did not want to manipulate the window so as to not introduce bias, but there were many other confounding events that occured during the procedure which clearly affect the results. In most cases, the findings resulting from the use of the PMMH algorithm appear to be clinically relevant and correspond to findings obtained using other methods. All the results obtained from of all the experiments are presented in Appendix C. Future work on this dataset, should consider a changepoint detection algorithm that can take the full range of the data and detect the point(s) at which a change occurs and see 111 5.2. Stock Market Returns of Technology Firms whether it matches the observed data. Furthermore, as noted in Chapter 2, being able to adaptively set the step size for each data region would be useful. 5.2 Stock Market Returns of Technology Firms In the next example, we consider the stock market log-returns of the four technology firms: Google, Microsoft, IBM and Apple. A description of the data and the pre-processing was given in section 1.2.2 of Chapter 1. The pre-processing step with the multivariate autoregressive filter ensures that the only parameter evolving in time is the covariance matrix. This in turns ensures that the conditional independence assumption of Chapter 2 is met. The analysis is done as with the previous example on the residuals, denoted by Z, which are obtained from fitting the AR filter. In this case, Z is an n × J × T array with n = 1, J = 4, and T = 386. We assume that the data follows a matrix-variate Gaussian distribution with −1 Zt | Σ−1 t ∼ MN n,J (0n×J , Σt , In ) where the unknown parameter of interest is the sequence of T precision −1 matrices Σ−1 1 , . . . , ΣT . We place a Wishart autoregressive prior on the sequence of matrices as described in Chapter 2 such that T −1 π(Σ−1 t | Σt−1 ) −1 π(Σ−1 1:T ) = π(Σ1 ) t=1 and where Σ−1 ∼ WJ (d, S −1 , 0) 1 −1 −1 −1 Σ−1 t | Σt−1 ∼ WJ (d, S − M SM , (S − M SM ) M Σt−1 M ) In this model, we fix the hyperparmeter S = IJ , the identity matrix and d = J +1 = 5. This choice of hyperparameters results in a prior on Σ that in turn results in a marginally uniform prior on the correlation R between the 112 5.2. Stock Market Returns of Technology Firms stocks, where R is obtained from the decomposition R = D−1 ΣD−1 where D is a diagonal matrix with D = diag(Σ). Therefore, the aim here is to infer the autoregressive parameter M = ρI as well as the precision sequence Σ−1 t from the data. We run the PMMH algorithm (2.4) of Chapter 2, to obtain samples of ρ and Σ−1 1:T using the prior as a proposal distribution for Σ1:T and using the random walk proposal for ρ with starting value ρ0 = 0.5 and different c to make sure that the acceptance probability is between 25% and 50% which indicates that the sampler is moving and exploring the space of possible values. For this example, we tried running the algorithm with different number of particles and for different number of iterations. However, it appears that regardless of the step size of the random walk, once the random walk reaches ρ = 1, it gets stuck there. For illustration, in Figures 5.5 we show posterior samples of ρ obtained from running the PMMH algorithm with K = 2000 SMC iterations and s = 5000 MCMC samples. Posterior Distribution of ρ Time Series Plot of ρ , AP=1 % 1 5000 0.95 4500 0.9 4000 0.85 Median −− 0.996 95 CI −− [0.993,0.997] 3500 0.8 3000 0.75 2500 0.7 2000 0.65 1500 0.6 1000 0.55 500 0.5 0 500 1000 1500 2000 2500 3000 with c=0.01 3500 4000 (a) Time series plot 4500 5000 0 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 (b) Histogram Figure 5.5: Posterior distribution of ρ obtained from running the PMMH algorithm on the four technology stock problem One of the advantages of assuming M = ρI is that because ρ is a scalar, we can estimate the marginal likelihood using different values of ρ on a grid to see why the algorithm is getting stuck. The results are plotted in Figure 113 5.3. “Women and Mathematics” Data Set 5.6. We can see that clearly the marginal likelihood peaks at 1. Therefore, it is no surprise that the algorithm is getting stuck at this boundary. Finally, we can perhaps conclude that the posterior is concentrated in a (0.9, 1) region and that there might be evidence against a stationary stochastic volatility model in this example. 2500 2000 Marginal likelihood 1500 1000 500 0 −500 −1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ρ Figure 5.6: The marginal likelihood evaluated and plotted for different scalar values for ρ on a grid. The maximum is reached at ρ = 1 5.3 “Women and Mathematics” Data Set In the next example we switch gears and we look at an application where the correlation is assumed to be static. We illustrate the methods discussed in Chapter 3 by applying the model selection algorithm to the “Women and Mathematics” data set (Fowlkes et al., 1988b). This data was analyzed by Tarantola (2004) and by Madigan and Raftery (1994) using the M C 3 algorithm, to learn the structure of a discrete graph. The primary aim is to study the attitude of 1190 New Jersey high school students toward mathematics. The six binary variables collected are: X1 Lecture Attendance (Attend/Did not Attend) X2 Gender (Male/Female) X3 School Type (Urban/Suburban) 114 5.3. “Women and Mathematics” Data Set Figure 5.7: Most probable graphical structures and estimated posterior probabilities for the “Women and Mathematics” data set. X4 “I will be needing Mathematics in my future work” (Agree/Disagree) X5 Subject Preference (Math/Science vs. Liberal Arts) X6 Future Plans (College/Job) We ran the algorithm for N = 200, 000 iterations. Multiple runs with different starting values and very similar results were observed in all cases. Our method identifies six structures with probabilities higher that 3% (0.03), accounting for 38% of the total probability mass. The most probable structures are depicted in figure 5.7. All of the probable graphs chosen by our algorithm show that X1 , lecture attendance, was independent of everything else conditional on all the other variables in the graph. Similarly, all of the probable models include edges between [X5 , X2 , X4 ], showing conditional association between gender, subject preference and whether they will need mathematics in their future. These results are consistent with those found by others. Our method also detects with high probability the clique [X6 , X5 , X4 ], suggesting a conditional association between future plans, subject preference, and whether 115 5.4. Six Cities Data Set the respondents thought there will be needing mathematics in their future. Moreover, a relationship between X6 , future plans and X3 , type of school chosen, was detected. Comparing our results with those of Tarantola (2004), we note that while their most probable graph does not correspond to the most probable one obtained here, some of their most probable graphs are identical to the ones we obtained. Furthermore, if we compare the edge marginal probability, we see that in their hierarchical models, with a certain value of the hyperparameters (λ0 = 64 in the non-hierarchical model and f = 64 in the hierarchical model), the edge marginals they obtained are very similar to the ones we have (see table 5.1). The graph they report in their paper corresponds to the hierarchical model with f = 1. Edges This paper Non Hierarchical λ0 = 64 Hierarchical f = 64 Edges This paper Non Hierarchical λ0 = 64 Hierarchical f = 64 (2,1) 0.11 0.10 0.12 (6,2) 0.20 0.02 0.03 (3,1) 0.10 0.12 0.15 (4,3) 0.22 0.81 0.80 (4,1) 0.14 0.15 0.17 (5,3) 0.56 0.33 0.43 (5,1) 0.13 0.13 0.15 (6,3) 1.00 1.0 1.0 (6,1) 0.13 0.15 0.16 (5,4) 1.00 1.0 1.0 (3,2) 0.04 0.11 0.12 (6,4) 1.00 0.80 0.89 (4,2) 0.90 1.0 1.0 (6,5) 0.98 0.79 0.88 (5,2) 1.00 1.0 1.0 Table 5.1: Table of edge marginal probabilities comparing results obtained by the algorithm developed in this paper and those by M C 3 algorithm as outlined in Tarantola (2004) 5.4 Six Cities Data Set In the last example we consider the data set used by Chib and Greenberg (1998), a subset of the Six Cities study, a longitudinal study of the health effects of air pollution. The data under consideration consists of repeated binary measures of wheezing status of 537 children from Stuebenville, Ohio. The objective is to model the probability of wheeze status over time as a function of the mother’s smoking status during the first year of the study. The response variable of interest yij indicates the wheezing status of child i at j = (1, 2, 3, 4) corresponding to age (7, 8, 9, 10). The nature of the data is suggestive that there exists an association between the elements of the 116 5.4. Six Cities Data Set multivariate response variable. We are interested in assessing the effects of the mother’s smoking status at onset (1=yes, 0=no) . Table 5.2 breaks down the number of “cases” representing wheezing symptoms, by age group and mother smoking status. From looking at these results, it appears that the percentages of wheezing children is higher for smoker mothers at each age. Mother Smoking Status Smoker Non Smoker Total 7 32 (17.0%) 55 (15.8%) 87 (16.2%) Child’s Age 8 9 40 (21.3%) 36 (19.1%) 51 (14.6%) 49 (14.0%) 91 (16.9%) 85 (15.8%) 10 27 (14.4%) 36 (10.0%) 63 (11.7 %) Table 5.2: Breakdown of wheezing cases by age group and smoking status of mother at onset. The percentages in brackets are percentages with respect to each age group. As in Lawrence et al. (2008), we only use the mother’s smoking status at onset as a covariate (x1 ) as well as an intercept term: Pr(Yi,j = 1) = βj,0 + βj,1 xi,1 . For the prior (4.3) on β, we set Ψ = 105 Ip . We fit the probit model using a saturated correlation matrix as in Lawrence et al. (2008) but also estimating the structure of association from the data. We generated N = 50, 000 samples and used 3, 000 samples as burn-in. Table 5.3 presents the posterior means for β and R. These results are compared to those presented in Lawrence et al. (2008). From the results in Table 5.3, we note that the results obtained using our proposed method and prior do not differ significantly from those presented in Lawrence et al. (2008). The estimates for the regression coefficients are robust to the correlation structure as well as the prior specification. The marginal posterior correlations are slightly smaller under the marginal uniform priors which is consistent with the fact that Jeffreys’ prior tends to push marginal correlations to ±1. Furthermore, in order to compare the performance of the algorithm from a computational standpoint we look at 117 5.4. Six Cities Data Set β11 β12 β21 β22 β31 β32 β41 β42 r12 r13 r14 r23 r24 r34 Jeffreys -0.983 0.011 -1.029 0.219 -1.054 0.166 -1.235 0.150 0.592 0.535 0.573 0.700 0.571 0.641 Saturated Marginally uniform -0.922 0.032 -0.929 0.223 -1.012 0.181 -1.180 0.167 0.498 0.455 0.487 0.588 0.494 0.548 Model Selection Marginally uniform -0.922 0.031 -0.983 0.222 -1.013 0.182 -1.181 0.169 0.495 0.422 0.485 0.589 0.469 0.552 Table 5.3: Posterior means of the parameters of interest. The first two columns correspond to using the saturated model for the Jeffreys’s prior of Lawrence et al. (2008) and the marginally uniform prior. The last column corresponds to using a marginally uniform prior and performing model selection. the plots of the autocorrelation function in Figure 5.8 and we compare that to Figure 1 in Lawrence et al. (2008). We can see that the proposed algorithm fairs well in comparison and shows significant autocorrelation to only about 10 to 15 lags compared to 25 to 30 lags for the algorithm in Lawrence et al. (2008) which in turn performs significantly better than the algorithm in Chib and Greenberg (1998). Because our method performs correlation selection, whereby the graph is sampled alongside the other parameters, we can obtain a posterior distribution over graph structures. Figure 5.9 depicts the most probable graphs that were obtained. The saturated model used by Chib and Greenberg (1998) and Lawrence et al. (2008) was the second most probable graph with posterior probability of 0.26. The most probable graph structure (43% of posterior mass) appears to be the one that suggests that the wheezing status of the child age 9 is conditionally independent of the wheezing status at age 118 5.4. Six Cities Data Set 0.5 0.5 R13 1 R12 1 0 −0.5 0 0 5 10 15 −0.5 20 0.5 0.5 5 10 15 20 0 5 10 15 20 0 5 10 15 20 R23 1 R14 1 0 0 −0.5 0 0 5 10 15 −0.5 20 0.5 0.5 R34 1 R24 1 0 −0.5 0 0 5 10 15 20 −0.5 Figure 5.8: Autocorrelation function for the marginal correlation parameters obtained from the Six Cities example under the saturated assumption. 119 5.4. Six Cities Data Set Most Probable Structures Post= 0.43 Post= 0.26 4 1 4 1 3 2 3 2 Post= 0.25 Post= 0.02 4 1 4 1 3 2 3 2 Figure 5.9: The four most probable graph structures, with their associated posterior probability 7, given the other variables in the model. Finally, in order to assess the fit of the model to the data, we use posterior predictive checks suggested in Gelman et al. (2003). Posterior predictive checks are useful for detecting systematic differences between the model and the observed data. They are done by generating replicated data sets from the posterior predictive distribution of the statistical model. The replicated data is then compared with the observed data set with respect to some features of interest. For this example, we took L = 25, 000 samples from the joint posterior distribution, and generated a data set Y l for each (β l , Rl ), l = 1, . . . , L. The resulting binary data set is then compared with the observed data with respect to the following three quantities • % of kids that never experienced wheezing at any age. • % of kids that experienced wheezing at every age. • % of kids that did not experience wheezing at onset, however, once they have they have continued for more than 2 time points. Each of the above measures, T (y) was computed under the observed data and compared to the quantity obtained under the replicated artificial data. The Bayesian p-value is used to assess whether differences are statistically 120 5.4. Six Cities Data Set significant. The Bayesian p-value is given by Pr(T (Y rep ) > T (y)|y) and is estimated from simulations using L l=1 I T (Y l ) > T (y) L. The p-values are computed for the case where the correlation structure is inferred from the data. The results in Table 5.4 suggest that the posterior predictive distribution produces data that are in reasonable agreement with the observed data. Indeed the p-values obtained are neither too close to 0 nor too close to 1. Test Variable % never exp. wheezing % always exp. wheezing % incident wheezing T (y) 66.1 3.3 2.4 95% Intervals for T (Y rep ) [55.12 67.59] [ 1.11 4.47] [1.86 5.77] p-value 0.08 0.21 0.93 Table 5.4: Posterior predictive statistical checks with 95% intervals and pvalues, indicating that the posterior predictive distribution of the proposed model is not statistically different from the observed data 121 Chapter 6 Conclusion 6.1 Summary In this research work we have laid out a general and unified framework to model dependencies in multivariate data. More specifically, under a Bayesian paradigm, we considered the problem of estimating covariance matrices and covariate effects under different settings and for different data types. For continuous data, in Chapter 2, we extended the Bayesian linear model and proposed algorithms that allow the estimation of time-varying covariance matrices using the Wishart autoregressive process. The Wishart autoregressive process is the multivariate generalization of the χ2 , a commonly used prior for the variance parameter in the univariate case. Therefore this natural extension allows for easy interpretation of the model parameters and guarantees stationarity. However, we have seen that algorithms needed in the implementation of the WAR model are computationally expensive and do not scale very well in high dimensions as the number of time steps, T , and the number of variables, J, simultaneously increase. This is particularly more difficult in the presence of static parameters. This dynamic model for continuous data was extended in Chapter 4 to allow for the estimation of a covariate information using a particle Gibbs algorithm. In Chapter 3, we discussed methods for estimating a static covariance that is unknown but fixed over time. We also discussed algorithms to estimate the static covariance or correlation matrix when the observed data are binary. Furthermore, we introduced covariance selection models and Gaussian graphical models as a way to estimate a sparse structure of the inverse correlation matrix for multivariate binary data using the probit model. This 122 6.2. Future Work model was further developed in Chapter 4 , by adding linear predictors and attempting to estimate their effect. 6.2 Future Work Finally, here are some directions to consider in future work. 6.2.1 Numerical Evaluation of the Hypergeometric Function One of the first road blocks in scaling the Sequential Monte Carlo algorithm for the estimation of a time-varying covariance matrix is the efficient and accurate numerical evaluation of the o F1 hypergeometric function. The accurate evaluation of the hypegeometric function is needed for the Central Wishart importance distribution used in the SMC algorithm. As we saw in simulation results, the more informative the likelihood is, the more different the posterior becomes from the prior and the worse the non-central Wishart prior becomes as a proposal. Future work in numerical analysis could speed up and make possible the use of the proposal that accounts of the observed data in dimensions J larger than 1. 6.2.2 Better Proposals for Static Parameters In Chapter 2 we discussed the need to devise better proposals for the estimation of the fixed autoregressive parameter M . The current random walk proposal can be improved by automatically tuning the step size. However, it might make sense from an application stand point to come up with proposals that allow for unconstrained as well as structured matrix M . In their paper, Lina et al. (2012) suggest stacking the rows of the matrix M and proposing from a multivariate Normal distribution. This could be a good starting point. Moreover, as we have seen in our simulations in the scalar case, ρ is not linear in that values have to be very close to 1 before we are able to detect persistence in the covariance matrix. Perhaps one can 123 6.2. Future Work consider a transformation to the real line (or, equivalently a multiplicative random walk). This might be preferable in this case. 6.2.3 Improving Speed of Mixing and Avoiding Degeneracy Particle MCMC algorithms are known to be computationally expensive and therefore future work must include ways to speedup these algorithms to make them practical for use. Such methods would include possibly more efficient implementations using a faster programming language. Furthermore, speeding up the algorithm by considering parallelisation over particles (Lee et al., 2010). In the case of particle Gibbs, methods that include a backwards simulations step within the algorithm should be considered, as suggested in Whiteley (2010) and Lindsten and Schon (2011). The aim of these strategies is to avoid particle degeneracy and improve mixing. Other alternatives that could be beneficial include ancestor sampling as suggested by Lindsten et al. (2012). 6.2.4 Other Possible Extensions Other extensions that could be attempted once the above mentioned issues have been addressed include extending the model to run for binary data in the dynamic case. Under this scenario, the response is a multidimensional longitudinal vector of binary observations for each subject. One can consider extending the parameter expansion for data augmentation (PXDA) algorithm to the dynamic case, moreover, learning a structured covariance structure over time, and possibly estimating the structure. This obviously adds to the complexity of the model and should not be attempted before resolving issues with mixing and speed. Other natural extensions to accommodate mixed scales as well as ordinal and count scales could also be considered. 124 Bibliography C. Andrieu and A. Doucet. Particle filtering for partially observed gaussian state space models particle filtering for partially observed gaussian state space models. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(4), 2002. C. Andrieu, A. Doucet, and R. Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society B, (72):1–33, July 2010. H. Armstrong, C.K. Carter, K. Wong, and R. Kohn. Bayesian covariance matrix estimation using a mixture of decomposable graphical models. Statistics and Computing, 19:303–316, 2009. M. Asai, M. McAleer, and J. Yu. Multivariate stochastic volatility: A review. Econometric Reviews, 25(2):145–175, 2006. J. Barnard, R. McCulloch, and X. Meng. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica, 10:1281–1311, 2000. Luke Bornn and Fran¸cois Caron. Bayesian clustering in decomposable graphs. Bayesian Analysis, 6(4):829–846, 2011. C.J. Brouse. Monitoring nociception during general anesthesia with heart rate variability. PhD Thesis, 2013. W. J. Browne. trices. 2006. Mcmc algorithms for constrained variance ma- Comput. Stat. ISSN 0167-9473. Data doi: Anal., 50(7):1655–1677, April 10.1016/j.csda.2005.02.008. URL http://dx.doi.org/10.1016/j.csda.2005.02.008. 125 Bibliography M.A. Cameron. A note on functions of markov processes with an application to a sequence of χ2 2statistics. Journal of Applied Probability, 10(4):pp. 895–900, 1973. C. Carvalho. Structure and sparsity in high-dimensional multivariate analysis. PhD Thesis, 2006. Carlos M. Carvalho and Mike West. Dynamic matrix-variate graphical models. Bayesian Analysis, 2(1):69–98, 2007. S. Chib and E. Greenberg. Analysis of multivariate probit models. Biometrika, 85(2):347–361, 1998. S. Chib, Y. Omori, and M. Asai. Multivariate stochastic volatility. Technical report, 2007. A. Dawid and S. Lauritzen. Hyper markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21(3):1272– 1317, sep 1993. A. Dempster. Covariance selection. Biometrics, 28(1):157–175, mar 1972. E. B. Fowlkes, A. E. Freeny, and J. M. Landwehr. Evaluating logistic models for large contingency tables. 83:611–622, 1988a. E. B. Fowlkes, A.E. Freeny, and J.M. Landwehr. models for large contingency tables. Evaluating logistic Journal of the American Sta- tistical Association, 83(403):611–622, 1988b. ISSN 01621459. URL http://www.jstor.org/stable/2289283. A. Gelman, John B. Carlin, Hal S. Stern, and D.B. Rubin. Bayesian Data Analysis, Second Edition (Chapman & Hall/CRC Texts in Statistical Science). Chapman and Hall/CRC, 2 edition, July 2003. J. Geweke. Efficient simulation from the multivariate normal and student– t distributions subject to linear constraints. Computing Science and Statistics: Proceedings of the Twenty–Third Symposium on the Interface, Alexandria, VA: American Statistical Association, pp., 1991. 126 Bibliography W.R. Gilks, G.O. Roberts, and S.K. Sahu. Adaptive markov chain monte carlo through regeneration. Journal of the American Statistical Association, 93(443):1045–1054, 1998. P. Giudici. Learning in graphical gaussian models. Bayesian Statistics 5: Proceedings of the Fifth Valencia International Meeting, June 5-9,1994, pages 621–628, 1996. P. Giudici and P.J. Green. Decomposable graphical gaussian model determination. Biometrika, 86:785–801, 1999. C. Gourieroux, J. Jasiak, and R. Sufana. The wishart autoregressive process of multivariate stochastic volatility. Journal of Econometrics, 150(2):167– 181, June 2009. A. Gupta and D. Nagar. Matrix Variate Distributions. Chapman and Hall, 2000. P. Koev and A. Edelman. The efficient evaluation of the hypergeometric function of a matrix argument. Mathematics of Computation, pages 833– 846, 2006. S. Lauritzen. Graphical Models. Clarendon Press, Oxford, 1996. E. Lawrence, D. Bingham, C. Liu, and V.N.N. Nair. Bayesian inference for multivariate ordinal data using parameter expansion. Technometrics, 50 (2):182–191, February 2008. A. Lee, C. Yau, M.B. Giles, A. Doucet, and C.C. Holmes. On the utility of graphics cards to perform massively parallel simulation of advanced monte carlo methods. Journal of Computational and Graphical Statistics, 19(4):769–789, 2010. Ming Lina, Changjiang Liub, and Linlin Niua. Bayesian estimation of wishart autoregressive stochastic volatility model. 2012. F Lindsten and T. B Schon. On the use of backward simulation in the particle gibbs sampler, 2011. 127 Bibliography F. Lindsten, M.I. Jordan, and T.B. Schon. Ancestor sampling for particle gibbs. arXiv preprint arXiv:1210.6911, 2012. C. Liu. Bayesian analysis of multivariate probit models - discussion on the art of data augmentation by van dyk and meng. Journal of Computational and Graphical Statistics, 10:75–81, 2001. D. Madigan and A.E. Raftery. Model selection and accounting for model uncertainty in graphical models using occam’s window. Journal of the American Statistical Association, 89(428):1535–1546, 1994. ISSN 01621459. URL http://www.jstor.org/stable/2291017. Jimmy Olsson and Tobias Ryden. Rao-blackwellization of particle markov chain monte carlo methods using forward filtering backward sampling. Signal Processing, IEEE Transactions on, 59(10):4606–4619, 2011. A. Philipov and M.E. Glickman. Multivariate Stochastic Volatility Via Wishart Random Processes. Journal of Business and Economic Statistics, 24(3):313–327, 2006. Ser-Huang Poon and Clive WJ Granger. Forecasting volatility in financial markets: A review. Journal of Economic Literature, 41(2):478–539, 2003. C. Robert. Simulation of truncated normal variables. Statistics and Computing, 5(2):121–125, 1995. G.O. Robert. Adaptive markov chain monte carlo through regeneration. Chapman and Hall/CRC, 1996. A. Tabet. Bayesian inference in the multivariate probit model. Master’s Thesis, 2007. A. Talhouk, A. Doucet, and K. Murphy. Efficient bayesian inference for multivariate probit models with sparse inverse correlation matrices. Journal of Computational and Graphical Statistics, 21(3):739–757, 2012. 128 Bibliography C. Tarantola. MCMC model determination for discrete graphical models. Statistical Modeling, 4(1):39–61, 2004. doi: 10.1191/1471082X04st063oa. URL http://smj.sagepub.com/cgi/content/abstract/4/1/39. T. Tokuda, B. Goodrich, I. Van Mechelen, A. Gelman, and F. Tuerlinckx. Visualizing distributions of covariance matrices. 2011. E. Webb and J. Forster. Bayesian model determination for multivariate ordinal and binary data. Computational Statistics and Data Analysis, 52: 2632–2649, January 2008. N Whiteley. Discussion on particle markov chain monte carlo methods. Journal of the Royal Statistical Society B, (72):306–307, July 2010. Nick Whiteley, Christophe Andrieu, and Arnaud Doucet. Efficient bayesian inference for switching state-space models using discrete particle markov chain monte carlo methods. arXiv preprint arXiv:1011.2437, 2010. F. Wong, C. Carter, and R. Kohn. Efficient estimation of covariance selection models. Biometrika, 90(4):809–830, 2003. X. Zhang, W.J. Boscardin, and T. Belin. Sampling correlation matrices in bayesian models with correlated latent variables. Journal of Computational and Graphical Statistics, 15(4):880–896, December 2006. 129 Appendix A Distributions We adopt the convention of using small letters to denote scalars, capital letters to denote matrices and vectors. A.1 The Gaussian Distribution A.1.1 The Univariate Representation i.i.d If Z1:n ∼ N (µ, σ 2 ) is an n-dimensional vector with scalar mean µ and variance σ 2 then 1 f (Z | µ, σ 2 ) = (2π)−n/2 (σ 2 )−n/2 exp − σ −2 sst 2 where sst = n i=1 (Zi (A.1) − µi )2 is the sum of squared distance from the mean. The log likelihood is given by n 1 l(µ, σ −2 ; z1:n ) = − log(2π) + n log σ −2 − σ −2 sst 2 2 (A.2) Random Number Generation To generate n samples with mean µ and variance σ 2 • Sample z ∗ ∼ N (0, 1) using the randn function in matlab. • Compute z = σz ∗ + µ. 130 A.1. The Gaussian Distribution A.1.2 Multivariate and Matrix-Variate Gaussian Distribution If Z ∼ NJ (µ, Σ) with mean vector µ (J × 1) and variance-covariance matrix Σ (J × J), then the probability density function is given by f (Z | µ, Σ) = 1 1 |Σ−1 |n/2 etr − Σ−1 (Z − µ)(Z − µ) ) nJ/2 2 (2π) (A.3) The log likelihood is given by l(Z | µ, Σ) = − nJ n 1 log(2π) + log(|Σ−1 |) + tr − Σ−1 (Z − µ)(Z − µ) 2 2 2 (A.4) The Matrix-Variate Normal Distribution The matrix Variate Normal distribution is the generalization of multivariate Gaussian vectors to multivariate Gaussian matrices. A random matrix X (n × J) is said to follow a matrix Normal distribution X ∼ MN n×J (X; Mn×J , ΣJ×J , Ωn×n ) with pdf is given by f (X | M, Σ, Ω) = |Σ−1 |n/2 |Ω|−J/2 1 etr − Σ−1 (X − M ) Ω−1 (X − M ) 2 (2π)nJ/2 (A.5) where M is n × J, Σ is J × J and Ω is n × n. In this case, M is the mean of the matrix X, Σ encodes the covariance between the columns of X and Ω is the covariance between the rows of X. The matrix normal is related to the multivariate normal in the following way, for X ∼ MN n×J (X; Mn×J , ΣJ×J , Ωn×n ) if and only if vec(X) ∼ N (vec(M ), Σ ⊗ Ω) where vec(X) is the vector notation obtained by stacking the columns of X and ⊗ is the kronecker product. The log likelihood is given by l(Σ, Ω, M ; X) = − n J nJ log(2π) + log|Σ−1 | − log |Ω|+ 2 2 2 1 −1 tr − Σ (X − M ) Ω−1 (X − M ) 2 (A.6) 131 A.2. The χ2 Distribution Random Variable Generation To generate a random variable from a n × J matrix-variate Gaussian Variable with mean M (n × J) and covariance ΣJ and In , we use the built-in matlab functions mvnrnd. • Generate X=mvnrnd(M,Σ,1) A.2 A.2.1 The χ2 Distribution The Central χ2 If w ∼ χ2 (d, s) with degree of freedom d and scale parameter s then f (w | d, s) = 1w 1 1 exp − 2s 2d/2 Γ(d/2) s w s d/2−1 (A.7) The log p.d.f. is given by d d w l(d, s; w) = − log(2) − log(Γ(d/2)) + log − log(w) − w/2s (A.8) 2 2 s Expectation The expectation of a standard χ2 variable is equal to its degrees of freedom. To obtain the expectation of a scaled χ2 we can start with the Gaussian iid representation. Let x1:d ∼ N (0, s), the d xi 2 √ i=1 ( s ) ∼ χ2 (d) with expected value d. d x √i s E i=1 2 = d (A.9) = sd (A.10) d (xi )2 E i=1 Random Number Generation To Generate n random variables according to χ2 (d, S) • Generate w ∼ χ2 (d) using matlab function chi2rnd(d, n). 132 A.2. The χ2 Distribution • Compute w = sw A.2.2 The Non-central χ2 If w ∼ χ2 (d, s, θ) with degrees of freedom d, non-centrality parameter θ and scale parameter s, then f (w | d, θ, s) = 1 2d/2 Γ(d/2) 1 1 w exp − θ + s 2 s w s 0 F1 d/2−1 ; d/2; θw 4s (A.11) Here we use the representation of the non-central χ2 variable w as the d 2 i=1 (xi ) , sum of squared Gaussian variables w = where x ∼ N (µ, s) and θ is related to the mean of the Gaussian variable and is defined as d θ= i=1 µ √i s 2 the log p.d.f given by d d w 1 w l(d, s; w) = − log(2) − log(Γ(d/2)) + log − log(w) − θ+ 2 2 s 2 s θw + log 0 F1 ; d/2; (A.12) 4s Expectation d x √i s E i=1 2 = d+θ (A.13) = s(d + θ) (A.14) d (xi )2 E i=1 Note that when the non-centrality parameter θ = 0 the non-central χ2 reduces to the central χ2 Random Number Generation 133 A.3. The Wishart Distribution To Generate n random variables according to a non-central χ2 (d, S, θ), a matlab function can be readily used. This function uses the Poisson weighted mixture representation. • Generate w ∼ χ2 (d, I, λ) using matlab function chi2rnd(d, λ, n). • Compute w = sw , to obtain a scaled variable. A.3 The Wishart Distribution The Central Wishart Distribution If a J × J covariance matrix Σ ∼ WJ (d, S), with degrees of freedom d and scale parameter S, the the p.d.f is given by −1 2dJ/2 ΓJ (d/2) |S|d/2 1 1 |Σ| 2 (d−J−1) etr − S −1 Σ 2 (A.15) Where ΓJ is the multivariate Gamma given by J ΓJ (a) = π 1 J(J−1) 4 i=1 1 Γ a − (i − 1) 2 (A.16) There exist several parameterization of the Wishart distribution. The one that we most frequently use throughout (unless otherwise noted) is the result of taking the outer product of two Gaussian vectors. The log pdf is given by l(d, S; Σ) = − dJ log(2) − ΓJ 2 d 2 − (d − J − 1) d log |S| + log |Σ| 2 2 1 + tr − S −1 Σ (A.17) 2 From Gupta and Nagar (2000) Theorem 3.2.2 page 88: Let X ∼ MN dJ (0, S, Id ) and define Σ = X X, then for d ≥ J, Σ ∼ WJ (d, S) with density given by (A.15). 134 A.3. The Wishart Distribution Another theorem of importance is the invariance of the Wishart distribution (Theorem 3.3.1 p 90 Gupta and Nagar (2000)). If Σ ∼ WJ (d, S) then for any A(J × J) non-singular matrix, AΣA ∼ WJ (d, ASA ). Expected Values Let Σ = (σij ) ∼ WJ (d, S), then E (Σ) = dS, E (σij ) = dsij (A.18) Random Variable Generation To generate a random matrix Σ ∼ WJ (d, S) 1. Generate d observations from X ∼ NdJ (0, IJ ) 2. Compute Z = X X. In this case, Z ∼ WJ (d, IJ ). 3. Let S = AA , where A is a lower triangular matrix obtained using the Cholesky decomposition. We can use the invariance property and compute Σ = AZA . The distribution of which will be Σ ∼ WJ (d, S) as desired. The Inverse Wishart Distribution A matrix Σ(J × J) is said to have an inverse Wishart distribution, denoted by IW J (d, ψ) with degrees of freedom d an location matrix ψ(p × p) if it’s pdf is given by 2(d−J−1)J/2 ΓJ (d/2) |ψ|−(d−J−1)/2 −1 1 1 |Σ|− 2 (d) etr − Σ−1 ψ 2 (A.19) Let Σ ∼ IW J (d, ψ), then Σ−1 ∼ WJ (d − J − 1, ψ −1 ). The Standard Inverse Wishart Distribution There are several parametrization of the inverse Wishart distribution. We will list below the ones that we use in this work. 135 A.3. The Wishart Distribution 1. The parametrization used in Gupta and Nagar (2000) Let Σ ∼ IW (m, IJ ), then 1 1 fJ (Σ|ν) ∝ |Σ|− 2 (m) exp(− tr(Σ−1 )) 2 (A.20) With Expectation: E(Σ) = IJ m − 2J − 2 The Matlab function invwishirnd from the MCMC tool box (http://www.mathworks.com/matlabcentral/fileexchange/ by David Shera), implements sampling from this distribution. 2. The parametrization used in Barnard et al. (2000) Let Σ ∼ IW (ν, IJ ), then 1 1 fJ (Σ|ν) ∝ |Σ|− 2 (ν+J+1) exp(− tr(Σ−1 )) 2 (A.21) With Expectation: E(Σ) = IJ ν−J −1 This corresponds to A.20, with m = ν + J + 1. This parametrization is implemented in the matlab function iwishrnd in the STAT toolbox. 3. The parametrization used in Dawid and Lauritzen (1993) Let Σ ∼ IW (δ, IJ ), then 1 1 fJ (Σ|δ) ∝ |Σ|− 2 (δ+2J) exp(− tr(Σ−1 )) 2 (A.22) With Expectation: E(Σ) = IJ δ−2 This corresponds to A.20, with δ = ν − J + 1 . 136 A.3. The Wishart Distribution In the one dimensional case all three parametrization reduce to an inverse Chi Square: 1 f (σ 2 |v) ∝ (σ)− 2 (v+2) exp(− 1 ) 2σ 2 (A.23) In this case we could see that ν and δ is parametrization 2 and 3 are equivalent and they are equal to v, and in parametrization 1, m = v + 2. The Non-Central Wishart Distribution Σ ∼ WJ (d, S, Θ) follows an non central Wishart distribution with degrees of freedom d, a scale parameter S and a non-centrality parameter Θ, with pdf given by 2dJ/2 ΓJ (d/2) |S|d/2 where ΓJ (d/2) = A −1 1 1 1 |Σ| 2 (d−J−1) etr − Θ etr − S −1 Σ 2 2 1 1 −1 0 F1 ; d; ΘS Σ 2 4 (d−J−1)/2 dA 0 exp{tr(−A)}(detA) (A.24) is the multi- dimen- sional gamma function and 0 F1 is the hypergeometric function of matrix argument, and the density is defined on positive definite matrices. The hypergeometric function has a series expansion: d 1 −1 0 F1 (; ; ΘS Σ) = 2 4 ∞ p=0 l Cl ( 14 ΘS −1 Σ) (d/2)l p! (A.25) where Σl denotes summation over all partitions l = (p1 , . . . , pm ), p1 ≥ · · · ≥ pm ≥ 0 of p integers, (d/2)l is the generalized hypergeometric coefficient (d/2)l = m i=1 (d/2 Cl ( 14 ΘS −1 Σ) − (i − 1)/2)p1 with (a)p1 = a(a + 1) . . . (a + pi − 1), and is the zonal polynomial associated with partition l. The log pdf of the non-centered Wishart is given by − dJ log(2) − ΓJ 2 d d (d − J − 1) 1 − log |S| + log |Σ| − tr Θ + S −1 Σ 2 2 2 2 1 1 + log 0 F1 ; d; ΘS −1 Σ (A.26) 2 4 137 A.3. The Wishart Distribution This parametrization of the non-central Wishart arises by taking the outer product of Gaussians. From Gupta and Nagar (2000) Theorem 3.5.1 p 114. Let Xd×J ∼ MN dJ (Md×J , SJ , Id ), then ΣJ = X X ∼ WJ (d, S, Θ) with Θ = S −1 M M . Furthermore, Theorem 3.5.2 gives a useful result: Let X ∼ MN dJ (M, SJ , Id ) and let A(q ×J) be any matrix of rank q ≤ J Then AX XA ∼ Wq (d, ASA , (ASA )−1 AM M A ) (A.27) Expectation Let Σ ∼ WJ (d, SJ , Θ) be a non-central Wishart, with Θ = S −1 M M E(Σ) = dSJ + M M (A.28) Random Number Generation To Generate a random variable W ∼ WJ (d, SJ , Θ) with Θ = S −1 M M We can write J W = S 1/2 (Xi + m ˜ i )(Xi + m ˜ i) + V S 1/2 i=1 where V ∼ WJ (d − J, IJ ) and X ∼ Nj (0, I), and m ˜ i are the columns of a −1/2 −1/2 ˜ M = chol(S M MS ). Under this formulation we can see that 138 A.3. The Wishart Distribution J E(W ) = E S 1/2 (Xi + m ˜ i )(Xi + m ˜ i) + V S 1/2 (A.29) i=1 J = S 1/2 E Xi Xi + Xi m ˜i +m ˜ i Xi + m ˜ im ˜ i + V S 1/2 (A.30) i=1 J = S 1/2 E J Xi Xi m ˜ im ˜ i S 1/2 + E(V ) + i=1 (A.31) i=1 J = S 1/2 JI + (d − J)I + m ˜ im ˜i S 1/2 (A.32) i=1 ˜ M ˜ S 1/2 = dS + M M = S 1/2 dI + M (A.33) 139 Appendix B Conjugate analysis B.1 Normal Wishart Let Z | Σ−1 ∼ N (0, Σ−1 ) and Σ−1 ∼ W(d0 , S0 ), the joint distribution is given by π(Z, Σ−1 ) = 1 1 −1 n/2 ZZ Σ−1 |Σ | etr − 2 (2π)nJ/2 1 −1 1 (d0 −J−1) 1 × |Σ | 2 etr − S0−1 Σ−1 C0 2 (B.1) where C0 = 2d0 J/2 ΓJ (d0 /2) |S0 |d0 /2 is the normalizing constant for the prior distribution and d0 and S0 are the hyperparameters for the degrees of freedom and the scale respectively. The prior is conjugate to the likelihood therefore the posterior distribution is given by π(Σ−1 | Z) ∼ W(dn , Sn ) (B.2) with dn = d0 + n and Sn = (S0−1 + ZZ )−1 . The marginal likelihood is given in closed form by p(Z) = = Cn 1 C0 (2π)nJ/2 (B.3) ΓJ (dn /2) |Sn |dn /2 1 ΓJ (d0 /2) |S0 |d0 /2 π nJ/2 (B.4) In the case, where we have multiple time points, the marginal likelihood 140 B.2. Normal Wishart with Covariates decomposes p(Z1 , . . . , ZT ) = p(Z1 , . . . , ZT −1 )p(ZT | Z1 , . . . , ZT −1 ) (B.5) T p(Zt | Zt−1 ) = p(Z1 ) (B.6) t=2 −1 In the special case where Σ−1 1 , . . . , ΣT are independent and identically dis- tributed (This is when ρ = 0) T p(Z1 , . . . , ZT ) = p(Zt ) (B.7) t=1 T = ΓJ ((d + n)/2) |(S0−1 + Zt Zt )−1 |(d+n)/2 1 (B.8) ΓJ (d0 /2) |S0 |d0 /2 π nJ/2 t=1 −1 In the other special case where Σ−1 1 = · · · = ΣT , corresponding to the WAP case when ρ = 1, the time element becomes irrelevant and the data can be stacked and the joint distribution of data and parameter can be written as π(Z, Σ−1 ) = 1 1 |Σ−1 |nT /2 etr − ZZ Σ−1 nT J/2 2 (2π) 1 −1 1 (d0 −J−1) 1 × |Σ | 2 etr − S0−1 Σ−1 C0 2 (B.9) which also results in a Wishart with the following hyperparameters: dn = d0 + nT and Sn = (S0−1 + ZZ )−1 , and where Z is the rows of Z1 , . . . , Zt stacked. The marginal likelihood in this case has the same as (B.4) but with the new hyperparameters. B.2 Normal Wishart with Covariates The multivariate linear model is defined by 141 B.2. Normal Wishart with Covariates Z = Xβ + Z 11 . . . Zn1 ... .. . ... Z1J x11 .. = .. . . ZnJ xn1 ... .. . ... x1p β11 .. × .. . . xnp βp1 (B.10) ... .. . ... β1J .. . + βpJ 11 .. . ... .. . 1J .. . ... nJ (B.11) Where Z is a n × J matrix of continuous responses, X is a n × p design matrix observed without error, β is a p × J matrix of regression coefficients and is a n × J matrix of Gaussian error terms. In this case, Zn×J ∼ MN n,J (Xβn×p , ΣJ , In ) follows a matrix-variate Gaussian density given by n1 1 J n 1 (Z − Xβ) In−1 (Z − Xβ) f (Z | β, Σ) = (2π)− 2 Jn |In |− 2 |ΣJ |− 2 etr − Σ−1 2 J (B.12) where Σ is a J × J covariance matrix representing the association between the errors of the different elements of the response and In is an n × n matrix representing the association between the different observations and is the identity because observations are assumed to be independent and identically distributed. We use a conjugate prior for β which is also a matrix-variate Normal, βp×J | ΣJ ∼ MN n,J (0p×J , ΣJ , Ψp ) with a p × J mean matrix of zeros and Ψp is a p × p matrix of hyperparamters representing respectively the prior belief on the location and the dispersion of the distribution of β, with density given by p 1 J 1 (2π)− 2 pJ |Ψ|− 2 |Σ|− 2 etr − Σ−1 β Ψ−1 p β 2 J (B.13) and ΣJ ∼ IW(2, I), whereI is the identity matrix using Lauritzen parametrization as in paper is proportional to 1 |Σ|−(2+2J)/2 etr − Σ−1 2 (B.14) We would like to compute the posterior distribution: π(Σ, β | Z) ∝ f (Z | Σ, β)π(β | Σ)π(Σ) (B.15) Putting it all together, we get 142 B.2. Normal Wishart with Covariates J 1 |Ψ|− 2 etr − Σ−1 (Z − Xβ) (Z − Xβ) + β Ψ−1 p β + IJ 2 J (B.16) −1 The term (Z − Xβ) (Z − Xβ) + β Ψ β + I inside the square brackets can be re-written by completing the square as π(Σ, β | Z) ∝ |Σ|− n+p+2+2J 2 (β − M ) Ξ−1 (β − M ) − M Ξ−1 M + C (B.17) where M = ΞX Z and Ξ−1 = X X + Ψ−1 and C = Z Z + IJ . Therefore we are able to re-write equation (B.16) as: J J π(Σ, β | Z) ∝ |Ψ|− 2 |K| 2 | Σ|− (n+2+2J) 2 1 etr − Σ−1 C − M Ξ−1 M 2 ∝IW J (Σ;2+n,Z Z+I−M Ξ−1 M ) × |Σ| −p 2 J 1 |K|− 2 etr − Σ−1 (β − M ) Ξ−1 (β − M ) 2 ∝MN Jp (β;M,Σ,Ξ) From the previous expression , we can see that we are able to integrate out β. This leads us to π(Σ | Z) ∼ IW J (2 + n, Z Z + I − M Ξ−1 M ) (B.18) and π(β | Z, Σ) ∼ MN pJ (M, Σ, Ξ) (B.19) 143 Appendix C Heart Rate Data Results C.1 Bolus Event 1 Data Results log (σ) vs Time 4 3 Nociception 120 Bolus 100 2 1 0 80 −1 Heart Rate 60 Anti Nociceptive Period 4 Anti−Nociception Nociceptive Period 40 20 0 −20 50 100 150 200 time 250 300 350 3 2 1 0 −1 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase. Nociceptive Period Anti−Nociceptive Period 500 2500 400 2000 Median −− 0.943 95 CI −− [0.684,0.994] Acc. Prob −− 24.9% 300 1000 100 0 Median −− 0.995 95 CI −− [0.907,1] Acc. Prob −− 23.8% 1500 200 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the antition phase nociception phase Figure C.1: Results for bolus event 1 144 C.2. Bolus Event 2 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 NC ANC Figure C.2: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 1. C.2 Bolus Event 2 Data Results Heart Rate Data for Bolus Event 2 120 Bolus 100 80 60 Anti Nociceptive Period Nociceptive Period 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 1000 1500 800 Median −− 0.985 95 CI −− [0.859,0.999] Acc. Prob −− 26% 600 Median −− 0.963 95 CI −− [0.302,0.999] Acc. Prob −− 29.1% 1000 400 500 200 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.3: Results for Bolus Event 2 145 C.2. Bolus Event 2 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.4: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 2. 146 C.3. Bolus Event 3 Data Results C.3 Bolus Event 3 Data Results Heart Rate Data for Bolus Event 3 120 Bolus 100 80 60 Anti Nociceptive Period Nociceptive Period 40 20 0 −20 50 100 150 200 250 time 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 300 500 400 Median −− 0.979 95 CI −− [0.93,0.999] Acc. Prob −− 25.1% 200 Median −− 0.925 95 CI −− [0.145,0.997] Acc. Prob −− 37.3% 300 200 100 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 500 1000 1500 2000 2500 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 0.8 0 0.3 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.5: Results for bolus event 3 147 C.4. Bolus Event 4 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.6: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 3. C.4 Bolus Event 4 Data Results Heart Rate Data for Bolus Event 4 120 Bolus 100 80 60 Anti Nociceptive Period Nociceptive Period 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 2500 2000 2000 1500 Median −− 0.998 95 CI −− [0.974,1] Acc. Prob −− 35% 1000 Median −− 0.996 95 CI −− [0.838,1] Acc. Prob −− 25.1% 1500 1000 500 0 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.7: Results for bolus event 4 148 C.4. Bolus Event 4 Data Results Posterior of ρ 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 NC ANC Figure C.8: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 4. 149 C.5. Bolus Event 5 Data Results C.5 Bolus Event 5 Data Results Heart Rate Data for Bolus Event 5 140 Bolus 120 100 80 Anti Nociceptive Period Nociceptive Period 60 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 800 500 400 600 Median −− 0.941 95 CI −− [0.317,0.998] Acc. Prob −− 29.9% 400 Median −− 0.782 95 CI −− [0.0694,0.999] Acc. Prob −− 59.1% 300 200 200 0 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 500 1000 1500 2000 2500 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 0.8 0 0.3 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.9: Results for bolus event 5 150 C.6. Bolus Event 6 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.10: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 5. C.6 Bolus Event 6 Data Results Heart Rate Data for Bolus Event 6 140 120 Bolus 100 80 Anti Nociceptive Period Nociceptive Period 60 40 20 0 −20 50 100 150 200 250 300 time 350 400 450 500 550 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 400 1500 300 Median −− 0.926 95 CI −− [0.506,0.994] Acc. Prob −− 26.7% 200 500 100 0 Median −− 0.99 95 CI −− [0.819,1] Acc. Prob −− 21% 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.11: Results for bolus event 6 151 C.6. Bolus Event 6 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.12: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 6. 152 C.7. Bolus Event 7 Data Results C.7 Bolus Event 7 Data Results Heart Rate Data for Bolus Event 7 160 140 120 Bolus 100 80 Anti Nociceptive Period Nociceptive Period 60 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 600 400 300 Median −− 0.945 95 CI −− [0.667,0.994] Acc. Prob −− 28.7% 400 Median −− 0.902 95 CI −− [0.658,0.977] Acc. Prob −− 30.1% 200 200 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 500 1000 1500 2000 2500 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 0.8 0 0.3 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.13: Results for bolus event 7 153 C.8. Bolus Event 8 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 NC ANC Figure C.14: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 7. C.8 Bolus Event 8 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 8 160 3 140 2 Nociception 120 Bolus 100 1 0 −1 80 Anti Nociceptive Period Nociceptive Period 3 Anti−Nociception 60 40 20 0 2 1 0 −1 −20 50 100 150 200 time 250 300 350 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 600 600 Median −− 0.934 95 CI −− [0.694,0.984] Acc. Prob −− 23.5% 400 200 0 Median −− 0.99 95 CI −− [0.941,1] Acc. Prob −− 20% 400 200 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.15: Results for bolus event 8 154 C.8. Bolus Event 8 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 NC ANC Figure C.16: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 8. 155 C.9. Bolus Event 9 Data Results C.9 Bolus Event 9 Data Results Heart Rate Data for Bolus Event 9 120 Bolus 100 80 60 Anti Nociceptive Period Nociceptive Period 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 300 300 Median −− 0.897 95 CI −− [0.49,0.99] Acc. Prob −− 31.6% 200 100 0 Median −− 0.978 95 CI −− [0.904,0.998] Acc. Prob −− 20.5% 200 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 500 1000 1500 2000 2500 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 0.8 0 0.3 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.17: Results for bolus event 9 156 C.10. Bolus Event 10 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.18: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 9. C.10 Bolus Event 10 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 10 140 2 Bolus Nociception 120 100 1 0 80 −1 Anti Nociceptive Period Nociceptive Period 2 Anti−Nociception 60 40 20 0 1 0 −1 −20 50 100 150 200 time 250 300 350 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 1000 1000 800 800 Median −− 0.968 95 CI −− [0.782,0.998] Acc. Prob −− 24.6% 600 400 400 200 0 Median −− 0.986 95 CI −− [0.875,1] Acc. Prob −− 23.9% 600 200 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.19: Results for bolus event 10 157 C.10. Bolus Event 10 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.20: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 10. 158 C.11. Bolus Event 11 Data Results C.11 Bolus Event 11 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 11 140 3 Bolus Nociception 120 100 2 1 0 80 −1 Anti Nociceptive Period Nociceptive Period 3 Anti−Nociception 60 40 20 0 2 1 0 −1 −20 50 100 150 200 250 300 350 400 450 time time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 800 1500 600 Median −− 0.956 95 CI −− [0.714,0.995] Acc. Prob −− 21.7% 400 500 200 0 Median −− 0.998 95 CI −− [0.98,1] Acc. Prob −− 36.4% 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.21: Results for bolus event 11 159 C.12. Bolus Event 12 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 NC ANC Figure C.22: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 11. C.12 Bolus Event 12 Data Results Heart Rate Data for Bolus Event 12 120 Bolus 100 80 60 Anti Nociceptive Period Nociceptive Period 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 600 800 600 Median −− 0.949 95 CI −− [0.802,0.991] Acc. Prob −− 24.1% 400 Median −− 0.975 95 CI −− [0.825,0.999] Acc. Prob −− 25.2% 400 200 200 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.23: Results for bolus event 12 160 C.12. Bolus Event 12 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.24: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 12. 161 C.13. Bolus Event 13 Data Results C.13 Bolus Event 13 Data Results Heart Rate Data for Bolus Event 13 140 120 Bolus 100 80 Anti Nociceptive Period Nociceptive Period 60 40 20 0 −20 50 100 150 200 250 300 350 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 400 400 300 300 Median −− 0.912 95 CI −− [0.768,0.971] Acc. Prob −− 24% 200 100 0 Median −− 0.9 95 CI −− [0.758,0.96] Acc. Prob −− 24.8% 200 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 500 1000 1500 2000 2500 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 0.8 0 0.3 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.25: Results for bolus event 13 162 C.14. Bolus Event 14 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 NC ANC Figure C.26: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 13. C.14 Bolus Event 14 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 14 140 Bolus 2 Nociception 120 100 1 0 80 −1 Anti Nociceptive Period Nociceptive Period 2 Anti−Nociception 60 40 20 0 1 0 −1 −20 50 100 150 200 time 250 300 350 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 2000 1000 800 1500 Median −− 0.989 95 CI −− [0.871,1] Acc. Prob −− 24.8% 1000 Median −− 0.984 95 CI −− [0.844,1] Acc. Prob −− 24.2% 600 400 500 0 200 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.27: Results for bolus event 14 163 C.14. Bolus Event 14 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 NC ANC Figure C.28: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 14. 164 C.15. Bolus Event 15 Data Results C.15 Bolus Event 15 Data Results Heart Rate Data for Bolus Event 15 120 Bolus 100 80 60 Anti Nociceptive Period Nociceptive Period 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 250 4000 200 3000 Median −− 0.987 95 CI −− [0.947,0.999] Acc. Prob −− 30.9% 150 100 1000 50 0 Median −− 0.996 95 CI −− [0.316,1] Acc. Prob −− 27% 2000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.29: Results for bolus event 15 165 C.16. Bolus Event 16 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.30: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 15. C.16 Bolus Event 16 Data Results Heart Rate Data for Bolus Event 16 140 Bolus 120 100 80 Anti Nociceptive Period Nociceptive Period 60 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Anti−Nociceptive Period Nociceptive Period 300 1500 Median −− 0.966 95 CI −− [0.734,1] Acc. Prob −− 21% 1000 100 500 0 Median −− 0.895 95 CI −− [0.735,0.962] Acc. Prob −− 27.6% 200 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.2 0 0.3 Anti−Nociceptive Period Nociceptive Period 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.31: Results for bolus event 16 166 C.16. Bolus Event 16 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 NC ANC Figure C.32: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 16. 167 C.17. Bolus Event 17 Data Results C.17 Bolus Event 17 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 17 3 140 Bolus 2 Nociception 120 100 1 0 80 −1 Anti Nociceptive Period Nociceptive Period 3 Anti−Nociception 60 40 20 0 2 1 0 −1 −20 50 100 150 200 time 250 300 350 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 400 400 300 300 Median −− 0.994 95 CI −− [0.966,1] Acc. Prob −− 22.4% 200 100 0 Median −− 0.921 95 CI −− [0.58,0.994] Acc. Prob −− 31.9% 200 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 500 1000 1500 2000 2500 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 0.8 0 0.3 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.33: Results for bolus event 17 168 C.18. Bolus Event 18 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.34: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 17 C.18 Bolus Event 18 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 18 140 3 2 Nociception 120 Bolus 100 1 0 80 −1 Anti Nociceptive Period Nociceptive Period 3 Anti−Nociception 60 40 20 0 2 1 0 −1 −20 50 100 150 200 time 250 300 350 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 300 300 Median −− 0.884 95 CI −− [0.616,0.981] Acc. Prob −− 33.8% 200 100 0 Median −− 0.993 95 CI −− [0.965,1] Acc. Prob −− 27.6% 200 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.35: Results for bolus event 18 169 C.18. Bolus Event 18 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 NC ANC Figure C.36: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 18. 170 C.19. Bolus Event 19 Data Results C.19 Bolus Event 19 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 19 3 120 2 Nociception Bolus 100 80 1 0 −1 60 Anti Nociceptive Period Nociceptive Period 3 2 Anti−Nociception 40 20 0 1 0 −1 −20 50 100 150 200 250 300 350 400 450 time time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 400 500 400 300 Median −− 0.893 95 CI −− [0.69,0.967] Acc. Prob −− 25.5% 200 Median −− 0.962 95 CI −− [0.794,0.996] Acc. Prob −− 24.7% 300 200 100 0 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 500 1000 1500 2000 2500 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 0.8 0 0.3 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.37: Results for bolus event 19 171 C.20. Bolus Event 20 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 NC ANC Figure C.38: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 19. C.20 Bolus Event 20 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 20 140 Bolus 2 Nociception 120 100 1 0 80 −1 Anti Nociceptive Period Nociceptive Period 2 Anti−Nociception 60 40 20 0 1 0 −1 −20 50 100 150 200 250 300 350 400 time time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 300 1500 Median −− 0.878 95 CI −− [0.4,0.972] Acc. Prob −− 33% 200 100 0 Median −− 0.96 95 CI −− [0.312,0.998] Acc. Prob −− 25.8% 1000 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.39: Results for bolus event 20 172 C.20. Bolus Event 20 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.40: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 20. 173 C.21. Bolus Event 21 Data Results C.21 Bolus Event 21 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 21 4 140 3 Bolus Nociception 120 100 2 1 0 80 Anti Nociceptive Period Nociceptive Period 3 Anti−Nociception 60 −1 4 40 20 0 2 1 0 −1 −20 50 100 150 200 250 300 time 350 400 450 500 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 250 300 200 Median −− 0.886 95 CI −− [0.55,0.993] Acc. Prob −− 31.1% 150 Median −− 0.887 95 CI −− [0.36,0.985] Acc. Prob −− 33.2% 200 100 100 50 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 500 1000 1500 2000 2500 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 0.8 0 0.3 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.41: Results for bolus event 21 174 C.22. Bolus Event 22 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.42: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 21. C.22 Bolus Event 22 Data Results Heart Rate Data for Bolus Event 22 120 Bolus 100 80 60 Anti Nociceptive Period Nociceptive Period 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 500 400 400 300 Median −− 0.988 95 CI −− [0.934,0.999] Acc. Prob −− 24.3% 300 Median −− 0.995 95 CI −− [0.982,0.999] Acc. Prob −− 28.9% 200 200 100 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.43: Results for bolus event 22 175 C.22. Bolus Event 22 Data Results Posterior of ρ 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 NC ANC Figure C.44: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 22. 176 C.23. Bolus Event 23 Data Results C.23 Bolus Event 23 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 23 140 2 1 Nociception Bolus 120 100 80 0 −1 Anti Nociceptive Period Nociceptive Period 2 Anti−Nociception 60 40 20 0 1 0 −1 −20 50 100 150 200 250 time 300 350 400 450 500 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Anti−Nociceptive Period Nociceptive Period 1000 500 800 400 Median −− 0.964 95 CI −− [0.305,0.999] Acc. Prob −− 24.7% 600 200 400 100 200 0 Median −− 0.937 95 CI −− [0.502,0.994] Acc. Prob −− 27.7% 300 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 500 1000 1500 2000 2500 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period Nociceptive Period 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.45: Results for bolus event 23 177 C.24. Bolus Event 24 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.46: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 23. C.24 Bolus Event 24 Data Results Heart Rate Data for Bolus Event 24 140 Bolus 120 100 80 Anti Nociceptive Period Nociceptive Period 60 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 2500 2500 2000 2000 Median −− 0.998 95 CI −− [0.939,1] Acc. Prob −− 20.1% 1500 1000 1000 500 0 Median −− 0.994 95 CI −− [0.863,1] Acc. Prob −− 22.9% 1500 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.47: Results for bolus event 24 178 C.24. Bolus Event 24 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 NC ANC Figure C.48: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 24. 179 C.25. Bolus Event 25 Data Results C.25 Bolus Event 25 Data Results Heart Rate Data for Bolus Event 25 140 Bolus 120 100 80 Anti Nociceptive Period Nociceptive Period 60 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 5000 4000 4000 3000 Median −− 1 95 CI −− [0.917,1] Acc. Prob −− 27.7% 2000 Median −− 0.999 95 CI −− [0.758,1] Acc. Prob −− 25% 3000 2000 1000 0 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.49: Results for bolus event 25 180 C.26. Bolus Event 26 Data Results Posterior of ρ 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 NC ANC Figure C.50: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 25. C.26 Bolus Event 26 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 26 120 4 Bolus Nociception 100 80 60 1 4 Anti−Nociception 40 2 0 Anti Nociceptive Period Nociceptive Period 3 20 0 3 2 1 0 −20 50 100 150 200 time 250 300 350 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 400 4000 300 3000 Median −− 0.999 95 CI −− [0.994,1] Acc. Prob −− 25.3% 200 100 0 Median −− 0.999 95 CI −− [0.587,1] Acc. Prob −− 36.4% 2000 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.51: Results for bolus event 26 181 C.26. Bolus Event 26 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 NC ANC Figure C.52: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 26. 182 C.27. Bolus Event 27 Data Results C.27 Bolus Event 27 Data Results Heart Rate Data for Bolus Event 27 140 120 Bolus 100 80 Anti Nociceptive Period Nociceptive Period 60 40 20 0 −20 50 100 150 200 time 250 300 350 400 (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 250 300 200 Median −− 0.996 95 CI −− [0.989,0.999] Acc. Prob −− 23.5% 150 Median −− 0.965 95 CI −− [0.849,0.998] Acc. Prob −− 27.7% 200 100 100 50 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 500 1000 1500 2000 2500 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 0.8 0 0.3 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.53: Results for bolus event 27 183 C.28. Bolus Event 28 Data Results Posterior of ρ 1 0.95 0.9 0.85 0.8 0.75 0.7 NC ANC Figure C.54: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 27. C.28 Bolus Event 28 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 28 140 3 120 Nociception Bolus 100 2 1 80 0 Anti Nociceptive Period Nociceptive Period Anti−Nociception 60 40 20 0 3 2 1 0 −20 50 100 150 200 time 250 300 350 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 500 1500 400 Median −− 0.979 95 CI −− [0.866,0.998] Acc. Prob −− 28.2% 300 Median −− 0.999 95 CI −− [0.991,1] Acc. Prob −− 25.9% 1000 200 500 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.55: Results for bolus event 28 184 C.28. Bolus Event 28 Data Results Posterior of ρ 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 NC ANC Figure C.56: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 28. 185 C.29. Bolus Event 29 Data Results C.29 Bolus Event 29 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 29 120 2.5 Bolus 2 Nociception 100 80 1.5 1 0.5 0 60 Anti Nociceptive Period Nociceptive Period 2 Anti−Nociception 40 −0.5 2.5 20 0 1.5 1 0.5 0 −20 50 100 150 200 time 250 300 350 −0.5 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 3000 4000 3000 Median −− 0.996 95 CI −− [0.443,1] Acc. Prob −− 51% 2000 Median −− 0.999 95 CI −− [0.44,1] Acc. Prob −− 32.8% 2000 1000 1000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.57: Results for bolus event 29 186 C.30. Bolus Event 30 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 NC ANC Figure C.58: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 29. C.30 Bolus Event 30 Data Results log (σ) vs Time Heart Rate Data for Bolus Event 30 120 3 Bolus 2 Nociception 100 80 60 1 0 −1 Anti Nociceptive Period Nociceptive Period 3 Anti−Nociception 40 20 0 2 1 0 −1 −20 50 100 150 200 time 250 300 350 400 time (a) The raw and centered heart rate (b) Posterior distribution of log(σ) in data the nociception and anti-nociception phase Nociceptive Period Anti−Nociceptive Period 2500 2000 2000 1500 Median −− 0.993 95 CI −− [0.117,0.999] Acc. Prob −− 28% 1500 Median −− 0.971 95 CI −− [0.0691,0.998] Acc. Prob −− 37.6% 1000 1000 500 500 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 Nociceptive Period 0.4 0.5 0.6 0.7 0.8 0.9 1 Anti−Nociceptive Period 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0 0.3 0.2 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 (c) Histogram and traceplot of the pos- (d) Histogram and traceplot of the terior distribution of ρ in the nocicep- posterior distribution of ρ in the notion phase ciception phase Figure C.59: Results for bolus event 30 187 C.30. Bolus Event 30 Data Results Posterior of ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NC ANC Figure C.60: Boxplot of the posterior of ρ in the nociceptive state (NC) and anti-nociceptive (ANC) for bolus event 30. 188
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Modeling dependencies in multivariate data
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Modeling dependencies in multivariate data Talhouk, Aline 2013
pdf
Page Metadata
Item Metadata
Title | Modeling dependencies in multivariate data |
Creator |
Talhouk, Aline |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | In multivariate regression, researchers are interested in modeling a correlated multivariate response variable as a function of covariates. The response of interest can be multidimensional; the correlation between the elements of the multivariate response can be very complex. In many applications, the association between the elements of the multivariate response is typically treated as a nuisance parameter. The focus is on estimating efficiently the regression coefficients, in order to study the average change in the mean response as a function of predictors. However, in many cases, the estimation of the covariance and, where applicable, the temporal dynamics of the multidimensional response is the main interest, such as the case in finance, for example. Moreover, the correct specification of the covariance matrix is important for the efficient estimation of the regression coefficients. These complex models usually involve some parameters that are static and some dynamic. Until recently, the simultaneous estimation of dynamic and static parameters in the same model has been difficult. The introduction of particle MCMC algorithms by Andrieu and Doucet (2002) has allowed for the possibility of considering such models. In this thesis, we propose a general framework for jointly estimating the covariance matrix of multivariate data as well as the regression coefficients. This is done under different settings, for different dimensions and measurement scales. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-08-03 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0074025 |
URI | http://hdl.handle.net/2429/44748 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2013-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2013_fall_talhouk_aline.pdf [ 6.9MB ]
- Metadata
- JSON: 24-1.0074025.json
- JSON-LD: 24-1.0074025-ld.json
- RDF/XML (Pretty): 24-1.0074025-rdf.xml
- RDF/JSON: 24-1.0074025-rdf.json
- Turtle: 24-1.0074025-turtle.txt
- N-Triples: 24-1.0074025-rdf-ntriples.txt
- Original Record: 24-1.0074025-source.json
- Full Text
- 24-1.0074025-fulltext.txt
- Citation
- 24-1.0074025.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0074025/manifest