Formal and Informal Approaches to Adjusting for Exposure Mismeasurement by Tian Shen B.Sc., The University of New Brunswick, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Statistics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2009 © Tian Shen, 2009 Abstract In many research areas, measurement error frequently occurs when investigators are trying to analyze the relationship between exposure variables and response variable in observational studies. Severe problems can be caused by the mismeasured exposure vari ables, such as loss of power, biased estimators, and misleading conclusions. As the idea of measurement error is adopted by more and more researchers, how to adjust for such error becomes an interesting point to study. Two big barriers in solving the problems are as follows. First, the mechanism of measurement error (the existence and magnitude of the error) is always unknown to researchers. Sometimes only a small piece of informa tion is available from previous studies. Moreover, the situation can be worsen when the study conditions are changed in the present study, which makes previous information not applicable. Second, some researchers may still argue about the consequences of ignoring measurement error due to its uncertainness. Thus, the adjustment for the mismeasure ment turn to be a difficult, or impossible task. In this thesis, we are studying situations where the binary response variable is pre cisely measured, but with a misclassified binary exposure or a mismeasured continuous exposure. We propose formal approaches to estimate unknown parameters under the non-differential assumption in both exposure conditions. The uncertain variance of mea surement error in the continuous exposure case, or the probabilities of misclassification in the binary exposure case, are incorporated by our approaches. Then the posterior models are estimated via simulations generated by the Gibbs sampler and the Metropo lis - Hasting algorithm. ii Abstract Meanwhile, we compare our formal approach with the informal or naive approach in both continuous and exposure cases based on simulated datasets. Odds ratios on log scales are used in comparisons of formal and informal approaches when the exposure variable is binary or continuous. General speaking, our formal approaches result in bet ter point estimators and less variability in estimation. Moreover, the 95% credible, or confidence intervals are able to capture the true values more than 90% of the time. At the very end, we apply our ideas on the QRS dataset to seek consistent conclu sions draws from simulated datasets and real world datasets, and we are able to claim that overall our formal approaches do a better job regardless of the type of the exposure variable. 111 Table of Contents Abstract ii Table of Contents iv List of Tables vi List of Figures x Acknowledgements xiv Dedication xv 1 Introduction 1 1.1 Misclassification and Measurement Error 1 1.2 Overview of Current Available Methods 2 1.3 Bayesian Analysis 4 1.3.1 Bayes Rule 4 1.3.2 Prior Distribution 5 1.3.3 Markov Chain Monte Carlo Algorithm 6 2 Simulation Study for Categorical Exposure Variable 8 2.1 Introduction 8 2.2 2 x 3 Table—Formal Analysis 8 2.3 Informal Analysis 10 2.4 Odds Ratio 11 iv Table of Contents Case 1: When We Know PjJS Case 2: When PijS are Unknown Case 3: Validation Data Results 2.8.1 Results for Case 1 2.8.2 Results for Case 2 Results for Case 3 Comparison of Odds Ratios 3 Simulation Study for Continuous Exposure 3.1 Introduction 3.2 Posterior and Prior Distributions 3.3 Case 1: When We Know j2 3.4 Case 2: When We Don’t Know u2 3.5 Case 3: When We Have Validation Data 3.6 Results 3.6.1 Results for Case 1 3.6.2 Results for Case 2 3.6.3 Results for Case 3 . . 3.7 Comparison of Three Approaches 4747495053545455606671 4 QRS Data Study 4.1 Naive Approach and Informal Approach 4.2 Formal Approach 4.3 Results 5 Conclusion and Future Work 85 2.5 2.6 2.7 2.8 2.9 2.10 13 14 16 17 17 24 32 44 Variable 77 79 80 84 Bibliography 88 V List of Tables 2.1 2x3 table due to the misclassification and measurement error 9 2.2 A 2x 3 table for formal analysis 9 2.3 2x2 table by epidemiologists’ rule 10 2.4 A 2x 2 table for informal analysis 11 2.5 Validation data and main data 16 2.6 True values, posterior means, 95% credible intervals of r0 and r1. These are results from the first simulation study (one dataset simulation) for scenario 1 in case 1 20 2.7 Bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and r1 for scenario 1 case 1. All results are based on 100 datasets, and their true values are 0.2 and 0.25 respectively 21 2.8 True values, posterior means, 95% credible intervals of r0 and r1. These are results from the first simulation study (one dataset simulation) for scenario 2 in case 1 22 2.9 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and rl for scenario 2 case 1. All results are based on 100 datasets, and their true values are 0.7 and 0.4 respectively 23 2.10 True values, posterior means, 95% credible intervals of r0,r1 and pjs. These are results from the first simulation study (one dataset simulation) for scenario 1 in case 1 26 vi List of Tables 2.11 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and ri for scenario 1 case 2. All results are based on 100 datasets, and their true values are r0 = 0.2, ri = 0.25, (Poo,Poi, P02) = (0.5, 0.3, 0.2) and (pio,pil,pi2) = (0.1,0.3, 0.6) 29 2.12 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of TO and ri for scenario 1 case 2. All results are based on 100 datasets, and their true values are ro = 0.7,r1 = 0.4, (poo,poi,po2) (0.5, 0.3, 0.2) and (plo,pll,p12) = (0.1,0.3, 0.6) 30 2.13 True values, posterior means, 95% credible intervals of r0, r1 and pjjs. These are results from the first simulation study (one dataset simulation) for scenario 1 in case 3. Validation size —200 36 2.14 True values, posterior means, 95% credible intervals of r0,r1 and jj5. These are results from the first simulation study (one dataset simulation) for scenario 2 in case 3. Validation size =100 36 2.15 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of r0 and r1 for scenario 1 case 3. All results are based on 100 datasets, and their true values are r0 = 0.2, r1 = 0.25, (poo,poi,p02) (0.5, 0.3, 0.2) and (plo,pii,p12) = (0.1,0.3,0.6). Validation data=200. . 37 2.16 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of TO and ri for scenario 2 case 3. All results are based on 100 datasets, and their true values are r0 = 0.2, r1 = 0.25, (p00, Poi ,Po2) (0.5, 0.3,0.2) and (pio,pi1,p12) = (0.1,0.3,0.6). Validation data=100. . . 37 2.17 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of To and Ti for scenario 3 case 3. All results are based on 100 datasets, and their true values are r0 = 0.7,r1 = 0.4, (poo,poi,p02) = (0.5,0.3,0.2) and (pio,pii,p12) = (0.1,0.3,0.6). Validation data=200. . 41 vii List of Tables 2.18 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and ri for scenario case 3. All results are based on 100 datasets, and their true values are r0 = 0.7,r1 0.4, (poo,pol,po2) = (0.5, 0.3, 0.2) and (plo,pll,p12) = (0.1,0.3,0.6). Validation data=100 . . 41 2.19 Estimated bias, mean square error (MSE), coverage of 95 % confidence intervals and average 95%CI width of informal and formal log odds ratios for scenario 1 (or 2) in three cases. True log odds ratio is 0.28 46 3.1 True values, posterior means, 95% credible intervals of1i,A2,30.These are res’ults from the first study in case 1 58 3.2 Estimated bias, mean square error (MSE), coverage of 95% CI and the average width of, A2,i3o, i for case 1. All results are based on 100 datasets. 60 3.3 True values, posterior means, 95% credible intervals of u, A2,/3o, i3i and o.2 These are results from the first study in case 2. The “true” values are:=0,A=1,8o=—1.5and31 64 3.4 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ,u, A2, i3o, /3i and for case 2. All results are based on 100 datasets. The “true” values are: i = 0,X2 = 1,/3o = —1.5 and /3i = 1.5 66 3.5 True values, posterior means, 95% credible intervals of .t, A2, /3o, 3i and a2. These are results from the first study in case 3 69 3.6 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of u, A2, i3o, i3 and a2 for case 3. All results are based on 100 datasets with validation size 50. The “true” values are: = 0, A2 = l,/3zz—1.5 and/31—_zl.5 71 3.7 Average of posterior means and 95% confidence intervals for the average posterior means of j3o and i3 for “naive”, “informal” and “formal” ap proaches. Results are based on 100 samples in case 1, 2 and 3 76 viii List of Tables 4.1 Estimators, 95% confidence, or credible, intervals of /3o and f3i by using “naive”, “informal” and “formal” approaches 84 ix List of Figures 2.1 Traceplots of ro and r1 from MCMC algorithm in scenario 1 case 1. The traceplots show the 20000 iterations after 1000 burn-in period. The true values of ro and r1 are 0.2 and 0.25 respectively 19 2.2 Histogram of 100 posterior means of ro and r in the second simulation study for scenario I case 1. The “true” values of ro and r1 are 0.2 and 0.25 respectively 21 2.3 Histogram of 100 posterior means of r0 and r1 in the second simulation study for scenario 2 case 1. The true values of r0 and r1 are 0.7 and 0.4 respectively 23 2.4 Density plots of “true” pj values with its corresponding Beta density func tion. The vertical lines in the graph indicate the “true” values. The Beta density functions (from left to right) are: Beta (55,45), Beta(30, 70), Beta (15, 85); Beta(10, 80), Beta(25, 80) and Beta(65,35) 25 2.5 Traceplots of r0, r1 and Pij5 from MCMG algorithm in scenario 1 case 2. The traceplots show the 50000 iterations after 1000 burn-in period. The “true” values are ro = 0.2,ri =O.25,(poo,pol,po2) = (0.5,0.3,0.2) and (plo,pll,p12) = (0.1,0.3, 0.6) 27 2.6 Histogram of 100 posterior means of r0,r1 and PijS in the second simu lation study for scenario 1 case 2. The “true” values are r0 = 0.2, r1 = O.25,(poo,pol,p02) = (0.5,0.3,0.2) and (plo,pll,p12) = (0.1,0.3,0.6). . 28 x List of Figures 2.7 Histogram of 100 posterior means of r0,r1 and PijS in the second simu lation study for scenario 1 case 2. The “true” values are ro = 0.7, r1 = O.4,(poo,pol,po2) = (0.5,0.3,0.2) and (Plo,p11,p12) (0.1,0.3,0.6). . . 31 2.8 Density plots of “true” pj,j values with its corresponding Beta density func tion. The vertical lines in the graph indicate the “true” values. The Beta density functions (from left to right) are: Beta (1,2), Beta(1,2), Beta(1, 2); Beta(1, 2), Beta(1, 2) and Beta(1,2) 33 2.9 Traceplots of ro, ri and PijS from MGMC algorithm in scenario 1 case 3. The traceplots show the 20000 iterations after 1000 burn-in period. The “true” values are r0 = 0.2,r1 = 0.25, (poo,pol,po2) = (0.5,0.3,0.2) and (plo,pll,p12) = (0.1,0.3,0.6). Validation size =200 34 2.10 Traceplots of r0, r1 and PijS from MCMC algorithm in scenario 1 case 3. The traceplots show the 20000 iterations after 1000 burn-in period. The “true” values are r0 = 0.2,r1 = 0.25, (poo,pol,p02) = (0.5,0.3,0.2) and (plo,pll,p12) = (0.1,0.3,0.6). Validation size =100 35 2.11 Histogram of 100 posterior means of r0,r1 and jJS in the second simu lation study for scenario 1 case 3. The “true” values are r0 = 0.2, r1 = O.25,(poo,pol,p02) = (0.5,0.3,0.2) and (plo,pll,p12) = (0.1,0.3,0.6). Val idation size =200 39 2.12 Histogram of 100 posterior means of r0,r1 and PijS in the second simu lation study for scenario 2 case 3. The “true” values are r0 = 0.2, r1 = O.25,(poo,po1,p02) = (0.5,0.3,0.2) and (plo,pll,p12) = (0.1,0.3,0.6). Val idation size =100 40 2.13 Histogram of 100 posterior means of ro, r and PuS in the second simu lation study for scenario 3 case 3. The “true” values are r0 0.7, r1 = O.4,(poo,pol,po2) = (0.5,0.3,0.2) and (Plo,P11,P12) = (0.1,0.3,0.6). Vali dation size =200 42 xi List of Figures 2.14 Histogram of 100 posterior means of ro, ri and PijS in the second simu lation study for scenario 4 case 3. The “true” values are r0 = 0.7, r1 = O.4,(poo,pol,po2) = (0.5,0.3,0.2) and (p1o,pll,p12) = (0.1,0.3,0.6). Vali dation size =100 43 2.15 Comparisons of log informal odds ratios with log formal odds ratios for scenario 1 in case 1 (upper left), scenario 1 in case 2 (upper right), scenario 1 & 2 in case 3 (lower left and right) . The true log odds ratio value is 0.285. The line in the each graph plots if two odds ratios are the same (y=x line). The dash lines represent the “true” value 45 3.1 Traceplots of/30,/31X2from MCMC algorithm in case 1. The trace- plots show the 5000 iterations after 1000 burn-in period 56 3.2 Traceplots of , from MCMC algorithm in case 1. The traceplots show the 5000 iterations after 1000 burn-in period 57 3.3 Histograms of 100 posterior means for ,t, A2, i3o, i3 in the second study in case 1. The true values are u = 0, A2 = = —1.5 and i3 = 1.5 . . 59 3.4 Density plot of Inverse Gamma distribution with hyper-parameters: c = 200 and /3 = 50. The vertical line is “true” value of the a2 = 0.25 . . . 61 3.5 Traceplots of/30,/1X2from MCMC algorithm in case 2. The trace- plots show the 5000 iterations after 1000 burn-in period 62 3.6 Traceplots of 1u, A2 and .2 from MCMC algorithm in case 2. The traceplots show the 5000 iterations after 1000 burn-in period 63 3.7 Histograms of 100 posterior means for i,A2,j301 and u2in the second study in case 2. The “true” values are: = 0, A2 = 1, i3o = —1.5 and = 1.5 65 3.8 Traceplots of/30,13X2from MCMC algorithm in case 3. The trace plots show the 5000 iterations after 1000 burn-in period 67 3.9 Traceplots of, A2 and a2 from MCMC algorithm in case 3. The traceplots show the 5000 iterations after .1000 burn-in period 68 xii List of Figures 3.10 Histograms of 100 posterior means for ji, A2,/3o, 3i and 2in the second study in case 3. The validation size is 50. The “true” values are: z = O,A2=1,/30—1.5 and31=1.5 70 3.11 Pairwise plots of three approaches, “naive”, “informal” and “formal”, in estimating /3o and j3 under case 1. The solid line is if “y x “, and the dash lines are the corresponding true values 73 3.12 Pairwise plots of three approaches , “naive”, “informal” and “formal”,in estimating /3i and i3i under case . The solid line is if ‘y = x “, and the dash lines are the corresponding true values 74 3.13 Pairwise plots of three approaches, “naive”, “informal” and “formal”, in estimating /3o and /3i under case 3. The solid line is if “y = x “, and the dash lines are the corresponding true values 75 4.1 Prior plots of unknown parameters with their hyper-parameters: N(0,1002),A IG(0.01,0.01),cr2 c—’ IG(100,0.1),/3 “-i N(0,1002)and N(0, 12). The vertical lines are the corresponding “true” values as: = 4.64, A2 0.094, = —0.90, /3 = 0.76 and g2 = 0.00096 81 4.2 Traceplot and posterior density plots of p0000 iterations after 1000 burn-in period of /3o and /3i when applying the MH sampling method 83 3(111 Acknowledgements First of all, I would like to give special thanks to my supervisor, Professor Paul Gustafson, since without his thoughtful support and guidance, I would not able to complete this the sis. It was an honor to work with him and he is the best supervisor that one could ever ask for. I would also thank Professor Lang Wu, for agreeing to be my second reader and also for his excellent teaching throughout my study period. I express sincere gratitude to Professors John Pekau, Ruben Zamar, Arnaud Doucet, Michael Schuizer, Jiahua Cheri and Matias Salibián-Barrera for their constant support and outstanding teaching. I am grateful to everyone, who makes the department a wonderful place to study. Spe cial thanks to Mike Danilov, Ehsan Karim, Derrick Lee, Reman Epstein, Xu Chen , Liu Zhong and Xu Wang for their help and friendship, and to Peggy Ng, Viena Tran, Elaine Salameh for their support and understanding of my work. Tian Shen Vancouver, BC, July 2009 xiv To My lovely Parents xv Chapter 1 Introduction In many research areas, statistical methods are used to analyze the relationships between two or more variables. For example, in the epidemiology area, statistical models are used to understand or study the relationship between an outcome variable Y and an explanatory variable X. For instance Y can be presence or absence of heart disease, and X can be presence or absence of smoking, where Y and X are both binary variables, or X can be the blood pressure, which is a continuous variable. There are four types of designs that are most often applied, which are the cross-sectional study, cohort study, case-control study and randomized controlled clinical trial. In this thesis, we are focused on the case-control study, which is also referred to as a retrospective study. We randomly select subjects from “case” and “control” groups, then compare the health outcomes for the two groups based on selected subjects. The explanatory variable X is often measured by some instruments. When X is precisely measured, the instrument is called a gold standard. However, due to the high cost and lack of such precise instrument, X is often measured imprecisely. If X is a categorical variable, imprecise measurements are called misclassifications, while if X is a continuous variable, they are called measurement errors. In this thesis, we are working on both discrete and continuous cases of X. 1.1 Misclassification and Measurement Error Generally speaking, misclassification means grouping a subject into a wrong category For example, a person who smoked one pack of cigarette in the day of the experiment might be accidently grouped as a heavy smoker while this person might barely smoke in other days. Thus, rather than recording X itself, a surrogate variable X* is often 1 1.2. Overview of Current Available Methods recorded instead. The misclassification probability (say Pu) defines the probability of classifying a subject into group i while its true status is in group j. When X is a continuous variable, by the definition of the classical error model in the measurement error literature, a surrogate variable X* is the linear term of X plus an er ror term. For example, a recorded patient’s blood pressure might be higher than his/her true values due the equipment error. As defined in Chapter 3 in this thesis, X = X + Z, where Z is the error term and E(ZIX)=0. Measurement error and misclassification can be categorized into differential and non- differential cases. If the distribution of the surrogate variable X* only depends on the true value X but not the health outcome Y, then the mismeasurement is classified as non-differential. Otherwise, the mismeasurement is categorized as differential. Due to the misclassification and measurement error , usually we only have precisely mea sured health outcome Y , the surrogate variable X* and some other precisely measured covariates U in the data. Since the goal of most studies is to understand the relationship between X and Y, conclusions obtained from X* and Y instead could be very misleading. Thus, the study of measurement error and misclassification is significantly important. 1.2 Overview of Current Available Methods In the literature, especially in the biomedical research, many methods were proposed to deal with misclassification and measurement error. Barron (1977) proposed the matrix method that estimate the expectation of cell counts by using their margins, and the odds ratio can be estimated later on based on the cell counts. By reparameterizing the misclassification, Marshall (1990) presented the inverse matrix method to retrieve the odds ratio. However, Lyles (2002) pointed out that the inverse matrix method is just a maximum likelihood estimate method that corrects the biased odds ratio due to misclas sification. The efficiency of the matrix method and inverse matrix method were compared under the assumption of differential misclassification (the degree of measurement error is different across different groups) by Morrissey and Spiegelman (1999). They concluded 2 1.2. Overview of Current Available Methods that the inverse matrix is more efficient than the matrix method; nevertheless, the sensi tivity, specificity, and probability of exposure are some key determinants of the efficiency. Later, other methods like simulation extrapolation (SIMEX) and logistic regression model are also developed to approach the misclassification problem (Kuchenhoff, Mwalili, and Lesaifre, 2006; Skrondal and Rabe-Hesketh, 2004). With the improvement of computa tional capability of computers and enhanced simulation techniques, the Bayesian analysis becomes another prospective method to study the misclassification problem. Carroll, Ruppert, Stefanski, and Cainiceanu (2006) grouped methods that deal with measurement error into functional and structural models, where X is considered as fixed or random with minimum assumptions of distributions in the functional models while X is considered as random variables in the structud models. Two general methods used in the functional model category are the regression calibration (Pierce and Kellerer, 2004) and SIMEX (Cook and Stefanski, 1994) methods. Carroll, Ruppert, Stefanski, and Cainiceanu (2006) stated that even though these two methods are very simple, they are only consistent in some special cases such as linear regression and they have limited contributions in reducing the bias caused by the measurement error. Disregarding some limitations of the regression calibration and SIMEX methods, they are still widely used since both of them are very easy to implement by using the existing software packages, which reduces the potential difficulties in the analysis set-up part. In the structural method category, the Expectation-Maximization algorithm in the frequentist perspec tive and Markov chain Monte Carlo algorithm in the Bayesian perspective are two useful algorithms to solve the measurement error problems. Bayesian methods are capable of dealing with biases induced by both misclassification and measurement error. One great advantage of Bayesian methods is that they can fully incorporate the uncertainties of parameters. Though fully specified models are of ten required, this kind of information plus some knowledge of mismeasurement are often 3 1.3. Bayesian Analysis available to medical researchers before they conduct studies. When an observed dataset is available, the natural existence of prior information make Bayesian analysis an appealing method, since inference now can be made through the prior and present information. In this thesis, we are focus on using the Gibbs sampler and Metropolis-Hastings algorithm in Bayesian methods to solve the misclassification and measurement error problems. 1.3 Bayesian Analysis Bayesian inference is statistical inference in which data are used to update or to newly infer quantities that are observed or wished to learn by using probability model. The “combination” of Markov Chain Monte Carlo (MCMC) algorithm and Bayesian infer ences is often used in solving the mismeasurement related problems. 1.3.1 Bayes Rule To understand the Bayesian analysis, it is essential to understand the fundamental prin ciple of the analysis - Bayes rule. Assume there is a independently and identically dis tributed dataset y (yl, Y2, ..., y) with unknown parameter 8 and following distribution fe. Bayesian analysis is to conclude a parameter 8 based on the observed data, so we denote the conditional distribution on the observed data as f (8 y), which is called the posterior distribution. Meanwhile, the sampling distribution of the data set is: f(v18) = ]fe(y) Then, a joint probability distribution of 6 and y can be defined as: f(8,y) = f(8) x f(y18) 4 1.3. Bayesian Analysis where f(8) refers the prior distribution of 8. Applying the basic property of conditional distribution (Bayes’ rule), the posterior distribution turns to: f(I ) — f(8,y) — f(9) x f(yI) (11)Y-f() - f() Since the f(y) is a constant term, independent of the parameter 0, we can express the unnormalized posterior density as f(Iy) cx f(0) x f(yIO). To estimate the parameter 0, we can calculate the expected value of 8 under the posterior density (Gelman et al., 2004) by taking the integral: 18 0 dO fOf(O)f(y0)dO0 = j f( ff(0)f(yIO)dO If there is a closed form of the posterior distribution, the estimation of 0 would be very easy to carry out. However, sometimes there is no closed form of the posterior density, we have to use the other numerical methods, such as Markov Chain Monte Carlo algorithms. 1.3.2 Prior Distribution As an “absent” term in classical statistical analysis, the prior distribution plays a major role in the Bayesian analysis. There are mainly two types of prior distributions: infor mative priors and non-informative priors. A non-informative prior also can be called a “fiat” prior that has a limited impact on the posterior distribution. Researchers use non-informative priors usually because they don’t have very much knowledge about the parameter in the previous study or they don’t want the inference to be greatly affected by some external sources. Moreover, the determination of “flat” prior is not trivial since everyone has a different definition of “fiat”. The informative prior, information that gained from previous study ,on the other hand, may provide a strong influence on the posterior distribution. However, the influence is somehow related to the sample size, number of MCMC iterations and the form of the prior. In most cases, strong priors are not necessary when the sample size is big. 5 1.3. Bayesian Analysis Nevertheless, in some cases, the priors are needed to be strong regardless of the sample size. 1.3.3 Markov Chain Monte Carlo Algorithm The computation of the posterior distribution is always a problem since the density of the posterior sometimes is very complicated or even high dimensional. Because of this problem, the usage of Bayesian analysis has been very limited. When the MCMC algo rithm was first applied in 1990, the limitation of Bayesian analysis then vanished and Bayesian methods becomes increasingly popular. In the following two sections, two particular MCMC algorithms are introduced: Metropolis- Hastings algorithm (Greenberg and Chib, 1995) and Gibbs sampler (Casella and George, 1992). Metropolis-Hastings Algorithm The Metropolis-Hastings (MH) algorithm is an iteration method for generating a se quence of samples from a probability distribution when directly sampling is difficult or impossible. It uses an acceptance/rejection rule to cover the target distribution. The basic algorithm follows: 1. Randomly choose a starting point 0 2. For the ithiteration from i=1,2,..., (a) Sample a proposal 9* from a jumping distribution at iteration i, p(9*&i_l) (b) Calculate the ratio of densities, f(9*)p(9i_l 18*) r = f(9i-1)p(O*19 _1) where f(9) is the target distribution. 6 1.3. Bayesian Analysis (c) Set — 9* with the probability min(r, 1) 9i—1 otherwise Note that the acceptance rate shouldn’t be close to 0% or 100%, otherwise, the random walk would either move too slowly (cover the target population too slowly) or stand still for too long (likely to stay in certain regions). Thus, proper adjustments of the jumping distribution sometimes are necessary. Gibbs sampler The Gibbs sampler is a special case of MR algorithm. It is applied when the ex plicit form of the joint distribution is unknown but the conditional distribution of each variable is known. Assume we divide the parameter vector 9 into p components, say 9 = (9i, 92, ..., 0,,). Each iteration of Gibbs sampler cycles thought all the components of 9. Each component is drawn based on the values of all others. Thus, the algorithm that generates a Markov chain at iteration i is: 1. Randomly set the initial values for all the components 0 (9°), 9(0)) 2. For the ithiteration from i=1,2,..., for j=1...p, sample 9 from the distribution of where 9 represents all the components of 0, except for O, at their current values: 0(i) — 19(i) 9(i) 0(i) 0(i—i) 9(i—1) , 1’••’ j—]i j+1 ••• P Though the above two MCMC algorithms looks very simple, sometime the technical details can be very challenging. In this thesis, we will apply Gibbs sampler algorithm in Chapter 2 and Metropolis-Hastings algorithm in Chapter 3. 7 Chapter 2 Simulation Study for Categorical Exposure Variable 2.1 Introduction Assume a researcher is interesting in study the relationship between a binary exposure variable X and a continuous response variable Y. The exposure variable X is often coded as 0 or 1, where X = 0 refers to “unexposed” and X = 1 refers to “exposed” in most epidemiological situation. Instead of observing X, a surrogate variable X* is measured. Under the non-differential misclassification assumption, X and X* are conditionally in dependent of Y, and the specificity and sensitivity are used to measure the magnitude of the misclassification (Gustafson, 2004). Then, the sensitivity SN = Pr(X* = 1IX = 1) is the probability of correctly classifying a truly “exposed” subject, and the specificity SP Pr(X* = 0X = 0) is the probability of correctly classifying a truly “unexposed” subject. In the following subsections, we will introduce two approaches to analyze the relationship between discrete exposure variables and health outcome Y. 2.2 2x 3 Table—Formal Analysis Though there are only two conditions of the true exposure X, Yes or No, sometimes the surrogate variable X* has three conditions instead as: Unlikely, Maybe and Likely, due to possible instrument error (as displayed in Table 2.1). 8 2.2. 2x 3 Table—Formal Analysis Assessed Exposure Unlikely Maybe Likely True Exposure No P00 P01 P02 Yes Plo p11 P12 Table 2.1: 2x 3 table due to the misclassification and measurement error Note that the Pij in the table defines the probability of classifying a subject into group j while its true status is in group i. If the truly exposed condition, X, is known, then the exact relationship (referred as a “true” result) of X and Y is able to be analyzed. However, in reality, X is often inaccessible, and researchers only know its surrogate, X*. Even though, researchers are analyzing the relationship between X* and Y, they also tend to conclude it as the re lationship between X and Y. It is very dangerous to make such conclusion since it can be very biased. The first analysis method is termed as a formal analysis: analysis that acknowledge the existences of 2 x 3 table (Table 2.1) structure in the exposure condition. Specifically, the analysis is carried out based on Table 2.2. Exposure Unlikely Maybe Likely Health Outcome controls n00 n01 n02 cases n10 n11 n12 Table 2.2: A 2x 3 table for formal analysis where nj in the table is the number of subjects that fall in the condition (e.g. n00 is the number of subjects that “Unlikely” have the exposure in controls). In the literature, there are fewer studies that involves the formal analysis, which makes it an interesting point to study. The second approach is termed as an infonnal analysis: analysis that tend to ignore the 2 x 3 table structure in the exposure condition. More details of informal analysis are going to be talked about in the next section. 9 2.3. Informal Analysis 2.3 Informal Analysis When we assume the mismeasurement is non-differential, sensitivity and specificity of X* for X can be used to described the magnitude of the misclassification. The closer the values of SN and SP are to one, the less severe the misclassification is. As stated in Gustafson (2004), when the probability of the exposure is very rare, the effects of misclassification worsens much more with the decrease of specificity than the decreases of sensitivity, which means the impact of low specificity will be bigger than the impact of low sensitivity in a further analysis. This attracts particular attention in the epidemiological area for the study of rare disease since it implies that when the expo sure is very rare, the analysis can be fully misleading even with a very small effects of misclassification. Thus, when some epidemiologists realize that they have the 2x3 table structure (as Table 2.1) in hand, they tend to group the “Maybe” and “Unlikely” groups together to form a 2x2 table as in Table 2.3: where oo = P00 +poi and io = P10 + P11. Assessed Exposure I No Yes True Exposure No oo qoi Yes io q Table 2.3: 2x 2 table by epidemiologists’ rule They prefer such grouping so that more subjects are classified as unexposed, and this leads an increase of probability that a true negative is correctly classified, which means a large SP value. In such way, a low specificity could be avoided so that a small effect of misclassification won’t lead a huge impact in the further analysis. In the literature, most analyses are constructed based on the ignorance of the struc ture of Table 2.1. Thus, they are based on Table 2.4, and we would like to refer such analysis as informal analysis in the rest of the Chapter. 10 2.4. Odds Ratio Exposure No Yes Health outcome controls floo + fbi fl02 cases nib +flhi fli2 Table 2.4: A 2x table for informal analysis 2.4 Odds Ratio Suppose we have two groups of subjects, controls and cases, and denoted the number of subjects in each group as n0 and n1. Let r0 denote the prevalence of exposure in the control group and r1 denote the prevalence of exposure in the case group, i.e. r3 = Pr(X = 1Y 0) and r1 Pr(X = 1Y = 1). Thus, the odds ratio, which defines the relationship between X and Y is formed as: (2.1) 1 —ro In the informal analysis, the odds ratio is usually calculated as: l2 x (noo + noi)0 — fl02 x (flio + flu) where the standard error of log odds ratio is formulated as: / 1 1 1 1 \i/2 se=( +—+ +—\floo + nc no2 10 + flu l2 In the formal analysis, there are three different ways to calculate the odds ratio. Since the MCMC algorithm estimates the prevalence in both case and control groups (ro and ri) at each iteration, they are going to be updated every time. Thus, the first way is to find the posterior mean of odds ratios, which is to calculate the odds ratio at each MCMC iteration, and then take the average of them. The second way is to find the posterior average of r0,r1 of each iteration, then use formula (2.1) to find the odds ratio. The 11 2.4. Odds Ratio third way is to find the posterior mean of log odds ratio, which is to calculate average of the log of odds ratio at each iteration first and then transfer it by using exponential function. Mathematically, those three odds ratio can be written as: 1 where i refers to the th iteration, and m is the total number of iterations. im im = — ,rj (2.2) -- OR2 __ rp 1 —r where ro, ri are calculated from equation (2.2). And also, OR3 = (±1o(Oñ)) We are interested in comparing the odds ratios in both analysis. Since the odds ratio always falls into a very skewed distribution, we tend to compare the informal odds ratio and the formal odds ratio in the log scale, such that the distribution will be more sym metric. We are going to focus on comparing the ORinformal with OR3 in the log scale with respect to their point estimator and confidence/credible intervals as (n12x(noo+nol)Nlog(ORjf0T1) = log \fl02 X (nio+nii)j 12 2.5. Case 1: When We Know PijS with its 95% confIdence interval as: Clinformal: (1og(ökiflf.mal) — 1.96 * Se, log(ORjnformaj) + 1.96 * se). The log formal odds ratio is as: logOR3 = --log(OTh). Note that the 95% credible interval for log OR3 can be simply obtained by finding the 2.5% and 97.5% quantiles of all log OR3 that obtained from each iteration. In order to evaluate the performance of our proposed formal approach, we conduct simu lation studies under three cases as: when the probability of classify, Pij are known; when we only have prior information on pu; when we have some validation data, i.e. we have some subjects have both X and X measured. 2.5 Case 1: When We Know piS In some cases, researchers may know the probabilities of classifying the assessed exposure to true exposure from previous experiments, i.e. the pujs are known. Under this condition, the posterior density is expressed as: f(ro,ri,Xi...Xn,IYi...Yj,X’...X) = JJr(i(1’(1 — ro)(1_Xi)(1 — ri I(XO)(1—X) I(Xz1)(1—Xj) I(Xz2)(1—Xj)xIIpoo p01 p02 2 —- I(X=0)X, I(X;=1)x I(X,=2)X p11 p12 3 13 2.6. Case 2: When PijS are Unknown where I is an indicator function. We assume the prior distribution of r0 and r1 are uniform (the same assumption carries in the following two cases), and then we use the Gibbs sampling method to update the unknowns, r0 and r1 since their posterior distribu tions have familiar distributions that we can recognize, namely Beta distributions. The posterior distribution of X3 is viewed as: f(XjIro,ri,Xi,...Xj_i,Xj+i,...Xn,Yi...Yj,X...X) where . I(X;=0) I(X;=1) z(x;=2) a = (1 — ro) ‘(1 — ri) ‘p00 p01 p02 y, y, I(X=O) I(X=1) I(X,=2)b = r0 r1 p10 p11 p12 Thus, at each iteration of MCMC algorithm, the probability of getting X3 = 1 by given everything else “known” is b/(a + b). 2.6 Case 2: When pS are Unknown Though, in the previous section, we studied the case that PijS are available from other experiments, in reality, those values are often unknown. The maximum information that we have on them are some knowledge of their priors. We choose a Dirichlet distribution as the prior distribution for PijS for two reasons. First, the summation of POj and the summation of Plj are both equal to 1, i.e. P00 +poi +po2 = 1 and Pio +pii +p12 = 1. Second, the Dirichlet distribution has the property of being conjugate, which means the posterior distribution of Pij5 would also come from the Dirichlet distribution family. Thus, by choosing Dirichiet distribution as the prior, the MCMC algorithm is easy to compute the updated PiJS at each iteration, and the results are also easy to be interpreted. 14 2.6. Case 2: When PijS are Unknown Hence, we have: Poo,Poi,Po2 Dir(eoo,coi,c02) P1o,P11,Pi2 ‘- Dir(cio,cii,c12). Then, the posterior distribution is changed to: f(ro,ri,Xi, ...X.,,poo,poi,po2,pio,pii12Y ...X) cx — r )(1_Xi)(1_Yj)rXiYi(l — ri)’”ii —- I(x;=o)(1—X3)I(x;=i)(i—x) I(x=2)(i—x) Poi P02 3 I(X=0)X3 I(X=flX3 I(X=2)X P11 P12 3 coo—i Cj—l c02—1 C1IYl Ciii C121XPOO P01 P02 * J)10 p11 p12 cx ]Jr(12)(1 — ro)_Xi 1_Yi)ri5Ci J(l — —- (I(X’=O)(1—X,)+coo—1) (I(X=1)(1—X3)+coi—1] (I(X=2)(1—X,)+co—1Xjjp00 Poi P02 3 (I(XJ=0)X+cio—1) (I(X’=i)X+c —1) (I(X=2)X+c1—1) X p10 p11 P12 cx Xj(1-Yj) (1 — r )(1_Xi)( XYj (1 — xpooi I(X=0)(1—X)+coo—10ZI(X;=1)(1—X3)+coi02I(X=2)(1—X)+co—1) <p10 I(XJ=0)X3+cio—1 1 I(X$=1)X,+cii I(X,=2)X+ci2—i Note that coo, col, c02, do, Cii and d12 are called hyper-parameters, and the specific values assigned will be discussed in the results section. 15 2.7. Case 3: Validation Data 2.7 Case 3: Validation Data Sometimes, the true exposure variable X might be possible to capture, but it is too expensive to get for all the subjects. As a result, only a small proportion of the subjects have the complete information of X, X and Y, whereas the majority of the subjects don’t have the precisely measured exposure status, X. Table 2.5 presents the structure of the validation sample and the incomplete main data. While all counts corresponding to Validation Data Main Data x* x* Unlikely Maybe Likely Unlikely Maybe Likely Y=o x=o Y=o x=1 Y=1 X=O Y=1 x=1 Table 2.5: Validation data and main data levels of X, Y, X* are fully recorded in the validation data, only counts that correspond to Y, X are recorded in the main data. We want to use the information from the validation data to impute X for the main data and to make inference on X and Y. Our new posterior density is like: f(X1 Xm,poo,poi,po2,p1o,p11,p12,ro,r1IXm+1,.,.Xn,X,...X,,Y1,...Yn) m j=n flfo’Ix x flf(XIX3)x flf(X3)fl f(X) j=1 j m+1 fJf(XX) x fJf(X) x f(poo,pol,p02,plo,pll,p12) x f(ro) x f(r1), where j = 1, . . .m is the non-validation data part and j = m + 1, . . .n is the validation data part. The simulation process for this case does not change a lot regarding the change of the posterior distribution, and the only difference is that “known” X values in the validation data do not need to be updated, where the “unknown” X values, r0,r1 and PijS are 16 2.8. Results updated the same as in the previous case. 2.8 Results In order to gain information about the performance of MCMC algorithms in all three cases, the MCMC trace-plots, posterior mean, 95% equal-tail credible interval, estimated bias, estimated mean square error of each unknown parameters are checked in the fol lowing subsections. Moreover, when the prevalence in control and cases are small, i.e. r0 and r1 are relatively small, the odds ratio of “formal” and “informal” are compared later on to assess their performance. In all cases, two sets of the prevalence are used as (1): r0 = 0.2,r1 = 0.25 and (2): r0 = 0.7, r1 0.4, and each one is combined with the “true” probability of classifying values as P00 O.S,poi = O.3,Po2 = 0.2 and pio = = 0.3,pi2 = 0.6. Any hyper parameters used in the specific cases will be defined in the later subsections along with detailed information of each case scenarios. Moreover, two simulation studies are per formed regarding each scenario in each case. The first one focuses on studying estimation from one sample. The second concentrates on studying the sampling distributions of each estimator across 100 simulated datasets. 2.8.1 Results for Case 1 Two scenarios we have in this case are: • Scenario 1: (T0,Ti) = (0.2, 0.25), true odds ratio=1.33, (poo,pol,po2) (0.5, 0.3, 0.2), (pio,pll,p12) = (0.1,0.3, 0.6); • Scenario 2: (rO,Ti) (0.7,0.4), true odds ratio=0.283, (poo,poi,po2) = (0.5,0.3,0.2), (plo,pil,pi2) = (0.1,0.3,0.6). Two simulation studies are carried for each scenario, where studies for the first scenario will be talked about in details as an example. Note that the same procedures are applied 17 2.8. Results to the second scenario as well. In the first study, a dataset of size 4000 (2000 controls and 2000 cases) was generated based on the given “true” value, (ro,ri) = (0.2,0.25), (poo,pol,po2) = (0.5,0.3,0.2), and (pio,pii,p12)= (0.1,0.3, 0.6). Then, we use 21000 iterations, where the first 1000 is the burn-in period to update unknown parameters. The choice of burn-in period is make based on visual inspection. Figure 2.1 shows the traceplot of MCMC algorithm for ro and r1 after the first 1000 burn-in period. It shows that the Markov chains moving smoothly within the target area. 18 0 C’j 9 0 0 a 2.8. Results iterations Figure 2.1: Traceplots of r0 and r1 traceplots show the 20000 iterations r1 are 0.2 and 0.25 respectively. from MCMC algorithm in scenario 1 case 1. The after 1000 burn-in period. The true values of ro and Table 2.6 shows the true values, posterior means and the 95% credible intervals for r0 and r1. Both 95% credible interval of r0 and r1 covers the “given” values, which indicates I. I Iii Ii I .,LI tU ‘ ii. liii I i. ‘ r,- I I’ I I I 0 5000 10000 15000 20000 0 Co 0 0 0 0 5000 10000 15000 20000 iterations 19 2.8. Results that for this particular generated dataset, our approach works well. true value posterior mean 95% CI ro 0.2 0.18 (0.15, 0.23) r 0.25 0.27 ( 0.23, 0.31) Table 2.6: True values, posterior means, 95% credible intervals of ro and ri. These are results from the first simulation study (one dataset simulation) for scenario 1 in case 1. The second study will repeats the first study 100 times, which enable us to investi gate the sampling distributions of r0 and r1. Figure 2.2 is the histogram of 100 posterior means of r0 and r1. It demonstrates that the sampling distribution of r0 and r are ap proximately normally distributed and centered around the “true” values, majority values are around the “true” values. 20 2.8. Results , 0 C, 0 41) Ll 0 I I I I I I 0.18 0.22 026 0.30 Figure 2.2: Histogram of 100 posterior means of ro and r in the second simulation study for scenario 1 case 1. The “true” values of ro and r1 are 0.2 and 0.25 respectively. Table 2.7 confirms our observation by showing the 95% credible interval coverage rates are very high and the average lengths of the credible intervals are very small. Bias MSE Coverage of the 95%CI j Average 95%CI Width r0 0.0024 0.0025 91 T________ 0.09 r1 0.0028 0.0023 95 T ööi Table 2.7: Bias, mean square error (MSE), coverage of 95 % CI and the average width of r0 and r1 for scenario .1 case 1. All results are based on 100 datasets, and their true values are 0.2 and 0.25 respectively. Since desirable results are obtained from the statistical procedures and stabilized iter to - 0 - C.) LI) - 01 0 U) 01 0 0J C U- 0 0C 0,14 0.18 022 026 21 2.8. Results ations are observed from the traceplot, it is reasonable to conclude that our approach works well when the prevalence for both control and cases are relatively small. Table 2.8 shows the first simulation study result for scenario 2, where r0 = 0.7 and r1 0.4 and Pij values remain the same. Again, the result shows that for the particular true value posterior mean 95% CI ro 0.7 0.69 (0.63,0.72) r1 0.4 0.41 (0.36, 0.45 Table 2.8: True values, posterior means, 95% credible intervals of ro and ri. These are results from the first simulation study (one dataset simulation) for scenario 2 in case 1. generated dataset, we are able to get reasonable estimators. However, in order to know how the model works in general, we need to review the results from the second simulation study. Figure 2.3 and Table 2.9 are histogram and statistical results from the second simulation study for scenario 2 in case 1. All results from two studies for scenario 2 suggest that our proposed approach works equally well when the prevalences are considerably larger. 22 2.8. Results >% C) C S 0 C) U- Figure 2.3: Histogram of 100 posterior means of r0 and r1 in the second simulation study for scenario 2 case 1. The true values of r0 and r1 are 0.7 and 0.4 respectively. Bias MSE ro 0.000068 0.0021 r1 0.00074 0.0023 Coverage of the 95%CI 98 _________ 94 Average 95%CI Width 0.085 0.092 Table 2.9: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of r0 and r1 for scenario 2 case 1. All results are based on 100 datasets, and their true values are 0.7 and 0.4 respectively. 0 Ci C C’) U) C Is, C C’) U,C) C U) 0 U) U- 0 C I I I I I 0.66 0.70 0.74 C r0 I I 0.32 0.36 0.40 0.44 t I 23 2.8. Results 2.8.2 Results for Case 2 In this case, there are also two scenarios as: • Scenario 1: (ro,ri) = (0.2, 0.25), true odds ratioz1.33, (poo,pol,po2) = (0.5,0.3,0.2), (plo,pll,p12) = (0.1,0.3,0.6); • Scenario 2 : (ro,ri) = (0.7.0.4), true odds ratio=1.33, (poo,pol,po2) = (0.5,0.3,0.2), (plo,pll,p12) = (0.1,0.3,0.6). Remember that in this case, since PijS are unknown, we need to specify prior distributions for them. The hyper-parameters cj are chosen particularly as (coo, c01, 002) = (55,30,15), (dO, c11, 012) (10, 25,65) for both scenarios. Those values are chosen so that the prior distribution are centrated nearly around the “true” values of Pu. Since any single com ponent of a Dirchlet vector has a Beta distribution, we are able to see how concentrated these priors are by using the Beta distribution with certain hyper-parameters. For exam ple, previous information (prior) states that P00 is from a Beta distribution with shape = 55, 3 = 45. Figure 2.4 displays the the true Pij values with its corresponding Beta density function. From the figure, we can see that most “true” pjj values are close to the centre of the the density function (especially with Poi and plo) and the range of the x-axis are pretty narrow, which suggests that the priors that we use in this case are concentrated ones. We use the concentrated priors here is because by simulation studies, we realized that the no matter the sample size is large or not, a strong prior is crucial in this case (as mentioned in Section 1.3.2). Same as in case 1, two simulation studies are carried for each scenario, where studies for the first scenario will be talked about in detail as an example. In the first study for scenario 1, a dataset of size 4000 (2000 controls and 2000 cases) was generated based on the given “true” value. Then, we use 51000 iterations (the first 1000 is the burn-in period) to update unknown parameters. Figure 2.5 show the traceplot of MCMC algorithm for r0, r1 and Pij5 after the first 1000 24 0.0 0.2 0.4 2.8. Results Figure 2.4: Density plots of “true” Pij values with its corresponding Beta density function. The vertical lines in the graph indicate the “true” values. The Beta density functions (from left to right) are: Beta (55,5), Beta(30, 70), Beta(15, 85); Beta(10, 80), Beta(25, 80) and Beta(65,35). 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 I ‘I 0.0 0.2 0.4 0.6 0.8 1.0 I I 0.0 0,2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 08 1.0 I” I’ 0.0 02 04 0.6 0.8 1.0 Pio Pu P12 25 2.8. Results burn-in period. Again, the Markov chains move smoothly within target range and no chain is fixed at a particular value. We also observe that the generated sample Markov Chain is somehow more stabilized in some unknown parameters (e.g. poi,pii) than oth ers (e.g. r0 and ri) after the burn-in period. Table 2.10 demonstrates the true value, estimated posterior mean and 95% credible interval for each unknown parameter from the first study of scenario 1. true value posterior mean 95% CI r0 0.2 0.25 (0.14,0.40) r1 0.25 0.32 (0.18, 0.41 P00 0.5 0.53 (0.46, 0.60 oi 0.3 0.31 (0.27, 0.35 P02 0.2 0.16 (0.09, 0.23) 0.1 0.1 (0.05, oii p 0.3 0.25 (0.18, 0.35) pi 0.6 0.64 (0.54, 0.73 Table 2.10: True values, posterior means, 95% credible intervals ofr0, r1 and pjjs. These are results from the first simulation study (one dataset simulation) for scenario 1 in case 1. Though there are discrepancies between the estimated posterior means of r0 and r1 and their true values, 95% credible intervals of estimated means do cover the true values. Figure 2.6 and Table 2.11 display the results from the second study (100 datasets simulation) of the first scenario in this case 26 2.8. Results L. i1L j, ., JIr1I. ________________________ ________________________ o 10000 30000 50000 itera5orrS IL I I I I 0 10000 30000 50000 iterations I. I. II Ii - iij 7 Figure 2.5: Traceplots ofr0,r1 and PijS from MCMC algorithm in scenario 1 case 2. The traceplots show the 50000 iterations after 1000 burn-in period. The “true” values are ro = O.2,r O.25,(poo,pOl,p02) = (0.5,0.3,0.2) and (plo,pll,p12) = (0.1,0.3, 0.6). I I I I . 0 10000 30000 00000 0 10000 30000 50000 iterations S iterations I I I 0 10000 30000 50000 iterations I I I 0 10000 30000 50000 iterations 0 10000 30000 50000 itera0ons 0 10000 30000 00000 iterations 27 2.8. Results I EfrhI I _________I J1+11hh I 0.50 0.52 0.54 0.56 0.28 0.30 0.32 0.34 0.145 0.150 0.155 0.160 P00 P01 P02 rffh1 _____ 0.095 0.100 0.105 0.110 0.22 0.24 0.26 0.28 0.30 0.60 0.62 0.64 0.66 Pta Pu P12 [oj ______I 0.20 0.25 0.30 0.35 024 0.28 0.32 0.36 r1 Figure 2.6: Histogram of 100 posterior means of r0, r1 and PijS in the second simulation study for scenario 1 case . The “true” values are r0 = 0.2,r1 0.25, (poe, PO1,P02) = (0.5, 0.3, 0.2) and (plo,pll,p12) = (0.1,0.3, 0.6). 28 2.8. Results From Figure 2.6, we observe that the histograms of 100 posterior means of 1j, T’i,Po,Pi,Pjo and P02 are shifted to the right from their true values, which indicates an overestimation of the parameters of interest. Though the graph suggests a possible unpleasant result, we still need to study those parameters’ corresponding sampling distributions to find out the true performance of our model. Table 2.11 shows the estimated bias, mean square error, 95% credible intervals coverages (of the true value) and the average length of 100 95% credible intervals for each parameter. Bias MSE Coverage of the 95%CI Average 95%CI Width r0 0.062 0.0021 100 0.25 r1 0.056 0.0019 100 0.24 0.035 0.00079 100 0.14 p01 0.014 0.00088 99 0.074 P02 -0.049 0.00023 100 0.14 pio 0.0022 0.00015 100 0.12 p -0.039 0.00072 100 0.16 p 0.0366 0.00061 100 0.18 Table 2.11: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and rl for scenario 1 case 2. All results are based on 100 datasets, and their true values arer0 = 0.2,r1 =O.25,(poo,pol,p02) = (0.5, 0.3, 0.2) and (p1o,p1,pi) = (0.1,0.3,0.6). It indicates that the even though the graph shows a potential unpleasant result, the parameters’ posterior distributions still cover their true values most of the time. As we discussed earlier, this case is very sensitive to the choice of prior distributions, and the observation obtained from Figure 2.6 could just imply a not strong enough prior for the particular dataset. Thus, it’s still reasonable to conclude that the algorithm works well for this case. Figure 2.7 and Table 2.12 are histogram and statistics results from the second sim ulation study for scenario 2 in case 2. We omit the results of the first study here due to its limitation of interpretation. From both the figure and table, we notice that no matter whether the probability of exposure is large or small, the prior distributions are always 29 2.8. Results very important. Weak priors may lead to poor results, and the stronger the prior is the better the results would be. This is dissimilar with most Bayesian cases, where when the sample size is large, the choice of the prior becomes insignificant. Future studies could be conducted on this case. Bias MSE Coverage of the 95%CI Average 95%CI Width ro 0.015 0.0027 100 0.23 r1 0.048 0.0025 99 0.23 p00 0.036 0.0017 100 0.15 Po 0.012 0.0020 96 0.10 -0.04 0.00043 100 0.15 pm -0.0036 0.00022 100 0.11 -0.011 0.0017 97 0.079 p 0.015 0.0015 100 0.14 Table 2.12: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width ofr0 and r1 for scenario 1 case 2. All results are based on 100 datasets, and their true values are r0 = 0.7,r1 = 0.4, (poo,pol,p02) = (0.5, 0.3, 0.2) and (plo,pll,p12) = (0.1,0.3,0.6). 30 2.8. Results ,.rFffffhi i ErfF1111}L 0.45 050 055 0.60 0.25 0.30 0.35 0.40 012 0.14 0.16 0.18 P00 Po P02 ‘rll}[fThk 00jj1l11L J1IF[L, I 0.090 0095 0.100 0.105 022 026 030 0.34 0.55 0.60 0.65 070 Pm Pu P12 _______ 060 0.65 0.70 0.75 0.80 0.35 0.40 045 0,50 0.55 Figure 2.7: Histogram of 100 posterior means of r0,r1 and PijS in the second simulation study for scenario 1 case . The “true” values are r0 = O.7,r1 = 0.4, (poo,pol,po2) = (0.5,0.3,0.2) and (plo, pll,p12) = (0.1,0.3,0.6). 31 2.9. Results for Case 3 2.9 Results for Case 3 Under this cases, we will have four scenarios as follows: • Scenario 1: (ro,ri) (0.2, 0.25), true odds ratio==1.33, (Poo,pol,po2) (O.S,O.3, .2),(plo,pll,p12) = (0.1,0.3,0.6), (COO,Col,c02)= (1,1,1), (clQ,Cll,c12) (1,1,1) ,validation size=200; • Scenario 2: (ro,ri) = (0.2.0.25), true odds ratio=1.33, (poo,pol,po2) (O.S,O.3,0.),(plO,pll,p12) = (0.1,0.3,0.6), (coo,col,c02)= (1,1,1), (clo,cll,c12)= (1,1,1), validation size=400; • Scenario 3: (ro,ri) = (0.7,0.4), true odds ratio0.28, (poo,pol,p02) (O.S,O., .2),(P1o,P11,P12) = (0.1,0.3,0.6), (COO,cOl,C02) (1,1,1), (c10,c11,c12) = (1,1, 1), validation sizez200; • Scenario 4: (ro,ri) = (0.7.0.4), true odds ratio=0.28, (poo,pol,po2) = (0.5,0.3,0.2), (plo,p11,p12) = (0.1,0.3,0.6), (coo,coi,c02 = (1,1,1), (cio,cii,c12)= (1,1,1) ,validation size=100. Notice that in scenario 1 and scenario 3, the validation size is 200 compared to scenario 2 and scenario 4 (validation size is 100). As in case 2, let’s plot the true values of pjjs along with their prior density functions. From Figure 2.8, we can see that now the range of the x-axis (possible generated values) becomes wider and the “true” values are not that close the to the center of the density functions. This directly implies that these priors are flatter than the one we chose in case 2. We assign flat priors in this case because we believe valuable information could be obtained from the validation data. Once more, two simulation studies are carried for each of the scenario, and only the first of the two scenario will be talked about in details. Figure 2.9 and Figure 2.10 show the traceplots of MCMC algorithm for r0,r1 and jjS after the first 1000 burn-in period for scenario 1 and 2. 32 2.9. Results for Case 3 I I•I I I I 00 0.2 0.4 0.6 0.8 1.0 I I N 0.0 0.2 0.4 0.6 0.8 1.0 P12 Figure 2.8: Density plots of “true” pj,j values with its corresponding Beta density function. The vertical lines in the graph indicate the “true” values. The Beta density functions (from left to right) are: Beta (1,2,), Beta(1,2), Beta (1, 2); Beta(1, 2), Beta(1, 2) and Beta(1, 2,). 0 0.0 0.2 0.4 0.6 0.8 1.0 P20 N 0.0 0.2 6.4 0.6 0.8 1.0 [IIJ 0.0 0.2 0.4 0.6 0.8 1.0 Plo Po, N 00 0.2 0.4 0.6 0.8 1.0 N. P20 0 33 2.9. Results for Case 3 ‘I 0 5000 15000 terabons :1,1 I 0 5000 15000 0 5000 15000 0 5000 15000 ilera0ono II Ir.I II II liii Figure 2.9: Traceplots of r0, r1 and PijS from MCMC algorithm in scenario 1 case 3. The traceplots show the 20000 iterations after 1000 burn-in period. The “true” values are r0 0.2,r1 =O.25,(pOO,pOl,p02) = (0.5,0.3, 0.2) and (plo,pn,p12) = (0.1,0.3,0.6). Validation size =200. We can see that for the generated sample Markov Chain are somehow more stable in this case than case 2. Table 2.13 and 2.14 demonstrates the true value, estimated posterior mean and 95% credible interval for each unknown parameter from the first hi .1., II r.J_JI I II II 0 5000 10000 ileraSons —1 ii i _IrI—r, i 0 5000 15000 iler000rrO ‘I I II l I ‘l 0 5000 15000 0 5000 15000 ilera5005 ileradero 34 2.9. Results for Case 3 0 5000 15000 0 5000 15000 iterations iterations -I II. ..ii_... ir’i’ 0 5000 15000 iterations 0 5000 15000 iterations J “F ‘IiT 0 5000 15000 iterations 0 5000 15000 iterations 0 5000 15000 0 5000 15000 iterations iterations Figure 2.10: Traceplots of r0, r1 and PiJS from MGMC algorithm in scenario 1 case 3. The traceplots show the 20000 iterations after 1000 burn-in period. The “true” values are r0 0.2,r1 = 0.25, (Poo,pol,po2) = (0.5,0.3,0.2) and (p1o,pll,p12) = (0.1,0.3,0.6). Validation size =rlOO. 35 2.9. Results for Case 3 study of scenario 1 and 2. true value posterior mean 95% CI ro 0.2 0.18 (0.13,0.24) r1 0.25 0.24 (0.19, 0.30) P00 0.5 0.49 (0.45, 0.52 P01 0.3 0.31 (0.28, 0.34 P02 0.2 0.20 (0.16, 0.24 pio 0.1 0.07 (0.00 0.13) pn 0.3 0.27 (0.16, 0.37) pj 0.6 0.66 (0.56, 0.77) Table 2.13: True values, posterior means, 95% credible intervals of ro,ri and PjjS. These are results from the first simulation study (one dataset simulation) for scenario 1 in case 3. Validation size =200. true value posterior mean 95% CI r0 0.2 0.22 (0.13,0.31) r1 0.25 0.27 (0.18, 0.37) poo 0.5 0.53 (0.46, 0.59) oi 0.3 0.25 (0.19, 0.31 ) P02 0.2 0.22 (0.19, 0.25) Plo 0.1 0.07 (0, 0.13 P11 0.3 0.42 (0.29, 0.54 pr 0.6 0.51 (0.42, 0.61 Table 2.14: True values, posterior means, 95% credible intervals ofr0,r1 and PjS. These are results from the first simulation study (one dataset simulation) for scenario 2 in case 3. Validation size =100. From these one sample studies, we observe that the approach works very well with flat priors. Though the observation now is only based on one sample study result, it is verified by Tables 2.15 and 2.16 from the second simulation study (sampling distribution study based on 100 datasets). 36 2.9. Results for Case 3 Bias MSE Coverage of the 95%CI Average 95%CI Width r0 0.0047 0.0032 93 0.12 r1 0.0024 0.0029 98 0.12 P00 0.0017 0.0020 95 0.07 pç -0.00045 0.0015 96 0.067 p -0.0013 0.0015 98 0.071 pio 0.0090 0.0036 97 0.15 pu -0.01 0.0045 97 0.20 i:i: 0.001 0.0039 99 0.19 Table 2.15: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and r1 for scenario 1 case 3. All results are based on 100 datasets, and their true values arero = 0.2,ri = 0.25, (poo,pol,p02) = (0.5, 0.3, 0.2) and (plO,pfl,p12) = (0.1,0.3,0.6). Validation data=200. Bias MSE Coverage of the 95%CI Average 95%CI Width r0 0.0066 0.0039 96 0.16 r1 0.0027 0.0038 96 0.16 poo 0.00050 0.0022 94 0.093 poi 0.0049 0.0023 94 0.086 po -0.0054 0.0021 99 0.09 pj 0.0090 0.0043 96 0.18 p -0.021 0.0068 93 0.25 p 0.012 0.0049 100 0.23 Table 2.16: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width ofr0 and r1 for scenario 2 case 3. All results are based on 100 datasets, and theirtrue values arer0 0.2,r1 = 0.25, (poo,pol,po2) = (0.5,0.3,0.2) and(pio,p11, 2= (0.1,0.3,0.6). Validation data—100. 37 2.9. Results for Case 3 Figure 2.11 and Figure 2.12 show the histograms of sampling distributions for each unknown parameter in scenario 1 and 2. Results from the above two tables and figures indicate that when there are some validation data, the formal approach performs equally well (compare with the case 2) though the prior information is weak. Even though intuitively, we may think that the larger validation size would have better results than the smaller size, by comparing Table 2.15 and Table 2.16, it is hard to conclude that there is significant difference in the results obtained from the scenario 1 and scenario 2. One possible explanation is that by increase the validation size from 100 to 200, the algorithm is only able to gain limited “valuable” information. This immediately rises a question that whether there is a cut off point that we are able to get the maximum benefit, i.e. fewer validation data and enough information to obtain a good estimation. This could be an interesting point to study later on. 38 2.9. Results for Case 3 _______ CL lb1. ‘: A LL :J __ _ _ __ ___________ __________ 00 I I I I I I I I I I I I I I 0.46 0.50 0.54 0.24 028 0.32 0.14 0.18 0.22 P00 P01 P01 0) JT[1l}L.3.. I I I I I I I I 0.05 0.10 0.15 0.20 0.15 0.25 0.35 0.45 0,50 0.55 0.60 0.65 0.70 Pio Pi, C I :r__[1lI1i1_00 0.10 0.15 020 0.25 0.30 020 0.25 0.30 r0 Figure 2.11: Histogram of 100 posterior means of ro, r1 and Pij8 in the second simulation study for scenarzo 1 case 3. The “true” values are r0 0.2,r1 = 0.25, (poo,pol,p02) = (0.5,0.3,0.2) and (Plo,P11,P12) (0.1,0.3,0.6). Validation size =200. 39 2.9. Results for Case 3 rf11}i ! ‘‘J{{}frflaT, 1 LitflThL 0.46 0.50 0.54 0.26 0.30 0.34 0.14 0.16 022 P00 Poi P01 ‘!;rfffi}1jTm 1EiriTTfbL., ‘r1LHL,, 005 0.10 0.15 0.20 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.50 Pio Pu Pi 0,10 0.15 0.20 0.25 0.30 0.20 0.25 0,30 0,35 r0 Figure 2.12: Histogram of 100 posterior means of ro, r1 and PijS in the second simulation study for scenario 2 case 3. The “true” values are r0 = 0.2,rj = 0.25, (poo,pol,p02) = (0.5, 0.3, 0.2) and (plo,pll,p12) = (0.1,0.3,0.6). Validation size =100. 40 2.9. Results for Case 3 The following tables and figures show the results of scenario 3 and scenario 4 where the prevalence rates are relatively larger (rO = 0.7 and r1 = 0.4). Bias MSE Coverage of the 959CI Average 95%CI Width r0 0.0033 0.0031 97 0.13 r 0.0064 0.0039 92 0.14 pj 0.0042 0.0033 94 0.11 P01 0.00019 0.0025 94 0.09 P02 -0.0044 0.0029 96 0.11 p 0.0001 0.002 95 0.078 pjj -0.0012 0.0020 93 0.075 p 0.0011 0.0023 96 0.093 Table 2.17: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and Ti for scenario 3 case 3. All results are based on 100 datasets, and their true values are r0 = 0.7,r1 =O.4,(poo,po1,p02) (0.5,0.3,0.2) and (plo,pll,pi2) = (0.1,0.3,0.6). Validation data=200. Bias MSE Coverage of the 95%CI Average 95%CI Width r0 0.0041 0.0038 97 0.19 r1 0.0062 0.0047 96 0.19 pyj 0.0097 0.0037 96 0.14 pr -0.0030 0.0026 96 0.10 P02 -0.0067 0.0034 96 0.14 pio -0.0006 0.0021 99 0.10 pH -0.0011 0.0025 90 0.082 PH 0.0017 0.0028 97 0.11 Table 2.18: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and Ti for scenario 4 case 3. All results are based on 100 datasets, and their true values are r0 = 0.7,r1 =O.,(pOO,p0i,p02) = (0.5, 0.3,0.2) and (plo,pn,pi2) = (0.1,0.3,0.6). Validation data=100 41 2.9. Results for Case 3 01 __________ _______ co_I Fti r} Lt :1 ___ - Wfl _ _ _ 0 ________ 01M liii L I I I I I I I I I I I I 0.45 0.50 0.55 0.60 0.22 0.26 0.30 0.34 0.15 0.20 0.25 Pa Poi P01 C01U, 01 rf{fFIITL, ‘ Eli—fl1 U ELrHWH1 Hi___at,_____ ______ I I I I I I P I I I I I I I I I 0.04 0.08 0.12 0.26 0.30 0.34 0.54 0.58 0.62 0.66 Plo Pu La_I I I 0 _ _ _ _ _ _ _ 0] __ ____ 0.65 0.70 0.75 0.80 0.35 040 0.45 0.50 r0 Figure 2.13: Histogram of 100 posterior means of ro, r1 and PijS in the second simulation study for scenario 3 case 3. The “true” values are r0 = 0.7,r1 = 0.4, (poo,pol,p02) = (0.5, 0.3, 0.2) and (plo,pll,p12) = (0.1,0.3,0.6). Validation size =l00. 42 2.9. Results for Case 3 0 C’, cof I El OhI I I I o & 1 Fl nH-, I-fff-1 ___ ___ : ___ w H __ __ __ __ ______ __ __ __ I I I I I I I I I 0.45 0.50 0.55 0.60 0.22 0.26 0.30 0.34 0.10 0.15 0.20 0.25 Poo Pot Pot 0 TrTTiI1ITh _ _ 0IjH a w oj _ _ __ _________ I I I I I I I I I I I 0.04 0.08 0.12 0.20 0.25 0.30 0.35 0.55 0.60 0.65 0.70 Pio Pit P12 0 0C01 C’,, -l o_ - I I I I I I rrrii S _ H Oh] _ __ 0.60 0.65 0.70 0.75 0.80 0.20 0.30 0.40 0.50 Figure 2.14: Histogram of 100 posterior means ofr0,r1 and PijS in the second simulation study for scenario 4 case 3. The “true” values are r0 — 0.7, r1 = 0.4, (Poo,Poi, P02) = (0.5,0.3,0.2) and (plo,p11,p12) = (0.1,0.3,0.6). Validation size —100 43 2.10. Comparison of Odds Ratios From the above results, we are able to conclude that our formal approach works well when the prevalence rate is relatively high, and again it is hard to conclude that more validation data will help in getting more precise results. 2.10 Comparison of Odds Ratios As we talked about previously, it is interesting to compare odds ratios estimated by formal and informal approaches, to see which one tends to be closer to the true vales. Since in the informal approach, we tend to group two categories from the exposure “Unlikely” and “Maybe” together, only when the probability of exposure is rare, we would only compare odds ratios in scenarios that ro = 0.2 and r1 = 0.25 in each case, i.e. scenario 1 in case 1 and 2, scenario 1, 2 in case 3. 44 Id 0 2.10. Comparison of Odds Ratios Figure 2.15: Comparisons of log informal odds ratios with log formal odds ratios for scenario 1 in case 1 (upper left), scenario .1 in case 2 (upper right), scenario 1 81 2 in case S (lower left and right) . The true log odds ratio value is 0.285. The line in the each graph plots if two odds ratios are the same (y=x line). The dash lines represent the “true” value. 0 —0.2 0.0 0.2 0.4 0.8 0.8 Iog(O0) --.o--o —0.2 0.0 0.2 0.4 0.6 0.8 Iog(OR) /0 //—0.2 00 02 0.4 0.6 0.8 —02 0.0 0.2 0.4 0.6 0.8 Iog(OR) 45 2.10. Comparison of Odds Ratios Figure 2.15 suggests that the informal odds ratios tend to underestimate the true values, where the formal odds ratios sometimes overestimates the true value. For most of generated datasets, the informal odds ratio is always less than the formal odds ratios, since the majority of the plots are below the “y=x” line. The following table illustrates the comparisons in details. Bias MSE Coy, of 95%CI Ave. 95%CI Width Scenario 1 Case 1 ORinformal -0.19 0.0078 21 0.27 OR3 0.008 0.020 93 0.75 Scenario 1 Case 2 ORinformal 0.19 f 0.0072 23 0.30 OR3 -0.055 f 0.016 93 0.66 Scenario 1 Case 3 ORinformal 0.19 0.0072 23 0.27 OR3 0.00027 0.018 93 0.70 Scenario 2 Case 3 ORinformal -0.18 0.0068 27 0.27 OR3 -0.0043 0.018 97 0.76 Table 2.19: Estimated bias, mean square error (MSE), coverage of 95 % confidence intervals and average 95% CI width of informal and formal log odds ratios for scenario 1 (or 2) in three cases. True log odds ratio is 0.28. Table 2.19 indicates that the formal analysis produces more precise estimation of log odds ratio. It has a small bias, great coverage rate, and a reasonable average of 95% CI width, where, on the other hand, the informal approach gives larger bias and unexpected small coverage rate. Thus, we are able to conclude that the formal approach generally does a better job than the informal approach in estimating the odds ratio. Researchers who apply the informal approach may need to take serious consideration of the measurement error, otherwise, the results might be very biased. 46 Chapter 3 Simulation Study for Continuous Exposure Variable 3.1 Introduction Suppose a researcher is interested in investigating the relationship between a health outcome, Y, and a continuous variable, X, and then concluding that E(YIX) = /3o +iX for unknown parameters o and i3i. Nevertheless, when the health outcome is measured precisely, a noisy measurement X is often obtained instead of X. If the researcher does not realize the existence of measurement error, or decide to ignore it, then his conclusion about Y and X could be biased. Thus, it is useful to study the impact of the mismeasured covariate X. Let’s assume a precisely measured predictor X has a normal distribution with mean i and variance )2, while its mismeasured surrogate X arises from an additive measurement error, which is non-differential, unbiased and normally distributed. Thus, X is normally distributed with mean X and variance a2. Moreover, since the measurement error is non differential, which means X* and Y are conditional independent of Y, the distribution of (X* Ix, Y) and the distribution of (X* IX) are identical. The joint distribution of X, X’ and Y can then be viewed as: f(X*,Y,X) = f(X*IX,Y) x f(YIX) x f(X) (3.1) f(X*IX) x f(YIX) x f(X). 47 3.1. Introduction The first term in equation (3.1) is called the measurement model, which is the conditional density of X* given X and Y. This defines that under the influence of Y, the surrogate X’ arises from the true variable X in a particular way. The second term is called the response model, which explains the relationship between the true explanatory variableX and the response Y. The last term is called the exposure model in epidemiological applications (Gustafson, 2004). Usually, specific distributions, which involve some unknown parameters will be assumed in equation 3.1, and to make inferences about the X*, X and Y relationship. Since in reality, the true explanatory variable X is unobserved, the likelihood function of X and Y is formed as f(X*,Y) ff(x*,x,Y)dx (3.2) = ff(X*IX,Y) x f(YX) x f(X)dX. Though in some cases, equation 3.2 is easy to evaluate, in other cases, big problems could arise when the integral does not have a closed form. Often, a Bayesian MCMC analysis will be used in such condition since one advantage of Bayesian approach is that the likeli hood function is not necessary expressed in explicit form. Dempster, Lairdd, and Rubin (1977) proposed the EM algorithm to solve the implicitly problem in a non-Bayesian way, however, in this paper, we will stay with the Bayesian MCMC methods. Researchers often want to compare the health outcome Y within two groups, thus, they often dichotomize the continuous variable X into two or more categories. Though, Roys ton, Altman, and Sauerbrei (2006) pointed out a considerable disadvantage of the di chotomization, it is still very common in the literature (MacCallum, Zhang, Preacher, and Rucker, 2002). In this paper, we dichotomize X into two groups with the rule if X > c, the subject is truly exposed, otherwise, the subject is not exposed. Note that in reality, the value c is often decided from a previous study or chosen by a health expert. 48 3.2. Posterior and Prior Distributions The relationship of the health outcome and predictor variable is often estimated by obtaining the coefficients from a linear regression model. Since the health outcome, Y, is a binary variable here, logistic regression appeals as a suitable model. General speaking, there are three approaches to estimate the coefficients. An “naive” approach would di chotomize X’ with respect to c, where an “informal” approach dichotomize the surrogate variable X* according to c (a threshold not necessarily the same as c). In reality, the true predictor variable X is unobserved, however, in the “formal” approach (discuss in the paper), it is pretended to be known and be dichotomized with respect to c and fit the model afterward. The choice of the threshold, c, is somehow arbitrary, however, in order to keep a high specificity (as in discrete case), some epidemiologists would intend to choose c to be bigger than the true c value, such that Pr(X* > c IX < c) is very small. Notice that researchers, who use the “naive” approach are often not aware of the measure ment error or intend to ignore it, while people who use “informal” or “formal” approach do acknowledge existence of the measurement error and try to find out a solution to the problem. Results from all three approach will be compared later in the Chapter. 3.2 Posterior and Prior Distributions In this paper, the constituent models are studied based on normal distributions. Specif ically speaking, we assume the measurement model, (X* Ix, Y), to be a normal distribu tion with mean X and variance 2• Since the measurement error is non-differential, we have XX N(X, a-2) The exposure model is also assumed as a normal distribution with unknown parameters j and i.e. X N(2). 49 3.3. Case 1: When We Know o-2 Prentice and Pyke (1979) pointed out that the odds ratios are equivalent when both prospective and retrospective logistic model are applied to the case-control data, thus we would like to assume the response model YIX follows a logistic regression, which is logitP’r(Y = lix) /3 + /31(X > c). By easy transformation, the response model turn to: Pr(Y=liX)= 1 + eI+11I(X>c) Note that all parameters 2, , )2, /3o and /3i are unknown, and proper prior distribu tions might be needed in order to proceed. Meanwhile, we would like to assume the independence of all prior distributions, so that f(u2,,Ao,th) = f(u2) x f() x f(A2) x f() x f(i). Specific prior distributions will be assigned later on. In this chapter, we focus on studying three cases to demonstrate the performance of the “formal approach” : when we have some knowledge the noise of the true exposure value from previous study, i.e. is known; when we only have some prior information about the noise term, i.e. 2 is unknown but we have some information there; when we have some validation data, i.e. we have some data on x along with X* and Y for some subjects. 3.3 Case 1: When We Know 2 Sometimes, researchers might have some knowledge about the noise of the true explana tory variable from previous study results, then the posterior density function is (3.3) fJf(IXj x flf(XIX) < JJf(x3) < f(/3) x f(3) x f() x f(A2). 50 3.3. Case 1: When We Know o2 In order to proceed, we need to specify prior distributions for unknown parameters u, A2, and To simplify the MCMC algorithm later on, it is convenient to as sign the normal distribution for ,/3o, i3 and Inverse Gamma distribution for A2 as their prior distributions. Thus, we have N(0,d); J3 ‘s.’ N(0,d); N(0,d); A2 IG(d4,d5). where the choice of hyper-parameters d, d, d, d4 and d5 determine how flat or con centrated a prior could be. Here, we choose d = d = 4 = 1002 and d4 = d5 = 0.01 so that have flatter priors for unknown parameters , A2, /3o and ,6. The posterior density function in (3.3) now turns to: (3.4) flf(YjIXj) x fJf(X;IX) x ]jf(X3)x fCo) x f(j3) x f(L) x f(A2) 1 ___________ 1 Z(X)2 cc x—e 22 x—e 2A flj 1 + x -_ei x x dg x A2_4+1)e_d5/A2d1 d2 d3 F(d4) We chose the normal distribution and Inverse Gamma distribution for t and A2 as prior distributions, since they have the property of being conjugate. A conjugate prior means the posterior distribution of ,u and A2 would come from the same distribution family as the prior distribution. In particular, a simple Gibbs sampler algorithm can be used to generate a sample of from the posterior distribution of t, which is a normal distribu tion. Similarly, we can use Gibbs sampler algorithm to generate a sample of A2 from its posterior distribution, the Inverse Gamma distribution. 51 3.3. Case 1: When We Know a2 Unlike with i and A2, the posterior distributions of 3o, j3 and X do not have the same form as any other familiar distributions that we recognize . Thus, the Gibbs sampler does not work for updating them and we need to use the Metropolics - Hasting algorithm instead. As introduced in Chapter 1, this algorithm is based on the accept/ reject rule. We would like to avoid the acceptance rate being extreme, such as 100% or 0%. The key to decide the rate is the jump distribution of the parameter. Let’s take j3o for example. Suppose we are at jth iteration right now and after updating i and A2, we want to update the th value of !o. We would calculate the joint density as where f is the joint density as in equation (3.4). Then, we would assign a jump size for i3o, such that = + t, where t is from the normal distribution with mean 0 and variance k2 and we would again calculate the new joint density value as: Now, we will pick the new updated j3 as: cond) with the probability min(b/a, 1), 3’ otherwise. A similar procedure is applied for i3 and X. The individual jump size often follows a normal distribution with mean 0 and variance k2. Note that the jump size for each estimated parameter may vary in next two cases. 52 3.4. Case 2: When We Don’t Know 2 3.4 Case 2: When We Don’t Know Though, in the previous section we talked about knowing the measurement error variance of the true X from other studies, in most situation, we do not know the exact value of a2 but rather have a prior distribution for it. Then, the new posterior density turns to: (3.5) = flf(IXj) x flf(XIX) x llf(X3)x fC8o) x f(j3) x f(fL) x f().2) xf(a2) Similarly for other parameters, we need specify a prior distribution for a2. Again, be cause of the conjugately property of Inverse Gamma distribution, we would like to assign Inverse Gamma distribution with shape parameter d6 and scale parameter d7 as the prior distribution of a2. Note that d6 and d7 are hyper-parameters. The choice of prior distributions for other unknown parameters are the same as in case 1 and all other hyper parameters would be assigned the same values. Now the joint density becomes: f(X1,23...,X,3o,31jt,A2,aIY ...,X) = flfIX x flf(XIX) x flf(X) x f() x fCdi) x fCu) xf(A2) x f(a2) ___________ 1 —(X1)2x—e 22 x—e 2A 113 1 +e130+13’()Cj>’) jfl x±e x x x dg4 x )2_(d4+1)e_d/A2 d3 F(d4) x d6 x a2_6+1)e_d7/2 P(d6) 53 3.5. Case 3: When We Have Validation Data We are able to update a2 by Gibbs sampling since its posterior distribution is known as Inverse Gamma distribution, and hyper-parameters d6 and d7 are specified later in the result section. 3.5 Case 3: When We Have Validation Data Similarly as the validation case in the discrete case, here we assume there is a small proportion of data with complete information on X, X* and Y, whereas the majority of the data do not have the precise measurement of X but have the surrogate variable X* instead. Thus, unlike the equation (3.3) and (3.5) in case 1 and case 2, the joint density here becomes: f(Xl,X2,3...,Xm,/3O,I3ljL,AUIX +l,Xm+..,Xn,Y,Y2,..., * * * * n’ 1’ 2’ m n = flfo’Ixj x HfX;1X3)x flf(X3) II f(IX) j=1 j j j=m+1 flf(XIX3)x ]Jf(X3)x f(j3) x fC8i) x f(p) x f(.\2) x f(a2), where the first j = 1, . . .m are the non-validation data and the rest is the validation data. Though the joint density changes, there are not many changes regarding the simulation process. The only difference is we do not update the “known” X values in the simulations. We used Gibbs sampler to update , A2 and a2, and the Metropolis - Hastings Algorithm is used to update the o, i3i and the “unknown” X values. 3.6 Results In the following subsections, traceplots of MCMC algorithms in all three cases are checked along with some statistics for each unknown parameter, such as the posterior mean, 95% 54 3.6. Results equal-tail credible interval, bias, and estimated mean square error. In this way, we are able to gain some information about how well the MCMC algorithms are working for par ticular models and randomly generated datasets. Moreover, comparisons of estimated logistic regression coefficients are made among all three approaches, “formai”, “naive” and “informal”. In all three cases, the true unknown (common) parameters are set to be: ,u 0, A2 = 1, i3o = —1.5, and /3 1.5, with the dichotomization value c = 1. Other choice of the hyper-parameters as well as the jump size are subject to different cases. Moreover, two simulation studies are performed for each case. The first concentrates on gathering information on estimates from one sample. The second focuses on studying the sampling distributions of each estimator across 100 simulated datasets. 3.6.1 Results for Case 1 In the first study, a dataset of size 500 was generated based on a choice of given parameter values. Then, we use 6000 iterations (the first 1000 iterations are burn-in period) of the MCMC algorithm to estimate unknown parameters. Note that the jump sizes for o, /3i and X in the Metropolis-Hasting algorithm are chosen to be 0.15, 0.75, and 1 respectively. Figure 3.1 and Figure 3.2 show the traceplots of the MCMC algorithm outputs for unknown parameters in case 1. These traceplots only plots the iterations after the burn-in period. We can see that the Markov Chain is somehow stabilized after the burn-in period, and the Markov chain move thoroughly within the target range. 55 3.6. Results ;; IiII.hI I_I oI ‘-1 I I 0 1000 2000 3000 4000 5000 iterations 0 1000 2000 3000 4000 5000 iterations Figure 3.1: Traceplots of X1,X2 from MCMC algorithm in case 1. The traeeplots show the 5000 iterations after 1000 burn-in period. ci C.) ci c’J c C C’J >< Ii ii C Cs 0 C C c I i’ LU U) iILiI.h I I 1111 0 1000 2000 3000 4000 5000 iterations 0 1000 2000 3000 4000 5000 iterations 56 3.6. Results c’J 0 I I I 0 1000 3000 iterations 5000 0 1000 3000 5000 iterations Figure 3.2: Traceplots of u, A2 from MCMG algorithm in ease 1. The traceplots show the 5000 iterations after 1000 burn-in period. iii I 0 0 0 IF 1 0 0 0 57 3.6. Results Table 3.1 shows the true values, posterior means and the 95% credible intervals for each of the unknown parameters estimated from the data. true value posterior mean 95% CI 0 -0.013 (-0.112, 0.086) 5 1 1.014 ( 0.86, 1.17) ,3 -1.5 -1.48 (-1.73, -1.21) 1.5 1.48 ( 0.79, 2.13) Table 3.1: True values, posterior means, 95% credible intervals of .i,A2,3o,61 These are results from the first study in case 1. From the table, we can see that the 95% credible interval of each unknown parameter actually covers the corresponding true value and the posterior mean is very close to the corresponding true value. The second study just involves repeating the first study 100 times and the sampling distribution of each estimator is studied. Figure 3.3 is the histogram of the posterior means for the 100 samples in case 1. 58 c c I I H C;’.- Cb 1% Fr eq ue nc y o 5 10 20 30 I I I I Fr eq ue nc y 0 10 20 30 I I 0 01 0 u p 0 01 p . 01 - to - to 01 Fr eq ue nc y p 01 0 5 10 15 20 0 5 10 15 20 25 Fr eq ue nc y 0 •0 0 b 0 1)1 3.6. Results From the figure, we can see that the sampling distributions of j2 and i3i are approx imately normally distributed and centered at their true values, whereas the sampling distribution for is a little right skewed and is somehow left skewed. Table 3.2 summarizes each parameter estimator as: Bias MSE Coverage of the 95%CI Average 95%C[ Width ,u 0.0023 0.0050 95 0.19 -zO.0096 0.0080 98 0.31 /3o -0.065 0.013 93 0.54 /3 0.051 0.038 94 1.42 Table 3.2: Estimated bias, mean square error (MSE), coverage of 95% CI and the average width ofpA,301for case 1. All results are based on 100 datasets. The above table shows that the biases and estimated mean square error (MSE) for each unknown parameter are quite small, especially for fL and )2• Furthermore, the average widths, out of the 100 runs, of the credible intervals for /2 and )2 are pretty small. How ever, the wide average widths for /3o and j3 suggest that there is more variation among the 100 estimated and th. Also, out of the 100 times, the 95% credible intervals cover (CI) cover the true and )2 value 96 times, and cover the true /3o and 94 times, which suggests a good overall performance of the formal approach. 3.6.2 Results for Case 2 In this case, 2, the measurement error variance, is also estimated with other parameters and the “true value” is chosen as 0.25. The hyper-parameters of the Inverse Gamma distribution of o2 are specified as 200 and 50 respectively. The specific choice of the hyper-parameters results in a concentrated prior distribution. From Figure 3.4, we can see that this prior has a relatively narrow range and the centre of the distribution is close to the “true” value. 60 >a) 3.6. Results density plot 0 c’J U) 0 U) 0 Figure 3.4: Density plot of Inverse Gamma distribution with hyper-parameters: c = 200 and 3 = 50. The vertical line is “true” value of the u2 = 0.25 0.20 0.25 0.30 0.35 61 3.6. Results Moreover, the jump sizes for updating and X in Metropolis - Hastings Algo rithm are changed to 0.15, 0.55, 0.8, to avoid extreme acceptance/rejection rates. Again results from the first study (1 sample study) are displayed first. Figure 3.5 and Figure 3.6 are the traceplots of parameters the MCMC algorithm in case 2. 0 0 >0 0 q q 0 0 Figure 3.5: Traceplots of /3, i3i, X1X2 from MCMC algorithm in case 2. The traceplots show the 5000 iterations after 1000 burn-in period. i [1 L .i ii. ii. 0 Is) ‘1 ri 0 1000 2000 3000 4000 5000 iterations Ii .111i ii I III] .II1I II ‘‘rrJi’ i I I 0 1000 2000 3000 4000 5000 iterations I ii I i .1 .11 0 1000 2000 3000 4000 5000 iterations 0 1000 2000 3000 4000 5000 iterations 62 0d 0 0 3.6. Results 0 0 C 0 0 1000 2000 3000 4000 5000 terations Figure 3.6: Traceplots of u, )2 and g2 from MCMG algorithm in case 2. The traceplots show the 5000 iterations after 1000 burn-in period. We can see that the Gibbs sampling algorithm is very stable when updating o2 in 5000 iterations, and the chains do not have a mixing or convergence problem. Table 3.3 shows the posterior means and 95% CI analyzed based on the particular dataset. Again, all the 95% CI covers the true values of the parameters, and it’s reason able to conclude that the approach works well for this particular dataset. The procedure of the first study is repeated 100 times in the second study. We are able 0 1000 2000 3000 4000 5000 5 1000 2000 3000 4000 5000 iterations iterations 63 3.6. Results true value posterior mean 95% CI t 0 -0.0061 (-0.089, 0.10) 5 1 0.88 ( 0.73, 1.03) 0.25 0.25 (0.22, 0.29) i3 -1.5 -1.51 (-1.77, -1.27) th 1.5 1.72 ( 1.06, 2.39) Table 3.3: True values, posterior means, 95% credible intervals of,A2,/3o,/3i and u2. These are results from the first study in case 2. The “true” values are: u = 0, ?.. —1.5 and 4 1.5. to study the sampling distribution of to get a better understanding of the potential problem (overestimate the parameter) in the first study. The problem could be just happening by chance or it could show that overall the MCMC algorithm underestimates th. Figure 3.7 confirms that the problem of underestimating /3 is just due to chance and overall the estimated is roughly follows a normal distribution centered at the “true” value. Except for A2, other parameter estimators are all approximately following a normal distribution centered at their corresponding “true” value. 64 3.6. Results rfl_fffR1..;. r ____ —0.10 —0.05 0.00 0.05 0.10 —2.0 —1.8 —16 —1.4 —1.2 —1.0 Po I ___ _ . 0.248 0.250 0.252 0.254 0.256 Figure 3.7: Histograms of 100 posterior means for , A2, io i3 and o2in the second study in case 2. The “true” values are: u — O,A2 = 1,/3o — —1.5 and j3i = 1.5. 65 3.6. Results Table 3.4 outlined the bias, MSE, coverage of 95% and average width of the 95% for each estimator. It suggest that our approach produces reliable estimators with small biases, small MSE, satisfactory coverage rates and reasonable average credible interval widths. Bias MSE Coverage of the 95%CI Average 95%CI Width a 0.0034 0.0047 95 0.20 V 0.007 0.0087 95 0.32 -0.054 0.013 97 0.54 /3i 0.019 0.035 95 1.41 0.0023 0.00013 100 0.071 Table 3.4: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of t, A2, i3o, i3 and g2 for case 2. All results are based on 100 datasets. The “true” values are: t = 0, A2 = 1, i3o = —1.5 and i3 1.5. 3.6.3 Results for Case 3 In this case, we took our validation size to be 50, which means 10% each dataset has precise measurements of X. The new jump sizes for i3o, i3 and X are 0.1, 0.35 and 0.7. Figure 3.8 and Figure 3.9 are the traceplots of parameters in the MCMC algorithm in the first study 66 3.6. Results rLI I I I I I I o woo 2000 3000 4000 5000 iterations I I I I I 0 1000 2000 3000 4000 5000 iterations 1000 2000 3000 4000 5000 Figure 3.8: Traceplots of th X1,X2 from MCMG algorithm in case 3. The traceplots show the 5000 iterations after 1000 burn-in period. C (N 0 (N >< 0 0 0 0 La as 0 1000 2000 3000 4000 5000 iterations Os >< 5’, 0 I ‘ iterations 67 iterations 3.6. Results i.. _1IT..9I I rIiIIII I••I’iIII1 0 1000 2000 3000 4000 5000 teratiorts 04 0 0 .5 04 0 0 04 0 0 1000 2000 3000 4000 5000 iterations Figure 3.9: Traceplots of t, A2 and o2 from MCMC algorithm in case 3. The traceplots show the 5000 iterations after 1000 bur’rt-iri period. 04 0 0 1. 1 IL.. 0 0 0 0 1000 2000 3000 4000 5000 0 68 3.6. Results true value posterior mean 95% CI u 0 -0.032 ( -0.013, 0.066) 1 1.03 ( 0.87, 1.19) ET 0.25 0.24 (0.21, 0.28) j3 -1.5 -1.44 (-1.68, -1.19) 4i 1.5 1.30 ( 0.63, 1.97) Table 3.5: True values, posterior means, 95% credible intervals of , )2, /3o, /3i and o2. These are results from the first study in case 3. Both figures (Figure 3.8 and Figure 3.9) and statistic values (Table 3.5) indicate that for this particular generated dataset, the approach did a good job. Next, the histogram (Figure 3.10) and summary statistics (Table 3.6) of the sampling distribution for each unknown parameter in the second study are presented. 69 3.6. Results II —015 —0.05 0.00 0.05 0.10 0.15 0.20 —1.8 —1.6 —1.4 —1.2 I 01’52!02!5 081,1! 0.235 0.245 0.255 0.265 a2 Figure 3.10: Histograms of 100 posterior means for )2, /3, j3 andu2in the second study in case 3. The validation size is 50. The “true” values are: t = 0,A2 = = —1.5 and i = 1.5. 70 3.7. Comparison of Three Approaches From Figure 3.10, we can see that all the sampling distributions of parameter es timators are approximately normally distributed with some skewness involved, expect the histogram for u2 looks uniformly distributed at first glance. However, by taking a close look at the figure, we noticed that the scale for the histogram of o2 has 3 decimal places, which suggests that most estimated values of g2 are very close to the true value, 0.25. This observation is also confirmed in Table 3.6, since the estimation of o2 has the smallest bias, MSE and average CI width and highest coverage rate (100%). Bias MSE Coverage of the 95%CI Average 95%CI Width i -0.0032 0.0050 97 0.19 A 0.012 0.0080 95 0.31 -0.024 0.013 95 0.55 /3i 0.051 0.036 95 1.34 0.00148 0.00059 100 0.066 Table 3.6: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of, )2, i3o, i3 and o2 for case S. All results are based on 100 datasets with validation size 50. The “true” values are: i = 0,)..2 = 1,/3o —1.5 and i3 = 1.5. Evidence, such as small bias, small MSE and hight percentage of true values coverage, in Table 3.6 demonstrate a good performance for the validation model. Overall, we can conclude that the “formal” approach did an excellent job in estimat ing unknown parameters for three different models as: knowing the measurement error variance, prior information of the measurement error variance and validation model. Next, we are going to study the comparative performance of the “formal” approach, the “naive” and the “informal” approach. 3.7 Comparison of Three Approaches As we discussed previously, the “naive”, “formal” and “informal” approach would di chotomize either X or X* with respect to c or c* to fit a logistic regression model as 71 3.7. Comparison of Three Approaches following: logit(Pr(Y = 1IX))formal = /3o + /31(X > c) logit(Pr(Y = 1X*))jflfyJ. = i3o + /31(X* > c) logit(Pr(Y = 1IX*))najve = + 13i1(X* > c) The relationship of the health outcome, Y and the exposure variable X is gained from the estimated coefficients, /3 and i3. Thus, the comparisons are mainly based on estimating these two parameters. The true values are the same as before: i3o = —1.5 and /3i 1.5, and comparisons are constructed in each of the three cases of the “formal” approach. Note that the c is chosen as 1.3 so that the specificity and the sensitivity is very high (both are around 95%). Figure 3.11 to Figure 3.13 are the pairwise plots of results from the three approaches. By observing them, we see there are some linear relationship between estimators, and the linear relationship is somehow weaker when estimating /3 than estimating i3o. Moreover, both the “naive” approach and the “informal” approach tend to overestimate /3o but underestimate i3 ,whereas estimations from “formal” approach are located around the true value. Since it is very hard to tell which one is “better” between the “naive” and the “informal” approach from the figures, the summary statistics of estimators’ sampling distribution in these two approaches are crucial to know. 72 3.7. Comparison of Three Approaches I I I I A0 0’ //0 I 0 S :j :040 • fl_ 1’ 0 8 -- S o0 00 ID I - I . I 001 o I 00 I r I I • I I I I I I I —1.8 —1.6 —1.4 —12 —1.8 —1.6 —1.4 —1.2 —1.7 —1.5 —1.3 —1.1 informal o naive Po informal 13o 00 0 Sb ICM09 d1 Iq. OnCo 41I I — 0 I 0 i0% - 0//’ 000 0.5 8o 0 : I .7 _______ /l __ 0 j%0o00 I 0 I IO oO I 0 I I I I I I I I I 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 informal , naive , informal i Figure 3.11: Pairwise plots of three approaches, “naive”, “informal” and “formal”, in estimating i3o and j3 under case 1. The solid line is if “y x “, and the dash lines are the corresponding true values. 73 3.7. Comparison of Three Approaches I / I 0° /1 Ij In Hi ‘0. /:° I 0°,, /°,°I I 0 I 0 0 I I Ip ___________ ‘1I I I I I I —1.8 —1.6 —1.4 —1.2 —1.8 —1.6 —1.4 —1.2 —1.7 —1.5 —1.3 —1.1 informal flo naive J1 infomial fib 0 a q 0 I8:o ,oe In 0,0& In I aoo 0 I o%0 ---0-Il--- - - I 60/ 0 °li 7 0 - - -: 20 : o$ — D’1 0 I I LI) I 0 Q90 I LI) ° I 0 I I 0 I I I I I I I I I 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 informal fl. naive 1 infomial I3 Figure 3.12: Pairwise plots of three approaches , “naive “, “informal” and “formal”, in estimating /3i and i3 under case . The solid line is if “y = x “, and the dash lines are the corresponding true values. 74 3.7. Comparison of Three Approaches I I I 0.5 1.0 1.5 2.0 informal i 8 0 0.5 1.0 1.6 2.0 natve i Figure 3.13: Pairwise plots of three approaches, “naive “, “informal” and “formal”, in estimating i3o and 3iunder case 3. The solid line is if “y = x “, and the dash lines are the corresponding true values. 75 / /0 It S 0 —1.8 —1.6 —1.4 —1.2 —1.0 0 04 .5 . 0 I0• 0 04 Ing 0 0 0 informal Pa naive Pr 0 04 Lb a, a LI 0 —1.8 —1.6 —1.4 —1.2 —1,0 informal 0 04 I 0 0 0 - -: 000 //L 0 0 L0 0 I I 01!0i52!0 informal Pi 0 U) 3.7. Comparison of Three Approaches Table 3.7 reports the average posterior means of /3o and i3, as well as 95% confidence intervals for the average posterior means in all three cases. We are able to conclude that the formal approach is superior to informal and naive approach, since only the confi dence interval of “formal” approach cover the true values of i3o and !3i. Moreover, when estimating /3, the formal approach produces posterior means and confidence intervals, which are more closer to the true values. The “naive” approach generated the most narrow confidence interval that may also implies over-confidence. These results suggests that it is very dangerous to ignore the measurement errors in the analysis and making proper adjustments for the measurement error is crucial. Average Posterior Mean 95%Confidence Interval Case 1 /3Onajve -1.41 (-1.43, -1.39) Ojnformal -1.36 (-1.38, -1.34) /3Oformal 1.52 (4.53, 1.48) /3’naive 0.98 (0.93, 1.03) ’informal 1.14 (1.08, 1.20) /3lformaj 1.55 (1.47, 1.63) Case 2 /3Oflajve 4.40 (-1.42, -1.37) JOjnformaj -1.34 (-1.36, -1.31) /3Oformal -1.49 (4.52, 1.47) /3inaive 0.97 (0.92, 1.0201) ljnforma1 1.04 (0.99, 1.10) /3lformal 1.52 (1.45, 1.59) Case 3 !3Onajve -1.43 (-1.45, 4.40) /Ojnformal 1.37 (-1.39, -1.35) /9Oformal 4.49 (4.52, 1.47) I3’naive 0.99 (0.95, 1.04) /ljnformaz 1.12 (1.06, 1.18) /3lformal 1.55 (1.48, 1.62) Table 3.7: Average of posterior means and 95% confidence intervals for the average posterior means of o and i3i for “naive “, “informal” and “formal” approaches. Results are based on 100 samples in case 1, and 3 76 Chapter 4 QRS Data Study To illustrate the ideas and methods that we discussed in the previous chapters in a real world example, we use the QRS dataset. This dataset is provide by Vittinghoff, Glid den, Shiboski, and McCulloch (2004). Heart problems can be diagnosed through the timing of diverse stages in the contraction of the heart. Electrocardiography(EKG) is the device that records the electrical activities of the heart through a duration of time. As the authors indicate the QRS wave is defined as a commonly measured time interval in the contraction of the ventricles. The study dataset contains the QRS times (in mu liseconds) for 53 patients, of whom 18 have the inducible ventricular tachycardia (IVT) and 35 of them are without IVT. Note that the sample size is relatively small, since it is very difficult to assemble a large number of subjects to participate in a brain wave study, and the cost of the study is very high. Thus, studies that involve brain waves and electrocardiography devices commonly have small sample sizes. Though the sample size is considerably small, it is still a good and clean dataset to illus trate our ideas and methods. The response variable Y takes the value of 1 if the subjects has IVT, and 0 otherwise, while the covariates variable X is the QRS time (in millisec onds). Since the QRS time is a continuous variable, we are focusing on the approach introduced in Chapter 3. Even through, in the literature, there are researchers who argue about the accuracy of the QRS duration (Tomlinson, Betts, and Rajappan, 2009), which indicates that measurement error could exist in the measurement of timing in the real world, for the purpose of this thesis, we treat the QRS timing as precisely measured (X) and we simulate the surrogate variable, X, in order to compare the results obtained from the true values, naive analysis, informal analysis and our formal analysis. 77 Chapter 4. QRS Data Study Nevertheless accuracy of the QRS is questioned by many researchers, there are few articles states the possible magnitude of the measurement error. According to Sahambi, Tandon, and Bhatt (2009), the maximum error rate of the QRS is 6.25% due to the 50 Hz power-line interference. As we lack of detailed information about how the data are collected, we would simply adopt the measurement error rate stated by Sahambi et al. (2009). For our illustrative proposes, we have to assume we know the variance of additive measurement error, u2, in order to generate X*. The error rate we accepted previously is a multiplicative error, and proper transformation is necessary to acquire an additive error (as we defined in Chapter 3). As a result, we choose X = log(QRStime) instead of QRS time directly. As defined in Chapter 3, under the nondifferential assumption, the measurement model here is X’IX N(1ogQRS,u2). Mathematically, we can compute a2 as: X = 1ogQRS+o-Z eX* = (QRS)e which motivates e = 1.0625 where Z is a standard normal random variable. We get the variance of additive mea surement error as 0.0312 = 0.00096, and the surrogate variable, X’, can be generated afterward. Since we don’t have validation data and we suppose we don’t know the variance of the measurement error, the “formal analysis” approach is based on the model that was 78 4.1. Naive Approach and Informal Approach introduced as case 2 in Chapter 3. Thus, the response model is 1ogit(P(Y=1X)) = log1 51 (4.1) = /30+/31(X>c) Under the assumption of nondifferential measurement error, the measurement model is X’X N(X,). The exposure model is: X N(,)). We will use this set-up to conduct naive, informal and formal analysis and compare results produced by three approaches with the true values. 4.1 Naive Approach and Informal Approach For the response model, the value of c is chosen as log(120) as suggested by Tomlinson, Betts, and Rajappan (2009). To refresh the memory, the naive approach would formulate the response model based on X* and c as: logit(P(Y = 1IX*)) P(Y=1II(X*>c)) = In order to perform the informal approach, we need to pick up a c value such that, the specificity is very high (as discussed in Chapter 3), and it is chosen as c = log(123) so 79 4.2. Formal Approach that the SP = 0.94. Then the response model of informal approach is: P(Y — 1II(X* > c))logit(P(Y = 1IX*)) = logit1 — P(Y= 1I(X* > cj) = /30+/31(X*>c*) The estimates of and for these two approaches can be easily obtained by using the gim function in R. Results are shown in the section 4.3. 4.2 Formal Approach Since we are assume the QRS timing from the dataset is the true value, we are able to obtain some information about the exposure variable X, such as the the mean of mean(X) = p. = 4.64, var(X) = A2 = 0.094 and the measurement error variance u2 0.00091. Note that those values are not applicable in the real world, and they are available here for this example only (because of our assumption). As discussed early in the simulation study of Chapter 3, we need to assign prior information for the “unknown” parameters. Remember that in section 3.6.2, flatter priors are assigned to unknown parameters since the simulated sample size is considerably large. Though, the sample size is pretty small in this dataset, we are still able to assign flatter priors to some of the unknowns. For example, p. is assigned as p. N(0, 1002), A2 is assigned as A2 IG(0.01, 0.01) and /i is assigned with N(0, 1002). Dislike the non-informative priors for p., A and i3o, the prior information becomes very crucial for 2 and i3. It is reasonable to assign concentrated priors to these two unknowns, since researchers could easily obtain relative information about these two unknowns from precious study. As a result, concentrated priors are assigned as such that j3 N(0, 12) and .2 IG(100, 0.1). Figure 4.1 displays the prior density plots vs their corresponding “true” values. 80 4.2. Formai Approach I I I I I —400 —200 0 200 400 0.00*00 5.Oe+302 1.00+303 1.50+303 2.00+303 I’ II.I.I —4 —2 0 2 4 —40000 —20000 0 20000 40000 0.0008 0.0010 0.0012 0.0014 0.0018 Figure 4.1: Prior plots of unknown parameters with their hyper-parameters: i N(0,1002),A IG(0.01,0.01),o-2-‘ IG(100,0.1),i3 ‘-‘. N(0,1002)and /3 ‘-- N(0,12). The vertical lines are the corresponding “true” values as: u = 4.64, A2 = 0.094, /3o —0.90, /3i 0.76 and g2 = 0.00096. 81 4.2. Formal Approach We observe that, regardless of the strength of the prior, the center of prior density for each unknown parameter is most likely located around the corresponding true value. The plot for A2 looks abnormal, since the range of its density function goes from 0 to infinity so that it is quite difficult to display on a limited scale. 2 has the most concentrated prior, since approximately 95% of its data are enclosed by 0.0008 and 0.0014, a pretty small range. x Similarly as in the simulation study, we are unable to obtain the full conditional distributions for o and so we are going to use the Metropolis- Hastings Algorithm to obtain their estimators. All other unknown parameters are updated as in the simulation study of Chapter 3 case 2. The jump size for updating X, /3o and in MR consist with the simulation study, which are 0.15, 0.35, 0.8 respectively. Figure 4.2 shows the traceplot of 200000 iterations after 2000 burn-in period, and there are no apparent mixing problems to be noticed. The numeric results are presented and compared in the next section. 82 4.2. Formal Approach I I I I —2.5 —1 5 —0.5 0.5 13o —2 —I 0 1 2 3 13i Figure 4.2: Traceplot and posterior density plots of !0000 iterations after 1000 burn-in period of /3 and /3 when applying the MH sampling method. LO C Lf) L() Co - I I I I I 0 50000 100000 200000 iterations >. a) C ci) 0 C ci) q C CD C C C’) C C C Co C C C’) 0 C C C’) - L. 1.1 Ii.i ii.l,.ii CD. C - — F.. IIIr1r —j. I J..I I I 0 50000 100000 200000 iterations 83 4.3. Results 4.3 Results Before discussing the results estimated from the naive, informal and formal approaches, we would like to find out the supposed true result first. It is very easily obtained from the glm function in R and the true model for explaining whether or not a subject has the IVT is estimated as follows: log1 Y11)) —0.90 + O.761(X > c). Table 4.1 records results of i3o and i3i from the naive, informal and formal approaches. Estimate 95%CI CI width /3Onajve -0.86 (-1.58, -0.14) 1.44 fOjnformal 0.86 (4.58, 0.14) 1.44 /3Oformaj -0.90 (-1.63, -0.22) 1.35 0.61 (-0.63, 1.84) 2.48 /3ljnform& 0.61 (0.63, 1.84) 2.48 /3lfo’rmai 0.67 (0.46, 1.73 ) 2.19 Table 4.1: Estimators, 95% confidence, or credible, intervals of i3o and /3 by using “naive”, “informal” and “formal” approaches. In light of the study performed in Chapter 3, the results for analyzing this dataset behave as we would expect. Though the results are close, the formal approach performed the best when strong priors for /3 and c2 are provided. As the data size gets larger, we believe that the formal approach will keep doing a good job, i.e. estimated values close to the “true” values and the less variability of estimated parameters, even when flatter priors are assigned. Surprisingly, the naive and informal approaches produce the same results, and one possible explanation is that the data size is very small, and there is no significant difference in modeling X* with threshold c or c in this special case. Note that when the data size increases, the chance that naive and informal approaches produce the same results will become slim. 84 Chapter 5 Conclusion and Future Work In this thesis, we propose a formal approach to adjust mismeasurement in case-control studies. Ignoring potential mismeasurement on exposure variables could lead to serious problems, such as loss of power, biased estimation and misleading conclusions. In the literature, many methods were proposed to deal with misclassification and measurement error, such as matrix method, inverse matrix method, regression calibration, SIMEX, Expectation-Maximization algorithm in frequentist perspective. Lots of methods are ready to use, nevertheless, they all have their limitations. For example, Carroll, Rup pert, Stefanski and Crainiceariu (2006) stated that though the SIMEX and regression calibration are simple methods to implement, they have limited contributions in reduc ing the bias caused by the measurement error. The Bayesian approach, on the other hand, is able to correct the bias more precisely and generally. Though, a potentially misspecified exposure model, too complex posterior and intensive computational require ments are occasionally drawbacks in the approach, it has the great advantage that the uncertainties of parameters can be fully incorporated. A formal approach in the Bayesian perspective is introduced in this dissertation to ac count for both categorical and continuous exposure variables under the non-differential assumption. Fundamental techniques and concepts are introduced in Chapter 1. Ideas of the proposed formal approach that deals with a categorical exposure variable is in troduced and studied through investigating three cases in Chapter 2. The underlying theme of the formal approach that adjusts the measurement error (continuous exposure variable) is presented and examined by studying its performance on three cases again in 85 Chapter 5. Conclusion and Future Work Chapter 3. In Chapter 4, a real world dataset is used to evaluate the proposed model. Gibbs sampler and Metropolis-Hasting algorithm are mainly used to sample the param eters of interest from their corresponding posterior distributions. In Chapter 2, we investigate three cases where we have different levels of knowledge about misclassified probabilities. In each case, the approach is implemented with both low and high prevalence rate, as well as a different validation sample size when we assume validation data are available. Stabilized traceplots suggest that the overall convergence rate is adequate for Markov chain simulation in our proposed model. When the sampling distribution of each unknown parameter is studied, statistical assessments such as, small estimator bias, small mean square error, high coverage rate of the true value and reason able average 95% credible interval length, all indicate that overall the model is efficient and accurate. When only the prior information about the misclassified probabilities is known, strong and concentrated priors are required to get good estimation. One pos sible explanation is that, a strong prior is able to reduce the variability of estimators and improve the efficiency of the approach. However, when a small proportion of the validation data is available, it is found that the strong priors become unnecessary, which indicates that the model is able to capture enough information to make good estimation. Moreover, it seems like the size of the validation data does not significantly affect the estimation, and this would be an interesting point to study later on. When the results obtained from low prevalence rate and high prevalence in each case is compared, it is delightful to observe that the approach could work for any prevalence rate. In the end of Chapter 2, estimated log odds ratios are compared for the proposed formal approach and informal approach. It is found, as expected, that the informal approach tends to underestimate the association between the exposure variable and response variable most of the time, and that less than 25% of the 95% confidence intervals actually cover the true value. Even though, the formal approach sometimes overestimate the log odds ratio, the majority of its 95% confidence intervals include the true values and the estimator 86 Chapter 5. Conclusion and Future Work bias is much smaller than it is in the informal approach. In Chapter 5, the proposed approach is implemented with a continuous exposure variable. A logistic regression model is specified for the binary exposure variable and dichotomized continuous exposure variable so that their association is measured according to the coeffi cients of logistic regression. Two simulation studies for three cases studies are conducted based on our knowledge of the magnitude of the measurement error. As expected, we find that our proposed approach is both efficient and accurate. As in the discrete case, proper priors are crucial when we only have the prior information in hand but not so important when the validation data are accessible. Coefficients obtained from the “naive” approach and informal approach (both of them use the association estimated from response vari able Y and X* to estimate the true relationship between Y and X) are compared with our proposed formal approach at the end of the chapter. Overall, the performance of those approaches decline as we move from formal approach to informal approach and then naive approach. A real world example is used to illustrate our idea and approach. Due to a very small sample size of the dataset, adjusted proper priors are again critical to gain valuable estimations. Fortunately, it is proved that our suggested approach can work practically after strong priors are assigned. Our proposed Bayesian adjustment for mismeasurement can be extended to a variety of research areas. One straightforward extension would be having more precisely mea sured covariates in addition to the model that we have right now. Further investigation can be conducted to understand the weakness of our approach, which is that strong priors are needed when we only have prior information. A relevant and interesting study would be to find out whether there is a “cut-off” validation size so that the researchers are able to gain the “maximum” information while spending minimum time or money. 87 Bibliography B. A. Barron. The effects of misclassification on the estimation of relative risk. Biomet rics, 33:414—418, 1977. R. J. Carroll, D. Ruppert, L.A. Stefanski, and C. M. Cainiceanu. Measurement Error in Nonlinear Models, Vol. 105 of Monographs on Statistics and Applied Probability. Chapman &Hall/CRC, Boca Raton., second edition, 2006. G. Casella and E.I. George. Explaining the gibbs sampler. The American Statistician, 46:167—174, 1992. J.R. Cook and L. A. Stefanski. Simulation-extrapolation estimation in parametric mea surement error models. Journal of American Statistical Association, 89:1314—1328, 1994. A. P. Dempster, N. M. Lairdd, and D. B. Rubin. Maximum likelihoodd from incomplete data via the em algorithm. Journal of the Royal Statistical Society, B 39(1):1—38, 1977. E. Greenberg and Siddhartha Chib. Understanding the metropolis-hastings algoithm. The American Statistician, 49(4): 167—174, 1995. Paul Gustafson. Measurement Error and Misclassification in Statistics and Epidemiol ogy: Impacts and Bayesian Adjustments, volume Vol. 13 of Interdisciplinary Statistics. Chapman &Hall/CRC, Boca Raton, 2004. H. Kuchenhoff, S. M. Mwalili, and E. Lesaifre. A general method for dealing with misclassification in regression: the misclassification simex. Biometrics, 62:85—96, 2006. 88 Chapter 5. Bibliography R. H. Lyles. A note on estimating crude odds ratios in case-control studies with differ entially misclassified exposure. Biometrics, 58:1034—1037, 2002. R.C. MacCallum, S. Zhang, K.J. Preacher, and D.D. Rucker. On the practice of di chotimization of quantitative variables. Psychological Methods,, 7:19—40, 2002. R. J. Marshall. Validation study methods for estimating exposure proportions and odds ratios with misclassified data. Journal of Clinical Epidemiology, 43:941—947, 1990. M. J. Morrissey and D. Spiegelman. Matrix methods for estimating odds ratios with misclassified exposure data: extensions and comparisons. Biometrics, 55:338—344, 1999. D.A. Pierce and A.M. Kellerer. Adjusting for covariate error with nonparametric assess ment of the true covariate distribution. Biometrika, 91:863—876, 2004. R.L. Prentice and R. Pyke. Quantitative analysis of errors due to power-line interference and base-line drift in detection of onsets and offsets in ECO using wavelets. Medical and Biological Engineering and Computing, 35(6):747—751, 1979. P. Royston, D. G. Altman, and W. Sauerbrei. Dichotomizing continuous predictors in multiple regression: a bad idea. Statistics in Medicine, 25(1):127—141, 2006. J.S. Sahambi, S.N. Tandon, and R.K.P. Bhatt. Quantitative analysis of errors due to power-line interference and base-line drift in detection of onsets and offsets in ECG using wavelets. Medical and Biological Engineering and Computing, 35(6):747—751, 2009. A. Skrondal and S. Rabe-Hesketh. Generalized Latent Variable Modeling: multi-level, longitudinal and strucutral equation models. Chapman &Hall/CRC, Boca Raton., 2004. D.R. Tomlinson, T.R. Y. Betts, and K. Rajappan. Accuracy of manual qrs duration assessment: its importance in patient selection for cardiac resynchronization and im plantable cardioverter defibrillator therapy. Europace, 11(5) :638—642, 2009. 89 Chapter 5. Bibliography E. Vittinghoff, David V. Glidden, S.C. Shiboski, and C. E McCulloch. Regression methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. Springer, 2004. 90
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Formal and informal approaches to adjusting for exposure...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Formal and informal approaches to adjusting for exposure mismeasurement Shen, Tian 2009
pdf
Page Metadata
Item Metadata
Title | Formal and informal approaches to adjusting for exposure mismeasurement |
Creator |
Shen, Tian |
Publisher | University of British Columbia |
Date Issued | 2009 |
Description | In many research areas, measurement error frequently occurs when investigators are trying to analyze the relationship between exposure variables and response variable in observational studies. Severe problems can be caused by the mismeasured exposure vari ables, such as loss of power, biased estimators, and misleading conclusions. As the idea of measurement error is adopted by more and more researchers, how to adjust for such error becomes an interesting point to study. Two big barriers in solving the problems are as follows. First, the mechanism of measurement error (the existence and magnitude of the error) is always unknown to researchers. Sometimes only a small piece of information is available from previous studies. Moreover, the situation can be worsen when the study conditions are changed in the present study, which makes previous information not applicable. Second, some researchers may still argue about the consequences of ignoring measurement error due to its uncertainness. Thus, the adjustment for the mismeasurement turn to be a difficult, or impossible task. In this thesis, we are studying situations where the binary response variable is precisely measured, but with a misclassified binary exposure or a mismeasured continuous exposure. We propose formal approaches to estimate unknown parameters under the non-differential assumption in both exposure conditions. The uncertain variance of measurement error in the continuous exposure case, or the probabilities of misclassification in the binary exposure case, are incorporated by our approaches. Then the posterior models are estimated via simulations generated by the Gibbs sampler and the Metropolis - Hasting algorithm. Meanwhile, we compare our formal approach with the informal or naive approach in both continuous and exposure cases based on simulated datasets. Odds ratios on log scales are used in comparisons of formal and informal approaches when the exposure variable is binary or continuous. General speaking, our formal approaches result in bet ter point estimators and less variability in estimation. Moreover, the 95% credible, or confidence intervals are able to capture the true values more than 90% of the time. At the very end, we apply our ideas on the QRS dataset to seek consistent conclu sions draws from simulated datasets and real world datasets, and we are able to claim that overall our formal approaches do a better job regardless of the type of the exposure variable. |
Extent | 1924669 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-11-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0070879 |
URI | http://hdl.handle.net/2429/15219 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2009-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2009_fall_shen_tian.pdf [ 1.84MB ]
- Metadata
- JSON: 24-1.0070879.json
- JSON-LD: 24-1.0070879-ld.json
- RDF/XML (Pretty): 24-1.0070879-rdf.xml
- RDF/JSON: 24-1.0070879-rdf.json
- Turtle: 24-1.0070879-turtle.txt
- N-Triples: 24-1.0070879-rdf-ntriples.txt
- Original Record: 24-1.0070879-source.json
- Full Text
- 24-1.0070879-fulltext.txt
- Citation
- 24-1.0070879.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0070879/manifest