Formal and Informal Approaches to Adjusting for Exposure Mismeasurement by Tian Shen B.Sc., The University of New Brunswick, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Statistics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2009 © Tian Shen, 2009 Abstract In many research areas, measurement error frequently occurs when investigators are trying to analyze the relationship between exposure variables and response variable in observational studies. Severe problems can be caused by the mismeasured exposure vari ables, such as loss of power, biased estimators, and misleading conclusions. As the idea of measurement error is adopted by more and more researchers, how to adjust for such error becomes an interesting point to study. Two big barriers in solving the problems are as follows. First, the mechanism of measurement error (the existence and magnitude of the error) is always unknown to researchers. Sometimes only a small piece of informa tion is available from previous studies. Moreover, the situation can be worsen when the study conditions are changed in the present study, which makes previous information not applicable. Second, some researchers may still argue about the consequences of ignoring measurement error due to its uncertainness. Thus, the adjustment for the mismeasure ment turn to be a difficult, or impossible task. In this thesis, we are studying situations where the binary response variable is pre cisely measured, but with a misclassified binary exposure or a mismeasured continuous exposure. We propose formal approaches to estimate unknown parameters under the non-differential assumption in both exposure conditions. The uncertain variance of mea surement error in the continuous exposure case, or the probabilities of misclassification in the binary exposure case, are incorporated by our approaches. Then the posterior models are estimated via simulations generated by the Gibbs sampler and the Metropo lis - Hasting algorithm. ii Abstract Meanwhile, we compare our formal approach with the informal or naive approach in both continuous and exposure cases based on simulated datasets. Odds ratios on log scales are used in comparisons of formal and informal approaches when the exposure variable is binary or continuous. General speaking, our formal approaches result in bet ter point estimators and less variability in estimation. Moreover, the 95% credible, or confidence intervals are able to capture the true values more than 90% of the time. At the very end, we apply our ideas on the QRS dataset to seek consistent conclu sions draws from simulated datasets and real world datasets, and we are able to claim that overall our formal approaches do a better job regardless of the type of the exposure variable. 111 Table of Contents Abstract ii Table of Contents iv List of Tables vi List of Figures x Acknowledgements xiv Dedication xv 1 2 Introduction 1 1.1 Misclassification and Measurement Error 1 1.2 Overview of Current Available Methods 2 1.3 Bayesian Analysis 4 1.3.1 Bayes Rule 4 1.3.2 Prior Distribution 5 1.3.3 Markov Chain Monte Carlo Algorithm 6 Simulation Study for Categorical Exposure Variable 8 2.1 Introduction 8 2.2 2 x 3 Table—Formal Analysis 8 2.3 Informal Analysis 10 2.4 Odds Ratio 11 iv Table of Contents 2.5 Case 1: When We Know 2.6 Case 2: When 2.7 Case 3: Validation Data 16 2.8 Results 17 2.9 PijS 13 PjJS are Unknown 2.8.1 Results for Case 1 17 2.8.2 Results for Case 2 24 Results for Case 3 32 2.10 Comparison of Odds Ratios 3 5 44 Simulation Study for Continuous Exposure Variable 3.1 Introduction 3.2 Posterior and Prior Distributions 3.3 Case 1: When We Know 3.4 Case 2: When We Don’t Know u 2 3.5 Case 3: When We Have Validation Data 3.6 Results 3.7 4 14 3.6.1 Results for Case 1 3.6.2 Results for Case 2 3.6.3 Results for Case 3 2 j . . Comparison of Three Approaches QRS Data Study 77 4.1 Naive Approach and Informal Approach 79 4.2 Formal Approach 80 4.3 Results 84 Conclusion and Future Work Bibliography 85 88 V List of Tables 2.1 2x3 table due to the misclassification and measurement error 9 2.2 A 2x 3 table for formal analysis 9 2.3 2x2 table by epidemiologists’ rule 10 2.4 A 2x 2 table for informal analysis 11 2.5 Validation data and main data 16 2.6 True values, posterior means, 95% credible intervals of r 0 and r . These 1 are results from the first simulation study (one dataset simulation) for scenario 1 in case 1 2.7 20 Bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and r 1 for scenario 1 case 1. All results are based on 100 datasets, and their true values are 0.2 and 0.25 respectively 2.8 21 True values, posterior means, 95% credible intervals of r 0 and r . These 1 are results from the first simulation study (one dataset simulation) for scenario 2 in case 1 2.9 22 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and rl for scenario 2 case 1. All results are based on 100 datasets, and their true values are 0.7 and 0.4 respectively 23 2.10 True values, posterior means, 95% credible intervals of r ,r 0 1 and pjs. These are results from the first simulation study (one dataset simulation) for scenario 1 in case 1 26 vi List of Tables 2.11 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and ri for scenario 1 case 2. All results are based on 100 datasets, and their true values are r 0 (0.5, 0.3, 0.2) and (pio,pil,pi2) = 0.2, ri = 0.25, (Poo,Poi, P02) = = (0.1,0.3, 0.6) 29 2.12 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of TO and ri for scenario 1 case 2. All results are based on 100 datasets, and their true values are ro (0.5, 0.3, 0.2) and (plo,pll,p12) = = 1 0.7,r = 0.4, (poo,poi,po2) (0.1,0.3, 0.6) 30 2.13 True values, posterior means, 95% credible intervals of r ,r 0 1 and pjjs. These are results from the first simulation study (one dataset simulation) for scenario 1 in case 3. Validation size —200 36 2.14 True values, posterior means, 95% credible intervals of r ,r 0 1 and jj5. These are results from the first simulation study (one dataset simulation) for scenario 2 in case 3. Validation size =100 36 2.15 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of r 0 and r 1 for scenario 1 case 3. All results are based on 100 datasets, and their true values are r 0 (0.5, 0.3, 0.2) and (plo,pii,p12) = = 0.2, r 1 = 0.25, (poo,poi,p02) (0.1,0.3,0.6). Validation data=200. . 37 2.16 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of TO and ri for scenario 2 case 3. All results are based on 100 datasets, and their true values are r 0 (0.5, 0.3,0.2) and (pio,pi1,p12) = = 0.2, r 1 = 0.25, (p00, Poi ,Po2) (0.1,0.3,0.6). Validation data=100. . . 37 2.17 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of To and Ti for scenario 3 case 3. All results are based on 100 datasets, and their true values are r 0 (0.5,0.3,0.2) and (pio,pii,p12) = = 1 0.7,r = 0.4, (poo,poi,p02) (0.1,0.3,0.6). Validation data=200. = . 41 vii List of Tables 2.18 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and ri for scenario case 3. All results are based on 100 datasets, and their true values are r 0 = (0.5, 0.3, 0.2) and (plo,pll,p12) = 1 0.7,r 0.4, (poo,pol,po2) (0.1,0.3,0.6). Validation data=100 . = . 41 2.19 Estimated bias, mean square error (MSE), coverage of 95 % confidence intervals and average 95%CI width of informal and formal log odds ratios for scenario 1 (or 2) in three cases. True log odds ratio is 0.28 3.1 46 True values, posterior means, 95% credible intervals of 1 i,A 2 , 0 . 3 These are res’ults from the first study in case 1 3.2 58 Estimated bias, mean square error (MSE), coverage of 95% CI and the average width of, A ,i 2 o, i for case 1. All results are based on 100 datasets. 60 3 3.3 True values, posterior means, 95% credible intervals of u, A ,/ 2 o, i3i and 3 o.2 These are results from the first study in case 2. The “true” values 1 are:=0 = 2 = 1,8o=—1.5a 1.5 ,A nd3 3.4 3.5 64 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ,u, A , i3o, / 2 i and 3 for case 2. All results are based on 100 datasets. The “true” values are: i = 2 0,X = 1,/3o = True values, posterior means, 95% credible intervals of —1.5 and .t, i = 3 / 1.5 66 , /3o, 3i and 2 A . These are results from the first study in case 3 2 a 3.6 69 Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of u, 2 A i3o, i3 and a , 2 for case 3. All results are based on 100 datasets with validation size 50. The “true” values are: = 0, A 2 = l,/3zz—1.5 and/3 —_zl.5 1 3.7 71 Average of posterior means and 95% confidence intervals for the average posterior means of j o and i3 for “naive”, “informal” and “formal” ap 3 proaches. Results are based on 100 samples in case 1, 2 and 3 76 viii List of Tables 4.1 Estimators, 95% confidence, or credible, intervals of /3o and f i by using 3 “naive”, “informal” and “formal” approaches 84 ix List of Figures 2.1 Traceplots of ro and r 1 from MCMC algorithm in scenario 1 case 1. The traceplots show the 20000 iterations after 1000 burn-in period. The true values of ro and r 1 are 0.2 and 0.25 respectively 2.2 19 Histogram of 100 posterior means of ro and r in the second simulation study for scenario I case 1. The “true” values of ro and r 1 are 0.2 and 0.25 respectively 2.3 21 Histogram of 100 posterior means of r 0 and r 1 in the second simulation study for scenario 2 case 1. The true values of r 0 and r 1 are 0.7 and 0.4 respectively 2.4 23 Density plots of “true” pj values with its corresponding Beta density func tion. The vertical lines in the graph indicate the “true” values. The Beta density functions (from left to right) are: Beta (55,45), Beta(30, 70), Beta (15, 85); Beta(10, 80), Beta(25, 80) and Beta(65,35) 2.5 25 Traceplots of r ,r 0 1 and Pij5 from MCMG algorithm in scenario 1 case 2. The traceplots show the 50000 iterations after 1000 burn-in period. The “true” values are ro (plo,pll,p12) 2.6 = = 0.2,ri = ,(poo,pol,po2) 25 O. = (0.5,0.3,0.2) and (0.1,0.3, 0.6) Histogram of 100 posterior means of r ,r 0 1 and 27 PijS in the second simu lation study for scenario 1 case 2. The “true” values are r 0 ,(poo,pol,p02) 25 O. = (0.5,0.3,0.2) and (plo,pll,p12) = = 0.2, r 1 (0.1,0.3,0.6). = . 28 x List of Figures 2.7 Histogram of 100 posterior means of r ,r 0 1 and in the second simu PijS lation study for scenario 1 case 2. The “true” values are ro ,(poo,pol,po2) 4 O. 2.8 = (0.5,0.3,0.2) and = 0.7, r 1 (0.1,0.3,0.6). (Plo,p11,p12) . = . 31 Density plots of “true” pj,j values with its corresponding Beta density func tion. The vertical lines in the graph indicate the “true” values. The Beta density functions (from left to right) are: Beta (1,2), Beta(1,2), Beta(1, 2); Beta(1, 2), Beta(1, 2) and Beta(1,2) 2.9 Traceplots of ro, ri and PijS 33 from MGMC algorithm in scenario 1 case 3. The traceplots show the 20000 iterations after 1000 burn-in period. The “true” values are r 0 (plo,pll,p12) = = 1 0.2,r = 0.25, = (poo,pol,po2) (0.5,0.3,0.2) and (0.1,0.3,0.6). Validation size =200 34 2.10 Traceplots of r ,r 0 1 and PijS from MCMC algorithm in scenario 1 case 3. The traceplots show the 20000 iterations after 1000 burn-in period. The “true” values are r 0 (plo,pll,p12) = = 1 0.2,r = 0.25, (poo,pol,p02) = (0.5,0.3,0.2) and (0.1,0.3,0.6). Validation size =100 2.11 Histogram of 100 posterior means of r ,r 0 1 and 35 in the second simu jJS lation study for scenario 1 case 3. The “true” values are r 0 5,(poo,pol,p02) 2 O. = (0.5,0.3,0.2) and (plo,pll,p12) = = 0.2, r 1 = (0.1,0.3,0.6). Val idation size =200 39 2.12 Histogram of 100 posterior means of r ,r 0 1 and in the second simu PijS lation study for scenario 2 case 3. The “true” values are r 0 ,(poo,po1,p02) 25 O. = (0.5,0.3,0.2) and (plo,pll,p12) = = 0.2, r 1 = (0.1,0.3,0.6). Val idation size =100 40 2.13 Histogram of 100 posterior means of ro, r and PuS in the second simu lation study for scenario 3 case 3. The “true” values are r 0 ,(poo,pol,po2) 4 O. dation size =200 = (0.5,0.3,0.2) and (Plo,P11,P12) = 0.7, r 1 = (0.1,0.3,0.6). Vali 42 xi List of Figures 2.14 Histogram of 100 posterior means of lation study for scenario ,O. 4 (poo,pol,po2) = ro, ri and in the second simu PijS 0 4 case 3. The “true” values are r (0.5,0.3,0.2) and (p1o,pll,p12) = = 0.7, r 1 = (0.1,0.3,0.6). Vali dation size =100 43 2.15 Comparisons of log informal odds ratios with log formal odds ratios for scenario 1 in case 1 (upper left), scenario 1 in case 2 (upper right), scenario 1 & 2 in case 3 (lower left and right) . The true log odds ratio value is 0.285. The line in the each graph plots if two odds ratios are the same (y=x line). The dash lines represent the “true” value 3.1 45 Traceplots 1 2 0 of/3 , /3 X from MCMC algorithm in case 1. The traceplots show the 5000 iterations after 1000 burn-in period 3.2 Traceplots of , 56 from MCMC algorithm in case 1. The traceplots show the 5000 iterations after 1000 burn-in period 3.3 Histograms of 100 posterior means for case 1. The true values are u 3.4 0, A 2 = , i3o, i3 in the second study in 2 A = —1.5 and i3 = 1.5 . = 50. The vertical line is “true” value of the a 2 = 0.25 . 59 . Density plot of Inverse Gamma distribution with hyper-parameters: c 200 and /3 3.5 = ,t, 57 . = . Traceplots of 2 0 /3 , 1 /3 X from MCMC algorithm in case 2. The traceplots show the 5000 iterations after 1000 burn-in period 3.6 Traceplots of 1 u, A 2 and .2 62 from MCMC algorithm in case 2. The traceplots show the 5000 iterations after 1000 burn-in period 3.7 63 Histograms of 100 posterior means for 0 1 , 2 i,A , j3 3 and 2 u i n the second study in case 2. = 3.8 The “true” values are: = 0, A 2 = 1, i3o = —1.5 and 1.5 65 Traceplots of 2 ,13 0 /3 , 1 X from MCMC algorithm in case 3. The trace plots show the 5000 iterations after 1000 burn-in period 3.9 61 67 Traceplots of, A 2 and a 2 from MCMC algorithm in case 3. The traceplots show the 5000 iterations after .1000 burn-in period 68 xii List of Figures 3.10 Histograms of 100 posterior means for ji, ,/ 2 A 2 n the second o, 3i and i 3 study in case 3. The validation size is 50. The “true” values are: z = 0 = 2 O,A = =1.5 1 —1.5 1,/3 and3 70 3.11 Pairwise plots of three approaches, “naive”, “informal” and “formal”, in estimating /3o and j3 under case 1. The solid line is if “y x “, and the dash lines are the corresponding true values 3.12 Pairwise plots of three approaches estimating /3i and i3i under case , . 73 “naive”, “informal” and “formal”,in The solid line is if ‘y = x “, and the dash lines are the corresponding true values 74 3.13 Pairwise plots of three approaches, “naive”, “informal” and “formal”, in estimating / o and /3i under case 3. The solid line is if “y 3 = x “, and the dash lines are the corresponding true values 4.1 75 Prior plots of unknown parameters with their hyper-parameters: ),A 2 N(0,100 2 IG(0.01,0.01),cr c—’ 0 IG(100,0.1),/3 “-i N(0,100 ) 2 and N(0, 12). The vertical lines are the corresponding “true” values as: = 4.2 4.64, A 2 0.094, = —0.90, /3 = 0.76 and 2 = g 0.00096 81 Traceplot and posterior density plots of p0000 iterations after 1000 burn-in period of /3o and /3i when applying the MH sampling method 83 3(111 Acknowledgements First of all, I would like to give special thanks to my supervisor, Professor Paul Gustafson, since without his thoughtful support and guidance, I would not able to complete this the sis. It was an honor to work with him and he is the best supervisor that one could ever ask for. I would also thank Professor Lang Wu, for agreeing to be my second reader and also for his excellent teaching throughout my study period. I express sincere gratitude to Professors John Pekau, Ruben Zamar, Arnaud Doucet, Michael Schuizer, Jiahua Cheri and Matias Salibián-Barrera for their constant support and outstanding teaching. I am grateful to everyone, who makes the department a wonderful place to study. Spe cial thanks to Mike Danilov, Ehsan Karim, Derrick Lee, Reman Epstein, Xu Chen , Liu Zhong and Xu Wang for their help and friendship, and to Peggy Ng, Viena Tran, Elaine Salameh for their support and understanding of my work. Tian Shen Vancouver, BC, July 2009 xiv To My lovely Parents xv Chapter 1 Introduction In many research areas, statistical methods are used to analyze the relationships between two or more variables. For example, in the epidemiology area, statistical models are used to understand or study the relationship between an outcome variable Y and an explanatory variable X. For instance Y can be presence or absence of heart disease, and X can be presence or absence of smoking, where Y and X are both binary variables, or X can be the blood pressure, which is a continuous variable. There are four types of designs that are most often applied, which are the cross-sectional study, cohort study, case-control study and randomized controlled clinical trial. In this thesis, we are focused on the case-control study, which is also referred to as a retrospective study. We randomly select subjects from “case” and “control” groups, then compare the health outcomes for the two groups based on selected subjects. The explanatory variable X is often measured by some instruments. When X is precisely measured, the instrument is called a gold standard. However, due to the high cost and lack of such precise instrument, X is often measured imprecisely. If X is a categorical variable, imprecise measurements are called misclassifications, while if X is a continuous variable, they are called measurement errors. In this thesis, we are working on both discrete and continuous cases of X. 1.1 Misclassification and Measurement Error Generally speaking, misclassification means grouping a subject into a wrong category For example, a person who smoked one pack of cigarette in the day of the experiment might be accidently grouped as a heavy smoker while this person might barely smoke in other days. Thus, rather than recording X itself, a surrogate variable X* is often 1 1.2. Overview of Current Available Methods recorded instead. The misclassification probability (say Pu) defines the probability of classifying a subject into group i while its true status is in group j. When X is a continuous variable, by the definition of the classical error model in the measurement error literature, a surrogate variable X* is the linear term of X plus an er ror term. For example, a recorded patient’s blood pressure might be higher than his/her true values due the equipment error. As defined in Chapter 3 in this thesis, X = X + Z, where Z is the error term and E(ZIX)=0. Measurement error and misclassification can be categorized into differential and nondifferential cases. If the distribution of the surrogate variable X* only depends on the true value X but not the health outcome Y, then the mismeasurement is classified as non-differential. Otherwise, the mismeasurement is categorized as differential. Due to the misclassification and measurement error , usually we only have precisely mea sured health outcome Y , the surrogate variable X* and some other precisely measured covariates U in the data. Since the goal of most studies is to understand the relationship between X and Y, conclusions obtained from X* and Y instead could be very misleading. Thus, the study of measurement error and misclassification is significantly important. 1.2 Overview of Current Available Methods In the literature, especially in the biomedical research, many methods were proposed to deal with misclassification and measurement error. Barron (1977) proposed the matrix method that estimate the expectation of cell counts by using their margins, and the odds ratio can be estimated later on based on the cell counts. By reparameterizing the misclassification, Marshall (1990) presented the inverse matrix method to retrieve the odds ratio. However, Lyles (2002) pointed out that the inverse matrix method is just a maximum likelihood estimate method that corrects the biased odds ratio due to misclas sification. The efficiency of the matrix method and inverse matrix method were compared under the assumption of differential misclassification (the degree of measurement error is different across different groups) by Morrissey and Spiegelman (1999). They concluded 2 1.2. Overview of Current Available Methods that the inverse matrix is more efficient than the matrix method; nevertheless, the sensi tivity, specificity, and probability of exposure are some key determinants of the efficiency. Later, other methods like simulation extrapolation (SIMEX) and logistic regression model are also developed to approach the misclassification problem (Kuchenhoff, Mwalili, and Lesaifre, 2006; Skrondal and Rabe-Hesketh, 2004). With the improvement of computa tional capability of computers and enhanced simulation techniques, the Bayesian analysis becomes another prospective method to study the misclassification problem. Carroll, Ruppert, Stefanski, and Cainiceanu (2006) grouped methods that deal with measurement error into functional and structural models, where X is considered as fixed or random with minimum assumptions of distributions in the functional models while X is considered as random variables in the structud models. Two general methods used in the functional model category are the regression calibration (Pierce and Kellerer, 2004) and SIMEX (Cook and Stefanski, 1994) methods. Carroll, Ruppert, Stefanski, and Cainiceanu (2006) stated that even though these two methods are very simple, they are only consistent in some special cases such as linear regression and they have limited contributions in reducing the bias caused by the measurement error. Disregarding some limitations of the regression calibration and SIMEX methods, they are still widely used since both of them are very easy to implement by using the existing software packages, which reduces the potential difficulties in the analysis set-up part. In the structural method category, the Expectation-Maximization algorithm in the frequentist perspec tive and Markov chain Monte Carlo algorithm in the Bayesian perspective are two useful algorithms to solve the measurement error problems. Bayesian methods are capable of dealing with biases induced by both misclassification and measurement error. One great advantage of Bayesian methods is that they can fully incorporate the uncertainties of parameters. Though fully specified models are of ten required, this kind of information plus some knowledge of mismeasurement are often 3 1.3. Bayesian Analysis available to medical researchers before they conduct studies. When an observed dataset is available, the natural existence of prior information make Bayesian analysis an appealing method, since inference now can be made through the prior and present information. In this thesis, we are focus on using the Gibbs sampler and Metropolis-Hastings algorithm in Bayesian methods to solve the misclassification and measurement error problems. 1.3 Bayesian Analysis Bayesian inference is statistical inference in which data are used to update or to newly infer quantities that are observed or wished to learn by using probability model. The “combination” of Markov Chain Monte Carlo (MCMC) algorithm and Bayesian infer ences is often used in solving the mismeasurement related problems. 1.3.1 Bayes Rule To understand the Bayesian analysis, it is essential to understand the fundamental prin ciple of the analysis tributed dataset y - Bayes rule. Assume there is a independently and identically dis (yl, Y2, ..., y) with unknown parameter 8 and following distribution fe. Bayesian analysis is to conclude a parameter 8 based on the observed data, so we denote the conditional distribution on the observed data as f (8 y), which is called the posterior distribution. Meanwhile, the sampling distribution of the data set is: = f(v18) ]fe(y) Then, a joint probability distribution of 6 and y can be defined as: f(8,y) = f(8) x f(y18) 4 1.3. Bayesian Analysis where f(8) refers the prior distribution of 8. Applying the basic property of conditional distribution (Bayes’ rule), the posterior distribution turns to: — f(8,y) f(I Y-f() ) — - f(9) x f(yI) f() (11) Since the f(y) is a constant term, independent of the parameter 0, we can express the unnormalized posterior density as f(Iy) cx f(0) x f(yIO). To estimate the parameter 0, we can calculate the expected value of 8 under the posterior density (Gelman et al., 2004) by taking the integral: 0 = 18 0 j f( dO fOf(O)f(y0)dO ff(0)f(yIO)dO If there is a closed form of the posterior distribution, the estimation of 0 would be very easy to carry out. However, sometimes there is no closed form of the posterior density, we have to use the other numerical methods, such as Markov Chain Monte Carlo algorithms. 1.3.2 Prior Distribution As an “absent” term in classical statistical analysis, the prior distribution plays a major role in the Bayesian analysis. There are mainly two types of prior distributions: infor mative priors and non-informative priors. A non-informative prior also can be called a “fiat” prior that has a limited impact on the posterior distribution. Researchers use non-informative priors usually because they don’t have very much knowledge about the parameter in the previous study or they don’t want the inference to be greatly affected by some external sources. Moreover, the determination of “flat” prior is not trivial since everyone has a different definition of “fiat”. The informative prior, information that gained from previous study ,on the other hand, may provide a strong influence on the posterior distribution. However, the influence is somehow related to the sample size, number of MCMC iterations and the form of the prior. In most cases, strong priors are not necessary when the sample size is big. 5 1.3. Bayesian Analysis Nevertheless, in some cases, the priors are needed to be strong regardless of the sample size. 1.3.3 Markov Chain Monte Carlo Algorithm The computation of the posterior distribution is always a problem since the density of the posterior sometimes is very complicated or even high dimensional. Because of this problem, the usage of Bayesian analysis has been very limited. When the MCMC algo rithm was first applied in 1990, the limitation of Bayesian analysis then vanished and Bayesian methods becomes increasingly popular. In the following two sections, two particular MCMC algorithms are introduced: MetropolisHastings algorithm (Greenberg and Chib, 1995) and Gibbs sampler (Casella and George, 1992). Metropolis-Hastings Algorithm The Metropolis-Hastings (MH) algorithm is an iteration method for generating a se quence of samples from a probability distribution when directly sampling is difficult or impossible. It uses an acceptance/rejection rule to cover the target distribution. The basic algorithm follows: 1. Randomly choose a starting point 0 2. For the ithiteration from i=1,2,..., (a) Sample a proposal 9* from a jumping distribution at iteration i, p(9*&i_l) (b) Calculate the ratio of densities, r f(9*)p(9i_l 18*) = 9i_1) 1 f(9i-1)p(O* where f(9) is the target distribution. 6 1.3. Bayesian Analysis (c) Set — 9* with the probability min(r, 1) i—1 9 otherwise Note that the acceptance rate shouldn’t be close to 0% or 100%, otherwise, the random walk would either move too slowly (cover the target population too slowly) or stand still for too long (likely to stay in certain regions). Thus, proper adjustments of the jumping distribution sometimes are necessary. Gibbs sampler The Gibbs sampler is a special case of MR algorithm. It is applied when the ex plicit form of the joint distribution is unknown but the conditional distribution of each variable is known. Assume we divide the parameter vector 9 into p components, say 9 = ( i 9 , 92, ..., 0,,). Each iteration of Gibbs sampler cycles thought all the components of 9. Each component is drawn based on the values of all others. Thus, the algorithm that generates a Markov chain at iteration i is: 1. Randomly set the initial values for all the components 0 (9°), 9(0)) 2. For the ithiteration from i=1,2,..., for j=1...p, sample 9 from the distribution of where 9 represents all the components of 0, except for O, at their current values: 0 ( i) — (i) ( 19 9 i—i) i) 0 ( i) (0 , 1’••’ j—]i j+1 ••• (9 i—1) P Though the above two MCMC algorithms looks very simple, sometime the technical details can be very challenging. In this thesis, we will apply Gibbs sampler algorithm in Chapter 2 and Metropolis-Hastings algorithm in Chapter 3. 7 Chapter 2 Simulation Study for Categorical Exposure Variable 2.1 Introduction Assume a researcher is interesting in study the relationship between a binary exposure variable X and a continuous response variable Y. The exposure variable X is often coded as 0 or 1, where X = 0 refers to “unexposed” and X = 1 refers to “exposed” in most epidemiological situation. Instead of observing X, a surrogate variable X* is measured. Under the non-differential misclassification assumption, X and X* are conditionally in dependent of Y, and the specificity and sensitivity are used to measure the magnitude of the misclassification (Gustafson, 2004). Then, the sensitivity SN = Pr(X* = 1IX = 1) is the probability of correctly classifying a truly “exposed” subject, and the specificity SP Pr(X* = 0X = 0) is the probability of correctly classifying a truly “unexposed” subject. In the following subsections, we will introduce two approaches to analyze the relationship between discrete exposure variables and health outcome Y. 2.2 2x 3 Table—Formal Analysis Though there are only two conditions of the true exposure X, Yes or No, sometimes the surrogate variable X* has three conditions instead as: Unlikely, Maybe and Likely, due to possible instrument error (as displayed in Table 2.1). 8 2.2. 2x 3 Table—Formal Analysis Assessed Exposure Unlikely Maybe Likely True Exposure No Yes P00 P01 P02 Plo p11 P12 Table 2.1: 2x 3 table due to the misclassification and measurement error Note that the Pij in the table defines the probability of classifying a subject into group j while its true status is in group i. If the truly exposed condition, X, is known, then the exact relationship (referred as a “true” result) of X and Y is able to be analyzed. However, in reality, X is often inaccessible, and researchers only know its surrogate, X*. Even though, researchers are analyzing the relationship between X* and Y, they also tend to conclude it as the re lationship between X and Y. It is very dangerous to make such conclusion since it can be very biased. The first analysis method is termed as a formal analysis: analysis that acknowledge the existences of 2 x 3 table (Table 2.1) structure in the exposure condition. Specifically, the analysis is carried out based on Table 2.2. Health Outcome controls cases Exposure Unlikely Maybe 00 n 01 n 10 n 11 n Likely n02 2 1 n Table 2.2: A 2x 3 table for formal analysis where nj in the table is the number of subjects that fall in the condition (e.g. 0 n 0 is the number of subjects that “Unlikely” have the exposure in controls). In the literature, there are fewer studies that involves the formal analysis, which makes it an interesting point to study. The second approach is termed as an infonnal analysis: analysis that tend to ignore the 2 x 3 table structure in the exposure condition. More details of informal analysis are going to be talked about in the next section. 9 2.3. Informal Analysis 2.3 Informal Analysis When we assume the mismeasurement is non-differential, sensitivity and specificity of X* for X can be used to described the magnitude of the misclassification. The closer the values of SN and SP are to one, the less severe the misclassification is. As stated in Gustafson (2004), when the probability of the exposure is very rare, the effects of misclassification worsens much more with the decrease of specificity than the decreases of sensitivity, which means the impact of low specificity will be bigger than the impact of low sensitivity in a further analysis. This attracts particular attention in the epidemiological area for the study of rare disease since it implies that when the expo sure is very rare, the analysis can be fully misleading even with a very small effects of misclassification. Thus, when some epidemiologists realize that they have the 2x3 table structure (as Table 2.1) in hand, they tend to group the “Maybe” and “Unlikely” groups together to form a 2x2 table as in Table 2.3: where oo I True Exposure = P00 +poi and io = P10 + P11. Assessed Exposure No Yes No oo qoi Yes io q Table 2.3: 2x 2 table by epidemiologists’ rule They prefer such grouping so that more subjects are classified as unexposed, and this leads an increase of probability that a true negative is correctly classified, which means a large SP value. In such way, a low specificity could be avoided so that a small effect of misclassification won’t lead a huge impact in the further analysis. In the literature, most analyses are constructed based on the ignorance of the struc ture of Table 2.1. Thus, they are based on Table 2.4, and we would like to refer such analysis as informal analysis in the rest of the Chapter. 10 2.4. Odds Exposure No controls floo + fbi cases nib +flhi Health outcome Table 2.4: A 2x 2.4 Ratio Yes fl02 fli2 table for informal analysis Odds Ratio Suppose we have two groups of subjects, controls and cases, and denoted the number of subjects in each group as 0 n and n 0 denote the prevalence of exposure in . Let r 1 the control group and r 1 denote the prevalence of exposure in the case group, i.e. r 3 Pr(X = 1Y Pr(X 0) and r 1 = 1Y = = 1). Thus, the odds ratio, which defines the relationship between X and Y is formed as: (2.1) 1 —ro In the informal analysis, the odds ratio is usually calculated as: 0 — + noi) fl02 x (flio + flu) l2 x (noo where the standard error of log odds ratio is formulated as: \i/2 1 / 1 1 1 +—+ +— \floo + nc 10 + flu no2 l2 se=( In the formal analysis, there are three different ways to calculate the odds ratio. Since the MCMC algorithm estimates the prevalence in both case and control groups (ro and ri) at each iteration, they are going to be updated every time. Thus, the first way is to find the posterior mean of odds ratios, which is to calculate the odds ratio at each MCMC iteration, and then take the average of them. The second way is to find the posterior average of r ,r 0 1 of each iteration, then use formula (2.1) to find the odds ratio. The 11 2.4. Odds Ratio third way is to find the posterior mean of log odds ratio, which is to calculate average of the log of odds ratio at each iteration first and then transfer it by using exponential function. Mathematically, those three odds ratio can be written as: 1 where i refers to the th iteration, and m is the total number of iterations. im im = — ,rj (2.2) -- 2 OR rp 1 —r where ro, ri are calculated from equation (2.2). And also, 3 OR (±1o(Oñ)) = We are interested in comparing the odds ratios in both analysis. Since the odds ratio always falls into a very skewed distribution, we tend to compare the informal odds ratio and the formal odds ratio in the log scale, such that the distribution will be more sym metric. We are going to focus on comparing the ORinformal with OR 3 in the log scale with respect to their point estimator and confidence/credible intervals as log(ORjf T 0 1) = log (n12x(noo+nol)N \fl02 X (nio+nii)j 12 2.5. Case 1: When We Know PijS with its 95% confIdence interval as: Clinformal: (1og(ökiflf.mal) 1.96 — * Se, log(ORjnformaj) + 1.96 * se). The log formal odds ratio is as: 3 logOR = --log(OTh). Note that the 95% credible interval for log OR 3 can be simply obtained by finding the 2.5% and 97.5% quantiles of all log OR 3 that obtained from each iteration. In order to evaluate the performance of our proposed formal approach, we conduct simu lation studies under three cases as: when the probability of classify, Pij are known; when we only have prior information on pu; when we have some validation data, i.e. we have some subjects have both X and X measured. 2.5 Case 1: When We Know piS In some cases, researchers may know the probabilities of classifying the assessed exposure to true exposure from previous experiments, i.e. the pujs are known. Under this condition, the posterior density is expressed as: f(ro,ri,Xi...Xn,IYi...Yj,X’...X) = JJr(i(1’(1 — ro)(1_Xi)(1 — I(XO)(1—X) I(Xz1)(1—Xj) I(Xz2)(1—Xj) ri xIIpoo 01 p 02 p 2 —- I(X=0)X, I(X;=1)x I(X,=2)X 11 p 12 p 3 13 2.6. Case 2: When PijS are Unknown where I is an indicator function. We assume the prior distribution of r 0 and r 1 are uniform (the same assumption carries in the following two cases), and then we use the Gibbs sampling method to update the unknowns, r 0 and r 1 since their posterior distribu tions have familiar distributions that we can recognize, namely Beta distributions. The posterior distribution of X 3 is viewed as: f(XjIro,ri,Xi,...Xj_i,Xj+i,...Xn,Yi...Yj,X...X) where a = (1 b = 0 r — ‘(1 ro) y, y, I(X;=0) I(X;=1) z(x;=2) ri) — 00 ‘p 01 p . 02 p I(X=O) I(X=1) I(X,=2) 1 p r 10 11 p 12 p Thus, at each iteration of MCMC algorithm, the probability of getting X 3 = 1 by given everything else “known” is b/(a + b). 2.6 Case 2: When pS are Unknown Though, in the previous section, we studied the case that PijS are available from other experiments, in reality, those values are often unknown. The maximum information that we have on them are some knowledge of their priors. We choose a Dirichlet distribution as the prior distribution for PijS for two reasons. First, the summation of POj and the summation of Plj are both equal to 1, i.e. P00 +poi +po2 = 1 and Pio +pii +p12 = 1. Second, the Dirichlet distribution has the property of being conjugate, which means the posterior distribution of Pij 5 would also come from the Dirichlet distribution family. Thus, by choosing Dirichiet distribution as the prior, the MCMC algorithm is easy to compute the updated PiJS at each iteration, and the results are also easy to be interpreted. 14 2.6. Case 2: When are Unknown PijS Hence, we have: ), 02 Dir(eoo,coi,c Poo,Poi,Po2 P1o,P11,Pi2 ‘- ). 12 Dir(cio,cii,c Then, the posterior distribution is changed to: f(ro,ri,Xi, ...X) 12 ...X.,,poo,poi , 2 Yi, pio,pii,p ,po cx — r )(1_Xi)(1_Yj)rXiYi(l — ri)’”ii ) I(x;=i)(i—x) I(x=2)(i—x) 3 I(x;=o)(1—X —- Poi 02 P 3 3 I(X=2)X I(X=flX 3 I(X=0)X 12 P 11 P 3 coo—i Cj—l c02—1 * 01 P P02 XPOO cx ]Jr(12)(1 — C1IYl Ciii C121 J) 10 11 p 12 p CiYJ(l 5 ro)_Xi 1_Yi)ri — )+coi —1] (I(X=2)(1—X,)+co 3 —1) 2 —- (I(X’=O)(1—X,)+coo—1) (I(X=1)(1—X Xjjp 00 Poi P02 3 X +cio—1) (I(X’=i)X+c —1) (I(X=2)X+c 3 (I(XJ=0)X —1) 12 p 1 0 11 p P12 Xj(1-Yj) cx xpooi <p10 Note that coo, (1 — r )(1_Xi)( XYj (1 — 02 ) 3 I(X;=1)(1—X — +co Z 1 01 I(X=0)(1—X)+coo—1 i2 I(X=2)(1—X)+co — 1) I(XJ=0)X + 3 cio—1 1 I(X$=1)X,+cii col, c02, do, Cii and d12 are I(X,=2)X+ci — 2 i called hyper-parameters, and the specific values assigned will be discussed in the results section. 15 2.7. Case 3: Validation Data 2.7 Case 3: Validation Data Sometimes, the true exposure variable X might be possible to capture, but it is too expensive to get for all the subjects. As a result, only a small proportion of the subjects have the complete information of X, X and Y, whereas the majority of the subjects don’t have the precisely measured exposure status, X. Table 2.5 presents the structure of the validation sample and the incomplete main data. While all counts corresponding to Validation Data x* Unlikely Maybe x=o x=1 X=O x=1 Y=o Y=1 Main Data x* Unlikely Maybe Likely Likely Y=o Y=1 Table 2.5: Validation data and main data levels of X, Y, X* are fully recorded in the validation data, only counts that correspond to Y, X are recorded in the main data. We want to use the information from the validation data to impute X for the main data and to make inference on X and Y. Our new posterior density is like: 1 f(X Xm,poo,poi,po2,p1o,p11,p12,ro,r1IXm+1,.,.Xn,X,...X,,Y1,...Yn) j=n x ) 3 flf(XIX x ) 3 flf(X f(X) flfo’Ix j=1 m+1 j m fl fJf(XX) x fJf(X) x f(poo,pol,p02,plo,pll,p12) x f(ro) x f(r ), 1 where j = 1, .m is the non-validation data part and . . j = m + 1, . . .n is the validation data part. The simulation process for this case does not change a lot regarding the change of the posterior distribution, and the only difference is that “known” X values in the validation data do not need to be updated, where the “unknown” X values, ,r 0 r 1 and PijS are 16 2.8. Results updated the same as in the previous case. 2.8 Results In order to gain information about the performance of MCMC algorithms in all three cases, the MCMC trace-plots, posterior mean, 95% equal-tail credible interval, estimated bias, estimated mean square error of each unknown parameters are checked in the fol lowing subsections. Moreover, when the prevalence in control and cases are small, i.e. r 0 and r 1 are relatively small, the odds ratio of “formal” and “informal” are compared later on to assess their performance. In all cases, two sets of the prevalence are used as (1): r 0 0 r = 1 0.7, r values as P00 = 1 0.2,r = 0.25 and (2): 0.4, and each one is combined with the “true” probability of classifying O.S,poi = ,Po2 3 O. = 0.2 and pio = = 0.3,pi2 = 0.6. Any hyper parameters used in the specific cases will be defined in the later subsections along with detailed information of each case scenarios. Moreover, two simulation studies are per formed regarding each scenario in each case. The first one focuses on studying estimation from one sample. The second concentrates on studying the sampling distributions of each estimator across 100 simulated datasets. 2.8.1 Results for Case 1 Two scenarios we have in this case are: • Scenario 1: (T0,Ti) (0.5, 0.3, 0.2), (0.2, 0.25), true odds ratio=1.33, (pio,pll,p12) • Scenario 2: (rO,Ti) (0.5,0.3,0.2), = = (0.1,0.3, 0.6); (0.7,0.4), true odds ratio=0.283, (plo,pil,pi2) = (poo,pol,po2) (poo,poi,po2) = (0.1,0.3,0.6). Two simulation studies are carried for each scenario, where studies for the first scenario will be talked about in details as an example. Note that the same procedures are applied 17 2.8. Results to the second scenario as well. In the first study, a dataset of size 4000 (2000 controls and 2000 cases) was generated based on the given “true” value, (ro,ri) ) 12 (pio,pii,p = = (0.2,0.25), (poo,pol,po2) = (0.5,0.3,0.2), and (0.1,0.3, 0.6). Then, we use 21000 iterations, where the first 1000 is the burn-in period to update unknown parameters. The choice of burn-in period is make based on visual inspection. Figure ro and 1 r 2.1 shows the traceplot of MCMC algorithm for after the first 1000 burn-in period. It shows that the Markov chains moving smoothly within the target area. 18 2.8. Results I. I Iii Ii I .,LI tU ii. ‘ liii I i. 0 C’j 9 0 ‘ 0 a r,- I’ I I I I 0 5000 10000 15000 20000 15000 20000 iterations 0 Co 0 0 0 0 5000 10000 iterations Figure 2.1: Traceplots of r 0 and r 1 from MCMC algorithm in scenario 1 case 1. The traceplots show the 20000 iterations after 1000 burn-in period. The true values of ro and 1 are 0.2 and 0.25 respectively. r Table 2.6 shows the true values, posterior means and the 95% credible intervals for r 0 and r . Both 95% credible interval of r 1 1 covers the “given” values, which indicates 0 and r 19 2.8. Results that for this particular generated dataset, our approach works well. ro r true value 0.2 0.25 posterior mean 0.18 0.27 95% CI (0.15, 0.23) ( 0.23, 0.31) Table 2.6: True values, posterior means, 95% credible intervals of ro and ri. These are results from the first simulation study (one dataset simulation) for scenario 1 in case 1. The second study will repeats the first study 100 times, which enable us to investi gate the sampling distributions of r 0 and r . Figure 2.2 is the histogram of 100 posterior 1 means of r 0 and r . It demonstrates that the sampling distribution of r 1 0 and r are ap proximately normally distributed and centered around the “true” values, majority values are around the “true” values. 20 2.8. Results to - 0 0 C.) - U) 01 LI) - 01 0 0J , 0 C, C 0 41) Ll U- 0 0 C 0 0,14 0.18 022 026 I I 0.18 0.22 I I 026 I I 0.30 Figure 2.2: Histogram of 100 posterior means of ro and r in the second simulation study for scenario 1 case 1. The “true” values of ro and r 1 are 0.2 and 0.25 respectively. Table 2.7 confirms our observation by showing the 95% credible interval coverage rates are very high and the average lengths of the credible intervals are very small. r 0 1 r Bias 0.0024 0.0028 MSE 0.0025 0.0023 Coverage of the 95%CI 91 95 j Average 95%CI Width 0.09 T________ ööi T Table 2.7: Bias, mean square error (MSE), coverage of 95 % CI and the average width 0 and r 1 for scenario .1 case 1. All results are based on 100 datasets, and their true of r values are 0.2 and 0.25 respectively. Since desirable results are obtained from the statistical procedures and stabilized iter 21 2.8. Results ations are observed from the traceplot, it is reasonable to conclude that our approach works well when the prevalence for both control and cases are relatively small. Table 2.8 shows the first simulation study result for scenario 2, where r 0 1 r = 0.7 and 0.4 and Pij values remain the same. Again, the result shows that for the particular ro r 1 true value 0.7 0.4 posterior mean 0.69 0.41 95% CI (0.63,0.72) (0.36, 0.45 Table 2.8: True values, posterior means, 95% credible intervals of ro and ri. These are results from the first simulation study (one dataset simulation) for scenario 2 in case 1. generated dataset, we are able to get reasonable estimators. However, in order to know how the model works in general, we need to review the results from the second simulation study. Figure 2.3 and Table 2.9 are histogram and statistical results from the second simulation study for scenario 2 in case 1. All results from two studies for scenario 2 suggest that our proposed approach works equally well when the prevalences are considerably larger. 22 2.8. Results C C’) 0 Ci U) C C’) U, C) >% C) C S 0 C) U- C U) 0 U) U- C 0 Is, C C I I 0.66 I I I I 0.74 0.70 0.32 I 0.36 0.40 0.44 0 r Figure 2.3: Histogram of 100 posterior means of r 0 and r 1 in the second simulation study for scenario 2 case 1. The true values of r 0 and r 1 are 0.7 and 0.4 respectively. ro 1 r Bias 0.000068 0.00074 MSE 0.0021 0.0023 Coverage of the 95%CI 98 94 t Average 95%CI Width 0.085 0.092 I Table 2.9: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average 0 and r width of r 1 for scenario 2 case 1. All results are based on 100 datasets, and their true values are 0.7 and 0.4 respectively. 23 2.8. Results 2.8.2 Results for Case 2 In this case, there are also two scenarios as: • Scenario 1: (ro,ri) = (0.2, 0.25), true odds ratioz1.33, (poo,pol,po2) (0.5,0.3,0.2), (plo,pll,p12) • Scenario 2 : (ro,ri) = = (0.1,0.3,0.6); (0.7.0.4), true odds ratio=1.33, (poo,pol,po2) (0.5,0.3,0.2), (plo,pll,p12) = = = (0.1,0.3,0.6). Remember that in this case, since PijS are unknown, we need to specify prior distributions for them. The hyper-parameters cj are chosen particularly as (coo, , 01 002) c (dO, 11 012) c , = (55,30,15), (10, 25,65) for both scenarios. Those values are chosen so that the prior distribution are centrated nearly around the “true” values of Pu. Since any single com ponent of a Dirchlet vector has a Beta distribution, we are able to see how concentrated these priors are by using the Beta distribution with certain hyper-parameters. For exam ple, previous information (prior) states that P00 is from a Beta distribution with shape = 55, 3 = 45. Figure 2.4 displays the the true Pij values with its corresponding Beta density function. From the figure, we can see that most “true” pjj values are close to the centre of the the density function (especially with Poi and plo) and the range of the x-axis are pretty narrow, which suggests that the priors that we use in this case are concentrated ones. We use the concentrated priors here is because by simulation studies, we realized that the no matter the sample size is large or not, a strong prior is crucial in this case (as mentioned in Section 1.3.2). Same as in case 1, two simulation studies are carried for each scenario, where studies for the first scenario will be talked about in detail as an example. In the first study for scenario 1, a dataset of size 4000 (2000 controls and 2000 cases) was generated based on the given “true” value. Then, we use 51000 iterations (the first 1000 is the burn-in period) to update unknown parameters. Figure 2.5 show the traceplot of MCMC algorithm for r ,r 0 1 and Pij5 after the first 1000 24 2.8. Results I 0.0 0.2 0.4 0.0 02 04 0.6 0.8 0.6 0.8 I” 1.0 I’ Pio 0.0 I 1.0 0.0 0.2 0.4 0.6 I 0.8 1.0 0.0 0,2 0.4 0.6 0.8 1.0 0.8 1.0 0.0 0.2 0.4 0.6 08 1.0 ‘I 0.2 0.4 Pu 0.6 P12 Figure 2.4: Density plots of “true” Pij values with its corresponding Beta density function. The vertical lines in the graph indicate the “true” values. The Beta density functions (from left to right) are: Beta (55,5), Beta(30, 70), Beta(15, 85); Beta(10, 80), Beta(25, 80) and Beta(65,35). 25 2.8. Results burn-in period. Again, the Markov chains move smoothly within target range and no chain is fixed at a particular value. We also observe that the generated sample Markov Chain is somehow more stabilized in some unknown parameters (e.g. poi,pii) than oth ers (e.g. r 0 and ri) after the burn-in period. Table 2.10 demonstrates the true value, estimated posterior mean and 95% credible interval for each unknown parameter from the first study of scenario 1. r 0 1 r P00 oi P02 p pi true value 0.2 0.25 0.5 0.3 0.2 0.1 0.3 0.6 posterior mean 0.25 0.32 0.53 0.31 0.16 0.1 0.25 0.64 95% CI (0.14,0.40) (0.18, 0.41 (0.46, 0.60 (0.27, 0.35 (0.09, 0.23) (0.05, oii (0.18, 0.35) (0.54, 0.73 Table 2.10: True values, posterior means, 95% credible intervals of r ,r 0 1 and pjjs. These are results from the first simulation study (one dataset simulation) for scenario 1 in case 1. Though there are discrepancies between the estimated posterior means of r 0 and r 1 and their true values, 95% credible intervals of estimated means do cover the true values. Figure 2.6 and Table 2.11 display the results from the second study (100 datasets simulation) of the first scenario in this case 26 2.8. Results ii 1 L L. S j, JIr I 1 .r ., I 30000 o 10000 50000 I 0 10000 itera5orrS I I 30000 00000 . 0 10000 iterations 30000 50000 iterations IL I 0 10000 I I I 30000 I 50000 iterations I. I. I 0 10000 I I 30000 50000 iterations II Ii - 0 10000 I I 30000 50000 iterations iij 7 0 10000 30000 itera0ons 50000 0 10000 30000 00000 iterations Figure 2.5: Traceplots of r ,r 0 1 and PijS from MCMC algorithm in scenario 1 case 2. The traceplots show the 50000 iterations after 1000 burn-in period. The “true” values are ro = O.2,r 5,(poo,pOl,p02) = (0.5,0.3,0.2) and (plo,pll,p12) = (0.1,0.3, 0.6). 2 O. 27 2.8. Results J1+11hh I I 0.50 EfrhI 0.54 0.52 0.56 0.28 0.30 P00 0.32 0.34 0.145 P01 0.150 0.155 I 0.160 P02 rffh1 0.095 0.100 0.105 0.110 0.22 0.24 0.26 0.28 0.30 Pta Pu 0.60 0.62 0.64 0.66 P12 [oj 0.20 0.25 0.30 0.35 024 0.28 0.32 0.36 1 r Figure 2.6: Histogram of 100 posterior means of r ,r 0 1 and PijS in the second simulation study for scenario 1 case The “true” values are r 1 0.25, (poe, PO1,P02) = 0 = 0.2,r (0.5, 0.3, 0.2) and (plo,pll,p12) = (0.1,0.3, 0.6). . 28 2.8. Results From Figure 2.6, we observe that the histograms of 100 posterior means of 1j, T’i,Po,Pi,Pjo and P02 are shifted to the right from their true values, which indicates an overestimation of the parameters of interest. Though the graph suggests a possible unpleasant result, we still need to study those parameters’ corresponding sampling distributions to find out the true performance of our model. Table 2.11 shows the estimated bias, mean square error, 95% credible intervals coverages (of the true value) and the average length of 100 95% credible intervals for each parameter. r 0 1 r p01 P02 pio p p Bias 0.062 0.056 0.035 0.014 -0.049 0.0022 -0.039 0.0366 MSE 0.0021 0.0019 0.00079 0.00088 0.00023 0.00015 0.00072 0.00061 Coverage of the 95%CI 100 100 100 99 100 100 100 100 Average 95%CI Width 0.25 0.24 0.14 0.074 0.14 0.12 0.16 0.18 Table 2.11: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and rl for scenario 1 case 2. All results are based on 100 datasets, and their true values arer 0 = 0.2,r 1 = O. ,(poo,pol,p02) = (0.5, 0.3, 0.2) and (p1o,p1,pi) = 25 (0.1,0.3,0.6). It indicates that the even though the graph shows a potential unpleasant result, the parameters’ posterior distributions still cover their true values most of the time. As we discussed earlier, this case is very sensitive to the choice of prior distributions, and the observation obtained from Figure 2.6 could just imply a not strong enough prior for the particular dataset. Thus, it’s still reasonable to conclude that the algorithm works well for this case. Figure 2.7 and Table 2.12 are histogram and statistics results from the second sim ulation study for scenario 2 in case 2. We omit the results of the first study here due to its limitation of interpretation. From both the figure and table, we notice that no matter whether the probability of exposure is large or small, the prior distributions are always 29 2.8. Results very important. Weak priors may lead to poor results, and the stronger the prior is the better the results would be. This is dissimilar with most Bayesian cases, where when the sample size is large, the choice of the prior becomes insignificant. Future studies could be conducted on this case. ro 1 r p00 Po pm p Bias 0.015 0.048 0.036 0.012 -0.04 -0.0036 -0.011 0.015 MSE 0.0027 0.0025 0.0017 0.0020 0.00043 0.00022 0.0017 0.0015 Coverage of the 95%CI 100 99 100 96 100 100 97 100 Average 95%CI Width 0.23 0.23 0.15 0.10 0.15 0.11 0.079 0.14 Table 2.12: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of r 0 and r 1 for scenario 1 case 2. All results are based on 100 datasets, and their true values are r 0 = 0.7,r 1 = 0.4, (poo,pol,p02) = (0.5, 0.3, 0.2) and (plo,pll,p12) = (0.1,0.3,0.6). 30 2.8. Results ,.rFf f hi ErfF1111}L i 0.45 050 055 0.60 0.25 0.30 ‘rll}[fThk 0.090 0095 0.100 0.105 0.65 0.40 0.16 0.18 l11L J1IF[L, 1 00jj 070 022 Pm 060 0.35 012 0.14 Po P00 0.70 026 030 P02 0.34 0.55 0.60 Pu 0.75 0.80 0.65 I P12 0.35 0.40 045 0,50 0.55 = = = = Figure 2.7: Histogram of 100 posterior means of r ,r 0 1 and PijS in the second simulation study for scenario 1 case The “true” values are r 0 1 O.7,r 0.4, (poo,pol,po2) (0.5,0.3,0.2) and (plo, pll,p12) (0.1,0.3,0.6). . 31 2.9. Results for Case 3 2.9 Results for Case 3 Under this cases, we will have four scenarios as follows: (0.2, 0.25), true odds ratio==1.33, (Poo,pol,po2) • Scenario 1: (ro,ri) 2 , 3 (O.S,O. ) ,(plo,pll,p12) O. • Scenario 2: (ro,ri) = = = (1,1,1), = = (0.1,0.3,0.6), 0 (coo,col,c 2 ) = (1,1,1), = (0.1,0.3,0.6), (1,1,1), (COO,cOl,C02) (1,1, 1), validation sizez200; • Scenario 4: (ro,ri) (0.5,0.3,0.2), = (0.7,0.4), true odds ratio0.28, (poo,pol,p02) 2 , 3 (O.S,O. ) ,(P1o,P11,P1 O. 2) ) 12 (cio,cii,c = (1,1,1), validation size=400; • Scenario 3: (ro,ri) = (COO,Col,c 2 0 ) (0.2.0.25), true odds ratio=1.33, (poo,pol,po2) 2 , 3 (O.S,O. ) ,(plO,pll,p12 0. ) ,c 10 (c , 1 12 c 1) (0.1,0.3,0.6), (1,1,1) ,validation size=200; (clQ,Cll,c 2 1 ) (clo,cll,c 2 1 ) = = (0.7.0.4), true odds ratio=0.28, (poo,pol,po2) (plo,p11,p12) = (0.1,0.3,0.6), (coo,coi,c ) 02 = (1,1,1), (1,1,1) ,validation size=100. Notice that in scenario 1 and scenario 3, the validation size is 200 compared to scenario 2 and scenario 4 (validation size is 100). As in case 2, let’s plot the true values of pjjs along with their prior density functions. From Figure 2.8, we can see that now the range of the x-axis (possible generated values) becomes wider and the “true” values are not that close the to the center of the density functions. This directly implies that these priors are flatter than the one we chose in case 2. We assign flat priors in this case because we believe valuable information could be obtained from the validation data. Once more, two simulation studies are carried for each of the scenario, and only the first of the two scenario will be talked about in details. Figure 2.9 and Figure 2.10 show the traceplots of MCMC algorithm for r ,r 0 1 and jjS after the first 1000 burn-in period for scenario 1 and 2. 32 2.9. Results for Case 3 NN 0 0.0 0.2 6.4 0.6 0.8 1.0 00 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 P20 Po, P20 0 [IIJ 0.0 0.2 0.4 0.6 0.8 1.0 Plo I N. I•I I I I 00 0.2 0.4 0.6 0.8 1.0 I N I 0.0 0.2 0.4 0.6 0.8 1.0 P12 Figure 2.8: Density plots of “true” pj,j values with its corresponding Beta density function. The vertical lines in the graph indicate the “true” values. The Beta density functions (from left to right) are: Beta (1,2,), Beta(1,2), Beta (1, 2); Beta(1, 2), Beta(1, 2) and Beta(1, 2,). 33 2.9. Results for Case 3 hi r.J_JI II .1., :1,1 ‘I I 0 15000 5000 0 15000 0 5000 terabons —1 0 i i 15000 5000 0 II Ir.I II 0 5000 15000 ilera0ono liii II ‘I I 5000 15000 5000 iler000rrO 0 II 10000 ileraSons ii _IrI—r, II I 5000 15000 ilera5005 II I l 0 5000 ‘l 15000 ileradero Figure 2.9: Traceplots of r ,r 0 1 and PijS from MCMC algorithm in scenario 1 case 3. The traceplots show the 20000 iterations after 1000 burn-in period. The “true” values 0 1 = O. 0.2,r are r 5,(pOO,pOl,p02) = (0.5,0.3, 0.2) and (plo,pn,p12) = (0.1,0.3,0.6). 2 Validation size =200. We can see that for the generated sample Markov Chain are somehow more stable in this case than case 2. Table 2.13 and 2.14 demonstrates the true value, estimated posterior mean and 95% credible interval for each unknown parameter from the first 34 2.9. Results for Case 3 -I II. ..ii_... ir’i’ 0 5000 15000 0 15000 5000 iterations 0 5000 iterations 15000 iterations J “F 0 5000 15000 0 5000 iterations 0 5000 15000 iterations ‘IiT 15000 iterations 0 5000 0 5000 15000 iterations 15000 iterations Figure 2.10: Traceplots of r ,r 0 1 and PiJS from MGMC algorithm in scenario 1 case 3. The traceplots show the 20000 iterations after 1000 burn-in period. The “true” values are r 0 1 = 0.25, (Poo,pol,po2) = (0.5,0.3,0.2) and (p1o,pll,p12) = (0.1,0.3,0.6). 0.2,r Validation size =rlOO. 35 2.9. Results for Case 3 study of scenario 1 and 2. ro r 1 P00 P01 P02 pio pn pj true value 0.2 0.25 0.5 0.3 0.2 0.1 0.3 0.6 posterior mean 0.18 0.24 0.49 0.31 0.20 0.07 0.27 0.66 95% CI (0.13,0.24) (0.19, 0.30) (0.45, 0.52 (0.28, 0.34 (0.16, 0.24 (0.00 0.13) (0.16, 0.37) (0.56, 0.77) Table 2.13: True values, posterior means, 95% credible intervals of ro,ri and PjjS. These are results from the first simulation study (one dataset simulation) for scenario 1 in case 3. Validation size =200. r 0 1 r poo oi P02 Plo P11 pr true value 0.2 0.25 0.5 0.3 0.2 0.1 0.3 0.6 posterior mean 0.22 0.27 0.53 0.25 0.22 0.07 0.42 0.51 95% CI (0.13,0.31) (0.18, 0.37) (0.46, 0.59) (0.19, 0.31 ) (0.19, 0.25) (0, 0.13 (0.29, 0.54 (0.42, 0.61 Table 2.14: True values, posterior means, 95% credible intervals of r ,r 0 1 and PjS. These are results from the first simulation study (one dataset simulation) for scenario 2 in case 3. Validation size =100. From these one sample studies, we observe that the approach works very well with flat priors. Though the observation now is only based on one sample study result, it is verified by Tables 2.15 and 2.16 from the second simulation study (sampling distribution study based on 100 datasets). 36 2.9. Results for Case 3 r 0 1 r P00 pç p pio pu i:i: Bias 0.0047 0.0024 0.0017 -0.00045 -0.0013 0.0090 -0.01 0.001 MSE 0.0032 0.0029 0.0020 0.0015 0.0015 0.0036 0.0045 0.0039 Coverage of the 95%CI 93 98 95 96 98 97 97 99 Average 95%CI Width 0.12 0.12 0.07 0.067 0.071 0.15 0.20 0.19 Table 2.15: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and r 1 for scenario 1 case 3. All results are based on 100 datasets, and their true values arero = 0.2,ri = 0.25, (poo,pol,p02) = (0.5, 0.3, 0.2) and (plO,pfl,p12) = (0.1,0.3,0.6). Validation data=200. r 0 1 r poo poi po pj p p Bias 0.0066 0.0027 0.00050 0.0049 -0.0054 0.0090 -0.021 0.012 MSE 0.0039 0.0038 0.0022 0.0023 0.0021 0.0043 0.0068 0.0049 Coverage of the 95%CI 96 96 94 94 99 96 93 100 Average 95%CI Width 0.16 0.16 0.093 0.086 0.09 0.18 0.25 0.23 Table 2.16: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of r 0 and r 1 for scenario 2 case 3. All results are based on 100 datasets, and theirtrue values arer 1 = 0.25, (poo,pol,po2) = (0.5,0.3,0.2) and(pio,p 0.2,r 0 ) = 12 ,p 11 (0.1,0.3,0.6). Validation data—100. 37 2.9. Results for Case 3 Figure 2.11 and Figure 2.12 show the histograms of sampling distributions for each unknown parameter in scenario 1 and 2. Results from the above two tables and figures indicate that when there are some validation data, the formal approach performs equally well (compare with the case 2) though the prior information is weak. Even though intuitively, we may think that the larger validation size would have better results than the smaller size, by comparing Table 2.15 and Table 2.16, it is hard to conclude that there is significant difference in the results obtained from the scenario 1 and scenario 2. One possible explanation is that by increase the validation size from 100 to 200, the algorithm is only able to gain limited “valuable” information. This immediately rises a question that whether there is a cut off point that we are able to get the maximum benefit, i.e. fewer validation data and enough information to obtain a good estimation. This could be an interesting point to study later on. 38 2.9. Results for Case 3 LL . 1 lb 00 :J I I CL I I 0.50 0.46 0.54 I 0.24 I I 028 P00 I I ‘: 0.32 A I I 0.14 I I 0.18 0.22 P01 I P01 0) JT[1l}L.3.. I 0.05 I 0.10 I I 0.15 0.20 I 0.15 0.25 0.35 0.45 I 0,50 0.55 Pio C I 0.15 020 I 0.70 Pi, 00 0.10 I 0.60 0.65 0.25 0.30 :r__[1lI1i1_ 020 0.25 0.30 0 r Figure 2.11: Histogram of 100 posterior means of ro, r 1 and Pij8 in the second simulation study for scenarzo 1 case 3. The “true” values are r 0 1 = 0.25, (poo,pol,p02) = 0.2,r (0.5,0.3,0.2) and (Plo,P11,P12) (0.1,0.3,0.6). Validation size =200. 39 2.9. Results for Case 3 rf11}i 0.46 0.50 ! 0.54 ‘‘J{{}frflaT, 0.30 0.26 0.34 1 LitflThL 0.14 0.16 Poi P00 022 P01 m 1EiriTTfbL., ‘r1LHL,, r T ‘!;rfffi}1j 005 0.10 0.15 0.20 0.10 0.20 Pio 0,10 0.30 0.40 0.50 0.60 Pu 0.15 0.20 0.25 0.30 0.20 0.25 0.70 0.50 Pi 0,30 0,35 0 r Figure 2.12: Histogram of 100 posterior means of ro, r 1 and PijS in the second simulation study for scenario 2 case 3. The “true” values are r 0 = 0.2,rj = 0.25, (poo,pol,p02) = (0.5, 0.3, 0.2) and (plo,pll,p12) = (0.1,0.3,0.6). Validation size =100. 40 2.9. Results for Case 3 The following tables and figures show the results of scenario 3 and scenario 4 where the prevalence rates are relatively larger r 0 r pj P01 P02 p pjj p Bias 0.0033 0.0064 0.0042 0.00019 -0.0044 0.0001 -0.0012 0.0011 MSE 0.0031 0.0039 0.0033 0.0025 0.0029 0.002 0.0020 0.0023 (rO = 0.7 and r 1 Coverage of the 959CI 97 92 94 94 96 95 93 96 = 0.4). Average 95%CI Width 0.13 0.14 0.11 0.09 0.11 0.078 0.075 0.093 Table 2.17: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and Ti for scenario 3 case 3. All results are based on 100 datasets, and their true values are r 0 = 0.7,r 1 = O. (0.5,0.3,0.2) and (plo,pll,pi2) = ,(poo,po1,p02) 4 (0.1,0.3,0.6). Validation data=200. r 0 1 r pyj pr P02 pio pH PH Bias 0.0041 0.0062 0.0097 -0.0030 -0.0067 -0.0006 -0.0011 0.0017 MSE 0.0038 0.0047 0.0037 0.0026 0.0034 0.0021 0.0025 0.0028 Coverage of the 95%CI 97 96 96 96 96 99 90 97 Average 95%CI Width 0.19 0.19 0.14 0.10 0.14 0.10 0.082 0.11 Table 2.18: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average width of ro and Ti for scenario 4 case 3. All results are based on 100 datasets, and their true values are r 0 = 0.7,r 1 = O. ,(pOO,p0i,p02) = (0.5, 0.3,0.2) and (plo,pn,pi2) = 4 (0.1,0.3,0.6). Validation data=100 41 2.9. Results for Case 3 co_I Fti Lt :1 01 0.50 - I I 0.55 0.60 I 0.45 0 __________ I I I I I I 0.22 0.26 Pa 0.30 1 U I I 0.08 I I P 0.12 0.26 Plo 0] 0.20 I I 0.30 0.25 C01 U, rf{fFIITL, 1 ELrHWH at,_____ I 0.15 P01 ‘ Eli—fl I _______ I I I 0.34 Poi 01 0.04 r} Wfl 1M liii L 0 I I 0.34 Hi___ I 0.54 I 0.58 I I 0.62 I 0.66 Pu La_ 0.65 I I I 0.70 0.75 0.80 0 0.35 040 0.45 0.50 0 r Figure 2.13: Histogram of 100 posterior means of ro, r 1 and PijS in the second simulation study for scenario 3 case 3. The “true” values are r 1 = 0.4, (poo,pol,p02) = 0 = 0.7,r (0.5, 0.3, 0.2) and (plo,pll,p12) = (0.1,0.3,0.6). Validation size =l00. 42 2.9. Results for Case 3 & o w H 0.45 0.50 0.55 1 0 Fl cofI II C’, El I I Oh nH-, I I I 0.60 0.22 0.26 I Poo I I 0.30 I-fff-1 : I 0.34 I I 0.10 0.15 Pot 0.20 0.25 Pot 0 0Ij H a w oj TrTTiI1ITh I I 0.04 I I 0.08 0.12 _ I I 0.20 0.25 Pio _ I I I I I 0.30 0.35 0.55 0.60 0.65 Pit 0 0.70 P12 0 C’,, C01 -l - S Oh] o_ I 0.60 I 0.65 I I 0.70 0.75 0.80 H I 0.20 I rrrii 0.30 0.40 0.50 Figure 2.14: Histogram of 100 posterior means of r ,r 0 1 and PijS in the second simulation study for scenario 4 case 3. The “true” values are r 0 0.7, r 1 = 0.4, (Poo,Poi, P02) = — (0.5,0.3,0.2) and (plo,p11,p12) = (0.1,0.3,0.6). Validation size —100 43 2.10. Comparison of Odds Ratios From the above results, we are able to conclude that our formal approach works well when the prevalence rate is relatively high, and again it is hard to conclude that more validation data will help in getting more precise results. 2.10 Comparison of Odds Ratios As we talked about previously, it is interesting to compare odds ratios estimated by formal and informal approaches, to see which one tends to be closer to the true vales. Since in the informal approach, we tend to group two categories from the exposure “Unlikely” and “Maybe” together, only when the probability of exposure is rare, we would only compare odds ratios in scenarios that ro = 0.2 and r 1 = 0.25 in each case, i.e. scenario 1 in case 1 and 2, scenario 1, 2 in case 3. 44 2.10. Comparison of Odds Ratios Id 0 --.o--o 0 —0.2 0.0 0.2 0.4 0.8 0.8 —0.2 0.0 // 00 02 0.6 0.8 Iog(OR) 0 —0.2 0.4 0.2 Iog(O0) 0.4 0.6 0.8 / —02 0.0 0.4 0.2 0.6 0.8 Iog(OR) Figure 2.15: Comparisons of log informal odds ratios with log formal odds ratios for scenario 1 in case 1 (upper left), scenario .1 in case 2 (upper right), scenario 1 81 2 in case S (lower left and right) The true log odds ratio value is 0.285. The line in the each graph plots if two odds ratios are the same (y=x line). The dash lines represent the “true” value. . 45 2.10. Comparison of Odds Ratios Figure 2.15 suggests that the informal odds ratios tend to underestimate the true values, where the formal odds ratios sometimes overestimates the true value. For most of generated datasets, the informal odds ratio is always less than the formal odds ratios, since the majority of the plots are below the “y=x” line. The following table illustrates the comparisons in details. Scenario 1 Case 1 ORinformal 3 OR Scenario 1 Case 2 ORinformal 3 OR Scenario 1 Case 3 ORinformal 3 OR Scenario 2 Case 3 ORinformal 3 OR Bias -0.19 0.008 0.19 -0.055 0.19 0.00027 -0.18 -0.0043 f f MSE 0.0078 0.020 0.0072 0.016 0.0072 0.018 0.0068 0.018 Coy, of 95%CI 21 93 23 93 23 93 27 97 Ave. 95%CI Width 0.27 0.75 0.30 0.66 0.27 0.70 0.27 0.76 Table 2.19: Estimated bias, mean square error (MSE), coverage of 95 % confidence intervals and average 95% CI width of informal and formal log odds ratios for scenario 1 (or 2) in three cases. True log odds ratio is 0.28. Table 2.19 indicates that the formal analysis produces more precise estimation of log odds ratio. It has a small bias, great coverage rate, and a reasonable average of 95% CI width, where, on the other hand, the informal approach gives larger bias and unexpected small coverage rate. Thus, we are able to conclude that the formal approach generally does a better job than the informal approach in estimating the odds ratio. Researchers who apply the informal approach may need to take serious consideration of the measurement error, otherwise, the results might be very biased. 46 Chapter 3 Simulation Study for Continuous Exposure Variable 3.1 Introduction Suppose a researcher is interested in investigating the relationship between a health outcome, Y, and a continuous variable, X, and then concluding that E(YIX) = / +iX o 3 for unknown parameters o and i3i. Nevertheless, when the health outcome is measured precisely, a noisy measurement X is often obtained instead of X. If the researcher does not realize the existence of measurement error, or decide to ignore it, then his conclusion about Y and X could be biased. Thus, it is useful to study the impact of the mismeasured covariate X. Let’s assume a precisely measured predictor X has a normal distribution with mean i and variance )2, while its mismeasured surrogate X arises from an additive measurement error, which is non-differential, unbiased and normally distributed. Thus, X is normally distributed with mean X and variance a . Moreover, since the measurement error is non 2 differential, which means X* and Y are conditional independent of Y, the distribution of (X* Ix, Y) and the distribution of (X* IX) are identical. The joint distribution of X, X’ and Y can then be viewed as: f(X*,Y,X) = f(X*IX,Y) x f(YIX) x f(X) (3.1) f(X*IX) x f(YIX) x f(X). 47 3.1. Introduction The first term in equation (3.1) is called the measurement model, which is the conditional density of X* given X and Y. This defines that under the influence of Y, the surrogate X’ arises from the true variable X in a particular way. The second term is called the response model, which explains the relationship between the true explanatory variableX and the response Y. The last term is called the exposure model in epidemiological applications (Gustafson, 2004). Usually, specific distributions, which involve some unknown parameters will be assumed in equation 3.1, and to make inferences about the X*, X and Y relationship. Since in reality, the true explanatory variable X is unobserved, the likelihood function of X and Y is formed as f(X*,Y) ff(x*,x,Y)dx = (3.2) ff(X*IX,Y) x f(YX) x f(X)dX. Though in some cases, equation 3.2 is easy to evaluate, in other cases, big problems could arise when the integral does not have a closed form. Often, a Bayesian MCMC analysis will be used in such condition since one advantage of Bayesian approach is that the likeli hood function is not necessary expressed in explicit form. Dempster, Lairdd, and Rubin (1977) proposed the EM algorithm to solve the implicitly problem in a non-Bayesian way, however, in this paper, we will stay with the Bayesian MCMC methods. Researchers often want to compare the health outcome Y within two groups, thus, they often dichotomize the continuous variable X into two or more categories. Though, Roys ton, Altman, and Sauerbrei (2006) pointed out a considerable disadvantage of the di chotomization, it is still very common in the literature (MacCallum, Zhang, Preacher, and Rucker, 2002). In this paper, we dichotomize X into two groups with the rule if X > c, the subject is truly exposed, otherwise, the subject is not exposed. Note that in reality, the value c is often decided from a previous study or chosen by a health expert. 48 3.2. Posterior and Prior Distributions The relationship of the health outcome and predictor variable is often estimated by obtaining the coefficients from a linear regression model. Since the health outcome, Y, is a binary variable here, logistic regression appeals as a suitable model. General speaking, there are three approaches to estimate the coefficients. An “naive” approach would di chotomize X’ with respect to c, where an “informal” approach dichotomize the surrogate variable X* according to c (a threshold not necessarily the same as c). In reality, the true predictor variable X is unobserved, however, in the “formal” approach (discuss in the paper), it is pretended to be known and be dichotomized with respect to c and fit the model afterward. The choice of the threshold, c, is somehow arbitrary, however, in order to keep a high specificity (as in discrete case), some epidemiologists would intend to choose c to be bigger than the true c value, such that Pr(X* > c IX < c) is very small. Notice that researchers, who use the “naive” approach are often not aware of the measure ment error or intend to ignore it, while people who use “informal” or “formal” approach do acknowledge existence of the measurement error and try to find out a solution to the problem. Results from all three approach will be compared later in the Chapter. 3.2 Posterior and Prior Distributions In this paper, the constituent models are studied based on normal distributions. Specif ically speaking, we assume the measurement model, (X* Ix, Y), to be a normal distribu tion with mean X and variance 2• Since the measurement error is non-differential, we have XX N(X, a) 2 The exposure model is also assumed as a normal distribution with unknown parameters j and i.e. X N( ) 2 . 49 3.3. Case 1: When We Know o2 Prentice and Pyke (1979) pointed out that the odds ratios are equivalent when both prospective and retrospective logistic model are applied to the case-control data, thus we would like to assume the response model YIX follows a logistic regression, which is logitP’r(Y = lix) 1(X 1 /3 + /3 > c). By easy transformation, the response model turn to: Pr(Y=liX)= Note that all parameters 2, , )2, o and 3 / 1+ i 3 / eI+11I(X>c) are unknown, and proper prior distribu tions might be needed in order to proceed. Meanwhile, we would like to assume the independence of all prior distributions, so that , 2 f(u , o,th) ,A = ) x f() x f(A 2 f(u ) x f() x f(i). 2 Specific prior distributions will be assigned later on. In this chapter, we focus on studying three cases to demonstrate the performance of the “formal approach” : when we have some knowledge the noise of the true exposure value from previous study, i.e. is known; when we only have some prior information about the noise term, i.e. 2 is unknown but we have some information there; when we have some validation data, i.e. we have some data on x along with X* and Y for some subjects. 3.3 Case 1: When We Know 2 Sometimes, researchers might have some knowledge about the noise of the true explana tory variable from previous study results, then the posterior density function is (3.3) fJf(IXj x flf(XIX) < ) < f(/3) x f(3) x f() x 2 3 JJf(x f(A ) . 50 3.3. Case 1: When We Know 2 o In order to proceed, we need to specify prior distributions for unknown parameters u, , 2 A and To simplify the MCMC algorithm later on, it is convenient to as sign the normal distribution for , 2 as their o, i3 and Inverse Gamma distribution for A 3 / prior distributions. Thus, we have N(0,d); 3 J ‘s.’ N(0,d); N(0,d); 2 A 5 , 4 IG(d ) d . where the choice of hyper-parameters d, d, d, d 4 and d 5 determine how flat or con centrated a prior could be. Here, we choose d = that have flatter priors for unknown parameters d , = 4 = 1002 and d 4 = 5 d = 0.01 so , /3o and ,6. 2 A The posterior density function in (3.3) now turns to: (3.4) ) x 3 flf(YjIXj) x fJf(X;IX) x ]jf(X cc 1 x—e flj 1 + 1 d x -_ei x 3 d 2 d 22 x dg x F(d ) 4 fCo) x f(j3) 1 x—e x f(L) x f(A ) 2 2 Z(X) 2A A2_4+1)e_d5/A2 We chose the normal distribution and Inverse Gamma distribution for t and A 2 as prior distributions, since they have the property of being conjugate. A conjugate prior means the posterior distribution of ,u and A 2 would come from the same distribution family as the prior distribution. In particular, a simple Gibbs sampler algorithm can be used to generate a sample of from the posterior distribution of t, which is a normal distribu tion. Similarly, we can use Gibbs sampler algorithm to generate a sample of A 2 from its posterior distribution, the Inverse Gamma distribution. 51 3.3. Case 1: When We Know a 2 Unlike with i and A , the posterior distributions of o 2 3, j3 and X do not have the same form as any other familiar distributions that we recognize . Thus, the Gibbs sampler does not work for updating them and we need to use the Metropolics Hasting algorithm - instead. As introduced in Chapter 1, this algorithm is based on the accept/ reject rule. We would like to avoid the acceptance rate being extreme, such as 100% or 0%. The key to decide the rate is the jump distribution of the parameter. Let’s take j3o for example. Suppose we are at jth iteration right now and after updating i and A , we want to update 2 the th value of !o. We would calculate the joint density as where f is the joint density as in equation (3.4). Then, we would assign a jump size for i3o, such that = + t, where t is from the normal distribution with mean 0 and variance k 2 and we would again calculate the new joint density value as: Now, we will pick the new updated j3 as: cond) with the probability min(b/a, 1), 3’ otherwise. A similar procedure is applied for i3 and X. The individual jump size often follows a normal distribution with mean 0 and variance k . Note that the jump size for each 2 estimated parameter may vary in next two cases. 52 3.4. Case 2: When We Don’t Know 3.4 2 Case 2: When We Don’t Know Though, in the previous section we talked about knowing the measurement error variance of the true X from other studies, in most situation, we do not know the exact value of 2 but rather have a prior distribution for it. Then, the new posterior density turns to: a (3.5) = flf(IXj) x flf(XIX) x ) 3 llf(X x fC8o) x f(j3) x f(fL) x ) 2 f(). ) 2 xf(a Similarly for other parameters, we need specify a prior distribution for a . Again, be 2 cause of the conjugately property of Inverse Gamma distribution, we would like to assign Inverse Gamma distribution with shape parameter d 6 and scale parameter d 7 as the prior distribution of a . Note that d 2 6 and d 7 are hyp er-parameters. The choice of prior distributions for other unknown parameters are the same as in case 1 and all other hyper parameters would be assigned the same values. Now the joint density becomes: 3 1 f(X , 2 , X, jt,A ...,X,3o,3 , I 2 , 1 Y Y a = flfIX ) x 2 xf(A 113 1 + d6 ) 6 P(d flf(XIX) x flf(X) x f() x fCdi) x fCu) ) 2 f(a 0+1 1 e 1 3 ’()Cj>’) 3 x±e x x x ...,X) x x—e —) 1 —(X 1 2 2 2A x—e 22 jfl 3 d x dg4 ) 4 F(d x )2_(d4+1)e_d/A2 x a2_6+1)e_d7/2 53 3.5. Case 3: When We Have Validation Data We are able to update a 2 by Gibbs sampling since its posterior distribution is known as Inverse Gamma distribution, and hyper-parameters d 6 and d 7 are specified later in the result section. 3.5 Case 3: When We Have Validation Data Similarly as the validation case in the discrete case, here we assume there is a small proportion of data with complete information on X, X* and Y, whereas the majority of the data do not have the precise measurement of X but have the surrogate variable X* instead. Thus, unlike the equation (3.3) and (3.5) in case 1 and case 2, the joint density here becomes: f(Xl,X , 3 , I 2 , ...,Xm,/3O,I3 ...,Xn,Y,Y2,. Xm+l,Xm+ X U .., ljL,A n’ * * * 1’ 2’ * m = n flfo’Ixj ) 3 flf(XIX j = ) 3 HfX;1X j j=1 where the first x 1, . . x flf(X ) 3 j II f(IX) j=m+1 x ]Jf(X ) x f(j3) x fC8i) x 3 ) 2 f(p) x f(.\ .m are the non-validation data and the rest x f(a ) 2 , is the validation data. Though the joint density changes, there are not many changes regarding the simulation process. The only difference is we do not update the “known” X values in the simulations. We used Gibbs sampler to update , , and the Metropolis Hastings Algorithm 2 2 and a A - is used to update the o, i3i and the “unknown” X values. 3.6 Results In the following subsections, traceplots of MCMC algorithms in all three cases are checked along with some statistics for each unknown parameter, such as the posterior mean, 95% 54 3.6. Results equal-tail credible interval, bias, and estimated mean square error. In this way, we are able to gain some information about how well the MCMC algorithms are working for par ticular models and randomly generated datasets. Moreover, comparisons of estimated logistic regression coefficients are made among all three approaches, “formai”, “naive” and “informal”. In all three cases, the true unknown (common) parameters are set to be: ,u 1, i o 3 = —1.5, and /3 1.5, with the dichotomization value c = 0, A 2 = 1. Other choice of the hyper-parameters as well as the jump size are subject to different cases. Moreover, two simulation studies are performed for each case. The first concentrates on gathering information on estimates from one sample. The second focuses on studying the sampling distributions of each estimator across 100 simulated datasets. 3.6.1 Results for Case 1 In the first study, a dataset of size 500 was generated based on a choice of given parameter values. Then, we use 6000 iterations (the first 1000 iterations are burn-in period) of the MCMC algorithm to estimate unknown parameters. Note that the jump sizes for o, / i and X in the Metropolis-Hasting algorithm are chosen to be 0.15, 0.75, and 1 3 respectively. Figure 3.1 and Figure 3.2 show the traceplots of the MCMC algorithm outputs for unknown parameters in case 1. These traceplots only plots the iterations after the burn-in period. We can see that the Markov Chain is somehow stabilized after the burn-in period, and the Markov chain move thoroughly within the target range. 55 3.6. Results ci C.) ; ci IiII.hI c’J c I_I oI ‘-1 C I I 0 1000 2000 3000 4000 5000 0 1000 iterations 2000 3000 4000 5000 iterations C Cs iILiI.h Ii ii 0 C’J >< C C LU c I i’ U) 0 1000 2000 3000 iterations 4000 5000 1111 0 1000 2000 I I 3000 4000 5000 iterations Figure 3.1: Traceplots of ,X 1 X 2 from MCMC algorithm in case 1. The traeeplots show the 5000 iterations after 1000 burn-in period. 56 3.6. Results c’J 0 iii 0 I 0 0 0 IF 1 0 0 I 0 1000 I I 3000 iterations 5000 0 1000 3000 5000 iterations Figure 3.2: Traceplots of u, A 2 from MCMG algorithm in ease 1. The traceplots show the 5000 iterations after 1000 burn-in period. 57 3.6. Results Table 3.1 shows the true values, posterior means and the 95% credible intervals for each of the unknown parameters estimated from the data. 5 ,3 true value 0 1 -1.5 1.5 posterior mean -0.013 1.014 -1.48 1.48 95% CI (-0.112, 0.086) ( 0.86, 1.17) (-1.73, -1.21) ( 0.79, 2.13) Table 3.1: True values, posterior means, 95% credible intervals of are results from the first study in case 1. 1 , 2 .i,A . 3o,6 These From the table, we can see that the 95% credible interval of each unknown parameter actually covers the corresponding true value and the posterior mean is very close to the corresponding true value. The second study just involves repeating the first study 100 times and the sampling distribution of each estimator is studied. Figure 3.3 is the histogram of the posterior means for the 100 samples in case 1. 58 c C;’ 1% Cb .- H II c -to -to 1)1 0 01 01 p 0 I o 5 5 I 10 10 Frequency I 20 Frequency 15 I 30 20 b 0 0 •0 01 p. 01 0 p u 0 01 0 0 I 0 5 I 10 10 15 30 20 Frequency 20 Frequency 25 3.6. Results From the figure, we can see that the sampling distributions of j2 and i3i are approx imately normally distributed and centered at their true values, whereas the sampling distribution for is a little right skewed and is somehow left skewed. Table 3.2 summarizes each parameter estimator as: ,u o 3 / /3 Bias 0.0023 -zO.0096 -0.065 0.051 MSE 0.0050 0.0080 0.013 0.038 Coverage of the 95%CI 95 98 93 94 Average 95%C[ Width 0.19 0.31 0.54 1.42 Table 3.2: Estimated bias, mean square error (MSE), coverage of 95% CI and the average 2 pA , 0 3 for case 1. All results are based on 100 datasets. width of 1 The above table shows that the biases and estimated mean square error (MSE) for each unknown parameter are quite small, especially for fL and widths, out of the 100 runs, of the credible intervals for /2 )2• Furthermore, the average and )2 are pretty small. How ever, the wide average widths for /3o and j3 suggest that there is more variation among the 100 estimated and th. Also, out of the 100 times, the 95% credible intervals cover (CI) cover the true and )2 value 96 times, and cover the true /3o and 94 times, which suggests a good overall performance of the formal approach. 3.6.2 Results for Case 2 In this case, , 2 the measurement error variance, is also estimated with other parameters and the “true value” is chosen as 0.25. The hyper-parameters of the Inverse Gamma distribution of o 2 are specified as 200 and 50 respectively. The specific choice of the hyper-parameters results in a concentrated prior distribution. From Figure 3.4, we can see that this prior has a relatively narrow range and the centre of the distribution is close to the “true” value. 60 3.6. Results density plot 0 c’J U) > a) 0 U) 0 0.20 0.25 0.30 0.35 Figure 3.4: Density plot of Inverse Gamma distribution with hyper-parameters: c and 3 = 50. The vertical line is “true” value of the u 2 = 0.25 = 200 61 3.6. Results Moreover, the jump sizes for updating and X in Metropolis - Hastings Algo rithm are changed to 0.15, 0.55, 0.8, to avoid extreme acceptance/rejection rates. Again results from the first study (1 sample study) are displayed first. Figure 3.5 and Figure 3.6 are the traceplots of parameters the MCMC algorithm in case 2. 0 i [1 Ii i 111 i I III] . 1 L .i ii. ii. .II1I Is) II ri ‘1 ‘‘rrJi’ i I 0 1000 2000 3000 4000 5000 0 1000 2000 iterations 4000 5000 iterations 0 q I i I i .1 .11 q 0 >0 I 3000 0 0 0 0 1000 2000 3000 iterations 4000 5000 0 1000 2000 3000 4000 5000 iterations Figure 3.5: Traceplots of /3, i3i, X ,X 1 2 from MCMC algorithm in case 2. The traceplots show the 5000 iterations after 1000 burn-in period. 62 3.6. Results 0 d 0 0 0 1000 2000 3000 4000 5000 iterations 5 1000 2000 3000 4000 5000 iterations 0 0 C 0 0 1000 2000 3000 4000 5000 terations 2 from MCMG algorithm in case 2. The traceplots Figure 3.6: Traceplots of u, )2 and g show the 5000 iterations after 1000 burn-in period. We can see that the Gibbs sampling algorithm is very stable when updating o 2 in 5000 iterations, and the chains do not have a mixing or convergence problem. Table 3.3 shows the posterior means and 95% CI analyzed based on the particular dataset. Again, all the 95% CI covers the true values of the parameters, and it’s reason able to conclude that the approach works well for this particular dataset. The procedure of the first study is repeated 100 times in the second study. We are able 63 3.6. Results t 5 i3 th true value 0 1 0.25 -1.5 1.5 posterior mean -0.0061 0.88 0.25 -1.51 1.72 95% CI (-0.089, 0.10) ( 0.73, 1.03) (0.22, 0.29) (-1.77, -1.27) ( 1.06, 2.39) Table 3.3: True values, posterior means, 95% credible intervals of,A ,/3o,/3i and u 2 . 2 These are results from the first study in case 2. The “true” values are: u = 0, ?.. —1.5 and 4 1.5. to study the sampling distribution of to get a better understanding of the potential problem (overestimate the parameter) in the first study. The problem could be just happening by chance or it could show that overall the MCMC algorithm underestimates th. Figure 3.7 confirms that the problem of underestimating /3 is just due to chance and overall the estimated is roughly follows a normal distribution centered at the “true” value. Except for A , other parameter estimators are all approximately following 2 a normal distribution centered at their corresponding “true” value. 64 3.6. Results r rfl_fffR1..;. —0.10 —0.05 0.00 0.05 0.10 —2.0 —1.8 —16 —1.4 —1.2 —1.0 Po I . 0.248 0.250 0.252 0.254 0.256 Figure 3.7: Histograms of 100 posterior means for , 2 A io i3 and o , in the second study 2 in case 2. The “true” values are: u O,A 2 = 3 —1.5 and j3i = 1.5. ,/ 1 o — — 65 3.6. Results Table 3.4 outlined the bias, MSE, coverage of 95% and average width of the 95% for each estimator. It suggest that our approach produces reliable estimators with small biases, small MSE, satisfactory coverage rates and reasonable average credible interval widths. a V i 3 / Bias 0.0034 0.007 -0.054 0.019 0.0023 MSE 0.0047 0.0087 0.013 0.035 0.00013 Coverage of the 95%CI 95 95 97 95 100 Average 95%CI Width 0.20 0.32 0.54 1.41 0.071 Table 3.4: Estimated bias, mean square error (MSE), coverage of 95 % CI and the average 2 for case 2. All results are based on 100 datasets. The “true” , i3o, i3 and g 2 width of t, A values are: t = 0, A 2 = 1, i3o = —1.5 and i3 1.5. 3.6.3 Results for Case 3 In this case, we took our validation size to be 50, which means 10% each dataset has precise measurements of X. The new jump sizes for i3o, i3 and X are 0.1, 0.35 and 0.7. Figure 3.8 and Figure 3.9 are the traceplots of parameters in the MCMC algorithm in the first study 66 3.6. Results C rLI I I I I I I o woo 2000 3000 4000 5000 0 1000 2000 iterations 3000 4000 5000 4000 5000 iterations (N La 0 (N as Os >< >< 0 0 0 0 5’, I 0 I I I I I 0 1000 2000 3000 4000 iterations 5000 1000 ‘ 2000 3000 iterations Figure 3.8: Traceplots of ,X 1 2 from MCMG algorithm in case 3. The traceplots th X show the 5000 iterations after 1000 burn-in period. 67 3.6. Results 04 0 1. 1 IL.. i.. 0 0 0 0 0 0 1000 2000 3000 4000 9 . 1 _1IT . III . 5000 iterations 0 1000 1 rIiIIII I••I’iIII 2000 3000 4000 5000 teratiorts 04 0 0 .5 04 0 0 04 0 0 1000 2000 3000 4000 5000 iterations Figure 3.9: Traceplots of t, A 2 and o 2 from MCMC algorithm in case 3. The traceplots show the 5000 iterations after 1000 bur’rt-iri period. 68 3.6. Results u ET j3 4i true value 0 1 0.25 -1.5 1.5 posterior mean -0.032 1.03 0.24 -1.44 1.30 95% CI ( -0.013, 0.066) ( 0.87, 1.19) (0.21, 0.28) (-1.68, -1.19) ( 0.63, 1.97) . 2 Table 3.5: True values, posterior means, 95% credible intervals of , )2, / o, / 3 i and o 3 These are results from the first study in case 3. Both figures (Figure 3.8 and Figure 3.9) and statistic values (Table 3.5) indicate that for this particular generated dataset, the approach did a good job. Next, the histogram (Figure 3.10) and summary statistics (Table 3.6) of the sampling distribution for each unknown parameter in the second study are presented. 69 3.6. Results II —015 I —0.05 0.00 0.05 0.10 0.15 01’52!02!5 0.235 0.245 0.255 0.20 —1.8 —1.6 —1.4 —1.2 081,1! 0.265 2 a Figure 3.10: Histograms of 100 posterior means for )2, /3, j3 and u in the second study 2 in case 3. The validation size is 50. The “true” values are: t = 2 0,A = = —1.5 and i = 1.5. 70 3.7. Comparison of Three Approaches From Figure 3.10, we can see that all the sampling distributions of parameter es timators are approximately normally distributed with some skewness involved, expect the histogram for u 2 looks uniformly distributed at first glance. However, by taking a close look at the figure, we noticed that the scale for the histogram of o2 has 3 decimal 2 are very close to the true value, places, which suggests that most estimated values of g 0.25. This observation is also confirmed in Table 3.6, since the estimation of 2 o has the smallest bias, MSE and average CI width and highest coverage rate (100%). i A i 3 / Bias -0.0032 0.012 -0.024 0.051 0.00148 MSE 0.0050 0.0080 0.013 0.036 0.00059 Coverage of the 95%CI 97 95 95 95 100 Average 95%CI Width 0.19 0.31 0.55 1.34 0.066 Table 3.6: Estimated bias, mean square error (MSE), coverage of 95 % CI and the 2 for case S. All results are based on 100 datasets with average width of, )2, i3o, i3 and o validation size 50. The “true” values are: i = 0,)..2 = 1,/3o —1.5 and i3 = 1.5. Evidence, such as small bias, small MSE and hight percentage of true values coverage, in Table 3.6 demonstrate a good performance for the validation model. Overall, we can conclude that the “formal” approach did an excellent job in estimat ing unknown parameters for three different models as: knowing the measurement error variance, prior information of the measurement error variance and validation model. Next, we are going to study the comparative performance of the “formal” approach, the “naive” and the “informal” approach. 3.7 Comparison of Three Approaches As we discussed previously, the “naive”, “formal” and “informal” approach would di chotomize either X or X* with respect to c or c* to fit a logistic regression model as 71 3.7. Comparison of Three Approaches following: logit(Pr(Y logit(Pr(Y = logit(Pr(Y = 1IX))formal = /3o + /3 1(X > c) 1 = 1(X* > c) 1 i3o + /3 1 1X*))jflfyJ. = 1IX*))najve = + 13i1(X* > c) The relationship of the health outcome, Y and the exposure variable X is gained from the estimated coefficients, /3 and i3. Thus, the comparisons are mainly based on estimating these two parameters. The true values are the same as before: i3o = —1.5 and i 3 / 1.5, and comparisons are constructed in each of the three cases of the “formal” approach. Note that the c is chosen as 1.3 so that the specificity and the sensitivity is very high (both are around 95%). Figure 3.11 to Figure 3.13 are the pairwise plots of results from the three approaches. By observing them, we see there are some linear relationship between estimators, and the linear relationship is somehow weaker when estimating /3 than estimating i3o. Moreover, both the “naive” approach and the “informal” approach tend to overestimate /3o but underestimate i3 ,whereas estimations from “formal” approach are located around the true value. Since it is very hard to tell which one is “better” between the “naive” and the “informal” approach from the figures, the summary statistics of estimators’ sampling distribution in these two approaches are crucial to know. 72 3.7. Comparison of Three Approaches I I I I :j 0’ 0 A 0 //0 I 0 :040 8 -- S S 00 I I I —1.6 I I —1.4 —12 • —1.8 I I r 001 00 I I I I I —1.6 —1.4 —1.2 —1.7 —1.5 naive Po 00 09 On 0 q. - I informal o Co ID I . o —1.8 1’ fl_ • o0 0 informal Sb —1.3 o 3 1 CM I I 1 d I 41 I 0 —1.1 I 0 i0% — 0//’ .5 000 - I 8o 0 I I 0 /l I I I 0.5 1.0 1.5 informal .7 0 0 0 , 2.0 : IO I __ I I I 0.5 1.0 naive 1.5 2.0 , 0 _______ j%0o0 oO I I I 0.5 1.0 1.5 __ informal 2.0 i Figure 3.11: Pairwise plots of three approaches, “naive”, “informal” and “formal”, in estimating i3o and j3 under case 1. The solid line is if “y the corresponding true values. x “, and the dash lines are 73 3.7. Comparison of Three Approaches / I 1 /0° I Ij In Hi /:° 0°,, p —1.8 I I I I I I —1.6 ‘0. —1.4 informal 0 I I I I —1.2 —1.8 —1.6 —1.4 —1.2 7 —1.5 q In I I 0 0 o% - : — I I I I 1.0 1.5 2.0 informal fl. 20 0 °li I I I 0.5 1.0 1.5 Q90 I 0 I naive 1 . ---0-Il--- o$ I I Figure 3.12: Pairwise plots of three approaches estimating /3i and i3 under case the corresponding true values. 0 -: D’1 LI) ° LI) 0.5 - 60/ I I 0 - 0 I —1.1 fib I aoo 0 —1.3 0 In 0,0& —1.7 infomial a ,oe 0 ‘1 J1 naive 8:o 0 I I flo /°,° 0 I 2.0 I I I 0.5 1.0 1.5 infomial 2.0 I3 “naive “, “informal” and “formal”, in The solid line is if “y = x “, and the dash lines are , 74 3.7. Comparison of Three Approaches 0 / /0 04 .5 It . S 0 0 I0• LI 0 —1.8 —1.6 —1.2 —1.4 informal —1.0 —1.8 Pa I naive Pr —1.2 —1,0 0 0 04 0 0 0 04 04 //L -: Lb - In —1.4 informal 0 000 g —1.6 a, a 8 0 0 0 0 0 0 U) L0 0 0 I I I 0.5 1.0 1.5 informal i 2.0 0.5 I I 1.0 1.6 01!0i52!0 2.0 natve i informal Pi Figure 3.13: Pairwise plots of three approaches, “naive “, “informal” and “formal”, in estimating i3o and 3 i under case 3. The solid line is if “y the corresponding true values. = x “, and the dash lines are 75 3.7. Comparison of Three Approaches Table 3.7 reports the average posterior means of /3o and i3, as well as 95% confidence intervals for the average posterior means in all three cases. We are able to conclude that the formal approach is superior to informal and naive approach, since only the confi dence interval of “formal” approach cover the true values of i3o and ! i 3 . Moreover, when estimating /3, the formal approach produces posterior means and confidence intervals, which are more closer to the true values. The “naive” approach generated the most narrow confidence interval that may also implies over-confidence. These results suggests that it is very dangerous to ignore the measurement errors in the analysis and making proper adjustments for the measurement error is crucial. Case 1 Onajve 3 / Ojnformal 3 / Oformal 3 / ’naive 3 / ’informal 3 / lformaj 3 / Case 2 Average Posterior Mean -1.41 -1.36 1.52 0.98 1.14 1.55 95%Confidence Interval (-1.43, -1.39) (-1.38, -1.34) (4.53, 1.48) (0.93, 1.03) (1.08, 1.20) (1.47, 1.63) 4.40 -1.34 -1.49 0.97 1.04 1.52 (-1.42, -1.37) (-1.36, -1.31) (4.52, 1.47) (0.92, 1.0201) (0.99, 1.10) (1.45, 1.59) -1.43 1.37 4.49 0.99 1.12 1.55 (-1.45, (-1.39, (4.52, (0.95, (1.06, (1.48, Oflajve 3 / Ojnformaj 3 J Oformal 3 / inaive 3 / ljnforma1 3 / lformal 3 / Case 3 Onajve 3 ! Ojnformal 3 / Oformal 9 / ’naive 3 I ljnformaz 3 / lformal 3 / 4.40) -1.35) 1.47) 1.04) 1.18) 1.62) Table 3.7: Average of posterior means and 95% confidence intervals for the average posterior means of o and i3i for “naive “, “informal” and “formal” approaches. Results are based on 100 samples in case 1, and 3 76 Chapter 4 QRS Data Study To illustrate the ideas and methods that we discussed in the previous chapters in a real world example, we use the QRS dataset. This dataset is provide by Vittinghoff, Glid den, Shiboski, and McCulloch (2004). Heart problems can be diagnosed through the timing of diverse stages in the contraction of the heart. Electrocardiography(EKG) is the device that records the electrical activities of the heart through a duration of time. As the authors indicate the QRS wave is defined as a commonly measured time interval in the contraction of the ventricles. The study dataset contains the QRS times (in mu liseconds) for 53 patients, of whom 18 have the inducible ventricular tachycardia (IVT) and 35 of them are without IVT. Note that the sample size is relatively small, since it is very difficult to assemble a large number of subjects to participate in a brain wave study, and the cost of the study is very high. Thus, studies that involve brain waves and electrocardiography devices commonly have small sample sizes. Though the sample size is considerably small, it is still a good and clean dataset to illus trate our ideas and methods. The response variable Y takes the value of 1 if the subjects has IVT, and 0 otherwise, while the covariates variable X is the QRS time (in millisec onds). Since the QRS time is a continuous variable, we are focusing on the approach introduced in Chapter 3. Even through, in the literature, there are researchers who argue about the accuracy of the QRS duration (Tomlinson, Betts, and Rajappan, 2009), which indicates that measurement error could exist in the measurement of timing in the real world, for the purpose of this thesis, we treat the QRS timing as precisely measured (X) and we simulate the surrogate variable, X, in order to compare the results obtained from the true values, naive analysis, informal analysis and our formal analysis. 77 Chapter 4. QRS Data Study Nevertheless accuracy of the QRS is questioned by many researchers, there are few articles states the possible magnitude of the measurement error. According to Sahambi, Tandon, and Bhatt (2009), the maximum error rate of the QRS is 6.25% due to the 50 Hz power-line interference. As we lack of detailed information about how the data are collected, we would simply adopt the measurement error rate stated by Sahambi et al. (2009). For our illustrative proposes, we have to assume we know the variance of additive measurement error, u , in order to generate X*. The error rate we accepted previously 2 is a multiplicative error, and proper transformation is necessary to acquire an additive error (as we defined in Chapter 3). As a result, we choose X = log(QRStime) instead of QRS time directly. As defined in Chapter 3, under the nondifferential assumption, the measurement model here is X’IX N(1ogQRS,u ) 2 . Mathematically, we can compute a 2 as: X eX* = 1ogQRS+o-Z = (QRS)e = 1.0625 which motivates e where Z is a standard normal random variable. We get the variance of additive mea surement error as 0.0312 = 0.00096, and the surrogate variable, X’, can be generated afterward. Since we don’t have validation data and we suppose we don’t know the variance of the measurement error, the “formal analysis” approach is based on the model that was 78 4.1. Naive Approach and Informal Approach introduced as case 2 in Chapter 3. Thus, the response model is 1ogit(P(Y=1X)) 51 = 1 log = 1 + 0 /3 1 (X>c) /3 (4.1) Under the assumption of nondifferential measurement error, the measurement model is X’X N(X,). The exposure model is: X N(,)). We will use this set-up to conduct naive, informal and formal analysis and compare results produced by three approaches with the true values. 4.1 Naive Approach and Informal Approach For the response model, the value of c is chosen as log(120) as suggested by Tomlinson, Betts, and Rajappan (2009). To refresh the memory, the naive approach would formulate the response model based on X* and c as: logit(P(Y = P(Y=1II(X*>c)) 1IX*)) = In order to perform the informal approach, we need to pick up a c value such that, the specificity is very high (as discussed in Chapter 3), and it is chosen as c = log(123) so 79 4.2. Formal Approach that the SP = 0.94. Then the response model of informal approach is: logit(P(Y = 1IX*)) P(Y 1II(X* > c)) P(Y= 1I(X* > cj) + 0 /3 1 1 (X*>c*) /3 — = 1 logit — = The estimates of and for these two approaches can be easily obtained by using the gim function in R. Results are shown in the section 4.3. 4.2 Formal Approach Since we are assume the QRS timing from the dataset is the true value, we are able to obtain some information about the exposure variable X, such as the the mean of mean(X) = 0.00091. Note that those values are not applicable in the real world, and they are p. = 4.64, var(X) = 2 A = 0.094 and the measurement error variance u 2 available here for this example only (because of our assumption). As discussed early in the simulation study of Chapter 3, we need to assign prior information for the “unknown” parameters. Remember that in section 3.6.2, flatter priors are assigned to unknown parameters since the simulated sample size is considerably large. Though, the sample size is pretty small in this dataset, we are still able to assign flatter priors to some of the unknowns. 2 A For example, p. is assigned as p. IG(0.01, 0.01) and /i N(0, 1002), A 2 is assigned as is assigned with N(0, 1002). Dislike the non-informative priors for p., A and i3o, the prior information becomes very crucial for 2 and i3. It is reasonable to assign concentrated priors to these two unknowns, since researchers could easily obtain relative information about these two unknowns from precious study. As a result, concentrated priors are assigned as such that j3 N(0, 12) and .2 IG(100, 0.1). Figure 4.1 displays the prior density plots vs their corresponding “true” values. 80 4.2. Formai Approach I —400 —200 0 200 400 0.00*00 I I 5.Oe+302 1.00+303 I I 1.50+303 2.00+303 I’ II.I.I —4 —2 0.0008 0 0.0010 4 2 0.0012 0.0014 —40000 —20000 0 20000 40000 0.0018 Figure 4.1: Prior plots of unknown parameters with their hyper-parameters: i ),A N(0,100 2 2 IG(0.01,0.01),o0 IG(100,0.1),i3 )and /3 2 N(0,100 ). 2 N(0,1 The vertical lines are the corresponding “true” values as: u = 4.64, A 2 = 0.094, /3o 2 = 0.00096. —0.90, /3i 0.76 and g -‘ ‘-‘. ‘-- 81 4.2. Formal Approach We observe that, regardless of the strength of the prior, the center of prior density for each unknown parameter is most likely located around the corresponding true value. The plot for A 2 looks abnormal, since the range of its density function goes from 0 to infinity so that it is quite difficult to display on a limited scale. 2 has the most concentrated prior, since approximately 95% of its data are enclosed by 0.0008 and 0.0014, a pretty small range. x Similarly as in the simulation study, we are unable to obtain the full conditional distributions for o and so we are going to use the Metropolis- Hastings Algorithm to obtain their estimators. All other unknown parameters are updated as in the simulation study of Chapter 3 case 2. The jump size for updating X, /3o and in MR consist with the simulation study, which are 0.15, 0.35, 0.8 respectively. Figure 4.2 shows the traceplot of 200000 iterations after 2000 burn-in period, and there are no apparent mixing problems to be noticed. The numeric results are presented and compared in the next section. 82 4.2. Formal Approach LO C q Lf) C >. a) C CD C ci) C C’) C L() C C I I I 0 50000 100000 I I I 200000 —2.5 I I - L. C’) - C - 1.1 Co Ii.i ii.l,.ii C 0 C CD. C ci) C’) 0 F.. IIIr rI 1 — —j. I I I 0 50000 100000 iterations —0.5 0.5 13o iterations Co I —1 5 I C C 200000 J.. —2 —I 1 0 2 3 13i Figure 4.2: Traceplot and posterior density plots of !0000 iterations after 1000 burn-in period of /3 and /3 when applying the MH sampling method. 83 4.3. Results 4.3 Results Before discussing the results estimated from the naive, informal and formal approaches, we would like to find out the supposed true result first. It is very easily obtained from the glm function in R and the true model for explaining whether or not a subject has the IVT is estimated as follows: 1 log )) 1 Y 1 —0.90 + O.761(X > c). Table 4.1 records results of i3o and i3i from the naive, informal and formal approaches. Onajve 3 / Ojnformal 3 f Oformaj 3 / ljnform& 3 / lfo’rmai 3 / Estimate -0.86 0.86 -0.90 0.61 0.61 0.67 95%CI (-1.58, -0.14) (4.58, 0.14) (-1.63, -0.22) (-0.63, 1.84) (0.63, 1.84) (0.46, 1.73 ) CI width 1.44 1.44 1.35 2.48 2.48 2.19 Table 4.1: Estimators, 95% confidence, or credible, intervals of i3o and /3 by using “naive”, “informal” and “formal” approaches. In light of the study performed in Chapter 3, the results for analyzing this dataset behave as we would expect. Though the results are close, the formal approach performed the best when strong priors for /3 and c 2 are provided. As the data size gets larger, we believe that the formal approach will keep doing a good job, i.e. estimated values close to the “true” values and the less variability of estimated parameters, even when flatter priors are assigned. Surprisingly, the naive and informal approaches produce the same results, and one possible explanation is that the data size is very small, and there is no significant difference in modeling X* with threshold c or c in this special case. Note that when the data size increases, the chance that naive and informal approaches produce the same results will become slim. 84 Chapter 5 Conclusion and Future Work In this thesis, we propose a formal approach to adjust mismeasurement in case-control studies. Ignoring potential mismeasurement on exposure variables could lead to serious problems, such as loss of power, biased estimation and misleading conclusions. In the literature, many methods were proposed to deal with misclassification and measurement error, such as matrix method, inverse matrix method, regression calibration, SIMEX, Expectation-Maximization algorithm in frequentist perspective. Lots of methods are ready to use, nevertheless, they all have their limitations. For example, Carroll, Rup pert, Stefanski and Crainiceariu (2006) stated that though the SIMEX and regression calibration are simple methods to implement, they have limited contributions in reduc ing the bias caused by the measurement error. The Bayesian approach, on the other hand, is able to correct the bias more precisely and generally. Though, a potentially misspecified exposure model, too complex posterior and intensive computational require ments are occasionally drawbacks in the approach, it has the great advantage that the uncertainties of parameters can be fully incorporated. A formal approach in the Bayesian perspective is introduced in this dissertation to ac count for both categorical and continuous exposure variables under the non-differential assumption. Fundamental techniques and concepts are introduced in Chapter 1. Ideas of the proposed formal approach that deals with a categorical exposure variable is in troduced and studied through investigating three cases in Chapter 2. The underlying theme of the formal approach that adjusts the measurement error (continuous exposure variable) is presented and examined by studying its performance on three cases again in 85 Chapter 5. Conclusion and Future Work Chapter 3. In Chapter 4, a real world dataset is used to evaluate the proposed model. Gibbs sampler and Metropolis-Hasting algorithm are mainly used to sample the param eters of interest from their corresponding posterior distributions. In Chapter 2, we investigate three cases where we have different levels of knowledge about misclassified probabilities. In each case, the approach is implemented with both low and high prevalence rate, as well as a different validation sample size when we assume validation data are available. Stabilized traceplots suggest that the overall convergence rate is adequate for Markov chain simulation in our proposed model. When the sampling distribution of each unknown parameter is studied, statistical assessments such as, small estimator bias, small mean square error, high coverage rate of the true value and reason able average 95% credible interval length, all indicate that overall the model is efficient and accurate. When only the prior information about the misclassified probabilities is known, strong and concentrated priors are required to get good estimation. One pos sible explanation is that, a strong prior is able to reduce the variability of estimators and improve the efficiency of the approach. However, when a small proportion of the validation data is available, it is found that the strong priors become unnecessary, which indicates that the model is able to capture enough information to make good estimation. Moreover, it seems like the size of the validation data does not significantly affect the estimation, and this would be an interesting point to study later on. When the results obtained from low prevalence rate and high prevalence in each case is compared, it is delightful to observe that the approach could work for any prevalence rate. In the end of Chapter 2, estimated log odds ratios are compared for the proposed formal approach and informal approach. It is found, as expected, that the informal approach tends to underestimate the association between the exposure variable and response variable most of the time, and that less than 25% of the 95% confidence intervals actually cover the true value. Even though, the formal approach sometimes overestimate the log odds ratio, the majority of its 95% confidence intervals include the true values and the estimator 86 Chapter 5. Conclusion and Future Work bias is much smaller than it is in the informal approach. In Chapter 5, the proposed approach is implemented with a continuous exposure variable. A logistic regression model is specified for the binary exposure variable and dichotomized continuous exposure variable so that their association is measured according to the coeffi cients of logistic regression. Two simulation studies for three cases studies are conducted based on our knowledge of the magnitude of the measurement error. As expected, we find that our proposed approach is both efficient and accurate. As in the discrete case, proper priors are crucial when we only have the prior information in hand but not so important when the validation data are accessible. Coefficients obtained from the “naive” approach and informal approach (both of them use the association estimated from response vari able Y and X* to estimate the true relationship between Y and X) are compared with our proposed formal approach at the end of the chapter. Overall, the performance of those approaches decline as we move from formal approach to informal approach and then naive approach. A real world example is used to illustrate our idea and approach. Due to a very small sample size of the dataset, adjusted proper priors are again critical to gain valuable estimations. Fortunately, it is proved that our suggested approach can work practically after strong priors are assigned. Our proposed Bayesian adjustment for mismeasurement can be extended to a variety of research areas. One straightforward extension would be having more precisely mea sured covariates in addition to the model that we have right now. Further investigation can be conducted to understand the weakness of our approach, which is that strong priors are needed when we only have prior information. A relevant and interesting study would be to find out whether there is a “cut-off” validation size so that the researchers are able to gain the “maximum” information while spending minimum time or money. 87 Bibliography B. A. Barron. The effects of misclassification on the estimation of relative risk. Biomet rics, 33:414—418, 1977. R. J. Carroll, D. Ruppert, L.A. Stefanski, and C. M. Cainiceanu. Measurement Error in Nonlinear Models, Vol. 105 of Monographs on Statistics and Applied Probability. Chapman &Hall/CRC, Boca Raton., second edition, 2006. G. Casella and E.I. George. Explaining the gibbs sampler. The American Statistician, 46:167—174, 1992. J.R. Cook and L. A. Stefanski. Simulation-extrapolation estimation in parametric mea surement error models. Journal of American Statistical Association, 89:1314—1328, 1994. A. P. Dempster, N. M. Lairdd, and D. B. Rubin. Maximum likelihoodd from incomplete data via the em algorithm. Journal of the Royal Statistical Society, B 39(1):1—38, 1977. E. Greenberg and Siddhartha Chib. Understanding the metropolis-hastings algoithm. The American Statistician, 49(4): 167—174, 1995. Paul Gustafson. Measurement Error and Misclassification in Statistics and Epidemiol ogy: Impacts and Bayesian Adjustments, volume Vol. 13 of Interdisciplinary Statistics. Chapman &Hall/CRC, Boca Raton, 2004. H. Kuchenhoff, S. M. Mwalili, and E. Lesaifre. A general method for dealing with misclassification in regression: the misclassification simex. Biometrics, 62:85—96, 2006. 88 Chapter 5. Bibliography R. H. Lyles. A note on estimating crude odds ratios in case-control studies with differ entially misclassified exposure. Biometrics, 58:1034—1037, 2002. R.C. MacCallum, S. Zhang, K.J. Preacher, and D.D. Rucker. On the practice of di chotimization of quantitative variables. Psychological Methods,, 7:19—40, 2002. R. J. Marshall. Validation study methods for estimating exposure proportions and odds ratios with misclassified data. Journal of Clinical Epidemiology, 43:941—947, 1990. M. J. Morrissey and D. Spiegelman. Matrix methods for estimating odds ratios with misclassified exposure data: extensions and comparisons. Biometrics, 55:338—344, 1999. D.A. Pierce and A.M. Kellerer. Adjusting for covariate error with nonparametric assess ment of the true covariate distribution. Biometrika, 91:863—876, 2004. R.L. Prentice and R. Pyke. Quantitative analysis of errors due to power-line interference and base-line drift in detection of onsets and offsets in ECO using wavelets. Medical and Biological Engineering and Computing, 35(6):747—751, 1979. P. Royston, D. G. Altman, and W. Sauerbrei. Dichotomizing continuous predictors in multiple regression: a bad idea. Statistics in Medicine, 25(1):127—141, 2006. J.S. Sahambi, S.N. Tandon, and R.K.P. Bhatt. Quantitative analysis of errors due to power-line interference and base-line drift in detection of onsets and offsets in ECG using wavelets. Medical and Biological Engineering and Computing, 35(6):747—751, 2009. A. Skrondal and S. Rabe-Hesketh. Generalized Latent Variable Modeling: multi-level, longitudinal and strucutral equation models. Chapman &Hall/CRC, Boca Raton., 2004. D.R. Tomlinson, T.R. Y. Betts, and K. Rajappan. Accuracy of manual qrs duration assessment: its importance in patient selection for cardiac resynchronization and im plantable cardioverter defibrillator therapy. Europace, 11(5) :638—642, 2009. 89 Chapter 5. Bibliography E. Vittinghoff, David V. Glidden, S.C. Shiboski, and C. E McCulloch. Regression methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. Springer, 2004. 90
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Formal and informal approaches to adjusting for exposure...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Formal and informal approaches to adjusting for exposure mismeasurement Shen, Tian 2009
pdf
Page Metadata
Item Metadata
Title | Formal and informal approaches to adjusting for exposure mismeasurement |
Creator |
Shen, Tian |
Publisher | University of British Columbia |
Date Issued | 2009 |
Description | In many research areas, measurement error frequently occurs when investigators are trying to analyze the relationship between exposure variables and response variable in observational studies. Severe problems can be caused by the mismeasured exposure vari ables, such as loss of power, biased estimators, and misleading conclusions. As the idea of measurement error is adopted by more and more researchers, how to adjust for such error becomes an interesting point to study. Two big barriers in solving the problems are as follows. First, the mechanism of measurement error (the existence and magnitude of the error) is always unknown to researchers. Sometimes only a small piece of information is available from previous studies. Moreover, the situation can be worsen when the study conditions are changed in the present study, which makes previous information not applicable. Second, some researchers may still argue about the consequences of ignoring measurement error due to its uncertainness. Thus, the adjustment for the mismeasurement turn to be a difficult, or impossible task. In this thesis, we are studying situations where the binary response variable is precisely measured, but with a misclassified binary exposure or a mismeasured continuous exposure. We propose formal approaches to estimate unknown parameters under the non-differential assumption in both exposure conditions. The uncertain variance of measurement error in the continuous exposure case, or the probabilities of misclassification in the binary exposure case, are incorporated by our approaches. Then the posterior models are estimated via simulations generated by the Gibbs sampler and the Metropolis - Hasting algorithm. Meanwhile, we compare our formal approach with the informal or naive approach in both continuous and exposure cases based on simulated datasets. Odds ratios on log scales are used in comparisons of formal and informal approaches when the exposure variable is binary or continuous. General speaking, our formal approaches result in bet ter point estimators and less variability in estimation. Moreover, the 95% credible, or confidence intervals are able to capture the true values more than 90% of the time. At the very end, we apply our ideas on the QRS dataset to seek consistent conclu sions draws from simulated datasets and real world datasets, and we are able to claim that overall our formal approaches do a better job regardless of the type of the exposure variable. |
Extent | 1924669 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-11-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
IsShownAt | 10.14288/1.0070879 |
URI | http://hdl.handle.net/2429/15219 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2009-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2009_fall_shen_tian.pdf [ 1.84MB ]
- Metadata
- JSON: 24-1.0070879.json
- JSON-LD: 24-1.0070879-ld.json
- RDF/XML (Pretty): 24-1.0070879-rdf.xml
- RDF/JSON: 24-1.0070879-rdf.json
- Turtle: 24-1.0070879-turtle.txt
- N-Triples: 24-1.0070879-rdf-ntriples.txt
- Original Record: 24-1.0070879-source.json
- Full Text
- 24-1.0070879-fulltext.txt
- Citation
- 24-1.0070879.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0070879/manifest