Formal and Informal ApproachestoAdjusting for ExposureMismeasurementbyTian ShenB.Sc., The University of New Brunswick, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENTOFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2009©Tian Shen, 2009AbstractIn many research areas, measurement error frequently occurs when investigatorsaretrying to analyze the relationship between exposure variables and responsevariable inobservational studies. Severe problems can be causedby the mismeasured exposure variables, such as loss of power, biased estimators, and misleading conclusions.As the ideaof measurement error is adopted by more and more researchers,how to adjust for sucherror becomes an interesting point to study. Two big barriers in solvingthe problemsare as follows. First, the mechanism of measurement error (the existence andmagnitudeof the error) is always unknown to researchers. Sometimes onlya small piece of information is available from previous studies. Moreover, the situation canbe worsen when thestudy conditions are changed in the present study, which makes previous information notapplicable. Second, some researchers may still argue about theconsequences of ignoringmeasurement error due to its uncertainness. Thus, the adjustment forthe mismeasurement turn to be a difficult, or impossible task.In this thesis, we are studying situations where the binary responsevariable is precisely measured, but with a misclassified binary exposure ora mismeasured continuousexposure. We propose formal approaches to estimate unknownparameters under thenon-differential assumption in both exposure conditions. The uncertain varianceof measurement error in the continuous exposure case, or the probabilitiesof misclassificationin the binary exposure case, are incorporated by our approaches. Thenthe posteriormodels are estimated via simulations generated by the Gibbs samplerand the Metropolis - Hasting algorithm.iiAbstractMeanwhile, we compare our formal approach with the informalor naive approach inboth continuous and exposure cases based on simulateddatasets. Odds ratios on logscales are used in comparisons of formal and informal approaches whenthe exposurevariable is binary or continuous. General speaking, ourformal approaches result in better point estimators and less variability in estimation. Moreover,the 95% credible, orconfidence intervals are able to capture the true values morethan 90% of the time.At the very end, we apply our ideas on the QRS datasetto seek consistent conclusions draws from simulated datasets and real world datasets, and we areable to claimthat overall our formal approaches do a better job regardlessof the type of the exposurevariable.111Table of ContentsAbstractiiTable of ContentsivList of TablesviList of FiguresxAcknowledgementsxivDedicationxv1 Introduction11.1 Misclassification and Measurement Error11.2 Overview of Current Available Methods21.3 Bayesian Analysis41.3.1 Bayes Rule41.3.2 Prior Distribution51.3.3 Markov Chain Monte Carlo Algorithm62 Simulation Study for Categorical Exposure Variable82.1 Introduction82.2 2 x 3 Table—Formal Analysis82.3 Informal Analysis102.4 Odds Ratio11ivTable of ContentsCase 1: When We KnowPjJSCase 2: When PijS are UnknownCase 3: Validation DataResults2.8.1 Results for Case 12.8.2 Results for Case 2Results for Case 3Comparison of Odds Ratios3 Simulation Study for Continuous Exposure3.1 Introduction3.2 Posterior and Prior Distributions3.3 Case 1: When We Know j23.4 Case 2: When We Don’t Know u23.5 Case 3: When We Have Validation Data3.6 Results3.6.1 Results for Case 13.6.2 Results for Case 23.6.3 Results for Case 3 . .3.7 Comparison of Three Approaches47474950535454556066714 QRS Data Study4.1 Naive Approach and Informal Approach4.2 Formal Approach4.3 Results5 Conclusion and Future Work 852.52.62.72.82.92.101314161717243244Variable77798084Bibliography88VList of Tables2.1 2x3 table due to the misclassification and measurementerror 92.2 A 2x 3 table for formal analysis92.3 2x2 table by epidemiologists’ rule102.4 A 2x 2 table for informal analysis112.5 Validation data and main data162.6 True values, posterior means, 95% credible intervalsof r0 and r1. Theseare results from the first simulation study (onedataset simulation) forscenario 1 in case 1202.7 Bias, mean square error (MSE), coverage of 95% CI and the average widthof ro and r1 for scenario 1 case 1. All results are based on100 datasets,and their true values are 0.2 and 0.25 respectively212.8 True values, posterior means,95% credible intervals of r0 and r1. Theseare results from the first simulation study (one datasetsimulation) forscenario 2 in case 1222.9 Estimated bias, mean square error (MSE), coverageof 95 % CI and theaverage width of ro andrl for scenario 2 case 1. All results are based on100 datasets, and their true values are 0.7 and0.4 respectively 232.10 True values, posterior means, 95% credible intervalsof r0,r1 and pjs.These are results from the first simulationstudy (one dataset simulation)for scenario 1 in case 126viList of Tables2.11 Estimated bias, mean square error (MSE), coverageof 95 % CI and theaverage width of ro and ri for scenario 1 case 2. All results arebased on100 datasets, and their true values are r0 = 0.2,ri = 0.25,(Poo,Poi, P02) =(0.5, 0.3, 0.2) and(pio,pil,pi2)= (0.1,0.3, 0.6) 292.12 Estimated bias, mean square error (MSE), coverageof 95 % CI and theaverage width of TO and ri for scenario 1 case 2. All results are based on100 datasets, and their true values arero = 0.7,r1 = 0.4,(poo,poi,po2)(0.5, 0.3, 0.2) and(plo,pll,p12)= (0.1,0.3, 0.6) 302.13 True values, posterior means, 95% credible intervals of r0,r1 andpjjs.These are results from the first simulation study(one dataset simulation)for scenario 1 in case 3. Validation size —200362.14 True values, posterior means, 95% credible intervals of r0,r1and jj5.These are results from the first simulation study (one datasetsimulation)for scenario 2 in case 3. Validation size =100362.15 Estimated bias, mean square error (MSE), coverageof 95 % CI and theaverage width of r0 and r1 for scenario 1 case 3. All resultsare based on100 datasets, and their true values are r0 = 0.2, r1 = 0.25,(poo,poi,p02)(0.5, 0.3, 0.2) and(plo,pii,p12)= (0.1,0.3,0.6). Validation data=200. . 372.16 Estimated bias, mean square error (MSE), coverageof 95 % CI and theaverage width ofTO and ri for scenario 2 case 3. All results are based on100 datasets, and their true values are r0 = 0.2, r1 = 0.25,(p00, Poi ,Po2)(0.5, 0.3,0.2) and(pio,pi1,p12)= (0.1,0.3,0.6). Validation data=100. . . 372.17 Estimated bias, mean square error (MSE), coverageof 95 % CI and theaverage width of To and Ti for scenario 3 case 3. All results are basedon100 datasets, and their true values are r0 = 0.7,r1 = 0.4,(poo,poi,p02) =(0.5,0.3,0.2) and(pio,pii,p12)= (0.1,0.3,0.6). Validation data=200. . 41viiList of Tables2.18 Estimated bias, mean square error (MSE), coverageof 95 % CI and theaverage width of ro and ri for scenario case 3. All resultsare based on100 datasets, and their true values are r0 = 0.7,r10.4,(poo,pol,po2) =(0.5, 0.3, 0.2) and(plo,pll,p12)= (0.1,0.3,0.6). Validation data=100 . . 412.19 Estimated bias, mean square error (MSE), coverageof 95 % confidenceintervals and average 95%CI width of informal and formallog odds ratiosfor scenario 1 (or 2) in three cases. True log odds ratio is 0.28463.1 True values, posterior means, 95% credible intervalsof1i,A2,30.Theseare res’ults from the first study in case 1583.2 Estimated bias, mean square error (MSE), coverageof 95% CI and theaverage width of, A2,i3o,ifor case 1. All results are based on 100 datasets. 603.3 True values, posterior means, 95% credible intervals of u,A2,/3o, i3iando.2These are results from the first study in case 2. The “true”valuesare:=0,A2=1,8o=—1.5and31=1.5643.4 Estimated bias, mean square error (MSE), coverageof 95 % CI and theaverage width of ,u, A2,i3o, /3iand for case 2. All results are based on100 datasets. The “true” values are: i = 0,X2= 1,/3o = —1.5 and/3i= 1.5 663.5 True values, posterior means, 95% credible intervals of .t, A2,/3o, 3ianda2. These are results from the first study incase 3 693.6 Estimated bias, mean square error (MSE), coverage of 95% CI and theaverage width of u, A2,i3o, i3and a2 for case 3. All results are based on100 datasets with validationsize 50. The “true” values are: = 0, A2 =l,/3zz—1.5 and/31—_zl.5713.7 Average of posterior means and 95% confidence intervals forthe averageposterior means ofj3oandi3for “naive”, “informal” and “formal” approaches. Results are based on 100 samples in case 1, 2 and3 76viiiList of Tables4.1 Estimators, 95% confidence, or credible, intervalsof/3oandf3iby using“naive”, “informal” and “formal” approaches 84ixList of Figures2.1 Traceplots of ro and r1 from MCMC algorithm inscenario 1 case 1. Thetraceplots show the 20000 iterations after 1000 burn-inperiod. The truevalues of ro and r1 are 0.2 and 0.25 respectively192.2 Histogram of 100 posterior means ofro and r in the second simulationstudy for scenario I case 1. The “true” values ofro and r1 are 0.2 and0.25 respectively 212.3 Histogram of 100 posterior means of r0 and r1in the second simulationstudy for scenario 2 case 1. The true values of r0 and r1 are 0.7 and 0.4respectively232.4 Density plots of “true”pjvalues with its corresponding Beta density function. The vertical lines in the graph indicate the “true” values. TheBeta density functions (from left to right) are: Beta(55,45), Beta(30, 70),Beta(15, 85); Beta(10, 80), Beta(25, 80) and Beta(65,35)252.5 Traceplots ofr0,r1 andPij5from MCMG algorithm in scenario 1 case 2.The traceplots show the 50000 iterations after 1000burn-in period. The“true” values are ro = 0.2,ri=O.25,(poo,pol,po2)= (0.5,0.3,0.2) and(plo,pll,p12)= (0.1,0.3, 0.6)272.6 Histogram of 100 posterior means of r0,r1and PijS in the second simulation study for scenario 1 case 2. The “true” valuesare r0 = 0.2, r1 =O.25,(poo,pol,p02)= (0.5,0.3,0.2) and(plo,pll,p12)= (0.1,0.3,0.6). . 28xList of Figures2.7 Histogram of 100 posterior means of r0,r1 and PijS in the second simulation study for scenario 1 case 2. The “true” values arero = 0.7, r1 =O.4,(poo,pol,po2)= (0.5,0.3,0.2) and(Plo,p11,p12)(0.1,0.3,0.6). . . 312.8 Density plots of “true”pj,jvalues with its corresponding Beta density function. The vertical lines in the graph indicate the “true” values. The Betadensity functions (from left to right) are: Beta (1,2), Beta(1,2), Beta(1,2); Beta(1, 2), Beta(1, 2) and Beta(1,2)332.9 Traceplots of ro, ri and PijS from MGMC algorithm in scenario 1 case 3.The traceplots show the 20000 iterations after 1000 burn-in period. The“true” values are r0 = 0.2,r1 = 0.25,(poo,pol,po2)= (0.5,0.3,0.2) and(plo,pll,p12)= (0.1,0.3,0.6). Validation size =200 342.10 Traceplots of r0,r1 and PijS from MCMC algorithm in scenario 1 case 3.The traceplots show the 20000 iterations after 1000 burn-in period. The“true” values are r0 = 0.2,r1 = 0.25,(poo,pol,p02)= (0.5,0.3,0.2) and(plo,pll,p12)= (0.1,0.3,0.6). Validation size =100352.11 Histogram of 100 posterior means of r0,r1 and jJS in the second simulation study for scenario 1 case 3. The “true” values are r0 = 0.2, r1 =O.25,(poo,pol,p02)= (0.5,0.3,0.2) and(plo,pll,p12)= (0.1,0.3,0.6). Validation size =200392.12 Histogram of 100 posterior means of r0,r1 and PijS in the second simulation study for scenario 2 case 3. The “true” values are r0 = 0.2, r1 =O.25,(poo,po1,p02)= (0.5,0.3,0.2) and(plo,pll,p12)= (0.1,0.3,0.6). Validation size =100 402.13 Histogram of 100 posterior means ofro, r and PuS in the second simulation study for scenario 3 case 3. The “true” values are r0 0.7, r1=O.4,(poo,pol,po2)= (0.5,0.3,0.2) and(Plo,P11,P12)= (0.1,0.3,0.6). Validation size =200 42xiList of Figures2.14 Histogram of 100 posterior meansof ro, ri and PijS in the second simulation study for scenario4case 3. The “true” values are r0 = 0.7, r1 =O.4,(poo,pol,po2)= (0.5,0.3,0.2) and(p1o,pll,p12)= (0.1,0.3,0.6). Validation size =100432.15 Comparisons of log informal odds ratios with logformal odds ratios forscenario 1 in case 1 (upper left), scenario 1 in case 2 (upperright), scenario1 & 2 in case 3 (lower left and right) .The true log odds ratio value is0.285. The line in the each graph plotsif two odds ratios are the same(y=x line). The dash lines represent the“true” value 453.1 Traceplots of/30,/31,X2fromMCMC algorithm in case 1. The trace-plots show the 5000 iterations after 1000 burn-inperiod 563.2 Traceplots of,from MCMC algorithm in case 1. The traceplotsshowthe 5000 iterations after 1000 burn-inperiod 573.3 Histograms of 100 posterior meansfor ,t, A2,i3o, i3in the second study incase 1. The true values are u = 0,A2 = = —1.5 andi3= 1.5 . . 593.4 Density plot of Inverse Gamma distributionwith hyper-parameters: c =200 and /3 = 50. The vertical line is “true”value of the a2 = 0.25 . . . 613.5 Traceplots of/30,/31,X2from MCMCalgorithm in case 2. The trace-plots show the 5000 iterations after 1000burn-in period623.6 Traceplots of 1u, A2 and.2from MCMC algorithmin case 2. The traceplotsshow the 5000 iterations after 1000burn-in period633.7 Histograms of 100 posteriormeans for i,A2,j30,31 and u2in the secondstudy in case 2. The “true”values are: = 0, A2 = 1,i3o= —1.5 and= 1.5653.8 Traceplots of/30,131,X2from MCMCalgorithm in case 3. The traceplots show the 5000 iterations after 1000burn-in period673.9 Traceplots of, A2 and a2 from MCMCalgorithm in case 3. The traceplotsshow the 5000 iterations after .1000 burn-inperiod68xiiList of Figures3.10 Histograms of 100 posterior means for ji, A2,/3o, 3iand 2in the secondstudy in case 3. The validation size is 50. The “true”values are: z =O,A2=1,/30=—1.5 and31=1.5703.11 Pairwise plots of three approaches, “naive”, “informal” and “formal”,inestimating/3oandj3under case 1. The solid line is if“yx “, and thedash lines are the corresponding true values733.12 Pairwise plots of three approaches , “naive”, “informal”and “formal”,inestimating/3iandi3iunder case . The solid line is if‘y= x “, and thedash lines are the corresponding true values743.13 Pairwise plots of three approaches, “naive”, “informal”and “formal”, inestimating/3oand/3iunder case 3. The solid line is if“y= x “, and thedash lines are the corresponding true values754.1 Prior plots of unknown parameters with theirhyper-parameters:N(0,1002),A IG(0.01,0.01),cr2c—’ IG(100,0.1),/30“-i N(0,1002)andN(0,12).The vertical lines are the corresponding “true” values as:= 4.64, A2 0.094, = —0.90, /3 = 0.76 and g2 = 0.00096814.2 Traceplot and posterior density plots of p0000 iterations after 1000burn-inperiod of /3o and/3iwhen applying the MH sampling method 833(111AcknowledgementsFirst of all, I would like to give special thanksto my supervisor, Professor Paul Gustafson,since without his thoughtful support and guidance, I wouldnot able to complete this thesis. It was an honor to work with him and he is the bestsupervisor that one could everask for. I would also thank Professor Lang Wu, foragreeing to be my second reader andalso for his excellent teaching throughout my study period.I express sincere gratitude to Professors John Pekau,Ruben Zamar, Arnaud Doucet,Michael Schuizer, Jiahua Cheri and Matias Salibián-Barrera fortheir constant supportand outstanding teaching.I am grateful to everyone, who makes the departmenta wonderful place to study. Special thanks to Mike Danilov, Ehsan Karim, Derrick Lee,Reman Epstein, Xu Chen , LiuZhong and Xu Wang for their help and friendship, andto Peggy Ng, Viena Tran, ElaineSalameh for their support and understanding ofmy work.Tian ShenVancouver, BC, July 2009xivTo My lovely ParentsxvChapter 1IntroductionIn many research areas, statistical methods are used to analyzethe relationships betweentwo or more variables. For example, in the epidemiology area, statisticalmodels areused to understand or study the relationship between an outcome variableY and anexplanatory variable X. For instance Y can be presence orabsence of heart disease, andX can be presence or absence of smoking, where Yand X are both binary variables, orX can be the blood pressure, which is a continuous variable. Thereare four types ofdesigns that are most often applied, which are the cross-sectionalstudy, cohort study,case-control study and randomized controlled clinical trial. In this thesis,we are focusedon the case-control study, which is also referredto as a retrospective study. We randomlyselect subjects from “case” and “control” groups, thencompare the health outcomes forthe two groups based on selected subjects. The explanatory variableX is often measuredby some instruments. When X is precisely measured, the instrumentis called a goldstandard. However, due to the high cost and lack of suchprecise instrument, X is oftenmeasured imprecisely. If X is a categorical variable, imprecisemeasurements are calledmisclassifications, while if X is a continuous variable, they are called measurementerrors.In this thesis, we are working on both discrete and continuouscases of X.1.1 Misclassification and Measurement ErrorGenerally speaking, misclassification means groupinga subject into a wrong categoryFor example, a person who smoked one pack of cigarette in theday of the experimentmight be accidently grouped as a heavy smoker whilethis person might barely smokein other days. Thus, rather than recording X itself, asurrogate variableX*is often11.2. Overview of Current Available Methodsrecorded instead. The misclassification probability (sayPu)defines the probability ofclassifying a subject into group i while its true status is in groupj.When X is a continuous variable, by the definition of the classical error model in themeasurement error literature, a surrogate variableX*is the linear term of X plus an error term. For example, a recorded patient’s blood pressure might be higher than his/hertrue values due the equipment error. As defined in Chapter 3 in this thesis, X = X+ Z,where Z is the error term andE(ZIX)=0.Measurement error and misclassification can be categorized into differential andnon-differential cases. If the distribution of the surrogate variableX*only depends on thetrue value X but not the health outcome Y, then the mismeasurement is classified asnon-differential. Otherwise, the mismeasurement is categorized as differential.Due to the misclassification and measurement error , usually we only have precisely measured health outcome Y , the surrogate variableX*and some other precisely measuredcovariates U in the data. Since the goal of most studies is to understand the relationshipbetween X and Y, conclusions obtained fromX*and Y instead could be very misleading.Thus, the study of measurement error and misclassification is significantly important.1.2 Overview of Current Available MethodsIn the literature, especially in the biomedical research, many methods were proposedtodeal with misclassification and measurement error. Barron (1977) proposed the matrixmethod that estimate the expectation of cell counts by using their margins, andtheodds ratio can be estimated later on based on the cell counts. By reparameterizing themisclassification, Marshall (1990) presented the inverse matrix method to retrieve theodds ratio. However, Lyles (2002) pointed out that the inverse matrix method is just amaximum likelihood estimate method that corrects the biased odds ratio due to misclassification. The efficiency of the matrix method and inverse matrix method were comparedunder the assumption of differential misclassification (the degree of measurement error isdifferent across different groups) by Morrissey and Spiegelman (1999). Theyconcluded21.2. Overview of Current Available Methodsthat the inverse matrix is more efficient than the matrixmethod; nevertheless, the sensitivity, specificity, and probability of exposure are somekey determinants of the efficiency.Later, other methods like simulation extrapolation (SIMEX)and logistic regression modelare also developed to approach the misclassification problem (Kuchenhoff,Mwalili, andLesaifre, 2006; Skrondal and Rabe-Hesketh, 2004). Withthe improvement of computational capability of computers and enhanced simulationtechniques, the Bayesian analysisbecomes another prospective method to study the misclassification problem.Carroll, Ruppert, Stefanski, and Cainiceanu (2006) grouped methodsthat deal withmeasurement error into functional and structural models,where X is considered as fixedor random with minimum assumptions of distributionsin the functional models whileX is considered as random variables in the structudmodels. Two general methodsused in the functional model category are the regression calibration (Pierceand Kellerer,2004) and SIMEX (Cook and Stefanski, 1994) methods. Carroll,Ruppert, Stefanski,and Cainiceanu (2006) stated that even though these two methods arevery simple, theyare only consistent in some special cases such as linearregression and they have limitedcontributions in reducing the bias caused by the measurementerror. Disregarding somelimitations of the regression calibration and SIMEX methods, they arestill widely usedsince both of them are very easy to implementby using the existing software packages,which reduces the potential difficulties in the analysisset-up part. In the structuralmethod category, the Expectation-Maximization algorithm in thefrequentist perspective and Markov chain Monte Carlo algorithm in the Bayesianperspective are two usefulalgorithms to solve the measurement error problems.Bayesian methods are capable of dealing with biases inducedby both misclassificationand measurement error. One great advantage of Bayesianmethods is that they canfully incorporate the uncertainties of parameters. Thoughfully specified models are often required, this kind of information plus some knowledge ofmismeasurement are often31.3. Bayesian Analysisavailable to medical researchers before they conduct studies. Whenan observed dataset isavailable, the natural existence of prior information makeBayesian analysis an appealingmethod, since inference now can be made throughthe prior and present information. Inthis thesis, we are focus on using the Gibbs sampler andMetropolis-Hastings algorithmin Bayesian methods to solve the misclassification andmeasurement error problems.1.3 Bayesian AnalysisBayesian inference is statistical inference in which data areused to update or to newlyinfer quantities that are observed or wished to learnby using probability model. The“combination” of Markov Chain Monte Carlo (MCMC)algorithm and Bayesian inferences is often used in solving the mismeasurement relatedproblems.1.3.1 Bayes RuleTo understand the Bayesian analysis, it is essentialto understand the fundamental principle of the analysis - Bayes rule. Assume there isa independently and identically distributed dataset y(yl, Y2,...,y)with unknown parameter 8 and following distributionfe.Bayesian analysis is to conclude a parameter 8 basedon the observed data, so wedenote the conditional distribution on the observed dataasf(8 y), which is called theposterior distribution. Meanwhile, the samplingdistribution of the data set is:f(v18)=]fe(y)Then, a joint probability distribution of 6 andy can be defined as:f(8,y) =f(8) xf(y18)41.3. Bayesian Analysiswhere f(8) refers the prior distribution of 8. Applyingthe basic property of conditionaldistribution (Bayes’ rule), the posterior distribution turns to:f(I )— f(8,y) — f(9) xf(yI)(11)Y-f()- f()Since the f(y) is a constant term, independent of theparameter 0, we can express theunnormalized posterior density asf(Iy)cx f(0) xf(yIO).To estimate the parameter 0, we can calculate the expectedvalue of 8 under the posteriordensity (Gelman et al., 2004) by taking the integral:180 dOfOf(O)f(y0)dO0= jf(ff(0)f(yIO)dOIf there is a closed form of the posterior distribution, the estimationof 0 would be veryeasy to carry out. However, sometimes there is no closed formof the posterior density, wehave to use the other numerical methods, such as MarkovChain Monte Carlo algorithms.1.3.2 Prior DistributionAs an “absent” term in classical statistical analysis, the priordistribution plays a majorrole in the Bayesian analysis. There are mainly two typesof prior distributions: informative priors and non-informative priors. A non-informative prior alsocan be calleda “fiat” prior that has a limited impact on the posterior distribution. Researchersusenon-informative priors usually because they don’t have verymuch knowledge about theparameter in the previous study or they don’t want the inferenceto be greatly affectedby some external sources. Moreover, the determination of “flat” prioris not trivial sinceeveryone has a different definition of “fiat”.The informative prior, information that gained fromprevious study ,on the other hand,may provide a strong influence on the posterior distribution.However, the influenceis somehow related to the sample size, number of MCMC iterationsand the form ofthe prior. In most cases, strong priors are not necessarywhen the sample size is big.51.3. Bayesian AnalysisNevertheless, in some cases, the priors are needed to be strong regardless of the samplesize.1.3.3 Markov Chain Monte Carlo AlgorithmThe computation of the posterior distribution is always a problem sincethe density ofthe posterior sometimes is very complicated or even high dimensional.Because of thisproblem, the usage of Bayesian analysis has been very limited. When the MCMC algorithm was first applied in 1990, the limitation of Bayesiananalysis then vanished andBayesian methods becomes increasingly popular.In the following two sections, two particular MCMC algorithms are introduced: Metropolis-Hastings algorithm (Greenberg and Chib, 1995) and Gibbs sampler (Casella and George,1992).Metropolis-Hastings AlgorithmThe Metropolis-Hastings (MH) algorithm is an iteration method for generatinga sequence of samples from a probability distribution when directly samplingis difficult orimpossible. It uses an acceptance/rejection rule to cover the target distribution. Thebasic algorithm follows:1. Randomly choose a starting point02. For theithiterationfrom i=1,2,...,(a) Sample a proposal9*from a jumping distribution at iteration i,p(9*&i_l)(b) Calculate the ratio of densities,f(9*)p(9i_l18*)r=f(9i-1)p(O*19i_1)where f(9) is the target distribution.61.3. Bayesian Analysis(c) Set— 9*with the probability min(r, 1)9i—1otherwiseNote that the acceptance rate shouldn’t be close to 0% or100%, otherwise, the randomwalk would either move too slowly (cover the target populationtoo slowly) or stand stillfor too long (likely to stay in certain regions). Thus,proper adjustments of the jumpingdistribution sometimes are necessary.Gibbs samplerThe Gibbs sampler is a special case of MR algorithm.It is applied when the explicit form of the joint distribution is unknown but the conditional distributionof eachvariable is known. Assume we divide the parameter vector 9 intop components, say9 =(9i, 92, ...,0,,). Each iteration of Gibbs sampler cycles thought all the componentsof9. Each component is drawn based on the values of all others. Thus,the algorithm thatgenerates a Markov chain at iteration i is:1. Randomly set the initial values for all the components0(9°),9(0))2. For theithiterationfrom i=1,2,...,for j=1...p, sample 9 from the distribution ofwhere 9 represents all the components of 0, exceptforO,at their current values:0(i) — 19(i) 9(i) 0(i) 0(i—i) 9(i—1), 1’••’ j—]i j+1 ••• PThough the above two MCMC algorithms looks verysimple, sometime the technicaldetails can be very challenging. In this thesis, we will apply Gibbs sampleralgorithm inChapter 2 and Metropolis-Hastings algorithm in Chapter3.7Chapter 2Simulation Study for CategoricalExposure Variable2.1 IntroductionAssume a researcher is interesting in study the relationship between a binary exposurevariable X and a continuous response variable Y. The exposure variable X is oftencodedas 0 or 1, where X = 0 refers to “unexposed” and X = 1 refers to “exposed” in mostepidemiological situation. Instead of observing X, a surrogate variableX*is measured.Under the non-differential misclassification assumption, X andX*are conditionally independent of Y, and the specificity and sensitivity are used to measure the magnitude ofthe misclassification (Gustafson, 2004). Then, the sensitivity SN= Pr(X*= 1IX= 1)is the probability of correctly classifying a truly “exposed” subject, and the specificitySPPr(X*=0X = 0) is the probability of correctly classifying a truly “unexposed”subject. In the following subsections, we will introduce two approaches to analyze therelationship between discrete exposure variables and health outcome Y.2.2 2x 3 Table—Formal AnalysisThough there are only two conditions of the true exposure X, Yes or No, sometimesthesurrogate variableX*has three conditions instead as: Unlikely, Maybe and Likely, dueto possible instrument error (as displayed in Table 2.1).82.2. 2x 3 Table—Formal AnalysisAssessed ExposureUnlikely Maybe LikelyTrue Exposure NoP00 P01 P02YesPlo p11 P12Table 2.1: 2x 3 table due to the misclassification andmeasurement errorNote that thePijin the table defines the probability of classifying asubject intogroup j while its true status is in group i.If the truly exposed condition, X, is known, thenthe exact relationship (referred asa “true” result) of X and Y is able to be analyzed. However,in reality, X is ofteninaccessible, and researchers only know its surrogate,X*.Even though, researchers areanalyzing the relationship betweenX*and Y, they also tend to conclude it as the relationship between X and Y. It is very dangerousto make such conclusion since it canbe very biased. The first analysis method is termedas a formal analysis: analysis thatacknowledge the existences of 2 x 3 table (Table 2.1) structurein the exposure condition.Specifically, the analysis is carried out based on Table 2.2.ExposureUnlikely Maybe LikelyHealth Outcome controls n00 n01n02cases n10 n11n12Table 2.2: A 2x 3 table for formal analysiswhere nj in the table is the number of subjects that fall inthe condition (e.g. n00 isthe number of subjects that “Unlikely” have the exposurein controls). In the literature,there are fewer studies that involves the formal analysis,which makes it an interestingpoint to study. The second approach is termed as an infonnalanalysis: analysis that tendto ignore the 2 x 3 table structure in the exposure condition.More details of informalanalysis are going to be talked about in the next section.92.3. Informal Analysis2.3 Informal AnalysisWhen we assume the mismeasurement is non-differential, sensitivityand specificity ofX*for X can be used to described the magnitude of the misclassification. The closerthevalues of SN and SP are to one, the less severe the misclassificationis.As stated in Gustafson (2004), when the probability of the exposureis very rare, theeffects of misclassification worsens much more with thedecrease of specificity than thedecreases of sensitivity, which means the impact of low specificity willbe bigger than theimpact of low sensitivity in a further analysis. This attracts particularattention in theepidemiological area for the study of rare disease sinceit implies that when the exposure is very rare, the analysis can be fully misleadingeven with a very small effects ofmisclassification. Thus, when some epidemiologists realize that theyhave the 2x3 tablestructure (as Table 2.1) in hand, they tendto group the “Maybe” and “Unlikely” groupstogether to form a 2x2 table as in Table 2.3: whereoo = P00 +poiandio = P10 + P11.Assessed ExposureINo YesTrue Exposure Nooo qoiYesioqTable 2.3: 2x 2 table by epidemiologists’ ruleThey prefer such grouping so that more subjectsare classified as unexposed, and thisleads an increase of probability that a true negative is correctlyclassified, which meansa large SP value. In such way, a low specificity could be avoidedso that a small effect ofmisclassification won’t lead a huge impact in the further analysis.In the literature, most analyses are constructed based on theignorance of the structure of Table 2.1. Thus, they are based on Table 2.4, and we wouldlike to refer suchanalysis as informal analysis in the rest of the Chapter.102.4. Odds RatioExposureNo YesHealth outcome controlsfloo + fbi fl02casesnib +flhi fli2Table 2.4: A 2x table for informal analysis2.4 Odds RatioSuppose we have two groups of subjects, controls andcases, and denoted the numberof subjects in each group as n0 and n1. Let r0 denotethe prevalence of exposure inthe control group and r1 denote the prevalence of exposurein the case group, i.e. r3 =Pr(X=1Y 0) and r1 Pr(X=1Y = 1). Thus, the odds ratio, which defines therelationship between X and Y is formed as:(2.1)1 —roIn the informal analysis, the odds ratio is usually calculatedas:l2x (noo+ noi)0 —fl02x(flio + flu)where the standard error of log odds ratio is formulatedas:/ 1 1 1 1\i/2se=( +—+ +—\floo +nc no2 10 + flu l2In the formal analysis, there are three different ways tocalculate the odds ratio. Sincethe MCMC algorithm estimates the prevalence in bothcase and control groups (ro andri) at each iteration, they are going to be updated every time.Thus, the first way is tofind the posterior mean of odds ratios, which is tocalculate the odds ratio at each MCMCiteration, and then take the average of them. Thesecond way is to find the posterioraverage of r0,r1 of each iteration, then use formula(2.1) to find the odds ratio. The112.4. Odds Ratiothird way is to find the posterior mean of log odds ratio, whichis to calculate averageof the log of odds ratio at each iteration first and thentransfer it by using exponentialfunction. Mathematically, those three odds ratio can be writtenas:1where i refers to thethiteration, and m is the total number of iterations.im im=—,rj (2.2)--OR2__rp1 —rwherero, ri are calculated from equation (2.2). And also,OR3 =(±1o(Oñ))We are interested in comparing the odds ratios in both analysis. Sincethe odds ratioalways falls into a very skewed distribution, we tend to comparethe informal odds ratioand the formal odds ratio in the log scale, such that thedistribution will be more symmetric. We are going to focus on comparing theORinformal with OR3 in the log scalewith respect to their point estimator and confidence/credible intervalsas(n12x(noo+nol)Nlog(ORjf0T1) = log\fl02X(nio+nii)j122.5. Case 1: When We Know PijSwith its 95% confIdence interval as:Clinformal: (1og(ökiflf.mal) — 1.96 * Se, log(ORjnformaj)+ 1.96 * se).The log formal odds ratio is as:logOR3 = --log(OTh).Note that the 95% credible interval for log OR3 can besimply obtained by finding the2.5% and 97.5% quantiles of all log OR3 that obtained fromeach iteration.In order to evaluate the performance of our proposedformal approach, we conduct simulation studies under three cases as: when the probabilityof classify,Pijare known; whenwe only have prior information onpu;when we have some validation data, i.e. we havesome subjects have both X and X measured.2.5 Case 1: When We KnowpiSIn some cases, researchers may know the probabilitiesof classifying the assessed exposureto true exposure from previous experiments, i.e. thepujs are known. Under this condition,the posterior density is expressed as:f(ro,ri,Xi...Xn,IYi...Yj,X’...X)=JJr(i(1’(1— ro)(1_Xi)(1 —riI(XO)(1—X) I(Xz1)(1—Xj) I(Xz2)(1—Xj)xIIpoop01 p022—-I(X=0)X, I(X;=1)x I(X,=2)Xp11 p123132.6. Case 2: WhenPijS are Unknownwhere I is an indicator function. We assume the prior distribution of r0and r1 areuniform (the same assumption carries in the following twocases), and then we use theGibbs sampling method to update the unknowns, r0 and r1 since their posteriordistributions have familiar distributions that we can recognize, namely Beta distributions. Theposterior distribution of X3 is viewed as:f(XjIro,ri,Xi,...Xj_i,Xj+i,...Xn,Yi...Yj,X...X)where.I(X;=0) I(X;=1) z(x;=2)a = (1 — ro) ‘(1 —ri) ‘p00 p01 p02y, y, I(X=O) I(X=1) I(X,=2)b = r0 r1 p10p11 p12Thus, at each iteration of MCMC algorithm, the probability of getting X3 = 1by giveneverything else “known” is b/(a+ b).2.6 Case 2: When pS are UnknownThough, in the previous section, we studied the case that PijS are availablefrom otherexperiments, in reality, those values are often unknown. The maximum information thatwe have on them are some knowledge of their priors. We choose a Dirichlet distributionas the prior distribution for PijS for two reasons. First, the summation ofPOjand thesummation ofPljare both equal to 1, i.e.P00+poi +po2 = 1andPio +pii +p12= 1.Second, the Dirichlet distribution has the property of being conjugate,which meansthe posterior distribution ofPij5would also come from the Dirichlet distribution family.Thus, by choosing Dirichiet distribution as the prior, the MCMC algorithmis easy tocompute the updated PiJS at each iteration, and the results are also easy tobe interpreted.142.6. Case 2: When PijS are UnknownHence, we have:Poo,Poi,Po2Dir(eoo,coi,c02),P1o,P11,Pi2‘- Dir(cio,cii,c12).Then, the posterior distribution is changed to:f(ro,ri,Xi, ...X.,,poo,poi,po2,pio,pii,p12Yi, ...X)cx—r)(1_Xi)(1_Yj)rXiYi(l—ri)’”ii—-I(x;=o)(1—X3)I(x;=i)(i—x) I(x=2)(i—x)Poi P023I(X=0)X3 I(X=flX3 I(X=2)XP11 P123coo—i Cj—l c02—1 C1IYl Ciii C121XPOOP01 P02* J)10p11 p12cx]Jr(12)(1— ro)_Xi 1_Yi)ri5CiYJ(l ——-(I(X’=O)(1—X,)+coo—1) (I(X=1)(1—X3)+coi—1] (I(X=2)(1—X,)+co2—1)Xjjp00PoiP023(I(XJ=0)X+cio—1) (I(X’=i)X+c —1) (I(X=2)X+c12—1)X p10 p11P12cxXj(1-Yj)(1— r)(1_Xi)(XYj(1—xpooiI(X=0)(1—X)+coo—101ZI(X;=1)(1—X3)+co1—i02I(X=2)(1—X)+co2—1)<p10I(XJ=0)X3+cio—11I(X$=1)X,+cii I(X,=2)X+ci2—iNote that coo,col, c02, do, Cii and d12 are called hyper-parameters, and the specific valuesassigned will be discussed in the results section.152.7. Case 3: Validation Data2.7 Case 3: Validation DataSometimes, the true exposure variable X might be possibleto capture, but it is tooexpensive to get for all the subjects. As a result, only a small proportionof the subjectshave the complete information of X, X and Y, whereasthe majority of the subjectsdon’t have the precisely measured exposure status, X. Table2.5 presents the structureof the validation sample and the incomplete main data.While all counts corresponding toValidation Data Main Datax* x*Unlikely Maybe Likely Unlikely Maybe LikelyY=o x=o Y=ox=1Y=1 X=O Y=1x=1Table 2.5: Validation data and main datalevels of X, Y, X* are fully recorded in the validation data, only countsthat correspondto Y, X are recorded in the main data. We want touse the information from thevalidation data to impute X for the main data and to make inferenceon X and Y.Our new posterior density is like:f(X1Xm,poo,poi,po2,p1o,p11,p12,ro,r1IXm+1,.,.Xn,X,...X,,Y1,...Yn)m j=nflfo’Ixx flf(XIX3)x flf(X3)flf(X)j=1 j m+1fJf(XX) x fJf(X) xf(poo,pol,p02,plo,pll,p12)x f(ro) x f(r1),where j = 1, . . .m is the non-validation data part andj= m + 1, . . .n is the validationdata part.The simulation process for this case does not changea lot regarding the change of theposterior distribution, and the only difference is that “known”X values in the validationdata do not need to be updated, where the “unknown” X values, r0,r1 and PijS are162.8. Resultsupdated the same as in the previous case.2.8 ResultsIn order to gain information about the performance of MCMC algorithmsin all threecases, the MCMC trace-plots, posterior mean, 95% equal-tail credible interval, estimatedbias, estimated mean square error of each unknown parameters are checkedin the following subsections. Moreover, when the prevalence in control and cases aresmall, i.e. r0and r1 are relatively small, the odds ratio of “formal” and “informal” are compared lateron to assess their performance.In all cases, two sets of the prevalence are used as (1):r0 = 0.2,r1 = 0.25 and (2):r0 = 0.7, r1 0.4, and each one is combined with the “true” probabilityof classifyingvalues asP00 O.S,poi = O.3,Po2= 0.2 andpio = = 0.3,pi2= 0.6. Any hyperparameters used in the specific cases will be defined in the latersubsections along withdetailed information of each case scenarios. Moreover, two simulationstudies are performed regarding each scenario in each case. The first one focuseson studying estimationfrom one sample. The second concentrates on studying the sampling distributionsof eachestimator across 100 simulated datasets.2.8.1 Results for Case 1Two scenarios we have in this case are:• Scenario 1:(T0,Ti) = (0.2, 0.25), true odds ratio=1.33,(poo,pol,po2)(0.5, 0.3, 0.2),(pio,pll,p12)= (0.1,0.3, 0.6);• Scenario 2:(rO,Ti) (0.7,0.4), true odds ratio=0.283,(poo,poi,po2) =(0.5,0.3,0.2),(plo,pil,pi2)= (0.1,0.3,0.6).Two simulation studies are carried for each scenario, where studies for thefirst scenariowill be talked about in details as an example. Note that the same proceduresare applied172.8. Resultsto the second scenario as well.In the first study, a dataset of size 4000 (2000 controls and 2000 cases) wasgeneratedbased on the given “true” value, (ro,ri) = (0.2,0.25),(poo,pol,po2)= (0.5,0.3,0.2), and(pio,pii,p12)= (0.1,0.3, 0.6). Then, we use 21000 iterations, where the first 1000 istheburn-in period to update unknown parameters. The choice of burn-in periodis makebased on visual inspection. Figure 2.1 shows the traceplot of MCMC algorithmforro and r1 after the first 1000 burn-in period. It shows that the Markov chains movingsmoothly within the target area.180C’j9 00a2.8. ResultsiterationsFigure 2.1: Traceplots of r0 and r1traceplots show the 20000 iterationsr1 are 0.2 and 0.25 respectively.from MCMC algorithm in scenario 1 case 1. Theafter 1000 burn-in period. The true values ofro andTable 2.6 shows the true values, posterior means and the95% credible intervals for r0and r1. Both 95% credible interval of r0 and r1 coversthe “given” values, which indicatesI. I Iii Ii I .,LItU‘ ii. liii Ii.‘ r,-II’I I I0 5000 10000 15000 200000Co0000 5000 10000 15000 20000iterations192.8. Resultsthat for this particular generated dataset, our approach works well.true value posterior mean 95% CIro 0.2 0.18 (0.15, 0.23)r 0.25 0.27(0.23, 0.31)Table 2.6: True values, posterior means, 95% credible intervals ofro and ri. These areresults from the first simulation study (one dataset simulation) for scenario1 in case 1.The second study will repeats the first study 100 times, which enableus to investigate the sampling distributions of r0 and r1. Figure 2.2 is the histogram of 100 posteriormeans of r0 and r1. It demonstrates that the sampling distribution of r0 andr are approximately normally distributed and centered around the “true” values, majority valuesare around the “true” values.202.8. Results, 0C,041)Ll0I I I I I I0.18 0.22 026 0.30Figure 2.2: Histogram of 100 posterior means ofro and r in the second simulation studyfor scenario 1 case 1. The “true” values ofro and r1 are 0.2 and 0.25 respectively.Table 2.7 confirms our observation by showing the95% credible interval coveragerates are very high and the average lengths of the credibleintervals are very small.Bias MSE Coverage of the 95%CIjAverage 95%CI Widthr0 0.0024 0.0025 91T________0.09r1 0.0028 0.0023 95TööiTable 2.7: Bias, mean square error (MSE), coverageof 95 % CI and the average widthof r0 and r1 for scenario .1 case 1. All results are based on 100 datasets,and their truevalues are 0.2 and 0.25 respectively.Since desirable results are obtained from the statisticalprocedures and stabilized iterto -0 -C.)LI) -010U)0100JCU-00C0,14 0.18 022 026212.8. Resultsations are observed from the traceplot, it is reasonableto conclude that our approachworks well when the prevalence for both control and cases are relativelysmall.Table 2.8 shows the first simulation study result for scenario 2, wherer0 = 0.7 andr1 0.4 andPijvalues remain the same. Again, the result shows that forthe particulartrue value posterior mean 95% CIro 0.7 0.69 (0.63,0.72)r1 0.4 0.41 (0.36, 0.45Table 2.8: True values, posterior means, 95% credibleintervals of ro andri. These areresults from the first simulation study (one dataset simulation) for scenario2 in case 1.generated dataset, we are able to get reasonable estimators.However, in order to knowhow the model works in general, we need to review the results fromthe second simulationstudy.Figure 2.3 and Table 2.9 are histogram and statistical results fromthe second simulationstudy for scenario 2 in case 1. All results from two studies for scenario2 suggest thatour proposed approach works equally well when theprevalences are considerably larger.222.8. Results>%C)CS0C)U-Figure 2.3: Histogram of 100 posterior means ofr0 andr1 in the second simulation studyfor scenario 2 case 1. The true values of r0 and r1 are 0.7and 0.4 respectively.Bias MSEro 0.000068 0.0021r1 0.00074 0.0023Coverage of the 95%CI98_________94Average 95%CI Width0.0850.092Table 2.9: Estimated bias, mean square error (MSE),coverage of 95 % CI and the averagewidth of r0 and r1 for scenario 2 case 1. All results are based on 100datasets, and theirtrue values are 0.7 and 0.4 respectively.0CiCC’)U)CIs,CC’)U,C)CU)0U)U-0CI I I I I0.66 0.70 0.74Cr0I I0.32 0.36 0.40 0.44t I232.8. Results2.8.2 Results for Case 2In this case, there are also two scenarios as:• Scenario 1: (ro,ri) = (0.2, 0.25), true odds ratioz1.33,(poo,pol,po2) =(0.5,0.3,0.2),(plo,pll,p12)= (0.1,0.3,0.6);• Scenario 2 : (ro,ri) = (0.7.0.4), true odds ratio=1.33,(poo,pol,po2) =(0.5,0.3,0.2),(plo,pll,p12)= (0.1,0.3,0.6).Remember that in this case, since PijS are unknown,we need to specify prior distributionsfor them. The hyper-parameters cj are chosen particularlyas (coo, c01,002) = (55,30,15),(dO, c11,012) (10, 25,65) for both scenarios. Those values are chosen so that the priordistribution are centrated nearly around the “true” valuesofPu.Since any single component of a Dirchlet vector has a Beta distribution,we are able to see how concentratedthese priors are by using the Beta distribution with certainhyper-parameters. For example, previous information (prior) states thatP00is from a Beta distribution with shape= 55, 3 = 45. Figure 2.4 displays the the truePijvalues with its corresponding Betadensity function. From the figure, we can see that most“true”pjjvalues are close tothe centre of the the density function (especially withPoiandplo)and the range of thex-axis are pretty narrow, which suggests that the priorsthat we use in this case areconcentrated ones. We use the concentrated priors here isbecause by simulation studies,we realized that the no matter the sample size is large or not,a strong prior is crucial inthis case (as mentioned in Section 1.3.2).Same as in case 1, two simulation studies are carried foreach scenario, where studiesfor the first scenario will be talked about in detailas an example.In the first study for scenario 1, a dataset of size 4000(2000 controls and 2000 cases)was generated based on the given “true” value. Then,we use 51000 iterations (the first1000 is the burn-in period) to update unknown parameters.Figure 2.5 show the traceplot of MCMC algorithm forr0,r1 and Pij5 after the first 1000240.0 0.2 0.42.8. ResultsFigure 2.4: Density plots of “true”Pijvalues with its corresponding Beta density function.The vertical lines in the graph indicate the “true” values. TheBeta density functions(from left to right) are: Beta (55,5), Beta(30, 70), Beta(15,85); Beta(10, 80), Beta(25,80) and Beta(65,35).0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0I ‘I0.0 0.2 0.4 0.6 0.8 1.0I I0.0 0,2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 08 1.0I” I’0.0 02 04 0.6 0.8 1.0Pio PuP12252.8. Resultsburn-in period. Again, the Markov chains move smoothly within target range andnochain is fixed at a particular value. We also observe that the generated sample MarkovChain is somehow more stabilized in some unknown parameters(e.g.poi,pii)than others (e.g. r0 and ri) after the burn-in period.Table 2.10 demonstrates the true value, estimated posterior mean and95% credibleinterval for each unknown parameter from the first study of scenario 1.true value posterior mean 95% CIr0 0.2 0.25 (0.14,0.40)r1 0.25 0.32 (0.18, 0.41P000.5 0.53 (0.46, 0.60oi0.3 0.31 (0.27, 0.35P020.2 0.16 (0.09, 0.23)0.1 0.1 (0.05,oiip0.3 0.25 (0.18, 0.35)pi0.6 0.64 (0.54, 0.73Table 2.10: True values, posterior means, 95% credible intervalsofr0,r1 and pjjs. Theseare results from the first simulation study (one dataset simulation) for scenario 1in case1.Though there are discrepancies between the estimated posterior means of r0 and r1andtheir true values, 95% credible intervals of estimated means do cover the true values.Figure 2.6 and Table 2.11 display the results from thesecond study (100 datasetssimulation) of the first scenario in this case262.8. ResultsL. i1iLj, .,JIr1I.r________________________ ________________________o 10000 30000 50000itera5orrSILI I I I0 10000 30000 50000iterationsI. I. II Ii - iij7Figure 2.5: Traceplots ofr0,r1 and PijS from MCMCalgorithm in scenario 1 case 2. Thetraceplots show the 50000 iterations after 1000 burn-inperiod. The “true” values arero = O.2,r O.25,(poo,pOl,p02) = (0.5,0.3,0.2) and(plo,pll,p12)= (0.1,0.3, 0.6).I I I I .0 10000 30000 00000 0 10000 30000 50000iterationsSiterationsI I I0 10000 30000 50000iterationsI I I0 10000 30000 50000iterations0 10000 30000 50000itera0ons0 10000 30000 00000iterations272.8. ResultsI EfrhII_________IJ1+11hh I0.50 0.52 0.54 0.56 0.28 0.30 0.32 0.34 0.145 0.150 0.155 0.160P00 P01P02rffh1______ _____0.095 0.100 0.105 0.110 0.22 0.24 0.26 0.28 0.30 0.60 0.62 0.640.66Pta Pu P12[oj______I _______0.20 0.25 0.30 0.35 024 0.28 0.32 0.36r1Figure 2.6: Histogram of 100 posterior means ofr0,r1 andPijSin the second simulationstudy for scenario 1 case . The “true” values are r0 = 0.2,r1 0.25,(poe, PO1,P02) =(0.5, 0.3, 0.2) and(plo,pll,p12)= (0.1,0.3, 0.6).282.8. ResultsFrom Figure 2.6, we observe that the histograms of 100 posterior means of1j,T’i,Po,Pi,PjoandP02are shifted to the right from their true values, whichindicates an overestimationof the parameters of interest. Though the graph suggestsa possible unpleasant result,we still need to study those parameters’ corresponding sampling distributionsto find outthe true performance of our model. Table 2.11 showsthe estimated bias, mean squareerror, 95% credible intervals coverages (of the true value) and the averagelength of 10095% credible intervals for each parameter.Bias MSE Coverage of the 95%CI Average95%CI Widthr0 0.062 0.0021 1000.25r1 0.056 0.0019 100 0.240.035 0.00079 1000.14p010.014 0.00088 990.074P02-0.049 0.00023 100 0.14pio0.0022 0.00015 1000.12p -0.039 0.00072 100 0.16p 0.0366 0.00061 100 0.18Table 2.11: Estimated bias, mean square error (MSE),coverage of 95 % CI and theaverage width of ro andrl for scenario 1 case 2. All results are based on 100 datasets, andtheir true values arer0 = 0.2,r1=O.25,(poo,pol,p02)= (0.5, 0.3, 0.2) and(p1o,p1,pi) =(0.1,0.3,0.6).It indicates that the even though the graph showsa potential unpleasant result, theparameters’ posterior distributions still cover their truevalues most of the time. As wediscussed earlier, this case is very sensitive to the choice of priordistributions, and theobservation obtained from Figure 2.6 could just implya not strong enough prior for theparticular dataset. Thus, it’s still reasonable to conclude that the algorithmworks wellfor this case.Figure 2.7 and Table 2.12 are histogram and statisticsresults from the second simulation study for scenario 2 in case 2. We omit theresults of the first study here due toits limitation of interpretation. From both the figure andtable, we notice that no matterwhether the probability of exposure is large or small, theprior distributions are always292.8. Resultsvery important. Weak priors may lead to poor results,and the stronger the prior is thebetter the results would be. This is dissimilar with most Bayesiancases, where when thesample size is large, the choice of the prior becomes insignificant.Future studies couldbe conducted on this case.Bias MSE Coverage of the 95%CI Average95%CI Widthro 0.015 0.0027 100 0.23r1 0.048 0.0025 99 0.23p000.036 0.0017 100 0.15Po0.012 0.0020 960.10-0.04 0.00043 1000.15pm-0.0036 0.00022 100 0.11-0.011 0.0017 970.079p0.015 0.0015 100 0.14Table 2.12: Estimated bias, mean square error (MSE),coverage of 95 % CI and theaverage width ofr0 and r1 for scenario 1 case 2. All results are based on100 datasets, andtheir true values are r0 = 0.7,r1 = 0.4,(poo,pol,p02)= (0.5, 0.3, 0.2) and(plo,pll,p12) =(0.1,0.3,0.6).302.8. Results,.rFffffhiiErfF1111}L0.45 050 055 0.60 0.25 0.30 0.35 0.40 012 0.14 0.16 0.18P00 Po P02‘rll}[fThk00jj1l11LJ1IF[L, I0.090 0095 0.100 0.105 022 026 030 0.34 0.55 0.60 0.65 070Pm Pu P12_______060 0.65 0.70 0.75 0.80 0.35 0.40 045 0,50 0.55Figure 2.7: Histogram of 100 posterior means ofr0,r1 and PijS in the secondsimulationstudy for scenario 1 case . The “true” values are r0 = O.7,r1 = 0.4,(poo,pol,po2) =(0.5,0.3,0.2) and(plo, pll,p12)= (0.1,0.3,0.6).312.9. Results for Case 32.9 Results for Case 3Under this cases, we will have four scenarios as follows:• Scenario 1:(ro,ri) (0.2, 0.25), true oddsratio==1.33,(Poo,pol,po2)(O.S,O.3,O.2),(plo,pll,p12)= (0.1,0.3,0.6),(COO,Col,c02)= (1,1,1),(clQ,Cll,c12) (1,1,1) ,validation size=200;• Scenario 2: (ro,ri) = (0.2.0.25), true odds ratio=1.33,(poo,pol,po2)(O.S,O.3,0.2),(plO,pll,p12) = (0.1,0.3,0.6), (coo,col,c02)= (1,1,1),(clo,cll,c12)= (1,1,1), validationsize=400;• Scenario 3: (ro,ri) = (0.7,0.4), true odds ratio0.28,(poo,pol,p02)(O.S,O.3,O.2),(P1o,P11,P12)= (0.1,0.3,0.6),(COO,cOl,C02) (1,1,1),(c10,c11,c12) = (1,1, 1), validation sizez200;• Scenario 4: (ro,ri) = (0.7.0.4), true odds ratio=0.28,(poo,pol,po2)= (0.5,0.3,0.2),(plo,p11,p12)= (0.1,0.3,0.6), (coo,coi,c02)= (1,1,1),(cio,cii,c12)= (1,1,1) ,validation size=100.Notice that in scenario 1 and scenario 3, the validation size is 200 comparedto scenario 2and scenario 4 (validation size is 100). As in case 2, let’s plot the truevalues of pjjs alongwith their prior density functions. From Figure 2.8, wecan see that now the range ofthe x-axis (possible generated values) becomes wider and the “true”values are not thatclose the to the center of the density functions. This directlyimplies that these priorsare flatter than the one we chose in case 2. We assign flat priors in thiscase because webelieve valuable information could be obtained from the validationdata.Once more, two simulation studies are carried for each of the scenario,and only the firstof the two scenario will be talked about in details. Figure 2.9and Figure 2.10 show thetraceplots of MCMC algorithm for r0,r1 and jjS afterthe first 1000 burn-in period forscenario 1 and 2.322.9. Results for Case 3I I•I I I I00 0.2 0.4 0.6 0.8 1.0I I N0.0 0.2 0.4 0.6 0.8 1.0P12Figure 2.8: Density plots of “true”pj,jvalues with its corresponding Beta density function.The vertical lines in the graph indicate the “true” values. TheBeta density functions(from left to right) are: Beta (1,2,), Beta(1,2), Beta(1, 2); Beta(1, 2), Beta(1, 2) andBeta(1, 2,).00.0 0.2 0.4 0.6 0.8 1.0P20N0.0 0.2 6.4 0.6 0.8 1.0[IIJ0.0 0.2 0.4 0.6 0.8 1.0PloPo,N00 0.2 0.4 0.6 0.8 1.0N.P200332.9. Results for Case 3‘I0 5000 15000terabons:1,1I0 5000 150000 5000 15000 0 5000 15000ilera0onoII Ir.I II IIliiiFigure 2.9: Traceplots of r0,r1 and PijS from MCMCalgorithm in scenario 1 case 3.The traceplots show the 20000 iterationsafter 1000 burn-in period. The “true” valuesare r0 0.2,r1=O.25,(pOO,pOl,p02)= (0.5,0.3, 0.2) and(plo,pn,p12)= (0.1,0.3,0.6).Validation size =200.We can see that for the generated sample Markov Chainare somehow more stablein this case than case 2. Table 2.13 and 2.14 demonstratesthe true value, estimatedposterior mean and 95% credible interval for each unknownparameter from the firsthi .1., IIr.J_JIIII II0 5000 10000ileraSons—1 ii i_IrI—r,i0 5000 15000iler000rrO‘I IIIlI‘l0 5000 15000 0 5000 15000ilera5005 ileradero342.9. Results for Case 30 5000 15000 0 5000 15000iterations iterations-I II. ..ii_...ir’i’0 5000 15000iterations0 5000 15000iterationsJ“F‘IiT0 5000 15000iterations0 5000 15000iterations0 5000 15000 0 5000 15000iterations iterationsFigure 2.10: Traceplots of r0,r1 and PiJS from MGMCalgorithm in scenario 1 case 3.The traceplots show the 20000 iterations after 1000 burn-inperiod. The “true” valuesare r0 0.2,r1=0.25,(Poo,pol,po2)= (0.5,0.3,0.2) and(p1o,pll,p12)= (0.1,0.3,0.6).Validation size =rlOO.352.9. Results for Case 3study of scenario 1 and 2.true value posterior mean 95% CIro 0.2 0.18 (0.13,0.24)r1 0.25 0.24 (0.19, 0.30)P000.5 0.49 (0.45, 0.52P010.3 0.31 (0.28, 0.34P020.2 0.20 (0.16, 0.24pio0.1 0.07 (0.00 0.13)pn0.3 0.27 (0.16, 0.37)pj 0.6 0.66 (0.56, 0.77)Table 2.13: True values, posterior means, 95% credibleintervals of ro,ri and PjjS. Theseare results from the first simulation study (one dataset simulation)for scenario 1 in case3. Validation size =200.true value posterior mean95% CIr0 0.2 0.22 (0.13,0.31)r1 0.25 0.27 (0.18, 0.37)poo0.5 0.53 (0.46, 0.59)oi0.3 0.25 (0.19, 0.31)P020.2 0.22 (0.19, 0.25)Plo0.1 0.07 (0, 0.13P110.3 0.42 (0.29, 0.54pr0.6 0.51 (0.42, 0.61Table 2.14: True values, posterior means, 95% credible intervalsofr0,r1 andPjS.Theseare results from the first simulation study (one dataset simulation)for scenario 2 in case3. Validation size =100.From these one sample studies, we observe that the approachworks very well with flatpriors. Though the observation now is only based on one samplestudy result, it is verifiedby Tables 2.15 and 2.16 from the second simulation study (sampling distributionstudybased on 100 datasets).362.9. Results for Case 3Bias MSE Coverage of the 95%CI Average95%CI Widthr0 0.0047 0.0032 93 0.12r1 0.0024 0.0029 98 0.12P000.0017 0.0020 950.07pç -0.00045 0.0015 96 0.067p -0.0013 0.0015 98 0.071pio 0.0090 0.0036 97 0.15pu-0.01 0.0045 97 0.20i:i:0.001 0.0039 99 0.19Table 2.15: Estimated bias, mean square error (MSE),coverage of 95 % CI and theaverage width ofro and r1 for scenario 1 case 3. All results are based on 100 datasets, andtheir true values arero = 0.2,ri = 0.25,(poo,pol,p02)= (0.5, 0.3, 0.2) and(plO,pfl,p12) =(0.1,0.3,0.6). Validation data=200.Bias MSE Coverage of the 95%CIAverage 95%CI Widthr0 0.0066 0.0039 96 0.16r1 0.0027 0.0038 96 0.16poo 0.00050 0.0022 94 0.093poi0.0049 0.0023 940.086po-0.0054 0.0021 99 0.09pj 0.0090 0.0043 96 0.18p-0.021 0.0068 930.25p0.012 0.0049 100 0.23Table 2.16: Estimated bias, mean square error (MSE),coverage of 95 % CI and theaverage width ofr0 and r1 for scenario 2 case 3. All results arebased on 100 datasets, andtheirtrue values arer0 0.2,r1 = 0.25,(poo,pol,po2)= (0.5,0.3,0.2) and(pio,p11,p12)=(0.1,0.3,0.6). Validation data—100.372.9. Results for Case 3Figure 2.11 and Figure 2.12 show the histograms of sampling distributionsfor eachunknown parameter in scenario 1 and 2. Results from theabove two tables and figuresindicate that when there are some validation data, the formal approach performsequallywell (compare with the case 2) though the prior information is weak. Even thoughintuitively, we may think that the larger validation sizewould have better results thanthe smaller size, by comparing Table 2.15 and Table 2.16, itis hard to conclude thatthere is significant difference in the results obtained from the scenario 1 and scenario2. One possible explanation is that by increase the validationsize from 100 to 200, thealgorithm is only able to gain limited “valuable” information. This immediatelyrisesa question that whether there is a cut off point that we are able to get the maximumbenefit, i.e. fewer validation data and enough informationto obtain a good estimation.This could be an interesting point to study later on.382.9. Results for Case 3_______CLlb1.‘:ALL:J___________ ___________ __________00I I I I I I I I I II I I I0.46 0.50 0.54 0.24 028 0.320.14 0.18 0.22P00 P01 P010)JT[1l}L.3.._______ ______I I I I I I II0.05 0.10 0.15 0.20 0.15 0.25 0.35 0.45 0,50 0.550.60 0.65 0.70Pio Pi,CI:r__[1lI1i1_000.10 0.15 020 0.25 0.30 020 0.25 0.30r0Figure 2.11: Histogram of 100 posterior meansof ro, r1 andPij8in the second simulationstudy for scenarzo 1 case 3. The “true” values are r0 0.2,r1= 0.25,(poo,pol,p02) =(0.5,0.3,0.2) and(Plo,P11,P12)(0.1,0.3,0.6). Validation size =200.392.9. Results for Case 3rf11}i!‘‘J{{}frflaT,1LitflThL0.46 0.50 0.54 0.26 0.30 0.34 0.14 0.16 022P00 Poi P01‘!;rfffi}1jTrm1EiriTTfbL.,‘r1LHL,,005 0.10 0.15 0.20 0.10 0.20 0.30 0.40 0.500.60 0.70 0.50Pio Pu Pi0,10 0.15 0.20 0.25 0.30 0.20 0.25 0,30 0,35r0Figure 2.12: Histogram of 100 posterior means of ro,r1 and PijS in the second simulationstudy for scenario 2 case 3. The “true” values are r0 = 0.2,rj= 0.25,(poo,pol,p02) =(0.5, 0.3, 0.2) and(plo,pll,p12)= (0.1,0.3,0.6). Validation size =100.402.9. Results for Case 3The following tables and figures show the results of scenario3 and scenario 4 wherethe prevalence rates are relatively larger(rO = 0.7 and r1 = 0.4).Bias MSE Coverage of the 959CI Average95%CI Widthr0 0.0033 0.0031 97 0.13r 0.0064 0.0039 92 0.14pj 0.0042 0.0033 94 0.11P010.00019 0.0025 940.09P02-0.0044 0.0029 960.11p 0.0001 0.002 95 0.078pjj -0.0012 0.0020 93 0.075p 0.0011 0.0023 96 0.093Table 2.17: Estimated bias, mean square error (MSE), coverage of95 % CI and theaverage width of ro and Ti for scenario 3 case 3. All results arebased on 100 datasets, andtheir true values are r0 = 0.7,r1=O.4,(poo,po1,p02)(0.5,0.3,0.2) and(plo,pll,pi2) =(0.1,0.3,0.6). Validation data=200.Bias MSE Coverage of the 95%CI Average95%CI Widthr0 0.0041 0.0038 97 0.19r1 0.0062 0.0047 96 0.19pyj 0.0097 0.0037 96 0.14pr -0.0030 0.0026 96 0.10P02-0.0067 0.0034 96 0.14pio-0.0006 0.0021 99 0.10pH-0.0011 0.0025 90 0.082PH0.0017 0.0028 970.11Table 2.18: Estimated bias, mean square error (MSE),coverage of 95 % CI and theaverage width of ro andTi for scenario4case 3. All results are based on 100 datasets, andtheir true values are r0 = 0.7,r1=O.,(pOO,p0i,p02)= (0.5, 0.3,0.2) and(plo,pn,pi2) =(0.1,0.3,0.6). Validation data=100412.9. Results for Case 301__________ _______co_IFti r}Lt:1___-Wfl______0________01M liii LI I I I I I I I I I I I0.45 0.50 0.55 0.60 0.22 0.26 0.30 0.34 0.15 0.20 0.25Pa PoiP01C01U,01rf{fFIITL,‘Eli—fl1U__ELrHWH1Hi___at,__________ ______I I I I I I P I I I I I I I I I0.04 0.08 0.12 0.26 0.30 0.34 0.54 0.58 0.62 0.66Plo PuLa_I I I0_____________0]_________ _________0.65 0.70 0.75 0.80 0.35 040 0.45 0.50r0Figure 2.13: Histogram of 100 posterior means of ro, r1and PijS in the second simulationstudy for scenario 3 case 3. The “true” values are r0 = 0.7,r1= 0.4,(poo,pol,p02) =(0.5, 0.3, 0.2) and(plo,pll,p12)= (0.1,0.3,0.6). Validation size =l00.422.9. Results for Case 30C’,cof I ElOhI I I Io &1 FlnH-,I-fff-1___ ___:___wH_________ __________ _________I I I I I I I I I0.45 0.50 0.55 0.60 0.22 0.26 0.30 0.34 0.10 0.15 0.20 0.25Poo Pot Pot0TrTTiI1ITh_ __0Ij_Ha w oj_________I I I I I I I I I I I0.04 0.08 0.12 0.20 0.25 0.30 0.35 0.55 0.60 0.65 0.70Pio Pit P1200C01 C’,,-lo__-I I I I I I rrriiS______H______Oh]__________ ___________0.60 0.65 0.70 0.75 0.80 0.20 0.30 0.40 0.50Figure 2.14: Histogram of 100 posterior means ofr0,r1 andPijSin the second simulationstudy for scenario4case 3. The “true” values are r0 — 0.7, r1 = 0.4,(Poo,Poi, P02) =(0.5,0.3,0.2) and(plo,p11,p12)= (0.1,0.3,0.6). Validation size —100432.10. Comparison of Odds RatiosFrom the above results, we are able to conclude that our formal approach works wellwhen the prevalence rate is relatively high, and again it is hardto conclude that morevalidation data will help in getting more precise results.2.10 Comparison of Odds RatiosAs we talked about previously, it is interesting to compare odds ratios estimatedby formaland informal approaches, to see which one tends to be closer to the true vales. Since inthe informal approach, we tend to group two categories fromthe exposure “Unlikely”and “Maybe” together, only when the probability of exposure is rare, we wouldonlycompare odds ratios in scenarios thatro = 0.2 and r1 = 0.25 in each case, i.e. scenario1 in case 1 and 2, scenario 1, 2 in case 3.44Id02.10. Comparison of Odds RatiosFigure 2.15: Comparisons of log informalodds ratios with log formal odds ratios forscenario 1 in case 1 (upper left), scenario .1 in case 2 (upper right), scenario 1 812 incase S (lower left and right) . The true log odds ratio value is0.285. The line in theeach graph plots if two odds ratios are the same (y=x line). Thedash lines represent the“true” value.0—0.2 0.0 0.2 0.4 0.8 0.8Iog(O0)--.o--o—0.2 0.0 0.2 0.4 0.6 0.8Iog(OR)/0//—0.2 00 02 0.4 0.6 0.8—02 0.0 0.2 0.4 0.6 0.8Iog(OR)452.10. Comparison of Odds RatiosFigure 2.15 suggests that the informal odds ratios tendto underestimate the truevalues, where the formal odds ratios sometimes overestimates the truevalue. For mostof generated datasets, the informal odds ratio is always less thanthe formal odds ratios,since the majority of the plots are below the “y=x” line. The followingtable illustratesthe comparisons in details.Bias MSE Coy, of 95%CI Ave. 95%CI WidthScenario 1 Case 1ORinformal -0.19 0.0078 21 0.27OR3 0.008 0.020 93 0.75Scenario 1 Case 2ORinformal 0.19f0.0072 23 0.30OR3 -0.055f0.016 93 0.66Scenario 1 Case 3ORinformal 0.19 0.0072 23 0.27OR3 0.00027 0.018 930.70Scenario 2 Case 3ORinformal -0.18 0.0068 27 0.27OR3 -0.0043 0.018 97 0.76Table 2.19: Estimated bias, mean square error (MSE), coverageof 95 % confidenceintervals and average 95% CI width of informal and formallog odds ratios for scenario 1(or 2) in three cases. True log odds ratio is 0.28.Table 2.19 indicates that the formal analysis produces more preciseestimation of logodds ratio. It has a small bias, great coverage rate, anda reasonable average of 95% CIwidth, where, on the other hand, the informal approachgives larger bias and unexpectedsmall coverage rate.Thus, we are able to conclude that the formal approachgenerally does a better job thanthe informal approach in estimating the odds ratio. Researchers whoapply the informalapproach may need to take serious consideration of the measurementerror, otherwise,the results might be very biased.46Chapter 3Simulation Study for ContinuousExposure Variable3.1 IntroductionSuppose a researcher is interested in investigating the relationship betweena healthoutcome, Y, and a continuous variable, X, and then concluding thatE(YIX) = /3o +iXfor unknown parametersoandi3i.Nevertheless, when the health outcome is measuredprecisely, a noisy measurement X is often obtained instead of X. If theresearcher doesnot realize the existence of measurement error, or decideto ignore it, then his conclusionabout Y and X could be biased. Thus, it is useful to study the impactof the mismeasuredcovariate X.Let’s assume a precisely measured predictor X has a normal distribution withmean i andvariance)2,while its mismeasured surrogate X arises from an additive measurementerror, which is non-differential, unbiased and normally distributed. Thus, Xis normallydistributed with mean X and variance a2. Moreover, since the measurementerror is nondifferential, which meansX*and Y are conditional independent of Y, the distributionof(X*Ix,Y) and the distribution of(X*IX)are identical. The joint distribution of X, X’and Y can then be viewed as:f(X*,Y,X)= f(X*IX,Y) xf(YIX)x f(X) (3.1)f(X*IX)xf(YIX)x f(X).473.1. IntroductionThe first term in equation (3.1) is called the measurement model, whichis the conditionaldensity ofX*given X and Y. This defines that under the influenceof Y, the surrogateX’ arises from the true variable X in a particular way. The second term is called theresponse model, which explains the relationship between the true explanatoryvariableXand the response Y. The last term is called the exposure model in epidemiologicalapplications (Gustafson, 2004).Usually, specific distributions, which involve some unknown parameterswill be assumedin equation 3.1, and to make inferences about theX*,X and Y relationship. Since inreality, the true explanatory variable X is unobserved, the likelihood functionof X andY is formed asf(X*,Y) ff(x*,x,Y)dx(3.2)= ff(X*IX,Y)x f(YX) x f(X)dX.Though in some cases, equation 3.2 is easy to evaluate, in othercases, big problems couldarise when the integral does not have a closed form. Often,a Bayesian MCMC analysiswill be used in such condition since one advantage of Bayesianapproach is that the likelihood function is not necessary expressed in explicit form. Dempster, Lairdd,and Rubin(1977) proposed the EM algorithm to solve the implicitly problem in a non-Bayesianway,however, in this paper, we will stay with the Bayesian MCMC methods.Researchers often want to compare the health outcome Y withintwo groups, thus, theyoften dichotomize the continuous variable X into two or more categories.Though, Royston, Altman, and Sauerbrei (2006) pointed out a considerable disadvantageof the dichotomization, it is still very common in the literature (MacCallum,Zhang, Preacher,and Rucker, 2002). In this paper, we dichotomize X into two groupswith the rule ifX > c, the subject is truly exposed, otherwise, the subject is notexposed. Note thatin reality, the value c is often decided from a previousstudy or chosen by a health expert.483.2. Posterior and Prior DistributionsThe relationship of the health outcome and predictor variable is often estimatedbyobtaining the coefficients from a linear regression model. Since the healthoutcome, Y, isa binary variable here, logistic regression appeals as a suitable model. General speaking,there are three approaches to estimate the coefficients. An “naive” approachwould dichotomize X’ with respect to c, where an “informal” approach dichotomizethe surrogatevariableX*according to c (a threshold not necessarily the sameas c). In reality, thetrue predictor variable X is unobserved, however, in the “formal” approach(discuss inthe paper), it is pretended to be known and be dichotomized with respectto c and fitthe model afterward. The choice of the threshold, c, is somehow arbitrary,however, inorder to keep a high specificity (as in discrete case), some epidemiologistswould intendto choosecto be bigger than the true c value, such thatPr(X*> cIX < c)is verysmall.Notice that researchers, who use the “naive” approach are often not awareof the measurement error or intend to ignore it, while people whouse “informal” or “formal” approachdo acknowledge existence of the measurement error and try to find out asolution to theproblem. Results from all three approach will be compared later in theChapter.3.2 Posterior and Prior DistributionsIn this paper, the constituent models are studied based on normaldistributions. Specifically speaking, we assume the measurement model,(X*Ix,Y), to be a normal distribution with mean X and variance2•Since the measurement error is non-differential,wehaveXX N(X, a-2)The exposure model is also assumed as a normal distributionwith unknown parametersj and i.e.X N(2).493.3. Case 1: When We Know o-2Prentice and Pyke (1979) pointed out that the odds ratios are equivalentwhen bothprospective and retrospective logistic model are appliedto the case-control data, thuswe would like to assume the response modelYIX follows a logistic regression, which islogitP’r(Y= lix) /3 +/311(X > c). By easy transformation, the response modelturnto:Pr(Y=liX)=1 +eI+11I(X>c)Note that all parameters 2,,)2,/3oand/3iare unknown, and proper prior distributions might be needed in order to proceed. Meanwhile, we would liketo assume theindependence of all prior distributions, so thatf(u2,,A,o,th) = f(u2)xf()x f(A2)xf()xf(i).Specific prior distributions will be assigned later on.In this chapter, we focus on studying three casesto demonstrate the performance ofthe “formal approach” : when we have some knowledge the noise of thetrue exposurevalue from previous study, i.e. is known; when we only have some priorinformationabout the noise term, i.e. 2 is unknown but we have some informationthere; when wehave some validation data, i.e. we have some data on xalong withX*and Y for somesubjects.3.3 Case 1: When We Know 2Sometimes, researchers might have some knowledge about the noiseof the true explanatory variable from previous study results, then the posterior density functionis(3.3)fJf(IXjxflf(XIX)< JJf(x3)< f(/3) x f(3) xf()x f(A2).503.3. Case 1: When We Know o2In order to proceed, we need to specify prior distributions forunknown parametersu, A2, and To simplify the MCMC algorithm later on, it is convenient toassign the normal distribution for,/3o, i3and Inverse Gamma distribution for A2 as theirprior distributions. Thus, we haveN(0,d);J3 ‘s.’ N(0,d);N(0,d);A2 IG(d4,d5).where the choice of hyper-parameters d, d, d, d4 and d5 determinehow flat or concentrated a prior could be. Here, we choose d = d =4= 1002and d4 = d5 = 0.01 sothat have flatter priors for unknown parameters ,A2,/3oand ,6.The posterior density function in (3.3) now turns to:(3.4)flf(YjIXj)xfJf(X;IX)x ]jf(X3)xfCo)x f(j3) x f(L) x f(A2)1___________1Z(X)2cc x—e22x—e2Aflj1 +x -_ei x xdgxA2_4+1)e_d5/A2d1 d2 d3 F(d4)We chose the normal distribution and Inverse Gamma distribution fort and A2 as priordistributions, since they have the property of beingconjugate. A conjugate prior meansthe posterior distribution of ,u and A2 would come from the samedistribution family asthe prior distribution. In particular, a simple Gibbs sampler algorithmcan be used togenerate a sample of from the posterior distribution of t, whichis a normal distribution. Similarly, we can use Gibbs sampler algorithm to generate a sampleof A2 from itsposterior distribution, the Inverse Gamma distribution.513.3. Case 1: When We Know a2Unlike withiand A2, the posterior distributions of3o,j3 and X do not have the sameform as any other familiar distributions that we recognize . Thus, theGibbs samplerdoes not work for updating them and we need to use the Metropolics - Hastingalgorithminstead. As introduced in Chapter 1, this algorithm isbased on the accept/ reject rule.We would like to avoid the acceptance rate being extreme, such as 100%or 0%. The keyto decide the rate is the jump distribution of the parameter.Let’s takej3ofor example.Suppose we are atjthiteration right now and after updating i and A2, we wantto updatethethvalue of!o.We would calculate the joint density aswherefis the joint density as in equation (3.4). Then,we would assign a jump size fori3o,such that=+ t, where t is from the normal distribution with mean 0and variance k2 and we would again calculate the new jointdensity value as:Now, we will pick the new updated j3 as:cond)with the probability min(b/a, 1),3’ otherwise.A similar procedure is applied fori3and X. The individual jump size often followsa normal distribution with mean 0 and variance k2. Note that thejump size for eachestimated parameter may vary in next two cases.523.4. Case 2: When We Don’t Know23.4 Case 2: When We Don’t KnowThough, in the previous section we talked about knowing the measurement errorvarianceof the true X from other studies, in most situation, we do not know the exactvalue ofa2 but rather have a prior distribution for it. Then, the new posterior density turnsto:(3.5)= flf(IXj)xflf(XIX)xllf(X3)xfC8o)x f(j3) x f(fL) xf().2)xf(a2)Similarly for other parameters, we need specify a priordistribution for a2. Again, because of the conjugately property of Inverse Gamma distribution, we would liketo assignInverse Gamma distribution with shape parameter d6 andscale parameter d7 as theprior distribution of a2. Note that d6 and d7 are hyper-parameters. The choice ofpriordistributions for other unknown parameters are the same as incase 1 and all other hyperparameters would be assigned the same values.Now the joint density becomes:f(X1,X23,...,X,3o,31jt,A2,aIY,Y, ...,X)= flfIXxflf(XIX)x flf(X) xf()xfCdi)xfCu)xf(A2)x f(a2)___________1—(X1—)2x—e22x—e2A21131 +e130+131’()Cj>’)jflx±e xx xdg4x)2_(d4+1)e_d/A2d3 F(d4)xd6xa2_6+1)e_d7/2P(d6)533.5. Case 3: When We Have Validation DataWe are able to update a2 by Gibbs sampling since its posterior distribution is knownas Inverse Gamma distribution, and hyper-parameters d6 and d7 are specified later inthe result section.3.5 Case 3: When We Have Validation DataSimilarly as the validation case in the discrete case, here we assume there is a smallproportion of data with complete information on X,X*and Y, whereas the majority ofthe data do not have the precise measurement of X but have the surrogate variableX*instead. Thus, unlike the equation (3.3) and (3.5) in case 1 and case 2, the joint densityhere becomes:f(Xl,X2,X3,...,Xm,/3O,I3ljL,A,UIXm+l,Xm+,...,Xn,Y,Y2,...,* * * *n’1’ 2’m n= flfo’IxjxHfX;1X3)x flf(X3)IIf(IX)j=1 j j j=m+1flf(XIX3)x ]Jf(X3)x f(j3) xfC8i)xf(p)x f(.\2)x f(a2),where the first j = 1, . . .m are the non-validation data and the rest is the validation data.Though the joint density changes, there are not many changes regarding the simulationprocess. The only difference is we do not update the “known” X values in the simulations.We used Gibbs sampler to update,A2 and a2, and the Metropolis - Hastings Algorithmis used to update theo, i3iand the “unknown” X values.3.6 ResultsIn the following subsections, traceplots of MCMC algorithms in all three cases are checkedalong with some statistics for each unknown parameter, such as the posterior mean, 95%543.6. Resultsequal-tail credible interval, bias, and estimated mean square error. In this way, we areable to gain some information about how well the MCMC algorithms are working for particular models and randomly generated datasets. Moreover, comparisons of estimatedlogistic regression coefficients are made among all three approaches, “formai”, “naive”and “informal”.In all three cases, the true unknown (common) parameters areset to be: ,u 0, A2 =1,i3o= —1.5, and /3 1.5, with the dichotomization value c = 1. Other choiceofthe hyper-parameters as well as the jump size are subject to different cases. Moreover,two simulation studies are performed for each case. The first concentrates on gatheringinformation on estimates from one sample. The second focuses on studying the samplingdistributions of each estimator across 100 simulated datasets.3.6.1 Results for Case 1In the first study, a dataset of size 500 was generated based ona choice of given parametervalues. Then, we use 6000 iterations (the first 1000 iterations are burn-in period)ofthe MCMC algorithm to estimate unknown parameters. Note that the jump sizesforo,/3iand X in the Metropolis-Hasting algorithm are chosen to be 0.15, 0.75, and 1respectively. Figure 3.1 and Figure 3.2 show the traceplots of the MCMCalgorithmoutputs for unknown parameters in case 1. These traceplots only plots the iterationsafter the burn-in period. We can see that the Markov Chain is somehow stabilizedafterthe burn-in period, and the Markov chain move thoroughly within the target range.553.6. Results;;IiII.hII_IoI‘-1I I0 1000 2000 3000 4000 5000iterations0 1000 2000 3000 4000 5000iterationsFigure 3.1: Traceplots of X1,X2 from MCMC algorithm incase 1. The traeeplotsshow the 5000 iterations after 1000 burn-in period.ciC.)cic’JcCC’J><Ii iiCCs0CCc I i’LUU)iILiI.hI I11110 1000 2000 3000 4000 5000iterations0 1000 2000 3000 4000 5000iterations563.6. Resultsc’J0I I I0 1000 3000iterations5000 0 1000 3000 5000iterationsFigure 3.2: Traceplots of u, A2 from MCMG algorithm in ease 1. Thetraceplots show the5000 iterations after 1000 burn-in period.iii I000IF1000573.6. ResultsTable 3.1 shows the true values, posterior means and the95% credible intervals foreach of the unknown parameters estimated from the data.true value posterior mean 95% CI0 -0.013 (-0.112, 0.086)5 1 1.014(0.86, 1.17),3 -1.5 -1.48 (-1.73, -1.21)1.5 1.48(0.79, 2.13)Table 3.1: True values, posterior means, 95% credible intervals of .i,A2,3o,61.Theseare results from the first study in case 1.From the table, we can see that the 95% credible interval of each unknownparameteractually covers the corresponding true value and the posterior meanis very close to thecorresponding true value.The second study just involves repeating the firststudy 100 times and the samplingdistribution of each estimator is studied. Figure 3.3 is the histogram ofthe posteriormeans for the 100 samples in case 1.58ccIIHC;’.- Cb 1%Frequencyo5102030IIIIFrequency0102030II0 01 0 u p 0 01 p. 01-to -to01Frequencyp 01051015200510152025Frequency0 •0 0 b0 1)13.6. ResultsFrom the figure, we can see that the sampling distributions of j2 andi3iare approximately normally distributed and centered at their true values, whereasthe samplingdistribution for is a little right skewed and is somehow left skewed.Table 3.2 summarizes each parameter estimator as:Bias MSE Coverage of the 95%CI Average 95%C[Width,u 0.0023 0.0050 95 0.19-zO.0096 0.0080 98 0.31/3o-0.065 0.013 93 0.54/3 0.051 0.038 94 1.42Table 3.2: Estimated bias, mean square error (MSE), coverageof 95% CI and the averagewidth ofpA2,301for case 1. All results are based on 100 datasets.The above table shows that the biases and estimated mean square error(MSE) for eachunknown parameter are quite small, especially for fL and)2•Furthermore, the averagewidths, out of the 100 runs, of the credible intervals for /2 and)2are pretty small. However, the wide average widths for/3oand j3 suggest that there is more variation amongthe 100 estimated andth.Also, out of the 100 times, the 95% credible intervals cover(CI) cover the true and)2value 96 times, and cover the true/3oand 94 times, whichsuggests a good overall performance of the formal approach.3.6.2 Results for Case 2In this case, 2, the measurement error variance, is also estimated withother parametersand the “true value” is chosen as 0.25. The hyper-parameters of theInverse Gammadistribution of o2 are specified as 200 and 50 respectively. Thespecific choice of thehyper-parameters results in a concentrated prior distribution. From Figure3.4, we cansee that this prior has a relatively narrow range and the centre of thedistribution is closeto the “true” value.60>a)3.6. Resultsdensity plot0c’JU)0U)0Figure 3.4: Density plot of Inverse Gamma distributionwith hyper-parameters: c = 200and 3 = 50. The vertical line is “true” valueof the u2 = 0.250.20 0.25 0.30 0.35613.6. ResultsMoreover, the jump sizes for updating and X inMetropolis - Hastings Algorithm are changed to 0.15, 0.55, 0.8, to avoid extreme acceptance/rejectionrates.Again results from the first study (1 sample study) are displayedfirst. Figure 3.5 andFigure 3.6 are the traceplots of parameters the MCMC algorithm incase 2.00>00qq00Figure 3.5: Traceplots of /3,i3i,X1,X2 from MCMC algorithm in case 2. The traceplotsshow the 5000 iterations after 1000 burn-in period.i [1 L .i ii. ii.0Is)‘1ri0 1000 2000 3000 4000 5000iterationsIi .111i1 ii I III] .II1III‘‘rrJi’iI I0 1000 2000 3000 4000 5000iterationsI ii I i .1 .110 1000 2000 3000 4000 5000iterations0 1000 2000 3000 4000 5000iterations620d003.6. Results00C00 1000 2000 3000 4000 5000terationsFigure 3.6: Traceplots of u,)2and g2 from MCMG algorithm in case 2. The traceplotsshow the 5000 iterations after 1000 burn-in period.We can see that the Gibbs sampling algorithm is very stable when updatingo2 in5000 iterations, and the chains do not have a mixing or convergence problem.Table 3.3 shows the posterior means and 95% CI analyzed based on the particulardataset. Again, all the 95% CI covers the true values of the parameters,and it’s reasonable to conclude that the approach works well for this particulardataset.The procedure of the first study is repeated 100 times in the second study.We are able0 1000 2000 3000 4000 5000 5 1000 2000 3000 4000 5000iterations iterations633.6. Resultstrue value posterior mean 95% CIt 0 -0.0061 (-0.089,0.10)5 1 0.88(0.73, 1.03)0.25 0.25 (0.22, 0.29)i3 -1.5 -1.51 (-1.77, -1.27)th1.5 1.72(1.06, 2.39)Table 3.3: True values, posterior means,95% credible intervals of,A2,/3o,/3i and u2.These are results from the first study in case 2. The“true” values are: u = 0, ?..—1.5 and41.5.to study the sampling distribution of to get a betterunderstanding of the potentialproblem (overestimate the parameter) in thefirst study. The problem could be justhappening by chance or it could show that overall theMCMC algorithm underestimatesth.Figure 3.7 confirms that the problemof underestimating /3 is just due to chanceand overall the estimated is roughly follows a normaldistribution centered at the“true” value. Except for A2, other parameter estimatorsare all approximately followinga normal distribution centered at their corresponding“true” value.643.6. Resultsrfl_fffR1..;.r____—0.10 —0.05 0.00 0.05 0.10 —2.0 —1.8 —16 —1.4 —1.2 —1.0PoI___ _.0.248 0.250 0.252 0.254 0.256Figure 3.7: Histograms of 100 posterior means for , A2,io i3and o2in the second studyin case 2. The “true” values are: u — O,A2= 1,/3o — —1.5 andj3i= 1.5.653.6. ResultsTable 3.4 outlined the bias, MSE, coverage of95% and average width of the 95%for each estimator. It suggest that our approachproduces reliable estimators with smallbiases, small MSE, satisfactory coverage rates andreasonable average credible intervalwidths.Bias MSE Coverage of the 95%CI Average95%CI Widtha 0.0034 0.004795 0.20V 0.007 0.008795 0.32-0.054 0.013 970.54/3i0.019 0.035 951.410.0023 0.00013 1000.071Table 3.4: Estimated bias, mean square error (MSE),coverage of 95 % CI and the averagewidth of t, A2,i3o, i3and g2 for case 2. All results are based on 100 datasets. The“true”values are: t = 0, A2 = 1,i3o= —1.5 andi31.5.3.6.3 Results for Case 3In this case, we took our validation sizeto be 50, which means 10% each dataset hasprecise measurements of X. The new jump sizes fori3o, i3and X are 0.1, 0.35 and 0.7.Figure 3.8 and Figure 3.9 are the traceplots of parametersin the MCMC algorithm inthe first study663.6. ResultsrLI II I I I Io woo 2000 3000 4000 5000iterationsI I I I I0 1000 2000 3000 4000 5000iterations1000 2000 3000 4000 5000Figure 3.8: Traceplots ofthX1,X2 from MCMG algorithm in case 3. The traceplotsshow the 5000 iterations after 1000 burn-in period.C(N0(N><0000Laas0 1000 2000 3000 4000 5000iterationsOs><5’,0I ‘iterations67iterations3.6. Resultsi.._1IT1..9.IIIrIiIIIII••I’iIII10 1000 2000 3000 4000 5000teratiorts0400.504000400 1000 2000 3000 4000 5000iterationsFigure 3.9: Traceplots of t, A2 and o2 from MCMC algorithmin case 3. The traceplotsshow the 5000 iterations after 1000 bur’rt-iri period.04001. 1 IL..0000 1000 2000 3000 4000 50000683.6. Resultstrue value posterior mean 95% CIu 0 -0.032(-0.013, 0.066)1 1.03(0.87, 1.19)ET0.25 0.24 (0.21, 0.28)j3 -1.5 -1.44 (-1.68, -1.19)4i 1.5 1.30(0.63, 1.97)Table 3.5: True values, posterior means, 95% credible intervals of, )2,/3o,/3iand o2.These are results from the first study in case 3.Both figures (Figure 3.8 and Figure 3.9) and statistic values (Table 3.5) indicate thatfor this particular generated dataset, the approach did a good job. Next, the histogram(Figure 3.10) and summary statistics (Table 3.6) of the sampling distribution for eachunknown parameter in the second study are presented.693.6. ResultsII—015 —0.05 0.00 0.05 0.10 0.15 0.20 —1.8 —1.6 —1.4 —1.2I 01’52!02!5 081,1!0.235 0.245 0.255 0.265a2Figure 3.10: Histograms of 100 posterior means for)2,/3,j3andu2in the second studyin case 3. The validation size is 50. The “true” values are: t=0,A2== —1.5andi= 1.5.703.7. Comparison of Three ApproachesFrom Figure 3.10, we can see that all the sampling distributions of parameter estimators are approximately normally distributed with some skewness involved, expectthe histogram for u2 looks uniformly distributed at first glance. However, by taking aclose look at the figure, we noticed that the scale for the histogram ofo2has 3 decimalplaces, which suggests that most estimated values of g2 are very close to the true value,0.25. This observation is also confirmed in Table 3.6, since the estimation of o2 has thesmallest bias, MSE and average CI width and highest coverage rate (100%).Bias MSE Coverage of the 95%CI Average 95%CI Widthi -0.0032 0.0050 97 0.19A 0.012 0.0080 95 0.31-0.024 0.013 95 0.55/3i0.051 0.036 95 1.340.00148 0.00059 100 0.066Table 3.6: Estimated bias, mean square error (MSE), coverage of 95 % CI and theaverage width of,)2,i3o, i3and o2 for case S. All results are based on 100 datasets withvalidation size 50. The “true” values are: i= 0,)..2= 1,/3o —1.5 andi3= 1.5.Evidence, such as small bias, small MSE and hight percentage of true values coverage,in Table 3.6 demonstrate a good performance for the validation model.Overall, we can conclude that the “formal” approach did an excellent job in estimating unknown parameters for three different models as: knowing the measurement errorvariance, prior information of the measurement error variance and validation model.Next, we are going to study the comparative performance of the “formal” approach, the“naive” and the “informal” approach.3.7 Comparison of Three ApproachesAs we discussed previously, the “naive”, “formal” and “informal” approach would dichotomize either X orX*with respect to c orc*to fit a logistic regression model as713.7. Comparison of Three Approachesfollowing:logit(Pr(Y= 1IX))formal = /3o +/311(X > c)logit(Pr(Y =1X*))jflfyJ.1= i3o +/311(X*> c)logit(Pr(Y= 1IX*))najve=+13i1(X*> c)The relationship of the health outcome, Y and the exposure variableX is gained from theestimated coefficients,/3andi3.Thus, the comparisons are mainlybased on estimatingthese two parameters. The true values are the same as before:i3o= —1.5 and/3i1.5,and comparisons are constructed in each of the three cases of the “formal” approach.Note that the c is chosen as 1.3 so that the specificity and thesensitivity is very high(both are around 95%).Figure 3.11 to Figure 3.13 are the pairwise plots of results from the three approaches.Byobserving them, we see there are some linear relationshipbetween estimators, and thelinear relationship is somehow weaker when estimating/3than estimatingi3o.Moreover,both the “naive” approach and the “informal” approach tendto overestimate/3obutunderestimatei3,whereas estimations from “formal” approach are located around thetrue value. Since it is very hard to tell which one is “better”between the “naive” andthe “informal” approach from the figures, the summary statistics of estimators’samplingdistribution in these two approaches are crucial to know.723.7. Comparison of Three ApproachesI I II A00’ //0 I0S:j:040•fl_1’08 --So000IDI -I . I001oI 00I rI I • I I I I I II—1.8 —1.6 —1.4 —12 —1.8 —1.6 —1.4 —1.2 —1.7—1.5 —1.3 —1.1informalo naive Po informal 13o00 0Sb ICM09d1 Iq.On0Co41I I—0I0i0%-0//’000 0.58o 0:I.7_______/l__ __0j%0o00 I 0 I IO oOI 0I I I I I I I I I0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0informal , naive, informal iFigure 3.11: Pairwise plots of three approaches, “naive”,“informal” and “formal”, inestimatingi3oand j3 under case 1. The solid line is if “y x “, and the dash lines arethe corresponding true values.733.7. Comparison of Three ApproachesI /I0°/1IjInHi‘0./:° I0°,,/°,°I I 0I0 0I I Ip___________‘1I I I I I I—1.8 —1.6 —1.4 —1.2 —1.8 —1.6 —1.4 —1.2 —1.7 —1.5 —1.3—1.1informalflo naive J1 infomial fib0 aq0 I8:o,oeIn0,0&In Iaoo0Io%0---0-Il-----I 60/0°li70 -- -:20:o$— D’10 II LI) I 0 Q90 ILI) ° I0I I 0I I I I I I I I I0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0informalfl.naive 1 infomialI3Figure 3.12: Pairwise plots of three approaches , “naive“, “informal” and “formal”, inestimating/3iandi3under case . The solid line is if “y = x “, and the dash lines arethe corresponding true values.743.7. Comparison of Three ApproachesI I I0.5 1.0 1.5 2.0informali800.5 1.0 1.6 2.0natveiFigure 3.13: Pairwise plots of three approaches, “naive“, “informal” and “formal”, inestimatingi3oand3iunder case 3. The solid line is if“y = x “, and the dash lines arethe corresponding true values.75//0ItS0—1.8 —1.6 —1.4 —1.2 —1.0004.5 .0I0•004Ing000informalPa naive Pr004Lba,aLI0—1.8 —1.6 —1.4 —1.2 —1,0informal004I000- -:000//L00L00I I01!0i52!0informalPi0U)3.7. Comparison of Three ApproachesTable 3.7 reports the average posterior means of/3oandi3,as well as 95% confidenceintervals for the average posterior means in all threecases. We are able to conclude thatthe formal approach is superior to informal and naive approach, sinceonly the confidence interval of “formal” approach cover the true valuesofi3oand!3i.Moreover, whenestimating/3,the formal approach produces posterior means andconfidence intervals,which are more closer to the true values. The “naive”approach generated the mostnarrow confidence interval that may also implies over-confidence.These results suggeststhat it is very dangerous to ignore the measurement errors in the analysisand makingproper adjustments for the measurement error is crucial.Average Posterior Mean 95%Confidence IntervalCase 1/3Onajve-1.41 (-1.43, -1.39)Ojnformal-1.36 (-1.38, -1.34)/3Oformal1.52 (4.53, 1.48)/3’naive0.98 (0.93, 1.03)’informal1.14 (1.08, 1.20)/3lformaj1.55 (1.47, 1.63)Case 2/3Oflajve4.40 (-1.42, -1.37)JOjnformaj-1.34 (-1.36, -1.31)/3Oformal-1.49 (4.52, 1.47)/3inaive0.97 (0.92, 1.0201)ljnforma11.04 (0.99, 1.10)/3lformal1.52 (1.45, 1.59)Case 3!3Onajve-1.43 (-1.45, 4.40)/Ojnformal1.37 (-1.39, -1.35)/9Oformal4.49 (4.52, 1.47)I3’naive0.99 (0.95, 1.04)/ljnformaz1.12 (1.06, 1.18)/3lformal1.55 (1.48, 1.62)Table 3.7: Average of posterior means and 95%confidence intervals for the averageposterior means ofoandi3ifor “naive “, “informal” and “formal” approaches. Resultsare based on 100 samples in case 1, and 376Chapter 4QRS Data StudyTo illustrate the ideas and methods that we discussed in the previous chapters ina realworld example, we use the QRS dataset. This dataset is provide by Vittinghoff, Glidden, Shiboski, and McCulloch (2004). Heart problems can be diagnosed throughthetiming of diverse stages in the contraction of the heart. Electrocardiography(EKG) isthe device that records the electrical activities of the heart through a durationof time.As the authors indicate the QRS wave is defined as a commonly measured time intervalin the contraction of the ventricles. The study dataset contains the QRS times (inmuliseconds) for 53 patients, of whom 18 have the inducible ventricular tachycardia (IVT)and 35 of them are without IVT. Note that the sample size is relatively small, since itis very difficult to assemble a large number of subjects to participate ina brain wavestudy, and the cost of the study is very high. Thus, studies that involve brain waves andelectrocardiography devices commonly have small sample sizes.Though the sample size is considerably small, it is still a good and clean datasetto illustrate our ideas and methods. The response variable Y takes the value of 1 if the subjectshas IVT, and 0 otherwise, while the covariates variable X is the QRS time (in milliseconds). Since the QRS time is a continuous variable, we are focusing on the approachintroduced in Chapter 3. Even through, in the literature, there are researchers whoargueabout the accuracy of the QRS duration (Tomlinson, Betts, and Rajappan, 2009), whichindicates that measurement error could exist in the measurement of timing inthe realworld, for the purpose of this thesis, we treat the QRS timingas precisely measured (X)and we simulate the surrogate variable, X, in order to compare the resultsobtainedfrom the true values, naive analysis, informal analysis and our formal analysis.77Chapter 4. QRS Data StudyNevertheless accuracy of the QRS is questioned by many researchers,there are fewarticles states the possible magnitude of the measurementerror. According to Sahambi,Tandon, and Bhatt (2009), the maximum error rateof the QRS is 6.25% due to the 50Hz power-line interference. As we lack of detailed informationabout how the data arecollected, we would simply adopt the measurement error rate statedby Sahambi et al.(2009). For our illustrative proposes, we have to assume we know the varianceof additivemeasurement error, u2, in order to generateX*.The error rate we accepted previouslyis a multiplicative error, and proper transformationis necessary to acquire an additiveerror (as we defined in Chapter 3). As a result, we chooseX = log(QRStime) instead ofQRS time directly. As defined in Chapter 3, under the nondifferentialassumption, themeasurement model here isX’IX N(1ogQRS,u2).Mathematically, we can compute a2 as:X = 1ogQRS+o-ZeX*= (QRS)ewhich motivatese = 1.0625where Z is a standard normal random variable. Weget the variance of additive measurement error as0.0312= 0.00096, and the surrogate variable, X’, can begeneratedafterward.Since we don’t have validation data and we supposewe don’t know the variance ofthe measurement error, the “formal analysis” approach isbased on the model that was784.1. Naive Approach and Informal Approachintroduced as case 2 in Chapter 3. Thus, the response model is1ogit(P(Y=1X)) = log151(4.1)= /30+/31(X>c)Under the assumption of nondifferential measurement error, the measurement model isX’X N(X,).The exposure model is:X N(,)).We will use this set-up to conduct naive, informal and formal analysis and compareresults produced by three approaches with the true values.4.1 Naive Approach and Informal ApproachFor the response model, the value of c is chosen as log(120)as suggested by Tomlinson,Betts, and Rajappan (2009). To refresh the memory, the naive approach would formulatethe response model based onX*and c as:logit(P(Y= 1IX*))P(Y=1II(X*>c))=In order to perform the informal approach, we need to pick up a c value such that, thespecificity is very high (as discussed in Chapter 3), and it is chosen as c =log(123) so794.2. Formal Approachthat the SP = 0.94. Then the response model of informal approach is:P(Y — 1II(X*>c))logit(P(Y= 1IX*))= logit1— P(Y=1I(X* >cj)= /30+/31(X*>c*)The estimates of and for these two approaches can be easily obtainedby using thegim function in R. Results are shown in the section 4.3.4.2 Formal ApproachSince we are assume the QRS timing from the dataset is the true value, we are ableto obtain some information about the exposure variable X, suchas the the mean ofmean(X) = p. = 4.64, var(X) = A2 = 0.094 and the measurement error variance u20.00091. Note that those values are not applicable in the real world, and they areavailable here for this example only (because of our assumption). As discussed early inthe simulation study of Chapter 3, we need to assign prior information for the “unknown”parameters. Remember that in section 3.6.2, flatter priors are assigned to unknownparameters since the simulated sample size is considerably large. Though,the samplesize is pretty small in this dataset, we are still able to assign flatter priors to someof the unknowns. For example,p. is assigned as p. N(0,1002),A2 is assigned asA2 IG(0.01, 0.01) and/iis assigned with N(0,1002).Dislike the non-informativepriors for p., A andi3o,the prior information becomes very crucial for 2 andi3.It isreasonable to assign concentrated priors to these two unknowns, since researchers couldeasily obtain relative information about these two unknowns from preciousstudy. As aresult, concentrated priors are assigned as such thatj3N(0,12)and.2IG(100, 0.1).Figure 4.1 displays the prior density plots vs their corresponding “true” values.804.2. Formai ApproachI I I I I—400 —200 0 200 400 0.00*00 5.Oe+302 1.00+303 1.50+303 2.00+303I’II.I.I—4 —2 0 2 4 —40000 —20000 0 20000 400000.0008 0.0010 0.0012 0.0014 0.0018Figure 4.1: Prior plots of unknown parameters with their hyper-parameters: iN(0,1002),A IG(0.01,0.01),o-2-‘ IG(100,0.1),i30‘-‘. N(0,1002)and /3 ‘-- N(0,12).The vertical lines are the corresponding “true” values as: u = 4.64, A2 = 0.094,/3o—0.90,/3i0.76 and g2 = 0.00096.814.2. Formal ApproachWe observe that, regardless of the strength of the prior, the center of priordensity foreach unknown parameter is most likely located around the correspondingtrue value. Theplot for A2 looks abnormal, since the range of its densityfunction goes from 0 to infinityso that it is quite difficult to display on a limited scale. 2 has the mostconcentratedprior, since approximately 95% of its data are enclosedby 0.0008 and 0.0014, a prettysmall range. xSimilarly as in the simulation study, we are unable to obtainthe full conditionaldistributions foroand so we are going to use the Metropolis- HastingsAlgorithm toobtain their estimators. All other unknown parameters areupdated as in the simulationstudy of Chapter 3 case 2. The jump size for updating X, /3o and inMR consistwith the simulation study, which are 0.15, 0.35, 0.8 respectively.Figure 4.2 shows thetraceplot of 200000 iterations after 2000 burn-in period,and there are no apparent mixingproblems to be noticed. The numeric results are presentedand compared in the nextsection.824.2. Formal ApproachI I I I—2.5 —1 5 —0.5 0.513o—2 —I 0 1 2 313iFigure 4.2: Traceplot and posterior densityplots of !0000 iterations after 1000 burn-inperiod of /3 and /3 when applying the MH sampling method.LOCLf)L()Co -I I I I I0 50000 100000 200000iterations>.a)Cci)0Cci)qCCDCCC’)CCCCoCCC’)0CCC’) -L. 1.1 Ii.i ii.l,.iiCD.C -— F..IIIr1rI—j.IJ..I I I0 50000 100000 200000iterations834.3. Results4.3 ResultsBefore discussing the results estimated from the naive, informal and formalapproaches,we would like to find out the supposed true result first. It is very easilyobtained fromthe glm function in R and the true model for explaining whether or nota subject hasthe IVT is estimated as follows:log1Y11))—0.90 + O.761(X > c).Table 4.1 records results ofi3oandi3ifrom the naive, informal and formal approaches.Estimate 95%CI CI width/3Onajve-0.86 (-1.58, -0.14) 1.44fOjnformal0.86 (4.58, 0.14) 1.44/3Oformaj-0.90 (-1.63, -0.22) 1.350.61 (-0.63, 1.84) 2.48/3ljnform&0.61 (0.63, 1.84) 2.48/3lfo’rmai0.67 (0.46, 1.73)2.19Table 4.1: Estimators, 95% confidence, or credible, intervals ofi3oand/3by using“naive”, “informal” and “formal” approaches.In light of the study performed in Chapter 3, the results for analyzingthis datasetbehave as we would expect. Though the results are close, the formal approach performedthe best when strong priors for/3and c2 are provided. As the data size gets larger, webelieve that the formal approach will keep doing a good job, i.e. estimatedvalues closeto the “true” values and the less variability of estimated parameters, even when flatterpriors are assigned. Surprisingly, the naive and informal approachesproduce the sameresults, and one possible explanation is that the data size is very small,and there is nosignificant difference in modelingX*with threshold c or c in this special case. Note thatwhen the data size increases, the chance that naive and informal approachesproduce thesame results will become slim.84Chapter 5Conclusion and Future WorkIn this thesis, we propose a formal approach to adjust mismeasurement in case-controlstudies. Ignoring potential mismeasurement on exposure variables could leadto seriousproblems, such as loss of power, biased estimation and misleading conclusions. Intheliterature, many methods were proposed to deal with misclassification and measurementerror, such as matrix method, inverse matrix method, regression calibration,SIMEX,Expectation-Maximization algorithm in frequentist perspective. Lots of methodsareready to use, nevertheless, they all have their limitations. For example, Carroll,Ruppert, Stefanski and Crainiceariu (2006) stated that though the SIMEX andregressioncalibration are simple methods to implement, they have limited contributions in reducing the bias caused by the measurement error. The Bayesian approach, on the otherhand, is able to correct the bias more precisely and generally. Though,a potentiallymisspecified exposure model, too complex posterior and intensive computational requirements are occasionally drawbacks in the approach, it has the great advantagethat theuncertainties of parameters can be fully incorporated.A formal approach in the Bayesian perspective is introduced in this dissertationto account for both categorical and continuous exposure variables under the non-differentialassumption. Fundamental techniques and concepts are introduced in Chapter 1. Ideasof the proposed formal approach that deals with a categorical exposure variableis introduced and studied through investigating three cases in Chapter 2. Theunderlyingtheme of the formal approach that adjusts the measurement error (continuous exposurevariable) is presented and examined by studying its performance on three cases againin85Chapter 5. Conclusion and Future WorkChapter 3. In Chapter 4, a real world dataset is used to evaluate the proposed model.Gibbs sampler and Metropolis-Hasting algorithm are mainlyused to sample the parameters of interest from their corresponding posterior distributions.In Chapter 2, we investigate three cases where we have different levels of knowledgeabout misclassified probabilities. In each case, the approach is implemented with bothlow and high prevalence rate, as well as a different validation sample size when we assumevalidation data are available. Stabilized traceplots suggest thatthe overall convergencerate is adequate for Markov chain simulation in our proposed model. When the samplingdistribution of each unknown parameter is studied, statistical assessments such as, smallestimator bias, small mean square error, high coverage rate of the true valueand reasonable average 95% credible interval length, all indicate that overall the model isefficientand accurate. When only the prior information about the misclassified probabilitiesisknown, strong and concentrated priors are required to get good estimation. Onepossible explanation is that, a strong prior is able to reduce the variability ofestimatorsand improve the efficiency of the approach. However, when a small proportion of thevalidation data is available, it is found that the strong priors become unnecessary, whichindicates that the model is able to capture enough informationto make good estimation.Moreover, it seems like the size of the validation data does not significantlyaffect theestimation, and this would be an interesting point to study later on. When the resultsobtained from low prevalence rate and high prevalence in each caseis compared, it isdelightful to observe that the approach could work for any prevalencerate. In the endof Chapter 2, estimated log odds ratios are compared for the proposed formal approachand informal approach. It is found, as expected, that the informal approach tendstounderestimate the association between the exposure variable and response variablemostof the time, and that less than 25% of the 95% confidence intervals actuallycover thetrue value. Even though, the formal approach sometimes overestimate the log odds ratio,the majority of its 95% confidence intervals include the true values and theestimator86Chapter 5. Conclusion and Future Workbias is much smaller than it is in the informal approach.In Chapter 5, the proposed approach is implemented witha continuous exposure variable.A logistic regression model is specified for the binary exposure variableand dichotomizedcontinuous exposure variable so that their association is measured accordingto the coefficients of logistic regression. Two simulation studies for three cases studiesare conductedbased on our knowledge of the magnitude of the measurement error. As expected, we findthat our proposed approach is both efficient and accurate. As in the discretecase, properpriors are crucial when we only have the prior information in hand butnot so importantwhen the validation data are accessible. Coefficients obtained from the “naive”approachand informal approach (both of them use the association estimated fromresponse variable Y andX*to estimate the true relationship between Y and X) are compared withour proposed formal approach at the end of the chapter. Overall,the performance ofthose approaches decline as we move from formal approachto informal approach andthen naive approach. A real world example is used to illustrate our idea andapproach.Due to a very small sample size of the dataset, adjusted proper priorsare again criticalto gain valuable estimations. Fortunately, it is proved that our suggestedapproach canwork practically after strong priors are assigned.Our proposed Bayesian adjustment for mismeasurement canbe extended to a varietyof research areas. One straightforward extension would be having moreprecisely measured covariates in addition to the model that we have right now. Furtherinvestigationcan be conducted to understand the weakness of our approach, whichis that strong priorsare needed when we only have prior information. A relevant and interestingstudy wouldbe to find out whether there is a “cut-off” validation size so that theresearchers are ableto gain the “maximum” information while spending minimum time or money.87BibliographyB. A. Barron. The effects of misclassification on the estimation of relativerisk. Biometrics, 33:414—418, 1977.R. J. Carroll, D. Ruppert, L.A. Stefanski, and C. M. Cainiceanu.Measurement Errorin Nonlinear Models, Vol. 105 of Monographs on Statistics and AppliedProbability.Chapman &Hall/CRC, Boca Raton., second edition, 2006.G. Casella and E.I. George. Explaining the gibbs sampler.The American Statistician,46:167—174, 1992.J.R. Cook and L. A. Stefanski. Simulation-extrapolation estimationin parametric measurement error models. Journal of American Statistical Association,89:1314—1328,1994.A. P. Dempster, N. M. Lairdd, and D. B. Rubin. Maximumlikelihoodd from incompletedata via the em algorithm. Journal of the Royal Statistical Society, B 39(1):1—38, 1977.E. Greenberg and Siddhartha Chib. Understanding themetropolis-hastings algoithm.The American Statistician, 49(4): 167—174, 1995.Paul Gustafson. Measurement Error and Misclassificationin Statistics and Epidemiology: Impacts and Bayesian Adjustments, volume Vol. 13 ofInterdisciplinary Statistics.Chapman &Hall/CRC, Boca Raton, 2004.H. Kuchenhoff, S. M. Mwalili, and E. Lesaifre. A general methodfor dealing withmisclassification in regression: the misclassification simex. Biometrics,62:85—96, 2006.88Chapter 5. BibliographyR. H. Lyles. A note on estimating crude odds ratios in case-control studies withdifferentially misclassified exposure. Biometrics, 58:1034—1037, 2002.R.C. MacCallum, S. Zhang, K.J. Preacher, and D.D. Rucker. On the practice ofdichotimization of quantitative variables. Psychological Methods,, 7:19—40,2002.R. J. Marshall. Validation study methods for estimating exposure proportions and oddsratios with misclassified data. Journal of Clinical Epidemiology, 43:941—947, 1990.M. J. Morrissey and D. Spiegelman. Matrix methods for estimatingodds ratios withmisclassified exposure data: extensions and comparisons. Biometrics, 55:338—344,1999.D.A. Pierce and A.M. Kellerer. Adjusting for covariate error with nonparametricassessment of the true covariate distribution. Biometrika, 91:863—876,2004.R.L. Prentice and R. Pyke. Quantitative analysis of errors dueto power-line interferenceand base-line drift in detection of onsets and offsets in ECO usingwavelets. Medicaland Biological Engineering and Computing, 35(6):747—751, 1979.P. Royston, D. G. Altman, and W. Sauerbrei. Dichotomizingcontinuous predictors inmultiple regression: a bad idea. Statistics in Medicine, 25(1):127—141,2006.J.S. Sahambi, S.N. Tandon, and R.K.P. Bhatt. Quantitative analysisof errors due topower-line interference and base-line drift in detectionof onsets and offsets in ECGusing wavelets. Medical and Biological Engineering and Computing, 35(6):747—751,2009.A. Skrondal and S. Rabe-Hesketh. Generalized Latent VariableModeling: multi-level,longitudinal and strucutral equation models. Chapman &Hall/CRC,Boca Raton., 2004.D.R. Tomlinson, T.R. Y. Betts, and K. Rajappan. Accuracy ofmanual qrs durationassessment: its importance in patient selection for cardiac resynchronizationand implantable cardioverter defibrillator therapy. Europace, 11(5) :638—642,2009.89Chapter 5. BibliographyE. Vittinghoff, David V. Glidden, S.C. Shiboski, and C. EMcCulloch. Regression methodsin Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models.Springer,2004.90
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Formal and informal approaches to adjusting for exposure...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Formal and informal approaches to adjusting for exposure mismeasurement Shen, Tian 2009
pdf
Page Metadata
Item Metadata
Title | Formal and informal approaches to adjusting for exposure mismeasurement |
Creator |
Shen, Tian |
Publisher | University of British Columbia |
Date | 2009 |
Date Issued | 2009-11-18 |
Description | In many research areas, measurement error frequently occurs when investigators are trying to analyze the relationship between exposure variables and response variable in observational studies. Severe problems can be caused by the mismeasured exposure vari ables, such as loss of power, biased estimators, and misleading conclusions. As the idea of measurement error is adopted by more and more researchers, how to adjust for such error becomes an interesting point to study. Two big barriers in solving the problems are as follows. First, the mechanism of measurement error (the existence and magnitude of the error) is always unknown to researchers. Sometimes only a small piece of information is available from previous studies. Moreover, the situation can be worsen when the study conditions are changed in the present study, which makes previous information not applicable. Second, some researchers may still argue about the consequences of ignoring measurement error due to its uncertainness. Thus, the adjustment for the mismeasurement turn to be a difficult, or impossible task. In this thesis, we are studying situations where the binary response variable is precisely measured, but with a misclassified binary exposure or a mismeasured continuous exposure. We propose formal approaches to estimate unknown parameters under the non-differential assumption in both exposure conditions. The uncertain variance of measurement error in the continuous exposure case, or the probabilities of misclassification in the binary exposure case, are incorporated by our approaches. Then the posterior models are estimated via simulations generated by the Gibbs sampler and the Metropolis - Hasting algorithm. Meanwhile, we compare our formal approach with the informal or naive approach in both continuous and exposure cases based on simulated datasets. Odds ratios on log scales are used in comparisons of formal and informal approaches when the exposure variable is binary or continuous. General speaking, our formal approaches result in bet ter point estimators and less variability in estimation. Moreover, the 95% credible, or confidence intervals are able to capture the true values more than 90% of the time. At the very end, we apply our ideas on the QRS dataset to seek consistent conclu sions draws from simulated datasets and real world datasets, and we are able to claim that overall our formal approaches do a better job regardless of the type of the exposure variable. |
Extent | 1924669 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | Eng |
Collection |
Electronic Theses and Dissertations (ETDs) 2008+ |
Date Available | 2009-11-18 |
Provider | Vancouver : University of British Columbia Library |
DOI | 10.14288/1.0070879 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2009-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
URI | http://hdl.handle.net/2429/15219 |
Aggregated Source Repository | DSpace |
Download
- Media
- [if-you-see-this-DO-NOT-CLICK]
- ubc_2009_fall_shen_tian.pdf [ 1.84MB ]
- Metadata
- JSON: 1.0070879.json
- JSON-LD: 1.0070879+ld.json
- RDF/XML (Pretty): 1.0070879.xml
- RDF/JSON: 1.0070879+rdf.json
- Turtle: 1.0070879+rdf-turtle.txt
- N-Triples: 1.0070879+rdf-ntriples.txt
- Original Record: 1.0070879 +original-record.json
- Full Text
- 1.0070879.txt
- Citation
- 1.0070879.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
France | 7 | 0 |
China | 3 | 18 |
United States | 2 | 0 |
Canada | 1 | 0 |
City | Views | Downloads |
---|---|---|
Unknown | 7 | 1 |
Beijing | 2 | 0 |
Ashburn | 2 | 0 |
Shenzhen | 1 | 18 |
Vancouver | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0070879/manifest