Bayesian Adjustment for Exposure Misclassification in Case-Gontrol Studies by Bong Chu B.Sc.(H), McMaster University. 2005 A T H E S I S S U B M I T T E D I N P A R T I A L 11 I I 1 1 1 : . \ U O F T H E i;i:oi 11n:\ii:.\ i s FOR. T I H : D E G R E E O F IVI aster of Science in The Faculty of Graduate Studies (Statistics) The University Of British Columbia A p r i l , 2007 © Rong C h u 2007 V Abstract Measurement error occurs frequently in observational studies investigating the relationship between exposure variables and the clinical outcome. Error-prone observations on the explanatory variable may lead to biased estimation and loss of power in detecting the impact of an exposure variable. The mechanism of measurement error, such as whether or in what way the quality of data is affected by the disease status, is seldom completely revealed to the investigators. This increases uncertainty in assessing the consequences of ignoring measurement error associated with observed data, and brings difficulties to adjustment for mismeasurement. In this study, we consider situations with a correctly specified binary response, and a misclassified binary exposure. We propose a solution to conduct Bayesian adjustment to correct for measurement error subject to varying differentiality, including the nondifferential misclassification, differential misclassification and nearly nondifferential misclassification. Our Bayesian model incorporates the randomness of exposure prevalences and misclassification parameters as prior distributions. The posterior model is constructed upon simulations generated by Gibbs sampler and Metropolis-Hastings algorithm. Internal validation data is utilized to insure the resulting model is identifiable. Meanwhile, we compare the Bayesian model with maximum likelihood estimation (MLE) and simulation extrapolation (MC-SIMEX) methods, using simulated datasets. The Bayesian and MLE models produce accurate and similar estimates for odds ratio in describing the association between the disease and exposure, when appropriate as- ii Abstract sumptions regarding the differentially of misclassification are made. The 90% credible or confidence intervals capture the truth approximately 90% of the time. A Bayesian method corresponding to nearly nondifferential prior belief compromises between the loss of efficiency and loss of accuracy associated with other prior assumptions. At the end, we look at two case-control studies with misclassified exposure variables, and aim to make valid inference about the effect parameter. iii Table of Contents Abstract ii Table of Contents iv List of Tables vi List of Figures viii Acknowledgements . ix Dedication x 1 Introduction 1 2 1.1 Problem setup 1.2 Review of currently available methods r ' 3 7 Bayesian data analysis 9 2.1 Bayes theorem 9 2.2 Prior distributions 2.3 Posterior simulation 2.4 The Metropolis Hastings Algorithm 2.5 The Gibbs Sampler 2.6 The hybrid algorithm 10 • •• 12 • 13 13 - 14 iv Table of Contents 3 4 Maximum Likelihood Estimation 19 3.1 MLEs under differential misclassification 19 3.2 MLEs under nondifferential misclassification 23 Simulation Extrapolation Approach 25 5 Simulation Studies 5.1 Data Simulation 29 5.2 Choice of Hyperparameters 31 5.3 Convergence of MCMC Simulation 5.4 Comparison with results obtained using MLE and MC-SIMEX 6 Application in Epidemiological Studies 7 29 . 32 40 48 6.1 The study of sudden infant death syndrome 48 6.2 The study of invasive cervical cancer 54 Conclusion and Future Work Bibliography 59 63 List o f Tables 1.1 Validation data and main data 5.1 Average acceptance rates over 400 replicates in Case 1 5.2 Posterior distributions at various scenarios in Case 1 with different prior information 5.3 35 43 MSEs, empirical coverages and average widths of 90% confidence intervals for logOR using MLE methods based on datasets simulated in Case 1 . . 5.6 34 MSEs, empirical coverages and average widths of 90% credible intervals for logOR using Bayesian methods based on datasets simulated in Case 1 5.5 33 Posterior distributions at various scenarios in Case 2 with different prior information 5.4 4 44 MSEs, empirical coverages and average widths of 90% confidence intervals for logOR using MC-SIMEX under nondifferencial misclassification based on datasets simulated in Case 1 5.7 45 MSEs, empirical coverages and average widths of 90% confidence intervals for logOR using MC-SIMEX under differencial misclassification based on datasets simulated in Case 1 5.8 45 MSEs, empirical coverages and average widths of 90% credible intervals for logOR using Bayesian methods based on datasets simulated in Case 2 46 vi List of Tables 5.9 MSEs, empirical coverages and average widths of 90% confidence intervals for logOR using MLE methods based on datasets simulated in Case 2 . . 46 5.10 MSEs, empirical coverages and average widths of 90% confidence intervals for logOR using MC-SIMEX under nondifferencial misclassification based on datasets simulated in Case 2 47 6.1 Validation study and main study of SIDS 48 6.2 Estimates of model parameters in SIDS study 49 6.3 logOR, SD and 90% intervals for logOR in SIDS study 49 6.4 Validation data and main data for cervical cancer study 54 6.5 logOR, SD and 90% intervals for logOR in cervical cancer study 56 List of Figures 5.1 MCMC mixing based on iterations 8000-10000 regarding a dataset in Case 1, Scenario 1, using the nondifferential Bayesian method 5.2 36 MCMC mixing based on iterations 8000-10000 regarding a dataset in Case 1, Scenario 4, using the differential Bayesian method 37 5.3 Posterior histogram ofr , r\ and logOR based o.n 10000 iterations of Case 1 38 5.4 Posterior histogram ofr , r and logOR based on 10000 iterations of Case 2 39 6.1 MCMC mixing based on iterations 8000-10000 using Nearly NDF Bayesian 0 0 x method (SIDS study) 6.2 50 Plots of the estimated logOR as a function of misclassification size A in SIDS study 51 6.3 ' Prior and posterior distributions of TQ, r\ and logOR subject to three mis• classifications, 6.4 •. : 57 Plots of the estimated logOR as a function of misclassification size A in cervical cancer study 58 viii Acknowledgements I would like to express my special gratitude to my supervisor, Professor Paul Gustafson, and my co-supervisor Dr. Nhu Le, for their thoughtful guidance and support throughout my study in the Department of Statistics. Their indicatives and insight helped me to complete this thesis. Sincere thanks to Professors John Petkau, Lang Wu, Harry Joe, Ruben Zamar, Will Welch, Raphael Gottardo and Matias Salibidn-Barrera for their excellent teaching and constant support during my Master's study. I am also grateful for the encouragement from all graduate students and staff members in the department. Special thanks to Kenneth, Zhong, Juxin, Hui, Jean-Franois, Lawrence, Wei Liu and Mike Danilov. Rong Chu The University of British Columbia April 2007 ix To My Dear Parents X Chapter 1 Introduction In biomedical research, people are often interested in learning the relationship between a health-related outcome Y and an explanatory variable measuring some kind of exposure status, denoted by T. Sometimes, in practice, the exposure variable or clinical outcome is not precisely recorded. For instance, instead of T, an approximate measurement or a surrogate, X is obtained. Carroll, Ruppert, Stefanski and Crainiceanu (2006) found that measurement error in the explanatory variables had triple effects: • It causes bias in parameter estimation for statistical models. • It leads to a loss of power, sometimes profound, for detecting interesting relationship among variables. • It masks the features of the data, making graphical model analysis difficult. The first problem caused by replacing T with X without accounting for the measurement error in data analysis has most serious impact on subsequent statistical inference. Hence the goal of adjustment for mismeasurement is to achieve roughly unbiased estimates to reveal the relationship between Y and T indirectly, based on the measurements of Y, X and perhaps other correctly recorded covariates Z. As it is important to have adequate knowledge of the nature and type of measurement errors, some examples from epidemiology are listed below. The NHANES-I dataset was created in a prospective study consisting of nutrition habits and incidences of breast cancer concerning a cohort of 8596 females (Jones et al., 1 Chapter 1. Introduction 1987). The response (Y) represents the occurrence of breast cancer. The covariates of primary interest are the long-term nutrition variables (T). Other explanatory variables Z include demographic and clinical factors such body mass index (BMI), age, alcohol consumption, age at menarche, etc. The response Y and Z were assumed to be correctly recorded, whereas T was not measured due to the difficulty in observing diet of a large cohort over a long period of time. Instead, a surrogate X recording the nutrition intakes of study subjects in the previous 24 hours was retrieved during the interview. The longitudinal variation of diet results in a major measurement error in this study. The seasonal diet behavior and day-to-day nutrition intake differences make X an inadequate approximation of T. The nonnegligible mismeasurement was discussed in epidemiological literature (Beaton et al., 1979; Wu et al., 1986). Some measurement error models based on a subset of the cohort were proposed by Carroll et al. (2006). It is sensible to assume the conditional distribution of X given T and Y does not depend on Y in the first example, which is known as the nondifferential measurement error. However, in other circumstances, this condition does not hold. In case-control studies, explanatory variables are retrieved after the diagnosis. Two group of study subjects with positive (cases) or negative (controls) clinical outcomes are first recruited, and consecutively the explanatory variables about their exposure history are measured. This type of sampling scheme may well lead to the so called differential measurement error, i.e. the conditional distribution of the surrogate X given the unobservable exposure T also depends on the response Y. When information about covariates is collected through some "self-report" mechanism, subjects with positive clinical outcomes may erroneously "blame" a set of risk factors for their conditions, or "ignore" previous experience with exposure variables to avoid any connection between behaviour and disease. The controls on the other hand may pay less attention or make less efforts to provide precise information about their past actions, as they do not suffer from the disease. 2 Chapter 1. Introduction A small portion of subjects for whom both the rough measurements and "gold standard" measurements are acquired (the validation sample) is sometimes obtained to monitor the severeness of measurement error. The herpes and cervical cancer study serves as an example to demonstrate the utilization of validation study. A case-control study consisting of 732 subjects of cervical cancer and 1312 community or hospital controls with negative cervical cancer diagnosis was conducted to investigate the impact of herpes simplex virus type 2 (HSV-2, a binary variable) in the development of invasive cervical cancer (Hildesheim et al., 1991). The exposure status was detected by the western blot assay, which produced error-prone measurements. A refined, more accurate procedure was performed on a randomly selected sample whose disease statuses were blinded, in order to assess the misclassification rates. Carroll, Gail and Lubin (1993) observed from the validation sample that the misclassification differ between cases and control (Fisher's exact two-sided test implied a greater sensitivity for the cases, p=0.049), and proposed a pseudo-likelihood model to adjust for the differential measurement error. 1.1 P r o b l e m setup The second example reflects how mismeasurement phenomena arise from biomedical studies with categorical covariates. The measurement error in this situation is often referred as a misclassification problem. In this thesis, we restrict ourself to misclassification problems on a binary exposure variable T(—0, 1) in case-control studies with no other covariates at play. Discussions of measurement error on continuous or polychotomous (with more than 2 categories) exposure variable can be found in statistical literature (Gustafson, 2004; Carroll et al., 2006). We assume no measurement error arising for the outcome of interest Y (=0, 1). Complete information on Y, T and the surrogate exposure variable X (=0, 1) is available 3 Chapter 1. Introduction Table 1.1: Validation data and main data Validation Y=0 Y=l T X=l T-1 an a T=0 ai3 N an+ 0,13 Main .Data Data X=0 Y=0 Y=l X=l X=0 X=l X=0 X=l X=0 hi 012 001 002 aoi a-02 ai4 Q03 0-04 013 »14 003 004 ai2 + a i 4 aoi + ^03 «02 + ao4 ai5 ai6 005 OQ6 12 for a small proportion of data (validation'sample), whereas the true exposure status for the majority of study subjects (main study) is unobservable or cannot be precisely measured. The validation data and incomplete main data are presented in Table 1. While each cell'Oij in the validation data is fully specified (i — 0,1 ,j = 1,2, 3,4), only margins Oo5) ^06, Oi5, aw in the main misclassification table are recorded. Our goal is therefore to recover the main table and ultimately make inference on the relationship between the clinical outcome Y and the actual exposure variable T. Let us denote the true exposure prevalences amongst cases and controls by ri and ro respectively, where r» = P(T = l\Y = i), i = 0,1. The retrospective odds ratio describing the correlation between the response and explanatory variable is defined as r r ' n 1 ~r 0 which is equal to the prospective odds ratio, , p(y=l|T=i) P(Y=0\T=\) p(y=i|T=o) ' p(y=o|r=o) via Bayes rule and simple algebraic manipulations. The odds ratio is sometimes adopted to approximate the relative risk $ = P(Y = 1\T = l)/P(Y = 1|T = 0) of having a 4 Chapter 1. Introduction disease in two exposure groups, when the disease incidence rate is small (rare disease). Sensitivity (SN) and specificity (SP) jointly measure the magnitude of exposure misclassification. In the scenarios subject to differential misclassification, the surrogate X given the unobserved true exposure T and the response Y are not mutually independent. The sensitivities and specificities among cases and controls can be formulated as, SNi = P(X = 1\T = 1,Y = i),SPi = P(X = 0|T = Q,Y = i), i = 0,1. Prevalences of the apparent exposure for diseased and non-diseased individuals are denoted by r* and 7Q. Another way of expressing the degree of misclassification is to facilitate the positive predictive value (PPV) and negative predictive value (NPV). Their relationships are presented below. r* = P(X = l\Y = i) = Z] P(X = nSNi + (l - rO(l - SPi) =Q = l\T = j,Y = i)P(T = j\Y = i) (l.l) PPVi = P(T =1\X = 1,Y = i) P(X = l\T = 1, Y = i)P(T = 1\Y = i) P(X = 1\T = 1, Y = i)P(T = l\Y = i)+ P(X = 1|T = 0, Y = i)P(T = 0\Y - i) SNtn (1.2) SNiTi + ii-SPMi-n) NPVi = P(T = 0\X = 0,Y = i) P{X = 0\T = 0, Y = i)P(T = 0\Y = i) P(X = 0\T = 1, Y = i)P(T = 1\Y = i) + P(X = 0\T = 0, Y = i)P(T = 0\Y = i) SPi(l ~ n) SPi(l - n) + (l - SNJn 1.3) It is easy to justify that, in the main study the actual number of subjects of positive exposure status (bn) amongst those who are apparently exposed in either case or control 5 Chapter 1. Introduction group (045) follows a Binomial distribution, i.e. bn ~ Binomial(ai5, PPVi). Similarly, conditioning on the number of cases or controls with negative apparent exposure status (aft), the number of truly unexposed subjects ( ^ 4 ) follows Binomial{a{§, NPVi), for i = 0,l. When the nondifferential misclassification condition is fulfilled, meaning the conditional distribution of X\T,Y does not depend on Y, it follows immediately that SNQ = SNi = SN, SPo = SPi = SP. However it is worth pointing out that, nondifferential misclassification does not imply equality of cases and controls regarding the apparent exposure prevalence (PPVi, NPVi). The bias caused by misclassification of the explanatory variable in case-control studies can be evaluated by the attenuation factor (AF), (T>* $ with a numerator representing limit of the error-prone "apparent" odds ratio, 1 * Under the circumstances of nondifferential misclassification, the impact of bias imputed by measurement error is well understood (Gustafson, 2004). Under a weak condition that the possibility of having an accurately measured surrogate exposure is over 0.5 (i.e. SN + SP > 1), A F always moves towards the direction of = 1. In other words, there is a tendency to report an artificially weak association between the exposure and response in ignoring measurement error on the exposure. Furthermore, flattening effect on true odds ratios far away from the unity is more manifest. When the exposure prevalence is approaching 0 or 1, measurement error induces serious and sensitive attenuation on Chapter 1. Introduction the odds ratio. Nevertheless, Greenland and Gustafson (2006) found that, no general conclusion could be made regarding the direction of estimated association when the nondifferential misclassification on a binary exposure is not satisfied, or when the exposure variable is polychotomous. 1.2 Review of currently available methods There is a large literature on the correctness for mismeasurement in biomedical research. Most of the work concentrates on building measurement error models using frequentist methods (Walter and Irwig, 1987; Bashir and Duffy, 1997; Carroll, Ruppert, Stefanski and Crainiceanu, 2006, for example). When the'problem of misclassification on exposure variable is encountered in epidemiologic studies, the matrix method (Barron, 1977) estimates the expectations of cell counts (obtained by cross tabulating Y and T) by utilizing the margins of the main sample in Table 1.1. The odds ratio is estimated subsequently based on the cell counts, and the asymptotic variance of OR is derived by Greenland (1988) via'the Delta method. Marshall (1990) reparameterized the misclassification by PPV and NPV and proposed an inverse matrix method to retrieve the true odds ra- tio. Lyles (2002) proved that Marshall's formula was in fact the maximum likelihood estimate (MLE) subject to differential misclassification. Other approaches including the simulation extrapolation method and latent class logistic regression model are developed to tackle the same problem (Kiichenhoff, Mwalili and Lesaffre, 2006; Skrondal and Rabe-Hesketh, 2004). On the other hand, the dramatic improvement of computational capability of electronic computers and the development of indirect simulation techniques such as Markov chain Monte Carlo (MCMC) make it possible to explore misclassification problems from a Bayesian perspective (Gustafson, 2004). In fact, partial knowledge of misclassification probabilities is often accessible to medical researchers before the conduc7 Chapter 1. Introduction t i o n of a study. T h i s makes B a y e s i a n analysis a n a p p e a l i n g a p p r o a c h , for the inference w i l l be based o n t h e c o m b i n a t i o n of p r i o r knowledge a n d d a t a t h o r o u g h l y . Therefore, i n t h i s thesis, we p r i m a r i l y i n t r o d u c e a series of B a y e s i a n m e t h o d s suitable for different m i s c l a s s i f i c a t i o n assumptions. T h e i r performance w i l l be closely c o m p a r e d t o those of t h e m a x i m u m l i k e l i h o o d estimates ( M L E s ) a n d s i m u l a t i o n e x t r a p o l a t i o n ( S I M E X ) m e t h o d , u s i n g s i m u l a t i o n studies a n d two real life examples. T h e thesis is o r g a n i z e d as follows. C h a p t e r 2 presents f u n d a m e n t a l concepts a n d a l g o r i t h m s of t h e B a y e s i a n p a r a d i g m , a n d provides d e t a i l e d m e t h o d o l o g y for t h e p r o p o s e d B a y e s i a n m e t h ods. C h a p t e r s 3 a n d 4 review t h e M L E a n d S I M E X m e t h o d s u n d e r differential a n d nondifferential misclassifications. C h a p t e r 5 discusses t h e c o m p a r a t i v e b e h a v i o u r s of t h e three m e t h o d s based o n t w o s i m u l a t i o n studies. C h a p t e r 6 presents t h e performances of B a y e s i a n a n d other m e t h o d s v i a two case-control studies w i t h misclassified exposure variables a n d v a l i d a t i o n sub-samples. C h a p t e r 7 provides o v e r a l l c o n c l u s i o n a n d further r e m a r k s for the research. 8 Chapter 2 Bayesian data analysis 2.1 Bayes theorem Assume the observable random variables Y are distributed according to a joint density function f(y\6), where 9 is an unknown parameter vector. One's prior belief on the value of 6 before observing the data y is described by a prior distribution f(6). A joint probability distribution can be written as a product of the data distribution /(y|0) and the prior distribution /(#). Given the observed data, the posterior distribution of the unobserved parameters can be determined using Bayes' rule: /(%) = T^T - / W W ) f f(e*)f(y\e*)dd* (2 ' 1} (2.2) This equation can further be simplified by omitting the fixed factor /(y) for a given y that does not depend on 6: f(8\y) oc f (0)f (y|0) • When it is too complicated to derive a normalized form for the above unnormalized posterior distribution, Markov Chain Monte Carlo (MCMC) and other numerical 9 Chapter 2. Bayesian data analysis techniques should be applied. 2.2 Prior distributions The exposure prevalances r , r i , sensitivities SNQ, SNI, and specificities SPo, SPi are 0 the parameters of interest in this study, ranging from 0 to 1. We make them cover the whole real line by converting into a logit scale, logit(x) = log for x between 0 and 1. The prior information concerning these parameters can then be modeled using bivariate normal distributions. The actual exposure prevalences (r^), sensitivities(>SiVj) and specificities (SPi) of X as a surrogate for T are assumed to be uncorrelated in this circumstance. ( V* 1 ( P\0-\o- N J Pi \ UJ U) V l o g P2J T- n \ v. l - S^ 7 SP -SP SP^ \ 0 log 0 ( log \\ (To K SN -Sis SN! ( log 2 P2T1T2 T£ P3M2 51 J I 0 l l o g l J J The prior knowledge of differentiality for misclassification is reflected by ui, v , 71, 72 2 and varying values of p and p . Theory tells us that normality is retained from linear 2 3 10 Chapter 2. Bayesian data analysis combination of normal random variables. We then have 2p T T ) (2.3) gi - q ~ AT(7! - 72, 5? + b\ - 2p 5 S ) (2.4) Pi - P2 ~ N(vi - I/ , 2 + T 2 2 - 2 2 3 1 1 2 2 Parameters in the prior distributions are often named hyperparameters. In this context, they include pi, a^v^ Ti, 7$, 5i and pj. Hyperparameters reflect one's prior belief on the target parameters based on previous research, literature review or even personal, opinions. We set p,\ — p , v = v and 71 = 72 to be "unbiased", a priori, indicating the 2 x 2 prevalences and misclassification probabilities amongst cases and controls are centred around equal means; and o = a , T\ = r and S = 8 , asserting that, the variabilities x 2 2 x 2 associated with prevalences, sensitivities and specificities in two populations are consistent. As a result, the condition p = p = 1 implies p\ — p ~ A/(0,0), q\ — q ~ AT(0,0), 2 3 2 2 and furthermore pi = p , q — q . Since the logit transformation is a 1-to-l function, it 2 x 2 follows immediately that SNQ = SNi and SPQ = SP , which corresponds to nondifferenx tial misclassification. Conversely, a zero-valued correlation coefficient p or p indicates 2 3 the independence between p , p or q , q , and hence implies S W n and SNi or SP and x 2 x 2 0 SPi are independent. This intuitively reflects the fact that sensitivities or specificities are free to vary by themselves, and can be interpreted as, fully differential misclassification is achieved. Situations in between the two cases are defined as the "nearly nondifferential" misclassification to reflect a certain level of dependence between the misclassification parameters. In addition, by assigning particular values to hyperparameters, for instance Pi and a,, we can postulate a priori that r; lies between 0.02 and 0.5 say, on logit scale with 95% probability. In fact, this is a "wide" interval containing most feasible prevalence values in epidemiological applications. Because the marginal and conditional densities of a multivariate normal variable : 11 Chapter 2. Bayesian data analysis remain normal, the joint, marginal and conditional prior distributions for the original parameters r\, r , SN\, SN , SP\, SP2 can be easily derived applying standard variable 2 2 transformation techniques. 2.3 Posterior simulation Let us define the parameter vector 9 as e= (r ,r ,SN ,SN ,SP ,SP 1 2 1 2 1 2 The sampling distribution (ignoring constant terms) is /(y|0) = f [ \ \sNiri + (1 - SPi)(i - TO]"" [ ( l - SNi)n + SPi(i - n) Ot6 i=0 SNin an (i-5P0(i-n) an (l-SNjn and the posterior density / ( 0 | y ) is proportional to aii SPiil-n) },(2.5) f(9)f(y\9). With this complex unnormalized posterior density function, one cannot (a) simulate independent and identically distributed realizations from the posterior, or (b) simulate dependent but identically distributed realizations from the posterior distribution. Fortunately, under this circumstance, the family of MCMC techniques provides us with an approach to simulate from a Markov chain that converges to the posterior density as its stationary distribution (Gustafson, 2004). An alternative term referring to MCMC methods is the Metropolis Hastings (MH) algorithm. Together with Gibbs sampler, a special case of the MH algorithm, MCMC is commonly used in practice to build Markov chains,through drawing samples from the Bayesian posterior distributions. 12 Chapter 2. Bayesian data analysis 2.4 The Metropolis Hastings Algorithm The Metropolis Hastings (MH) algorithm was originally introduced by Metropolis et al. (1953) and was further generalized by Hastings (1970). The M H algorithm enjoys a simple form. Consider the situation when we try to construct a Markov chain converging to a stationary distribution (target density of interest) p(£,\z). First, an arbitrary starting point J i is drawn,so that pi^jz) > 0. A jumping distribution (or proposal distribution) s t n e n selected to simulate a candidate state £* at time t+1 provided the current state £*, for t = 1, 2, 3, ... . The acceptance probability is defined as (2.6) Next simulate u from the Uniform(0,l) distribution. If u < a, set £ set = t + 1 = £*; otherwise, The transition distribution of the Markov chain is therefore a" mixture of the proposal distribution and a point mass at ^ t + 1 = The iteration continues until a sample size n is reached after the "burn-in" period of size m. It is important to select sufficiently large m and n to guarantee accurate estimation of the quantities associated with the target distribution. The choice of the jumping distribution has great impact on the convergence of the Markov chain £ + £ + m 1 m ) 2 ) ...,£ m + n to the posterior density, and mixing of the parameter values. 2.5 The Gibbs Sampler The' Gibbs sampler can be treated as a special case of the M H algorithm of acceptance probability being 1 at every jump. It is useful when the target posterior distribution is multidimensional, such as the problem being studied here. When it is difficult to sample from the complicated desired joint distribution, the parameter vector 13 Chapter 2. Bayesian data analysis 0 c a n b e d i v i d e d i n t o say d subvectors o r b l o c k s ( 0 i , . . . , 0 d ) . A t every i t e r a t i o n , a l l d subvectors are u p d a t e d i n sequence, each b e i n g s a m p l e d f r o m t h e c o n d i t i o n a l dist r i b u t i o n g i v e n a l l other b l o c k s (the full conditional distribution). I n other words, a t step i, 6i is s a m p l e d from / ( # i | # ^ \ y ) , p r o v i d e d the latest values of other subvectors Q*-} = (0\,..., 9\_ , Oj+l,..., W i t h conjugate priors, i t is possible t o s i m u l a t e d i - x r e c t l y from t h e full c o n d i t i o n a l target d i s t r i b u t i o n s . Otherwise, the M H algorithm can be used t o u p d a t e p a r a m e t e r s w i t h i n a single block. 2.6 The hybrid algorithm W h e n t h e M a r k o v c h a i n u n d e r differential o r n e a r l y nondifferential misclassification reaches some state 9 i n t h e p a r a m e t e r space, t h e cell counts i n m a i n d a t a o f T a b l e 1.1 (hj), i = 0,1, j = 1,2,3,4 are generated from B i n o m i a l d i s t r i b u t i o n s described i n section 1.1. T h e d a t a is u p d a t e d as y = {y bs, Yunobs} = {(an, • • • , a ), (bn, • • • , 6 )}- . i4 0 i4 T h e s a m p l i n g d i s t r i b u t i o n or t h e l i k e l i h o o d f u n c t i o n becomes f (y obs j yunobs \9) =. L(6\yobs,Yunobs) = n{( ^r 5iv i+bn ((i-^)(i-n)r 3+6i3 i=0 ((l - SNi) n) (SPi (l - n)) « « aii+ba a } +b i _ J~J !^j.aii+ai2+ba+bi2 ^ _ ^.^ i3+ai4+fct3+fei4 a Cjpjau+bii i=0 (1 - SNif i2+ba SP* ii+bii (1 - SPi) } aa+biZ . (2.7) A s i t is n o t p l a u s i b l e t o s i m u l a t e 9 from the j o i n t posterior d i s t r i b u t i o n directly, we s a m p l e t h e 6-dimension p a r a m e t e r sequentially from t h e c o n d i t i o n a l posterior d i s t r i b u - 14 =0 Chapter 2. Bayesian data analysis tions, i f(r ,r \SN ,SN ,SP ,SP ,y) 0 1 0 1 0 oc 1 /(ro.n) fj { r ? " + 0 » " + 6 + t »\l - r .)««+a M + *i3+6 M J ; ( 2 8 ) t=0 1 oc HSNo^N^ruS^SPuy) / ( S i V , SiVi) {S A T ^ 0 6 " (1 - S W ) ° « < } , + 6 1 HS^SP^ruSNo^Nuy) oc /(SP ,5Pi) 0 TJ IsPr ' " t +b (2.9) a i (1 - 5 P i ) a i 3 + 6 i 3 j . (2.10) i=0 In above equations, /(r ,ri), 0 f(SN ,SNi) 0 and f(SPo,SPi) are bivariate normal prior densities for the prevalences, sensitivities and specificities over cases and controls. As the densities are not conditionally conjugate, the MH algorithm is performed at next "step. We apply univariate MH jumps embedded in Gibbs sampling to update each component in the paired of parameters, (r 0 ) (SNo,SNi) and (SPo,SP\). It ri), is important to choose a jumping distribution that leads to satisfactory performance of the Markov chain. One intuitive solution is to let the jumping distribution follow a Beta density as appeared in the likelihood function in Equation (2.7). This simplifies calculation of the acceptance rate by cross canceling the ratio of proposed vs. current likelihoods and the ratio between two jumping densities. As a result, we are left with, merely the ratio between two prior distributions. To be more specific, let us take the acceptance probability in one dimensional MH jump on r in Equation (2.8) for example. 0 The jumping rule.is specified as ^Beta(aQi + ao2 + b + bl + l,a 01 2 03 + ao4 + b + b + l), 03 04: close to the conditional sampling distribution. The ratio of ratios in Equation (2.6) 15 Chapter 2. Bayesian data analysis becomes f(r* t SN*,SN*,SP*,SP*,y*) r{, SN*,SN*,SP*,SP*,y*) r\,SN*,SN*,SP*,SP*) J(r* J(r* \r* ,r*,SN*,SN{,SP*,SP*) f(r* \r\, SN*, SNI SP*, SP*)L(r* ,r\, SN*, SN*, SP*, SP*) f(r* \r{, SN*, SNJ, SP*, SP*)L(r* , rj, SN*, SN\, SP*, SP*) L(r*,r\,SN*,SNj,SP*,SP*) L(r* ,r\,SN*,SN*,SP^,SP*) f(r*\r\,SN*,SN*,SP*,SP*) f(r* \r\,SN*,SN*,SP*,SP*) 0 r 0 0 0 0 0 0 0 0 0 f(ro\r[) f(ro\ \)' r The last step is obtained because of the mutual independence among prevalence, sensitivity and specificity in prior information. We now summarize the posterior simulation procedure as follows. 1. Acquire initial values of (rg, rf, SN$, SN?, 5P °, SP?), each lying between 0 and 1. 0 2. At the t* iteration, h • Given parameters 6* — (r* ,r[, SNQ, SN*, SPQ, SPQ, update the unobserved 0 actual exposure data Vunobs' based on Binomial distributions, for = 1 = 0,1,7 = 1,2,3,4. • Based on the updated cell counts {b*j} at the t th iteration, model parameters are generated alternately via Gibbs sampler and MH algorithm. (a) Simulate r* conditioning on (rj , SN*' , -1 1 0 algorithm. A proposed jumping rule is 1, a 03 + ao4 + b* Q3 + ^04 + 1)) w i t n SN*~ , L using MH SP^T , SP^ ) 1 ~Beta(aoi + a 1 02 + b* + b* + 01 02 acceptance rate min ( / f f i f y t - L , l } • 16 Chapter 2. Bayesian data analysis (b) Simulate r\ conditioning on (T-Q, SNQ' , SN![~ ,SPQ~ , 1 1 SP^" ). 1 1 posed jumping rule is r\ ~Beta(a,n + ai + b\ + b\ + 2 x 2 l,ai3 + A proau + b{ + 6j + 1), with acceptance rate min ^y^r^ry, l | • 3 4 (c) Simulate SN% conditioning on (r£, r \ , SW? , -1 SP^ , SP^ ), 1 1 using MH al- gorithm. A proposed jumping rule is 5A/Q* ~5eta(aoi+6 + l, with acceptance rate min (d) Simulate SN{ rate mm j (e) Simulate SP& j f^pf^t-i^, 1j • conditioning on (r£, r\, SN&, jumping rule is SAT* 01 ~Beta(a + 6' .+ 1, a u n SP' SP^ , 1 - 1 ). ao2+Oo2 + l), A proposed + 6 + 1), with acceptance 4 i2 12 , 1) • conditioning on (r£, rj, SWg, SWf, SP' - 1 ) using MH algo- rithm. A proposed jumping rule is SPQ ~J3e£a(a 4 + 6 + 1, a 3 + 6Q +1), 0 with acceptance rate min j f^f^-^spt-\^, (f) Simulate S P / conditioning ing rule is SP{ ~Beta(a u • Calculate the log odds ratio on (r^rj, 04 0 3 11 • SA/Q, SN{,SPQ). A proposed jump- + b\ + 1, ai + 6' + 1), with acceptance rate A at the t 3 th 3 iteration. 3. Repeat step (2) at subsequent iterations, for t = 1,..., m + n, to simulate target parameters alternately using the hybrid algorithm. The procedure is stopped after accomplishing m+n iterations, where m is the number of burn-in iterations and n stands for the number of target iterations. The multivariate Markov chain is generated and converges to the joint posterior distribution at sufficiently large m. The marginal Markov chains for individual parameters including the log odds ratio are constructed at the meantime and statistical inference can be performed thereafter. 17 Chapter 2. Bayesian data analysis Under the circumstances of nondifferential misclassification, the parameter space becomes 4 dimensional with 8 = (r ,ri, SN, SP). The prior densities and likelihood func0 tion are revised as ^( r^)~^' log r2) g=(log-^)~ATM ) 2 L(d\y , y ) obs unobs = i=0 { r ? ° « » » (1 - r ) ° 1 + + b + 6 i a + a " ' + 6 3 + 6 " _ ^ +b pan+bn _ pja +b ^ _ ^ This leads to the combinations of steps (c) and (d), (e) and (f) in the above procedure. S <Hi+bii tX N SN i2 i2 S f l S i3 i3 Only one M H jump is required to simulate Markov chain for the sensitivity or specificity within the Gibbs sampling structure. Nevertheless, the ideas on posterior simulation can be addressed and utilized as before. 18 Chapter 3 Maximum Likelihood Estimation The maximum likelihood estimate of a parameter 9 given a sample point x is a parameter value at which the likelihood L(9\x) attains its maximum as a function of 9 (Casella and Berger, 2002). Often candidate maximum likelihood estimators (MLEs) can be obtained through solving ^logL(9\x) = 0, when the likelihood or log likelihood function is differentiable in 6. The left hand side of the equation is sometimes named the score function associated with the likelihood. When 9 is multidimensional and a closed-form solution does not exist, iterative numerical procedures such as Newton Raphson algorithm are required. Because of some optimality quantities of MLE including the consistency and efficiency, the method of maximum likelihood estimation is currently the most popular method for point estimation. 3.1 MLEs under differential misclassification A standard way to express the likelihood in terms of (r ,ri, SN , SN , SPQ, SP\) for prob0 0 X lems consisting of a main study (misclassification data) and internal validation data is provided in Equation (2.5) under differential misclassification. Unfortunately the score functions do not form a tractable system and no closed-form solution exists under this parameterization. Lyles (2002) proposed an alternate to parameterize the likelihood in terms of 9* = (rj.rj, PPV , PPV 0 U NPV , 0 NPVi). This reparameterization leads to closed-form MLEs, and is proven to be equivalent to the inverse matrix method of Mar- 19 Chapter 3. Maximum Likelihood Estimation shall (1990). The likelihood function is of the form W\y) II { * * t = r i=0 a 1 - *)° r „*\ il a i0 ( p p v ((1 - PPVi)r*) ((l ri) -NPV^l ai3 (WPV$(I-O)°"} *\\ i2 - r*)) a . (3.1) By setting the score function associated with each parameter zero, we obtain MLEs an PPVi an NPVi + a^ ajA i2 a + &i4 il + <2i3 + &i5 a a t l + a%2 + i 3 + i4. + <2j5 + aj6 a (3-2) a If we apply the invariance property, the MLEs of the original parameters and the log odds ratio d> can be written as U. = PPVirt + il-NPViXl-f!) PPVif* SNi NPVj(l-f*) 1-fi SPi <j) = log (3-3) The Hessian matrix composed of mixed second derivatives is estimated at the M L E d* in form of H(0*).= logL V2 dd (V*|y) In fact, the zero cross partial derivatives simplifies the expression of Hessian matrix, and 20 Chapter 3. Maximum Likelihood Estimation the observed Fisher information is obtained as, i(0*) = -H(0*) / d logL(0* |y) 2 0 0 d logL(e*\y) 2 0 0 0 d logL(e*\y) 2 0 0 0 dPPVtf 0 d logL(0*\y) dPPV? 2 0 0 0 0 d logL(6*\y) dNPVg JP\ 2 0 V 0 0 d logL(e-\y) 2 0 0 . dNPV 2 / The asymptotic covariance matrix of MLEs in the Cramer-Rao lower bound £ g , is approximated by the inverse of the observed Fisher informatin at 9*, which bears the form of a 6x6 diagonal matrix with reciprocals of non-zero second derivatives of the log likelihood on the diagonal. As a result, variances of the "apparent" exposure, positive and negative predictive values are estimated as, var (r*) var (PPV^) f?(l-f\) &i2 + a + a Oil + i3 PPVj(l il a - i2 + a i5 + a i ( PPVj) + &i3 lyp?i(i a i 4 2 + NPVJ) a i4 The asymptotic variance for the log odds ratio is further estimated as (Morrissey and 21 Chapter 3. Maximum Likelihood Estimation Spiegelman, 1999; Lyles, 2002), /„N var W » ( r W j + JVfYj - l) 55?(f;) + (ff) 5Sr ( r W , 2 = ^ rf(l - hY — (1 - flf 2 ~ {NPV^j var + (3.5) using multivariate Delta method. var ($) = W % W , (3.6) T whereby d J L W= 9ppvi n0*=0* < aAfPv l^*=s* vy 0 a j, is a column vector composed of first derivatives evaluated at the MLE 6*. Similar ideas can be adapted to obtain the asymptotic variances for MLEs of the original parameters 6 = (r ,h,SNo,' 0 ^i,SP ,SP\) • 0 vaf{ri) = (PPVi + NPVi-1) {f*fvar(pPV^ (f* - if var var (r*) + + (NPV^ (3.7) 22 Chapter 3. Maximum Likelihood Estimation fpy. PPViT* (pPVi + NPVi - l)" 2 var (SNi) f*fi-PPVi(f*) (1 - ft) ft .•gpy. var (SPi) 1 -r. var (PPVi) PPVi ft (1-ft) • NPVj (1 - r*) + (3.8) (NPV^J (1 - r*) [PPVi NPVi *+ var var (f*)+ + NPVi (1 - hf var (PPV^j - ) 1 var (f*) + + a I-f* 1 -f. 3.2 NPVi (1-ft) ( i - hY 2 var NPVi (3.9) MLEs under nondifferential misclassification Neither parameterization mentioned above facilitates closed-form MLEs under the circumstances of nondifferential misclassification. Although explicit formulas that are computationally less intensive are provided in approaches such as the matrix method (Barron, 1977) and inverse matrix method (Marshall, 1990), the ML estimation is in general preferable due to its efficiency. We therefore proceed by employing a quasi-Newton method ("L-BFGS-B") by Byrd et al. (1995) that is implemented in the "optim()" function in statistical software R to achieve likelihood optimization in original parameter space of 0 = (r ,ri,SNo,SNi,SPo SPi). 0 : Function values, gradients (first order derivatives) of the log likelihood function and the initial estimates are supplied to the routine in order to build up a picture of the surface to be maximized ( R 2.4.0). 23 Chapter 3. Maximum Likelihood Estimation A numerically differentiated Hessian matrix at the best set of estimates in form of a matrix of mixed second derivatives, H(<j) will be returned if the option hessian is set T R U E in "optimQ" function. The estimate of observed Fisher information is received immediately i(l) = - H ( 0 ) , and the asymptotic covariance matrix of 6 in the Cramer-Rao lower bound is approximated by the inverse of the estimated Fisher information, denoted by i ( 0 ) . The asymp_1 totic variance of the log odds ratio estimate is attainable by the multivariate Delta method, var dr 0 ve\e \ ( °~r 0 a i 0 0 ' *2 °~fo?i dr °~f ri f cr 1 Ve\e \ (3.10) J whereby the matrix in the middle of the right hand is the 2x2 component of the estimated asymptotic covariance matrix corresponding to the ro and f\. 24 Chapter 4 Simulation Extrapolation Approach Simulation extrapolation (SIMEX) method was initially proposed to correct additive measurement error in general regression problems by Cook and Stefanski (1994) and further extended to handle other measurement error models (Cook and Stefansk, 1994; Kiichenhoff and Carroll, 1997; Eckert, Carroll and Wang, 1997). SIMEX exploits the relationship between the size of measurement error in covariate or response and the bias in parameter estimation. It utilizes a self-contained simulation component to establish a trend of measurement error-induced bias versus the variance of the extra added measurement error, and extrapolate this trend to the case of no measurement error (Carroll, Ruppert, Stefanski and Crainiceanu, 2006). The misclassification SIMEX (MC-SIMEX) method, proposed by Kiichenhoff, Mwalili and Lesaffre (2006) extends SIMEX to handle misclassification problems with discrete covariate or response. Let us describe MC-SIMEX in a logistic regression setting with a binary response variable Y, and a nondifferential misclassification prone binary covariate X that is associated with an unobservable independent variable T. logit (Y = l\t)=p 0 +Pit It follows immediately after simple arithmetic that Pi is the effect parameter of interest in this context, which is the log odds ratio of having a positive response, i.e. <f) = P\. The 25 Chapter 4. Simulation Extrapolation Approach nondifferential misclassification mechanism is characterized by a misclassification matrix, ( n = SP ^1-SP l-SN^ SN ' P{X = 0\T = 0) P(X = 0|T = 1) ^ P(X = 1\T = 0) P(X = 1\T = 1) j Assuming II is positive definite, the spectral decomposition can be written as II = EAE , -1 where A is a diagonal matrix of eigenvalues and E is composed of eigenvec- tors. Let us define a misclassification operation MC [II] (Z), indicating the procedure to simulate a misspecified random variable related to Z with measurement error probabilities SN, SP. It is then natural to express the relationship between X and T as X = MC [II] (T). The limit to which the naive estimator of B converges ignoring meax surement error on the exposure, is denoted by (3* (II), as the sample size approaches infinity. At the simulation stage of MG-SIMEX algorithm, extra misclassification II = A EA E~ , x l (A > 0) is attached to the error prone variable X to generate "reclassified" pseudo data, X (A) = MC [ n ] (X). A Under the assumption that the measurement error mechanisms affecting T and X are independent, one can write X (A) = MC [U. ] (T), 1+A and the naive estimator based on the simulated dataset (Xi (A), the form /3J ( l l 1 + A , i = 1,..., N, bears ) , which is achievable by least squares method. Hence for a fixed set 26 Chapter 4. Simulation Extrapolation Approach. of A values, 0 < Ai < A < ... < AM, the MC-SIMEX.reclassification step is repeated for 2 B times at every A^ to simulate pseudo data (Xi (A&), Yi) and to estimate /?* b The situation of no misclassification is obtained at A = —1, as n 1+A (n 1+Afc ). = II = / . The log 0 odds ratio at A^ is then estimated as * Pi,k Ef fe(n Q 1+A - =1 -g , for fc = 1 , . . . , M . At the extrapolation step, the MC-SIMEX estimator'BMCis thus acquired by 1. fitting an parametric extrapolation function L on data ^1 + \k,J3i,k), for instance by least squares, for k = 1,..., M; 2. extrapolating the parametric model back to the origin crossing ^0, PMCJ • Kiichenhoff et al. (2006) believe that the MC-SIMEX estimator is at least approximately consistent when the extrapolation function is correctly specified or sufficiently close to Pt (rffc ) . They conclude that a quadratic or loglinear extrapolation function +A works well in general for various models. Examples of MC-SIMEX plots illustrating the effect of increasing measurement error on logOR via different extrapolation functions are provided in Figures 6.2 and 6.4. The MC-SIMEX method can also be applied to differential misclassification problems. In the context to model relationship between binary covariate and response, a 4x4 misclassification matrix with separate matrices characterizing measurement error in case fix and control fl populations should be supplied to initiate the simulation and 0 27 Chapter 4. Simulation Extrapolation Approach extrapolation process, n= v o n (4.1) i y The misclassification matrices are usually either directly obtainable or can be estimated from the validation data. There are several candidates for the MC-SIMEX variance estimation var(J3 c). The M ease of parameter estimation in general regression context using SIMEX is somehow counterbalanced by the complex calculation of the standard errors. A bootstrap approach is outlined by Kuchenhoff et al. (2006), though the intensive computation required in simulation step brings difficulty to the implementation. Two other methods avoiding nested resampling are established by Stefanski and Cook (1995) and Carroll et al. (1996). The first one goes' closely with Tukey's jackknife standard error calculation, when the misclassification matrix is known or adequately estimated. The latter employs the asymptotic normal distribution and the asymptotic covariance matrix in the setting of M-estimation, and provides a more flexible estimation (Carroll et al., 2006). These two variance estimation methods are adapted in this thesis. The MC-SIMEX method is available in R after loading the "SIMEX" package that was developed by Lederer. The main function "mcsimex()" is modified to meet our needs to automate the estimation process of log odds ratio repeatedly under the differential misclassification. 28 Chapter 5 S i m u l a t i o n Studies 5.1 Data Simulation In order to demonstrate the performance of Bayesian adjustment to misclassificatibn and make comparison with other statistical approaches, we conduct simulation study under two cases. Two sets of fixed simulation parameters are assigned to cases 1 and 2 respectively. Four misclassification scenarios concerning different levels of misclassification differentiality are built to contrast statistical methods introduced in Chapter 2, 3 and 4. We generate 400 datasets (NREP=400) regarding each of the scenarios in each case. Data in scenario 1 are simulated under the nondifferential misclassification, with increasing degree of differentiality in scenarios 2, 3, and 4. Case 1: Simulation with equal numbers of cases and controls (N ca = Ncn t = 800), whereby each group consists of 25 percent of validation data and 75 percent of main (or misclassification prone) study data (JV£ = N v mt = 200, JV£ = N™ = 600). To achieve t the property that the odds ratio $ = -^p- for the true exposure T on the response Y is l-r 0 1.5, we have • Scenario 1: (r , r )=(0.075, 0.1084), (SN , SWi)=(0.6, 0.6), (SP , 0 1 0 0 SPi)=(0.9, 0.9) • Scenario 2: ( r , n)=(0.075, 0.1084), (SN , 5M)=(0.6, 0.65), (SP , 5Pi)=(0.9, 0 0 0 0.85) 29 Chapter 5. Simulation Studies • Scenario 3: (r , n)=(0.075, 0.1084), (SN , 5iV )=(0.6, 0.7), (SP , 5Pi)=(0.9,. 0 0 x 0 0.8) • Scenario 4: (r , ri)=(0.075, 0.1084), (SN , SWi)=(0.6, 0.75), (SP , SPi)=(0.9, 0 0 0 0.75) Case 2: Simulation with larger percentage of controls (./V = 1200) than cases (./V cnt ca = 400), while the ratio of validation data against main data remains 1:3 ( A ^ = 100, N^ t = 300, N£ = 300, N™ = 900). To achieve the property that the odds ratio $ for the t true exposure T on the response Y is 2, we have • Scenario 1: (r , ri)=(0.1, 0.1818), (SN , SWi)=(0.6, 0.6), (SP , SPi)=(0.85, 0.85) • Scenario 2: (r , n)=(0.1, 0.1818), (SN , 5iV )=(0.65, 0.6), (SP , 5Pi).=(0.8, 0.85) • Scenario 5: (r • Scenario ^: (r , ri)=(0.1, 0.1818), (SN , S'iVi)=(0.75, 0.6), (SP , 5Pi)=(0.7, 0.85) 0 0 0 0) 0 0 1 0 ^ = ( 0 . 1 , 0.1818), (SAT , SA^ )=(0.7, 0.6), (SP , 5P )=(0.75, 0.85) 0 0 1 0 0 1 0 To generate data in either case, we first simulate the true exposure status T (=0, 1) at Y = i from Bernoulli^), for i=0, 1, two binary quantities indicating the occurrence of misclassification in terms of sensitivity and specificity, of sample size iV or N . ca mt Using Equation 1.1, we then simulate the apparent (or surrogate) exposure measurements in two groups. Finally we cross tabulate the true and apparent exposure quantities of size N^ a or to construct the internal validation tables, and classify the remaining apparent exposure measurements to acquire the main study table (Table 1.1). Three Bayesian methods adopting nondifferential, nearly nondifferential and differential prior distributions respectively, are performed at each scenario within each case to adjust for possible misclassifications and assess the association between the true exposure and outcome. 30 Chapter 5. Simulation Studies 5.2 Choice of Hyperparameters Methodology about the MCMC method is described in details in Chapter 2. In simulation studies, wefirstspecify the hyperparameters appearing in prior distributions of the effect parameter of interest 6 — (r , r ,SNi, SPi), i = 0,1. According to Section 2.2, under the 0 x assumptions that p\ = p , o\ = 02, V\ = ^2, 7i = 72, i — T and Si = S , we assign r 2 2 2 p = — 2, a = 1 to model the prior information that the logit true exposures are normally distributed with 95% probability between logit(0.02) and logit(0.5). Mild correlation between r and r\ (pi = 0.3) is selected to allow relatively large standard deviation of 0 logOR around mean 0. Similarly, we set v — r = 1.7, 7 = 5 = 0.65 to represent the prior knowledge that the logit sensitivity and logit specificity are normally distributed within logit(0.6) and logit(0.95) with 95% probability. As discussed in Chapter 2, we set p = p = 1 to reflect nondifferential misclassification; p = p — 0 to express 2 3 2 3 prior belief in differential misclassification. The choice of p , p for nearly nondifferential 2 3 misclassification requires extra work. To reflect the priori that, the probability of having similar proportions of correctly measured exposure variable among cases and controls is about one quarter, i.e. P{\SNi - SN \ < 0.01} = P{\SPi - SP \ < 0.01} = 0.24, Q 0 a simulation study is conducted, and 0.9 is assigned to p and p . Alternatively, closed2 3 form solutions p = p = 0.9 can be obtained by solving the equation, 2 3 PUlogiHSNi) - logit(SN )\ < 0.1} = P{\logit(SP\) - logit{SP )\ < 0.1} = 0.1, 0 0 using expressions (2.3), (2.4) and the standard normal table. 31 Chapter 5. Simulation Studies 5.3 Convergence of M C M C Simulation The convergence and mixing of MCMC simulation should be checked before we make inference about the posterior distributions. Decisions are made based on visual inspection of simulation plots. After discarding the first 1000 simulations to diminish the effect of initial distributions, we draw samples from 10000 iterations, based on which statistical inferences are conducted. The last 2000 iterations of Markov chains for Ti, SN , SPi and t logOR at different scenarios (each represented by one out of 400 simulated datasets) in Case 1 are displayed in Figure 5.1 and 5.2. To be more specific, Figure 5.1 depicts the 8000-10000 realizations of a Markov chain simulated under the nondifferential misclassification assumption regarding a dataset in Case 1, scenario 1; Figure 5.2 depicts the 8000-10000 realizations of a Markov chain simulated under the differential misclassification assumption regarding a dataset in Case 1, scenario 4. It is observed that Markov chains generated by the "nondifferential", "nearly nondifferential" and "differential" Bayesian methods, under different prior distributions display satisfactory mixing and convergence. The Markov chains are moving thoroughly within the target range, and no chain is stuck at the similar values for successive iterations. Table 5.1 reports the acceptance rates based on 10000 posterior realizations over 400 replicated datasets at each scenario in Case 1. As described in section 2.6, the univariate proposal (or jumping) distributions of model parameters embedded in the Gibbs sampler, do not directly depend on preceding states, for information from the immediate previous states facilitates update of the unobserved cell counts in the main data Yunobs — and, the univariate candidate parameter value is simulated solely based on updated data and previous values of other parameters. Therefore, high acceptance rates displayed in Table 5.1 demonstrate, posterior Markov chain is moving around and adequacy of mixing is achieved. It is worth pointing out that, the jumping distributions 32 / Chapter 5. Simulation Studies proposed here do not involve tuning parameters as required in the random-walk MH algorithm. The typical challenge of adjusting variations (oVs) in the proposal distributions 81 ~ N(9 ,a .) (k = l,...,m), to maintain a moderate acceptance rate, is therefore t 2 k avoided. Meanwhile the Markov chains converge roughly to the target distributions afTable 5.1: Average acceptance rates over 400 replicates in Case 1 Nondifferential Bayesian method - scenario 1 SN S^ SPi ri SPo r Acceptance 0.973 0.950 0.728 0.728 0.977 0.977 0 0 Nearly nondifferential Bayesian method - scenario 2 SNo SNi n SPo SPi ro 0.742 0.741 0.668 Acceptance 0.976 0.947 0.540 Nearly nondifferential Bayesian method - scenario 3 SPi SN SPo ro Acceptance 0.977 0.944 0.485 0.665 0.572 0.602 0 Differential Bayesian method - scenario 4 SNo SPi SPo ro Acceptance 0.973 0.949 0.641 0.786 0.959 0.909 ter 1000 burn-in iterations. Posterior distributions of the exposure prevalences amongst diseased and non-diseased participants, and log odds ratio describing the association between the response and exposure, are illustrated by histograms of posterior samples (Figure 5.3 and 5.4). We hereby emphasize plots considering scenarios where models are consistent with the data-generation, though the simulations include inconsistent cases as well (e.g. differential model on nondifferentially misclassified data, or nondifferential model on differentially misspecified data). The pre-specified true exposure prevalence are 0.075 (control group), 0.1081 (case group) in Case 1, and 0.1 (control group), 0.1818 (case group) in Case 2. The simulation parameters along with the true log odds ratios (log(1.5) in Case 1 and log(2) in Case 2) are marked by vertical bars in the plots. It is shown in the plots that, posterior samples capture the true model parameters with high 33 Chapter 5. Simulation Studies accuracy. It is also noticeable that r and r\ are often estimated with bias towards the 0 same direction (i.e. whose medians located both below or above the vertical bars). This however leads to a more accurate estimate of logOR. Table 5.2: Posterior distributions at various scenarios in Case 1 with different prior information Nondifferential ri r Mean 0.075 0.102 SD 0.015 0.0169 5 %tile 0.052 0.076 95 %tile 0.101 0.131 0 th th prior at Case 1, scenario 1 logOR SNi SN SP! SP 0.901 0.901 0.706 0.706 0.353 0.011 0.011 0.063 0.063 0.265 0.883 . 0.883 0.599 0.599 -0.070 0.918 0.918 0.805 0.805 0.798 0 0 Nearly nondifferential prior at Case 1, scenario 2 SN SiVi SP SPi r Mean 0.884 0.847 0.777 0.785 0.053 0.098 SD 0.013 0.018 0.013 0.016 0.069 0.063 5 %tile 0.032 0.070 0.862 0.820 0.658 0.677 95 %tile 0.076 0.129 0.906 0.873 0.877 0.880 logOR 0.686 0.339 0.142 1.262 Nearly nondifferential prior at Case 1, scenario 3 SN . SNi SP n SPi ro Mean 0.060 0.091 0.884 0.798 0.715 0.750 0.014 0.018 0.013 0.017 0.074 0.064 SD 5 %tile 0.039 0.064 0.863 0.769 0.588 0.637 95 %tile 0.084 0.123 0.905 0.826 0.828 0.846 logOR 0.463 0.320 -0.049 0.995 0 0 0 th th Q 0 th th Differential prior at Case 1, scenario 4 SN SN . logOR SP SP n r 0.762 0.7520 0.529 0.898 0.645 0.055 0.089 0.360 0.014 0.019 0.013 0.020 0.103 0.077 0.614 -0050 0.034 0.060 0.876 0.730 0.468 1.128 0.080 0.123 0.920 0.795 0.805 0.864 0 Mean SD 5 %tile 95 %tile th th 0 r 0 X ; The sample mean, standard deviation and 90% credible interval (regarding 10000 post burn-in iterations) for each model parameter at different scenarios of Case 1 and Case 2 (each corresponding to one simulated dataset) are calculated under the appropriate prior assumptions regarding differentiality of misclassification. Results presented in Tables 5.2 34 Chapter 5. Simulation Studies Table 5.3: Posterior distributions at various scenarios in Case 2 with different prior information Nondifferential ri r Mean 0.104 0.162 SD 0.017 0.030 5 %tile 0.078 0.115 95 %tile 0.132 0.215 0 th th prior at Case 2, scenario 1 logOR SN SPo SP\ 0.828 0.828 0.629 0.629 0.511 0.013 0.013 0.060 0.060 0.280 0.806 0.806 0.529 0.529 0.049 0.849 0.849 0.727 0.727 0.975 0 Nearly nondifferential prior at Case 2, scenario 2 SPo SNo SNi SPi ro 0.072 0.152 0.804 0.861 0.627 0,608 Mean SD 0.014 0.029 0.013 0.019 0.071 0.073 5 %tile 0.050 0.106 0.783 0.828 0.506 0.486 95 %tile 0.095 0.202 0.826 0.892 0.742 0.727 th th logOR 0.840 0.310 0.323 1.344 Nearly nondifferential prior at Case 2, scenario 3 logOR SNo SN n SP SPi r Mean 0.103 0.150 0.775 0.820 0.770 0.751 0.426 SD . 0.017 0.028 0.016 0.023 0.054 0.063 0.292 5 %tile 0.076 0.106 0.749 0.782 0.676 0.642 -0.053 95 %tile 0.132 0.199 0.801 0.856 0.851 0.845 - 0.910 0 0 X th tft Mean SD 5 %tile 95 %tile th th Differential prior at Case 2, scenario 4 SNo SN! logOR n SPo SPi ro 0.110 0.196 0.697 0.861 0.765 0.758 0.674 0.017 0.032 0.017 0.027 0.058 0.069 0.274 0.084 0.145 0.669 0.815 0.663 0.636 0.221 0.140 0.250 0.725 0.904 0.855 0.863 1.116 and 5.3. Most sample means are sufficiently close to the true values of the parameters at different levels of exposure misclassification, except for the sensitives. The sensitivities in case and control groups are found larger than the true values in Cases 1 and 2. This should not downgrade the performance of our algorithm, because most 90% credible intervals cover the true values, except for SNi at scenario 2, Case 1 and a few exceptional observations in Case 2. To conclude, the Bayesian misclassification methods utilizing MCMC algorithm facilitate convergent posterior Markov chains with adequate mixing. 35 Chapter 5. Simulation Studies 8000 8500 9000 9500 10000 iteration Figure 5.1: MCMC mixing based on iterations 8000-10000 regarding a dataset in Case 1, Scenario 1, using the nondifferential Bayesian method 36 Chapter 5. Simulation Studies o n 8000 1 1 8500 9000 1 r 9500 10000 iteration i 1 1 8000 8500 9000 i 8000 1 1 8500 9000 ~i r 9500 10000 iteration 1 T 9500 10000 iteration Figure 5.2: MCMC mixing based on iterations 8000-10000 regarding a dataset in Case 1, Scenario 4, using the differential Bayesian method Chapter 5. Simulation Studies 0) CD C scenario 1: NDF scenario 1: NDF scenario 1: NDF o Dtl N O 0.04 0.08 0.12 3 m CD o • IT 0.06 0.10 0.14 r-ffl>i n rfffllllih^ ( -0.5 0.18 0.5 1.0 logOR scenario 2: NNDF :-di....Thr. r CNJ r 0.02 0.06 scenario 3: NNDF scenario 2: NNDF o o o I I I I I I I r I 0.04 0.08 0.12 0.16 0.10 -4 -0.5 0.5 1.5 logOR o o Tf P o •rffnTTlltrn 0.02 0.06 scenario 3: NNDF scenario 3: NNDF scenario 3: NNDF i s rr-ij q 1—I - 0.10 — 0.05" ^ 0.10^ c i0.15 — s J~TT-i in q ° -0.5 -— r^~T 0.5 i r 1.5 logOR scenario 4: DF <? 8 scenario 4: DF scenario 4: DF o o o CM T4 0.02 LL 0.06 0.05 0.10 0.10 0.15 l_,4HLv -0.5 0.5 1.5 logOR Figure 5.3: Posterior histogram ofr , ri and logOR based on 10000 iterations of Case 1 0 38 Chapter 5. Simulation Studies scenario 1: NDF CT CD o o ID CN _ q g> r—r O 0.06 scenario 1: NDF scenario 1: NDF 0.05 0.10 0.14 0.18 0.15 4fc o -0.5 0.25 0.5 1.5 logOR scenario 2: NNDF scenario 2: NNDF scenario 2: NNDF >. CD o i 0.04 i i 0.05 0.08 0.12 0.16 4HK 0.15 o y <=> § CM — CT <D O 0.25 ^ r -0.5 0.5 1.5 logOR scenario 3: NNDF scenario 3: NNDF scenario 3: NNDF o o o CN CO 0.04 fu o 0.08 0.12 0.16 0.10 0.20 o T-0.5 0.5 1.5 logOR scenario 4: DF scenario 4: DF ^1 n-i CD o T 0.05 j f | l ih-! 0.10 0.15 scenario 4: DF w c S n | o-E , r-rjl 0.05 0.15 p_ -Dp^T — I 0.25 0.35 CD 3 CT 0) -0.5 0.5 tu 1.5 logOR Figure 5.4: Posterior histogram ofr , r and logOR based on 10000 iterations of Case 2 0 x 39 Chapter 5. Simulation Studies Tables 5.2 and 5.3 also suggests that standard deviations increase with the inflation of differential misclassification. This is not unexpected because under nondifferential misclassification, data concerning cases and controls are pooled together to estimate the misclassification parameters. Whereas under differential misclassification, sensitivities and specificities are estimated separately according to data in each arm, hence resulting in less precise estimates (i.e., with larger variance). 5.4 Comparison with results obtained using M L E and MC-SIMEX In this section, we study the comparative performance of Bayesian methods, maximum likelihood estimation methods and simulation extrapolation methods in several misclassification situations. Three methods under the assumptions of nondifferential misclassification (NDF), nearly nondifferential misclassification (Nearly NDF, only applicable to Bayesian method), and differential misclassification (DF) are applied simultaneously to data simulated in section 5.1 (4 scenarios in two cases). The estimated mean squared error (MSE), defined as is calculated to assess the goodness of the point estimator for 4> = logOR, by incorporating both the precision (via variance) and accuracy (via bias) of <f>. NREP=400 represents the number of repeated datasets generated in each scenario. The coverage proportion of the logOR and average width of the 90% credible interval using Bayesian methods and the 90% confidence intervals using M L E or MC-SIMEX are also reported to evaluate the accuracy and precision of the effect estimator qb (Table 5.4, 5-5 , 5.6 and 5.7 for Case 1 40 Chapter 5. Simulation Studies and Table 5.8, 5.9 and 5.10 for Case 2). Table 5.4 shows the nondifferential Bayesian method has the best performance when the data is truly nondifferentially misclassified. It produces estimator with smallest bias, variation and overall error rate compared to other Bayesian methods in scenario 1. However, the related coverage proportion drops dramatically as the exposure misclassification gets more differential. The fact that the empirical coverage probability drops to 0.06 in scenario 4 illustrates the poor reliability of the nondifferential. Bayesian method, for the level of differentiality involved in the misclassification is seldom well understood in practice. The differential misclassification Bayesian method behaves better than the others in terms of smaller estimated MSEs and larger coverage proportions, when data are highly differentially misclassified (scenario 3 and 4), which agrees to our expectation, fn general, the differential Bayesian method is not as efficient under nondifferential misclassification, because data are split to two groups and estimations are accomplished on separate subsets. Nevertheless, it generates reliable estimations as misclassification deviates from complete nondifferential condition without losing much efficiency. The nearly nondifferential Bayesian method performs well in scenario 2 where departure from the nondifferential misclassification is mild. Separate estimation of sensitivities or specificities improves the accuracy of </>. Meanwhile by posing large positive correlation between SWjS and SP s, it "borrows strength" from measurements in both groups and produces t small variance and MSE. The MSE, coverage proportion and average width of credible interval associated with the nearly nondifferential method are always in between the other two Bayesian methods. It is a method worth exploring in practice when the mechanism of misclassification is not fully specified. In Table 5.5, we observe analogous phenomena comparing the performances of nondifferential (NDF) MLE and differential (DF) MLE. More accurate and precise estimation of logOR with smaller MSE, greater empirical coverage rate and shorter confidence in41 Chapter 5. Simulation Studies terval is supplied by NDF ML method in scenario 1. On the other hand, the estimates are far away from the target when the actual misclassification rates differ between cases and controls. The DF ML method yields consistent estimates with satisfactory coverage proportions (> 88%) over all situations, and is especially superior at more differential scenarios such as 3 and 4. It is also found that MLEs are systematically associated with slightly greater MSE and variation compared with the Bayesian estimates in Table 5.4, which may result from the utilization of appropriate prior information. Results obtained using differential (DF) and nondifferential (NDF) MC-SIMEX methods are summarized in Table 5.6 and 5.7. For each simulated dataset, the estimated effect parameter <f> is acquired via a quadratic extrapolation function and a loglinear extrapolation function, L uad(^) q L (X) log = a + «iA + a A , 2 0 = e 2 '\ ao+a The variance for each 4> is estimated using the asymptotic theory and Jackknife method. Tables 5.6 and 5.7 hence summarize the performance of MC-SIMEX methods at these four combinations. Similar to Bayesian and MLE methods, the NDF MC-SIMEX works better for the nondifferential misclassification scenario, but is off-target with differential measurement errors. In Case 1, <> /s achieved during reclassification at various values of A are on average better fit by the quadratic parametric function, leading to relatively more accurate estimates of <fi without measurement error. This is reflected by greater coverage proportions in Table 5.6 and 5.7. We also observe that estimates using either extrapolation functions bear same MSEs in each scenario, while Tukey's Jackknife variance is consistently smaller than asymptotic variance when quadratic function is applied. As the size of measurement error represented by SNi ano SPi is not totally specified 42 Chapter 5. Simulation Studies in this study, Tukey's Jackknife method is preferred to accommodate situations where variation of SNi, SPi is not negligible (Carroll et al., 2006). Compared with intervals by first two methods, the 90% confidence intervals generated by MC-SIMEX methods have apparently smaller probabilities to cover the true effect parameter logOR. Table 5.4: MSEs, empirical coverages and average widths of 90% credible intervals for logOR using Bayesian methods based on datasets simulated in Case 1 scenario 1 MSE Coverage Width MSE scenario .2 Coverage Width MSE scenario 3 Coverage Width MSE scenario 4 Coverage Width NDF Nearly NDF 0.0869 0.0773 0.8975 0.9075 0.9247 0.9947 0.0959 0.1801 0.9000 0.6825 1.01327 0.9325 0.4798 0.1230 0.8550 0.2400 1.0308 0.9295 0.8759 0.1545 0.0600 0.8000 1.0478 0.9249 DF 0.0993 0.8875 1.0279 0.0986 0.9075 1.0380 0.0946 0.8950 1.0501 0.1071 0.8875 1.0564 Similar findings present in Case 2. For in each simulated dataset, the number of controls are three time of the number of cases, data from the control group have more impact on the estimation of model parameters such as the exposure prevalences ro, ri and the effect parameter logOR. This results in greater coverage proportions than those in Case 1 when NDF methods (Bayesian or MLE or MC-SIMEX) are incorrectly applied to the differentially misclassified data (scenario 2, 3 and 4). It is also observed that the DF MC-SIMEX method fails at some replicative datasets in scenario 2. Two possibilities may cause this problem. First, the estimated misclassification matrix T I (Equation (4.1) in Chapter 4) does not exit. This may be caused by estimation of the original misclassification matrix IT using the validation data. Kiichenhoff et al. (2006) suggested use a refined approximation method (Israel et al, 2001) to estimate n . Second, a few A 43 Chapter 5. Simulation Studies Table 5.5: MSEs, empirical coverages and average widths of 90% confidence intervals for logOR using M L E methods based on datasets simulated in Case 1 scenario 1 MSE Coverage Width scenario 2 MSE Coverage . Width scenario 3 MSE Coverage Width scenario 4 MSE Coverage Width NDF 0.0924 0.9100 0,97706 0.2258 0.6650 0.9849 0.5776 0.2125 0.9800 1.0361 0.0575 0.9765 DF 0.1224 0.8875 1.1062 0.1226 0.9075 1.1150 0.1176 0.8975 1.1235 0.1338 0.8875 1.1301 modifications are made on function "mcsimex()" to fulfil our needs to automate the estimation process of log odds ratio repeatedly under the differential misclassification. Although the performance of the modified function under boundary conditions (large misclassification rates) are not completely tested, we believe the chance that it causes a problem is very tiny. 44 Chapter 5. Simulation Studies Table 5.6: MSEs, empirical coverages and average widths of 90% confidence intervals for logOR using MC-SIMEX under nondifferential misclassification based on datasets simulated in Case 1 scenario 1 scenario 2 scenario 3 scenario 4 MSE Coverage Width MSE Coverage Width MSE Coverage Width MSE Coverage Width Quadratic extrapolation Asymptotic Jackknife variance variance 0.0810 0.0810 0.8875 0.8025 0.8929 0.7300 0.3414 0.3414 0.2450 0.3600 0.8704 0.6999 1.1960 1.1960 0.0100 -0.0025 0.8561 0.6788 2.4902 2.4902 0 0 0.8607 0.6718 Loglinear extrapolation Asymptotic Jackknife variance variance 0.1078 0.1078 0.7550 0.7350 0.7300 0.9317 0.6136 0.6136 0.3175 0.1650 1.0617 0.6999 1.7361 1.7361 0.005 0 1.0415 0,6788 3.5249 3.5249 0 0 0.6718 1.0683 Table 5.7: MSEs, empirical coverages and average widths of 90% confidence intervals for logOR using MC-SIMEX under differencial misclassification based on datasets simulated in Case 1 scenario 1 scenario 2 scenario 3 scenario 4 Quadratic extrapolation Asymptotic Jackknife variance variance MSE 0.1502 0.1502 Coverage 0.7700 0.6500 Width 0.8950. 0.7278 MSE 0.1226 0.1226 Coverage 0.6800 0.7875 Width 0.6937 0.8637 MSE 0.1423 . . 0.1423 Coverage 0.7175 0.6300 Width 0.8319 0.6746 MSE . 0.1920 0.1920 Coverage 0.6075 0.5200 Width 0.8194 0.6590 Loglinear extrapolation Asymptotic Jackknife variance variance 0.0829 0.0829 0.8550 0.6600 0.6227 0.7278 0.0877 0.0877 0.8625 0.8325 0.6937 0.5673 0.1992 0.1992 0.2800 0.5800 0.5204 0.6746 0.2995 0.2995 0.0900 0.0025 0.4734 0.6590 45 Chapter 5. Simulation Studies Table 5.8: MSEs, empirical coverages and average widths of 90% credible intervals for logOR using Bayesian methods based on datasets simulated in Case 2 scenario 1 scenario 2 scenario 3 scenario 4 MSE Coverage Width MSE Coverage Width MSE Coverage Width MSE Coverage Width NDF Nearly NDF 0.0832 0.0703 0.9000 0.8925 0.8734 0.9393 0.1624 0.1049 0.8650 0.7300 0.9157 0.9536 0.3317 0.1318 0.8100 0.4400 0.9624 0.9471 0.5852 0.1792 0.7125 0.2375 0.9722 0.9675 DF 0.0965 0.8900 0.9627 0.0977 0.8950 0.9643 0.0969 0.8900 0.9661 0.1035 0.9025 0.9712 Table 5.9: MSEs, empirical coverages and average widths of 90% confidence intervals for logOR using MLE methods based on datasets simulated in Case 2 scenario 1 MSE Coverage Width scenario 2 MSE Coverage Width scenario 3 MSE Coverage Width scenario 4 MSE Coverage Width NDF 0.0771 0.8925 0.9079 0.1452 0.8050 0.9595 0.2945 0.5500 0.9956 0.5429 0.3425 1.0222 DF 0.1055 0.8850 1.0215 0.1080 0.8950 1.0211 0.1025 0.9050 .1.0211 0.1006 0.9000 1.0255 46 Chapter 5. Simulation Studies Table 5.10: MSEs, empirical coverages and average widths of 90% confidence intervals for logOR using MC-SIMEX under nondifferential misclassification based on datasets simulated in Case 2 scenario 1 scenario 2 scenario 3 scenario 4 MSE Coverage Width MSE Coverage Width MSE Coverage Width MSE Coverage Width Quadratic extrapolation Asymptotic Jackknife variance variance 0.1247 0.1247 0.7950 0.6650 0.8890 0.7193 0.7746 0.7746 0.0200 0.0550 0.7172 0.9058 1.9450 1.9450 0 0 0.7145 0.9165 3.4673 3.4673 0 0 0.9236 0.7099 Loglinear extrapolation Asymptotic Jackknife variance variance 0.1725 0.1725 0.6575 0.7800 0.7198 1.0223 0.6240 0.6240 0.0575 0.0950 0.7172 0.6418 1.2375 1.2375 0 0 0.7145 0.4893 1.8875 1.8875 0 0 0.4382 0.7099 47 Chapter 6 Application in Epidemiological Studies 6.1 The study of sudden infant death syndrome To further illustrate how Bayesian methods, MLE methods and MC-SIMEX methods work in practice, we consider a case-control study on sudden infant death syndrome (SIDS) (Kraus, Greenland and Bulterys, 1989). It is of interest to investigate the association between the use of antibiotics during pregnancy and the occurrence of SIDS. The surrogate exposure quantity about maternal use of antibiotic (X) was measured by interview question (yes=l, no=0). Information from medical records (T) was extracted to conduct an internal validation study to assess the misclassification probabilities. The complete data is shown in Table 6.1. Ignoring possible measurement errors, the X — Y log odds ratio is estimated as 0.352 with 90% confidence interval (0.141, 0.563). Table 6.1: Validation study and main study of SIDS Main data Validation data Y=0 Y=l T X=l X=0 X=l X=0 Y X=l X=0 T=l 29 17 21 16 Y=l 122 442 T=0 22 143 12 168 Y=0 101 479 48 Chapter 6. Application in Epidemiological Studies Table 6.2: Estimates of model parameters in SIDS study ri SP SPi SN SNi 0 0 NDF Bayesian Mean SD Rejection 0.1593 0.0200 0.0419 0.2194 0.0232 0.0588 0.9017 0.0122 0.1987 0.9017 0.0122 0.1987 0.0241 0.6255 0.0450 0.0241 0.6255 0.0450 n SPo SPi SNo SNi Nearly NDF Bayesian Mean SD Rejection 0.1675 0,0215 0.0481 0.0562 0.2098 0.0234 0.9134 0.0140 0.2817 0.8889 0.01661 0.1959 0.6181 0.0559 0.2678 0.6333 0.0515 0.2529 NDF Mean 0.1634 0.2253 0.6031 0.6031 0.9028 0.9028 MLE SD 0.1981 0.2649 0.6810 0.6810 0.9241 0.9241 Mean 0.1701 0.2013 0.9216 0.8803 0.6378 0.6528 DF Bayesian SD Rejection 0.0222 0.0496 0.0236 0.05890.2794 0.0155 0.0184 0.2717 0.0632 0.0914 ' 0.0588 0.0199 DF MLE SD Mean 0.1793 0.0234 0.2095 0.0254 0.5966 0.0695 0.6060 0.0653 0.9255 0.0172 0.8782 0.0197 Table 6.3: logOR, SD and 90% intervals for logOR in SIDS study NDF Bayesian Nearly NDF Bayesian DF Bayesian NDF MLE DF MLE NDF MC-SIMEX quadratic, asymptotic quadratic, Jackknife loglinear, asymptotic loglinear, Jackknife DF MC-SIMEX quadratic, asymptotic quadratic, Jackknife loglinear, asymptotic loglinear, Jackknife log(OR) 0.3967 0.2791 0.2086 0.3983 0.1927 0.6131 0.7896 0.0365 0.2701 SD 90% intervals 0.1875 (0.0887, 0.7079) 0.2047 (-0.0576, 0.6135) 0.2154 (-0.1450, 0.5650) 0.1909 (0.08442, 0.7123) 0.2212 (-0.1711, 0.5566) 0.2265 0.1772 0.2579 0.1772 (0.2406, 0.9856) (0.3217, 0.9045) (0.3654, 1.2138) (0.4981,1.0810) 0.2221 0.1794 0.1090 0.1794 (-0.3288, 0.4019) (-0.2585, 0.3316) (0.0908, 0.4494) (-0.0249, 0.5651) 49 Chapter 6. Application in Epidemiological Studies 8000 8500 9000 9500 10000 8000 8500 iteration 8000 8500 9000 H 8500 1 9000 9500 10000 iteration 9500 10000 iteration i 8000 9000 8000 8500 9000 9500 10000 iteration 1 T 9500 10000 Figure 6.1: MCMC mixing based, on iterations 8000-10000 using Nearly NDF Bayesian method (SIDS study) Chapter 6. Application in Epidemiological Studies 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 (c)(1+X) (d)(1+X) Figure 6.2: Plots of the estimated logOR as a function of misclassification size A in SIDS study. The upper-left panel is based on a quadratic extrapolation subject to NDF MC-SIMEX. The upper-right panel is based on a loglinear extrapolation subject to NDF MC-SIMEX. The lower-left panel is based on a quadratic extrapolation subject to DF MC-SIMEX. The lower-right panel is based on a loglinear extrapolation subject to DF MC-SIMEX. 51 Chapter 6. Application in Epidemiological Studies The Markov chains generated by Bayesian methods present adequate mixing and convergence. Figure 6.1 displays posterior simulations over the last 2000 iterations when the Nearly ND method is applied. Study results after adjustment for misclassification are presented in Tables 6.2 and 6.3. The sample means and 90% intervals regarding model parameters (r , r\, SNi, SPi) obtained by NDF or DF Bayesian approach are 0 close to those estimated by corresponding MLE method. Standard errors generated from Bayesian and MLE methods are similar to one another. Parameters estimated under DF misclassification assumptions are found more variable than those derived under NDF or nearly NDF situations, which is consistent withfindingsfrom simulation studies. The point estimates of logOR in form of posterior mean are greater than zero, when the Bayesian, MLE or NDF MC-SIMEX methods are applied. It appears that log(OR) by NDF Bayesian or NDF MLE is twice greater than that obtained by DF Bayesian or DF MLE algorithm. Given the equality of misclassification rates in case and control groups, data in Table 6.1 suggest a positive association between the use of antibiotic and subsequence incidence of SIDS, for 0 is not contained in the 90% credible or confidence intervals generated by NDF methods. The estimates of log(OR) returned by MC-SIMEX are somehow distinguished from those returned from Bayesian or MLE method. It is depicted in Figure 6.2 that, log odds ratio Pi k decreases with A^ if the measurement of drug use is nondifferentially misclast sified, but increases with greater size of misclassification when SNi or SPi differentiates between two groups. As a result, combining the trajectory of the points, a quadratic extrapolation function produces a smaller positive log odds ratio at the origin under NDF and a smaller negative log odds ratio crossing the origin under DF. The existence of misclassification matrix n A is fulfilled at A = 0.5, 1, 1.5, 2, and two variance calcu- lations are accomplished at each misclassification level. Because SNi and SPi required in MC-SIMEX are estimated from the validation study, and their variabilities are not 52 Chapter 6. Application in Epidemiological Studies negligible, we consider the Jackknife variance as a more appropriate method to apply (Kiichenhoff et al., 2006). One drawback of MC-SIMEX method is, it does not provide approximation of other parameters such as the exposure prevalence or misclassification rates directly. In addition, a known or adequately estimated misclassification matrix JI is required to ensure the performance of the algorithm. 53 Chapter 6. Application in Epidemiological Studies 6.2 The study of invasive cervical cancer A brief profile of the case-control study investigating the impact of herpes simplex virus type 2 (HSV-2) in the progression of invasive cervical cancer is provided in Chapter 1. We adopt this study as an example to demonstrate the effectiveness of Bayesian adjustments when the differentiality of misclassification is borderline. Data for the cervical cancer study is shown in Table 6.4. It is noticeable from the validation and main data that the exposure prevalence of HSV-2 is high in both cases and controls. Table 6.4: Validation data and main data for cervical cancer study Main data Validation data Y=0 Y=l T X=l x=o X=l x=o Y X=l X=0 T=l 18 5 16 16 Y=l 375 318 T=0 3 13 11 33 Y=0 535 701 Ignoring measurement error arising from the inaccurate western blot procedure, the naive log odds ratio is estimated as 0.4529 (standard error = 0.0928), with 95% confidence interval (0.300, 0.606), indicating HSV-2 is positively correlated with the occurrence of invasive cervical cancer. We conduct Bayesian adjustment subject to three misclassification situations (NDF, nearly NDF and DF) via different prior distributions. Mild prior association (p = 0.3) is imposed for exposure prevalences in cases and controls. The x randomness and differentiality of misclassification parameters are reflected by hyperparameters, the same way as in simulation studies: p = p = 1 for NDF misclassification; 2 3 P2 — Pz = 0 for DF misclassification; and p = p = 0.9 for nearly NDF misclassification. 2 3 The mixing and convergence of posterior Markov chains are satisfactory: Figure' 6.3 depicts prior densities and posterior simulations regarding ro, r\ and logOR based on 10000 posterior realizations (after 1000 burn-in iterations), under different misclassifi5,4 Chapter 6. Application in Epidemiological Studies cation assumptions. The same prior distributions of ro, r and logOR are applied to x various differentially conditions. Although exposure prevalences seem to be underestimated in prior densities, the posterior distributions incorporate information from the data and prior beliefs and lead to greater, distinct exposure prevalences and positive log odds ratio. Figure 6.3 illustrates that, the resulting estimates obtained using Bayesian adjustment are reasonably robust to prior knowledge, for posterior distributions generated across three conditions are sufficiently close to one another. Yet the analogous shapes of the prior and posterior densities suggest that, the prior knowledge still plays a non-negligible role in generating the resulting models. Furthermore, it is noticeable that, as the measurement error becomes more differential, fo increases and f\ decreases slightly, resulting in small decrement of logOR, which is consistent with findings from simulation studies. Table 6.5 presents posterior means, posterior standard deviation and 90% credible intervals (or 90% confidence intervals for non Bayesian models) for the log odds ratio describing how herpes simplex virus type 2 affect development of cervical cancer. We observe that, exposure of HSV-2 is positively associated with a growing risk of developing cervical cancer. As expected, the Bayesian and MLE adjustments in general produce similar logOR under various misclassification assumptions. Effect estimates derived assuming more differential misclassifications move towards the unity of odds ratio, with larger variability. MC-SIMEX on the other hand generates bigger correlation, if the inaccuracy arising in western blot procedure does not depend on the incidence of cervical cancer; and smaller association if the measurement error is nondifferential. Extrapolations based on the quadratic and log linear models are displayed in Figure 6.4. As Carroll, Gail and Lubin (1993) pointed out, there is moderate evidence to show measurement error is differential across cases and controls. Sensitivities estimated from validation data alone are 0.78 for cases and 0.5 for controls (two sided Fisher's exact 55 Chapter 6. Application in Epidemiological Studies producing p-value=0.047). Nevertheless, if both the complete and incomplete data are considered, a likelihood ratio test for the nondifferentiality of misclassification with 2 degrees of freedom, generates a p-value at 0.073, indicating lack of evidence to reject the null at 5% significance level. Hence, it is more appropriate to interpret the differentiality of measurement as borderline. One advantage of Bayesian adjustment emerges in this context, as it incorporates the "in-between" scenario into consideration via nearly nondifferential prior densities. By imposing reasonably large correlation coefficients (p , ps 2 < 1) between sensitivities or specificities in cases and controls, the nearly ND Bayesian method compromises between loss of efficiency and accuracy associated with the "fully" differential and nondifferential adjustment methods. Furthermore, the sensitivities and specificities estimated via MLE or Bayesian methods are found consistent with the conditional probabilities of measurement errors calculated using the generalized latent variable modeling technique of Skrondal and Rabe-Hesketh (2004). Table 6.5: logOR, SD and 90% intervals for logOR in cervical cancer study NDF Bayesian Nearly NDF Bayesian DF Bayesian NDF MLE DF MLE NDF MC-SIMEX quadratic, asymptotic quadratic, Jackknife loglinear, asymptotic loglinear, Jackknife DF MC-SIMEX quadratic, asymptotic quadratic, Jackknife loglinear, asymptotic loglinear, Jackknife log(OR) 0.8665 0.6527 0.5417 0.9579 0.6081 SD 0.2047 0.2700 0.2938. 0.2366 0.3503 90% interval (0.5434, 1.2175) (0.2303, 1.1066) (0.0462, 1.0200) (0.5688, 1.3471) (0.0318, 1.1843) 0.9106 0.1845 0.1545 0.2426 0.1545 (0.6071, 1.2141) (0.6566, 1.1647) (0.7139, 1.5120) (0.8589, 1.3670) 0.1721 0.1327 0.1010 0.1327 (-0.1737, 0.3923) (-0.1090, 0.3276) (0.2678, 0.5999) (0.2156, 0.6521) 1.1130 0.1093 0.434 56 Chapter 6. Application in Epidemiological Studies NDF Bayesian adjustment o Nearly NDF Bayesian adjustmc DF Bayesian adjustment 0.0 0.4 r 0.8 0 0.0 0.4 0.8 -4 H 0 2 4 logOR Figure 6.3: Prior and posterior distributions of r , r and logOR subject to three misclassifications. Q v 57 Chapter 6. Application in Epidemiological Studies 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 (c)(1+X), (d)(1+X) Figure 6.4: Plots of the estimated logOR as a function of misclassification size A in cervical cancer study. The upper-left panel is based on a quadratic extrapolation subject to NDF MC-SIMEX. The upper-right panel is based on a loglinear extrapolation subject to NDF MC-SIMEX. The lower-left panel is based on a quadratic extrapolation subject to DF MC-SIMEX. The lower-right panel is based on a loglinear extrapolation subject to DF MC-SIMEX. 58 / Chapter 7 Conclusion and Future Work In this dissertation, we concentrate on comparing three adjustment models for binary exposure measurement error in case-control studies. When potential measurement error on the exposure is "ignored", statistical assessment of the impact of the exposure variable on a dichotomous health related outcome is misleading. The direction towards which the association between the actual but unobservable explanatory variable and the response is biased, is unpredictable in some circumstances. A Bayesian model is developed to account for various types of measurement error on the covariate (i.e., nondifferential misclassification, nearly nondifferential misclassification and differential misclassification), by incorporating randomness of the exposure prevalences, sensitivities and specificities amongst cases and controls via prior distributions. The concepts and fundamental techniques of Bayesian paradigm are reviewed in Chapter 2. A hybrid algorithm alternating the Gibbs sampler and Metropolis-Hastings algorithm is proposed to sample the parameters of interest (n, SN, SPi) i=0, 1 from the posterior distributions. Statistical inference about the odds ratio describing the correlation between the exposure and response, is made based on posterior realizations of model parameters after burn-in iterations. Other models previously developed to account for misclassification problems include the maximum likelihood estimation and simulation extrapolation method (Chapters 3 and 4). In connection to our research, we contrast model parameters and log odds ratio estimated in the Bayesian model against those generated by other methods under different misclassification scenarios, to investigate the overall 59 Chapter 7. Conclusion and Future Work performances of the adjustment models. Two simulation studies composed of four scenarios with gradually increasing levels of differentiality of misclassification are conducted in Chapter 5. Markov chains simulated using the Bayesian model display adequate convergence and mixing, when the prior distributions are consistent with data-generation. Our choice of the univariate proposal . distribution in the M H algorithm is proportional to the likelihood function and does not depend on the current parameter value. High acceptance rates are observed in both cases, indicating the proposed hybrid algorithm is efficient. When data are simulated using equal sensitivities and specificities in diseased and non-diseased populations, the nondifferential (ND) prior behaves the best, for logOR returned by the corresponding posterior model is associated with the smallest MSE, least variance and greatest coverage proportion. The differential (DF) Bayesian adjustment model performs better when the measurement error on the exposure is highly differential. The nearly D F prior performs well when the influence of disease status on the exposure misclassification is mild. A l though separate estimation of the sensitivities and specificities under the D F assumption improves accuracy of the estimates, the D F Bayesian model has a trade-off of loss of efficiency when the exposure is actually nondifferentially misclassified. The nearly D F Bayesian model to some extent, balances the loss of efficiency and accuracy, by posing large yet less than one, positive correlation between the SN^s and SPi's. It is found that, the MSE, coverage proportion and average width of the 90% credible intervals obtained using the nearly ND prior distributions, are always between those associated with the completely differential or nondifferential prior densities. Hence in situations where the mechanism of misclassification is uncertain or the differentiality is borderline, the nearly ND Bayesian adjustment should be adopted to reveal the actual relationship between the disease and actual exposure variable. Analogous phenomena regarding the performances of NDF vs. D F are detected in the 60 Chapter 7. Conclusion and Future Work maximum likelihood estimates (MLE) and MC-SIMEX estimates. As expected, coverage proportions of the true log odds ratio associated with 90% confidence or credible intervals using the appropriate M L E or Bayesian models, are satisfactory (89%) and similar to one another. On the other hand, we notice that MLEs are associated with systematically bigger M S E and interval width, compared with the Bayesian.estimates. One possible explanation is that, the properly specified prior distribution helps to reduce the variability of the effect estimator and improves the efficiency of Bayesian model. As no middle-ground scenario exists in the M L E approach, logOR is estimated either through a NDF stream or a D F stream. A two-step M L E procedure, sometimes can be adapted to recover the potentially biased effect estimator. A hypothesis test for the nondifferential misclassification (the null) can first be conducted using some standard procedures, for instance, the likelihood ratio test with two degrees of freedom, based on the validation data and main data. Under guidance of the testing result, the M L estimate of logOR subject to DF or N D F measurement error, will be reported as the final answer. Unfortunately, for situations where the evidence to reject the null is merely mild, say p-value is around 0.05, the two-stage procedure is incapable of providing a reliable estimate. The extrapolated log odds ratio at zero measurement error varies with the choice of extrapolation function in a MC-SIMEX model. Although Kiichenhoff, Mwalili and Lesaffre (2006) showed through some concrete examples that, the quadratic and log linear functions provided good approximation, in extrapolating logOR back to the origin of no measurement error, we observe these functions sometimes return substantially distinct logOR with variable coverage proportions. The subjective choice of an extrapolation function brings difficulty in determining an unique effect estimate in epidemiological applications and makes result interpretation ambiguous. Moreover, even if the differentiality of exposure misclassification is correctly specified, log odds ratios estimated via MC-SIMEX models tend to be more variable, having lower coverage rates and bigger 61 Chapter 7: Conclusion and Future Work MSEs, compared with the alternatives. The Bayesian adjustment model proposed in this dissertation can be naturally extended to other related areas. One immediate extension arises in the appearance of additional correctly recorded explanatory variables, in addition to the binary response and error-prone exposure discussed in this thesis. Previous studies show that, the impact of measurement error is more difficult to predict and the adjustment protocol becomes more complex in this case. Gustafson (2004) investigated other situation where two surrogates are employed to measure the unobserved true exposure. The likelihood functions utilized in the MLE and Bayesian models need to be reconstructed to capture information emerging from the dual error-prone covariates. Special attention should be paid to the sensitivities and specificities associated with each surrogate variable, conditioning on the binary disease status. Another interesting example is in investigation of misclassification on an exposure with more than 2 categories, for example, the exposure evaluation among three populations. It is then more appropriate to construct a posterior distribution based on a Multinomial likelihood instead of a Binomial sampling distribution. Eventually, the measurement error problems arising from combination of the above scenarios are worth exploring. Further research should be conducted to improve the validity of scientific findings in epidemiological studies. 62 Bibliography Barron, B. A. (1977). The effects of misclassification on the estimation of relative risk, Biometrics 33: 414-418. Bashir, S. A. and Duffy, S. W. (1997). The correction of risk esitmates for measurement error, Annals of Epidemiology 7: 154-164. Beaton, G. H., Milner, J. and Little, J. A. (1979). Sources of variation in 24-hour dietary recall data: implications for nutrition study design and interpretation, American Journal of Clinical Nutrition 32: 2546-2559. Byrd, R. H., Lu, P., Nocedal, J. and Zhu, C. (1995). A limited memory algorithm for bound constrained optimization, Scientific Computing 16: 1190-1208. Carroll, R. J., Gail, M. H. and Lubin, J. H. (1993). Case-control studies with errors in covariates, Journal of the American Statistical Association 88: 185-199. Carroll, R. J., Kiichenhoff, H., Lombard, F. and Stefanski, L. A. (1996). Asymptotics for the simex estimator in nonlinear measurement error models, Journal of the American Statistical Association 91: 242-250. Carroll, R. J., Ruppert, D., Stefanski, L. A. and Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models, Vol. 105 of Monographs on Statistics and Applied Probability, second edn, Chapman & Hall/CRC, Boca Raton. 63 Bibliography Casella, G. and Berger, R. L. (2002). Statistical Inference, V o l 6 of Duxbury advanced series, second edn, Duxbury. Cook, J. R. and Stefansk, L. A. (1994). Simulation-extrapolation estimation in parametric measurement error eodels, Journal of the American Statistical Association 89: 1314— 1328. Eckert, R. S., Carroll, R. J. and Wang, N . (1997). Transformations to additivity in measurement error models, Biometrics 53: 262-272. Greenland, S. and Gustafson, P. (2006). Accounting for independent nondifferential misclassification does not increase certainty that an observed association is in the correct direction, American Journal of Epidemiology 164: 63-68. Gustafson, P. (2004). Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments, Vol. 13 of Interdisciplinary Statistics, Chapman & Hall/CRC, Boca Raton. Hastings, W. K . (1970). Monte carlo sampling methods using markov chains and their applications, Biometrika 57: 97-109. Hildesheim, A . , Mann, V . , Brinton, L. A., Szklo, M . , Reeves, W. C. and Rawls, W. E. (1991). Herpes simplex virus type 2: a possible interaction with human papillomavirus type 16/18 in the development of invasive cervical cancer, international Journal of Cancer 49: 335-340. Israel, R., Rosenthal, J. and Wei, J. (2001). Finding generators for markov chains via empirical transition matrices, with applications to credit ratings, Mathematical Finance 11: 245-265. 64 BibliographyJones, D. Y . , Schatzkin, A . , Green, S. B., Block, G., Brinton, L. A . , Ziegler, R. G., Hoover, R. and Taylor, P. R. (1987). Dietary fat and breast cancer in the national health and nutrition survey 1: epidemiologic follow-up study, Journal of the National Cancer Institute 79: 465-471. Kraus, J. F., Greenland, S. and Bulterys, M . (1989). Risk factors for sudden infant death syndrome in the us collaborative penrinatal project, International Journal of Epidemiology 18: 113-120. Kiichenhoff, H. and Carroll, R. J. (1997). Segmented regression with errors in predictors: semiparametric and parametric methods, Statistics in Medicine 16: 169-188. Kiichenhoff, H . , Mwalili, S. M . and Lesaffre, E. (2006). A general method for dealing with misclassification in regression: the misclassification simex, Biometrics 62: 85-96. Lyles, R. H . (2002). A note on estimating crude odds ratios in case-control studies with differentially misclassified exposure, Biometrics 58: 1034-1037. Marshall, R. J. (1990). Validation study methods for estimating exposure proportions and odds ratios with misclassified data, Journal of Clinical Epidemiology 43: 941-947. Metropolis, N . , Rosenbluth, A . W., Rosenbluth, M . N . , Teller, A . H . and Teller, E. (1953). Equations of state calculations by fast computing machines, Mournal of Chemical Physics 21: 1087-1092. Morrissey, M . J. and Spiegelman, D. (1999). Matrix methods for estimating odds ratios with misclassified exposure data: extensions and comparisons, Biometrics 55: 338-344. Skrondal, A. and Rabe-Hesketh, S. (2004). Generalized Latent Variable Modeling: multilevel, longitudinal, and structural equation models, Chapman & Hall/CRC, Boca Raton. 65 Bibliography Stefanski, L. A. and Cook, J. R. (1995). Simulation-extrapolation: the measurement error jackknife, Journal of the American Statistical Association 90: 1247-1256. Walter, S. D. and Irwig, L. M. (1987). Estimation of test error rates, disease prevalence and'relative risk from misclassified data: a review, Journal of Clinical Epidemiology 41: 923-937. Wu, M. L., Whittemore, A. S. and Jung, D. L. (1986). Errors in reported dietary intakes, American Journal of Epidemiology 124: 826-835. 66
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Bayesian adjustment for exposure misclassification...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Bayesian adjustment for exposure misclassification in case-control studies Chu, Rong 2007-12-31
pdf
Page Metadata
Item Metadata
Title | Bayesian adjustment for exposure misclassification in case-control studies |
Creator |
Chu, Rong |
Publisher | University of British Columbia |
Date | 2007 |
Date Issued | 2011-03-07T18:29:03Z |
Description | Measurement error occurs frequently in observational studies investigating the relationship between exposure variables and the clinical outcome. Error-prone observations on the explanatory variable may lead to biased estimation and loss of power in detecting the impact of an exposure variable. The mechanism of measurement error, such as whether or in what way the quality of data is affected by the disease status, is seldom completely revealed to the investigators. This increases uncertainty in assessing the consequences of ignoring measurement error associated with observed data, and brings difficulties to adjustment for mismeasurement. In this study, we consider situations with a correctly specified binary response, and a misclassified binary exposure. We propose a solution to conduct Bayesian adjustment to correct for measurement error subject to varying differentiality, including the nondifferential misclassification, differential misclassification and nearly nondifferential misclassification. Our Bayesian model incorporates the randomness of exposure prevalences and misclassification parameters as prior distributions. The posterior model is constructed upon simulations generated by Gibbs sampler and Metropolis-Hastings algorithm. Internal validation data is utilized to insure the resulting model is identifiable. Meanwhile, we compare the Bayesian model with maximum likelihood estimation (MLE) and simulation extrapolation (MC-SIMEX) methods, using simulated datasets. The Bayesian and MLE models produce accurate and similar estimates for odds ratio in describing the association between the disease and exposure, when appropriate assumptions regarding the differentially of misclassification are made. The 90% credible or confidence intervals capture the truth approximately 90% of the time. A Bayesian method corresponding to nearly nondifferential prior belief compromises between the loss of efficiency and loss of accuracy associated with other prior assumptions. At the end, we look at two case-control studies with misclassified exposure variables, and aim to make valid inference about the effect parameter. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2011-03-07 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0101064 |
URI | http://hdl.handle.net/2429/32108 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- [if-you-see-this-DO-NOT-CLICK]
- ubc_2007-0058a.pdf [ 3.11MB ]
- Metadata
- JSON: 1.0101064.json
- JSON-LD: 1.0101064+ld.json
- RDF/XML (Pretty): 1.0101064.xml
- RDF/JSON: 1.0101064+rdf.json
- Turtle: 1.0101064+rdf-turtle.txt
- N-Triples: 1.0101064+rdf-ntriples.txt
- Original Record: 1.0101064 +original-record.json
- Full Text
- 1.0101064.txt
- Citation
- 1.0101064.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
China | 12 | 10 |
United States | 12 | 4 |
Russia | 10 | 1 |
United Kingdom | 6 | 0 |
India | 4 | 0 |
Canada | 3 | 0 |
Ukraine | 3 | 0 |
Brazil | 3 | 0 |
France | 2 | 0 |
Mexico | 2 | 0 |
Italy | 1 | 0 |
Spain | 1 | 0 |
Germany | 1 | 0 |
City | Views | Downloads |
---|---|---|
Unknown | 18 | 3 |
Shenzhen | 8 | 10 |
Saint Petersburg | 7 | 0 |
Ashburn | 5 | 0 |
Beijing | 4 | 0 |
Washington | 3 | 0 |
Mesquite | 2 | 0 |
Wilmington | 2 | 0 |
Mexico City | 2 | 0 |
Gurgaon | 2 | 0 |
Sheffield | 1 | 0 |
Oria | 1 | 0 |
Mount Pearl | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0101064/manifest