Evaluating the Performance of Hypothesis Testing in Case-Control Studies with Exposure Misclassification, using Frequentist and Bayesian Techniques by Mohammad Ehsanul Karim B.Sc., University of Dhaka, 2004 M.S., University of Dhaka, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Statistics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2009 © Mohammad Ehsanul Karim 2009 Abstract In epidemiologic studies, measurement error in the exposure variable can have large effects on the power of hypothesis testing for detecting the impact of exposure in the development of a disease. As it distorts the structure of data, more uncertainty is associated with the inferential procedure involving such exposure variables. The underlying theme of this thesis is the adjust ment for misclassification in the hypothesis testing procedure. We consider problems involving a correctly measured binary response and a misclassi fied binary exposure variable in a retrospective case-control scenario. We account for misclassification error via validation data under the assumption of non-differential misclassification. The objective here is to develop a test to check whether the exposure prevalence rates of cases and controls are the same or not, under the frequentist and Bayesian point of view. To evalu ate the test developed under the Bayesian approach, we compare that with an equivalent test developed under the frequentist approach. Both these approaches were developed in two different settings: in the presence or ab sence of validation data, to evaluate whether there is any gain in hypothesis testing for having such validation data. The frequentist approach involves the likelihood ratio test, while the Bayesian test is developed from posterior distribution generated by a mixed MCMC algorithm and a normal prior under realistic assumptions. The comparison between these two approaches is conducted using different simulated scenarios, as well as two real case- control studies having partial validation (internal) data. Different scenarios include settings with varying sensitivity and specificity, sample sizes, ex posure prevalence and proportion of unvalidated and validated data. One other scenario that was considered is to evaluate the performance under a fixed budgetary constraint. In the scenarios under consideration, we reach the same conclusion from the two hypothesis testing procedures. The simu lation study suggests that the adjusted model (with validation data model) is always better than the unadjusted model (without validation data model). However, exception is possible in the fixed budget scenario. U Table of Contents Abstract ii Table of Contents iii List of Tables v List of Figures vi Selected Notations viii Acknowledgements ix 1 Prelude 1 1.1 Introduction 1 1.2 The Impact of Misclassification 2 1.3 Suggested Correction of Misclassification in the Literature 3 1.4 Settings of the Problem under Investigation 4 1.5 Basic Terminologies used to Evaluate Misclassification and Measures of Association 7 1.6 Existing Literature on Misclassification 10 1.7 Motivation and Outline of the Current Work 12 2 Frequentist Adjustment 13 2.1 Introduction 13 2.2 Likelihood Functions 13 2.2.1 Without Validation Data 15 2.2.2 With Validation Data 15 2.3 Variance Estimates 17 2.4 Likelihood-ratio Tests 17 3 Bayesian Adjustment 19 3.1 Introduction 19 3.1.1 Bayes’ Theorem 19 111 Table of Contents 3.2 MCMC Algorithms 19 3.2.1 Metropolis-Hastings Algorithm . . . 21 3.2.2 Gibbs Algorithm 22 3.2.3 Mixed Algorithm 23 3.3 MCMC Diagnostics 29 3.3.1 Conventional Graphical Diagnosis 31 3.3.2 Gelman-Rubin Method for Monitoring Convergence 31 4 Simulation Results 33 4.1 Data Generation 33 4.2 Scenario Settings Under Frequentist Adjustment 33 4.2.1 Under Different Values of Sensitivity and Specificity 35 4.2.2 Under Different Sample Sizes 36 4.2.3 Under Different Exposure Prevalence Rates 36 4.2.4 Under Different Proportion of Validation and Main Part of the Data 39 4.2.5 Comparison under Budgetary Constraint 39 4.3 Scenario Settings Under Bayesian Adjustment 48 4.3.1 Under Different Values of Sensitivity and Specificity 49 4.3.2 Under Different Sample Sizes 49 4.3.3 Under Different Exposure Prevalence Rates 49 4.3.4 Under Different Proportion of the Validation Data 49 4.3.5 Diagnostics 54 5 Application in Epidemiological Studies 64 5.1 Introduction 64 5.2 Study of Sudden Infant Death Syndrome (SIDS) 64 5.3 Cervical Cancer and Herpes Simplex Virus Study 68 6 Conclusions and Further Research 75 6.1 Overall Conclusions 75 6.2 Further Research and Recommendations 77 Bibliography 79 iv List of Tables 1.1 Structure for main (unvalidated) part of the data 6 1.2 Structure for validation part of the data 7 1.3 Relationships among the basic terminologies in a 2 x 2 table 7 4.1 Scenarios under consideration 35 4.2 Scenarios under constant cost = $2400, assuming that collect ing validated data costs three (3) times as much as collecting unvalidated (main) data 43 4.3 Scenarios under constant cost = $2400, assuming that collect ing validated data costs five (5) times as much as collecting unvalidated (main) data 44 4.4 Scenarios under constant cost $2400, assuming that collect ing validated data costs ten (10) times as much as collecting unvalidated (main) data 45 5.1 Data from the study of sudden infant death syndrome (SIDS) and antibiotic prescription 65 5.2 Frequentist Estimates of the model parameters in the SIDS study 65 5.3 Bayesian Estimates of the model parameters in the SIDS study 68 5.4 Data from Herpes Simplex Virus-2 study 68 5.5 Frequentist Estimates of the model parameters in the HSV-2 study 70 5.6 Bayesian Estimates of the model parameters in the HSV-2 study 71 V List of Figures 4.1 Power curves under different sensitivity and specificity values: 0.6, 0.7, 0.8 and 0.9 respectively 37 4.2 Power curves under different sample sizes: 400, 600, 1000 and 2000 respectively (validation sub-sample size is fixed at 200 in each situation) 38 4.3 Power curves under different Exposure Prevalences: 0.25, 0.30, 0.35 and 0.4 respectively 40 4.4 Power curves under smaller Exposure Prevalences: 0.005, 0.01, 0.05 and 0.10 respectively 41 4.5 Power curves under different proportions of validation part and main part of the data: (1:9, 1:3, 1:1 and 3:1) respectively 42 4.6 Power curves under fixed amount of cost = $2400 assuming that collecting validated data costs three (3) times as much as collecting unvalidated (main) data 44 4.7 Power curves under fixed amount of cost = $2400 assuming that collecting validated data costs five (5) times as much as collecting unvalidated (main) data 45 4.8 Power curves under fixed amount of cost = $2400 assuming that collecting validated data costs ten (10) times as much as collecting unvalidated (main) data 46 4.9 Bayesian analysis results for different sensitivity and speci ficity values (.6, .7, .8, .9): Proportion of credible intervals excluding null value 50 4.10 Bayesian analysis results for different sample sizes (400, 600, 1000, 2000): Proportion of credible intervals excluding null value 51 4.11 Bayesian analysis results for different exposure prevalence (.25, .3, .35, .4): Proportion of credible intervals excluding null value 52 vi List of Figures 4.12 Bayesian analysis results for different ratio of data splits (1:9, 1:3, 1:1, 3:1 respectively for validation and main part): Pro portion of credible intervals excluding null value 53 4.13 Diagnosis of convergence of Bayesian analysis results: Trace Plots for r0 in 4 chains with different starting values (for 10,000 iterations, with half burn-in) for a single dataset . . 55 4.14 Diagnosis of convergence of Bayesian analysis results: Trace Plots for r1 in 4 chains with different starting values (for 10,000 iterations, with half burn-in) for a single dataset . . 56 4.15 Diagnosis of convergence of Bayesian analysis results: Trace Plots for SN in 4 chains with different starting values (for 10,000 iterations, with half burn-in) for a single dataset . . 57 4.16 Diagnosis of convergence of Bayesian analysis results: Trace Plots for SF in 4 chains with different starting values (for 10,000 iterations, with half burn-in) for a single data.set . . 58 4.17 Sequence of the mean of posterior for r0 for the four Markov Chain Monte Carlo Chains for 10,000 iterations 59 4.18 Sequence of the mean of posterior for r1 for the four Markov Chain Monte Carlo Chains for 10,000 iterations 60 4.19 Sequence of the mean of posterior for SN for the four Markov Chain Monte Carlo Chains for 10,000 iterations 61 4.20 Sequence of the mean of posterior for SF for the four Markov Chain Monte Carlo Chains for 10,000 iterations 62 4.21 Sequence of the Gelman-Rubin R for the four Markov Chain Monte Carlo Chains for 10,000 iterations 63 5.1 MCMC for the with and without validation data model pa rameters in the SIDS study 67 5.2 Prior and Posterior Distributions of all the Parameters under Consideration in the SIDS study 69 5.3 MCMC for the with and without validation data model pa rameters in the HSV-2 study 72 5.4 Prior and Posterior Distributions of all the Parameters under Consideration in the HSV-2 study 74 vii Selected Notations Y Outcome variable of a study V = Categorical explanatory variable V* Recorded surrogate variable which is collected instead of V for practical reasons Observed part of the data Unobserved part of the data Exposure prevalence & Apparent exposure prevalence SN Sensitivity SF Specificity ‘I’ Odds-ratio PPV Positive predictive value NPV Negative predictive value L(.) Likelihood function f (.) density function H Logit transformation of exposure prevalence P Logit transformation of sensitivity T Logit transformation of specificity e Logit transformation of apparent exposure prevalence i,j,K,t Index vi” Acknowledgements I would like to thank everyone of the Department of Statistics, from faculty to staff and fellow graduate students, for making my M.Sc. program such an enriching and pleasant experience. In particular, it gives me a great pleasure to express my sincere gratitude and deepest appreciation to my supervisor Professor Paul Gustafson. His unique way of mentoring, valuable suggestions regarding my research, technical issues of research, hints for programming calculations, careful draft revision, constant inspiration and words of wisdom helped me a great deal to complete this dissertation in time. It truly has been an honor and a privilege to work with him. Nonetheless, I would like to express my sincere gratitude and appre ciation to Professors John Petkau, Jiahua Chen, Matfas Salibián-Barrera, Ruben H. Zamar, Arnaud Doucet, Lang Wu and Michael Schulzer for their excellent teaching and mentorship. Again, I would like to thank Professor John Petkau for kindly agreeing to be the second reader of my thesis, and for his numerous suggestions regarding clarity and consistency in my writing. On a personal note, I am eternally grateful to my parents who have provided me with abundance of opportunities and freedom all my life, and to my wife Suborna for her love. Vancouver, Mohammad Ehsanul Karim Canada ehsancstat.ubc.ca August 27, 2009 ix To my parents, and my wife Suborna x Chapter 1 Prelude 1.1 Introduction A health outcome, often simply presence or absence of disease, is usually the central issue of an epidemiological inquiry, whereas the exposure is a related factor that is possibly involved in development of disease. The scope of an exposure assessment is much broader in the sense that it can originate from various sources such as some physiological characteristic, psychologi cal characteristic, genetic factor, social or environmental element or genetic attribute. We can use some biological test or even self-reported survey in strument to assess exposure status. Intuition suggests that, whatever tool we use to evaluate that exposure status, there is always a possibility of hav ing mismeasurement. We can have a gold standard method of exposure status evaluation with a well-set definition of superior or ideal exposure as sessment. However, since such superior assessment may not be possible to implement on the whole study sample for various practical reasons, such as available resources or ethical considerations, an operational method of assessment has to be settled upon so that we can use that method on the entire sample. This operational definition is basically an indirect measure of the exposure of ultimate interest. The methodologies of assessing disease and evaluating exposure are quite different from one another. Therefore, the mechanisms by which measurement errors will occur from these two sources are very different. Evaluating the affect of the exposure to a given risk factor in the de velopment of a disease or infection is usually the goal in epidemiological studies. While making the causal association between an outcome variable that defines the disease and the exposure variable(s), it is crucial that both are recorded without error. However, due to restriction of resources, often such quantification by any association measure is hindered by the lack of preciseness of the measures of relevant exposures which are collected us ing the operational definition. When there exist any sources of error, it is possible that the researcher’s interpretations or findings of causal inference 1 1.2. The Impact of Misclassification have alternative explanations. A rich literature suggests that this has long been identified as a problem and there has been considerable interest in this problem. Most applied work still ignores this issue and suffers detrimental effects. In the current work, this issue is acknowledged and addressed. This problem is relevant as much to continuous as to categorical mea sures — although the terminology differs slightly. ‘Measurement error’ is the terminology used when the predictor variable under consideration is continu ous in nature. On average, the closer the true explanatory variable value and the measured value from the surrogate variable (error-corrupted variable) are, tho less measurement error exits, If the predictor variable is categorical instead (with two or more categories), we call it a problem of ‘misclassifica tion’. In this case, the probabilities of classifying a subject into the correct category are considered. The impact of both of the mismeasurement cases are somewhat similar, although the expressions and terminologies to evalu ate them are quite different. For the sake of clarity, let us define some notation: the goal of the study is to explain the relationship between the outcome variable (Y) and exposure variable. In the current work, we will restrict ourselves to a binary exposure variable and denote it as V. That is, we will only consider the situation whether a subject is either exposed or not. However, for practical reasons such as cost, time factors or unavailability of a gold standard, V might not be measured precisely or directly. Therefore, a cruder classification method is applied and a corresponding surrogate variable V* is recorded instead. This is mostly the case when the exposure status is unobservable or cannot be measured precisely within reasonable cost. Nevertheless, although plugging- in a surrogate variable by using an imprecise but cheap classification tool might seem a very intuitive solution, this is are not without consequences. The phenomenon of such error on the measure of association is sometimes referred to as information bias. A question of accuracy of the estimate of the measure of association between disease and exposure arises, and hence we need to evaluate the impact of such replacements. 1.2 The Impact of Misclassification Mismeasurement in the explanatory variables, when ignored, can have detri mental effects on statistical analysis such as: making the estimates of the 2 1.3. Suggested Correction of Misclassification in the Literature parameters biased in the model under investigation, reducing the discrim inative power, and masking various features of the data [Carroll et al., 2006]. In the case of obtaining an estimate of a measure of association, misclas sification presents a serious problem. Naive analysis that just substitutes the apparent exposure status for the unobserved true exposure status can produce highly biased estimates. When misclassification probabilities are equal for the two compared groups (exposed and unexposed), the estimates of measures of association such as relative risk, odds ratio, are biased toward the null value [Copeland et al., 1977]. However, the effect of misclassification error on hypothesis testing pro cedures might not be as detrimental as that on estimation, as mentioned in Bross [1954]. In this paper, it is argued that, if similar misclassification prevails in both exposed and unexposed populations, then the validity of the test of finding whether two proportions, that is, the exposure prevalences are different or not, is not affected. However, this does not come without a price - and the price is the power of the test, which is reduced in the pres ence of misclassification. Usually the extent of loss depends on the amount of misclassification. 1.3 Suggested Correction of Misclassification in the Literature Fortunately, reasonable estimates of measures of association are still attain able, even though the exposure variable under consideration is corrupted. For that, the researchers must have some knowledge about the nature of error to be able to correct or account for it. Identifiability becomes an issue for the likelihoods - if we have no clue regarding the extent of misclassifi cation [Walter and Irwig, 1988]. A number of methods for the correction of measurement error have been developed throughout the years, both in design and analysis stages. Methodologies in the design stage include repli cated measurements, validation studies, etc. In the validation study, the validated sub-sample is derived randomly from the same population under investigation (either internal or external to those included in the primary sample) and a superior method of exposure assessment is implemented on each subject in the sub-sample. All these methods have their own pros and cons. Taking into account such information, correction for misclassification or measurement error can be performed either in frequentist or Bayesian 3 1.4. Settings of the Problem under Investigation ways. We will discuss this topic in the current work from the ‘test of hy pothesis’ point of view. It is worth mentioning that we will be using internal validation sub- samples throughout this work. Although external validation sample some times helps generalizing the results to larger extent, it suffers from various other limitations as well, especially in the situations when cause is depen dent on many factors, not only on the predictor variable under investigation, which is a very common circumstance in the disease-exposure relationships. Also, in terms of cost, internal validation sub-samples are cheaper than ex ternal validation samples since inferior method of exposure assessment is already applied on the subjects of internal validation sub-samples. 1.4 Settings of the Problem under Investigation Let Y be the outcome of interest: for diseased subjects for disease free subjects. To keep the problem simple, it is assumed that the outcome variable Y is measurement error free. That is, we will deal with exposure misclassifica tion, not disease misclassification. The simplest setting in misclassification is a binary exposure variable, which is frequently the case in epidemiological studies. The binary variable V is used to denote the true exposure status: — J 1 for truly exposed 0 for truly unexposed. V” is a surrogate variable that denotes the exposure status observed by some instrument or measurement that is subject to a certain amount of error: — f 1 for apparently exposed 0 for apparently unexposed. Here the exposure variable V is considered to be replaced by the surrogate variable V* with considerable measurement error. It is also assumed that such exposure measurements are independent of other errors. 4 1.4. Settings of the Problem under Investigation To obtain information about the degree of mismeasurement, a validation sub-sample is necessary, where complete information is available about true exposure status (V), along with surrogate variable V* status (through an imperfect assessment on exposure). This is a small fraction of the main sam ple, where only the surrogate variable V* status is available. Throughout this work, we used various compositions of data by varying this fraction. We will discuss this further in Chapter 4. Although it is known that prospective study data are usually preferable study data, researchers have to make certain trade-offs due to feasibility. Retrospective designs are more popular because the secondary data sources are usually much cheaper. However, (unmatched) retrospective case-control studies are more subject to errors of measurement or misclassification, which often leads to invalid results. Therefore, we consider a retrospective case- control scenario, where n1 subjects are sampled from the diseased population (cases), and n0 subjects are sampled from the disease free population (con trols). To make valid causal inference from a retrospective study, a number of assumptions need to be appreciated. Consideration of the type or pattern of measurement error is very crucial in evaluating its likely impact on a measure of association. Researchers should be able to distinguish the con sequences of different patters of misclassification: such as differential and nondifferential misclassification - which are based on whether the pattern of error in exposure assessment varies with respect to disease status. Misclassi fication probabilities of exposure vary with respect to disease status in case of differential misclassification. Errors arising due to recall bias and percep tion are common sources for misclassification probabilities being different in relation to disease status. The presence of disease may have great influence on how subjects interpret or report about the exposure status. In this case, the conditional distribution of the surrogate exposure variable (or measure ment by ‘imperfect’ exposure assessment method), given the true exposure variable and outcome variable, that is, VV, Y, depends on Y. This is the case for many realistic situations. However, to simplify the problem, we sometimes assume that the conditional distribution of V* V, Y does not de pend on Y, that is, misclassification probabilities are invariant with respect to disease status (all cases and controls have the same probability of be ing misclassified). This is the definition of nondifferential misclassification. Throughout the current work, we will maintain the assumption of nondiffer ential misclassification, and the conclusions are valid under this particular 5 1.5. Basic Terminologies used to Evaluate Misclassification and Measures of Association assumption. The reason for this assumption is basically due to some of the simple features that are established in literature, such as “bias toward the null” in absence of other forms of error for a dichotomous exposure variable and its simplicity compared to relatively unpredictable effects of differential misclassification, Researchers usually go through more sophisticated designs like blinding (of exposure assessment with respect to the disease outcome) or some of its advanced variants to attempt to ensure that the nondifferential assumption holds. The notation we will use for the unvalidated sub-sample and validation sub-sample data structures under consideration is given in Table 1.1 and Table 1.2 respectively. The unvalidated and validation sub-samples are sep arate parts of the entire data. The validation sub-sample is the part of the data where we implemented both the inferior and superior methods of ex posure assessment. On the other hand, the unvalidated sub-sample is the part on which we used only the inferior method of exposure assessment, excluding those subjects who were randomly selected for the validation sub sample. For convenience, we will use the phrase ‘unvalidated sub-sample data’ and ‘main data’ interchangeably from now on. In each of these tables, the nj’s are observed, where i = 0, 1, j = 1,2,3,4, 5,6, but in Table 1.1, the uj ‘s are unobserved. Although the marginal totals of V* are observ able, we do not have direct information on how those subjects are classi fied with respect to V. The total number of subjects in the case group is flu + nj + fl13 + n14 + ni + ni = ni and similarly, the total number of subjects in the control group is oi + fl02 + fl03 + fl04 + no5 + floe = flo Table 1.1: Structure for main (unvalidated) part of the data [ y y=i J_____ V/V* V=1 V’=0 V=1 V’=0 V = 1 U02 V = 0 u03 Total u11 + U13 U12 + U14 UOl + UO3 U02 + U04 = nlS = fl16 = o5 = 6 1 5. Basic Terminologies used to Evaluate Misclassification and Measures of Association Table 1.2: Structure for validation part of the data Y Y=1 Y=o V/V* v=i V’=o I V=l VO V 1 fljj fl12 O1 V = 0 n13 n14 n03 n04 Total [[ n11 + n13 n12 + 7114 I no1 + o3 fl1J2 + ‘O4 Table 1.3: Relationships among the basic terminologies in a 2 x 2 table Test Condition Exposed Unexposed Exposed True Positive False NegativeTrue Condition Unexposed False Positive True Negative 1.5 Basic Terminologies used to Evaluate Misclassification and Measures of Association Let us denote the true exposure prevalence as: = P(V 1Y where i = 0, 1 for control and case respectively. As V in this case is unob served, the apparent exposure prevalence is defined as = P(V* = 1Y = i). Sensitivity and specificity are commonly used statistical measures of the performance of a binary classification test. In the current context, sensitivity (SN) measures the proportion of actual exposed people which are correctly identified as such. Specificity (SP) measures the proportion of unexposed people which are correctly identified. Thus, by definition, SN =P(V* = 1V = 1,Y=i) SP = P(V* = OIV =0, Y = i) Notice that we are characterizing misclassification in terms of classification probabilities. Therefore, SN2 and SP. range between 0 and 1, and the extent to which these are less than 1 indicates the intensity of the misclassification problem. 7 1.5. Basic Terminologies used to Evaluate Misclassification and Measures of Association When the conditional distribution of V* Iv, Y does not depend on Y (i.e., nondifferential misclassification condition), then we get SN0 = SN1 = SN and SP0 = SP1 = SF. The apparent exposure prevalence é1 can be expressed in terms of r1, SN1,SF1: = F(V*=1IY=i) = P(V =1, V = kIY = i) = F(V*=1iV=k,Y=i)P(VkIYi) = F(V*=liV=1,Y=i)P(V:=zliY=i)+ P(V* liv = 0,Y = i)P(V 0Y = i) = SN1 r1 + (1 — 5P1)(l — r1) (1.1) = SNr+(1—SF)(l—r), denoting common sensitivity by SN and common specificity by SF, under the assumption of nondifferential classification. Simple algebraic manipula tion from Equation (1.1) shows that (rI, SN1SF1) and (1 — r1, 1 — SN1, 1 — SF1) leads to same From Youden’s Index [Youden, 19501, we know that if the sensitivity and specificity are such that SN + SF — 1 <0, then the test is misleading. SN + SF = 1 would mean that the test is no more useful than a coin-flip guess. That is, the test has no discriminative power on the exposure group, and reports same proportion of positive tests for both exposed and unex posed groups. Therefore, a common assumption is SN + SF> 1. In our scenario, where both the response Y and the exposure variable V are binary, the odds-ratio is defined as: — F(V 1Y 1)/F(V = OIY =1) — F(V = 1Y = 0)/F(V = 0Y =0) = ri/(l—ri) (1.2) ro/(l — ro) which is a common measure of association between disease and exposure status for retrospective case-control studies. However, if exposure variable v is subject to misclassification error, an intuitive substitute is: * — 8/(l — i) 0 0 8 1.5. Basic Terminologies used to Evaluate Misclassification and Measures of Association Thus, the attenuation factor is defined as: I1* AF=r--, which gives us an idea of the magnitude of bias introduced by misclassifica tion. An alternative formulation for expressing degree of misclassification re quires us to use the Positive Predictive Value (FP) and the Negative Pre dictive Value (NFV). Positive Predictive Value (FFV) is the proportion of subjects with a positive test result from the inferior method of expo sure ase.sment, who actually is exposed, determined by superior method of exposure assessment. Similarly, Negative Predictive Value (NPV) is the proportion of subjects with a negative test result from the inferior method of exposure assessment, who actually is unexposed, as indicated by supe rior method of exposure assessment. These two quantities can be calculated from a 2 x 2 table. By implementing Bayes’ Rule as discussed in Equation (3.1), the relationships of (ri, SN, SF,) with FFT4 and NFV are derived as follows: PPV, = P(V=1IV* =1,Y=i) — P(V* = iV= 1,Y=i)P(V=1IY=i) — P(V = 1V= 1,Y=i)P(V=1IY =i)+P(V = 1IV=O,Y=i)P(V=OIY=i) SNr 1 4SNr + (1 — SP)(1 — NPV P(V=OIV* =O,Y=i) — P(V* =OIV=O,Y=i)P(V=OIY=i) — P(V*=0IV=o,Y=i)P(V=0IY=i)+P(V*=1IV=o,Y=i)P(V=OJY=i) SP(1—r) (15 — SP(1—r)+(1--SN)r However, unlike the implication of nondifferential misclassification with respect to sensitivity SN and specificity SF, FFV0 does not have to be equal to PPV1, nor does NPV0 has to be equal to NPV1 under nondifferential misclassification. 9 - 1.6. Existing Literature on Misclassification 1.6 Existing Literature on Misclassification Nondifferentiality is a recurring assumption in the epidemiologic literature due to some of its interesting results. Bross [1954] discussed the difficulties of inferences on a single proportion or the difference between two propor tions from a 2 x 2 classification table in the presence of misclassification. He indicated the distortion of estimation and the power reduction in hypothesis testing. He justified his statements under the assumption of nondifferential misclassification. Newell [1962] further substantiated the fact that nondif ferential misclassification errors will always tend to produce results biased towards the null (that is, the difference between the two rates will shrink while applying inferential procedures on the data with nondifferential mis classification). Also, Gullen et al. [1968] suggested that under broad as sumptions, classification error never results in the apparent difference being larger than the real difference in rates. Dosemeci et al. [1990] and Diamond and Lilienfeld [1962a, b] showed with some numerical examples that excep tions are possible and that nondifferentiality is not always tenable. Keys and Kihlberg [1963] tried to identify the reasons of such unusual deviation. To implement this result, the measurement error has to be independent of all other errors. A few good reviews of such unusual phenomenon are available in the literature, such as Thomas [1995] and Jurek et al. [2005]. Roih7nan et al. [2008] discussed misuse of the “bias toward the null” result, mostly when the assumptions for this result are not met. Even if the assumptions are met, it is not necessarily true for hypothesis testing: p values need not have upward bias as reported by Greenland and Gustafson [2006]. The situation gets even more complicated for more than two categories, i.e., when exposure is polytomous. [Gladen and Rogan, 1979] provided ex pressions for bias under nondifferential assumption. Early literature on the impact of misclassification includes Koch [1969] and Goldberg [1975]. Most of these describe the effect on association measures obtained from a 2 x 2 exposure-disease classification table. Goldberg [1972] discusses the issue with regard to hypothesis testing. Historically, the development of adjustments for mismeasurement were mostly under the nondifferentiality assumption. Copeland et al. [1977] sug gested extension of the “bias toward the null” result to ratio effect measures of association, such as the risk ratio and odds ratio and derived adjust 10 1.6. Existing Literature on Misclassification ment formulas to correct for misclassification given the nondifferential as sumption. Barron [1977] suggested a matrix method for such adjustment. Greenland [1980] further extended the adjustment to difference effect mea sures and also considered the possibility of misclassification of confounders. Greenland [1988b] discussed the basic methods for constructing variance estimators for the various parameters after adjusting for misclassification. Marshall [1990] proposed inverse matrix methods by reparameterizing the misclassification problem. Morrissey and Spiegelman [1999] discussed both the matrix and inverse matrix methods under various circumstances. Lyles [2002] reparameterized the likelihood of the problem and suggested a rela tively more convenient solution to the problem which does not require nu merical optimization. If all the parameters are unknown, nonidentifiability makes the inference impossible. A reasonable estimate of the misclassifi cation probabilities is required to carry on the inference. Adjustments for misclassification using replicated samples are provided by Walter and Irwig [1988]. Greenland [1988a] provided formulas for adjustment when a valida tion sample is present. More recent works include Greenland and Gustafson [2006], Greenland [2008] and Marshall [1997]. Marshall [1989] pointed out that the estimates of measures of association that adjust for misclassifica tion are very sensitive to the estimates of misclassification probabilities and even small discrepancies with actual probabilities can lead to misadjustment. Recent developments in the rapidly advancing field of computing made it possible to use the numerical approaches and simulation techniques to solve these problems in a more elegant way. The problems of mismeasurement were explored from a Bayesian context in Rahme et al. [2000], Joseph et al. [1995] and Prescott and Garthwaite [2002]. Gustafson et al. [2001] checked the point made by Marshall [1989] and suggested a Bayesian solution of the problem by incorporating some uncertainty about the misclassification probabilities by means of having a prior distribution of those parameters instead of a particular guess. Gustafson and Greenland [2006] showed that implementing such prior may provide narrower interval estimates of measure of association. Chu [2005] incorporated such uncertainty or randomness by implementing various prior distributions on the prevalence and misclassi fication probabilities and assessed the Bayesian adjustment of estimates of various parameters of misclassified data under various assumptions when val idation data is available. Estimates obtained from the Bayesian approach is then compared with estimates from previously developed methods such as the maximum likelihood estimates [Lyles, 2002] and SIMEX (simulation extrapolation method). 11 1.7. Motivation and Outline of the Current Work A general overview of the methods for misclassified categorical data and some extensions to higher dimensions are provided in Willett [1989] and Chen [1989]. Overall general discussion of these issues and the ways to combat such problems are documented in chapters 3 and 5 of Gustafson [2004]. 1.7 Motivation and Outline of the Current Work Although comparisons between the frequentist method with specific esti mates of parameters (misclassification probabilities and prevalences) and the Bayesian method with prior distributions on parameters (to incorporate uncertainty) provided in the literature, such comparisons have not yet been made for hypothesis testing. In this thesis, we will assess the impact of mis classification of dichotomous exposure on hypothesis testing for two settings - without considering validation data and its counterpart after adjustments using the estimates from validation data - under the nondifferential misclas sification assumption. The Bayesian adjustments for hypothesis testing will be compared with standard frequentist methods. In Chapter 1, we have discussed the historical developments, basic def initions and terminologies for misclassification error. The motivations for correction and some methods of adjusting for such errors are also discussed. The problem under investigation is specified. In Chapter 2 and 3, we will explain the models and methodologies of hypothesis testing in the presence of misclassification error from the frequentist and Bayesian points of view respectively. In chapter 4 we will show the simulation results under a set of scenarios and compare the classical and Bayesian methods. We use some real epidemiological datasets to implement these methods in Chapter 5 and conclude with general findings and further recommendations for future re searches in Chapter 6. 12 Chapter 2 Frequentist Adjustment 2.1 Introduction Maximum likelihood estimation (MLE) is a popular method used for fitting a statistical model to data. Pioneered by various statisticians including R. A. Fisher at the beginning of the last century, it has widespread applications in various fields. If the sample observations are available, this estimation proce dure searches over various possible population characteristics and eventually obtains the most likely value as the estimate of that population character istic. Having drawn a sample of ri values x1, x2, ..., z from a distribution where 4 is the parameter of interest, we form L(çb) = f(xi, x2,.. . , x,j. The method of maximum likelihood estimates ‘ by finding the value of that maximizes L() or, more commonly, the logarithmic transformed version of it. The solution can be found numerically using various optimization al gorithms. The popular alternatives to this estimation procedure are least squares procedure and method of moments. However, those estimates are not very efficient in various circumstances, whereas maximum likelihood estimates possess various desirable features such as consistency and asymp totically efficiency, if solution exists. The maximum likelihood estimation procedure can also be used on non-random samples, if certain adjustments are made, such as conditioning on the clusters or correlated groups, etc. 2.2 Likelihood Functions Previously in §1.2, we discussed the impact of misclassification. The es timates of (So, 9) obtained from the entire sample will be biased toward the null, under certain conditions. As described in §1.3, there are various methods suggested in the literature for adjusting the consequences of mis classification. We will use the method that uses a validation sub-sample. By using validation data, we can have an estimate of ro, r1, SN and SF. Therefore, given the observed data, under nondifferential misclassification, we can consider (ro, ri) or (So, Si) as the unknown parameters in the statis 13 2.2. Likelihood Functions tical models. As mentioned in §1.5, the true exposure prevalence is defined as r = P (V = iY = i), whereas, the apparent exposure prevalence is ex pressed as 8 P(V* = 1Y = i). From Equation (1.1), we can see that O can be expressed as a linear function of (ri, SN, SF). For the same dataset - when r, SN and SF remains fixed, we can use H0 : = i for without val idation data settings (by ignoring all the true exposure status V, but using all the apparent exposure status V* obtainable from the entire sample) and equivalently, H0 : ro = r1 for with validation data settings (by incorporating the true exposure status V from the validation sub-sample and the apparent exposure status V* obtainable from the entire sample). Since both of these hypotheses are applied on the same dataset, the total number of subjects under consideration are the same in each test. The stated hypotheses are simply variants of the following hypotheses respectively: H0 : ‘I’ 1 and H0 : =1. The notable distinction between these two models is that r can take any value between (0, 1), whereas é1 can take values between min(SN, 1 — SF) and max(SN, 1 — SF). We will discuss the likelihoods and the solution methods in the following subsections. One important point is worth mentioning: even in absence of validation data (that is, when true r0, ri , SN and SF are not estimable), due to the equivalence of hypotheses mentioned above, we can test H0 : = = 0 and we can conclude the same about H0 : ro = r1 = r. However, when validation data is not present, such equivalence is not true for estimation purposes, because when SN and SF are unknown, the relationship between (Oo, 0) and (ro, ri) is not known respectively (see Equation 1.1). Therefore, when a validation sub-sample is not available, we can not estimate (ro,r1), but from the entire sample we can estimate (Oo, O). 14 2.2. Likelihood Functions 2.2.1 Without Validation Data A standard way to express the likelihood in terms of the parameters (6o, 6) for problems consisting of misclassified data without validation part is: L(60,9iIV, Y) cc = 1Y = o)(flol+flo3+ o5x P(V = OY = O)(flO2+fl4flOG) x P(V* = iY = i)(flhl+fl13+fl15) x P(V* OY = i)(fl12+fl14+fl16) = 6nol+no3+no5) < {i. — 0 }(flO2+flO4+flO) x (nn+n13+n15) x {i —6}(fl12+fl14+fl1) (2.1) The maximum likelihood estimates of 6, 61 respectively are given by: floi+fl03+fl5 = fl01 + flhJ2 + no3 + no4 + no5 + no6 flu + ri3 + 9215 il + fl12 + fll3 + fl14 + fll5 + Th16 Under the null hypothesis H0 : 6 = 6 0, the maximum likelihood estimate is given by - — flOi+fl03+flQ5+flhl +nl3+fll5 — Oi +flo2 +fl03 +fl04 +flo5 +9206 +flui +fli2 +fll3 +fl14 + l5 + l6 2.2.2 With Validation Data A standard way to express the likelihood in terms of the parameters (ro, r1, SN, SP) for problems consisting of misclassified data with validation part 15 2.2. Likelihood Functions is provided in Equation (2.2) under nondifferential misclassification: L(ro, r, SN, SPV*, V Y) {P(v = 1Y = O)P(V* 0V 1,Y = 0)}01 {P(v 1IY = O)P(V* = 1V 1,Y = 0)}02 {P(v = OY = O)P(V* 1V = 0, Y = {P(v 0Y 0)P(V* OV = 0, Y = 0)}T04 {P(v = 1Y = 1)P(V* 0V = 1,Y = 1)}nul {P(v 1Y 1)P(V* = 1V ,Y i)}712 {P(v 0Y 1)P(V* 1V = 0, Y = {P(v = 0Y = 1)P(V* = 0,Y = {P(v* = 1IY = 0)}0{1 — (P(v* iY = {P(v* = 1Y 1)}’{1 — (P(v* = 1IY = l))}16 = {roSN}’°’{ro(1 — SN)}’°{(1 — ro)(1 — SP)}’°3 x {(i — ro)SP}°{riSN}Th11{ri(l — SN)}Th12 x {(i — ri)(1 — SP)}”3{(i — ri)SP}’ x {r0SN + (1 — ro)(1 — SP)}’205 x {i — (r0SN + (1 —r0)(1 — SP))}?06 x {r1SN + (1 — ri)(1 — SP)}’15 x {i— (riSN+(1—ri)(1—SP))}’’6. (2.2) This likelihood does not lead to a closed form for the maximum likelihood estimates of ro, r1, SN and SF. In quasi-Newton methods, the Hessian ma trix of second derivatives of the function to be optimized is not required. That is why, a general-purpose optimization based on quasi-Newton methods or a variable metric algorithm is used to optimize Equation (2.2), specifically the algorithm that was published simultaneously in 1970 by Broyden [1970], Fletcher [1970], Goldfarb [1970] and Shanno [1970] (that is the origin of the name Broyden - Fletcher - Goldfarb - Shanno or BFGS method). This algorithm uses function values and gradients to build up a picture of the surface to be optimized. However, for differential misclassification, we do have closed form ex pression for the maximum likelihood estimates of ro, ri, SN and SF. 16 2.3. Variance Estimates 2.3 Variance Estimates The numerical approximation to the Hessian matrix can be obtained from the BFGS algorithm (implemented in opt im function of R). The negative of the Hessian matrix is the observed Fisher information matrix. The inverse of the observed Fisher information matrix yields the asymptotic covariance matrix of the maximum likelihood estimates. By the use of multivariate delta method, one can easily obtain the asymptotic variance of the log odds ratio, given the estimated prevalence rates. 2.4 Likelihood-ratio Tests A likelihood-ratio test is a statistical test for making a decision between two hypotheses based on the value of the ratio of the likelihood under two differ ent hypotheses. The null hypothesis is often stated by saying the parameter is in a specified subset ‘I of the parameter space . The likelihood func tion is L() = L(çbx) is a function of the parameter with x held fixed at the value that was actually observed, i.e., the data. The likelihood ratio is A — sup L(q5Ix) : — supLx):E The numerator corresponds to the maximum likelihood of the observed re sult under the null hypothesis H0. The denominator corresponds to the maximum likelihood of the observed result under the alternative hypothesis H1. Lower values of the likelihood ratio mean that the observed result was less likely to occur under the null hypothesis. Higher values mean that the observed result was more likely to occur under the null hypothesis. The likelihood ratio A is between 0 and 1. The likelihood ratio test rejects the null hypothesis if A is less than a critical value which is chosen to obtain a specified significance level c. Usually it is difficult to determine the ex act distribution of the likelihood ratio for a specific problem. However, as the sample size n approaches , the test statistic —2 log(A) will be asymp totically x2 distributed with degrees of freedom equal to the difference in dimensionality of o and I. In the current context, for without valida tion data, I = 6 and I = (6o, 6k). Similarly, for with validation data, (r, SN, SP) and 1 = (ro, r1, SN, SF). Eventually from these tests, we obtain p-values. A convenient measure of the performance of any hypothesis test is to find the probability of not making type II errors (1 — 3), or in other words, not 17 2.4. Likelihood-ratio Tests making the error of “not rejecting null hypothesis when it is false” - power of the test. Powers can be thought as the ability of the hypothesis test to detect a false null hypothesis. In Chapter 4, we will use the power curve as a tool to compare the tests based on with and without validation data. Also we will try to identify whether frequentist methods perform better than Bayesian methods or not. We will discuss relevant Bayesian methodology in Chapter 3. 18 Chapter 3 Bayesian Adjustment 3.1 Introduction 3.1.1 Bayes’ Theorem Bayes’ Theorem is a simple mathematical formula used for calculating con ditional probabilities of random events. For the random variable X, that is distributed as L@5x), where is the parameter of interest, let fx(x) is the marginal distribution and hence a function of the observed X alone, while g() is the distribution of before observing X. Then Bayes’ Theorem says that the form of posterior distribution is: ir@Ix) = f(x,)fx(x) — g()L(Ix) fx(x) c g()LQx). (3.1) Although Equation (3.1) seems simple, it is a fundamental theorem which has deep impact in statistical theory. It is often the case that the posterior ‘w(ØJx) is non-standard or high dimensional, involving a lot of parameters. Then it is difficult to evaluate summaries such as the mean, variance, mo ments, etc. which require integration. Although analytically this is a dif ficult problem, algorithms discussed in the following sections help us find solutions numerically. It should be noted that simpler methods such as Laplace approximation can also be used to evaluate such summary quanti ties, but they require restrictive assumptions such as normal approximation to the posterior distribution and so on. Therefore, we consider algorithms that can be applied in broader contexts. 3.2 MCMC Algorithms From §2.4, frequentist likelihood ratio test results are based on the asymp totic assumption, that is, a x2 approximation for the sampling distribution 19 3.2. MCMC Algorithms of the test statistic —2 log(A), since the exact distribution of the statistic is hard to determine and varies from problem to problem. Monte Carlo meth ods can be an alternative to this approach. These methods can even be applied in the cases where the distributions are not in conventional format or unknown. To explain the idea of Monte Carlo, suppose that is a collection of model parameters or unknowns, and h() is a function of . We want to evaluate the expected value of the given function h() over a pdf ir(). In other words, we want to evaluate E(h()) fh(b)Tr(q)db. If ir has a very complex form, we proceed with the Monte Carlo integration technique. Here, we draw samples (2), (n) independently from Tr(). Then we estimate h()), which can be made as accurate as desired by increasing sample size. Therefore, the fundamental idea behind Monte Carlo methods is that, by repeatedly drawing random samples from the target population ir(), we can gain insight regarding the behavior of a statistic. When we observe the behavior for a very long time, we obtain an estimate of the sampling distribution of the statistic. But this added advan tage is not without a price - time and computer resources are big issues for these algorithms. However, recent advances in computing technologies have led to enormous popularity of Monte Carlo simulation as a powerful alter native to formula-based analytic approaches, especially where the solution requires a lot of assumptions. In the Bayesian context, this Tr() is the posterior density Tr(Ix), which may have a nonstandard, complicated form. Here x denotes the observed information, and is high dimensional. Sampling independently from the posterior density ir(Ix) is generally not feasible, and closed form solutions are not usually possible. Therefore, we generate a chain or dependent sam ples (1), (2), (‘) from the posterior using a Markov Chain scheme. This Markov chain generates each iteration ) taking into account of the previous value (i_1) only. We want to create a Markov Chain whose station ary or limiting or equilibrium distribution is the desired posterior Tr(q5 x). Here the posterior distribution ir(g5Ix) is the target distribution. To obtain the stationary distribution of the Markov Chain, we need to run the burn-in for a long time. Here, burn-in refers to the series of initial samples that are not expected to have yet converged to the target distribution and are hence excluded from any subsequent analysis. In brief, the basic idea of Markov Chain Monte Carlo is to iteratively produce parameter values that are rep resentative samples from the joint posterior. For large number of iterations, 20 3.2. MCMC Algorithms this scheme will provide samples ) from its stationary distribution, that is, when the successive samples becomes uncorrelated. This way, it is sur prisingly easy to approximate the posterior distribution ir(Ix). However, one added disadvantage to this entire procedure is that we have to monitor convergence. We will discuss this further in §3.3. The Markov Chain Monte Carlo algorithms that are used in the Bayesian version of the test under consideration are described in §3.2.3. But for detailed understanding of the procedure, we start with a general description of the Metropolis-Hastings algorithm and Gibbs algorithm. However, for basic terminologies and definitions used in these Markov Chain Monte Carlo algorithms, we refer the readers to Gelman [2004]. 3.2.1 Metropolis-Hastings Algorithm Suppose we need to estimate a parameter vector with k-elements, E 4 and the posterior, 1r(c). When the chain reaches the position at the ttI step, we draw ‘ from a distribution over the same support and we name it the proposal or jumping distribution, Pt(4I), according to which a new value ‘ (candidate point) is proposed given the new current value ]. One thing to keep in mind is that P(’I) should be easy to sample from. We are producing a multidimensional candidate value. The condition here is that the reverse function value, Pt(I’) should also exist. In the literature, the acceptance ratio is defined as follows: / ir(’)Pt(’) 32(.) The Metropolis-Hastings algorithm does not necessarily move on every it eration. The probabilistic rule that decides whether the candidate point is accepted or not, i.e., transition from t to (t + l)th point, is: — f ‘ with probability min{a(, [t1), l}[tj with probability 1 — min{c(’, [t1), 1} We only need to know the posterior distribution ir() up to a constant of proportionality. This is considered as the most attractive feature of the Metropolis-Hastings sampler. A single Metropolis-Hastings iteration proceeds with the following steps: 1. Initialize the chain with any arbitrary value. 21 3.2. MCMC Algorithms 2. Generate a candidate point // from (‘I), where is the current location. 3. Sample u from uniform (0, 1) distribution. 4. If cYç/) > u then accept ‘. 5. Otherwise keep 4 as the new location and repeat until convergence. From the obtained chain, we truncate burn-in samples, and the rest of the chain is used to estimate the posterior distribution. 3.2.2 Gibbs Algorithm The Gibbs sampler is a special case of Metropolis-Hastings where we always accept a candidate value. The idea of Gibbs sampling is that, given a multi- variate distribution, sample from a conditional distribution. This sampling is generally simpler than integrating over a joint distribution. Hence, the Gibbs sampler is simply a Markovian updating scheme, based on a sequence of conditional probabilistic statements. We will give a brief outline of Gibbs algorithm in its simplest form. Let the joint distribution of interest be ir(), where is a vector of of k parameters. The aim is to create a Markov chain that cycles through some conditional statements. A requirement for use of this sampler is that we must know the full conditional distributions. This is a major limitation of this algorithm, especially for the cases where the conditional distributions are hard to derive. The full set of required conditional distributions for are denoted by and defined by 7r() = ir(iI&, 2, . . . j4, j+1,. . . k) for i = 1, .. . , k. It should be possible to draw samples from these conditional distributions. At each iteration of the Gibbs sampling, the algorithm cycles througn these conditionals based on the most recent version of all other parameters. The order is not important, but it is important that the most recent draws from the other samples be used. The algorithm is as follows: 1. Decide on the starting values: = [01, .. , 2. At the ttI iteration, a single cycle is completed by drawing values from 22 3.2. MCMC Algorithms the following k distributions: _ijtl ,4jti A.It1 _t—•1 ,4j11 ‘P1 ‘P2 ‘ ‘P3 , ‘P4 ‘ ‘ ‘Pki ‘ ‘Pk [tj [t] jt—i] [t—1] [t—i1 [t—i] 2 Y2 , ‘P3 , , ‘ ‘Pk [tj jt] [t] [t—i] jt—i] [t—1] P3 ‘P1 ‘ ‘P2 ‘ ‘P4 ‘ •‘ Pk—1 ‘ ‘Pk [tj ,[t] ,[t] ,[tj [tJ ,[t—i) “k—i pk—1 ‘P1 , ‘P2 ‘ ‘P3 ‘ “‘ ‘Pk—2, ‘Pk [tJ ,[t] ,4[tl ,[tj ,[t] ,,[tj k ‘Pk ‘P1 ‘ ‘P2 ‘P3 , ‘ ‘Pk2, ‘Pk1 Here çb can be a multidimensional vector. 3. Set t = (t + 1) and repeat until convergence. If the Gibbs sampler has run for sufficiently long time, it produces samples from the desired stationary distribution. The attractive feature of the Gibbs sampling algorithm is that these conditional distributions contain enough in formation to eventually produce samples from the desired joint distribution. 3.2.3 Mixed Algorithm With Validation Data Likelihood Function: First, we define the parameter space: (ro,ri,SN,SP), where r0 is the exposure prevalence for controls and r1 is the same for cases, SN is the sensitivity and SF is the specificity under nondifferential classification. The cell counts u of the main data as shown by Table 1.1 are gener ated from a binomial distribution. To be more specific, the actual number of subjects that are in positive exposure status (u1) amongst those who are exposed in the groups of cases or controls (n5) follows a binomial with parameters n5 and PPI4 (as defined in Equation (1.4)). Likewise, condi tioning on the number of cases or controls with negative exposure status (ri6), the number of truly unexposed subjects (u4) follows a binomial with parameters j6 and NPV (as defined in Equation (1.5)). The likelihood function for this setting is given in Equation (3.3). Here, the data Y is updated as Y = {Y, Y} = {(m1,n,2, fl3, n4), (u1,u2,u3, 23 3.2. MCMC Algorithms U4) f(Y,,YI2) L(ro,ri,SN,SPIYn,Yu) ñ [pv = iv = i,Y = i)P(V = {P(v* OV 1,Y = i)P(V 1Y {P(v* = 1V = O,Y = i)P(V = OY = {P(v* OV = O,Y i)P(V = OY {P(v* iV = 1,Y = i)P(V = 1Y — j)}Uil {F(v* OV = 1,Y = i)P(V 1Y {P(v* 1V = O,Y = i)P(V = OJY = j)}U3 {P(v* OV = O,Y = i)P(V OY = i)}Ui4] [{SNjr}n x {(i - SN)r}’2 x {(1 - SP)(1 — {SP(1 — x {SNrI}” x {(i — SNj)rj}U2 {(i — SP1)(1 — r)}us3 x {SP(1 — rj)}U4] [{sNjri} +Ul x {(i — SNj)rj}Th2+2 {(i — SP)(1 — x {SP(1 — ri)}44] = x (1 — x SN’” X (1 — SN)2+2 x (1_ X SP4+4]. Under nondifferential misclassification, SN0 = SN1 — SN and SF0 = SP1 = SF (according to the definition that we used). Therefore, in Equation (3.3), we could have used SN and SF, instead of SN and SP. But we preferred to keep the general format to present the likelihood function for the broader context. Prior Specification: We are interested in (ro, ri, SN, SF), as de fined in section 3.2.3. Each of these parameters can possibly range from 0 to 1. To cover the whole real line from —oo to , we make a logit transfor mation of each of these parameters. To keep the problem manageable, we 24 3.2. MCMC Algorithms assume the following: / “ II __\ // \ / 2( H = ( LO9r \ N I ( /L0 ‘ I °o P°oi “ H ) — \ logy ) ) ‘k puoui F (log1 N) N(2,u), I (iog1Sp) (34) where H, H, F, I are just the logit transformed versions of ro, r1, SN, SF respectively. Here (ll, Hi)’ is assumed to follow a bivariate normal distri bution with hyperparameters o, and cr0, u1, p. Similarly, F and I follow independent normals with hyperparameters ,u2 2 and j, u3. Also conditional distribution of a bivariate normal variable remains nor mal, therefore, given F, I, we have Holili N({[Lo+p(Hi _1L1)},U(l_p2)) flub N({,u + p-(Ho — io)}, (1 — p2)) It should be noted that these assumptions of independence among the parameters and normal distribution structures of them are purely based on mathematical convenience. Researchers can think of other possible distribu tions if they find them suitable for the purpose. Also if one thinks that the assumption of independence of the parameters is inappropriate, it is possible to impose correlation among the parameters by means of some multivariate distribution with defined correlation structure. We assume that the analyst’s prior beliefs about the logit transformed parameters can be represented by the hyperparameters mentioned in Equa tion (3.4). These beliefs may be gained from relevant examples from the given subject area. Under fairly general conditions, we have empirical rea son to believe that both ro and ru usually lie between Tmjn = 0.02 and Tmax 0.50; we will assume a median being 0.125. Then i = ILi = 0.125. Within 2 standard deviation, on logit scale, we have orj = i = {logit(u) — logit(rmjn)}/3 under normality with 95% probability. Also as sume a mild value for p, say, 0.3 to allow relatively large standard deviation of log OR around the mean of 0. For SN and SF, we usually see them lying 25 3.2. MCMC Algorithms between 0.60 and 0,99; we will assume median of 0.80. Using the same logic as before, we determine the hyperparameters. It should be emphasized that the user can choose any hyperparameters of interest. The above is just an example of how we can construct the prior from the mentioned empirical beliefs. Often the posterior is robust to the assumed prior. We will discuss this point further in Chapter 5. Posterior: Since Equation (3.3) is a complex one, simulating 1 directly from the joint posterior distribution is troublesome. Therefore, we will sam ple sequentially from the conditional distributions as follows: (ro, riI, SN, SF) cc fr(ro, ri) fi [r1+ 2+Uul+Ui2 (1 — rj)nh3+4+3+ui4], (SNI,ro,ri,SP) cc fsN(SN)fl[SN’1 x(1_SN)22], (SP, SN, r0, ri) cc fsp(SP) fl [5pi4+ui4 x (1 — SP)Thi3+Ui3] (35) using the prior distribution fT, fsN and fs as already described. Since the densities are not conditionally conjugate, we implement uni variate Metropolis-Hastings jumps embedded in the Gibbs sampling. This algorithm will update each component in the pairs of parameters, (ro, r1), and the same for SN and SP. For satisfactory performance of the MCMC, we need o make suitable choice of jumping distribution. If we examine the likelihood function in Equation (3.3), and think of r, SN and SF sepa rately, it looks similar to a beta density. Hence we assume a beta jumping distribution. This simplifies calculation of the acceptance rate by cross can celing the ratio of proposed versus current likelihoods and the ratio between two jumping densities. For example, consider the acceptance probability for the one-dimensional M-H jump on ro in Equation (3.5). The jumping rule is specified as r’0 ‘-‘s Beta(noi +fl02 +U1+u2+ 1, rio3 +n04 +u3+u4+ 1), close to the conditional sampling distribution, where t is the index for iteration 26 3.2. MCMC Algorithms number. The ratio in Equation (3.2) becomes ir(r,lri ,SNt,SPt,Yt) — 7r(rIr,SNt,SPt,Yt) — Pt(rbr,r,SNt,SPt) Pt(rIrb,r,SN,SP) r(r’o Ir ,SNt ,SPt)L(rb,r ,SNt,SPt) — 7r(r Ir ,SNt ,SPt)L(r,r ,SNt ,SPt) — L(ro,r ,SNt ,SPt) L(r ,r ,SNt ,SPt) — 7r(r’oIr, SNt, SPt) — rIr, SNt, SPt) — ir(rbr) — ir(rr Therefore, we are left with merely the ratio between two prior distributions. Thus, using this mixed algorithm, we proceed as follows: 1. Set starting values of (r, r?, SN°, SP°). 2. At the tt/ iteration, • Given parameters Qt = (ri, r, SNt, SPt), generate new data as unobserved actual exposure data = {u} based on binomial distributions, for i = 0,1, j = 1,2,3,4. • Based on the updated cell counts {u} at the tt iteration, model parameters are generated as follows: (1) Simulate r conditioning on (r,SNt_l,SPt_) using the M-H algorithm, the proposed jumping rule is rj Beta(noi+ nO2tui,no3+no4+u),with acceptance rate mm { }. (ii) Similate r conditioning on (r,, SNt_l, Spt_1) using the M H algorithm. The proposed jumping rule is r’1 “.‘ Beta(nii + fl12 +U1+U2+ 1, nla + ri14 +u3+u4+ 1), with acceptance I ir(r’ir) rate mini Iir(r1 Iro,1) (iii) Simulate SNt conditioning on (ri, r, SPt_1) using the M-H algorithm. The proposed jumping rule is SN ‘- Beta(noi + U11 +7111 +1-41 + 1, no2+u2+n12+u2+ 1), with acceptance I ir(SVSN’’) rate mlnr(SNt_1ISNt_1) (iv) Simulate 5pt conditioning on (r,, r, SNt) using the M-H algorithm. The proposed jumping rule is SF Beta(no4+ 27 3.2. MCMC Algorithms Uj4 + fl14 + U4 + 1, o3 +u3 + flj3 +u3 + 1), with acceptance I r(S7VSP’) rate mlncr(SPt_1ISPt_1) . Calculate the log odds ratio at the tt iteration. 3. Repeat the step (2) at subsequent iterations, for t = 1,. . . , m + n, to simulate target parameters using the hybrid algorithm. The procedure is run for sufficiently long m + n iterations, where m is the number of burn-in iterations and n is the number of target iterations. Without Validation Data Likelihood Function: At first, we define the parameter space: (8o,9i) where 8o is the apparent exposure prevalence for controls and 9i is the same for cases. As for validation data case, let us assume that the cell counts from the main data as shown by Table 1.1 are generated from a binomial distribution. The likelihood function from this setting becomes L( = {o91}Y, Y) H1+fl3+flh5(1 — .)fl2+fl4+fl6 = 8fl01+fl03+fl05 (1 —00)fl02+fl04+fl 6 x 91111+fl13+fl15(l —80)fl12+fl14+fl16 Prior Specification: We are interested in (8o, Ox). Each of these parameters ranges from 0 to 1. To cover the whole real like from —oo to oo, we make a logit transformation of each of these parameters. To keep the problem manageable, we assume the following: (0 - (1og N ° Ie1) logy-)’ 1)’0o o? ))‘ where e0, e1 are just the logit transformed versions of 6, 6i respectively. Here (Os, 1) is assumed to follow bivariate normal distribution with h.y perparameters Jo, iti and o, 8i, . For the hyperparameters, the logic is the same as for the prior of (fl,Hi) in §3.2.3. The conditional distributions of a bivariate normal can be similarly derived. 28 3.3. MCMC Diagnostics Posterior: Similar to the case of data with a validation sub-sample, we proceed as follows for the mixed algorithm: 1. Set starting values of (08, 9). 2. At the tt iteration, a Given parameters = (0, 0), update the unobserved actual exposure data. • Based on the updated cell counts {uj} at the tth iteration, model parameters are generated as follows: (i) Simulate 0 conditioning on 0’ using the M-H algorithm. The proposed jumping rule is 9 Beta(noi + no3 + 7105 + 1, no2+no4+no6+1), with acceptance rate mm { (Ot_1t)} (ii) Simulate 14 conditioning on 0. The proposed jumping rule is 0 Beta(nn + fl13 + n15 + 1, n12 + nl4 + fl16 + 1), with I r(O’iI)acceptance rate mini 1 0’ • Calculate the log odds ratio at the tth iteration. 3. Repeat step (2) at subsequent iterations, for t = 1,.. .,m + n, to simulate target parameters alternately using the hybrid algorithm. 3.3 MCMC Diagnostics Formal convergence diagnostic techniques are addressed here, to identify various frequently occurring issues regarding mixing and coverage of the MCMC algorithms discussed in §3.2. There are several common issues as discussed by Gill [2008]: • There is no formal way to ensure that the chain at currently in the target distribution for a given Markov chain at a given time. • It is not possible to ensure that a Markov chain will explore all areas of the target distribution in finite time. • Slow convergence. Although theoretically this is not a problem, it is a practical issue. Fundamentally these concerns can be summarized as setting up the param eters of the process appropriately, ensuring satisfactory mixing throughout 29 3.3. MCMC Diagnostics the whole sample space, and obtaining convergence at some point. There are some design issues that must be taken into consideration before constructing and running the chain. Some of these considerations are taking decisions like determining where to start the chain, judging how long to burn-in the chain before recording values for inference, and determining whether to thin the chain values by discarding portions of the output. 1. Initialization: When little is known about the process, some researchers randomly distribute initial values through the state space. Usually it is best to try several different starting points in the state space and observe whether they lead to noticeably different descriptions of the posteriors. This is an obvious sign of non-convergence of the Markov chain. Unfortunately the reverse is not true: it is not the case that if one starts several Markov chains in different places in the state space and they congregate for a time in the same region that this is the re gion that characterizes the stationary distribution. It is possible that all of the chains are influenced by the same local maxima. 2. Burn-In: The beginning set of runs are discarded as they are not expected to be representative of the target distribution. Unfortunately, there is no formal way to calculate the appropriate length of the burn- in period. Assessing diagnostic plots or other convergence statistics described in the literature are the usual ways to determine the burn-in period. 3. Mixing: A chain that has not fully explored the stationary distribution will tend to give biased results since it is based on only a subset of the state space. Often slow mixing through the target distribution can be attributed to high correlation between model parameters - hence checking autocorrelation is a good idea. This is particularly the case with the Gibbs sampling algorithm. High intra-parameter correlation is also an issue with the Metropolis-Hastings algorithm since it will also induce slow mixing, due to observing too many rejected candidate values. 4. Chain thinning: In the very long simulations, storage of the observed values on the computer becomes a huge problem. Not only the stor age, but also the process of storing the high dimensional parameter realizations will slow down the computation. The idea of thinning the chain is to run the chain in an usual fashion, but record only every c-th value of the chain, thus reducing the storage demands while still pre serving the general trend of the Markov process. Here c is some small 30 3.3. MCMC Diagnostics integer. It is worth mentioning that this approach does not improve the quality of the estimate, speed up the chain or help in convergence - rather it is the other way around - the variance estimate will be some what distorted due to use of less observations. Still, it is a tool for dealing with possibly limited computer resources. Given the tradeoffs between storage and accuracy as well diagnostic ability, the value of c should be carefully chosen in any given problem. Keeping all the above aspects in mind, we still need to find the number of iterations that would be sufficient for approximating the convergence to the target distribution or the length of burn-in sample. Various methods are proposed in the literature for monitoring the convergence of Markov Chain Monte Carlo chains; see Cowles and Garlin [1996], Brooks et al. [1997], Geyer [1992], Raftery and Lewis [1992], Hastings [1970], Robert [1995] and Rizzo [2007] for more detailed discussion. We will discuss the graphical diagnosis and the approach suggested by Gelman and Rubin [1992] and gel; Gelman [2004]. 3.3.1 Conventional Graphical Diagnosis Graphically, trace plots are the most popular way to assess convergence. If the iterations are run for fairly long time, the trend will move from initial values to the desired density. Other plots that are popularly used include the mean graph - which plots the mean scores of the previous values versus the iteration number. If the chain is stable, a flat line will be produced. This does not evidently prove convergence, but if the chain is not producing a fiat line, this indicates that the chain has not yet converged. Also, density plots of the estimates after burn-in can be drawn. 3.3.2 Gelman-Rubin Method for Monitoring Convergence gel suggested that the lack of convergence can be appropriately detected by comparing multiple sequences (at least two) with initial points being widely dispersed in the target distribution. The Gelman-Rubin statistic R (shrink factor) of monitoring convergence of a Markov chain is based on comparing the behavior of a group of chains with respect to the variance of a given scalar summary statistic. The estimates of the variance of the statistic are analogous to estimates based on between-sample and within sample mean squared errors in a one-way analysis of variance. It uses the between sequence variation of the summary statistic as an upper bound and the within-sequence variance as a lower bound. The idea behind this 31 3.3. MCMC Diagnostics method is that, if the chain converges to the target distribution, both the variances will also converge. It is recommended that the sequence be run until R for all the summaries are less than 1.2 at most. If it is less than 1.1, the convergence is even better. 32 Chapter 4 Simulation Results 4.1 Data Generation To generate data for our setting as described in Table 1.1 and Table 1.2, the steps are: 1. We independently generate the true exposure status (V = 0 and 1) at Y i for i = 0, 1. The generating distribution is Bernoulli with parameter r. 2. We generate surrogate measurements V* IV from Bernoulli based on the fact that P(V*=1IV=0) = 1-SN F(V*=1IV=1) = SF, where r1 is the exposure prevalence, SN is the seusitivity and SF is the specificity under nondifferential classification. Now we cross-tabulate to get the validation table. The main data generation is exactly the same - but in this case we omit the true exposure status (V) from the classification - it is only about apparent measurements. 4.2 Scenario Settings Under Frequentist Adjustment While dealing with frequentist adjustments, we utilize all the n’s (observed values) but not the u’s (unobserved values) as mentioned in Tables 1.1 and 1.2. Tn the model without validation data or the two parameter model (these two names of this model will be used interchangeably throughout the entire work) discussed in §2.2.1, we simply use the column totals from the tables. But in the model with validation data (or four parameter model) discussed in §2.2.2, we also use the n’s that are inside the validation table. Hence, when we make comparison, for example - say, for sample size 2000 where 33 4.2. Scenario Settings Under Frequentist Adjustment 200 are in the validation and 1800 are in the main (unvalidated) part, then for without validation data we use the 2000 subjects aggregately as if there were no validation part (marginal total for the surrogate variable are known for both parts and the without validation model uses just these marginal totals). But with the validation model, we can recognize 200 subjects as comprising the validation part and the rest as the main part. To understand the performance of frequentist adjustment for nondiffer ential misclassification in the simplest possible way, several scenarios are considered, as shown in Table 4.1, varying the level of exposure prevalence, or sensitivity and specificity, or sample size, or sample proportion of the validation and main parts of the data. Notice that, the whole process is very complex. Here, the factors are merely assessed in an uni-dimensional way in all these cases, that is, all other factors are held constant when we switch from one scenario setting to another, so we will not be able to assess the possibility of interactions among the factors. That would require more combinations of scenarios and a huge amount of data would have to be gen erated. However, there would be some limitations to that approach as well - such as computing time and storage and, above all, comprehending and interpreting all those data would be challenging. As our objective of assess ing impacts on hypothesis testing is a relatively new one in epidemiologic research, this simplified approach should provide some rough ideas about the effects which will suffice as a first step in the process. We use power curves as the tool of comparison for this frequentist ap proach. Therefore, the null hypothesized value (difference between the ex posure prevalences is zero) is the mid-value on the horizontal axis. On the right and left side of it, four other equidistant difference points are selected in each direction based on the difference of the exposure prevalences from case and control groups, according to alternative hypothesis. In this work, the considered absolute difference between exposure prevalences from case and control groups were 0.05, 0.10, 0.15, 0.20 (fixing r0 and changing r1 to achieve the desired difference). Therefore, we have nine points in total in one power curve. The process of getting the estimated power is as follows for any one point: 10,000 datasets are generated according to the hypothe sized difference in exposure prevalence from two groups. We implement the hypothesis test on each dataset and evaluate the p-value. The estimated power is given by the proportion of the datasets that provide a p-value less than the chosen level of significance 0.05. In theory, with a large number of datasets, the lowest point of the power should be the chosen level of sig 34 4.2. Scenario Settings Under Frequentist Adjustment nificance at the null hypothesized point. This is under the assumption that asymptotic cut offs are accurate. Power curves from the hypotheses of H0 : = 6i = 6 (from without validation data) and H0 : r0 = r1 = r (from with validation data) are shown together in each graph because of their equivalence as described in §2.2. On a technical note, to allow reproducibility of the results, the seed is chosen arbitrarily and kept the same throughout the entire analysis. Table 4.1: Scenarios under consideration Factor changed SN, SF Sample Size Scenarios A B C D B F C B Validated data 200 200 200 200 200 200 200 200 Unvalidated data 1800 1800 1800 1800 200 400 800 1800 To = Ti 0.40 0.40 0.40 0.40 0.40 0.40 0.40 0.40 SN = SF 0.60 0.70 0.80 0.90 0.70 0.70 0.70 0.70 Factor changed Exposure Prevalence Proportion of data Scenarios I J K B B N 0 P Validated data 200 200 200 200 200 500 1000 1500 Unvalidated data 1800 1800 1800 1800 1800 1500 1000 500 r0 = 0.25 0.30 0.35 0.40 0.40 0.40 0.40 0.40 SN SF 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 4.2.1 Under Different Values of Sensitivity and Specificity We generated 10,000 datasets with exposure prevalence 0.4 for both case and control groups, where, in each dataset, 200 were in the validation part (50% in the case group, and the rest in the control group) of the data as was described in Table 1.2, and 1800 were in the main dataset (again 50% in the case group and the rest are in the control group) as was described in Table 1.1. Four different sets of sensitivity and specificity values were considered: 0.60, 0.70, 0.80 and 0.90. We implemented the likelihood ratio test (discussed in §2.2) for the two parameter (6o, 6) model for data with out validation part and the four parameter (ro, Ti, SN, SF) model for data with validation part. The estimated power curves out of these tests for all the cases under consideration are shown in Figure 4.1. From the figure, it is evident that the power of the two parameter model is always dominated by that of the four parameter model. The situation is much exacerbated 35 4.2. Scenario Settings Under Frequentist Adjustment when the values of sensitivity and specificity are low. However, when the misclassification is at least 0.8, the powers of both tests are almost the same. According to the theory, for the exact tests, the power at the null point (where the difference between the exposures of case and control groups are zero) should be equal to the level of significance, which is 0.05. Due to simulation error, this might deviate a bit. From the Figure 4.1, we can also see that the lowest point of powers do match at 0.05 in each setting. Therefore, the number of simulations considered here are adequate to show the power curves nicely. 4.2.2 Under Different Sample Sizes Like the previous scenario, we generated 10,000 datasets. The exposure prevalence for both case and control groups was 0.4. Sensitivity and speci ficity of both groups was 0.7. The sample sizes varied in this scenario as follows: 400, 600, 1000 and 2000, where in each situations, we had 200 as the validation part of the data and rest were the main part of the data (again half are allocated in case group, and rest are in control group in each setting). Still the four parameter model is superior considering the power of the likelihood ratio tests, as shown in Figure 4.2. Naturally, as the sample size increases, the power of both the tests increases. 4.2.3 Under Different Exposure Prevalence Rates In this scenario, we considered 10,000 datasets. Again sensitivity and speci ficity were set to be 0.7. In each dataset, we had 200 as the validation part of the data and 1800 as the main part of the data (50% in case group, and rest are in control group). The hypothesis regarding the exposure prevalence was always H0 : ro = = r or equivalently H0 : = 0, where r-j = — ri. Alternative hypothesis in this case would be that the difference of r0 and r1 is not zero. To draw a complete power curve, we assume that the possible differences in horizontal axis are 0.05, 0.10, 0.15, 0.20 in both directions, so that we get nine points in total to draw a power curve. There were four values of r under consideration: r = 0.25, r = 0.30, r = 0.35 and r = 0.40. From Figure 4.3, in all the cases, the power of the two parameter model is less than the four parameter models, and the power does not seem to vary much under different exposure prevalence values r. In practical situations, sometimes we see much less prevalence. Hence, we construct power curves for lower prevalence rates such as r = 0.005, r = 0.01, r = 0.05 and r 0.10, 36 42. Scenario Settings Under Frequentist Adjustment Figure 4.1: Power curves under different sensitivity and specificity values: 0.6, 0.7, 0.8 and 0.9 respectively \dtoD Parameter Difference Parameter Difference Parameter Difference Parameter Difference 37 4.2. Scenario Settings Under Frequentist Adjustment Figure 4.2: Power curves under different sample sizes: 400, 600, 1000 and 2000 respectively (validation sub-sample size is fixed at 200 in each situation) Parameter Difference Parameter Difference Parameter Difference 38 4.2. Scenario Settings Under Frequentist Adjustment which are shown in Figure 4.4. In all the case, we find the previous conclu sion is still valid. One point we should mention is that, for lower exposure prevalence such as 0.005, while finding the maximum likelihood estimators, sometimes the optim function goes out of bound. Therefore, for finding maximum likelihood estimators of 10,000 simulations in the null hypothesis situation (rO = r1 = r), we had to iterate the process of generating new datasets 132,134 times for r = 0.005, 42,798 times for r 0.01, 10,859 times for r = 0.05 and 10,047 times for r = 0.10. However, for higher exposure prevalence rates, we never had this problem of non-convergence. 4.2.4 Under Different Proportion of Validation and Main Part of the Data As for all the other scenarios, we generated 10,000 datasets, with sensitivity and specificity 0.7 and exposure prevalence 0.4. But, keeping the total sample size fixed at 2000, we changed the proportions of the validation and the main (unvalidated) part of the dataset, — which are 1:9 (200:1800), 1:3 (500:1500), 1:1 (1000:1000) and 3:1 (1500:500) respectively. From Figure 4.5, the two parameter model has an identical power curve in all situations, but as the proportion of main data decreases for the four parameter model, power increases sharply. 4.2.5 Comparison under Budgetary Constraint Cost effectiveness is obviously an important measure of the ultimate worth of a study design. While designing a study, we aim to obtain the best qual ity of information for a given resource, say, in terms of money or time. Of course, the optimal solution for a given study design is hard to obtain, be cause not all the parameters are usually known and there might be external constraints. Nonetheless, for our study, by considering the stated assump tions and the parameters of the described models, we tried to find which model performs better under a fixed budgetary constraint. Validation data is costly to collect. The high cost of validation data limits the size of the validation sub-sample in a fixed cost design. From the previous scenarios we considered, the model without validation data could be at best as good as the model with validation data given favorable condi tions, but never better. The critical issue we wanted to investigate here is to 39 4.2. Scenario Settings Under Frequentist Adjustment Figure 4.3: Power curves under different Exposure Prevalences: 0.25, 0.30, 0.35 and 0.4 respectively Parameter Difference Parameter Difference 40 4.2. Scenario Settings Under Frequentist Adjustment Figure 4.4: Power curves under smaller Exposure Prevalences: 0.005, 0.01, 0.05 and 0.10 respectively Power Curve for HO: r = 0.005 Power Curve for HO: r.0.l Power Curve for HO: r 0.01 2poromelwMod — 4porovwlerMod D 0b5 0.0 0.5 00 Pwomwwve Power Corre for HO: r — 0.05 • - 2 po,uvulor Modd — 400,urnolerModd ibo o.b5 0.10 0.15 0 0 Jwenoo 2 pa,umolur Model — 4 parumotur Model 0.00 005 0.10 0.15 020 ParamelurJ2erecur -. 2 parunrole, Model — 4 parumoler Model 0.00 0.55 0.10 0.15 0.20 Pa,ameler0De,enoe 41 4.2. Scenario Settings Under Frequentist Adjustment Figure 4.5: Power curves under different proportions of validation part and main part of the data: (1:9, 1:3, 1:1 and 3:1) respectively 42 4.2. Scenario Settings Under Frequentist Adjustment find out whether there is any point where the model without validation data becomes superior to the model with validation data. In other words, how costly the validation data have to be to abandon the model with validation data, or whether a researcher can always choose the model with validation data without any trade-off. We investigated using a particular example as follows. Say, we have $2400 as a budget for designing a retrospective study using either the model with validation data or the model without validation data. We arbitrarily set $1 as the cost of an unvalidated observation. We consider three pricing choices: 1. Collecting validated data costs three (3) times cost as much as collect ing unvalidated (main) data. The allocations of validated and unvali dated data considered are provided in Table 4.2. 2. Collecting validated data costs five (5) times cost as much as collecting unvalidated (main) data. The allocation of validated and unvalidated data considered are provided in Table 4.3. 3. Collecting validated data costs ten (10) times cost as much as collecting unvalidated (main) data. The allocations of amount of validated and unvalidated data considered are provided in Table 4.4. Table 4.2: Scenarios under constant cost = $2400, assuming that collecting vali dated data costs three (3) times as much as collecting unvalidated (main) data Scenario Validated data Unvalidated data Cost Q.S 2 x 50 2 x 1050 2 x (3 x 50 + 1050) = 2400 R.S 2 x 100 2 x 900 2 x (3 x 100 + 900) = 2400 S.3 2 x 200 2 x 600 2 x (3 x 200 + 600) = 2400 T.3 2x300 2x300 2x(3x300+300)=2400 In Tables 4.2, 4.3 and 4.4, we only consider situations where sample sizes are equal for cases and controls in both validated and unvalidated parts. From Figure 4.6, the with validation data model is still superior in all scenarios despite the fact that validation sample costs three times more to collect compared to an unvalidated sample. However, when the cost is five times as much, both models have almost the same utility, as shown in Figure 4.7. But from Figure 4.8, it is evident that the model without validation 43 4.2. Scenario Settings Under Frequentist Adjustment Figure 4.6: Power curves under fixed amount of cost = $2400 assuming that collecting validated data costs three (3) times as much as collecting unvalidated (main) data —6.2 —6.1 olo oi 012 Parameter Difference —0.2 —0.1 0.0 0 1 Parameter Difference —6.2 —6.1 olo cli 02 Parameter Difference —0.2 —0.1 00 0.1 Parameter Difference Table 4.3: Scenarios under constant cost = $2400, assuming that collecting vali dated data costs five (5) times as much as collecting unvalidated (main) data Scenario Validated data Unvalidated data Cost Q.5 2 x 50 2 x 950 2 x (5 x 50 +950) =2400 R.5 2x100 2x700 2x(5x 100+700)=2400 S.5 2 x 150 2 x 450 2 x (5 x 150 + 450) 2400 T.5 2 x 200 2 x 200 2 x (5 x 200 + 200) = 2400 - - - - Without Validation Data — With Validation Data 2 Without Vetidation Data With Validation Data 012 44 4.2. Scenario Settings Under Frequentist Adjustment Figure 4.7: Power curves under fixed amount of cost = $2400 assuming that collecting validated data costs five (5) times as much as collecting unvalidated (main) data Parameter Difference Parameter Difference O1O1O2 Parameter Difinrence Parameter Diffemnee Table 4.4: Scenarios under constant cost = $2400, assuming that collecting vali dated data costs ten (10) times as much as collecting unvalidated (main) data Scenario Validated data Unvalidated data Cost Q.1O 2x25 2x950 2x(10x25+950)=2400 R.1O 2x50 2x700 2x(10x50+700)=2400 Sb 2x75 2x450 2x(10x75+450)=2400 T.bO 2 x 100 2 x 200 2 x (10 x 100 + 200) = 2400 45 4.2. Scenario Settings Under Frequentist Adjustment Figure 4.8: Power curves under fixed amount of cost = $2400 assuming that collecting validated data costs ten (10) times as much as collecting unvalidated (main) data 0201000102 Parameter Difference Parameter Difference 0202 Parameter Difference Parameter Difference 46 4.2. Scenario Settings Under Frequentist Adjustment data can be superior to the with validation data model, given a fixed total cost and a much higher cost for validation data. This is the only case among the scenarios we have considered, where the model without validation data can possibly be superior - when the cost of a validated observation is much higher than the cost of an unvalidated observation. This is one practical limitation of the model with validation data that the researchers should keep in mind when designing a study. 47 4.3. Scenario Settings Under Bayesian Adjustment 4.3 Scenario Settings Under Bayesian Adjustment While dealing with Bayesian adjustments, we utilize all the n’s (observed values) and the u’s (unobserved values) in Tables 1.1 and 1.2, and the mod els (two parameter for data without validation part and four parameter models for data with validation part) discussed in §3.2.3 are utilized. As for frequentist adjustment, for the two parameter model, we simply use the column totals from the tables, while in the four parameter model, we use the n’s that are inside the validation table as they are observed. To ensure comparability, both models in Bayesian adjustment utilize the same amount of data. The only difference is that the four parameter model recognizes the validation part, while the two parameter model ignores the true classifica tion information of the validation part. Exactly the same scenarios discussed in §4.2 are considered to understand the performance of Bayesian adjustment to nondifferential misclassification, varying the level of exposure prevalence, or sensitivity and specificity, or sample size, or sample proportion of the validation and the main (unvali dated) part of the data. We used the power curve from the likelihood ratio tests as the compar ison tool in assessing the frequentist adjustments. However, finding such a tool for Bayesian adjustment models was not straightforward. Instead, this is what we have done: Once we have generated the data (as discussed in §4.1), we implemented the mixed algorithm as described in §3.2.3 for 10,000 Markov Chain Monte Carlo iterations. Half of the Markov Chain Monte Carlo iterations were discarded as burn-in (we will justify the length of chain and burn-in in §4.3.5). Using the retained chains, we constructed a 95% credible interval for the odds-ratio and checked whether this credible interval contained the null value (OR = 1) or not (where OR is a function of r0 and r1 for four parameter model as given in Equation (1.2), and also, for two parameter model, OR is a function of 9o and f as given in Equation (1.3)). One other way of serving this same purpose would be to construct 95% credible interval for the logarithmic transformation of the odds-ratio and test whether the constructed credible interval contains the null value (log OR 0) or not. 2,000 datasets for each set of parameters in a particu lar case of the scenario. To produce a graph for the cases of each scenario, we do it for nine points (corresponding to the differences of the alternative 48 4.3. Scenario Settings Under Bayesian Adjustment hypothesis, as was discussed in §4.2 for the frequentist approach of power curve construction procedure) for each of the models. This information of what proportion of credible intervals excluded the null value was used to find a probabilistic solution of the problem of comparison. This tool could also be labeled as a kind of power curve since this also uses the similar theme “re ject H0 if the credible interval excludes null value”, instead of the statement “reject H0 if the p-value is less than significance level”. From the deviation from one model’s curve to another, one can have some understanding of the performance of the two models in these situations. Again, to allow reproducibility of the results, the seed is chosen arbi trarily and kept the same throughout the entire analysis. Initial values need to be provided for Markov Chain Monte Carlo algorithms. Experience sug gests that the initial values does not have much impact on the final results. Details of this comment are shown in §4.3.5. 4.3.1 Under Different Values of Sensitivity and Specificity The same cases as considered in frequentist adjustment are carried out. From the Figure 4.9, it is evident that the two parameter model is always dominated by the four parameter model. 4.3.2 Under Different Sample Sizes Figure 4.10 shows that the tests get better as the sample size increases, but the four parameter model is always better than the two parameter model. 4.3.3 Under Different Exposure Prevalence Rates The graphs under consideration are shown in Figure 4.11. In all the cases, two parameter models are dominated by the four parameter models, and the graphs of credible intervals excluding null values do not seem to vary much for the various prevalence rates under consideration. 4.3.4 Under Different Proportion of the Validation Data From Figure 4.12, the two parameter model in all cases have almost similar curves, but the powers increase sharply as the proportion for the validation part increases for four parameter model. 49 4.3. Scenario Settings Under Bayesian Adjustment Figure 4.9: Bayesian analysis results for different sensitivity and specificity values (.6, .7, .8, .9): Proportion of credible intervals excluding null value 50 4.3. Scenario Settings Under Bayesian Adjustment Figure 4.10: Bayesian analysis results for different sample sizes (400, 600, 1000, 2000): Proportion of credible intervals excluding null value Parameter Difference Parameter Difference Parameter Difference 51 4.3. Scenario Settings Under Bayesian Adjustment Figure 4.11: Bayesian analysis results for different exposure prevalence (.25, .3, .35, .4): Proportion of credible intervals excluding null value -0.2 —0.1 0.0 Parameter Difference 0.1 0.2 Parameter Difference Parameter Difference Parameter Difference 52 4.3. Scenario Settings Under Bayesian Adjustment Figure 4.12: Bayesian analysis results for different ratio of data splits (1:9, 1:3, 1:1, 3:1 respectively for validation and main part): Proportion of credible intervals excluding null value It S S t a 1 0 1 S ant S -0.2 —0.1 0.0 Parameter Difference 0.1 0.2 -0.2 —0.1 0.0 01 Parameter Difference 02 S 20 t a p 1 - - - - - Withoat Validation Data IdatonDate It S S S ant S —0.2 —0.1 0.0 0.1 Parameter Difference 0.2 —0.2 —0.1 0.0 0.1 0.2 Parameter Difference 53 4.3. Scenario Settings Under Bayesian Adjustment In all four situations considered here, we have the same conclusion about the respective situations from both the frequentist and Bayesian approaches. In fact, if we compare Figure 4.9 with Figure 4.1 and Figure 4.10 with Figure 4.2 and Figure 4.11 with Figure 4.3 and Figure 4.12 with Figure 4.5, the shapes of curves from the respective situations are strikingly similarity. 4.3.5 Diagnostics For diagnostic purposes, we generate datasets with exposure prevalence 0.3 for both case and control groups and sensitivity and specificity both equals to 0.7. As shown in Figures 4.13, 4.14, 4.15 and 4.16 for 10,000 MCMC iterations, the trace plot of all the parameters r, r1, SN and SF look sta ble after the burn-in in four chains with different starting values (0.2, 0.4, 0.6, 0.8 for each parameters). All these figures are obtained using one single dataset as an example. The burn-in is colored as grey and after burn-in, the estimates are colored as black in each of these graphs. Sometimes graphical diagnostics are not very reliable, Therefore, we re sort to some statistics that are used for such chain diagnosis, such as Gelman and Rubin’s convergence diagnostic statistic which was discussed in §3.3.2. This statistic requires more than one chain, and hence we used the four chains with four different set of initial values. Theory says that the statistic should not go beyond 1.2. For this particular dataset, for the parameters r0, r1, SN and SF, we had the Gelman and Rubin’s convergence diagnostic statistic, R 1.011, 1.003, 1.011 and 1.008 respectively, for 10,000 itera tions in each. Figure 4.21 indicates the evolution of Gelman and Rubin’s convergence diagnostic statistic as the number of iterations increases. From this figure, it is evident that the chain is very satisfactory after burn-in. Also, to check whether the initial value has any effect on estimates, we plotted the means of each of the four chains which started from different initial values. As shown in Figures 4.17, 4.18, we can see that both converge to 0.3, which was the original exposure prevalence value used to generate the considered dataset. Similarly, from Figures 4.19, 4.20, we see that both the sensitivity and specificity eventually converge to 0.7, which was the parameter value used to generate the datasets under consideration. 54 4.3. Scenario Settings Under Bayesian Adjustment Figure 4.13: Diagnosis of convergence of Bayesian analysis results: Trace Plots for ro in 4 chains with different starting values (for 10,000 iterations, with half burn-in) for a single dataset r0 r0 55 43. Scenario Settings Under Bayesian Adjustment Figure 4.14: Diagnosis of convergence of Bayesian analysis results: Trace Plots for r1 in 4 chains with different starting values (for 10,000 iterations, with half burn-in) for a single dataset O 20O 400 doo 8d00 ioOoo 9! iLii Ii!P1 g Jihihi. I I! ‘J - - it: -1: -r_‘Li’ 0 2000 4000 0000 Ii 0 00 4000 00oo 00oo io00 111111 ‘I I iiII1’ I’ll! 6000 10 2000 4000 Ii 0000 0000 10000 56 4.3. Scenario Settings Under Bayesian Adjustment Figure 4.15: Diagnosis of convergence of Bayesian analysis results: Trace Plots for SN in 4 chains with different starting values (for 10,000 iterations, with half burn-in) for a single dataset O 2d00 4coo 6d00 8d00 ioOoo SN 11 a 0 2000 4000 6000 8000 10004 SN SN I SN 57 4.3. Scenario Settings Under Bayesian Adjustment rI9 6 2d00 4d00 oooo odoo ioOoo sP Figure 4.16: Diagnosis of convergence of Bayesian analysis results: Trace Plots for SF in 4 chains with different starting values (for 10,000 iterations, with half burn-in) for a single dataset wI 6 zobo 4000 doo adoo ioOoo sP iIi. U 0 I I I 1 0 2000 4000 6000 8000 10000 sP ‘I 0 2000 4000 6000 8000 10000 sp 58 4.3. Scenario Settings Under Bayesian Adjustment Figure 4.17: Sequence of the mean of posterior for ro for the four Markov Chain Monte Carlo Chains for 10,000 iterations r0 r0 0 o ._ _______________________ 0 0 C., o C., 1’ 0 0 0 CN ° Ô 2d00 4OO 6d00 8d00 ioöc Ô 2d00 4d00 6d00 8OO ioôoo Chain 1 Chain 2 r0 r0 0 1 0 0 0 a 0 0 0 C. ° ó 2d00 4d00 6d00 8d00 ioôoo 6 2d00 4d00 6d00 8d00 ioôoo Chain 3 Chain 4 59 4.3. Scenario Settings Under Bayesian Adjustment Figure 4.18: Sequence of the mean of posterior for r1 for the four Markov Chain Monte Carlo Chains for 10,000 iterations r1 r1 C 0 d 0 d 0 CD 0 a- a (‘1’ 0 0 0 c,J C., ° a 2d00 4000 60h0 8d00 ioöoo O 2d00 4d00 6d00 8OO bOa Chain 1 Chain 2 r1 0 0 0 0 0 0 _______________ 0 - a 2d00 4000 60h0 8d00 ioôoo 0 2000 4000 6000 8000 10000 Chain 3 Chain 4 60 4.3. Scenario Settings Under Bayesian Adjustment Figure 4.19: Sequence of the mean of posterior for SN for the four Markov Chain Monte Carlo Chains for 10,000 iterations SN SN h ______ 0 d 0 0 CD o 0 C) 0 0 2d00 4000 6d00 8d00 10600 0 6 2000 4d00 6d00 8d00 ioôoo Chain 1 Chain 2 SN SN CD CD o CD r. CD o CD 0 cc? 0 0 0 0 __ _ ____ 6 2d00 4d00 6d00 8fiOO ioOoo 6 2000 4d00 6d00 8d00 ioôuo Chain 3 Chain 4 61 4.3. Scenario Settings Under Bayesian Adjustment Figure 4.20: Sequence of the mean of posterior for SP for the four Markov Chain Monte Carlo Chains for 10,000 iterations sP sP 0 Cr, 0 0 ° 0 0 0 20O 4000 6d00 8000 ioôoo 0 2d00 4d00 6d00 80b0 ioOoo Chain 1 Chain 2 sP sP C.- 0 F-. CD CDCD F— 0 0 CD F-, C c’J, 0 0 C 0 F- ____________ 00 2d00 4d00 6d00 80b0 ioôoo 5 2d00 4d00 6d00 8d00 ioóoo Chain 3 Chain 4 62 4.3. Scenario Settings Under Bayesian Adjustment Figure 4.21: Sequence of the Celman-Rubin E for the four Markov Chain Monte Carlo Chains for 10,000 iterations Bum—in Samples — Retained Samples Cut—point at 1.1 - Cut—point at 1.2 Burn—in Semples — Retained Semplee Cut—point atll - Cut—point etIZ no - - - - Porn-in Samples — Ratainad Samptae Cut—point at 1.1 -- - Cut-pdntatl2 6 2600 4655 6650 ados ioOst SN Sum—inSamptes — Retained Samples Cot—point at 1.1 Cot—point at 12 6 265o 4600 sdoo Bobs ioOoo Sp S 2000 4000 6000 a000 10000 6 odoo 4600 6600 esbo ioboo 63 Chapter 5 Application in Epidemiological Studies 5.1 Introduction In almost all epidemiological studies, some amount of error in assessment is inevitable. The extent of such error depends on various factors, such as the nature of the exposure, and the instrument error associated with collecting the information. In this chapter, we will consider two epidemiologic datasets where challenges specifically arise in accurately identifying the outcome of exposure. The methods discussed in the previous chapters are considered and applied to these datasets. 5.2 Study of Sudden Infant Death Syndrome (SIDS) The performance of the methods described in the previous chapters are illustrated using a case-control study of antibiotic prescription during preg nancy and subsequent occurrence of Sudden Infant Death Syndrome (SIDS) [Greenland, 1988b, a, 2008], [Marshall, 1990, 1997], [Kraus et al., 1989]. The association of interest is between the prescription of antibiotics during pregnancy (V) and SIDS (Y). The surrogate exposure or error-prone mea surement (V*) was an interview response, whereas the true exposure (V) was derived from medical records. The validation studies, in cases (Y = 1) and controls (Y = 0), were joint (V*, V) designs done as sub-studies that resulted in the data presented in Table 5.1. Frequentist estimates of parameters of the two models under considera tion for the SIDS study data are reported in Table 5.2. From this table, we can see that the apparent prevalence rates are close estimates of the preva lence rates obtained while considering the validation data. The validation data model shows that the data has low sensitivity (0.6), but high speci 64 5.2. Study of Sudden Infant Death Syndrome (SIDS) Table 5.1: Data from the study of sudden infant death syndrome (SIDS) and antibiotic prescription Y Cases (Y = 1) J Controls (Y = 0) Validated Part V* = 1 V = 0 V* = 1 V = 0 V=r1 29 17 21 16 V=0 22 143 12 168 Unvalidated (main) 122 442 101 479 Total 173 602 134 663 ficity (0.9). The log-odds ratios in both groups are positive numbers. For without validation data, the estimate of odds ratio is 1.422 and 95% Wald confidence limits are (1.11, 1.83), calculated using the formula provided by Marshall [1997]. Not surprisingly, the likelihood ratio p-value obtained from this model is small (0.006). These results match with the case discussed by Greenland [2008]. On the other hand, for with validation data model, the estimated odds ratio is 1.49 with 95% Wald confidence limits (1.02, 2.16), which is coherent with the findings of Greenland and Gustafson [2006]. The likelihood ratio p-value is also small in this model (0.035). Therefore, the conclusions from both models are the same. They suggest that the hypoth esis H0 : r0 Ti is rejected at cv 0.05. That is, the true log(OR) is significantly far away from 0 based on the evidence provided by the SIDS study data. Table 5.2: Frequentist Estimates of the model parameters in the SIDS study Not considering_Validation data Considering Validation data Parameters Estimate S.E. Parameters Estimate S.E. 9o 0.168 0.013. r0 0.163 0.021 8 0.223 0.015 r1 0.225 0.024 SN 0.603 0.047 SP 0.903 0.013 log(OR) 0.352 0.128 log(OR) 0.398 0.191 P-value 0.006 P-value 0.035 The Bayesian estimates and standard errors are reported in Table 5.3. The priors used here are very general and similar to those described in §3.2.3. These results are very similar to those obtained using maximum likelihood. 65 5.2. Study of Sudden Infant Death Syndrome (SIDS) Both the 95% credible intervals of the odds ratio obtained from the with and without validation data model fail to include the null value 1 inside the interval. Moreover, the estimates and credible intervals are very similar to those obtained by frequentist methods. Therefore, the null hypothesis is still rejected by the Bayesian tools. That means the data suggests a positive association between the prescription of antibiotic and consequent incidence of SIDS, under the assumption of equality of misclassification probabilities. For the Bayesian estimates and hypothesis testing results reported in Ta ble 5.3 and trace plots in Figure 5.1, the initial values of r0, r1, SN and SP were set to 0.4, 0.4, 0.7 and 0.7 respectively. For 6o and 6, it was 0.2 and 0.2. One interesting issue needs to be addressed here. Other than a few spa cial cases, it is well known that under the nondifferential misclassification assumption, in absence of any other errors, the estimates of measure of as sociation, such as odds ratio should be biased towards the null ‘on average’. However, in this particular data, we notice that the estimate of odds ratio slightly goes away from null (1.42 to 1.49), as the theory suggests, but the p-value from Wald test gives us the opposite message - it increases from 0.0029 to 0.0184 respectively (which dictates towards the null behavior after adjustment). This might be due to the fact that the posterior variance is being iinrierestimated in the without validation data situation, and hence the posterior variance increases after adjustment, providing an even wider credible interval. Such phenomenon of increment of uncertainty even though the odds ratio moves away from null after adjustment, is already noted by Gustafson and Greenland [2006]. The likelihood ratio test acts similarly to the approximate Wald test. This might be one indication that the assump tion of nondifferentiality was not completely satisfied, if we rule out the explanation of random variation due to chance in this particular example. Since both p-values are small enough to reject the null hypothesis, this does not alter the conclusion in this example. The Gelman and Rubin convergence diagnostic statistic, 1? value for the four parameters ro, r, SN and SP are 1.002, 1.003, 1.045 and 1.009 respec tively. Also, for 8 and 8o, R gives 1.002 and 1.003 respectively. All these values are much less than 1.2. Here, various initial values were set to check the convergence - such as 0.2, 0.4, 0.6 and 0.8 for each of the parameters un der consideration. 10, 000 iterations were performed and half were retained after burn-in to estimate each parameters. Also, from Figure 5.2, we can see that the posterior distributions does not have any multimodality, which 66 5.2. Study of Sudden Infant Death Syndrome (SIDS) Figure 5.1: MCMC for the with and without validation data model parameters in the SIDS study 0 2000 4000 4 8000 1 — 4 Ii I ! 9ç1 0 2000 4000 0000 8004 000 80] FI LtI.[Iii 0 . i0I 67 5.3. Cervical Cancer and Herpes Simplex Virus Study Table 5.3: Bayesian Estimates of the model parameters in the SIDS study Not considering Validation setting J Considering Validation setting Parameters Estimate SD Parameters Estimate SD o 0.168 0.013 r0 0.161 0.020 9 0.222 0.015 r1 0.221 0.024 SN 0.609 0.046 SP 0.901 0.012 log(OR) 0.351 0.129 log(OR) 0.395 0.186 95%C.I. Does not include H0 value 95%C.I. Does not include Ho value (OR) (1.103, 1.830) (OR) (1.038, 2.153) is a sign of good convergence. 5.3 Cervical Cancer and Herpes Simplex Virus Study This data is listed in Carroll et al. [1993] and discussed in Prescott and Garthwaite [2002], Carroll et al. [2006]. The research question is whether exposure to herpes simplex virus contributes to the risk of cervical cancer. The response variable Y is an indicator of cervical cancer, V is exposure to type 2 herpes simplex virus (HSV-2) measured by a refined western blot procedure and V* is exposure to HSV-2 measured by the western blot pro cedure. The data is provided in Table 5.4. Table 5.4: Data from Herpes Simplex Virus-2 study Y Cases (Y = 1) ] Controls (Y = 0) Validated Part V* 1 V = 0 V = 1 V 0 V=1 18 5 16 16 V=0 3 13 11 33 [ Unvalidated (main) 375 318 535 701[ Total 396 336 562 750 Frequentist estimates of parameters of the two models under considera tion for the HSV-2 study data are reported in Table 5.5. From this table, we can see that the apparent prevalence rates and estimates of the prevalence 68 5.3. Cervical Cancer and Herpes Simplex Virus Study Figure 5.2: Prior and Posterior Distributions of all the Parameters under Con sideration in the SIDS study 04 0:5 06 07 08 09 10 sP pijot — Posterior - - - - prior — Postorior - - -. Prior — Posterior 00 0:1 02 0:3 0:4 r0 01 0:2 0:3 04 r1 ---- Prite — OSt000r 0:4 0:6 08 i’o SN -- -- Prior — Posterior oo o1 0:2 0:3 0:4 90 0o 01 0:2 03 0:4 69 5.3. Cervical Cancer and Herpes Simplex Virus Study rates for the case group obtained in the presence of the validation data are not in complete agreement. Especially the prevalence rates for case group are much higher than the control group, both in the before and after ad justments. For without validation data, the estimated odds ratio is 1.57 and 95% Wald confidence limits are (1.31, 1.89). Also, the likelihood ratio p-value obtained from this model is very small. On the other hand, for the with validation data model, the estimated odds ratio is 2.61 with 95% Wald confidence limits (1.62, 4.18). The likelihood ratio p-value is also very small in this model. Since the p-values obtained from both models are very small, the conclusions from both models are the same. They suggest that the hypothesis H0 : To = r is rejected at c = 0.05. The validation data model shows that the exposure assessment has moderate sensitivity, as well as moderate specificity. Table 5.5: Frequentist Estimates of the model parameters in the HSV-2 study Not considering Validation setting] Considering Validation setting Parameters Estimate S.E. Parameters Estimate S.E. 9 0.428 0.014 0.418 0.046 9 0.541 0.018 r1 0.652 0.053 SN 0.679 0.041 SP 0.743 0.043 log(OR) 0.453 0.093 log(OR) 0.958 0.237 P-value 9.966 x iO P-value 1.48 x 10 The Bayesian estimates and standard errors are reported in Table 5.6. For the model without validation data, these results are almost the same as those obtained using maximum likelihood. However, the estimates obtained from the model with validation data are not nearly as close. Nonetheless, both the 95% credible intervals of the odds ratio obtained from the with and without the validation data model fail to include the null value 1 inside the interval. Therefore, even with the Bayesian method, the null hypothesis is rejected. Moreover, the conclusions obtained from hypothesis testing are very similar to those obtained by frequentist methods. As a result, we can conclude that the exposure of HSV-2 is positively associated with increased risk of developing cervical cancer. Using the same prior that we used in simulations in chapter 4, we can see that the exposure prevalences are greatly underestimated in prior den 70 5.3. Cervical Cancer and Herpes Simplex Virus Study sities. The posterior exposure prevalences are very different than suggested in the prior. From Figure 5.4 it is evident that the posterior results are not dominated by the given prior. Table 5.6: Bayesian Estimates of the model parameters in the HSV-2 study Not considering Validation setting j Considering Validation setting Parameters Estimate SD Parameters Estimate SD 8 0.426 0.014 0.383 0.046 8 0.537 0.018 0.605 0.052 SN 0.700 0.043 SF 0.733 0.041 log(OR) 0.445 0.092 log(OR) 0.912 0.233 95%C.I. Does not include H0 value 95%C.I. Does not include H0 value (OR) (1.302, 1.870) (OR) (1.654, 4.085) For the Bayesian estimates and the trace plots, the initial values of r0, r1, SN and SP were set to 0.4, 0.4, 0.7 and 0.7 respectively. For 8o and 8, it was 0.45 and 0.45. The Gelman and Rubin convergence diagnostic statistic, 1? value for the four parameters r0, r1, SN and SF are 1.23, 1.16, 1.11 and 1.26 respec tively. Also, for 8 and 8, R gives 1.26 and 1.28 respectively. Notice that, most of these values are over 1.2 for 10, 000 iterations considering half of these as burn-in. Hence we can conclude that the convergence is not good for the cases under consideration for 10, 000 iterations. If we increase the number of iterations to 40,000 and R value for the four parameters r0, r1, SN and SF becomes 1.19, 1.16, 1.05 and 1.15 respectively. For 8o and 8, R now gives 1.01 and 1.10 respectively. As all of these R values are less than 1.2, we can conclude that the convergence is satisfactory for the cases under consideration for 40, 000 iterations, Therefore, we report the trace plots and the Bayesian estimates of the parameters for 40, 000 iterations in Table 5.6 and Figure 5.3. However, it should be noted that the changes in estimation are very small (changes mostly in third decimal places) and the standard errors are almost the same despite the larger number of iterations. As before, the initial values were set to be 0.2, 0.4, 0.6 and 0.8 for each of the parameters under consideration. One possible reason for this analysis requiring such large number of iterations could be due to the fact that some 71 5.3. Cervical Cancer and Herpes Simplex Virus Study Figure 5.3: MCMC for the with and without validation data model parameters in the HSV-2 study ij [i Before Burn—in — 3000 Burn-in ii11 iI’’ ‘1 6 io60o o6oo 30600 4060 Before Barn-in — her Burn-rn b r’ 6 1 Ooo 206 30603 4060 BBIB.BB rn-rn — 200cr Bum—rn 1F rr, irM’ r r 6 10600 20600 30603 4060C BBI000B rn-rn — re Burn-rn Ir Ni Ir 6 10600 20600 30600 4060 Before B urn-in — AflerBurn-rn 0 10000 20000 30000 40000 0 0000 20000 30000 40000 72 5.3. Cervical Cancer and Herpes Simplex Virus Study cell counts of the Table 5.4 are 5 or less. Another possibility is that the nondifferential assumption does not hold in this case. 73 5.3. Cervical Cancer and Herpes Simplex Virus Study Figure 5.4: Prior and Posterior Distributions of all the Parameters under Con sideration in the HSV-2 study 74 -V PflQf — P0ii0r ----Pta - 0.0 0.2 0.4 0.6 00 -. - V Pta — P0 04 05 06 07 0.6 09 10 SN 0.4 0.5 0.6 0.7 0.9 OS 1.0 SP ---V P599 — Posterior 60 Chapter 6 Conclusions and Further Research 6.1 Overall Conclusions Various practical issues force researchers to use inferior measures of expo sure assessment. When an ideal exposure measurement is replaced by an operational method or a surrogate variable, it is well known in the litera ture that due to the disparity between these two measures, there are several consequences of such compromise. Of course the extent of disparity plays a role in the consequences. To understand the extent to which the measure of association differs, a validation sub-sample is used to get some insight about the misclassification probabilities. Using the added information obtained from a validation sub-sample, adjustment measures are possible to correct for such bias and the subsequent power loss in hypothesis testing procedures. The nondifferentiality assumption is very popular in the epidemiologic literature due to its various attractive features. Two adjustment techniques are considered in this thesis under this assumption. One is based on fre quentist methods, power curves were derived for the likelihood ratio test both with and without validation data. This is basically a standard rou tine, used here as a benchmark. The detailed procedure is discussed in Chapter 2. The main goal is to evaluate the Bayesian counterpart which is based on a MCMC algorithm after reasonable diagnostic checks as dis cussed in Chapter 3. Both these methods are implemented in two settings: considering validation data and without considering validation data. In the frequentist method, estimates from the validation sub-sample are used to adjust for exposure misclassification, but in the Bayesian implementation, instead of having specific estimates of parameters, a set of priors are used so that some randomness or uncertainty is induced in the inferential process amongst cases and controls. 75 6.1. Overall Conclusions The main focus of this research is to identify the adjustment methodology that performs better under fairly general conditions in hypothesis testing. A set of scenarios are considered so that both methods can be compared using simulation study. These scenarios were constructed by varying the level of misclassification, prevalence, sample size, proportion of validation part ii1 the whole sample and under fixed cost constraint. Since a lot of scenarios are in possible, to simplify the problem, only the one dimensional effects due to the one of the parameters, sample size or sample composition change is considered at a time. Both methods are applied on all of these scenarios. Details are provided in Chapter 4. As a tool of evaluation, power curves are drawn for the frequentist method and the proportion of credible intervals that exclude the null value are plotted for Bayesian method. From these plots, it is clear that the with validation data model is always better. The without validation data (two parameter) model can be as good as with validation data (four parameter) model in extreme cases, but can never get better. We showed that this is true for hypothesis testing settings. The only case when the without validation data model can be superior to the with validation data model is under fixed budget, if the cost of collecting validated data is much higher than collecting usual unvalidated data. How high is high? This depends on the various parameters, considered sample sizes, composition of sample and budget for the study. We just showed by example that such an exception is possible. It is worth mentioning that the settings considered by Greenland and Gustafson [2006] are slightly different than those considered in this work, although they also address the issue of adjusting for misclassification in the context of hypothesis testing. In that paper, it is shown that given known or reasonably assumed (say, from educated guesses) values of sensitivity and specificity, the power does not improve after adjustment under nondiffer ential misclassification error (assumed to be free from any other sources of errors). This suggestion was based on the analysis of a single dataset. In contrast, in the current work, we showed that in presence of validation data, which enables us to estimate the true exposure prevalences, sensitivity and specificity, we have more power after adjustment, subject to the condition that the nondifferential misclassification assumption is satisfied. If the plots of the frequentist and the Bayesian method results for respec tive scenario are superimposed, they are almost indistinguishable. Compar ing these plots under each scenario, it is evident that both methods perform exactly the same way. Having both methods producing the same conclusion, 76 6.2. Further Research and Recommendations it is worth mentioning that the Bayesian framework, although very easy to generalize to other extensions of this problem, are very demanding in terms of resources and computing time to attain results without any MCMC di agnostic anomaly. On the other hand, with the frequentist methods used here, although closed forms are not always attainable, simple numerical rou tines can optimize these likelihoods very quickly. To give real life flavor, two epidemiologic datasets are also analyzed using the above methodologies in chapter 5, which are coherent with the simulation results. 6.2 Further Research and Recommendations Further research could focus on extending some of the simplistic assump tions that were considered, adapting the proposed models for problems with similar specifications and generalizing the simulation scenario to broader contexts. • One can consider larger combinations of the scenario setting than con sidered in this work to describe the effects in a broader sense. One could organize this effort by developing an experimental design (e.g., a fractional factorial design) involving the factors of interest. • One immediate extension of the work is to go beyond nondifferential assumption and check the results under differential misclassification, which is more realistic in many fields. For Bayesian adjustment, this can be easily done by considering the general model where the mis classification probabilities are different with respect to case and control and imposing a joint prior for those parameters with an assumed co variance structure. • To make the problem more realistic, additional exposures that are cor rectly measured are worth adding in the model. A logistic regression model can be a start in this direction. • This dissertation only deals with dichotomous exposure misclassifica tion. Polytomous exposure misclassification can also be another exten sion to this research. Instead of binomial assumption of misclassified exposure, a multinomial assumption will be used in that case. 77 6.2. Further Research and Recommendations • The models used in this work can be modified to allow using repli cated sub-set of data or data obtained from an alternative source in the absence of a benchmark scorer or gold standard method of expo sure assessment, instead of validation data, which could be more cost effective, especially when the cost of validation data is very high. • It is also worth investigating other tools to analyze the continuous ex posure data directly, instead of dichotomizing it to make it categorical, and try to identify how much sensitivity does one loose by categorizing the exposure variable. Spline analysis can be one way to move in this direction. • The Bayesian hypothesis testing could be accomplished by using the Bayes factor, and then compared with the standard likelihood tech niques to find out whether there is any discrepancy in those two methodologies. 78 Bibliography Barron, B., The effects of misclassification on the estimation of relative risk, Biometrics, 93(2), 414—418, 1977. Brooks, S., P. Dellaportas, and G. Roberts, An approach to diagnosing total variation convergence of MCMC algorithms, Journal of Computational and Graphical Statistics, 6(3), 251—265, 1997. Bross, I., Misclassification in 2 x 2 tables, Biometrics, 10(4), 478—486, 1954. Broyden, C., The convergence of a class of double-rank minimization algo rithms 1. general considerations, IMA Journal of Applied Mathematics, 6(1), 76—90, 1970. Carroll, R., M. Gail, and J. Lubin, Case-control studies with errors in covari ates, Journal of the American Statistical Association, 88(421), 185—199, 1993. Carroll, R., D. Ruppert, L. Stefanski, and C. Crainiceanu, Measurement Er ror in Nonlinear Models: A Modern Perspective, Chapman & Hall/CRC, 2006. Chen, T., A review of methods for misclassified categorical data in epidemi ology, Statistics in Medicine, 8(9), 914—921, 1989. Chu, R., Bayesian Adjustment for Exposure Misclassification in Case- Control Studies, Master’s thesis, Department of Statistics, University of British Columbia, 2005. Copeland, K., H. Checkoway, A. McMichael, and R. Holbrook, Bias due to misclassification in the estimation of relative risk, American Journal of Epidemiology, 105(5), 488—495, 1977. Cowles, M., and B. Carlin, Markov chain Monte Carlo convergence diagnos tics: a comparative review, Journal of the American Statistical Associa tion, 91(434), 883—904, 1996. 79 Chapter 6. Bibliography Diamond, E., and A. Lilienfeld, Effects of errors in classification and diagno sis in various types of epidemiological studies, American Journal of Public Health, 52(7), 1137—1145, 1962a. Diamond, E., and A. Lilienfeld, Misclassification errors in 2 x 2 tables with one margin fixed: some further comments, American Journal of Public Health, 52(12), 2106—2111, 1962b. Dosemeci, M., S. Wacholder, and J. Lubin, Does nondifferential misclas sification of exposure always bias a true effect toward the null value?, American Journal of Epidemiology, 132(4), 746—748, 1990. Fletcher, R., A new approach to variable metric algorithms, The Computer Journal, 13(3), 317—322, 1970. Gelman, A., Bayesian Data Analysis, CRC Press, 2004. Gelman, A., and D. Rubin, Inference from iterative simulation using multiple sequences, Statistical Science, 7(4), 457—472, 1992. Geyer, C., Practical markov chain monte carlo, Statistical Science, 7(4), 473—483, 1992. Gill, J., Bayesian Methods: A Social and Behavioral Sciences Approach, Chapman & Hall/CRC, 2008. Gladen, B., and W. Rogan, Misclassification and the design of environmental studies, American Journal of Epidemiology, 109(5), 607—616, 1979. Goldberg, J., The Effects of Misclassification on the Analysis of Data in 2 X 2 Tables, Unpublished dissertation, Harvard University School of Public Health. Boston, 1972. Goldberg, J., The effects of misclassification on the bias in the difference be tween two proportions and the relative odds in the fourfold table, Journal of the American Statistical Association, 70(351), 561—567, 1975. Goldfarb, J., A family of variable metric updates derived by variational means, Mathematics of Computing, 24(109), 23—26, 1970. Greenland, S., The effect of misclassification in the presence of covariates, American Journal of Epidemiology, 112(4), 564—569, 1980. 80 Chapter 6. Bibliography Greenland, S., Statistical uncertainty due to misclassification: implications for validation substudies., Journal of Clinical Epidemiology, 41(12), 1167— 1174, 1988a. Greenland, S., Variance estimation for epidemiologic effect estimates under misclassification, Statistics in Medicine, 7(7), 745—757, 1988b. Greenland, S., Maximum-likelihood and closed-form estimators of epidemio logic measures under misclassification, Journal of Statistical Planning and Inference, 138(2), 528—538, 2008. Greenland, S., and P. Gustafson, Accounting for independent nondifferential misclassification does not increase certainty that an observed association is in the correct direction, American Journal of Epidemiology, 164(1), 63—68, 2006. Gullen, W., J. Bearman, and E. Johnson, Effects of misclassification in epidemiologic studies., Public Health Reports, 83(11), 914—918, 1968. Gustafson, P., Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments, CRC Press, 2004. Gustafson, P., and S. Greenland, Curious phenomena in Bayesian adjust ment for exposure misclassification, Statistics in Medicine, 25(1), 2006. Gustafson, P., N. Le, and R. Saskin, Case-control analysis with partial knowledge of exposure misclassification probabilities, Biometrics, 57(2), 598—609, 2001. Hastings, W., Monte Carlo sampling methods using Markov chains and their applications, Biometrika, 57(1), 97—109, 1970. Joseph, L., T. Gyorkos, and L. Coupal, Bayesian estimation of disease preva lence and the parameters of diagnostic tests in the absence of a gold stan dard, American Journal of Epidemiology, 141 (3), 263—272, 1995. Jurek, A., S. Greenland, G. Maldonado, and T. Church, Proper interpreta tion of non-differential misclassification effects: expectations vs observa tions, International journal of epidemiology, 34(3), 680—687, 2005. Keys, A., and J. Kihlberg, Effect of misclassification on estimated rela tive prevalence of a characteristic: Part I. Two populations infallibly dis tinguished. Part II. Errors in two variables, American Journal of Public Health, 53(10), 1656, 1963. 81 Chapter 6. Bibliography Koch, G., The Effect of Non-Sampling Errors on Measures of Association in 2 x 2 Contingency Tables, Journal of the American Statistical Association, 64 (327), 852—863, 1969. Kraus, J., S. Greenland, and M. Bulterys, Risk factors for sudden infant death syndrome in the US Collaborative Perinatal Project, International Journal of Epidemiology, 18(1), 113—120, 1989. Lyles, R., A note on estimating crude odds ratios in case-control studies with differentially misclassified exposure, Biometrics, 58(4), 1034—1037, 2002. Marshall, J., The use of dual or multiple reports in epidemiologic studies, Statistics in Medicine, 8(9), 1041—1049, 1989. Marshall, R., Validation study methods for estimating exposure proportions and odds ratios with misclassified data., Journal of Clinical Epidemiology, 43(9), 941—947, 1990. Marshall, R., Assessment of exposure misclassification bias in case-control studies using validation data, Journal of Clinical Epidemiology, 50(1), 15—19, 1997. Morrissey, M., and D. Spiegelman, Matrix methods for estimating odds ratios with misclassified exposure data: extensions and comparisons, Bio metrics, 55(2), 338—344, 1999. Newell, D., Errors in the interpretation of errors in epidemiology, American Journal of Public Health, 5, 1925—1928, 1962. Prescott, G., and P. Garthwaite, A simple Bayesian analysis of misclassified binary data with a validation substudy, Biometrics, 58(2), 454—458, 2002. Raftery, A., and S. Lewis, How many iterations in the Gibbs sampler, Bayesian Statistics, 4, 763—773, 1992. Rahme, E., L. Joseph, and T. Gyorkos, Bayesian sample size determination for estimating binomial parameters from data subject to misclassification, Applied Statistics, 49(1), 119—128, 2000. Rizzo, M., Statistical Computing with R, Taylor & Francis, Inc., 2007. Robert, C., Convergence control methods for Markov chain Monte Carlo algorithms, Statistical Science, 10(3), 231—253, 1995. 82 Chapter 6. Bibliography Rothman, K., S. Greenland, and T. Lash, Modern Epidemiology, Lippincott Williams & Wilkins, 2008. Shanno, D., Conditioning of quasi-Newton methods for function minimiza tion, Mathematics of Computation, 2 (111), 647—656, 1970. Thomas, D., When will nondifferential misclassification of an exposure pre serve the direction ofatrend?, American Journal of Epidemiology, .L4.2(7), 782—784, 1995. Walter, S., and L. Irwig, Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review., Journal of Clinical Epidemiology, 41(9), 923—937, 1988. Willett, W., An overview of issues related to the correction of non differential exposure measurement error in epidemiologic studies, Statis tics in Medicine, 8(9), 1031—1040, 1989. Youden, W., Index for rating diagnostic tests, Cancer, 8(1), 32—35, 1950. 83
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Evaluating the performance of hypothesis testing in...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Evaluating the performance of hypothesis testing in case-control studies with exposure misclassification,… Karim, Mohammad Ehsanul 2009
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Evaluating the performance of hypothesis testing in case-control studies with exposure misclassification, using frequentist and Bayesian techniques |
Creator |
Karim, Mohammad Ehsanul |
Publisher | University of British Columbia |
Date Issued | 2009 |
Description | In epidemiologic studies, measurement error in the exposure variable can have large effects on the power of hypothesis testing for detecting the impact of exposure in the development of a disease. As it distorts the structure of data, more uncertainty is associated with the inferential procedure involving such exposure variables. The underlying theme of this thesis is the adjustment for misclassification in the hypothesis testing procedure. We consider problems involving a correctly measured binary response and a misclassified binary exposure variable in a retrospective case-control scenario. We account for misclassification error via validation data under the assumption of non-differential misclassification. The objective here is to develop a test to check whether the exposure prevalence rates of cases and controls are the same or not, under the frequentist and Bayesian point of view. To evaluate the test developed under the Bayesian approach, we compare that with an equivalent test developed under the frequentist approach. Both these approaches were developed in two different settings: in the presence or absence of validation data, to evaluate whether there is any gain in hypothesis testing for having such validation data. The frequentist approach involves the likelihood ratio test, while the Bayesian test is developed from posterior distribution generated by a mixed MCMC algorithm and a normal prior under realistic assumptions. The comparison between these two approaches is conducted using different simulated scenarios, as well as two real case-control studies having partial validation (internal) data. Different scenarios include settings with varying sensitivity and specificity, sample sizes, exposure prevalence and proportion of unvalidated and validated data. One other scenario that was considered is to evaluate the performance under a fixed budgetary constraint. In the scenarios under consideration, we reach the same conclusion from the two hypothesis testing procedures. The simulation study suggests that the adjusted model (with validation data model) is always better than the unadjusted model (without validation data model). However, exception is possible in the fixed budget scenario. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-03-24 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0069386 |
URI | http://hdl.handle.net/2429/22472 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2009-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2009_fall_karim_mohammad.pdf [ 1.65MB ]
- Metadata
- JSON: 24-1.0069386.json
- JSON-LD: 24-1.0069386-ld.json
- RDF/XML (Pretty): 24-1.0069386-rdf.xml
- RDF/JSON: 24-1.0069386-rdf.json
- Turtle: 24-1.0069386-turtle.txt
- N-Triples: 24-1.0069386-rdf-ntriples.txt
- Original Record: 24-1.0069386-source.json
- Full Text
- 24-1.0069386-fulltext.txt
- Citation
- 24-1.0069386.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0069386/manifest