Topics on the E ect of Non-di erential Exposure Misclassi cation by Dongxu Wang B.Sc., The University of British Columbia, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Statistics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) July 2012 c Dongxu Wang 2012Abstract There is quite an extensive literature on the deleterious impact of exposure misclassi cation when inferring exposure-disease associations, and on statis- tical methods to mitigate this impact. When the exposure is a continuous variable or a binary variable, a general mismeasurement phenomenon is at- tenuation in the strength of the relationship between exposure and outcome. However, few have investigated the e ect of misclassi cation on a polychoto- mous variable. Using Bayesian methods, we investigate how misclassi cation a ects the exposure-disease associations under di erent settings of classi - cation matrix. Also, we apply a trend test and understand the e ect of misclassi cation according to the power of the test. In addition, since vir- tually all of work on the impact of exposure misclassi cation presumes the simplest situation where both the true status and the classi ed status are binary, my work diverges from the norm, in considering classi cation into three categories when the actual exposure status is simply binary. Intuitive- ly, the classi cation states might be labeled as ‘unlikely exposed’, ‘maybe exposed’, and ‘likely exposed’. While this situation has been discussed in- formally in the literature, we provide some theory concerning what can be learned about the exposure-disease relationship, under various assumption- s about the classi cation scheme. We focus on the challenging situation whereby no validation data is available from which to infer classi cation probabilities, but some prior assertions about these probabilities might be justi ed. iiTable of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Exposure Mismeasurement . . . . . . . . . . . . . . . . . . . 2 1.2 The Impact of Mismeasurement . . . . . . . . . . . . . . . . 3 1.2.1 Mismeasured Continuous Variables . . . . . . . . . . 3 1.2.2 Misclassi ed Binary Variables . . . . . . . . . . . . . 4 1.2.3 Misclassi ed Polychotomous Variables . . . . . . . . . 5 1.3 Case-Control Studies with a \Maybe" Exposed Group . . . . 6 2 The Impact of Misclassi ed Polychotomous Variable . . . 9 2.1 Subjects Cannot Be Misclassi ed More Than Two Categories o the True Exposure Level . . . . . . . . . . . . . . . . . . 10 2.1.1 When E(Y j X) is Monotone . . . . . . . . . . . . . 10 2.1.2 When E(Y j X) is not Monotone . . . . . . . . . . . 14 2.2 Subjects Cannot Be Misclassi ed More Than Three Cate- gories o the True Exposure Level . . . . . . . . . . . . . . . 16 2.2.1 When E(Y j X) is Monotone . . . . . . . . . . . . . 16 2.2.2 When E(Y j X) is not Monotone . . . . . . . . . . . 18 2.3 The E ect of the Exposure Distribution . . . . . . . . . . . . 19 2.4 Test for Trend Across Categories . . . . . . . . . . . . . . . . 20 2.4.1 Trend Test for Actual Exposure . . . . . . . . . . . . 23 2.4.2 Trend Test for Apparent Exposure . . . . . . . . . . . 24 2.4.3 Power of the Trend Test . . . . . . . . . . . . . . . . 26 iiiTable of Contents 2.5 Example: The E ect of Vitamin C on Tooth Growth in Guinea Pigs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3 Case-Control Study with \Maybe" Exposed Group . . . . 33 3.1 Identi cation Regions . . . . . . . . . . . . . . . . . . . . . . 33 3.1.1 Constraint A . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.2 Constraint B . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.3 Comparison Between Constraint A and B . . . . . . . 39 3.1.4 Collapsing Exposure to Two Categories . . . . . . . . 40 3.1.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Limiting Posterior Distribution . . . . . . . . . . . . . . . . . 51 3.2.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Finite-Sample Posteriors . . . . . . . . . . . . . . . . . . . . 55 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 ivList of Tables 3.1 The lower bound on the odds ratio for collapsed case and for constraint A and B. . . . . . . . . . . . . . . . . . . . . . . . 52 vList of Figures 2.1 and when X is uniformly distributed and is increasing 13 2.2 and when X is unimodal and is increasing . . . . . . 13 2.3 and when X is uniformly distributed and is unimodal 15 2.4 and when X is uniformly distributed and is unimodal 16 2.5 and when X is uniformly distributed and is increasing 18 2.6 and when X is uniformly distributed and is unimodal 19 2.7 and when X is uniformly distributed and is increasing 23 2.8 and when X is uniformly distributed and is increasing 28 2.9 and when X is uniformly distributed and is increasing 29 2.10 Relationship between tooth length and dose levels . . . . . . 31 3.1 Prior and identi cation regions . . . . . . . . . . . . . . . . . 38 3.2 Identi cation regions under the combination ( ) . . . . . 43 3.3 Identi cation regions under the combination ( +) . . . . . 44 3.4 Identi cation regions under the combination ( + ) . . . . . 45 3.5 Identi cation regions under the combination ( + +) . . . . . 46 3.6 Identi cation regions under the combination (+ ) . . . . . 47 3.7 Identi cation regions under the combination (+ +) . . . . . 48 3.8 Identi cation regions under the combination (+ + ) . . . . . 49 3.9 Identi cation regions under the combination (+ + +) . . . . . 50 3.10 Limiting posterior distributions under the combination ( ) 56 3.11 Limiting posterior distributions under the combination ( +) 57 3.12 Limiting posterior distributions under the combination ( + ) 57 3.13 Limiting posterior distributions under the combination ( + +) 58 3.14 Limiting posterior distributions under the combination (+ ) 58 3.15 Limiting posterior distributions under the combination (+ +) 59 3.16 Limiting posterior distributions under the combination (+ + ) 59 3.17 Limiting posterior distributions under the combination (+ + +) 60 3.18 Posterior distributions under the combination ( ) . . . . 61 3.19 Posterior distributions under the combination (+ + +) . . . . 62 viList of Figures 3.20 Posterior distributions via informal analysis under the com- bination ( ) . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.21 Posterior distributions via informal analysis under the com- bination (+ + +) . . . . . . . . . . . . . . . . . . . . . . . . . 65 viiAcknowledgements First and foremost, I would like to express my sincerest gratitude to my supervisor, Paul Gustafson. During my graduate study, his expertise and suggestions help me all the time of research and writing of this thesis, and at the same time, his patience and understanding allows me the room to work in my own way. This thesis would not have been completed without the guidance and the help of him. He is the best supervisor anyone could wish for. I would like to thank Professor Lang Wu for taking time out from his busy schedule to be the second reader of this thesis. I must also thank Professor Rollin Brant, John Petkau, Matias Salibian- Barrera, and Ruben Zamar for their teaching. They are the best teachers I have had in my life. I would like to thank all the student colleagues, Ardavan Saeedi, Chao Xiong, Hao Luo, Hongyang Zhang, James Proudfoot, Jessica Chen, Jing Dong, Lang Qin, Meijiao Guan, Qian Ye, Yang Liu, Yongliang Zhai, Yuqing Wu, for their various forms of support during my graduate study. They provide a stimulating and fun environment in which to learn and grow. I would like to thank all the sta s in the department o ce, Peggy Ng, Elaine Salameh, and Andrea Sollberger, for their consistent help during these two years. Because of them, I can focus on my research. Last but not the least, I wish to thank my parents, Jian Wang and Guang Yu, for supporting me throughout all my life. You are always the most important persons to me. Also I need to thank my girlfriend, Zhaozhao Qin, for her love and care during these years. viiiChapter 1 Introduction In many studies, especially in the area of epidemiology, researchers intend to measure both the outcome Y and the exposure X for given experimental units and use statistical models to investigate the relationship between Y and X according to the recorded data. Depending on the study design, the number of exposure and outcome variables vary a lot. In this thesis, we will focus on the simplest case which contains only one outcome Y and one exposure X. In general, both Y and X can be either binary, polychotomous, or continuous. For example, the outcome can be death status which is a binary variable, health condition measured in ve-point Likert scale which is a polychotomous variable, or blood pressure which is continuous. In the ideal case, when the outcome and the exposure are exactly the target variables and are precisely measured, the measurement is considered as ‘gold standard’. Under this situation, as both Y and X are measured precisely, there is no doubt that the relationship inferred from the recorded data is the intended relationship. However, in almost all studies, it is im- possible to have the gold standard measurement because of various types of limitations. These limitations include selection bias and information bias, which are discussed by Gordis (1996) [7]. For example, Krisbnaiab, et al (2005) [15], carried out a cross-sectional study to investigate the association between smoking and cataracts. To access cumulative smoking dose, sub- jects were classi ed based on cigarette pack-years, which was calculated by multiplying the number of packs of cigarettes smoked per day by the number of years the person had smoked. However, as data were gathered through questionnaires, both the number of packs of cigarettes smoked per day and the number of years of smoking were recorded imprecisely because of the reporting bias. It is easy to understand that many heavy smokers were not expected to report as many cigarettes as they smoked, while some smoker- s might even over report. Therefore, the recorded variables might not be exactly the actual target variables, but some corresponding surrogate vari- ables, a surrogate outcome Y in place of Y or a surrogate exposure X in place of X. In some cases, however, several measurements exist but none of them is 11.1. Exposure Mismeasurement viewed as gold standard. Mismeasurement will not be concerned in these cases, and the issues discussed in this thesis will not apply. Instead, these measurements might be combined into one or more dimensions which is useful to describe the main features of them, as mentioned in Kraemer, et al (2003) [14]. 1.1 Exposure Mismeasurement The goal of a study is to investigate the relationship between Y and X, while the relationship gained from the recorded data is actually between the surrogates, but not the target variables. The analysis which pretends the surrogate variables are exactly the same as the variables of interest is referred to as ‘naive’. Because of the mismeasurement, the recorded data might lead to misleading results, which a ect the conclusion much. Therefore, it is worth discussing the e ect of mismeasurement. For example, Kenkel, et al (2004) [13], investigated the e ect of mismeasurement in retrospective smok- ing data. They showed that failing to account for mismeasurement would have led to incorrect inferences about the association between cigarette price and smoking participation. As the conclusion, it is important to take into account the impact of mismeasurement. In real cases, both the outcome and the exposure are possibly subject to mismeasurement. In Gustafson (2004) [8], it shown that the main focus is on the situations where the exposure X is subject to mismeasurement. The equally realistic scenarios where the outcome Y is subject to mismeasure- ment are reasonable to be ignored. Under the situation where only exposure mismeasurement happens, doing the analysis by pretending the surrogate of the exposure X is actually the variable of interest can lead to biased infer- ences, not just less precise inferences. Therefore, in this thesis we will only consider the situation where the exposure X is subject to mismeasurement but the outcome Y is precisely measured. That is, the discussion in the following part is about the studies where both a surrogate exposure X and an actual outcome Y are recorded. Moreover, many studies have been done when the outcome is subject to misclassi cation, such as in Garcia-Zattera, et al (2010) [6], and in Savoca (2011) [19]. If the exposure X is categorical, imprecise measurements of X are called misclassi cations, while if it is continuous, they are called measurement errors, as described in Shen (2009) [20]. When X is continuous, set the measurement error to be Z. The surrogate X is de ned to be a linear form of X such that X = X + Z. When X is categorical, misclassi cation 21.2. The Impact of Mismeasurement happens when classifying an item into a wrong exposure level. Also, it is worth mentioning that mismeasurement may occur in two forms: di erential and non-di erential. In di erential mismeasurement, the distribution of the surrogate X depends on both the outcome Y and the true exposure X. For example, mismeasurement of exposure may occur such that subjects in case group are mismeasured more often than those in control group, as explained in Gordis (1996) [7]. In contrast, non-di erential mis- measurement refers to the situation where the surrogate X depends only on the true exposure X but not on the outcome Y . More formally, the condi- tional distribution of (X j X;Y ) is identically the conditional distribution of (X j X). Non-di erential mismeasurement results from the degree of inaccuracy that characterizes how information is obtained from any study group, as explained in Gordis (1996) [7]. In this thesis, all the discussions are under the assumption of non-di erential exposure mismeasurement. 1.2 The Impact of Mismeasurement Many have discussed how well naive estimation performs in the face of mis- measurement, for both continuous and binary exposures, i.e., in Gustafson (2004) [8]. A general mismeasurement phenomenon is attenuation in the strength of the relationship between Y and X induced by mismeasuring X as X . The attenuation factor is used to represent the magnitude of the im- pact of mismeasurement. However, even polychotomous exposure is much more common than binary exposure in real cases, not many works have done on this kind of situations. To adjust misclassi cation in those studies, it is worth discussing the e ect of misclassi cation for polychotomous exposure rst. We will brie y state the e ect of mismeasurement for continuous and binary exposures, and mainly discuss the e ect for polychotomous exposure in Chapter 2. 1.2.1 Mismeasured Continuous Variables Consider the case where both the outcome Y and the exposure X are continuous. The true relationship between Y and X is assumed to be E(Y j X) = 0 + 1X, where the coe cients 0 and 1 are unknown. Under the situation when only exposure mismeasurement happens, Y is measured correctly but X is recorded as a surrogate of the target exposure X. If the mismeasurement is not adjusted by the researchers, the naive analysis will investigate the relationship E(Y j X ) = 0 + 1X instead of 31.2. The Impact of Mismeasurement E(Y j X) = 0 + 1X. In this way, 0 and 1 will be interpreted, but not 0 and 1. Assume that both X and X follow normal distributions. Also, assume that X is a non-di erential and unbiased surrogate of X. Denote that X N( ; 2) and X N(X; 2 2), where the parameter is chosen as SD(X j X)=SD(X). Under the standard assumption, standard distri- butional theory implies that (X j X) has a bivariate normal distribution. Using the non-di erential property and the properties of this bivariate nor- mal distribution, the relationship between and can be written as: 0 = 0 + 1 1 + 2 ; 1 = 1 1 + 2 : The attenuation factor is referred as 1= 1 = 1=(1+ 2), which is positive but smaller than one. It shows that the relationship between Y and X has the same direction of trend as between Y and X. However, since j 1 j < j 1j, the slope of the relationship between Y and X will be always atter. Espino-Hernandez, et al (2010) [5], carried out a study about how to adjust measurement error in case-control studies with continuous exposure using the Bayesian method. 1.2.2 Misclassi ed Binary Variables A categorical exposure can be either binary or polychotomous. When X is binary, the impact of misclassi cation is easy to intuit. The magnitude of the misclassi cation can be described by the sensitivity and the speci city of X , i.e., the probability of correct classi cation for exposed and unexposed subjects respectively. Note that the sensitivity is de ned as SN = Pr(X = 1 j X = 1), and the speci city is de ned as SP = Pr(X = 0 j X = 0). Consider the case where the outcome Y is continuous and the exposure X is binary. Without loss of generality, the relationship between Y and X can be written as E(Y j X) = a + bX, such that E(Y j X = 0) = a and E(Y j X = 1) = a + b. When misclassi cations arise, E(Y j X ) = a + b X is estimated instead of E(Y j X) = a+ bX, where a = a+ bPr(X = 1 j X = 0) and b b = 1 Pr(X = 0 j X = 1) Pr(X = 1 j X = 0): 41.2. The Impact of Mismeasurement The attenuation factor is referred as b =b. Since the attenuation factor is always smaller than 1, larger probability of misclassi cation leads to more attenuated result, which is the same for continuous exposure. Chu (2010) [2] carried out a study about how to adjust misclassi cation in case-control studies with binary exposure using the Bayesian method. The attenuation factor b =b can also be represented in terms of sensitivity and speci city. Note that prevalence is de ned as the number of exposed persons present in the population at the speci c time divided by the number of persons in the population at that time. Setting the actual and apparent prevalences of exposure to be = Pr(X = 1) and = Pr(X = 1) respectively, the attenuation factor can be written as: b b = (SN + SP 1) (1 ) (1 ) : Since it is obvious that is a function of ( ; SN; SP ), the attenuation factor is also a function of ( ; SN; SP ). 1.2.3 Misclassi ed Polychotomous Variables In a lot of research, the outcome Y is still a continuous variable as described in Section 1.2.1 and 1.2.2, but the exposure X becomes an ordinary poly- chotomous variable, a categorical variable with more than two categories. Under this situation, if misclassi cation happens, the impact is more com- plex and hard to predict. For example, Lindblad, et al (2005) [16], carried out a study about the association between the intensity of smoking and cataracts. In this study, the exposure is the intensity of smoking, which is an ordinary variable with more than 2 categories. Subjects were classi- ed into one of many exposure levels according to the number of cigarettes smoked per day. As the data is based on the Swedish Mammography Co- hort, exposure misclassi cation is unavoidable. Without the adjustment of misclassi cation, biased results may in uence the conclusion. It will be informative to investigate how misclassi cation a ects the results in such studies. As stated in Section 1.2.1 and 1.2.2, when the exposure X is either continuous or binary, the mismeasurement always attenuates the strength of the exposure-disease association. Is there a same conclusion for a poly- chotomous exposure? When the exposure X is an ordinary polychotomous variable, both the distribution of the exposure and the outcome will a ect the nal results. We will treat subjects with the same exposure level as a group, and use the group mean outcomes as responses. In Chapter 2, we 51.3. Case-Control Studies with a \Maybe" Exposed Group will rst discuss the e ect of misclassi cation with di erent distributions of actual group mean outcomes. In this part, we will mainly investigate two situations, in which subjects cannot be misclassi ed more than one catego- ry o the true exposure level and in which subjects cannot be misclassi ed more than two categories. In each of these two situations, we will start with the simplest case where the group mean outcomes are monotone, and then expand to the general case. Consequently, we will describe the e ect of mis- classi cation with di erent distributions of the actual exposure. After all, to compare with the continuous and binary exposure, we will use a trend test to analyze the e ect of misclassi cation for polychotomous exposure. 1.3 Case-Control Studies with a \Maybe" Exposed Group In epidemiology, there are two main types of designs for clinical studies, randomized studies and observational studies. Three types of observation- al studies are mostly applied: cohort study, case-control study, and cross- sectional study. Because of the limitations for a cross-sectional study in establishing a relationship between exposure and outcome, cohort and case- control studies are relied on in epidemiologic investigations and clinical re- search to estimate etiologic relationships. In Chapter 3, we will focus on un- matched case-control study, where the goal is to infer exposure prevalences in case and control populations, and thereby infer the exposure-disease odds ratio. Some understanding of how the bias induced by unacknowledged mis- classi cation depends on various aspects of the problem at hand can spawn informal strategies for mitigating this bias. This is taken up in Dosemeci, et al (1996) [3]. For instance, in interpreting their ndings they say: These ndings suggest that if, in the exposure assessment process of a case-control study, where the exposure prevalence is low, an occupational hygienist is not sure about the exposure status of a subject, it is judicious to classify that subject as unexposed. This recommendation arises since, in the presence of low exposure preva- lences, the magnitude of the bias increases much faster as the speci city drops from one than it does as the sensitivity drops from one. Thus keeping the exposure group pure, by limiting the misclassi cation of truly unexposed subjects into it, becomes paramount. 61.3. Case-Control Studies with a \Maybe" Exposed Group The form of such a recommendation suggests thinking of the exposure classi cation, at least initially, as being made into one of three categories. For sake of de niteness, we label these categories as ‘unlikely exposed’, ‘maybe exposed’, and ‘likely exposed’. Then, depending on the context, some mitigation of bias could be achieved by collapsing the observed ex- posure data from three categories down to two, e.g., merging the rst two categories if exposure prevlance is low. (In the face of high exposure preva- lences, analogous considerations would suggest instead merging the last two categories.) After such a merge, data analysis can follow along the routine lines of inferring the odds ratio from a 2 2 exposure-disease table of counts. It is natural to ask whether a more formal statistical scheme might bet- ter mitigate bias and/or better re ect a posteriori uncertainty about the target parameter. Particulary, we investigate directly modelling the expo- sure classi cation into the ‘unlikely exposed’, ‘maybe exposed’, and ‘likely exposed’ categories. Thus the sensitivity and speci city of the classi cation scheme are supplanted by probability distributions across the three cate- gories, for the truly exposed and the truly unexposed respectively. While non-di erential misclassi cation with more than two categories has been considered in the literature (see, for instance, Dosemeci, et al (1990) [4], and Weinberg, et al (1994) [21]), this is typically considered when the same set of labels for more than two ordered states is used for both the true and observed exposure status (e.g., none, low, high). ‘Non-square’ situation- s, such as two states for the true status and three states for the observed status, do not seem to have garnered attention. In my framework, we quantify the information about exposure preva- lences, and hence the odds ratio, in a large-sample sense. In situations where classi cation probabilities are known, or can be consistently estimat- ed from validation data, then inferential options for consistent estimation of the odds-ratio are available; see, for instance, Gustafson (2004) [8] or Bu- conaccorsi (2010) [1]. This thesis focusses on the more challenging setting where classi cation probabilities cannot be estimated consistently. Given this, we cannot expect to consistently estimate the exposure-disease odds ratio as the case and control sample sizes increase. We may, however, be able to rule out some values for the odds ratio. First we focus on determining identi cation regions from prior regions. Particularly, given assumptions about the possible values of classi cation probabilities, we show what values of exposure prevalances are compatible with the distribution of the observable data. This falls within the rubric of partially identi ed models (e.g., Manski 2003 [18]), whereby even the observation of an in nite amount of data does not reveal the true values 71.3. Case-Control Studies with a \Maybe" Exposed Group of the target parameters, but does rule out some values. We consider two prior regions based on di erent assumptions about a priori plausible values of the misclassi cation probabilities. The rst is a weak assumption that the exposure classi cation scheme is ‘better than random,’ in a particular sense. The second is a stronger assumption of monotonicity, in the sense that for any two categories, and either level of true exposure, the worse classi cation is less likely. Having established the form of the identi cation regions, we turn to de- termining the behaviour of the posterior distributions over the control and case exposure prevalences, as the control and case sample sizes go to in nity. This is pursued via the general approach to determining the limiting poste- rior distribution in partially identi ed models outlined in Gustafson (2005) [9] (also see Gustafson 2010 [11]). Necessarily, the support of the limiting posterior distribution is the corresponding identi cation region. Finally, we also show, via simulation, how the posterior distribution approaches its limit as the sample size grows. 8Chapter 2 The Impact of Misclassi ed Polychotomous Variable As introduced in Section 1.2.1 and 1.2.2, mismeasurement will attenuate the exposure-disease association for both continuous exposure and binary exposure. When the actual exposure X is a polychotomous ordered variable, the impact of misclassi cation is more complex and hard to intuit. Denote Y as the outcome of experimental units, we only consider the situation when Y is continuous. Also, let X and X represent actual exposure status and apparent exposure status, which are both polychotomous ordered variables. It means that non-di erential misclassi cation gives rise to X as a surrogate for X. Without loss of generality, let X and X take values in 1, 2, , k. Let P denote the classi cation matrix, where pij is the probability of classifying a subject into the j-th exposure level given this subject is ac- tually in the i-th exposure level (pij = P (X = j j X = i)). Some weak assumptions can be made in order to ensure that the classi cation is bet- ter than randomly assign subjects into exposure levels. We name it the monotonicity assumption of pij such that: pij is maximized when j = i, for any xed i from 1 to k; when 1 j i, pij decreases as j decreases; when i j k, pij decreases as j increases. Treating subjects with the same exposure level as a group, we will use the group mean outcomes as responses. Let be the mean outcome for actual exposure status ( i = E(Y j X = i)), be the mean outcome for apparent exposure status ( i = E(Y j X = i)), and be the distribution of actual exposure ( i = Pr(X = i)). According to Bayes’ theorem, the relationship between and can be expressed based on and P , such that: j = Pk i=1 ipij i Pk i=1 pij i : (2.1) 92.1. Subjects Cannot Be Misclassi ed More Than Two Categories o the True Exposure Level Given the values of P and , it is informative to investigate the mapping from to to understand the impact of misclassi cation. To have a deeper understanding about the behavior of X and X , we can treat the classi cation matrix P as a transition matrix of a Markov chain and as the probability distribution of the starting position. Then, the distribution of the surrogate X can be viewed as the distribution of X with one step toward the stationary distribution, which can be expressed as: Pr(X = j) = kX i=1 pij i: When calculating j , the denominator of (2:1) is exactly the form of Pr(X = j), but the numerator of (2:1) is distributed as one step toward the stationary distribution with the starting distribution proportional to , i.e., is a vector with k elements such that ( )i = i i for any i from 1 to k. To investigate how and behave, we can compare the change between the mean outcomes of each pair of adjacent levels for both X and X . 2.1 Subjects Cannot Be Misclassi ed More Than Two Categories o the True Exposure Level When no misclassi cation happens at any exposure level, the classi cation matrix P is an identity matrix, resulting in exactly the same and . Un- der the monotonicity assumption of pij , let’s rst consider the least severe situation when misclassi cation happens, where subjects cannot be misclas- si ed more than one category o the true exposure level. Under this situa- tion, the classi cation matrix is restricted such that pij = 0 if ji jj 2. We will analyze the e ect of misclassi cation for polychotomous exposure based on two di erent distributions of . 2.1.1 When E(Y j X) is Monotone Let’s rst have a look at the e ect of misclassi cation for polychotomous exposure under the simplest situation where is monotone. For any type of exposure distribution, we can generate a theorem. Theorem 2.1.1 Under the situation when pij = 0 if ji jj 2, if is monotone, will also be monotone with the same direction of trend as . 102.1. Subjects Cannot Be Misclassi ed More Than Two Categories o the True Exposure Level Proof: (i): First, assume that X is uniformly distributed. That is, i = 1=k for any i from 1 to k. Then, the changes of the mean outcomes between adjacent levels can be expressed as: When j = 1, 2 1 = 1p12 + 2p22 + 3p32 p12 + p22 + p32 1p11 + 2p21 p11 + p21 / ( 2 1)p11p22 + ( 3 1)p11p32 + ( 1 2)p21p12 + ( 3 2)p21p32: When 1 < j < k 2, j+1 j = jpj;j+1 + j+1pj+1;j+1 + j+2pj+2;j+1 pj;j+1 + pj+1;j+1 + pj+2;j+1 j 1pj 1;j + jpjj + ypj+1;j pj 1;j + pjj + pj+1;j / ( j j 1)pj 1;jpj;j+1 + ( j+1 j 1)pj 1;jpj+1;j+1 + ( j+2 j 1)pj 1;jpj+2;j+1 + ( j+1 j)pjjpj+1;j+1 + ( j+2 j)pjjpj+2;j+1 + ( j j+1)pj;j+1pj+1;j + ( j+2 j+1)pj+1;jpj+2;j+1: When j = k 1, k k 1 = k 1pk 1;k + kpkk pk 1;k + pkk k 2pk 2;k 1 + k 1pk 1;k 1 + kpk;k 1 pk 2;k 1 + pk 1;k 1 + pk;k 1 / ( k 1 k 2)pk 2;k 1pk 1;k + ( k k 2)pk 2;k 1pkk + ( k k 1)pk 1;k 1pkk + ( k 1 k)pk;k 1pk 1;k: Under the monotonicity assumption of pij , it is easy to show pjjpj+1;j+1 > pj;j+1pj+1;j for any j. Therefore, when monotonically increases ( j 1 j j+1 j+2), we can prove that j j+1 for any j from 1 to k 1. 112.1. Subjects Cannot Be Misclassi ed More Than Two Categories o the True Exposure Level It means that also monotonically increases. Similarly, when monotoni- cally decreases, also monotonically decreases ( j j+1 for any j from 1 to k 1). Notice that j = j+1 holds if and only if j 1 = j = j+1 = j+2. (ii): In general, for any which satis es 0 < i < 1 and Pk i=1 i = 1, this theorem still holds. Take the situation when 1 < j < k 1 as an example. The relationship between j and j+1 can be represented as: j+1 j = j jpj;j+1 + j+1 j+1pj+1;j+1 + j+2 j+2pj+2;j+1 jpj;j+1 + j+1pj+1;j+1 + j+2pj+2;j+1 j 1 j 1pj 1;j + j jpjj + j+1 j+1pj+1;j j 1pj 1;j + jpjj + j+1pj+1;j / ( j j 1) j 1 jpj 1;jpj;j+1 + ( j+1 j 1) j 1 j+1pj 1;jpj+1;j+1 + ( j+2 j 1) j 1 j+2pj 1;jpj+2;j+1 + ( j+1 j) j j+1pjjpj+1;j+1 + ( j+2 j) j j+2pjjpj+2;j+1 + ( j j+1) j j+1pj;j+1pj+1;j + ( j+2 j+1) j+1 j+2pj+1;jpj+2;j+1: Based on (i), when is monotone, will still be monotone. Similarly, this conclusion also holds when j = 1 and j = k 1. To visualize the e ect of misclassi cation, we will show an example where monotone increasing leads to monotone increasing in Example 2.1.1. Example 2.1.1 Take k = 6, i = i2, and P = 0 B B B B B B @ 0:6 0:4 0 0 0 0 0:2 0:6 0:2 0 0 0 0 0:2 0:6 0:2 0 0 0 0 0:2 0:6 0:2 0 0 0 0 0:2 0:6 0:2 0 0 0 0 0:4 0:6 1 C C C C C C A as an example. Both and are displayed in Figure 2.1 when X is u- niformly distributed ( i = 1=6), and in Figure 2.2 when X is unimodal ( = (1; 2; 3; 3; 2; 1)=12). It shows that an increasing leads to an increas- ing , as expected according to Theorem 2.1.1. 122.1. Subjects Cannot Be Misclassi ed More Than Two Categories o the True Exposure Level Figure 2.1: and when X is uniformly distributed and is increasing. Figure 2.2: and when X is unimodal and is increasing. 132.1. Subjects Cannot Be Misclassi ed More Than Two Categories o the True Exposure Level 2.1.2 When E(Y j X) is not Monotone When is not monotone, a similar conclusion as in Theorem 2:1:1 does not hold. Let sign( j+1 j) and sign( j+1 j ) represent the change of mean outcomes of actual exposure and apparent exposure at the j-th adjacent levels. For example, when sign( j+1 j) is positive, the mean outcome increases at the j-th level; and when sign( j+1 j) is negative, the mean outcome decreases. When is monotone, sign( j+1 j) and sign( j+1 j ) will always be the same, according to Theorem 2.1.1. However, sign( j+1 j) might di er from sign( j+1 j ) when is not monotone. Consider the simplest case where X is uniformly distributed and is unimodal with the lowest value taken at i = l, where l is between 2 and k 1. Under this situation, l is the minimum value among . Using the same procedure as in Theorem 2.1.1, it is easy to prove that when j 6= l 1 or l, sign( j+1 j) and sign( j+1 j ) are still always the same; however, either l 1 < l or l > l+1 might be the case. It means that only when j = l 1 or l, it is possible that sign( j+1 j) and sign( j+1 j ) are di erent. The relationships of the mean outcome among the (l 1)-th, l-th, and (l + 1)-th levels highly depend on the magnitudes of l 1 and l+1. To visualize the e ect of misclassi cation, we will show two examples where is unimodal in Example 2.1.2. Example 2.1.2 Take k = 6, X is uniformly distributed, = (50; 49; 48; 1; 5; 6), and P = 0 B B B B B B @ 0:9 0:1 0 0 0 0 0:1 0:8 0:2 0 0 0 0 0:2 0:5 0:3 0 0 0 0 0:2 0:5 0:3 0 0 0 0 0:2 0:7 0:1 0 0 0 0 0:1 0:9 1 C C C C C C A as an example. Both and are displayed in Figure 2.3. Under this setting, 4 < 5 but 4 > 5 . The minimum mean outcome of actual ex- posure is taken at the 4-th level, however the minimum mean outcome of apparent exposure is taken at the 5-th level. Because of misclassi cation, the minimum value moves one category to the right. In contrast, take k = 6, X is uniformly distributed, = (6; 5; 3; 2; 1; 30), 142.1. Subjects Cannot Be Misclassi ed More Than Two Categories o the True Exposure Level Figure 2.3: and when X is uniformly distributed and is unimodal. and P = 0 B B B B B B @ 0:9 0:1 0 0 0 0 0:1 0:8 0:2 0 0 0 0 0:2 0:5 0:3 0 0 0 0 0:3 0:5 0:2 0 0 0 0 0:3 0:5 0:2 0 0 0 0 0:4 0:6 1 C C C C C C A as an example. Both and are displayed in Figure 2.4. Under this set- ting, 4 > 5 but 4 < 5 . The minimum mean outcome of actual exposure is taken at the 5-th level, however the minimum mean outcome of appar- ent exposure is taken at the 4-th level. Misclassi cation can also move the minimum value one category to the left. Based on Example 2.1.2, we can have a conclusion for the situation where is unimodal under the assumptions we made. When j 6= l 1 or l, the directions of changes of and will be the same as described in Theorem 2.1.1. When j = l 1 or l, although it is possible that sign( j+1 j) and sign( j+1 j ) are di erent, the overall shape of is still unimodal. The overall shape of and will stay unchanged with the mode moving only one category either to the left or to the right. 152.2. Subjects Cannot Be Misclassi ed More Than Three Categories o the True Exposure Level Figure 2.4: and when X is uniformly distributed and is unimodal. 2.2 Subjects Cannot Be Misclassi ed More Than Three Categories o the True Exposure Level In Section 2.1, we have discussed the situation with the least severe mis- classi cation. As one step further, let’s consider a worse situation where the surrogate X can be misclassi ed two categories o the true exposure level, which is still under the monotonicity assumption of pij . That is, the classi cation matrix is more complicated such that pij = 0 if ji jj 3. We will also analyze the e ect of misclassi cation for polychotomous exposure based on two di erent distributions of as in Section 2.1. 2.2.1 When E(Y j X) is Monotone Still, let’s rst consider the simplest case when is monotone. Under the situation where X is uniformly distributed, the relationship between j and 162.2. Subjects Cannot Be Misclassi ed More Than Three Categories o the True Exposure Level j+1 can be expressed as: j+1 j = j 1pj 1;j+1 + jpj;j+1 + j+1pj+1;j+1 + j+2pj+2;j+1 + j+3pj+3;j+1 pj 1;j+1 + pj;j+1 + pj+1;j+1 + pj+2;j+1 + pj+3;j+1 j 2pj 2;j + j 1pj 1;j + jpjj + j+1pj+1;j + j+2pj+1;j pj 2;j + pj 1;j + pjj + pj+1;j + pj+2;j / ( j 1 j 2)pj 2;jpj 1;j+1 + ( j j 2)pj 2;jpj;j+1 + ( j+1 j 2)pj 2;jpj+1;j+1 + ( j+2 j 2)pj 2;jpj+2;j+1 + ( j+3 j 2)pj 2;jpj+3;j+1 + ( j j 1)pj 1;jpj;j+1 + ( j+1 j 1)pj 1;jpj+1;j+1 + ( j+2 j 1)pj 1;jpj+2;j+1 + ( j+3 j 1)pj 1;jpj+3;j+1 + ( j 1 j)pjjpj 1;j+1 + ( j+1 j)pjjpj+1;j+1 + ( j+2 j)pjjpj+2;j+1 + ( j+3 j)pjjpj+3;j+1 + ( j 1 j+1)pj+1;jpj 1;j+1 + ( j j+1)pj+1;jpj;j+1 + ( j+2 j+1)pj+1;jpj+2;j+1 + ( j+3 j+1)pj+1;jpj+3;j+1 + ( j 1 j+2)pj+2;jpj 1;j+1 + ( j j+2)pj+2;jpj;j+1 + ( j+1 j+2)pj+2;jpj+1;j+1 + ( j+3 j+2)pj+2;jpj+3;j+1: From the expression of apparent mean outcome change above, it is hard to determine sign( j+1 j ) given sign( j+1 j). Therefore, Theorem 2.1.1 dose not hold under this situation. Even when is monotone, it is impossible to predict the changes of the mean outcomes of apparent exposure. For example, given is monotone increasing, the mean outcome of at some levels might decrease. Here is an example where is monotone but is not in Example 2.2.1. Example 2.2.1 Take k = 6, X is uniformly distributed, = (1; 2; 50; 51; 52; 53), and P = 0 B B B B B B @ 0:9 0:09 0:01 0 0 0 0:01 0:34 0:33 0:32 0 0 0:085 0:4 0:5 0:01 0:005 0 0 0:01 0:48 0:5 0:006 0:004 0 0 0:32 0:33 0:34 0:01 0 0 0 0:01 0:09 0:9 1 C C C C C C A as an example. Both and are displayed in Figure 2.5. Given 3 < 4, misclassi cation leads to 3 > 4 . Therefore, although is monotone increasing, it is possible that is not. 172.2. Subjects Cannot Be Misclassi ed More Than Three Categories o the True Exposure Level Figure 2.5: and when X is uniformly distributed and is increasing. 2.2.2 When E(Y j X) is not Monotone In Section 2:1:2, we concluded that, under the situation where pij = 0 if ji jj 2, a unimodal will still lead to a unimodal . However, for a worse classi cation matrix such that pij = 0 if ji jj 3, the same conclusion does not hold. It is possible that is unimodal but is not, as shown in Example 2.2.2. Example 2.2.2 Take k = 6, X is uniformly distributed, = (180; 30; 10; 1; 80; 81), and P = 0 B B B B B B @ 0:8 0:101 0:099 0 0 0 0:107 0:8 0:047 0:046 0 0 0:13 0:22 0:3 0:22 0:13 0 0 0:015 0:23 0:26 0:25 0:245 0 0 0:235 0:245 0:265 0:255 0 0 0 0:24 0:35 0:41 1 C C C C C C A as an example. Both and are displayed in Figure 2.6. It shows that even though is unimodal with i minimized at i = 4, is not unimodal but bimodal where both i = 2 ( 1 > 2 and 3 > 2) and i = 4 ( 3 > 4 and 5 > 4) are local minimum points. In general, with other assumptions unchanged as in Section 2.1.2, even is unimodal, a worse misclassi cation in which subjects can be misclassi ed 182.3. The E ect of the Exposure Distribution Figure 2.6: and when X is uniformly distributed and is unimodal. more than one category away from the true exposure level will in uence the overall shape of such that it might not be unimodal. 2.3 The E ect of the Exposure Distribution After discussing the e ect of misclassi cation with di erent classi cation matrices and distributions of mean outcomes, we will focus on how will be in uenced by the distribution of exposure X. Remind that we denote as the distribution of X such that i = Pr(X = i). Here, we only consider the simplest situation where subject can only be misclassi ed one category away from the true exposure level (pij = 0 if ji jj 2) under the monotonicity assumption of pij and is monotone. In Section 2:1:1, we have proved that if is monotone, then will also be monotone with the same direction of trend, which does not depend on the value of . The distribution of X does not a ect the overall trend under this situation. Although the trend of adjacent levels will stay the same, the magnitude of apparent mean outcome will depend on . We will discuss how is in uenced by the value of . Given the de nition of P and unchanged, let 1 be the vector of mean outcomes when X is uniformly distributed, and 2 be the vector of mean outcomes when X follows a non-uniform distribution. When 1 < j < k 1, the relationship between 1j and 2 j given the same values of is shown 192.4. Test for Trend Across Categories as: 2j 1 j = j 1 j 1pj 1;j + j jpjj + j+1 j+1pj+1;j j 1pj 1;j + jpjj + j+1pj+1;j j 1pj 1;j + jpj;j + j+1pj+1;j pj 1;j + pjj + pj+1;j / ( j j 1)( j j 1)pj 1;jpjj + ( j+1 j 1)( j+1 j 1)pj 1;jpj+1;j + ( j+1 j)( j+1 j)pjjpj+1;j : Since subject can only be misclassi ed one category away from the true exposure level, it is easy to understand that the relationship between 2j and 1j is only in uenced by three terms of , ( j 1, j , and j+1). Similarly, we can write out the expression when j = 1 and j = k, such that: 21 1 1 / ( 2 1)( 2 1)p11p21; 2k 1 k / ( k k 1)( k k 1)pk 1;kpkk: As other values of except j 1, j , and j+1 do not have any e ect on the magnitude of j , a theorem can be generated based on only these three terms of . Theorem 2.3.1 When pij = 0 if ji jj 2 under the monotonicity as- sumption of pij: If and ( j 1; j ; j+1) are both monotone with the same direction of trend, 2j 1 j . If and ( j 1; j ; j+1) are both monotone but with di erent direc- tions of trend, 2j 1 j . Only when ( j 1; j ; j+1) is monotone, we can determine the relation- ship between 1j and 2 j . It allows us to compare the e ect of misclassi - cation between the cases when exposure is uniformly distributed and non- uniformly distributed. However, when ( j 1; j ; j+1) is not monotone, the relationship between 1j and 2 j will be hard to predict. 2.4 Test for Trend Across Categories When exposure X is a monotone continuous variable and the mean actual outcome E(Y j X) has a linear relationship with X, the slope of E(Y j X) 202.4. Test for Trend Across Categories will always be steeper than the slope of the mean apparent outcome E(Y j X ). Moreover, the relationship between the intercepts of E(Y j X) and the intercepts of E(Y j X ) is in uenced by the direction of the trend. When E(Y j X) is monotone increasing, the starting value of E(Y j X ) will always higher than that of E(Y j X), and vice versa. What if X is a polychotomous variable instead of a continuous variable? Let’s have a look at the simplest case where misclassi cation is least severe under the monotonicity assumption of pij and is monotone. In Theorem 2.1.1, we have proved that monotone will always lead to monotone with the same direction of trend. Therefore, misclassi ca- tion does not in uence the overall shape of the apparent mean outcome for polychotomous exposure, which is the same as that for continuous exposure. Besides that, we can generate a theorem on the values of the mean outcomes at the starting and ending levels. Theorem 2.4.1 Under the situation where pij = 0 if ji jj 2 and for any value of : When is monotone increasing, 1 is larger than 1, k is smaller than k. When is monotone decreasing, the contrary is the case. Proof: (i): When is monotone increasing, 1 is the smallest element among i, for any i from 1 to k. The relationship between 1 and 1 can be expressed as: 1 1 = Pk i=1 ipi1 i Pk i=1 ipi1 1 / kX i=1 ipi1 i kX i=1 ipi1 1 > 0: As it can be proved that 1 1 > 0, 1 is larger than 1 for sure. (ii): When there are k exposure levels and is monotone increasing, k is the largest element among i, for any i from 1 to k. The relationship 212.4. Test for Trend Across Categories between k and k can be expressed as: k k = Pk i=1 ipik i Pk i=1 ipik k / kX i=1 ipik i kX i=1 ipik k < 0: As it can be proved that k k < 0, k is smaller than k for sure. Similarly, when is monotone decreasing, using the same procedure, it is easy to prove that 1 1 < 0 and k k > 0. In summary, from Theorem 2.4.1, we can state that the smallest mean outcome for apparent exposure will always be larger than that for actual exposure; and the largest mean outcome will always be smaller. Therefore, misclassi cation lead to the same conclusion on the smallest and the largest values of mean outcomes for both continuous exposure and polychotomous exposure. From both Theorem 2.1.1 and Theorem 2.4.1, we can summarize that the largest changes between the mean outcomes of any two levels for appar- ent exposure will always be smaller than that for actual exposure. However, it is possible that the change of at some adjacent levels is larger than that of for polychotomous exposure, which will not happen for continuous ex- posure. Therefore, we still cannot conclude that misclassi cation attenuate the exposure-disease association for polychotomous exposure. Example 2.4.1 Let’s consider the situation when k = 6, X is uniformly distributed, = (1; 2; 50; 51; 100; 110) which is monotone increasing, and P = 0 B B B B B B @ 0:9 0:1 0 0 0 0 0:1 0:8 0:2 0 0 0 0 0:2 0:5 0:3 0 0 0 0 0:2 0:5 0:3 0 0 0 0 0:2 0:7 0:1 0 0 0 0 0:1 0:9 1 C C C C C C A : Both and are displayed in Figure 2.7. It is obvious that the change between 3 and 4 is larger than the change between 3 and 4, although the change of apparent mean outcomes between any other adjacent levels is smaller than that of actual apparent mean outcomes. 222.4. Test for Trend Across Categories Figure 2.7: and when X is uniformly distributed and is increasing. To investigate the e ect of misclassi cation for polychotomous exposure, we will check how the overall trend of mean outcomes changes from to by a trend test. We will rst introduce the trend test for actual exposure, and then for apparent exposure. Under both parts, we only consider the simplest situation where X is uniformly distributed. 2.4.1 Trend Test for Actual Exposure Treating all subjects with the same exposure level as a group, we can consider the group mean outcomes as the target responses. Assume that each group has the same expected sample size, denoted as n. Also, we make an equal variance assumption such that the variance of the response for each subject is the same, denoted as 2. Moreover, assume that these subjects are all independent from each other. Denote variable Y to be the vector of sample mean outcomes for the k groups. Assume that Y follows a normal distribution, which can be written as: Y N( ; ); where is the true mean outcome and is the covariance matrix. As for group mean outcomes, the exposure level X is the group index, the design matrix will be an identity matrix Ik. Under the assumptions stated above, the covariance matrix can be expressed as = n 1 2Ik. 232.4. Test for Trend Across Categories For actual exposure X, the sample mean outcomes can be represented as ^ = . Tests for linear trend across categories are best performed using a linear contrast in the coe cients corresponding to the various levels of X, say c0 . With the null hypothesis H0 : c0 = 0, we can test if X has a non-zero trend, and then have a concise view about the slope of overall mean outcomes. Given there are k exposure levels, the coe cient c can be calculated using the least squares approach. In this way, c0 is the slope of the least squares line of the group mean outcomes on the group index, i.e., on (1; : : : ; k). Fixed number of categories leads a same value of c. According to Ordinary Least Squares method, c0 ^ N(c0 ; c0 c): The test statistic of a one-sample t-test can be written as: T = c0 ^ p c0 c : Note that the power of a test for signi cance level 0.05 is the probability that the test will reject the null hypothesis when the null hypothesis is false. For this one-sample t-test, the power calculation can be expressed as: Pr(jT j > 1:96) = 1:96 + j j p nt t ; where nt is the total sample size (here is k since we use the group index as the observation); j j is the expected mean di erence (c0 ); t is the standard deviation ( p c0 c). Given the number of exposure levels to be k, the power calculation will only depend on two terms: the expected mean di erence and the standard deviation. The larger j j= t is, the higher the power will be. 2.4.2 Trend Test for Apparent Exposure Although the apparent exposure level of a certain subject might di er from the actual exposure level, the number of exposure levels will always be the same for both actual exposure and apparent exposure. Therefore, if treating 242.4. Test for Trend Across Categories X as the group index, the design matrix for X is still an identity matrix Ik, which is the same as for X. However, the variance of each group is in uenced by misclassi cation. Denote the variance of the i-th apparent exposure level as 2i . Because of misclassi cation, 2 i will not be equal for each group, which di er from that for actual exposure. In general, 2i can be written as: 2i = V ar(Y j X = i) = E[V ar(Y j X) j X = i] + V ar[E(Y j X) j X = i] = 2 + V arf 1I(X = 1) + + kI(X = k) j X = ig: Given a subject is apparently classi ed into the i-th exposure level, the actual exposure level of this subject can be one of a xed number of levels with xed probabilities. Therefore, it is reasonable to treat that (I(X = 1); ; I(X = k) j X = i) follows a multinomial distribution. Denote P to be the transition matrix, where p ij = Pr(X = j j X = i). Each element of this transition matrix can be calculated based on the classi cation matrix P such that p ij = pji= Pk j=1 pji. From the properties of the multinomial distribution, we can state that: V ar(I(X = i) j X = h) = p hi(1 p hi); (2.2) Cov(I(X = i); I(X = j) j X = h) = p hip hj : (2.3) Based on the expression in (2:2) and (2:3), we can express 2i in terms of 2 and p ij : when i = 1, 21 = 2 + V ar( 1I(X = 1) + 2I(X = 2) j X = 1) = 2 + 21p 11(1 p 11) + 2 2p 12(1 p 12) 2 1 2p 11p 12; when 1 < i < k, 2i = 2 + V ar( i 1I(X = i 1) + iI(X = i) + i+1I(X = i+ 1) j X = i) = 2 + 2i 1p i;i 1(1 p i;i 1) + 2 i p ii(1 p ii) + 2 i+1p i;i+1(1 p i;i+1) 2 i 1 ip i;i 1p ii 2 i 1 i+1p i;i 1p i;i+1 2 i i+1p iip i;i+1; when i = k, 2k = 2 + V ar( k 1I(X = k 1) + kI(X = k) j X = k) = 2 + 2k 1p k;k 1(1 p k;k 1) + 2 kp kk(1 p kk) 2 k 1 kp k;k 1p kk: 252.4. Test for Trend Across Categories From the expression above, it is obvious that 2i 2 for any i. In conclu- sion, the group variance for apparent exposure will not be equal and always be larger than that for actual exposure. Additionally, the sample size of each group also changes because of mis- classi cation. Each group does not have the same number of subjects for apparent exposure. Let ni denote the expected sample size of the i-th ap- parent group. Given the value of P , ni can be calculated as: when i = 1, n1 = (p11 + p21)n; when 1 < i < k, ni = (pi 1;i + pii + pi+1;i)n; when i = k, nk = (pk 1;k + pkk)n. Let be the covariance matrix for apparent exposure. Then, can be written based on both 2i and ni: = 0 B B B B B B B B B B B B B @ 21 n1 0 0 0 0 0 0 2 2 n2 0 0 0 0 0 0 2 3 n3 0 0 0 0 0 0 2k 2 nk 2 0 0 0 0 0 0 2k 1 nk 1 0 0 0 0 0 0 2k nk 1 C C C C C C C C C C C C C A : For apparent exposure X , the sample mean outcomes can be represent- ed as ^ = . The form of the test statistic and the power calculation are exactly the same as for actual exposure, but with replaced by . 2.4.3 Power of the Trend Test To investigate the e ect of misclassi cation, one possible way is to compare the power of the trend test between X and X . Larger power of the test represents stronger exposure-disease association. As misclassi cation does not a ect the number of exposure levels, given the total sample size nt unchanged, power calculation only depend on two terms, j j t = c0 p c0 c : (2.4) As shown in Example 2.4.1, even when is monotonically increasing, the change between i and i+1 is not always smaller than that between i and 262.4. Test for Trend Across Categories i+1. As the result, we can not guarantee that for the numerators of (2:4), c0 > c0 is always the case. As a result, it is possible that misclassi cation increases the power of the trend test. To compare the powers between actual exposure and apparent exposure, we can simply calculate the ratio of j j= t between actual exposure and apparent exposure, and compare it with 1. The ratio of j j= t can be expressed as: c0 = p c0 c c0 = p c0 c : (2.5) The numerator of (2:5) is part of the power calculation for actual exposure and the denominator of (2:5) is for apparent exposure. If (c0 = p c0 c)=(c0 = p c0 c) > 1, misclassi cation decreases the power of the trend test, and vice versa. We will rst give an example in which c0 < c0 is the case but mis- classi cation decreases the power of the trend test in Example 2.4.2. Example 2.4.2 Let’s consider a case when k = 6. By least squares ap- proach, the coe cient of linear contrast c can be calculated as: c0 = ( 5; 3; 1; 1; 3; 5): Then, the null hypothesis can be written as: H0 : 5 1 3 2 3 + 4 + 3 5 + 5 6 = 0: Take X is uniformly distributed, = 2, = (4:5; 4:6; 5:0; 5:1; 5:5; 5:6); and P = 0 B B B B B B @ 0:6 0:4 0 0 0 0 0:1 0:5 0:4 0 0 0 0 0:1 0:5 0:4 0 0 0 0 0:4 0:5 0:1 0 0 0 0 0:4 0:5 0:1 0 0 0 0 0:4 0:6 1 C C C C C C A as an example. Both and are displayed in Figure 2.8. Under this setting, c0 = 8:34 which is larger than c0 = 8:30. Misclassi cation can increase the value of the linear contrast. However, because c0 c < c0 c, it is still possible that misclassi cation decreases the power of the trend test. Under this setting, c0 = p c0 c c0 = p c0 c = 1:1359: 272.4. Test for Trend Across Categories Figure 2.8: and when X is uniformly distributed and is increasing. As (c0 = p c0 c)=(c0 = p c0 c) > 1, the power of the trend test for actual exposure is larger than that for apparent exposure. It means that misclassi - cation attenuates the exposure-disease association. In this example, misclas- si cation has the same e ect for polychotomous exposure as for continuous or binary exposure (i.e., when there are 6 subjects and the actual exposure levels of these subjects are all di erent, the power of the trend test for is 0.2877, and the power for is 0.2347 under signi cance level 0.05). Beside the numerator of (2:4), the denominator of (2:4) for can also not be guaranteed smaller than that for . In conclusion, misclassi ca- tion will not always attenuate the exposure-disease association. Here is an example where misclassi cation increases the power of the trend test in Ex- ample 2.4.3, Example 2.4.3 Let’s consider the situation where all the other settings are the same as in Example 2.4.2, but with a di erent classi cation matrix: P = 0 B B B B B B @ 0:9 0:1 0 0 0 0 0:25 0:7 0:05 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0:05 0:7 0:25 0 0 0 0 0:1 0:9 1 C C C C C C A : 282.5. Example: The E ect of Vitamin C on Tooth Growth in Guinea Pigs Figure 2.9: and when X is uniformly distributed and is increasing. Both and are displayed in Figure 2.9. Under this setting, although c0 > c0 , c0 c > c0 c at the same time. It can be calculated that c0 = p c0 c c0 = p c0 c = 0:9977: As (c0 = p c0 c)=(c0 = p c0 c) < 1, misclassi cation increases the power of the trend test. The power of the trend test for will be larger than that for (i.e., when there are 6 subjects and the actual exposure levels of these subjects are all di erent, the power of the trend test for is 0.2877, and the power for is 0.2888 under signi cance level 0.05). Therefore, misclas- si cation strengthen the exposure-disease association. Under this setting, misclassi cation does not have the same e ect on polychotomous exposure as on continuous or binary exposure. 2.5 Example: The E ect of Vitamin C on Tooth Growth in Guinea Pigs To examine the e ect of Vitamin C on tooth growth in guinea pigs, 60 guinea pigs were observed in a study. Each guinea pig was assigned one of three dose levels of Vitamin C (0.5, 1.0 and 2.0mg) and the length of odontoblasts 292.5. Example: The E ect of Vitamin C on Tooth Growth in Guinea Pigs was recorded as the response to represent tooth growth condition. De ne that guinea pigs applied 0.5mg Vitamin C were in group 1, 1.0mg in group 2, and 2.0mg in group 3, and there were 20 guinea pigs in each group. This data set (ToothGrowth) can be found in the \datasets" package in R. Suppose that the original data set in R package represented the real situation, named as the actual data set. Then, the vector of group mean outcomes for true exposure can be calculated based on this actual data set and denoted as . Assume that researchers measured the outcomes (the length of odontoblasts) accurately but did not always classify the guinea pigs into their true exposure levels (dose level of Vitamin C) during the study. As exposure misclassi cation happened, the recorded exposure levels of guinea pigs is the surrogate X but not the actual exposure X. Assume that the actual exposure is uniformly distributed, and the ex- posure level cannot be misclassi ed more than one category away from the true exposure level. Let the classi cation matrix be: P = 0 @ 0:6 0:4 0 0:2 0:6 0:2 0 0:4 0:6 1 A ; i.e., p12 = 0:4 means forty percents of guinea pigs in the rst group which were applied dose level 0.5mg were wrongly recorded as being applied dose level 1.0mg. Using this classi cation matrix, we can create a new data set, named as the recorded data set. Assuming that researchers only have this recorded data set but not the actual data set, all the results are analyzed based on this recorded data set. If researchers did not adjust for misclassi - cation, they treated the apparent exposure as the true exposure, which led to biased exposure-disease association. In the recorded data set, guinea pigs in the same group would not have the same levels of Vitamin C in reality. Even there were actually 20 guinea pigs received one of three dose levels of Vitamin C, the sample size of each group would not be equal in the recorded data set. As the actual data set is unknown, the recorded data set is the only avail- able information to analyze. Denote the vector of group mean outcomes of this recorded data set as , which corresponding to the surrogate exposure. Based on the recorded data, the group sample sizes ni and the covariance matrix can also be calculated. Then, it is possible to investigate the e ect of misclassi cation in this study. Based on both the actual data set and the recorded data set, the vector of the actual group mean outcome can be calculated as = (10:065; 19:735; 26:100), and the vector of the apparent group mean 302.5. Example: The E ect of Vitamin C on Tooth Growth in Guinea Pigs Figure 2.10: Relationship between tooth length and dose levels. outcome can be calculated as = (12:585; 19:206; 22:900). Both and are displayed in Figure 2.10. This example obeys the conclusions in both Theorem 2.1.1 and 2.4.1. Since the exposure levels cannot be misclassi ed more than one category away from the true exposure levels and is monotonically increasing, is also monotonically increasing. The mean outcome of the guinea pigs which were applied 0.5mg Vitamin C for apparent exposure is larger that that for actual exposure, and the mean outcome of those which were applied 2.0mg Vitamin C for apparent exposure is smaller than that for actual exposure. To investigate the e ect of misclassi cation for the overall mean outcomes, we will use a trend test and compare the power of the test between actual exposure and apparent exposure as in Section 2.4. The number of exposure levels for both actual exposure and apparen- t exposure is three. Using the least squares approach, the corresponding coe cient of the linear contrast c can be calculated as: c0 = ( 1; 0; 1): As discussed in Section 2.4.3, power calculation depends on only two terms: j j and t. Given both the actual data set and the recorded data set, all values of n, ni, , and can be worked out. Then, we can calculate 312.5. Example: The E ect of Vitamin C on Tooth Growth in Guinea Pigs the value of the ratio of j j= t: c0 = p c0 c c0 = p c0 c = 3:56: In this example, the ratio of j j= t between actual exposure and appar- ent exposure is 3.56, which is much larger than 1. The power of the trend test for will be larger than the power of the trend test for . It means that misclassi cation attenuate the exposure-disease association, which has the same e ect as for continuous or binary exposure. 32Chapter 3 Case-Control Study with \Maybe" Exposed Group In normal case-control studies, there are always two categories for disease, presence and absence, and two categories for exposure, exposed and unex- posed. Then, based on the collected data set, researchers can create a 2 2 table and use this 2 2 table to analyze the exposure-disease association. In this thesis, we only consider the non-di erential exposure misclassi cation. It means that misclassi cation only happens during classifying exposure s- tatus of subjects. The outcomes status are always accurately classi ed. We will use a surrogate variable X to represent the apparent exposure status. When the exposure prevalence is low and the exposure status of a subject is not sure, researchers always classify this subject as unexposed. Let’s have a look at what if keeping all this kind of subjects as a new group, but not merging them into the unexposed group. In this way, there are three expo- sure levels instead of two, ‘likely exposed’, ‘maybe exposed’, and ‘unlikely exposed’. 3.1 Identi cation Regions Let Y , X, and X represent an individual’s disease status, actual exposure status, and apparent exposure status respectively. The disease status Y has two categories, coded as zero for ‘absence’ and one for ‘presence’. The actual exposure status X also has two categories, coded as zero for ‘exposed’ and one for ‘unexposed’. However, the apparent exposure status X has three categories, coded as zero for ‘unlikely exposed’, one for ‘maybe exposed’, and two for ‘likely exposed’. Thus observed data take the form of a 2 3 (Y;X ) data table. The target parameters in the study are the prevalences of true exposure. De ne r0 and r1 to be the prevalences of true exposure among controls and 333.1. Identi cation Regions cases such that: r0 = Pr fX = 1 j Y = 0g ; r1 = Pr fX = 1 j Y = 1g : The non-di erential misclassi cation assumption is invoked, under which Y and X are conditionally independent given X. Thus the misclassi cation is described via pij denoting the probability of classifying subjects into the j-th assessed exposure level given the i-th true exposure level: p0 = 0 @ p00 p01 p02 1 A = 0 @ Pr fX = 0 j X = 0g Pr fX = 1 j X = 0g Pr fX = 2 j X = 0g 1 A ; p1 = 0 @ p10 p11 p12 1 A = 0 @ Pr fX = 0 j X = 1g Pr fX = 1 j X = 1g Pr fX = 2 j X = 1g 1 A : Since the apparent exposure has three categories, both p0 and p1 are three dimensional. Then, the prevalences of apparent exposure among controls and cases, say 0 and 1, can be expressed as a combination of r0, r1, p0, and p1: 0 = 0 @ 00 01 02 1 A = 0 @ r0p10 + (1 r0)p00 r0p11 + (1 r0)p01 r0p12 + (1 r0)p02 1 A = 0 @ Pr fX = 0 j Y = 0g Pr fX = 1 j Y = 0g Pr fX = 2 j Y = 0g 1 A ; 1 = 0 @ 10 11 12 1 A = 0 @ r1p10 + (1 r1)p00 r1p11 + (1 r1)p01 r1p12 + (1 r1)p02 1 A = 0 @ Pr fX = 0 j Y = 1g Pr fX = 1 j Y = 1g Pr fX = 2 j Y = 1g 1 A : Both 0 and 1 are also three dimensional. Clearly the distribution of the observed data depends on the unknown parameters only via = ( 0; 1), and only functions of are consistently estimable. Going forward, it is useful to note that pi and i belong to the probability simplex on three categories, which is denoted as S3. When useful, it is possible to visualize points in S3 by plotting the probability assigned to the rst (third) category on the vertical (horizontal) axis, so that S3 is represented as the lower-left triangle in the unit square (0; 1)2. We study the situation where the only direct information about the classi cation probabilities p = (p0;p1) is a priori knowledge that they lie in 343.1. Identi cation Regions a particular subset of S23. We write this as p 2 P, and refer to P as the prior region. In concept we could also consider a prior region for the exposure prevalences r = (r0; r1) which would be a subset of (0; 1)2, but in fact we only consider the situation where this prior region is all of (0; 1)2. Following the general approach to partial identi cation, as described for instance by Manski (2003) [18], we start with the prior region and the values of (thinking of the latter as equivalent to observation of an in nite number of controls and caes). Then we would like to know all the values of the unknown parameters (particularly the target parameters r) which could have produced this value of . Formally, let the identi cation region Q( ) be all values of the target parameters (r0; r1) which yield this value of for some choice of p 2 P. Since the values of will be learnt as the sample size increases, the identi cation region can be regarded as all values of (r0; r1) which are still plausible after having observed an in nite amount of data, presuming that the classi cation probabilities are indeed inside the prior region. Note that the identi cation region r 2 Q( ) will in turn generate an identi cation region (typically an interval) for the odds ratio OR = fr1=(1 r1)g=fr0=(1 r0)g. 3.1.1 Constraint A To motivate a realistic prior region, note that merging the maybe and unlike- ly categories together would result in a binary classi cation scheme having sensitivity p12 and speci city 1 p02, so that an assumption of ‘better than random’ classi cation is expressed as p12 > p02. Similarly, if the maybe and likely categories were merged, the assumption p00 > p10 would hold sway. Therefore, if categories are not actually collapsed, it is natural to assume that both inequalities hold. We refer to this as constraint A, and express the prior region as p 2 PA. With respect to the visualization scheme, p0 and p1 can be anywhere in the lower-left triangle representing S3, so long as p1 is south-east of p0. Consider a value of which is compatible with constraint A, i.e., this arises for some value of p 2 PA along with some value of r 2 (0; 1)2. Geometrically, it is immediate that is compatible with constraint A if and only if the line connecting 0 and 1 has negative slope. Below this line is simply referred as ‘the connecting line’. Without loss of generality, but for ease of exposition, I assume 1 lies south-east of 0 (as must arise if r0 < r1). 353.1. Identi cation Regions Then, the identi cation region for p can be expressed as: fp0;p1 : 00 < p00 < 1; 0 < p02 < 02; 0 < p10 < 10; 12 < p12 < 1; p00 = ap02 + b; p10 = ap12 + bg; where a = ( 00 10)=( 02 12) and b = ( 02 10 12 00)=( 02 12). It means that p0 can lie anywhere on the connecting line above 0 and p1 can lie anywhere on the connecting line below 1 (though of course each pi must remain within S3). To transfer the identi cation region to (r0; r1), we introduce two more parameters z0 and z1. Setting z0 = r0=(r1 r0) and z1 = (1 r1)=(r1 r0), this geometry lends itself to simple algebraic description of the identi cation region in terms of z = (z0; z1), where p0 = 0 + z0( 0 1); (3.1) p1 = 1 + z1( 1 0); (3.2) i.e., zi indicates how far pi lies beyond i. Thus each zi is non-negative, but cannot exceed the value which maps pi onto the boundary of S3. Con- sequently, the identi cation region for (z0; z1) is rectangular, given by 0 zi zi( ), for i = 0; 1, where z0( ) = ( min( 02 12 02 ; 01 11 01 ) if 11 01 > 0; 02 12 02 if 11 01 0: (3.3) z1( ) = ( min( 10 00 10 ; 11 01 11 ) if 01 11 > 0; 10 00 10 if 01 11 0: (3.4) From here, it is easy to verify that the rectangular identi cation region for (z0; z1) maps to a polygonal identi cation region for (r0; r1), via the map (r0; r1) = (1 + z0 + z1) 1(z0; 1 + z0). In particular, the identi cation region for (r0; r1) can be expressed as: QA( ) = (r0; r1) :r1 > z0( ) + 1 z0( ) r0; r1 > z1( ) z1( ) + 1 r0 + 1 z1( ) + 1 ; 0 < r0; r1 < 1 : The situation described thus far is illustrated in the left panels of Fig- ure 3.1, for the example values of 0 = (0:645; 0:200; 0:155) and 1 = 363.1. Identi cation Regions (0:567; 0:200; 0:233). The top panel illustrates these i values and the iden- ti cation region for p, within S3. The middle panel shows this identi cation region expressed in terms of z, and nally the bottom panel visualizes this region as r 2 QA( ). 3.1.2 Constraint B Sometimes a stronger assumption than constraint A may be justi ed, making explicit reference to the chance of ‘maybe’ classi cation. The monotonicity of p0 and p1 might be assumed, whereby the worse a classi cation is, the less likely it is. This constraint, henceforth referred to as constraint B, can be expressed as p00 > p01 > p02 and p10 < p11 < p12. The prior region is de ned to be p 2 PB. The visual representation of PB is given in the upper-right panel of Figure 3.1, in which p0 must lie in the upper shaded triangle and p1 must lie in the lower shaded triangle. Say that is compatible with constraint A, and again assume, without loss of generality, that 1 is south-east of 0. Taking the geometric view, the identi cation region under constraint B will be non-empty if and only if the portion of the connecting line above 0 intersect the prior region for p0 and the portion below 1 intersects the prior region for p1. Say that is compatible with constraint B if the identi cation region is non-empty. If arises from a true value of p 2 PB then by de nition is compatible with constraint B. However, if arises from a true value of p 2 PA PB, then may or may not be compatible with constraint B. Upon inspection of the upper-right panel of Figure 3.1, Thus compatibility with constraint B arises if and only if the connecting line intersects the vertical axis between 0:5 and 1, and also intersects the horizontal axis between 0:5 and 1. Compared with constraint A, there are four additional restrictions de ning the identi cation region. The identi cation region for p can be expressed as: fp0;p1 : 00 < p00 < 1; 0 < p02 < 02; 0 < p10 < 10; 12 < p12 < 1; p00 = ap02 + b; p10 = ap12 + b; p00 > p01; p01 > p02; p10 < p11; p11 < p12g: For a compatible with constraint B, we can again express the identi ca- tion region in terms of z. The upper bounds on z0 and z1 must correspond to the intersection of the connecting line with the vertical and horizontal axes 373.1. Identi cation Regions Figure 3.1: Prior and identi cation regions. The three plots on the left are under constraint A, and the three plots on the right are under constraint B. The shaded areas with gray are the prior regions, and the shaded areas with black are the identi cation regions. 383.1. Identi cation Regions respectively, and therefore be the same upper bounds (3:3) and (3:4) that ap- ply under constraint A. Note that if is compatible with constraint B, then z0( ) and z1( ) can only take values of 02=( 12 02) and 10=( 00 10). It is also clear from the geometric view (upper-right panel of Figure 3.1 a- gain) that zi will have a positive lower bound if and only if i lies outside the prior region for pi. Thus our identi cation region is now expressed as zi( ) zi zi( ), for i = 0 or 1, where z0( ) = max(0; 01 02 ( 11 12) ( 01 02) ); z1( ) = max(0; 10 11 ( 00 01) ( 10 11) ): Again this rectangular identi cation region for (z0; z1) induces a polyg- onal identi cation region (r0; r1) 2 QB( ), as illustrated in the middle-right and lower-right panels of Figure 3.1. Formally, the identi cation region for (r0; r1) can be expressed as: QB( ) = (r0; r1) : r1 > z0( )+1 z0( ) r0; r1 > z1( ) z1( )+1 r0 + 1 z1( )+1 ; r1 < I(z0( ) 6= 0) z0( ))+1 z0( ) r0; r1 < z1( ) z1( )+1 r0 + 1z1( )+1 ; 0 < r0; r1 < 1 : Note that, if is compatible with constraint B, the identi cation regions for z under constraints A and B are both rectangular with the same north- east vertex. Consequently, in terms of r, the lower boundary of QB( ) is guaranteed to be a subset of the lower boundary of QA( ). 3.1.3 Comparison Between Constraint A and B Based on the identi cation regions for for p, for z, and for r described in Section 3.1.1 and 3.1.2, we can summarize some salient features of the identi cation regions under constraints A and B in the following theorems. Theorem 3.1.1 Say that is compatible with constraint A. Assume without loss of generality that 00 > 10 and 02 < 12 (as must arise if r0 < r1). Then: (i) If 2 PB, then QA( ) = QB( ). That is, both constraints give rise to the same identi cation region if 0 and 1 fall in the prior regions for p0 and p1 under constraint B. Otherwise, QB( ) QA( ). 393.1. Identi cation Regions (ii) Constraint A yields an in nite upper bound on the odds ratio. (iii) If is compatible with constraint B, then constraint B yields a nite upper bound on the odds ratio if and only if 0 is outside the prior region for p0 and 1 is outside the prior region for p1. (iv) Constraint A yields a lower bound on the odds ratio achieved at r0 = z0( )=( z0( ) + z0( ) + 1) and r1 = ( z0( ) + 1)=( z0( ) + z0( ) + 1). If is compatible with constraint B, then the same lower bound applies under constraint B. Proof: (i) : If 2 PB then z0( ) = z1( ) = 0, and the result follows immediately. (ii)&(iii) : The odds ratio tends to in nity as r0 goes to zero or r1 goes to one. This corresponds to p0 going to 0 (from above/left) or p1 going to 1 (from below/right), along the connectging line. By inspection, it is obvious that the prior region under constraint A never precludes either possibility (upper-left panel of Figure 3.1). Both are precluded under constraint B, however, if and only if both 0 and 1 lie outside the respective components of PB (upper-right panel of Figure 3.1). (iv) : Clearly the maximum value of r0 and the minimum value of r1 correspond to the two intersections of the connecting line with the S3 bound- ary. This can also be visualized as the north-east vertex of the identi cation rectangle for z, and as the middle of the three vertices which give the lower boundary for QA( ) or QB( ). 3.1.4 Collapsing Exposure to Two Categories As alluded to in Section 1.3, it is informative to compare the identi cation regions described above to the identi cation region arising when exposure is collapsed from three to two categories. Particularly, if low exposure preva- lences are anticipated, the ‘unlikely exposed’ and ‘maybe exposed’ categories could be merged. Then the binary apparent exposure would be X = 0 if X 2 f0; 1g; 1 if X = 2; Given the setting p 0 = Pr fX = 2 j X = 0g = p02; p 1 = Pr fX = 1 j X = 1g = p12; 403.1. Identi cation Regions the quality of classi cation can be described by speci city 1 p 0 and sensitiv- ity p 1. A weak and commonly invoked assumption is that p 0 < p 1, stating that the classi cation scheme is better than simply choosing an exposure status completely at random. Thus I take the prior region P to be the triangular region on (0; 1)2 for which this inequality holds. The information gleaned from an in nite data sample would be the value of , where: 0 = r0p 1 + (1 r0)p 0 = Pr fX = 1 j Y = 0g ; 1 = r1p 1 + (1 r1)p 0 = Pr fX = 1 j Y = 1g : The identi cation region for this problem is determined by Gustafson (2001) [12]. However, we express the results in a form more amenable for comparison with the results in Sections 3.1.1 and 3.1.2. Assume without loss of generality that 0 < 1 (as must arise if r0 < r1). As per (3:1) and (3:2), we can de ne z = (z 0 ; z 1), where p 0 = 0 + z0( 0 1) and p 1 = 1 + z1( 1 0). Then by the same geometric argument as earlier, we have a rectangular identi cation region of the form 0 < z i < z i ( ) for i = 0 or 1, such that: z 0( ) = 0 1 0 ; z 1( ) = 1 0 1 0 : Compared with the value of z under constraint A and B, 0 = 12 and 1 = 12 in this collapsed case. The identi cation region maps to r just as before, according to (r0; r1) = (1 + z 0 + z 1) 1(z 0 ; 1 + z 0). Thus we again have a polygonal boundary for the identi cation region r 2 Q ( ). The minimum odds ratio occurs when r0 = 0 and r1 = 1. It is very easy to compare the e ect of collapsing to the use of three categories under constraint A. The conclusions can be generated as theorems in Theorem 3.1.2. Theorem 3.1.2 Say that is compatible with constraint A. Also, assume without loss of generality that 00 > 10 and 02 < 12 (as must arise if r0 < r1). Then: QA( ) Q ( ); Q ( ) cannot produce a nite upper bound on the odds ratio, which is the same as QA( ); 413.1. Identi cation Regions the lower bound on the odds ratio of Q ( ) cannot exceed that of QA( ). Proof: When is compatible with constraint A, z 0( ) z0( ) and , z 1( ) > z1( ). Also, the lower bound of zi( ) for both collapsed case and constraint A are zero (zi( ) = zi( ) = 0, for i = 0 or 1). By mapping the identi - cation region for (z0; z1) to (r0; r1), the conclusion can be got directly that QA( ) Q ( ). Since the upper bound on the odds ratio under constraint A is in nite, the collapsed case will also yields an in nite upper bound on the odds ratio. Also, the lower bound on the odds ratio for the collapsed case will always smaller or equal to the lower bound under constraint A. 3.1.5 Examples To examine identi cation regions under some realistic scenarios, we use two settings of exposure prevalence among controls (r0), two settings of the odds ratio (OR), and two settings of the classi cation probabilities (p0;p1). The value of r1 is determined by both the values of r0 and OR. The symbols ‘-’ and ‘+’ are used to label the rst and second settings for each of these three factors. For the exposure prevalence among controls, the settings are r0 = 0:05 and r0 = 0:15. For the exposure-disease association, the set- tings are OR = 1:2 and OR = 2:0. The rst setting for the classi cation probabilities is a ‘symmetric’ situation with p0 = (0:750; 0:200; 0:050) and p1 = (0:050; 0:200; 0:750). The second setting corresponds to exposure being hard to detect, with p0 = (0:900; 0:075; 0:025) and p1 = (0:200; 0:300; 0:500). In total, there are 23 = 8 values of arising from all combinations of these three factors. The identi cation regions for these eight scenarios are dis- played in Figures 3.2 through 3.9. From these gures, we see that, in all cases, collapsing and using con- straint A yield very similar identi cation regions for r. The identi cation region using constraint B is typically very much smaller, though of course constraints A and B are guaranteed to yield the same lower bound on the odds ratio. Moreover, while some values of produce a nite upper bound on the odds ratio under constraint B, this does not happen for any of the eight scenarios considered here. To the extent that our scenarios are typical, this suggests that a nite upper bound is uncommon. In fact, we can see that low exposure prevalences will tend to produce values of 0 close to p0, 423.1. Identi cation Regions Figure 3.2: Identi cation regions for the combination ( ). Based on the given values of r0 and OR, r1 can be calculated to be 0.0594. In the upper- left panel, the dot indicates the true exposure prevalences (r0; r1), while the cross indicates apparent exposure prevalences upon collapsing to two categories and ignoring misclassi cation, ( 0; 1). In the other three panels, prior and identi cation regions are indicated in grey and black respectively. The upper-right panel is the r plane in collapsed case, the lower-left panel is under constraint A, and the lower-right panel is under constraint B. In the collapsed case, 0=0.0850 and 1=0.0916. Under constraint A and B, 0=(0.7150, 0.2000, 0.0850) and 1=(0.7084, 0.2000, 0.0916). 433.1. Identi cation Regions Figure 3.3: Identi cation regions for the combination ( +). The layout is the same as Figure 3.2. Based on the given values of r0 and OR, r1 can be calculated to be 0.0594. In the collapsed case, 0=0.0488 and 1=0.0532. Under constraint A and B, 0=(0.865, 0.0863, 0.0488) and 1=(0.8584, 0.0884, 0.0532). 443.1. Identi cation Regions Figure 3.4: Identi cation regions for the combination ( + ). The layout is the same as Figure 3.2. Based on the given values of r0 and OR, r1 can be calculated to be 0.0952. In the collapsed case, 0=0.0850 and 1=0.1167. Under constraint A and B, 0=(0.7150, 0.2000, 0.0850) and 1=(0.6833, 0.2000, 0.1167). 453.1. Identi cation Regions Figure 3.5: Identi cation regions for the combination ( + +). The layout is the same as Figure 3.2. Based on the given values of r0 and OR, r1 can be calculated to be 0.0952. In the collapsed case, 0=0.0488 and 1=0.0702. Under constraint A and B, 0=(0.8650, 0.0863, 0.0488) and 1=(0.8333, 0.0964, 0.0702). 463.1. Identi cation Regions Figure 3.6: Identi cation regions for the combination (+ ). The layout is the same as Figure 3.2. Based on the given values of r0 and OR, r1 can be calculated to be 0.1748. In the collapsed case, 0=0.1550 and 1=0.1723. Under constraint A and B, 0=(0.6450, 0.2000, 0.1550) and 1=(0.6277, 0.2000, 0.1723). 473.1. Identi cation Regions Figure 3.7: Identi cation regions for the combination (+ +). The layout is the same as Figure 3.2. Based on the given values of r0 and OR, r1 can be calculated to be 0.1748. In the collapsed case, 0=0.0963 and 1=0.1080. Under constraint A and B, 0=(0.7950, 0.1088, 0.0963) and 1=(0.7777, 0.1143, 0.1080). 483.1. Identi cation Regions Figure 3.8: Identi cation regions for the combination (+ + ). The layout is the same as Figure 3.2. Based on the given values of r0 and OR, r1 can be calculated to be 0.2609. In the collapsed case, 0=0.1550 and 1=0.2326. Under constraint A and B, 0=(0.6450, 0.2000, 0.1550) and 1=(0.5674, 0.2000, 0.2326). 493.1. Identi cation Regions Figure 3.9: Identi cation regions for the combination of (+++). The layout is the same as Figure 3.2. Based on the given values of r0 and OR, r1 can be calculated to be 0.2609. In the collapsed case, 0=0.0963 and 1=0.1489. Under constraint A and B, 0=(0.7950, 0.1088, 0.0963) and 1=(0.7174, 0.1337, 0.1489). 503.2. Limiting Posterior Distribution and therefore inside the prior region under constraint B, unless p0 happens to be very close to the boundary of the prior region. Thus we can intuit that a nite upper bound on the odds ratio will not commonly arise. More speci cally, for given classi cation probabilities p it is a simple matter to characterize how large r0 must be (and how small r1 must be) in order to produce for which there is a nite upper bound on OR. This will happen if p01 p02 (p10 + 2p12) (p00 + 2p02) < ri < 2p00 + p02 1 (2p00 + p02) (2p10 + p12) ; for i = 0 or 1. For instance, with p = p , the upper bound is nite if ri 2 (0:214; 0:786), and with p = p+, this bound is nite if ri 2 (0:200; 0:892). The lower bounds on the odds ratio in these eight scenarios are given in Table 3.1. As guaranteed by Theorem 3.1.1, the lower bound is always the same under both constraints A and B, but smaller in the collapsed case. In most scenarios, the lower bound for collapsed case is only very slightly lower. In a practical sense the bounds are useful. For instance, in the (+ ) and (+ +) scenarios, one can rule out an odds ratio below 1:14 when the true value is 1:2. and in the (+ + ) and (+ + +) scenarios, one can rule out an odds ratio below 1:7 when the true value is 2. It is also worth remembering that the lower bound in the collapsed case corresponds to the large-sample limit of the raw odds ratio in the collapsed data table. Thus the extent to which constraints A and B produce a higher lower bound than this re ects the utility of a formal adjustment approach over collapsing the ‘unlikely exposed’ and ‘maybe exposed’ categories together and treating this is the unexposed category. 3.2 Limiting Posterior Distribution In a partially identi ed context such as that faced here, determining the identi cation region is only part of the inferential story. From a Bayesian perspective, as the sample size goes to in nity, the investigator learns more than just the identi cation region. The posterior distribution of the target parameter will tend to some distribution over the identi cation region, so an obvious issue to address is the extent to which the limiting posterior distribution is at or peaked across the identi cation region. 513.2. Limiting Posterior Distribution Table 3.1: The lower bound on the odds ratio for collapsed case and for constraint A and B. Scenarios Lower bound of odds ratio for collapsed case Lower bound of odds ratio for constraint A and B ( ) 1.085 1.087 ( +) 1.097 1.100 ( + ) 1.422 1.436 ( + +) 1.474 1.496 (+ ) 1.135 1.143 (+ +) 1.137 1.147 (+ + ) 1.652 1.706 (+ + +) 1.643 1.715 3.2.1 Principle Suppose r0, r1, p0, and p1 are independent of each other a priori. Assume that r0 and r1 are both uniformly distributed such that r0 U(0; 1) and r1 U(0; 1). Also, assume that p0 and p1 follow Dirichlet distribution, written as p0 Dirichlet(c00; c01; c02) and p1 Dirichlet(c10; c11; c12), with the additional truncation of p0 and p1 to the assumed prior region P. Under these assumptions, the joint prior density can be written as: f(r0; r1;p0;p1) / 0 @ 1Y i=0 2Y j=0 p cij 1 ij 1 A I(0;1)(r0)I(0;1)(r1)IP(p0;p1): Since the value of is estimable from data, and r0 and r1 are target parameters, a reparameterization from (r0; r1;p0;p1) to (r0; r1; 0; 1) is helpful. By change of variables, the transformation gives the joint prior 523.2. Limiting Posterior Distribution density as: f(r0; r1; 0; 1) / r0 10 r1 00 r0 r1 c00 1 r0 11 r1 01 r0 r1 c01 1 r0 12 r1 02 r0 r1 c02 1 (1 r1) 00 (1 r0) 10 r0 r1 c10 1 (1 r1) 01 (1 r0) 11 r0 r1 c11 1 (1 r1) 02 (1 r0) 12 r0 r1 c12 1 1 (r0 r1)2 IQ( )(r0; r1); where a non-zero density is obtained only when r 2 Q( ). The joint posterior density of all the parameters given the data can be written as: f(r0; r1; 0; 1 j X ; Y ) = f( 0; 1 j X ; Y )f(r0; r1 j 0; 1): The distribution of the data (X ; Y ) gives direct information on parame- ters 0 and 1 only. As the sample sizes of the control and case groups increases, the conditional density f( 0; 1 j X ; Y ) will become narrower, converging to a point mass at the true values of 0 and 1 in the limit. Also, it is easy to point out that for xed ( 0; 1), the conditional prior density f(r0; r1 j 0; 1) is simply proportional to the joint prior density f(r0; r1; 0; 1). Thus the limiting posterior distribution of (r0; r1) can sim- ply be ‘read o ’ from the expression given above. As a nal step, the limiting posterior distribution of (r0; r1) induces a limiting posterior distribution on the exposure-disease odds ratio. By change of variables and marginalization, the limiting posterior density of the log odds ratio, say s = logit r1 logit r0, can be expressed as: f(sj 0; 1) = Z g(r0; s)f(r0; expit(s+ logit r0)j 0; 1) dr0; (3.5) where g(r0; s) = expit(s+ logit r0)f1 expit(s+ logit r0)g: 533.2. Limiting Posterior Distribution Note that the support of the integrand in (3.5) is those r0 for which fr0; expit(s + logit r0)g 2 Q( ). By inspection (e.g., see the bottom rows of Figure 3.1), for given (s; ) this could be either an interval of r0 values or a pair of disjoint intervals. Particularly, think of the support as arising from intersecting the identi cation region in the r plane with the level curve logit r1 logit r0 = s. It is also easy to note that provided the prior density of (rj ) is bounded on Q( ), the limiting density f(sj ) will tend to zero as s approaches the lower bound on logOR, since the support of the integrand in (3.5) is readily seen to shrink to a single point in this limit, i.e., a unique r0 value gives rise to the lower bound value of s. For given values of , we can readily evaluate (3.5) using one-dimensional numerical integration. 3.2.2 Examples The eight scenarios (values of ) from Section 3.1.5 are revisited, in combina- tion with two di erent settings of the prior distribution according to hyper- parameters c0 and c1. The rst setting is c 0 = c 1 = (1; 1; 1), corresponding to uniform distributions for p0 and p1 across the prior region. The second setting is c+0 = (6; 4; 2), c + 1 = (2; 4; 6), which assigns more prior weight to better classi cations (henceforth we refer to this setting as the ‘weighted’ prior). It is also possible to mimic these hyperparameter settings for the col- lapsed case as well, via a Beta(c0 ) prior on speci city and a Beta(c1 ) prior on speci city. Then in the collapsed case, c 0 = c 1 = (1; 1) is taken as an instance of uniform priors. In light of the collapsibility property of Dirich- let distributions, the analogous ‘weighted prior’ setting when the ‘maybe exposed’ and ‘unlikely exposed’ categories are combined is c+0 = (10; 2), c+1 = (6; 6). The limiting posterior distributions of logOR for these eight scenarios appear in Figures 3.10 through 3.17. In the case of uniform priors, we consis- tently see constraint B lead to a more peaked limiting posterior distribution than constraint A, even though the identi cation interval of logOR is un- changed. Thus, if it can be invoked, there is a bene t associated with the stronger assumption about misclassi cation probabilities. In turn, posteriors under constraint A are more peaked than their collapsed case counterparts, even though the identi cation interval of logOR are only very marginally bigger for the collapsed case analysis. Thus there is a bene t associated with directly adjusting for misclassi cation into the three exposure cate- gories, rather than collapsing to two categories and then adjusting. The behaviour of the posteriors arising from the weighted priors is more nuanced. Under constraint A, moving from the uniform prior to the weighted 543.3. Finite-Sample Posteriors prior tends to result in a more concentrated posterior, as one might expect. However, and surprisingly, under constraint B, moving to the weighted prior tends to atten the posterior. Consequently, with the weighted prior, the constraint A and constraint B posterior distributions tend to be very similar. We have further investigated this surprising ‘interaction’ between using the more concentrated prior and the stronger constraint, and it seems to persist quite generally if exposure prevalences are low and the odds ratio is modest. If starting with uniform priors and constraint A, the resulting posterior induces a negative dependence between logOR and W (p), where W () is the weighted prior density on the classi cation probabilities. Thus moving from the uniform prior to W ‘downweights’ the long right tail, and thereby sharpens the posterior distribution of logOR. However, upon ‘removing points’ that do not satisfy constraint B, the dependence is seen to become positive. Thus the constraint B analysis has this curious feature of a more concentrated prior leading to a less concentrated posterior. We also note that with the weighted prior constraint A or B again leads to a more concentrated posterior than the ‘collapse then adjust’ strategy. 3.3 Finite-Sample Posteriors Until now, only the limiting behavior in the in nite sample size limit is under consideration. Under this situation, the posterior on 0 and 1 reduces to a point mass at the true values. It is instructive to see how the nite sample posterior distribution of logOR moves toward the limiting posterior distribution when the sample size increases, by simulating data under several of the previous scenarios. The prior distributions are taken as p0 Dirichlet(6; 4; 2) and p1 Dirichlet(2; 4; 6), truncated with constraint A. We simulate ve independent data sequences with equal numbers of controls and cases (ni = n, for i = 0 or 1), and then determine the posterior distribution of logOR after n = 100, n = 1000, and n = 5000 observations, using WinBUGS [17]. We generically writeDn for the observed data. Posterior densities arising under the ( ) and (+ + +) scenarios appear in Figures 3.18 and 3.19, with the limiting posterior densities also given. In both scenarios, we see the sampling variation in the posterior distri- bution diminish with sample size. However, the posterior approaches its limit much more quickly in the (+ + +) scenario then the ( ) scenario. In fact, this is readily understood, particularly if we contemplate how the posterior variance approaches its limit. Denote the posterior variance as 553.3. Finite-Sample Posteriors Figure 3.10: Limiting posterior distributions under the combination ( ). The left panel are the limiting posterior distributions of logOR for constraint A, constraint B, and collapsed case when p0 and p1 have noninformative uniform priors. The right panel are the limiting posterior distributions of logOR for constraint A, constraint B, and collapsed case when the prior distribution gives more weight to better classi cations. In this scenario, the true logOR is 0.1823. The lower bound of logOR is 0.0839 under both constraint A and B, and 0.0818 under collapsed case. 563.3. Finite-Sample Posteriors Figure 3.11: Limiting posterior distributions under the combination ( +). The layout is the same as Figure 3.10. In this scenario, the true log odds ratio is 0.1823. The lower bound of logOR is 0.0953 under both constraint A and B, and 0.0934 under collapsed case. Figure 3.12: Limiting posterior distributions under the combination ( + ). The layout is the same as Figure 3.10. In this scenario, the true log odds ratio is 0.6932. The lower bound of logOR is 0.3620 under both constraint A and B, and 0.3519 under collapsed case. 573.3. Finite-Sample Posteriors Figure 3.13: Limiting posterior distributions under the combination ( ++). The layout is the same as Figure 3.10. In this scenario, the true log odds ratio is 0.6932. The lower bound of logOR is 0.4025 under both constraint A and B, and 0.3880 under collapsed case. Figure 3.14: Limiting posterior distributions under the combination (+ ). The layout is the same as Figure 3.10. In this scenario, the true log odds ratio is 0.1823. The lower bound of logOR is 0.1332 under both constraint A and B, and 0.1267 under collapsed case. 583.3. Finite-Sample Posteriors Figure 3.15: Limiting posterior distributions under the combination (+ +). The layout is the same as Figure 3.10. In this scenario, the true log odds ratio is 0.1823. The lower bound of logOR is 0.1373 under both constraint A and B, and 0.1284 under collapsed case. Figure 3.16: Limiting posterior distributions under the combination (++ ). The layout is the same as Figure 3.10. In this scenario, the true log odds ratio is 0.6932. The lower bound of logOR is 0.5341 under both constraint A and B, and 0.5023 under collapsed case. 593.3. Finite-Sample Posteriors Figure 3.17: Limiting posterior distributions under the combination (+++). The layout is the same as Figure 3.10. In this scenario, the true log odds ratio is 0.6932. The lower bound of logOR is 0.5391 under both constraint A and B, and 0.4965 under collapsed case. Varfs(r)jDng, where s(r) = logitr1 logitr0 is the log odds ratio. Note that Varfg(r)jDng = E [Varfg(r)j gjDn] + Var [Efg(r)j gjDn] ; (3.6) where the rst term tends to a positive constant as n increases, but the second term is of the order n 1. In our general experience with partially identi ed models, the rst term can vary widely with the true parameter values. For instance, here it is far larger under the (+ + +) parameters settings than the ( ) settings. On the other hand, the second (order n 1) term, which is governed by the Fisher information in the model for (Dnj ); can vary much less with the parameter values. Thus getting ‘close to convergence,’ which corresponds to the second term becoming small com- pared to the rst, can arise at a much smaller n when the rst term is large, i.e., when the limiting posterior distribution is wide. Variance decomposi- tions such as (3.6) in partially identi ed models are studied at length by Gustafson (2006) [10]. The simulated data sets are also analyzed via the informal method al- luded to in Section 3.1.4. That is, ‘unlikely exposed’ and ‘maybe exposed’ subjects are merged and taken as ‘unexposed’, while the ‘likely exposed’ 603.3. Finite-Sample Posteriors Figure 3.18: Posterior distributions under the combination ( ). From the upper-left panel to the lower-right panel, the posterior distributions of logOR for the sample sizes 100, 1000, 5000, and the limiting posterior distribution are displayed. 613.3. Finite-Sample Posteriors Figure 3.19: Posterior distributions under the combination (+ + +). The layout is the same as Figure 3.18. 623.3. Finite-Sample Posteriors subjects are taken to be ‘exposed’. Then a standard analysis, without any adjustment for misclassi cation, is applied to the resulting 2 2 data ta- ble. A Bayesian instantiation of the standard analysis is applied, whereby the exposure prevalances for controls and cases are assigned independent uniform priors, leading to independent Beta posterior distributions. The corresponding posterior distributions for logOR appear in Figures 3.20 and 3.21. In fact, these work quite well. By ignoring misclassi cation, markedly more peaked posterior distributions are obtained. Yet even when n = 5000, the resulting bias does not yet dominate. That is, the posterior does not yet rule out the true value of OR = 1:2 in the ( ) setting or OR = 2:0 in the (+ + +) setting. Thus the informal strategy of choosing to treat ‘maybe exposed’ subjects as being unexposed in light of low exposure prevalence proves to be useful. Of course with enough data people would eventually be lead astray. That is, from Table 3.1 it shows that the posterior will tend to a point mass at OR = 1:09 in the ( ) case and a point mass at OR = 1:72 in the (+++) case. Thus in concept, if not in practice, the informal scheme is unappealing. 633.3. Finite-Sample Posteriors Figure 3.20: Posterior distributions via informal analysis under the combi- nation ( ). From the upper-left panel to the lower panel, the posterior distributions of logOR for the sample sizes 100, 1000, and 5000. 643.3. Finite-Sample Posteriors Figure 3.21: Posterior distributions via informal analysis under the combi- nation (+ + +). The layout is the same as Figure 3.20. 65Chapter 4 Conclusion In Chapter 2, we investigated the impact of misclassi cation for polychoto- mous exposure. The relationship between the group mean outcomes of ap- parent exposure and actual exposure can be calculated using Bayes’ theorem. We rst summarize that only under the least severe misclassi cation, where subjects cannot be misclassi ed more than one category away from the true exposure levels, monotone group mean outcomes of actual exposure will also lead to monotone group mean outcomes of apparent exposure. Whenever the classi cation is worse, the conclusion does not hold anymore. Then, we focus on the e ect of the exposure distribution under the least severe misclassi cation. For a given exposure level, it is possible to compare the e ect of misclassi cation at this level between uniformly distributed and non-uniformly distributed actual exposure. Moreover, as the goal of the s- tudy is to analyze the e ect of misclassi cation for polychotomous exposure, we performed a trend test to investigate the overall e ect of misclassi ca- tion. By comparing the power of the trend test, we can nd a counterexam- ple in which misclassi cation strengthens the exposure-disease association. Therefore, we conclude that misclassi cation does not always attenuate the exposure-disease association for polychotomous exposure. It means that the e ect of misclassi cation for polychotomous exposure does not always the same as for continuous exposure or binary exposure. In Chapter 3, we have considered non-di erential classi cation of a truly binary exposure into three categories. In this setting, inference about the exposure-disease association could be based on collapsing of categories as im- plicitly advocated by Dosemeci (1996) [3]. Then the data could be analyzed without acknowledging misclassi cation, or perhaps binary misclassi cation with unknown sensitivity and speci city could be acknowledged. More for- mally, and as investigated here, the classi cation into three states can be modeled explicitly. This yields a partially identi ed inference problem, for which the rst-order issue in the e cacy of inference is the size of identi - cation region. Regardless of whether Bayesian or non-Bayesian inference is pursued, the size of the identi cation region summarizes how much uncer- tainty about target parameters would remain if an in nite amount of data 66Chapter 4. Conclusion could be collected. Section 3.1 illustrates how an in nite amount of data can rule out near-null values of the exposure-disease association. The choice of prior region for the classi cation probabilities can have a marked e ect on the bivariate identi cation region for the control and case exposure preva- lences, but little or no e ect on the resulting identi cation interval for the odds ratio. The second-order issue, investigated in Section 3.2, is the extent to which the posterior distribution, in the large-sample limit, is at or concentrated across the identi cation region. It shows that in many circumstances the limiting posterior distribution of logOR is indeed quite peaked. In Sec- tion 3.3, we also illustrated brie y how this limiting posterior distribution is approached with nite data sets, and drew comparisons with the infor- mal approach of collapsing to two exposure categories and not adjusting for misclassi cation. 67Bibliography [1] J. Buonaccorsi. Measurement Error: Models, Methods, and Applica- tions. Chapman and Hall, CRC Press, 2010. [2] R. Chu. Bayisan adjustment for exposure misclassi ction in case-contral studies. Statistics in Medicine, 29:994{1003, 2010. [3] M. Dosemeci and P. A. Stewart. Recommendations for reducing the e ects of misclassi cation on relative risk estimates. Occupational Hy- giene, 3:169{176, 1996. [4] M. Dosemeci, S. Wacholder, and J. H. Lubin. Does nondi erential misclassi cation of exposure always bias a true e ect toward the null value? American Journal of Epidemiology, 19:746{748, 1990. [5] G. Espino-Hernandez, P. Gustafson, and I. Burstyn. Bayesian adjust- ment for measurement error in continuous exposures in an individually matched case-contral study. BMC Medical Research Methodology, 11:67, 2011. [6] M. Garcia-Zattera, T. Mutsvari, A. Jara, D. Declerck, and E. Lesa re. Correcting for Mislcassi cation for a Monotone Disease Process with an Application in Dental Research. Wiley Online Libary, 29:3103{3117, 2010. [7] L. Gordis. Epidemiology. SAUNDERS, 1996. [8] P. Gustafson. Measurement Error and Misclassi cation in Statistics and Epidemiology: Impact and Bayesian Adjustments. Chapman and Hall, CRC Press, 2004. [9] P. Gustafson. On model expansion, model contraction, identi ability, and prior information: two illustrative scenarios involving mismeasured variables (with discussion). Statistical Science, 20:111{140, 2005. 68Bibliography [10] P. Gustafson. Sample size implications when biases are modelled rather than ignored. Journal of the Royal Statistical Society, Series A, 169:883{902, 2006. [11] P. Gustafson. Bayesian inference for partially identi ed models. Inter- national Journal of Biostatistics, 6:issue 2 article 17, 2010. [12] P. Gustafson, N. D. Le, and R. Saskin. Case-control analysis with par- tial knowledge of exposure misclassi cation probabilities. Biometrics, 57:598{609, 2001. [13] D. Kenkel, D. Lillard, and A. Mathios. Accounting for Misclassi ca- tion Error in Retrospective Smoking Data. HEALTH ECONOMICS, 13:1031{1044, 2004. [14] H. Kraemer, J. Measelle, J. Ablow, M. Essex, W. Boyce, and D. Kupfer. A New Approach to integrating Data From Multiple Informants in Psy- chiatric Assessment and Research: Mixing and Matching Contexts and Perspectives. American Journal of Psychiatry, 160:9:1566{1577, 2003. [15] S. Krisbnaiab, K. Vilas, B. Shamanna, G. Rao, R. Thomas, and D. Bal- asubramanian. Local sensitivity of inferences to prior marginals. IOVS, 46:58{65, 2005. [16] B. Lindblad, N. Hakansson, H. Svensson, B. Philipson, and A. Wolk. Intensity of Smoking and Smoking Cessation in Relation to Risk of Cataract Extaction: A Prospective Study of Women. American Journal of Epidemiology, 162:73{79, 2005. [17] D. J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter. WinBUGS { a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing, 10:325{337, 2000. [18] C. F. Manski. Partial Identi cation of Probability Distributions. Springer, 2003. [19] E. Savoca. Accounting for Misclassi cation Bias in Binary Outcome Measures of Illness: The Case Of Post-Traumatic Stress Disorder in Male Veterans. SOCIOLOGICAL METHODOLOGY, 41:49{76, 2011. [20] T. Shen. Formal and Informal Approaches to Adjusting for Exposure Misclassi cation. Thesis, Department of Statistics, UBC, 2009. 69Bibliography [21] C. R. Weinberg, D. M. Umbach, and S. Greenland. When will non- di erential misclassi cation of an exposure preserve the direction of a trend? (with discussion). American Journal of Epidemiology, 140:565{ 571, 1994. 70
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Topics on the effect of non-differential exposure misclassification
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Topics on the effect of non-differential exposure misclassification Wang, Dongxu 2012-12-31
pdf
Page Metadata
Item Metadata
Title | Topics on the effect of non-differential exposure misclassification |
Creator |
Wang, Dongxu |
Publisher | University of British Columbia |
Date | 2012 |
Date Issued | 2012-07-19 |
Description | There is quite an extensive literature on the deleterious impact of exposure misclassification when inferring exposure-disease associations, and on statistical methods to mitigate this impact. When the exposure is a continuous variable or a binary variable, a general mismeasurement phenomenon is attenuation in the strength of the relationship between exposure and outcome. However, few have investigated the effect of misclassification on a polychotomous variable. Using Bayesian methods, we investigate how misclassification affects the exposure-disease associations under different settings of classification matrix. Also, we apply a trend test and understand the effect of misclassification according to the power of the test. In addition, since virtually all of work on the impact of exposure misclassification presumes the simplest situation where both the true status and the classified status are binary, my work diverges from the norm, in considering classification into three categories when the actual exposure status is simply binary. Intuitively, the classification states might be labeled as `unlikely exposed', `maybe exposed', and `likely exposed'. While this situation has been discussed informally in the literature, we provide some theory concerning what can be learned about the exposure-disease relationship, under various assumptions about the classification scheme. We focus on the challenging situation whereby no validation data is available from which to infer classification probabilities, but some prior assertions about these probabilities might be justified. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Collection |
Electronic Theses and Dissertations (ETDs) 2008+ |
Date Available | 2012-07-19 |
Provider | Vancouver : University of British Columbia Library |
DOI | 10.14288/1.0072897 |
URI | http://hdl.handle.net/2429/42776 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2012-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- ubc_2012_fall_wang_dongxu.pdf [ 977.07kB ]
- [if-you-see-this-DO-NOT-CLICK]
- Metadata
- JSON: 1.0072897.json
- JSON-LD: 1.0072897+ld.json
- RDF/XML (Pretty): 1.0072897.xml
- RDF/JSON: 1.0072897+rdf.json
- Turtle: 1.0072897+rdf-turtle.txt
- N-Triples: 1.0072897+rdf-ntriples.txt
- Original Record: 1.0072897 +original-record.json
- Full Text
- 1.0072897.txt
- Citation
- 1.0072897.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
United States | 26 | 2 |
China | 14 | 43 |
Canada | 7 | 0 |
France | 4 | 0 |
Russia | 3 | 0 |
Japan | 1 | 0 |
City | Views | Downloads |
---|---|---|
Shenzhen | 9 | 27 |
University Park | 8 | 0 |
Ashburn | 8 | 0 |
Unknown | 5 | 0 |
Mountain View | 4 | 0 |
Washington | 4 | 0 |
Beijing | 3 | 15 |
Saint Petersburg | 3 | 0 |
Oakville | 3 | 0 |
Delta | 2 | 0 |
Tianjin | 2 | 1 |
Sunnyvale | 1 | 0 |
Redmond | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0072897/manifest