Costs and Benefits of Environmental Data in Investigations of Gene-Disease Associations by Hao Luo B.Sc., Nanjing University, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Statistics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2012 c Hao Luo, 2012 Abstract The inclusion of environmental exposure data may be beneficial, in terms of statistical power, to investigation of gene-disease association when it exists. However, resources invested in obtaining exposure data could instead be applied to measure disease status and genotype on more subjects. In a cohort study setting, we consider the tradeoff between measuring only disease status and genotype for a larger study sample and measuring disease status, genotype, and environmental exposure for a smaller study sample, under the ‘Mendelian randomization’ assumption that the environmental exposure is independent of genotype in the study population. We focus on the power of tests for gene-disease association, applied in situations where a gene modifies risk of disease due to particular exposure without a main effect of gene on disease. Our results are equally applicable to exploratory genome-wide association studies and more hypothesis-driven candidate gene investigations. We further consider the impact of misclassification for environmental exposures. We find that under a wide range of circumstances research resources should be allocated to genotyping larger groups of individuals, to achieve a higher power for detecting presence of gene-environment interactions by studying genedisease association. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Study Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 (Y, X, G) Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 (Y, G) Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Mixed Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Cost Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Performance of the Mixed Design . . . . . . . . . . . . . . . . . 13 3.2 (Y, X, G) Design vs. (Y, G) Design . . . . . . . . . . . . . . . . . 15 Misclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Misclassification . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 (Y, X ∗ , G) Design vs. (Y, G) Design. . . . . . . . . . . . . . . . . 24 4.2.1 Non-differential Misclassification . . . . . . . . . . . . . 25 4.2.2 Differential Misclassification . . . . . . . . . . . . . . . . 28 3 4 iii 4.3 Extension of Comparison . . . . . . . . . . . . . . . . . . . . . . 30 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1 Significance Level . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Presence of Main Gene Effect . . . . . . . . . . . . . . . . . . . 35 5.3 Case-Control & Case-Only . . . . . . . . . . . . . . . . . . . . . 36 5.4 3-Category Genotype . . . . . . . . . . . . . . . . . . . . . . . . 38 Conclusion & Discussion . . . . . . . . . . . . . . . . . . . . . . . . 41 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5 6 iv List of Tables Table 3.1 Parameter settings of the factorial experiment. . . . . . . . . . 16 Table 3.2 Situations where collecting X data can be harmful. . . . . . . . 21 Table 5.1 Notations for the data of a case-control study. . . . . . . . . . 37 v List of Figures Figure 3.1 Effect of cost ratio on relative performance of (Y, X, G) design. Figure 3.2 Power of the mixed design as proportion of (Y, X, G) data varies. 14 Figure 3.3 Break-even cost as a function of desired power. . . . . . . . . 17 Figure 3.4 Joint effect of (πX , β0 ) on break-even cost. . . . . . . . . . . . 18 Figure 3.5 Situations where break-even cost is below 1. . . . . . . . . . . 20 Figure 4.1 Effect of non-differential misclassification: rare exposure. . . 26 Figure 4.2 Effect of non-differential misclassification: common exposure. 27 Figure 4.3 Effect of differential misclassification . . . . . . . . . . . . . 29 Figure 4.4 Comparison among three data types. . . . . . . . . . . . . . . 31 Figure 5.1 Results with a liberal significance level, 0.05. . . . . . . . . . 33 5 × 10−8 . 12 Figure 5.2 Results with a stringent significance level, . . . . . 34 Figure 5.3 Break-even cost with the presence of main gene effect. . . . . 35 vi Acknowledgments First, I really want to give thanks to my supervisor, Professor Paul Gustafson, who helped me a lot with the completion of my thesis. It was my great honor to work with him. I would give thanks to Professor Igor Burstyn, who provided support and guidance as an epidemiological expertise. I would also like to thank Professor Gabriela Cohen Freue for agreeing to be my second reader. I would express my gratitude to Professor John Pekau, Ruben Zamar, Jennifer Bryan, Harry Joe, William Welch, Lang Wu and Eugenia Yu for their constant support and excellent teaching. I am also grateful to Peggy Ng, Elaine Salameh and Andrea Sollberger for their hard work and kind help. Thanks are given to everyone in our department for making the department such a good place. Finally, I owe special thanks to my parents for their support and understanding of my study. vii Chapter 1 Introduction In recent decades, advances in genotyping technology and reductions in associated cost have made it feasible to conduct large-scale genome-wide association studies to locate disease susceptibility loci among thousands or millions of screened markers. However, such studies usually ignore the joint effects of genetic and environmental exposures, which may result in a loss in statistical power, as it is widely accepted that complex diseases are likely to be caused by the interplay of both genetic and environmental factors. Thus, it may be beneficial to collect concurrent environmental exposure data to conduct a study taking into account the gene-environment interaction [Kraft et al., 2007, Williamson et al., 2010]. It can be challenging to measure environmental exposure well. In the context of a binary exposure, however, [Kraft et al., 2007] found that the benefit of having environmental data is seen to be maintained in the face of misclassification levels with both sensitivity and specificity of 80%, with such levels seen commonly (e.g, [England et al., 2007, Pickett et al., 2009]). In occupational and environmental epidemiology, however, exposure misclassification can occur at a much higher rate, with sensitivity often around 50% or less (e.g. [Burstyn et al., 2009, Teschke et al., 2002]). Therefore the misclassification rates studied in [Kraft et al., 2007] are neither extreme nor typical of epidemiology in general. Furthermore, it is typically costly to obtain exposure data, and this cost could instead be applied to measure disease status and genotype on more subjects. Thus, from a fixed resource per1 spective, the additional cost of exposure assessment may be so high that measuring only disease status and genotype for a larger study sample may yield more power than measuring disease status, genotype, and environmental exposure for a smaller study sample. It has been shown before that there is a balance between costly exposure estimates that are perfect and cheaper error-prone methods in achieving optimal study power on a fixed budget [Armstrong, 1996]. It has also been argued that to detect gene-environment interaction when exposure measures are assessed with error, it is beneficial to fit a marginal (gene-only) model to the data, when it can be assumed that gene is associated with outcome only in the presence of exposure, and the acquisition of environmental exposure is independent of genotype. The latter assumption is usually referred to as the ‘Mendelian randomization’ assumption [Smith, 2004], permitting the detection and quantification of environmental effects in the presence of latent confounding and heterogeneity in genetic susceptibility to environmental exposure. We compare the power of two cohort study designs aimed at detecting conditional dependence between disease and a genetic locus given environmental exposure, with and without assessment of this exposure, in terms of cost-effectiveness, i.e., which one achieves higher power on a fixed-budget basis. We identify the situations in which the resources should be allocated to enlarging the sample size of the study instead of assessing environmental exposure. We focus on scenarios where Mendelian randomization can be assumed; a test for marginal gene-disease association is intuitively sensible under this assumption. For the joint test which incorporates environmental data, a null of no gene effect is tested against an alternative that there is a main effect and/or an interaction effect of gene. However, we study the power of this test in the setting that there is no main gene effect, i.e., the gene effect is only evident in the presence of exposure. This is referred to as a ‘qualitative interaction’ [Williamson et al., 2010], and was also stressed in earlier work [Burstyn et al., 2009]. We also discuss under what conditions our findings apply to case-control studies, and contrast our approach with the case-only study design. Our analysis is equally applicable to investigations that in genotyping rely on candidate genes selected for their involvement in affecting toxicity of specific exposure and genome-wide association studies. 2 Chapter 2 Study Designs Let Y be the binary disease status, X the environmental exposure, and G one of possibly very many ascertained genetic markers. We assume G is binary, as would result if we distinguished only between the homozygous dominant versus other genotypes. We assume X is binary for the sake of simplicity and ease of illustration, with the view that X could in fact be a dichotomized version of a continuous exposure variable. The assumption of a binary X is commonly made in investigating methodologies for gene-environment interaction studies (see, for instance, [Kraft et al., 2007, Li and Conti, 2009, Umbach and Weinberg, 1997, Williamson et al., 2010]), since even if the underlying exposure is continuous, interpretation of effect above a certain threshold is often desirable in development of policy for interventions. 2.1 (Y, X, G) Design When (Y, G, X) are all observed, a saturated logistic regression model, allowing a gene-environment interaction, is commonly fit: logitPr(Y = 1|X, G) = β0 + β1 X + β2 G + β3 XG. 3 Within this model, the null and alternative hypotheses are: β2 H0 : = β3 0 0 vs H1 : β2 β3 = 0 0 . This null hypothesis states that the gene is not associated with disease, given the environmental exposure status. We assume that being exposed is independent of having a specific gene in study population. Further, we denote πG = Pr(G = 1) the genotype prevalence, and πX = Pr(X = 1) the environmental exposure prevalence. Then the log-likelihood implied by the full model is: exp[β0 + β1 X + β2 G + β3 XG] 1 + exp[β0 + β1 X + β2 G + β3 XG] 1 + (1 −Y ) log 1 + exp[β0 + β1 X + β2 G + β3 XG] = Y log + G log πG + (1 − G) log(1 − πG ) + X log πX + (1 − X) log(1 − πX ). By standard large-sample theory, the asymptotic distribution of βˆ is a multivariate normal distribution with mean its true value and variance the inverse of the expected Fisher information matrix. The expected Fisher information, I, is the negative of the expectations of the second derivatives of the log-likelihood. As an example, we show the algebra for the [β3 , β3 ] entry, I44 : • The first derivative ∂ exp[β0 + β1 X + β2 G + β3 XG] = Y XG − XG × . ∂ β3 1 + exp[β0 + β1 X + β2 G + β3 XG] • The second derivative ∂2 exp[β0 + β1 X + β2 G + β3 XG = −X 2 G2 × . (1 + exp[β0 + β1 X + β2 G + β3 XG])2 ∂ β32 4 • The negative of the expectation I44 = −E ∂2 ∂ β32 exp[β0 + β1 X + β2 G + β3 XG (1 + exp[β0 + β1 X + β2 G + β3 XG])2 exp[β0 + β1 + β2 + β3 = Pr(G = 1, X = 1) × (1 + exp[β0 + β1 + β2 + β3 ])2 exp[β0 + β1 + β2 + β3 = πG πX . (1 + exp[β0 + β1 + β2 + β3 ])2 = E X 2 G2 × Similarly, we can calculate the rest elements of the information matrix. After some algebra, it turns out that B0 + B1 + B2 + B3 B1 + B3 B2 + B3 B3 I= B1 + B3 B1 + B3 B3 B2 + B3 B3 B2 + B3 B3 B3 B3 B3 , B3 B3 where B0 = (1 − πG )(1 − πX ) exp(β0 )(1 + exp(β0 ))−2 , B1 = (1 − πG )πX exp(β0 + β1 )(1 + exp(β0 + β1 ))−2 , B2 = πG (1 − πX ) exp(β0 + β2 )(1 + exp(β0 + β2 ))−2 , B3 = πG πX exp(β0 + β1 + β2 + β3 )(1 + exp(β0 + β1 + β2 + β3 ))−2 . The power calculation is based on the Wald test. Under the alternative hypothesis, the Wald statistic follows a non-central χ 2 distribution, χ22 (λ ), with 2 degrees of freedom and non-centrality parameter: T λ =n β2 β3 × ( I−1 (3:4,3:4) )−1 × 5 β2 β3 = n β22 −1 −1 −1 B−1 −1 0 + B1 + B2 + B3 + 2β2 β3 + β32 (B−1 1 + B3 ). −1 B−1 + B 0 2 Then, the power of this joint test is calculated as: 2 Power = Pr(χ22 (λ ) > χ1−α,2 ), 2 where χ1−α,2 is defined as the 1 − α percentile of the χ 2 distribution with 2 degrees of freedom. 2.2 (Y, G) Design On the other hand, without X data the (Y, G) association can be represented by a saturated logistic regression, or a ‘reduced form’ of the above model: logitPr(Y = 1|G) = α0 + α1 G. In this reduced model, the parameter of interest is the marginal gene-disease odds ratio, α1 , which is equal to 0 if there is no gene-disease association. Correspondingly, the null and alternative hypotheses for this reduced model are: H0 : α1 = 0 vs H1 : α1 = 0. Further, we have Pr(Y = 1|G) = ∑ Pr(Y = 1|X, G)Pr(X|G). X Thus, the parameter α1 can be obtained by substituting the probabilities from the full model. Under the assumption that being exposed is independent of having a specific 6 gene in study population, (β2 , β3 ) = (0, 0) implies that Pr(Y = 1|G = 1) =Pr(Y = 1|X = 1, G = 1)Pr(X = 1) + Pr(Y = 1|X = 0, G = 1)Pr(X = 0) =πX exp[β0 + β1 ] exp[β0 ] + (1 − πX ) 1 + exp[β0 + β1 ] 1 + exp[β0 ] =Pr(Y = 1|X = 1, G = 0)Pr(X = 1) + Pr(Y = 1|X = 0, G = 0)Pr(X = 0) =Pr(Y = 1|G = 0). This further implies that gene, marginally, does not affect the risk of disease and hence α1 = 0. On the other hand, α1 = 0 implies that exp(β0 + β1 ) exp(β0 ) πX + (1 − πX ) = 1 + exp(β0 + β1 ) 1 + exp(β0 ) exp(β0 + β1 + β2 + β3 ) exp(β0 + β2 ) πX + (1 − πX ). 1 + exp(β0 + β1 + β2 + β3 ) 1 + exp(β0 + β2 ) Hence, α1 = 0 corresponds to a single curve in the (β2 , β3 ) parameter space that goes through the origin and depends upon (πX , β0 , β1 ). Further, we notice that the above equation holds only when (β2 , β3 ) = 0 or β2 (β2 + β3 ) < 0. Thus, the null hypothesis of the marginal model is nearly equivalent to the null hypothesis of the full model. Then, (Y, G) design can be alternatively used to test for the null hypothesis of conditional independence between Y and G given X. Without the assumption of gene-environment independence, a non-zero value of α1 can arise when (β2 , β3 ) = (0, 0). In such instances then, evidence of a non-null (Y, G) association cannot be taken as evidence that Y and G are conditionally dependent given X. The log-likelihood implied by the reduced model is: = Y log exp[α0 + α1 G] 1 + exp[α0 + α1 G] 7 + (1 −Y ) log 1 1 + exp[α0 + α1 G] + G log πG + (1 − G) log(1 − πG ). Similar to the calculation shown in Section 2.1, after some algebra, we have the expected Fisher information matrix for this marginal model as: I= A0 + A1 A1 A1 A1 , where A0 = (1 − πG ) exp(α0 )(1 + exp(α0 ))−2 , A1 = πG exp(α0 + α1 )(1 + exp(α0 + α1 ))−2 . Therefore, the power of the marginal test is calculated as: 2 Power = Pr(χ12 (λ ) > χ1−α,1 ), where χ22 (λ ) is a non-central χ 2 distribution with 1 degrees of freedom and noncentrality parameter −1 λ = nα 2 (A−1 0 + A1 ). 2.3 Mixed Design There is another possible sampling scheme, involving (Y, X, G) measurements for some subjects and (Y, G) measurements for others. In this case, we can still fit a full model and the corresponding null and alternative hypotheses are: H0 : β2 β3 = 0 0 vs 8 H1 : β2 β3 = 0 0 . . Suppose our sample consists of (Y, X, G) measurements on N1 subjects and (Y, G) measurements on N2 subjects. Then the expected Fisher information matrix is: I= N1 N2 I1 + I2 , N1 + N2 N1 + N2 where I1 and I2 correspond to (Y, X, G) data and (Y, G) data respectively. For (Y, X, G) data, we have already derived the expected Fisher information matrix under the full model. Therefore, I1 takes the form as given in Section 2.1: I1 = B0 + B1 + B2 + B3 B1 + B3 B2 + B3 B3 B1 + B3 B1 + B3 B3 B2 + B3 B3 B2 + B3 B3 B3 B3 B3 . B3 B3 On the other hand, applying the full model to (Y, G) data implies that the marginal gene-disease association should be expressed in terms of β through: Pr(Y = 1|G) =Pr(Y = 1|X = 1, G)Pr(X = 1) + Pr(Y = 1|X = 0, G)Pr(X = 0) = exp[β0 + β1 + (β2 + β3 )G] exp[β0 + β2 G] × πX + × (1 − πX ). 1 + exp[β0 + β1 + (β2 + β3 )G] 1 + exp[β0 + β2 G] The log-likelihood for (Y, G) data with full model is exp[β0 + β1 + (β2 + β3 )G] exp[β0 + β2 G] πX + (1 − πX ) 1 + exp[β0 + β1 + (β2 + β3 )G] 1 + exp[β0 + β2 G] 1 1 πX + (1 − πX ) + (1 −Y ) log 1 + exp[β0 + β1 + (β2 + β3 )G] 1 + exp[β0 + β2 G] = Y log + G log πG + (1 − G) log(1 − πG ). Then, I2 can be derived based on this log-likelihood. After some algebra, it turns 9 out that: I2 = (B0 +B1 )2 A0 B1 (B0 +B1 ) A0 2 3) + (B2 +B A1 B1 (B0 +B1 ) A0 B1 2 A0 + B3 (BA21+B3 ) (B2 +B3 )2 A1 + B3 (BA21+B3 ) 2 + BA31 B3 (B2 +B3 ) A1 B3 (B2 +B3 ) A1 B3 2 A1 (B2 +B3 )2 A1 B3 (B2 +B3 ) A1 (B2 +B3 )2 A1 B3 (B2 +B3 ) A1 B3 (B2 +B3 ) A1 B3 2 A1 B3 (B2 +B3 ) A1 B3 2 A1 , where A0 , A1 , and B0 to B3 are given in previous two sections. Having determined the Fisher information matrix, the power calculation is carried out again by: 2 Power = Pr(χ22 (λ ) > χ1−α,2 ). where the non-centrality parameter is T λ =n β2 × ( I−1 β3 10 (3:4,3:4) )−1 × β2 β3 . Chapter 3 Cost Effectiveness Many studies have compared the (Y, G) design with the (Y, X, G) design, in terms of statistical power. [Kraft et al., 2007] found that collecting concurrent environmental data within large cohort studies could be beneficial for investigating gene-disease associations in situations where the gene effect is only evident in the presence of exposure. In practice, however, this kind of comparison may be ‘unfair’ since they ignored the fact that a (Y, X, G) design typically costs more money or resources than a (Y, G) design with the same sample size. It is clear that resources invested on obtaining exposure data could instead be applied to measure disease status and genotype on more subjects. Therefore, it might be more appropriate to compare the power of different cohort study designs in terms of cost-effectiveness, i.e., which one achieves higher power on a fixed-budget basis. We presume the cost of measuring Y, X and all the genetic markers on a subject to be c times the cost of measuring Y and the markers alone, referring to c > 1 as the cost-ratio. For the purpose of illustration, we focus on only one genetic marker, denoted by G as mentioned in Chapter 2. A pilot example is shown in Figure 3.1 to display the power of the (Y,X,G) design as the cost ratio varies. We assume that (πG , πX , β0 , β1 , β3 ) = (0.19, 0.4, logit0.05, log 1.5, log 1.5), and the study budget can afford 80% power for the (Y, G) design. From Figure 3.1, we can see that 11 Power of (Y,G,X) Data 1.0 0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 Cost Ratio Figure 3.1: Effect of cost ratio on relative performance of (Y, X, G) design. the change in power is quite sensitive to cost ratio. Particularly, collecting (Y,G,X) data will only yield power of around 50% when the cost ratio is at the value of 2.5. To help our comparison, we introduce the break-even cost c∗ , which is the value of c for which the same total cost spent on either a smaller (Y, X, G) sample or a larger (by a factor of c∗ ) (Y, G) sample will yield the same power to detect a gene effect. If the actual cost ratio c exceeds c∗ , then collecting only (Y, G) data and fitting the reduced model is a better use of resources than collecting (Y, X, G) data and fitting the full model. 12 3.1 Performance of the Mixed Design We begin by investigating the performance of the mixed design. Suppose our sample consists of (Y, X, G) measurements on a proportion w (0 < w < 1) of the subjects and (Y, G) measurements only on the remaining (1 − w) × 100% subjects. Then the budget used to obtain (Y, G) data for m subjects, or (Y, X, G) data for m/c subjects, can also be used to obtain this kind of mixed data type for m/(cw + 1 − w) subjects. Then the power of this mixed design can be calculated following the discussion in Section 2.3. Under the same setting of the pilot example, we examine the power for different values of w, as shown in Figure 3.2, with the cost ratio being 1.5 (top panel) and 2 (bottom panel). We note that the power calculation for the mixed data (w ∈ (0, 1) ) is still valid when we have (Y, X, G) data only (w = 1), but not applicable when we have (Y, G) data only (w = 0) since applying the full model to (Y, G) data would lead to a nonidentifiability problem. Thus, the power as a function of w is not continuous at w = 0 (as evident in Figure 3.2). Also, the performance of the mixed design depends on the value of the cost ratio. When the cost ratio is small, (Y, X, G) data are preferred. Thus, the more weight on (Y, X, G), the higher power. On contrary, when the cost ratio is large, (Y, G) data are preferred and increasing the proportion of (Y, X, G) decreases the power. But the model will become nearly non-identifiable if too few (Y, X, G) are collected. Therefore, in this case, the power of the mixed design is maximized at some point between 0 and 1. However, (Y, G) alone can achieve even higher power. To make the comparison more tractable, we consider the situation where the cost ratio is the break-even cost. Then, the power at two endpoints, w = 0 and w = 1, are the same, so our problem is simplified as maximizing the power with w ∈ (0, 1]. We conduct a factorial experiment, as shown in Table 3.1, to investigate the optimal w. In all settings, the shape of the power function is similar to that in the top panel of Figure 3.2, and the maximum power is always reached when w = 1. Then, we can conclude that this mixed data type is not a good choice compared to 13 0.80 0.75 ● ● ● ● ● ● ● ● ● ● ● 0.70 Power ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.65 ● ● 0.0 0.2 0.4 0.6 0.8 1.0 ● 0.70 ● 0.65 Power 0.75 0.80 Proportion of (Y,X,G) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of (Y,X,G) Figure 3.2: Power of the mixed design as proportion of (Y, X, G) data varies. 14 the other two sampling schemes with one single data type. This also matches our intuition that we should invest all resources on the most cost-effective data type. 3.2 (Y, X, G) Design vs. (Y, G) Design We have shown that the ‘mixed type’ sampling scheme is always less costeffective than the better of the two single type schemes, so we can focus on the comparison between (Y, X, G) data alone and (Y, G) data alone. To match the realistic situation that gene alone confers no additional disease risk in the absence of exposure, we consider scenarios where β2 = 0 and β3 = 0, which is termed as a ‘qualitative’ gene-environment interaction by [Williamson et al., 2010]. Let Fq (·, k) denote the cumulative distribution function for the noncentral χ 2 distribution with degree-of-freedom q and non-centrality parameter k. Let ri denote the solutions for equation 1 − Fi (Fi−1 (1 − s, 0), x) = Power, i = 1, 2, where s is the pre-specified significance level. Based on the power calculation described in Chapter 2, the sample sizes required for two study designs to achieve a certain power are: r1 1 1 × ( + ), 2 A0 A1 α1 r2 N(Y,X,G) = 2 × (Q − P), β3 + 2β2 β3 + β22 Q/P N(Y,G) = where P = 1/B0 + 1/B2 and Q = P + 1/B1 + 1/B3 . The break-even cost is just the ratio of sample size of (Y, G) design to sample size of (Y, X, G) design. Particularly, for the scenarios considered with a qualitative interaction, the break-even cost takes the following form: c∗ = r1 β32 × × r2 α12 As a technical point, the first term r1 r2 1 A0 1 B1 + A11 + B13 . is only a function of the significance level and the desired power. Its value increases as the magnitude of desired power in- 15 creases, but decreases as the significance level increase. Once the type I error and the desired power are specified, it becomes a constant. Given that large numbers of markers may be screened, so that some form of multiple comparison adjustment or false discovery rate control will be required, we report power when the significance level is 10−4 . We start with the same parameter setting as given in the pilot example. Figure 3.3 shows the break-even cost c∗ as a function of the desired power. We can see that the break-even cost increases modestly with the desired power. Its shape also supports the fact that the c∗ can vary according to the magnitude of the desired power since we are comparing the power of a two degree of freedom test to a one degree of freedom test. Particularly, we can read from Figure 3.3 that the breakeven cost is around 1.7 when 80% power is desired. This implies that (Y, G) data are more cost effect than (Y, X, G) data when the per-subject cost of obtaining X is more than 70% of the cost of obtaining (Y, G). Factorial Experiment Next, we investigate the break-even cost under different parameter settings through a factorial experiment. All 440, 000 possible combinations of the parameter values listed in Table 3.1 have been considered. From Step To πG 0.05 0.05 0.50 πX 0.05 0.05 0.50 β0 -5 0.5 0 β1 0.1 0.1 2 β3 0.1 0.1 2 Table 3.1: Parameter settings of the factorial experiment. We find that among the 5 parameters, prevalence of exposure (πX ) and background rate of the health outcome (β0 ) have consistent effects, while the impacts 16 2.0 Break−even Cost 1.8 1.6 1.4 1.2 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Power Desired Figure 3.3: Break-even cost as a function of desired power. of the other three parameters involve interactions with the other parameters. Figure 3.4 shows the joint effect of (πX , β0 ) and the corresponding contour plot, with all other parameter values set as in Figure 3.3. We can see that increasing the value of β0 (i.e., studying the outcome more prevalent in the population) will increase the break-even cost, while increasing the value of πX (i.e., studying a population with higher prevalence of environmental exposure) has the opposite effect. 17 Break−even Cost 15 10 5 0.2 0.4 0.6 0.8 −5 −4 −3 −2 0 β0 2 1 4 0.4 8 6 0.2 10 −5 −4 −3 −2 −1 β0 0 πX −1 0.6 0.8 πX Figure 3.4: Joint effect of (πX , β0 ) on break-even cost. 18 Particularly, we notice that when the exposure is rare, the break-even cost will become extremely large, which is a strong signal for the necessity of assessing environmental exposure (collecting data on X) as opposed to relying on the marginal model to detect qualitative interaction. In our current comparison setting, gene alone confers no additional risk, so the difference in disease risk is only evident among exposed subjects. Therefore, when the prevalence of exposure is very rare, the sub-groups based on genotype are dominated by unexposed subjects, and hence exhibit very little difference in terms of disease risk between the two groups. That is why we need X data to identify the exposed subjects in two groups and focus on understanding differences in risk mainly in those two sub-groups: susceptible exposed and resistant exposed. On the other hand, when the exposure is common, (Y,G) data are more likely to be preferred. Of all the settings with πX > 0.3, about 77% have a break-even cost below 2, and 89% have a break-even cost below 2.5. So we can generalize that for common exposure, when the cost ratio is greater than 2, collection of (Y, G) data is typically more efficient. This may represent realities of studies of highly exposed groups such as industry-based cohorts or for contaminants that are widespread at ‘toxic’ levels in the general environment. Collecting X Data can be Harmful Interestingly, if we focus on a very prevalent environmental exposure, say with πX > 0.7, we find that the break-even cost can even be below 1, which means that using X data would decrease power even if they could be obtained for free! (See Figure 3.5.) We have chosen several sets of parameter values for which the break-even cost is below one, and conducted simulation to verify these theoretical results. Although the empirical results may not be consistent with the theoretical results since the power calculation is only asymptotically true, there do exists situations where the (Y, G) study outperforms the (Y, X, G) study even with the same sample size, as reported in Table 3.2. 19 Break−even Cost 1.3 1.2 1.2 1.1 1.1 1.0 1.0 0.9 0.70 0.75 0.80 0 −1 −2 0.85 πX −3 0.90 −4 0.95 1.00 −5 β0 0.9 0.8 Figure 3.5: Situations where break-even cost is below 1. Thus, when the exposure is very common, even perfectly measured X data are likely to be harmful, which seems counter-intuitive. To understand why this is the case, let us consider the extreme situation where everyone is exposed. If this is the case, binary X data clearly are useless, conveying no additional information beyond (Y, G). However, when we fit the full model to (Y, G, X) data, we are attempting to ‘parse’ any gene effect into the main effect β2 and the interaction effect β3 . This is impossible without variation in X, and by extension inefficient when the X prevalence is very high. This inefficiency is also manifested when the X prevalence is very low, however the joint test still beats the marginal test in this case for the reasons described above. 20 πG πX β0 β1 β3 Sample Size Power of (Y, G) 0.50 0.40 -5.0 1.4 2.0 1400 0.899 0.35 0.50 -5.0 0.9 1.9 1700 0.831 0.30 0.50 -5.0 1.7 1.3 2100 0.774 0.45 0.50 -5.0 1.9 1.3 1800 0.829 0.45 0.50 -5.0 1.6 1.2 2900 0.835 0.35 0.45 -5.0 1.5 1.9 1100 0.824 0.20 0.50 -5.0 1.9 1.2 2600 0.766 0.50 0.50 -5.0 1.9 1.2 2300 0.850 0.40 0.45 -5.0 0.8 2.0 1900 0.858 0.15 0.50 -5.0 1.9 2.0 800 0.771 0.50 0.45 -5.0 2.0 1.6 1200 0.849 100,000 data sets were simulated under each condition. Power of (Y, X, G) 0.890 0.820 0.766 0.811 0.834 0.808 0.761 0.838 0.852 0.767 0.839 Table 3.2: Situations where collecting X data can be harmful. 21 Chapter 4 Misclassification 4.1 Misclassification In Chapter 3, we have compared the (Y, X, G) design with the (Y, G) design in terms of cost effectiveness, assuming all data are measured without error. However, this assumption is unrealistic. Whereas genetic information is relatively stable throughout life and can be measured nearly perfectly, exposure assessment is generally considered to be almost always error-prone. [England et al., 2007] found that reliance on self-reported smoking status among pregnant women can result in exposure misclassification, where 21.6% of self-reported quitters had evidence of active smoking. In some studies in occupation and epidemiology, exposure misclassification can occur at a much higher rate, with sensitivity often around 50% or less [Burstyn et al., 2009, Teschke et al., 2002]. In this chapter, we consider the situations with the presence of exposure misclassification. We denote X ∗ the imperfect environmental exposure, to distinguish it from the true environmental exposure X. Again, we assume X ∗ and G are independent given X and Y . Since the environmental exposure is assumed to be binary, the magnitude of misclassification can be described by sensitivity (SN = P(X ∗ = 1|X = 1)) and specificity (SP = P(X ∗ = 0|X = 0)). When we treat 22 X ∗ as if it were X, we are actually working under a true relationship of the form: Pr(Y = 1|X ∗ , G) = ∑ {Pr(Y = 1|X, X ∗ , G)Pr(X|X ∗ )} X =∑ X Pr(Y = 1|X, G)Pr(X ∗ |Y = 1, X, G) Pr(X ∗ |X)Pr(X) × Pr(X ∗ |X, G) Pr(X ∗ ) = ∑ Pr(Y = 1|X, G)Pr(X ∗ |X,Y = 1) X Pr(X) . Pr(X ∗ ) Thus, we need a new set of parameters (πG ∗ , πX ∗ , β0 ∗ , β1 ∗ , β2 ∗ , β3 ∗ ) rather than the true parameter setting (πG , πX , β0 , β1 , β2 , β3 ) to capture the true nature of (Y, X ∗ , G) data. The prevalence of genotype remain unchanged, indicating that πG∗ = πG . The probability of being classified as exposed is πX∗ = ∑ ∑{Pr(X ∗ = 1|X,Y )Pr(Y |X)Pr(X)}. Y X Finally, β ∗ can be derived from the original parameters by solving: Pr(Y = 1|X ∗ , G) = expit(β0∗ + β1∗ X ∗ + β2∗ G + β3∗ X ∗ G) = ∑ expit(β0 + β1 X + β2 G + β3 XG) X Pr(X ∗ |X,Y = 1)Pr(X) , Pr(X ∗ ) and it turns out that β0∗ =logit ∑ expit(β0 + β1 X) Pr(X ∗ = 0|X,Y = 1)Pr(X) Pr(X ∗ = 0) ∑ expit(β0 + β1 X) Pr(X ∗ = 1|X,Y = 1)Pr(X) Pr(X ∗ = 1) ∑ expit(β0 + β2 + (β1 + β3 )X) X β1∗ =logit X , − β0∗ , β2∗ =logit X − β0∗ , 23 Pr(X ∗ = 0|X,Y = 1)Pr(X) Pr(X ∗ = 0) β3∗ =logit ∑ expit(β0 + β2 + (β1 + β3 )X) X Pr(X ∗ = 1|X,Y = 1)Pr(X) Pr(X ∗ = 1) − β0∗ − β1∗ − β2∗ . We can see from the above expressions that fitting the full model to (Y, X ∗ , G) data generally gives biased point estimates of β . When (β2 , β3 ) = (0, 0), however, it can be easily verified that the following two equations should both be satisfied: β2∗ + β0∗ = β0∗ , β3∗ + β2∗ + β1∗ + β0∗ = β1∗ + β0∗ . This implies that (β2∗ , β3∗ ) = (0, 0). Hence, (β2 , β3 ) = (0, 0) in the (Y |X, G) relationship implies zero coefficients for G and X ∗ G in the (Y |X ∗ , G) relationship, so that fitting the full model to (Y, G, X ∗ ) data still yields a valid test of the null hypothesis that Y and G are conditionally independent given X. However, the use of X ∗ rather than X will reduce power (e.g. [Burstyn et al., 2009, Rothman et al., 1999, Vineis, 2004, Wong et al., 2003]), and the power calculation should be adjusted for the presence of misclassification. We should plug the adjusted parameters given above into the power calculation shown in section 2.1 to obtain the power for misclassified (Y, X ∗ , G) data. 4.2 (Y, X ∗ , G) Design vs. (Y, G) Design. In this section, we compare (Y, X ∗ , G) study to (Y, G) study in terms of cost effectiveness. Two types of misclassification are considered: Non differential misclassification occurs when the probability of being misclassified are the same for all study subjects. Differential misclassification occurs when the probability of being misclassified differs across groups of study subjects. 24 4.2.1 Non-differential Misclassification We first focus on non-differential misclassification models. Figure 4.1 and Figure 4.2 show the effect of non-differential misclassification under different parameter settings: • Figure 4.1 – rare exposure: (πG , πX , β0 , β1 , β3 ) = (0.19, 0.2, logit0.05, log 1.5, log 1.5), • Figure 4.2 – common exposure: (πG , πX , β0 , β1 , β3 ) = (0.19, 0.7, logit0.05, log 1.5, log 1.5). In both figures, we evaluate the break-even cost under four scenarios: (i) top-left: SN = 0.75, SP = 0.75; (ii) top-right: SN = 0.95, SP = 0.95; (iii) bottom-left: SN = 0.75, SP = 0.95; (iv) bottom-right: SN = 0.95, SP = 0.75. The solid curve represents the break-even cost of the misclassified data, while the dash curve represents the break-even cost of the perfect data. The bigger gap between these two curves, the more the break-even cost is influenced by misclassification. 25 3.5 3.0 2.5 2.0 Break−even Cost Perfect Data Imperfect Data 1.0 1.5 3.5 3.0 2.5 2.0 1.5 1.0 Break−even Cost Perfect Data Imperfect Data 0.0 0.4 0.8 0.0 3.5 2.0 2.5 3.0 Perfect Data Imperfect Data 1.0 1.5 2.0 2.5 Break−even Cost 3.0 Perfect Data Imperfect Data 1.5 0.8 Power Desired 1.0 Break−even Cost 3.5 Power Desired 0.4 0.0 0.4 0.8 0.0 Power Desired 0.4 0.8 Power Desired Figure 4.1: Effect of non-differential misclassification: rare exposure. 26 1.3 1.3 1.0 1.1 1.2 Perfect Data Imperfect Data 0.7 0.8 0.9 Break−even Cost 1.1 1.0 0.9 0.7 0.8 Break−even Cost 1.2 Perfect Data Imperfect Data 0.0 0.4 0.8 0.0 0.8 0.9 1.0 1.1 1.2 Perfect Data Imperfect Data 0.7 0.8 0.9 1.0 1.1 Break−even Cost 1.2 Perfect Data Imperfect Data 0.7 Break−even Cost 0.8 Power Desired 1.3 1.3 Power Desired 0.4 0.0 0.4 0.8 0.0 Power Desired 0.4 0.8 Power Desired Figure 4.2: Effect of non-differential misclassification: common exposure. 27 From these two figures, we can see that similar to the impact of misclassification on estimation [Gustafson, 2004], power is more influenced by sensitivity (SN) when πX is large (common exposure) and more by specificity (SP) when πX is small (rare exposure). Also as expected, the break-even cost decreases as the misclassification becomes more severe, because more data have to be collected to compensate for lower power due to the use of an imperfect surrogate for exposure. In other words, misclassified exposure has to be rather cheap compared to collection of health outcome data and genotyping to be worthwhile, whereas greater expense can be justified for perfect (or near-perfect) exposure assessment. Particularly, when the quality of the X ∗ data is very poor, the use of X ∗ data can be harmful. Therefore, it is important to determine at what values of sensitivity and specificity the break-even cost dips below 1. We already know from Chapter 3 that when the environmental exposure is very common, the break-even cost is likely to be below 1 even if X data are perfectly classified. So we investigate only situations with moderate prevalence of exposure. All combinations described in Table 3.1 were investigated again, under various values for the quality of exposure classification. We find that for a moderate prevalence of exposure, say that 0.2 < πX < 0.5, when SN = SP = 0.6, all combinations have a break-even cost below 1. Thus, if we cannot guarantee the quality of our environmental exposure data, it may be better to avoid assessing exposure. While it is well known that estimation bias can be removed by appropriate statistical adjustment for exposure measurement error, there is no way to recover the power lost by having X ∗ measurements rather than X measurements [Greenland and Gustafson, 2006]. 4.2.2 Differential Misclassification Next, let us turn our attention to the situation where differential misclassification occurs, which are more likely in cohort studies. We study three differential misclassification models that were also considered by [Williamson et al., 2010]: 28 0.2 0.6 Power 1.0 2.0 1.8 1.6 1.4 Break−even Cost 1.0 1.2 1.8 1.6 1.4 1.0 1.2 Break−even Cost 1.8 1.6 1.4 1.2 1.0 Break−even Cost Blame 2.0 Better Recall 2.0 Social Stigma 0.2 0.6 1.0 0.2 Power 0.6 1.0 Power Figure 4.3: Effect of differential misclassification (i) the “social stigma” model, where diseased subjects are unwilling to report their exposure status (sensitivity given disease = 80%, otherwise perfect classification); (ii) the “better recall” model, where those with disease keep an eye on the exposure and thus can give perfect classification, but those without disease are not able to recall the exposure history well (sensitivity and specificity for undiseased = 80%, perfect classification for diseased); (iii) the “blame” model, where subjects with disease blame their disease on an exposure and hence report it more often (specificity given disease = 80%, otherwise perfect classification). Figure 4.3 shows the break-even cost under these three scenarios, where the dash lines show the breakeven cost without misclassification. We can see that there is a big drop under the ”social stigma” model and the ”blame” model, while no big change is seen under the ”better recall” model. Hence, it seems that the loss in power is more driven by the misclassification among diseased subjects. 29 4.3 Extension of Comparison Finally, we can extend our comparison to be among three data types: (Y, G, X), (Y, G, X ∗ ), and (Y, G). For example, consider again the scenario under which Figure 3.3 is created: (πG , πX , β0 , β1 , β3 ) = (0.19, 0.4, logit0.05, log 1.5, log 1.5). Let us say non-differential misclassification occurs with sensitivity and specificity both being 0.9. We aim to achieve 80% power, with 10−4 significance level. We can then determine two corresponding break-even costs relative to (Y, G) data: c1 = 1.7 for (Y, G, X) data and c2 = 1.4 for (Y, G, X ∗ ) data. The break-even cost between (Y, G, X) and (Y, G, X ∗ ) is simply the ratio c1 /c2 = 1.22. That is, when a decision should be made between the (Y, X, G) design and the (Y, X ∗ , G) design, the former is less cost-effective if its per-subject cost of data acquisition is 23% more than the latter. Hence, Figure 4.4 can be created. Presuming that (Y, G, X) data are indeed more costly than (Y, G, X ∗ ) data which are in turn more costly than (Y,G) data, the shaded area in this plot corresponds to unrealistic cost ratios. The region of plausible cost values is divided into 3 subregions: area (i) is where (Y, G) data are the most cost-effective, area (ii) is where (Y, G, X ∗ ) data should be collected; and area (iii) is where (Y, G, X) data are preferred. Thus, according to this, we can ‘see’ the best data type to be collected as a function of the two actual per subject cost ratios: for (Y, G, X) versus (Y, G) and (Y, G, X ∗ ) versus (Y, G). Plots of this form could be used for study-planning purposes, with pilot values of the parameters used to create a customized version of the plot. Anticipated sampling costs can then be located on the plot to visualize which of the three study designs is most cost effective. In some practical situations though, the choice in study design may be between collecting X ∗ versus no exposure assessment at all, since exposure status cannot be ascertained without error at any price. Moreover, with low SN and SP the comparison may lead investigators to favour inferring the presence of a gene effect from (Y, G) data. Such situations are common in practice and have profound implications for rational allocation of research resources and 30 4 3 2 I c1 = 1.7 ● III 1 C[YXG] C[YG] II 0 c2 = 1.4 0 1 2 3 4 C[YX*G] C[YG] Figure 4.4: Comparison among three data types. choice of research questions. Of course, one can usually allocate more resources to upgrade the instruments and improve the quality of data. Therefore, further extensions can be made to compare two environmental exposure surrogates with different qualities. 31 Chapter 5 Other Issues 5.1 Significance Level In the previous two chapters, our results are presented in the context of a 10−4 significance level. Our use of the 10−4 significance level is a compromise for illustrative purposes. In the context of a genome-wide association study, a much more stringent level, such as 5 × 10−8 , would be employed (or, perhaps more likely, false discovery rate control would be implemented). Conversely, for some environmental exposures the number of candidate genes is very limited. For example, certain common and important exposures are associated with only one or two single nucleotide polymorphisms, e.g., SNPs in paraoxonase (PON1) gene and oraganophosphates [Burstyn et al., 2009]. In contexts with very limited numbers of candidate genes, much more liberal significance levels than 10−4 would be applied. Figure 5.1 and Figure 5.2 reproduce some figures in Chapter 2 with a 0.05 significance level and a 5 × 10−8 significance level, respectively. The break-even cost for achieving 80% power is 1.6 when the significance level is 0.05, and 1.8 when the significance level is 5 × 10−8 . Thus, the relative performance of the (Y, X, G) study is improved with a more stringent significance level, although the circumstances under which the (Y, G) only design outperforms the (Y, X, G) design are insensitive to this choice. 32 2.0 Break−even Cost 1.8 1.6 1.4 1.2 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Break−even Cost Power Desired 15 10 5 0.2 0.4 πX 0.6 0.8 −5 −4 −3 −2 −1 0 β0 Figure 5.1: Results with a liberal significance level, 0.05. 33 2.0 Break−even Cost 1.8 1.6 1.4 1.2 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Break−even Cost Power Desired 15 10 5 0.2 0.4 πX 0.6 0.8 −5 −4 −3 −2 −1 0 β0 Figure 5.2: Results with a stringent significance level, 5 × 10−8 . 34 5.2 Presence of Main Gene Effect Our reported power evaluations have been in the qualitative interaction setting ( β2 = 0, β3 = 0), as we believe this may be a typical circumstance. However, when we evaluate power in the presence of a main effect of gene ( β2 = 0), we find that the break-even cost decreases with the magnitude of the main effect, whilst other parameters remain fixed at (πG , πX , β0 , β1 , β3 ) = (0.19, 0.4, logit0.05, log 1.5, log 1.5), as shown in Figure 5.3. In fact, as the main gene effect comes to dominate the 0.4 0.8 Power Desired 1.8 1.4 1.0 1.4 1.0 0.0 β2 = 0.2 Break−even Cost 1.8 β2 = 0.1 Break−even Cost 1.8 1.4 1.0 Break−even Cost β2 = 0 0.0 0.4 0.8 Power Desired 0.0 0.4 0.8 Power Desired Figure 5.3: Break-even cost with the presence of main gene effect. other effects in the model, the marginal gene effect may be large enough to be easily detected, and the involvement of exposure data may not help much. Thus, by evaluating power when in fact the gene-environment interaction is qualitative, we are considering a ‘least favorable’ setting for the (Y, G) only design, yet it often outperforms the (Y, X, G) design nonetheless. 35 5.3 Case-Control & Case-Only While our results are presented in the cohort study setting, the power calculations are valid for case-control studies, with the proviso that the relevant intercept β0 would be that induced by the case-control sampling scheme rather than that describing disease prevalence in the target population: β0∗ = β0 + log NCase Pr(Y = 1) − log . NControl Pr(Y = 0) The other caveat is that our results are developed under sampling from a distribution in which X and G are independent. In the cohort study setting, this naturally corresponds to independence in the study population. In the case-control setting the situation is less clear, as the (X, G) distribution induced by case-control sampling is not identical to the (X, G) distribution in the target population. We have addressed the value of X data (or X ∗ data) by comparing tests for a main effect and/or interaction effect of G obtained from equally costly samples with and without X (or X ∗ ). In either case, the null hypothesis β2 = β3 = 0 is considered. However, if X data are to be collected, it may not be worth collecting any information on controls. [Piegorsch et al., 1994] showed that a gene-environment interaction can be estimated more efficiently with a case-only design than with either a cohort or a case-control study, under the assumption that the environmental exposure and gene are independent among controls. Let’s first have a brief review of the case-only approach. The justification of the case-only design can be shown by expressing β3 in the following form: exp(β3 ) = Odds(G = 1|X = 1,Y = 1)/Odds(G = 1|X = 0,Y = 1) . Odds(G = 1|X = 1,Y = 0)/Odds(G = 1|X = 0,Y = 0) Note that the denominator is equal to 1 when X and G are independent among controls. This is approximately true under the assumption of gene-environment in36 dependence (on population level) and under a rare-disease assumption. Therefore, β3 can be estimated without any controls. Let Ni jk be the number of subjects with Y = i, X = j and G = k (i, j, k = 0, 1) for a case-control study, as shown in Table 5.1. Y=1 Y=0 G=0 G=1 G=0 G=1 X=0 N100 N101 N000 N001 X=1 N110 N111 N010 N011 Table 5.1: Notations for the data of a case-control study. In a case-control study, β3 can be estimated by N111 N100 N010 N001 βˆ3 = log , N011 N000 N110 N101 and the estimated variance of βˆ3 is Var(βˆ3 ) = 1 1 1 1 1 1 1 1 + + + + + + + . N111 N110 N101 N100 N011 N010 N001 N000 On the other hand, by collecting case only, β3 can also be estimated through N111 N100 βˆ3 = log , N110 N101 and the corresponding estimated variance is Var(βˆ3 ) = 1 1 1 1 + + + . N111 N110 N101 N100 Thus, the case-only design provides a more efficient way for estimating geneenvironment interaction. However, we have deliberately not compared the case–only design to the (Y,G) 37 only design, since this would be a ‘category mistake’ (or an ‘apples and oranges’ comparison): the case-only design can only test the null β3 = 0 (under the raredisease assumption), whereas the (Y, G) only design can only test the null β2 = β3 = 0 (without invocation of the rare disease assumption). Furthermore, in thinking about how the case-only design relates to the present discussion, exposure misclassification must be considered in deriving a comparison of study designs that are applicable to epidemiologic practice. We have already mentioned misclassification of environmental exposure as a point in favor of the (Y, G) only design compared to the (Y, X ∗ , G) design. In fact, the case-only design is even more susceptible to such misclassification than the (Y, X ∗ , G) design. Consider the worst-case of a “useless” exposure classification having sensitivity equal to 1 − specificity. When (β2 , β3 ) = (0, 0), according to the reparameterization given in Section 4.1, this will induce β1∗ = β3∗ = 0 but β2∗ = 0, hence the (Y, X ∗ , G) data still have some power to detect the gene effect. Conversely, X ∗ and G will be conditionally independent given Y = 1, which will render the case-only design completely powerless. This gives a sense in which this design is an orderof-magnitude more susceptible to exposure misclassification than the ‘full data’ design. 5.4 3-Category Genotype Finally, we can extend the discussion to the situation where the genotype has three categories. In cases where a gene exists in two allelic forms (designated A and a), three combinations of alleles (genotypes) are possible: Homozygous-dominant Genotype when both alleles are dominant, i.e., AA; Homozygous-recessive Genotype when both alleles are recessive, i.e., aa; Heterozygous when two alleles are different, i.e., Aa. 38 Let H denote the 3-category genetic factor, which can take values in {0, 1, 2}. The power calculations for the (Y, X, H) design and the (Y, H) design are analogous to those shown in Chapter 2. In what follows, we only show some important pieces for the power calculation. (Y, X, H) Design The model applied to (Y, X, H) data is logitPr(Y = 1|X, H) =β˜0 + β˜1 X + β˜21 I(H = 1) + β˜22 I(H = 2) + β˜31 XI(H = 1) + β˜32 XI(H = 1), where I(·) is an indicator function. Correspondingly, the null and alternative hypotheses are β˜21 ˜ β22 H0 : β˜ = 31 β˜32 0 0 0 0 vs β˜21 ˜ β22 Ha : β˜ = 31 β˜32 0 0 . 0 0 Finally, we have the expected Fisher information matrix as I= D0 + D1 + D21 + D22 + D31 + D32 D1 + D31 + D32 D21 + D31 D22 + B32 D1 + D31 + D32 D1 + D31 + D32 D31 D32 D21 + D31 D31 D21 + D31 0 D22 + D32 D32 0 D22 + D32 D31 D31 D31 0 D32 D32 0 D32 where D0 = Pr(H = 0, X = 0) exp(β˜0 )(1 + exp(β˜0 ))−2 , D1 = Pr(H = 0, X = 1) exp(β˜0 + β˜1 )(1 + exp(β˜0 + β˜1 ))−2 , 39 D31 D32 D31 D32 D31 0 , 0 D32 D31 0 0 D32 D21 = Pr(H = 1, X = 0) exp(β˜0 + β˜21 )(1 + exp(β˜0 + β˜21 ))−2 , D22 = Pr(H = 2, X = 0) exp(β˜0 + β˜22 )(1 + exp(β˜0 + β˜22 ))−2 , D31 = Pr(H = 1, X = 1) exp(β˜0 + β˜1 + β˜21 + β˜31 )(1 + exp(β˜0 + β˜1 + β˜21 + β˜31 ))−2 . D32 = Pr(H = 2, X = 1) exp(β˜0 + β˜1 + β˜22 + β˜32 )(1 + exp(β˜0 + β˜1 + β˜22 + β˜32 ))−2 . (Y, H) Design The model applied to (Y, H) data is logitPr(Y = 1|H) = α˜ 0 + α˜ 11 I(H = 1) + α˜ 12 I(H = 2). The corresponding null and alternative hypotheses are: H0 : α˜ 11 α˜ 12 = 0 vs 0 Ha α˜ 11 α˜ 12 = 0 0 Finally, the expected Fisher information matrix is I= C0 +C11 +C12 C11 C12 C11 C11 C12 0 0 , C12 where C0 = Pr(G = 0) exp(α˜ 0 )(1 + exp(α˜ 0 ))−2 , C11 = Pr(G = 1) exp(α˜ 0 + α˜ 11 )(1 + exp(α˜ 0 + α˜ 11 ))−2 , C12 = Pr(G = 2) exp(α˜ 0 + α˜ 12 )(1 + exp(α˜ 0 + α˜ 12 ))−2 . 40 . Chapter 6 Conclusion & Discussion Our main finding is that under a wide range of circumstances research resources aimed at identification of association between genes and diseases can be more efficiently (in a sense of study power) allocated to genotyping larger groups of individuals rather than investing in exposure assessment, when exposure and genes interact. Likewise, efficient study design to detect qualitative gene-environment interactions can typically omit exposure assessment if there is convincing evidence that the gene only influences risk of disease by modifying exposure. (The evidence for mode of action of gene in conferring risk of disease would have to arise from studies outside of realm of epidemiology.) These conclusions do not negate the need for exposure assessment in quantifying gene-disease and gene-environment interactions, but do suggest the claim in [Williamson et al., 2010] that it is always desirable to assess exposures in such studies does not hold when resource constraints are considered. Of course there may well be circumstances where neither analytical approach will yield satisfactory power, but our results support the claim made in [Burstyn et al., 2009] that test for qualitative interaction typically requires smaller sample size to achieve the same power as study that collects error-prone exposure data and estimates interaction directly. It should be noted that when prior information is available on the magnitude of gene-environment interaction, data on gene and health outcome alone, under the Mendelian randomization assumption, can be used to estimate the magnitude of interaction through a Bayesian procedure [Gustafson and Burstyn, 2011]. 41 It must be recognized that even in situations where (Y, G) data yields more power than (Y, X, G) data, the latter data structure does permit partitioning of the gene effect into main and interaction components. If (Y, G) data alone are collected and indicate a gene effect, then the question arises of whether this might arise via a main effect of G alone, an interaction effect of G alone (a qualitative interaction), or a combination of main and interaction effects. In some contexts a main effect is implausible a priori, so the results can be interpreted as evidence for a qualitative interaction. In other contexts, it may make sense to seek additional resources, in order to obtain exposure or surrogate exposure measurements for a subsample, permitting estimation of the coefficients in the full model. The question of resource use is now more complex, since the sub-sample cost of exposure assessment is only incurred if the initial (Y, G) sample indicates association. It is also paramount to consider that the number of environmental exposures of potential interest is large. For example, a conservative list used by the U.S. National Health and Nutrition Examination Survey consists of at least 266 ‘core’ exposures [Patel et al., 2010], and it is believed that the environmental exposures epidemiologists ought to be considering in exploratory studies number in the thousands, at least [Wild, 2005]. It is also clear that the cost of exposure assessment that meets the needs of epidemiology by providing both accurate and biologically meaningful measures will continue to escalate in near term, given the experimental nature of approaches that are being proposed [Rappaport and Smith, 2010]. There is hope that the costs of exposure assessment will decline in time, as they have done for genotyping, but exposure assessment is a much more complex technical challenge than genotyping. For the foreseeable future, scientists must contend with exposure assessment costs that can be on the order of hundreds of dollars per subject per exposure. Under these conditions, selecting appropriate exposures and genes to study in addressing important questions in public health will remain central to designing feasible and cost-effective investigations. Overall, we conclude that in many situations not collecting environmental exposure to boost sample size is an efficient approach to assessing qualitative gene42 environment interactions when the disease is known not be causes by gene alone. The approach may also prove to be valuable as an efficient first stage of identifying role of gene (via interaction or main effect) in causing a disease. 43 Bibliography B. Armstrong. Optimizing power in allocating resources to exposure assessment in an epidemiologic study. American Journal of Epidemiology, 144(2): 192–197, 1996. I. Burstyn, H. Kim, Y. Yasui, and C. N.M. The virtues of a deliberately mis-specified disease model in demonstrating a gene-environment interaction. Occupational and Environmental Medicine, 66(6):374–380, 2009. L. England, A. Grauman, C. Qian, D. Wilkins, E. Schisterman, F. Kai, and R. Levine. Misclassification of maternal smoking status and its effects on an epidemiologic study of pregnancy outcomes. Nicotine & Tobacco Research, 9 (10):1005–1013, 2007. S. Greenland and P. Gustafson. Accounting for independent nondifferential misclassification does not increase certainty that an observed association is in the correct direction. American Journal of Epidemiology, 164(1):63–68, 2006. P. Gustafson. Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments. Boca Raton: Chapman and Hall/CRC, 2004. P. Gustafson and I. Burstyn. Bayesian inference of geneCenvironment interaction from incomplete data: What happens when information on environment is disjoint from data on gene and disease? Statistics in Medicine, 30(8):877–889, 2011. P. Kraft, Y. Yen, D. Stram, J. Morrison, and W. Gauderman. Exploiting gene-environment interaction to detect genetic associations. Human Heredity, 63(2):111–119, 2007. D. Li and D. Conti. Detecting gene-environment interactions using a combined case-only and case-control approach. American Journal of Epidemiology, 169 (4):497–504, 2009. 44 C. Patel, J. Bhattacharya, and A. Butte. An environment-wide association study (EWAS) on type 2 diabetes mellitus. PLoS One, 5(5):e10746, 2010. K. Pickett, K. Kasza, G. Biesecker, R. Wright, and L. Wakschlag. Women who remember, women who do not: A methodological study of maternal recall of smoking in pregnancy. Nicotine & Tobacco Research, 11(10):1166–1174, 2009. W. Piegorsch, C. Weinberg, and J. Taylor. Non–hierarchical logistic models and case–only designs for assessing susceptibility in population–based case–control studies. Statistics in Medicine, 13(2):153–162, 1994. S. Rappaport and M. Smith. Environment and disease risks. Science, 330(6003): 460–461, 2010. N. Rothman, M. Garcia-Closas, W. Stewart, and J. Lubin. The impact of misclassification in case-control studies of gene-environment interactions. IARC Scientific Publications, 148:89–96, 1999. S. Smith, GD nad Ebrahim. Mendelian randomization: prospects, potentials, and limitations. International Journal of Epidemiology, 33(1):30–42, 2004. K. Teschke, A. Olshan, J. Daniels, A. De Roos, C. Parks, M. Schulz, and T. Vaughan. Occupational exposure assessment in caseCcontrol studies: opportunities for improvement. Occupational and Environmental Medicine, 59 (9):575–594, 2002. D. Umbach and C. Weinberg. Designing and analysing case–control studies to exploit independence of genotype and exposure. Statistics in Medicine, 16(15): 1731–1743, 1997. P. Vineis. A self-fulfilling prophecy: are we underestimating the role of the environment in geneCenvironment interaction research? International Journal of Epidemiology, 33(5):945–946, 2004. C. Wild. Complementing the genome with an “exposome”: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiology Biomarkers & Prevention, 14(8):1847–1850, 2005. E. Williamson, A. Ponsonby, J. Carlin, and T. Dwyer. Effect of including environmental data in investigations of gene–disease associations in the presence of qualitative interactions. Genetic Epidemiology, 34(6):552–560, 2010. 45 M. Wong, N. Day, J. Luan, K. Chan, and N. Wareham. The detection of geneCenvironment interaction for continuous traits: should we deal with measurement error by bigger studies or better measurement? International Journal of Epidemiology, 32(1):51–57, 2003. 46
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Costs and benefits of environmental data in investigations...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Costs and benefits of environmental data in investigations of gene-disease associations Luo, Hao 2012
pdf
Page Metadata
Item Metadata
Title | Costs and benefits of environmental data in investigations of gene-disease associations |
Creator |
Luo, Hao |
Publisher | University of British Columbia |
Date Issued | 2012 |
Description | The inclusion of environmental exposure data may be beneficial, in terms of statistical power, to investigation of gene-disease association when it exists. However, resources invested in obtaining exposure data could instead be applied to measure disease status and genotype on more subjects. In a cohort study setting, we consider the tradeoff between measuring only disease status and genotype for a larger study sample and measuring disease status, genotype, and environmental exposure for a smaller study sample, under the ‘Mendelian randomization’ assumption that the environmental exposure is independent of genotype in the study population. We focus on the power of tests for gene-disease association, applied in situations where a gene modifies risk of disease due to particular exposure without a main effect of gene on disease. Our results are equally applicable to exploratory genome-wide association studies and more hypothesis-driven candidate gene investigations. We further consider the impact of misclassification for environmental exposures. We find that under a wide range of circumstances research resources should be allocated to genotyping larger groups of individuals, to achieve a higher power for detecting presence of gene-environment interactions by studying genedisease association. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2012-08-29 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0073077 |
URI | http://hdl.handle.net/2429/43080 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2012-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2012_fall_luo_hao.pdf [ 482.88kB ]
- Metadata
- JSON: 24-1.0073077.json
- JSON-LD: 24-1.0073077-ld.json
- RDF/XML (Pretty): 24-1.0073077-rdf.xml
- RDF/JSON: 24-1.0073077-rdf.json
- Turtle: 24-1.0073077-turtle.txt
- N-Triples: 24-1.0073077-rdf-ntriples.txt
- Original Record: 24-1.0073077-source.json
- Full Text
- 24-1.0073077-fulltext.txt
- Citation
- 24-1.0073077.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073077/manifest