The Gene-environment Independence Assumption in theAnalysis of Case-control DatabyHao LuoB.Sc., Nanjing University, 2010M.Sc., The University of British Columbia, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)December 2016c© Hao Luo, 2016AbstractIn this thesis, we consider the problem of exploiting the gene-environment inde-pendence assumption in a case-control study inferring the joint effect of genotypeand environmental exposure on disease risk.We first take a detour and develop the constrained maximum likelihood esti-mation theory for parameters arising from a partially identified model, where someparameters of the model may only be identified through constraints imposed byadditional assumptions. We show that, under certain conditions, the constrainedmaximum likelihood estimator exists and locally maximizes the likelihood func-tion subject to constraints. Moreover, we study the asymptotic distribution of theestimator and propose a numerical algorithm for estimating parameters.Next, we use the frequentist approach to analyze case-control data under thegene-environment independence assumption. By transforming the problem intoa constrained maximum likelihood estimation problem, we are able to derive theasymptotic distribution of the estimator in a closed form. We then show that ex-ploiting the gene-environment independence assumption indeed improves estima-tion efficiency. Also, we propose an easy-to-implement numerical algorithm forfinding estimates in practice.Furthermore, we approach the problem in a Bayesian framework. By introduc-ing a different parameterization of the underlying model for case-control data, weare able to define a prior structure reflecting the gene-environment independenceassumption and develop an efficient numerical algorithm for the computation ofthe posterior distribution. The proposed Bayesian method is further generalizedto address the concern about the validity of the gene-environment independenceassumption.iiFinally, we consider a special variant of the standard case-control design, thecase-only design, and study the analysis of case-only data under the gene-environmentindependence assumption and the rare disease assumption. We show that theBayesian method for analyzing case-control data is readily applicable for the analy-sis of case-only data, allowing the flexibility of incorporating different prior beliefson disease prevalence.iiiPrefaceA version of Chapter 2 has been posted online at arxiv.org (Luo et al. [17]) and willbe submitted for publication. The idea was motivated in order to solve the problemin Chapter 3. Under the supervision of Dr. Gustafson, I derived the theoretical re-sults, conducted all computational work, and wrote the majority of the manuscript.Dr. Bouchard-Coˆte´ and Dr. Cohen Freue provided thoughtful suggestions andhelped with the revisions of the manuscript.Chapter 3, 4, and 5 are based on manuscripts collaborated with Dr. Gustafson,Dr. Burstyn, Dr. Bouchard-Coˆte´, and Dr. Cohen Freue. Some related work con-cerning the interaction between a binary genotype and a binary environmental ex-posure has been published (Luo et al. [18]). Versions of these chapters will besubmitted for publication. Dr. Gustafson initiated the research problem and in-spired me to the right direction when I had difficulties moving beyond the simpleset-up of binary genetic and environmental factors. I conducted all derivations, allcomputational work and the majority of the writing. Dr. Bouchard-Coˆte´ and Dr.Cohen Freue both gave me constructive criticism and helpful suggestions, whichprompted more thinking about the proposed methods and greatly improved thesemanuscripts. Dr. Burstyn gave valuable inputs from the perspective of an epidemi-ologist to make our research more relevant to the application in epidemiology.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Analysis of case-control data . . . . . . . . . . . . . . . . . . . . 11.2 Analysis of case-only data . . . . . . . . . . . . . . . . . . . . . 32 The Constrained Maximum Likelihood Estimation with PartiallyIdentified Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Statistical problem . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 The constrained maximum likelihood estimation . . . . . . . . . . 72.3.1 The constrained maximum likelihood estimate . . . . . . 92.3.2 Asymptotic distributions . . . . . . . . . . . . . . . . . . 182.3.3 Numerical algorithm . . . . . . . . . . . . . . . . . . . . 22v2.4 Example problem and simulation study . . . . . . . . . . . . . . 242.5 Just- and over-identified situations . . . . . . . . . . . . . . . . . 262.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 The Benefit of Exploiting the GEI Assumption for Analyzing Case-Control Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Formulation of the problem . . . . . . . . . . . . . . . . . . . . . 303.3 A reparameterization of the model . . . . . . . . . . . . . . . . . 323.4 Estimation with known disease prevalence . . . . . . . . . . . . . 343.4.1 Theoretical properties of the estimator . . . . . . . . . . . 353.4.2 Numerical algorithm . . . . . . . . . . . . . . . . . . . . 363.5 Estimation with unknown disease prevalence . . . . . . . . . . . 373.5.1 Parameter identification . . . . . . . . . . . . . . . . . . 383.5.2 Theoretical properties of the estimator . . . . . . . . . . . 393.5.3 Numerical algorithm . . . . . . . . . . . . . . . . . . . . 413.6 Extension: a reduced logistic model . . . . . . . . . . . . . . . . 423.7 Efficiency gain . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.7.1 The special binary case . . . . . . . . . . . . . . . . . . . 453.7.2 The saturated model . . . . . . . . . . . . . . . . . . . . 463.7.3 The reduced model . . . . . . . . . . . . . . . . . . . . . 473.8 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . 503.8.1 The special binary case . . . . . . . . . . . . . . . . . . . 503.8.2 The saturated model . . . . . . . . . . . . . . . . . . . . 533.8.3 The reduced model . . . . . . . . . . . . . . . . . . . . . 543.8.4 The violation of the GEI assumption . . . . . . . . . . . . 573.9 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 Bayesian Inference in Case-Control Studies . . . . . . . . . . . . . . 634.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2 Another reparameterization . . . . . . . . . . . . . . . . . . . . . 644.3 Bayesian framework . . . . . . . . . . . . . . . . . . . . . . . . 66vi4.4 A simulation study . . . . . . . . . . . . . . . . . . . . . . . . . 694.5 Relaxation of the GEI assumption . . . . . . . . . . . . . . . . . 754.5.1 Two established methods . . . . . . . . . . . . . . . . . . 754.5.2 A generalized Bayesian framework . . . . . . . . . . . . 774.6 Another simulation study . . . . . . . . . . . . . . . . . . . . . . 784.7 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835 Bayesian Inference in Case-Only Studies . . . . . . . . . . . . . . . 855.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.2 Bayesian case-only methods . . . . . . . . . . . . . . . . . . . . 855.3 Bias of the traditional case-only method . . . . . . . . . . . . . . 875.4 A simulation study . . . . . . . . . . . . . . . . . . . . . . . . . 875.5 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.5.1 Analysis of colorectal cancer data . . . . . . . . . . . . . 915.5.2 Analysis of ovarian cancer data . . . . . . . . . . . . . . 935.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98A The Forms of Some Vectors and Matrices . . . . . . . . . . . . . . . 102viiList of TablesTable 2.1 Data structure for the example problem considered in Section2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Table 3.1 Comparison of the performance between the three methods:TRAD, GEI-U and GEI-K, in the special binary case . . . . . . 51Table 3.2 Comparison of the performance between the three methods:TRAD, GEI-U and GEI-K, in the scenarios with a saturated dis-ease risk model . . . . . . . . . . . . . . . . . . . . . . . . . 54Table 3.3 Comparison of the performance between the three methods:TRAD, GEI-U and GEI-K, in the scenarios with a reduced dis-ease risk model . . . . . . . . . . . . . . . . . . . . . . . . . 57Table 3.4 Comparison of the performance between the three methods:TRAD, GEI-U and GEI-K, when the GEI assumption is vio-lated, assuming a reduced disease risk model . . . . . . . . . . 60Table 3.5 Data from a case-control study concerning the interaction ofNAT2 genotype and smoking habit on bladder cancer . . . . . . 61Table 4.1 Parameter settings used in the simulation study in Section 3.8 . 69Table 4.2 The computational efficiency of the proposed importance sam-pling algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 72Table 4.3 Comparison of the performance between the TRAD method andthe proposed BGEI method . . . . . . . . . . . . . . . . . . . 73Table 4.4 Comparison of the performance between different methods insituations where the GEI assumption may or may not hold . . . 80viiiTable 4.5 Two datasets considered in Section 4.7 for the application of theproposed Bayesian methods . . . . . . . . . . . . . . . . . . . 82Table 4.6 The data analysis results for the two datasets considered in Sec-tion 4.7 by ten different methods . . . . . . . . . . . . . . . . 83Table 5.1 Parameter settings used in the simulation study in Section 5.4 . 89Table 5.2 Prior distributions and LPDs for case-only data under four pa-rameter settings . . . . . . . . . . . . . . . . . . . . . . . . . 89Table 5.3 Comparison of the performance between the Bayesian and thetraditional case-only method . . . . . . . . . . . . . . . . . . . 90Table 5.4 The coverage probabilities of the Traditional and the Bayesian95% confidence/credible intervals for different case-only sam-ple sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Table 5.5 Case-control data concerning the interaction of NAT 2 genotypeand smoking status on colorectal cancer . . . . . . . . . . . . 92Table 5.6 Case-only data concerning the number of births and the statusof BRCA1/2 mutations for 832 women with ovarian cancer . . 93ixList of FiguresFigure 3.1 Comparison of the TRAD method, the GEI-U method, and theGEI-K method in terms of efficiency, with a saturated diseaserisk model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Figure 3.2 Comparison of the TRAD method, the GEI-U method, and theGEI-K method in terms of efficiency, with a reduced diseaserisk model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Figure 3.3 Comparison of the length of the 95% confidence interval be-tween the TRAD method and the GEI-U method, in the specialbinary case . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 3.4 Comparison of the length of the 95% confidence interval be-tween the GEI-U method and the GEI-K method, in the specialbinary case . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 3.5 Comparison of the length of the 95% confidence interval be-tween the TRAD method and the GEI-U method, in the sce-narios with a saturated disease risk model . . . . . . . . . . . 55Figure 3.6 Comparison of the length of the 95% confidence interval be-tween the GEI-U method and the GEI-K method, in the sce-narios with a saturated disease risk model . . . . . . . . . . . 56Figure 3.7 Comparison of the length of the 95% confidence interval be-tween the TRAD method and the GEI-U method, in the sce-narios with a saturated disease risk model . . . . . . . . . . . 58xFigure 3.8 Comparison of the length of the 95% confidence interval be-tween the GEI-U method and the GEI-K method, in the sce-narios with a reduced disease risk model . . . . . . . . . . . . 58Figure 4.1 Comparison of the length of the 95% confidence interval be-tween the traditional method and the proposed Bayesian GEI-based method . . . . . . . . . . . . . . . . . . . . . . . . . . 74Figure 5.1 Prior distributions of ζ0|θ for different values of disease preva-lence θ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88xiAcknowledgmentsMy deepest debt of gratitude goes to my supervisor, Professor Paul Gustafson, forhis continuous support of my Ph.D. studies. His guidance helped me build my ownskills to ask meaningful questions and find appropriate answers. His patience andencouragement helped me go through the most difficult time when I got stuck inmy research. Without any of these, I would never have been able to reach this far.I would like to thank the members of my supervisory committee, ProfessorAlexandre Bouchard-Coˆte´, Professor Gabriela Cohen-Freue, and Professor IgorBurstyn, for their inspiring questions and insightful comments. I am grateful tofaculty members and staff of the Department of Statistics at UBC for providingsuch a nice academic environment. I would like to thank my fellow students andfriends at UBC for making my study at UBC so enjoyable. Special thanks to Yan-ling Cai, Xin Geng, Yi Huang, Yang Liu, Ji Lv, Mengzhe Shen, Chumeng Wu, HuiYang, Hongyang Zhang, and Tingting Zhao.Last, I leave the warmest part of my heart for my family, my parents (ZhongliangLuo and Liping Wu) and my wife (Shihui Zhong), for their unconditional supportand love. Without them, everything is impossible.xiiDedicationto my parents and my wife.xiiiChapter 1IntroductionGenetic and environmental factors may jointly influence the risk of many com-plex diseases. Individuals with different genotypes may be affected differentlyby exposure to the same environmental factors, and have different disease pheno-types as the result of gene-environment interactions. For example, heavy smokerstend to have higher risk of bladder cancer if they also carry NAT2 slow acetylatorsgenotypes [10]. Thus, epidemiologists are interested in inferring gene-environmentinteraction. The study of gene-environment interactions will lead to better under-standing of the biological mechanisms and pathological processes that contributeto the development of complex diseases. Such an understanding can guide the de-velopment of more efficient measures for preventing or even curing disease. If anindividual carries a genotype that confers susceptibility to a certain disorder in aparticular environment, then the disease may be prevented by reducing exposure tothe environment.1.1 Analysis of case-control dataThe case-control study design is popular in studies of gene-environment interac-tion because it allows a better allocation of resources, where data can be collectedfor more cases. The traditional method for analyzing case-control data is fittingprospective logistic regression models regardless of the nature that data are col-lected retrospectively. Prentice and Pyke [24] showed that this method will lead to a1consistent estimator of the gene-environment interaction. The traditional estimatordoes not rely on any a priori assumptions about the joint distribution of genotypeand environmental exposure. In many settings, however, it is biologically plau-sible to assert that genotype and environmental exposure are independent of oneanother in the source population (hereafter the gene-environment independence as-sumption, or GEI for short), since genotype arises from the random assortment ofchromosomes carrying genes in meiosis [26]. Therefore, more efficient estimatorsof the gene-environment interaction from case-control data may be available byexploiting this assumption.Many authors have investigated the possible benefits of analyzing case-controldata under the GEI assumption or its variants ([23], [28], [3], [21], [22], [4]). Forinstance, Umbach and Weinberg [28] proposed the application of log-linear mod-els to case-control cell counts, using a form of the GEI assumption that assertsgene-environment independence within the population of controls. These authorsacknowledged that this form of the assumption is not natural, particularly as ac-quisition of genotype and environmental exposure are temporally antecedent to thedisease. They justified such an assumption by pointing out that if the disease is rarethen GEI in the source population will imply near GEI in the control population.Moreover, they showed that the estimator of the gene-environment interaction de-rived under this form of the GEI assumption differs from the traditional estimator,with a smaller asymptotic variance.More recently, Chatterjee and Carroll [3] studied the problem of maximum-likelihood estimation for case-control data, assuming GEI in the source population.They argued that, contrary to intuition, the intercept in the prospective relationshipcan in fact be identified under the GEI assumption. Further, they proposed a pro-file likelihood technique to obtain the maximum likelihood estimator based on theretrospective likelihood, and presented simulation results to show that their GEIestimator has considerably lower variance than the traditional estimator. However,looking specifically at the situation where both genotype and environmental expo-sure are binary, Chen and Chen [4] found no efficiency gain for the GEI estimatorover the traditional estimator. They claimed that estimating the intercept term inthe prospective relationship uses up the additional information inherent in the GEIassumption.2With different frameworks inducing different claims about efficiency gains as-sociated with the GEI assumption, epidemiologists are uncertain about the suit-ability of analytical strategies relying upon the assumption. This thesis brings clar-ity to this issue. In Chapter 2, we first take a detour to develop the constrainedmaximum likelihood estimation theory for estimating parameters arising from apartially identified model with some equality constraints introduced by additionalassumptions. This theory is then applied in Chapter 3 for analyzing case-controldata under the GEI assumption through a frequentist approach.Another big hindrance for the use of GEI-based methods is their non-robustnessto the violation of the GEI assumption. Albert et al. [2] discussed different sce-narios where an environmental factor is associated with a genetic marker. Forexample, they may be correlated if both are associated with other (uncontrolledconfounding) factors. Different methods have been proposed that relaxes the GEIassumption. Mukherjee and Chatterjee [21] developed an empirical Bayes-typeshrinkage estimator to trade off between bias and efficiency. Mukherjee et al. [22]proposed a full Bayesian analysis and design strategy to incorporate prior belief onthe assumption of gene-environment independence. However, these methods allfocused on the assumption that genotype and environmental exposure are indepen-dent in the population of controls. In Chapter 4, we develop a Bayesian frameworkfor analyzing case-control data under the GEI assumption (in the source popula-tion), and then generalize it to allow uncertainty about the assumption.1.2 Analysis of case-only dataPiegorsch and Weinberg [23] noticed that, when genotype and environmental expo-sure are independent of each other in the control population, the gene-environmentinteraction odds ratio can be estimated using their association odds ratio in casesalone. Thus, the case-only design, which only collects data on diseased subjects,serves as an alternative to the standard case-control design for studying the gene-environment interaction effect, assuming GEI among controls. In a recent system-atic review, Dennis et al. [6] showed no substantial difference between the case-only and case-control interaction estimates when the assumption indeed holds.Comparing to the standard case-control design, the case-only design has several3advantages. It not only avoids the difficulty of selecting appropriate controls, butalso achieves better precision for estimating interaction effects. More importantly,it greatly saves study resources.Despite its potential value, there are considerable concerns about the validityof the traditional case-only method because of its susceptibility to bias. Albertet al. [2] pointed out that inference made with the traditional case-only methodcan be highly sensitive to the assumption of GEI among controls. Even for smallamounts of gene-environment association among controls, the type I error for test-ing the interaction effects can be seriously inflated and/or the case-only estimatorof the interaction effect can be greatly biased. Thus, conclusions drawn from thetraditional case-only method regarding the existence and/or the magnitude of thegene-environment interaction effect can sometimes be misleading.The assumption of GEI among controls approximately holds under the GEI as-sumption and the rare disease assumption. Even though the GEI assumption canbe supported in theory by the principle of the random assortment of alleles at thetime of gamete formation (Mendel’s second law [26]), and in practice by empiri-cal evidence for some genetic variants and environmental factors, the rare diseaseassumption is not rigorously defined. It is not clear at what disease prevalence theassumption of GEI among controls will approximately hold when the GEI assump-tion actually holds in the source population. Indeed, Gatto et al. [8] reported that,under the GEI assumption, the gene-environment association in the control popula-tion may still not be negligible when the disease is only modestly rare. Therefore,the traditional case-only method may still produce substantial bias for the estima-tion of the gene-environment interaction effect, when the GEI assumption holdsand the disease is rare but not extremely rare. In Chapter 5, we investigate therelationship between the disease prevalence and the performance of the traditionalcase-only method. We also apply the Bayesian framework developed in Chapter 4for analyzing case-only data, where the rare disease assumption is clearly quanti-fied through an appropriate prior distribution on disease prevalence.4Chapter 2The Constrained MaximumLikelihood Estimation withPartially Identified Models2.1 IntroductionIn some scientific studies, due to constraints of logistics and/or resources, data arenot collected in the ideal way. Consequently, the available data may only partiallyidentify the statistical model under consideration, i.e., parameters of the statisticalmodel are identified up to a set of possible values instead of just one single value.The set of parameter values that correspond to the same distribution of observablesis usually termed the identification region. Manski [19] gives an overview of partialidentification and covers many scenarios where partial identification may arise.Of course, point-identification is preferred as it is fundamental for consistentpoint estimation and ensures many nice properties of model-based parameter es-timators. With a partially identified model, when possible one may impose somereasonable assumptions to achieve point-identification. Under such assumptions,the parameter vector is restricted to a subset of the original parameter space. Ifthis constrained parameter space has only a single point of intersection with theidentification region, then the parameter vector is uniquely identified.5In this chapter, we study the maximum-likelihood estimation of parametersarising from a partially identified model with some equality constraints introducedby additional assumptions. In particular, we consider the scenario where thereexists a special re-parameterization of all parameters of the model, which is termeda transparent re-parameterization by Gustafson et al. [14], such that the distributionof observables is completely determined by a proper subset of parameters aftertransformation.In the situation of adding parameter constraints to a model which is identifiedeven without the constraints, Aitchison and Silvey [1] studied the large-samplebehavior of maximum-likelihood estimators via a Lagrange multiplier approach.However, the assumption that the unconstrained version of the model is identifiedis embedded in their approach. Therefore, our work extends their theory to thesituation that identification is only obtained via imposition of the constraints.The rest of this chapter is organized as follows. We first introduce some gen-eral notation and give a mathematical formulation of the problem. We then provethe existence of the constrained maximum likelihood estimate and show that theestimator is asymptotically normally distributed. A numerical algorithm for com-puting the constrained maximum likelihood estimate is also developed. We thenconsider an example problem and use a simulation study to compare the perfor-mance of the proposed method and the general method, which does not depend onconstraints, to investigate the effect of imposing additional assumptions with a par-tially identified model. Moreover, we comment on a special situation where thereis no benefit in terms of estimation efficiency.2.2 Statistical problemSuppose our data consist of n i.i.d. observations x = (x1, . . . ,xn). The statisticalmodel underlying the data is assumed to be initially parameterized in scientificterms via a vector of s parameters. Let ω = (ω1, . . . ,ωs) be a re-parameterizationof the original parameters such that the log-likelihood function ` for the observeddata can be completely determined by its first r elements, say φ = (ω1, . . . ,ωr),through`(x,φ) =n∑i=1log f (xi,φ),6where f (x,φ) denotes the probability density function for an individual observationx. The remaining s− r parameters of ω are represented by another vector ψ =(ωr+1, . . . ,ωs), which cannot be learned from the observed data. Thus, ω = (φ ,ψ)is partially identified with the identified part φ and the unidentified part ψ .Further, we consider additional assumption that impose t equality constraintson ω:h(ω) =h1(ω)...ht(ω)= 0.These equality constraints can be used to identify ψ . Since the dimension of ψ iss− r, we assume that there are at least s− r equations so that ψ can be fully identi-fied. Also, it is reasonable to assume that the number of constraints does not exceedthe number of identified parameters, which is necessary for the development of ourmethod. Thus, we assume that s− r ≤ t ≤ r. Note that the true, though unknown,parameter value ω∗ = (ω∗1 , . . . ,ω∗s ) is presumed to satisfy these constraints itself,i.e., h(ω∗) = 0.Our objective is to find the constrained maximum likelihood estimate ωˆ thatmaximizes the log-likelihood function `(x,φ) subject to the condition h(ω) = 0,and study the properties of the estimator. Moreover, we will propose a numericalalgorithm for computing the constrained maximum likelihood estimate in practice.2.3 The constrained maximum likelihood estimationLet φˆ (u) denote the unconstrained maximum likelihood estimate under general con-ditional without additional assumptions. If the equation h(φˆ (u),ψ) = 0 with respectto ψ has a solution, say ψˆ(c), then ωˆ = (φˆ (u), ψˆ(c)) is the constrained maximumlikelihood estimate of the problem. This approach may fail, however, since theequation h(φ ,ψ) = 0 with respect to ψ may not necessarily have a solution forsome values of φ . Alternatively, we propose to estimate ω by finding the stationarypoint of (1/n)`(x,φ)+λ T h(ω), where λ = (λ1, . . . ,λt) is a Lagrange multiplier.7Thus, we consider the following s+ t equations:1ns(x,φ)+Jωλ = 0, (2.1)Kωλ = 0, (2.2)h(ω) = 0, (2.3)where s(x,φ) is the score vector of length r whose i-th component is ∂`(x,φ)/∂ωi,for i = 1, . . . ,r, Jω is the r× t matrix (∂h j(ω)/∂ωi), for i = 1, . . . ,r, j = 1, . . . , t,and Kω is the (s− r)× t matrix (∂h j(ω)/∂ωr+i), for i = 1, . . . ,s− r, j = 1, . . . , t.In this section, we will show that, under some general conditions, if x belongsto a set whose probability measure tends to 1 as n approaches infinity, then theequations (2.1) - (2.3) have a solution (ωˆ, λˆ ) such that ωˆ is within a small neigh-borhood of the true value ω∗. This solution is then proved to be the constrainedmaximum likelihood estimate that maximizes `(x,φ) subject to h(ω) = 0. Wethen extend the definition of (ωˆ, λˆ ) for all x ∈ Rn, and show the asymptotic distri-bution of the random variable thus defined. Finally, we propose an algorithm fornumerically computing (ωˆ, λˆ ). The development of this section is based on thework by Aitchison and Silvey [1]. However, due to the presence of the unidentifiedcomponent ψ , our work is more than a simple generalization of their results.We first impose some conditions on f (x,φ) and h(ω) within some neighbor-hood of ω∗, say Uα = {ω : ||ω−ω∗|| ≤ α}. We assume that f (x,φ) satisfies theconditions (F1) - (F4) as defined in [1]. These conditions are quite general andwill be satisfied in most practical estimation problems. Here, we just write oneimportant result implied by these conditions for later reference. If the conditionson f (x,φ) are satisfied, for any given positive numbers δ < α and ε < 1 and forsufficiently large n≥ n(δ ,ε), there exists a set Xn with the properties(X 1) Pr{Xn}> 1− ε .(X 2) ||s(x,φ ∗)/n||< δ 2, if x ∈ Xn.(X 3)(Mx,φ∗/n)can be expressed in the form −Bφ∗+δmx,φ∗ , where Mx,φ∗ is thematrix (∂ 2`(x,φ ∗)/∂ωi∂ω j), i, j = 1, . . . ,r, Bφ∗ is a certain positive defi-nite matrix, and mx,φ∗ is an r× r matrix, the moduli of whose elements are8bounded by 1, if x ∈ Xn.(X 4) There exists a constant, say κ1, such that for every ω ∈ Uα and i, j,k =1,2, . . . ,r, ∣∣∣∣1n ∂ 3`(x,φ)∂ωi∂ω j∂ωk∣∣∣∣< 2κ1,if x ∈ Xn.On the other hand, some conditions are assumed for the constraint function h(ω)as follows.(H 1) For every ω ∈Uα , the first order partial derivatives ∂hk(ω)/∂ωi, i= 1, . . . ,s,k = 1, . . . , t, exist and they are continuous function of ω .(H 2) For every ω ∈ Uα , the second order partial derivatives ∂ 2hk(ω)/∂ωi∂ω j,i, j = 1, . . . ,s, k = 1, . . . , t, exist, and they are uniformly bounded by a con-stant, say κ2, on Uα .(H 3) The r× t matrix Jω∗ and the (s− r)× t matrix Kω∗ are both of full rank, i.e.,rank(Jω∗) = t and rank(Kω∗) = s− r.2.3.1 The constrained maximum likelihood estimateWe begin by establishing a necessary and sufficient condition for the existence ofa solution to the equations (2.1) - (2.3) under some general conditions. It shouldbe noted that the following lemma cannot be directly generalized from Lemma 1in [1] by simply viewing the log-likelihood function as a function of ω and lettingBω∗ be the s× s matrix that naturally extends Bφ∗ , due to the singularity of Bω∗thus defined. Therefore, some modifications are required.Lemma 1. Suppose conditions on f and h are satisfied, and Xn is a set with theproperties (X 1) - (X 3) for some given positive numbers δ < α and ε < 1. Whenx ∈Xn and α is sufficiently small, then (ωˆ , λˆ ) is a solution of the equations (2.1) -(2.3) and ωˆ ∈Uδ , if and only if ωˆ satisfies a certain equation. This equation takesthe form −B˜ω∗(ω−ω∗)+δ 2v(x,ω) = 0, whereB˜ω∗ = Bφ∗ 00 Is−r ,9and v(x,ω) is a continuous function of ω on Uδ and ||v(x,ω)|| is bounded on Uδby a positive number κ†, which does not depend on δ .Proof. We first prove the necessity of the condition. By expanding the componentsof s(x,φ) at φ ∗ in the equation (2.1), and the components of h(ω) at ω∗ in theequation (2.3), we find that the solution of the equations (2.1) - (2.3) should alsosatisfy:1n{s(x,φ ∗)+Mx,φ∗(φ −φ ∗)+v(1)(x,φ)}+Jωλ = 0, (2.4)JTω∗(φ −φ ∗)+KTω∗(ψ−ψ∗)+v(2)(ω) = 0, (2.5)where(i) v(1)(x,φ) is a vector of dimension r whose m-th component is12(φ −φ ∗)T Lm(φ −φ ∗),where Lm is the matrix (∂ 3`(x,φ (m,1))/∂ωm∂ωi∂ω j), i, j = 1, . . . ,r, withφ (m,1) being a point such that ||φ (m,1)−φ ∗||< ||φ −φ ∗||, and(ii) v(2)(ω) is a vector of dimension s whose m-th component is12(ω−ω∗)T Hm(ω−ω∗),where Hm is the matrix (∂ 2hm(ω(m,2))/∂ωi∂ω j), i, j = 1, . . . ,s, with ω(m,2)being a point such that ||ω(m,2)−ω∗||< ||ω−ω∗||.Further, given property (X 3) , we can re-write the equations (2.4) and (2.5) in thefollowing form:−Bφ∗(φ −φ ∗)+Jωλ +δ 2v(3)(x,φ) = 0, (2.6)JTω∗(φ −φ ∗)+KTω∗(ψ−ψ∗)+δ 2v(4)(ω) = 0, (2.7)10wherev(3)(x,φ) =1nδ 2s(x,φ ∗)+1δmx,φ∗(φ −φ ∗)+ 1nδ 2 v(1)(x,φ), (2.8)v(4)(ω) =1δ 2v(2)(ω). (2.9)Moreover, by properties (X 2) - (X 4), we obtain a bound for v(3)(x,φ) as||v(3)(x,φ)|| ≤ 1nδ 2||s(x,φ ∗)||+ 1δ||mx,φ∗(φ −φ ∗)||+ 1nδ 2 ||v(1)(x,φ)||< 1+ r2+ r3κ1, (2.10)and, by condition (H 2), we have a bound for v(4)(ω) as||v(4)(ω)||< s3κ2(1δ 2||ω−ω∗||2)< s3κ2. (2.11)Next, since Bφ∗ is positive definite, we can pre-multiply the equation (2.6)by JTω∗B−1φ∗ to get an expression for JTω∗(φ − φ ∗), which is then plugged into theequation (2.7) to obtain the following equationJTω∗B−1φ∗ Jωλ +KTω∗(ψ−ψ∗)+δ 2(JTω∗B−1φ∗ v(3)(x,φ)+v(4)(ω))= 0. (2.12)Now the condition (H 3) implies that JTω∗B−1φ∗ Jω∗ is also positive definite. Besides,according to the condition (H 1), the elements of Jω are all continuous functionsof ω . It then follows that JTω∗B−1φ∗ Jω is also non-singular within Uα so long as αis sufficiently small. Thus, we can solve the equation (2.12) with respect to λ andexpress it in terms of ωλ =−Aω{KTω∗(ψ−ψ∗)+δ 2(JTω∗B−1φ∗ v(3)(x,φ)+v(4)(ω))}, (2.13)where we define the notation Aω = (JTω∗B−1φ∗ Jω)−1.So far, we are basically replicating the steps of the proof given by [1]. Now,we need to take some extra steps to find the expression for (ψ−ψ∗). By applying11the equation (2.13) to substitute for λ , the equation (2.2) becomes:KωAωKTω∗(ψ−ψ∗)+δ 2KωAω(JTω∗B−1φ∗ v(3)(x,φ)+v(4)(ω))= 0. (2.14)Following the same argument for JTω∗B−1φ∗ Jω , the condition (H 4) ensures that thematrix KωAωKTω∗ is not singular within Uα provided that α is sufficiently small.Thus, we can solve the equation (2.14) with respect to ψ and getψ−ψ∗ =−δ 2v(5)(x,ω), (2.15)wherev(5)(x,ω) =(KωAωKTω∗)−1(KωAω)(JTω∗B−1φ∗ v(3)(x,φ)+v(4)(ω)). (2.16)We then plug the equation (2.15) into the equation (2.13) and derive an updatedexpression for λ :λ =−δ 2v(6)(x,ω), (2.17)wherev(6)(x,ω) = Aω{−KTω∗v(5)(x,ω)+(JTω∗B−1φ∗ v(3)(x,φ)+v(4)(ω))}. (2.18)By combining the equations (2.6) and (2.15), with λ substituted using the equa-tion (2.17), we find that the solution of the equations (2.1) - (2.3) should also satisfy− B˜ω∗(ω−ω∗)+δ 2v(x,ω) = 0, (2.19)whereB˜ω∗ = Bφ∗ 00 Is−r ,andv(x,ω) = v(3)(x,φ)−Jωv(6)(x,ω)−v(5)(x,ω) .Finally, we have shown in the inequalities (2.10) and (2.11) that ||v(3)(x,φ)||12and ||v(4)(ω)|| are uniformly bounded within Uα . Also, given that Aω and KωAωKTω∗are positive definite within the closed set Uα , their determinants are both positivewithin Uα . Therefore, the continuity of the elements of these two matrices en-sures that their determinants are uniformly bounded within Uα . Then it followsthat v(x,ω) is a continuous function on Uδ and ||v(x,ω)|| is bounded on Uδ by apositive number, say κ†, which does not depend on δ .Now, we prove the sufficiency of the condition. Suppose the equation (2.19)has a solution ωˆ . That is, ωˆ satisfies Bφ∗ 00 Is−r φˆ −φ ∗ψˆ−ψ∗= δ 2 v(3)(x, φˆ)−Jωˆv(6)(x, ωˆ)−v(5)(x, ωˆ) . (2.20)By pre-multiplying the equation (2.20) by the t×s matrix (JTω∗B−1φ∗ , KTω∗), we haveJTω∗(φˆ −φ ∗)+KTω∗(ψˆ−ψ∗)+δ 2v(4)(ωˆ) = 0. (2.21)We first write v(1)(x,φ) and v(2)(ω) as the remainders after expanding s(x,φ) andh(ω), respectively,v(1)(x,φ) = s(x,φ)− s(x,φ ∗)−Mx,φ∗(φ −φ ∗), (2.22)v(2)(ω) = h(ω)−JTω∗(φ −φ ∗)−KTω∗(ψ−ψ∗). (2.23)Applying the equations (2.22) and (2.23) to substitute for v(1)(x,φ) and v(2)(ω) inthe equations (2.8) and (2.9), respectively, we getv(3)(x,φ) =1δ 2{1ns(x,φ)+Bφ∗(φ −φ ∗)}, (2.24)v(4)(ω) =1δ 2{h(ω)−JTω∗(φ −φ ∗)−KTω∗(ψ−ψ∗)}. (2.25)Finally, we substitute for v(4)(ωˆ) in the equation (2.21) using the equation (2.25).It immediately follows that h(ωˆ) = 0.Next, we apply the equations (2.24) and (2.25) to substitute for v(3)(x,φ) andv(4)(ω) in the equations (2.16) and (2.18), and end with the following expressions13for v(5)(x,ω) and v(6)(x,ω):v(5)(x,ω) =−(ψ−ψ∗)+ (KωˆAωKTω∗)−1 KωYω(1ns(x,φ)), (2.26)v(6)(x,ω) = Yω(1ns(x,φ))−KTω∗(KωAωKTω∗)−1 KωYω(1ns(x,φ)), (2.27)where Yω is defined as Yω = AωJTω∗B−1φ∗ . Now, by using the equations (2.24),(2.26) and (2.27) to substitue for v(3)(x, φˆ), v(5)(x, ωˆ) and v(6)(x, ωˆ) in the equation(2.20), respectively, we can see that ωˆ satisfies1ns(x, φˆ)−JωˆYωˆ(1ns(x, φˆ))= 0,−KωˆYωˆ(1ns(x, φˆ))= 0.As we have shown earlier that h(ωˆ) = 0, it is easy to see that ωˆ , jointly withλˆ =−Yωˆs(x, φˆ)/n, solves the equations (2.1) - (2.3).We now give the following theorem to show the existence of a solution of theequations (2.1) - (2.3).Theorem 1. Subject to conditions on f and h, if x ∈ Xn for a sufficiently smallgiven positive number δ and another given positive number ε < 1, then the equa-tions (2.1) - (2.3) have a solution (ωˆ, λˆ ) such that ωˆ ∈Uδ .Proof. The proof of Theorem 1 in [1] works here, provided the modified versionof Lemma 1 given above is used. Also, it is important to notice that the matrix B˜ω∗defined in Lemma 1 is positive definite provided that Bφ∗ is positive definite, andits minimum latent root is min{µ0,1}, where µ0 is the latent minimum root of Bφ∗ .Details are omitted.In the remainder of this section, we are going to show that the solution of theequations (2.1) - (2.3) as stated in Theorem 1 locally maximizes the log-likelihoodsubject to the constraints. This result was proved in [1] for the identified model.However, we are not able to prove this result for the partially identified model with14a direct extension of their proof. Alternatively, we take another route and use theresult given by Spring [27].To match with the set-up in [27], we change the order of variables and letη = (λ ,ω). Let HT denote the second order partial derivatives of the Lagrangianfunction `(x,φ)/n+λ T h(ω) evaluated at the critical point ηˆ = (λˆ , ωˆ)HT(n) =0 JTωˆ KTωˆJωˆ 1n Mφˆ +Xλˆ ,ωˆ YTλˆ ,ωˆKωˆ Yλˆ ,ωˆ Zλˆ ,ωˆ ,where Xλˆ ,ωˆ YTλˆ ,ωˆYλˆ ,ωˆ Zλˆ ,ωˆ= t∑k=1λˆk∂ 2hk∂ω1∂ω1 · · ·∂ 2hk∂ω1∂ωs.... . ....∂ 2hk∂ωs∂ω1 · · ·∂ 2hk∂ω1s∂ωs ,with Xλˆ ,ωˆ being the upper-left r× r block matrix, Yλˆ ,ωˆ being the bottom-left r×(s−r) block matrix, and Zλˆ ,ωˆ being the bottom-right (s−r)×(s−r) block matrix.Let Λ(n)k denote the principal upper left k-th order minor of the Hessian MatrixHT(n). According to Theorem 1 of [27], ωˆ locally maximizes the log-likelihoodfunction subject to the constraints, so long as (−1)t+pΛ(n)2t+p, p = 1, . . . ,s− t, areall positive.Note that λˆ was defined as λˆ =−Yωˆ(s(x, φˆ)/n). For any small number δ , bythe equation (2.24) and the inequality (2.10), if n is sufficiently large, we have1n||s(x, φˆ)||= ||−Bφ∗(φˆ −φ ∗)+δ 2v(3)(x, φˆ)||< κ3δ +(1+ r2+ r3κ1)δ 2,where κ3 is a positive number that depends only on the elements of Bφ∗ . Also, theelements of Yωˆ are bounded by a number independent of δ for ωˆ ∈Uδ . Therefore,15we have||λˆ ||= 1n||Yωˆs(x, φˆ)||< κ4δ +κ5δ 2,where κ4 and κ5 are positive numbers independent of δ . That is, λˆ converges to0 as n goes to infinity (δ → 0). By condition (H 2), the second partial derivatives∂ 2hk(ω)/∂ωi∂ω j, i, j = 1, . . . ,s, k = 1, . . . ,k, are all bounded by a constant 2κ2.Thus, it follows that Xλˆ ,ωˆ→ 0r×r, Yλˆ ,ωˆ→ 0r×(s−r), and Zλˆ ,ωˆ→ 0(s−r)×(s−r). Also,it is easy to see from Theorem 1 that, for ωˆ ∈Uδ with sufficiently small value ofδ , ωˆ converges to ω∗ as n goes to infinity. By condition (H 1), the elements ofJω and Kω are all continuous functions of ω . Thus, as n goes to infinity, Jωˆ , Kωˆ ,and Mφˆ/n approach Jω∗ , Kω∗ and Mφ∗/n, respectively. Furthermore, by property(X 2), we have Mφ∗/n approaches −Bφ∗ as n goes to infinity. Finally, we haveHT(n) converges to HT(∞) as n goes to infinity, whereHT(∞) =0 JTω∗ KTω∗Jω∗ −Bφ∗ 0Kω∗ 0 0 .Then, for sufficiently large n, the signs of the leading principal minors of HT(n)are the same as those of their corresponding minors of HT(∞). Therefore, we caninstead study the signs of the leading principal minors of HT(∞).For brevity, we suppress the subscripts ω∗ and φ ∗. First, given that B is positivedefinite, by Sylvester’s criterion the upper left d× d corner matrix of B, denotedby Bd , is also positive definite, for d = 1, . . . ,r. Next, since rank(J) = t, with somere-ordering of the rows if necessary, the first d rows of J, denoted by Jd , is a d× tmatrix of full column rank t, and thus the matrix JTd B−1d Jd is positive definite, ford = t + 1, . . . ,r. Similarly, as rank(K) = s− r, the first d rows of K, denoted byKd , is a d× t matrix of full row rank d, and thus the matrix Kd(JT B−1J)−1 KTd isagain positive definite, for d = 1, . . . ,s− r. Now we are ready to study the sign of16(−1)t+pΛ(∞)2t+p, for p = 1, . . . ,s− t. On one hand, for p = 1, . . . ,r− t, we have(−1)t+pΛ(∞)2t+p = (−1)t+p×det 0 JTt+pJt+p −Bt+p= (−1)t+p×det(−Bt+p)×det(−JTt+p (−Bt+p)−1 Jt+p)= (−1)2t+2p×det(Bt+p)×det(JTt+pB−1t+pJt+p)> 0.On the other hand, for p = r− t+1, . . . ,s− t, we have(−1)t+pΛ(∞)2t+p = (−1)t+p×det0 JT KTt+p−rJ −B 0Kt+p−r 0 0= (−1)t+p×det 0 JTJ −B×det− KTt+p−r0 0 JTJ −B−1( Kt+p−r 0 )= (−1)t+p×det(−B)×det(JT B−1J)×det(−Kt+p−r(JT B−1J)−1 KTt+p−r)= (−1)2t+2p×det(B)×det(JT B−1J)×det(Kt+p−r(JT B−1J)−1 KTt+p−r)> 0.Therefore, we have shown that (−1)t+pΛ(∞)2t+p, p = 1, . . . ,s− t, is always positive,and so is (−1)t+pΛ(n)2t+p for sufficiently large n. Thus, it follows that ωˆ is theconstrained maximum likelihood estimate of the problem.172.3.2 Asymptotic distributionsIn this section, we define sequences{(ωˆn, λˆ n)}that extends (ωˆ, λˆ ), as stated inthe Theorem 1, for all x∈Rn, and develop the asymptotic distribution for (ωˆn, λˆ n).Note that this section differs from the Section 5 of [1] in that the covariance matrixhere becomes a partitioned matrix of 3×3 blocks.Lemma 2. The following partitioned matrix is non-singular.Bφ∗ 0 −Jω∗0 0 −Kω∗−JTω∗ −KTω∗ 0Proof. For brevity, we omit the suffix φ ∗ and ω∗. Then we wish to find a matrixP11 P12 P13P21 P22 P23P31 P32 P33such thatB 0 −J0 0 −K−JT −KT 0P11 P12 P13P21 P22 P23P31 P32 P33=Ir 0 00 Is−r 00 0 It .18Since B is positive definite, and J and K are of full rank, it can be solved thatP11 = B−1−B−1J(JT B−1J)−1JT B−1+B−1J(JT B−1J)−1KT{K(JT B−1J)−1KT}−1 K(JT B−1J)−1JT B−1,P12 =−B−1J(JT B−1J)−1KT{K(JT B−1J)−1KT}−1,P13 =−B−1J(JT B−1J)−1+B−1J(JT B−1J)−1KT{K(JT B−1J)−1KT}−1 K(JT B−1J)−1,P22 ={K(JT B−1J)−1KT}−1,P23 =−{K(JT B−1J)−1KT}−1 K(JT B−1J)−1,P33 =−(JT B−1J)−1+(JT B−1J)−1KT{K(JT B−1J)−1KT}−1 K(JT B−1J)−1,and P21, P31, and P32 are the transposes of P12, P13, and P23, respectively, as it iseasy to see that the matrix is symmetric.Suppose x ∈ Xn, δ is small enough for Theorem 1 to apply, and (ωˆ, λˆ ) is asolution of equations (2.1) - (2.3) such that ωˆ ∈Uδ . We now write the equations(2.1) - (2.3) in a different form:Bφ∗+ bˆ(x) 0 −Jω∗− jˆ(x)0 0 −Kω∗− kˆ(x)−JTω∗− jˆ′(x) −KTω∗− kˆ′(x) 0φˆ −φ ∗ψˆ−ψ∗λˆ=1n s(x,φ∗)00 ,(2.28)where bˆ(x), jˆ(x), jˆ′(x), kˆ(x), and kˆ′(x) are matrices whose elements tend to 0 as δgoes to 0. Thus, by Lemma 2, if δ is sufficiently small, then the matrixBφ∗+ bˆ(x) 0 −Jω∗− jˆ(x)0 0 −Kω∗− kˆ(x)−JTω∗− jˆ′(x) −KTω∗− kˆ′(x) 019is also non-singular and we write its inverse asPˆ11(x) Pˆ12(x) Pˆ13(x)Pˆ21(x) Pˆ22(x) Pˆ23(x)Pˆ31(x) Pˆ32(x) Pˆ33(x) .Thus, if δ is sufficiently small and if x ∈Xn, we can solve from the equation (2.28)that φˆ −φ ∗ψˆ−φ ∗λˆ=Pˆ11(x) Pˆ12(x) Pˆ13(x)Pˆ21(x) Pˆ22(x) Pˆ23(x)Pˆ31(x) Pˆ32(x) Pˆ33(x)1n s(x,φ∗)00 . (2.29)Since the asymptotic distribution of s(x,φ ∗)/n is known, we can use the aboverelationship to induce the asymptotic distribution of (ωˆ, λˆ ). However, this mayonly be valid for x ∈ Xn, and we need to extend it to also account for x /∈ Xn.Let (δm), (εm) be two decreasing sequences of positive real numbers, such thatδ1 < µ1/κ3, ε1 < 1, and δm and εm both tend to 0 as m goes to infinity. Definean increasing sequence (nm) of integers such that, if n ≥ nm, there exists a setXn with properties (X 1) - (X 4) for ε = εm and δ = δm. For m = 1,2, . . . ,, ifnm ≤ n < nm+1, we choose a set Xn with properties (X 1) - (X 4) for ε = εm andδ = δm. When x∈Xn, the equations (2.1) - (2.3) have a solution (ωˆn, λˆ n) such that||ωˆn−ω∗||< δm, with ωˆn being the constrained maximum likelihood estimate forω . Thus, ωˆn and λˆ n satisfy the equation (2.29). When x /∈ Xn, we defineφˆ n−φ ∗ψˆn−ψ∗λˆ n=P11 P12 P13P21 P22 P23P31 P32 P331n s(x,φ∗)00 ,where Pi j, i, j = 1,2,3, are defined in the proof of Lemma 2. Note that the prob-ability of x /∈ Xn goes to zero as n goes to infinity. Thus, we have defined twosequences of random variables, (ωˆn) and (λˆ n), n = nm,nm+1, . . . , which have theproperty that ωˆn converges in probability to ω∗ as n goes to infinity. Moreover, ωˆn20and λˆ n jointly satisfy the equations (2.1) - (2.3).Theorem 2.√nφˆ n−φ ∗ψˆn−ψ∗λˆ n d→Ns+t000 ,P11 P12 0P21 P22 00 0 −P33 .Proof. If x /∈ Xn, we define Pˆi j(x) = Pi j, i, j = 1,2,3. Then, for sufficiently largen, we have√nφˆ n−φ ∗ψˆn−ψ∗λˆ n=Pˆ11(x) Pˆ12(x) Pˆ13(x)Pˆ21(x) Pˆ22(x) Pˆ23(x)Pˆ31(x) Pˆ32(x) Pˆ33(x)√n1n s(x,φ∗)00 .Since bˆ(x), jˆ(x), jˆ∗(x), kˆ(x), and kˆ∗(x) all tend to 0 as δ → 0, it follows that theelements of Pˆ11(x) Pˆ12(x) Pˆ13(x)Pˆ21(x) Pˆ22(x) Pˆ23(x)Pˆ31(x) Pˆ32(x) Pˆ33(x)converge in probability to the elements ofP11 P12 P13P21 P22 P23P31 P32 P33 .Moreover, it is known that the asymptotic distribution of (s(x,φ ∗)/n) is normalwith mean zero and asymptotic variance Bφ∗ . Thus, we have√n1n s(x,φ∗)00 d→Ns+t000 ,Bφ∗ 0 00 0 00 0 0 .21It then follows that the asymptotic distribution of√n(φˆ n−φ ∗, ψˆn−ψ∗, λˆ n)isNs+t000 ,P11 P12 P13P21 P22 P23P31 P32 P33Bφ∗ 0 00 0 00 0 0P11 P12 P13P21 P22 P23P31 P32 P33T .Finally, using the expressions for Pi j, i, j = 1,2,3, that were derived in the proof ofLemma 2, it can be verified that the asymptotic variance simplifies toP11 P12 0P21 P22 00 0 −P33 .The result then follows.2.3.3 Numerical algorithmThe solution of the equations (2.1) - (2.3), say ωˆ = (φˆ , ψˆ), usually does not havea closed form, and thus must be computed numerically. We may immediatelyconsider the Newton-Raphson method to solve the problem. However, that methodrequires the form of the Hessian matrix of h(ω), which is an s× s matrix and maybe very complicated, especially when s is large. Thus, we follow the approachproposed by [1] and develop an algorithm that is easier to implement.Suppose ω(0) = (φ (0),ψ(0)) is an initial guess for ωˆ such that ||ω(0)− ωˆ|| issmall. Then we consider a first order of approximation to s(x, φˆ) and h(ωˆ):s(x, φˆ)≈ s(x,φ (0))+Mx,φ (0)(φˆ −φ (0)),h(ωˆ)≈ h(ω(0))+JTω(0)(φˆ −φ (0))+KTω(0)(ψˆ−ψ(0)).Also, we assume that λˆ is close to 0 when n is large. Then to a first order of22approximation, we haveJωˆ λˆ ≈ Jω(0) λˆ ,Kωˆ λˆ ≈Kω(0) λˆ .Since ωˆ and λˆ jointly satisfy the equations (2.1) - (2.3), they should also approxi-mately satisfy−1n Mx,φ (0) 0 −Jω(0)0 0 −Kω(0)−JTω(0) −KTω(0) 0φˆ −φ (0)ψˆ−ψ(0)λˆ≈1n s(x,φ(0))0h(ω(0)) .When n is large,−Mx,φ (0)/n should be close to Bφ (0) . Thus, we use Bφ (0) to approx-imate −Mx,φ (0)/n. Finally, we have the formula for updating ω(0), and in generalfor updating ω(r−1) in the r-th iteration, φ (r)λ (r)= φ (r−1)0+ Bφ (r−1) −Hφ (r−1)−HTφ (r−1)0−1 1n s(x,φ (r−1))h(φ (r−1)) .If the sequence{(ω(r),λ (r))}converges, then it converges to a solution of theequation (2.1) - (2.3). The convergence of these sequences may depend on theinitial value used. For our problem, we can use the unconstrained maximum like-lihood estimate of φ plus a reasonable guess on the value of ψ as the initial valuefor ω , which we believe provides a good starting point. Thus, we expect thesesequences to converge in most practical situations. Finally, it should be noted thatλ (r−1) is actually missing from the right hand side of the above equation. Thus, theupdating procedure only needs to store the current value of ω = (φ ,ψ) for the nextiteration.232.4 Example problem and simulation studyIn this section, we use the proposed method to solve a missing data problem, whereparameters associated with the missing mechanism may only be identified withadditional assumptions. This sort of problem might otherwise be tackled with anexpectation-maximization algorithm. More specifically, consider a binary responsevariable Y and two binary explanatory variables X1 and X2. The probability ofhaving Y = 1 given (X1,X2) is assumed to be determined through a logistic model:logitPr(Y = 1|X1,X2) = β0+β1X1+β2X2+β3X1X2,where logit(p) = log(p)− log(1− p). Suppose we can observe X1 and X2 foreveryone sampled, but the status of Y is missing for some people. Let R indicatemissingness. The data structure is displayed in Table 1, where ni jk is the numberof subjects with complete data of (Y = i,X1 = j,X2 = k,R = 1), and m jk is thenumber of subjects with incomplete data of (X1 = j,X2 = k,R = 0), i, j,k = 0,1.The corresponding cell probabilities, as enclosed in parentheses in the Table 2.1,areri jk = Pr(Y = i,X1 = j,X2 = k,R = 1),s jk = Pr(X1 = j,X2 = k,R = 0),for i, j,k = 0,1. Based on Table 2.1, the log-likelihood of data is:`= ∑i, j,kni jk logri jk +∑j,km jk logs jk.In order to understand the relationship between Y and (X1,X2), we need to infer theproportions of subjects with Y = 1 among the groups of incomplete datat jk = Pr(Y = 1|X1 = j,X2 = k,R = 0),for j,k = 0,1. However, these quantities are not identifiable from data withoutadditional assumptions.Now, we make two assumptions. First, we assume that the status of Y is miss-24Y = 0, R = 1 Y = 1, R = 1 Y =?, R = 0X1 = 0, X2 = 0, n000 (r000) n100 (r100) m00 (s00)X1 = 1, X2 = 0, n010 (r010) n110 (r110) m10 (s10)X1 = 0, X2 = 1, n001 (r001) n101 (r101) m01 (s01)X1 = 1, X2 = 1, n011 (r011) n111 (r111) m11 (s11)Table 2.1: Data structure for the example problem considered in Section 2.4.ing at random, i.e., R and Y are conditionally independent given (X1,X2). Thisassumption imposes four constraints on parameters, and implieslog t jk− log(1− t jk) = logr1 jk− logr0 jk,for j,k = 0,1. Secondly, we assume that the effects of X1 and X2 on Y are ad-ditive on the logit scale, which means that the interaction effect β3 is zero. Thisassumption introduces one more constraint on parameters aslog(r100+ s00t00)(r111+ s11t11)(r101+ s01t01)(r110+ s10t10)= log(r000+ s00(1− t00))(r011+ s11(1− t11))(r001+ s01(1− t01))(r010+ s10(1− t10)) .Under these two assumptions, we can apply the proposed method to obtained themaximum likelihood estimates rˆi jk, sˆ jk, and tˆ jk, i, j,k = 0,1, subject to the abovefive constraints. Next, the constrained maximum likelihood estimates for the maineffects of X1 and X2 can be deduced throughβˆ1 = logrˆ110+ sˆ10tˆ10rˆ100+ sˆ00tˆ00− log rˆ010+ sˆ10(1− tˆ10)rˆ000+ sˆ00(1− tˆ00) ,βˆ2 = logrˆ101+ sˆ01tˆ01rˆ100+ sˆ00tˆ00− log rˆ001+ sˆ01(1− tˆ01)rˆ000+ sˆ00(1− tˆ00) ,and the corresponding estimated variances can be obtained by the delta method.Finally, based on the above problem, we conduct a simulation study to illustratethe performance of the proposed method. In particular, we randomly generate10000 datasets of size 1000 under the parameter setting β0 = logit 0.1, β1 = log2,25β2 = log3, β3 = 0, andPr(X1 = 0,X2 = 0) = 0.4, Pr(R = 0|X1 = 0,X2 = 0) = 0.2,Pr(X1 = 1,X2 = 0) = 0.3, Pr(R = 0|X1 = 1,X2 = 0) = 0.1,Pr(X1 = 0,X2 = 1) = 0.2, Pr(R = 0|X1 = 0,X2 = 1) = 0.05,Pr(X1 = 1,X2 = 1) = 0.1, Pr(R = 0|X1 = 1,X2 = 1) = 0.05.For each dataset, we apply the proposed method to obtain the constrained maxi-mum likelihood estimates for βˆ1 and βˆ2, and the associated 95% confidence inter-vals. Our simulation results show that the empirical biases for the estimators of β1and β2 are 0.0033 and 0.0029, respectively. Correspondingly, the coverage prob-abilities of the 95% confidence intervals are 95.1% and 95.2%, which match wellwith the nominal level. We can see that the proposed method performs very well.2.5 Just- and over-identified situationsIn the previous section, we have considered a partially identified model with fournon-identifiable parameters and made additional assumptions that impose five con-straints on parameters. Consequently, the constrained maximum likelihood estima-tors for the identifiable parameters, ri jk’s and s jk’s, i, j,k = 0,1, differ from theirunconstrained estimators. More importantly, compared to the unconstrained esti-mators, the constrained estimators are associated with smaller variances. For ex-ample, under the parameter setting considered in the previous section, the varianceof the asymptotic distribution of the the unconstrained estimator for (r000, . . . ,r111)26is 0.205 −0.06 −0.04 −0.02 −0.01 −0.01 −0.01 −0.01−0.06 0.172 −0.03 −0.01 −0.01 −0.01 −0.01 −0.01−0.04 −0.03 0.122 −0.01 −0.01 −0.01 −0.01 −0.01−0.02 −0.01 −0.01 0.054 −0.00 −0.00 −0.00 −0.00−0.01 −0.01 −0.01 −0.00 0.031 −0.00 −0.00 −0.00−0.01 −0.01 −0.01 −0.00 −0.00 0.047 −0.00 −0.00−0.01 −0.01 −0.01 −0.00 −0.00 −0.00 0.045 −0.00−0.01 −0.01 −0.01 −0.00 −0.00 −0.00 −0.00 0.037,and the asymptotic variance of the corresponding constrained estimator is0.197 −0.06 −0.03 −0.02 −0.00 −0.02 −0.02 −0.00−0.06 0.165 −0.04 −0.01 −0.02 −0.00 −0.00 −0.02−0.03 −0.04 0.115 −0.00 −0.01 0.00 0.00 −0.01−0.02 −0.01 −0.00 0.046 0.01 −0.01 −0.01 0.01−0.00 −0.02 −0.01 0.01 0.023 0.01 0.01 −0.01−0.02 −0.00 0.00 −0.01 0.01 0.039 −0.01 0.01−0.02 −0.00 0.00 −0.01 0.01 −0.01 0.038 0.01−0.00 −0.02 −0.01 0.01 −0.01 0.01 0.01 0.029.By comparing the elements along the diagonal of these two matrices, it is clear thatthe constrained estimator is more efficient than the unconstrained estimator for theproblem considered in the previous section.However, if we only make the missing at random assumption and allow themodel for Y |X1,X2 to be saturated, then we have only four constraints for fournon-identifiable parameters. In this case, we find that the constrained and uncon-strained maximum likelihood estimators for the identifiable parameters always co-incide and have the same asymptotic distribution. Thus, making the missing atrandom assumption alone leads to no efficiency gain.27Generally, we say that the parameters are over-identified when the number ofconstraints is greater than the number of unidentified parameters. In this case, theconstrained maximum likelihood estimator differs from the unconstrained estima-tor and achieves better efficiency. On the other hand, we say that the parametersare just-identified when the number of constraints is equal to the number of uniden-tified parameters. If that is the case, then the constrained estimator will coincidewith the unconstrained estimator, at least asymptotically. Moreover, identifyingthe unidentified parameters uses up the information provided by the additionalconstraints and thus a more efficient estimator is not available. This phenomenawas also observed by Chen and Chen [4] in the context of a gene-environmentindependence problem.2.6 ConclusionParameters arising from a partially identified model can be estimated when we haveenough equality constraints enforced by additional assumptions. Moreover, theconstrained maximum likelihood estimator for the identified part may or may notcoincide with its unconstrained counterpart, and achieves higher efficiency whenthey do not coincide.Another possibility for estimating parameters of a partially identified modelsubject to constraints is to exploit a reduced-form parameterization that is free ofconstraints. However, the capability of such approach is limited, as a closed formfor a reduced-form parameterization is often very complicated or even sometimesnot available. In contrast, the method presented in this paper is applicable in moregeneral settings. Moreover, since the log-likelihood function is usually expressedin its simplest form with a transparent re-parameterization, taking the second par-tial derivatives of the log-likelihood function becomes much more straightforward.Thus, the proposed method is also advantageous in terms of calculation.28Chapter 3The Benefit of Exploiting the GEIAssumption for AnalyzingCase-Control Data3.1 IntroductionIn this chapter, we apply the constrained maximum likelihood estimation theorydeveloped in Chapter 2 to study the benefit of exploiting the GEI assumption foranalyzing case-control data concerning a binary genotype and an environmentalexposure that has two or more categories. This chapter is organized as follows. Wefirst describe the underlying model for case-control data and formulate the problemunder consideration in more detail. We then develop a reparameterization of themodel and transform the problem into a constrained maximum likelihood estima-tion problem. Next, we propose methods for analyzing case-control data exploitingthe GEI assumption in different scenarios. We then investigate the efficiency gainof exploiting the GEI assumption, and conduct simulation studies to compare theperformance of the proposed GEI-based methods with the traditional method. Wealso consider a real dataset for the application of the proposed method. Finally,some concluding thoughts are given at the end of this chapter.293.2 Formulation of the problemLet D be the binary disease status, with D = 1 for presence and D = 0 for absence.Suppose the risk of having the disease is affected by a subject’s binary geneticfactor G, coded as {0,1}, and a categorical environmental exposure E with K+1levels, coded as {0,1, . . . ,K}. In a case-control study, genotype and environmentalexposure status are collected for n = n0+n1 subjects, including n0 controls and n1cases, where the case-to-control ratio, n1/n0, is pre-specified. Let ni jk denote thenumber of subjects in the D = i study arm with genotype G = j and environmentalexposure E = k. Thus, we can summarize case-control data in a 2× 2× (K + 1)table.Let ι = (ι00, . . . , ι0K , ι10, . . . , ι1K) denote the vector of probability masses forthe joint distribution of genotype and exposure, with ι jk = Pr(G = j,E = k) forj = 0,1 and k = 0, . . . ,K. It has 2K + 1 degrees of freedom since the sum of allelements is equal to one. We assume that the disease risk, given a subject’s statusof genotype and environmental exposure, is parameterized by a saturated logisticregression model:Pr(D = 1|E,G) = expit{β0+βGG+K∑k=1(β (k)E 1{k}(E)+β(k)EG1{k}(E)G)},where expit(x) = 1/{1+ exp(−x)} is the inverse of the logit function. For brevity,let βE = (β(1)E , . . . ,β(K)E ) represent the vector of all main environmental effects,and βEG = (β(1)EG, . . . ,β(K)EG ) represent the vector of all gene-environment interactioneffects. Then, the model can be parameterized by φ = (ι ,β0,βG,βE ,βEG), whichhas 4K+3 degrees of freedom.Under the GEI assumption, the joint distribution of genotype and environmen-tal exposure can be determined by their marginal distributions. Let κ = (κ0,κ1)and δ = (δ0, . . . ,δK) denote two vectors of probability masses for the marginaldistributions of genotype and environmental exposure, respectively. The numbersof free parameters in these two vectors are 1 and K, respectively, as the sum ofall elements in each vector is equal to one. Then the joint probability takes theproduct form ι jk = κ jδk for j = 0,1 and k = 0, . . . ,K. Therefore, we get a reducedform parameterization φr = (κ,δ ,β0,βG,βE ,βEG), which has 3K + 3 degrees of30freedom. We can then estimate φr by maximizing the retrospective log-likelihoodfor case-control data`(φr) = ∑i, j,kni jk logκ jδk pi jk(β0,βG,βE ,βEG)∑ j′,k′ κ j′δk′ pi j′k′(β0,βG,βE ,βEG),where pi jk(β0,βG,βE ,βEG) = Pr(D = i|G = j,E = k), i, j = 0,1, k = 0, . . . ,K.However, due to the complicated form with the denominator including summationover all control or case cell probabilities, direct maximization of the log-likelihoodwith respect to φr can be numerically challenging or even infeasible, especiallywhen K is large.Alternatively, Chatterjee and Carroll [3] proposed to obtain the estimate usinga profile-likelihood technique. The likelihood is first maximized with respect to δfor fixed values of (κ,β ) to derive the profile likelihood of the data, where β gener-ically represents (β0,βG,βE ,βEG). The profile likelihood is then maximized withrespect to (κ,β ) to obtain the maximum likelihood estimator of (κ,β ). If δˆ (κ,β )denotes the value of δ that maximizes the likelihood for fixed (κ,β ), the profilelikelihood is then `(κ,β , δˆ (κ,β )). Chatterjee and Carroll [3] have shown that theprofile likelihood `(κ,β , δˆ (κ,β )) can be computed without having to maximizethe log-likelihood numerically with respect to the potentially high-dimensional pa-rameter δ . Instead, it can be obtained in a closed form up to only one additionalparameter. More specifically, let θ denote the disease prevalence Pr(D = 1), andS denote the indicator of whether or not a subject has been selected in the case-control sample. We consider the joint probability distribution for D and G given Ein the case-control sample and letPr(D= i,G= j|E = k,S= 1)= ni{θ 1−i(1−θ)i}κ j pi jk(β0,βG,βE ,βEG)∑i′, j′ ni′ {θ 1−i′(1−θ)i′}κ j′ pi′ j′k(β0,βG,βE ,βEG),which only concerns parameters κ , β , and θ . Let n++k be the marginal fre-quency of the kth category of E for k = 0, . . . ,K. Then the profile likelihood`(κ,β , δˆ (κ,β )) can be computed as `∗(κ,β , θˆ(κ,β )), where`∗(κ,β ,θ(κ,β )) = ∑i, j,kni jk logPr(D = i,G = j|E = k,S = 1),31and θˆ(κ,β ) is defined by the solution of the equation:n1 =∑k∑jn++kPr(D = 1,G = j|E = k,S = 1).Thus, the semiparametric maximum likelihood estimate of (κ,β ) can be obtainedby solving score equations ∂`∗(κ,β ,θ)/∂ (κ,β ,θ) = 0 jointly with respect to κ ,β , and θ . However, given the complex form of Pr(D,G|E,S = 1), numericallysolving these estimating equations using standard methods, such as the Newton-Raphson method, is still challenging. Moreover, although it has been shown in[3] that, under suitable regularity conditions, the semiparametric maximum likeli-hood estimator is consistent and asymptotically follows a normal distribution, theasymptotic variance of the estimator was given in the form that includes a dou-ble expectation, first with respect to Pr(D,G|E,S = 1) and then with respect toPr(G,E|D). Thus, it is not trivial to obtain the standard error of the estimate.Therefore, we are seeking a simpler solution for analyzing case-control data underthe GEI assumption.3.3 A reparameterization of the modelWe now consider a different parameterization of the model such that the form ofthe log-likelihood function can be greatly simplified. First, as already defined inthe previous section, we use θ to denote the disease prevalence Pr(D = 1). Also,we define γ = (γ001, . . . ,γ01K ,γ101, . . . ,γ11K), where γi jk = Pr(G = j,E = k|D = i),for i, j = 0,1 and k = 0, . . . ,K, are sampling probabilities actually underlying case-control data. Note that γ000 and γ100 are excluded from γ as their values can beuniquely determined through the constraints ∑ j,k γi jk = 1 for i = 0,1. Now con-sider the parameterization ξ = (γ,θ). It can be easily verified that the mappingbetween ξ and φ is invertible. Particularly, by comparing subjects with environ-mental exposure E = k to those with baseline level E = 0, we are able to expressβ (k)EG by ξ throughβ (k)EG(γ) = logγ11kγ100γ00kγ010γ10kγ110γ01kγ000, (3.1)for k = 1, . . . ,K.32Suppose ξˆ = (γˆ, θˆ) is an estimator of ξ and its asymptotic distribution is√n γˆ− γ∗θˆ −θ ∗ d→N 00 , Σγ Σγ·θΣTγ·θ Σθ ,where ξ ∗ = (γ∗,θ ∗) is the true value of ξ . We can then easily deduce an estimatorof φ and apply the delta method to derive the corresponding asymptotic distribu-tion. In particular, the estimator of the interaction effect β (k)EG can be derived fromthe equation (3.1) as βˆ (k)EG = β(k)EG(γˆ), and its asymptotic distribution is√n(βˆ (k)EG−β (k)∗EG)d→N (0, ∇∗Tk Σγ∗∇∗k) , (3.2)where ∇∗k is the gradient of the function β(k)EG(γ) evaluated at γ∗. Moreover, inpractice, the asymptotic standard error of βˆ (k)EG isSE(βˆ (k)EG)=1√n√∇ˆkΣγˆ∇ˆk, (3.3)where ∇ˆk is the gradient of the function β(k)EG(γ) evaluated at γˆ . Therefore, it issufficient to analyze case-control data based on the parameterization ξ .With the new parameterization ξ , we first re-write the retrospective log-likelihoodfunction in a much simpler form`(γ) = ∑i, j,kni jk logγi jk,which shows even more clearly that, without any additional assumption, the modelcan only be partially identified, leaving the parameter θ not identified. Let γˆ(U)denote the unconstrained maximum likelihood estimator of γ . We can easily solvethe equation ∂`(γ)/∂γ = 0 with respect to γ , yielding γˆ(U)i jk = ni jk/ni for i, j = 0,1and k = 0, . . . ,K. The asymptotic distribution of γˆ(U) is√n(γˆ(U)− γ∗)d→N(0, B−1γ∗), (3.4)where Bγ∗ is the unconstrained Fisher information, i.e., the negative of the expec-33tation of the second derivative of the log-likelihood function `(γ), which is givenin Appendix A.Next, we look for the mathematical representation of the GEI assumption withthe new parameterization of the model. Under the GEI assumption, we haveι00ι1k = κ0δ0κ1δk = ι10ι0k, for k = 1, . . . ,K. Note that ι jk can be expressed byξ through ι jk(ξ ) = (1−θ)γ0 jk+θγ1 jk. Thus, the GEI assumption can be enforcedonto ξ through the following constraintsg(ξ ) = (g1(ξ ), . . . ,gK(ξ ))T = 0,where, for k = 1, . . . ,K,gk(ξ ) = ι00(ξ )ι1k(ξ )− ι10(ξ )ι0k(ξ ).These equality constraints define a GEI subspace of ξ , within which ξ can bemapped to φ r and the mapping is invertible.Finally, since each component of ξ is a probability, the following inequalityconstraints apply naturally on ξ :0≤ γi jk ≤ 1, (3.5)for i, j = 0,1, k = 0, . . . ,K, and0≤ θ ≤ 1. (3.6)Therefore, the problem is now transformed to finding the estimate of ξ that max-imizes `(γ) subject to the equality constraints g(ξ ) = 0 and the inequality con-straints (3.5) and (3.6).3.4 Estimation with known disease prevalenceWe begin with the special case where the disease prevalence is known, say from anexternal source. Suppose the true value of θ is known to be θ ∗. First, we can plugθ ∗ into g(ξ ) and thus the constraints induced by the GEI assumption now becomeg(γ,θ ∗) = 0, which are equations concerning γ only. Secondly, the inequality34constraint (3.6) is no longer in effect. Thus, the problem now becomes finding theestimate of γ that maximizes `(γ) subject to the equality constraint g(γ,θ ∗) = 0and the inequality constraints (3.5).3.4.1 Theoretical properties of the estimatorWe first study the asymptotic distribution of the constrained maximum likelihoodestimator of γ . Given that the true value of θ is known, within a sufficiently smallneighborhood of the true value of γ , the inequality constraints are automaticallysatisfied. Thus, in order to study the asymptotic properties of the estimator of γ ,we can ignore the inequality constraints (3.5) and treat the problem as estimatingparameters subject to equality constraints only, which has been studied by Aitchi-son and Silvey [1].Consider an auxiliary function T (γ,λ ) = (1/n)`(γ)+λ T g(γ;θ ∗), where λ isa Lagrange multiplier vector of length K. The maximum point of T (γ,λ ) can befound by solving ∂T (γ,λ )/∂ (γ,λ ) = 0 jointly with respect to γ and λ , which leadsto the following 5K+2 equations:1ns(γ)+Jγ;θ ∗λ = 0, (3.7)g(γ;θ ∗) = 0, (3.8)where s(γ) is the gradient of the log-likelihood function, and Jξ is the Jacobian ofg(ξ ) with respect to γ , both of which are given in Appendix A.Suppose (γˆ, λˆ ) is the solution to the equations (3.7) and (3.8). Applying thetheory in [1], as sample size n goes to infinity with the case-to-control ratio n1/n0fixed, the asymptotic distribution of (γˆ, λˆ ) is√n γˆ− γ∗λˆ d→N 00 , P11 00 P22 , (3.9)where P11 P12P21 P22−1 = Bγ∗ −Jγ∗,θ ∗−JTγ∗,θ ∗ 0 .35We have shown earlier that Bγ∗ represents the unconstrained Fisher information.Analogously, we refer to the matrix on the right hand side of the above equation asthe ‘constrained Fisher information’ that accounts for the GEI assumption.Finally, it is easy to see from the equation (3.9) that the variance of the asymp-totic distribution of γˆ is Σγ = P11. We can then apply the (3.2) to derive the asymp-totic distributions of the constrained maximum likelihood estimators of the inter-action effects, βˆ (k)EG for k = 1, . . . ,K, under the GEI assumption with known diseaseprevalence.3.4.2 Numerical algorithmIn practice, we need to compute the maximum likelihood estimate γˆ numerically,since there is no closed form solution. In general, there are various numericalalgorithms, such as the augmented Lagrangian algorithm [5] and the sequentialquadratic programming algorithm [9], for solving the constrained optimizationproblem. For this particular problem, however, naively implementing these al-gorithms using the existing tools like the NLopt library [16] encounters numericalerrors mainly because the target function has no definition when the inequalityconstraints (3.5) are violated. More care needs to be taken for successful imple-mentations of these advanced algorithms, which can be tricky. Alternatively, wefind that a simple variant of the Newton’s method as proposed in [1] is easy toimplement and works well in practice.We first find the constrained maximum likelihood estimate of γ subject to theequality constraints g(γ,θ ∗) = 0, ignoring the inequality constraints (3.5) for themoment. Specifically, we begin by setting the initial value of γ to be γ(U), as it isexpected that the constrained and the unconstrained maximum likelihood estimateof γ should be close, if not equal, to each other. We then iteratively update thevalue of γ by γλ← γ0+ c Bγ −Jγ,θ ∗−JTγ,θ ∗ 0−1 1n s(γ)g(γ,θ ∗) , (3.10)where c ∈ (0,1] is a tuning parameter that controls the step size of each update andprevent the values of γ from leaving their plausible range. We find c = 0.5 works36very well in practice. Also, although we update the value of λ in each iteration,we don’t need to keep track of it because it won’t be used in the next iteration.This process is repeated until convergence or termination, as summarized belowin Algorithm 1. The termination criteria include reaching the preset maximumnumber of iterations and running into numerical errors, both of which occur veryrarely.Algorithm 1 Find γˆ under the GEI assumption with known θ .Set the initial value γ(0)For m = 1,2, . . . ,MCompute γ(m) given γ(m−1) using the equation (3.10)If any error occurs:Set m = M and breakIf ||γ(m)− γ(m−1)||∞ < εBreakIf m<MOutput γˆ = γ(m)We now discuss the effect of the inequality constraints (3.5). If Algorithm1 converges and successfully finds a constrained maximum likelihood estimate γˆ ,then the inequality constraints (3.5) should be naturally satisfied, otherwise the log-likelihood does not even exist. However, it is possible, though very rarely occurredin our simulation studies, that Algorithm 1 may terminate due to the violation ofthese inequality constraints before it finds the final estimate. In that case, we needto use a smaller step size by decreasing the value of c to prevent our algorithmmoving too fast.Finally, having the constrained maximum likelihood estimate γˆ , we can plug itinto the equations (3.1) and (3.3), with Σγ = P11, to get the constrained maximumlikelihood estimates for the interaction effects and their associated standard errors.3.5 Estimation with unknown disease prevalenceWe now come back to our original problem in the more common settings wherethe disease prevalence is unknown. It needs to be learned indirectly through the37constraints imposed by the GEI assumption.3.5.1 Parameter identificationBefore we study estimation, we first discuss the identifiability of the parameters.In general settings where no assumption is made about the joint distribution ofgenotype and environmental exposure, it is well known that neither the joint dis-tribution of genotype and environmental exposure, ι , nor the intercept parameterof the logistic model, β0, is identifiable from case-control data. Under the GEIassumption, Chatterjee and Carroll [3] claimed that these parameters are theoreti-cally identifiable from the retrospective likelihood of case-control data, except forsome boundary situations where the logistic model for Pr(D|G,E) depends onlyon G or only on E, but not both. In other words, either (βG,βE ,βEG) = (0,βE ,0) or(βG,βE ,βEG) = (βG,0,0), corresponding to either only the main effect of G or onlythe main effect of E, respectively. However, this conclusion is incomplete since wefind that there are situations where the originally non-identifiable parameters maystill not be fully identified under the GEI assumption, even though the disease riskis affected by both genotype and environmental exposure.To learn the identifiability of the parameters under the GEI assumption, it issufficient to study the identifiability of θ , which is the only parameter in the pa-rameterization ξ that is not identifiable in general. We first consider the simplestsetting where G and E are both binary. In this case, the constraints imposed by theGEI assumption reduce to just one single equation as follows0 = (γ100θ + γ000(1−θ))(γ111θ + γ011(1−θ)) −(γ101θ + γ001(1−θ))(γ110θ + γ010(1−θ)).It is easy to see that this equation is quadratic in θ . Therefore, it may produce a‘twin’ solution comprised of the true disease prevalence as well as an erroneousvalue that also lies between 0 and 1. That is, there could exist two different val-ues of ξ within the GEI subspace such that they only differ in the value of θ .Correspondingly, it is possible that two different values of φ both satisfy the GEIassumption and result in the same case-control sampling probabilities. For exam-ple, it can be easily verified that the following two values of φ r lead to the same38case-control sampling probabilities:φ (1)r : βG = βE = βEG = log2, β0 = log1.5, κ = (0.5,0.5) , δ = (0.8,0.2) ,φ (2)r : βG = βE = βEG = log2, β0 = log5.5, κ =(1328,1528), δ =(5267,1567).Next, in more general settings where the environmental exposure takes more thantwo categories, the GEI-induced constraint can be viewed as multiple quadraticequations in θ , which can have at most two solutions. If there exist two values of θboth satisfying all of these equations, then these equations differ from one anotheronly up to some constants, which is an extremely rare but theoretically possiblesituation.Therefore, even under the GEI assumption, we may still not be able to fullyidentify all parameters from case-control data, even for some settings that arenot included in the boundary situations defined in [3]. In theory, the originallynon-identifiable parameters become ‘almost’ identified under the GEI assumption,though a ‘twin’ of the true value may be present as well.3.5.2 Theoretical properties of the estimatorWe now study the asymptotic distribution of the constrained maximum likelihoodestimator of ξ . Within a sufficiently small neighborhood of the true value of ξ ,the inequality constraints (3.5) and (3.6) are automatically satisfied. Thus, in orderto study the asymptotic properties of the estimator, we can ignore those inequalityconstraints and treat the problem as estimating parameters arising from a partiallyidentified model subject to only equality constraints, which has been studied inChapter 2.We still use the Lagrange multiplier method, and consider an auxiliary functionT (ξ ,λ ) = (1/n)`(γ)+ λ T g(ξ ). Since θ is now also an unknown parameter, weneed to solve ∂T (γ,θ ,λ )/∂ (γ,θ ,λ ) = 0 jointly with respect to γ , θ and λ to find39the maximum point of T (ξ ,λ ). Thus, we get the following 5K+3 equations1ns(γ)+Jξλ = 0, (3.11)Kξλ = 0, (3.12)g(ξ ) = 0, (3.13)where s(γ) is again the gradient of the log-likelihood function, and Jξ and Kξ arethe Jacobians of g(ξ ) with respect to γ and θ , respectively, which are all given inAppendix A.Suppose (γˆ, θˆ , λˆ ) is the solution of the equations (3.11) - (3.13). Then applyingthe theory developed in Chapter 2, as the sample size n goes to infinity with thecase-to-control ratio n1/n0 fixed, the asymptotic distribution of (γˆ, θˆ , λˆ ) is√nγˆ− γ∗θˆ −θ ∗λˆ d→N000 ,Q11 Q12 0Q21 Q22 00 0 −Q33 , (3.14)where Q11 Q12 Q13Q21 Q22 Q23Q31 Q32 Q33−1=Bγ∗ 0 −Jξ ∗0 0 −Kξ ∗−JTξ ∗ −KTξ ∗ 0 .Similarly as before, we refer to the matrix on the right hand side of the above equa-tion as the ‘constrained Fisher information’ that accounts for the GEI assumptionin the case that the disease prevalence is unknown.We find from the equation (3.14) that the variance of the asymptotic distributionof γˆ is Σγ = Q11. Again, we apply the (3.2) to derive the asymptotic distributionsof the constrained maximum likelihood estimators of the interaction effects, βˆ (k)EGfor k = 1, . . . ,K, under the GEI assumption with unknown disease prevalence.403.5.3 Numerical algorithmIn practice, the maximum likelihood estimate ξˆ has no closed form expression andstill needs to be computed numerically. For this particular problem, we use thenumerical algorithm proposed in Chapter 2, supplemented with a one-dimensionalgrid search if necessary, which is easy to implement and works well in practice.First, we search globally for the constrained maximum likelihood estimate ofξ subject to the equality constraints, ignoring the inequality constraints for themoment. We call this a global search because each component of ξ is allowed totake any value. We set the initial value of γ to be γ(U) for the same reason as before.Without any additional information on disease prevalence, the initial value of θ isset to be 0.5, the midpoint of its plausible range. Therefore, the algorithm beginswith an initial value ξ (0) = (γ(U),0.5). We then iteratively update the value of ξ byγθλ←γθ0+ cBγ 0 −Jξ0 0 −Kξ−JTξ −KTξ 0−11n s(γ)0g(ξ ) . (3.15)Still, there is no need to record the value of λ for the next iteration. This process isrepeated until convergence or termination. The termination criteria again includereaching the preset maximum number of iterations and running into numerical er-rors, both of which occur occasionally.Next, we consider the inequality constraints. As discussed earlier in Section3.4.2, we can tune the value of c to prevent values of γ from leaving their plausi-ble range, and the inequality constraints (3.5) are naturally satisfied. In practice,however, the inequality constraint (3.6) may sometimes be violated. That is, theabove algorithm may sometimes result in an estimate with θˆ /∈ [0,1], especiallywhen the sample size is not very large. In that case, we perform a one dimensionalgrid search to approximate the optimal value of θ over the fixed interval [0,1] thatmaximizes the log-likelihood `(γ(θ)). For any fixed value of θ , we can apply Al-gorithm 1 to find the constrained maximum likelihood estimate of γ and obtain themaximum log-likelihood corresponding to that given value of θ . Moreover, thisone-dimensional grid search is also performed in the rare situation when the global41search step fails to converge.In summary, we use Algorithm 2, which combines the primary step of globalsearch and the supplementary step of one-dimensional grid search, to obtain themaximum likelihood estimate ξˆ . We then plug it into the equations (3.1) and (3.3),with Σγ = Q11, to get the constrained estimates for the interaction effects and theirassociated standard errors.Algorithm 2 Find γˆ under the GEI assumption with unknown θ .Set the initial value ξ (0) = (γ(0),0.5)For m = 1,2, . . . ,MCompute ξ (m) given ξ (m−1)) using equation (3.15)If any error occurs:Set m = M and breakIf ||ξ (m)−ξ (m−1)||∞ < εBreakIf m<M and θ (m) ∈ [0,1]Output ξˆ = ξ (m)ElseSet w =−∞For θ = 0,0.01, . . . ,0.99,1Apply Algorithm 1 to find the estimate γˆ(θ).If `(γˆ(θ))> wSet w = `(γˆ(θ))Set ξˆ = (γˆ(θ),θ)Output ξˆ3.6 Extension: a reduced logistic modelIn some cases, it may make practical sense to treat the categorical environmen-tal exposure as ordinal rather than nominal. When the environmental exposure isordinal, a reduced logistic regression model can be useful:logitPr(D = 1|E,G) = β0+βGG+βEE +βEGEG.42It should be noted that this reduced model is just a special case of the saturatedmodel, where β (k)E = kβ(1)E and β(k)EG = kβ(1)EG, for k = 2, . . . ,K. Thus, assuming areduced model introduces additional 2K−2 constraints on γ ,h(γ) = (h0,2(γ), . . . ,h0,K(γ),h1,2(γ), . . . ,h1,K(γ))T = 0,where, for j = 0,1 and k = 2, . . . ,K,h j,k(γ) = logγ1 jkγ0 jk+(k−1) log γ1 j0γ0 j0− k log γ1 j1γ0 j1.Combining these equality constraints with those imposed by the GEI assumption,we now have in total 3K−2 constraint equations. The asymptotic theories and thenumerical algorithms developed in Sections 3.4 and 3.5 are still applicable withsome minor modifications. Suppose λ h is the Lagrange multiplier associated withh(γ), and λ g is the Lagrange multiplier associated with g(ξ ). Let Hγ denote theJacobian of h(γ) with respect to γ , the form of which is given in Appendix A. Wediscuss the extensions in three scenarios.No GEI assumptionWhen we only assume a reduced logistic regression model in a general set-ting without the GEI assumption, the problem becomes finding the estimateof γ that maximizes `(γ) subject to the equality constraint h(γ) = 0 and theinequality constraints (3.5). This also fits into the framework developed inSection 3.4. Thus, the asymptotic distribution of (γˆ, λˆ h) is√n γˆ− γ∗λˆ h d→N 00 , R11 00 R22 , (3.16)where R11 R12R21 R22−1 = Bγ∗ −Hγ∗−HTγ∗ 0 .The right hand side of the above equation is the ‘constrained Fisher infor-mation’ that accounts for only assuming a reduced model. It should then be43used to modify the equation (3.10) in Algorithm 1 for updating the value ofγ . This new algorithm can be used to numerically compute the estimate γˆ .GEI assumption with known θWhen we assume a reduced model under the GEI assumption with knowndisease prevalence, the problem is to find the estimate of γ that maximizes`(γ) subject to the equality constraints (h(γ), g(γ,θ ∗)) = 0 and the inequal-ity constraints (3.5). Following the work in Section 3.4, the asymptotic dis-tribution of (γˆ, λˆ h, λˆ g) is√nγˆ− γ∗λˆ hλˆ g d→N000 ,P′11 0 00 −P′22 −P′230 −P′32 −P′33 , (3.17)where P′11 P′12 P′13P′21 P′22 P′23P′31 P′32 P′33−1=Bγ∗ −Hγ∗ −Jγ∗;θ ∗−HTγ∗ 0 0−JTγ∗;θ ∗ 0 0 .The right hand side of the above equation is the ‘constrained Fisher informa-tion’ that accounts for assuming a reduced model and the GEI assumptionwith known disease prevalence. Again, it is used to modify the equation(3.10) in Algorithm 1 for updating the value of γ . This new algorithm is thenused to numerically compute the estimate γˆ in practice for this problem.GEI assumption with unknown θWhen we assume a reduced model under the GEI assumption with unknowndisease prevalence, the problem becomes finding the estimate of ξ that max-imizes `(γ) subject to the equality constraints (h(γ), g(ξ )) = 0 and the in-equality constraints (3.5) and (3.6). Following the work in Section 3.5, the44asymptotic distribution of (γˆ, θˆ , λˆ h, λˆ g) is√nγˆ− γ∗θˆ −θ ∗λˆ hλˆ gd→N0000 ,Q′11 Q′12 0 0Q′21 Q′22 0 00 0 −Q′33 −Q′340 0 −Q′43 −Q′44 ,(3.18)whereQ′11 Q′12 Q′13 Q′14Q′21 Q′22 Q′23 Q′24Q′31 Q′32 Q′33 Q′34Q′41 Q′42 Q′43 Q′44−1=Bγ∗ 0 −Hγ∗ −Jξ ∗0 0 0 −Kξ ∗−HTγ∗ 0 0 0−JTξ ∗ −KTξ ∗ 0 0 .Then the right hand side of the above equation is the ‘constrained Fisherinformation’ that accounts for assuming a reduced model and the GEI as-sumption with unknown disease prevalence. We then use it to modify theequation (3.15) in Algorithm 2 for updating the value of ξ . This new algo-rithm is used to numerically compute the estimate ξˆ in this situation.3.7 Efficiency gainIn this section, we investigate the benefit of exploiting the GEI assumption in termsof estimation efficiency. Whereas related work [3] has relied on simulation studiesto assess this benefit at only a limited number of parameter settings, using thetheories in previous sections allow us to directly evaluate the efficiency gain at avery large number of values for the underlying parameters.3.7.1 The special binary caseFirst, we consider the special case where both genotype and environmental ex-posure are binary. In this case, we can show that there is no efficiency gain byexploiting the GEI assumption if the disease prevalence is unknown. When the45GEI assumption is assumed and the disease prevalence is unknown, the varianceof the asymptotic distribution of the estimator of γ is Q11 as given in the equation(3.14). Following the results given in the proof of Lemma 2 in Chapter 2, we areable to express Q11 by the block matrices of the corresponding constrained Fisherinformation:Q11 = B−1−B−1J(JT B−1J)−1JT B−1+Wwhere subscripts are omitted for brevity andW = B−1J(JT B−1J)−1KT[K(JT B−1J)−1KT]−1K(JT B−1J)−1JT B−1.Note that K reduces to a scalar in this special case. Thus, it can be easily verifiedthatW = B−1J(JT B−1J)−1JT B−1.It then follows that Q11 = B−1, where B−1 is the variance of the asymptotic dis-tribution of the unconstrained estimator of γ . This result suggests that the un-constrained and the constrained estimators of γ asymptotically follow the samedistribution. Correspondingly, the unconstrained and the constrained estimators ofβ (k)EG, k= 1, . . . ,K, are asymptotically equivalent, as these parameters are connectedwith ξ only through γ . Thus, there is no efficiency gain by exploiting the GEI as-sumption when the disease prevalence is unknown in the special binary case. Thismatches the finding by Chen and Chen [4] that estimating the intercept term inthe prospective relationship uses up the additional information inherent in the GEIassumption in the special binary case.3.7.2 The saturated modelNext, we consider a saturated logistic model in the scenario where the environmen-tal exposure has four levels. Some exploratory experiments show that the magni-tude of efficiency gain is likely to be affected by the magnitude of the baselinerisk β0. Thus, we examine two different values of β0 (logit0.003 vs. logit0.3) sep-arately. For each value of β0, we randomly generate 10000 sets of values for theremaining parameters. First, the marginal distributions of genotype and environ-mental exposure, κ and δ , are generated using flat Dirichlet distributions that are46uniform over the simplex, respectively. Secondly, all components of the main ef-fects, βG and βE , and the interaction effects, βEG, are generated separately using astandard normal distribution.For each parameter setting, we first compute the variances of the asymptoticdistributions of the estimators of γ resulting from the traditional method (TRAD),the method exploiting the GEI assumption with unknown disease prevalence (GEI-U), and the method exploiting the GEI assumption with known disease prevalence(GEI-K), which are B−1 defined in the equation (3.4), Q11 defined in the equation(3.9), and P11 defined in the equation (3.14), respectively. We then use the equation(3.2) to deduce the asymptotic variances of the corresponding estimators of theinteraction effects β (k)EG, for k = 1, . . . ,K. We finally summarize results by boxplotsin Figure 3.1, where the asymptotic variance ratios of the GEI-U estimators to theTRAD estimators are shown in left panels, and the asymptotic variance ratios ofthe GEI-K estimators to the GEI-U estimators are shown in right panels.From Figure 3.1, we can see that exploiting the GEI assumption does lead toefficiency gain for the estimation of the interaction effects. This benefit is morelikely to be substantial when the baseline risk is small. Moreover, knowing thedisease prevalence can further lead to more efficiency gain, which also tends to begreater when the baseline risk is small.3.7.3 The reduced modelFinally, we consider a reduced logistic model in three different scenarios, wherethe environmental exposure has three, four, and five levels, respectively. Again,we look at two different values of β0 (logit0.003 vs. logit0.3) separately. For eachvalue of β0, we still randomly generate 10000 sets of values for the remainingparameters. Parameter values are generated in the same manner as before, exceptthat βE and βEG are now two scalars.For each parameter setting, we compute the asymptotic variances of the TRADestimator, the GEI-U estimator and the GEI-K estimator of γ , which are R11 de-fined in the equation (3.16), Q′11 defined in the equation (3.18) and P′11 defined inthe equation (3.17), respectively. The asymptotic variances of the correspondingthree estimators of the interaction effects βEG are then deduced. Figure 3.2 sum-47llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllβeg(1) βeg(2) βeg(3)0.00.20.40.60.81.0β0 = logit(0.3)RU:Tllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllβeg(1) βeg(2) βeg(3)0.00.20.40.60.81.0β0 = logit(0.3)RK:Ullllllllllllllllllllllllllllllllllllβeg(1) βeg(2) βeg(3)0.00.20.40.60.81.0β0 = logit(0.003)RU:Tllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllβeg(1) βeg(2) βeg(3)0.00.20.40.60.81.0β0 = logit(0.003)RK:UFigure 3.1: Comparison of the TRAD method, the GEI-U method, and theGEI-K method in terms of efficiency, with a saturated logistic model.The comparison is based on 10000 randomly generated parameter set-tings under the scenario where the environmental exposure has four lev-els. The asymptotic variance ratios of the GEI-U estimator to the TRADestimator, RU :T , are shown in left panels, and the asymptotic varianceratios of the GEI-K estimator to the GEI-U estimator, RK:U , are shownin right panels.marizes results, where the asymptotic variance ratios of the GEI-U estimator to theTRAD estimator are shown in left panels, and the asymptotic variance ratios of the48GEI-K estimator to the GEI-U estimator to the GEI-U estimator are shown in rightpanels.llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllK=2 K=3 K=40.00.20.40.60.81.0β0 = logit(0.3)RU:TlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllK=2 K=3 K=40.00.20.40.60.81.0β0 = logit(0.3)RK:UllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllK=2 K=3 K=40.00.20.40.60.81.0β0 = logit(0.003)RU:TllK=2 K=3 K=40.00.20.40.60.81.0β0 = logit(0.003)RK:UFigure 3.2: A similar figure to Figure 3.1, but shown with a reduced logisticmodel in scenarios where the environmental exposure has three (K = 2),four (K = 3), and five (K = 4) levels, respectively.From Figure 3.2, we can see that exploiting the GEI assumption still leads to ef-ficiency gain with a reduced logistic model, although the effect is less pronouncedcompared to the case with a saturated logistic model. Also, we find the trend thatthe benefit of exploiting the GEI assumption becomes greater when the environ-mental exposure has more levels. Interestingly, a reversed trend is observed in the49right bottom panel of Figure 3.2, which suggests that the benefit of knowing thedisease prevalence seems to decrease as the levels of the environmental exposureincreases.3.8 Simulation studiesIn this section, we conduct simulation studies to compare the performance of thetraditional prospective logistic method (TRAD), the method exploiting the GEIassumption with unknown disease prevalence (GEI-U), and the method exploit-ing the GEI assumption with known disease (GEI-K) under different scenarios.We set the tuning parameters in Algorithm 1 and Algorithm 2 to be c = 0.5,ε = exp(−10log10), and M = 1000.3.8.1 The special binary caseWe first consider the special case where both genotype and environmental exposureare binary. We use the parameter setting: κ = (0.95,0.05), δ = (0.6,0.4), β0 =logit0.005, βG = log1.5, βE = log1.2, and βEG = log3, so that the GEI assumptionindeed holds. We consider three different sample sizes n∈ {500,1000,2000}, withequal numbers of controls and cases. For each sample size, we apply the TRADmethod, the GEI-U method, and the GEI-K method on 10000 randomly generatedsamples to obtain estimates and the corresponding 95% confidence intervals forthe interaction effect βEG.We first present some summaries of the 10000 generated datasets. The GEI-Kmethod successfully converges for all generated datasets. When the sample size is500/1000/2000, there are 80/0/0 datasets containing at least one zero cell count,which are not used for comparison. The GEI-K method converges for all of theremaining 9920/10000/10000 datasets. On the other hand, the GEI-U methoddirectly finds the constrained maximum likelihood estimates of γ without the needof the one-dimensional grid search for 5137/5168/5323 of those datasets. Forthe other 4783/4832/4677 datasets, the one-dimensional grid search is performedto find estimates. Among these datasets, on-boundary estimates (θˆ ∈ {0,1}) arefound for 4779/4832/4677 datasets.We then summarize simulation results by three key indices in Table 3.1: the50bias and the mean squared error of the estimators, plus the coverage probabilitiesof the 95% confidence intervals. First, we can see that the GEI-U estimator is alittle bit biased in practice because some estimates are computed differently bythe one-dimensional grid search in order to ensure that estimates of the diseaseprevalence actually make practical sense. However, this empirical bias reduces assample size increases. Secondly, the coverage probability of the GEI-U 95% con-fidence interval is even slightly greater than the nominal level. Thirdly, comparedto the TRAD method, the GEI-U method yields slightly better mean squared er-rors. Finally, we find that the GEI-K method performs the best among the threemethods. It leads to an unbiased estimator with substantially lower mean squarederrors. It is interesting to see that knowing the disease prevalence has no effecton the efficiency of the estimators of the interaction effect without any additionalassumption, but greatly improves the efficiency when the GEI assumption is made.Table 3.1: Comparison between the TRAD method, the GEI-U method, andthe GEI-K method in terms of the bias and the mean squared error (MSE)of the estimators of βEG, and the coverage probability of the 95% confi-dence intervals in the special binary case.n Bias MSE CoverageTRAD GEI-U GEI-K TRAD GEI-U GEI-K TRAD GEI-U GEI-K500 0.061 0.257 0.034 0.651 0.584 0.201 0.959 0.959 0.9571000 0.028 0.184 0.015 0.294 0.221 0.091 0.953 0.973 0.9562000 0.019 0.127 0.010 0.137 0.097 0.045 0.954 0.976 0.949We also compare the length of the 95% confidence intervals, as shown in Figure3.3 and 3.4. When the sample size is small, the GEI-U 95% confidence intervalsmay sometimes be shorter than their TRAD counterparts. When the sample sizegets larger, the 95% confidence intervals of these two estimators have about thesame length. This observation matches our earlier result that these two estimatorsare asymptotically equivalent in the special binary case. Again, the GEI-K methodoutperforms the other two methods, as it results in much shorter 95% confidenceintervals. More importantly, as we can see from Table 3.1, the coverage probability51of the GEI-K 95% confidence intervals is still maintained at the nominal level.lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllll l lll2.5 3.0 3.5 4.0 4.5 5.02.53.03.54.04.55.0GEI−UTRADn = 500lllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll2.0 2.5 3.0 3.5 4.02.02.53.03.54.0GEI−UTRADn = 1000lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll1.4 1.6 1.81.41.61.8GEI−UTRADn = 2000Figure 3.3: Comparison of the TRAD method and the GEI-U method interms of the length of the 95% confidence interval in the special bi-nary case. For better visualization, we only present results for a randomsample of 1000 datasets from all 10000 simulated datasets. The greyline is the identity line.llllllll lllllllllllll lllllllllllllllllllllllllllllll llll lllllllllllllllllllllllllllllllll ll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllll lllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllll2 3 4 52345GEI−KGEI−Un = 500lllllllllllllllllllllllllllll llll lllll llllllllllllllllllllllllllllll lllll lllllllllllllllllllllllllllllllllllll llllllllllllllllll llllllllllllllllllllllllllllllllllllll lll lllllllllllllllllll llllllllllllllll ll llllllllllllllllllll lllllllll lllllllllllllll1.0 1.5 2.0 2.5 3.01.01.52.02.53.0GEI−KGEI−Un = 1000l lll lllllllllllllllllllllllllllllllllllll llllll llllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllll llllll lllllllllllllllllllll llllllllllllllllll lll llllllllllllll llllllllll llllllllllllllll llllllllllllllllllllllllllllllllllll llllllllllllllll llllllllllllllllllllllll llllllllllllllllllllll lllll0.8 1.0 1.2 1.4 1.60.81.01.21.41.6GEI−KGEI−Un = 2000Figure 3.4: Comparison of GEI-U method and the GEI-K method in terms ofthe length of the 95% confidence interval in the special binary case. Westill present results for a random sample of 1000 datasets from all 10000simulated datasets for better visualization. The grey line is the identityline.523.8.2 The saturated modelWe now consider a saturated disease risk model concerning a binary genotype andan environmental exposure having four categories (K = 3) under the parametersetting: κ = (0.9,0.1), δ = (0.4,0.3,0.2,0.1), β0 = logit 0.005, βG = 0, βE =(log1.1, log1.3, log1.5), and βEG = (log1.2, log1.6, log2). We still consider threedifferent sample sizes n ∈ {500,1000,2000}, with equal numbers of controls andcases. For each sample size, we apply the TRAD method, the GEI-U method, andthe GEI-K method on 10000 randomly generated samples to obtain the estimatesand the corresponding 95% confidence intervals for the interaction effects, β (1)EG,β (2)EG, and β(3)EG.We present some summaries about the 10000 generated datasets. When thesample size is 500/1000/2000, there are 921/74/1 datasets containing at least onezero cell count, which are not used for comparison. The GEI-K method convergesfor all of the remaining 9079/9926/9999 datasets. On the other hand, the GEI-Umethod directly finds the constrained maximum likelihood estimates of γ withoutthe need of the one dimensional grid search for 4541/5056/5107 of those datasets.For the other 4528/4870/4892 datasets, the one dimensional grid search is per-formed to find estimates. Among these datasets, non-boundary estimates are foundfor only 2/4/1 datasets.The three summary indices, the bias and the mean squared error of the esti-mators as well as the coverage probabilities of the 95% confidence intervals, arereported in Table 3.2. The comparisons between the GEI-U method and the TRADmethod, and between the GEI-K method and the GEI-U method, in terms of thelength of the 95% confidence intervals, are shown in Figure 3.5 and 3.6, respec-tively. First, we can see that the GEI-K method still performs the best among thethree methods with respect to all these aspects. Secondly, the coverage probabilityof the GEI-U 95% confidence interval is slightly smaller than the nominal level.Thirdly, compared to the TRAD method, the GEI-U method yields comparablemean squared error even when the sample size is small, and better mean squarederror as sample size increases. Finally, the GEI-U 95% confidence intervals arenearly always shorter than their TRAD method counterparts. For the parametersetting used in this simulation study, the asymptotic variance ratios for β (1)EG, β(2)EG53are 0.53, 0.65 and 0.88 between the GEI-U and the TRAD estimators, and 0.97,0.69 and 0.43 between the GEI-U and the GEI-K estimators. Correspondingly,when the sample size is sufficiently large, the GEI-U 95% confidence intervalsare 27%, 19%, and 6% shorter than their TRAD counterparts, and the GEI-K 95$confidence intervals 2%, 17%, and 35% shorter than their GEI-U counterparts, forβ (1)EG, β(2)EG, and β(3)EG, respectively. These theoretical results approximately matchthe empirical results shown in Figure 3.5 and 3.6.Table 3.2: Comparison of the TRAD method, the GEI-U method, and theGEI-K method in terms of the bias and the mean squared error (MSE)of the estimators, and the coverage probability of the 95% confidenceintervals in the scenarios with a saturated disease risk model.n Bias MSE CoverageTRAD GEI-U GEI-K TRAD GEI-U GEI-K TRAD GEI-U GEI-Kβ (1)EG 500 0.020 0.016 0.007 0.642 0.610 0.321 0.950 0.938 0.9581000 0.000 0.002 -0.005 0.284 0.250 0.147 0.952 0.941 0.9552000 0.002 0.013 -0.001 0.138 0.106 0.069 0.949 0.943 0.952β (2)EG 500 0.057 0.059 0.010 0.744 0.749 0.317 0.959 0.929 0.9601000 0.024 0.043 0.005 0.335 0.331 0.144 0.953 0.915 0.9562000 0.013 0.048 0.000 0.156 0.136 0.068 0.954 0.931 0.957β (3)EG 500 0.007 0.022 0.000 0.904 0.990 0.405 0.971 0.927 0.9631000 0.056 0.091 -0.007 0.520 0.547 0.184 0.961 0.910 0.9532000 0.037 0.104 -0.001 0.248 0.241 0.085 0.953 0.928 0.9513.8.3 The reduced modelNext, we consider a reduced disease risk model for a binary genotype and a four-category environmental exposure under the parameter setting: κ = (0.9,0.1), δ =(0.4,0.3,0.2,0.1), β0 = logit0.005, βG = 0, βE = log1.3, and βEG = log3. Again,we consider three different sample sizes n ∈ {500,1000,2000} and set the num-54llllllllllllllllllllllll llllllll llllllllllllllll lllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllll lllllllllllllllllllllll lllllllllllllll llllllllllllll llllllllllllll llllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllll lllll lllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllll llllllllllllll lllllllllllllllllll l lllll llllllllllll1.5 2.5 3.5 4.51.52.53.54.5GEI−UTRADβEG(1)n = 500l llllllllllllll llllllllllllllllllllll llllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllll llllllllllllllllllllllll ll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ll lllllllllllllllllllllllll llllllll lllllllll l llllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllll lllllllllll llllllllllll llllll llllllllllllllll llllllllllllllll lllllllll2 3 4 52345GEI−UTRADβEG(2)llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllll llllllllllllllllllllllll llll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllll2 3 4 5 623456GEI−UTRADβEG(3)lllllllllllllllllllllllllllllllllllllll llllllllllll llllllllll llllll ll llllllllllllllllll ll llll lllllllll llllllll llllllllll ll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllll l llllll llllllll lllllll lllllllllll llll llllllllllllllllllllllllllllllllllllllllll llll lllll l llll llllllllllllllllllllllll llllllllllllllllllll lllll lllllllllllllll lllllllllllllllllllllllllll lllllllllllll lllllllll1.5 2.0 2.5 3.01.52.02.53.0GEI−UTRADn = 1000l lllllllllllllllllllllllll llll llll lllllllllll lllllll lll lll ll lllllllllllllllllllllllllllllll llllllllllllll llllllllll llll llllllllll lllllllllllllllllllllll llllll llllllllllllllllllllllllllllllllllllllllllll llllllllll llllllllllllllll llllllllllllllllllllllllllllllllllllll lllllll lllllll lllllllllllllll llllllllllllllllllllllllllllllllll1.5 2.0 2.5 3.0 3.5 4.01.52.02.53.03.54.0GEI−UTRADllll llll llllll llllllllllllllllll llllllll lllllll lllllllllllllllllllllllllll llllllllllllllllllllllllllllll llllllllllllllllllllllll llllllllllllllllll llllllllllllllllllllllllllllllllllllllllll llllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ll lll llllllllllll lllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllll1.5 2.5 3.5 4.51.52.53.54.5GEI−UTRADllllllllllllllllll llllllllllllllllllllllll llllllllllllllll lllllllll lllll llllllllllllllllllllllllll l llllllll lll lllllllllllllll llllllllllllllll llllllllll llllllllllllllllllllllllllllllllllllll llllllll llllllllllllllllllllllllllllllllllllllllllll llllllllllllllllll lllllllllllllllllllll lllllllllllll lllllllll llllllllllllllllllllllllllllllllllllllllllll llllllll lllllllllllllll lllllll llllllllllllllllll lllllllllllllllllll llllllllllll llllll llllllllllllllllllllll llllllllllllllllllllllllllll llllllllllllllllllllllllllllll llllllllllll llllllllll1.0 1.2 1.4 1.61.01.21.41.6GEI−UTRADn = 2000llllll llllllllllllllllllllllll l llllllllllll lllll llllllllll llllllllllllll lllllllllllllllllllllllllllll lllllllllllllllllllllllllllll llllllllllllllllllllllllllllll lll lllllllllllllllllllll llllllll lllllllllll llllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllll llllllllllll llllllllllll lllllllllllllll lllllll lllllll llllllllllllllllllllll llllllllllll lllllll lll lllllll llllllllllllll llllllllllllllllllllllllllllllllll lllllllllllllll llllllllllll llllll ll1.0 1.2 1.4 1.6 1.8 2.01.01.21.41.61.82.0GEI−UTRADllllllllllllll llllllll llllllll llllllll llllll llllllll llllllllll lllllllll lllllllllllllllllllllllllllll lllllllll lllllllllllllllllllllllllllllllllll llllll llllllllllllll ll llllllllllllllllllll llll lllllllllll ll llllllll lllllllll lllllllllllll llll lllllllllllllllllllllllllllll llllll1.5 2.5 3.51.52.53.5GEI−UTRADFigure 3.5: Comparison between the TRAD method and the GEI-U methodin terms of the length of the 95% confidence interval for the scenariowith a saturated disease risk model. Results are only presented for arandom sample of 1000 datasets from all 10000 simulated datasets. Thegrey line is the identity line.bers of controls and cases to be equal. We compare the performance of the TRADmethod, the GEI-U method, and the GEI-K method still based on 10000 randomsamples. It should be noted that, with a reduced model, we can estimate the in-teraction effect βEG even when the dataset contains zero cell counts. We find thatthe GEI-K method fails to converge for 1/0/0 dataset using the current setting of55lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll lllllllllllllllllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllll lllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllll2.0 2.5 3.0 3.5 4.0 4.52.02.53.03.54.04.5GEI−KGEI−UβEG(1)n = 500llllllllllllllllllllllll llllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllll lllllllllllll llllllllllllllllllllll llllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllll llllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll lllllllllll lll lllllllllllllllllllllllllllllllllllll lllllllllll2.0 3.0 4.0 5.02.03.04.05.0GEI−KGEI−UβEG(2)lllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllll2 3 4 5 623456GEI−KGEI−UβEG(3)llllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllll llllllllll lllll llllllllllll ll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ll llllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllll1.2 1.4 1.6 1.8 2.0 2.21.21.41.61.82.02.2GEI−KGEI−Un = 1000ll lllllllllllllllllllllllllllllllllllllllllll llll lll llllll l lllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllll lllllllllllllllllllll lll lllllllllll llllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllll llllllllllllllllll lllllllll lllllllllllllllllllllllllllllllllllllllllllll llllll1.5 2.0 2.5 3.01.52.02.53.0GEI−KGEI−Ullllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllll llllllllllllllll lllllllllllllllllllllllllllll llllllllllllllllllll lllllllllllllllllllllllllllll lllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll1.5 2.5 3.5 4.51.52.53.54.5GEI−KGEI−Ulllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllll llllllllllllll llllll llllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll llllllllllllllllllllllll lllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllll lllllllllllllllllllllll lllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllll llllllllll llll0.9 1.1 1.3 1.50.91.11.31.5GEI−KGEI−Un = 2000llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllll lllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllll lllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll1.0 1.2 1.4 1.6 1.81.01.21.41.61.8GEI−KGEI−Ullllll llllllllllllllllllllllllllllllllll ll lllll lllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllll ll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllll1.0 1.5 2.0 2.51.01.52.02.5GEI−KGEI−UFigure 3.6: Comparison between the GEI-U method and the GEI-K methodin terms of the length of the 95% confidence interval for the scenariowith a saturated disease risk model. Results are only presented for arandom sample of 1000 datasets from all 10000 simulated datasets. Thegrey line is the identity line.tuning parameters. Among the remaining 9999/10000/10000 generated datasets,the GEI-U method can directly find the constrained maximum likelihood estimatesof γ without the need of the one dimensional grid search for 6941/7588/8366datasets. For the other 3058/2412/1634 datasets, the one dimensional grid searchis performed, and non-boundary estimates are found for only 1/0/0 datasets.56The three summary indices, the bias and the mean squared error of the esti-mators as well as the coverage probabilities of the 95% confidence intervals, arereported in Table 3.3. The comparisons between the GEI-U method and the TRADmethod, and between the GEI-K method and the GEI-U method, in terms of thelength of the 95% confidence intervals, are shown in Figure 3.7 and Figure 3.8,respectively. First of all, as expected, the GEI-K method still performs the bestamong the three methods. Moreover, the GEI-U estimators are only slightly bi-ased. The GEI-U method outperforms the TRAD method by achieving substan-tially lower mean squared errors even when the sample size is small. Also, theGEI-U 95% confidence intervals have coverage probabilities slightly above thenominal level, and are generally shorter than their TRAD counterparts. For this pa-rameter setting, the asymptotic variance ratios for βEG are 0.80 between the GEI-Uand the TRAD estimators, and 0.36 between the GEI-K and the GEI-U estimators,corresponding to 10% and 40% reductions in the length of 95% confidence inter-vals, respectively. These theoretical results are observed empirically in Figure 3.7and 3.8.Table 3.3: Comparison of the TRAD method, the GEI-U method, and theGEI-K method in terms of the bias and the mean squared error (MSE)of the estimators, and the coverage probability of the 95% confidenceintervals in the scenarios with a reduced disease risk model.n Bias MSE CoverageTRAD GEI-U GEI-K TRAD GEI-U GEI-K TRAD GEI-U GEI-K500 0.041 0.074 0.018 0.106 0.063 0.027 0.957 0.977 0.9521000 0.026 0.041 0.010 0.049 0.031 0.013 0.947 0.970 0.9532000 0.016 0.021 0.006 0.023 0.015 0.006 0.953 0.968 0.9513.8.4 The violation of the GEI assumptionFinally, we test the performance of the proposed GEI-based methods when theGEI assumption does not hold. We consider a reduced disease risk model as an57llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllll llllllll1.0 1.5 2.0 2.51.01.52.02.5GEI−UTRADn = 500lllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllll lllllll llllllllllllllllllll llll lllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllll lllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.7 0.8 0.9 1.0 1.1 1.20.70.80.91.01.11.2GEI−UTRADn = 1000lllllllllllllllllllllllllllllllllllllll lll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllll ll lllllllllllllllllllllllllllllllll lllllllllll lllllllllllll ll l0.50 0.60 0.700.500.600.70GEI−UTRADn = 2000Figure 3.7: Comparison between the TRAD method and the GEI-U methodin terms of the length of the 95% confidence interval for the scenariowith a reduced disease risk model. Results are only presented for arandom sample of 1000 datasets from all 10000 simulated datasets. Thegrey line is the identity line.lllllllllllllllllllllllllllll llllllllllllllllllllllllllllllll lllllllllll llllllllllllllllllllllllllllllllllllll l llllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllll lllllll0.6 1.0 1.4 1.80.61.01.41.8GEI−KGEI−Un = 500lllll llllllllllllllllllllllllllll ll llllllllllllllllllllllllllll llll llllllllllllllllll llllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllll0.4 0.5 0.6 0.7 0.8 0.90.40.50.60.70.80.9GEI−KGEI−Un = 1000llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.30 0.40 0.50 0.600.300.400.500.60GEI−KGEI−Un = 2000Figure 3.8: Comparison between the GEI-U method and the GEI-K methodin terms of the length of the 95% confidence interval for the scenariowith a saturated disease risk model. Results are only presented for arandom sample of 1000 datasets from all 10000 simulated datasets. Thegrey line is the identity line.example. The parameter setting for β is the same as the previous section, i.e.,(β0,βG,βE ,βEG) = (logit0.005,0, log1.3, log3). For the joint distribution of geno-58type and environmental exposure, we consider two scenarios:(a) : ι = (0.364,0.269,0.178,0.089,0.036,0.031,0.022,0.011),(b) : ι = (0.38,0.27,0.17,0.08,0.02,0.03,0.03,0.02),where scenario (a) corresponds to a slight violation of the GEI assumption as theodds ratio between genotype and any pair of environmental exposure levels rangesfrom 0.8 to 1.3, and scenario (b) corresponds to a serious violation of the GEIassumption as the odds ratio between genotype and any pair of environmental ex-posure levels ranges from 0.2 to 4.8. For each scenario, we generate 10000 ran-dom samples of 500 controls and 500 cases. The GEI-K method converges forall datasets generated under both scenarios. Under the scenario (a), GEI-U esti-mates can be directly obtained by the global search step of Algorithm 2 for 7520out of 10000 generated datasets, and on-boundary estimates are found by one-dimensional grid search for all the other 2480 datasets. Under the scenario (b),GEI-U estimates can be directly obtained by the global search step of Algorithm2 for only 2646 out of 10000 generated datasets, and on-boundary estimates arefound by one-dimensional search for all the other 7354 datasets. We see that dataseem to be less likely to ‘support’ the GEI assumption when the assumption isseriously violated.Table 3.4 summaries the simulation results for scenarios (a) and (b) with thebias and the mean squared error of the estimators as well as the coverage proba-bilities of the 95% confidence intervals. When the GEI assumption is slightly vi-olated, the GEI-U method still performs pretty well, although slightly worse com-pared to the results shown in Table 3.3. It again leads to smaller mean squarederrors compared to the TRAD method, and the coverage probability of the 95%confidence interval is also maintained at the nominal level. When the GEI assump-tion is seriously violated, however, the GEI-U estimator is greatly biased. Con-sequently, the proposed method produces much larger mean squared errors thanthe TRAD method. Moreover, the coverage probability of the 95% confidence in-terval is much lower than the nominal level. Finally, we can see that the GEI-Kmethod is very sensitive to the violation of the GEI assumption and performs verypoorly when the assumption is seriously violated, even much worse than the GEI-U59method.Table 3.4: Comparison of the TRAD method, the GEI-U method, and theGEI-K method when the GEI assumption is violated, assuming a reduceddisease risk model. Scenarios (a) and (b) correspond to the situationswhere the GEI assumption is slightly and seriously violated, respectively.Simulation results of 10000 random samples are summarized in terms ofthe bias and the mean squared error (MSE) of the estimators, and thecoverage probability of the 95% confidence intervals.Bias MSE CoverageTRAD GEI-U GEI-K TRAD GEI-U GEI-K TRAD GEI-U GEI-KScenario (a) 0.023 0.092 0.080 0.046 0.036 0.020 0.954 0.958 0.916Scenario (b) 0.023 0.366 0.569 0.050 0.152 0.344 0.950 0.629 0.0073.9 Data analysisWe now consider a real dataset for the application of the proposed GEI-U method.The dataset is taken from Garca-Closas et al. [7], who investigate the associationsof polymorphisms in NAT and GST genes with bladder cancer risk and their inter-actions with cigarette smoking among subjects participating in the Spanish Blad-der Cancer Study. We focus on the joint effect of genetic variation in NAT2 andsmoking habit on bladder cancer risk, and restrict the analysis to subjects who hadcomplete information on NAT2 genotype (rapid/intermediate vs. slow acetylator)and smoking habit (never, occasional, former, and current), resulting in a total of1134 cases and 1130 controls. The observed cell frequencies are presented in Table3.5.We now apply the proposed GEI-U method to this dataset. The estimated in-teraction effects, accompanied by their 95% confidence intervals, are 0.529 with(−0.305,1.362) for the interaction between NAT2 slow acetylator and occasionalsmokers, 0.628 with (0.158,1.098) for the interaction between NAT2 slow acety-lator and former smokers, and 0.403 with (−0.092,0.898) for the interaction be-tween NAT2 slow acetylator and current smokers. For comparison, the correspond-60Table 3.5: Data from a case-control study concerning the interaction of NAT2genotype and smoking habit on bladder cancer. NAT2 genotype is codedas 0 for rapid/intermediate acetylator and 1 for slow acetylator. Smokingstatus is coded as 0 for never smoker, 1 for occasional smoker, 2 forformer smoker, and 3 for current smoker.G = 0 G = 1E = 0 E = 1 E = 2 E = 3 E = 0 E = 1 E = 2 E = 3D = 0 131 37 212 113 199 48 240 150D = 1 66 16 161 163 91 32 310 295ing estimates and 95% confidence intervals from the TRAD method are 0.530 with(−0.303,1.362), 0.628 with (0.160,1.096), and 0.407 with (−0.088,0.902), re-spectively. Thus, we can see that making the GEI assumption is not very helpfulfor this particular dataset. Moreover, the estimated disease prevalence from theGEI-U method is 0.51, which is not very convincing since bladder cancer is a rel-atively rare disease. This might suggest that the GEI assumption is not satisfiedfor this problem. However, it should also be reminded that, even when the GEI as-sumption actually holds, exploiting the assumption may lead to no efficiency gainfor some parameter settings, as can be seen from Figure 3.1.3.10 ConclusionIn this chapter, we have studied the problem of estimating parameters arising froma logistic regression model for case-control data. This problem was previouslystudied by Chatterjee and Carroll [3] using the profile likelihood technique. Weapproach the problem in a different way by treating it as a constrained maximumlikelihood estimation problem. It should be noted that both methods are based onthe same retrospective likelihood for the case-control data, and thus should yieldidentical estimates. However, the present contribution is useful because, comparedto the profile likelihood method in [3], our method is easier to implement. More-over, as our method gives the explicit form for the asymptotic variance of the esti-mator, it can help with the planning of a case-control study.61By switching the position of genotype and environmental exposure, our methodis readily applicable to the problem where the environmental exposure is binarybut the genotype has three levels (recessive, co-dominant/incomplete-dominant,dominant). In general, with modern genotyping methods, we often do not knowwhat trait is associated with a specific genetic marker/SNP within a gene, but havemany of these markers within a gene (looks like categories but with no meaningto order). Then the challenge becomes trying to find a suitable model to assessjoint impact of these markers with environmental exposure (typically forced to berepresented as dichotomous) on disease. Our method may be naturally applied toapproach this kind of problems.Many aspects of exploiting the gene-environmental independence assumptionin a case-control study have been well discussed by Chatterjee and Carroll [3].Particularly, they briefly described the possibility of generalizing their method toaddress population stratification. When the genotype and environmental exposureare independent conditional on some stratum variables, the environmental exposureand the stratum variables can be combined as the new ‘environmental variable’.This new variable is independent of the genetic factor, and thus the GEI-basedmethod can still be applied. Our method can be generalized in a similar way toaddress population stratification as well.62Chapter 4Bayesian Inference inCase-Control Studies4.1 IntroductionIn this chapter, we study methods for analyzing case-control data that exploit theGEI assumption in a Bayesian framework. This chapter is organized as follows.We first present another parameterization of the underlying model for case-controldata, which has a reduced form directly incorporating the GEI assumption and alsoconnects to the data structure closely. We then develop a Bayesian frameworkfor analyzing case-control data under the GEI assumption, and conduct simulationstudies to illustrate the performance of the proposed Bayesian method in situa-tions where the GEI assumption indeed holds. Next, we generalize the proposedBayesian method to allow uncertainty around the GEI assumption. We conductmore simulation studies to investigate the performance of the generalized Bayesianmethod in situations where the GEI assumption may or may not hold. Finally, weconsider two real datasets for the application of the proposed methods, and givesome concluding thoughts at the end of this chapter.634.2 Another reparameterizationIn Chapter 3, we presented a reparameterization of the underlying model that isclosely connected to the case-control data structure. It was used to facilitate theanalysis of case-control data from the frequentist perspective. With that parame-terization, however, it is difficult to set an appropriate prior structure reflecting theGEI assumption, since the assumption is defined through some constraint equa-tions. Thus, that parameterization is not suitable for Bayesian analysis. On theother hand, the original parameterization has a reduced form that directly incor-porates the GEI assumption. Nonetheless, it may still not be ideal for Bayesiananalysis in terms of computational efficiency, as it is not very closely connected tothe actual data structure. Therefore, we need a new parameterization of the modelthat has a reduced form directly incorporating the GEI assumption and connects tothe data structure as closely as possible.Let θ still denote the disease prevalence. We use γ0 = (γ000,γ001, . . . ,γ01K)and γ1 = (γ100, . . . ,γ11K) to represent the vectors of sampling probabilities un-derlying controls and cases, respectively. Further, we define ρ = (ρ1, . . . ,ρK)as the vector of gene-environment associations in the source population, whereρk = (ι1kι00)/(ι10ι0k) is the ratio of the odds of having genotype G = 1 in theE = k subpopulation to the odds of having genotype G = 1 in the E = 0 subpop-ulation, for k = 1, . . . ,K. Finally, let γ00 = (γ000, . . . ,γ00K ,γ01+) denote anothervector of probabilities in the control population, where γ01+ = 1−∑k γ00k is thecombined probability of having genotype G = 1.Next, we are going to show that the new parameterization ψ = (θ ,ρ,γ00,γ1)is a reparameterization of the original parameterization φ = (ι ,β0,βG,βE ,βEG).Since we have already shown in Chapter 3 that ξ = (θ ,γ0,γ1) is a reparameteriza-tion of φ , it is sufficient to show the connection between ξ and ψ . On one hand, asι jk = (1−θ)γ0 jk +θγ1 jk, we can write ρk in terms of ξ throughρk(ξ ) ={(1−θ)γ000+θγ100}{(1−θ)γ01k +θγ11k}{(1−θ)γ00k +θγ10k}{(1−θ)γ010+θγ110} ,for k = 1, . . . ,K. The values of the remaining components of ψ , (θ ,γ00,γ1), arereadily available from ξ . Thus, the value of ψ can be uniquely determined given64a value of ξ . On the other hand, in order to determine the value of ξ given avalue of ψ , we only need to find the value of (γ010, . . . ,γ01K) as the values of theother components of ξ can be directly found in ψ . We first rearrange the aboveexpression for ρk to get an equation with respect to γ010 and γ01k:−ρkι0kγ010+ ι00γ01k = θ1−θ (−ι00γ11k +ρkι0kγ110) ,for k = 1, . . . ,K, where the value of (ι00, . . . , ι0K) is known given ψ . Combiningthese equations with the constraint that ∑ j,k γ0 jk = 1, we end up with in total K+1equations for K+1 unknown parameters (γ010, . . . ,γ01K). Furthermore, these K+1equations can be rewritten in matrix form as1 1 1 · · · 1−ρ1ι01 ι00 0 · · · 0−ρ2ι02 0 ι00 · · · 0.......... . ....−ρKι0K 0 0 · · · ι00γ010γ011γ012...γ01K=1− (γ000+ · · ·+ γ00K)θ1−θ (−ι00γ111+ρ1ι01γ110)θ1−θ (−ι00γ112+ρ2ι02γ110)...θ1−θ (−ι00γ11K +ρKι0Kγ110).Then it is clear that this linear equation system always has a unique solution for(γ010, . . . ,γ01K), due to the non-singularity of the matrix of coefficients. Thus, thevalue of ξ can be uniquely determined given a value of ψ . Therefore, ψ can beconsidered as a reparameterization of ξ , and in turn a reparameterization of φ .If we let Pm denote the standard m-simplex, then the parameter space for ψis Φ = P2K+1 ×R2K+2. With the reparameterization from φ to ψ , the originalparameter space Φ is mapped to a new parameter space Ψ, which is only a propersubset of the space P1×RK+×PK+1×P2K+1 because some values of ψ in the largerspace don’t have correspondents in Φ. Moreover, for any given value of ψ in thelarger space, it is easy to test for its membership in Ψ. Now, the retrospectivelikelihood for case-control data in terms of the new parameterization ψ isL(ψ) ={∏j,kγn1 jk1 jk}×{∏kγn00k00k}×{∏k(γ01k(ψ))n01k}.65Finally, the GEI assumption can be easily represented by setting every componentof ρ to be one.4.3 Bayesian frameworkWe now develop a Bayesian framework for analyzing case-control data under theGEI assumption, based on the parameterization of ψ .We begin by specifying a prior distribution for ψ over its parameter space Ψ.First, making the GEI assumption is equivalent to stating that ψ ∈ ΨI , where ΨIdenotes the subset of Ψ corresponding to ρ = 1. Thus, the prior distribution haszero-density whenψ /∈ΨI . Next, we use a flat Dirichlet distribution of order 2K+2that is uniform over the simplex, Dir(1, . . . ,1), as the prior distribution for γ1.Also, the same flat Dirichlet distribution would be used as the prior distribution forγ0 if we were going to specify a prior distribution for ξ . That inspires a Dirichletdistribution of order K + 2, Dir(1, . . . ,1,K+1), as the prior distribution for γ00.Finally, we set an appropriate prior distribution for θ to reflect any knowledgewe have on the disease prevalence, which may come from an external source. Inthis chapter, we simply use a standard uniform distribution as the non-informativeprior distribution for θ , assuming no extra information is available. In summary,the prior density for ψ , denoted by τ(ψ), isτ(ψ) =Cτ · (γ01+)K ·1{ψ ∈ΨI} .where Cτ is a normalizing constant not depending on ψ . Finally, the posteriordensity of ψ , denoted by p(ψ), is proportional to the product of the prior densityτ(ψ) and the likelihood function L(ψ), and thus can be expressed as:p(ψ) =Cp · (γ01+)K ·∏kγn00k00k ·∏k(γ01k(ψ))n01k ·∏j,kγn1 jk1 jk ·1{ψ ∈ΨI} ,where Cp is another normalizing constant also not depending on ψ .To compute the posterior distribution for Bayesian inference, we use the tech-nique of importance sampling. The importance sampling method is often used asa variance reduction technique for approximating integrals. However, it does morethan just that. It also provides an alternative way to simulate from complex distri-66butions (Robert and Casella [25]). Briefly, the method simulates a random samplefrom a proposal distribution along with corresponding importance weights to forma weighted sample numerically representing the original distribution. We choosethe importance sampling technique over other Monte Carlo methods such as therejection sampling method and the Metropolis-Hastings method for two reasons.First, it is computationally more efficient for our problem since a good proposaldistribution can be established. Secondly, comparing to some other Markov ChainMonte Carlo methods like the Metropolis-Hastings method, the importance sam-pling technique comes with a relatively objective measure to evaluate the qualityof the simulated Monte Carlo sample. We will illustrate these two points in theremainder of this section.To ensure the performance of importance sampling, we need a good proposaldistribution forψ . First, we fix ρ = 1 to make sure thatψ ∈ΨI . With the set-up of aconjugate Dirichlet prior distribution for the multinomial distribution, the posteriordistribution for the case cell probabilities γ1 can be easily obtained as a Dirichletdistribution with updated parameters, Dir((n100+1, . . . ,n11K +1)), which in turnserves as a good proposal distribution for γ1. Similarly, a reasonable proposal dis-tribution for γ00 would be another Dirichlet distribution with updated parameters,Dir((n000+1, . . . ,n00K +1,n01++K+1)). Finally, we propose θ from a distribu-tion whose density function is u(θ). Therefore, the density function of the proposaldistribution of ψ isq(ψ) =Cq · (γ01+)n01++K ·u(θ) ·∏j,kγn1 jk1 jk ·∏kγn00k00k ,where Cq is the product of the normalizing constants of the two Dirichlet distribu-tions, which does not depend on ψ . Note that q(ψ) is positive over its support,which is a superset of ΨI . Thus, this proposal distribution of ψ is valid for theimportance sampling.We now discuss two candidate proposal distributions for θ . The first option isto use the prior distribution of θ , which is a standard uniform distribution in mostcases without external information on the disease prevalence. This proposal distri-bution may not be very efficient because we know the posterior distribution of θis more concentrated, especially when the sample size is very large. Nevertheless,67this simple proposal distribution works well for reasonable sample sizes in practice.The second option is to use the result of the frequentist approach as developed inChapter 3. More specifically, we propose θ from a truncated normal distribution onthe interval [0,1], with mean being the constrained maximum likelihood estimateof θ and standard deviation being the corresponding standard error. This is a moreaggressive proposal distribution that works more efficiently for most datasets, es-pecially when the sample size is large. However, when the constrained maximumlikelihood estimate of θ deviates far from the true value for some datasets due torandom variation, this proposal distribution can be very inefficient. In practice, wecan try both proposal distributions and it is very unlikely that both of them performpoorly.For each value of ψ generated by the proposal distribution, the correspondingimportance weight is p(ψ)/(mq(ψ)), where m is the total number of points gener-ated. However, this exact weight can not be directly obtained since the normalizingconstant in p(ψ), Cp, is intractable. In practice, we compute the relative weightw =∏k [γ01k(ψ)]n01k[γ01+]n01+·1(ψ ∈ΨI) ,and renormalize all weights to have them sum to one. Having obtained a weightedsample numerically representing the posterior distribution of ψ , by transformationwe can induce another weighted sample numerically representing the posterior dis-tribution of φ . Either directly based on this weighted sample or based on an un-weighted sample by further resampling, we can obtain the posterior mean and thecorresponding equal-tailed 95% credible intervals for Bayesian inference.Finally, we introduce the effective sample size (ESS) as a measure for assessingthe quality of a weighted sample. It is defined in [25] asESS =(∑mi=1 wi)2∑mi=1 w2iBriefly, it gives the sample size of a simple random sample that would convey thesame amount of information about the target posterior distribution as the weightedsample. Thus, we are going to keep track of the effective sample size of theweighted sample in our importance sampling algorithm and keep iterating until68it is greater than some threshold.4.4 A simulation studyWe now conduct a simulation study to investigate the performance of the proposedBayesian method that exploits the GEI assumption (BGEI), and also compare itwith the traditional prospective logistic method (TRAD).We focus on the situation where both genotype and environmental exposureare binary, and consider parameter settings from a factorial experiment, where wefix the value of δ , βG and βE , and consider all combinations of assignments oftwo different values each for κ , β0, and βEG. In particular, we first set parametersδ = (0.6,0.4), βG = log1.5, and βE = log1.2. We then consider κ = (0.95,0.05)for a rare genotype and κ = (0.7,0.3) for a common genotype, β0 = logit 0.005for a rare disease and β0 = logit 0.2 for a common disease, and βEG = log1.1for a weak gene-environment interaction effect and βEG = log3 for a strong gene-environment interaction effect. Thus, we end up with eight parameter settings usedin the simulation study, as shown in Table 4.1.Table 4.1: Parameter settings used in the simulation study in Section 3.8.κ δ β0 βG βE βEGφ 1 (0.95,0.05) (0.6,0.4) logit 0.005 log1.5 log1.2 log1.1φ 2 (0.95,0.05) (0.6,0.4) logit 0.005 log1.5 log1.2 log3φ 3 (0.95,0.05) (0.6,0.4) logit 0.2 log1.5 log1.2 log1.1φ 4 (0.95,0.05) (0.6,0.4) logit 0.2 log1.5 log1.2 log3φ 5 (0.7,0.3) (0.6,0.4) logit 0.005 log1.5 log1.2 log1.1φ 6 (0.7,0.3) (0.6,0.4) logit 0.005 log1.5 log1.2 log3φ 7 (0.7,0.3) (0.6,0.4) logit 0.2 log1.5 log1.2 log1.1φ 8 (0.7,0.3) (0.6,0.4) logit 0.2 log1.5 log1.2 log3For each parameter setting, we consider three different choices for sample size,n ∈ {1000,3000,9000}, and set the numbers of cases and controls to be equal.For each sample size, we apply the TRAD method and the BGEI method on 200069randomly generated samples to obtain point estimates and the corresponding 95%confidence/credible intervals for the interaction effects βEG. For the BGEI method,the proposal distribution for θ is chosen to be the truncated normal distribution, asdescribed in Section 4.3, and the threshold for the effective sample size is set to be5000.We first examine the computational efficiency of the importance sampling algo-rithm. Table 4.2 shows the 10-, 25-, 50-, 75-, and 90-percentiles of the importancesampling sample sizes actually used in our simulation study. In order to achieveeffective sample size of at least 5000, the importance sampling algorithm requiressample size less than 20000 for 75% of the generated datasets in all settings consid-ered. Moreover, except for one setting, the required importance sampling samplesize is not greater than 30000 for 90% of the generated datasets. Therefore, we cansee the importance sampling algorithm computes the posterior distribution veryefficiently for most datasets.Next, we summarize our simulation results with three key indices in Table 4.3:the bias and the mean square errors of the estimator, and the coverage probability ofthe 95% confidence/credible interval. First, we can see that, when the sample sizeis small, the BGEI estimator is biased for the parameter settings φ 2 and φ 6, bothcorresponding to the situation when the disease is rare and the gene-environmentinteraction effect is strong. For the remaining parameter settings, the BGEI es-timator is nearly unbiased. Secondly, the BGEI method achieves smaller meansquared error than the TRAD method in practice, especially when the sample sizeis relatively small. Yet this advantage diminishes as sample size increases, whichmatches our conclusion in Chapter 3 that exploiting the GEI assumption does notimprove asymptotic estimation efficiency in the special binary case. Lastly, thecoverage probability of the BGEI equal-tailed 95% credible interval is maintainedat the nominal level for a majority of parameter settings considered, but is slightlylower than the nominal level for a few parameter settings.We also compare the BGEI method with the TRAD method in terms of thelength of the 95% credible/confidence intervals, as shown in Figure 4.1. The firstimpression is that the length of the BGEI 95% credible interval is far more variablethan that of the TRAD 95% confidence interval. The BGEI 95% credible intervalsare shorter than their TRAD counterparts for most datasets, especially when the70sample size is small. But, when the gene-environment interaction effect is weak,there are a few datasets for which the BGEI 95% credible intervals are substantiallylonger than the TRAD 95% confidence intervals. However, as the sample size getslarger, the BGEI 95% credible interval and the TRAD 95% confidence interval tendto have similar length. This pattern can be seen more clearly from the last columnin Figure 4.1.Finally, we connect current results with earlier results in Section 3.8, where theparameter setting φ 2 and the sample size n= 1000 were also used in that simulationstudy, to compare the Bayesian GEI-based method (BGEI) and the frequentist GEI-based method (GEI-U as described in Chapter 3). Both estimators are biased andhave similar mean squared errors. However, the two methods result in differentinterval estimates. By sacrificing a little bit in coverage probability, the BGEI 95%credible intervals are shorter than the GEI-U 95% confidence intervals.71Table 4.2: The computational efficiency of the proposed importance sam-pling algorithm summarized by percentiles of the importance samplingsample sizes for achieving effective sample size of at least 5000.n Importance sampling sample sizes10-%tile 25-%tile 50-%tile 75-%tile 90-%tileφ 1 1000 8000 8000 9000 11000 160003000 8000 8000 9000 11000 160009000 8000 8000 9000 12000 20000φ 2 1000 7000 7000 8000 10000 130003000 6000 6000 7000 7000 90009000 6000 6000 6000 7000 8000φ 3 1000 8000 8000 9000 10000 140003000 8000 8000 9000 11000 140009000 8000 8000 9000 10000 15000φ 4 1000 8000 9000 11000 14000 220003000 7000 8000 9000 12000 190009000 7000 7000 8000 9000 11000φ 5 1000 9000 9000 10000 14000 240003000 9000 9000 10000 15000 300009000 9000 9000 12000 19000 58100φ 6 1000 7000 8000 8000 10000 150003000 7000 7000 8000 9000 140009000 7000 7000 7000 8000 12000φ 7 1000 9000 9000 10000 13000 210003000 8000 9000 10000 13000 200009000 8000 9000 10000 13000 22000φ 8 1000 8000 9000 11000 14000 200003000 9000 10000 11000 12000 150009000 9000 10000 10000 11000 1200072Table 4.3: Comparison of the performance between the TRAD method andthe proposed BGEI method for the estimation of the interaction effects,in terms of the bias and the mean squared error (MSE) of the estimators,and the coverage probability of the 95% credible/confidence intervals.n Bias MSE CoverageTRAD BGEI TRAD BGEI TRAD BGEIφ 1 1000 0.014 0.024 0.330 0.252 0.950 0.9513000 -0.005 0.015 0.097 0.080 0.952 0.9469000 -0.001 0.019 0.033 0.028 0.943 0.942φ 2 1000 0.028 0.268 0.275 0.204 0.965 0.9263000 0.012 0.179 0.093 0.074 0.949 0.9089000 0.002 0.106 0.032 0.026 0.942 0.909φ 3 1000 -0.013 -0.022 0.360 0.271 0.957 0.9553000 0.000 -0.008 0.114 0.092 0.945 0.9549000 -0.001 -0.004 0.035 0.028 0.959 0.957φ 4 1000 0.062 -0.032 0.457 0.294 0.959 0.9593000 0.029 -0.023 0.136 0.105 0.951 0.9539000 0.008 -0.026 0.043 0.042 0.952 0.946φ 5 1000 0.012 0.026 0.075 0.064 0.950 0.9523000 0.005 0.023 0.025 0.021 0.948 0.9399000 0.004 0.024 0.007 0.007 0.956 0.927φ 6 1000 0.011 0.142 0.076 0.063 0.950 0.9253000 0.001 0.085 0.025 0.021 0.950 0.9309000 0.000 0.047 0.009 0.007 0.950 0.930φ 7 1000 0.004 0.001 0.076 0.064 0.954 0.9563000 0.000 -0.001 0.024 0.020 0.961 0.9609000 0.001 -0.001 0.008 0.007 0.951 0.953φ 8 1000 0.009 -0.044 0.085 0.077 0.950 0.9473000 0.007 -0.019 0.029 0.029 0.948 0.9469000 0.000 -0.009 0.009 0.009 0.952 0.95273Figure 4.1: Comparison of traditional method (TRAD) and the proposed Bayesian GEI-based method (BGEI) in termsof the length of the 95% confidence/credible interval. The grey lines are the identity line.744.5 Relaxation of the GEI assumptionResearchers sometimes have concerns about the validity of the GEI assumption.Thus, we discuss a variant of the proposed Bayesian method that relaxes the GEIassumption.4.5.1 Two established methodsWe first briefly review two established methods for relaxing the GEI assumption.• The empirical Bayes methodMukherjee and Chatterjee [21] proposed a simple stochastic framework totrade off between bias and efficiency in a data-adaptive way. They showedthat the magnitude of the uncertainty parameter can be estimated from thedata itself. This estimate of the uncertainty parameter can then be used inan empirical Bayes fashion to obtain a shrinkage estimator that “shrinks” themaximum likelihood estimators of gene-environment interaction parametersunder a general model to those obtained under the model that assumes theindependence assumption.Particularly, in the simple set-up of a case-control study with a binary geneticfactor and a binary environmental exposure, Mukherjee and Chatterjee [21]considered two commonly used estimators of the interaction effect βEG, theone obtained from using all case-control dataβˆ (CC)EG = logn001n010n100n111n101n110n000n011,and the other obtained from using data of cases aloneβˆ (CO)EG = logn100n111n101n110.The empirical Bayes estimator of βEG was then proposed as the followingweighted estimatorβˆ (EB)EG =τˆ2τˆ2+ σˆ2CCβˆ (CC)EG +σˆ2CCτˆ2+ σˆ2CCβˆ (CO)EG ,75whereτˆ2 =(logn000n011n001n010)2is an estimate for the conceptual prior variability of the gene-environmentassociation in the population of controls, which measures the uncertaintyabout the independence (among controls) assumption, andσˆ2CC =1∑i=01∑j=01∑k=01ni jkis the estimated variance of the case-control estimator βˆ (CC)EG . When the dataprovide evidence in favor of GEI assumption in the control population (τˆ2→0), we have βˆ (EB)EG → βˆ (CC)EG . When the uncertainty regarding the assumptionbecomes stronger (τˆ2→ ∞), we have βˆ (EB)EG → βˆ (CO)EG .• The full Bayesian methodMukherjee et al. [22] proposed a proper full Bayesian approach for analyz-ing studies of gene-environment interaction, which provides a natural wayto incorporate uncertainty around the assumption of GEI in the populationof controls. In particular, they demonstrated their method in the simpleset-up with a binary genotype and a binary environmental exposure. Letγ0 = (γ000,γ001,γ010,γ011) and γ1 = (γ100,γ101,γ110,γ111) denote the under-lying sampling probabilities for controls and cases, respectively. The priordistributions on γ i are assumed to be independent Dirichlet distributions,namely, γ i ∼ Dir(α i) with α i = (αi00,αi01,αi10,αi11), i = 0,1. Through amultinomial-Dirichlet conjugate analysis, the posterior distribution on γ i canthen be easily derived in closed form as a Dirichlet distribution with updatedparameters, namely, γ i|ni ∼ Dir(ni +α i), where ni = (ni00,ni01,ni10,ni11),i = 0,1. Therefore, the posterior distribution of βEG can be obtained usingextremely inexpensive computation.Next, Mukherjee et al. [22] proposed to reflect prior belief on the assump-tion of GEI in the control population only through prior specification. Morespecifically, they defined the strength of the Dirichlet prior on the control76and case probability vector as si = ∑1j=0∑1k=0αi jk for i = 0,1. Differentchoices of s0 and s1 induce different variances on the logarithm of the gene-environment associations in controls and cases, respectively. Also, to makethe corresponding prior distribution on βEG roughly centered around zero,it was implicitly assumed in [22] that the two independent Dirichlet priordistributions are symmetric. For reflecting different degrees of belief on theassumption, the two parameter vectors, γ0 and γ1, are treated asymmetricallyin the prior specification. The prior distribution on γ1 is chosen to be fairlynon-informative, say α1 = (5,5,5,5). On the other hand, the Dirichlet prioron γ0 can have varying strength s0 to induce different prior variance aroundthe assumption of independence (among controls).However, the above two methods both focus on the other form of the GEI as-sumption that asserts GEI in the population of controls. This form of the assump-tion is not very natural, particularly as acquisition of genotype and environmentalexposure are temporally antecedent to the disease. Therefore, we study the methodfor relaxing the more natural assumption of GEI within the source population inthe next section.4.5.2 A generalized Bayesian frameworkIn this section, we present a generalized Bayesian method that extends the Bayesianframework developed in Section 4.3 to further incorporate uncertainty around theGEI assumption.Now, instead of asserting that ρ = 1, we model ρ a priori by a multivariatelog-normal distributionρ ∼ lnNorm(0, σ2ρ I),where the common standard deviation σρ can be specified to reflect different levelsof uncertainty about the GEI assumption. Assuming the priors for other parametersare the same as before, the density of the prior distribution of ψ now becomesτ˜(ψ) =Cτ˜ · (γ01+)K ·K∏k=1d(log(ρk)/σρ)ρk·1{ψ ∈Ψ} ,77where d(·) is the density function of the standard normal distribution. Note thatthe parameter space for ψ is now enlarged from ΨI to the entire Ψ. The posteriordensity of ψ now becomesp˜(ψ) =Cp˜ · (γ01+)K ·K∏k=1d(log(ρk)/σρ)ρk·∏kγn00k00k ·∏k(γ01k(ψ))n01k ·∏j,kγn1 jk1 jk ·1{ψ ∈Ψ} .Computationally, we need to extend the importance sampling algorithm to alsohave ρ proposed from its prior distribution. Moreover, since the GEI assumptionmay not hold, it would be better to use a standard uniform distribution as the pro-posal distribution for θ . Thus, we end up with a proposal distribution for ψ withthe following density functionq˜(ψ) =Cq˜ · (γ01+)n01++K ·K∏k=1d(log(ρk)/σρ)ρk·∏j,kγn1 jk1 jk ·∏kγn00k00k .We can then compute the relative importance weightw˜ =∏k [γ01k(ψ)]n01k[γ01+]n01+·1(ψ ∈Ψ)for each value of ψ generated from the proposal distribution and renormalize allweights to have them sum up to one. As before, we want the effective sample sizeof the weighted sample greater than some threshold to ensure a good approximationto the posterior distribution.4.6 Another simulation studyIn this section, we conduct another simulation study to compare the performance ofthe three methods discussed in Section 4.5 in situations where the GEI assumptionmay or may not hold.We still focus on the set-up with a binary genotype and a binary environmentalexposure. We assume that the marginal distributions for genotype and environmen-tal exposure are κ = (0.95,0.05) and δ = (0.6,0.4), respectively. The coefficients78in the logistic regression model are: β0 = logit 0.2, βG = log1.5, βE = log1.2, andβEG = log3. We now consider three different scenarios: (a) ρ = 1, (b) ρ = 1.1, and(c) ρ = 3, corresponding to no violation, a slight violation, and a serious violationof the GEI assumption, respectively. For each scenario, we generate 2000 datasetsof 500 cases and 500 controls.We analyze each generated case-control dataset using the empirical Bayes (EB)method, the full Bayesian (FB) method with three different choices of s0, theBayesian GEI (BGEI) method , and the generalized Bayesian (gBGEI) methodwith two different choices of σρ . For three FB methods, we fix the case Dirichletparameter α1 as (5,5,5,5) with s1 = 20, and set the control Dirichlet parameterα0 at (5,5,5,5), (20,20,20,20), and (80,80,80,80), corresponding to values ofs0 at 20 (FB-20), 80 (FB-80), and 320 (FB-320), respectively. This setting wasalso used in [22]. For two gBGEI methods, the standard deviation σρ is set at 0.1(gBGEI-0.1) and 1.0 (gBGEI-1.0) for small and great uncertainty around the GEIassumption, respectively. Table 4.4 summarizes our simulation results in three in-dices: the bias and the mean squared error of the estimators as well as the coverageprobabilities of the 95% credible/confidence intervals.When the GEI assumption holds or is only slightly violated, our proposedBayesian methods all result in unbiased or nearly unbiased estimators. The cov-erage probabilities of the 95% credible intervals are all maintained at the nominallevel. Moreover, the BGEI method produces the smallest mean squared error, al-though the mean squared error of the gBGEI-0.1 estimator is very close. On theother hand, the EB estimator is slightly biased and the three FB estimators are allgreatly biased. The coverage probabilities of the 95% credible/confidence intervalsresulting from the EB method and the three FB methods are all lower than the nom-inal level. Lastly, the performance of the FB method gets worse (more bias, greatermean squared error, and lower coverage probability) with stronger prior belief onthe assumption of GEI in the control population.When the GEI assumption is seriously violated, both the BGEI method andthe gBGEI-0.1 method perform poorly as they strongly rely on the validity of theGEI assumption. The gBGEI-1.0 method performs much better as it allows moreuncertainty around the GEI assumption. On the other hand, the performance ofthe EB method is relatively invariant to the violation of the GEI assumption, as it79Table 4.4: Comparison between different methods in terms of the bias andthe mean squared error (MSE) of the estimators of βEG, and the coverageprobability of the 95% credible/confidence intervals in situations wherethere is no violation (ρ = 1), a slight violation (ρ = 1.1), and a seriousviolation (ρ = 3) of the GEI assumptionBias MSE Coverageρ = 1 EB -0.116 0.318 0.910FB-20 -0.357 0.299 0.935FB-80 -0.665 0.542 0.658FB-320 -0.785 0.695 0.340BGEI -0.001 0.275 0.968gBGEI-0.1 -0.004 0.278 0.966gBGEI-1.0 -0.016 0.343 0.964ρ = 1.1 EB -0.072 0.276 0.932FB-20 -0.326 0.290 0.933FB-80 -0.575 0.435 0.750FB-320 -0.658 0.516 0.518BGEI 0.089 0.263 0.959gBGEI-0.1 0.080 0.267 0.960gBGEI-1.0 0.019 0.353 0.952ρ = 3 EB 0.249 0.397 0.898FB-20 -0.203 0.226 0.966FB-80 -0.141 0.125 0.980FB-320 -0.034 0.086 0.983BGEI 0.796 0.791 0.489gBGEI-0.1 0.760 0.741 0.557gBGEI-1.0 0.074 0.360 0.961automatically adjusts itself to rely less on the assumption. Finally, it is interestingto see that the FB methods work surprisingly well for this setting. The performanceof the FB method gets better with stronger prior belief on the assumption of GEI80in the control population. However, the true gene-environment odds ratio in thecontrol population is about 1.8, which does not justify the good performance of theFB method. Therefore, we feel that the performance of the FB method can be hardto predict.4.7 Data analysisWe now examine two datasets to illustrate our methods. One dataset, taken fromHwang et al. [15], reveals an interaction effect of maternal smoking during preg-nancy and a Taq1 polymorphism at the transforming growth factor alpha (TGFα)locus on oral clefts. In this study, 113 infants with cleft palate were identifiedas cases (D = 1) and 281 infants without cleft palate were selected as controls(D = 0). All subjects were tested to ascertain whether they are carriers (G = 1) ornon-carriers (G = 0) of any rare Taq1 alleles. Smoking status during pregnancy,with E = 1 for smokers and E = 0 for non-smokers, was obtained by interview.The other dataset considered concerns the joint effect of NAT2 genotype andcigarette smoking on bladder cancer, which was first published in Gu et al. [10] andwas later reanalyzed by Gustafson and Burstyn [13]. The dataset consists of 502cases (D = 1) and 512 controls (D = 0). The two categories of genotype are rapid(G= 0) and slow (G= 1) acetylator. The smoking status in this study is categorizedas either heavy (E = 1) or never/light (E = 0). Both datasets are summarized inTable 4.5.For each dataset, we apply ten estimators to quantify the gene-environmentinteraction, including the traditional case-control estimator (TRAD), the seven es-timators considered in Section 4.6 (EB, BGEI, gBGEI-0.1, gBGEI-1.0, FB-20, FB-80, FB-320), and another generalized Bayesian estimator with σρ = 3.0 (gBGEI-3.0). The results are summarized in Table 4.6. We again emphasize that the pro-posed Bayesian methods address the GEI assumption in the source population, butthe empirical Bayes method and the full Bayesian method concern the GEI assump-tion in the control population. Given that the oral cleft is a rare disease, however,the two assumptions are approximately equivalent.For both datasets, we see some similar patterns. First, the BGEI estimate isquite different from the TRAD estimate. Also, the gBGEI estimate is closer to81Table 4.5: Two datasets considered in Section 4.7. One was reported in [15]concerning the effect of TGF α genotype and maternal smoking duringpregnancy on oral cleft. Genotype is coded as 0 = common Taq1 allele,1 = rare Taq1 allele. Smoking status is coded as 0 = non-smoker, 1 =smoker. The other was used by [10] to investigate the effect of NAT2genotype and smoking during pregnancy on oral cleft. Genotype is codedas 0 = rapid acetylator , 1 = slow acetylator. Smoking status is coded as0 = never/light, 1 = heavy.Data from [15] Data from [10]G = 0 G = 1 G = 0 G = 1E = 0 E = 1 E = 0 E = 1 E = 0 E = 1 E = 0 E = 1D = 0 167 69 34 11 172 58 230 52D = 1 60 32 12 9 106 83 156 157the BGEI estimate when the uncertainty around the GEI assumption is weak, andcloser to the TRAD estimate when the uncertainty around the GEI assumptionis strong. Secondly, comparing to the BGEI estimate, the EB estimate is evenmore different from the TRAD estimate. Finally, the FB method tends to give theestimate that is most far away from the TRAD estimate when a very strong priorstrength is assumed.The two datasets might differ in term of the validity of the GEI assumption. Forthe first dataset, the T GFα genotype is probably independent of maternal smokingduring pregnancy, as the EB estimate is very different from the TRAD estimate.Thus, comparing to the TRAD method, all methods that strongly rely on the as-sumption yield different point estimates and shorter interval estimates. For thesecond dataset, the NAT 2 genotype might not be independent of maternal smokingduring pregnancy, since the estimate resulting from the empirical Bayes methodis less different from the TRAD estimate. Therefore, unlike the first dataset, allmethods produce more similar results.82Table 4.6: The point and interval estimates of the gene-environment interac-tion βeg for the two datasets considered in Section 4.7, obtained by tendifferent methods.Dataset of [15] Dataset of [10]Estimate 95% CI Estimate 95% CITRAD 0.586 (−0.628, 1.799) 0.651 (0.093, 1.208)EB 0.374 (−0.628, 1.376) 0.516 (−0.077, 1.110)BGEI 0.474 (−0.637, 1.598) 0.555 (0.032, 1.114)gBGEI-0.1 0.484 (−0.607, 1.610) 0.577 (0.044, 1.115)gBGEI-1.0 0.552 (−0.663, 1.730) 0.650 (0.095, 1.210)gBGEI-3.0 0.585 (−0.612, 1.790) 0.650 (0.109, 1.198)FB-20 0.435 (−0.640, 1.446) 0.626 (0.092, 1.172)FB-80 0.185 (−0.820, 1.130) 0.580 (0.052, 1.079)FB-320 0.088 (−0.805, 0.971) 0.496 (0.030, 0.970)4.8 ConclusionIn this chapter, we have developed a Bayesian framework for analyzing case-control data under the GEI assumption, and also generalized it to relax the GEIassumption. We have shown through two real dataset applications that the general-ized Bayesian method does indeed serve as a compromise between the traditionalcase-control method and the proposed Bayesian method exploiting the exact GEIassumption.We have seen in Chapter 3 that knowing the disease prevalence in addition tothe GEI assumption can further improve estimation efficiency. This is also truefor the proposed Bayesian methods. Moreover, the proposed Bayesian methodsallow more flexibility to incorporate prior knowledge on the disease prevalencethrough an appropriate prior distribution. Therefore, rather than knowing the exactdisease prevalence, any knowledge about the disease prevalence, like the rare dis-ease assumption, may help improve estimation efficiency. The more concentratedprior distribution assumed on the disease prevalence, the better efficiency can be83achieved.Finally, we have also briefly reviewed the empirical Bayes method and the fullBayesian method for relaxing the GEI assumption. However, these two methodsboth depend on the other form of the GEI assumption that asserts gene-environmentindependence in the population of controls. That version of GEI assumption is notas natural as the GEI assumption considered throughout this thesis. The assump-tion of GEI in the control population is often justified under the GEI assumption ifthe disease is rare. In that case, however, we feel that it would be more appropriateto directly incorporate the GEI assumption and the rare disease assumption into theanalysis, which can be easily achieved by the proposed Bayesian methods. We ex-pect that methods exploiting different versions of the GEI assumption would resultin similar estimates if the prior distribution asserts the disease to be very rare. Butsome discrepancy could be observed if the prior distribution allows possibility ofthe disease prevalence being less extreme.84Chapter 5Bayesian Inference in Case-OnlyStudies5.1 IntroductionIn this chapter, we study the analysis of case-only data under the GEI assumptionand the rare disease assumption. The case-only design is a special variant of thestandard case-control design, where only data on cases are collected. It exploits theGEI assumption for studying the gene-environment interaction, assuming that thedisease is rare. This chapter is organized as follows. We first show how to adaptthe Bayesian methods proposed in Chapter 4 for analyzing case-only data. We theninvestigate the prior distribution of the systematic bias of the traditional case-onlymethod under different levels of disease prevalence, while the other assumptionstill holds true. Simulation studies are conducted to compare the performance ofthe proposed Bayesian methods with the traditional case-only method. Finally, weapply the proposed Bayesian methods on two real datasets.5.2 Bayesian case-only methodsWe first describe the application of the Bayesian methods developed in Chapter4 on case-only data, since the case-only study can be viewed as a special type ofcase-control study where all control cell frequencies are zeros.85First, the likelihood in terms ofψ for case-only data is simply L(ψ)=∏ j,k γn1 jk1 jk .Next, we still fix ρ = 1 to reflect the GEI assumption, and set the prior distributionsfor γ00 and γ1 as Dir(1, . . . ,1,K+1) and Dir(1, . . . ,1), respectively. Moreover, toreflect the rare disease assumption, we acknowledge that a smooth prior on θ , suchas a Beta distribution putting most of its mass very close to zero, might make morepractical sense. However, for the sake of illustration, we initially assume that θis uniformly distributed over (0,ν), where ν is a threshold slightly greater thanzero that controls the prior belief concerning the rare disease assumption. We willuse both specifications for our data analysis in Section 5.5. In summary, the priordensity τ(ψ) is proportional to (γ01+)K when ψ ∈ ΨI and 0 < θ < ν , and zerootherwise. Finally, the posterior density p(ψ) is:p′(ψ) =Cp′ · (γ01+)K ·∏j,kγn1 jk1 jk ·1{ψ ∈ΨI} ·1{0< θ < ν} ,where Cp′ is a normalizing constant that does not depend on ψ . The interactioneffects are estimated by their posterior means.Computationally, we still use the technique of importance sampling to nu-merically represent the posterior distribution. The proposal distribution for γ1 isDir(n100+1, . . . ,n11K +1). The proposal distribution for γ00 is the same as itsprior distribution, Dir(1, . . . ,1,K+1), since n001 = · · ·= n01K = 0. Lastly, the pro-posal distribution for θ is also its prior distribution U(0,ν). With these proposaldistribution, it can be easily verified that the density function of the proposal dis-tribution for ψ is some constant. Therefore, all proposed values of ψ , such thatψ ∈ΨI and 0< θ < ν , are equally weighted.Next, we discuss the limiting case when we have an infinite amount of case-only data. In this case, the posterior distribution of case cell probabilities convergesto a point mass at its true value, say γ∗1. Correspondingly, the region of plausiblevalues of ψ , where ψ ∈ΨI , 0< θ < ν , and γ1 = γ∗1, may be greatly shrunk, leadingto more concentrated distributions of βEG. Thus, we can investigate the limitingposterior distributions (LPDs) of βEG’s to gain some insight about how much atmost can be learned from a non-identified model [11]. To obtain the LPDs, wemodify the importance sampling algorithm by having γ1 fixed at its true γ∗1 in theimportance sampling algorithm.86Finally, the generalized Bayesian method developed in Section 4.5 is also ap-plicable for the analysis of case-only data when GEI assumption may be relaxed.We still model ρ a priori by a multivariate log-normal distribution with a tuningparameter σρ controlling the degree of uncertainty around the GEI assumption. Forcase-only data, the generalized Bayesian method also assigns equal weights to allproposed values of ψ such that ψ ∈Ψ.5.3 Bias of the traditional case-only methodIn this section, we investigate the relationship between the disease prevalence andthe performance of the traditional case-only method. Particularly, in the case ofbinary genetic and environmental factors, we haveβEG = ζ1−ζ0, (5.1)where ζ1 and ζ0 are the logarithm of gene-environment odds ratios in the popula-tions of cases and controls, respectively. Since ζ1 is the target parameter estimatedby the traditional case-only method and βEG is the parameter of real interest, theirdifference ζ0 = ζ1−βEG measures the systematic bias of the traditional case-onlymethod.We examine the prior distributions of ζ0 conditional on different values of θ .Given a value of θ , we randomly generate 100000 values of ψ by sampling γ00and γ1 from their prior distributions, respectively. We only keep points such thatψ ∈ΨI and 0< θ < ν , which are more than 95% for all values of θ considered. Wethen calculate the corresponding value of ζ0 and obtain a numerical representationfor its distribution conditional on the given value of θ . Figure 5.1 shows the mean,accompanied with the 2.5- and 97.5-percentiles, of the prior distribution of ζ0|θfor different values of θ . We can see from this figure that some substantial bias canemerge when the disease is more prevalent than 0.5%.5.4 A simulation studyWe now conduct a simulation study to compare the proposed Bayesian case-onlymethod with the traditional case-only method. Still, we focus on the case of87Figure 5.1: The mean and the 2.5- and 97.5-percentiles of the prior distribu-tions of ζ0|θ for different values of disease prevalence θ . Note that alogarithmic scale is used for the x-axis.a binary genotype and a binary environmental exposure. We assume that themarginal distributions for genotype and environmental exposure are κ = (0.9,0.1)and δ = (0.7,0.3), respectively. We then set the main genetic and environmen-tal effects to be βG = 0 and βE = log2. Finally, we consider combinations oftwo values for β0 with two values for βEG. We consider β0 = logit 0.001 andβ0 = logit 0.01, corresponding to a very rare and a modestly rare disease, respec-tively. We also consider βEG = log2 and βEG = log6, corresponding to a modestand a strong gene-environment interaction effect, respectively. Thus, we end upwith four parameter settings used in the simulation study, as shown in Table 5.1.The true disease prevalences corresponding to these parameter settings are 0.14%,0.16%, 1.4%, and 1.6%, respectively.We first examine the LPDs of βEG under these four parameter settings. Thefour LPDs are obtained using ν = 0.01. Table 5.2 presents the mean, the median,and the 95% equal tailed credible intervals of the prior distribution and the fourLPDs of βEG. We can see that the LPDs are indeed much more concentrated than88Table 5.1: Parameter settings used in the simulation study in Section 5.4.κ δ β0 βG βE βEGφ 1 (0.9,0.1) (0.7,0.3) logit 0.001 0 log2 log2φ 2 (0.9,0.1) (0.7,0.3) logit 0.001 0 log2 log6φ 3 (0.9,0.1) (0.7,0.3) logit 0.01 0 log2 log2φ 4 (0.9,0.1) (0.7,0.3) logit 0.01 0 log2 log6the prior distribution. More importantly, each 95% LPD credible interval coversthe corresponding true value of βEG.Table 5.2: Summary statistics about the prior distribution of βEG and theLPDs for case-only data under four parameter settings.Mean Median 2.5- and 97.5-%ilesPrior -0.016 0.017 (-5.242, 5.053)LPD under φ 1 0.700 0.693 ( 0.552, 0.898)LPD under φ 2 1.805 1.788 ( 1.702, 1.979)LPD under φ 3 0.682 0.676 ( 0.540, 0.858)LPD under φ 4 1.718 1.703 ( 1.611, 1.935)Next, we compare the performance of the Bayesian and the traditional case-only method in estimating βEG with respect to the mean squared error (MSE) ofthe estimators, the coverage probability and the average length of the 95% confi-dence/credible intervals, based on 10000 datasets of 1000 cases. For the Bayesianmethod, we consider four different values, ν ∈ {0.001,0.003,0.01,0.03}, for theupper bound of the prior distribution on θ . Our simulation results are summarizedin Table 5.3. On one hand, when we are very confident that the disease is reallyrare (u = 0.001,0.003), the Bayesian interval estimates are very similar to the tra-ditional interval estimates. On the other hand, when we are less certain that thedisease is very rare, the Bayesian interval becomes wider to increase the cover-age probability of the 95% credible interval. Moreover, we find that the Bayesian89method tends to be over-conservative. Thus, as we can see from the results for φ 4,even when the prior belief on the disease prevalence slightly deviates from the truth(ν = 0.01 for θ = 0.016), the coverage probability of the 95% credible intervals isstill above the nominal level, and the average length of the intervals is comparableto that of the traditional method (0.809 versus 0.711).Table 5.3: Comparison of the performance between the Bayesian and the tra-ditional case-only method with respect to the mean squared error (MSE)of the estimators, the coverage probability and the average length of the95% confidence/credible intervals, based on 10000 dataset of 1000 cases.ν MSE Coverage Lengthφ 1 Traditional 0† 0.034 0.951 0.717Bayesian 0.001 0.034 0.953 0.7260.003 0.034 0.958 0.7470.01 0.034 0.973 0.8240.03 0.036 0.994 1.090φ 2 Traditional 0† 0.033 0.950 0.715Bayesian 0.001 0.033 0.953 0.7210.003 0.033 0.958 0.7410.01 0.033 0.972 0.8110.03 0.036 0.990 1.052φ 3 Traditional 0† 0.035 0.943 0.719Bayesian 0.001 0.035 0.945 0.7270.003 0.035 0.953 0.7490.01 0.035 0.974 0.8260.03 0.035 0.995 1.094φ 4 Traditional 0† 0.040 0.917 0.711Bayesian 0.001 0.041 0.922 0.7180.003 0.040 0.932 0.7380.01 0.039 0.961 0.8090.03 0.036 0.994 1.053† The traditional method can be conceptually viewed as assuming ν = 0.90Finally, we investigate the effect of sample size by considering four differentchoices, n1 ∈ {500,1000,2000,5000}, under the parameter setting φ 4. For theBayesian method, we set ν = 0.03. The coverage probabilities of the 95% confi-dence/credible intervals, based on 10000 datasets, are reported in Table 5.4. Weobserve that the two methods behave differently as the sample size increases. Forthe Bayesian method, the coverage probability is increasing and approaching 1 asthe sample size gets larger. On the other hand, the traditional case-only performsworse as the sample size increases. Technically, as sample size goes to infinity, the95% confidence interval of the traditional case-only method converges to a pointmass at the true log odds ratio among cases, and thus its coverage probability con-verges to zero unless there is no systematic bias. In contrast, Gustafson [12] hasshown that, in the partially identified context, the large-sample limit of frequentistcoverage for Bayesian (1−α) credible intervals is one over a large subset of theparameter space, and zero over its complement, where large means having priorprobability 1−α .Table 5.4: The coverage probabilities of the Traditional and the Bayesian95% confidence/credible intervals for different case-only sample sizesn1, based on 10000 datasets.n1 = 500 n1 = 1000 n1 = 2000 n1 = 5000Traditional 93.1% 91.2% 87.9% 77.5%Bayesian 98.6% 99.6% 99.9% 100.0%5.5 Data analysisWe consider two real datasets for the application of the proposed Bayesian meth-ods. One example concerns a binary environmental exposure and the other con-cerns a four-category environmental exposure.5.5.1 Analysis of colorectal cancer dataThe first dataset is from a case-control study concerning the molecular epidemiol-ogy of colorectal cancer (MECC), which was previously used by Mukherjee et al.91[22]. We used the part of data concerning the interaction effect of NAT 2 phenotype(slow vs. fast) and smoking status (never vs. ever). The observed cell counts areprovided in Table 5.5.Table 5.5: Case-control data concerning the interaction of NAT 2 genotype(slow vs. fast) and smoking status (never vs. ever) on colorectal cancer.Colorectal Cancer (No) Colorectal Cancer (Yes)NAT2 (slow) NAT2 (fast) NAT2 (slow) NAT2 (fast)Smoke (never) 665 437 623 410Smoke (ever) 584 285 475 277The traditional case-only method gives an estimated interaction effect of−0.121with a 95% confidence interval of (−0.315,0.073), and the traditional case-controlmethod results in an estimated interaction effect of 0.177 with a 95% confidenceinterval of (−0.092,0.445). We can see that the traditional case-only method maybe misleading in this data example, since its 95% confidence interval has littleoverlap with the case-control 95% confidence interval.We now apply the proposed Bayesian case-only method to this dataset. Weassume that the prevalence of colorectal cancer is less than 1%, i.e., the prior dis-tribution on θ is θ ∼ Unif(0,0.01). The posterior mean of βEG from the proposedBayesian method is−0.118, with the 95% credible interval being (−0.357,0.124).We also consider a more realistic prior distribution for the prevalence of colorectalcancer, which is θ ∼ Beta(5,995). The result is very similar, with the estimatedposterior mean being−0.121 and the 95% credible interval being (−0.358,0.117).Whatever prior is used for θ , we see that the proposed Bayesian method leadsto more conservative credible intervals that are slightly more overlapped with thecase-control confidence interval.Finally, we apply the generalized Bayesian case-only method to this dataset.We again use Beta(5,995) as the prior distribution for disease prevalence. We firstconsider the setting σρ = 0.05 to mimic the scenario that the GEI assumption mightonly be slightly violated. The estimated posterior mean is −0.124 and the 95%credible interval is (−0.382,0.132). Also, we consider the scenario where the GEI92assumption might be moderately violated with σρ = 0.5. In this case, the estimatedposterior mean is −0.123 and the 95% credible interval is (−1.157,0.900). Asexpected, the length of the 95% credible interval increases if we allow for thepossibility of a stronger violation of the GEI assumption.5.5.2 Analysis of ovarian cancer dataThe second dataset is from a population-based case-control study of ovarian cancerreported by Modan et al. [20]. This study assessed whether the use of oral contra-ceptives and multiparity lower the risk of ovarian cancer in carriers of a BRCA1/2mutation, as they do in non-carriers. Modan et al. [20] assumed that the the statusof BRCA1/2 mutations and the reproductive risk factors are independent in con-trols, and did not provide a detailed breakdown. Thus, only data from the 832cases were used. Our analysis focused on the interaction between the parity andthe status of BRCA1/2 mutations, with data summarized in Table 5.6.Table 5.6: Case-only data concerning the number of births and the status ofBRCA1/2 mutations for 832 women with ovarian cancer.Number of Births0 1-2 3-4 ≥ 5BRCA 1/2 NonCarriers 68 248 199 77BRCA 1/2 Carriers 20 119 90 11It was reported by Chatterjee and Carroll [3] that the estimate of the prevalenceof ovarian cancer in the underlying population is about 0.00087. Thus, we set thethreshold of the disease prevalence θ in the Bayesian method to be 0.2%. Makingthe 0-birth group the reference group, the estimated log odds ratios of the Bayesianmethod (and the associated 95% credible intervals) are 0.470 (−0.062,1.023) forthe group of 1-2 births, 0.413 (−0.141,0.980) for the group of 3-4 births, and−0.702 (−1.521,0.079) for the group of more than 4 births. Similar to the first dataanalysis example, we also consider a smooth prior θ ∼ Beta(1,999). This yieldsestimated log odds ratios of 0.476 (−0.068,1.042) for the group of 1-2 births,0.419 (−0.143,1.002) for the group of 3-4 births, and−0.698 (−1.507,0.093) for93the group of more than 4 births. Finally, we present the corresponding estimatesand the associated 95% confidence intervals resulting from the traditional case-onlymethod for comparison, which are 0.490 (−0.055,1.034), 0.430 (−0.127,0.988),and −0.722 (−1.527,0.083), respectively. We see that with a strong prior beliefasserting that the disease is very rare, the proposed Bayesian method gives verysimilar results to those from the traditional case-only method. Thus, the proposedBayesian method can be viewed in some sense as an extension of the traditionalcase-only method with more flexibility.Again, we also apply the generalized Bayesian case-only method to this dataset.We use Beta(1,999) as the prior distribution for disease prevalence. When the GEIassumption might only be slightly violated with σρ = 0.05, the estimated log oddsratios (and the associated 95% credible intervals) are 0.472 (−0.078,1.051) forthe group of 1-2 births, 0.413 (−0.152,0.992) for the group of 3-4 births, and−0.703 (−1.507,0.094) for the group of more than 4 births. When we assumeσρ = 0.5 to allow a moderate violation of the GEI assumption, the estimated logodds ratios (and the associated 95% credible intervals) are 0.466 (−0.647,1.579)for the group of 1-2 births, 0.415 (−0.729,1.562) for the group of 3-4 births, and−0.698 (−1.975,0.528) for the group of more than 4 births. Again, permittinga moderate violation of the GEI assumption leads to substantially wider intervalestimates.5.6 ConclusionWe have investigated the performance of the traditional case-only method withdifferent levels of disease prevalence, assuming that the genetic factor and the en-vironmental exposure are independent in the target population. We have foundsome empirical evidence that, for most realistic parameter settings, the traditionalcase-only method works quite well for diseases less prevalent than 0.1%. When thedisease prevalence is greater than 0.5%, however, we begin to see some substantialbias, and thus the traditional case-only method should be used with caution.We have shown that the Bayesian framework for analyzing case-control datadeveloped in Chapter 4 is readily applicable for analyzing case-only data. Partic-ularly, the Bayesian case-only method allows the flexibility of incorporating dif-94ferent prior beliefs on disease prevalence. Compared to the traditional case-onlymethod, the Bayesian case-only method leads to interval estimates that are muchless likely to miss the true value if a correct, or even slightly incorrect, prior beliefon disease prevalence is assumed. Moreover, we have seen from our simulationstudies and data examples that, if we are confident that the disease is indeed veryrare, the Bayesian case-only method gives almost identical results to the traditionalcase-only method. Thus, the proposed method can be viewed as a generalizationof the traditional case-only method, which improves the quality of inference bytaking expert opinion or previous knowledge into consideration.Finally, we have the generalized Bayesian method which relaxes the GEI as-sumption and thus can be used to address situations when the GEI assumptionmight be violated. We have seen that allowing for the possibility of a moderateviolation of the GEI assumption leads to a substantial increase in the length ofthe 95% credible interval. This reflects the fact that the case-only method is verysensitive to the violation of the GEI assumption.95Chapter 6Future WorkIn this thesis, we have studied the methods for analyzing case-control data thatexploit the GEI assumption from both the frequentist and Bayesian perspective.Though presented in the context of gene-environment interaction studies, our meth-ods can be applied to other case-control studies concerning two explanatory vari-ables that are independent of each other. We have also developed the constrainedmaximum likelihood estimation for partially identified models, which may evenhave a broader application to solve other problems. In this chapter, we brieflydescribe a few possible directions for future research.1. Partially identified models with no transparent reparameterizationIn Chapter 2, we assume that the partially identified model can be under-stood through a transparent re-parameterization that separates the identifi-able parameters from non-identifiable parameters. Unfortunately, such are-parameterization does not always exist. Gustafson et al. [14] gives twoexamples that do not admit a transparent re-parameterization. We suspectthat the established theoretical results are still valid even when a transparentre-parameterization is not available. However, a rigorous proof is needed.2. Numerical algorithm directly incorporating inequality constraintsIn Chapter 3, we use a two-step numerical algorithm for finding the GEI-constrained maximum likelihood estimate with unknown disease prevalence,where the second step of one-dimensional grid search is performed when the96estimate found by the first step breaks inequality constraints. That algorithmworks reasonably well and suffices for our research problem, as our focusis not developing the best numerical algorithm. However, more elegant andefficient numerical algorithms that directly incorporate inequality constraintsin each iteration may be desired.3. Reduced disease risk model for the Bayesian frameworkIn Chapter 4, we only propose the Bayesian method for a saturated diseaserisk model. One may also want to extend that for a reduced disease riskmodel. With a reduced model and the GEI assumption, it can be shown that(θ ,ρ,γ000,γ010,γ011,γ100,γ101,γ110,γ111) serves as another parameterizationof the model. A similar Bayesian framework can be developed for a reducedmodel based on this new parameterization.4. Continuous environmental exposureIn most gene-environment interaction studies, the environmental factors ofinterest are usually categorical variables with a few categories. Yet, thereare situations where a continuous environmental exposure is of interest. Inthat case, we may convert a continuous variable to a categorical variable bygrouping values, based on empirical quantiles for example. Hopefully, theloss in efficiency due to categorization can be minimized by having many,say ten, categories, and will be well compensated by the efficiency gain ofexploiting the GEI assumption. More research can be conducted to look intothis matter.97Bibliography[1] J. Aitchison and S. D. Silvey. Maximum-likelihood estimation of parameterssubject to restraints. The Annals of Mathematical Statistics, 29(3):813 – 828,1958. → pages 6, 8, 9, 11, 14, 18, 22, 35, 36[2] P. S. Albert, D. Ratnasinghe, J. Tangrea, and S. Wacholder. Limitations ofthe case-only design for identifying gene-environment interactions.American Journal of Epidemiology, 154(8):687 – 693, 2001. → pages 3, 4[3] N. Chatterjee and R. J. Carroll. Semiparametric maximum likelihoodestimation exploiting gene-environment independence in case-controlstudies. Biometrika, 92(2):399 – 418, 2005. → pages 2, 31, 32, 38, 39, 45,61, 62, 93[4] H. Y. Chen and J. Chen. On information coded in gene-environmentindependence in case-control studies. American Journal of Epidemiology,174(6):736 – 743, 2011. → pages 2, 28, 46[5] A. R. Conn, N. I. M. Gould, and P. L. Toint. A globally convergentaugmented lagrangian algorithm for optimization with general constraintsand simple bounds. SIAM Journal on Numerical Analysis, 28(2):545 – 572,1991. → pages 36[6] J. Dennis, S. Hawken, D. Krewski, N. Birkett, M. Gheorghe, J. Frei,G. McKeown-Eyssen, and J. Little. Bias in the case-only design applied tostudies of gene-environment and gene-gene interaction: a systematic reviewand meta-analysis. International Journal of Epidemiology, 40:1329 – 1341,2011. → pages 3[7] M. Garca-Closas, N. Malats, D. Silverman, M. Dosemeci, M. Kogevinas,D. W. Hein, A. Tardon, C. Serra, A. Carrato, R. Garca-Closas, J. Lloreta,G. Castao-Vinyals, M. Yeager, R. Welch, S. Chanock, N. Chatterjee,S. Wacholder, C. Samanic, M. Tor, F. Fernndez, F. X. Real, and R. N. Nat298slow acetylation and gstm1 null genotypes increase bladder cancer risk:results from the spanish bladder cancer study and meta-analyses. Lancet,366(9468):649 – 659, 2005. → pages 60[8] N. M. Gatto, U. B. Campbell, A. G. Rundle, and H. Ahsan. Furtherdevelopment of the case-only design for assessing gene-environmentinteraction: evaluation of and adjustment for bias. International Journal ofEpidemiology, 33:1014 – 1024, 2004. → pages 4[9] P. E. Gill and E. Wong. Mixed integer nonlinear programming, volume 154,chapter Sequential quadratic programming method, pages 147 – 224.Springer, 2012. → pages 36[10] J. Gu, D. Liang, Y. Wang, C. Lu, and X. Wu. Effects of n-acetyl transferase1 and 2 polymorphisms on bladder cancer risk in caucasians. MutationResearch, 581:97 – 104, 2005. → pages 1, 81, 82, 83[11] P. Gustafson. What are the limits of posterior distributions arising fromnonidentified models, and why should we care? Journal of the AmericanStatistical Association, 104(488):1682 – 1695, 2009. → pages 86[12] P. Gustafson. On the behavior of bayesian credible intervals in partiallyidentified models. Electronic Journal of Statistics, 6:2107 – 2124, 2012. →pages 91[13] P. Gustafson and I. Burstyn. Bayesian inference of gene environmentinteraction from incomplete data: What happens when information onenvironment is disjoint from data on gene and disease? Statistics inMedicine, 30:877 – 889, 2011. → pages 81[14] P. Gustafson, A. E. Gelfand, S. K. Sahu, W. O. Johnson, T. E. Hanson,L. Joseph, and J. Lee. On model expansion, model contraction, identifiabilityand prior information: two illustrative scenarios involving mismeasuredvariables. Statistical Science, 20(2):111 – 140, 2005. → pages 6, 96[15] S. J. Hwang, T. H. Beaty, S. R. Panny, N. A. Street, J. M. Joseph, S. Gordon,I. McIntosh, and C. A. Francomano. Association study of transforminggrowth factor alpha (tgf alpha) taqi polymorphismand oral clefts: indicationof gene-environment interaction in a population-based sample of infantswith birth defects. American Journal of Epidemiology, 141(7):629 – 636,1995. → pages 81, 82, 83[16] S. G. Johnson. The nlopt nonlinear-optimization package. URLhttp://ab-initio.mit.edu/nlopt. Accessed: 2016-05-31. → pages 3699[17] H. Luo, A. Bouchard-Coˆte´, G. Cohen Freue, and P. Gustafson. Theconstrained maximum likelihood estimation for parameters arising frompartially identified models. arXiv:1607.08826v1 [math.ST]. →pages iv[18] H. Luo, I. Burstyn, and P. Gustafson. Gene-environment independence incase-control studies: Issues of parameteriza- tion and bayesian inference.Statistics in Bioscience, 7:460 – 475, 2015. → pages iv[19] C. F. Manski. Partial identification of probability distributions. Springer,2003. → pages 5[20] B. Modan, P. Hartge, G. Hirsh-Yechezkel, A. Chetrit, F. Lubin, U. Beller,G. Ben-Baruch, A. Fishman, J. Menczer, J. P. Struewing, M. A. Tucker, andS. Wacholder. Parity, oral contraceptives and the risk of ovarian canceramong carriers and noncarriers of a brca1 or brca2 mutation. New EnglandJournal of Medicine, 345:235 – 240, 2001. → pages 93[21] B. Mukherjee and N. Chatterjee. Exploiting gene-environmentindependence for analysis of case-control studies: an empirical bayes-typeshrinkage estimator to trade-off between bias and efficiency. Biometrics, 64(3):685 – 694, 2008. → pages 2, 3, 75[22] B. Mukherjee, J. Ahn, S. B. Gruber, M. Ghosh, and N. Chatterjee.Case-control studies of gene-environment interaction: Bayesian design andanalysis. Biometrics, 66:934 – 948, 2010. → pages 2, 3, 76, 77, 79, 92[23] W. W. Piegorsch and C. R. Weinberg. Non-hierarchical logistic models andcase-only designs for assessing susceptibility in population-basedcase-control studies. Statistics in Medicine, 13:153 – 162, 1994. → pages 2,3[24] R. L. Prentice and R. Pyke. Logistic disease incidence models andcase-control studies. Biometrika, 66(3):403 – 411, 1979. → pages 1[25] C. P. Robert and G. Casella. Introducing Monte Carlo methods with R.Springer, 2010. → pages 67, 68[26] G. D. Smith and S. Ebrahim. Mendelian randomization: prospects,potentials, and limitations. International Journal of Epidemiology, 33:30 –42, 2004. → pages 2, 4[27] D. Spring. On the second derivative test for constrained local extrema. TheAmerican Mathematical Monthly, 92(9):631 – 643, 1985. → pages 15100[28] D. M. Umbach and C. R. Weinberg. Designing and analysing case-controlstudies to exploit independence of genotype and exposure. Statistics inMedicine, 66:403 – 411, 1997. → pages 2101Appendix AThe Forms of Some Vectors andMatricesWe first define notations for two kinds of auxiliary matrices. Let Is denote theidentity matrix of size s, Es,t denote the s× t all-ones matrix, and Os,t denote thes× t zero matrix.(a) The score function of the log-likelihood:s(γ) =n001γ001...n01Kγ01Kn101γ101...n11Kγ11K−n000γ000 ·EK,1n100γ100 ·EK,1.(b) The unconstrained Fisher information:Bγ = B(0)γ OK,KOK,K B(1)γ ,102where, for i = 0,1,B(i)γ =nin1γi01 0 · · · 00 1γi02....... . . 00 · · · 0 1γi1K+1γi00·EK,K.(c) The Jacobian of g(ξ ) with respect to γ:Jξ = (1−θ) ·J†θ ·J† ,whereJ† =−ι10 · IK−(ι01, . . . , ι0K)ι00 · IK−E2K+1,1× (ι11, . . . , ι1K).(d) The Jacobian of g(ξ ) with respect to θ :Kξ =(∂g1∂θ, . . . ,∂gK∂θ),where, for k = 1, . . . ,K,∂gk∂θ= (γ100− γ000)ι1k +(γ11k− γ01k)ι00− (γ10k− γ00k)ι10− (γ110− γ010)ι0k.(e) The Jacobian of h(γ), the constraints imposed by assuming a reduced model,103with respect to γ:Hγ = 1γ000 E2K+1,1× (1, . . . ,K−1) O2K+1,K−1− 1γ100 E2K+1,1× (1, . . . ,K−1) O2K+1,K−1+X0 OK,K−1OK+1,K−1 Y0X1 OK,K−1OK+1,K−1 Y1 ,where, for i = 0,1,Xi = (−1)i2γi013γi01 · · · Kγi01− 1γi02 0 · · · 00 − 1γi03....... . . 00 · · · 0 − 1γi0K,andYi = (−1)i− 1γi10 − 2γi10 · · · −K−1γi102γi113γi11 · · · Kγi11− 1γi12 0 · · · 00 − 1γi13....... . . 00 · · · 0 − 1γi1K.104
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- The gene-environment independence assumption in the...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
The gene-environment independence assumption in the analysis of case-control data Luo, Hao 2016
pdf
Page Metadata
Item Metadata
Title | The gene-environment independence assumption in the analysis of case-control data |
Creator |
Luo, Hao |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | In this thesis, we consider the problem of exploiting the gene-environment independence assumption in a case-control study inferring the joint effect of genotype and environmental exposure on disease risk. We first take a detour and develop the constrained maximum likelihood estimation theory for parameters arising from a partially identified model, where some parameters of the model may only be identified through constraints imposed by additional assumptions. We show that, under certain conditions, the constrained maximum likelihood estimator exists and locally maximizes the likelihood function subject to constraints. Moreover, we study the asymptotic distribution of the estimator and propose a numerical algorithm for estimating parameters. Next, we use the frequentist approach to analyze case-control data under the gene-environment independence assumption. By transforming the problem into a constrained maximum likelihood estimation problem, we are able to derive the asymptotic distribution of the estimator in a closed form. We then show that exploiting the gene-environment independence assumption indeed improves estimation efficiency. Also, we propose an easy-to-implement numerical algorithm for finding estimates in practice. Furthermore, we approach the problem in a Bayesian framework. By introducing a different parameterization of the underlying model for case-control data, we are able to define a prior structure reflecting the gene-environment independence assumption and develop an efficient numerical algorithm for the computation of the posterior distribution. The proposed Bayesian method is further generalized to address the concern about the validity of the gene-environment independence assumption. Finally, we consider a special variant of the standard case-control design, the case-only design, and study the analysis of case-only data under the gene-environment independence assumption and the rare disease assumption. We show that the Bayesian method for analyzing case-control data is readily applicable for the analysis of case-only data, allowing the flexibility of incorporating different prior beliefs on disease prevalence. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2017-01-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0340074 |
URI | http://hdl.handle.net/2429/59878 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2017-02 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2017_february_luo_hao.pdf [ 1.13MB ]
- Metadata
- JSON: 24-1.0340074.json
- JSON-LD: 24-1.0340074-ld.json
- RDF/XML (Pretty): 24-1.0340074-rdf.xml
- RDF/JSON: 24-1.0340074-rdf.json
- Turtle: 24-1.0340074-turtle.txt
- N-Triples: 24-1.0340074-rdf-ntriples.txt
- Original Record: 24-1.0340074-source.json
- Full Text
- 24-1.0340074-fulltext.txt
- Citation
- 24-1.0340074.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0340074/manifest