Model and Inference Issues Related to Exposure-DiseaseRelationshipsbyLi XingMSc, Simon Fraser University, 2007MSc, University of British Columbia, 2005BSc, Hebei University of Technology, 2000A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University Of British Columbia(Vancouver)August 2014c Li Xing, 2014AbstractThe goal of my thesis is to make contributions on some statistical issues related toepidemiological investigations of exposure-disease relationships.Firstly, when the exposure data contain missing values and measurement errors,we build a Bayesian hierarchical model for relating disease to a potentially harmfulexposure while accommodating these flaws. The traditional imputation method,called the group-based exposure assessment method, uses the group exposure meanto impute the individual exposure in that group, where the group indicator indicatesthat the exposure levels tend to vary more across groups and less within groups.We compare our method with the traditional method through simulation studies,a real data application, and theoretical calculation. We focus on cohort studieswhere a logistic disease model is appropriate and where group exposure meanscan be treated as fixed effects. The results show a variety of advantages of thefully Bayesian approach, and provide recommendations on situations where thetraditional method may not be suitable to use.Secondly, we investigate a number of issues surrounding inference and theshape of the exposure-disease relationship. Presuming that the relationship can beexpressed in terms of regression coefficients and a shape parameter, we investigatehow well the shape can be inferred in settings which might typify epidemiologicinvestigations and risk assessment. We also consider a suitable definition of theaverage effect of exposure, and investigate how precisely this can be inferred. Wealso examine the extent to which exposure measurement error distorts inferenceabout the shape of the exposure-disease relationship. All these investigations re-quire a family of exposure-disease relationships indexed by a shape parameter. Forthis purpose, we employ a family based on the Box-Cox transformation.iiThirdly, matching is commonly used to reduce confounding due to lack of ran-domization in the experimental design. However, ignoring measurement errors inmatching variables will introduce systematically biased matching results. There-fore, we recommend to fit a trajectory model to the observed covariate and thenuse the estimated true values from the model to do the matching. In this way, wecan improve the quality of matching in most cases.iiiPrefaceThis thesis was completed under the supervision of Professor Paul Gustafson.Chapter 2 of this thesis is based on the paper “A Comparison of Bayesian Hi-erarchical Modeling with Group-based Exposure Assessment in Occupational Epi-demiology.” Statistics in Medicine, Xing L, Burstyn I, Richardson DB, Gustafson P,2013 [56]. This research problem was identified and designed by my thesis advisorProfessor Paul Gustafson and by our collaborator Professor Igor Burstyn. As thefirst author, I proposed the model, developed the algorithm, conducted the mathe-matical calculation, and performed the empirical analysis and simulation studies.I drafted the manuscript; my supervisor and other coauthors helped me to reviseit. Professor David Richardson provided the data from the Savannah River Site(SRS) Study. The use of the SRS data has been approved by UBC Research EthicsBoard, under the project title, SRSBM, and the Certificate Number of the EthicsCertificate obtained, H09-03190. And Professor Paul Gustafson is the principalinvestigator of this project.Chapter 3 of this thesis is based on the paper “On the Shape of an Exposure-Disease Relationship, the Average Effect of Exposure, and the Impact of ExposureMeasurement Error.” under review for Statistics in Medicine, Xing L, Burstyn I,Gustafson P, 2014 [57]. This research problem was identified and designed bymy thesis advisor Professor Paul Gustafson. As the first author, I worked on var-ious conjectures, conducted numerical analysis and mathematical calculation, andperformed data analysis and simulation studies. I drafted the manuscript, my su-pervisor and other coauthor helped me to revise it.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 AComparison of BayesianHierarchicalModeling with Group-basedExposure Assessment in Occupational Epidemiology . . . . . . . . . 42.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Bayesian Hierarchical Modeling . . . . . . . . . . . . . . . . . . 72.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Bias of the Group-based Exposure Assessment Method . . . . . . 172.6 An Empirical Example . . . . . . . . . . . . . . . . . . . . . . . 242.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 On the Shape of an Exposure-Disease Relationship, the Average Ef-fect of Exposure, and the Impact of Exposure Measurement Error . 36v3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2 Logistic Box-Cox Model . . . . . . . . . . . . . . . . . . . . . . 393.2.1 Parameter Estimates in the Model . . . . . . . . . . . . . 403.2.2 Factors in Epidemiological Studies . . . . . . . . . . . . 453.2.3 Average Predictive Effect . . . . . . . . . . . . . . . . . . 473.3 Measurement Errors in the Model . . . . . . . . . . . . . . . . . 593.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684 Two-step Approach inMatching Process for Longitudinal MatchingCovariates in the Presence of Measurement Error . . . . . . . . . . 724.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2 Baltimore Experience Corps Study . . . . . . . . . . . . . . . . . 754.3 Statistical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 764.4 Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 784.5 Treatment Effect . . . . . . . . . . . . . . . . . . . . . . . . . . 794.6 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 804.7 Matching Quality and Measurement Error . . . . . . . . . . . . . 864.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95viList of TablesTable 2.1 Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.6 and100 samples observed in each group . . . . . . . . . . . . . . 12Table 2.2 Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.6 and1000 samples observed in each group . . . . . . . . . . . . . . 13Table 2.3 Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.4 and1000 samples observed in each group . . . . . . . . . . . . . . 14Table 2.4 Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.4 and100 samples observed in each group . . . . . . . . . . . . . . 15Table 2.5 The SSE and the ASE of the estimated b1 when b1 = 0.6 and100 samples are observed in each group . . . . . . . . . . . . 16Table 2.6 Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.6 and100 samples observed in each group under the average differ-ence between group exposure means equal to 0.3 . . . . . . . . 23Table 2.7 Estimates from logistic regression using true exposure intensityto predict leukemia mortality . . . . . . . . . . . . . . . . . . 24Table 2.8 Reduced occupations of the SRS male workers and their sum-maries statistics . . . . . . . . . . . . . . . . . . . . . . . . . 26viiTable 2.9 Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.6 and20 samples observed in each group . . . . . . . . . . . . . . . 29Table 2.10 Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.4 and20 samples observed in each group . . . . . . . . . . . . . . . 30Table 2.11 Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.6 and100 samples observed in each group under different priors forthe scale parameters . . . . . . . . . . . . . . . . . . . . . . . 32Table 2.12 Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.6 and20 samples observed in each group under different priors for thescale parameters . . . . . . . . . . . . . . . . . . . . . . . . . 33viiiList of FiguresFigure 2.1 The standardized biases versus sB under sW = 0.5 and b1 =0.6. The vertical bars are 95% confidence intervals due to sim-ulation variability. . . . . . . . . . . . . . . . . . . . . . . . 18Figure 2.2 Plot of (1rg)(1 2rg) as a function of the group exposuremean, µg, based on b0 =4 and b1 = 0.4. . . . . . . . . . . 22Figure 3.1 The histogram of the level of acid phosphatase in King-Armstrongunits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 3.2 Distributions of the exposure variable. . . . . . . . . . . . . . 46Figure 3.3 The risks of disease as a function of exposure in the selectedsettings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Figure 3.4 Asymptotic standard deviation of lˆ over different settings. . . 50Figure 3.5 Asymptotic standard deviation of Dˆ over different settings. . . 51Figure 3.6 The asymptotic standard deviation of the estimated averagepredictive effect, Dˆ, and the asymptotic standard deviation oflˆ under P1 = 0.02 and different values of factor A, B, and D. . 52Figure 3.7 The asymptotic standard deviation of the estimated averagepredictive effect, Dˆ, and the asymptotic standard deviation oflˆ under P1 = 0.02 and different values of factor A, B, and D. . 53Figure 3.8 Coefficient of variation of Dˆ over different settings. . . . . . . 54Figure 3.9 The average predictive effect, D, and the large-sample limitingcoefficient, g⇤1 , from the simple logistic model under differentsettings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58ixFigure 3.10 The asymptotic standard deviation of gˆ1 from the misspecifiedmodel and the asymptotic standard deviation of Dˆ under differ-ent settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Figure 3.11 The difference between the shape parameter, l , in the truemodel (3.1) and the large-sample limit, h⇤, in the misspecifiedmodel (3.19) in the presence of the multiplicative measurementerror with se = 0.1 across different settings. . . . . . . . . . . 62Figure 3.12 The difference between the shape parameter, l , in the truemodel (3.1) and the large sample limit, h⇤, in the model (3.19)in the presence of the multiplicative measurement error se =0.01 across different settings. . . . . . . . . . . . . . . . . . . 63Figure 3.13 The difference between the true probability and the estimatedprobability from the misspecified model (3.19) across differentsettings under se = 0.1. . . . . . . . . . . . . . . . . . . . . 64Figure 3.14 The difference between the true probability and the estimatedprobability from the misspecified model (3.19) across differentsettings under se = 0.01. . . . . . . . . . . . . . . . . . . . 65Figure 3.15 The difference between the true probability and the fitted onefrom the misspecified model (3.19). . . . . . . . . . . . . . . 66Figure 3.16 The estimated probability of the nodal involvement in the prostatecancer patients via the fitted logistic Box-Cox model. . . . . . 67Figure 4.1 Boxplot of the global distance from different conditions. . . . 82Figure 4.2 Boxplot of the global distance from different conditions. . . . 83Figure 4.3 Boxplot of the global distance from different conditions. . . . 84Figure 4.4 Boxplot of the global distance from different conditions. . . . 85Figure 4.5 Boxplot of the estimated treatment effects from different con-ditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Figure 4.6 Boxplot of the estimated treatment effects from different con-ditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88Figure 4.7 Boxplot of the estimated treatment effects from different con-ditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89xFigure 4.8 Boxplot of the estimated treatment effects from different con-ditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90xiAcknowledgmentsI would like to express my special appreciation and thanks to my PhD supervisor,Professor Paul Gustafson, you have been a tremendous supervisor for me. Thanksfor your priceless advice on my research and for letting the research work becomea delightful and enjoyable journey. I also would like to express my thanks to ourcollaborator, Professor Igor Burstyn at the Department of Environmental and Oc-cupational Health of Drexel University, who provided valuable and critical advicein epidemiology. Also thanks to Professors Lang Wu and Rollin Brant for servingas my committee members along the the way, and thanks to our other collabora-tors, Professor David Richardson at the Department of Epidemiology, Universityof North Carolina, who provided the data from the Savannah River Site Study andgave valuable advice in epidemiology. And also thanks to Professors Qianli Xue,Jeanine M Parisi, and George Rebok at the Johns Hopkins Center on Aging andHealth for hosting me and working together with us on the matching problems.I feel so lucky to have been able to study at the UBC Statistics Department,where I had the good fortune to have been mentored by Professors Harry Joe,William J. Welch, Ruben H. Zamar, Matı´as Salibia´n-Barrera, and others. I learnedso much from their insight, advice and tremendous knowledge of statistics. Also Iwish also to acknowledge my Master degrees’ supervisors, Professor CharmaineDean of the Department of Statistical and Actuarial Science, University West-ern Ontario and Professor Kee Yuen Lam of the Mathematics Department, UBC.Thanks also to Peggy Ng, Andrea Sollberger, and Zin Kurji in the UBC Statisticsdepartment office, who always gave efficient assistance with sincere care ratherthan merely carrying out duties.A special thanks to my beloved husband, Xuekui Zhang, who takes lot of fam-xiiily responsibility to support my studies. Thanks to my friends and colleagues fortheir encouragement and generous help in my life.A final acknowledgement goes to NSERC and University 4-year Fellowshipsof UBC.xiiiChapter 1IntroductionInvestigation of relationships between exposures and diseases is a central topicin epidemiologic studies. The exposure variable could be workers’ exposure topotentially hazardous chemicals and/or ionizing radiation in the workplace in theoccupational epidemiologic settings, could be the level of biomarkers in the humanbody in medical research, or could be the level of air pollution and/or heavy metalexposure in environmental health studies. And the disease outcome is typicallybinary with 1 indicating the presence of the disease and 0 indicating the absenceof the disease. The goal of my thesis is to develop statistical methods to assistin finding associations between exposures and diseases under the consideration ofsome typical study concerns, which include the imperfectly measured exposuredata, various potential shapes of relationship between exposure and disease, andthe lack of randomization in the study design.As for concerns in such studies, first of all, it is often the case that the exposuremeasurements are from readings in devices which are prone to measurement errors.If ignored in the analysis, these would introduce biased inference. Also, particu-larly in occupational epidemiologic studies, it is very costly to measure everyone’sexposure in the workplace. So the investigators randomly sample some workers tohave their exposure measured, which introduces missing data in exposure measure-ments. However, the disease outcome data are typically relatively complete sincethey are obtained from the hospital records. Therefore, we cannot simply ignorethe incomplete exposure data nor just use the naive traditional imputation method,1which may introduce biased results. Secondly, to model the shape of the relation-ship between exposure and disease, the linear model and the logit-linear modelare overused in epidemiology due to convenience, but nonlinear relationships arequite possible and, therefore, require more attention. The final concern is that alot of epidemiologic studies on exposure-disease relationship are observational co-hort studies. Due to the lack of randomization in the experimental design, thereare potential confounders in estimating the relationship between exposure and dis-ease. Therefore, it is challenging to appropriately reduce confounding when thereis also measurement error in the confounders. I address these issues separately inthe following chapters.In Chapter 2, we build a Bayesian hierarchical model for relating disease to apotentially harmful exposure, which can accommodate the missingness and mea-surement errors in the exposure measurement. The traditional imputation method,called the group-based exposure assessment method, uses the group exposure meanto impute the individual exposure in that group, where the the group indicator in-dicates that the exposure levels tend to vary more across groups and less withingroups. We compare our method with the traditional method through simulationstudies, a real data application, and theoretical calculation. We focus on cohortstudies where a logistic disease model is appropriate and where group exposuremeans can be modeled as fixed effects. The results show a variety of advantages ofthe fully Bayesian approach, and lead to recommendations on situations where thetraditional group-based exposure assessment method may not be suitable to use.In Chapter 3, we explore some conjectures concerning the nonlinear shapes ofthe exposure-disease relationship. Presuming that the relationship can be expressedin terms of regression coefficients and a shape parameter, we investigate how wellthe shape can be inferred in settings which might typify epidemiologic investiga-tions and risk assessment. We also consider a suitable definition of the averageeffect of exposure, and investigate how precisely this can be inferred. This is doneboth in the case of using a model acknowledging uncertainty about the shape pa-rameter and in the case of using a simple model ignoring this uncertainty. We alsoexamine the extent to which exposure measurement error distorts inference aboutthe shape of the exposure-disease relationship. All these investigations require afamily of exposure-disease relationships indexed by a shape parameter. For this2purpose, we employ a family based on the Box-Cox transformation.In Chapter 4, we look into how to reduce confounding of the observed covari-ates in estimating the relationship between exposure and disease since in the pres-ence of measurement error in the confounders, the traditional matching methodcan provide systematically biased matching results. We propose to use the mixed-effects model to fit the repeated measurements or the longitudinal measurements ofthe matching covariates. Then we can adjust for the measurement errors by usingmatching on variables from the model. Simulation studies show us the this pre-matching process improve the matching quality and, as a consequence, treatmenteffect can be better estimated.Discussion on our current investigations and future research topics on exposure-disease relationships are mentioned in Chapter 5.3Chapter 2A Comparison of BayesianHierarchical Modeling withGroup-based ExposureAssessment in OccupationalEpidemiology2.1 IntroductionIn occupational cohort studies, interest lies in the association between potentiallyharmful occupational exposures occurring in the workplace and diseases outcomes.A binary outcome variable is often available for each cohort member. However,exposure quantities are usually measured for only a sample of members due to cost,or may be historically available only on a sample. For the occupational exposure,we often have repeated measurements, which are used to characterize day-to-dayvariations and measurement error. In this chapter we focus on a setting in which wewish to examine the association between a person’s average occupational exposureintensity, x, and a binary outcome, y. As for the measurement errors, there are twocommon types. One is the classical measurement error model, which states that4the observed exposure, x⇤, is randomly distributed around the truth, x,x⇤ = x+ e, e ⇠ N(0,s2W ), and x |= e.Please note that x⇤ is typically on the logarithmic scale of the observed exposuresince the distribution of the observed exposure on the original scale is often right-skewed. The other is called the Berkson error model. In a Berkson error model, itis assumed that we have a predefined targeted value. The underlying true value, x,is equal to the targeted value, x˜, plus a random error. That is,x = x˜+ e, e ⇠ N(0,s2Be), and x˜ |= e.It is typical in a controlled experiment, where we know a predefined level such asa temperature or radiation exposure level, and we wish to bring the experimentalquantity to this value, but the underlying true value may be away from this targetedvalue by an error. In an occupational epidemiology study, the exposure scores aresometimes assigned based on work areas, jobs, or other factors, and exposure levelsare classified based on these work groups. In such cases, group exposure meansare estimated with uncertainty and retain features of classical measurement errorstructure as in Berkson’s “modified controlled experiment” [4].There are two methods to assign average exposure intensity scores to cohortmembers by drawing on available exposure measurements for a subsample of thecohort. One is called group-based exposure assessment; the other is called individual-based exposure assessment. Assume x⇤gi j is the log transformed exposure measure-ment for the jth measurement of the ith subject in the gth group and it satisfies thefollowing model.x⇤gi j = log(exposure) = µg + ggi + egi j = xgi + egi j,where µg is the fixed group exposure mean, ggi is the random effect for the ith sub-ject in the gth group, xgi is the true exposure based on log scale for the gth group andith individual, and egi j is the measurement error with i= 1, · · · ,ng and g= 1, · · · ,G.For the group-based exposure assessment, the group exposure mean estimated fromthe sample of individuals from the gth group, X¯⇤g = n1g Ângi=1ÂJj=1 x⇤gi j, is used to5approximate the true exposure, xgi, for all people in that group. For the individual-based exposure assessment, if there are measurements for that individual, the in-dividual exposure mean, X¯⇤gi = J1ÂJj=1 x⇤gi j, is used to approximate the true ex-posure, xgi, for that individual. For the classical exposure model, Tielemans et al.[53] showed that, when regressing continuous disease outcomes on the exposuremeasurements, the group-based exposure assessment method provided less biasedestimates of the association parameter, and the individual-based exposure assess-ment method provided more biased, but more precise estimates of the associationparameter. Kim et al. [29] confirmed this result. However, when encounteringmissing their exposure measurements, the individual-based exposure assessmentmethod uses the complete-case analysis strategy. That is, the subjects without anyobserved exposure will be ignored. This strategy can cause loss of information[35], and, therefore, is not recommended in general. The group-based exposureassessment method works as a simple imputation method. That is, X¯⇤g is imputedas the exposure measurement for all people in that group, and then a model fordisease status given imputed exposure and covariates is fitted and interpreted asif it were a model for disease status given actual exposure and covariates. Thiscan improve the utility of the data, especially when there is a large proportion ofmissing exposure data. Thus occupational health researchers are more interestedin the group-based exposure assessment method, and there is more work on it inthe literature, which we describe in detail as follows.When the outcome variable is binary, Kim et al. [28] showed that the group-based exposure assessment method corresponds to an approximate Berkson er-ror model when the group exposure mean is well estimated, such as when it isbased on large sample size. Then E(Xgi|X¯⇤g ) = X¯⇤g +(m1)(X¯⇤g µg)⇡ X¯⇤g , wherem = Cov(Xgi, X¯⇤g )/Var(X¯⇤g ). Then they postulated an approximate Berkson errormodel as Xgi = X¯⇤g + egi, where egi is normally distributed and Cov(X¯⇤g ,egi) = 0.Based on this Berkson error model, Kim and Burstyn [27] showed that Bayesianmethods can provide less biased estimates of the association parameter comparedto the group-based exposure assessment method, when the group exposure meansare far apart and there are moderate between-subject variations. Kim et al. [29]further relaxed the requirement on the independence between the group exposuremean and the measurement error, and introduced a quasi-Berkson measurement6error model, which allows for correlation between the group exposure mean andthe random error when the group exposure mean is a random effect. For this quasi-Berkson type model, they compared the estimates of the association parametersbased on assuming random group effects and assuming fixed group effects. Theyalso stated that if there is a fixed set of studied occupational groups in the study,we should treat the group as fixed, while if we would like to draw inferences on allpossible occupational groups, it is more natural to assume that groups are createdthrough a random draw of all possible groupings. Under the assumption of eitherfixed or random group exposure means, the estimates of regression coefficients areless attenuated with a large sample size used to estimate group exposure means,when between-subject variability is small and the spread between group exposuremeans is large. Attenuation bias is virtually absent when a large number of mea-surements is used to estimate fixed-effect but not random-effect group exposuremeans (due to covariance of group exposure means and random error).In this chapter, we develop a fully Bayesian method, which should in principlegive certain advantages. As for dealing with missing data, the group-based expo-sure assessment method ignores the between-worker variation within a group bysimply replacing workers’ exposure levels by the sample group exposure mean,which distorts the marginal distributions and the measures of covariations of theexposure data and, therefore, may lead to invalid inference [35]. Our Bayesianmethod is based on a likelihood approach, which can accommodate all the uncer-tainty from the data, and also provide the posterior variance to describe the uncer-tainty about the parameters and the exposure measurements [22]. Secondly, giventhe groups are known, we use the classical type measurement error model to adjustfor measurement errors for each worker based on his repeated exposure measure-ments. We concentrate our efforts on exploring properties of the fully Bayesianapproach in a cohort study with the group exposure means as the fixed effects inthe analysis.2.2 Bayesian Hierarchical ModelingWe propose to build a hierarchical model, which contains three sub-models: aresponse model to link disease status and true exposure, an exposure model to7model true exposure based on occupational categories, and a measurement errormodel to adjust for measurement errors. The Bayesian method is our top candidateto jointly fit these models, and it has a natural way to deal with missing data, whichcan be automatically imputed by simulated draws via Markov chain Monte Carlobased on the joint distribution assumptions.Let ygi be the binary disease outcome for the i-th individual in the g-th group,sgi be the original true exposure level, with i = 1, · · · ,ng, and g = 1, · · · ,G. Ourresponse model can be described asH (E(ygi|sgi)) = b0 +b1 f (sgi) (2.1)where H(·) is the link function, particularly here we use logit link function, andf (·) is some transformation function. We focus on the logistic link function inthis chapter. For simplicity we do not write out the covariate effects, though theycould be easily added in (2.1). And we consider the log-transformed exposuredata with the assumption that sgi follows a log-normal distribution, which is oftena realistic assumption for occupational exposure data. The measured exposure,s⇤gi j, has multiplicative measurement error, which can be written as s⇤gi j = sgiegi j,where egi j also follows a log-normal distribution. Denote xgi = log(sgi), and x⇤gi j =log(s⇤gi j). Thus our measurement error model can be written asx⇤gi j|xgi ⇠ N(xgi,s2W ),where x⇤gi j can be treated as the j-th repeated measurement of the exposure intensitywith j = 1,2, · · · ,J, and sW is the within-worker standard deviation. We assumethat measurement errors are nondifferential, that is, these x⇤gi j’s are not dependenton the response, ygi given xgi. Our exposure model states the distribution of xgi asfollows:xgi ⇠ N(µg,s2B), (2.2)where µg is the exposure mean in the gth group, and sB is the between-workerstandard deviation within a group. In our study, the group is treated as a fixedeffect. For priors, we assume b0 ⇠ N(0,s2b0), b1 ⇠ N(0,s2b1), µg ⇠ N(0,s2µ) forg = 1,2, · · · ,G, s2W ⇠ Inv-Gamma(aW ,gW ), and s2B ⇠ Inv-Gamma(aB,gB).82.3 The AlgorithmBased on the models proposed in the previous section, we calculate the full condi-tional posterior distribution for all the parameters (b ,s2W ,s2B,µ1, · · · ,µG) and thelatent variables (xgi’s). For s2W ,s2B, and µg’s, we can write out the closed formsof their full conditional posterior distributions. For the remaining parameters andlatent variables, their full conditional posterior distributions are only available upto a proportionality constant. So we propose to use a MCMC algorithm, Metropo-lis within Gibbs [7]. Specifically, in each iteration, for s2W ,s2B, and µg’s, we useGibbs sampling to sample candidates, which are all accepted for updating, whilefor b and xgi’s, we use random walk Metropolis. For the random walk Metropolis,in each t-th iteration of the Metropolis sampling, we propose their candidate valuesbased on the normal distributions centered at their sampling values in the (t1)-thiteration and with some pre-tuned standard deviations. Those candidates may beaccepted or rejected based on the acceptance ratio, r. The definition of this accep-tance ratio can be found in [7]. If the acceptance ratio, r, is larger than 1, we usethe candidate value. If r < 1, we use the candidate value by probability r and stayat the (t1)-th sampled value by probability 1 r. Note that for efficient conver-gence, the standard deviations of the normal distributions in the random walks aredetermined by adaptively tuning the overall acceptance rates such that the rates fallin the range between 20% and 50% [7]. Our algorithm is coded in R 2.12.2.Usually there are missing values in the exposure measurements and these are anessential feature of the group-based exposure assessment approach where, by def-inition, exposure measurements are not available for all subjects for whom healthoutcome was determined. It is infeasible to make measurements on the whole co-hort. So, ideally, the investigators randomly choose some members of each groupto measure their exposure, and in this ideal situation the missing mechanism ismissing completely at random (MCAR) within each group, which is assumed bythe group-based exposure assessment method. For the fully Bayesian method, weextend the missing mechanism to missing at random (MAR) within each group.Please also note that MAR is based on the random sampling conditional on theobserved data, which may not always the case due to complex real situations. Forexample, some workers may have missing records due to loss of compliance from9more exposure. In this case, the missingness is based on unobserved data. In realanalysis, the statement of the exceptions for the missing data is encouraged [32]. Inour analysis, in addition, we assume if one measurement is missing, the other mea-surements for that subject are missing as well. Denote the exposure measurementsasX ⇤ =x⇤gi jÂGg=1 ng⇥J= X ⇤oX ⇤m!,where X ⇤o = (x⇤ogi j) is the observed exposure measurement and X⇤m = (x⇤mgi j) is themissing exposure measurement, the latent true exposure as X , and the outcomes asY . Let the missing indicator for the ith worker in the gth group bemgi =(1 if x⇤gi j is missing for j = 1, · · · ,J.0 if x⇤gi j is observed for j = 1, · · · ,J.and M = (mgi) with i = 1,2, · · · ,ng and g = 1,2, · · · ,G. We know that this missingmechanism does not depend on the missing data. Let q be the vector of parametersof data, f be the vector of parameters of the missing-data process, and q and f areassumed to be distinct. The full likelihood function can be written asL(q ,X ,f |Y ,M ,X ⇤) = f (Y ,M ,X ⇤|q ,X ,f )=G’g=1ng’i=1f (ygi,x⇤gi|q ,xgi) f (mgi|ygi,g,xgi,x⇤gi,f )=G’g=1ng’i=1f (ygi|q ,xgi) f (x⇤gi|q ,xgi) f (mgi|ygi,g,xgi,x⇤gi,f )We further assume that the missingness indicator mgi is conditionally independentof xgi and x⇤gi given ygi and g for all g = 1, · · · ,G since the true exposure is a latentquantity and the exposure measurements are not observed for all workers. Thenour likelihood function would be extended toL(q ,X ,f |Y ,M ,X ⇤) = G’g=1ng’i=1f (ygi|q ,xgi) f (x⇤gi|q ,xgi) f (mgi|ygi,g,f ).Also if we add the prior information, p(q ), of q , we can see the posterior distri-10bution of q is proportional to f (Y ,X ⇤|q ,X )p(q ). Then we do not need to modelM , and we can say that the missing mechanism is ignorable for Bayesian inference[35].2.4 Simulation StudyTo compare the performance of the fully Bayesian method and the group-based ex-posure assessment method, we design the following simulation studies correspond-ing to some occupational epidemiological settings [27, 28, 30, 31]. We simulatedata with 5000 workers, who are categorized into five groups, with the true groupexposure means µg = 1.1, 2.1, 3.1, 4.1, and 5.1. In each group, there are 1000workers. For each worker, one binary disease outcome is measured. We know thatthe exposure may only be measured for a sample of workers per group in a cohortstudy and there are two repeated measurements for those workers who had mea-sured exposure. We choose the number of the observed samples per group varyingas 1000 (complete data), 100 (90% subjects with missing exposure data), or 20(98% subjects with missing exposure data). We randomly select these observedworkers within each group. We set b0 = 4, b1 = 0.4 or 0.6, sW varying as 0.1,0.5, and 1.0, and sB varying as 0.1, 0.5, 1.0, 1.5 and 2.0. Thus the typical risk pergroup is varying from 2.8% to 12.3% given b1 = 0.4, or from 3.4% to 28.1% givenb1 = 0.6, under the mean exposure for each group.For each setting we simulate 100 data sets, and for each data set we do theparameter estimation using one MCMC chain with 10,000 iterations and stor-ing every 10-th iteration. The chosen priors are b0,b1 ⇠ N(0,100), sW ,sB ⇠Inv-Gamma(0.0001,0.0001), and µg ⇠ N(0,10000) with g = 1,2, · · · ,G. The firsthalf of the chain is discarded as burn-in, and the remaining half is kept for infer-ence. The estimated posterior mean from the chain is considered as the estimateof the corresponding parameter. And the estimated posterior standard deviation isconsidered as the standard error of the corresponding parameter.We compare the precision, variances, and coverage probabilities of the 95%credible intervals of the estimated association parameter b1 from the two methods.11Table 2.1: Bias, MSE and coverage rates of the fully Bayesian and the group-based exposure assessment methods when b1 = 0.6 and 100 samples ob-served in each group( FB is for the fully Bayesian method, and GB is for the group-based exposure assess-ment method )Bias MSE Coverage RatesB = 0.1 sW = 0.1 FB 0.0069 0.0013 0.94GB 0.0059 0.0013 0.96sW = 0.5 FB 0.0052 0.0014 0.96GB 0.0061 0.0014 0.95sW = 1.0 FB -0.0016 0.0015 0.96GB 0.0060 0.0016 0.93sB = 0.5 sW = 0.1 FB 0.0055 0.0015 0.94GB -0.0023 0.0014 0.95sW = 0.5 FB 0.0031 0.0015 0.95GB -0.0022 0.0014 0.94sW = 1.0 FB -0.0051 0.0017 0.95GB -0.0024 0.0016 0.90sB = 1.0 sW = 0.1 FB 0.0065 0.0018 0.92GB -0.0225 0.0019 0.85sW = 0.5 FB 0.0042 0.0019 0.94GB -0.0225 0.0020 0.85sW = 1.0 FB -0.0075 0.0020 0.93GB -0.0227 0.0021 0.80sB = 1.5 sW = 0.1 FB 0.0064 0.0024 0.91GB -0.0570 0.0049 0.51sW = 0.5 FB 0.0015 0.0022 0.91GB -0.0570 0.0049 0.53sW = 1.0 FB -0.0143 0.0024 0.89GB -0.0573 0.0050 0.50sB = 2.0 sW = 0.1 FB 0.0071 0.0032 0.89GB -0.1013 0.0121 0.17sW = 0.5 FB 0.0016 0.0031 0.91GB -0.1013 0.0120 0.14sW = 1.0 FB -0.0199 0.0032 0.89GB -0.1017 0.0122 0.1412Table 2.2: Bias, MSE and coverage rates of the fully Bayesian and the group-based exposure assessment methods when b1 = 0.6 and 1000 samplesobserved in each group( FB is for the fully Bayesian method, and GB is for the group-based exposure assess-ment method )Bias MSE Coverage RatesB = 0.1 sW = 0.1 FB 0.0066 0.0013 0.93GB 0.0060 0.0013 0.96sW = 0.5 FB 0.0063 0.0013 0.95GB 0.0058 0.0013 0.96sW = 1.0 FB 0.0065 0.0013 0.96GB 0.0056 0.0013 0.97sB = 0.5 sW = 0.1 FB 0.0052 0.0012 0.95GB -0.0015 0.0012 0.97sW = 0.5 FB 0.0056 0.0012 0.93GB -0.0017 0.0012 0.97sW = 1.0 FB 0.0059 0.0013 0.96GB -0.0019 0.0012 0.97sB = 1.0 sW = 0.1 FB 0.0044 0.0008 0.95GB -0.0203 0.0014 0.91sW = 0.5 FB 0.0054 0.0009 0.95GB -0.0205 0.0014 0.91sW = 1.0 FB 0.0063 0.0011 0.97GB -0.0207 0.0014 0.92sB = 1.5 sW = 0.1 FB 0.0011 0.0006 0.92GB -0.0532 0.0037 0.60sW = 0.5 FB 0.0017 0.0007 0.91GB -0.0534 0.0037 0.60sW = 1.0 FB 0.0029 0.0008 0.92GB -0.0536 0.0037 0.59sB = 2.0 sW = 0.1 FB 0.0005 0.0005 0.96GB -0.0957 0.0099 0.15sW = 0.5 FB 0.0007 0.0005 0.93GB -0.0958 0.0099 0.13sW = 1.0 FB 0.0015 0.0006 0.94GB -0.0960 0.0100 0.1213Table 2.3: Bias, MSE and coverage rates of the fully Bayesian and the group-based exposure assessment methods when b1 = 0.4 and 1000 samplesobserved in each group( FB is for the fully Bayesian method, and GB is for the group-based exposure assess-ment method )Bias MSE Coverage RatesB = 0.1 sW = 0.1 FB 0.0093 0.0023 0.91GB 0.0095 0.0022 0.95sW = 0.5 FB 0.0095 0.0023 0.93GB 0.0094 0.0022 0.95sW = 1.0 FB 0.0094 0.0023 0.92GB 0.0092 0.0022 0.95sB = 0.5 sW = 0.1 FB 0.0061 0.0019 0.92GB 0.0066 0.0019 0.97sW = 0.5 FB 0.0064 0.0020 0.93GB 0.0065 0.0019 0.97sW = 1.0 FB 0.0066 0.0020 0.94GB 0.0063 0.0019 0.97sB = 1.0 sW = 0.1 FB 0.0051 0.0014 0.93GB 0.0048 0.0016 0.97sW = 0.5 FB 0.0053 0.0014 0.95GB 0.0047 0.0016 0.97sW = 1.0 FB 0.0059 0.0016 0.93GB 0.0046 0.0015 0.97sB = 1.5 sW = 0.1 FB 0.0023 0.0010 0.93GB -0.0057 0.0015 0.98sW = 0.5 FB 0.0022 0.0010 0.92GB -0.0058 0.0015 0.98sW = 1.0 FB 0.0024 0.0011 0.92GB -0.0059 0.0015 0.98sB = 2 sW = 0.1 FB 0.0007 0.0006 0.92GB -0.0176 0.0014 0.98sW = 0.5 FB 0.0008 0.0006 0.95GB -0.0177 0.0014 0.98sW = 1.0 FB 0.0012 0.0006 0.93GB -0.0179 0.0014 0.9914Table 2.4: Bias, MSE and coverage rates of the fully Bayesian and the group-based exposure assessment methods when b1 = 0.4 and 100 samples ob-served in each group( FB is for the fully Bayesian method, and GB is for the group-based exposure assess-ment method )Bias MSE Coverage RatesB = 0.1 sW = 0.1 FB 0.0098 0.0023 0.94GB 0.0095 0.0022 0.94sW = 0.5 FB 0.0084 0.0022 0.94GB 0.0094 0.0022 0.92sW = 1.0 FB 0.0011 0.0021 0.94GB 0.0092 0.0022 0.92sB = 0.5 sW = 0.1 FB 0.0077 0.0020 0.97GB 0.0061 0.0020 0.97sW = 0.5 FB 0.0053 0.0019 0.97GB 0.0059 0.0019 0.97sW = 1.0 FB -0.0019 0.0018 0.96GB 0.0056 0.0019 0.97sB = 1.0 sW = 0.1 FB 0.0090 0.0021 0.96GB 0.0033 0.0019 0.96sW = 0.5 FB 0.0063 0.0020 0.97GB 0.0032 0.0018 0.96sW = 1.0 FB -0.0011 0.0019 0.94GB 0.0029 0.0018 0.96sB = 1.5 sW = 0.1 FB 0.0054 0.0024 0.93GB -0.0082 0.0021 0.94sW = 0.5 FB 0.0025 0.0022 0.94GB -0.0083 0.0021 0.93sW = 1.0 FB -0.0064 0.0022 0.91GB -0.0085 0.0021 0.92sB = 2 sW = 0.1 FB 0.0041 0.0025 0.92GB -0.0217 0.0024 0.88sW = 0.5 FB 0.0016 0.0023 0.90GB -0.0218 0.0023 0.90sW = 1.0 FB -0.0074 0.0022 0.93GB -0.0221 0.0024 0.9015Table 2.1 shows us when b1 = 0.6, sB is moderate or large, and only 100 workersare observed per group, the estimated b1 from the fully Bayesian method tends tohave smaller bias, smaller mean squared error (MSE), and more accurate coveragerates than for the group-based exposure assessment method. For instance, whensB = 2 and sW = 0.1, the absolute bias of the group-based exposure assessmentmethod is 14 fold that of the fully Bayesian method, the MSE of the group-basedexposure assessment is around 4 times that of the fully Bayesian method, and thecoverage of the group-based exposure assessment is only 17% compared with 89%from the fully Bayesian method. When sB becomes larger, the advantages of thefully Bayesian method are more prominent, which is consistent with the fact thatsince the group-based exposure assessment method ignores the between-workervariance within each group, more between-worker variance can cause more bias.On the other hand, when sW becomes large, the measurement becomes poor, whichmakes both methods less accurate. However, we did not find prominent differencesbetween the two methods when b1 is set as 0.4, as shown in Tables 2.3 and 2.4.Table 2.5: The SSE and the ASE of the estimated b1 when b1 = 0.6 and 100samples are observed in each group( FB is for the fully Bayesian method, and GB is for the group-based exposure assess-ment method )sB 1.0 1.5 2.0sW 0.1 0.5 1.0 0.1 0.5 1.0 0.1 0.5 1.0GB SSE 0.038 0.039 0.040 0.041 0.041 0.042 0.043 0.042 0.043ASE 0.034 0.034 0.034 0.032 0.032 0.032 0.030 0.030 0.030FB SSE 0.042 0.043 0.045 0.049 0.047 0.047 0.056 0.056 0.053ASE 0.041 0.041 0.042 0.046 0.047 0.046 0.051 0.050 0.048We calculate two kinds of summary standard errors of the estimated b1. Oneis the simulated standard error (SSE), which is defined as the standard deviation ofthe estimated b1’s from all simulated data under the same condition. The other isthe averaged standard error (ASE), which is calculated as the mean of the estimatedstandard errors of b1 from each simulation data set under the same condition. TheSSE and the ASE of the fully Bayesian method agree well in all of our simulated16conditions. However, for the group-based exposure assessment method, when thereis a large amount of missing data, Table 2.5 shows the SSE is much larger than theASE. This confirms that the group-based exposure assessment method underesti-mates the standard error of the association parameter.As for the biases, we inspect the raw bias, the standardized bias, and the per-centage bias. Table 2.1 and Figure 2.1 show that the biases of the estimated b1from the fully Bayesian method are much smaller than those from the group-basedexposure assessment method when b1 = 0.6 and sB is moderate or large. For thefully Bayesian method, the biases are within 5% of the value of b1, and the amountof missing data has an effect on the magnitude of the biases. For example, whenthere are only 20 workers observed, the biases are much greater than those with100 workers observed and complete data. However, as sB increases, the differ-ences between these three types of data tend to become less, and the magnitude ofbias is also becoming smaller. For the group-based exposure assessment method,the magnitude of the bias sharply increases when sB increases, regardless of theamount of missing data. Also the gaps of biases between data with different amountof missing data become larger and larger as sB increases.2.5 Bias of the Group-based Exposure AssessmentMethodIn addition to the simulation study, we can theoretically explore the cause of thebias of the group-based exposure assessment method for binary outcomes, if weassume that the group exposure means are known precisely instead of estimatedfrom a sample as it is done in practice. The response model under consideration isPr(ygi = 1|xgi,b0,b1) = logit1(b0 +b1xgi).We can obtain the relationship between ygi and µg by integrating out the xgi’s. Afterplugging in the exposure model (2.2), we havePr(ygi = 1|µg,egi,b0,b1) = logit1(b0 +b1µg +b1egi).170.0 0.5 1.0 1.5 2.0í25í20í15í10í50510σBPercentage Bias of Estimated β 1 ííííííííííííííííííííííííííííííííííííííííííííííííííííííííFB with Complete DataFB with 100 samples observed FB with 20 samples observed GB with Complete DataGB with 100 samples observed GB with 20 samples observedFigure 2.1: The standardized biases versus sB under sW = 0.5 and b1 = 0.6.The vertical bars are 95% confidence intervals due to simulation vari-ability.18To get the link between ygi and µg, we need to integrate out the error term:Pr(ygi = 1|µg,b0,b1)=Z +••logit1(b0 +b1µg +b1egi) 1q2ps2B exp e2gi2s2B!degi=Z +••11+ exp(b1egi)exp(b0+b1µg) 1q2ps2B exp e2gi2s2B!degi, (2.3)where egi ⇠ N(0,s2B). To calculate the integral (2.3), we do Taylor expansion of✓1+exp(b1egi)exp(b0 +b1µg)◆1onb1egi around 0. Let A= exp(b0+b1µg) and z=b1egi. Consider the functiong(z) =✓1+ezA◆1.We haveg0(z) = Aez(A+ ez)2,g(2)(z) =Aez(A+ ez)(A+ ez)3,g(3)(z) = Aez(A24Aez + e2z)(A+ ez)4,g(4)(z) =Aez(A+ ez)(A210Aez + e2z)(A+ ez)5.Then the first three terms of the expansion of g at z = 0 isg = g(0)+g0(0)z+g(2)(0)z22!+g(3)(0)z33!+g(4)(0)h44!=A1+AA(A+1)2z+A(1A)(A+1)3z22!A(A24A+1)(A+1)4z33!+A(1A)(A210A+1)(A+1)5h44!, (2.4)19where h lies between z and 0. Plugging (2.4) in to integration (2.3), we havePr(ygi = 1|µg,b0,b1)=A1+A+A(1A)2!(A+1)3b 21s2B+A(1A)(A210A+1)4!(A+1)5Z +••h4q2ps2B exp e2gi2s2B!degi,where A = exp(b0 +b1µg). The residual term isA(1A)(A210A+1)4!(A+1)5Z +••h4q2ps2B exp e2gi2s2B!degi.What is more, we haveA(1A)(A210A+1)4!(A+1)5Z +••h4q2ps2B exp e2gi2s2B!degi=A(1A)(A210A+1)4!(A+1)5Z +••h4q2ps2B exp e2gi2s2B!degiA(1A)(A210A+1)4!(A+1)5Z +••z4q2ps2B exp e2gi2s2B!degi=A(1A)(A210A+1)4!(A+1)5b 41 E(e4gi)=A(1A)(A210A+1)8(A+1)5b 41s4B.After we ignore the residual term, we get the approximation of the relationshipbetween ygi and µgi.Pr(ygi = 1|µg,b0,b1) ⇡ A1+A + A(1A)(A+1)3 b 21s2B2! ,20where A = exp(b0 +b1µg). Denote b = (b0,b1)0, and letrg(b ) = exp(b0 +b1µg)1+ exp(b0 +b1µg) .Then this probability equation of the true model can be written asPr(ygi = 1|µg,b0,b1)⇡ rg1+(1rg)(12rg) b 21s2B2 . (2.5)In comparison, when we assume that the group exposure means are known, therelationship between ygi and µg is assumed asPr(ygi = 1|µg,b0,b1) = rg.Also since they are not available in the real studies, µ 0gs are approximated by theircorresponding sample group exposure means. Therefore we can see that, when thesample group exposure mean is very close to the true group exposure mean, whichis often true under large sample sizes, the group-based exposure assessment methodmay not work well under any of the following conditions: (1), the associationparameter is big, i.e., the disease risk increases dramatically when there is one-unitincrease in the exposure; (2), the between-worker variation is relatively large; and(3), the prevalence of disease is rare, which makes rg far from 1 and 1/2.There is another conclusion from our theoretical calculation, which can be sup-ported by the simulation results mentioned in [27]. Assume there is the same incre-ment between the adjacent group exposure means. Kim and Burstyn [27] inspectedthe bias andMSE of the estimated b1 when b0 =4, b1 = 0.4, µ1 = 0.1, and the in-crement varies as 0.3,0.5,1, and 1.414 in their simulation studies. They concludedthat when the increment between groups is large and the between-subject varianceis relatively small, the group-based exposure assessment method gives unbiasedestimates, while it fails to adjust for the measurement errors properly when theincrement between groups is small. Their conclusions agree with our theoreticalcalculation. From equation (2.5), we can see that, given b0 =4 and b1 = 0.4, thecloser the function (1rg)(1 2rg) is to 0, the better the group-based exposureassessment method performs. Figure 2.2 is a plot of (1rg)(1 2rg) as a func-210 2 4 6 8 100.00.20.40.60.8µg(1<lg)(1<2lg)Figure 2.2: Plot of (1 rg)(1 2rg) as a function of the group exposuremean, µg, based on b0 =4 and b1 = 0.4.22tion of the gth group exposure mean, µg, with b0 =4 and b1 = 0.4. It shows that(1rg)(1 2rg) is decreasing to 0 as µg increases in the range of our consider-ation. Also given µ1, the smallest group exposure mean, the larger the incrementof groups, the larger means the other groups have. If the group exposure meansare more distributed to the right side of the curve in Figure 2.2, the group-basedexposure assessment method will have less bias. Then we can conclude that underthese given conditions, the further apart the group exposure means are, the betterperformance we can expect from the group-based exposure assessment method.Table 2.6: Bias, MSE and coverage rates of the fully Bayesian and the group-based exposure assessment methods when b1 = 0.6 and 100 samples ob-served in each group under the average difference between group expo-sure means equal to 0.3( FB is for the fully Bayesian method, and GB is for the group-based exposure assess-ment method )Bias MSE Coverage RatesW = 1.0 sB = 1.0 FB -0.0927 0.0263 0.91GB -0.0054 0.0257 0.97sB = 1.5 FB -0.0844 0.0207 0.86GB -0.0542 0.0333 0.94sB = 2.0 FB -0.0595 0.0144 0.83GB -0.1111 0.0422 0.82To confirm our aforementioned conclusion and also to compare the perfor-mance of the group-based exposure assessment and the fully Bayesian methodsunder smaller average difference between group exposure means, we run sup-plementary simulation studies under the average difference between group expo-sure means equal to 0.3 (with group exposure means equal to 0.1,0.4, · · · ,1.3),b0 = 4,b1 = 0.6,sW = 1, and sB = 1.0,1.5, and 2, which are selected to matchthe simulation settings in [27]. The simulation results are shown in Table 2.6. Wecan see that under fixed sW , as sB increases the performance of the group-basedexposure assessment method worsens according to the three listed criteria. This isconsistent with our theoretical finding. As sB increases, there are also some trends23for the fully Bayesian method, i.e., slightly increased bias, decreased MSE, andreduced coverage rate. Those indicate that increasing between group variance willalso make the fully Bayesian method perform worse. However, compared withthat of the group-based exposure assessment method, the performance of the fullyBayesian method is relatively stable and consistent with the simulation results un-der the average difference between group exposure means equal to 1. In addition,the fully Bayesian method is less biased, has smaller MSE and similar coveragerate when sB = 2.2.6 An Empirical ExampleTo illustrate the application of the fully Bayesian method, we use a subset of thedata from a study of exposure to radiation and mortality from leukemia conductedat the Savannah River Site (SRS) (South Carolina) [40, 41]. The study includes atotal of 18,883 workers hired between 1950 and 1986 who were followed through2002 to ascertain causes of death. We analyze data for the subcohort of 15,264male workers using the fully Bayesian method, and also using the group-basedexposure assessment method, and comment on the analyses resulting from bothmethods.Table 2.7: Estimates from logistic regression using true exposure intensity topredict leukemia mortalityEstimate Std. Error P valueIntercept -6.406 1.297 < 0.001True intensity exposure (log-scale) 0.268 0.108 0.013Total exposure days (log-scale) 0.264 0.152 0.081Black -0.657 0.476 0.167Pay monthly 0.021 0.327 0.948Pay weekly -0.265 0.343 0.440Attained Age -0.008 0.010 0.448The outcome variable, Y , is leukemia mortality. The predictor of interest in thecurrent analysis is the exposure intensity to ionizing radiation. The annual expo-sure measurements and the annual total days of employment were recorded. As24suggested by [40], there is an induction and latency period between exposure toionizing radiation and an observed change in risk of leukemia mortality. Since ex-posure data were available for the period 1950 - 1999, while follow-up spannedthe period 1950-2002, a 3-year lag was the minimal lag assumption possible toevaluate in these analyses. Therefore, for the exposure measurements and days ofexposure, we only consider those measured or counted 3 years before the leukemiaoutcome or before the exit from the study. Also we can get a worker’s exposure in-tensity estimate for a year using his annual measurement divided by his annual totalemployment days. For purposes of evaluating the methodologies, we let xi be theaverage of the ith worker’s annual intensities from study entry to the lag year. Theaverage number of such years is 12.9 (SE = 9.3). Conceptually we will treat xi asthe true exposure intensity, in order to evaluate the methodologies when less infor-mation about exposure is available. To estimate the association between leukemiamortality and radiation exposure intensity, we use a logistic regression with thetrue exposure intensity, xi, as a predictor. To reduce confounding, we adjust ouranalysis using the following confounders: total days of exposure; the attained age,which is defined as age of death if the worker is deceased during the study, or ageof exiting the study otherwise; sex; race (Black vs. other); and pay code (used tocontrol for socioeconomic differences in mortality and classified on the basis ofthe worker’s pay schedule when hired as paid monthly, weekly, or hourly) [40].Table 2.7 shows us the estimated coefficients, associated standard errors, and theirp-values. There is a significant association between exposure intensity and theleukemia outcome. We also test the difference between the coefficient of the trueexposure intensity (log-scale) and that of the total exposure days (log-scale), andfind that we cannot reject the null hypothesis that there is no difference betweenthese two coefficients. Therefore, there is some indication that it is cumulative ex-posure which is influencing the outcome. This would be an interesting topic forfurther investigation.To conduct group-based exposure assessment, we define exposure groups bycombining similar occupations that were recorded for the workers in the originalstudy and term them as “reduced occupations.” Though some workers changedtheir occupations during the study, for simplicity we choose their longest occupa-25Table 2.8: Reduced occupations of the SRS male workers and their sum-maries statisticsReduced Occupations Num of Leukemia Crude Risk Geometric Mean of radiation(Total Workers) of Disease exposure in excess relative rate/10mSv(Geometric SD)Clerical and kindred non-manual workers 9 (898) 1% 0.158 (3.376)Technicians, Analysts, and Assistants 4 (970) 0.41% 0.429 (3.085)Scientific and medical services 2 (508) 0.39% 0.166 (2.671)Engineering jobs 9 (2018) 0.45% 0.116 (2.622)Supervisors and managers 7 (1244) 0.56% 0.185 (3.073)General Service Operator 6 (1711) 0.35% 0.155 (2.840)Operators 16 (2062) 0.78% 0.437 (3.440)Skilled trades 13 (1542) 0.84% 0.327 (2.965)Other semi-skilled workers 3 (670) 0.45% 0.076 (1.789)tions during the study as their unique occupations during the study. We assumein each occupation, the exposure intensity is following a log-normal distribution.Table 2.8 shows us a summary of the data in each reduced occupation. To imaginea study with only two repeated measurements for each subject, we randomly se-lect two exposure intensity measurements for the ith worker, x⇤i1 and x⇤i2, and treatthem as two repeated measurements of the true exposure, xi. We use the fullyBayesian and the group-based exposure assessment methods separately to analyzethe data. We iterate this approach 100 times and obtain 100 model estimates fromeach method. The average of the estimated association parameter from the fullyBayesian method is 0.284 (ASE = 0.142). The average of the estimated associa-tion parameter from the group-based exposure assessment method is 0.228 (ASE= 0.271). The average of estimated within-worker variance is 1.226 and that of thebetween-worker variance is 0.988. Relative to the association based on observingthe true exposure intensities shown in Table 2.7, we can see that the estimates of theassociation parameter from the fully Bayesian method are less biased for the trueassociation between exposure and leukemia. There is some attenuation in the co-efficient estimate from the group-based exposure assessment method, which maybe due to lack of adjustment for measurement errors, and there is also a greateruncertainty from the group-based exposure assessment method, which is shownfrom the ASE of the group-based exposure assessment method doubling that fromthe fully Bayesian method. This is most likely because the group-based expo-sure assessment method does not make full use of the data. On the other side,for both methods we calculate the power to detect the significant association indi-26cated from the full data. The results show us based on the significance level set as0.05, the fully Bayesian method detects a significant signal for 55 of 100 replica-tions, while the group-based exposure assessment method does not detect for anyof them. Therefore, we would like to conclude that the fully Bayesian method canbetter recover the signal from the data compared with the group-based exposureassessment method in this data example.To examine the behavior of the fully Bayesian method under a more realisticprior for the association parameter, b1, we have tried the prior distribution of b1as N(0, log32). We repeat the above analysis. It turns out that the average of theestimated association parameter from the fully Bayesian method is 0.268 (ASE =0.140), the average of estimated within-worker variance is 1.226 and that of thebetween-worker variance is 0.988, and a significant signal is detected for 47 of 100replications. We can see that this informative prior gives a big improvement for theaccuracy of the association parameter.To examine both methods in the presence of a large proportion of missing data,we randomly select 90% of the subjects within each occupation group and treatthem as missing their exposure measurements. After that we randomly select tworepeated measurement as we did before and fit the models using both methods.We repeat this procedure 100 times. The average of the estimated association pa-rameter from the fully Bayesian method is 0.299 (ASE = 0.256). The averageof the estimated association parameter from the group-based exposure assessmentmethod is 0.243 (ASE = 0.257). The average of estimated within-worker varianceis 1.403 and that of the between-worker variance is 0.516. Compared to the fulldata results in Table 2.7 then, the two methods seem to be performing similarly.However, the fully Bayesian method gives 95%, 90%, and 80% credible intervalsfor the exposure intensity coefficient which excludes zero in 0, 9, and 36 of the 100replications respectively. The corresponding numbers of significant findings forthe group-based method are 0, 0 and 2 out of the 100 replications. So we concludethat even under the presence of missing data, the fully Bayesian method is morepowerful in this data example.Although a conditional logistic regression model would be a standard approachto analysis of matched case-control data [40], in order to illustrate our method,we use unconditional likelihood with adjustment for the matching factors. Our27analysis shows the significant association between exposure and mortality basedon the unmatched data as was shown in [40].2.7 DiscussionFrom our mathematical derivation and our simulation studies, we can concludethat there are a few conditions under which the group-based exposure assessmentmethod may not work well. One of them is the existence of big variation betweenworkers within a group. The bias of the estimated b1 from the group-based expo-sure assessment method is almost proportional to the magnitude of the between-worker variance under certain settings, as shown by [28]. The second conditionwhich may cause the group-based exposure assessment method to perform poorlyis low prevalence of the disease outcome. Kim et al. [29] have run simulations un-der different disease prevalences for the group-based exposure assessment method.Based on their simulation results, we can see that under lower prevalence the group-based exposure assessment method tends to have much larger MSE. And the thirdcondition is a big increase in risk given one-unit increase in exposure. Notice thatsome of these conditions commonly exist in occupational epidemiological studies,although it is comforting to note that for weak associations that are believed todominate epidemiology today, all methods considered work well [13]. So we rec-ommend the fully Bayesian method, which shows its good performance over dif-ferent parameter settings from our testing. In addition, the fully Bayesian methodcan provide estimates of uncertainty arising from the data, such as the variance ofthe measurement errors and the variance between workers within a group, and esti-mates of other quantities including the association parameters and latent variablessimultaneously through iterative simulation process. This should be more accuratethan separated steps to get those estimates, since the variance structure is well con-sidered [8]. This will further benefit epidemiologists and occupational hygienists toobtain estimates related to compliance and risk in assessing occupational exposureand overexposure [54].We need to mention that there are other factors, which can also affect the be-havior of the two methods. One such factor is the sample size. For the group-based28Table 2.9: Bias, MSE and coverage rates of the fully Bayesian and the group-based exposure assessment methods when b1 = 0.6 and 20 samples ob-served in each group( FB is for the fully Bayesian method, and GB is for the group-based exposure assess-ment method )Bias MSE Coverage RatesB = 0.1 sW = 0.1 FB 0.0066 0.0014 0.95GB 0.0060 0.0014 0.95sW = 0.5 FB 0.0104 0.0019 0.94GB 0.0053 0.0018 0.92sW = 1.0 FB 0.0162 0.0033 0.88GB 0.0034 0.0028 0.82sB = 0.5 sW = 0.1 FB 0.0073 0.0021 0.94GB -0.0027 0.0019 0.87sW = 0.5 FB 0.0097 0.0024 0.91GB -0.0035 0.0022 0.87sW = 1.0 FB 0.0130 0.0034 0.91GB -0.0056 0.0031 0.77sB = 1.0 sW = 0.1 FB 0.0152 0.0048 0.9GB -0.0258 0.0038 0.73sW = 0.5 FB 0.0268 0.0199 0.90GB -0.0265 0.0039 0.68sW = 1.0 FB 0.0181 0.0066 0.88GB -0.0285 0.0047 0.65sB = 1.5 sW = 0.1 FB 0.0204 0.0113 0.90GB -0.0653 0.0089 0.42sW = 0.5 FB 0.0222 0.0167 0.84GB -0.0658 0.0090 0.44sW = 1.0 FB 0.0167 0.0158 0.86GB -0.0677 0.0096 0.43sB = 2.0 sW = 0.1 FB 0.0203 0.0174 0.81GB -0.1169 0.0193 0.16sW = 0.5 FB 0.0176 0.0182 0.79GB -0.1173 0.0193 0.15sW = 1.0 FB 0.0031 0.0193 0.76GB -0.1190 0.0199 0.1829Table 2.10: Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.4 and 20 sam-ples observed in each group( FB is for the fully Bayesian method, and GB is for the group-based exposure assess-ment method )Bias MSE Coverage RatesB = 0.1 sW = 0.1 FB 0.0098 0.0023 0.92GB 0.0096 0.0023 0.93sW = 0.5 FB -0.0213 0.0931 0.91GB 0.0090 0.0023 0.92sW = 1.0 FB -0.0188 0.0792 0.89GB 0.0075 0.0027 0.88sB = 0.5 sW = 0.1 FB 0.0069 0.0025 0.94GB 0.0059 0.0023 0.95sW = 0.5 FB -0.0417 0.1340 0.94GB 0.0053 0.0023 0.95sW = 1.0 FB -0.0354 0.1042 0.92GB 0.0036 0.0025 0.92sB = 1.0 sW = 0.1 FB 0.0128 0.0037 0.94GB 0.0010 0.0028 0.89sW = 0.5 FB -0.0295 0.0894 0.91GB 0.0004 0.0028 0.89sW = 1.0 FB -0.0036 0.0311 0.92GB -0.0012 0.0030 0.87sB = 1.5 sW = 0.1 FB 0.0094 0.0061 0.93GB -0.0146 0.0038 0.82sW = 0.5 FB 0.0099 0.0061 0.89GB -0.0151 0.0038 0.81sW = 1.0 FB 0.0074 0.0061 0.90GB -0.0166 0.0040 0.78sB = 2.0 sW = 0.1 FB 0.0163 0.0104 0.91GB -0.0333 0.0055 0.61sW = 0.5 FB 0.0247 0.0278 0.86GB -0.0337 0.0054 0.63sW = 1.0 FB 0.0196 0.0283 0.88GB -0.0351 0.0056 0.6430exposure assessment method, if the sample size is very small, the sample mean canbe far away from the true mean. The conditional mean imputation will cause verypoor estimation of b1. For the fully Bayesian method, if the sample size is small,the inference will be most likely affected by the priors, and since less informationcan be used in updating the posterior distribution, more biased estimates could beprovided given the same length of MC chain. For small sample size, two obser-vations from our simulation studies are noteworthy. When the sample size is verysmall, e.g. 100 workers per group and only 10 exposure measurements from eachgroup, both methods fail for some simulated data sets. Therefore, in practice, morecaution is required when the sample size is small. Secondly, we observe that thetwo methods behave differently at small sample size. For example, when there are1000 workers per group, but only 20 exposure measurements from each group, Ta-ble 2.9 and Table 2.10 show us that the group-based exposure assessment methodis less biased under the small between-group standard deviation given comparablecoverage rates. When sB is large, the fully Bayesian method performs better sincethe coverage of the group-based exposure assessment method deteriorates.We also look at the sensitivity of the fully Bayesian method based to the se-lection of priors. In the simulation studies, we used diffuse priors for all theparameters. However, Gelman [17] mentioned some concern on inverse gammadistribution of the scale parameters and suggested using a uniform prior on thestandard deviation, and a half-t family when the number of groups is small. Sowe did a small simulation study to investigate how our posterior estimates canchange if we use different hyper-parameters of the inverse gamma distributionsin our current settings. Instead of Inv-Gamma(0.0001,0.0001), we let sW ,sB ⇠Inv-Gamma(0.01,0.01), and we also vary the number of observed workers pergroup as 20 and 100. Tables 2.11 and 2.12 shows us the model estimates, whichare very close to those from Inv-Gamma(0.0001,0.0001) priors shown in Tables2.1 and 2.9, separately. This reflects the robustness of our modeling to some de-gree. However, we suggest readers to use more robust priors such as the uniformpriors written in our attached WinBUGs code and when the number of groups isless than 5, consider the half-t family mentioned in Gelman [17]. Also the Cauchy31Table 2.11: Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.6 and 100 sam-ples observed in each group under different priors for the scale parame-ters( FB is for the fully Bayesian method, and GB is for the group-based exposure assess-ment method )Bias MSE Coverage RatesB = 0.5 sW = 0.1 FB 0.0042 0.0015 0.94GB -0.0023 0.0014 0.95sW = 0.5 FB 0.0038 0.0015 0.96GB -0.0022 0.0014 0.94sW = 1.0 FB -0.0043 0.0016 0.94GB -0.0024 0.0016 0.90sB = 1.0 sW = 0.1 FB 0.0067 0.0018 0.96GB -0.0225 0.0019 0.86sW = 0.5 FB 0.0043 0.0019 0.93GB -0.0225 0.0020 0.85sW = 1.0 FB -0.0063 0.0019 0.97GB -0.0227 0.0021 0.80sB = 1.5 sW = 0.1 FB 0.0060 0.0024 0.91GB -0.0570 0.0049 0.51sW = 0.5 FB 0.0024 0.0022 0.92GB -0.0570 0.0049 0.53sW = 1.0 FB -0.0149 0.0023 0.93GB -0.0573 0.0050 0.50class of prior distributions for logistic regression parameters may have advantagesin handling separation and sparsity issues [19].In the literature there is another Bayesian approach for the occupational expo-sure data. Kim et al. [28] showed that when estimated group exposure mean is agood approximation of the true group exposure mean, the estimated group expo-sure mean is related to true exposure for each worker by an approximate Berksonerror model. Consequently, Kim and Burstyn [27] assume a Berkson exposure er-ror structure in a Bayesian framework. Also they further explored a setting wherethe standard deviation between workers with a group is known. Based on theseassumptions, they developed a Bayesian method, called Berkson-Bayesian group-32Table 2.12: Bias, MSE and coverage rates of the fully Bayesian and thegroup-based exposure assessment methods when b1 = 0.6 and 20 sam-ples observed in each group under different priors for the scale parame-ters( FB is for the fully Bayesian method, and GB is for the group-based exposure assess-ment method )Bias MSE Coverage RatesB = 0.5 sW = 0.1 FB 0.0070 0.0021 0.95GB -0.0027 0.0019 0.87sW = 0.5 FB 0.0093 0.0024 0.94GB -0.0035 0.0022 0.87sW = 1.0 FB 0.0140 0.0035 0.91GB -0.0056 0.0031 0.78sB = 1.0 sW = 0.1 FB 0.0144 0.0046 0.93GB -0.0258 0.0038 0.73sW = 0.5 FB 0.0253 0.0212 0.92GB -0.0265 0.0039 0.68sW = 1.0 FB 0.0249 0.0141 0.89GB -0.0285 0.0047 0.65sB = 1.5 sW = 0.1 FB 0.0168 0.0090 0.91GB -0.0653 0.0089 0.44sW = 0.5 FB 0.0165 0.0124 0.87GB -0.0658 0.0090 0.45sW = 1.0 FB 0.0159 0.0218 0.84GB -0.0677 0.0096 0.44based method, and compared it with the group-based assessment through theirsimulation studies. They concluded that Berkson-Bayesian group-based methodyields less biased estimates compared to group-based exposure assessment in allof the settings, especially when true group exposure means are far apart. However,the Berkson-Bayesian group-based method incurs equal or greater MSE than thegroup-based exposure assessment method in all of the settings. Based on similarsimulation settings, the fully Bayesian method provides less biased estimates inmost cases and reduction by a factor of 4 in mean squared error when sB = 2. Alsothe fully Bayesian method avoids plugging in an estimate of sB as if it were knownand it assumes the classical measurement error type, which is closer to reality33for occupational epidemiological studies. Both the Berkson-Bayesian group-basedmethod and the fully Bayesian method can be trivially implemented in WinBUGS.The WinBUGS code for the fully Bayesian method is given below:model{for (i in 1 : M){ # M is the total sample sizey[i] ˜ dbern(p[i]) # response modellogit(p[i]) <- beta0 + beta1 * x[i] # true likelihoodxx1[i] ˜ dnorm(x[i],tau.W) # measurement error modelxx2[i] ˜ dnorm(x[i],tau.W) # measurement error modelx[i] ˜ dnorm(mu.g[Group[i]],tau.B) # true exposure distribution}for (j in 1:G){ # G is the number of groupsmu.g[j] ˜ dnorm(0, 0.01) # prior of mean of each group}tau.W <- pow(sigma.W, -2) # tau.W=1/simga.Wˆ2tau.B <- pow(sigma.B, -2) # tau.B=1/simga.Bˆ2beta0 ˜ dnorm(0, 1.0E-2) # prior of beta0beta1 ˜ dnorm(0, 1.0E-2) # prior of beta1sigma.W ˜ dunif(0.001, 100) # prior of sigma.Wsigma.B ˜ dunif(0.001, 100) # prior of sigma.B}When researchers plan to use the fully Bayesian method, simulations on rel-evant settings are recommended to see how much missingness and what samplesizes are optimal. In the future, we would modify the fully Bayesian method to fita variety of studies. For example, we would modify the measurement error modelif those errors are not classical. We would change the response model for the timeto event outcome data. We would also accommodate different exposure distribu-tions. On the other hand, the fully Bayesian method could also be extended forsituations where exposure and outcome data are available on disjoint sets of indi-viduals. This commonly occurs in practice when health status of persons whoseexposure was monitored is unknown and is only linked to health effects thoughknowledge of group membership [32]. Another possible extension of the fullyBayesian approach is to use random group effects since this may be more naturalto some epidemiologic problems. Generally, we hope the availability of a more34appropriate method of analysis encourages epidemiologists to use the group-basedstudy design for estimating exposure-disease associations in occupational settings.35Chapter 3On the Shape of anExposure-Disease Relationship,the Average Effect of Exposure,and the Impact of ExposureMeasurement Error3.1 IntroductionThe exposure-disease relationship analysis, also called dose-response analysis, playsa central role in risk assessment in epidemiologic studies, especially when the ex-posure is a continuous quantity and the disease is indicated by a binary outcome.For example, in occupational epidemiology, workers exposed to some potentialhazard in the workplace may later develop the disease of interest. Therefore, wewould like to assess the risk based on the level of the exposure through modelingthe shape of the exposure-disease relationship. A popular approach to the dose-response analysis is categorical analysis, which breaks exposure into categories,usually based on the percentile method, and then looks for trend or relative risks.However, as mentioned in [20], one potential pitfall is that when exposure effects36are limited to extreme ends of the exposure scale, the estimated hazard from datacan be attenuated if the high exposure level is combined with some lower levelinto a single category in the data analysis. Instead, Greenland [20] suggested usingfractional polynomial regression and spline regression, which seems under-used inepidemiologic research. Steenland and Deddens [48] mentioned the importance ofthe spline method, which can provide a close fit of the dose-response curve, andalso suggest using parametric models for easier interpretation. In this chapter, wefocus on a parametric model, particularly though a Box-Cox type model, which isflexible and indexed by a shape parameter.The general model for the association between a binary disease outcome, Y ,and an exposure variable, X , could be written as follows. For the i-th subject, wehavelog✓pi1 pi◆= g(xi), (3.1)where pi = E(Yi|Xi = xi), function g describes the shape of the exposure-diseaserelationship, and here we let g(x) = b0 +b1x(l ). The function x(l ) is the Box-Coxtransformation, sometimes also called the power transformation [5], defined asx(l ) =( xl1l l > 0log(x) l = 0 . (3.2)This function has a nice continuity property at l = 0 as we can see that liml!0+ x(l ) =log(x), and this continuity extends to any degree of derivatives. Clearly the shapeparameter, l , can fit different patterns with the change of the predictor, X . Morespecifically l = 1 indicates a linear pattern, l > 1 indicates an increasing slopepattern, and l < 1 indicates a decreasing slope pattern. This model is a specialcase of the fractional polynomial logistic model, which is discussed in [44].As the nonlinearity of the Box-Cox transformation may introduce some diffi-culty in the model interpretation, and in some situations it may be hard to preciselyestimate the shape parameter, we can consider the average predictive effect, as analternative summary of the effect of a predictor. Gelman and Pardoe [18] suggestedaveraging the effect of a predictor over the population distribution of predictors. Ina linear regression with predictors only included as linear main effects, the regres-37sion coefficients coincide with the predictive effects of their corresponding predic-tors. In a model containing nonlinear and/or interaction terms of a predictor, theaverage predictive effect of the predictor is defined as the expectation of its slopebased on given distribution of the predictor. So we can think the average predic-tive effect as an averaged slope across its entire distribution. Examples are shownby Liu and Gustafson [36] in linear regression models and by Gustafson [23] inthe survival analysis context. We adapt their definitions to the logistic regressioncontext and define the average predictive effect as follows:D= Ed(logit(E(Y |X)))dX,presuming this expectation exists.When predictor X is poorly measured, we cannot ignore the measurement error.There are two common types of errors, the additive measurement error and themultiplicative measurement error [9]. For the classic additive measurement errormodel, the measured predictor, X⇤, is equal to the the true value, X , plus a randomerror,U . That is,X⇤ = X +U,where X andU are independent, andU follows the normal distribution centered at0. As for the multiplicative measurement error model, which can be appropriatefor a positive, skewed distribution of exposure that often arises in epidemiologicalapplications, a common model isX⇤ = XW, (3.3)where X andW are independent, andW follows the log-normal(0,s2e ) distribution.Measurement errors can distort the relationship between the response and pre-dictors. In the linear regression, there is attenuation in the b coefficient for thenon-differential additive measurement errors [9]. There is a similar attenuationin the shape parameter when we fit a power regression model and the predictorfollows a log-normal distribution for multiplicative non-differential measurementerrors [12]. Ku¨chenhoff and Carroll [33] showed that due to measurement errorsan original threshold relationship between true predictor and response becomes a38smooth relationship between the measured predictor and the response. When weconsider the generalized linear regression, the effect of measurement error is morecomplex. Stefanski and Carroll [50] found that in the simple logistic regression,the measurement error tends to attenuate the predicted probabilities towards 0.5.There does not appear to be any discussion in literature specifically about the ef-fect of measurement error in the Box-Cox family applied to logistic regression.In section 3.2, we start with a data example on the application of the logisticBox-Cox model, and then introduce the settings for all the issues we are going todiscuss. Due to the difficulty of direct interpretation of the parameters in the modeland/or the requirement of large sample sizes in order to precisely estimate the shapeparameter under some conditions, we investigate using the average predictive effectas an alternative way to summarize the relationship between the outcome and thepredictor. We also discuss using the estimated slope coefficient from the potentiallymisspecified simple logistic regression as an estimate of the average predictiveeffect. In section 3.3, we look at the distortion of the parameter estimates in thelogistic Box-Cox family model under different degrees of measurement error. Thechapter closes with a discussion in section 3.4.3.2 Logistic Box-Cox ModelWe start our discussion with precisely measured data. In this case, the maximumlikelihood estimates (MLEs) from the fitted model based on the observed data arenot contaminated by the measurement errors, and, therefore, can be treated as ap-propriate estimates of the parameters in the true model, assuming no model mis-specification. Given data (Y ,X ), we can obtain MLEs of the parameters in thelogistic Box-Cox model (3.1), based on the profile likelihood method. The loglikelihood function of model (3.1) isl(b0,b1,l |Y ,X ) =nÂi=1yi"b0 +b1 xli 1l !# log"1+ exp b0 +b1 xli 1l !!# ,39where Y = (y1,y2, · · · ,yn)T and X = (x1,x2, · · · ,xn)T . The MLEs, (bˆ0, bˆ1, lˆ ), canbe obtained by solving the equations below,—l(b0,b1,l |Y ,X ) b0=bˆ0, b1=bˆ1, l=lˆ = 0.The above equations include the score functions relating to b0 and b1,nÂi=126640@1xlˆi 1lˆ 1A0BB@yi exp✓bˆ0 + bˆ1 xlˆi 1lˆ ◆1+ exp✓bˆ0 + bˆ1 xlˆi 1lˆ ◆1CCA3775= 0. (3.4)Defining a new variable V = (X lˆ 1)/lˆ , we can rewrite equation (3.4) asnÂi=124 1vi!0@yiexp⇣bˆ0 + bˆ1vi⌘1+ exp⇣bˆ0 + bˆ1vi⌘1A35= 0. (3.5)We can see that given lˆ , the above equation (3.5) could be solved using standardlogistic regression software, if we assume (Y ,V ) are our data. Therefore, we canuse the profile likelihood method to get the MLE. First of all, we need to set anupper bound Bu of the estimate of l , since a very large value of l is meaninglessand implausible in the disease-exposure model. Then we do a grid search in theinterval [0,Bu] with 1000[Bu] equally spaced values of l , fit the model at eachvalue, and select the particular value which maximizes the likelihood function.3.2.1 Parameter Estimates in the ModelThe prostate cancer data from [6], which were analyzed in detail in [11], provide agood example to illustrate the logistic Box-Cox family model. The study involved53 prostatic cancer patients who had a laparotomy to see if the cancer had spreadto the surrounding lymph nodes. The outcome variable, Y , is binary indicatingpresence (1) or absence (0) of nodal involvement. One of the risk factors is thelevel of acid phosphatase in serum (in King-Armstrong units), X . The phosphataseis a measure of the degree of tissue damage. Figure 3.1 shows that the distributionof the phosphatase variable is positively skewed. Under the assumption of a log-40phosphatase (King−Armstrong units)Density0.5 1.0 1.5 2.00.00.51.01.52.0Figure 3.1: The histogram of the level of acid phosphatase in King-Armstrong units. The solid curve is an estimated density curve, andthe dashed curve is the density curve of log-normal(0.42,0.32).normal(µ,s2) distribution for the phosphatase, we get µˆ = 0.42 and sˆ = 0.32,which indicates some mild skewness. We fit the logistic Box-Cox model to thedata using the profile likelihood mentioned above. The final fitted model islogit [ E(Y |X)] = 0.40+2.24log(X).We can see that phosphatase is positively associated with the nodal involvement,and one unit increase in phosphatase in log-scale is associated with 9.39 timesincrease in the odds of nodal involvement. Since lˆ = 0 is at the boundary of theparameter space, we bootstrap 500 samples to check the variability of the parameterestimates. It turns out that for 91% of these samples lˆ is at the lower bound of 0,3.4% yield lˆ = 5, which is the imposed upper bound, and the remaining 5.6%yield lˆ between 0 and 5. Furthermore, the 95% CI for bˆ0 is [0.75,2.35], and the95% CI for bˆ1 is [0.33,6.51].We can see that l plays an important role in determining the shape of exposure-disease relationship, with a one unit difference in l corresponding to a big differ-ence in the shape. As an example, l = 0 corresponds to a log-transformation in the41logistic model, but l = 1 represents a simple logistic model. Therefore, we startwith evaluating the precision of the estimator of l . We calculate the Fisher infor-mation matrix to obtain the asymptotic variance of the MLE of l . One conclusionis that given fixed values of other parameters, when b1 goes to 0, the asymptoticvariance of lˆ goes to infinity. Denote Z as a random variable following the stan-dard normal distribution. Then we can write X = exp(µ +Zs). The likelihoodfunction for a single observation isL(b0,b1,l |X = x,Y = y) =24exp⇣b0 +b1 xl1l ⌘1+ exp⇣b0 +b1 xl1l ⌘35y24 11+ exp⇣b0 +b1 xl1l ⌘351y f (x),where f (x) is the probability density function of X . Then the log likelihood func-tion for a single observation isl(b0,b1,l |X = x,Y = y) =y b0 +b1 xl 1l ! log"1+ exp b0 +b1 xl 1l !#+ log[ f (x)]. (3.6)Based on the log likelihood function (3.6), we can calculate the Fisher information42matrix as follows.I1(q )= E26664∂2l(X ,Y )∂b20 ∂2l(X ,Y )∂b1∂b0 ∂2l(X ,Y )∂l∂b0∂2l(X ,Y )∂b1∂b0 ∂2l(X ,Y )∂b21 ∂2l(X ,Y )∂l∂b1∂2l(X ,Y )∂l∂b0 ∂2l(X ,Y )∂l∂b1 ∂2l(X ,Y )∂l2 37775= E2666664E✓ ∂2l(X ,Y )∂b20 |X◆ E⇣ ∂2l(X ,Y )∂b1∂b0 |X⌘ E⇣ ∂2l(X ,Y )∂l∂b0 |X⌘E⇣ ∂2l(X ,Y )∂b1∂b0 |X⌘ E✓ ∂2l(X ,Y )∂b21 |X◆ E⇣ ∂2l(X ,Y )∂l∂b1 |X⌘E⇣ ∂2l(X ,Y )∂l∂b0 |X⌘ E⇣ ∂2l(X ,Y )∂l∂b1 |X⌘ E⇣ ∂2l(X ,Y )∂l2 |X⌘ 3777775= E2664P(1P) P(1P)V P(1P)b1 ∂V∂lP(1P)V P(1P)V 2 P(1P)b1V ∂V∂lP(1P)b1 ∂V∂l P(1P)b1V ∂V∂l P(1P)⇣b1 ∂V∂l ⌘2 3775=2664R+•• p(1 p)f(z)dz R+•• p(1 p)vf(z)dz R+•• p(1 p)b1 ∂v∂l f(z)dzR+•• p(1 p)vf(z)dz R+•• p(1 p)v2f(z)dz R+•• p(1 p)b1v ∂v∂l f(z)dzR+•• p(1 p)b1 ∂v∂l f(z)dz R+•• p(1 p)b1v ∂v∂l f(z)dz R+•• p(1 p)⇣b1 ∂v∂l ⌘2 f(z)dz 3775 ,(3.7)where q = (b0,b1,l )T ,P= E(Y |X),f(z) is the probability density function of thestandard normal distribution, andV =Xl 1l .To calculate the asymptotic variance of l , we need to calculate the (3,3) entry ofthe inverse of the Fisher information matrix shown in (3.7). That is,Avar(lˆ ) = 1det[I1(q )]C33, (3.8)where C33 is the matrix cofactor of I1(q ) withC33 = (1)3+3det" R +•• p(1 p)f(z)dz R +•• p(1 p)vf(z)dzR +•• p(1 p)vf(z)dz R +•• p(1 p)v2f(z)dz #=✓Z +••p(1 p)f(z)dz◆✓Z +••p(1 p)v2f(z)dz◆✓Z +••p(1 p)vf(z)dz◆✓Z +••p(1 p)vf(z)dz◆ ,43anddet(I1(q ))= b 21 "✓Z +••p(1 p)f(z)dz◆✓Z +••p(1 p)v2f(z)dz◆ Z +••p(1 p)✓ ∂v∂l ◆2 f(z)dz!✓Z +••p(1 p)f(z)dz◆✓Z +••p(1 p)v∂v∂l f(z)dz◆✓Z +•• p(1 p)v ∂v∂l f(z)dz◆✓Z +••p(1 p)vf(z)dz◆✓Z +••p(1 p)vf(z)dz◆ Z +••p(1 p)✓ ∂v∂l ◆2 f(z)dz!+✓Z +••p(1 p)vf(z)dz◆✓Z +••p(1 p)∂v∂l f(z)dz◆✓Z +•• p(1 p)v ∂v∂l f(z)dz◆+✓Z +••p(1 p)∂v∂l f(z)dz◆✓Z +•• p(1 p)vf(z)dz◆✓Z +•• p(1 p)v ∂v∂l f(z)dz◆✓Z +••p(1 p)∂v∂l f(z)dz◆✓Z +•• p(1 p) ∂v∂l f(z)dz◆✓Z +•• p(1 p)v2f(z)dz◆ .In general, we cannot get closed-form expressions for the integrals shown in theabove equation (3.8). However, numerical evaluation of the integrals is possibleby using the Gaussian-Hermite Quadrature (GHQ). In GHQ, we use the followingapproximation,Z +••et2f (t)dt ⇡mÂi=1wi f (ti)where m is the number of sample points used, and {ti}(i=1,··· ,m) are the roots of the(physicists’ version of the) Hermite polynomial Hm(t)(i = 1,2, ...,m). The associ-ated weights wi are given in [1] aswi =2m1m!ppm2[Hm1(xi)]2.In the logistic Box-Cox model, the value of b1 reflects the association betweenthe outcome and the predictor. If we only have a weak association between Y andX , it may be hard to get a precise estimate of the relationship. Therefore, for smallvalues of b1, we may obtain a large variance for the estimated l . As we can seewhen b1 = 0, all values of l yield the same distribution of (Y |X). So next we lookat this mathematically. Sincelimb1!0 p(x) = exp(b0)1+ exp(b0) ,44based on Lebesgue’s Dominated Convergence Theorem [3], we havelimb1!0C33 ="exp(b0)(1+ exp(b0))2#2"Z +•• f(z)dzZ +•• v2f(z)dz✓Z +•• vf(z)dz◆2# ,andlimb1!0b 21 det(I1(q )) = " exp(b0)(1+ exp(b0))2#3"Z +••f(z)dzZ +••v2f(z)dzZ +••✓ ∂v∂l ◆2 f(z)dzZ +••f(z)dzZ +••v∂v∂l f(z)dzZ +•• v ∂v∂l f(z)dzZ +••vf(z)dzZ +••vf(z)dzZ +••✓ ∂v∂l ◆2 f(z)dz+Z +••vf(z)dzZ +••∂v∂l f(z)dzZ +•• v ∂v∂l f(z)dz+Z +••∂v∂l f(z)dzZ +•• vf(z)dzZ +•• v ∂v∂l f(z)dzZ +••∂v∂l f(z)dzZ +•• ∂v∂l f(z)dzZ +•• v2f(z)dz .Therefore, we can get that Avar(lˆ ) = O(b21 ). And when the association betweenoutcome and predictor is not strong, a relatively large sample size may be requiredfor reasonable estimation of l . Due to the complexity of the integrations in theFisher information matrix, we cannot get closed-form expressions in general, butwe can carry out the numerical calculation across a comprehensive range of set-tings.3.2.2 Factors in Epidemiological StudiesIn a typical epidemiological study, there are four main factors to consider,1. factor A: the shape of exposure distribution;452. factor B: the shape of exposure-disease relationship;3. factor C: the disease rarity;4. factor D: the strength of exposure-disease association.We vary the levels of the four factors factorially in order to evaluate estimatorperformance systematically and comprehensively. Let the exposure variable, X ,0.0 0.2 0.4 0.6 0.8 1.0 1.2012345xDensityσ=0.5σ=1σ=2Figure 3.2: Distributions of the exposure variable.follow the log-normal(µ,s2) distribution. Without loss of generality, we fix its95-th percentile as 1, and let s vary as 0.5,1 and 2 to represent different levelsof skewness. Figure 3.2 shows the selected distributions of X , which range frommild skewness to severe skewness. As for B, we let the shape parameter, l , in theBox-Cox transformation vary as 0,0.5,1 and 2, which corresponds to log, squareroot, linear and square functions respectively. We use the probability of disease atthe 5-th percentile of the exposure to indicate the disease rarity, denoting this asP1. For the strength of exposure-disease association, we consider the ratio of theprobability of the disease at 95-th percentile of the exposure to the probability of46the disease at 5-th percentile, which is denoted asR =Pr(Y = 1|X is at 95-th percentile)Pr(Y = 1|X is at 5-th percentile). (3.9)We let P1 vary as 0.02 and 0.1 to represent low and moderate disease prevalencerespectively, and let R vary as 1.1,2 and 5 to represent weak, medium and strongassociations respectively.Figure 3.3 shows the disease risk as a function of the exposure in the above-described settings. In each panel, the distribution of exposure and disease rarityis fixed. The risk functions vary due to the difference of the shape parametersin the model and the risk ratios indicating the strength of the association. Wecan see that given the distribution of exposure, as the exposure-disease associationbecomes stronger, the risk differences between different shape parameters at thesame exposure level become larger. Also, given the association, as the distributionbecomes more skewed, the risk differences between different shape parametersat the same exposure level become larger. These indicate that the skewness andthe strength of association may be related to the precision in estimating the shapeparameter.3.2.3 Average Predictive EffectIn the logistic Box-Cox model, the shape parameter, l , and the association param-eter, b1, jointly determine the slope of log odds at X = x as we haved(logit(P))dXX=x= b1xl1, (3.10)where P = E(Y |X). Due to the difficulty in direct interpretation of this slope andthe shape parameter, l , we are going to discuss another quantity with an easierinterpretation. For our model (3.1), assume that X follows a log-normal(µ,s2)470.0 0.5 1.0 1.50.000.10σ = 0.5x Pr( Y = 1 | X = x) 0.0 0.5 1.0 1.50.00.20.40.6σ = 0.5x Pr( Y = 1 | X = x) 0.0 0.5 1.0 1.50.000.10σ = 1x Pr( Y = 1 | X = x) 0.0 0.5 1.0 1.50.00.20.40.6σ = 1x Pr( Y = 1 | X = x) 0.0 0.5 1.0 1.50.000.10σ = 2x Pr( Y = 1 | X = x) 0.0 0.5 1.0 1.50.00.20.40.6σ = 2x Pr( Y = 1 | X = x) Figure 3.3: The risks of disease as a function of exposure in the selected set-tings. The solid curves in each panel correspond to l = 0, the dashedcurves are for l = 0.5, the dotted curves are for l = 1, and the dashed-dotted curves are for l = 2. The distribution of the exposure for eachpanel is shown at the bottom of each panel by segmented gray lines.48distribution, so that the average predictive effect can be calculated asD = Ed(logit(P))dX= Ehb1Xl1i= b1 Z +••1xp2ps e (ln(x)µ)22s2 xl1dx= b1e l12 [(l1)s2+2µ]. (3.11)We can see that the average predictive effect represents average slope in a nonlinearmodel. When the true model is linear, we have l = 1, and, based on the aboveequation (3.11), we have D = b1. Generally D is determined by b1, l , µ , and s .Under given µ and s , an estimate of D could be written asDˆ= bˆ1e lˆ12 [(lˆ1)s2+2µ]. (3.12)When µ and s2 are unknown, we can estimate them based on the empirical dis-tribution of the predictor, which is independent from the model fitting. As for theasymptotic variance of Dˆ, based on the multivariate delta method, we haveAvarDˆ⇡⇣ ∂D∂b1 , ∂D∂l ⌘S ∂D∂b1∂D∂l != D2 Avar(bˆ1)b 21 + 2⇥(l 1)s2 +µ⇤b1 Acov(bˆ1, lˆ )+⇥(l 1)s2 +µ⇤2Avar(lˆ )⌘ , (3.13)where ∂D∂b1∂D∂l != e l12 [(l1)s2+2µ]⇥(l 1)s2 +µ⇤b1e l12 [(l1)s2+2µ] ! ,andS= Avar(bˆ1), Acov(bˆ1, lˆ )Acov(bˆ1, lˆ ), Avar(lˆ ) !49is the asymptotic variance-covariance matrix of bˆ1 and lˆ . And if our sample sizen is sufficiently large [10], we haveVar(Dˆ)⇡ 1nAvar(Dˆ). (3.14)Based on equation (3.13), given µ and s2, we calculate the asymptotic variances oflˆ and that of Dˆ in all the settings mentioned above, to evaluate estimator precisionunder different conditions.●●●0100200300400500600ASD(λ^)σ = 0.5R1 R2 R3●● ●●●●●●●●●●●●●●●●●●●P1 = 0.02, λ = 0P1 = 0.02, λ = 0.5P1 = 0.02, λ = 1P1 = 0.02, λ = 2P1 = 0.1, λ = 0P1 = 0.1, λ = 0.5P1 = 0.1, λ = 1P1 = 0.1, λ = 2 ●●●0100200300400500600ASD(λ^)σ = 1R1 R2 R3●● ●●●●●●●●●●●●●●●P1 = 0.02, λ = 0P1 = 0.02, λ = 0.5P1 = 0.02, λ = 1P1 = 0.02, λ = 2P1 = 0.1, λ = 0P1 = 0.1, λ = 0.5P1 = 0.1, λ = 1P1 = 0.1, λ = 2●● ●0100200300400500600ASD(λ^)σ = 2R1 R2 R3●●●● ●●●●●●●●●●P1 = 0.02, λ = 0P1 = 0.02, λ = 0.5P1 = 0.02, λ = 1P1 = 0.02, λ = 2P1 = 0.1, λ = 0P1 = 0.1, λ = 0.5P1 = 0.1, λ = 1P1 = 0.1, λ = 2Figure 3.4: Asymptotic standard deviation of lˆ over different settings. R1 isfor R = 1.1, R2 is for R = 2, and R3 is for R = 5.Figure 3.4 shows that the asymptotic variance of lˆ tends to decrease as theexposure-disease association becomes stronger, as the skewness of distribution ofexposure becomes more severe, as the disease becomes more common, and as theshape parameter increases. As for the asymptotic variance of Dˆ in Figure 3.5, weobserve the same trend for the disease rarity and for the shape parameter, but the op-posite trend for more skewed distribution. As an example, in the rare disease case(P1 = 0.02) and lower risk ratio case (R= 1.1), ASD(Dˆ) doubles when the distribu-tion of X changes from log-normal(0.82,0.25) to log-normal(1.64,1) and in-creases around 20 times when the distribution of X changes from log-normal(1.64,1)to log-normal(3.29,4). As for the relationship of ASD(Dˆ) with the strength of theassociation, for mildly-skewed distribution of X , ASD(Dˆ) decrease as the associa-50● ● ●050100150200ASD(∆^)σ = 0.5R1 R2 R3● ● ●● ● ●● ●● ● ●● ● ●●● ● ●● ●●050100150200ASD(∆^)σ = 1R1 R2 R3● ● ●●●● ● ●● ● ●● ● ●● ● ● 050100150200ASD(∆^)σ = 2R1 R2 R3●●●●●●●●●●●●● ●●●●●●●0500100015002000ASD(∆^)σ = 2R1 R2 R3● ● ●● ● ●● ● ●● ● ●● ● ●Figure 3.5: Asymptotic standard deviation of Dˆ over different settings. R1 isfor R = 1.1, R2 is for R = 2, and R3 is for R = 5. The lines with soliddots are for P1 = 0.02; the lines with open circles are for P1 = 0.1; thesolid lines are for l = 0; the dashed lines are for l = 0.5; the dottedlines are for l = 1; and the dashed-dotted lines are for l = 2. Fromleft to right in the third panel, figure is only shown within the range of[0,200] in vertical axis for better comparison with its left two figuresand we plot it again in the fourth panel in its complete range.tion becomes stronger. However, for a very skewed distribution, it seems ASD(Dˆ)tends to increase for stronger associations for most of the cases.Figures 3.6 and 3.7 show us the comparison between ASD(lˆ ) and ASD(Dˆ).We can see that under the same distribution of exposure, as the shape parameter lincreases, both ASD(lˆ ) and ASD(Dˆ) decrease. Under the mildly-skewed distribu-tion of exposure, ASD(lˆ ) seems always be greater than ASD(Dˆ), especially whenthe exposure-disease association is weak. As an example, under the case with raredisease (P1 = 0.02), low risk ratio (R = 1.1) and mild skewness of exposure, nomatter what the shape parameter is, ASD(lˆ ) is around 10 times more than ASD(Dˆ).This indicates that to achieve the similar precision in the parameter estimates, weneed 100 times more observations for estimating l than for D. However, undervery skewed distribution of exposure, ASD(lˆ ) is much less than ASD(Dˆ) when lis less than 1. Especially under the case with rare disease (P1 = 0.02), low risk51●●●0200400600ASD(λ^) & ASD(∆^)σ = 0.5 , λ = 0R1 R2 R3● ● ●●● ●0200400600σ = 0.5 , λ = 0.5R1 R2 R3● ● ●●● ●0200400600σ = 0.5 , λ = 1R1 R2 R3● ●●● ●0200400600σ = 0.5 , λ = 2R1 R2 R3● ●●● ●0200400600ASD(λ^) & ASD(∆^)σ = 1 , λ = 0R1 R2 R3● ● ●●● ●0200400600σ = 1 , λ = 0.5R1 R2 R3● ● ●●● ●0200400600σ = 1 , λ = 1R1 R2 R3● ● ● ● ● ●0200400600σ = 1 , λ = 2R1 R2 R3● ●● ● ●05001500ASD(λ^) & ASD(∆^)σ = 2 , λ = 0R1 R2 R3●●●●● ●0200400600σ = 2 , λ = 0.5R1 R2 R3● ● ●● ● ●0200400600σ = 2 , λ = 1R1 R2 R3● ● ● ● ● ●0200400600σ = 2 , λ = 2R1 R2 R3● ●Figure 3.6: The asymptotic standard deviation of the estimated average pre-dictive effect, Dˆ, and the asymptotic standard deviation of lˆ underP1 = 0.02 and different values of factor A, B, and D. R1 is for R = 1.1,R2 is for R = 2, and R3 is for R = 5. The lines with solid dots representASD(lˆ ) and the lines with open circle represent ASD(Dˆ). The range inthe vertical axis is limited from 0 to 660, except for the left one in thebottom row due to its large values.ratio (R = 1.1) and l = 0, ASD(Dˆ) is around 20 times more than ASD(lˆ ). Thisexploration tells us that when the exposure distribution is mildly-skewed, inferenceon D requires less data and affords a more direct interpretation. However, on theother side, when exposure distribution is very skewed, it may not always be costeffective to do inference on the average predictive effect instead of l , due to theinflation of the variance of the estimate of D.Another quantity of interest is the coefficient of variation (CV) of Dˆ, which is52●● ●050150250ASD(λ^) & ASD(∆^)σ = 0.5 , λ = 0R1 R2 R3● ● ●●● ●050150250σ = 0.5 , λ = 0.5R1 R2 R3● ● ●●● ●050150250σ = 0.5 , λ = 1R1 R2 R3● ●●● ●050150250σ = 0.5 , λ = 2R1 R2 R3● ●●● ●050150250ASD(λ^) & ASD(∆^)σ = 1 , λ = 0R1 R2 R3● ● ●●● ●050150250σ = 1 , λ = 0.5R1 R2 R3● ● ●●● ●050150250σ = 1 , λ = 1R1 R2 R3● ● ●●● ●050150250σ = 1 , λ = 2R1 R2 R3● ● ●● ● ●02006001000ASD(λ^) & ASD(∆^)σ = 2 , λ = 0R1 R2 R3● ● ●●● ●050150250σ = 2 , λ = 0.5R1 R2 R3● ●●● ● ●050150250σ = 2 , λ = 1R1 R2 R3● ●● ● ● ●050150250σ = 2 , λ = 2R1 R2 R3● ● ●Figure 3.7: The asymptotic standard deviation of the estimated average pre-dictive effect, Dˆ, and the asymptotic standard deviation of lˆ underP1 = 0.02 and different values of factor A, B, and D. R1 is for R = 1.1,R2 is for R = 2, and R3 is for R = 5. The lines with solid dots representASD(lˆ ) and the lines with open circle represent ASD(Dˆ). The range inthe vertical axis is limited from 0 to 300, except for the left one in thebottom row due to its large values.given asH(Dˆ) =qAvar(Dˆ)Dˆ=sAvar(bˆ1)b 21 + 2 [(l 1)s2 +µ]b1 Acov(bˆ1, lˆ )+ [(l 1)s2 +µ]2Avar(lˆ ).Figure 3.8 shows the coefficient of variation of Dˆ at different levels of the fourfactors. We can see that CV of Dˆ tends to decrease as the disease becomes morecommon, as the exposure-disease association becomes stronger, and as l increases.However, when the distribution of the exposure becomes more skewed, the CV ofDˆ increases considerably. Therefore, for very skewed exposure distributions, it is53●●●0100200300400500600700 CV of ∆^σ = 0.5R1 R2 R3●● ●●●●●●●●●●●●●●P1 = 0.02, λ = 0P1 = 0.02, λ = 0.5P1 = 0.02, λ = 1P1 = 0.02, λ = 2P1 = 0.1, λ = 0P1 = 0.1, λ = 0.5P1 = 0.1, λ = 1P1 = 0.1, λ = 2●●●0100200300400500600700 CV of ∆^σ = 1R1 R2 R3●● ●●●●●●●●●●●●●●●●●●P1 = 0.02, λ = 0P1 = 0.02, λ = 0.5P1 = 0.02, λ = 1P1 = 0.02, λ = 2P1 = 0.1, λ = 0P1 = 0.1, λ = 0.5P1 = 0.1, λ = 1P1 = 0.1, λ = 2●●●0100200300400500600700 CV of ∆^σ = 2R1 R2 R3●●●●●●●●●●●● ●●●●●●●●●●P1 = 0.02, λ = 0P1 = 0.02, λ = 0.5P1 = 0.02, λ = 1P1 = 0.02, λ = 2P1 = 0.1, λ = 0P1 = 0.1, λ = 0.5P1 = 0.1, λ = 1P1 = 0.1, λ = 2Figure 3.8: Coefficient of variation of Dˆ over different settings. R1 is forR = 1.1, R2 is for R = 2, and R3 is for R = 5.harder to determine the true signal of D given the same sample size and other pa-rameter values. This doubly confirms that the preferable inference condition for Dis under mildly-skewed exposure distribution when exposure is perfectly measured.We know that D represents the average slope of the effect of the exposure. If alogit-linear model only contains the main effect of the predictor, the average pre-dictive effect is equal to the regression coefficient in the simple logistic regression.So we conjecture that when we fit a logit-linear model to a logit-nonlinear relation-ship, though it is a misspecified model, the estimated slope parameter may still bea good estimate of the average predictive effect. The simple logistic model can bewritten aslogit(Pr(Y = 1 |X = x)) = g0 + g1x.We would like to determine if the large-sample limiting slope, g⇤1 , from fitting thesimple logistic regression is close to the average predictive effect and to comparethe asymptotic variance of the estimated average predictive effect with that of theestimate of g1. Since we assume independent and identical distributions of theobservations, we focus our discussion on just one observation. The misspecified54log likelihood function for a single observation isl1(g0,g1|X ,Y ) = Y (g0 + g1X) log(1+ exp(g0 + g1X)).To get the large-sample limiting coefficients, g⇤0 and g⇤1 , we need to find the maxi-mizers for the expectation of the log likelihood function. That is,(g⇤0 ,g⇤1 ) = argmax(g0,g1)2R2 {E(l1(g0,g1))} .We can obtain the maximizers based on setting the first partial derivative equationsequal to 0. ∂E(l1)∂gi g0=g⇤0 ,g1=g⇤1 = E✓∂ l1∂gi◆g0=g⇤0 ,g1=g⇤1 = 0,where i = 1, and 2. That is, we haveE" 1X!✓Y exp(g⇤0 + g⇤1X)1+ exp(g⇤0 + g⇤1X)◆#= 0.And, after using the double expectation theorem and taking the expectation on Yfirstly, we haveE" 1X!✓Pexp(g⇤0 + g⇤1X)1+ exp(g⇤0 + g⇤1X)◆#= 0, (3.15)whereP = E(Y |X) =exp⇣b0 +b1 Xl1l ⌘1+ exp⇣b0 +b1 Xl1l ⌘ .Due to the misspecified likelihood, the inverse of the Fisher Information matrix canno longer provide the asymptotic variances of the parameters. Instead we use thesandwich estimation idea [15, 55], wherebyAvar(gˆ ) = J11 (g ⇤)V1(g ⇤)J11 (g ⇤), (3.16)where J1 =E(H(l1)),H(l1) is the Hessian matrix,V1 =Var(—l1), and g ⇤ =(g⇤0 ,g⇤1 )T .We calculate this asymptotic variance as follows. First the gradient of the log-55likelihood is—l1(g ) = ∂ l∂g0∂ l∂g1 != Y P⇤(Y P⇤)X ! ,where P⇤ = exp(g0 + g1X)/(1+ exp(g0 + g1X)) . Then its variance isV1(g ) = Var(—l1(g ))= E⇣—l1(g )(—l1(g ))T⌘= E⇥(Y P⇤)2⇤E⇥(Y P⇤)2X⇤E⇥(Y P⇤)2X⇤E⇥(Y P⇤)2X2⇤!= E⇥P2PP⇤+P⇤2⇤E⇥(P2PP⇤+P⇤2)X⇤E⇥(P2PP⇤+P⇤2)X⇤E⇥(P2PP⇤+P⇤2)X2⇤!.Secondly, the Hessian matrix of the likelihood isH(l1) =0@∂ 2l∂g20 ∂ 2l∂g0∂g1∂ 2l∂g0∂g1 ∂ 2l∂g21 1A= P⇤(1P⇤) P⇤(1P⇤)XP⇤(1P⇤)X P⇤(1P⇤)X2 ! .Then its expectation isJ1(g ) =E(H(l1)) = E(P⇤(1P⇤)) E(P⇤(1P⇤)X)E(P⇤(1P⇤)X) E(P⇤(1P⇤)X2)!.We need to solve equation (3.15) to get the large-sample limiting coefficientestimate of g and then to calculate equation (3.16) to get the asymptotic variance ofthe MLE of g . Since we can not get the closed-form solution for both of them, wedo both based on numerical calculation. First of all, given µ and s2, we simulateN samples from the distribution of X , and use the sample mean to approximate theexpectation. That is, for equation (3.15), we solve1NNÂi=1" 1xi!(pi p⇤i )#= 0, (3.17)56wherepi = E(Y |X = xi) =exp⇣b0 +b1 xli 1l ⌘1+ exp⇣b0 +b1 xli 1l ⌘ ,andp⇤i =exp(g⇤0 + g⇤1xi)1+ exp(g⇤0 + g⇤1xi) .If we treat (xi, pi)(i=1,··· ,N) as our data, then the equation (3.17) can be solved bystandard logistic regression software, such as the glm() function in the R software[39], which allows us to pretend that the outcome isn’t binary. We call the standarderrors of the coefficients reported from the glm() function as the simulation errorssince these errors are introduced by the limitation of a finite simulated sample size.And for equation (3.16), we getAvar(gˆ )⇡ Jˆ11 (g ⇤)Vˆ1(g ⇤)Jˆ11 (g ⇤), (3.18)whereJˆ1(g ⇤) =0BB@ 1N NÂi=1 [p⇤i (1 p⇤i )] 1N NÂi=1 [p⇤i (1 p⇤i )xi]1NNÂi=1[p⇤i (1 p⇤i )xi]1NNÂi=1⇥p⇤i (1 p⇤i )x2i⇤1CCA ,andVˆ1(g ⇤) =0BB@ 1N NÂi=1⇥pi2pip⇤i + p⇤2i ⇤ 1N NÂi=1⇥(pi2pip⇤i + p⇤2i )xi⇤1NNÂi=1⇥(pi2pip⇤i + p⇤2i )xi⇤1NNÂi=1⇥(pi2pip⇤i + p⇤2i )x2i⇤1CCA .In our example, we choose N = 100,000. The bias between the average pre-dictive effect, D, and the large-sample limiting coefficient, g⇤1 , has a clear patternshown in Figure 3.9. Given the exposure distribution, we see that g⇤1 is less than Dwhen l < 1, g⇤1 is greater than D when l is above 1, and they exactly agree witheach other when l = 1. Also the bias is smaller when l is closer to 1 and when thedisease is more common. However, as the distribution of the exposure becomesmore skewed, the bias increases, especially under l = 0. As an example, underthe severely skewed distribution of exposure, the bias can be larger than 60 when57●●●0 1 2 3 4 5 60123456γ1*∆σ = 0.5●●●●●●●●●●●●●●●●●●0 1 2 3 4 5 60123456γ1*∆σ = 1●●●●●●●●●●●●●●●●●●●0 1 2 3 4 5 60123456γ1*∆σ = 2●●●●●●●●●●●●●●●●●●0.0 0.5 1.0 1.5 2.0 2.50102030405060γ1*∆σ = 2●●●●●●●● ● ●● ●● ● ●● ● ●Figure 3.9: The average predictive effect, D, and the large-sample limitingcoefficient, g⇤1 , from the simple logistic model under different settings.The lines with solid dots are for P1 = 0.02; the lines with open circlesare for P1 = 0.1; the solid lines are for l = 0; the dashed lines are forl = 0.5; the dotted lines are for l = 1; and the dashed-dotted lines arefor l = 2; the gray line is the 45 degree line. From left to right, in thethird panel, the vertical axis of the figure is restricted to [0,6] for bettercomparison with the left two figures. We plot it again in the fourth panelover its complete range.P1 = 0.1 and l = 0. Therefore, the conditions producing small biases should bethat (1), the exposure variable is not very skewed, and (2), the shape parameter isclose to 1. Going beyond the analysis of the bias, Figure 3.10 shows the compar-ison of asymptotic variances between Dˆ and gˆ1. In most settings, the asymptoticvariance of gˆ1 is less than the asymptotic variance of Dˆ. Also, Avar(gˆ1) is muchmore stable across different distribution of exposure, disease rarity, risk ratio andthe change of shape parameters. Therefore, under mildly-skewed exposure distri-butions and shape parameter close to 1, we could consider using gˆ1 to replace Dˆ,since the variance is substantially reduced even though we may incur some bias.For the prostate cancer data from [6], we estimate the average predictive effectof the phosphatase as 3.59 with 95% CI [0.52,9.78] from the 500 bootstrap sam-ples. This tells us that one King-Armstrong unit increase in the level of the acidphosphatase may introduce more cases of nodal involvement in the prostate cancerpatients. At the same time, the estimated slope coefficient from the simple logisticregression model is 2.04 with 95% CI [0.11,8.16] from the bootstrap, indicating58● ● ●02060100ASD(∆^) & ASD(γ^ 1)σ = 0.5 , λ = 0R1 R2 R3● ● ●● ● ●● ● ●● ●●02060100ASD(∆^) & ASD(γ^ 1)σ = 0.5 , λ = 0.5R1 R2 R3● ● ●●●● ● ●●● ●02060100ASD(∆^) & ASD(γ^ 1)σ = 0.5 , λ = 1R1 R2 R3● ● ●●● ●● ● ●● ● ●02060100Avar(∆^) & Avar(γ^ 1)σ = 0.5 , λ = 2R1 R2 R3● ● ●● ●● ●● ●●02060100ASD(∆^) & ASD(γ^ 1)σ = 1 , λ = 0R1 R2 R3● ● ●● ● ●● ●●●●02060100ASD(∆^) & ASD(γ^ 1)σ = 1 , λ = 0.5R1 R2 R3● ● ●● ● ●● ●● ● ●02060100ASD(∆^) & ASD(γ^ 1)σ = 1 , λ = 1R1 R2 R3● ● ●●● ●● ● ● ●●02060100Avar(∆^) & Avar(γ^ 1)σ = 1 , λ = 2R1 R2 R3● ● ●● ●●●●05001500ASD(∆^) & ASD(γ^ 1)σ = 2 , λ = 0R1 R2 R3● ● ●● ● ●●●●050100150200ASD(∆^) & ASD(γ^ 1)σ = 2 , λ = 0.5R1 R2 R3●●●● ● ●●●●●02060100ASD(∆^) & ASD(γ^ 1)σ = 2 , λ = 1R1 R2 R3●●●● ● ●●● ●●02060100Avar(∆^) & Avar(γ^ 1)σ = 2 , λ = 2R1 R2 R3●●● ●Figure 3.10: The asymptotic standard deviation of gˆ1 from the misspecifiedmodel and the asymptotic standard deviation of Dˆ under different set-tings. R1 is for R = 1.1, R2 is for R = 2, and R3 is for R = 5. Thelines with solid dots are for P1 = 0.02 and the lines with open circlesare for P1 = 0.1. The solid lines are for ASD(Dˆ), and the dashed linesare for ASD(gˆ1). The range of the horizontal axis is limited from 0 to120, except for the left two panels in the bottom row.some evidence of an association.3.3 Measurement Errors in the ModelFor non-precisely measured data, the measurement errors in the predictor coulddistort the true relationship between the outcome, Y , and the predictor, X . There-fore, we may get biased parameter estimates in the logistic Box-Cox model. Weexplore the biases produced under different degrees of measurement errors across59all of the proposed settings. Denote X⇤ as the measured predictor, which containsthe multiplicative measurement error (3.3). Given X ⇠ log-normal(µ,s2), we haveX⇤ ⇠ log-normal(µ,s2⇤ ), with s2⇤ = s2 +s2e . We can describe the Box-Cox modelwith the binary outcome, Y , and the predictor measurement, X⇤, aslogit [E(Y = 1 |X⇤ = x⇤) ] = r0 +r1x⇤(h). (3.19)We assume that (r0,r1,h) is in the same parameter space with (b0,b1,l ). Presum-ing the model (3.1) describes the underlying relationship between the outcome, Y ,and the predictor, X , the above model (3.19) is then a misspecified model. Esti-mators that maximize the misspecified likelihood will tend to their large-samplelimits as the sample size goes to infinity [55]. Therefore, we focus our discussionon these large-sample limits. Denote (r⇤0 ,r⇤1 ,h⇤) as the large-sample limits of theestimates of (r0,r1,h). To look into the distortion of the parameter estimates frommeasurement errors, we numerically evaluate the bias in the parameter estimationand also the bias in the disease probability estimation. We can evaluate these bysolving the following optimization equation,(r⇤0 ,r⇤1 ,h⇤) = argmax(r0,r1,h)2Q {E(l(r0,r1,h|Y,X⇤))} , (3.20)where the likelihood function for a single observation isl(r0,r1,h|Y,X⇤)=Y r0 +r1✓X⇤h1h ◆log1+ exp✓r0 +r1✓X⇤h1h ◆◆ .We can obtain the maximizer, (r⇤0 ,r⇤1 ,h⇤), in equation (3.20) based on the gradientat (r⇤0 ,r⇤1 ,h⇤) being a zero vector. That is,—E(l(r0,r1,h|Y,X⇤)) r=r ⇤, h=h⇤ = 0.Based on the equations corresponding to r0 and r1, we haveE24 1X⇤h⇤1h⇤!0@Y exp⇣r⇤0 +r⇤1 X⇤h⇤1h⇤ ⌘1+ exp⇣r⇤0 +r⇤1 X⇤h⇤1h⇤ ⌘1A35= 0.60Denote P = E(Y |X). Then we haveE24 1X⇤h⇤1h⇤!0@Pexp⇣r⇤0 +r⇤1 X⇤h⇤1h⇤ ⌘1+ exp⇣r⇤0 +r⇤1 X⇤h⇤1h⇤ ⌘1A35= 0. (3.21)Define a new variable V = (X⇤h⇤1)/h⇤. We can rewrite equation (3.21) asE" 1V! Pexp(r⇤0 +r⇤1V )1+ expr⇤0 +r⇤1V!#= 0. (3.22)For each (b0,b1,l ), we use the following algorithm to calculate the theoreticalvalues of r⇤0 ,r⇤1 and h⇤ given different degrees of measurement error. Set an upperbound on l , denoted as Bu. Then we do a grid search in the interval [0,Bu] with1000[Bu] equally spaced values, fit the model at l equal to each value, and selectthe particular value which can maximize the likelihood function. We let se varyas 0.1 and 0.01 to represent two different degrees of measurement error. Whense = 0.1, the measured values are most likely in a range of 82% 122% of thetrue values due to multiplicative measurement errors, and we treat this situation asa representative of rough measurements. We use se = 0.01 to typify fairly precisemeasurements since the range shrinks to 98%102% of the true values.Figure 3.11 shows that under the rough measurements of the exposure, thelarge-sample limit of the shape parameter, h⇤, in the misspecified model (3.19) isalways smaller than the shape parameter, l , in the true model (3.1). And the extentof the attenuation is associated with the true value of the shape parameter, l , thestrength of the exposure-disease association, and the skewness of the exposuredistribution. Within these three factors, the value of l seems to play the mostprominent role in the extent of attenuation, while the effects of the strength ofthe exposure-disease association and the skewness of the exposure distribution canonly be seen clearly at a relativly large value of l . As an illustration, when l isless than 1, the attenuation is less than 0.025 regardless of the other conditions. Asl increases, the attenuation becomes more severe. When l = 2, the attenuationranges from 0 to 0.6. Also in this situation, we can see a clear pattern that as thestrength of exposure-disease association increases, the attenuation increases, and as61● ● ●0.00.10.20.30.40.50.6λ = 0λ − h*R1 R2 R3● ● ●0.00.10.20.30.40.50.6λ = 0.5λ − h*R1 R2 R3●● ● ●●● ● ●0.00.10.20.30.40.50.6λ = 1λ − h*R1 R2 R3●● ●●●● ●●●●●●0.00.10.20.30.40.50.6λ = 2λ − h*R1 R2 R3●●●●●●●●●●●●●Figure 3.11: The difference between the shape parameter, l , in the truemodel (3.1) and the large-sample limit, h⇤, in the misspecified model(3.19) in the presence of the multiplicative measurement error withse = 0.1 across different settings. The solid line is under s = 0.5 inthe exposure distribution, the dashed line is under s = 1 in the ex-posure distribution, and the dashed dotted line is under s = 2 in theexposure distribution. The lines with solid dots are for P1 = 0.02, andthe lines with open circles are for P1 = 0.1.62● ● ●0.00.10.20.30.40.5λ = 0λ − h*R1 R2 R3● ● ●0.00.10.20.30.40.5λ = 0.5λ − h*R1 R2 R3● ● ●0.00.10.20.30.40.5λ = 1λ − h*R1 R2 R3● ● ●●●●● ● ●0.00.10.20.30.40.5λ = 2λ − h*R1 R2 R3● ● ●●●●● ●●●●●Figure 3.12: The difference between the shape parameter, l , in the truemodel (3.1) and the large sample limit, h⇤, in the model (3.19) in thepresence of the multiplicative measurement error se = 0.01 across dif-ferent settings. The solid line is under s = 0.5 in the exposure distri-bution, the dashed line is under s = 1 in the exposure distribution, andthe dashed dotted line is under s = 2 in the exposure distribution. Thelines with solid dots are for P1 = 0.02 and the lines with open circlesare for P1 = 0.1.630.0 0.5 1.0 1.5−0.050.000.050.10 σ = 0.5 , λ = 0E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●● ●● ●0.0 0.5 1.0 1.5−0.050.000.050.10 σ = 0.5 , λ = 0.5E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●●●●0.0 0.5 1.0 1.5−0.050.000.050.10 σ = 0.5 , λ = 1E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●●● ●●●0.0 0.5 1.0 1.5−0.050.000.050.10 σ = 0.5 , λ = 2E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●●●●●●0.0 0.5 1.0 1.5−0.050.000.050.10 σ = 1 , λ = 0E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●● ●0.0 0.5 1.0 1.5−0.050.000.050.10 σ = 1 , λ = 0.5E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●●●0.0 0.5 1.0 1.5−0.050.000.050.10 σ = 1 , λ = 1E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●●●0.0 0.5 1.0 1.5−0.050.000.050.10 σ = 1 , λ = 2E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●●●●●0.0 0.5 1.0 1.5−0.050.000.050.10 σ = 2 , λ = 0E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●0.0 0.5 1.0 1.5−0.050.000.050.10 σ = 2 , λ = 0.5E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●● ●0.0 0.5 1.0 1.5−0.050.000.050.10 σ = 2 , λ = 1E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●●●●0.0 0.5 1.0 1.5−0.050.000.050.10 σ = 2 , λ = 2E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●●● ●●●●●Figure 3.13: The difference between the true probability and the estimatedprobability from the misspecified model (3.19) across different settingsunder se = 0.1. The solid curves are for R = 1.1, the dashed curvesare for R = 2, and the dashed-dotted curves are for R = 5. The curveswith triangles are for P1 = 0.02, and the curves with open circles arefor P1 = 0.1.640.0 0.5 1.0 1.5−0.010−0.0050.0000.0050.010 σ = 0.5 , λ = 0E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●0.0 0.5 1.0 1.5−0.010−0.0050.0000.0050.010 σ = 0.5 , λ = 0.5E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●0.0 0.5 1.0 1.5−0.010−0.0050.0000.0050.010 σ = 0.5 , λ = 1E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●0.0 0.5 1.0 1.5−0.010−0.0050.0000.0050.010 σ = 0.5 , λ = 2E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●● ●0.0 0.5 1.0 1.5−0.010−0.0050.0000.0050.010 σ = 1 , λ = 0E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●0.0 0.5 1.0 1.5−0.010−0.0050.0000.0050.010 σ = 1 , λ = 0.5E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●0.0 0.5 1.0 1.5−0.010−0.0050.0000.0050.010 σ = 1 , λ = 1E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●0.0 0.5 1.0 1.5−0.010−0.0050.0000.0050.010 σ = 1 , λ = 2E(Y | X = x*) − E(Y | X* = x*)x*● ●●●●●● ●●●●●0.0 0.5 1.0 1.5−0.010−0.0050.0000.0050.010 σ = 2 , λ = 0E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●0.0 0.5 1.0 1.5−0.010−0.0050.0000.0050.010 σ = 2 , λ = 0.5E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●0.0 0.5 1.0 1.5−0.010−0.0050.0000.0050.010 σ = 2 , λ = 1E(Y | X = x*) − E(Y | X* = x*)x*●● ● ●●●● ●●●●●0.0 0.5 1.0 1.5−0.040.000.020.04σ = 2 , λ = 2E(Y | X = x*) − E(Y | X* = x*)x*● ● ● ●●●●●●●●●Figure 3.14: The difference between the true probability and the estimatedprobability from the misspecified model (3.19) across different settingsunder se = 0.01. The solid curves are for R = 1.1, the dashed curvesare for R = 2, and the dashed-dotted curves are for R = 5. The curveswith triangles are for P1 = 0.02, and the curves with open circles arefor P1 = 0.1. The figures in all the panels except the right bottompanel are plotted in the range of (0.010,0.010) in vertical axis forbetter comparison. The figure in the right bottom panel is plotted in awider range in vertical axis due to its special large values on verticalaxis.65the exposure distribution becomes more skewed, the attenuation increases as well.Under se = 0.01, we have similar results except, unsurprisingly, the attenuationis milder under each setting, as shown in Figure 3.12. If we look at the bias inestimating Pr(Y = 1|X) via the fitted (Y |X⇤) relationship, we can see that the fittedprobability is most biased when the distribution of exposure is very skewed andthe shape parameter l is 2, which is shown in Figure 3.14. In this scenario, underse = 0.1, we could estimate 5 fewer cases per 100 samples than the truth whenthe measured exposure is around 0.75, and could estimate around 8 more cases per100 samples than the truth when the measured exposure is close to 1.5, which isfar in the right tail. Also it seems that the fitted probability is always inflated atsmall values of exposure but then attenuated at large values of the exposure. We0.2 0.4 0.6 0.8 1.0−0.040.000.04E(Y | X = x*)E(Y | X = x*) − E(Y | X* = x*) σe = 0.1σe = 0.01Figure 3.15: The difference between the true probability and the fitted onefrom the misspecified model (3.19).know that in simple logistic regression, measurement errors in the predictor tendto attenuate fitted probabilities towards 0.5 [50]. Figure 3.15 indicates that we mayhave the same attenuation pattern for this nonlinear logistic regression model.As for a data example, we continue to use the prostate cancer data [6]. Firstly,we simulate 100 sets of measurement errors with the same sample size of theprostate cancer data from log-normal (0,s2e ) distribution with se = 0.1 and 0.01660.5 1.0 1.50.00.20.40.60.81.0phosphatase (King−Armstrong units) Pr( nodal involvement | phosphatase) 0.5 1.0 1.50.00.20.40.60.81.0phosphatase (King−Armstrong units) Pr( nodal involvement | phosphatase) Figure 3.16: The estimated probability of the nodal involvement in theprostate cancer patients via the fitted logistic Box-Cox model. In theleft panel, the solid curve comes from the original data, the dashedcurves come from three randomly selected simulated data sets contain-ing the multiplicative measurement error with se = 0.1 separately, andthe dotted curves come from three randomly selected simulated datasets containing the multiplicative measurement error with se = 0.01separately. In the right panel, the solid curve comes from the originaldata, the dashed curve is the average of the fitted probability from 100simulated data sets containing the multiplicative measurement errorwith se = 0.1, and the dotted curve is the average of the fitted proba-bility from 100 simulated data sets containing the multiplicative mea-surement error with se = 0.01. Note in both panel the dotted curvesare right under the solid curve.respectively. Secondly, we multiply the acid phosphatase predictor in the prostatecancer data with each set of simulated errors. In this way, we have simulated datasets with different degrees of multiplicative measurement errors, while the originalprostate cancer data play the role of data without measurement errors. For eachsimulated-error data set, we fit both the logistic Box-Cox model and the simplelogistic model, and summarize the results based on the degree of the measurementerrors. Figure 3.16 shows us the estimated probability of the nodal involvementgiven the level of acid phosphatase, using the fitted logistic Box-Cox model. We67can see that when the standard deviation of the measurement error is 0.01, it is hardto see the difference between the estimated probability based on data with and with-out measurement errors from the three randomly selected simulated data sets andfrom the average of the 100 simulated data sets. When the standard deviation ofthe measurement error distribution increases to 0.1, the difference in the estimatedprobability given the phosphatase level is apparent, but is small especially when thephosphatase is below its 95th percentile, 1.12. The change pattern of the differenceexactly agree with what we find from numerical analysis. Also based on the numer-ical results, since the phosphatase distribution is not very skewed and lˆ = 0, weexpect the bias could be reduced if there were more samples in the original data.However, there is a lot of variation across the simulated datasets mainly due toMonte Carlo errors. For the three randomly selected simulated data sets, there canbe attenuation or inflation at the large value of the acid phosphatase. As an exam-ple, when the measurement of the level of acid phosphatase is 1.0 King-Armstrongunits, we predict 5 fewer cases of nodal involvement in every 100 prostate can-cer patients than the estimated number from the original data in one scenario, 10fewer cases in another scenario, and 10 more cases in a third scenario. And whenthe measurement of the level of acid phosphatase is 1.5 King-Armstrong units, thedifference in the number of nodal involvement in the prostate cancer patients in-creases to around 10 for every 100 prostate cancer patients for the three sets ofsimulated data.3.4 DiscussionIn this chapter, we assume that the predictor, X , follows the log-normal(µ,s2).The log-scale parameter, µ , controls the plausible range of X given s . The shapeparameter, s , plays the role of a fraction of the association parameter. To explainmore on this, let us first consider the following transformation. Denote h = slandW = exp(µ/s +Z). Then we haveX (l ) = Xl 1l = s ⇥exp( µs h +Zh)1⇤h = s [Wh 1]h = sW (h).68And the model (3.1) becomeslog✓pi1 pi◆= b0 +b1sw(h)i .We can see that no matter what s is, after transformation, we can always constructa new predictor with shape parameter equal to 1 for which b1s plays the role ofthe association parameter. Based on this, we have one direct conclusion. That is,given b1, when s goes to 0, Avar(lˆ ) goes to infinity. This is because we have theAvar(hˆ) goes to infinity when s goes to 0 based on the calculation in the Appendix,and Avar(hˆ) = Avar(slˆ ) = s2Avar(lˆ ). So we know that given the associationparameter, if there is only a small variation of the predictor in log-scale, we maynot get precise estimation of the shape parameter.Our exploration of the Box-Cox type model could be naturally extended to thefractional polynomial model. A single covariate fractional polynomial of degree mcould be expressed as fm(X |x , p) = x0 + mÂj=1x jX (p j), (3.23)where m is a positive integer, p = (p1, · · · , pm) is a real-valued vector of powerswith p1 < · · · < pm and x = (x0,x1, · · · ,xm) are real-valued coefficients [44]. Alsothe round bracket notation represents the Box-Tidwell transformation,X (p j) =(X pj if p j 6= 0lnX if p j = 0.We can see that whenm= 1, it is equivalent to the Box-Cox family after we linearlytransform the coefficients. As for fitting the fractional polynomial model, there isan R contributed package, mph [39], which can work for multiple covariates giventhe degree for each covariate. As the fractional polynomials can provide quite flex-ible shapes to fit for nonlinear pattern data, Keogh et al. [26] showed the effectof measurement errors in a proportional hazard model in some specified nonlinearmodels when fitting fractional polynomial models though simulation studies. Thisis different from our discussion about the effect of measurement errors in this chap-69ter, since our discussion is within the logistic Box-Cox family model, and our focusis to look at the distortion of the parameter estimates under different degree of ran-dom errors. Similar discussion could also be made for the fractional polynomialfamily models. As there are more complex situations in epidemiologic research,[49] mentioned that splines with knots are good candidate for risk assessment,especially when response attenuates at high exposure level and epidemiologistsalso prefer a liner exposure-disease shape in the low dosage. Therefore, the splinemodel is also in our further exploration.We have explored the average predictive effect and its relationship with the co-efficient in the misspecified linear model. As an average slope between logit[E(Y |X)]and X , if the distribution of exposure is not very skewed, the average predictive ef-fect requires much less data to achieve the similar estimator precision comparedwith the shape parameter l . So, sometimes it is more worthwhile to use the aver-age predictive effect as an inferential target rather than the original parameters inthe model. Another conjecture is that the fitted coefficient from the simple logisticregression may estimate well the average predictive effect from a more complexnonlinear model. In this chapter, using the logistic Box-Cox model as an example,we see that when the distribution of exposure is not very skewed and the shape pa-rameter is close to 1, the large-sample limit of the coefficient is close to the averagepredictive effect, and the asymptotic variance of the estimated coefficient can bemuch smaller than that of the average predictive effect. Therefore, we are facinga bias-variance trade off in the model selection. In some conditions, we may pre-fer using the estimated coefficient from the simple model as an alternative to theestimated average predictive effect from a more complex model.From our investigation, the measurement errors tend to overestimate the risk atlow exposure and underestimate the risk at high exposure. In practice, the risk atlow exposure is used to determine threshold for acceptable risk. Therefore, over-estimating risk in the low range would lead to unjustifiably low exposure thresh-olds/regulatory limits. So for non-precisely measured data, appropriate adjustmentis recommended. A convenient way to do this is though a Bayesian hierarchicalmodel. Gustafson [22] and Carroll et al. [9] showed detailed steps on implementingthis in nonlinear models using Bayesian methods. As a byproduct, the average pre-dictive effect could be calculated based on the fitted parameters from the Bayesian70hierarchical model. In this situation we may be less interested in the coefficientin the simple logistic model due to its bias caused by measurement error. Also ifwe fit a misspecified Bayesian hierarchical model with the simple logistic regres-sion as one level, the computational complexity may be very similar to using thenonlinear model as one level.71Chapter 4Two-step Approach in MatchingProcess for LongitudinalMatching Covariates in thePresence of Measurement Error4.1 BackgroundEpidemiologic studies of exposure-disease relationships are often observational.We would like to compare the risks of diseases between the exposed and non-exposed groups, but due to lack of randomization in the treatment/exposure assign-ment, there are potential biases in estimating the causal effect, due to the imbalanceof the covariates distributions between the treatment and control groups. Matchedsampling is a common technique to achieve balance of the observed covariates be-tween the two groups. Matching has been discussed extensively in the literature,especially when it is time-independent. A time-independent matching happens ifthe treatment units receive the treatment around the same time, and, therefore, weonly consider two time points in the design and analysis procedure: (1), the base-line when the matching covariates are measured before any treatment initiation;and (2), after the treatment when the outcome is evaluated. For time-independent72matching, Stuart [52] provided “a structure for thinking about the matching meth-ods and guidance on their use, coalescing the existing research (both old and new)and providing a summary of where the literature on matching methods is now andwhere it should be headed.”In the meantime, in some observational studies, the study units receive thetreatment at different time points after baseline. In this scenario, a time-dependentmatching may be preferred, since if we select a control to match to a treatmentunit, we would like to use the value of covariates measured right before the treat-ment initiation rather than values measured at baseline. The risk-set matching is anexample of the time-dependent matching [34, 37]. In a risk-set matching, a studyunit who receives treatment at time t will be matched by another study unit, se-lected from the set of study units who have not received treatment by time t, butmay receive treatment later. In this chapter, we propose a different approach forthe time-dependent matching. We would like to model the trajectory from the re-peated or longitudinal measurements of the matching variables before the initiationof the treatment, then the matched controls are selected from the units in the controlgroup through the distance measure estimated based on the trajectory. There aretwo major benefits using the trajectory model before matching: (1), the measure-ment errors in the observed matching variables can be automatically adjusted forif we use estimated true values from the trajectory model to do the matching; and(2), the distance can be more accurately calculated, since the matching quantitiesplugged into the distance are more accurate and also we can have a better estimateof the variance-covariance matrix estimated from the trajectory model, which canespecially improve the accuracy of the Mahalanobis distance, since the variance-covariance matrix is central to this distance measure. More details regarding thetwo benefits are described below.First of all, regarding measurement errors in matched sampling, Steiner et al.[51] demonstrated that “unreliability of measurement can degrade the ability ofpropensity scores to reduce bias.” Althauser and Rubin [2], McShane et al. [38]provided ways to adjust for measurement errors in the linear regression and condi-tional logistic regression for matched sampling. As for the time-dependent match-ing, denote the measured matching variables for the ith subject as Z⇤i and the true73values as Zi. A common trajectory model can be written asZ⇤i (t) = f (t)+Qibi(t)+ ei(t), (4.1)where Zi(t) = f (t)+Qibi(t), f (t) is a function to describe the population averageover time, Qi is a design matrix for the random effects for the i-th subject, bi(t) isthe vector of random effects for the i-th subject, and ei(t) is the vector of randomerrors for the i-th subject. The estimated true variables, Zˆ i(t), from the model willbe used for matching. In this way, we can automatically adjust for the measurementerrors.Secondly, as for the distance measures in the time-dependent matching, Lu [37]used the propensity scores to measure the closeness between two units, which aredefined as the hazard component based on the Cox proportional model when theevent is defined as time to treatment. Other than time-dependent propensity scorematching, Li et al. [34] showed us a way to define the Mahalanobis distance forthe time-dependent matching, using an observational study of interstitial cystitisas an example. For the interstitial cystitis data, patients were evaluated at baselineand at intervals of approximately every 3 months. Pain, urgency, and nocturnal fre-quency of voiding were measured at baseline and every three months afterwards.After baseline, some patients receive treatment at different time points. They cat-egorize the measurements as baseline measurement, and later measurements foreach patient and pool the data together. That is, they have six quantities used tocalculate the variance-covariance matrix, which contains baseline and current pain,urgency, and frequency measurements. We can see that for a control unit, all mea-surements after the baseline are treated as independent records in calculation of thevariance-covariance matrix while the correlation between measurements for a pa-tient working as control has been ignored. This may introduce biased estimates ofthe distances since it allows replicated counts of samples in estimation of variance-covariance matrices. In our proposed method, the trajectory model can providebetter estimates of the variance-covariance matrix for the Mahalanobis distance.Based on (4.1), we have Var(Zi(t)) = QiVar(bi(t))QTi . Then the Mahalanobis dis-74tance is defined asd(Zi(t),Z j(t)) = (Zi(t)Z j(t))TSi(t)1(Zi(t)Z j(t)),where Si(t) = Var(Zi(t)), the i-th unit receives treatment at time t and the j-th unitis in the control group.Based on the above benefits, we propose to fit the trajectory model beforematching and call this approach the two-step matching approach. In section 4.2,we introduce the Baltimore Experience Corps study, in which the two-step methodcould be applied. In section 4.3, we describe the trajectory model designed for thisstudy. The matching algorithm is given in section 4.4. In section 4.5, we define thetreatment effect for the time-dependent matching. In section 4.6, we conduct sim-ulation studies for different trajectory models and compare the matching qualityand the bias of treatment effect estimates between the two-step matching approachand the traditional matching approach. The mathematical calculation on the biasintroduced by measurement error in the distance matching is described in section4.7. And we close our discussion in section 4.8.4.2 Baltimore Experience Corps StudyThe Baltimore Experience Corps study (EC study), USA [16] concerns a seniorvolunteer program, which places people 60 and older in meaningful roles in thepublic elementary schools, with the roles designed to have high impact on chil-dren’s outcomes, as well as the health and functional status of the older adults.Along with this program, an observational study, the Experience Corps project 2,was conducted to determine if this new program improves the educational out-comes of children. During the course of the study from Fall 2006 to Spring 2011,a total of 27 Baltimore city public schools participated in the EC program, includ-ing 16 that received treatment since the 2006/2007 academic year, 4 that receivedtreatment since the 2007/2008 academic year, 2 that received treatment since the2009/2010 academic year, and 5 that received treatment in the 2010/2011 aca-demic year. EC schools that meet the certain criteria were selected by a committeeof investigators and GHCC in consultation with the city school system [16]. Foran EC school, the senior volunteers started in September of the year of treatment75initiation, to help students in their academic learning. In following March, studentstook the tests which are used to evaluate their academic performance. The StanfordAchievement Tests Series Tenth Edition (Stanford 10) are used for children undergrade 3, and the Maryland School Assessment (MSA) tests are used for Grade 3and above, in reading and math, to meet the testing requirements of the federalNo Child Left Behind Act. The outcome variables of interest are the school-levelscores at grade 3, and then the school scores at grade 2 in previous year are treatedas matching variables. Since we assume that most students are promoted to thenext grade and most of them maintain their academic performance, we can bal-ance school-level student academic performance level between the treatment andthe control groups based on their performance in the previous year.In the educational data setting, the measurement errors can be interpreted as allkinds of deviations from their underlying truth, which is different from the tradi-tional sense of errors made in the process of measurement using some instruments.For example, the school-level test scores can be estimated from the average of in-dividual student scores. But the student individual scores would be affected by stu-dent retention, transferring, and/or teacher switching. So we assume the estimatedschool-level scores based on the average of individual scores are the underlyingtrue school-level scores plus some random error.4.3 Statistical ModelingIn randomized experiments, before treatment initiation ideally the joint distributionof covariates in the treatment group and the joint distribution of the covariates inthe control group are close to identical due to the random assignment. However, inobservational studies, these distributions may be different. So we denote quantitiesdifferently in the treatment group and control group. Let P1 be the distributionof covariates in the treatment schools and P2 be the distribution of covariates inthe control schools. We have N1 randomly selected treatment schools from P1 andN2 randomly selected control schools from P2 with N2 > N1. The matching vectorZ = (z1,z2, · · · ,zK)T contains testing scores for K topics. As an example, the topicscould be mathematics, reading, writing, and comprehension. Assume that giventhe group, the random-effects vector of each school in that group shares the same76variance-covariance matrix of the matching covariates at a given time. And we candefine the distance between the treatment school i and the control school j asdi, j = (Zi(t)Z j(t))T S11 (t)(Zi(t)Z j(t)) , (4.2)where S1(t) = Var(Zi(t)), and the treatment school i receives the treatment rightafter year t. The trajectory model for treatment schools can be described aszik(t) = a1(t)+a1k(t)+b1i(t), (4.3)where a1(t) is the overall mean for treatment schools, a1k(t) is the random topiceffect for the kth topic at year t, b1i(t) is the school random effect for the ith schoolat year t, i = 1,2, · · · ,N1,k = 1,2, · · · ,K. Note that the model is only applied to thetreatment schools before the treatment initiation. For control schools, we havez jk(t) = a2(t)+a2k(t)+b2 j(t), (4.4)where a2(t) is the overall mean for control schools, a2k(t) is the random topic effectfor the kth topic at year t, b2 j(t) is the school random effect for the jth school atyear t, j = 1,2, · · · ,N2,k = 1,2, · · · ,K. For the control schools, we can model themduring the entire study. Assume the random effects follow normal distributionscentered at 0 with Var(a1k(t)) = s2a1(t), Var(a2k(t)) = s2a2(t), Var(b1i(t)) = s2b1(t),and Var(b2 j(t)) = s2b2(t). Assuming the classical type random error, we havez⇤k(t) = zk(t)+ ek(t). (4.5)We assume that within each group, the random effects and the random errors aremutually independent, and the random errors follow normal distributions centeredat 0 with Var(eik(t)) =Var(e jk(t)) = s2e (t). Based on our statistical model, we pro-pose a two-step approach for the matching. That is, fit a mixed effects model toall of covariates before treatment initiation. That is, we use the covariates duringthe entire study for control schools and the covariates before treatment initiation fortreatment schools. And then use estimated true school-level scores to do the match-ing. We will compare this approach with direct matching using the raw scores.774.4 Matching AlgorithmDue to the longitudinal feature of time-dependent matching, sequential matchingis feasible. For sequential matching, we start matching treatment units who receivetreatment at earliest time point and after that we match treatment units who receivetreatment at the second earliest time points while the previously matched controlsare removed from the control pool. In this way, we sequentially match until the lasttime point of the study. This chronological matching procedure is called sequentialmatching. However, since we conduct matching without replacement, it may in-troduce poorly matched pairs in later years, as the treatment schools that receivedtreatment at earlier times have the priority in choosing better matched controls.So we prefer the simultaneous matching approach, which aims to consider all thetreatment schools simultaneously. In the simultaneous matching, there are also twooptions: the “greedy” nearest-available matching or the optimal matching method.In the nearest-available matching, we randomly order the treatment schools, andchoose for each treatment school in turn the nearest-available control school avail-able as a match [14, 45]. For this matching strategy, the matching quality maydepend on the order of the treatment units to match for. As an example, when thereare not many controls available, the competition of getting a well-matched controlfor a treatment unit is extensive [21]. The treatment unit can get better matchedcontrols if it is matched earlier. A complete matching is defined as each treatmentschool being matched with a control school. In this work, we particularly assumethe complete matching is formed based on matching without replacement. For theoptimal matching, we look at all of the possible complete matchings, and choosethe one which minimize the global distance. The optimal matching is guaranteedto find the global optimum through finding the path in a minimum cost flow in thenetwork, which was discussed in [21, 24, 42] and is implemented in R packageoptimatch [39].In this chapter, we conduct the optimal pair-matching rather than the nearest-available matching. In constructing a complete matching, once a control schoolis selected to match for a treatment school at a particular year, it will not be con-sidered as a candidate for the remaining unmatched treatment schools. The global78distance is defined asD =N1Âi=1di, j, (4.6)where di, j is defined in equation (4.2). We choose the complete matching whichminimizes the global distance.4.5 Treatment EffectThe causal treatment effect can be defined based on the well known counterfactualmodel [47]. Denote Y (1)(t) as the response vector of a school after receivingthe treatment at year t, and Y (2)(t) as the response of the same school if it wereassigned to the control group at the same year. Let T represents the treatmentassignment with T = 1 indicating the treatment assignment and T = 2 indicatingthe control assignment. We would like to match for Z(t1), which are measuredright before the treatment. Assume the treatment assignment is strongly ignorable[43]. That is, the treatment assignment is independent of the potential outcomesconditional on the matching variables, which can be denoted below.(Y (1)(t),Y (2)(t)) |= T |(Z(t1) = z(t1)), for all z(t1), (4.7)and 0< Pr(T = 1|Z(t1)= z(t1))< 1, for all z(t1). Denote E(Y (1)(t)|Z(t1)) = R1(Z(t 1)) and E(Y (2)(t)|Z(t 1)) = R2(Z(t 1)), where R1 is the ex-pected treatment surface as a function of Z(t 1) and R2 is the expected controlsurface as a function of Z(t1). Then the treatment effect conditional on Z(t1)can be written as R1(Z(t 1))R2(Z(t 1)). Assume the parallel response sur-faces with the same distance at all years. Then we can further writeRq(Z(t1)) = t q +V (Z(t1)),where V (Z(t 1)) is a function to represent the surface parallel to the treatmentand control surfaces of the matching vector, and q = 1,2. Then the treatment effectcan be defined as t = E1 [R1 (Z(t1))R2 (Z(t1))] ,79where E1(.) means expectation based on the distribution in the treatment group.The estimator of t we are going to use istˆ = 1N1N1Âi=1[Y i(1)(t)Y ji(2)(t)] , (4.8)where t is the one year before the treatment time for the treatment school i, the con-trol school ji is the matched to the treatment school i, and the complete matchingminimizes the global distance.4.6 Simulation StudyTo compare the matching quality and the accuracy of the treatment effect esti-mation between the two-step approach and the traditional matching approach, wecarry out the following simulation studies.There are 200 schools enrolled in a five-year study, and 24 schools are pre-selected as treatment schools. Without loss of generality, school 1 to 6 receivetreatment at year 1; school 7 to 12 receive treatment at year 2; school 13 to 18receive treatment at year 3; and school 19 to 24 receive treatment at year 4. Theremaining 176 control schools are never treated during the entire study. There arethree test scores recorded from tests in reading, vocabulary, and writing at grade2. The trajectory models of the treatment schools and the control schools for theschool-level scores at grade 2 are described separately below. The trajectory modelfor the treatment schools isz⇤ik(t) = a1k(t)+b1i(t)+ e1ik(t), (4.9)where i = 1,2, · · · ,24, k = 1,2,3, Var(b1i(t)) = s2b1(t), and Var(e1ik(t)) = s2e (t);the trajectory model for the control schools isz⇤jk(t) = a2k(t)+b2 j(t)+ e2 jk(t), (4.10)where j = 25,26, · · · ,200, k = 1,2, and 3, Var(b2 j(t)) = s2b2(t), and Var(e2 jk(t)) =s2e (t). In this model, we assume no topic random effects. We can see that a1k(t)and a2k(t) represent the longitudinal fixed effects of the k-th topic for treatment80schools and control schools seperately, b1i(t) and b2 j(t) represent the longitudinalrandom effects of the i-th treatment school and of the j-th control school separately,e1ik(t) and e1ik(t) are random errors. The response is just the addition of all thetopic scores after receiving the treatment for a treatment school or the additioncalculated separately for every year for a control school.In our simulation settings, we let the fixed effects be constant over years witha1k(t) = ak +h and a2k(t) = ak h , where a1 = a2 = a3 = 78 and h vary as0,0.5,1 and 2. We also let the random effects be independent of time with s2b2 = 1,s2b1 varies as 0.5,1, and 2, and s2e varies as 1/9,1 and 4. The choices of s2b1 ands2b2 are based on some discussion of the ratio of the variances between treatmentand control distributions [46].We simulate 1000 data sets in each scenario. For each data set, we fit only onetrajectory model with school intercept random effect instead of fitting two modelsfor treatment group and control group separately since we would like to keep thetrajectory model simple and also the sample size for treatment school is usuallysmall. After that we conduct 1 1 optimal matching based on the estimated trueschool scores. We also conduct 1 1 optimal matching directly from the originalschool scores. We compare the global distances of the final complete matchingfrom the two methods and also the estimated treatment effects from both methods.Figures 4.1, 4.2, 4.3, and 4.4 show us that in all of the considered settings theglobal distances from the two-step approach are smaller than those from the directmatching, which indicate the two-step matching can improve the matching qualityin the presence of measurement errors. We say a setting is matching favored if thefinal selected complete matching is close to perfect matching, which has the globaldistance equal to 0. For matching favored settings, we can see apparent improve-ment on the global distances. For example, when h = 0, it would be easier to finda better matched control for treatment schools since the values of their matchingvariables are close to each other. In these setting, we can see that no matter whatthe degree of the measurement errors is, the global distances can be decreased byusing the two-step approach. While when h = 2, the values of matching variablesbetween the treatment and control groups are far away from each other given theirstandard deviations. Under those conditions, when the degree of measurement er-81Figure 4.1: Boxplot of the global distance from different conditions. A: box-plot of the optimal true global distances; B1 : boxplot of the optimalglobal distances based on two-step approach under measurement errorswith s2e = 19 ; B2 : boxplot of the optimal global distances based on di-rect matching under measurement errors with s2e = 19 ; C1 : boxplot ofthe optimal global distances based on two-step approach under measure-ment errors with s2e = 1; C2 : boxplot of the optimal global distancesbased on direct matching under measurement errors with s2e = 1; D1 :boxplot of the optimal global distances based on two-step approach un-der measurement errors with s2e = 4; and D2 : boxplot of the optimalglobal distances based on direct matching under measurement errorswith s2e = 4.82Figure 4.2: Boxplot of the global distance from different conditions. A: box-plot of the optimal true global distances; B1 : boxplot of the optimalglobal distances based on two-step approach under measurement errorswith s2e = 19 ; B2 : boxplot of the optimal global distances based on di-rect matching under measurement errors with s2e = 19 ; C1 : boxplot ofthe optimal global distances based on two-step approach under measure-ment errors with s2e = 1; C2 : boxplot of the optimal global distancesbased on direct matching under measurement errors with s2e = 1; D1 :boxplot of the optimal global distances based on two-step approach un-der measurement errors with s2e = 4; and D2 : boxplot of the optimalglobal distances based on direct matching under measurement errorswith s2e = 4.83Figure 4.3: Boxplot of the global distance from different conditions. A: box-plot of the optimal true global distances; B1 : boxplot of the optimalglobal distances based on two-step approach under measurement errorswith s2e = 19 ; B2 : boxplot of the optimal global distances based on di-rect matching under measurement errors with s2e = 19 ; C1 : boxplot ofthe optimal global distances based on two-step approach under measure-ment errors with s2e = 1; C2 : boxplot of the optimal global distancesbased on direct matching under measurement errors with s2e = 1; D1 :boxplot of the optimal global distances based on two-step approach un-der measurement errors with s2e = 4; and D2 : boxplot of the optimalglobal distances based on direct matching under measurement errorswith s2e = 4.84Figure 4.4: Boxplot of the global distance from different conditions. A: box-plot of the optimal true global distances; B1 : boxplot of the optimalglobal distances based on two-step approach under measurement errorswith s2e = 19 ; B2 : boxplot of the optimal global distances based on di-rect matching under measurement errors with s2e = 19 ; C1 : boxplot ofthe optimal global distances based on two-step approach under measure-ment errors with s2e = 1; C2 : boxplot of the optimal global distancesbased on direct matching under measurement errors with s2e = 1; D1 :boxplot of the optimal global distances based on two-step approach un-der measurement errors with s2e = 4; and D2 : boxplot of the optimalglobal distances based on direct matching under measurement errorswith s2e = 4.85rors is weak, we can only see a little bit of improvement. And the improvementbecomes stronger when the degree of measurement errors increases. Other factorswhich can reflect whether the matching condition is favored or not are the stan-dard deviations of the school random effect. Given fixed h and fixed variance ofthe control school, if the ratio of the variance of the treatment schools to that ofthe control schools is smaller, it is relatively easier to get a better matching result.Therefore, we can see that when the ratio of s2b1 to s2b2 decreases, we can see moreapparent improvement regarding the global distance.Figures 4.5, 4.6, 4.7, and 4.8 show us that under matching favored conditionsthe estimated treatment effect is unbiased and the two-step approach can provideestimators with less variance. However, when the matching is under unfavoredconditions, we can get biased estimates even from matching based on true valuesof the matching variables.4.7 Matching Quality and Measurement ErrorUsing Mahalanobis distance matching methods, if the matching covariates containmeasurement errors, we can get systematically biased results. For the matchingbased on the measured matching covariates, the distance of a matched pair can bewritten asd⇤ = (Z⇤i Z⇤ji)TS⇤11 (Z⇤i Z⇤ji), (4.11)where S⇤1 is the variance-covariance matrix of the distribution of the measuredmatching covariates in the treatment group. If the matching is perfect, we haved⇤ = 0, and equivalently we have Z⇤i = Z⇤ji . Under the assumption of classicalmeasurement errors (4.5), we haveZ⇤i = Zi +Ei, and Z⇤ji = Z ji +E ji . (4.12)Therefore, the true distance of this matched pair isd = (ZiZ ji)TS11 (ZiZ ji) = (E ji Ei)TS11 (EiE ji), (4.13)86Figure 4.5: Boxplot of the estimated treatment effects from different condi-tions. A is from the optimal true global distances; B1 is from the optimalglobal distances based on two-step approach under measurement errorswith s2e = 19 ; B2 is from the optimal global distances based on directmatching under measurement errors with s2e = 19 ; C1 is from the op-timal global distances based on two-step approach under measurementerrors with s2e = 1; C2 is from the optimal global distances based ondirect matching under measurement errors with s2e = 1; D1 is from theoptimal global distances based on two-step approach under measure-ment errors with s2e = 4; and D2 is from the optimal global distancesbased on direct matching under measurement errors with s2e = 4.87Figure 4.6: Boxplot of the estimated treatment effects from different condi-tions. A is from the optimal true global distances; B1 is from the optimalglobal distances based on two-step approach under measurement errorswith s2e = 19 ; B2 is from the optimal global distances based on directmatching under measurement errors with s2e = 19 ; C1 is from the op-timal global distances based on two-step approach under measurementerrors with s2e = 1; C2 is from the optimal global distances based ondirect matching under measurement errors with s2e = 1; D1 is from theoptimal global distances based on two-step approach under measure-ment errors with s2e = 4; and D2 is from the optimal global distancesbased on direct matching under measurement errors with s2e = 4.88Figure 4.7: Boxplot of the estimated treatment effects from different condi-tions. A is from the optimal true global distances; B1 is from the optimalglobal distances based on two-step approach under measurement errorswith s2e = 19 ; B2 is from the optimal global distances based on directmatching under measurement errors with s2e = 19 ; C1 is from the op-timal global distances based on two-step approach under measurementerrors with s2e = 1; C2 is from the optimal global distances based ondirect matching under measurement errors with s2e = 1; D1 is from theoptimal global distances based on two-step approach under measure-ment errors with s2e = 4; and D2 is from the optimal global distancesbased on direct matching under measurement errors with s2e = 4.89Figure 4.8: Boxplot of the estimated treatment effects from different condi-tions. A is from the optimal true global distances; B1 is from the optimalglobal distances based on two-step approach under measurement errorswith s2e = 19 ; B2 is from the optimal global distances based on directmatching under measurement errors with s2e = 19 ; C1 is from the op-timal global distances based on two-step approach under measurementerrors with s2e = 1; C2 is from the optimal global distances based ondirect matching under measurement errors with s2e = 1; D1 is from theoptimal global distances based on two-step approach under measure-ment errors with s2e = 4; and D2 is from the optimal global distancesbased on direct matching under measurement errors with s2e = 4.90where S1 is the variance-covariance matrix of the distribution of the true matchingcovariates in the treatment group, and E = (e1,e2, · · · ,ek) is the vector of randomerrors. Based on the expected value of matrix products shown in [25], for theperfectly matched pair under measurement errors, we have that the expectation ofthe distance becomesE(d|Z⇤i = Z⇤ji) = Eh(E ji Ei)T S11 (EiE ji) |Z⇤i = Z⇤jii= E⇥(E ji Ei) |Z⇤i = Z⇤ji⇤T S11 E⇥(EiE ji) |Z⇤i = Z⇤ji⇤+tr2VarEi|Z⇤i = Z⇤jiS11, (4.14)where tr(.) is the trace function, which calculates the sum of the diagonal of amatrix. If the random errors in both groups are all independent with each other, wehaveVarEi|Z⇤i = Z⇤ji= s2e I,where I is the identity matrix. Therefore, based on equation (4.14) we haveE(d|Z⇤i = Z⇤ji) = 2s2e trS11 . (4.15)As for a perfect complete matching based on the measured matching covariates,given independent random errors between treatment and control groups, the biasof the global distance is 2N1s2e tr(S11 ), which vanishes only when s2e = 0, i.e.,there are no measurement errors. If the matching conditions are not preferable, thebias may be much larger than the bias from the perfect matching.4.8 DiscussionBased on the simulation results and mathematical calculation, we can see that com-pared with the traditional matching approach the two-step approach can improvethe matching quality in the cross-sectional setting and/or when there is no longitu-dinal trend in the data, especially under matching favored conditions. We proposeto conduct more simulation studies in the future, focused on the settings wherethere are the various longitudinal trends in the matching covariates data, and/orvarious types of variance-covariance structure in the matching covariates. Also91more theoretical work is under way for longitudinal data.As for the estimate of the treatment effect, we use the average of the differ-ence of outcome within matched pair at specified time point. For time-independentmatching, Rubin [46] compared five different estimates of the treatment effect in-cluding the average of difference of outcome across matched pair and four otherregression adjusted estimates. For time-dependent matching, we would also liketo consider model adjusted estimates while the linear regression model may be fit-ted at each time point or we would model the matched pair using a mixed-effectsmodel and then do the adjustment based on the model output.Finally we would like to apply the two-step method to the EC study data orother real data with similar structure in the future.92Chapter 5DiscussionMy work on issues in the studies of exposure-disease relationships can go furtherand deeper.First of all, in my thesis, focus has only been given to binary outcomes. How-ever, in many studies, time-to-disease is also available. Therefore, I would liketo modify my models to accommodate time-to-event outcomes as well. As exam-ples, in Chapter 2, I can modify the logistic regression sub-model in the Bayesianhierarchical model into a Cox proportional hazard model, if the disease outcomeis a time-to-event variable. And in Chapter 3, it would be interesting to discussdose-response relationships in the Cox regression setting.Secondly, I can work more on statistical computation. In Chapter 2, for theBayesian hierarchical model I built, I developed the specified MCMC algorithmfor this type of model. This is much faster than using WinBUGs. To share thealgorithm with other researchers, I would like to formalize the algorithm into a Rpackage. This could benefit the researchers, who would like to run simulationsand/or do analysis with extra large sample sizes, as WinBUGs may take a verylong time to finish. For Chapter 3, for the model fitting, I used profile likelihood,which is commonly used for nonlinear models. I can try other numerical analysismethods to speed the program, though currently running time is not an issue.Finally, in the future, more attention should be given to the study design. Chap-ters 2 and 3 mainly focus on prospective cohort studies. I address some problemsin observational studies in Chapter 4. There is much work left for observational93studies if we use other matching strategies and consider various longitudinal pat-terns.94Bibliography[1] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions,10th printing with corrections. Dover, 1972. ISBN 9780486612720. !pages 44[2] R. P. Althauser and D. Rubin. Measurement Error and Regression to theMean in Matched Samples. Social Forces, 50(2):206, Dec. 1971. ISSN00377732. doi:10.2307/2576938. URLhttp://www.jstor.org/stable/2576938?origin=crossref. ! pages 73[3] R. Bartle. The elements of integration and Lebesgue measure. WileyInterscience, 1995. ! pages 45[4] J. Berkson. Are there two regressions? Journal of the American StatisticalAssociation, 45(250):164–180, 1950. ! pages 5[5] G. Box and D. Cox. An Analysis of Transformations. Journal of the RoyalStatistical Society. Series B ( . . . , 26(2):211–252, 1964. URLhttp://www.jstor.org/stable/10.2307/2984418. ! pages 37[6] B. W. Brown. Prediction Analyses for Binary Data. In: BiostatisticsCasebook (eds. R. J. Miller, B. Efron, B. W. Brown and L. E. Moses),. Wiley:New York, 1980. ! pages 40, 58, 66[7] B. P. Carlin and T. A. Louis. Bayesian Methods for Data Analysis. Chapman& Hall CRC, 2008. ! pages 9[8] R. J. Carroll. Variances are not always nuisance parameters. Biometrics, 59:211–220, 2003. ! pages 28[9] R. J. Carroll, D. Ruppert, L. A. Stefanski, and C. M. Crainiceanu.Measurement error in nonlinear models: a modern perspective, secondedition. CRC press, 2006. ! pages 38, 7095[10] G. Casella and R. L. Berger. Statistical Inference. Wadsworth &Brooks/Cole, 2001. ! pages 50[11] D. Collett. Modelling Binary Data. London: Chapman and Hall, 1991. !pages 40[12] K. S. Crump. The effect of random error in exposure measurement upon theshape of the exposure response. Dose-response : a publication ofInternational Hormesis Society, 3(4):456–64, Jan. 2005. ISSN 1559-3258.doi:10.2203/dose-response.003.04.002. URLhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2477201&tool=pmcentrez&rendertype=abstract. ! pages 38[13] R. Doll. Weak associations in epidemiology: importance, detection, andinterpretation. Journal of Epidemiology - Japan EpidemiologicalAssociation, 6(4):S11–S20, 1996. ! pages 28[14] Donald B . Rubin. Bias reduction using Mahalanobis-metric matching.Biometrics, 36(2):293–298, 1980. URLhttp://www.jstor.org/stable/10.2307/2529981. ! pages 78[15] D. A. Freedman. On the so-called “Huber sandwich estimator” and “robuststandard errors”. The American Statistician, 60(4):299–302, 2006.doi:10.1198/000313006X152207. ! pages 55[16] L. P. Fried, M. C. Carlson, S. McGill, T. Seeman, Q.-L. Xue, K. Frick,E. Tan, E. K. Tanner, J. Barron, C. Frangakis, R. Piferi, I. Martinez,T. Gruenewald, B. K. Martin, L. Berry-Vaughn, J. Stewart, K. Dickersin,P. R. Willging, and G. W. Rebok. Experience Corps: A dual trial to promotethe health of older adults and children’s academic success. Contemporaryclinical trials, pages 1–13, May 2013. ISSN 1559-2030.doi:10.1016/j.cct.2013.05.003. URLhttp://www.ncbi.nlm.nih.gov/pubmed/23680986. ! pages 75[17] A. Gelman. Prior distributions for variance parameters in hierarchicalmodels. Bayesian Analysis, 1(3):515–533, 2006. ! pages 31[18] A. Gelman and I. Pardoe. Average predictive comparisons for models withnonlinearity, interactions, and variance components. SociologicalMethodology, 37:23–51, 2007. ! pages 37[19] A. Gelman, A. Jakulin, M. G. Pittau, and S. Y. Su. A weakly informativedefault prior distribution for logistic and other regression models. TheAnnals of Applied Statistics, 2(4):1360–1383, 2008. ! pages 3296[20] S. Greenland. Dose-response and trend analysis in epidemiology:alternatives to categorical analysis. Epidemiology, pages 356–365, 1995.URL http://www.jstor.org/stable/3702080. ! pages 36, 37[21] X. Gu and P. Rosenbaum. Comparison of multivariate matching methods:Structures, distances, and algorithms. Journal of Computational andGraphical Statistics, 2(4):405–420, 1993. URLhttp://www.tandfonline.com/doi/abs/10.1080/10618600.1993.10474623. !pages 78[22] P. Gustafson. Measurement Error and Misclassification in Statistics andEpidemiology: Impacts and Bayesian Adjustments. Chapman and Hall/CRCPress, 2003. ! pages 7, 70[23] P. Gustafson. On robustness and model flexibility in survival analysis:Transformed hazard models and average effects. Biometrics, 63:69–77,2007. ! pages 38[24] B. B. Hansen and S. O. Klopfer. Optimal full matching and related designsvia network flows. Journal of Computational and Graphical Statistics, 15(3):609–627, 2006. ! pages 78[25] D. Kendrick. Stochastic Control for Economic Models. McGraw-Hill, 1981.! pages 91[26] R. H. Keogh, A. D. Strawbridge, and I. R. White. Effects of ClassicalExposure Measurement Error on the Shape of Exposure-DiseaseAssociations. Epidemiologic Methods, 1(1), Jan. 2012. ISSN 2161-962X.doi:10.1515/2161-962X.1007. URL http://www.degruyter.com/view/j/em.2012.1.issue-1/2161-962X.1007/2161-962X.1007.xml. ! pages 69[27] H.-M. Kim and I. Burstyn. Bayesian method for improving logisticregression estimates under group-based exposure assessment with additivemeasurement errors. Archives of Environmental & Occupational Health, 64(4):261–265, 2009. ! pages 6, 11, 21, 23, 32[28] H.-M. Kim, Y. Yasui, and I. Burstyn. Attenuation in risk estimates in logisticand cox proportional-hazards models due to group-based exposureassessment strategy. The Annals of Occupational Hygiene, 50(6):623–635,2006. ! pages 6, 11, 28, 32[29] H.-M. Kim, D. Richardson, D. Loomis, M. V. Tongeren, and I. Burstyn.Bias in the estimation of exposure effects with individual- or group-based97exposure assessment. Journal of Exposure Science and EnvironmentalEpidemiology, 21(2):212–221, 2011. ! pages 6, 28[30] H. Kromhout and R. Vermeulen. Temporal, personal and spatial variabilityin dermal exposure. The Annals of Occupational Hygiene, 45(4):257–273,2001. ! pages 11[31] H. Kromhout, E. Symanski, and S. M. Rappaport. A comprehensiveevaluation of within- and between-worker components of occupationalexposure to chemical agents. The Annals of Occupational Hygiene, 37(3):253–270, 1993. ! pages 11[32] H. Kromhout, D. P. Loomis, G. J. Mihlan, L. A. Peipins, R. C. Kleckner,R. Iriye, and D. A. Savitz. Assessment and grouping of occupationalmagnetic field exposure in five electric utility companies. ScandinavianJournal of Work, Environment & Health, 21:43–50, 1995. ! pages 10, 34[33] H. Ku¨chenhoff and R. J. Carroll. Segmented regression with errors inpredictors: semi-parametric and parametric methods. Statistics in medicine,16(1-3):169–88, 1997. ISSN 0277-6715. URLhttp://www.ncbi.nlm.nih.gov/pubmed/9004390. ! pages 38[34] Y. P. Li, K. J. Propert, and R. R. Rosenbaum. Balanced Risk Set Matching.Journal of the American Statistical Association, 96(455):870 – 882, 2001.! pages 73, 74[35] R. J. Little and D. B. Rubin. Statistical Analysis with Missing Data. JohnWiley & Sons, Inc., 2002. ! pages 6, 7, 11[36] J. Liu and P. Gustafson. On Average Predictive Comparisons andInteractions. International Statistical Review, 76(3):419–432, Dec. 2008.ISSN 03067734. doi:10.1111/j.1751-5823.2008.00056.x. URLhttp://doi.wiley.com/10.1111/j.1751-5823.2008.00056.x. ! pages 38[37] B. Lu. Propensity score matching with time-dependent covariates.Biometrics, 61(3):721 – 728, 2005. doi:10.1111/j.1541-0420.2005.00356.x.URL http://www.ncbi.nlm.nih.gov/pubmed/16135023. ! pages 73, 74[38] L. M. McShane, D. N. Midthune, J. F. Dorgan, L. S. Freedman, and R. J.Carroll. Covariate measurement error adjustment for matched case-controlstudies. Biometrics, 57(1):62–73, Mar. 2001. ISSN 0006-341X. URLhttp://www.ncbi.nlm.nih.gov/pubmed/11252619. ! pages 7398[39] R Core Team. R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, Austria, 2013. URLhttp://www.R-project.org. ! pages 57, 69, 78[40] D. B. Richardson and S. Wing. Leukemia mortality among workers at thesavannah river site. American Journal of Epidemiology, 166(9):1015–1022,2007. ! pages 24, 25, 27, 28[41] D. B. Richardson, S. Wing, and R. D. Daniels. Evaluation of externalradiation dosimetry records at the savannah river site, 1951-1989. Journal ofExposure Science and Environmental Epidemiology (2006), 1C12, pages1–12, 2006. ! pages 24[42] P. Rosenbaum. Optimal Matching for Observational Studies. Journal of theAmerican Statistical Association, 84(408):1024–1032, 1989. URLhttp://www.tandfonline.com/doi/abs/10.1080/01621459.1989.10478868. !pages 78[43] P. Rosenbaum and D. Rubin. The Central Role of the Propensity Score inObservational Studies for Causal Effects. Biometrika, 70(1):41–55, 1983.URL http://biomet.oxfordjournals.org/content/70/1/41.short. ! pages 79[44] P. Royston and D. Altman. Regression using fractional polynomials ofcontinuous covariates: parsimonious parametric modelling. AppliedStatistics, 43(3):429–467, 1994. URLhttp://www.jstor.org/stable/10.2307/2986270. ! pages 37, 69[45] D. Rubin. Matching to Remove Bias in Observational Studies. Biometrics,29(1):159–183, 1973a. URL http://www.jstor.org/stable/10.2307/2529684.! pages 78[46] D. Rubin. The Use of Matched Sampling and Regression Adjustment toRemove Bias in Observational Studies. Biometrics, 29(1):185–203, 1973b.URL http://www.jstor.org/stable/10.2307/2529685. ! pages 81, 92[47] D. Rubin. Estimating causal effects of treatments in randomized andnonrandomized studies. Journal of educational Psychology, 66(5):688–701,1974. URL http://psycnet.apa.org/journals/edu/66/5/688/. ! pages 79[48] K. Steenland and J. a. Deddens. A practical guide to dose-response analysesand risk assessment in occupational epidemiology. Epidemiology(Cambridge, Mass.), 15(1):63–70, Jan. 2004. ISSN 1044-3983.doi:10.1097/01.ede.0000100287.45004.e7. URLhttp://www.ncbi.nlm.nih.gov/pubmed/14712148. ! pages 3799[49] K. Steenland, R. Seals, M. Klein, J. Jinot, and H. D. Kahn. Risk estimationwith epidemiologic data when response attenuates at high-exposure levels.Environmental health perspectives, 119(6):831–7, June 2011. ISSN1552-9924. doi:10.1289/ehp.1002521. URLhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3114819&tool=pmcentrez&rendertype=abstract. ! pages 70[50] L. Stefanski and R. Carroll. Covariate measurement error in logisticregression. The Annals of Statistics, 13(4):1335–1351, 1985. URLhttp://projecteuclid.org/euclid.aos/1176349741. ! pages 39, 66[51] P. M. Steiner, T. D. Cook, and W. R. Shadish. On the Importance of ReliableCovariate Measurement in Selection Bias Adjustments Using PropensityScores. Journal of Educational and Behavioral Statistics, 36(2):213–236,Feb. 2011. ISSN 1076-9986. doi:10.3102/1076998610375835. URLhttp://jeb.sagepub.com/cgi/doi/10.3102/1076998610375835. ! pages 73[52] E. A. Stuart. Matching methods for causal inference: A review and a lookforward. Stat Sci, 25(1):1–21, 2010. ! pages 73[53] E. Tielemans, L. L. Kupper, H. Kromhout, D. Heederik, and R. Houba.Individual-based and group-based occupational exposure assessment: Someequations to evaluate different strategies. The Annals of OccupationalHygiene, 42(2):115–119, 1998. ! pages 6[54] R. Tornero-Velez, E. Symanski, H. Kromhout, R. C. Yu, and S. M.Rappaport. Compliance versus risk in assessing occupational exposures.Risk Analysis, 17(3):279–292, 1997. ! pages 28[55] H. White. Maximum Likelihood Estimation of Misspecified Models.Econometrica: Journal of the Econometric Society, 50(1):1 – 25, 1982.URL http://www.jstor.org/stable/10.2307/1912526. ! pages 55, 60[56] L. Xing, I. Burstyn, D. B. Richardson, and P. Gustafson. A comparison ofbayesian hierarchical modeling with group-based exposure assessment inoccupational epidemiology. Statistics in Medicine, 32(21):3686–3699, 2013.! pages iv[57] L. Xing, I. Burstyn, and P. Gustafson. On the shape of an exposure-diseaserelationship, the average effect of exposure, and the impact of exposuremeasurement error. Under the review of Statistics in Medicine, 2014. !pages iv100
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Model and inference issues related to exposure-disease...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Model and inference issues related to exposure-disease relationships Xing, Li 2014
pdf
Page Metadata
Item Metadata
Title | Model and inference issues related to exposure-disease relationships |
Creator |
Xing, Li |
Publisher | University of British Columbia |
Date Issued | 2014 |
Description | The goal of my thesis is to make contributions on some statistical issues related to epidemiological investigations of exposure-disease relationships. Firstly, when the exposure data contain missing values and measurement errors, we build a Bayesian hierarchical model for relating disease to a potentially harmful exposure while accommodating these flaws. The traditional imputation method, called the group-based exposure assessment method, uses the group exposure mean to impute the individual exposure in that group, where the group indicator indicates that the exposure levels tend to vary more across groups and less within groups. We compare our method with the traditional method through simulation studies, a real data application, and theoretical calculation. We focus on cohort studies where a logistic disease model is appropriate and where group exposure means can be treated as fixed effects. The results show a variety of advantages of the fully Bayesian approach, and provide recommendations on situations where the traditional method may not be suitable to use. Secondly, we investigate a number of issues surrounding inference and the shape of the exposure-disease relationship. Presuming that the relationship can be expressed in terms of regression coefficients and a shape parameter, we investigate how well the shape can be inferred in settings which might typify epidemiologic investigations and risk assessment. We also consider a suitable definition of the average effect of exposure, and investigate how precisely this can be inferred. We also examine the extent to which exposure measurement error distorts inference about the shape of the exposure-disease relationship. All these investigations require a family of exposure-disease relationships indexed by a shape parameter. For this purpose, we employ a family based on the Box-Cox transformation. Thirdly, matching is commonly used to reduce confounding due to lack of randomization in the experimental design. However, ignoring measurement errors in matching variables will introduce systematically biased matching results. Therefore, we recommend to fit a trajectory model to the observed covariate and then use the estimated true values from the model to do the matching. In this way, we can improve the quality of matching in most cases. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2014-08-19 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0165967 |
URI | http://hdl.handle.net/2429/50042 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2014-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2014_september_xing_li.pdf [ 2.14MB ]
- Metadata
- JSON: 24-1.0165967.json
- JSON-LD: 24-1.0165967-ld.json
- RDF/XML (Pretty): 24-1.0165967-rdf.xml
- RDF/JSON: 24-1.0165967-rdf.json
- Turtle: 24-1.0165967-turtle.txt
- N-Triples: 24-1.0165967-rdf-ntriples.txt
- Original Record: 24-1.0165967-source.json
- Full Text
- 24-1.0165967-fulltext.txt
- Citation
- 24-1.0165967.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0165967/manifest