Bayesian Approach to Case-control Studies with Errors in the Covariates by Marc Vallee B.Sc, Universite de Montreal, 1996 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Statistics) we accept this thesis as conforming to the required standard The University of British Columbia December 1998 © Marc Vallee, 1998 In presenting degree freely at this the thesis in partial fulfilment of University of British Columbia, I agree available for copying of department publication this or of reference thesis by this for his and study. scholarly or thesis for her 4jgif<T< Department of The University of British Vancouver, Canada D a t e DE-6 (2/88) / W . Q*. Columbia /Ml purposes requirements that agree may representatives. financial permission. I further the It gain shall not be is that the Library permission granted by understood be for an advanced shall make for the that allowed without it extensive head of my copying or my written Abstract It is not uncommon to be faced with imprecise exposure measurements when dealing with case-control data. In cancer case-control studies, for instance, smoking histories may be unreliable. The usual methods of analysis involve logistic regression with different correction factors. The approach we adopt involves Bayesian fitting of a retrospective discriminant analysis model. The parameters of interest are the regression coefficients in the prospective logodds ratio for disease. Under a standard non-informative prior, the posterior means of these parameters are infinite. Posterior medians, however, perform reasonably relative to other estimators that adjust for covariate imprecision. For models with only continuous exposures, the Bayesian inference can be implemented with exact posterior simulation. The presence of binary covariates requires some elements of a covariance matrix to be fixed. We develop a general approach for sampling such a constrained covariance matrix. The Bayesian inference in this context now demands the use of a Gibbs sampling algorithm. ii Contents Abstract ii Contents iii List of Figures v Acknowledgements vii Dedication viii 1 Introduction 1 2 Literature Review 5 3 2.1 Basic Terminology 5 2.2 Correction Methods 7 2.2.1 The Armstrong, Whittemore and Howe Method . . . . 7 2.2.2 The Rosner, Willet and Spiegelman Methods 9 2.2.3 Other Methods 10 The P M E D Approach 14 iii 4 3.1 Methodology 14 3.2 Known Measurement Error Variance 18 3.3 Unknown Measurement Error Variance 21 3.4 Simulation Study 23 3.5 Discussion 30 Multivariate Exposure 4.1 Methodology 35 4.2 Binary Covariates 38 4.2.1 The Gibbs Sampler 40 4.2.2 Application 42 4.3 4.4 5 Covariance Matrix With Partially Fixed Diagonal 44 4.3.1 A Simple Case 49 4.3.2 More than Two Binary Covariates 53 Discussion 56 A n Application to Real Data 5.1 5.2 6 34 57 Bladder Cancer Case-Control Data 58 5.1.1 59 A Closer Look at the Data The Analysis 61 Conclusion 71 Bibliography 74 iv List of Figures 3.1 Comparison of the PMED and the AWH estimators in the known r case. The first two panels give (0.1,0.3,0.5,0.7,0.9) quantiles of the estimators as a function of y/r. The median is displayed with the solid line. The long-dashed line indicate true value of /3. The AWH estimator is only considered when third panel gives empirical coverage probabilities of 80% credible intervals, with error bars corresponding to plus and minus two standard errors 3.2 32 Comparison of the PMED, AWH, RWSl and RWS2 estimators in the unknown r case. The format is as per Figure 3.1 . . . . 5.1 Histograms for the distribution of the age and smoking history variables 5.2 33 66 Diagnostic plots to verify the normality of the logarithmic transformation of the age and smoking history variables v 67 5.3 Plots of the coefficients from the logistic regression of the disease status on increasing stratificaions of both the age and logarithmic transformation of the smoking history variables 5.4 68 Plot of the coefficients from the logistic regression of the disease status on increasing stratificaions of the logarithmic transformation of the smoking history variable 5.5 69 Histograms of the posterior distributions of the parameter estimates for Pi, (3 and P obtained with the assumption there was 2 3 no measurement error, 10% and finally 25% measurement error in the smoking history variable vi 70 Acknowledgements Sinceres remerciements a Paul Gustafson pour son support, sa patience, son ingeniosite et sa grande disponibilite. Paul a su me redonner confiance dans les moments difficiles. Merci aussi a Nhu Le pour ses precieux conseils et son experience. Merci a tous les membres du departement pour avoir fait de ces deux annees une experience extraordinaire. Et finalement, merci a ma famille pour ses indispensables encouragements et son support inconditionel. MARC The University of British Columbia December 1998 vii VALLEE A ma famille.. viii Chapter 1 Introduction The development of case-control methodology is of great importance to the field of epidemiology. A common problem which seems to arise in these types of studies is the mismeasurement of some of the covariates. Our objective is to develop, using Bayesian methods, an approach that will allow us to reach reasonable conclusions from case-control studies where some of the variables are measured imprecisely. Three major components are present here: Case-control studies, Errors in covariates and Bayesian methods. Case-contol studies are usually conducted to investigate the relationship between the presence of disease and of specific risk factors often referred to as the exposure. The central idea, as stated in Breslow (1996): "is the comparison of a group having the outcome of interest to a control group with regard to one or more characteristics." The work of Cornfield (1951) was a great contribution to the development of case-control methodology. He demonstrated 1 that the exposure odds ratio for cases versus controls equals the disease odds ratio for exposed versus unexposed. Also, provided the disease is rare, the exposure odds ratio (known as the relative risk) approximates the disease rate ratio. Basic case-control data can be regarded as two independent samples of covariable vectors, {a:oi}"=i from X\D = 0 (the controls), and {xu}^ from X\D = 1 (the cases), where D denotes the disease status (0 for healthy and 1 for diseased) and X is the covariable vector. The typical analysis of such data is done by fitting a prospective logistic regression model for D\X to the retrospectively sampled data. This procedure is described and justified by Prentice and Pyke (1979). It is not uncommon to have binary risk factors, in which case errors can occur by misclassification. The approach we will introduce later on, and most of the methods we will review, are designed to deal with continuous exposure variables. These are usually covariates that are more complex to measure as opposed to a simple classification; in which case a surrogate exposure X* is observed rather than the true exposure X. Examples of commonly, mismeasured covariates include information about diet or nutrient intake, past history of cigarette or alcohol use, and radiation exposure. The obvious solution is better measuring instruments and techniques, but this is not always feasible. It is widely recognized that in epidemiology or in studies of relationships between a response variable and a set of covariables, these covariates are often measured with error, which can seriously affect the statistical analysis. If ignored, these measurement errors may 2 lead to biased results; the usual consequence is the attenuation of the strength of the relationship. General discussion on classical statistical procedures for dealing with errors in covariates can be found in Fuller (1987) and Carroll (1989). Errors in covariates appear to be difficult to deal with classically; it seems a number of assumptions and approximations have to be introduced in order to proceed with the analysis. A natural alternative is the use of Bayesian methods. The key feature of the Bayesian approach is that it uses probability theory to describe uncertainty about both parameters and observables. Bayesians think of model parameters as random variables, therefore having a certain probability distribution which can be interpreted as belief about the possible values this parameter can take. This differs from the frequentists who consider model parameters to be fixed. In the Bayesian approach, the information that is available to the experimenter before data are observed and his belief contribute to the specification of a pior distribution. A sample is then taken and the data observed so that the prior distribution can be updated with this sample information. This updated prior distribution is called the posterior distribution. Inference is then made on the basis of this posterior distribution which is proportional to the product of the likelihood and the prior distribution. The recent developments in algorithms for performing Bayesian computations make this approach particularly appealing. We are referring here to Markov chain Monte Carlo (MCMC) methods; reviews of these methods can be found in both Smith and Roberts (1993) and Besag, Green, Higdon, and 3 Mergensen (1995). This makes the Bayesian approach to errors in covariates in case-control studies very sensible and reasonably simple. With the use of a measurement error model we will produce a distribution over the plausible values of the parameters of interest. Although quite natural, this approach has not received much attention in the literature. Richardson and Gilks (1993) and Muller and Roeder (1997) explore Bayesian approaches to errors in covariates in casecontrol studies. Most of the papers related to this topic suggest classical methods, among those are Armstrong, Whittemore, and Howe (1989), Rosner, Willet, and Spiegelman (1989) and Carroll, Gail, and Lubin (1993). In Maffick, and Gelfand (1995) a Bayesian approach to errors in covariates is presented, but it is not specifically applicable to case-control studies. In the following chapter we will review the main findings from many of the previously mentioned papers amongst others. In Chapter 3 we will introduce a univariate version of our method along with a simulation study. Chapter 4 will exhibit the generalization to multivariate applications. Finally, an analysis of bladder cancer case-control data will be presented in Chapter 5. 4 Chapter 2 Literature Review In this Chapter we will review the methods that have been suggested for correcting for measurement errors. As it was previously mentioned, most of the correction methods make use of the classical theory but we did find results that were obtained within the Bayesian framework. 2.1 Basic Terminology First, we believe a brief summary of the basic terminology could be quite helpful. Errors can be "random" or "systematic", the key feature of random errors is that the law of large numbers applies; if we were to repeat the measurement many times the mean of these replicates would provide an unbiased estimate of the true quantity we are trying to measure. As opposed to random errors, unbiasedness does not hold for systematic errors; the mean of many repeated measures would not necessarily converge toward the true value. 5 Errors can also be either "differential" or "nondifferential". This depends on whether the errors are related to the disease outcome D. An example of differential errors would be if the mismeasurement or misclassification of the exposure was dependent on the disease status. Such errors can be the cause of serious bias, but fortunately good study design can help guard against these. A measurement error is nondifferential if it arises in the same way for cases and controls; the errors distribution is independent of the disease outcome. A more formal definition of nondifferential errors can be stated as X*\X, D = X*\X, or equivalently as D\X,X* = D\X, which means that when the true exposure is known the measured one does not add any additional information. Nondifferential random errors can also be the cause of biased results, and most of the correction methods we will present deal with this type of errors. Measurement error has traditionally been modeled in two different ways: The "classical error" model and the "Berkson error" model. In the classical error formulation, the conditional distribution of the surrogate X* given the true value X is specified, while in the Berkson formulation, it is the conditional distribution of X given X* which is specified. In order to get an assessment of the measurement error distribution a "validation study" is often used. In a validation study a "gold standard" measurement of X is obtained and compared to the surrogate measure X* which will be used in the main study. In this manner a direct estimate of the error distribution is provided, but since gold standard measurements are generally quite expensive, they can only be performed on a small portion of 6 the whole sample. This short terminology review should make the subsequent sections easier to follow. Now we will present techniques suggested by different authors, some on which our own approach was based, and others that served as comparison for our work. 2.2 Correction Methods In this section we introduce different methods to analyse case-control data with imprecise exposure measurement. Two of these methods are presented in greater details because of the similarity of their initial setting to ours. 2.2.1 The Armstrong, Whittemore and Howe Method In their 1989 paper, Armstrong, Whittemore and Howe [hereafter AWH] suggest a method of correcting a standard logistic regression analysis to account for measurement errors. In order to adjust for the effect of the measurement error on the logistic regression coefficients obtained from case-control data, a multivariate discriminant analysis model is assumed for the joint distribution of the true covariate values and errors among cases and controls. Their approach is applicable to multiple strata designs, but for simplicity we will present here the one-stratum, one-dimensional case. They consider xp as a p-dimensional (p=l, for the present case) row vector of unknown true covariates for case {D=l) and control 7 (D=0). Follow- ing the dicriminant analysis model: ~ N(n + X D DA,a), where JV(-, •) stands for the normal distribution with mean /z+DA and variance a. This notation for the normal distribution will be used throughout the thesis. The cases and controls have a common variance cr, / i is the common part of the mean for both groups and A represents by how much the case mean differs from the control one. The information that is available is from lp measurements of the flawed X D (we will consider l = 1 here) with: D X* D = X + D e, D where e D ~ AT 7 ( + D5,T). The e^s are the error components and are independent of each other and of the X s. In their model the parameter 7 stands for the mean error common to D cases and controls and 5 represents the systematic difference in error between cases and controls, therefore allowing for differential errors. The parameter r is the error variance. These assumptions imply the following: XQ ~ N{fj, + 7, a + T), X{ ~ N((JL + 7 + A + 8,O + T). Then using a well known relationship between discriminant analysis and logistic regression, they obtain a model of the form: logit{P(D = 1\X = x)} = a + 8 fx. Fitting this model with x^ and x{, realizations of XQ and X* respectively, they get a 'naive' estimate R of R. They propose an estimator of the prospective log-odds ratio PAWH = P/X, where A is the correction factor which adjusts for measurement error. An estimate of PAWH can be obtained by the 'naive' R and the appropriate estimate of A. In the unidimensional case this correction factor is A = a/(r + cr); the estimators are A = 1 — A = 1- (S /S ) 2 2 if T is unknown. Here S 2 P = {SS 0 {T/S ) 2 if r is known, and + SSi)/(n + n - 2) is 0 x the pooled estimator of the variance r + a and SSD = 2~2i(x* — x* ) , D=0,1 2 Di D and no is the sample size. While S = ri^ SS2 estimates r from a validation 2 1 sample which would be needed when r is unknown and SS2 = Ylii^i ~ ii) • x 2 In Chapter 3 a simulation study that gives a comparison of the AWH approach with the one proposed in this work will be presented. 2.2.2 The Rosner, Willet and Spiegelman Methods Another approach is proposed by Rosner, Willet, and Spiegelman (1989) [hereafter RWS]. In their paper two methods are provided to correct relative risk estimates obtained from logistic regression models for measurement error in continuous covariates. Both methods require a separate validation study to estimate the regression coefficient A relating the surrogate measure to the true exposure. They first assume that the model relating a single-dimensional true exposure X and the probability of disease D is of the logistic form: logit{P{D = l\X = x)} = a + 9 0x. Then a linear relationship is assumed to exist between true exposure X and observed exposure X* of the form: X = a' + XX* + e, where e ~ N(0, a ). 2 Finally, they assume nondifferential errors and that the conditional distribution of X given X* and the marginal distribution of X* are the same for the main and validation study populations. Their first method is a linear approximation which yields an estimate of PRWSI of the same form as PAWH {P/X). Here A is obtained by regressing X on X* using the above linear model with ordinary least squares and the validation study data. Their second method is a likelihood approximation where a second-order Taylor series expansion is used to approximate the logistic function, enabling closed-form likelihood estimation of PRWS2- Again, performance of these methods will be examined in the simulation study presented in Chapter 3. A multivariate versions of their approaches is presented in Rosner, Spiegelman, and Willet (1990). 2.2.3 Other Methods Following the likelihood approach Carroll, Gail, and Lubin (1993) derive pseudolikelihoods on which to base estimators of the parameters of a prospective logistic model for case-control data that are, of course, measured with error. They also examine computationally simpler methods where the conditional 10 expectation of the true covariate X knowing the surrogate X* is substituted for X in the logistic model. Prentice and Pyke (1979) established the equivalence of the asymptotic distributions of estimators based on the prospective and retrospective likelihoods under case-control sampling, and Breslow and Day (1980) suggested the use of estimators based on the conditional likelihood. Forbes and Santner (1995) continue in the same vein looking at the effect of measurement error on the conditional maximum likelihood estimator (CMLE). They then go on to suggest three alternative estimators correcting for the measurement error: The first based on a correction for the asymptotic bias of the CMLE, the second is a functional estimator, and the third is a "transformed" estimator obtained by computing the CMLE using transformed covariates. In Buonaccorsi (1990), he makes use of the normal discriminant analysis model setting and the correction for measurement error is made possible by a double sampling scheme in which the surrogates are collected on all units and true values are obtained on a random subset of units, allowing the consideration of a large set of measurement error models. Yet another approach is presented in Roeder, Carroll, and Lindsay (1996) , in which a semiparametric mixture method is introduced. By using a mixture model, the relationship between the surrogate X* and the true covariate X can be modeled. The likelihood depends on the marginal distribution X and the measurement error density; this measurement error density is parametrically based on a validation sample, and the marginal of X is modeled 11 using a nonparametric mixture distribution. Work in the classical setting has also been done on sample size calculations for case-contol studies with errors in the covariates, both McKeownEyssen and Thomas (1985), and McKeown-Eyssen and Tibshirani (1994) discuss that matter. Although many problem in epidemiology can be naturally formulated in the Bayesian framework, the approach was not pursued due to computational complexities. But the introduction of Markov chain Monte Carlo sampling methods has really opened the way, and the literature on this subject is now more common. Richardson and Gilks (1993) take a Bayesian perspective on measurement error problems in epidemiology. The authors construct what they call a conditional independence model which is equivalent to considering nondifferential errors. They also introduce a graphical representation to this type of model. Then they indicate how Bayesian estimation can be carried out in these settings using a Gibbs sampler. Also using the Bayesian approach is Muller and Roeder (1994), paper in which they present a semiparametric model for case-control studies with errors in variables. The approach proposed in this paper is based on a nonparametric model for the exposure and a parametric disease model. The model they present is complex in structure, but is said to be simple to implement. We have reviewed here the papers we think are most relevant to the problem we will be tackling ourselves in the subsequent Chapters. The main 12 role the material found in these articles will be playing is guiding us in building our own initial model, thus making comparisons possible. The methodology we will be presenting will combine parts of the different approaches found in the literature that have not been tried together. 13 Chapter 3 The P M E D Approach To develop our own correction method we investigate the use of a retrospective discriminant analysis model for the unobserved real exposure, which leads to a Bayesian variance component model. In this chapter we present the basic methodology of our approach and illustrate it in the simple context of a univariate exposure. The method is applied to both scenarios where the measurement error variance is known and unknown. A simulation study is then carried out, and results are compared to the ones obtained with established methods. 3.1 Methodology As for the methods we reviewed earlier we have D and X that respectively represent an individual's disease status and covariable vector, with D coded as 0 (healthy) or 1 (diseased). Case-control data are obtained retrospectively and, 14 as it was mentioned in the introduction, can be regarded as two independent covariable vectors, {xoi}^ from X\D = 0 (the controls), and {xu}^ from X\D = 1 (the cases). We want to perform a fully Bayesian analysis of the retrospective data which demands a likelihood function based on the sampled distribution X\D, as opposed to the typical fitting of a prospective logistic regression model for D\X. Since our main objective is to learn about the prospective relationship D\X, we have to examine the form of the prospective model implied by a specified retrospective model. Assuming for now that all the covariables are continuous, a simple model to consider is one suggested by AWH, the normal discriminant analysis model: X\D = 0^ N(fx ,E ), 0 X\D = l~N(n ,'Z ). 0 1 (3.1) 1 The logistic regression model can be rewritten as: logit{P(D = 1\X = x)} = log = log = log . = lO9 = lO9 P(D = l\X = x) 1-P(D 'P(D = l\X = x) j = l\X = x)} P(D = 0\X = x)j P(X = x\D = 1)P(D = 1) P(X = x\D = 0)P(D = 0) (P(X = x\D = l)\ , (P(D = 1)\ {p\x , \ = xD = (f(X\D \f(X\D 0)) ° {pJb^} +l 9 = l)} = 0)j + C O n S t & n t - Plugging in the distributions suggested by the normal discriminant analysis model (3.1), we obtain: lo ff( \ ° \f(X\D X 9 D = )\ = 0)i 1 l o = " \ C le x p ( - H x - ^ y ^ i x - L n j ) ^ t c b e x p ^ - z i o V E o 15 1 ^ - ^ ) ) x cx ~2^ x ~ ( 1 ( E f V i - So x Er )^)^. ~ A»i) + (x~ Aio)'S Vo)z + ^ ( ( E p 1 1 - 0 - no)} Hence, the retrospective model implies a prospective model of the form logit{P(D = 1\X = x)} = a + 8 x + x Cx, T T (3.2) where 3 = E r V i - EQVO, and C = (EQ - E]" )/2, Given the fact that 8 1 1 and C depend only on the parameters of (3.1), they can be estimated from case-control data. It is not the case for the intercept a. Because it also depends on the marginal distribution of D, a is not estimable from casecontrol data alone. Information about the disease prevalence would be needed. The relationship between (3.1) and (3.2) is well known; it is exploited by Armstrong, Whittemore and Howe (1989) and Muller, Parmigiani, Schildkraut and Tardella (1997) amongst others. For the remainder of this chapter ideas will be presented in the univariate exposure setting, i.e. with X being a scalar. We will also assume a common variance v = E = £ i for both cases and controls. Model (3.1) 0 therefore becomes: X | D = O~iV(0o,i/), X | D = l~JV(/xi,i/), (3.3) which implies the prospective relationship logit{P(D = l\X = x)} = a + 8x, (3.4) with 8 = (fii — HQ)/V being the parameter of interest. Typically, the interpretation for 8 is that exp{/3(:r' — x)} is the disease odds ratio for exposure 16 x' compared to exposure x. Additionally, for rare diseases the relative risk of disease is approximately equal to this odds ratio. An important point that has to be considered is how much stronger an assumption is (3.3) compared to (3.4). If we consider /j to be the density of X\D = i, then we get exactly (3.4) when ^4ocexp(/?a;). (3.5) (3.5) can obviously be satisfied by densities /o and / i which are not normal, but it does restrict their tail behavior. If we assume without loss of generality that f3 > 0, then (3.5) implies that the right tail of / and the left tail of / i 0 fall off faster than exp(—f3\x\). If these 'thinner than exponential' tails are in fact normal, then the other tails (the left tail of / and the right tail of 0 /i) will also be normal. This suggests that (3.3) is likely to be appropriate in at least some situations where (3.4) holds. In practice transformations are often used to approximate normality. If such a transformation was used on the sample of control exposures for example, then in order to proceed with the retrospective model suggested here, the same transformation would have to yield approximate normality with a similar variance for the other exposure sample. We now have a model for the relationship between the exposure and the disease outcome. Next we have to consider the measurement error which will arise in that a surrogate exposure X* will be observed rather than the true exposure X. So we also have to model this error, we suggest the following 17 simple classical (non-Berkson) model: X*\X ~ N(X,r) (3.6) under which X* is an unbiased but noisy measurement of X. We assume the measurement error is nondifferential; that is, it arises in the same way for cases and controls. Using the formal definition of nondifferential error given in chapter 2, (3.3) and (3.6) can be collapsed to obtain the next model, from which the data arise: X*\D = 0 ~ N(U ,T 0 + V), X*\D = 1 ~ N(ti T LT + v). (3.7) So a direct consequence of the measurement error is some extra variability in the data compared to (3.3). This extra variability r (the measurement error variance) could be known from an external validation of the measurement process, or it could be estimated with the use of a validation sample as it was illustrated in chapter 2. The following two sections will illustrate our approach for either situation. 3.2 Known Measurement Error Variance In the present section we will assume r to be known. Since we are conducting a Bayesian analysis, we need a prior distribution in addition to the likelihood based on (3.7). This will yield a posterior distribution on the unknown parameters (/io, fJ-i, v) and enable us to estimate 3. The prior distribution on which our inference will be based is a standard noninformative prior for variance 18 component models, 7r(jLto,jLti,i/) oc ( r + i / ) . (3.8) -1 Formally, this prior is known as the reference prior (Berger and Bernardo, 1992) when v is the parameter of interest and (/i , Pi) are nuisance parameters. 0 The reference prior for 8 as the parameter of interest was also considered, but the fact that it did not have a simple form made it unappealing. Given the prior distribution in (3.8) and the realization x* = {xl , • • •, 0 n o- *ii''' x x 0 ' ^ n i i l fro the normal discriminant analysis model in (3.7), the m resulting posterior is of the following form: prior ir(u, ,(j,i,v\x*) oc (V + T) - I Q 1=1 xn(. x ) - ^ e x ( - ^ f ) + (3.9) P 1 \ 3 OC X e X X e X H H 2(„ + r) j i 2(, + r) ) H 6 X P C X 2(, + r) ) 2(„ + r) J Our Bayesian inference can be implemented by simulating independent draws from this posterior distribution. We can then integrate out /x and ^ from 0 (3.9) to get n ( \ * ) oc u x ( _ L _ ) ^ e x p ^ - & ( 5 ) ~ ^ + S p i ( s ? i ~ *i)^j £ g 19 2 ( 31 Q ) In order to make sampling easier, it is convenient to reparametrize from (//o,Ati,T) to (^0,^1,7), where 7 = r + v. The posterior distribution of ^\x* is then easily obtained from (3.10) and can be expressed as 7 r |c* = G\G > T, (3.11) where G has an inverse gamma distribution with shape parameter (n + n\ — 0 2)/2 and scale parameter (SS + SSi)/2, with SSi = Ej(x*j ~ x*f- Sampling 0 from G\G > r is easily implemented by repeatedly sampling G until G > r. Then v is taken to be the difference between the sampled G and the known r. Now that we have sampled 7 (or u) it is easy to see from (3.9) that we have frfrx* ~ Nixln^-y), (3.12) where //o and //1 are independent given 7 and x*. A draw from the joint posterior yLto? A*i>Tl^* c a n be obtained by sampling from ^\x* and then from A*o> A*i JT> *- Thus we have a simple algorithm for exact posterior simulation x in a variance components model. This computational approach is pursued in greater generality by Wolfinger and Kass (1996). Commonly, Bayesian parameter point estimation is done by using the posterior distribution's mean. It can be seen by the following that in the present case E(B\x*) is infinite; E{{fjn - / / o K V } = E{u~ E{^- = E{v- {x\ = (xl ^\u,x*)} l l 20 -x* )\x*} 0 -xl)E{v~ \x*} l by (3.12) but where a and u are both positive and are respectively the shape and scale parameter of an inverse gamma distribution, the numerator's integral blows up near r, therefore E{u \x*} -1 = +00. Intuitively this occurs because, according to both the likelihood based on (3.7) and the prior (3.8), zero is a plausible value for v which is the denominator of (3. Since the posterior mean of/3 does not exist, we use the posterior median of our parameter of interest as a point estimator. From now on we will refer to this posterior median as PMED. This concludes the known measurement error variance case. We will come back to it in a subsequent section to reveal the procedure and the results of a simulation study. 3.3 Unknown Measurement Error Variance We will now examine the situation where r is unknown, in which case it will have to be estimated along with v. The estimation of this error variance will be done with the use of a validation sample. In addition to the surrogate exposures for cases and controls, we assume an independent data set is available. This extra data set will consist of measurements of both the surrogate exposure X* and the true exposure X for each subject. We have assumed nondifferential measurement error, thus the subjects selection procedure for this validation sample does not matter; we can view the true exposure X as being sam21 pled from an arbitrary distribution, so long as the measurement error model X*\X ~ N(X,r) is the same for both the main and validation samples. The observed validation sample is denoted {{x , a^i)}™^, with SS — J2i(^2i~' 2i) x 2i 2 2 defined for subsequent use. To conduct our analysis we can proceed in a very similar manner as in the known r case. We begin by doing a reparametrization similar to the one done in the previous case. The variance components (u, r) are reparametrized to (7,7"), where 7 = r + v. We again use the reference prior for a variance components model but this time when (7, r) are the parameters of interest TT(^0I^I,7.T) « 7 T 1 1 (0 < T < 7 < CO) (3.13) Once more we have selected in (3.13) a noninformative prior. Combined with the likelihood given by the normal discriminant analysis model in (3.7), we can write the resulting posterior distribution from which we will be sampling: prior x (-\ x n (fi -x* y exp ( - - ILL exp 0 0 o 2 exp 7 Tli(/Xi -X\f 2 7 exp (3-14) Once again we can integrate out fi and /xi, this time obtaining: 0 ) (3.15) 22 Hence from (3.15) the marginal posterior distribution of the variance components is ( ,T)\X* 7 = (3.16) (G,T)\G>T, where G and T are independent, with G „ / G (!!i±£-*, + T ~ / G ( ! ^ i , ^ ) , (3.17) and the posterior conditional distribution for /XQ and fjb\ once 7 and r have been sampled is the same as in the known r case ^l^r^^JV^.n^)- (3-18) Thus again exact posterior sampling can be implemented, by simulating (G, T) pairs from the distributions in (3.17) until G > T, and then sampling /i and u, 0 x from (3.18) in order to get a set of /?'s. The posterior mean for the parameter of interest is infinite in this case too, so again our PMED is used as a point estimator. 3.4 Simulation Study Situations with known and unknown measurement error variance have now been addressed. In both cases our approach led to exact posterior sampling, which should make implementation simple and sampling procedures reasonably efficient In order to assess the performance of our approach we conduct 23 the analysis of simulated data sets and compare our results to established methods previously illustrated in chapter 2. Simulations are carried out based on samples size no = ni = 50 from the retrospective model (3.3) with parameter values ([JL , 0 u-i,u)=(0,1,1). Un- der these parameters, the prospective logistic regression coefficient is B = 1. The simulations are done with different values of \fr ranging from 0.0 to 2.0 by jumps of 0.1. In one case r is assumed to be known. In the other it has to be estimated from a simulated validation sample of size n = 50. In ei2 ther situation, the PMED estimate of B is computed as the median of 500 independent and identically distributed draws from the posterior distribution. Here are the multiple steps of the simulation process for the r known case: 1. Generate 500 samples of size n = 50 from a normal distribution with 0 mean 0 and variance v = 1. 2. Generate 500 samples of size n\ = 50 from a normal distribution with mean 0 and variance v = 1. 3. Create a vector of y/r% (0.0, 0.1, 0.2,...1.8, 1.9, 2.0). 4. Use the 500 samples generated in step 1. to create 500 new samples that will be from a normal distribution with mean /x — 0 and variance r (the 0 controls). 5. Use the 500 samples generated in step 2. to create 500 new samples that will be from a normal distribution with mean /ii = 1 and variance r (the 24 cases). 6. Sample in turn u, U,Q and u,\. • Sample 7 from (3.11) to get v. • Use v to sample both u, and u, from (3.12). 0 • Compute (3 = x — /x ). 0 7. Repeat step 6. 500 times, which gives us 500 independent and identically distributed draws from the posterior distribution. 8. Compute the PMED estimate of /?, the median of the 500 draws from the posterior. 9. Repeat steps 4., 5., 6. and 7. with every value \fr was given. The reason for steps 1. and 2. is that initializing all the future samples with these (which come from standard normal distributions), makes comparison from one measurement error level to another easier and more precise. The process is very similar for the r unknown case, except one step is added between steps 2. and 3. and between steps 5. and 6. to create the validation sample, and step 6. is slightly different. Here are these two added steps and the correction for step 6.: 1. Generate 500 samples of size n = 50 from a normal distribution with 2 mean 0 and variance v = 1. 25 2. Use the 500 samples generated in the previous step to create 500 new samples that will be from a normal distribution with mean // = 0 and 2 variance r (the validation sample). 3. Sample in turn v, and • Sample 7 and r from (3.16, 3.17) to get v. • Use v to sample both / i and fj,i from (3.18). 0 • Compute (3 = — //o). Quite obviously we did not generate both surrogate and true exposures in our validation sample, it actually consists in {(x* — x i)}"=].> the differences 2i 2 between the surrogate and true exposures. These differences should follow a normal distribution with mean 0 and variance r. The simulation routines were coded in C, and although probably not using the most efficient algorithms, they compared very favorably to S-plus with respect to running time and memory resources needed. The PMED estimator of (3 is compared to other estimators suggested in the literature. For the known r case, the comparison is done with a method by AWH, illustrated in chapter 2, in which a correction factor A is used to adjust the (3 estimate obtained from a 'naive' logistic regression. For the r unknown case, the AWH method is still applicable and two others by RWS, also discussed in chapter 2, are performed on our simulated data for comparison. The first by RWS, which we refer to as RSW1, takes the form of the AWH estimator, except that the correction factor A is estimated by the fitted slope of a regression of 26 X on X* using the validation sample. The second estimator, referred to as RWS2, is based on an approximate likelihood function. The RWS method needs to fulfill more assumptions then the PMED approach in order to be applicable. It requires that the variance of X be the same for the validation sample as for the case and control samples. If this requirement is not met, the fitted slope will no longer estimate A correctly. This assumption is not needed for the PMED method to work, it is applicable to any distribution of X in the validation sample. This seems more sensible as it is parameters of the conditional distribution for X*\X which must be estimated. The AWH estimator has a definite disadvantage in that it performs poorly when faced with substantial measurement error. The estimate A of the correction factor can sometimes take on a negative value, in fact, P(A < 0) can be quite high when r is sufficiently large. We can obtain A's distribution and derive the following: (3.19) where Fk is the chi-square distribution function with k degrees of freedom. The derivation of this probability is done using information on the AWH approach presented in Chapter 2. We illustrated there that in the known r scenario A can be estimated by A = 1 — (T/S ), where 2 From there we can write: P(A<0) = P ( l - {T/SD < 0) 27 = P{r/Sl>l) = P{Sl<r) VV V+ T \ Ano+ni-2 r J V P ^ 1+ V+ T V/T Similarly, in the unknown r case, P(A<0) where = F„ _ , 0 + N I 2 N 2 (7^777) (3.20) is the F distribution function with k and / degrees of freedom. This time the probability was derived given that in the unknown r scenario A is estimated by A = 1 — S / S , and 2 2 S /(V + T) 2 p = S /r ( 1 {l + 2 \ Sl u/r)s no+ni— 2,712 • 2 Again we can write: P(A<O) = P (1 - sl/s 2 p < 0) = P(S JS >l) = P(S /S <I) 2 2 p 2 2 P V 1 + U/TJS* •P ('Pno+ni—2,ri2 ^ \1 + U/T / 1 1+ \ \ ' ^/7 Because of this the the AWH estimator is implemented in our simulation study only when the measurement error r is small enough to ensure that either (3.19) or (3.20) does not exceed 0.05. 28 Results of the simulations for r known and r unknown are illustrated respectively in Figure 3.1 and Figure 3.2, which can be found at the end of the present chapter. In the first case, for each value of r 500 independent data sets are simulated, for each data set PMED and AWH estimates are computed. Using the sampling methods presented above in the different steps of the simulation we obtain samples of these estimates. The empirical (0.1,0.3,0.5,0.7,0.9) quantiles of each estimator are displayed as functions of y/r in the first two panels of Figure 3.1. We can see from these two panels that for small measurement error both methods yield sampling distributions of the estimators that are very similar, but as r gets larger the PMED estimator is more tightly centered about the true value 3 = 1. Even when (3.19) becomes non-negligeable and the AWH estimate is considered inappropriate, the PMED estimator keeps performing quite reasonably. We also proceed to construct Bayesian credible intervals for 3. If we take the a/2 and 1 — a/2 quantiles of the posterior distribution of'/?, the interval between these constitutes an equal-tailed (1 — a) credible interval for /?. These posterior quantiles can be estimated by empirical quantiles from the simulated posterior sample. Based on the 500 data sets, empirical coverage probabilities of 80% credible intervals are displayed in the third panel of Figure 3.1. The associated 'error bars' (plus and minus two standard errors) indicate that the credible intervals can be reasonably interpreted as 80% frequentist confidence intervals. For the unknown r case, Figure 3.2 compares the PMED, AWH, RWS1 29 and RWS2 estimators. The format is the same as in Figure 3.1. In the first four panels the sampling distributions of the estimators are presented by five empirical quantiles. The comparison between the PMED and AWH estimators is similar to that in the known r case, but the AWH estimate becomes inappropriate because (3.20) becomes non-negligeable even faster than in the previous case. The RWS estimators behave quite similarly to the PMED estimator, and in fact have slightly better performance for large measurement error variance r. However, as noted previously, the PMED estimator works under less restrictive assumptions than the RWS estimators. In the last panel are displayed the empirical coverage probabilities of 80% credible intervals. Again these seem consistent with the confidence interval interpretation. 3.5 Discussion We have now introduced methodology in a general setting up to a certain point after which we have illustrated our approach in the simple context of a univariate exposure for two particular cases. The simulations seem to indicate the equivalent or superior performance of the posterior median estimator compared to the others that were studied here in the analysis of case-control data with imprecise exposure measurements. We have discussed the fact that the assumption of a normal retrospective model compared to the assumption of a prospective logistic regression model is somewhat stronger, though in an applied situation it should be reasonably easy to verify the retrospective assumptions. Also, transformations on exposure variable to approximate 30 normality may often satisfy the retrospective model. Given the promising results of this simple case, the next step will be to make this Bayesian approach more practical by incorporating the realities of complex data sets. Thus in the next chapter we will generalize our methodology to deal with multiple covariates, some of which could be binary. 31 PMED 1.0 Measurement error SD AWH 1.0 Measurement error SD Coverage Probability 1.0 Measurement error SD Figure 3.1: Comparison of the PMED and the AWH estimators in the known r case. The first two panels give (0.1, 0.3,0.5, 0.7, 0.9) quantiles of the estimators as a function of y/r. The median is displayed with the solid line. The longdashed line indicate true value of (3. The AWH estimator is only considered when r is sufficiently small that P(X < 0) does not exceed 0.05. The third panel gives empirical coverage probabilities of 80% credible intervals, with error bars corresponding to plus and minus two standard errors. 32 PMED AWH 1.0 0.5 o.o 1.0 Measurement error SD Measurement error SD RWS1 RWS2 1.0 Measurement error SD 1.5 2.0 1.0 Measurement error SD Coverage Probability 0.5 1.0 Measurement error SD Figure 3.2: Comparison of the PMED, AWH, RWS1 and RWS2 estimators in the unknown r case. The format is as per Figure 3.1 33 2.0 Chapter 4 Multivariate Exposure The previous chapter established the validity of our work in a simple setting. The next logical step is to consider the more realistic situation in which we would be faced with multiple covariates. Developing this multivariate methodology will undoubtedly mean an increase in the level of complexity of the algebraic manipulations, the distribution identification and the sampling procedures from these distributions. The present chapter will illustrate the development of this multivariate approach, first with the general multivariate exposure methodology, then with the introduction in the model of binary covariates, and finally with the development of a sampling method for covariance matrices with partially fixed diagonal elements. 34 4.1 Methodology The initial setting will not be much different from the univariate exposure one. We are still dealing with case-control data for which the cases and the controls can be regarded as independent samples. The difference is that we are now faced with covariable matrices; {x ij}"=ij=i from X\D = 0 (the controls), 0 and {xuj}™l'ij =1 from X\D = 1 (the cases), where d is the dimension of the exposure covariate. Our inferential objective remains the same, that is, to learn about the prospective relationship D\X. But again, to conduct a Bayesian analysis, we will need a likelihood based on the sampled distribution X\D. Therefore, assuming all the covariables are continuous, we consider once more the normal discriminant analysis model: X\D = 0 ~ N(fio, £ ) , X\D = 1~ N(u 0 E ). (4.1) = a + 8 x + x Cx, (4.2) u x which still 'implies' the prospective model of the form logit{P(D = 1\X = x)} T T where 8 = E f Vi — EQ VOUp until now, this is exactly as in chapter 3 with the exception that we do not restrict X to be unidimentional. Let's assume we have a common covariance matrix for the cases and the controls, E = E = £ i , (4.1) becomes 0 X\D = 0 ~ N(fio, E), X\D = 1~ N(fi U E). (4.3) Leading us once more to a standard logistic regression for the prospective relationship logit{P{D = 1\X = x)} 35 = a + 8x, (4.4) where this time the parameter of interest 3 = (p,\ — / x ) E . _1 0 The measurement error arises in a similar fashion. It is still assumed to be nondifferential and follows the same model X*\X ~ N{X,T) (4.5) where r is now a d x d covariance matrix. Consequently, the observed data arise from X*\D 0~ = + £), N(fi ,T 0 X * | D = l ~ J V ( / i i , r + E). (4.6) We assume the error covariance matrix r to be known from an external validation of the measurement process. As in the univariate case we base our inference on the noninformative prior 7 r ^ ^ i , E ) o c | ( r + E)|-( )/ . d+1 (4.7) 2 0 This is a standard noninformative prior for variance component models. Our prior in (4.7) and our likelihood obtained from (4.6) yield the following posterior distribution: 7r(^,^\x*) oc |(E + r ) | - ( d+1 )/ 2 no x , II K + ^ ) l " ^ P S II 1 / 2 x K + T)\cx |(E + r ) | E 1/2 i ' (-j(x* 0 - /*)'(£ + r)-\x* - exp ( - - ( ^ - ( n o + n i + d + 1 ) / 2 ( 0 Ml ) ' ( E + T)-\x* n -^ no \ - « £ « o - A*o)'(S + r ) " ^ - // ) 1 0 ( x e x p 1 - 9 V z 711 5><i 36i=i N + Y\<i T >. - / /io)j - )j M l cx |(E + r ) | - " ( 0 + n i + d + 1 ) / 2 x exp{-(l/2) trace((E + r ) 5 ) } _1 0 (4.8) x exp{-(l/2) trace((E + r ) ^ ) } - 1 where for c = 0 or 1 (controls or cases) \ (/4 -*: ) 1} E 04 ( (1) 2 (1) - *; ) (1) 2 + E i=l V ( *W _ T*W)(x* (d) T - X< h d • •• (X* - T*^) W 2 where x*^ is a scalar representing the mean of the realizations of the jth covariate. We can integrate out JJLQ and \i\ from (4.8) which gives us the marginal posterior distribution of the variance components TT((Z + T)\X*) oc |(E + )|-(no+m+d+i)/2 r (4.9) x exp{-(l/2) trace((E + r ) 5 ) } _1 where n 1 S = E c=0 c E ( 2 i=l ' i c X ( n 1 (4.10) c)i ic ~ c) X — X (4 ( 1 ) - i K f {l) *(i) _ =»(i)w *< (o _ =*co) c E E c=0 i = l (x* -x*W)(x* W \ K-^ic W x c JK^ic -x*^) x c 37 ) ( 4 W - ^ ( D ) ) 2 So we have (S + T)\X* = G\[{G - T) positive definite], (4.11) where G has an inverse Wishart distribution with n + ni degrees of freedom 0 and scale matrix S. Sampling from G|[(G — r) positive definite] can be implemented by repeatedly sampling G and subtracting the known error covariance matrix from it until we get one that is positive definite. The posterior conditional distribution for /x and Hi once E has been 0 sampled is easily obtained from (4.8), both follow multivariate normals of dimension d Mo|E,r,x* ~N{x* ,n - (?: + T)), 1 0 0 (4.12) Using (4.11) to sample E and (4.12) to get u, and pi gives a reasonably simple 0 algorithm for exact posterior simulation in a variance components model. From this algorithm we can get a sample of our <i-dimensional parameter of interest /5 = (>L*I — /xo)E . This multivariate exposure setting will also call on our -1 PMED estimator, in this case the sample of /?'s will be split in d subsamples for each exposure variable and a median will be computed for each of these subsamples. 4.2 Binary Covariates Our approach is now applicable to multidimensional exposure variable, but an important assumption that we have made is that all variables are continuous. 38 However, in applied situations it is common to be faced with binary exposure variables, for which individuals are classified to be either exposed or unexposed. Even more common is a combination of both continuous and binary exposures. For this reason, it should be a priority for us to integrate this type of variable in our model. Since our methodology has so far been developed for continuous variables, we will try to find a continuous representation of these binary covariates. A method that could be used to obtain such a representation is described in Albert and Chib (1993), where they regress a unidimensional binary response variable on a set of covariates. They suppose N independent binary random unidimensional variables Yi, • • • > YN are observed, where Yi has a Bernouilli distribution with probability of success pi. These pi are related to a set of covariates that may be continuous or discrete. They define a binary regression model as P i = H(xl8),i = l,.-.,N, where 3 is a k x 1 vector of unknown parameters, xj = (xn, • • •, Xik) is a vector of known covariates, and H( ) is a known function linking the probabilities Pi with the linear structure of 3. Taking H to be the standard normal cdf $( ) yields what is called the probit model. In order to get this continuous representation they introduce N latent variables Z\, • • •, Z , where these Zi N are independent N(xj8,1), and define Yi = 1 if Zi > 0 and Yi — 0 otherwise. It can easily be shown that the Yi are independent Bernoulli random variables with & = P(Yi = l) = $(xJ3). 39 The joint posterior density of (3 and Z = (Zi, • • •, Z ) given the data N V = (yi, • • • ,VN) is given by n{P,Z\y) = Cn(P)f[{I(Z >0)I(y i = l) + I(Z <0)I{y i i i = 0)} i=l x<f>(Zi;xjB,l), where 7r(/3) is a prior on 3, ;/J.,a ) is the N(fi,a ) 2 2 pdf, I(X A) is an G indicator function that is equal to 1 if the random variable X is contained in the set A, and here and henceforth C is a proportionality constant. It is very difficult to sample directly from this distribution. But computation of the marginal posterior distribution of (3 using the Gibbs sampling algorithm requires only the posterior distribution of (3 conditional on Z and the posterior distribution of Z conditional on (3, and fortunately these full conditional distribution are of standard forms. The posterior density of (3 conditioned on Z is given by TT(/%, Z) = Cn(f3) f 4>(Zu xjB, 1), (4.13) i=l and the posterior density of Z conditioned on (3 results in the random variables Z i , • • •, Z N being independent with Zi= < z*|z;>o Z*\Z*<0 4.2.1 ,ifj/i = i where Z* ~ N(xf(3,l). (4.14) , if ^ = 0 The Gibbs Sampler We mentioned above the Gibbs sampling algorithm. Here is a brief review of this useful technique. Suppose one is interested in simulating from the posterior distribution of 6 partitioned into the vector components 6 = (# • • •, 9 ). 1; 40 P Although it may be difficult to sample from the joint posterior, suppose that it is easy to sample from the conditional distributions n(9 \{9j, j ^ k}). To k implement the Gibbs sampler, one starts with initial guesses of the 8 say i: 8^, • • •, 8^ and then simulates in turn 9? from *{0i\{6f\j±\}) from *(e \0[ \{9?\j>2}) 1 2 (4.15) eW from n(9 \{9f\j <p}). p The cycle in (4.15) is iterated t times, generating the sample 9^> = (0i \ • • •, 9f>) As t approaches infinity, it can be shown that the joint distribution of 8® approaches the joint distribution of 8, in practice this convergence is usually quite fast. So for sufficiently large t, say t*, can be regarded as one simulated value from the posterior distribution of 8. Repeating this process m times yields the sample {(#[* \ 8^-,\- • •, 8 p j j = 1, • • •, m}, which can be used for statistical inference. In practice, instead of restarting the algorithm once the convergence is obtained we will just keep for our posterior sample the 0\f for t = t*, • • •, t* + m. This concludes the review of the Gibbs sampler. Returning to the method suggested by Albert and Chib, now given a previous value of 3, one cycle of the Gibbs algorithm would produce Z and 3 from the distributions (4.14) and (4.13). Advances developed from this approach are presented in Chib and Greenberg (1998), in which they construct a multivariate probit model; the binary response variable is allowed to be multidimentional. This generaliza41 tion to multidimensional response makes this technique applicable to our own analysis. 4.2.2 Application We are faced with ^-dimensional binary data Y ^ = 0 or 1, i = 1, • • •, rij, where j is either 0 or 1 for controls or cases and k = 1, • • •, b, where b is the dimension of the binary covariate we want to include in our model. We must introduce the latent variables Z\jk, • • •, Z jk, where the Zij are independent multinormal nj N(r)j, £ 2 2 ) . We are using 77 and not xj(3 as the mean because our application is not in a regression context. The reason for the £ 2 2 notation of the covariance matrix will be explained later on. It is necessary for identifiability reasons to fix as l's the diagonal elements of this covariance matrix. This way the variance of Z^ is 1 as in the univariate case from the Albert and Chib method presented earlier, but the covariance parameters are free. In their multivariate development Chib and Greenberg simply refer to this as the necessity of £ 2 2 to be in correlation form. Then we define = 1 if Z^k > 0 and Y^ = 0 otherwise. The are independent vectors of O's and l's with pjk = P(Yijk = 1) = ^(Vjk), where <$>( ) is the standard normal cdf, which gives us the model known as the probit link model. Since the yi are observed we can estimate the pj and using the probit k link obtain estimates for the r}j as starting values for our Gibbs sampler. As k a starting value for £ 2 we can use l's for the variance (on the diagonal) and 2 O's for the covariance elements (off the diagonal). 42 So the next step is to go ahead and implement our Gibbs sampler. First by sampling Z from its posterior distribution conditioned on 77, £ 2 2 and Y, then by sampling £ 2 conditioned on 77 and Z, finally by sampling 77 conditioned on 2 £ 2 2 and Z. Repeating this enough times will at some point be equivalent to sampling from the joint posterior. In order to proceed with this sampling scheme we have to determine these conditional densities. The random variables Zij, • • •, Z j will be indenj pendent with f>dimensional "truncated multinomial" distributions Zij\y,r], E22 ~ Ar(77j,£ 2 2 ) with Z ijk > 0 if y ijk = 1, and Z ijk < 0 if y ijk = 0. (4-16) This way the initial binary covariates are represented by continuous variables following a multinormal distribution. It is therefore possible to include binary exposures in our model, and their connection to the normal distribution will make covariance estimation feasible. The rjj, which are vectors of size b, will also follow a multinormal distribution just like /x and Hi in (4.12). In fact, 0 when we will juxtapose the Z's to the observed continuous X*'s to deal with both the continuous and binary variables simultaneously, the 77^ will simply be the last b components of the \ij vectors. So we can sample 77, from the following r]j\Zj, £ 2 2 ~ N{Zj^ n- ) 1 22 (4.17) Finally we need to determine the conditional distribution of £ 2 2 • We know 43 from (4.11) that its distribution will somehow be related to an inverse Wishart, but we don't know yet how the correlation form condition (l's on the diagonal) will influence the actual distribution from which our sample has to be drawn. The next section will concentrate on determining this distribution and will present an approach for sampling such a covariance matrix. 4.3 Covariance M a t r i x W i t h P a r t i a l l y Fixed Diagonal The addition of these binary covariates in a continuous form will not change model (4.6) from which the data arise. The Z mentioned in the previous section will just have to be juxtaposed to X*, forming a new X* consisting of c continuous exposure variables and b binary exposure variables in a continuous form, where c + b = d the total dimension of the exposure. For later use, the binary covariates will always be placed at the end of the exposure vector; so we will have first the c continuous X* and then the b binary ones. This will yield a covariance matrix E splitted in the following way E = y where E n is c x c, E i 2 ^11 ^12 2^21 ^22 (4.18) J is c x b, E i is of course the transpose of E 2 i 2 and E 2 2 is b x b. We also make the important assumption that there is no misclassification for the binary covariates, therefore the errors in variables will only 44 be occurring for continuous variables. Also, with no influence on our previous work, some of the continuous exposures could be measured without error. This disposition of the continuous variables followed by the binary ones combined with the no-misclassifications assumption means that the lower diagonal of the covariance matrix E (the diagonal of £22) i U consist in a series of b l's, as w will the lower diagonal of (£ + r), since the lower diagonal of r will consist of a series of b O's. The major change comes from the new prior we will have to use. In (4.7) we suggested Jeffrey's noninformative prior, but now part of the covariance matrix is known. Thus we try the prior T T ( ^ I , £ ) oc |(£ + r ) | - ( d+1 )/ /(diag(£ + r ) = 1), 2 22 22 (4.19) where 7( ) is an indicator function equal to 1 if the diagonal elements of £22 are l's and equal to 0 otherwise. Do note that r 2 is a matrix of O's since 2 we have assumed no misclassification for the binary covariates. The addition of this new prior will operate a slight change in equation (4.9), the marginal posterior of the variance components now becomes 7r((£ + r)|X*) oc |(E + T)|-( N O + N I + D + 1 )/ 2 x exp{-(l/2) trace((£ + r)'^)} x/(diag(£ 2) = 1), (4.20) 2 where S is as defined in (4.10), thereby transforming (4.11) to (£ + T)\X* = G\[(G - T) positive definite and diag(G ) = 1)], 22 45 (4.21) Where G still has an inverse Wishart distribution with n + ri\ degrees of 0 freedom and scale matrix S. Sampling (S + r) from (4.21) is not a trivial task anymore. We need to introduce a matrix decomposition presented in Le and Zidek (1992). Take a covariance matrix E which follows an inverse Wishart distribution with m degrees of freedom and scale matrix which can be partitioned in the following way E H ( E = V -*21 ^22 where E n is u x u and E 2 is g x g. Through the Bartlett decomposition E 2 can be written as ^ E i | + '0E V' 2 '0^22 ^ T 22 V E2 '0' (4.22) ^22 / 2 where E i | = E n — E ^ E ^ ^ i , a uxu matrix and ip = E12E22 , auxg 1 2 matrix. The inverse Wishart distribution of E can be equivalently presented in terms of the new parameters (E 2, Ei|2, ip) as 2 E | * , m ~ IW(ty 2, 2 2 m-u) 2 Ei| |*,m ~ /W(*i| ,m) 2 ^|E!| (4.23) 2 2> ~ N{V *£, 12 E!| <g> * - ), 1 2 2 2 where IW(A, B) stands for inverse Wishart with scale matrix A and degrees of freedom B, E 2 is independent of (E | , ip), and <g> symbolizes the Kroneker 2 X 2 product. 46 This decomposition can obviously be applied to our own covariance matrix £ + r which has to be sampled from (4.21), and the partitioning will be the one suggested earlier in (4.18) T = (£ + r ) (£ + r ) i ^ E n + Til Ei2+Ti2^ ^ y £ l + 21 £ 2 + 22 ^ (£ + r) i (£ + r ) r T 2 where £22 + T 2 2 2 J u 2 \ 2 22 j = £22 since we assume there is no misclassification for the binary covariates. Therefore u will be equal to our number of continuous variables c, g equal to the number of binary variables 6, m equal to the degrees of freedom n + ni of our covariance matrix's inverse Wishart distribution and 0 \I/ equal to the scale matrix S of this same inverse Wishart distribution. Based on (4.22) £ + r can be reparametrized, leading to ( + T) T + IPZ IP T M2 22 VE22 ^ = £22^ ^22 J However, £ + r does not follow exactly an inverse Wishart distribution, since it has the extra conditions that it must be positive definite once r is subtracted from it and the elements of the lower diagonal (diag(£ 2)) have to be all l's. 2 Work has to be done to see how these conditions influence the distributions suggested in (4.23). First, the positive definite condition can be verified once the sampling is done, so this condition will not affect the distributions we will be sampling from. The diagonal elements set to be 1 will however make the sampling procedure more complex. For later use let us introduce the following notation 47 IW*(Scale, df) = G|diag(G) = 1, where G ~ IW{Scale, df). Given the independence between £ and ((E + r)i| , 0), sampling (E + , 2 2 2 r)i| and ip will always be trivial once the non-diagonal elements of E 2 2 2 are identified. The case where we have only one binary variable is quite simple. E 2 2 would be a scalar set to be 1, independent of ( £ + T ) I | and ip which could 2 be sampled from the distributions given in (4.23) without being affected by £ 2 2 = 1. Once these are sampled, E + r can be computed from (4.22), then r subtracted so that the positive definite condition can be verified. If the condition is not met we need to resample these parameters. The process becomes more complex when we are faced with more then one binary covariate, in which case b > 1 and E 2 2 is not a simple scalar equal to 1, it has off-diagonal elements. In such a case an iterative process will have to be used. The initial step will be to perform the decomposition as it was done in the "1 binary covariate" case. This time the decomposition gives us a matrix E 2 2 of dimension b > 1. From (4.23) we know that E 2 2 follows an inverse Wishart with n + n\ — c degrees of freedom and scale matrix 5 , and 0 22 its diagonal elements are all l's. To keep the notation as simple as we can we will now take E* = E 2 2 and S* — 5 . Given E*'s distribution, the next step 22 is to perform the same decomposition on E*, in which case the split has to be done in the following manner E* = ^ v* v* ^ 2 \ ^21 ^22 48 / so that E = 1 and E 2 2 is a (b - 1) x (b - 1) matrix still with l's on its diagonal. From (4.22) E 2 2 (now E*) can be written differently as n E* = -1* J 22 J where E* = E |2 n - E ^ E ^ E ^ , but E? + ^ |2 2 0 I T = 1, and 0 = E * ^ , a 1 vector of dimension 6—1. We have to keep using the decomposition until the bottom right corner, E for the above decomposition, is a scalar set to be 1. 2 2 So if b = 2, two iteration are needed; the decomposition has to be performed twice in order to get E 2 2 = 1, once to split the continuous and binary variables and once more to obtain the scalar form of the bottom right corner. We saw in (4.23) that the decomposition specifies the distributions of the different parameters obtained from the decomposition. These distributions will be slightly different from the second iteration and onward. 4.3.1 A Simple Case We will first deal with a fairly simple situation, that is the case where the number of binary covariates b is 2. In such a case the initial decomposition would lead us to (E + r ) | + V E V ' / E + 1 T = V S 2^ 2 2 2 T r VS 2 2 ^ (4.24) ^22 2 where (E + r ) i | + -0E f/' is a c x c matrix (remember that c is the number T 2 22 of continuous variables), and E 2 2 is a 2 x 2 matrix with l's on its diagonal. 49 From (4.23) we know that E | S , n , n i ~ IW*(S ,n 22 0 22 0 + n - c) x (£ +i-MS, n ni ~ i W ( % , n + n ) 0> 0 V>|(£ + r ) ^ , 5 - N(S S£, (S + r)!| ® S^ ), 1 12 where E 2 2 is independent of ( ( £ -I- (4.25) x 2 r)i\ ,ip). 2 For simplicity we will again use E* = E 2 2 and S* = S . 22 We now perform the decomposition on £*, this time resulting in ( Et + </>£^ | 2 £* = V where both £ 2 2 E ^> T a r e 2 2 2 0E$ ^ 2 E 22 and £*| + 0 E ^ T 2 2 (4.26) J equal to 1. Once again we use (4.23) to obtain E =1 2 2 E* |S*, n , m , c ~ IW(S* , n + n - c) |2 0 l]2 </»|Et , s* ~ N(S* S; \ |2 12 2 0 (4.27) x E; ® ^ v ) , 1 |2 the last two distributions are subject to the condition, E * | + </>£ 0 = 1, T 2 22 which can be reduced to E * | + (f> = 1, making the condition independent of 2 2 E which will not be the case for situations with more than 2 binary covariates. 2 2 E ^ ' s distribution is also simplified, given the fact that it is a scalar the inverse 2 Wishart distribution now becomes an inverse gamma, and its shape and scale parameters respectively are (n + n\ — c)/2 and S*\ /2. 0 2 In order to find the distribution for E ^ and <f) under this condition we 2 50 use the following reparametrization: Z\ = E ^ - h ^ and Z = (j). The Jacobian 2 2 for this transformation is J = 1. We already have the joint distribution for E*| and 4> 2 = /s; ,*(£I|2,0) |a /(^l i| )/( i|2)i s s 2 from which we can derive the joint distribution of our new variables Z and x Z 2 z) = fz z (zi, lt 2 /EJ ,*(*I - 2 | 2 4, z )J. 2 We now want to obtain the distribution of Z given Z\ = 1, we can not 2 get this exactly but do work out a function proportional to this conditional distribution: fz \z =\{z \z 2 l 2 l = l) = f ,z (zi,z )/f (zi Zl 2 oc OC 2 Zl = 1) (4.28) f =i,z (l,z ) Zl (1 - z 2 2 2 x exp v 2 )-(™+3)/2 2 (l-^ )/m 2 j where m = n + ni — c. This is not a standard distribution, but still we want 0 to sample from it. In order to do so we use the rejection sampling technique as presented in Ross(1997), to sample Z from a certain density proportional to/(): 1. Generate Y from a density proportional to g( ), where g( ) is a function that bounds /( ). 51 2. Generate U from a standard uniform distribution. 3. If U < f(Y)/g(Y), take Z = Y, else go back to 1. Therefore we need to find a bound for (4.28). We proceed in doing so by splitting (4.28) into two parts A = (1 - Z 2)-(m+3)/2 /m + e x p S* /m-l-(S* /m) (1 - zl)/m 2 22 l2 To obtain an upper bound for A we note that argmax (x~ exp(6/o;)) = b/a, (4.29) a the resulting bound is a constant obtained by plugging in A 777 (1 - 4) = —^( *n/ S m + 22/m - 1 S {Symf). The bound for B is which is proportional to a normal density with mean S* /m and variance 1/m, 2 making it easy to sample from the bounding function. Once Z is sampled, this means we have an estimate for </>, therefore we 2 can complete E* or E 2 in (4.26). Then use the distributions given in (4.25) 2 to sample the necessary parameters to complete E + r in (4.24). Following carefully all of the previous steps we are able to obtain an estimate of the covariance matrix with parts of it being fixed, then all that needs to be done is to subtract r and verify the positive definite condition. 52 4.3.2 More than Two Binary Covariates We must now be able to deal with a greater number of binary covariates, this will increase the number of iterations and will again slightly alter the distributions. Let's assume we still have c continuous variables and 6 binary ones, but this time b is greater than 2. We proceed as in the previous simple case and operate the initial decomposition and obtain a matrix just like in (4.24) except E 2 2 is now 6 x 6 , 6 > 2. The distributions in (4.25) still stand, therefore we can go ahead with the second decomposition as in the simple case and obtain £* just like in (4.26) with the exception that even with this second iteration E 2 2 is not yet a scalar equal to 1, it is now a (6 — 1) x (6 — 1) matrix. There is now a slight transformation to the distributions we had obtained in (4.27) £; |S*,m~ IW*(S* ,m-l) 2 EI | 2 22 | 5 * , m ~ / W ^ , m) (4.30) <^|Et|, s* ~ N(S* S* -\ E ; , ® ^ v ) , 1 2 where m = n 4- n\ — c, E 0 12 2 2 2 22 is independent of (E*| , 4>) and these are this still 2 subject to the condition £*| + (f>T, t ( 2 >T = 22 1- Since this time E^ is not yet a 2 scalar equal to 1, the condition EJ| + (pT, (f} — 1 can not be simplified and T 2 22 therefore depends on E^ which will once more influence our sampling. 2 We now want to sample from a density proportional to /(E; )/(Et )/(^|E* )/(diag ( s y = I ) / ( E ; + 0E^20 = 1). T 2 |2 |2 |2 First we will consider the distribution of (E , E*| , (/»|diag(E ) = 1) to be 22 53 2 22 g(£2 )/(£i| )/(0|£i| ), where #(£22) i 2 2 sn o t t n e 2 same as / ( £ ) . Thus we want 2 2 to sample from the density proportional to 2(£y/(£i )/M£t|2)/(£t|2 + ^ * | 2 2 < t > 2 1), = T (4-31) where both /(£*| ) and /(<^|E^ ) are known. We now reparametrize by in2 2 troducing Z\ = E*| + ^ E 0 and Z = <j>, once again the Jacobian for this r 2 2 2 2 transformation is J = 1. From this (4.31) becomes 0(E^)/(zi - z L* z )f(z \z 2 22 2 2 - z T,* z )I(z! 1 2 22 = 1), 2 therefore we want to sample from a density proportional to < 7 ( £ y / ( l - Z 2 E 4 ) / ( z | l - z \? z ). (4.32) T 2 2 2 2 22 2 By putting in (4.32) the respective distribution functions and doing a little algebra we get that (4.32) is proportional to 2(1 — z E z ) r 2 ( {z - S\ S* )S* {z 2 H 22 22 2 S* S* ) \ x 2 22 1 2 12 2(1-z £^) j' 2 v r 22 ( 4 - 3 3 ) ' v B To sample from this distribution we can apply the rejection sampling technique, therefore we need to bound (4.33). To find an upper bound for A we use (4.29). The resulting bound is a constant obtained by plugging in A 1-z^zT = S* /(m + 3). l2 B can easily be bounded by exp (- -(z l - S{ S* )S* {z l 2 2 22 22 54 - S{ S* ) ) l 2 2 22 T , which is to a near constant a multinomial distribution of dimension b — (i — l) with mean vector S* S 2 l a n 22 d covariance matrix S ~ , where i is the number l 22 of iterations performed to this point. This will again make it easy to sample from the bounding distribution. The whole bound is x exp ^—— s{ s ) £22(32 — s* s ) ^ > X 2 X 22 2 T 22 thus in both the bound and the distribution we want to sample from in (4.33) we have </(E ). These will cancel out in the third stage of the rejection 22 sampling, avoiding the task of determining them. However we see that in (4.33) we need E 2 2 , therefore we need to iterate the process and apply the decomposition on £ 2 2 until we have reached its bottom right corner which is 1. Once this is done we can move up one dimension at a time and use the preceding technique to sample the off-diagonal elements of the covariance matrix. As an example, suppose the number of continuous variables c is 1, and that the number of binary covariates b is 3. We apply the decomposition once to split the continuous part of the covariance matrix from the binary one. We apply the decomposition once more resulting in a 2 x 2 Ej^ matrix, then apply the decomposition one last time to obtain (E ) 2 = 1- Once this is done we 22 2 can work our way back up. Knowing (££2)22 = 1 we can sample from (4.33) with m = UQ + ni — c — 2 using the rejection sampler. This will give us the off-diagonal elements of E22 which we can use again in (4.33) with this time 55 m = n + n — c—1 giving us the off-diagonal elements of E 2, and now that 0 1 2 we have £22 we can use it to sample from the last two distributions of (4.25). It is now possible for us to run our Gibbs sampling algorithm of Section 4.2.2 since all the conditional distribution needed have been determined. 4.4 Discussion We have seen that in the generalization of our approach to a multivariate setting it is feasible to incorporate binary variables to our model. The use of a variation of the probit regression model enables us to get a continuous representation of these binary covariates. This continuous representation is obtained by generating values from truncated normal distributions which have their variance set to 1. Then an iterative process which makes it possible to estimate a covariance matrix with a partially fixed diagonal allows us to compute estimates of our parameter of interest which is B = main parameter of a standard logistic regression. 56 — yu )£ 0 _1 the Chapter 5 An Application to Real Data The work presented in the previous chapters has resulted in an approach that should be generally applicable to actual data. One of the generalizations allowed for multivariate exposures and another made the addition of binary covariates to the analysis possible. We will now verify via the analysis of real case-control data if our methodology is suitable to applied situations and how results are influenced by our correction for error. The present chapter will first describe the data we will be working on and expose the results obtained in a previously conducted analysis. We will also verify if the basic assumptions of our method are met and if so, will proceed with the analysis using our own approach. 57 5.1 Bladder Cancer Case-Control Data The data we will be conducting our analysis on is a fraction of a dataset which was obtained for a large-scale study. The objective of this study was to identify occupational cancer risk factors. Information on occupation, smoking and alcohol consumption histories was collected by means of a self-administered questionnaire from male cancer patients aged 20 years and over ascertained from the British Columbia population-based cancer registry from 1983 to 1990. To estimate smoking relationships for types of cancer known to be strongly associated with cigarette smoking, patients suffering from cancer types with no such associations were used as controls. So the control group consisted of all cancer types with the exception of those known to be associated with smoking. The analysis was performed by matching cases and controls on exact age and diagnosis year. The patients smoking histories were measured in pack-years (number of years of smoking 20 cigarettes a day). Odds ratios for different stratifications of these pack-years were estimated using conditional logistic regression. Based on this analysis, a statistically significant relationship was found between bladder cancer and cigarette smoking; the odds ratio would get bigger as the pack-years increased. This analysis was conducted assuming no measurement error was present. The details of this study are presented in Band et al. (to appear in the Journal of Occupational and Environmental Medicine). In this large-scale study many types of cancer were considered, but the fraction we focus on deals with bladder cancer and contains information on the 58 disease status (case or control), the age, the smoking history and the diagnosis year. Here we apply the Bayesian approach to re-examine the relationship between cigarette smoking and the risk of developing bladder cancer, taking into consideration measurement error of the smoking history information. 5.1.1 A Closer Look at the Data In light of these results we are interested in seeing if our method would lead to similar conclusions. A first step in doing so is to take a closer look at the data and see if the assumptions required by our approach are met. Our dataset consists of 1038 cases and 7006 controls. We have, for each individual, their bladder cancer status (healthy:0, diseased: 1), their age (in years), their smoking history (in pack-years) and their year of diagnosis (from 1983 to 1990). We initially get histograms for the age and the smoking history variables. The one for age has a longer and thicker left tail, but a transformation could probably approximate normality. The histogram for the smoking history, however, reveals a problem of greater consequence. There is an important concentration of 'never-smokers' or of individuals whose smoking history is 0 pack-years. In fact there are 1782 'never-smokers', of which 1658 are controls and 124 are cases, together they represent close to 20% of the whole group. Therefore, no transformations could lead us to an approximately normal distribution for the smoking history. Since the data in the present format fail to meet the normality as- 59 sumption, we need to consider a slightly different objective. We now consider 'smokers' only, or the individuals that did not have 0 pack-years for their smoking history. We examine the dose-response, that is, how the risk of bladder cancer is influenced by the amount of smoking a person has done throughout her life. Since we ignore the 'non-smokers' we are now left with 914 cases and 5348 controls. Once more we proceed to construct histograms for the age and the smoking history variables, these are shown in Figure 5.1. Based on these histograms, it seems age is quite close to normality, however smoking history will obviously require a transformation in order to approximate normality. The two bottom graphs of Figure 5.2 indicate that although it is not perfect, a logarithmic transformation of the smoking history could be considered to be approximately normal. To see if the age and the logarithmic transformation of the smoking history can both be used as continuous variables, we proceed with a logistic regression of the disease status on different stratifications of these two variables. The parameters obtained for these regressions are plotted against the increasing stratifications in Figure 5.3. We notice that for the regression on age the coefficients seem to follow a curvilinear trend, perhaps a logarithmic transformation of the age would correct for that. The coefficients for the logistic regression on the logarithmic transformation of the smoking history however increase in a relatively linear fashion, indicating that the logarithm of the pack-years can be used as a continuous variable. Given the non-linearity of the coefficients for the age, we try the logarithmic transformation of the age. We 'logistically' regress on stratifications of the logarithm of the age, a plot of 60 the obtained parameters shown in Figure 5.4 increases in a slightly more linear fashion than for the untransformed age, indicating that it is reasonable to use the logarithm of the age as a continuous variable. We also need to verify the normality assumption for this transformation of the age. The upper panels of Figure 5.2 indicate that normality of the logarithmic transformation of the age is reasonably acceptable. 5.2 The Analysis Based on the verification of assumptions done in the previous section we can now proceed to do an analysis of this bladder cancer case-control data using the methodology presented in the earlier chapters. The variables we use are the following: The logarithmic transformation of both the age and the smoking history, that we respectively name X\ and X , and a binary variable 2 indicating if the cancer patient was diagnosed before 1987 that we name I 3 . That last variable mainly serves as a check to see if diagnostic methods have changed between these two four years periods from 1983 to 1986 and 1987 to 1990. Using an indicator for each year from 1983 to 1990 would have given us eight binary variables, which our approach can handle, however a normal representation of these variables would not have been appropriate given that a 1 for one of them means 0 for all the others, setting the sum to 1. The normal representation which sets the sign of the normal given the binary response can deal with correlation but not so much as to fix the sum. The analysis is thus 61 performed following the model: logit{P{D = 1\X where X = (Xi,X , 2 = x)} = a + faXi + BX 2 2 +BY Z Z (5.1) Y ) and a is not estimable from case-control data alone. 3 Remember that the original analysis was performed assuming no measurement error was present. However, it is important to note that the information on the smoking history of these patients was obtained from a selfadministered questionnaire. For this reason smoking history is believed to be reported with some uncertainty. Some people convinced that smoking is undoubtedly the cause of their cancer may tend to overestimate it. Others who put the blame on different factors like pollution or work environment could tend to undermined the effect of their smoking habits on their illness and thereby underestimate it. We are interested in examining how different error levels in this variable would influence the results. In order to see how the parameter estimates would be affected, we conduct the analysis three times, each time assuming a different level of error on the smoking history variable. In the first case we assume no error is present, in the second case we assume a 10% error and finally we assume a 25% error. Note that the error when assumed known was incorporated in our method by assuming that the true data covariance matrix E was inflated with the addition of the errors' covariance matrix r, leading to the observed data covariance matrix E + r. We are considering the smoking history data to be arising from the following: log(X*) = log(Z)+log(X ), 2 62 (5.2) where X is the observed pack-years, X is the true pack-years and log(Z) is 2 2 the error component, which is equivalent to: X* = Z x X . (5.3) 2 In (5.3) the error component Z is now a multiplicative factor of the true packyears, so obtaining the desired error levels can be done in the following way: No error: r 22 =0 10% error: 1.1 = exp(.l) and 0.9 = exp(-.l) r 22 = (.l) 25% error: 1.25 S exp(.25) and 0.75 ^ exp(-.25) => r 2 = (.25) 2 22 Note that only the (2,2) element of r is affected by the error since we assume error only for the smoking history and that this error is not correlated to the other variables. Our correction therefore consists in subtracting from the estimated observed data covariance matrix £ + r the covariance matrix r we have assumed for the error, all this within runs of our Gibbs sampling algorithm because of the presence of a binary variable. Results of the analysis for each of these error level assumptions are summarized in Figure 5.5, in which can be found histograms of the posterior distribution of the parameters B ,B X 2 and 83. We can see from the histograms that the parameters B and /3 , respectively related to the logarithm of the x 3 age and the diagnosis period, do not seem to be influenced by the different error levels we have assumed for the pack-years. The range and the centrality measures of the posteriors for both B\ and B^ remain relatively constant while the error level is changing. Now what happens to the parameter B 1 It is the 2 63 one we expect should be mainly affected by the assumed error on the reported pack-years and in fact it is. Based on the histograms of Figure 5.5 and some summary statistics for each of these posteriors we can detect an increase in fl 2 as the error is increasing. This is exactly what was expected since the estimated parameters are obtained from the difference between the estimated case mean and control mean multiplied by the inverse of the estimated covariance matrix (fl = E ( / i i — fj, )). _ 1 0 When this covariance matrix is estimated from the observed data, it will be overestimated if there is error in the data. Therefore when correcting for the error the estimate of the covariance matrix will be reduced, which means division by a smaller value leading to an increased estimate of the parameter fl. As it was mentioned earlier, the interpretation of the parameter fl is that it is the coefficient in the prospective log-odds ratio and that for rare disease the relative risk of disease can be approximated by the odds ratio. Thus, the basic conclusion that can be drawn from the obtained results is that the risk of getting bladder cancer given the smoking history is underestimated if there is in fact error in the observed data. Meaning that an analysis performed assuming no error to be present gives conservative estimates of the relative risk of disease. It should be noted that in the present analysis it was impossible, as it is for many exposures, to get a validation sample in which we would have had both the true smoking history and the observed one. We could only guess on what the error was. Thus, our approach here was mainly used as a validation, 64 but the potential of the methodology should not be overlooked. 65 Histogram for Age CO o d CM o •••••••••I o d o d ^ III illll illlll illlll Hil llll 111fill111 111 Ii 4 :l •: h it ;t i| 20 40 60 80 100 Years Histogram for Smoking History o CM o d o d o o d o o d o d •111 i I ij :. t, 50 100 150 200 Pack-years Figure 5.1: Histograms for the distribution of the age and smoking history variables 66 Figure 5.2: Diagnostic plots to verify the normality of the logarithmic transformation of the age and smoking history variables 67 Age o o ie O o O CM o 2 4 6 8 10 12 14 Stratifications log(Smoking History) 4 Stratifications Figure 5.3: Plots of the coefficients from the logistic regression of the disease status on increasing stratificaions of both the age and logarithmic transformation of the smoking history variables 68 log(Age) c © o fc 0> O O 8 Stratifications Figure 5.4: Plot of the coefficients from the logistic regression of the disease status on increasing stratificaions of the logarithmic transformation of the smoking history variable 69 No Error 0.4 0.6 0.8 1.0 1.2 1.4 betal 0.10 0.15 0.20 0.25 0.30 0.35 beta2 -0.1 0.0 0.1 beta3 0.2 0.3 0.4 0.6 0.8 1.0 1.2 1.4 betal 0.10 0.15 0.20 0.25 0.30 0.35 beta2 -0.1 0.0 0.1 beta3 0.2 0.3 0.4 0.6 0.8 1.0 1.2 1.4 betal 0.10 0.15 0.20 0.25 0.30 0.35 beta2 -0.1 0.0 0.1 beta3 0.2 0.3 Error: 10% Error: 25% Figure 5.5: Histograms of the posterior distributions of the parameter estimates for Pi, 8 and /? obtained with the assumption there was no measurement error, 10% and finally 25% measurement error in the smoking history variable. 2 3 70 Chapter 6 Conclusion At the risk of repeating ourselves, imprecise exposure measurements are common in case-control studies. Since perfecting measuring instruments is not always a feasible option, accepting the errors and developing methodology that accounts for them in the analysis has generated a fair amount of literature. Chapter 2 reviewed and summarized some of the work that had been done on the subject. Prospective logistic regression being the typical approach for analyzing case-control data, most of the methods that offered a correction for these errors were based on this methodology which fits a prospective model to the retrospectively sampled data. The alternative we adopted called on the Bayesian approach which revealed to be a fairly natural way of incorporating uncertainty about the unobserved true exposure. The simulations and comparisons to known procedures carried out in Chapter 3 found our method to perform reasonably. Although the assumptions of the normal discriminant analysis model may be somewhat stronger than 71 those of the prospective logistic regression model they should be met fairly easily in an applied situation. In light of these encouraging results, we pursued our work to make this approach more adaptable to the realities of complex datasets. In the generalization to multivariate exposure all the basic elements from the univariate setting carried over, allowing the simplicity of the procedure to be kept intact. Simple yet efficient, for models with only continuous exposures the Bayesian inference could be performed with exact posterior sampling. The addition to the model of binary covariates presented a challenging problem. Fortunately, the use of a variation of the probit regression model enabled us to get a continuous representation of these binary covariates. This required some elements of a covariance matrix to be fixed, which led to the development of a general algorithm for sampling such a constrained covariance matrix. The Bayesian inference in this context required the use of a Gibbs sampling algorithm. Analyzing real case-control data showed the method could be applied fairly easily, and produced results that were in line with what was theoretically expected. Finally, although we could not explore these in the scope of this thesis, the following suggestions could be considered for further development of the presented methodology. One would be to generalize the approach to make it applicable to matched case-control studies. The other, to explore other applications of the algorithm for sampling covariance matrices with partially fixed 72 diagonal elements, for example multivariate or multinomial probit models. 73 Bibliography [1] Albert, M.H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, 669-679. [2] Armstrong, B.G., Whittemore, A.S. and Howe, G.R. (1989). Analysis of case-control data with covariate measurement error: application to diet and colon cancer. Statistics in Medicine 8, 1151-1163. [3] Berger. J.O. and Bernardo, J.M. (1992). On the development of reference priors. Bayesian Statistics 4 35-49. [4] Breslow, N.E. and Day, N.E. (1980). Statistical Methods in Cancer Research. Lyon: International Agency for Reseach on Cancer. [5] Breslow, N.E. (1996) Statistics in epidemiology: The case-control study. Journal of the American Statistical Association 91, 14-28. [6] Buonaccorsi, J.R (1990). Double sampling for exact values in the normal discriminant model with application to binary regression. Commun. Statist. -Theory Meth. 19, 4569-4586 74 [7] Carroll, R.J., Gail, M.H. and Lubin, J.H. (1993). Case-control studies with errors in covariates. tion Journal of the American Statistical Associa- 88, 185-199 [8] Carroll, R.J., Roeder, K. and Wasserman, L. (1996). Flexible parametric measurement error models. Technical Report 648, Department of Statistics, Carnegie Mellon University. [9] Casella, G. and George, E.I. (1992). Explaining the Gibbs sampler. The American Statistician 46, 167-174. [10] Chib, S. and Greenberg, E. (1998). Analysis of multivariate probit model. Biometrika 85, 347-361. [11] Cornfield, J. (1951). A method of estimating comparative rates from clinical data. Applications to cancer of the lung, breast and cervix. Journal of the National Cancer Institute 11, 1269-1275. [12] Dellaportas, P. and Stephens, D.A. (1995). Bayesian analysis of errorsin-variables regression models. Biometrics 51, 1085-1095. [13] Forbes, A.B. and Santner, T.J. (1995) Estimators of odds ratio regression parameters in matched case-control studies with covariate measurement error. Journal of the American 1075-1084 75 Statistical Association 90, [14] Le, N.D. and Zidek, J.V. (1992). Interpolation with uncertain spatial covariances: A Bayesian alternative to kriging. Journal of Multivariate 43, 351-374. Analysis [15] Mallick, B.K. and Gelfand, A . E . (1996). Semiparametric errors-invariables models: A Bayesian approach. ning and Inference Journal of Statistical Plan- 52, 307-321. [16] McKeown-Eyssen, G.E. and Thomas, D.C. (1985). Sample size determination in case-control studies: The influence of the distribution of exposure. Journal of Chronic Dis. 38, 559-568 [17] McKeown-Eyssen, G.E. and Tibshirani, R. (1994). Implications of measurement error in exposure for the sample sizes of case-control studies. American Journal of Epidemiology 139, 415-421. [18] Muller, R , Parmigiani, G., Schildkraut, J. and Tardella, L. (1996). A Bayesian hierarchical approach for combining case-control and prospective studies. Discussion Paper 96-29, Institute of Statistics and Decision Sciences, Duke University. [19] Muller, P. and Roeder, K. (1994). A Bayesian semiparametric model for case-control studies with errors in variables. Discussion Paper 9428, Institute of Statistics and Decision Sciences, Duke University. [20] Prentice, R.L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 76 66, 403-411. [21] Richardson, S. and Gilks, W.R. (1993). A Bayesian approach to measurement error problems in epidemiology using conditional independence models. American Journal of Epidemiology 138, 430-442. [22] Richardson, S. and Gilks, W.R. (1993). Conditional independence models for epidemiological studies with covariate measurement error. Statistics in Medicine 12, 1703-1722. [23] Roeder, K., Carroll, R.J. and Lindsay, B.G. (1996) A semiparametric mixture approach to case-control studies with errors in covariables. Journal of the American Statistical Association 91, 722-732. [24] Rosner, B., Spiegelman, D. and Willett, W.C. (1990). Correction of logistic regression relative risk estimates and confidence intervals for measurement error: The case of multiple covariates measured with error. American Journal of Epidemiology 132, 734-745. [25] Rosner, B., Willett, W.C. and Spiegelman, D. (1989). Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. Statistics in Medicine 8, 1051-1069. [26] Ross, S. (1997). Introduction to Probability Models. 6th ed. San Diego: Academic Press. [27] Thomas, D., Stram, D. and Dwyer, J. (1993). Exposure measurement error: Influence on exposure-disease relationships and methods of correction. Annu. Rev. Publ. Health 14, 69-93 [28] Tierney, L. (1994). Markov chains for exploring posterior distributions. The Annals of Statistics 22, 1701-1728. [29] Wolfinger, R.D. and Kass, R.E. (1996). Bayesian analysis of variance component models via rejection sampling. Technical Report 642, Department of Statistics, Carnegie Mellon University. 78
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A Bayesian approach to case-control studies with errors...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A Bayesian approach to case-control studies with errors in the covariates Vallée, Marc 1999
pdf
Page Metadata
Item Metadata
Title | A Bayesian approach to case-control studies with errors in the covariates |
Creator |
Vallée, Marc |
Date Issued | 1999 |
Description | It is not uncommon to be faced with imprecise exposure measurements when dealing with case-control data. In cancer case-control studies, for instance, smoking histories may be unreliable. The usual methods of analysis involve logistic regression with different correction factors. The approach we adopt involves Bayesian fitting of a retrospective discriminant analysis model. The parameters of interest are the regression coefficients in the prospective logodds ratio for disease. Under a standard non-informative prior, the posterior means of these parameters are infinite. Posterior medians, however, perform reasonably relative to other estimators that adjust for covariate imprecision. For models with only continuous exposures, the Bayesian inference can be implemented with exact posterior simulation. The presence of binary covariates requires some elements of a covariance matrix to be fixed. We develop a general approach for sampling such a constrained covariance matrix. The Bayesian inference in this context now demands the use of a Gibbs sampling algorithm. |
Extent | 3004229 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-06-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0088924 |
URI | http://hdl.handle.net/2429/8983 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1999-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1999-0101.pdf [ 2.87MB ]
- Metadata
- JSON: 831-1.0088924.json
- JSON-LD: 831-1.0088924-ld.json
- RDF/XML (Pretty): 831-1.0088924-rdf.xml
- RDF/JSON: 831-1.0088924-rdf.json
- Turtle: 831-1.0088924-turtle.txt
- N-Triples: 831-1.0088924-rdf-ntriples.txt
- Original Record: 831-1.0088924-source.json
- Full Text
- 831-1.0088924-fulltext.txt
- Citation
- 831-1.0088924.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0088924/manifest