"Science, Faculty of"@en . "Statistics, Department of"@en . "DSpace"@en . "UBCV"@en . "Valle\u00CC\u0081e, Marc"@en . "2009-06-11T20:27:58Z"@en . "1999"@en . "Master of Science - MSc"@en . "University of British Columbia"@en . "It is not uncommon to be faced with imprecise exposure measurements when\r\ndealing with case-control data. In cancer case-control studies, for instance,\r\nsmoking histories may be unreliable. The usual methods of analysis involve\r\nlogistic regression with different correction factors. The approach we adopt\r\ninvolves Bayesian fitting of a retrospective discriminant analysis model. The\r\nparameters of interest are the regression coefficients in the prospective logodds\r\nratio for disease. Under a standard non-informative prior, the posterior\r\nmeans of these parameters are infinite. Posterior medians, however, perform\r\nreasonably relative to other estimators that adjust for covariate imprecision.\r\nFor models with only continuous exposures, the Bayesian inference can be\r\nimplemented with exact posterior simulation.\r\nThe presence of binary covariates requires some elements of a covariance\r\nmatrix to be fixed. We develop a general approach for sampling such\r\na constrained covariance matrix. The Bayesian inference in this context now\r\ndemands the use of a Gibbs sampling algorithm."@en . "https://circle.library.ubc.ca/rest/handle/2429/8983?expand=metadata"@en . "3004229 bytes"@en . "application/pdf"@en . "Bayesian Approach to Case-control Studies with Errors in the Covariates by Marc Vallee B.Sc, Universite de Montreal, 1996 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Statistics) we accept this thesis as conforming to the required standard The University of British Columbia December 1998 \u00C2\u00A9 Marc Vallee, 1998 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my writ ten permission. Department of 4jgif 0, then (3.5) implies that the right tail of / 0 and the left tail of / i fall off faster than exp(\u00E2\u0080\u0094f3\x\). If these 'thinner than exponential' tails are in fact normal, then the other tails (the left tail of / 0 and the right tail of / i) will also be normal. This suggests that (3.3) is likely to be appropriate in at least some situations where (3.4) holds. In practice transformations are often used to approximate normality. If such a transformation was used on the sample of control exposures for example, then in order to proceed with the retrospective model suggested here, the same transformation would have to yield approximate normality with a similar variance for the other exposure sample. We now have a model for the relationship between the exposure and the disease outcome. Next we have to consider the measurement error which will arise in that a surrogate exposure X* will be observed rather than the true exposure X. So we also have to model this error, we suggest the following 17 simple classical (non-Berkson) model: X*\X ~ N(X,r) (3.6) under which X* is an unbiased but noisy measurement of X. We assume the measurement error is nondifferential; that is, it arises in the same way for cases and controls. Using the formal definition of nondifferential error given in chapter 2, (3.3) and (3.6) can be collapsed to obtain the next model, from which the data arise: So a direct consequence of the measurement error is some extra variability in the data compared to (3.3). This extra variability r (the measurement error variance) could be known from an external validation of the measurement process, or it could be estimated with the use of a validation sample as it was illustrated in chapter 2. The following two sections will illustrate our approach for either situation. 3.2 Known Measurement Error Variance In the present section we will assume r to be known. Since we are conducting a Bayesian analysis, we need a prior distribution in addition to the likelihood based on (3.7). This will yield a posterior distribution on the unknown param-eters ( / i o , fJ-i, v) and enable us to estimate 3. The prior distribution on which our inference will be based is a standard noninformative prior for variance X*\D = 0 ~ N(U0,T + V), X*\D = 1 ~ N(tiLTT + v). (3.7) 18 component models, 7r(jLto,jLti,i/) oc ( r + i / ) - 1 . (3.8) Formally, this prior is known as the reference prior (Berger and Bernardo, 1992) when v is the parameter of interest and (/i0, Pi) are nuisance parameters. The reference prior for 8 as the parameter of interest was also considered, but the fact that it did not have a simple form made it unappealing. Given the prior distribution in (3.8) and the realization x* = {xl0, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, xn0o-x*ii''' ' ^ n i i l from the normal discriminant analysis model in (3.7), the resulting posterior is of the following form: prior ir(u,Q,(j,i,v\x*) oc (V + T) - I 1=1 xn(. + x ) - ^ e x P ( - ^ f ) (3.9) OC 1 \ 3 X e X H 2(\u00E2\u0080\u009E + r) j 6 X P i 2(, + r) ) X e X H 2 ( , + r) ) C X H 2(\u00E2\u0080\u009E + r) J Our Bayesian inference can be implemented by simulating independent draws from this posterior distribution. We can then integrate out /x0 and ^ from (3.9) to get n ( u \ x * ) oc ( _ L _ ) ^ e x p ^ - \u00C2\u00A3 & ( g 5 ) ~ ^ + S p i ( s ? i ~ *i)2 j^ ( 3 1 Q ) 19 In order to make sampling easier, it is convenient to reparametrize from (//o,Ati,T) to (^ 0,^ 1,7), where 7 = r + v. The posterior distribution of ^\x* is then easily obtained from (3.10) and can be expressed as 7|rc* = G\G > T, (3.11) where G has an inverse gamma distribution with shape parameter (n0 + n\ \u00E2\u0080\u0094 2)/2 and scale parameter (SS0 + SSi)/2, with SSi = Ej(x*j ~ x*f- Sampling from G\G > r is easily implemented by repeatedly sampling G until G > r. Then v is taken to be the difference between the sampled G and the known r. Now that we have sampled 7 (or u) it is easy to see from (3.9) that we have frfrx* ~ Nixln^-y), (3.12) where //o and //1 are independent given 7 and x*. A draw from the joint posterior yLto? A*i>Tl^ * c a n be obtained by sampling from ^\x* and then from A*o> A*i JT>x*- Thus we have a simple algorithm for exact posterior simulation in a variance components model. This computational approach is pursued in greater generality by Wolfinger and Kass (1996). Commonly, Bayesian parameter point estimation is done by using the posterior distribution's mean. It can be seen by the following that in the present case E(B\x*) is infinite; E{{fjn - / / o K V } = E{u~lE{^- ^\u,x*)} = E{v-l{x\ -x*0)\x*} by (3.12) = (xl -xl)E{v~l\x*} 20 but where a and u are both positive and are respectively the shape and scale parameter of an inverse gamma distribution, the numerator's integral blows up near r, therefore E{u-1\x*} = +00. Intuitively this occurs because, according to both the likelihood based on (3.7) and the prior (3.8), zero is a plausible value for v which is the denominator of (3. Since the posterior mean of/3 does not exist, we use the posterior median of our parameter of interest as a point estimator. From now on we will refer to this posterior median as PMED. This concludes the known measurement error variance case. We will come back to it in a subsequent section to reveal the procedure and the results of a simulation study. 3.3 Unknown Measurement Error Variance We will now examine the situation where r is unknown, in which case it will have to be estimated along with v. The estimation of this error variance will be done with the use of a validation sample. In addition to the surrogate ex-posures for cases and controls, we assume an independent data set is available. This extra data set will consist of measurements of both the surrogate exposure X* and the true exposure X for each subject. We have assumed nondifferential measurement error, thus the subjects selection procedure for this validation sample does not matter; we can view the true exposure X as being sam-21 pled from an arbitrary distribution, so long as the measurement error model X*\X ~ N(X,r) is the same for both the main and validation samples. The observed validation sample is denoted {{x2i, a^i)}\u00E2\u0084\u00A2^, with SS2 \u00E2\u0080\u0094 J2i(^2i~'x2i)2 defined for subsequent use. To conduct our analysis we can proceed in a very similar manner as in the known r case. We begin by doing a reparametrization similar to the one done in the previous case. The variance components (u, r) are reparametrized to (7,7\"), where 7 = r + v. We again use the reference prior for a variance components model but this time when (7, r) are the parameters of interest Once more we have selected in (3.13) a noninformative prior. Combined with the likelihood given by the normal discriminant analysis model in (3.7), we can write the resulting posterior distribution from which we will be sampling: TT(^ 0I^ I,7.T) \u00C2\u00AB 7 1T 1 (0 < T < 7 < CO) (3.13) prior x (-\ exp ( - -n0(fi0-x*oy 2 7 exp exp ILL Tli(/Xi -X\f 2 7 x - exp (3-14) Once again we can integrate out fi0 and /xi, this time obtaining: (3.15) ) 22 Hence from (3.15) the marginal posterior distribution of the variance compo-nents is (7,T)\X* = (G,T)\G>T, (3.16) where G and T are independent, with G \u00E2\u0080\u009E / G ( ! ! i \u00C2\u00B1 \u00C2\u00A3 - * , + T ~ / G ( ! ^ i , ^ ) , (3.17) and the posterior conditional distribution for /XQ and fjb\ once 7 and r have been sampled is the same as in the known r case ^ l ^ r ^ ^ J V ^ . n ^ ) - (3-18) Thus again exact posterior sampling can be implemented, by simulating (G, T) pairs from the distributions in (3.17) until G > T, and then sampling /i0 and u,x from (3.18) in order to get a set of /?'s. The posterior mean for the parameter of interest is infinite in this case too, so again our PMED is used as a point estimator. 3.4 Simulation Study Situations with known and unknown measurement error variance have now been addressed. In both cases our approach led to exact posterior sampling, which should make implementation simple and sampling procedures reason-ably efficient In order to assess the performance of our approach we conduct 23 the analysis of simulated data sets and compare our results to established methods previously illustrated in chapter 2. Simulations are carried out based on samples size no = ni = 50 from the retrospective model (3.3) with parameter values ([JL0, u-i,u)=(0,1,1). Un-der these parameters, the prospective logistic regression coefficient is B = 1. The simulations are done with different values of \fr ranging from 0.0 to 2.0 by jumps of 0.1. In one case r is assumed to be known. In the other it has to be estimated from a simulated validation sample of size n2 = 50. In ei-ther situation, the PMED estimate of B is computed as the median of 500 independent and identically distributed draws from the posterior distribution. Here are the multiple steps of the simulation process for the r known case: 1. Generate 500 samples of size n0 = 50 from a normal distribution with mean 0 and variance v = 1. 2. Generate 500 samples of size n\ = 50 from a normal distribution with mean 0 and variance v = 1. 3. Create a vector of y/r% (0.0, 0.1, 0.2,...1.8, 1.9, 2.0). 4. Use the 500 samples generated in step 1. to create 500 new samples that will be from a normal distribution with mean /x0 \u00E2\u0080\u0094 0 and variance r (the controls). 5. Use the 500 samples generated in step 2. to create 500 new samples that will be from a normal distribution with mean /ii = 1 and variance r (the 24 cases). 6. Sample in turn u, U,Q and u,\. \u00E2\u0080\u00A2 Sample 7 from (3.11) to get v. \u00E2\u0080\u00A2 Use v to sample both u,0 and u,x from (3.12). \u00E2\u0080\u00A2 Compute (3 = \u00E2\u0080\u0094 /x0). 7. Repeat step 6. 500 times, which gives us 500 independent and identically distributed draws from the posterior distribution. 8. Compute the PMED estimate of /?, the median of the 500 draws from the posterior. 9. Repeat steps 4., 5., 6. and 7. with every value \fr was given. The reason for steps 1. and 2. is that initializing all the future samples with these (which come from standard normal distributions), makes comparison from one measurement error level to another easier and more precise. The process is very similar for the r unknown case, except one step is added between steps 2. and 3. and between steps 5. and 6. to create the validation sample, and step 6. is slightly different. Here are these two added steps and the correction for step 6.: 1. Generate 500 samples of size n2 = 50 from a normal distribution with mean 0 and variance v = 1. 25 2. Use the 500 samples generated in the previous step to create 500 new samples that will be from a normal distribution with mean // 2 = 0 and variance r (the validation sample). 3. Sample in turn v, and \u00E2\u0080\u00A2 Sample 7 and r from (3.16, 3.17) to get v. \u00E2\u0080\u00A2 Use v to sample both / i 0 and fj,i from (3.18). \u00E2\u0080\u00A2 Compute (3 = \u00E2\u0080\u0094 //o). Quite obviously we did not generate both surrogate and true exposures in our validation sample, it actually consists in {(x*2i \u00E2\u0080\u0094 x2i)}\"=].> the differences between the surrogate and true exposures. These differences should follow a normal distribution with mean 0 and variance r. The simulation routines were coded in C, and although probably not using the most efficient algorithms, they compared very favorably to S-plus with respect to running time and memory resources needed. The PMED estimator of (3 is compared to other estimators suggested in the literature. For the known r case, the comparison is done with a method by AWH, illustrated in chapter 2, in which a correction factor A is used to adjust the (3 estimate obtained from a 'naive' logistic regression. For the r unknown case, the AWH method is still applicable and two others by RWS, also discussed in chapter 2, are performed on our simulated data for comparison. The first by RWS, which we refer to as RSW1, takes the form of the AWH estimator, except that the correction factor A is estimated by the fitted slope of a regression of 26 X on X* using the validation sample. The second estimator, referred to as RWS2, is based on an approximate likelihood function. The RWS method needs to fulfill more assumptions then the PMED approach in order to be applicable. It requires that the variance of X be the same for the validation sample as for the case and control samples. If this requirement is not met, the fitted slope will no longer estimate A correctly. This assumption is not needed for the PMED method to work, it is applicable to any distribution of X in the validation sample. This seems more sensible as it is parameters of the conditional distribution for X*\X which must be estimated. The AWH estimator has a definite disadvantage in that it performs poorly when faced with substantial measurement error. The estimate A of the correction factor can sometimes take on a negative value, in fact, P(A < 0) can be quite high when r is sufficiently large. We can obtain A's distribution and derive the following: where Fk is the chi-square distribution function with k degrees of freedom. The derivation of this probability is done using information on the AWH approach presented in Chapter 2. We illustrated there that in the known r scenario A (3.19) can be estimated by A = 1 \u00E2\u0080\u0094 (T/S2), where From there we can write: P ( A < 0 ) = P ( l - {T/SD < 0) 27 = P{r/Sl>l) = P{Sll) = P(S2P/S2V. x II K S + ^ ) l \" 1 / 2 ^ P (-j(x*0 - /*)'(\u00C2\u00A3 + r)-\x*0 - /io)j x II K E + T)\-1/2 exp ( - - ( ^ - M l ) ' (E + T)-\x*n - M l ) j cx |(E + r ) | - ( n o + n i + d + 1 ) / 2 ( -^ no \ - \u00C2\u00AB \u00C2\u00A3 \u00C2\u00AB o - A*o)'(S + r ) \" 1 ^ - //0) ( 1 7 1 1 N x e x p - 9 5>L*I \u00E2\u0080\u0094 /xo)E - 1. This multivariate exposure setting will also call on our PMED estimator, in this case the sample of /?'s will be split in d subsamples for each exposure variable and a median will be computed for each of these subsamples. 4.2 Binary Covariates Our approach is now applicable to multidimensional exposure variable, but an important assumption that we have made is that all variables are continuous. Mo|E,r,x* ~N{x*0,n0-1(?: + T)), (4.12) 38 However, in applied situations it is common to be faced with binary exposure variables, for which individuals are classified to be either exposed or unexposed. Even more common is a combination of both continuous and binary exposures. For this reason, it should be a priority for us to integrate this type of variable in our model. Since our methodology has so far been developed for continuous vari-ables, we will try to find a continuous representation of these binary covariates. A method that could be used to obtain such a representation is described in Albert and Chib (1993), where they regress a unidimensional binary response variable on a set of covariates. They suppose N independent binary random unidimensional variables Yi, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 > YN are observed, where Yi has a Bernouilli distribution with probability of success pi. These pi are related to a set of covariates that may be continuous or discrete. They define a binary regression model as P i = H(xl8),i = l,.-.,N, where 3 is a k x 1 vector of unknown parameters, xj = (xn, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, Xik) is a vector of known covariates, and H( ) is a known function linking the probabilities Pi with the linear structure of 3. Taking H to be the standard normal cdf $( ) yields what is called the probit model. In order to get this continuous representation they introduce N latent variables Z\, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, ZN, where these Zi are independent N(xj8,1), and define Yi = 1 if Zi > 0 and Yi \u00E2\u0080\u0094 0 otherwise. It can easily be shown that the Yi are independent Bernoulli random variables with & = P(Yi = l) = $(xJ3). 39 The joint posterior density of (3 and Z = (Zi, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, ZN) given the data V = (yi, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 ,VN) is given by n{P,Z\y) = Cn(P)f[{I(Zi>0)I(yi = l) + I(Zi<0)I{yi = 0)} i=l x(Zi;xjB,l), where 7r(/3) is a prior on 3, ;/J.,a2) is the N(fi,a2) pdf, I(X G A) is an indicator function that is equal to 1 if the random variable X is contained in the set A, and here and henceforth C is a proportionality constant. It is very difficult to sample directly from this distribution. But computation of the marginal posterior distribution of (3 using the Gibbs sampling algorithm requires only the posterior distribution of (3 conditional on Z and the poste-rior distribution of Z conditional on (3, and fortunately these full conditional distribution are of standard forms. The posterior density of (3 conditioned on Z is given by T T ( / % , Z) = Cn(f3) f 4>(Zu xjB, 1), (4.13) i=l and the posterior density of Z conditioned on (3 results in the random variables Z i , \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, ZN being independent with z*|z;>o , i f j / i = i Zi= < where Z* ~ N(xf(3,l). (4.14) Z*\Z*<0 , if ^ = 0 4.2.1 The Gibbs Sampler We mentioned above the Gibbs sampling algorithm. Here is a brief review of this useful technique. Suppose one is interested in simulating from the poste-rior distribution of 6 partitioned into the vector components 6 = (#1; \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, 9P). 40 Although it may be difficult to sample from the joint posterior, suppose that it is easy to sample from the conditional distributions n(9k\{9j, j ^ k}). To implement the Gibbs sampler, one starts with initial guesses of the 8i: say 8^, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, 8^ and then simulates in turn 9? from *{0i\{6f\j\u00C2\u00B1\}) from *(e2\0[1\{9?\j>2}) (4.15) eW from n(9p\{9f\j = (0i \ \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, 9f>) As t approaches infinity, it can be shown that the joint distribution of 8\u00C2\u00AE ap-proaches the joint distribution of 8, in practice this convergence is usually quite fast. So for sufficiently large t, say t*, can be regarded as one simulated value from the posterior distribution of 8. Repeating this process m times yields the sample {(#[* \ 8^-,\- \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, 8 p j j = 1, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, m}, which can be used for statistical inference. In practice, instead of restarting the algorithm once the convergence is obtained we will just keep for our posterior sample the 0\f for t = t*, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, t* + m. This concludes the review of the Gibbs sampler. Returning to the method suggested by Albert and Chib, now given a previous value of 3, one cycle of the Gibbs algorithm would produce Z and 3 from the distributions (4.14) and (4.13). Advances developed from this approach are presented in Chib and Greenberg (1998), in which they construct a multivariate probit model; the binary response variable is allowed to be multidimentional. This generaliza-41 tion to multidimensional response makes this technique applicable to our own analysis. 4.2.2 A p p l i c a t i o n We are faced with ^-dimensional binary data Y ^ = 0 or 1, i = 1, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, rij, where j is either 0 or 1 for controls or cases and k = 1, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, b, where b is the dimension of the binary covariate we want to include in our model. We must introduce the latent variables Z\jk, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, Znjjk, where the Zij are independent multinormal N(r)j, \u00C2\u00A3 2 2 ) . We are using 77 and not xj(3 as the mean because our application is not in a regression context. The reason for the \u00C2\u00A3 2 2 notation of the covariance matrix will be explained later on. It is necessary for identifiability reasons to fix as l's the diagonal elements of this covariance matrix. This way the variance of Z^ is 1 as in the univariate case from the Albert and Chib method presented earlier, but the covariance parameters are free. In their multivariate development Chib and Greenberg simply refer to this as the necessity of \u00C2\u00A3 2 2 to be in correlation form. Then we define = 1 if Z^k > 0 and Y^ = 0 otherwise. The are independent vectors of O's and l's with pjk = P(Yijk = 1) = ^(Vjk), where <$>( ) is the standard normal cdf, which gives us the model known as the probit link model. Since the yi are observed we can estimate the pjk and using the probit link obtain estimates for the r}jk as starting values for our Gibbs sampler. As a starting value for \u00C2\u00A3 2 2 we can use l's for the variance (on the diagonal) and O's for the covariance elements (off the diagonal). 42 So the next step is to go ahead and implement our Gibbs sampler. First by sampling Z from its posterior distribution conditioned on 77, \u00C2\u00A3 2 2 and Y, then by sampling \u00C2\u00A3 2 2 conditioned on 77 and Z, finally by sampling 77 conditioned on \u00C2\u00A3 2 2 and Z. Repeating this enough times will at some point be equivalent to sampling from the joint posterior. In order to proceed with this sampling scheme we have to determine these conditional densities. The random variables Zij, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, Znjj will be inde-pendent with f>dimensional \"truncated multinomial\" distributions Zij\y,r], E22 ~ A r ( 7 7 j , \u00C2\u00A3 2 2 ) (4-16) with Zijk > 0 if yijk = 1, and Zijk < 0 if yijk = 0. This way the initial binary covariates are represented by continuous variables following a multinormal distribution. It is therefore possible to include binary exposures in our model, and their connection to the normal distribution will make covariance estimation feasible. The rjj, which are vectors of size b, will also follow a multinormal distribution just like /x0 and Hi in (4.12). In fact, when we will juxtapose the Z's to the observed continuous X*'s to deal with both the continuous and binary variables simultaneously, the 77^ will simply be the last b components of the \ij vectors. So we can sample 77, from the following r]j\Zj, \u00C2\u00A3 2 2 ~ N{Zj^22n-1) (4.17) Finally we need to determine the conditional distribution of \u00C2\u00A3 2 2 \u00E2\u0080\u00A2 We know 43 from (4.11) that its distribution will somehow be related to an inverse Wishart, but we don't know yet how the correlation form condition (l's on the diagonal) will influence the actual distribution from which our sample has to be drawn. The next section will concentrate on determining this distribution and will present an approach for sampling such a covariance matrix. 4.3 Covariance M a t r i x W i t h Partial ly F ixed Diagonal The addition of these binary covariates in a continuous form will not change model (4.6) from which the data arise. The Z mentioned in the previous section will just have to be juxtaposed to X*, forming a new X* consisting of c continuous exposure variables and b binary exposure variables in a continuous form, where c + b = d the total dimension of the exposure. For later use, the binary covariates will always be placed at the end of the exposure vector; so we will have first the c continuous X* and then the b binary ones. This will yield a covariance matrix E splitted in the following way E = ^ 1 1 ^ 1 2 (4.18) y 2^21 ^ 2 2 J where E n is c x c, E i 2 is c x b, E 2 i is of course the transpose of E i 2 and E 2 2 is b x b. We also make the important assumption that there is no misclassifi-cation for the binary covariates, therefore the errors in variables will only 44 be occurring for continuous variables. Also, with no influence on our previous work, some of the continuous exposures could be measured without error. This disposition of the continuous variables followed by the binary ones combined with the no-misclassifications assumption means that the lower diagonal of the covariance matrix E (the diagonal of \u00C2\u00A322) w i U consist in a series of b l's, as will the lower diagonal of (\u00C2\u00A3 + r), since the lower diagonal of r will consist of a series of b O's. The major change comes from the new prior we will have to use. In (4.7) we suggested Jeffrey's noninformative prior, but now part of the covariance matrix is known. Thus we try the prior T T ( ^ I , \u00C2\u00A3 ) oc |(\u00C2\u00A3 + r ) | - ( d + 1 ) / 2 / (d iag(\u00C2\u00A3 2 2 +r 2 2 ) = 1), (4.19) where 7( ) is an indicator function equal to 1 if the diagonal elements of \u00C2\u00A322 are l's and equal to 0 otherwise. Do note that r22 is a matrix of O's since we have assumed no misclassification for the binary covariates. The addition of this new prior will operate a slight change in equation (4.9), the marginal posterior of the variance components now becomes 7r((\u00C2\u00A3 + r)|X*) oc |(E + T ) | - ( N O + N I + D + 1 ) / 2 x exp{-(l/2) trace((\u00C2\u00A3 + r)'^)} x/(diag(\u00C2\u00A3 22) = 1), (4.20) where S is as defined in (4.10), thereby transforming (4.11) to (\u00C2\u00A3 + T)\X* = G\[(G - T) positive definite and diag(G22) = 1)], (4.21) 45 Where G still has an inverse Wishart distribution with n 0 + ri\ degrees of freedom and scale matrix S. Sampling (S + r) from (4.21) is not a trivial task anymore. We need to introduce a matrix decomposition presented in Le and Zidek (1992). Take a covariance matrix E which follows an inverse Wishart distribution with m de-grees of freedom and scale matrix which can be partitioned in the following way E = ( H E V -*21 ^22 where E n is u x u and E22 is g x g. Through the Bartlett decomposition E can be written as ^ E i | 2 + '0E 2 2 V' T '0^22 ^ V (4.22) E22'0'/ ^22 where E i | 2 = E n \u00E2\u0080\u0094 E ^ E ^ ^ i , a uxu matrix and ip = E12E221, auxg matrix. The inverse Wishart distribution of E can be equivalently presented in terms of the new parameters (E22, Ei|2, ip) as E 2 2 | * , m ~ IW(ty22, m-u) E i | 2 | * , m ~ /W(*i | 2 ,m) ^ | E ! | 2 > ~ N{V12*\u00C2\u00A3, E! | 2 * 2 - 2 1 ) , (4.23) where IW(A, B) stands for inverse Wishart with scale matrix A and degrees of freedom B, E22 is independent of (EX | 2, ip), and symbolizes the Kroneker product. 46 This decomposition can obviously be applied to our own covariance matrix \u00C2\u00A3 + r which has to be sampled from (4.21), and the partitioning will be the one suggested earlier in (4.18) \ ^ En + Til E i2+Ti2^ ^ T = y \u00C2\u00A32l + r21 \u00C2\u00A3 2 2 + T22 J (\u00C2\u00A3 + r ) u (\u00C2\u00A3 + r) i 2 ^ (\u00C2\u00A3 + r) 2 i (\u00C2\u00A3 + r ) 2 2 j where \u00C2\u00A322 + T 2 2 = \u00C2\u00A322 since we assume there is no misclassification for the binary covariates. Therefore u will be equal to our number of continuous variables c, g equal to the number of binary variables 6, m equal to the degrees of freedom n0 + ni of our covariance matrix's inverse Wishart distribution and \I/ equal to the scale matrix S of this same inverse Wishart distribution. Based on (4.22) \u00C2\u00A3 + r can be reparametrized, leading to ( + T)M2 + IPZ22IPT VE22 ^ T = \u00C2\u00A3 2 2 ^ 2^2 J However, \u00C2\u00A3 + r does not follow exactly an inverse Wishart distribution, since it has the extra conditions that it must be positive definite once r is subtracted from it and the elements of the lower diagonal (diag(\u00C2\u00A322)) have to be all l's. Work has to be done to see how these conditions influence the distributions suggested in (4.23). First, the positive definite condition can be verified once the sampling is done, so this condition will not affect the distributions we will be sampling from. The diagonal elements set to be 1 will however make the sampling procedure more complex. For later use let us introduce the following notation 47 IW*(Scale, df) = G|diag(G) = 1, where G ~ IW{Scale, df). Given the independence between \u00C2\u00A3 2 2 and ((E + r)i|2, ,0), sampling (E + r) i | 2 and ip will always be trivial once the non-diagonal elements of E 2 2 are identified. The case where we have only one binary variable is quite simple. E 2 2 would be a scalar set to be 1, independent of ( \u00C2\u00A3 + T ) I | 2 and ip which could be sampled from the distributions given in (4.23) without being affected by \u00C2\u00A3 2 2 = 1. Once these are sampled, E + r can be computed from (4.22), then r subtracted so that the positive definite condition can be verified. If the condition is not met we need to resample these parameters. The process becomes more complex when we are faced with more then one binary covariate, in which case b > 1 and E 2 2 is not a simple scalar equal to 1, it has off-diagonal elements. In such a case an iterative process will have to be used. The initial step will be to perform the decomposition as it was done in the \"1 binary covariate\" case. This time the decomposition gives us a matrix E 2 2 of dimension b > 1. From (4.23) we know that E 2 2 follows an inverse Wishart with n0 + n\ \u00E2\u0080\u0094 c degrees of freedom and scale matrix 5 2 2 , and its diagonal elements are all l's. To keep the notation as simple as we can we will now take E* = E 2 2 and S* \u00E2\u0080\u0094 5 2 2 . Given E*'s distribution, the next step is to perform the same decomposition on E*, in which case the split has to be done in the following manner ^ v* v* ^ E* = 2 \ ^ 2 1 ^ 2 2 / 48 so that E n = 1 and E 2 2 is a (b - 1) x (b - 1) matrix still with l's on its diagonal. From (4.22) E 2 2 (now E*) can be written differently as E* = -1* I J22 J where E* | 2 = E n - E ^ E ^ E ^ , but E? | 2 + ^ 2 0 T = 1, and 0 = E * ^ 1 , a vector of dimension 6\u00E2\u0080\u00941. We have to keep using the decomposition until the bottom right corner, E 2 2 for the above decomposition, is a scalar set to be 1. So if b = 2, two iteration are needed; the decomposition has to be performed twice in order to get E 2 2 = 1, once to split the continuous and binary variables and once more to obtain the scalar form of the bottom right corner. We saw in (4.23) that the decomposition specifies the distributions of the different parameters obtained from the decomposition. These distributions will be slightly different from the second iteration and onward. 4.3.1 A Simple Case We will first deal with a fairly simple situation, that is the case where the number of binary covariates b is 2. In such a case the initial decomposition would lead us to E + T = / (E + r) 1 | 2 + V E 2 2 V ' r V S 2 2 ^ V (4.24) 2^2 S 22^T where (E + r ) i | 2 + -0E 2 2f/ ' T is a c x c matrix (remember that c is the number of continuous variables), and E 2 2 is a 2 x 2 matrix with l's on its diagonal. 49 From (4.23) we know that E 2 2 |S ,n 0 ,n i ~ IW*(S22,n0 + n x - c) (\u00C2\u00A3 +i-MS, n0>ni ~ i W ( % , n0 + nx) (4.25) V>|(\u00C2\u00A3 + r ) ^ , 5 - N(S12S\u00C2\u00A3, (S + r)!| 2 \u00C2\u00AE S^1), where E 2 2 is independent of ( ( \u00C2\u00A3 - I - r)i\2,ip). For simplicity we will again use E* = E 2 2 and S* = S22. We now perform the decomposition on \u00C2\u00A3*, this time resulting in \u00C2\u00A3* = ( E t | 2 + < / > \u00C2\u00A3 ^ T 0E$2 ^ V (4.26) E22^> E 2 2 J where both \u00C2\u00A3 2 2 and \u00C2\u00A3*|2 + 0 E 2 2 ^ T a r e equal to 1. Once again we use (4.23) to obtain E 2 2 = 1 E* | 2 |S*, n0, m, c ~ IW(S*l]2, n0 + nx - c) (4.27) \u00C2\u00A3220T = 1, which can be reduced to E * | 2 + (f>2 = 1, making the condition independent of E 2 2 which will not be the case for situations with more than 2 binary covariates. E^ 2 ' s distribution is also simplified, given the fact that it is a scalar the inverse Wishart distribution now becomes an inverse gamma, and its shape and scale parameters respectively are (n0 + n\ \u00E2\u0080\u0094 c)/2 and S*\2/2. In order to find the distribution for E ^ 2 and /s; | a ,*(\u00C2\u00A3I|2 ,0) = / ( ^ l s i | 2 ) / ( s i | 2 ) i from which we can derive the joint distribution of our new variables Zx and Z2 fzltz2(zi, z2) = / E J | 2 , * ( * I - 4, z2)J. We now want to obtain the distribution of Z2 given Z\ = 1, we can not get this exactly but do work out a function proportional to this conditional distribution: fz2\zl=\{z2\zl = l) = fZl,z2(zi,z2)/fZl(zi = 1) (4.28) oc fZl=i,z2(l,z2) OC (1 - z 2 2 ) - ( \u00E2\u0084\u00A2 + 3 ) / 2 x exp v 2 ( l - ^ 2 ) / m j where m = n0 + ni \u00E2\u0080\u0094 c. This is not a standard distribution, but still we want to sample from it. In order to do so we use the rejection sampling technique as presented in Ross(1997), to sample Z from a certain density proportional t o / ( ) : 1. Generate Y from a density proportional to g( ), where g( ) is a function that bounds /( ). 51 2. Generate U from a standard uniform distribution. 3. If U < f(Y)/g(Y), take Z = Y, else go back to 1. Therefore we need to find a bound for (4.28). We proceed in doing so by splitting (4.28) into two parts A = (1 - Z 2)- (m+3) /2 e x p To obtain an upper bound for A we note that argmax (x~a exp(6/o;)) = b/a, (4.29) the resulting bound is a constant obtained by plugging in A 777 (1 - 4) = \u00E2\u0080\u0094^(S*n/m + S22/m - 1 - {Symf). The bound for B is which is proportional to a normal density with mean S*2/m and variance 1/m, making it easy to sample from the bounding function. Once Z2 is sampled, this means we have an estimate for , therefore we can complete E* or E22 in (4.26). Then use the distributions given in (4.25) to sample the necessary parameters to complete E + r in (4.24). Following carefully all of the previous steps we are able to obtain an estimate of the covariance matrix with parts of it being fixed, then all that needs to be done is to subtract r and verify the positive definite condition. 52 /m + S*22/m-l-(S*l2/m)2 (1 - zl)/m 4.3.2 More than Two Binary Covariates We must now be able to deal with a greater number of binary covariates, this will increase the number of iterations and will again slightly alter the distributions. Let's assume we still have c continuous variables and 6 binary ones, but this time b is greater than 2. We proceed as in the previous simple case and operate the initial de-composition and obtain a matrix just like in (4.24) except E 2 2 is now 6 x 6 , 6 > 2. The distributions in (4.25) still stand, therefore we can go ahead with the second decomposition as in the simple case and obtain \u00C2\u00A3* just like in (4.26) with the exception that even with this second iteration E 2 2 is not yet a scalar equal to 1, it is now a (6 \u00E2\u0080\u0094 1) x (6 \u00E2\u0080\u0094 1) matrix. There is now a slight transformation to the distributions we had obtained in (4.27) \u00C2\u00A3; 2 | S* ,m~ IW*(S*22,m-l) E I | 2 | 5 * , m ~ / W ^ , m) (4.30) <^ |Et|2, s* ~ N(S*12S*22-\ E ; , 2 \u00C2\u00AE ^ v 1 ) , where m = n 0 4- n\ \u00E2\u0080\u0094 c, E 2 2 is independent of (E*|2, 4>) and these are this still subject to the condition \u00C2\u00A3*|2 + (f>T,22(t>T = 1- Since this time E^ 2 is not yet a scalar equal to 1, the condition EJ| 2 + (pT,22(f}T \u00E2\u0080\u0094 1 can not be simplified and therefore depends on E^ 2 which will once more influence our sampling. We now want to sample from a density proportional to /(E; 2)/(Et | 2)/(^|E* | 2)/(diag ( s y = I ) / (E; | 2 + 0E^20T = 1). First we will consider the distribution of (E 2 2 , E*| 2, (/\u00C2\u00BB|diag(E22) = 1) to be 53 g(\u00C2\u00A32 2)/(\u00C2\u00A3i| 2)/(0 |\u00C2\u00A3i | 2), where #(\u00C2\u00A322) i s n o t t n e same as / ( \u00C2\u00A3 2 2 ) . Thus we want to sample from the density proportional to 2 ( \u00C2\u00A3 y / ( \u00C2\u00A3 i | 2 ) / M \u00C2\u00A3 t | 2 ) / ( \u00C2\u00A3 t | 2 + ^ * 2 2 < t > T = 1), (4-31) where both /(\u00C2\u00A3*| 2) and /(<^|E^2) are known. We now reparametrize by in-troducing Z\ = E*| 2 + ^ E 2 2 0 r and Z2 = , once again the Jacobian for this transformation is J = 1. From this (4.31) becomes 0(E^)/(zi - z2L*22z2)f(z2\z1 - z2T,*22z2)I(z! = 1), therefore we want to sample from a density proportional to < 7 ( \u00C2\u00A3 y / ( l - Z 2 E 2 2 4 ) / ( z 2 | l - z2\?22zT2). (4.32) By putting in (4.32) the respective distribution functions and doing a little algebra we get that (4.32) is proportional to 2(1 \u00E2\u0080\u0094 z 2 E 2 2 z 2 r ) ( {z2 - S\2S*22 x)S*22{z2 - S*12S*221)r\ H 2 ( 1 - z 2 \u00C2\u00A3 ^ ) j' ( 4 - 3 3 ) v v ' B To sample from this distribution we can apply the rejection sampling tech-nique, therefore we need to bound (4.33). To find an upper bound for A we use (4.29). The resulting bound is a constant obtained by plugging in A 1-z^zT = S*l2/(m + 3). B can easily be bounded by exp (-l-(z2 - S{2S*22l)S*22{z2 - S{2S*22l)T) , 54 which is to a near constant a multinomial distribution of dimension b \u00E2\u0080\u0094 (i \u00E2\u0080\u0094 l) with mean vector S*2S22l a n d covariance matrix S22~l, where i is the number of iterations performed to this point. This will again make it easy to sample from the bounding distribution. The whole bound is thus in both the bound and the distribution we want to sample from in (4.33) we have 55 m = n0 + n1 \u00E2\u0080\u0094 c\u00E2\u0080\u00941 giving us the off-diagonal elements of E 22, and now that we have \u00C2\u00A322 we can use it to sample from the last two distributions of (4.25). It is now possible for us to run our Gibbs sampling algorithm of Section 4.2.2 since all the conditional distribution needed have been determined. 4.4 Discussion We have seen that in the generalization of our approach to a multivariate setting it is feasible to incorporate binary variables to our model. The use of a variation of the probit regression model enables us to get a continuous representation of these binary covariates. This continuous representation is obtained by generating values from truncated normal distributions which have their variance set to 1. Then an iterative process which makes it possible to estimate a covariance matrix with a partially fixed diagonal allows us to compute estimates of our parameter of interest which is B = \u00E2\u0080\u0094 y u 0 ) \u00C2\u00A3 _ 1 the main parameter of a standard logistic regression. 56 Chapter 5 An Application to Real Data The work presented in the previous chapters has resulted in an approach that should be generally applicable to actual data. One of the generalizations al-lowed for multivariate exposures and another made the addition of binary covariates to the analysis possible. We will now verify via the analysis of real case-control data if our methodology is suitable to applied situations and how results are influenced by our correction for error. The present chapter will first describe the data we will be working on and expose the results obtained in a previously conducted analysis. We will also verify if the basic assumptions of our method are met and if so, will proceed with the analysis using our own approach. 57 5.1 Bladder Cancer Case-Control Data The data we will be conducting our analysis on is a fraction of a dataset which was obtained for a large-scale study. The objective of this study was to identify occupational cancer risk factors. Information on occupation, smoking and alcohol consumption histories was collected by means of a self-administered questionnaire from male cancer patients aged 20 years and over ascertained from the British Columbia population-based cancer registry from 1983 to 1990. To estimate smoking relationships for types of cancer known to be strongly associated with cigarette smoking, patients suffering from cancer types with no such associations were used as controls. So the control group consisted of all cancer types with the exception of those known to be associated with smoking. The analysis was performed by matching cases and controls on exact age and diagnosis year. The patients smoking histories were measured in pack-years (number of years of smoking 20 cigarettes a day). Odds ratios for different stratifications of these pack-years were estimated using conditional lo-gistic regression. Based on this analysis, a statistically significant relationship was found between bladder cancer and cigarette smoking; the odds ratio would get bigger as the pack-years increased. This analysis was conducted assuming no measurement error was present. The details of this study are presented in Band et al. (to appear in the Journal of Occupational and Environmental Medicine). In this large-scale study many types of cancer were considered, but the fraction we focus on deals with bladder cancer and contains information on the 58 disease status (case or control), the age, the smoking history and the diagnosis year. Here we apply the Bayesian approach to re-examine the relationship between cigarette smoking and the risk of developing bladder cancer, taking into consideration measurement error of the smoking history information. 5.1.1 A Closer Look at the Data In light of these results we are interested in seeing if our method would lead to similar conclusions. A first step in doing so is to take a closer look at the data and see if the assumptions required by our approach are met. Our dataset consists of 1038 cases and 7006 controls. We have, for each individual, their bladder cancer status (healthy:0, diseased: 1), their age (in years), their smoking history (in pack-years) and their year of diagnosis (from 1983 to 1990). We initially get histograms for the age and the smoking history variables. The one for age has a longer and thicker left tail, but a transformation could probably approximate normality. The histogram for the smoking history, however, reveals a problem of greater consequence. There is an important concentration of 'never-smokers' or of individuals whose smoking history is 0 pack-years. In fact there are 1782 'never-smokers', of which 1658 are controls and 124 are cases, together they represent close to 20% of the whole group. Therefore, no transformations could lead us to an approximately normal distribution for the smoking history. Since the data in the present format fail to meet the normality as-59 sumption, we need to consider a slightly different objective. We now consider 'smokers' only, or the individuals that did not have 0 pack-years for their smok-ing history. We examine the dose-response, that is, how the risk of bladder cancer is influenced by the amount of smoking a person has done throughout her life. Since we ignore the 'non-smokers' we are now left with 914 cases and 5348 controls. Once more we proceed to construct histograms for the age and the smoking history variables, these are shown in Figure 5.1. Based on these histograms, it seems age is quite close to normality, however smoking history will obviously require a transformation in order to approximate normality. The two bottom graphs of Figure 5.2 indicate that although it is not perfect, a logarithmic transformation of the smoking history could be considered to be approximately normal. To see if the age and the logarithmic transformation of the smoking history can both be used as continuous variables, we proceed with a logistic regression of the disease status on different stratifications of these two variables. The parameters obtained for these regressions are plotted against the increasing stratifications in Figure 5.3. We notice that for the regression on age the coefficients seem to follow a curvilinear trend, perhaps a logarith-mic transformation of the age would correct for that. The coefficients for the logistic regression on the logarithmic transformation of the smoking history however increase in a relatively linear fashion, indicating that the logarithm of the pack-years can be used as a continuous variable. Given the non-linearity of the coefficients for the age, we try the logarithmic transformation of the age. We 'logistically' regress on stratifications of the logarithm of the age, a plot of 60 the obtained parameters shown in Figure 5.4 increases in a slightly more linear fashion than for the untransformed age, indicating that it is reasonable to use the logarithm of the age as a continuous variable. We also need to verify the normality assumption for this transformation of the age. The upper panels of Figure 5.2 indicate that normality of the logarithmic transformation of the age is reasonably acceptable. 5.2 The Analysis Based on the verification of assumptions done in the previous section we can now proceed to do an analysis of this bladder cancer case-control data us-ing the methodology presented in the earlier chapters. The variables we use are the following: The logarithmic transformation of both the age and the smoking history, that we respectively name X\ and X2, and a binary variable indicating if the cancer patient was diagnosed before 1987 that we name I 3 . That last variable mainly serves as a check to see if diagnostic methods have changed between these two four years periods from 1983 to 1986 and 1987 to 1990. Using an indicator for each year from 1983 to 1990 would have given us eight binary variables, which our approach can handle, however a normal representation of these variables would not have been appropriate given that a 1 for one of them means 0 for all the others, setting the sum to 1. The normal representation which sets the sign of the normal given the binary response can deal with correlation but not so much as to fix the sum. The analysis is thus 61 performed following the model: logit{P{D = 1\X = x)} = a + faXi + B2X2 + BZYZ (5.1) where X = (Xi,X2, Y3) and a is not estimable from case-control data alone. Remember that the original analysis was performed assuming no mea-surement error was present. However, it is important to note that the in-formation on the smoking history of these patients was obtained from a self-administered questionnaire. For this reason smoking history is believed to be reported with some uncertainty. Some people convinced that smoking is undoubtedly the cause of their cancer may tend to overestimate it. Others who put the blame on different factors like pollution or work environment could tend to undermined the effect of their smoking habits on their illness and thereby underestimate it. We are interested in examining how different error levels in this variable would influence the results. In order to see how the parameter estimates would be affected, we conduct the analysis three times, each time assuming a different level of error on the smoking history variable. In the first case we assume no error is present, in the second case we assume a 10% error and finally we assume a 25% error. Note that the error when assumed known was incorporated in our method by assuming that the true data covariance matrix E was inflated with the addition of the errors' covari-ance matrix r, leading to the observed data covariance matrix E + r. We are considering the smoking history data to be arising from the following: log(X*) = log(Z)+log(X 2), (5.2) 62 where X2 is the observed pack-years, X2 is the true pack-years and log(Z) is the error component, which is equivalent to: X* = Z x X2. (5.3) In (5.3) the error component Z is now a multiplicative factor of the true pack-years, so obtaining the desired error levels can be done in the following way: No error: r 2 2 = 0 10% error: 1.1 = exp(.l) and 0.9 = exp(-.l) r 2 2 = (.l) 2 25% error: 1.25 S exp(.25) and 0.75 ^ exp(-.25) => r 2 2 = (.25)2 Note that only the (2,2) element of r is affected by the error since we assume error only for the smoking history and that this error is not correlated to the other variables. Our correction therefore consists in subtracting from the estimated observed data covariance matrix \u00C2\u00A3 + r the covariance matrix r we have assumed for the error, all this within runs of our Gibbs sampling algorithm because of the presence of a binary variable. Results of the analysis for each of these error level assumptions are summarized in Figure 5.5, in which can be found histograms of the posterior distribution of the parameters BX,B2 and 83. We can see from the histograms that the parameters Bx and /33, respectively related to the logarithm of the age and the diagnosis period, do not seem to be influenced by the different error levels we have assumed for the pack-years. The range and the centrality measures of the posteriors for both B\ and B^ remain relatively constant while the error level is changing. Now what happens to the parameter B21 It is the 63 one we expect should be mainly affected by the assumed error on the reported pack-years and in fact it is. Based on the histograms of Figure 5.5 and some summary statistics for each of these posteriors we can detect an increase in fl2 as the error is increasing. This is exactly what was expected since the esti-mated parameters are obtained from the difference between the estimated case mean and control mean multiplied by the inverse of the estimated covariance matrix (fl = E _ 1 ( / i i \u00E2\u0080\u0094 fj,0)). When this covariance matrix is estimated from the observed data, it will be overestimated if there is error in the data. There-fore when correcting for the error the estimate of the covariance matrix will be reduced, which means division by a smaller value leading to an increased estimate of the parameter fl. As it was mentioned earlier, the interpretation of the parameter fl is that it is the coefficient in the prospective log-odds ratio and that for rare disease the relative risk of disease can be approximated by the odds ratio. Thus, the basic conclusion that can be drawn from the obtained results is that the risk of getting bladder cancer given the smoking history is underestimated if there is in fact error in the observed data. Meaning that an analysis performed assuming no error to be present gives conservative estimates of the relative risk of disease. It should be noted that in the present analysis it was impossible, as it is for many exposures, to get a validation sample in which we would have had both the true smoking history and the observed one. We could only guess on what the error was. Thus, our approach here was mainly used as a validation, 64 but the potential of the methodology should not be overlooked. 65 Histogram for Age CO o d CM o o d o d o CM o d o d o o d o o d o d \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 I ^ III illll illlll illlll Hil llll 111 fill 111 11 Ii 4 :l \u00E2\u0080\u00A2: h it ;t i| 20 40 60 80 100 Years Histogram for Smoking History \u00E2\u0080\u00A2111 i I ij :. t, 50 100 Pack-years 150 200 Figure 5.1: Histograms for the distribution of the age and smoking history variables 66 Figure 5.2: Diagnostic plots to verify the normality of the logarithmic trans-formation of the age and smoking history variables 67 Age o o ie O CM o O o 2 4 6 8 10 12 14 Stratifications log(Smoking History) 4 Stratifications Figure 5.3: Plots of the coefficients from the logistic regression of the disease status on increasing stratificaions of both the age and logarithmic transforma-tion of the smoking history variables 68 log(Age) c \u00C2\u00A9 o fc 0> O O 8 Stratifications Figure 5.4: Plot of the coefficients from the logistic regression of the disease status on increasing stratificaions of the logarithmic transformation of the smoking history variable 69 No Error 0.4 0.6 0.8 1.0 1.2 1.4 0.10 0.15 0.20 0.25 0.30 0.35 -0.1 0.0 0.1 0.2 0.3 betal beta2 beta3 Error: 10% 0.4 0.6 0.8 1.0 1.2 1.4 0.10 0.15 0.20 0.25 0.30 0.35 -0.1 0.0 0.1 0.2 0.3 betal beta2 beta3 Error: 25% 0.4 0.6 0.8 1.0 1.2 1.4 0.10 0.15 0.20 0.25 0.30 0.35 -0.1 0.0 0.1 0.2 0.3 betal beta2 beta3 Figure 5.5: Histograms of the posterior distributions of the parameter esti-mates for Pi, 82 and /?3 obtained with the assumption there was no measure-ment error, 10% and finally 25% measurement error in the smoking history variable. 70 Chapter 6 Conclusion At the risk of repeating ourselves, imprecise exposure measurements are com-mon in case-control studies. Since perfecting measuring instruments is not always a feasible option, accepting the errors and developing methodology that accounts for them in the analysis has generated a fair amount of litera-ture. Chapter 2 reviewed and summarized some of the work that had been done on the subject. Prospective logistic regression being the typical approach for analyzing case-control data, most of the methods that offered a correction for these errors were based on this methodology which fits a prospective model to the retrospectively sampled data. The alternative we adopted called on the Bayesian approach which revealed to be a fairly natural way of incorporating uncertainty about the unobserved true exposure. The simulations and comparisons to known procedures carried out in Chapter 3 found our method to perform reasonably. Although the assumptions of the normal discriminant analysis model may be somewhat stronger than 71 those of the prospective logistic regression model they should be met fairly easily in an applied situation. In light of these encouraging results, we pursued our work to make this approach more adaptable to the realities of complex datasets. In the generalization to multivariate exposure all the basic elements from the univariate setting carried over, allowing the simplicity of the proce-dure to be kept intact. Simple yet efficient, for models with only continuous exposures the Bayesian inference could be performed with exact posterior sam-pling. The addition to the model of binary covariates presented a challenging problem. Fortunately, the use of a variation of the probit regression model enabled us to get a continuous representation of these binary covariates. This required some elements of a covariance matrix to be fixed, which led to the de-velopment of a general algorithm for sampling such a constrained covariance matrix. The Bayesian inference in this context required the use of a Gibbs sampling algorithm. Analyzing real case-control data showed the method could be applied fairly easily, and produced results that were in line with what was theoretically expected. Finally, although we could not explore these in the scope of this thesis, the following suggestions could be considered for further development of the presented methodology. One would be to generalize the approach to make it applicable to matched case-control studies. The other, to explore other appli-cations of the algorithm for sampling covariance matrices with partially fixed 72 diagonal elements, for example multivariate or multinomial probit models. 73 Bibliography [1] Albert, M.H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical As-sociation 88, 669-679. [2] Armstrong, B.G., Whittemore, A.S. and Howe, G.R. (1989). Analysis of case-control data with covariate measurement error: application to diet and colon cancer. Statistics in Medicine 8, 1151-1163. [3] Berger. J.O. and Bernardo, J.M. (1992). On the development of refer-ence priors. Bayesian Statistics 4 35-49. [4] Breslow, N.E. and Day, N.E. (1980). Statistical Methods in Cancer Research. Lyon: International Agency for Reseach on Cancer. [5] Breslow, N.E. (1996) Statistics in epidemiology: The case-control study. Journal of the American Statistical Association 91, 14-28. [6] Buonaccorsi, J.R (1990). Double sampling for exact values in the nor-mal discriminant model with application to binary regression. Com-mun. Statist. -Theory Meth. 19, 4569-4586 74 [7] Carroll, R.J., Gail, M.H. and Lubin, J.H. (1993). Case-control studies with errors in covariates. Journal of the American Statistical Associa-tion 88, 185-199 [8] Carroll, R.J., Roeder, K. and Wasserman, L. (1996). Flexible para-metric measurement error models. Technical Report 648, Department of Statistics, Carnegie Mellon University. [9] Casella, G. and George, E.I. (1992). Explaining the Gibbs sampler. The American Statistician 46, 167-174. [10] Chib, S. and Greenberg, E. (1998). Analysis of multivariate probit model. Biometrika 85, 347-361. [11] Cornfield, J. (1951). A method of estimating comparative rates from clinical data. Applications to cancer of the lung, breast and cervix. Journal of the National Cancer Institute 11, 1269-1275. [12] Dellaportas, P. and Stephens, D.A. (1995). Bayesian analysis of errors-in-variables regression models. Biometrics 51, 1085-1095. [13] Forbes, A.B. and Santner, T.J. (1995) Estimators of odds ratio regres-sion parameters in matched case-control studies with covariate mea-surement error. Journal of the American Statistical Association 90, 1075-1084 75 [14] Le, N.D. and Zidek, J.V. (1992). Interpolation with uncertain spatial covariances: A Bayesian alternative to kriging. Journal of Multivariate Analysis 43, 351-374. [15] Mallick, B.K. and Gelfand, A.E. (1996). Semiparametric errors-in-variables models: A Bayesian approach. Journal of Statistical Plan-ning and Inference 52, 307-321. [16] McKeown-Eyssen, G.E. and Thomas, D.C. (1985). Sample size deter-mination in case-control studies: The influence of the distribution of exposure. Journal of Chronic Dis. 38, 559-568 [17] McKeown-Eyssen, G.E. and Tibshirani, R. (1994). Implications of measurement error in exposure for the sample sizes of case-control studies. American Journal of Epidemiology 139, 415-421. [18] Muller, R, Parmigiani, G., Schildkraut, J. and Tardella, L. (1996). A Bayesian hierarchical approach for combining case-control and prospective studies. Discussion Paper 96-29, Institute of Statistics and Decision Sciences, Duke University. [19] Muller, P. and Roeder, K. (1994). A Bayesian semiparametric model for case-control studies with errors in variables. Discussion Paper 94-28, Institute of Statistics and Decision Sciences, Duke University. [20] Prentice, R.L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66, 403-411. 76 [21] Richardson, S. and Gilks, W.R. (1993). A Bayesian approach to mea-surement error problems in epidemiology using conditional indepen-dence models. American Journal of Epidemiology 138, 430-442. [22] Richardson, S. and Gilks, W.R. (1993). Conditional independence models for epidemiological studies with covariate measurement error. Statistics in Medicine 12, 1703-1722. [23] Roeder, K., Carroll, R.J. and Lindsay, B.G. (1996) A semiparametric mixture approach to case-control studies with errors in covariables. Journal of the American Statistical Association 91, 722-732. [24] Rosner, B., Spiegelman, D. and Willett, W.C. (1990). Correction of logistic regression relative risk estimates and confidence intervals for measurement error: The case of multiple covariates measured with error. American Journal of Epidemiology 132, 734-745. [25] Rosner, B., Willett, W.C. and Spiegelman, D. (1989). Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. Statistics in Medicine 8, 1051-1069. [26] Ross, S. (1997). Introduction to Probability Models. 6th ed. San Diego: Academic Press. [27] Thomas, D., Stram, D. and Dwyer, J. (1993). Exposure measurement error: Influence on exposure-disease relationships and methods of cor-rection. Annu. Rev. Publ. Health 14, 69-93 [28] Tierney, L. (1994). Markov chains for exploring posterior distributions. The Annals of Statistics 22, 1701-1728. [29] Wolfinger, R.D. and Kass, R.E. (1996). Bayesian analysis of variance component models via rejection sampling. Technical Report 642, De-partment of Statistics, Carnegie Mellon University. 78 "@en . "Thesis/Dissertation"@en . "1999-05"@en . "10.14288/1.0088924"@en . "eng"@en . "Statistics"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "A Bayesian approach to case-control studies with errors in the covariates"@en . "Text"@en . "http://hdl.handle.net/2429/8983"@en .