Non-addi t ive effects i n Logis t ic Regression by Kazi Mahbubur Rashid Azad M . S c , University of Dhaka, 1992 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Statistics) We accept this thesis as conforming to the required standard The Univers i ty of B r i t i s h C o l u m b i a March 2003 © Kazi Mahbubur Rashid Azad, 2003 In presenting t h i s thesis i n p a r t i a l f u l f i l m e n t of the requirements for an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree that the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e for reference and study. I further agree that permission for extensive copying of t h i s thesis for s c h o l a r l y purposes may be granted by the head of my department or by h i s or her representatives. I t i s understood that copying or p u b l i c a t i o n of t h i s thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission. Department of STATIST tC£ The U n i v e r s i t y of B r i t i s h Columbia Vancouver, Canada Date M o X d U 0<O, HOP 3 Abstract Logistic regression is commonly used in epidemiology to model the relationship between risk factors and presence/absence of a disease. Usually it is difficult to look for interaction structure (many possible pairwise interactions, for instance) to include in the model. So a model which is additive on the logit scale is fitted. If the number of risk factors is relatively large such an additive relationship may not make good sense. A new logistic regression model is proposed to incorporate non-additive interaction effects. In some scenarios this model might better reflect the relationship between the response variable and the risk factors. The Bayesian approach is followed to fit the model and a Markov chain Monte Carlo ( M C M C ) algorithm, known as the hybrid algorithm is used to simulate the parameters. We apply the new model to three examples and interpret the parameter estimates. We compare the predictive performance of the new model with that of the step-wise and the ordinary logistic regression models. n Contents Abstract ii Contents iii List of Tables v List of Figures vi Acknowledgements viii Dedication ix 1 Introduction 1 1.1 Logistic Regression 2 1.2 Logistic Regression Model 3 1.3 Model Fitting . . . 4 1.3.1 Log Likelihood for Binomial Data 4 1.4 Bayesian Approach to Model Fitting 6 1.5 M C M C methods for parameter estimation 8 1.5.1 Monte Carlo Integration 9 1.5.2 Markov Chains 10 1.5.3 The Metropolis-Hastings Algorithm '. 10 1.5.4 Hybrid (HY) Algorithm 13 1.6 Variable Selection 14 1.6.1 Stepwise Logistic Regression 15 in 1.7 Cross-Validation 17 1.8 Outline and Scope of the Thesis 17 2 Model Specification and Estimation 18 2.1 Interaction and Effect Modification 18 2.2 Specific non-additive functional form of the logit 23 2.3 Model Fitting and Estimation 25 2.3.1 Log Likelihood Function 26 2.3.2 Posterior density 27 2.3.3 Parameter estimation 27 3 Examples 34 3.1 Example 1 : South African Heart Study (SAHS) .' . 34 3.1.1 Model Comparison 40 3.2 Example 2 : Scottish Heart Health Study (SHHS) 46 3.2.1 Model Comparison 50 3.3 Example 3: Mystery data 53 3.3.1 Model Comparison 58 4 Discussion and Future Work 61 Bibliography 64 iv List of Tables 3.1 Description of the risk factors for SAHS 35 3.2 Summary results of the posterior distribution 36 3.3 Summary results from the ordinary logistic regression fit 36 3.4 Estimated average effects of the nine predictors using the N A D model 40 3.5 Code sheet for the Scottish Heart Health Study 46 3.6 Summary results of the posterior distribution from SHHS 47 3.7 Summary results from the ordinary logistic regression fit of the SHHS 47 3.8 Estimated average effects of the six predictors from SHHS . 50 3.9 Summary results of the posterior distribution from Mystery data . 54 3.10 Summary results from the ordinary logistic regression fit 54 3.11 Estimated average effects of the twelve predictors from Mystery data 55 v List of Figures 1.1 Logistic Function 3 2.1 Plot of the logits under three different models showing the presence and absence of interaction 19 2.2 Hypothetical representation of relationship among risk factors 21 2.3 Logit as non-linear function of the number of risk factors 22 2.4 Comparison of different A values 25 3.1 Sample path of the M C M C output where the first panel is for the intercept, panels 2-10 are of the coefficients and the last one is for A 37 3.2 Posterior distribution of the parameters. The first panel is for the intercept, panels 2-10 are of the coefficients and the last one is for A 38 3.3 Scatterplot for comparing average effects with the logistic regression estimates . . 41 3.4 Distribution of individual level effects for all risk factors 42 3.5 Boxplot for comparing performance of three logistic models 43 3.6 Scatterplot of fitted probabilities 44 3.7 Sample path of the M C M C output where the first panel is for the intercept, panels 2-8 are of the coefficients and the last one is for A 48 3.8 Posterior distribution of the parameters. The first panel is for the intercept, panels 2-8 are of the coefficients and the last one is for A 49 3.9 Scatterplot for comparing average effects with the logistic regression estimates . . 50 3.10 Trace plots of the average effects of the six predictors considering every tenth sample from the M C M C output of the parameters 51 vi 3.11 Boxplot for comparing performance of the three models 52 3.12 Distribution of the number of risk factors 53 3.13 Sample path of the M C M C output where the first panel is for the intercept, panels 2-13 are of the coefficients and the last one is for A 56 3.14 Posterior distribution of the parameters. The first panel is for the intercept, panels 2-13 are of the coefficients and the last one is for A 57 3.15 Scatterplot for comparing average effects with the logistic regression estimates . . 58 3.16 Trace plots of the average effects of the twelve predictors considering every tenth sample from the M C M C output of the parameters 59 3.17 Boxplot for comparing performance of the three models 60 vn Acknowledgements Firs t of a l l thanks to A l m i g h t y and the Department of Statistics to give me the opportuni ty to study here. I am very grateful to my supervisor D r . P a u l Gustafson for his innovative ideas that he shared w i t h me and many many thanks to h i m for giving me continuous support regarding theoretical development and implementing them into my thesis. Wi thou t his help i n programming I could not have finished my thesis. I a m thakful to D r . L a n g W u for his comments about my thesis as the second reader. M a n y thanks to L i s a Kuramoto for her hints regarding some programming problems and Latex. I am grateful to my wife Nafisa Anawer for her continuous mental support i n completing the thesis, though she was not here. I acknowledge my parents' support. K A Z I M A H B U B U R R A S H I D A Z A D The University of British Columbia March 2003 v i i i To my Parents and Wife. ix Chapter 1 Introduction Regression methods are commonly used to describe the relationship between a response vari- able and one or more explanatory variables. When the response variable is discrete, taking two (or more) possible values, logistic regression is the standard method of data analysis. In epidemiology we often have to model the relationship between the status (presence/absence) of a disease and one or more risk factors. Logistic regression models are commonly used for describing this type of relationship and determining if there is a significant effect of the risk factors on the disease outcome. Often we have to include interaction effects in the model along with the main effects to predict the response. Usually it is very difficult to look for interaction structure (many possible pairwise interactions, for instance) especially when there are many risk factors to include in the model. Linde and Osius [13] have commented, "within the setting of parametric logistic regression, interactions can be modeled only in a clumsy and limited way". So a model which is additive on the logit scale is fitted. For simplicity assume the risk factors are binary for now. When the number of risk factors are relatively large and if we want to include the pairwise interactions also such an additive relationship may not make good sense. Suppose we have ten risk factors. Further suppose that in fitting the additive model without interactions, each coefficient is estimated to be log(1.5), that is, each risk factor by itself has an odds-ratio of 1.5. The additive model says that someone with all ten risk factors has an odds-ratio of 1.510 « 58 relative to someone without any risk factors. In many applications this would be implausibly large. Thus instead of modeling an additive relationship and trying to pick out pairwise interactions terms, can we fit non-additive models? Are there any other kind 1 of interactions that better explain the relationship between the outcome of the response variable and the risk factors? The objective of our thesis is to address these questions by considering different functional forms for modeling the regression relationship. We also want to show that non-additive effects might have plausible and reasonable interpretations in real-life situations. In this chapter we will briefly describe the background materials to understand the logistic regression model, Bayesian approach to parameter estimation, MCMC methods, step- wise logistic regression and cross-validation to find out the predictive performance of different models. 1.1 Logistic Regression Let Y be a binary response variable indicating presence (V = 1) and absence (Y — 0) of a disease. Consider a collection of p explanatory variables or risk factors denoted by the vector X' — (Xi,X2, • •. ,Xp). In any regression problem the key quantity is the mean value of the response variable, given the value of the independent or explanatory variable. This quantity is called the conditional mean and will be expressed as UE(Y\X = x)". Let us denote this mean as n(x). For a binary response variable E(Y\X = x) = 1 x Pr(Y = 1\X) + 0 x Pr(Y = 0\X) = Pr(Y = 1\X = x). In linear regression we assume that this conditional mean may be expressed as a linear function of X. That is, E(Y\X = x) = ir(x) = Pr{Y = 1\X = x) = po + 0!Xi + ... + /3pxp, (1.1) which is called a linear probability model. Here /?o is the intercept and (3' = (/3i,^2,... ,@p) are regression coefficients to be estimated. When observations on Y are independent, this model is a generalized linear model (GLM) with identity link function. This linear model (1.1) has a major structural defect. With a dichotomous response variable ir takes values between 0 and 1, whereas linear functions take values over the entire real line. Model (1.1) predicts ir < 0 and 7r > 1 for sufficiently large or small x values. Therefore, there must in fact be a nonlinear relationship between ir(x) and x. 2 1.2 Logistic Regression Model Because of the structural problems with the linear probability model (1.1), a nonlinear or curvilinear relationship between x and ir(x) is more reasonable. When we expect a monotonic relationship, the S-shaped curve in Figure 1.1 is natural shape for explaining the relationship between a dichotomous response variable and risk factors. A function having this shape is ^ _ exp(/30 + Bxxx + . . . + (3pXp) 1 + exp(/30 + Bixi + . . . + 3pxp) exp(ft) + B'x) called the logistic regression function, where x denotes the vector of predictors. Figure 1.1: Logistic Function We now have to find the link function that will connect the mismatched quantities and for which the logistic regression model (1.2) is a G L M . For this model the odds of obtaining response Y = 1 are ir(x) 1 — ir(x) = exp{80 + p'x) This formula provides a basic interpretation for the coefficients. The odds increase multiplica- tively by for every unit increase in X = x. Suppose we are interested in the odds ratio ip. If we have a single risk factor, say x\, then ip for a specific x\ = a compared to x\ = b can be computed as: = exp(/?0 + Bia) exp(/?0 + /3i6) However, in multi-variable situation ip for a specific risk factor can be computed by keeping all other variables fixed at some arbitrary values, usually at their average values. This is called adjusted odds-ratio. The log odds has the linear relationship - * ( i ^ y = = Po + P'x (1.3) Thus, the appropriate link is the log odds transformation or the logit transformation. The logit, g{x;B) is linear in its parameters, may be continuous, and may range from —oo to +00, depending on the range of x. 1.3 M o d e l F i t t i n g 1.3.1 Log Likelihood for Binomial Data Suppose that we have total n number of cases in the study. The responses yi,y%, • • •,yn are assumed to be the observed values of independent random variables Y\,Y2, • • • ,Yn such that Yi,i = 1,2,... ,n has the binomial distribution with index mi, the number of observations in each group and parameter 7Tj. The yj's may be success counts or success proportions in each group. For simplicity, we assume mi = 1, i = 1, 2 , . . . , n so that each j/j now represents "success" or "failure" of the outcome. Let xi — (xio,xn,... ,Xip) denote the pth setting of values of p explanatory variable X's, where x^ = 1. The x^s may be binary or continuous. We express 4 the logistic regression model (1.3) as ir(xi) = exp j l Pjxij 1 + exp PjXij Therefore, the likelihood function may be written in the form n i=i = { n a - ^ j j n e x p f i o g ^ = {"[[(I - K(Xi)) I exp (1.4) For model (1.3), the ith logit is Y_\- {3jX{j, so the exponential term in the last expression equals exp n I p i=i \j=o exp V I n J2 y&a & j=0 V i = l / Also, since [1 — 7r(ajj)] = [1 + exp(Y_) • PjXij)] 1 , the log likelihood equals v f n j=0 \ i = l / i = l 1 + exp [ The maximum likelihood estimate j3 of /3 satisfies the likelihood equations = 0 for j = 0 ,1 , . . . ,p. Since am d(33 — ^ UiXij ^ a;. exp 1 + exp (1.5) 5 the (p+1) likelihood equations are = o i a n d ^_(y% ~ ftijxij = 0 for .7 = 1,2,... ,P (1.6) where exp 1 + exp denotes the M L estimate of n(xi). Since this system of equations is not linear in 8, iterative methods such as Newton-Raphson method are needed to evaluate parameter estimates. 1.4 Bayesian A p p r o a c h to M o d e l F i t t i n g In the previous section we described the maximum likelihood estimation(MLE) procedure for estimating the parameters in which we do not incorporate any prior information about the parameters, but rather estimate them on the basis of the observed data. Let our parameters of interest be the vector 6, with a precise meaning in the problem under study. In Bayesian data analysis we assume that 6 has some probability distribution and we include these information along with the observed data in the estimation process (see, for example, Gelman et. al. [5]). It is likely that the researcher has some knowledge about 0. Inclusion of this body of knowledge in the analysis is possible and scientifically recommended. Bayesians and frequentists have divergent views in this respect. The latter does not admit this information because it has not been observed and is therefore not subject to empirical verification. The Bayesian approach incorporates this information to the analysis through a density p{9) even when this information is not precise. The process of Bayesian data analysis can be divided into the following three steps: 1. Setting up a full probability model - a joint probability distribution for all observable and unobservable quantities in a problem. 2. Conditioning on observed data: calculating and interpreting the appropriate posterior distribution - the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data. 6 3. Evaluating the fit of the model and the implications of the resulting posterior distribution. Bayesian Inference Bayesian statistical conclusions about a parameter vector 9, or unobserved data y, are made in terms of probability statements. These probability statements are conditional on the observed value of y, and we write them as p(0\y) or p{y\y). Conditioning also applies to the fixed values of any covariates, x. Bayes' rule As outlined in item (1), Bayesian inference contains two ingredients for calculating the posterior density: p{9), the prior distribution of 9, and p(y\9), the sampling distribution of the observed data y. The former distribution can also be specified by some constants/parameters just like the distribution of y. Sometimes it is useful to distinguish them from the parameter of interest of 9. These constants are then called hyperparameters, as they are the parameters of the distribution of the parameters. Initially, the hyperparameters are assumed to be known. We call p(9) the prior density as it contains the probability distribution of 9 before the observation of the value of y. The likelihood function of 9 is L(9) = p{y\9). The joint probability distribution of 9 and y can then be defined as p(0,y)=p(9)p(y\9). Simply conditioning on the known value of the data y, using the basic property of conditional probability known as Bayes' rule, yields the posterior density: p(o,y) P m = P(y) p(0)p(y\0) ( 1 ? ) p(y) wherep(y) = _]0p(0)p(y|̂ ), and the sum is over all possible values of ̂ (or p(y) = f p(9)p(y\9)d9 in the case of continuous 0), is called the marginal distribution of y. Since p(y) does not depend on 9, an equivalent expression of (1.7) is the following unnormalized posterior density, p(9\y)cxp(9)p(y\9), (1.8) where the proportionality is as a function of 9 for fixed y and p(y) is the normalization constant which may not be evaluated easily. 7 The concepts of prior and posterior are always relative to the observation considered at a given moment. It is possible that after observing y and obtaining the posterior, a new observation y also related to 9 through an eventually different likelihood function becomes available. In this case, the posterior (relative to y) is the prior (relative to y) and a new posterior can be obtained by a new application of Bayes' theorem. A Bayes' estimator of 9 is the mean of the posterior distribution of 9, called the posterior expectation, i.e. Therefore, the obvious distinction between the M L E and Bayesian approach is that in the M L E procedure we estimate the parameters by maximizing the likelihood function whereas in the Bayesian approach we obtain estimates by computing posterior mean of the parameters. To make predictive inferences about unknown observable, y, the posterior predictive distribution of y is given by: From the second and third lines of the equation we see that the posterior predictive distribution is an average of conditional predictions over the posterior distribution of 9. 1.5 M C M C methods for parameter estimation If 9 has high dimension and the model is complex so that we can not get any closed or nice mathematical form for conditional distributions for 9, then it's very difficult to estimate the 9B = E(9\y) J 9p(9)p(y\9)d9 Jp(9)p(y\9)d9 • (1.9) Prediction (1.10) 8- parameters. Markov Chain Monte Carlo ( M C M C ) techniques provide an answer to the diffi- cult problem of simulation from the high- dimensional distribution of the unknown quantities that appears in complex models (ref. Gamerman [4], Gilks et. al. [7]). M C M C is essentially Monte Carlo integration using Markov Chains. We need to integrate over the posterior dis- tribution of model parameters given the data. Monte Carlo integration draws samples from the required distribution until it approaches equilibrium, known as the limiting distribution, and then forms sample averages to approximate expectations. So our limiting distribution is usually the posterior distribution. Markov Chain Monte Carlo draws these samples by running a cleverly constructed Markov Chain for a long time. The integrations in (1.9) have until recently been the source of most of the practical difficulties in Bayesian inference, especially in high dimensions. In most applications, analytic evaluation of E(9\y) is impossible. The best alternative way of evaluation is the M C M C . .1.5.1 Monte Carlo Integration Let X be a vector of k random variables, with distribution 7r(.), where X consists of model parameters and missing data. Our task is to evaluate the expectation mttYV - ff(xMx)dx for some function of interest /( .) . Monte Carlo integration evaluates E[f(X)] by drawing samples {Xt, t = 1, 2 , . . . , n} from 7r(.) and then approximating n t=i So the population mean of f{X) is estimated by a sample mean. When the samples {Xt} are independent, the laws of large numbers ensure that the approximation can be made as accurate as desired by increasing the sample size n. In general, drawing samples {Xt} independently from 7r(.) is difficult. However, the {Xt} need not necessarily be independent. They can be generated by any process which draws samples throughout the support of in the correct proportions. One way of doing this is through a Markov Chain having 7r(.) as its stationary distribution. This is then Markov chain Monte Carlo. 9 1.5.2 Markov Chains A Markov chain is a special type of stochastic processes in which the future state of the process depends only on the present state and is independent of the previous states. Suppose we generate a sequence of random variables, {Xo ,Xi ,X2 , . . . } , such that at each time t > 0, the next state Xt+i is sampled from a distribution P(Xt+\\Xt) which depends only on the current state of the chain, X t . That is, given X t , the next state Xt+\ does not depend further on the history of the chain {Xo, X\,,..., Xt-i). This sequence is called a Markov chain, and P(.|.) is called the transition kernel of the chain, we will assume that the chain is time-homogeneous, i.e. P(.|.) does not depend on t. Let the distribution of Xt given Xo be denoted by P^(Xt\Xo). Subject to regularity conditions, the chain wil l gradually 'forget' its initial state Xo , and P^(.\Xo) will eventually converge to a unique stationary distribution, which does not depend on t or Xo- Thus as t increases, the sampled points {Xt} will look increasingly like dependent samples from 7r(.). We can now use the output from the Markov chain to estimate E[f(X)], where X has distribution 7r(.). The estimator is called an ergodic average. If the chain is ergodic and EV[/ (X)] < oo for the unique limiting distribution TT then This result is a Markov chain equivalent of the law of large numbers. It states that the av- erages of chain values also provide strongly consistent estimates of parameters of the limiting distribution n despite their dependence. 1.5.3 The Metropolis-Hastings Algorithm Markov chains can be constructed in several ways. We will describe Markov chains under the class of Metropolis-Hastings algorithm. This name comes from papers by Metropolis et al. [15] and Hastings [11]. For the Metropolis-Hastings algorithm, at each time t, the next state Xt+i is chosen by first sampling a candidate point Y from a proposal transition q(.\Xt). The proposal (1.11) t=i tn —> ^ [ / ( X ) ] as n —> oo, with probability 1 10 distribution may depend on the current point Xt- The candidate point Y is then accepted with probability a(Xt,Y) where < ™ - - ( • • ; » ) • If the candidate point is accepted, the next state becomes Xt+\ = Y. If the candidate is rejected, the chain does not move, i.e. Xt+\ = Xt. Consider a distribution TC from which a sample must be drawn via Markov chains. This task will only make sense if the non-iterative generation of TC is very complicated or expensive. In this case, a transition kernel P(Xt+\\Xt) must be constructed in a way such that n is the equilibrium distribution of the chain. Consider reversible chains where the kernel P satisfies 7t(Xt)P(Xt+1\Xt) = ir(Xt+1)P{Xt\Xt+1). The kernel P(Xt+i\Xt) consists of two elements: an arbitrary transition kernel q{Xt+\\Xt) and the probability a(Xt, Xt+i) such that P(Xt+1\Xt) = q(Xt+1\Xt)a{Xt,Xt+i), if Xt ? Xt+l. So the transition kernel defines a density P(.\Xt) for every possible value of the parameter different from Xt. Consequently, there is a positive probability left for the chain to remain at Xt given by P(Xt\Xt) = 1- Jq(Xt+l \Xt)a{Xt, Xt+i)dXt+i. These two forms can be grouped in the general expression P(Xt\Xt) = q(Xt+1\Xt)a(Xt,Xt+1) + I(Xt+1 = Xt)[l - jq{Xt+l\Xt)a{Xt+1\Xt)dXt+l\, where 7(.) denotes the indicator function (taking the value 1 when its argument is true, and 0 otherwise). In practice, simulation of a draw from TC using the Metropolis-Hastings algorithm can be set up as follows: Initialize XQ; set t = 0. Repeat { Sample a point Y from q(.\Xt) 11 Sample a Uniform(0,l) random varaibale U If 17 < a{Xt,Y) set Xt+1 =Y otherwise set Xt+i = Xt Increment t R a n d o m walk chains and symmetric q(.\.) To implement the M-H algorithm, a suitable candidate-generating density should be specified. Typically, this density is selected from a family of distributions that require the spec- ification of such tuning parameters as the location and scale. One family of candidate-generating densities, that appears in the work of Metropolis et al. (1953), is given by q(Xt, Xt+\) = qi(Xt+i — Xt), where gi(.|.) is a multivariate density. The candidate Xt+\ is thus drawn ac- cording to the process Xt+i = Xt + z, where z is called the increment random variable and follows the distribution q\. Because the candidate is equal to the current value plus noise, this case is called a random walk chain. Possible choices for q\ include the multivariate nor- mal density and the multivariate-̂ . Note that when q\ is symmetric, the usual circumstance, qi(z) = q\(—z); the probability of move then reduces to There are some other family of candidate-generating densities. We are not discussing them here. There is one critical issue of choosing the spread or scale of the candidate-generating density. This choice of scale certainly affects the efficiency of the algorithm and affects the behavior of the chain in at least two dimensions: one is the "acceptance rate" (the percentage of times a move to a new point is made), and the other is the region of the sample space that is covered by the chain. Consider the situation in which the chain has converged and the density is being sampled around the mode. Then, if the spread is extremely large, some of the generated candidates will be far from the current value, and will therefore have a low probability of being accepted. If the spread chosen is too small, the chain will take longer to traverse the support of the density, and stays in the low probability regions resulting in high acceptance rates. To get a(X, Y) = min 12 a reasonable acceptance rate of around 50% we have to compromise between the two situations above. 1.5.4 Hybrid (HY) Algorithm Sometimes simple random walk M C M C algorithms for estimating parameters of the high- dimensional target posterior density can not yield satisfactory estimates. To improve the algorithm the following two steps can be followed for a successful M C M C run: (a) incorpo- rate derivative evaluations of the target log-posterior density, and (b) suppress the random walk behavior of the Markov chain. Incorporating derivative evaluations implies utilization of more information about the target distribution. By suppressing the random walk behavior we actually direct the chain to follow a definite path towards the target distribution for faster convergence and efficiency (see Neal [16],Gustafson et.al. [9]). In the usual M - H algorithm we update estimates one at a time and have to set the jump size for each of the components of the parameter vector. In the guided random walk hybrid algorithm we can update components of A;-dimensional parameter vector all at once and can set one jump size for the ^-dimension. Here is the brief description of the general algorithm. Let X ~ n be the target density having an unnormalized density function n(x) on a subset of Uk. The algorithm works by extending the state from X to (X,Y), and the unnormalized target density from n(x) to ir(x,y) = Tr(x)ir(y) = T r ^ e x p ^ - ^ y ^ (1.12) where Y has a N(0,lk) distribution independent of X. Thus we can sample from TX(X) by sampling (X, Y) from (1.12) and simply discarding the Y value. The following three steps should be followed to construct a Markov chain for (X, Y) having (1.12) as its stationary distribution. Also it is necessary to specify a step size e > 0, a function g : Rk —> and a constant 6 6 [0,1). 1. Determine a candidate state (x*,y*) as x* <- x + e[y + {e/2)g{x)\, y* f - -y-(e/2){g(x)+g(x*)}, 13 and randomly assign (x*,y*) with probability p, (x, y) with probability 1 — p, where V = mm { ir(x,y) 2. Unconditionally negate y, i.e. y <—V- 3. Perform an autoregressive update to y, i.e. y<r-N(6y,(l-^)1/2Ik). Thus we can improve our Monte Carlo sample estimates by using gradient evaluations g(x) = Alog7r(a;) of the parameters from the posterior distribution and should choose S close to one to suppress random walk behavior. 1.6 Var iab le Selection The goal of much research is to select those variables that results in the "best" predictive model when we have several potential independent variables to be included in the model. In many situations we have to consider interaction effects (say, pairwise) along with the main effects. Epidemiologists often suggest including all clinically and intuitively relevant variables in the model, regardless of their "statistical significance". But if the number of variables is large, it will be very difficult to get the actual effect of some biologically important variables. Thus, the approach should be seeking most parsimonious model that explains the data best. In order to achieve this goal we must have: (1) a basic plan for selecting variables, and (2) a set of methods for assessing the adequacy of the model. There are several methods that one can follow to select variables for a logistic regression model. One method is "Univariate method" in which variables are assessed one by one via likelihood ratio test whether to include in the model. This method is time-consuming and tedious. Another approach is to use a "Stepwise method" in which variables are selected either 14 for inclusion or exclusion from the model in a sequential fashion based solely on statistical criteria. There are two main versions of the stepwise procedure: (a) forward selection with a test for backward elimination, and (b) backward elimination followed by a test for forward selection. In the next section we will describe briefly the first version and use it in our future variable-selection method. Hosmer and Lemeshow [12] describe the stepwise method and use likelihood ratio test to select a variable. We will use a different criterion, score test to select variables. 1.6.1 Stepwise Logistic Regression In this method or algorithm a variable is selected on the basis of its "importance" in building up the model. The importance is defined in terms of a measure of the statistical significance of the coefficient for the variable and overall fit of the model. The statistic used is the score chi-square test. The specific procedure that we will use consists of the following few steps: Step 0: Suppose we have available a total of p possible independent variables, all of which are candidates to be included in the model and are judged to have plausible "biological" importance in studying the response. Step (0) begins with a fit of the "intercept only model" and evaluation of Score chi-squares with corresponding p-values of all of the p factors. The first most important variable to be entered in the model is that which has the highest Score chi-square value and, of course, a small p-value. A crucial aspect of implementing stepwise logistic regression is the choice of an reasonable "alpha(a)" level to judge the importance of variables. L e t p E be the choice where " E " stands for entry. The choice for pE determines how many variables eventually are included in the model. Many researchers have studied the choice of pE and their research has shown that pE = 0.05 is too stringent, often excluding important variables. Choosing a value for pB in the range of 0.15 and 0.20 is highly recommended. If the goal of the analysis is broader and we want to include more variables that provides better prediction, we can choose pE = 0.25 or even larger. Whatever the choice of p E , a variable is judged important enough to enter into the model if the p-value for Score chi-square is less than pE. Step 1: Step (1) begins with a fit of the logistic regression model containing the variable selected. The overall fit of the model is assessed via the likelihood ratio test and significance of 15 the coefficient of the maximum likelihood estimate is assessed. To ascertain whether an entered variable should be deleted from the model the program selects that variable which, when removed, yields a high p-value. To decide whether the variable should be removed, the program compares the estimated p-value to second pre-chosen "alpha" level, p R , which indicates some minimal level of continued contribution to the model where "R" stands for removal. Whatever value we choose for pR, it must exceed the value of pE to guard against the possibility of having the program enter and remove the same variable at successive steps. If we do not wish to exclude many variables once they have entered then we might use pK = 0.9. A more stringent value would be used if a continued "significant" contribution were required. For example, if we used pE = 0.20, then we might choose pR = 0.25. Thus if the p-value of the just entered variable exceeds pR then the variable is removed from the model, otherwise it will stay in the model. The program again calculates score chi-square values for all remaining variables not in the model in this step. Step 2: The procedure for step (2) is identical to that of step (1). The program selects the next variable to be entered into the model as the one having the highest score chi-square. Then it fits the model with the entered variables, assesses the overall fit of the model via the likelihood ratio test and assesses the significance of the maximum likelihood ratio estimates. Then it performs a check for the backward elimination. The process continues in this manner until the last variable selected according to pE value. In the end, the process produces a summary table of maximum likelihood estimates of the variables selected for the final model. The stepwise procedure that we will use in our model selection for different data sets will consist of the following two steps: (a) In the first step we will select main effects only for our model using the stepwise procedure. We want to make sure that we are selecting significant effects to be used in the second step. (b) In the second step we wil l ask the stepwise procedure to select different effects from among main effects (selected in the first step) and pairwise interactions of these main effects. 16 1.7 Cross -Va l ida t ion To compare the predictive power of Bayesian and classical models on some particular data sets, cross-validation is an effective method. The basic idea of cross-validation is to divide the whole data set randomly into a training sample and a validation or test sample. The results from this scheme may be sensitive to which particular subset of the data into training and test cases is utilized. For this reason, we use K-iold cross-validation in which we split the cases randomly into K roughly equal-sized segments. For example, when K = 5, the scenario looks like the following table: 1 2 3 4 5 Test Train Train Train Train For the kth part (first in the table), we fit the model to the other K — 1 parts of the data, and calculate the predictive performance of the fitted model when predicting the kth part of the data. We do this for k = 1,2,... , K and combine the K estimates of prediction. Each of the models to be compared is fit to the training sample first and then used to predict the test sample responses given the test sample covariates. Models which yield better predictions in this scheme are then preferred. 1.8 Out l ine and Scope of the Thesis We begin Chapter 2 by discussing the pairwise interactions and how do these modify the effects. Then we introduce a different functional relationship between the response variable and the risk factors. We try to explain how to interpret the effects of particular risk factors under this new model. The Bayesian approach to parameter estimation and implementation of a specific M C M C method for the new model are discussed. In Chapter 3 we use some datasets from real epidemiological studies to fit the new model, interpret the parameter estimates, and compare the predictive power of the new model with that of the step-wise and ordinary logistic regression models. Chapter 4 discusses some limitations that we face in implementing the new model, and how we might overcome these restrictions in the future. 17 Chapter 2 Model Specification and Estimation 2.1 Interact ion and Effect Mod i f i ca t i on Generally, interaction is said to present between two risk factors when the effect of one risk factor upon disease is different at (at least some) different levels of the second risk factor. Consider a model containing a dichotomous risk factor (e.g. sex) and a continuous covariate (e.g. age). When interaction is present, the association between the risk factor and the outcome variable differs, or depends in some way on the level of the covariate. That is, the covariate modifies the effect of the risk factor. Epidemiologists often use the term effect modifier to describe a variable that interacts with a risk factor (see, for instance, Woodward [19])- If the association between the covariate and the outcome variable is the same within each level of the risk factor, then there is no interaction between the covariate and the risk factor. Graphically, the absence of interaction yields a model with two parallel lines, one for each level of the risk factor (sex). In general, the absence of interaction is characterized by a model that contains no second or higher order terms involving two or more variables. In any epidemiologic study, we may have several risk factors that we decide to measure, and part of the study aims to decide which risk factors interact with others in regard to the disease outcome of interest. So then we can include in our model appropriate higher order terms to represent the effect of interaction. A n important step in the process of modeling a set of data is determining whether there is evidence of interaction in the data. If the number of risk factors in the data is relatively 18 Logit h Age Figure 2.1: Plot of the logits under three different models showing the presence and absence of interaction large, it's very difficult to identify which variables interact with each other. Figure 2.1 presents the graphs of three different logits. The graphs of these logits will be used to explain what is meant by interaction. Consider an example where the outcome variable is the presence or absence of C H D , the risk factor is sex and the covariate is age. Suppose that the line l\ corresponds to the logit for females as a function of age. Line li represents the logit for males. These two lines are parallel to each other, indicating that the relationship between age and C H D is the same for males and females. In this situation there is no interaction and the log odds for sex (male versus female), controlling for age, is given by the differences between the lines I2 and This difference is equal to the vertical distance between two lines, which is the same for all ages. Suppose instead that the logit for males is given by the line I3. This line is steeper than the line l\, for females, indicating that the relationship between age and C H D among males is different from that among females. When this occurs we say there is an interaction between age and sex. The estimate of log-odds ratio for sex (male versus female) controlling for age is still given by the vertical distance between the lines I3 — I1, but this difference now depends on the age at which the comparison is made. Thus, we can not estimate the odds ratio for sex without specifying the age. That is, age is an effect modifier. To consider the magnitude of "effect modification", say that C H D depends on two risk 19 factors, smoking status and systolic blood pressure. For simplicity, we assume that both risk factors are binary and can be defined as: { 0 if no smoking, 1 if smoking, and ! 0 if blood pressure is normal, 1 if blood pressure is high. Let Bi and p2 represent the coefficients of X i and X2, respectively. Then we can express the relationship between the probability of presence of C H D and the risk factors in the logit scale as (assuming interaction effect of smoking and blood pressure is present): logit Pr(Y = 1|X) = p0 + BXXX + p2X2 + pl2XxX2, (2.1) where p\2 is the magnitude of effect modification. Say that the effect of smoking is of interest. Comparing X i = 0 and X i = 1 we have Level of X 2 logit for Xx = 0 logit for X i = 1 logit difference X 2 = 0 Po P0 + P1 Pi X 2 = 1 Po +82 P0 + P1+P2 + Pu P1+P12 giving a sense in which P12 describes how the smoking effect is modified by blood pressure. Similarly, it can be shown that the blood pressure effect is modified by smoking, by the amount Pu, if we quantify the blood pressure effect by comparing X 2 = 0 and X 2 = 1 for two levels of smoking. Suppose we have 10 risk factors to consider. If we want to include all of them in the model without considering their statistical significance, and also want to evaluate their pairwise interaction effects, we will have 10+45=55 terms to include in the model. The additive model will then look like: logit Pr(Y = l\X)=p0 + piXx + ...+ p\0X1Q + Pi,2XxX2 + ...+ p\,10X9X10. (2.2) In such a situation most of the parameter estimates will be insignificant and unreliable due to confounding or some other factors, and evaluation of interaction effects will be unreliable. This 20 is not a feasible way of evaluating effects. The most widely used procedure for selecting main effects and interaction effects is the stepwise logistic regression procedure discussed in Chapter 1, Section 1.4. Nevertheless, one might raise several issues such as: 1. W i l l a pairwise interaction model always be realistic ? 2. Is an additive structure as in (2.1) always appropriate ? 3. Could there be other kinds of interaction whereby the effect of a given predictor is different for a subject at generally low risk than for a subject at generally high risk ? In an attempt to discuss these issues, suppose we have several risk factors under study and one of them is smoking status. Say we want to consider the effect of smoking on C H D as a function of the remaining predictors. Even though there may be numerous other risk factors, hypothetically consider the X-axis in Figure 2.2 to summarize their combined effect. If it were possible to place these predictors on this X-axis of the two-dimensional plane, from low levels to high levels, and smoking effect on the Y-axis, then we can contemplate the following figure. This picture is trying to grasp the idea of issue 3 above. From this figure we see that the effect a — increasing — decreasing unchanged Low risk High risk Remaining predictors Figure 2.2: Hypothetical representation of relationship among risk factors. 21 of smoking on C H D , considering remaining predictors from their low risk level to high risk level, might be constant, increasing or decreasing. The ordinary interaction model (2.1) is not amenable to describing this situation. So we wish to consider alternate models. We wil l return to this issue in Section 2.2. To get some sense of other issues, let us rewrite equation (2.1) as: logit Pr{Y = l\X)=/30 + g{Xx + ...+X,), (2.3) where p components of X are the risk factors and g is the function of the risk factors. In equation (2.1) we assumed g as linear and assumed additive effect of the risk factors on the disease. It may not always be true. It might be possible that after reaching a certain level, the risk on the disease outcome will be almost constant and effects of additional risk factors will be low. This situation encourages us to think g as a non-linear function of the risk factors. The following figure illustrates plausible non-linearity of the logit as a function of the number of risk factors present: Figure 2.3: Logit as non-linear function of the number of risk factors Say risk of the disease is the function of the number of risk factors where all of them are binary, representing e.g. absence and presence of the risk factors. Thus risk of the disease in the logit scale might be non-linear as plotted in Figure 2.3 above. From this figure, it can 22 be said that as the number of risk factor increases, the increased risk of the disease on the logit scale will be getting smaller and smaller. Qualitatively, we can conclude that the effect of an additional risk factor will be small as the number of risk factor increases, even if the risk factor may be biologically important. More precisely, let Z denote the number of risk factors. Then unit increase in Z will have an insignificant effect on the logit if Z is large. For making the above comments we assume here that all of the risk factors are equally important. No single risk factor has greater inffuence on the risk than any other. A new model wil l be introduced in Section 2.2 to generalize this behavior beyond the special case of equally important risk factors. In conclusion, instead of trying to identify the possible pairwise interaction effects for fitting the linear model, we could try to find some tractable and interpretable functional form of g. What functional form g could take? Could g interpret the situations we discussed above? In the next section a specific functional form of g will be assumed and will be used to describe the relationship between the logit and the risk factors. 2.2 Specific non-addit ive functional form of the logit Suppose we have Xi, i = 1,2,... ,p predictors where : { 0 if the i-th risk factor is absent, 1 if the i-th risk factor is present. The specific non-linear functional form of g that we are thinking of can be expressed in terms of logit and is given by: logit Pr(Y = l\X)=p0 + {p^X1 + ... + ftX^'* , (2.4) where A is an additional unknown parameter and X^s are binary as defined above. We call this model (2.4) as N A D (Non-ADditive) model throughout the entire thesis. The N A D model is not defined properly unless we restrict the signs of the coefficients Pi to be positive. This implies that Xi = 1 corresponds to higher risk than Xi — 0. The functional form may have some advantages over trying to pick out pairwise interaction terms. How do we interpret A? Consider a very simple case: Pi = P2 = • • • = Pp = P > 0. Then 23 equation (2.4) reduces to: logit Pr(Y = 1\X) = ft + { / ? A ( * i + • • • + ^ p ) } ^ , = P0 + /?(# of risk factors) 1 / A . (2.5) In this special setting, we see from Equation (2.5) that logit depends on how many risk factors a person has and obviously on the value of A. This special behavior of the logit where each of the risk factors is getting the same weight, is represented in Figure 2.3. To interpret j3 by the odds ratio let us compare the logit of somebody having no risk factors with the logit of a person with just one risk factor (the i-th one). Using Equation (2.5) we have: Risk factor 0 z-th one logit Po Po+Pi Thus the odds ratio is e^», just like the ordinary logistic regression model and bears the same interpretation. Now let p = 10 and P — 1. From Equation (2.5) we see that the functional form of logit depends on the value of A. For A = 1 we get the simple linear logistic function. For other values of A we can depict logit as a function of number of risk factors in this special example. This is shown in the first picture of Figure 2.4. What can be said about the effect of an additional risk factor on the risk for someone who has already p risk factors? For A = 1 it would be constant. For A > 1 and A < 1 the effect would be decreasing and increasing, respectively. This is depicted in the second picture of Figure 2.4. This simple example motivates us to think that logit might be a non-linear function of the risk factors with A different from one. But does a value of A different from one tend to explain the data better? How interpretable are such models? We will discuss these issues using some real life examples after estimating A and the coefficient /3's. However, in reality the usual picture is that /3's are not all equal. Let X = (Xi,X^), where X^ represent factors other than X{. Suppose the effect of Xi is of interest and could be quantified by: logit(X 2 = 1, X(l)) - logit(*i = 0, X{l)). (2.6) 24 Figure 2.4: Comparison of different A values. Using the N A D model effect of Xi can be assessed from equation (2.6) as follows: (i) If A > 1 then (2.6) is decreasing in each Xj, j ^ i. (ii) If A = 1 then (2.6) is unaffected by X^y (iii) If A < 1 then (2.6) is increasing in each Xj, j ^ i. Which value of A is best supported by data? How well does the N A D model describe the data? To find the answers we have to fit the model first and estimate the parameters. 2.3 M o d e l F i t t i n g and Es t ima t ion We assumed that we have y\,y2, • •••, yn independent binary outcomes of a disease of n inde- pendent observations in any study. For each of the y^s we have corresponding X\,X2,... ,XP predictors. More formally, we can define our observation as the pair {yi,Xi) for i = 1,2,... ,n where x{ = {xii,xi2, •. .,xip). 25 2.3.1 Log Likelihood Function Let us denote the N A D model by g(X; 3, A) = logit Pr(Y = 1\X) = 30 + + . . . + /? p A X p } Then we express the logistic regression model (2.7) as l/A (2.7) Pr(Y = l\X = x) = e<K*;/3,A) 1 + eg{X;p,\) ' or /3o+ Tv(Xi) = l/A /3o+ £ 1 + e V=» l / A • (2.8) Therefore, the likelihood function can be written in the following form: n L(3,\) = H7r(xir>[l-Tt(xi)]1-y>. i=l Rearranging and following the procedure of Section 1.2.1, Chapter 1, it turns out that the likelihood function equals: L{fi,\) = { n 2=1 1/A> l + e x p ( / ? 0 + £ / 3 , - s > x exp l/A" EM&+ 1 E # 2=1 (2.9) which yields the log likelihood function given by: J G M ) = E 2 = 1 l/A" #> +1E^ .3=1 log 1 + exp 1/A\ J (2.10) When A = 1, equation (2.10) reduces to the log likelihood function of the ordinary logistic regression model. In the ordinary logistic regression case we use Maximum Likelihood Principle to estimate the parameters 3. To estimate the parameters of the N A D model we will follow the Bayesian approach. 26 2.3.2 Posterior density The parameter A is positive. We will reparameterize A by defining qS = log A for numerical simplicity. So A = exp(0) and L(B,X) becomes L(B,(b). We assume a normal prior for </>, i.e. 7r(</>) ~ iV(0, c 2 ), or ^) = c^eXK~^2)- To avoid some numerical complexities associated with large A values we assume c = log(2). By choosing a symmetric prior for </> to be centered at 0, A and j are equally likely a priori. Centering the prior for 4> at 0 corresponds to centering the prior for A at 1, thereby favouring the simple logistic regression model. For parameters B we assume a noninformative diffuse prior distribution, that is n(B) oc 1. In practice, we might use a uniform prior distribution if we really have no prior knowledge about the parameters. The joint posterior distribution of all the parameters given the data is then defined as: P{B^\y)c<L{B,ct>W)A<t>). Therefore we can get the log posterior density as follows: (2.11) where K is an unknown constant. In the next section we will discuss how to estimate the parameters from this density by Markov Chain Monte Carlo methods. 2.3.3 Parameter estimation In the Bayesian perspective, we get estimates of the parameters from their posterior means as we discussed in Chapter 1, Section 1.2.2. If we want to estimate /?'s, e.g. B\, we have to evaluate 27 Equation (1.9) using the posterior density (2.11). One way to evaluate (1.9) is to compute conditional distributions of the parameters from the density (2.11) and then draw samples from their respective conditional distributions. This is known as the Gibbs sampler. But the conditional distributions of the parameters from density (2.11) do not have nice and simple mathematical closed form. Alternatively, we can draw samples from the posterior density (2.11) by the Random Walk Metropolis-Hastings (MH) algorithm discussed in Chapter 1, Section 1.3.3. Thus we can draw m samples of </>, for example, and get A = exp(<̂ >). We can then estimate A - A ( 1 ) + - m + A ( m ) to approximate / A log P(fi,\\y)d(3d\. To implement M H algorithm and simulate samples from Equation (2.11) using some data we need to specify: (i) A candidate generating density: We will use Normal density centered at the current parameter value with specific standard deviations or jump sizes. (ii) Initial values of (3 and <p. We will use estimates from the ordinary logistic regression fit of the data as initial values for (3, and (p° = 0. (iii) Jump sizes for each (3 and <p. A n important step in simulation is the setting up of jump sizes for each parameter, for proper mixing and a reasonable acceptance rate. We tried to draw samples from (2.11) using a data set (will be discussed in the next section) through RW M H algorithm. But due to high correlation between </> and (3 we were having trouble with the mixing of the sampler output, with a very low acceptance rate after several changes of the jump sizes. We then tried to rescale the /?'s by (3 = expO) w h i l e fixing the value of </>. Rescaling was implemented through multiplying jump sizes by exp((/>). The mixing was good and acceptance rate was satisfying, but we do not have estimates of (p. Besides, there's no established or specific rule to specify jump sizes. They are fixed by trial and error basis. A n alternative way to get simultaneous estimates of the parameters is to use Hybrid (HY) algorithm by using full information of the log posterior density which was discussed in Chapter 1, Section 1.3.4. Moreover, we can get rid of specifying several jump sizes by specifying only one jump size for all of the parameters. To implement the H Y algorithm we need to evaluate the derivatives of (2.11) with respect 28 to parameters P and (p. Rewrite Equation (2.11) as follows: log P((3,<j>\y) =K + Y. [Vi9i{fr & ~ l 0 ^ 1 + e9>{M))} - ^4>\ (2.12) 2=1 where (2.13) Differentiating (2.12) with respect to (3j,j = 0,1,2, . . . , p and <f> we get the following (p + 2) equations: o9i(P,4>) Q dlog P(/3,<p\y) = E 2=1 L dPj = E i=i d\ogP{(3,<p\yl = £ K [ W39l[(iA)) ~ 1 + e*>(M X W3m{M Vi- 1 1 30 i=i 1 -j- e9i(P,<t>) 1 1 -|- e9i{P,4>) _d_ dp, 9i{P, <f>), (2.14) (2.15) To get (p + 1) equations of (2.14) we need to evaluate -gj^giiP, (p)- Differentiating (2.13) with respect to Pj for j = 0,1,2,... ,p we get the following (p + 1) equations: dPo dm(PA) dpi = 1, aft exp e^ log E ^ % E # % E # * E # % e ^ x [Erf*, i a_ E $ % - e ^ - l e-<Pe9pe*-lX. il Pf-'Xn. 29 Similarly, dp2 E $ % dgi(PA) dPP = E $ % 0p 1 -^ip- Equation (2.14) can now be evaluated using the above equations. Further we need to evaluate ^gi(P, <t>) which is given by, dgj{PA) = d_ exp e^ log Ij^Pf Xa = ( E # % - I E 3 % = ( E # * = ( E # % 30 e-*log j > | % - d<t> iog[x;̂ %)(-e-*) e - * l o g [ X ; ^ 0 30 Now, since d_ 34> d<f> exp (e 0log(/3,)) = ft (Vlog(/3,)) therefore, d<f> £ (pf Iog(/3,-)^-) e-*iog(x;̂ Even though after using derivative information of the log posterior density in the H A we were not getting satisfactory sample estimates, that is, mixing of the sampler output was not good. We then redefine <f> = A; log A, where A; is a constant, then A = e*/fc. The prior distribution of <j> now depends on k. Since <j)/k ~ N(0, c 2 ) , <j> ~ N(0, c2k2). We assume k = 4 and c = ^log(2). Still we have not got reasonable acceptance rate and proper mixing of the estimates due to correlation between <j> and /3's. We then reparameterized /3's as a3- = ^ for j = 1,2,... ,p so that 6j = \aj. In this reparameterization we have to redefine the N A D model as follows: logit Pr{Y = 1\X) = 80 + [(Aai)**! + • • • + (Aap)A V V * = Po + = a 0 + A where ao = BQ. Let A A ( a ^ ! + . . . + a ^ p ) l / A a AXx + ... + a AX p l / A Xn + ... + a: (2.16) (2.17) 31 Then the log posterior density is given by: n 1 logP(a,(l>\y)=K + ^[yigi(a,<l>)-\og(l + e^a^)] - T^tf • (2-18) Differentiating (2.17) with respect to ay for j = 0,1,2,... ,p we get the following (p+ 1) equations: da0 doti = 1, e - W / * ) _ l e W k ) ( E a f * 9 a 2 5 -(#/*)_l e W / f c ) - l v dgija, <j>) dan "p ^ zp • Equation (2.14) can now be evaluated using these equations in a parameterization. Further -§^gija,4>) can be obtained as follows: dgija A) d<t> "(0/*) 1 + E ° r 4 ) * . ^ U « f k ) ^gja3)Xl3 _ e . W k ) ] o g ( j - ^ / ^ . . ( E £ = i o f / 4 % ) V=i Equation (2.15) can be evaluated using this equation where the last term would be Note that these above expressions of gradients do not cover the situation if all of the risk factors of a person have value zero. In this situations model (2.16) reduces to logit PrjY = 32 1|X) = ao and Equation (2.17) reduces to gi(a,4>) = atj. We then have the derivatives as: dgi(a, <p) = 1 n:(rv ii>) f)nArv th\ = o, da0 dgi(a,<t>) = dgi(a,<p) ^ _ dgj{a,<p) da\ dot2 dav and 9gi{a,4>) n d<t> So we will not get any information about the coefficients and A except the intercept. We have to keep it in mind during the M C M C run. The next chapter will be devoted to estimate the parameters of the model (2.16) by using some real data sets, where we compare the performance of the N A D model with that of the step-wise and the ordinary logistic regression model. 33 Chapter 3 Examples 3.1 E x a m p l e 1 : South A f r i c a n Hear t S tudy ( S A H S ) This dataset contains a retrospective sample of 460 males in a heart-disease high-risk region of the Western Cape, South Africa. The response variable was Coronary Heart Disease (CHD) represented as presence and absence of the disease. There are roughly two controls per case of C H D . Many of the C H D positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their C H D event. In some cases the measurements were made after these treatments. Of the 460 males there were 160 people who had the C H D event. There were nine risk factors measured in this study; one of them is binary and the remaining eight are continuous. Complete descriptions for some of the predictors were not available. These data are taken from a larger dataset, described in Rousseauw et. al. [17]. In Table 3.1 the names, description and type of the risk factors are given. We convert the eight continuous risk factors into binary variables by thresholding in comparison to their means. For evaluating the Hybrid (HY) Algorithm and to get a reasonable acceptance rate while ensuring proper mixing of the sampler output, we assume k = 4, c = | log(2), e = 0.07, 6 = 0.90. As we have mentioned before the initial values for 0's were the estimates from the ordinary logistic regression fit and (fp = 0. After iterating the algorithm for 20,000 times the acceptance rate was 82% and the Markov Chain is stabilized as can be seen from Figure 3.1. Figure 3.1 and 3.2 show the sample path of the M C M C output and the posterior distribution by histogram, 34 Table 3.1: Description of the risk factors for SAHS Variable Name Description Type SBP Systolic Blood Pressure Continuous Tobacco Cumulative tobacco consumption (kg) Continuous L D L Low density lipoprotein cholesterol Continuous Adiposity* Measured fatness Continuous Typea Type-A behavior Continuous Obesity* Measured fatness Continuous Alcohol Current alcohol consumption Continuous Age Age at onset Continuous FamHist Family history of heart disease (Present, Absent) Binary incomplete information. respectively. By looking at the plots in Figure 3.1 we see the almost instantaneous convergence of the sampler to the target posterior distribution. For this reason we do not throw away any 'burn-in' iterations to compute the posterior means and standard deviations. As a summary of the posterior distribution given the data, the posterior means and the corresponding posterior standard deviations, computed using the 20,000 iterations, are given in Table 3.2. The 90% equal-tailed credible intervals are also computed from the sample quantiles of the M C M C out- put, presented in Table 3.2. The predictors in Table 3.2 are listed in the order they were given in Table 3.1. The restriction imposed on the coefficients to be positive has some effect on some of the posterior distributions of the parameters as can be seen from the histograms in Figure 3.2 . We now fit the ordinary logistic regression model to the data to compare with the N A D model. The estimates, standard errors and the 95% confidence intervals are given in Table 3.3. Comparing the estimates from Table 3.3 with those M C M C output from the N A D model we see that the N A D model tends to provide larger estimated /3j's than does the ordinary logistic regression model and the posterior standard deviations are much bigger than the standard errors estimated by the ordinary logistic regression model. It is quite reasonable to have bigger posterior standard deviations since while sampling from the posterior density of the N A D model we incorporate more uncertainty through the additional parameter A and it is evident that data 35 Table 3.2: Summary results of the posterior distribution Variable Posterior Means Std Dev. 90% CI 5% 95% Intercept -3.91 0.83 -5.53 -2.76 1.11 0.69 0.18 2.47 x2 1.13 0.66 0.24 2.37 0.92 0.60 0.11 2.07 X4 1.18 0.74 0.18 2.53 x5 1.28 0.69 0.33 2.58 0.94 0.68 0.09 2.28 x7 0.51 0.44 0.02 1.36 Xs 1.75 0.71 0.81 3.06 x9 1.87 0.72 0.88 3.27 X 1.52 0.34 1.06 2.19 Table 3.3: Summary results from the ordinary logistic regression fit 95% CI Coefficients Estimate Std Error Lower Upper Intercept -2:70 0.37 -3.44 -1.97 0.38 0.23 -0.07 0.83 x2 0.48 0.23 0.02 0.94 Xz 0.38 0.23 0.07 0.84 x4 0.34 0.30 -0.26 0.93 x5 0.52 0.22 0.08 0.95 X6 0.21 0.27 -0.32 0.74 Xj 0.06 0.23 -0.40 0.51 x8 . 0.87 0.26 0.36 1.39 x9 0.93 0.22 0.50 1.36 36 0 5 0 0 0 10000 15000 20000 0 5 0 0 0 10000 15000 2 0 0 0 0 0 5 0 0 0 10000 15000 2 0 0 0 0 0 5 0 0 0 10000 15000 2 0 0 0 0 0 5 0 0 0 10000 15000 2 0 0 0 0 0 5 0 0 0 10000 15000 2 0 0 0 0 0 5 0 0 0 10000 15000 2 0 0 0 0 0 5 0 0 0 10000 15000 2 0 0 0 0 0 5 0 0 0 10000 15000 2 0 0 0 0 0 5 0 0 0 10000 15000 2 0 0 0 0 n c r w i in/w> i c n / v i o / w w Figure 3.1: Sample path of the M C M C output where the first panel is for the intercept, panels 2-10 are of the coefficients and the last one is for A. 37 o w I 1 1 1 1 1 .7 -6 -5 -4 -3 -2 MI1[III|M*TT-H I 1 1 1 1 1 n — i — i — i — i Intercept l l l l l l l l w I — I — I — I — I — I 0 1 2 3 4 5 i i i i i i i — r o 1 0 i — i — r o i l_r_== - I — i ![___ T — i — i — i — I 0 1 2 3 4 5 0 I 1 1 1 1 1 > 03 « 0 .I_t__ i — i — i — i — i 1.0 1.5 2.0 2.5 3.0 Figure 3.2: Posterior distribution of the parameters. The first panel is for the intercept, panels 2-10 are of the coefficients and the last one is for A. 38 support value of A greater than one. The uncertainty is reflected by the larger posterior standard deviations. The reason that the estimated standard errors from the ordinary logistic regression fit are smaller is that these are obtained assuming A = 1. But the posterior standard deviations are averaged over all possible values of A. Moreover, correlation between /?'s and A may give rise to larger posterior standard deviations. Consequently, as A gets larger than one we end up with more spread-out posterior distributions of the parameters. Again, when we fit ordinary logistic regression we assume that the effect of a particular risk factor is the same for all persons in the study, so it measures average effect of that risk factor irrespective of the levels of other risk factors. Wi th the N A D model, on the other hand, when interpreting f3j as the effect of the j-th risk factor we consider a person for whom all other risk factors are absent. Moreover, for A > 1 the effect of a particular risk factor decreases as the number of risk factors increases. Hence the N A D model gives larger estimated /?'s than does the ordinary logistic regression fit. As for example, the posterior mean /3g = 1.8662 represents the risk in logit of having FamHist as a potential risk factor assuming that the remaining risk factors X\ to X% are absent. To get comparable estimates as the ordinary logistic regression model we compute the average effect of the risk factors by the following quantity using sample M C M C output of the parameters: 1 n average effect of Xj = — ^ [logit(Xj = 1,-Xy) = a^j),/3, A) — n i=i logit(X, = 0, XU) = xi(j),P, A)] , (3.1) where j = 1,2,... ,9. Equation (3.1) gives the average effect of a particular risk factor with respect to the empirical distributions of other predictors, from the difference in logits when that risk factor is from its lower level to higher level in the presence of the remaining risk factors. Column 1 in Table 3.4 gives the posterior mean of (3.1), denoted by AV\. Column 2 was obtained by inserting posterior means of the parameters into Equation 3.1, denoted by AV2- From Table 3.4 we observe that AVi values are very close to those estimates obtained by fitting the ordinary logistic regression. The AV\ values are slightly larger than the ordinary logistic regression estimates. Therefore, it can be concluded that both models indicate similar average effects of the risk factors, as can be seen from Figure 3.3 in which AVi values are plotted 39 Table 3.4: Estimated average effects of the nine predictors using the N A D model Coefficients AVi AV2 x1 0.46 0.45 x2 0.49 0.46 Xz 0.38 0.35 xA 0.51 0.50 x5 0.58 0.57 x6 0.39 0.38 x7 0.17 0.14 x8 0.91 0.90 x9 0.98 0.98 against logistic regression estimates. While estimating effects, ordinary logistic regression model assumes that the effect of a particular risk factor is the same for all persons. If the effects of Xj in (3.1) for person i varies slightly with i, the N A D model is very close to the ordinary logistic regression model. However, histograms of individual level effects which are plotted in Figure 3.4, show considerable variation of effects from person to person. We thus qualitatively conclude that the fitted N A D model is substantially different from the fitted ordinary logistic regression model. 3.1.1 Mode l Comparison To compare the predictive power of the N A D model with that of the step-wise and ordinary logistic regression model (which is fit considering full model irrespective of the statistical sig- nificance of the risk factors), we use the cross-validation procedure as described in Chapter 1, Section 1.5. For purpose of cross-validation we randomly divide the whole data set into five approximately equal-sized segments to compute the predicted probabilities for these three models. The sample M C M C output via the N A D model was obtained after iterating the Hybrid (HY) algorithm 20,000 times using k = 4, e = 0.06, c = 0.5 * log(2) and S = 0.90. The step- wise models have been selected by using the procedure described in Chapter 1, Section 1.4.1. For selecting variables we use the entry probability p E = 0.20 and the probability of removal is pR = 0.25. The predicted probabilities of the presence of the disease outcome are then computed 40 0.2 0.4 0.6 0.8 Logistic regression estimates Figure 3.3: Scatterplot for comparing average effects with the logistic regression estimates by: Pr{Y = 1\X) = 1 (3.2) l+exp(-g(X,9)) where g(X, 0) is the corresponding estimated logit function for whichever model is under con- sideration. We compute Equation (3.2) for the NAD model by using the sample MCMC output and is based on the following relationship: Pr(new person has disease | data) = E {Pr(new person has disease | /3, A) | data} . The right hand side of the above relationship is approximated by: E[h(0,\) | data] « -=- £ ^ ( i )), (3.3) i=l where h is given by Equation (3.2). As an alternative, we also computed (3.2) by plugging in the posterior means of the parameters. The computed predicted probabilities for the three models are presented in Figure 3.5 by boxplots. We see from this figure that the NAD model predicts the presence of the disease outcome slightly better than the other two models. Its predictive performance about the absence of 41 Risk factor 1 Risk factor 2 Risk factor 3 a * o »• a 0.0 nr- 0.5 T - 1.0 T - 1.5 2.0 0.0 1- 0.5 T - 1.0 1~ 1.5 -1 2.0 0.0 0.5 1.0 ~1~ 1.5 2.0 Risk factor 4 Risk factor 5 Risk factor 6 a o • o , 10 0.0 0.5 ~1~ 1.0 1.5 2.0 o o CM 10 C O 0 » n 0 o a o lU I 1 r - 0.0 0.5 1.0 ~1~ 1.5 2.0 a * 0.0 —r 0.5 1.0 1.5 2.0 Risk factor 7 Risk factor 8 Risk factor 9 8H CM 0.0 0.5 T - t.0 " T " 1.5 2.0 a) o a o 0.0 0.5 " T " 1.0 1~ 1.5 2.0 o 2 0.0 11 i 0.5 " T " 1.0 1.5 2.0 Figure 3.4: Distribution of individual level effects for all risk factors 42 Absence of CHD Presence of CHD NAD Step Ordinary NAD Step Ordinary Three Logistic Models Three Logistic Models Figure 3.5: Boxplot for comparing performance of three logistic models disease is better than that of the step-wise but similar to that of the ordinary. Note that, step- wise and the ordinary logistic regression models have the same abilities to predict the presence of disease outcome. To have an intuitive idea of how similar is the N A D model with the ordinary logistic regression model we compute the fitted probabilities using both models from Equation (3.2). We use sample M C M C output to compute fitted probabilities by the N A D model. These are plotted against the fitted probabilities from the ordinary model in Figure 3.6. From this figure we see that the fitted probabilities from both models are highly collinear and lie approximately on the straight line. Qualitatively, we can say that though both models indicate similar fitted probabilities, they bear different interpretations and getting a posterior mean of A > 1 with the N A D model indicates presence of different kind of interactions. The log-likelihoods of the predicted probabilities of the three models can be computed 43 Figure 3.6: Scatterplot of fitted probabilities by: Likelihood L = j j p i w [ l - p i ] { 1 _ w ) i=i n log L = ^2 [Vi logpi + (1 - yi) log(l - pi)} i=i n r = E i=l y^log Pi + log(l - Pi) Using Equation (3.4) we have, (3.4) log L (NAD) = -270.0520, logL(NAD) = -272.2717 (using posterior means), logL(siep) = -284.7087, log L (ordinary) = —271.1916. Thus, the N A D model gives better log likelihood than the other two. To find out how much better is the predictive performance of the N A D model than that of the step-wise model, we 44 compute: logL(NAD) - logL(step)] n J ( 3 " 5 ) /14.6567 \ = e X H^60 - J = 1.0324. Hence the N A D model is only 3.24% better in predicting disease outcome than that of the step-wise model for this example. In the next section we will investigate the second example to estimate the parameters and compare the performance of the three models following the same procedures as this example. exp average < log L(NAD) \ L(step) -- exp 45 3.2 Example 2 : Scot t ish Hear t H e a l t h S tudy ( S H H S ) The Cardiovascular Epidemiology Unit of the University of Dundee, Scotland undertook a range of epidemiological studies in the mid 80's in order to understand the factors associated with C H D prevalence across Scotland. The Scottish Heart Health Study (SHHS) was one of these epidemiological studies with the objective to establish the levels of C H D risk factors in a cross- sectional sample of Scottish men and women aged 40-59 years drawn from different localities. This particular dataset contains a sample of 4049 males with six risk factors from the huge study. Of the 4049 subjects 196 had the C H D event. The response variable C H D is coded as { 1 if the individual had a C H D event 0 if not Description of the six risk factors are given in Table 3.5. Table 3.5: Code sheet for the Scottish Heart Health Study # Description Codes/Values Name 1 Age years A G E 2 Total Cholesterol mmol/1 T O T C H O L 3 Body Mass Index k g / m 2 B M I (weight / height2) 4 Systolic Blood Pressure mmHg S Y S T O L 5 Smoking status 1 = never smoked, 2 = ex-smoker, S M K 3 = current smoker 6 Activity in leisure 1 = active, 2 = average, 3 = inactive A C T I V I T Y The first four quantitative risk factors are converted into binary variables by thresholding in comparison to their means. For converting smoking status we assume 'never smoked' category as low level and 'ex-smoker' & 'current smoker' categories combinedly as high level. For activity in leisure we assume 'active' & 'average' as low level and 'inactive'as high level. We now draw samples from (2.18) by implementing hybrid algorithm. We assume k = 5, c = ^log(2), e = 0.035, and 8 = 0.90. The initial values for /3's are the estimates from the ordinary logistic regression fit, and (p° = 0. We iterate the algorithm for 20,000 times and the acceptance rate is 90%. The sample plots of the M C M C output are given in Figure 3.7. Figure 3.8 shows the histogram of the posterior distribution. From Figure 3.7 we see, though 46 Table 3.6: Summary results of the posterior d is t r ibut ion from S H H S Variable Posterior Means S td Dev. 90% C I 5% 95% Intercept -5.56 0.87 -7.23 -4.41 0.87 0.61 0.11 1.95 x2 1.81 0.76 0.86 3.21 Xz 0.54 0.54 0.01 1.55 x4 1.68 0.74 0.72 2.98 x5 1.57 0.82 0.48 3.10 x6 0.63 0.58 0.02 1.75 A 1.75 0.47 1.13 2.58 Table 3.7: Summary results from the ordinary logistic regression fit of the S H H S 95% C I Coefficients Est imate S td Er ro r Lower Upper Intercept -4.37 0.26 -4.88 -3.85 0.22 0.15 -0.08 0.51 x2 0.75 0.17 . 0.45 1.06 X3 0.09 0.15 -0.20 0.38 xA 0.68 0.15 0.37 0.98 x5 0.54 0.21 0.12 0.95 x6 0.08 0.19 -0.30 0.46 the mix ing of the sampler output is slow for some of the parameters, the M C i n each case is stationary. The posterior means and standard deviations, computed using the 20,000 iterations are given i n Table 3.6. The 90% equal-tailed credible intervals (CI) are also presented i n Table 3.6, obtained from 5% and 95% sample quantiles. T h e risk factors are listed i n the same order as given i n Table 3.5. The summary results of the ordinary logistic regression fit is given i n Table 3.7. A s w i t h Example 1, the restrict ion imposed on the coefficients to be positive has some effect on some of the posterior distr ibutions of the parameters as can be seen from Figure 3.8. Compar ing the estimates from Table 3.7 wi th the posterior means and standard devi- ations i n Table 3.6 we see the same features as before, that is, posterior means and standard deviations are larger than the estimates and standard errors of the ordinary logistic regression 47 fit as we expect. The comparable estimates that can be obtained by (3.1) from the sample M C M C output are given in Table 3.8. * - m - N - O -i i i i i 0 5000 10000 15000 20000 Iteration i i i i i 0 5000 10000 15000 20000 Iteration - cn - w - o - I I I I I 0 5000 10000 15000 20000 Iteration i i i i i 0 5000 10000 15000 20000 Iteration <j - N - O - I l l l l 0 5000 10000 15000 20000 Iteration 1 1 1 I 1 0 5000 10000 15000 20000 Iteration in q _ 1 1 1 1 1 0 5000 10000 15000 20000 Iteration I 1 I I 1 0 5000 10000 15000 20000 Iteration Figure 3.7: Sample path of the M C M C output where the first panel is for the intercept, panels 2-8 are of the coefficients and the last one is for A. AV\ values are the posterior means of (3.1) obtained by using every tenth sample of the M C M C output. AV~2 values are computed from (3.1) by inserting the posterior means of the parameters. Table 3.8 reflects the same fact that the AV2 values are very close to the estimates of the ordinary logistic regression fit. Moreover, AV\ and AV2 values are not very different. 48 rf1TlTTT>^_ Intercept I 1 1 1 1 1 o I 1 1 1 1 1 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Figure 3.8: Posterior distribution of the parameters. The first panel is for the intercept, panels 2-8 are of the coefficients and the last one is for A. Therefore we can draw the same conclusion that both models indicate similar average effects of the risk factors as can be seen from Figure 3.9 in which AV2 values are plotted against the estimates from the ordinary logistic regression fit. Figure 3.10 shows the trace plots of the average effects of the six risk factors. Observing the trace plots and comparing with the sample paths of the parameters from Figure 3.7, we see that mixing of the sampler in these plots is better and looks good. Because of the high correlation between coefficients and A, sampling from the N A D model was very hard using the M C M C methods. As a consequence, mixing of the sampler in Figure 3.7 is slow. But while 49 o —I 1 1 1 1 1 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Logistic regression estimates Figure 3.9: Scatterplot for comparing average effects with the logistic regression estimates computing the average effects from the sample M C M C output, correlation has less effect and we get quite satisfactory mixing of the sample average effects. Table 3.8: Estimated average effects of the six predictors from SHHS Coefficients AVi AV2 0.26 0.25 x2 0.86 0.87 X3 0.13 0.11 0.75 0.75 x5 0.79 0.79 X6 0.16 0.14 3.2.1 Model Comparison The predictive power of the N A D model will be compared with that of the step-wise and ordi- nary logistic regression model. As before we are using cross-validation procedure. The predicted probabilities were computed by (3.2). For the N A D model we follow the same procedure as 50 0 500 1000 1500 2000 0 500 1000 1500 2000 Index Index Figure 3.10: Trace plots of the average effects of the six predictors considering every tenth sample from the M C M C output of the parameters Example 1 to compute the posterior means of the predicted probabilities using sample M C M C output. We also compute predicted probabilities plugging in the posterior means of the pa- rameters. The log likelihoods computed using both procedure were presented. The predicted probabilities are presented in Figure 3.11. From the boxplots we see that the N A D model predicts the presence of C H D slightly better than step-wise and the ordinary logistic regression models. It also predicts the absence of C H D slightly better than the step-wise but the same as the ordinary. 51 Absence of CHD NAD Step Ordinary Three Logistic Models Presence of CHD NAD Step Ordinary Three Logistic Models Figure 3.11: Boxplot for comparing performance of the three models The log likelihoods for three models computed using (3.4) are given below: log L(N AD) = -760.8353, logL(NAD) = —761.2751 (using posterior means), logL(step) = -765.9517, log L(or dinary) = -763.1630. The model with the largest log likelihood is the best. Comparing the above three log predictive likelihoods we see that the N A D model has the largest log likelihood, but the differences with the other two are very small in light of the large sample size. Although we have a large sample in this data set the number of events is relatively very small (only 5% of the total cases), which might limit the predictive performance of the N A D model. For this reason we could not see a huge improvement. Using Equation (3.5) we see that the N A D model is only 0.13% better than the step-wise model. Considering these issues we can say that the N A D model has slightly better predictive power than step-wise for this example. 52 3.3 Example 3: Mystery data The data for this example is from a real epidemiological study. The dataset contains 10,000 cases and twelve risk factors. Of the 10,000 subjects 1433 had the outcome event. The continuous risk factors are converted into binary variables. Due to reasons of data security and confidentiality, a description of the study and the risk factors is unavailable. The response variable y is defined as 1 if the outcome event is present, 0 if it is absent. The twelve predictors are denoted by Xi, X2, • • •, Xi2. Figure 3.12 shows the distribution of the number of risk factors. There are 109 people in this study who do not have any risk factors and nobody has all twelve predictors. By observing the histogram we see that very few people have ten or eleven risk factors. The majority of the subjects have three or four risk factors. n nn_ i 1 1 1 1 1 r 0 2 4 6 8 10 12 Number of risk factors Figure 3.12: Distribution of the number of risk factors To evaluate the Hybrid (HY) algorithm we assume k = 4, c = \ log(2), e = 0.016, 6 = 0.90. The initial values for /3's are the estimates from the ordinary logistic regression fit and 0° = 0. We iterate the algorithm 20,000 times and the acceptance rate is 88%. Figure 3.13 shows the sample path of the M C M C output. Figure 3.14 shows the his- togram of the posterior distribution. From Figure 3.13 we see that it took some iterations for 53 Table 3.9: Summary results of the posterior distribution from Mystery data Variable Posterior Means Std Dev. 90% CI 5% 95% Intercept -4.11 0.28 -4.66 -3.75 Xi 0.87 0.23 0.56 1.31 x2 0.50 0.22 0.18 0.92 Xz 1.19 0.23 0.85 1.61 0.28 0.20 0.02 0.61 x5 1.12 0.24 0.81 1.57 x6 0.72 0.26 0.32 1.18 X7 1.47 0.25 1.11 1.94 x8 1.32 0.26 0.94 1.81 x9 0.34 0.20 0.04 0.69 Xio 0.56 0.28 0.10 1.03 Xu 0.26 0.19 0.02 0.64 X\2 0.71 0.30 0.21 1.23 A 1.59 0.17 1.33 1.90 Table 3.10: Summary results from the ordinary logistic regression fit 95% CI Coefficients Estimate Std Error Lower Upper Intercept -3.353 0.110 -3.568 -3.138 xx 0.365 0.084 0.201 0.528 x2 0.101 0.061 -0.019 0.221 x3 0.494 0.061 0.374 0.614 x4 0.029 0.081 -0.131 0.188 x5 0.529 0.079 0.373 0.684 xG 0.236 0.084 0.072 0.399 X7 0.694 0.070 0.558 0.831 x8 0.557 0.068 0.424 0.690 x9 0.078 0.070 -0.058 0.214 Xw 0.157 0.100 -0.039 0.353 Xn 0.001 0.065 -0.125 0.128 X\2 0.265 0.111 0.047 0.483 54 Table 3.11: Estimated average effects of the twelve predictors from Mystery data Variable AVi AV2 Xx 0.37 0.37 x2 0.15 0.15 x3 0.56 0.56 Xi 0.07 0.06 x5 0.54 0.54 xe 0.26 0.25 Xj 0.75 0.75 x8 0.63 0.63 x9 0.09 0.08 Xio 0.18 0.17 Xn 0.06 0.05 Xl2 0.26 0.25 the chains to be stabilized. So we throw away first 5000 iterations as burn-in and compute pos- terior means from the remaining samples. The posterior means, standard deviations and 90% equal-tailed credible intervals, computed from 5% and 95% quantiles, are given in Table 3.9. From Figure 3.14 we see the effect of the restriction imposed on the coefficients to be positive on some of the posterior distributions of the parameters. Table 3.10 represents the summary results from the ordinary logistic regression fit. Com- paring the estimates from Table 3.10 with those from Table 3.9, we again observe that the N A D model provides larger estimated B^s and larger posterior standard deviations. Comparable estimates, obtained by using (3.1), are given in Table 3.11. AV\ values are obtained considering every tenth sample from the 15,000 sample M C M C output. AV2 values are computed by plugging in the posterior means of the parameters into (3.1). Both AV\ and AV2 values are very close to the estimates from the ordinary logistic regression fit. Thus we can draw the same conclusion that both models indicate similar average effects of the risk factors, as can be seen from Figure 3.15 in which AV2 values are plotted against logistic regression estimates. Figure 3.16 shows the trace plots of the average effects of the twelve predictors. Com- paring these plots with those from Figure 3.13 we observe that mixing is slow. Again, high correlation between A and coefficients makes it difficult to sample even after several tuning the 55 0 5000 10000 15000 20000 "i 1 1 r 0 5000 10000 15000 20000 0 5000 10000 15000 20000 0 5000 10000 15000 20000 0 5000 10000 15000 20000 0 5000 10000 15000 20000 0 5000 10000 15000 20000 0 5000 1 0000 15000 20000 n m i inrm irvmn ?mnn Figure 3.13: Sample path of the MCMC output where the first panel is for the intercept, panels 2-13 are of the coefficients and the last one is for A 56 Figure 3.14: Posterior distribution of the parameters. The first panel is for the intercept, panels 2-13 are of the coefficients and the last one is for A 57 n I I I I I I I 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Logistic regression estimates Figure 3.15: Scatterplot for comparing average effects with the logistic regression estimates step size, and get proper mixing of the sampler output. As correlation has less effect on the average effects we are getting better mixing of the sample average effects. 3.3.1 Model Comparison As with the Examples 1 and 2, we compare the predictive power of the N A D model with that of the step-wise and the ordinary logistic regression model. The cross-validation procedure is used. For selecting the step-wise model we use the probability for selecting a variable is 0.20 and for removing a variable is 0.25. The predicted probabilities were computed by (3.2). For implementing the H Y algorithm with the N A D model we use k = 4, e = 0.018. The posterior means of the predicted probabilities were computed using sample M C M C output. Predicted probabilities by plugging in the posterior means of the parameters were also computed. The log predictive likelihoods were presented below. The predicted probabilities were plotted in Figure 3.17. From the boxplots we observe that the N A D model predicts the presence and absence of the outcome the same as that of the step-wise model for this dataset. In both of the cases 58 \ 59 Absence of the event Presence of the event NAD Step Ordinary NAD Step Ordinary Three Logistic Models Three Logistic Models Figure 3.17: Boxplot for comparing performance of the three models ordinary logistic regression model is slightly worse. The log predictive likelihoods for three models computed by using (3.4) are as follows: log L{N AD) = -3767.815, logL(NAD) = —3768.121 (using posterior means), log L(step) = -3779.009, log L(ordinar-y) = —3772.059. Observing the above table we see that the N A D model has the largest log predictive likelihood, but the differences with the other two are very small. Though we have a large sample size and 15% cases have the event in this dataset, due to high correlation between A and coefficients we could not improve the predicted probabilities. Nevertheless, using (3.5) we see that the N A D model has 0.11% better predictive power than the step-wise model. 60 Chapter 4 Discussion and Future Work The most important finding to mention from the analysis of the data is that in each example we got a posterior mean of A which is greater than one. This supports our belief about existence of possible kind of interactions other than the pairwise and our interpretations about the effects of the risk factors on the logit (Chapter 2, Section 2.2). From our experience we agree with Linde et. al. [13] that in course of analyzing the data, selection and inclusion of pairwise interaction terms into the step-wise logistic regression model is time-consuming and clumsy. We showed that the N A D model is slightly better than the step-wise and the ordinary logistic regression models in predicting the outcome of the response variable. We estimated average effects of the risk factors via the N A D model which are similar to the estimates of the ordinary logistic regression model and bear the equivalent interpretation. It is also worth mentioning that the ordinary logistic regression model is almost as capable as the step-wise in prediction. For fitting the N A D model we assume binary covariates and restrict the sign of the coefficients to be positive. Due to the structural form of the N A D model we can not have negative covariates or negative values of the coefficients. Negative covariates can be included in the model by rescaling the minimum value to be zero. One useful technique that most epidemiologists apply when analyzing their data is to categorize the quantitative risk factors. However, if we have continuous risk factors scaled so that zero is the lowest risk and one is the highest risk, we can analyze them and interpret as usual conditional on > 0. The estimated (3 or the posterior mean of a parameter can be interpreted as representing the risk difference (in logit scale) of a highest risk subject compared to a lowest risk subject when assuming the 61 remaining risk factors are absent. In all of the three examples used to fit the N A D model we converted the quantitative risk factors into binary variables and we coded the risk factors to match our a priori beliefs about the direction of effects, as reflected by the restriction to nonnegative coefficients. Indeed, in practical epidemiological context the question might arise: could we make full use of the N A D model with these restrictions? Or should we think of a new functional form? In spite of these limitations the positive aspect is that we can still analyze quantitative risk factors using the N A D model as long as we consider positive values of the coefficients. In many epidemiological studies we could guess the direction of the effect of many of the risk factors in advance. By knowing this fact we could choose the appropriate coding and interpret the coefficient estimate accordingly. However, in reality some predictor's characteristics in an epidemiological study may not be guessed completely in advance and if we were asked to analyze that predictor what we will do? Suppose that all of the ft's are positive where i = 1,2, . . . , p . Let us introduce the new parameters 71,72, • • •, 7p- These parameters will govern the direction of the predictors and validate the structure of the N A D model by defining each of them as: { 1 if Xi — 1 is the high risk level, 0 if Xi = 0 is the high risk level. We have to estimate these parameters along with the other parameters already in the model. After introducing 7J'S the N A D model can be rewritten as: logit Pr(Y = 1\X) = ft + {p?X? (1 - X,)1^ + f%X? (1 - X2f~^ + . . . + / 3 A X p 7 p ( l - X p ) 1 - > } 1 / A , (4.1) where Xj 's are binary variables or if continuous, scaled to be zero or one. This functional form in the logit scale can overcome the restriction of presuming to know the direction of each effect in advance. Model (4.1) can be fit.by following the Bayesian approach to model fitting discussed in Chapter 2, Section 2.3. This model now contains some discrete and some continuous parameters. Two procedures can be followed to evaluate the parameters, either (i) by fixing 7 's and simulate fts and <p, or (ii) simulate all of them simultaneously. If 7 's are known in advance, we go back to the N A D model and can follow the same procedures. If not, a prior distribution for 7 , 62 say 7r(7), is need to be assumed along with 3 and cj). See George and McCulloch [6] for a discussion of assigning a prior distribution to 7 when there are mixtures of discrete and continuous parameters. We have the similar situation here. The joint posterior distribution is then written as: n"0M>7|y) ^ L{3,(b,i)TT{B)ir{(p)'K(i). Although analytical simplification of n(3,4>, j\y) is intractable, existing M C M C methods such as the Gibbs sampler or Metropolis-Hastings algorithms (see Smith and Roberts [18], Chib and Greenberg [2] for an overview) can be used to explore the posterior 7r(7|y). Applied to the complete posterior ir(3, </>,7|y), such methods simulate a Markov chain ^ \ ^ \ ^ l \ 3 ^ \ 4 2 \ i { 2 \ . . . , which converges in distribution to 7r(/3, <f>, -y\y). The embedded subsequence 7^,7^, . . . , thus converges to 7 ~ 7r(7|y). Refer to George and McCulloch [6] again for additional comments on the simulation process. While we have not implemented this in the present thesis, it should be straightforward extension of the methods described. 63 Bibliography [1] Agresti, Alan (1990). Categorical Data Analysis. John Wiley & Sons, New York. [2] Chib, Siddartha and Greenberg, Edward (1995). Understanding the Metropolis-Hastings algorithm. The American Statistician Vol. 49, No. 4, 327-335. [3] Duane, S., Kennedy, A . D., Pendleton, B . J . , and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B 195, 216-222. [4] Gamerman, Dani (1997). Markov Chain Monte Carlo: Stochastic simulation for Bayesian inference. Chapman & Hall , London. [5] Gelman, A . , Carlin, John B. , Stern, Hal S. and Rubin, Donald B . (1995). Bayesian Data Analysis. Chapman & Hall , London. [6] George, E . I. and McCulloch, R. E . (1997). Approaches for Bayesian variable selection. Statist. Sinica 7, 339-373. [7] Gilks, W.R. , Richardson, S. and Spiegelhalter, D .J . (1998). Markov Chain Monte Carlo in Practice. Chapman & H a l l / C R C , London. [8] Gustafson, Paul (1998). A guided walk Metropolis algorithm, Statistics and Computing 8, 357-364. [9] Gustafson, P., MacNab, Y . C. and Wen, S. (2002). On the value of derivative evaluations and random walk suppression in Markov Chain Monte Carlo algorithms. Submitted. [10] Hastie, T. J . and Tibshirani, R. J . (2001). The Elements of Statistical Learning. Springer, New York. 64 [11] Hastings, W. K . (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97-109. [12] Hosmer, D. W . and Lemeshow, S. (2000). Applied Logistic Regression. John Wiley & Sons, Inc. New York. [13] Linde, Angelika van der and Osius, Gerhard (2001). Estimation of non-parametric multi- variate risk functions in matched case-control studies with application to the assessment of interactions of risk factors in the study of cancer. Statistics in Medicine 20, 1639-1662. [14] McCullagh, P. and Nelder, J . A . (1989). Generalized Linear Models, Second edition. Chap- man & Hall , London. [15] Metropolis, N . , Rosenbluth, A . W., Rosenbluth, M . N . , Teller A . H . , and Teller E . (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics 21, 1087-1092. [16] Neal, R. M . (1993). Bayesian learning via stochastic dynamics, in C L . Giles, S.J. Hanson, and J.D. Cowan (eds.), Advances in Neural Information Processing Systems 5, pp. 475-482, San Mateo, California: Morgan Kaufmann. [17] Rousseauw, J. , du Plessis, J . , Benade, A . , Jordaan, P., Kotze, J . , Jooste, P. and Ferreira, J . (1983). Coronary risk factor screening in three rural communities. South African Medical Journal 64, 430-436. [18] Smith, A . F . M . and Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. Journal of the Royal Statistical Society, Series B 55, 3-24. [19] Woodward, Mark (1999). Epidemiology: Study Design and Data Analysis. Chapman &; H a l l / C R C . London. 65
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Non-addictive effects in logistic regression
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Non-addictive effects in logistic regression Kazi, Azad Mahbubur Rashid 2003
pdf
Page Metadata
Item Metadata
Title | Non-addictive effects in logistic regression |
Creator |
Kazi, Azad Mahbubur Rashid |
Date | 2003 |
Date Created | 2009-10-17 |
Date Issued | 2009-10-17 |
Description | Logistic regression is commonly used in epidemiology to model the relationship between risk factors and presence/absence of a disease. Usually it is difficult to look for interaction structure (many possible pairwise interactions, for instance) to include in the model. So a model which is additive on the logit scale is fitted. If the number of risk factors is relatively large such an additive relationship may not make good sense. A new logistic regression model is proposed to incorporate non-additive interaction effects. In some scenarios this model might better reflect the relationship between the response variable and the risk factors. The Bayesian approach is followed to fit the model and a Markov chain Monte Carlo (MCMC) algorithm, known as the hybrid algorithm is used to simulate the parameters. We apply the new model to three examples and interpret the parameter estimates. We compare the predictive performance of the new model with that of the step-wise and the ordinary logistic regression models. |
Extent | 2937201 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | Eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2009-10-17 |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0090878 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
Graduation Date | 2003-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
URI | http://hdl.handle.net/2429/13988 |
Aggregated Source Repository | DSpace |
Download
- Media
- [if-you-see-this-DO-NOT-CLICK]
- ubc_2003-0146.pdf [ 2.8MB ]
- Metadata
- JSON: 1.0090878.json
- JSON-LD: 1.0090878+ld.json
- RDF/XML (Pretty): 1.0090878.xml
- RDF/JSON: 1.0090878+rdf.json
- Turtle: 1.0090878+rdf-turtle.txt
- N-Triples: 1.0090878+rdf-ntriples.txt
- Original Record: 1.0090878 +original-record.json
- Full Text
- 1.0090878.txt
- Citation
- 1.0090878.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
United States | 7 | 2 |
Ukraine | 5 | 0 |
Japan | 4 | 0 |
China | 2 | 22 |
France | 1 | 0 |
South Africa | 1 | 0 |
Sri Lanka | 1 | 0 |
Italy | 1 | 1 |
City | Views | Downloads |
---|---|---|
Unknown | 8 | 7 |
Ashburn | 4 | 0 |
Tokyo | 4 | 0 |
Beijing | 2 | 4 |
Pietermaritzburg | 1 | 0 |
Redmond | 1 | 0 |
Seattle | 1 | 0 |
Milan | 1 | 1 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0090878/manifest