Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Average effects for regression models with misspecifications and diffuse interaction models 2007

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
ubc_2007-318171.pdf
ubc_2007-318171.pdf
ubc_2007-318171.pdf [ 5.2MB ]
ubc_2007-318171.pdf
Metadata
JSON: 1.0100654.json
JSON-LD: 1.0100654+ld.json
RDF/XML (Pretty): 1.0100654.xml
RDF/JSON: 1.0100654+rdf.json
Turtle: 1.0100654+rdf-turtle.txt
N-Triples: 1.0100654+rdf-ntriples.txt
Citation
1.0100654.ris

Full Text

Average effects for regression models with misspecifications and diffuse interaction models by Juxin Liu B.Sc, Anhui University, 2000 M.Sc, Beijing University, 2003 Ph.D., University of British Columbia, 2007 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Statistics) The University Of British Columbia June, 2007 © Juxin Liu 2007 Abstract In epidemiological studies, how best to assess and interpret interaction of risk factors of interest has been the subject of a lively debate. In statistical regression models, the interaction between two putative risk factors is described by the regression coefficient of the product of the risk factors. What happens if a linear regression model without pairwise interaction terms is used to fit the data actually generated from a linear regres- sion model with all possible pairwise interactions? We apply the idea of average effect to evaluate the consequence of misspecified models and find out that the average effect estimates are still consistent if the joint distribution of risk factors satisfy some certain conditions. It is known that pairwise interaction models encounter intractable problems especially when the number of risk factor under consideration is quite large. The number of pairwise interaction terms is p(p — l ) /2 , if the number of risk factors is p. As an alternative strategy, we introduce diffuse interaction model with only one parameter to reflect the interactions among all the risk factors, without specifying which of the risk factors do indeed interact. We compare the two kinds of interaction models in terms of ability to detect interactions. Another issue investigated in the thesis is to devise MCMC algorithms to estimate diffuse interaction models. This is done not only for the diffuse interaction model assuming all risk factors interact in the same direction, either synergistically or antagonistically, but also for extended diffuse interaction models which relaxing this strong assumption. ii Table of Contents Abstract n Table of Contents i ii List of Tables vi List of Figures y i i Acknowledgement xiii Dedication • xiv 1 Introduction 1 1.1 Motivation . . 1 1.2 Classes of regression models 6 1.3 Outline of thesis 13 2 Average effects for regression models with misspecifications 14 2.1 Average effect 14 2.2 General results 18 2.3 Linear regression 25 2.3.1 Difference in large sample limits of the two average effect estimators 27 2.3.2 Relative efficiency of the two average effect estimators 33 iii Table of Contents 2.4 Nonparametric regression and smoothing 37 2.4.1 Spline regression models 38 2.4.2 Penalized regression models 49 2.5 A middle scenario 56 2.6 Summary 58 3 Comparison of interaction detectability under different interaction mod- els 60 3.1 General framework 60 3.2 Comparison between pairwise interaction models and diffuse interaction models 65 3.2.1 Introduction of diffuse interaction model 65 3.2.2 Power comparison 66 3.3 Summary 81 4 M C M C algorithms for diffuse interaction models 82 4.1 Why MCMC? 82 4.2 Details of MCMC algorithms 84 4.2.1 One-group diffuse interaction models 84 4.2.2 Two-group diffuse interaction models . 90 4.3 Discussion 104 4.4 Example 117 5 Summarization and future work 121 5.1 Conclusions 121 5.2 Future work 122 iv Table of Contents Appendices I P r o o f of (2.14)in Section 2.3.1 126 I I Ano the r consistent estimator of V* mentioned i n Section 2.2 130 I I I N u m e r i c a l approach to Cn and Cn i n Section 2.4.1 133 I V P r o o f of (2.24) i n Section 2.4.2 135 V Pseudocode for the hybr id M C M C a lgor i thm i n Section 4.2.1 . . . . 136 Bib l iography 138 List of Tables 4.1 Algorithm II: Frequency table based on posterior samples for each compo- nent of I with different values of (3j,j € ANT. 101 4.2 Algorithm III: Frequency table based on posterior samples for each com- ponent of I under two-group diffuse interaction model with A unknown. . 108 4.3 Algorithm III: Frequency table based on posterior samples for each com- ponent of I under three-group diffuse interaction group 115 List of Figures 2.1 Magnitude of relative bias as a function of (p, r) for multivariate log- normal distribution of p = 10 predictors. The cases (a) through (d) are as described in the text 33 2.2 Relative efficiency of the "misspecified"-model average-effect estimator as a function of p and ||7||, for an equi-correlated multivariate normal distri- bution of p = 10 predictors. The cases (a) and (c) are as described in the text 37 2.3 Comparison of two average effect estimators and 8l((3) for spline regression models with two predictors involved. The x-coordinate of each circle is 5*(/3) and the y-coordinate is 5\(j3). Different circles are produced by different values of p, the correlation coefficient between X\ and X2. 48 2.4 Comparison of asymptotic conditional variances of two average effect es- timators for spline regression models with two predictors involved. The x-coordinate of each circle is the conditional variance of misspecified-model estimator and the y-coordinate is that of right-model estimator. Differ- ent circles are produced by different values of p, the correlation coefficient between X\ and X2- 49 vii List of Figures 2.5 Scatter plots of average effect estimates in penalized spline regression with only two predictors: the top panel with sample size n = 200 and bottom with n = 1000. The solid horizontal line in each panel identifies the true value of average effect, and each circle represents a different simulated data set 55 2.6 Histograms of the average effect estimates in penalized spline regression with only two predictors: the top panel with sample size n = 200 and bottom with n = 1000, and the symbol 'x' in each panel marks the true value of average effect. 55 3.1 Power Curves with Xi's%'~' Bernoulli(0.5): the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. In the bottom panel V = l,xi • • • 72 3.2 Different choices of rj in Case II with binary predictors: the top panel involves all predictor pairs interacting positively, the middle panel has only a few of predictor pairs interacting positively, and the bottom panel has more mixed directions of interactions. The lengths of TJ'S in different panels are normalized to be 1. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting 74 vm List of Figures 3.3 Different choices of £ in X,'s distribution: the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, £=0.2,0.5 and 0.8 respectively. 75 3.4 Different choices of b in f3M — blp: the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, 6=0.1,0.5 and 1 respectively 76 3.5 Different choices of Var(e):the top panel with the true structure to be dif- fuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on dif- fuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, Var(e)=0.5,l and 5 respectively 77 3.6 Different choices of Var(e) with scaled A: the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting 78 List of Figures 3.7 Different choices of p among Xj's: the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, /O=0.005,0.5 and 0.98 respectively 79 3.8 Power Curves with Xi'sl'~' Log-normal(0,1): the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. In the bottom panel V = 1,X . /N/</ • • 80 3.9 Different choices of rj in Case II with continuous predictors: the top panel involves all predictor pairs interacting positively, the middle panel has only a few of predictor pairs interacting positively, and the bottom panel has more mixed directions of interactions. The lengths of 77's in different panels are normalized to be 1. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting 81 4.1 Algorithm I: Posterior correlations for (A,/?i) and (A, ai) respectively. . . 86 4.2 Algorithm I: MCMC traceplots for Pjtj = 0,1,.. . ,p and A with diffuse priors a20Q = a20. = a\ = 100 87 4.3 Algorithm I: Marginal posterior densities for Pj, j = 0,1,... ,p, and A with diffuse priors cr|o = a2p. = a\ = 100 88 x List of Figures 4.4 Algorithm I: Posterior densities for Bj, j = 0,1,... ,p and A with informa- tive priors cr20Q = = cr\ — 0.4. . . . : 89 4.5 Algorithm II: Using Proposal 1, MCMC trace plots for /3j,j = l,...,p with diffuse priors,o~po = cr| — 100 95 4.6 Algorithm II: Using Proposal 1, posterior densities for — 1,... ,p with diffuse priors ajj0 = <r| = 100 96 4.7 Algorithm II: Comparison of two proposals: Number of correct group al- locations based on posterior samples for 1 97 4.8 Algorithm II: With Proposal 1, the autocorrelation curves for posterior samples of Bj, j — 1,..., p respectively. 98 4.9 Algorithm II: With Proposal 2, the autocorrelation curves for posterior samples of Bj, j = 1,... ,p respectively 99 4.10 Algorithm II: With Proposal 1, the posterior densities of samples from with informative priors :<Xg o = <jj| = 0.4. , 100 4.11 Algorithm II: Comparison of number of correct group allocations based on posterior samples for I with different values of Bj(j S ANT): top panel with smaller value (=0.4), and bottom panel with bigger value (=0.7). . . 101 4.12 Algorithm II: MCMC output of j3A with true value set to be 0.4: the top panel is the posterior density for entire sequence of 10000 iterations, the middle is the posterior density for subsequence /34|/4 = 1, and the bottom is the posterior density for subsequence /34|/4 = 2 103 4.13 Algorithm III: Trace plots for (3j,j = 1,... ,p based on two-group diffuse interaction model with A unknown, respectively 106 4.14 Algorithm III: Marginal posterior densities for 8j,j = 1,... ;p based on two-group diffuse interaction model with A unknown, respectively.- .... 107 xi List of Figures 4.15 Algorithm III: Plots of posterior samples for A, the top panel is the trace plot, the middle is the posterior density plot, and the bottom is the auto- correlation curve 108 4.16 Algorithm III: Autocorrelation curves for pj, j = 1,... ,p based on two- group diffuse interaction model with A unknown, respectively 109 4.17 Algorithm V: Trace plots of MCMC samples for Pj, j = 1,... ,p based on three-group diffuse interaction models with A2 = 4/5, A 3 = 5/4 112 4.18 Algorithm V: Posterior densities of MCMC samples for P j , j = l , . . . , p based on three-group diffuse interaction models with A2 = 4/5, A 3 = 5/4. 113 4.19 Algorithm V: Autocorrelation curves of MCMC samples for Pj,j = 1,... ,p based on three-group diffuse interaction models with A2 = 4/5, A 3 = 5/4. 114 4.20 Algorithm V: Number of correct group allocations based on posterior sam- ples of I under three-group diffuse interaction models 116 4.21 Algorithm V: Posterior densities of samples for Pj conditional on Ij cor- rectly allocated (j = 1,... ,p) under three-group diffuse interaction models. 116' 4.22 Abalone data: trace plots for Pj, j = 0,1,... , p , for the whole sequence of 100,000 iterations including the burn-in period 119 4.23 Abalone data: density plots for Pj and 5j, j — 1,... ,p, for the sequence of 40,000 post burn-in iterations. The solid lines stand for average effect Sj's and dashed lines stand for /Jj's 120 4.24 Abalone data: the top panel is the density plot of A and the bottom panel is the density plot of relative antagonistic effect, for the sequence of 40,000 post burn-in iterations 120 xii I would like to thank my supervisor, Professor Paul Gustafson, for his guidance, invalu- able advice and great patience. Each time when I am doing "random walk", it is my supervisor who shows me an efficient "direction" to make meaningful movement. Special thanks to Professors Harry Joe and Lang Wu for their helpful comments and suggestions on my proposal and table-tennis playing as well. And extra thanks to Harry for the tips on Latex, which makes the notations in my thesis well-exhibited. I am very grateful to the helps from my fellow graduate students and the ladies in our main office. I feel lucky to have had the chance to study in our department, which is just like a big family. Juxin Liu The University of British Columbia June 2007 xiii To Ameng xiv Chapter 1 Introduction 1.1 Motivation The term interaction is used in epidemiology to describe a situation in which two or more risk factors modify the effect of each other with regard to the occurrence or value of a given health outcome, denoted by Y. For dichotomous variables, interaction means that the effect of one risk factor, say A, on the outcome differs depending on whether another variable B (effect modifier) is present. Moreover, if the presence of B, the effect modifier, potentiates/accentuates the effect of risk factor A,, this variable and risk factor are said to be synergistic (positive interaction); if the presence of B diminishes or eliminates the effect of risk factor A, the two variables are antagonistic (negative interaction). For continuous variables, the phenomenon of interaction means that the effect of one risk factor on outcome differs depending on the value of another variable (effect modifier). A mathematical definition comes in Section 1.2. Later in Section 3.2.1, we will use a more general definition of syn- ergism/antagonism when diffuse interaction models are introduced. In epidemiological studies, synergistic/antagonistic interaction among risk factors is common. For exam- ple, if people suffering from obesity have high blood cholesterol, then they have higher chances to get heart diseases. Another example (for antagonism) is the interaction be- tween smoking and intake of Vitamin A for the risk of lung cancer. People who smoke a lot but take Vitamin A in daily dietary have lower risk of lung cancer than people who 1 Chapter 1. Introduction seldom smoke but lack of Vitamin A. Interaction can be described in two different but compatible ways. Each definition leads to a specific strategy for the assessment of interaction. The first, definition is based on homogeneity/heterogeneity of effects. Interaction occurs when the effect of a risk factor A on outcome Y is not homogeneous across strata formed by a third variable B. When this definition is used, variable B is often referred to as an effect modifier. The second, definition is based on the comparison between observed and expected joint effects of risk factor A and third variable B. Interaction occurs when the observed joint effect of A and B differs from that expected on the basis of the independent effects of A and B. How does one assess interactions? In the thesis, we only focus on the situation where the relationship between outcome Y (continuous) and risk factors is of interest. Commonly the product of the two variables A and B is used to describe interaction effects. That is, E (Y\A, B) = p0 + PiA + B2B + B12AB, where Bi2 reflects magnitude of interaction effect between A and B. If Pi2 > 0, the interaction between risk factors A and B is synergistic (positive) interaction; otherwise it is an antagonistic (negative) interaction. For simplicity, the pure quadratic terms A2 and B2 are omitted in the above model (These terms are also omitted in the linear regression models which are mentioned later and an analysis with quadratic terms is mentioned briefly in Section 2.5). Note that if the outcome of interest is discrete, we could replace E(Y\A,B) by g(E(Y\A, B)) in the above model, i.e., a generalized linear model. For instance, if Y is binary, logistic regression can be used in terms of g(E(Y\A, B)) — log (P(y = i ) / p ( y = o)). ' 2 Chapter 1. Introduction More and more attention has been focused on model misspecifications since statisti- cians realize that unfortunately misspecified models are not uncommon in practice. Box (1979) and also McCullagh and Nelder (1983) mentioned "all models are wrong", though some fit data better than the others. Since we never know what the true model is in real- ity, by "true" model we assuming that is the true structure or closer to the truth than the others. It is natural to ask whether the properties of the estimator derived from misspec- ified models are affected. Does the estimator still converge to some limit asymptotically, and does this limit have any meaning? If the estimator is approximately consistent, is it also asymptotically normal? White (1982) provides answers to these questions by us- ing maximum likelihood techniques for estimation and inference,of regression coefficients. Also in White (1981), the consequences and detection of misspecified nonlinear regression models are explored. Under the general topic of interactions, the goal of our work is to explore the conse- quences of a particular scenario of misspecified models . To be specific, what happens if we apply an additive model ignoring pairwise interactions to data which are actually generated from a pairwise interaction model? In this context, to make clear how those ignored interaction effects affect the results, we apply the average effect idea (definition given later in Chapter 2), while not applying the results from White (1982) directly to the regression coefficients. As is known, the interpretability of regression coefficients of risk factors is rather limited when models include interaction terms. The idea of average effect is proposed by Gelman and Pardoe (2007) and Gustafson et al. (2005) as well. Basically, it is the average of predictive effect, which is the expected change in outcome associated with a unit change in one of the risk factors. In a linear regression model without interactions, the average effect of any putative risk factor is simply the regression coefficient. However, in a model with interaction terms, the predic- tive effect in general depends on the value of risk factors. There are various definitions Chapter 1. Introduction based on different distributions to average over. Three versions are denned in Chapter 2. The main advantage of the average effect idea is to make comparisons possible between different parametric models with sets of parameters that have incomparable interpreta- tions. The average effect idea could also be used in other contexts. For example, Xu and O'Quigley (2000), also Gustafson (2007), gives a definition of average effect in survival analysis. In the future, we could also apply the idea of average effect to explore the consequences of model misspecifications under the framework of survival analysis. Nowadays another common issue arising in epidemiological studies is that a large (sometimes larger than sample size) number of potential risk factors should be consid- ered in modelling. We could imagine the challenge to model the interaction effects when p, the number of risk factors, is relatively large. In particular, if the model under con- sideration is a pairwise interaction model, the number of all possible pairwise interaction terms is p(p — l)/2. For instance, if 12 risk factors are involved in the study, 66 pairs of possible interactions would be investigated besides the possibility of higher order in- teractions among three or more risk factors. Stepwise procedures are the most widely used approaches to select the important pairwise interaction terms in applied medical statistics. The basic idea of stepwise procedure is to find a "best" subset of potential risk factors by subsequently adding or dropping one risk factor at a time. Take forward step- wise regression procedure for instance, it starts off by choosing a model containing the single best risk factor variable and then attempts to build up with subsequent additions of other risk factors one at a time, as long as these additions are worthwhile. There are, however, a number of limitations with stepwise procedure. In particular, if possible models under consideration are nested pairwise interaction models, the stepwise procedure does not scale up to the number of potential risk factors (suppose all the risk factors involved at this stage are all important). As the number p of risk factors increases, the number of submodels, 2( p( p -0/ 2) ) increases dramatically, making the computational 4 Chapter 1. Introduction burden enormous. Also, the fitting of full model sometimes may not be suitable, because only a few of the p risk factors are typically included in the final model. And the fitting of the full model increases the numerical complexity of the methods unnecessarily. Another problem is that the model selected by a stepwise procedures includes only those variables entered in that final model, and ignores the variables not selected and the uncertainty due to the model selection procedure. In the worst possible scenario, such procedures may underestimate uncertainty about the variables, overestimate confi- dence in a particular model being selected, and may lead to sub-optimal decisions and limited predictability (Raftery (1996); Draper (1995)). There are also other drawbacks of stepwise selection. For example, a small change in the data can result in very different models being selected and this can reduce prediction accuracy, as discussed in Breiman (1996). To overcome the difficulties caused by large number of risk factors, we use diffuse interaction models as an alternative way to model interactions. These kind of models are proposed by Gustafson et al. (2005) in context of binary response. In this context, by syn- ergism/ antagonism, we mean that the effect of a putative risk factor increases/decreases in magnitude as all the other risk factors move from absent to present. (Or as all the other risk factors increases if they are continuous.) Under this particular probability model, only one parameter is used to describe interaction among all risk factors. That is, the parameter can tell the overall interaction direction but without indicating which of the risk factors actually interact in that direction and which of them not. Hence the model is a bit simplistic, but we postulate it has more power, compared to pairwise interaction model, to detect interactions. 5 Chapter 1. Introduction 1.2 Classes of regression models When we are concerned with the dependence of a response variable Y on observed risk factors Xi,..., Xp, an equation that relates Y to X\,..., Xp is usually called as regression equation. Denote the regression equation by E{Y\Xi = Xi,X2 = x2,...,Xp = xp) = g(xi,x2,.. • ,xp) = g(x). First of all, we give a more precise definition of synergism/antagonism in terms of math- ematical languages. For any pair j < k and all e, 6 > 0, if g(x + elj + 6lk) - 5(x + <Jlfc) - [g(x + el,-) - g'x)] > (<) 0, the interaction among X is synergistic (antagonistic). Note lj means a p x 1 vector of zero except that the jth. element is 1. Equivalently, if g is twice differentiate, the above inequality can be rewritten as Note that there are other names for the above definition, such as supermodular, di- rectionally convex and lattice-superadditive for nonnegative derivatives and submodular, directionally concave, lattice-subadditive for nonpositive derivatives. They are used to compare the dependence structure of random vectors having the same marginal distri- butions. More details and discussions are in Muller and Stoyan (2002). In the following we list all the regression equations considered in the thesis. 6 Chapter 1. Introduction 1. Linear regression models. v g(xu ... ijXj. Due to the simplicity and interpretability, linear regression models are the most widely used for either experimental data or observational data. The regression coefficient Bj implies how much change in the response variable is associated with a unit change in Xj when keeping all the other risk factors unchanged. It is acknowledged that this model is a linear approximation of the relationship be- tween Y and X. By first-order Taylor expansion, it is easy to derive the model expression. Therefore, it may work well only for a local region, where the surface does not have cur- vature. That's the reason why prediction of X values outside of the range where we fit the model is usually dangerous and not reliable. By second-order Taylor expansion, we have Also this is good for a local region, where the surface has some curvature. Note that for a binary variable Xj, the term Bjjxj is not necessary. One thing worth mentioning is that the explanation of coefficients Pj's are different from that under the additive model. The expected change in response variable Y after one unit change in Xj (while keeping all the others unchanged) now breaks down into several pieces, i.e., BjXj and BijXiXjii = 1,... ,p). For this model, for any pair i < j, we have synergistic interaction if Pij > 0 and antagonistic interaction if Bij < 0. p 7 Chapter 1. Introduction 2. Spline regression models. p g(xu...,xp) = Y^mj(xj)> (L1) where nij is a smoothing function applied to Xj. Note this model is additive and does not include interaction terms. Taylor expansion allows to write rrij as D+l mj(x) = cijkXk~l + Rerrij(x), a < x <b, j = 1,... ,p fc=i Remjix) = (DI)'1 [bmf+1)(x)(x-0^, Ja where (x — tk)+ = x — tk, if x > tk and zero otherwise. Note that for fixed j , if Rerrij(x) are uniformly small for all observed Xji in magnitude, polynomial regression of Xj may provide a reasonable analysis. Otherwise, we need some other methods to take account of the item Renij. One method of estimation which attempts to guard against departures from polynomial models is smoothing splines. The basic premise is the integral in Rerrij can be approximated using the quadrature formula Rerrij(x) « ^ ajk(x - tjk)+ fc=i for coefficients O j i , . . . ,a,jLj and tji < tj2 < • • • < tjLj (i^'s are knots in the definition of spline, which is given in Section 2.4.1). Combining this with the original polynomial approximation leads to an overall approximation of the regression function by rrij(x) = otjix H h ajDxD + ̂ ajk(x - ijfc)+, k=i 8 ) Chapter 1. Introduction which is the spline function under consideration in Chapter 2. To introduce the interactions between X\,... ,Xj, we use the following regression func- tion, similar to pairwise interaction models, Note that there are other possibilities to determine the interactions between any pair of risk factors. Here for simplicity we only use the summation of the products of one risk factor and smoothing function of the other risk factor, which is different from that discussed in Gu (2002). In Gu (2002), the multivariate function is given by an ANOVA decomposition, that is, it is expressed as a constant plus the sum of functions of one vari- able (main effects), plus the sum of functions of two variables (two-factor interactions)and so on. Note the interactions are assumed to be in tensor product spaces. Note that if we have numerous number of risk factors, the additive model (1.1) may be too generous allowing a few degrees of freedom per Xj and it does not take into account the interactions between X, 's yet. Classes of function are not "dense" or a universal approximation to class of smooth function in a rectangular region of rFj=i[xj£> xju]> where Xjij^Xju are the lower and upper bounds of jth risk factor respectively. In the interaction model (1.2), the number of interaction terms increases dramatically when the number of risk factor increases. In such a case, we may use projection pursuit regression (Friedman and Stuezle, 1981) or multivariate adaptive regression splines (Friedman, 1991). These two methods consist of universal approximation to smooth functions by a sum of nonlinear functions of linear combinations of â -'s, i.e., p (1.2) M 9 Chapter 1. Introduction More discussion is in Venables. and Ripley (1999) and Diaconis and Shahshahani (1984). Note that projection pursuit regression models and multivariate adaptive regression splines are not considered in the thesis, but it is possible and worthwhile to be stud- ied in the future. 3. Diffuse interaction models. As discussed in the previous section, some problems arise when fitting a pairwise interaction model especially with a large number of risk factors. For instance, high blood pressure is a warning signal for health problems. There are many risk factors which may cause high blood pressure. Age: The risk of high blood pressure increases as you get older. Gender: Women are more likely to develop high blood pressure after menopause. Family history: High blood pressure tends to run in families. Body weight: The greater your body mass, more risk of high blood pressure. Tobacco use: The chemicals in tobacco can increase the risk of high blood pressure. Sodium intake: Too much sodium can lead to increased blood pressure. Excessive alcohol: Heavy drinking can damage your heart. Stress: High levels of stress can lead to a (temporary) increase in blood pressure. Therefore, for all the risk factors listed except gender, larger value means more risk of high blood pressure. There might be synergistic interactions among those risk factors. How to depict the magnitude of the synergism? As proposed in Gustafson et al. (2005), the diffuse interaction model is defined as (1.3) 10 Chapter 1. Introduction with Pj > 0, Xj > 0, A > 0. It is easy to verify that A < (>)1 leads to dg(x) dxidxj > ( < ) 0 . That is, A < (>)1 means synergistic(antagonistic) interaction. Clearly the magnitude of A is a measurement of degree of synergism/antagonism and curvature of the surface as well. Note that when A = 1, the above regression equation just reduces to an additive linear regression model. Note that g(xj,*x.(j) = 0) = p0 + PjXj, where X(j) = (x\,..., Xj-i,Xj+i,..., xp) and Xj = 0 means the absence of j t h risk factor. That is, no matter what A is, Pj is the increase in response variable associated wi th a unit change in Xj, in the absence of al l the other risk factors. Now the interpretation of Pj is different from that under linear regression model. The models are increasing function hence in absence of al l risk factors, that is, Xj = 0, j — 1,... ,p, Po stands for the smallest expected response (usually risk of some diseases). Therefore, we may rewrite the diffuse interaction models in a more general sense as 5(x) = A)+.N|, • (1.4) where || • || is a norm in E p . Even more generally, to get a regression function that is not monotone increasing, we may consider 5(x) = /?o + ||x-a||, 11 Chapter 1. Introduction where a is the vector of location parameters, standing for the values of risk factors that leads to the smallest expected response. The diffuse interaction models (3.5) are a special class of (1.4) wi th the choice of norm being ' (1.5) INI/? = | ] [ > r ^ ) A j .. Pj > 0- *i > 0, A > 0. Here Pj's are inverse scale parameters. Naturally, can be interpreted as the distance between x and 0. More general classes of norms can be used, (a) If the variables Xj's interact in different directions, one may want to part i t ion those variables depending on the direction of interactions among them, that is l / A i 1/A2 5(x) = A , + E (&xi)+{ E O 0 ^ ) * 1 j e A D D U e S Y N + { E with Pj > 0, Xj > 0, for al l j and 0 < Aj < 1 < A 2 . It is easy to verify that for any pair j < k G S Y N ( A N T ) , > ( < ) 0 . dxjdxk (b) Assuming there is a nesting of groups of variables, one might use the 2-level nested model Â2 .9(x) = ieSi + ies2 A/A 2 ' 1/A (1.6) where A i < 1 and A 2 > 1 for a synergistic set S i and an antagonistic set S 2 wi th some interaction between the two sets. Also this function can be extended to multiple levels of nesting. Now the second derivative of g(x) in (1.6) is more complicated. For A^> 1, any pair (j, k) wi th j G Si, k € S 2 , the second derivative is negative. For A < 1, any pair (j, A;) wi th j e Si, k G S 2 , the second derivative is positive. If 1 < A < A 2 , for any pair 12 Chapter 1. Introduction (j, k) € the second derivative is negative. If A i < A < 1, for any pair (j, k) € Si, the second derivative is positive. The following example illustrate one scenario where we might consider to use (1.6). Suppose X\,X2,Z are the predictors relating to response variable Y. However, Z can not be measured and is replaced by two surrogate variables X3 and X4. In this case, one might use g'x1,x2,x3,x4) = {(Pixi)x + (fox2)x + (fox3 + / 3 4 x 4 ) A } 1 / A • 1.3 Ou t l i ne of thesis The following chapters are organized as follows. In Chapter 2, we study the consequences of fitting an additive regression model to data generated from a pairwise interaction model. We find out that under some particular situations, the average effect estimates based on misspecified model can still be consistent with true values. Further, under the framework of spline regression, we investigate the consequences of model misspecifications by failing to include interaction terms into model, when such terms exist. In Chapter 3, we introduce diffuse interaction models as an alternative to pairwise interaction models when the number of risk factors of interest is rather large. And we compare its ability to detect interactions with a pairwise interaction model. In Chapter 4, we propose a M C M C algorithm to estimate the parameters in diffuse interaction model, introduced in chapter 3. Further, we propose other M C M C algorithms for more general versions of diffuse interaction model by relaxing the assumption that all the risk factors interact in the same way, either synergistically or antagonistically. In chapter 5, we summarize the results of the above three chapters and also discuss some possible problems to consider in future. 13 Chapter 2 Average effects for regression models with misspeciflcations How bad is the estimation when the real relationship between response variable and predictors (i.e.risk factors in epidemiological studies) does involve interactions, but a model without interaction is fitted? More specifically, if the actual data generating mechanism is Y = J2hj(Xj)+ M * . ^ ) + C (2-1) j=i i<j<i<p what w i l l the result be if we fit an additive model 3 = 1 where both e and 77 denote random errors, postulated to follow normal distributions. 2.1 Average effect To evaluate the performance of the estimation under misspecified models, we apply the idea of average effect, which is proposed in Gelman and Pardoe (2007) and also in Gustafson et al . (2005). The reasons to introduce the average effect are listed as the following. (a) For comparing models wi th different parametric forms, different sets of parameters 14 Chapter 2. Average effects for regression models with misspecihcations have different interpretations, so it is hard to explain the difference between the parame- ters from different models. For example, postulate the parametric forms of models (2.1) and (2.2) as below. In model (2.1): hj{Xj) = BjXj, hij(Xi, Xj) = Bij Xi Xj. In model (2.2): hj(Xj) = ctjXj. Then B^s in model (2.1) are the parameters describing the pairwise interactions between the predictors, while no such parameters appear in model (2 .2) . It is hard to interpret the meaning of the difference between Pj and ctj. More precisely, the expected change in response variable caused by one unit change of Xj keeping other predictors unchanged in model (2.1) is Pj + Yli^tj PijXi, while in model (2.2) it is aj. (b) More generally, if the two models of interest are quite different, no common parameters could be compared. For example, say one model is a linear regression model while the other is a nonparametric regression model. Now it is impossible to compare the estimates of coefficients from the former model and the estimate of a function/curve from the latter model. To handle the difficulty mentioned above, we need a quantity which is associated with something in common among different models. The average effect is such a quantity. The definition of average effect is as follows. If the predictor Xj takes value in a continuous space, then the predictive effect of Xj 15 Chapter 2. Average effects for regression models with misspecihcations is denned as dE(y\Xj=xj,XU)=xuy,e) A i(x ; f l ) = dxj where X y ) = (Xi,..., X j - i , Xj+i,..., Xp), i.e., a random vector comprising al l the ex- planatory variables except the j - t h one, and 8 denotes the parameter vector in the model. If the predictor Xj can only take finitely many values, then the definition of its predictive effect is ( 1 / (2) { E m * ; = xf\xU) = x O );0) - E(Y\Xj = xf\xU) = x ( j );0)} , X- Xj where x^\x^ are a pair of different possible values of Xj. The quantity of predictive effect reflects the change in E ( V | X ) associated wi th asmall change in the j - t h predictor Xj conditioned on a specific value of X y j = x^). The reason to use the notation Aj(x; 6) and not Aj(x(j); 0) is that i n general predic- tive effect is a function of Xj as well. For example, in the quadratic model wi th continuous predictors, that is • E(Y\X) = 0Q + f^PiXj + J2BjjX*+ &iXiX>> the predictive effect of Xj is 2BjjXj + Yli^j PijXi- However, in some special situations such as model (2.10) and model (2.11), appearing in the next section, Aj(x ;0) does not depend on Xj. Based on the definition of predictive effect, we can define different versions of average effect as the expected value of A^(x; 9) wi th respect to different distributions. 16 Chapter 2. Average effects for regression models with misspecihcations Definition 1: If averaging the predictive effect over the joint distr ibution of X ^ ) , al l the predictors except Xj, then the average effect of the j-th. predictor is defined as 5J(XJ; d) = ExU){Aj(Xj = Xj, X 0 ) ; 0)}. (2.3) Definition 2: If averaging the predictive effect over the joint distribution of a l l pre- dictors X\,..., Xp, then the. average effect of the j ' - th predictor is defined as 5J(0) = E X { A , ( X ; 0 ) } . (2.4) Definition 3: If averaging the predictive effect over the conditional distr ibution of X ( j ) | X j , then the average effect of the j - t h predictor is defined as 6j[XJ;d) — E X o , | x , {Aj(Xj = Xj, X y ) ; 6)}. (2.5) In general the three definitions are not identical to each other. B u t i n the special cases when the predictive effect of Xj does not depend on the value of Xj, as in model (2.10) and model (2.11), the first two definitions are the same because both are obtained by averaging over the joint distribution of X ^ j . In the following sections, we stick to use Definition 2 unless specified particularly. Note that the predictive effect of Xj is based on the conditional distr ibution of Y\X, while the average effect of Xj is defined wi th respect to the joint distr ibution of (Y, X y ) ) or (Y, X). We should also keep in mind that the idea of average effect is not just confined wi th in regression context. For example, X u and O'Quigley (2000) and Gustafson (2007) apply this concept i n survival analysis by averaging hazard function over the joint distribution of (T, X ) , where T is failure time. 17 Chapter 2. Average effects for regression models with misspecihcations 2 .2 General results Suppose we have p predictors, Xi,...,Xp, and response variable Y, whose relationship wi th X , ' s is of interest. For the general framework, let T = (Ti (Xi,..., Xp),..., Tt(Xi,..., Xp)) , whose components are the functions of predictors involved in the "true" relationship between response variable and predictors, and let S = (Si(Xi,..., Xp),..., Ss(Xi,..., Xp)) , denote the functions of the predictors in the fitted model. B y allowing general forms of functions, even nonparametric forms, involved in the components of S and T, we do have a rather broad possibilities of hj and hij in (2.2) and (2.1). In the forthcoming sections, we wi l l see how this general setting applies in different regression models. Hence, models (2.2) and (2.1) can be rewritten as follows, respectively. Y = S'cx + e, Y = T'P + e, where e follows N(0,o2) and are independent of X . Denote by 5j(/3), 5*(/3) the average effects of the j - t h predictor under the "true" model and fitted one respectively. 18 Chapter 2. Average effects for regression models with misspecihcations Assuming X/s are continuous and T, S are differentiable, let Tj = {j^Ti(Xi,-• • ,Xp),... ,-^Tt(Xi,... ,Xp^ , = (^QY-SI(XI,...,XP),...,-^-SS(XI,...,XP) So that the true average effect of Xj, denoted by Sj(f3), is [E(Tj)]'/3. Now let n be a sample size with (xix,..., xip, yt) being the observations of predictors and response variable for the ith subject, (xa,..., xip),i = 1,. . . ,n are assumed to be iid replications of (Xi,..., Xp) and e i , . . . , e„ are iid A^(0, cr2). Let T be the n x t design matrix with (i, k) element equal to Tk(Xn,..., Xip). Let Tj benxt design matrix with (i, k) element equal to j^-T^Xn,... ,Xip). Assuming T is of full column rank, we have 3 = ( T ' T ) _ 1 T ' Y , E(Tj)=n-1fjlnxi. where Y = (yuy2, • • • ,yn)'- Hence the estimate of the average effect of Xj is obtained by ^03) = n _ 1 l / f i ( T ' T ) - 1 T ' Y . W i t h • Y = T /3-+e, where e = ( e i , . . . , en)', we get <5~(/3) = n-ll'Tj(3 + rf1\%tT'T)-1re. 19 Chapter 2. Average effects for regression models with misspecihcations Therefore, we have v ^ ( W ) - W ) ) = v ^ ^ ^ ' T j - E ^ O ' ^ + n-^'f^T'T/n)- 1^ 1/ 2^. The first term in the right side of the above equation can be rewritten as where (¥,);. denotes the ith row of Tj. Since (Tj)j. are i.i.d., the above term converges in distribution to -/V(0, /3'Var (Tj)(3) by the multivariate central limit theorem. With the conditional variance identity and the assumption that e is independent of X, we have Var(T,'€i) = E(Vai(Ti.,£i|*i))-+Vai(E(Ti.'ci|Xi)) = E ( T T > 2 . Thus, by the multivariate central limit theorem, n n-1'2 ]T Ti.'a ^N(0,E (T'T)<J2) . Since the (Tj)j. are i.i.d., by the strong law of large numbers, we have n-^-^ECT,-)}'. If E(TT) is invertible, then ( n - W f 1 (E(TT')}" 1 . 20 Chapter 2. Average effects for regression models with misspecihcations W i t h the above results, a straightforward conclusion can be summarized as the following. R e s u l t 1: Assuming the existence of { E ( T T ' ) } " 1 and e is independent of X , v^(W)-W)) °N(0,Vj(p)), where Vj(3) = <x2[E (Tj)}'[E ( T T ' r ' E ( T , ) + /3 'Var (Tj)3. (2.6) Similarly, in the fitted model, let § be the n x s design matr ix wi th (i, k) element equal to Sk(Xn,..., S i p ) . Let be n x s design matr ix wi th (i,k) element equal to jfiqSk{Xn,..., XiP). Assuming that S is of full column rank, the estimate of the average effect of Xj wi th the "misspecified" model is 6*(J3) = • n - 1 l / S j ( S ' S ) - 1 S ' Y . Assuming E(SS') is invertible, ( S ' S J ^ S ' Y { E ( S S ' ) } _ 1 E ( S Y ) { E ( S S ' ) } - 1 { E ( S T ' ) } / 9 a*. Thus, in the l imit as n ^ oo, 5*{3) = lim<5}(/3) = E ( S ^ ' a * . Hence, wi th Y = TB + e, we have n v ^ ( § 0 9 ) - 5*(f3)) = n"V2 £ _ E (§,•) ') ( S ' S ^ S ' Y + i=l ( E (S^HS'S /n ) -V 1 / 2 S ' (T /3 - - S a , + e), (2.7) 21 a = a.s. Chapter 2. Average effects for regression models with misspecihcations where (§ j ) i - is the i t h row of 'Sj . B y the strong law of large numbers, n _ 1 S ' Y —> E (SY)(= {E (ST')}/3), (S'S/n)- 1 ^ {E(SS')}" 1. W i t h the multivariate central l imit theorem and the above facts, the first term in (2.7) asymptotically follows a multivariate normal distribution wi th mean vector of 0 and covariance matr ix of a'„Var (Sj)a*. Also by multivariate central l imit theorem, we have n n'1'2 J2 % {Ti./3 - S,a* + et) Z N{0, V*), i=i where Sj. is the i t h row of S, Tj. is the i t h row of T and •, V* = E {(HY/3 — Sj.a* + ei)2Sj.Si.} .= ff2E(SS') + E ((T'̂ 9 - S'a«) 2SS') . (2.8) Note that the first part is due to random error and the second part is due to model misspecification. Therefore, the asymptotic distribution of the second term in (2.7) is a multivariate normal wi th mean vector of 0 and covariance matr ix of {E (Si)}'{E (SS')}"f V*{E (SS')}"1^ (§,)}. Immediately the combination of the above results leads to the following result. Result 2 : Assuming the existence of {E (SS')} _ 1 and e is independent of X , V^{5f{P)-5;(3))^N(0,v;(3)), 22 Chapter 2. Average effects for regression models with misspecihcations where «;(/3) = { E ( S j ) } ' { E ( S S ' ) } - . 1 V - { E ( S S ' ) } - 1 { E ( S , ) } + 2a ' ,Cov (Sjf S(T ' /3 - S ' a . ) ) { E ( S S ' ) } _ 1 E (S,) + a ' ,Var ( S , > . . (2.9) Remark. In some instances Tj and/or Sj might be constant, in which case the second term in (2.6) and the last two terms in (2.9) vanishes. Particularly, if the "true" regressors include pairwise products from X , but the fitted model includes only linear terms from X , then the second term in (2.6) does not vanish, but the last two terms in (2.9) do vanish. Result 1 and Result 2 give the asymptotic distributions of the average effect estimates based on "true" model and "misspecified" model, respectively. Combining the two results, we can get a consistent average effect estimator from the misspecified model as long as Sj(P) = Sj((3). Some easily-studied cases where the above equality establishes are shown in Result 3 later. Also Result 1 and 2 make it possible to compare the efficiencies of the two estimators from "true" model and "misspecified" model. More discussion is given in Section 2.3.2. One thing worth investigating here is to find the consistent estimator, under the "misspecified" model fitting, of the mean squared error a 2 ( a ) — E ( V — S 'c t) 2 . A least squares estimator a is a parameter vector that solves the problem min s 2 ( a ) = (n - p - 1 ) _ 1 | | Y - S a a Now s 2 ( a ) may be rewritten in terms of matrices, n — p — 1 n 23 Chapter 2. Average effects for regression models with misspecihcations B y the strong law of large numbers, ( n - 1 Y ' Y ) a 4 ' E ( y 2 ) , ( r a ^ S ' Y j ^ E t S y ) . Therefore, s2n(a) aA E ( y 2 ) - E(S 'y){E(SS ' )} _ 1 E(Sy) . Note that the a » is the unique minimizer of mean squared error a2(a). Now cr 2(ct.) = E ( y ) 2 - 2 E ( y S ' a , ) + E ( S ' a , ) 2 = E ( Y 2 ) - E ( S ' a » ) 2 (E(y|S) = S'a.) •• = E ( y 2 ) -a ' ,E(SS ')a. = E ( y 2 ) -E(s 'y){E(ss ' )} _ 1 E(sy) . That is, s 2 ( S ) is a consistent estimator of a2(cxf). However, (S 'S/n ) l l S 2 (a ) a -4-{E(SS ' ) } -V 2 (a , ) , which is not equal to {E(SS')} _ 1 V*{E (SS')} _ 1 in general. Note the fact that, by the previous analyses and (2.8), we have l im V a r ( v / n S ) = v(cxt) = {E(SS')} _ 1V*{E(SS')}" 1 = {E ( S S ' ) } - V ( a . ) + {E (SS')} _ 1 E ((T'/3 - S'a,) 2SS') {E (SS')} _ 1. Therefore, the true variances of the coefficients estimates from the fitted model are big- 24 Chapter 2. Average effects for regression models with misspecihcations ger than the reported standard errors of those estimates, which is caused by the model misspecification, the second piece in the last equality of the above equation. If no mis- specification occurred, then V* = E(SS'), then (S'S)s 2 1(a) is the consistent estimator of var (a). Let n V>a = n _ 1 2 ( ^ - S i . a ) 2 S ; . S i - . i=l then it is not hard to derive that Va is a consistent estimator of V* by a similar derivation as before. Therefore a large difference between Va and S ' S s ^ f i ) can be an evidence for model misspecification (see Whi t e (1980) for a formal test for misspecification). Hence v(at) can be consistently estimated by n ( S ' S ) " 1 jjj (Yi - Si.aj'sj.Si.j ( S ' S ) " 1 . There are other ways to get consistent estimator of v(a+) as well. One way is so-called "sandwich" method, which is based on the derivatives of log-likelihood function. Ac tua l ly it is exactly the same as the estimator used in Whi t e (1982), where Whi t e studied the asymptotic distribution of the maximum likelihood estimate in case of model misspeci- fication. A s known, the least squares estimate is the same as the maximum likelihood estimate in the linear regression. Under a simple setting wi th only two predictors in- volved, we can also apply the methodology in Whi t e (1982) and get the same result as Result 2. Details are shown in Appendix II. 2.3 Linear regression In this section, we assume that the response variable Y depends linearly on the predictors Xi's and that Y given X = (X\,..., Xp) is normally distributed wi th a constant variance. 25 Chapter 2. Average effects for regression models with misspecihcations Suppose the true relationship between Y and XiS is Y | X ~ TV (A, + BXXX + ••• + BpXp + BX2XXX2 + ••• + +(3P-1,PXP-1XP, a 2 ) , (2.10) while the fitted model is Y | X . ~ A T ( a 0 + + • • • + a p X p , r 2 ) . (2.11) Here 3 = (P0, Bu . . . , Bp, Bn,..., PP-\,P), a = ( a 0 , a u . . . , a.p). B y Definition 2 of average effect, it is direct to derive the following average effects based respectively on (2.10) and (2.11): Si{P) = Bi + Y^BiiE(Xi), (2.12) 5*(3) = a j , , (2.13) Naturally, as an informative summary we study the estimates of the average effect defined as n Sj(3) = n-15^Aj(xi{,-),3) = 4- + SA^i, i - l i^j n i=l where X;y) is the z-th observation of X y ) and is the sample mean of the observations of Xi. The estimates of parameters 3 and a are a l l least squares estimates. Note that in the normal linear regression context, the maximum likelihood estimates of the parameters are the same as the least squares estimates. In the following three sections, we w i l l compare the estimators of two average effects. 26 Chapter 2. Average effects for regression models with misspecihcations 2.3.1 Difference in large sample limits of the two average effect estimators Without loss of generality, we can think about centering predictors in the fitted model, that is, E (Y | X ) = a0 + a i X i + • • • + apXp, where Xi = Xi — E ( X j ) , i.e., E ( X ) = 0. Note that the estimates of a \ , . . . , a p are unchanged by centering. The large-sample limit of S , denoted by a*, satisfies / 1 \ ( 1 \ E < xx > a , = E < Y Xi > V Xp ) > < \ Xp ) Since the equation is symmetric in the p predictors, it suffices to determine the rela- tionship between ot\* and B y Cramer's Rule, we can solve the equation above and get where S — (<7ij)pXp is the covariance matrix of (Xi, • • • , Xp) and is the cofactor of E . Details of the proof of (2.14) are in Appendix I. Moreover, based on (2.3), expressions of average effect, the above results can be summarized as below. R e s u l t 3. Assume that the variables are centred, i.e., E ( X ) = 0. Let T = ( 1 , X ' , W ' ) ' where W = (XiX2,XiX3,... ,Xp-iXp), i.e., the true relationship involves pairwise in- teractions. Also let S = ( 1 , X ' ) ' , i.e., the interactions are undetected or ignored in the 27 Chapter 2. Average effects for regression models with misspeciBcations modelling process. Then (i) *;(/9) = W ) + ̂ ^ ^ E W , ( 2. 1 5) where E is the covariance matr ix of X , and E ^ is the cofactor of the determinant corre- sponding to element akj- Consequently (ii) if X has a multivariate normal distr ibution, or if the components of X are indepen- dent, then SW-Sj'p) = 0. Particularly, when p = 2, the connection equation (2.15) can be simplified as below "Var (X2)E {X*X2) - Cov{XuX2)E (X^y 5{{(3) = 5^(3) + f3l2 Var ( X i ) V a r (X2) - C o v 2 ^ , X2) (2.16) Remarks: The two conditions, both independence and multivariate normality, are suf- ficient but not necessary to get the consistency of "wrong-model" estimate 61-((3). The condition that X follows a multivariate normal distribution can be replaced by an ellip- t ical distribution. The latter actually is an extension of the former. The equation (2.15) shows that the bias is controlled by.E(XiXiXk). Suppose the joint distribution of X is an elliptical distribution EC(0,Y,,cp) and its characteristic function is exp(it'/i)0(t'St), where 0 is a scalar function and called characteristic generator. If i,j,k are different to one another, the joint distribution of (Xi,X[,Xk) is also an elliptical distribution EC(0, E ^ , 4>) where Y,uk is the corresponding sub-matrix of E 28 Chapter 2. Average effects for regression models with misspecihcations associated with (XitXi,Xk)- By the definition we have (Xi, Xi, Xk)' = A'(Zi, Zi, Zk)', where (Zi,Zi,Zk) follows a spherical distribution (which is a special case of elliptical distribution by setting 2 = 1). A is the lower triangular matrix in the Cholesky decom- position of Tjiik- Hence we can rewrite the above as Xi — aZi, Xi = bZi + cZh Xk = dZi + eZi + fZk. It is straightforward to get E(XiXiXk) = 0 because that E (Zf) = 0,E(X?Zt) = 0 and E (ZiZiZk) = 0. If k = i or k = I, we can also show the expectation to be zero by replacing the third equation by either the first one (k = i) or the second one (k = I). The second part of Result 3 says that for certain distributions on X, a model ignor- ing interactions will yield consistent estimates of average effects, even though the true regression relationship involve interactions. In addition to being of conceptual interest, this suggests some practical modelling strategies. For instance, in applications where one wishes to avoid modelling interactions explicitly, one might attempt to pre-transform the predictors to approximate normality before fitting a linear model. If one is willing to think of average effects as targets of inference, then such transformations should re- duce bias in estimating these effects via a model without interaction terms. Of course transformations applied to predictor and response variables are an important part of regression modelling in practice, and the desirability of transforming a response variable to approximate normality is clear. Transformations on predictors, however, are typically 29 Chapter 2. Average effects for regression models with misspecihcations argued for on the basis of compatibil i ty wi th linearity and homoscedasticity assumptions, without regard for the resulting shape of the predictor distribution. Result 3 suggests that the shape of the distribution could also be a consideration in assessing possible transformations of the predictors. However, we also find that the consistency does not hold generally, wi th the following two examples. Firs t , assume Xi follows standard normal and X2\Xi follows Poisson(c\\Xi\). Since when p = 2, transformation of (2.16) gives the quantity Pu (d~l(3) — 6i(3)), nothing to do wi th true values of 3 and only depending on the distr ibution of X . Moreover, it can also somehow indicate the discrepancy of the two large sample limits. Note that if Pn = Pi = P2, under the setting of Result 3, this quantity is just the relative bias, i.e., j^i"1 [$i(3) — 5i(3)]\. Based on the property of the mean and variance of Poisson distribution, by some algebra we can derive that Solve Ci by setting the above to be 1. Note the fact that p — E (XiX2) - E (X\)E (X2) = C i E ( X 1 | X 1 | ) = 0.. This is caused by the fact that is an odd function and the integral interval is symmetric about zero. Hence, based on (2.16), we have sm-Si(d) P12 = E{X2X2} 0.718. 30 Chapter 2. Average effects for regression models with misspecihcations The last equality follows from 1 poo E(\Xf\) = 2^= \ x3e~x2/2dx V=x2 1 J^ye-yl2dy = 2sfl. 2ir Second, suppose equi-correlated predictors, each following log-normal distribution, defined as A J VV2(T) ' 3 i'---'P> where z = ( z u z p y ~ N (0,(1 - P)ip + P J P ) , M ( r ) = E ( e ^ ) = e r 2 / 2 ! V(T) = V a r ( e T ^ ) = e 2 r 2 - e r 2 . Here Ip is p-dimensional identity matr ix and J P is p-dimensional square matr ix wi th a l l p2 elements equal to 1. Note that as a —> 0, Xj converges to a standard normal. So that larger r corresponds to more non-normality and larger p corresponds to more depen- dence among predictors. Par t i t ion the (true) regression coefficients as (3 — (0o, (3'MJ P'I)' according to intercept, main, and interaction effects respectively. Using Result 3 we can compare the vector of true average effects 5(0) = (3M (since E ( X ) = 0) to the large- sample l imit of estimated effects from a model without interactions, i.e., 5*(pi). F rom Result 3 we know the relative bias is zero if p = 0 or r = 0, so the question of interest is how fast the bias grows as the components of X becomes both correlated and skewed. For p = 10 and selected values of (/3M,/37), Figure 2.1 depicts the relative bias as p and 31 Chapter 2. Average effects for regression models with misspecihcations T vary, where This i l lustration is based on fixing the direction of 3M and the relative length of 3j compared to 8M, i.e., 3M oc l p and \\3j\\ — \\3M\\. For convenience, we index the components of 3{ by (u, v) pair, where u < v. In Figure 2.1, four choices for the direction of 8j are considered: (a) BitUV oc 1, which involves a l l predictor pairs interacting positively. (b) Pi,Uv oc (—l)v~u+1, which means about half the pairs interact positively and the other half negatively. (c) Pi,uv oc I{\v (modp) — u\ — 1}, which means only a few of positive interactions. (d) Pi,uv ^ (—lY^v ( m o d p ) - ' " l = 1 } ) which means a majority (minority) of positive (neg- ative) interactions. In each case RB\, the relative bias in estimating the average effect of Xi, is calculated. Note that each of these choices except (b) involves sufficient symmetry so that the relative bias is the same for estimating all p average effects, i.e., RBi = ... = RBp. We can see that the bias increase when o grows, which means that more non-normality would cause larger bias. Moreover, the size of bias also changes in an increasing trend as p increases. Therefore, more non-normality of and more dependence among X give larger bias. The general impression from Figure 1 is that very large biases are possible when estimating average effects, if predictors are substantially skewed and dependent. Note also that slight skewness and strong correlation tends to induce a bigger bias than strong skewness and slight correlation. Another thing worth notice is that the relative biases are considerably smaller in panel (b) compared to the other three cases. It may represent a 'cancellation effect' 32 Chapter 2. Average effects for regression models with misspecihcations between positive and negative interaction terms. Moreover, we have similar magnitudes of relative biases under the other three cases. The common property of the three cases is the "overwhelming" strength of interaction in one direction over the other. Figure 2 .1: Magnitude of relative bias as a function of (p, T) for multivariate log-normal distr ibution of p = 10 predictors. The cases (a) through (d) are as described in the text. (a) (b) 2.3.2 Relative efficiency of the two average effect estimators A natural question of interest is which one is more efficient between the two estimators 5j(/3) and 5j(/3), which means to compare Vj((3) in Result 1 and Vj((3) in Result 2. We take the ratio of the latter to the former as the relative efficiency, wi th values larger than one representing the inefficiency occurring as a result of model misspecifications. We apply Result 1 and 2 directly to linear regression context, where an additive model is fitted to the data generated by pairwise interaction model. Due to the symmetry in 33 Chapter 2. Average effects for regression models with misspecifications the p predictors under both models, we can only take the average effect estimator of Xx for example. Now we have T = (1, Xi,..., Xp, XiX2,..., Xp-iXp) , T i = (0,1,0,. ..,0,X2,...,Xp,0,...,0)' S = (l,Xi,X2,... ,XP) S\ = ( 0 , 1 , 0 , . . . , 0 ) ' . W i t h Result 1 and 2, we have ^ { E t S O j ' E i 1 {£(§!)} + {E (SOVEs'E {(T'/3 - S'a,) 2SS'} E ^ E (S^)}, (2.17) a 2 {E(T 1)}'ET 1 {E(T 1 )} + {P12, A P )Var ( X 2 , X p ) ( P i 2 , p \ p ) ' , (2.18) where Es = E{(l,Xi,...,Xp) (l,Xi,...;Xp)}, S T = E {(1, Xi,..., Xp, XiX2,..., XiXp) (1,X\,... ,Xp,XiX2,... ,XXXP)}. Based on the expressions it is clear that the difference of the two unconditional variances depends on the true value of parameters (3. Therefore generally speaking, the comparison of the two variances could depend on the true values of coefficients of those interaction terms involving Xx. y In particular, we are interested in situations where the additive model can yield con- 34 nVar{<5*} -> nVar{<5i} —> Chapter 2. Average effects for regression models with misspecifications sistent estimator under the true relationship involving al l pairs of interactions. More precisely, if X has independence of components or multivariate normal distr ibution (dis- cussed in Result 3), what is the relative efficiency? We take the former case first. Assuming E(Xj) = 0, Var (Xj) = l,j = 1 , . . . ,p, and the independence of Xi,..., Xp , therefore we have — Ip+l 7 V S s 0 0 £ 22 where S 2 2 is the covariance matr ix of (X\X<t, • • •, X\XP)'. Thus the asymptotic variance of the estimates from the right model is a 2 ( 0 , 1 , 0 , . . . , 0 ) 2 ^ ( 0 , l , 0 , . . . , 0 ) ' + | > i j=2 The first term is right the first term in (2.17). Note the fact that T'/3 - S'a» = ^2 PijXiXj, l<i<j<p where a . = Ss 1 E(ST') - (lP+i,0)d. 35 Chapter 2. Average effects for regression models with misspecihcations Therefore, the second term in (2.17) is where 7 = a~10I and (7i.,7_i.) is a part i t ion of 7 into those interaction terms which do and don't involve X\ respectively. Therefore, the asymptotic variance of the estimator based on "misspecified" model is larger. This is consistent wi th our intui t ion that more Now the question of interest is how large/small the relative efficiency would be if there is some dependence among the components of X? Assuming that X has a multivariate normal distribution wi th mean 0 and equi-correlated covariance matrix, that is, X ~ Np(0, (1 - p)Ip + pip). From (2.18) and (2.17) for given p, it is easy to justify that the relative efficiency depends only on p and 7 = a~13I, wi th the latter one describing the interaction 'signal ' relative to noise. For p =10, Figure 2.2 shows the relative efficiency as a function of p and ||7||, where two certain directions for 3j, i.e., case (a) and (c) defined in Section 2.3.1, are considered. We can see that if the true interactions among X are rather "sparse", like the choice of 0j in case (c), the efficiency loss is much smaller compared to that wi th a "dense" interaction structure in panel (a). Also note that the misspecified-model estimator can Hence, the ratio of the two asymptotic variances is a2+E{{1zl<JPiixixjfxi l + 3ll7i.||2 + i i7-i . | | 2 1 + Il7i.|!2 } uncertainty in Sj(3) is caused by model misspecification. 36 Chapter 2. Average effects for regression models with misspecihcations be very inefficient wi th strongly correlated predictors, but the efficiency loss tends to be slight wi th independent predictors. A s a related point, the rate at which the efficiency loss grows wi th the strength of the underlying interaction signal is governed by the strength of correlation. Figure 2.2: Relative efficiency of the "misspecified"-model average-effect estimator as a in the text. (c) \ \ \ •S \ \ \ \ \ \ \ ^ " I.J ' •— 1.1 - | r .0.4 o.e p 2.4 Nonparametric regression and smoothing T h e above analyses i n Section 2.3 are based on a simple scenario, where a parametric model in terms of a linearity of response variable Y in continuous predictors Xi's is postulated. W h a t if the linear regression model is not appropriate? F i t t i ng a linear model to the data actually containing a nonlinear structure can given very misleading results, even worse than useless. A more general alternative to linear regression is nonparametric regression model. The distinguishing property of nonparametric regression is that there is no (or very little) a priori knowledge about the form of the true structure of the regression 37 Chapter 2. Average effects for regression models with misspecihcations function. It allows the class of functions which the model can represent to be very broad. W h e n shall we use nonparametric regression model? In many real problems, there is no information from the data nor scientific knowledge to suggest a parametric form, so that a parametric model is often specified from a casual graphical summary of the data or chosen for convenience (for example linear regression models). A predetermined paramet- ric model might be too restricted or too low-dimensional to fully model a "rich" data set containing many unexpected features. In such a case, we would like the nonparametric (smoothing) approach, which offers a flexible tool in analyzing unknown regression rela- tionships between response variable and predictors. Also parametric vs non-parametric depends a lot on sample size, especially when there are many predictors. Smoothing methods, widely used in nonparametric modelling, deserve a respectable place in statistics. There are many papers and a number of books study on this topic (Silverman 1986; Eubank, 1988; Hastie and Tibshirani , 1990; Wahba, 1990; Green and Silverman, 1994; G u , 2002). A s a matter of fact, smoothing methods provide a bridge/compromise between making no assumptions on the underlying process that gen- erated the data (a purely nonparametric approach) and making very strong assumptions (a parametric approach). In the following subsections, we mainly focus on consequences of model misspecifica- tion omit t ing interactions under the context of least squares regression in smoothing. 2.4.1 Spline regression models o Spline functions are very flexible and thus are often used in smoothing regression. A spline function is a piecewise or segmented polynomial . More precisely, it is defined as below. 38 Chapter 2. Average effects for regression models with misspecihcations Definition: The function 0 is a spline on [a, b] of degree D w i th knots * i , . . . , *£, (a < t i < • • • < tr, < b) if <f> is a polynomial of degree D on the subintervals [a, ti], [t\,t2], • •., [tc b] and 0 has D — 1 continuous derivatives on [a, 6]. Denote the collection of these splines by Sp(ti,... ,tL]D). Take D = 1 for example. A spline of degree 1 is continuous and piecewise linear, wi th breaks in slope occurring at t\, • • • ,t^. Based on the definition of spline, it is not hard to show that the collection of the splines of degree is a linear space of functions. Then we can talk about its dimension and construct bases for the space. The dimension is the number of parameters needed to describe a member of the space, which i s . D + L + l ( o r . D + L i f forcing the spline space to not include a constant term). The number of functions in a basis w i l l simply be equal to the dimension. For simplicity and ease-of-understanding, we introduce the power basis. Note that for other forms of bases, the analysis discussed later is also applicable but wi th more difficult computational problems. Power Basis: Denote the power basis by 4>\,..., (J>L+D+I- 1 4>2{x) = X, 4>D+\{x) = X ,D (pD+2(x) = 4>D+L+l(x) = [x ~ tL) D + ' 39 Chapter 2. Average effects for regression models with misspecihcations where (x - t)D for x > t, I 0 f o r x < t . ' • . It is easy to show that each <f>j above is a spline and all the ^ ' s are linearly independent. Therefore for any function h in the spline space wi th degree D, it can be writ ten as M*) = £ ; = i D + 1 Thus what happens if the fitted model is misspecified? Note that the main concern now is about the impact of model misspecifications. Therefore, how to choose the degree of spline (D), number of knots (L) and locations of these knots (t\,... ,ti) is left aside for the time being, although this is always concerned in smoothing. Tha t is, we assume they are already appropriately chosen in the following studies. We start wi th a regression model having only two predictors. Say the fitted model is (Y\Xi =xi,X2 = x2) ~ 7 V ( m ^ x i ) + m 2 ( x 2 ) , o - 2 ) , while the "true" model is ( Y | X i = xuX2 = x2) ~ N(g1(x1) + g2(x2) + 9U{XI,X2),T2) , where mi ,m 2 , g \ , g 2 are splines. Here gl2 accounts for the interactions between the two predictors. In general, there are many plausible possibilities for the form of _gi2. For simplicity and interpretability, we use the form of gi2(Xi,X2) = X2t\(X\) +X1t2(X2) in the following, where t\,t2 are splines. \ Note that generally the basis functions for different predictors could be different. For concise notations without loss of clarity, we suppress the subscript of $ since each of its 40 Chapter 2. Average effects for regression models with misspecihcations components is expressed function of Xi. Suppose rrii = &(Xi)oti, 9 l = &(Xi)0i, tt = &(Xi)f%a,i = l,2. Similar to the linear regression case, without loss of generality we assume that for i = l , . . . , p , E ( X i ) = ' 0 , E [ * ( X i ) ] = 0, where *pQ = {M^i),..., <^PQ)' (Ki denotes the number of basis functions for Xi). Otherwise, we can replace Xi (or<&pQ)) by Xi — E(Xi) (ov$(Xi) — E[$(Xi)]), and those centering constants would be included into the intercept, keeping the coefficients of basis functions unchanged. Thus the mean function of the fitted model can be rewritten as E (Y\XU X2) = a0+ (V(*i), * ' ( X 2 ) ) v t t 2 / where a0 denotes the intercept. Let a = (a0,ai,a2). Based on Whi te (1982), we have a * , the large-sample l imit of the a , as the solution to E ^d\ogf(Y\X1,X2)^ = Q ) where ot — (OLQ,OL1,OL2)' . Note that the / function in above equality denotes the density function of the fitted model and the expectation operation is wi th respect to the true joint distribution of Y and X . 41 Chapter 2. Average effects for regression models with misspecihcations Therefore, we derive that E I [ ( 1 \ Note the fact that a i E Y * ( * i / A ^ E(Y\X) = (X1),*\X2),X2*\Xl),X1&{X2j) Dint P\ V PT j Therefore we can l ink the above two equalities to derive that Pi + c r / c -11 1̂2 int \ where C N = E | ( I ) * ' ( X 1 ) ) * , ( X 2 ) ) ' ( I ) $ ' ( ^ 1 ) ) $ ' ( X 2 ) ) | , Cia = E | ^ 1 , $'(^2)^ ( * 2 $ ' ( X i ) , X i $ ' ( X 2 ) ) } C 2 2 = E {(x&'iX^X&'iX^ (x&'iXi^X&'iXd ( 2 . 1 9 ) ( 2 . 2 0 ) . We also use the idea of average effect as that in the linear scenario to evaluate the impact of model misspecifications. That is, we are interested in the average effect by averaging over the joint distribution of a l l predictors X . Note that the predictive effect 4 2 Chapter 2. Average effects for regression models with misspecifications of Xj here is function of all predictors while just a function of X y ) i n linear regression settings. Wi thout loss of generality, we only take the average effect of Xi for example. Here the large sample l imit of average effect of Xi estimated from the fitted model is E a dX, (2.21) while that in the "true" model is dX1 dXi + PT *(x2) (2.22) Similar to the situations considered in linear regression, we also focus on some special cases here. Say, if Xx and X2 axe independent of each other, then we get E (Cn) = 0. Hence, the equality (2.20) now becomes y a=2* J 1 ^ That is, if Xi and X2 are independent, we get the consistency of the coefficient estimates ci. Subsequently we also get the agreement between the two average effects defined by (2.21) and (2.22). In the following we study the uncertainties of the two average effect estimates. Let <&(Xj) be an n by Ki matr ix wi th <E>y = 4>j(xn) a n d xj be an n by 1 vector wi th components xu, I = 1,2,i = l,...,n,j — l,...,Ki. Define the sample version of E T ( = E(TT') as 43 Chapter 2. Average effects for regression models with misspecifications follows. D 11 D12 D'12 1 \ D12 = I \ ( * ' ( X i ) V *'(x2) I f 1 ^ * ' ( X ! ) V *'(Xa) J 1 J ( 1 J $ ( X 1 ) , * ( X 2 ) ) , ( X 2 $ ( X 1 ) , X 1 * ( X 2 ) ) ! £22 = X 2 * ' ( X i ) V X ^ ( X 2 ) Again similar to the linear regression case, we have ( X 2 * ( X 1 ) , X 1 $ ( X 2 ) ) . + DT?D 11 ^ 1 2 Hence the estimate of the fitted average effect can be rewritten as w ) = [d*r31+([d#]'i>u)3r+([d*]'i>i*2)3r, 44 Chapter 2. Average effects for regression models with misspecifications where D* n f r> J ™ I T—^T-i ; ' da; l x = X l i (un1^) [ -1 , ] U 2 \ U22 Notationally, #[-1, ] denotes a sub-matrix of # after deleting the first row. The estimate of the average effect of Xx based on the "true" model is stf) = [d*]'31 + n X2; d$(x), da: \x=xu 0 + n %2i P2 With Result 1 and 2, it is possible to compare the asymptotic variances of the two esti- mates. However, the calculation of (2.9) now is rather complicated due to E ((T'/3 — S'a»)2SS') and the last two non-vanishing terms in (2.9). Thus we compare the asymptotic conditional variances in the following. Recall that T is the vector of the predictor functions in the true relationship and S is the vector the predictor functions in the fitted model. Recall that Tj, Si denote the derivative with respect to X\ of T and S, respectively. Using essentially the same derivation as that in the previous subsection of linear regression scenario, we get that as n —> oo, nVar(tfi(a)|X) - a2[E(Si)]'{E(SS')} E(Si) nVar(Si(/3)|X) -> a2[E(Ti)]'{E(TT')}_1E(Tx) 45 Chapter 2. • Average effects for regression models with misspecihcations where T T i S S i = (1,^(X1),^'(X2),X1^(X2),X^'(X1))', = o, • d $ ( X i ) . d X i = ( l , * ' ^ ) , * ' ^ ) ) ' , = ( 0 , ^ | - $ ' ( X i ) , O i x ^ ) . 0 l x K 2 , [ ( * ( ^ 2 ) ) ] ' ; d $ E ( X 2 d ^ } A s we discussed before, in the linear regression scenario, the joint normality of the predic- tors can also yield the consistent estimates of average effects. W i l l this good property s t i l l hold now? To explore the effect of the correlation between predictors on the difference of the two average effects, we'set up predictors and basis functions as follows. A l . Suppose Xi and X2 are bivariate normal wi th mean vector 0 and covariance matr ix 7 1 P P IJ A 2 . For simplicity, a common set of quadratic power basis functions for two predictors is used: \ £ = <pi(x) = x - k u <j)2(x) = x2 - k2, <hi(x) = (x-ti)2I{x>ti}-k3, <t>4(x) = (x- t2)2I{x > t2} - ki, where ti,t2 are the knots, which are set to be the 25% and 75% percentile of standard normal, respectively. The ki's are the centering constants to make E(4>i(Xi)) — 0, i = 1, . . . ,4. Al though the choice of basis functions is simple, it mimics the generality of 46 Chapter 2. Average effects for regression models with misspecihcations splines expressed as linear combinations of the power basis functions. A s for the number of knots, we can also easily incorporate more knots into the basis functions. Here for computational convenience, we just use two knots for demonstration. A 3 . Set the true values of parameters as BQ = 0,0 l = 02 = 0™* = 02nt = (0.5,0.5,0.5,0.5)' . Under the above settings, we can get the values of the elements of matrices C n , Cu by a numerical approach. The details of the numerical approach for the required integrals are in Appendix I I I . One fact worth noticing is that when p = 1, that is X\ = X2, Cn is not invertible. Therefore we can think of the difference between the two average effects as a function of p, the correlation coefficient, confined wi th in [0 ,1 ) . A s the correlation increases, the discrepancy between the two average effects is quite small compared to the range of average effects. The graphical summarization is shown in Figure 2.3, which implies that the bias of the average effect is close to zero even if the correlation of the two predictors is high. 47 Chapter 2. Average effects for regression models with misspecihcations Figure 2.3: Comparison of two average effect estimators 5i(0) and 5{(8) for spline re- gression models wi th two predictors involved. The x-coordinate of each circle is 6{(0) and the y-coordinate is 8\({5). Different circles are produced by different values of p, the correlation coefficient between X\ and X2. 1 1 1 1 1.5 2.0 2.5 3.0 3.5 A s shown in Figure 2.4, we can see that the difference of the two conditional variances is rather small compared to magnitude of the conditional variances. This means that even though the model is misspecified, the precision of the estimates for given values of predictors does not seem to be affected that much by the misspecification. ; 48 Chapter 2. Average effects for regression models with misspecihcations Figure 2.4: Comparison of asymptotic conditional variances of two average effect estima- tors for spline regression models wi th two predictors involved. The x-coordinate of each circle is the conditional variance of misspecified-model estimator and the y-coordinate is that of right-model estimator. Different circles are produced by different values of p, the correlation coefficient between X\ and X2. i'.W) 2.4.2 Penalized regression models Penalized spline regression (often referred to as P-splines) has received attention as a powerful smoothing method. Original ly suggested by O'Sul l ivan (1986), the method provides a range of practical modelling tools in applied statistics, wi th the books by Green and Silverman (1994) and more recently by Ruppert , Wand , and Carro l l (2003). The main principle of penalized spline regression is to estimate the unknown regression function by a compromise between sum of squares of residuals (represent the fidelity to the data) and smoothness of the estimate. We start wi th a univariate case. Suppose that we have data (xi,yi) (for now Xi is a 49 Chapter 2. Average effects for regression models with misspecihcations scalar not a vector), Yi = m(xi\oc) + €j, where m is a smooth function denoting the conditional mean of Yt given xit and {ej}" = 1 are independent, mean zero random errors wi th a constant variance. To estimate m we use a spline regression model L m(x\ a) — a 0 + aix + • • • + a^ar 0 + £ ak(x — tk)+, 'fc=i where d > 1 is an integer, a = (ao, a i , . . a i , . . . , a^)' is a vector of regression coefficients, ti < • • • < tL are fixed knots and (a; — tk)+ = (x — tk)DI{x > tk}. Actual ly , m is expressed by a set of power spline basis functions. Recal l that S denotes the "design matrix" for the model so that the i - th row of S is S f = (l,xi,...,xP,(xi-t1)°,...,(xi-tL)I>) = ( l , * ' (?<))• However, simple parametric fitting of a would lead to unsatisfactory results due to the high dimensionality of basis functions. Instead, a is estimated in a penalized manner by imposing a penalty on the coefficients in m. A roughness penalty is placed on {ak}k=1 which is the set of jumps at the knots in the p-th derivative of m(x;, a ) . This leads to the penalized least-squares estimator { 71 L Y {Vi ~ m(x; a ) } 2 + A £ a\ i = i fe=i = m i n { ( Y - S a ) ' ( Y - S a ) + A a ' P a } , a with A as penalty parameter controlling the trade-off between fidelity to the data and 50 Chapter 2. Average effects for regression models with misspecihcations smoothness of the fitted spline and P as the corresponding penalty matrix. To put the roughness penalty mentioned above, we choose P to be a diagonal matr ix whose first (1 + D) diagonal elements are 0 and whose remaining diagonal elements are 1. B y simple algebra, the penalized least squares estimate of a is given by o ( A ) = (S'S + A P ^ S ' Y , = (n^S'S + n^XPy^n-^'Y}. Since n~1XP goes to a zero matrix, the asymptotic behavior of a (A) is the same as that in the standard spline regression without penalty. To work out the least squares estimates a (A), we need to determine an appropri- ate value of A first. Generalized cross-validation ( G C V ) (Craven and Wahba, 1979) is one method of smoothing parameter selection that has proven effective and has good theoretical properties. Here we follow Ruppert (2002) closely. Let ASR(X) = n - 1 J2 {Vi ~ mixi; a7(A))}2 i=i be the average squared residuals using A. Let S(A) =§(S'S + A P ) _ 1 S ' be the "smoother" or "hat" matrix. Then GCV(X)=  A f ^ m , 2 (2-23) (1 - n 1 t r {S (A)} ) is the generalized cross validation statistic. Here t r{S(A)} is the "effective degrees of freedom" of the fit. One chooses A minimizing G C V statistic over a grid of values of 51 Chapter 2. Average effects for regression models with misspecifications A. Computat ion can be sped up and stabilized numerically wi th the following diago- nalization method that is variation on the Demmler-Reinsch algorithm used to compute smoothing splines. Let B be a square matr ix satisfying B~1(B~1)' = § ' § , for example, B~l is a Cholesky factor of S'S: Let U be orthogonal and let C be. diagonal such that UCU' = BPB'. For example, we can use the eigen-decomposition of BPB' to find U and C. Then by some algebra, we get t r{S(A)} = £ ( l + A C l ) - 1 , (2.24) i where is the i t h diagonal element of C. Details of proof is given in Appendix I V . The elegance of this method is that the work of calculating B and (U, C) needs to be done only once and then these quantities can be used for a l l values of A. Therefore, we have an efficient way to evaluate G C V defined by (2.23). . This method is easy to be extended to the additive model wi th more than one pre- dictor, i.e., p >1. Suppose we have data (y; ,Xj) (XJ = (xu,...,xPi)'), Yi = a 0 + mi(xu) + • • • + mp(xpi) + e,. We wi l l use a spline model for each mk: Li mi(x\ ai) = anx -I h aiDxD + £ aik(x - ti)+, l = l,...,p', fc=i where on — ( a n , . . . , a;o, o n , . . . , a/xj''• Hence the vector of whole parameters is a = ( a 0 , a'1:..., CKp)', and the penalty matr ix P is set to be the sum of p diagonal matrices Pi (i = 1,... ,p). Each Pi is a (pK + 1) [K is the dimension of basis functions) by pK + 1 square matrix whose (p x i + 2)th to (p x i + L + l ) t h diagonal elements are 1 and whose remaining diagonal elements are 0. Therefore the least squares estimate of a 52 Chapter 2. Average effects for regression models with misspecifications is the minimizer of P U /\2{yi ~ m(x»; a)}2 + ]T A; ̂  < i = i ; = i fe=i i.e. (Y - Sa)'(Y - Set) + a' ̂  XiP^j ex. Note S is the corresponding "design matrix" with z'-th row as Si. = {l,&{xli),...,&(xpi)). Consequently the least squares estimate of a can be solved by a(A) = ̂ S'S + AjPi j S'Y. If the components function (mj }f= 1 require roughly the same amounts of smoothing, we may assume a common value of A;, I = 1,... ,p, which makes a quick access of the estimation of smoothing parameters by using the methodology introduced in the univari- ate P-spline case. More realistically, different component function mj's require different amounts of smoothing and this can be accomplished by allowing A; values to vary rather than a common value. In the algorithm of Ruppert and Carroll (2000), A1;..., Ap are chosen by GCV in two steps. Note that GCV is a function of Ai,..., Ap. In the first step, GCV is minimized by assuming that Ai = • • • = Ap = A. In the second step, set the common smoothing parameter as the starting value of each Aj, Ai,..., Ap are selected one-at-a-time by minimizing the GCV criterion. In the following simulation study, we see how the two average effect estimates based on penalized least squares would be affected by the smoothing parameter. Suppose we have two predictors involved in the model and both of them follows a uniform distribution 53 Chapter 2. Average effects for regression models with misspecihcations on [0,1] independently. Here a common set of centered cubic basis functions are used and hence a common smoothing parameter A is assumed. In the basis function, five knots are used, which are equi-spaced wi th in [0,1]. Generate a data set of n = 200 realizations of (Y,Xi, X2), where Y is generated from normal distribution wi th variance of 1 and mean is based on the model wi th interactions (2.19). The true values of al l the coefficients are set to be 0.5. Repeat generating 200 data sets, and calculating the average effect estimate for each data set. We can vary the sample size n to see what happens as n gets larger. The top panel in Figure 2.5 shows the average effect estimates from the data sets of size n = 200 and the bot tom panel shows that of size n = 1000. The solid horizontal line in both panel is the true value of the average effect. Therefore we can see that the estimates lie around the true value wi th a larger variability, which is caused by the larger variability of the smoothing parameter estimates. A s sample size increases, the variabili ty goes down, i.e., the estimates in the bot tom panel are more concentrated on the true value than the top one. One problem from Figure 2.5 is that the range of the estimates is quite broad compared to the scale of the true value 2.0208, which could visually blur the concentration around the true value. To make the concentration around the true value more visual, we use only a subset of those estimates, those wi th in —10 and 10, to plot the histograms, illustrated by Figure 2.6. Note that the number of estimates outside of this range is 32 for n — 200 and 8 for n = 1000, which are both small compared to 200, the number of estimates in al l . 54 Chapter 2. Average effects for regression models with misspecihcations Figure 2.5: Scatter plots of average effect estimates in penalized spline regression wi th only two predictors: the top panel wi th sample size n = 200 and bot tom wi th n = 1000. The solid horizontal line in each panel identifies the true value of average effect, and each circle represents a different simulated data set. Figure 2.6: Histograms of the average effect estimates in penalized spline regression wi th only two predictors: the top panel wi th sample size n — 200 and bot tom wi th n = 1000, and the symbol ' x ' in each panel marks the true value of average effect. n=20 8 - $ - I 1 1 " 1 1 -10 -5 0 5 10 Average effect estimate 55 Chapter 2. Average effects for regression models with misspecihcations 2.5 A middle scenario Although we do not get exactly consistent estimates in the spline regression scenario as discussed above, we st i l l wonder whether there is an intermediate situation between straight-line fitting and curve fitting. In this section, we consider models involving the quadratic terms of the predictors. To explain, we assume the true model to be E (Y\XUX2) = 30 + B1X1 + P2X2 + B3X? + BAX\ + Bl2XlX2, while the.fitted model is E (Y\XUX2) = aQ + aiXi + a2X2 + a3X2 + aAX\. Assume that the joint distribution of X\ and X2 is a bivariate normal distr ibution wi th mean vector 0, C o v ( X i , X 2 ) = p and V a r p Q = l,i = 1,2. Based on the previous result (2.14), we have n — R i n - p ( Y \ , R St=l ̂ f c l E {XiX2)Xk Cil* — Pi + Pi2E (A 2) + P12 "3. = Pz + P. 12" where E is the covariance matrix of (X\,X2,Xf,X2). Using moments of the bivariate normal distribution, it is easy to derive that « i * = P i , «3* = Pi + -1 P nPl2- 1 + p2 W i t h Definition 2 of average effect, that is, averaging over the joint distribution of 56 Chapter 2. Average effects for regression models with misspecifications (Xi,X2), we have 5\(0) — Tha t is, we s t i l l have (exactly) consistent estimator under this setting. However, interestingly, based on Definition 1, i.e., averaging over the marginal distr i- bution of X2, we get 5{(xi;(3) = a i + 2a 3»x 1 = ft+ 2 (ft+ ^ £ 1 2 ) . * i , and <Ji(z i ; /3) = p\ + 2p\Xl. Hence, we get the difference between the fitted and true average effects 5*1{x1;P)-51{x1;P) = -£—p\2Xl. 1 + p It is easy to verify that the bias increases when p increases for given X\ = X\. One special case is that when p equals 0, the difference disappears. It is also clear that the independence of XiS make consistent average effect estimates no matter whether the joint distribution of X^s is normal or not. Furthermore, if Definition 3 is used, averaging over the conditional distr ibution X2\Xi = xi, we have 8i{xi-p) = Pi + 2p3xi + pupxi. Note that 6l(xi\(3) does not subject to the change of definition because it does not 57 Chapter 2. Average effects for regression models with misspecihcations depend on X2. Thus, the bias becomes correspondingly Sl(x1;P)-51(x1;/3) = - p^J p\2Xl P-P3 a It is easy to verify that for any given x\, the bias increases if \p\ < \J\/5 — 2 and decreases otherwise. Therefore, conditional on different definitions of average effect, we may have different conclusions about the consistency of estimators based on a simpler model. 2.6 S u m m a r y In section 2.2, Result 1 & 2 show the asymptotic distribution of average effect estimators under "true" model and "misspecified" model, respectively. Note we don't have to assume that the two models are nested, that is, our results can be applied to more general model misspecification situations. In the linear regression context, discussed in section 2.5, Result 3 gives the conditions to yield consistent estimator when fitting an additive model without interactions to the data generated from a pairwise interaction model without pure quadratic terms. Al though the conditions, independence or joint normality (elliptical-contoured) of the components of X (centered), may not be satisfied in practice, we could try appropriate transformations to make the distributions of X close to either of the conditions. In section 2.4, spline regression context, we can st i l l have consistent estimator when the predictors are independent and spline basis are centered. We also explore the bias under the condition of joint normality, and find out the magnitude of bias is quite small even when there is strong dependence between the predictors. Tha t is, the estimator based on additive model without interaction is approximately consistent. 58 Chapter 2. Average effects for regression models with misspecihcations In section 2.5, we consider models wi th quadratic terms of predictors, where consistent estimator comes into being under the conditions in Result 3. We should be aware of the fact that the consistency of average effect estimator depends on the definition of average effect, as implicated by the example in section 2.5. A l l the results/conclusions we have in Chapter 2 is based on Definition 2, that is, averaging over the joint distr ibution of a l l components of X . However, this definition may not always be appropriate to use. For example, if one is interested in comparison the risk of lung cancer between two groups of people having different smoking habit. Say one group never smoke and the other smoke everyday. There are also bunch of other risk factors, such as gender, age, resident and so on. Since we want to know how smoking makes difference in the risk of lung cancer, we should consider Definition 3, averaging over the distribution of al l the other risk factors conditional on smoking factor. Thus, for different scenarios, we need to investigate the consistency of estimator equipped wi th different definitions. A s for which definitions should be applied, it depends on the goal of study. 59 Chapter 3 Comparison of interaction detectability under different interaction models This chapter wi l l focus on comparison of power to detect interactions under different regression models, in particular, a pairwise interaction model and.a diffuse interaction model. Section 3.1 has the background on asymptotic power under local alternatives for W a l d (or quadratic form score) statistics. Section 3.2 gives a concrete example to show how powerful the diffuse interaction model is to detect interactions no matter what the true structure of interaction is diffuse or not. 3.1 General framework In this section, we give a general result about the asymptotic power function of score (quadratic form) test for presence of interactions are derived based on two models. Let T = {f(y\ x,6) : 6 G 6 } and Q = {g(y\ ~K,U>),U> G 0,} denote two different parametric families of densities under consideration when modelling the relationship be- tween response variable Y and predictor variables Xx,... ,Xp. We assume an agreement between the two families at some certain values of their parameters. Tha t is, one member 60 Chapter 3. Comparison of interaction detectability under different interaction models from G satisfies that g(y\ x, u>0) = f(y \ x, 0O), for a l l y and x, so as OJ moves away from w0, g(y\ x, u>) moves away from F in some specified way. Under model T (Q), 6 = OO(OJ = U>Q) stands for the null model of no interactions. Let f,g denote densities and F, G denote the corresponding distributions. Let p\ = dim(0),p 2 = dim(u>) be the dimensions of !F and Q, w i th sF(e, Y, x) = 3 [ i o g { / ( Y | x, e)}]/ae and SG(L>,Y,X) = <9[log{ 5(Y| X,u)}]/du> being the respective score vectors, and IF{0) = Ee{sF(e,Y,X)s'F(d,Y,X)} and /G(w) = EU{SG(U>,Y,X)S'G(U>,Y,X)} being the corresponding Fisher information matrices. Note the two above expectation operations are wi th respect to joint distribution of Y and X. Let /x(x) be density of X with respect to measure i^(x) (either Lebesgue or counting measure). To compare the capability to detect interactions under the two families T and Q, we set up two sets of hypothesis tests wi th T as the fitted model and Q as the true model. The reverse case wi th T being the true distribution while Q being fitted model, is just an analog. Hence in the following, we just take the former case for demonstration. 61 Chapter 3. Comparison of interaction detectahility under different interaction models To make the comparison tractable and also to produce some nontrivial asymptotic powers that are not all equal to 1, we use a Pitman-type local analysis (developed by Le Cam (I960)), focusing on n - 1/ 2 - neighborhoods of the true parameter values. Say T is the fitted model while Q is the true model with un = u30 + n~l^2A'q, where A is a scalar and rj is a vector of length p2. The hypotheses we used here are H0 : CO = £0 versus Ha : CO ̂  Co> where C is a matrix r x dim(0) of full row rank and £0 l s a certain vector (for example, zero vectors denotes an additive model if CO is the vector of interaction parameters). Note that the above hypotheses are more general than testing H0 : 0 = 00, where C just reduces to an identity matrix as a special case. The above hypotheses consider any linear combination of the whole parameter 0 under the fitted model. To compare the capability to detect interactions, the parameters of interest are just those related to interactions. In particular, if the pairwise interaction model (defined later by (3.6)) is fitted, the whole parameter vector 0 = (/3Q,PI, ... ,Pp,Pn, • • • ,P{p-i)p)'. Only the last q = p(p — l)/2 components relate to interactions. Hence let C\ = (0gx(p+i), Iqxq) so that C\Q = (Pn, • • •, P(p-i)P)', which is what the parameter vector of interest for testing. If the fitted model is the diffuse interaction model (3.5), the whole parameter vector is u = (Po,P\,... ,PP, A)' and only A relates to interaction. Thus let C2 - (0 l x( p +i), 1) so that C2u> = A. Let 0n be the maximum likelihood estimator based on f(y\ x, 0) from n observations from g(y\ x,u;„). Then n(C(On - 00))' {Cl^ie^C}'1 (C(0n - 0„)) is the Wald test statistic used 62 Chapter 3. Comparison of interaction detectability under different interaction models here, whose asymptotic distribution is x 2 wi th degrees of freedom of rank(C) if / ( - | 90) is the true model. Because T is not the true model, nll2C{6n - d0) = nl'2C (dn - 0 . ( w n ) ) + n^C {0.(un) - 00), (3.1) where 0«(u;) is the parameter vector which minimizes the Kullback-Leibler information criterion, that is 0 * M = a rgmin e J | l o g ^ j g(y\x, o>)/ x(x)dydi/(x). Note that the fact </(-| wo) = /(-| 0o) yields that 0,(u>o) = 0o- B y Whi te (1982) we know that the first i tem on the right side of (3.1) is asymptotically normal wi th mean 0 and covariance matr ix CIp1(60)C'. So we only need to work on the second item on the right side in (3.1). B y the definition of 0 * ( w ) , we know that 0*(<*>) is the value of 0 solving J sF(8'u),y, x)g{y\x, w)/ x (x)dydi /(x) = 0. (3.2) Based on Gustafson (2001), implici t differentiation of (3.2) gives ' Ew[4(0.(a;),y )X)]0Uw) + E w [ s i . ( 0 , ( a ; ) ) y , X ) s G ( u ; ; y ) X ) ] = O. Evaluated at u = u0, the above equality yields BB ^ = / ^ 1 ( 0 o ) E e o K ( 0 o ; y ! X ) S G ( u ; o ; y ) X ) } , (3.3) 63 Chapter 3. Comparison of interaction detectability under different interaction models which is derived by the fact that 0*(u>o) = OQ- Therefore, we have ^ UU> u>=u>o j AC du UI=U>0 V, where du,. .. dui Recall the equality (3.1). Based on O 'Br i en et al . (2006), assuming Q being the true model, the asymptotic distribution of n(c(en - e0))' {crF\d0)cyl {c(dn - e0)) is a noncentral x2 wi th degrees of freedom of rank(C) and noncentrality parameter 5 is calculated by 5=1 AC r? j {crF\oQ)cyl I A C (3.4) Suppose the equivalent null hypothesis for Q is CGu = COG where CQ is a rG x dim(u;) matrix. The Wald statistic based on the fitted model Q is n{CG(On - u ; 0 ) ) ' {CGIc\u0)C'Gyl (C(Qn - w „ ) ) . Its asymptotic distribution is noncentral Xrc(^G), where 6G = {ACGrj}' {CGIG\u0)C'Gyl {ACGr)} 64 Chapter 3. Comparison of interaction detectability under different interaction models 3.2 Comparison between pairwise interaction models and diffuse interaction models 3.2.1 Introduction of diffuse interaction model Greenland (1983) pointed out that the powers of statistical tests to detect interactions are very low in some commonly encountered epidemiological studies. We could imagine even lower power in the situations where the number of risk factors is rather large and only a very small fraction of a l l possible (pairwise) interaction terms really play a role. Gustafson et al. (2005) proposed another kind of interaction model, the diffuse interaction model, to deal wi th difficulties caused by a large number of risk factor under pairwise interaction models. B y diffuse interaction, we mean that the effect of a particular risk factor either increases (synergism) or decreases (antagonism) as al l the other risk factors increase, without regard to which of the other risk factors get involved wi th the effect modification. The diffuse interaction model introduced in Gustafson et al . (2005) is defined as The parameter A reflects the magnitude of the synergism/antagonism. Take a binary Xj for example, if A > 1 then it is easily verified that the interaction is antagonistic, in the sense that the value of E ( Y | A j = 1, X y ) = xy)) — E ( Y | A j = 0, X y ) = xy j ) decreases in each component of x y ) . If Xj is continuous, it is also easy to show that A > 1 gives E(Y\XU...,XP) = HD (3.5) dE ( Y | X = x) dxjdxk 65 Chapter 3. Comparison of interaction detectability under different interaction models which means synergism based on the definition in Section 1.2. Conversely, A < 1 corre- sponds to synergism. That is, the effect modification caused by any putative risk factor increases as other risk factors increase. The magnitude of the difference between A and 1 implies how much synergism/antagonism is present. However, we should be aware that A does not provide the information about which of those risk factors contribute in the effect modification for any putative risk factor. 3.2.2 Power comparison Say the response variable Y is normally distributed and X\,...,XP are the corresponding explanatory variables. We study an example under the following two interaction models. Pairwise interaction model: • E{Y\XU...,XP) Recall diffuse interaction model: E(Y\XU...,XP) = pD = + | j ' X i - °- A - °- Let 3M = (6i,... , P P Y and 3j = (Aj)<jxi [q = p(p — l)/2), the coefficients of pairwise interaction terms. Let 6 = (/30, By,..., Bp, (3'j, a2)', which is the whole parameter vector in the pairwise interaction model and let fi = (/?0, B1,..., Bp, A, a 2 ) ' , which is the whole parameter vector in the diffuse interaction model. If Bj = /3 / 0 = 0 in the pairwise interaction model, the model reduces to be an additive model without interaction terms. Correspondingly, if A = Ao = 1 in the diffuse 66 p,P - A) + X)A*i+ Paxixii (3-6) i = l l<i<?<P ' Chapter 3. Comparison of interaction detectability under different interaction models interaction model, the model reduces to an additive model as well. Therefore, the two interaction models are the same if evaluating at f3I0 and A 0 respectively. Denote by fp the density function of Y | X i , . . . ,XP under the pairwise interaction model, and by fp> the density function under the diffuse interaction model. Denote by- Ip the information matrix under the pairwise interaction model and ID under the diffuse interaction model. Denote by sp( - ,Y , X) the score function of fp under pairwise interaction model and so SD(-,Y, X) the score function of fp, under diffuse interaction model. That is, where Sp(-,Y,X) = o--2(Y-Lip) ( i N Xp X\Xi y Xp-iXp J ( i \ sD(;Y,X)\x=1 = o--2(Y-fiD) Xn dctp dX < =  = ~ [J^PiXi log [Y^PiXi + f^.PiXilogifiiXi). , i = l 1=1 It is obvious that interaction can be measured by only one parameter A in (3.5) while p(p — l ) / 2 parameters are used in (3.6). Hence in the diffuse interaction model (3.5), we assume that al l the predictors interact in the same direction, either synergistically or antagonistically, which corresponds to A < 1 or A < 1 respectively. It is reasonable 67 Chapter 3. Comparison of interaction detectability under different interaction models to imagine that model (3.5) is more powerful to detect interaction since detecting the interaction effect in one direction could be easier than that in many possible directions. The comparison between two models wi l l be explored by the following cases. C a s e I: Assume that the diffuse interaction (3.5) is the true model wi th A„ = 1 + A n " 1 / 2 , where A is a scalar. (i) The fitted model is the pairwise interaction model shown as (3.6). To test whether there are interaction effects, set up the following hypotheses. H0 : Cxd = 0 versus Ha : Cx0 ^ 0, i.e., H0: /37 = 0 versus Ha : (3j ̂  0. Recal l that C\ = (0qxp, IqXq). Let dn denote the M L E of 9. Therefore, we have Power = P (n{d(0„ - Oo)}' {dip1 (OoM}-1 {Cx{0n - Oo)} > xl Hence, CJp1 (0O)C[ = If According to the results we discussed in the previous section, we know that n ( C i f o - eo^'iifwr'ic^ - Oo)) Z x 2q(s) 68 Chapter 3. Comparison of interaction detectahility under different interaction models where the noncentrality parameter is dO* dX A A=l where dO, ~ax IP l(0o) {Eeo{sP(O0,Y,X)sD(X0,Y,X)}}. Note that S£>(X0, Y, X) is the derivative of log /D(U>, Y, X) with respect to A evaluated at A0, i.e., the last component of the score vector SD(W,Y, X). Therefore the asymptotic power is P(x2(S) > x\, a)> where Xqt a is the upper a quantile of x2— distribution with degrees of freedom q. (ii) Fitted model is diffuse interaction model. Now the hypotheses to be tested are Ho : C2u = 1 versus Ha : C2u ^ 1, i.e., H0 :'A = 1 vs. Hi : X 1. Recall that C2 = (0i x p, 1). Therefore the asymptotic power is P{x\{5)>xla), where 5 = A V ^ A o ) Vb(Ao) = C2rD\u>o)C2, ID(00) = E ' < ' a i o § M ( ^ f o \ du J \ du w 0 = (A)./3I.---.A,.A0)'. Remark: As a matter of fact, the An should be positive to make the operation (-)A" meaningful. That is, the above analyses work well when sample size n big enough since 69 Chapter 3. Comparison of interaction detectability under different interaction models Xn converges to 1, which is away from the boundary value 0. Case II: Assume the true model is a pairwise interaction model with ^ 0 1 2 , n ^ Pin = \ P(P-l)P,n J ^ where rj is a q x 1 vector. For this time being, we set each element of rf to be 1, which means every pair of interactions is positive. Later we discuss the consequences of different choices of rj. (i) Fitted model is diffuse interaction model (3.5). The hypotheses to be tested here are Ho : C2u — 1 vs. Ha : C2UJ ̂  1. All the following is an analog of i) of Case I. Now e = (p0,pu...;pp,xy. The asymptotic power is P(xl(S)>xla), where dX, 8K ldPr P,=o dPj /3/=0 {//31(u;o)EaJo (sD(vj0, Y, X)s'P(0o, Y, X))} (P+2)- ' where MT. denotes for the rth row vector of matrix M. (ii) Fitted model is pairwise interaction model (3.6). The hypotheses to be tested here are HQ : C\0 = 0 versus H\ : C\0 ̂  0. We have 70 Chapter 3. Comparison of interaction detectability under different interaction models with 5 = A2ri'{lp2(0o)}~ir). Hence the asymptotic power is P (xq($) > Xq, a)- As discussed above, we have derived four asymptotic power functions for the four subcases, corresponding to all possible patterns of either pairwise interaction or diffuse interaction model being fitted while the underlying data generation process is the other one. However, it is not easy to tell which power is bigger based on the expressions of asymptotic powers. To be more specific, we take an example as below. Suppose we have p = 9 predictors at hand, the true values of all components of (3M are all equal to 0.5 and p0 = 0. The variance of random error is set to be a2 = 1. Also the predictors are identically independently distributed as Bernoulli with parameter £, the probability of success. Then we plot the power functions against the value of A, with £ = 0.5. As shown before, the key things to compute the power are Ip = E {sp(-, Y, X)s'P(-, Y, X)}, ID = E {sD(-, Y, X)s'D(-, Y, X)} , E {sP(-, Y, X)s'D(-, Y, X)} and E {sD(-, Y, X)sP(-, Y, X)}. Under the above settings, Figure 3.1 shows the result of the powers in the four sub- cases discussed above. The solid lines denote the power curves for diffuse interaction model fitting and the dashed lines denote those of pairwise interaction model fitting. We find that the diffuse interaction model is more powerful to detect interactions than pair- wise interaction model, regardless of whether the true model we postulate is the diffuse interaction model or not. 71 Chapter 3. Comparison of interaction detectability under different interaction models Figure 3.1: Power Curves with AYs'~d' Bernoulli(0.5): the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. In the bottom panel 77 = l g x i - True model is pairwise ! ^ | ( ( 0 5 10 15 20 delta In case II, consider replacing rj = (1,..., 1)' with a vector of 77 having entries ±1, that is, the pairwise interaction terms have different "directions". For those terms with positive coefficient, it implies that the impact of the interaction is to increase the value of response variable when the corresponding predictors increase while other predictors keeping unchanged. While for the terms with negative coefficient, the interaction ef- fect causes decreased response if the involved predictors increase with other predictors unchanged. As opposed to a pairwise interaction model, the role of A in the diffuse interaction model is to measure the magnitude of the overall interaction among the predictors. Note that diffuse interaction models do not specify which of the predictors do or do not con- tribute to the interaction. Hence, we could expect the power of diffuse interaction model 72 Chapter 3. Comparison of interaction detectability under different interaction models would get worse when the true model under consideration involves more mixed directions of interaction effect. Since the "overall" interaction is getting weaker and weaker as more other direction of interaction terms appear in 77, we would expect the performance of the diffuse interaction models to get worse and worse. That is what Figure 3.2 implies. In the first panel, wi th the choice of rj stating that al l interaction terms have the same direction, the diffuse interaction model performs better than the pairwise interaction model in terms of power. However, when rj is changed to have only part of the interaction terms playing the role in the same direction, there is crossing of the two power curves as shown in the middle panel. That is, for some values of A , the diffuse interaction model works better while for some other values of A , the pairwise interaction model does better. Moreover, when 77 is set to have more mixed directions of interaction effect, the performance of diffuse interaction model is worse as implied in the last panel. In Figure 3.2, the top panel is the power curves obtained by setting 77 of a vector of l ' s , the middle panel wi th 77 of a vector made up of 10 l ' s and 26 0's, and the bot tom panel wi th 77 of a vector made up of 10 l ' s , 10 0's and 16 —l's. Note here to make the plots across different 77's comparable, we normalize each 77 to have length equal to 1. 73 Chapter 3. Comparison of interaction detectability under different interaction models Figure 3.2: Different choices of rj in Case II wi th binary predictors: the top panel involves all predictor pairs interacting positively, the middle panel has only a few of predictor pairs interacting positively, and the bot tom panel has more mixed directions of interactions. The lengths of rj's in different panels are normalized to be 1. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. True model Is palrwlsa 0 5 10 15 20 Subsequently what happens if other settings, like £, (3M, cr 2(= Var(e)) vary? Figure 3.3 shows the outcome of different £'s wi th (3M = 0 .51 p and a2 = 1. In the leftmost panel, £ =0.2, while £ =0.5 in the middle panel and £ =0.8 in the rightmost one. The three plots in the top panel are power curves in case of the true model being diffuse interaction, and the other three in the bottom under the true model being pairwise interaction. A s £ varies, the performance of both models get worse. However, the difference among the other three plots in the bot tom panel where the true model is pairwise interaction model is not great. In other words, the distribution of predictors has greater effect in the case 74 Chapter 3. Comparison of interaction detectability under different interaction models of diffuse interaction model being true model than it does in case of pairwise interaction model being true model. Figure 3.3: Different choices of £ in A Y s distribution: the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, £=0.2,0.5 and 0.8 respectively. True modal Is diffusa True modal Is diffuse True model Is diffusa Figure 3.4 shows the result of different choices of 0M. For simplicity, we set all the components of QM equal to each other, that is 3M oc blp. In the leftmost panel, b is set to be 0.1, while 0.5 in the middle panel and 1 in the rightmost one. From the three plots in the top panel, we can see that the performance of both models get better as the magnitude of b increases. However, the three plots in the bottom panel are almost the same. Therefore, we get the similar conclusion as above, where £ varies. Change of b has more effect in the case that the diffuse interaction model is true. 75 Chapter 3. Comparison of interaction detectability under different interaction models Figure 3.4: Different choices of b in (3M = blp: the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, 6=0.1,0.5 and 1 respectively. Trua modal U diffusa True modal Is diffusa True model Is diffuse -i i 1 H "-i ! 1 i H H 1 1 1 H S 10 15 20 O S 10 15 20 0 5 10 IS 20 True model Is pairwise True model Is pairwise True model Is pairwise Figure 3.5 shows what happens if cr2 varies. In the leftmost panel, cr2 set to be 0.5, while 1 in the middle panel and 5 in the rightmost one. Now the three plots in the top panel are quite similar to those in the bottom respectively. In fact, based on the formula (3.4), we know that the noncentrality parameter 5 is just proportional to a - 2 . Therefore if we change the scale of A according to the value of a2, the shapes of the power curves are the same across different a2. That is what Figure 3.6 implies. ; 76 Chapter 3. Comparison of interaction detectability under different interaction models Figure 3.5: Different choices of Var(e):the top panel wi th the true structure to be diffuse interaction model and bottom panel wi th the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, Var(e)=0.5,l and 5 respectively. Trtu modal la diffuse True model Is diffuse Trua model Is diffusa 10 is 20 o a 10 15 M 0 6 10 15 M Trua modal la palrwlaa True model Is pairwise True modal ia pairwise ~i H H 1 1 1 H h 1 1 1 H 15 20 0 S 10 IS 20 O S 10 IS 20 77 Chapter 3. Comparison of interaction detectability under different interaction models Figure 3.6: Different choices of Var (e) wi th scaled A : the top panel wi th the true structure to be diffuse interaction model and bot tom panel wi th the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. Trua modal la diff ma Trua modal la diffusa Trua modal la diffusa A l l the above plots are based on the distribution of X having independent components. In the situation that those predictors are dependent, what happens if the dependency changes? To construct a dependent structure between those predictors, we let Xi,..., XP\Z Bernoulli(Z), Z ~ Beta(a, b). B y a litt le algebra, we have p = (1 + a + 6 ) - 1 , as the correlation between any pair of Xi,... ,XP. Hence by changing the values of a and b, we may get different correlations. In Figure 3.7, the first column is power curves obtained from a = b — 100, so that p = 0.005; for the second column, a = b = 0.5, hence p = 0.5 and for the th i rd column a = b = 0.01, which leads to p = 0.98. The change of correlation does affect the power performance 78 Chapter 3. Comparison of interaction detectability under different interaction models more in the case that diffuse model is true than that in the case that pairwise interaction model is true. Figure 3.7: Different choices of p among A Y s : the top panel wi th the true structure to be diffuse interaction model and bot tom panel wi th the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, p—0.005,0.5 and 0.98 respectively. Trua model Is diffusa True modal Is diffusa Trua modal Is diffuse We also investigate what happens if the predictors are continuous. Suppose A Y s (i = 1, . . . ,9) are i . i .d . following a log-normal distribution. Tha t is for each i, A j = exp(Zj), where Zj 's are i . i .d . standard normal variables. Set /30 = 0,3M = 0 .51 p and the variance of random error cr2 = 1. Then we plot the asymptotic power against A (defined in the local alternatives) for different four subcases discussed before. Figure 3.8 shows a similar outcome to Figure 3.1, which is for binary predictors. That is, the diffuse interaction model is more powerful to detect interactions than pairwise interaction model, regardless of what the true structure of interaction is diffuse or not. 79 Chapter 3. Comparison of interaction detectability under different interaction models Figure 3.8: Power Curves wi th Xi's"~' Log-normal(0,1): the top panel wi th the true structure to be diffuse interaction model and bot tom panel wi th the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. In the bot tom panel 77 = lqxi/y/q~- True model is diffuse Similar to the previous example wi th binary predictors, we also change the direction of interactions when the pairwise interaction model is the true model. A s shown in Figure 3.9, the diffuse interaction model lose the power to detect interactions as the "overall" strength of interaction gets weaker. 80 Chapter 3. Comparison of interaction detectability under different interaction models Figure 3.9: Different choices of r\ in Case II w i th continuous predictors: the top panel involves al l predictor pairs interacting positively, the middle panel has only a few of predictor pairs interacting positively, and the bottom panel has more mixed directions of interactions. The lengths of 77's in different panels are normalized to be 1. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. True model Is pairwise 0 5 10 15 20 delta 3.3 Summary B y the two examples studied in section 3.2.2, we can see that diffuse interaction model is more powerful to detect the interactions no matter what the true interaction structure (which is postulated) is diffuse interaction or pairwise interaction. However, as the direc- tion of true interactions among predictors is more mixed, the diffuse interaction model gets less powerful to detect interaction. 81 Chapter 4 M C M C algorithms for diffuse interaction models 4.1 W h y M C M C ? We introduced a diffuse interaction model in Chapter 3, where it is proposed to be more powerful to detect interactions than pairwise interaction model by using only a single parameter to describe the interactions among numerous predictors. A s a related point, the diffuse interaction model would be better for inferences on interactions. In this chapter, we are going to apply M C M C algorithms to implement the model fitting. In this section, we enumerate reasons for using M C M C algorithms but not maximum likelihood estimation, which also may be feasible to be implemented. Firs t , in the diffuse interaction models, al l the parameters /3,'s (except the intercept (3Q) and A are designated to be positive. The statistical inference of constrained max- imum likelihood estimates usually is more complicated. Standard asymptotic theory asserts that statistical inference regarding inequality constrained parameters does not require special techniques, because for a large enough sample there wi l l always be a con- fidence interval at the selected level of confidence that avoids the constraint boundaries. Sufficiently large, however, can be quite large, in the cases when the true parameter val- ues are very close to these boundaries. In practice, our finite samples may not be large enough for confidence intervals to avoid constraint boundaries, and this has implications 82 Chapter 4. MCMC algorithms for diffuse interaction models for al l parameters in models wi th inequality constraints, even those that are not them- selves constrained. Comparatively, M C M C sampler can automatically accommodate the constraints and yield appropriate interval estimate without extra efforts. Second, sometimes the interested quantity may not be the parameters in the model directly and would be some complicated function of them. In particular, under diffuse interaction models, we apply the average effect idea to make inferences on interactions. Now the form of the first derivative of the regression function wi th respect to x3-, j — 1, . . . , p is somewhat intricate so that it takes efforts to approach the interval estimates for average effects based on standard M L E . However, the interval estimates can be effectively achieved based on the posterior samples of the parameters of diffuse interaction models. Th i rd , considering some extensions to the diffuse interaction model, the maximum likelihood estimates could be heavy in hand to calculate. For example, to relax the assumption that al l the predictors interact in the same way, we could allow the predictors to be categorized into groups: wi th in each group, the predictors interact in the same way. Here we assume that there is no overlapping among the predictors wi th in different subgroups. That is, each predictor has an indicator variable denoting the group to which it belongs. The specific parametric form of this extended model w i l l be shown in the coming section 4.2. Therefore, the parameter set is a mixture of continuous parameters (Pj's, the coefficients of predictors, and A, the diffuse interaction parameter), and discrete parameters (the indicator variables associated wi th predictors). We have 2P (or 3 P if three groups) different patterns of group allocation, hence the optimization procedure would be more complicated especially when p is large. 83 Chapter 4. MCMC algorithms for diffuse interaction models 4.2 Details of M C M C algorithms In Gustafson et al . (2005), an efficient M C M C sampler is developed for binary responses. Now, we apply a similar M C M C algorithm for continuous (normal) responses. We start wi th a simple case, that is, al l the predictors interact in the same way. 4.2.1 One-group diffuse interaction models In this subsection, we consider a one-group diffuse interaction model where A > 0 is the parameter accounting for interactions and Pj > 0,Xj > 0 for j = 1 P- ' ' We apply a hybrid M C M C ( H Y ) approach, similar as that in Gustafson et al . (2005) to make inference. The reason for using this algorithm is to avoid the waste caused by the randomness introduced by the proposal of candidate value. Basically, the idea of H Y is to incorporate derivative information of the target density and to suppress random walk behavior in the sampling simultaneously. B o t h of the strategies attempt to eliminate the inefficiency of random walk, which is commonly used in Metropolis-Hastings ( M H ) algorithms to generate candidate states. To explain, by using random walk, the direction of each movement about the target distribution is randomized. This can greatly increase the number of iterations required before achieving the equil ibrium. The situation is getting even worse especially when the parameters involved in the target distribution are highly dependent to each other. More discussions in Gustafson et al. (2004) and Neal (1998). The pseudocode of the hybrid M C M C algorithm is provided in Appendix V . In our simulation study, the priors of the parameters are log A ~ N(0 , a 2 ) wi th c r ^ l O O ; 84 Chapter 4. MCMC algorithms for diffuse interaction models 00 ~ N(0,(7$O) wi th ^ = 1 0 0 ; and /% ~ N+(0,CT|.) wi th ^ = 100 for j = l,...,p, where N + ( / i t , a 2 ) denotes the N(/i, o - 2) distribution truncated to non-negative values. Note that we also check the prior sensitivity by using informative priors wi th smaller hyper- parameters Op , G \ . The prior for cr2 is inverse-gamma (a, b) w i th the shape parameter a = 0.0001 and the scale parameter b = 0.0001. Set the number of predictors to be p = 10, the sample size to be n =2000. Gener- ate the p binary predictor variables from Bernoulli(0.2) distr ibution independently. For simplicity, we set the true values of / V s (j = 0 , . . . ,p) a l l equal to 0.5. The true value of A is set to be 2, that is, a l l the predictors interact in antagonistic direction. Then for 1 = 1, . . . ,n,we generate the response variable Yi from normal distribution wi th mean of 0o + < 2~2^=i(Pjxv)X f a n d variance a 2 = 1. Note that in the model for binary re- sponses, no variance component is involved. A s a consequence, our algorithm here has one more step to update a 2 , compared to the M C M C algorithm used in Gustafson et al. (2005). Following Hil ls and Smith (1992), the M C M C literature has considered changes in the parameterization of a model as a way to speed up convergence in a Gibbs Sampler. The general suggestion is to try to make the components as "independent" as possible. Thus we implement M C M C using the new parameters ( a , A), where «o = 0o,&j — Pj/^ij — 1, . . . ,p. F rom the scatter plot of posterior samples for A and a\ (left panel of Figure 4.1), a smaller correlation between reparametrized components can be clearly seen. Tha t is, the reparametrization works well. Note that al l the samples used in Figure 4.1 are based on the 30,000 iterations after 20,000 burn-in iterations. 85 Chapter 4. MCMC algorithms for diffuse interaction models MCMC Algorithm I: At the t—th iteration, Step 1. For given A(t-1), update from (^t_1), a? ..., a P t _ 1 )) to {$\ o%\ af) as a block by using a hybrid MCMC. Step 2. Update A^-1) by using a random walk Metropolis-Hastings to log(A). Step 3. Update a2^ ^ via Gibbs sampler. Given the prior of a2 to be inverse-gamma (a, b), the posterior conditional distribution given all the other parameters is inverse gamma with shape parameter a + n/2 and scale parameter of b + RSS/2, where Note that Step 2 can also be implemented by any other choices of prior for A.due to the generality of MH. However Step 3 needs conjugate prior setting.for a2 because Gibbs sampler get samples from the corresponding full conditional distribution. It is actually 86 Chapter 4. MCMC algorithms for diffuse interaction models a particular choice of the proposal in M H leading to the acceptance ratio of 1. The trace plots of M C M C outputs in Figure 4.2 shows that the above M C M C ap- proach works well for the simulated data set under the specific setting of priors for parameters. In each panel of Figure 4.3, the true value, marked by the solid vertical line, is covered wi th in the corresponding 95% equal-tailed credible interval. For some pa- rameters though, the true value is somehow close to the lower/upper end of the credible interval. Figure 4.2: Algor i thm I: M C M C traceplots for f3j,j = 0,l,...,p and A wi th diffuse priors 0 2000 4000 6000 8000 10000 0 2000 4000 6000 6000 10000 0 2000 4000 6000 6000 10000 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 2000 4000 6000 B0OO 10000 0 2000 4000 6000 8000 10000 0 2000 4000 6000 iteration 6000 10000 0 2000 4000 6000 6000 10000 0 2000 4000 6000 6000 10000 87 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.3: Algor i thm I: Marg ina l posterior densities for 3j, j = 0,1,... ,p, and A wi th diffuse priors cr|Q = <x|. = o~\ = 100. Remarks for the algorithm: Firs t , the step sizes used to produce candidate values in step 2 and 3 should be adjusted as sample size changes. Basically, we tune the step sizes to get relatively high acceptance rates, about 70% for step 2 and about 50% for step 3. Second, when the priors of the parameters 83,j = 0 , . . . ,p and A are changed to be more informative (smaller variance), the outcomes do not change much. Comparing the density plots in Figure 4.3 and Figure 4.4, there is no serious difference between the distri- butions of posterior samples from different priors. This point is different from Gustafson et al. (2005), where the prior of 60 does have influence on the Bayesian inference. 88 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.4: Algor i thm I: Posterior densities for Pj, j = 0 , 1 , . . . ,p and A wi th informative priors o\ = cr}. = a\ = 0.4. 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.4 0.2 0.4 0.6 0.8 f i _ f *T 1 —i 1— 1 3 - i :—i ^ - i 1 i i 0 .0  0 1—, 1 -0.2 0.0 0.2 0.4 0.8 0.6 0.0 0.2 0.4 0.6 0.6 1.0 1.5 2.0 2.5 3.0 3.5 T h i r d , as suggested in Gustafson et al. (2005), the algorithm above can be easily- extended to a more general case of no positive constraint of the sign of Pj, replacing XjPj, in model (4.1), by g(xj,Pj), where g(x,P) \p\x- p>o, l - x ) ; P<0. To remove the positive constraint on the Pj's, we need the assumption that A ^ ' s are bounded wi thin [0,1]. In practice, transformation of predictors may be necessary. In the following, for simplicity,.we always focus on Pj's wi th positive constraints and leave the extension without the constraints- to the future work. 89 Chapter 4. MCMC algorithms for diffuse interaction models 4.2.2 Two-group diffuse interaction models A s one possible extension mentioned but not pursued in Gustafson et al. (2005), we study a more complicated model in this subsection. Say Y\X = x ~ N 6 o + }2 (^) + ( E ( /^) A f ( 4-2) y , e A D D LeANT where A D D consists of indices of the predictors belonging to the additive group and A N T is the set of indices of the predictors interacting" in antagonistic direction. ( For sure, if we suspect that some of the explanatory variables interact in synergistic way, we would replace the antagonistic group by synergistic group by setting A < 1.) In this subsection, we take A to be fixed. Let I = {Ik}k=i P, where 1; k € A D D , h — 2; fceANT. Therefore, the parameters in this model now are ({Pj}^=0, {Ij}Pj=\-> o~2)- So that we have p — 1 more parameters,with A fixed, to update than that in the simple case (4.1). To explore the M C M C approach to model (4.2), we do a simulation study as the following. A s before, set the number of predictors p =12, the sample size n =2000. Generate each predictor variable from a Bernoulli(0.2) distribution. For simplicity, we set the true values of /Vs ( j= l , . . . , p ) a l l equal to 0.7 and BQ = 0. The true value of A is set to be 2, that is, al l the predictors involved in interaction group interact in antagonistic direction. The first six predictors Xi... A 6 are set to belong to A D D group while the others belong to A N T group. The starting values of the indicator variables are set to be the worst possible case, that is, each Ik is set to be the opposite value of the corresponding true value. Then we generate the response variable Yi, i = 1 , . . . , n from a 90 Chapter 4. MCMC algorithms for diffuse interaction models normal distribution wi th mean of 'z ^ (Pjxij) f > i e A D D b e A N T J and variance a2 =1. For each j € {1,... ,p}, the prior of I3 is uniform distribution over {1,2}. The prior of B0 is JV(0, <rj0) w i th <7 ô=100. For j G 1,... ,p, the prior of Pj is Bj ~ N +(0,cr|.) wi th a\ = 100 for j = 1,... ,p, where N+(p, a2) denotes the N ( / i , CT2) distribution truncated to non-negative values. M C M C Algor i thm II: Step 1. Update the coefficients in the additive group, denoted by / S ^ D D ' together wi th 0Q. Given I and coefficients of predictors in A N T group, denoted by / 3 A . N T ' ^ * s e a s y *° verify that the posterior of (/30, / 3 ' A D D ) ' * s P r ° P ° r t i o n a l to exp {-a~2 ( ( A ) , / 3 ' A D D ) ' ~ ( X i X i + ^ X ' ^ ) ' (XiXi + D) ((A,, / 3 ' A D D ) ' - W X l + ^)" l x i Y i )} > 0 } . It implies that (B0, / 3 ' A . D D ) ' follows a multivariate normal iV ( { X i X i + I ? } " ^ ^ ! , { X ' j X i + £>} truncated by / 3 A D D > >̂ w h e r e X i Y i Y 2 £> Note that X ^ r j p j denotes the design matr ix wi th column vectors of the values of the = ( I , X A D D ) > = Y - Y 2 , = { X A N T A ^ A N T } 1 ' = (T 2 diag{cr^ 0 2 , a^ 2 , . . . , cT^ 2 } . 91 Chapter 4. MCMC algorithms for diffuse interaction models explanatory variables in the additive group. Analogously, X A j ^ r p is the design matr ix wi th column vectors of the observations of predictors in the antagonistic group. Step 2. Update the intercept together wi th the coefficients in the antagonistic group. Subtracting the contribution of predictors in the additive group, we can now pre- tend there is only one group and all predictors interact antagonistically. To be spe- cific, we apply Algor i thm I, devised for the one-group diffuse interaction model, to ( Y ' , X A N T , / 3 A N T , / 3 0 ) where Y ' = Y - X A D D / 3 A D D . Step 3. Update (Pk, h) together. We generate the candidate values for (pk,Ik)/ denoted by (Pk,Ik), as the following:. | Tfc = 3 — ifc, l o g ( ^ ) ~ N(log(P°k),r2). That is, the candidate value of Ik is just opposite to the current value. Say Ik = 1, then Ik = 2. Using random walk on the log scale of Pk to fulfil the positive constraints in model (4.2). There could be more than one criteria to determine 8® and two of them are discussed after remarks on Algor i thm II. A t this stage, let's write Pk in a general way as Pk = hh,rk(Pk), where h is a deterministic function wi th reversibility property, that is, h is 1-1 mapping 92 Chapter 4. MCMC algorithms for diffuse interaction models on [0, oo) and h"1 = h, where h~l is the inverse function of h. Thus, we have hi',ik (hiklIi(Pk)) = pk- Then, the acceptance ratio is n{i\p*) V M f c y y y - jp (4.3) where ir(-) is the joint posterior density of / and P for given a2 and (f> is the density function of standard normal distribution. According to Step 3 in A lgor i thm II P(Ik\Il) = P{I*k\h) = 1. Remarks: Firs t , the fraction of Pl/Pk in (4.3) comes out from the Jacobian of transformation because the proposed value is generated on the log scale of Second, if r 2 = 0, then the proposed value of Pk is exactly Pk. However, the acceptance ratio now becomes somewhat intractable because it takes effort to figure out (log(flfc)-log(h/»,7fc(fit)) \ I ' J *(-eflh'rM)/hnjM)) l im — — = — — (4 4) T ^ ^ l o g ^ - l o g (*>,,,,,.(f>kj) ^ <Ke) where e follows standard normal distribution. Moreover, tuning the size of r to be larger helps to speed up the convergence of samples for Pj's.- In the following, we demonstrate two ways to determine P®, i.e., hikj*(P). Proposal 1 : Choose P® to keep the average effect of Xk unchanged. Since al l X/s are binary in the simulation study, based on Definition 1, think of the average effect for 93 Chapter 4. MCMC algorithms for diffuse interaction models Xk as E (Y\Xk = 1, X ( f c ) ) - E (Y\Xk = 0, X ( f c ) ) . Therefore, we have { Pk, if k e A D D ; EiPZ + Z}1'* -E{ZVX}, if fee A N T , where Z= £ (OjXj)*. jeANT-{fc} Let Zj = S j e A N T - { f c } ( ^ ' x u ) A - ^ 4 = 1> t n e n 4* = 2, and Pi is the solution to »_1 £ {&A+^}1/A - n _ 1 E ^1/A = A- (4-5) i = i t = i If 4 = 2, then 1% = 1, and /?£ is the solution to P ^ n ^ i P t + z ^ - n - ^ z ] " . ' (4.6) i=l i=l In above two equations, the left side is the estimated average effect of Xk after change of Ik to I^., while the right side is the estimated average effect before the change. Here estimated average effect is the sample mean of average effect evaluated at each realization of X(fc). The reason for this proposal is that average effect estimate tends to be more robust, as discussed in Gustafson et al . (2005). Proposal 2 : Choose P% to make ak(=pk/\Ik) unchanged. (Note that \ x = 1, A 2 = A.) Tha t is, f \pk, i f / * = 1, Pk — X-'pk, if h = 2. 94 Chapter 4. MCMC algorithms for diffuse interaction models Based on the following plots of the M C M C output, Figure 4 . 5 and Figure 4 . 6 , using Proposal 1, the algorithm works well. For a couple of the /3 / s , though the true value falls close to ta i l of the density plot. Figure 4 . 5 : A lgo r i thm II: Using Proposal 1, M C M C trace plots for (3jJ = 1 , . . . ,p wi th diffuse priors <Xgo = cr|. = 1 0 0 . T^^^^^^^^^^^f^W 5 - PWIff̂ WĴ iPPW Î - ̂ ^^^^^^^^^^^^ 0 2000 4 OOO 6000 6000 10000 -1 1 1 f— 0 2000 4000 6000 6000 10000 2000 4000 6000 8000 10000 5 ' H H H H H H H H £ " - M1HHHHHHHHI 2000 4000 6000 6000 10000 0 2000 4000 6000 8000 10000 i t e r a t i o n 0 2000 4000 6000 6000 10000 Iteration 0 2000 4000 6000 6000 100O0 itaration 2000 4000 6000 6000 10000 0 ' 200(1 4000 6000 10O00 0 2000 4000 6000 6000 10O00 2000 4000 6000 6000 10000 2000 4000 6000 6000 10000 I t e r a t i o n 9 5 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.6: A lgor i thm II: Using Proposal 1, posterior densities for Bj,j = 1,... ,p wi th diffuse priors cr2^ = cr̂  = 100. Pi fc fc 0.4 0.6 0.6 10 0.4 0.6 08 1.0 0.2 0.4 0.6 0.6 To compare the performances of samples under different proposals in Step 3, we plot the number of correct group allocations, since the true group allocation is known in the simulation study. Figure 4.7 illustrates that Proposal 1, that is, solving for B® by keeping the average effect unchanged, gives the right group allocations only after a couple of iterations. However it takes many more iterations for the samples under Proposal 2, where /3° is solved by keeping ak unchanged. To make this point more clear, we also plot the autocorrelation coefficients up to lag 40 for MCMC samples under different proposals, as shown in Figures 4.8 and 4.9 respectively. In Figure 4.9, for some /Vs (Pe — Pn), there is st i l l some dependence even for lag 40. Wh i l e in Figure 4.8, for al l Pj's, the serial dependence drops close to zero after a small number of lags. It implies that the convergence rate under Proposal 1 is faster than that under Proposal 2. 96 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.7: A lgor i thm II: Comparison of two proposals: Number of correct group alloca- tions based on posterior samples for I. Proposal 1 Proposal 2 97 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.8: Algor i thm II: W i t h Proposal 1, the autocorrelation curves for posterior sam- ples of Pj, j = 1,..., p respectively. 20 30 10 20 i t 20 30 20 . 30 10 20 iteration 98 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.9: Algor i thm II: W i t h Proposal 2, the autocorrelation curves for posterior sam- ples of Pj, j = 1 , . . . ,p respectively. Pi fa Pa - 1'- J U11J U U Hi U U LU U U hU U U U LU.uLU. £ ° ; 10 20 30 40 Iteration p4 i 1 1 1 r 10 20 30 40 1 1 1 1 10 20 30 40 iteration h 3 " £ „ ; JUlUUU.UUULUUULUUULiluuU. 10 20 30 40 Iteration P7 10 20 30 40 iteration 1 1 1 1 1 0 10 20 30 40 IM ration Po .b.llUIHIJUULIlLlllllllllllllllUU 5 " .lillJlJLIllJIJLlLIII.lULIIIJULIlUUU. *I; , LllUU . i L l l l J U m i J . U U U U u l U U l,i, I 1 1 1 r 1 10 20 30 40 Iteration Pw i i i i r 10 20 30 40 iteration Pi. 10 20 30 40 Iteration Pi! - 5 : AlllJULIllJIJUilJlJULIIUULlLlllJU. .liiijyuiijuuiijijuiiiuuLiiijiju <LlllJUU.IJUULIJIJULIllJULI.liUU. 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Iteration Iteration Iteration Remarks on M C M C algorithm II: First , in step 3, we only allow the movement wi th in the parameter space such that at least two predictors fall wi th in the antagonistic/synectic interaction group. The reason of the constraint for now is that the terminology interaction has real meaning only when at least two predictors are involved. We could remove this constraint later and more would be discussed in the future work. Second, the priors of Pj's do not strongly affect the posterior distribution. Change the value of hyper-parameter ap to be 0.4, which is rather informative/narrow compared wi th the previous value 100. There is not much difference between the density plots in Figure 4.6, based on diffuse priors and those from the informative prior shown in Figure 4.10. In addition, for both kinds of priors, the posterior samples for Ij, j = 1 , . . . ,p have 99 Chapter 4. MCMC algorithms for diffuse interaction models high frequency of agreement wi th the true values. Figure 4.10: Algor i thm II: W i t h Proposal 1, the posterior densities of samples from wi th informative priors \OpQ = o~p = 0.4. f *: I *: y — i 1 — — i 1—1 & " - 1 — i — • — i — o.e o.e i.o 0.4 0.S 0.8 0.7 0.8 0.9 1.0 1.1 T h i r d , the true value of Pj(j € A N T ) does have influence on the posterior information about group structure. Bigger Pfs (j £ A N T ) lead to more posterior concentration around the true value of / , which governing group allocation. Whi l e smaller Pj's seems to lead to less posterior concentration around the true group structure. Consider an extreme case that al l the Pj,j € A N T are very small, say pretty close to 0, hence the conditional expectation of response variable given al l explanatory variables are approximately equal to p0 + zZjeADD^iPi- ^ n ° t n e r words, the diffuse interaction model is reduced to an additive model wi th explanatory variables of Xj,j G A D D . 1 Let 's make this point clearer by looking at a simulation study. Set the true value of :In fact, if /3j (j eANT) are too small, the numerical solution of (4.6) would be zero, which cause difficulty in random walk on log scale of Pj. 100 Chapter 4. MCMC algorithms for diffuse interaction models each /3j(i eANT) to be 0.4, while the true value of coefficients in additive group is 0.7, same as the previous simulation study. Checking the posterior sample of each indicator variable for the first 10000 iterations, we get the following table. Clearly, as shown in F i gure 4.11 and Table 4.1 as well, we get more posterior mass on wrong group allocations wi th smaller value of fij's, but less wi th larger /3^'s. Figure 4.11: Algor i thm II: Comparison of number of correct group allocations based on posterior samples for I wi th different values of (3j(j G ANT): top panel wi th smaller value (=0.4), and bot tom panel wi th bigger value (=0.7). o o o o o a ooocooo o OOOO O OOOD ooo o 0 2000 4000 6000 8000 10000 iteration Table 4.1: Algor i thm II: Frequency table based on posterior samples for each component of I wi th different values of € A N T . ft =0.4 A h h u h h h /8 h /io Ai A 2 1 1.00 1.00 1.00 .73 1.00 1.00 0.01 0.01 0.00 0.00 0.00 .11 2 0.00 0.00 0.00 .27 0.00 0.00 0.99 0.99 1.00 1.00 1.00 .89 ft =0.7 A h h u h h h h u ho Ai A 2 1 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00 LOO 1.00 101 Chapter 4. MCMC algorithms for diffuse interaction models Fourth, there is high posterior dependence between Ik and pk for each k. The blocking of (Ik, Pk) in Step 3 is motivated by this fact. We tried to update Ik and Pk separately (for each k), and found the sampler got stuck so that we could not move efficiently around the whole parameter space. The reason is that in the separate update case, posterior distribution of Ik for given pk highly concentrates on one certain value and vice versa. In other word, the update of pk and the update of Ik do affect each other a lot. Therefore, we should block the two parameters. In the simulation study for M C M C Algor i thm II, changing the true values of Pj,j € A N T to be 0.4 while keeping Pj,j € A D D unchanged, Figure 4.12 depicts the dependence between Ik and Pk more clearly. The top panel is the unconditional posterior density plot of posterior samples for p4. The middle panel is the posterior density plot of P4\I4 = 1, that is the posterior sample of p4 given posterior samples for I4 equal to 1. The bot tom panel is the posterior density plot of samples for P4\I4 = 2. Clearly the modes of the samples within different groups are different by the latter two plots. Due to the considerable distance between the two modes, we get a bi-mode posterior density curve estimated by al l the samples of p4, as indicated by the top panel. 102 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.12: Algor i thm II: M C M C output of BA wi th true value set to be 0.4: the top panel is the posterior density for entire sequence of 10000 iterations, the middle is the posterior density for subsequence PA\h = 1, and the bot tom is the posterior density for subsequence /3417/4 = 2. density plot of the first 10000 iterations 1 1 1 0.0 OS density plot of e 1.0 ' 1.5 P" amples in the right group i i i i 0.0 OS 1.0 1.5 64 density plot of samples in the wrong group 0.0 05 1.0 1.5 B4 Fif th , Ik, the estimate of indicator variable Ik may be determined by the one having higher frequency to be sampled. Here we should be careful about the Bayesian inference to Bks. For each Bk, the point estimate and 95% equal-tailed credible interval could be obtained based on the samples belonging to the Ik group only, while not taking into account the whole sample sequence of Bk. To be more explicit, iM "= ' ' } r £0.025 = 2.5% quantile of [sf : I® = Ik] C/o.025 = 97.5% quantile of : J<° = / * } 103 Chapter 4. MCMC algorithms for diffuse interaction models where Bk\lk^ are the i - th samples for BkjIk respectively, and means the number of element in the set A, and L0.025 / fo .025 is the lower/upper bound of a 95% credible interval. Or there is another possibility to make the inference based all the samples since the true I is unknown in practice. In such a case, we could imagine that if Ik wi th much higher frequency than other sampled values of Ik, the estimate of Bk based on samples from Ik group only should be close to that based on samples from al l possible groups. The credible intervals may be close to each other as well. O n the other hand, if Ik = Ik does not have a superior frequency, then the estimate based on all samples for Bk would be somewhat different that from the Ik group only. The credible interval in the former case would be wider. 4.3 Discussion Other than the two diffuse interaction models discussed above, we could consider more complicated models. 1. A s a direct extension to model (4.2), we could allow A to vary as well. However, a corresponding direct extension to M C M C algorithm II is not t r iv ia l to achieve. We are aware of the fact that (Ik, 8k, Xk) are closely associated together for each k E { 1 , . . . ,p}. Therefore we need to devise a good joint proposal for the triple set. To make use of the previous algorithm for two groups wi th A known, we propose the following algorithm. M C M C Algor i thm III: Step 1. Update / 3 A D D ' coefficients in the additive group, together wi th intercept BQ. Step 2. Update / 9 A N T ' coefficients in the antagonistic group, together wi th intercept A). (Note that the two steps have exactly the same structure as in that Algor i thm II.) 104 Chapter 4. MCMC algorithms for diffuse interaction models Step 3. For k = 1 , . . . ,p, update (Pk, Ik) a s a block in the same way used in Algor i thm II, that is, II is the opposite to the current value of Ik and PI is proposed by adding noise to P®, which leaves the average effect of Xk unchanged for given value A. Step 4. Update A, for given other parameters, by using M H update of log(A). Step 5. Update a2 v ia Gibbs sampler. From the following trace plot of A samples, Figure 4.13, we can see that the M C M C samples for A have not achieved the stationary distribution wi th in the first 10000 itera- tions. We increased the number of iterations and the M C M C st i l l do not mix very well for A. We may adjust the stepsize for the update of A or better figure out other ways to propose the candidates of A more efficiently. Moreover, according to the autocorrelation plot of samples for A, the convergence speed is really slow since the autocorrelation of lag 40 is st i l l rather large. It is similar for the posterior samples for "some Pj's, as displayed in Figure 4.16. This implies that Algor i thm III st i l l need to be improved in terms of speeding up the convergence. B u t s t i l l the algorithm seems promising since we get eight out of twelve correct estimates of indicator variables in the simulated example, as shown in Table 4.2. 105 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.13: Algor i thm III: Trace plots for Bj,j = 1,... ,p based on two-group diffuse interaction model wi th A unknown, respectively. 2000 4000 6000 6000 10000 T 1 r— 0 20O0 4000 6000 0 2000 4000 6000 6000 10000 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 1 | 1 1 2000 4000 8000 8000 10000 Iteration 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Iteration 0 2000 4000 6000 8000 10000 Iteration 0 2000 4000 6000 8000 10000 2000 4000 6000 8000 10000 2000 4000 6000 8000 10000 Iteration 106 Chapter 4. MCMC algorithms for diffuse interaction models 107 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.15: Algor i thm III: Plots of posterior samples for A, the top panel is the trace plot, the middle is the posterior density plot, and the bot tom is the autocorrelation curve. n 1 1 1 1 1— 0 2000 4000 6000 8000 10000 iteration 1.0 1.2 1.4 1.6 1.8 2.0 2.2 X 0 10 20 . 3 0 40 X Table 4.2: Algor i thm III: Frequency table based on posterior samples for each component of I under two-group diffuse interaction model wi th A unknown. h h h h h h h h h ho hi h2 1 0.63 0.40 0.55 0.48 0.28 0.66 0.34 0.40 0.46 0.39 0.34 0.54 2 0.37 0.60 0.45 0.52 0.72 0.34 0.66 0.60 0.54 0.61 0.66 0.46 true value 1 1 1 1 1 1 2 2 2 2 2 2 108 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.16: Algor i thm III: Autocorrelat ion curves for f3j,j = 1, group diffuse interaction model wi th A unknown, respectively. ,p based on two- Ma J l l U l l U L I J U L l j U l U L U U i Jl l lUlUl l l lJLI lUUJl l l lJ luumu. JIJIIJ1.I1IJ1I1IJ1U min i in J I l ium LI ... 20 30 10 20 30 40 20 30 llllllllilllimili I ; 11.11.1i I.I I I.J U1IJ 11,11,111,1 LI IULI1UU1U. Hit Ui UtUUlU Ui U LU LLUJ 30 40 30 40 10 10 30 40 JLIJUIIJLIllJLIJIJllJLIllJLIilJllJLI. J LU U LU LI 1 u L I 2. Including more groups into the model, say additive, synergistic and antagonistic group, wi th the corresponding A's known, for instance,Ai = 1, A 2 = 4/5 , A 3 = 5/4 respec- tively. (The reason to choose A2/A3 close to 1 is stated later.) In this scenario, there are three possible values of each Ik. Hence we cannot just flip the current value of Ik when in the joint proposal for update of (Ik,Pk)- We need more ^complicated jumping rule for the update of Ik. To apply the previous algorithm for two groups to the current situation wi th three groups, an easy-to-extend algorithm is as below. M C M C Algor i thm I V : Step 1. Update the coefficients in the additive group together wi th intercept. Step 2. Update the coefficients in the synergistic group together wi th intercept. Step 3. Update the coefficients in the antagonistic group together wi th intercept. 109 Chapter 4. MCMC algorithms for diffuse interaction models Step 4. For k e SYN (J ADD, update (pk, Ik) by applying step 3 in Algorithm II to Y - { X A N T ^ N T } 1 / A 3 , \ where X̂ jvj'p is the design matrix with column vectors of the observations of predictors in the antagonistic group. That is, by subtracting the contribution of the antagonistic group from the response variable, we could pretend that we have only two groups, additive and synergistic group. Step 5. For k € ANT (J ADD, update (Pk,h) by applying step 3 in Algorithm II to Y - { X S Y N A 2 ^ V N } 1 / A 2 - where XgyN is the design matrix with column vectors of the observations of predictors in the synergistic group. Again, by taking away the given contribution of the synergistic group, we may imagine having only two groups, additive and antagonistic group. Step 6. Update <r2 via Gibbs sampler. The step 4 and 5 in the above algorithm allows jumps between additive and synergistic groups, and jumps between additive and antagonistic groups. In other words, the big jumps between synergistic and antagonistic groups are broken down into two small jumps, which would be easier to achieve. Unfortunately, one obvious drawback is that at each iteration we need to update |ADD| + p (|ADD| denotes for the number of elements in ADD group) of (Ik, Pk) pairs, since the index k runs over the indices in ADD twice as stated in step 4 and 5. Thus we propose the following algorithm which need to propose just p of (Pk,Ik) pairs at each iteration. MCMC Algorithm V: Step 1. Update the coefficients in the additive group together with intercept. 110 Chapter 4. MCMC algorithms for diffuse interaction models Step 2. Update the coefficients in the synergistic group together wi th intercept. Step 3. Update the coefficients in the antagonistic group together wi th intercept. Step 4. For k == 1, . . . ,p, update (@k, Ik) as a block. Now the proposed value of Ik, different from Ik, is drawn wi th probability l - 7 r ( / f c | / ( _ f c ] , / 3 , a 2 ) ' where.7r is the joint posterior density function of / , 3, cr 2. This is referring to Metropolized Gibbs sampler in L i u (1996), which proves that the way used in Step 4 above is more efficient than the Gibbs sampling method, that is, the proposed value of Ik, possibly the same as the current value of Ik, is sampled from ir(-\I[-k], 3, cr 2). Then Q*k is proposed by leaving the average effect of Xk unchanged, which is the same idea as before. Hence, the acceptance probability is the same as (4.3), wi th P{lt\h) i n Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.17: Algor i thm V : Trace plots of M C M C samples for Bj,j = 1, . . . ,p based on three-group diffuse interaction models wi th A2 = 4 /5 , A 3 = 5/4. 0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000 iteration Iteration iteration 112 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.18: Algor i thm V : Posterior densities of M C M C samples for Pj,j — l,...,p based on three-group diffuse interaction models wi th A 2 = 4 /5 , A 3 = 5/4. Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.19: Algor i thm V : Autocorrelat ion curves of M C M C samples for Bj,j = 1, . . . ,p based on three-group diffuse interaction models wi th A 2 = 4/5 , A 3 = 5/4. 20 30 40 lllllllllll lllll|IIIIIIIMIIII l l l l l l l l l l l 1,1111,111], Ill nun in mm 1111111111111111 20 30 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Lag Lag Lag Remarks on M C M C Algor i thm V : First , note that the magnitudes of A2 and A 3 have an effect. If A2 and A 3 getting closer to 1, the jump between different groups, i.e., update in Step 4, is easier to make. Otherwise, the change of group may cause big change in the posterior density, which leads to a small acceptance ratio. So the proposed values are more likely to be rejected, which makes the algorithm inefficient. We also check the performance of Algor i thm V by setting a l l A's equal to 1. Under this setting, the posterior distr ibution of h,for given k € { 1 ; ; . . . is almost uniform distribution over {1,2,3}. This is consistent wi th our intuit ion, since when A i = A2 = A 3 = 1, the three group are actually identical to each other. Second, according to the trace plots, as shown in Figure 4.17, the sampler mix well. 114 Chapter 4. MCMC algorithms for diffuse interaction models However, Figure 4.19 tells us that there is some scope to make improvement in conver- gence rate. The sample dependence does not drop close to zero after lag 40 for some PkS, like P5 and pn. One thing worth noticing is the triple-mode in almost al l density plots of PkS. The reason is that for each k, the posterior distributions of Pk conditional on different values of Ik may have different (up to three) modes. Ac tua l ly this is another evidence to show the high dependence between Pk and Ik- Table 4.3: Algor i thm III: Frequency table based on posterior samples for each component of I under three-group diffuse interaction group. h h h h h h h h h ho hi h2 1 0.33 0.31 0.34 0.31 0.33 0.33 0.29 0.28 0.31 0.29 0.26 0.28 2 0.23 0.19 0.18 0.24 0.17 0.34 0.38 0.43 0.31 0.28 0.26 0.43 3 0.44 0.51 0.49 0.46 0.50 0.33 0.32 0.29 0.38 0.43 0.48 0.29 true value 1 1 1 1 2 2 2 2 3 3 3 3 T h i r d , we also plot the number of correct group allocations. It seems like that most of the samples for I have 4-6 components correctly valued. It looks promising as implied by Figure 4.20, some of the samples for I do achieve 10 or 11 correct allocations. Moreover, Figure 4.21 shows the posterior density plots of samples for Pk conditional on the true group allocation of fc-th predictor, k — 1 , . . . ,p. We find that for each Pk, the samples highly concentrates around its true value. It is a good sign to see that Algor i thm V somehow works. 115 Chapter 4. MCMC algorithms for diffuse interaction models ) Figure 4.20: Algor i thm V : Number of correct group allocations based on posterior samples of I under three-group diffuse interaction models. Figure 4.21: Algor i thm V : Posterior densities of samples for Pj conditional on Ij correctly allocated (j = 1, . . . ,p) under three-group diffuse interaction models. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 2.0 2.5 3.0 0.5 1.0 . 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.4 0.6 0.S 1.0 1.2 f . 0.4 0.6 0.6 ~~] 1 116 Chapter 4. MCMC algorithms for diffuse interaction models 4.4 Example To provide an illustrative example of fitting the diffuse interaction model to a real data set, we consider the abalone growth dataset available from the U C I Machine Learning Repository (Newman et. al. 1998). The response variable Y is the age of abalone. Ignor- ing the single categorical explanatory variable (sex), we take the dependent variables X to be the seven continuous explanatory variables (length, diameter, height, whole weight, shucked weight, viscera weight, shell weight). The data are observed on n = 4177 speci- mens. To find out the overall direction of interaction among those dependent variables, we apply Algor i thm I to make inference. One thing worth noting is the release of positive constraints on / V s by using the following model mentioned in Section 4.2. A s a consequence, al l the observations of Xj's are scaled to [0,1]. The chosen priors are Bj ~ iV(0, 0.5)(j = 0 , 1 , . . . , p), log A ~ JV(0,0.5), and a2 ~ inverse-gamma (0.0001,0.0001). We also tried rather diffuse priors by replacing the small variance 0.5 wi th a larger value 50 and we did not see any serious difference in the output. Figure 4.22 gives trace plots for posterior samples of Bj, j = 0 , 1 , . . . ,p, which shows the M C M C algorithm worked well. The solid curves in Figure 4.23 are posterior densities for average effects 6i,...,5p and the dashed lines are posterior densities for / 3 i , . . . , Bp. We can see that most of ,<5j's are slightly smaller than the corresponding Bj's, although the case for X5 is the other way around. This means the overall interaction among X is where 117 Chapter 4. MCMC algorithms for diffuse interaction models antagonistic, which is consistent with the inference we can make based on the posterior sample for A. The posterior mean of A is 1.19, and the 95% equal-tailed credible interval is (0.92,1.46). The posterior probability that A > 1 is 0.93. Hence, we have evidence of the presence of antagonism among X. The posterior density plot for A is given in Figure 4.24. For a better understanding of the content of antagonism, we refer to relative antago- nistic effect as <?(*) - 9(0) - T.U (g(xJiJ) - g(o)) T.U(g(xiW-9{0)) ' . [ - } where \3 means a p x 1 vector of zero except that the jth element is 1. Note that the numerator is the difference of the joint effect of X and the sum of independent effect of each Xj,j = 1,... ,p, and the denominator is the sum of independent effect. Averaging the ratio over the joint distribution of X, we get average relative antagonistic effect (ARAE); By using the empirical distribution of X (the true distribution of X is unknown), we get ARAE based on the posterior samples of (3j,j = 0,... ,p and A and the graphical summary is given by the bottom panel in Figure 4.24. We can also see the evidence in favor of antagonism since all samples of ARAE are negative. Moreover, most of samples of ARAE are valued in (-0.6,-1), which indicates a considerable size of antagonism. The bi-mode of ARAE in the plot might be caused by the complexity of (4.7), because the denominator now is also a function of A other than the simple form ^ (3jXj as before when Pj's are confined to be positive. One thing worth mention is that MCMC can provide an effective way to make inference on this complicated function ARAE, based on the posterior samples. That's one of the main reasons why we pursue MCMC approaches for model fitting, as mentioned in the beginning of this chapter. Note that both of the definitions above can be easily applied to synergistic effect case. 118 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.22: Abalone data: trace plots for 3j,j = 0 , 1 , . . . ,p, for the whole sequence of 100,000 iterations including the burn-in period. 1 ftvV^^ 1 I I 1 Oe+00 2e+04 4e+04 6e+04 8e+04 1e+05 "T 1 1 1 - 0e+00 2e+04 4e+04 6e+04 iteration Be+04 1e+05 - \ r - Oe+00 2e+04 4e+04 6e+Q4 iteration ~ i i 1 r - Be+04 1e+05 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 1 r - 2e+04 4e+04 6e+04 Iteration |35 T Be+04 1e+05 Oe+00 2e+04 1 1 ~ - - j 4e+04 6e*04 8e+04 iteration Oe+00 2e+04 . . 4e+04 6e+04 ~i r e+04 1e+05 0e*00 2e+04 4e+04 6e+04 iteration ! ( 8e+04 1e+05 119 Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.23: Abalone data: density plots for and 5j, j = 1,... ,p, for the sequence of 40,000 post burn-in iterations. The solid lines stand for average effect Sj's and dashed lines stand for /3j's. —! " " I • T - 1 1 1 I ' 1 i—Ti-j-.-.-j-v , !— -0.5 0.0 0.5 t.O 1.5 2.0 2.5 -1.0 -0.5 0.0 0.5 1.0 1.5 X5 X6 .0 -0.5 0.0 -1.0 -0.5 0.0 0.5 1.0 1.5 X7 1 2 3 4 5 S Figure 4.24: Abalone data: the top panel is the density plot of A and the bot tom panel is the density plot of relative antagonistic effect, for the sequence of 40,000 post burn-in iterations. 1.0 I 1.5 average relative antagnistlc effect t 2.0 A A -1.6 -1,4 -1.2 -1,0 -0.8 -0.6 120 Chapter 5 Summarization and future work 5.1 Conclusions Overall , the findings are as follows. In chapter 2, we study the consequences of fitting an additive regression model a pairwise interaction model is assumed to be the true model. We obtained the asymptotic distribution of average effect estimates based on the "misspecified" model and "true" model as well, as shown in Result 1 and Result 2, respectively. W i t h the two large sample l imits achieved in the two results, we work out the consistency of average effect estimates from "misspecified" model under some easily-studied situations, which are given by Result 3. This result implies that the distribution of risk factors does influence the size of bias of estimates in cases of model misspecifications. Result 3 suggests that transformations of risk factors, aiming toward normality, may help to reduce bias of average effect estimates. More generally, under the framework of spline regression models, we investigate the consequences of model misspecifications by failing to incorporate the interaction terms, which are assumed to be included into the true model. In chapter 3, we introduce the diffuse interaction model, which is more powerful to detect interaction than the pairwise interaction model especially when the number of risk factors of interest is rather large. We compare the power' to detect interactions under the two models in situations where we assume the true model is either diffuse or pairwise interaction model. To make the comparison tractable, large sample and small 121 Chapter 5. Summarization and future work misspecification approximations are employed. To be specific, for either interaction model that is assumed to be true, the true values of the parameters standing for the magnitude of interactions (A in the diffuse interaction model, /3y's in the pairwise interaction model) are just wi th in a (local) n - 1 / 2 neighborhood of those values which imply no interactions at al l , i.e., additive models. Therefore, as sample size n goes up to infinity, the model misspecification vanishes to zero. Under a set of specific settings (shown in Section 3.3), we find out that the power of diffuse interaction model is superior to pairwise interaction model no matter whether the true model is itself or not. However, if the true model is pairwise interaction, the detectability v i a diffuse interaction model decreases when the "overall" strength of interaction among the risk factors gets weaker. In chapter 4, we develop an efficient M C M C algorithm for one-group diffuse inter- action model. Also we investigate the possibility of generalizing the model away from the strong assumption that a l l risk factors interact in the same direction, i.e., synergistic or antagonistic. In the more generalized but complex model, we have more parameters since each risk factor has an corresponding indicator variable denoting to which interac- t ion group it belongs. W i t h A fixed, we have an efficient algorithm for model with two groups of risk factors, one group for risk factors having no interactions and the other groups for risk factors interacting synergistically/antagonistically. A n d we also see some hope to develop a good M C M C sampler for a more general model wi th three groups of risk factors, i.e., a no-interaction/additive group, a synergistic group and an antagonistic group. 5.2 Future work In this section we discuss some interesting problems that could be studied in the future. 1. In terms of model misspecification ignoring interactions: 122 Chapter 5. Summarization and future work (a) . In Section 2.3, the linear regression context, we find that under some certain condition such as independence or joint normality of risk factors, the average effect es- timators based on the misspecified mode la re st i l l consistent wi th the true values. B u t we need to know whether there are more general (possibly weaker) conditions which can produce the consistency in the face of model misspecifications. (b) . We explored two examples in section 2.3 to see how far the bias can be away from zero. However, we need more general investigation of the magnitude of bias (or relative bias) as the joint distribution of (X\, • • • ,XP) moves away from multivariate normally or independence. (c) . We could also study the consequences of omit t ing the interaction terms in the context of generalized linear models. Actual ly, in many epidemiological studies, the health outcome Y is often binary/categorical (for example, diseased or not). (d) The main results in terms of average effects we have derived so far is referring to Definition 2,. that is, averaging predictive effects over the joint distr ibution of a l l predictors. W h a t could the result be if other versions of average effect are applied? Moreover, we are aware of the fact that idea of average effect could have more general applications, not just confined wi th in the context of regression or generalized regression. For example, we can also apply the idea of average effect in survival analysis by averaging over the change in hazard function, instead of outcome/response variable, associated wi th a unit change in the putative risk factor. 2. In terms of M C M C algorithm development for more complex diffuse interaction mod- els: A s suggested in Gustafson et al. (2005), a more general model can be built by par- ti t ioning the risk factors into three sets: additive, synergistic, and antagonistic, that 123 Chapter 5. Summarization and future work is E(Y\XU---,XP) = p0 + (5J) ieADD { \ l / A . / • \ 1/A 0 . E (ft^')A s [ + \ E o v o ) A a [ > jeSYN J LeANT J where 0 < Xs < 1 < A a . Start ing wi th the simplest diffuse interaction model wi th a l l the risk factors in single interaction group, we have an efficient M C M C sampler which does a good job. A n d also M C M C performs well for a diffuse model wi th two groups, additive and synergistic or antagonistic wi th a fixed value of A. However, for a more complicated situation, model (5.1), how to implement an efficient M C M C approach is s t i l l in process. The challenge here is how to propose the indicator Ik and corresponding Pk together in an efficient way. In our algorithm, we only allow the transition between additive group and synergistic/antagonistic group, while the transition between two interaction groups are not allowed directly wi th in one single movement. The reason is that the direct change between two interaction groups would cause rather big change in posterior density function, which leads to low acceptance rate of proposed values of (Pk,Ik) P a i r - Al though we see some hope in the simulation studies that the sampler mix well, we st i l l need to find better one which mixes faster and leads to better estimation. To be more realistic, we would also add one more group, that is, no-effect group into the current three-group interaction model. This is more challenging because it is even harder to make efficient jumping schemes between the no-effect group and other three groups. 3. In terms of model selections : 124 Chapter 5. Summarization and future work In the analyses of linear regression wi th interactions, a stepwise strategy is often applied to choose a subset of interaction terms. The problem wi th stepwise procedure is that it is a computationally intractable technique when the number of risk factors is quite large, as discussed in Chapter 1. In the worst situation, | (^) ((?J) + l ) possible cases must be evaluated before obtaining the final model. Therefore, stepwise regression is not always a practical technique and other selection techniques may have to be considered. Suppose we have a linear model wi th pairwise and maybe even triple-wise interaction terms. We could take this model and reparametrize from the original parameters 6 to ((f), A), where c/> are the average effects and A are nuisance parameters. Note that when A = 0, the interaction model is an additive model. Take a pairwise interaction model (3.6) for instance, reparametrize the parameters 0 as follows: <f>o = A), = {Pij}l<i<j<p- B y setting up a prior for A which is quite concentrated around zero, we can do the posterior inference to A and then pick out those important interaction terms instead of doing stepwise procedures. 125 Appendix I Proof of (2.14)in Section 2.3.1 By Cramer's rule to solve the (p + 1) x (p + 1) system of equations, we get where ax\D\ = \DX\, Let D = 1 0 0 Var(Xi) 0 Cov(Xx,Xp) \ 0 Cov{Xx,Xp ( Dx = •• Var(Xp) j 1 EY . 0 • • • 0 ^ 0 E(YXX) Cov(Xx,X2) ••• Cov(Xx,Xp \ 0 E(YXP) Cov(X2,Xp) ••• V a r p Q - J S — (o~ij)pxP — ( Varpfi) Cov(A"i,X2) ••• Cov(Xx,Xp)\ CoviXuXi) Var(X2) ••• Cov{X2,Xp) \Cov(Xx,Xp) Cov(X2,Xp) ••• Var(Xp) / Appendix I. Proof of (2.U)in Section 2.3.1 Since E U . ( l , XU .... Xp ( E ( y ) with determinant expansion by the cofactors (first column), we may write |Z?i| as E i A E ( X ^ ) + Ei<, AjE (XiXjXi E, A E (XiXp) + x;^- A , E ( x^ X p ) i <?"2, • • • , CTj where cr̂  denotes i-th column vector of E . By the property of determinant, we can rewrite the expressions above as summation of two determinants and , where D™\ = \D (2) | E A E ( X J X 1 ) : . > cr2, • • • , o-p EiP^iXiXp) E8-<j AjE (XiXjXij , cr2, • • , o-j Ei<j AjE (XiXjXp) 127 Appendix I. Proof of (2.U)in Section 2.3.1 Let 's consider at first. Note the fact that E (XiXj) = C o v ( A " i , X j ) . i i = J 2 ^ C o v ( X i ' X ^ n + ••• + CoviX^X^1} (.2) The last equality is derived by the fact that when i ^ 1 Cov(Xu Xr)En + ••• + Cov(Xh X p ) S p l C o v ^ A ^ ) , : , • • • , o-p = 0 | CoviX^Xp) Similarly, we get |£>i 2 ) I = E / ? « [ S U E (XiXjXJ + ••• + £ p l E (XiXjX,,)]. i<3 Since E (XiXjXi) = E {XiXjX{) + E (X^CoviXi, X x ) + E (Xi)Cav{Xjt X,) 128 Appendix I. Proof of (2.U)in Section 2.3.1 we can derive that + (Xj)[Cov{Xi, A ^ E 1 1 + ••• + Cov(Xu Xp)Epl] • i<3 + £ A j E ( X i ) [ C o v ( X j , X ! ) E n + • • • + C o v ( A ^ X p ) S p l ] = E f t i ( S n E ( X * JC;A\) + • • • + E p l E{X iX jX , ) ) + E A j E ( A ^ ) | S | . (.3) Combining equations (.1), (.2) and (.3), we get the equation (2.15) and so (2.16) when P = 2. 129 Appendix II Another consistent estimator of mentioned in Section 2.2 The following is to show that sandwich estimator is a consistent estimator of V* as well. We only work on the simple case with two predictors and it is easy to expand the results of p = 2 to general p. In the fitted model, the log-likelihood function of an observation is log foiYlXuX?) = -^log27r - l-\ogr2 - ±(Y - a 0 - axXx - a2X2)2, while the true log-likelihood function should be \ogg{Y\XuX2) = -\log27r - l- log a 2 - ^(Y - 30 - frX, - B2X2 - P12X,X2)2. Define the matrices as below. An{6) = j n - 1 £ d 2 logMYilXi = xu,X2 = x^/dO^1 , Bn(9) = j n " 1 Y^dlogfeiY^X, = xH,X2 = x^/dO, • d\ogf6(Yx\Xx = xu,X2 = x^/dOj 130 Appendix II. Another consistent estimator of V* mentioned in Section 2.2 The expecations of them are defined as A(0) = E(&logfe(Y\XuX2)/dOid6j)i B(6) = E(dlogfe(Y\X1,X2)/d9l-dlogfo(Y\XuX2)/d6j).. B y Whi te (1982) we know that the M L E of parameter vector 0n = ( a 0 , &i, d 2, r 2 ) , which is a consistent estimator for 0*, minimizing the Kul lback-Leibler Information C r i - terion ( K L ) . Tha t is and Vn~(0n-0*)^ N{O,C{0*)). (.2) Intuitively, K L measures our ignorance about the true model. Remarks: 1) A l l the expectations here are calculated under the true density function unless specified. 2) C(8) is defined as follows. C(0) = A(0)-lB{0)A{0)-1. A n d An(0)~1Bn(O)An(0)~l is a consistent estimator of C(0). According to its sandwich- like shape, this estimator is also called "sandwich" estimator. Note that when the fitted model is correct, A(0t) + B(0+) equals zero, otherwise it may not. So A(0*) + JB(0*) is a useful indicator of misspecifications. 131 Appendix II. Another consistent estimator of V* mentioned in Section 2.2 B y some algebra, we have MO*) = 4 -»3xl Olx 3 2T? (.3) where where ' E s = E(SS') = E{(1,X1,X2)'(1,X1,X2)}. B(e*) = ± V* 7 7 R A 7 = A 1 4T? E E(Y - aQ — a i X i - a 2 X 2 ) 3 E ((y - a 0 - a i X x - a2X2fXx) ^ E ((y - a 0 - a j X i - a 2 X 2 ) 3 X 2 ) ^ _ L ( y _ a o _ a i X l - a 2 X 2 ) 2 - i Note V * is defined in (2.8). Therefore by the definition of matr ix C , we get C ( 0 . ) E^VE^ 1 2 r 2 E ^ 7 2 ' A - l 2r^7 A 4TV4A The left upper sub-matrix is just v(at). 132 Appendix III Numerical approach to C n and C12 in Section 2.4.1 We give details of the numerical approach how to get Cn and C\2 in (2.20). The joint distribution of (X\,X2) is bivariate normal. The elements of matrix Cn are easily obtained by one-dimensional integration. As for the elements of matrix Cu, the typical term is E {((Jd - Cl)a(I{Xi > ci} - h) ((X2 - c2)bI{X2 > c2} - k2)} , (.1) where { 0, if Ci - - 0 0 , tt, if a — ti^ - 0 0 . ki = E((Xi-a)a(I{Xi>Ci}). To reduce the two-dimensional integration into one dimension, we just need to figure out E ((X2 - c2)bI{X2 >c2}-k2\X1). 133 Appendix III. Numerical approach to Cn and Cu in Section 2.4.1 Note the fact conditional on X\ = x\, X2 can be rewritten as X2 = pxi + y/l - p2 Z, where p is the correlation coefficient of Xx and X2 and Z stands for a standard normal variable. Therefore, • f(xx) = E((X2-c2)bI{X2>c2}\X1 = x1) This integral can be easily evaluated since Z is standard normal and then we can regard the integrand in (.1), typical term of CX2, as a function of one variable: g(X1) = ((X1~c1)a(I{X1>c1}-k1)x(f(X1)-k2). Thus, we can approximate the integral of the function by the following K E i g i X ^ ^ K - ^ g i b , ) , where 6j is i/(K + 1) quantile of the marginal distribution of X\. In our numerical approach to the elements of Cn and Cu, K is set to be 2000. 134 Appendix IV Proof of (2.24) in Section 2.4.2 W i t h the facts that B~1(B~1)' = S'S and UCU' = BPB', we have S'S + A P = B-\B~X)' + XP = B~l(l + XBPB')(B~1)' = B^Uil + XtyUXB-1)'. Thus (S'S + A P ) " 1 = B'U(I + XC)~lU'B. Let V = SB'U, then V'V = I, which gives tr(S(a)) = t r{V(I + A C ) " V } = tr(I + A C ) _ 1 . Appendix V Pseudocode for the hybrid M C M C algorithm in Section 4.2.1 Let 0 ~ II be the target distribution, having an unnormalized density function 7r(t9) on a subset of In the following, we will abuse notation with the same symbol used to denote different functions if the meaning is clear from the context. The algorithm works by extending the state from 0 to (8,z), and the unnormalized target density from 7r(0) to 0. Set values for function g, constant e, constant a, number of iterations I. • 1. Initialize the value of 0. Could be generated from the prior of each component of 8 or by fitting a simpler model (for example an additive model in regression model context). Also initialize z by sampling from standard normal. 2. For each iteration, i = 1,..., / : a) Generate a candidate state (8*,z*) as T T ( 0 , Z ) = 7r(0)7r(z) 0* <- 0 + e {z + (e/2)<?(0)}, z * <_ - z - ( e / 2 ) { 5 ( 0 ) + 5 ( 0 * ) } , 136 Appendix V. Pseudocode for the hybrid MCMC algorithm in Section 4.2.1 and draw u from unif(0,1). Set (0, z ) <— (0*, z * ) if otherwise, keep the original ( 0 , z ) . b) Uncondit ionally negate z ; that is, Z i Z. c) Perform an autoregressive update to z ; that is, z ^ N{az,{l-a2)Ik). d) Set 0j «- 0. 3. Output 0 X , . . . , 0 / . Remarks: Setting 5(0) = Vlog7r (0) and a close to 1 we obtain hybrid algorithm. Other choices of g and a gives different algorithms. 137 Bibliography Box, G . (1979). Robustness in the strategy of scientific model building, Robustness in Statistics pp. 201-236. Breiman, L . (1996). Heuristics of instability and stabilization in model selection, Annal of Statistics 24(6): 2350-2383. Diaconis, P. and Shahshahani, M . (1984). Nonlinear functions of linear combinations, SIAM Journal of Scientific Statistical Computation 5(1): 175-191. Draper, D . (1995). Assessment and propagation of model uncertainty, Journal of the Royal Statistical Society. Series B (Methodological) 57(1): 45-97. Eubank, R. (1999). Nonparametric regression and spline smoothing, Marce l Dekker. Friedman, J . (1991).- Mult ivar iate adaptive regression splines, The Annals of Statistics 19(1): 1-67. Gelman, A . and Pardoe, I. (2007). Average predictive effects for models wi th nonlin- earity, interactions, and variance components, Sociological Methodology. Greenland, S. (1979). Limitat ions of the logistics analysis of epidemilogic data, American Journal of Epidemiology 110(6): 693. Greenland, S.. (1983). Tests for interaction in epidemiologic studies: a review and a study of power, Statistics in Medicine 2(2): 243-51. 138 Bibliography Greenland, S. (1993). Basic problems in interaction assessment, Environmental Health Perspectives 101: 59-66. G u , C . (2002). Smoothing spline ANOVA models, Springer. Gustafson, P. (2000). Bayesian regression modeling wi th interactions and smooth effects, Journal of the American Statistical Association 95(451): 795-806. Gustafson, P. (2001). O n measuring sensitivity to parametric model misspecification, Journal of the Royal Statistical Society. Series B (Statistical Methodology) 63(1): 81-94. Gustafson, P. (2007). O n robustness and model flexibility in survival analysis: trans- formed hazard models and average effects, Biometrics 63: 69-77. Gustafson, P., K a z i , A . and Levy, A . (2005). Extending logistic regression to model diffuse interactions, Statistics in medicine 24(13): 2089-2104. Gustafson, P., MacNab , Y . and Wen, S. (2004). O n the Value of derivative evaluations and random walk suppression in Markov Cha in Monte Car lo algorithms, Statistics and Computing 14(1): 23-38. Hil ls , S. and Smith, A . (1992). Parameterization issues in Bayesian inference, Bayesian Statistics 4: 227-246. Hogan, M . , Kupper , L . , Most , B . and Haseman, J . (1978). Alternatives to Rothman's approach for assessing synergism (or antagonism) in cohort studies, American Journal of Epidemiology 108(1): 60-67. K a r l i n , S. and Studden, W . (1966). Tchebycheff Systems: With Applications in Analysis and Statistics, Interscience Publishers. 139 Bibliography Kieinbaum, D . , Kupper , L . and Morgenstern, H . (1982). Interaction, effect modification, and synergism, Epidemiologic Research. New York: Van Nostrand Reinholdpp. 407-417. Koopman, J . (1981). Interaction between discrete causes, American Journal of Epi- demiology 113(6): 716-724. Le C a m , L . (1960). Local ly asymptotically normal families of distributions, University of California Publications in Statistics 3(2): 37-98. L i u , J . (1996). Metropolized Gibbs sampler: an improvement. Madigan, D . and Raftery, A . (1994). Mode l selection and accounting for model uncer- tainly in- graphical models using Occam's window, Journal of the American Statistical Association.' McCul l agh , P. and Nelder, J . (1983). Generalized linear models, Chapman & H a l l . Mul ler , A . and Stoyan, D . (2002). Comparison methods for stochastic models and risks, John Wi l ey and Sons. Neal, R . (1996). Bayesian learning for neural networks, Springer. Neal, R . (1998). Suppressing random walks in Markov chain Monte Carlo using ordered overrelaxation, Learning in Graphical Models pp. 205-225. O 'Br ien , S., Kupper , L . and Dunson, D . (2006). Performance of tests of association in misspecified generalized linear models, Journal of statistical planning and inference 136(9): 3090-3100. . . . . Raftery, A . (1996). Approximate Bayes factors and accounting for model uncertainty in generalised linear models, Biometrika 83(2): 251-266. 140 Bibliography Rothman, K . (1974). Synergy and antagonism in cause-effect relationships, American Journal of Epidemiology 99(6): 385. Rothman, K . , Greenland, S. and Walker, A . (1980). Concepts of interaction, American Journal of Epidemiology 1 1 2 ( 4 ) : 467; Saracci, R . (1980). Interaction and synergism, American Journal of Epidemiology 1 1 2 ( 4 ) : 465. Venables, W . and Ripley, B . (1999). Modern applied statistics with S-Plus, Springer New York. Wang, D . , Zhang, W . and Bakhai , A . (2004). Comparison of Bayesian model averaging and stepwise methods for model selection i n logistic regression, Statistics in Medicine 23 (22) : 3451-3467. Whi te , H . (1980). Using least squares to approximate unknown regression functions, International Economic Review 21 (1 ) : 149-170. Whi te , H . (1981). Consequences and detection of misspecified nonlinear regression models, Journal of the American Statistical Association 76(374): 419-433. Whi te , H . (1982). M a x i m u m likelihood estimation of misspecified models, Econometrica 50(1): 1-26. X u , R. and O'Quigley, J . (2000). Est imat ing average regression effect under non- proportional hazards, Biostatistics 1(4): 423-439. 141

Cite

Citation Scheme:

    

Usage Statistics

Country Views Downloads
United States 7 1
Nigeria 5 0
Japan 5 0
Germany 1 0
Canada 1 1
City Views Downloads
Unknown 7 1
Tokyo 5 0
Mountain View 2 0
Nidderau 1 0
Vancouver 1 1
Washington 1 0
Sunnyvale 1 0
Ashburn 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}

Share

Share to:

Comment

Related Items