Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Average effects for regression models with misspecifications and diffuse interaction models Liu, Juxin 2007

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
[if-you-see-this-DO-NOT-CLICK]
ubc_2007-318171.pdf [ 5.2MB ]
Metadata
JSON: 1.0302408.json
JSON-LD: 1.0302408+ld.json
RDF/XML (Pretty): 1.0302408.xml
RDF/JSON: 1.0302408+rdf.json
Turtle: 1.0302408+rdf-turtle.txt
N-Triples: 1.0302408+rdf-ntriples.txt
Original Record: 1.0302408 +original-record.json
Full Text
1.0302408.txt
Citation
1.0302408.ris

Full Text

Average effects for regression models with misspecifications and diffuse interaction models by Juxin Liu B.Sc, Anhui University, 2000 M.Sc, Beijing University, 2003 Ph.D., University of British Columbia, 2007 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Statistics) The University Of British Columbia June, 2007 © Juxin Liu 2007  Abstract In epidemiological studies, how best to assess and interpret interaction of risk factors of interest has been the subject of a lively debate. In statistical regression models, the interaction between two putative risk factors is described by the regression coefficient of the product of the risk factors. What happens if a linear regression model without pairwise interaction terms is used to fit the data actually generated from a linear regression model with all possible pairwise interactions? We apply the idea of average effect to evaluate the consequence of misspecified models and find out that the average effect estimates are still consistent if the joint distribution of risk factors satisfy some certain conditions. It is known that pairwise interaction models encounter intractable problems especially when the number of risk factor under consideration is quite large. The number of pairwise interaction terms is p(p — l ) / 2 , if the number of risk factors is p. As an alternative strategy, we introduce diffuse interaction model with only one parameter to reflect the interactions among all the risk factors, without specifying which of the risk factors do indeed interact. We compare the two kinds of interaction models in terms of ability to detect interactions. Another issue investigated in the thesis is to devise MCMC algorithms to estimate diffuse interaction models. This is done not only for the diffuse interaction model assuming all risk factors interact in the same direction, either synergistically or antagonistically, but also for extended diffuse interaction models which relaxing this strong assumption.  ii  Table of Contents Abstract  n  Table of Contents  iii  List of Tables  vi  List of Figures  y  ii xiii  Acknowledgement  Dedication  •  1 Introduction  xiv 1  1.1  Motivation  . .  1.2  Classes of regression models  1.3  Outline of thesis  2 Average effects for regression models with misspecifications  1 6 13  14  2.1  Average effect  14  2.2  General results  18  2.3  Linear regression  25  2.3.1  Difference in large sample limits of the two average effect estimators  27  2.3.2  Relative efficiency of the two average effect estimators  33  iii  Table of Contents  2.4 Nonparametric regression and smoothing  37  2.4.1 Spline regression models  38  2.4.2 Penalized regression models  49  2.5 A middle scenario  56  2.6 Summary  58  3 Comparison of interaction detectability under different interaction models  60  3.1 General framework  60  3.2 Comparison between pairwise interaction models and diffuse interaction models 3.2.1  65 Introduction of diffuse interaction model  3.2.2 Power comparison 3.3 Summary 4 M C M C algorithms for diffuse interaction models  65 66 81 82  4.1 Why MCMC?  82  4.2 Details of MCMC algorithms  84  4.2.1  One-group diffuse interaction models  84  4.2.2 Two-group diffuse interaction models  . 90  4.3 Discussion  104  4.4 Example  117  5 Summarization and future work  121  5.1 Conclusions  121  5.2 Future work  122  iv  Table of Contents  Appendices I  P r o o f of (2.14)in Section 2.3.1  126  II A n o t h e r consistent estimator of V* mentioned i n Section 2.2  130  I I I N u m e r i c a l approach to C n a n d Cn i n Section 2.4.1  133  I V P r o o f of (2.24) i n Section 2.4.2  135  V  P s e u d o c o d e for the h y b r i d M C M C a l g o r i t h m i n Section 4.2.1  Bibliography  . . . .  136 138  List of Tables 4.1  Algorithm II: Frequency table based on posterior samples for each component of I with different values of (3j,j € ANT.  4.2  101  Algorithm III: Frequency table based on posterior samples for each component of I under two-group diffuse interaction model with A unknown. . 108  4.3 Algorithm III: Frequency table based on posterior samples for each component of I under three-group diffuse interaction group  115  List of Figures 2.1  Magnitude of relative bias as a function of (p, r) for multivariate lognormal distribution of p = 10 predictors. The cases (a) through (d) are as described in the text  2.2  33  Relative efficiency of the "misspecified"-model average-effect estimator as a function of p and ||7||, for an equi-correlated multivariate normal distribution of p = 10 predictors. The cases (a) and (c) are as described in the text  2.3  37  Comparison of two average effect estimators  and 8l((3) for spline  regression models with two predictors involved. The x-coordinate of each circle is 5*(/3) and the y-coordinate is 5\(j3). Different circles are produced by different values of p, the correlation coefficient between X\ and X . 2  2.4  48  Comparison of asymptotic conditional variances of two average effect estimators for spline regression models with two predictors involved. The x-coordinate of each circle is the conditional variance of misspecified-model estimator and the y-coordinate is that of right-model estimator. Different circles are produced by different values of p, the correlation coefficient between X\ and X 2  49  vii  List of Figures  2.5 Scatter plots of average effect estimates in penalized spline regression with only two predictors: the top panel with sample size n = 200 and bottom with n = 1000. The solid horizontal line in each panel identifies the true value of average effect, and each circle represents a different simulated data set  55  2.6 Histograms of the average effect estimates in penalized spline regression with only two predictors: the top panel with sample size n = 200 and bottom with n = 1000, and the symbol 'x' in each panel marks the true value of average effect. 3.1 Power Curves with  55  Xi's '~' %  Bernoulli(0.5): the top panel with the true  structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. In the bottom panel V =  l,xi  • • • 72  3.2 Different choices of rj in Case II with binary predictors: the top panel involves all predictor pairs interacting positively, the middle panel has only a few of predictor pairs interacting positively, and the bottom panel has more mixed directions of interactions. The lengths of TJ'S in different panels are normalized to be 1. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting  74  vm  List of Figures  3.3 Different choices of £ in X,'s distribution: the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, £=0.2,0.5 and 0.8 respectively.  75  3.4 Different choices of b in f3 — bl : the top panel with the true structure M  p  to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, 6=0.1,0.5 and 1 respectively  76  3.5 Different choices of Var(e):the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, Var(e)=0.5,l and 5 respectively  77  3.6 Different choices of Var(e) with scaled A: the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting  78  List of Figures  3.7  Different choices of p among Xj's: the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, /O=0.005,0.5 and 0.98 respectively  3.8  Power Curves with  Xi's '~' l  79  Log-normal(0,1): the top panel with the true  structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. In the bottom panel V = 1,X./N/</  3.9  •  •  80  Different choices of rj in Case II with continuous predictors: the top panel involves all predictor pairs interacting positively, the middle panel has only a few of predictor pairs interacting positively, and the bottom panel has more mixed directions of interactions. The lengths of 77's in different panels are normalized to be 1. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting  81  4.1  Algorithm I: Posterior correlations for (A,/?i) and (A, ai) respectively. . .  86  4.2  Algorithm I: MCMC traceplots for P j jt  priors a  2  4.3  = 0,1,... ,p and A with diffuse  = a . = a\ = 100 2  0Q  0  87  Algorithm I: Marginal posterior densities for Pj, j = 0,1,... ,p, and A with diffuse priors cr| = a . = a\ = 100 2  o  p  88  x  List of Figures  4.4 Algorithm I: Posterior densities for Bj, j = 0,1,... ,p and A with informative priors cr  2 0Q  =  = cr\ — 0.4.  . ..:  89  4.5 Algorithm II: Using Proposal 1, MCMC trace plots for /3j,j  =  l,...,p  with diffuse priors,o~p = cr| — 100  95  o  4.6 Algorithm II: Using Proposal 1, posterior densities for  — 1,... ,p with  diffuse priors ajj = <r| = 100  96  0  4.7 Algorithm II: Comparison of two proposals: Number of correct group allocations based on posterior samples for 1  97  4.8 Algorithm II: With Proposal 1, the autocorrelation curves for posterior samples of Bj, j — 1,..., p respectively.  98  4.9 Algorithm II: With Proposal 2, the autocorrelation curves for posterior samples of Bj, j = 1,... ,p respectively  99  4.10 Algorithm II: With Proposal 1, the posterior densities of samples from with informative priors :<Xg  o  =  <jj|  = 0.4. ,  100  4.11 Algorithm II: Comparison of number of correct group allocations based on posterior samples for I with different values of Bj(j S ANT): top panel with smaller value (=0.4), and bottom panel with bigger value (=0.7). . .  101  4.12 Algorithm II: MCMC output of j3 with true value set to be 0.4: the top A  panel is the posterior density for entire sequence of 10000 iterations, the middle is the posterior density for subsequence /3|/ = 1, and the bottom 4  4  is the posterior density for subsequence /3|/ = 2 4  4  103  4.13 Algorithm III: Trace plots for (3j,j = 1,... ,p based on two-group diffuse interaction model with A unknown, respectively  106  4.14 Algorithm III: Marginal posterior densities for 8j,j = 1,... ;p based on two-group diffuse interaction model with A unknown, respectively.- .... 107  xi  List of Figures  4.15 Algorithm III: Plots of posterior samples for A, the top panel is the trace plot, the middle is the posterior density plot, and the bottom is the autocorrelation curve  108  4.16 Algorithm III: Autocorrelation curves for pj, j = 1,... ,p based on twogroup diffuse interaction model with A unknown, respectively  109  4.17 Algorithm V: Trace plots of MCMC samples for Pj, j = 1,... ,p based on three-group diffuse interaction models with A2 = 4/5, A3 = 5/4 4.18 Algorithm V: Posterior densities of MCMC samples for  P j , j  112 =  l , . . . , p  based on three-group diffuse interaction models with A = 4/5, A3 = 5/4. 2  113  4.19 Algorithm V: Autocorrelation curves of MCMC samples for Pj,j = 1,... ,p based on three-group diffuse interaction models with A2 = 4/5, A3 = 5/4.  114  4.20 Algorithm V: Number of correct group allocations based on posterior samples of I under three-group diffuse interaction models  116  4.21 Algorithm V: Posterior densities of samples for Pj conditional on Ij correctly allocated (j = 1,... ,p) under three-group diffuse interaction models. 116' 4.22 Abalone data: trace plots for  Pj,  j  =  0,1,...  , p ,  for the whole sequence of  100,000 iterations including the burn-in period  119  4.23 Abalone data: density plots for Pj and 5j, j — 1,... ,p, for the sequence of 40,000 post burn-in iterations. The solid lines stand for average effect Sj's and dashed lines stand for /Jj's  120  4.24 Abalone data: the top panel is the density plot of A and the bottom panel is the density plot of relative antagonistic effect, for the sequence of 40,000 post burn-in iterations  120  xii  I would like to thank my supervisor, Professor Paul Gustafson, for his guidance, invaluable advice and great patience. Each time when I am doing "random walk", it is my supervisor who shows me an efficient "direction" to make meaningful movement. Special thanks to Professors Harry Joe and Lang Wu for their helpful comments and suggestions on my proposal and table-tennis playing as well. And extra thanks to Harry for the tips on Latex, which makes the notations in my thesis well-exhibited. I am very grateful to the helps from my fellow graduate students and the ladies in our main office. I feel lucky to have had the chance to study in our department, which is just like a big family. Juxin Liu The University of British Columbia June 2007  xiii  To Ameng  xiv  Chapter 1 Introduction 1.1  Motivation  The term interaction is used in epidemiology to describe a situation in which two or more risk factors modify the effect of each other with regard to the occurrence or value of a given health outcome, denoted by Y. For dichotomous variables, interaction means that the effect of one risk factor, say A, on the outcome differs depending on whether another variable B (effect modifier) is present. Moreover, if the presence of B, the effect modifier, potentiates/accentuates the effect of risk factor A,, this variable and risk factor are said to be synergistic  (positive  interaction); if the presence of B diminishes or eliminates the effect of risk factor A, the two variables are antagonistic  (negative interaction). For continuous variables, the  phenomenon of interaction means that the effect of one risk factor on outcome differs depending on the value of another variable (effect modifier). A mathematical definition comes in Section 1.2. Later in Section 3.2.1, we will use a more general definition of synergism/antagonism when diffuse interaction  models are introduced. In epidemiological  studies, synergistic/antagonistic interaction among risk factors is common. For example, if people suffering from obesity have high blood cholesterol, then they have higher chances to get heart diseases. Another example (for antagonism) is the interaction between smoking and intake of Vitamin A for the risk of lung cancer. People who smoke a lot but take Vitamin A in daily dietary have lower risk of lung cancer than people who 1  Chapter 1.  Introduction  seldom smoke but lack of Vitamin A. Interaction can be described in two different but compatible ways. Each definition leads to a specific strategy for the assessment of interaction. The first, definition is based on homogeneity/heterogeneity of effects.  Interaction  occurs when the effect of a risk factor A on outcome Y is not homogeneous across strata formed by a third variable B. When this definition is used, variable B is often referred to as an effect modifier. The second, definition is based on the comparison between observed and expected joint effects of risk factor A and third variable B. Interaction occurs when the observed joint effect of A and B differs from that expected on the basis of the independent effects of A and B. How does one assess interactions?  In the thesis, we only focus on the situation  where the relationship between outcome Y (continuous) and risk factors is of interest. Commonly the product of the two variables A and B is used to describe interaction effects. That is, E (Y\A, B) = p + PiA + B B + 0  2  B AB, 12  where B reflects magnitude of interaction effect between A and B. If Pi > 0, the i2  2  interaction between risk factors A and B is synergistic (positive) interaction; otherwise it is an antagonistic (negative) interaction. For simplicity, the pure quadratic terms A and 2  B are omitted in the above model (These terms are also omitted in the linear regression 2  models which are mentioned later and an analysis with quadratic terms is mentioned briefly in Section 2.5). Note that if the outcome of interest is discrete, we could replace E(Y\A,B)  by g(E(Y\A,  B))  in the above model, i.e., a generalized linear model. For  instance, if Y is binary, logistic regression can be used in terms of g(E(Y\A,  B)) —  log (P(y = i ) / p ( y = o)). '  2  Chapter 1.  Introduction  More and more attention has been focused on model misspecifications since statisticians realize that unfortunately misspecified models are not uncommon in practice. Box (1979) and also McCullagh and Nelder (1983) mentioned "all models are wrong", though some fit data better than the others. Since we never know what the true model is in reality, by "true" model we assuming that is the true structure or closer to the truth than the others. It is natural to ask whether the properties of the estimator derived from misspecified models are affected. Does the estimator still converge to some limit asymptotically, and does this limit have any meaning? If the estimator is approximately consistent, is it also asymptotically normal? White (1982) provides answers to these questions by using maximum likelihood techniques for estimation and inference,of regression coefficients. Also in White (1981), the consequences and detection of misspecified nonlinear regression models are explored. Under the general topic of interactions, the goal of our work is to explore the consequences of a particular scenario of misspecified models . To be specific, what happens if we apply an additive model ignoring pairwise interactions to data which are actually generated from a pairwise interaction model? In this context, to make clear how those ignored interaction effects affect the results, we apply the average effect idea (definition given later in Chapter 2), while not applying the results from White (1982) directly to the regression coefficients. As is known, the interpretability of regression coefficients of risk factors is rather limited when models include interaction terms. The idea of average effect is proposed by Gelman and Pardoe (2007) and Gustafson et al. (2005) as well. Basically, it is the average of predictive effect, which is the expected change in outcome associated with a unit change in one of the risk factors. In a linear regression model without interactions, the average effect of any putative risk factor is simply the regression coefficient. However, in a model with interaction terms, the predictive effect in general depends on the value of risk factors. There are various definitions  Chapter 1.  Introduction  based on different distributions to average over. Three versions are denned in Chapter 2. The main advantage of the average effect idea is to make comparisons possible between different parametric models with sets of parameters that have incomparable interpretations. The average effect idea could also be used in other contexts. For example, Xu and O'Quigley (2000), also Gustafson (2007), gives a definition of average effect in survival analysis. In the future, we could also apply the idea of average effect to explore the consequences of model misspecifications under the framework of survival analysis. Nowadays another common issue arising in epidemiological studies is that a large (sometimes larger than sample size) number of potential risk factors should be considered in modelling. We could imagine the challenge to model the interaction effects when p, the number of risk factors, is relatively large. In particular, if the model under consideration is a pairwise interaction model, the number of all possible pairwise interaction terms is p(p — l)/2. For instance, if 12 risk factors are involved in the study, 66 pairs of possible interactions would be investigated besides the possibility of higher order interactions among three or more risk factors. Stepwise procedures are the most widely used approaches to select the important pairwise interaction terms in applied medical statistics. The basic idea of stepwise procedure is to find a "best" subset of potential risk factors by subsequently adding or dropping one risk factor at a time. Take forward stepwise regression procedure for instance, it starts off by choosing a model containing the single best risk factor variable and then attempts to build up with subsequent additions of other risk factors one at a time, as long as these additions are worthwhile. There are, however, a number of limitations with stepwise procedure. In particular, if possible models under consideration are nested pairwise interaction models, the stepwise procedure does not scale up to the number of potential risk factors (suppose all the risk factors involved at this stage are all important). As the number p of risk factors increases, the number of submodels, 2( ( 0/ ) increases dramatically, making the computational p  p-  2  )  4  Chapter 1.  Introduction  burden enormous. Also, the fitting of full model sometimes may not be suitable, because only a few of the p risk factors are typically included in the final model. And the fitting of the full model increases the numerical complexity of the methods unnecessarily. Another problem is that the model selected by a stepwise procedures includes only those variables entered in that final model, and ignores the variables not selected and the uncertainty due to the model selection procedure. In the worst possible scenario, such procedures may underestimate uncertainty about the variables, overestimate confidence in a particular model being selected, and may lead to sub-optimal decisions and limited predictability (Raftery (1996); Draper (1995)). There are also other drawbacks of stepwise selection. For example, a small change in the data can result in very different models being selected and this can reduce prediction accuracy, as discussed in Breiman (1996). To overcome the difficulties caused by large number of risk factors, we use diffuse interaction  models as an alternative way to model interactions. These kind of models are  proposed by Gustafson et al. (2005) in context of binary response. In this context, by synergism/ antagonism,  we mean that the effect of a putative risk factor increases/decreases  in magnitude as all the other risk factors move from absent to present. (Or as all the other risk factors increases if they are continuous.) Under this particular probability model, only one parameter is used to describe interaction among all risk factors. That is, the parameter can tell the overall interaction direction but without indicating which of the risk factors actually interact in that direction and which of them not. Hence the model is a bit simplistic, but we postulate it has more power, compared to pairwise interaction model, to detect interactions.  5  Chapter 1.  1.2  Introduction  Classes of regression models  When we are concerned with the dependence of a response variable Y on observed risk factors Xi,..., equation.  X, p  an equation that relates Y to X\,...,  X  p  is usually called as  regression  Denote the regression equation by E{Y\Xi  = Xi,X  2  = x ,...,X 2  p  = x ) = g(xi,x ,.. p  2  • ,x ) p  = g(x).  First of all, we give a more precise definition of synergism/antagonism in terms of mathematical languages. For any pair j < k and all e, 6 > 0, if g(x + elj + 6l ) - (x + <Jl ) - [g(x + el,-) - g'x)] > (<) 0, k  5  fc  the interaction among X is synergistic (antagonistic). Note lj means a p x 1 vector of zero except that the jth. element is 1. Equivalently, if g is twice differentiate, the above inequality can be rewritten as  Note that there are other names for the above definition, such as supermodular, directionally convex and lattice-superadditive for nonnegative derivatives and submodular, directionally concave, lattice-subadditive for nonpositive derivatives. They are used to compare the dependence structure of random vectors having the same marginal distributions. More details and discussions are in Muller and Stoyan (2002). In the following we list all the regression equations considered in the thesis.  6  Chapter 1.  Introduction  1. Linear regression models. v  g(x  u  ...  ijXj.  Due to the simplicity and interpretability, linear regression models are the most widely used for either experimental data or observational data. The regression coefficient Bj implies how much change in the response variable is associated with a unit change in Xj when keeping all the other risk factors unchanged. It is acknowledged that this model is a linear approximation of the relationship between Y and X. By first-order Taylor expansion, it is easy to derive the model expression. Therefore, it may work well only for a local region, where the surface does not have curvature. That's the reason why prediction of X values outside of the range where we fit the model is usually dangerous and not reliable. By second-order Taylor expansion, we have p  Also this is good for a local region, where the surface has some curvature. Note that for a binary variable Xj, the term Bjjxj is not necessary. One thing worth mentioning is that the explanation of coefficients Pj's are different from that under the additive model. The expected change in response variable Y after one unit change in Xj (while keeping all the others unchanged) now breaks down into several pieces, i.e., BjXj and BijXiXjii = 1,... ,p). For this model, for any pair i < j, we have synergistic interaction if Pij > 0 and antagonistic interaction if Bij < 0.  7  Chapter 1.  Introduction  2. Spline regression models. p  g(x ...,x ) u  = Y^ j( j)> m  p  ( )  x  L1  where nij is a smoothing function applied to Xj. Note this model is additive and does not include interaction terms. Taylor expansion allows to write rrij as D+l  j( )  m  =  x  cijkX ~ k  l  + Rerrij(x), a < x <b, j = 1,... ,p  fc=i  Remjix)  = (DI)'  [ mf (x)(x-0^,  1  b  +1)  Ja where (x — t )+ = x — t , if x > t and zero otherwise. k  k  k  Note that for fixed j , if Rerrij(x) are uniformly small for all observed Xji in magnitude, polynomial regression of Xj may provide a reasonable analysis. Otherwise, we need some other methods to take account of the item Renij.  One method of estimation which  attempts to guard against departures from polynomial models is smoothing splines. The basic premise is the integral in Rerrij can be approximated using the quadrature formula  Rerrij(x) « ^  a (x - tj )+ jk  k  fc=i  for coefficients  Oji,...  ,a,j  Lj  and tji < t  j2  <•••<t  jLj  (i^'s are knots in the definition  of spline, which is given in Section 2.4.1). Combining this with the original polynomial approximation leads to an overall approximation of the regression function by  rrij(x) = otjix H  h  a x  D  jD  + ^a (x jk  -  ijfc)+,  k=i  8  )  Chapter 1.  Introduction  which is the spline function under consideration in Chapter 2. To introduce the interactions between  X\,...  ,Xj,  we use the following regression func-  tion, similar to pairwise interaction models, p  (1.2)  Note that there are other possibilities to determine the interactions between any pair of risk factors. Here for simplicity we only use the summation of the products of one risk factor and smoothing function of the other risk factor, which is different from that discussed in Gu (2002). In Gu (2002), the multivariate function is given by an ANOVA decomposition, that is, it is expressed as a constant plus the sum of functions of one variable (main effects), plus the sum of functions of two variables (two-factor interactions)and so on. Note the interactions are assumed to be in tensor product spaces. Note that if we have numerous number of risk factors, the additive model (1.1) may be too generous allowing a few degrees of freedom per Xj and it does not take into account the interactions between X , ' s yet. Classes of function are not "dense" or a universal approximation to class of smooth function in a rectangular region of rFj=i[ j£> ju]> where x  x  Xjij^Xju are the lower and upper bounds of jth risk factor respectively. In the interaction model (1.2), the number of interaction terms increases dramatically when the number of risk factor increases. In such a case, we may use projection pursuit regression (Friedman and Stuezle, 1981) or multivariate adaptive regression splines (Friedman, 1991). These two methods consist of universal approximation to smooth functions by a sum of nonlinear functions of linear combinations of a^-'s, i.e., M  9  Chapter 1.  Introduction  More discussion is in Venables. and Ripley (1999) and Diaconis and Shahshahani (1984). Note that projection pursuit regression models and multivariate adaptive regression splines are not considered in the thesis, but it is possible and worthwhile to be studied in the future. 3. Diffuse interaction models. As discussed in the previous section, some problems arise when fitting a pairwise interaction model especially with a large number of risk factors. For instance, high blood pressure is a warning signal for health problems. There are many risk factors which may cause high blood pressure. Age: The risk of high blood pressure increases as you get older. Gender: Women are more likely to develop high blood pressure after menopause. Family history: High blood pressure tends to run in families. Body weight: The greater your body mass, more risk of high blood pressure. Tobacco use: The chemicals in tobacco can increase the risk of high blood pressure. Sodium intake: Too much sodium can lead to increased blood pressure. Excessive alcohol: Heavy drinking can damage your heart. Stress: High levels of stress can lead to a (temporary) increase in blood pressure. Therefore, for all the risk factors listed except gender, larger value means more risk of high blood pressure. There might be synergistic interactions among those risk factors. How to depict the magnitude of the synergism? As proposed in Gustafson et al. (2005), the  diffuse  interaction model is defined as  (1.3)  10  Chapter  1.  Introduction  w i t h Pj > 0, Xj > 0, A > 0. It is easy to verify that A < (>)1 leads to  dg(x)  dxidxj  >(<)0.  T h a t is, A < (>)1 means synergistic(antagonistic) interaction. Clearly the magnitude of A is a measurement of degree of synergism/antagonism and curvature of the surface as well. Note that when A = 1, the above regression equation just reduces to a n additive linear regression model. Note that  g(xj,*x.(j) = 0) = p + PjXj, 0  where X(j) = (x\,..., Xj-i,Xj+i,...,  x ) and Xj = 0 means the absence of j t h risk factor. p  T h a t is, no matter what A is, Pj is the increase i n response variable associated w i t h a unit change i n Xj, i n the absence of a l l the other risk factors. N o w the interpretation of Pj is different from that under linear regression model. T h e models are increasing function hence i n absence of all risk factors, that is, Xj = 0, j — 1,... ,p, Po stands for the smallest expected response (usually risk of some diseases). Therefore, we may rewrite the diffuse interaction models i n a more general sense as  5  (x) = A)+.N|,  •  (1.4)  where || • || is a n o r m i n E . Even more generally, to get a regression function that is not p  monotone increasing, we may consider  5(x) = /?o + ||x-a||,  11  Chapter  Introduction  1.  where a is the vector of location parameters, standing for the values of risk factors that leads to the smallest expected response. The diffuse interaction models (3.5) are a special class of (1.4) w i t h the choice of n o r m being  INI/?  = | ] [ > r ^ )'  A  . Pj > 0- *i > 0, A > 0.  j  Here Pj's are inverse scale parameters. Naturally,  (1.5)  can be interpreted as the distance  between x a n d 0. M o r e general classes of norms c a n be used, (a) If the variables Xj's interact i n different directions, one may want to p a r t i t i o n those variables depending on the direction of interactions among them, that is 1/A  l/Ai  (x) = A , +  5  E (& i)+{ jeADD  E  x  2  +{ E  O0^)*1  UeSYN  w i t h Pj > 0, Xj > 0, for a l l j a n d 0 < Aj < 1 < A . It is easy to verify that for any pair 2  j < k GSYN(ANT), dxjdxk  >(<)0.  (b) A s s u m i n g there is a nesting of groups of variables, one might use the 2-level nested model  .9(x) =  +  ieSi  ^A2  A/A ' 1/A 2  (1.6)  ies  2  where A i < 1 and A > 1 for a synergistic set S i a n d an antagonistic set S w i t h some 2  2  interaction between the two sets. A l s o this function can be extended to multiple levels of nesting. N o w the second derivative of g(x) i n (1.6) is more complicated. F o r A^> 1, any pair (j, k) w i t h j G Si, k € S , the second derivative is negative. For A < 1, any pair 2  (j, A;) w i t h j e Si, k G S , the second derivative is positive. If 1 < A < A , for any pair 2  2  12  Chapter 1. (j, k) €  Introduction  the second derivative is negative. If A i < A < 1, for any pair (j, k) € Si, the  second derivative is positive. The following example illustrate one scenario where we might consider to use (1.6). Suppose X\,X ,Z  are the predictors relating to response variable Y. However, Z can  2  not be measured and is replaced by two surrogate variables X and X . In this case, one 3  4  might use g'x ,x ,x ,x ) 1  1.3  2  3  4  = {(Pixi)  x  + (fox )  x  2  + (fox  + /3 x ) } A  3  4  4  1 / A  •  O u t l i n e o f thesis  The following chapters are organized as follows. In Chapter 2, we study the consequences of fitting an additive regression model to data generated from a pairwise interaction model. We find out that under some particular situations, the average effect estimates based on misspecified model can still be consistent with true values. Further, under the framework of spline regression, we investigate the consequences of model misspecifications by failing to include interaction terms into model, when such terms exist. In Chapter 3, we introduce diffuse interaction models as an alternative to pairwise interaction models when the number of risk factors of interest is rather large. A n d we compare its ability to detect interactions with a pairwise interaction model. In Chapter 4, we propose a M C M C algorithm to estimate the parameters in diffuse interaction model, introduced in chapter 3. Further, we propose other M C M C algorithms for more general versions of diffuse interaction model by relaxing the assumption that all the risk factors interact in the same way, either synergistically or antagonistically. In chapter 5, we summarize the results of the above three chapters and also discuss some possible problems to consider in future. 13  Chapter 2 Average effects for regression models with misspeciflcations How bad is the estimation when the real relationship between response variable and predictors (i.e.risk factors i n epidemiological studies) does involve interactions, but a model without interaction is fitted?  More specifically, if the actual d a t a  generating  mechanism is  Y = J2h (X )+ j  j=i  M * . ^ ) +C  j  (2-1)  i<j<i<p  what w i l l the result be if we fit an additive model  3=1  where b o t h e and 77 denote random errors, postulated to follow n o r m a l distributions.  2.1  Average effect  To evaluate the performance of the estimation under misspecified models, we apply the idea of average effect, which is proposed i n G e l m a n and Pardoe (2007) and also i n Gustafson et al. (2005).  T h e reasons to introduce the average effect are listed as the  following. (a) For comparing models w i t h different parametric forms, different sets of parameters 14  Chapter 2. Average effects for regression models with misspecihcations  have different interpretations, so it is hard to explain the difference between the parameters from different models. For example, postulate the parametric forms of models (2.1) and (2.2) as below. In model (2.1):  hj{Xj)  =  BjXj,  hij(Xi, Xj)  =  Bij Xi Xj.  In model (2.2): j(Xj)  h  = ctjXj.  Then B^s in model (2.1) are the parameters describing the pairwise interactions between the predictors, while no such parameters appear in model (2.2). It is hard to interpret the meaning of the difference between Pj and ctj. More precisely, the expected change in response variable caused by one unit change of Xj keeping other predictors unchanged in model (2.1) is Pj + Yli^tj PijXi, while in model (2.2) it is aj. (b) More generally, if the two models of interest are quite different, no common parameters could be compared. For example, say one model is a linear regression model while the other is a nonparametric regression model. Now it is impossible to compare the estimates of coefficients from the former model and the estimate of a function/curve from the latter model. To handle the difficulty mentioned above, we need a quantity which is associated with something in common among different models. The average effect is such a quantity. The definition of average effect is as follows. If the predictor Xj takes value in a continuous space, then the predictive effect of Xj  15  Chapter 2. Average effects for regression models with misspecihcations is denned as  Ai(x;fl)  where X y ) = (Xi,...,  dE(y\X =x ,X =x y,e) j  =  j  U)  u  dxj  X j - i , Xj+i,...,  X ), i.e., a random vector comprising a l l the exp  planatory variables except the j - t h one, and 8 denotes the parameter vector i n the model. If the predictor Xj can only take finitely many values, then the definition of its predictive effect is  (  1  /  X-  where x^\x^  (2  = x ; 0 ) - E(Y\Xj  ) { E m * ; = xf\x  O)  U)  = xf\x  U)  = x ;0)} , (j)  Xj  are a pair of different possible values of Xj. T h e quantity of predictive  effect reflects the change in E ( V | X ) associated w i t h asmall change i n the j - t h predictor Xj conditioned on a specific value of X y j = x^). T h e reason t o use the notation Aj(x; 6) a n d not Aj(x(j); 0) is t h a t i n general predictive effect is a function of Xj as well. For example, i n the quadratic model w i t h continuous predictors, that is  E(Y\X)  •  = 0 + f^P X Q  i  j  + J2B X*+ jj  the predictive effect of Xj is 2BjjXj + Yli^j PijXi-  &i i >> X X  However, i n some special situations  such as model (2.10) and model (2.11), appearing i n the next section, A j ( x ; 0 ) does not depend on Xj. Based on the definition of predictive effect, we can define different versions of average effect as the expected value of A^(x; 9) w i t h respect t o different distributions.  16  Chapter 2. Average effects for regression models with misspecihcations Definition 1: If averaging the predictive effect over the joint distribution of X ^ ) , all the predictors except Xj, then the average effect of the j-th. predictor is defined as  5J(XJ;  d) = Ex {Aj(Xj U)  =  Xj,  X  0 )  ; 0)}.  (2.3)  Definition 2: If averaging the predictive effect over the joint distribution of a l l predictors X\,...,  X, p  then the. average effect of the j ' - t h predictor is defined as  5 (0) = E { A , ( X ; 0 ) } . J  (2.4)  X  Definition 3: If averaging the predictive effect over the conditional distribution of X ( j ) | X j , then the average effect of the j - t h predictor is defined as  6j[XJ;d)  —  E , | x , {Aj(Xj = X o  Xj,  X y ) ; 6)}.  In general the three definitions are not identical to each other.  (2.5)  B u t i n the special  cases when the predictive effect of Xj does not depend on the value of Xj, as i n model (2.10) and model (2.11), the first two definitions are the same because b o t h are obtained by averaging over the joint distribution of X ^ j . In the following sections, we stick to use Definition 2 unless specified particularly. Note that the predictive effect of Xj is based on the conditional distribution of  Y\X,  while the average effect of Xj is defined w i t h respect to the joint distribution of (Y, X y ) ) or (Y, X).  W e should also keep in m i n d that the idea of average effect is not just  confined w i t h i n regression context. For example, X u and O ' Q u i g l e y (2000) and Gustafson (2007) apply this concept i n survival analysis by averaging hazard function over the joint distribution of (T, X ) , where T is failure time.  17  Chapter 2. Average effects for regression models with misspecihcations  2.2  General results  Suppose we have p predictors, Xi,...,X ,  a n d response variable Y, whose relationship  p  w i t h X , ' s is of interest. For the general framework, let  T = (Ti (Xi,...,  X ),..., p  T (Xi,..., t  X )) p  ,  whose components are the functions of predictors involved i n the "true" relationship between response variable and predictors, a n d let  S = (Si(Xi,...,  X ),..., p  S (Xi,..., s  X )) p  ,  denote the functions of the predictors i n the fitted model. B y allowing general forms of functions, even nonparametric forms, involved i n the components of S and T , we do have a rather broad possibilities of hj a n d hij i n (2.2) and (2.1). I n the forthcoming sections, we w i l l see how this general setting applies i n different regression models. Hence, models (2.2) and (2.1) can be rewritten as follows, respectively.  where e follows N(0,o ) 2  Y  =  S'cx + e,  Y  =  T'P + e,  a n d are independent of X . Denote b y 5j(/3), 5*(/3) the average  effects of the j - t h predictor under the "true" model and fitted one respectively.  18  Chapter 2. Average effects for regression models with misspecihcations Assuming X/s are continuous a n d T, S are differentiable, let  Tj  = {j^ i(Xi,-•  • ,X ),...  T  =  P  ix  ip  S  P  Xj, denoted by Sj(f3), is  [E(Tj)]'/3.  Now let n be  t  ith subject, (x ,...,  x ),i  a  = 1 , . . . , n are assumed to be iid replications  ip  X ) a n d e i , . . . , e„ are iid A^(0, cr ). Let T be the n x t design matrix with 2  p  (i, k) element equal to T (Xn,..., k  equal to  p  (x ,..., x , y ) being the observations of predictors a n d response  a sample size with  of (Xi,...,  ,X ^ ,  t  (^QY-SI(XI,...,X ),...,-^-S (XI,...,X )  So that the true average effect of  variable for the  ,-^T (Xi,...  p  j^-T^Xn,...  ,X ).  X ). Let Tj benxt  design matrix with  ip  (i, k) element  Assuming T is of full column rank, we have  ip  3 = (T'T) T'Y, _ 1  E(Tj)=n- f l i. 1  j nx  where Y = (y y , • • • ,y )'u  2  n  Hence the estimate of the average effect of Xj is obtained by  ^03) = n l f ( T ' T ) - T ' Y . _ 1  /  1  i  With  • Y = T/3-+e,  where e = ( e i , . . . , e )', we get n  <5~(/3) = n- l'Tj(3 l  +  rf \% T'T)- re. 1  t  1  19  Chapter 2. Average effects for regression models with misspecihcations  Therefore, we have  v^(W)-W))  = v ^ ^ ^ ' T j - E ^ O ' ^ + n-^'f^T'T/n)- ^ / ^. 1  1  2  The first term in the right side of the above equation can be rewritten as  where (¥,);. denotes the ith row of Tj. Since (Tj)j. are i.i.d., the above term converges in distribution to -/V(0, /3'Var (Tj)(3) by the multivariate central limit theorem. With the conditional variance identity and the assumption that e is independent of X, we have Var(T,' ) = E(Vai(T.£|*))-+Vai(E(T.'c|X)) ,  €i  i  =  i  i  i  i  i  E(TT> . 2  Thus, by the multivariate central limit theorem, n  n- '  1 2  ]T Ti.'a ^N(0,E  (T'T)<J ) . 2  Since the (Tj)j. are i.i.d., by the strong law of large numbers, we have n-^-^ECT,-)}'. If E(TT) is invertible, then (n-Wf  1  (E(TT')}" . 1  20  Chapter 2. Average effects for regression models with misspecihcations W i t h the above results, a straightforward conclusion can be summarized as the following. R e s u l t 1: A s s u m i n g the existence of { E ( T T ' ) } "  1  and e is independent of X ,  v^(W)-W))  °N(0, (p)), Vj  where (3)  = <x [E (Tj)}'[E 2  Vj  ( T T ' r ' E ( T , ) + /3'Var (Tj)3.  (2.6)  Similarly, i n the fitted model, let § be the n x s design m a t r i x w i t h (i, k) element equal to Sk(Xn,..., jfiqSk{Xn,...,  S ). i p  Xi ). P  Let  be n x s design m a t r i x w i t h (i,k) element equal to  Assuming that S is of full column rank, the estimate of the average  effect of Xj w i t h the "misspecified" model is  6*(J3) = • n - 1 l S ( S ' S ) - S ' Y . /  1  j  A s s u m i n g E(SS') is invertible,  a  = a.s.  (S'SJ^S'Y {E(SS')} E(SY) _ 1  {E(SS')}- {E(ST')} 9 1  /  a*.  Thus, i n the limit as n ^  oo, 5*{3) = lim<5}(/3) = E ( S ^ ' a * . Hence, w i t h Y = TB + e,  we have n  v ^ ( § 0 9 ) - 5*(f3))  =  n"V2 £  _  E  (§,•)') ( S ' S ^ S ' Y +  i=l  (E (S^HS'S/n)-V  1 / 2  S ' ( T / 3 - - S a , + e),  (2.7) 21  Chapter 2. Average effects for regression models with misspecihcations where (§j)i- is the i t h row o f ' S j . B y the strong law of large numbers,  n S ' Y —> E (SY)(= {E (ST')}/3), _1  (S'S/n)- ^ {E(SS')}" . 1  1  W i t h the multivariate central limit theorem a n d the above facts, the first t e r m i n (2.7) asymptotically follows a multivariate normal distribution w i t h mean vector of 0 a n d  (Sj)a*. A l s o by multivariate central limit theorem, we have  covariance m a t r i x of a'„Var  n  n' '  J2 % {Ti./3 - S,a* + e ) Z {0, V*), i=i  1 2  t  N  where Sj. is the i t h row of S, Tj. is the i t h row of T a n d  •,  V* = E {(HY/3 — Sj.a* + ei) Sj.Si.} 2  .= ff E(SS') + E ((T'^9 - S'a«) SS') . 2  2  (2.8)  Note that the first part is due to random error a n d the second part is due to model misspecification. Therefore, the asymptotic distribution of the second term i n (2.7) is a multivariate normal w i t h mean vector of 0 a n d covariance m a t r i x of  {E (S )}'{E (SS')}"f V*{E (SS')}" ^ (§,)}. 1  i  Immediately the combination of the above results leads to the following result.  Result 2 : A s s u m i n g the existence of {E (SS')} a n d e is independent of X , _1  V^{5f{P)-5;(3))^N(0,v;(3)),  22  Chapter 2. Average effects for regression models with  misspecihcations  where  «;(/3)  =  {E(S )}'{E(SS')}-. V-{E(SS')}- {E(S,)} + 1  1  j  2 a ' , C o v (S  S ( T ' / 3 - S ' a . ) ) { E ( S S ' ) } E (S,) + a ' , V a r ( S , > . . _ 1  jf  (2.9)  Remark. I n some instances Tj a n d / o r Sj might be constant, i n which case the second term i n (2.6) and the last two terms i n (2.9) vanishes. Particularly, if the "true" regressors include pairwise products from X , but the fitted model includes only linear terms from X , then the second term i n (2.6) does not vanish, but the last two terms i n (2.9) do vanish. Result 1 a n d Result 2 give the asymptotic distributions of the average effect estimates based on "true" model and "misspecified" model, respectively. C o m b i n i n g the two results, we can get a consistent average effect estimator from the misspecified model as long as Sj(P) = Sj((3). Some easily-studied cases where the above equality establishes are shown in Result 3 later. A l s o Result 1 a n d 2 make it possible to compare the efficiencies of the two estimators from "true" model and "misspecified" model. M o r e discussion is given i n Section 2.3.2. One thing worth investigating here is to find the consistent estimator, under the "misspecified" model fitting, of the mean squared error a ( a ) — E ( V — S ' c t ) . 2  2  A least squares estimator a is a parameter vector that solves the problem  m i n s ( a ) = (n - p - 1 ) | | Y - S a 2  _ 1  a  N o w s ( a ) m a y be rewritten i n terms of matrices, 2  n n—p— 1  23  Chapter 2. Average effects for regression models with misspecihcations B y the strong law of large numbers,  (n- Y'Y) 4'E(y ), 1  a  2  (ra^S'Yj^EtSy).  Therefore, s (a) A E ( y ) 2  a  E(S'y){E(SS')} E(Sy).  2  _1  n  Note that the a » is the unique minimizer of mean squared error a (a). N o w 2  cr (ct.) 2  =  E(y) -2E(yS'a,) + E(S'a,)  =  E(Y )-E(S'a»)  =  E ( y ) -a',E(SS')a.  =  E(y ) -E(s'y){E(ss')} E(sy).  2  2  2  2  (E(y|S) = S'a.)  ••  2  2  _1  T h a t is, s ( S ) is a consistent estimator of a (cx ). 2  2  f  (S'S/n)  ll  2 S  However,  (a) -4-{E(SS')}-V (a,), a  2  which is not equal to { E ( S S ' ) } V * { E (SS')} _1  _1  i n general. Note the fact that, by the  previous analyses and (2.8), we have  lim V a r ( n S ) /  v  =  v(cx )  =  {E(SS')} V*{E(SS')}"  =  {E ( S S ' ) } - V ( a . ) + {E ( S S ' ) } E ((T'/3 - S'a,) SS') {E (SS')} .  t  _1  1  _1  2  _1  Therefore, the true variances of the coefficients estimates from the fitted model are big24  Chapter 2. Average effects for regression models with misspecihcations ger than the reported standard errors of those estimates, which is caused by the model misspecification, the second piece i n the last equality of the above equation. If no misspecification occurred, then V* = E(SS'), then (S'S)s (a) is the consistent estimator of 2  1  var (a). Let n  Va = n 2(^-S .a) S;.Si-. >  _ 1  2  i  i=l  then it is not hard to derive that V  a  is a consistent estimator of V* by a similar derivation  as before. Therefore a large difference between V  a  and S ' S s ^ f i ) can be an evidence for  model misspecification (see W h i t e (1980) for a formal test for misspecification). Hence v(a ) t  can be consistently estimated by  (Yi - Si.aj'sj.Si.j ( S ' S ) " .  n ( S ' S ) " jjj 1  1  There are other ways to get consistent estimator of v(a+) as well. One way is so-called "sandwich" method, which is based on the derivatives of log-likelihood function. A c t u a l l y it is exactly the same as the estimator used i n W h i t e (1982), where W h i t e studied the asymptotic distribution of the maximum likelihood estimate i n case of model misspecification. A s known, the least squares estimate is the same as the m a x i m u m likelihood estimate i n the linear regression. Under a simple setting w i t h only two predictors i n volved, we can also apply the methodology i n W h i t e (1982) and get the same result as Result 2. Details are shown i n A p p e n d i x II.  2.3  Linear regression  In this section, we assume that the response variable Y depends linearly on the predictors Xi's and that Y given X = (X\,...,  X ) is normally distributed w i t h a constant variance. p  25  Chapter 2.  Average effects for regression models with misspecihcations  Suppose the true relationship between Y and XiS is  Y | X ~ TV ( A , + B X X  + ••• + B X  X  p  +B XX  p  X2  X  + ••• + +(3 - , X - X ,  2  P  1 P  P  1  P  a ), 2  (2.10)  while the fitted model is  Y | X . ~ AT(a +  +••• + a X , r ) .  Here 3 = (P , B . . . , B , B ,..., 0  u  p  p  P -\, ),  n  (2.11)  2  0  P  P  a = (a , a 0  p  u  . . . , a. ). p  B y Definition 2 of  average effect, i t is direct to derive the following average effects based respectively on (2.10) and (2.11):  S {P) = B + Y^B E(X ), i  i  5*(3) =  a  j  ii  (2.12)  i  ,  ,  (2.13)  Naturally, as a n informative summary we study the estimates of the average effect defined as n  Sj(3) = n- 5^A(x{,-),3) = 4- + SA^i, 1  j  i  i-l  i^j  n i=l  where X;y) is the z-th observation of X y ) and  is the sample mean of the observations of  Xi. T h e estimates of parameters 3 a n d a are a l l least squares estimates. Note that i n the normal linear regression context, the m a x i m u m likelihood estimates of the parameters are the same as the least squares estimates. In the following three sections, we w i l l compare the estimators of two average effects.  26  Chapter 2. Average effects for regression models with misspecihcations  2.3.1  Difference in large sample limits of the two average effect estimators  W i t h o u t loss of generality, we can think about centering predictors in the fitted model, that is,  E (Y | X ) = a + a i X i + • • • +  aX,  0  where Xi  = Xi — E ( X j ) , i.e., E ( X ) =  p  p  0. Note that the estimates of a \ , . . . , a  p  are  unchanged by centering. T h e large-sample limit of S , denoted by a*, satisfies  /  1  \  ( 1  x  x  E <  VX  p  Xi  > a, = E<Y  )  >  \ >  \X ) p  <  Since the equation is symmetric in the p predictors, it suffices to determine the relationship between ot\* and B y Cramer's Rule, we can solve the equation above and get  where S — (<7ij)  is the covariance matrix of (Xi,  pXp  • •• , X) p  and  is the cofactor of E .  Details of the proof of (2.14) are in A p p e n d i x I. Moreover, based on (2.3), expressions of average effect, the above results can be summarized as below. R e s u l t 3. Assume that the variables are centred, i.e., E ( X ) = 0. L e t T = where W = teractions.  (XiX ,XiX ,... 2  3  ,X -iX ), p  p  (1,X',W')'  i.e., the true relationship involves pairwise in-  Also let S = ( 1 , X ' ) ' , i.e., the interactions are undetected or ignored in the  27  Chapter 2. Average effects for regression models with misspeciBcations modelling process. T h e n (i)  *;(/9) =  W )+^  ^  ^  W  E  ,  (2  . ) 15  where E is the covariance m a t r i x of X , and E ^ is the cofactor of the determinant corresponding to element a jk  Consequently (ii) if X has a multivariate n o r m a l distribution, or if the components of X are independent, then  = 0.  SW-Sj'p)  Particularly, when p = 2, the connection equation (2.15) can be simplified as below  "Var (X )E {X*X ) - Cov{X X )E 2  5{{(3) = 5^(3) + f3  l2  2  u  (X^y  2  V a r ( X i ) V a r (X ) - C o v ^ , 2  2  (2.16)  X) 2  Remarks: T h e two conditions, b o t h independence a n d multivariate normality, are sufficient but not necessary to get the consistency of "wrong-model" estimate 61-((3). T h e condition that X follows a multivariate normal distribution can be replaced by an elliptical distribution. T h e latter actually is an extension of the former. T h e equation (2.15) shows that the bias is controlled by.E(XiXiXk).  Suppose the joint distribution of X is  an elliptical distribution EC(0,Y,,cp) a n d its characteristic function is exp(it'/i)0(t'St), where 0 is a scalar function and called characteristic generator. If i,j,k  are different to one another, the joint distribution of (Xi,X[,X ) k  is also  an elliptical distribution EC(0, E ^ , 4>) where Y,u is the corresponding sub-matrix of E k  28  Chapter 2. Average effects for regression models with misspecihcations  associated with (X Xi,Xk)it  By the definition we have (Xi, Xi, Xk)' = A'(Zi, Zi, Zk)',  where (Zi,Zi,Zk)  follows a spherical distribution (which is a special case of elliptical  distribution by setting 2 = 1). A is the lower triangular matrix in the Cholesky decomposition of Tjiik- Hence we can rewrite the above as Xi  — aZi,  Xi  = bZi + cZ  X  = dZi + eZi + fZk.  h  k  It is straightforward to get E(XiXiX ) k  = 0  because that E (Zf) = 0,E(X?Z ) t  = 0  and  E (ZiZiZk) = 0. If k = i or k = I, we can also show the expectation to be zero by replacing the third equation by either the first one (k = i) or the second one (k = I). The second part of Result 3 says that for certain distributions on X, a model ignoring interactions will yield consistent estimates of average effects, even though the true regression relationship involve interactions. In addition to being of conceptual interest, this suggests some practical modelling strategies. For instance, in applications where one wishes to avoid modelling interactions explicitly, one might attempt to pre-transform the predictors to approximate normality before fitting a linear model. If one is willing to think of average effects as targets of inference, then such transformations should reduce bias in estimating these effects via a model without interaction terms. Of course transformations applied to predictor and response variables are an important part of regression modelling in practice, and the desirability of transforming a response variable to approximate normality is clear. Transformations on predictors, however, are typically 29  Chapter 2. Average effects for regression models with misspecihcations argued for on the basis of compatibility w i t h linearity a n d homoscedasticity assumptions, without regard for the resulting shape of the predictor distribution. Result 3 suggests that the shape of the distribution could also be a consideration i n assessing possible transformations of the predictors. However, we also find that the consistency does not hold generally, w i t h the following two examples. F i r s t , assume Xi follows standard n o r m a l a n d X \Xi 2  follows Poisson(c\\Xi\).  when p = 2, transformation of (2.16) gives the quantity Pu (d~l(3) — 6i(3)),  Since nothing  to do w i t h true values of 3 a n d only depending on the d i s t r i b u t i o n of X . Moreover, i t can also somehow indicate the discrepancy of the two large sample limits. Note that i f Pn = Pi = P2, under the setting of Result 3, this quantity is just the relative bias, i.e., [$i(3) — 5i(3)]\.  j^i"  1  Based on the property of the mean a n d variance of Poisson  distribution, by some algebra we can derive that  Solve Ci by setting the above to be 1. Note the fact that p — E (XiX ) 2  C i E ( X | X | ) = 0.. T h i s is caused by the fact that 1  - E (X\)E (X ) = 2  is an o d d function a n d the  1  integral interval is symmetric about zero. Hence, based on (2.16), we have  sm-Si(d)  =  E{X X } 2  2  P12  0.718.  30  Chapter 2. Average effects for regression models with misspecihcations The last equality follows from  E(\Xf\)  1 2^=  =  1  V=x2  poo  \  x e~ / dx 3  x2 2  2irJ^ye-yl dy  2 fl.  =  2  s  Second, suppose equi-correlated predictors, each following log-normal distribution, defined as A  J  VV2( ) T  '  3  '---'P>  i  where  z = ( z M  u  z  p  y ~ N (0,(1 - )i P  (r) = E ( e ^ ) = e  Here I  p  + J ) , P  P  r 2 / 2  V(T) = V a r ( e ^ ) = e T  p  !  2 r 2  is p-dimensional identity m a t r i x and J  P  -e  r 2  .  is p-dimensional square m a t r i x w i t h a l l  p elements equal to 1. Note that as a —> 0, Xj converges to a standard normal. So that 2  larger r corresponds to more non-normality and larger p corresponds to more dependence among predictors. P a r t i t i o n the (true) regression coefficients as (3 — (0o, (3'MJ P'I)' according to intercept, main, and interaction effects respectively. U s i n g Result 3 we can compare the vector of true average effects 5(0)  = (3  M  (since E ( X ) = 0) to the large-  sample limit of estimated effects from a model without interactions, i.e., 5*(pi). F r o m Result 3 we know the relative bias is zero if p = 0 or r = 0, so the question of interest is how fast the bias grows as the components of X becomes b o t h correlated and skewed. For p = 10 and selected values of (/3 ,/3 ), Figure 2.1 depicts the relative bias as p and M  7  31  Chapter 2. Average effects for regression models with misspecihcations T vary, where  T h i s illustration is based o n fixing the direction of 3  M  compared to 8 , M  i.e., 3  M  oc l  p  a n d the relative length of 3j  a n d \\3j\\ — \\3 \\. For convenience, we index the M  components of 3 by (u, v) pair, where u < v. In Figure 2.1, four choices for the direction {  of 8j are considered: (a) Bi  oc 1, which involves a l l predictor pairs interacting positively.  (b) Pi, v  oc (—l) ~  tUV  U  v  u+1  , which means about half the pairs interact positively and the  other half negatively. ( ) Pi,uv oc I{\v ( m o d p ) — u\ — 1}, which means only a few of positive interactions. c  (d) Pi,uv ^ (—lY^ ( v  m o d  p)-'"l= } 1  )  which means a majority (minority) of positive (neg-  ative) interactions. In each case RB\, the relative bias i n estimating the average effect of Xi, is calculated. Note that each of these choices except (b) involves sufficient symmetry so that the relative bias is the same for estimating all p average effects, i.e., RBi = ... = RB . W e can see p  that the bias increase when o grows, which means that more non-normality would cause larger bias. Moreover, the size of bias also changes i n a n increasing trend as p increases. Therefore, more non-normality of and more dependence among X give larger bias. T h e general impression from Figure 1 is that very large biases are possible when estimating average effects, if predictors are substantially skewed a n d dependent.  Note also that  slight skewness and strong correlation tends t o induce a bigger bias than strong skewness and slight correlation. A n o t h e r thing worth notice is that the relative biases are considerably smaller i n panel (b) compared to the other three cases.  It may represent a 'cancellation effect'  32  Chapter 2. Average effects for regression models with misspecihcations between positive a n d negative interaction terms. Moreover, we have similar magnitudes of relative biases under the other three cases. T h e common property of the three cases is the "overwhelming" strength of interaction i n one direction over the other.  Figure 2 . 1 : Magnitude of relative bias as a function of (p, T) for multivariate log-normal distribution of p = 1 0 predictors. T h e cases (a) through (d) are as described i n the text. (a)  2.3.2  (b)  Relative efficiency of the two average effect estimators  A natural question of interest is which one is more efficient between the two estimators 5j(/3) and 5j(/3), which means to compare Vj((3) i n Result 1 a n d Vj((3) i n Result 2. W e take the ratio of the latter to the former as the relative efficiency, w i t h values larger than one representing the inefficiency occurring as a result of model misspecifications. We apply Result 1 and 2 directly to linear regression context, where a n additive model is fitted to the d a t a generated by pairwise interaction model. D u e to the symmetry i n  33  Chapter 2. Average effects for regression models with misspecifications the p predictors under b o t h models, we can only take the average effect estimator of X  x  for example. N o w we have  T  =  (1, Xi,...,  X , XiX ,...,  Ti  =  (0,1,0,.  S  =  (l,Xi,X ,...  S\  =  (0,1,0,...,0)'.  p  Xp-iXp)  2  ,  ..,0,X ,...,X ,0,...,0)' 2  p  ,X )  2  P  W i t h Result 1 and 2, we have  nVar{<5*}  ^ { E t S O j ' E i {£(§!)} +  ->  1  {E ( S O V E s ' E {(T'/3 - S'a,) SS'} E ^ E (S^)}, 2  nVar{<5i} —>  a {E(T )}'ET {E(T )} 2  +  1  1  1  {P12,  A )Var ( X , X ) ( P i , p \ ) ' , P  (2.17)  2  p  2  p  (2.18)  where Es  =  E{(l,Xi,...,X )  S  =  E {(1, Xi,...,  T  (l,Xi,...;X )},  p  (1,X\,...  p  X , XiX ,..., p  2  ,X ,XiX ,... p  2  XiXp) ,X X )}. X  P  Based on the expressions it is clear that the difference of the two unconditional variances depends on the true value of parameters (3. Therefore generally speaking, the comparison of the two variances could depend on the true values of coefficients of those interaction terms involving X . x  y  In particular, we are interested i n situations where the additive model can yield con-  34  Chapter 2. Average effects for regression models with misspecifications sistent estimator under the true relationship involving a l l pairs of interactions.  More  precisely, if X has independence of components or multivariate normal distribution (discussed i n Result 3), what is the relative efficiency? We take the former case first. A s s u m i n g E(Xj) the independence of Xi,...,  X  p  = 0, V a r (Xj) = l,j  = 1 , . . . ,p, a n d  , therefore we have  — Ip+l  7  S  0  s  V 0 £22 where S 2 is the covariance m a t r i x of (X\X<t, • • •, 2  X\X )'. P  T h u s the asymptotic variance of the estimates from the right model is  a (0,1,0,..., 0)2^(0, l,0,...,0)' + | > i 2  j=2  T h e first term is right the first term i n (2.17). Note the fact that  T'/3 - S'a» =  ^2  PijXiXj,  l<i<j<p  where  a.  = Ss E(ST') 1  -  (l i,0)d. P+  35  Chapter 2. Average effects for regression models with misspecihcations Therefore, the second term i n (2.17) is  Hence, the ratio of the two asymptotic variances is  a +E{{ z Pi x x fxi 2  1  l<J  i  i  j  l + 3ll7i.|| + i i 7 - i . | | 2  }  2  1 + Il7i.|!  2  where 7 = a~ 0 1  I  and (7i.,7_i.) is a partition of 7 into those interaction terms which do  and don't involve X\ respectively. Therefore, the asymptotic variance of the estimator based on "misspecified" model is larger. T h i s is consistent w i t h our i n t u i t i o n that more uncertainty i n Sj(3)  is caused by model misspecification.  N o w the question of interest is how large/small the relative efficiency would be if there is some dependence among the components of X? A s s u m i n g that X has a multivariate normal distribution w i t h mean 0 and equi-correlated covariance matrix, that is, X N (0, p  (1 - p)I  p  + pip).  ~  F r o m (2.18) and (2.17) for given p, it is easy to justify that the  relative efficiency depends only on p and 7 = a~ 3 , w i t h the latter one describing the 1  I  interaction 'signal' relative to noise. For p =10, Figure 2.2 shows the relative efficiency as a function of p and ||7||, where two certain directions for 3j, i.e., case (a) and (c) defined i n Section 2.3.1, are considered. We can see that if the true interactions among X are rather "sparse", like the choice of 0j  i n case (c), the efficiency loss is much smaller compared to that w i t h a  "dense"  interaction structure i n panel (a). A l s o note that the misspecified-model estimator can 36  Chapter 2. Average effects for regression models with misspecihcations be very inefficient w i t h strongly correlated predictors, but the efficiency loss tends to be slight w i t h independent predictors. A s a related point, the rate at which the efficiency loss grows w i t h the strength of the underlying interaction signal is governed by the strength of correlation. Figure 2.2: Relative efficiency of the "misspecified"-model average-effect estimator as a i n the text. (c)  \\\ \  \ \ \  \  \  •S  \  ^  " '  -|  I.J  •— 1.1  r  o.e  .0.4  p  2.4  Nonparametric regression and smoothing  T h e above analyses i n Section 2.3 are based on a simple scenario, where a parametric model i n terms of a linearity of response variable Y i n continuous predictors Xi's  is  postulated. W h a t if the linear regression model is not appropriate? F i t t i n g a linear model to the d a t a actually containing a nonlinear structure can given very misleading results, even worse than useless. A more general alternative to linear regression is nonparametric regression model. T h e distinguishing property of nonparametric regression is that there is no (or very little) a priori knowledge about the form of the true structure of the regression  37  Chapter 2. Average effects for regression models with misspecihcations function. It allows the class of functions which the model can represent to be very broad. W h e n shall we use nonparametric regression model? In many real problems, there is no information from the data nor scientific knowledge to suggest a parametric form, so that a parametric model is often specified from a casual graphical summary of the d a t a or chosen for convenience (for example linear regression models). A predetermined parametric model might be too restricted or too low-dimensional to fully model a "rich" d a t a set containing many unexpected features. In such a case, we would like the nonparametric (smoothing) approach, which offers a flexible t o o l i n analyzing unknown regression relationships between response variable and predictors. Also parametric vs non-parametric depends a lot on sample size, especially when there are many predictors. Smoothing methods, widely used i n nonparametric modelling, deserve a respectable place i n statistics. There are many papers and a number of books study on this topic (Silverman 1986; E u b a n k , 1988; Hastie and T i b s h i r a n i , 1990; W a h b a , 1990; Green and Silverman, 1994; G u , 2002).  A s a matter of fact, smoothing methods provide a  bridge/compromise between making no assumptions on the underlying process that generated the d a t a (a purely nonparametric approach) and making very strong assumptions (a parametric approach). In the following subsections, we mainly focus on consequences of model misspecification o m i t t i n g interactions under the context of least squares regression i n smoothing.  2.4.1  Spline regression models o  Spline functions are very flexible and thus are often used i n smoothing regression.  A  spline function is a piecewise or segmented p o l y n o m i a l . M o r e precisely, it is defined as below.  38  Chapter 2. Average effects for regression models with misspecihcations Definition: T h e function 0 is a spline on [a, b] of degree D w i t h knots * i , . . . , *£, (a < t i < • • • < tr, < b) if <f> is a p o l y n o m i a l of degree D on the subintervals [a, ti], [t\,t ], • •., [tc b] 2  and 0 has D — 1 continuous derivatives o n [a, 6]. Denote the collection of these splines by S (ti,... ,tL]D). p  Take D = 1 for example. A spline of degree 1 is continuous a n d  piecewise linear, w i t h breaks i n slope occurring at t\, • • • ,t^. Based on the definition of spline, it is not hard to show t h a t the collection of the splines of degree  is a linear space of functions. T h e n we can talk about its dimension  and construct bases for the space. T h e dimension is the number of parameters needed to describe a member of the space, which i s . D + L + l ( o r . D + L i f forcing the spline space to not include a constant term). T h e number of functions i n a basis w i l l simply be equal to the dimension. For simplicity and ease-of-understanding, we introduce the power basis. Note that for other forms of bases, the analysis discussed later is also applicable b u t w i t h more difficult computational problems.  Power Basis: Denote the power basis by 4>\,...,  (J>L+D+I-  1  4> {x)  =X,  2  4>D+\{x) = (pD+2(x)  ,D X  =  4>D+L+l(x)  = [x ~ t ) D +' L  39  Chapter 2. Average effects for regression models with misspecihcations where  (x - t)  D  I  0  for x > t, forx<t.'•  .  It is easy to show that each <f>j above is a spline and all the ^ ' s are linearly independent. Therefore for any function h i n the spline space w i t h degree D, it can be written as  M*) = £ ; = i  D + 1  Thus what happens if the fitted model is misspecified? Note that the m a i n concern now is about the impact of model misspecifications. Therefore, how to choose the degree of spline (D), number of knots ( L ) and locations of these knots (t\,... ,ti)  is left aside  for the time being, although this is always concerned i n smoothing. T h a t is, we assume they are already appropriately chosen i n the following studies. We start w i t h a regression model having only two predictors. Say the fitted model is  (Y\Xi  =xi,X  = x ) ~ 7 V ( m ^ x i ) + m (x ),o- ) , 2  2  2  2  2  while the "true" model is  (Y|Xi = xX u  where m i , m , g \ , g 2  predictors.  2  2  = x ) ~ N(g (x ) 2  1  are splines. Here g  l2  1  + g (x ) + 2  2  9U{XI,X ),T ) 2  2  ,  accounts for the interactions between the two  In general, there are many plausible possibilities for the form of _g . i2  simplicity and interpretability, we use the form of gi (Xi,X ) 2  the following, where t\,t  2  2  = X t\(X\) 2  +X t (X ) 1 2  2  For in  are splines.  \ Note that generally the basis functions for different predictors could be different. For concise notations without loss of clarity, we suppress the subscript of $ since each of its  40  Chapter 2.  Average effects for regression models with misspecihcations  components is expressed  function of Xi.  rrii =  &(Xi)oti,  9 l  =  &(Xi)0i,  t  =  t  Suppose  &(X )f% ,i a  i  =  l,2.  Similar to the linear regression case, without loss of generality we assume that for i =  *pQ = {M^i),...,  l , . . . , p , E ( X i ) = ' 0 , E [ * ( X ) ] = 0, where i  the number of basis functions for Xi). Xi — E(Xi)  (ov$(Xi)  — E[$(Xi)]),  <^PQ)'  Otherwise, we can replace Xi  (Ki denotes  (or<&pQ)) by  and those centering constants would be included  into the intercept, keeping the coefficients of basis functions unchanged. T h u s the mean function of the fitted model can be rewritten as  E (Y\X  U  X) 2  = a+ 0  (V(*i),  *'(X )) 2  v where a  0  t  t  2  /  denotes the intercept.  Let a = (a ,ai,a ). 0  Based on W h i t e (1982), we have a * , the large-sample limit of  2  the a , as the solution to E  where ot —  (OLQ,OL1,OL2)'  ^d\ogf(Y\X ,X )^ 1  2  =  Q )  .  Note that the / function i n above equality denotes the density function of the fitted model and the expectation operation is w i t h respect to the true joint distribution of Y and X .  41  Chapter 2.  Average effects for regression models with misspecihcations  Therefore, we derive that  [ (  1  \  E I  E  a i  Y  * ( * i  Note the fact that  / A E(Y\X)  =  ^  (X ),*\X ),X *\X ),X &{X j) 1  2  2  l  1  (2.19)  2  Dint  P\  V PT  j  Therefore we can link the above two equalities to derive that  int  Pi  +  \  cr/c-  (2.20).  11 ^12  where  C  N  =  E|(I *'(X ) * (X ))'(I $'(^ ) $'(X ))|, ,  )  1  )  2  Cia  =  E | ^1,  C  =  E {(x&'iX^X&'iX^  2 2  $'(^2)^  )  1  )  2  (* $'(Xi),Xi$'(X ))} 2  2  (x&'iXi^X&'iXd  We also use the idea of average effect as that i n the linear scenario to evaluate the impact of model misspecifications.  T h a t is, we are interested i n the average effect by  averaging over the joint distribution of a l l predictors X . Note that the predictive effect  42  Chapter 2. Average effects for regression models with misspecifications of Xj here is function of a l l predictors while just a function of X y ) i n linear regression settings. W i t h o u t loss of generality, we only take the average effect of Xi for example. Here the large sample limit of average effect of Xi estimated from the fitted model is  E  a  (2.21)  dX,  while that i n the "true" model is  dX  dXi  1  + PT *(x ) 2  (2.22)  Similar to the situations considered i n linear regression, we also focus o n some special cases here. Say, i f X  x  and X  2  axe independent of each other, then we get E (Cn) = 0.  Hence, the equality (2.20) now becomes  1 ^  y a=2* J  T h a t is, if Xi a n d X are independent, we get the consistency of the coefficient estimates 2  ci. Subsequently we also get the agreement between the two average effects defined by  (2.21) and (2.22). In the following we study the uncertainties of the two average effect estimates. L e t <&(Xj) be an n by Ki m a t r i x w i t h <E>y = 4>j( n) x  xu, I = 1,2,i = l,...,n,j  — l,...,Ki.  a  n  d xj be a n n by 1 vector w i t h components  Define the sample version of E T ( = E ( T T ' ) as  43  Chapter 2. Average effects for regression models with misspecifications follows.  I  D  12  1  \ D'12 ( D 11  \  1  J  (1 $(X ),*(X )),  * ' ( X i )  J  1  2  V *'(x) I 2  f D12  =  1  ^ (X $(X ),X *(X ))  *'(X!)  2  1  1  2  !  V *'(Xa) J X  £22  2  * ' ( X i ) (X *(X ),X $(X )).  =  2  V  X ^  1  1  2  (X ) 2  Again similar to the linear regression case, we have  + DT?D 11 ^ 1 2  Hence the estimate of the fitted average effect can be rewritten as  w )  = [d*r3 +([d#]'i> )3r+([d*]'i>i*2)3r, 1  u  44  Chapter 2. Average effects for regression models with misspecifications  where  n  D*  f  J ™ da;  r>  x = X l i; ' Il T—^T-i  (un ^) [-1,] 1  U  2\  22  U  Notationally, #[-1, ] denotes a sub-matrix of # after deleting the first row. The estimate of the average effect of X based on the "true" model is x  stf)  = [d*]'3  1  +  n  X;  d$(x), da:  2  \x=x  u  0 +  n  %2i  P  2  With Result 1 and 2, it is possible to compare the asymptotic variances of the two estimates. However, the calculation of (2.9) now is rather complicated due to E ((T'/3 — S'a»)SS') 2  and the last two non-vanishing terms in (2.9). Thus we compare the asymptotic  conditional  variances in the following. Recall that  T is the vector of the predictor functions in the true relationship and S is the vector the predictor functions in the fitted model. Recall that Tj, Si denote the derivative with respect to X\ of T and S, respectively. Using essentially the same derivation as that in the previous subsection of linear regression scenario, we get that as n —> oo, nVar(tfi(a)|X) - a [E(Si)]'{E(SS')} E(Si) 2  nVar(Si(/3)|X) -> a [E(Ti)]'{E(TT')} E(T ) 2  _1  x  45  Chapter 2. • Average effects for regression models with misspecihcations where  T  =  Ti  =  S Si  (1,^(X ),^'(X ),X ^(X ),X^'(X ))', 1  o,  2  1  2  1  • d$(Xi).  d$ 0lxK ,[(*(^2))]';  dXi  2  =  (l,*'^),*'^))',  =  (0,^|-$'(Xi),Oi ^).  E  (  X  2  d ^  }  x  A s we discussed before, i n the linear regression scenario, the joint n o r m a l i t y of the predictors can also yield the consistent estimates of average effects. W i l l this good property still hold now? T o explore the effect of the correlation between predictors on the difference of the two average effects, we'set up predictors and basis functions as follows. Al.  Suppose Xi and X  are bivariate n o r m a l w i t h mean vector 0 a n d covariance  2  matrix  \  7  £=  1 P P  IJ  A 2 . For simplicity, a common set of quadratic power basis functions for two predictors is used: <pi(x) = <j) (x) = 2  x - k x 2  u  k, 2  <hi(x) = <t>4(x) =  where ti,t  2  (x-ti) I{x>ti}-k , 2  3  (x-  t ) I{x 2  2  > t } - ki, 2  are the knots, which are set to be the 25% a n d 75% percentile of standard  normal, respectively. T h e ki's are the centering constants to make E(4>i(Xi)) — 0, i = 1 , . . . , 4 . A l t h o u g h the choice of basis functions is simple, it mimics the generality of 46  Chapter 2. Average effects for regression models with misspecihcations splines expressed as linear combinations of the power basis functions. A s for the number of knots, we can also easily incorporate more knots into the basis functions.  Here for  computational convenience, we just use two knots for demonstration. A3.  Set the true values of parameters as B  Q  =  0,0  l  = 0  2  = 0™* = 0  nt 2  =  (0.5,0.5,0.5,0.5)'. Under the above settings, we can get the values of the elements of matrices C n , Cu by a numerical approach. T h e details of the numerical approach for the required integrals are i n A p p e n d i x I I I . One fact worth noticing is that when p = 1, that is X\ = X , 2  Cn is not invertible.  Therefore we can think of the difference between the two average effects as a function of p, the correlation coefficient, confined w i t h i n [ 0 , 1 ) . A s the correlation increases, the discrepancy between the two average effects is quite small compared to the range of average effects. T h e graphical summarization is shown i n Figure 2.3, which implies that the bias of the average effect is close to zero even if the correlation of the two predictors is high.  47  Chapter 2. Average effects for regression models with misspecihcations  Figure 2.3: Comparison of two average effect estimators 5i(0) and 5{(8) for spline regression models w i t h two predictors involved. T h e x-coordinate of each circle is 6{(0) and the y-coordinate is 8\({5). Different circles are produced by different values of p, the correlation coefficient between X\ and X . 2  1  1.5  1 2.0  1  2.5  1 3.0  3.5  A s shown i n Figure 2.4, we can see that the difference of the two conditional variances is rather small compared to magnitude of the conditional variances. T h i s means that even though the model is misspecified, the precision of the estimates for given values of predictors does not seem to be affected that much by the misspecification.  ;  48  Chapter 2. Average effects for regression models with misspecihcations  Figure 2.4: C o m p a r i s o n of asymptotic conditional variances of two average effect estimators for spline regression models w i t h two predictors involved. T h e x-coordinate of each circle is the conditional variance of misspecified-model estimator and the y-coordinate is that of right-model estimator. Different circles are produced by different values of p, the correlation coefficient between X\ and X . 2  i'.W)  2.4.2  Penalized regression models  Penalized spline regression (often referred to as P-splines) has received attention as a powerful smoothing method.  Originally suggested by O ' S u l l i v a n (1986), the method  provides a range of practical modelling tools i n applied statistics, w i t h the books by Green and Silverman (1994) and more recently by R u p p e r t , W a n d , and C a r r o l l (2003). T h e m a i n principle of penalized spline regression is to estimate the unknown regression function by a compromise between sum of squares of residuals (represent the fidelity to the data) and smoothness of the estimate. We start w i t h a univariate case. Suppose that we have d a t a (xi,yi) (for now Xi is a  49  Chapter 2. Average effects for regression models with misspecihcations scalar not a vector),  Yi = m(xi\oc) + €j, where m is a smooth function denoting the conditional mean of Y given x t  it  and  {ej}"  =1  are independent, mean zero r a n d o m errors w i t h a constant variance. T o estimate m we use a spline regression model L  m(x\ a) — a + aix + • • • + a ^ a r + £  a (x — t )+,  0  0  k  k  'fc=i  where d > 1 is an integer, a = (ao, coefficients, ti < • • • < t  a^)' is a vector of regression  a i , . . a i , . . . ,  are fixed knots and (a; — tk)+ = (x — tk) I{x D  L  > t }. k  Actually,  m is expressed by a set of power spline basis functions. Recall that S denotes the "design matrix" for the model so that the i - t h row of S is  S  f  =  (l,x ,...,xP,(x -t )°,...,(x -t ) >) I  i  =  i  1  i  L  ( l , *'(?<))•  However, simple parametric fitting of a would lead to unsatisfactory results due to the high dimensionality of basis functions. Instead, a is estimated i n a penalized manner by imposing a penalty on the coefficients i n m. A roughness penalty is placed o n  {a } k  k=1  which is the set of jumps at the knots i n the p - t h derivative of m(x;, a ) . T h i s leads to the penalized least-squares estimator  {Y  L  71  i=i  =  {Vi ~ m(x; a ) } + A £ 2  a\  fe=i  min{(Y-Sa)'(Y-Sa) + Aa'Pa}, a  w i t h A as penalty parameter controlling the trade-off between fidelity to the data a n d  50  Chapter 2. Average effects for regression models with misspecihcations smoothness of the fitted spline and P as the corresponding penalty matrix. To put the roughness penalty mentioned above, we choose P to be a diagonal m a t r i x whose first (1 + D) diagonal elements are 0 and whose remaining diagonal elements are 1. B y simple algebra, the penalized least squares estimate of a is given by  o(A)  Since n~ XP 1  =  (S'S + A P ^ S ' Y ,  =  (n^S'S +  n^XPy^n-^'Y}.  goes to a zero matrix, the asymptotic behavior of a (A) is the same as that  i n the standard spline regression without penalty. T o work out the least squares estimates a (A), we need to determine an appropriate value of A first. Generalized cross-validation ( G C V ) (Craven and W a h b a , 1979) is one method of smoothing parameter selection that has proven effective and has good theoretical properties. Here we follow Ruppert (2002) closely. Let  ASR(X) = n - J2 {Vi ~ mixi; a7(A))} i=i 1  2  be the average squared residuals using A. L e t  S(A) = § ( S ' S + A P ) S ' _ 1  be the "smoother" or "hat" matrix. T h e n  GCV(X)=  A  (1 - n  is the generalized cross validation statistic.  ft r^ , {S(A)}) 1  m  2  (2-23)  Here t r { S ( A ) } is the "effective degrees of  freedom" of the fit. One chooses A minimizing G C V statistic over a grid of values of  51  Chapter 2. Average effects for regression models with misspecifications A.  C o m p u t a t i o n can be sped up a n d stabilized numerically w i t h the following diago-  nalization method that is variation on the Demmler-Reinsch algorithm used to compute smoothing splines. Let B be a square m a t r i x satisfying B~ (B~ )' 1  1  = § ' § , for example, B~  l  is a Cholesky  factor of S'S: L e t U be orthogonal a n d let C be. diagonal such that UCU' = BPB'. For example, we can use the eigen-decomposition of BPB' to find U a n d C. T h e n by some algebra, we get tr{S(A)} = £ ( l + A C ) - ,  (2.24)  1  l  i where  is the i t h diagonal element of C. Details of proof is given i n A p p e n d i x I V .  T h e elegance of this method is that the work of calculating B a n d (U, C) needs to be done only once a n d then these quantities can be used for a l l values of A. Therefore, we have an efficient way to evaluate G C V defined by (2.23). . T h i s method is easy to be extended to the additive model w i t h more than one predictor, i.e., p >1. Suppose we have d a t a ( y ; , X j ) (XJ =  Yi = a + mi(x ) 0  u  (xu,...,x i)'), P  + • • • + m (x i) + e,. p  p  We w i l l use a spline model for each m : k  Li  mi(x\ ai) = a x -I n  ha x  D  iD  +£  a (x - ti)+, ik  l=  l,...,p',  fc=i  where on — ( a n , . . . , a;o, o n , . . . , a/xj''• Hence the vector of whole parameters is a = ( a , a' ..., 0  1:  CKp)', a n d the penalty m a t r i x P is set to be the s u m of p diagonal matrices  Pi (i = 1,... ,p).  E a c h Pi is a (pK + 1) [K is the dimension of basis functions) by  pK + 1 square m a t r i x whose (p x i + 2 ) t h to (p x i + L + l ) t h diagonal elements are 1 and whose remaining diagonal elements are 0. Therefore the least squares estimate of a  52  Chapter 2. Average effects for regression models with misspecifications  is the minimizer of U  P  /\2{yi ~ m(x»; a)}  2  + ]T A; ^ <  i=i  i.e.  ;=i  fe=i  (Y - Sa)'(Y - Set) + a' ^  XiP^j  ex.  Note S is the corresponding "design matrix" with z'-th row as Si. =  {l,&{x ),...,&(x )). li  pi  Consequently the least squares estimate of a can be solved by  a(A) = ^S'S +  If the components function (mj }f  =1  AjPi j S'Y.  require roughly the same amounts of smoothing,  we may assume a common value of A;, I = 1,... ,p, which makes a quick access of the estimation of smoothing parameters by using the methodology introduced in the univariate P-spline case. More realistically, different component function mj's require different amounts of smoothing and this can be accomplished by allowing A; values to vary rather than a common value. In the algorithm of Ruppert and Carroll (2000), A ..., A are 1;  p  chosen by GCV in two steps. Note that GCV is a function of Ai,..., A . In the first p  step, GCV is minimized by assuming that Ai = • • • = A = A. In the second step, set p  the common smoothing parameter as the starting value of each Aj, Ai,..., A are selected p  one-at-a-time by minimizing the GCV criterion.  53 In the following simulation study, we see how the two average effect estimates based  on penalized least squares would be affected by the smoothing parameter. Suppose we have two predictors involved in the model and both of them follows a uniform distribution  Chapter 2. Average effects for regression models with misspecihcations on [0,1] independently. Here a common set of centered cubic basis functions are used and hence a common smoothing parameter A is assumed. In the basis function, five knots are used, which are equi-spaced w i t h i n [0,1]. Generate a data set of n = 200 realizations of (Y,Xi, X ), 2  where Y is generated from normal  distribution w i t h variance of 1 and mean is based on the model w i t h interactions (2.19). T h e true values of all the coefficients are set to be 0.5. Repeat generating 200 d a t a sets, and calculating the average effect estimate for each data set. W e can vary the sample size n to see what happens as n gets larger.  T h e top panel i n F i g u r e 2.5 shows the  average effect estimates from the d a t a sets of size n = 200 and the b o t t o m panel shows that of size n = 1000. T h e solid horizontal line i n both panel is the true value of the average effect. Therefore we can see that the estimates lie around the true value w i t h a larger variability, which is caused by the larger variability of the smoothing parameter estimates. A s sample size increases, the variability goes down, i.e., the estimates i n the b o t t o m panel are more concentrated on the true value than the top one. One problem from Figure 2.5 is that the range of the estimates is quite broad compared to the scale of the true value 2.0208, which could visually blur the concentration around the true value. To make the concentration around the true value more visual, we use only a subset of those estimates, those w i t h i n —10 and 10, to plot the histograms, illustrated by Figure 2.6. Note that the number of estimates outside of this range is 32 for n — 200 and 8 for n = 1000, which are b o t h small compared to 200, the number of estimates i n a l l .  54  Chapter 2. Average effects for regression models with misspecihcations  Figure 2.5: Scatter plots of average effect estimates i n penalized spline regression w i t h only two predictors: the top panel w i t h sample size n = 200 a n d b o t t o m w i t h n = 1000. T h e solid horizontal line i n each panel identifies the true value of average effect, a n d each circle represents a different simulated data set.  Figure 2.6: Histograms of the average effect estimates i n penalized spline regression w i t h only two predictors: the top panel w i t h sample size n — 200 a n d b o t t o m w i t h n = 1000, and the symbol ' x ' i n each panel marks the true value of average effect. n=20  $  8-  I  1  1  -10  -5  0  "  1  1  5  10  Average effect estimate  55  Chapter 2. Average effects for regression models with misspecihcations  2.5  A middle scenario  A l t h o u g h we do not get exactly consistent estimates i n the spline regression scenario as discussed above, we still wonder whether there is a n intermediate situation between straight-line fitting and curve fitting. In this section, we consider models involving the quadratic terms of the predictors. To explain, we assume the true model to be  E (Y\X X ) U  = 3 +BX  2  0  1  1  + PX 2  + B X? + B X\ +  2  3  B XX ,  A  l2  l  2  while the.fitted model is  E (Y\X X ) U  = a + Xi  2  Q  +aX  ai  2  Assume that the joint distribution of X\ and X  +aX  2  2  3  + a X\. A  is a bivariate n o r m a l distribution w i t h  2  mean vector 0, C o v ( X i , X ) = p and V a r p Q = l,i = 1,2. 2  Based on the previous result (2.14), we have  Cil*  — R i n - p ( Y \ , R St=l ^ — Pi + P i 2 E ( A ) + P12  "3.  =  n  f c l E  {XiX )X 2  k  2  Pz  +  12"  P.  where E is the covariance m a t r i x of  (X\,X ,Xf,X ). 2  2  Using moments of the bivariate normal distribution, i t is easy to derive that  «i*  =  «3*  =  P i ,  Pi + -1  P  nPl2-  1 + p  2  W i t h Definition 2 of average effect, that is, averaging over the joint distribution of  56  Chapter 2. Average effects for regression models with misspecifications  (Xi,X ), 2  we have 5\(0) —  T h a t is, we s t i l l have (exactly) consistent estimator  under this setting. However, interestingly, based on Definition 1, i.e., averaging over the m a r g i n a l distrib u t i o n of X , we get 2  5{(xi;(3)  =  a i + 2a »x  =  ft+  3  1  ^£12).*i,  2 (ft+  and  <Ji(z /3)  =  i;  p\ +  2p\ . Xl  Hence, we get the difference between the fitted and true average effects  5* {x ;P)-5 {x ;P) 1  1  1  1  =  -£—p\ . 1+ p 2Xl  It is easy to verify that the bias increases when p increases for given X\ = X\. O n e special case is that when p equals 0, the difference disappears. It is also clear that the independence of XiS make consistent average effect estimates no matter whether the joint d i s t r i b u t i o n of X^s is n o r m a l or not. Furthermore, if Definition 3 is used, averaging over the conditional d i s t r i b u t i o n X \Xi  =  2  xi, we have  8i{xi-p)  Note that 6l(xi\(3)  =  Pi + 2p xi + 3  p pxi. u  does not subject to the change of definition because i t does not  57  Chapter 2. Average effects for regression models with misspecihcations depend on X . 2  Thus, the bias becomes correspondingly  Sl(x ;P)-5 (x ;/3) 1  1  1  =  - ^J p  P-P  p\  2Xl  3  It is easy to verify that for any given x\,  a  the bias increases if \p\ < \J\/5  — 2 and  decreases otherwise. Therefore, conditional on different definitions of average effect, we may have different conclusions about the consistency of estimators based on a simpler model.  2.6  Summary  In section 2.2, Result 1 & 2 show the asymptotic distribution of average effect estimators under "true" model and "misspecified" model, respectively. Note we don't have to assume that the two models are nested, that is, our results can be applied to more general model misspecification situations. In the linear regression context, discussed i n section 2.5, Result 3 gives the conditions to yield consistent estimator when fitting an additive model without interactions to the d a t a generated from a pairwise interaction model without pure quadratic terms. A l t h o u g h the conditions, independence or joint normality (elliptical-contoured) of the components of X (centered), may not be satisfied i n practice, we could t r y appropriate transformations to make the distributions of X close to either of the conditions. In section 2.4, spline regression context, we can still have consistent estimator when the predictors are independent and spline basis are centered.  W e also  explore the bias under the condition of joint normality, and find out the magnitude of bias is quite small even when there is strong dependence between the predictors. T h a t is, the estimator based on additive model without interaction is approximately consistent.  58  Chapter 2. Average effects for regression models with misspecihcations In section 2.5, we consider models w i t h quadratic terms of predictors, where consistent estimator comes into being under the conditions in Result 3. W e should be aware of the fact that the consistency of average effect estimator depends on the definition of average effect, as implicated by the example i n section 2.5. A l l the results/conclusions we have in Chapter 2 is based on Definition 2, that is, averaging over the joint distribution of a l l components of X . However, this definition may not always be appropriate to use. For example, if one is interested i n comparison the risk of lung cancer between two groups of people having different smoking habit. Say one group never smoke and the other smoke everyday. There are also bunch of other risk factors, such as gender, age, resident and so on. Since we want to know how smoking makes difference i n the risk of lung cancer, we should consider Definition 3, averaging over the distribution of all the other risk factors conditional on smoking factor. Thus, for different scenarios, we need to investigate the consistency of estimator equipped w i t h different definitions.  A s for which definitions  should be applied, it depends on the goal of study.  59  Chapter 3 Comparison of interaction detectability under different interaction models T h i s chapter w i l l focus on comparison of power to detect interactions under different regression models, i n particular, a pairwise interaction model a n d . a diffuse interaction model. Section 3.1 has the background on asymptotic power under local alternatives for W a l d (or quadratic form score) statistics. Section 3.2 gives a concrete example to show how powerful the diffuse interaction model is to detect interactions no matter what the true structure of interaction is diffuse or not.  3.1  General framework  In this section, we give a general result about the asymptotic power function of score (quadratic form) test for presence of interactions are derived based on two models. Let T = {f(y\  x,6)  :6  G  6 } and  Q  = {g(y\  ~K,U>),U>  G  0,} denote two differe  parametric families of densities under consideration when modelling the relationship between response variable Y and predictor variables X ,... x  ,X . p  W e assume an agreement  between the two families at some certain values of their parameters. T h a t is, one member  60  Chapter 3.  Comparison of interaction detectability under different interaction models  from G satisfies that  g(y\  x, u> ) 0  = f(y \ x,  0 ), O  for a l l y and  x,  so as OJ moves away from w , g(y\ x, u>) moves away from F i n some specified way. Under 0  model T (Q), 6 = OO(OJ = U>Q) stands for the null model of no interactions. L e t f,g  denote densities and F, G denote the corresponding distributions. L e t p\ =  dim(0),p = dim(u>) be the dimensions of !F and Q, w i t h 2  s (e, Y, x) = 3 [ i o { / ( Y | x, e)}]/ae F  g  and S (L>,Y,X) G  = <9[log{ (Y|  X,u)}]/du>  5  being the respective score vectors, and  I {0) = E {s (e,Y,X)s' (d,Y,X)} F  e  F  F  and / (w) = G  E {SG(U>,Y,X)S' (U>,Y,X)} U  being the corresponding Fisher information matrices.  G  Note the two above expectation  operations are w i t h respect to joint distribution of Y and X. L e t /x(x) be density of X w i t h respect to measure i^(x) (either Lebesgue or counting measure). To compare the capability to detect interactions under the two families T and Q, we set up two sets of hypothesis tests w i t h T as the fitted model and Q as the true model. T h e reverse case w i t h T being the true distribution while Q being fitted model, is just an analog. Hence i n the following, we just take the former case for demonstration.  61  Chapter 3. Comparison of interaction detectahility under different interaction models  To make the comparison tractable and also to produce some nontrivial asymptotic powers that are not all equal to 1, we use a Pitman-type local analysis (developed by Le Cam (I960)), focusing on n / - neighborhoods of the true parameter values. -1  2  Say T is the fitted model while Q is the true model with u  = u3 + n~ ^ A'q, l  n  2  0  where A  is a scalar and rj is a vector of length p . The hypotheses we used here are H : CO = £ 2  0  versus H : CO ^ Co> where C is a matrix r x dim(0) of full row rank and £ a  0  ls a 0  certain vector (for example, zero vectors denotes an additive model if CO is the vector of interaction parameters). Note that the above hypotheses are more general than testing H : 0 = 0 , where C just reduces to an identity matrix as a special case. The above 0  0  hypotheses consider any linear combination of the whole parameter 0 under the fitted model. To compare the capability to detect interactions, the parameters of interest are just those related to interactions. In particular, if the pairwise interaction model (defined later by (3.6)) isfitted,the whole parameter vector 0 = (/3Q,PI, ... ,Pp,Pn,  •••  ,P{p-i) )'. p  Only the last q = p(p — l)/2 components relate to interactions. Hence let C\ = (0 (p+i), Iqxq) so that gx  C\Q = (Pn, • • •, P(p-i) )', P  which is what the parameter vector  of interest for testing. If the fitted model is the diffuse interaction model (3.5), the whole parameter vector is u  = (Po,P\,...  ,P , P  A)' and only A relates to interaction. Thus let  C - (0 ( i), 1) so that Cu> = A. 2  lx  p+  2  Let 0 be the maximum likelihood estimator based on f(y\ x, 0) from n observations n  from g(y\ x,u;„). Then  n(C(O  - 0 ))' {Cl^ie^C}'  1  n  0  (C(0  n  -  0„)) is the Wald test statistic used 62  Chapter 3.  Comparison of interaction detectability under different interaction models  here, whose asymptotic distribution is x  w i t h degrees of freedom of rank(C) if / ( - |  2  9) 0  is the true model. Because T is not the true model,  n l C{6  - d ) = n' C  l 2  l 2  n  0  (d  n  -  {0.(u ) - 0 ),  n^C  0.(w )) + n  0  n  (3.1)  where 0«(u;) is the parameter vector which minimizes the K u l l b a c k - L e i b l e r information criterion, that is  0*M  = argmin J  |log ^  e  j g(y\x,  o>)/ (x)dydi/(x). x  Note that the fact </(-| wo) = /(-| 0o) yields that 0,(u> ) = 0oo  B y W h i t e (1982) we know that the first item on the right side of (3.1) is asymptotically normal w i t h mean 0 and covariance m a t r i x CIp (6 )C'.  So we only need to work on the  1  0  second item on the right side i n (3.1). B y the definition of  0*(w),  we know that 0*(<*>) is the value of 0 solving  J s (8'u),y, F  x)g{y\x,  w ) / ( x ) d y d i / ( x ) = 0.  (3.2)  x  Based on Gustafson (2001), implicit differentiation of (3.2) gives  '  E [ 4 ( 0 . ( a ; ) , y X ) ] 0 U w ) + E [ s i . ( 0 , ( a ; ) y , X ) s ( u ; ; y X ) ] = O. w  )  Evaluated at u = u , 0  w  )  )  G  the above equality yields  BB ^  =/^ (0o)E K(0 ;y X) 1  e o  o  !  S G  (u;o;y X)}, )  (3.3)  63  Chapter 3.  Comparison of interaction detectability under different interaction models  which is derived by the fact that 0*(u>o) = OQ- Therefore, we have  ^ UU> u>=u>o  AC  du  UI=U>0  j  V,  where du,.  Recall the equality (3.1).  ..  dui  Based on O ' B r i e n et a l . (2006), assuming Q being the true  model, the asymptotic distribution of  n(c(e is a noncentral x  2  - e ))' {cr \d )cy  {c(d - e ))  l  n  0  F  0  0  n  w i t h degrees of freedom of r a n k ( C ) and noncentrality parameter 5 is  calculated by  5=1  r? j {cr \o )cy  I AC  l  AC  F  Q  Suppose the equivalent null hypothesis for Q is C u G  (3.4)  = COG where CQ is a r  G  x dim(u;)  m a t r i x . T h e W a l d statistic based on the fitted model Q is  n{C (O G  - u ; ) ) ' {C Ic\u )C' y  (C(Q  l  n  0  0  G  G  n  Its asymptotic distribution is noncentral Xr (^G),  G  = {AC rj}' G  {C I \u )C' y  l  G  G  0  G  w„)).  where  c  6  -  {AC r)} G  64  Chapter 3. Comparison of interaction detectability under different interaction models  3.2  Comparison between pairwise interaction models and diffuse interaction models  3.2.1  Introduction of diffuse interaction model  Greenland (1983) pointed out that the powers of statistical tests to detect interactions are very low i n some commonly encountered epidemiological studies. W e could imagine even lower power i n the situations where the number of risk factors is rather large and only a very small fraction of a l l possible (pairwise) interaction terms really play a role. Gustafson et al. (2005) proposed another k i n d of interaction model, the diffuse  interaction  model, to deal w i t h difficulties caused by a large number of risk factor under pairwise interaction models. B y diffuse interaction, we mean that the effect of a particular risk factor either increases (synergism) or decreases (antagonism)  as all the other risk factors  increase, without regard to which of the other risk factors get involved w i t h the effect modification.  T h e diffuse interaction model introduced i n Gustafson et al. (2005) is  defined as  E(Y\X ...,X ) U  P  = HD (3.5)  T h e parameter A reflects the magnitude of the synergism/antagonism. Take a binary Xj for example, if A > 1 then it is easily verified that the interaction is antagonistic, i n the sense that the value of E ( Y | A j = 1, X y ) = x y ) ) — E ( Y | A j = 0, X y ) = x y j ) decreases i n each component of x y ) . If Xj is continuous, it is also easy to show that A > 1 gives  dE ( Y | X = x ) dxjdxk 65  Chapter 3. Comparison of interaction detectability under different interaction models which means synergism based on the definition i n Section 1.2.  Conversely, A < 1 corre-  sponds to synergism. T h a t is, the effect modification caused by any putative risk factor increases as other risk factors increase. T h e magnitude of the difference between A and 1 implies how much synergism/antagonism is present. However, we should be aware that A does not provide the information about which of those risk factors contribute in the effect modification for any putative risk factor.  3.2.2  Power comparison  Say the response variable Y is normally distributed and X\,...,X  are the corresponding  P  explanatory variables. W e study an example under the following two interaction models. Pairwise interaction model:  •  E{Y\X ...,X ) U  p,  P  P  -  A) +  X)A*i+ i=l  Pa i ii  (-)  x x  l<i<?<P  3  6  '  Recall diffuse interaction model:  E(Y\X ...,X ) U  =  P  =  Let 3  M  =  (6i,...  , P  P  Y  p  D  + |  j  '  X  i  - °-  A  - °-  and 3j = (Aj)<jxi [q = p(p — l)/2), the coefficients of pairwise  interaction terms. Let 6 = (/3 , By,..., B , (3'j, a )', which is the whole parameter vector 2  0  p  i n the pairwise interaction model and let fi = (/? , B ,..., B , A, a ) ' , which is the whole 2  0  1  p  parameter vector i n the diffuse interaction model. If Bj = / 3  /0  = 0 i n the pairwise interaction model, the model reduces to be an  additive model without interaction terms. Correspondingly, if A = Ao = 1 i n the diffuse 66  Chapter 3.  Comparison of interaction detectability under different interaction models  interaction model, the model reduces to an additive model as well. Therefore, the two interaction models are the same if evaluating at f3I0 and A respectively. Denote by fp 0  the density function of Y | X i , . . . ,X  under the pairwise interaction model, and by fp>  P  the density function under the diffuse interaction model. Denote by- Ip the information matrix under the pairwise interaction model and I  under the diffuse interaction model.  D  Denote by s p ( - , Y , X) the score function of fp under pairwise interaction model and so SD(-,Y, X) the score function of fp, under diffuse interaction model. T h a t is,  (  p(-,Y,X)  S  =  i N  o-- (Y-Lip) 2  X  p  X\Xi  y Xp-iXp  J  ( s (;Y,X)\ D  x=1  =  i  \  o-- (Y- i ) 2  f  D  X  n  where  dctp dX  < =  = ~ [J^PiXi ,i=l  log [Y^PiXi  +  f^.PiXilogifiiXi). 1=1  It is obvious that interaction can be measured by only one parameter A i n (3.5) while p(p — l ) / 2 parameters are used i n (3.6). Hence i n the diffuse interaction model (3.5), we assume that all the predictors interact i n the same direction, either synergistically or antagonistically, which corresponds to A < 1 or A < 1 respectively. It is reasonable 67  Chapter 3.  Comparison of interaction detectability under different interaction models  to imagine that model (3.5) is more powerful to detect interaction since detecting the interaction effect i n one direction could be easier than that i n many possible directions. T h e comparison between two models will be explored by the following cases. C a s e I: Assume that the diffuse interaction (3.5) is the true model w i t h A„ = 1 + A n " / , where A is a scalar. 1  2  (i) T h e fitted model is the pairwise interaction model shown as  (3.6). To test  whether  there are interaction effects, set up the following hypotheses. H  : C d = 0 versus H  H:  /3 = 0 versus H  0  0  x  7  a  a  Recall that C\ = (0 , qxp  : C 0 ^ 0, i.e., x  : (3j ^ 0. I ). qXq  Power = P ( n { d ( 0 „  Let d  denote the M L E of 9. Therefore, we have  n  - Oo)}'  {dip  1  (OoM}-  1  {C {0 x  n  - Oo)} >  xl  Hence, CJp  1  (0 )C[ = If O  A c c o r d i n g to the results we discussed i n the previous section, we know that  n(Cifo -  eo^'iifwr'ic^  - Oo)) Z  2 x q  (s)  68  Chapter 3. Comparison of interaction detectahility under different interaction models  where the noncentrality parameter is dO* dX  A=l  A  where dO,  I (0o)  {E {s (O ,Y,X)s (X ,Y,X)}}.  l  ~ax  P  eo  P  0  D  0  Note that S£>(X , Y, X) is the derivative of log /D(U>, Y, X) with respect to A evaluated at 0  A , i.e., the last component of the score vector SD(W,Y, X). 0  Therefore the asymptotic power is P(x (S) > x\,)> where Xq is the upper a quantile 2  at  a  of x — distribution with degrees of freedom q. 2  (ii) Fitted model is diffuse interaction model. Now the hypotheses to be tested are Ho : C u = 1 versus H : C u ^ 1, i.e., a  2  H  0  2  :'A = 1 vs. Hi : X 1.  Recall that C2 = (0i , 1). Therefore the asymptotic power is xp  P{x\{5)>xl ), a  where 5 = Vb(Ao)  =  I (0 )  =  D  0  w  0  AV^Ao) C r \u>o)C , 2  E  D  2  ' < '  a  i  o  § M  du  J \  (^fo\ du  = (A)./3I.---.A,.A )'. 0  Remark: As a matter of fact, the A should be positive to make the operation (-)" A  n  meaningful. That is, the above analyses work well when sample size n big enough since 69  Chapter 3. Comparison of interaction detectability under different interaction models  X converges to 1, which is away from the boundary value 0. n  Case II: Assume the true model is a pairwise interaction model with  Pin  ^  012,  n  ^  \  P( -l) ,n  J  = P  P  ^  where rj is a q x 1 vector. For this time being, we set each element of rf to be 1, which means every pair of interactions is positive. Later we discuss the consequences of different choices of rj. (i) Fitted model is diffuse interaction model (3.5). The hypotheses to be tested here are Ho : e =  C u — 1 vs. Ha : C UJ 2  2  ^ 1. All the following is an analog of i) of Case I. Now  (p ,p ... p ,xy. 0  u  ;  p  The asymptotic power is P(xl(S)>xl ), a  where 8K dP  P,=o  l  r  dX,  dPj  {/ 3 (u;o)E (s (vj , Y, X)s' (0 , 1  /3/=0  /  aJo  D  0  P  o  Y, X))} (P+2)-  '  where M . denotes for the rth row vector of matrix M. T  (ii) Fitted model is pairwise interaction model (3.6). The hypotheses to be tested here are H : C\0 = 0 versus H\ : C\0 ^ 0. Q  We have  70  Chapter 3. Comparison of interaction detectability under different interaction models  with 5 = A ri'{lp (0 )}~ r). 2  2  i  o  Hence the asymptotic power is P (xq($) > Xq, a)-  As discussed above, we have derived four asymptotic power functions for the four subcases, corresponding to all possible patterns of either pairwise interaction or diffuse interaction model being fitted while the underlying data generation process is the other one. However, it is not easy to tell which power is bigger based on the expressions of asymptotic powers. To be more specific, we take an example as below. Suppose we have p = 9 predictors at hand, the true values of all components of (3 are all equal to 0.5 M  and p = 0. The variance of random error is set to be a = 1. Also the predictors are 2  0  identically independently distributed as Bernoulli with parameter £, the probability of success. Then we plot the power functions against the value of A, with £ = 0.5. As shown before, the key things to compute the power are Ip = E {sp(-, Y, X)s' (-, Y, X)}, P  I = E {s (-, Y, X)s' (-, Y, X)} , E {s (-, Y, X)s' (-, Y, X)} and E {s (-, Y, X)s (-, Y, X)}. D  D  D  P  D  D  P  Under the above settings, Figure 3.1 shows the result of the powers in the four subcases discussed above. The solid lines denote the power curves for diffuse interaction model fitting and the dashed lines denote those of pairwise interaction model fitting. We find that the diffuse interaction model is more powerful to detect interactions than pairwise interaction model, regardless of whether the true model we postulate is the diffuse interaction model or not.  71  Chapter 3. Comparison of interaction detectability under different interaction models  Figure 3.1: Power Curves with AYs'~ ' Bernoulli(0.5): the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. In the bottom panel 77 = l i d  g x  True m o d e l is pairwise  !  ^  0  5  |  10  (  15  (  20  delta  In case II, consider replacing rj = (1,..., 1)' with a vector of 77 having entries ±1, that is, the pairwise interaction terms have different "directions". For those terms with positive coefficient, it implies that the impact of the interaction is to increase the value of response variable when the corresponding predictors increase while other predictors keeping unchanged. While for the terms with negative coefficient, the interaction effect causes decreased response if the involved predictors increase with other predictors unchanged. As opposed to a pairwise interaction model, the role of A in the diffuse interaction model is to measure the magnitude of the overall interaction among the predictors. Note that diffuse interaction models do not specify which of the predictors do or do not contribute to the interaction. Hence, we could expect the power of diffuse interaction model 72  Chapter  3.  Comparison  of interaction  detectability  under different interaction  models  would get worse when the true model under consideration involves more m i x e d directions of interaction effect. Since the "overall" interaction is getting weaker and weaker as more other direction of interaction terms appear i n 77, we would expect the performance of the diffuse interaction models to get worse and worse. T h a t is what Figure 3.2 implies. In the first panel, w i t h the choice of rj stating that all interaction terms have the same direction, the diffuse interaction model performs better than the pairwise interaction model i n terms of power. However, when rj is changed to have only part of the interaction terms playing the role i n the same direction, there is crossing of the two power curves as shown i n the middle panel. T h a t is, for some values of A , the diffuse interaction model works better while for some other values of A , the pairwise interaction model does better. Moreover, when 77 is set to have more mixed directions of interaction effect, the performance of diffuse interaction model is worse as implied i n the last panel. In Figure 3.2, the top panel is the power curves obtained by setting 77 of a vector of l ' s , the middle panel w i t h 77 of a vector made up of 10 l ' s and 26 0's, and the b o t t o m panel w i t h 77 of a vector made up of 10 l ' s , 10 0's and 16 —l's. Note here to make the plots across different 77's comparable, we normalize each 77 to have length equal to 1.  73  Chapter 3. Comparison of interaction detectability under different interaction models  Figure 3.2: Different choices of rj i n Case II w i t h binary predictors: the top panel involves all predictor pairs interacting positively, the middle panel has only a few of predictor pairs interacting positively, and the b o t t o m panel has more mixed directions of interactions. T h e lengths of rj's i n different panels are normalized to be 1. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. True model Is palrwlsa  0  5  10  15  20  Subsequently what happens if other settings, like £, (3 ,  cr (= Var(e)) vary? Figure 2  M  3.3 shows the outcome of different £'s w i t h (3  = 0 . 5 1 and a  2  M  p  = 1. In the leftmost panel,  £ =0.2, while £ =0.5 i n the middle panel and £ =0.8 i n the rightmost one. T h e three plots i n the top panel are power curves i n case of the true model being diffuse interaction, and the other three i n the b o t t o m under the true model being pairwise interaction. A s £ varies, the performance of both models get worse. However, the difference among the other three plots i n the b o t t o m panel where the true model is pairwise interaction model is not great. In other words, the distribution of predictors has greater effect i n the case  74  Chapter 3. Comparison of interaction detectability under different interaction models of diffuse interaction model being true model than it does in case of pairwise interaction model being true model.  Figure 3.3: Different choices of £ in A Y s distribution: the top panel with the true structure to be diffuse interaction model and b o t t o m panel with the true structure to be pairwise interaction model.  Solid lines denote power curves based on diffuse interaction model  fitting and dashed lines denote power curves based on pairwise interaction model  fitting.  F r o m left to right across the three columns, £=0.2,0.5 and 0.8 respectively. True modal Is diffusa  True modal Is diffuse  True model Is diffusa  Figure 3.4 shows the result of different choices of 0 .  For simplicity, we set all the  M  components of Q  M  equal to each other, that is 3  M  oc bl . p  In the leftmost panel, b is  set to be 0.1, while 0.5 in the middle panel and 1 in the rightmost one. F r o m the three plots in the top panel, we can see that the performance of both models get better as the magnitude of b increases.  However, the three plots in the b o t t o m panel are almost the  same. Therefore, we get the similar conclusion as above, where £ varies. Change of b has more effect in the case that the diffuse interaction model is true.  75  Chapter  3.  Comparison  of interaction  detectability  under  different  interaction  models  Figure 3.4: Different choices of b in (3 = bl : the top panel with the true structure to be diffuse interaction model and bottom panel with the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. From left to right across the three columns, 6=0.1,0.5 and 1 respectively. M  Trua modal U diffusa  -i  i  1  S  10  15  p  True modal Is diffusa  H 20  True model Is pairwise  "-i O  ! S  1  i  10  15  True model Is diffuse  H 20  True model Is pairwise  H  1  1  1  H  0  5  10  IS  20  True model Is pairwise  Figure 3.5 shows what happens if cr varies. In the leftmost panel, cr set to be 0.5, 2  2  while 1 in the middle panel and 5 in the rightmost one. Now the three plots in the top panel are quite similar to those in the bottom respectively. In fact, based on the formula (3.4), we know that the noncentrality parameter 5 is just proportional to a . Therefore - 2  if we change the scale of A according to the value of a , the shapes of the power curves 2  are the same across different a . That is what Figure 3.6 implies. 2  ;  76  Chapter 3. Comparison of interaction detectability under different interaction models  Figure 3.5: Different choices of Var(e):the top panel w i t h the true structure t o be diffuse interaction model and b o t t o m panel w i t h the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. F r o m left to right across the three columns, Var(e)=0.5,l and 5 respectively. Trtu modal la diffuse  10  is  True model Is diffuse  20  o  Trua modal la palrwlaa  ~i 15  a  10  15  Trua model Is diffusa  M  0  True model Is pairwise  H 20  H 0  1 S  1 10  1 IS  6  10  15  M  True modal ia pairwise  H 20  h O  1 S  1 10  1 IS  H 20  77  Chapter 3. Comparison of interaction detectability under different interaction models  Figure 3.6: Different choices of V a r (e) w i t h scaled A : the top panel w i t h the true structure to be diffuse interaction model and b o t t o m panel w i t h the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. Trua modal la diff m a  Trua modal la diffusa  Trua modal la diffusa  A l l the above plots are based on the distribution of X having independent components. In the situation that those predictors are dependent, what happens if the  dependency  changes? To construct a dependent structure between those predictors, we let  Xi,...,  X \Z  Bernoulli(Z),  P  Z ~ Beta(a,  b).  B y a little algebra, we have p = (1 + a + 6 ) , as the correlation between any pair of - 1  Xi,...  ,X . P  Hence by changing the values of a and b, we may get different correlations. In  Figure 3.7, the first column is power curves obtained from a = b — 100, so that p = 0.005; for the second column, a = b = 0.5, hence p = 0.5 and for the t h i r d column a = b = 0.01, which leads to p = 0.98. The change of correlation does affect the power  performance 78  Chapter 3. Comparison of interaction detectability under different interaction models more i n the case that diffuse model is true t h a n that i n the case that pairwise interaction model is true. Figure 3.7: Different choices of p among A Y s : the top panel w i t h the true structure to be diffuse interaction model a n d b o t t o m panel w i t h the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting a n d dashed lines denote power curves based on pairwise interaction model fitting. F r o m left to right across the three columns, p—0.005,0.5 a n d 0.98 respectively. Trua model Is diffusa  True modal Is diffusa  Trua modal Is diffuse  We also investigate what happens i f the predictors are continuous. Suppose A Y s (i = 1 , . . . ,9) are i.i.d. following a log-normal distribution. T h a t is for each i, A j = exp(Zj), where Z j ' s are i.i.d. standard normal variables. Set /3 = 0,3 0  M  = 0 . 5 1 a n d the variance p  of random error cr = 1. T h e n we plot the asymptotic power against A (defined i n the 2  local alternatives) for different four subcases discussed before. Figure 3.8 shows a similar outcome to Figure 3.1, which is for binary predictors. T h a t is, the diffuse interaction model is more powerful to detect interactions than pairwise interaction model, regardless of what the true structure of interaction is diffuse or not.  79  Chapter 3. Comparison of interaction detectability under different interaction models  Figure 3.8: Power Curves w i t h Xi's"~' Log-normal(0,1): the top panel w i t h the true structure to be diffuse interaction model and b o t t o m panel w i t h the true structure to be pairwise interaction model. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. In the b o t t o m panel 77 = l xi/y/q~q  True m o d e l is diffuse  Similar to the previous example w i t h binary predictors, we also change the direction of interactions when the pairwise interaction model is the true model. A s shown i n F i g u r e 3.9, the diffuse interaction model lose the power to detect interactions as the "overall" strength of interaction gets weaker.  80  Chapter 3. Comparison of interaction detectability under different interaction models  Figure 3.9: Different choices of r\ i n Case II w i t h continuous predictors: the top panel involves all predictor pairs interacting positively, the middle panel has only a few of predictor pairs interacting positively, and the b o t t o m panel has more m i x e d directions of interactions. T h e lengths of 77's i n different panels are normalized to be 1. Solid lines denote power curves based on diffuse interaction model fitting and dashed lines denote power curves based on pairwise interaction model fitting. True model Is pairwise  0  5  10  15  20  det la  3.3  Summary  B y the two examples studied i n section 3.2.2, we can see that diffuse interaction model is more powerful to detect the interactions no matter what the true interaction structure (which is postulated) is diffuse interaction or pairwise interaction. However, as the direction of true interactions among predictors is more mixed, the diffuse interaction model gets less powerful to detect interaction.  81  Chapter 4 M C M C algorithms for diffuse interaction models 4.1  Why MCMC?  We introduced a diffuse interaction model in Chapter 3, where it is proposed to be more powerful to detect interactions than pairwise interaction model by using only a single parameter to describe the interactions among numerous predictors. A s a related point, the diffuse interaction model would be better for inferences on interactions. chapter, we are going to apply M C M C algorithms to implement the model  In this  fitting.  In this section, we enumerate reasons for using M C M C algorithms but not m a x i m u m likelihood estimation, which also may be feasible to be implemented. First, i n the diffuse interaction models, all the parameters /3,'s (except the intercept (3Q) and A are designated to be positive. T h e statistical inference of constrained maxi m u m likelihood estimates usually is more complicated.  Standard asymptotic theory  asserts that statistical inference regarding inequality constrained parameters does not require special techniques, because for a large enough sample there w i l l always be a confidence interval at the selected level of confidence that avoids the constraint boundaries. Sufficiently large, however, can be quite large, i n the cases when the true parameter values are very close to these boundaries. In practice, our finite samples may not be large enough for confidence intervals to avoid constraint boundaries, and this has implications  82  Chapter 4. MCMC algorithms for diffuse interaction models for all parameters i n models w i t h inequality constraints, even those that are not themselves constrained. Comparatively, M C M C sampler can automatically accommodate the constraints and yield appropriate interval estimate without extra efforts. Second, sometimes the interested quantity may not be the parameters i n the model directly and would be some complicated function of them. In particular, under diffuse interaction models, we apply the average effect idea to make inferences on interactions. Now the form of the first derivative of the regression function w i t h respect to x -, j — 3  1 , . . . , p is somewhat intricate so that it takes efforts to approach the interval estimates for average effects based on standard M L E . However, the interval estimates can be effectively achieved based on the posterior samples of the parameters of diffuse interaction models. T h i r d , considering some extensions to the diffuse interaction model, the m a x i m u m likelihood estimates could be heavy i n hand to calculate.  For example, to relax the  assumption that all the predictors interact i n the same way, we could allow the predictors to be categorized into groups: w i t h i n each group, the predictors interact i n the same way.  Here we assume that there is no overlapping among the predictors w i t h i n different  subgroups. T h a t is, each predictor has an indicator variable denoting the group to which it belongs.  T h e specific parametric form of this extended model w i l l be shown i n the  coming section 4.2. Therefore, the parameter set is a mixture of continuous parameters (Pj's, the coefficients of predictors, and A, the diffuse interaction parameter), and discrete parameters (the indicator variables associated w i t h predictors). W e have 2 (or 3 if three P  P  groups) different patterns of group allocation, hence the o p t i m i z a t i o n procedure would be more complicated especially when p is large.  83  Chapter 4. MCMC algorithms for diffuse interaction models  4.2  Details of M C M C algorithms  In Gustafson et al. (2005), an efficient M C M C sampler is developed for binary responses. Now, we apply a similar M C M C algorithm for continuous (normal) responses. W e start w i t h a simple case, that is, all the predictors interact i n the same way.  4.2.1  One-group diffuse interaction models  In this subsection, we consider a one-group diffuse interaction model  where A > 0 is the parameter accounting for interactions and Pj > 0,Xj  1  P-  '  > 0 for j = '  We apply a hybrid M C M C ( H Y ) approach, similar as that i n Gustafson et al. (2005) to make inference. T h e reason for using this algorithm is to avoid the waste caused by the randomness introduced by the proposal of candidate value. Basically, the idea of H Y is to incorporate derivative information of the target density and to suppress random walk behavior i n the sampling simultaneously.  B o t h of the strategies attempt to eliminate  the inefficiency of random walk, which is commonly used i n Metropolis-Hastings ( M H ) algorithms to generate candidate states. T o explain, by using random walk, the direction of each movement about the target distribution is randomized. T h i s can greatly increase the number of iterations required before achieving the equilibrium. T h e situation is getting even worse especially when the parameters involved i n the target distribution are highly dependent to each other.  M o r e discussions i n Gustafson et al. (2004) and Neal  (1998). T h e pseudocode of the h y b r i d M C M C algorithm is provided i n A p p e n d i x V . In our simulation study, the priors of the parameters are log A ~ N ( 0 , a ) w i t h c r ^ l O O ; 2  84  Chapter 4. MCMC algorithms for diffuse interaction models 0  0  ~ N(0,(7$ ) w i t h ^ = 1 0 0 ;  and /% ~  O  N+(0,CT|.) w i t h ^  =  100 for j =  l,...,p,  where N ( / i t , a ) denotes the N(/i, o ) distribution truncated to non-negative values. N o t e +  -2  2  that we also check the prior sensitivity by using informative priors w i t h smaller hyperparameters Op , G \ .  T h e prior for cr is inverse-gamma (a, b) w i t h the shape parameter 2  a = 0.0001 and the scale parameter b = 0.0001. Set the number of predictors to be p = 10, the sample size to be n =2000.  Gener-  ate the p binary predictor variables from Bernoulli(0.2) d i s t r i b u t i o n independently. For simplicity, we set the true values of / V s (j = 0 , . . . ,p) a l l equal to 0.5. T h e true value of A is set to be 2, that is, a l l the predictors interact i n antagonistic direction. T h e n for 1 = 1 , . . . ,n,we generate the response variable Yi from n o r m a l d i s t r i b u t i o n w i t h mean of 0o + < 2~2^=i(Pj v) f x  X  a  n  d variance a  2  = 1. Note that i n the model for binary re-  sponses, no variance component is involved. A s a consequence, our a l g o r i t h m here has one more step to update a , compared to the M C M C algorithm used i n Gustafson et al. 2  (2005). Following Hills and S m i t h (1992), the M C M C literature has considered changes i n the parameterization of a model as a way to speed up convergence i n a G i b b s Sampler. T h e general suggestion is to t r y to make the components as "independent" as possible. T h u s we implement M C M C using the new parameters ( a , A), where « o = 0o,&j — Pj/^ij — 1 , . . . ,p.  F r o m the scatter plot of posterior samples for A and a\ (left panel of Figure  4.1), a smaller correlation between reparametrized components can be clearly seen. T h a t is, the reparametrization works well. Note that a l l the samples used i n Figure 4.1 are based on the 30,000 iterations after 20,000 burn-in iterations.  85  Chapter 4. MCMC algorithms for diffuse interaction models  MCMC Algorithm I: At the t—th iteration, Step 1. For given A  (t-1)  , update from (^ , a? t_1)  ..., a  t_1) P  ) to {$\ o%\  af)  as a block by using a hybrid MCMC. Step 2. Update A^ ) by using a random walk Metropolis-Hastings to log(A). -1  Step 3. Update a ^ ^ via Gibbs sampler. Given the prior of a to be inverse-gamma 2  2  (a, b), the posterior conditional distribution given all the other parameters is inverse gamma with shape parameter a + n/2 and scale parameter of b + RSS/2, where  Note that Step 2 can also be implemented by any other choices of prior for A.due to the generality of MH. However Step 3 needs conjugate prior setting.for a because Gibbs 2  sampler get samples from the corresponding full conditional distribution. It is actually 86  Chapter 4. MCMC algorithms for diffuse interaction models a particular choice of the proposal i n M H leading to the acceptance ratio of 1. T h e trace plots of M C M C outputs i n Figure 4.2 shows that the above M C M C  ap-  proach works well for the simulated data set under the specific setting of priors for parameters.  In each panel of Figure 4.3, the true value, marked by the solid vertical  line, is covered w i t h i n the corresponding 95% equal-tailed credible interval. For some parameters though, the true value is somehow close to the lower/upper end of the credible interval. Figure 4.2: A l g o r i t h m I: M C M C traceplots for f3j,j = 0,l,...,p and A w i t h diffuse priors  0  2000  4000  6000  8000  10000  0  2000  4000  6000  6000  10000  2000  4000  6000  B0OO  10000  0  0  2000  4000  6000  6000  10000  2000  4000  6000  8000  10000  0  2000  4000  6000  8000  10000  2000  4000  6000  8000  10000  0  2000  4000  6000  6000  10000  iteration  0  2000  4000  6000  6000  10000  0  2000  4000  6000  6000  10000  87  Chapter 4. MCMC algorithms for diffuse interaction models  Figure 4.3: A l g o r i t h m I: M a r g i n a l posterior densities for 3j, j = 0,1,... ,p, and A w i t h diffuse priors cr| = <x|. = o~\ = 100. Q  Remarks for the algorithm: F i r s t , the step sizes used to produce candidate values i n step 2 and 3 should be adjusted as sample size changes. Basically, we tune the step sizes to get relatively high acceptance rates, about 70% for step 2 and about 50% for step 3. Second, when the priors of the parameters 8 ,j 3  = 0 , . . . ,p and A are changed to be  more informative (smaller variance), the outcomes do not change much. C o m p a r i n g the density plots i n Figure 4.3 and Figure 4.4, there is no serious difference between the distributions of posterior samples from different priors. T h i s point is different from Gustafson et al. (2005), where the prior of 6 does have influence on the Bayesian inference. 0  88  Chapter 4. MCMC algorithms for diffuse interaction models Figure 4.4: A l g o r i t h m I: Posterior densities for Pj, j = 0 , 1 , . . . ,p and A w i t h informative priors o\ = cr}. = a\ = 0.4.  0.3  0.4  0.5  0.6  0.7  0.8  0.2  0.4  0.2  -0.2  1 0.0  —i 0.2  3 1— 0.4  1  0.8  0.6  i 0.0  :  —i 0.2  ^-i 0.4  1  i 0.6  i 0.6  0.0  *T  f  0.6  0.8  _  0  i  f  0.4  1—, 1.0  1 1.5  2.0  2.5  3.0  3.5  T h i r d , as suggested i n Gustafson et al. (2005), the algorithm above can be easilyextended to a more general case of no positive constraint of the sign of Pj, replacing XjPj, i n model (4.1), by g(xj,Pj),  g(x,P)  where  \p\x-  l-x);  p>o, P<0.  To remove the positive constraint on the Pj's, we need the assumption that A ^ ' s are bounded w i t h i n [0,1]. In practice, transformation of predictors may be necessary. In the following, for simplicity,.we always focus on Pj's w i t h positive constraints and leave the extension without the constraints- to the future work.  89  Chapter 4. MCMC algorithms for diffuse interaction models 4.2.2  T w o - g r o u p diffuse interaction models  A s one possible extension mentioned but not pursued i n Gustafson et al. (2005), we study a more complicated model i n this subsection. Say  Y\X = x  ~  N  6o+ y  }2 (^) ,eADD  +( E (/^) f LeANT  ( -2)  A  4  where A D D consists of indices of the predictors belonging to t h e additive group a n d A N T is the set of indices of the predictors interacting" i n antagonistic direction. ( For sure, if we suspect that some of the explanatory variables interact i n synergistic way, we would replace the antagonistic group by synergistic group by setting A < 1.) I n this subsection, we take A to be fixed. L e t I = {Ik}k=i  , where  P  1;  k € ADD,  2;  fceANT.  h —  Therefore, the parameters i n this model now are ({Pj}^ , =0  {Ij} j=\-> o~ )- So that we have P  2  p — 1 more parameters,with A fixed, to update than that i n the simple case (4.1). To explore the M C M C approach to model (4.2), we do a simulation study as the following.  A s before, set the number of predictors p =12, the sample size n =2000.  Generate each predictor variable from a Bernoulli(0.2) distribution. For simplicity, we set the true values of /Vs ( j = l , . . . , p )  a l l equal to 0.7 a n d BQ = 0.  T h e true value  of A is set to be 2, that is, a l l the predictors involved i n interaction group interact i n antagonistic direction. T h e first six predictors Xi...  A  6  are set to belong to A D D group  while the others belong to A N T group. T h e starting values of the indicator variables are set to be the worst possible case, that is, each I is set to be the opposite value of the k  corresponding true value. T h e n we generate the response variable Yi, i = 1 , . . . , n from a 90  Chapter 4. MCMC algorithms for diffuse interaction models n o r m a l distribution w i t h mean of  'z ^ (Pj ij)  f >  x  ieADD  be ANT  J  and variance a =1. For each j € {1,... ,p}, the prior of I is uniform distribution over 2  3  {1,2}. T h e prior of B is JV(0, <rj ) w i t h <7^ =100. F o r j G 1,... ,p, the prior of Pj is 0  Bj ~ N (0,cr|.) w i t h a\ +  0  o  = 100 for j = 1,... ,p, where N (p, a ) denotes the N ( / i , CT ) +  2  2  distribution truncated to non-negative values. M C M C A l g o r i t h m II: Step 1. U p d a t e the coefficients i n the additive group, denoted by /  S ^ D D '  together  w i t h 0Q. G i v e n I and coefficients of predictors i n A N T group, denoted by  /3A.NT'  ^ *  s e a s  y *°  verify that the posterior of (/3 , / 3 ' A D D ) ' * P ° P ° t i o n a l to s  r  r  0  exp {-a~  2  ((A),/3'  ( i i + X  D)' ~  A D  ^ X ' ^ ) '  X  (XiXi + D) ((A,, / 3 ' A D D ) ' It implies that (B , 0  truncated by  / 3  A  /3'A.DD)'  D D  >  -  W + ^)" i i)} X l  lx  Y  > 0}.  follows a multivariate n o r m a l iV ( { X i X i + I ? } " ^ ^ ! , { X ' j X i + £>}  ^> h e r e w  Xi  =  Yi  = Y - Y  Y  2  =  £>  =  (I,XADD)>  {  2  A N T  X  , A  ^ A N T }  1  '  (T diag{cr^ ,a^ ,...,cT^ }. 2  2  2  2  0  Note that X ^ r j p j denotes the design m a t r i x w i t h c o l u m n vectors of the values of the  91  Chapter 4. MCMC algorithms for diffuse interaction models explanatory variables i n the additive group. Analogously,  X  A  j ^ r p  is the design m a t r i x  w i t h column vectors of the observations of predictors i n the antagonistic group. Step 2. U p d a t e the intercept together w i t h the coefficients i n the antagonistic group. Subtracting the contribution of predictors i n the additive group, we can now pretend there is only one group and all predictors interact antagonistically.  To be spe-  cific, we apply A l g o r i t h m I, devised for the one-group diffuse interaction model, to (Y',X  A  N  T  ,/3  A  N  T  ,/3 ) 0  where  Y' = Y -  Step 3.  U p d a t e (Pk, h)  denoted by (P ,I ), k  k  together.  X  A  D  D  / 3  A  D  D  .  W e generate the candidate values for  (p ,I )/ k  k  as the following:. | Tfc  =  log(^)  ~  T h a t is, the candidate value of I  k  I  k  3 — ifc, N(log(P° ),r ). 2  k  is just opposite to the current value. Say I  k  = 1, then  = 2. Using random walk on the log scale of P to fulfil the positive constraints i n k  model (4.2). There could be more than one criteria to determine 8® and two of them are discussed after remarks on A l g o r i t h m II. A t this stage, let's write P i n a general way as k  Pk =  h,r (Pk),  h  k  where h is a deterministic function w i t h reversibility property, that is, h is 1-1 mapping  92  Chapter 4. MCMC algorithms for diffuse interaction models on [0, oo) a n d h" = h, where h~ is the inverse function of h. Thus, we have 1  l  hi',i  k  (hi (P )) klIi  k  =p k  T h e n , the acceptance ratio is  n{i\p*)  V M f c y y  y  -  jp  (4.3)  where ir(-) is the joint posterior density of / a n d P for given a a n d (f> is the density 2  function of standard n o r m a l distribution. A c c o r d i n g to Step 3 i n A l g o r i t h m II P(Ik\Il)  =  P{I*k\h) = 1. Remarks: First, the fraction of Pl/Pk i n (4.3) comes out from the Jacobian of transformation because the proposed value is generated o n the log scale of Second, if r  2  = 0, then the proposed value of Pk is exactly P . However, the acceptance k  ratio now becomes somewhat intractable because it takes effort to figure out  (log(flfc)-log(h/»,7 (fit)) fc  I  lim — T  ^  '  \  J  —= —  ^ l o g ^ - l o g (*>,,,,,.(f>j) k  ^  *(-eflh' M)/h jM)) r  n  —  (4 4)  <K) e  where e follows standard normal distribution. Moreover, t u n i n g the size of r to be larger helps to speed up the convergence of samples for Pj's.In the following, we demonstrate two ways to determine P®, i.e., Proposal  hi j*(P).  1 : Choose P® to keep the average effect of Xk unchanged.  k  Since a l l X/s  are binary i n the simulation study, based o n Definition 1, think of the average effect for  93  Chapter 4. X  k  MCMC  algorithms for diffuse interaction models  as E (Y\X  k  = 1, X  ( f c )  ) - E (Y\X  = 0, X  k  ( f c )  ).  Therefore, we have  {  Pk,  EiPZ + Z} '*  if k e A D D ;  -E{ZV },  1  iffeeA N T ,  X  where Z=  £  (OjXj)*.  jeANT-{fc} Let Zj = S j e A N T - { f c } ( ^ ' u ) - ^ 4 x  A  1>  =  t  n  e  n  4*  » £ {& +^} _1  A  1/A  2, and Pi is the solution to  =  n _ 1  i=i  E ^ = A-  (-)  1/A  4 5  t=i  If 4 = 2, then 1% = 1, and /?£ is the solution to  P ^ n ^ i P t +z ^ - n - ^ z ] " . i=l i=l  ' (4.6)  In above two equations, the left side is the estimated average effect of Xk after change of Ik to I^., while the right side is the estimated average effect before the change. Here estimated average effect is the sample mean of average effect evaluated at each realization of X(fc). T h e reason for this proposal is that average effect estimate tends to be more robust, as discussed i n Gustafson et a l . (2005). Proposal 2 : Choose P% to make a (=p /\ ) k  k  Ik  unchanged. (Note that \  x  = 1, A = A.) 2  T h a t is,  f  \p , k  i f / * = 1,  X-'pk,  if h = 2.  Pk —  94  Chapter 4. MCMC  algorithms for diffuse interaction models  Based o n the following plots of the M C M C output, Figure 4 . 5 and Figure 4 . 6 , using Proposal 1, the algorithm works well. For a couple of the / 3 / s , though the true value falls close to t a i l of the density plot.  Figure 4 . 5 : A l g o r i t h m II: Using Proposal 1, M C M C trace plots for (3jJ = 1 , . . . ,p w i t h diffuse priors <Xg = cr|. = 1 0 0 . o  T^^^^^^^^^^^f^W 0  2000  4 OOO  6000  6000  10000  5 - PWIff^WJ^iPPW^I -1 1 1— f 0  5  2000  4000  6000  6000  10000  2000  4000  6000  6000  2000  4000  6000  8000  2000  10000  £  2000  4000  6000  6000  8000  10000  " - M1HHHHHHHHI 0  2000  iteration  0  4000  10000  ' HHHHHHHH 0  - ^^^^^^^^^^^^^  4000  6000  6000  10000  6000  10O00  6000  10000  Iteration  6000  100O0  2000  4000  6000  6000  10000  6000  10O00  2000  4000  6000  6000  10000  0 '  200(1  4000  2000  4000  itaration  0  2000  4000  6000  6000  Iteration  95  Chapter 4. MCMC algorithms for diffuse interaction models  Figure 4.6: A l g o r i t h m II: U s i n g Proposal 1, posterior densities for Bj,j diffuse priors cr ^ = cr^ = 100.  = 1,... ,p w i t h  2  Pi  0.4  0.6  fc  0.6  10  0.4  0.6  fc  08  1.0  0.2  0.4  0.6  0.6  To compare the performances of samples under different proposals i n Step 3, we plot the number of correct group allocations, since the true group allocation is known i n the simulation study. Figure 4.7 illustrates that P r o p o s a l 1, that is, solving for B® by keeping the average effect unchanged, gives the right group allocations only after a couple of iterations.  However it takes many more iterations for the samples under P r o p o s a l 2,  where /3° is solved by keeping a  k  unchanged.  plot the autocorrelation coefficients up to lag  To make this point more clear, we also  40 for MCMC  proposals, as shown i n Figures 4.8 and 4.9 respectively. (Pe — Pn),  samples under  different  In Figure 4.9, for some / V s  there is still some dependence even for lag 40. W h i l e i n Figure 4.8, for all  Pj's, the serial dependence drops close to zero after a small number of lags. It implies that the convergence rate under P r o p o s a l 1 is faster than that under P r o p o s a l 2.  96  Chapter 4. MCMC  algorithms for diffuse interaction models  Figure 4.7: A l g o r i t h m II: Comparison of two proposals: N u m b e r of correct group allocations based on posterior samples for I.  Proposal 1  Proposal 2  97  Chapter 4. MCMC  algorithms for diffuse interaction models  Figure 4.8: A l g o r i t h m II: W i t h P r o p o s a l 1, the autocorrelation curves for posterior samples of Pj, j = 1,..., p respectively.  20  30  20  30  10  20  i t 20 . 30  10  20 iteration  98  Chapter 4. MCMC  algorithms for diffuse interaction models  Figure 4.9: A l g o r i t h m II: W i t h P r o p o s a l 2, the autocorrelation curves for posterior samples of Pj, j = 1 , . . . ,p respectively. fa  Pi  Pa  - 1'-  £  J U11J U U Hi U U LU U U hU U U U LU.uLU. 10  20  Iterato in  30  i  40  1 10  1 20  1 30  °;  1  r 40  10  1 20  1 30  1  40  tierato in  p4  h 3 " £  10  20  Iterato in  30  „ ;  40  10  20  tierato in  30  JUlUUUU . UULUUULUUULiluuU. 1 1 1 1  40  1 0  10  P7  .b.llUIHIJUULIlLlllllllllllllllUU  1  10  1 20  Iterato in  1  30  r  .lillJlJLIllJIJLlLIII.lULIIIJULIlUUU. 1  i  40  i 10  Pw  -  5  30  40  Po 5 "  I  20  M I rato in  i  i  20  30  tierato in  *I;  , LllUU . i L l l l J U m i J . U U U U u l U U l,i, 10  r 40  Pi.  20  Iterato in  30  40  Pi!  :  AlllJULIllJIJUilJlJULIIUULlLlllJU.  0  10  20 30 Iterato in  40  .liiijyuiijuuiijijuiiiuuLiiijiju  0  10  20 30 Iterato in  40  <LlllJUU.IJUULIJIJULIllJULI.liUU.  0  10  20 Iterato in  30  40  Remarks o n M C M C algorithm II: First, i n step 3, we only allow the movement w i t h i n the parameter space such that at least two predictors fall w i t h i n the antagonistic/synectic interaction group. T h e reason of the constraint for now is that the terminology interaction has real meaning only when at least two predictors are involved.  W e could remove this constraint later a n d more  would be discussed i n the future work. Second, the priors of Pj's do not strongly affect the posterior distribution. Change the value of hyper-parameter ap to be 0.4, which is rather informative/narrow compared w i t h the previous value 100. There is not much difference between the density plots i n Figure 4.6, based o n diffuse priors a n d those from the informative prior shown i n Figure 4.10. In addition, for b o t h kinds of priors, the posterior samples for Ij, j = 1 , . . . , p have 99  Chapter 4. MCMC algorithms for diffuse interaction models high frequency of agreement w i t h the true values. Figure 4.10: A l g o r i t h m II: W i t h Proposal 1, the posterior densities of samples from w i t h informative priors \OpQ = o~p = 0.4.  f *:  y — i  1  I *:  & " -  — i  1—  1  1  o.e  —  —i—•—i—  oe .  0.4  i.o  0.S  0.8  0.7  0.8  0.9  1.0  1.1  T h i r d , the true value of Pj(j € A N T ) does have influence on the posterior information about group structure.  Bigger Pfs  (j £ A N T ) lead to more posterior concentration  around the true value of / , which governing group allocation. W h i l e smaller Pj's seems to lead to less posterior concentration around the true group structure. Consider an extreme case that all the Pj,j  € A N T are very small, say pretty close to 0, hence the conditional  expectation of response variable given a l l explanatory variables are approximately equal to p + zZjeADD^iPi0  ^  n  °t  n e r  words, the diffuse interaction model is reduced to an  additive model w i t h explanatory variables of Xj,j  G ADD.  1  Let's make this point clearer by looking at a simulation study. Set the true value of  In fact, if /3j (j eANT) are too small, the numerical solution of (4.6) would be zero, which cause difficulty in random walk on log scale of Pj. :  100  Chapter 4. MCMC algorithms for diffuse interaction models each /3j(i  eANT) to be 0.4, while the true value of coefficients i n additive group is 0.7,  same as the previous simulation study. Checking the posterior sample of each indicator variable for the first 10000 iterations, we get the following table. Clearly, as shown i n F i gure 4.11 a n d Table 4.1 as well, we get more posterior mass on wrong group allocations w i t h smaller value of fij's, but less w i t h larger /3^'s. Figure 4.11: A l g o r i t h m II: Comparison of number of correct group allocations based on posterior samples for I w i t h different values of (3j(j G ANT): top panel w i t h smaller value (=0.4), and b o t t o m panel w i t h bigger value (=0.7).  o oo  0  2000  ooa  4000  ooocooo  OOOO O  o  6000  OOOD  8000  ooo o  10000  iteration  Table 4.1: A l g o r i t h m II: Frequency table based on posterior samples for each component of I w i t h different values of  ft =0.4 1 2  A 1.00 0.00  ft =0.7  A  1 2  1.00 0.00  h  1.00 0.00  h  1.00 0.00  h  1.00 0.00  h 1.00 0.00  €A N T .  u  .73 .27  u  1.00 0.00  h  1.00 0.00  h  1.00 0.00  h  1.00 0.00  h  1.00 0.00  h  0.01 0.99  h  0.00 1.00  /  8  0.01 0.99  h  0.00 1.00  h  0.00 1.00  /io  Ai  0.00 1.00  0.00 1.00  A  2  .11 .89  u  ho  Ai  A  0.00 1.00  0.00 1.00  0.00 LOO  0.00 1.00  2  101  Chapter 4. MCMC algorithms for diffuse interaction models Fourth, there is high posterior dependence between I and p for each k. T h e blocking k  k  of (Ik, Pk) i n Step 3 is motivated by this fact. W e tried to update Ik and Pk separately (for each k), and found the sampler got stuck so that we could not move efficiently around the whole parameter space.  T h e reason is that i n the separate update case, posterior  distribution of Ik for given p highly concentrates on one certain value and vice versa. In k  other word, the update of p  k  and the update of Ik do affect each other a lot. Therefore,  we should block the two parameters. In the simulation study for M C M C A l g o r i t h m II, changing the true values of Pj,j A N T to be 0.4 while keeping Pj,j  €  € A D D unchanged, Figure 4.12 depicts the dependence  between I and P more clearly. T h e top panel is the unconditional posterior density plot k  k  of posterior samples for p . 4  T h e middle panel is the posterior density plot of P \I 4  1, that is the posterior sample of p  4  given posterior samples for I  4  b o t t o m panel is the posterior density plot of samples for P \I 4  4  equal to 1.  4  = The  = 2. Clearly the modes  of the samples within different groups are different by the latter two plots. D u e to the considerable distance between the two modes, we get a bi-mode posterior density curve estimated by a l l the samples of p , as indicated by the top panel. 4  102  Chapter 4. MCMC algorithms for diffuse interaction models  Figure 4.12: A l g o r i t h m II: M C M C output of B w i t h true value set to be 0.4: the top A  panel is the posterior density for entire sequence of 10000 iterations, the middle is the posterior density for subsequence P \h = 1, and the b o t t o m is the posterior density for A  subsequence /3 17/ = 2. 4  4  density plot of the first 10000 iterations  1 0.0  1 OS  1  1.0  ' 1.5  P"  density plot of eamples in the right group  i  i 0.0  OS  i 1.0  i 1.5  1.0  1.5  64  density plot of samples in the wrong group  0.0  05 B4  F i f t h , Ik, the estimate of indicator variable Ik may be determined by the one having higher frequency to be sampled. Here we should be careful about the Bayesian inference to B s. k  For each B , the point estimate and 95% equal-tailed credible interval could k  be obtained based on the samples belonging to the I  k  group only, while not t a k i n g into  account the whole sample sequence of B . To be more explicit, k  iM"=''}r £0.025  =  2.5% quantile of [sf  C/o.025  =  97.5% quantile of  : I® =  I]  : J<°  =/*}  k  103  Chapter 4. MCMC algorithms for diffuse interaction models where B \l ^ k  k  are the i - t h samples for B I kj  k  respectively, and  means the number  of element i n the set A, and L0.025 /fo.025 is the lower/upper b o u n d of a 95% credible interval. O r there is another possibility to make the inference based all the samples since the true I is unknown i n practice. In such a case, we could imagine that if I  k  higher frequency than other sampled values of I , k  from I  k  w i t h much  the estimate of B based on samples k  group only should be close to that based on samples from a l l possible groups.  T h e credible intervals may be close to each other as well. O n the other hand, if I  k  = I  k  does not have a superior frequency, then the estimate based on all samples for B would k  be somewhat different that from the I  k  group only. T h e credible interval i n the former  case would be wider.  4.3  Discussion  Other than the two diffuse interaction models discussed above, we could consider more complicated models. 1. A s a direct extension to model (4.2), we could allow A to vary as well. However, a corresponding direct extension to M C M C algorithm II is not t r i v i a l to achieve. W e are aware of the fact that (I , 8 , X ) are closely associated together for each k E { 1 , . . . ,p}. k  k  k  Therefore we need to devise a good joint proposal for the triple set. To make use of the previous algorithm for two groups w i t h A known, we propose the following algorithm. M C M C A l g o r i t h m III: Step 1. U p d a t e / 3 D D ' coefficients i n the additive group, together w i t h intercept B . A  Q  Step 2. U p d a t e / 9 N T ' coefficients i n the antagonistic group, together w i t h intercept A  A). (Note that the two steps have exactly the same structure as i n that A l g o r i t h m II.)  104  Chapter 4. MCMC algorithms for diffuse interaction models Step 3. For k = 1 , . . . ,p, update (Pk, Ik) s a block i n the same way used i n A l g o r i t h m a  II, that is, II is the opposite to the current value of I  k  noise to P®, which leaves the average effect of X  k  and PI is proposed by adding  unchanged for given value A.  Step 4. U p d a t e A, for given other parameters, by using M H update of log(A). Step 5. U p d a t e a  2  v i a G i b b s sampler.  F r o m the following trace plot of A samples, Figure 4.13, we can see that the M C M C samples for A have not achieved the stationary distribution w i t h i n the first 10000 iterations. W e increased the number of iterations and the M C M C still do not m i x very well for A. W e may adjust the stepsize for the update of A or better figure out other ways to propose the candidates of A more efficiently. Moreover, according to the autocorrelation plot of samples for A, the convergence speed is really slow since the autocorrelation of lag 40 is still rather large. It is similar for the posterior samples for "some Pj's, as displayed in Figure 4.16. T h i s implies that A l g o r i t h m III still need to be improved i n terms of speeding up the convergence. B u t still the algorithm seems promising since we get eight out of twelve correct estimates of indicator variables i n the simulated example, as shown in Table 4.2.  105  Chapter 4. MCMC  algorithms for diffuse interaction models  Figure 4.13: A l g o r i t h m III: Trace plots for Bj,j = 1,... ,p based on two-group diffuse interaction model w i t h A unknown, respectively.  2000  4000  6000  6000  10000  T 0  1 20O0  r— 4000  6000  2000  4000  6000  8000  10000  0  2000  4000  6000  0  8000  10000  2000  4000  1  |  2000  4000  6000  1 8000  6000  10000  1 8000  10000  8000  10000  8000  10000  Iteration  0  2000  4000  6000  8000  10000  0  2000  4000  6000  8000  10000  0  2000  Iteration  0  2000  4000  6000  8000  10000  2000  4000  6000  4000  6000  Iteration  8000  10000  2000  4000  6000  Iteration  106  Chapter 4. MCMC algorithms for diffuse interaction models  107  Chapter 4. MCMC algorithms for diffuse interaction models  Figure 4.15: A l g o r i t h m III: Plots of posterior samples for A, the top panel is the trace plot, the middle is the posterior density plot, and the b o t t o m is the autocorrelation curve.  1  1  1  1  2000  4000  6000  8000  n 0  1— 10000  iteration  1.0  1.2  1.4  1.6  1.8  2.0  2.2  X  0  10  20  . 3 0  40  X  Table 4.2: A l g o r i t h m III: Frequency table based on posterior samples for each component of I under two-group diffuse interaction model w i t h A unknown.  1 2 true value  h  h  h  h  h  h  h  h  h  ho  hi  h2  0.63 0.37 1  0.40 0.60 1  0.55 0.45 1  0.48 0.52 1  0.28 0.72 1  0.66 0.34 1  0.34  0.40 0.60 2  0.46 0.54 2  0.39 0.61 2  0.34 0.66 2  0.54 0.46 2  0.66 2  108  Chapter 4. MCMC  algorithms for diffuse interaction models  Figure 4.16: A l g o r i t h m III: A u t o c o r r e l a t i o n curves for f3j,j = 1, group diffuse interaction model w i t h A unknown, respectively.  Ma  JlllUlUllllJLIlUUJllllJluumu.  JllUllULIJULljUlULUUi  min  JIJIIJ1.I1IJ1I1IJ1U 20  llllllllilllimili  30  I 30  10  ;  i  20  11.11.1i I.I I I.J U1IJ 11,11,111,1  40  30  J LU U LU  JLIJUIIJLIllJLIJIJllJLIllJLIilJllJLI.  LI 1  u  in  J I l i u m LI ...  40  LIIULI1UU1U. 30  ,p based o n two-  40  20  30  Hit Ui UtUUlU Ui U LU LLUJ 10  10  30  40  LI  2. Including more groups into the model, say additive, synergistic a n d antagonistic group, w i t h the corresponding A's known, for instance,Ai = 1, A = 4 / 5 , A = 5/4 respec2  3  tively. (The reason to choose A2/A3 close to 1 is stated later.) In this scenario, there are three possible values of each I . Hence we cannot just flip the current value of I k  k  when  in the joint proposal for update of (Ik,Pk)- W e need more ^complicated j u m p i n g rule for the update of I . To apply the previous algorithm for two groups to the current situation k  w i t h three groups, a n easy-to-extend algorithm is as below. M C M C Algorithm IV: Step 1. U p d a t e the coefficients i n the additive group together w i t h intercept. Step 2. U p d a t e the coefficients i n the synergistic group together w i t h intercept. Step 3. U p d a t e the coefficients i n the antagonistic group together w i t h intercept.  109  Chapter 4. MCMC algorithms for diffuse interaction models  Step 4. For k e SYN (J ADD, update (p , I ) by applying step 3 in Algorithm II to k  Y - { X  A  N  T  ^  k  N  T  }  1  /  A  3  ,  \  where X^jvj'p is the design matrix with column vectors of the observations of predictors in the antagonistic group. That is, by subtracting the contribution of the antagonistic group from the response variable, we could pretend that we have only two groups, additive and synergistic group. Step 5. For k € ANT (J ADD, update (Pk,h) by applying step 3 in Algorithm II to  Y - { X  S  Y  N  A  2  ^ V N }  1  /  A  2  -  where XgyN is the design matrix with column vectors of the observations of predictors in the synergistic group. Again, by taking away the given contribution of the synergistic group, we may imagine having only two groups, additive and antagonistic group. Step 6. Update <r via Gibbs sampler. 2  The step 4 and 5 in the above algorithm allows jumps between additive and synergistic groups, and jumps between additive and antagonistic groups. In other words, the big jumps between synergistic and antagonistic groups are broken down into two small jumps, which would be easier to achieve. Unfortunately, one obvious drawback is that at each iteration we need to update |ADD| + p (|ADD| denotes for the number of elements in ADD group) of (Ik, Pk) pairs, since the index k runs over the indices in ADD twice as stated in step 4 and 5. Thus we propose the following algorithm which need to propose just p of (Pk,Ik) pairs at each iteration. MCMC Algorithm V: Step 1. Update the coefficients in the additive group together with intercept.  110  Chapter 4. MCMC algorithms for diffuse interaction models Step 2. U p d a t e the coefficients i n the synergistic group together w i t h intercept. Step 3. U p d a t e the coefficients i n the antagonistic group together w i t h intercept. Step 4. For k == 1 , . . . ,p, update (@ , I ) k  k  as a block. N o w the proposed value of  I, k  different from I , is drawn w i t h probability k  l-7r(/ |/ _ f c  (  f c ]  ,/3,a2)'  where.7r is the joint posterior density function of / , 3, cr . T h i s is referring to Metropolized 2  Gibbs sampler i n L i u (1996), which proves that the way used i n Step 4 above is more efficient than the Gibbs sampling method, that is, the proposed value of I , possibly the k  same as the current value of I , is sampled from ir(-\I[- ], 3, cr ). T h e n Q* is proposed by 2  k  leaving the average effect of X  k  k  k  unchanged, which is the same idea as before. Hence, the  acceptance probability is the same as (4.3), w i t h  P{lt\h)  i n  Chapter 4. MCMC  algorithms for diffuse interaction models  Figure 4.17: A l g o r i t h m V : Trace plots of M C M C samples for Bj,j = 1 , . . . ,p based on three-group diffuse interaction models w i t h A2 = 4 / 5 , A 3 = 5/4.  0  10000  20000  30000  iteration  40000  50000  0  10000  20000  30000  Iteration  40000  50000  0  10000  20000  30000  40000  50000  iteration  112  Chapter 4. MCMC  algorithms for diffuse interaction models  Figure 4.18: A l g o r i t h m V : Posterior densities of M C M C samples for Pj,j based on three-group diffuse interaction models w i t h A = 4 / 5 , A = 5/4. 2  3  —  l,...,p  Chapter 4. MCMC algorithms for diffuse interaction models  Figure 4.19: A l g o r i t h m V : Autocorrelation curves of M C M C samples for Bj,j = 1 , . . . ,p based on three-group diffuse interaction models w i t h A = 4 / 5 , A3 = 5/4. 2  lllllllllll 20  30 40  lllllllllll 1 1 ,1 1 1 1 ,1 1 ] ,  lllll|IIIIIIIMIIII  20  0  10  mm  30  20  30  nun  Ill  40  0  Lag  10  in  11111111111111111  20  30  40  0  Lag  10  20  30 40  Lag  Remarks on M C M C A l g o r i t h m V : First, note that the magnitudes of A2 and A 3 have an effect.  If A2 and A getting 3  closer to 1, the j u m p between different groups, i.e., update i n Step 4, is easier to make. Otherwise, the change of group may cause big change i n the posterior density, which leads to a small acceptance ratio. So the proposed values are more likely to be rejected, which makes the algorithm inefficient. W e also check the performance of A l g o r i t h m V by setting a l l A's equal to 1. Under this setting, the posterior distribution of h,for given k € {1; ... ;  is almost uniform distribution over {1,2,3}. T h i s is consistent w i t h our  intuition, since when A i = A2 = A 3 = 1, the three group are actually identical to each other. Second, according to the trace plots, as shown i n Figure 4.17, the sampler m i x well.  114  Chapter 4. MCMC algorithms for diffuse interaction models However, Figure 4.19 tells us that there is some scope to make improvement i n convergence rate. T h e sample dependence does not drop close to zero after lag 40 for some PkS, like P and p . 5  n  One thing worth noticing is the triple-mode i n almost all density  plots of PkS. T h e reason is that for each k, the posterior distributions of Pk conditional on different values of Ik may have different (up to three) modes. A c t u a l l y this is another evidence to show the high dependence between Pk and Ik-  Table 4.3: A l g o r i t h m III: Frequency table based on posterior samples for each component of I under three-group diffuse interaction group.  1 2 3 true value  h  h  h  h  h  h  h  h  h  ho  hi  h2  0.33 0.23 0.44 1  0.31 0.19 0.51 1  0.34 0.18 0.49 1  0.31 0.24 0.46 1  0.33 0.17 0.50 2  0.33 0.34 0.33 2  0.29 0.38 0.32 2  0.28 0.43 0.29 2  0.31 0.31 0.38 3  0.29 0.28 0.43 3  0.26 0.26 0.48 3  0.28 0.43 0.29 3  T h i r d , we also plot the number of correct group allocations. It seems like that most of the samples for I have 4-6 components correctly valued. It looks promising as implied by Figure 4.20, some of the samples for I do achieve 10 or 11 correct allocations. Moreover, Figure 4.21 shows the posterior density plots of samples for Pk conditional on the true group allocation of fc-th predictor, k — 1 , . . . ,p. We find that for each Pk, the samples highly concentrates around its true value.  It is a good sign to see that A l g o r i t h m V  somehow works.  115  )  Chapter 4. MCMC  algorithms for diffuse interaction models  Figure 4.20: A l g o r i t h m V : N u m b e r of correct group allocations based on posterior samples of I under three-group diffuse interaction models.  Figure 4.21: A l g o r i t h m V : Posterior densities of samples for Pj conditional o n Ij correctly allocated (j = 1 , . . . ,p) under three-group diffuse interaction models.  0.0  0.0  0.5  1.0  0.5  1.0  1.5  1.5  2.0  2.0  2.5  3.0 3.5  2.0  2.5 3.0  2.5  3.0  0.5  0.4  0.6  1.0 .  1.5  0.S  2.0  1.0  2.5  1.2  f . 0.4  ~~]  0.6  0.6  1  116  Chapter 4. MCMC algorithms for diffuse interaction models  4.4  Example  To provide an illustrative example of fitting the diffuse interaction model to a real d a t a set, we consider the abalone growth dataset available from the U C I M a c h i n e Learning Repository (Newman et. al. 1998). T h e response variable Y is the age of abalone. Ignoring the single categorical explanatory variable (sex), we take the dependent variables X to be the seven continuous explanatory variables (length, diameter, height, whole weight, shucked weight, viscera weight, shell weight). T h e d a t a are observed on n = 4177 specimens. To find out the overall direction of interaction among those dependent variables, we apply A l g o r i t h m I to make inference. One thing worth noting is the release of positive constraints on / V s by using the following model mentioned i n Section 4.2.  where  A s a consequence, all the observations of Xj's  are scaled to [0,1].  T h e chosen priors are Bj ~ iV(0, 0.5)(j = 0 , 1 , . . . , p), log A ~ JV(0,0.5), and a  2  ~  inverse-gamma (0.0001,0.0001). W e also tried rather diffuse priors by replacing the small variance 0.5 w i t h a larger value 50 and we d i d not see any serious difference in the output. Figure 4.22 gives trace plots for posterior samples of Bj, j = 0 , 1 , . . . ,p, which shows the M C M C algorithm worked well. T h e solid curves i n F i g u r e 4.23 are posterior densities for average effects 6i,...,5  p  and the dashed lines are posterior densities for / 3 i , . . . , B . p  We can see that most of ,<5j's are slightly smaller than the corresponding Bj's, although the case for X  5  is the other way around. T h i s means the overall interaction among X is  117  Chapter 4. MCMC algorithms for diffuse interaction models  antagonistic, which is consistent with the inference we can make based on the posterior sample for A. The posterior mean of A is 1.19, and the 95% equal-tailed credible interval is (0.92,1.46). The posterior probability that A > 1 is 0.93. Hence, we have evidence of the presence of antagonism among X. The posterior density plot for A is given in Figure 4.24. For a better understanding of the content of antagonism, we refer to relative antagonistic effect as <?(*) - 9(0) - T.U  (g(x i )  T.U(g(xiW-9{0))  J J  - g(o)) '  .  [  -  }  where \ means a p x 1 vector of zero except that the jth element is 1. Note that 3  the numerator is the difference of the joint effect of X and the sum of independent effect of each Xj,j = 1,... ,p, and the denominator is the sum of independent effect. Averaging the ratio over the joint distribution of X, we get average relative antagonistic effect (ARAE); By using the empirical distribution of X (the true distribution of X is unknown), we get ARAE based on the posterior samples of (3j,j = 0,... ,p and A and the graphical summary is given by the bottom panel in Figure 4.24. We can also see the evidence in favor of antagonism since all samples of ARAE are negative. Moreover, most of samples of ARAE are valued in (-0.6,-1), which indicates a considerable size of antagonism. The bi-mode of ARAE in the plot might be caused by the complexity of (4.7), because the denominator now is also a function of A other than the simple form ^  (3jXj  as before when Pj's are confined to be positive. One thing worth mention is  that MCMC can provide an effective way to make inference on this complicated function ARAE, based on the posterior samples. That's one of the main reasons why we pursue MCMC approaches for model fitting, as mentioned in the beginning of this chapter. Note that both of the definitions above can be easily applied to synergistic effect case.  118  Chapter 4. MCMC  algorithms for diffuse interaction models  Figure 4.22: Abalone data: trace plots for 3j,j = 0 , 1 , . . . ,p, for the whole sequence of 100,000 iterations including the burn-in period.  1 ftvV^^ Oe+00  1  I  I  1  2e+04  4e+04  6e+04  8e+04  "T  1e+05  0e+00  2e+04  1  1  4e+04  1  6e+04  -  Be+04  1e+05  iteration  -\  Oe+00  r  ~i  -  2e+04  4e+04  6e+Q4  Be+04  1e+05  0e+00  2e+04  i  1  r  4e+04  6e+04  8e+04  -  1e+05  iteration  |35  T 1  r  -  2e+04  4e+04  6e+04  Be+04  1e+05  Oe+00  2e+04  1  1  4e+04  6e*04  8e+04  6e+04  8e+04  Iteration  Oe+00  2e+04  . 4e+04  ~-  -j  iteration  . 6e+04  ~i e+04  r 1e+05  0e*00  2e+04  4e+04  !  (  1e+05  iteration  119  Chapter 4. MCMC  algorithms for diffuse interaction models  Figure 4.23: A b a l o n e data: density plots for and 5j, j = 1,... ,p, for the sequence of 40,000 post burn-in iterations. T h e solid lines stand for average effect Sj's and dashed lines stand for /3j's.  —! -0.5  " " I •T 0.0  0.5  1  1  1  I  t.O  1.5  2.0  2.5  ' -1.0  1  i—Ti-j-.-.-j-v  ,  !—  -0.5  0.0  0.5  1.0  X5  1.5  X6  .0  -0.5  0.0  -1.0  -0.5  0.0  0.5  1.0  1.5  X7  1  2  3  4  5  S  Figure 4.24: A b a l o n e data: the top panel is the density plot of A and the b o t t o m panel is the density plot of relative antagonistic effect, for the sequence of 40,000 post burn-in iterations.  t  I 1.5  1.0  2.0  average relative antagnistlc effect  AA -1.6  -1,4  -1.2  -1,0  -0.8  -0.6  120  Chapter 5 Summarization and future work 5.1  Conclusions  Overall, the findings are as follows. In chapter 2, we study the consequences of fitting an additive regression model a pairwise interaction model is assumed to be the true model. W e obtained the asymptotic distribution of average effect estimates based on the "misspecified" model and "true" model as well, as shown i n Result 1 and Result 2, respectively.  W i t h the two large  sample limits achieved i n the two results, we work out the consistency of average effect estimates from "misspecified" model under some easily-studied situations, which are given by Result 3. T h i s result implies that the distribution of risk factors does influence the size of bias of estimates i n cases of model misspecifications. Result 3 suggests that transformations  of risk factors, aiming toward normality, may help to reduce bias of  average effect estimates. M o r e generally, under the framework of spline regression models, we investigate the consequences of model misspecifications by failing to incorporate the interaction terms, which are assumed to be included into the true model. In chapter 3, we introduce the diffuse interaction model, which is more powerful to detect interaction than the pairwise interaction model especially when the number of risk factors of interest is rather large.  W e compare the power' to detect interactions  under the two models i n situations where we assume the true model is either diffuse or pairwise interaction model. T o make the comparison tractable, large sample and small  121  Chapter 5. Summarization and future work misspecification approximations are employed. To be specific, for either interaction model that is assumed to be true, the true values of the parameters standing for the magnitude of interactions (A i n the diffuse interaction model, /3y's i n the pairwise interaction model) are just w i t h i n a (local) n  - 1  /  2  neighborhood of those values which i m p l y no interactions  at a l l , i.e., additive models. Therefore, as sample size n goes up to infinity, the model misspecification vanishes to zero. Under a set of specific settings (shown i n Section 3.3), we find out that the power of diffuse interaction model is superior to pairwise interaction model no matter whether the true model is itself or not. However, if the true model is pairwise interaction, the detectability v i a diffuse interaction model decreases when the "overall" strength of interaction among the risk factors gets weaker. In chapter 4, we develop an efficient M C M C algorithm for one-group diffuse interaction model. A l s o we investigate the possibility of generalizing the model away from the strong assumption that a l l risk factors interact i n the same direction, i.e., synergistic or antagonistic. In the more generalized but complex model, we have more parameters since each risk factor has an corresponding indicator variable denoting to which interaction group it belongs. W i t h A fixed, we have an efficient algorithm for model w i t h two groups of risk factors, one group for risk factors having no interactions and the other groups for risk factors interacting synergistically/antagonistically. A n d we also see some hope to develop a good M C M C sampler for a more general model w i t h three groups of risk factors, i.e., a no-interaction/additive group, a synergistic group and an antagonistic group.  5.2  Future work  In this section we discuss some interesting problems that could be studied i n the future. 1. In terms of model misspecification ignoring interactions:  122  Chapter 5. Summarization and future work (a) . In Section 2.3, the linear regression context, we find that under some certain condition such as independence or joint normality of risk factors, the average effect estimators based on the misspecified m o d e l a r e still consistent w i t h the true values. B u t we need to know whether there are more general (possibly weaker) conditions which can produce the consistency i n the face of model misspecifications. (b) . W e explored two examples i n section 2.3 to see how far the bias can be away from zero. However, we need more general investigation of the magnitude of bias (or relative bias) as the joint distribution of (X\, • • • ,X ) P  moves away from multivariate normally or  independence. (c) . W e could also study the consequences of o m i t t i n g the interaction terms i n the context of generalized linear models.  Actually, i n many epidemiological studies,  the  health outcome Y is often binary/categorical (for example, diseased or not). (d) T h e m a i n results i n terms of average effects we have derived so far is referring to Definition 2,. that is, averaging predictive effects over the joint distribution of a l l predictors.  W h a t could the result be if other versions of average effect are applied?  Moreover, we are aware of the fact that idea of average effect could have more general applications, not just confined w i t h i n the context of regression or generalized regression. For example, we can also apply the idea of average effect i n survival analysis by averaging over the change i n hazard function, instead of outcome/response variable, associated w i t h a unit change i n the putative risk factor. 2. In terms of M C M C algorithm development for more complex diffuse interaction m o d els: A s suggested i n Gustafson et al. (2005), a more general model can be built by partitioning the risk factors into three sets: additive, synergistic, and antagonistic, that  123  Chapter 5. Summarization and future work is  E(Y\X ---,X ) U  = p +  P  ( ) 5J  0  {  ieADD l/A.  \  .E jeSYN  (ft^')  As  [ J  /  + \  \ 1/A  •  E  LeANT  ovo)  A a  [  0  >  J  where 0 < X < 1 < A . s  a  Starting w i t h the simplest diffuse interaction model w i t h a l l the risk factors i n single interaction group, we have an efficient M C M C sampler which does a good job. A n d also M C M C performs well for a diffuse model w i t h two groups, additive a n d synergistic or antagonistic w i t h a fixed value of A. However, for a more complicated situation, model (5.1), how to implement an efficient M C M C approach is still i n process. T h e challenge here is how to propose the indicator Ik a n d corresponding Pk together i n an efficient way. In our algorithm, we only allow the transition between additive group a n d synergistic/antagonistic group, while the transition between two interaction groups are not allowed directly w i t h i n one single movement. T h e reason is that the direct change between two interaction groups would cause rather b i g change i n posterior density function, which leads to low acceptance rate of proposed values of (Pk,Ik) P i a  r  A l t h o u g h we see some hope i n the simulation studies that the  sampler m i x well, we still need to find better one which mixes faster a n d leads to better estimation. To be more realistic, we would also a d d one more group, that is, no-effect group into the current three-group interaction model. T h i s is more challenging because it is even harder to make efficient j u m p i n g schemes between the no-effect group a n d other three groups. 3. In terms of model selections : 124  Chapter 5. Summarization and future work In the analyses of linear regression w i t h interactions, a stepwise strategy is often applied to choose a subset of interaction terms. T h e problem w i t h stepwise procedure is that it is a computationally intractable technique when the number of risk factors is quite large, as discussed i n Chapter 1. In the worst situation, | ( ^ ) ((?J) + l ) possible cases must be evaluated before obtaining the final model. Therefore, stepwise regression is not always a practical technique and other selection techniques may have to be considered. Suppose we have a linear model w i t h pairwise and maybe even triple-wise interaction terms. W e could take this model and reparametrize from the original parameters 6 to ((f), A), where c/> are the average effects and A are nuisance parameters. Note that when A = 0, the interaction model is an additive model. Take a pairwise interaction model (3.6) for instance, reparametrize the parameters 0 as follows:  <f>o  =  =  A),  {Pij}l<i<j<p-  B y setting up a prior for A which is quite concentrated around zero, we can do the posterior inference to A and then pick out those important interaction terms instead of doing stepwise procedures.  125  Appendix I Proof of (2.14)in Section  2.3.1  By Cramer's rule to solve the (p + 1) x (p + 1) system of equations, we get a \D\ = \D \, x  X  where 1  0  0  Var(Xi)  \ 0  Cov{X ,X  D =  ( 1  0  x  EY  Cov(X ,X ) x  p  .  p  0  •••  Cov(X ,X )  •••  E(YX )  Cov(X ,X )  •••  x  j  Var(X )  0 E(YX ) X  D =  ••  p  2  0  ^  Cov(X ,X x  p  x  \0  P  2  p  VarpQ-  J  Let ( S — (o~ij) x — p  Varpfi) CoviXuXi)  Cov(A"i,X ) ••• 2  Var(X ) 2  •••  Cov(X ,X )  •••  Cov(X ,X )\ x  p  Cov{X ,X ) 2  p  P  \Cov(X ,X ) x  p  2  p  Var(X ) p  /  Appendix I. Proof of (2.U)in Section 2.3.1 Since  U . ( l ,  E  X  ....  U  (  X  p  E(y)  with determinant expansion by the cofactors (first column), we may write |Z?i| as  E i A E ( X ^ ) + Ei<, A j E  (XiXjXi  i <?"2,  • • • , CTj  E , A E (XiXp) + x;^- A , E ( x ^ X p ) where cr^ denotes i-th column vector of E . By the property of determinant, we can rewrite the expressions above as summation of two determinants  and  , where  EAE(X X ) J  D™\  =  : .  1  > 2, cr  • • • , o-  p  EiP^iXiXp) E -<j A j E 8  \D  (XiXjXij  (2) |  , cr , • • , o-j 2  Ei<j A j E (XiXjXp)  127  Appendix I. Proof of (2.U)in Section 2.3.1 Let's consider  at first. Note the fact that E (XiXj)  i =  = Cov(A" ,X ). i  j  i  J 2 ^  C  o  v  (  X  i '  X  ^  + ••• +  n  CoviX^X^ } 1  (.2)  T h e last equality is derived by the fact that when i ^ 1  Cov(X  Xr)E  + ••• + Cov(X  n  u  h  Cov^A^)  : |  X )S  p l  p  ,  ,  • • • , o-p = 0  CoviX^Xp)  Similarly, we get  |£>i  2)  I  = E /?«[S  U  E (XiXjXJ  + ••• + £  p l  E  (XiXjX,,)].  i<3 Since  E (XiXjXi)  = E {XiXjX{)  + E (X^CoviXi,  X ) + E (Xi)Cav{X x  jt  X,)  128  Appendix I. Proof of (2.U)in Section 2.3.1 we can derive that  +  (Xj)[Cov{Xi, A ^ E  1 1  + ••• + Cov(X  X )E ]  •  pl  u  p  i<3 +  £  A j (X )[Cov(X , X ! ) E + •• •+ Cov(A^ X ) S ]  =  E f t i ( S E ( X * JC;A\) + • • • + E E { X i X j X , ) )  E  n  i  n  p l  j  p  pl  + EAjE(A^)|S|.  (.3)  Combining equations (.1), (.2) and (.3), we get the equation (2.15) and so (2.16) when P = 2.  129  Appendix II Another consistent estimator of mentioned in Section 2.2 The following is to show that sandwich estimator is a consistent estimator of V* as well. We only work on the simple case with two predictors and it is easy to expand the results of p = 2 to general p. In the fitted model, the log-likelihood function of an observation is  log foiYlXuX?)  = -^log27r - -\ogr l  - ±(Y  2  - a - aX 0  x  -  x  aX), 2  2  2  while the true log-likelihood function should be  \ogg{Y\X X ) u  = -\log27r  2  - - log a - ^(Y l  2  - 3 - frX, - B X 0  2  2  -  P X,X ) . 2  12  2  Define the matrices as below.  A {6)  =  jn- £d  B (9)  =  j n " Y^dlogfeiY^X,  n  n  1  1  2  l o g M Y i l X i = x ,X u  = x ,X  2  H  2  =  x^/dO^1  = x^/dO,  ,  • d\ogf (Y \X 6  x  x  = x ,X u  2  =  x^/dOj  130  Appendix II.  Another consistent estimator of V* mentioned in Section 2.2  T h e expecations of them are defined as  A(0)  =  B(6)  =  E(&logf (Y\X X )/dO d6 ) e  u  2  i  j  i  E(dlogf (Y\X ,X )/d9 -dlogfo(Y\X X )/d6 ).. e  1  2  l  u  B y W h i t e (1982) we know that the M L E of parameter vector 0  2  j  = ( a , &i, d , r ) , 2  n  0  2  which is a consistent estimator for 0*, m i n i m i z i n g the K u l l b a c k - L e i b l e r Information C r i terion ( K L ) . T h a t is  and Vn~(0 -0*)^ n  N{O,C{0*)).  (.2)  Intuitively, K L measures our ignorance about the true model. Remarks: 1) A l l the expectations here are calculated under the true density function unless specified. 2) C(8)  is defined as follows.  C(0) =  A n d A (0)~ B (O)A (0)~ 1  n  l  n  n  A(0)- B{0)A{0)- . l  1  is a consistent estimator of C(0).  A c c o r d i n g to its sandwich-  like shape, this estimator is also called "sandwich" estimator. Note that when the fitted model is correct, A(0 ) t  + B(0+) equals zero, otherwise it  may not. So A(0*) + JB(0*) is a useful indicator of misspecifications.  131  Appendix II. Another consistent estimator of V* mentioned in Section 2.2 B y some algebra, we have  4  MO*) =  -»3xl  (.3)  Olx 3 2T?  where ' E = E(SS') =  E{(1,X ,X )'(1,X ,X )}. 1  s  V*  B(e*) = ±  2  1  2  7 A  R 7  where  E(Y  7  =  1  Q  E ((y - a - a i X 0  4T?  ^ E ((y - a  A  —aiXi -  - a  E _L(y_  a  o  _  a X ) 2  -  x  3  2  a X fX ) 2  2  x  - ajXi - a X ) X ) 3  0  2  a  i  X  l  -a  2  2  ^  X ) -i 2  2  2  Note V * is defined in (2.8). Therefore by the definition of matrix C , we get  C(0.)  The left upper sub-matrix is just  E^VE^ 2 ' AA 2r^7  1  - l  2r E^ 2  7  4TV A 4  v(a ). t  132  Appendix III Numerical approach to C n and C12 in Section 2.4.1 We give details of the numerical approach how to get Cn and C\ in (2.20). The joint 2  distribution of (X\,X )  is bivariate normal.  2  The elements of matrix Cn are easily obtained by one-dimensional integration. As for the elements of matrix Cu, the typical term is  > ci} - h) ((X  E {((Jd - ) (I{Xi a  Cl  - c ) I{X  > c } - k )} ,  b  2  2  2  2  2  (.1)  where  { ki  0, t, t  if Ci if a  =  —  -00,  ti^  -00.  E((Xi-a) (I{Xi>Ci}). a  To reduce the two-dimensional integration into one dimension, we just need to figure out  E ((X  - c ) I{X b  2  2  2  >c }-k \X ). 2  2  1  133  Appendix III. Numerical approach to Cn and Cu in Section 2.4.1 Note the fact conditional on X\ = x\, X  can be rewritten as  2  X  2  + y/l - p  = pxi  where p is the correlation coefficient of X  2  x  variable.  and X  Z,  a n d Z stands for a standard normal  2  Therefore,  • f(x ) x  =  E((X -c ) I{X >c }\X  = x)  b  2  2  2  2  1  1  T h i s integral can be easily evaluated since Z is standard n o r m a l and then we can regard the integrand i n (.1), typical term of C , as a function of one variable: X2  g(X ) 1  =  ((X ~c ) (I{X >c }-k )x(f(X )-k ). a  1  1  1  1  1  1  2  Thus, we can approximate the integral of the function by the following K EigiX^^K-^gib,),  where 6j is i/(K  + 1) quantile of the marginal distribution of X\.  In our numerical  approach to the elements of Cn a n d Cu, K is set to be 2000.  134  Appendix IV Proof of (2.24) in Section 2.4.2 W i t h the facts that B~ (B~ )' 1  1  = S'S a n d UCU' = BPB', we have  S'S + A P  =  B-\B~ )'  =  B~ (l +  =  B^Uil  + XP  X  XBPB')(B~ )'  l  1  +  XtyUXB- )'. 1  Thus (S'S + A P ) " = B'U(I +  XC)~ U'B.  1  l  Let V = SB'U, then V'V = I, which gives  tr(S(a)) = tr{V(I + A C ) " V } =  tr(I + A C ) . _ 1  Appendix V Pseudocode for the hybrid M C M C algorithm in Section 4.2.1 Let 0 ~ II be the target distribution, having an unnormalized density function a subset of  7r(t9)  on  In the following, we will abuse notation with the same symbol used to  denote different functions if the meaning is clear from the context. The algorithm works by extending the state from 0 to  and the unnormalized target density from 7r(0)  (8,z),  to  TT(0,Z)  =  7r(0)7r(z)  0. Set values for function g, constant e, constant a, number of iterations I. • 1. Initialize the value of 0. Could be generated from the prior of each component of 8 or by fitting a simpler model (for example an additive model in regression model context). Also initialize z by sampling from standard normal. 2. For each iteration, i = 1,..., / : a) Generate a candidate state (8*,z*) as  0*  <-  0 + {z + (e/2)<?(0)},  *  <_  - -( /2){ (0)+ (0*)},  z  e  z  e  5  5  136  Appendix V.  Pseudocode for the hybrid MCMC  algorithm in Section 4.2.1  and draw u from unif(0,1). Set (0, z ) <— (0*, z * ) if  otherwise, keep the original ( 0 , z ) . b) U n c o n d i t i o n a l l y negate z ; that is,  Z i  Z.  c) Perform an autoregressive update to z ; that is,  z ^  d) Set  N{az,{l-a )I ). 2  k  0j «- 0.  3. O u t p u t 0 , . . . , 0 / . X  Remarks: Setting 5(0) = V l o g 7 r ( 0 ) and a close to 1 we obtain h y b r i d algorithm. Other choices of g and a gives different algorithms.  137  Bibliography Box, G . (1979). Robustness i n the strategy of scientific model building, Robustness in Statistics pp. 201-236. Breiman, L . (1996). Heuristics of instability and stabilization i n model selection, Annal of Statistics 24(6): 2350-2383. Diaconis, P. and Shahshahani, M . (1984). Nonlinear functions of linear combinations, SIAM Journal of Scientific Statistical Computation 5(1): 175-191. Draper, D . (1995). Assessment and propagation of model uncertainty, Journal of the Royal Statistical Society. Series B (Methodological) 57(1): 45-97. Eubank, R . (1999). Nonparametric regression and spline smoothing, M a r c e l Dekker. Friedman, J . (1991).- M u l t i v a r i a t e adaptive regression splines, The Annals of Statistics 19(1): 1-67. G e l m a n , A . and Pardoe, I. (2007). Average predictive effects for models w i t h nonlinearity, interactions, and variance components, Sociological Methodology. Greenland, S. (1979). L i m i t a t i o n s of the logistics analysis of epidemilogic data, American Journal of Epidemiology 110(6): 693. Greenland, S.. (1983). Tests for interaction i n epidemiologic studies: a review and a study of power, Statistics in Medicine 2(2): 243-51. 138  Bibliography Greenland, S. (1993). Basic problems i n interaction assessment, Environmental Health Perspectives 101: 59-66. G u , C . (2002). Smoothing spline ANOVA  models, Springer.  Gustafson, P. (2000). Bayesian regression modeling w i t h interactions a n d smooth effects, Journal of the American Statistical Association 95(451): 795-806. Gustafson, P. (2001). O n measuring sensitivity to parametric model misspecification, Journal of the Royal Statistical Society. Series B (Statistical Methodology) 63(1): 81-94. Gustafson, P. (2007). O n robustness and model flexibility i n survival analysis: transformed hazard models and average effects, Biometrics 63: 69-77. Gustafson, P., K a z i , A . a n d Levy, A . (2005). E x t e n d i n g logistic regression to model diffuse interactions, Statistics in medicine 24(13): 2089-2104. Gustafson, P., M a c N a b , Y . and W e n , S. (2004). O n the Value of derivative evaluations and random walk suppression i n M a r k o v C h a i n M o n t e C a r l o algorithms, Statistics and Computing 14(1): 23-38. Hills, S. a n d S m i t h , A . (1992). Parameterization issues i n Bayesian inference, Bayesian Statistics 4: 227-246. Hogan, M . , K u p p e r , L . , M o s t , B . and Haseman, J . (1978). Alternatives to R o t h m a n ' s approach for assessing synergism (or antagonism) i n cohort studies, American Journal of Epidemiology 108(1): 60-67. K a r l i n , S. a n d Studden, W . (1966). Tchebycheff Systems: With Applications in Analysis and Statistics, Interscience Publishers.  139  Bibliography K i e i n b a u m , D . , K u p p e r , L . and Morgenstern, H . (1982). Interaction, effect modification, and synergism, Epidemiologic  Research. New York: Van Nostrand  Reinholdpp.  K o o p m a n , J . (1981). Interaction between discrete causes, American  Journal  407-417. of Epi-  demiology 113(6): 716-724. L e C a m , L . (1960). L o c a l l y asymptotically normal families of distributions, of California  Publications  University  in Statistics 3(2): 37-98.  L i u , J . (1996). M e t r o p o l i z e d G i b b s sampler: an improvement. M a d i g a n , D . and Raftery, A . (1994). M o d e l selection and accounting for model uncertainly in- graphical models using Occam's window, Journal  of the American  Statistical  Association.' M c C u l l a g h , P . and Nelder, J . (1983). Generalized M u l l e r , A . and Stoyan, D . (2002). Comparison  linear models, C h a p m a n & H a l l .  methods for stochastic models and risks,  J o h n W i l e y and Sons. Neal, R . (1996). Bayesian  learning for neural networks, Springer.  Neal, R . (1998). Suppressing random walks i n M a r k o v chain M o n t e C a r l o using ordered overrelaxation, Learning in Graphical Models pp. 205-225. O ' B r i e n , S., K u p p e r , L . and Dunson, D . (2006). Performance of tests of association in misspecified generalized linear models, Journal 136(9): 3090-3100.  .  of statistical  planning  and  inference  ...  Raftery, A . (1996). A p p r o x i m a t e Bayes factors and accounting for model uncertainty in generalised linear models, Biometrika  83(2): 251-266.  140  Bibliography R o t h m a n , K . (1974). Synergy and antagonism i n cause-effect relationships,  American  Journal of Epidemiology 99(6): 385. R o t h m a n , K . , Greenland, S. a n d Walker, A . (1980). Concepts of interaction, American Journal of Epidemiology 1 1 2 ( 4 ) : 467; Saracci, R . (1980).  Interaction a n d synergism, American  Journal of Epidemiology  1 1 2 ( 4 ) : 465.  Venables, W . a n d Ripley, B . (1999).  Modern applied statistics with S-Plus, Springer  New Y o r k . W a n g , D . , Zhang, W . a n d B a k h a i , A . (2004). Comparison of Bayesian model averaging a n d stepwise methods for model selection i n logistic regression, Statistics in Medicine 2 3 ( 2 2 ) : 3451-3467. W h i t e , H . (1980). U s i n g least squares to approximate u n k n o w n regression functions, International Economic Review 2 1 ( 1 ) : 149-170. W h i t e , H . (1981).  Consequences and detection of misspecified nonlinear regression  models, Journal of the American Statistical Association 76(374): 419-433. W h i t e , H . (1982). M a x i m u m likelihood estimation of misspecified models, Econometrica 50(1): 1-26. X u , R . and O'Quigley, J . (2000).  E s t i m a t i n g average regression effect under non-  proportional hazards, Biostatistics 1(4): 423-439.  141  

Cite

Citation Scheme:

    

Usage Statistics

Country Views Downloads
China 3 0
Canada 1 0
Russia 1 0
United Kingdom 1 0
United States 1 0
City Views Downloads
Shenzhen 3 0
Ottawa 1 0
Penza 1 0
St Albans 1 0
Ashburn 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0302408/manifest

Comment

Related Items