E x p a n d i n g the Capabi l i t ies of Bayesian Regression W i t h Interactions and S m o o t h Effects by Nathan Johnson B.Sc, Okanagan University College 1998 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Statistics) we accept this thesis as conforming to the required standard The Unive r s i ty of B r i t i s h C o l u m b i a September 2000 © Nathan Johnson, 2000 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of The University of British Columbia Vancouver, Canada Date S<$ DE-6 (2/88) A b s t r a c t The B-WISE (Bayesian regression With Interactions and Smooth Effects) method of regression modelling was designed to improve upon other flexible regression schemes in the areas of model interpretability and ease of implementation. B-WISE models have been shown to have good predictive performance, and performance can be even further improved with a natural Bayesian model averaging scheme. In this thesis I will outline some ways in which some of the constraints inherent in a B-WISE model can be relaxed, so that the technique can be used in more general situations and with even greater flexibility. It is shown that much of the interpretability of B-WISE models is retained and implementation is still relatively straightforward. Contents Abstract ii Contents iii Acknowledgements v Dedication vi 1 Introduction 1 1.1 Cubic Smoothing Splines and Cubic Regression Splines 3 1.2 Some Methods for Multiple Regression 6 1.2.1 Thin Plate Splines and Tensor Product Splines 6 1.2.2 Regression Trees 9 1.2.3 Multivariate Adaptive Regression Splines (MARS) 10 1.2.4 Projection Pursuit Regression 11 1.2.5 Neural Networks 11 1.2.6 Gaussian Processes 12 1.3 B-WISE 14 1.4 Identifying Models Wi th High Posterior Probability 18 i i i 2 Extensions to B-WISE 22 2.1 Overlapping Interactions 22 2.2 Categorical Variables 25 3 Examples 29 3.1 Simulated Data 29 3.2 Abalone Data 35 3.3 Census Housing Data 40 4 Conclusion and Further Work 49 Bibliography 52 Appendix A Details of the Model Search Algorithm 54 iv Acknowledgements I would like to thank Paul Gustafson for his initiative and integrity in research, as well as the freedom with guidance he afforded me in the directions I took. Thanks to Bertrand Clarke for being a second reader, and to the rest of the faculty for their high commitment to teaching and mentoring. I have thoroughly enjoyed my time in the Department of Statistics and won't begin to mention the colleagues who contributed to a memorable two years. N A T H A N J O H N S O N The University of British Columbia September 2000 v T o m y pa r en t s — o b v i o u s l y ! vi Chapter 1 Introduction Often an adequate summary of the relationship between a response variable Y and predictor variables X = (X\,..., Xv) is provided by a model Y = f(Xu...,Xp) + e, var(e) = cr2, where e is usually normally distributed with mean 0. The ability to predict Y for an X yet to be observed is perhaps the most important aspect of a good model, but often interpretability is a competing aspect. Investigators often want a model which is indeed a summary, making important features of the relationship apparent without conveying the more minute details. Hastie and Tibshirani (1990) give a thorough treatment of additive models, which have the form p Y = a +E +e' var(e) = ct2' where / i , . . . , fp are unspecified smooth functions. The flexibility of the functions / i , . . . , / p , coupled with the clarity and interpretability provided by the additive structure in (1.1), have made the additive model a standard tool for statistical analysis. Having obtained an additive model which adequately summarizes the 1 relationship between X and Y, an investigator can find in /j all of the model's information about the effect of Xi on the response Y. A scatterplot of Y against Xi, together with what Hastie and Tibshirani call the scatterplot smooth, / j , is one appealing way to represent this information and assess its quality. A large number of modelling schemes which go beyond additivity have been proposed for situations where the emphasis is on the predictive ability of f(X\,...,Xp). These include surface smoothers, regression trees, MARS, projec-tion pursuit regression, neural networks, gaussian processes, to name a few. Any of these methods can perform well in terms of prediction, but seldom do they lead to models that are interpretable. In particular, given a good fit f(X\,... ,XP) it may be very difficult to elicit information about the effect of a particular predictor, yet at the same time the additive model may be too simplistic to be useful. Significant research has been done to take the simple additive model one step further towards predictive ability, by describing possible bivariate interactions in predictor variables. We can write this extended additive model as Y = Y/fl(Xl) + ^2gjk(Xj,Xk) + e, var(e) = a2, (1.2) i j ) ( 1,+ 1_ t ){(1 + ^ ) 7 > + 1 + (1 + ^ ) 7 , } for ti < t < ti+\, i = 1,..., n — 1. In vector notation, using the above theorem, this can be written as «(«) = where v is a vector with Vi = *, Vi+\ = ^ j^, and all other components zero, and w is a vector with Wi = 1 + tl+^. *, = 1 + and all other components zero. Thus we have a construction for Z. 1.2 Some M e t h o d s for M u l t i p l e Regression 1.2.1 T h i n P l a t e S p l i n e s a n d Tenso r P r o d u c t S p l i n e s Thin plate splines and tensor product splines fall into a category of smoothers that Hastie and Tibshirani call "generic multivariate smoothers," which are basically generalizations of one-dimensional methods to higher dimensions. These methods are often useful for two or three dimensions, but in higher dimensions they lead to models that are difficult to interpret. A more fundamental problem with these methods has been called the "curse of dimensionality." Most smoothing methods rely on the concept of localness. The smoothed value of an observed response is determined mostly by points in a local neighbourhood of the response. But in high dimensions, local neighbourhoods are inherently sparse. For example, in a uniformly distributed data set of p predictors, a hypercube with sides that cover 10% of the range of each predictor will contain on average 0 .P of the points in the data set. This sparsity has prompted the development of other specialized methods that can deal more effectively with high dimensional data, but the generic multivariate smoothers can be useful in two or three dimensions. Thin plate splines are motivated by the generalization of the roughness mea-sure r(f) = j {f"(t)Y dt to higher dimensions. The roughness measure for a bi-variate function f{x,y) analogous to the above measure for a univariate function is Under this measure any plane in two dimensions has zero roughness. Green and Silverman note a physical interpretation similar to that for the univariate roughness measure. If an infinite elastic flat plate is deformed to the shape of the function ef for small e then the bending energy of the plate is to first order proportional to r(f). The minimizer of the penalized residual sum of squares, using this roughness measure, is a thin plate spline. We can generalize the roughness measure (1.6) to still higher dimensions, where again a thin plate spline minimizes the penalized residual sum of squares, but for clarity we limit our discussion to the two-dimensional case. To define a thin plate spline we first define a function 77(7-) as 7 Denote the points in the predictor space by tj = (xi,yi), i = 1,... , n, and let ||t| denote the usual Euclidean norm of t. Then a function g(t) is a thin plate spline on the data set t i , . . . , t„ if and only if g is of the form for some constants 6\,..., 6n, a\, 0 2 , 0 3 . We see that regression models based on thin plate splines are hardly interpretable, but in two and three dimensions they can at least be analysed graphically. Another, perhaps more natural, way to generalize univariate splines is by using tensor product splines. Following the discussion in Green and Silverman we consider the most useful case of two dimensions. The tensor product of two functions /(£) and g(u) is the function (/ g)(t,u) = f(t)g(u). Now suppose we have a set of linearly independent functions {Sj1 : i = 1,..., q\} and another basis {ej2 : J2 = 1,.. • , c/2J- The tensor product of these two bases is the set of all linear combinations of tensor products of basis functions Let {ri,..., r m i } and {vi,..., vm2} be sequences such that the range of t is T = b~i; Tmi ], a n d the range of u is U = [vi, vm2 ]. A tensor product cubic spline is constructed by dividing up T x U into panels of the form [rr,rr+i] x [us,us+i]. Within each panel the function is the product of a cubic in t and a cubic in u, and at the joins between the panels the function is smooth. We omit the details, but suffice it to say that a large number of parameters must be estimated and, related to that issue, interpreting the function is difficult. n 8 1.2.2 Regression Trees Regression trees (Breiman, Friedman, Olshen and Stone 1984) carve up the predic-tor space into disjoint and exhaustive rectangles, modelling the response Y within each rectangle as a constant. The method begins by finding the optimal splitting point among all splitting points among all predictors. The optimal split is the one which divides the space into two regions such that the within region variance is min-imal. Then for each of the two subregions the process is repeated until convergence, where further splitting does not significantly improve the within region variance. A regression tree can be presented graphically as a binary tree, or mathematically as in / ( X ) = J > / ( X e i ^ ) , i where the regions Ri are defined by the binary splits. There is no standard con-vergence criterion in use, but one popular strategy is to build a large tree and then prune it to a smaller size using cross-validation. The final tree is the one which has the smallest estimated prediction error. The main drawback to regression trees is that they are discontinuous func-tions of the predictor variables. This leads to less than optimal predictive ability in situations where the true underlying relationship between X and Y is continuous. Also, while small trees are often very interpretable for practitioners, larger trees can be puzzling. However the adaptive placement of the 'knots', or discontinuities, provides efficient use of the data since more parameters are estimated only where the data requires them. 9 1.2.3 M u l t i v a r i a t e A d a p t i v e R e g r e s s i o n S p l i n e s ( M A R S ) The recursive partitioning approach outlined above motivated the M A R S method-ology proposed by Friedman (1991). Like regression trees M A R S uses adaptive knot placement, but produces a regression function that is continuous. Start-ing with the initial basis function i?i(x) = 1, a search is made for the optimal splitting point among all splitting points among all predictors. That splitting point introduces a pair.of new basis functions, #2(x) = Bi(x)[+(xi — U)]+ and Bs(x) = B\{x)[— (xi — < t ) ] + , where X{ is the predictor being split, ti is the optimal splitting point, and [z]+ = max(0, z). Each iteration entails a search for the optimal splitting point for all of the basis functions present, and results in two new basis functions being introduced. The final regression function i is a continuous function of the predictor variables (since all the basis functions are continuous). The basis functions have the form £i(x) 1, » = i IljLl^jiC^i/O'i) — * j i ) ] -H * = 2, 3, . . . where Sji is the sign of each factor, v(ji) is the index of the predictor in the jth factor of the ith basis function, tji is the splitting point, and J{ is the degree of interaction. We do not allow a predictor to occur more than once in a single basis function. As with regression trees, a popular strategy is to run the algorithm until a large number of basis functions have been formed, and then to delete the unimportant functions until some lack-of-fit measure is minimized. Denison, Mallick and Smith (1998) put M A R S in a Bayesian context by plac-ing a prior distribution on the number k of basis functions present, and on the vector 10 $( k) — ..., ^ fc), where = (ai, Ti,tu,su,..., tji, s ji) contains information that completely determines the ith basis function. The variable Tj is the "type" of the basis function, defined to be the indicator of which predictors occur in the basis function. They then simulate samples from the joint posterior distribution p(6^,k\y)cxp(k,9^,y) = p(k)p(9^\k)p(y\k,e^) using an M C M C method. The main strength in B M A R S over M A R S is this M C M C sampling scheme by which Bayesian model averaging can be employed. Emperi-cal results have shown that Bayesian model averages obtained from the B M A R S algorithm have better predictive ability than good M A R S models. 1.2.4 P r o j e c t i o n P u r s u i t R e g r e s s i o n Projection pursuit regression (Friedman and Stuetzle 1981) models are of the form K Y = ^ M a f X ) + e, k-l where aJX. is a one-dimensional projection of X and hk is an arbitrary smooth function. Because all of the smoothing is univariate there are no dimensionality problems. Interactions between predictor variables can be accounted for but are not easily understood from the functional form. In fact the functional form is not very enlightening at all for K > 1, but the judicious selection of projections does keep the number of parameters to a minimum. 1.2.5 N e u r a l N e t w o r k s The simplest form of a neural network in the regression context can be written as Y = (/)0 + vhk$h (^ah + wihXij j + e, (1.7) 11 where h is almost always the logistic function, and c/>n is a linear, logistic or thresh-old function. The pictorial representation in Figure (1.1) is helpful. At each of the hidden nodes the inputs X\,..., Xp are multiplied by weights uiih and summed. The functions (ph are applied and the results are treated in the same way as inputs for the output node. The scheme originated in neuroscience as a model for electronic impulses in the brain. As such it is not a very realistic model, but as a nonlinear regression scheme it can be powerful. Besides model interpretability, which is basi-cally nil, there are some complications in controlling the amount of fitting. Various suggestions have been made in the literature of neural networks for choosing the number of hidden nodes (which controls the amount of fitting). As with regression trees cross-validation is one possibility, but no procedure has been widely agreed upon. Variants of the simple model (1.7) may have more than one hidden layer and may incorporate "skip-layer" connections which link the input layer directly to the output. Parameter estimation is done in an iterative manner using Gauss-Newton steps and can be time consuming. Neal (1996) put parameter estimation in a Bayesian framework, and also showed that a neural network converges to a Gaussian process as the number of hidden nodes approaches infinity. 1.2.6 Gaussian Processes A Gaussian process can be used to define a prior distribution over functions of one or more input variables. Suppose we have a response variable t which depends on a p parameter input x. Denoting the repeats of t as ..., and the corresponding inputs as ..., x^ n\ a prior for the relationship between x and t is specified in terms of a covariance function Cov [ t ^ , ^ ] which depends on the inputs. For 12 example, Cov u=l is the covariance function for the simple regression model v u = l The covariance function Cov u = l where 77 and p u are hyperparameters, can be used to obtain a regression function based on arbitrary smooth functions. Letting C be the n x n matrix of values of the covariance function, it can be shown that the posterior predictive distribution of an unobserved t ( n + 1 ) is Gaussian with mean E = kTc-h and variance Var t(n+1)|t(1),...,tH = ^ - k T c - 1 k , where k is the vector of covariances between t^n+1^ and the n known targets, and v is the prior variance of *( n + 1 ) (given by Cov [ t ( n + 1 ) , t ( n + 1 ) ] ) . More details and illustrations can be found in Neal (1997). The main drawback to using Gaussian processes is that they become com-putationally expensive as n increases. Memory requirements grow as n2 and time requirements grow as n 3 , but Neal suggests Gaussian processes can handle data sets with up to a thousand cases with modest computational resources. 13 1.3 B-WISE B-WISE models a continuous response Y as a function of continuous predictors X\,..., Xp. For ease of interpretation we assume that the predictors have been rescaled so that 0 corresponds to a low value of the predictor and 1 corresponds to a high value. We assume the response Y is rescaled to have mean zero. Most variable selection techniques are aimed at deciding which of the p predictors should be included in the model and which should be excluded. In B-WISE, a predictor can be included as smooth effect, included as a linear effect, or excluded from the model. Thus a B-WISE model can be written as Y = a+YJfi{Xl)+ 93k{Xj,Xk) + e, (1.8) where e ~ N(0,a2), M indicates which predictors are included as main effects only and I indicates which predictors are involved in a bivariate interaction. A main effect function /j can be either linear, or smooth, where Si(x{) is a cubic regression spline constrained to be zero at the exterior knots 0 and 1. The bivariate effect functions include a main effect for each variable, so a predictor involved in an interaction will appear only in the second summation of (1.8). Moreover a predictor may only appear once in (1.8), so that 'overlapping' in-teractions, as when g±2 and gis are both included, are not allowed. Several proposals 14 have been made for the structure of nonparametric bivariate functions. One possi-bility is to let gjk{xj,Xk) = gjk(xj)hjk(xk), where the two univariate functions are nonparametric (Cuzick 1993). The main drawback associated with this technique is that a large number of smooth functions must be estimated. Hastie and Tibshirani (1993) proposed varying-coefficient models, in which the univariate function for one of the predictors is nonparametric and acts as a coefficient for the other predictor. Thus fewer parameters need to be estimated and much flexibility is retained. In the B-WISE scheme, if two variables Xj and xk are both included as linear effects then their bivariate effect function gjk would be modelled as the usual gjk{xj,xk) = fyxj + j5kxk + j3jkXjXk. Under this structure the effect of one predictor can be viewed as a linear effect, with slope and intercept that depend on the value of the other predictor. For intuition into the effect of Xj we might plot gjk(xj,0) = /3jXj and gjk(xj, 1) = fik + {fij+fljk)xj on the same axes, with the understanding that the effect at an arbitrary xk is a linear interpolation, gjk(xj,xk) = (1 - xk)gjk{xj,0) + x k g j k ( x j , l ) . This is a useful feature, since it can be difficult for a picture of a surface in three dimensions to communicate information about a bivariate effect. B-WISE models retain this feature when Xj is include as a linear effect and xk is included as a smooth effect. In this case the bivariate effect function would be modelled.as gjk(xj,xk) = (3jXj + f3kxk + sk(xk) + PjkXjXk + Xjtjk(xk), where tjk(xk) is another cubic regression spline with exterior knots constrained to be zero. We could plot gjk(0,xk) = fikxk + sk(xk) and g j k { l , xk) = + {/3k + fyk)xk + 15 $k(xk) + tjki^k) on the same axis, with the understanding that the effect of xk at arbitrary values of Xj is a linear interpolation between the two smooth curves, 9jk{xj,xk) = (1 - Xj)gjk{0,xk) +Xjgjk(l,xk). Alternatively we might plot the two linear functions gjk{xj,0) and gjk(xj, 1), but the interpolation for general values of xk is no longer linear. In the case where both predictors are included as smooth effects then their bivariate effect function is modelled as 9jk(xj,xk) = BjXj + Sj(xj) + Bkxk + sk{xk) + 3jkXjXk + Xjtjk{xk) + xktkj(xj). In this situation we can plot the bivariate effect as a function of either variable while holding the other variable fixed. However, it would be prudent to plot more than just two curves since there is no linear interpolation interpretation. In all three cases for the bivariate effect function the surface is determined by the univariate functions at the edges of the unit square, thus facilitating interpretability and ease of computation. The aim of the B-WISE scheme is to rank all, or many, possible models according to their simplicity and predictive ability. This ranking is made possible by means of a Bayesian paradigm in which a posterior probability is calculated for each model. This probability then serves as a measure of a model's quality, in terms of the criteria emphasized under this scheme. We assume that Y\0,a2,X ~ N(A\0,a2I), where 0 is a parameter vector with an intercept, slope terms, and the values of the splines at the knot points. A\ is the design matrix for model A. We use a prior of the form p(9, cr2, A) = p(9\a2, X)p(a2)p(X), where p(a2) oc a~2, and p(X) is uniform on the model space A. For p(0\a2, A) we use the g-prior, 16 but in a slightly modified form that penalizes models according to their degree of roughness, r(f) = /'{/"'(t)}2dt. The modified g-prior makes use of a roughness matrix R such that vTRv = r{f), where v is the vector of spline values at the knot points (see equation 1.5). To penalize roughness we give v the prior distribution v ~ N{0, (/ci?)-1}, where k determines the strength of the roughness penalty. Now to quantify the roughness of an entire model we use the matrix R to construct an overall roughness matrix Rx that satisfies ^ = Sr(.,) +i; r (" ) + r (" + "), (1.9) 3 k where j ranges over smooth predictors involved in a main effect only, and k ranges over smooth predictors involved in an interaction. This is, in some sense, the sum of the average roughnesses of all the predictors. Here a predictor involved in an interaction has an average roughness that depends on its effect when the interacting variable is at a 'high' value and on its effect when the interacting variable is at a 'low' value. One can verify that R\ is everywhere zero except for R on the diagonal blocks corresponding to the parameters for S j , R/2 on the diagonal blocks for ti, and R/2 on the off-diagonal blocks for the Si-U pairs. Then the final form of the prior is 9\a2, A ~ N ^0, a2 + kRx where the hyperparameters c and k must be set in order for the prior to be specified completely. With the prior thus specified, the conditional (6\a2,X,y) is normally dis-17 tributed with mean 1 + - ) ATXAX + kRx nJ ATxy, and variance a [(1 + c/n)AlAx + kR),}-1. The distribution of cr2|A, y is inverse gamma with shape parameter ra/2 and scale parameter {yT(I — Hx)y}/2, where 1-1 Hx = Ax[(l + ^)ATxAx + kRx is the hat matrix for model A. The distribution of A|y is PWV) oc K ^ ^ A + ^ AI1/2 T ( J _ „ / 2 | ( l + ^ ) ^ A + A ; i ? A | 1 / 2 Note that (1.10) has some intuitive appeal. We can write yT(I — Hx)y as yT{I - Hx)y = {\\y - y A | | 2 + k6%Rx0x} + ^\\yx\\\ where yx = Hxy are the fitted values. The term in braces is just the penalized residual sum of squares. The other factor in (1.10) can be interpreted as a penalty for model complexity. In the simple case where k = 0 we have p{\\y) oc exp | - | l o g - y\\\2 + ^ I I ? A I I 2 ) - -y^ los ( i + ^ ) } . where d(X) is the dimension of 6 under model A. Here we can easily see how the posterior probability involves a tradeoff between goodness of fit and the number of parameters in the model. 1.4 Identifying M o d e l s W i t h H i g h Poster ior P r o b a b i l -i ty Given predictor variables Xi,...,Xp the collection of models of form (1.8) make up a model space A. When p is small it is possible to evaluate every model in A. For 18 larger p it may not be feasible to evaluate every model, yet there are techniques available for locating models with high posterior probabilities. George and McCul-loch (1993) describe one such Markov chain Monte Carlo ( M C M C ) technique, and George and McCulloch (1997) give advice for fast model space exporation. For B-WISE Gustafson proposes an M C M C inspired algorithm which makes use of the nonnormalized posterior probability (1.10). When this probability is available up to a normalizing constant the search algorithm need not contain the accept-reject step usually found in M C M C algorithms. To describe the B-WISE search algorithm we define a neighbourhood of a particular model to contain all models obtained by • including a predictor not already in the model • upgrading the linear effect of a predictor to a smooth effect • including an interaction between two variables present in the model and not already involved in interactions • downgrading the smooth effect of a predictor to a linear effect • removing a predictor included in the model as a linear effect • removing an interaction term • partner switching two interaction terms (eg. changing A x B and C x D interactions to A x C and B x D interactions. The algorithm begins by evaluating the nonnormalized posterior probability of all models in the neighbourhood of an initial model Ai chosen by the investigator. The nonnormalized probabilities are normalized relative to this neighbourhood and the next model A2 is chosen randomly from the neighbourhood according to these 19 probabilities. The same procedure is performed for A2, and is repeated until some convergence criterion is satisfied. Gustafson suggests that the algorithm has likely converged if it fails to find a model in A a = {A : p(A|y) > a^maxxpiXly)} (1.11) for t iterations in a row. The maximum in (1.11) is taken over all models evaluated by the algorithm so far, and Gustafson used a = 20 and t = 5. The model search algorithm is more fully detailed in Appendix A. Having performed a complete run of the algorithm we may find that a single best model will suit our purposes, or we may form a Bayesian model average using the Occam's window approach described in Raftery, Madigan and Hoeting (1997). Bayesian model averages are often found to have better predictive ability than a single best model, but in the B-WISE scheme dramatic improvements have not been seen. 20 Input layer Hidden layer(s) Figure 1.1: A pictorial representation of model (1.7). At node h of the hidden layer the inputs are multiplied by weights and summed. The result is inputted to a function whose value is then treated as an input to the next layer of the network. 21 Chapter 2 Extensions to B-WISE There are at least two limitations of the B-WISE scheme as it stands. The first is that overlapping interactions are not permitted, so a given predictor variable can interact with at most one other. Yet in some studies a predictor variable may well interact with several other predictor variables. One example might be a study in human health with gender as a predictor variable. Here an investigator may well expect the gender variable to interact with every other predictor variable. The second limitation of B-WISE is also apparent in this example. As of yet no provision has been made for categorical predictor variables. In this section we discuss how B-WISE can be extended to allow overlapping interactions and categorical predictors. 2.1 Over lapping Interactions The model (1.8) has an appealing structure. To investigate the model's information about a particular predictor one need examine only one term of the model, which will be either a univariate function fi or a bivariate function gij. To incorporate overlapping interactions we must assume a different structure underlying (1.8). The 22 problem is that if both gij and gik appear in the model, there is no obvious place to put the main effect for predictor Xi, except perhaps in a separate main effect term This is the approach we take, so that the first summation in (1.8) contains the main effects for all the predictors in the model, and the second summation contains the bivariate interactions for any pairs of variables which interact. Thus, to investigate the model's information about a particular predictor, one must collect all of the terms involving that predictor. Wi th a predictor being able to interact with several others, there comes the question of how to quantify the roughness of a predictor's effect function. The old way is given by equation (1.9), where the roughness is given by r(s,) if Xi is not involved in any interactions, and by (r(sj) +r(si +ti))/2 ii Xi is involved in an interaction. We can build on this idea of averaging over the high and low levels of the interacting variable. If predictor Xi interacts with d other predictors Xj1,..., Xjd then there are 2d different high-low combinations over which to average. If we put all of the interaction splines into a vector T^ = ( i Z J 1 , . . . , Ujd) then we can write the roughness of the effect for Xi as 2-dY,r(si + ITTi) I where I ranges over all vectors of length d with elements from {0,1}. This simplifies to the expression given for the non-overlapping case where d = 1. To specify the prior for the parameter vector 9 we need a matrix R\ that satisfies 9 T R ^ D = £ J 2 - D I M ( T F C ) E^(^+/rT*)} > where k ranges over all predictors with smooth effects. Fortunately such an R\ is fairly easy to construct. If a predictor Xi interacts with d others then we put a 23 (d + 1) x (d + 1) block matrix into R\, each block being the roughness matrix R multiplied by a special factor. This special factor will be 1 in the upper left block, 1/2 in the rest of the top row, left column and diagonal, and 1/4 in the remaining blocks. Example: To show that the above construction works we give an example where a smooth predictor X\ interacts with both X2 and X3. Let a be the parameter vector for the spline S i ( x i ) , b be the parameter vector for the spline £12(2:1) and c be the parameter vector for the spline £13(3:1). Then a R \R \R- a [aThTcT] RX b = [a T b T c r ] \R \R \R b c . \R \R hR _ c = 1 [aTRa + (a + b) T E(a + b) + (a + c)TR{& + c) + (a + b + c) Ti?(a + b + c)] (2.1) = \ [r(si) + r ( a i + tl2) + r{Sl + tl3) + r{si + £12 + £13)], which is the average roughness of the effect function for X\. To show that the construction for R\ works in general, suppose Xi interacts with d = dim(Tj) other predictors and consider the form of equation (2.1). All 2d terms have an a Ti?a. When we divide by 2d in forming the average the result is that a ri?a has a coefficient of 1. Hence the entry in the upper left block of R\. There are three other kinds of terms found in equation (2.1): terms of the form x T i ? x where x is a spline other than a, terms of the form a T i l x (or x Ti?a), and terms of the form ~x.TRy where both x and y are splines other than a. Terms of the first kind come from a larger term (a +•••-(- x + •• • )TR(a. + -- - + X + -- -), and there are 2d~1 possibilities for the exact form of this term (there are d — 1 splines other 24 than x which are either present or absent from this larger term). Dividing by 2d in taking the average gives x Ti?x a coefficient of 2 d _ 1/2 d = 1/2. Hence the diagonal entries of R\. The same reasoning applies to a Ti?x (or x Ti?a), which explains the top row and left column of R\. Terms of the third kind come from a larger term (a+---+x + -- -+ yH )TR(SL + -- -+ x + -- - + y + -- -), and there are 2d~2 possibilities for the exact form of this term. Dividing by 2d means the term x T i ? y has a coefficient of 1/4. Hence all the remaining blocks of R\. 2.2 Categor ica l Variables Incorporating categorical variables into the B-WISE setup is fairly straightforward given the work done so far. The key assumption is that each predictor has an additive effect on the response variable, and that effect may depend on the values of several other predictors. We will keep equation (1) as our model equation and remain in the wider setting where overlapping interactions are allowed. If a categorical predictor X, has a main effect only, we write its effect function as fi(Xi) = Bi2I(Xi = xi2) H V BimI(Xi = xim), where xn,... ,Xim are the m mutually exclusive values which Xi can take. The effect of Xi being at level xn is now hidden in the intercept a of equation (1.8). Note that when Xi is a binary (0/1) categorical variable, we can take fi(X{) = faXi, as we do for continuous variables. As with all other types of interactions, when we allow a categorical predictor to interact with another predictor, we also include main effect functions for each of the variables involved in the interaction. These main effect functions have the same 25 form as they would if the same two predictors were in the model but did not interact with each other. Thus, allowing an interaction between two variables already in a model amounts to only a modest increase in the complexity of the design matrix. We need to consider three types of interactions in which a categorical predictor X{ may be involved: an interaction with a linear predictor, an interaction with a smooth predictor, and an interaction with another categorical predictor. If Xi interacts with a linear predictor Xj we code the bivariate effect function as gij{Xi, Xj) = I(Xi = Xi2){aj2 + fijiXj) H V I(Xi = xim)(a>jm + PjmXj). In this case the effect of Xi being at level xn is hidden in the intercept a and the main effect function fj(Xj). If categorical predictor Xi interacts with a smooth predictor Xj, one possi-bility is to code the bivariate effect function as gij(Xi,Xj) = I(Xi = xi2)(aj2 + fijiXj + sj2(Xj)) H ~^~I{Xi = %im){(Xjm "f" PjmXj + S j m ( - X j ' ) ) , but this causes complications when we try to measure the average roughness of the smooth predictor Xj. We return to this issue later. A better choice, perhaps, is to model the bivariate effect function as 9ij(X{,Xj) = I(Xi = xi2)(aj2 + Pj2Xj) -\ V I(Xi = xim)(a>jm + fijmXj), so that the same spline (which would appear in the main effect function fj(Xj)) is used for each category, but the linear effect is allowed to vary. This is the approach we favoured, with the hope that under this model the savings in the number of parameters to be estimated would make up for any loss in flexibility. The roughness 26 matrix to be constructed under this approach is no different than that in the case where the categorical and smooth predictor do not interact, since there are no additional splines in the model to contribute to the model's roughness. In the final case, we may have Xi interacting with another categorical pre-dictor Xj. In this case we model the bivariate effect function as m n 9ij{X{,Xj) = ~^2^2jijI(Xi = xik,Xj = Xji), k=2 1=2 m being the number of categories of Xi, and n being the number of categories of Xj. The effect when either of these predictors is in its first category shows up in the main effect functions. In extending B-WISE to allow categorical predictors, we have made provision for bivariate interactions between categorical and smooth predictors, but we have constrained the smooth predictor's effect function to use the same spline for each level of the categorical variable while allowing the linear piece to differ. We concede that important features of data could be missed under this modelling technique. In the remainder of this chapter we outline the construction of the roughness matrix R\ under the more flexible conditions where a smooth effect function for a predic-tor is allowed to use a different spline term for each of an interacting categorical predictor's levels. We do this in such a way that 6R\0 may still be interpreted as the sum of the average roughnesses of all the predictors, and hence (1.10) re-mains interpretable. Now suppose a smooth predictor X\ interacts with a number of other continuous predictors X2, • • •, X^+i and a number of categorical predictors W\,..., Wt with c i , . . . , Ct categories, respectively. For each of the 2d high-low com-binations of the continuous predictors there are c\ • • • ct category combinations of the categorical predictors. We define the average roughness of X\ to be the average of the roughnesses of all 2dc\ • • • ct effect functions for X\. Let Wi2,Wiz, • • • ,WiCi be 27 parameter vectors for the splines corresponding to the different levels of predictor W{. To build R\ we reconsider the form of equation (2.1). Each of the 2 d c i • • • cj terms would contain an aTRa, and dividing by the number of terms gives the upper left block of R\ a coefficient of 1. The diagonal blocks corresponding to continuous interacting variables will have a coefficient of ^, since there are 2 d _ 1 c i • • • ct terms of the form (a + .. . + x + .. .)TR(a + .. . + x + ...), where x is the spline corresponding to the high value of an interacting continuous predictor. In fact one can see fairly easily that all blocks corresponding to continuous predictors will have the same multipliers as before. Now, blocks on the diagonal corresponding to categorical predictor Wj will have a coefficient of ^ , since a term (a + ... + wij + . . .)TR(a + ... + Wij + ...) occurs 2 d c i • • • C i _ i C i + i • • • Q times. The same reasoning applies to the top row and left column blocks. Blocks off the diagonal corresponding to categorical variables W{ and Wj, i / j will have a coefficient of and blocks off the diagonal corre-sponding to the same categorical predictor, eg. to the splines Wi2 and ^ 3 , wil l be zero. Lastly, blocks corresponding to categorical predictor Wi and a smooth predic-tor will have a coefficient of 2 d _ 1 c i • • • Cj_iCi+i • • • Q. Obviously the computer code will have to be meticulous. This further increase in flexibility has good potential but of course the increase in the number of parameters requires a larger amount of data for estimation. 28 Chapter 3 Examples 3.1 Simulated D a t a Denison, Mallick and Smith (1998) tested BMARS on an example from Friedman (1991). Gustafson (2000) compared B-WISE with BMARS using the same example, and we will use it again to see if the extensions to B-WISE discussed above are helpful. The data consist of predictor variables X = (X\,..., Xp) simulated from a p-dimensional unit hypercube, and a response variable Y generated independently as Y\X ~ AT(/(X),CT 2), where /(x) = 10sin(7nEiiE2) + 20(x3 - 0.5)2 + 10x4 + 5rr5, and a2 = 1. In one version of the problem (n,p) = (200,6) (one predictor having no relation to Y), and in another version (n,p) = (100,10) (five predictors having no relation to Y). Rather than simulate the data as prescribed above we use the data made available by Denison, Mallick and Smith, which is also the data that Gustafson analyzed. Gustafson found that in both versions of the problem B-WISE placed the 29 Table 3.1: Simulated data set with (n,p) — (200,6). The top five models found in one run of the model search algorithm. Model Posterior Probability S(Xx) L(X10) 0.105 SiX,) $ § S(X 2) + 5(X 3) + L(X 4) + LiX5) + LiX,0) 0.055 six,) S(X 2) + S(X3) + L(XA) ® S(X5) + SiX,)