E x p a n d i n g the C a p a b i l i t i e s of B a y e s i a n Regression W i t h Interactions a n d S m o o t h Effects by Nathan Johnson B.Sc, Okanagan University College 1998 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF T H E REQUIREMENTS FOR THE D E G R E E OF M a s t e r of Science in T H E FACULTY OF GRADUATE STUDIES (Department of Statistics) we accept this thesis as conforming to the required standard T h e U n i v e r s i t y of B r i t i s h C o l u m b i a September 2000 Â© Nathan Johnson, 2000 In presenting degree freely at this the University available copying of department publication for this or of thesis this partial fulfilment of British Columbia, reference thesis by in for his and scholarly or thesis study. for her financial of The University of British Vancouver, Canada Date DE-6 (2/88) S<jlf~ & ? I I further purposes Columbia ~L*P<S><$ gain the shall requirements agree that agree may representatives. permission. Department of be It not that the Library by understood be an advanced shall permission for granted is for allowed the that without make it extensive head of copying my my or written Abstract The B - W I S E (Bayesian regression W i t h Interactions and Smooth Effects) method of regression modelling was designed to improve upon other flexible regression schemes in the areas of model interpretability and ease of implementation. B - W I S E models have been shown to have good predictive performance, and performance can be even further improved with a natural Bayesian model averaging scheme. In this thesis I will outline some ways in which some of the constraints inherent in a B - W I S E model can be relaxed, so that the technique can be used in more general situations and with even greater flexibility. It is shown that much of the interpretability of B - W I S E models is retained and implementation is still relatively straightforward. Contents Abstract ii Contents iii Acknowledgements v Dedication vi 1 Introduction 1 1.1 Cubic Smoothing Splines and Cubic Regression Splines 3 1.2 Some Methods for Multiple Regression 6 1.2.1 T h i n Plate Splines and Tensor Product Splines 6 1.2.2 Regression Trees 9 1.2.3 Multivariate Adaptive Regression Splines (MARS) 10 1.2.4 Projection Pursuit Regression 11 1.2.5 Neural Networks 11 1.2.6 Gaussian Processes 12 1.3 B-WISE 14 1.4 Identifying Models W i t h High Posterior Probability 18 iii 2 3 4 Extensions to B - W I S E 22 2.1 Overlapping Interactions 22 2.2 Categorical Variables 25 Examples 29 3.1 Simulated Data 29 3.2 Abalone Data 35 3.3 Census Housing Data 40 Conclusion and Further Work 49 Bibliography 52 Appendix A Details of the Model Search Algorithm 54 iv Acknowledgements I would like to thank Paul Gustafson for his initiative and integrity in research, as well as the freedom with guidance he afforded me in the directions I took. Thanks to Bertrand Clarke for being a second reader, and to the rest of the faculty for their high commitment to teaching and mentoring. I have thoroughly enjoyed my time in the Department of Statistics and won't begin to mention the colleagues who contributed to a memorable two years. NATHAN The University of British Columbia September 2000 v JOHNSON To m y parents â€” vi obviously! Chapter 1 Introduction Often an adequate summary of the relationship between a response variable Y and predictor variables X = (X\,..., Y X) v is provided by a model = f(X ...,X ) u + e, p var(e) = cr , 2 where e is usually normally distributed with mean 0. The ability to predict Y for an X yet to be observed is perhaps the most important aspect of a good model, but often interpretability is a competing aspect. Investigators often want a model which is indeed a summary, making important features of the relationship apparent without conveying the more minute details. Hastie and Tibshirani (1990) give a thorough treatment of additive models, which have the form p Y = +E +' ()= ' a e var e ct2 where / i , . . . , f are unspecified smooth functions. The flexibility of the functions p / i , . . . , / , coupled with the clarity and interpretability provided by the additive p structure in (1.1), have made the additive model a standard tool for statistical analysis. Having obtained an additive model which adequately summarizes the 1 relationship between X and Y, an investigator can find in /j all of the model's information about the effect of Xi on the response Y. A scatterplot of Y against Xi, together with what Hastie and Tibshirani call the scatterplot smooth, / j , is one appealing way to represent this information and assess its quality. A large number of modelling schemes which go beyond additivity have been proposed for situations where the emphasis is on the predictive ability of f(X\,...,X ). p These include surface smoothers, regression trees, MARS, projec- tion pursuit regression, neural networks, gaussian processes, to name a few. Any of these methods can perform well in terms of prediction, but seldom do they lead to models that are interpretable. In particular, given a good fit f(X\,... ,X ) it may P be very difficult to elicit information about the effect of a particular predictor, yet at the same time the additive model may be too simplistic to be useful. Significant research has been done to take the simple additive model one step further towards predictive ability, by describing possible bivariate interactions in predictor variables. We can write this extended additive model as Y = Y f (X ) i / l l + ^2g (X ,X ) j<k jk j k + e, var(e) = a , 2 (1.2) where both the univariate and bivariate effect functions are unspecified. Note that model (1.2) retains much of the simplicity and transparency of the simple additive model. In particular, the estimated effect of a predictor X{ on the response can still be graphed, either as a curve in two dimensions as before, or a set of surfaces in three dimensions, depending on the number of bivariate interactions in which Xi is involved. In this thesis we develop some extensions to the B-WISE scheme originally proposed by Gustafson (2000). B-WISE is a loose acronym for Bayesian Regression With Interactions and Smooth Effects. In a Bayesian framework this scheme models data using the expanded additive model (1.2) and uses a novel structure for the 2 bivariate effect functions. Before giving a more thorough outline of the B - W I S E method we describe cubic splines, which are fundamental to B-WISE, and we discuss briefly some of the more popular methods for multiple regression. 1.1 Cubic Smoothing Splines and Cubic Regression Splines A scatterplot smooth is curve that to some extent summarizes the informa- tion contained in a scatterplot. A curve that is too rough and overfits the data will not be useful as a summary, and it will perform poorly as a predictor for new observations. O n the other hand, a curve that is too smooth may not contain enough information about local features or curvatures. Among all functions with two continuous derivatives, it seems that a function / which could minimize the penalized residual sum of squares, E(^-/(^)) + /{/"W} ^ 2 A i=i where (x\,yi),..., 2 (!-) 3 J (x , y ) are the data points, would satisfy both features of a good n n summary. The first term in (1.3) ensures that / contains enough information, that it follows the data closely; the second term ensures that it is not overly rough. Quantifying the roughness of / as f ( / " ) or linear then it has zero roughness. 2 has intuitive appeal. If / is constant Green and Silverman (1994) tell us that a mechanical spline (e.g. a thin strip of wood) constrained to pass through a set of points will assume the shape of the function / which minimizes bending energy, a quantity which to first order is proportional to f / " , provided the points are 2 reasonably close to a straight line. Thus we have a loose physical interpretation for this quantification of roughness. 3 For the following discussion we assume the values of the predictor variable are ordered so that x\ < X2 â€¢ â€¢ â€¢ < x and there are no ties. Green and Silverman n also tell us that among all functions with two continuous derivatives, the function which minimizes (1.3) is the cubic smoothing spline. A cubic smoothing spline is made up of n â€” 1 cubic polynomials, one for each region [xi, Xi+i], i = 1,. â€¢ . , n â€” 1. The cubics are joined together in a smooth way by requiring that at each of the interior knots X2, â€¢ â€¢ â€¢, the two cubics that join there must be equal in value and first and second derivatives. The second derivatives at the exterior knots x\ and x n are set to zero, making the spline a natural cubic spline (NCS). Outside of the exterior knots the N C S is linear. W i t h these constraints in place the minimizer g of (1.3) is found by solving an n x n linear system whose solution is the vector of spline values g = (g(x\),... ,g(x )) . T n A N C S is fully specified given its vector of knot-point values. To see this, note that there are originally 4(n â€” 1) parameters to be specified. The constraints for smoothness number three for each of the n â€” 2 interior knots, and the additional two requirements at the exterior knots bring the number of constraints to 3n â€” 4. Thus there are n free parameters to be determined by the n data points. While the cubic smoothing spline does minimize (1.3) and there is an elegant algorithm available to solve the nxn linear system, yet there are difficulties in interpreting a curve that requires such a large number of parameters in its description. For our purposes a better smooth is the cubic regression spline, which utilizes a smaller number of knots, and thus requires that fewer parameters be estimated than the cubic smoothing spline. For simplicity we let the spline have m equallyspaced knots t\ < ... < t , m and in our examples m is typically close to 7. From the many ways to specify a N C S we choose to parameterize in terms of the knot 4 values g .= {g{ti), â€¢ â€¢ â€¢ ,g{t )), which seems to be most amenable to interpretation m and specification of a prior distribution. Under this parameterization the values of the spline at arbitrary x = (xi,..., x ) can be expressed as n (1.4) = Zg, where Z depends only on x and the knot locations. For the interested reader we close this section with details on the construction of Z, which is an essential procedure in the implementation of B-WISE. We need to define two matrices Q and R. Let hi = ij+i â€” ti for i â€” 1,..., nâ€”1. The matrix Q is rn x (m â€” 2) with entries qij, i = l,...,m and j = 2,..., m â€” 1, given by for j = 2,..., m â€” 1, and all other entries are zero. The matrix R is (m â€” 2) x (m - 2) with entries r^, i and j running from 2 to m â€” 1, given by = ^ ( ^ i - i + ^i) for i = 2,... ,m - 1 =r i i+ ti = ^hi for 2 = 2,..., m - 2, and all other entries are zero. Finally we let 7 be the vector with the m â€” 2 second derivative values g"{ti), i = 2,... ,m â€” 1. One of the main results in Green and Silverman is stated in the following theorem. The vectors g and 7 specify a natural cubic spline g if and only if If this condition is satisfied then the roughness penalty will satisfy J g"(t) dt 2 = R 1 T 1 = g QR- Q . T 1 (1.5) T (Note that R is invertible since it is strictly diagonal dominant and hence strictly positive-definite.) The theorem gives a useful relation between g and 7, as well as a computationally convenient expression for the roughness penalty. The reader can verify that the value of the cubic spline at t is given by 9(t) (t - ti)g i i+ + (t - t)gi i+x hi _I _ (t 1>)(1 , _ {( +1 t) ^) 1 + 7 > + 1 + ( 1+ ^) ,} 7 for ti < t < ti \, i = 1,..., n â€” 1. + In vector notation, using the above theorem, this can be written as Â«(Â«) = where v is a vector with Vi = *, Vi \ = ^j^, and all other components zero, + and w is a vector with Wi = 1 + ^. *, tl+ =1+ and all other components zero. Thus we have a construction for Z. 1.2 1.2.1 Some M e t h o d s for M u l t i p l e R e g r e s s i o n T h i n P l a t e Splines a n d Tensor P r o d u c t Splines Thin plate splines and tensor product splines fall into a category of smoothers that Hastie and Tibshirani call "generic multivariate smoothers," which are basically generalizations of one-dimensional methods to higher dimensions. These methods are often useful for two or three dimensions, but in higher dimensions they lead to models that are difficult to interpret. A more fundamental problem with these methods has been called the "curse of dimensionality." Most smoothing methods rely on the concept of localness. The smoothed value of an observed response is determined mostly by points in a local neighbourhood of the response. But in high dimensions, local neighbourhoods are inherently sparse. For example, in a uniformly distributed data set of p predictors, a hypercube with sides that cover 10% of the range of each predictor will contain on average 0 . P of the points in the data set. This sparsity has prompted the development of other specialized methods that can deal more effectively with high dimensional data, but the generic multivariate smoothers can be useful in two or three dimensions. T h i n plate splines are motivated by the generalization of the roughness measure r(f) = j {f"(t)Y dt to higher dimensions. The roughness measure for a bivariate function f{x,y) analogous to the above measure for a univariate function is Under this measure any plane in two dimensions has zero roughness. Green and Silverman note a physical interpretation similar to that for the univariate roughness measure. If an infinite elastic flat plate is deformed to the shape of the function ef for small e then the bending energy of the plate is to first order proportional to r(f). The minimizer of the penalized residual sum of squares, using this roughness measure, is a thin plate spline. We can generalize the roughness measure (1.6) to still higher dimensions, where again a thin plate spline minimizes the penalized residual sum of squares, but for clarity we limit our discussion to the two-dimensional case. To define a thin plate spline we first define a function 7 77(7-) as Denote the points in the predictor space by tj = (xi,yi), i = 1,... , n, and let ||t| denote the usual Euclidean norm of t. Then a function g(t) is a thin plate spline on the data set t i , . . . , tâ€ž if and only if g is of the form n for some constants 6\,..., 6 , a\, n 02,03. We see that regression models based on thin plate splines are hardly interpretable, but in two and three dimensions they can at least be analysed graphically. Another, perhaps more natural, way to generalize univariate splines is by using tensor product splines. Following the discussion in Green and Silverman we consider the most useful case of two dimensions. The tensor product of two functions /(Â£) and g(u) is the function (/ <g> g)(t,u) = f(t)g(u). Now suppose we have a set of linearly independent functions {Sj : i = 1,..., q\} and another basis 1 {ej : J2 = 1,.. â€¢ , c/2J- The tensor product of these two bases is the set of all linear 2 combinations of tensor products of basis functions Let {ri,..., r } and {vi,..., v } be sequences such that the range of t is T = m i b~i; Tmi ], a n m2 d the range of u is U = [vi, v m2 ]. A tensor product cubic spline is constructed by dividing up T x U into panels of the form [r ,r +i] x [u ,u +i]. r r s s Within each panel the function is the product of a cubic in t and a cubic in u, and at the joins between the panels the function is smooth. We omit the details, but suffice it to say that a large number of parameters must be estimated and, related to that issue, interpreting the function is difficult. 8 1.2.2 Regression Trees Regression trees (Breiman, Friedman, Olshen and Stone 1984) carve up the predictor space into disjoint and exhaustive rectangles, modelling the response Y within each rectangle as a constant. The method begins by finding the optimal splitting point among all splitting points among all predictors. The optimal split is the one which divides the space into two regions such that the within region variance is minimal. Then for each of the two subregions the process is repeated until convergence, where further splitting does not significantly improve the within region variance. A regression tree can be presented graphically as a binary tree, or mathematically as in /(X) = J > / ( X e i^), i where the regions Ri are defined by the binary splits. There is no standard convergence criterion in use, but one popular strategy is to build a large tree and then prune it to a smaller size using cross-validation. The final tree is the one which has the smallest estimated prediction error. The main drawback to regression trees is that they are discontinuous functions of the predictor variables. This leads to less than optimal predictive ability in situations where the true underlying relationship between X and Y is continuous. Also, while small trees are often very interpretable for practitioners, larger trees can be puzzling. However the adaptive placement of the 'knots', or discontinuities, provides efficient use of the data since more parameters are estimated only where the data requires them. 9 1.2.3 M u l t i v a r i a t e A d a p t i v e Regression Splines ( M A R S ) The recursive partitioning approach outlined above motivated the M A R S methodology proposed by Friedman (1991). Like regression trees M A R S uses adaptive knot placement, but produces a regression function that is continuous. Start- ing with the initial basis function i?i(x) = 1, a search is made for the optimal splitting point among all splitting points among all predictors. That splitting point introduces a pair.of new basis functions, #2(x) = Bi(x)[+(xi â€” U)]+ and Bs(x) = B\{x)[â€” (xi â€” <t)]+, where X{ is the predictor being split, ti is the optimal splitting point, and [z]+ = max(0, z). Each iteration entails a search for the optimal splitting point for all of the basis functions present, and results in two new basis functions being introduced. The final regression function i is a continuous function of the predictor variables (since all the basis functions are continuous). The basis functions have the form Â£i(x) 1, Â»= i IljLl^jiC^i/O'i) â€” *ji)]-H * = 2, 3, . . . where Sji is the sign of each factor, v(ji) is the index of the predictor in the jth factor of the ith basis function, tji is the splitting point, and J{ is the degree of interaction. We do not allow a predictor to occur more than once in a single basis function. As with regression trees, a popular strategy is to run the algorithm until a large number of basis functions have been formed, and then to delete the unimportant functions until some lack-of-fit measure is minimized. Denison, Mallick and Smith (1998) put M A R S in a Bayesian context by placing a prior distribution on the number k of basis functions present, and on the vector 10 $( ) â€” k ..., ^fc), where = (ai, Ti,tu,su,..., tji, s ji) contains information that completely determines the ith basis function. The variable Tj is the "type" of the basis function, defined to be the indicator of which predictors occur in the basis function. They then simulate samples from the joint posterior distribution p(6^,k\y)cxp(k,9^,y) = p(k)p(9^\k)p(y\k,e^) using an M C M C method. The main strength in B M A R S over M A R S is this M C M C sampling scheme by which Bayesian model averaging can be employed. Emperical results have shown that Bayesian model averages obtained from the B M A R S algorithm have better predictive ability than good M A R S models. 1.2.4 Projection Pursuit Regression Projection pursuit regression (Friedman and Stuetzle 1981) models are of the form K Y = ^ M a f X ) + e, k-l where aJX. is a one-dimensional projection of X and hk is an arbitrary smooth function. Because all of the smoothing is univariate there are no dimensionality problems. Interactions between predictor variables can be accounted for but are not easily understood from the functional form. In fact the functional form is not very enlightening at all for K > 1, but the judicious selection of projections does keep the number of parameters to a minimum. 1.2.5 Neural Networks The simplest form of a neural network in the regression context can be written as Y = (/) 0 + hk$h (^a + v h 11 ihXij w j + e, (1.7) where <j>h is almost always the logistic function, and c/>n is a linear, logistic or threshold function. The pictorial representation in Figure (1.1) is helpful. At each of the hidden nodes the inputs X\,..., X p are multiplied by weights uiih and summed. The functions (ph are applied and the results are treated in the same way as inputs for the output node. The scheme originated in neuroscience as a model for electronic impulses in the brain. As such it is not a very realistic model, but as a nonlinear regression scheme it can be powerful. Besides model interpretability, which is basically nil, there are some complications in controlling the amount of fitting. Various suggestions have been made in the literature of neural networks for choosing the number of hidden nodes (which controls the amount of fitting). As with regression trees cross-validation is one possibility, but no procedure has been widely agreed upon. Variants of the simple model (1.7) may have more than one hidden layer and may incorporate "skip-layer" connections which link the input layer directly to the output. Parameter estimation is done in an iterative manner using GaussNewton steps and can be time consuming. Neal (1996) put parameter estimation in a Bayesian framework, and also showed that a neural network converges to a Gaussian process as the number of hidden nodes approaches infinity. 1.2.6 G a u s s i a n Processes A Gaussian process can be used to define a prior distribution over functions of one or more input variables. Suppose we have a response variable t which depends on a p parameter input x. Denoting the repeats of t as inputs as ..., and the corresponding ..., x^ \ a prior for the relationship between x and t is specified n in terms of a covariance function Cov [ t ^ , ^ ] which depends on the inputs. For 12 example, Cov u=l is the covariance function for the simple regression model v u=l The covariance function Cov u=l where 77 and p are hyperparameters, can be used to obtain a regression function u based on arbitrary smooth functions. Letting C be the n x n matrix of values of the covariance function, it can be shown that the posterior predictive distribution of an unobserved t ( n + 1 ) is Gaussian with mean = E k c-h T and variance Var t( )|t( ),...,tH n+1 1 =^-k c- k, where k is the vector of covariances between t^ v is the prior variance of * ( n+1 T ^ n+1 ) (given by Cov [ t ( n+1 1 and the n known targets, and ),t( n+1 ) ] ) . More details and illustrations can be found in Neal (1997). The main drawback to using Gaussian processes is that they become computationally expensive as n increases. Memory requirements grow as n and time 2 requirements grow as n , but Neal suggests Gaussian processes can handle data sets 3 with up to a thousand cases with modest computational resources. 13 1.3 B-WISE B - W I S E models a continuous response Y as a function of continuous predictors X\,..., X. p For ease of interpretation we assume that the predictors have been rescaled so that 0 corresponds to a low value of the predictor and 1 corresponds to a high value. We assume the response Y is rescaled to have mean zero. Most variable selection techniques are aimed at deciding which of the p predictors should be included in the model and which should be excluded. In B - W I S E , a predictor can be included as smooth effect, included as a linear effect, or excluded from the model. Thus a B - W I S E model can be written as Y where e ~ N(0,a ), 2 = a+Y fi{X )+ J 93k{Xj,X ) l k + e, (1.8) M indicates which predictors are included as main effects only and I indicates which predictors are involved in a bivariate interaction. A main effect function /j can be either linear, or smooth, where Si(x{) is a cubic regression spline constrained to be zero at the exterior knots 0 and 1. The bivariate effect functions include a main effect for each variable, so a predictor involved in an interaction will appear only in the second summation of (1.8). Moreover a predictor may only appear once in (1.8), so that 'overlapping' interactions, as when gÂ±2 and gis are both included, are not allowed. Several proposals 14 have been made for the structure of nonparametric bivariate functions. One possibility is to let gjk{xj,Xk) where the two univariate functions are gj (xj)hj (x ), = k k k nonparametric (Cuzick 1993). The main drawback associated with this technique is that a large number of smooth functions must be estimated. Hastie and Tibshirani (1993) proposed varying-coefficient models, in which the univariate function for one of the predictors is nonparametric and acts as a coefficient for the other predictor. Thus fewer parameters need to be estimated and much flexibility is retained. In the B-WISE scheme, if two variables Xj and x are both included as linear k effects then their bivariate effect function gj would be modelled as the usual k g {xj,x ) jk = k fyxj + j5 x k k + j3j XjX . k k Under this structure the effect of one predictor can be viewed as a linear effect, with slope and intercept that depend on the value of the other predictor. For intuition = into the effect of Xj we might plot gj (xj,0) k /3jXj and gj (xj, 1) = k {fij+flj )xj fi + k k on the same axes, with the understanding that the effect at an arbitrary x is a linear k interpolation, gjk(xj,x ) k = (1 - x )g {xj,0) k +x g jk k j k (xj,l). This is a useful feature, since it can be difficult for a picture of a surface in three dimensions to communicate information about a bivariate effect. B-WISE models retain this feature when Xj is include as a linear effect and x is included as a smooth k effect. In this case the bivariate effect function would be modelled.as = (3jXj gj (xj,x ) k where tj (x ) k k k + f3 x + s (x ) + P XjX k k k k jk k + Xjt (x ), jk k is another cubic regression spline with exterior knots constrained to be zero. We could plot gj (0,x ) k k = fi x + s (x ) and k k k 15 k g j k {l, x) = k + {/3 + fy )x + k k k $k( k) x + tjki^k) on the same axis, with the understanding that the effect of x at k arbitrary values of Xj is a linear interpolation between the two smooth curves, 9jk{xj,x ) = (1 - Xj)g {0,x ) k +Xjg (l,x ). k jk k jk Alternatively we might plot the two linear functions gj {xj,0) and gj (xj, 1), but k k the interpolation for general values of x is no longer linear. In the case where k both predictors are included as smooth effects then their bivariate effect function is modelled as 9jk(xj,x ) k = BjXj + Sj(xj) + Bx k k + s {x ) k k + 3 XjX jk k + Xjt {x ) jk + k x t (xj). k kj In this situation we can plot the bivariate effect as a function of either variable while holding the other variable fixed. However, it would be prudent to plot more than just two curves since there is no linear interpolation interpretation. In all three cases for the bivariate effect function the surface is determined by the univariate functions at the edges of the unit square, thus facilitating interpretability and ease of computation. The aim of the B - W I S E scheme is to rank all, or many, possible models according to their simplicity and predictive ability. This ranking is made possible by means of a Bayesian paradigm in which a posterior probability is calculated for each model. This probability then serves as a measure of a model's quality, in terms of the criteria emphasized under this scheme. We assume that Y\0,a ,X 2 ~ N(A\0,a I), 2 where 0 is a parameter vector with an intercept, slope terms, and the values of the splines at the knot points. A\ is the design matrix for model A. We use a prior of the form p(9, cr , A) = p(9\a , X)p(a )p(X), 2 2 2 where p(a ) oc a~ , and p(X) is uniform 2 on the model space A. For p(0\a , A) we use the g-prior, 2 16 2 but in a slightly modified form that penalizes models according to their degree of roughness, r(f) = /'{/"'(t)} dt. The modified g-prior makes use of a roughness 2 matrix R such that v Rv = r{f), T where v is the vector of spline values at the knot points (see equation 1.5). To penalize roughness we give v the prior distribution v ~ N{0, (/ci?) }, where k -1 determines the strength of the roughness penalty. Now to quantify the roughness of an entire model we use the matrix R to construct an overall roughness matrix Rx that satisfies ^ = Sr(.,) i; " r( )+r( + " ", + ) k 3 (1.9) where j ranges over smooth predictors involved in a main effect only, and k ranges over smooth predictors involved in an interaction. This is, in some sense, the sum of the average roughnesses of all the predictors. Here a predictor involved in an interaction has an average roughness that depends on its effect when the interacting variable is at a 'high' value and on its effect when the interacting variable is at a 'low' value. One can verify that R\ is everywhere zero except for R on the diagonal blocks corresponding to the parameters for S j , R/2 on the diagonal blocks for ti, and R/2 on the off-diagonal blocks for the Si-U pairs. Then the final form of the prior is 9\a , A ~ N ^0, a 2 + kR 2 x where the hyperparameters c and k must be set in order for the prior to be specified completely. With the prior thus specified, the conditional (6\a ,X,y) 2 17 is normally dis- tributed with mean + 1+ -) A A nJ T X and variance a [(1 + c/n)AlA X + kR),}- . A y, T x x The distribution of cr |A, y is inverse 2 1 x kR gamma with shape parameter ra/2 and scale parameter {y (I T H x = A [(l + ^)A A T x x + x â€” H )y}/2, where x 1-1 kR x is the hat matrix for model A. The distribution of A|y is PWV) oc K ^ ^ A + ^AI / 1 |(l + ^)^A 2 + A;i? | T ( J _ â€ž A Note that (1.10) has some intuitive appeal. We can write y (I T y {I T where y x - H )y x /2 1/2 = {\\y - y | | + k6%R 0 } x x as + ^\\y \\\ 2 A â€” H )y x x = H y are the fitted values. The term in braces is just the penalized x residual sum of squares. The other factor in (1.10) can be interpreted as a penalty for model complexity. In the simple case where k = 0 we have p{\\y) oc exp | - | l o g - y\\\ + 2 ^II?AII ) 2 - -y^s ol (i + ^) } . where d(X) is the dimension of 6 under model A. Here we can easily see how the posterior probability involves a tradeoff between goodness of fit and the number of parameters in the model. 1.4 Identifying Models W i t h H i g h Posterior P r o b a b i l ity Given predictor variables Xi,...,X p the collection of models of form (1.8) make up a model space A . When p is small it is possible to evaluate every model in A . For 18 larger p it may not be feasible to evaluate every model, yet there are techniques available for locating models with high posterior probabilities. George and M c C u l loch (1993) describe one such Markov chain Monte Carlo ( M C M C ) technique, and George and McCulloch (1997) give advice for fast model space exporation. For B W I S E Gustafson proposes an M C M C inspired algorithm which makes use of the nonnormalized posterior probability (1.10). When this probability is available up to a normalizing constant the search algorithm need not contain the accept-reject step usually found in M C M C algorithms. To describe the B - W I S E search algorithm we define a neighbourhood of a particular model to contain all models obtained by â€¢ including a predictor not already in the model â€¢ upgrading the linear effect of a predictor to a smooth effect â€¢ including an interaction between two variables present in the model and not already involved in interactions â€¢ downgrading the smooth effect of a predictor to a linear effect â€¢ removing a predictor included in the model as a linear effect â€¢ removing an interaction term â€¢ partner switching two interaction terms (eg. changing A x B and C x D interactions to A x C and B x D interactions. The algorithm begins by evaluating the nonnormalized posterior probability of all models in the neighbourhood of an initial model Ai chosen by the investigator. The nonnormalized probabilities are normalized relative to this neighbourhood and the next model A2 is chosen randomly from the neighbourhood according to these 19 probabilities. The same procedure is performed for A2, and is repeated until some convergence criterion is satisfied. Gustafson suggests that the algorithm has likely converged if it fails to find a model in A = {A : p(A|y) > a^maxxpiXly)} a (1.11) for t iterations in a row. The maximum in (1.11) is taken over all models evaluated by the algorithm so far, and Gustafson used a = 20 and t = 5. The model search algorithm is more fully detailed in Appendix A. Having performed a complete run of the algorithm we may find that a single best model will suit our purposes, or we may form a Bayesian model average using the Occam's window approach described in Raftery, Madigan and Hoeting (1997). Bayesian model averages are often found to have better predictive ability than a single best model, but in the B-WISE scheme dramatic improvements have not been seen. 20 Input layer Hidden layer(s) Figure 1.1: A pictorial representation of model (1.7). A t node h of the hidden layer the inputs are multiplied by weights and summed. The result is inputted to a function whose value is then treated as an input to the next layer of the network. 21 Chapter 2 Extensions to B-WISE There are at least two limitations of the B - W I S E scheme as it stands. The first is that overlapping interactions are not permitted, so a given predictor variable can interact with at most one other. Yet in some studies a predictor variable may well interact with several other predictor variables. One example might be a study in human health with gender as a predictor variable. Here an investigator may well expect the gender variable to interact with every other predictor variable. The second limitation of B - W I S E is also apparent in this example. As of yet no provision has been made for categorical predictor variables. In this section we discuss how B W I S E can be extended to allow overlapping interactions and categorical predictors. 2.1 O v e r l a p p i n g Interactions The model (1.8) has an appealing structure. To investigate the model's information about a particular predictor one need examine only one term of the model, which will be either a univariate function fi or a bivariate function gij. To incorporate overlapping interactions we must assume a different structure underlying (1.8). The 22 problem is that if both gij and gik appear in the model, there is no obvious place to put the main effect for predictor Xi, except perhaps in a separate main effect term This is the approach we take, so that the first summation in (1.8) contains the main effects for all the predictors in the model, and the second summation contains the bivariate interactions for any pairs of variables which interact. Thus, to investigate the model's information about a particular predictor, one must collect all of the terms involving that predictor. W i t h a predictor being able to interact with several others, there comes the question of how to quantify the roughness of a predictor's effect function. The old way is given by equation (1.9), where the roughness is given by r(s,) if Xi is not involved in any interactions, and by (r(sj) +r(si +ti))/2 ii Xi is involved in an interaction. We can build on this idea of averaging over the high and low levels of the interacting variable. If predictor Xi interacts with d other predictors Xj ,..., 1 Xj d then there are 2 different high-low combinations over which to average. If we put d all of the interaction splines into a vector T^ = ( i Z J 1 , . . . , Uj ) then we can write the d roughness of the effect for Xi as 2- Y,r(si + d I Ti) T I where I ranges over all vectors of length d with elements from {0,1}. This simplifies to the expression given for the non-overlapping case where d = 1. To specify the prior for the parameter vector 9 we need a matrix R\ that satisfies E^(^+ *)} > /rT 9 T R ^ D = Â£ J2- D I M ( T F C ) where k ranges over all predictors with smooth effects. Fortunately such an R\ is fairly easy to construct. If a predictor Xi interacts with d others then we put a 23 (d + 1) x (d + 1) block matrix into R\, each block being the roughness matrix R multiplied by a special factor. This special factor will be 1 in the upper left block, 1/2 in the rest of the top row, left column and diagonal, and 1/4 in the remaining blocks. Example: To show that the above construction works we give an example where a smooth predictor X\ interacts with both X2 and X3. Let a be the parameter vector for the spline Si(xi), b be the parameter vector for the spline Â£12(2:1) and c be the parameter vector for the spline Â£13(3:1). Then a [a h c ] R T T T b X = [a b c ] T T r R \R \R- a \R \R \R b .\ \ h _ R c = R R c 1 [a Ra + (a + b ) E ( a + b) + (a + c) R{& + c) T T T + (a + b + c) i?(a + b + c)] (2.1) T = \ [r(si) + r(ai + t ) + r{ l2 + t ) + r{si Sl l3 + Â£12 + Â£13)], which is the average roughness of the effect function for X\. To show that the construction for R\ works in general, suppose Xi interacts with d = dim(Tj) other predictors and consider the form of equation (2.1). A l l 2 d terms have an a i?a. When we divide by 2 in forming the average the result is that T d a i?a has a coefficient of 1. Hence the entry in the upper left block of R\. There r are three other kinds of terms found in equation (2.1): terms of the form x i ? x T where x is a spline other than a, terms of the form a i l x (or x i?a), and terms T T of the form ~x. Ry where both x and y are splines other than a. Terms of the first T kind come from a larger term (a +â€¢â€¢â€¢-(- x + â€¢â€¢ â€¢ ) R(a. + -- - + X + -- -), and there T are 2 ~ d 1 possibilities for the exact form of this term (there are d â€” 1 splines other 24 than x which are either present or absent from this larger term). Dividing by 2 in d taking the average gives x i ? x a coefficient of 2 T d_1 / 2 = 1/2. Hence the diagonal d entries of R\. The same reasoning applies to a i ? x (or x i?a), which explains the T T top row and left column of R\. Terms of the third kind come from a larger term (a+---+x + -- -+ yH ) R(SL T 2~ + -- -+ x + -- - + y + -- -), and there are d 2 possibilities for the exact form of this term. Dividing by 2 means the term x i ? y T d has a coefficient of 1/4. Hence all the remaining blocks of R\. 2.2 Categorical Variables Incorporating categorical variables into the B-WISE setup is fairly straightforward given the work done so far. The key assumption is that each predictor has an additive effect on the response variable, and that effect may depend on the values of several other predictors. We will keep equation (1) as our model equation and remain in the wider setting where overlapping interactions are allowed. If a categorical predictor X, has a main effect only, we write its effect function as fi(Xi) = B I(Xi = x ) H i2 i2 V B I(Xi im = x ), im where xn,... ,Xi are the m mutually exclusive values which Xi can take. The m effect of Xi being at level xn is now hidden in the intercept a of equation (1.8). Note that when Xi is a binary (0/1) categorical variable, we can take fi(X{) = faXi, as we do for continuous variables. As with all other types of interactions, when we allow a categorical predictor to interact with another predictor, we also include main effect functions for each of the variables involved in the interaction. These main effect functions have the same 25 form as they would if the same two predictors were in the model but did not interact with each other. Thus, allowing an interaction between two variables already in a model amounts to only a modest increase in the complexity of the design matrix. We need to consider three types of interactions i n which a categorical predictor X{ may be involved: an interaction with a linear predictor, an interaction with a smooth predictor, and an interaction with another categorical predictor. If Xi interacts with a linear predictor Xj we code the bivariate effect function as gij{Xi, Xj) = I(Xi = Xi ){aj2 2 + fijiXj) H V I(Xi = x )(a>j im m + Pj Xj). m In this case the effect of Xi being at level xn is hidden in the intercept a and the main effect function fj(Xj). If categorical predictor Xi interacts with a smooth predictor Xj, one possibility is to code the bivariate effect function as gij(Xi,Xj) = I(Xi = x )(aj i2 ~^~I{Xi = + fijiXj + s (Xj)) 2 H j2 %im){(Xjm "f" PjmXj + Sj (-Xj')), m but this causes complications when we try to measure the average roughness of the smooth predictor Xj. We return to this issue later. A better choice, perhaps, is to model the bivariate effect function as 9ij(X{,Xj) = I(Xi = x )(aj2 i2 + Pj Xj) 2 -\ V I(Xi = x )(a>j im m + fij Xj), m so that the same spline (which would appear in the main effect function fj(Xj)) is used for each category, but the linear effect is allowed to vary. This is the approach we favoured, with the hope that under this model the savings i n the number of parameters to be estimated would make up for any loss in flexibility. The roughness 26 matrix to be constructed under this approach is no different than that in the case where the categorical and smooth predictor do not interact, since there are no additional splines in the model to contribute to the model's roughness. In the final case, we may have Xi interacting with another categorical predictor Xj. In this case we model the bivariate effect function as m 9ij{X{,Xj) = n ~^2^2jijI(Xi = x ,Xj ik = Xji), k=2 1=2 m being the number of categories of Xi, and n being the number of categories of Xj. The effect when either of these predictors is in its first category shows up in the main effect functions. In extending B-WISE to allow categorical predictors, we have made provision for bivariate interactions between categorical and smooth predictors, but we have constrained the smooth predictor's effect function to use the same spline for each level of the categorical variable while allowing the linear piece to differ. We concede that important features of data could be missed under this modelling technique. In the remainder of this chapter we outline the construction of the roughness matrix R\ under the more flexible conditions where a smooth effect function for a predictor is allowed to use a different spline term for each of an interacting categorical predictor's levels. We do this in such a way that 6R\0 may still be interpreted as the sum of the average roughnesses of all the predictors, and hence (1.10) remains interpretable. Now suppose a smooth predictor X\ interacts with a number of other continuous predictors X , â€¢ â€¢ â€¢, X^+i and a number of categorical predictors 2 W\,..., Wt with c i , . . . , Ct categories, respectively. For each of the 2 high-low comd binations of the continuous predictors there are c\ â€¢ â€¢ â€¢ ct category combinations of the categorical predictors. We define the average roughness of X\ to be the average of the roughnesses of all 2 c\ â€¢ â€¢ â€¢ ct d effect functions for 27 X\. Let Wi ,Wiz, 2 â€¢ â€¢ â€¢ ,Wi Ci be parameter vectors for the splines corresponding to the different levels of predictor W{. To build R\ we reconsider the form of equation (2.1). Each of the 2 c i â€¢ â€¢ â€¢ cj d terms would contain an a Ra, T and dividing by the number of terms gives the upper left block of R\ a coefficient of 1. The diagonal blocks corresponding to continuous interacting variables will have a coefficient of ^, since there are 2 d _ 1 c i â€¢ â€¢ â€¢ ct terms of the form (a + .. . + x + .. .) R(a + .. . + x + ...), where x is the spline corresponding to T the high value of an interacting continuous predictor. In fact one can see fairly easily that all blocks corresponding to continuous predictors will have the same multipliers as before. Now, blocks on the diagonal corresponding to categorical predictor Wj will have a coefficient of ^ , since a term (a + ... + wij + . . .) R(a T + ... + Wij + ...) occurs 2 c i â€¢ â€¢ â€¢ C i _ i C i i â€¢ â€¢ â€¢ Q times. The same reasoning applies to the top row and d + left column blocks. Blocks off the diagonal corresponding to categorical variables W{ and Wj, i / j will have a coefficient of and blocks off the diagonal corre- sponding to the same categorical predictor, eg. to the splines Wi2 and ^ 3 , will be zero. Lastly, blocks corresponding to categorical predictor Wi and a smooth predictor will have a coefficient of 2 d _ 1 c i â€¢ â€¢ â€¢ Cj_iCi+i â€¢ â€¢ â€¢ Q. Obviously the computer code will have to be meticulous. This further increase in flexibility has good potential but of course the increase in the number of parameters requires a larger amount of data for estimation. 28 Chapter 3 Examples 3.1 Simulated Data Denison, Mallick and Smith (1998) tested BMARS on an example from Friedman (1991). Gustafson (2000) compared B-WISE with BMARS using the same example, and we will use it again to see if the extensions to B-WISE discussed above are helpful. The data consist of predictor variables X = (X\,..., X) simulated from a p p-dimensional unit hypercube, and a response variable Y generated independently as Y\X ~ AT(/(X),CT ), where 2 /(x) = 10sin(7nEiiE2) + 20(x - 0.5) + 10x + 5rr , 2 3 4 5 and a = 1. In one version of the problem (n,p) = (200,6) (one predictor having 2 no relation to Y), and in another version (n,p) = (100,10) (five predictors having no relation to Y). Rather than simulate the data as prescribed above we use the data made available by Denison, Mallick and Smith, which is also the data that Gustafson analyzed. Gustafson found that in both versions of the problem B-WISE placed the 29 Table 3.1: Simulated data set with (n,p) â€” (200,6). The top five models found i n one run of the model search algorithm. Model S(Xx) <S5 S(X ) 2 + S{X ) 3 + L(X ) + 4 L{X ) 5 <s5 S{X ) + S{X ) + S(Xi) + LiX ) SiX,) i5 S(X ) + S(X ) + L{X ) + SiX ) six,) $5 S(X ) + S{X ) + L(X ) Â® L(X ) six,) six,) 2 3 2 3 4 2 3 4 a5 S{X ) 2 + SiX ) 3 5 5 5 + SiX ) 4 + S(X ) 5 Posterior Probability 0.366 0.186 0.080 0.061 0.041 highest posterior probability on the qualitatively correct model, that is, the model with smooth effects for X,, X 2 and X , linear effects for X 3 4 and X$, and an inter- action between X, and X . Computational resources do not conveniently allow us 2 to evaluate every model in the p = 6 case, let alone the p = 10 case, but on several runs of the model search algorithm the qualitatively correct model was located and had the highest posterior probability of any model. Apparently the scheme penalizes model complexity in a way that prevents overfitting, and yet can describe fairly complicated data structures in detail. Table 3.1 lists the top five models found in one run of the model search algorithm, along with their posterior probabilities normalized relative to all models visited on that run. These are the same as the top five models found by Gustafson with the original B - W I S E scheme. Table 3.2 lists the top five models for the problem involving 100 points and 10 predictors. These models are not the same as the top five models in Gustafson, but the results there also show that X,o makes some contribution to predictive ability. We think the models listed in Table 3.2 represent a more compact cluster of models around the best model than those found by Gustafson. Figure 3.1 shows the univariate effects of the B - W I S E best model and a B M A R S model average for the p = 6 case. B y "univariate effect" we mean the func- 30 Table 3.2: Simulated data set with (n,p) = (100,10). The top five models found in one run of the model search algorithm. Model 5 S{X ) + SiX ) + L ( X ) + L{X ) Six,) a5 S(X ) + S(X ) + LiX ) + L(X ) + SiX,) <Â§>L(X ) SiX,) $Â§ S ( X ) + 5 ( X ) + L ( X ) + LiX ) + LiX, ) 2 3 2 3 2 3 2 a5 S(X ) 2 3 + SiX ) 3 5 10 5 4 six,) <sS> S(X ) + S(X ) six,) 5 4 t 0 + L(X ) Â® S(X ) + SiX,) <g ) i(A-io) + SiX ) + L(X ) + SiX,) â‚¬) i ( X 1 0 ) A 5 A 5 Posterior Probability 0.112 0.105 0.055 0.046 0.045 tion defined by holding all but one predictor variable constant in the fitted model. As such, a univariate effect can only be defined up to a vertical translation, since it is not clear what the intercept should be. The effects in Figure 3.1 have been vertically translated by eye so as to line up with the data. Much information regarding each of the fitted functions cannot be seen in the univariate effects, but the pictures do help to characterize the differences between the two regression methods. One apparent consequence of the model parsimony associated with B-WISE is that B-WISE curves are less flexible. The hyperparameters c and k are certainly determining factors here, and could be changed to make the univariate effects somewhat more dramatic than those shown in Figure 3.1. It is also interesting that even a BMARS model average has sharp bends. This could be due to a lack of diversity in the models that make up the average. Also of interest are the joint effect functions for X, and X , shown in Figures 2 3.2 and 3.3. Here again B-WISE is seen to produce a smoother, less flexible surface. Yet we will see in the next example that maintaining interpretable algebraic forms for models does not necessarily result in models with little predictive ability, as B-WISE can actually outperform BMARS in this respect. 31 0.4 06 predictor x4 0.9 1.0 00 0.2 04 0.6 prodlctor x5 0.9 t.O 0.0 0.2 0.0 02 04 06 predictor x6 Figure 3.1: The B-WISE univariate effects are shown with a lighter line, and the BMARS effects with a heavier line. Vertical translations are not of interest here; only the shapes of the curves are. 32 Figure 3.2: The B M A R S joint effect function for X\ and X flexible. 2 33 seems to be quite Figure 3.3: The B-WISE joint effect function for X\ and X is smoother than the corresponding BMARS function. 2 34 3.2 Abalone Data The data for this example are made available by the Department of Computer Science at the University of Toronto. Here 9 variables are recorded for each of 4177 abalone (molluscs). Our goal is to predict the number of rings in each abalone using the other variables as predictors (the age in years is roughly the number of rings plus 1.5). We treat the number of rings as a continuous variable, but in fact it has values on the integers ranging from 1 to 29. To facilitate comparison with regression methods that do not allow categorical variables we will for the present omit the "gender" variable, which classifies each abalone as either male, female, or infant. A n analysis of the residuals from a B - W I S E fit to all of the data showed that a log transformation of the response gave better agreement with the constant variance assumption. Figures 3.4 and 3.5 show diagnostics for a B - W I S E fit to all of the data, using a log transformation on the response. The assumption of constant variance is tenable, as is the assumption of normal errors. We performed a five-fold cross-validation, using roughly 4/5 of the data as a training set to fit a model and 1/5 of the data as a validation set to assess the model. Denoting the i-th observation of the validation set as (x*, y*), we assess the predictive ability of a model / using the root-mean-squared prediction error, where m is the number of points in the validation set. We terminated the B - W I S E model search algorithm if for five consecutive iterations it failed to locate a model in A , and we terminated the B M A R S algorithm after 20,000 iterations, following 2 the recommendation of the B M A R S creators. 35 9 0.0 0.5 1.0 t.5 2.0 2.5 0.0 predictor <5 0,5 1,0 1.5 2,0 2,5 3.0 predictor x6 Figure 3.4: The assumption of constant variance is nearly satisfied, but some funnelling occurs in the latter predictors. We are satisfied because of the simplicity of the log transformation. 36 QQ plot of Ordered Residuals Figure 3.6 shows that in all 5 splits of the data, the extended B - W I S E method produced models with better predictive ability than the B - W I S E method. Likewise, B - W I S E models performed better than B M A R S model averages in all but one split where a B M A R S model average performed better than both extended B - W I S E and B-WISE. Also shown are the results for the extended B - W I S E method which include the "gender" categorical predictor variable. It is unfair to compare these results to those obtained by the other methods which use one less predictor, but this example does show the usefulness of an ability to incorporate categorical variables. We also note some differences between the B M A R S and the B - W I S E algorithms. While the B M A R S algorithm may step from one model to another more quickly than B-WISE, the difference is usually not enough to make the overall running time shorter. For this example 20,000 iterations of B M A R S translated into a computing time of about 11 hours per split of data. In contrast, the B - W I S E algorithm converged in less than an hour for every split of data. We have found that a convenient way to represent B - W I S E models is with an upper diagonal p x p matrix, where p is the number of possible predictors in the model. The diagonal elements can be either 0, 1 or 2. If the (i,i)-th element is 0 then the ith predictor is excluded from the model, if 1 then it is included as a linear effect, and if 2 then it is included as a smooth effect. The off-diagonal elements can be either 0 or 1. If the (i,j)-th element is 0 then predictors i and j do not interact, and if 1 then they do interact. For example, in the fifth split of the abalone data, 38 "1 extended B-WISE with categorical predictor 1 extended B-WISE without categorical predictor METHOD 1 B-WISE â€” r BMARS Figure 3.6: Each line represents a particular split of the data. The extended IBWISE method has a lower RMSPE than both B-WISE and BMARS in four of the five splits. 39 the best B - W I S E model found could be represented as the matrix ( 2 1 0 1 0 0 0 2 0 0 0 0 1 0 0 0 0 2 1 0 0 2 0 1 2 1 1 (3.1) 2 V Here we can immediately see that all predictors are included in the model. The interaction pairs are predictors 1 and 2, 1 and 4, 2 and 7, 4 and 5, 5 and 7, and 6 and 7. The number of predictors is high enough that there is some sophistication required, but B - W I S E models are certainly more accessible than those of other adaptive regression schemes. 3.3 C e n s u s H o u s i n g Data The data for this example were collected as part of the 1990 US census, and are made available by the Computer Science Department at the University of Toronto. Here 16 features for each of 256 geographical regions are used to predict the median price of the houses in the region. The 16 predictor variables are 1. total number of families in the region 2. total number of households (HH's) 3. percentage of black people 4. percentage of people between 25-64 years of age 40 5. percentage of widowed females 6. percentage of HH's with 1 person 7. percentage of HH's with asian householder 8. percentage of HH's with Hispanic householder 9. percentage of HH's in which more then one non-relative lives 10. percentage of HH's which are non-family with 2 or more people 11. percentage of housing units (HU's) occupied 12. percentage of occupied H U ' s which are owner-occupied 13. percentage of vacant H U ' s which are not for rent, sale, migrant workers nor for seasonal, recreational or occasional use 14. percentage of occupied HU's with white householder 15. percentage of occupied H U ' s with householder not of Hispanic origin 16. percentage of HU's with 1 to 4 rooms Before performing the regression analysis we applied a log transformation to the response variable and fourth-root transformations to the two count variables, predictors 1 and 2, to achieve distributions which were closer to Gaussian. The remaining predictors are proportions on [0,1], and since some are highly skewed towards one end of the interval, we applied a logit transformation to all of these predictors. There were some values at the endpoints of [0,1], so before applying the logit transformation we applied a small shrinkage towards 0.5. Figures 3.7 and 3.8 show the plots of the response against predictors using the original data and 41 the transformed data, respectively. Predictor 15 is eliminated from the analysis because during the initial fit we found that this predictor was perfectly negatively correlated with predictor 8. From the description of each predictor this detail is not entirely obvious, but it becomes obvious upon attempting to run the model search algorithm. At this point we added a safety feature that keeps the algorithm from stepping through models for which A?A\ is non-invertible. In some cases this may occur because there are more parameters than data points, and in other cases because of linear dependence or near linear dependence between continuous predictor variables. Also, with several categorical predictors in the model there are often zero counts in some category combinations. If the right interaction is present this gives rise to a non-invertible A?A\. 42 0 20000 40000 0 predictor 1 0.0 0.10 20000 60000 0.0 0.4 predictor 2 0.6 0.2 predictor 3 0.4 0.6 predictor 4 0.20 predictor 5 0.0 0.05 0.15 0.2 0.4 0.6 0.S 1.0 predictor 11 predictor 10 0.0 0.2 0.4 0.6 0.6 predictor 16 Figure 3.7: Plots of median house price against each of the untransformed predictor variables. 43 2 *". *â€¢ "" â€¢ 5 ' *â€¢*.'â€¢ .** â€¢ S â€¢ 2 4 6 6 10 14 5 10 15 â€¢4-2 predictor 1 0 â€¢ .Â» v t â€¢ 2 -1.5 4 -0.5 predictor 3 0.5 predictor 4 prodicla< 2 I * I â€¢ â€¢ â€¢'*Â£&!:. ' â€¢.'V-..' . 1 s 5 .4 -3 * â€¢', â€¢. .Â« lO s â€¢3 predictor 5 1 -2 * A*. Â» 0 -5.0 0 5 : -4.0 -3,0 -2 0 - 4 - 2 predictor 7 predictor 6 0 ! predictor 8 â€” ' V:Â« â€¢ .S8k -5 -4 -3 -2 â€¢ â€¢ â€¢ , .:. A.fc.-5 -1 predictor 9 -2 0 2 predictor 13 -4 -3 0 -2 predictor 10 : :M... â€¢ â€¢A â€¢ SS l : :â€¢Â».& 4 2 4 â€¢2 predictor 11 T- ; s s â€¢â€¢â€¢â€¢^1 - 4 - 2 0 2 4 2 4 â€¢â€¢'i.'r' â€¢ . **â€¢'-**' *** * 4-3-2-1 predictor 14 0 predictor 12 0 1 predictor 15 Figure 3.8: Plots of the transformed response variable against each of the transformed predictor variables. Some skewness in the predictors has been eliminated. 44 The best model found by the model search algorithm was \ 1 1 V Figure 3.9 shows that the assumption of normal errors is tenable, but the assumption of constant variance may be problematic. As it happens, about 14% of the responses take on the minimum response value of 14999, while almost all of the other response values are unique. Apparently a lower bound was imposed on the median house price when the data were recorded. It was hoped that the model structure (3.3) would give some insight into the underlying influences of the median house price, but because of the large number of predictors this seems to be an example in which the model structure can not shed much understanding on the problem. Indeed, trying to interpret the above model 45 o d ci -1 -3 0 1 Quantiles of Standard Normal CD -1.0 -0.5 0.0 0.5 1.0 1.5 fitted values Figure 3.9: The normality assumption is reasonable, but the assumption of constant variance is problematic. Apparently a lower bound was imposed on the median house price ie. values less than 14999 were recorded as 14999, which produced the line of data points at the left of the lower plot. There does not seem to be an obvious remedy for the situation. 46 would be putting too much faith in the B - W I S E modelling scheme. Assuming there is some true underlying model that accounts for all systematic variation in the median house price, the most we can hope is that a good approximation to this model exists within the class of B - W I S E models. There is certainly some error in the structure of any B - W I S E model, and for that reason it may be prudent to average over several good models. A Bayesian model average is an average of several or all models in the model space, where each model is weighted by its posterior probability. For this example we constructed Bayesian model averages using only the models in A20 as this speeds up computation and reduces round-off error. Of course, the model weights were normalized relative to A2oWe performed five-fold cross-validation and assessed predictive ability using RMSPE. In all five splits the model average had a lower RMSPE than the best single model though the gains were perhaps not dramatic. This could be because, as Gustafson suggested, the B - W I S E models that make up the average are in some sense quite similar. 47 in CD d IO m d o in d Best Model Model Average method F i g u r e 3.10: RMSPE f o r t h e five splits o f t h e census h o u s i n g d a t a . 48 Chapter 4 Conclusion a n dFurther W o r k B-WISE improves on many existing flexible regression schemes in the area of interpretation by limiting the number of ways a predictor can enter the model, yet it allows enough flexibility to capture non-trivial features of data and has been shown to have excellent predictive ability. The interpretability comes mainly from the easy solicitation of univariate effects, even for predictors involved in an interaction. The flexibility is provided by cubic splines for the univariate effects and a structured approach to bivariate interactions. In this thesis we have implemented two important extensions to the original B-WISE introduced by Gustafson, but there are still other directions in which B-WISE could be extended. In our first example we hinted at some arbitrariness in the choice of the hyperparameters c and k. Gustafson provided a reasonable rationale for c = 1 and k = 0.1, but recommended some sensitivity analysis. Another option is to put a prior on these parameters. The main hurdle to using this approach would probably be choosing the prior distributions, and most likely we would lose the closed form for the model posterior probability (1.10). As a consequence, a full M C M C algorithm 49 (i.e. an algorithm with an accept-reject step) would be required for the model search. Because of these drawbacks we favour the simplicity of fixing the hyperparameters and perhaps tuning them after a sensitivity analysis. A vast number of data sets have a non-normally distributed response variable, in which case B - W I S E cannot be used. But we think some of the ideas used in B - W I S E would be useful for discrimination in the case of a binary response. As a starting point one could implement a deterministic, non-Bayesian version of the model search algorithm, which at each step moves towards the model which achieves the most separation D 2 between the two groups. = (ix - X 2 ) 5 " , ( x - x) 1 T 0 e d 1 2 A Bayesian implementation would take steps towards neighbouring models in accordance with the posterior probabilities of those models, and the model posterior probability would favour models which achieve large D and would penalize models according to their complexity. We think there are 2 discrimination problems where the flexibility of B - W I S E models would be useful. Currently the rate determining step in the software we have developed is the evaluation of A?A\, which has an operation count of 0(nq ), where q is the number 2 of columns in the design matrix. This can easily be lowered to 0(nq) for many of the model evaluations in the search algorithm. Suppose we have calculated AQAQ for a certain model An, and then we form a new model A i by adding a term to An, that is, inserting a column c into [A | c | a n d AfA, a = AQ [A |v4(,]. a Then the design matrix for A i is has the form ( AA T a AjA, = b cA T a \ AjA Ale AA CT C eA Ac AlA T a b 50 T a b T b b Most of the entries in this matrix have already been calculated. The remaining entries can be calculated in 0(nq) steps. Clearly, models formed by deleting columns from AQ are even easier to compute. We suspect there are several places in the model search code where fast updating methods could be employed. Once a certain quantity has been calculated for a model, the corresponding quantity for a model in the immediate neighbourhoud might be a simple update to the information already obtained. 51 Bibliography [1] Breiman, L . , J . Friedman, R. Olshen and C. Stone (1984). Classification and Regression Trees. Belmont, Calif.: Wadsworth, Inc. [2] Cuzick, J . (1993). Discussion of "Varying-coefficient Models" by T. J . Hastie and R. J . Tibshirani. Journal of the Royal Statistical Society, Series B, 55, 757-796. [3] Denison, D . G . T., B . K . Mallick and A . F . M . Smith (1998). Bayesian M A R S . Statistics and Computing, 8, 337-346. [4] Friedman, J . (1991). Multivariate Adaptive Regression Splines (with discussion). The Annals of Statistics, 19, 1-141. [5] Friedman, J . and W . Stuetzle (1981). Projection Pursuit Regression. Journal of the American Statistical Association, 76, 817-823. [6] George E . I. and R. McCulloch (1993). Variable Selection via Gibbs Sampling. Journal of the American Statistical Association, 88, 881-889. [7] George E . I. and R. McCulloch (1997). Approaches for Bayesian Variable Selection, it Statistica Sinica, 7, 339-373. 52 [8] Green, P . J . and B . W . Silverman (1994). Nonparametric Regression and Gen- eralized Linear Models. London: Chapman and Hall. [9] Gustafson, P. (2000). Bayesian Regression Modelling with Interactions and Smooth Effects. To appear in Journal of the American Statistical Association, 95. [10] Hastie, T. J . and R. J . Tibshirani (1990). Generalized Additive Models. London: Chapman and Hall. [11] Hastie, T. J . and R. J . Tibshirani (1993). Varying-coefficient Models (with discussion). Journal [12] Neal, of the Royal Statistical R. M . (1996). Bayesian Learning Society, Series B, 55, 757-796. For Neural Networks. New York: Springer- Verlag. [13] Neal, R. M . (1997). Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification. Technical Report 9702, Department of Statistics, University of Toronto. [14] Raftery, A . E . , D . M . Madigan and J . Hoeting (1997). Model Selection and Accounting for Model Uncertainty in Linear Regression Models. Journal of the American Statistical Association, 92, 179-191. 53 Appendix A Details o f the M o d e l Search Algorithm In Section 1.4 we described an algorithm with which one can locate B - W I S E models of high posterior probability. In this appendix we give a pseudocode for the software we developed. Before the model search begins, the program must read in from files all of the relevant data, including the n x p matrix X of predictor variables, the n x 1 vector y of responses, the spline structure (Z from equation (1.4)) for each of the continuous variables, the roughness matrix (the R used in the construction of Z), and the initial model (specified using a matrix, as in (3.1)). We found it more convenient to construct Z and R using the matrix friendly routines in Splus and then to store them in files, rather than to construct them with the same C program that performs the model search. Having acquired the relevant quantities the algorithm proceeds as follows. print "Model 1"; 54 print model; // here model i s the user-specified i n i t i a l model print "log of non-normalized posterior p r o b a b i l i t y = "; bestlpp=designmatrix(model,p,X,y, Z,R,n,numknots); print bestlpp; The function designmatrix is the workhorse of the whole algorithm. It builds the design matrix A\ based on the specifications in model, and ultimately computes and returns the model posterior probability (1.10). stopflag=l; // the algorithm terminates i f stopflag reaches 5 numvisits=l; // t h i s i s the number of models v i s i t e d while(stopflag<5 and numvisits<=maxit) // maxit i s the maximum number of i t e r a t i o n s allowed { stopflag=stopflag+1; numnbhd=sizenbhd(model); // sizenbhd computes the number of models i n the nbhd of model modlist=list of a l l models i n neighbourhood of model; // In the loop below we w i l l compute the model posterior // f o r a l l of the models i n modlist. f o r ( l p from 1 to numnbhd) { curmodel=modlist[lp]; lookup curmodel i n modfile; i f found then l p p l i s t [ l p ] = l o g post prob found i n modfile; else { 55 // need to compute l o g post prob f o r curmodel and store i t i n modfile lpplist[lp]=design_matrix(curmodel,p,X,y,Z,R,n,numknots); if lpplist[lp]>bestlpp then { bestlpp=lpplist[lp]; bestmodel=curmodel; print "best so f a r = "; print bestlpp; } if l p p l i s t [ l p + l ] > b e s t l p p - l o g ( 2 0 . 0 ) then s t o p f l a g = 0 ; // If a model i s found which has post prob greater then 57, of // highest post prob then s t o p f l a g i s r e s e t to 0 . the write curmodel to modfile; write l p p l i s t [ l p ] to modfile; } } // Now we exponentiate and normalize the p o s t e r i o r p r o b a b i l i t i e s // r e l a t i v e to the immediate neighbourhood. lpplist=exp(lpplist); s=sum(lpplist); lpplist=lpplist/s; draw=runif(0,1); cumulative=0.0; f o r ( i from 1 to numnbhd) { 56 cumulative=cumulative+lpplist[i] ; if draw<=cumulative then { temp=i; e x i t f o r loop; } } print "Model "; numvisits=numvisits+l; p r i n t numvisits; model=modlist[temp]; p r i n t model; print print "log of non-normalized p o s t e r i o r p r o b a b i l i t y = "; log(s*lpplist[temp]); > print "The model with highest p o s t e r i o r d e n s i t y was "; p r i n t bestmodel; We have also written simpler programs to compute predicted or fitted values for a single B - W I S E model and for a Bayesian model average which uses the modfile created by the model search algorithm to construct the average. 57
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Expanding the capabilities of Bayesian regression with...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Expanding the capabilities of Bayesian regression with interactions and smooth effects Johnson, Nathan 2000-12-31
pdf
Page Metadata
Item Metadata
Title | Expanding the capabilities of Bayesian regression with interactions and smooth effects |
Creator |
Johnson, Nathan |
Date | 2000 |
Date Issued | 2009-07-10T22:38:05Z |
Description | The B-WISE (Bayesian regression With Interactions and Smooth Effects) method of regression modelling was designed to improve upon other flexible regression schemes in the areas of model interpretability and ease of implementation. B-WISE models have been shown to have good predictive performance, and performance can be even further improved with a natural Bayesian model averaging scheme. In this thesis I will outline some ways in which some of the constraints inherent in a B-WISE model can be relaxed, so that the technique can be used in more general situations and with even greater flexibility. It is shown that much of the interpretability of B-WISE models is retained and implementation is still relatively straightforward. |
Extent | 2232684 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2009-07-10 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0089547 |
URI | http://hdl.handle.net/2429/10640 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2000-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- ubc_2000-0436.pdf [ 2.13MB ]
- Metadata
- JSON: 1.0089547.json
- JSON-LD: 1.0089547+ld.json
- RDF/XML (Pretty): 1.0089547.xml
- RDF/JSON: 1.0089547+rdf.json
- Turtle: 1.0089547+rdf-turtle.txt
- N-Triples: 1.0089547+rdf-ntriples.txt
- Original Record: 1.0089547 +original-record.json
- Full Text
- 1.0089547.txt
- Citation
- 1.0089547.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
China | 16 | 26 |
United States | 15 | 1 |
Iran | 3 | 0 |
Canada | 2 | 0 |
France | 2 | 0 |
Romania | 1 | 0 |
Russia | 1 | 0 |
City | Views | Downloads |
---|---|---|
Ashburn | 14 | 0 |
Shenzhen | 13 | 26 |
Didehban | 3 | 0 |
Beijing | 3 | 0 |
Edmonton | 2 | 0 |
Roubaix | 2 | 0 |
Hunedoara | 1 | 0 |
Unknown | 1 | 12 |
Wilmington | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0089547/manifest