Priors for Bayesian Neural Networks by Mark Robinson B.Sc, University of Guelph 1999 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Statistics) we accept this thesis as conforming to the required standard The University of British Columbia June 2001 © Mark Robinson, 2001 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of 5 4 4 ^ ^ The University of British Columbia Vancouver, Canada Date BU,?.4,-2 6d) DE-6 (2788) Abstrac t In recent years, Neural Networks (NN) have become a popular data-analytic tool in Statistics, Computer Science and many other fields. NNs can be used as universal approximators, that is, a tool for regressing a dependent variable on a possibly complicated function of the explanatory variables. The N N parameters, unfortunately, are notoriously hard to interpret. Under the Bayesian view, we propose and discuss prior distributions for some of the network parameters which encourage parsimony and reduce overfit, by eliminating redundancy, promoting orthogonality, linearity or additivity. Thus we consider more senses of parsimony than are discussed in the existing literature. We investigate the predictive performance of networks fit under these various priors. The Deviance Information Criterion (DIC) is briefly explored as a model selection criterion. i i Contents Abstract ii Contents iii List of Figures vi Acknowledgements viii Dedication ix 1 Introduction 1 1.1 Artificial Neural Networks 2 1.2 The Bayesian Approach 5 1.3 Current Approaches to Bayesian Neural Networks 6 1.3.1 Neal 7 1.3.2 Rios Insua and Miiller 8 1.3.3 Andrieu, de Freitas and Doucet 9 1.3.4 Lee 10 1.4 Datasets Used in the Thesis 12 1.4.1 Ozone Data 12 1.4.2 Tecator Data ; . . .- 14 1.4.3 Boston Housing Data 14 ii i 1.5 Outline and Scope of the Thesis : 16 2 Considerations for the Prior Distributions 19 2.1 Motivation . . : 19 2.2 The Geometry of the Problem 21 2.3 Weight Decay 22 2.4 Possibilities for the Prior 23 2.5 Parsimony Priors 23 2.5.1 Orthogonality of Input-to-Hidden Weights 24 2.5.2 Additivity of the Hidden Nodes . 24 2.5.3 Selection of a and b 25 2.6 Effective Domain Prior 26 2.7 Priors on B and a2 29 3 Inference via Markov Chain Monte Carlo 30 3.1 The Posterior Distribution 30 3.2 The MCMC Sampling Algorithm 31 3.3 Full Conditional Distributions 32 3.3.1 Derivation for a 2 32 3.3.2 Derivation for B 32 3.4 Acceptance Ratio Calculation for the M H Step 32 3.5 Posterior Inference 33 3.5.1 Importance Sampling 34 4 Empirical Results 35 4.1 Assessing Convergence . . . 35 4.2 Goals 37 4.3 Results on the Boston Housing Dataset 40 4.4 Results on the Tecator Dataset 42 4.5 Results on the Ozone Dataset . 43 iv 5 Discussions and Further Work 46 Appendix A: Implementation 51 Appendix B: Plots 57 Bibliography 63 v List of Figures 1.1 The Neural Network Architecture '3 1.2 Scatter Plot of Ozone Data Set . . . 13 1.3 Scatter Plot of Tecator Data Set. 15 1.4 Scatter Plot of Boston Housing Data Set 17 2.1 Interpretation of Neural Network Parameters 20 2.2 Examples of Logistic Nodes 21 2.3 A Piecewise Linear Approximation to the Logistic Function 27 2.4 A n Example of \JL outside the effective domain of interest. 28 4.1 M C M C Plots for the Neural Network Parameters for 10000 iterations of a network with 5 hidden nodes on the Ozone dataset. Prior distribution from Lee (2000a, 2000b). 36 4.2 M C M C Plots for the Neural Network Parameters for 10000 iterations of a network with 5 hidden nodes on the Ozone dataset. Prior distribution is the effective domain of interest 38 4.3 M C M C Plots of the a Parameter as a diagnostic for convergence of the Markov Chain. 39 4.4 Importance Weights for the Parsimony Priors (Based on 10000 Samples). 39 4.5 MSEs on the Validation Dataset (Boston Housing Data). Five runs from different initial values on five different splits of the data 41 4.6 Improvements (% Change in MSE) under Parsimony Priors (Boston Housing Data). Five runs from different initial values on five different splits of the data 41 vi 4.7 Improvements (% Change in MSE) under Parsimony Priors (Tecator Data). Ten runs from different initial values. 43 4.8 MSEs on the Validation Dataset (Ozone Data). Five runs from different initial values on five different splits of the data 44 4.9 Improvements (% Change in MSE) under Parsimony Priors (Ozone Data). Five runs from different initial values on five different splits of the data at z = 2, 3,4. . . . 45 5.1 Plots of DIC for Neural Network Models with 4, 6 and 8 Hidden Nodes for Multiple Runs and Splits. 50 vn Acknowledgements I would like to thank my supervisor, Dr. Paul Gustafson, for his support and encouragement throughout the process of producing this thesis. Also, thanks are owed to Dr. Harry Joe for his helpful comments on the thesis and for providing programming tips along the way. My time here at U B C would not have been the same had it hot been for the support of an incredible bunch of graduate students who helped me put things in perspective and who endured my high intensity. I graciously thank them for their support. Lastly, I would like to thank my parents, brothers and friends for their encouragement and especially to Lynn for her love and support. MARK ROBINSON The University of British Columbia June 2001 vii i To coffee and all things real ix Chapter 1 Introduction In recent years, there has been considerable interest in the use of neural networks by both the com-puter science and statistical communities. In the computing community, applications are usually associated with artificial intelligence (e.g. pattern recognition). For the statistician, the methodol-ogy is used primarily for flexible (non-linear) regression as well as classification. There is consider-able overlap in the two disciplines, especially in classification problems. Many reviews have been written from a statistical point of view (Cheng and Titterington, 1994; Ripley, 1994; Stern, 1996). As the name suggests, neural networks initially were derived in an attempt to model the parallel processing capacity of the human brain. Computers have a remarkable ability to perform routine calculation tasks in a fraction of the time a human can. However, a human can perform tasks such as identifying faces or recognizing characters with little or no effort; a computer performing the same task may require a considerable amount of time and a sophisticated algorithm. Artificial neural networks (ANNs) — the mathematical constructs discussed in this thesis — attempt to harness these parallel capabilities to analyze experimental data. In general, neural networks take a number of inputs (explanatory variables or covariates in statistical terminology) and via a collection of neurons (i.e. units which weight and sum a number of incoming variables to one output), generate a number of outputs (responses in statistical terms). For example, in a regression problem, the inputs may be some predictors x, which are used to calculate an output, y. The x and y can be scalar or vector quantities. In a classification problem, 1 I the outputs yi (i = 1, • • •, k) may represent the probability of being in the i-th group. From a computer science viewpoint, these networks are to be learned 1 from the data (i.e. the network parameters are to be estimated). For this thesis, it is desired to work under the Bayesian framework, in which one thinks of each parameter as having a distribution which represents belief about the value of that parameter. We will see shortly that this is a difficult task. 1.1 Artificial Neural Networks In the literature, ANNs go by many names: back-propagation NNs, feed-forward NNs and multilayer perceptron (MLP) NNs, all of which describe the same, general structure. The term back-propagations refers to the way that the derivatives are calculated for the optimization methods. Feed-forward refers to the forward directed links from the inputs to the outputs. The term M L P refers to the possibly multiple layers in the network which are connected by neurons. Every A N N consists of an input layer which is eventually connected to an output layer using 1 or more hidden layers. In getting from the inputs to the outputs, each link in the network contributes a weight and every node in the hidden layer forces that linear combination through an activation function. This process is repeated for each of the hidden layers. Commonly, biases are added at each stage before being passed through an activation function. A three layer (1 input, 1 hidden, 1 output) neural network is depicted in Figure 1. For a three layer network, the output layer is computed as a function of a linear combination (I.e.) of non-linear functions of a linear combination of the inputs. The neural network depicted in Figure 1 can be written mathematically as: f I.e. of non-linear- functions 1 I.e. of inputs J where p is the number of input units, H is the number of hidden units, q is the number of output l rrhe learning discussed here is often called supervised learning, in that the data are to be classified into known groups. Unsupervised learning involves classifying into groups which are not previously specified. 2 lutput Units Hidden Units Units Figure 1.1: The Neural Network Architecture. units and o,H are activation functions. Here, the Vhk represent the H x q hidden-to-output weights Uih are the p x H input-to-hidden weights, bk are the output biases and are the hidden biases. There are also networks suggested which have skip layer connections, meaning there are direct links from the input to output layers which have corresponding weights. In this case, the 3 above model (1.1) becomes: Uk = 00 < i=i V H / V + X] Vhkh ( CLh + uihxi h=l V i=l ) ) for A; = ! , - • • (1.2) skip layer terms In most cases, the 4>i, • • •, 4>H are taken to be the same sigmoidal function (e.g. the logistic function, (l + e~x)~1 or equivalently, the hyberbolic tangent function, (ex — e~x)(ex + e~x)~l). In a regression problem, 0o is taken to be the identity function and in a binary classification problem, o is commonly taken to be the logistic function. For example, a regression network with one input, four hidden units and one output unit with a tanh activation function would relate the input x to the output.y via the following model: y = b + v\ tanh(ai + u\x) + V2 tanh(a2 + U2x) + v$ tanh(a3 + u^x) + V4 tanh(a4 + u^x). In general, a neural network is a tool for establishing an intricate non-linear relationship in either a regression problem or classification problem. A very detailed non-linear relationship can be described by models such as (1.1) or (1.2). In fact, it can be shown (although no attempt is made here) that a three layer A N N can approximate any function (on a compact domain) to a desired level (Hornik et al. 1989) by having an appropriate number of hidden nodes. There are a few standard statistical methods which are special cases of the neural network. First of all, linear regression can be achieved with one hidden unit and identity functions for both \. The linear logistic model can ascertained by using the identity function for 4>i and the logistic function for (f>o- Other methods such as projection pursuit regression and radial basis functions are special cases as well (Chen and Titterington, 1994). The competing statistical methods of ANNs are, for regression: multivariate adaptive re-gression splines (MARS; Friedman 1991), and the Bayesian version thereof; generalized additive models, non-parametric regression, smoothing techniques and, more recently, Bayesian regression modelling with interactions and smooth effects (B-WISE; Gustafson 2000). For classification, other statistical methods include (linear, quadratic, flexible) discriminant analysis and nearest-neighbour 4 methods. Classification and Regression Trees (Breiman 1984) also overlap many applications with neural networks. Neural networks have been used quite extensively in a broad spectrum of disciplines. A small scan of applications includes: the papermaking industry (Edwards et al. 1999), the dairy industry (Macrossan et al. 1999) and meteorology and oceanography (Hsieh and Tang 1998). 1.2 The Bayesian Approach In this thesis, I have chosen to take the Bayesian approach. The question arises: Why add an additional layer of complexity (computational or otherwise) to an already intricate problem? My reason is twofold. First, with the recent technology in Markov Chain Monte Carlo methods (see Chapter 3), Bayesian methods are becoming easier to implement. Second, the additional complexity also provides additional flexibility in terms of model capabilities and methods for inference. The rest of this section describes the Bayesian approach, as it applies to this thesis. Frequentist (or classical or Fisherian) statisticians assume their data follows, at least ap-proximately, some parametric model, from which they form the likelihood function (i.e. the joint distribution of the observed data but thought of as a function of the unknown parameters in the model). They find the parameters which maximize the likelihood, that is, the parameters which maximize the probability of seeing that data. They use elegant asymptotic theory to make inference on the parameters or functions thereof. Bayesian statisticians attacking the same problem, assume the parameters in the model each have a distribution of their own. First, one's beliefs about these parameters are represented in a distribution called the prior distribution. Using the laws of probability, these beliefs are updated to the posterior distribution on which inferences are based. The prior distribution can allow for quite a flexible analysis, and therefore have a tremendous advantage over frequentist methods. Not surprisingly, this greatest benefit is also the greatest criticism, and rightfully so. Are humans able to reliable express their beliefs in a mathematical function? Likely not. In practice, it is extremely difficult to specify prior distributions. As will be illustrated later in the thesis, priors for neural networks are especially difficult to define because the parameters may not have simple 5 interpretations. Quite generally, assume we have random variables Y i , . . . , Yn which generate data y = a prior distribution of the parameter 0, denoted by p{9), we can appeal to the laws of probability and form the posterior distribution of 9 in light of the observed data y, denoted by p(9\y). The calculation is as follows: In many cases, the posterior distribution is not in a form easily recognizable as one of the standard distributions. Under certain general conditions, Markov Chain theory allows us to gener-ate a Markov Chain which has the posterior distribution as its stationary distribution. Therefore, we follow a Markov chain for some period of time and monitor whether it converges. Assuming the chain has converged, we generate very large samples (i.e. Monte Carlo) from and summarize pos-terior attributes (or functions thereof) from these. This method of computation is called Markov Chain Monte Carlo (MCMC) for obvious reasons. Also of importance, especially here in this thesis, is the concept of a predictive distribu-tion. Suppose that we have a new observation y* that we wish to make predictions for based on the previous observations. Again, using the laws of probability we can generate the conditional distribution of y* given the observed data y, as follows: After determining the posterior distribution (perhaps by simulation), predictive inferences for this new observation can then be made. 1.3 Current Approaches to Bayesian Neural Networks In this section, we discuss the recent methodology for analyzing neural networks under the Bayesian framework. Here, we discuss the work of (j/i) • • • )Vn) from some parametric density f{yi\0) (sometimes written as fe(yi))- After we specify p(o\y) = f(y\o)P(9) ff(y\9')P(9>)d9< c< f(y\e)p(e). 6 • Neal (1996) • Rios Insua and Miiller (1998); Miiller and Rios Insua (1998) • Andrieu, de Freitas and Doucet (1999); and, • Lee (2000a, 2000b). In each case, I keep the notation used in the original papers in order to form a distinction. 1.3.1 N e a l Similar to that in the previous sections, Neal defines the network with one hidden layer as: fk{x) = bk + Yvjkhj{x) i hj(x) = tanh ^a,j + £ where fk(x) denotes the fc*n output. The biases for the output and hidden units are a,j and bk, respectively. The weight from the input unit to the hidden unit is denoted by Uij. The weight from hidden unit j to output unit k is denoted by Vjk. The tanh activation function is the same shape as the logistic function but ranges from — 1 to 1 instead of 0 to 1. Suppose u>ij is any of the network weights (either u or v above). Neal specifies priors on the weights hierarchically as follows: WiMli ~ - ^ ( 0 , ^ . ) , and 2 r - i ( a w ^ w \ ° W i . ~ V 2 ' 2 ) where T _ 1 represents the inverted Gamma distribution That is, if X ~ Gamma(a, b), denoted T(a, b), then 1/X ~ Inverted-Gamma(a, 6), denoted F~l(a, b). Notice that there is a hyperparam-eter specified for each unit that the weight leaves from. Neal does, however, consider special cases where each hyperparameter is the same across the group. For example, if all the inputs were to known be on an equal footing, one could take cr 2 1 = • • • = o\ = a^,. The hyperparameters aw and 7 UJW are selected so as to assume very vague prior information regarding o\ . Similarly, hierarchical priors are specified for the bias parameters. A n advantage of specifying a hyperparameter for the connections coming out of each input unit is, that if the weights of connections coming out of some input i are typically small, then the values of the parameter i = l,...,n where e^N(0,a2) and ^(*) = _ L _ , 1 + e z and Xi = (xio,xn,... ,XiP) and so that input-to-hidden biases are present, Xio = 1. They define hierarchical priors similar to Neal, as follows: 1. Pj\p,p,crj ~ N(/j,p,aj) where p.p ~ N(ap,Ap) and a^2 ~ T(cb/2JcbCb/2) ~ i V ( / i 7 , 5 7 ) where ^ 7 ~ N ( a 7 , Ay) and Sy 1 ~ Wish(c 7 , ( c 7 C 7 ) - 1 ) 3. a " 2 ~ T(s/2,sS/2) Their key observation is that conditional on the 7 = ( 7 1 , . . . , 7 M ) parameter, one simply has a standard linear model. As a result, the M C M C algorithm can be implemented efficiently by integrating out B = (Bi,... ,8M)- Metropolis steps are done on 7 after marginalizing over 8 B. Conditional on 7 and the hyperparameters, B is multivariate normal. Similar to Neal, all hyperpriors are conjugate and can be sampled efficiently. They extend their method to the variable architecture case, that is, where the number of hidden nodes is not fixed. Because there is nonnegligible uncertainty about whether a certain num-ber of nodes is adequate, this is a particularly appealing approach. They develop a reversible-jump M C M C algorithm (Green 1995) in which a Markov Chain jumps randomly to different architectures, therefore having differing parameter spaces. 1.3.3 Andrieu, de Freitas and Doucet Andrieu, de Freitas and Doucet (1999) define a model with k hidden nodes and c outputs of the form k yt = b + /3rx4 + ^2aLJ4>(\\xt-nj\\) + nu t = 1,...,N, j=i where n 4 are zero-mean independent Gaussian errors with constant variances o\,...,o2 for each of the c outputs and $(•) is a radial basis function (RBF) . The //,• parameters represent the radial centres. Common choices of the R B F may be: • Linear: 4>(rj) = n • Thin plate spline: (j>(rj) = n2 log n • Multi-quadric: (j>(rj) = \fn2 + A 2 • Gaussian: (j)(rj) = exp(-Ar? 2). The interpretations of the ensuing radial basis networks would presumably be quite different for these functions. Andrieu, de Freitas and Doucet (1999) define a prior distribution over the parameters (b,/3,a, fx, a2) as follows. A priori, the radial basis centres fij are uniformly distributed over the space of where the input data lie, conditional on k. That is, they are restricted to the hyper-cube, with one corner at all the maximum values of each of the predictors and the opposite corner at all of the minimum values of the each of the predictors. In this respect, the prior is slightly 9 data dependent. Conditional on the centres, the collection of regression parameters (b,/3,a) are specified to have a + fc)-dimensional Gaussian distribution centred at the origin with inverse variance-covariance matrix -^DfDi where Di is the design matrix including columns for k radial basis functions, d covariates and an intercept. Hence, the prior penalizes the radial basis functions for being too close. The parameter k is also taken to be random and is assumed to have a trun-cated (i.e. k < N — (d + 1)) Poisson distribution with parameter A. The truncation is necessary to avoid columns of Di where are linearly dependent. To finalize the prior distribution, Jeffrey's noninformative prior is used for a2 (i.e. p(of) = c~ 2 ) , and vague inverted Gamma and Gamma priors are used for each of 6f and A. To simplify the posterior distribution, the authors marginalize over (b,/3,a) (Gaussian), analogous to what is done in Rios Insua and Miiller (1998). Sampling of the posterior distribution with k fixed is similar to Rios Insua and Miiller (1998). Metropolis-Hastings steps are performed sequentially for each hidden node j = 1,. . . , k on fij. Following this, the regression parameters (b,/3,a) can be sampled jointly from a multivariate normal distribution. Similarly, the variances of and S2 can be sampled from inverted Gamma distributions. The hyperparameter A can be sampled using Metropolis steps. Andrieu, de Freitas and Doucet (1999) extend this to the random k case by appealing to the reversible jump M C M C method discussed in Green (1995), like Rios Insua and Miiller in the last section. 1.3.4 Lee Most recently, Lee (2000a, 2000b) has introduced a non-informative prior distribution over the parameters (/3,7, a2) to avoid the selection of a large number of hyperparameters. Lee implements a neural network model of the form M yi = A) + YP3i,(.xTn)-+ ti-, i = i,-.-;n where, as before ~ N(0,a2) and ib(z) = (1 + e~z)~l. He selects a prior which puts equal density on all regions of the parameter space (except on the variance a2 where the log scale is used). A restricted version of the standard Bayesian linear 10 regression prior (p(8,j,a2) ex. a 2 ) is used. Unlike linear regression, this improper prior for neural networks gives rise to an improper posterior. Define Z = {%} where _ 1 1 3 1 + exp ( -7 j 0 - Eh IjhXih)' In the case of logistic activation functions, a restriction of the form n n = {(6,7,a2) : \ l j h \ < Cn,\ZTZ\ > Dn) ensures linear independence of the logistic basis functions. Lee proves that posterior in this case is indeed proper and suggests values of Cn and Dn which work well in practice. For M C M C sampling, Lee uses a streamlined version (i.e. because of fewer parameters) of the fixed-architecture Rios Insua and Miiller algorithm. Lee migrates to a discussion of model selection using Bayesian model averaging. Briefly, because there is uncertainty as to whether any given model is the correct one (or that there are multiple explanations of the data), it is of interest to combine and average over many candidate models. A review of this technique is given in Hoeting et al. (1999). Using Lee's notation, the predictive distribution of responses Y based on observed data D and q candidate models M i , . . . , Mq is P(Y\D) = Y^P{Y\D',Mi)P(Mi\D). i Clearly, this is a weighted average where the weights are the posterior probabilities of each model. Unfortunately, the term P(M{\D), sometimes known as the evidence is typically a difficult quantity to estimate! Due to the potentially large number of possible models (subsets of the explanatory variables + number of hidden nodes), he introduces a Bayesian Random Searching (BARS) algorithm which randomly proposes jumps from model to model. Based on the Bayesian Information Criterion (BIC) 2 , these steps are accepted or rejected. Assuming the models have equal prior probability, the posterior probabilities can be approximated as P(Mi\D) = eBICi Bid 2BICi = Li — | p i log n where Li is the maximized log-likelihood and pi is the number of parameters in the model; BIC is an asymptotic approximation to the log Bayes factor under certain conditions. 11 After the B A R S algorithm has found a reasonable subset of the model, a regular M C M C chain is run to fit a neural network on each. From the estimated posterior probabilites, predictions (or other posterior summaries of interest) can be averaged over numerous models. 1.4 Datasets Used in the Thesis In this Section, I give a brief description of the three datasets which are analyzed in the thesis: 1. Ozone Data (Lee 2000b) 2. Tecator Data (Thodberg 1996) 3. Boston Housing Data (Neal 1996) 1.4.1 Ozone Data This dataset consists of 330 observations of groundlevel ozone in the Los Angeles well as the nine.explanatory variables explained below Covariate Description 03 ozone concentration vh altitude at which the pressure is 500 millibars wind wind speed (mph) hum humidity (%) temp temperature (degrees F) ibh temperature inversion base height (feet) dpg pressure gradient (mmHg) ibt inversion base temperature (degrees F) vis visibility (miles) doy day of year Figure 1.2 gives a scatterplot of the data. One should note non-linearities of the response (e.g. 03 vs. doy) and predictors which are highly correlated (e.g. temp vs. ibt). 12 -2 0 2 -2 0 1 2 -2 0 2 -1 1 2 3 0 10 30 Figure 1.2: Scatter Plot of Ozone Data Set. 13 1.4.2 Tecator Data The Tecator dataset involves predicting the fat content of meat based on spectrometer measure-ments at 100 absorbencies. Tecator is a Swedish company that manufactures the Infratec Food and Feed Analyzer, the instrument used to determine the absorbencies at 100 wavelengths in the region 850-1050 nm. The 10 variables P C I , . . .,PC10 are shown in Figure 1.3 as well as the fat content response, y. P C I , . . .,PC10 represent the first 10 principal components of the measurements at the 100 absorbencies. 1.4.3 Boston Housing Data The Boston Housing dataset was introduced by Harrison and Rubinfeld (1978) and later analyzed in Neal (1996). Though the initial goal was not prediction, our results can be compared to Neal's. The data are comprised of observations on 14 items for 506 census tracts in the metropolitan Boston area. Our goal is to make predictions of the median house value using the other 13 covariates. A description of the variables is given below: 14 Figure 1.3: Scatter Plot of Tecator Data Set. 15 Covariate Description C R I M per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town C H A S Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) N O X nitric oxides concentration (parts per 10 million) R M average number of rooms per dwelling A G E proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres R A D index of accessibility to radial highways T A X full-value property-tax rate per $10,000 P T R A T I O pupil-teacher ratio by town B 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town L S T A T % lower status of the population M E D V median value of owner-occupied homes in $1000's 1.5 Outline and Scope of the Thesis Though neural networks are capable of applications with multiple outputs, we limit ourselves here to a network of a single output unit. The methods described in the following chapters can easily be extended to this case. We also limit ourselves to the fixed architecture case, although we fit networks at differing numbers of hidden nodes in order to address the differences. I now define the notation that wil l be used for the remainder of the thesis. We consider a neural network with k hidden nodes for regression of y on a p-dimensional covariate x as follows k Vi = A) + -^{CZJ +ljXi) + £i, i = l , . . . , n . where the ej's are independent Gaussian with mean 0 and variance a2 and ip(-) is the logistic function. The parameters a = ( a i , . . . , a*;) and 7 = (71, • • • ,7fe) are referred to as the input-to-hidden biases and weights, respectively. The parameter 3 = (80, B\,..., Bk) is comprised of a single. 16 0 60 M _ W _ L 0.0 0.8 m m 4 7 200 700 . ' " I I I 0 300 j_J_U_l 10 40 4 J j j i tea Id •11™ 1 • t i f f 2*. 1 i t . P^Ti] I 1 1 1 1 1 1 0 60 IDIS BDB2 11111 0 15 T T T T t 0 60 T T T T 5 20 Ll i-i ffl! H m TTTTT 14 22 T T T 10 Figure 1.4: Scatter Plot of Boston Housing Data Set. 17 hidden-to-output bias (3Q) and the hidden-to-output weights 8\,.-- ,8k- Commonly, 8 is referred to as the regression parameter. For notational convenience, I also define here x = (x\,..., xn) and y = (yi,---,yn)-The thesis discusses and validates approaches to specifying the prior distribution over the neural network parameters based on geometrical properties. Chapter 2 discusses new prospects for prior distributions over the network parameters. Chapter 3 mentions the details of an M C M C sampling scheme. Empirical results of predictive performance for the three datasets are presented in Chapter 4. The last Chapter gives some concluding remarks and future directions, including a brief description of model selection and an analysis based on the Deviance Information Criterion (DIC). The implementation details and some auxiliary plots are given in an Appendix. 18 Chapter 2 Considerations for the Prior Distributions In this chapter, I discuss alternative specifications of the prior distribution over the network pa-rameters. First, I motivate the topic with a simple example and then the priors that we use are introduced. 2.1 Motivation To motivate the discussion, consider a synthetic dataset of 200 observations which has the following true relationship y = cos(2.5a;) + 1.5 sin(2.5a;) + x + e, where e ~ iV(0,CT 2 = .25). The left panel of Figure 2.1 shows the data as well as 2 neural network fits to the data using the methods of this thesis. The two fits were taken from 2 Markov Chains that were started at different initial values.. A neural network with 3 hidden nodes was used. The fit corresponding to the solid line in 19 Figure 2.1: Interpretation of Neural Network Parameters. (a) Data and Neural Network Fits (b) Components of Neural Network Figure 2.1(a) was the following y(x) = 274.41 -527:65 ^(+1.56 - 0.44a:) +394.28 V(+0.41 - 0.83a;) -150.42 ^(-0.04 - 1.28a;) where ip(-) is the logistic function. Considering the scale that these parameters are on, it is inter-esting that, of all possible parameter combinations, the Markov Chain sampled this particular one. The right panel of Figure 2.1 shows the 3 separated logistic nodes in addition to the intercept term. The fit corresponding to the dotted line on panel (a) is y(x) = +662.49 -621.33 V»(-0.58 + 0.58a;) +24.87 ^(+0.73 + 1.97a;) -662.05^(+0.79 - 0.53a;). Hence, there is no similarity for either the regression parameters or for the input-to-hidden weights. 20 Figure 2.2: Examples of Logistic Nodes. From both the numerical and the graphical displays, two central themes about neural net-works are evident. First of all, the parameters would be difficult to interpret, and thus in a Bayesian framework, specification of a prior over them is also very difficult. Secondly, it is clear that many explanations of the data are possible, that is, there are many combinations of network parameters 2.2 The Geometry of the Problem For simplicity, consider a covariate x which is unidimensional. Consider the form of one hidden node as where p represents the center of the curvature and v describes the degree of curvature. Figure 2.2 shows examples of logistic nodes. Here, we see the common 5-shaped form of the logistic function. By changing the argument to x/10 and lOx, the logistic node becomes effectively linear and effectively a step function over x 6 (—3,3), respectively. Due to this quite flexible setup, neural networks are able to capture quite an elaborate relationship. which bear no apparent resemblance, but give approximately the same relationship. 21 2.3 Weight Decay A common nuisance in neural network modelling is that of overfitting. That is, because the network is so versatile, it may end up fitting the random noise in the data instead of the true relationship. Given enough hidden nodes, the network may be capable of a fit with almost no error. Obviously, this would not generalize well for prediction. Analogous to regularization in smoothing splines, weight decay penalizes for large values of the network weights ( 7 1 , . . . , 7^ in our notation), the values which correspond to sharper increases in the activation function. Smaller values of the weights correspond to activation functions which are closer to linearity. Thus, weight decay, if implemented appropriately, reduces oscillations and encourages a smoother response surface and can hopefully prevent overfit. Also a contributing factor to the degree of smoothness of the neural network output is the number of hidden nodes, k. In the frequentist setting, a penalty term is added to the likelihood (and therefore can be referred to as penalized likelihood) which achieves weight decay. To illustrate, let k . f(xi) = 8o + Yl Bi ^ ( a i + 7jxi) • Assuming normally distributed errors, one would maximize (with respect to 6 and 7 ) the log-likelihood i=i j=i where r 2 determines the severity of the penalty. A tradeoff between fitting the training set well and having a generalizable model is necessary. Commonly, the choice of T 2 is picked based on cross-validation experiments. Alternatively, a dataset is broken down into a training, validation and test set. The networks under different choices for T 2 are fit on the training set and the fits of each are evaluated on the validation set. The value of r 2 which gives the best fit is chosen and the resulting performance on the test set is reported. In the Bayesian paradigm, the above approach is no different than specifying a prior on each 7 j . It would correspond to a normal prior centred at zero with variance r 2 for each 7 ^ . 22 2.4 Possibilities for the Prior As mentioned in the previous section, the typical prior on 7,- which achieves weight decay is: A n appropriate value of r 2 is generally not known a prior. In the Bayesian framework, this un-certainty is best captured in a prior distribution as is commonly done in the literature (Neal 1996, Rios Insua and Miiller 1998). The standard approach, essentially for mathematical convenience, is to take an inverted gamma ( T _ 1 ) prior on r 2 . Neal discusses priors which disperse mass over a very broad range on the positive real line. For example, when analyzing the Boston Housing data, he uses parameters that give r - 2 a mean of 100 and a variance 200,000 for each of the input-to-hidden weight groups. Rios Insua and Miiller use somewhat less disparate T _ 1 priors and give little guidance as to how they were selected. Though Lee advocates a prior which in effect has no weight decay, it has been our experience that weight decay is necessary for reliable prediction in the Bayesian framework, especially in long runs of the Markov Chains (see Section 4.1). For us, this motivates two further possibilities. First, we explore other senses of parsimony such as orthogonality of the weights 77 or additivity of the hidden nodes. Second, we consider priors which encourage the parameters to reside in an effective domain of interest. 2.5 Parsimony Priors Note that the prior on 7^ as in (2.1) combined with a F~1(a, b) hyperprior on the T 2 parameter is mathematically equivalent to an unconditional Multivariate-i distribution for jj. That is, 7 , ~ A T ( 0 , r 2 / ) for j = l,... (2.1) (2.2) Consider a reparameterization of (aj,jj) <—> (/J,J,WJ,\2) given by 'j ~ wvf" a n d (2-3) 23 so that This helps one to think of desired geometrical properties of the unit vector Wj which promote model parsimony. A couple of possibilites follow. 2.5.1 Orthogonality of Input-to-Hidden Weights Since one desires different hidden nodes to explain distinct components of the relationship, we may discourage wi and Wj from pointing in the same direction. One possibility is to add a penalty for larger values of the absolute inner product between the all unit vectors wt and Wj, perhaps in the same flavour as (2.2), such as The term fc^2_1^ Yli