Priors for Bayesian Neural Networks by Mark Robinson B.Sc, University of Guelph 1999 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Statistics) we accept this thesis as conforming to the required standard The University of British Columbia June 2001 © Mark Robinson, 2001 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of 5 4 4 ^ ^ The University of British Columbia Vancouver, Canada Date BU,?.4,-2 6d) DE-6 (2788) Abstrac t In recent years, Neural Networks (NN) have become a popular data-analytic tool in Statistics, Computer Science and many other fields. NNs can be used as universal approximators, that is, a tool for regressing a dependent variable on a possibly complicated function of the explanatory variables. The N N parameters, unfortunately, are notoriously hard to interpret. Under the Bayesian view, we propose and discuss prior distributions for some of the network parameters which encourage parsimony and reduce overfit, by eliminating redundancy, promoting orthogonality, linearity or additivity. Thus we consider more senses of parsimony than are discussed in the existing literature. We investigate the predictive performance of networks fit under these various priors. The Deviance Information Criterion (DIC) is briefly explored as a model selection criterion. i i Contents Abstract ii Contents iii List of Figures vi Acknowledgements viii Dedication ix 1 Introduction 1 1.1 Artificial Neural Networks 2 1.2 The Bayesian Approach 5 1.3 Current Approaches to Bayesian Neural Networks 6 1.3.1 Neal 7 1.3.2 Rios Insua and Miiller 8 1.3.3 Andrieu, de Freitas and Doucet 9 1.3.4 Lee 10 1.4 Datasets Used in the Thesis 12 1.4.1 Ozone Data 12 1.4.2 Tecator Data ; . . .- 14 1.4.3 Boston Housing Data 14 ii i 1.5 Outline and Scope of the Thesis : 16 2 Considerations for the Prior Distributions 19 2.1 Motivation . . : 19 2.2 The Geometry of the Problem 21 2.3 Weight Decay 22 2.4 Possibilities for the Prior 23 2.5 Parsimony Priors 23 2.5.1 Orthogonality of Input-to-Hidden Weights 24 2.5.2 Additivity of the Hidden Nodes . 24 2.5.3 Selection of a and b 25 2.6 Effective Domain Prior 26 2.7 Priors on B and a2 29 3 Inference via Markov Chain Monte Carlo 30 3.1 The Posterior Distribution 30 3.2 The MCMC Sampling Algorithm 31 3.3 Full Conditional Distributions 32 3.3.1 Derivation for a 2 32 3.3.2 Derivation for B 32 3.4 Acceptance Ratio Calculation for the M H Step 32 3.5 Posterior Inference 33 3.5.1 Importance Sampling 34 4 Empirical Results 35 4.1 Assessing Convergence . . . 35 4.2 Goals 37 4.3 Results on the Boston Housing Dataset 40 4.4 Results on the Tecator Dataset 42 4.5 Results on the Ozone Dataset . 43 iv 5 Discussions and Further Work 46 Appendix A: Implementation 51 Appendix B: Plots 57 Bibliography 63 v List of Figures 1.1 The Neural Network Architecture '3 1.2 Scatter Plot of Ozone Data Set . . . 13 1.3 Scatter Plot of Tecator Data Set. 15 1.4 Scatter Plot of Boston Housing Data Set 17 2.1 Interpretation of Neural Network Parameters 20 2.2 Examples of Logistic Nodes 21 2.3 A Piecewise Linear Approximation to the Logistic Function 27 2.4 A n Example of \JL outside the effective domain of interest. 28 4.1 M C M C Plots for the Neural Network Parameters for 10000 iterations of a network with 5 hidden nodes on the Ozone dataset. Prior distribution from Lee (2000a, 2000b). 36 4.2 M C M C Plots for the Neural Network Parameters for 10000 iterations of a network with 5 hidden nodes on the Ozone dataset. Prior distribution is the effective domain of interest 38 4.3 M C M C Plots of the a Parameter as a diagnostic for convergence of the Markov Chain. 39 4.4 Importance Weights for the Parsimony Priors (Based on 10000 Samples). 39 4.5 MSEs on the Validation Dataset (Boston Housing Data). Five runs from different initial values on five different splits of the data 41 4.6 Improvements (% Change in MSE) under Parsimony Priors (Boston Housing Data). Five runs from different initial values on five different splits of the data 41 vi 4.7 Improvements (% Change in MSE) under Parsimony Priors (Tecator Data). Ten runs from different initial values. 43 4.8 MSEs on the Validation Dataset (Ozone Data). Five runs from different initial values on five different splits of the data 44 4.9 Improvements (% Change in MSE) under Parsimony Priors (Ozone Data). Five runs from different initial values on five different splits of the data at z = 2, 3,4. . . . 45 5.1 Plots of DIC for Neural Network Models with 4, 6 and 8 Hidden Nodes for Multiple Runs and Splits. 50 vn Acknowledgements I would like to thank my supervisor, Dr. Paul Gustafson, for his support and encouragement throughout the process of producing this thesis. Also, thanks are owed to Dr. Harry Joe for his helpful comments on the thesis and for providing programming tips along the way. My time here at U B C would not have been the same had it hot been for the support of an incredible bunch of graduate students who helped me put things in perspective and who endured my high intensity. I graciously thank them for their support. Lastly, I would like to thank my parents, brothers and friends for their encouragement and especially to Lynn for her love and support. MARK ROBINSON The University of British Columbia June 2001 vii i To coffee and all things real ix Chapter 1 Introduction In recent years, there has been considerable interest in the use of neural networks by both the com-puter science and statistical communities. In the computing community, applications are usually associated with artificial intelligence (e.g. pattern recognition). For the statistician, the methodol-ogy is used primarily for flexible (non-linear) regression as well as classification. There is consider-able overlap in the two disciplines, especially in classification problems. Many reviews have been written from a statistical point of view (Cheng and Titterington, 1994; Ripley, 1994; Stern, 1996). As the name suggests, neural networks initially were derived in an attempt to model the parallel processing capacity of the human brain. Computers have a remarkable ability to perform routine calculation tasks in a fraction of the time a human can. However, a human can perform tasks such as identifying faces or recognizing characters with little or no effort; a computer performing the same task may require a considerable amount of time and a sophisticated algorithm. Artificial neural networks (ANNs) — the mathematical constructs discussed in this thesis — attempt to harness these parallel capabilities to analyze experimental data. In general, neural networks take a number of inputs (explanatory variables or covariates in statistical terminology) and via a collection of neurons (i.e. units which weight and sum a number of incoming variables to one output), generate a number of outputs (responses in statistical terms). For example, in a regression problem, the inputs may be some predictors x, which are used to calculate an output, y. The x and y can be scalar or vector quantities. In a classification problem, 1 I the outputs yi (i = 1, • • •, k) may represent the probability of being in the i-th group. From a computer science viewpoint, these networks are to be learned 1 from the data (i.e. the network parameters are to be estimated). For this thesis, it is desired to work under the Bayesian framework, in which one thinks of each parameter as having a distribution which represents belief about the value of that parameter. We will see shortly that this is a difficult task. 1.1 Artificial Neural Networks In the literature, ANNs go by many names: back-propagation NNs, feed-forward NNs and multilayer perceptron (MLP) NNs, all of which describe the same, general structure. The term back-propagations refers to the way that the derivatives are calculated for the optimization methods. Feed-forward refers to the forward directed links from the inputs to the outputs. The term M L P refers to the possibly multiple layers in the network which are connected by neurons. Every A N N consists of an input layer which is eventually connected to an output layer using 1 or more hidden layers. In getting from the inputs to the outputs, each link in the network contributes a weight and every node in the hidden layer forces that linear combination through an activation function. This process is repeated for each of the hidden layers. Commonly, biases are added at each stage before being passed through an activation function. A three layer (1 input, 1 hidden, 1 output) neural network is depicted in Figure 1. For a three layer network, the output layer is computed as a function of a linear combination (I.e.) of non-linear functions of a linear combination of the inputs. The neural network depicted in Figure 1 can be written mathematically as: f I.e. of non-linear- functions 1 I.e. of inputs J where p is the number of input units, H is the number of hidden units, q is the number of output l rrhe learning discussed here is often called supervised learning, in that the data are to be classified into known groups. Unsupervised learning involves classifying into groups which are not previously specified. 2 lutput Units Hidden Units Units Figure 1.1: The Neural Network Architecture. units and <f>o,<pi, - • • ,(J>H are activation functions. Here, the Vhk represent the H x q hidden-to-output weights Uih are the p x H input-to-hidden weights, bk are the output biases and are the hidden biases. There are also networks suggested which have skip layer connections, meaning there are direct links from the input to output layers which have corresponding weights. In this case, the 3 above model (1.1) becomes: Uk = 00 < i=i V H / V + X] Vhk<t>h ( CLh + uihxi h=l V i=l ) ) for A; = ! , - • • (1.2) skip layer terms In most cases, the 4>i, • • •, 4>H are taken to be the same sigmoidal function (e.g. the logistic function, (l + e~x)~1 or equivalently, the hyberbolic tangent function, (ex — e~x)(ex + e~x)~l). In a regression problem, 0o is taken to be the identity function and in a binary classification problem, </>o is commonly taken to be the logistic function. For example, a regression network with one input, four hidden units and one output unit with a tanh activation function would relate the input x to the output.y via the following model: y = b + v\ tanh(ai + u\x) + V2 tanh(a2 + U2x) + v$ tanh(a3 + u^x) + V4 tanh(a4 + u^x). In general, a neural network is a tool for establishing an intricate non-linear relationship in either a regression problem or classification problem. A very detailed non-linear relationship can be described by models such as (1.1) or (1.2). In fact, it can be shown (although no attempt is made here) that a three layer A N N can approximate any function (on a compact domain) to a desired level (Hornik et al. 1989) by having an appropriate number of hidden nodes. There are a few standard statistical methods which are special cases of the neural network. First of all, linear regression can be achieved with one hidden unit and identity functions for both <fio and <f>\. The linear logistic model can ascertained by using the identity function for 4>i and the logistic function for (f>o- Other methods such as projection pursuit regression and radial basis functions are special cases as well (Chen and Titterington, 1994). The competing statistical methods of ANNs are, for regression: multivariate adaptive re-gression splines (MARS; Friedman 1991), and the Bayesian version thereof; generalized additive models, non-parametric regression, smoothing techniques and, more recently, Bayesian regression modelling with interactions and smooth effects (B-WISE; Gustafson 2000). For classification, other statistical methods include (linear, quadratic, flexible) discriminant analysis and nearest-neighbour 4 methods. Classification and Regression Trees (Breiman 1984) also overlap many applications with neural networks. Neural networks have been used quite extensively in a broad spectrum of disciplines. A small scan of applications includes: the papermaking industry (Edwards et al. 1999), the dairy industry (Macrossan et al. 1999) and meteorology and oceanography (Hsieh and Tang 1998). 1.2 The Bayesian Approach In this thesis, I have chosen to take the Bayesian approach. The question arises: Why add an additional layer of complexity (computational or otherwise) to an already intricate problem? My reason is twofold. First, with the recent technology in Markov Chain Monte Carlo methods (see Chapter 3), Bayesian methods are becoming easier to implement. Second, the additional complexity also provides additional flexibility in terms of model capabilities and methods for inference. The rest of this section describes the Bayesian approach, as it applies to this thesis. Frequentist (or classical or Fisherian) statisticians assume their data follows, at least ap-proximately, some parametric model, from which they form the likelihood function (i.e. the joint distribution of the observed data but thought of as a function of the unknown parameters in the model). They find the parameters which maximize the likelihood, that is, the parameters which maximize the probability of seeing that data. They use elegant asymptotic theory to make inference on the parameters or functions thereof. Bayesian statisticians attacking the same problem, assume the parameters in the model each have a distribution of their own. First, one's beliefs about these parameters are represented in a distribution called the prior distribution. Using the laws of probability, these beliefs are updated to the posterior distribution on which inferences are based. The prior distribution can allow for quite a flexible analysis, and therefore have a tremendous advantage over frequentist methods. Not surprisingly, this greatest benefit is also the greatest criticism, and rightfully so. Are humans able to reliable express their beliefs in a mathematical function? Likely not. In practice, it is extremely difficult to specify prior distributions. As will be illustrated later in the thesis, priors for neural networks are especially difficult to define because the parameters may not have simple 5 interpretations. Quite generally, assume we have random variables Y i , . . . , Yn which generate data y = a prior distribution of the parameter 0, denoted by p{9), we can appeal to the laws of probability and form the posterior distribution of 9 in light of the observed data y, denoted by p(9\y). The calculation is as follows: In many cases, the posterior distribution is not in a form easily recognizable as one of the standard distributions. Under certain general conditions, Markov Chain theory allows us to gener-ate a Markov Chain which has the posterior distribution as its stationary distribution. Therefore, we follow a Markov chain for some period of time and monitor whether it converges. Assuming the chain has converged, we generate very large samples (i.e. Monte Carlo) from and summarize pos-terior attributes (or functions thereof) from these. This method of computation is called Markov Chain Monte Carlo (MCMC) for obvious reasons. Also of importance, especially here in this thesis, is the concept of a predictive distribu-tion. Suppose that we have a new observation y* that we wish to make predictions for based on the previous observations. Again, using the laws of probability we can generate the conditional distribution of y* given the observed data y, as follows: After determining the posterior distribution (perhaps by simulation), predictive inferences for this new observation can then be made. 1.3 Current Approaches to Bayesian Neural Networks In this section, we discuss the recent methodology for analyzing neural networks under the Bayesian framework. Here, we discuss the work of (j/i) • • • )Vn) from some parametric density f{yi\0) (sometimes written as fe(yi))- After we specify p(o\y) = f(y\o)P(9) ff(y\9')P(9>)d9< c< f(y\e)p(e). 6 • Neal (1996) • Rios Insua and Miiller (1998); Miiller and Rios Insua (1998) • Andrieu, de Freitas and Doucet (1999); and, • Lee (2000a, 2000b). In each case, I keep the notation used in the original papers in order to form a distinction. 1.3.1 N e a l Similar to that in the previous sections, Neal defines the network with one hidden layer as: fk{x) = bk + Yvjkhj{x) i hj(x) = tanh ^a,j + £ where fk(x) denotes the fc*n output. The biases for the output and hidden units are a,j and bk, respectively. The weight from the input unit to the hidden unit is denoted by Uij. The weight from hidden unit j to output unit k is denoted by Vjk. The tanh activation function is the same shape as the logistic function but ranges from — 1 to 1 instead of 0 to 1. Suppose u>ij is any of the network weights (either u or v above). Neal specifies priors on the weights hierarchically as follows: WiMli ~ - ^ ( 0 , ^ . ) , and 2 r - i ( a w ^ w \ ° W i . ~ V 2 ' 2 ) where T _ 1 represents the inverted Gamma distribution That is, if X ~ Gamma(a, b), denoted T(a, b), then 1/X ~ Inverted-Gamma(a, 6), denoted F~l(a, b). Notice that there is a hyperparam-eter specified for each unit that the weight leaves from. Neal does, however, consider special cases where each hyperparameter is the same across the group. For example, if all the inputs were to known be on an equal footing, one could take cr 2 1 = • • • = o\ = a^,. The hyperparameters aw and 7 UJW are selected so as to assume very vague prior information regarding o\ . Similarly, hierarchical priors are specified for the bias parameters. A n advantage of specifying a hyperparameter for the connections coming out of each input unit is, that if the weights of connections coming out of some input i are typically small, then the values of the parameter <r^ . would also be small, thus signifying little relevance to the targets. In this case, the impact of this input would be automatically downweighted. Mackay (1994) and Neal (1996) call this Automatic Relevance Determination (ARD). For inference, Neal (1993) uses a M C M C method called hybrid Monte Carlo based on dynamical simulation techniques to make updates of the network weights and biases. Also, if the full conditional distributions of the hyperparameters are known, Gibbs sampling steps can be performed on them. 1.3.2 Rios Insua and Miiller Rios Insua and Miiller (1998) define their model for a single output and p predictors as: M Vi = J2 & VKzf 7J) + £i> i = l,...,n where e^N(0,a2) and ^(*) = _ L _ , 1 + e z and Xi = (xio,xn,... ,XiP) and so that input-to-hidden biases are present, Xio = 1. They define hierarchical priors similar to Neal, as follows: 1. Pj\p,p,crj ~ N(/j,p,aj) where p.p ~ N(ap,Ap) and a^2 ~ T(cb/2JcbCb/2) ~ i V ( / i 7 , 5 7 ) where ^ 7 ~ N ( a 7 , Ay) and Sy 1 ~ Wish(c 7 , ( c 7 C 7 ) - 1 ) 3. a " 2 ~ T(s/2,sS/2) Their key observation is that conditional on the 7 = ( 7 1 , . . . , 7 M ) parameter, one simply has a standard linear model. As a result, the M C M C algorithm can be implemented efficiently by integrating out B = (Bi,... ,8M)- Metropolis steps are done on 7 after marginalizing over 8 B. Conditional on 7 and the hyperparameters, B is multivariate normal. Similar to Neal, all hyperpriors are conjugate and can be sampled efficiently. They extend their method to the variable architecture case, that is, where the number of hidden nodes is not fixed. Because there is nonnegligible uncertainty about whether a certain num-ber of nodes is adequate, this is a particularly appealing approach. They develop a reversible-jump M C M C algorithm (Green 1995) in which a Markov Chain jumps randomly to different architectures, therefore having differing parameter spaces. 1.3.3 Andrieu, de Freitas and Doucet Andrieu, de Freitas and Doucet (1999) define a model with k hidden nodes and c outputs of the form k yt = b + /3rx4 + ^2aLJ4>(\\xt-nj\\) + nu t = 1,...,N, j=i where n 4 are zero-mean independent Gaussian errors with constant variances o\,...,o2 for each of the c outputs and $(•) is a radial basis function (RBF) . The //,• parameters represent the radial centres. Common choices of the R B F may be: • Linear: 4>(rj) = n • Thin plate spline: (j>(rj) = n2 log n • Multi-quadric: (j>(rj) = \fn2 + A 2 • Gaussian: (j)(rj) = exp(-Ar? 2). The interpretations of the ensuing radial basis networks would presumably be quite different for these functions. Andrieu, de Freitas and Doucet (1999) define a prior distribution over the parameters (b,/3,a, fx, a2) as follows. A priori, the radial basis centres fij are uniformly distributed over the space of where the input data lie, conditional on k. That is, they are restricted to the hyper-cube, with one corner at all the maximum values of each of the predictors and the opposite corner at all of the minimum values of the each of the predictors. In this respect, the prior is slightly 9 data dependent. Conditional on the centres, the collection of regression parameters (b,/3,a) are specified to have a + fc)-dimensional Gaussian distribution centred at the origin with inverse variance-covariance matrix -^DfDi where Di is the design matrix including columns for k radial basis functions, d covariates and an intercept. Hence, the prior penalizes the radial basis functions for being too close. The parameter k is also taken to be random and is assumed to have a trun-cated (i.e. k < N — (d + 1)) Poisson distribution with parameter A. The truncation is necessary to avoid columns of Di where are linearly dependent. To finalize the prior distribution, Jeffrey's noninformative prior is used for a2 (i.e. p(of) = c~ 2 ) , and vague inverted Gamma and Gamma priors are used for each of 6f and A. To simplify the posterior distribution, the authors marginalize over (b,/3,a) (Gaussian), analogous to what is done in Rios Insua and Miiller (1998). Sampling of the posterior distribution with k fixed is similar to Rios Insua and Miiller (1998). Metropolis-Hastings steps are performed sequentially for each hidden node j = 1,. . . , k on fij. Following this, the regression parameters (b,/3,a) can be sampled jointly from a multivariate normal distribution. Similarly, the variances of and S2 can be sampled from inverted Gamma distributions. The hyperparameter A can be sampled using Metropolis steps. Andrieu, de Freitas and Doucet (1999) extend this to the random k case by appealing to the reversible jump M C M C method discussed in Green (1995), like Rios Insua and Miiller in the last section. 1.3.4 Lee Most recently, Lee (2000a, 2000b) has introduced a non-informative prior distribution over the parameters (/3,7, a2) to avoid the selection of a large number of hyperparameters. Lee implements a neural network model of the form M yi = A) + YP3i,(.xTn)-+ ti-, i = i,-.-;n where, as before ~ N(0,a2) and ib(z) = (1 + e~z)~l. He selects a prior which puts equal density on all regions of the parameter space (except on the variance a2 where the log scale is used). A restricted version of the standard Bayesian linear 10 regression prior (p(8,j,a2) ex. a 2 ) is used. Unlike linear regression, this improper prior for neural networks gives rise to an improper posterior. Define Z = {%} where _ 1 1 3 1 + exp ( -7 j 0 - Eh IjhXih)' In the case of logistic activation functions, a restriction of the form n n = {(6,7,a2) : \ l j h \ < Cn,\ZTZ\ > Dn) ensures linear independence of the logistic basis functions. Lee proves that posterior in this case is indeed proper and suggests values of Cn and Dn which work well in practice. For M C M C sampling, Lee uses a streamlined version (i.e. because of fewer parameters) of the fixed-architecture Rios Insua and Miiller algorithm. Lee migrates to a discussion of model selection using Bayesian model averaging. Briefly, because there is uncertainty as to whether any given model is the correct one (or that there are multiple explanations of the data), it is of interest to combine and average over many candidate models. A review of this technique is given in Hoeting et al. (1999). Using Lee's notation, the predictive distribution of responses Y based on observed data D and q candidate models M i , . . . , Mq is P(Y\D) = Y^P{Y\D',Mi)P(Mi\D). i Clearly, this is a weighted average where the weights are the posterior probabilities of each model. Unfortunately, the term P(M{\D), sometimes known as the evidence is typically a difficult quantity to estimate! Due to the potentially large number of possible models (subsets of the explanatory variables + number of hidden nodes), he introduces a Bayesian Random Searching (BARS) algorithm which randomly proposes jumps from model to model. Based on the Bayesian Information Criterion (BIC) 2 , these steps are accepted or rejected. Assuming the models have equal prior probability, the posterior probabilities can be approximated as P(Mi\D) = eBICi Bid 2BICi = Li — | p i log n where Li is the maximized log-likelihood and pi is the number of parameters in the model; BIC is an asymptotic approximation to the log Bayes factor under certain conditions. 11 After the B A R S algorithm has found a reasonable subset of the model, a regular M C M C chain is run to fit a neural network on each. From the estimated posterior probabilites, predictions (or other posterior summaries of interest) can be averaged over numerous models. 1.4 Datasets Used in the Thesis In this Section, I give a brief description of the three datasets which are analyzed in the thesis: 1. Ozone Data (Lee 2000b) 2. Tecator Data (Thodberg 1996) 3. Boston Housing Data (Neal 1996) 1.4.1 Ozone Data This dataset consists of 330 observations of groundlevel ozone in the Los Angeles well as the nine.explanatory variables explained below Covariate Description 03 ozone concentration vh altitude at which the pressure is 500 millibars wind wind speed (mph) hum humidity (%) temp temperature (degrees F) ibh temperature inversion base height (feet) dpg pressure gradient (mmHg) ibt inversion base temperature (degrees F) vis visibility (miles) doy day of year Figure 1.2 gives a scatterplot of the data. One should note non-linearities of the response (e.g. 03 vs. doy) and predictors which are highly correlated (e.g. temp vs. ibt). 12 -2 0 2 -2 0 1 2 -2 0 2 -1 1 2 3 0 10 30 Figure 1.2: Scatter Plot of Ozone Data Set. 13 1.4.2 Tecator Data The Tecator dataset involves predicting the fat content of meat based on spectrometer measure-ments at 100 absorbencies. Tecator is a Swedish company that manufactures the Infratec Food and Feed Analyzer, the instrument used to determine the absorbencies at 100 wavelengths in the region 850-1050 nm. The 10 variables P C I , . . .,PC10 are shown in Figure 1.3 as well as the fat content response, y. P C I , . . .,PC10 represent the first 10 principal components of the measurements at the 100 absorbencies. 1.4.3 Boston Housing Data The Boston Housing dataset was introduced by Harrison and Rubinfeld (1978) and later analyzed in Neal (1996). Though the initial goal was not prediction, our results can be compared to Neal's. The data are comprised of observations on 14 items for 506 census tracts in the metropolitan Boston area. Our goal is to make predictions of the median house value using the other 13 covariates. A description of the variables is given below: 14 Figure 1.3: Scatter Plot of Tecator Data Set. 15 Covariate Description C R I M per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town C H A S Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) N O X nitric oxides concentration (parts per 10 million) R M average number of rooms per dwelling A G E proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres R A D index of accessibility to radial highways T A X full-value property-tax rate per $10,000 P T R A T I O pupil-teacher ratio by town B 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town L S T A T % lower status of the population M E D V median value of owner-occupied homes in $1000's 1.5 Outline and Scope of the Thesis Though neural networks are capable of applications with multiple outputs, we limit ourselves here to a network of a single output unit. The methods described in the following chapters can easily be extended to this case. We also limit ourselves to the fixed architecture case, although we fit networks at differing numbers of hidden nodes in order to address the differences. I now define the notation that wil l be used for the remainder of the thesis. We consider a neural network with k hidden nodes for regression of y on a p-dimensional covariate x as follows k Vi = A) + -^{CZJ +ljXi) + £i, i = l , . . . , n . where the ej's are independent Gaussian with mean 0 and variance a2 and ip(-) is the logistic function. The parameters a = ( a i , . . . , a*;) and 7 = (71, • • • ,7fe) are referred to as the input-to-hidden biases and weights, respectively. The parameter 3 = (80, B\,..., Bk) is comprised of a single. 16 0 60 M _ W _ L 0.0 0.8 m m 4 7 200 700 . ' " I I I 0 300 j_J_U_l 10 40 4 J j j i tea Id •11™ 1 • t i f f 2*. 1 i t . P^Ti] I 1 1 1 1 1 1 0 60 IDIS BDB2 11111 0 15 T T T T t 0 60 T T T T 5 20 Ll i-i ffl! H m TTTTT 14 22 T T T 10 Figure 1.4: Scatter Plot of Boston Housing Data Set. 17 hidden-to-output bias (3Q) and the hidden-to-output weights 8\,.-- ,8k- Commonly, 8 is referred to as the regression parameter. For notational convenience, I also define here x = (x\,..., xn) and y = (yi,---,yn)-The thesis discusses and validates approaches to specifying the prior distribution over the neural network parameters based on geometrical properties. Chapter 2 discusses new prospects for prior distributions over the network parameters. Chapter 3 mentions the details of an M C M C sampling scheme. Empirical results of predictive performance for the three datasets are presented in Chapter 4. The last Chapter gives some concluding remarks and future directions, including a brief description of model selection and an analysis based on the Deviance Information Criterion (DIC). The implementation details and some auxiliary plots are given in an Appendix. 18 Chapter 2 Considerations for the Prior Distributions In this chapter, I discuss alternative specifications of the prior distribution over the network pa-rameters. First, I motivate the topic with a simple example and then the priors that we use are introduced. 2.1 Motivation To motivate the discussion, consider a synthetic dataset of 200 observations which has the following true relationship y = cos(2.5a;) + 1.5 sin(2.5a;) + x + e, where e ~ iV(0,CT 2 = .25). The left panel of Figure 2.1 shows the data as well as 2 neural network fits to the data using the methods of this thesis. The two fits were taken from 2 Markov Chains that were started at different initial values.. A neural network with 3 hidden nodes was used. The fit corresponding to the solid line in 19 Figure 2.1: Interpretation of Neural Network Parameters. (a) Data and Neural Network Fits (b) Components of Neural Network Figure 2.1(a) was the following y(x) = 274.41 -527:65 ^(+1.56 - 0.44a:) +394.28 V(+0.41 - 0.83a;) -150.42 ^(-0.04 - 1.28a;) where ip(-) is the logistic function. Considering the scale that these parameters are on, it is inter-esting that, of all possible parameter combinations, the Markov Chain sampled this particular one. The right panel of Figure 2.1 shows the 3 separated logistic nodes in addition to the intercept term. The fit corresponding to the dotted line on panel (a) is y(x) = +662.49 -621.33 V»(-0.58 + 0.58a;) +24.87 ^(+0.73 + 1.97a;) -662.05^(+0.79 - 0.53a;). Hence, there is no similarity for either the regression parameters or for the input-to-hidden weights. 20 Figure 2.2: Examples of Logistic Nodes. From both the numerical and the graphical displays, two central themes about neural net-works are evident. First of all, the parameters would be difficult to interpret, and thus in a Bayesian framework, specification of a prior over them is also very difficult. Secondly, it is clear that many explanations of the data are possible, that is, there are many combinations of network parameters 2.2 The Geometry of the Problem For simplicity, consider a covariate x which is unidimensional. Consider the form of one hidden node as where p represents the center of the curvature and v describes the degree of curvature. Figure 2.2 shows examples of logistic nodes. Here, we see the common 5-shaped form of the logistic function. By changing the argument to x/10 and lOx, the logistic node becomes effectively linear and effectively a step function over x 6 (—3,3), respectively. Due to this quite flexible setup, neural networks are able to capture quite an elaborate relationship. which bear no apparent resemblance, but give approximately the same relationship. 21 2.3 Weight Decay A common nuisance in neural network modelling is that of overfitting. That is, because the network is so versatile, it may end up fitting the random noise in the data instead of the true relationship. Given enough hidden nodes, the network may be capable of a fit with almost no error. Obviously, this would not generalize well for prediction. Analogous to regularization in smoothing splines, weight decay penalizes for large values of the network weights ( 7 1 , . . . , 7^ in our notation), the values which correspond to sharper increases in the activation function. Smaller values of the weights correspond to activation functions which are closer to linearity. Thus, weight decay, if implemented appropriately, reduces oscillations and encourages a smoother response surface and can hopefully prevent overfit. Also a contributing factor to the degree of smoothness of the neural network output is the number of hidden nodes, k. In the frequentist setting, a penalty term is added to the likelihood (and therefore can be referred to as penalized likelihood) which achieves weight decay. To illustrate, let k . f(xi) = 8o + Yl Bi ^ ( a i + 7jxi) • Assuming normally distributed errors, one would maximize (with respect to 6 and 7 ) the log-likelihood i=i j=i where r 2 determines the severity of the penalty. A tradeoff between fitting the training set well and having a generalizable model is necessary. Commonly, the choice of T 2 is picked based on cross-validation experiments. Alternatively, a dataset is broken down into a training, validation and test set. The networks under different choices for T 2 are fit on the training set and the fits of each are evaluated on the validation set. The value of r 2 which gives the best fit is chosen and the resulting performance on the test set is reported. In the Bayesian paradigm, the above approach is no different than specifying a prior on each 7 j . It would correspond to a normal prior centred at zero with variance r 2 for each 7 ^ . 22 2.4 Possibilities for the Prior As mentioned in the previous section, the typical prior on 7,- which achieves weight decay is: A n appropriate value of r 2 is generally not known a prior. In the Bayesian framework, this un-certainty is best captured in a prior distribution as is commonly done in the literature (Neal 1996, Rios Insua and Miiller 1998). The standard approach, essentially for mathematical convenience, is to take an inverted gamma ( T _ 1 ) prior on r 2 . Neal discusses priors which disperse mass over a very broad range on the positive real line. For example, when analyzing the Boston Housing data, he uses parameters that give r - 2 a mean of 100 and a variance 200,000 for each of the input-to-hidden weight groups. Rios Insua and Miiller use somewhat less disparate T _ 1 priors and give little guidance as to how they were selected. Though Lee advocates a prior which in effect has no weight decay, it has been our experience that weight decay is necessary for reliable prediction in the Bayesian framework, especially in long runs of the Markov Chains (see Section 4.1). For us, this motivates two further possibilities. First, we explore other senses of parsimony such as orthogonality of the weights 77 or additivity of the hidden nodes. Second, we consider priors which encourage the parameters to reside in an effective domain of interest. 2.5 Parsimony Priors Note that the prior on 7^ as in (2.1) combined with a F~1(a, b) hyperprior on the T 2 parameter is mathematically equivalent to an unconditional Multivariate-i distribution for jj. That is, 7 , ~ A T ( 0 , r 2 / ) for j = l,... (2.1) (2.2) Consider a reparameterization of (aj,jj) <—> (/J,J,WJ,\2) given by 'j ~ wvf" a n d (2-3) 23 so that This helps one to think of desired geometrical properties of the unit vector Wj which promote model parsimony. A couple of possibilites follow. 2.5.1 Orthogonality of Input-to-Hidden Weights Since one desires different hidden nodes to explain distinct components of the relationship, we may discourage wi and Wj from pointing in the same direction. One possibility is to add a penalty for larger values of the absolute inner product between the all unit vectors wt and Wj, perhaps in the same flavour as (2.2), such as The term fc^2_1^ Yli<j \ wIwj\ represents average absolute dot product amongst the pairs of network weights. If all input-to-hidden weights are perfectly orthogonal (assuming k < p), then f(wi,..., Wk) = 1 whereas if all input-to-hidden weights are linearly dependent, / ( i o i , . . . , Wk) = (1 + 6 _ 1 ) _ f c ( a + P ) / 2 , a smaller value. Here, o and b determine the degree of penalization as they would for the weight decay prior. Of course, in the Bayesian framework, really what is meant by adding a penalty is specifying a prior distribution. In this case, we have a prior which peaks in regions where the (wi,... ,Wk) are more orthogonal and has lowest density when they are close to the linear dependence. 2.5.2 Additivity of the Hidden Nodes Yet another possibility for model parsimony is to encourage sparse linear combinations, where there are a small number of non-zero entries in the weight vector. In the limiting case, total additivity of the hidden nodes would result in a network of the form k 24 whereby the activation function acts on only one covariate at a time and, if k < p, then it acts on a subset of the original variables. If the true relationship is extremely non-linear in a particular covariate, it may require more than one hidden node. Also, because covariate j may not necessarily correspond with node j, the alternative definition Xij* represents a reshuffling of the x variables. A neural network of this type may mimic somewhat the flavour of generalized additive models (Hastie and Tibshirani 1990), except that tp is fixed in advance. To promote a prior distribution which achieves some degree of additivity, we may penalize unit vectors of the form (^=,. . . , -^ =) in favour of vectors of the form (0 , . . . , 0 ,1 ,0 , . . . , 0). This is achieved by defining a function, for example g(u) = J2i \ui\ ~ 1) a n d defining the prior as k ( i \ - i s ¥ -/ a d d ^ i ' - - - ' ™ * ) a II [1 + T9(wj)\ . (2.5) j=i V 7 The function g(u) is maximized when u = (-^=,..., ^=) corresponding to each covariate in a hidden node having equal weight. Since this is a scenario we wish to discount, the prior affords it the least mass. Accordingly, the prior gives largest mass when the weight is solely on one covariate. Again, the hyperparameters a and b determine the degree of penalization. 2.5.3 Selection of a and b One approach for selecting the parameters a and b is that of a "factor of zn calculation. In the original weight decay prior on r 2 , the parameter a represents the degrees of freedom and ^1 is the scale of the Multivariate-^ distribution. First of all, we pick a to be small to induce heavy tails (i.e. small number of degrees of freedom in a i-distribution). To select a value for b, we use the following. In the case of orthogonality, we favour total orthogonality over total linear dependence by a factor of z. That is, /oj-th (total orthogonality) / o r t n (linear dependence) By the same approach, we favour total addivity over equal strength by a factor of z. 25 2.6 Effective Domain Prior From the reparameterization above in (2.3), one can think of wTx as resting most often within a so-called effective domain of interest as defined by (—A(w), A(w)). So that this definition is tenable, the x's are prescaled (and precentred), a common procedure for models of this type. One logical possibility, based on the standard deviation of the linear combination wjx and the notion of approximate normality (where 2 standard deviations covers most of the range of the data), may be A(w) = 2^wJRwj, (2.6) where R is the correlation matrix of x. Consider the logistic activation function, ib{z) = (1 + e - 2 ) - 1 . The function is well approxi-mated by a function of the form 0, z < c, z, —c<z<c, 1, z > c. Selecting c = 3 seems to work well, for example (see Figure 2.3). Large values of A typically induce a more linear activation function, at least over the range of OLJ + The slope of this linear section is controlled to some extent by the A parameter, but also by the Bj corresponding to the node. Hence^ we have a redundancy. In the following, we consider a restriction on A and \x (in terms of A = A(w)) which should help alleviate this redundancy. Let z = wTx and consider, for simplicity, one hidden node. In this case, we are considering y(z)=Po + MfZ~f" A where z S (—A, A ) . Restriction on A. Concerned mainly with the internal linear piece, suppose we focus on the situation where A - LL , - A - a — - — < c and —— > - c A A 26 Figure 2.3: A Piecewise Linear Approximation to the Logistic Function. which corresponds to A - / i A + fi A > and A > - . c c In this situation, the function is well approximated by the linear central piece, leading to a function of the form Po . z < - A , V{*) = { A) + / 3 i ( ^ ) -A<z< A , 60 + 81 z> A: But, there are 4 parameters (8O,8I,/J,, A) available to characterize this function, which is an obvious overparameterization. R e s t r i c t i o n on /z. Similarly, we could also restrict our attention to the situation where \i < | A | . For values of fx outside this region, the majority of the curvature of the logistic function is missed (see figure 2.4) and, again we encounter a relationship that can be expressed in fewer parameters. Hence, by restricting \x G (-A(w), A(w)) and A < A W + H ; w e can discount situations that have alternative representations. A parameter value outside this region will have a representative inside the region (or, on the boundary) which is able essentially to characterize the same functional 27 CO b to b ci CM b o b - A + A (1 Figure 2.4: A n Example of fx outside the effective domain of interest. form. Thus, we may restrict ourselves to this region with little or no effect on the model. In practical terms, though, we only encourage the parameters to lie in this region by specifying a prior distribution which has greater density in these areas. In our approach, we advocate the desired region of the parameters by specifying the prior distributions as follows: w ~ Uniform, (2.7) where the shape parameter of the T _ 1 mimics previous priors and the scale parameter gives a mode at the boundary, ( A W + H ) 2 . Essentially, we are downweighting large values of A for being redundant and small values of A as weight decay. Note also this is a weakly data-dependent prior via the correlation matrix R. The prior is self-calibrating because the calculation of A(w) requires no subjective input. 28 2.7 Priors on j3 and a2 In previous analyses of neural networks (Neal 1996; Rios Insua and Miiller 1998), the prior on the 6 parameter is of the same hierarchical structure as the prior on jj. That is, weight decay is also applied to the /3's. Such a prior would encourage ridge regression. Although ridge regression has a definite benefit in linear regression where the predictors are highly correlated, it is unclear whether it is useful in a neural network context. Thus, we follow Lee (2000a, 2000b) and take the standard non-informative prior for Bayesian linear-regression, p(0) oc 1. The posterior is proper despite using an improper prior. The prior on o2 is also taken from Lee, who uses p(c r 2 ) OC \ , which is both non-informative and improper. The prior distributes equal mass over all regions for log cr2 and also results in a proper prior. Again, this is the standard non-informative prior (i.e. Jeffrey's prior) for Bayesian linear regression. 29 Chapter 3 Inference via Markov Chain Monte Carlo In this chapter, I discuss the M C M C sampling scheme that was used to fit the neural networks. The reader is referred to the Appendix for the implementation details using S-plus and C. 3.1 The Posterior Distribution To make inferences and predictions, we need to be able to sample from the posterior distribution, denoted hereafter by p(B,a,7,a2\x,y). The posterior distribution can be expressed as p{B,a,j,a2\x,y) = p(y\x, B, a, 7, c r 2 ) • p(B, a, 7, a2), where p(y\x,B, 01,7, a2) is the likelihood and p(B,a,j,a2) is the prior on the network parameters and hyperparameters. For notational convenience in what follows, consider the following definition. Let Z be a n x (p + 1) design matrix with elements 1 if j = 1, tp(oij-i + iJ_]Xi) otherwise. Wi th this simplification, one can think of (ZB)i as the prediction for observation i. 30 In the regression context where the errors are assumed to be independent Gaussian (with constant variance cr2), the posterior distribution is of the form p(/3,a,7,a2|a;,y) oc — ^ exp ( - (y - ZB)T(y - ZB)\'- \ -p(a,j). (3.1) (a2)2. I 2az J az The priors given in Chapter 2 (equations 2.7, 2.8 arid 2.9) are not in the same parameterization as that above. Fortunately, I do not calculate the full distributions in the (a, 7) parameterization because, as I show later in the chapter, it is not necessary. The Jacobian required for this calculation cancels out in the Metropolis-Hasting acceptance ratio (see Section 3.4). The M C M C Sampling Algorithm is discussed first and the details follow. 3.2 The M C M C Sampling Algorithm The sampling of the posterior consists of the following steps: 1. Start with initial values for a ,7 and B (e.g. Unit Gaussian random variables). 2. Draw o~2 and B from their full conditional distributions (see Section 3.3). 3. For each j = 1,... , k, perform Metropolis-Hastings (MH) steps on i. Propose a candidate (ciy, 7^ ) ~ N((aj, 7,), c 2 I p + i ) ii. Accept (aj,jj) with probability min ( l , p ^ ^ f c y ) ) (see Section 3.4 for further details). 4. Repeat steps 2 and 3 for the desired number of iterations. The parameter c can be adjusted to achieve a desired acceptance rate. It may also be beneficial to sample aj and 7^ separately with corresponding tuning parameters c\ and C2 in order to optimize acceptance. 31 3.3 Full Conditional Distributions 3.3.1 Derivation for a2 The distribution of a2\a, 7,8 can be written as , 2, , f 1 (y-Z8)T(y-ZB)\ p ( a | . . . ) o c ( ^ j e x p j - - , j , which can be readily identified as an inverted gamma distribution with shape parameter ^ and scale parameter where SSE is the sum of squared deviations from the observations and the estimates, based on the current values of 0 ,7 and fi. 3.3.2 Derivation for 6 The distribution of @\a,j, o2 can be written as: p(8\...) cx e x p { - ^ ( y - Z / 3 ) r ( y - Z / 3 ) } oc e x p { - ^ [ / 3 T Z T Z / 3 - 2 / ? T Z T y ] } . From this representation, it is clear the full conditional distribution is multivariate normal with mean (ZTZ)~1ZTy and variance-covariance matrix a2(ZTZ)~l. 3.4 Acceptance Ratio Calculation for the M H Step The prior distribution on a and 7, as specified indirectly by (2.7, 2.8 and 2.9) is necessary for calculating the acceptance ratio of the M H Step. For what follows, assume (pi,w,\, A(w)) is calculated from (ctj,')j) via the reparameterization in- (2.3) and the calculation in (2.6). Similarly, (fij, Wj, Xj, A(WJ)) can be calculated from the proposal (aj, jj). Define f([i,w,\,A) for the hierarchical prior defined by (2.7, 2.8 and 2.9). Assume the probing density for the M H step is q(-, •). The Jacobian of the transformation would appear as a term in both the prior (in the new parameterization) and in the probing density and would cancel 32 out in the acceptance ratio. Also, because we chose a symmetric prior (i.e. Gaussian), the q(a,a) and q(a,a) terms, where a is the value at the current iteration and a.is a proposal, cancel out as well. At last, one is left with an acceptance ratio of the form: f(p,j,Wj, Aj, AJ) p{y\a,7,/3,a2) min 1, f{fj,j,Wj,\j,Aj) p(y\a,-y,(3,o-2) where f(p,,w,\,A) = -^exp 1 A 2 A + M P + 1 cA 3.5 Posterior Inference After allowing sufficient time for the Markov Chains to converge, the next step is to make inferences from the posterior samples. Because our main interest is prediction, I discuss making a prediction for a new observation, say y* based on predictors x*. Assume we have allowed the chain to converge and have taken a large number, N, samples thereafter, perhaps also thinning them out to reduce the autocorrelation. (This thinning is only a concern when computing simulation standard errors of the estimates, which is not the main focus here.) Let (0^k\ a ^ , 7 ^ ) represent the k^ sample from the posterior distribution. The prediction for this new observation y* at x* for sample k can be calculated as y ^ \ x l = ^ ) + J:0f )^f+it)x*). If N is sufficiently large and the samples are sufficiently independent, then one would have an adequate approximation to the predictive distribution of y(x*) based on the samples 1, 2 , . . . , N. To summarize in a point estimate, one might choose the mean of the predictive distribution at the point x*. This quantity can be estimated by E(Y\x = x*)*y(x*) = ^J2y{k)(x*)-k=l For the analyses presented in the next section, I calculate a mean square error (MSE) for the training and the test datasets. Assuming n observations in the dataset, this calculation is the 33 \ average squared deviation, as follows: MSE = -Yj{yi-y{xi))2. 1=1 3.5.1 Importance Sampling Importance sampling is a practical tool to take samples from one distribution but in calculating a summary statistic (i.e. expectation of some function), weight the estimates in order to simulate sampling from an alternative distribution. For our purposes, we wish to make predictions under the alternative parsimony priors but we wish to do it by weighting the samples obtained from the posterior based on the effective domain prior. In general, suppose we have sampled z\,..., zm from a distribution with density TTI but we want to estimate a function g(Z) under an alternative distribution with density 712- It is well known that MZ). In our case, we wish to explore whether the extra senses of parsimony provide any benefit over and above the effective domain prior. For that reason, the ratio (^ -) and thus the weights are simply the penalty terms (2.4 and 2.5), for the orthogonality and additivity priors, respectively. That is 7T2 = 7Ti • / o r t h for the orthogonality prior and 7T2 = 7Ti • / a c J ( j for the additivity prior. Therefore, to make a prediction under the orthogonality prior, one weights the sample as follows yorthC* ) = ^ A T " 7 . where hk = / o r t h ( ™ i f c ) > - - - > ™ i f c ) ) -A similar procedure is used for the linearity prior. An estimate of the MSE under these new priors can then be assessed. 34 Chapter 4 Empirical Results This chapter discusses the empirical results we have found for the datasets mentioned in Chapter 1. First, I discuss the approach we took to monitoring the Markov Chains and I follow with a discussion of the goals of the empirical evaluation. Finally, a description of the results on each of the datasets is given. 4.1 Assessing Convergence In my experience, assessing the convergence of the Markov Chains for Bayesian Neural Networks is very difficult. To illustrate this point, I present what I know now to be a typical plot of the sampled values of the network parameters. It is well known that these posterior distributions are multimodal and so textbook-style convergence is literally unheard of. Figure 4.1 shows plots of the sampled network parameters for a neural network with five hidden nodes on the Ozone data set. The prior distribution used for the neural network in Figure 4.1 is the non-informative prior of Lee (2000a, 2000b). The panels are labelled by the parameter they describe. The reader should notice that because there are a large number of 7 ^ parameters in the network fit, a summary measure (i.e. length) is used for monitoring purposes. Perhaps the most relevant and useful summary measure is that of the standard deviation a. Despite many possible explanations of the data through the network parameters, one would hope that the estimate of the 35 T r 1 1 1 r i 1 1 r~—i r T 1 1 1 1 r 0 . 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 Iteration Iteration Iteration a Ml m 0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 Iteration Iteration Iteration Figure 4.1: M C M C Plots for the Neural Network Parameters for 10000 iterations of a network with 5 hidden nodes on the Ozone dataset. Prior distribution from Lee (2000a, 2000b). 36 standard deviation remains stable. Figure 4.2 shows a similar plot for a neural network with 5 nodes on the same data but using the effective domain prior discussed in this thesis. Although the scales are different in many occasions, a couple of comments are worth noting. Both networks are able to fit the data reasonably well, as suggested by the plots on a. One potential disadvantage of the Lee. prior is that under a prior with no weight decay, the size of each 7 vector increase without bound. Even in longer runs' of the chain (e.g. 50000), this gain in magnitude of each 7 is still observed. Therefore, for the monitoring of the Markov Chains, we simply use the M C M C plot of the standard deviation as our guide. To illustrate the stability of the neural network to the parameter cr, we made multiple independent runs of the chain and plot them together. Plots of a for five independent fits of the network are shown in Figure 4.3 for the same dataset and prior as in Figure 4.2. For the simulated sampling under the parsimony priors using importance sampling it is also important to monitor the importance weights to ensure that a small subset of the observations are not receiving the bulk of the weight and thus swaying the predictions accordingly. In general, the importance weights are well behaved. For example, Figure 4.4 shows importance weights for the penalty functions (2.4) and (2.5), corresponding to the orthogonality and additivity priors, respectively. * . 4.2 G o a l s In the following sections of this chapter, I present a study of the performance of the methods suggested in the last chapter. The goals of the study are as follows: Demonstrate that the prior suggested in this thesis has the ability to capture the relationship adequately well. Examine the potential gains in using the parsimony priors. Evaluate the sensitivity of priors to the factor of z calculation. 37 Po Iteration Iteration Iteration o m i m T 1 1 1 1 r i 1 1 r- 1 h 1 1 1 1 r 0 2000 4 0 0 0 . 6000 8000 0 2000 4000 6000 8000 0 2000 4000 6000 8000 Iteration Iteration Iteration Figure 4.2: M C M C Plots for the Neural Network Parameters for 10000 iterations of a network with 5 hidden nodes on the Ozone dataset. Prior distribution is the effective domain of interest. 38 Figure 4.3: M C M C Plots of the a Parameter as a diagnostic for convergence of the Markov Chain. Of course, there are many other factors which add extra variability to these assessments. For example, results may be highly dependent on the number of hidden nodes which are selected. Other sources of variability include the starting point of the chain or the split of the data which the network is fit and tested on. 4.3 Resul ts on the B o s t o n H o u s i n g Dataset First of all, I present an illustration of the adequacy of the fit of a neural network under the effective domain prior for the Boston Housing dataset (A plot of the data was given in Figure 1.2). For the analyses presented in this chapter, the networks were trained using a random split of the data of size 253, leaving another 253 observations in a validation dataset on which we present prediction results. From the monitoring of the o parameter, as mentioned earlier, I chose a burnin of 20000 iterations which seems more than adequate. I arbitrarily chose a Monte Carlo sampling stage of 100000 iterations but thinning to take every 10-th sample. Predictions were made for the 253 observations in the validation dataset for each the 10000 samples and averaged to make a point estimate (i.e. posterior mean of the predictive distribution). MSEs were calculated as the average squared deviation between the estimated and observed response over the 253 observations. • As suggested earlier, there are many potential sources of variability which may enter. For this reason, the analysis of the Boston housing data was repeated for: i) different random splits of the data; ii) runs of the Markov Chain from different starting points, and iii) on 4, 6, 8 and 10 hidden nodes. For each number of hidden nodes, Figure 4.5 shows the results for five runs of the network fit on five randomly selected splits of the set. In most occasions, though not all, there exists more variation between different splits than between multiple runs on the same split. The straight line superimposed on each of these plots is the MSE presented in Neal (1996) for his neural network with eight hidden nodes and a Gaussian error distribution (MSE=13.7). In Neal's analysis, split of 253/253 was also used. It is clear from our analysis that there exists differences between the splits and it is very unlikely that I have chosen the same split as Neal. That said, our neural network is able to perform adequately well in terms of prediction. 40 Figure 4.5: MSEs on the Validation Dataset (Boston Housing Data). Five runs from different initial values on five different splits of the data. 4 H i d d e n N o d e s 6 H i d d e n N o d e s § 0 0 ooo e 8 8 T r (a) 4 Hidden Nodes (b) 6 Hidden Nodes 8 H i d d e n N o d e s I O H i d d e n N o d e s (c) 8 Hidden Nodes (d) 10 Hidden Nodes Figure 4.6: Improvements (% Change in MSE) under Parsimony Priors (Boston Housing Data). Five runs from different initial values on five different splits of the data. (a) 6 Nodes, Orthogonality Prior (b) 6 Nodes, Additivity Prior 41 To evaluate the profit of the parsimony priors discussed in Chapter 2, we take the same approach of trying many runs of the chain on many splits. Figure 4.6 shows gains in MSE on the percentage scale (i.e. MSg(orthog.j)r^addit.)-MS£;(e.d.) } for & n e t w o r k w i t h g h i d d e n n o d e s H e r 6 ) points less than the superimposed line at zero represent a gain in performance. For completeness, the corresponding plots for 4, 8 and 10 hidden nodes are given in Appendix B . It seems that there are gains, though modest, to be made by using the additivity and orthogonality priors. For 6 hidden nodes, average gains of 0.15% and 0.12% for the orthogonality and additivity priors were recorded, respectively. The plots in Figure 4.6 as well as those in the Appendix also suggest that when gains are made in one prior, they are also observed in the other. For the Boston dataset, the selection of the hyperparameters for the prior used a factor of z calculation, as suggested in Section 2.5.3, with z = 2. In the next sections, we explore further the sensitivity of this choice. 4.4 Results on the Tecator Dataset In this section, we discuss the results of our prior on the Tecator dataset (see Figure 1.3), previously analyzed with a neural network by Thodberg (1996). From the previous analyses on.the Tecator dataset, the training and validation set had already been chosen. Hence, there was no need to make random splits of the data. The predictive ability of our neural networks for this dataset was not as good as those presented in Thodberg (1996). For most of their networks, a model selection procedure was used to the choose a best model and that may explain the better performance. For example, he used a generalized prediction error procedure, derived from Akaike's Information Criterion or Mackay's evidence (Mackay 1992). Here, I focus on the potential gains by using the priors which encourage parsimony. We wished to explore further the sensitivity of the predictive performance of the neural networks to the factor of z calculation. For our analysis, we chose z—2, 3 or 4. Figure 4.7 shows 10 fits of the network with 4 hidden nodes on the given training set from different starting points at each of the three values of z suggested above. Not surprisingly, we observe that the gains (or losses) are exaggerated for large values of z. That is, if there are gains 42 Figure 4.7: Improvements (% Change in MSE) under Parsimony Priors (Tecator Data). Ten runs from different initial values. (a) 4 Nodes, Orthogonality Prior (b) 4 Nodes, Additivity Prior to be made at z = 2, then those gains are accentuated with z = 3 or 4. One would expect, though, that there will be a point where a higher penalty wil l provide no further advantage. As was done before, the corresponding graphs for 6, 8 and 10 hidden nodes are left until Appendix B . 4.5 Results on the Ozone Dataset In this section, I describe the neural networks fit on the Ozone dataset (see Figure 1.2). Again, multiple splits of the dataset into training and validation subsets were done. Fol-lowing Lee (2000a), I used a training set of size 200, leaving 130 observations in the validation set. Figure 4.8 shows the MSEs on the 5 different fits of the network at each of 5 different random splits of the data for networks with 4, 6 and 8 hidden nodes. One may note the variation between different splits of the data is considerably higher than that within multiple runs of a split. The previous literature has compared different methods of analyzing these data by the multiple correlation coefficient, R2. For example, a generalized additive model was able to achieve R2 = 0.80 and a Lee neural network attained R2 = 0.79. Again, our method does not perform quite as well (R2 = 0.76) as any of those mentioned in Lee (2000a), but here we have not considered any model 43 Figure 4.8: MSEs on the Validation Dataset (Ozone Data). Five runs from different initial values on five different splits of the data. 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Split Split Split (a) 4 Hidden Nodes (b) 6 Hidden Nodes (c) 8 Hidden Nodes selection. As in the previous sections, we wish to determine whether there exists any benefit in using the parsimony priors and also to observe how sensitive the analysis is to the calibration parameter, z. Figure 4.9 shows the gains for both parsimony priors on networks with 8 hidden nodes. Shown there are multiple runs from different splits of the data for z = 2,3 and 4. The same observations can be made for the ozone dataset as for the previous datasets. That is, there are some gains to be made, but in general they are small. In most cases, the parsimony prior does little or no harm. Also, as was noticed earlier, increasing z has an effect of exaggerating the gains. 44 Figure 4.9: Improvements (% Change in MSE) under Parsimony Priors (Ozone Data). Five runs from different initial values on five different splits of the data at z = 2, 3,4. Split Split (a) Additivity Prior, 2=2 (b) Additivity Prior, 2=3 (c) Additivity Prior, 2=4 Split (d) Orthogonality Prior, 2=2 (e) Orthogonality Prior, 2=3 (f) Orthogonality Prior, 2=4 45 Chapter 5 Discussions and Further Work Neural Network models for regression are extremely useful methods to have in a statistician's toolkit. Applying these methods from the Bayesian point of view, as was done here, provides the data analyst with a flexible and robust approach. Although we limited ourselves in this thesis to a univariate continuous response, the methods mentioned here easily generalize to a multivariate continuous response, a binary response or a categorical response. In this thesis, we have discussed a novel method for specifying the prior distribution over the parameters for a neural network with logistic basis units. Briefly, our approach looks at the geometry of the logistic basis functions and specifies a prior which, on one end, applies weight decay, and on the other end, reduces redundancy by probabilistically eliminating parameters values which essentially overparameterize the model. This is accomplished by stochastically restricting the parameters to an effective domain of interest. Moreover, we introduce priors which encourage notions of parsimony, such as orthogonality of the input-to-hidden weights or additivity of the hidden nodes. These approaches were developed in Chapter 2. Following the lead of Rios Insua and Miiller (1998) and Lee (2000a, 2000b), methods for sampling the posterior distribution via M C M C are discussed in Chapter 3. Also mentioned is the method of importance sampling, which is used to assess the worth of the parsimony priors. Chapter 4 describes and discusses a comprehensive comparison study of the priors discussed in Chapter 2. The study covered many extra sources of variability such as: the split of the data, 46 multiple fits from different starting points and the sensitivity of calibrating the prior. The study showed that the effective domain prior was able to capture the relationship with similar performance as the previous methods suggested by Neal, Rios Insua and Miiller and Lee. A n examination of the parsimony priors was conducted at the same time. For each of the three datasets, the orthogonality and additivity afforded only a small influence. In some cases, it appears to have shown a small gain in predictive performance and in others it showed little or no gain. There were no cases where the parsimony priors had an adverse effect on predictive performance. In summary, the effective domain prior provides a valid and useful specification of the prior distribution over neural network parameters incorporating both weight decay and removing redun-dancy. The parsimony priors do not seem to provide much in the way of predictive performance, however, they also do little harm. Perhaps other senses of parsimony could prove useful. Future Directions: Model Selection. Briefly mentioned in the literature review in Chapter 1, model selection is an important part of effectively using a neural network model for prediction. Model selection encompasses both the selection of an adequate network architecture (i.e. number of hidden nodes) and a selection of a reasonable subset of the explanatory variables. In this thesis, one objective was to specify a prior distribution which indirectly encourages parsimonious models, and hopefully sidesteps some of the questions of model selection. However, the notions of additivity of the hidden nodes and orthogonality of the input-to-hidden weights were ineffective in providing major gains in predictive performance. A recent development in the area of Bayesian model selection is the work of Spiegelhalter, Best and Carlin (1998). Since many model selection criteria (e.g. Akaike's Information Criterion (AIC), Bayesian Information Criterion, Minimum Description Length) are calculations based on the likelihood evaluated at the maximum likelihood estimator, it is unclear how to do the corresponding calculation in a Bayesian setting. Spiegelhalter, Best and Carlin (1998) present an alternative strategy based on M C M C sam-pling output. They propose the Deviance Information Criterion (DIC), which is based on the 47 posterior distribution of the deviance statistic D(0) = -21ogp(y|0)+21og/(y). The second term assumes f(y) to be a perfect predictor (saturated model) which gives probability 1 to each observation and therefore has no effect on model selection. Typically, the deviance is used to summarize the goodness-of-fit of the model. In the Bayesian setting, one can summarize the fit of the model by the posterior mean of the deviance. For neural networks for regression, D reduces where M is the Monte Carlo sample size and 9\y denotes the posterior distribution of the parameters. Based on an analogy with A I C , the effective number of parameters, pp is defined by pD = D-D{E0\y{0}) = D-D(0). This has shown to be asymptotically non-negative (Spiegelhalter et al., 1998) As with other model selection criterion, the DIC is combined of a goodness-of-fit term and a penalty for complexity; it is defined as mc = D + P D . DIC mimics the form of the A I C . The AIC adds a penalty term (for the number of parameters) to the deviance evaluated at the maximum likelihood estimator. Similar to other measures (e.g. AIC) , smaller values of the DIC are preferred. Spiegelhalter et al. (1998) make the argument that the DIC is a natural generalization of the A I C . However, they give the following warning for the use of the DIC: "We need to emphasise that we do not recommend that DIC be used as a strict criterion for model choice or as a basis for model averaging." DiCiccio et al. (1997) discuss many methods of computing Bayes factors, a problem that parallels model selection. They mention that their methods are effective in situations where the posterior distribution is well behaved, a situation which is not well suited to neural networks. Since the DIC discussed above is an essentially a "for free" calculation once the Monte Carlo sample has been taken, we briefly explore its worth in the comparison of competing neural network 48 models. Our key consideration is the dependability of the estimate. Our knowledge of monitoring M C M C plots for neural networks would suggest that D will be stable, because it is composed of SSE, as well as the sample variance, a2. However, the deviance evaluated at the posterior mean of the parameters may be much less sound because of the multimodality. The following table shows a summary of the components of the DIC measure for repeated runs (on the same split) of a neural network'on 4, 6 and 8 hidden nodes (on the Boston dataset). Nodes D D(9) PD DIC 4 1684.183 1641.259 42.924 1727.107 4 1721.168 1657.078 64.09 1785.258 4 1668.159 1637.941 30.218 1698.377 4 1573.388 1531.038 42.35 1615.738 4 1729.801 1657.674 72.127 1801.928 6 1706.928 1678.629 28.299 1735.227 6 1992.409 1909.299 83.11 2075.519 6 1958.468 1908.796 49.672 2008.14 6 1882.927 1760.677 122.25 2005.177 6 1718.773 1640.565 78.208 1796.981 8 1886.049 1776.907 109.142 1995.191 8 2220.292 2135.306 84.986 2305.278 8 1873.486 1845.666 27.82 1901.306 8 1787.058 1838.826 -51.768 1735.29 8 2471.422 2542.731 -71.309 2400.113 As expected, one can see that the DIC criterion is quite variable even with respect to the ' starting value of the chain. Furthermore, variablility of the DIC from run to ruri increases as the number of nodes increases. It would seem as though the neural network with 4 hidden nodes provides a better model in this case. Figure 5.1 shows plots of DIC repeated for multiple splits. Again, one sees that the neural network with 4 hidden nodes provides the "better" model and it is clear that the DIC is much less stable for the model with 8 hidden nodes. 49 o C D o C O CXI o o cvi o o CNJ CVI C D C D C D C \ l C D C D O O C D C D C D 3P 2 X 4 H i d d e n N o d e s O 6 H i d d e n N o d e s + 8 H i d d e n N o d e s o 8 1 3 Split 4 o o Figure 5.1: Plots of DIC for Neural Network Models with 4, 6 and 8 Hidden Nodes for Multiple Runs and Splits. One alternative to model selection is that of Markov Chain Monte Carlo Model Composition ( M C 3 ) (Raftery et al. 1997) whereby M C M C is performed on the model space. They apply this method in the case of linear regression models. Further work is required in the current domain to determine the feasibility for neural network models. 50 Appendix A: Implementation The implementation of the neural networks combined the use of both programs written in C and R (see Ihaka and Gentleman 1996). Taking advantage of the speed of a programming environment such as C, the M C M C sam-pling steps and the prediction calculations were performed in C. These routines were called within R using the .C function. The outputs and results were plotted within R. Because of the similarity of R and S-plus, a similar implementation could be performed in S-plus. Pages 52 to 54 show the C programs. Pages 55 to 56 show the R programs. 51 52 53 > C 0> • H rH & 8 • H M o o> .3-3 <a 3 o " ftT) a; c CO M -> > Q d 3 3 o o o TJ TJ *0 £1 41 3 > 3 13 3 S G Z ft' - . 0) : «-t DI M 1, I (0 U S u l l ? O i > s — II ra M s =r N tfl N Z w d Qi JJ £ Pt (0 (V * a * * g CQ II (U II II E M V M V V (0 3 • n h - n - n O U V ••> '+ p N ^ o ^ u — n u it I — > — CN •rt ! Q N f—. »-< + • H + [/) Q) -rt z D) — U 43 & B (0 5 u W I 3 — V O -H • H a i <U JJ ' VI (/> (fl <U Z f N QJ jQ W 3 54 e u CN 33 ~ j I « S ~ 33 + a © w —- a -T3 flj * * •*->. ~ H 33 T j — ^ rH * S + + rH U fiaW ~ ± c + 2 w * + "T ^ ft E 33Vj a e a — i — -d O — » " i-H » — — • £ £ in ~ ro in X X m </></> en v l O i O i " AJ AJ * — n) ifl Q I * -H TJ T» Q> + — n ae rH £ — 0 O t U M V rH C C — 1 I AJ AJ v v o) at tt tt a z > > AJ fO - H m j> j> 0) H H 3 o o h - » C c , — •H ~ O X * C H <d nl id ^ P £ M AJ AJ — c a o •H a ~ p a • « t-t QJ « > <U rH • a u aa — u i a — x v H •» I " «rt I i— • a — y — u (8 (tJ (11 (0 flj S I ! ? § , ! , •H io ra V -H -H V 0,5 > H > > N . « O S o -Q, > M V> M >• W fi N (3 M II N — — — « 10 » U U «H a; AJ m m to B > ll E — «/> — AJ N 55 •g 99 gg • w TJ TJ TJ a, <D JJ JJ « o i l * ...8 i n in a I I) 01 ID I JJ JJ 10 H -rt 'rt J JJ JJ * I) M W tfl Ll 0> Q) £ N * * VJ H H H T3 i ui to • O H fN «D • Q> 0) QI -H U> JJ JJ JJ » rH JJ JJ JJ QI nj •—I to to to to JJ m <y 0) <D B nj > JJ JJ JJ rH — ( j p, a a « u M I I I > • Q >,>,>, - JJ — V> 0* V> rH tO Tj TJ TJ TJ • QI 5 •H -H -rt 01 >i P id (d nJ id • JJ JJ JJ > (d id id - D •• Tl TJ TJ o J*.-* § 5 S • <u rt — ^ -rt > > > tO > i_> JJ JJ -H o •• ra ra >,-. JJ. xi B JJ <d u — to TJ 3 * o to TJ (N H 10 z . e-a c v — — •rt O JJ — ^ M w n a a o -rt Qi Q) J 3 i JJ JJ - -J id rt i B 0 W I -rt -rt to > JJ JJ X JJ rtI 01 Q» Q> rH B TJ • S B B » • -3 > CI rH t t i '—• id V V V U I > 0 rH tN TJ TJ 01 QI QI 9 rH H to W 10 O rt Q B B B ^ > rH rH rH JJ U JJ - io io r • • -> > i O rH • ra 9 •rt € I 10 JJ IS •rt M CM JJ QI M tJ 0) 01 Bl W *W • JJ QI • 10 -H Ifl >1 rl t*H G o - • - M rH rH X U • (0 <A QI 01 > JJ (7 Ul . •rt tfl Q O «M - • • • tO JJ 0) Rt -rt 10 JJ O «W B •S K O r i QI • ra I tr oi > n w -i I i u > n JJ TJ • tfl -H *rt M M UH rH — o — rt JJ H O > tO tH I • •rt 0) QI 0) rH O* 10 tfl - r « e s 8 a rt u JJ oi ra a JJ id • io ja v TJ u 1 I a i 2 aw > C B C 3 u 11 tl ss li 5 o S u u B e d ^ M 13 5 -O W 5 2 § 3 ! ra id JJ JJ rH •8 *3 i V Oi D O M -rt . a to c to . -rt « */> ra — c jq m -rt - O TJ TJ a> w w * w 3 S g ^ id id ~ * TJ TJ T1 8*rH tf < <• M M M C — — U-i 5 X X - -rt -rt Ifl 1, M M 0» I JJ JJ JJ H a x i T l O) -rt \ * *H M * J — tfl • IMrl 1 fi -rt 0) J I > - - — v • a a to to 0) 67 M u u u O O I I U U V V M U H N , 01 Of JJ jJ _ D 1 C* tn p> r to to 3 3 E r* 0 B a o -n oi QI ra JJ JJ rH ra ra 3 B B U -rt -rt rH JJ JJ ra to tn u 01 Q) * TJ >i B •H t/> </> rH TJ TJ Id -rt >rt > H H • ra ra a rt > > *» JJ • • T) •3 * S " H TJ Ifl (0 rH — u n ra U TJ TJ > • to v v ra TJ >.Q -rt JJ i • JJ ra to v s to > QI ra QI • ; 5 1 - I •» _ v — v ra «-U-l r— M tn JJ • TJ C — . 10 (A — -rt tfl QI 0) tfl I tfl J >, j j U V tfl rt B ra p —'rt TJ * -rt U — — •• ra JJ QI Q — H M 1/1 O >i»W TJ QI tO E -rt > ra •«-. T) • > ~ rt W • TJ rH 01 tO 'rt Id JJ M rH > ra o « • B M > 3 "H M • ra JJ oi o M to cr >i TJ 0> 10 E E ^ ra *A ~ J3 fi a u •H </> + ra fi ra x: -rt — — u> ra -o c 3 — •rt 10 tfl ^ ra - E IN Xi (N 3 •a § T J i JJ JJ , w j V) U) >rt OJ 01 3 IM - — — ra • < rt H TJ U % ft D> fd t, Di a - TJ > i0 ro - — • • JJ fd i ra ra -H -JJ U <*-i ra ro * ra i I TJ TJ O i J | • Qt I QI XI I : c rj B JJ I QI QJ AJ '»H E E *rt rH ra f P O l Q I J J V V V U C J t l U — — -rH rarara o H N TJ <o e S E w • • • c -c •rt -rt -rt w QJ o) a» p u j J j J j J S l O I / l t O O w io io • E s E u >, OJ QI QI — • • • — e 56 Appendix B: Plots In the following pages, the extra plots referred to in Chapter 4 are given. 57 nor les °?lf83§ennaliKdPers°r Split Split 0|fl8ffiKYKS?r J t iv i ty Prior idden Nodes Split o o o 0 o o 8 o 0 8 0 © 8 3 Split Aq'dJtiyity Prior ' odes Split Boston Dataset. 58 Additivity Prior 4 Hidden Nodes Additivity Prior 6 Hidden Nodes Additivity Prior 8 Hidden Nodes Additivity Prior 1 0 Hidden Nodes Tecator Dataset. 59 Additivity Prior 4 Hidden Nodes Additivity Prior 6 Hidden Nodes Additivity Prior 8 Hidden Nodes Additivity Prior 1 0 Hidden Nodes Tecator Dataset. 60 oo so- 0 1." 3SW u| sBueuQ % 0 I- Z- E -3Sim U l aBuBijo % .2 Ii o-cn" o c 5= —I 1 — OH SO 0 0 SO- 0 1 -3 S W u| eBueqo % o o Io o .e a O C oxu o o o 0 0 9 0 - O'l- 0 l~ Z - E -3SW Ul S B U E U . 0 % .2 a Q-oT o c oxu o-a . C T J i — i — 01 so oo so- a i -3SW u ! eSueuo % 0 0 SO- <n-3Sinl U | eBueu.0 % D - C B £8 O C O X D J C T } o v- • z- e-3SIM Ul e6uBL|0 % Ozone Dataset. o d o o ~T 1— 01 SO 0 0 so- O'l-BSlftl ui eBuBgo % 61 S'l S'O 9 0 - S'l~ 3SW U! o6ireu,o % 9'I. 9 0 9 0 - S ' l -3 S H ui 96UEL|0 % 91 9 0 9 0 - S ' l -3SlNU!66ueuo'% i—i—r S'l S'O 9 0 - S l " 3SlfN "! SBUBLIO % o o op o i—i—r i — i — r 91 9 0 S O - S'l 3 S H U| aBueno % S'l- S'O S'O- S l " 3SW Li| O6UBL|0 % 91 9'0 S'O- S ' l -3SW u| 86UBMO % CM II i-N If g (0 0 00|30 i—i—r S'l S'O S'O- S ' l -3S1M U| efiuBiio % Ozone Dataset. i—i—i—l—i—i—r S'l S'O S'O- S ' l -3SW U! O6UBMO % 62 Bibliography [1] Andrieu, C, de Freitas, J .F .G. and Doucet, A . (1999). Robust Full Bayesian Learning for Neural Networks. Technical Report 343, Engineering Department, Cambridge University. [2] Breiman, L , Friedman, J.h., Olshen, R . A . and Stone, C .J . (1984) Classification and Regression Trees Monterey: Wadsworth and Brooks/Cole. [3] Cheng, B . and Titterington, D . M . (1994). Neural Networks: A review from a statistical per-spective (with discussion). Statistical Science, 9, 2-54. [4] DiCiccio, T .J . , Kass, R .E . , Raftery, A . E . and Wasserman, L . (1997). Computing Bayes Factors by Combining Simulation and Asymptotic Approximations. Journal of the American Statistical Association, 92, 903-915. [5] Edwards, P.J, Murray, A . F . , Papadopoulos, G. , Wallace, R., Barnard, J . and Smith, G. (1999). The Application of Neural Networks to the Papermaking Industry. IEEE Transactions on Neural Networks 10(6),1456-1464. [6] Friedman, J .H. (1991). Multivariate adaptive regression splines (with discussion). Annals of Statistics. 19, 1-141. [7] Green, P.J. (1995). Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4),711-32. [8] Gustafson, P. (2000). Bayesian Regression Modelling with Interactions and Smooth Effects. Journal of the American Statistical Association. To appear. 63 [9] Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. London: Chapman and Hall. [10] Harrison, D. and Rubinfeld, D .L . (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5, 81-102. [11] Hoeting, J .A. , Madigan, D., Raftery, A . E . and Volinsky, C.T. (1999). Bayesian Model Aver-aging: A Tutorial (with discussion)." Statistical Science, 14,4,382-417. [12] Hornik, K , Stinchcombe, M . and White, H . (1989). Multilayer Feedforward Networks are Universal Approximators. Neural Networks, 2(5), 359-366. [13] Hsieh, W . W and Tang, B . (1998). Applying Neural Networks to Prediction and Data Analysis in Meteorology and Oceanography. Bulletin of the American Meteorological Society, 79(9),1855-1870. [14] Ihaka, R. and Gentleman, R. (1996) R: A Language for Data Analysis and Graphics. Journal of Graphical and Computational Statistics, 5, 299-314. [15] Lee, H . K . H . (2000a). A Noninformative Prior for Neural Networks. Technical Report 00-04, Duke University, Institute of Statistics and Decision Sciences. [16] Lee, H . K . H . (2000b). A Framework for Nonparametric Regression Using Neural Networks. Technical Report 00-32, Duke University, Institute of Statistics and Decision Sciences. [17] Lee, H . K . H . (2000c). Model Selection for Neural Network Classification. Technical Report 00-18, Duke University, Institute of Statistics and Decision Sciences. [18] Mackay, D.J .C (1992). The evidence framework applied to classification networks. Neural Com-putation. 4, 720-736. [19] Mackay, D.J .C (1994). Bayesian Non-Linear Modelling for the Energy Prediction Competition. ASHRAE Transactions, 100, Part 2, 1053-1062. Neural Computation. 4, 720-736. 64 [20] Macrossan, RE. , Abbass, H.A., Mengersen, K., Towsey, M. and Finn, G. (1999). Bayesian Neural Network Learning for Prediction in the Australian Dairy Industry. In Lecture Notes in Computer Science, 1642, 395-406. [21] Miiller, P. and Rios Insua, D. (1998). Issues in Bayesian Analysis of Neural Network Models, Neural Computation, 10, 571-592. [22] Neal, R.M. (1993). Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto. [23] Neal, R.M. (1996). Bayesian learning for neural networks, New York: Springer-Verlag. [24] Raftery, A.E.,. Madigan, D. and Hoeting, J.A. (1997). Bayesian Model Averaging for Linear Regression Models. Journal of the American Statistical Association. 437, 179-191. [25] Rios Insua, D. and Miiller, P. (1998). Feedforward Neural Networks for Nonparametric Regres-sion. In Practical Nonparametric and Semiparametric Bayesian Statistics (Dey Dipak, Peter Miiller, Debajyoti Sinha, editors) 181-193. Springer, New York. [26] Ripley, B.D. (1994). Neural networks and related methods for classification. Journal of the Royal Statistical Society B, 56, 409-456. [27] Spiegelhalter, D.J., Best, N.G. and Carlin, B.P. (1998). Bayesian deviance, the effective number of parameters, and the comparison of arbitrarily complex models. Applied Statistics, to appear. [28] Stern, H.S. (1996). Neural networks in applied statistics. Technometrics, 38, 205-220. [29] Thodberg, H.H. (1996). A Review of Bayesian Neural Networks with an Application to Near Infrared Spectroscopy. IEEE Transactions on Neural Networks. 7(1), 56-72. 65
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Priors for Bayesian Neural Networks
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Priors for Bayesian Neural Networks Robinson, Mark 2001
pdf
Page Metadata
Item Metadata
Title | Priors for Bayesian Neural Networks |
Creator |
Robinson, Mark |
Date Issued | 2001 |
Description | In recent years, Neural Networks (NN) have become a popular data-analytic tool in Statistics, Computer Science and many other fields. NNs can be used as universal approximators, that is, a tool for regressing a dependent variable on a possibly complicated function of the explanatory variables. The NN parameters, unfortunately, are notoriously hard to interpret. Under the Bayesian view, we propose and discuss prior distributions for some of the network parameters which encourage parsimony and reduce overfit, by eliminating redundancy, promoting orthogonality, linearity or additivity. Thus we consider more senses of parsimony than are discussed in the existing literature. We investigate the predictive performance of networks fit under these various priors. The Deviance Information Criterion (DIC) is briefly explored as a model selection criterion. |
Extent | 2868929 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-08-06 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0090168 |
URI | http://hdl.handle.net/2429/11911 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2001-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2001-0583.pdf [ 2.74MB ]
- Metadata
- JSON: 831-1.0090168.json
- JSON-LD: 831-1.0090168-ld.json
- RDF/XML (Pretty): 831-1.0090168-rdf.xml
- RDF/JSON: 831-1.0090168-rdf.json
- Turtle: 831-1.0090168-turtle.txt
- N-Triples: 831-1.0090168-rdf-ntriples.txt
- Original Record: 831-1.0090168-source.json
- Full Text
- 831-1.0090168-fulltext.txt
- Citation
- 831-1.0090168.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0090168/manifest