BAYESIAN CURVE FITTING WITH ROUGHNESS PENALTY PRIOR DISTRIBUTIONS by MD MUSHFIQUR RAHMAN M.Sc, University of Dhaka, 1996 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Statistics) THE UNIVERSITY OF BRITISH COLUMBIA July 2005 © Md Mushfiqur Rahman, 2005 Abstract In statistical research with populations having a multilevel structure, hierarchical models can play significant roles. The use of the Bayesian approach to hierarchical models has numerous advantages over the classical approach. For example, a spline with roughness penalties can easily be expressed as a hierarchical model and the model parameters can be estimated by the Bayesian techniques. Splines are sometimes useful to express the rapid fluctuating relationship between response and the covariate. In smoothing spline problems, usually one smoothing parameter (variance component in Bayesian context) is considered for the whole data set. But to deal with rapidly fluctuating or wiggly data sets, it is more logical to consider different smoothing parameters at different knot points in order to find more efficient estimates of the the regression functions under consideration. In this study, we have proposed the roughness penalty prior distribution considering local variance components at different knot points and call it Prior 2. Prior 2 is compared with Prior 1, where a single global variance component is considered for the whole data set, and with Prior 3, where no roughness penalty terms are considered ( i.e., the parameters at different knot points are assumed independent). Performance of the proposed prior distributions are checked for three different data sets of different curvature. Similar performance of Prior 1 and Prior 2 is observed for all three data sets under the assumption of piecewise linear spline. The application has been extended to the case of natural cubic spline, where the modification of Prior 1 and Prior 3 are straightforward. However, for Prior 2, the modification becomes very tedious. We have proposed an approximate roughness penalty matrix for Prior 2. ii Parameters corresponding to the smoothing splines are estimated using M C M C techniques. We carefully compare the inferential procedures in simulation studies and illustrate them for two data sets. Similarity among the curves produced by Prior 1 and Prior 2 are observed, and they are much smoother than the curve estimated by Prior 3 for both piecewise linear and natural cubic splines. Therefore, in the context of Bayesian curve fitting, both local and global roughness penalty priors produce equally smooth curves in dealing with wiggly data. iii Contents Abstract ii Contents iv List of Tables vii List of Figures ix Acknowledgements xii Dedication xiii 1 Bayesian Hierarchical Models 1 1.1 Introduction 1 1.2 Hierarchical Models in Real Life 2 1.3 Bayesian Approach to Data Analysis 5 1.3.1 Bayes' Rule 5 1.4 The Bayesian treatment of the hierarchical model 6 1.5 Markov Chain Monte Carlo (MCMC) 7 1.5.1 Metropolis-Hastings Algorithm 7 1.5.2 Gibbs Sampling 9 1.6 Discussion 9 iv 2 Nonparametric Regression 11 2.1 Introduction . 11 2.2 Splines 13 2.3 Piecewise linear spline 13 2.4 Natural Cubic Spline (NCS) with Roughness Penalty 15 2.4.1 Interpolating and Plotting an NCS 16 2.4.2 Choosing the Smoothing Parameter for Spline Smoothing 17 3 Bayesian Curve Fitting 19 3.1 Introduction 19 3.2 Simple Roughness Priors 20 3.3 Levels of Hierarchy 26 3.4 Posterior Distributions and Simulation Study 26 3.4.1 Conditional Posteriors for Prior 1 27 3.4.2 Conditional Posteriors for Prior 2 27 3.4.3 Conditional Posteriors for Prior 3 29 3.5 Discussion 29 4 M C M C Simulation for Piecewise Linear Spline 31 4.1 Introduction 31 4.2 Backfitting Algorithm 32 4.3 Simulated Results: Example 1 32 4.3.1 Parameter Estimates 34 4.3.2 Monitoring the Convergence of M C M C Simulation 35 4.4 Simulated Results: Example 2 43 4.4.1 Parameter Estimates 43 4.4.2 Monitoring the Convergence of M C M C Simulation . 44 4.5 Simulated Results: Example 3 52 4.5.1 M C M C Run and Parameter Estimates 53 v 4.6 Discussion 56 5 M C M C Simulation for Natural Cubic Spline 57 5.1 Introduction 57 5.2 Modified Prior and Posterior Distributions 58 5.2.1 Prior 1 58 5.2.2 Prior 2 59 5.3 Example 1 62 5.3.1 Monitoring the Convergence of M C M C Simulation 62 5.4 Example 2 70 5.4.1 Monitoring the Convergence of M C M C Simulation 70 5.5 Discussion 73 6 Conclusion and Future Work 79 Appendix A 82 A . l Conditional Posterior for 9 in Prior 1 82 A.2 Conditional Posterior Distribution for r 2 in Prior 1 83 A.3 Conditional Posterior Distribution of S in Prior 1 84 A. 4 Conditional Posterior Distribution of a2 in Prior 1 84 Appendix B 85 B. l Conditional Posterior for 9 and r 2 in Prior 2 85 B.2 Conditional posterior distribution of w in Prior 2 86 B.3 Conditional Posterior Distribution for A in Prior 2 86 B.4 Conditional posteriors for 9 and r in Prior 3 87 Appendix C 88 C l Interpolating and Plotting an NCS 88 Bibliography 90 vi List of Tables 4.1 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 1 35 4.2 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with fixed lambda (A = 0.33) 35 4.3 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform lambda 37 4.4 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform shrinkage lambda 37 4.5 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 3 37 4.6 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 1 46 4.7 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with fixed A (A = 0.33) 46 4.8 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform A 47 4.9 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform shrinkage A 47 4.10 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 3 52 vii 5.1 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 1 64 5.2 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with fixed A (A = 0.33) 64 5.3 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform A 65 5.4 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform shrinkage A 65 5.5 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 3 70 5.6 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 1 72 5.7 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with fixed A (A = 0.33) 72 5.8 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform A 72 5.9 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform shrinkage A 72 5.10 Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 3 73 viii List of Figures 4.1 The motor-cycle data 33 4.2 Sixty posterior realizations (grey curves) for the parameter vector 0. Dark curves show the posterior means. 36 4.3 Plots of 5000 posterior realizations of the variance component r 2 for each of the prior distributions 39 4.4 Histograms of the last 3000 posterior realizations of the variance component T 2 for each of the prior distributions 40 4.5 Plots of 5000 posterior realizations of the data variance o2 for each of the prior distributions 40 4.6 Histograms of the last 3000 posterior realizations of the data variance o2 for each of the prior distributions 41 4.7 Plots of 5000 posterior samples of the parameter A and the histograms for the last 3000 samples 41 4.8 Plots of 5000 posterior samples of mean w and the histograms for the last 3000 samples 42 4.9 Estimated curves for all the five choices of the prior distributions 42 4.10 All the five estimated curves from the five prior specifications 43 4.11 The simulated data 1 44 4.12 Sixty posterior realizations (grey curves) for the parameter vector 0. Dark curves show the posterior means 45 ix 4.13 Plots of 5000 posterior realizations of the variance component r 2 for each of the prior distributions. 48 4.14 Histograms of the last 3000 posterior realizations of the variance component T 2 for each of the prior distributions 48 4.15 Plots of 5000 posterior realizations of the data variance o2 for each of the prior distributions 49 4.16 Histograms of the last 3000 posterior realizations of the data variance o2 for each of the prior distributions 49 4.17 Plots of 5000 posterior samples of the parameter A and the histograms for the last 300 samples 50 4.18 Plots of 5000 posterior samples of mean w and the histograms of the last 3000 samples 50 4.19 True and the estimated curves of all five choices of the prior distributions. . 51 4.20 All five estimated curves with the true curve 51 4.21 The simulated data 2 52 4.22 Sixty posterior realizations (grey curves) for the parameter vector 0. Dark curves show the posterior means 54 4.23 True and estimated curves of five choices of the prior distributions 55 4.24 All five estimated curves with the true curve 55 5.1 Sixty posterior realizations (grey curves) for the parameter vector 0. Dark curves show the posterior means 63 5.2 Plots of 5000 posterior realizations of the variance component r 2 for each of the prior distributions 66 5.3 Histograms of the last 3000 posterior realizations of the variance component r 2 for each of the prior distributions 66 5.4 Plots of 5000 posterior realizations of the data variance o2 for each of the prior distributions 67 x 5.5 Histograms of the last 3000 posterior realizations of the data variance o2 for each of the prior distributions 67 5.6 Plots of 5000 posterior samples of the parameter A and histograms for the last 3000 samples 68 5.7 • Plots of 5000 posterior samples of mean w and histograms for the last 3000 samples 68 5.8 Estimated curves for all five choices of the prior distributions 69 5.9 All five estimated curves from the five prior specifications 69 5.10 Sixty posterior realizations (grey curves) for the parameter vector 0. Dark curves show the posterior means 71 5.11 Plots of 5000 posterior realizations of the variance component r 2 for each of the prior distributions 74 5.12 Histograms of the last 3000 posterior realizations of the variance component T 2 for each of the prior distributions 74 5.13 Plots of 5000 posterior realizations of the data variance o2 for each of the prior distributions 75 5.14 Histograms of the last 3000 posterior realizations of the data variance o2 for each of the prior distributions 75 5.15 Plots of 5000 posterior samples of the parameter A and histograms for the last 3000 samples 76 5.16 Plots of 5000 posterior samples of w and histograms for the last 3000 samples. 76 5.17 True and the estimated curves of all the five choices of the prior distributions. 77 5.18 All the five estimated curves with the true curve 77 xi Acknowledgements I would like to warmly acknowledge the guidance and support of my supervisor, Dr. Paul Gustafson throughout my studies. I would also like to give many thanks to Professor Nancy E. Heckman, second reader of this thesis, for her valuable suggestions. Special thanks to those faculty members who were amicable with me in their courses and my TA work. I appreciate the co-operation of Christine Graham as a Graduate Co-ordinator of this department. Lastly, I am grateful to all the, graduate students in this department especially to Jafar, Mikhail, Shahadut, Lawrence and Yiping for their warm friendship and constant encourage- ment throughout my studies. M D . M U S H F I Q U R R A H M A N The University of British Columbia July 2005 xii To my Parents. xiii Chapter 1 Bayesian Hierarchical Models 1.1 Introduction Hierarchical models are at the heart of research in many real life situations where the popula- tion has hierarchical structure. The basic idea of hierarchical model (also known as variance component model, multilevel model, random effects model, or growth curve model) is to organize the lowest-level units into a hierarchy of successive higher-level units. For example, patients are in hospitals can be considered as level 1, hospitals are in cities can be considered as level 2, cities are in states can be considered as level 3 etc. We can then describe outcomes for an individual patient as a pooled effect for the individual patient, for her/his hospital, for the city and for the state. Each of these effects can often be regarded as one of an exchange- able collection of effects (e.g., all hospital-level effects) drawn from a distribution described by a variance component. There may also be regression coefficients at some or all of the levels. Once a hierarchical model is specified, inferences can be drawn for the population means at any level (hospital, city, etc.) from available data in both frequentist and Bayesian approaches. In the Bayesian perspective the parameter estimates are referred to as the posterior means and variances, and in the frequentist perspective the parameter estimates are referred to as the Best Linear Unbiased Predictors (BLUPs) (Robinson, 1991). Both 1 approaches often have better properties than simple sample-based estimators using data only from the unit in question. For multistage data structure hierarchical models are a good choice for the following reasons (Hossain, 2003): (a) Hierarchical models permit the direct framing of the theories about the effect of struc- tural change at each of the different levels of the hierarchy. (b) They provide accurate adjustments to the uncertainty assessment based on simple random sampling when the data are gathered in a hierarchical fashion in the presence of strong intra-cluster correlations. (c) Use of non-hierarchical models is inappropriate for hierarchical data because with a few parameters they usually can not fit the data accurately. This chapter is organized as follows. In section 1.2 we give some real life examples where hierarchical models are often used. Section 1.3 is for the Bayesian approach to data analysis. There we discuss the Bayes' rules and the estimation procedures in Bayesian techniques. The Bayesian treatment for the hierarchical model is presented in section 1.4 where we discuss the computation techniques of posterior distributions. The procedures of drawing samples from the posterior distributions through Markov Chain Monte Carlo (MCMC) techniques are described in section 1.5. We end with a brief discussion in section 1.6. 1.2 Hierarchical Models in Real Life Many authors worked on the data with hierarchical fashion and they explained the concept of hierarchical modelling with many nice examples. A few of them are as follows: (i) Smoothing spline In many practical situations, the relationship between a response and a covariate shows a rapid fluctuating curve. Examples include growth curves of organisms, learning 2 curves, and stock market trends. Then it is better to express the curve with the help of a piece-wise polynomial of certain degree (called a spline) over each subinterval of the range of the predictor considered. Nonparametric regression using linear and cubic splines is an attractive, flexible and widely applicable approach to curve estimation. Besides many existing classical approaches of smoothing splines a lot of studies con- cerning Bayesian curve fitting have been conducted. A spline with roughness penalties can easily be expressed as a hierarchical model and the model parameters can be esti- mated by Bayesian techniques. Denison et al. (1998), Smith and Kohn (1997), Altman and Casella (1995) and Silverman (1985) are some nice pieces of work for the Bayesian approach to smoothing splines. (ii) Small Area Estimation Hierarchical models can be used in small area estimation. Gosh and Rao (1994) intro- duced several techniques including Empirical Bayes and Hierarchical Bayes procedures for small area estimation. The term small area is defined as a small geographical area, such as a county, a municipality of a census division, or a small subpopulation such as a specific age-sex-race group of people within a large geographical area. The usual direct survey estimators for the totals and means for large domains from a small area, based on data only from the sample units in the area, are likely to yield unacceptably large standard errors due to the unduly small size of the sample in the area (Gosh and Rao, 1994). Small area estimation can be done efficiently by assuming a hierarchi- cal small area model and then applying the Bayesian approach to hierarchical models. Datta and Ghosh (1991) applied the hierarchical Bayes approach to estimation of small area means under general mixed linear models and also discussed the computational aspects. (iii) Random effects meta-analysis Meta-analysis is defined to be the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings. In recent 3 years, random effects meta-analysis is widely used by researchers in medical and social sciences for many of its advantages. Interested readers can read Whitehead (2002). Similar to the random effects meta-analysis, in the Bayesian approach, all the model parameters are treated as random variables and hence one can account for the uncer- tainty of all relevant sources of variability in the model. Let us consider a meta-analysis with the log odds ratios from several studies. A full Bayesian random effects model for meta-analysis includes a three level hierarchy and can be written as follows: Level (i) Observed individual log odds ratio given the true log odds ratio and the variance component follows a population distribution (usually normal). Level (ii) True log odds ratio given the overall population mean log odds ratio and the variance component follows a population distribution (usually normal). Level (iii) The overall mean log odds ratio and the variance component parameters are treated as independent random variables with some population distributions. (iv) Repeated measures data In longitudinal studies, we have repeated observations on the same individual under study at different occasions; for example, the monthly observations of the blood glucose levels of diabetic patients. This type of data can be arranged in a hierarchical model with two levels, where the measurement occasions (considered as level 1 units) are nested within individuals (considered as level 2 units). (v) Hierarchical models in Biostatistics In our previous patient-hospital and Bayesian meta-analysis examples we explained how the hierarchical models can be applied in biostatistics. Besides these, in some studies of offspring of animals and human beings we observe the data having hierar- chical or clustered structure, where offspring are grouped within families. Offspring from the same parents tend to be more alike in their physical and mental characteris- tics than individuals chosen at random from the population at large. Thus, offspring 4 may be the level 1 units in a 2-level structure where the level 2 units are the families. Details are found in Goldstain et al. (1994). 1.3 Bayesian Approach to Data Analysis Besides the observed data, sometimes, it is likely that the researcher has some knowledge about the parameter vector 0. Bayesian approach incorporates this information to the analysis through a density IT(0) with the observed data in the estimation process. The process of Bayesian data analysis can be divided into the following three steps (Gelman et al, 1995): 1. Setting up a full probability model - a joint probability distribution for all observable and unobservable quantities in a problem (i.e., the likelihood function times the prior distributions). 2. Conditioning on observed data, calculating and interpreting the appropriate posterior distribution - the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data. 3. Evaluating the fit of the model and the implications of the resulting posterior distri- bution. If necessary, one can alter or expand the model and repeat these three steps. 1.3.1 Bayes' Rule Bayesian inference about a parameter vector 6 is made in terms of probability statement. This probability statement is conditional on the observed value y and is called the posterior probability of 0 given y. Notationally we can write ir{0\y). To calculate the posterior density we need two ingredients: IT(0), the prior distribution of 0, and 7r(y|t9), the sampling distribution of the observed data y which is also called the likelihood function of 6. The joint probability distribution of 0 and y can then be defined as 7r(0,y) = 7r(0)7r(2/|0). (1.1) 5 Applying the Bayes' rule for conditional probability we get the posterior density of theta as follows: ?r(0,y) 7 r ( % ) = where 7r(y) = «*i*<f), ( 1.2) YJ e 7r(t9)7r(y|(9), for discrete 0 or ?r(y) = { JeTr(0)ir(y\Q)dO, for continuous 0, is called the marginal distribution of y. Since 7r(y) does not depend on 0 and, with fixed y can thus be considered as constant (also called normalization constant), an equivalent expression of (1.2) is as follows: T T ( % ) OC Tr(0)n(y\O). (1.3) A Bayes' estimator of 6 is the mean of the posterior distribution of 6, E(0\y) = J9 0ir(0\y)d0. 1.4 The Bayesian treatment of the hierarchical model Let us consider the following example of the hierarchical model: in a study of the effectiveness of chemotherapy, in cancer patients, let us have the observed data yij from the ith individual and jth. hospital with survival probability Oj. Hence, it might be reasonable to expect that estimates of the 0/s, which represent a sample of hospitals, should be related to each other. We can incorporate this similarity in a natural way if we use a prior distribution in which the O/s are viewed as a sample from a common population distribution. In such case the observed data y^ can be used to estimate aspects of the population distribution of the Ofs even though the values of 8/s are not observed. It is natural to model such a phenomenon hierarchically by assuming that each of the parameters, the c9/s, form an independent sample from a population distribution governed by some unknown parameter r 2 . Thus we write, n(0\r2)=f[<ej\T2). (1.4) 3=1 6 The main hierarchical part of these models is that r 2 is not known and thus has its own prior distribution, 7T(T2). The joint prior distribution can be written as T T ( T 2 , 0 ) = 7r(r2)7r(6>|r2), and the joint posterior distribution is 7r(r 2,0 |y) oc T T ( T 2 , 0)7r(y|r 2, 0) = 7r(r2)7r(0|r2)7r(y|c9), since 7r(y|r 2 ,0) depends only on 0; the hyperparameter r 2 affects y only through 0. Now the problem is how to choose the hyperprior distribution for r 2 . If little is known about r 2 , we can assign a noninformative prior distribution (we will discuss more in chapter 3), but we must be careful when using an improper prior density to check that the resulting posterior distribution is proper, and we should assess whether our conclusions are sensitive to this simplifying assumption. After writing the joint posterior density 7r(r 2 , 0|y) in terms of the likelihood function and the joint prior distribution, the computational strategy for the above hierarchical models needs the following two additional steps: (i) we should calculate the conditional posterior density for 0 given the hyperparameter T 2 and the observed data y, i.e., calculate 7r(0 |r 2 , y) and (ii) we should obtain the marginal posterior distribution 7r(r 2|y). 1.5 Markov Chain Monte Carlo (MCMC) 1.5.1 Metropolis-Hastings Algorithm Markov Chain Monte Carlo (MCMC) methods attempt to simulate direct draws from some distributions of interest. In recent years M C M C techniques have been increasingly used by researchers to simulate complex, nonstandard multivariate distributions. In the M C M C technique, one uses the previous sample values to randomly generate the next sample value and generate a Markov chain (as the transition probabilities between sample values are only 7 a function of the most recent sample value). Constructing such a Markov chain is surpris- ingly easy. We describe the form due to Hastings (1970), which is a generalization of the method first proposed by Metropolis et al. (1953). Suppose our goal is to draw samples from some distribution TT(9). For the Metropolis- Hastings algorithm, at each time t, the next state 6t+\ is chosen by first sampling a candidate point 9* from a proposal (or candidate-generating or jumping) distribution q(9*\0t). Note that the proposal distribution may depend on the current point 9t. The candidate point 9* is then accepted with probability a(9t,9*) where If the candidate point is accepted, the next state becomes 9t+i = 9*. If the candidate is rejected, the chain does not move, i.e., 9t+i = 6t. In practice the Metropolis-Hastings algorithm generates a sequence of draws from the distribution TV(9) with the following steps: 1. Start with any initial value #0, set t = 0. 2. Using the current 9 value, sample a candidate point 9* from some proposal distribution q(9*\9t). 3. Sample a uniform (0,1) random variable U. 4. If U < a(9t,9*) set 9t+1 = 9*, otherwise set 9t+1 = 9t. Repeat steps (2) to (4) n times to obtain a sample of size n. Convergence and mixing of the chain should be considered in order to select an appropriate sample size n. If the pro- posal distribution is symmetric, i.e., q(9*\9t) = q(9t\9*), the Metropolis-Hastings algorithm becomes the Metropolis algorithm. Hence, for the Metropolis algorithm we need to calculate a(9t, 9*), where (1.5) (1.6) 8 1.5.2 Gibbs Sampling The Gibbs sampler (Geman and Geman, 1984) is a special case of Metropolis-Hastings sam- pling where the random value is always accepted (i.e., a = 1). The main feature of the Gibbs sampler is that one only considers univariate conditional distributions. Such conditional dis- tributions are far easier to simulate than complex joint distributions and usually have simple forms (e.g., normal, inverse gamma, or other common prior distributions). A very nice ap- plication of the Gibbs sampler to generalized linear models with random effects is found in Zeger and Karim (1991). Some other applications are found in Gelfand and Smith (1990) and Gelfand et al. (1990). To explain the Gibbs sampler algorithm, let us consider that there are k parameters in the model, denoted by 9\, • • • ,9k, and that the conditional distributions p{0i\Gj^i,y), i = l , - - - ,k, are available for sampling. Then, given a set of starting values (6f\ • • • , 8^), for the first iteration we sample e[% from piOr^,.-. ,6f\y), e^ly from p(92\9?\ f f \ • • • ,0f\y), 8^\y from p(6k\tf\ • • • , y). Then using the first set of samples (9^\ • • • ,9^) we draw a second set and so on. The process continues until after n iterations when a sample (9^\ • • • , t?]^) is obtained. Again to select the sample size n, one should consider the convergence and mixing of the chain. The iterative process follows a Markov chain, which converges to its stationary distribution, that being the joint posterior distribution of the k parameters. 1.6 Discussion In this chapter, we have discussed the concept of hierarchical models with some examples. The Bayesian approach to deal with hierarchical models is presented here. We have also described the techniques of drawing samples from conditional posterior distributions. Our 9 main objective of this study is Bayesian curve fitting using both piece-wise linear and natural cubic splines. Therefore, in the next chapter, we will introduce some basic concepts of splines and their classical computational aspects. 10 Chapter 2 Nonparametric Regression 2.1 Introduction Regression analysis is one of the most popular and useful tools in data analysis. The goals of regression analysis are to describe the dependence of a response variable on one or more covariates and to predict the response in future. Suppose we have a response variable Y, and a predictor variable X. The dependence of Y on X can be expressed as the following functional form y = f(x) + e, (2.1) where e is the random noise and / is an unknown regression function that we wish to esti- mate. The value f(X) is the conditional expectation of Y given the value X, so it can be used to predict the future values of Y for different measured values of X. There are two approaches to find the regression function / : (a) parametric approach and (b) nonparamet- ric approach. In the parametric approach we have rigid parametric assumptions about the dependence between the response and the predictors. Linear regression, logistic regression, Poisson regression are examples of the parametric regression approach. In general, in the parametric approach the response variables are assumed to have a distribution in the expo- nential family. Within the exponential family the dependence of response on the covariates can be summarized under the framework of generalized linear models (GLM)(McCullah and 11 Nelder, 1989, Nelder and Wedderburn, 1972). The simplest form of G L M is the linear re- gression model y = /30 + Pix+e, where the responses are assumed to be normally distributed. Researchers use polynomial regression in situations where they know that curvilinear effects are present in the true response function. Polynomial regression is also useful for approxi- mating functions in unknown and possibly very complex nonlinear relationships. But they suffer from various drawbacks. For example, individual observations can exert an influence, in unexpected ways, on remote parts of the curve. Also, owing to the global nature of poly- nomial fitting, there are problems in estimating wiggly curves (Denison et al. 1998). Every parametric regression analysis requires rigid assumptions on the form of / that may not be true, and hence the usefulness of a non-parametric approach arises. In the non- parametric approach we do not have any rigid assumptions about the dependence between the response and the predictors. By making the relatively weak assumption that whatever the true relationship might be, it is a smooth curve, it is possible to let the data tell the analyst what the pattern truly is. Smoothing methods provide a bridge between making no assumptions on formal structure (non-parametric approach) and making very strong as- sumptions (parametric approach) (Simonoff, 1996). The assumption of smoothness can be interpreted as meaning that the dependence of the mean of Y on X should not change much if X does not change much. We expect an estimate / of / which is not much wigglier than / . The non-parametric approach involves the choice of a smoothing parameter which controls the balance between goodness of fit and smoothness of the estimated regression function. The running-mean, running-line, smoothing spline, kernel, and regression spline smoothers are all linear smoothers (Hastie and Tibshirani, 1990). Some smoothers are nonlinear, that is, the fit / cannot be written as f = Sy for any smoother matrix S independent of y. A l l the smoothers mentioned above are nonlinear if the selection of the smoothing parameter is based on the data y. 12 Since we compare Bayesian curve fitting using different roughness penalty prior distributions in splines, we describe the spline-based roughness penalty approach and associated techniques in the next few sections. In section 2.2 we give the definition of splines. In section 2.3 we discuss how the design matrices can be computed in case of piecewise linear splines. We briefly discuss the natural cubic spline approach to curve fitting in section 2.4, where we describe how to plot a natural cubic spline and how to use the cross-validation method to choose the smoothing parameter in the classical curve fitting approach. 2.2 Splines Univariate and polynomial splines are piecewise polynomials of some degree ci. The break- points marking a transition from one polynomial to the next are referred to as knots (Hansen and Kooperberg, 2002). For example, a linear spline consists of several piecewise linear func- tions and a cubic spline consists of several piecewise polynomials of degree three. Typically, a spline will also satisfy smoothness constraints describing how the different pieces are to be joined. These restrictions are specified in terms of the number of continuous derivatives, m, exhibited at the joins of the piecewise polynomials. Hence, in case of a cubic spline, we make the constraint of continuous first and second derivatives. In this study we denote the vector of p knots as t = (ti, • • • ,tp). Mathematically, a function / is called a spline of degree d if 1. the domain of / is an interval [a, b] and 2- /,/ ' ,/"> • • • >/^ - 1^ a r e a h continuous functions on [a, b]. The knots U, i = 1, • • • ,p, satisfy a = to < h < • • • < tp < tp+i = b, and / is a polynomial of degree at most d on each subinterval [ti, U+i], i = 0, • • • ,p. 2.3 Piecewise linear spline Higher-order polynomial interpolation is rarely used for practical purposes because of a poly- nomial wiggle, i.e., high swings of the interpolating polynomials between the data points. 13 Instead, most computational algorithms use the idea of splines, i.e., piecewise interpolation. In this research we consider experiments that generate scatter plot data (yiy Xi), i = 1, • • • , n, where the relationship can be linear in some places and curvilinear in other places. In other words, we consider scatter plot data where the regression function can easily be expressed by a piecewise linear spline. The primary objective of this thesis is to fit the scatter plot data through the Bayesian hierarchical model technique. For the given data, we want to fit equation (2.1) where f(x) — f(x\0) is modelled as a piecewise linear function, with knots t = (to, t\, • • • , tp, tp+i) satisfying a = to < t\ < • • • < tp < tp+i = b, and function values at the knots t90, Oi, • • • , 9P, 9p+i, such that the coordinate (U,9i) represents the location at which two line segments meet. To keep things simple, we assume that: (i) The knots are equally spaced. (ii) The value of 0jS are equal to zero in two exterior knots to and tp+i, i.e., 0O = 0 and 9p+i = 0, so that 6 = (9i,- • • , 9P)' is the vector of unknown parameters. (iii) The linear trend is being considered in the model separately and the estimation will be done by the backfitting procedure. In matrix notation we can express our model as Y = D6-rAG + e, (2.2) where Y is an n x 1 vector of observations, 6 is a p x 1 vector of parameters, c5 = (<50, Si)' is the vector of linear regression parameters, D = (l,x) is an n x 2 matrix, A is an n x p design matrix and e is an n x 1 vector of random error terms. The basis matrix or the design matrix A for the piece-wise linear spline can be computed from the corresponding x values as follows: 14 •ij — if tj-i < Xi < tj, if tj < Xi < tj+i and otherwise for i = 1, • • • , n and j = 1, - • • ,p. It is evident from here that for tj < Xi < tj+i, Aij+i = 1 — Aij, and that makes easy computation of the basis matrix. We assume that the random error e follows a multivariate normal distribution with mean 0 and covariance matrix a21. Hence, we can write Y ~ N(D6 + AO, a21), i.e., the observation Y follows a multivariate normal distribution with the mean vector D6 + AO and the covariance matrix a21. Here the quantity DS represents the linear part and the quantity AO represents the smooth part of the curve. 2.4 Natural Cubic Spline (NCS) with Roughness Penalty A cubic spline, / , on an interval [a, b] is said to be a natural cubic spline if its second and third derivatives are zero at the boundaries a and b. These conditions are known as natural boundary conditions. This implies that / is linear on the two extreme intervals [a, ti] and [tp,b] (Green and Silverman, 1994). There is much literature on natural cubic splines in- cluding the books of Green and Silverman (1994), Hastie and Tibshirani (1990) and Wahba (1990). The purpose of the nonparametric regression through natural cubic splines is to summarize the trend of a response measurement as a function of one or more predictors assuming a piecewise polynomial of degree three. In general it is always desirable to have an estimated curve (trend) which provides a good fit without being too wiggly. The roughness penalty approach makes a compromise between these two opposing factors. Roughness or wigglyness of a twice-differentiable curve / defined on [a, b] is measured by calculating its integrated squared second derivative J^{f"(x)}2dx. The roughness penalty 15 approach to curve estimation is easily done by defining the penalized sum of squares S = X > - f(xi)}2 + r2 f{j"{x)}2dx. (2.3) i=i J a The penalized least square estimator / is obtained by minimizing S over all twice-differentiable functions / . An important point here is that the minimizer of S is a natural cubic spline with knots at the xjs. The addition of the roughness penalty term r 2 f^{f"(x)}2dx in equa- tion (2.3) ensures that both the residual sum of squares 2~2i=i{Vi~ f(xi)}2 a n d the roughness T 2 J^{f"(x)}2dx are used to determine the cost S. In this context, the smoothing param- eter T 2 represents the rate of exchange between residual error and local variation. It also gives the amount in terms of sum of squared residual error that corresponds to one unit of integrated squared second derivative. For the given value of T 2 , minimizing S will give the best compromise between smoothness and goodness-of-fit. Choice of the optimum value of T 2 is very important, and usually it is done by cross validation that we will discuss in later sections. 2.4.1 Interpolating and Plotting an NCS Detailed calculations for interpolation and plotting an NCS are found in Green and Silverman (1994). For convenience, we use the same notation as Green and Silverman. Let / be a natural cubic spline with knots i i < t<i < • • • < tp and we define = f(ti) and % — /"{U) for i = 1, • • • ,p. By the definition of NCS, 71 = 7 P = 0. Let / = ( / i , - - - ,fP)T and 7 = (72i • • • > 7 p - i ) T - The vectors / and 7 specify the curve / completely with the two band matrices Q and R such that QTf = Ri- (2.4) Appendix C l contains the calculations for Q and R matrices. The roughness penalty term fa {f"(t)}2dt can be expressed as, f\f"(t)}2dt = 7Ti?7 = fKf, (2.5) 16 where the matrix K is defined by K = QR~1QT. (2.6) A natural cubic spline can be plotted as Y = AO, where A is an n x p basis matrix. The jth row (j = 1, • • • , n) of A, as shown in Appendix C l , can be written as: aj,i+l — aji ~~ ! M . ! H « { ( H ^ ) ^ + ( 1 + i d ) 4 and aAi, = ^ — | ( l + - r - J t ^ i y + ^1 + ) * V j , where z' = 1,2,--- ,z — l,z + 2, ••• ,p, for any t between U and and is the element of R 1QT. 2.4.2 C h o o s i n g the S m o o t h i n g Pa ramete r for Spl ine S m o o t h i n g The problem of choosing the smoothing parameter in curve estimation is very important because of the dependency of the estimated curve on the parameter. One can choose the parameter value subjectively. If r 2 in equation (2.3) is very large then the main component in S will be the roughness penalty term and hence the minimizer / will display very little curvature. Therefore, as r 2 tend to infinity, the term J^{f"(x)}2dx will be forced to zero and the curve / will approach the linear regression fit. On the other hand, if r 2 is relatively small, the main contribution to S will be the residual sum of squares, and the curve estimate / will not represent the curvature of the data well. There are many different data driven methods to select the value of the smoothing parameter r 2 , among them cross-validation (CV) and generalized cross-validation (GCV) are the widely used techniques. The Cross-validation and the Generalized Cross-validation Silverman (1985) and Green and Silverman (1994) discussed the 'leave-one-out' principle to choose the smoothing parameter and this is known as cross-validation. The basic principle of cross-validation is to leave the data points out one at a time and to choose the value of r 2 17 under which the missing data points are best predicted by the remainder of the data. Let / - i ( x ; T 2 ) be the smoothing spline calculated from all the data points except (xi,yi), using the value of r 2 as the smoothing parameter. The cross-validation choice of r 2 is then the value of T 2 that minimizes the cross-validation score defined as CV(r2) = -\J2{yi-f-i(xi-y)}2. (2.7) i = l For computational ease the CV score defined above can be modified as follows: OV(r^1-t{'^^}\ (2, , where / is the spline smoother calculated from the full data set (U, yi) with the smoothing parameter r 2 and A(T2) = (I — T2QR~1QT)~1 . The equation (2.8) shows that, provided that the diagonal entries Aa(r2) are known, the cross-validation score can be calculated from the residuals {yi — f(U)} about the spline smoother calculated from the full data set. Generalized cross-validation (GCV) is a modified form of CV and is a popular method for choosing the smoothing parameter (Craven and Wahba, 1979). It is obtained by replacing the term 1 — AH(T2) in the denominator of equation (2.8) by 1 — n _ 1 t rA( r 2 ) as follows: W ) 4 ^ - ^ J . (2.9) As in ordinary CV, the G C V choice of smoothing parameter is then carried out by minimizing the function GCV(T2) over r 2 . 18 Chapter 3 Bayesian Curve Fitting 3.1 Introduction The purpose of any curve fitting is to determine the true functional relationship between the response and the covariates. Before fitting any curve, we may have some clues about i the relationship, or we may have some ideas about what we think our approximation should look like. Obviously, we should think of the following issues which may influence how we proceed in determining unknown relationships. First, we should think about the type of the approximating function. Commonly used approximating functions include linear models, generalized linear models, smoothing splines, neural networks, wavelets, decision trees and kernel smoothers. All of these provide explicit models for the relationship between the response and the predictors (Denison et al., 2002). In this thesis we will consider the spline based relationships. The second issue is that we will need some criterion to find the best approximation of the truth. Since in the real data analysis situation, no single model can completely explain the true relationship between the response and the predictors, we may think of the best model among a set of alternatives as the one which most closely explains the true relationship for the particular purpose of interest. Thirdly, we should think of improving the quality of the models by incorporating available qualitative or quantitative knowledge a priori. This approach naturally leads us to Bayesian methods where we assign 19 prior distributions to the parameters in the model and then update the priors in the light of data. These yield a posterior distribution via Bayes' theorem: posterior oc prior x likelihood. In this chapter we mainly describe the prior and posterior distributions for curve fitting assuming the regression function is either piece-wise linear spline or a natural cubic spline. In section 3.2 we discuss simple roughness priors. There we discuss three different roughness penalty priors that we will use to compare the smoothness of the fitted curve. The hierar- chical structures of our Bayesian data analysis are presented in section 3.3. We calculate the conditional posteriors for all the model parameters in section 3.4. Finally, we give a brief discussion in section 3.5. 3.2 Simple Roughness Priors To perform Bayesian inference, the scientific question is how to specify a prior distribution for the parameter 9. Prior distributions can be either informative and non-informative. In the informative case the prior distribution represents a population of possible parameter values from which 9 has been drawn. But in many practical situations, there is no perfectly relevant population of 0's from which the current 9 has been drawn. Some authors sug- gest that the prior distribution should include all plausible values of 9, but the distribution need not be realistically concentrated around the true value, because often the information about 9 contained in the data will far outweigh that of any reasonable prior probability specification. When prior distributions have no population basis, they can be difficult to construct, and there has long been a desire for prior distributions that can be guaranteed to play a minimal role in the posterior distribution.. Such distributions are sometimes called 'reference prior distributions', and the prior density is described as vague, flat, diffuse or noninformative (Gelman et al. 1995). Jeffreys' invariance principle is sometimes used to define noninformative prior distributions. 20 The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy. In this case we call the prior distribution a conjugate prior distribution. For instance, the beta distribution is a conjugate prior for the binomial like- lihood function, and the normal distribution is a conjugate prior for the normal likelihood function. Conjugate prior distributions have some practical advantages and computational convenience. The probability distributions that belong to an exponential family have natural conjugate prior distributions. As mentioned earlier, we are interested in a Gaussian response that can be expressed as Y ~ N(DS+AO, a2I), i.e., the response vector Y follows a multivariate normal distribution with mean DS + AO and covariance matrix a21, where the quantity DS represents the linear part and the quantity AO represents the smooth part of the curve. Hence, the likelihood function can be written as: TT(Y|C5,0,C72,X) OC \£l\-$exp^~{Y-DS-AOyn-^Y-DS-AO)^ , (3.1) where Q = a21. Since the main focus of this study is to estimate the parameters c5 and 0 with the Bayesian technique, we need to specify the prior distributions of the parameters. Let us assume that the relationship between the response and the covariate can be expressed by a piecewise linear spline and we are interested in comparing the following roughness penalty prior distributions of 0. P r i o r 1. Let us be interested in estimating 0 using p equally spaced knot points and hence 0 contains p unknown parameters. By the conditional probability law the joint prior of 0 and r 2 can be written as 7T(0,T2) = 7T(0|T2)7T(T 2), and we write the roughness penalty prior for 0 given r 2 as i \ p / 2 TT(0|T2) OC ^ exp < T 2 (3.2) Here the proportionality (or normalizing) constant is not shown because it does not depend 21 on T 2 . The term Y%=i {̂ * ~ f ^ - i + ^ + i ] / 2 } 2 i n e c l u a t i o n ( 3- 2) c a n b e expressed in matrix notation as O'MO, where M = V 5 4 -1 1 4 0 •• • 0 0 -1 3 2 -1 1 4 . 0 0 1 4 -1 3 2 -1 .. . 0 0 0 1 4 -1 3 2 . 0 0 0 0 0 0 .. 3 2 -1 0 0 0 0 .. . -1 5 4 \ is a p x p symmetric matrix. Therefore, expression (3.2) can be written as TT(0|T2) OC P/2 IM- 1 ! -1/2 exp { - 2 ^ ' M * } = | B | 5 e x p | - ^ S 0 } , (3.3) where B = ( l / r 2 ) M and B"1 represents the covariance matrix of 6. The expression on the right hand side of (3.3) looks like the pdf of a multivariate normal distribution without the normalizing constant and hence we can say that 0 given r 2 follows multivariate normal dis- tribution with mean vector 0 and precision matrix B. Here B measures the overall roughness or wigglyness of the curve. For the variance component r 2 , the most commonly used prior distribution is the inverse Gamma distribution, and this is the conjugate prior distribution for the normal distribution. Hence, for r 2 , we assume an inverse gamma distribution with parameters a and B and therefore, the pdf of r 2 can be written as 7T(T2) = Ba ( a+ l ) exp HI for r2 > 0. (3.4) r(a) \T* We assume that both a and B are known and hence we can write the above prior distribution as 7T(T2) OC ( l / - r 2 / a + 1 ^ exp{—B/T2} where the proportionality constant involves only a and P, and does not have any effect in calculating the posterior distribution of r 2 . 22 Pr ior 2. Instead of considering overall roughness of the curve, if we consider the local roughness at each knot point to estimate the curve we may think of a vector of variance components say, r 2 = (r 2 • • • r 2 ) ' . Therefore, the joint prior of 0 and r 2 can be written as TT(0, r 2 ) = 7r (0 | r 2 )7r ( r 2 )7r ( r 2 ) • • • TT(T2) , where TT(0|T 2) OC h • • • , j exp I - - £ ^ J- 2 ^ (3.5) where h(-) is some function of (ji,"' ; 7?) related to the determinant of the covariance ma- r 0 v>i-i+0i trix of (9i, • • • , Op). Similar to the case of Prior 1, we can express the term Y%=i S ~ in matrix notation as 6'MT0, where MT is a symmetric p x p matrix as follows / X-L-L -i(x + X) x. n 0 A -*(*+*) *+*+'fi - 0 0 0 0 o ... + i + - i ( £ + *) 0 0 0 ... - * ( £ + *) 3 ^ + ^ and therefore 7T(I9|T 2) in (3.5) can be written as TT(0|T 2) OC | A f T | 1 / 2 e x p | ~ f l ' M T f l | . (3.6) Therefore, we can say that 6 follows a multivariate normal distribution with mean vector 0 and precision matrix MT. Here M " 1 measures the local variation of the data at each knot point. As we are assuming different r 2 at different knot points, each 7r(r t2) might be assigned the same inverse gamma distribution with hyper parameters a and 3. We are assuming this kind of prior rather than Prior 1 because, in some real life situations curves show different roughness in different places of the curve. More precisely, the data variances are equal but the smoothness of the curve differs for some of the knot points. Hence there is a chance of getting less efficient estimates if we consider the equal variance component for all the knot 23 points of the data. Instead, the inferences can be improved if we use a joint prior of 6 and T 2 which can account for local variation of the data, and hence we choose Prior 2. To estimate the parameters with the heteroscedastic situation we express r 2 as T 2 = WiT2 for i = 1, • This is in some sense similar to the weighted least squares (WLS) problem of linear regression. Here we consider both Wi and r 2 to be random samples form inverse gamma distributions. One can think of the scalar r 2 as governing 'overall' roughness, while w = (wi,-- - ,wp) governs local deviations from the overall roughness. By considering random Wi we always have the flexibility to let the data select the appropriate values for the local variation and hence produce better estimates. So now we have prior 7r(c?|io, T 2 ) instead of prior 7r(t9|r2), which penalizes roughness of the curve. Thus the prior distribution of 0 given w and r 2 can be written as IT(0\W,T2) OC p/2 h ( —2> \w( Wi exp 1 ' { O i - V t f ^ } ' 2 r 2 ^ 2 ^ t=i Wi (3.7) Similar to the earlier cases we can express the term Y%=i \ ~ wf r m matrix notation as 0'MW6, where Mw is a p x p symmetric matrix as follows / - L + - 1 - -l(-L + ±) J L 2 V W\ 1U2 J 4ti)i W2 iW3 2 \W2 W3 ) 4W2 2 V W2 W3 I 4tV2 W3 4W4 0 0 0 0 0 0 \ V 0 0 0 0 0 0 T . . . . 4 4 w p _ 2 Wp-i 2 \^ti)p_i Wp J 2 \^«; p _i wp J 4m 1 + ± p - l and hence 7r(0 | iu,T 2 ) can be written as n(6\w,T2) oc I B ^ I ^ e x p j -^B^j , (3.8) 24 where BW<T = (\/T2)MW. And therefore, we can say that 0 given w and r 2 follows a multivariate normal distribution with mean vector 0 and precision matrix BWfT. As a second-stage prior, let us assume that r 2 has an inverse gamma distribution with parame- ters a and 3 to match with the first prior, and we assume that each of w\, • • • , wp follows an inverse gamma distribution with both scale and shape parameters equal to 1/A, i.e., Wi ~ IG(l/\, 1/A), i = 1, • • • ,p. Here we see that the mean of Wi is 1/(1 — A) and the fact that as A goes to 0, the mean of Wi goes to 1. Hence w^s are centered at 1 for very small A. Another interesting feature of this prior specification is that as A decreases, the variance of Wi decreases, that is, as A goes to 0, the second prior converges to the first prior. Prior 3. To compare the above two priors, we consider a third prior for 9 which indicates neither local nor global specification for the roughness of the curve. We write, That is, 6 is assumed to have a multivariate normal distribution with mean vector 0 and precision matrix (1/r 2)/ and as usual we assume that 7T(T2) is an inverse gamma density with hyper parameters a and 3. Many studies suggest that a good choice of prior distribution for the data variance a2 is inverse gamma and the uniform prior for the linear regression parameters works well. Hence, for all the three cases, we assume that the data variance a2 has an inverse gamma distri- bution with known parameters a and 6, i.e., a2 ~ IG(a,b). Although the linear regression parameters can take both positive and negative values, for simplicity, we assume that 5Q and 5i are uniform over the interval (0,1). TT(0,T 2) = 7r(0 |T 2 )7r(r 2 ) , where (3.9) 25 3.3 Levels of Hierarchy To verify the performance of the proposed prior distributions we need to conduct simula- tion studies for the three-level hierarchical models for each of the priors. For Prior 1 the hierarchical structure can be written as: Similarly for Prior 2 the 3-level hierarchical structure can be expressed as follows: Y|C5,0,CT 2 ~ N(D6 + A0,n); where fi = a2l, 0 |w , r 2 ~ N(0,B^T), a2~IG(a,b) and £ ~ (7(0,1), r 2 ~ IG(a,3) and wt ~ IG Q , ^ , i = l,---,p. And finally, the hierarchical structure for Prior 3 is as follows: Y|<5,0,cr 2 ~ N{D6 + A0,n); where fi = a2l, For the simplicity of computation, we now assume that the 4th level hyperparameters a, 8, a, b and A are known constants. 3.4 Posterior Distributions and Simulation Study Under the Bayesian approach, prior beliefs about parameters are combined with sample information to create updated, or posterior, beliefs about the parameters. In the following subsections we present the conditional posterior distributions for all the model parameters in the cases of our proposed prior distributions. Y\S, 6,a- N{D6 + A0,tt), where fi = <r IG(a,3). 26 3.4.1 Conditional Posteriors for Prior 1 Combining the sampling distribution for the observable Y's and the prior distributions yields (' the joint posterior distribution of all the parameters and hyperparameters. For our Prior 1 the joint posterior distribution can be written as: TT(0, S, T2, C T 2 | Y , X ) oc TT(Y | 0 , 6, a2, X)7r(0|r 2)7r(r 2)7r(c5)7r(cT 2). (3.10) For posterior sampling through MCMC, we need to find the conditional posterior dis- tributions for each of the model parameters given all others, i.e., we have to calculate the marginal distributions, n(8\8,T2,a2, Y , X ) , 7r(5|0, T 2 ,CT 2 , Y , X ) , TT(T2\6, 6, a2, Y , X ) and u(a2\8, r 2 , Y , X ) from the joint posterior distribution (3.10). Details of the calculations are presented in Appendix A. From Appendix A . l , it can be written that the conditional pos- terior distribution of 0 given 8, r 2 , a2, Y and X , is multivariate normal with mean vector ( ( l / r 2 ) M + (A'A)'1/a2)'1 A ' Z i / c r 2 and covariance matrix ( ( l / r 2 M + (A'A)'1/a2)'1, where Z i = Y — D8. By Appendix A.2, we find that the conditional posterior distribution for the variance component r 2 given 0, 8, a2, Y and X , is inverse gamma with parameters a + p/2 and (0'M0)/2 + (3. The conditional posterior distribution of the linear regression param- eter <5 given 0, a 2 , Y and X , is multivariate normal with mean vector (D'D)~lD'Z2 and the covariance matrix (D'D)~1o2, where Z2 = Y — Ad, and is shown in Appendix A .3. Finally form Appendix A .4, we can say that the conditional posterior distribution of the data variance a2 given 8, 0, r 2 , Y and X , is inverse gamma with parameters a + n/2 and ( Y — D8 — A 0 ) ' ( Y — DS — AO) + b. Since the closed form solutions are found for all the conditional posterior distributions, the Gibbs sampler algorithm will be useful to draw the posterior samples. 3.4.2 Conditional Posteriors for Prior 2 Similar to Prior 1, the joint posterior distribution for Prior 2 can be written as TT(0, S, T2, W, a2\Y) oc TT(Y |0, S, a 2 , X)ir(6\w, T2)TT(T2)^(Wi) • • • TT(wp)ir(6)Tr(a2). (3.11) 27 As shown in Appendix B . l , the conditional posterior 7r(0|<5, T 2 , W, a2, Y , X) is a multivariate normal density with mean vector of (BW>T + (A1 A)'1/a2)"1 A"L\ju2 and covariance matrix of (BWtT + ( A ' A ) _ 1 / C T 2 ) _ 1 , and the conditional posterior distribution for r 2 given 0, w, S, a2, Y and X , is inverse gamma with the parameters a + p/2 and 1/2(0''BW>T0) + 3). The joint conditional posterior distribution for the vector w given 0 and r 2 (as in Appendix B.2) is f 1 1 p f 1 \ f 1 1 « M » . T * ) oc | B „ , r | V , e x p | _ _ ^ T e j x r i ^ ( - ) e x p { - - | Apparently, neither 7r(u;|0, r 2 ) can be written in a closed form density function, nor we can calculate the marginal conditional posterior densities 7 r ( iUj | u ; i , • • • , Wi+\, • • • , wp, 9, r 2 ) , for i = 1, • • • ,p, easily. This is because 7r ( i t f | 0 ,T 2 ) involves the determinant term IB^I1/2. The conditional posterior distribution of <5 given 0, cr2, Y and X , and the conditional posterior distribution of a2 given S, 0, r 2 , Y and X , are the same as we have found in the case of Prior 1, that is, TT(<5|0, a2,Y, X) ~ N {{D'D}~1 D'Z2, {D'D}-la2) and TT{O2\0, (5, r 2 , Y , X) ~ IG(a + n/2, {Y — DS — Ad}'{Y - DS - AO} + b). Ideally, one may think that the hyperparameter A is fixed, so the posterior samples can be drawn by using the Gibbs sampler algorithm from the conditional posteriors that have closed forms, and using the Metropolis Hastings algorithm from the posterior density TT(W\6,T2). However, in practice, we do not know which value of A will give us better estimates of the parameters of interest and hence, we should adopt some trial and error simulations to find the optimum value for A. On the other hand, we may consider some noninformative priors for A and let the data decide the value of A instead of using fixed values. We consider both fixed and random A approaches for posterior sampling. Assuming that A is uniform within (0, A 0) we calculate the conditional posterior distribution for A given 0, S, w, r 2 , Y and X in Appendix B.3. Daniels (1999) suggested an uniform shrinkage prior for the hyperparameter A, and in our usual notation, the uniform shrinkage prior can be written as 7r(A) ~ a2/{a2 + A), where a2 is the data variance. In the study we also consider the uniform shrinkage prior. The conditional posterior distribution for this case is calculated 28 in Appendix B.3. Since there are no closed form solutions found in both uniform and uniform shrinkage A cases, we will apply random walk Metropolis-Hastings algorithm to draw posterior samples for the hyperparameter A. 3.4.3 Conditional Posteriors for Prior 3 For Prior 3, the joint posterior distribution can be written as TT(0, S, T2, c r 2 | Y ) oc TT(Y |0, S, CT2, X)7r(0|T2)7r(r2)7r(c5)7r(t72). (3.12) As shown in Appendix B.4, the conditional posterior for 0 given 6, r 2 , a2, Y and X is a multivariate normal density with mean vector ( ( 1 / r 2 ) / + ( A ' A ) - 1 / < 7 2 ) - 1 A'Zi/a2 and the covariance matrix ( ( l / r 2 ) / + (A'A)'1 /a2)-1. The conditional posterior distribution for r 2 given 0, S, a2, Y and X is inverse gamma with parameters a + | and \{0'0) + 3. The other two conditional distributions, 7r(<5|0, <r2, Y , X ) and 7r(<72|0, <5, r 2 , Y , X ) , remain the same as before. That is, 7r(<S|0,<72, Y , X ) , is a multivariate normal density with mean vec- tor (D'D)~1D'Z2, and covariance matrix (D'D)~lo2, and 7r(cr2|0, S, r 2 , Y , X ) is an inverse gamma density with parameters a + | and ( Y — DS — A 0 ) ' ( Y — DS — AO) + b. Hence for all the conditional posterior distributions we may apply the Gibbs Sampler algorithm in order to draw the posterior samples. 3.5 Discussion In this chapter we have discussed our three choices of roughness penalty prior distributions. The posterior distribution for every prior distribution has been calculated. Closed form solutions are found for all the conditional posterior distributions in Prior 1 and Prior 3, and for all the parameters in Prior 2 except w and A. Hence, in Prior 2 both Gibbs sampler and random walk Metropolis-Hastings algorithms are applied to generate the posterior samples. In practice, we do not know how rough (wiggle) the true curve is. If the curve is not very rough, Prior 1 can be used for Bayesian curve estimation. On the other hand, if the true curve 29 is rough in some places but linear in other places, our conjecture is that Prior 2 may produce better estimates of the parameters. In the next chapters, we will check the performance of our proposed priors as well as the validity of our assertion about the roughness penalty priors in Bayesian curve fitting through several examples. 30 Chapter 4 M C M C Simulation for Piecewise Linear Spline 4.1 Introduction As our approaches to curve fitting are to allow the regression function / to be either piecewise linear or cubic spline functions, in this chapter, we provide a Bayesian version which models / by a piecewise straight line with a known number of equidistant knots. Some studies regarding Bayesian curve fitting has been done assuming an unknown number of knots at unknown locations. Denison, Mallick and Smith (1998) described a Bayesian methodology of curve fitting by a piecewise polynomial assuming a large collection of possible knots instead of choosing a single collection of knots. Their Bayesian knot selection procedure was done by selecting posterior samples for both the number of knots and their locations using the Markov chain Monte Carlo (MCMC) simulation technique of reversible jumps (Green, 1995). In this chapter we mainly discuss the simulation studies to check the performance of our proposed priors in the case of piecewise linear splines. Since we consider both linear and smooth effects of the same covariate on the response we need to apply the Bayesian back- fitting algorithm to fit the curve. We discuss the Bayesian backfitting algorithm in Section 31 4.2. Section 4.3 and 4.4 present the simulated results. Finally, a brief discussion is given in Section 4.5. 4.2 Backfitting Algorithm Hastie and Tibshirani (1990) discussed the backfitting procedure to fit the additive model Y = a + 2~Zi=i fi(Xi) + e, where the errors e are independent of the X{S, E(e) = 0 and var(e) = cr2. The Bayesian version of the Backfitting algorithm is also proposed by Hastie and Tibshirani (2000) and they called it Bayesian backfitting procedure and they applied it for posterior sampling from additive and generalized additive models. In this study we assume the additive model as y = 50 + Sxx + f{x : 0) + e, (4.1) where e is a random error term that follows normal distribution with mean 0 and constant variance a2. The fitted form of equation (4.1) can be written as y = So + S\X + Ad. So in our backfitting procedure we first guess the initial values of 8o and 8\ and then calculate y — 8Q + S\X and draw posterior samples for 6. Then we update So and c5i by drawing posterior samples using 0 in the equation y — AO = <5n + 5\X. We continue this process until a posterior sample of size n is obtained. 4.3 Simulated Results: Example 1 To illustrate how the priors work we apply them to several data sets. Our first data set, called Simulated Motorcycle Accident (Silverman, 1985) is taken from the S-plus data base, and is presented in Figure (4.1). The data frame consists of a series of n = 133 accelerometer readings taken through time in an experiment on the efficacy of crash helmets. In the data, the time points are not regularly spaced and there are multiple observations at some time 32 points, and all the observations are subject to random error. Considering time as a predictor variable, we fit the additive model (4.1) with all choices of prior distributions. o s a> 8 o CO time Figure 4.1: The motor-cycle data. Let us fix the number of knots p — 20. It is clear from Figure (4.1) that the smoothness of the underlying curve is not constant. To fit this data we choose the Bayesian method that makes the use of our proposed prior distributions. The model parameters 0, S, r 2 , w, o2 and A have been estimated and throughout the study we use the statistical software 'R' for computational purposes. In a Bayesian analysis a common question is what will be the value of the hyperpararneters. As the posterior sampling depends on the values of the hyperpararneters, one should be careful in choosing them. In this study, we have to set the values for the hyperpararneters a and b (for a 2), a and 8 (for r 2 ) , and A (for Prior 2). For the data variance o2 we choose both the hyperpararneters equal to 0.001, i.e., we consider a2 ~ IG(0.001,0.001). Such a prior distribution is referred to as non-informative. In a related context, Gustafson et al. (2003) suggested an inverse gamma prior for the variance component r 2 with mode equal to (0.01)2 and coefficient of variation equal to 1. This implies an inverse gamma distribution with a shape parameter of 3 and a scale parameter of (0.02)2. In this study we choose the shape parameter a = 3 and scale parameter 8 = 0.01. We can not make 8 less than 0.01 because of some computational difficulty. The smaller value of 8 forces T 2 33 to take very small values and when we compute the covariance matrix for 0 by using the term (Bw>r + (A'A)~1/a2)~1A'Z1/o2, we get the matrix (BW>T + (A'A)'1/a2) being close to non-invertible for Prior 2. Before considering random A , we run M C M C to simulate posterior samples for some fixed values of A . Among the estimates, it is observed that A = 0.33 (i.e., when the coefficient of variation of Wi = 1) gives better estimates of the parameters, and hence, we consider Prior 2 with A = 0.33 for comparing estimates obtained using other priors. A noninformative prior for the parameter A is assumed uniform within the interval (0, An) . Here we need not assume any value for the hyperparameter An because it cancels out both in the numerator and in the denominator when we calculate the acceptance ratio a in Metropolis algorithm (1.6). The uniform shrinkage prior for the hyperparameter A is also assumed. Both the priors for A are proper and hence lead to proper posterior distributions. Hence, depending on the different choices of A (fixed, uniform and uniform shrinkage), we have 3 different cases for Prior 2, and with Prior 1 and Prior 3 the total choices of prior distributions has increased to five. 4.3.1 Parameter Estimates For the parameter estimates we have run M C M C simulation techniques. The Gibbs sampler technique has been adopted for posterior simulation of 0, <5, r 2 and a2, and random walk Metropolis-Hastings algorithm for A and w. We present the estimates of the parameters T 2 , 6 and a2 for all the five specifications of the prior distributions in five tables (Tables 4.1-4.5). For the parameter vector 6 we have 20 estimates and, therefore, instead of presenting the estimates in a tabular form we prefer to present them graphically. Figure (4.2) shows the 60 posterior realizations (grey curves) for the parameter vector 0 with the means (dark curves) for all the five choices of the prior distributions. The 60 posterior realizations are chosen systematically from the last 3000 samples i.e., take one after every 50 samples. Although no big differences among the estimates have been noticed, by having a close look we observe that the mean curves produced by Prior 1 and Prior 2 are better than the mean curve produced by Prior 3 with respect to smoothness. 34 Similarity among the estimates of the data variance a2 is observed in all the five choices of the prior distributions. Estimates of the overall variance component r 2 differ greatly. For the case of Prior 3 we observe the biggest r 2 . The smallest r 2 is observed for Prior 2 with fixed A. For the Prior 2 with uniform A the estimate of r 2 is approximately equal to the estimate obtained in Prior 1 case. Although the estimates of the linear parameters So and Si differ slightly in all the cases, they have been considered as statistically insignificant since the 90% credible intervals include zero. Table 4.1: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 1. Standard 5th 95th Parameter Mean deviation centile centile T 2 177.92 69.41 94.92 311.14 So 0.69 21.73 -36.76 35.13 Si 0.25 0.64 -0.77 1.35 a2 509.82 67.11 410.25 627.95 Table 4.2: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with fixed lambda (A = 0.33). Standard 5th 95"1 Parameter Mean deviation centile centile T 2 103.36 51.37 40.10 205.44 So -2.88 19.20 -33.93 29.24 Si 0.22 0.55 -0.66 1.10 a2 512.71 67.35 412.00 630.89 4.3.2 Monitoring the Convergence of M C M C Simulation Before making any valid inferences about the estimates we need to check both the mixing and the convergence of M C M C run for them. We present the M C M C run for the parameters 35 Prion Prior 2: lambda=0.33 10 20 30 40 50 time Prior 2: uniform lambda 10 20 30 40 50 time Prior 2: uniform shrinkage lambda Prior 3 o _ < 0 8 - I o I I I I 10 20 30 i i 40 50 time Figure 4.2: Sixty posterior realizations (grey curves) for the parameter vector 0. Dark curves show the posterior means. 36 Table 4.3: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform lambda. Standard 5 t f t 95 t f t Parameter Mean deviation centile centile T2 174.05 73.46 87.11 310.32 <*0 -12.91 19.63 -43.24 20.67 Ch 0.20 0.62 -0.77 1.29 CT2 511.82 67.32 412.92 629.77 Table 4.4: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform shrinkage lambda. Standard 5"1 95th Parameter Mean deviation centile centile r 2 166.52 68.31 82.79 293.66 So -2.87 21.05 -36.54 32.20 Si 0.10 0.55 -0.83 0.97 a2 508.79 68.32 405.95 633.42 Table 4.5: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 3. Standard 5th 95th Parameter Mean deviation centile centile 1533.47 516.51 890.90 2538.64 So -31.85 16.61 -57.15 -3.99 Si 0.60 0.47 -0.18 1.37 a2 522.54 69.94 420.45 648.39 T 2 and a2 in Figures (4.3) and (4.5) respectively. The corresponding histograms based on the samples of iterations 2001-5000 are presented in Figure (4.4) and Figure (4.6). For each of the model parameters we have run M C M C simulations for 5000 iterations, among which the first 2000 iterations have been deleted as burn-in period of the chain and the remaining 3000 samples are used for inference purposes. 37 For updating A and w we perform random walk Metropolis-Hastings algorithm in an expo- nential scale. In other words, let Ao be the initial guess for A, then its candidate states are taken as A* = A 0 x exp(JV(0,u)), (4.2) where v is the jump size for updating A. We need to check for the convergence as well as mixing of the M C M C run for A. In practice, it is always desirable to have a Markov chain which can mix and converge well at the same time to ensure the accurate inference about the target posterior distribution. To have such a chain we need to adjust the jump size through monitoring the output. There is no hard and fast rule to determine the appropriate jump size and usually a trial and error method is adopted. Many research findings, for example, Gilks et al. (1996), suggest that for better mixing and convergence a desirable acceptance rate is 50%. So, while implementing an M C M C simulation it is necessary to plot the run to monitor the mixing and the convergence of the chain, and calculate the acceptance rate. We plot the M C M C chains for A and the corresponding histograms in Figure (4.7). In case of uniform A the posterior mean is 0.02, standard deviation is 0.01, the fifth percentile is 0.006 and the 95"* percentile is 0.04, where the acceptance ratio for the M C M C run is 50 percent. We find almost similar estimates for A when we consider the uniform shrinkage prior. In this case the posterior mean is 0.025, standard deviation is 0.02, the fifth and the 95th percentiles are 0.006 and 0.07, respectively, with the acceptance rate of 52 percent. At each knot point we have 5000 posterior samples for the w's (in other words, we have 5000 posterior samples for the vector w) in case of Prior 2. We have performed the same random walk Metropolis-Hastings algorithm as shown in Equation (4.2) for the parameter vector w. To check the convergence of the M C M C run, we take average w = YA=I wi a n < ^ pl°* the 5000 posterior realizations of w in Figure (4.8). The histograms for the last 3000 samples are also plotted in the same graph. It is known that for small values of A, the prior distribution of Wi's is centered at 1, so we expect small values for w. From Figure (4.8), it is clear that for both uniform and uniform shrinkage A priors, w converges to 1 after approximately 2000 38 iterations. The fitted curves with the data points are plotted in Figure (4.9). Since the true curve is unknown, we can only make the decision of better performance by visual inspection. We see that Prior 1 and Prior 2 with both uniform and uniform shrinkage A give similar but better fits and produce more smooth curves. From the mathematical point of view, we know that as A goes to 0, u>i goes to unity, and consequently, Prior 2 converges to Prior 1. Our simulation results support this argument empirically when we compare the five curves for all five different choices of prior distributions in Figure (4.10). P r i o r 2 : l a m b d a = 0 . 3 3 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 S O O O i t e r a t i o n P r i o r 2 : u n i f o r m l a m b d a 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 S O O O i t e r a t i o n P r i o r 3 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n O 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a § -r § A 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n Figure 4.3: Plots of 5000 posterior realizations of the variance component T 2 for each of the prior distributions. 39 P r i o r 1 P r i o r 2 : l a m b d a — 0 . 3 3 2 - S 2 0 0 4 0 0 6 0 0 a o o 1 0 0 0 P r i o r 2 : u n i f o r m l a m b d a 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 & C T s O 2 0 0 4 0 0 6 0 0 e o o 1 0 0 0 P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a & 8 J O 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 2 2 0 0 0 _ 2 3 0 0 0 4 0 0 0 Figure 4.4: Histograms of the last 3000 posterior realizations of the variance component r 2 for each of the prior distributions. P r i o r 2 : l a m b d a f i x e d S ss 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5< i t e r a t i o n P r i o r 2 : u n i f o r m l a m b d a O 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 S O O O i t e r a t i o n P r i o r 3 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n O 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n Figure 4.5: Plots of 5000 posterior realizations of the data variance a2 for each of the prior distributions. 40 P r i o r 1 P r i o r 2 : l a m b d a f i x e d i 1 1 1 1 1 300 400 500 600 TOO SOO 300 400 500 600 700 SOO P r i o r 2 : u n i f o r m l a m b d a 1 1 1 1 1 300 400 5 0 0 eoq 700 SOO _.2 P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a 300 400 500 600 TOO SOO 2 i r - n 1 300 400 SOO 600 700 800 Figure 4.6: Histograms of the last 3000 posterior realizations of the data variance o2 for each of the prior distributions. Uniform lambda . Uniform lambda ~ I 1 1 1 1 r O 1 0 0 0 3 0 0 0 5 0 0 0 i t e r a t i o n C D i — r 1 1 1 1 O.O 0 . 1 0 . 2 0 . 3 O.A 0 . 5 Uniform shrinkage lambda Uniform shrinkage lambda C D =3 I 1 1 1 1 1 O.O 0 . 1 0 . 2 0 . 3 0 . 4 O . S i t e r a t i o n Figure 4.7: Plots of 5000 posterior samples of the parameter A and the histograms for the last 3000 samples. 41 F i x e d l a m b d a F i x e d l a m b d a T 1 1 1 1 O 1 0 0 0 3 0 0 0 5 0 0 0 i t e r a t i o n U n i f o r m l a m b d a - i 1 r 1 0 O O 3 0 0 0 i t e r a t i o n 5- § Ll-u. r - — i — i — i — i 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 U n i f o r m l a m b d a ~i 1 1 1 1 O . B 1 - O 1 . 2 1 .A 1 . 6 1 . S 2 . 0 U n i f o r m s h r i n k a g e l a m b d a o 1000 ~I 1 - 1 5 0 0 0 I t e r a t i o n U n i f o r m s h r i n k a g e l a m b d a & § - i I r - o.e 1 -o 1.2 1 .A 1 .s 1 .e 2.0 Figure 4.8: PZois o/ 5000 posterior samples of mean w and the histograms for the last 3000 samples. P r i o r 2 : l a m b d a ° . 3 3 1 0 2 0 3 0 4 0 S O t i m e P r i o r 2 : u n i f o r m l a m b d a T 1 1 1 0 2 0 3 0 AO S O t i m e P r i o r 3 § J t i m e n o 2 0 3 0 4 0 s o t i m e P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a Figure 4.9: Estimated curves for all the five choices of the prior distributions. 42 P r i o r 1 — P r i o r 2: f i x e d l a m b d a - • P r i o r 2: u n i f o r m l a m b d a • • P r i o r 2: u n i f o r m s h r i n k a g e - P r i o r 3 I 5 0 1 0 2 0 3 0 4 0 Figure 4.10: All the five estimated curves from the five prior specifications. 4.4 Simulated Results: Example 2 Our second data set is simulated in the following way: V = < — + e 10 ^ fc 30 - 0.2 x x + e sin (30 + f£) x 3x + e for x = 1,2, • • • , 100, for x = 101,102, • • • , 175 and for x = 171,172,-•• ,250, where e is a random error term which follows a normal distribution. In particular, we choose e ~ N(0,100). Hence we have a data set of size n = 250 which is plotted in Figure (4.11). The response is generated in such a way that some parts are linear and some parts are curvilinear. From the generated data, the model parameters 6, 6, w, r 2 , o2 and A are estimated by the Bayesian technique that makes the use of our proposed prior distributions for known and fixed number of knots p = 20. We use the same hyper parameters as in Example 1, i.e., we use r 2 ~ IG{3, 0.01) and a 2 ~ JG(0.001, 0.001). 4.4.1 Parameter Estimates As in the case of Example 1, the Gibbs sampler technique has been applied for posterior simulation of 6, d, T2 and cr2, and a random walk Metropolis-Hastings algorithm has been 43 used for w and A. We present the parameter estimates for all the five specifications of the prior distributions in five tables (Tables 4.6-4.10). Estimates of the parameter vector 6 are presented graphically. Figure (4.12) shows the 60 (random) posterior realizations (grey curves) for the parameter vector 6 and the means (dark curves) for all the five choices of the prior distributions. Apparently, from the plots we can say that the estimates from Prior 1 to Prior 2 do not differ. Although Prior 3 gives a less smooth mean curve, the standard errors of the estimates are lower (i.e., less variation among the 60 estimates) as compared to the other four cases. For this data set we also see that the estimates of the data variance a2 are approximately equal for all five choices of the priors. Estimates of r 2 differ greatly: the non-roughness penalty prior (i.e., Prior 3) gives the biggest r 2 ; on the other hand, Prior 2 with fixed A gives the smallest T 2 . Since the 90% credible intervals contain zero, we can say that the estimates of the linear regression parameters So and Si are statistically insignificant in all five cases. 4.4.2 Monitoring the Convergence of M C M C Simulation In this example we also run M C M C simulations for 5000 iterations to get posterior samples for all the model parameters, from which the first 2000 iterations are discarded as burn-in period of the chain. Figure (4.13) and (4.15) present 5000 posterior realizations for the Figure 4.11: The simulated data 1. 44 Priori Prior 2: lambda=0.33 1 1 1 1 100 150 200 250 x values Prior 2: uniform lambda T 50 ~\ 1 r 100 150 200 250 x values Prior 2: uniform shrinkage lambda 7 50 i 1 r 100 150 200 250 x values o o o i - A. 4 j J \. 4 •v / A V ¥ T 50 100 150 T 200 250 x values Figure 4.12: Sixty posterior realizations (grey curves) for the parameter vector 0. Dark curves show the posterior means. 45 Table 4.6: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 1. Standard 5th 95 t h Parameter Mean deviation centile centile r> 33.37 17.07 12.31 65.42 So -6.69 7.68 -18.48 6.85 Si -0.05 0.04 -0.11 0.01 a2 93.49 9.45 78.81 110.05 Table 4.7: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with fixed A (A = 0.33). Standard 95 t / l Parameter Mean deviation centile centile T2 18.75 10.07 6.41 37.43 So -3.93 6.65 -14.82 7.16 Si -0.56 0.04 -0.12 0.02 a2 92.69 9.09 78.79 108.10 parameters r 2 and cr2, respectively. We also plot the corresponding histograms for the last 3000 posterior samples in Figure (4.14) and Figure (4.16). The M C M C chains for A and the corresponding histograms are shown in Figure (4.17). In this figure, we observe that after approximately 1000 iterations, A nicely converges to 0. In this example, we also get the similar estimates of A for both the uniform and uniform shrinkage priors. In case of uniform prior, the posterior mean for the parameter A is 0.03, standard deviation is 0.06, the fifth percentile is 0.006 and the 95 t fe percentile is 0.16. The acceptance rate for the M C M C run is 52 percent. We calculate w for all the three cases of Prior 2 and plot the posterior samples in Figure (4.18). Similar to the case of Example 1, it is also observed that w converges to unity after approximately 2000 iterations. The true and fitted curves with the data points are plotted in Figure (4.19). Since the true curve is known, we can make the decision of better performance by comparing the estimated 46 Table 4.8: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform A. Standard 5th 95 t h Parameter Mean deviation centile centile 29.48 16.33 9.18 59.97 6o -2.90 6.27 -12.42 7.75 Si -0.06 0.03 -0.11 0.0002 a2 94.50 9.91 79.76 111.82 Table 4.9: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform shrinkage A. Standard 5th 95ttl Parameter Mean deviation centile centile 29.73 18.14 6.85 62.72 So -3.22 7.77 -15.91 9.24 Si -0.06 0.04 -0.11 0.01 a2 95.23 10.19 80.30 113.07 curves with it. We have generated this kind of wiggly data set in order to have a clear snapshot of the behaviors of the prior distributions. Since our assertion is that Prior 2 will produce better curves for estimating wiggly functions, we will have a more smooth curve from it. Figure (4.20) shows that Prior 2 with both uniform and uniform shrinkage A give better estimates of the true curve, although the differences of the estimates obtained from Prior 1 and Prior 2 are very small. 47 P r i o r 1 P r i o r 2 : l a m b d a ° 0 . 3 3 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 2 : u n i f o r m l a m b d a 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 3 O 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n O 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n Figure 4.13: Plots of 5000 posterior realizations of the variance component r 2 for each of the prior distributions. P r i o r 2 : l a m b d a ° 0 . 3 3 M P r i o r 2 : u n i f o r m l a m b d a ~i 100 - c 2 P r i o r 3 - I O O 2 0 0 3 0 0 4 0 0 g> ^ O 5 0 1 0 0 1 5 0 2 0 0 x 2 P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a 1 0 O 1 5 0 2 0 0 2 Figure 4.14: Histograms of the last 3000 posterior realizations of the variance component r 2 /or eac/i of the prior distributions. 48 P r i o r "1 P r i o r 2 : l a m b d a f i x e d O "IOOO 2000 3000 4000 5000 i t e r a t i o n P r i o r 2 : u n i f o r m l a m b d a 1000 2000 3000 4000 i t e r a t i o n 3 1000 2000 3000 4000 5000 i t e r a t i o n P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a s 1000 2000 3000 4000 i t e r a t i o n 1000 2000 3000 4000 5000 i t e r a t i o n Figure 4.15: Plots of 5000 posterior realizations of the data variance o2 for each of the prior distributions. P r i o r 1 P r i o r 2 : l a m b d a f i x e d a> I" S 6 0 SO 1 0 0 1 2 0 1 4 . 0 P r i o r 2 : u n i f o r m l a m b d a 6 0 SO 1 0 0 1 2 0 1 4 0 •2. SO SO 1 0 0 1 2 0 1 4 0 o 2 P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a eo so -too 120 140 P r i o r 3 60 80 100 120 14D O Figure 4.16: Histograms of the last 3000 posterior realizations of the data variance a for each of the prior distributions. 49 U n i f o r m l a m b d a U n i f o r m l a m b d a O 1 0 0 0 3 0 0 0 5 0 0 0 0 .0 0.1 0 .2 0 .3 0 .4 0 .5 i t e r a t i o n X U n i f o r m s h r i n k a g e l a m b d a U n i f o r m s h r i n k a g e l a m b d a ~i 1 1 1 1 r O 1 0 0 0 3 0 0 0 5 0 0 0 o CD <X> C M I 1 1 1 1 1 O.O O.I 0 .2 0 .3 0 .4 0 . 5 i t e r a t i o n X Figure 4.17: Plots of 5000 posterior samples of the parameter A and the histograms for the last 300 samples. i s i s F i x e d l a m b d a F i x e d l a m b d a l 1 i 1 1 O 1 0 0 0 3 0 0 0 S O O O i t e r a t i o n U n i f o r m l a m b d a IS 2 i 1 o 1000 1 1 1-1 3 0 0 0 5 0 0 0 i t e r a t i o n U n i f o r m s h r i n k a g e l a m b d a - i 1 1 1 r O I O O O 3 0 0 0 5 0 0 0 i t e r a t i o n <L> _ 1 . 0 1 . 5 2 . 0 2 . 5 W U n i f o r m l a m b d a 1 . 0 1 . 5 2 . 0 2 . 5 W U n i f o r m s h r i n k a g e l a m b d a S3 I 1 . 0 1 . 5 2 . 0 2 . 5 W Figure 4.18: Plots of 5000 posterior samples of mean w and the histograms of the last 3000 samples. 50 P r i o r 1 P r i o r 2 : l a m b d a = . 3 3 CO ^= S3 8. CO CO i=- S O 1 0 0 1 5 0 2 0 0 2 x v a l u e s P r i o r 2 : u n i f o r m l a m b d a 5 0 2 0 0 2 S O 1 O O 1 S O x v a l u e s P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a s -r co x v a l u e s Figure 4.19: True and the estimated curves of all five choices of the prior distributions. x v a l u e s Figure 4.20: All five estimated curves with the true curve. 51 Table 4.10: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 3. Standard 5 t / l 95 t f t Parameter Mean deviation centile centile r 2 126.29 43.87 71.61 207.16 S0 3.61 5.39 -6.29 11.91 Si -0.04 0.03 -0.09 0.01 a2 90.20 8.42 77.19 104.51 4.5 Simulated Results: Example 3 In the previous two examples we have considered data sets with much wiggleness. That is, the underlying function fluctuates at different knot points. For comparison of the proposed priors we now consider a third data set with very small data variance and the underlying curve is not so rough. The data are generated in the following way: y = (x- 3.5)2 + e for 0 < x < 8, where e is a random error term and follows standard normal distribution. In this case we have a random sample of size n = 81 and we plot the data in Figure (4.21). From the figure it is clear that the data variance is almost the same and the underlying curve is not wiggly. We set the number of knots p = 10 and estimate all the model parameters with the Bayesian technique using the proposed prior distributions. We use the same specifications for the hyperprior parameters as in the previous two examples. O 2 4 e a x v a l u e s Figure 4.21: The simulated data 2. 52 4.5.1 M C M C Run and Parameter Estimates We also draw 5000 posterior samples for all the model parameters by performing M C M C runs. In this example, we are not showing all the parameter estimates and the figures regarding the convergence and mixing of M C M C runs. We only present 60 posterior samples (systematically chosen from the last 3000 samples) of the parameter vector 0 with the mean curves in Figure (4.22), the true and the fitted curves with the data points in Figure (4.23), and all the estimated curves with the true curve in Figure (4.24). Although all the priors give similar estimates as the true curve, by a close look at the figures, we can say that both Prior 1 and Prior 2 work better than Prior 3. As we see in the previous chapters, Prior 2 needs much more computation than Prior 1 and hence, in case of simple curve estimation, we would rather use Prior 1 for Bayesian curve estimation technique. 53 Prior 1 Prior 2: lambda=0.33 1 2 3 4 5 6 7 1 2 3 4 5 6 7 x values x values Prior 3 V INK . / - 1 2 3 4 5 6 7 x values Figure 4.22: Sixty posterior realizations (grey curves) for the parameter vector 6. Dark curves show the posterior means. 54 P r i o r 1 2 n d p r i o r w i t h l a m b d a = . 3 3 55 4.6 Discussion In this chapter three simulation studies have been done for the Bayesian curve fitting using hierarchical models with our three roughness penalty prior distributions. We perform M C M C to simulate 5000 posterior samples for all the model parameters. The first 2000 iterations can be regarded as the burn-in period of the chain, and therefore, they have been discarded for the chain. The remaining 3000 samples are used for parameter estimation and curve fitting. From the plots of posterior realizations we observe good convergence and mixing of the chains. In all the three examples, we find similar performance of the proposed prior distributions, where Prior 1 and Prior 2 show similar but better performance than Prior 3. However, we have different expectations regarding the performance of the proposed prior distributions. This is because Prior 1 estimates the curve considering the overall roughness penalty of the whole data set, while Prior 2 estimates the curve considering the local roughness penalty at each knot point, but Prior 3 does not consider any roughness penalty to estimate the curve. Therefore, if the data is not too rough, we do not expect much variation among the performance of the three priors. But, if the data are very rough (wiggle), Prior 3 should produce the worst estimate among the three. Also, if the wiggleness of the data varies at different knot points, Prior 2 should produce better estimates than Prior 1. On the basis of the examples we reject our initial hypothesis that in the case of much wiggleness of the data, Prior 2 produces the best estimates among the three, but accept the fact that both local and global roughness penalty priors produce the same smooth fit of the curve in case of Bayesian curve fitting using hierarchical models. In this chapter, all the inferences are drawn under the piecewise linear spline assumption. In the next chapter, we will continue the application of Bayesian hierarchical models under the assumption of natural cubic spline with roughness penalty approach. 56 Chapter 5 M C M C Simulation for Natural Cubic Spline 5.1 Introduction The Bayesian curve fitting with our proposed prior distributions under the assumption of piecewise linear splines has been discussed in Chapter 4. In this chapter, we also provide a Bayesian curve fitting which models the regression function / by a natural cubic spline (NCS) with a known number of equidistant knots. For our three different roughness penalty prior distributions of the parameter 6, discussed in Chapter 3, we have found three precision matrices, B, BWtT and / , in the case of piecewise linear splines. Modifications in the precision matrices B and BW<T are necessary in order to apply the prior distributions to the natural cubic spline. In section 5.2 we present the modified prior distributions and the corresponding posterior distributions. The simulation results of two examples are presented in section 5.3 and 5.4, and we give a brief discussion in section 5.5. 57 5.2 Modified Prior and Posterior Distributions In the case of a natural cubic spline with roughness penalty, the penalized sum of squares can be written as $ = £ > - f(*i)Y + ~2 f{f"(x)Ydx. (5.1) i=l T J a Here we use the smoothing parameter 1/r2 instead of T 2 in equation (2.3) of chapter 2 for our convenience. In matrix notation equation (5.1) can be written as S = (Y-f)'(Y-f) + ±f'Kf = (Y-D6- A0)'(Y -DS- Ad) + \f'Kf « (z - Ad)\z - AO) + \&KB, where K is the (jp + 2) x (p + 2) roughness penalty matrix (Green and Silverman, 1994) and Z = Y - DS. We get the roughness penalty term (1/T2)0'K0 from (l/r2)f'Kf by separating the linear and smooth parts of / and constraining the smooth part to be zero at the exterior knots. 5 . 2 . 1 P r i o r 1 For NCS, Prior 1 can be written as TT(0;T 2) = 7r(0 | r 2 )7r(r 2 ) where TT(0|T2) - N^V'1), V = --hKi, and K\ is the p x p matrix constructed from the matrix K excluding the rows and colums corresponding to the two exterior knots (i.e., deleting the first row and the first column, and the last row and the last column). The prior distributions for the parameters T 2 , a2 and <5 remain the same as before. Hence, by replacing ( l / r 2 ) M by V, we calcu- late the conditional posterior distributions. We find that 0 given S, r 2 , a2, Y and X , is multivariate normal with mean vector (V + (A'A)~l/a2)~x A'Zi/a2 and covariance matrix (V + (A'A)'11a2)'1, where Zi = Y — DS. The conditional posterior distribution for the variance component r 2 given 0, S, a2, Y and X , is inverse gamma with parameters a + p/2 and (0'V0)/2 + 3. The conditional posterior distribution of the linear regression parameter 8 given 0, a2, Y and X , and the conditional posterior distribution of the data variance a2 58 given c5, 6, r 2 , Y and X remain same as in the case of piecewise linear splines. We get the closed form solutions for all the conditional posterior distributions. Hence we will apply the Gibbs Sampler algorithm to draw the posterior samples. 5.2.2 Prior 2 In the case of the second prior we have assumed different variance components at different knot points. In particular, we have variances r 2 , r | , . . . , r 2 for the parameters 6i, 0 2 , . . . , 9P where r 2 measures the roughness of the curve at knot U. Hence it is logical to modify the roughness penalty term as follows: P + 1 i ru (5.2) Unlike the case of piecewise linear splines, it is not easy to express (5.2) in matrix form. We are not going to do that tedious mathematical operation. However, in the light of our expression in the case of piecewise linear splines, we can approximate (5.2) as: p+i P - * " 1 i rti £ 4 / {/"(*i)}2dzi « / % / , (5.3) where KT is a (p + 2) x (p + 2) matrix obtained by pre- and post- multiplications of the K matrix by a diagonal A T matrix as follows: f 0 . . . TO 0 0 0 0 . . . - L -1_ Tp+1 &00 ho koi hi k0,(p+l) h,(p+i) zŝ oo -ki •01 TITO TQTl k(p+i),o fc(p+i),i • • • fc(p+i),(p+i) • • • ^r^p+i) T 0 TO 0 0 0 0 1 TP+1 59 Since the variance components at the two exterior knots do not make any sense, we delete the first row and the first column and the last row and the last column of KT so that the roughness penalty matrix is —h T2T1 21 — kl2 T\Tp T2Tp >4p «2p d-A- T 2 "'pp Hence we assume that 0 given r 2 has the multivariate normal distribution with mean 0 and precision matrix K T . The interesting feature of this parameterization is that if T2 = r 2 = ... = r2 — T2, we get KT = (1/T2)KI. That is if all the r?s are equal, Prior 2 is exactly equal to Prior 1. For our reparameterization r 2 = u>iT2 we modify the precision matrix as VWLT = (1/T2)KW, where —hi Wl 1 ± , 1 fcoi — koo ^/W2W\ * 1 W2 y/W2V :k lp «2p — kn Therefore, 0 given w and r 2 follows a multivariate normal distribution with mean vec- tor 0 and precision matrix VWiT. For the prior distributions r 2 ~ IG(a,f3) and Wi ~ IG(l/\, 1/A) the conditional posterior distribution IT(Q\6,T2,W,O2, Y , X ) is a multivariate normal density with mean vector (VW<T + (A'A)-1 /a2)-1 A'Z\/a2 and the covariance matrix (Yw,r + {A'A)~l/cr2)~l, and the conditional posterior distribution for r 2 given 6, w, d, a2, Y and X , is inverse gamma with the parameters a+p/2 and (d'VWtT0)/2 + 8). To calculate the conditional posterior distribution of w we write Tr(w\d, r2) oc IT(0\T2, W)IT(W). Since the wjs are independent and identically distributed with inverse gamma, the joint posterior 60 distribution of wi, • • • ,wp, given 6 and r 2 , can be written as 7r(ioi, • • • ,Wp\6,T2) oc ir(6\T2,w)ir(wi) • • • 7r(wp) oc \vWtT\l2 exp ̂ -^e'vw,Te^ n w i \ (x+D r(i) W exp (5.4) We have VW<T = (1/T2)KW, and the matrix Kw can be expressed as AwKiAw with 0 . . . 0 I 0 * 0 0 0 1_ 'Wp Hence the determinant \VWtT\ can be expressed as nf=i l-̂ il- The quadratic form O'VW>T0 is equivalent to tr(yWjT06') and after some mathematical manipulation it can be written as (l/r2)tr(i<r1(A1,c9)(6> /A lu)) which is equal to £ £ f = i £ i = i J g j f > where fcy is element corresponding to the ith row and the jth column of matrix K\. Therefore, the conditional posterior distribution in expression (5.4) can be written as ^ , . . . ^ , r > ) oc y n y I ^ I - X P J - ^ E E - ^ I I ' ( i l t ' i M * " 1 n i = l r(i) W — exp \-—\ From the conditional joint distribution of w\, • • • ,wp, it is difficult to find the conditional posterior distribution of the ith element Wi, given all other WjS (j 7 M ) , 0 and r 2 . Hence we decide to draw posterior samples for w from the conditional joint distribution using a random walk Metropolis Hastings Algorithm. Al l other conditional posteriors such as ir(d\0, a 2 , Y , X) , TT(O2\0, S,T2, Y , X ) and the conditional posterior for A remain the same as found in section 3.4.2. It is obvious that all the conditional posterior distributions for the case of Prior 3 remain the same as we find in section 3.4.3. 61 5.3 Example 1 To illustrate the methodologies that we have described above, our first simulation study has been done for the Simulated Motorcycle Accident (Silverman, 1985) data. The additive model (4.1) is fitted with all five choices of prior distributions assuming fixed number of knots p = 20, and the model parameters 0, S, r 2 , a2,w and A have been estimated. For the same specifications of the hyperpararneters as defined in the previous chapter, we perform Gibbs sampling and random walk Metropolis algorithms for 5000 iterations. More specifically, we have performed Gibbs sampling technique for posterior simulation of 6, S, T 2 and a2, and random walk Metropolis algorithm for w and A. The estimates of the parameters r 2 , 5 and a2 for all five specifications of the prior distributions are presented in five tables (Tables 5.1- 5.5). Twenty estimates for the parameter vector 6 are shown graphically. Figure (5.1) shows the 60 posterior samples (grey curves) for the parameter vector 9 with the means (dark curves) for all five choices of the prior distributions. The differences among the estimated mean curves produced by Prior 1 and Prior 2 are very small. However, the curves are more smooth than that of Prior 3. It may also be mentioned that Prior 3 gives estimates with small standard errors. We observe that the estimates for all the parameters are similar for all three specifications of Prior 2. Estimates of the data variance a2 are above 500 with standard deviation more than 60 in all the cases. Estimates of r 2 differ greatly: for the non-roughness penalty prior (i.e., Prior 3) case r 2 is largest, on the other hand, the smallest r 2 is observed for Prior 2. The estimates of the linear regression parameters So and ch are statistically insignificant (i.e., 90% credible intervals include zero). 5.3.1 Monitoring the Convergence of M C M C Simulation To check the mixing and convergence we present the M C M C run for the parameters r 2 and CT2 in Figures (5.2) and (5.4), respectively, and to have an idea about their distribution, the corresponding histograms are plotted in Figures (5.3) and (5.5). For each of the model parameters we have run M C M C simulations for 5000 iterations, from which the first 2000 62 Prior 1 Prior 2: lambda=0.33 Table 5.1: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 1. Standard 5th 95 t f t Parameter Mean deviation centile centile T2 34.88 14.67 17.52 62.64 So -5.71 23.60 -45.88 31.32 Si 0.23 0.64 -0.87 1.26 a2 520.40 69.56 416.96 642.15 Table 5.2: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with fixed A (A = 0.33). Standard 5th 95th Parameter Mean deviation centile centile T2 23.95 12.52 8.78 48.39 So 15.10 25.98 -24.69 61.41 Si -0.02 0.61 -1.07 0.92 a2 519.93 72.05 414.24 648.17 iterations have been deleted as burn-in period of the chain and the remaining 3000 samples are used to draw inferences. The M C M C chains for A and the corresponding histograms are shown in Figure (5.6). In the case of uniform A, the posterior mean is 0.02, standard deviation is 0.04, the 5th percentile is 0.006 and the 95th percentile is 0.1. The acceptance ratio for the M C M C run is 51 percent. We find similar estimates for A when we consider the uniform shrinkage prior for it. Five thousand posterior samples of w and the histograms of the last 3000 samples are plotted in Figure (5.7). The empirical results indicate that the distribution of the WiS is centered at 1, as we see, w converges to 1 after approximately 2000 iterations for both uniform and uniform shrinkage priors of A. On the other hand, for the fixed A case, M C M C run does not show any convergence in 5000 iterations with an acceptance rate of 43 percent. The fitted curves with the data points are plotted in Figure (5.8). It is apparent that Prior 1 and 64 Table 5.3: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform A. Standard 5th 95"1 Parameter Mean deviation centile centile r2 31.91 13.92 15.58 57.64 So -1.47 22.99 -41.56 32.68 Si 0.21 0.58 -0.73 1.13 a2 516.21 68.99 417.29 639.29 Table 5.4: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform shrinkage A. Standard 5th 95th Parameter Mean deviation centile centile T2 32.98 14.95 15.47 60.89 So -1.30 20.81 -38.08 31.53 Si 0.18 0.60 -0.79 1.18 a2 516.12 67.47 414.16 631.02 Prior 2 produce exactly the same smooth curve, better than the curve produced by Prior 3, when we see all the estimates together in Figure (5.9). Therefore, we are commented on the similar behavior of Prior 1 and Prior 2 under the assumption of natural cubic spline regarding to estimation and smoothness. 65 P r i o r 2 : l a m b d a = 0 . 3 3 53 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 2 : u n i f o r m l a m b d a 55 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 3 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n S3 S3 H O 1 0 0 0 2 0 0 0 3 0 0 0 - 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a SS 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n Figure 5.2: Plots of 5000 posterior realizations of the variance component r 2 for each of the prior distributions. P r i o r 2 : l a m b d a = 0 . 3 3 3 -A P r i o r 2 : u n i f o r m l a m b d a n 1 O O 1 5 0 P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a 5 0 1 0 0 1 5 0 T 2 a> C T 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 x 2 Figure 5.3: Histograms of the last 3000 posterior realizations of the variance component r 2 for each of the prior distributions. 66 P r i o r 1 P r i o r 2 : l a m b d a f i x e d 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 ( i t e r a t i o n P r i o r 2 : u n i f o r m l a m b d a 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 3 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n O 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a O 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 i t e r a t i o n Figure 5.4: Plots of 5000 posterior realizations of the data variance o2 for each of the prior distributions. P r i o r 2 : l a m b d a f i x e d 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 S O O 9 0 0 P r i o r 2 : u n i f o r m l a m b d a 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 9 0 0 1 H 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 S O O 9 0 0 o 2 P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a R - i 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 S O O 9 0 0 M 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 ' 8 0 0 9 0 0 Figure 5.5: Histograms of the last 3000 posterior realizations of the data variance a2 for each of the prior distributions. 67 Uniform lambda Uniform lambda - i — i — i — i — i — O 1 o o o 3 0 0 0 S O O O i teration Uniform shrinkage lambda ~i 1 r o 1000 3 0 0 0 S O O O i teration Uniform shrinkage lambda BL i — i — i — i — i — i — i O . O O O . - I O 0 . 2 0 0 . 3 0 A . Figure 5.6: Plots of 5000 posterior samples of the parameter A and histograms for the last SOOO samples. F i x e d l a m b d a F i x e d l a m b d a oo | - | 1 r~ O 1 0 0 0 3 0 0 0 5 0 0 0 i t e r a t i o n U n i f o r m l a m b d a i 1 1 1 r O 1 0 0 0 3 0 0 0 5 0 0 0 i t e r a t i o n U n i f o r m s h r i n k a g e l a m b d a 1 1 1 r - O 1 O O O 3 0 0 0 i t e r a t i o n U n i f o r m l a m b d a s- m I r - ~I 1 O . S O . Q 1 . 0 1 .1 1 . 2 W U n i f o r m s h r i n k a g e l a m b d a S i 1 —r—— 1 1 o .a o . s 1.0 1.-1 1.2 Figure 5.7: Plots of 5000 posterior samples of mean w and histograms for the last 3000 samples. 68 P r i o r 1 P r i o r 2 : l a m b d a = . 3 3 i l i i l 1 0 2 0 3 0 4 0 5 0 t i m e Figure 5.8: Estimated curves for all five choices of the prior distributions. 1 0 2 0 3 0 4 0 5 0 t i m e Figure 5.9: All five estimated curves from the five prior specifications. 69 Table 5.5: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 3. Standard 5th 95*ft Parameter Mean deviation centile centile T 2 1366.78 442.48 802.83 2166.88 So -32.53 15.35 -56.11 -7.22 Si 0.64 0.48 -0.18 1.41 a2 533.12 71.91 427.13 662.09 5.4 Example 2 Here we fit the same data set of size n = 250 that we have used in Example 2 of Chap- ter 4 for the fixed number of knots p = 20 in order to have a second chance of comparing the proposed prior distributions. The same hyperpararneters are being used, i.e., we use r 2 ~ IG(3, 0.01) and a2 ~ 7G(0.001, 0.001). Similar to the case of Example 1, both the Gibbs sampler technique and a random walk Metropolis-Hastings algorithm are applied for posterior simulation of the parameters. Estimates of r 2 , S and a2 for all five specifications of the prior distributions are presented in five tables (Tables 5.6-5.10). Estimates of the pa- rameter vector 0 are presented graphically. Figure (5.10) shows the 60 posterior realizations (grey curves) for the parameter vector 0 with the means (dark curves) for all five choices of the prior distributions. We observe the similar performance as we find in example 1. To be specific, both Prior 1 and Prior 2 produce approximately the same smooth mean curves. The estimates of the data variance a2 are similar in all the five choices of the priors. The variance component r 2 is the only parameter whose estimates differ greatly for different choices of the prior distributions, and the largest r 2 is found in the case of Prior 3. The regression parameters remain statistically insignificant in this example as well. 5.4.1 Monitoring the Convergence of M C M C Simulation In our 5000 iterations of M C M C simulations, a clear picture of a burn-in period is found in the first 2000 iterations. Figures (5.11) and (5.13) present 5000 posterior realizations 70 Priori Prior 2: lambda=0.33 x values Figure 5.10: Sixty posterior realizations (grey curves) for the parameter vector 9. Dark curves show the posterior means. 71 Table 5.6: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 1. Standard 5th 95"1 Parameter Mean deviation centile centile r2 0.07 0.04 0.02 0.14 £o -9.54 8.99 -24.36 6.79 Si -0.04 0.05 -0.11 -0.05 a2 94.79 9.49 80.14 111.18 Table 5.7: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with fixed A (A = 0.33). Standard 5th 95'". Parameter Mean deviation centile centile r2 0.04 0.03 0.007 0.102 So -11.49 9.92 -30.03 3.60 Si -0.03 0.04 -0.10 0.04 a2 96.87 9.86 81.67 114.10 Table 5.8: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform A. Standard 517* 9 5 ^ Parameter Mean deviation centile centile T1 O06 O04 O02 0 1 4 - J 0 -4.07 8.05 -17.25 9.03 6! -0.05 0.04 -0.12 0.03 a2 95.53 9.87 80.84 112.36 Table 5.9: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 2 with uniform shrinkage A. Standard 5* 9 J F - Parameter Mean deviation centile centile T2" O06 O04 O02 OlF~ t50 -8.00 8.79 -22.48 6.50 Si -0.04 0.04 -0.11 0.03 a 2 95.20 9.38 80.98 112.39 72 Table 5.10: Posterior distributions from Gibbs sampling based on iterations 2001-5000 for Prior 3. Standard 5 t h 95th Parameter Mean deviation centile centile r 4 102.91 34.61 60.63 172.05 So 3.25 4.32 -3.56 10.59 Si -0.03 0.03 -0.08 0.01 a 2 91.13 8.51 77.88 105.56 for the parameters r 2 and a2, respectively. We also plot the corresponding histograms in Figures (5.12) and (5.14), respectively. The M C M C chains for A and the corresponding histograms are shown in Figure (5.15). We observe a nice convergence and a good mixing of the M C M C run. In this example, similar estimates of A for both the uniform and uniform shrinkage priors are found. For the uniform prior, the posterior mean for the parameter A is 0.015, the standard deviation is 0.01, the 5th percentile is 0.006 and the 95th percentile is 0.033. The acceptance ratio for the M C M C run is 53 percent. Posterior samples of w for the Prior 2 cases are plotted in Figure (5.16). Similar to the case of Example 1, w converges to 1 after approximately 2000 iterations for both uniform and uniform shrinkage A cases. True and the fitted curves with the data points are plotted in Figure (5.17). To compare the performance of the five priors together, all the five estimated curves with the true curve are plotted in Figure (5.18). Similar conclusion as in Example 1 is drawn, that is Prior 1 and Prior 2 perform equally better than Prior 3 in regard to smoothness. 5.5 Discussion Two simulation studies have been done in this chapter for the Bayesian curve fitting using hierarchical models with our proposed prior distributions under the assumption of natural cubic spline. We perform M C M C to simulate 5000 posterior samples for all the model parameters. The plots of posterior realizations reflects well convergence as well as mixing of the chains. Although the concepts of three prior distributions are completely different, in 73 P r i o r 1 P r i o r 2 : l a m b d a = 0 . 3 3 1 Q O O 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 2 : u n i f o r m l a m b d a O 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 3 1 8 "I 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n l*llll>L*.i»<l • i 1 1 1 1 r O 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a -i 1 1 1 r O 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n Figure 5.11: Plots of 5000 posterior realizations of the variance component r 2 for each of the prior distributions. s - i -i 1 1 1 0 . 2 0 0 . 3 0 P r i o r 2 : l a m b d a = 0 . 3 3 - l 1 1 1 1 1 O . O O 0 . 1 0 0 . 2 0 0 . 3 0 P r i o r 2 : u n i f o r m l a m b d a 0 . 1 0 0 . 2 0 2 P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a -i 1 1 1 1 1 O . O O 0 . 1 0 0 . 2 0 0 . 3 0 ^.2 rr l r - 5 0 1 0 0 1 S O 2 0 0 2 S O 3 0 0 T 2 Figure 5.12: Histograms of the last 3000 posterior realizations of the variance component r 2 for each of the prior distributions. 74 Prior 1 Prior 2 : lambda fixed i t e r a t i o n i t e r a t i o n Prior 2 : uniform lambda Prior 2 : uniform shrinkage lambda i 1 1 1 1 r O 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 i t e r a t i o n Figure 5.13: Plots of 5000 posterior realizations of the data variance a2 for each of the prior distributions. Prior 2 : lambda fixed S - s ~~1 I-* eo eo 100 120 140 S 1 6 0 8 0 1 0 0 1 2 0 1 4 0 Prior 2 : uniform lambda 6 0 S O 1 0 0 1 2 0 1 4 0 Prior 2 : uniform shrinkage lambda l b * . eo so 100 120 140 60 so 100 120 140 „ 2 Figure 5.14: Histograms of the last 3000 posterior realizations of the data variance o2 for each of the prior distributions. 75 U n i f o r m l a m b d a U n i f o r m l a m b d a O 1 0 0 0 3 0 0 0 i teration 5 0 0 0 c r rh->— 0 . 0 0 0 .04 I 0 . 0 8 U n i f o r m s h r i n k a g e l a m b d a U n i f o r m s h r i n k a g e l a m b d a i 1 r O 1 0 0 0 3 0 0 0 5 0 0 0 i teration o O.OO 1 1 0 .04 1 1 0 . 0 8 Figure 5.15: Plots of 5000 posterior samples of the parameter A and histograms for the last 3000 samples. F i x e d l a m b d a 1OOO 3000 i t e r a t i o n F i x e d l a m b d a CL> O.S 1.0 1.S 2.0 W U n i f o r m l a m b d a 5 = ft+J T T~ o -IOOO ~i r~ 3000 i t e r a t i o n U n i f o r m l a m b d a - I : 1 "I i 0.5 Q.T O.S 1.1 W U n i f o r m s h r i n k a g e l a m b d a ~i i r O 1OOO 3000 i t e r a t i o n U n i f o r m s h r i n k a g e l a m b d a Ji i — t — i — i — i — i — i — i O.S O.T 0.9 1.1 w Figure 5.16: Plots of 5000 posterior samples of w and histograms for the last 3000 samples. 76 P r i o r 1 P r i o r 2 : l a m b d a = . 3 3 CO i = <3 • ' .V.'. • • • •Sh- •"" -" • \ \.y f - - t r u e ' - " * ; " , \ " * e s t i m a t e . 5 0 T~ 2 0 0 c o ^ - H 1 O O 1 5 0 x v a l u e s P r i o r 2 : u n i f o r m l a m b d a s o 2 0 0 2 5 0 55 a> 0 c o 5=- 1 O O 1 5 0 x v a l u e s P r i o r 2 : u n i f o r m s h r i n k a g e l a m b d a 55 - r 1 O O 1 5 0 x v a l u e s 1 0 0 1 5 0 x v a l u e s Figure 5.17: True and the estimated curves of all the five choices of the prior distributions. Figure 5.18: All the five estimated curves with the true curve. 77 Example 1, we do not observe any big difference among the estimates. To be specific, both Prior 1 and Prior 2 have produced exactly the same smooth curves which are slightly better than the curve produced by Prior 3. In Example 2, the upshot remain the same. Despite of the non-smoothing behavior of Prior 3, it gives estimates with less standard errors as we see in Figures (5.1) and (5.10). Therefore, we accept the fact that Prior 1 and Prior 2 perform in a similar way for estimating wiggly functions under the natural cubic spline assumption. 78 Chapter 6 Conclusion and Future Work The main focus of this dissertation is to compare three roughness penalty prior distributions for the parameters of smoothing splines in Bayesian hierarchical models. These models are useful in many real life situations and some of them have been discussed in Chapter 1. The M C M C simulation technique is essential for the Bayesian curve fitting and we have briefly explained the Gibbs sampler and the Metropolis-Hastings algorithms in Chapter 1. In order to estimate a curve in the context of a smoothing spline with roughness penalty approach, the usual prior distribution for the regression function considers only one variance component (or smoothing parameter) for the entire function and measures the global rough- ness of the curve. In this thesis we call it Prior 1. Our scientific interest is to explore more smooth curve from a wiggly data set, and hence we may rather think of different variance components for the parameters at different knot points. This kind of local variance compo- nent concept may produce more smooth curves in dealing with wiggly data. To apply the concept of measuring local roughness of the underlying curve, we propose a second prior and call it Prior 2. In connection with our prior search, to estimate a rapidly fluctuating function, we come up with the idea of comparing the above two priors with a non-roughness penalty prior distribution, in which all the parameters are independently distributed with a common global variance component, and we call it Prior 3. Our nonparametric approaches 79 to curve fitting include both piecewise linear and natural cubic splines. Hence for simulation we need to calculate the basis matrices for both piecewise linear and natural cubic spline cases. We have briefly discussed the procedures of calculating basis matrices in Chapter 2. Under the assumption of a piecewise linear spline, we have formulated the hierarchical mod- els for the Bayesian curve fitting with roughness penalty prior distributions, i.e., for Prior 1 and Prior 2, and also for the non-roughness penalty prior, Prior 3. We have calculated the posterior distributions for all the model parameters. In most of the cases, we have found closed form solutions for the posterior distributions and so we have applied Gibbs sampling technique. For the parameters w and A we have not obtained any closed form solutions, and hence we have adopted random walk Metropolis Hastings algorithms in exponential scale for simulating these parameters. In Chapter 4, the Bayesian backfitting algorithm has been applied to draw posterior sam- ples. This is because our regression function consists of both linear and smooth parts. Three different data sets of different curvatures are chosen for comparing the performance of the proposed prior distributions. Necessary steps have been taken to make the M C M C run and mixing of the posterior samples good. We have also calculated the standard errors and the credible intervals for some of the model parameters. In Chapter 4, simulation has been done only for the piecewise linear spline assumption. After observing the results from the piecewise linear spline, we feel encourage to compare the prior distributions in the case of natural cubic spline. The formulation of the Bayesian hierarchical model in the case of the global variance component prior distribution (Prior 1) is well established and found in many books and articles, on the other hand, the concept of local variance component prior distribution (Prior 2) is new and there is no such reference available. Since the calculation of the roughness penalty matrix under the assumption of NCS becomes very tedious, an approximate roughness penalty matrix has been proposed for 80 Prior 2 in order to calculate the posterior distributions. And no changes have been necessary in calculating the posterior distributions for Prior 3 in the case of NCS. For Prior 2, we have three different sets of posterior distributions depending on the hy- perparameter A (fixed, uniform and uniform shrinkage). However, they did not play any significant role to make the estimated curve different. Although the estimates of w do not show any convergence for the fixed A case, this does not show any effect on the estimated regression function. We carefully checked the estimation process and necessary steps are taken to make good mixing as well as convergence of the M C M C run. In Chapters 4 and 5 we have given brief discussions about our estimated curves. In recapitulating what we have found in our simulation studies we can say that both Prior 1 and Prior 2 perform in a similar way, better than Prior 3, in wiggly curves estimation. So, inclusion of the roughness penalty term is necessary for rapidly fluctuating curve estimation, and the local and global variance component concept does not make any difference in the estimation. Even though this argument is plausible, we still would like to have some more application to real life wiggle data. This may need some future research. Comparisons of the estimated curves have been made through visual examinations. Some deviance measures in the Bayesian context may be developed in order to have better comparisons. Although some of the classical techniques of fitting splines have been discussed in this thesis, we did not get any chance to compare them with our Bayesian approach. Throughout the study, the response has been assumed to be normal. In conclusion, we can say that future work can also be done on the application of these approaches to binary outcome data, count data, etc. i.e., to the exploration of the concept in the case of generalized additive models. 81 Appendix A A . l Conditional Posterior for 6 in Prior 1 Let Y be the vector of responses and X be the vector of predictors in a regression analysis context such that Y ~ N(Dd + AO, a21), where 6, S and a2 are the parameters of interest, and A and D are the basis matrices calculated from the x values. Hence, the likelihood function can be written as 7r(Y|o\0,<r 2,X) OC |Q|-iexp j - ^ ( Y - L > < 5 - AO)'Q-\Y - D8 - A 0 ) j , (A.l) where £1 = a21. In a Bayesian perspective, we consider all the above parameters as random variables and have probability distributions (called prior distributions). We assume that the density functions of the above parameters have the following forms (ignoring the normalizing constants) T r ^ a ^ ^ e x p j - ^ } , TT((72)OC ( ^ ) ( a + 1 ) e x p { - & A 7 2 } and (A.2) 7r(<5) oc 1, where r 2 , a and b are called hyperparameters. Let us assume that a and b are known and T 2 has an inverse gamma distribution with the density function / i \ («+i) 7 T ( T 2 ) O C(^J exp{-/?/r 2}, (A.3) where a and 3 are konwn constants. By applying Bayes' theorem we can derive the posterior distribution from the prior distribution and the likelihood function. The joint posterior 82 distribution of the parameters 0,S, r 2 and cr2 given Y and X can be written as TT(0, 5, r 2 , cr 2 |Y, X ) oc TT(Y |0, S, C T 2 , X)7r(0|r2)7r(r2)7r(<5)7r(cT2). (A.4) With the likelihood function, prior densities and the joint posterior distribution defined above, the conditional posterior distribution for 0 given 6, r 2 , a 2 , Y and X can be calculated as TT(0\6, r 2 , a 2 , Y , X ) oc TT(Y |0, S, a2, X )7r (0 | r 2 ) oc exp { - | ( Y - DS - AOyn-^Y - DS - AO)} x ( i ) ^ e x p { - l 0 ' B 0 } = \Q\--2 exp { - i ( Z ! - Aeyn-1(z1 - Ad)} ( £ ) ( p / 2 ) e x p {-\e'BO} oc exp {-\(6-(B + A£l-lA)-1AQl-lZl)'(B + A'Q^A)'1 x (0 - (B + A ' f i - M ) - 1 ^ - ^ ! ) } , where Z i = Y — DS. It follows that the density 7r(0|«5, r 2 , cr2, Y , X ) is multivariate normal with mean vector and covariance matrix Cli where /xx = (B + A'n-1A)-1A'n-1Z1=(\M+{-^^Y^ and A.2 Conditional Posterior Distribution for r 2 in Prior 1 From the definitions of Appendix A . l , the conditional posterior distribution for r 2 given 0, «5, a 2 , Y and X is equivalent to the conditional posterior distribution of r 2 given 0 since other densities do not contain r 2 . So TT(T 2|0, 8, cr2, Y , X ) = TT(T2|0) OC 7r(r 2)7r(0|r 2) oc exp{ - / 3 /r 2 } exp {-§0 'B0} (A.6) a ( i ) ( ^ + 1 ) e x p { - 5 l 5 ( 0 ' M 0 ) - ^ } . The right hand side of the above expression says that the density 7r(r 2 |0, S, cr2, Y , X ) is inverse gamma with parameters a + § and \(9'M6) + 3. 83 A.3 Conditional Posterior Distribution of S in Prior 1 From Appendix A . l , the conditional posterior distribution of t5 given 0, a2,Y and X is obtained as follows: 7r(f5|0,cr 2,Y,X) oc TT(Y|0, .5, cr2, X)TT(<5) oc | f i |"3 exp { - | ( Y - DS - A0)'Q- 1 (Y - DS - AO)} oc \n\-^exp{-^(Z2-DS)'Q-1(Z2-DS)} (A.7) oc exp{-±(«5 - (D'Q^D^D'Q-^YiD'Q^D)-1 x(S - (£)'Q- 1i?)- 1Z)'fi- 1Z 2)}, where Z2 = Y — AO. Hence we can say that 7r(t5|0, a2, Y , X ) is a multivariate normal density with mean vector (D'Q,~1D)~1D'Q,~1Z2 and covariance matrix (D'Cl^D)'1. Since ft — a21, we can write the mean vector as (D'D)~1D'Z2 and the covariance matrix as (D'D)~la2. A.4 Conditional Posterior Distribution of a2 in Prior 1 The conditional posterior distribution of a2 given «5, 0, r 2 Y and X is obtained as per the definitions in Appendix A . l as follows: TT(O-2|0, S, T 2 , Y , X ) oc TT(Y|0, S, a2, X)TT(O-2) oc ( £ ) n / 2 e x p { _ ^ ( Y -DS- A6)'(Y - DS - AO)} oc ( ^ ) a + ^ + 1 exp { - ^ 2 ( Y - DS - A0)'(Y - DS - AS) + b} . This is an IG(a + § , ( Y - DS - A6)'(Y - DS - A6) + b) density without the normalizing constants. 84 Appendix B B.l Conditional Posterior for 6 and r 2 in Prior 2 In addition to the parameters defined in Appendix A . l , let us have a vector w = • • • , wp)' of p parameters and A. By inclusion of the new parameters the density 7r(0|r 2) has been changed to 7T(0|T 2, W) with the functional form: TT(6\T2,W) oc | B W l T | 1 / 2 e x p j - i e ^ O where BWtT = ^Mw. Here we assume that uVs are independent and identically distributed random variables with density function 7r(tu<) = (B.l) And the prior distribution of A is assumed as uniform over the interval (0, A 0 ). The joint probability distribution for 0, S,T2,W and a2 given Y and X can be written as TT(0, S, r 2 , w, a2\Y) oc TT(Y|0 , 6, a2, X)TT(0|W, r 2)7r(r 2)7r(u;i) • • • 7r(wp)7r((5)7r(cr2). (B.2) As we have already shown the calculation for the conditional pdf 7r(0|<5, T 2 , cr2, Y , X ) in (A.5), in the similar manner, we can show that the condition posterior distribution 7r(0|<5, T 2 , W, a2, Y , X ) is multivariate normal with mean vector /Lt2 and the covariance matrix Q 2 where A*2 = (Bw;T + A'n-1A)-1A,n-1Z1=(J^Mw+(^^-) ^ and ft2 = (BWtT + A'n-'A)-1 = (±MW + ^ 85 Also in the similar way to (A.6), we obtain the conditional posterior distribution for r 2 given 0, w, S, a2, Y and X , and which is inverse gamma with the parameters a + | and B.2 Conditional posterior distribution of w in Prior 2 With the definitions in Appendix B . l , the Conditional posterior distribution of w given 0, 5, T 2 , a2, Y and X is equivalent to the conditional posterior distribution of w given 0 and T 2 . Therefore, we can write 7r(«;|0,r2) oc 7r(0 |r 2, W)TT(W). Since Wj ' s are independent and identically distributed with inverse gamma, the joint posterior distribution of wi,-- - ,wp given 0 and r 2 can be written as 7 r ( i 0 i , • • • ,wp\6, T2) oc 7 r(0 | r 2 , i y ) 7r ( i t ; i ) • • • TT(WP) B.3 Conditional Posterior Distribution for A in Prior 2 Let us consider the prior distribution of A is uniform over the interval (0, A 0), i.e., 7r(A) OC ^ , then the conditional probability distribution for A given 0, 6, w, r 2 , Y and X as per the definitions in Appendix B . l is as follows: oc |B„.^exp{-i<>'iW>} x ft M (^)< i+1>exp{-i} 7r (A|0 , r 2 , to ,a : •2,<5,Y,X) a 7r(t«|A)7r(A) 86 Let us consider another probability density function for A as, 7r(A) ~ For this case the conditional posterior is obtained as ir(\\6,T2,w,o2,d,Y,X) oc 7rHA)7r(A|cr2)7r(<72) B.4 Conditional posteriors for 0 and r in Prior 3 For Prior 3 the joint posterior distribution can be written as: TT(0, S, r 2 , a2\Y, X ) oc TT(Y |0, 5, a2, X)7r(0|r2)7r(r2)7r(<5)7r((T2), (B.3) where the densities 7T(T 2), ir(a2) and ir(5) are defined previously in Appendix A . l except that 7T(0|T2) oc {^2)^^ exp { —2f20'0} a n d hence, similar calculations as in (A.5) show that 7r(0, S, T 2 , C T 2 | Y , X ) is a multivariate normal density with mean vector / i 3 and the covariance matrix where Also similar to (A.6) we can show that 7r(r 2 |0, S, a2, Y , X ) is inverse gamma with parameters a + f and ±(0'0) + /?. 87 Appendix C C l Interpolating and Plotting an NCS Let / be a natural cubic spline with knots ti < t2 < ••• < tp and we define / ; = f(ti) and li = f"(U) for i = 1, ...,p. By the definition of NCS, 71 = 7 P = 0. Let / = (/ l 5 ...,fp)T and 7 = (72, ...,7 p _i) T . The vectors / and 7 specify the curve / completely with the two band matrices Q and R such that QTf = Rr (Ci) The Q and R matrices can be calculated as follows: Let hi = U+\ — U for i = 1, • • • ,p — 1. The Q matrix is a p x (p - 2) matrix with entries e/y, i = 1, • • • ,p; j = 2, • • • ,p - 1, defined as for j = 2, • • • ,p - 1 and ^ = 0 for \i - j\ > 2. The symmetric matrix R is (p — 2) x {p — 2) with elements r^, for i , j = 2, • • • , (p — 1), defined as r« = \{hi-i + hi) for t = 2, • • • ,p- 1, r i i i + i = r i + i , i = ^/ij for i = 2, • • • ,p - 2, and m = 0 for \i-j\ >2. We define a matrix K by K = QR~1QT (C.2) 88 with the property that f f"(tfdt = 1 T R 1 = fTKf. (C.3) J a The value of the cubic spline at a knot point t can be calculated as: m = ( * - ^ ; ( ^ - * ) / < . i ( t _ , ) ( < , 1 _ t ) { ( 1 + t = f i 7 m + ( , + u ^ i ) lt) (CA) for hi = ti+1 -U; U < t < ti+1; i = 1, • • • ,p - 1. We know that 7 = R~1QTf is a (p - 2) vector and if we write the zth element of 7 as v^fi + ^2/2 H 1- %>/p> we can write equation (C.4) as m = ilzMn _ (' - - « > * « { ( 1 + ^ t ) ^ » + (1 + 4 ^ ) ̂ } + (*« -«)/, _ 0 - tm« - u j ̂ + ^ „ w + ^ + t j ^ - t j < ' - ^ - ™ { ( i + ^ ) ^ 4 + * i r W " j( 1 + i £ ) „ i + 1 , p + ( 1 + ^ ) ^ } . (C. 5 ) In matrix notation, equation (C.5) can be written as {f(h),f(h),--- ,f(tP)) = Af (C.6) where A is a matrix of order nx p whose j th row (j = 1, • • • , n) is obtained as (t-U) (t - ^) (ti+i - t) 6 — £;)(ti+l - t ) h 6 (t — U)(ti+i -t)fp 6 hi 6 _(ti+1-t) ( t - t j ) ( t i + 1 - t ) Clji — {(1 + i ^ ) ^ + ( 1 + «s±zi) ,^}, ( t - t i ) ( t i ± 1 - t ) (/ t - t A . { , , h + L z l \ „ X and a*, = - U + J ̂ V ) ^ J ' where i' = 1,2, • • • , i - 1, i + 2, • • • ,p, for any t between U and ti+1, and % is the (i,j)th element of R~XQT and p is the number of knot points for which / needs to be estimated.. 89 Bibliography [1] N . S. Altman and G. Casella. Nonparametric empirical Bayes growth curve analysis. Journal of the American Statistical Association, 90:508-515, 1995. [2] P. Craven and G. Wahba. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of Generalized Cross-Validation. Numerische Mathematika, 31:377-403, 1979. [3] G. S. Datta and M . Ghosh. Bayesian prediction in linear models: applications to small area estimation. Ann. Statist, 19:1748-1770, 1991. [4] M . J. Deniels. A prior for the variance in hierarchical models. The Canadian Journal of Statistics, 27:567-578, 1999. [5] D. G. T. Denison, C. C. Holmes, B. K. Mallick, and A. F. M . Smith. Bayesian Method for Nonlinear Classification and Regression. John Wiley & Sons, Ltd., England, 2002. [6] D. G. T. Denison, B. K. Mallick, and A. F. M . Smith. Automatic Bayesian curve fitting. J. R. Statist. Soc. Ser.B, 60:333-350, 1998. [7] A. E. Gelfand, S. E. Hills, A. Racine-Poon, and A. F. M . Smith. Illustration of Bayesian inference in normal data models using Gibbs sampling. Journal of the American Sta- tistical Association, 85:972-985, 1990. [8] A. E. Gelfand and A. F. M . Smith. Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association, 85:398-409, 1990. 90 [9] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall, London, 1995. [10] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721-741, 1984. [11] M . Ghosh and J. N . K. Rao. Small area estimation: an appraisal. Statistical Science, 9:55-76, 1994. [12] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman and Hall/CRC, New York, 1996. [13] H. Goldstein, M . J. R. Healy, and J. Rasbash. Multilevel time series models with applications to repeated measures data. Statistics in Medicine, 13:1643-1655, 1994. [14] P. J. Green. Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination. Biometrika, 82:711-732, 1995. [15] P. J. Green and B. W. Silverman. Nonparametric Regression and Generalized Linear Models. Chapman and Hall, London, 1994. [16] P. Gustafson, D. Aeschliman, and A. R. Levy. A simple approach to fitting Bayesian Survival models. Lifetime Data Analysis, 9:5-19, 2003. [17] M . H. Hansen and C. Koopberg. Spline adaptation in extended linear models. Statistical Science, 17:2-51, 2002. [18] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall, London, 1990. [19] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their appli- cations. Biometrika, 57:97-109, 1970. 91 [20] M . S. Hossain. A conservative prior for Bayesian hierarchical models in Biostatistics. Master's thesis, University of British Columbia, 2003. [21] P. McCullagh and J. A. Nelder. Generalized Linear Models (2nd ed.). Chapman and Hall, London, 1989. [22] N . Metropolis, A. W. Rosenbluth, M . N. Rosenbluth, and A. H. Teller. Equations of sate calculations by fast computing machines. Journal of Chemical Physics, 21:1087-1091, 1953. [23] J. A. Nelder and R. W. M . Wedderburn. Generalized Linear Models. J. R. Statist. Soc, Ser. A, 135:370-384, 1972. [24] G. K. Robinson. That BLUP is a good thing: the estimation of random effects. Statis- tical Science, 6:15-51, 1991. [25] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric curve fitting. J. R. Statist. Soc. Ser. B, 47:1-52, 1985. [26] J. S. Simonoff. Smoothing Methods in Statistics. Springer, New York, 1996. [27] M . Smith and R. Kohn. A Bayesian approach to nonparametric bivariate regression. Journal of the American Statistical Association, 92:1522-1535, 1997. [28] G. Wahba. Spline Models for Observational Data. CBMS-NSF, Philadelphia, Pennsyl- vania, 1990. [29] A. Whitehead. Meta-Analysis of controlled clinical trials. John Wiley and Sons, Ltd., England, 2002. [30] S. L. Zeger and M . R. Karim. Generalized linear models with random effects; a gibbs sampling approach. Journal of the American Statistical Association, 86: 79-86, 1991. 92
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Bayesian curve fitting with roughness penalty prior...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Bayesian curve fitting with roughness penalty prior distributions Rahman, Md Mushfiqur 2005
pdf
Page Metadata
Item Metadata
Title | Bayesian curve fitting with roughness penalty prior distributions |
Creator |
Rahman, Md Mushfiqur |
Date | 2005 |
Date Created | 2009-12-14 |
Date Issued | 2009-12-14 |
Description | In statistical research with populations having a multilevel structure, hierarchical models can play significant roles. The use of the Bayesian approach to hierarchical models has numerous advantages over the classical approach. For example, a spline with roughness penalties can easily be expressed as a hierarchical model and the model parameters can be estimated by the Bayesian techniques. Splines are sometimes useful to express the rapid fluctuating relationship between response and the covariate. In smoothing spline problems, usually one smoothing parameter (variance component in Bayesian context) is considered for the whole data set. But to deal with rapidly fluctuating or wiggly data sets, it is more logical to consider different smoothing parameters at different knot points in order to find more efficient estimates of the the regression functions under consideration. In this study, we have proposed the roughness penalty prior distribution considering local variance components at different knot points and call it Prior 2. Prior 2 is compared with Prior 1, where a single global variance component is considered for the whole data set, and with Prior 3, where no roughness penalty terms are considered ( i.e., the parameters at different knot points are assumed independent). Performance of the proposed prior distributions are checked for three different data sets of different curvature. Similar performance of Prior 1 and Prior 2 is observed for all three data sets under the assumption of piecewise linear spline. The application has been extended to the case of natural cubic spline, where the modification of Prior 1 and Prior 3 are straightforward. However, for Prior 2, the modification becomes very tedious. We have proposed an approximate roughness penalty matrix for Prior 2. Parameters corresponding to the smoothing splines are estimated using MCMC techniques. We carefully compare the inferential procedures in simulation studies and illustrate them for two data sets. Similarity among the curves produced by Prior 1 and Prior 2 are observed, and they are much smoother than the curve estimated by Prior 3 for both piecewise linear and natural cubic splines. Therefore, in the context of Bayesian curve fitting, both local and global roughness penalty priors produce equally smooth curves in dealing with wiggly data. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | Eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2009-12-14 |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0092149 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of |
Degree Grantor | University of British Columbia |
Graduation Date | 2005-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
URI | http://hdl.handle.net/2429/16674 |
Aggregated Source Repository | DSpace |
Download
- Media
- ubc_2005-0605.pdf [ 4.16MB ]
- [if-you-see-this-DO-NOT-CLICK]
- Metadata
- JSON: 1.0092149.json
- JSON-LD: 1.0092149+ld.json
- RDF/XML (Pretty): 1.0092149.xml
- RDF/JSON: 1.0092149+rdf.json
- Turtle: 1.0092149+rdf-turtle.txt
- N-Triples: 1.0092149+rdf-ntriples.txt
- Original Record: 1.0092149 +original-record.json
- Full Text
- 1.0092149.txt
- Citation
- 1.0092149.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
China | 11 | 1 |
United States | 9 | 101 |
Japan | 5 | 0 |
United Kingdom | 3 | 0 |
France | 1 | 0 |
Canada | 1 | 0 |
Australia | 1 | 0 |
City | Views | Downloads |
---|---|---|
Beijing | 9 | 0 |
Unknown | 6 | 5 |
Ashburn | 4 | 0 |
Tokyo | 4 | 0 |
Shanghai | 2 | 0 |
Dagenham | 2 | 0 |
Kensington | 1 | 0 |
Sunnyvale | 1 | 0 |
Plano | 1 | 0 |
Calgary | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0092149/manifest