A Conservative Prior for Bayesian Hierarchical Models Biostatistics Md. Shahadut Hossain M.Sc, University of Dhaka. 1993 A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF THE REQUIREMENTS FOR THE D E G R E E OF Master of Science in • -'' i ,: • THE FACULTY OF GRADUATE STUDIES (Department of Statistics) We accept this thesis as conforming to the required standard The University of British Columbia August 2003 © Md. Shahadut Hossain, 2003 •• '• \ In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of The University of British Columbia Vancouver, Canada Date DE-6 (2/88) 7.Z- 0 g- POO 3 Abstract Hierarchical models are suitable and very natural to model many real life phenomena, where data arise in nested fashion. The use of Bayesian approach to hierarchical models has numerous advantages over the classical approach. Estimating a phenomenon with hierarchical model can be viewed as a smoothing problem, and hence while summarizing such a phenomenon via hierarchical model we do not want to undersmooth the phenomenon. That is, in most of the practical applications undersmoothing is more serious type of error than oversmoothing. So, we need an estimation approach which can guard against undersmoothing. If we can control the undersmoothing reasonably, we may get a better calibrated summary of the phenomenon we estimate. In this study, we have incorporated the aspect of smoothing in estimating the parameters of Bayesian hierarchical models. In doing so we have proposed a conservative prior for the variance component to achieve the adequate degree of smoothness while estimating the phenomenon under study. We have conducted simulation studies to decide about the appropriate values to be used for the hyperparameter while using the conservative prior to ensure the adequate degree of smoothness. We have investigated the performance of the proposed conservative prior in guarding against undersmoothing in simple normal-normal hierarchical models (random effects models for normal response) and in non-parametric regression curve estimation problem via simulation studies. We have also investigated the performance of the proposed prior compared to those of the uniform shrinkage prior and the Jeffreys' prior, with respect to both guarding against undersmoothing and the MSE of the estimated model parameters, through simulation studies. Contents Abstract'.' ii Contents iii List of Tables vi List of Figures viii Acknowledgements xii Dedication xiii 1 Hierarchical Models 1.1 Introduction. 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.2 Why Hierarchical Models? . . . . . . . . . '. . . . . . . . . . . . . . 2 1.3 Bayesian Approach to Hierarchical Models 3 1.4 Choice of Prior Specification for Variance Component in Hieraxchical Models . . 5 1.5 Application of the Proposed Prior: Simulation Studies iii 12 1.5.1 Monitoring the Convergence and Mixing the M C M C Run for a; and A . . 15 1.6 Analysis of Output . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.7 Comparison of Results Obtained by Using the Conservative Prior with those Obtained by,Using the Jeffreys' Prior and the Uniform Shrinkage Prior . . . 1.7.1 1.8 Analysis of Output . . . . . . . . . . ' . . . . Conclusion;^ V. . 27 . . . . . . . . . . 27 . "..'. . . . . . . . ' . . . . . . . . .30 2 Curve Fitting 32 2.1 Introduction 2.2 Parametric Approach . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Non-parametric Approach 2.4 2.5 ..'... . . . 32 . . . . ... 33 . .''. . . . . . . . . . . . . . . . . ;. . . . 33 2.3.1 Spline-based Roughness Penalty Approach . . . . . . . . . . . . . . . . . 34 2.3.2 Mathematical Formulation of the Roughness Penalty Approach . . . . .'..''35.' 2.3.3 Elegant Way of Representing NCS . . . . . . . . . . . . . . . . . . . . . . 35 Selection of Smoothing Parameter for Spline Smoothing . . . . . . . . . . . . . . 37 2.4.1 Cross-validation Method . . 2.4.2 Generalized Cross-validation approach . . . . . . Conclusion . . . . . . . . . . ; ... '. . . . . . . . . . 3 7 : . . . . . . . . . . . : ... . . . . . . . . .'.. . . . 38 . . . . . . 38 3 Bayesian Representation of Roughness Penalty Approach of Curve Fitting: Methodology and Simulation Studies 40 3.1 40 Introduction. . . . . . . . . . . . . . . . . . . . . .... . . . . . . . . . . . . . . 3.2 Bayesian Representation of Roughness Penalty Approach . . . . . . . . . . . . . 3.2.1 The Conservative Prior for the Random Effects Variance Component T in Roughness Penalty Approach 3.2.2 3.3 . . 44 Posterior Distributions for the Model Parameters Under Conservative Prior 45 Monitoring the Convergence of M C M C Simulation . . . . . . . . . . . . . 48 : Output Analysis . . . . 3.4.1 3.5 2 Performance of Conservative Prior in Smoothing Problem: Simulation Studies . . 47 3.3.1 3.4 . . . . . . . . . 41 49 Performance Analysis of Conservative Prior in Smoothing in case of Small Sample size (n = 23) . : . . ... . . . . . . . . . . . . . . . . . . . . . .63 Comparison of Results Obtained by Using the Proposed Conservative Prior with - those Obtained by Uniform Shrinkage Prior and Jeffreys' Prior . . . . . . . . . . 69 3.6 4 3.5.1 Jeffreys'Prior for Smoothing Problem . . . . .70 3.5.2 Comparison of Results: Simulation Studies 70 3.5.3 Comparison of Results for a = 3 . . . . . . . . . . . . . . . . . . . . . . . 79 Conclusion . : . . . . . . . . . . . . . . . . . . •'. . Discussion and Conclusion Bibliography . 85 87 90 List of Tables 1.1 Values of the hyperparameter a required to keep the maximum probability of undersmoothing to a certain level for different values of e 11 1.2 MSE of A for different priors when A = 0 . . .. . . . . . . . . . . . . .". . . . . 28 1.3 MSE of A for different priors when A = 1 . . . . 1.4 MSE of A for different priors when A = 5 . . . . . . . . 1.5 MSE of A for different priors, with a = 3 for conservative prior, when A = 0 . . . 29 1.6 MSE A for different priors, with a = 3 for conservative prior, when A = 1 . . . . . 30 1.7 MSE A for different priors, with a = 3 for conservative prior, when A = 5 . i . . . 3.1 Table summarizing the number of times that MSE(#) under Jeffreys' prior and . I : . . . . . . . . . . . . . . . . . . . 28 29 30 uniform shrinkage prior exceed those under conservative prior out of 100 occasions when T = 5 2 3.2 . . . . . . . . . . . . . . . . . . . . .75 Table summarizing the number of times that MSE(6>) under Jeffreys' prior and uniform shrinkage prior exceed those under conservative prior out of 100 occasions when T = 1 2 3.3 . .. . . . . .. . . . . . .. . . . . . . . . . . 75 Table summarizing the number of times that MSE(<?)under Jeffreys' prior and . uniform shrinkage prior exceed those under conservative prior out of 100 occa-. sions when r = 0.01 . . . . . . . 2 . . . .. . . . . vi 75 3.4 Table summarizing the number of times that MSE(0) under Jeffreys' prior and uniform shrinkage prior exceed those under conservative prior with a = 3 out of 100 occasions when r = 5 . . . . 84 2 3.5 Table summarizing the number of times that MSE(0) under Jeffreys' prior and uniform shrinkage prior exceed those under conservative prior with a = 3 out of 100 occasions when r = 1 . . . . . . . 2 3.6 . .". . . . . . . . . . . . . . . . . . . . . . 84 Table summarizing the number of times that MSE(t?)under Jeffreys' prior and uniform shrinkage prior exceed those under conservative prior with a = 3 out of 100 occasions when r = 0.01 85 2 vii List of Figures 1.1 Probability of undersmoothing as a function of n . . . . . . . . . . . . . . . 1.2 Plot of A against n 1.3 Convergence plot for A when A = 0 . . . . . . . . 1.4 Convergence plot for A when A = 1 . . . . . . . . . . . . . . . . . . . 1.5 Convergence plot for A when A = 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6 Convergence plot for w.' . . . . . .18 1.7 Histograms of A for different choices of a in the case of posterior mean when true . . . . . . . . .. . . . . • • • • • • • • • • • • • • • . 9 • • • 9 16 . . . . . . . . . . . . . .... 17 A =o . . . 1.8 .20 Histograms of A for different choices of a in the case of posterior mode when true A 1.9 . — 0 . . . .": . . 21 Histograms of A for different choices of a in the case of posterior mean when true A= 1. . . . . . . ; . , . . .V . . . . . . . . . . . •; . '. . . . r . . . .22 1.10 Histograms of A for different choices of a in the case of posterior mode when true A= l . . . . . . . . . . . . . . . . . . . . . . 23 1.11 Histograms of A for different choices of a in the case of posterior mean when true A= 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii :: 24 1.12 Histograms of A for different choices of a in the case of posterior mode when true A= 5. . . . . . . . . . . . . . . . ., ... . . . . . . . . . . . . . . . . . . . . 25 3.1 Plots of M C M C chains for T with different starting points . . . . . . . . . . . . 48 3.2 Plots of M C M C chains for a with different starting points ". . . . . . . . ; . . . . 49 3.3 Histogram of estimated r based on posterior mean . . . . . . . . . . . . . . . . . 3.4 Histogram of estimated r based on posterior mode . . . 3.5 Histogram of estimated a based on posterior mean . ... . . . . . . . . . ... 3.6 Histogram of estimatedCT based on posterior mode . 2 2 2 . . 51 2 2 2 51 52 . . . . . : . . . . . . . 52 3.7 . Plots of data values, true mean curves and the estimated curves f(x) for the selected data sets . . . . . ' . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.8 Plots of data values, true mean curves and the estimated curves f(x) for the selected data sets when T = 0.01 . . '.' . . . . 55 2 3.9 Plots of data values, true mean curves and the estimated curve f(x) obtained by using uniform shrinkage prior when r = 0.01 56 2 3.10 Plots of data values, true mean curves and the estimated curves f(x) obtained by using Jeffreys' prior when T == 0.01 . . . : 57 2 3.11 Plots of M C M C chains for r with different starting points when p=5 2 . . . . . . 58 3.12 Plots of M C M C chains for a with different starting points when p=5 . . . . . . 59 2 3.13 Histogram of estimated r based bn posterior mean while using fewer knots (p=5) 60 2 3.14 Histogram of estimated r based on posterior mode while using fewer knots (p=5) 60 2 3.15 Histogram of estimated a based on posterior mean when p=5 . . . . . . . . . . 61 3.16 Histogram of estimated a based on posterior mode when p=5 . . . . . . . . . . 61 1 2 3.17 Plots of true data, true curves and the estimated curve f(x) when p=5 . . . . . ...' 62 3.18 Plots of M C M G chains for T with different starting points when n=23 and p=10 64 2 3.19 Plots of M C M C chains for a with different starting points when n=23 and p=10 64 1 3.20 Histogram of estimated r based on posterior mean for n=23 and p=10 . . ... 65 3.21 Histogram of estimated r based on posterior mode for n=23 and p=10 . . . . . 65 2 2 3.22 Histogram of estimated r based on posterior mean for n=23 and p=5 . . . . . . 66 2 3.23 Histogram of estimated r based on posterior mode for n=23 and p=5 . . . . . . 66 2 3.24 Plots of the true curves, data values and the estimated curves f(x) when n=23 and p=10 . . . . . . . . . . . ... . . . . . . , . . . . . . . . . . . . . . . . 67 3.25 Plots of the true curves, data values and the estimated curves f(x) when n=23 and p=5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.26 Histograms of T for different priors when T = 5 and n = 111 . . . . . . . . . . . 71 3.27 Histograms of r for different priors when r = 0.01 and n = 111 . . . . . . . . . 72 3.28 Histograms of r for different priors when T = 5 and n = 23 . . . . . . . . . . . 73 2 2 2 2 2 2 3.29 Histograms of r for different priors when r = 0.01 and n = 23 . . . . . . . . . 2 2 . 74 3.30 Histograms of differences of MSE(t^) for pairs of different priors when r = 5 and n=in '. . . . . . . . . : . . 76 2 3.31 Histograms of differences of MSE(^) for pairs of different priors when r = 5 and 2 •:. n = 23 . : . : . ..: . . ... . . . . . . : : V . : ."v V . ' . 77 3.32 Histograms of differences of MSE(t^) for pairs of different priors when r =. 1 and 2 ?i = ill ." ... . . . . . . . . . . . . 77 3.33 Histograms of differences of MSE(<?) for pairs of different priors when r = 1 and 2 n = 23 . . . . . . . . . . . . . . . ' . . . . . . . . . .I , . : . . : . . . . .,'. . 78 : 3.34 Histograms of differences of MSE(0) for pairs of different priors when r and n = 111 . .'. . . . . . . . . == 0.01 2 . . . . . V . 78 : 3.35 Histograms of differences of MSE(0) for pairs of different priors when T = 6.01 2 and n = 23 . . . '. . . . . . . . . . . . . . . . ": .... 3.36 Histograms of T for different priors when T = 5 and n = 111 . . . . . . . 2 3.37 Histograms of r 3.38 Histograms of r 2 for different priors when r 2 = 5 and n = 23 2 : . . ; 79 . . . . 80 . . . . . . . . . . . 81 for different priors when T = 0.01 and n = 111 . . . . . . . . . 82 2 2 3.39 Histograms of T for different priors when T = 0.01 and n = 23 . . . . . . . . . . 2 2 xi 83 Acknowledgements I would like to convey my gratitude and thanks to my Supervisor Dr. Paul Gustafson for his guidance and help in completing my thesis work. Needless to mention that without his initiative, guidance and co-operation it was not possible for me to complete this piece of work. Secondly, I am grateful to Dr. Ying Macnab, Assistant Professor, Department of Health Care and Epidemiology, UBC, for being the second reader of my thesis. Also, I would like to thank all the faculty members in the department of statistics for their high commitment to quality of teaching and mentoring. Lastly, I have been benefited from the department of statistics, UBC, in many ways. I would like to take the opportunity to thank all students for making my last two years of stay enjoyable and office staff for their co-operation. M D . SHAHADUT HOSSAIN The University of British Columbia August 2003 xii To my Parents and Wife. xiii Chapter 1 Hierarchical Models 1.1 Introduction In many practical field of studies data may arise in nested or hierarchical fashion. Some common scenarios are: (i) in the field of health, e.g. patients nested within a sample of hospitals may themselves be nested within a geographic region etc. (ii) in meta-analysis, information from a number of studies on the same phenomenon are combined to produce more accurate inferences and predictions than those available from any single study. In such case of combining information, subjects are nested within studies (iii) in longitudinal studies, repeated observations on each individual study subject are nested within the individual In all the above cases if people model data as independent and identically distributed for the sake of simplicity, e.g., pretend that the subjects in the sampling experiment are drawn homogeneously from a single population, then the analysis applied to summarize the data can not be able to depict the real picture that prevails in the data set itself, because in this case the researcher will fail to capture the similarity between groups. ' ' Furthermore, in any hierarchical setup observations are obtained in clusters and the responses from the same cluster can not be assumed independent. So, to incorporate this within cluster 1 correlation and to reflect the dependence of cluster specific random effects we must consider some hierarchical modeling techniques. In hierarchical models the observable outcomes are modeled conditionally on the group level random effects and then the group level random effects are given a probabilistic specification in terms of the further parameters known as hyperparameters. 1.2 W h y Hierarchical Models? For nested data hierarchical models are good choice for the following reasons: (i) hierarchical models permit the direct framing of the theories about the effect of structural change at each of the different levels of the hierarchy (ii) they provide the accurate adjustments to the uncertainty assessment based on the simple random sampling when the data are gathered in hierarchical fashion in the presence of strong intra-cluster correlations (iii) use of non-hierarchical models is inappropriate for the hierarchical data because,with a few parameters they usually can not fit the data accurately. For example, if we consider the observed data ytj (i = 1,2,..., m; j — 1,2,..., r i j ) from all clusters are homogeneously arise from a common distribution with overall population effect 0, then such a model may fail to account for cluster-to-cluster variability , which may be an important source of . variability in the data. Again, with many parameters use of non-hierarchical models tend to overfit the data i n the sense of producing models that fit the existing data well but lead to inferior predictions for new data . Hierarchical models can handle both of the problems since they have enough parameters to fit the data well and they can avoid the problem of pverfitting by borrowing the strength across clusters using a population distribution to incorporate the similarity among the cluster-specific effects. . . Again, in causal analysis it may be necessary to introduce covariates at multiple levels of hierarchy. In such case the assumption of exchangeability of units or subjects at the lowest level breaks down even after conditioning on covariate information because of information sharing among the units at each level. So, we need to introduce covariates relevant to each of the higher level units. But introducing covariate at each level of hierarchy dramatically increases the number of parameters in the model and a sensible estimation is only possible through 2 further modeling in the form of population distribution, i.e., we need to consider a population distribution for the effects at each level of hierarchy and thus gives rise to hierarchical models. In a nutshell, hierarchical models can improve the cluster or area specific estimates by combining the data across the units at each level of hierarchy. And in doing so they consider the probabilistic mechanism that gives rise to the data at each level of hierarchy. Hierarchical models offer an explicit framework which can express similarity judgement, combine, information across units, and thus produce accurate and well-calibrated predictions of observable outcomes. 1.3 Bayesian Approach to Hierarchical Models Heuristically, it can be said that hierarchical models force us to be Bayesian because to set up a hierarchy we need to consider a population distribution (prior distribution) for the parameters at each level to incorporate the uncertainty about the parameters at that level. There are both methodological and computational arguments in favor of being Bayesian while dealing with the hierarchical models. Some of them are listed below. (i) Bayesian methods have the flexibility to incorporate the multiple levels of randomness that prevail in hierarchical models, and the resultant ability to combine information from different sources, while incorporating all reasonable sources of uncertainty in inferential summaries. They naturally lead to smoothed estimates in complicated data structures , and consequently have the ability to obtain better real-world answers. (ii) The psychological reason for adopting Bayesian methods is that in almost all practical cases the users of statisticians' work usually interpret interval estimates provided by statisticians as Bayesian intervals, that is, as probability statements about the likely values of unknown quantities conditional on the evidence in the data. Such direct probability statements require prior probability specifications for unknown quantities, and thus the kinds of answers the clients assume are Bayesian. Thus from structural point of view the strength of Bayesian hierarchical approach lies in (a) its ability to combine information from multiple sources, (b) its more encompassing accounting of uncertainty about unknowns in a statistical problem and (c) it is the essential tool for achieving partial pooling of estimates and compromising in a scientific way between alternative sources of information. Actually, such partial pooling is more sophisticated and scientific than simple pooling of individual estimates in the sense that such pooling considers all sources of uncertainty rather than considering only one source of random error within each unit. Besides theoretical view points there are computational view points also that lead us to be Bayesian. The computational view points are summarized below. (i) In the case of hierarchical modeling the likelihood approach becomes very difficult to implement, even with the normal response, when the level of hierarchy gets increased. In case of non-normal response the likelihood approach becomes intractable. . (ii) In causal analysis when the response from the same cluster are correlated random or mixed effects models are used. In such cases if the response are not normal the likelihood approach becomes intractable because it requires evaluation of high dimensional integral to calculate likelihood function. Though there are some approximate methods such as, the penalized quasi-likelihood (PQL) and the marginal quasi-likelihood (MQL) (Breslow and Clayton, 1993) based on Laplace's method of integral evaluation they have serious limitation in estimating the variance structure. Both PQL and MQL methods can give rise to a non-positive definite variance structure,-which is absurd. The possibility of likelihood approach's producing negative estimate for the variance component can be well understood by considering the case of proto-type normal-normal hierarchical model. Such a model can be written as Stage!: yi \ Qi ~ N(8i, LJ), where to is known Stage 2: 0; ~ JV(0, A) The two stages of the proto-type model can be collapsed to yi | A ~ N(0,CJ + A). The simple likelihood estimate for A is obtained as A = S — to, where S is the pooled sample 2 2 variance. Clearly, for u > S the likelihood estimate of variance component A will be 2 negative, i.e., pr(X < 0) > 0. The Bayesian formulation enjoys an advantage here because of the information on the variance components contributed by the prior specification. The Bayesian method is simple to implement due to Markov Chain Monte Carlo (MCMC) methods. ?' - , ' ' (iii) Even in linear mixed effects models (i.e., when the response are normal) closed-form solutions for the maximum likelihood (ML) and Restricted Maximum Likelihood (REML) estimators of the parameters are often unavailable. In such cases numerical optimization methods like Newton-Raphson and E M algorithm are used to calculate estimates of the model parameters. These algorithms are not guaranteed to locate the global maximum 4 of the likelihood function from an arbitrary starting point. But a hierarchical Bayesian . version of the mixed model does not have such problem of convergence to different points because it leads to simulation from the appropriate posterior distribution. This gives a distributional estimate of the model parameters and thus can provide more complete . 1.4 estimates ••. .'.•'•.'.•.>'."•' Choice of Prior Specification for Variance Component in Hierarchical Models In this study our concern is to make a good choice of prior specification for variance components in Bayesian hierarchical models. The choice of prior distribution for variance components is a critical task because often adequate subjective prior information is lacking. For situation where little prior information is available, the usual choice is to use a non-informative type prior. But the problem of using non-informative type priors is that most of them are improper, and so may lead to an improper joint posterior distribution, when applied to a variance component. Impropriety of joint posterior distribution implies that there does not exist a joint density to which an M C M C chain converges, i.e., MCMC algorithm leads to inferences about a nonexistent posterior distribution (Hobert and Casella, 1996); So, before using an improper prior we must ensure that the resulting joint posterior distribution is a proper one. Many authors worked on the issue of choosing non-informative prior for the variance components in hierarchical models. The leading ones are Jeffreys (1961), Box & Tiao (1973), Berger & Deely (1988), Berger& Bernando (1992), Daniels (1999) etc. Daniels (1999) introduced a non-informative prior, known as the uniform shrinkage prior, which itself is proper, and hence leads to proper posterior distribution. It also possesses many desirable frequentist properties. However, none of the authors who worked on this issue previously did consider any explicit criteria to choose the priors for the variance component. In this study we have considered the concept of smoothing in choosing the prior for the variance components. It is true that any estimation procedure is imperfect and we either byersmooth or undersmooth the phenomenon we estimate. But in many application undersmoothing is a worse error than oversmoothing, as undersmoothing involves postulating structure that is not really there. So, in this study we have attempted to choose priors for the variance components in such a way that we can guard against undersmoothing. For example, consider the following normal-normal hierarchical model. 5 Stage 1: y \ 9, A ~ N(A9,vy), where A is a known n x p matrix, 9 is a unknown p x 1 vector, v\ is an n x n matrix (may or may not be known). Stage 2: 9 | A ~ A^(0, AU2), where A is unknown scalar, V2 is a known p x p matrix. Stage 3: A ~ vr(A), where 7r(A) is a prior distribution for A, the variance component. Such models have numerous applications. Some of the scenarios are listed below. Scenario A : Random effects models. For instance, data are collected on patients at p hospitals. It may be reasonable to think that the patient outcomes are similar but not identical across hospitals. Thus y consists of patient outcomes, Q\ can be taken to be the ith hospital effect. \ Scenario B: Spatial models. For instance, data might describe observed disease prevalence in a geographic regions. It may be reasonable to think that the underlying prevalence in adjacent regions is similar but not identical. Thus 9{ is the ith region effect and V2 is chosen so that component of 9 which are geographically closer have higher positive correlation. Scenario C: Curve-fitting. Say that Y = /(X)+noise, and we wish to estimate the function / . The smoothing spline approaches to this problem can be viewed as a hierarchical model with 9 being the values of / at some fixed knot values. By appropriate choice of V2 we end up penalizing functions / which are more wiggly. All the three scenarios can be thought as smoothing problem. In each case we estimate (rather than fix) A so that the data decide on "how much smoothness" is appropriate. That is, how similar should the hospitals or regions be in scenarios A and B, and literally, how smooth should the function / be in scenario C. And the choice of prior at Stage 3 of the hierarchical model has some impact on this data-driven smoothing procedure. Again, in smoothing problem it is assumed that the underlying phenomenon is smooth, and hence we want an estimate of the phenomenon which is also smooth. For example, consider a hospital model about the effectiveness of a cardiac treatment, with patients in the hospital j having the survival probability 9j, it might be reasonable to expect that estimates of 0jS', which represent a sample of hospitals, should be related to each other. A natural way to incorporate this similarity of fys' is to use a prior distribution in which 9JS' are viewed as a sample from a 6 common population distribution. In such case the observed data, yij, with units indexed by i within groups indexed by j, can be used to estimate aspects of the population distribution of the 0js' even though the values of fys' are not observed. It is natural to model such a phenomenon hierarchically, with observable outcomes modeled conditionally on certain parameters, which themselves are given a probabilistic specification in terms.of further parameters known as hyperparameters. In such kind of hospital model it is important to have estimates of fys' which are not more variable than fys' themselves. Because, if it is known that the survival rates among patients suffering from cardiac ailment are higher in some hospitals than those in the other hospitals people may rush to the hospitals with higher survival rates, which may cause serious administrative problems. In such a case the idea is that we do not want to declare that some hospitals are better with respect to cardiac ailment care unless we are pretty much sure, i.e., we.want estimates of 9JS' to be smooth. So, in choosing a prior for A in Stage 3 we have emphasized on an estimate of A which does not undersmooth the phenomenon understudy. Undersmoothing is defined to be A > A and oversmoothing is defined to be A < A, where A is the estimate of A. Since undersmoothing is considered to be more serious type of error, in this study we have focused on choosing the Stage 3 prior in order to guard against undersmoothing. In particular, we consider choosing the prior in order to make the (frequentist) probability that A > A, i.e., pr(X > A), small. In search for such a prior we can start with the following proto-type normal-normal hierarchical model: . ' • , Stage 1: yi \ Oi ~ N(0i,uj), where yi (i — 1,2,... , n) is the observed summary outcome in the i t h cluster; Oi is the true effect for the i t h cluster and UJ is the known dispersion parameter. Stage 2: 0 | A ~ i V ( 0 , A ) { . • The above hierarchical model can be collapsed as: •= E[E( \O )\X} VL = = and var(yi\X) = T E(0i\X) 0 E[var( \0i)\X] yi = E(CJ\X) + = UJ + X + var(0i\X), var[E( \0i)\X} yi Therefore, yi\X ~ N(0,u + A). The likelihood function can then be expressed as L(X) oc 1 cu + X A prior density of the form 7r(A) O C (r^j)^ ^ a+1 would be conjugate, where ui+X ~ IG{a,b) e~ truncated to be u + A > ui, i.e., a priori A is distributed as V\ — w\V\ > w, where Vi ~ IG(a, b). Now, we need to make reasonable choices of a and b to guard against undersmoothing. The posterior distribution of A, given the data, is obtained as L(A | Y) be ( - L - ) ^ \ r ^ ^ ~ IG - + a, \ £ n + b i-l That is, a posteriori A is distributed as V2 — w;| V2 > w, where ~ J C (a + | , 6 + \ Now, take A =posterior mode as the estimator of A. So, A =max I ' ~ + 2 a + 2— w y ). 2 '^ The probability of undersmoothing can then be calculated as / pr(X > A) = pr t v f + i=l - w> A n + 2a + 2 pr J = pr (u + X)v + 2b n + 2a + 2 \ X 26 + X = pr > CJ i=t n + 2a + 2 > a; + A 2n > (1.1) pr I z > In Equation (1.1) , v — ^ j x " ~ x\ and 2 = -j= ~ iV(0,1) as n v 00. Again, from Equation (1.1) we observe that the chance of undersmoothing increases as A decreases. Hence, the worst possible case of undersmoothing takes place when A = 0, and so we can worry just about the case when A = 0, i.e., if we can ensure adequate guard against undersmoothing in case of A = 0 then we can automatically ensure the adequate guard against undersmoothing i n all the other cases. For A = 0, Equation(l.l) can be written as pr(X > A).= pr z > V2(a + 1-^) (1.2) For fixed n we can choose the values of a and b to make the probabilities given by equation (1.2) small, but there still has the questions about what happens as n —> 00. Theoretically, as 8 n —>• oo, ^ " J ^ w ^ 0, which implies that the probability of undersmoothing approaches to 0.5. Figure 1.1 displays the probability plots against n corresponding to the Equation (1.2). Form these plots it is observed that probability of undersmoothing approaches to 0.5 as n -V oo. Again, form the Figure it is also observed that bigger values of a help in keeping the probability of undersmoothing small. Figure 1.1: Probability of undersmoothing as a function of n o e S2 500 600 900 1000 Figure 1.2: Plot of A against n 800 1000 Now, since probability of undersmoothing approaches to 0.5 as the sample size gets larger, there may be frequent upward jumps in the values of A and we need to check how big these jumps are. To check the magnitude of the jumps in the values of A we have simulated data and 9 calculated A for different values of n and plotted them against n for a true A value of zero in Figure 1.2. For this plot the values of the hyperparameters a and b have been taken to be 3 and 1, respectively. From Figure (1.2) it is clear that though there are frequent jumps in.the values of A the jumps get smaller as n gets larger. So, we need not to worry about the jumps in A when n is large. . , ; Again, from Equation (1.1) we observe that the probability of undersmbothing is an increasing function of b. So, setting 6 = 0 can ensure maximum possible guard against undersmoothing. For b = 0 the probability of undersmoothing is obtained, as pr(X > A) = r > P • [ ( 1 . 3 ) From Equation (1.3) we observe that for fixed n the probability of undersmoothing decreases as a increases. Hence for fixed n we look for a bigger values of a to guard against undersmoothing. Another advantage of setting b to zero is that the probability of undersmoothing does not depend on A and w, and we can express the maximum value of probability of undersmoothing as a function of a only. Sometimes it is desired to express the probability of undersmoothing as the probability that A exceeds A plus a small increment obtained as a function of true A, i.e., pr(undersmoothing) = pr(A > A + e(A + w)) ( £ yf "-' K \ > (A + w)(l + «0 1 n + 2a + 2 V . 2(a + l)(l + e) + ne\ / M The probability of undersmoothing given by Equation (1.4) will be maximum if ^ / ( n ) = 2 ( o + + V( ) 2n ne + ; _ . is minimum. The minimum value of f(n) has been found to be 7(n) n = 2\/e(a + l)(l + e) mi Thus = ' . • pr(X > X + e(A + w)) - pr{z > 2\/e{a + 1)(1.+ e)) (1.6) .' (1.7) From Equation (1.7) we observe that the maximum value for the probability of undersmoothing will be smaller for bigger values of a. Table 1.1,constructed based on the maximum probability of undersmoothing given by the Equation (1.7), gives different values of a required to keep the maximum probability of undersmoothing to a certain level a (say), e.g. 5%. From Table 1.1 Table 1.1: Values of the hyperparameter a required to keep the maximum probability of undersmoothing to a certain level for different values of e a 0.05 0.10 0.20 e 0.10 7.15. 4.73 2.61 0.05 13.89 8.82 4.37 0.20 3.83 2.71 1.74 we see that quite big values of a is required if we want keep the maximum probability of undersmoothing at 5% level. For this level we need a = 4 even if we allow A to exceed A by 20% of the total variability, UJ + A. From the Table it is also observed that even if we allow 20% of the A exceed the true A itself by a magnitude of 20% of the data variability we still need a value of almost 2 for a (a = 1.74). Sb, it can be reasonably argued that the uniform shrinkage prior, which considers a — 1, can not control undersmoothing adequately. So, finally by considering all the advantages and mathematical ease it can be concluded that the desired conservative prior for Bayesian hierarchical model can be obtained from truncated inverted gamma prior rr(X) oc (•^j) ^e~by choosing 6 = 0 and a bigger value for o. We t<a+1 term the proposed prior as "conservative prior" because it is conservative against undersmoothing. Now, taking b — 0 in the inverted gamma prior leads to a class of priors and is termed as "parametric power" priors (Hobert &; Casella, 1996). The advantages of using such priors are that they are obtained from the inverted-gamma prior and for a > 0 they are proper and hence lead to a proper joint posterior distribution of unknown parameters, which is essential for adopting M C M C to simulate the. values'of. unknown parameters for valid posterior inferences (Hobert and Casella, 1996). Such priors give the flexibility to choose the value of a to guard against undersmoothing. Furthermore, these priors comprise a class of non-conjugate priors within which some of the most frequently used non-conjugate and relatively non-informative priors belong, e.g., Jeffreys prior and the uniform shrinkage prior. For a — 0 this class gives rise to Jeffreys prior and for d = 1 it gives rise to the uniform shrinkage prior. 11 1.5 Application of the Proposed Prior: Simulation Studies To verify the performance of the proposed class of priors we have conducted simulation studies for the simple normal-normal hierarchical model defined below. Stage 1: y^ | ~ N(0i,u}); i = 1,2,.V. ,7t; j = 1,2,..., A; Stage 2: (k | /x,A ~ N{fi,X) In stage 1, UJ is sometimes considered as nuisance parameter and assumed to be known. But in most of the practical situations ui is unknown and to be estimated form the data. For estimating UJ under the Bayesian set up we need to assume a prior for w in stage 3. So, the full Bayesian version of the above hierarchical model can be written as Stage 1: ytj | 0i,uj ~ N(0i,u) i• Stage 2: (),, ~ N(ii. A) Stage 3: 7r(//) O C 1 7T(A|c)a( ^r» • cx {^e-t ••; ; x where, T T ( O ; ) , 7T(/J) and 7r(A | UJ) are the priors for ui, ft and A, respectively. In stage 3 the prior for UJ has been taken to be unit information inverted-gamma prior by setting c = ^ and d = 5 for our simulation studies. The reason of choosing unit information prior is that we are assuming we have little prior information, worth of one data point, about the distribution of UJ and allowing UJ to be estimated mostly by the data. The fiat prior density for \i in stage 3 is reasonable for hierarchical model, because, the combined data from all n experiments (clusters) are generally highly informative about /j, and hence we can afford to be vague about its prior distribution. Now, the joint posterior distribution of 9s\ /i, UJ and A, given the data, can be 12 obtained as it(9, //,, UJ, A | Y) x ix{u) x Y[f(0i"\'fi,-X) oc V J n k .«• n • 2 n n 7r(A j UJ) x 7r(a;) X c(w) a+l 1 n UJ '+ A i=l i^lj-l c+1' 1 (1.8) e «> c(w) where, c(w) is the normalizing constant. This normalizing constant is required to make the joint prior 7r(A,a;) a proper probability distribution, which in turn make the joint posterior distribution given by the Equation (1.8) proper posterior distribution. Propriety of joint posterior for the model parameters is essential for valid posterior inferences. The normalizing constant for this problem is obtained as . C(UJ)= OO . . a+l /(Z7+A) dX = auj" Now, for posterior inferences about the model parameters we need to simulate draws for them from the posterior distribution. To draw simulations for each of the unknowns through M C M C method we need to compute the conditional posterior density for each of the unknowns. The conditional posterior for 6 is obtained as n _I u 2 TT(0 I fi,u,\,y) ex J J e Since 0 \ 6 • A. are assumed exchangeable we can write 6i \ p,uj,X,y ~'N(0i,r ), 2 n The conditional posterior density for // is obtained as n' 7r(/i I (),UJ,X,y) oc AW-") J]f::' 2 t=l Therefore, // | 0., ui. A. y ~ iV(0, £ ) . The conditional posterior density for UJ is obtained as a+l Tr{uj\0,fj,,X,y) oc [-) \UJ ) e 13 - —— . . . . . \U! + X J c(ui) where 6{ = Finally, the conditional posterior density for A is obtained as „7r(A | 9,UJ,u,y) oc - e XJ Since the conditional posteriors for 0JS' and \ui + A, have known parametric form we can use Gibbs sampler to draw simulations for them. But the conditional posteriors for UJ and A do not have known distributional form, and so we need to use random walk Metropolis-Hastings algorithm to draw simulations for them. It has been already mentioned that the goal of this study is to choose the appropriate value of a in the prior for A which can sufficiently guard against undersmoothing. Hence, to choose, the appropriate value(s) of a and to judge the performance of the proposed prior in estimating A, the variance component, we have performed simulation studies. For the simulation studies we have proceeded in the following way: (i) fixed some values for each of the unknown parameters / i , a; and A as their true values (ii) for each combination of true values of/x and A we have generated i.i.d. 0i,02, • • • ,9n from iV(/i,A) • ., (iii) for each of the #JS' an i.i.d sample yn,yi2, •• • ,ytk has been generated from N(0i,uj) (iv) given the generated data set, we have applied M C M C techniques to estimate the unknown parameters for different values of a and tried to identify which value of a is reasonable to sufficiently guard against undersmoothing. For our simulation studies we have considered three different values of A, the variance component. The chosen A values are 0, 1 and 5. For all cases the chosen values of a; is 1 and the value of /z has been taken to be 5. The reason, for choosing different values of A for a fixed value of UJ in our simulation study is to see how the proposed class of priors perform in estimating the unknown parameters in different situations when the ratios of the between and within cluster variances are smaller than, equal to and larger than unity and thus make a more general choice for a. The fixed choice of u> — 1 is reasonable in the sense that we can rescale the ui values to 1, which will not affect the other things of our study. For each combination, o f / J , UJ and A values we have generated ten 0 values 0\,02, • • • ,#io and from each of the generated 0 values we have generated a sample of data values. For our simulations studies we can consider both the small and large sample sizes. But here we have considered only the small sample size case especially, to demonstrate the ability of hierarchical model to give accurate estimates by borrowing 14 strength across clusters. To identify a reasonable choice for the hyperparameter, a, we have generated 100 data sets and estimated A by using each of the data sets for different choices of a. Finally, we have observed the number of times, out of 100, that A exceeds the true A value. 1.5.1 Monitoring the Convergence and Mixing the M C M C Run for to and A While using random-walk Metropolis-Hastings algorithm to generate posterior simulations for both UJ and A we need to check the convergence and mixing of M C M C run for them before making any valid inferences about them. For better mixing and convergence we need to choose appropriate jump size for the M C M C algorithm while updating the parameter estimates. In Metropolis-Hastings algorithm there are two problems- (i) the slow movement of the chain toward the target distribution and (ii) slow mixing of the M C M C chain. An M C M C chain may move first (high acceptance rate) but may show slow mixing, i.e, the M C M C chain may move around a specific region of the target distribution for many iterations; on the other hand, it may exhibit good mixing but. slow convergence. But in practice, it is always desirable to have a chain which mixes and converges well at the same time to ensure the accurate inference about the target posterior distribution. To have such a chain we need to adjust the jump size of the chain through monitoring the output. There is no hard and fast rule to determine the appropriate jump size. Usually, trial and error method is adopted. A considerable volume of research work has been carried out and suggestions have been made to monitor the mixing and convergence of an M C M C chain. A nice reference in this regard is Gilks, at. el. (1996).' Research findings suggest that for better mixing and convergence a desirable acceptance rate is around 50%. So, while implementing an M C M C simulation it is necessary to plot the run to monitor the mixing and the convergence of the chain. Also, we need to adjust the jump size so as to get an acceptance rate of around 50%. For updating UJ and A using Metropolis-Hastings algorithm we have used exponential scale, i.e., let UJO and Ao are the initial guesses for UJ and A, respectively, then the candidate states for them are taken as UJairr = W() X CX])(N({), k' )) Kurr - A 2 0 x exp{N(Q,k%)) where, ki and ki are the jump sizes for updating UJ and A, respectively. Each of the Figures 1.3, .1.4 and 1.5 displays four different M C M C chains for the three different true A values, respectively, considered for the simulation studies. In each case of the three true A values considered 15 we have used different initial values, which are widely dispersed from each other, for the four different chains. The convergence plots for UJ are displayed in Figure 1.6. Figure 1.3: Convergence plot for A when A = 0 —i—; • o 1000 2000 3000 4000 1000 2000 3000 iteration (a) initial value=20 iteration (b) initial value=10 iteration (c) initial value=6 iteration (c) initial value=1 16 4000 Figure 1.4: Convergence plot for A when A = 1 iteration (a) initial value=20 — i — 1000 — I — 2000 —i—: 3000 iteration (b) initial valuc=10 — I — 4000 iteration (c) initial value=6 iteration (c) initial value=1 Figure L5: Convergence plot for A when A = 5 • iteration (a) initial va!ue=20 iteration ' (b) initial value=10 iteration (c) initial value=6 iteration (c) initial value=1 17 Figure 1.6: Convergence plot for UJ in co i 4000 4500 iteration iteration ( a ) i n i t i a l value=20 ( b ) i n i t i a l value=10 -r r —r - 1000 4000 4500 iteration (c) initial ... i 4000 4500 2000 3000 iteration value=6 (c) initial 18 value=1 r 4000 '4500 From Figures 1.3, 1.4 and 1.5 we observe that all the chins of each of the true A values exhibit good mixing since none of the chains in any of the three cases kept moving around slowly in any particular region of the target distribution for many iterations. It is also observed that though different chains for each of the true A values start at different initial values all of them have stabilized at the respective true A values, which indicates that the M C M C chains for all of the A values have converged to the respective target posterior distributions. Furthermore, all the cases the chains have stabilized after very few iterations. So, we can draw posterior inferences about A by generating even a shorter chain because we do not need to throw away too many iterations as burn-in. Again, if we have a look at the plots of the Figure 1.6 we observe the same feature about the convergence and mixing of the chains for w as we do in the case of convergence and mixing of the chains of A. 1.6 Analysis of Output Since the goal of this study is to choose appropriate value/values of the hyperparameter a for the proposed conservative prior we have estimated A by using 100 different simulated data sets for different values of a. In estimating A we have used both the posterior mean and the posterior mode. In each case we have constructed histogram of. 100 A and see how many of them exceeds the true A value. For each of the model parameters we have run M C M C simulations for 2500 iterations, among which first 500 iterations have been thrown away as burn-in of the chain and used the remaining 2000 for inference purposes. Here, we have used 2500 iterations because from the convergence study of M C M C chains we have observed that for both UJ and A M C M C chains converge after very few iterations. The histograms of 100 estimates of A have been displayed in Figures 1.7-1.12 for different true A values. In each case, we have considered 6 different values of a as a'= 1,2,3,4,5 and 10. 19 Figure 1.7: Histograms of A for different choices of a in the case of posterior mean when true A=0 100 90 80 j? 70 § «> 3 50 £ 11 40 30 20 10 0 ~1 0.00 1 1 0.02 - r 0.04 - -1—r "i—i—r 0.06 o.io T 0.12 ~i—i—i—i—i—i—i—r—r 1—r~T—i—m—i—r 0.00 .. 0.02 / 0.04 , 0.06 , 0.08 0.10 0.12 F T 0.14 - A •X•' (b)a=2 X (a)a=1 A •X (c)a=3 0.08 0.14 A I. (d):a=4 0.10 0.12 A • I' • (e)a=5 (f)a=10 20 0.14 0.16 Figure 1.8: Histograms of A for different choices of a in the case of posterior mode when true A= 0 Oe-tOO A I: 46-05 ' A X (a)a=1 (b)a=2 0 0 *" 0. > 0 0 c CsD 0 a 90) • U. i. o . w 0OwOO —J— 1e-05 -T 3e-05 2e-05 4e-05 5e-05 A A (c)a=3 (d):a=4 I •I' - 100' 90 • 80 • 70' 60' 50' 40 30 20 10 0 0e*00 16-05 • 26-05 3e-05 46-05 56-05 06tO0 16-05 2e-05 36-05 .A . A . I (f)a=10 (e)a=5. 21 46-05 56-05 Figure 1.9: Histograms of A for different choices of a in the case of posterior mean when true A= 1 40- 40- > 300 > 300 c •c (D 3 20 - (0 CT 3 20- 0 £ 0 , 10- £ 10- n _ i—i—i—i—r—i—r—r - -i—i—i—i—i—i—r 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.9 1.0 1.1 1.2 1.3 1.4 1.5 i—i—rn—i—i—i—m—i—i—i—i—i—r 0.0 0.1 0.2 0.3 0.4 0.5 0,6 0.7.0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 A 0 a=1 .. • A II—i—rn—i—i—i—i—i—r • \- . A •'. (d):a=4 (c)ar3 \2 II 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 09 1.0 1.1 1.2 .1.3 1.4 1.5 i • III—i—i—i—rn—T—i—i—i—i—i T—r—i—i—i—i—T—i—i—i—i—i—i—i—i—r 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8. 0.9 1.0 1.1 (b)a=2 '-, : i—III—i—i—i—i—i—r .1.3 1.4 i.5 A 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8. 0.9 1.0 .1.1 .1.2 1.3 .1.4 .1.5 A • I "• (f)a=10 . (e) a=5 22 Figure 1.10: Histograms of A for different choices of a in the case of posterior mode when true A= 1 i—i—II II—r. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 i—i—i—i—i—i—i—i—i—i—i—i—i—i—r 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1—I—I—I—III 0.9 1.0 1.1 \Z 1.3 1.4 1.5 . A VI )a=2 (8)8=1, i—i—i—i—III—i—III—i—i—i—I—r 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 i—i—i—i—i—I—I—II i—I—I—I I I 0.0 o:i 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 12 1.3 1.4 1.5 A A I (c) a=3 i i i i i i i (d): a=4 i i i i i i i r :0.0 0.1 0.2 0.3.0.4 0.5 0.6 0.7 0.0 0.) 0.2 0.3 0.4 0.5 0.6 07 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 A " • • » , • • (f)a=10 (e)a=5 23 1.0 1.1 1.2 1.3 1.4 1.5 Figure 1.11: Histograms of A for different choices of a in the case of posterior mean when true A= 5 ;50 40 > o % I 30 20 IL 10 0 i I I I . z... l f h , - r i i i II i i—i—i—i—i—III—i—i—i r 1.5 2.0 2.5 3.0 3.5 4.0 45 5.0 5.5 6.0 6.5 7.0 7.5 - i r~i \ r i i i i i r~r 1.5 2.0 2.5' 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 A I i A I (a)a=1 T i I 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6,5 7.0 7.5 . (c)a=3 I (b)a=2 1 1 r~\ 1 1 1 1 1 1 1 1 r 1.5.2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 A .; I •• - • (d): a=4 ! 5040- c 30- be. 0 20- 0 3 IL 100- n—i—1—1—i—III—r 1—III 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 - T 1 1 i 1 i 1 i 1 1 1 i r 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 A X (f)a=10 (e)a=5 24 Figure 1.12: Histograms of A for different choices of a in the case of posterior mode when true A= 5 "•• (a)a=L. (b)a=2 I (c) a=3 50 40 c 30 - CT 20 O LL 10 - ~i 2.0 2.5 "T— 3.0 3.5 "I "T— 4.0 . 4.5 -r 5.0 0 - 5.5 6.0 6.5 7.0 7.5 ~i 1.5 1 1 1 2.0 • 2.5 . 3.0 1 3.5 1 1 1 1 1 1 1 4.0 4.5 5.0 5.5 6.0 6.5 7.0 r 7.5 A •I (f)a=i0 (e) a=5 From the histograms of A, based on posterior mean, we observe that for a < 3 most of the estimates of A exceeds the true value of A for all the cases considered. But for a > 3 few of the estimates exceeds the corresponding true parameter value except the cases when A = 0. Furthermore, the simulation studies reveal the fact that too big values of a cause serious underestimation. So, the general choice for a should be between 3-5. In the case when true A = 0, though some of the estimates exceed the true A values for values between 3-5 of a, the magnitudes of overestimation are quite small (Figures 1.7 and 1.8). Also, in prototype model, mathematically we, have shown that the smaller the value of true A the bigger, the 25 chance of undersmoothing and for A = 0 the probability of undersmoothing is the highest. So, Figures 1.7 and 1.8 confirm that a value between 3-5 of a performs quite well even in the worst case of undersmoothing. So, from the results of simulation studies we can conclude that use of the proposed conservative prior with a value between 3 to 5 for the hyperparameter a can sufficiently guard against undersmoothing when posterior mean is used as point estimate of A. Regarding the use of posterior mode as an estimate of A, the histograms of A (based on posterior mode) reveal that any value between 2 to 4 of a can guard quite well against, undersmoothing, and in the case of no heterogeneity (A = 0) posterior mode performs much better than posterior mean in guarding against undersmoothing. Even with a = 1 undersmoothing seems to.be under control. In the case of posterior mode all the A are very close to 0 (between 0 to 0.00005), the true value of A. So, as long as the point estimate of A is concerned posterior mode is preferable to posterior mean. But the problem of using the posterior mode is that it depends on the bin width used to calculate it. There is another problem, probably the more serious one in the context of smoothing, of depending on posterior mode as an estimate of A while estimating the mean vector 9 in Bayesian hierarchical model by using M C M C technique. The use of posterior mode as a point estimate of A ends up with suggesting the use of smaller values for a in conservative prior compared to the case when posterior mean is used (2 to 4 versus 3 to 5). Use of smaller a values, as suggested by the posterior mode, may lead to undersmoothing of the phenomenon, i.e., smaller a values may produce 9 estimates which are more variable than the true 9s. Because, while estimating the 9s by M C M C technique we generate 9s conditional on the specific A generated at each iteration of the M C M C run, i.e., we do not condition on the point estimate of A while generating 9s, rather we generate 9s for each generated A from its conditional posterior distribution. So, use of smaller a values, as suggested by the posterior mode, can generate a A bigger than the true A, which in turn results in generated 9s that are more variable than they truly are. This problem seems to be more serious when true A is 0. In this case the use of posterior mode as a point estimate of A simulation studies suggest that even a = 1 performs greatly in guarding against undersmoothing. . Again, from the histograms of A based on the posterior mean we observe that a = 5 sometimes can cause to much oversmoothing, especially, when A is relatively larger than UJ. Also, posterior mean suggests a value between 3 to 5 and posterior mode suggests a value between 2 to 4 for a to guard against undersmoothing. Since, both the point, estimates of A suggest a common range of values for a, which is between 3 to 4, hence, in general, irrespective of the choice of point estimate for.A we can conclude that any value between 3 to 4 for a in the conservative prior can reasonably be used to ensure adequate guard against undersmoothing, while at the 26 same time not oversmoothing the phenomenon top much and thus obtaining better calibrated estimates of the model parameters in normal-normal hierarchical models. 1.7 Comparison of Results Obtained by Using the Conservative Prior with those Obtained by Using the Jeffreys' Prior and the Uniform Shrinkage Prior For comparing the performance of the proposed conservative prior with those of the uniform shrinkage prior and Jeffreys, priors we have conducted simulation studies. We have estimated A from 100 different simulated data sets by using all the three competitive priors. We have examined the comparative performance of the priors from two different aspects- (i) first, with respect to their ability to guard against undersmoothing and (ii) second, since by,introducing the conservative prior we force the variance component A to be underestimated (oversmoothed), in a sense, we are introducing some sort of bias in A. So, the obvious question here is that how much we gain or lose with respect to mean-squared error (MSE) of A by accepting some bias while estimating it. So, to reflect this issue we have also compared the MSE of A for, the competitive priors. For our simulation studies we considered A = 0,1,5 and UJ = 1. In this case we have performed the simulation studies under both the small (n = 10) and large (n — 100) sample sizes. In this study we have considered the uniform shrinkage prior and the Jeffreys' prior as the main competitors of the proposed conservative prior. The reason is that these two priors are widely used non-informative priors for Bayesian hierarchical models. Daniels (1999) showed that other priors are not good in estimating the random effects variance component, especially in the situation of no heterogeneity and in the cases when the random effect variance. A is relatively smaller with compared to error variance UJ . 1.7.1 Analysis of Output For comparing the performance of the three competitive priors we have calculated the MSEs of A by using each of the three priors. In each case we have calculated the MSE by averaging over 100 simulated data sets. In all the cases we have used a — 5 for the conservative prior. MSEs are summarized in Tables 1.2, 1.3 and 1.4 for different true A values considered. 27 . Table 1.2: M S E of A for different priors when A = 0 Prior Conservative Uniform shrinkage Jeffreys' Conservative Uniform shrinkage Jeffreys' Sample size 10 100 • MSE 0.0001124 : 0.0001368 0.0001871 3.1606e - 06 3.5682e - 06 3.5337e - 0 6 [ Table 1.3: M S E of A for different priors when A = 1 Prior Conservative Uniform shrinkage Jeffreys''. Conservative Uniform shrinkage Jeffreys' Sample size 10 100 M S E .-. 0.032352 0.428368 0.838135 0.027283 0.435985 0.810519 From Tables 1.2, 1.3 and 1.4 we see that, irrespective of sample size, M S E of A is the minimum for the conservative prior and the maximum for the Jeffreys' prior when the true A values are smaller or of equal magnitude (A = 0 or A = 1) with compared to the within group variance UJ. But when the true A is larger compared to the error variance UJ the M S E for the conservative prior is the maximum while that for uniform shrinkage prior is the minimum in the case of small sample size. In case of large sample size the M S E of A is the maximum for Jeffreys' prior and minimum for the uniform shrinkage prior. The interesting point is that for large sample the M S E increases for both uniform shrinkage prior and Jeffreys' prior but decreases for conservative prior when true A value is relatively bigger (Table 1.4 ). This might be due the fact that, as it can be observed from Equation (1.1) and from Figure 1.1, as sample size gets larger the probability of undersmoothing increases toward a limit of 0.5, and so there is more chance of getting bigger values in the posterior simulations of A, which inflates the estimated A and thus contributing to higher M S E of A. But in case of conservative prior the bigger a value suppresses those upward jumps in the posterior simulations for A, and controls the undersmoothing, which in turn does not let the M S E to go up even when sample size gets larger. 28 Table 1.4: MSE of A for different priors when A = 5 Prior Conservative Uniform shrinkage Jeffreys' Conservative Uniform shrinkage Jeffreys' Sample size 10 100 MSE 1.995964 0.436016 0.831997 1.808555 0.666834 3.77446 Before drawing general conclusion about the comparative performance of the three competitive priors there is one point worth focusing on. So far, for calculating the MSE of A we have considered a = 5. Previously, from simulation studies, we have observed that any value between 3 to 5 of a is sufficient for guarding against undersmoothing while we use the proposed conservative prior. But if we have look at the histograms of A in Figures 1.11 and 1.12 we observe that using a — 5 for conservative prior gives a rather underestimation (oversmoothing) for A when true A is relatively larger, which might be the cause of bigger MSE of A in this case. Since any value between 3 to 5 of a is good enough to guard against undersmoothing and since for the case of relatively larger between group variance use of a = 5 gives too much underestimation it may be reasonable to use a_ — 3 and see how the conservative prior perform with respect to MSE of A in all the cases we have considered. To check this we have computed the MSE of A again for a = 3. Tables 1.5, 1.6 and 1.7 summarize these MSEs. Table 1.5: MSE of A for different priors, with a — 3 for conservative prior, when A = 0 Prior Conservative. Uniform shrinkage Jeffreys' Conservative Uniform shrinkage Jeffreys' Sample size 10 100 MSE 0.0001216 0.0001368 0.0001871 3.2738e - 06 3.5682e - 0 6 3.5337e - 06 Looking at the output in Tables 1.5, 1.6 and 1.7 we observe that with a ='3 the conservative prior produces the best results in terms of both guarding against undersmoothing and MSE in 29 Table 1.6: MSE A for different priors, with a == 3 for conservative prior, when A — 1 Prior Conservative Uniform shrinkage Jeffreys' Conservative Uniform shrinkage Jeffreys' . Sample size 10 . 100 MSE 0.117817 0.428368 0.838135 0.126487 0.435985 0.810519 Table 1.7: MSE A for different priors, with a — 3 for conservative prior, when A = 5 Prior Conservative Uniform shrinkage Jeffreys' Conservative Uniform shrinkage Jeffreys' Sample size 10 100 MSE 0.484169 0.436016 0.831997 0.474004 0.666834 3.77446 estimating A in all but one of the cases considered. The only case where the uniform shrinkage prior marginally beats (an MSE of 0.4842 for conservative prior versus an MSE of 0.4360 for uniform shrinkage prior) conservative prior is when between group variability is larger than the within group variability and sample size is small. So, in general, we can conclude that though the proposed conservative prior has been introduced with the aim of guarding against undersmoothing simulation studies shows that for a reasonably chosen value of hyperparameter a it can also produce better calibrated estimate for the random effects variance component and hence for the other parameters in hierarchical models. 1.8 Conclusion In this chapter we have introduced a class of non-informative priors for the random effects variance component in Bayesian hierarchical models. This class of priors can give rise to the widely used priors in Bayesian hierarchical models such as, the Jeffreys' prior and the uniform 30 shrinkage prior. Finally, for estimating the variance component A by Bayesian approach we have incorporated the idea of smoothing to choose the appropriate values of the hyperparameter a for the considered class of priors to get a prior which can sufficiently guard against undersmoothing. We named this prior as "conservative prior". Simulation studies reveal that we can achieve desired guard against undersmoothing if we choose a value between 3 to 5 for the hyperparameter a. But guarding against undersmoothing may result in some degree of bias, in the parameter estimates and hence an increase in the M S E of of the estimated parameters. Keeping this aspect in mind we have compared the performance of conservative prior in estimating the variance component A with those obtained by using the Jeffreys' prior and the uniform shrinkage prior with the help of simulation studies. From simulation studies we have found that use of a = 5 for conservative prior produces smaller M S E for A with compared to those produced by the use of other two priors when between group variability is relatively smaller than the within group variability, but produces higher M S E compared to the other two priors in the reverse situation except the case that it produces smaller M S E than Jeffreys' prior does when sample size is large. From simulation studies it is also observed that relatively higher M S E for A for conservative prior with a = 5 in the situation of relatively higher between group variation is actually due to the.fact that using a = 5 in this situation gives to much, underestimation of A (Figures 1.11 and 1.12). But using a = 3 nicely controls undersmoothing as well as gives lower M S E by providing more precise estimate of A, which in turn produces better calibrated estimates of the other model parameters. Finally, though we have conducted our simulation studies for the situations of between and within group variance ratios of 0, l a n d 5, it can be argued that for most of the practical situations it is very unlikely that the ratio of between group variability to the within group variability would be very high, e.g. as large as 5. For instance, if we think of a hospital model of cardiac treatment it is very unlikely that the survival probability 6{ will differ too much from one hospital to another. In such cases undersmoothing should be main concern and there is not too much to get worried in using any value between 3 to 5 for the hyperparameter a in conservative prior to get better calibrated estimates of the model parameters and thus have a better real life picture. In the case where it is really expected that the between group variability would be much larger than the within group variability we can use a = 3 for better calibrated estimates of the model parameters. 31 Chapter 2 Curve Fitting 2.1 Introduction In chapter 1, we have studied the performance of the proposed conservative prior in Bayesian normal-normal hierarchical model, which is actually an application of Bayesian hierarchical model to random effects model for normal response. Another potential area of using, the Bayesian hierarchical model is non-parametric curve-fitting. In this chapter we have briefly described the idea of non-parametric curve-fitting and the spline-based roughness penalty approach to curve-fitting. Curve fitting or regression function estimation is a common and most useful tool in statistics. It has two purposes-firstly, it provides a way of analyzing and presenting the dependence of a response variable on the design variables; secondly, it allows prediction of unobservable future responses. Suppose we have n measurements on a response variable y, and a single predictor variable x. In general, the dependence of y on a; can be expressed as y = f(x) + e . . (2.1) where / is the curve of some sort and e is the random noise. The objective of curve fitting is to estimate the function / in (2.1) from the observed data. There are two approaches to curve fitting- (i) parametric approach and (ii) non-parametric approach. 32 2.2 Parametric Approach The parametric approach imposes rigid parametric assumptions about the dependence between the response and the predictor. For instance, in the case of linear regression the regression function / in (2.1) is assumed to be linear. In general, in parametric approach response are assumed to follow a parametric distribution, e.g., a distribution from exponential family. Under the exponential family the dependence of response can be summarized under the framework of generalized linear model (GLM)(McCullah and Nelder, 1989). The simplest version of G L M is the linear regression model where the responses are assumed to follow normal distribution, thus making / a linear function of x, i.e., y = a + fix + e. That is, every parametric method requires rigid assumption on the form of / , which may not be true, and hence the question of a non-parametric approach comes through. 2.3 Non-parametric Approach Data for which none of the parametric method seems reasonable are often envisaged. In such cases, it is wise to let the data to show us the appropriate functional form rather than imposing any definite parametric form. Non-parametric methods are very flexible in allowing the data itself to decide on how the dependence pattern should be. The underlying assumption here is that the dependence of the mean of y on a; should not change much if x does not change much. This assumption is very often reasonable. This assumption can be interpreted as that we want an estimate of/, say / , which is at most as variable as / itself, i.e., we don't want / to be more wiggly than / . So, the non-parametric approach involves the choice of a smoothing parameter which controls the balance between goodness of fit and smoothness of the estimated regression function. There are different non-parametric methods for estimating regression function. Interested readers are referred to Simonoff (1996). Since our work is aimed at developing a conservative prior for Bayesian approach to spline-based roughness penalty approach of curve estimation, descriptions of the spline-based roughness penalty approach and associated techniques are given in the next few sections. 33 2.3.1 Spline-based Roughness Penalty Approach The basic idea of. the spline-based roughness penalty approach is to quantify the notion of a rapidfluctuatingcurve with the help of a piece-wise polynomial of certain degree (called spline) over each subintervals of the range of the predictor considered, and then pose the estimation problem in such a way that makes the necessary compromise between the two opposing aims of curve estimation. The opposing aims are-(i) goodness offit,which measures the closeness of the estimated curve to the true underlying curve and (ii) roughness, which measures the wigglyness (local variability) of the estimated curve. In general, it is always desirable to have an estimated curve which provides a good fit as well as not too wiggly. The roughness penalty approach makes a compromise between these two opposing factors. In our discussion we will focus on only cubic splines because polynomials of degree higher than three are too wiggly, while lower degree polynomials are not flexible enough to capture the local variation of the data. A cubic spline can capture the local variation because it has two continuous derivatives and at the same time it is not too wiggly. It is also very amenable for mathematical manipulation. Eubank (1988), Wahba (1990) and Green'& Silverman (1994) are excellent reference books on spline-based smoothing techniques. As our work will be based on cubic spline/natural cubic spline, formal definitions of cubic spline and natural cubic spline have been given below. Cubic Spline (CS): The function / is said to be a cubic spline on the. interval [a, b], satisfying a < t\ < 7J2 < ••• < tn < b, if / is a cubic polynomial on each of the intervals (a,£i), (*i,^2) • • • > {tn,b) and the polynomial pieces fit together at the points U in such a way that / itself and its first and second'derivatives are continuous at each rj^, and hence on the whole interval [a, b}. The points V s are called the knots. Natural Cubic Spline (NCS): A cubic spline on an interval [a, b] is said to be natural cubic spline if its second and third derivatives are zero at the boundaries a and b. These conditions are known as natural boundary conditions. These imply that / is linear on the two extreme interval [a,7Ji] and [t ,b]. n Natural cubic spline has enormous mathematical convenience because it can be exactly specified by finding a finite number of constants (Green and Silverman, 1994). 34 2.3.2 Mathematical Formulation of the Roughness Penalty Approach For constructing an estimate of the curve of type 2.1, whose observed values are subject to random error, suppose that h,t2, •' •., t are the points in [a, 6] satisfying a < t\ < £2 < . . . < n t n < b and that yi,y2> • • iVn are the observed values. Let $2[ft,b] be. the space of continuous twice differentiable functions, then for any function / in 52 [a, b] the penalized sum of squares is defined to be • S(f)=J2{yi-f(ti)} 2 : i = : 1 ' + <x f{f(x)"} dx : 2 « •••• (2.2) "• In the roughness penalty approach, / is calculated so as to be the minimizer of S(f) over the class S2[a, b] of all sufficiently smooth curves on [a, b]. The second term in (2.2) is the roughness penalty term. The addition of the term a f (f") dx in (2.2) ensures that the cost of 5(/) of the 2 particular curve is determined not only by its goodness-of-fit to the data quantified by the first term but also by its roughness f(f")' dx. The smoothing parameter a represents the strength 2 of penalty to be paid. Large value of a represents stronger penalty, which produces a smooth curve. On the other hand, if a is relatively small then the main contribution to S(f) will be the residual sum of squares and the curve estimate / will track the data closely, thus producing a wiggly curve. Thus minimization of S(f) can give a compromise between smoothness and goodness-of-fit. In implementing roughness penalty approach the obvious choice is the natural cubic spline because NCS produces the smoothest possible curve among all the polynomials with continuous second derivatives in b] minimizing 5(/) (Green and Silverman, 1994). Also knowing that / is a natural cubic spline has many advantages. We can specify / exactly by finding a finite number of constants because we only need to minimize S(f) over a finite dimensional class of functions, the NCS's with knots at'ij, instead of considering the infinite dimensional set of smooth functions 52[a, b]. Also, with an NCS there is an elegant algorithm of how the minimizing spline curve can be found by solving a set of linear equations. 2.3.3 Elegant Way of Representing N C S The most elegant way of representing NCS is the value second derivative representation (Green and Silverman, 1994). In this method an NCS can be specified by its value and the second 35 derivatives at each of the knots Let the spline have n equally, spaced knots satisfying h < h < • • • < t . Define fi = f(ti) and 7* = f"{U) for i — 1,2,..., n. For an N C S we have. n 71 = 7 n = 0. Let / be the vector of (f f ,..., u f )' and 7 be the vector of ( 7 2 , 7 3 , . . . , 7 n - i ) ' - 2 n For specification of the N C S with the help of / and 7 one needs to define two matrices Q and II using the knot values. . Let hi = 7Jj i + — ti for i = 1, 2 , . . . , n — 1. The Q matrix is an n x (n — 2) matrix with entries qij, i = 1,2,..., n; j = 2 , . . . , n — 1, defined as for j = 2 , 3 , . . . , n -1, and Qj, = 0 for \i - j | > 2. The symmetric matrix R is an (n — 2) x (n - 2) matrix with entries rjj, i , j . = 2,3,... ,n—.1,.defined as . • r^j = ^(/li-x +/ij),for i = 2,3,... ,.n - 1 • •. n,i+i = ^+1/ = ^/ii,for i = 2,3,... ,n - 2 and = 0 for |i — j\ > 2. Using these facts two important results are stated in Green and. Silverman (1994), which are given by the following theorem. T h e o r e m : The vectors / and 7 specify a natural cubic spline if the condition Q f = i ? 7 is T satisfied. If the above condition is satisfied then the roughness penalty wijl satisfy b ntydt •.'••'' with K' = , P ''.' =7 ^ 7 ; o QR~ Q . l T It can be verified that the value of the cubic spline at any point t is given by (Green and Silverman) f(A fit) = (t-U)fi+i — + {ti -t)fi+1 h l l ~ ^ - W i u for U < t < U i; i = 1,2,... ,ra - 1 + + l ~ t)[[l f, ^ t-ti\ + r-^j 7 i + 1 • f, + , t i + l -t\ Ji (2. In matrix notation, (2.3) can be written as . . (f(x )J(x ),...,f(x )) 1 2 N = Af . (2.4) where A is a matrix of order N x n whose (j)th row is obtained as ;.• •' (t-.u) *=— - j i + -T-j w ^ W +1 = a , K ~~h~ l , (*-*»)(<HI - 0 r / 1 6 6 \V tr—il = + 1 + ~h-) C ~xrJ [ + i + c w W + + l V -'-V H :. : C -h—) + ) J ~h~~) + l ^~J -'j 1 + c for «' = 1,2,..., i — 1, ? + 1,..., n for any t between ti and *t+i> where Cjj is the (?i)th element of R~ Q l T and iV is the number of knot points for which / needs to be estimated. 2.4 Selection of Smoothing Parameter for Spline Smoothing Though spline smoothing is a very popular non-parametric technique for estimating a regression function the performance of it'depends on the choice of smoothing parameter. So, the choice of smoothing parameter is the most essential task of spline smoothing. There are different methods for an automated selection of smoothing parameter. The most widely used ones in frequentist approach include the cross-validation (CV) and the generalized cross-validation (GCV) methods. 2.4.1 Cross-validation Method The basic idea behind C V method is in terms of prediction. It uses the principle of "leave-" one-out" prediction. The idea is to leave the data points out one at a time and to select the smoothing parameter a in equation (2.2) under which the removed data points are best predicted by the remaining data. Let • (t, a) be the curve estimate from all the data except yi using a smoothing parameter value a. Then the cross-validation choice of a is the value of a that minimizes the C V criterion 37 denned as C V ^ ^ ^ ^ i - j ^ ^ a t f . •1 (2.5) N An easier computational form of C V criterion denned by equation (2.5) has been given by Craven and Wahba (1979) as ^ : w- where / is the spline smoother calculated from the full data set (ti,yi) with the smoothing parameter a and B(a) = (/ + 2.4.2 aQR~ Q )~ . l T l Generalized Cross—validation approach Generalized cross-validation (Cravan and Wahba, 1979) method is a modified form of C V method. It is obtained by replacing Bu(a) in equation (2.6) by its average value, as-" - ' n^ trB(a) 1 • • : .",E{?A-/(A«)} , GCV(a) —- ~7~T~ , „ o (2.7) ' n{l -n-HrB(a)} ' As in ordinary CV method in GCV method the smoothing parameter a is obtained by minimiz2 v 2 K ing G C V score given by the equation (2.7). Theoretical arguments show that GCV produces asymptotically best possible choice for a in the sense of minimizing the average squared error at the design points. Also, GCV method has some computational advantages over CV method (Silverman, 1984). 2.5 Conclusion In this chapter we have discussed the roughness penalty approach to curve-fitting. The most important consideration in the roughness penalty approach is choosing the smoothing parameter. Two widely used frequentist methods in this regard are the C V and GCV. Both the methods have some limitations. The most serious one is that though in smoothing problem it is believed that the true underlying regression curve is a smooth one the CV or GCV approach 38 sometimes gives a smoothing parameter that produces an estimated curve which is very undersmooth (wiggly). Secondly, the basic idea of CV or GCV approach is to choose the value of the smoothing parameter a which minimizes the C V or GCV score. But in some cases C V or GCV score may be a monotone function of a, and in those cases no obvious solution is available. Thirdly, in cases when the function CV or GCV has not a unique minimum a simple grid search is usually used to locate the global minimum. In such cases it is not easy to determine that how small the grid length should be so that the global minimum can be located.. From this view point C V or GCV is not automated method in the strict sense. Again, with C V or G C V approach we can just estimate and predict the response as a function of the predictor, but we can not test the hypothesis that whether the response predictor relationship is significant. Considering all the limitations of CV or GCV approach it is desirable to have a more automated approach of selecting the smoothing parameter to get an estimated curve which is not more wiggly than the true curve itself and which can provide means to draw conclusion about the significance of the response—predictor relationship. 39 Chapter 3 Bayesian Representation of Roughness Penalty Approach of Curve Fitting: Methodology and Simulation Studies 3.1 Introduction In seeking an approach to regression curve estimation which can guard against undersmoothing, we can think of a Bayesian formulation of roughness penalty approach. Also, in the Bayesian framework of roughness penalty approach we can think of making use of the proposed conservative prior. Because simulation studies show that the proposed prior, with suitably chosen hyperparameter, performs quite well in guarding against undersmoothing in the case of simple normal-normal hierarchical model. In this regard, first of all we need to define the Bayesian formulation of the roughness penalty approach to curve-fitting. The current section gives the details of the Bayesian formulation of the roughness penalty approach. . ' The roughness penalty approach has straight forward Bayesian representation. Let y be the response vector and a; be the vector of the values of the design variable. In smoothing problem an adequate summary of the relationship between a response variable y and a predictor variable x is provided by the smoothing model (3.1) y = fU-) + <• where, e is usually multivariate normal with mean vector 0 and the variance matrix a I . 2 n Define n knots t\,...,t n over the values of the design variable x. Let fi = f(U), so that / — (/b • • • > fn)'. be the vector of the values of f(x) at the knots. With f(x) being a natural cubic spline with knots at ti — r,i, the penalized sum of squares can be expressed as S(/) = ( y - / ) ' ( ? , - / ) + A / ' t f / 40 . (3.2) where f'Kf 1990). is the measure of roughness (Green and Silverman, 1994; Hastie and Tibshirani, In the frequentist roughness penalty approach the goal is to estimate / such that S(f) is minimum. In Bayesian paradigm there is a nice straightforward representation for this roughness penalty approach. In this chapter we elaborate on the Bayesian representation of roughness penalty approach of smoothing problem. In Bayesian representation the roughness penalty approach to curve-fitting can be expressed as a multivariate normal-normal hierarchical model, where the smoothing parameter can be taken as the function of the random effects variance component. Hence, in order to guard against undersmoothing we have used the conservative prior for the variance component. We have conducted simulation studies to see how the proposed conservative prior can perform to guard against undersmoothing. Also, on the basis of simulation studies, we have compared the performance of conservative prior and that of uniform shrinkage prior (Daniels, 1999) and Jeffreys' prior with respect to guarding against undersmoothing as well as with respect to MSE of the estimated curve. We have considered uniform shrinkage prior as the main competitor of conservative prior with respect to guarding against undersmoothing since it can provide some degree of guard against undersmoothing due to its property of shrinkage toward the situation of no global heterogeneity in the data set. Finally, on the basis of simulation studies, a comparison of Bayesian approach using conservative prior and Bayesian approaches using uniform shrinkage prior and Jeffreys' prior have been made to see how each of the competitive priors perform with respect to the mean squared error of the estimated curve. 3.2 Bayesian Representation of Roughness Penalty Approach Suppose, we have the data (yi,Xi); i — 1,2,... , n and the model yi = f(xi).+ e;. Assume that yi ~ N(f(xj),o ). 2 Let f(x) be a natural cubic spline. Define distinct knots at x — to,x — ti,...,x = t ,x = tp+i over the values of the design variable x with /(io). = Go, f(t\) = #i, • . . , f{t ) — 0 , f(t +i) = 0p+\. Here, Oo, Q\,..., 0 i are unknown parameters with f(tj) — OJ; j = 0,1,... ,p + 1. By defining the vector of parameters 0 = (QQ,Q\, . . . ,0 \) at p p p p p + p+ the selected knots only we represent the true function f(x) with a lower dimensional vector 0. The main advantage of such a parameterization of the true function f(x) is that we are able to estimate the entire function f(x) by estimating a fewer number of parameters. For example, there may be the cases where the number of design points is over several hundred or even thousand. In such cases it is not computationally reasonable to estimate f(x) at all the design points. In such situations even 10 or 20 knots can give adequate approximation to the 41 true curve. In other words as long as smoothing is concerned, fewer number of knots can give the same level of flexibility as the case of taking knot at each design point. For a natural cubic spline it is assumed that f(x) is linear outside the boundary knots. Under this assumption we can consider the model (3.1) to be composed of a linear part and a smoothed part. The model can then be written as v =fta+ S\x + f(x)+e '" : : ••' Now, for simplicity, we can assume that linear trend has been removed so that /(io) = 0 and /(ip+i) = 0. In this case, 9 == ... ,9 )'. Under the assumption that linear part has been P removed model (3.3) can be written as y — BQ — B\x = f(x) + e, or equivalently, • / : V = f(x) + e (3.4) where y = y — Bo — B\ - Removing the linear part from the model (3.3) helps in defining a x proper prior for the parameter vector 6. The smoothed part f(x) and the linear part can be estimated simultaneously by Bayesian back-fitting algorithm (Haste and Tibshirani, 2000). In this study we have concentrated on estimating the smoothed part f(x) only. Under the model (3.4) we have ]9-dimensional parameter vector 9 = (9i,... ,9 )' to represent the true function P f(x). In this case, we have f'Kf = 9'K9 = f[f"(t)] dt as the measure of roughness of / . The 2 penalized sum of squares can now be expressed as : (3.5) S(f) = (y - f)'(y - /) + X9'K0 Since we represent, f(x) with the parameter vector 9 we estimate f(x) only at the selected knot points as 9. So, tp estimate the entire curve we can use the method of interpolating natural cubic spline (Green and Silverman, 1994) to have f{x) = A§, where A is an n xp design matrix. The computation of the matrix A has already been discussed in section 2.3.3. Now, for Bayesian representation we can write y~N(A9,a l ), 2 n . where y = ( y i , . . . , y )' is the n x 1 response vector. So, we can write /. = A9, and the penalized n sum of squares as • S(f) = (y-A0)'(y-A0) (3.6) + \9'K9 Consider the prior distribution for the parameter vector 9 as 0\T ~ iV(0, T V), where, V = K~ 2 2 L is a known p x p matrix (K has been defined in section 2.3.3). The hierarchical model for the 42 smoothing problem (3.4) can be expressed as y\0,o ~N(A9,a T„) 2 2 (3.7) 9\T ~N{0,T*V) 2 The posterior distribution for 6, given T ,CT and the data, is obtained as 2 TT(0\a ,r ,y) 2 2 f(y\9,a )n(9\T ) oc 2 2 . oc 2 ^ti^mv-M) -&o'v-ie e e where, IX(0\T ) is the prior for 0. 2 It can be easily verified that the posterior distribution of 9 is normal. The Bayesian estimate of 9 is given by. the posterior mean (or mode) of 9, which can be obtained by minimizing the quantity Q = (y - A9)'{y - AO) + ^0'V~ 0. l If we let A = <£, then Q = S(f), the penalized sum of squares. So, minimizing Q is the same as minimizing S(f) meaning that the Bayesian model (3.7) is the exact representation of the roughness penalty approach of curve fitting. For our model the smoothing parameter is A = the variability among 0s. Smaller r 2 Since 9 ~ N(0, T V ) , r 2 2 determines corresponds to a stronger penalty and hence attempts to produce a smoother curve, whereas larger T attempts to produce an wiggly curve. From model 2 (3.7) it is observed that in Bayesian paradigm we express the roughness penalty approach as hierarchical model and we take the smoothing parameter A as a function of the random effects variance component r . Since, in Bayesian approach degree of smoothing i n curve estimation 2 is controlled by the random effects variance component T we can use the conservative prior for 2 T so that the chance of having an estimated r 2 2 greater than the true T is low. Thus we can 2 have an estimated curve which is not more wiggly than the true underlying curve. For the full Bayesian treatment of the roughness penalty approach we need to specify priors for a and r . There are many possible choices for the priors of random effect variance component 2 2 r . Barry (1995) considered Jeffreys' prior jointly for r and a . Daniels (1999) considered uni2 2 2 form shrinkage prior for the covariance matrix of random effects in normal-normal hierarchical model. Though, in his study, Daniels focused on the simple normal-normal hierarchical model it is possible to construct an intuitive version of uniform shrinkage prior for smoothing problem represented by model (3.7). In any of the above cases, there was no or little attempts to guard against undersmoothing while estimating the random effects, 9, though it is desirable for most of the applications. In this study our goal is to suggest a prior for the random effect variance component T which can guard against undersmoothing. That is, we would like a prior for r 2 43 2 which encourages smaller r and therefore smoother f(x). In other words, we do not want f(x) 2 to be more wiggly than the real f(x). In this chapter we have conducted simulation studies to estimate the function f(x) by Bayesian roughness penalty approach using the proposed conservative prior for r . Also, besides using the suggested conservative prior we have.used the 2 uniform shrinkage prior and the Jeffreys' prior for the random effects variance component r 2 and have compared the results. 3.2.1 The Conservative Prior for the Random Effects Variance Component r 2 in Roughness Penalty Approach The goal of this study is to ensure the smoothness of the estimated curve i n non-parametric regression. In simple normal-normal hierarchical model simulation studies suggest that the proposed conservative prior can achieve this goal. The conservative prior suggested for normalnormal hierarchical model can be extended for the smoothing problem. In the case of smoothing we need to deal with the multivariate problem instead of univariate one, and we have to make adjustment for the dimension of the covariance matrix T V in defining the conservative prior 2 for r . In smoothing problem we have 2 fj\0.o- ~ N (AO, a I ), 2 2 n . \ . The conservative prior for r 0\T 2 2 ~ N (0,T V). 2 p can then be defined as 7T(TV) <* ^r. \a I + T V\ p •'. (3.8) 2 2 p The denominator p in the exponent has been taken to adjust for the dimension of the parameter vector 0 to make the prior comparable to the same one in the univariate case (simple normalnormal hierarchical model). So, the full Bayesian model for the smoothing problem (3.4) can be written as y\0,a ~ N (AO, a I ) 2 2 n O\T : .- 2 : ~ •K(T \O ) O C 2 2 N(0,r V) 2 .-• '. 1 /, \a I + T V\^ 2 2 p 1 7T((T ) O C 2 i e ° (3.9) where n(a ) is the prior for a , which has been taken to be a unit information inverse gamma 2 2 distribution centered at one. 44 Posterior Distributions for the Model Parameters Under Conservative 3.2.2 Prior . For the posterior inference of the model parameters we need the conditional or marginal posterior distributions of the model parameters 9, a and r . The joint posterior distribution of 9, 2 2 o and T is obtained as 2 •. • 2 7r(0,a ,T \y) 2 f(y\0,a )7i(0\T )n(a )T:(T \a ) oc 2 2 (v ) 2 2 ' 2 iV 2 2 \a I + .(r )^ . 2 2 +1 ~2 ) 2 T 2 V \ p - - • 'V i - i e p — a% — 3.10 where, c(<r ) is the normalizing constant for TX(T \O ) which is needed to obtain the joint prior 2 as 7T(T , 2 2 2 a ) = 7r(T |a )7r(a ) and hence to make the joint posterior (3.10) a proper probability 2 2 2 2 distribution. Propriety of the joint posterior distribution of the model parameters is essential for valid inferences about them. The normalizing constant c(cr ) is obtained as 2 oo' c(a ) 2 = f dr , J. \a I + r V\-t1 >1=1 2 p o oo / By making the substitution u = 2 2 1 1 - a+l V we have T = a u 2 2 dr 2 , (3.11) dr — a du. Substituting these facts 2 2 in the right hand side of the expression (3.11) we have OO : :. •• { ( > a2 •1 / _ 22\ < * a+l \i .+ uv.\* . P f 1 / ... , a+ l (^ ) { \I + uV\ — p 1 45 a U - ••• Now, from equation(3.10), the conditional posterior distribution of 0, given a , r and the data, 2 2 can be derived as n(0\a ,r ,y) 2 oc r T ^ - ^ 2 = where, A = as ^ ( r ^ v - o -^{(y- y(y- <>)+M' ~ t)} A() A v i e and Q = (y~A0)'(y-A0) + X0'V' 0. The quadratic form Q can be decomposed 1 • :- ' Q - y'[I - H(X))y + [0 - G(X)y}'(A'A 4- AV'~){0 - G(X)y] = W(X) + U(0) ] where, H{X) = A(A'A +XV' )' 1 A' and G{X) = {A'A + XV.- )- 1 1 A'. Therefore, the conditional 1 posterior distribution of 0, given cr , T and the data, is 2 • ; *(B\y) 2 oc c-^-Oixm^A^^-Oixm (3.12) From equation (3.12) it can easily be verified that the posterior distribution of 9 is normal with mean vector 0 = (A'A + A V ) ^ ' ? / and covariance matrix (A'A + A F ) c r , i.e., - 1 - 1 _ 1 0\a ,r ,y 2 ^ N[G(X)y,(A'A + XV- )-^ ] 2 1 _ 1 2 . 2 The conditional posterior distribution of a , given 0, r and the data, is 2 n(a \X,y) 2 oc ^ - ^ m - A B ) ( J ^ e ^ i ' - ^ (° )WJ.. \o I +T V\v 2 e 2 2 2 V ~ ) ^ " ' H c r ' \a I + r V\^-<^ ) 2 2 2 p - ^ CW) ' \ . (3.13). . .• Finally, the conditional posterior distribution of r , given 0, a and the data, is 2 2 : ir(r \X,y) 2 i&°' ~' — ^ • \a I + T V\T- . oc -j^e" ( ) ? r V 2 0 2 (3.14) 2 p From equations (3.12), (3.13) and (3.14) it is observed that the conditional posterior distribution of 0 has closed form but those of a and r do not have closed form. So, we can use Gibbs 2 2 sampler for posterior simulation of 0, but for a and r we need to use the Metropolis-Hastings 2 algorithm to draw posterior simulations. 46 2 3.3 Performance of Conservative Prior in Smoothing Problem: Simulation Studies As in the case of simple normal-normal hierarchical model simulation studies have been performed in smoothing problem also to demonstrate the performance of conservative prior in this area. For simulation studies we have considered the following situations: (i) simulation studies with large sample size (n = 111) (ii) simulation studies with small sample size (n — 23) For each of the above two cases we have considered two different values of p (the dimension of mean vector 9) as: (a) p = 10 .'• • (b) =5 P For n = 111 and p = 10 the design variable has been taken to be s - ^ ; i = 0,1,.'.., 110 and the equidistant knots have been taken at x = 0, x = 1, and p = 5 the design variable x has been defined to be x = 11. For n = 111 ^ 1 8 3 3 t ; i = 0 , 1 , . . . , 111 and the equidistant knots have been taken at x = 1.833333 x k; k = 0,1,... ,6. For each of the above 4 cases data have been generated as: (i) we have fixed the values of the unknown parameters <r and T (a =' 1, T =1) 2 (ii) given the value of r , we have generated l00 different 2 6 2 2 vectors as 2 6 ~ N(0,T V) 2 (iii) for each of the generated 0 vectors and given a value we have generated a data set (y 1 vector) as y ~ N(A0,a' l ). . '• 2 n From the generated data the model parameters 9, a and r 2 2 have been estimated by using the Bayesian technique that makes the use of the proposed conservative prior with a choice of a = 5. For parameter estimates we have run M C M G simulation techniques. We have used Gibbs sampler for posterior simulation of 9 and random-walk Metropolis-Hastings algorithm for a and 2 T . . 2 47 3.3.1 Monitoring the Convergence of M C M C Simulation For monitoring the convergence and mixing of the M C M C chains for both a and r we have 2 2 plotted four different chains for each of them. For updating a and r we have used exponential 2 2 scale. Figures 3.1 and 3.2 display four different M C M C chains for each of a and r , respectively, 2 2 obtained by using the same data set. For different chains of each parameter widely dispersed initial values have been used. From Figures 3.1 and 3.2 we observe that all the chins of each Figure 3.1: Plots of M C M C chains for T with.different starting points 2 0 500 1000 1500 2000 2500 0 • /. • , iteration (c) starting value=1 500 1000 1500 2000 ,. 2500 iteration . (d)startingvalue=0.002 of the parameters exhibit good mixing since none of the chains of any of the parameters kept moving slowly around any particular region of the target distribution for many iterations. It is also observed that though different chains for each of the parameters starts at different initial 48 Figure 3.2: Plots of M C M C chains for a with different starting points 2 500 1000 ~\ 1500 . iteration. (a) starting value=3.002 — r 1 2000 2500 iteration (b) starting value=2.002 . iteration • (c) starting value= 1.002 values all the chains for both a a n d r 2 iteration (d) starting value=0.002 2 have stabilized to very close to the respective true values after very few iterations. In all the cases the acceptance rates are found to be between 45% to 50%. 3.4 Output Analysis Since the main goal of the proposed conservative prior is to control the undersmoothing, i.e., to control the over estimation of the random effects variance component f , in this study 2 49 we have estimated the same true r value by using 100 different data sets, which have been 2 simulated from the distribution with the same true a and r values (r = l,cr = 1). Details 2 2 2 2 of data simulations had been elaborated earlier in section 3.3. For estimation of parameters by Bayesian approach we have considered the proposed conservative prior. We have adopted M C M C simulation to generate draws from the respective conditional posterior distributions. By monitoring the output for several independent chain for each of the model parameters we have found that for both T and cr the MCMC chains converge after very few iterations. So, 2 2 for posterior inferences about the model parameters each time we have run the M C M C chain for each of the parameters for 1500 iterations. Finally, we have thrown away first 500 iterations as burn-in of the chain and used the remaining 1000 iterations for inferential purposes. To estimate a and r 2 2 we have used both the posterior mean, and the posterior mode.. For 6 posterior mean and mode are the same because the posterior distribution of 0,given <r , r and 2 2 the data, is multivariate normal. Figures 3.3 and 3.4 display the histograms of 100 estimates of the same r value (true r = 1) obtained by using 100 different data sets for large sample size 2 2 (n = 111) and for.10 interior knots (p =10). Estimates in Figure 3.3 are based on posterior mean and those in Figure 3.4 are based on posterior mode. From Figures 3.3 and 3.4 we observe that for both cases hot too many estimates of r exceed the 2 true T value. However, in the case of posterior mean the number of r estimates that exceed 2 2 the true r value is slightly more than in the case of posterior mode (20 in case of mean versus 2 9 in case of mode). This is reasonable because mode is insensitive to any possible outliers. However, deciding about the value of the hyperparameter a on the basis of posterior mode may not be sufficient for guarding against undersmoothing, especially when T is relatively smaller 2 than a . The disadvantages of using the posterior mode in the context of undersmoothing have 2 already been discussed in section 1.6 of chapter 1. Regarding a , it is observed that estimates are very close to the true value and the estimates do 2 not differ too much whether we use the posterior mean or mode. This is because the posterior distribution of a is not so positively skewed as that of T . This picture will be more clear if we 2 2 look at the Figures 3.1 and 3.2. From each plot of Figure 3.1 it is observed that there are some big upward jumps in the posterior simulations of r indicating highly positive skewness of the 2 posterior distribution of r , whereas such big upward jumps are absent in each plot of Figure 3.2 2 of posterior simulations of a . The histograms of the estimates of a based on posterior mean 2 2 and mode are displayed in Figures 3.5 and 3.6, respectively. 50 Figure 3.3: Histogram of estimated T based on posterior mean 2 s s§ - I 0.0 — T ~ 0.5 ; 1—— 1—; 1.0 1.5 Figure 3.4: Histogram of estimated r I . 0.0 0.5 ~ 1.0 A t 2 51 2 1 1— 2.0 2.5 based on posterior mode 1— 1.5 —i— 2.0 —1 2.5 Figure .3.5: Histogram of estimated a based on posterior mean 2 S3 I 0.8 1.0 1.2 1.4 Figure 3.6: Histogram of estimated a based on posterior mode 2 S3 H From Figures 3.5 and 3.6 it is observed that the spreads of the both histograms are almost same, but the histogram of the estimates of a based on posterior mean is centered around 2 1.05, whereas that based on posterior mode is centered around 0.95. Finally, since the main purpose of smoothing problem is to estimate the curve f(x) itself, which in our study is parameterized by 6 vector, we are interested in drawing inferences about 6 vector also. By introducing the proposed conservative prior we attempt to estimate the 6 vector so that the estimates of the components of 0 vector are not more variable than the true values of the components of 6 vector. In our simulation study we have generated 100 different 0 vectors 52 and for each 9 vector we have generated a data set. Finally, we have used each generated data set to estimate the respective 9. Now, for our estimation problem we have parameterized the true curve f(x) with a lower dimensional vector 9, but in practice it is desirable to estimate the entire curve. The estimate of the entire curve can be obtained by using the method of interpolating natural cubic spline given the curve estimate at the selected knot points. This is obtained as f(x) = AB. Figure 3.7 displays the plots of the true data , the true curve as given by f(x) = A9 and the entire estimated curve for four randomly selected 9 vectors out of 100 of them.' - • From the plots (a), (b) and (c) of Figure 3.7 we observe that the estimated curves are quite smoothed and they track the respective underlying true mean curves very closely. On the other hand, if we look at the panel (d) of Figure 3.7 we observe that true curve is wiggly but estimated curve is smoothed. In this though the estimated curve could not track the true underlying mean curve very closely in some data ranges still it has captured the key features in the data set it represents. Actually, if there are too rapid fluctuation in the data then the true curve may become wiggly but the goal of smoothing problem is to estimate the data in such a way that the estimated curve is a smoothed one and can pick the key pattern in the data set. All the plots of Figure 3.7, especially the last panel, confirms the ability of conservative prior to guard against undersmoothing in estimating the true curve in the smoothing problem. Again, as it is already known, in hierarchical model and in smoothing problem the key concern on the part of an estimation method is its ability to detect the situation of no heterogeneity. So, to reflect the ability of the conservative prior to detect the situation of no heterogeneity we have plotted the data values, the true underlying mean curve and the estimated curve on the same plane in Figure 3.8 when T = 0.01. The similar plots for the uniform shrinkage prior and. 2 for the Jeffreys' prior for the same data sets used in the case of conservative prior to construct the Figure 3.8 are displayed in Figures 3.9 and 3.10, respectively. 53 Figure 3.7: Plots of data values, true mean curves and the estimated curves f(x) for the selected data sets 54 Figure 3.8: Plots of data values, true mean curves and the estimated curves f(x) for the selected data sets when r = 0.01 2 x 3.8(b) 3.8(a) 0 0 0 ° n % o 0 0 ' 0 0 °o ° 0 o <p „ °o % 0 0 0 0 0 0 o _0° 0 0 o° 0 o 00 0 0 o o <- — o o True curve I 0 — . — 0 0 o n a 0 0 0 0 "oo Estimated curve " o 0,0 0 — ^-0- -2 x o "° o o o oo ;. oo 0 0 True curve Estimated curve 0 o 0 6 0 :—* 10 2 10 X X 3.8(c) 3.8(d) 55 0 Figure 3.9: Plots of data values, true mean curves and the estimated curve f(x) obtained by using uniform shrinkage prior when r = 0.01 2 00 ^ 0 0 1 0 -TCP •><» .A oo . «v a —o ° °on on ~ o„ o ° o 0 0 0 0 00 0 0 — o» °° ° ° - - • True curve Estimated curve Truecurve Estimated curve 0 0 o n T^^ooo^-o--wtr- o o 0 0 0 <b „0 ° o nO ° 0 • « 0 o "T - 10 x X 3.9(a) 3.9(b) 0 o o 0 ° 0 ° 0 0 0 00 0 o. 0 0 o°° ° « 0 0 0 0 o0 0° ° ° o „o o 0 — Truecurve — : Estimated curve r——r I . 2 o 0 ( " 0 0 0 <- 0 Oo 0 %0- o 0 0 °o o 'a, 0 0 0 0 °o ---oA 0 °. ° ° -°-°rn, 0 0 0 0 ° 0 0 . „' ^ „ o o o o °» *— •"*«» . " 7 M " % 0 0 J X °° o o * <P 0 ^••r»-"^--i^ o V o ° o 0 „ C R 0 o. 00 c oo o ~r10 4 x X 3.9(c) 3.9(d) 56 Figure 3.10: Plots of data values, true mean curves and the estimated curves f(x) obtained by using Jeffreys'prior when r = 0.01 2 True curve Estimated curve 10 x X 3.10(a) 3.10(b) True curve . Estimated curve True curve Estimated curve — -• I • .6 I 10 10 X 3.10(d) 3.10(c) The plots of the Figure 3.8 show that all the estimated curves track the respective true underlying curves very closely, and none of them is more wiggly than the corresponding true curve. • ' Also, plots of the original data values show that there is almost no global variation, as it should be for a true r value close to 0, in the data set and the estimated curves have picked up this 2 feature of the data sets very well. Now, by comparing the plots of Figures 3.9 and 3.10 we see that the conservative prior and the uniform shrinkage prior produce very similar results for 3 out of the 4 selected data sets (panels (a), (b) and (d) of both the Figures). But for the data set corresponding to the panel (c) the estimated curve in the case of uniform shrinkage prior is not so close to the true curve as, it is in the case of conservative prior (plot 3.8(c) versus plot 3.9(c)). For this particular data set the estimated curve seems to be more wiggly than the true curve in the case of uniform shrinkage prior, whereas the conservative prior has produced a smoothed estimated curve which also tracks the true curve very closely. The picture is worse for 3 out of 4 data sets when Jeffreys' prior is used. Plots 3.10(b), 3.10(c) and 3.10(d) show that the estimated curves are far more wiggly than the corresponding true curves. So, we can 57 Figure 3.11: Plots of M C M C chains for r with different starting points when p = 5 2 "V CM iteration (a)slartingvalue=3.002 iteration (b)startingvalue=2.002 •V iteration • (c)stailing value=1.002 CM iteration (d)startingvalue=0.002 conclude that if undersmoothing can not be reasonably controlled we may get too wiggly and less accurate estimated curves. To check whether the use of conservative prior produces similar results in different situations we have also analyzed the simulated data for large sample size by specifying fewer number of knots (5 instead of 10). Figure 3.11 displays the convergence plots of M C M C chains for r with 2 different starting values. Different plots of Figure 3.11 reveal that though different chains started with different initial values all the chains have stabilized at the same level after very few iterations. Also mixing of M C M C chains looks very good. Similar conclusions can be drawn about the convergence and mixing of the chains for o (Figure 3.12). 2 58 Figure 3.12: Plots of M C M C chains for a with different starting points when p=5 2 Regarding the performance of conservative prior to control the undersmoothing in estimating 6 vector in case of 5 knots we can conclude that the proposed prior can perform equally well as it does in case of 10 knots. Figures 3.13 and 3.14 confirm this fact. Figure 3.13: Histogram of estimated r based on posterior mean while using fewer knots (p=5) 2 Figure 3.14: Histogram qf estimated r based on posterior mode while using fewer knots (p=5) 2 If we look at the histogram o f f 2 based on the posterior mean (Figure 3.13) and that based on posterior mode (Figure 3.14) we observe that in both cases f 2 is centered around 0.5, which is very much similar to the case when we used 10 knots instead of 5 knots. About the estimates of a similar conclusions can be drawn as in the case of using 10 knots except that for both 2 posterior mean and posterior mode the histograms of a , are centered around 1.05 instead of 2 60 1.05 in case of posterior mean and 0.95 in case of posterior mode when we used 10 knots (Figures 3.15 and 3.16). Figure 3.15: Histogram of estimated a based on posterior mean when p=5 2 1.0 1.2 Figure 3.16: Histogram of estimated a based on posterior mode when p=5 2 8 H ~ i 0.8 ;—i r :—i 1.0 ' 1 . 2 1.4 ; i— 1.6 So, in general we can conclude that the proposed conservative prior performs quite well, irrespective of the number of knots used, in controlling the undersmoothing while estimating the true regression curve by Bayesian roughness penalty approach. Finally, to see how the proposed conservative prior performs in estimating the true data in the case of using fewer knots we have plotted the true data values and the estimated curves on the same plane for the same data sets we have used in the case of 10 knots tb construct the Figure 3.7 in Figure 3.17 61 Figure 3.17: Plots of true data, true curves and the estimated curve f(x) when p=5 The plots in Figure 3.17 reveal that the estimated curves are smoother than those in Figure 3.7. Especially, if we look at the plots (a) and (d) of Figure 3.17 we observe that the estimated curves can not track the true curves as closely as they did in Figure 3.7. So, it can concluded that using too few knots may produce too smooth curves which can not pick up the key data feature as well as the estimated curves obtained by using sufficiently large number of knots can do. . 3.4.1 Performance Analysis of Conservative P r i o r i n S m o o t h i n g i n case of Small Sample size (n = 23) As mentioned earlier in Section 3.3, to give strong footing to our conclusion about the performance of the proposed conservative prior in smoothing problem, we have performed simulation studies for small sample size also. In small sample size case we have considered a sample size of n = 23. For n = 23, the two different cases are: (i) p=10, i.e., we have considered 10 interior knots (ii) p=5, i.e., we have considered 5 interior knots For case (i) the design values have been taken to be x = |; i = 0,1,..., 22 and the equidistant knots are taken at x = 0,x = l,...,x L 8 3 3 3 6 3 3 8 *; = 11. For case (ii) values of x are taken to be i = i = 0,1,...,21 and x = 11, and the knots are taken at x = 0 x 1.833333,a: = 1 x 1.833333,x = 2 x 1 . 8 3 3 3 3 x = 10 x 1.833333 ~ 11. For n = 23 and p = 10 the convergence plots of 4 independent MCMC chains for r , when true 2 T value is 1^ generated from the same data set with 4 different starting values are displayed in 2 Figure 3.18. '' \ Figure 3.18 shows that all the chains for T converge to the same level after few iterations and 2 all the chains exhibit good mixing. Note that the jump size in this case is 0.80 for T , and for 2 this jump size the acceptance rates have been found to be between 48% to 50%. For cr the convergence plots are displayed in Figure 3.19. Similar conclusions can be drawn 2 about the convergence of the chains for a as was drawn about the convergence of the chains 2 for T . The jump size for updating a is 0.60, and the acceptance rates have been found to be 2 between 48% to 50%. 2 : ... 63 Figure 3.18: Plots of MCMG chains for r with different starting points when n=23 and p=10 2 500 1000 , . 1500 . 2000 2500 : ' . • iteration (a) s t a r t i n g v a l u e = 3 . 0 0 2 iteration (b) starting v a l u o = 2 0 0 2 iteration (c) starting value=1.002 iteration (d)starting value=0.002 Figure 3.19: Plots of MCMC chains for a with different starting points when n=23 and p=10 1 T 1500 iteration (a) starting v a l u e = 3 . 0 0 2 iteration . (b) starting value=2.002 iteration (c) starting v a l u e = 1 . 0 0 2 iteration (d) s t a r t i n g v a l u e = 0 . 0 0 2 64 1 .. 2000 ..—;—r - . 2500' So, finally for drawing posterior inference about T 2 and a we. have run each chain of both the 2 parameters for 1500 iterations of whom first 500 iterations have been discarded as burn-in and the remaining 1000 have been used for inference purposes. Figures 3.20 and 3.21 display the histograms for the estimates of r based on posterior mean and mode respectively, when true 2 r 2 value is 1, obtained by using 100 different simulated data sets, Figure 3.20: Histogram of estimated r based on posterior mean for n=23 and p=10 2 0.0 1.0 0.5 1.5 2.0 v 2.5 Figure 3.21: Histogram of estimated r based on posterior niode for n=23 and p=10 2 •a H —i 0.0 . : r— —'—'—i—:—• 1 0.5 1.0 . 1—: 1.5 —i 2.0 : 1— 2.5 From Figures 3.20 and 3.21 we observe the similar picture as we did in case of large sample size. Again, for n = 23 and p = 5 the corresponding histograms are displayed in Figures 3.22 and 3.23. 65 Figure 3.22: Histogram of estimated T — 0.0 • 1 0.5 2 c 1 0 based on posterior mean for n=23 and p=5 — — - l 1.5 1—; 2.0 : 1— 2.5 Figure 3.23: Histogram of estimated r based on posterior mode for n=23 and p=5 2 s.H ST From Figures 3.22 and 3.23 we see that in this case also the conservative priors can perform very well in controlling undersmoothing while estimating the true curve in smoothing problem. So, in general we can conclude that the proposed conservative prior performs reasonably well in guarding against undersmoothing in non-parametric regression curve estimation problem. Finally, to see how the proposed conservative prior performs in estimating the true underlying curve and in capturing the key data feature in the case of small sample size we have plotted the true curve, the data values and the estimated curves on the same plane for the 4 randomly selected data set, in Figures 3.24 and 3.25 for p = 10 and p = 5, respectively. 66 Figure 3.24: Plots of the true curves, data values and the estimated curves f(x) when n=23 and p=10 • • Figure 3.25: Plots of the true curves, data values and the estimated curves /(x) when n=23 and p=5 ..;',>..•.'••' r Plots in Figure 3.24 reveal that for the selected data sets the Bayesian roughness penalty approach with the proposed conservative prior seems to perform quite well in estimating the true regression curve. By having a comparative look at the Figures 3.24 and 3.25 we observe the same picture as we did in the case of large sample size. So, in general, we can conclude that irrespective of sample size using too few knots may produce too smooth estimated curves which may not be able to capture the underlying feature of the respective data sets very well as they can do in the case, of using sufficient number of knots. Actually, it is always reasonable to use sufficiently large number of knots in smoothing problem. 3.5 Comparison of Results Obtained by Using the Proposed Conservative Prior with those Obtained by Uniform Shrinkage Prior and Jeffreys' Prior As we mentioned earlier that this study has been aimed at suggesting a conservative prior that can guard against undersmoothing since in many practical situations undersmoothing is considered to be a more serious error than oversmoothing. So, in this section we have studied the comparative performance of conservative prior and that of uniform shrinkage and Jeffreys' priors with respect to the issue of guarding against undersmoothing. Another important issue.in smoothing is that in case of using conservative prior we force the estimates of random effects 9 to be close to each other. By forcing the estimates to be closer we can reduce the variability among the estimated 9 "values but this fact may increase the bias in the estimated curve. So, one obvious question is that how much we gain (or lose) in terms of mean squared error (MSE) by using the conservative prior in comparison with the two competitive priors- the uniform shrinkage prior and the Jeffreys' prior. To check this fact we have compared the estimated MSE of 9 for conservative prior with those for the uniform shrinkage and the Jeffreys' priors using simulated data. For smoothing problem, which is a multivariate normal-normal hierarchical model, the uniform shrinkage prior is the special case of conservative prior with a = 1, but the derivation of Jeffreys' prior is not so straightforward. However, Barry (1995) derived the Jeffreys' prior for smoothing problem. 69 3.5.1 Jeffreys' P r i o r for Smoothing P r o b l e m Barry (1995) derived Jeffreys' prior jointly for a and r in smoothing problem. For defining 2 2 the Jeffreys' prior he considered the smoothing model to be y\9 ~ N{A6,cj I ) : 2 n 0\X ~ where, A = 7T(a , A) oc 2 N(0,XV) (3.15) He found the Jeffreys' prior for a and r jointly in smoothing problem to be 2 4^ \/R(XJ, 2 where m = r{V), R(X) = (n + m -p)[tr{H (X)} 2 + m -p] - [tr{H(X)} + m - p] and H(X) = A(A'A + A V " ) ^ . For smoothing model (3.4), r(V) = p and hence 2 - 1 - 1 R(X)=n{tr{H (X)}}-[tr{H(X)}J . 2 2 Now, it can be easily shown that the conditional posterior distribution of 6, given a , r and 2 the data, is normal with mean vector 0 = G(X)y and covariance matrix (A'A + whore G(X) = (A'A + XV ) l 2 XV~ )~ cr , l l 2 A'. x The conditional posterior density of a given A is obtained as 2 , • . ix{o \X,y) 2 l "' 1 - X '' (n oc ^ ^ , e ^ ~ Inv - yanuna 2 + W\ ) where, W = y'[I - H{X)]y. Finally, the marginal posterior density for A is 7r(A|y) oc j {A'A + . XV^^W'i The conditional posterior densities for 0 and a have closed form. So, we can use Gibbs sampler 2 to draw posterior simulations for 0 and a . 2 But, we have to use random walk Metropolis- Hastings algorithm to draw posterior simulations for A. Once we have the estimate of A we can get the estimate of r as T — ^ . 2 3.5.2 2 9 C o m p a r i s o n of Results: Simulation Studies Simulation studies have been performed to compare the results obtained by using the competitive priors in smoothing problem with respect to guarding against undersmoothing. To see whether the proposed conservative prior can perform reasonably well in case of both large and 70 Figure 3.26: Histograms of r for different priors when r = 5 and n = 111 2 2 §T "i—:—r-—i 0.0 2.5 1 5.0 7.5 1—;—r 10.0 12.5 r — i 15.0 r 17.5 20.0 (a) Conservative prior o 8 ~r 0.0 • 1 1 1 1 2.5 5.0 7.5 10.0 r~—i 12.5 15.0 r r 17.5 20.0 A (b) Uniform shrinkage prior n—~i 0.0 1 2.5 5.0 1—~i 7.5 10.0 1 12.5 r —i - 15.0 17.5 20.0 (c) Jeffry's prior small T 2 values, we have conducted our simulation studies for r 2 = 5 and T 2 = 0.01 while keeping the true <j value fixed at 1. For the purpose of comparison we have used a = 5 for 2 the conservative prior. Histograms of r for conservative prior, uniform shrinkage prior and 2 Jeffreys' prior based on 100 different data sets are plotted in Figures 3.26 and 3.27, respectively for T 2 = 5 and r 2 = 0.01 when n = 111. The same for n = 23 are plotted ih Figures 3.28 and 3.29. 7.1 Figure 3.27: Histograms of r for different priors when r = 0.01 and n = 111 2 2 o c . Q 24 3 CT 0) III 0.00 I I I I I I I I I I I 0.04 0.08 0.12 0.16 0.20 0.24 t I II 0.28 • 2 (a) Conservative prior CO >> f u. S-l 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.00 0.04 0.08 0.12 0.16 0.20 0.24 0.28 A (b) Uniform shrinkage prior > o c 0 3 CT O III 0.00 i i i i i i i i n i i i r 0.04 0.08 0.12 0.16 0.20 0.24 - A T 2 (c) Jeffry's prior 72 0.28 Figure 3.28: Histograms of T for different priors when T = 5 and n = 23 2 o 2 2 _r o ~i 0.0 Dux. r—i 2.5 n—i 1 .5,0 7.5 1 12.5 1 1 17.5 1-—r 22.5 (a) Conservative prior £ 8. 2 o !: 1 r—i—TT 1 0.0 2.5 5.0 7.5 r—i 12.5 1 1 17.5 . 1— 22.5 (b) Uniform shrinkage prior O to I. I 0.0 2.5 1—-i 5.0 1 7.5 1 12.5 r—i 17.5 1 n—r 22.5 (c) Jeffry's prior Histograms in Figure 3.26 reveal that the conservative prior can guard against undersmoothing most effectively. Jeffreys' prior produces the worst result in this regard. A similar picture is observed in the case of no heterogeneity (T = 0.01) (Figure 3.27). It is also observed that in 2 the case of low T undersmoothing is very severe for the uniform shrinkage and the Jeffreys' 2 priors with more than 50 and 80 out of 100 estimates of r larger than the true r value for the 2 2 uniform shrinkage prior and the Jeffreys' prior, respectively. For small sample size the picture is similar as it is observed in the case of large sample size (Figures 3.28 and 3.29). Now, to see the performance of the three competitive priors in terms of MSE of 9 we have also used the simulated data to compute the differences between MSEs obtained by conservative prior and those obtained by uniform shrinkage prior and Jeffreys' prior for 100 data sets. Finally, we have counted the number of times that MSEs of 9 in the case of Jeffreys' and uniform 73 Figure 3.29: Histograms of r for different priors when r 2 "i i i i i i = 0.01 and n = 23 2 r 0.00 0.09 020 0.30 0.40 0.50 0.60 0.70 0.80 A T 2 (a) Conservative prior 1 — r — i — r — r — i — r 0.00 0.09 0.20 0.30 0.40 0.50 0.60 0.70 0.80 A T 2 (b) Uniform shrinkage prior o c. to 3 I I i i i i r 0.00 0.09 0.20 0.30 0.40 0.50 0.60 0.70 0.80 A T (c) Jeffrys prior; 2 74 shrinkage priors exceed the MSEs of 0 obtained by conservative prior. It can be mentioned that in calculating the MSE of 0 we have averaged over the knots, not over the repeated data set, i.e., the MSE of 0is obtained as where p is the number of knbts used to estimate the regression curve. Tables 3.1, 3.2 and 3.3 represent those counts for T = 5, r = 1 and r = 0.01, respectively. 2 2 2 Table 3.1: Table summarizing the number of times that MSE(0) under Jeffreys' prior and uniform shrinkage prior exceed those under conservative prior out of 100 occasions when r = 5 2 Sample Size 111 23 #[mse(0) >mse(0) ] 34 . -. . 19 usp #[mse(f9)j /y >mse(0) ] 43 33 con e con Table 3.2: Table summarizing the number.of times that MSE(0) under Jeffreys' prior and uniform shrinkage prior exceed those under conservative prior out of 100 occasions when T = 1 2 Sample Size 111 23 #[mse(0) >mse(0) ] . 37. : 32 usp #[mse(0) con >mse((9) ] 52 • 48 Je// con Table 3.3: Table summarizing the number of times that MSE((9)under Jeffreys' prior and uniform shrinkage prior exceed those under conservative prior out of 100 occasions when r = 0.01 2 Sample Size 111 23 #[mse(6) >mse(0) ] 47 97 usp #[mse(6)j f >rnse(0) ] 63 96 cora ef con From Table 3.1 we observe that for r = 5 uniform shrinkage prior and Jeffreys' priors perform 2 better than conservative prior in terms of MSE of 0 irrespective of sample size. For both large and small sample sizes the count that MSEs of 0 obtained by uniform shrinkage prior and Jeffreys' priors exceed those obtained by conservative prior are less than 50 out of 100 occasions. But the picture is reverse for smaller r (r = 0.01). In this case of no heterogeneity 2 75 2 the performance of the conservative prior is far better than that of Jeffreys' prior, irrespective of sample size. For large sample size uniform shrinkage prior and conservative prior perform similarly, but for small sample size the conservative prior performs much better than uniform shrinkage prior. When r 2 is of equal magnitude of a , the error Variance, uniform shrinkage 2 prior still produces smaller MSE for more than 50 occasions out of 100 occasions, but the chance that the Jeffreys' prior produces higher MSE than the conservative prior are 52 and and 48, respectively for large and small sample size out of 100 occasions. Again, counting the number of times that MSEs of 9 for uniform shrinkage prior and Jeffreys' prior exceed those for conservative prior can not give any idea about the magnitudes by which the MSEs off? for uniform shrinkage prior and Jeffreys' priors exceed those for conservative prior and vice-versa. So, to reflect this picture we have constructed histograms of the differences of MSEs of 9 for each pair of the competitive priors. Figures 3.30, 3.31, 3.32, 3.33, 3.34 and 3.35 display these histograms. Figure 3.30: Histograms of differences of MSE(f?) for pairs of different priors when T = 5 and n = lll " • 2 n—•—i—;—i -6.03 -0.02 -0.01 —i 0.00 1—•• 0.01 1— 0.02 n . 0.03 mse(9(conservative)) - mse(9(unilormshrinkage)) -0.05 0.00 0.05 mse(6(conservative)) - mse(8(Jeffry)) 76 Figure 3.31: Histograms of differences of MSE((9) for pairs of different priors when T = 5 and n = 23 ' \ ' .. ' • • "• 2 -0.5 T T T 0.0 0.5 1.0 1.5 2.0 2.5 mse(8(conservative)) - mse(8(uniformshrinkage)) -1.0 -0.5 0.0 0.5 1.0 . 1.5 ' 2.0 mse(e(conservative)) - mse(0(Jeffry)) Figure 3.32: Histograms of differences of MSE(#) for pairs of different priors when r = 1 and n = 111 '. •• .' 2 & o I8 ~i—: 1 0.00 0.05 i — : — 0.10 mse(6(conservative)) - mse(e(uniformshrinkage)) (a) . reqtjency s - u. o _ o _ (VI o - —I— -0.05 0.00 H —I— 0.05 0.10 mse(8(conservative)) - mse(6(Jeffry)) . (b) 77 .0.15 Figure 3.33: Histograms of differences of MSE(0) for pairs of different priors when T n = 23 ~1— -0.2 0.0 2 = 1 and I 0.2 0.4 mse(e(conservative)) - mse(e(uniformshrinkage)) (a) mse(e(conservative))-mse(e(Jeffry)) •(b) Figure 3.34: Histograms of differences of MSE(f9) for pairs of different priors when r and n — 111 —i— -0.10 -0.05 0.00 0.05 mse(8(conservative)) - mse(8(uniformshrinkage)) -0.10 -0.05 mse(8(conservative)) - mse(8(Jeffry)) 78 2 = 0.01 Figure 3.35: Histograms of differences of MSE(0) for pairs of different priors when T = 0.01 and n = 23 ' reqi ncy 2 u. a in _ o _ m - o - III T -0.8 . -0.6 -0.4 -0.2 - 0.0 0.2 mse(6(conservative)) - mse(8(uniformshrinkage)) 1 1 1 -0.8 -0.6 -0.4 -0.2 0.0 0.2 mse(e(conservative)) - m s e ( 9 ( J e f f r y ) ) Figures 3.30 and 3.31 display the histograms of MSE differences for bigger r value (r = ' = 5) for 2 2 large and small sample sizes, respectively. For large sample size the magnitudes of difference are the same both ways around zero, which indicates that though the conservative prior produces large MSE for 0 in more occasions the magnitudes by which the MSEs of 0 for conservative prior exceed those for uniform shrinkage prior and Jeffreys' prior are the same as those in the case of other way around (Figure 3.30). For small sample size the magnitudes by which the MSEs of 0 for conservative prior exceed those for the other two priors are bigger in very few cases than those in the reverse case (Figure 3.31). Histograms for MSE differences in the case of T = 1 reveal the similar picture as they do in the case of larger T value. Finally, by looking 2 2 at the histograms of Figures 3.34 and 3.35 we observe that in the case of no heterogeneity the conservative prior beats the other two priors with big margin in all respect. 3.5.3 Comparison of Results for a = 3 In the previous section, for conservative prior we considered the value.of the hyper parameter a to be equal to 5. But for normal-normal hierarchical model, simulation studies showed that any value from 3 to 5 can guard against undersmoothing well. Actually, the bigger the value of a the smoother the estimated function will be. So, using a = 5 may oversmooth too much 79 and that is why the performance of conservative prior, in terms of MSE of 9, might be worse than those of the other two priors for large r value. Lowering the value of a to 3 may improve 2 the performance of conservative prior in terms of MSE of 0, while keeping undersmoothing in control at the same time. To check this fact we have estimated the model parameters by using conservative prior with a = 3 and compared the results as we did in the case oi a — 5 in the previous section. Figures 3.36 and 3.37 display the histograms of T ample size when T 2 = 5 and r 2 for the three competitive priors for large 2 = 0.01, respectively. The same for small sample size are displayed in Figures 3.38 and 3.39. Figure 3.36: Histograms of r for different priors when r = 5 and n = 111 2 2 1.8. & o a. *". T 0.0 "THT. 1 2.5 1 1 5.0 7.5 T T 1 1 1 T 1 1 22.5 r (a) Conservative prior 8 8H 2. w ~i 0.0 r—i——l 1 2.5 5.0 7.5 • —i 1— 12.5 17.5 I (b) Uniform shrinkage prior snn 0.0 2.5 5.0 7.5 l 1 T , 1 2 (c) Jeffry's prior 80 1 1 1 r Figure 3.37: Histograms of r for different priors when T = 2 2 5 and n = o c 0) a 0) Jk n—i—i—r 0.0 12.5 2.5 5.0 7.5 . . 17.5 22.5 A (a) Conservative prior >> 3^ 0 c a 0) i—rn—r — u . 0.0 2.5 5.0 7.5 12.5 17.5 A 22.5 (b) Uniform slhri shrinkage prior o • c | o . o u N 0 IL O j ' "i 0.0 i i i 2.5 5.0 7.5 i i 12.5 m i 17.5 A .1 (c) Jeffry's prior 81 r 22.5 ' • r 23 Figure 3.38: Histograms of r for different priors when r = 0.01 and n = 111 2 2 o 10 > t * 0) J £ o 0 u. CM - l~l-nn III I I I I I 0.00 I I III II I ! 0.04 0.08 0.12 0.16 0.20 0.24 ' 0.28 A (a) Conservative prior > <° 0 ffl P CM 11 n 0.00 r~i i i i i i i i i i i i r 0.04 0.08 0.12 0.16 0.20 0.24 0.26 inri prior (b) Uniform s shrinkage O c o ; to 2 CD * III I I I I I I I I I I I I I I IL 0.00 . 0.04 0.08 0.12 0.16 0.20 0.24 0.28 . A (c) Jeffry's prior 82 Figure 3.39: Histograms of r for different priors when r = 0.01 and n = 23 2 2 > o c a 3 CT ~r~—r—i——i 0.00 0.09 0.20 0.30 0.40 0.50 1 0.60 i 0.70 0.8 (a) Conservative prior c <° 3 ° CT * g> o 0.00 0.09' 0.20 0.30 0.40 0.50 0.60 0.70 0 . 8 0 A . . . (b) Uniform shrinkage prior o CT * £ o milium 0.00 0.09 T T 0.20 0.30 0.40 , 0.50 0.60 0.70 0 . 8 0 A T 2 (c) Jeffry's prior All of the Figures 3.36, 3.37, 3.38 and 3.39 confirm that conservative prior performed best in controlling undersmoothing in all the situations considered. To reflect the performance of conservative prior with a = 3 compared to the other two priors considered with respect to MSE of 9 we have summarized the number of times that MSE of 9 in the case of Jeffreys' and uniform shrinkage priors exceed the MSEs of 9 obtained by conservative prior in Tables 3.4, 3.5 and 3.6. From the counts of the Tables 3.4, 3.5 and 3.6 we observe that when r is relatively larger than 83 Table 3.4: Table summarizing the number of times that MSE(0) under Jeffreys' prior and uniform shrinkage prior exceed those under conservative prior with a = 3 out of 100 occasions when T = 5 2 Sample Size 111 23 #[mse(0) >mse(0) „] 37 25 usp co #[mse(0)j // >mse(0) ] 46 46 e ctm Table 3.5: Table summarizing the number of times that MSE(0) under Jeffreys' prior and uniform shrinkage prior exceed those under conservative prior with a = 3 out of 100 occasions when r = 1 2 Sample Size 111 23 #[mse(0) >mse(0) ] , 3 7 35 usp con #[mse(0) // >mse(0) ] 52 .52 Je con the error variance cr the uniform shrinkage prior performs better, in terms of MSE of 0, in more 2 occasions than the proposed conservative prior. In case the when T 2 is relatively lower than the error variance the conservative prior with a = 3 performs better in more occasions. For the situation when the global variability is of equal magnitude of the local variability, though the performance of the conservative prior improves in the case of small sample size the counts that the MSEs of 0 for uniform shrinkage prior exceed those for the conservative prior are still less than 50 out of 100 occasions. Regarding the comparison of the performance of conservative prior and Jeffreys' prior, the output suggest that in case of lager T conservative prior and Jeffreys' 2 prior performs almost similarly but in the reverse case conservative prior always performs far better than Jeffreys' prior in terms of MSE of the estimated curve. 84 Table 3.6: Table summarizing the number.of times that MSE(f9)under Jeffreys' prior and uniform shrinkage prior exceed those under conservative prior with a — 3 out of 100 occasions when r = 0.01 2 Sample Size 111 23 3.6 #[mse(0) >mse(0) ] .'#[mse(0)j // >mse(0) ] 55 74 58 83 usp con e con Conclusion In this chapter we studied the performance of the proposed conservative prior with respect to its ability to controlling undersmoothing through simulation studies. We also studied the performance of the proposed prior with compared to the Jeffreys' prior and the uniform shrinkage prior. The reason for. considering the uniform shrinkage prior and the Jeffreys' prior as the competitor of the proposed prior was that these two priors are common choices in Bayesian hierarchical models. Also, they perform better than other priors used for Bayesian hierarchical models (Daniels, 1999). Furthermore, the uniform shrinkage prior can do some degree of smoothing due to its property of shrinkage toward zero. Like the uniform shrinkage prior, in a Bayesian hierarchical model the proposed conservative prior make possible to get a proper posterior distribution of the model parameters, which is essential for valid posterior inferences about the parameters. From simulation studies we observed that the proposed conservative prior performs better in controlling undersmoothing than the other two priors considered in this study. Also, in terms of MSE of the estimated curve the proposed prior exhibits better performance for the data sets when the global variation is relatively smaller than the local variation. In the other case also the proposed prior with a = 3 showed better performance in terms of MSE of the estimated curves for many data sets. So, from the result of simulation studies it can argued that using reasonably chosen value of the hyperparameter a, e.g. a = 3, for the conservative prior can nicely control the excessive degree of undersmoothing while, at the same time, not oversmoothing too much, and hence can give more accurate estimate of r 2 and consequently, of the other model parameters. Thus the Bayesian approach with conservative prior can give a better picture of the true underlying relationship between response and the covariate in smoothing problem. Again, if it is believed that the true underlying relationship between the response and the 85 covariate is smooth then it is desirable to have an estimated curve which is also smooth. In such cases controlling undersmoothing may be the main concern and hence the use of conservative prior, with a value between 3 to 5 of the hyperparameter a, can ensure the smoothness of the estimated curves. Finally, from the results of the simulation studies we can summarize that for a response variable fluctuating rapidly as a function of the design variable it is better to use the conservative prior while using the Bayesian roughness penalty approach in regression curve estimation. Actually, this is the case of larger relative error variability with compared to the global variability, and in such case a smooth estimated curve is really desirable. Also, since the Bayesian roughness penalty approach with the conservative prior for the random effects variance component provides smooth estimate of the true mean curve it is robust against any wild local observations. 86 Chapter 4 Discussion and Conclusion The work of this dissertation focuses on finding an appropriate prior distribution for the variance component in the Bayesian hierarchical models. In many real life situations data arise in hierarchical fashion. Some common situations where the hierarchical models arise are the random effect models, spatial models and non-parametric curve-fitting. All the above mentioned models can further be viewed as smoothing problem, and it is often desirable to have an estimated mean structure which is not more variable than the true underlying structure or the estimated regression curve which is not more wiggly than the true underlying curve. That is, we do not want to undersmooth the phenomenon we estimate. The motivation for this work comes from the thought that we should have an estimation technique which can control the undersmoothing while estimating a phenomenon by hierarchical models. In searching such a technique for hierarchical models we have considered the Bayesian approach because of its flexibility to incorporate the multiple level of randomness- the main feature of hierarchical models. Bayesian approach to hierarchical models have many other advantages over the classical approach which have already been discussed in Chapter 1. Finally, while adopting the Bayesian approach for hierarchical models we have thought of choosing a prior for the random effects variance component which can guard against undersmoothing. We have conducted simulation studies to choose the appropriate values for the hyperparameter for the proposed prior so that it can provide adequate guard against undersmoothing and at the same time do not oversmooth the phenomenon too much and thus can provide better calibrated estimates of the models parameters. In Chapter 1, we have formulated the Bayesian version of the normal-normal hierarchical model. 87 Simulation studies reveal that the proposed conservative prior, with values between 3 to 5 of the hyperparameter a, perform quite well in guarding against undersmoothing. With compared to the two of its competitors- the uniform shrinkage prior and the Jeffreys' prior- the proposed prior perform much better in controlling undersmoothing in all the situations considered for the simulation studies. Also, in terms of the MSE of the estimated variance component the proposed prior, with a = 3, perform much better than its competitors in all the situations considered. Even with a = 5 the proposed prior performs better than the Jeffreys' prior in all but one cases considered and performs better than the uniform shrinkage prior in the case when the between group variability is relatively smaller than the within group variability. In Chapter 3, we have formulated the Bayesian hierarchical model for the roughness penalty approach to curve-fitting. Then we have suggested a modified version of the conservative prior for that hierarchical model. Finally, we have conducted simulation studies to see how the conservative prior performs in smoothing problem to guard against undersmoothing. Also, we have investigated the performance of the conservative prior with those of the uniform shrinkage prior and the Jeffreys' prior via simulation studies. The results are encouraging in terms of guarding against undersmoothing. In terms of MSE of the estimated curve the proposed conservative prior performs better than its competitors in the case when global variability is less than the local variability. Also, in the other two cases the conservative prior performs better in more than 50% of the data sets considered with compared to the Jeffreys' prior. With compared to the uniform shrinkage prior the conservative prior produces smaller MSE in many occasions (though less than 50%). Again, it is always believed that the true underlying curve in non-parametric regression is usually smoothed. So, for most of the applications in non-parametric curve-fitting the smoothness of the estimated curves is the main concern, and hence the use of conservative prior with the suggested values of the hyperparameter a can ensure the desired degree of smoothness of the estimated curves. If someone is concerned about the MSE of the estimated curves also he/she may use the lowest value from the suggested range (3 to 5) for the hyperparameter. In a nutshell, on the basis of the results of simulation studies we can conclude that in the case of simple normal-normal hierarchical model the proposed conservative prior with a = 3 can be used without any reservation because with a — 3 the conservative prior beats the other two priors considered in this study- both in terms of guarding against undersmoothing as well as in terms of MSE of the estimated variance component. Also, in many situations where the smoothness of the estimates is the only concern, e.g., in the case of hospital model of recovery rates of a cardiovascular treatment in different hospitals, even bigger values of a (not exceeding 88 5) can be used. Again, in smoothing problem researchers can choose between the uniform shrinkage prior and the conservative prior if they think that the global variability is relatively larger than the local variability. In the reverse case, the use of conservative prior is always better. Also, in the case when the researchers is concerned only about the smoothness of the estimated curve the use of the conservative may get preference over the uniform shrinkage prior. As in all other areas there is a huge scope for further research in this area. Firstly, in the present study we have proposed the conservative prior and investigated its performance in the context of continuous response only. Further work is required to make the use of the conservative prior to the response arise from any distribution belongs to the exponential family. In this study, in normal-normal hierarchical model we did not consider the incorporation of predictors to predict the mean response. The hierarchical model that incorporates the response-predictor relationship for the Gaussian response is the linear mixed effects model (LMM). In L M M , the classical approach considers random effects variance component to be fixed and be estimated from the data. In doing so, the classical approach ignores the uncertainty in the random effects variance component. So, the reasonable alternative is the Bayesian hierarchical models. And in the case of Bayesian hierarchical models, we can consider the use of the conservative prior for the random effects variance component. Again, further work needs to be done to extend the use of the proposed conservative prior in the broad areas of generalized linear mixed models (GLMM). In the case of G L M M , the existing classical approaches lead to inefficient inference about the random effects. In G L M M , the estimate of random effects variance components becomes very difficult when the number of random effects becomes large. The approximate method proposed by Breslow and Clayton may produce estimate of the random effects covariance matrix which is negative definite (Breslow and Clayton, 1993). But the Bayesian formulation enjoys an advantage here because of the information on the variance component provided by the prior distribution. AJso, Bayesian procedures avoid the need for numerical integration by taking repeated samples form the posterior distribution. But the main concern in the Bayesian approach is to choose an appropriate prior for the random effects variance components, and the proposed conservative prior, if can be extended for G L M M , may provide a good alternative in this area. Natarajan and Kass (Ranjini Natarajan and Robert E . Kass, 2000) proposed the approximate uniform shrinkage prior and the approximate Jeffreys' prior for the random effects covariance matrix in GLMM. But the main concern with the use of those prior is undersmoothing. So, we can think of extending the use of proposed conservative prior for the random effects covariance matrix in G L M M to check undersmoothing and to have better calibrated estimates of the model parameters. 89 Bibliography [1] Barger, J. 0. and Deely, J. J. A bayesian approach to ranking and selection of related means with alternative to analysis-of-variance methodology. Journal of the American Statistical Association.. 83:364-373, 1988. [2] Barger, J. O. and Bernando, J. M . On the development of reference priors. In Barger, J. O., Bernando, J. M , Dawid, A. P., and Smit,h,F. M., editors, Bayesian Statistics-4, pages . 35-60. Oxford University Press, 1992. [3] Barry, Daniel. A bayesian analysis for a class of penalized likelihood estimates. Communicationsin Statistics, 34(4):1057-1071, 1995: [4] Box, G. E. P. and Tiao, G. C. Bayesian Inference in Statistical Analysis. Wiley, New York, 1973. [5] Breslow, N . E. and Clayton, D. G. Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88:9-25, 1993. [6] Michael J. Daniels. A prior for the variance in hierarchical models. The Canadian Journal of Statistics, 27, No.3:567 578, 1999. [7] Eubank, R. L. Spline Smoothing and Nonparametric Regression. Marcel Drekker, New York, 1988. [8] Gilks, at. el. Markov Chain Monte Carlo in Practice. Chapman and Hall/CRC, 1996.. [9] Green, P. J. and Silverman, B. W. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall, London, 1994. [10] Hastie, T. and Tibshirani, R. Generalized Additive Models. Chapman and Hall, New York, 1990. 90 [11] Hastie, T. and Tibshirani, R. Bayesian backfitting. Statistical Science, 15, No.3:196-223, .'2000. . [12] Hobert, J. P. and Casella, G. The effect of improper priors on gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Association, 91:1461-1476, 1996. [13] Jeffreys, H. Theory of Probability. Oxford University Press, 1961. [14] McCullagh, P.and Nelder, J. A. Generalized Linear Models. Chapman and Hall, London, 1989- = [15] Ranjini Natarajan and Robert E. Kass. Reference bayesian methods for generalized linear m i x e d models. Journal of the American Statistical Association, 95:227-237, 2000. [16] Silverman, B. W. . A fast and.efficient cross-validation method for smoothing parameter choice in spline regression. Journal of the American Statistical Association, 79:584-589, 1984. ..• ; . • : . [17] Simonoff, Jeffrey S. Smoothing Methods in Statistics. Springer-Verlag, New York, 1996. [18] Wahhba, G. Spline Models for Observational Data. SI AM, 1990. 91
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A conservative prior for bayesian hierarchical models...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A conservative prior for bayesian hierarchical models in biostatistics Hossain, Md. Shahadut 2003
pdf
Page Metadata
Item Metadata
Title | A conservative prior for bayesian hierarchical models in biostatistics |
Creator |
Hossain, Md. Shahadut |
Date Issued | 2003 |
Description | Hierarchical models are suitable and very natural to model many real life phenomena, where data arise in nested fashion. The use of Bayesian approach to hierarchical models has numerous advantages over the classical approach. Estimating a phenomenon with hierarchical model can be viewed as a smoothing problem, and hence while summarizing such a phenomenon via hierarchical model we do not want to undersmooth the phenomenon. That is, in most of the practical applications undersmoothing is more serious type of error than oversmoothing. So, we need an estimation approach which can guard against undersmoothing. If we can control the undersmoothing reasonably, we may get a better calibrated summary of the phenomenon we estimate. In this study, we have incorporated the aspect of smoothing in estimating the parameters of Bayesian hierarchical models. In doing so we have proposed a conservative prior for the variance component to achieve the adequate degree of smoothness while estimating the phenomenon under study. We have conducted simulation studies to decide about the appropriate values to be used for the hyperparameter while using the conservative prior to ensure the adequate degree of smoothness. We have investigated the performance of the proposed conservative prior in guarding against undersmoothing in simple normal-normal hierarchical models (random effects models for normal response) and in non-parametric regression curve estimation problem via simulation studies. We have also investigated the performance of the proposed prior compared to those of the uniform shrinkage prior and the Jeffreys' prior, with respect to both guarding against undersmoothing and the MSE of the estimated model parameters, through simulation studies. |
Extent | 5849339 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-10-29 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0091385 |
URI | http://hdl.handle.net/2429/14419 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2003-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2003-0361.pdf [ 5.58MB ]
- Metadata
- JSON: 831-1.0091385.json
- JSON-LD: 831-1.0091385-ld.json
- RDF/XML (Pretty): 831-1.0091385-rdf.xml
- RDF/JSON: 831-1.0091385-rdf.json
- Turtle: 831-1.0091385-turtle.txt
- N-Triples: 831-1.0091385-rdf-ntriples.txt
- Original Record: 831-1.0091385-source.json
- Full Text
- 831-1.0091385-fulltext.txt
- Citation
- 831-1.0091385.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0091385/manifest