Valid Estimation and PredictionInference in Analysis of a ComputerModelbyB?la NagyB.Math, E?tv?s Lor?nd University, 1995Honours B.Math, Computer Science, University of Waterloo, 2000A THESIS SUBMITTED IN PARTIAL FULFILMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinThe Faculty of Graduate Studies(Statistics)The University of British Columbia(Vancouver)August 2008c B?la Nagy , 2008AbstractComputer models or simulators are becoming increasingly common in manyfields in science and engineering, powered by the phenomenal growth in com-puter hardware over the past decades. Many of these simulators implementa particular mathematical model as a deterministic computer code, meaningthat running the simulator again with the same input gives the same output.Often running the code involves some computationally expensive tasks,such as solving complex systems of partial differential equations numeri-cally. When simulator runs become too long, it may limit their usefulness.In order to overcome time or budget constraints by making the most outof limited computational resources, a statistical methodology has been pro-posed, known as the ?Design and Analysis of Computer Experiments?.The main idea is to run the expensive simulator only at a relatively few,carefully chosen design points in the input space, and based on the outputsconstruct an emulator (statistical model) that can emulate (predict) theoutput at new, untried locations at a fraction of the cost. This approach isuseful provided that we can measure how much the predictions of the cheapemulator deviate from the real response surface of the original computermodel.One way to quantify emulator error is to construct pointwise predictionbands designed to envelope the response surface and make assertions thatthe true response (simulator output) is enclosed by these envelopes with acertain probability. Of course, to be able to make such probabilistic state-ments, one needs to introduce some kind of randomness. A common strategythat we use here is to model the computer code as a random function, alsoiiAbstractknown as a Gaussian stochastic process. We concern ourselves with smoothresponse surfaces and use the Gaussian covariance function that is ideal incases when the response function is infinitely differentiable.In this thesis, we propose Fast Bayesian Inference (FBI) that is both com-putationally efficient and can be implemented as a black box. Simulationresults show that it can achieve remarkably accurate prediction uncertaintyassessments in terms of matching coverage probabilities of the predictionbands and the associated reparameterizations can also help parameter un-certainty assessments.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . xStatement of Co-Authorship . . . . . . . . . . . . . . . . . . . . . xiii1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Computer Model . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Reparameterization . . . . . . . . . . . . . . . . . . . . . . . 61.4 Preview of chapters . . . . . . . . . . . . . . . . . . . . . . . 7Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 The Gaussian Process Model . . . . . . . . . . . . . . . . . . 132.3 Outline of Fast Bayesian Inference . . . . . . . . . . . . . . . 162.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26ivTable of Contents2.6 Dealing with the process variance . . . . . . . . . . . . . . . 292.6.1 Profile likelihood . . . . . . . . . . . . . . . . . . . . . 292.6.2 Integrated likelihood . . . . . . . . . . . . . . . . . . 302.7 Fast Bayesian Inference in detail . . . . . . . . . . . . . . . . 312.8 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 32Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2 Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.1 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 403.2.2 Profile likelihood . . . . . . . . . . . . . . . . . . . . . 413.2.3 Log transformation . . . . . . . . . . . . . . . . . . . 413.2.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.5 One-dimensional special case . . . . . . . . . . . . . . 453.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.4.1 Point estimation . . . . . . . . . . . . . . . . . . . . . 503.4.2 Parameter uncertainty . . . . . . . . . . . . . . . . . 553.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.5.1 Likelihood-based estimators . . . . . . . . . . . . . . 623.5.2 Normal approximations . . . . . . . . . . . . . . . . . 643.5.3 Bayes estimator . . . . . . . . . . . . . . . . . . . . . 663.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 66Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.1 Alternative correlation functions . . . . . . . . . . . . . . . . 744.2 Additional terms in the model . . . . . . . . . . . . . . . . . 754.3 Different reparameterizations . . . . . . . . . . . . . . . . . . 76vTable of Contents4.4 Numerical optimizations . . . . . . . . . . . . . . . . . . . . 77Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79A Appendix to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 81Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84B Appendix to Chapter 4 . . . . . . . . . . . . . . . . . . . . . . 85B.1 Warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86B.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86B.3 MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87B.4 Coverage Probability . . . . . . . . . . . . . . . . . . . . . . 90B.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103viList of Tables3.1 MLE of log theta1 in the first simulation study (n = 10 d). . . . . 513.2 MLE of log theta1 in the second simulation study (n = 5 d). . . . 513.3 MLE of log sigma2 in the first simulation study (n = 10 d). . . . . 523.4 MLE of log sigma2 in the second simulation study (n = 5 d). . . . 523.5 Bayes estimate of log sigma2 in the first simulation study (n = 10d). 533.6 Bayes estimate of log sigma2 in the second simulation study (n =5 d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53viiList of Figures2.1 Optimal lambda values in the first simulation study (n = 10 d) ford = 1, ... , 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2 Optimal lambda values in the second simulation study (n = 5d) ford = 1, ... , 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3 Coverage probabilities in the first simulation study (n = 10d)for d = 1, 4, 7, and 10. . . . . . . . . . . . . . . . . . . . . . 242.4 Coverage probabilities in the second simulation study (n =5 d) for d = 1, 4, 7, and 10. . . . . . . . . . . . . . . . . . . . 252.5 Coverage probabilities for the Arctic sea ice example (n = 50). 272.6 Coverage probabilities for the Wonderland example (n = 100). 283.1 The log transformation improved approximate normality ofthe profile likelihood for this one-dimensional (d = 1) exam-ple. The top two plots are for the two-parameter likelihoodand the bottom two for the one-parameter profile likelihood.The ridges of the contours are marked by the dashed lines,reaching their apex at the MLE. Below the contour plots,these dashed lines are plotted as functions of the range pa-rameter, representing the profile likelihood function. In ad-dition to the profile likelihoods (dashed curves), their normalapproximation is also shown for comparison (dotted curves). . 43viiiList of Figures3.2 Optimal ? values for d = 1 and n = 3, ... , 10 as a functionof ?. The digits 3, ... , 9 in the plot represent the designsample size n and the digit 0 represents n = 10. The lines forn = 8, n = 9, and n = 10 do not start on the left side of theplot because of numerical difficulties for small ?. . . . . . . . 463.3 Coverage probabilities of Wald confidence intervals and Bayescredible intervals in the first simulation study (n = 10 d) ford = 1, 4, 7, and 10. . . . . . . . . . . . . . . . . . . . . . . . 563.4 Coverage probabilities of Wald confidence intervals and Bayescredible intervals in the second simulation study (n = 5d) ford = 1, 4, 7, and 10. . . . . . . . . . . . . . . . . . . . . . . . 573.5 Coverage probabilities of confidence regions based on the twolikelihood functions and their normal approximations in thefirst simulation study (n = 10 d) for d = 1, 4, 7, and 10. . . . 593.6 Coverage probabilities of confidence regions based on the twolikelihood functions and their normal approximations in thesecond simulation study (n = 5 d) for d = 1, 4, 7, and 10. . . 60ixAcknowledgmentsThere was a time when I thought that statistics was boring. Driven bycommon misconceptions (reinforced by dry Statistics Canada press releases)I imagined it was nothing more than endless tabulating and mindless numbercrunching...That view came to an end when I had the good fortune of taking my firstreal statistics course from Professor William J. Welch at the University ofWaterloo. How different that was from my previous studies in mathematicalstatistics! Finally, I could see how we could learn about the real world bycarefully designed statistical experiments.Unfortunately, I did not have the courage to go for stats at that pointin my life. That came several years later when he was about to teach acourse on the Design and Analysis of Experiments at UBC. Then I decidedto take the plunge and that was one of the best decisions in my life. Heis an exceptionally inspirational teacher and I am grateful for his guidanceas my academic advisor. I feel incredibly lucky for the opportunity to getmy training in the Design and Analysis of Computer Experiments from oneof the founders of this field. I have benefited immensely not only from hisexpertise in technical matters but also in countless other practical thingsthat are important in academic life. Considering his many responsibilities,how he found the time to do so much for me is still a mystery and I haveto admit that I often felt guilty for diverting his time from other importantthings.My thanks and gratitude also go to Dr. Jason L. Loeppky, who firststarted advising me as a postdoctoral fellow and later became my co-supervisor.xAcknowledgmentsHe was instrumental in getting me started in my research the summer beforeI started my Ph.D. program. I will always remember our numerous stimu-lating conversations and debates that helped me identify my shortcomingsand broaden my horizons.I am especially grateful to both of my advisors for granting me many?degrees of freedom? of independence and for enduring my stubbornness(which led me down blind alleys several times resulting in countless hoursof ultimately unproductive programming, e.g. concatenating Gaussian pro-cesses, constructing sequential designs, exploring bootstrapping schemes,etc.) Their patience and understanding is a trait I value highly.Derek Bingham at SFU also supported me in a variety of ways (includingfinancially) and I am still astounded by all his efforts to get me involved inconferences, workshops, summer schools, and internships.Although it is impossible to mention everyone else by name who con-tributed to my memorable grad student experience here at UBC, I stillwant to point out several remarkable individuals. Among the faculty inthe Department of Statistics, professors Paul Gustafson, Jiahua Chen, andJames V. Zidek stand out for their steadfast support and useful suggestions.Hardworking grad students in our department were too many to mention,but Jesse Raffa, Yiping Dou, and Kenneth Chi Ho Lo have inspired me onmany occasions with their seemingly superhuman efforts. I also want tothank Wei Wang and Hui Shen for providing me with LATEX templates tocreate this document. Of course, appreciation goes both ways and I willalways remember fondly that my fellow grad students have awarded metwice for my own efforts with the ?Energizer Award (keeps working andworking and...)? in 2006-2007 and the ?If I were allowed, I would live inthe Department Award? in 2005-2006. (Unfortunately, I did not attend theaward ceremonies because I was, well, working...)This research was funded by the Natural Sciences and Engineering Re-search Council of Canada and the National Program on Complex DataxiAcknowledgmentsStructures of Canada. Computations were made using WestGrid, whichis funded in part by the Canada Foundation for Innovation, Alberta Innova-tion and Science, BC Advanced Education, and the participating researchinstitutions. WestGrid equipment is provided by IBM, Hewlett Packard,and SGI.My research would not have been possible without all the technical sup-port staff that was helping me perform enormous amounts of computation,store inordinate data sets, and use obscure software libraries. Martin Siegertat SFU has been incredibly helpful and responsive in installing, compiling,and debugging C++ code on WestGrid (robson). I will also be forevergrateful to The Ha in our department and Roman Baranowski at the UBCWestGrid facility for not kicking me out for good after my careless codecaused an accidental burnout of the department?s workhorse (gosset), andon another occasion, a catastrophic slowdown of the largest WestGrid cluster(glacier), respectively.The Van de Plas family (Paul, Susan, and Monique) also deserve specialcredit, because they had a large share in my success by welcoming me totheir city and being incredibly generous with their unconditional supportand time to help me get started in Vancouver. When I think of Canadianhospitality and willingness to help, their example will always glow bright.Finally, I also want to say thank you to Dr. Matthias Schonlau for hisvery useful advice about how to keep focused in grad school and credit hisweb site for helping me finish in the minimum time allowed for my program(three years), doing my part for improving graduation statistics! However, Ishould also mention that I started doing research four months before startingmy Ph.D. and started taking courses in probability and statistics a yearbefore, pushing the total time invested to four years. (At the time of thiswriting, typing ?how to finish a phd? or ?finishing my phd? into Google andhitting the I?m Feeling Lucky button leads to his web page athttp://www.schonlau.net/finishphd.html).xiiStatement of Co-AuthorshipThe thesis is finished under the supervision of my advisors, Professor WilliamJ. Welch and Assistant Professor Jason L. Loeppky.Chapter 2 is co-authored with Dr. William J. Welch and Dr. Jason L.Loeppky. My main contributions are designing, running, analyzing the sim-ulations and cross-validations, applying the results of Kass and Slate (1994)to minimize nonnormality, and most of the writing.Chapter 3 is co-authored with Dr. William J. Welch and Dr. JasonL. Loeppky. My main contributions are designing, running, analyzing thesimulations, deriving formulas for optimal transformations based on a non-normality measure of Sprott (1973), and most of the writing.xiiiChapter 1IntroductionThis thesis is about how to achieve valid inference by reparameterizing aparticular statistical model used in the field of computer experiments. Afterdiscussing what we mean by valid inference and its related philosophicalimplications, we describe the model and the rationale behind its reparam-eterization. Then we preview how the following chapters address specificaspects of this problem.When we are talking about estimation or prediction, valid inference in-cludes the ability to quantify uncertainty. Frequentists do that by construct-ing confidence sets, while Bayesians may prefer credible sets. In this thesiswe use the classical frequentist interpretation. For instance, we view a 99%confidence region as a random entity that should cover the true value ap-proximately 99% of the time over the course of many repeated, identicaltrials.We evaluate the validity of our methods by extensive simulations, av-eraging over many data sets to estimate the actual coverage probabilitiesof the confidence regions. Unless the true coverage is roughly the same asthe advertised nominal coverage, an inference method cannot be consideredvalid.The methods we study can be classified as likelihood-based or Bayesian.But regardless of the philosophical underpinnings, they all have to go throughthe same frequentist simulation test, e.g. even when we are dealing with aBayesian credible interval, we still evaluate it by its frequentist propertiesin terms of matching coverage probabilities.Hence, one could argue that our approach is a mix of frequentist, likeli-1Chapter 1.hoodist, and Bayesian ideas. However, that characterization would not dojustice to the spirit of this work, since we are above all pragmatists, driven bypractical, real-world applications, not just academic curiosities. The imageof the practicing engineer, or research scientist, or some other professionalexperimenting with a computer model is paramount in our minds. As users,most of them could not care less about philosophical debates in statistics.What matters to them most is whether a given method works or does notwork in the real situation they are facing. That is also a kind of philoso-phy we can relate to and hope that practitioners will find our contributionsuseful and will implement our proposals.1.1 Computer ModelFirst, we need to make a distinction between the computer model or sim-ulator and the statistical model or emulator. The computer model is nota statistical model. Instead, it is usually a complex mathematical modelof ordinary and partial differential equations, implemented as a computercode, used to simulate a complex real-world phenomenon. Examples includeweather modeling, chemical and biochemical reactions, particle physics, cos-mology, semiconductor design, aircraft design, automotive crash simulations,etc.Rapidly growing computing power has enabled scientists and engineersto build sophisticated computer models that can simulate a complex pro-cess to sufficient granularity, so that in some cases, it is sufficient to studythe virtual world created by the simulator instead of the original physicalprocess in the real world. This may have several advantages, since physi-cal experimentation can be time-consuming, expensive, or not possible atall because of a variety of reasons (physical, legal, ethical, etc.) In contrast,computer experimentation is usually only limited by the available computingresources.As cutting edge science and engineering is always pushing the boundary2Chapter 1.of what is possible, many of these simulators tend to be computationallyexpensive. Furthermore, the number of input variables may be so large thata systematic exploration of all possible input combinations of interest maynot be possible because of a combinatorial explosion. This necessitates afaster approximation: an emulator that emulates the simulator.1.2 Statistical ModelThe emulator is a statistical model that can predict the output of the simu-lator based on a relatively small number of simulation runs. Since predictionmay be many orders of magnitudes faster than running the simulator codeitself, the emulator may eventually replace the simulator. (This is whyan emulator is sometimes called a meta-model that models the computermodel).Of course, all this hinges on the ability of the emulator to accuratelypredict the unknown response of the simulator at an untried input combina-tion. This is the subject of a specialized field in statistics that started withthe seminal paper (Sacks, Welch, Mitchell, and Wynn, 1989) with the title?Design and Analysis of Computer Experiments?.The design part deals with the question of how to choose the initialinput combinations for the simulator runs. Classical design of experimentstechniques, such as replication, randomization, or blocking do not apply,since what we are trying to predict is deterministic computer output with noobservational error (if we run the code again with the same input, we get thesame output). It quickly became apparent that space-filling designs were themost useful, such as the Latin hypercubes of McKay, Beckman, and Conover(1979) that are used in this thesis. Since our work is about the analysis ofcomputer experiments, we are not going to discuss design issues any further.The interested reader is referred to the substantial literature developed overthe years about the many possible ways to construct such designs (e.g. Tang(1993); Morris and Mitchell (1995), or Mease and Bingham (2006) for a3Chapter 1.recent generalization to Latin hyperrectangles).Although the output of a simulator may be multivariate, we can assumewithout loss of generality that it is univariate, since different outputs can beemulated separately. (This approach may be feasible even when the outputis functional data, since sometimes we are interested in emulating only afinite number of summaries of the output function, instead of the entirefunction). Hence, we are emulating a deterministic computer code with asingle output y as a function of d greaterequal 1 inputs: x1,x2, ..., xd. Likewise, forthe emulator, we use a statistical model with a single dependent variable Yand d independent variables. Here we are assuming that all variables are realnumbers and that the response is a smooth function of the d-dimensionalvector x = (x1,x2, ..., xd)T , having derivatives of all orders. This is areasonable assumption for a large class of simulators because of the natureof the underlying system of differential equations.The main idea is to model the deterministic function specified by thecomputer code with a random function Z(x). This is a counter-intuitiveidea, since we are infusing randomness where none exists. Nevertheless, thisapproach has been proven more successful over the past 20 years or so thanany other method for modeling deterministic response surfaces in computerexperiments.Other common names for Z(x) are Gaussian process, Gaussian stochas-tic process, stochastic process, spatial process, or random field, and thisconstruct has been used extensively in spatial statistics starting with geo-statistics where it is known as kriging (Cressie, 1993). However, spatialapplications are usually in just two or three dimensions, while computermodels can have many more input variables. (For example, in Chapter 2,we will work with a model with d = 41 inputs). The other major differenceis that kriging models usually include a white noise term, but in the deter-ministic case there is no noise. This is why for example linear models areinappropriate, since the usual independence assumption for a random error4Chapter 1.epsilon1 is not satisfied in a model likeY (x) =summationdisplayjbetajfj(x) + epsilon1,where each fj(x) is a function of x with unknown betaj coefficients. But if wereplace the random error term epsilon1 with a systematic error term Z(x), then weobtain the model in Sacks et al. (1989) that is a sum of a regression com-ponent or drift and a stochastic process Z(x) that captures the systematicdeparture from the drift:Y (x) =summationdisplayjbetajfj(x) + Z(x),where the assumption for the process Z(x) is that it has mean zero, con-stant variance sigma2 and a parametric correlation function depending on somemeasure of distance in the input space. To simplify calculations, we assumethat there is no significant drift, making the regression part unnecessary andleaving the stochastic part as the only component in our model:Y (x) = Z(x).We use the Gaussian correlation function that is a common choice for mod-eling smooth response surfaces:Corr(Z(w), Z(x)) =dproductdisplayi=1exp braceleftbig-thetai(wi -xi)2bracerightbig, (1.1)specifying that the correlation between the responses at input sites w andx is a function of the distance between the two points, scaled by positivethetai range parameters, measuring how active the process is in each of the ddimensions.We should also mention that such models are often presented in a Bayesianway, saying that what we are doing is essentially using a Gaussian process5Chapter 1.prior for the data (Currin, Mitchell, Morris, and Ylvisaker, 1991). Fromthat perspective, this correlation function puts all prior density on smooth,infinitely differentiable functions. However, in this thesis we avoid that kindof terminology because we use the word ?Bayesian? in a different way, re-ferring to the joint distribution of the range parameters, as part of the FastBayesian Inference method that is the subject of Chapter 2.1.3 ReparameterizationStatistical models can be reparameterized for many different purposes. Ourobjective is to make the shape of the likelihood more Gaussian, enablinggood normal approximations (i.e. approximating a likelihood function withthe density function of a multivariate normal distribution). An excellent ref-erence on this subject is Slate (1991), comparing several different measuresfor nonnormality. Note that this is about transforming the parameters ofthe model, as opposed to transforming the response, as popularized by Boxand Cox (1964).We use measures by Sprott (1973) for d = 1, and a multivariate extensionby Kass and Slate (1994) for d > 1 to quantify nonnormality for relativelysmall sample sizes, when we cannot rely on asymptotics to guarantee alikelihood that is approximately normal. It appears that, in general, smallsample normality has not been investigated as thoroughly as asymptoticnormality in statistics.But in practice, small sample results are often more relevant than largesample results. This is especially true in the field of computer experiments,where sample sizes are routinely small relative to the dimensionality of theinput space because of the excessive computational cost of obtaining data.Hence, the lack of small sample focus is even more puzzling in this re-search area (although it is understandable from a historical perspective,since powerful computers required for simulations with finite samples haveemerged only gradually over the past decades, while hardware requirements6Chapter 1.for asymptotic investigations were rarely more than pen and paper).Theory is lagging behind current practice, since from the practitioners?point of view the crucial question is how to make the most of a limitednumber of data points. But theoretical arguments are usually based onasymptotics, providing little guidance for small samples. (See Zhang andZimmerman (2005) for a recent review of results based on increasing-domainor infill asymptotics, titled ?Towards Reconciling Two Asymptotic Frame-works in Spatial Statistics?).The original inspiration for this work was Karuri (2005), who observedthat in a Bayesian setting the log transformation of the range parameters im-proved approximate normality of the posterior for one- and two-dimensionalexamples and demonstrated its usefulness for integration and prediction.Following up on her original idea, we show that the log transformation canbe even more useful in higher dimensions, sometimes enabling surprisinglyaccurate uncertainty assessments for parameter estimation and prediction.1.4 Preview of chaptersThis is a manuscript-based thesis. Chapters 2 and 3 are separate articlesintended for publication. An earlier version of Chapter 2 has already beensubmitted to a journal and Chapter 3 will follow soon.Chapters 2 introduces Fast Bayesian Inference (FBI) and compares it tothe traditional plug-in method on both simulated and real data sets, demon-strating that the prediction bands of the FBI are more valid than those of theplug-in in terms of their frequentist coverage probabilities. The equivalenceof ?profiling out? and ?integrating out? the process variance sigma2 is estab-lished and the resulting profile likelihood function of the range parameterstheta1, ..., thetad in (1.1) is approximated with a multivariate normal distribution.The quality of the approximation is evaluated by a nonnormality measureof Kass and Slate (1994). This measure is minimized numerically in an at-tempt to improve normal approximations. It is found that for large d, within7Chapter 1.the family of power transformations, the log transformation is close to be-ing optimal both with respect to minimizing nonnormality and assessingprediction uncertainty.Using the same model with the same reparameterization, Chapter 3 ex-amines how well the parameters can be estimated by maximum likelihoodand how well one can quantify parameter uncertainty by normality-basedconfidence sets and more exact likelihood-based confidence regions. It isfound that although the point estimates slightly underestimate the realparameters, uncertainty can be measured adequately by normality-based(Wald-type) confidence intervals obtained from the standard errors derivedfrom the observed information matrix. However, Wald-type confidence el-lipsoids for the joint estimation of the model parameters are not as accurateas the ones obtained from inverting the likelihood-ratio tests (which them-selves can become inadequate for small sample sizes). Implications for theFBI are discussed and a Bayes estimator for sigma2 is presented that is lessbiased than the MLE of sigma2. Likelihood nonnormality (i.e. closeness to nor-mal approximations) is explored graphically, revealing a mismatch in thetails. Another measure by Sprott (1973) for d = 1 demonstrates why the logtransformation can be far from being optimal in the one-dimensional specialcase, explaining why results seen for d = 1 are in general inferior to resultsfor d > 1 in Chapters 2 and 3 and Appendix B.Chapter 4 concludes the thesis by relating the two manuscripts to eachother and to the field of study, reviewing the strengths and weaknesses of theresearch, and discussing potential directions for future work. Appendix Ato Chapter 3 contains the derivations of the formulas necessary to computea nonnormality measure of Sprott (1973) for random function models in thed = 1 case. Appendix B to Chapter 4 illustrates the robustness of the FBIby additional simulation studies from Nagy, Loeppky, and Welch (2007),including a wider range of parameter choices and smaller sample sizes ford = 1, ..., 10.8BibliographyBox, G. E. P. and Cox, D. R. (1964), ?An Analysis of Transformations,?Journal of the Royal Statistical Society. Series B (Methodological), 26, 211?252.Cressie, N. A. C. (1993), Statistics for Spatial Data, John Wiley & Sons.Currin, C., Mitchell, T., Morris, M., and Ylvisaker, D. (1991), ?BayesianPrediction of Deterministic Functions, with Applications to the Design andAnalysis of Computer Experiments,? Journal of the American StatisticalAssociation, 86, 953?963.Karuri, S. W. (2005), ?Integration in Computer Experiments and BayesianAnalysis,? Ph.D. thesis, University of Waterloo.Kass, R. E. and Slate, E. H. (1994), ?Some Diagnostics of Maximum Like-lihood and Posterior Nonnormality,? The Annals of Statistics, 22, 668?695.McKay, M. D., Beckman, R. J., and Conover, W. J. (1979), ?A Comparisonof Three Methods for Selecting Values of Input Variables in the Analysisof Output from a Computer Code,? Technometrics, 21, 239?245.Mease, D. and Bingham, D. (2006), ?Latin Hyperrectangle Sampling forComputer Experiments,? Technometrics, 48, 467?477.Morris, M. D. and Mitchell, T. J. (1995), ?Exploratory Designs for Compu-tational Experiments,? Journal of Statistical Planning and Inference, 43,381?402.9BibliographyNagy, B., Loeppky, J. L., and Welch, W. J. (2007), ?Fast Bayesian Infer-ence for Gaussian Process Models,? Tech. Rep. 230, Department of Statis-tics, The University of British Columbia.Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989), ?De-sign and Analysis of Computer Experiments (C/R: P423-435),? StatisticalScience, 4, 409?423.Slate, E. H. (1991), ?Reparameterizations of Statistical Models,? Ph.D.thesis, Carnegie Mellon University.Sprott, D. A. (1973), ?Normal Likelihoods and Their Relation to LargeSample Theory of Estimation,? Biometrika, 60, 457?465.Tang, B. (1993), ?Orthogonal Array-based Latin Hypercubes,? Journal ofthe American Statistical Association, 88, 1392?1397.Zhang, H. and Zimmerman, D. L. (2005), ?Towards Reconciling TwoAsymptotic Frameworks in Spatial Statistics,? Biometrika, 92, 921?936.10Chapter 2Quantifying PredictionUncertainty in ComputerExperiments with FastBayesian Inference2.1 IntroductionComputer models have been used with great success throughout the sciencesand engineering disciplines, for example in climate modeling, aviation, semi-conductor design, nuclear safety, etc. Implemented as computer programs,deterministic models calculate an output y for a given input vector x. De-pending on the complexity of the underlying mathematical model, this canbe expensive computationally, creating a need for faster approximations. Acommon approach is to build a statistical model to approximate the outputof the computer code. This has become known as the field of computerexperiments in statistics, using Gaussian process (GP) models as computa-tionally cheap surrogates (Sacks, Welch, Mitchell, and Wynn, 1989; Currin,Mitchell, Morris, and Ylvisaker, 1991; Welch, Buck, Sacks, Wynn, Mitchell,and Morris, 1992). Trading off accuracy for speed is acceptable as long asA version of this chapter has been submitted for publication. Authors: Nagy B.,Loeppky J.L., Welch W.J.11Chapter 2.we can measure how much the surrogate?s prediction of the response mightdeviate from the real one. However, quantifying that uncertainty has beenan ongoing challenge.This paper is about the prediction uncertainty that originates in the GPmodel itself and from estimating the parameters of the model. Of course,there are other sources of uncertainty that can be just as important. Forinstance, such a simple statistical model may be an oversimplification of thecomplex original model. But that is outside of the scope of this investigation.Our focus is on prediction within the class of functions defined by the GPmodel. The rationale is that if we cannot assess prediction uncertaintydecently within this class of functions (that satisfy all of our assumptions),then we cannot realistically hope to do so when working with a differentclass of functions (that may not satisfy the modeling assumptions).Prediction uncertainty can be quantified by a prediction band providingconfidence limits for the response surface. The typical approach in com-puter experiments is to pretend that the response is a random realization ofa GP that can be modeled with a modest number of parameters. Based onthat assumption, it is straightforward to compute normality-based predic-tion limits using the standard error from the prediction variance formula.As long as the model parameters are known, this is a valid practice, resultingin confidence sets that by definition have a perfect match between nominaland actual coverage probabilities at all confidence levels.However, in practice, most of the time the parameters of the GP modelare not known but need to be estimated, usually by maximum likelihood.The resulting estimates are then often used as if they were the true values.But this ignores the uncertainty in estimating the parameters. Plugging inthe estimates in place of the true values in the prediction variance formulaleads to prediction bands that are narrower than they should be. This hasbeen a long standing problem of the plug-in method (see Abt (1999) for areview).12Chapter 2.In this article, we present a Bayesian way to deal with this problemand compare the frequentist properties of the traditional plug-in methodwith the new method, called Fast Bayesian Inference (FBI). This researchwas inspired by Karuri (2005), indicating the potential usefulness of the logtransformation for the parameters. We concern ourselves with a noise-freeGP model using the Gaussian covariance function. Model uncertainty ispurposefully ignored by assuming that the response is a realization of sucha Gaussian process. The only uncertainty left about the model is the exactvalues of its parameters. The main finding is that FBI can successfullypropagate that parameter uncertainty into assessing prediction uncertainty,leading to improved frequentist properties of the resulting prediction bands.After defining the GP model in the next section, Section 2.3 outlinesthe foundations for a computationally fast Bayesian analysis. Then twosimulation studies are presented in Section 2.4, followed by two real examplesin Section 2.5. The proposed method is described in detail in Sections 2.6and 2.7. We finish the article with some concluding remarks in Section 2.8.2.2 The Gaussian Process ModelSacks et al. (1989) gave the following general model for a deterministiccomputer code y(x):Y (x) =summationdisplayjbetajfj(x) + Z(x),that is the sum of a regression model and a GP model Z(x) with mean zero.Note that no white noise term is necessary because of the deterministicnature of the code, i.e. if we rerun the simulator with the same input, wealways get the same output. Often the regression component can be omitted,too (e.g. see Chen (1996) and Steinberg and Bursztyn (2004)), because ofthe flexibility of the stochastic process that can easily take on the featuresof the underlying function. Following Linkletter, Bingham, Hengartner,13Chapter 2.Higdon, and Ye (2006) we assume a standardized response with mean zero(by subtracting the mean of all observations). Thus we model the computercode y(x) as if it was a realization of a mean zero Gaussian stochastic processZ(x) on the d-dimensional vector x = (x1,x2, ..., xd)T :Y (x) = Z(x).Hence all model parameters are in the covariance function:Cov(Z(w), Z(x)) = sigma2 R(w, x),where sigma2 is the process variance and R(w, x) is the correlation between twoconfigurations of the input vector, w and x:R(w, x) =dproductdisplayi=1exp braceleftbig-thetai(wi -xi)2bracerightbig, (2.1)where the positive thetai range parameters control how variable the process isin a particular dimension. This is the Squared Exponential or Gaussian cor-relation function that is frequently used in computer experiments to modelsmooth response surfaces.The likelihood is a function of sigma2 and the d-dimensional vector of rangeparameters theta = (theta1,theta2, ..., thetad)T :L(sigma2, theta) proportional 1(sigma2)n2 |R| 12expbraceleftbigg-yT R-1y2sigma2bracerightbigg, (2.2)where y is the data vector of length n and R is the n ?n design correlationmatrix given by (2.1) for all pairs of input vector configurations in the dataset. R is a function of the range parameter vector theta. If theta is known, thenthe Best Linear Unbiased Predictor (BLUP) of the response at a new x0 is?0(theta) = r(x0)T R-1y, (2.3)14Chapter 2.where r(x0) is an n ? 1 vector of correlations between the new x0 and then design points (a function of theta), again given by (2.1).Furthermore, if sigma2 is also known, then the Mean Squared Error of theBLUP isMSE?y0(sigma2, theta) = sigma2 parenleftbig1 - r(x0)T R-1r(x0)parenrightbig , (2.4)and these two formulas enable one to construct valid normality-based point-wise prediction bands, having a perfect match between nominal and truecoverage at all levels under this model. However, that validity is dependenton the assumption that all parameters are known.But in practice, often none of the parameters are known. Instead, theyhave to be estimated, usually by maximizing (2.2) to get the estimates ?2and ?. When we plug in ?2 in place of sigma2 and ? in place of theta in (2.3)and (2.4), we lose validity in the sense that the estimator of (2.4) based on?2 and ? is biased to be too small relative to the true mean squared errorgiven by (2.4) based on sigma2 and theta. In the computer experiments and thegeostatistics literature, this problem is seen as a serious shortcoming of thetraditional plug-in method (see the review in Abt (1999) for more details).The root of this problem is ignoring the uncertainty due to estimating themodel parameters. That suggests that a Bayesian approach could potentiallyhelp. However, before considering how to deal with parameter uncertainty,it is important to realize that all parameters are not created equal. We cansee that theta exerts its influence on the BLUP (2.3) and its Mean Squared Error(2.4) in a highly nonlinear fashion through the correlation vector r(x0) andthe correlation matrix R. In contrast, the dependence on sigma2 is much simpler.It is a factor in the MSE formula (2.4), but the BLUP itself is independent ofsigma2. This has important implications when the parameters are unknown. Itis easier to deal with uncertainty in sigma2 than in theta because the predictor is notaffected by sigma2 and its MSE is simply proportional to sigma2. In fact, we foundthat it is best to treat sigma2 as a nuisance parameter and eliminate it from thelikelihood because it has a relatively minor role in quantifying prediction15Chapter 2.uncertainty. This can be done by either profiling or integrating, as shownlater in Section 2.6. Either way, the result is the likelihood function L(theta)that is only a function of the range parameters and this is the likelihoodthat we use for all subsequent calculations. Having eliminated sigma2, we cancall L(theta) the profile likelihood or just simply the likelihood for short, andthe log of this function the log-likelihood: l(theta) = log L(theta). In practice, onewould optimize l(theta) numerically to get the Maximum Likelihood Estimateor MLE (e.g. see Welch et al. (1992)).2.3 Outline of Fast Bayesian InferenceBayesian statistics provides a natural way to incorporate parameter uncer-tainty into a predictive statistical model. However, a Bayesian approachimmediately raises two nontrivial questions:1. How to choose a prior?2. How to sample from the posterior?Usually, these are perceived as separate issues. But we choose a prior,together with a reparameterized likelihood, that together lead to a posteriorwith a multivariate normal shape. This makes sampling from the posteriortrivial. The basic idea is to use a parameterization that makes the likelihoodapproximately normal (Gaussian shape) and then to use a prior that makesthe posterior normal. This way the problem is reduced to two questions:1. How to get a nearly normal likelihood?2. How to get a normal posterior?The answer to the first question lies in reparameterization. We look at afamily of transformations and pick one that is optimal (or nearly optimal)with respect to a criterion measuring the quality of a second-order approxi-mation to the log-likelihood, and hence a Gaussian shape for the likelihood16Chapter 2.as a function of the transformed parameters. We use the family of powertransformations from Tukey (1957) (extended by Box and Cox (1964)), in-dexed by a real lambda that includes no transformation (for lambda = 1) and the logtransformation (for lambda = 0) for the positive thetai range parameters:gammai =bracelefttpbraceleftmidbraceleftbtthetalambdai , lambda negationslash= 0,log thetai, lambda = 0.The same lambda value is used for all thetai ( i = 1, ..., d ). To find the optimallambda with the least observed nonnormality, we use a third derivative-basednonnormality measure from Kass and Slate (1994):1d2dsummationdisplayi1,i2,i3,i4,i5,i6=1-partialdiffi1i2l(MLE)-1 ?partialdiffi4i5l(MLE)-1??partialdiffi3i6l(MLE)-1 ?partialdiffi1i2i3l(MLE) ?partialdiffi4i5i6l(MLE),where partialdiffijl(MLE) denotes second and partialdiffijkl(MLE) third partial derivatives ofthe log-likelihood function l, evaluated at the MLE. By minimizing this mea-sure, one can find a log-likelihood that has relatively small third derivativescompared to second derivatives at the mode. This means that third-ordernonnormality is minimized, making the shape of the likelihood approxi-mately Gaussian in the neighborhood of the MLE. Simulations show thatthe optimal lambda values tend to cluster around zero (except in low dimensions).Thus, the log transformation for the range parameters empirically leads toa likelihood with approximately normal shape.Having a nearly normal likelihood, we choose the multivariate normalposterior N(MLE, -H-1MLE), where HMLE is the Hessian matrix of secondderivatives of the log-likelihood at the MLE. Note that this is equivalent tochoosing a prior. (But we never need to compute the prior, all we need isthe posterior). We can see that this prior is fairly non-informative becauseit leads to a posterior centered at the MLE, matching the curvature of the17Chapter 2.likelihood at the MLE up to the second order. This seems sensible as thischoice will not interfere much with the information coming from the data,which is contained in the likelihood function L(theta).Although we are not interested in the prior per se and never computeit, we should point out connections with earlier work. A nearly normallikelihood implies a nearly uniform prior on the log scale. By a change ofvariables, we can verify that uniform priors on the log scale are equivalentto inverse priors on the original scale, which are known to approximate theJeffreys prior in this case (Berger, De Oliveira, and Sans?, 2001). Usingthe results of Chen (1996), Karuri (2005) verified this approximation ford = 1 and d = 2 and suggested that similar results may hold in higherdimensions, too. Another way to justify using inverse priors is that theygive prior weights inversely proportional to the parameter values, preventingoverly large parameter estimates. For example, in our model, excessivelylarge range parameter estimates could potentially underestimate the spatialcorrelations in the input space, undermining our spatial model.We should also mention that Nagy, Loeppky, and Welch (2007) presentsthe FBI from a slightly different viewpoint, namely as an approximation tothe Bayesian method that uses uniform priors on the log scale for the rangeparameters. In this interpretation the FBI is only approximately Bayesian,since it is taking samples from the normal approximation of the posterior,where the posterior is proportional to the likelihood because of the uniformpriors. But no matter which semantics we choose, the computations arealways the same. More details are provided in Sections 2.6 and 2.7, butfirst we demonstrate through simulated and real examples that this methodworks remarkably well.2.4 SimulationsThe simulations were designed to mirror situations that one would expectto encounter in practice. That means balancing two main concerns. The18Chapter 2.first one is that the design sample size n is often less than ideal becauseof the computational cost of obtaining data (slow simulators). The sec-ond is that n should still be large enough to enable meaningful prediction.One option is to tie it to the number of dimensions d. According to Sacks(Chapman, Welch, Bowman, Sacks, and Walsh, 1994; Loeppky, Sacks, andWelch, 2008), n = 10 d could often serve as a rough initial estimate foran adequate sample size. Hence, the first set of simulations used this rulefor d = 1, ..., 10 . To ensure adequate prediction accuracy (with medianprediction errors within 5% of the range of the data), the range parameterswere set to theta = 25/(d + 1)2 in all dimensions. (Of course, the model fittingprocedure did not make the assumption that all range parameters were thesame and the resulting estimates were dispersed over a wide range). Thesecond set of simulations halved the sample size (n = 5 d) while increasingcorrelations between the design sites (using theta = 5/(d + 1)2) to maintaincomparable prediction accuracy to the first study.To obtain 1,000 replicates for a given combination of the sample size nand the common range parameter theta, the following steps were repeated 1,000times:1. Select a random n point design by Latin hypercube sampling in [0, 1]d(McKay, Beckman, and Conover, 1979).2. Sample 15 more points uniformly in [0, 1]d for prediction.3. Generate a realization of the mean zero GP over the n + 15 points bysetting the process variance to one and thetai to theta for i = 1, ..., d.4. Use the data for the n design points to fit the GP model.5. Compute predictors with mean squared errors for the 15 additionalpoints by the plug-in and FBI methods and then for each alpha = 0.01,0.02, ..., 0.99, construct 100(1-alpha)% pointwise prediction bands: pre-dictor ? talpha/2n radicalbigMSE(predictor), where talpha/2n is the upper alpha/2 criticalpoint of the tn distribution.19Chapter 2.6. Calculate coverage probabilities by counting how many of the 15 pointswere covered by the prediction bands of the plug-in and FBI for alpha =0.01, 0.02, ..., 0.99.Finally, the resulting actual coverage probabilities for both methods wereaveraged over the 1,000 replicates and plotted against the nominal levelsfor alpha = 0.01, 0.02, ..., 0.99. Note that although it is common to usenormality-based prediction bands (especially for the plug-in), here we usedthe t-distribution with n degrees of freedom instead of the normal becauseit can slightly improve the match between nominal and true coverages, es-pecially for small n. To make the comparison fair, here the tn distributionwas used for the plug-in, too, to match Bayesian prediction bands based onthe predictive distribution (O?Hagan, 1994; Santner, Williams, and Notz,2003).This simulation sequence was devised to represent a typical real worldscenario. Latin hypercubes are the design of choice for GP models for pre-diction at new, untried inputs anywhere in [0, 1]d. Although there are manyimproved variants of Latin hypercubes (Tang, 1993; Morris and Mitchell,1995; Mease and Bingham, 2006), the original random version of McKayet al. (1979) was used here because of the enormous number of realizationsgenerated. 1,000 replicates were used to make sure that both design and ran-dom generation effects were averaged out in the final calculation of coverageprobabilities. In addition, using 15 random points for prediction (for eachrealization) gave a total sample size of 15,000 to average out all samplingeffects.For each realization, we tried several different values for lambda, includingchoosing it dynamically by numerically minimizing the nonnormality mea-sure of Kass and Slate (1994) with respect to lambda. For the two simulationstudies, Figures 2.1 and 2.2, respectively show the distribution of the opti-mized lambda values (having the least nonnormality) over 1,000 simulated datasets each for d = 1, ..., 10. We can see that unless d is one or two, the20Chapter 2.optimal lambda is usually close to zero. Since in computer experiments we are pri-marily interested in high-dimensional applications, we chose lambda = 0 becausethere is no evidence that any other value is better for high d. That meansusing the log transformation for thetai ( i = 1, ... , d ).Figures 2.3 and 2.4 contrast the frequentist performance of the plug-inand FBI methods by plotting nominal coverage levels (from 1% to 99%)vs. actual coverage for d = 1, 4, 7, and 10. The coverage probabilitieswere calculated by averaging over the 15 new points used for prediction andthe 1,000 data sets (using lambda = 0 for the log transformation). In additionto the solid line for the plug-in and the dashed line for FBI, a gray diag-onal is also shown in the middle of each plot to help guide the eye: thecloser the curves are to the diagonal, the better the match between nominalcoverage (horizontally) and true coverage (vertically). Without exception,FBI achieved closer matching coverage than the plug-in at all levels for alld = 1, ..., 10 in both simulation studies. Results for d = 2,3,5,6,8,9were similar (not shown here). Except for d = 1, the dashed curves cameremarkably close to the diagonal representing perfect matching (from 1% to99% in Figures 2.3 and 2.4). Hence we can conclude that according to thisfrequentist criterion, FBI with lambda = 0 provides approximately valid inferenceabout prediction accuracy and is clearly superior to the plug-in method inthis respect.Other lambda values around zero yield similar results in terms of coverageprobabilities. Also, using the optimal lambda for each data set (instead of a fixedvalue) has no additional benefit. That suggests that the log transformationis nearly optimal in higher dimensions not only with respect to the nonnor-mality of the likelihood function, but also in terms of matching coverageprobabilities of the FBI predictions bands.21Chapter 2.1 2 3 4 5 6 7 8 9 10-3-2-10123dlambdaFigure 2.1: Optimal lambda values in the first simulation study (n = 10 d) ford = 1, ..., 10.22Chapter 2.1 2 3 4 5 6 7 8 9 10-3-2-10123dlambdaFigure 2.2: Optimal lambda values in the second simulation study (n = 5 d) ford = 1, ..., 10.23Chapter 2.Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 1plug-inFBINominal coverageTrue coverage1% 50% 99%1%50%99%d = 4plug-inFBINominal coverageTrue coverage1% 50% 99%1%50%99%d = 7plug-inFBINominal coverageTrue coverage1% 50% 99%1%50%99%d = 10plug-inFBIFigure 2.3: Coverage probabilities in the first simulation study (n = 10 d)for d = 1, 4, 7, and 10.24Chapter 2.Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 1plug-inFBINominal coverageTrue coverage1% 50% 99%1%50%99%d = 4plug-inFBINominal coverageTrue coverage1% 50% 99%1%50%99%d = 7plug-inFBINominal coverageTrue coverage1% 50% 99%1%50%99%d = 10plug-inFBIFigure 2.4: Coverage probabilities in the second simulation study (n = 5 d)for d = 1, 4, 7, and 10.25Chapter 2.2.5 ExamplesThe prediction uncertainty assessments of the two methods were also com-pared on two real data sets by a computationally intensive version of crossvalidation. This was done by randomly splitting the data in two (for train-ing and validation) 100 times, and then averaging the resulting coverageprobabilities the same way as for the 1,000 replicates for the simulations inSection 2.4. Here 100 replicates were sufficient because they were all subsetsof the same data set. Also, for the design size n < 5d was sufficient becauseof strong correlations between the design sites.For a fixed design size n and a data set size m, the following steps wererepeated 100 times:1. Select n points randomly (without replacement) from the available mpoints.2. Use the data for those n points to fit the GP model, using the logtransformation for the range parameters (lambda = 0).3. For each alpha = 0.01, 0.02, ... , 0.99, construct 100(1 - alpha) pointwiseprediction bands for the remaining m -n points by both methods.4. Calculate coverage probabilities by counting how many of those m-npoints are covered by the prediction bands of the plug-in and FBI foralpha = 0.01, 0.02, ..., 0.99.Finally, the resulting actual coverage probabilities were averaged over the100 replicates and plotted against the nominal levels to facilitate visualcomparison to the simulation results. When doing so, we have to keep inmind that there is an important difference between simulated and real datasets. When one generates data from the GP model repeatedly, then one canexpect that over the long-run, any useful inference method should show rea-sonably valid performance, since the data is from the true model, satisfyingall modeling assumptions. But if the data comes from the real world, where26Chapter 2.Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 13plug-inFBIFigure 2.5: Coverage probabilities for the Arctic sea ice example (n = 50).the GP model may or may not be appropriate, that can potentially lead toother inference difficulties.The first example from Chapman et al. (1994) had m = 157 datapoints in a 13-dimensional input space representing 157 successful runs ofa dynamic-thermodynamic Arctic sea ice model with 13 inputs and fouroutputs. One of the outputs, sea ice velocity, proved especially resistant toprediction uncertainty assessments by the plug-in method, because the stan-dard errors of the predictions were too small and as a result, the predictionbands were always too narrow. To see whether FBI can quantify predictionuncertainty better, random subsets of n = 50 were chosen repeatedly (100times) to fit the model, leaving the remaining 107 points for validation. Fig-ure 2.5 shows the coverage probabilities averaged over all repetitions. Bylooking at the solid line, it is apparent that the plug-in method indeed un-derestimated the uncertainty by a large margin. The dashed line for FBI iscloser to the diagonal, indicating a better match.27Chapter 2.Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 41plug-inFBIFigure 2.6: Coverage probabilities for the Wonderland example (n = 100).Figure 2.6 is for a more challenging 41-dimensional example with m =500 data points, out of which n = 100 were used for fitting, leaving 400for validation. This is the Wonderland simulator of Milik, Prskawetz, Fe-ichtinger, and Sanderson (1996) for global sustainability with 41 inputs.Here the response is a human development index. Again, the predictionbands of the plug-in are too narrow and the FBI is also far from perfect,often making the opposite mistake by stretching the bands too wide, as in-dicated by the portion of the dashed line over the diagonal. (Although onecould argue that overcoverage is often preferable to undercoverage). Butthe true coverage of the FBI is still closer to the nominal than that of theplug-in at all confidence levels.We can see that in both cases, the coverage of the FBI has larger devia-tions than in the simulations (undercovering at lower levels and overcoveringat higher ones). Nevertheless, at the highest confidence levels it is close tothe diagonal, indicating a good match. For example, at the 95% nominal28Chapter 2.level, the actual coverage of the FBI is 95.7% in both cases. In contrast,the plug-in?s true coverage at the 95% level is only 67.8% in Figure 2.5 and75.7% in Figure 2.6.2.6 Dealing with the process varianceThis section formally defines the likelihood L(theta) that is a function of onlythe range parameters. Two possible ways are presented for eliminating theprocess variance sigma2: ?maximizing out? to get the profile likelihood and ?inte-grating out? to get the integrated likelihood (see Berger, Liseo, and Wolpert(1999) for a general discussion of these methods). While profiling is commonin likelihood-based settings, Bayesians are usually more comfortable with in-tegrating. Although in this case the same L(theta) function is obtained bothways, interpretations can still differ depending on the underlying framework.2.6.1 Profile likelihoodGiven theta, L(sigma2, theta) in equation (2.2) has a unique maximum at?2(theta) = yT R-1yn . (2.5)This is easily obtained by differentiating L(sigma2, theta) with respect to sigma2 orby observing that given theta and y, the likelihood (2.2) is proportional to anInverse Gamma density function with respect to the variable sigma2:sigma2 | theta, y similar IGparenleftbiggn2 -1,yT R-1y2parenrightbiggand using the b/(a + 1) formula for the mode of an Inverse Gamma distri-bution IG(a, b) with density functionf( x | a, b ) = ba exp braceleftbig-bxbracerightbigG(a) xa+1 .29Chapter 2.Plugging in ?2(theta) from (2.5) into (2.2) yields the profile likelihood:L(theta) proportional 1(?2(theta))n2 |R| 12expbraceleftbigg-yT R-1y2?2(theta)bracerightbiggproportional (yT R-1y)-n2 |R|-12 .Now the maximum likelihood estimation can be done using L(theta) instead ofthe original L(sigma2, theta), reducing the dimensionality of the required numericaloptimization by one.2.6.2 Integrated likelihoodBayesians prefer to put a prior distribution on sigma2 before eliminating it. Ac-cording to Berger et al. (2001), the most common choice is that of Handcockand Stein (1993), who used the improper prior 1/sigma2 for sigma2 > 0. This canbe interpreted as a relative weight function giving prior weights inverselyproportional to the magnitude, encouraging sigma2 to be close to zero. Let pi(theta)denote the prior for the range parameters, independent of sigma2. Then the jointprior is of the form pi(theta)/sigma2 and the posterior is obtained by multiplyingwith the likelihood (2.2):pi(theta)sigma2 L(sigma2, theta) proportional pi(theta)(sigma2)n2 +1 |R| 12expbraceleftbigg-yT R-1y2sigma2bracerightbiggand notice thatsigma2 | theta, y similar IGparenleftbiggn2,yT R-1y2parenrightbiggwhich means that sigma2 can be integrated out from the posterior to get themarginal posterior of theta:integraldisplay infinity0pi(theta)(sigma2)n2 +1 |R| 12expbraceleftbigg-yT R-1y2sigma2bracerightbiggdsigma2 = pi(theta) Gparenleftbign2parenrightbigparenleftbiggyT R-1y2parenrightbiggn2 |R| 12proportional pi(theta)L(theta).Note that after integrating, the posterior for theta is proportional to the priorfor theta times the same likelihood function L(theta) as above, which means that30Chapter 2.in this case profiling and integrating leads to the same likelihood functionfor the remaining parameters.2.7 Fast Bayesian Inference in detailUsing the notation gamma = (log theta1, ..., log thetad)T = log theta for the transformedparameter vector and theta = (exp gamma1, ..., exp gammad)T = exp gamma for the inversetransformation, the transformed likelihood function L(exp gamma) tends to havea shape that is closer to a normal distribution with respect to gamma than theshape of the original L(theta) with respect to theta. Working with log-likelihoods,the equivalent statement is that l(exp gamma) is usually more quadratic than l(theta),which incidentally can also help the Maximum Likelihood Estimation thatneeds to be done numerically. Another advantage of the log transformation isthat it makes the numerical optimization of the log-likelihood unconstrained:gamma element Rd. This is the first step of Fast Bayesian Inference, that can besummarized as follows:1. Maximize the log-likelihood l(exp gamma) to get the MLE of gamma, denoted ?.2. Compute the Hessian matrix of second derivatives of the log-likelihoodat ?, denoted H?gamma.3. Sample from the multivariate normal distribution N(?, -H-1?gamma ) to ob-tain M = 400 Monte Carlo samples: gamma(1), ... , gamma(M).4. The FBI predictor is obtained by averaging:1MMsummationdisplayi=1?0(exp gamma(i)),and its Mean Squared Error can be computed by the the variance31Chapter 2.decomposition formula:1MMsummationdisplayi=1MSE?y0parenleftBig?2(exp gamma(i)), exp gamma(i)parenrightBig++ 1M -1Msummationdisplayj=1parenleftBigg?0(exp gamma(j)) - 1MMsummationdisplayi=1?0(exp gamma(i))parenrightBigg2,that is the average MSE of the plug-in predictors plus the sample varianceof those predictors. It is instructive to compare this sequence to the plug-in method (as described in Section 2.2). Both start by locating the MLE.After that the plug-in method jumps into the prediction phase right away,assuming that the value found at the mode is the one best estimate of thetruth.The FBI is more careful. In the second step it looks at the curvature ofthe log-likelihood at the MLE to quantify the uncertainty in the estimationof the point estimate. For example, if the surface is flat, that means highuncertainty and the corresponding normal posterior in step 3 will have ahigh variance reflecting that uncertainty. In the final step, the FBI averagespredictions based on the sample from that normal posterior. Again, there isa part that is identical to the plug-in method, since for each sample point,equations (2.3) and (2.4) are used to calculate the predictor and its MeanSquared Error, respectively (also using (2.5) to estimate sigma2 for a given gamma(i)in the sample). This way the FBI will have many predictions to average(one for each sample point), while the plug-in method will have just one.Hence, the plug-in can be viewed as a special case of the FBI with MonteCarlo sample size M = 1.2.8 Concluding remarksWe have introduced a new method for quantifying prediction uncertainty incomputer experiments that is conceptually simple and easy to implement32Chapter 2.in practice. We have also shown how much the traditional plug-in methodcan underestimate prediction uncertainty by ignoring parameter uncertainty.Fast Bayesian Inference can potentially correct this deficiency by incorpo-rating the uncertainty around the MLE. This is accomplished by utilizing anon-interfering prior that leaves the mode of the likelihood where it is andalso leaves the curvature at the mode unchanged by a normal posterior thatmatches that curvature (up to the second order). We have also found thatthe log transformation for the range parameters was effective in limiting(third order) nonnormality. The main advantage of a normal posterior isthat it allows one to draw independent samples from it directly, facilitatingfast and easy Bayesian analysis. Although we are not dealing explicitly withthe uncertainty in estimating the parameter sigma2, we have seen that incorpo-rating only the uncertainty in estimating theta (and plugging in the MLE of sigma2conditional on theta) can propagate sufficient parameter uncertainty throughthe model for potentially valid prediction uncertainty assessments.The implementation of the FBI method is straightforward, since it is asimple add-on to the plug-in. It can also be included in a commercial oropen source software package as black box computer code, since the userdoes not need to know anything about its inner workings. Runtimes arecomparable to that of the plug-in, since computations are dominated by thenumerical optimization required to find the MLE. Hence, the word fast inthe name of the method is applicable to both implementation or coding timeand execution or run time.Finally, it is important to point out that when one expects the FBI togive valid prediction uncertainty assessments, one needs to keep in mindthe two fundamental limitations of our study. The first one was mentionedalready: the potential validity of the method rests on the assumption ofa Gaussian process as the data generating mechanism. However, for realdata, this assumption may be inadequate or totally wrong and results willbe entirely dependent on the real underlying function.33Chapter 2.The second serious limitation is that we studied the frequentist propertiesof the prediction bands in terms of coverage probabilities. Hence, validityis implied only over a long sequence of identical trials, according to theclassical frequentist interpretation. But in practice, most of the time thereis just one unique data set. However, the use of this criterion is not limitedto frequentists. It is not uncommon for Bayesians to use it as a sanity checkfor their Bayesian credible regions. For example, Bayarri and Berger (2004)argue that ?there is a sense in which essentially everyone should ascribe tofrequentism? and provide the following version of the frequentist principle:?In repeated practical use of a statistical procedure, the long-run averageactual accuracy should not be less than (and ideally should equal) the long-run average reported accuracy?.34BibliographyAbt, M. (1999), ?Estimating the Prediction Mean Squared Error in Gaus-sian Stochastic Processes with Exponential Correlation Structure,? Scan-dinavian Journal of Statistics, 26, 563?578.Bayarri, M. J. and Berger, J. O. (2004), ?The Interplay of Bayesian andFrequentist Analysis,? Statistical Science, 19, 58?80.Berger, J. O., De Oliveira, V., and Sans?, B. (2001), ?Objective BayesianAnalysis of Spatially Correlated Data,? Journal of the American StatisticalAssociation, 96, 1361?1374.Berger, J. O., Liseo, B., and Wolpert, R. L. (1999), ?Integrated LikelihoodMethods for Eliminating Nuisance Parameters,? Statistical Science, 14, 1?28.Box, G. E. P. and Cox, D. R. (1964), ?An Analysis of Transformations,?Journal of the Royal Statistical Society. Series B (Methodological), 26, 211?252.Chapman, W. L., Welch, W. J., Bowman, K. P., Sacks, J., and Walsh, J. E.(1994), ?Artic sea ice variability: Model sensitivities and a multidecadalsimulation,? Journal of Geophysical Research, 99, 919?936.Chen, X. (1996), ?Properties of Models for Computer Experiments,? Ph.D.thesis, University of Waterloo.Currin, C., Mitchell, T., Morris, M., and Ylvisaker, D. (1991), ?BayesianPrediction of Deterministic Functions, with Applications to the Design and35BibliographyAnalysis of Computer Experiments,? Journal of the American StatisticalAssociation, 86, 953?963.Handcock, M. S. and Stein, M. L. (1993), ?A Bayesian Analysis of Kriging,?Technometrics, 35, 403?410.Karuri, S. W. (2005), ?Integration in Computer Experiments and BayesianAnalysis,? Ph.D. thesis, University of Waterloo.Kass, R. E. and Slate, E. H. (1994), ?Some Diagnostics of Maximum Like-lihood and Posterior Nonnormality,? The Annals of Statistics, 22, 668?695.Linkletter, C., Bingham, D., Hengartner, N., Higdon, D., and Ye, K. Q.(2006), ?Variable Selection for Gaussian Process Models in Computer Ex-periments,? Technometrics, 48, 478?490.Loeppky, J. L., Sacks, J., and Welch, W. J. (2008), ?Choosing the Sam-ple Size of a Computer Experiment: A Practical Guide,? Tech. Rep. 238,Department of Statistics, The University of British Columbia.McKay, M. D., Beckman, R. J., and Conover, W. J. (1979), ?A Comparisonof Three Methods for Selecting Values of Input Variables in the Analysisof Output from a Computer Code,? Technometrics, 21, 239?245.Mease, D. and Bingham, D. (2006), ?Latin Hyperrectangle Sampling forComputer Experiments,? Technometrics, 48, 467?477.Milik, A., Prskawetz, A., Feichtinger, G., and Sanderson, W. C. (1996),?Slow-fast dynamics in Wonderland,? Environmental Modeling and As-sessment, 1, 3?17.Morris, M. D. and Mitchell, T. J. (1995), ?Exploratory Designs for Compu-tational Experiments,? Journal of Statistical Planning and Inference, 43,381?402.36BibliographyNagy, B., Loeppky, J. L., and Welch, W. J. (2007), ?Fast Bayesian Infer-ence for Gaussian Process Models,? Tech. Rep. 230, Department of Statis-tics, The University of British Columbia.O?Hagan, A. (1994), Kendall?s Advanced Theory of Statistics. Volume 2B:Bayesian Inference, Cambridge University Press.Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989), ?De-sign and Analysis of Computer Experiments (C/R: P423-435),? StatisticalScience, 4, 409?423.Santner, T. J., Williams, B. J., and Notz, W. (2003), The Design andAnalysis of Computer Experiments, Springer Verlag, New York.Steinberg, D. M. and Bursztyn, D. (2004), ?Data Analytic Tools for Under-standing Random Field Regression Models,? Technometrics, 46, 411?420.Tang, B. (1993), ?Orthogonal Array-based Latin Hypercubes,? Journal ofthe American Statistical Association, 88, 1392?1397.Tukey, J. W. (1957), ?On the Comparative Anatomy of Transformations,?The Annals of Mathematical Statistics, 28, 602?632.Welch, W. J., Buck, R. J., Sacks, J., Wynn, H. P., Mitchell, T. J., andMorris, M. D. (1992), ?Screening, Predicting, and Computer Experiments,?Technometrics, 34, 15?25.37Chapter 3Inference for covarianceparameters of a randomfunction by likelihood-basedapproximations3.1 IntroductionRandom function models, also known as Gaussian process models or krig-ing models, have a long history in spatial statistics (Cressie, 1993). Otherimportant application areas include the design and analysis of computerexperiments dating back to Sacks, Welch, Mitchell, and Wynn (1989) andmore recently machine learning (Rasmussen and Williams, 2006).Although sometimes the model parameters themselves can be of interest(Mardia and Marshall, 1984; Abt and Welch, 1998; Wang and Zhang, 2003),usually one is more interested in prediction than parameter estimation. Like-wise, the main interest is quantifying prediction uncertainty instead of pa-rameter uncertainty. However, ignoring the uncertainty in estimating theparameters leads to underestimating the uncertainty in predictions (Abt,1999).A version of this chapter will be submitted for publication. Authors: Nagy B.,Loeppky J.L., Welch W.J.38Chapter 3.Our interest in this problem arose because of Fast Bayesian Inference(FBI) for deterministic computer codes, as described in Chapter 2, suggest-ing that quantifying parameter uncertainty was the key to good predictionuncertainty assessments. Our primary goal in this chapter is to investi-gate how well the covariance parameters can be estimated using the FBIframework. A secondary goal is to evaluate how well the likelihood can beapproximated by a normal density function, which is another important as-pect of the FBI method and its ability to accurately and efficiently assessthe uncertainty in predictions.In computer experiments, a random function model is used as a com-putationally cheap statistical surrogate for a complex mathematical model,implemented as a computer code. Often it takes a considerable amount oftime to run the code because of the large amounts of computation involved.In general, it is not possible to run them at each input combination of in-terest because that would lead to a combinatorial explosion for models withseveral input variables. In these cases the surrogate can be used to approx-imate the output of the code, based on the outputs from a relatively smallsample from the input space.Hence, from the practitioners? point of view, small sample results aremore relevant in computer experiments than large sample results. We eval-uate small sample properties by extensive simulations. Existing theory isnot very helpful in this context, since it is built mostly on asymptotic argu-ments (see Stein (1999); Zhang and Zimmerman (2005); Furrer (2005) forthe current state-of-the-art of theoretical development).After reviewing the statistical model used by FBI in the next sectiontogether with the related issue of reparameterizations, Section 3.3 describestwo sets of simulations using the same model as in Chapter 2, with the samereparameterization (log transformation). Section 3.4 presents the simulationresults for the estimation of the parameters, including an assessment ofthe uncertainty in the estimation by individual and joint confidence sets.39Chapter 3.Likelihood-based and Bayesian methods used to obtain those results arediscussed in Section 3.5. We finish the chapter with some concluding remarksin Section 3.6.3.2 Statistical ModelWe consider a deterministic computer code with a single output that is asmooth function of d greaterequal 1 input variables. Here we reuse the model in Sec-tion 2.2, Chapter 2, that is a version of the statistical formulation in Sackset al. (1989), Currin, Mitchell, Morris, and Ylvisaker (1991), or Welch, Buck,Sacks, Wynn, Mitchell, and Morris (1992), treating the response (code out-put) as if it was a realization of a real-valued, zero-mean Gaussian stochasticprocess Z(x) on the d-dimensional real vector x = (x1,x2, ..., xd)T :Y (x) = Z(x).Z(x) is parameterized by the process variance sigma2 and the thetai range parame-ters in the Gaussian correlation function:Corr(Z(w), Z(x)) = R(w, x) =dproductdisplayi=1exp braceleftbig-thetai(wi -xi)2bracerightbig, (3.1)specifying that the correlation is a function of the squared distance betweenthe coordinates of the input vectors w and x, scaled by the thetai parametersalong the d dimensions ( i = 1, ..., d ).3.2.1 LikelihoodThe likelihood is a function of d + 1 variables: the range parameters in thed-dimensional vector theta = (theta1,theta2, ..., thetad)T and the process variance sigma2:L(sigma2, theta) proportional 1(sigma2)n2 |R| 12expbraceleftbigg-yT R-1y2sigma2bracerightbigg, (3.2)40Chapter 3.where the n ? 1 vector y contains the n outputs of the computer code forthe n design points in the input space, and R is the n?n design correlationmatrix (a function of theta), as specified by (3.1).3.2.2 Profile likelihoodBy differentiating (3.2) with respect to sigma2, we get that given theta, the likelihoodL(sigma2, theta) reaches its maximum at?2(theta) = yT R-1yn . (3.3)Now if we plug in ?2(theta) in (3.3) in place of sigma2 into (3.2), we get the profilelikelihood L(theta) that is only a function of the d range parameters:L(theta) proportional 1(?2(theta))n2 |R| 12expbraceleftbigg-yT R-1y2?2(theta)bracerightbiggproportional (yT R-1y)-n2 |R|-12 . (3.4)Note that one can get the Maximum Likelihood Estimate (MLE) of theta, de-noted ?, by maximizing L(theta), and then get the MLE of sigma2 by plugging in ?into (3.3).3.2.3 Log transformationFor parameters that can take only positive values, the log transformationis commonly employed in statistics for various reasons. One such objectiveis to improve normality of the likelihood for small sample sizes, as arguedby Sprott (1973). A thorough investigation of this subject was providedby Slate (1991), showing how reparameterizations of statistical models canmake the shape of the likelihood or posterior more Gaussian, enabling goodnormal approximations.Karuri (2005) observed that in a Bayesian setting, the log transforma-tion of the range parameters in a random function model improved approxi-mate normality of the posterior for one- and two-dimensional examples and41Chapter 3.demonstrated its usefulness for integration and prediction.Nagy, Loeppky, and Welch (2007a) found that in the one-dimensional(d = 1) case this was a general trend for this model, too: the log transfor-mation tends to reduce nonnormality of the profile likelihood, as quantifiedby two nonnormality measures in Sprott (1973). (In subsection 3.2.5, werevisit one of those measures to illustrate which transformations one couldexpect to be optimal for reducing nonnormality in the d = 1 case).In Chapter 2 and earlier in Nagy, Loeppky, and Welch (2007b) we demon-strated the usefulness of working on the log scale for d = 1, ... , 10 for theprediction uncertainty problem across a wide range of parameter settings.Using a multivariate nonnormality measure of Kass and Slate (1994), in Sec-tion 2.4, Chapter 2, we also showed that the log transformation was nearlyoptimal for large d in the class of power transformations.3.2.4 ExampleTo give some intuition about the relationship between the likelihood, theprofile likelihood, and the log transformation, we present a one-dimensional(d = 1) toy example. Although the log transformation is rarely ideal ford = 1 (as we will show in the next subsection), it can still illustrate thegeneral principles using the simplest possible case (and leave it up to thereaders? imagination to extrapolate from that to higher-dimensional cases).This example was created the following way: after simulating n = 3 datapoints from a one-dimensional random function repeatedly, using theta = 0.2,sigma2 = 1, and an equispaced design {0, 0.5, 1}, we chose a realization where thelog transformation was particularly successful in improving the approximatenormality of the profile likelihood (for other realizations the approximationwas also substantially helped by the log transformation, but in general notas much as for the one chosen here for illustration; see Nagy et al. (2007a)for simulations and quantitative arguments based on two nonnormality mea-sures of Sprott (1973) about the effect of the log transformation for d = 142Chapter 3.Likelihood contour (original scale)thetasigma20.00 0.05 0.10 0.15 0.20 0.250.00.51.01.52.0Likelihood contour (log scale)log thetalog sigma2-5 -4 -3 -2-3-2-100.00 0.05 0.10 0.15 0.20 0.2505101520253035Profile likelihood (original scale)thetaLikelihood-5 -4 -3 -205101520253035Profile likelihood (log scale)log thetaLikelihoodFigure 3.1: The log transformation improved approximate normality of theprofile likelihood for this one-dimensional (d = 1) example. The top twoplots are for the two-parameter likelihood and the bottom two for the one-parameter profile likelihood. The ridges of the contours are marked by thedashed lines, reaching their apex at the MLE. Below the contour plots, thesedashed lines are plotted as functions of the range parameter, representingthe profile likelihood function. In addition to the profile likelihoods (dashedcurves), their normal approximation is also shown for comparison (dottedcurves).43Chapter 3.and n = 3,6,9,12).Likelihood functions for this example are plotted in Figure 3.1. On theoriginal scale (left), the contour plot of the two-parameter likelihood (3.2)is highly nonnormal, having a banana-shaped peak around the MaximumLikelihood Estimate and a sharp ridge along the axes, marked by the dashedline. Below the contour plot, the one-parameter version of this dashed lineis also highly nonnormal. This is the profile likelihood (3.4) that can beobtained by maximizing (3.2) over all sigma2 given theta. The dotted line is anunnormalized normal density function centered on the MLE of the rangeparameter with variance set to the negative inverse of the second deriva-tive of the log profile likelihood at the MLE. We can see that this normalapproximation of L(theta) is a poor approximation of the profile likelihood.In contrast, on the log scale (right), the contours are more ellipsoidal,suggesting less nonnormality. Below that, the difference is even more strik-ing for the profile likelihood (dashed) that is virtually indistinguishable fromits normal approximation (dotted) over the domain of log theta shown (corre-sponding to the domain of theta on the left). At first look it may not be apparentthat there are two separate lines in this plot (one dashed and one dotted)that overlap almost perfectly.In Chapter 2, we used the log transformation to quantify predictionuncertainty for d = 1, ..., 10. We follow this reparameterization in thischapter for both the process variance and the range parameters. All of theseparameters are positive-valued and here we work with all of them on the logscale for estimation purposes.The nonnormality measure of Kass and Slate (1994) used in Chapter 2indicated that although the log transformation was nearly optimal in higherdimensions, this was not necessarily the case in low dimensions. This wasespecially apparent for d = 1 and we decided to double-check that findingusing a different measure that is the subject of the next subsection.44Chapter 3.3.2.5 One-dimensional special caseFor a scalar theta, let l(theta) = log L(theta) denote the logarithm of the profile likeli-hood, ? the Maximum Likelihood Estimate (MLE) of theta, and lprimeprime(?) and lprimeprimeprime(?)the second- and third-derivatives of l(theta) at the MLE, respectively. FollowingSprott (1973), define the Expected nonnormality (ENN) measure for theta:ENN for theta = | Elprimeprimeprime(?) (-Elprimeprime(?))-32 |.The intuition is that the expectation of the third derivative standardized bythe expectation of the second derivative measures the deviation from normal-ity (see Appendix A for taking expectations). This measure is appropriatewhen one wishes to consider a family of possible likelihoods without condi-tioning on any particular data set. Sprott (1973) also provided a formulathat quantifies the effect of a transformation phi on nonnormality, where phi isa twice differentiable function of theta. After the phi transformation,ENN for phi(theta) =vextendsinglevextendsinglevextendsinglevextendsinglevextendsingle Elprimeprimeprime(?theta) (-Elprimeprime(?theta))-32 + 3 phiprimeprime(?theta)phiprime(?) (-Elprimeprime(?))12vextendsinglevextendsinglevextendsinglevextendsinglevextendsingle ,where the first term inside the absolute value is the same as before in thedefinition of the ENN for theta and the second term is the effect of the trans-formation phi. As in Chapter 2, we use the family of power transformationsoriginally explored by Tukey (1957) and later extended by Box and Cox(1964), indexed by a real lambda that includes no transformation (for lambda = 1) andthe log transformation (for lambda = 0) for the positive theta range parameter:phi(theta) =bracelefttpbraceleftmidbraceleftbtthetalambda, lambda negationslash= 0,log theta, lambda = 0.45Chapter 3.3333333330 5 10 15 20-1.4-1.2-1.0-0.8-0.6-0.4-0.20.0theta^lambda^444444 44 44 44 444444555555 55 5 55 5 5 5 55 5 56666 66 66 6 66 6 6 6 6 6 6 67 77 77 77 77 7 77 7 7 7 7 7 78 88 88 8 88 8 88 8 8 8 8 8 89 9 9 99 9 9 99 9 9 9 9 9 90 0 0 0 0 00 0 0 0 0 0 0 0 0Figure 3.2: Optimal ? values for d = 1 and n = 3, ..., 10 as a function of?theta. The digits 3, ... , 9 in the plot represent the design sample size n andthe digit 0 represents n = 10. The lines for n = 8, n = 9, and n = 10 do notstart on the left side of the plot because of numerical difficulties for small ?.46Chapter 3.The equation ENN for phi(theta) = 0 has the following solution for lambda:?lambda = 1 + ?theta Elprimeprimeprime(?theta)3 Elprimeprime(?) .The optimal ? values for d = 1, n = 3, ... , 10, and ? between 0.2 and20 are plotted in Figure 3.2. Unlike the simulations where we use randomLatin hypercubes (McKay, Beckman, and Conover, 1979), the design here isfixed and equally spaced: { i/(n -1) : i = 0, ..., n -1 }. Note that somevalues are missing for n = 8, 9, and 10 because of ill-conditioned correlationmatrices that could not be inverted for small ? (see Appendix A for moredetails).Now we can see that it is no coincidence that the normal approximationof the profile likelihood for the one-dimensional example in Figure 3.1 is sogood on the log scale, since ? is close to zero for small ? when n = 3. Thevalues in this plot are also consistent with the box-plots for d = 1 in Figures2.1 and 2.2 in Chapter 2, indicating that the optimal ? can be substantiallyless than zero for larger ? or larger n. This also suggests an explanationto the anomaly why the results of the FBI in the d = 1 case are often lesssatisfactory than the results for d > 1 when trying to quantify predictionuncertainty in Chapter 2 and Nagy et al. (2007b). We will see that thed = 1 case is also quite special with respect to parameter estimation whenwe present our results in Section 3.4.3.3 SimulationsTo be able to compare prediction uncertainty assessments in Chapter 2with parameter uncertainty assessments in this chapter, we replicated thesimulations in Chapter 2, setting the common range parameter theta = 25/(d+1)2 and the sample size n = 10d for the first set, and theta = 5/(d+1)2, n = 5dfor the second set (d = 1, ..., 10). The process variance sigma2 was fixed atconstant 1 for all 20 simulations. To obtain 1,000 replicates for a given47Chapter 3.combination of the sample size n and the range parameter theta, the followingsteps were repeated 1,000 times:1. Select a random n point design by Latin hypercube sampling in [0, 1]d(McKay et al., 1979).2. Generate a realization of the Gaussian process over the n points bysetting sigma2 to 1 and thetai to theta for i = 1, ... , d.3. Find the MLE of the range parameters by numerically optimizing thelog profile likelihood, and then apply formula (3.3) to get the MLE ofsigma2.4. Estimate the parameters together with standard errors based on theMLE and the observed information, i.e. standard errors were obtainedby taking square roots of the diagonal elements of the negative inverseof the Hessian matrix of second derivatives evaluated at the MLE.Using the notation xi = log sigma2 for the log transformed process varianceand gamma = (log theta1, ..., log thetad)T = log theta for the transformed theta vector andsigma2 = exp xi, theta = (exp gamma1, ..., exp gammad)T = exp gamma for the inverse transforma-tions, the Hessian at the MLE is nabla2 log L(exp ?, exp ?), where ? and ? arethe maximum likelihood estimates of xi and the vector gamma, respectively, andnabla2 log L(exp xi, exp gamma) is defined asparenlefttpparenleftexparenleftexparenleftexparenleftexparenleftexparenleftexparenleftbtpartialdiff2 log L(exp xi, exp gamma)partialdiffxi2partialdiff2 log L(exp xi, exp gamma)partialdiffxi partialdiffgamma1 ...partialdiff2 log L(exp xi, exp gamma)partialdiffxi partialdiffgammadpartialdiff2 log L(exp xi, exp gamma)partialdiffgamma1 partialdiffxipartialdiff2 log L(exp xi, exp gamma)partialdiffgamma21 ...partialdiff2 log L(exp xi, exp gamma)partialdiffgamma1 partialdiffgammad... ... ... ...partialdiff2 log L(exp xi, exp gamma)partialdiffgammad partialdiffxipartialdiff2 log L(exp xi, exp gamma)partialdiffgammad partialdiffgamma1 ...partialdiff2 log L(exp xi, exp gamma)partialdiffgamma2dparenrighttpparenrightexparenrightexparenrightexparenrightexparenrightexparenrightexparenrightbt.The observed information matrix is the negative Hessian matrix of secondderivatives evaluated at the MLE: -nabla2 log L(exp ?, exp ?). In Section 3.5we describe well-known asymptotic methods using the inverse of this matrixto quantify the uncertainty in the estimation of the parameters xi and gammai48Chapter 3.( i = 1, ..., d ). But before that, in the next section, first we present asummary of the results, showing that on the log scale these methods workquite well for finite sample sizes.Summary statistics for the 1,000 replicates were obtained for the logtransformed model parameters. Here we outline the quantities calculated forthe estimator of the log transformed process variance xi = log sigma2. (Similarsummaries are presented for the other estimators in Section 3.4).1. The average estimate for xi over the 1,000 replicates is given by?xi = 110001000summationdisplayi=1?xi(i),where ?(i) is estimated from the ith data set ( i = 1, ..., 1000 ).2. The bias of the estimator ? is estimated by subtracting the real valuefrom the mean estimate:Bias?xi = ? - xi.3. A p-value is attached to this bias by doing a two-sided, one-samplet-test on { ?(1) -xi, ... , ?(1000) -xi}, to test if it is significantly differentfrom zero.4. The sample variance of the estimator ? is:Variance?xi = 19991000summationdisplayi=1parenleftBig?xi(i) - ?parenrightBig2.5. The Mean Squared Error (MSE) of ? is:MSE?xi = 110001000summationdisplayi=1parenleftBig?xi(i) - xiparenrightBig2.49Chapter 3.6. The estimated Mean Squared Error (hatwideE) of ? is the average of thefirst diagonal elements of the inverse observed information matrix:hatwideMSE?xi = 110001000summationdisplayi=1bracketleftBig-nabla2 log LparenleftBigexp ?(i), exp ?(i)parenrightBigbracketrightBig-1(1,1),where ?(i) and ?(i) denote the MLE of xi and gamma, respectively, estimatedfrom the ith data set ( i = 1, ..., 1000 ).Coverage probabilities for confidence intervals, credible intervals, and multi-dimensional confidence regions were also calculated by counting how manyof the 1,000 replicates were covered by the 100(1 -alpha)% confidence or cred-ible sets for alpha = 0.01, 0.02, ..., 0.99. Section 3.5 provides more detailsabout these procedures after presenting the results in the next section.3.4 Results3.4.1 Point estimationThe range parameters are estimated by maximum likelihood. Tables 3.1and 3.2 summarize the results of estimating the first parameter of the logtransformed theta vector: log theta1. All numbers are on the log scale. The firstfeature that jumps out from both tables is the negative bias that, judgingby the p-values, seems significant in all 20 cases, except for d = 2 in Table3.1. This means that correlations between the responses have a tendency toappear stronger than they really are. However, considering the magnitude ofthe variance, this bias is relatively unimportant (i.e. statistical significancedoes not necessarily imply practical significance).In Table 3.1, the hatwideE column slightly underestimates the MSE column,but overall it is fairly close, meaning that it can measure well the uncer-tainty in the point estimation when n = 10 d. But when n = 5 d (Table3.2), the hatwideE measure can become unstable numerically, as evidenced bythe inflated numbers for d = 5,6,8,9,10 in Table 3.2. This is caused by50Chapter 3.d Real Bias p-value Variance MSE hatwideE1 1.833 -0.019 7.51e-05 0.023 0.023 0.0182 1.022 -0.012 0.126 0.059 0.059 0.0543 0.446 -0.051 6.73e-07 0.104 0.106 0.0884 0.000 -0.052 5.57e-06 0.132 0.135 0.1145 -0.365 -0.057 9.19e-06 0.161 0.164 0.1366 -0.673 -0.048 0.000204 0.165 0.168 0.1517 -0.940 -0.066 2.21e-06 0.193 0.198 0.1708 -1.176 -0.051 0.000566 0.222 0.224 0.1789 -1.386 -0.086 2.45e-08 0.235 0.242 0.19910 -1.577 -0.092 4.59e-10 0.214 0.222 0.203Table 3.1: MLE of log theta1 in the first simulation study (n = 10 d).d Real Bias p-value Variance MSE hatwideE1 0.223 -0.070 6.71e-05 0.304 0.309 0.1612 -0.588 -0.081 1.77e-05 0.349 0.356 0.2433 -1.163 -0.103 9.49e-07 0.440 0.450 0.3124 -1.609 -0.142 1.88e-10 0.489 0.509 0.3755 -1.974 -0.169 2.35e-08 0.906 0.934 57.1346 -2.282 -0.172 8.96e-07 1.217 1.246 98.3777 -2.549 -0.183 4.22e-12 0.680 0.713 0.4818 -2.785 -0.244 8.49e-09 1.758 1.815 40.8999 -2.996 -0.206 3.75e-07 1.621 1.662 28.88910 -3.186 -0.259 5.78e-09 1.950 2.016 20.442Table 3.2: MLE of log theta1 in the second simulation study (n = 5 d).51Chapter 3.d Real Bias p-value Variance MSE hatwideE1 0.000 -0.127 4.2e-08 0.525 0.540 0.4352 0.000 -0.108 2.87e-10 0.286 0.297 0.2903 0.000 -0.052 0.00055 0.227 0.230 0.2434 0.000 -0.047 0.00257 0.238 0.240 0.2205 0.000 -0.036 0.013 0.209 0.211 0.2076 0.000 -0.030 0.0316 0.201 0.201 0.2007 0.000 -0.024 0.0956 0.205 0.205 0.1908 0.000 -0.025 0.0585 0.176 0.176 0.1829 0.000 -0.027 0.0469 0.184 0.184 0.17710 0.000 -0.017 0.184 0.166 0.166 0.172Table 3.3: MLE of log sigma2 in the first simulation study (n = 10 d).d Real Bias p-value Variance MSE hatwideE1 0.000 -0.286 1.3e-13 1.450 1.530 0.8952 0.000 -0.201 6.47e-12 0.835 0.875 0.6923 0.000 -0.164 2.28e-09 0.740 0.766 0.6124 0.000 -0.119 1.05e-06 0.588 0.601 0.5725 0.000 -0.097 5.07e-05 0.562 0.571 0.5516 0.000 -0.106 8.1e-06 0.558 0.568 0.5437 0.000 -0.096 8.15e-05 0.595 0.604 0.5408 0.000 -0.133 2.32e-07 0.649 0.666 0.5329 0.000 -0.120 5.12e-07 0.563 0.577 0.52610 0.000 -0.114 1.42e-06 0.552 0.564 0.511Table 3.4: MLE of log sigma2 in the second simulation study (n = 5 d).52Chapter 3.d Real Bias p-value Variance MSE hatwideE1 0.000 -0.082 0.000537 0.561 0.567 0.3222 0.000 -0.017 0.314 0.292 0.292 0.2223 0.000 0.051 0.000888 0.233 0.235 0.1894 0.000 0.056 0.000332 0.244 0.247 0.1745 0.000 0.062 2.4e-05 0.215 0.218 0.1646 0.000 0.065 6.33e-06 0.206 0.210 0.1597 0.000 0.061 2.45e-05 0.209 0.213 0.1538 0.000 0.055 3.31e-05 0.175 0.178 0.1459 0.000 0.051 0.000195 0.187 0.190 0.14310 0.000 0.058 9.9e-06 0.170 0.174 0.140Table 3.5: Bayes estimate of log sigma2 in the first simulation study (n = 10 d).d Real Bias p-value Variance MSE hatwideE1 0.000 -0.224 2.75e-08 1.600 1.648 0.7982 0.000 -0.035 0.234 0.852 0.852 0.5883 0.000 0.026 0.352 0.756 0.756 0.5274 0.000 0.073 0.00309 0.601 0.605 0.4795 0.000 0.099 4.25e-05 0.584 0.593 0.4666 0.000 0.062 0.011 0.599 0.602 0.4597 0.000 0.063 0.0126 0.641 0.644 0.4618 0.000 0.008 0.767 0.736 0.735 0.4889 0.000 -0.016 0.524 0.657 0.657 0.50310 0.000 -0.034 0.182 0.658 0.658 0.517Table 3.6: Bayes estimate of log sigma2 in the second simulation study (n = 5d).53Chapter 3.extreme uncertainty in some cases, when the likelihood surface at the modeis essentially flat in certain directions (i.e. the MLE is on a high-dimensionalridge), and near-zero second derivatives can lead to inflated inverses, render-ing the hatwideE measure effectively useless. One way to remedy this situationis to detect outliers and eliminate them from the hatwideE statistic. However,judging what is an outlier caused by numerical issues and what is not isinherently subjective. In the next subsection we present a better way toassess parameter uncertainty graphically, instead of just relying on a singlenumber.The process variance can also be estimated by maximum likelihood. Ta-bles 3.3 and 3.4 contain the results for the MLE of log sigma2 for the first set ofsimulations with adequate sample size (n = 10 d), and the second set withlimited sample size (n = 5 d), respectively. Again, the numbers in the biascolumns are all negative without exception; however, the evidence is notas strong for the first set: there are quite a few relatively large p-values inTable 3.3 for large d (which also implies large n, since n = 10d in this case).Hence, we can conclude that the negative bias is significant for all butthe largest sample sizes. This finding is consistent with the simulation studyin Mardia and Marshall (1984), but appears to contradict the simulations inAbt and Welch (1998), where no negative bias was reported for the MLE ofsigma2 in one or two dimensions (this may be because of the much larger samplesizes: n = 14 for d = 1 and n = 64 for d = 2).We also developed a Bayes estimator for the process variance based onFBI, taking into account the uncertainty in the estimation of the parameters,as described in Section 3.5. Its performance is given in Tables 3.5 and 3.6that enable direct comparison with the MLE. The most important differenceis that with the exception of the d = 1 case, there is either no evidence thatthe Bayes estimator is biased, or when there is, the bias is positive. Evenin the one-dimensional case, the estimated negative bias is less severe thanthat of the MLE.54Chapter 3.This may also provide an insight into how the FBI corrects the deficiencyof the traditional plug-in method: by refusing to accept the too small MLEof sigma2, it constructs its own estimates that, on average, do not severelyunderestimate the real sigma2 for large d. As usual, the d = 1 case is againan exception, retaining a significant negative bias in Tables 3.5 and 3.6.3.4.2 Parameter uncertaintyAs we already mentioned in the previous section, one way of quantifyingthe uncertainty in the estimation of the parameters is by comparing theMSE and hatwideE columns. (This seems feasible for all six tables presentedso far, except Table 3.2 that has inflated hatwideE numbers in higher dimen-sions). If the two numbers are close, we would expect that normality-based(Wald-type) confidence intervals using the standard errors would have goodfrequentist properties. In other words, validity would be demonstrated byconfidence intervals whose actual coverage is approximately equal to thenominal coverage.But why not make that match (or the lack of it) more explicit? Tovisualize how good that match is, in Figures 3.3 and 3.4 we plotted nominalcoverage levels (from 1% to 99%) vs. the true coverage for the three kinds ofestimators for d = 1, 4, 7, and 10. These (frequentist) coverage probabilitieswere calculated by counting how many times the real values were covered outof 1,000 realizations (replicates). In addition, a gray diagonal is also shownin the middle of each plot to help guide the eye: the closer the curves are tothe diagonal, the better the match between nominal coverage (horizontally)and true coverage (vertically). This way of plotting is robust with respectto outliers, since a few inflated standard errors (out of 1,000) will have onlya negligible effect on the estimated true coverage probabilities.On a technical note, we should also mention that although we used theabbreviation CI in these plots for both frequentist confidence intervals (basedon the likelihood) and for Bayesian credible intervals (based on the poste-55Chapter 3.Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 1Wald CI for log theta1Wald CI for log sigma2Bayes CI for log sigma2Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 4Wald CI for log theta1Wald CI for log sigma2Bayes CI for log sigma2Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 7Wald CI for log theta1Wald CI for log sigma2Bayes CI for log sigma2Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 10Wald CI for log theta1Wald CI for log sigma2Bayes CI for log sigma2Figure 3.3: Coverage probabilities of Wald confidence intervals and Bayescredible intervals in the first simulation study (n = 10 d) for d = 1, 4, 7,and 10.56Chapter 3.Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 1Wald CI for log theta1Wald CI for log sigma2Bayes CI for log sigma2Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 4Wald CI for log theta1Wald CI for log sigma2Bayes CI for log sigma2Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 7Wald CI for log theta1Wald CI for log sigma2Bayes CI for log sigma2Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 10Wald CI for log theta1Wald CI for log sigma2Bayes CI for log sigma2Figure 3.4: Coverage probabilities of Wald confidence intervals and Bayescredible intervals in the second simulation study (n = 5 d) for d = 1, 4, 7,and 10.57Chapter 3.rior), they have very different interpretations. But we can still evaluate thefrequentist properties of these intervals, regardless of what assumptions wemade when we derived them. It is not uncommon that a credible intervalprovides similar matching probabilities to its frequentist counterpart. In-deed, that is what we can see in this case, too, although the coverage of thecredible intervals centered on the Bayes estimate are clearly less than thatof the Wald confidence intervals centered on the MLE of log sigma2 that are veryclose to the diagonal.Overall what we can see in Figures 3.3 and 3.4 is that Wald confidenceintervals had a good match in the second simulation study and almost perfectmatch (following the diagonal) in the first study. Also, note that this isindeed a robust way of visualizing uncertainty assessments, not as vulnerableto a few excessive outliers as the hatwideE measure in the tables.Results for d = 2,3,5,6,8,9 were similar (not shown here). That sug-gests that the log transformation is nearly optimal not only with respect toassessing prediction uncertainty by FBI, but also with respect to quantifyingparameter uncertainty by normality-based confidence intervals. But whenwe attempt to derive joint confidence regions for all the parameters, a lessrosy picture emerges.Figures 3.5 and 3.6 depict the results for the same four cases (d = 1, 4, 7,and 10) that we have seen before in Figures 3.3 and 3.4. But this timethe coverage probabilities are for likelihood-based confidence regions for thejoint likelihood L(sigma2, theta) with d + 1 parameters, the profile likelihood L(theta)with d parameters, and their normal approximations, respectively (the exactprocedures for these calculations are given in the next section where we willalso show that the confidence regions based on the normal approximationsare equivalent to Wald-type confidence sets that are easier to compute thanthe original likelihood-based ones).Although in Figure 3.5 we can only see moderate mismatch between thenominal and true coverages, that gap grows larger in Figure 3.6, especially58Chapter 3.Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 1Joint likelihoodProfile likelihoodJoint approximationProfile approximationNominal coverageTrue coverage1% 50% 99%1%50%99%d = 4Joint likelihoodProfile likelihoodJoint approximationProfile approximationNominal coverageTrue coverage1% 50% 99%1%50%99%d = 7Joint likelihoodProfile likelihoodJoint approximationProfile approximationNominal coverageTrue coverage1% 50% 99%1%50%99%d = 10Joint likelihoodProfile likelihoodJoint approximationProfile approximationFigure 3.5: Coverage probabilities of confidence regions based on the twolikelihood functions and their normal approximations in the first simulationstudy (n = 10 d) for d = 1, 4, 7, and 10.59Chapter 3.Nominal coverageTrue coverage1% 50% 99%1%50%99%d = 1Joint likelihoodProfile likelihoodJoint approximationProfile approximationNominal coverageTrue coverage1% 50% 99%1%50%99%d = 4Joint likelihoodProfile likelihoodJoint approximationProfile approximationNominal coverageTrue coverage1% 50% 99%1%50%99%d = 7Joint likelihoodProfile likelihoodJoint approximationProfile approximationNominal coverageTrue coverage1% 50% 99%1%50%99%d = 10Joint likelihoodProfile likelihoodJoint approximationProfile approximationFigure 3.6: Coverage probabilities of confidence regions based on the twolikelihood functions and their normal approximations in the second simula-tion study (n = 5 d) for d = 1, 4, 7, and 10.60Chapter 3.in higher dimensions (e.g. d = 10). Not only do the curves for the normalapproximations fall short of the diagonal (indicating serious undercoverage),but the ones for the likelihoods do as well, which means that there is insuf-ficient information in the data to satisfactorily quantify the uncertainty inthe maximum likelihood estimation.These two figures can also serve to visualize the nonnormality of the twolikelihood functions in a way that is superior to measures that compress allinformation about the shape of the function into a single real number. Forinstance, the nonnormality measures of Sprott (1973) and their multivariateextensions by Kass and Slate (1994) can only provide limited informationabout tail behavior (Slate, 1991) because they are based solely on the cur-vature at the mode.In contrast, the figures show not only what happens in the neighborhoodof the mode, say, at the 1% nominal confidence level, but also what happensin the tails, say, at the 99% level. We can see that the true coverage forthe normal approximations start out very close to the original, indicatinga nearly normal shape in the neighborhood of the MLE. However, as thenominal level increases on the x-axis, the gap between the approximationand its original progressively gets larger on the y-axis, meaning that theapproximation is less accurate in the tails.Figure 3.6 also indicates that the FBI does not propagate through allparameter uncertainty (as specified by the likelihood) into prediction uncer-tainty. The gap between the profile likelihood (dashed line) and its normalapproximation (dotted line) is substantial between the 50% and 99% con-fidence levels, meaning that the normal approximation (that is used as theposterior distribution by the FBI) not only diminishes the tails, but over-all it is much more concentrated around the mode than the original profilelikelihood. Comparison of Figures 3.5 and 3.6 show, however, that the infer-ence from the normal approximation improves substantially as the samplesize increases from n = 5d to n = 10d.61Chapter 3.We can also observe that the approximation for the profile likelihood(dotted line) lies closer to the diagonal than the approximation for the jointlikelihood (dashed-and-dotted line), illustrating an additional benefit of pro-filing. The difference may not appear to be large; however, from the predic-tion perspective, this seems justified considering that profiling also allowsone to disentangle the uncertainty in the estimation of the relatively unim-portant process variance parameter from the uncertainty in the estimationof the more important range parameters.In summary, it appears that using normal approximations for the twolikelihood functions on the log scale to quantify parameter uncertainty maybe acceptable with adequate sample size (like n = 10 d in our first simula-tion study), but may result in confidence regions with serious undercoveragefor smaller samples (like n = 5 d in the second simulation study). On theother hand, confidence intervals for the maximum likelihood estimates ofindividual parameters are much more robust in terms of coverage probabili-ties, retaining surprisingly good matching coverage even for smaller samples(except for d = 1). Although the coverage of the Bayes credible intervals forlog sigma2 is less accurate, it seems robust to smaller sample sizes.3.5 Methods3.5.1 Likelihood-based estimatorsWe estimate the process variance sigma2 and the d range parameters in the theta vec-tor by maximum likelihood (see Mardia and Marshall (1984) for regularityconditions for the consistency and asymptotic normality of these estima-tors). Since these are not available in a closed form, the optimization mustbe done numerically, e.g. by maximizing the logarithm of the joint likelihoodfunction L(sigma2, theta). Alternatively, one can optimize the log profile likelihoodlog L(theta) to find the MLE of theta and then use equation (3.3) to get the MLE ofsigma2, as we did (see Welch et al. (1992) for optimization-related issues). Differ-62Chapter 3.ent parameterizations might affect the optimization process differently, butif done correctly, the end result should be the same because of the invarianceof the MLE. (We consistently used the log transformation for everything inthis chapter, including maximizing the log profile likelihood).Based on asymptotic theory, the joint likelihood L(sigma2,theta) can also be usedto derive confidence sets for all the d+1 parameters jointly by inverting thelikelihood ratio test. For example, following Meeker and Escobar (1995), anapproximate 100(1 - alpha)% likelihood-based confidence region for (sigma2, theta) isthe set of all values of (sigma2, theta) such that-2logparenleftBiggL(sigma2, theta)L(?2, ?)parenrightBigg< chi2(1-alpha; d+1), (3.5)where ?2 and ? denote the MLE of the parameter sigma2 and the parametervector theta, respectively, and chi2(1-alpha; d+1) is the 1 -alpha quantile of the chi-squaredistribution with d + 1 degrees of freedom.Similarly, the profile likelihood function L(theta) can yield confidence setsfor the d range parameters jointly. An approximate 100(1-alpha)% likelihood-based confidence region for theta is the set of all values of theta such that-2logparenleftBiggL(theta)L(?)parenrightBigg< chi2(1-alpha; d). (3.6)These confidence regions are also invariant to parameter transformations.However, in general, they are not guaranteed to have a nice shape and canbe cumbersome to calculate numerically, especially in higher dimensions.Fortunately, in our case we did not have to calculate the boundaries of theseregions explicitly, since we were only interested whether the true values werecovered by them, and for that one needs to evaluate the likelihood at onlytwo points: at the real value and at the MLE.63Chapter 3.3.5.2 Normal approximationsFor our normal approximations, we continue to use the MLEs as point es-timates. But what happens to the confidence sets when we replace thelikelihoods with their (multivariate) normal approximations? We get nicelyshaped, symmetric confidence ellipsoids centered on the MLE (in d + 1 di-mensions for all parameters jointly or in d dimensions for the theta vector).In this case the boundaries can be calculated analytically, but there isa trade-off for computational simplicity. We lose invariance to transforma-tions and we also lose accuracy in terms of coverage probabilities. This isespecially evident in Figures 3.5 and 3.6 if we compare the actual coveragefor the two original likelihood functions vs. their approximations.Here we describe the approximation procedure only for the d-dimensionallog transformed theta vector, since the (d + 1)-dimensional case is completelyanalogous with an extra log transformed sigma2 parameter. Using the samenotation as in Section 3.3, namely gamma = (log theta1, ..., log thetad)T = log theta forthe transformed theta vector and theta = (exp gamma1, ... , exp gammad)T = exp gamma for theinverse transformation, we expect the transformed profile likelihood functionL(exp gamma) to have a shape that is more Gaussian with respect to gamma than theshape of the original L(theta) with respect to theta.First we maximize log L(exp gamma) to get the MLE of gamma, denoted ?. Thenwe compute the Hessian matrix of second derivatives of log L(exp gamma) at ?,denoted H?gamma = nabla2 log L(exp ?), wherenabla2 log L(exp gamma) =parenlefttpparenleftexparenleftexparenleftexparenleftbtpartialdiff2 log L(exp gamma)partialdiffgamma21 ...partialdiff2 log L(exp gamma)partialdiffgamma1 partialdiffgammad... ... ...partialdiff2 log L(exp gamma)partialdiffgammad partialdiffgamma1 ...partialdiff2 log L(exp gamma)partialdiffgamma2dparenrighttpparenrightexparenrightexparenrightexparenrightbt.Then we can approximate L(exp gamma) with the density function of the multi-64Chapter 3.variate normal N(?, -H-1?gamma ) distribution, which is proportional tovextendsinglevextendsinglevextendsingle-H-1?gammavextendsinglevextendsinglevextendsingle-12 exp braceleftbigg-12 (gamma - ?gamma)T bracketleftBig-H-1?bracketrightBig-1(gamma - ?)bracerightbigg.Using this approximation instead of the profile likelihood function in in-equality (3.6) leads to the confidence ellipsoid defined by the quadratic form(gamma - ?)TbracketleftBig-H?gammabracketrightBig(gamma - ?) < chi2(1-alpha; d).This is the same as the quadratic form for the normal-theory Wald subsetstatistic(gamma - ?)TbracketleftBig???gammabracketrightBig-1(gamma - ?),where ??gamma is obtained by leaving out the row and column for the log trans-formed sigma2 parameter from the inverse of the observed information matrix(see Meeker and Escobar (1995) for a proof and also for a general discussionon the connection between profiling and constructing likelihood-based andWald-type confidence regions).With only one parameter, confidence ellipsoids become normality-basedWald confidence intervals, using the quantiles of the standard normal distri-bution with a standard error to provide confidence bounds symmetric aboutthe MLE. For example, in terms of coverage probabilities, the assumptionthat (gamma1 -?1)/StdErr?gamma1 follows a N(0, 1) distribution is equivalent to assum-ing that (gamma1 - ?1)T bracketleftbigStdErr2?gamma1bracketrightbig-1(gamma1 - ?1) has a chi-squared distribution withone degree of freedom, where the standard error of ?1, denoted as StdErr?gamma1,is obtained by either taking the square root of -H-1?gamma (1,1), or, equivalently,by taking the root on the diagonal for ?1 of the inverse of the observed infor-mation matrix. (The root of the appropriate diagonal element of the inverseof the observed information matrix was also used to obtain a standard errorfor log ?2).65Chapter 3.3.5.3 Bayes estimatorFast Bayesian Inference uses the N(?, -H-1?gamma ) distribution as the posteriordistribution of gamma. The main advantage is that it is straightforward to obtainindependent, identically distributed (iid) Monte Carlo samples from thisposterior: gamma(1), ..., gamma(M), and since they are iid, a relatively small samplesize is sufficient (we used M = 400). For each gamma(i) in this sample, forprediction purposes, internally the FBI estimates sigma2 with ?2(exp gamma(i)), usingequation (3.3). Note that the estimator function ?2(theta) in (3.3) is a functionof the untransformed range parameter vector theta, so we need to use the inversetransformation for the log transformed gamma(i) vectors in the FBI sample ( i =1, ..., M ). (Also: the notation gamma(i) used here should not be confused withthe hatted ?(i) used earlier in Section 3.3).Thus the Bayes estimate of log sigma2 is1MMsummationdisplayi=1log ?2(exp gamma(i)).To derive a standard error for normality-based credible intervals, we cantake the square root of the sample variance:1M -1Msummationdisplayj=1parenleftBigglog ?2(exp gamma(j)) - 1MMsummationdisplayi=1log ?2(exp gamma(i))parenrightBigg2.3.6 Concluding remarksOur main finding for point estimation by maximum likelihood is that allparameters tend to be underestimated. We have introduced a Bayes esti-mator for the process variance that can reduce this negative bias for d = 1and make it insignificant or even turn it positive for d > 1. Further workis needed to clarify how this less biased estimator can be exploited (besidesquantifying prediction uncertainty in FBI). One possible avenue of investi-gation could be to fix sigma2 at the Bayes estimate and then see whether the66Chapter 3.maximum likelihood estimation leads to improved estimates for the remain-ing parameters (which in turn could also be used to try to improve the Bayesestimate of sigma2, and so on).In Chapter 2, the FBI method demonstrated that the log transformationwas nearly optimal for assessing prediction uncertainty, since it left almost noroom for further improvements in terms of matching coverage probabilities ofthe prediction bands. Likewise, we have shown that the log transformationis nearly optimal for quantifying parameter uncertainty, since the matchbetween nominal and true coverages of the MLE-centered, normality-basedWald confidence intervals for the individual parameters are almost as goodas those seen for the FBI prediction bands.This also provides some insight into why the FBI is able to computethe uncertainty in the predictions so well. However, this is still not a fullysatisfactory explanation in cases when we can get good matching coverageonly for the Wald confidence intervals for each parameter separately, butnot for the Wald confidence regions jointly.Wald and likelihood ratio confidence regions are asymptotically equiv-alent (e.g. see Cox and Hinkley (1974) for a proof). However, for smallsamples, the Wald approximation is often inferior in terms of matching cov-erage probabilities. We have shown that for our random function model thedifference can be quite substantial. It is an open question how much differ-ent reparameterizations could help to close this gap. We have seen that inthe one-dimensional case, working on the log scale is not optimal in termsof profile likelihood nonnormality. Transformations that make the shape ofthe likelihood or the profile likelihood more Gaussian, perhaps adaptively(based on the data), might be worth exploring, since the only special casewhen Wald confidence regions are equivalent to likelihood-based ones forfinite sample sizes is when the likelihood is proportional to a normal densityfunction.Finally, we should point out that although the log transformation is67Chapter 3.rarely optimal in the one-dimensional special case according to the nonnor-mality measure used in subsection 3.2.5, that does not mean that it is notnearly optimal. On the contrary, results in Nagy et al. (2007a) suggest thatoverall, the log transformation is quite useful in most cases for reducing thenonnormality of the profile likelihood, and our results seem to support that,since the approximations for d = 1 in Figures 3.5 and 3.6 look no worse thanthe ones for d > 1.68BibliographyAbt, M. (1999), ?Estimating the Prediction Mean Squared Error in Gaus-sian Stochastic Processes with Exponential Correlation Structure,? Scan-dinavian Journal of Statistics, 26, 563?578.Abt, M. and Welch, W. J. (1998), ?Fisher Information and MaximumLikelihood Estimation of Covariance Parameters in Gaussian StochasticProcesses,? The Canadian Journal of Statistics / La Revue Canadienne deStatistique, 26, 127?137.Box, G. E. P. and Cox, D. R. (1964), ?An Analysis of Transformations,?Journal of the Royal Statistical Society. Series B (Methodological), 26, 211?252.Cox, D. R. and Hinkley, D. V. (1974), Theoretical Statistics, London: Chap-man and Hall.Cressie, N. A. C. (1993), Statistics for Spatial Data, John Wiley & Sons.Currin, C., Mitchell, T., Morris, M., and Ylvisaker, D. (1991), ?BayesianPrediction of Deterministic Functions, with Applications to the Design andAnalysis of Computer Experiments,? Journal of the American StatisticalAssociation, 86, 953?963.Furrer, R. (2005), ?Covariance estimation under spatial dependence,? Jour-nal of Multivariate Analysis, 94, 366?381.Karuri, S. W. (2005), ?Integration in Computer Experiments and BayesianAnalysis,? Ph.D. thesis, University of Waterloo.69BibliographyKass, R. E. and Slate, E. H. (1994), ?Some Diagnostics of Maximum Like-lihood and Posterior Nonnormality,? The Annals of Statistics, 22, 668?695.Mardia, K. V. and Marshall, R. J. (1984), ?Maximum Likelihood Estima-tion of Models for Residual Covariance in Spatial Regression,? Biometrika,71, 135?146.McKay, M. D., Beckman, R. J., and Conover, W. J. (1979), ?A Comparisonof Three Methods for Selecting Values of Input Variables in the Analysisof Output from a Computer Code,? Technometrics, 21, 239?245.Meeker, W. Q. and Escobar, L. A. (1995), ?Teaching about Approxi-mate Confidence Regions Based on Maximum Likelihood Estimation,? TheAmerican Statistician, 49, 48?53.Nagy, B., Loeppky, J. L., and Welch, W. J. (2007a), ?Correlation parame-terization in random function models to improve normal approximation ofthe likelihood or posterior,? Tech. Rep. 229, Department of Statistics, TheUniversity of British Columbia.? (2007b), ?Fast Bayesian Inference for Gaussian Process Models,? Tech.Rep. 230, Department of Statistics, The University of British Columbia.Rasmussen, C. E. and Williams, C. K. I. (2006), Gaussian Processes forMachine Learning, MIT Press.Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989), ?De-sign and Analysis of Computer Experiments (C/R: P423-435),? StatisticalScience, 4, 409?423.Slate, E. H. (1991), ?Reparameterizations of Statistical Models,? Ph.D.thesis, Carnegie Mellon University.Sprott, D. A. (1973), ?Normal Likelihoods and Their Relation to LargeSample Theory of Estimation,? Biometrika, 60, 457?465.70BibliographyStein, M. L. (1999), Interpolation of Spatial Data: Some Theory for Krig-ing, Springer-Verlag Inc.Tukey, J. W. (1957), ?On the Comparative Anatomy of Transformations,?The Annals of Mathematical Statistics, 28, 602?632.Wang, H. and Zhang, H. (2003), ?On the Possibility of a Private CropInsurance Market: A Spatial Statistics Approach,? Journal of Risk andInsurance, 70, 111?124.Welch, W. J., Buck, R. J., Sacks, J., Wynn, H. P., Mitchell, T. J., andMorris, M. D. (1992), ?Screening, Predicting, and Computer Experiments,?Technometrics, 34, 15?25.Zhang, H. and Zimmerman, D. L. (2005), ?Towards Reconciling TwoAsymptotic Frameworks in Spatial Statistics,? Biometrika, 92, 921?936.71Chapter 4DiscussionIn this thesis, we demonstrated a novel way to achieve approximately validestimation and prediction inference for a particular statistical model fre-quently used in computer experiments to model smooth response surfaces.In general, there are two kinds of goals for statistical modeling: predictionand model identification. We followed that separation of concerns whendividing up our work into two separate publications. Although the twomanuscripts share the same model and simulation design, Chapter 2 onlydeals with issues related to prediction and Chapter 3 with identifying (thecovariance of) the random function model.However, we do not just accept the model as it is, but seek nearly opti-mal reparameterizations to minimize nonnormality of the profile likelihood.Hence, another way to relate the two chapters is by looking at what kind ofnonnormality measures they use. Chapter 2 employs a multivariate measurefrom Kass and Slate (1994) based on the curvature at the mode. Chapter 3takes a more visual approach to compare likelihoods with their normal ap-proximations and reveal discrepancies not only in the neighborhood of themode but also in the tails. (The one-dimensional special case is also inves-tigated by a univariate measure of Sprott (1973) in Chapter 3, subsection3.2.5, that is the same as the ?Expected Non-Normality? measure in Nagy,Loeppky, and Welch (2007a). Also, the multivariate measure in Kass andSlate (1994) for d = 1 reduces to the univariate ?Observed Non-Normality?in Nagy et al. (2007a), also from Sprott (1973)).Since the main goal in computer experiments is prediction, the contribu-tions in Chapter 2 are arguably more important to this field than the results72Chapter 4.in Chapter 3. The main advantage of Fast Bayesian Inference is that it iscomputationally efficient and can be implemented as a black box. Moreover,it can also relieve the user from the burden and responsibility of selectinga suitable prior or an appropriate MCMC algorithm. In other words, FBIhas all the required ingredients that make it suitable for incorporation into astandard statistical package. That holds out the promise that one day it maybecome a widely used method across many fields in science and engineering.In contrast, the results in Chapter 3 seem less interesting from the prac-tical standpoint. One could even say that the significance of Chapter 3 liesmostly in providing some insights about why the FBI prediction bands areso accurate in Chapter 2, since precise assessments of parameter uncertaintycan certainly help quantify prediction uncertainty, too. However, that is atmost a partial explanation of the success of the FBI, since the predictionbands retain much of their accuracy even in extreme situations (such as verysmall sample sizes or extremely large range parameters) as shown in Nagy,Loeppky, and Welch (2007b) (see results included in Appendix B). Althoughthese extremes are irrelevant in practice (since meaningful prediction is notpossible), it is still an interesting theoretical question what makes the FBIso robust across such a wide range of settings.One practical shortcoming of the thesis is that the estimated true cover-age probabilities of the prediction bands are averaged over both the hyper-cube [0, 1]d and over all the simulated data sets. But it is never exploredwhat happens at just one specific point in the input space or for just oneparticular realization, both of which could be more relevant in practice thanthe present blanket measure. And what if the data is not a realization of aGaussian process? In practical applications, this is almost always the case,yet no attempt is made to assess the robustness of FBI with respect to modelmisspecifications. (One cannot make generalizations based on just two realexamples in Chapter 2).The simulation design can also be criticized on the grounds that it uses73Chapter 4.only one fixed theta value for each dimension (although we should mention thatwe obtained similar results with Bayesian simulation designs where log theta wasdrawn from either a uniform or normal distribution, but those results are notpresented here). On the other hand, the simulations can also be consideredthe primary strength of this thesis, pushing the limits of both currentlyavailable hardware (WestGrid high performance computing facilities) andsoftware (e.g. Intel?s Math Kernel Library for matrix operations or ADOL-C for automatic differentiation in C++).There are also many obvious extensions to our work, some of whichseem easier to tackle than others. We briefly discuss some possible researchdirections and speculate on their perceived feasibility at the time of thiswriting.4.1 Alternative correlation functionsWhether the FBI can be adopted for other covariance structures is one of thefirst questions that comes to mind. For instance, one possible generalizationof the Gaussian correlation function is the Power Exponential family:Corr(Z(w), Z(x)) =dproductdisplayi=1exp {-thetai|wi -xi|pi},where 0 < pi <=< 2 (Sacks, Welch, Mitchell, and Wynn, 1989). In this thesiswe only presented results for the pi = 2 (i = 1, ..., d) special case that isknown as the Squared Exponential or Gaussian correlation function. Butwe also experimented with running the FBI with constant exponents fixedat values other than 2. What we found was that the prediction uncertaintyassessments were not as uniformly valid as for p = 2. More work is requiredto understand what causes the difference. Results for d = 1 in Nagy et al.(2007a) suggest that transformations may play a large role. (For example,the log transformation tends to reduce nonnormality for p = 2, but that is74Chapter 4.less often the case for p < 2).Perhaps the most exciting question is if the FBI can be extended to thesituation when the parameters pi are unknown. Unfortunately, it is notclear at this time how one might go about accomplishing that, since theseparameters can only take values between 0 and 2 and the boundary caseis very special. We have seen that the process can behave very differently(including losing differentiability), even for values of p that are only slightlyless than 2, such as p = 2 -10-6.Brian Williams suggested another way to generalize the Gaussian cor-relation that can be viewed as a limiting case of the Mat?rn class (Mat?rn,1947) in terms of differentiability. This was also recommended earlier inStein (1999). The rationale is that this family has a parameter nu for fine-tuning differentiability, i.e. the process is differentiable k times if and onlyif k < nu, allowing much finer control than the Power Exponential family, forwhich the process is infinitely differentiable for p = 2 and not differentiableat all for p < 2.4.2 Additional terms in the modelA general model could have a regression component and a white noise termin addition to the stochastic process Z(x):Y (x) =summationdisplayjbetajfj(x) + Z(x) + epsilon1,where each fj(x) is a function of x with known or unknown betaj coefficientsand epsilon1 is iid random error parameterized by a known or unknown variance.In fact, when we began developing FBI, we started out with a model thatincluded both an unknown mean ? and white noise with known or unknownvarianceY (x) = ? + Z(x) + epsilon1.75Chapter 4.Encouraging preliminary results for d = 2 were reported at the AnnualMeeting of the Statistical Society of Canada in London, Ontario in May2006, entitled ?Uncertainty in kriging predictions with and without randomerror?. Since it did not become more clear later how to deal with noise, weended up dropping it from the model and turned our focus to the determin-istic case:Y (x) = ? + Z(x),having an unknown mean ?. FBI results for d = 1, ..., 5 comparable tothe ones in Appendix B were presented at the Joint Statistical Meetings inSeattle, Washington in August 2006 with the title ?Validity of Likelihoodand Bayesian Inference for Gaussian Process Regression?.Eventually, we dispensed with ?, too, to simplify the algebra and speedup computations to be able to explore higher-dimensional cases and runlong Markov chains. Since we found that the FBI prediction bands wereremarkably accurate with or without ? in the model, it is not unreasonableto guess that one could get similarly good prediction uncertainty assessmentsfor a model including more regression terms. The only required change inFBI would be that in addition to sigma2, one would also have to eliminate theextra regression coefficients when deriving the profile likelihood function forthe remaining range parameters. All this is standard practice, described indetail in Sacks et al. (1989) or Welch, Buck, Sacks, Wynn, Mitchell, andMorris (1992).4.3 Different reparameterizationsThere is no reason to restrict oneself to the family of power transformationsin Tukey (1957) as we did in this thesis. Although this class of functionsis quite flexible and powerful, as shown by Box and Cox (1964), this choiceis still arbitrary and can not be expected to provide a satisfying solution inevery situation.76Chapter 4.In the one-dimensional case we experimented briefly with the rho = e-thetareparameterization for the range parameter (Linkletter, Bingham, Hengart-ner, Higdon, and Ye, 2006), which was also the inspiration for the logexptransformation for theta, defined as log(etheta -1) in Nagy et al. (2007a), where wecompared it to the log transformation. However, those results were incon-clusive, raising more questions than answers, and we decided not to includethem here.As we already mentioned in Chapter 3, adaptive transformations basedon the data at hand might be worth exploring, too. Although the negativeresults in Chapter 2 seem to contradict this (i.e. adaptively optimizing lambdacould not beat the log transformation), one should keep in mind that we wereusing just one specific model with one particular correlation structure thatresulted in FBI prediction bands with almost perfect frequentist propertiesin terms of matching coverage probabilities, leaving almost no room forimprovement. But a different model with another correlation function mayneed a more sophisticated reparameterization scheme and it seems unwiseto rule out adaptive approaches a priori based on a single negative result.4.4 Numerical optimizationsIn theory, reducing the nonnormality of likelihoods or profile likelihoodscan help the required numerical optimizations to find the MLE. This isbecause when the shape of the likelihood functions is more Gaussian, thenthe shape of the log-likelihoods is more quadratic, and those are exactly thekinds of functions that can be optimized very efficiently with Newton-typealgorithms.However, in practice, this can be tricky, since it is well-known that meth-ods relying heavily on derivatives can easily become unstable (Press, Flan-nery, Teukolsky, and Vetterling, 2002). This can be caused by either thefeatures of the objective function or by numerical inaccuracies. For exam-ple, although in our case the likelihood is log-concave (Paninski, 2004), it77Chapter 4.may still appear as possessing several local maxima along the ridge wherethe MLE is located (Warnes and Ripley, 1987).Another challenge for derivative-based optimizers is that in high dimen-sions some partial derivatives can get dangerously close to zero, even inplaces that are still far away from the MLE. This problem may be possi-ble to alleviate to some extent by dimensionality-reduction techniques, ormaybe even by just screening out the input variables with little or no effectbefore the optimization or in parallel (Welch et al., 1992).But when it does work, Newton?s method can achieve quadratic conver-gence, which means doubling the number of correct digits at each iteration.That suggests that it may be worth investing some time into carefully de-signing a statistical experiment to identify the factors that determine effi-cient convergence for the log-likelihoods in question. If those factors canbe controlled in a black box implementation, that can make Fast BayesianInference even faster, since its running time is determined by the numericaloptimization required for maximum likelihood estimation.78BibliographyBox, G. E. P. and Cox, D. R. (1964), ?An Analysis of Transformations,?Journal of the Royal Statistical Society. Series B (Methodological), 26, 211?252.Kass, R. E. and Slate, E. H. (1994), ?Some Diagnostics of Maximum Like-lihood and Posterior Nonnormality,? The Annals of Statistics, 22, 668?695.Linkletter, C., Bingham, D., Hengartner, N., Higdon, D., and Ye, K. Q.(2006), ?Variable Selection for Gaussian Process Models in Computer Ex-periments,? Technometrics, 48, 478?490.Mat?rn, B. (1947), ?Methods of Estimating the Accuracy of Line and Sam-ple Plot Surveys,? Medd. fr. Statens Skagsforsknings Inst., 36.Nagy, B., Loeppky, J. L., and Welch, W. J. (2007a), ?Correlation parame-terization in random function models to improve normal approximation ofthe likelihood or posterior,? Tech. Rep. 229, Department of Statistics, TheUniversity of British Columbia.? (2007b), ?Fast Bayesian Inference for Gaussian Process Models,? Tech.Rep. 230, Department of Statistics, The University of British Columbia.Paninski, L. (2004), ?Log-concavity results on Gaussian process methodsfor supervised and unsupervised learning,? Advances in Neural InformationProcessing, 17.79BibliographyPress, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T.(2002), Numerical recipes in C++, Cambridge: Cambridge UniversityPress, includes bibliographical references, 3 appendixes and 2 indexes.Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989), ?De-sign and Analysis of Computer Experiments (C/R: P423-435),? StatisticalScience, 4, 409?423.Sprott, D. A. (1973), ?Normal Likelihoods and Their Relation to LargeSample Theory of Estimation,? Biometrika, 60, 457?465.Stein, M. L. (1999), Interpolation of Spatial Data: Some Theory for Krig-ing, Springer-Verlag Inc.Tukey, J. W. (1957), ?On the Comparative Anatomy of Transformations,?The Annals of Mathematical Statistics, 28, 602?632.Warnes, J. J. and Ripley, B. D. (1987), ?Problems with Likelihood Estima-tion of Covariance Functions of Spatial Gaussian Processes,? Biometrika,74, 640?642.Welch, W. J., Buck, R. J., Sacks, J., Wynn, H. P., Mitchell, T. J., andMorris, M. D. (1992), ?Screening, Predicting, and Computer Experiments,?Technometrics, 34, 15?25.80Appendix AAppendix to Chapter 3In this appendix, we derive formulas for the Expected nonnormality (ENN)measure in Chapter 3, Section 3.2.5.Let y denote the response vector having length n, mean zero, and covari-ance matrix sigma2R, where sigma2 is the process variance and R is the symmetric,positive definite n ? n design correlation matrix (that is a function of theparameter theta). The MLE of theta is treated as given, denoted by ?. Let G denotethe inverse matrix of R and define the matrices F = GRprime, S = GRprimeprime, andT =GRprimeprimeprime, where Rprime, Rprimeprime, and Rprimeprimeprime are the first, second, and third derivatives ofR, respectively (with respect to theta). The trace of a matrix is denoted bytr(?). For concise notation, we also define t(?) = tr(?)/n.Taking the log of L(theta), the log-likelihood is (up to an additive constant):l(theta) = - n2 log yT R-1yn -12 log |R|.The functions g and h are also used to simplify calculations:g(theta) = yT R-1yn and h(theta) = -log |R|n .Suppressing theta from l(theta), g(theta), and h(theta) gives the following equations forthe log-likelihood l and its first three derivatives:l = n2 (h - log g),81Appendix A. Appendix to Chapter 3lprime = n2parenleftbigghprime - gprimegparenrightbigg,lprimeprime = n2parenleftBigghprimeprime +parenleftbigggprimegparenrightbigg2- gprimeprimegparenrightBigg,lprimeprimeprime = n2parenleftBigghprimeprimeprime - 2parenleftbigggprimegparenrightbigg3+ 3 gprimegprimeprimeg2 -gprimeprimeprimegparenrightBigg,where hprime = -t(F), hprimeprime = t(F2 -S), hprimeprimeprime = -t(2F3 -3FS + T),and gprime = yT Gprimey/n, gprimeprime = yT Gprimeprimey/n, gprimeprimeprime = yT Gprimeprimeprimey/n ,where Gprime = -FG, Gprimeprime = (2F2 -S)G, Gprimeprimeprime = -(6F3 -3FS-3SF +T)G.Lemma 1. For a symmetric n ?n matrix Q and y similar N(0, sigma2R)E yT Qy = sigma2 tr(QR).Proof: Using any standard text on matrix algebra, e.g. Harville (1997),E yT Qy = E tr(yT Qy) = E tr(QyyT ) = tr(Q E yyT ) = tr(Q sigma2R) =sigma2 tr(QR), where we used the fact that y has covariance matrix sigma2R.Lemma 2. For a symmetric n ? n matrix Q, y similar N(0, sigma2R), andG = R-1E yT QyyT Gy = t(QR).Proof: Let z = C-1y, where C is the lower-triangular Cholesky-factor ofthe covariance matrix sigma2R. Then z similar N(0, In) andCCT = sigma2R arrowdblright R = CCT /sigma2 arrowdblright R-1 = sigma2parenleftbigCT parenrightbig-1C-1.Substituting y = Cz and G = sigma2parenleftbigCT parenrightbig-1C-1 we get:E yT QyyT Gy = EzT CT QCzzT CT sigma2(CT )-1C-1Cz =1sigma2 EzT (CT QC)zzT z .Conniffe and Spencer (2001) state that the expectation of a ratio of this form82Appendix A. Appendix to Chapter 3is the ratio of the expectations for any quadratic form in the numerator. Thisis a consequence of the fact that the ratio is independent of its denominator,a result attributed to Geary (1933). Hence we can apply Lemma 1 separatelyto the numerator and the denominator:E yT QyyT Gy =E yT QyE yT Gy =sigma2 tr(QR)sigma2 tr(GR) =tr(QR)tr(In) =tr(QR)n = t(QR).Lemma 3. Elprimeprime(?) and Elprimeprimeprime(?) for the ENN are:Elprimeprime = n2 parenleftbigt2(F) - t(F2)parenrightbig ,Elprimeprimeprime = n2 parenleftbig2t3(F) - 6t(F)t(F2) + 3t(F)t(S) - 3t(FS) + 4t(F3)parenrightbig .Proof: When theta = ? (the MLE of theta), then lprime = 0 and that implies thatgprime/g = hprime. Replacing gprime/g with hprime in the second and third derivative formulasfor l leads to the following expressions:lprimeprime = n2parenleftbigghprimeprime + parenleftbighprimeparenrightbig2 - gprimeprimegparenrightbigglprimeprimeprime = n2parenleftbigghprimeprimeprime - 2 parenleftbighprimeparenrightbig3 + 3 hprime gprimeprimeg -gprimeprimeprimegparenrightbigg.Taking expectations:Elprimeprime = n2parenleftbigghprimeprime + parenleftbighprimeparenrightbig2 - E gprimeprimegparenrightbiggElprimeprimeprime = n2parenleftbigghprimeprimeprime - 2 parenleftbighprimeparenrightbig3 + 3 hprime E gprimeprimeg - Egprimeprimeprimegparenrightbigg.Now Lemma 2 can be applied to the expectations of the ratios:E gprimeprimeg = EyT GprimeprimeyyT Gy = t(GprimeprimeR) and E gprimeprimeprimeg = EyT GprimeprimeprimeyyT Gy = t(GprimeprimeprimeR).Substituting the formulas for Gprimeprime, Gprimeprimeprime, and hprime, hprimeprime, hprimeprimeprime completes the proof.83BibliographyConniffe, D. and Spencer, J. E. (2001), ?When Moments of Ratios AreRatios of Moments,? Journal of the Royal Statistical Society, Series D:The Statistician, 50, 161?168.Geary, R. C. (1933), ?A General Expression for the Moments of CertainSymmetrical Functions of Normal Samples,? Biometrika, 25, 184?186.Harville, D. A. (1997), Matrix Algebra from a Statistician?s Perspective,Springer-Verlag Inc.84Appendix BAppendix to Chapter 4This appendix contains the simulation results in Nagy, Loeppky, and Welch(2007) for three different inference methods. In addition to the plug-in andFBI that are the same as in Chapter 2, it also includes an extra Bayesianmethod using Markov chain Monte Carlo (MCMC) to sample from anotherposterior distribution that is different from the one used by the FBI. Thatmeans that the two Bayesian methods are not expected to give the sameresults. (However, as we will see in Section B.3, there is a strong connection:the FBI?s posterior is the normal approximation of the posterior sampled bythe MCMC).After warning the reader in the next section why the results of thisappendix should be taken with a grain of salt, in Section B.2 we outlinethe simulation procedure in Nagy et al. (2007) that is similar to the onepresented in Chapter 2, but it explores a much wider range of experimentalsetups, and also has another method using MCMC, which is described in de-tail in Section B.3. Other differences include an alternative way to calculatethe Coverage Probability (CP) of a prediction interval (derived in SectionB.4) using the prior information that the data is a realization of a Gaussianprocess. Finally, the results presented in Section B.5 are based on the nor-mality assumption for the prediction intervals (which can be improved forthe smallest sample sizes by using the t-distribution instead, as argued inChapter 2). The true coverage probabilities obtained for the three differentmethods for the nominal 90%, 95%, and 99% levels are summarized in atable, followed by 10 figures for the the simulation results in d = 1, ..., 10,plotting the actual coverages of the 100(1-alpha)% pointwise prediction bands85Appendix B. Appendix to Chapter 4against the nominal levels for alpha = 0.01, 0.02, ..., 0.99 in the same fashionas Figures 2.3 and 2.4 before in Chapter 2.B.1 WarningUnlike the carefully designed 20 simulation studies in Chapters 2 and 3, the90 simulations in this appendix have several limitations. This is because inaddition to sensible choices (that one may expect to encounter in practice),we also wanted to explore more extreme experimental setups, such as overlychallenging response surfaces or samples that are much too small relative tothe number of input variables. Of course, the drawback of casting such awide net is that we catch more interesting behaviors than we ask for. Forexample, unlike in Chapters 2 and 3, many times we could not completecomputations for all 1,000 replicates because of numerical difficulties. Notethat here we are no longer talking about ill-conditioning of correlation ma-trices whose inverses had unrealistically large elements (as in Chapter 3),but outright failures when attempting to take the inverse ended with an er-ror message. Although these failures were excluded from the final analysis,other degenerate cases were not, e.g. when we did get an inverse, we didnot check whether it was ?realistic? or not. Hence, there is no guaranteethat numerical issues did not have a significant influence in some cases, andthese limitations should be kept in mind when drawing conclusions fromthe results in this appendix. But in spite of all the difficulties, it is stillinteresting to observe the performance of the three methods on the frontiersof their applicability.B.2 SimulationsThe simulation plan can be viewed as a set of 10 statistically designed ex-periments for d = 1, ..., 10. For each experiment, the design was a 3 ? 3full-factorial with 1,000 replicates. The two factors were the range parame-86Appendix B. Appendix to Chapter 4ter theta and the sample size n, both at three levels (equally spaced on the logscale): theta = 0.2, 2, 20 and n = 10 d/4, 10 d/2, 10 d (where 10 d/4 wasrounded up to the nearest integer).To obtain 1,000 replicates for a given combination of theta and n, the fol-lowing four steps were repeated (attempted) 1,000 times:1. Select an n point design by Latin hypercube sampling in the d-dimensionalunit hypercube [0, 1]d (McKay, Beckman, and Conover, 1979).2. Generate a realization y of the Gaussian process over the n designpoints by setting the range parameter to theta in all dimensions and theprocess variance to one.3. Sample 10 new points uniformly in the unit hypercube [0, 1]d for pre-diction.4. Compute the predictors for the three methods with their mean squarederrors for the 10 new points from the data y.Note that step 2 or 4 could fail because of numerical issues, leading toan unsuccessful realization (missing value) for that particular replicate (notincluded in subsequent analysis). The only case when this had a catastrophicimpact on results was the theta = 0.2, n = 10 case for d = 1, since setting theta to0.2 pushes the limits of the standard double precision representation in theone input case: numerical difficulties arise because the high correlations inthe n?n correlation matrix (all close to one) make it ill-conditioned (nearlysingular). Hence, in Section B.5, the numbers and plots are missing in thetheta = 0.2, n = 10 case from the first (d = 1) row of the table and the first(d = 1) figure, respectively.B.3 MCMCTo compare the FBI with another Bayesian method, we used uniform priorson the log scale for the range parameters and sampled the resulting poste-87Appendix B. Appendix to Chapter 4rior by MCMC, using the Metropolis random walk algorithm (Metropolis,Rosenbluth, Rosenbluth, Teller, and Teller, 1953) that has been used suc-cessfully in many high-dimensional problems. Of course, this comparison isnot entirely fair because the FBI is not taking samples directly from thisposterior but its normal approximation. But to make it as comparable tothe FBI as possible, everything was done on the log scale using the samegamma-parameterization as in Chapter 2. Also, the first two moments of theN(?, -H-1?gamma ) normal approximation were utilized to help the implementa-tion in step 1 and step 3 of the algorithm, respectively:1. Initialize gamma(1) at ?.2. To select a direction for a random walk step, sample an integer juniformly from 1, ..., d.3. Given the current gamma(i), set gammaasteriskmath to gamma(i) and then add to the jth coordinateof gammaasteriskmath a normal random deviate with mean zero and standard deviationequal to three times the standard error in the jth dimension, estimatedfrom the Hessian:radicalBig-H-1?gamma (j,j).4. Compute the acceptance ratio for gammaasteriskmath, given gamma(i):p = minbraceleftbigg1, L(exp gammaasteriskmath)L(exp gamma(i))bracerightbigg.5. Set gamma(i+1) to gammaasteriskmath with probability p and to gamma(i) with probability 1 -p.6. Repeat steps 2?5 until i reaches the desired sample size.When this algorithm works well, it constructs a Markov chain whose sta-tionary distribution is the posterior distribution. The resulting sample thencan be used for prediction exactly the same way as the sample for the FBI(step 4 in Chapter 2, Section 2.7). In other words, once the sampling isdone, the treatment of the samples are identical.88Appendix B. Appendix to Chapter 4But that does not mean that the samples are equivalent or similar. TheMCMC algorithm constructs a large, dependent sample from a posteriorthat is only known up to a scale (proportional to the likelihood because ofthe uniform priors used for the gamma parameters). On the other hand, the FBIcan get away with a much smaller independent, identically distributed (iid)sample that is not from the same posterior, but from its normal approxi-mation N(?, -H-1?gamma ), as in Chapter 2. The Monte Carlo sample size forthe FBI was only 400, minus those sample points that ran into numericaldifficulties caused by the ill-conditioning of the correlation matrix. This hap-pened mostly in lower-dimensional cases, especially in d = 1. The MCMCsample size was N = 100,000 (after 10,000 burn-in). Unlike the FBI sam-ple, the MCMC sample did not suffer from numerical problems becauseproblematic points would never be accepted by the algorithm, since thelikelihood/posterior was set to zero whenever the Cholesky-decompositionof the correlation matrix failed.An MCMC run was considered successful if the acceptance rate was atleast 15% and the Mean Effective Sample Size (MESS) was at least 50. Bothmeasures were calculated after the burn-in phase. The following formula wasused for the MESS:MESS = 1ddsummationdisplayi=1NbracketleftBigg1 + 21000summationdisplayk=1parenleftbigg1 - kNparenrightbigg?k(i)bracketrightBigg-1,where ?k(i) is the kth sample autocorrelation in the ith dimension (Carterand Kohn, 1994).These measures were intended to provide some minimal automatic qual-ity control, since visual examination of various diagnostic plots for all runswere clearly not possible. Of course, there is no guarantee that an MCMCchain that met both of these criteria (and as a result was classified as suc-cessful) has actually converged to the stationary distribution or was notdeficient in some other way. The original technical report Nagy et al. (2007)89Appendix B. Appendix to Chapter 4has more details about potential problems and the challenges of this partic-ular MCMC implementation.B.4 Coverage ProbabilityCoverage probabilities for prediction bands were calculated by averagingthe individual CPs over all new points and all successful realizations. Arealization was considered successful if all operations for all three methodscompleted without error. It is straightforward to compute an individualCP. Suppose that we want to predict the output Y0 at a new, untried inputx0. Since the true model is known during the simulation, we know thatconditionally on the realized data, Y0 is normally distributed with mean ?0and variance sigma20, where ?0 and sigma20 are given by equations (2.3) and (2.4) inChapter 2, respectively.Now suppose that after estimation, the predictor for Y0 was ?1 withmean squared error sigma21. This amounts to mis-specifying the distribution ofthe random variable Y0 as N(?1,sigma21) instead of the true N(?0,sigma20).Then the CP of a normality-based 100(1-alpha)% prediction interval about?1 isP0 parenleftbig?1 -sigma1 zalpha/2 < Y0 < ?1 + sigma1 zalpha/2parenrightbig == P0parenleftbigg?1 -sigma1 zalpha/2 -?0sigma0 <Y0 -?0sigma0 <?1 + sigma1 zalpha/2 -?0sigma0parenrightbigg== Phiparenleftbigg?1 + sigma1 zalpha/2 -?0sigma0parenrightbigg- Phiparenleftbigg?1 -sigma1 zalpha/2 -?0sigma0parenrightbigg,where P0 denotes the true probability distribution, Phi is the cumulativedistribution function of the standard normal N(0,1), and zalpha/2 satisfiesPhi(-zalpha/2) = alpha/2.90Appendix B. Appendix to Chapter 4B.5 ResultsThe following table is a summary of the CPs of the pointwise predictionbands of the three competing methods for the nominal 90%, 95%, and 99%confidence levels. The two-digit numbers in the table are truncated per-centages without the percent sign and without the fractional parts (roundeddown). The 3?3 arrangement inside each cell follows the layout of the plotsin the following figures by the three levels of theta horizontally and the threelevels of n vertically.After the table, the following 10 figures compare the validity of the threemethods for d = 1, ..., 10, for all combinations of the three levels of thetaand the three levels of n. In addition to the gray diagonal in the middle,three curves were plotted for the three methods relating the true coverageprobabilities on the vertical axis (from 1% to 99%) to the nominal coverageon the horizontal axis (from 1% to 99%). This is the same as before inFigures 2.3 and 2.4 in Chapter 2, just this time the axis labels are not shown,to be able to present 9 plots in the same figure in a 3 ?3 arrangement.Plots are based on the realizations that were classified as successful, outof 1,000 attempts in total. Counts for the number of realizations includedin the final calculations are shown in the top-left corner of each plot. Cal-culations of the CPs were always restricted to the successful subset of the1,000 realizations and all failures were excluded.91Appendix B. Appendix to Chapter 490% 95% 99%d plugin FBI MCMC plugin FBI MCMC plugin FBI MCMC72 76 85 85 89 94 78 82 90 89 93 96 85 89 95 94 96 981 67 66 64 82 81 75 94 95 88 73 72 69 86 84 79 95 97 94 81 80 76 90 89 84 97 98 9761 59 52 85 81 65 94 87 79 66 64 57 87 84 71 96 92 85 73 70 63 90 88 78 98 96 9080 80 75 87 87 83 88 89 88 87 86 82 92 92 88 93 93 93 94 94 90 96 96 94 97 97 972 71 70 59 84 82 76 89 87 85 78 76 65 89 87 82 93 92 91 86 84 74 94 92 89 96 96 9650 45 39 76 76 68 82 85 78 56 50 44 80 81 74 87 90 84 64 58 51 86 87 82 91 95 9181 81 74 87 87 83 88 88 88 88 87 81 92 92 89 93 93 93 95 94 89 97 97 95 97 97 983 73 68 57 85 82 78 87 84 85 80 75 64 90 87 85 92 89 91 88 84 73 95 94 92 96 94 9653 45 44 81 82 78 77 84 81 60 51 50 85 87 84 82 90 88 69 60 58 91 93 91 87 95 9483 81 74 88 87 84 89 88 88 89 88 81 93 92 90 94 93 93 96 95 89 98 97 96 98 97 984 75 67 60 86 83 84 87 83 86 82 74 66 91 89 89 92 89 92 90 84 76 96 95 96 96 94 9749 43 44 84 85 81 75 85 83 55 49 50 89 90 87 81 90 89 65 58 59 93 95 94 87 96 9583 81 74 88 87 87 88 87 88 89 88 81 93 92 92 93 92 93 96 95 90 98 97 97 98 97 985 74 65 62 86 85 86 85 84 87 81 73 69 91 91 92 90 90 92 90 83 78 96 96 97 95 95 9850 45 48 88 86 84 72 84 84 56 51 54 92 91 90 78 90 90 67 62 64 96 96 96 84 96 9683 80 74 88 87 88 88 86 88 89 87 81 93 92 93 93 91 94 96 95 90 98 97 98 98 96 986 74 65 63 87 87 87 82 85 87 81 72 70 92 92 92 88 91 93 90 83 80 97 97 97 94 96 9849 46 49 90 88 85 72 85 85 56 53 55 94 93 91 78 91 91 67 63 65 97 97 97 85 96 9784 79 76 89 88 88 88 85 88 90 86 83 94 93 94 93 91 94 96 94 91 98 98 98 98 96 987 72 64 65 86 88 87 81 86 88 80 72 73 92 93 93 87 91 93 89 83 82 97 98 98 93 96 9850 47 51 92 88 86 73 86 86 57 54 57 95 93 92 79 91 91 68 64 67 98 98 97 86 97 9784 79 76 89 88 89 89 86 89 90 86 83 94 93 94 94 91 94 97 94 91 98 98 98 98 97 988 71 64 66 87 89 88 80 86 88 79 72 74 92 94 93 86 92 93 89 83 83 97 98 98 93 97 9849 48 51 93 88 86 73 86 86 56 55 58 96 93 92 79 92 92 67 66 68 98 98 97 87 97 9784 78 77 89 89 89 88 86 89 90 85 84 94 94 94 94 92 94 97 93 92 98 98 98 98 97 989 71 65 67 88 89 88 79 87 88 79 73 74 93 94 93 85 93 93 88 84 84 97 98 98 93 97 9850 50 52 94 88 87 74 86 86 57 57 59 97 93 92 80 92 92 69 68 69 99 98 97 88 97 9784 77 78 89 89 89 88 87 89 90 85 84 94 94 94 94 92 94 97 93 93 98 98 98 98 97 9810 70 66 68 90 89 88 79 88 88 78 74 75 94 94 93 85 93 93 88 84 85 98 98 98 93 98 9849 50 54 94 88 87 75 87 87 57 57 61 97 93 93 81 92 92 68 69 72 99 98 98 89 97 9792Appendix B. Appendix to Chapter 4Nominal coverageTrue coverage0 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCtheta = 0.2 theta = 2 theta = 20d = 1n = 3 n = 5 n = 1093Appendix B. Appendix to Chapter 4Nominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage987 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage992 realizationspluginFBIMCMCNominal coverageTrue coverage925 realizationspluginFBIMCMCtheta = 0.2 theta = 2 theta = 20d = 2n = 5 n = 10 n = 2094Appendix B. Appendix to Chapter 4Nominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCNominal coverageTrue coverage993 realizationspluginFBIMCMCtheta = 0.2 theta = 2 theta = 20d = 3n = 8 n = 15 n = 3095Appendix B. Appendix to Chapter 4Nominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCtheta = 0.2 theta = 2 theta = 20d = 4n = 10 n = 20 n = 4096Appendix B. Appendix to Chapter 4Nominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCtheta = 0.2 theta = 2 theta = 20d = 5n = 13 n = 25 n = 5097Appendix B. Appendix to Chapter 4Nominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCtheta = 0.2 theta = 2 theta = 20d = 6n = 15 n = 30 n = 6098Appendix B. Appendix to Chapter 4Nominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage998 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage998 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCtheta = 0.2 theta = 2 theta = 20d = 7n = 18 n = 35 n = 7099Appendix B. Appendix to Chapter 4Nominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCNominal coverageTrue coverage995 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCNominal coverageTrue coverage998 realizationspluginFBIMCMCtheta = 0.2 theta = 2 theta = 20d = 8n = 20 n = 40 n = 80100Appendix B. Appendix to Chapter 4Nominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage996 realizationspluginFBIMCMCNominal coverageTrue coverage983 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage998 realizationspluginFBIMCMCNominal coverageTrue coverage999 realizationspluginFBIMCMCtheta = 0.2 theta = 2 theta = 20d = 9n = 23 n = 45 n = 90101Appendix B. Appendix to Chapter 4Nominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage993 realizationspluginFBIMCMCNominal coverageTrue coverage988 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage1000 realizationspluginFBIMCMCNominal coverageTrue coverage995 realizationspluginFBIMCMCtheta = 0.2 theta = 2 theta = 20d = 10n = 25 n = 50 n = 100102BibliographyCarter, C. K. and Kohn, R. (1994), ?On Gibbs Sampling for State SpaceModels,? Biometrika, 81, 541?553.McKay, M. D., Beckman, R. J., and Conover, W. J. (1979), ?A Comparisonof Three Methods for Selecting Values of Input Variables in the Analysisof Output from a Computer Code,? Technometrics, 21, 239?245.Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., andTeller, E. (1953), ?Equation of State Calculations by Fast Computing Ma-chines,? Journal of Chemical Physics, 21, 1087?1091.Nagy, B., Loeppky, J. L., and Welch, W. J. (2007), ?Fast Bayesian Infer-ence for Gaussian Process Models,? Tech. Rep. 230, Department of Statis-tics, The University of British Columbia.103
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Valid estimation and prediction inference in analysis...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Valid estimation and prediction inference in analysis of a computer model Nagy, Béla 2008-12-31
pdf
Page Metadata
Item Metadata
Title | Valid estimation and prediction inference in analysis of a computer model |
Creator |
Nagy, Béla |
Publisher | University of British Columbia |
Date | 2008 |
Date Issued | 2008-08-28T18:42:17Z |
Description | Computer models or simulators are becoming increasingly common in many fields in science and engineering, powered by the phenomenal growth in computer hardware over the past decades. Many of these simulators implement a particular mathematical model as a deterministic computer code, meaning that running the simulator again with the same input gives the same output. Often running the code involves some computationally expensive tasks, such as solving complex systems of partial differential equations numerically. When simulator runs become too long, it may limit their usefulness. In order to overcome time or budget constraints by making the most out of limited computational resources, a statistical methodology has been proposed, known as the "Design and Analysis of Computer Experiments". The main idea is to run the expensive simulator only at a relatively few, carefully chosen design points in the input space, and based on the outputs construct an emulator (statistical model) that can emulate (predict) the output at new, untried locations at a fraction of the cost. This approach is useful provided that we can measure how much the predictions of the cheap emulator deviate from the real response surface of the original computer model. One way to quantify emulator error is to construct pointwise prediction bands designed to envelope the response surface and make assertions that the true response (simulator output) is enclosed by these envelopes with a certain probability. Of course, to be able to make such probabilistic statements, one needs to introduce some kind of randomness. A common strategy that we use here is to model the computer code as a random function, also known as a Gaussian stochastic process. We concern ourselves with smooth response surfaces and use the Gaussian covariance function that is ideal in cases when the response function is infinitely differentiable. In this thesis, we propose Fast Bayesian Inference (FBI) that is both computationally efficient and can be implemented as a black box. Simulation results show that it can achieve remarkably accurate prediction uncertainty assessments in terms of matching coverage probabilities of the prediction bands and the associated reparameterizations can also help parameter uncertainty assessments. |
Extent | 780407 bytes |
Subject |
Computer experiment Bayesian inference Gaussian process Prediction uncertainty |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Collection |
Electronic Theses and Dissertations (ETDs) 2008+ |
Date Available | 2008-08-28 |
Provider | Vancouver : University of British Columbia Library |
DOI | 10.14288/1.0066559 |
URI | http://hdl.handle.net/2429/1561 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2008-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- [if-you-see-this-DO-NOT-CLICK]
- ubc_2008_fall_nagy_bela.pdf [ 762.12kB ]
- [if-you-see-this-DO-NOT-CLICK]
- Metadata
- JSON: 1.0066559.json
- JSON-LD: 1.0066559+ld.json
- RDF/XML (Pretty): 1.0066559.xml
- RDF/JSON: 1.0066559+rdf.json
- Turtle: 1.0066559+rdf-turtle.txt
- N-Triples: 1.0066559+rdf-ntriples.txt
- Original Record: 1.0066559 +original-record.json
- Full Text
- 1.0066559.txt
- Citation
- 1.0066559.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
China | 17 | 27 |
United States | 16 | 1 |
Brazil | 12 | 1 |
Russia | 11 | 0 |
Canada | 5 | 0 |
Germany | 4 | 8 |
France | 2 | 0 |
Bangladesh | 1 | 0 |
Sudan | 1 | 0 |
City | Views | Downloads |
---|---|---|
Unknown | 24 | 10 |
Shenzhen | 16 | 20 |
Ashburn | 7 | 0 |
Saint Petersburg | 6 | 0 |
Mountain View | 4 | 0 |
Vancouver | 4 | 0 |
University Park | 3 | 0 |
Redmond | 1 | 1 |
Dhaka | 1 | 0 |
Toronto | 1 | 0 |
Los Angeles | 1 | 0 |
Beijing | 1 | 7 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0066559/manifest