Statistical Science2016, Vol. 31, No. 1, 40–60DOI: 10.1214/15-STS531© Institute of Mathematical Statistics, 2016Analysis Methods for ComputerExperiments: How to Assess and WhatCounts?Hao Chen, Jason L. Loeppky, Jerome Sacks and William J. WelchAbstract. Statistical methods based on a regression model plus a zero-meanGaussian process (GP) have been widely used for predicting the output of adeterministic computer code. There are many suggestions in the literature forhow to choose the regression component and how to model the correlationstructure of the GP. This article argues that comprehensive, evidence-basedassessment strategies are needed when comparing such modeling options.Otherwise, one is easily misled. Applying the strategies to several computercodes shows that a regression model more complex than a constant mean ei-ther has little impact on prediction accuracy or is an impediment. The choiceof correlation function has modest effect, but there is little to separate twocommon choices, the power exponential and the Matérn, if the latter is op-timized with respect to its smoothness. The applications presented here alsoprovide no evidence that a composite of GPs provides practical improve-ment in prediction accuracy. A limited comparison of Bayesian and empiricalBayes methods is similarly inconclusive. In contrast, we find that the effect ofexperimental design is surprisingly large, even for designs of the same typewith the same theoretical properties.Key words and phrases: Correlation function, Gaussian process, kriging,prediction accuracy, regression.1. INTRODUCTIONOver the past quarter century a literature begin-ning with Sacks, Schiller and Welch (1989), Sackset al. (1989, in this journal), Currin et al. (1991), andO’Hagan (1992) has grown to explore the design andanalysis of computer experiments. Such an experimentis a designed set of runs of a mathematical model im-plemented as a computer code. Running the code withvector-valued input x gives the output value, y(x), as-Hao Chen is a Ph.D. candidate and William J. Welch isProfessor, Department of Statistics, University of BritishColumbia, Vancouver, BC V6T 1Z4, Canada (e-mail:hao.chen@stat.ubc.ca; will@stat.ubc.ca). Jason L. Loeppkyis Associate Professor, Statistics, University of BritishColumbia, Kelowna, BC V1V 1V7, Canada (e-mail:jason.loeppky@ubc.ca). Jerome Sacks is Director Emeritus,National Institute of Statistical Sciences, Research TrianglePark, North Carolina 27709, USA (e-mail: sacks@niss.org).sumed real-valued and deterministic: Running the codeagain with the same value for x does not change y(x).A design D is a set of n runs at n configurations of x,and an objective of primary interest is to use the data(inputs and outputs) to predict, via a statistical model,the output of the code at untried input values.The basic approach to the statistical model typicallyadopted starts by thinking of the output function, y(x),as being in a class of functions with prior distribu-tion a Gaussian process (GP). The process has meanμ, which may be a regression function in x, and a co-variance function, σ 2R, from a specified parametricfamily. Prediction is then made through the posteriormean given the data from the computer experiment,with some variations depending on whether a maxi-mum likelihood (empirical Bayes) or fuller Bayesianimplementation is used. While we partially addressthose variations in this article, our main thrusts are thepractical questions faced by the user: What regression40ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 41function and correlation function should be used? Doesit matter?We will call a GP model with specified regressionand correlation functions a Gaussian stochastic process(GaSP) model. For example, GaSP(Const, PowerExp)will denote a constant (intercept only) regression andthe power-exponential correlation function. The vari-ous regression models and correlation functions underconsideration in this article will be defined in Section 2.The rationale for the GaSP approach stems fromthe common situation that the dimension, d , of thespace of inputs is not small, the function is fairly com-plex to model, and n is not large (code runs are ex-pensive), hindering the effectiveness of standard meth-ods (e.g., polynomials, splines, MARS) for producingpredictions. The GaSP approach allows for a flexiblechoice of approximating models that adapts to the dataand, more tellingly, has proved effective in coping withcomplex codes with moderately high d and scarce data.There is a vast literature treating an analysis in this con-text.This article studies the impact on prediction accu-racy of the particular model specifications commonlyused, particularly μ,R,n,D. The goals are twofold.First, we propose a more evidence-based approach todistinguish what may be important from the unimpor-tant and what may need further exploration. Second,our application of this approach to various examplesleads to some specific recommendations.Assessing statistical strategies for the analysis of acomputer experiment often mimics what is done forphysical experiments: a method is proposed, appliedin examples—usually few in number—and comparedto other methods. Where possible, formal, mathemat-ical comparisons are made; otherwise, criteria for as-sessing performance are empirical. An initial empiri-cal study for a physical experiment is forced to rely onthe specific data of that experiment and, while differentanalysis methods may be applied, all are bound by thesingle data set. There are limited opportunities to varysample size or design.Computer experiments provide richer opportunities.Fast-to-run codes enable a laboratory to investigate therelative merits of an analysis method. A whole spec-trum of “replicate” experiments can be conducted for asingle code, going beyond a thimbleful of “anecdotal”reports.The danger of being misled by anecdotes can beseen in the following example. The borehole function[Morris, Mitchell and Ylvisaker, 1993, also given inthe supplementary material (Chen et al., 2016)] is fre-quently used to illustrate methodology for computerexperiments. A 27-run orthogonal array (OA) in the8 input factors was proposed as a design, followingJoseph, Hung and Sudjianto (2008). The 27 runs wereanalyzed via GaSP with a specific R (the Gaussiancorrelation function described in Section 2) and withtwo choices for μ: a simple constant (intercept) ver-sus a method to select linear terms (SL), also describedin Section 2. The details of these alternative modelsare not important for now, just that we are comparingtwo modeling methods. A set of 10,000 test points se-lected at random in the 8-dimensional input space wasthen predicted. The resulting values of the root meansquared error (RMSE) measure defined in (2.5) of Sec-tion 2 were 0.141 and 0.080 for the constant and SLregression models, respectively.With the SL approach reducing the RMSE by about43% relative to a model with a constant mean, doesthis example provide powerful evidence for using re-gression terms in the GaSP model? Not quite. We repli-cated the experiment with the same choices of μ,R,nand the same test-data, but the training data came froma theoretically equivalent 27-run OA design. (Thereare many equivalent OAs, e.g., by permuting the la-bels between columns of a fixed OA.) The RMSE val-ues in the second analysis were 0.073 and 0.465 forthe constant and SL models respectively; SL producedabout 6 times the error relative to a constant mean—theevidence against using regression terms is even morepowerful!A broader approach is needed. The one we take islaid out starting in Section 2, where we specify the al-ternatives considered for the statistical model’s regres-sion component and correlation function, and definethe assessment measures to be used. We focus on thefundamental criterion of prediction accuracy (uncer-tainty assessment is discussed briefly in Section 6.1).In Section 3 we outline the basic idea of generatingrepeated data sets for any given example. The methodis (exhaustively) implemented for several fast codes,including the aforementioned borehole function, alongwith some choices of n and D. In Section 4 the methodis adapted to slower codes where data from only oneexperiment are available. Ideally, the universe of com-puter experiments is represented by a set of test prob-lems and assessment criteria, as in numerical optimiza-tion (Dixon and Szegö, 1978); the codes and data setsinvestigated in this article and its supplementary ma-terial (Chen et al., 2016) are but an initial set. In Sec-tion 5 other modeling strategies are compared. Finally,42 CHEN, LOEPPKY, SACKS AND WELCHin Sections 6 and 7 we make some summarizing com-ments, conclusions and recommendations.The article’s main findings are that regression termsare unnecessary or even sometimes an impediment, thechoice of R matters for less smooth functions, and thatthe variability of performance of a method for the sameproblem over equivalent designs is alarmingly high.Such variation can mask the differences in analysismethods, rendering them unimportant and reinforcingthe message that light evidence leads to flimsy conclu-sions.2. STATISTICAL MODELS, EXPERIMENTALDESIGN, AND ASSESSMENTA computer code output is denoted by y(x) wherethe input vector, x = (x1, . . . , xd), is in the d-dimen-sional unit cube. As long as the input space is rect-angular, transforming to the unit cube is straightfor-ward and does not lose generality. Suppose n runs ofthe code are made according to a design D of in-put vectors x(1), . . . ,x(n) in [0,1]d , resulting in datay = (y(x(1)), . . . , y(x(n)))T . The goal is to predict y(x)at untried x.The GaSP approach uses a regression model and GPprior on the class of possible y(x). Specifically, y(x) isa priori considered to be a realization ofY(x) = μ(x) + Z(x),(2.1)where μ(x) is the regression component, the mean ofthe process, and Z(x) has mean 0, variance σ 2, andcorrelation function R.2.1 Choices for the Correlation FunctionLet x and x′ denote two values of the input vector.The correlation between Z(x) and Z(x′) is denoted byR(x,x′). Following common practice, R(x,x′) is takento be a product of 1-d correlation functions in the dis-tances hj = |xj −x′j |, that is, R(x,x′) =∏dj=1 Rj(hj ).We mainly consider four choices for Rj :• Power exponential (abbreviated PowerExp):Rj(hj ) = exp(−θjhpjj),(2.2)with θj ≥ 0 and 1 ≤ pj ≤ 2 controlling the sensitiv-ity and smoothness, respectively, of predictions of ywith respect to xj .• Squared exponential or Gaussian (abbreviatedGauss): the special case of PowerExp in (2.2) withall pj = 2.• Matérn:Rj(hj ) = 1(νj )2(νj−1)(θjhj )νj Kνj (θjhj ),(2.3)where is the Gamma function, and Kνj is the mod-ified Bessel function of order νj . Again, θj ≥ 0 is asensitivity parameter. The Matérn class was recom-mended by Stein (1999), Section 2.7, for its controlvia νj > 0 of the differentiability of the correlationfunction with respect to xj , and hence that of theprediction function. With νj = 1 + 12 or νj = 2 + 12 ,there are 1 or 2 derivatives, respectively. We callthese subfamilies Matérn-1 and Matérn-2. Similarly,we use Matérn-0 and Matérn-∞ to refer to the casesνj = 0 + 12 and νj → ∞. They give the exponen-tial family [pj = 1 in (2.2)], with no derivatives, andGauss, which is infinitely differentiable. Matérn-0,1,2 are closely related to linear, quadratic, andcubic splines. We believe that little would be gainedby incorporating smoother kernels (but less smooththan the analytic Matérn-∞) in the study.Consideration of the entire Matérn class for νj > 0is computationally cumbersome for the large num-bers of experiments we will evaluate. Hence, whatwe call Matérn has νj optimized over the Matérn-0, Matérn-1, Matérn-2, and Matérn-∞ special cases,separately for each coordinate.• Matérn-2: Some authors (e.g., Picheny et al., 2013)fix νj in the Matérn correlation function to givesome differentiability. The Matérn-2 subfamily setsνj = 2 + 12 for all j , giving 2 derivatives.More recently, other types of covariance functionshave been recommended to cope with “apparently non-stationary” functions (e.g., Ba and Joseph, 2012). InSection 5.2 we will discuss the implications and char-acteristics of these options.Given a choice for Rj and hence R, we define then× n matrix R with element i, j given by R(x(i),x(j))and the n × 1 vector r(x) = (R(x,x(1)), . . . ,R(x,x(n)))T for any x where we want to predict.2.2 Choices for the Regression ComponentWe explore three main choices for μ:• Constant (abbreviated Const): μ(x) = β0.• Full linear (FL): μ(x) = β0 +β1x1 +· · ·+βdxd , thatis, a full linear model in all input variables.• Select linear (SL): μ(x) is linear in the xj like FLbut only includes selected terms.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 43The proposed algorithm for SL is as follows. For agiven correlation family construct a default predictorwith Const for μ. Decompose the predictive function(Schonlau and Welch, 2006) and identify all main ef-fects that contribute more than 100/d percent to thetotal variation. These become the selected coordinates.Typically, large main effects have clear linear compo-nents. If a large effect lacks a linear component, littleis lost by including a linear term. Inclusion of possiblenonlinear trends can be pursued at considerable com-putational cost; we do not routinely do so, but in Sec-tion 4.1 we do include a regression model with nonlin-ear terms in xj .All candidate regression models considered can bewritten in the formμ(x) = β0 + β1f1(x) + · · · + βkfk(x),where the functions fj (x) are known. The maximumlikelihood estimate (MLE) of β = (β0, . . . , βk)T is thegeneralized least-squares estimate βˆ = (FT R−1F)−1 ·FT R−1y, where the n × (k + 1) matrix F has (1,f1(x(i)), . . . , fk(x(i))) in row i. This is also the Bayesposterior mean with a diffuse prior for β .Early work (Sacks, Schiller and Welch, 1989) sug-gested that there is little to be gained (and maybe evensomething to lose) by using other than a constant termfor μ. In addition, Lim et al. (2002) showed that poly-nomials can be exactly predicted with a minimal num-ber of points using the Gauss correlation function, pro-vided one lets the θj → 0. These points underline thefact that a GP prior can capture the complexity in theoutput of the code, suggesting that deploying regres-sion terms is superfluous. The evidence we report latersupports this impression.2.3 PredictionPredictions are made as follows. For given data andvalues of the parameters in R, the mean of the posteriorpredictive distribution of y(x) isyˆ(x) = μˆ(x) + rT (x)R−1(y − Fβˆ),(2.4)where μˆ(x) = βˆ0 + βˆ1f1(x) + · · · + βˆkfk(x).In practice, the other parameters, σ 2 and those in thecorrelation function R of equations (2.2) or (2.3), mustbe estimated too. Empirical Bayes replaces all of β ,σ 2, and the correlation parameters in R by their MLEs(Welch et al., 1992). Various other Bayes-based proce-dures are available, including one fully Bayesian strat-egy described in Section 5.1. Our focus, however, isnot on the particular Bayes-based methods employedbut rather on assumptions about the form of the under-lying GaSP model.2.4 DesignFor fast codes we typically use as a base designD an approximate maximin Latin hypercube design(mLHD, Morris and Mitchell, 1995), with improvedlow-dimensional space-filling properties (Welch et al.,1996). A few other choices, such as orthogonal arrays,are also investigated in Section 3.5, with a more com-prehensive comparison of different classes of designthe subject of another ongoing study. In any event, weshow that even for a fixed class of design and fixedn there is substantial variation in prediction accuracyover equivalent designs. Conclusions based on a singledesign choice can be misleading.The effect of n on prediction accuracy was exploredby Sacks, Schiller and Welch (1989) and more recentlyby Loeppky, Sacks and Welch (2009); its role in thecomparison of competing alternatives for μ and R willalso be addressed in Section 3.5.2.5 Measures of Prediction ErrorIn order to compare various forms of the predictorin (2.4) built from the n code runs, y = (y(x(1)), . . . ,y(x(n)))T , we need to set some standards. The goldstandard is to assess the magnitude of prediction er-ror via holdout (test) data, that is, in predicting N fur-ther runs, y(x(1)ho ), . . . , y(x(N)ho ). The prediction errorsare yˆ(x(i)ho ) − y(x(i)ho ) for i = 1, . . . ,N .The performance measure we use is a normalizedRMSE of prediction over the holdout data, denotedby ermse,ho. The normalization is the RMSE using the(trivial) predictor y¯, the mean of the data from the runsin the experimental design, the “training” set. Thus,ermse,ho =√(1/N)∑Ni=1(yˆ(x(i)ho ) − y(x(i)ho ))2√(1/N)∑Ni=1(y¯ − y(x(i)ho ))2.(2.5)The normalization in the denominator puts ermse,horoughly on [0,1] whatever the scale of y, with 1 in-dicating no better performance than y¯. The criterion isrelated to R2 in regression, but ermse,ho measures per-formance for a new test set and smaller values are de-sirable.Similarly, worst-case performance can be defined asthe normalized maximum absolute error. Results forthis metric are reported in the supplementary material(Chen et al., 2016); the conclusions are the same. Otherdefinitions (such as median absolute error) and othernormalizations (such as those of Loeppky, Sacks andWelch, 2009) can be used, but without substantive ef-fect on comparisons.44 CHEN, LOEPPKY, SACKS AND WELCHFIG. 1. Equivalent designs for d = 2 and n = 11: (a) a base mLHD design; (b) the base design with labels permuted between columns; and(c) the base design with values in the x1 column reflected around x1 = 0.5.What are tolerable levels of error? Clearly, these areapplication-specific so that tighter thresholds wouldbe demanded, say, for optimization than for sensitiv-ity analysis. For general purposes we take the ruleof thumb that ermse,ho < 0.10 is useful. For normal-ized maximum error it is plausible that the thresholdcould be much larger, say 0.25 or 0.30. These specu-lations are consequences of the experiences we docu-ment later, and are surely not the last word. The valueof having thresholds is to provide benchmarks that en-able assessing when differences among different meth-ods or strategies are practically insignificant versus sta-tistically significant.3. FAST CODES3.1 Generating a Reference Set for ComparisonsFor fast codes under our control, large holdout setscan be obtained. Hence, in this section performanceis measured through the use of a holdout (test) set of10,000 points, selected as a random Latin hypercubedesign (LHD) on the input space.With a fast code many designs and hence trainingdata sets can easily be generated. We generate manyequivalent designs by exploiting symmetries. For asimple illustration, Figure 1(a) shows a base mLHDfor d = 2 and n = 11. Permuting the labels betweencolumns of the design, that is, interchanging the x1 andx2 values as in Figure 1(b), does not change the inter-point distances or properties based on them such as theminimum distance used to construct mLHD designs.Similarly, reflecting the values within, say, the x1 col-umn around x1 = 0.5 as in Figure 1(c), does not changethe properties. In this sense the designs are equivalent.In general, for any base design with good proper-ties, there are d!2d equivalent designs and hence equiv-alent sets of training data available from permutingall column labels and reflecting within columns for asubset of inputs. For the borehole code mentioned inSection 1 and investigated more fully in Section 3.2,we have found that permuting between columns givesmore variation in prediction accuracy than reflectingwithin columns. Thus, in this article for nearly all ex-amples we only permute between columns: for d = 4all 24 possible permutations, and for d ≥ 5 a randomselection of 25 different permutations. The example ofSection 5.2 with d = 2 is the one exception. Becausey(x1, x2) is symmetric in x1 and x2, permuting be-tween columns does not change the training data andwe reflect within x1 and/or x2 instead.The designs, generated data sets, and replicate analy-ses then serve as the reference set for a particular prob-lem and provide the grounds on which variability ofperformance can be assessed. Given the setup of Sec-tion 2, we want to assess the consequences of makinga choice from the menu of three regression models andfour correlation functions.3.2 Borehole CodeThe first setting we will look at is the borehole code(Morris, Mitchell and Ylvisaker, 1993) mentioned inSection 1 and described in the supplementary material(Chen et al., 2016). It has served as a test bed in manycontexts (e.g., Joseph, Hung and Sudjianto, 2008). Weconsider three different designs for the experiment: a27-run, 3-level orthogonal array (OA), the same designused by Joseph, Hung and Sudjianto (2008); a 27-runmLHD; and a 40-run mLHD.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 45FIG. 2. Borehole function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP with all combinations of three regression modelsand four correlation functions. There are three base designs: a 27-run OA (top row); a 27-run mLHD (middle row); and a 40-run mLHD(bottom row). For each base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.There are 12 possible modeling combinations fromthe four correlation functions and three regressionmodels outlined in Sections 2.1 and 2.2. The SL choicefor μ here is always the term x1. Its main effect ac-counts for approximately 80% of the variation in pre-dictions over the 8-dimensional input domain, and allanalyses with a Const regression model choose x1 andno other terms across all designs and all repeat experi-ments.The top row of Figure 2 shows results with the 27-run OA design. For a given modeling strategy, 25 ran-dom permutations between columns of the 27-run OAlead to 25 repeat experiments (Section 3.1) and hencea reference set of 25 values of ermse,ho shown as adot plot. The results are striking. Relative to a con-stant regression model, the FL regression model hasempirical distributions of ermse,ho which are uniformlyand substantially inferior, for all correlation functions.The SL regression also performs very poorly some-times, but not always. To investigate the SL regres-sion further, Figure 3 plots ermse,ho for individual repeatexperiments, comparing the GaSP(Const, Gauss) andGaSP(SL, Gauss) models. Consistent with the anec-dotal comparisons in Section 1, the plot shows thatFIG. 3. Borehole code: Normalized holdout RMSE of prediction,ermse,ho, for an SL regression model versus a constant regressionmodel. The 25 points are from repeat experiments generated by 25random permutations between columns of a 27-run OA.46 CHEN, LOEPPKY, SACKS AND WELCHthe SL regression model can give a smaller ermse,ho—this tends to happen when both methods perform fairlywell—but the SL regression sometimes has very pooraccuracy (almost 0.5 on the normalized RMSE scale).The top row of Figure 2 also shows that the choiceof correlation function is far less important than thechoice of regression model.The results for the 27-run mLHD in the middle rowof Figure 2 show that design can have a large effect onaccuracy: every analysis model performs better for the27-run mLHD than for the 27-run OA. (Note the ver-tical scale is different for each row of the figure.) TheSL regression now performs about the same as the con-stant regression instead of occasionally much worse.There is no substantial difference in accuracy betweenthe correlation functions. Indeed, the impact on accu-racy of using the space-filling mLHD design instead ofan OA is much more important than differences due tochoice of the correlation function. The scaling in themiddle row of plots somewhat mutes the considerablevariation in accuracy still present over the 25 equiva-lent mLHD designs.Increasing the number of runs to a 40-run mLHD(bottom row of Figure 2) makes a further substantialimprovement in prediction accuracy. All 12 modelingstrategies give ermse,ho values of about 0.02–0.06 overthe 25 repeat experiments. Although there is little sys-tematic difference among strategies, the variation overequivalent designs is still striking in a relative sense.The strikingly poor results from the SL regressionmodel (sometimes) and the FL model (all 25 repeats)in the top row of Figure 2 may be explained as fol-lows. The design is a 27-run OA with only 3 levels. Ina simpler context, Welch et al. (1996) illustrated non-identifiability of the important terms in a GaSP modelwhen the design is not space-filling. The SL regressionand, even more so, the FL regression complicate analready flexible GP model. The difficulty in identify-ing the important terms is underscored by the fact thatfor all 25 repeat experiments from the base 27-run OA,a least squares fit of a simple linear regression model inx1 (with no other terms) gives ermse,ho values close to0.45. In other words, performance of GaSP(SL, Gauss)is sometimes similar to fitting just the important x1linear trend. The performance of GaSP(FL, Gauss) ishighly variable and sometimes even worse than simplelinear regression.Welch et al. (1996) argued that model identifiabil-ity is, not surprisingly, connected with confounding inthe design. The confounding in the base 27-run OAis complex. While it is preserved in an overall senseby permutations between columns, how the confound-ing structure aligns with the important inputs amongx1, . . . , x8 will change across the 25 repeat experi-ments. Hence, the impact of confounding on noniden-tifiability will vary.In contrast, accuracy for the space-filling design inthe middle row of Figure 2 is much better, even withonly 27 runs. The SL regression model performs as ac-curately as the Const model (but no better); only theeven more complex FL regression runs into difficulties.Again, this parallels the simpler Welch et al. (1996) ex-ample, where model identification was less problem-atic with a space-filling design and largely eliminatedby increasing the sample size (the bottom row of Fig-ure 2).3.3 G-Protein CodeA second application, the G-protein code used byLoeppky, Sacks and Welch (2009) and described in thesupplementary material (Chen et al., 2016), consists ofa system of ODE’s with 4-dimensional input.Figure 4 shows ermse,ho for the three regression mod-els (here SL selects x2, x3 as inputs with large effects)and four correlation functions. The results in the toprow are for a 40-run mLHD. With d = 4, all 24 possi-ble permutations between columns of a single base de-sign lead to 24 data sets and hence 24 ermse,ho values.The dot plots in the top row have similar distributionsacross the 12 modeling strategies. As these empiricaldistributions have most ermse,ho values above 0.1, wetry increasing the sample size with an 80-run mLHD.This has a substantial effect on accuracy, with all mod-eling methods giving ermse,ho values of about 0.06 orless.Thus, for the G-protein application, none of the threechoices for μ or the four choices for R matter. Thevariation among equivalent designs is alarmingly largein a relative sense, dwarfing any impact of modelingstrategy.3.4 PTW CodeResults for a third fast-to-run code, PTW (Preston,Tonks and Wallace, 2003), are in the supplementarymaterial (Chen et al., 2016). It has 11 inputs. We tooka mLHD with n = 110 as the base design for the ref-erence set. Prior information from engineers suggestedincorporating linear x1 and x2 terms; SL also includedx3. No essential differences among μ or R emerged,but again there is a wide variation over equivalent de-signs.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 47FIG. 4. G-protein code: Normalized holdout RMSE of prediction, ermse,ho, for all combinations of three regression models and fourcorrelation functions. There are two base designs: a 40-run mLHD (top row); and an 80-run mLHD (bottom row). For each base design, all24 permutations between columns give the 24 values of ermse,ho in each dot plot.3.5 Effect of DesignThe results above document a significant, seldomrecognized role of design: different, even equivalent,designs can have a greater effect on performance thanthe choice of μ,R. Moreover, without prior informa-tion, there is no way to assure that the choice of de-sign will be one of the good ones in the equivalenceclass. Whether sequential experimentation, if feasible,can produce a more advantageous solution needs ex-ploring.The contrast between the results for borehole 27-runOA and the 27-run mLHD is a reminder of the impor-tance of using designs that are space-filling, a qualitywidely appreciated. It is no secret that the choice ofsample size, n, has a strong effect on performance asevidenced in the results for the 40-point mLHD con-trasted with those for the 27-point mLHD. A morepenetrating study of the effect of n was conducted byLoeppky, Sacks and Welch (2009). That FL does aswell as Const and SL for the Borehole 40-point mLHDbut performs badly for either of the two 27-point de-signs, and that none of the regression choices matterfor the G-protein 40-point design or for the PTW 110-point design, suggests that “everything” works if n islarge enough.In summary, the choice of n and the choice of Dgiven n can have huge effects. But have we enoughevidence that choice of μ matters only in limited con-texts (such as small n or poor design) and that choiceof R does not matter? So far we have dealt with onlya handful of simple, fast codes; it is time to considermore complex codes.4. SLOW CODESFor complex costly-to-run codes, generating sub-stantial holdout data or output from multiple designsis infeasible. Similarly, for codes where we only havereported data, new output data are unavailable. Forcedto depend on what data are at hand leads us to rely oncross-validation methods for generating multiple de-signs and holdout sets, through which we can assess theeffect of variability not solely in the designs but also,48 CHEN, LOEPPKY, SACKS AND WELCHand inseparably, in the holdout target data. We knowfrom Section 3 that variability due to designs is con-siderable, and it is no surprise that variability in hold-out sets would lead to variability in predictive perfor-mance. The utility then of the created multiple designsand holdout sets is to compare the behavior of differ-ent modeling choices under varying conditions ratherthan relying on a single quantity attached to the origi-nal, unique data set.Our approach is simply to delete a subset from thefull data set, use the remaining data to produce a pre-dictor, and calculate the (normalized) RMSE from pre-dicting the output in the deleted (holdout) subset. Re-peating this for a number (25 is what we use) of subsetsgives some measure of variability and accuracy. In ef-fect, we create 25 designs and corresponding holdoutsets from a single data set and compare consequencesarising from different choices for predictors.The details described in the applications below dif-fer somewhat depending on the particular application.In the example of Section 4.1—a reflectance modelfor a plant canopy—there are, in fact, limited holdoutdata but no data from multiple designs. In the volcano-eruption example of Section 4.2 and the sea-ice modelof Section 4.3 holdout data are unavailable.4.1 Nilson–Kuusk ModelAn ecological code modeling reflectance for a plantcanopy developed by Nilson and Kuusk (1989) wasused by Bastos and O’Hagan (2009) to illustrate di-agnostics for GaSP models. With 5-dimensional input,two computer experiments were performed: the firstusing a 150-run random LHD and the second with anindependently chosen LHD of 100 points.We carry out three studies based on the same data.The first treats the 100-point LHD as the experimentand the 150-point set as a holdout sample. The secondstudy reverses the roles of the two LHDs. A third study,extending one done by Bastos and O’Hagan (2009),takes the 150-run LHD, augments it with a randomsample of 50 points from the 100-point LHD, takes theresulting 200-point subset as the experimental designfor training the statistical model, and uses the remain-ing N = 50 points from the 100-run LHD to form theholdout set in the calculation of ermse,ho. By repeatingthe sampling of the 50 points 25 times, we get 25 repli-cate experiments, each with the same base 150 runsbut differing with respect to the additional 50 trainingpoints and the holdout set.In addition to the linear regression choices we havestudied so far, we also incorporate a regression modelTABLE 1Nilson–Kuusk model: Normalized holdout RMSE of prediction,ermse,ho, for four regression models and four correlationfunctions. The experimental data are from a 100-run LHD, and theholdout set is from a 150-run LHDermse,hoCorrelation functionRegression model Gauss PowerExp Matérn-2 MatérnConstant 0.116 0.099 0.106 0.102Select linear 0.115 0.099 0.106 0.105Full linear 0.110 0.099 0.104 0.104Quartic 0.118 0.103 0.107 0.106identified by Bastos and O’Hagan (2009): an intercept,linear terms in the inputs x1, . . . , x4, and a quartic poly-nomial in x5. We label this model “Quartic.” All anal-yses are carried out with the output y on a log scale,based on standard diagnostics for GaSP models (Jones,Schonlau and Welch, 1998).Table 1 summarizes the results of the study withthe 100-point LHD as training data and the 150-pointset as a holdout sample. It shows the choice for μis immaterial: the constant mean is as good as any.For the correlation function, Gauss is inferior to theother choices, there is some evidence that Matérn ispreferred to Matérn-2, and there is little differencebetween PowerExp and Matérn, the best performers.Similar results pertain when the 150-run LHD is usedfor training and the 100-run set for testing [Table 4 inthe supplementary material (Chen et al., 2016)].The dot plots in Figure 5 for the third study are evenmore striking in exhibiting the inferiority of R = Gaussand the lack of advantages for any of the nonconstantregression functions. The large variability in perfor-mance among designs and holdout sets is similar tothat seen for the fast-code replicate experiments of Sec-tion 3. The perturbations of the experiment, from ran-dom sampling here, appear to provide a useful refer-ence set for studying the behavior of model choices.The large differences in prediction accuracy amongthe correlation functions, not seen in Section 3, de-serve some attention. An overly smooth correlationfunction—the Gaussian—does not perform as well asthe Matérn and power-exponential functions here. Thelatter two have the flexibility to allow needed rougherrealizations. With the 150-run design and the constantregression model, for instance, the maximum of the loglikelihood increases by about 50 when the power expo-nential is used instead of the Gaussian, with four of thepj in (2.2) taking values less than 2.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 49FIG. 5. Nilson–Kuusk code: Normalized holdout RMSE of prediction, ermse,ho, for four regression models and four correlation functions.Twenty-five designs are created from a 150-run LHD base plus 50 random points from a 100-run LHD. The remaining 50 points in the100-run LHD form the holdout set for each repeat.The estimated main effect (Schonlau and Welch,2006) of x5 in Figure 6 from the GaSP(Const,PowerExp) model shows that x5 has a complex effect.It is also a strong effect, accounting for about 90% ofthe total variance of the predicted output over the 5-dimensional input space. Bastos and O’Hagan (2009)correctly diagnosed the complexity of this trend. Mod-eling it via a quartic polynomial in x5 has little impacton prediction accuracy, however. The correlation struc-ture of the GaSP is able to capture the trend implicitlyjust as well.4.2 Volcano ModelA computer model studied by Bayarri et al. (2009)models the process of pyroclastic flow (a fast-movingcurrent of hot gas and rock) from a volcanic eruption.FIG. 6. Nilson–Kuusk code: Estimated main effect of x5.The inputs varied are as follows: initial volume, x1,and direction, x2, of the eruption. The output, y, is themaximum (over time) height of the flow at a location.A 32-run data set provided by Elaine Spiller [differentfrom that reported by Bayarri et al. (2009) but a similarapplication] is available in the supplementary material(Chen et al., 2016). Plotting the data shows the outputhas a strong trend in x1, and putting a linear term in theGaSP surrogate, as modeled by Bayarri et al. (2009), isnatural. But is it necessary?The nature of the data suggests a transformation of ycould be useful. The one used by Bayarri et al. (2009)is log(y + 1). Diagnostic plots (Jones, Schonlau andWelch, 1998) from using μ = Const and R = Gaussshow that the log transform is reasonable, but a square-root transformation is better still. We report analysesfor both transformations.The regression functions considered are Const, SL(β0 + β1x1), full linear, and quadratic (β0 + β1x1 +β2x2 + β3x21 ), because the estimated effect of x1 has astrong trend growing faster than linearly when lookingat main effects from the surrogate obtained using √yand GaSP(Const, PowerExp).Analogous to the approach in Section 4.1, repeatexperiments are generated by random sampling of 25runs from the 32 available to comprise the design formodel fitting. The remaining 7 runs form the holdoutset. This is repeated 25 times, giving 25 ermse,ho val-ues in the dot plots of Figure 7. The conclusions aremuch like those in Section 4.1: there is usually no needto go beyond μ = Const and PowerExp is preferred50 CHEN, LOEPPKY, SACKS AND WELCHFIG. 7. Volcano model: Normalized holdout RMSE, ermse,ho, for four regression models and four correlation functions. The output variableis either √y or log(y + 1).to Gauss. The failure of Gauss in the two “slow” ex-amples considered thus far is surprising in light of thewidespread use of the Gauss correlation function.4.3 Sea-Ice ModelThe Arctic sea-ice model studied in Chapman et al.(1994) and in Loeppky, Sacks and Welch (2009) has13 inputs, 4 outputs, and 157 available runs. The pre-vious studies found modest prediction accuracy ofGaSP(Const, PowerExp) surrogates for two of the out-puts (ice mass and ice area) and poor accuracy for theother two (ice velocity and ice range). The questionarises whether use of linear regression terms can in-crease accuracy to acceptable levels. Using a samplingprocess like that in Section 4.2 leads to the results in thesupplementary material (Chen et al., 2016), where theanswer is no: there is no help from μ = SL or FL, norfrom changing R. Indeed, FL makes accuracy muchworse sometimes.5. OTHER MODELING STRATEGIESClearly, we have not studied all possible paths toGaSP modeling that one might take in a computer ex-periment. In this section we address several others,some in more detail, and point to issues that could beaddressed in the fashion described above.5.1 Full BayesA number of full Bayes approaches have been em-ployed in the literature. They go beyond the statis-tical formulation using a GP as a prior on the classof functions and assign prior distributions to all pa-rameters, particularly those of the correlation function.For illustration, we examine the GEM-SA implemen-tation of Kennedy (2004), which we call Bayes-GEM-SA. One key aspect is its reliance on R = Gauss. Italso uses the following independent prior distributions:β0 ∝ 1, σ 2 ∝ 1/σ 2, and θj exponential with rate 0.01(Kennedy, 2004). When comparing its predictive accu-racy with GaSP, μ = Const is used for all models.For the borehole application, 25 repeat experimentsare constructed for three designs, as in Section 3. Thedot plots of ermse,ho in Figure 8 compare Bayes-GEM-SA with the Gauss and PowerExp methods in Section 3based on MLEs of all parameters. (The method CGPANALYSIS METHODS FOR COMPUTER EXPERIMENTS 51FIG. 8. Borehole function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp),Bayes-GEM-SA, and CGP. There are three base designs: a 27-run OA (left), a 27-run mLHD (middle), and a 40-run mLHD (right). Foreach base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.and its dot plot are discussed in Section 5.2.) Bayes-GEM-SA is less accurate than either GaSP(Const,Gauss) or GaSP(Const, PowerExp).Figure 9 similarly depicts results for the G-proteincode. With the 40-run mLHD, the Bayesian and likeli-hood methods all perform about the same, giving onlyfair prediction accuracy. Increasing n to 80 improvesaccuracy considerably for all methods (the scales ofthe two plots are very different), far outweighing anysystematic differences between their accuracies.Bayes-GEM-SA performs as well as the GaSP meth-ods for G-protein, not so well for Borehole with n = 27but adequately for n = 40. Turning to the slow codesin Section 4, a different message emerges. Figure 10for the Nilson–Kuusk model is based on 25 repeat de-signs constructed as for Figure 5 with a base designof 150 runs plus 50 randomly chosen from 100. Thedistributions of ermse,ho for Bayes-GEM-SA and Gaussare similar, with PowerExp showing a clear advantage.Moreover, few of the Bayes ermse,ho values meet the0.10 threshold, while all the GaSP(Const, PowerExp)ermse,ho values do. Bayes-GEM-SA uses the Gaus-sian correlation function, which performed relativelyFIG. 9. G-protein: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp), Bayes-GEM-SA,and CGP. There are two base designs: a 40-run mLHD (left); and an 80-run mLHD (right). For each base design, all 24 permutationsbetween columns give the 24 values of ermse,ho in a dot plot.52 CHEN, LOEPPKY, SACKS AND WELCHFIG. 10. Nilson–Kuusk model: Normalized holdout RMSEof prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const,PowerExp), Bayes-GEM-SA, and CGP.poorly in Section 4; the disadvantage carries over tothe Bayesian method here.The results in Figure 11 for the volcano code are forthe 25 repeat experiments described in Section 4. Hereagain PowerExp dominates Bayes and for the samereasons as for the Nilson–Kuusk model. For the √ytransformation, all but a few GaSP(Const, PowerExp)ermse,ho values meet the 0.10 threshold, in contrast toBayes where all but a few do not.These results are striking and suggest that Bayesmethods relying on R = Gauss need extension. The“hybrid” Bayes-MLE approach employed by Bayarriet al. (2009) estimates the correlation parameters inPowerExp by their MLEs, fixes them, and takes ob-jective priors for μ and σ 2. The mean of the predictivedistribution for a holdout output value gives the sameprediction as GaSP(Const, PowerExp). Whether other“hybrid” forms can be brought to bear effectively needsexploration.5.2 NonstationarityThe use of stationary GPs as priors in the faceof “nonstationary appearing” functions has attracteda measure of concern despite the fact that all func-tions with L2-derivative can be approximated usingPowerExp with enough data. Of course, there never areenough data. A relevant question is whether other pri-ors, even stationary ones different from those in Sec-tion 2, are better reflective of conditions and lead tomore accurate predictors.West et al. (1995) employed a GP prior for y(x)with two additive components: a smooth one for globaltrend and a rough one to model more local behavior.Recently, a similar “composite” GP (CGP) approachwas advanced by Ba and Joseph (2012). These authorsused two GPs, both with Gauss correlation. The firsthas correlation parameters θj in (2.2) constrained to besmall for gradually varying longer-range trend, whilethe second has larger values of θj for shorter-range be-havior. The second, local GP also has a variance thatdepends on x, primarily as a way to cope with apparentnonstationary behavior. Does this composite approachoffer an effective improvement to the simpler choicesof Section 2?We can apply CGP via its R library to the exam-ples studied in Sections 3 and 4, much as was justdone for Bayes-GEM-SA. The comparisons in Figure 8for the borehole function show that GaSP and CGPhave similar accuracy for the two 27-run designs. GaSPhas smaller error than CGP for the 40-run mLHD,FIG. 11. Volcano model: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp),Bayes-GEM-SA, and CGP.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 53FIG. 12. 2-d function: Holdout predictions versus true values of y from fitting (a) GaSP(Const, PowerExp) and (b) CGP.though both methods achieve acceptable accuracy. Theresults in Figure 9 for G-protein show little practi-cal difference between any of the methods, includingCGP. For these two fast-code examples, there is negli-gible difference between CGP and the GaSP methods.For the models of Section 4, however, conclusions aresomewhat different. GaSP(Const, PowerExp) is clearlymuch more accurate than CGP for the Nilson–Kuuskmodel (Figure 10) and roughly equivalent for the vol-cano code (Figure 11).Ba and Joseph (2012) gave several examples assess-ing the performance of CGP. For reasons noted in Sec-tions 6.2 and 6.4 we only look at two.10-d example. The test function isy(x) = −10∑j=1sin(xj )(sin(jx2j /π))20(0 < xj < π).With n = 100, Ba and Joseph (2012) obtained unnor-malized RMSE values of about 0.72–0.74 for CGP andabout 0.72–0.88 for a GaSP(Const, Gauss) model over50 repeat experiments.This example demonstrates a virtue of using a nor-malized performance measure. To compute the nor-malizing factor for RMSE in (2.5), we followed theprocess of Ba and Joseph (2012). Training data froman LHD with n = 100 gives y¯, the trivial predictor.The normalization in the denominator of (2.5) is com-puted from N = 5000 random test points. Repeatingthis process 50 times gives normalization factors of0.71–0.74, about the same as the raw RMSE valuesfrom CGP. Thus, CGP’s RMSE prediction accuracyis no better than that of the trivial y¯ predictor, andthe default method is worse. Effective prediction hereis unattainable by CGP or GaSP and perhaps by noother approach with n = 100 because the function isso multi-modal; comparisons of CGP with other meth-ods are meaningless in this example.2-d example. For the functiony(x1, x2) = sin(1/(x1x2)) (0.3 ≤ xj ≤ 1),(5.1)Ba and Joseph (2012) used a single design with n = 24runs to compare CGP and GaSP(Const, Gauss). Theirresults suggest that accuracy is poor for both meth-ods, which we confirmed. For this example, follow-ing Ba and Joseph (2012), a holdout set of 5000 ran-dom points on [0.3,1]2 was used. For one mLHD withn = 24, we obtained ermse,ho values of 0.23 and 0.24 forCGP and GaSP(Const, PowerExp), respectively. More-over, the diagnostic plot in Figure 12 shows how badlyCGP (and GaSP) perform. Both methods grossly over-predict for some points in the holdout set, with GaSPworse in this respect. Both methods also have large er-rors from under-prediction, with CGP worse.Does this result generalize? With only two inputvariables and a function that is symmetric in x1 andx2, repeat experiments cannot be generated by permut-ing the column labels of the design. Reflecting withinthe x1 and x2 columns is considered below, but first wecreated multiple experiments by increasing n.We were also curious about how large n has tobe before acceptable accuracy is attained. Compar-isons between CGP and GaSP(Const, PowerExp) weremade for n = 24,25, . . . ,48; for each value of n anmLHD was generated. The ermse,ho results plotted inFigure 13(a) show that accuracy is not improved sub-stantially for either method as n increases. Indeed,54 CHEN, LOEPPKY, SACKS AND WELCHFIG. 13. 2-d function: Normalized holdout RMSE of prediction, ermse,ho, versus n for CGP (◦), GaSP(Const, PowerExp) (), andGaSP(Const, PowerExp) with nugget (+).GaSP(Const, PowerExp) gives variable accuracy, withlarger values of n sometimes leading to worse accuracythan for n = 24. (The results in Figure 13 for a modelwith a nugget term are described in Section 5.3.)To try to improve the accuracy, even larger sam-ple sizes were tried. Figure 13(b) shows ermse,ho forn = 50,60, . . . ,200. Both methods continue to givepoor accuracy until n reaches 70, after which there isa slow, unsteady improvement. Curiously, GaSP nowdominates.Permuting between columns of a design does notgenerate distinct repeat experiments here, but reflect-ing either or both coordinates about the centers of theirranges maintains the distance properties of the design,that is, x1 on [0.3,1] is replaced by x′1 = 1.3 − x1, andsimilarly x2. Results for the repeat experiments fromreflecting within x1, x2, or both x1 and x2 are availablein the supplementary material (Chen et al., 2016). Theyare similar to those in Figure 13.Thus, CGP dominates here for n ≤ 60: it is inaccu-rate but less inaccurate than GaSP. For larger n, how-ever, GaSP performs better, reaching the 0.10 thresh-old for ermse,ho before CGP does. This example demon-strates the potential pitfalls of comparing two methodswith a single experiment. A more comprehensive anal-ysis not only gives more confidence in the findings butmay also be essential to provide a balanced overviewof advantages and disadvantages.These last two toy functions together with the resultsin Figures 8–11 show no evidence for the effectivenessof a composite GaSP approach. These findings are inaccord with the earlier study by West et al. (1995).5.3 Adding a Nugget TermA nugget augments the GaSP model in (2.1) withan uncorrelated ε term, usually assumed to have a nor-mal distribution with mean zero and constant varianceσ 2ε , independent of the correlated process Z(x). Thischanges the computation of R and rT (x) in the condi-tional prediction (2.4), which no longer interpolates thetraining data. For data from physical experimentationor observation, augmenting a GaSP model in this wayis natural to reflect random errors (e.g., Gao, Sacks andWelch, 1996; McMillan et al., 1999; Styer et al., 1995).A nugget term has also been widely used for statisti-cal modeling of deterministic computer codes withoutrandom error. The reasons offered are that numericalstability is improved, so overcoming computational ob-stacles, and also that a nugget can produce better pre-dictive performance or better confidence or credibil-ity intervals. The evidence—in the literature and pre-sented here—suggests, however, that for deterministicfunctions the potential advantages of a nugget term aremodest. More systematic methods are available to dealwith numerical instability if it arises (Ranjan, Haynesand Karsten, 2011), adding a nugget does not convert apoor predictor into an acceptable one, and other factorsmay be more important for good statistical propertiesof intervals (Section 6.1). On the other hand, we also donot find that adding a nugget (and estimating it alongwith the other parameters) is harmful, though it mayproduce smoothers rather than interpolators. We nowelaborate on these points.A small nugget, that is, a small value of σ 2ε , is oftenincluded to improve the numerical properties of R. ForANALYSIS METHODS FOR COMPUTER EXPERIMENTS 55the space-filling initial designs in this article, however,Ranjan, Haynes and Karsten (2011) showed that ill-conditioning in a no-nugget GaSP model will only oc-cur for low-dimensional x, high correlation, and largen. These conditions are not commonly met in initial de-signs for applications. For instance, none of the com-putations for this article failed due to ill-conditioning,and those computations involved many repetitions ofexperiments for the various functions and GaSP mod-els. The worst conditioning occurred for the 2-d exam-ple in Section 5.2 with n = 200, but even here the con-dition numbers of about 106 did not preclude reliablecalculations. When a design is not space-filling, ma-trix ill-conditioning may indeed occur. For instance, asequential design for, say, optimization or contour esti-mation (Bingham, Ranjan and Welch, 2014) could leadto runs close together in the x space, causing numeri-cal problems. If ill-conditioning does occur, however,the mathematical solution proposed by Ranjan, Haynesand Karsten (2011) is an alternative to adding a nugget.A nugget term is also sometimes suggested to im-prove predictive performance. Andrianakis and Chal-lenor (2012) showed mathematically, however, thatwith a nugget the RMSE of prediction can be as largeas that of a least squares fit of just the regression com-ponent in (2.1). Our empirical findings, choosing thesize of σ 2ε via its MLE, are similarly unsupportive of anugget. For example, the 2-d function in (5.1) is hardto predict with a GaSP(Const, PowerExp) model (Fig-ure 13), but the results with a fitted nugget term shownby a “+” symbol in Figure 13 are no different in prac-tice from those of the no-nugget model.Similarly, repeating the calculations leading to Fig-ure 2 for the borehole function, but fitting a nugget termin all models, shows essentially no difference [the re-sults with a nugget are available in Figure 1 of the sup-plementary material (Chen et al., 2016)]. The MLE ofσ 2ε is either zero or very small relative to the variance ofthe correlated process: typically σˆ 2ε /σˆ 2 < 10−6. Thesefindings are consistent with those of Ranjan, Haynesand Karsten (2011), who found for the borehole func-tion and other applications that constraining the modelfit to have at least a modest value of σ 2ε deterioratedpredictive performance.Another example, the Friedman function,y(x) = 10 sin(πx1x2) + 20(x3 − 0.5)2(5.2)+ 10x4 + 5x5,with n = 25 runs, was used by Gramacy and Lee(2012) to illustrate potential advantages of includinga nugget term. Their context—performance criteria,analysis method, and design—differs in all respectsfrom ours. Our results in the top row of Figure 14show that the GaSP(Const, Gauss) and GaSP(Const,PowerExp) models with n = 25 have highly variableaccuracy, with ermse,ho values no better and often muchworse than 20%. The effect of the nugget is inconse-quential. Increasing the sample size to n = 50 makes adramatic improvement in prediction accuracy, but theeffect of a nugget remains negligible.The Gramacy and Lee (2012) results are not incon-sistent with ours in that they did not report predictionaccuracy for this example. Rather, their results relateto the role of the nugget in sometimes obtaining betteruncertainty measures when a poor choice of correlationfunction is inadvertently made, a topic we return to inSection 6.1.6. COMMENTS6.1 Uncertainty of PredictionAs noted in Section 1, our attention is directed at pre-diction accuracy, the most compelling characteristic inpractical settings. For example, where the objective iscalibration and validation, the details of uncertainty, asdistinct from accuracy, in the emulator of the computermodel are absorbed (and usually swamped) by modeluncertainties and measurement errors (Bayarri et al.,2007). But for specific predictions it is clearly impor-tant to have valid uncertainty statements.Currently, a full assessment of the validity of em-ulator uncertainty quantification is unavailable. It haslong been recognized that the standard error of predic-tion can be optimistic when MLEs of the parametersθj , pj , νj in the correlation functions of Section 2.1are “plugged-in” because the uncertainty in the param-eter values is not taken into account (Abt, 1999). Cor-rections proposed by Abt remain to be done for the set-tings in which they are applicable.Bayes credible intervals with full Bayes methodscarry explicit and valid uncertainty statements; hybridmethods using priors on some of the correlation pa-rameters (as distinct from MLEs) may also have reli-able credible intervals. But for properties such as actualcoverage probability (ACP), the proportion of points ina test set with true response values covered by intervalsof nominal (say) 95% confidence or credibility, the be-havior is far from clear. Chen (2013) compared severalBayes methods with respect to coverage. The resultsshowed variability with respect to equivalent designs56 CHEN, LOEPPKY, SACKS AND WELCHFIG. 14. Friedman function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss) and GaSP(Const, PowerExp)models with no nugget term versus the same models with a nugget. There are two base designs: a 25-run mLHD (top row); and a 50-runmLHD (bottom row). For each base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.like that found above for accuracy, a troubling charac-teristic pointing to considerable uncertainty about theuncertainty.In Figure 15 we see some of the issues. It gives ACPresults for the borehole and Nilson–Kuusk functions.The left-hand plot for borehole displays the anticipatedunder-coverage using plug-in estimates for the correla-tion parameters. (Confidence intervals here use n − 1rather than n in the estimate of σ in the standard er-ror and tn−1 instead of the standard normal.) PowerExpis slightly better than Gauss, and Bayes-GEM-SA hasACP values close to the nominal 95%. Surprisingly, theplot for the Nilson–Kuusk code on the right of Fig-ure 15 paints a different picture. Plug-in with Gaussand Bayes-GEM-SA both show under-coverage, whileplug-in PowerExp has near-ideal properties here. Wespeculate that the use of the Gauss correlation functionby Bayes-GEM-SA is again suboptimal for the Nilson–Kuusk application, just as it was for prediction accu-racy.The supplementary material (Chen et al., 2016) com-pares models with and without a nugget in terms ofcoverage properties for the Friedman function in (5.2).The results show that the problem of substantial under-coverage seen in many of the replicate experimentsis not solved by inclusion of a nugget term. A mod-est improvement in the distribution of ACP values isseen, particularly for n = 50, an improvement consis-tent with the advantage seen in Table 1 of Gramacy andLee (2012) from fitting a nugget term.A more complete study is surely needed to clarifyappropriate criteria for uncertainty assessment and howmodeling choices may affect matters.6.2 ExtrapolationGaSP based methods are interpolations so our find-ings are clearly limited to prediction in the space ofthe experiment. The design of the computer experimentshould cover the region of interest, rendering extrapo-lation meaningless. If a new region of interest is found,for example, during optimization, the initial computerruns can be augmented; extrapolation can be used todelimit regions that have to be explored further. Ofcourse, extrapolation is necessary in the situation of aANALYSIS METHODS FOR COMPUTER EXPERIMENTS 57FIG. 15. Borehole and Nilson–Kuusk functions: ACP of nominal 95% confidence or credibility intervals for GaSP(Const, Gauss),GaSP(Const, PowerExp), and Bayes-GEM-SA. For the borehole function, 25 random permutations between columns of a 40-run mLHDgive the 25 values of ACP in a dot plot. For the Nilson–Kuusk function, 25 designs are created from a 150-run LHD base plus 50 randompoints from a 100-run LHD. The remaining 50 points in the 100-run LHD form the holdout set for each repeat.new region and a code that can no longer be run. Butthen the question is how to extrapolate. Initial inclusionof linear or other regression terms may be more use-ful than just a constant, but it may also be useless, oreven dangerous, unless the “right” extrapolation termsare identified. We suspect it would be wiser to exam-ine main effects resulting from the application of GaSPand use them to guide extrapolation.6.3 Performance CriteriaWe have focused almost entirely on questions of pre-dictive accuracy and used RMSE as a measure. Thesupplementary material (Chen et al., 2016) defines andprovides results for a normalized version of maximumabsolute error, emax,ho. Other computations we havedone use the median of the absolute value of predictionerrors, with normalization relative to the trivial predic-tor from the median of the training output data. Theseresults are qualitatively the same as for ermse,ho: regres-sion terms do not matter, and PowerExp is a reliablechoice for R. For slow codes, analysis like in Section 4but using emax,ho has some limited value in identifyingregions where predictions are difficult, the limitationsstemming from a likely lack of coverage of subregions,especially at borders of the unit cube, where the outputfunction may behave badly.A common performance measure for slow codesuses leave-one-out cross-validation error to produceanalogues of ermse,ho and emax,ho, obviating the needfor a holdout set. For fast codes, repeat experimentsand the ready availability of a holdout set render cross-validation unnecessary, however. For slow codes withonly one set of data available, the single assessmentfrom leave-one-out cross-validation does not reflect thevariability caused, for example, by the design. In anycase, qualitatively similar conclusions pertain regard-ing regression terms and correlation functions.6.4 More ExamplesThe examples we selected are codes that have beenused in earlier studies. We have not incorporated1-d examples; while instructive for pedagogical rea-sons, they have little presence in practice. Other ap-plications we could have included (e.g., Gough andWelch, 1994) duplicate the specific conclusions wedraw below. There are also “fabricated” test functionsin the numerical integration and interpolation litera-ture (Barthelmann, Novak and Ritter, 2000) and somespecifically for computer experiments (Surjanovic andBingham, 2015). They exhibit characteristics some-times similar to those in Section 5—large variabilityin a corner of the space, a condition that inhibits andeven prevents construction of effective surrogates—and sometimes no different than the examples in Sec-tion 3. Codes that are deterministic but with numeri-cal errors could also be part of a diverse catalogue oftest problems. Ideally performance metrics from var-ious approaches would be provided to facilitate com-parisons; the suite of examples that we employed is astarting point.58 CHEN, LOEPPKY, SACKS AND WELCH6.5 DesignsThe variability in performance over equivalent de-signs is a striking phenomenon in the analyses of Sec-tion 3 and raises questions about how to cope with whatseems to be unavoidable bad luck. Are there sequentialstrategies that can reduce the variability? Are there ad-vantageous design types, more robust to arbitrary sym-metries. For example, does it matter whether a randomLHD, mLHD, or an orthogonal LHD is used? The lat-ter question is currently being explored by the authors.That design has a strong role is both unsurprising andsurprising. It is not surprising that care must be takenin planning an experiment; it is surprising and perplex-ing that equivalent designs can lead to such large dif-ferences in performance that are not mediated by goodanalytic procedures.6.6 Larger Sample SizesAs noted in Section 1, our attention is on experi-ments where n is small or modest at most. With ad-vances in computing power it becomes more feasibleto mount experiments with larger values of n while, atthe same time, more complex codes become feasiblebut only with limited n. Our focus continues to be onthe latter and the utility of GaSP models in that context.As n gets larger, Figure 2 illustrates that the differ-ences in accuracy among choices of R and μ beginto vanish. Indeed, it is not even clear that using GaSPmodels for large n is useful; standard function fittingmethods such as splines may well be competitive andeasier to compute. In addition, when n is large non-stationary behavior can become apparent and encour-ages variations in the GaSP methodology such as de-composing the input space (as in Gramacy and Lee,2008) or by using a complex μ together with a com-putationally more tractable R (as in Kaufman et al.,2011). Comparison of alternatives when n is large isyet to be considered.6.7 Are Regression Terms Ever Useful?Introducing regression terms is unnecessary in theexamples we have presented; a heuristic rationalewas given in Section 2.2. The supplementary material(Chen et al., 2016) reports a simulation study with real-ized functions generated as follows: (1) there are verylarge linear trends for all xj ; and (2) the superimposedsample path from a 0-mean GP is highly nonlinear, thatis, a GP with at least one θj 0 in (2.2). Even undersuch extreme conditions, the advantage of explicitlyfitting the regression terms is limited to a relative (ra-tio of ermse,ho) advantage, with only small differencesin ermse,ho; the presence of a large trend causes a largenormalizing factor. Moreover, such functions are notthe sort usually encountered in computer experiments.If they do show up, standard diagnostics will revealtheir presence and allow effective follow-up analysis(see Section 7.2).7. CONCLUSIONS AND RECOMMENDATIONSThis article addresses two types of questions. First,how should the analysis methodologies advanced in thestudy of computer experiments be assessed? Second,what recommendations for modeling strategies followfrom applying the assessment strategy to the particularcodes we have studied?7.1 Assessing MethodsWe have stressed the importance of going beyond“anecdotes” in making claims for proposed methods.While this point is neither novel nor startling, it is onethat is commonly ignored, often because the processof studying consequences under multiple conditions ismore laborious. The borehole example (Figure 2), forinstance, employs 75 experiments arising from 25 re-peats of each of 3 base experiments.When only one training set of data is available (ascan be the case with slow codes), the procedures inSection 4, admittedly ad hoc, nevertheless expand therange of conditions. This brings more generalizabilityto claims about the comparative performances of com-peting procedures. The same strategy of creating mul-tiple training/holdout sets is potentially useful in com-paring competing methods in physical experiments aswell.The studies in the previous sections lead to the fol-lowing conclusions:• There is no evidence that GaSP(Const, PowerExp)is ever dominated by use of regression terms, orother choices of R. Moreover, we have found thatthe inclusion of regression terms makes the likeli-hood surface multi-modal, necessitating an increasein computational effort for maximum likelihoodor Bayesian methods. This appears to be due toconfounding between regression terms and the GPpaths.• Choosing R = Gauss, though common, can be un-wise. The Matérn function optimized over a fewlevels of smoothness is a reasonable alternative toPowerExp.• Design matters but cannot be controlled completely.Variability of performance from equivalent designscan be uncomfortably large.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 59There is not enough evidence to settle the followingquestions:• Are full Bayes methods ever more accurate thanGaSP(Const, PowerExp)? Bayes methods relying onR = Gauss were seen to be sometimes inferior, andextensions to accommodate less smooth R such asPowerExp, perhaps via hybrid Bayes-MLE methods,are needed.• Are composite GaSP methods ever better thanGaSP(Const, PowerExp) in practical settings wherethe output exhibits nonstationary behavior?7.2 RecommendationsFaced with a particular code and a set of runs, whatshould a scientist do to produce a good predictor?Our recommendation is to make use of GaSP(Const,PowerExp), use the diagnostics of Jones, Schonlauand Welch (1998) or Bastos and O’Hagan (2009),and assess whether the GaSP predictor is adequate.If found inadequate, then the scientist should expectno help from introducing regression terms and, un-til further evidence is found, neither from Bayes norCGP approaches. Of course, trying such methods isnot prohibited, but we believe that inadequacy of theGaSP(Const, PowerExp) model is usually a sign thatmore substantial action must be taken.We conjecture that the best way to proceed in theface of inadequacy is to devise a second (or multiple)stage process, perhaps by added runs, or perhaps bycarving the space into more manageable subregionsas well as adding runs. How best to do this has beenpartially addressed, for example, by Gramacy and Lee(2008) and Loeppky, Moore and Williams (2010); ef-fective methods constrained by limited runs are not ap-parent and in need of study.ACKNOWLEDGMENTSWe thank the referees, Associate Editor, and Editorfor suggestions that clarified and broadened the scopeof the studies reported here.The research of Loeppky and Welch was supportedin part by grants from the Natural Sciences and Engi-neering Research Council, Canada.SUPPLEMENTARY MATERIALSupplement to “Analysis Methods for ComputerExperiments: How to Assess and What Counts?”(DOI: 10.1214/15-STS531SUPP; .zip). This report(whatcounts-supp.pdf) contains further description ofthe test functions and data from running them, furtherresults for root mean squared error, findings for max-imum absolute error, further results on uncertainty ofprediction, and details of the simulation investigatingregression terms. Inputs to the Arctic sea-ice code—ice-x.txt. Outputs from the code—ice-y.txt.REFERENCESABT, M. (1999). Estimating the prediction mean squared errorin Gaussian stochastic processes with exponential correlationstructure. Scand. J. Stat. 26 563–578. MR1734262ANDRIANAKIS, I. and CHALLENOR, P. G. (2012). The effect ofthe nugget on Gaussian process emulators of computer models.Comput. Statist. Data Anal. 56 4215–4228. MR2957866BA, S. and JOSEPH, V. R. (2012). Composite Gaussian processmodels for emulating expensive functions. Ann. Appl. Stat. 61838–1860. MR3058685BARTHELMANN, V., NOVAK, E. and RITTER, K. (2000). High di-mensional polynomial interpolation on sparse grids. Adv. Com-put. Math. 12 273–288. MR1768951BASTOS, L. S. and O’HAGAN, A. (2009). Diagnostics forGaussian process emulators. Technometrics 51 425–438.MR2756478BAYARRI, M. J., BERGER, J. O., PAULO, R., SACKS, J.,CAFEO, J. A., CAVENDISH, J., LIN, C.-H. and TU, J. (2007).A framework for validation of computer models. Technometrics49 138–154. MR2380530BAYARRI, M. J., BERGER, J. O., CALDER, E. S., DAL-BEY, K., LUNAGOMEZ, S., PATRA, A. K., PITMAN, E. B.,SPILLER, E. T. and WOLPERT, R. L. (2009). Using statisticaland computer models to quantify volcanic hazards. Technomet-rics 51 402–413. MR2756476BINGHAM, D., RANJAN, P. and WELCH, W. J. (2014). Designof computer experiments for optimization, estimation of func-tion contours, and related objectives. In Statistics in Action(J. F. Lawless, ed.) 109–124. CRC Press, Boca Raton, FL.MR3241971CHAPMAN, W. L., WELCH, W. J., BOWMAN, K. P., SACKS, J.and WALSH, J. E. (1994). Arctic sea ice variability: Model sen-sitivities and a multidecadal simulation. J. Geophys. Res. 99C919–935.CHEN, H. (2013). Bayesian prediction and inference in analy-sis of computer experiments. Master’s thesis, Univ. British,Columbia, Vancouver.CHEN, H., LOEPPKY, J. L., SACKS, J. and WELCH, W. J.(2016). Supplement to “Analysis Methods for Computer Exper-iments: How to Assess and What Counts?” DOI:10.1214/15-STS531SUPP.CURRIN, C., MITCHELL, T., MORRIS, M. and YLVISAKER, D.(1991). Bayesian prediction of deterministic functions, with ap-plications to the design and analysis of computer experiments.J. Amer. Statist. Assoc. 86 953–963. MR1146343DIXON, L. C. W. and SZEGÖ, G. P. (1978). The global optimisa-tion problem: An introduction. In Towards Global Optimisation(L. C. W. Dixon and G. P. Szegö, eds.) 1–15. North Holland,Amsterdam.60 CHEN, LOEPPKY, SACKS AND WELCHGAO, F., SACKS, J. and WELCH, W. J. (1996). Predicting urbanozone levels and trends with semiparametric modeling. J. Agric.Biol. Environ. Stat. 1 404–425. MR1807773GOUGH, W. A. and WELCH, W. J. (1994). Parameter space explo-ration of an ocean general circulation model using an isopycnalmixing parameterization. J. Mar. Res. 52 773–796.GRAMACY, R. B. and LEE, H. K. H. (2008). Bayesian treed Gaus-sian process models with an application to computer modeling.J. Amer. Statist. Assoc. 103 1119–1130. MR2528830GRAMACY, R. B. and LEE, H. K. H. (2012). Cases for the nuggetin modeling computer experiments. Stat. Comput. 22 713–722.MR2909617JONES, D. R., SCHONLAU, M. and WELCH, W. J. (1998). Ef-ficient global optimization of expensive black-box functions.J. Global Optim. 13 455–492. MR1673460JOSEPH, V. R., HUNG, Y. and SUDJIANTO, A. (2008). Blind krig-ing: A new method for developing metamodels. J. Mech. Des.130 031102–1–8.KAUFMAN, C. G., BINGHAM, D., HABIB, S., HEITMANN, K.and FRIEMAN, J. A. (2011). Efficient emulators of computerexperiments using compactly supported correlation functions,with an application to cosmology. Ann. Appl. Stat. 5 2470–2492.MR2907123KENNEDY, M. (2004). Description of the Gaussian process modelused in GEM-SA. Techical report, Univ. Sheffield. Available athttp://www.tonyohagan.co.uk/academic/GEM/.LIM, Y. B., SACKS, J., STUDDEN, W. J. and WELCH, W. J.(2002). Design and analysis of computer experiments whenthe output is highly correlated over the input space. Canad. J.Statist. 30 109–126. MR1907680LOEPPKY, J. L., MOORE, L. M. and WILLIAMS, B. J. (2010).Batch sequential designs for computer experiments. J. Statist.Plann. Inference 140 1452–1464. MR2592224LOEPPKY, J. L., SACKS, J. and WELCH, W. J. (2009). Choosingthe sample size of a computer experiment: A practical guide.Technometrics 51 366–376. MR2756473MCMILLAN, N. J., SACKS, J., WELCH, W. J. and GAO, F.(1999). Analysis of protein activity data by Gaussian stochas-tic process models. J. Biopharm. Statist. 9 145–160.MORRIS, M. D. and MITCHELL, T. J. (1995). Exploratory designsfor computational experiments. J. Statist. Plann. Inference 43381–402.MORRIS, M. D., MITCHELL, T. J. and YLVISAKER, D. (1993).Bayesian design and analysis of computer experiments: Use ofderivatives in surface prediction. Technometrics 35 243–255.MR1234641NILSON, T. and KUUSK, A. (1989). A reflectance model for thehomogeneous plant canopy and its inversion. Remote Sens. En-viron. 27 157–167.O’HAGAN, A. (1992). Some Bayesian numerical analysis. InBayesian Statistics, 4 (PeñíScola, 1991) (J. M. Bernardo, J. O.Berger, A. P. Dawid and A. F. M. Smith, eds.) 345–363. OxfordUniv. Press, New York. MR1380285PICHENY, V., GINSBOURGER, D., RICHET, Y. and CAPLIN, G.(2013). Quantile-based optimization of noisy computer ex-periments with tunable precision. Technometrics 55 2–13.MR3038476PRESTON, D. L., TONKS, D. L. and WALLACE, D. C. (2003).Model of plastic deformation for extreme loading conditions. J.Appl. Phys. 93 211–220.RANJAN, P., HAYNES, R. and KARSTEN, R. (2011). A compu-tationally stable approach to Gaussian process interpolation ofdeterministic computer simulation data. Technometrics 53 366–378. MR2850469SACKS, J., SCHILLER, S. B. and WELCH, W. J. (1989). De-signs for computer experiments. Technometrics 31 41–47.MR0997669SACKS, J., WELCH, W. J., MITCHELL, T. J. and WYNN, H. P.(1989). Design and analysis of computer experiments (with dis-cussion). Statist. Sci. 4 409–435. MR1041765SCHONLAU, M. and WELCH, W. J. (2006). Screening the inputvariables to a computer model via analysis of variance and vi-sualization. In Screening: Methods for Experimentation in In-dustry, Drug Discovery, and Genetics (A. Dean and S. Lewis,eds.) 308–327. Springer, New York.STEIN, M. L. (1999). Interpolation of Spatial Data: Some Theoryfor Kriging. Springer, New York. MR1697409STYER, P., MCMILLAN, N., GAO, F., DAVIS, J. and SACKS, J.(1995). Effect of outdoor airborne particulate matter on dailydeath counts. Environ. Health Perspect. 103 490–497.SURJANOVIC, S. and BINGHAM, D. (2015). Virtual library of sim-ulation experiments: Test functions and datasets. Available athttp://www.sfu.ca/~ssurjano.WELCH, W. J., BUCK, R. J., SACKS, J., WYNN, H. P.,MITCHELL, T. J. and MORRIS, M. D. (1992). Screening, pre-dicting, and computer experiments. Technometrics 34 15–25.WELCH, W. J., BUCK, R. J., SACKS, J., WYNN, H. P., MOR-RIS, M. D. and SCHONLAU, M. (1996). Response to James M.Lucas. Technometrics 38 199–203.WEST, O. R., SIEGRIST, R. L., MITCHELL, T. J. and JENK-INS, R. A. (1995). Measurement error and spatial variabilityeffects on characterization of volatile organics in the subsurface.Environ. Sci. Technol. 29 647–656. Statistical Science2016, Vol. 31, No. 1, 40–60DOI: 10.1214/15-STS531© Institute of Mathematical Statistics, 2016Analysis Methods for ComputerExperiments: How to Assess and WhatCounts?Hao Chen, Jason L. Loeppky, Jerome Sacks and William J. WelchAbstract. Statistical methods based on a regression model plus a zero-meanGaussian process (GP) have been widely used for predicting the output of adeterministic computer code. There are many suggestions in the literature forhow to choose the regression component and how to model the correlationstructure of the GP. This article argues that comprehensive, evidence-basedassessment strategies are needed when comparing such modeling options.Otherwise, one is easily misled. Applying the strategies to several computercodes shows that a regression model more complex than a constant mean ei-ther has little impact on prediction accuracy or is an impediment. The choiceof correlation function has modest effect, but there is little to separate twocommon choices, the power exponential and the Matérn, if the latter is op-timized with respect to its smoothness. The applications presented here alsoprovide no evidence that a composite of GPs provides practical improve-ment in prediction accuracy. A limited comparison of Bayesian and empiricalBayes methods is similarly inconclusive. In contrast, we find that the effect ofexperimental design is surprisingly large, even for designs of the same typewith the same theoretical properties.Key words and phrases: Correlation function, Gaussian process, kriging,prediction accuracy, regression.1. INTRODUCTIONOver the past quarter century a literature begin-ning with Sacks, Schiller and Welch (1989), Sackset al. (1989, in this journal), Currin et al. (1991), andO’Hagan (1992) has grown to explore the design andanalysis of computer experiments. Such an experimentis a designed set of runs of a mathematical model im-plemented as a computer code. Running the code withvector-valued input x gives the output value, y(x), as-Hao Chen is a Ph.D. candidate and William J. Welch isProfessor, Department of Statistics, University of BritishColumbia, Vancouver, BC V6T 1Z4, Canada (e-mail:hao.chen@stat.ubc.ca; will@stat.ubc.ca). Jason L. Loeppkyis Associate Professor, Statistics, University of BritishColumbia, Kelowna, BC V1V 1V7, Canada (e-mail:jason.loeppky@ubc.ca). Jerome Sacks is Director Emeritus,National Institute of Statistical Sciences, Research TrianglePark, North Carolina 27709, USA (e-mail: sacks@niss.org).sumed real-valued and deterministic: Running the codeagain with the same value for x does not change y(x).A design D is a set of n runs at n configurations of x,and an objective of primary interest is to use the data(inputs and outputs) to predict, via a statistical model,the output of the code at untried input values.The basic approach to the statistical model typicallyadopted starts by thinking of the output function, y(x),as being in a class of functions with prior distribu-tion a Gaussian process (GP). The process has meanμ, which may be a regression function in x, and a co-variance function, σ 2R, from a specified parametricfamily. Prediction is then made through the posteriormean given the data from the computer experiment,with some variations depending on whether a maxi-mum likelihood (empirical Bayes) or fuller Bayesianimplementation is used. While we partially addressthose variations in this article, our main thrusts are thepractical questions faced by the user: What regression40ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 41function and correlation function should be used? Doesit matter?We will call a GP model with specified regressionand correlation functions a Gaussian stochastic process(GaSP) model. For example, GaSP(Const, PowerExp)will denote a constant (intercept only) regression andthe power-exponential correlation function. The vari-ous regression models and correlation functions underconsideration in this article will be defined in Section 2.The rationale for the GaSP approach stems fromthe common situation that the dimension, d , of thespace of inputs is not small, the function is fairly com-plex to model, and n is not large (code runs are ex-pensive), hindering the effectiveness of standard meth-ods (e.g., polynomials, splines, MARS) for producingpredictions. The GaSP approach allows for a flexiblechoice of approximating models that adapts to the dataand, more tellingly, has proved effective in coping withcomplex codes with moderately high d and scarce data.There is a vast literature treating an analysis in this con-text.This article studies the impact on prediction accu-racy of the particular model specifications commonlyused, particularly μ,R,n,D. The goals are twofold.First, we propose a more evidence-based approach todistinguish what may be important from the unimpor-tant and what may need further exploration. Second,our application of this approach to various examplesleads to some specific recommendations.Assessing statistical strategies for the analysis of acomputer experiment often mimics what is done forphysical experiments: a method is proposed, appliedin examples—usually few in number—and comparedto other methods. Where possible, formal, mathemat-ical comparisons are made; otherwise, criteria for as-sessing performance are empirical. An initial empiri-cal study for a physical experiment is forced to rely onthe specific data of that experiment and, while differentanalysis methods may be applied, all are bound by thesingle data set. There are limited opportunities to varysample size or design.Computer experiments provide richer opportunities.Fast-to-run codes enable a laboratory to investigate therelative merits of an analysis method. A whole spec-trum of “replicate” experiments can be conducted for asingle code, going beyond a thimbleful of “anecdotal”reports.The danger of being misled by anecdotes can beseen in the following example. The borehole function[Morris, Mitchell and Ylvisaker, 1993, also given inthe supplementary material (Chen et al., 2016)] is fre-quently used to illustrate methodology for computerexperiments. A 27-run orthogonal array (OA) in the8 input factors was proposed as a design, followingJoseph, Hung and Sudjianto (2008). The 27 runs wereanalyzed via GaSP with a specific R (the Gaussiancorrelation function described in Section 2) and withtwo choices for μ: a simple constant (intercept) ver-sus a method to select linear terms (SL), also describedin Section 2. The details of these alternative modelsare not important for now, just that we are comparingtwo modeling methods. A set of 10,000 test points se-lected at random in the 8-dimensional input space wasthen predicted. The resulting values of the root meansquared error (RMSE) measure defined in (2.5) of Sec-tion 2 were 0.141 and 0.080 for the constant and SLregression models, respectively.With the SL approach reducing the RMSE by about43% relative to a model with a constant mean, doesthis example provide powerful evidence for using re-gression terms in the GaSP model? Not quite. We repli-cated the experiment with the same choices of μ,R,nand the same test-data, but the training data came froma theoretically equivalent 27-run OA design. (Thereare many equivalent OAs, e.g., by permuting the la-bels between columns of a fixed OA.) The RMSE val-ues in the second analysis were 0.073 and 0.465 forthe constant and SL models respectively; SL producedabout 6 times the error relative to a constant mean—theevidence against using regression terms is even morepowerful!A broader approach is needed. The one we take islaid out starting in Section 2, where we specify the al-ternatives considered for the statistical model’s regres-sion component and correlation function, and definethe assessment measures to be used. We focus on thefundamental criterion of prediction accuracy (uncer-tainty assessment is discussed briefly in Section 6.1).In Section 3 we outline the basic idea of generatingrepeated data sets for any given example. The methodis (exhaustively) implemented for several fast codes,including the aforementioned borehole function, alongwith some choices of n and D. In Section 4 the methodis adapted to slower codes where data from only oneexperiment are available. Ideally, the universe of com-puter experiments is represented by a set of test prob-lems and assessment criteria, as in numerical optimiza-tion (Dixon and Szegö, 1978); the codes and data setsinvestigated in this article and its supplementary ma-terial (Chen et al., 2016) are but an initial set. In Sec-tion 5 other modeling strategies are compared. Finally,42 CHEN, LOEPPKY, SACKS AND WELCHin Sections 6 and 7 we make some summarizing com-ments, conclusions and recommendations.The article’s main findings are that regression termsare unnecessary or even sometimes an impediment, thechoice of R matters for less smooth functions, and thatthe variability of performance of a method for the sameproblem over equivalent designs is alarmingly high.Such variation can mask the differences in analysismethods, rendering them unimportant and reinforcingthe message that light evidence leads to flimsy conclu-sions.2. STATISTICAL MODELS, EXPERIMENTALDESIGN, AND ASSESSMENTA computer code output is denoted by y(x) wherethe input vector, x = (x1, . . . , xd), is in the d-dimen-sional unit cube. As long as the input space is rect-angular, transforming to the unit cube is straightfor-ward and does not lose generality. Suppose n runs ofthe code are made according to a design D of in-put vectors x(1), . . . ,x(n) in [0,1]d , resulting in datay = (y(x(1)), . . . , y(x(n)))T . The goal is to predict y(x)at untried x.The GaSP approach uses a regression model and GPprior on the class of possible y(x). Specifically, y(x) isa priori considered to be a realization ofY(x) = μ(x) + Z(x),(2.1)where μ(x) is the regression component, the mean ofthe process, and Z(x) has mean 0, variance σ 2, andcorrelation function R.2.1 Choices for the Correlation FunctionLet x and x′ denote two values of the input vector.The correlation between Z(x) and Z(x′) is denoted byR(x,x′). Following common practice, R(x,x′) is takento be a product of 1-d correlation functions in the dis-tances hj = |xj −x′j |, that is, R(x,x′) =∏dj=1 Rj(hj ).We mainly consider four choices for Rj :• Power exponential (abbreviated PowerExp):Rj(hj ) = exp(−θjhpjj),(2.2)with θj ≥ 0 and 1 ≤ pj ≤ 2 controlling the sensitiv-ity and smoothness, respectively, of predictions of ywith respect to xj .• Squared exponential or Gaussian (abbreviatedGauss): the special case of PowerExp in (2.2) withall pj = 2.• Matérn:Rj(hj ) = 1(νj )2(νj−1)(θjhj )νj Kνj (θjhj ),(2.3)where is the Gamma function, and Kνj is the mod-ified Bessel function of order νj . Again, θj ≥ 0 is asensitivity parameter. The Matérn class was recom-mended by Stein (1999), Section 2.7, for its controlvia νj > 0 of the differentiability of the correlationfunction with respect to xj , and hence that of theprediction function. With νj = 1 + 12 or νj = 2 + 12 ,there are 1 or 2 derivatives, respectively. We callthese subfamilies Matérn-1 and Matérn-2. Similarly,we use Matérn-0 and Matérn-∞ to refer to the casesνj = 0 + 12 and νj → ∞. They give the exponen-tial family [pj = 1 in (2.2)], with no derivatives, andGauss, which is infinitely differentiable. Matérn-0,1,2 are closely related to linear, quadratic, andcubic splines. We believe that little would be gainedby incorporating smoother kernels (but less smooththan the analytic Matérn-∞) in the study.Consideration of the entire Matérn class for νj > 0is computationally cumbersome for the large num-bers of experiments we will evaluate. Hence, whatwe call Matérn has νj optimized over the Matérn-0, Matérn-1, Matérn-2, and Matérn-∞ special cases,separately for each coordinate.• Matérn-2: Some authors (e.g., Picheny et al., 2013)fix νj in the Matérn correlation function to givesome differentiability. The Matérn-2 subfamily setsνj = 2 + 12 for all j , giving 2 derivatives.More recently, other types of covariance functionshave been recommended to cope with “apparently non-stationary” functions (e.g., Ba and Joseph, 2012). InSection 5.2 we will discuss the implications and char-acteristics of these options.Given a choice for Rj and hence R, we define then× n matrix R with element i, j given by R(x(i),x(j))and the n × 1 vector r(x) = (R(x,x(1)), . . . ,R(x,x(n)))T for any x where we want to predict.2.2 Choices for the Regression ComponentWe explore three main choices for μ:• Constant (abbreviated Const): μ(x) = β0.• Full linear (FL): μ(x) = β0 +β1x1 +· · ·+βdxd , thatis, a full linear model in all input variables.• Select linear (SL): μ(x) is linear in the xj like FLbut only includes selected terms.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 43The proposed algorithm for SL is as follows. For agiven correlation family construct a default predictorwith Const for μ. Decompose the predictive function(Schonlau and Welch, 2006) and identify all main ef-fects that contribute more than 100/d percent to thetotal variation. These become the selected coordinates.Typically, large main effects have clear linear compo-nents. If a large effect lacks a linear component, littleis lost by including a linear term. Inclusion of possiblenonlinear trends can be pursued at considerable com-putational cost; we do not routinely do so, but in Sec-tion 4.1 we do include a regression model with nonlin-ear terms in xj .All candidate regression models considered can bewritten in the formμ(x) = β0 + β1f1(x) + · · · + βkfk(x),where the functions fj (x) are known. The maximumlikelihood estimate (MLE) of β = (β0, . . . , βk)T is thegeneralized least-squares estimate βˆ = (FT R−1F)−1 ·FT R−1y, where the n × (k + 1) matrix F has (1,f1(x(i)), . . . , fk(x(i))) in row i. This is also the Bayesposterior mean with a diffuse prior for β .Early work (Sacks, Schiller and Welch, 1989) sug-gested that there is little to be gained (and maybe evensomething to lose) by using other than a constant termfor μ. In addition, Lim et al. (2002) showed that poly-nomials can be exactly predicted with a minimal num-ber of points using the Gauss correlation function, pro-vided one lets the θj → 0. These points underline thefact that a GP prior can capture the complexity in theoutput of the code, suggesting that deploying regres-sion terms is superfluous. The evidence we report latersupports this impression.2.3 PredictionPredictions are made as follows. For given data andvalues of the parameters in R, the mean of the posteriorpredictive distribution of y(x) isyˆ(x) = μˆ(x) + rT (x)R−1(y − Fβˆ),(2.4)where μˆ(x) = βˆ0 + βˆ1f1(x) + · · · + βˆkfk(x).In practice, the other parameters, σ 2 and those in thecorrelation function R of equations (2.2) or (2.3), mustbe estimated too. Empirical Bayes replaces all of β ,σ 2, and the correlation parameters in R by their MLEs(Welch et al., 1992). Various other Bayes-based proce-dures are available, including one fully Bayesian strat-egy described in Section 5.1. Our focus, however, isnot on the particular Bayes-based methods employedbut rather on assumptions about the form of the under-lying GaSP model.2.4 DesignFor fast codes we typically use as a base designD an approximate maximin Latin hypercube design(mLHD, Morris and Mitchell, 1995), with improvedlow-dimensional space-filling properties (Welch et al.,1996). A few other choices, such as orthogonal arrays,are also investigated in Section 3.5, with a more com-prehensive comparison of different classes of designthe subject of another ongoing study. In any event, weshow that even for a fixed class of design and fixedn there is substantial variation in prediction accuracyover equivalent designs. Conclusions based on a singledesign choice can be misleading.The effect of n on prediction accuracy was exploredby Sacks, Schiller and Welch (1989) and more recentlyby Loeppky, Sacks and Welch (2009); its role in thecomparison of competing alternatives for μ and R willalso be addressed in Section 3.5.2.5 Measures of Prediction ErrorIn order to compare various forms of the predictorin (2.4) built from the n code runs, y = (y(x(1)), . . . ,y(x(n)))T , we need to set some standards. The goldstandard is to assess the magnitude of prediction er-ror via holdout (test) data, that is, in predicting N fur-ther runs, y(x(1)ho ), . . . , y(x(N)ho ). The prediction errorsare yˆ(x(i)ho ) − y(x(i)ho ) for i = 1, . . . ,N .The performance measure we use is a normalizedRMSE of prediction over the holdout data, denotedby ermse,ho. The normalization is the RMSE using the(trivial) predictor y¯, the mean of the data from the runsin the experimental design, the “training” set. Thus,ermse,ho =√(1/N)∑Ni=1(yˆ(x(i)ho ) − y(x(i)ho ))2√(1/N)∑Ni=1(y¯ − y(x(i)ho ))2.(2.5)The normalization in the denominator puts ermse,horoughly on [0,1] whatever the scale of y, with 1 in-dicating no better performance than y¯. The criterion isrelated to R2 in regression, but ermse,ho measures per-formance for a new test set and smaller values are de-sirable.Similarly, worst-case performance can be defined asthe normalized maximum absolute error. Results forthis metric are reported in the supplementary material(Chen et al., 2016); the conclusions are the same. Otherdefinitions (such as median absolute error) and othernormalizations (such as those of Loeppky, Sacks andWelch, 2009) can be used, but without substantive ef-fect on comparisons.44 CHEN, LOEPPKY, SACKS AND WELCHFIG. 1. Equivalent designs for d = 2 and n = 11: (a) a base mLHD design; (b) the base design with labels permuted between columns; and(c) the base design with values in the x1 column reflected around x1 = 0.5.What are tolerable levels of error? Clearly, these areapplication-specific so that tighter thresholds wouldbe demanded, say, for optimization than for sensitiv-ity analysis. For general purposes we take the ruleof thumb that ermse,ho < 0.10 is useful. For normal-ized maximum error it is plausible that the thresholdcould be much larger, say 0.25 or 0.30. These specu-lations are consequences of the experiences we docu-ment later, and are surely not the last word. The valueof having thresholds is to provide benchmarks that en-able assessing when differences among different meth-ods or strategies are practically insignificant versus sta-tistically significant.3. FAST CODES3.1 Generating a Reference Set for ComparisonsFor fast codes under our control, large holdout setscan be obtained. Hence, in this section performanceis measured through the use of a holdout (test) set of10,000 points, selected as a random Latin hypercubedesign (LHD) on the input space.With a fast code many designs and hence trainingdata sets can easily be generated. We generate manyequivalent designs by exploiting symmetries. For asimple illustration, Figure 1(a) shows a base mLHDfor d = 2 and n = 11. Permuting the labels betweencolumns of the design, that is, interchanging the x1 andx2 values as in Figure 1(b), does not change the inter-point distances or properties based on them such as theminimum distance used to construct mLHD designs.Similarly, reflecting the values within, say, the x1 col-umn around x1 = 0.5 as in Figure 1(c), does not changethe properties. In this sense the designs are equivalent.In general, for any base design with good proper-ties, there are d!2d equivalent designs and hence equiv-alent sets of training data available from permutingall column labels and reflecting within columns for asubset of inputs. For the borehole code mentioned inSection 1 and investigated more fully in Section 3.2,we have found that permuting between columns givesmore variation in prediction accuracy than reflectingwithin columns. Thus, in this article for nearly all ex-amples we only permute between columns: for d = 4all 24 possible permutations, and for d ≥ 5 a randomselection of 25 different permutations. The example ofSection 5.2 with d = 2 is the one exception. Becausey(x1, x2) is symmetric in x1 and x2, permuting be-tween columns does not change the training data andwe reflect within x1 and/or x2 instead.The designs, generated data sets, and replicate analy-ses then serve as the reference set for a particular prob-lem and provide the grounds on which variability ofperformance can be assessed. Given the setup of Sec-tion 2, we want to assess the consequences of makinga choice from the menu of three regression models andfour correlation functions.3.2 Borehole CodeThe first setting we will look at is the borehole code(Morris, Mitchell and Ylvisaker, 1993) mentioned inSection 1 and described in the supplementary material(Chen et al., 2016). It has served as a test bed in manycontexts (e.g., Joseph, Hung and Sudjianto, 2008). Weconsider three different designs for the experiment: a27-run, 3-level orthogonal array (OA), the same designused by Joseph, Hung and Sudjianto (2008); a 27-runmLHD; and a 40-run mLHD.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 45FIG. 2. Borehole function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP with all combinations of three regression modelsand four correlation functions. There are three base designs: a 27-run OA (top row); a 27-run mLHD (middle row); and a 40-run mLHD(bottom row). For each base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.There are 12 possible modeling combinations fromthe four correlation functions and three regressionmodels outlined in Sections 2.1 and 2.2. The SL choicefor μ here is always the term x1. Its main effect ac-counts for approximately 80% of the variation in pre-dictions over the 8-dimensional input domain, and allanalyses with a Const regression model choose x1 andno other terms across all designs and all repeat experi-ments.The top row of Figure 2 shows results with the 27-run OA design. For a given modeling strategy, 25 ran-dom permutations between columns of the 27-run OAlead to 25 repeat experiments (Section 3.1) and hencea reference set of 25 values of ermse,ho shown as adot plot. The results are striking. Relative to a con-stant regression model, the FL regression model hasempirical distributions of ermse,ho which are uniformlyand substantially inferior, for all correlation functions.The SL regression also performs very poorly some-times, but not always. To investigate the SL regres-sion further, Figure 3 plots ermse,ho for individual repeatexperiments, comparing the GaSP(Const, Gauss) andGaSP(SL, Gauss) models. Consistent with the anec-dotal comparisons in Section 1, the plot shows thatFIG. 3. Borehole code: Normalized holdout RMSE of prediction,ermse,ho, for an SL regression model versus a constant regressionmodel. The 25 points are from repeat experiments generated by 25random permutations between columns of a 27-run OA.46 CHEN, LOEPPKY, SACKS AND WELCHthe SL regression model can give a smaller ermse,ho—this tends to happen when both methods perform fairlywell—but the SL regression sometimes has very pooraccuracy (almost 0.5 on the normalized RMSE scale).The top row of Figure 2 also shows that the choiceof correlation function is far less important than thechoice of regression model.The results for the 27-run mLHD in the middle rowof Figure 2 show that design can have a large effect onaccuracy: every analysis model performs better for the27-run mLHD than for the 27-run OA. (Note the ver-tical scale is different for each row of the figure.) TheSL regression now performs about the same as the con-stant regression instead of occasionally much worse.There is no substantial difference in accuracy betweenthe correlation functions. Indeed, the impact on accu-racy of using the space-filling mLHD design instead ofan OA is much more important than differences due tochoice of the correlation function. The scaling in themiddle row of plots somewhat mutes the considerablevariation in accuracy still present over the 25 equiva-lent mLHD designs.Increasing the number of runs to a 40-run mLHD(bottom row of Figure 2) makes a further substantialimprovement in prediction accuracy. All 12 modelingstrategies give ermse,ho values of about 0.02–0.06 overthe 25 repeat experiments. Although there is little sys-tematic difference among strategies, the variation overequivalent designs is still striking in a relative sense.The strikingly poor results from the SL regressionmodel (sometimes) and the FL model (all 25 repeats)in the top row of Figure 2 may be explained as fol-lows. The design is a 27-run OA with only 3 levels. Ina simpler context, Welch et al. (1996) illustrated non-identifiability of the important terms in a GaSP modelwhen the design is not space-filling. The SL regressionand, even more so, the FL regression complicate analready flexible GP model. The difficulty in identify-ing the important terms is underscored by the fact thatfor all 25 repeat experiments from the base 27-run OA,a least squares fit of a simple linear regression model inx1 (with no other terms) gives ermse,ho values close to0.45. In other words, performance of GaSP(SL, Gauss)is sometimes similar to fitting just the important x1linear trend. The performance of GaSP(FL, Gauss) ishighly variable and sometimes even worse than simplelinear regression.Welch et al. (1996) argued that model identifiabil-ity is, not surprisingly, connected with confounding inthe design. The confounding in the base 27-run OAis complex. While it is preserved in an overall senseby permutations between columns, how the confound-ing structure aligns with the important inputs amongx1, . . . , x8 will change across the 25 repeat experi-ments. Hence, the impact of confounding on noniden-tifiability will vary.In contrast, accuracy for the space-filling design inthe middle row of Figure 2 is much better, even withonly 27 runs. The SL regression model performs as ac-curately as the Const model (but no better); only theeven more complex FL regression runs into difficulties.Again, this parallels the simpler Welch et al. (1996) ex-ample, where model identification was less problem-atic with a space-filling design and largely eliminatedby increasing the sample size (the bottom row of Fig-ure 2).3.3 G-Protein CodeA second application, the G-protein code used byLoeppky, Sacks and Welch (2009) and described in thesupplementary material (Chen et al., 2016), consists ofa system of ODE’s with 4-dimensional input.Figure 4 shows ermse,ho for the three regression mod-els (here SL selects x2, x3 as inputs with large effects)and four correlation functions. The results in the toprow are for a 40-run mLHD. With d = 4, all 24 possi-ble permutations between columns of a single base de-sign lead to 24 data sets and hence 24 ermse,ho values.The dot plots in the top row have similar distributionsacross the 12 modeling strategies. As these empiricaldistributions have most ermse,ho values above 0.1, wetry increasing the sample size with an 80-run mLHD.This has a substantial effect on accuracy, with all mod-eling methods giving ermse,ho values of about 0.06 orless.Thus, for the G-protein application, none of the threechoices for μ or the four choices for R matter. Thevariation among equivalent designs is alarmingly largein a relative sense, dwarfing any impact of modelingstrategy.3.4 PTW CodeResults for a third fast-to-run code, PTW (Preston,Tonks and Wallace, 2003), are in the supplementarymaterial (Chen et al., 2016). It has 11 inputs. We tooka mLHD with n = 110 as the base design for the ref-erence set. Prior information from engineers suggestedincorporating linear x1 and x2 terms; SL also includedx3. No essential differences among μ or R emerged,but again there is a wide variation over equivalent de-signs.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 47FIG. 4. G-protein code: Normalized holdout RMSE of prediction, ermse,ho, for all combinations of three regression models and fourcorrelation functions. There are two base designs: a 40-run mLHD (top row); and an 80-run mLHD (bottom row). For each base design, all24 permutations between columns give the 24 values of ermse,ho in each dot plot.3.5 Effect of DesignThe results above document a significant, seldomrecognized role of design: different, even equivalent,designs can have a greater effect on performance thanthe choice of μ,R. Moreover, without prior informa-tion, there is no way to assure that the choice of de-sign will be one of the good ones in the equivalenceclass. Whether sequential experimentation, if feasible,can produce a more advantageous solution needs ex-ploring.The contrast between the results for borehole 27-runOA and the 27-run mLHD is a reminder of the impor-tance of using designs that are space-filling, a qualitywidely appreciated. It is no secret that the choice ofsample size, n, has a strong effect on performance asevidenced in the results for the 40-point mLHD con-trasted with those for the 27-point mLHD. A morepenetrating study of the effect of n was conducted byLoeppky, Sacks and Welch (2009). That FL does aswell as Const and SL for the Borehole 40-point mLHDbut performs badly for either of the two 27-point de-signs, and that none of the regression choices matterfor the G-protein 40-point design or for the PTW 110-point design, suggests that “everything” works if n islarge enough.In summary, the choice of n and the choice of Dgiven n can have huge effects. But have we enoughevidence that choice of μ matters only in limited con-texts (such as small n or poor design) and that choiceof R does not matter? So far we have dealt with onlya handful of simple, fast codes; it is time to considermore complex codes.4. SLOW CODESFor complex costly-to-run codes, generating sub-stantial holdout data or output from multiple designsis infeasible. Similarly, for codes where we only havereported data, new output data are unavailable. Forcedto depend on what data are at hand leads us to rely oncross-validation methods for generating multiple de-signs and holdout sets, through which we can assess theeffect of variability not solely in the designs but also,48 CHEN, LOEPPKY, SACKS AND WELCHand inseparably, in the holdout target data. We knowfrom Section 3 that variability due to designs is con-siderable, and it is no surprise that variability in hold-out sets would lead to variability in predictive perfor-mance. The utility then of the created multiple designsand holdout sets is to compare the behavior of differ-ent modeling choices under varying conditions ratherthan relying on a single quantity attached to the origi-nal, unique data set.Our approach is simply to delete a subset from thefull data set, use the remaining data to produce a pre-dictor, and calculate the (normalized) RMSE from pre-dicting the output in the deleted (holdout) subset. Re-peating this for a number (25 is what we use) of subsetsgives some measure of variability and accuracy. In ef-fect, we create 25 designs and corresponding holdoutsets from a single data set and compare consequencesarising from different choices for predictors.The details described in the applications below dif-fer somewhat depending on the particular application.In the example of Section 4.1—a reflectance modelfor a plant canopy—there are, in fact, limited holdoutdata but no data from multiple designs. In the volcano-eruption example of Section 4.2 and the sea-ice modelof Section 4.3 holdout data are unavailable.4.1 Nilson–Kuusk ModelAn ecological code modeling reflectance for a plantcanopy developed by Nilson and Kuusk (1989) wasused by Bastos and O’Hagan (2009) to illustrate di-agnostics for GaSP models. With 5-dimensional input,two computer experiments were performed: the firstusing a 150-run random LHD and the second with anindependently chosen LHD of 100 points.We carry out three studies based on the same data.The first treats the 100-point LHD as the experimentand the 150-point set as a holdout sample. The secondstudy reverses the roles of the two LHDs. A third study,extending one done by Bastos and O’Hagan (2009),takes the 150-run LHD, augments it with a randomsample of 50 points from the 100-point LHD, takes theresulting 200-point subset as the experimental designfor training the statistical model, and uses the remain-ing N = 50 points from the 100-run LHD to form theholdout set in the calculation of ermse,ho. By repeatingthe sampling of the 50 points 25 times, we get 25 repli-cate experiments, each with the same base 150 runsbut differing with respect to the additional 50 trainingpoints and the holdout set.In addition to the linear regression choices we havestudied so far, we also incorporate a regression modelTABLE 1Nilson–Kuusk model: Normalized holdout RMSE of prediction,ermse,ho, for four regression models and four correlationfunctions. The experimental data are from a 100-run LHD, and theholdout set is from a 150-run LHDermse,hoCorrelation functionRegression model Gauss PowerExp Matérn-2 MatérnConstant 0.116 0.099 0.106 0.102Select linear 0.115 0.099 0.106 0.105Full linear 0.110 0.099 0.104 0.104Quartic 0.118 0.103 0.107 0.106identified by Bastos and O’Hagan (2009): an intercept,linear terms in the inputs x1, . . . , x4, and a quartic poly-nomial in x5. We label this model “Quartic.” All anal-yses are carried out with the output y on a log scale,based on standard diagnostics for GaSP models (Jones,Schonlau and Welch, 1998).Table 1 summarizes the results of the study withthe 100-point LHD as training data and the 150-pointset as a holdout sample. It shows the choice for μis immaterial: the constant mean is as good as any.For the correlation function, Gauss is inferior to theother choices, there is some evidence that Matérn ispreferred to Matérn-2, and there is little differencebetween PowerExp and Matérn, the best performers.Similar results pertain when the 150-run LHD is usedfor training and the 100-run set for testing [Table 4 inthe supplementary material (Chen et al., 2016)].The dot plots in Figure 5 for the third study are evenmore striking in exhibiting the inferiority of R = Gaussand the lack of advantages for any of the nonconstantregression functions. The large variability in perfor-mance among designs and holdout sets is similar tothat seen for the fast-code replicate experiments of Sec-tion 3. The perturbations of the experiment, from ran-dom sampling here, appear to provide a useful refer-ence set for studying the behavior of model choices.The large differences in prediction accuracy amongthe correlation functions, not seen in Section 3, de-serve some attention. An overly smooth correlationfunction—the Gaussian—does not perform as well asthe Matérn and power-exponential functions here. Thelatter two have the flexibility to allow needed rougherrealizations. With the 150-run design and the constantregression model, for instance, the maximum of the loglikelihood increases by about 50 when the power expo-nential is used instead of the Gaussian, with four of thepj in (2.2) taking values less than 2.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 49FIG. 5. Nilson–Kuusk code: Normalized holdout RMSE of prediction, ermse,ho, for four regression models and four correlation functions.Twenty-five designs are created from a 150-run LHD base plus 50 random points from a 100-run LHD. The remaining 50 points in the100-run LHD form the holdout set for each repeat.The estimated main effect (Schonlau and Welch,2006) of x5 in Figure 6 from the GaSP(Const,PowerExp) model shows that x5 has a complex effect.It is also a strong effect, accounting for about 90% ofthe total variance of the predicted output over the 5-dimensional input space. Bastos and O’Hagan (2009)correctly diagnosed the complexity of this trend. Mod-eling it via a quartic polynomial in x5 has little impacton prediction accuracy, however. The correlation struc-ture of the GaSP is able to capture the trend implicitlyjust as well.4.2 Volcano ModelA computer model studied by Bayarri et al. (2009)models the process of pyroclastic flow (a fast-movingcurrent of hot gas and rock) from a volcanic eruption.FIG. 6. Nilson–Kuusk code: Estimated main effect of x5.The inputs varied are as follows: initial volume, x1,and direction, x2, of the eruption. The output, y, is themaximum (over time) height of the flow at a location.A 32-run data set provided by Elaine Spiller [differentfrom that reported by Bayarri et al. (2009) but a similarapplication] is available in the supplementary material(Chen et al., 2016). Plotting the data shows the outputhas a strong trend in x1, and putting a linear term in theGaSP surrogate, as modeled by Bayarri et al. (2009), isnatural. But is it necessary?The nature of the data suggests a transformation of ycould be useful. The one used by Bayarri et al. (2009)is log(y + 1). Diagnostic plots (Jones, Schonlau andWelch, 1998) from using μ = Const and R = Gaussshow that the log transform is reasonable, but a square-root transformation is better still. We report analysesfor both transformations.The regression functions considered are Const, SL(β0 + β1x1), full linear, and quadratic (β0 + β1x1 +β2x2 + β3x21 ), because the estimated effect of x1 has astrong trend growing faster than linearly when lookingat main effects from the surrogate obtained using √yand GaSP(Const, PowerExp).Analogous to the approach in Section 4.1, repeatexperiments are generated by random sampling of 25runs from the 32 available to comprise the design formodel fitting. The remaining 7 runs form the holdoutset. This is repeated 25 times, giving 25 ermse,ho val-ues in the dot plots of Figure 7. The conclusions aremuch like those in Section 4.1: there is usually no needto go beyond μ = Const and PowerExp is preferred50 CHEN, LOEPPKY, SACKS AND WELCHFIG. 7. Volcano model: Normalized holdout RMSE, ermse,ho, for four regression models and four correlation functions. The output variableis either √y or log(y + 1).to Gauss. The failure of Gauss in the two “slow” ex-amples considered thus far is surprising in light of thewidespread use of the Gauss correlation function.4.3 Sea-Ice ModelThe Arctic sea-ice model studied in Chapman et al.(1994) and in Loeppky, Sacks and Welch (2009) has13 inputs, 4 outputs, and 157 available runs. The pre-vious studies found modest prediction accuracy ofGaSP(Const, PowerExp) surrogates for two of the out-puts (ice mass and ice area) and poor accuracy for theother two (ice velocity and ice range). The questionarises whether use of linear regression terms can in-crease accuracy to acceptable levels. Using a samplingprocess like that in Section 4.2 leads to the results in thesupplementary material (Chen et al., 2016), where theanswer is no: there is no help from μ = SL or FL, norfrom changing R. Indeed, FL makes accuracy muchworse sometimes.5. OTHER MODELING STRATEGIESClearly, we have not studied all possible paths toGaSP modeling that one might take in a computer ex-periment. In this section we address several others,some in more detail, and point to issues that could beaddressed in the fashion described above.5.1 Full BayesA number of full Bayes approaches have been em-ployed in the literature. They go beyond the statis-tical formulation using a GP as a prior on the classof functions and assign prior distributions to all pa-rameters, particularly those of the correlation function.For illustration, we examine the GEM-SA implemen-tation of Kennedy (2004), which we call Bayes-GEM-SA. One key aspect is its reliance on R = Gauss. Italso uses the following independent prior distributions:β0 ∝ 1, σ 2 ∝ 1/σ 2, and θj exponential with rate 0.01(Kennedy, 2004). When comparing its predictive accu-racy with GaSP, μ = Const is used for all models.For the borehole application, 25 repeat experimentsare constructed for three designs, as in Section 3. Thedot plots of ermse,ho in Figure 8 compare Bayes-GEM-SA with the Gauss and PowerExp methods in Section 3based on MLEs of all parameters. (The method CGPANALYSIS METHODS FOR COMPUTER EXPERIMENTS 51FIG. 8. Borehole function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp),Bayes-GEM-SA, and CGP. There are three base designs: a 27-run OA (left), a 27-run mLHD (middle), and a 40-run mLHD (right). Foreach base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.and its dot plot are discussed in Section 5.2.) Bayes-GEM-SA is less accurate than either GaSP(Const,Gauss) or GaSP(Const, PowerExp).Figure 9 similarly depicts results for the G-proteincode. With the 40-run mLHD, the Bayesian and likeli-hood methods all perform about the same, giving onlyfair prediction accuracy. Increasing n to 80 improvesaccuracy considerably for all methods (the scales ofthe two plots are very different), far outweighing anysystematic differences between their accuracies.Bayes-GEM-SA performs as well as the GaSP meth-ods for G-protein, not so well for Borehole with n = 27but adequately for n = 40. Turning to the slow codesin Section 4, a different message emerges. Figure 10for the Nilson–Kuusk model is based on 25 repeat de-signs constructed as for Figure 5 with a base designof 150 runs plus 50 randomly chosen from 100. Thedistributions of ermse,ho for Bayes-GEM-SA and Gaussare similar, with PowerExp showing a clear advantage.Moreover, few of the Bayes ermse,ho values meet the0.10 threshold, while all the GaSP(Const, PowerExp)ermse,ho values do. Bayes-GEM-SA uses the Gaus-sian correlation function, which performed relativelyFIG. 9. G-protein: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp), Bayes-GEM-SA,and CGP. There are two base designs: a 40-run mLHD (left); and an 80-run mLHD (right). For each base design, all 24 permutationsbetween columns give the 24 values of ermse,ho in a dot plot.52 CHEN, LOEPPKY, SACKS AND WELCHFIG. 10. Nilson–Kuusk model: Normalized holdout RMSEof prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const,PowerExp), Bayes-GEM-SA, and CGP.poorly in Section 4; the disadvantage carries over tothe Bayesian method here.The results in Figure 11 for the volcano code are forthe 25 repeat experiments described in Section 4. Hereagain PowerExp dominates Bayes and for the samereasons as for the Nilson–Kuusk model. For the √ytransformation, all but a few GaSP(Const, PowerExp)ermse,ho values meet the 0.10 threshold, in contrast toBayes where all but a few do not.These results are striking and suggest that Bayesmethods relying on R = Gauss need extension. The“hybrid” Bayes-MLE approach employed by Bayarriet al. (2009) estimates the correlation parameters inPowerExp by their MLEs, fixes them, and takes ob-jective priors for μ and σ 2. The mean of the predictivedistribution for a holdout output value gives the sameprediction as GaSP(Const, PowerExp). Whether other“hybrid” forms can be brought to bear effectively needsexploration.5.2 NonstationarityThe use of stationary GPs as priors in the faceof “nonstationary appearing” functions has attracteda measure of concern despite the fact that all func-tions with L2-derivative can be approximated usingPowerExp with enough data. Of course, there never areenough data. A relevant question is whether other pri-ors, even stationary ones different from those in Sec-tion 2, are better reflective of conditions and lead tomore accurate predictors.West et al. (1995) employed a GP prior for y(x)with two additive components: a smooth one for globaltrend and a rough one to model more local behavior.Recently, a similar “composite” GP (CGP) approachwas advanced by Ba and Joseph (2012). These authorsused two GPs, both with Gauss correlation. The firsthas correlation parameters θj in (2.2) constrained to besmall for gradually varying longer-range trend, whilethe second has larger values of θj for shorter-range be-havior. The second, local GP also has a variance thatdepends on x, primarily as a way to cope with apparentnonstationary behavior. Does this composite approachoffer an effective improvement to the simpler choicesof Section 2?We can apply CGP via its R library to the exam-ples studied in Sections 3 and 4, much as was justdone for Bayes-GEM-SA. The comparisons in Figure 8for the borehole function show that GaSP and CGPhave similar accuracy for the two 27-run designs. GaSPhas smaller error than CGP for the 40-run mLHD,FIG. 11. Volcano model: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp),Bayes-GEM-SA, and CGP.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 53FIG. 12. 2-d function: Holdout predictions versus true values of y from fitting (a) GaSP(Const, PowerExp) and (b) CGP.though both methods achieve acceptable accuracy. Theresults in Figure 9 for G-protein show little practi-cal difference between any of the methods, includingCGP. For these two fast-code examples, there is negli-gible difference between CGP and the GaSP methods.For the models of Section 4, however, conclusions aresomewhat different. GaSP(Const, PowerExp) is clearlymuch more accurate than CGP for the Nilson–Kuuskmodel (Figure 10) and roughly equivalent for the vol-cano code (Figure 11).Ba and Joseph (2012) gave several examples assess-ing the performance of CGP. For reasons noted in Sec-tions 6.2 and 6.4 we only look at two.10-d example. The test function isy(x) = −10∑j=1sin(xj )(sin(jx2j /π))20(0 < xj < π).With n = 100, Ba and Joseph (2012) obtained unnor-malized RMSE values of about 0.72–0.74 for CGP andabout 0.72–0.88 for a GaSP(Const, Gauss) model over50 repeat experiments.This example demonstrates a virtue of using a nor-malized performance measure. To compute the nor-malizing factor for RMSE in (2.5), we followed theprocess of Ba and Joseph (2012). Training data froman LHD with n = 100 gives y¯, the trivial predictor.The normalization in the denominator of (2.5) is com-puted from N = 5000 random test points. Repeatingthis process 50 times gives normalization factors of0.71–0.74, about the same as the raw RMSE valuesfrom CGP. Thus, CGP’s RMSE prediction accuracyis no better than that of the trivial y¯ predictor, andthe default method is worse. Effective prediction hereis unattainable by CGP or GaSP and perhaps by noother approach with n = 100 because the function isso multi-modal; comparisons of CGP with other meth-ods are meaningless in this example.2-d example. For the functiony(x1, x2) = sin(1/(x1x2)) (0.3 ≤ xj ≤ 1),(5.1)Ba and Joseph (2012) used a single design with n = 24runs to compare CGP and GaSP(Const, Gauss). Theirresults suggest that accuracy is poor for both meth-ods, which we confirmed. For this example, follow-ing Ba and Joseph (2012), a holdout set of 5000 ran-dom points on [0.3,1]2 was used. For one mLHD withn = 24, we obtained ermse,ho values of 0.23 and 0.24 forCGP and GaSP(Const, PowerExp), respectively. More-over, the diagnostic plot in Figure 12 shows how badlyCGP (and GaSP) perform. Both methods grossly over-predict for some points in the holdout set, with GaSPworse in this respect. Both methods also have large er-rors from under-prediction, with CGP worse.Does this result generalize? With only two inputvariables and a function that is symmetric in x1 andx2, repeat experiments cannot be generated by permut-ing the column labels of the design. Reflecting withinthe x1 and x2 columns is considered below, but first wecreated multiple experiments by increasing n.We were also curious about how large n has tobe before acceptable accuracy is attained. Compar-isons between CGP and GaSP(Const, PowerExp) weremade for n = 24,25, . . . ,48; for each value of n anmLHD was generated. The ermse,ho results plotted inFigure 13(a) show that accuracy is not improved sub-stantially for either method as n increases. Indeed,54 CHEN, LOEPPKY, SACKS AND WELCHFIG. 13. 2-d function: Normalized holdout RMSE of prediction, ermse,ho, versus n for CGP (◦), GaSP(Const, PowerExp) (), andGaSP(Const, PowerExp) with nugget (+).GaSP(Const, PowerExp) gives variable accuracy, withlarger values of n sometimes leading to worse accuracythan for n = 24. (The results in Figure 13 for a modelwith a nugget term are described in Section 5.3.)To try to improve the accuracy, even larger sam-ple sizes were tried. Figure 13(b) shows ermse,ho forn = 50,60, . . . ,200. Both methods continue to givepoor accuracy until n reaches 70, after which there isa slow, unsteady improvement. Curiously, GaSP nowdominates.Permuting between columns of a design does notgenerate distinct repeat experiments here, but reflect-ing either or both coordinates about the centers of theirranges maintains the distance properties of the design,that is, x1 on [0.3,1] is replaced by x′1 = 1.3 − x1, andsimilarly x2. Results for the repeat experiments fromreflecting within x1, x2, or both x1 and x2 are availablein the supplementary material (Chen et al., 2016). Theyare similar to those in Figure 13.Thus, CGP dominates here for n ≤ 60: it is inaccu-rate but less inaccurate than GaSP. For larger n, how-ever, GaSP performs better, reaching the 0.10 thresh-old for ermse,ho before CGP does. This example demon-strates the potential pitfalls of comparing two methodswith a single experiment. A more comprehensive anal-ysis not only gives more confidence in the findings butmay also be essential to provide a balanced overviewof advantages and disadvantages.These last two toy functions together with the resultsin Figures 8–11 show no evidence for the effectivenessof a composite GaSP approach. These findings are inaccord with the earlier study by West et al. (1995).5.3 Adding a Nugget TermA nugget augments the GaSP model in (2.1) withan uncorrelated ε term, usually assumed to have a nor-mal distribution with mean zero and constant varianceσ 2ε , independent of the correlated process Z(x). Thischanges the computation of R and rT (x) in the condi-tional prediction (2.4), which no longer interpolates thetraining data. For data from physical experimentationor observation, augmenting a GaSP model in this wayis natural to reflect random errors (e.g., Gao, Sacks andWelch, 1996; McMillan et al., 1999; Styer et al., 1995).A nugget term has also been widely used for statisti-cal modeling of deterministic computer codes withoutrandom error. The reasons offered are that numericalstability is improved, so overcoming computational ob-stacles, and also that a nugget can produce better pre-dictive performance or better confidence or credibil-ity intervals. The evidence—in the literature and pre-sented here—suggests, however, that for deterministicfunctions the potential advantages of a nugget term aremodest. More systematic methods are available to dealwith numerical instability if it arises (Ranjan, Haynesand Karsten, 2011), adding a nugget does not convert apoor predictor into an acceptable one, and other factorsmay be more important for good statistical propertiesof intervals (Section 6.1). On the other hand, we also donot find that adding a nugget (and estimating it alongwith the other parameters) is harmful, though it mayproduce smoothers rather than interpolators. We nowelaborate on these points.A small nugget, that is, a small value of σ 2ε , is oftenincluded to improve the numerical properties of R. ForANALYSIS METHODS FOR COMPUTER EXPERIMENTS 55the space-filling initial designs in this article, however,Ranjan, Haynes and Karsten (2011) showed that ill-conditioning in a no-nugget GaSP model will only oc-cur for low-dimensional x, high correlation, and largen. These conditions are not commonly met in initial de-signs for applications. For instance, none of the com-putations for this article failed due to ill-conditioning,and those computations involved many repetitions ofexperiments for the various functions and GaSP mod-els. The worst conditioning occurred for the 2-d exam-ple in Section 5.2 with n = 200, but even here the con-dition numbers of about 106 did not preclude reliablecalculations. When a design is not space-filling, ma-trix ill-conditioning may indeed occur. For instance, asequential design for, say, optimization or contour esti-mation (Bingham, Ranjan and Welch, 2014) could leadto runs close together in the x space, causing numeri-cal problems. If ill-conditioning does occur, however,the mathematical solution proposed by Ranjan, Haynesand Karsten (2011) is an alternative to adding a nugget.A nugget term is also sometimes suggested to im-prove predictive performance. Andrianakis and Chal-lenor (2012) showed mathematically, however, thatwith a nugget the RMSE of prediction can be as largeas that of a least squares fit of just the regression com-ponent in (2.1). Our empirical findings, choosing thesize of σ 2ε via its MLE, are similarly unsupportive of anugget. For example, the 2-d function in (5.1) is hardto predict with a GaSP(Const, PowerExp) model (Fig-ure 13), but the results with a fitted nugget term shownby a “+” symbol in Figure 13 are no different in prac-tice from those of the no-nugget model.Similarly, repeating the calculations leading to Fig-ure 2 for the borehole function, but fitting a nugget termin all models, shows essentially no difference [the re-sults with a nugget are available in Figure 1 of the sup-plementary material (Chen et al., 2016)]. The MLE ofσ 2ε is either zero or very small relative to the variance ofthe correlated process: typically σˆ 2ε /σˆ 2 < 10−6. Thesefindings are consistent with those of Ranjan, Haynesand Karsten (2011), who found for the borehole func-tion and other applications that constraining the modelfit to have at least a modest value of σ 2ε deterioratedpredictive performance.Another example, the Friedman function,y(x) = 10 sin(πx1x2) + 20(x3 − 0.5)2(5.2)+ 10x4 + 5x5,with n = 25 runs, was used by Gramacy and Lee(2012) to illustrate potential advantages of includinga nugget term. Their context—performance criteria,analysis method, and design—differs in all respectsfrom ours. Our results in the top row of Figure 14show that the GaSP(Const, Gauss) and GaSP(Const,PowerExp) models with n = 25 have highly variableaccuracy, with ermse,ho values no better and often muchworse than 20%. The effect of the nugget is inconse-quential. Increasing the sample size to n = 50 makes adramatic improvement in prediction accuracy, but theeffect of a nugget remains negligible.The Gramacy and Lee (2012) results are not incon-sistent with ours in that they did not report predictionaccuracy for this example. Rather, their results relateto the role of the nugget in sometimes obtaining betteruncertainty measures when a poor choice of correlationfunction is inadvertently made, a topic we return to inSection 6.1.6. COMMENTS6.1 Uncertainty of PredictionAs noted in Section 1, our attention is directed at pre-diction accuracy, the most compelling characteristic inpractical settings. For example, where the objective iscalibration and validation, the details of uncertainty, asdistinct from accuracy, in the emulator of the computermodel are absorbed (and usually swamped) by modeluncertainties and measurement errors (Bayarri et al.,2007). But for specific predictions it is clearly impor-tant to have valid uncertainty statements.Currently, a full assessment of the validity of em-ulator uncertainty quantification is unavailable. It haslong been recognized that the standard error of predic-tion can be optimistic when MLEs of the parametersθj , pj , νj in the correlation functions of Section 2.1are “plugged-in” because the uncertainty in the param-eter values is not taken into account (Abt, 1999). Cor-rections proposed by Abt remain to be done for the set-tings in which they are applicable.Bayes credible intervals with full Bayes methodscarry explicit and valid uncertainty statements; hybridmethods using priors on some of the correlation pa-rameters (as distinct from MLEs) may also have reli-able credible intervals. But for properties such as actualcoverage probability (ACP), the proportion of points ina test set with true response values covered by intervalsof nominal (say) 95% confidence or credibility, the be-havior is far from clear. Chen (2013) compared severalBayes methods with respect to coverage. The resultsshowed variability with respect to equivalent designs56 CHEN, LOEPPKY, SACKS AND WELCHFIG. 14. Friedman function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss) and GaSP(Const, PowerExp)models with no nugget term versus the same models with a nugget. There are two base designs: a 25-run mLHD (top row); and a 50-runmLHD (bottom row). For each base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.like that found above for accuracy, a troubling charac-teristic pointing to considerable uncertainty about theuncertainty.In Figure 15 we see some of the issues. It gives ACPresults for the borehole and Nilson–Kuusk functions.The left-hand plot for borehole displays the anticipatedunder-coverage using plug-in estimates for the correla-tion parameters. (Confidence intervals here use n − 1rather than n in the estimate of σ in the standard er-ror and tn−1 instead of the standard normal.) PowerExpis slightly better than Gauss, and Bayes-GEM-SA hasACP values close to the nominal 95%. Surprisingly, theplot for the Nilson–Kuusk code on the right of Fig-ure 15 paints a different picture. Plug-in with Gaussand Bayes-GEM-SA both show under-coverage, whileplug-in PowerExp has near-ideal properties here. Wespeculate that the use of the Gauss correlation functionby Bayes-GEM-SA is again suboptimal for the Nilson–Kuusk application, just as it was for prediction accu-racy.The supplementary material (Chen et al., 2016) com-pares models with and without a nugget in terms ofcoverage properties for the Friedman function in (5.2).The results show that the problem of substantial under-coverage seen in many of the replicate experimentsis not solved by inclusion of a nugget term. A mod-est improvement in the distribution of ACP values isseen, particularly for n = 50, an improvement consis-tent with the advantage seen in Table 1 of Gramacy andLee (2012) from fitting a nugget term.A more complete study is surely needed to clarifyappropriate criteria for uncertainty assessment and howmodeling choices may affect matters.6.2 ExtrapolationGaSP based methods are interpolations so our find-ings are clearly limited to prediction in the space ofthe experiment. The design of the computer experimentshould cover the region of interest, rendering extrapo-lation meaningless. If a new region of interest is found,for example, during optimization, the initial computerruns can be augmented; extrapolation can be used todelimit regions that have to be explored further. Ofcourse, extrapolation is necessary in the situation of aANALYSIS METHODS FOR COMPUTER EXPERIMENTS 57FIG. 15. Borehole and Nilson–Kuusk functions: ACP of nominal 95% confidence or credibility intervals for GaSP(Const, Gauss),GaSP(Const, PowerExp), and Bayes-GEM-SA. For the borehole function, 25 random permutations between columns of a 40-run mLHDgive the 25 values of ACP in a dot plot. For the Nilson–Kuusk function, 25 designs are created from a 150-run LHD base plus 50 randompoints from a 100-run LHD. The remaining 50 points in the 100-run LHD form the holdout set for each repeat.new region and a code that can no longer be run. Butthen the question is how to extrapolate. Initial inclusionof linear or other regression terms may be more use-ful than just a constant, but it may also be useless, oreven dangerous, unless the “right” extrapolation termsare identified. We suspect it would be wiser to exam-ine main effects resulting from the application of GaSPand use them to guide extrapolation.6.3 Performance CriteriaWe have focused almost entirely on questions of pre-dictive accuracy and used RMSE as a measure. Thesupplementary material (Chen et al., 2016) defines andprovides results for a normalized version of maximumabsolute error, emax,ho. Other computations we havedone use the median of the absolute value of predictionerrors, with normalization relative to the trivial predic-tor from the median of the training output data. Theseresults are qualitatively the same as for ermse,ho: regres-sion terms do not matter, and PowerExp is a reliablechoice for R. For slow codes, analysis like in Section 4but using emax,ho has some limited value in identifyingregions where predictions are difficult, the limitationsstemming from a likely lack of coverage of subregions,especially at borders of the unit cube, where the outputfunction may behave badly.A common performance measure for slow codesuses leave-one-out cross-validation error to produceanalogues of ermse,ho and emax,ho, obviating the needfor a holdout set. For fast codes, repeat experimentsand the ready availability of a holdout set render cross-validation unnecessary, however. For slow codes withonly one set of data available, the single assessmentfrom leave-one-out cross-validation does not reflect thevariability caused, for example, by the design. In anycase, qualitatively similar conclusions pertain regard-ing regression terms and correlation functions.6.4 More ExamplesThe examples we selected are codes that have beenused in earlier studies. We have not incorporated1-d examples; while instructive for pedagogical rea-sons, they have little presence in practice. Other ap-plications we could have included (e.g., Gough andWelch, 1994) duplicate the specific conclusions wedraw below. There are also “fabricated” test functionsin the numerical integration and interpolation litera-ture (Barthelmann, Novak and Ritter, 2000) and somespecifically for computer experiments (Surjanovic andBingham, 2015). They exhibit characteristics some-times similar to those in Section 5—large variabilityin a corner of the space, a condition that inhibits andeven prevents construction of effective surrogates—and sometimes no different than the examples in Sec-tion 3. Codes that are deterministic but with numeri-cal errors could also be part of a diverse catalogue oftest problems. Ideally performance metrics from var-ious approaches would be provided to facilitate com-parisons; the suite of examples that we employed is astarting point.58 CHEN, LOEPPKY, SACKS AND WELCH6.5 DesignsThe variability in performance over equivalent de-signs is a striking phenomenon in the analyses of Sec-tion 3 and raises questions about how to cope with whatseems to be unavoidable bad luck. Are there sequentialstrategies that can reduce the variability? Are there ad-vantageous design types, more robust to arbitrary sym-metries. For example, does it matter whether a randomLHD, mLHD, or an orthogonal LHD is used? The lat-ter question is currently being explored by the authors.That design has a strong role is both unsurprising andsurprising. It is not surprising that care must be takenin planning an experiment; it is surprising and perplex-ing that equivalent designs can lead to such large dif-ferences in performance that are not mediated by goodanalytic procedures.6.6 Larger Sample SizesAs noted in Section 1, our attention is on experi-ments where n is small or modest at most. With ad-vances in computing power it becomes more feasibleto mount experiments with larger values of n while, atthe same time, more complex codes become feasiblebut only with limited n. Our focus continues to be onthe latter and the utility of GaSP models in that context.As n gets larger, Figure 2 illustrates that the differ-ences in accuracy among choices of R and μ beginto vanish. Indeed, it is not even clear that using GaSPmodels for large n is useful; standard function fittingmethods such as splines may well be competitive andeasier to compute. In addition, when n is large non-stationary behavior can become apparent and encour-ages variations in the GaSP methodology such as de-composing the input space (as in Gramacy and Lee,2008) or by using a complex μ together with a com-putationally more tractable R (as in Kaufman et al.,2011). Comparison of alternatives when n is large isyet to be considered.6.7 Are Regression Terms Ever Useful?Introducing regression terms is unnecessary in theexamples we have presented; a heuristic rationalewas given in Section 2.2. The supplementary material(Chen et al., 2016) reports a simulation study with real-ized functions generated as follows: (1) there are verylarge linear trends for all xj ; and (2) the superimposedsample path from a 0-mean GP is highly nonlinear, thatis, a GP with at least one θj 0 in (2.2). Even undersuch extreme conditions, the advantage of explicitlyfitting the regression terms is limited to a relative (ra-tio of ermse,ho) advantage, with only small differencesin ermse,ho; the presence of a large trend causes a largenormalizing factor. Moreover, such functions are notthe sort usually encountered in computer experiments.If they do show up, standard diagnostics will revealtheir presence and allow effective follow-up analysis(see Section 7.2).7. CONCLUSIONS AND RECOMMENDATIONSThis article addresses two types of questions. First,how should the analysis methodologies advanced in thestudy of computer experiments be assessed? Second,what recommendations for modeling strategies followfrom applying the assessment strategy to the particularcodes we have studied?7.1 Assessing MethodsWe have stressed the importance of going beyond“anecdotes” in making claims for proposed methods.While this point is neither novel nor startling, it is onethat is commonly ignored, often because the processof studying consequences under multiple conditions ismore laborious. The borehole example (Figure 2), forinstance, employs 75 experiments arising from 25 re-peats of each of 3 base experiments.When only one training set of data is available (ascan be the case with slow codes), the procedures inSection 4, admittedly ad hoc, nevertheless expand therange of conditions. This brings more generalizabilityto claims about the comparative performances of com-peting procedures. The same strategy of creating mul-tiple training/holdout sets is potentially useful in com-paring competing methods in physical experiments aswell.The studies in the previous sections lead to the fol-lowing conclusions:• There is no evidence that GaSP(Const, PowerExp)is ever dominated by use of regression terms, orother choices of R. Moreover, we have found thatthe inclusion of regression terms makes the likeli-hood surface multi-modal, necessitating an increasein computational effort for maximum likelihoodor Bayesian methods. This appears to be due toconfounding between regression terms and the GPpaths.• Choosing R = Gauss, though common, can be un-wise. The Matérn function optimized over a fewlevels of smoothness is a reasonable alternative toPowerExp.• Design matters but cannot be controlled completely.Variability of performance from equivalent designscan be uncomfortably large.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 59There is not enough evidence to settle the followingquestions:• Are full Bayes methods ever more accurate thanGaSP(Const, PowerExp)? Bayes methods relying onR = Gauss were seen to be sometimes inferior, andextensions to accommodate less smooth R such asPowerExp, perhaps via hybrid Bayes-MLE methods,are needed.• Are composite GaSP methods ever better thanGaSP(Const, PowerExp) in practical settings wherethe output exhibits nonstationary behavior?7.2 RecommendationsFaced with a particular code and a set of runs, whatshould a scientist do to produce a good predictor?Our recommendation is to make use of GaSP(Const,PowerExp), use the diagnostics of Jones, Schonlauand Welch (1998) or Bastos and O’Hagan (2009),and assess whether the GaSP predictor is adequate.If found inadequate, then the scientist should expectno help from introducing regression terms and, un-til further evidence is found, neither from Bayes norCGP approaches. Of course, trying such methods isnot prohibited, but we believe that inadequacy of theGaSP(Const, PowerExp) model is usually a sign thatmore substantial action must be taken.We conjecture that the best way to proceed in theface of inadequacy is to devise a second (or multiple)stage process, perhaps by added runs, or perhaps bycarving the space into more manageable subregionsas well as adding runs. How best to do this has beenpartially addressed, for example, by Gramacy and Lee(2008) and Loeppky, Moore and Williams (2010); ef-fective methods constrained by limited runs are not ap-parent and in need of study.ACKNOWLEDGMENTSWe thank the referees, Associate Editor, and Editorfor suggestions that clarified and broadened the scopeof the studies reported here.The research of Loeppky and Welch was supportedin part by grants from the Natural Sciences and Engi-neering Research Council, Canada.SUPPLEMENTARY MATERIALSupplement to “Analysis Methods for ComputerExperiments: How to Assess and What Counts?”(DOI: 10.1214/15-STS531SUPP; .zip). This report(whatcounts-supp.pdf) contains further description ofthe test functions and data from running them, furtherresults for root mean squared error, findings for max-imum absolute error, further results on uncertainty ofprediction, and details of the simulation investigatingregression terms. Inputs to the Arctic sea-ice code—ice-x.txt. Outputs from the code—ice-y.txt.REFERENCESABT, M. (1999). Estimating the prediction mean squared errorin Gaussian stochastic processes with exponential correlationstructure. Scand. J. Stat. 26 563–578. MR1734262ANDRIANAKIS, I. and CHALLENOR, P. G. (2012). The effect ofthe nugget on Gaussian process emulators of computer models.Comput. Statist. Data Anal. 56 4215–4228. MR2957866BA, S. and JOSEPH, V. R. (2012). Composite Gaussian processmodels for emulating expensive functions. Ann. Appl. Stat. 61838–1860. MR3058685BARTHELMANN, V., NOVAK, E. and RITTER, K. (2000). High di-mensional polynomial interpolation on sparse grids. Adv. Com-put. Math. 12 273–288. MR1768951BASTOS, L. S. and O’HAGAN, A. (2009). Diagnostics forGaussian process emulators. Technometrics 51 425–438.MR2756478BAYARRI, M. J., BERGER, J. O., PAULO, R., SACKS, J.,CAFEO, J. A., CAVENDISH, J., LIN, C.-H. and TU, J. (2007).A framework for validation of computer models. Technometrics49 138–154. MR2380530BAYARRI, M. J., BERGER, J. O., CALDER, E. S., DAL-BEY, K., LUNAGOMEZ, S., PATRA, A. K., PITMAN, E. B.,SPILLER, E. T. and WOLPERT, R. L. (2009). Using statisticaland computer models to quantify volcanic hazards. Technomet-rics 51 402–413. MR2756476BINGHAM, D., RANJAN, P. and WELCH, W. J. (2014). Designof computer experiments for optimization, estimation of func-tion contours, and related objectives. In Statistics in Action(J. F. Lawless, ed.) 109–124. CRC Press, Boca Raton, FL.MR3241971CHAPMAN, W. L., WELCH, W. J., BOWMAN, K. P., SACKS, J.and WALSH, J. E. (1994). Arctic sea ice variability: Model sen-sitivities and a multidecadal simulation. J. Geophys. Res. 99C919–935.CHEN, H. (2013). Bayesian prediction and inference in analy-sis of computer experiments. Master’s thesis, Univ. British,Columbia, Vancouver.CHEN, H., LOEPPKY, J. L., SACKS, J. and WELCH, W. J.(2016). Supplement to “Analysis Methods for Computer Exper-iments: How to Assess and What Counts?” DOI:10.1214/15-STS531SUPP.CURRIN, C., MITCHELL, T., MORRIS, M. and YLVISAKER, D.(1991). Bayesian prediction of deterministic functions, with ap-plications to the design and analysis of computer experiments.J. Amer. Statist. Assoc. 86 953–963. MR1146343DIXON, L. C. W. and SZEGÖ, G. P. (1978). The global optimisa-tion problem: An introduction. In Towards Global Optimisation(L. C. W. Dixon and G. P. Szegö, eds.) 1–15. North Holland,Amsterdam.60 CHEN, LOEPPKY, SACKS AND WELCHGAO, F., SACKS, J. and WELCH, W. J. (1996). Predicting urbanozone levels and trends with semiparametric modeling. J. Agric.Biol. Environ. Stat. 1 404–425. MR1807773GOUGH, W. A. and WELCH, W. J. (1994). Parameter space explo-ration of an ocean general circulation model using an isopycnalmixing parameterization. J. Mar. Res. 52 773–796.GRAMACY, R. B. and LEE, H. K. H. (2008). Bayesian treed Gaus-sian process models with an application to computer modeling.J. Amer. Statist. Assoc. 103 1119–1130. MR2528830GRAMACY, R. B. and LEE, H. K. H. (2012). Cases for the nuggetin modeling computer experiments. Stat. Comput. 22 713–722.MR2909617JONES, D. R., SCHONLAU, M. and WELCH, W. J. (1998). Ef-ficient global optimization of expensive black-box functions.J. Global Optim. 13 455–492. MR1673460JOSEPH, V. R., HUNG, Y. and SUDJIANTO, A. (2008). Blind krig-ing: A new method for developing metamodels. J. Mech. Des.130 031102–1–8.KAUFMAN, C. G., BINGHAM, D., HABIB, S., HEITMANN, K.and FRIEMAN, J. A. (2011). Efficient emulators of computerexperiments using compactly supported correlation functions,with an application to cosmology. Ann. Appl. Stat. 5 2470–2492.MR2907123KENNEDY, M. (2004). Description of the Gaussian process modelused in GEM-SA. Techical report, Univ. Sheffield. Available athttp://www.tonyohagan.co.uk/academic/GEM/.LIM, Y. B., SACKS, J., STUDDEN, W. J. and WELCH, W. J.(2002). Design and analysis of computer experiments whenthe output is highly correlated over the input space. Canad. J.Statist. 30 109–126. MR1907680LOEPPKY, J. L., MOORE, L. M. and WILLIAMS, B. J. (2010).Batch sequential designs for computer experiments. J. Statist.Plann. Inference 140 1452–1464. MR2592224LOEPPKY, J. L., SACKS, J. and WELCH, W. J. (2009). Choosingthe sample size of a computer experiment: A practical guide.Technometrics 51 366–376. MR2756473MCMILLAN, N. J., SACKS, J., WELCH, W. J. and GAO, F.(1999). Analysis of protein activity data by Gaussian stochas-tic process models. J. Biopharm. Statist. 9 145–160.MORRIS, M. D. and MITCHELL, T. J. (1995). Exploratory designsfor computational experiments. J. Statist. Plann. Inference 43381–402.MORRIS, M. D., MITCHELL, T. J. and YLVISAKER, D. (1993).Bayesian design and analysis of computer experiments: Use ofderivatives in surface prediction. Technometrics 35 243–255.MR1234641NILSON, T. and KUUSK, A. (1989). A reflectance model for thehomogeneous plant canopy and its inversion. Remote Sens. En-viron. 27 157–167.O’HAGAN, A. (1992). Some Bayesian numerical analysis. InBayesian Statistics, 4 (PeñíScola, 1991) (J. M. Bernardo, J. O.Berger, A. P. Dawid and A. F. M. Smith, eds.) 345–363. OxfordUniv. Press, New York. MR1380285PICHENY, V., GINSBOURGER, D., RICHET, Y. and CAPLIN, G.(2013). Quantile-based optimization of noisy computer ex-periments with tunable precision. Technometrics 55 2–13.MR3038476PRESTON, D. L., TONKS, D. L. and WALLACE, D. C. (2003).Model of plastic deformation for extreme loading conditions. J.Appl. Phys. 93 211–220.RANJAN, P., HAYNES, R. and KARSTEN, R. (2011). A compu-tationally stable approach to Gaussian process interpolation ofdeterministic computer simulation data. Technometrics 53 366–378. MR2850469SACKS, J., SCHILLER, S. B. and WELCH, W. J. (1989). De-signs for computer experiments. Technometrics 31 41–47.MR0997669SACKS, J., WELCH, W. J., MITCHELL, T. J. and WYNN, H. P.(1989). Design and analysis of computer experiments (with dis-cussion). Statist. Sci. 4 409–435. MR1041765SCHONLAU, M. and WELCH, W. J. (2006). Screening the inputvariables to a computer model via analysis of variance and vi-sualization. In Screening: Methods for Experimentation in In-dustry, Drug Discovery, and Genetics (A. Dean and S. Lewis,eds.) 308–327. Springer, New York.STEIN, M. L. (1999). Interpolation of Spatial Data: Some Theoryfor Kriging. Springer, New York. MR1697409STYER, P., MCMILLAN, N., GAO, F., DAVIS, J. and SACKS, J.(1995). Effect of outdoor airborne particulate matter on dailydeath counts. Environ. Health Perspect. 103 490–497.SURJANOVIC, S. and BINGHAM, D. (2015). Virtual library of sim-ulation experiments: Test functions and datasets. Available athttp://www.sfu.ca/~ssurjano.WELCH, W. J., BUCK, R. J., SACKS, J., WYNN, H. P.,MITCHELL, T. J. and MORRIS, M. D. (1992). Screening, pre-dicting, and computer experiments. Technometrics 34 15–25.WELCH, W. J., BUCK, R. J., SACKS, J., WYNN, H. P., MOR-RIS, M. D. and SCHONLAU, M. (1996). Response to James M.Lucas. Technometrics 38 199–203.WEST, O. R., SIEGRIST, R. L., MITCHELL, T. J. and JENK-INS, R. A. (1995). Measurement error and spatial variabilityeffects on characterization of volatile organics in the subsurface.Environ. Sci. Technol. 29 647–656. Submitted to the Statistical ScienceSsnnlckclr rm \?lalwsisKcrfmds dmp AmknsrcpCxncpikclrs: Fmu rm ?sscssald Ufar Amslrs?"Fam Afcl, Hasml J, Jmcnniw, Hcpmkc Saais ald Uilliak H, UclafUnivCrsity of British ColSmbia anB National InstitStC of Statistical SciCncCs1, RCSR DSLARGMLS (D?SR AMBCS)1,1 Bmpcfmlc dslarimlihz output y is gznzrvtzy wyy RGTu=Hu −Hl)log=r=rw)(F + 2LTulgg(rLrw)r2wKw+ Tu=Tl) ;fihzrz thz M input vvrivwlzs vny thzir rzspzxtivz rvngzs of intzrzst vrz vs in ivwlz FCVuriuvly Xyswrip“ion (uni“s= fungyrw ruxius of voryholy (m= oD:DI; D:EIqr ruxius of inuynwy (m= oEDD; IDDDqTu “runsmissivi“fl of uppyr uquifyr (m2Cflr= oJGDKD; EEIJDDq.u po“yn“iomy“riw hyux of uppyr uquifyr (m= oMMD; EEEDqTl “runsmissivi“fl of lowyr uquifyr (m2 C flr= oJG:E; EEJq.l po“yn“iomy“riw hyux of lowyr uquifyr (m= oKDD; LFDqL lyng“h of voryholy (m= oEEFD; EJLDqKw hflxruuliw wonxuw“ivi“fl of voryholy (mCflr= oMLII; EFDHIqTable 1Voryholy zunwtion input vuriuvlys, units, unx rungysB All rungys ury wonvyrtyx to oD; Eq zorstutistiwul moxylingBDehajleefl gf Klalaklack, Ufanejkaly gf :jalakh ;gdmeZaa, Nafcgmnej, :;N.T )R,, ;afada (e-eaad2 hag.chef@klal.mZc.ca; oadd@klal.mZc.ca). Klalaklack,Ufanejkaly gf :jalakh ;gdmeZaa, Cedgofa, :; N)N )N7, ;afada (e-eaad2bakgf.dgehhcy@mZc.ca). Nalagfad Afklalmle gf Klalaklacad Kcaefcek, RekeajchTjaafgde Pajc, N; 277(1, UK9 (e-eaad2 kacck@fakk.gjg).∗fysyurwh suppor“yx vfl “hy bu“urul gwiynwys unx Ynginyyring fysyurwh Wounwil, WunuxuBky “hunk “hy ryfyryys, Ussowiu“y Yxi“or unx Yxi“or for suggys“ions “hu“ wluriyx unx vrouxynyx“hy swopy of “hy s“uxiys rypor“yx hyryB1iasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20152 HB WHYb Yh ULBVuriuvly Xyswrip“ion fungyu1 ru“y wons“un“ oF ED6; F ED7qu2 ru“y wons“un“ I ED−3 (fiyx=u3 ru“y wons“un“ E ED−3 (fiyx=u4 ru“y wons“un“ F ED−3 (fiyx=u5 ru“y wons“un“ L (fiyx=u6 ru“y wons“un“ oG ED−5; G ED−4qu7 ru“y wons“un“ oD:G; Gqu8 ru“y wons“un“ E (fiyx=x ini“iul wonwyn“ru“ion oE:D ED−9; E:D ED−6qTable 2GAprotyin woxy input vuriuvlys unx rungysB All vuriuvlys ury trunszormyx to log swulys on oD; Eqzor stutistiwul moxylingB1,2 E-npmrcil amdcihz yizrzntivl zquvtions gznzrvting thz output y of thz GBprotzin systzmyynvmixs vrzt) R −u))x+ u22 − u3) + u5;t2 R u))x− u22 − u42;t3 R −u623 + u8=Glgl − 3 − 4)=Glgl − 3);t4 R u623 − u74;y R =Glgl − 3)=Glgl;fihzrz ); : : : ; 4 vrz xonxzntrvtions of I xhzmixvl spzxizsA t) ≡ @1@t A ztxCA vnyGlgl R FEEEE is thz =flzy) totvl xonxzntrvtion of GBprotzin xomplzfl vftzr HEszxonysCihz input vvrivwlzs in this systzm vrz yzsxriwzy in ivwlz GC dnly d R I inputsvrz vvrizyO fiz moyzl y vs v funxtion of log=x)A log=u))A log=u6)A log=u7)C1,1 NRU amdcihz erzstonBionksBlvllvxz =eil) moyzl yzsxriwzs thz plvstix yzformvtionof vvrious mztvls ovzr v rvngz of strvin rvtzs vny tzmpzrvturzs =erzstonA ionksAvny lvllvxzA GEEH)C [or our purposzs thz moyzl xontvins d R FF input vvrivwlzs=pvrvmztzrs)A fihzrz thrzz of thzsz vrz physixvlly wvszy =tzmpzrvturzA strvin rvtzAstrvin)A vny thz rzmvining M inputs xvn wz uszy to tunz thz moyzl to mvtxhyvtv from physixvl zflpzrimzntsC Vyyitionvl informvtion on thz moyzl vny thzxvliwrvtion prowlzm xvn wz founy in [ugvtz zt vlC =GEE5)C ihz inputs vrz vllsxvlzy to thz unit intzrvvl pE; FrC2, B?R?2,1 Tmlaalm amdcihz yvtv for thz HG runs of thz volxvno xoyz vrz in ivwlz HC2,2 ?paria sca iac amdcihz yvtv vrz vvvilvwlz in thz supplzmzntvry lzs ice-x.txt vny ice-y.txtC1, DSPRFCP PCSSJRS DMP PKSC]zrz fiz givz furthzr rzsults for normvlizzy holyout gbhE of przyixtionA erese;hgCiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 3fun Volumy VusulUngly Hyigh“ logHyigh“ sqr“Hyigh“E MBJH EJBHL HGBKF EBJI JBJEF EDBFD EIBDL EIEBEJ FBEL EFBFMG MBLG EGBMI JLBEF EBLH LBFIH EDBDF EKBJE EDIBJD FBDG EDBFLI EDBIL EJBFD GHEBGM FBIG ELBHLJ EDBKK EIBGJ IHDBFD FBKG FGBFHK EDBGM EHBFG EMIBHI FBFM EGBMLL EDBMI EKBGG JKIBGH FBLG FIBMMM MBII EFBLG GLBIL EBJD JBFEED EDBEE ELBKG EFGBJJ FBED EEBEFEE MBKG EMBLJ ILBEK EBKK KBJGEF MBMF EEBKD LEBMF EBMF MBDIEG EDBHL EGBEE FKLBDM FBHI EJBJLEH EDBJK ELBHI HKJBMM FBJL FEBLHEI EDBGD EMBIL EJLBEL FBFG EFBMKEJ EDBLJ EEBML JDJBFD FBKL FHBJFEK MBGJ EHBIF FEBMG EBGJ HBJLEL LBLD EIBMF LBJH DBML FBMHEM MBEK EKBDI EKBJG EBFK HBFDFD LBML EGBGM EGBMG EBEK GBKGFE LBHF EHBLD DBHI DBEJ DBJKFF LBFG EIBJH DBDD DBDD DBDDFG LBJE EJBKK FBGM DBIG EBIIFH LBDI EGBJK DBDJ DBDF DBFHFI MBHI ELBEK FKBFJ EBHI IBFFFJ LBLM EFBFK EHBJL EBFD GBLGFK MBFK EEBEH GDBHG EBID IBIFFL MBDL EMBGD EFBMI EBEH GBJDFM LBIF EKBLM DBDL DBDG DBFLGD LBGG EFBII EBLG DBHI EBGIGE LBKD EEBHF MBLG EBDG GBEHGF LBEH EMBDF DBDD DBDD DBDDTable 3Volwuno woxyN inputs (Volumy unx VusulAngly) unx output (Hyight, logHyight, or sqrtHyight)Biasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20154 HB WHYb Yh ULBNormalized RMSE0.000.020.040.060.080.10Gauss PowerExp Matern Matern2Constant40−run mLHD0.000.020.040.060.080.10Gauss PowerExp Matern Matern2Select linear40−run mLHD0.000.020.040.060.080.10Gauss PowerExp Matern Matern2Full linear40−run mLHD0.00.10.20.30.40.5Constant27−run mLHD0.00.10.20.30.40.5Select linear27−run mLHD0.00.10.20.30.40.5Full linear27−run mLHD0.00.20.40.60.8Constant27−run OA0.00.20.40.60.8Select linear27−run OA0.00.20.40.60.8Full linear27−run OA6ig 1B Voryholy zunwtionN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ullwomvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrfl moxyl hus u ttyxnuggyt tyrmB hhyry ury thryy vusy xysignsN u FKArun cA (top row)O u FKArun mLHD (mixxlyrow)O unx u HDArun mLHD (vottom row)B For yuwh vusy xysign, F5 runxom pyrmututions vytwyynwolumns givy thy F5 vuluys oz Kamse;ho in u xot plotB1,1 Bmpcfmlc dslariml[igurz F shofis normvlizzy gbhE rzsults for moyzls fiith v ttzy nuggztCCompvrison fiith [igurz G in thz mvin vrtixlz for noBnuggzt moyzls shofis littlzyizrznxzA zflxzpt thvt tting v nuggzt tzrm givzs v smvll inxrzvsz in thz frzquznxyof rzlvtivzly poor rzsults for moyzls fiith v fullBlinzvr rzgrzssionC1,2 NRU amdc[igurz G givzs rzsults for erese;hg for thrzz rzgrzssion moyzls vny four xorrzlvBtion funxtionsC It is proyuxzy using thz mzthoys in thz mvin pvpzr's hzxtion HC[igurz H givzs erese;hg rzsults for thz WvyzsBGEbBhV vny CGe mzthoys in thzmvin pvpzr's hzxtions 5CF vny 5CGC czithzr of thzsz gurzs shofi prvxtixvl yizrBznxzs from thz vvrious moyzling strvtzgizsC1,1 Lilsml-Isssi Kmdclivwlz I givzs rzsults for F5E trvining runs vny FEE runs for thz holyBout tzst sztCihz rolzs of thz tfio yvtv szts vrz sfiitxhzy rzlvtivz to thz mvin pvpzr's ivwlz FCVgvinA Gvuss is infzriorA vny eofizrEflp vny bvt(zrn vrz thz wzst pzrformzrsC1,2 ?paria sca iacihzrz vrz FH input vvrivwlzs vny F5L runs of thz xoyz vvvilvwlzC gzpzvt zflBpzrimznts fizrz gznzrvtzy wy svmpling n R FEd R FHE runsA lzvving GL holyoutowszrvvtionsC[igurz I givzs rzsults for erese;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsC ihz four xorrzlvtion funxtions givz similvr rzsultsA wut thziasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 5Normalized RMSE0.050.100.150.200.25Gauss PowerExp Matern Matern−2ConstantGauss PowerExp Matern Matern−2Select linearGauss PowerExp Matern Matern−2Full linear6ig 2B dhk woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ull womviAnutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy vusy xysign is u EEDArunmLHDO F5 runxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kamse;ho in u xot plotBNormalized RMSE0.000.050.100.150.200.250.30Gauss PowerExp Bayes CGP6ig 3B dhk woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd(Wonst, Guuss),Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr two mythoxs ulso huvy wonAstunt rygryssion moxylsB hhy vusy xysign is u EEDArun mLHDO F5 runxom pyrmututions vytwyynwolumns givy thy F5 vuluys oz Kamse;ho in u xot plotBKamse;hofygryssion Worrylu“ion funw“ionmoxyl [uuss dowyrYfip au“yrnAF au“yrnWons“un“ DBEEE DBDLD DBDLK DBDKLgylyw“ linyur DBEEG DBDLD DBDME DBDKMFull linyur DBEDM DBDKM DBDMD DBDKJeuur“iw DBEDH DBDKM DBDLL DBDKLTable 4bilsonAKuusk moxylN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor zour rygryssionmoxyls unx zour worrylution zunwtionsB hhy yfipyrimyntul xutu ury zrom u E5DArun LHD, unxthy holxAout syt is zrom u EDDArun LHDBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20156 HB WHYb Yh ULBNormalized RMSE0.00.51.01.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.5ConstantIceVelocity0.00.51.01.5Select linearIceVelocity0.00.51.01.5Full linearIceVelocity0.00.20.40.60.81.01.2ConstantIceArea0.00.20.40.60.81.01.2Select linearIceArea0.00.20.40.60.81.01.2Full linearIceArea0.00.20.40.60.81.01.2ConstantIceMass0.00.20.40.60.81.01.2Select linearIceMass0.00.20.40.60.81.01.2Full linearIceMass6ig 4B Arwtiw syu iwy woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ullwomvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy E5K runs uvuiluvlyury runxomlfl split F5 timys into EGD runs zor tting unx FK holxAout runs to givy thy F5 vuluys ozKamse;ho in u xot plotB fysults ury givyn zor zour output vuriuvlysN iwy muss, iwy uryu, iwy vylowitfl,unx rungy oz uryuBfullBlinzvr rzgrzssion moyzl is infzrior for vll four output vvrivwlzsC[igurz 5 xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstA eofizrEflp)A WvyzsBGEbBhVA vny CGeC ihz four strvtzgizs givz similvr rzsultsC[igurz 6 givzs erese;hg rzsults for moyzls fiith zstimvtzy nuggzt tzrmsC Compvrisonfiith thz rzsults for thz svmz moyzls fiithout v ttzy nuggzt in [igurz I shofisno prvxtixvl yizrznxzsC1,3 2-d cxaknlc[igurz L givzs rzsults for thz funxtion in =5CF) from thrzz szts of rzpzvt zflpzrBimzntsC ihz yzsigns lzvying to [igurz FH in thz mvin pvpzr hvvz x)A x2A or wothx) vny x2 rzzxtzy fiithin xolumns vwout thz xzntzrs of thzir rvngzs to gznzrvtzzquivvlznt yzsigns vny nzfi trvining yvtvC ihz rzsulting erese;hg vvluzs vrz plottzyin [igurz LC2, PCSSJRS DMP K?VGKSK ?BSMJSRC CPPMP]zrz fiz givz rzsults for normvlizzy holyout mvflimum vwsolutz zrror of przByixtionA eeax;hgA yznzy vs=ICF) eeax;hg Rmvfli5);:::;Nsy=x(i)hg)− y=x(i)hg)mvfli5);:::;N+y − y=x(i)hg) :VgvinA thz mzvsurz of zrror is rzlvtivz to thz pzrformvnxz of thz trivivl przyixtorA+yCiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 7Normalized RMSE0.00.20.40.60.81.0Gauss PowerExp Bayes CGPIceVelocity0.00.20.40.60.81.0Gauss PowerExp Bayes CGPRangeOfArea0.00.10.20.30.40.50.6IceMass0.00.10.20.30.40.50.6IceArea6ig 5B Arwtiw syu iwy woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd(Wonst,Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr two mythoxs ulso huvywonstunt rygryssion moxylsB hhy E5K runs uvuiluvly ury runxomlfl split F5 timys into EGD runszor tting unx FK holxAout runs to givy thy F5 vuluys oz Kamse;ho in u xot plotBNormalized RMSE0.00.51.01.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.5ConstantIceVelocity0.00.51.01.5Select linearIceVelocity0.00.51.01.5Full linearIceVelocity0.00.20.40.60.81.01.2ConstantIceArea0.00.20.40.60.81.01.2Select linearIceArea0.00.20.40.60.81.01.2Full linearIceArea0.00.20.40.60.81.01.2ConstantIceMass0.00.20.40.60.81.01.2Select linearIceMass0.00.20.40.60.81.01.2Full linearIceMass6ig 6B Arwtiw syu iwy woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ullwomvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrfl moxyl hus u ttyxnuggyt tyrmB hhy E5K runs uvuiluvly ury runxomlfl split F5 timys into EGD runs zor tting unxFK holxAout runs to givy thy F5 vuluys oz Kamse;ho in u xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20158 HB WHYb Yh ULB25 30 35 40 450.00.10.20.30.40.5nNormalized RMSEReflect x1, n = 24 to 4850 100 150 2000.00.10.20.30.40.5nNormalized RMSEReflect x1, n = 50 to 20025 30 35 40 450.00.10.20.30.40.5nNormalized RMSEReflect x2, n = 24 to 4850 100 150 2000.00.10.20.30.40.5nNormalized RMSEReflect x2, n = 50 to 20025 30 35 40 450.00.10.20.30.40.5nNormalized RMSEReflect x1 and x2, n = 24 to 4850 100 150 2000.00.10.20.30.40.5nNormalized RMSEReflect x1 and x2, n = 50 to 2006ig 7B FAx zunwtionN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, vyrsus T zor WGd (◦),Gugd(Wonst, dowyrEfip) (△) unx Gugd(Wonst, dowyrEfip) with nuggyt (+)B For yuwh vuluy ozT thy vusy mLHD hus x1 ryywtyx urounx thy wyntyr oz its rungy (top row), x2 ryywtyx (mixxlyrow), or voth x1 unx x2 ryywtyx (vottom row)Biasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 9Normalized Max Error0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Constant40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Select linear40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Full linear40−run mLHD0.00.20.40.60.8Constant27−run mLHD0.00.20.40.60.8Select linear27−run mLHD0.00.20.40.60.8Full linear27−run mLHD0.00.20.40.60.81.0Constant27−run OA0.00.20.40.60.81.0Select linear27−run OA0.00.20.40.60.81.0Full linear27−run OA6ig 8B Voryholy zunwtionN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd with ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhyryury thryy vusy xysignsN u FKArun cA (top row)O u FKArun mLHD (mixxly row)O unx u HDArunmLHD (vottom row)B For yuwh vusy xysign, F5 runxom pyrmututions vytwyyn wolumns givy thyF5 vuluys oz Kmax;ho in u xot plotB2,1 Bmpcfmlc dslariml[igurz M givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorrzBlvtion funxtionsC [igurz N xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstA eofizrBEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzsults vrzvnvlogous to thz mvin pvpzr's [igurzs G vny M for erese;hgP thz svmz pvttzrnszmzrgzC[igurz FE givzs eeax;hg rzsults for moyzls fiith zstimvtzy nuggzt tzrmsC Compvrzyfiith [igurz MA inxlusion of v nuggzt mvkzs littlz yizrznxz zflxzpt for v smvll inBxrzvsz in frzquznxy of poor rzsults fiith szlzxtBlinzvr vny fullBlinzvr rzgrzssionmoyzlsC2,2 E-npmrcil[igurz FF givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorrzBlvtion funxtionsC [igurz FG xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstA eofizrBEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzsults vrzvnvlogous to thz mvin pvpzr's [igurzs I vny N for erese;hgC ihz svmz xonxlusionszmzrgzO thzrz is littlz prvxtixvl yizrznxz wztfizzn thz vvrious moyzling strvtzgizsC2,1 NRU amdc[igurz FH givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz FI xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBsults vrz vnvlogous to [igurzs G vny H in hzxtion HCG of this supplzmznt rzlvting toiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201510 HB WHYb Yh ULBNormalized Max Error0.00.10.20.30.40.5Gauss PowerExp Bayes CGP27−run OA0.00.10.20.30.40.5Gauss PowerExp Bayes CGP27−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Bayes CGP40−run mLHD6ig 9B Voryholy zunwtionN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsBNormalized Max Error0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Constant40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Select linear40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Full linear40−run mLHD0.00.20.40.60.8Constant27−run mLHD0.00.20.40.60.8Select linear27−run mLHD0.00.20.40.60.8Full linear27−run mLHD0.00.20.40.60.81.0Constant27−run OA0.00.20.40.60.81.0Select linear27−run OA0.00.20.40.60.81.0Full linear27−run OA6ig 10B Voryholy zunwtionN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd with ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrflmoxyl hus u ttyx nuggyt tyrmB hhyry ury thryy vusy xysignsN u FKArun cA (top row)O u FKArun mLHD (mixxly row)O unx u HDArun mLHD (vottom row)B For yuwh vusy xysign, F5 runxompyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kmax;ho in u xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 11Normalized Max Error0.00.10.20.30.4Gauss PowerExp Matern Matern2Constant80−run mLHD0.00.10.20.30.4Gauss PowerExp Matern Matern2Select linear80−run mLHD0.00.10.20.30.4Gauss PowerExp Matern Matern2Full linear80−run mLHD0.00.20.40.60.8Constant40−run mLHD0.00.20.40.60.8Select linear40−run mLHD0.00.20.40.60.8Full linear40−run mLHD6ig 11B GAprotyin woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhyry ury two vusyxysignsN u HDArun mLHD (top row)O unx un LDArun mLHD (vottom row)B For yuwh vusy xysign,ull FH pyrmututions vytwyyn wolumns givy thy FH vuluys oz Kmax;ho in yuwh xot plotBNormalized Max Error0.00.20.40.60.8Gauss PowerExp Bayes CGP40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Bayes CGP80−run mLHD6ig 12B GAprotyin woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsBNormalized Max error0.30.40.50.60.7Gauss PowerExp Matern Matern−2ConstantGauss PowerExp Matern Matern−2Select linearGauss PowerExp Matern Matern−2Full linear6ig 13B dhk woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zor Gugdwith ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy vusy xysignis u EEDArun mLHDO F5 runxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kmax;ho inu xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201512 HB WHYb Yh ULBNormalized Max Error0.00.20.40.60.8Gauss PowerExp Bayes CGP6ig 14B dhk woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsB hhy vusy xysign is u EEDArun mLHDO F5 runAxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kmax;ho in u xot plotBKmax;hofygryssion Worrylu“ion funw“ionmoxyl [uuss dowyrYfip au“yrnAF au“yrnWons“un“ DBGI DBFK DBFL DBFLgylyw“ linyur DBGI DBFJ DBFK DBFHFull linyur DBGD DBFK DBFL DBFLeuur“iw DBFM DBFL DBFM DBGETable 5bilsonAKuusk moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorzour rygryssion moxyls unx zour worrylution zunwtionsB hhy yfipyrimyntul xutu ury zrom uEDDArun LHD, unx thy holxAout syt is zrom u E5DArun LHDBerese;hgC ihz eeax;hg xritzrion lzvys to similvr xonxlusionsO thzrz vrz no prvxtixvlyizrznxzs in pzrformvnxzC2,2 Lilsml-Isssi Kmdclihz eeax;hg rzsults in ivwlzs 5 vny 6 vrz for FEE trvining runs vny F5E trviningrunsA rzspzxtivzlyC ihzy vrz vnvlogous to ivwlz F of thz mvin pvpzr vny supplzBmzntvry ivwlz IA fihixh rzlvtz to erese;hgC VgvinA Gvuss is infzriorC ihz yizrznxzsin pzrformvnxz wztfizzn thz othzr xorrzlvtion funxtions vrz smvllA wut eofizrEflpis thz wzst pzrformzr L out of M timzsC[igurz F5 givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz F6 xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBKmax;hofygryssion Worrylu“ion funw“ionmoxyl [uuss dowyrYfip au“yrnAF au“yrnWons“un“ DBFE DBEJ DBEL DBEKgylyw“ linyur DBFI DBEJ DBEM DBELFull linyur DBFG DBEJ DBFE DBEKeuur“iw DBFG DBEJ DBFE DBEKTable 6bilsonAKuusk moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorzour rygryssion moxyls unx zour worrylution zunwtionsB hhy yfipyrimyntul xutu ury zrom uE5DArun LHD, unx thy holxAout syt is zrom u EDDArun LHDBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 13Normalized Max Error0.100.150.200.250.30Gauss PowerExp Matern Matern−2ConstantGauss PowerExp Matern Matern−2Select LinearGauss PowerExp Matern Matern−2Full LinearGauss PowerExp Matern Matern−2Quartic6ig 15B bilsonAKuusk woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor zour rygryssion moxyls unx zour worrylution zunwtionsB hwyntflAvy xysigns ury wryutyx zromu E5DArun LHD vusy plus 5D runxom points zrom u EDDArun LHDB hhy rymuining 5D points inthy EDDArun LHD zorm thy holxout syt zor yuwh rypyutBNormalized Max Error0.00.10.20.30.40.5Gauss PowerExp Bayes CGP6ig 16B bilsonAKuusk moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201514 HB WHYb Yh ULBNormalized Max Error0.00.20.40.60.8Gauss PowerExp Matern Matern−2Constantlog(y+1)0.00.20.40.60.8Gauss PowerExp Matern Matern−2Select linearlog(y+1)0.00.20.40.60.8Gauss PowerExp Matern Matern−2Full linearlog(y+1)0.00.20.40.60.8Gauss PowerExp Matern Matern−2Quadraticlog(y+1)0.00.10.20.30.4Constanty0.00.10.20.30.4Select lineary0.00.10.20.30.4Full lineary0.00.10.20.30.4Quadraticy6ig 17B Volwuno moxylN bormulizyx holxout mufiimum uvsoluty yrror, Kmax;ho, zor thryy rygrysAsion moxyls unx two worrylution zunwtionsBsults vrz vnvlogous to thz mvin pvpzr's [igurzs 5 vny FEA fihixh rzlvtz to erese;hgC[igurzs F5 vny F6 lzvy to thz svmz xonxlusionsO RReofizrEflp pzrforms wzstARRGvuss is infzriorA thzrz is no vyvvntvgz from vny of thz nonBxonstvnt rzgrzsBsion funxtionsA vny nzithzr WvyzsBGEbBhV nor CGe pzrform fizll hzrzC2,3 Tmlaalm amdc[igurz FL givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz FM xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBsults vrz vnvlogous to thz mvin pvpzr's [igurzs L vny FFA fihixh rzlvtz to erese;hgC[igurz FL shofis thvt thz rzgrzssion moyzl mvkzs littlz yizrznxz if xomwinzy fiitheofizrEflpA fihixh vlfivys pzrforms fizll hzrzC [or thz√y rzsponszA thz quvyrvtixmoyzl mvkzs thz pzrformvnxzs of thz othzr xorrzlvtion funxtions similvr to thvtNormalized Max Error0.00.10.20.30.4Gauss PowerExp Bayes CGPy0.00.20.40.60.8Gauss PowerExp Bayes CGPlog(y+1)6ig 18B Volwuno moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 15Normalized Max Error0.00.51.01.52.02.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.52.02.5ConstantIceVelocity0.00.51.01.52.02.5Select linearIceVelocity0.00.51.01.52.02.5Full linearIceVelocity0.00.51.01.52.0ConstantIceArea0.00.51.01.52.0Select linearIceArea0.00.51.01.52.0Full linearIceArea0.00.51.01.52.0ConstantIceMass0.00.51.01.52.0Select linearIceMass0.00.51.01.52.0Full linearIceMass6ig 19B Arwtiw syu iwy woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zor Gugdwith ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy E5K runsuvuiluvly ury runxomlfl split F5 timys into EGD runs zor tting unx FK holxAout runs to givy thyF5 vuluys oz Kmax;ho in u xot plotBof eofizrEflpA wut thzrz is no vyvvntvgz ovzr Gvhe=ConstA eofizrEflp)C [igurz FMvgvin shofis for this vpplixvtion thvt RRGvuss vny WvyzsBGEbBhV vrz infzriorAfihilz eofizrEflp vny CGe pzrform vwout thz svmzC2,4 ?paria sca iac[igurz FN givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz GE xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBsults vrz vnvlogous to supplzmzntvry [igurzs I vny 5 in hzxtion HCIA fihixh rzlvtzto erese;hgC [igurz FN shofis thzrz is littlz prvxtixvl vyvvntvgz to wz gvinzy fromv nonBxonstvnt rzgrzssion moyzlC ihz smvll improvzmznt szzn for thz rzsponszIxzkzloxity is in thz xontzflt of eeax;hg oftzn grzvtzr thvn F for vll moyzlsA iCzCAno przyixtivz vwilityC [igurz GE shofis thvt Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGe givz similvr rzsultsC[igurz GF givzs eeax;hg rzsults for moyzls fiith zstimvtzy nuggzt tzrmsC Compvrisonfiith thz rzsults for thz svmz moyzls fiithout nuggzt tzrms in [igurz FN shofis noprvxtixvl improvzmzntC2,5 Dpicdkal dslariml[igurz GG xompvrzs przyixtion vxxurvxy of moyzls fiith vny fiithout v nuggztviv thz eeax;hg xritzrionC Vs in [igurz FI in thz mvin vrtixlzA no vyvvntvgz fromtting v nuggzt is vppvrzntCiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201516 HB WHYb Yh ULBNormalized Max Error0.00.51.01.52.02.5Gauss PowerExp Bayes CGPIceVelocity0.00.51.01.52.02.5Gauss PowerExp Bayes CGPRangeOfArea0.00.20.40.60.81.0IceMass0.00.20.40.60.81.0IceArea6ig 20B Arwtiw syu iwy woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsB hhy E5K runs uvuiluvly ury runxomlfl split F5timys into EGD runs zor tting unx FK holxAout runs to givy thy F5 vuluys oz Kmax;ho in u xotplotBNormalized Max Error0.00.51.01.52.02.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.52.02.5ConstantIceVelocity0.00.51.01.52.02.5Select linearIceVelocity0.00.51.01.52.02.5Full linearIceVelocity0.00.51.01.52.0ConstantIceArea0.00.51.01.52.0Select linearIceArea0.00.51.01.52.0Full linearIceArea0.00.51.01.52.0ConstantIceMass0.00.51.01.52.0Select linearIceMass0.00.51.01.52.0Full linearIceMass6ig 21B Arwtiw syu iwy woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zor Gugdwith ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrfl moxyl husu ttyx nuggyt tyrmB hhy E5K runs uvuiluvly ury runxomlfl split F5 timys into EGD runs zor ttingunx FK holxAout runs to givy thy F5 vuluys oz Kmax;ho in u xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 17Normalized Max Error0.000.050.100.150.20No nugget NuggetGaussn=500.000.050.100.150.20No nugget NuggetPowerExpn=500.00.20.40.60.81.0Gaussn=250.00.20.40.60.81.0PowerExpn=256ig 22B Friyxmun zunwtionN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss) unx Gugd(Wonst, dowyrEfip) moxyls with no nuggyt tyrm vyrsus thy sumymoxyls with u nuggytB hhyry ury two vusy xysignsN u F5Arun mLHD (top row)O unx u 5DArunmLHD (vottom row)B For yuwh vusy xysign, F5 runxom pyrmututions vytwyyn wolumns givy thyF5 vuluys oz Kmax;ho in u xot plotB3, SLACPR?GLRW MD NPCBGARGMLihz rzsults in [igurz GH xompvrz thz yistriwution of VCe vvluzs fiith vny fiithBout v nuggzt tzrm for thz [rizymvn funxtion in zquvtion =5CG) of thz mvin vrtixlzClith n R G5A suwstvntivl unyzrBxovzrvgz oxxurs frzquzntlyP tting v nuggzt tzrmmvkzs littlz yizrznxzC [or n R 5E thzrz is v moyzst improvzmznt in thz yistriBwution tofivrys thz nominvl N5: xovzrvgz fihzn v nuggzt is ttzyC huwstvntivlunyzrBxovzrvgz is still frzquzntA hofizvzrC4, PCEPCSSGML RCPKS]ofi thz inxlusion of rzgrzssion tzrms pzrforms in zfltrzmz szttings xvn wz szznin thz follofiing simulvtionsC [unxtions fizrz gznzrvtzy fiith 5Byimznsionvl inputx from v Gvhe fiith rzgrzssion xomponznt R FE=x) + x2 + x3 + x4 + x5)Avvrivnxz FA vny onz of four Ges fiith Gvussivn xorrzlvtion funxtions yizringin thzir truz pvrvmztzr vvluzsC ihz vzxtor of xorrzlvtion pvrvmztzr vvluzs fivszithzrO =F) A R =6:LGHG; G:INNG; E:6L5G; E:ENNG; E:EEHG)A fihixh sum to FEP =G)B R =H:H6F6; F:GIN6; E:HHL6; E:EIN6; E:EEF6)A hvlf thz vvluzs of AP =H) hvsxonstvnt zlzmznts fiith vvluz GP vny =I) hvs xonstvnt zlzmznts fiith vvluz FClith svmplz sizz n R 5E from v ma]D vny 5EE rvnyom tzst pointsA G5 svmplzfunxtions from thz I Ges fizrz tvkzn vny normvlizzy gbhEs xvlxulvtzyC ihzvvzrvgz erese;hg for thz four simulvtion sxznvrios for Const vzrsus [a ts vrz inivwlz LClhvt xvn wz yrvfin from thzsz numwzrs is thvt thz strong trzny xomposzyfiith lzss xorrzlvtzy =roughzr) funxtions givzs tting thz linzvr rzgrzssion littlzprvxtixvl vyvvntvgzC ihzrz xvn wz v jedalane vyvvntvgz for [a fihzn erese;hg isvzry smvll vnyfivy for woth mzthoysCihz trzny hzrz is vzry zfltrzmzO v totvl trzny of mvgnituyz 5E vxross thz inputiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201518 HB WHYb Yh ULBActual Coverage Probability0.00.20.40.60.81.0No nugget NuggetGaussn=500.00.20.40.60.81.0No nugget NuggetPowerExpn=500.00.20.40.60.81.0Gaussn=250.00.20.40.60.81.0PowerExpn=256ig 23B Friyxmun zunwtionN AWd oz nominul M59 wonxynwy or wryxivilitfl intyrvuls zorGugd(Wonst, Guuss) unx Gugd(Wonst, dowyrEfip) moxyls, yuwh without or with u nuggyt tyrmBhhyry ury two vusy xysignsN u F5Arun mLHD (top row)O unx u 5DArun mLHD (vottom row)B Foryuwh vusy xysign, F5 runxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz AWd in u xotplotBWorrylu“ion Uvyrugy Kamse;ho whyn ““ingpurumy“yrs [ugd(Wons“, dowyrYfip= [ugd(FL, dowyrYfip=A DBDH DBDFB DBDF DBDE Q 2 DBDL DBDJ Q 1 DBDH DBDFTable 7bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zrom tting wonstunt vyrsus zullAlinyurrygryssion moxylsB hhy xutu ury gynyrutyx using ony oz zour vywtors oz truy vuluys oz thyworrylution purumytyrsBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 19spvxz of vll 5 vvrivwlzsA rzlvtivz to v Ge fiith vvrivnxz FC dthzr simulvtions fiithlzss zfltrzmz trzny shofi zvzn lzss yizrznxz wztfizzn thz rzsults for thz Constvny [a moyzlsCPCDCPCLACSFugu“y, aB, killiums, VB, Higxon, XB, Hunson, KB aB, [u““ikyr, JB, Whyn, gBAfB, unx inul, WB(FDDI=, pHiyrurwhiwul Vuflysiun unulflsis unx “hy drys“onAhonksAkulluwy moxyl,6 hywhB fypBLUAifADIAGMGI, Los Ulumos bu“ionul Luvoru“orfl, Los Ulumos, baBdrys“on, XB LB, honks, XB LB, unx kulluwy, XB WB (FDDG=, paoxyl of dlus“iw Xyformu“ion forYfi“rymy Louxing Wonxi“ions,6 Journul oz Appliyx dhflsiws, MG, FEE{FFDBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015 Statistical Science2016, Vol. 31, No. 1, 40–60DOI: 10.1214/15-STS531© Institute of Mathematical Statistics, 2016Analysis Methods for ComputerExperiments: How to Assess and WhatCounts?Hao Chen, Jason L. Loeppky, Jerome Sacks and William J. WelchAbstract. Statistical methods based on a regression model plus a zero-meanGaussian process (GP) have been widely used for predicting the output of adeterministic computer code. There are many suggestions in the literature forhow to choose the regression component and how to model the correlationstructure of the GP. This article argues that comprehensive, evidence-basedassessment strategies are needed when comparing such modeling options.Otherwise, one is easily misled. Applying the strategies to several computercodes shows that a regression model more complex than a constant mean ei-ther has little impact on prediction accuracy or is an impediment. The choiceof correlation function has modest effect, but there is little to separate twocommon choices, the power exponential and the Matérn, if the latter is op-timized with respect to its smoothness. The applications presented here alsoprovide no evidence that a composite of GPs provides practical improve-ment in prediction accuracy. A limited comparison of Bayesian and empiricalBayes methods is similarly inconclusive. In contrast, we find that the effect ofexperimental design is surprisingly large, even for designs of the same typewith the same theoretical properties.Key words and phrases: Correlation function, Gaussian process, kriging,prediction accuracy, regression.1. INTRODUCTIONOver the past quarter century a literature begin-ning with Sacks, Schiller and Welch (1989), Sackset al. (1989, in this journal), Currin et al. (1991), andO’Hagan (1992) has grown to explore the design andanalysis of computer experiments. Such an experimentis a designed set of runs of a mathematical model im-plemented as a computer code. Running the code withvector-valued input x gives the output value, y(x), as-Hao Chen is a Ph.D. candidate and William J. Welch isProfessor, Department of Statistics, University of BritishColumbia, Vancouver, BC V6T 1Z4, Canada (e-mail:hao.chen@stat.ubc.ca; will@stat.ubc.ca). Jason L. Loeppkyis Associate Professor, Statistics, University of BritishColumbia, Kelowna, BC V1V 1V7, Canada (e-mail:jason.loeppky@ubc.ca). Jerome Sacks is Director Emeritus,National Institute of Statistical Sciences, Research TrianglePark, North Carolina 27709, USA (e-mail: sacks@niss.org).sumed real-valued and deterministic: Running the codeagain with the same value for x does not change y(x).A design D is a set of n runs at n configurations of x,and an objective of primary interest is to use the data(inputs and outputs) to predict, via a statistical model,the output of the code at untried input values.The basic approach to the statistical model typicallyadopted starts by thinking of the output function, y(x),as being in a class of functions with prior distribu-tion a Gaussian process (GP). The process has meanμ, which may be a regression function in x, and a co-variance function, σ 2R, from a specified parametricfamily. Prediction is then made through the posteriormean given the data from the computer experiment,with some variations depending on whether a maxi-mum likelihood (empirical Bayes) or fuller Bayesianimplementation is used. While we partially addressthose variations in this article, our main thrusts are thepractical questions faced by the user: What regression40ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 41function and correlation function should be used? Doesit matter?We will call a GP model with specified regressionand correlation functions a Gaussian stochastic process(GaSP) model. For example, GaSP(Const, PowerExp)will denote a constant (intercept only) regression andthe power-exponential correlation function. The vari-ous regression models and correlation functions underconsideration in this article will be defined in Section 2.The rationale for the GaSP approach stems fromthe common situation that the dimension, d , of thespace of inputs is not small, the function is fairly com-plex to model, and n is not large (code runs are ex-pensive), hindering the effectiveness of standard meth-ods (e.g., polynomials, splines, MARS) for producingpredictions. The GaSP approach allows for a flexiblechoice of approximating models that adapts to the dataand, more tellingly, has proved effective in coping withcomplex codes with moderately high d and scarce data.There is a vast literature treating an analysis in this con-text.This article studies the impact on prediction accu-racy of the particular model specifications commonlyused, particularly μ,R,n,D. The goals are twofold.First, we propose a more evidence-based approach todistinguish what may be important from the unimpor-tant and what may need further exploration. Second,our application of this approach to various examplesleads to some specific recommendations.Assessing statistical strategies for the analysis of acomputer experiment often mimics what is done forphysical experiments: a method is proposed, appliedin examples—usually few in number—and comparedto other methods. Where possible, formal, mathemat-ical comparisons are made; otherwise, criteria for as-sessing performance are empirical. An initial empiri-cal study for a physical experiment is forced to rely onthe specific data of that experiment and, while differentanalysis methods may be applied, all are bound by thesingle data set. There are limited opportunities to varysample size or design.Computer experiments provide richer opportunities.Fast-to-run codes enable a laboratory to investigate therelative merits of an analysis method. A whole spec-trum of “replicate” experiments can be conducted for asingle code, going beyond a thimbleful of “anecdotal”reports.The danger of being misled by anecdotes can beseen in the following example. The borehole function[Morris, Mitchell and Ylvisaker, 1993, also given inthe supplementary material (Chen et al., 2016)] is fre-quently used to illustrate methodology for computerexperiments. A 27-run orthogonal array (OA) in the8 input factors was proposed as a design, followingJoseph, Hung and Sudjianto (2008). The 27 runs wereanalyzed via GaSP with a specific R (the Gaussiancorrelation function described in Section 2) and withtwo choices for μ: a simple constant (intercept) ver-sus a method to select linear terms (SL), also describedin Section 2. The details of these alternative modelsare not important for now, just that we are comparingtwo modeling methods. A set of 10,000 test points se-lected at random in the 8-dimensional input space wasthen predicted. The resulting values of the root meansquared error (RMSE) measure defined in (2.5) of Sec-tion 2 were 0.141 and 0.080 for the constant and SLregression models, respectively.With the SL approach reducing the RMSE by about43% relative to a model with a constant mean, doesthis example provide powerful evidence for using re-gression terms in the GaSP model? Not quite. We repli-cated the experiment with the same choices of μ,R,nand the same test-data, but the training data came froma theoretically equivalent 27-run OA design. (Thereare many equivalent OAs, e.g., by permuting the la-bels between columns of a fixed OA.) The RMSE val-ues in the second analysis were 0.073 and 0.465 forthe constant and SL models respectively; SL producedabout 6 times the error relative to a constant mean—theevidence against using regression terms is even morepowerful!A broader approach is needed. The one we take islaid out starting in Section 2, where we specify the al-ternatives considered for the statistical model’s regres-sion component and correlation function, and definethe assessment measures to be used. We focus on thefundamental criterion of prediction accuracy (uncer-tainty assessment is discussed briefly in Section 6.1).In Section 3 we outline the basic idea of generatingrepeated data sets for any given example. The methodis (exhaustively) implemented for several fast codes,including the aforementioned borehole function, alongwith some choices of n and D. In Section 4 the methodis adapted to slower codes where data from only oneexperiment are available. Ideally, the universe of com-puter experiments is represented by a set of test prob-lems and assessment criteria, as in numerical optimiza-tion (Dixon and Szegö, 1978); the codes and data setsinvestigated in this article and its supplementary ma-terial (Chen et al., 2016) are but an initial set. In Sec-tion 5 other modeling strategies are compared. Finally,42 CHEN, LOEPPKY, SACKS AND WELCHin Sections 6 and 7 we make some summarizing com-ments, conclusions and recommendations.The article’s main findings are that regression termsare unnecessary or even sometimes an impediment, thechoice of R matters for less smooth functions, and thatthe variability of performance of a method for the sameproblem over equivalent designs is alarmingly high.Such variation can mask the differences in analysismethods, rendering them unimportant and reinforcingthe message that light evidence leads to flimsy conclu-sions.2. STATISTICAL MODELS, EXPERIMENTALDESIGN, AND ASSESSMENTA computer code output is denoted by y(x) wherethe input vector, x = (x1, . . . , xd), is in the d-dimen-sional unit cube. As long as the input space is rect-angular, transforming to the unit cube is straightfor-ward and does not lose generality. Suppose n runs ofthe code are made according to a design D of in-put vectors x(1), . . . ,x(n) in [0,1]d , resulting in datay = (y(x(1)), . . . , y(x(n)))T . The goal is to predict y(x)at untried x.The GaSP approach uses a regression model and GPprior on the class of possible y(x). Specifically, y(x) isa priori considered to be a realization ofY(x) = μ(x) + Z(x),(2.1)where μ(x) is the regression component, the mean ofthe process, and Z(x) has mean 0, variance σ 2, andcorrelation function R.2.1 Choices for the Correlation FunctionLet x and x′ denote two values of the input vector.The correlation between Z(x) and Z(x′) is denoted byR(x,x′). Following common practice, R(x,x′) is takento be a product of 1-d correlation functions in the dis-tances hj = |xj −x′j |, that is, R(x,x′) =∏dj=1 Rj(hj ).We mainly consider four choices for Rj :• Power exponential (abbreviated PowerExp):Rj(hj ) = exp(−θjhpjj),(2.2)with θj ≥ 0 and 1 ≤ pj ≤ 2 controlling the sensitiv-ity and smoothness, respectively, of predictions of ywith respect to xj .• Squared exponential or Gaussian (abbreviatedGauss): the special case of PowerExp in (2.2) withall pj = 2.• Matérn:Rj(hj ) = 1(νj )2(νj−1)(θjhj )νj Kνj (θjhj ),(2.3)where is the Gamma function, and Kνj is the mod-ified Bessel function of order νj . Again, θj ≥ 0 is asensitivity parameter. The Matérn class was recom-mended by Stein (1999), Section 2.7, for its controlvia νj > 0 of the differentiability of the correlationfunction with respect to xj , and hence that of theprediction function. With νj = 1 + 12 or νj = 2 + 12 ,there are 1 or 2 derivatives, respectively. We callthese subfamilies Matérn-1 and Matérn-2. Similarly,we use Matérn-0 and Matérn-∞ to refer to the casesνj = 0 + 12 and νj → ∞. They give the exponen-tial family [pj = 1 in (2.2)], with no derivatives, andGauss, which is infinitely differentiable. Matérn-0,1,2 are closely related to linear, quadratic, andcubic splines. We believe that little would be gainedby incorporating smoother kernels (but less smooththan the analytic Matérn-∞) in the study.Consideration of the entire Matérn class for νj > 0is computationally cumbersome for the large num-bers of experiments we will evaluate. Hence, whatwe call Matérn has νj optimized over the Matérn-0, Matérn-1, Matérn-2, and Matérn-∞ special cases,separately for each coordinate.• Matérn-2: Some authors (e.g., Picheny et al., 2013)fix νj in the Matérn correlation function to givesome differentiability. The Matérn-2 subfamily setsνj = 2 + 12 for all j , giving 2 derivatives.More recently, other types of covariance functionshave been recommended to cope with “apparently non-stationary” functions (e.g., Ba and Joseph, 2012). InSection 5.2 we will discuss the implications and char-acteristics of these options.Given a choice for Rj and hence R, we define then× n matrix R with element i, j given by R(x(i),x(j))and the n × 1 vector r(x) = (R(x,x(1)), . . . ,R(x,x(n)))T for any x where we want to predict.2.2 Choices for the Regression ComponentWe explore three main choices for μ:• Constant (abbreviated Const): μ(x) = β0.• Full linear (FL): μ(x) = β0 +β1x1 +· · ·+βdxd , thatis, a full linear model in all input variables.• Select linear (SL): μ(x) is linear in the xj like FLbut only includes selected terms.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 43The proposed algorithm for SL is as follows. For agiven correlation family construct a default predictorwith Const for μ. Decompose the predictive function(Schonlau and Welch, 2006) and identify all main ef-fects that contribute more than 100/d percent to thetotal variation. These become the selected coordinates.Typically, large main effects have clear linear compo-nents. If a large effect lacks a linear component, littleis lost by including a linear term. Inclusion of possiblenonlinear trends can be pursued at considerable com-putational cost; we do not routinely do so, but in Sec-tion 4.1 we do include a regression model with nonlin-ear terms in xj .All candidate regression models considered can bewritten in the formμ(x) = β0 + β1f1(x) + · · · + βkfk(x),where the functions fj (x) are known. The maximumlikelihood estimate (MLE) of β = (β0, . . . , βk)T is thegeneralized least-squares estimate βˆ = (FT R−1F)−1 ·FT R−1y, where the n × (k + 1) matrix F has (1,f1(x(i)), . . . , fk(x(i))) in row i. This is also the Bayesposterior mean with a diffuse prior for β .Early work (Sacks, Schiller and Welch, 1989) sug-gested that there is little to be gained (and maybe evensomething to lose) by using other than a constant termfor μ. In addition, Lim et al. (2002) showed that poly-nomials can be exactly predicted with a minimal num-ber of points using the Gauss correlation function, pro-vided one lets the θj → 0. These points underline thefact that a GP prior can capture the complexity in theoutput of the code, suggesting that deploying regres-sion terms is superfluous. The evidence we report latersupports this impression.2.3 PredictionPredictions are made as follows. For given data andvalues of the parameters in R, the mean of the posteriorpredictive distribution of y(x) isyˆ(x) = μˆ(x) + rT (x)R−1(y − Fβˆ),(2.4)where μˆ(x) = βˆ0 + βˆ1f1(x) + · · · + βˆkfk(x).In practice, the other parameters, σ 2 and those in thecorrelation function R of equations (2.2) or (2.3), mustbe estimated too. Empirical Bayes replaces all of β ,σ 2, and the correlation parameters in R by their MLEs(Welch et al., 1992). Various other Bayes-based proce-dures are available, including one fully Bayesian strat-egy described in Section 5.1. Our focus, however, isnot on the particular Bayes-based methods employedbut rather on assumptions about the form of the under-lying GaSP model.2.4 DesignFor fast codes we typically use as a base designD an approximate maximin Latin hypercube design(mLHD, Morris and Mitchell, 1995), with improvedlow-dimensional space-filling properties (Welch et al.,1996). A few other choices, such as orthogonal arrays,are also investigated in Section 3.5, with a more com-prehensive comparison of different classes of designthe subject of another ongoing study. In any event, weshow that even for a fixed class of design and fixedn there is substantial variation in prediction accuracyover equivalent designs. Conclusions based on a singledesign choice can be misleading.The effect of n on prediction accuracy was exploredby Sacks, Schiller and Welch (1989) and more recentlyby Loeppky, Sacks and Welch (2009); its role in thecomparison of competing alternatives for μ and R willalso be addressed in Section 3.5.2.5 Measures of Prediction ErrorIn order to compare various forms of the predictorin (2.4) built from the n code runs, y = (y(x(1)), . . . ,y(x(n)))T , we need to set some standards. The goldstandard is to assess the magnitude of prediction er-ror via holdout (test) data, that is, in predicting N fur-ther runs, y(x(1)ho ), . . . , y(x(N)ho ). The prediction errorsare yˆ(x(i)ho ) − y(x(i)ho ) for i = 1, . . . ,N .The performance measure we use is a normalizedRMSE of prediction over the holdout data, denotedby ermse,ho. The normalization is the RMSE using the(trivial) predictor y¯, the mean of the data from the runsin the experimental design, the “training” set. Thus,ermse,ho =√(1/N)∑Ni=1(yˆ(x(i)ho ) − y(x(i)ho ))2√(1/N)∑Ni=1(y¯ − y(x(i)ho ))2.(2.5)The normalization in the denominator puts ermse,horoughly on [0,1] whatever the scale of y, with 1 in-dicating no better performance than y¯. The criterion isrelated to R2 in regression, but ermse,ho measures per-formance for a new test set and smaller values are de-sirable.Similarly, worst-case performance can be defined asthe normalized maximum absolute error. Results forthis metric are reported in the supplementary material(Chen et al., 2016); the conclusions are the same. Otherdefinitions (such as median absolute error) and othernormalizations (such as those of Loeppky, Sacks andWelch, 2009) can be used, but without substantive ef-fect on comparisons.44 CHEN, LOEPPKY, SACKS AND WELCHFIG. 1. Equivalent designs for d = 2 and n = 11: (a) a base mLHD design; (b) the base design with labels permuted between columns; and(c) the base design with values in the x1 column reflected around x1 = 0.5.What are tolerable levels of error? Clearly, these areapplication-specific so that tighter thresholds wouldbe demanded, say, for optimization than for sensitiv-ity analysis. For general purposes we take the ruleof thumb that ermse,ho < 0.10 is useful. For normal-ized maximum error it is plausible that the thresholdcould be much larger, say 0.25 or 0.30. These specu-lations are consequences of the experiences we docu-ment later, and are surely not the last word. The valueof having thresholds is to provide benchmarks that en-able assessing when differences among different meth-ods or strategies are practically insignificant versus sta-tistically significant.3. FAST CODES3.1 Generating a Reference Set for ComparisonsFor fast codes under our control, large holdout setscan be obtained. Hence, in this section performanceis measured through the use of a holdout (test) set of10,000 points, selected as a random Latin hypercubedesign (LHD) on the input space.With a fast code many designs and hence trainingdata sets can easily be generated. We generate manyequivalent designs by exploiting symmetries. For asimple illustration, Figure 1(a) shows a base mLHDfor d = 2 and n = 11. Permuting the labels betweencolumns of the design, that is, interchanging the x1 andx2 values as in Figure 1(b), does not change the inter-point distances or properties based on them such as theminimum distance used to construct mLHD designs.Similarly, reflecting the values within, say, the x1 col-umn around x1 = 0.5 as in Figure 1(c), does not changethe properties. In this sense the designs are equivalent.In general, for any base design with good proper-ties, there are d!2d equivalent designs and hence equiv-alent sets of training data available from permutingall column labels and reflecting within columns for asubset of inputs. For the borehole code mentioned inSection 1 and investigated more fully in Section 3.2,we have found that permuting between columns givesmore variation in prediction accuracy than reflectingwithin columns. Thus, in this article for nearly all ex-amples we only permute between columns: for d = 4all 24 possible permutations, and for d ≥ 5 a randomselection of 25 different permutations. The example ofSection 5.2 with d = 2 is the one exception. Becausey(x1, x2) is symmetric in x1 and x2, permuting be-tween columns does not change the training data andwe reflect within x1 and/or x2 instead.The designs, generated data sets, and replicate analy-ses then serve as the reference set for a particular prob-lem and provide the grounds on which variability ofperformance can be assessed. Given the setup of Sec-tion 2, we want to assess the consequences of makinga choice from the menu of three regression models andfour correlation functions.3.2 Borehole CodeThe first setting we will look at is the borehole code(Morris, Mitchell and Ylvisaker, 1993) mentioned inSection 1 and described in the supplementary material(Chen et al., 2016). It has served as a test bed in manycontexts (e.g., Joseph, Hung and Sudjianto, 2008). Weconsider three different designs for the experiment: a27-run, 3-level orthogonal array (OA), the same designused by Joseph, Hung and Sudjianto (2008); a 27-runmLHD; and a 40-run mLHD.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 45FIG. 2. Borehole function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP with all combinations of three regression modelsand four correlation functions. There are three base designs: a 27-run OA (top row); a 27-run mLHD (middle row); and a 40-run mLHD(bottom row). For each base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.There are 12 possible modeling combinations fromthe four correlation functions and three regressionmodels outlined in Sections 2.1 and 2.2. The SL choicefor μ here is always the term x1. Its main effect ac-counts for approximately 80% of the variation in pre-dictions over the 8-dimensional input domain, and allanalyses with a Const regression model choose x1 andno other terms across all designs and all repeat experi-ments.The top row of Figure 2 shows results with the 27-run OA design. For a given modeling strategy, 25 ran-dom permutations between columns of the 27-run OAlead to 25 repeat experiments (Section 3.1) and hencea reference set of 25 values of ermse,ho shown as adot plot. The results are striking. Relative to a con-stant regression model, the FL regression model hasempirical distributions of ermse,ho which are uniformlyand substantially inferior, for all correlation functions.The SL regression also performs very poorly some-times, but not always. To investigate the SL regres-sion further, Figure 3 plots ermse,ho for individual repeatexperiments, comparing the GaSP(Const, Gauss) andGaSP(SL, Gauss) models. Consistent with the anec-dotal comparisons in Section 1, the plot shows thatFIG. 3. Borehole code: Normalized holdout RMSE of prediction,ermse,ho, for an SL regression model versus a constant regressionmodel. The 25 points are from repeat experiments generated by 25random permutations between columns of a 27-run OA.46 CHEN, LOEPPKY, SACKS AND WELCHthe SL regression model can give a smaller ermse,ho—this tends to happen when both methods perform fairlywell—but the SL regression sometimes has very pooraccuracy (almost 0.5 on the normalized RMSE scale).The top row of Figure 2 also shows that the choiceof correlation function is far less important than thechoice of regression model.The results for the 27-run mLHD in the middle rowof Figure 2 show that design can have a large effect onaccuracy: every analysis model performs better for the27-run mLHD than for the 27-run OA. (Note the ver-tical scale is different for each row of the figure.) TheSL regression now performs about the same as the con-stant regression instead of occasionally much worse.There is no substantial difference in accuracy betweenthe correlation functions. Indeed, the impact on accu-racy of using the space-filling mLHD design instead ofan OA is much more important than differences due tochoice of the correlation function. The scaling in themiddle row of plots somewhat mutes the considerablevariation in accuracy still present over the 25 equiva-lent mLHD designs.Increasing the number of runs to a 40-run mLHD(bottom row of Figure 2) makes a further substantialimprovement in prediction accuracy. All 12 modelingstrategies give ermse,ho values of about 0.02–0.06 overthe 25 repeat experiments. Although there is little sys-tematic difference among strategies, the variation overequivalent designs is still striking in a relative sense.The strikingly poor results from the SL regressionmodel (sometimes) and the FL model (all 25 repeats)in the top row of Figure 2 may be explained as fol-lows. The design is a 27-run OA with only 3 levels. Ina simpler context, Welch et al. (1996) illustrated non-identifiability of the important terms in a GaSP modelwhen the design is not space-filling. The SL regressionand, even more so, the FL regression complicate analready flexible GP model. The difficulty in identify-ing the important terms is underscored by the fact thatfor all 25 repeat experiments from the base 27-run OA,a least squares fit of a simple linear regression model inx1 (with no other terms) gives ermse,ho values close to0.45. In other words, performance of GaSP(SL, Gauss)is sometimes similar to fitting just the important x1linear trend. The performance of GaSP(FL, Gauss) ishighly variable and sometimes even worse than simplelinear regression.Welch et al. (1996) argued that model identifiabil-ity is, not surprisingly, connected with confounding inthe design. The confounding in the base 27-run OAis complex. While it is preserved in an overall senseby permutations between columns, how the confound-ing structure aligns with the important inputs amongx1, . . . , x8 will change across the 25 repeat experi-ments. Hence, the impact of confounding on noniden-tifiability will vary.In contrast, accuracy for the space-filling design inthe middle row of Figure 2 is much better, even withonly 27 runs. The SL regression model performs as ac-curately as the Const model (but no better); only theeven more complex FL regression runs into difficulties.Again, this parallels the simpler Welch et al. (1996) ex-ample, where model identification was less problem-atic with a space-filling design and largely eliminatedby increasing the sample size (the bottom row of Fig-ure 2).3.3 G-Protein CodeA second application, the G-protein code used byLoeppky, Sacks and Welch (2009) and described in thesupplementary material (Chen et al., 2016), consists ofa system of ODE’s with 4-dimensional input.Figure 4 shows ermse,ho for the three regression mod-els (here SL selects x2, x3 as inputs with large effects)and four correlation functions. The results in the toprow are for a 40-run mLHD. With d = 4, all 24 possi-ble permutations between columns of a single base de-sign lead to 24 data sets and hence 24 ermse,ho values.The dot plots in the top row have similar distributionsacross the 12 modeling strategies. As these empiricaldistributions have most ermse,ho values above 0.1, wetry increasing the sample size with an 80-run mLHD.This has a substantial effect on accuracy, with all mod-eling methods giving ermse,ho values of about 0.06 orless.Thus, for the G-protein application, none of the threechoices for μ or the four choices for R matter. Thevariation among equivalent designs is alarmingly largein a relative sense, dwarfing any impact of modelingstrategy.3.4 PTW CodeResults for a third fast-to-run code, PTW (Preston,Tonks and Wallace, 2003), are in the supplementarymaterial (Chen et al., 2016). It has 11 inputs. We tooka mLHD with n = 110 as the base design for the ref-erence set. Prior information from engineers suggestedincorporating linear x1 and x2 terms; SL also includedx3. No essential differences among μ or R emerged,but again there is a wide variation over equivalent de-signs.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 47FIG. 4. G-protein code: Normalized holdout RMSE of prediction, ermse,ho, for all combinations of three regression models and fourcorrelation functions. There are two base designs: a 40-run mLHD (top row); and an 80-run mLHD (bottom row). For each base design, all24 permutations between columns give the 24 values of ermse,ho in each dot plot.3.5 Effect of DesignThe results above document a significant, seldomrecognized role of design: different, even equivalent,designs can have a greater effect on performance thanthe choice of μ,R. Moreover, without prior informa-tion, there is no way to assure that the choice of de-sign will be one of the good ones in the equivalenceclass. Whether sequential experimentation, if feasible,can produce a more advantageous solution needs ex-ploring.The contrast between the results for borehole 27-runOA and the 27-run mLHD is a reminder of the impor-tance of using designs that are space-filling, a qualitywidely appreciated. It is no secret that the choice ofsample size, n, has a strong effect on performance asevidenced in the results for the 40-point mLHD con-trasted with those for the 27-point mLHD. A morepenetrating study of the effect of n was conducted byLoeppky, Sacks and Welch (2009). That FL does aswell as Const and SL for the Borehole 40-point mLHDbut performs badly for either of the two 27-point de-signs, and that none of the regression choices matterfor the G-protein 40-point design or for the PTW 110-point design, suggests that “everything” works if n islarge enough.In summary, the choice of n and the choice of Dgiven n can have huge effects. But have we enoughevidence that choice of μ matters only in limited con-texts (such as small n or poor design) and that choiceof R does not matter? So far we have dealt with onlya handful of simple, fast codes; it is time to considermore complex codes.4. SLOW CODESFor complex costly-to-run codes, generating sub-stantial holdout data or output from multiple designsis infeasible. Similarly, for codes where we only havereported data, new output data are unavailable. Forcedto depend on what data are at hand leads us to rely oncross-validation methods for generating multiple de-signs and holdout sets, through which we can assess theeffect of variability not solely in the designs but also,48 CHEN, LOEPPKY, SACKS AND WELCHand inseparably, in the holdout target data. We knowfrom Section 3 that variability due to designs is con-siderable, and it is no surprise that variability in hold-out sets would lead to variability in predictive perfor-mance. The utility then of the created multiple designsand holdout sets is to compare the behavior of differ-ent modeling choices under varying conditions ratherthan relying on a single quantity attached to the origi-nal, unique data set.Our approach is simply to delete a subset from thefull data set, use the remaining data to produce a pre-dictor, and calculate the (normalized) RMSE from pre-dicting the output in the deleted (holdout) subset. Re-peating this for a number (25 is what we use) of subsetsgives some measure of variability and accuracy. In ef-fect, we create 25 designs and corresponding holdoutsets from a single data set and compare consequencesarising from different choices for predictors.The details described in the applications below dif-fer somewhat depending on the particular application.In the example of Section 4.1—a reflectance modelfor a plant canopy—there are, in fact, limited holdoutdata but no data from multiple designs. In the volcano-eruption example of Section 4.2 and the sea-ice modelof Section 4.3 holdout data are unavailable.4.1 Nilson–Kuusk ModelAn ecological code modeling reflectance for a plantcanopy developed by Nilson and Kuusk (1989) wasused by Bastos and O’Hagan (2009) to illustrate di-agnostics for GaSP models. With 5-dimensional input,two computer experiments were performed: the firstusing a 150-run random LHD and the second with anindependently chosen LHD of 100 points.We carry out three studies based on the same data.The first treats the 100-point LHD as the experimentand the 150-point set as a holdout sample. The secondstudy reverses the roles of the two LHDs. A third study,extending one done by Bastos and O’Hagan (2009),takes the 150-run LHD, augments it with a randomsample of 50 points from the 100-point LHD, takes theresulting 200-point subset as the experimental designfor training the statistical model, and uses the remain-ing N = 50 points from the 100-run LHD to form theholdout set in the calculation of ermse,ho. By repeatingthe sampling of the 50 points 25 times, we get 25 repli-cate experiments, each with the same base 150 runsbut differing with respect to the additional 50 trainingpoints and the holdout set.In addition to the linear regression choices we havestudied so far, we also incorporate a regression modelTABLE 1Nilson–Kuusk model: Normalized holdout RMSE of prediction,ermse,ho, for four regression models and four correlationfunctions. The experimental data are from a 100-run LHD, and theholdout set is from a 150-run LHDermse,hoCorrelation functionRegression model Gauss PowerExp Matérn-2 MatérnConstant 0.116 0.099 0.106 0.102Select linear 0.115 0.099 0.106 0.105Full linear 0.110 0.099 0.104 0.104Quartic 0.118 0.103 0.107 0.106identified by Bastos and O’Hagan (2009): an intercept,linear terms in the inputs x1, . . . , x4, and a quartic poly-nomial in x5. We label this model “Quartic.” All anal-yses are carried out with the output y on a log scale,based on standard diagnostics for GaSP models (Jones,Schonlau and Welch, 1998).Table 1 summarizes the results of the study withthe 100-point LHD as training data and the 150-pointset as a holdout sample. It shows the choice for μis immaterial: the constant mean is as good as any.For the correlation function, Gauss is inferior to theother choices, there is some evidence that Matérn ispreferred to Matérn-2, and there is little differencebetween PowerExp and Matérn, the best performers.Similar results pertain when the 150-run LHD is usedfor training and the 100-run set for testing [Table 4 inthe supplementary material (Chen et al., 2016)].The dot plots in Figure 5 for the third study are evenmore striking in exhibiting the inferiority of R = Gaussand the lack of advantages for any of the nonconstantregression functions. The large variability in perfor-mance among designs and holdout sets is similar tothat seen for the fast-code replicate experiments of Sec-tion 3. The perturbations of the experiment, from ran-dom sampling here, appear to provide a useful refer-ence set for studying the behavior of model choices.The large differences in prediction accuracy amongthe correlation functions, not seen in Section 3, de-serve some attention. An overly smooth correlationfunction—the Gaussian—does not perform as well asthe Matérn and power-exponential functions here. Thelatter two have the flexibility to allow needed rougherrealizations. With the 150-run design and the constantregression model, for instance, the maximum of the loglikelihood increases by about 50 when the power expo-nential is used instead of the Gaussian, with four of thepj in (2.2) taking values less than 2.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 49FIG. 5. Nilson–Kuusk code: Normalized holdout RMSE of prediction, ermse,ho, for four regression models and four correlation functions.Twenty-five designs are created from a 150-run LHD base plus 50 random points from a 100-run LHD. The remaining 50 points in the100-run LHD form the holdout set for each repeat.The estimated main effect (Schonlau and Welch,2006) of x5 in Figure 6 from the GaSP(Const,PowerExp) model shows that x5 has a complex effect.It is also a strong effect, accounting for about 90% ofthe total variance of the predicted output over the 5-dimensional input space. Bastos and O’Hagan (2009)correctly diagnosed the complexity of this trend. Mod-eling it via a quartic polynomial in x5 has little impacton prediction accuracy, however. The correlation struc-ture of the GaSP is able to capture the trend implicitlyjust as well.4.2 Volcano ModelA computer model studied by Bayarri et al. (2009)models the process of pyroclastic flow (a fast-movingcurrent of hot gas and rock) from a volcanic eruption.FIG. 6. Nilson–Kuusk code: Estimated main effect of x5.The inputs varied are as follows: initial volume, x1,and direction, x2, of the eruption. The output, y, is themaximum (over time) height of the flow at a location.A 32-run data set provided by Elaine Spiller [differentfrom that reported by Bayarri et al. (2009) but a similarapplication] is available in the supplementary material(Chen et al., 2016). Plotting the data shows the outputhas a strong trend in x1, and putting a linear term in theGaSP surrogate, as modeled by Bayarri et al. (2009), isnatural. But is it necessary?The nature of the data suggests a transformation of ycould be useful. The one used by Bayarri et al. (2009)is log(y + 1). Diagnostic plots (Jones, Schonlau andWelch, 1998) from using μ = Const and R = Gaussshow that the log transform is reasonable, but a square-root transformation is better still. We report analysesfor both transformations.The regression functions considered are Const, SL(β0 + β1x1), full linear, and quadratic (β0 + β1x1 +β2x2 + β3x21 ), because the estimated effect of x1 has astrong trend growing faster than linearly when lookingat main effects from the surrogate obtained using √yand GaSP(Const, PowerExp).Analogous to the approach in Section 4.1, repeatexperiments are generated by random sampling of 25runs from the 32 available to comprise the design formodel fitting. The remaining 7 runs form the holdoutset. This is repeated 25 times, giving 25 ermse,ho val-ues in the dot plots of Figure 7. The conclusions aremuch like those in Section 4.1: there is usually no needto go beyond μ = Const and PowerExp is preferred50 CHEN, LOEPPKY, SACKS AND WELCHFIG. 7. Volcano model: Normalized holdout RMSE, ermse,ho, for four regression models and four correlation functions. The output variableis either √y or log(y + 1).to Gauss. The failure of Gauss in the two “slow” ex-amples considered thus far is surprising in light of thewidespread use of the Gauss correlation function.4.3 Sea-Ice ModelThe Arctic sea-ice model studied in Chapman et al.(1994) and in Loeppky, Sacks and Welch (2009) has13 inputs, 4 outputs, and 157 available runs. The pre-vious studies found modest prediction accuracy ofGaSP(Const, PowerExp) surrogates for two of the out-puts (ice mass and ice area) and poor accuracy for theother two (ice velocity and ice range). The questionarises whether use of linear regression terms can in-crease accuracy to acceptable levels. Using a samplingprocess like that in Section 4.2 leads to the results in thesupplementary material (Chen et al., 2016), where theanswer is no: there is no help from μ = SL or FL, norfrom changing R. Indeed, FL makes accuracy muchworse sometimes.5. OTHER MODELING STRATEGIESClearly, we have not studied all possible paths toGaSP modeling that one might take in a computer ex-periment. In this section we address several others,some in more detail, and point to issues that could beaddressed in the fashion described above.5.1 Full BayesA number of full Bayes approaches have been em-ployed in the literature. They go beyond the statis-tical formulation using a GP as a prior on the classof functions and assign prior distributions to all pa-rameters, particularly those of the correlation function.For illustration, we examine the GEM-SA implemen-tation of Kennedy (2004), which we call Bayes-GEM-SA. One key aspect is its reliance on R = Gauss. Italso uses the following independent prior distributions:β0 ∝ 1, σ 2 ∝ 1/σ 2, and θj exponential with rate 0.01(Kennedy, 2004). When comparing its predictive accu-racy with GaSP, μ = Const is used for all models.For the borehole application, 25 repeat experimentsare constructed for three designs, as in Section 3. Thedot plots of ermse,ho in Figure 8 compare Bayes-GEM-SA with the Gauss and PowerExp methods in Section 3based on MLEs of all parameters. (The method CGPANALYSIS METHODS FOR COMPUTER EXPERIMENTS 51FIG. 8. Borehole function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp),Bayes-GEM-SA, and CGP. There are three base designs: a 27-run OA (left), a 27-run mLHD (middle), and a 40-run mLHD (right). Foreach base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.and its dot plot are discussed in Section 5.2.) Bayes-GEM-SA is less accurate than either GaSP(Const,Gauss) or GaSP(Const, PowerExp).Figure 9 similarly depicts results for the G-proteincode. With the 40-run mLHD, the Bayesian and likeli-hood methods all perform about the same, giving onlyfair prediction accuracy. Increasing n to 80 improvesaccuracy considerably for all methods (the scales ofthe two plots are very different), far outweighing anysystematic differences between their accuracies.Bayes-GEM-SA performs as well as the GaSP meth-ods for G-protein, not so well for Borehole with n = 27but adequately for n = 40. Turning to the slow codesin Section 4, a different message emerges. Figure 10for the Nilson–Kuusk model is based on 25 repeat de-signs constructed as for Figure 5 with a base designof 150 runs plus 50 randomly chosen from 100. Thedistributions of ermse,ho for Bayes-GEM-SA and Gaussare similar, with PowerExp showing a clear advantage.Moreover, few of the Bayes ermse,ho values meet the0.10 threshold, while all the GaSP(Const, PowerExp)ermse,ho values do. Bayes-GEM-SA uses the Gaus-sian correlation function, which performed relativelyFIG. 9. G-protein: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp), Bayes-GEM-SA,and CGP. There are two base designs: a 40-run mLHD (left); and an 80-run mLHD (right). For each base design, all 24 permutationsbetween columns give the 24 values of ermse,ho in a dot plot.52 CHEN, LOEPPKY, SACKS AND WELCHFIG. 10. Nilson–Kuusk model: Normalized holdout RMSEof prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const,PowerExp), Bayes-GEM-SA, and CGP.poorly in Section 4; the disadvantage carries over tothe Bayesian method here.The results in Figure 11 for the volcano code are forthe 25 repeat experiments described in Section 4. Hereagain PowerExp dominates Bayes and for the samereasons as for the Nilson–Kuusk model. For the √ytransformation, all but a few GaSP(Const, PowerExp)ermse,ho values meet the 0.10 threshold, in contrast toBayes where all but a few do not.These results are striking and suggest that Bayesmethods relying on R = Gauss need extension. The“hybrid” Bayes-MLE approach employed by Bayarriet al. (2009) estimates the correlation parameters inPowerExp by their MLEs, fixes them, and takes ob-jective priors for μ and σ 2. The mean of the predictivedistribution for a holdout output value gives the sameprediction as GaSP(Const, PowerExp). Whether other“hybrid” forms can be brought to bear effectively needsexploration.5.2 NonstationarityThe use of stationary GPs as priors in the faceof “nonstationary appearing” functions has attracteda measure of concern despite the fact that all func-tions with L2-derivative can be approximated usingPowerExp with enough data. Of course, there never areenough data. A relevant question is whether other pri-ors, even stationary ones different from those in Sec-tion 2, are better reflective of conditions and lead tomore accurate predictors.West et al. (1995) employed a GP prior for y(x)with two additive components: a smooth one for globaltrend and a rough one to model more local behavior.Recently, a similar “composite” GP (CGP) approachwas advanced by Ba and Joseph (2012). These authorsused two GPs, both with Gauss correlation. The firsthas correlation parameters θj in (2.2) constrained to besmall for gradually varying longer-range trend, whilethe second has larger values of θj for shorter-range be-havior. The second, local GP also has a variance thatdepends on x, primarily as a way to cope with apparentnonstationary behavior. Does this composite approachoffer an effective improvement to the simpler choicesof Section 2?We can apply CGP via its R library to the exam-ples studied in Sections 3 and 4, much as was justdone for Bayes-GEM-SA. The comparisons in Figure 8for the borehole function show that GaSP and CGPhave similar accuracy for the two 27-run designs. GaSPhas smaller error than CGP for the 40-run mLHD,FIG. 11. Volcano model: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp),Bayes-GEM-SA, and CGP.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 53FIG. 12. 2-d function: Holdout predictions versus true values of y from fitting (a) GaSP(Const, PowerExp) and (b) CGP.though both methods achieve acceptable accuracy. Theresults in Figure 9 for G-protein show little practi-cal difference between any of the methods, includingCGP. For these two fast-code examples, there is negli-gible difference between CGP and the GaSP methods.For the models of Section 4, however, conclusions aresomewhat different. GaSP(Const, PowerExp) is clearlymuch more accurate than CGP for the Nilson–Kuuskmodel (Figure 10) and roughly equivalent for the vol-cano code (Figure 11).Ba and Joseph (2012) gave several examples assess-ing the performance of CGP. For reasons noted in Sec-tions 6.2 and 6.4 we only look at two.10-d example. The test function isy(x) = −10∑j=1sin(xj )(sin(jx2j /π))20(0 < xj < π).With n = 100, Ba and Joseph (2012) obtained unnor-malized RMSE values of about 0.72–0.74 for CGP andabout 0.72–0.88 for a GaSP(Const, Gauss) model over50 repeat experiments.This example demonstrates a virtue of using a nor-malized performance measure. To compute the nor-malizing factor for RMSE in (2.5), we followed theprocess of Ba and Joseph (2012). Training data froman LHD with n = 100 gives y¯, the trivial predictor.The normalization in the denominator of (2.5) is com-puted from N = 5000 random test points. Repeatingthis process 50 times gives normalization factors of0.71–0.74, about the same as the raw RMSE valuesfrom CGP. Thus, CGP’s RMSE prediction accuracyis no better than that of the trivial y¯ predictor, andthe default method is worse. Effective prediction hereis unattainable by CGP or GaSP and perhaps by noother approach with n = 100 because the function isso multi-modal; comparisons of CGP with other meth-ods are meaningless in this example.2-d example. For the functiony(x1, x2) = sin(1/(x1x2)) (0.3 ≤ xj ≤ 1),(5.1)Ba and Joseph (2012) used a single design with n = 24runs to compare CGP and GaSP(Const, Gauss). Theirresults suggest that accuracy is poor for both meth-ods, which we confirmed. For this example, follow-ing Ba and Joseph (2012), a holdout set of 5000 ran-dom points on [0.3,1]2 was used. For one mLHD withn = 24, we obtained ermse,ho values of 0.23 and 0.24 forCGP and GaSP(Const, PowerExp), respectively. More-over, the diagnostic plot in Figure 12 shows how badlyCGP (and GaSP) perform. Both methods grossly over-predict for some points in the holdout set, with GaSPworse in this respect. Both methods also have large er-rors from under-prediction, with CGP worse.Does this result generalize? With only two inputvariables and a function that is symmetric in x1 andx2, repeat experiments cannot be generated by permut-ing the column labels of the design. Reflecting withinthe x1 and x2 columns is considered below, but first wecreated multiple experiments by increasing n.We were also curious about how large n has tobe before acceptable accuracy is attained. Compar-isons between CGP and GaSP(Const, PowerExp) weremade for n = 24,25, . . . ,48; for each value of n anmLHD was generated. The ermse,ho results plotted inFigure 13(a) show that accuracy is not improved sub-stantially for either method as n increases. Indeed,54 CHEN, LOEPPKY, SACKS AND WELCHFIG. 13. 2-d function: Normalized holdout RMSE of prediction, ermse,ho, versus n for CGP (◦), GaSP(Const, PowerExp) (), andGaSP(Const, PowerExp) with nugget (+).GaSP(Const, PowerExp) gives variable accuracy, withlarger values of n sometimes leading to worse accuracythan for n = 24. (The results in Figure 13 for a modelwith a nugget term are described in Section 5.3.)To try to improve the accuracy, even larger sam-ple sizes were tried. Figure 13(b) shows ermse,ho forn = 50,60, . . . ,200. Both methods continue to givepoor accuracy until n reaches 70, after which there isa slow, unsteady improvement. Curiously, GaSP nowdominates.Permuting between columns of a design does notgenerate distinct repeat experiments here, but reflect-ing either or both coordinates about the centers of theirranges maintains the distance properties of the design,that is, x1 on [0.3,1] is replaced by x′1 = 1.3 − x1, andsimilarly x2. Results for the repeat experiments fromreflecting within x1, x2, or both x1 and x2 are availablein the supplementary material (Chen et al., 2016). Theyare similar to those in Figure 13.Thus, CGP dominates here for n ≤ 60: it is inaccu-rate but less inaccurate than GaSP. For larger n, how-ever, GaSP performs better, reaching the 0.10 thresh-old for ermse,ho before CGP does. This example demon-strates the potential pitfalls of comparing two methodswith a single experiment. A more comprehensive anal-ysis not only gives more confidence in the findings butmay also be essential to provide a balanced overviewof advantages and disadvantages.These last two toy functions together with the resultsin Figures 8–11 show no evidence for the effectivenessof a composite GaSP approach. These findings are inaccord with the earlier study by West et al. (1995).5.3 Adding a Nugget TermA nugget augments the GaSP model in (2.1) withan uncorrelated ε term, usually assumed to have a nor-mal distribution with mean zero and constant varianceσ 2ε , independent of the correlated process Z(x). Thischanges the computation of R and rT (x) in the condi-tional prediction (2.4), which no longer interpolates thetraining data. For data from physical experimentationor observation, augmenting a GaSP model in this wayis natural to reflect random errors (e.g., Gao, Sacks andWelch, 1996; McMillan et al., 1999; Styer et al., 1995).A nugget term has also been widely used for statisti-cal modeling of deterministic computer codes withoutrandom error. The reasons offered are that numericalstability is improved, so overcoming computational ob-stacles, and also that a nugget can produce better pre-dictive performance or better confidence or credibil-ity intervals. The evidence—in the literature and pre-sented here—suggests, however, that for deterministicfunctions the potential advantages of a nugget term aremodest. More systematic methods are available to dealwith numerical instability if it arises (Ranjan, Haynesand Karsten, 2011), adding a nugget does not convert apoor predictor into an acceptable one, and other factorsmay be more important for good statistical propertiesof intervals (Section 6.1). On the other hand, we also donot find that adding a nugget (and estimating it alongwith the other parameters) is harmful, though it mayproduce smoothers rather than interpolators. We nowelaborate on these points.A small nugget, that is, a small value of σ 2ε , is oftenincluded to improve the numerical properties of R. ForANALYSIS METHODS FOR COMPUTER EXPERIMENTS 55the space-filling initial designs in this article, however,Ranjan, Haynes and Karsten (2011) showed that ill-conditioning in a no-nugget GaSP model will only oc-cur for low-dimensional x, high correlation, and largen. These conditions are not commonly met in initial de-signs for applications. For instance, none of the com-putations for this article failed due to ill-conditioning,and those computations involved many repetitions ofexperiments for the various functions and GaSP mod-els. The worst conditioning occurred for the 2-d exam-ple in Section 5.2 with n = 200, but even here the con-dition numbers of about 106 did not preclude reliablecalculations. When a design is not space-filling, ma-trix ill-conditioning may indeed occur. For instance, asequential design for, say, optimization or contour esti-mation (Bingham, Ranjan and Welch, 2014) could leadto runs close together in the x space, causing numeri-cal problems. If ill-conditioning does occur, however,the mathematical solution proposed by Ranjan, Haynesand Karsten (2011) is an alternative to adding a nugget.A nugget term is also sometimes suggested to im-prove predictive performance. Andrianakis and Chal-lenor (2012) showed mathematically, however, thatwith a nugget the RMSE of prediction can be as largeas that of a least squares fit of just the regression com-ponent in (2.1). Our empirical findings, choosing thesize of σ 2ε via its MLE, are similarly unsupportive of anugget. For example, the 2-d function in (5.1) is hardto predict with a GaSP(Const, PowerExp) model (Fig-ure 13), but the results with a fitted nugget term shownby a “+” symbol in Figure 13 are no different in prac-tice from those of the no-nugget model.Similarly, repeating the calculations leading to Fig-ure 2 for the borehole function, but fitting a nugget termin all models, shows essentially no difference [the re-sults with a nugget are available in Figure 1 of the sup-plementary material (Chen et al., 2016)]. The MLE ofσ 2ε is either zero or very small relative to the variance ofthe correlated process: typically σˆ 2ε /σˆ 2 < 10−6. Thesefindings are consistent with those of Ranjan, Haynesand Karsten (2011), who found for the borehole func-tion and other applications that constraining the modelfit to have at least a modest value of σ 2ε deterioratedpredictive performance.Another example, the Friedman function,y(x) = 10 sin(πx1x2) + 20(x3 − 0.5)2(5.2)+ 10x4 + 5x5,with n = 25 runs, was used by Gramacy and Lee(2012) to illustrate potential advantages of includinga nugget term. Their context—performance criteria,analysis method, and design—differs in all respectsfrom ours. Our results in the top row of Figure 14show that the GaSP(Const, Gauss) and GaSP(Const,PowerExp) models with n = 25 have highly variableaccuracy, with ermse,ho values no better and often muchworse than 20%. The effect of the nugget is inconse-quential. Increasing the sample size to n = 50 makes adramatic improvement in prediction accuracy, but theeffect of a nugget remains negligible.The Gramacy and Lee (2012) results are not incon-sistent with ours in that they did not report predictionaccuracy for this example. Rather, their results relateto the role of the nugget in sometimes obtaining betteruncertainty measures when a poor choice of correlationfunction is inadvertently made, a topic we return to inSection 6.1.6. COMMENTS6.1 Uncertainty of PredictionAs noted in Section 1, our attention is directed at pre-diction accuracy, the most compelling characteristic inpractical settings. For example, where the objective iscalibration and validation, the details of uncertainty, asdistinct from accuracy, in the emulator of the computermodel are absorbed (and usually swamped) by modeluncertainties and measurement errors (Bayarri et al.,2007). But for specific predictions it is clearly impor-tant to have valid uncertainty statements.Currently, a full assessment of the validity of em-ulator uncertainty quantification is unavailable. It haslong been recognized that the standard error of predic-tion can be optimistic when MLEs of the parametersθj , pj , νj in the correlation functions of Section 2.1are “plugged-in” because the uncertainty in the param-eter values is not taken into account (Abt, 1999). Cor-rections proposed by Abt remain to be done for the set-tings in which they are applicable.Bayes credible intervals with full Bayes methodscarry explicit and valid uncertainty statements; hybridmethods using priors on some of the correlation pa-rameters (as distinct from MLEs) may also have reli-able credible intervals. But for properties such as actualcoverage probability (ACP), the proportion of points ina test set with true response values covered by intervalsof nominal (say) 95% confidence or credibility, the be-havior is far from clear. Chen (2013) compared severalBayes methods with respect to coverage. The resultsshowed variability with respect to equivalent designs56 CHEN, LOEPPKY, SACKS AND WELCHFIG. 14. Friedman function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss) and GaSP(Const, PowerExp)models with no nugget term versus the same models with a nugget. There are two base designs: a 25-run mLHD (top row); and a 50-runmLHD (bottom row). For each base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.like that found above for accuracy, a troubling charac-teristic pointing to considerable uncertainty about theuncertainty.In Figure 15 we see some of the issues. It gives ACPresults for the borehole and Nilson–Kuusk functions.The left-hand plot for borehole displays the anticipatedunder-coverage using plug-in estimates for the correla-tion parameters. (Confidence intervals here use n − 1rather than n in the estimate of σ in the standard er-ror and tn−1 instead of the standard normal.) PowerExpis slightly better than Gauss, and Bayes-GEM-SA hasACP values close to the nominal 95%. Surprisingly, theplot for the Nilson–Kuusk code on the right of Fig-ure 15 paints a different picture. Plug-in with Gaussand Bayes-GEM-SA both show under-coverage, whileplug-in PowerExp has near-ideal properties here. Wespeculate that the use of the Gauss correlation functionby Bayes-GEM-SA is again suboptimal for the Nilson–Kuusk application, just as it was for prediction accu-racy.The supplementary material (Chen et al., 2016) com-pares models with and without a nugget in terms ofcoverage properties for the Friedman function in (5.2).The results show that the problem of substantial under-coverage seen in many of the replicate experimentsis not solved by inclusion of a nugget term. A mod-est improvement in the distribution of ACP values isseen, particularly for n = 50, an improvement consis-tent with the advantage seen in Table 1 of Gramacy andLee (2012) from fitting a nugget term.A more complete study is surely needed to clarifyappropriate criteria for uncertainty assessment and howmodeling choices may affect matters.6.2 ExtrapolationGaSP based methods are interpolations so our find-ings are clearly limited to prediction in the space ofthe experiment. The design of the computer experimentshould cover the region of interest, rendering extrapo-lation meaningless. If a new region of interest is found,for example, during optimization, the initial computerruns can be augmented; extrapolation can be used todelimit regions that have to be explored further. Ofcourse, extrapolation is necessary in the situation of aANALYSIS METHODS FOR COMPUTER EXPERIMENTS 57FIG. 15. Borehole and Nilson–Kuusk functions: ACP of nominal 95% confidence or credibility intervals for GaSP(Const, Gauss),GaSP(Const, PowerExp), and Bayes-GEM-SA. For the borehole function, 25 random permutations between columns of a 40-run mLHDgive the 25 values of ACP in a dot plot. For the Nilson–Kuusk function, 25 designs are created from a 150-run LHD base plus 50 randompoints from a 100-run LHD. The remaining 50 points in the 100-run LHD form the holdout set for each repeat.new region and a code that can no longer be run. Butthen the question is how to extrapolate. Initial inclusionof linear or other regression terms may be more use-ful than just a constant, but it may also be useless, oreven dangerous, unless the “right” extrapolation termsare identified. We suspect it would be wiser to exam-ine main effects resulting from the application of GaSPand use them to guide extrapolation.6.3 Performance CriteriaWe have focused almost entirely on questions of pre-dictive accuracy and used RMSE as a measure. Thesupplementary material (Chen et al., 2016) defines andprovides results for a normalized version of maximumabsolute error, emax,ho. Other computations we havedone use the median of the absolute value of predictionerrors, with normalization relative to the trivial predic-tor from the median of the training output data. Theseresults are qualitatively the same as for ermse,ho: regres-sion terms do not matter, and PowerExp is a reliablechoice for R. For slow codes, analysis like in Section 4but using emax,ho has some limited value in identifyingregions where predictions are difficult, the limitationsstemming from a likely lack of coverage of subregions,especially at borders of the unit cube, where the outputfunction may behave badly.A common performance measure for slow codesuses leave-one-out cross-validation error to produceanalogues of ermse,ho and emax,ho, obviating the needfor a holdout set. For fast codes, repeat experimentsand the ready availability of a holdout set render cross-validation unnecessary, however. For slow codes withonly one set of data available, the single assessmentfrom leave-one-out cross-validation does not reflect thevariability caused, for example, by the design. In anycase, qualitatively similar conclusions pertain regard-ing regression terms and correlation functions.6.4 More ExamplesThe examples we selected are codes that have beenused in earlier studies. We have not incorporated1-d examples; while instructive for pedagogical rea-sons, they have little presence in practice. Other ap-plications we could have included (e.g., Gough andWelch, 1994) duplicate the specific conclusions wedraw below. There are also “fabricated” test functionsin the numerical integration and interpolation litera-ture (Barthelmann, Novak and Ritter, 2000) and somespecifically for computer experiments (Surjanovic andBingham, 2015). They exhibit characteristics some-times similar to those in Section 5—large variabilityin a corner of the space, a condition that inhibits andeven prevents construction of effective surrogates—and sometimes no different than the examples in Sec-tion 3. Codes that are deterministic but with numeri-cal errors could also be part of a diverse catalogue oftest problems. Ideally performance metrics from var-ious approaches would be provided to facilitate com-parisons; the suite of examples that we employed is astarting point.58 CHEN, LOEPPKY, SACKS AND WELCH6.5 DesignsThe variability in performance over equivalent de-signs is a striking phenomenon in the analyses of Sec-tion 3 and raises questions about how to cope with whatseems to be unavoidable bad luck. Are there sequentialstrategies that can reduce the variability? Are there ad-vantageous design types, more robust to arbitrary sym-metries. For example, does it matter whether a randomLHD, mLHD, or an orthogonal LHD is used? The lat-ter question is currently being explored by the authors.That design has a strong role is both unsurprising andsurprising. It is not surprising that care must be takenin planning an experiment; it is surprising and perplex-ing that equivalent designs can lead to such large dif-ferences in performance that are not mediated by goodanalytic procedures.6.6 Larger Sample SizesAs noted in Section 1, our attention is on experi-ments where n is small or modest at most. With ad-vances in computing power it becomes more feasibleto mount experiments with larger values of n while, atthe same time, more complex codes become feasiblebut only with limited n. Our focus continues to be onthe latter and the utility of GaSP models in that context.As n gets larger, Figure 2 illustrates that the differ-ences in accuracy among choices of R and μ beginto vanish. Indeed, it is not even clear that using GaSPmodels for large n is useful; standard function fittingmethods such as splines may well be competitive andeasier to compute. In addition, when n is large non-stationary behavior can become apparent and encour-ages variations in the GaSP methodology such as de-composing the input space (as in Gramacy and Lee,2008) or by using a complex μ together with a com-putationally more tractable R (as in Kaufman et al.,2011). Comparison of alternatives when n is large isyet to be considered.6.7 Are Regression Terms Ever Useful?Introducing regression terms is unnecessary in theexamples we have presented; a heuristic rationalewas given in Section 2.2. The supplementary material(Chen et al., 2016) reports a simulation study with real-ized functions generated as follows: (1) there are verylarge linear trends for all xj ; and (2) the superimposedsample path from a 0-mean GP is highly nonlinear, thatis, a GP with at least one θj 0 in (2.2). Even undersuch extreme conditions, the advantage of explicitlyfitting the regression terms is limited to a relative (ra-tio of ermse,ho) advantage, with only small differencesin ermse,ho; the presence of a large trend causes a largenormalizing factor. Moreover, such functions are notthe sort usually encountered in computer experiments.If they do show up, standard diagnostics will revealtheir presence and allow effective follow-up analysis(see Section 7.2).7. CONCLUSIONS AND RECOMMENDATIONSThis article addresses two types of questions. First,how should the analysis methodologies advanced in thestudy of computer experiments be assessed? Second,what recommendations for modeling strategies followfrom applying the assessment strategy to the particularcodes we have studied?7.1 Assessing MethodsWe have stressed the importance of going beyond“anecdotes” in making claims for proposed methods.While this point is neither novel nor startling, it is onethat is commonly ignored, often because the processof studying consequences under multiple conditions ismore laborious. The borehole example (Figure 2), forinstance, employs 75 experiments arising from 25 re-peats of each of 3 base experiments.When only one training set of data is available (ascan be the case with slow codes), the procedures inSection 4, admittedly ad hoc, nevertheless expand therange of conditions. This brings more generalizabilityto claims about the comparative performances of com-peting procedures. The same strategy of creating mul-tiple training/holdout sets is potentially useful in com-paring competing methods in physical experiments aswell.The studies in the previous sections lead to the fol-lowing conclusions:• There is no evidence that GaSP(Const, PowerExp)is ever dominated by use of regression terms, orother choices of R. Moreover, we have found thatthe inclusion of regression terms makes the likeli-hood surface multi-modal, necessitating an increasein computational effort for maximum likelihoodor Bayesian methods. This appears to be due toconfounding between regression terms and the GPpaths.• Choosing R = Gauss, though common, can be un-wise. The Matérn function optimized over a fewlevels of smoothness is a reasonable alternative toPowerExp.• Design matters but cannot be controlled completely.Variability of performance from equivalent designscan be uncomfortably large.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 59There is not enough evidence to settle the followingquestions:• Are full Bayes methods ever more accurate thanGaSP(Const, PowerExp)? Bayes methods relying onR = Gauss were seen to be sometimes inferior, andextensions to accommodate less smooth R such asPowerExp, perhaps via hybrid Bayes-MLE methods,are needed.• Are composite GaSP methods ever better thanGaSP(Const, PowerExp) in practical settings wherethe output exhibits nonstationary behavior?7.2 RecommendationsFaced with a particular code and a set of runs, whatshould a scientist do to produce a good predictor?Our recommendation is to make use of GaSP(Const,PowerExp), use the diagnostics of Jones, Schonlauand Welch (1998) or Bastos and O’Hagan (2009),and assess whether the GaSP predictor is adequate.If found inadequate, then the scientist should expectno help from introducing regression terms and, un-til further evidence is found, neither from Bayes norCGP approaches. Of course, trying such methods isnot prohibited, but we believe that inadequacy of theGaSP(Const, PowerExp) model is usually a sign thatmore substantial action must be taken.We conjecture that the best way to proceed in theface of inadequacy is to devise a second (or multiple)stage process, perhaps by added runs, or perhaps bycarving the space into more manageable subregionsas well as adding runs. How best to do this has beenpartially addressed, for example, by Gramacy and Lee(2008) and Loeppky, Moore and Williams (2010); ef-fective methods constrained by limited runs are not ap-parent and in need of study.ACKNOWLEDGMENTSWe thank the referees, Associate Editor, and Editorfor suggestions that clarified and broadened the scopeof the studies reported here.The research of Loeppky and Welch was supportedin part by grants from the Natural Sciences and Engi-neering Research Council, Canada.SUPPLEMENTARY MATERIALSupplement to “Analysis Methods for ComputerExperiments: How to Assess and What Counts?”(DOI: 10.1214/15-STS531SUPP; .zip). This report(whatcounts-supp.pdf) contains further description ofthe test functions and data from running them, furtherresults for root mean squared error, findings for max-imum absolute error, further results on uncertainty ofprediction, and details of the simulation investigatingregression terms. Inputs to the Arctic sea-ice code—ice-x.txt. Outputs from the code—ice-y.txt.REFERENCESABT, M. (1999). Estimating the prediction mean squared errorin Gaussian stochastic processes with exponential correlationstructure. Scand. J. Stat. 26 563–578. MR1734262ANDRIANAKIS, I. and CHALLENOR, P. G. (2012). The effect ofthe nugget on Gaussian process emulators of computer models.Comput. Statist. Data Anal. 56 4215–4228. MR2957866BA, S. and JOSEPH, V. R. (2012). Composite Gaussian processmodels for emulating expensive functions. Ann. Appl. Stat. 61838–1860. MR3058685BARTHELMANN, V., NOVAK, E. and RITTER, K. (2000). High di-mensional polynomial interpolation on sparse grids. Adv. Com-put. Math. 12 273–288. MR1768951BASTOS, L. S. and O’HAGAN, A. (2009). Diagnostics forGaussian process emulators. Technometrics 51 425–438.MR2756478BAYARRI, M. J., BERGER, J. O., PAULO, R., SACKS, J.,CAFEO, J. A., CAVENDISH, J., LIN, C.-H. and TU, J. (2007).A framework for validation of computer models. Technometrics49 138–154. MR2380530BAYARRI, M. J., BERGER, J. O., CALDER, E. S., DAL-BEY, K., LUNAGOMEZ, S., PATRA, A. K., PITMAN, E. B.,SPILLER, E. T. and WOLPERT, R. L. (2009). Using statisticaland computer models to quantify volcanic hazards. Technomet-rics 51 402–413. MR2756476BINGHAM, D., RANJAN, P. and WELCH, W. J. (2014). Designof computer experiments for optimization, estimation of func-tion contours, and related objectives. In Statistics in Action(J. F. Lawless, ed.) 109–124. CRC Press, Boca Raton, FL.MR3241971CHAPMAN, W. L., WELCH, W. J., BOWMAN, K. P., SACKS, J.and WALSH, J. E. (1994). Arctic sea ice variability: Model sen-sitivities and a multidecadal simulation. J. Geophys. Res. 99C919–935.CHEN, H. (2013). Bayesian prediction and inference in analy-sis of computer experiments. Master’s thesis, Univ. British,Columbia, Vancouver.CHEN, H., LOEPPKY, J. L., SACKS, J. and WELCH, W. J.(2016). Supplement to “Analysis Methods for Computer Exper-iments: How to Assess and What Counts?” DOI:10.1214/15-STS531SUPP.CURRIN, C., MITCHELL, T., MORRIS, M. and YLVISAKER, D.(1991). Bayesian prediction of deterministic functions, with ap-plications to the design and analysis of computer experiments.J. Amer. Statist. Assoc. 86 953–963. MR1146343DIXON, L. C. W. and SZEGÖ, G. P. (1978). The global optimisa-tion problem: An introduction. In Towards Global Optimisation(L. C. W. Dixon and G. P. Szegö, eds.) 1–15. North Holland,Amsterdam.60 CHEN, LOEPPKY, SACKS AND WELCHGAO, F., SACKS, J. and WELCH, W. J. (1996). Predicting urbanozone levels and trends with semiparametric modeling. J. Agric.Biol. Environ. Stat. 1 404–425. MR1807773GOUGH, W. A. and WELCH, W. J. (1994). Parameter space explo-ration of an ocean general circulation model using an isopycnalmixing parameterization. J. Mar. Res. 52 773–796.GRAMACY, R. B. and LEE, H. K. H. (2008). Bayesian treed Gaus-sian process models with an application to computer modeling.J. Amer. Statist. Assoc. 103 1119–1130. MR2528830GRAMACY, R. B. and LEE, H. K. H. (2012). Cases for the nuggetin modeling computer experiments. Stat. Comput. 22 713–722.MR2909617JONES, D. R., SCHONLAU, M. and WELCH, W. J. (1998). Ef-ficient global optimization of expensive black-box functions.J. Global Optim. 13 455–492. MR1673460JOSEPH, V. R., HUNG, Y. and SUDJIANTO, A. (2008). Blind krig-ing: A new method for developing metamodels. J. Mech. Des.130 031102–1–8.KAUFMAN, C. G., BINGHAM, D., HABIB, S., HEITMANN, K.and FRIEMAN, J. A. (2011). Efficient emulators of computerexperiments using compactly supported correlation functions,with an application to cosmology. Ann. Appl. Stat. 5 2470–2492.MR2907123KENNEDY, M. (2004). Description of the Gaussian process modelused in GEM-SA. Techical report, Univ. Sheffield. Available athttp://www.tonyohagan.co.uk/academic/GEM/.LIM, Y. B., SACKS, J., STUDDEN, W. J. and WELCH, W. J.(2002). Design and analysis of computer experiments whenthe output is highly correlated over the input space. Canad. J.Statist. 30 109–126. MR1907680LOEPPKY, J. L., MOORE, L. M. and WILLIAMS, B. J. (2010).Batch sequential designs for computer experiments. J. Statist.Plann. Inference 140 1452–1464. MR2592224LOEPPKY, J. L., SACKS, J. and WELCH, W. J. (2009). Choosingthe sample size of a computer experiment: A practical guide.Technometrics 51 366–376. MR2756473MCMILLAN, N. J., SACKS, J., WELCH, W. J. and GAO, F.(1999). Analysis of protein activity data by Gaussian stochas-tic process models. J. Biopharm. Statist. 9 145–160.MORRIS, M. D. and MITCHELL, T. J. (1995). Exploratory designsfor computational experiments. J. Statist. Plann. Inference 43381–402.MORRIS, M. D., MITCHELL, T. J. and YLVISAKER, D. (1993).Bayesian design and analysis of computer experiments: Use ofderivatives in surface prediction. Technometrics 35 243–255.MR1234641NILSON, T. and KUUSK, A. (1989). A reflectance model for thehomogeneous plant canopy and its inversion. Remote Sens. En-viron. 27 157–167.O’HAGAN, A. (1992). Some Bayesian numerical analysis. InBayesian Statistics, 4 (PeñíScola, 1991) (J. M. Bernardo, J. O.Berger, A. P. Dawid and A. F. M. Smith, eds.) 345–363. OxfordUniv. Press, New York. MR1380285PICHENY, V., GINSBOURGER, D., RICHET, Y. and CAPLIN, G.(2013). Quantile-based optimization of noisy computer ex-periments with tunable precision. Technometrics 55 2–13.MR3038476PRESTON, D. L., TONKS, D. L. and WALLACE, D. C. (2003).Model of plastic deformation for extreme loading conditions. J.Appl. Phys. 93 211–220.RANJAN, P., HAYNES, R. and KARSTEN, R. (2011). A compu-tationally stable approach to Gaussian process interpolation ofdeterministic computer simulation data. Technometrics 53 366–378. MR2850469SACKS, J., SCHILLER, S. B. and WELCH, W. J. (1989). De-signs for computer experiments. Technometrics 31 41–47.MR0997669SACKS, J., WELCH, W. J., MITCHELL, T. J. and WYNN, H. P.(1989). Design and analysis of computer experiments (with dis-cussion). Statist. Sci. 4 409–435. MR1041765SCHONLAU, M. and WELCH, W. J. (2006). Screening the inputvariables to a computer model via analysis of variance and vi-sualization. In Screening: Methods for Experimentation in In-dustry, Drug Discovery, and Genetics (A. Dean and S. Lewis,eds.) 308–327. Springer, New York.STEIN, M. L. (1999). Interpolation of Spatial Data: Some Theoryfor Kriging. Springer, New York. MR1697409STYER, P., MCMILLAN, N., GAO, F., DAVIS, J. and SACKS, J.(1995). Effect of outdoor airborne particulate matter on dailydeath counts. Environ. Health Perspect. 103 490–497.SURJANOVIC, S. and BINGHAM, D. (2015). Virtual library of sim-ulation experiments: Test functions and datasets. Available athttp://www.sfu.ca/~ssurjano.WELCH, W. J., BUCK, R. J., SACKS, J., WYNN, H. P.,MITCHELL, T. J. and MORRIS, M. D. (1992). Screening, pre-dicting, and computer experiments. Technometrics 34 15–25.WELCH, W. J., BUCK, R. J., SACKS, J., WYNN, H. P., MOR-RIS, M. D. and SCHONLAU, M. (1996). Response to James M.Lucas. Technometrics 38 199–203.WEST, O. R., SIEGRIST, R. L., MITCHELL, T. J. and JENK-INS, R. A. (1995). Measurement error and spatial variabilityeffects on characterization of volatile organics in the subsurface.Environ. Sci. Technol. 29 647–656. Submitted to the Statistical ScienceSsnnlckclr rm \?lalwsisKcrfmds dmp AmknsrcpCxncpikclrs: Fmu rm ?sscssald Ufar Amslrs?"Fam Afcl, Hasml J, Jmcnniw, Hcpmkc Saais ald Uilliak H, UclafUnivCrsity of British ColSmbia anB National InstitStC of Statistical SciCncCs1, RCSR DSLARGMLS (D?SR AMBCS)1,1 Bmpcfmlc dslarimlihz output y is gznzrvtzy wyy RGTu=Hu −Hl)log=r=rw)(F + 2LTulgg(rLrw)r2wKw+ Tu=Tl) ;fihzrz thz M input vvrivwlzs vny thzir rzspzxtivz rvngzs of intzrzst vrz vs in ivwlz FCVuriuvly Xyswrip“ion (uni“s= fungyrw ruxius of voryholy (m= oD:DI; D:EIqr ruxius of inuynwy (m= oEDD; IDDDqTu “runsmissivi“fl of uppyr uquifyr (m2Cflr= oJGDKD; EEIJDDq.u po“yn“iomy“riw hyux of uppyr uquifyr (m= oMMD; EEEDqTl “runsmissivi“fl of lowyr uquifyr (m2 C flr= oJG:E; EEJq.l po“yn“iomy“riw hyux of lowyr uquifyr (m= oKDD; LFDqL lyng“h of voryholy (m= oEEFD; EJLDqKw hflxruuliw wonxuw“ivi“fl of voryholy (mCflr= oMLII; EFDHIqTable 1Voryholy zunwtion input vuriuvlys, units, unx rungysB All rungys ury wonvyrtyx to oD; Eq zorstutistiwul moxylingBDehajleefl gf Klalaklack, Ufanejkaly gf :jalakh ;gdmeZaa, Nafcgmnej, :;N.T )R,, ;afada (e-eaad2 hag.chef@klal.mZc.ca; oadd@klal.mZc.ca). Klalaklack,Ufanejkaly gf :jalakh ;gdmeZaa, Cedgofa, :; N)N )N7, ;afada (e-eaad2bakgf.dgehhcy@mZc.ca). Nalagfad Afklalmle gf Klalaklacad Kcaefcek, RekeajchTjaafgde Pajc, N; 277(1, UK9 (e-eaad2 kacck@fakk.gjg).∗fysyurwh suppor“yx vfl “hy bu“urul gwiynwys unx Ynginyyring fysyurwh Wounwil, WunuxuBky “hunk “hy ryfyryys, Ussowiu“y Yxi“or unx Yxi“or for suggys“ions “hu“ wluriyx unx vrouxynyx“hy swopy of “hy s“uxiys rypor“yx hyryB1iasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20152 HB WHYb Yh ULBVuriuvly Xyswrip“ion fungyu1 ru“y wons“un“ oF ED6; F ED7qu2 ru“y wons“un“ I ED−3 (fiyx=u3 ru“y wons“un“ E ED−3 (fiyx=u4 ru“y wons“un“ F ED−3 (fiyx=u5 ru“y wons“un“ L (fiyx=u6 ru“y wons“un“ oG ED−5; G ED−4qu7 ru“y wons“un“ oD:G; Gqu8 ru“y wons“un“ E (fiyx=x ini“iul wonwyn“ru“ion oE:D ED−9; E:D ED−6qTable 2GAprotyin woxy input vuriuvlys unx rungysB All vuriuvlys ury trunszormyx to log swulys on oD; Eqzor stutistiwul moxylingB1,2 E-npmrcil amdcihz yizrzntivl zquvtions gznzrvting thz output y of thz GBprotzin systzmyynvmixs vrzt) R −u))x+ u22 − u3) + u5;t2 R u))x− u22 − u42;t3 R −u623 + u8=Glgl − 3 − 4)=Glgl − 3);t4 R u623 − u74;y R =Glgl − 3)=Glgl;fihzrz ); : : : ; 4 vrz xonxzntrvtions of I xhzmixvl spzxizsA t) ≡ @1@t A ztxCA vnyGlgl R FEEEE is thz =flzy) totvl xonxzntrvtion of GBprotzin xomplzfl vftzr HEszxonysCihz input vvrivwlzs in this systzm vrz yzsxriwzy in ivwlz GC dnly d R I inputsvrz vvrizyO fiz moyzl y vs v funxtion of log=x)A log=u))A log=u6)A log=u7)C1,1 NRU amdcihz erzstonBionksBlvllvxz =eil) moyzl yzsxriwzs thz plvstix yzformvtionof vvrious mztvls ovzr v rvngz of strvin rvtzs vny tzmpzrvturzs =erzstonA ionksAvny lvllvxzA GEEH)C [or our purposzs thz moyzl xontvins d R FF input vvrivwlzs=pvrvmztzrs)A fihzrz thrzz of thzsz vrz physixvlly wvszy =tzmpzrvturzA strvin rvtzAstrvin)A vny thz rzmvining M inputs xvn wz uszy to tunz thz moyzl to mvtxhyvtv from physixvl zflpzrimzntsC Vyyitionvl informvtion on thz moyzl vny thzxvliwrvtion prowlzm xvn wz founy in [ugvtz zt vlC =GEE5)C ihz inputs vrz vllsxvlzy to thz unit intzrvvl pE; FrC2, B?R?2,1 Tmlaalm amdcihz yvtv for thz HG runs of thz volxvno xoyz vrz in ivwlz HC2,2 ?paria sca iac amdcihz yvtv vrz vvvilvwlz in thz supplzmzntvry lzs ice-x.txt vny ice-y.txtC1, DSPRFCP PCSSJRS DMP PKSC]zrz fiz givz furthzr rzsults for normvlizzy holyout gbhE of przyixtionA erese;hgCiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 3fun Volumy VusulUngly Hyigh“ logHyigh“ sqr“Hyigh“E MBJH EJBHL HGBKF EBJI JBJEF EDBFD EIBDL EIEBEJ FBEL EFBFMG MBLG EGBMI JLBEF EBLH LBFIH EDBDF EKBJE EDIBJD FBDG EDBFLI EDBIL EJBFD GHEBGM FBIG ELBHLJ EDBKK EIBGJ IHDBFD FBKG FGBFHK EDBGM EHBFG EMIBHI FBFM EGBMLL EDBMI EKBGG JKIBGH FBLG FIBMMM MBII EFBLG GLBIL EBJD JBFEED EDBEE ELBKG EFGBJJ FBED EEBEFEE MBKG EMBLJ ILBEK EBKK KBJGEF MBMF EEBKD LEBMF EBMF MBDIEG EDBHL EGBEE FKLBDM FBHI EJBJLEH EDBJK ELBHI HKJBMM FBJL FEBLHEI EDBGD EMBIL EJLBEL FBFG EFBMKEJ EDBLJ EEBML JDJBFD FBKL FHBJFEK MBGJ EHBIF FEBMG EBGJ HBJLEL LBLD EIBMF LBJH DBML FBMHEM MBEK EKBDI EKBJG EBFK HBFDFD LBML EGBGM EGBMG EBEK GBKGFE LBHF EHBLD DBHI DBEJ DBJKFF LBFG EIBJH DBDD DBDD DBDDFG LBJE EJBKK FBGM DBIG EBIIFH LBDI EGBJK DBDJ DBDF DBFHFI MBHI ELBEK FKBFJ EBHI IBFFFJ LBLM EFBFK EHBJL EBFD GBLGFK MBFK EEBEH GDBHG EBID IBIFFL MBDL EMBGD EFBMI EBEH GBJDFM LBIF EKBLM DBDL DBDG DBFLGD LBGG EFBII EBLG DBHI EBGIGE LBKD EEBHF MBLG EBDG GBEHGF LBEH EMBDF DBDD DBDD DBDDTable 3Volwuno woxyN inputs (Volumy unx VusulAngly) unx output (Hyight, logHyight, or sqrtHyight)Biasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20154 HB WHYb Yh ULBNormalized RMSE0.000.020.040.060.080.10Gauss PowerExp Matern Matern2Constant40−run mLHD0.000.020.040.060.080.10Gauss PowerExp Matern Matern2Select linear40−run mLHD0.000.020.040.060.080.10Gauss PowerExp Matern Matern2Full linear40−run mLHD0.00.10.20.30.40.5Constant27−run mLHD0.00.10.20.30.40.5Select linear27−run mLHD0.00.10.20.30.40.5Full linear27−run mLHD0.00.20.40.60.8Constant27−run OA0.00.20.40.60.8Select linear27−run OA0.00.20.40.60.8Full linear27−run OA6ig 1B Voryholy zunwtionN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ullwomvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrfl moxyl hus u ttyxnuggyt tyrmB hhyry ury thryy vusy xysignsN u FKArun cA (top row)O u FKArun mLHD (mixxlyrow)O unx u HDArun mLHD (vottom row)B For yuwh vusy xysign, F5 runxom pyrmututions vytwyynwolumns givy thy F5 vuluys oz Kamse;ho in u xot plotB1,1 Bmpcfmlc dslariml[igurz F shofis normvlizzy gbhE rzsults for moyzls fiith v ttzy nuggztCCompvrison fiith [igurz G in thz mvin vrtixlz for noBnuggzt moyzls shofis littlzyizrznxzA zflxzpt thvt tting v nuggzt tzrm givzs v smvll inxrzvsz in thz frzquznxyof rzlvtivzly poor rzsults for moyzls fiith v fullBlinzvr rzgrzssionC1,2 NRU amdc[igurz G givzs rzsults for erese;hg for thrzz rzgrzssion moyzls vny four xorrzlvBtion funxtionsC It is proyuxzy using thz mzthoys in thz mvin pvpzr's hzxtion HC[igurz H givzs erese;hg rzsults for thz WvyzsBGEbBhV vny CGe mzthoys in thzmvin pvpzr's hzxtions 5CF vny 5CGC czithzr of thzsz gurzs shofi prvxtixvl yizrBznxzs from thz vvrious moyzling strvtzgizsC1,1 Lilsml-Isssi Kmdclivwlz I givzs rzsults for F5E trvining runs vny FEE runs for thz holyBout tzst sztCihz rolzs of thz tfio yvtv szts vrz sfiitxhzy rzlvtivz to thz mvin pvpzr's ivwlz FCVgvinA Gvuss is infzriorA vny eofizrEflp vny bvt(zrn vrz thz wzst pzrformzrsC1,2 ?paria sca iacihzrz vrz FH input vvrivwlzs vny F5L runs of thz xoyz vvvilvwlzC gzpzvt zflBpzrimznts fizrz gznzrvtzy wy svmpling n R FEd R FHE runsA lzvving GL holyoutowszrvvtionsC[igurz I givzs rzsults for erese;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsC ihz four xorrzlvtion funxtions givz similvr rzsultsA wut thziasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 5Normalized RMSE0.050.100.150.200.25Gauss PowerExp Matern Matern−2ConstantGauss PowerExp Matern Matern−2Select linearGauss PowerExp Matern Matern−2Full linear6ig 2B dhk woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ull womviAnutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy vusy xysign is u EEDArunmLHDO F5 runxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kamse;ho in u xot plotBNormalized RMSE0.000.050.100.150.200.250.30Gauss PowerExp Bayes CGP6ig 3B dhk woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd(Wonst, Guuss),Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr two mythoxs ulso huvy wonAstunt rygryssion moxylsB hhy vusy xysign is u EEDArun mLHDO F5 runxom pyrmututions vytwyynwolumns givy thy F5 vuluys oz Kamse;ho in u xot plotBKamse;hofygryssion Worrylu“ion funw“ionmoxyl [uuss dowyrYfip au“yrnAF au“yrnWons“un“ DBEEE DBDLD DBDLK DBDKLgylyw“ linyur DBEEG DBDLD DBDME DBDKMFull linyur DBEDM DBDKM DBDMD DBDKJeuur“iw DBEDH DBDKM DBDLL DBDKLTable 4bilsonAKuusk moxylN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor zour rygryssionmoxyls unx zour worrylution zunwtionsB hhy yfipyrimyntul xutu ury zrom u E5DArun LHD, unxthy holxAout syt is zrom u EDDArun LHDBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20156 HB WHYb Yh ULBNormalized RMSE0.00.51.01.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.5ConstantIceVelocity0.00.51.01.5Select linearIceVelocity0.00.51.01.5Full linearIceVelocity0.00.20.40.60.81.01.2ConstantIceArea0.00.20.40.60.81.01.2Select linearIceArea0.00.20.40.60.81.01.2Full linearIceArea0.00.20.40.60.81.01.2ConstantIceMass0.00.20.40.60.81.01.2Select linearIceMass0.00.20.40.60.81.01.2Full linearIceMass6ig 4B Arwtiw syu iwy woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ullwomvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy E5K runs uvuiluvlyury runxomlfl split F5 timys into EGD runs zor tting unx FK holxAout runs to givy thy F5 vuluys ozKamse;ho in u xot plotB fysults ury givyn zor zour output vuriuvlysN iwy muss, iwy uryu, iwy vylowitfl,unx rungy oz uryuBfullBlinzvr rzgrzssion moyzl is infzrior for vll four output vvrivwlzsC[igurz 5 xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstA eofizrEflp)A WvyzsBGEbBhVA vny CGeC ihz four strvtzgizs givz similvr rzsultsC[igurz 6 givzs erese;hg rzsults for moyzls fiith zstimvtzy nuggzt tzrmsC Compvrisonfiith thz rzsults for thz svmz moyzls fiithout v ttzy nuggzt in [igurz I shofisno prvxtixvl yizrznxzsC1,3 2-d cxaknlc[igurz L givzs rzsults for thz funxtion in =5CF) from thrzz szts of rzpzvt zflpzrBimzntsC ihz yzsigns lzvying to [igurz FH in thz mvin pvpzr hvvz x)A x2A or wothx) vny x2 rzzxtzy fiithin xolumns vwout thz xzntzrs of thzir rvngzs to gznzrvtzzquivvlznt yzsigns vny nzfi trvining yvtvC ihz rzsulting erese;hg vvluzs vrz plottzyin [igurz LC2, PCSSJRS DMP K?VGKSK ?BSMJSRC CPPMP]zrz fiz givz rzsults for normvlizzy holyout mvflimum vwsolutz zrror of przByixtionA eeax;hgA yznzy vs=ICF) eeax;hg Rmvfli5);:::;Nsy=x(i)hg)− y=x(i)hg)mvfli5);:::;N+y − y=x(i)hg) :VgvinA thz mzvsurz of zrror is rzlvtivz to thz pzrformvnxz of thz trivivl przyixtorA+yCiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 7Normalized RMSE0.00.20.40.60.81.0Gauss PowerExp Bayes CGPIceVelocity0.00.20.40.60.81.0Gauss PowerExp Bayes CGPRangeOfArea0.00.10.20.30.40.50.6IceMass0.00.10.20.30.40.50.6IceArea6ig 5B Arwtiw syu iwy woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd(Wonst,Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr two mythoxs ulso huvywonstunt rygryssion moxylsB hhy E5K runs uvuiluvly ury runxomlfl split F5 timys into EGD runszor tting unx FK holxAout runs to givy thy F5 vuluys oz Kamse;ho in u xot plotBNormalized RMSE0.00.51.01.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.5ConstantIceVelocity0.00.51.01.5Select linearIceVelocity0.00.51.01.5Full linearIceVelocity0.00.20.40.60.81.01.2ConstantIceArea0.00.20.40.60.81.01.2Select linearIceArea0.00.20.40.60.81.01.2Full linearIceArea0.00.20.40.60.81.01.2ConstantIceMass0.00.20.40.60.81.01.2Select linearIceMass0.00.20.40.60.81.01.2Full linearIceMass6ig 6B Arwtiw syu iwy woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ullwomvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrfl moxyl hus u ttyxnuggyt tyrmB hhy E5K runs uvuiluvly ury runxomlfl split F5 timys into EGD runs zor tting unxFK holxAout runs to givy thy F5 vuluys oz Kamse;ho in u xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20158 HB WHYb Yh ULB25 30 35 40 450.00.10.20.30.40.5nNormalized RMSEReflect x1, n = 24 to 4850 100 150 2000.00.10.20.30.40.5nNormalized RMSEReflect x1, n = 50 to 20025 30 35 40 450.00.10.20.30.40.5nNormalized RMSEReflect x2, n = 24 to 4850 100 150 2000.00.10.20.30.40.5nNormalized RMSEReflect x2, n = 50 to 20025 30 35 40 450.00.10.20.30.40.5nNormalized RMSEReflect x1 and x2, n = 24 to 4850 100 150 2000.00.10.20.30.40.5nNormalized RMSEReflect x1 and x2, n = 50 to 2006ig 7B FAx zunwtionN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, vyrsus T zor WGd (◦),Gugd(Wonst, dowyrEfip) (△) unx Gugd(Wonst, dowyrEfip) with nuggyt (+)B For yuwh vuluy ozT thy vusy mLHD hus x1 ryywtyx urounx thy wyntyr oz its rungy (top row), x2 ryywtyx (mixxlyrow), or voth x1 unx x2 ryywtyx (vottom row)Biasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 9Normalized Max Error0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Constant40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Select linear40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Full linear40−run mLHD0.00.20.40.60.8Constant27−run mLHD0.00.20.40.60.8Select linear27−run mLHD0.00.20.40.60.8Full linear27−run mLHD0.00.20.40.60.81.0Constant27−run OA0.00.20.40.60.81.0Select linear27−run OA0.00.20.40.60.81.0Full linear27−run OA6ig 8B Voryholy zunwtionN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd with ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhyryury thryy vusy xysignsN u FKArun cA (top row)O u FKArun mLHD (mixxly row)O unx u HDArunmLHD (vottom row)B For yuwh vusy xysign, F5 runxom pyrmututions vytwyyn wolumns givy thyF5 vuluys oz Kmax;ho in u xot plotB2,1 Bmpcfmlc dslariml[igurz M givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorrzBlvtion funxtionsC [igurz N xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstA eofizrBEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzsults vrzvnvlogous to thz mvin pvpzr's [igurzs G vny M for erese;hgP thz svmz pvttzrnszmzrgzC[igurz FE givzs eeax;hg rzsults for moyzls fiith zstimvtzy nuggzt tzrmsC Compvrzyfiith [igurz MA inxlusion of v nuggzt mvkzs littlz yizrznxz zflxzpt for v smvll inBxrzvsz in frzquznxy of poor rzsults fiith szlzxtBlinzvr vny fullBlinzvr rzgrzssionmoyzlsC2,2 E-npmrcil[igurz FF givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorrzBlvtion funxtionsC [igurz FG xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstA eofizrBEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzsults vrzvnvlogous to thz mvin pvpzr's [igurzs I vny N for erese;hgC ihz svmz xonxlusionszmzrgzO thzrz is littlz prvxtixvl yizrznxz wztfizzn thz vvrious moyzling strvtzgizsC2,1 NRU amdc[igurz FH givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz FI xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBsults vrz vnvlogous to [igurzs G vny H in hzxtion HCG of this supplzmznt rzlvting toiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201510 HB WHYb Yh ULBNormalized Max Error0.00.10.20.30.40.5Gauss PowerExp Bayes CGP27−run OA0.00.10.20.30.40.5Gauss PowerExp Bayes CGP27−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Bayes CGP40−run mLHD6ig 9B Voryholy zunwtionN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsBNormalized Max Error0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Constant40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Select linear40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Full linear40−run mLHD0.00.20.40.60.8Constant27−run mLHD0.00.20.40.60.8Select linear27−run mLHD0.00.20.40.60.8Full linear27−run mLHD0.00.20.40.60.81.0Constant27−run OA0.00.20.40.60.81.0Select linear27−run OA0.00.20.40.60.81.0Full linear27−run OA6ig 10B Voryholy zunwtionN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd with ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrflmoxyl hus u ttyx nuggyt tyrmB hhyry ury thryy vusy xysignsN u FKArun cA (top row)O u FKArun mLHD (mixxly row)O unx u HDArun mLHD (vottom row)B For yuwh vusy xysign, F5 runxompyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kmax;ho in u xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 11Normalized Max Error0.00.10.20.30.4Gauss PowerExp Matern Matern2Constant80−run mLHD0.00.10.20.30.4Gauss PowerExp Matern Matern2Select linear80−run mLHD0.00.10.20.30.4Gauss PowerExp Matern Matern2Full linear80−run mLHD0.00.20.40.60.8Constant40−run mLHD0.00.20.40.60.8Select linear40−run mLHD0.00.20.40.60.8Full linear40−run mLHD6ig 11B GAprotyin woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhyry ury two vusyxysignsN u HDArun mLHD (top row)O unx un LDArun mLHD (vottom row)B For yuwh vusy xysign,ull FH pyrmututions vytwyyn wolumns givy thy FH vuluys oz Kmax;ho in yuwh xot plotBNormalized Max Error0.00.20.40.60.8Gauss PowerExp Bayes CGP40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Bayes CGP80−run mLHD6ig 12B GAprotyin woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsBNormalized Max error0.30.40.50.60.7Gauss PowerExp Matern Matern−2ConstantGauss PowerExp Matern Matern−2Select linearGauss PowerExp Matern Matern−2Full linear6ig 13B dhk woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zor Gugdwith ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy vusy xysignis u EEDArun mLHDO F5 runxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kmax;ho inu xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201512 HB WHYb Yh ULBNormalized Max Error0.00.20.40.60.8Gauss PowerExp Bayes CGP6ig 14B dhk woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsB hhy vusy xysign is u EEDArun mLHDO F5 runAxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kmax;ho in u xot plotBKmax;hofygryssion Worrylu“ion funw“ionmoxyl [uuss dowyrYfip au“yrnAF au“yrnWons“un“ DBGI DBFK DBFL DBFLgylyw“ linyur DBGI DBFJ DBFK DBFHFull linyur DBGD DBFK DBFL DBFLeuur“iw DBFM DBFL DBFM DBGETable 5bilsonAKuusk moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorzour rygryssion moxyls unx zour worrylution zunwtionsB hhy yfipyrimyntul xutu ury zrom uEDDArun LHD, unx thy holxAout syt is zrom u E5DArun LHDBerese;hgC ihz eeax;hg xritzrion lzvys to similvr xonxlusionsO thzrz vrz no prvxtixvlyizrznxzs in pzrformvnxzC2,2 Lilsml-Isssi Kmdclihz eeax;hg rzsults in ivwlzs 5 vny 6 vrz for FEE trvining runs vny F5E trviningrunsA rzspzxtivzlyC ihzy vrz vnvlogous to ivwlz F of thz mvin pvpzr vny supplzBmzntvry ivwlz IA fihixh rzlvtz to erese;hgC VgvinA Gvuss is infzriorC ihz yizrznxzsin pzrformvnxz wztfizzn thz othzr xorrzlvtion funxtions vrz smvllA wut eofizrEflpis thz wzst pzrformzr L out of M timzsC[igurz F5 givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz F6 xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBKmax;hofygryssion Worrylu“ion funw“ionmoxyl [uuss dowyrYfip au“yrnAF au“yrnWons“un“ DBFE DBEJ DBEL DBEKgylyw“ linyur DBFI DBEJ DBEM DBELFull linyur DBFG DBEJ DBFE DBEKeuur“iw DBFG DBEJ DBFE DBEKTable 6bilsonAKuusk moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorzour rygryssion moxyls unx zour worrylution zunwtionsB hhy yfipyrimyntul xutu ury zrom uE5DArun LHD, unx thy holxAout syt is zrom u EDDArun LHDBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 13Normalized Max Error0.100.150.200.250.30Gauss PowerExp Matern Matern−2ConstantGauss PowerExp Matern Matern−2Select LinearGauss PowerExp Matern Matern−2Full LinearGauss PowerExp Matern Matern−2Quartic6ig 15B bilsonAKuusk woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor zour rygryssion moxyls unx zour worrylution zunwtionsB hwyntflAvy xysigns ury wryutyx zromu E5DArun LHD vusy plus 5D runxom points zrom u EDDArun LHDB hhy rymuining 5D points inthy EDDArun LHD zorm thy holxout syt zor yuwh rypyutBNormalized Max Error0.00.10.20.30.40.5Gauss PowerExp Bayes CGP6ig 16B bilsonAKuusk moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201514 HB WHYb Yh ULBNormalized Max Error0.00.20.40.60.8Gauss PowerExp Matern Matern−2Constantlog(y+1)0.00.20.40.60.8Gauss PowerExp Matern Matern−2Select linearlog(y+1)0.00.20.40.60.8Gauss PowerExp Matern Matern−2Full linearlog(y+1)0.00.20.40.60.8Gauss PowerExp Matern Matern−2Quadraticlog(y+1)0.00.10.20.30.4Constanty0.00.10.20.30.4Select lineary0.00.10.20.30.4Full lineary0.00.10.20.30.4Quadraticy6ig 17B Volwuno moxylN bormulizyx holxout mufiimum uvsoluty yrror, Kmax;ho, zor thryy rygrysAsion moxyls unx two worrylution zunwtionsBsults vrz vnvlogous to thz mvin pvpzr's [igurzs 5 vny FEA fihixh rzlvtz to erese;hgC[igurzs F5 vny F6 lzvy to thz svmz xonxlusionsO RReofizrEflp pzrforms wzstARRGvuss is infzriorA thzrz is no vyvvntvgz from vny of thz nonBxonstvnt rzgrzsBsion funxtionsA vny nzithzr WvyzsBGEbBhV nor CGe pzrform fizll hzrzC2,3 Tmlaalm amdc[igurz FL givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz FM xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBsults vrz vnvlogous to thz mvin pvpzr's [igurzs L vny FFA fihixh rzlvtz to erese;hgC[igurz FL shofis thvt thz rzgrzssion moyzl mvkzs littlz yizrznxz if xomwinzy fiitheofizrEflpA fihixh vlfivys pzrforms fizll hzrzC [or thz√y rzsponszA thz quvyrvtixmoyzl mvkzs thz pzrformvnxzs of thz othzr xorrzlvtion funxtions similvr to thvtNormalized Max Error0.00.10.20.30.4Gauss PowerExp Bayes CGPy0.00.20.40.60.8Gauss PowerExp Bayes CGPlog(y+1)6ig 18B Volwuno moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 15Normalized Max Error0.00.51.01.52.02.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.52.02.5ConstantIceVelocity0.00.51.01.52.02.5Select linearIceVelocity0.00.51.01.52.02.5Full linearIceVelocity0.00.51.01.52.0ConstantIceArea0.00.51.01.52.0Select linearIceArea0.00.51.01.52.0Full linearIceArea0.00.51.01.52.0ConstantIceMass0.00.51.01.52.0Select linearIceMass0.00.51.01.52.0Full linearIceMass6ig 19B Arwtiw syu iwy woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zor Gugdwith ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy E5K runsuvuiluvly ury runxomlfl split F5 timys into EGD runs zor tting unx FK holxAout runs to givy thyF5 vuluys oz Kmax;ho in u xot plotBof eofizrEflpA wut thzrz is no vyvvntvgz ovzr Gvhe=ConstA eofizrEflp)C [igurz FMvgvin shofis for this vpplixvtion thvt RRGvuss vny WvyzsBGEbBhV vrz infzriorAfihilz eofizrEflp vny CGe pzrform vwout thz svmzC2,4 ?paria sca iac[igurz FN givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz GE xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBsults vrz vnvlogous to supplzmzntvry [igurzs I vny 5 in hzxtion HCIA fihixh rzlvtzto erese;hgC [igurz FN shofis thzrz is littlz prvxtixvl vyvvntvgz to wz gvinzy fromv nonBxonstvnt rzgrzssion moyzlC ihz smvll improvzmznt szzn for thz rzsponszIxzkzloxity is in thz xontzflt of eeax;hg oftzn grzvtzr thvn F for vll moyzlsA iCzCAno przyixtivz vwilityC [igurz GE shofis thvt Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGe givz similvr rzsultsC[igurz GF givzs eeax;hg rzsults for moyzls fiith zstimvtzy nuggzt tzrmsC Compvrisonfiith thz rzsults for thz svmz moyzls fiithout nuggzt tzrms in [igurz FN shofis noprvxtixvl improvzmzntC2,5 Dpicdkal dslariml[igurz GG xompvrzs przyixtion vxxurvxy of moyzls fiith vny fiithout v nuggztviv thz eeax;hg xritzrionC Vs in [igurz FI in thz mvin vrtixlzA no vyvvntvgz fromtting v nuggzt is vppvrzntCiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201516 HB WHYb Yh ULBNormalized Max Error0.00.51.01.52.02.5Gauss PowerExp Bayes CGPIceVelocity0.00.51.01.52.02.5Gauss PowerExp Bayes CGPRangeOfArea0.00.20.40.60.81.0IceMass0.00.20.40.60.81.0IceArea6ig 20B Arwtiw syu iwy woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsB hhy E5K runs uvuiluvly ury runxomlfl split F5timys into EGD runs zor tting unx FK holxAout runs to givy thy F5 vuluys oz Kmax;ho in u xotplotBNormalized Max Error0.00.51.01.52.02.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.52.02.5ConstantIceVelocity0.00.51.01.52.02.5Select linearIceVelocity0.00.51.01.52.02.5Full linearIceVelocity0.00.51.01.52.0ConstantIceArea0.00.51.01.52.0Select linearIceArea0.00.51.01.52.0Full linearIceArea0.00.51.01.52.0ConstantIceMass0.00.51.01.52.0Select linearIceMass0.00.51.01.52.0Full linearIceMass6ig 21B Arwtiw syu iwy woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zor Gugdwith ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrfl moxyl husu ttyx nuggyt tyrmB hhy E5K runs uvuiluvly ury runxomlfl split F5 timys into EGD runs zor ttingunx FK holxAout runs to givy thy F5 vuluys oz Kmax;ho in u xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 17Normalized Max Error0.000.050.100.150.20No nugget NuggetGaussn=500.000.050.100.150.20No nugget NuggetPowerExpn=500.00.20.40.60.81.0Gaussn=250.00.20.40.60.81.0PowerExpn=256ig 22B Friyxmun zunwtionN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss) unx Gugd(Wonst, dowyrEfip) moxyls with no nuggyt tyrm vyrsus thy sumymoxyls with u nuggytB hhyry ury two vusy xysignsN u F5Arun mLHD (top row)O unx u 5DArunmLHD (vottom row)B For yuwh vusy xysign, F5 runxom pyrmututions vytwyyn wolumns givy thyF5 vuluys oz Kmax;ho in u xot plotB3, SLACPR?GLRW MD NPCBGARGMLihz rzsults in [igurz GH xompvrz thz yistriwution of VCe vvluzs fiith vny fiithBout v nuggzt tzrm for thz [rizymvn funxtion in zquvtion =5CG) of thz mvin vrtixlzClith n R G5A suwstvntivl unyzrBxovzrvgz oxxurs frzquzntlyP tting v nuggzt tzrmmvkzs littlz yizrznxzC [or n R 5E thzrz is v moyzst improvzmznt in thz yistriBwution tofivrys thz nominvl N5: xovzrvgz fihzn v nuggzt is ttzyC huwstvntivlunyzrBxovzrvgz is still frzquzntA hofizvzrC4, PCEPCSSGML RCPKS]ofi thz inxlusion of rzgrzssion tzrms pzrforms in zfltrzmz szttings xvn wz szznin thz follofiing simulvtionsC [unxtions fizrz gznzrvtzy fiith 5Byimznsionvl inputx from v Gvhe fiith rzgrzssion xomponznt R FE=x) + x2 + x3 + x4 + x5)Avvrivnxz FA vny onz of four Ges fiith Gvussivn xorrzlvtion funxtions yizringin thzir truz pvrvmztzr vvluzsC ihz vzxtor of xorrzlvtion pvrvmztzr vvluzs fivszithzrO =F) A R =6:LGHG; G:INNG; E:6L5G; E:ENNG; E:EEHG)A fihixh sum to FEP =G)B R =H:H6F6; F:GIN6; E:HHL6; E:EIN6; E:EEF6)A hvlf thz vvluzs of AP =H) hvsxonstvnt zlzmznts fiith vvluz GP vny =I) hvs xonstvnt zlzmznts fiith vvluz FClith svmplz sizz n R 5E from v ma]D vny 5EE rvnyom tzst pointsA G5 svmplzfunxtions from thz I Ges fizrz tvkzn vny normvlizzy gbhEs xvlxulvtzyC ihzvvzrvgz erese;hg for thz four simulvtion sxznvrios for Const vzrsus [a ts vrz inivwlz LClhvt xvn wz yrvfin from thzsz numwzrs is thvt thz strong trzny xomposzyfiith lzss xorrzlvtzy =roughzr) funxtions givzs tting thz linzvr rzgrzssion littlzprvxtixvl vyvvntvgzC ihzrz xvn wz v jedalane vyvvntvgz for [a fihzn erese;hg isvzry smvll vnyfivy for woth mzthoysCihz trzny hzrz is vzry zfltrzmzO v totvl trzny of mvgnituyz 5E vxross thz inputiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201518 HB WHYb Yh ULBActual Coverage Probability0.00.20.40.60.81.0No nugget NuggetGaussn=500.00.20.40.60.81.0No nugget NuggetPowerExpn=500.00.20.40.60.81.0Gaussn=250.00.20.40.60.81.0PowerExpn=256ig 23B Friyxmun zunwtionN AWd oz nominul M59 wonxynwy or wryxivilitfl intyrvuls zorGugd(Wonst, Guuss) unx Gugd(Wonst, dowyrEfip) moxyls, yuwh without or with u nuggyt tyrmBhhyry ury two vusy xysignsN u F5Arun mLHD (top row)O unx u 5DArun mLHD (vottom row)B Foryuwh vusy xysign, F5 runxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz AWd in u xotplotBWorrylu“ion Uvyrugy Kamse;ho whyn ““ingpurumy“yrs [ugd(Wons“, dowyrYfip= [ugd(FL, dowyrYfip=A DBDH DBDFB DBDF DBDE Q 2 DBDL DBDJ Q 1 DBDH DBDFTable 7bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zrom tting wonstunt vyrsus zullAlinyurrygryssion moxylsB hhy xutu ury gynyrutyx using ony oz zour vywtors oz truy vuluys oz thyworrylution purumytyrsBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 19spvxz of vll 5 vvrivwlzsA rzlvtivz to v Ge fiith vvrivnxz FC dthzr simulvtions fiithlzss zfltrzmz trzny shofi zvzn lzss yizrznxz wztfizzn thz rzsults for thz Constvny [a moyzlsCPCDCPCLACSFugu“y, aB, killiums, VB, Higxon, XB, Hunson, KB aB, [u““ikyr, JB, Whyn, gBAfB, unx inul, WB(FDDI=, pHiyrurwhiwul Vuflysiun unulflsis unx “hy drys“onAhonksAkulluwy moxyl,6 hywhB fypBLUAifADIAGMGI, Los Ulumos bu“ionul Luvoru“orfl, Los Ulumos, baBdrys“on, XB LB, honks, XB LB, unx kulluwy, XB WB (FDDG=, paoxyl of dlus“iw Xyformu“ion forYfi“rymy Louxing Wonxi“ions,6 Journul oz Appliyx dhflsiws, MG, FEE{FFDBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015 Arctic sea ice - 13 input variables ---------------------------------------------------------------------------- Run SnowAlbedo Shortwave MinLead OceanicHeat Snowfall Longwave AtmosDrag ---------------------------------------------------------------------------- 1 0.5375000 1.4125 2.9250e-02 -2.687500 3.000000 0.27000 1.2538e-03 2 0.9187501 3.7750 1.9500e-02 -4.375000 2.312500 0.32500 7.2250e-04 3 0.8812500 2.5750 2.8500e-02 -4.250000 0.375000 0.48000 1.3813e-03 4 0.5687500 3.8500 1.2750e-02 -1.625000 3.500000 0.13000 1.0200e-03 5 0.8250001 1.4875 4.6500e-02 -3.812500 4.750000 0.25000 4.8880e-04 6 0.6125000 1.7500 1.6500e-02 -2.187500 2.437500 0.43000 1.4238e-03 7 0.9750000 1.3375 5.6250e-02 -1.937500 5.000000 0.50000 1.4025e-03 8 0.9562500 2.6500 7.5000e-04 -1.875000 2.125000 0.41500 1.7000e-04 9 0.9687500 3.6250 2.2500e-02 -2.750000 4.000000 0.39500 3.8250e-04 10 0.8437500 1.6375 1.2000e-02 -3.875000 1.000000 0.28500 2.9750e-04 11 0.9437500 2.0125 3.9000e-02 -0.562500 3.562500 0.11500 1.6150e-03 12 0.6875000 2.6125 2.4000e-02 -4.875000 1.625000 0.18000 5.9500e-04 13 0.5500000 2.3875 1.7250e-02 -4.500000 1.812500 0.25500 7.6500e-04 14 0.8125000 1.0375 1.4250e-02 -0.250000 0.062500 0.30500 1.5513e-03 15 0.9062500 1.9375 8.2500e-03 -4.562500 2.812500 0.46500 9.3500e-04 16 0.7375000 1.2250 6.0000e-03 -0.750000 4.562500 0.38000 1.9130e-04 17 0.6812500 3.5500 2.2500e-03 -1.187500 3.250000 0.24000 6.3800e-05 18 0.7875000 1.0750 2.4750e-02 -3.687500 2.062500 0.18500 1.5938e-03 19 0.8625000 3.1000 2.1750e-02 -3.125000 3.750000 0.35000 1.0413e-03 20 0.9375000 1.5250 2.0250e-02 0.000000 4.375000 0.40500 8.9250e-04 21 0.6625000 2.2375 5.2500e-03 -3.750000 2.937500 0.32000 2.3380e-04 22 0.5937500 1.3750 3.0750e-02 -2.125000 4.312500 0.34000 9.9880e-04 23 0.7625001 2.8375 3.3750e-02 -1.125000 2.000000 0.13500 1.1688e-03 24 0.6000000 2.4250 1.8000e-02 -2.562500 4.812500 0.45000 1.0838e-03 25 0.6750000 1.5625 2.7750e-02 -1.562500 2.687500 0.10500 1.6363e-03 26 0.5250000 2.0500 9.7500e-03 -0.312500 0.687500 0.42500 4.0380e-04 27 0.6937500 3.4750 2.3250e-02 -3.062500 0.250000 0.22000 5.7380e-04 28 0.5125000 3.8125 3.7500e-02 -0.625000 3.187500 0.30000 1.4880e-04 29 0.7937501 2.9500 4.3500e-02 -2.437500 3.312500 0.12500 1.1475e-03 30 0.7187500 2.1625 4.9500e-02 -2.625000 2.250000 0.44000 9.1380e-04 31 0.5562500 1.0000 4.1250e-02 -3.937500 2.375000 0.49000 5.5250e-04 32 0.5000000 3.9250 3.4500e-02 -0.187500 0.625000 0.48500 1.5088e-03 33 0.6187500 2.2750 5.9250e-02 -1.375000 2.750000 0.16000 9.5630e-04 34 0.8187500 1.8625 7.5000e-03 -2.312500 3.687500 0.47000 1.1900e-03 35 0.9812501 1.7125 4.8000e-02 -4.312500 1.687500 0.14000 1.3175e-03 36 0.5187500 1.8250 4.0500e-02 -0.437500 3.125000 0.34500 2.5500e-04 37 0.6500000 3.9625 0.0000e+00 -0.125000 3.625000 0.31000 8.0750e-04 38 0.9250000 1.2625 4.5000e-03 -2.062500 4.875000 0.19500 5.1000e-04 39 0.8500000 3.2125 1.5000e-02 -4.062500 2.562500 0.26000 1.7000e-03 40 0.7500000 2.8750 4.2000e-02 -3.000000 4.687500 0.31500 4.6750e-04 41 0.8937500 3.4375 3.0000e-03 -3.375000 1.437500 0.45500 6.1630e-04 42 0.7250000 2.8000 4.7250e-02 -1.437500 3.875000 0.21000 1.4450e-03 43 0.5812500 1.1500 2.6250e-02 -4.625000 4.250000 0.21500 7.4380e-04 44 0.9500001 2.7625 6.7500e-03 -4.437500 3.375000 0.33000 6.8000e-04 45 0.9625000 3.2500 1.8750e-02 -0.687500 0.312500 0.36000 2.1300e-05 46 0.8000000 1.3000 2.1000e-02 -0.062500 0.187500 0.20500 1.2325e-03 47 0.5437500 2.2000 5.7000e-02 -4.750000 0.000000 0.27500 2.7630e-04 48 0.7312500 3.3250 5.8500e-02 -3.187500 1.250000 0.29500 1.0625e-03 49 0.5875000 4.0000 6.0000e-02 -4.812500 4.937500 0.38500 3.1880e-04 50 0.9312500 3.4000 1.5000e-03 -0.375000 1.062500 0.33500 1.4875e-03 51 0.9125000 3.5125 2.5500e-02 -3.625000 0.562500 0.41000 1.1050e-03 52 0.7000000 2.7250 3.1500e-02 -0.812500 1.187500 0.17500 8.5000e-05 53 0.7125000 1.9750 5.7750e-02 -1.062500 2.625000 0.37500 1.5725e-03 54 0.6375000 1.1875 3.0000e-02 -0.500000 2.187500 0.28000 1.2750e-03 55 0.7687500 3.2875 4.5750e-02 -2.812500 1.937500 0.36500 4.2500e-05 56 0.5625000 2.3125 5.1000e-02 -3.437500 1.750000 0.15000 1.3600e-03 57 0.8875001 1.9000 1.3500e-02 -5.000000 1.562500 0.26500 3.6130e-04 58 0.7562500 3.6625 5.3250e-02 -1.750000 1.125000 0.42000 8.2880e-04 59 0.9875000 3.0625 1.0500e-02 -2.375000 0.750000 0.11000 4.2500e-04 60 0.6062500 1.1125 5.0250e-02 -1.250000 0.125000 0.23000 1.2113e-03 61 0.7062500 2.3500 3.3000e-02 -4.125000 3.812500 0.15500 9.7750e-04 62 0.8375000 2.6875 3.2250e-02 -3.312500 2.875000 0.16500 1.1263e-03 63 0.7812500 3.1375 1.1250e-02 -2.250000 3.937500 0.47500 3.4000e-04 64 0.6562500 1.7875 3.6750e-02 -3.562500 3.437500 0.43500 1.0630e-04 65 1.0000000 3.7375 5.1750e-02 -4.937500 1.312500 0.17000 1.6788e-03 66 0.8562501 2.5375 2.7000e-02 -0.937500 1.875000 0.23500 1.2750e-04 67 0.8062500 2.0875 4.8750e-02 -4.687500 0.875000 0.22500 8.7130e-04 68 0.5062500 2.5000 5.5500e-02 -1.812500 0.937500 0.19000 6.5880e-04 69 0.6437500 3.5875 5.4750e-02 -4.187500 4.125000 0.39000 1.2963e-03 70 0.6250000 2.4625 5.4000e-02 -1.500000 1.500000 0.35500 1.4663e-03 71 0.7437500 3.8875 3.7500e-03 -3.250000 1.375000 0.40000 4.4630e-04 72 0.5312500 1.4500 3.5250e-02 -3.500000 4.187500 0.49500 7.8620e-04 73 0.5750000 1.6000 3.8250e-02 -4.000000 4.437500 0.20000 7.0130e-04 74 0.8312500 3.3625 1.5750e-02 -2.875000 4.500000 0.24500 1.5300e-03 75 0.9000000 2.1250 3.9750e-02 -0.875000 0.812500 0.37000 0.0000e+00 76 0.8750000 3.1750 3.6000e-02 -2.000000 0.500000 0.14500 8.5000e-04 77 0.6312500 1.6750 4.4250e-02 -2.500000 4.062500 0.12000 1.3388e-03 78 0.7750000 3.7000 9.0000e-03 -1.687500 4.625000 0.46000 2.1250e-04 79 0.8687500 3.0250 4.5000e-02 -1.312500 2.500000 0.10000 5.3130e-04 80 0.6687500 2.9125 5.2500e-02 -2.937500 3.062500 0.29000 6.3750e-04 81 0.9937500 2.9875 4.2750e-02 -1.000000 0.437500 0.44500 1.6575e-03 1 0.9991900 3.9870 5.5124e-02 -0.003655 4.733500 0.11675 4.4798e-05 2 0.9137500 1.4141 2.3801e-02 -1.134500 0.000165 0.42088 1.6997e-03 3 0.6640300 1.6224 5.9901e-02 -2.695600 0.446290 0.21206 1.9932e-07 4 0.5002000 3.5541 6.0000e-02 -0.041606 2.177700 0.48623 1.6968e-03 5 0.5820400 2.4135 7.6685e-03 -0.186300 0.373040 0.10426 1.3004e-03 6 0.9437900 3.7724 4.3544e-03 -1.366200 0.001933 0.48634 1.1332e-03 7 0.9964700 2.4422 1.6938e-03 -4.996300 4.109500 0.49977 8.7192e-04 8 0.5048400 2.8929 4.5013e-02 -4.292900 4.872700 0.35383 3.8969e-04 9 0.8992900 1.0019 4.9885e-02 -2.084700 4.814600 0.10000 1.7000e-03 10 0.9263000 3.8506 1.3151e-05 -4.171600 2.728000 0.48467 8.1255e-05 11 0.5459600 3.4211 2.7146e-02 -1.485900 1.225100 0.13342 1.5818e-03 12 0.8024200 3.9657 1.9113e-02 -3.767900 0.260440 0.49277 1.2801e-03 13 0.9857600 2.7828 3.6267e-02 -0.756700 0.314730 0.44982 9.6849e-04 14 0.8233100 3.8589 3.3620e-02 -4.323900 1.087400 0.10136 2.3049e-04 15 0.9310200 2.9240 2.5238e-02 -3.318000 0.570270 0.10193 1.3439e-05 16 0.9986800 1.0089 3.9249e-02 -0.379740 0.567030 0.28480 1.0033e-03 17 0.7779300 1.4269 4.2311e-02 -0.324530 2.789300 0.39036 8.0921e-04 18 0.8660800 3.8006 5.2622e-02 -4.692500 4.037000 0.36843 1.5447e-03 19 0.5102700 3.2743 9.9253e-03 -4.680000 0.034325 0.40482 1.5879e-03 20 0.6070300 3.5920 3.3641e-02 -4.944800 0.730160 0.47566 1.0331e-03 21 0.7090600 1.1936 3.7699e-02 -3.029900 1.591300 0.49650 1.2205e-03 22 0.7863700 2.1362 2.1789e-03 -4.574600 4.981400 0.34442 1.1849e-03 23 0.8848300 1.5329 1.7370e-02 -0.189730 2.248300 0.18032 4.4391e-04 24 0.5007000 1.8788 5.6843e-02 -2.502400 0.351810 0.32744 5.3733e-04 25 0.8413200 1.1843 4.6963e-02 -0.051837 3.320600 0.19726 9.7577e-04 26 0.7376800 1.7172 1.8621e-02 -3.569800 1.550100 0.10151 1.6551e-03 27 0.7616500 1.3332 6.3316e-03 -3.489300 4.885700 0.40881 9.8459e-06 28 0.7960400 3.0398 5.8490e-02 -1.532200 0.567920 0.41917 6.4355e-04 29 0.6966200 1.1003 1.1793e-02 -4.963700 4.532900 0.32134 3.8363e-06 30 0.9893800 1.3506 1.8324e-02 -0.584130 4.615500 0.15448 1.4794e-03 31 0.6335100 1.7956 7.7952e-05 -2.075000 4.495400 0.22400 5.1666e-04 32 0.9691300 2.5669 5.9539e-02 -4.997200 2.992600 0.46340 1.6846e-04 33 0.5962800 3.3083 1.3846e-03 -0.988880 1.151200 0.18487 1.6965e-03 34 0.5201800 3.7137 5.9985e-02 -3.953900 0.867360 0.12697 1.2211e-03 35 0.9595000 3.4952 2.5877e-02 -0.249320 3.137400 0.20007 1.6621e-03 36 0.9085400 2.9695 4.4311e-03 -0.387590 2.069300 0.19999 1.3846e-03 37 0.9325500 1.3707 3.8164e-02 -3.655100 4.075200 0.14425 6.3643e-04 38 0.5787800 3.7284 5.1964e-02 -0.608960 2.428000 0.16864 2.8120e-04 39 0.6507100 1.6706 3.4095e-02 -4.560500 4.439400 0.29230 1.3790e-03 40 0.8890400 2.4600 2.9733e-02 -4.442600 0.121770 0.16841 1.3667e-04 41 0.7230000 3.9887 1.1652e-02 -4.345600 0.521430 0.25889 1.5484e-03 42 0.7504000 1.5944 2.8494e-02 -0.008915 3.453200 0.10464 3.6870e-04 43 0.5451300 4.0000 5.2843e-03 -3.386300 2.470500 0.21652 8.4651e-04 44 0.9739700 3.9036 3.9006e-02 -4.653100 2.148200 0.24524 1.1371e-03 45 0.9497300 2.3202 2.9230e-02 -2.452600 2.855900 0.34722 1.6616e-03 46 0.8633500 1.4594 5.9998e-02 -3.536700 1.345900 0.49557 1.4861e-03 47 0.5782000 1.9411 2.1752e-02 -3.614600 2.148300 0.33351 2.2064e-07 48 0.5543000 2.7197 4.3162e-02 -2.704700 0.050771 0.46717 1.4540e-03 49 0.6866300 3.2062 2.0000e-02 -1.637800 2.361300 0.31280 6.9147e-04 50 0.9440800 3.0978 1.5482e-02 -1.140000 4.999900 0.20973 1.5361e-04 1 0.520833 1.425 0.0145 -3.458333 3.375000 0.176667 9.775000e-04 2 0.612500 3.975 0.0475 -4.041667 1.958333 0.403333 6.375000e-04 3 0.970833 3.225 0.0495 -0.791667 3.458333 0.350000 5.525000e-04 4 0.745833 1.275 0.0155 -2.708333 0.208333 0.336667 7.791667e-04 5 0.562500 1.825 0.0055 -0.125000 0.875000 0.416667 1.600833e-03 6 0.954167 1.375 0.0405 -4.291667 1.541667 0.423333 1.374167e-03 7 0.654167 3.775 0.0185 -2.208333 0.708333 0.390000 8.641667e-04 8 0.770833 1.975 0.0325 -2.541667 4.458333 0.183333 1.487500e-03 9 0.829167 2.175 0.0265 -1.291667 4.375000 0.296667 8.075000e-04 10 0.637500 1.125 0.0025 -2.291667 4.791667 0.143333 1.119167e-03 11 0.729167 3.925 0.0565 -2.125000 3.125000 0.156667 7.225000e-04 12 0.712500 3.825 0.0065 -0.458333 1.791667 0.190000 1.175833e-03 13 0.870833 2.425 0.0555 -4.958333 2.041667 0.290000 9.491667e-04 14 0.537500 2.875 0.0195 -4.625000 3.875000 0.116667 1.841667e-04 15 0.679167 2.725 0.0095 -3.708333 2.208333 0.370000 1.685833e-03 16 0.587500 3.575 0.0075 -2.958333 1.375000 0.323333 8.358333e-04 17 0.604167 1.475 0.0235 -4.125000 0.458333 0.356667 2.691667e-04 18 0.737500 1.075 0.0225 -1.375000 0.958333 0.470000 1.402500e-03 19 0.670833 3.425 0.0315 -0.541667 0.791667 0.270000 5.241667e-04 20 0.879167 3.725 0.0295 -3.291667 3.541667 0.410000 1.459167e-03 21 0.720833 3.025 0.0275 -1.791667 0.041667 0.283333 1.657500e-03 22 0.837500 3.675 0.0465 -1.125000 3.041667 0.110000 1.005833e-03 23 0.979167 2.025 0.0175 -3.375000 4.208333 0.310000 1.232500e-03 24 0.920833 1.175 0.0585 -4.375000 3.791667 0.363333 4.108333e-04 25 0.695833 2.075 0.0515 -0.041667 1.041667 0.210000 1.034167e-03 26 0.595833 2.125 0.0425 -1.458333 1.875000 0.276667 3.825000e-04 27 0.895833 2.225 0.0105 -2.791667 3.291667 0.150000 1.430833e-03 28 0.754167 3.475 0.0365 -4.541667 2.291667 0.383333 4.250000e-05 29 0.820833 1.325 0.0355 -2.458333 0.625000 0.216667 1.629167e-03 30 0.579167 2.375 0.0395 -1.708333 2.958333 0.223333 1.572500e-03 31 0.554167 3.125 0.0305 -1.875000 2.708333 0.396667 2.408333e-04 32 0.987500 1.925 0.0345 -3.208333 4.625000 0.170000 6.091667e-04 33 0.662500 2.775 0.0385 -0.958333 0.125000 0.476667 4.958333e-04 34 0.629167 2.975 0.0415 -0.208333 1.708333 0.463333 2.975000e-04 35 0.570833 3.525 0.0445 -0.875000 4.958333 0.256667 9.208333e-04 36 0.845833 2.825 0.0285 -4.208333 4.541667 0.430000 1.062500e-03 37 0.529167 3.275 0.0245 -1.208333 2.875000 0.450000 4.391667e-04 38 0.504167 1.675 0.0205 -3.041667 0.541667 0.243333 2.125000e-04 39 0.962500 2.275 0.0135 -3.958333 1.208333 0.376667 1.090833e-03 40 0.620833 3.175 0.0085 -4.875000 1.125000 0.130000 7.508333e-04 41 0.645833 3.875 0.0455 -3.125000 2.375000 0.456667 1.544167e-03 42 0.995833 2.625 0.0525 -3.625000 3.708333 0.483333 1.515833e-03 43 0.704167 2.525 0.0005 -4.458333 0.375000 0.436667 1.345833e-03 44 0.887500 2.325 0.0485 -0.375000 3.958333 0.496667 6.941667e-04 45 0.762500 3.075 0.0595 -3.791667 3.625000 0.236667 1.317500e-03 46 0.862500 1.875 0.0505 -0.708333 1.291667 0.250000 1.260833e-03 47 0.929167 2.675 0.0045 -1.541667 4.708333 0.263333 1.204167e-03 48 0.912500 1.525 0.0435 -4.791667 3.208333 0.490000 6.658333e-04 49 0.512500 1.625 0.0255 -1.958333 2.458333 0.316667 1.147500e-03 50 0.787500 2.475 0.0035 -3.875000 4.041667 0.303333 1.416667e-05 51 0.937500 3.325 0.0535 -2.625000 0.291667 0.163333 4.675000e-04 52 0.779167 1.575 0.0575 -2.875000 1.625000 0.123333 9.916667e-05 53 0.795833 2.925 0.0165 -0.625000 2.625000 0.136667 8.925000e-04 54 0.854167 1.025 0.0015 -3.541667 2.791667 0.203333 3.258333e-04 55 0.804167 1.775 0.0545 -1.041667 4.291667 0.330000 1.558333e-04 56 0.545833 1.225 0.0115 -0.291667 4.125000 0.443333 1.275000e-04 57 0.687500 3.375 0.0375 -4.708333 4.875000 0.103333 1.289167e-03 58 0.904167 1.725 0.0335 -1.625000 2.541667 0.343333 3.541667e-04 59 0.812500 2.575 0.0215 -2.041667 1.458333 0.196667 7.083333e-05 60 0.945833 3.625 0.0125 -2.375000 2.125000 0.230000 5.808333e-04 ------------------------------------------------------------------- Run SensHeat LatentHeat LogIceStr OceanicDrag OpenAlbedo IceAlbedo ------------------------------------------------------------------- 1 1.3500000 0.2650000 3.864333 6.250000 8.7500e-02 0.3250000 2 3.3000002 0.8287500 4.064333 0.500000 4.5625e-01 0.6000000 3 0.3000000 0.4025000 5.189333 1.125000 3.1875e-01 0.4562500 4 2.7000001 0.6500000 4.564333 4.875000 1.5625e-01 0.5500000 5 1.0500001 1.2000000 5.289333 5.250000 5.6250e-02 0.3562500 6 0.4500000 0.4575000 4.339333 5.500000 4.4375e-01 0.4625000 7 3.5250001 1.1725000 5.439333 6.875000 4.8750e-01 0.3437500 8 5.4750004 0.6362500 3.814333 3.250000 2.6250e-01 0.3000000 9 1.7250000 0.2375000 3.839333 7.875000 3.0625e-01 0.7250000 10 1.2000001 0.5812500 4.989333 2.250000 2.3750e-01 0.8000000 11 2.8500001 0.1000000 3.939333 3.750000 2.5000e-02 0.3687500 12 5.2500000 0.2925000 3.689333 9.000000 7.5000e-02 0.7437500 13 0.0000000 0.2512500 4.264333 1.500000 1.2500e-01 0.4312500 14 1.1250000 0.9112500 3.464333 6.000000 4.0625e-01 0.4250000 15 1.4250001 0.2237500 3.714333 4.750000 1.3750e-01 0.4812500 16 0.7500000 0.9937500 4.364333 7.375000 1.4375e-01 0.3375000 17 3.9750001 0.5537500 4.864333 8.000000 3.8750e-01 0.7812500 18 4.3500004 1.1312500 5.014333 5.750000 5.0000e-02 0.7000001 19 5.3250003 0.8562501 5.414333 8.125000 4.3750e-02 0.7625001 20 4.0500002 0.5950000 4.414333 1.875000 2.4375e-01 0.3625000 21 2.1000001 0.7600000 4.289333 7.250000 1.8125e-01 0.5937500 22 4.7250004 0.7737500 4.714333 9.375000 3.1250e-01 0.5062500 23 4.2750001 0.9525001 4.689333 1.625000 1.2500e-02 0.5687500 24 4.4250002 0.3887500 4.439333 4.500000 3.4375e-01 0.6250000 25 2.2500000 0.5125000 5.064333 1.000000 2.0000e-01 0.3062500 26 3.1500001 0.6637500 3.789333 9.750000 1.1250e-01 0.5812500 27 4.5750003 0.7050000 3.614333 3.000000 4.6875e-01 0.7687500 28 1.9500001 0.3612500 4.789333 7.125000 2.2500e-01 0.6062501 29 3.8250001 0.9387500 4.664333 1.250000 1.6875e-01 0.6875000 30 2.1750002 0.1825000 5.314333 7.750000 2.9375e-01 0.4937500 31 5.8500004 0.4987500 4.764333 10.000000 3.6875e-01 0.6125000 32 2.3250001 0.2787500 3.989333 2.125000 3.7500e-01 0.4062500 33 3.9000001 0.3200000 3.539333 7.625000 1.5000e-01 0.6625000 34 5.0250001 0.7325000 4.464333 8.500000 3.2500e-01 0.7187500 35 3.0000000 0.1550000 5.139333 4.000000 3.6250e-01 0.4375000 36 1.5000000 0.2100000 4.239333 9.625000 0.0000e+00 0.6937500 37 5.7750001 0.8837500 4.089333 6.375000 1.7500e-01 0.6187500 38 4.6500001 0.8150000 3.514333 0.625000 1.3125e-01 0.3812500 39 1.5750001 0.6225000 3.739333 8.750000 1.1875e-01 0.7125000 40 2.9250002 0.4712500 3.964333 9.500000 3.5625e-01 0.3500000 41 3.7500002 0.4437500 3.764333 1.750000 2.7500e-01 0.6812500 42 6.0000000 0.9800000 4.114333 6.125000 4.3750e-01 0.4750000 43 0.6750000 1.1450000 3.639333 6.625000 3.1250e-02 0.5312500 44 1.2750001 1.0625000 4.889333 8.250000 2.6875e-01 0.7062500 45 2.4000001 1.0074999 3.664333 0.000000 1.9375e-01 0.4875000 46 4.8000002 0.5262500 3.914333 0.375000 1.0625e-01 0.7375000 47 3.2250001 0.3337500 5.239333 5.000000 4.5000e-01 0.5187500 48 4.5000000 1.0212500 4.164333 6.500000 2.5625e-01 0.6750000 49 4.9500003 0.4850000 4.639333 2.375000 4.9375e-01 0.3312500 50 0.9750000 1.1587501 4.814333 4.250000 3.3125e-01 0.4687500 51 0.8250001 0.3475000 4.914333 0.250000 5.0000e-01 0.3187500 52 2.0250001 1.1037500 5.039333 3.125000 9.3750e-02 0.6375001 53 2.4750001 0.9662500 3.439333 4.375000 4.6250e-01 0.7875000 54 5.9250002 0.3750000 3.589333 3.375000 2.8125e-01 0.4500000 55 4.1250000 0.6775000 3.564333 2.500000 3.8125e-01 0.5375000 56 4.2000003 0.1275000 4.739333 3.875000 2.0625e-01 0.3750000 57 1.8750001 0.4162500 4.014333 8.375000 2.1875e-01 0.3937500 58 1.8000001 0.7875000 4.189333 9.125000 6.8750e-02 0.5250000 59 3.6000001 0.4300000 4.964333 5.125000 2.8750e-01 0.5562500 60 1.6500001 0.3062500 4.939333 9.875000 3.0000e-01 0.4187500 61 0.6000000 1.0762500 4.214333 2.750000 4.1250e-01 0.5125001 62 2.6250000 1.0900000 5.214333 5.625000 4.0000e-01 0.5437501 63 5.1750002 0.1962500 5.264333 0.750000 2.1250e-01 0.6500000 64 5.1000004 1.0487500 5.114333 3.625000 4.7500e-01 0.5750001 65 3.3750002 0.9250000 5.389333 1.375000 8.1250e-02 0.5000000 66 2.7750001 0.8425000 4.314333 4.125000 6.2500e-03 0.3125000 67 0.1500000 0.7187500 4.389333 9.250000 3.7500e-02 0.6312500 68 5.6250000 0.8975000 4.139333 7.500000 3.5000e-01 0.7312501 69 5.5500002 0.5675000 4.514333 6.750000 4.8125e-01 0.4125000 70 2.5500002 0.1412500 5.164333 0.875000 4.2500e-01 0.5875000 71 0.2250000 0.8700000 3.489333 2.875000 4.1875e-01 0.4000000 72 0.5250000 0.5400000 5.339333 8.625000 6.2500e-02 0.6687501 73 4.8750000 1.1862500 5.089333 5.875000 2.3125e-01 0.4437500 74 3.0750001 0.1687500 4.839333 7.000000 2.5000e-01 0.5625000 75 0.9000000 1.1175001 4.539333 0.125000 4.3125e-01 0.6562500 76 0.0750000 0.1137500 5.364333 2.000000 1.8750e-01 0.7562500 77 5.4000001 0.8012500 4.489333 4.625000 1.6250e-01 0.7750000 78 3.6750002 0.7462500 3.889333 5.375000 3.3750e-01 0.3875000 79 3.4500001 0.6087501 4.039333 8.875000 1.8750e-02 0.6437500 80 0.3750000 1.0350000 4.614333 3.500000 1.0000e-01 0.7937501 81 5.7000003 0.6912500 4.589333 2.625000 3.9375e-01 0.7500000 1 5.9997000 1.1589000 3.728167 1.120600 3.1427e-01 0.7998700 2 2.2264000 1.0322000 4.010978 0.410670 4.9997e-01 0.5514900 3 5.7302000 0.4378000 3.460086 1.269900 6.7792e-02 0.3066200 4 0.4633100 0.1433800 5.437766 1.544100 2.6867e-02 0.3950900 5 3.1825000 0.8446400 5.011613 0.035989 7.7866e-02 0.6735800 6 0.0073140 0.8061900 5.121429 2.285800 5.1269e-02 0.7260500 7 4.5960000 0.3383300 4.627888 0.351580 4.0538e-01 0.3105800 8 0.0095330 0.1834300 3.451648 0.348540 4.7205e-01 0.6761800 9 4.9261000 0.5693300 4.822502 2.712100 3.9996e-03 0.3006600 10 3.2789000 0.7215500 4.694280 0.906750 4.9718e-01 0.4507400 11 4.8171000 1.0666000 5.397262 0.219020 2.1799e-01 0.3492300 12 2.6602000 0.2106300 3.505272 8.842800 4.3526e-01 0.5195100 13 3.3406000 0.9153600 3.970049 9.912100 4.8917e-02 0.7986400 14 3.9517000 0.1026200 3.619709 5.770200 2.9396e-01 0.3253200 15 1.3097000 0.3653500 5.431509 5.365000 1.6848e-01 0.7375500 16 3.8246000 0.2584900 5.279804 6.906800 4.9844e-01 0.3850200 17 0.6145200 0.3929000 4.929945 1.147500 4.5226e-01 0.7633100 18 5.9957000 0.2867400 3.948447 2.996400 3.3700e-01 0.6794200 19 0.8468600 0.9316100 5.343546 3.517400 1.6954e-01 0.3145500 20 2.8816000 1.1733000 4.070333 3.374400 1.4065e-01 0.7035300 21 0.1761700 0.7625300 4.854409 4.690900 1.2050e-01 0.6594000 22 5.9876000 0.1235600 5.077077 7.326300 1.0754e-01 0.3577100 23 1.5836000 0.7963400 3.906195 4.072000 4.9639e-01 0.6547400 24 2.2911000 0.5222600 3.656443 0.919570 4.0402e-01 0.4368500 25 4.4842000 0.5002400 4.269186 0.074368 1.9276e-01 0.3758600 26 5.7496000 1.1923000 4.574610 9.496700 2.6735e-01 0.7625000 27 5.3283000 0.2642800 4.100715 9.935100 2.7406e-02 0.4520400 28 0.7680000 1.1925000 4.266373 2.410800 3.6799e-01 0.4041200 29 5.0351000 1.1999000 3.521766 1.757200 1.2282e-01 0.4990300 30 5.4754000 0.5220900 3.445121 2.322700 4.5116e-01 0.6973200 31 2.0068000 0.1011400 5.247728 6.105200 3.8664e-01 0.7966800 32 5.9057000 0.9582200 5.359019 0.667470 5.1881e-02 0.6393800 33 0.0089600 0.7051800 5.251395 7.741700 4.2043e-01 0.5757500 34 0.6751200 0.1034700 4.074304 6.929800 1.9812e-01 0.3000800 35 3.8655000 0.8704800 5.164977 9.988700 1.3020e-01 0.5172900 36 2.6345000 1.0018000 4.432119 7.989100 2.1750e-01 0.4211900 37 1.1061000 0.4695100 5.206394 8.693400 3.4598e-03 0.6242100 38 2.3969000 0.9173100 4.514893 1.493000 2.5622e-01 0.4858800 39 2.3406000 0.3240300 3.831735 0.075667 4.9214e-02 0.7449100 40 3.0248000 0.5017600 4.380880 5.084100 4.7197e-01 0.7994100 41 5.5046000 0.3263400 4.447065 9.524200 1.0187e-02 0.5387300 42 4.2783000 0.7330300 5.172048 7.202100 4.3740e-01 0.7526600 43 3.9904000 0.1757900 3.822600 3.534700 3.4630e-01 0.5538100 44 4.6088000 0.6066400 4.948472 9.560600 1.8588e-01 0.4695100 45 3.4437000 1.0054000 3.478624 1.810500 3.7831e-01 0.7425200 46 1.7404000 0.5664500 3.641107 0.000474 1.0077e-01 0.4926200 47 0.0014500 0.6688500 4.206097 7.435900 4.9989e-01 0.7972200 48 0.3564800 0.5456400 4.666640 4.195000 2.4020e-01 0.6710900 49 3.4885000 1.1329000 5.296950 9.254200 8.1136e-05 0.6731200 50 2.5518000 0.7634200 3.556556 9.646800 1.6867e-01 0.6811200 1 5.95 0.567500 4.289333 0.250000 0.137500000 0.729167 2 3.55 0.769167 3.956000 0.750000 0.462500000 0.487500 3 0.25 0.420833 4.456000 5.750000 0.495833300 0.520833 4 3.65 1.080833 4.489333 2.083333 0.212500000 0.587500 5 3.45 0.347500 4.822666 1.583333 0.412500000 0.762500 6 2.25 0.714167 3.456000 5.083333 0.345833300 0.645833 7 1.45 0.952500 5.056000 8.250000 0.362500000 0.795833 8 3.85 0.457500 4.556000 3.416667 0.179166700 0.404167 9 4.25 0.255833 3.989333 4.750000 0.487500000 0.712500 10 3.35 0.842500 4.189333 7.750000 0.445833300 0.695833 11 2.95 0.677500 3.856000 4.416667 0.045833330 0.412500 12 1.05 0.530833 3.756000 6.416667 0.479166700 0.637500 13 0.85 0.732500 5.089333 9.750000 0.279166700 0.779167 14 2.15 0.402500 3.589333 4.083333 0.070833330 0.704167 15 5.65 0.109167 4.389333 2.250000 0.312500000 0.687500 16 5.85 0.659167 5.122666 5.250000 0.004166667 0.470833 17 3.05 0.970833 4.156000 3.583333 0.395833300 0.662500 18 1.95 0.640833 5.189333 1.250000 0.170833300 0.362500 19 4.35 0.182500 5.422666 2.583333 0.154166700 0.420833 20 4.95 0.219167 4.922666 7.583333 0.054166670 0.629167 21 0.55 0.145833 3.522666 8.416667 0.387500000 0.620833 22 5.15 0.494167 4.422666 3.083333 0.270833300 0.437500 23 2.05 1.025833 3.689333 2.416667 0.429166700 0.312500 24 2.85 0.934167 5.222666 9.083333 0.229166700 0.770833 25 0.75 0.329167 3.889333 8.750000 0.220833300 0.454167 26 5.05 0.879167 3.656000 5.583333 0.187500000 0.787500 27 2.75 0.897500 5.022666 9.250000 0.329166700 0.445833 28 2.65 0.310833 4.722666 5.916667 0.254166700 0.720833 29 1.85 0.604167 4.622666 7.416667 0.454166700 0.670833 30 3.15 0.824167 4.089333 1.916667 0.287500000 0.345833 31 0.95 0.439167 5.156000 4.583333 0.029166670 0.570833 32 5.55 0.805833 5.322666 7.083333 0.237500000 0.462500 33 4.05 0.787500 3.722666 6.750000 0.337500000 0.754167 34 4.65 1.172500 3.822666 7.250000 0.062500000 0.537500 35 1.25 1.007500 3.489333 0.916667 0.037500000 0.612500 36 0.15 1.117500 4.789333 0.583333 0.162500000 0.379167 37 5.45 1.190833 4.756000 8.583333 0.437500000 0.429167 38 2.55 1.044167 4.022666 0.416667 0.320833300 0.529167 39 5.35 0.200833 3.622666 6.250000 0.012500000 0.354167 40 2.35 0.860833 4.889333 0.083333 0.470833300 0.370833 41 0.65 1.099167 5.256000 3.916667 0.112500000 0.737500 42 1.55 0.549167 3.556000 4.916667 0.195833300 0.320833 43 1.65 0.475833 4.989333 6.916667 0.020833330 0.679167 44 2.45 0.585833 4.689333 5.416667 0.095833330 0.554167 45 0.05 0.365833 3.789333 2.916667 0.245833300 0.495833 46 3.75 1.154167 4.656000 7.916667 0.370833300 0.745833 47 4.55 0.237500 4.522666 3.750000 0.379166700 0.512500 48 4.45 0.164167 4.222666 6.083333 0.304166700 0.579167 49 5.25 0.989167 4.589333 4.250000 0.295833300 0.387500 50 3.95 1.062500 5.389333 9.416667 0.079166670 0.479167 51 4.85 1.135833 4.056000 8.083333 0.120833300 0.604167 52 1.35 0.622500 4.356000 9.916667 0.420833300 0.504167 53 5.75 0.750833 5.289333 3.250000 0.404166700 0.329167 54 0.35 0.292500 5.356000 1.750000 0.354166700 0.595833 55 3.25 0.274167 4.956000 2.750000 0.129166700 0.654167 56 4.75 0.127500 4.122666 8.916667 0.262500000 0.395833 57 1.75 0.512500 4.322666 1.416667 0.204166700 0.545833 58 1.15 0.695833 4.856000 6.583333 0.087500000 0.304167 59 0.45 0.384167 3.922666 9.583333 0.145833300 0.562500 60 4.15 0.915833 4.256000 1.083333 0.104166700 0.337500 Statistical Science2016, Vol. 31, No. 1, 40–60DOI: 10.1214/15-STS531© Institute of Mathematical Statistics, 2016Analysis Methods for ComputerExperiments: How to Assess and WhatCounts?Hao Chen, Jason L. Loeppky, Jerome Sacks and William J. WelchAbstract. Statistical methods based on a regression model plus a zero-meanGaussian process (GP) have been widely used for predicting the output of adeterministic computer code. There are many suggestions in the literature forhow to choose the regression component and how to model the correlationstructure of the GP. This article argues that comprehensive, evidence-basedassessment strategies are needed when comparing such modeling options.Otherwise, one is easily misled. Applying the strategies to several computercodes shows that a regression model more complex than a constant mean ei-ther has little impact on prediction accuracy or is an impediment. The choiceof correlation function has modest effect, but there is little to separate twocommon choices, the power exponential and the Matérn, if the latter is op-timized with respect to its smoothness. The applications presented here alsoprovide no evidence that a composite of GPs provides practical improve-ment in prediction accuracy. A limited comparison of Bayesian and empiricalBayes methods is similarly inconclusive. In contrast, we find that the effect ofexperimental design is surprisingly large, even for designs of the same typewith the same theoretical properties.Key words and phrases: Correlation function, Gaussian process, kriging,prediction accuracy, regression.1. INTRODUCTIONOver the past quarter century a literature begin-ning with Sacks, Schiller and Welch (1989), Sackset al. (1989, in this journal), Currin et al. (1991), andO’Hagan (1992) has grown to explore the design andanalysis of computer experiments. Such an experimentis a designed set of runs of a mathematical model im-plemented as a computer code. Running the code withvector-valued input x gives the output value, y(x), as-Hao Chen is a Ph.D. candidate and William J. Welch isProfessor, Department of Statistics, University of BritishColumbia, Vancouver, BC V6T 1Z4, Canada (e-mail:hao.chen@stat.ubc.ca; will@stat.ubc.ca). Jason L. Loeppkyis Associate Professor, Statistics, University of BritishColumbia, Kelowna, BC V1V 1V7, Canada (e-mail:jason.loeppky@ubc.ca). Jerome Sacks is Director Emeritus,National Institute of Statistical Sciences, Research TrianglePark, North Carolina 27709, USA (e-mail: sacks@niss.org).sumed real-valued and deterministic: Running the codeagain with the same value for x does not change y(x).A design D is a set of n runs at n configurations of x,and an objective of primary interest is to use the data(inputs and outputs) to predict, via a statistical model,the output of the code at untried input values.The basic approach to the statistical model typicallyadopted starts by thinking of the output function, y(x),as being in a class of functions with prior distribu-tion a Gaussian process (GP). The process has meanμ, which may be a regression function in x, and a co-variance function, σ 2R, from a specified parametricfamily. Prediction is then made through the posteriormean given the data from the computer experiment,with some variations depending on whether a maxi-mum likelihood (empirical Bayes) or fuller Bayesianimplementation is used. While we partially addressthose variations in this article, our main thrusts are thepractical questions faced by the user: What regression40ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 41function and correlation function should be used? Doesit matter?We will call a GP model with specified regressionand correlation functions a Gaussian stochastic process(GaSP) model. For example, GaSP(Const, PowerExp)will denote a constant (intercept only) regression andthe power-exponential correlation function. The vari-ous regression models and correlation functions underconsideration in this article will be defined in Section 2.The rationale for the GaSP approach stems fromthe common situation that the dimension, d , of thespace of inputs is not small, the function is fairly com-plex to model, and n is not large (code runs are ex-pensive), hindering the effectiveness of standard meth-ods (e.g., polynomials, splines, MARS) for producingpredictions. The GaSP approach allows for a flexiblechoice of approximating models that adapts to the dataand, more tellingly, has proved effective in coping withcomplex codes with moderately high d and scarce data.There is a vast literature treating an analysis in this con-text.This article studies the impact on prediction accu-racy of the particular model specifications commonlyused, particularly μ,R,n,D. The goals are twofold.First, we propose a more evidence-based approach todistinguish what may be important from the unimpor-tant and what may need further exploration. Second,our application of this approach to various examplesleads to some specific recommendations.Assessing statistical strategies for the analysis of acomputer experiment often mimics what is done forphysical experiments: a method is proposed, appliedin examples—usually few in number—and comparedto other methods. Where possible, formal, mathemat-ical comparisons are made; otherwise, criteria for as-sessing performance are empirical. An initial empiri-cal study for a physical experiment is forced to rely onthe specific data of that experiment and, while differentanalysis methods may be applied, all are bound by thesingle data set. There are limited opportunities to varysample size or design.Computer experiments provide richer opportunities.Fast-to-run codes enable a laboratory to investigate therelative merits of an analysis method. A whole spec-trum of “replicate” experiments can be conducted for asingle code, going beyond a thimbleful of “anecdotal”reports.The danger of being misled by anecdotes can beseen in the following example. The borehole function[Morris, Mitchell and Ylvisaker, 1993, also given inthe supplementary material (Chen et al., 2016)] is fre-quently used to illustrate methodology for computerexperiments. A 27-run orthogonal array (OA) in the8 input factors was proposed as a design, followingJoseph, Hung and Sudjianto (2008). The 27 runs wereanalyzed via GaSP with a specific R (the Gaussiancorrelation function described in Section 2) and withtwo choices for μ: a simple constant (intercept) ver-sus a method to select linear terms (SL), also describedin Section 2. The details of these alternative modelsare not important for now, just that we are comparingtwo modeling methods. A set of 10,000 test points se-lected at random in the 8-dimensional input space wasthen predicted. The resulting values of the root meansquared error (RMSE) measure defined in (2.5) of Sec-tion 2 were 0.141 and 0.080 for the constant and SLregression models, respectively.With the SL approach reducing the RMSE by about43% relative to a model with a constant mean, doesthis example provide powerful evidence for using re-gression terms in the GaSP model? Not quite. We repli-cated the experiment with the same choices of μ,R,nand the same test-data, but the training data came froma theoretically equivalent 27-run OA design. (Thereare many equivalent OAs, e.g., by permuting the la-bels between columns of a fixed OA.) The RMSE val-ues in the second analysis were 0.073 and 0.465 forthe constant and SL models respectively; SL producedabout 6 times the error relative to a constant mean—theevidence against using regression terms is even morepowerful!A broader approach is needed. The one we take islaid out starting in Section 2, where we specify the al-ternatives considered for the statistical model’s regres-sion component and correlation function, and definethe assessment measures to be used. We focus on thefundamental criterion of prediction accuracy (uncer-tainty assessment is discussed briefly in Section 6.1).In Section 3 we outline the basic idea of generatingrepeated data sets for any given example. The methodis (exhaustively) implemented for several fast codes,including the aforementioned borehole function, alongwith some choices of n and D. In Section 4 the methodis adapted to slower codes where data from only oneexperiment are available. Ideally, the universe of com-puter experiments is represented by a set of test prob-lems and assessment criteria, as in numerical optimiza-tion (Dixon and Szegö, 1978); the codes and data setsinvestigated in this article and its supplementary ma-terial (Chen et al., 2016) are but an initial set. In Sec-tion 5 other modeling strategies are compared. Finally,42 CHEN, LOEPPKY, SACKS AND WELCHin Sections 6 and 7 we make some summarizing com-ments, conclusions and recommendations.The article’s main findings are that regression termsare unnecessary or even sometimes an impediment, thechoice of R matters for less smooth functions, and thatthe variability of performance of a method for the sameproblem over equivalent designs is alarmingly high.Such variation can mask the differences in analysismethods, rendering them unimportant and reinforcingthe message that light evidence leads to flimsy conclu-sions.2. STATISTICAL MODELS, EXPERIMENTALDESIGN, AND ASSESSMENTA computer code output is denoted by y(x) wherethe input vector, x = (x1, . . . , xd), is in the d-dimen-sional unit cube. As long as the input space is rect-angular, transforming to the unit cube is straightfor-ward and does not lose generality. Suppose n runs ofthe code are made according to a design D of in-put vectors x(1), . . . ,x(n) in [0,1]d , resulting in datay = (y(x(1)), . . . , y(x(n)))T . The goal is to predict y(x)at untried x.The GaSP approach uses a regression model and GPprior on the class of possible y(x). Specifically, y(x) isa priori considered to be a realization ofY(x) = μ(x) + Z(x),(2.1)where μ(x) is the regression component, the mean ofthe process, and Z(x) has mean 0, variance σ 2, andcorrelation function R.2.1 Choices for the Correlation FunctionLet x and x′ denote two values of the input vector.The correlation between Z(x) and Z(x′) is denoted byR(x,x′). Following common practice, R(x,x′) is takento be a product of 1-d correlation functions in the dis-tances hj = |xj −x′j |, that is, R(x,x′) =∏dj=1 Rj(hj ).We mainly consider four choices for Rj :• Power exponential (abbreviated PowerExp):Rj(hj ) = exp(−θjhpjj),(2.2)with θj ≥ 0 and 1 ≤ pj ≤ 2 controlling the sensitiv-ity and smoothness, respectively, of predictions of ywith respect to xj .• Squared exponential or Gaussian (abbreviatedGauss): the special case of PowerExp in (2.2) withall pj = 2.• Matérn:Rj(hj ) = 1(νj )2(νj−1)(θjhj )νj Kνj (θjhj ),(2.3)where is the Gamma function, and Kνj is the mod-ified Bessel function of order νj . Again, θj ≥ 0 is asensitivity parameter. The Matérn class was recom-mended by Stein (1999), Section 2.7, for its controlvia νj > 0 of the differentiability of the correlationfunction with respect to xj , and hence that of theprediction function. With νj = 1 + 12 or νj = 2 + 12 ,there are 1 or 2 derivatives, respectively. We callthese subfamilies Matérn-1 and Matérn-2. Similarly,we use Matérn-0 and Matérn-∞ to refer to the casesνj = 0 + 12 and νj → ∞. They give the exponen-tial family [pj = 1 in (2.2)], with no derivatives, andGauss, which is infinitely differentiable. Matérn-0,1,2 are closely related to linear, quadratic, andcubic splines. We believe that little would be gainedby incorporating smoother kernels (but less smooththan the analytic Matérn-∞) in the study.Consideration of the entire Matérn class for νj > 0is computationally cumbersome for the large num-bers of experiments we will evaluate. Hence, whatwe call Matérn has νj optimized over the Matérn-0, Matérn-1, Matérn-2, and Matérn-∞ special cases,separately for each coordinate.• Matérn-2: Some authors (e.g., Picheny et al., 2013)fix νj in the Matérn correlation function to givesome differentiability. The Matérn-2 subfamily setsνj = 2 + 12 for all j , giving 2 derivatives.More recently, other types of covariance functionshave been recommended to cope with “apparently non-stationary” functions (e.g., Ba and Joseph, 2012). InSection 5.2 we will discuss the implications and char-acteristics of these options.Given a choice for Rj and hence R, we define then× n matrix R with element i, j given by R(x(i),x(j))and the n × 1 vector r(x) = (R(x,x(1)), . . . ,R(x,x(n)))T for any x where we want to predict.2.2 Choices for the Regression ComponentWe explore three main choices for μ:• Constant (abbreviated Const): μ(x) = β0.• Full linear (FL): μ(x) = β0 +β1x1 +· · ·+βdxd , thatis, a full linear model in all input variables.• Select linear (SL): μ(x) is linear in the xj like FLbut only includes selected terms.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 43The proposed algorithm for SL is as follows. For agiven correlation family construct a default predictorwith Const for μ. Decompose the predictive function(Schonlau and Welch, 2006) and identify all main ef-fects that contribute more than 100/d percent to thetotal variation. These become the selected coordinates.Typically, large main effects have clear linear compo-nents. If a large effect lacks a linear component, littleis lost by including a linear term. Inclusion of possiblenonlinear trends can be pursued at considerable com-putational cost; we do not routinely do so, but in Sec-tion 4.1 we do include a regression model with nonlin-ear terms in xj .All candidate regression models considered can bewritten in the formμ(x) = β0 + β1f1(x) + · · · + βkfk(x),where the functions fj (x) are known. The maximumlikelihood estimate (MLE) of β = (β0, . . . , βk)T is thegeneralized least-squares estimate βˆ = (FT R−1F)−1 ·FT R−1y, where the n × (k + 1) matrix F has (1,f1(x(i)), . . . , fk(x(i))) in row i. This is also the Bayesposterior mean with a diffuse prior for β .Early work (Sacks, Schiller and Welch, 1989) sug-gested that there is little to be gained (and maybe evensomething to lose) by using other than a constant termfor μ. In addition, Lim et al. (2002) showed that poly-nomials can be exactly predicted with a minimal num-ber of points using the Gauss correlation function, pro-vided one lets the θj → 0. These points underline thefact that a GP prior can capture the complexity in theoutput of the code, suggesting that deploying regres-sion terms is superfluous. The evidence we report latersupports this impression.2.3 PredictionPredictions are made as follows. For given data andvalues of the parameters in R, the mean of the posteriorpredictive distribution of y(x) isyˆ(x) = μˆ(x) + rT (x)R−1(y − Fβˆ),(2.4)where μˆ(x) = βˆ0 + βˆ1f1(x) + · · · + βˆkfk(x).In practice, the other parameters, σ 2 and those in thecorrelation function R of equations (2.2) or (2.3), mustbe estimated too. Empirical Bayes replaces all of β ,σ 2, and the correlation parameters in R by their MLEs(Welch et al., 1992). Various other Bayes-based proce-dures are available, including one fully Bayesian strat-egy described in Section 5.1. Our focus, however, isnot on the particular Bayes-based methods employedbut rather on assumptions about the form of the under-lying GaSP model.2.4 DesignFor fast codes we typically use as a base designD an approximate maximin Latin hypercube design(mLHD, Morris and Mitchell, 1995), with improvedlow-dimensional space-filling properties (Welch et al.,1996). A few other choices, such as orthogonal arrays,are also investigated in Section 3.5, with a more com-prehensive comparison of different classes of designthe subject of another ongoing study. In any event, weshow that even for a fixed class of design and fixedn there is substantial variation in prediction accuracyover equivalent designs. Conclusions based on a singledesign choice can be misleading.The effect of n on prediction accuracy was exploredby Sacks, Schiller and Welch (1989) and more recentlyby Loeppky, Sacks and Welch (2009); its role in thecomparison of competing alternatives for μ and R willalso be addressed in Section 3.5.2.5 Measures of Prediction ErrorIn order to compare various forms of the predictorin (2.4) built from the n code runs, y = (y(x(1)), . . . ,y(x(n)))T , we need to set some standards. The goldstandard is to assess the magnitude of prediction er-ror via holdout (test) data, that is, in predicting N fur-ther runs, y(x(1)ho ), . . . , y(x(N)ho ). The prediction errorsare yˆ(x(i)ho ) − y(x(i)ho ) for i = 1, . . . ,N .The performance measure we use is a normalizedRMSE of prediction over the holdout data, denotedby ermse,ho. The normalization is the RMSE using the(trivial) predictor y¯, the mean of the data from the runsin the experimental design, the “training” set. Thus,ermse,ho =√(1/N)∑Ni=1(yˆ(x(i)ho ) − y(x(i)ho ))2√(1/N)∑Ni=1(y¯ − y(x(i)ho ))2.(2.5)The normalization in the denominator puts ermse,horoughly on [0,1] whatever the scale of y, with 1 in-dicating no better performance than y¯. The criterion isrelated to R2 in regression, but ermse,ho measures per-formance for a new test set and smaller values are de-sirable.Similarly, worst-case performance can be defined asthe normalized maximum absolute error. Results forthis metric are reported in the supplementary material(Chen et al., 2016); the conclusions are the same. Otherdefinitions (such as median absolute error) and othernormalizations (such as those of Loeppky, Sacks andWelch, 2009) can be used, but without substantive ef-fect on comparisons.44 CHEN, LOEPPKY, SACKS AND WELCHFIG. 1. Equivalent designs for d = 2 and n = 11: (a) a base mLHD design; (b) the base design with labels permuted between columns; and(c) the base design with values in the x1 column reflected around x1 = 0.5.What are tolerable levels of error? Clearly, these areapplication-specific so that tighter thresholds wouldbe demanded, say, for optimization than for sensitiv-ity analysis. For general purposes we take the ruleof thumb that ermse,ho < 0.10 is useful. For normal-ized maximum error it is plausible that the thresholdcould be much larger, say 0.25 or 0.30. These specu-lations are consequences of the experiences we docu-ment later, and are surely not the last word. The valueof having thresholds is to provide benchmarks that en-able assessing when differences among different meth-ods or strategies are practically insignificant versus sta-tistically significant.3. FAST CODES3.1 Generating a Reference Set for ComparisonsFor fast codes under our control, large holdout setscan be obtained. Hence, in this section performanceis measured through the use of a holdout (test) set of10,000 points, selected as a random Latin hypercubedesign (LHD) on the input space.With a fast code many designs and hence trainingdata sets can easily be generated. We generate manyequivalent designs by exploiting symmetries. For asimple illustration, Figure 1(a) shows a base mLHDfor d = 2 and n = 11. Permuting the labels betweencolumns of the design, that is, interchanging the x1 andx2 values as in Figure 1(b), does not change the inter-point distances or properties based on them such as theminimum distance used to construct mLHD designs.Similarly, reflecting the values within, say, the x1 col-umn around x1 = 0.5 as in Figure 1(c), does not changethe properties. In this sense the designs are equivalent.In general, for any base design with good proper-ties, there are d!2d equivalent designs and hence equiv-alent sets of training data available from permutingall column labels and reflecting within columns for asubset of inputs. For the borehole code mentioned inSection 1 and investigated more fully in Section 3.2,we have found that permuting between columns givesmore variation in prediction accuracy than reflectingwithin columns. Thus, in this article for nearly all ex-amples we only permute between columns: for d = 4all 24 possible permutations, and for d ≥ 5 a randomselection of 25 different permutations. The example ofSection 5.2 with d = 2 is the one exception. Becausey(x1, x2) is symmetric in x1 and x2, permuting be-tween columns does not change the training data andwe reflect within x1 and/or x2 instead.The designs, generated data sets, and replicate analy-ses then serve as the reference set for a particular prob-lem and provide the grounds on which variability ofperformance can be assessed. Given the setup of Sec-tion 2, we want to assess the consequences of makinga choice from the menu of three regression models andfour correlation functions.3.2 Borehole CodeThe first setting we will look at is the borehole code(Morris, Mitchell and Ylvisaker, 1993) mentioned inSection 1 and described in the supplementary material(Chen et al., 2016). It has served as a test bed in manycontexts (e.g., Joseph, Hung and Sudjianto, 2008). Weconsider three different designs for the experiment: a27-run, 3-level orthogonal array (OA), the same designused by Joseph, Hung and Sudjianto (2008); a 27-runmLHD; and a 40-run mLHD.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 45FIG. 2. Borehole function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP with all combinations of three regression modelsand four correlation functions. There are three base designs: a 27-run OA (top row); a 27-run mLHD (middle row); and a 40-run mLHD(bottom row). For each base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.There are 12 possible modeling combinations fromthe four correlation functions and three regressionmodels outlined in Sections 2.1 and 2.2. The SL choicefor μ here is always the term x1. Its main effect ac-counts for approximately 80% of the variation in pre-dictions over the 8-dimensional input domain, and allanalyses with a Const regression model choose x1 andno other terms across all designs and all repeat experi-ments.The top row of Figure 2 shows results with the 27-run OA design. For a given modeling strategy, 25 ran-dom permutations between columns of the 27-run OAlead to 25 repeat experiments (Section 3.1) and hencea reference set of 25 values of ermse,ho shown as adot plot. The results are striking. Relative to a con-stant regression model, the FL regression model hasempirical distributions of ermse,ho which are uniformlyand substantially inferior, for all correlation functions.The SL regression also performs very poorly some-times, but not always. To investigate the SL regres-sion further, Figure 3 plots ermse,ho for individual repeatexperiments, comparing the GaSP(Const, Gauss) andGaSP(SL, Gauss) models. Consistent with the anec-dotal comparisons in Section 1, the plot shows thatFIG. 3. Borehole code: Normalized holdout RMSE of prediction,ermse,ho, for an SL regression model versus a constant regressionmodel. The 25 points are from repeat experiments generated by 25random permutations between columns of a 27-run OA.46 CHEN, LOEPPKY, SACKS AND WELCHthe SL regression model can give a smaller ermse,ho—this tends to happen when both methods perform fairlywell—but the SL regression sometimes has very pooraccuracy (almost 0.5 on the normalized RMSE scale).The top row of Figure 2 also shows that the choiceof correlation function is far less important than thechoice of regression model.The results for the 27-run mLHD in the middle rowof Figure 2 show that design can have a large effect onaccuracy: every analysis model performs better for the27-run mLHD than for the 27-run OA. (Note the ver-tical scale is different for each row of the figure.) TheSL regression now performs about the same as the con-stant regression instead of occasionally much worse.There is no substantial difference in accuracy betweenthe correlation functions. Indeed, the impact on accu-racy of using the space-filling mLHD design instead ofan OA is much more important than differences due tochoice of the correlation function. The scaling in themiddle row of plots somewhat mutes the considerablevariation in accuracy still present over the 25 equiva-lent mLHD designs.Increasing the number of runs to a 40-run mLHD(bottom row of Figure 2) makes a further substantialimprovement in prediction accuracy. All 12 modelingstrategies give ermse,ho values of about 0.02–0.06 overthe 25 repeat experiments. Although there is little sys-tematic difference among strategies, the variation overequivalent designs is still striking in a relative sense.The strikingly poor results from the SL regressionmodel (sometimes) and the FL model (all 25 repeats)in the top row of Figure 2 may be explained as fol-lows. The design is a 27-run OA with only 3 levels. Ina simpler context, Welch et al. (1996) illustrated non-identifiability of the important terms in a GaSP modelwhen the design is not space-filling. The SL regressionand, even more so, the FL regression complicate analready flexible GP model. The difficulty in identify-ing the important terms is underscored by the fact thatfor all 25 repeat experiments from the base 27-run OA,a least squares fit of a simple linear regression model inx1 (with no other terms) gives ermse,ho values close to0.45. In other words, performance of GaSP(SL, Gauss)is sometimes similar to fitting just the important x1linear trend. The performance of GaSP(FL, Gauss) ishighly variable and sometimes even worse than simplelinear regression.Welch et al. (1996) argued that model identifiabil-ity is, not surprisingly, connected with confounding inthe design. The confounding in the base 27-run OAis complex. While it is preserved in an overall senseby permutations between columns, how the confound-ing structure aligns with the important inputs amongx1, . . . , x8 will change across the 25 repeat experi-ments. Hence, the impact of confounding on noniden-tifiability will vary.In contrast, accuracy for the space-filling design inthe middle row of Figure 2 is much better, even withonly 27 runs. The SL regression model performs as ac-curately as the Const model (but no better); only theeven more complex FL regression runs into difficulties.Again, this parallels the simpler Welch et al. (1996) ex-ample, where model identification was less problem-atic with a space-filling design and largely eliminatedby increasing the sample size (the bottom row of Fig-ure 2).3.3 G-Protein CodeA second application, the G-protein code used byLoeppky, Sacks and Welch (2009) and described in thesupplementary material (Chen et al., 2016), consists ofa system of ODE’s with 4-dimensional input.Figure 4 shows ermse,ho for the three regression mod-els (here SL selects x2, x3 as inputs with large effects)and four correlation functions. The results in the toprow are for a 40-run mLHD. With d = 4, all 24 possi-ble permutations between columns of a single base de-sign lead to 24 data sets and hence 24 ermse,ho values.The dot plots in the top row have similar distributionsacross the 12 modeling strategies. As these empiricaldistributions have most ermse,ho values above 0.1, wetry increasing the sample size with an 80-run mLHD.This has a substantial effect on accuracy, with all mod-eling methods giving ermse,ho values of about 0.06 orless.Thus, for the G-protein application, none of the threechoices for μ or the four choices for R matter. Thevariation among equivalent designs is alarmingly largein a relative sense, dwarfing any impact of modelingstrategy.3.4 PTW CodeResults for a third fast-to-run code, PTW (Preston,Tonks and Wallace, 2003), are in the supplementarymaterial (Chen et al., 2016). It has 11 inputs. We tooka mLHD with n = 110 as the base design for the ref-erence set. Prior information from engineers suggestedincorporating linear x1 and x2 terms; SL also includedx3. No essential differences among μ or R emerged,but again there is a wide variation over equivalent de-signs.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 47FIG. 4. G-protein code: Normalized holdout RMSE of prediction, ermse,ho, for all combinations of three regression models and fourcorrelation functions. There are two base designs: a 40-run mLHD (top row); and an 80-run mLHD (bottom row). For each base design, all24 permutations between columns give the 24 values of ermse,ho in each dot plot.3.5 Effect of DesignThe results above document a significant, seldomrecognized role of design: different, even equivalent,designs can have a greater effect on performance thanthe choice of μ,R. Moreover, without prior informa-tion, there is no way to assure that the choice of de-sign will be one of the good ones in the equivalenceclass. Whether sequential experimentation, if feasible,can produce a more advantageous solution needs ex-ploring.The contrast between the results for borehole 27-runOA and the 27-run mLHD is a reminder of the impor-tance of using designs that are space-filling, a qualitywidely appreciated. It is no secret that the choice ofsample size, n, has a strong effect on performance asevidenced in the results for the 40-point mLHD con-trasted with those for the 27-point mLHD. A morepenetrating study of the effect of n was conducted byLoeppky, Sacks and Welch (2009). That FL does aswell as Const and SL for the Borehole 40-point mLHDbut performs badly for either of the two 27-point de-signs, and that none of the regression choices matterfor the G-protein 40-point design or for the PTW 110-point design, suggests that “everything” works if n islarge enough.In summary, the choice of n and the choice of Dgiven n can have huge effects. But have we enoughevidence that choice of μ matters only in limited con-texts (such as small n or poor design) and that choiceof R does not matter? So far we have dealt with onlya handful of simple, fast codes; it is time to considermore complex codes.4. SLOW CODESFor complex costly-to-run codes, generating sub-stantial holdout data or output from multiple designsis infeasible. Similarly, for codes where we only havereported data, new output data are unavailable. Forcedto depend on what data are at hand leads us to rely oncross-validation methods for generating multiple de-signs and holdout sets, through which we can assess theeffect of variability not solely in the designs but also,48 CHEN, LOEPPKY, SACKS AND WELCHand inseparably, in the holdout target data. We knowfrom Section 3 that variability due to designs is con-siderable, and it is no surprise that variability in hold-out sets would lead to variability in predictive perfor-mance. The utility then of the created multiple designsand holdout sets is to compare the behavior of differ-ent modeling choices under varying conditions ratherthan relying on a single quantity attached to the origi-nal, unique data set.Our approach is simply to delete a subset from thefull data set, use the remaining data to produce a pre-dictor, and calculate the (normalized) RMSE from pre-dicting the output in the deleted (holdout) subset. Re-peating this for a number (25 is what we use) of subsetsgives some measure of variability and accuracy. In ef-fect, we create 25 designs and corresponding holdoutsets from a single data set and compare consequencesarising from different choices for predictors.The details described in the applications below dif-fer somewhat depending on the particular application.In the example of Section 4.1—a reflectance modelfor a plant canopy—there are, in fact, limited holdoutdata but no data from multiple designs. In the volcano-eruption example of Section 4.2 and the sea-ice modelof Section 4.3 holdout data are unavailable.4.1 Nilson–Kuusk ModelAn ecological code modeling reflectance for a plantcanopy developed by Nilson and Kuusk (1989) wasused by Bastos and O’Hagan (2009) to illustrate di-agnostics for GaSP models. With 5-dimensional input,two computer experiments were performed: the firstusing a 150-run random LHD and the second with anindependently chosen LHD of 100 points.We carry out three studies based on the same data.The first treats the 100-point LHD as the experimentand the 150-point set as a holdout sample. The secondstudy reverses the roles of the two LHDs. A third study,extending one done by Bastos and O’Hagan (2009),takes the 150-run LHD, augments it with a randomsample of 50 points from the 100-point LHD, takes theresulting 200-point subset as the experimental designfor training the statistical model, and uses the remain-ing N = 50 points from the 100-run LHD to form theholdout set in the calculation of ermse,ho. By repeatingthe sampling of the 50 points 25 times, we get 25 repli-cate experiments, each with the same base 150 runsbut differing with respect to the additional 50 trainingpoints and the holdout set.In addition to the linear regression choices we havestudied so far, we also incorporate a regression modelTABLE 1Nilson–Kuusk model: Normalized holdout RMSE of prediction,ermse,ho, for four regression models and four correlationfunctions. The experimental data are from a 100-run LHD, and theholdout set is from a 150-run LHDermse,hoCorrelation functionRegression model Gauss PowerExp Matérn-2 MatérnConstant 0.116 0.099 0.106 0.102Select linear 0.115 0.099 0.106 0.105Full linear 0.110 0.099 0.104 0.104Quartic 0.118 0.103 0.107 0.106identified by Bastos and O’Hagan (2009): an intercept,linear terms in the inputs x1, . . . , x4, and a quartic poly-nomial in x5. We label this model “Quartic.” All anal-yses are carried out with the output y on a log scale,based on standard diagnostics for GaSP models (Jones,Schonlau and Welch, 1998).Table 1 summarizes the results of the study withthe 100-point LHD as training data and the 150-pointset as a holdout sample. It shows the choice for μis immaterial: the constant mean is as good as any.For the correlation function, Gauss is inferior to theother choices, there is some evidence that Matérn ispreferred to Matérn-2, and there is little differencebetween PowerExp and Matérn, the best performers.Similar results pertain when the 150-run LHD is usedfor training and the 100-run set for testing [Table 4 inthe supplementary material (Chen et al., 2016)].The dot plots in Figure 5 for the third study are evenmore striking in exhibiting the inferiority of R = Gaussand the lack of advantages for any of the nonconstantregression functions. The large variability in perfor-mance among designs and holdout sets is similar tothat seen for the fast-code replicate experiments of Sec-tion 3. The perturbations of the experiment, from ran-dom sampling here, appear to provide a useful refer-ence set for studying the behavior of model choices.The large differences in prediction accuracy amongthe correlation functions, not seen in Section 3, de-serve some attention. An overly smooth correlationfunction—the Gaussian—does not perform as well asthe Matérn and power-exponential functions here. Thelatter two have the flexibility to allow needed rougherrealizations. With the 150-run design and the constantregression model, for instance, the maximum of the loglikelihood increases by about 50 when the power expo-nential is used instead of the Gaussian, with four of thepj in (2.2) taking values less than 2.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 49FIG. 5. Nilson–Kuusk code: Normalized holdout RMSE of prediction, ermse,ho, for four regression models and four correlation functions.Twenty-five designs are created from a 150-run LHD base plus 50 random points from a 100-run LHD. The remaining 50 points in the100-run LHD form the holdout set for each repeat.The estimated main effect (Schonlau and Welch,2006) of x5 in Figure 6 from the GaSP(Const,PowerExp) model shows that x5 has a complex effect.It is also a strong effect, accounting for about 90% ofthe total variance of the predicted output over the 5-dimensional input space. Bastos and O’Hagan (2009)correctly diagnosed the complexity of this trend. Mod-eling it via a quartic polynomial in x5 has little impacton prediction accuracy, however. The correlation struc-ture of the GaSP is able to capture the trend implicitlyjust as well.4.2 Volcano ModelA computer model studied by Bayarri et al. (2009)models the process of pyroclastic flow (a fast-movingcurrent of hot gas and rock) from a volcanic eruption.FIG. 6. Nilson–Kuusk code: Estimated main effect of x5.The inputs varied are as follows: initial volume, x1,and direction, x2, of the eruption. The output, y, is themaximum (over time) height of the flow at a location.A 32-run data set provided by Elaine Spiller [differentfrom that reported by Bayarri et al. (2009) but a similarapplication] is available in the supplementary material(Chen et al., 2016). Plotting the data shows the outputhas a strong trend in x1, and putting a linear term in theGaSP surrogate, as modeled by Bayarri et al. (2009), isnatural. But is it necessary?The nature of the data suggests a transformation of ycould be useful. The one used by Bayarri et al. (2009)is log(y + 1). Diagnostic plots (Jones, Schonlau andWelch, 1998) from using μ = Const and R = Gaussshow that the log transform is reasonable, but a square-root transformation is better still. We report analysesfor both transformations.The regression functions considered are Const, SL(β0 + β1x1), full linear, and quadratic (β0 + β1x1 +β2x2 + β3x21 ), because the estimated effect of x1 has astrong trend growing faster than linearly when lookingat main effects from the surrogate obtained using √yand GaSP(Const, PowerExp).Analogous to the approach in Section 4.1, repeatexperiments are generated by random sampling of 25runs from the 32 available to comprise the design formodel fitting. The remaining 7 runs form the holdoutset. This is repeated 25 times, giving 25 ermse,ho val-ues in the dot plots of Figure 7. The conclusions aremuch like those in Section 4.1: there is usually no needto go beyond μ = Const and PowerExp is preferred50 CHEN, LOEPPKY, SACKS AND WELCHFIG. 7. Volcano model: Normalized holdout RMSE, ermse,ho, for four regression models and four correlation functions. The output variableis either √y or log(y + 1).to Gauss. The failure of Gauss in the two “slow” ex-amples considered thus far is surprising in light of thewidespread use of the Gauss correlation function.4.3 Sea-Ice ModelThe Arctic sea-ice model studied in Chapman et al.(1994) and in Loeppky, Sacks and Welch (2009) has13 inputs, 4 outputs, and 157 available runs. The pre-vious studies found modest prediction accuracy ofGaSP(Const, PowerExp) surrogates for two of the out-puts (ice mass and ice area) and poor accuracy for theother two (ice velocity and ice range). The questionarises whether use of linear regression terms can in-crease accuracy to acceptable levels. Using a samplingprocess like that in Section 4.2 leads to the results in thesupplementary material (Chen et al., 2016), where theanswer is no: there is no help from μ = SL or FL, norfrom changing R. Indeed, FL makes accuracy muchworse sometimes.5. OTHER MODELING STRATEGIESClearly, we have not studied all possible paths toGaSP modeling that one might take in a computer ex-periment. In this section we address several others,some in more detail, and point to issues that could beaddressed in the fashion described above.5.1 Full BayesA number of full Bayes approaches have been em-ployed in the literature. They go beyond the statis-tical formulation using a GP as a prior on the classof functions and assign prior distributions to all pa-rameters, particularly those of the correlation function.For illustration, we examine the GEM-SA implemen-tation of Kennedy (2004), which we call Bayes-GEM-SA. One key aspect is its reliance on R = Gauss. Italso uses the following independent prior distributions:β0 ∝ 1, σ 2 ∝ 1/σ 2, and θj exponential with rate 0.01(Kennedy, 2004). When comparing its predictive accu-racy with GaSP, μ = Const is used for all models.For the borehole application, 25 repeat experimentsare constructed for three designs, as in Section 3. Thedot plots of ermse,ho in Figure 8 compare Bayes-GEM-SA with the Gauss and PowerExp methods in Section 3based on MLEs of all parameters. (The method CGPANALYSIS METHODS FOR COMPUTER EXPERIMENTS 51FIG. 8. Borehole function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp),Bayes-GEM-SA, and CGP. There are three base designs: a 27-run OA (left), a 27-run mLHD (middle), and a 40-run mLHD (right). Foreach base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.and its dot plot are discussed in Section 5.2.) Bayes-GEM-SA is less accurate than either GaSP(Const,Gauss) or GaSP(Const, PowerExp).Figure 9 similarly depicts results for the G-proteincode. With the 40-run mLHD, the Bayesian and likeli-hood methods all perform about the same, giving onlyfair prediction accuracy. Increasing n to 80 improvesaccuracy considerably for all methods (the scales ofthe two plots are very different), far outweighing anysystematic differences between their accuracies.Bayes-GEM-SA performs as well as the GaSP meth-ods for G-protein, not so well for Borehole with n = 27but adequately for n = 40. Turning to the slow codesin Section 4, a different message emerges. Figure 10for the Nilson–Kuusk model is based on 25 repeat de-signs constructed as for Figure 5 with a base designof 150 runs plus 50 randomly chosen from 100. Thedistributions of ermse,ho for Bayes-GEM-SA and Gaussare similar, with PowerExp showing a clear advantage.Moreover, few of the Bayes ermse,ho values meet the0.10 threshold, while all the GaSP(Const, PowerExp)ermse,ho values do. Bayes-GEM-SA uses the Gaus-sian correlation function, which performed relativelyFIG. 9. G-protein: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp), Bayes-GEM-SA,and CGP. There are two base designs: a 40-run mLHD (left); and an 80-run mLHD (right). For each base design, all 24 permutationsbetween columns give the 24 values of ermse,ho in a dot plot.52 CHEN, LOEPPKY, SACKS AND WELCHFIG. 10. Nilson–Kuusk model: Normalized holdout RMSEof prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const,PowerExp), Bayes-GEM-SA, and CGP.poorly in Section 4; the disadvantage carries over tothe Bayesian method here.The results in Figure 11 for the volcano code are forthe 25 repeat experiments described in Section 4. Hereagain PowerExp dominates Bayes and for the samereasons as for the Nilson–Kuusk model. For the √ytransformation, all but a few GaSP(Const, PowerExp)ermse,ho values meet the 0.10 threshold, in contrast toBayes where all but a few do not.These results are striking and suggest that Bayesmethods relying on R = Gauss need extension. The“hybrid” Bayes-MLE approach employed by Bayarriet al. (2009) estimates the correlation parameters inPowerExp by their MLEs, fixes them, and takes ob-jective priors for μ and σ 2. The mean of the predictivedistribution for a holdout output value gives the sameprediction as GaSP(Const, PowerExp). Whether other“hybrid” forms can be brought to bear effectively needsexploration.5.2 NonstationarityThe use of stationary GPs as priors in the faceof “nonstationary appearing” functions has attracteda measure of concern despite the fact that all func-tions with L2-derivative can be approximated usingPowerExp with enough data. Of course, there never areenough data. A relevant question is whether other pri-ors, even stationary ones different from those in Sec-tion 2, are better reflective of conditions and lead tomore accurate predictors.West et al. (1995) employed a GP prior for y(x)with two additive components: a smooth one for globaltrend and a rough one to model more local behavior.Recently, a similar “composite” GP (CGP) approachwas advanced by Ba and Joseph (2012). These authorsused two GPs, both with Gauss correlation. The firsthas correlation parameters θj in (2.2) constrained to besmall for gradually varying longer-range trend, whilethe second has larger values of θj for shorter-range be-havior. The second, local GP also has a variance thatdepends on x, primarily as a way to cope with apparentnonstationary behavior. Does this composite approachoffer an effective improvement to the simpler choicesof Section 2?We can apply CGP via its R library to the exam-ples studied in Sections 3 and 4, much as was justdone for Bayes-GEM-SA. The comparisons in Figure 8for the borehole function show that GaSP and CGPhave similar accuracy for the two 27-run designs. GaSPhas smaller error than CGP for the 40-run mLHD,FIG. 11. Volcano model: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss), GaSP(Const, PowerExp),Bayes-GEM-SA, and CGP.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 53FIG. 12. 2-d function: Holdout predictions versus true values of y from fitting (a) GaSP(Const, PowerExp) and (b) CGP.though both methods achieve acceptable accuracy. Theresults in Figure 9 for G-protein show little practi-cal difference between any of the methods, includingCGP. For these two fast-code examples, there is negli-gible difference between CGP and the GaSP methods.For the models of Section 4, however, conclusions aresomewhat different. GaSP(Const, PowerExp) is clearlymuch more accurate than CGP for the Nilson–Kuuskmodel (Figure 10) and roughly equivalent for the vol-cano code (Figure 11).Ba and Joseph (2012) gave several examples assess-ing the performance of CGP. For reasons noted in Sec-tions 6.2 and 6.4 we only look at two.10-d example. The test function isy(x) = −10∑j=1sin(xj )(sin(jx2j /π))20(0 < xj < π).With n = 100, Ba and Joseph (2012) obtained unnor-malized RMSE values of about 0.72–0.74 for CGP andabout 0.72–0.88 for a GaSP(Const, Gauss) model over50 repeat experiments.This example demonstrates a virtue of using a nor-malized performance measure. To compute the nor-malizing factor for RMSE in (2.5), we followed theprocess of Ba and Joseph (2012). Training data froman LHD with n = 100 gives y¯, the trivial predictor.The normalization in the denominator of (2.5) is com-puted from N = 5000 random test points. Repeatingthis process 50 times gives normalization factors of0.71–0.74, about the same as the raw RMSE valuesfrom CGP. Thus, CGP’s RMSE prediction accuracyis no better than that of the trivial y¯ predictor, andthe default method is worse. Effective prediction hereis unattainable by CGP or GaSP and perhaps by noother approach with n = 100 because the function isso multi-modal; comparisons of CGP with other meth-ods are meaningless in this example.2-d example. For the functiony(x1, x2) = sin(1/(x1x2)) (0.3 ≤ xj ≤ 1),(5.1)Ba and Joseph (2012) used a single design with n = 24runs to compare CGP and GaSP(Const, Gauss). Theirresults suggest that accuracy is poor for both meth-ods, which we confirmed. For this example, follow-ing Ba and Joseph (2012), a holdout set of 5000 ran-dom points on [0.3,1]2 was used. For one mLHD withn = 24, we obtained ermse,ho values of 0.23 and 0.24 forCGP and GaSP(Const, PowerExp), respectively. More-over, the diagnostic plot in Figure 12 shows how badlyCGP (and GaSP) perform. Both methods grossly over-predict for some points in the holdout set, with GaSPworse in this respect. Both methods also have large er-rors from under-prediction, with CGP worse.Does this result generalize? With only two inputvariables and a function that is symmetric in x1 andx2, repeat experiments cannot be generated by permut-ing the column labels of the design. Reflecting withinthe x1 and x2 columns is considered below, but first wecreated multiple experiments by increasing n.We were also curious about how large n has tobe before acceptable accuracy is attained. Compar-isons between CGP and GaSP(Const, PowerExp) weremade for n = 24,25, . . . ,48; for each value of n anmLHD was generated. The ermse,ho results plotted inFigure 13(a) show that accuracy is not improved sub-stantially for either method as n increases. Indeed,54 CHEN, LOEPPKY, SACKS AND WELCHFIG. 13. 2-d function: Normalized holdout RMSE of prediction, ermse,ho, versus n for CGP (◦), GaSP(Const, PowerExp) (), andGaSP(Const, PowerExp) with nugget (+).GaSP(Const, PowerExp) gives variable accuracy, withlarger values of n sometimes leading to worse accuracythan for n = 24. (The results in Figure 13 for a modelwith a nugget term are described in Section 5.3.)To try to improve the accuracy, even larger sam-ple sizes were tried. Figure 13(b) shows ermse,ho forn = 50,60, . . . ,200. Both methods continue to givepoor accuracy until n reaches 70, after which there isa slow, unsteady improvement. Curiously, GaSP nowdominates.Permuting between columns of a design does notgenerate distinct repeat experiments here, but reflect-ing either or both coordinates about the centers of theirranges maintains the distance properties of the design,that is, x1 on [0.3,1] is replaced by x′1 = 1.3 − x1, andsimilarly x2. Results for the repeat experiments fromreflecting within x1, x2, or both x1 and x2 are availablein the supplementary material (Chen et al., 2016). Theyare similar to those in Figure 13.Thus, CGP dominates here for n ≤ 60: it is inaccu-rate but less inaccurate than GaSP. For larger n, how-ever, GaSP performs better, reaching the 0.10 thresh-old for ermse,ho before CGP does. This example demon-strates the potential pitfalls of comparing two methodswith a single experiment. A more comprehensive anal-ysis not only gives more confidence in the findings butmay also be essential to provide a balanced overviewof advantages and disadvantages.These last two toy functions together with the resultsin Figures 8–11 show no evidence for the effectivenessof a composite GaSP approach. These findings are inaccord with the earlier study by West et al. (1995).5.3 Adding a Nugget TermA nugget augments the GaSP model in (2.1) withan uncorrelated ε term, usually assumed to have a nor-mal distribution with mean zero and constant varianceσ 2ε , independent of the correlated process Z(x). Thischanges the computation of R and rT (x) in the condi-tional prediction (2.4), which no longer interpolates thetraining data. For data from physical experimentationor observation, augmenting a GaSP model in this wayis natural to reflect random errors (e.g., Gao, Sacks andWelch, 1996; McMillan et al., 1999; Styer et al., 1995).A nugget term has also been widely used for statisti-cal modeling of deterministic computer codes withoutrandom error. The reasons offered are that numericalstability is improved, so overcoming computational ob-stacles, and also that a nugget can produce better pre-dictive performance or better confidence or credibil-ity intervals. The evidence—in the literature and pre-sented here—suggests, however, that for deterministicfunctions the potential advantages of a nugget term aremodest. More systematic methods are available to dealwith numerical instability if it arises (Ranjan, Haynesand Karsten, 2011), adding a nugget does not convert apoor predictor into an acceptable one, and other factorsmay be more important for good statistical propertiesof intervals (Section 6.1). On the other hand, we also donot find that adding a nugget (and estimating it alongwith the other parameters) is harmful, though it mayproduce smoothers rather than interpolators. We nowelaborate on these points.A small nugget, that is, a small value of σ 2ε , is oftenincluded to improve the numerical properties of R. ForANALYSIS METHODS FOR COMPUTER EXPERIMENTS 55the space-filling initial designs in this article, however,Ranjan, Haynes and Karsten (2011) showed that ill-conditioning in a no-nugget GaSP model will only oc-cur for low-dimensional x, high correlation, and largen. These conditions are not commonly met in initial de-signs for applications. For instance, none of the com-putations for this article failed due to ill-conditioning,and those computations involved many repetitions ofexperiments for the various functions and GaSP mod-els. The worst conditioning occurred for the 2-d exam-ple in Section 5.2 with n = 200, but even here the con-dition numbers of about 106 did not preclude reliablecalculations. When a design is not space-filling, ma-trix ill-conditioning may indeed occur. For instance, asequential design for, say, optimization or contour esti-mation (Bingham, Ranjan and Welch, 2014) could leadto runs close together in the x space, causing numeri-cal problems. If ill-conditioning does occur, however,the mathematical solution proposed by Ranjan, Haynesand Karsten (2011) is an alternative to adding a nugget.A nugget term is also sometimes suggested to im-prove predictive performance. Andrianakis and Chal-lenor (2012) showed mathematically, however, thatwith a nugget the RMSE of prediction can be as largeas that of a least squares fit of just the regression com-ponent in (2.1). Our empirical findings, choosing thesize of σ 2ε via its MLE, are similarly unsupportive of anugget. For example, the 2-d function in (5.1) is hardto predict with a GaSP(Const, PowerExp) model (Fig-ure 13), but the results with a fitted nugget term shownby a “+” symbol in Figure 13 are no different in prac-tice from those of the no-nugget model.Similarly, repeating the calculations leading to Fig-ure 2 for the borehole function, but fitting a nugget termin all models, shows essentially no difference [the re-sults with a nugget are available in Figure 1 of the sup-plementary material (Chen et al., 2016)]. The MLE ofσ 2ε is either zero or very small relative to the variance ofthe correlated process: typically σˆ 2ε /σˆ 2 < 10−6. Thesefindings are consistent with those of Ranjan, Haynesand Karsten (2011), who found for the borehole func-tion and other applications that constraining the modelfit to have at least a modest value of σ 2ε deterioratedpredictive performance.Another example, the Friedman function,y(x) = 10 sin(πx1x2) + 20(x3 − 0.5)2(5.2)+ 10x4 + 5x5,with n = 25 runs, was used by Gramacy and Lee(2012) to illustrate potential advantages of includinga nugget term. Their context—performance criteria,analysis method, and design—differs in all respectsfrom ours. Our results in the top row of Figure 14show that the GaSP(Const, Gauss) and GaSP(Const,PowerExp) models with n = 25 have highly variableaccuracy, with ermse,ho values no better and often muchworse than 20%. The effect of the nugget is inconse-quential. Increasing the sample size to n = 50 makes adramatic improvement in prediction accuracy, but theeffect of a nugget remains negligible.The Gramacy and Lee (2012) results are not incon-sistent with ours in that they did not report predictionaccuracy for this example. Rather, their results relateto the role of the nugget in sometimes obtaining betteruncertainty measures when a poor choice of correlationfunction is inadvertently made, a topic we return to inSection 6.1.6. COMMENTS6.1 Uncertainty of PredictionAs noted in Section 1, our attention is directed at pre-diction accuracy, the most compelling characteristic inpractical settings. For example, where the objective iscalibration and validation, the details of uncertainty, asdistinct from accuracy, in the emulator of the computermodel are absorbed (and usually swamped) by modeluncertainties and measurement errors (Bayarri et al.,2007). But for specific predictions it is clearly impor-tant to have valid uncertainty statements.Currently, a full assessment of the validity of em-ulator uncertainty quantification is unavailable. It haslong been recognized that the standard error of predic-tion can be optimistic when MLEs of the parametersθj , pj , νj in the correlation functions of Section 2.1are “plugged-in” because the uncertainty in the param-eter values is not taken into account (Abt, 1999). Cor-rections proposed by Abt remain to be done for the set-tings in which they are applicable.Bayes credible intervals with full Bayes methodscarry explicit and valid uncertainty statements; hybridmethods using priors on some of the correlation pa-rameters (as distinct from MLEs) may also have reli-able credible intervals. But for properties such as actualcoverage probability (ACP), the proportion of points ina test set with true response values covered by intervalsof nominal (say) 95% confidence or credibility, the be-havior is far from clear. Chen (2013) compared severalBayes methods with respect to coverage. The resultsshowed variability with respect to equivalent designs56 CHEN, LOEPPKY, SACKS AND WELCHFIG. 14. Friedman function: Normalized holdout RMSE of prediction, ermse,ho, for GaSP(Const, Gauss) and GaSP(Const, PowerExp)models with no nugget term versus the same models with a nugget. There are two base designs: a 25-run mLHD (top row); and a 50-runmLHD (bottom row). For each base design, 25 random permutations between columns give the 25 values of ermse,ho in a dot plot.like that found above for accuracy, a troubling charac-teristic pointing to considerable uncertainty about theuncertainty.In Figure 15 we see some of the issues. It gives ACPresults for the borehole and Nilson–Kuusk functions.The left-hand plot for borehole displays the anticipatedunder-coverage using plug-in estimates for the correla-tion parameters. (Confidence intervals here use n − 1rather than n in the estimate of σ in the standard er-ror and tn−1 instead of the standard normal.) PowerExpis slightly better than Gauss, and Bayes-GEM-SA hasACP values close to the nominal 95%. Surprisingly, theplot for the Nilson–Kuusk code on the right of Fig-ure 15 paints a different picture. Plug-in with Gaussand Bayes-GEM-SA both show under-coverage, whileplug-in PowerExp has near-ideal properties here. Wespeculate that the use of the Gauss correlation functionby Bayes-GEM-SA is again suboptimal for the Nilson–Kuusk application, just as it was for prediction accu-racy.The supplementary material (Chen et al., 2016) com-pares models with and without a nugget in terms ofcoverage properties for the Friedman function in (5.2).The results show that the problem of substantial under-coverage seen in many of the replicate experimentsis not solved by inclusion of a nugget term. A mod-est improvement in the distribution of ACP values isseen, particularly for n = 50, an improvement consis-tent with the advantage seen in Table 1 of Gramacy andLee (2012) from fitting a nugget term.A more complete study is surely needed to clarifyappropriate criteria for uncertainty assessment and howmodeling choices may affect matters.6.2 ExtrapolationGaSP based methods are interpolations so our find-ings are clearly limited to prediction in the space ofthe experiment. The design of the computer experimentshould cover the region of interest, rendering extrapo-lation meaningless. If a new region of interest is found,for example, during optimization, the initial computerruns can be augmented; extrapolation can be used todelimit regions that have to be explored further. Ofcourse, extrapolation is necessary in the situation of aANALYSIS METHODS FOR COMPUTER EXPERIMENTS 57FIG. 15. Borehole and Nilson–Kuusk functions: ACP of nominal 95% confidence or credibility intervals for GaSP(Const, Gauss),GaSP(Const, PowerExp), and Bayes-GEM-SA. For the borehole function, 25 random permutations between columns of a 40-run mLHDgive the 25 values of ACP in a dot plot. For the Nilson–Kuusk function, 25 designs are created from a 150-run LHD base plus 50 randompoints from a 100-run LHD. The remaining 50 points in the 100-run LHD form the holdout set for each repeat.new region and a code that can no longer be run. Butthen the question is how to extrapolate. Initial inclusionof linear or other regression terms may be more use-ful than just a constant, but it may also be useless, oreven dangerous, unless the “right” extrapolation termsare identified. We suspect it would be wiser to exam-ine main effects resulting from the application of GaSPand use them to guide extrapolation.6.3 Performance CriteriaWe have focused almost entirely on questions of pre-dictive accuracy and used RMSE as a measure. Thesupplementary material (Chen et al., 2016) defines andprovides results for a normalized version of maximumabsolute error, emax,ho. Other computations we havedone use the median of the absolute value of predictionerrors, with normalization relative to the trivial predic-tor from the median of the training output data. Theseresults are qualitatively the same as for ermse,ho: regres-sion terms do not matter, and PowerExp is a reliablechoice for R. For slow codes, analysis like in Section 4but using emax,ho has some limited value in identifyingregions where predictions are difficult, the limitationsstemming from a likely lack of coverage of subregions,especially at borders of the unit cube, where the outputfunction may behave badly.A common performance measure for slow codesuses leave-one-out cross-validation error to produceanalogues of ermse,ho and emax,ho, obviating the needfor a holdout set. For fast codes, repeat experimentsand the ready availability of a holdout set render cross-validation unnecessary, however. For slow codes withonly one set of data available, the single assessmentfrom leave-one-out cross-validation does not reflect thevariability caused, for example, by the design. In anycase, qualitatively similar conclusions pertain regard-ing regression terms and correlation functions.6.4 More ExamplesThe examples we selected are codes that have beenused in earlier studies. We have not incorporated1-d examples; while instructive for pedagogical rea-sons, they have little presence in practice. Other ap-plications we could have included (e.g., Gough andWelch, 1994) duplicate the specific conclusions wedraw below. There are also “fabricated” test functionsin the numerical integration and interpolation litera-ture (Barthelmann, Novak and Ritter, 2000) and somespecifically for computer experiments (Surjanovic andBingham, 2015). They exhibit characteristics some-times similar to those in Section 5—large variabilityin a corner of the space, a condition that inhibits andeven prevents construction of effective surrogates—and sometimes no different than the examples in Sec-tion 3. Codes that are deterministic but with numeri-cal errors could also be part of a diverse catalogue oftest problems. Ideally performance metrics from var-ious approaches would be provided to facilitate com-parisons; the suite of examples that we employed is astarting point.58 CHEN, LOEPPKY, SACKS AND WELCH6.5 DesignsThe variability in performance over equivalent de-signs is a striking phenomenon in the analyses of Sec-tion 3 and raises questions about how to cope with whatseems to be unavoidable bad luck. Are there sequentialstrategies that can reduce the variability? Are there ad-vantageous design types, more robust to arbitrary sym-metries. For example, does it matter whether a randomLHD, mLHD, or an orthogonal LHD is used? The lat-ter question is currently being explored by the authors.That design has a strong role is both unsurprising andsurprising. It is not surprising that care must be takenin planning an experiment; it is surprising and perplex-ing that equivalent designs can lead to such large dif-ferences in performance that are not mediated by goodanalytic procedures.6.6 Larger Sample SizesAs noted in Section 1, our attention is on experi-ments where n is small or modest at most. With ad-vances in computing power it becomes more feasibleto mount experiments with larger values of n while, atthe same time, more complex codes become feasiblebut only with limited n. Our focus continues to be onthe latter and the utility of GaSP models in that context.As n gets larger, Figure 2 illustrates that the differ-ences in accuracy among choices of R and μ beginto vanish. Indeed, it is not even clear that using GaSPmodels for large n is useful; standard function fittingmethods such as splines may well be competitive andeasier to compute. In addition, when n is large non-stationary behavior can become apparent and encour-ages variations in the GaSP methodology such as de-composing the input space (as in Gramacy and Lee,2008) or by using a complex μ together with a com-putationally more tractable R (as in Kaufman et al.,2011). Comparison of alternatives when n is large isyet to be considered.6.7 Are Regression Terms Ever Useful?Introducing regression terms is unnecessary in theexamples we have presented; a heuristic rationalewas given in Section 2.2. The supplementary material(Chen et al., 2016) reports a simulation study with real-ized functions generated as follows: (1) there are verylarge linear trends for all xj ; and (2) the superimposedsample path from a 0-mean GP is highly nonlinear, thatis, a GP with at least one θj 0 in (2.2). Even undersuch extreme conditions, the advantage of explicitlyfitting the regression terms is limited to a relative (ra-tio of ermse,ho) advantage, with only small differencesin ermse,ho; the presence of a large trend causes a largenormalizing factor. Moreover, such functions are notthe sort usually encountered in computer experiments.If they do show up, standard diagnostics will revealtheir presence and allow effective follow-up analysis(see Section 7.2).7. CONCLUSIONS AND RECOMMENDATIONSThis article addresses two types of questions. First,how should the analysis methodologies advanced in thestudy of computer experiments be assessed? Second,what recommendations for modeling strategies followfrom applying the assessment strategy to the particularcodes we have studied?7.1 Assessing MethodsWe have stressed the importance of going beyond“anecdotes” in making claims for proposed methods.While this point is neither novel nor startling, it is onethat is commonly ignored, often because the processof studying consequences under multiple conditions ismore laborious. The borehole example (Figure 2), forinstance, employs 75 experiments arising from 25 re-peats of each of 3 base experiments.When only one training set of data is available (ascan be the case with slow codes), the procedures inSection 4, admittedly ad hoc, nevertheless expand therange of conditions. This brings more generalizabilityto claims about the comparative performances of com-peting procedures. The same strategy of creating mul-tiple training/holdout sets is potentially useful in com-paring competing methods in physical experiments aswell.The studies in the previous sections lead to the fol-lowing conclusions:• There is no evidence that GaSP(Const, PowerExp)is ever dominated by use of regression terms, orother choices of R. Moreover, we have found thatthe inclusion of regression terms makes the likeli-hood surface multi-modal, necessitating an increasein computational effort for maximum likelihoodor Bayesian methods. This appears to be due toconfounding between regression terms and the GPpaths.• Choosing R = Gauss, though common, can be un-wise. The Matérn function optimized over a fewlevels of smoothness is a reasonable alternative toPowerExp.• Design matters but cannot be controlled completely.Variability of performance from equivalent designscan be uncomfortably large.ANALYSIS METHODS FOR COMPUTER EXPERIMENTS 59There is not enough evidence to settle the followingquestions:• Are full Bayes methods ever more accurate thanGaSP(Const, PowerExp)? Bayes methods relying onR = Gauss were seen to be sometimes inferior, andextensions to accommodate less smooth R such asPowerExp, perhaps via hybrid Bayes-MLE methods,are needed.• Are composite GaSP methods ever better thanGaSP(Const, PowerExp) in practical settings wherethe output exhibits nonstationary behavior?7.2 RecommendationsFaced with a particular code and a set of runs, whatshould a scientist do to produce a good predictor?Our recommendation is to make use of GaSP(Const,PowerExp), use the diagnostics of Jones, Schonlauand Welch (1998) or Bastos and O’Hagan (2009),and assess whether the GaSP predictor is adequate.If found inadequate, then the scientist should expectno help from introducing regression terms and, un-til further evidence is found, neither from Bayes norCGP approaches. Of course, trying such methods isnot prohibited, but we believe that inadequacy of theGaSP(Const, PowerExp) model is usually a sign thatmore substantial action must be taken.We conjecture that the best way to proceed in theface of inadequacy is to devise a second (or multiple)stage process, perhaps by added runs, or perhaps bycarving the space into more manageable subregionsas well as adding runs. How best to do this has beenpartially addressed, for example, by Gramacy and Lee(2008) and Loeppky, Moore and Williams (2010); ef-fective methods constrained by limited runs are not ap-parent and in need of study.ACKNOWLEDGMENTSWe thank the referees, Associate Editor, and Editorfor suggestions that clarified and broadened the scopeof the studies reported here.The research of Loeppky and Welch was supportedin part by grants from the Natural Sciences and Engi-neering Research Council, Canada.SUPPLEMENTARY MATERIALSupplement to “Analysis Methods for ComputerExperiments: How to Assess and What Counts?”(DOI: 10.1214/15-STS531SUPP; .zip). This report(whatcounts-supp.pdf) contains further description ofthe test functions and data from running them, furtherresults for root mean squared error, findings for max-imum absolute error, further results on uncertainty ofprediction, and details of the simulation investigatingregression terms. Inputs to the Arctic sea-ice code—ice-x.txt. Outputs from the code—ice-y.txt.REFERENCESABT, M. (1999). Estimating the prediction mean squared errorin Gaussian stochastic processes with exponential correlationstructure. Scand. J. Stat. 26 563–578. MR1734262ANDRIANAKIS, I. and CHALLENOR, P. G. (2012). The effect ofthe nugget on Gaussian process emulators of computer models.Comput. Statist. Data Anal. 56 4215–4228. MR2957866BA, S. and JOSEPH, V. R. (2012). Composite Gaussian processmodels for emulating expensive functions. Ann. Appl. Stat. 61838–1860. MR3058685BARTHELMANN, V., NOVAK, E. and RITTER, K. (2000). High di-mensional polynomial interpolation on sparse grids. Adv. Com-put. Math. 12 273–288. MR1768951BASTOS, L. S. and O’HAGAN, A. (2009). Diagnostics forGaussian process emulators. Technometrics 51 425–438.MR2756478BAYARRI, M. J., BERGER, J. O., PAULO, R., SACKS, J.,CAFEO, J. A., CAVENDISH, J., LIN, C.-H. and TU, J. (2007).A framework for validation of computer models. Technometrics49 138–154. MR2380530BAYARRI, M. J., BERGER, J. O., CALDER, E. S., DAL-BEY, K., LUNAGOMEZ, S., PATRA, A. K., PITMAN, E. B.,SPILLER, E. T. and WOLPERT, R. L. (2009). Using statisticaland computer models to quantify volcanic hazards. Technomet-rics 51 402–413. MR2756476BINGHAM, D., RANJAN, P. and WELCH, W. J. (2014). Designof computer experiments for optimization, estimation of func-tion contours, and related objectives. In Statistics in Action(J. F. Lawless, ed.) 109–124. CRC Press, Boca Raton, FL.MR3241971CHAPMAN, W. L., WELCH, W. J., BOWMAN, K. P., SACKS, J.and WALSH, J. E. (1994). Arctic sea ice variability: Model sen-sitivities and a multidecadal simulation. J. Geophys. Res. 99C919–935.CHEN, H. (2013). Bayesian prediction and inference in analy-sis of computer experiments. Master’s thesis, Univ. British,Columbia, Vancouver.CHEN, H., LOEPPKY, J. L., SACKS, J. and WELCH, W. J.(2016). Supplement to “Analysis Methods for Computer Exper-iments: How to Assess and What Counts?” DOI:10.1214/15-STS531SUPP.CURRIN, C., MITCHELL, T., MORRIS, M. and YLVISAKER, D.(1991). Bayesian prediction of deterministic functions, with ap-plications to the design and analysis of computer experiments.J. Amer. Statist. Assoc. 86 953–963. MR1146343DIXON, L. C. W. and SZEGÖ, G. P. (1978). The global optimisa-tion problem: An introduction. In Towards Global Optimisation(L. C. W. Dixon and G. P. Szegö, eds.) 1–15. North Holland,Amsterdam.60 CHEN, LOEPPKY, SACKS AND WELCHGAO, F., SACKS, J. and WELCH, W. J. (1996). Predicting urbanozone levels and trends with semiparametric modeling. J. Agric.Biol. Environ. Stat. 1 404–425. MR1807773GOUGH, W. A. and WELCH, W. J. (1994). Parameter space explo-ration of an ocean general circulation model using an isopycnalmixing parameterization. J. Mar. Res. 52 773–796.GRAMACY, R. B. and LEE, H. K. H. (2008). Bayesian treed Gaus-sian process models with an application to computer modeling.J. Amer. Statist. Assoc. 103 1119–1130. MR2528830GRAMACY, R. B. and LEE, H. K. H. (2012). Cases for the nuggetin modeling computer experiments. Stat. Comput. 22 713–722.MR2909617JONES, D. R., SCHONLAU, M. and WELCH, W. J. (1998). Ef-ficient global optimization of expensive black-box functions.J. Global Optim. 13 455–492. MR1673460JOSEPH, V. R., HUNG, Y. and SUDJIANTO, A. (2008). Blind krig-ing: A new method for developing metamodels. J. Mech. Des.130 031102–1–8.KAUFMAN, C. G., BINGHAM, D., HABIB, S., HEITMANN, K.and FRIEMAN, J. A. (2011). Efficient emulators of computerexperiments using compactly supported correlation functions,with an application to cosmology. Ann. Appl. Stat. 5 2470–2492.MR2907123KENNEDY, M. (2004). Description of the Gaussian process modelused in GEM-SA. Techical report, Univ. Sheffield. Available athttp://www.tonyohagan.co.uk/academic/GEM/.LIM, Y. B., SACKS, J., STUDDEN, W. J. and WELCH, W. J.(2002). Design and analysis of computer experiments whenthe output is highly correlated over the input space. Canad. J.Statist. 30 109–126. MR1907680LOEPPKY, J. L., MOORE, L. M. and WILLIAMS, B. J. (2010).Batch sequential designs for computer experiments. J. Statist.Plann. Inference 140 1452–1464. MR2592224LOEPPKY, J. L., SACKS, J. and WELCH, W. J. (2009). Choosingthe sample size of a computer experiment: A practical guide.Technometrics 51 366–376. MR2756473MCMILLAN, N. J., SACKS, J., WELCH, W. J. and GAO, F.(1999). Analysis of protein activity data by Gaussian stochas-tic process models. J. Biopharm. Statist. 9 145–160.MORRIS, M. D. and MITCHELL, T. J. (1995). Exploratory designsfor computational experiments. J. Statist. Plann. Inference 43381–402.MORRIS, M. D., MITCHELL, T. J. and YLVISAKER, D. (1993).Bayesian design and analysis of computer experiments: Use ofderivatives in surface prediction. Technometrics 35 243–255.MR1234641NILSON, T. and KUUSK, A. (1989). A reflectance model for thehomogeneous plant canopy and its inversion. Remote Sens. En-viron. 27 157–167.O’HAGAN, A. (1992). Some Bayesian numerical analysis. InBayesian Statistics, 4 (PeñíScola, 1991) (J. M. Bernardo, J. O.Berger, A. P. Dawid and A. F. M. Smith, eds.) 345–363. OxfordUniv. Press, New York. MR1380285PICHENY, V., GINSBOURGER, D., RICHET, Y. and CAPLIN, G.(2013). Quantile-based optimization of noisy computer ex-periments with tunable precision. Technometrics 55 2–13.MR3038476PRESTON, D. L., TONKS, D. L. and WALLACE, D. C. (2003).Model of plastic deformation for extreme loading conditions. J.Appl. Phys. 93 211–220.RANJAN, P., HAYNES, R. and KARSTEN, R. (2011). A compu-tationally stable approach to Gaussian process interpolation ofdeterministic computer simulation data. Technometrics 53 366–378. MR2850469SACKS, J., SCHILLER, S. B. and WELCH, W. J. (1989). De-signs for computer experiments. Technometrics 31 41–47.MR0997669SACKS, J., WELCH, W. J., MITCHELL, T. J. and WYNN, H. P.(1989). Design and analysis of computer experiments (with dis-cussion). Statist. Sci. 4 409–435. MR1041765SCHONLAU, M. and WELCH, W. J. (2006). Screening the inputvariables to a computer model via analysis of variance and vi-sualization. In Screening: Methods for Experimentation in In-dustry, Drug Discovery, and Genetics (A. Dean and S. Lewis,eds.) 308–327. Springer, New York.STEIN, M. L. (1999). Interpolation of Spatial Data: Some Theoryfor Kriging. Springer, New York. MR1697409STYER, P., MCMILLAN, N., GAO, F., DAVIS, J. and SACKS, J.(1995). Effect of outdoor airborne particulate matter on dailydeath counts. Environ. Health Perspect. 103 490–497.SURJANOVIC, S. and BINGHAM, D. (2015). Virtual library of sim-ulation experiments: Test functions and datasets. Available athttp://www.sfu.ca/~ssurjano.WELCH, W. J., BUCK, R. J., SACKS, J., WYNN, H. P.,MITCHELL, T. J. and MORRIS, M. D. (1992). Screening, pre-dicting, and computer experiments. Technometrics 34 15–25.WELCH, W. J., BUCK, R. J., SACKS, J., WYNN, H. P., MOR-RIS, M. D. and SCHONLAU, M. (1996). Response to James M.Lucas. Technometrics 38 199–203.WEST, O. R., SIEGRIST, R. L., MITCHELL, T. J. and JENK-INS, R. A. (1995). Measurement error and spatial variabilityeffects on characterization of volatile organics in the subsurface.Environ. Sci. Technol. 29 647–656. Submitted to the Statistical ScienceSsnnlckclr rm \?lalwsisKcrfmds dmp AmknsrcpCxncpikclrs: Fmu rm ?sscssald Ufar Amslrs?"Fam Afcl, Hasml J, Jmcnniw, Hcpmkc Saais ald Uilliak H, UclafUnivCrsity of British ColSmbia anB National InstitStC of Statistical SciCncCs1, RCSR DSLARGMLS (D?SR AMBCS)1,1 Bmpcfmlc dslarimlihz output y is gznzrvtzy wyy RGTu=Hu −Hl)log=r=rw)(F + 2LTulgg(rLrw)r2wKw+ Tu=Tl) ;fihzrz thz M input vvrivwlzs vny thzir rzspzxtivz rvngzs of intzrzst vrz vs in ivwlz FCVuriuvly Xyswrip“ion (uni“s= fungyrw ruxius of voryholy (m= oD:DI; D:EIqr ruxius of inuynwy (m= oEDD; IDDDqTu “runsmissivi“fl of uppyr uquifyr (m2Cflr= oJGDKD; EEIJDDq.u po“yn“iomy“riw hyux of uppyr uquifyr (m= oMMD; EEEDqTl “runsmissivi“fl of lowyr uquifyr (m2 C flr= oJG:E; EEJq.l po“yn“iomy“riw hyux of lowyr uquifyr (m= oKDD; LFDqL lyng“h of voryholy (m= oEEFD; EJLDqKw hflxruuliw wonxuw“ivi“fl of voryholy (mCflr= oMLII; EFDHIqTable 1Voryholy zunwtion input vuriuvlys, units, unx rungysB All rungys ury wonvyrtyx to oD; Eq zorstutistiwul moxylingBDehajleefl gf Klalaklack, Ufanejkaly gf :jalakh ;gdmeZaa, Nafcgmnej, :;N.T )R,, ;afada (e-eaad2 hag.chef@klal.mZc.ca; oadd@klal.mZc.ca). Klalaklack,Ufanejkaly gf :jalakh ;gdmeZaa, Cedgofa, :; N)N )N7, ;afada (e-eaad2bakgf.dgehhcy@mZc.ca). Nalagfad Afklalmle gf Klalaklacad Kcaefcek, RekeajchTjaafgde Pajc, N; 277(1, UK9 (e-eaad2 kacck@fakk.gjg).∗fysyurwh suppor“yx vfl “hy bu“urul gwiynwys unx Ynginyyring fysyurwh Wounwil, WunuxuBky “hunk “hy ryfyryys, Ussowiu“y Yxi“or unx Yxi“or for suggys“ions “hu“ wluriyx unx vrouxynyx“hy swopy of “hy s“uxiys rypor“yx hyryB1iasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20152 HB WHYb Yh ULBVuriuvly Xyswrip“ion fungyu1 ru“y wons“un“ oF ED6; F ED7qu2 ru“y wons“un“ I ED−3 (fiyx=u3 ru“y wons“un“ E ED−3 (fiyx=u4 ru“y wons“un“ F ED−3 (fiyx=u5 ru“y wons“un“ L (fiyx=u6 ru“y wons“un“ oG ED−5; G ED−4qu7 ru“y wons“un“ oD:G; Gqu8 ru“y wons“un“ E (fiyx=x ini“iul wonwyn“ru“ion oE:D ED−9; E:D ED−6qTable 2GAprotyin woxy input vuriuvlys unx rungysB All vuriuvlys ury trunszormyx to log swulys on oD; Eqzor stutistiwul moxylingB1,2 E-npmrcil amdcihz yizrzntivl zquvtions gznzrvting thz output y of thz GBprotzin systzmyynvmixs vrzt) R −u))x+ u22 − u3) + u5;t2 R u))x− u22 − u42;t3 R −u623 + u8=Glgl − 3 − 4)=Glgl − 3);t4 R u623 − u74;y R =Glgl − 3)=Glgl;fihzrz ); : : : ; 4 vrz xonxzntrvtions of I xhzmixvl spzxizsA t) ≡ @1@t A ztxCA vnyGlgl R FEEEE is thz =flzy) totvl xonxzntrvtion of GBprotzin xomplzfl vftzr HEszxonysCihz input vvrivwlzs in this systzm vrz yzsxriwzy in ivwlz GC dnly d R I inputsvrz vvrizyO fiz moyzl y vs v funxtion of log=x)A log=u))A log=u6)A log=u7)C1,1 NRU amdcihz erzstonBionksBlvllvxz =eil) moyzl yzsxriwzs thz plvstix yzformvtionof vvrious mztvls ovzr v rvngz of strvin rvtzs vny tzmpzrvturzs =erzstonA ionksAvny lvllvxzA GEEH)C [or our purposzs thz moyzl xontvins d R FF input vvrivwlzs=pvrvmztzrs)A fihzrz thrzz of thzsz vrz physixvlly wvszy =tzmpzrvturzA strvin rvtzAstrvin)A vny thz rzmvining M inputs xvn wz uszy to tunz thz moyzl to mvtxhyvtv from physixvl zflpzrimzntsC Vyyitionvl informvtion on thz moyzl vny thzxvliwrvtion prowlzm xvn wz founy in [ugvtz zt vlC =GEE5)C ihz inputs vrz vllsxvlzy to thz unit intzrvvl pE; FrC2, B?R?2,1 Tmlaalm amdcihz yvtv for thz HG runs of thz volxvno xoyz vrz in ivwlz HC2,2 ?paria sca iac amdcihz yvtv vrz vvvilvwlz in thz supplzmzntvry lzs ice-x.txt vny ice-y.txtC1, DSPRFCP PCSSJRS DMP PKSC]zrz fiz givz furthzr rzsults for normvlizzy holyout gbhE of przyixtionA erese;hgCiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 3fun Volumy VusulUngly Hyigh“ logHyigh“ sqr“Hyigh“E MBJH EJBHL HGBKF EBJI JBJEF EDBFD EIBDL EIEBEJ FBEL EFBFMG MBLG EGBMI JLBEF EBLH LBFIH EDBDF EKBJE EDIBJD FBDG EDBFLI EDBIL EJBFD GHEBGM FBIG ELBHLJ EDBKK EIBGJ IHDBFD FBKG FGBFHK EDBGM EHBFG EMIBHI FBFM EGBMLL EDBMI EKBGG JKIBGH FBLG FIBMMM MBII EFBLG GLBIL EBJD JBFEED EDBEE ELBKG EFGBJJ FBED EEBEFEE MBKG EMBLJ ILBEK EBKK KBJGEF MBMF EEBKD LEBMF EBMF MBDIEG EDBHL EGBEE FKLBDM FBHI EJBJLEH EDBJK ELBHI HKJBMM FBJL FEBLHEI EDBGD EMBIL EJLBEL FBFG EFBMKEJ EDBLJ EEBML JDJBFD FBKL FHBJFEK MBGJ EHBIF FEBMG EBGJ HBJLEL LBLD EIBMF LBJH DBML FBMHEM MBEK EKBDI EKBJG EBFK HBFDFD LBML EGBGM EGBMG EBEK GBKGFE LBHF EHBLD DBHI DBEJ DBJKFF LBFG EIBJH DBDD DBDD DBDDFG LBJE EJBKK FBGM DBIG EBIIFH LBDI EGBJK DBDJ DBDF DBFHFI MBHI ELBEK FKBFJ EBHI IBFFFJ LBLM EFBFK EHBJL EBFD GBLGFK MBFK EEBEH GDBHG EBID IBIFFL MBDL EMBGD EFBMI EBEH GBJDFM LBIF EKBLM DBDL DBDG DBFLGD LBGG EFBII EBLG DBHI EBGIGE LBKD EEBHF MBLG EBDG GBEHGF LBEH EMBDF DBDD DBDD DBDDTable 3Volwuno woxyN inputs (Volumy unx VusulAngly) unx output (Hyight, logHyight, or sqrtHyight)Biasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20154 HB WHYb Yh ULBNormalized RMSE0.000.020.040.060.080.10Gauss PowerExp Matern Matern2Constant40−run mLHD0.000.020.040.060.080.10Gauss PowerExp Matern Matern2Select linear40−run mLHD0.000.020.040.060.080.10Gauss PowerExp Matern Matern2Full linear40−run mLHD0.00.10.20.30.40.5Constant27−run mLHD0.00.10.20.30.40.5Select linear27−run mLHD0.00.10.20.30.40.5Full linear27−run mLHD0.00.20.40.60.8Constant27−run OA0.00.20.40.60.8Select linear27−run OA0.00.20.40.60.8Full linear27−run OA6ig 1B Voryholy zunwtionN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ullwomvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrfl moxyl hus u ttyxnuggyt tyrmB hhyry ury thryy vusy xysignsN u FKArun cA (top row)O u FKArun mLHD (mixxlyrow)O unx u HDArun mLHD (vottom row)B For yuwh vusy xysign, F5 runxom pyrmututions vytwyynwolumns givy thy F5 vuluys oz Kamse;ho in u xot plotB1,1 Bmpcfmlc dslariml[igurz F shofis normvlizzy gbhE rzsults for moyzls fiith v ttzy nuggztCCompvrison fiith [igurz G in thz mvin vrtixlz for noBnuggzt moyzls shofis littlzyizrznxzA zflxzpt thvt tting v nuggzt tzrm givzs v smvll inxrzvsz in thz frzquznxyof rzlvtivzly poor rzsults for moyzls fiith v fullBlinzvr rzgrzssionC1,2 NRU amdc[igurz G givzs rzsults for erese;hg for thrzz rzgrzssion moyzls vny four xorrzlvBtion funxtionsC It is proyuxzy using thz mzthoys in thz mvin pvpzr's hzxtion HC[igurz H givzs erese;hg rzsults for thz WvyzsBGEbBhV vny CGe mzthoys in thzmvin pvpzr's hzxtions 5CF vny 5CGC czithzr of thzsz gurzs shofi prvxtixvl yizrBznxzs from thz vvrious moyzling strvtzgizsC1,1 Lilsml-Isssi Kmdclivwlz I givzs rzsults for F5E trvining runs vny FEE runs for thz holyBout tzst sztCihz rolzs of thz tfio yvtv szts vrz sfiitxhzy rzlvtivz to thz mvin pvpzr's ivwlz FCVgvinA Gvuss is infzriorA vny eofizrEflp vny bvt(zrn vrz thz wzst pzrformzrsC1,2 ?paria sca iacihzrz vrz FH input vvrivwlzs vny F5L runs of thz xoyz vvvilvwlzC gzpzvt zflBpzrimznts fizrz gznzrvtzy wy svmpling n R FEd R FHE runsA lzvving GL holyoutowszrvvtionsC[igurz I givzs rzsults for erese;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsC ihz four xorrzlvtion funxtions givz similvr rzsultsA wut thziasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 5Normalized RMSE0.050.100.150.200.25Gauss PowerExp Matern Matern−2ConstantGauss PowerExp Matern Matern−2Select linearGauss PowerExp Matern Matern−2Full linear6ig 2B dhk woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ull womviAnutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy vusy xysign is u EEDArunmLHDO F5 runxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kamse;ho in u xot plotBNormalized RMSE0.000.050.100.150.200.250.30Gauss PowerExp Bayes CGP6ig 3B dhk woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd(Wonst, Guuss),Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr two mythoxs ulso huvy wonAstunt rygryssion moxylsB hhy vusy xysign is u EEDArun mLHDO F5 runxom pyrmututions vytwyynwolumns givy thy F5 vuluys oz Kamse;ho in u xot plotBKamse;hofygryssion Worrylu“ion funw“ionmoxyl [uuss dowyrYfip au“yrnAF au“yrnWons“un“ DBEEE DBDLD DBDLK DBDKLgylyw“ linyur DBEEG DBDLD DBDME DBDKMFull linyur DBEDM DBDKM DBDMD DBDKJeuur“iw DBEDH DBDKM DBDLL DBDKLTable 4bilsonAKuusk moxylN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor zour rygryssionmoxyls unx zour worrylution zunwtionsB hhy yfipyrimyntul xutu ury zrom u E5DArun LHD, unxthy holxAout syt is zrom u EDDArun LHDBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20156 HB WHYb Yh ULBNormalized RMSE0.00.51.01.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.5ConstantIceVelocity0.00.51.01.5Select linearIceVelocity0.00.51.01.5Full linearIceVelocity0.00.20.40.60.81.01.2ConstantIceArea0.00.20.40.60.81.01.2Select linearIceArea0.00.20.40.60.81.01.2Full linearIceArea0.00.20.40.60.81.01.2ConstantIceMass0.00.20.40.60.81.01.2Select linearIceMass0.00.20.40.60.81.01.2Full linearIceMass6ig 4B Arwtiw syu iwy woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ullwomvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy E5K runs uvuiluvlyury runxomlfl split F5 timys into EGD runs zor tting unx FK holxAout runs to givy thy F5 vuluys ozKamse;ho in u xot plotB fysults ury givyn zor zour output vuriuvlysN iwy muss, iwy uryu, iwy vylowitfl,unx rungy oz uryuBfullBlinzvr rzgrzssion moyzl is infzrior for vll four output vvrivwlzsC[igurz 5 xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstA eofizrEflp)A WvyzsBGEbBhVA vny CGeC ihz four strvtzgizs givz similvr rzsultsC[igurz 6 givzs erese;hg rzsults for moyzls fiith zstimvtzy nuggzt tzrmsC Compvrisonfiith thz rzsults for thz svmz moyzls fiithout v ttzy nuggzt in [igurz I shofisno prvxtixvl yizrznxzsC1,3 2-d cxaknlc[igurz L givzs rzsults for thz funxtion in =5CF) from thrzz szts of rzpzvt zflpzrBimzntsC ihz yzsigns lzvying to [igurz FH in thz mvin pvpzr hvvz x)A x2A or wothx) vny x2 rzzxtzy fiithin xolumns vwout thz xzntzrs of thzir rvngzs to gznzrvtzzquivvlznt yzsigns vny nzfi trvining yvtvC ihz rzsulting erese;hg vvluzs vrz plottzyin [igurz LC2, PCSSJRS DMP K?VGKSK ?BSMJSRC CPPMP]zrz fiz givz rzsults for normvlizzy holyout mvflimum vwsolutz zrror of przByixtionA eeax;hgA yznzy vs=ICF) eeax;hg Rmvfli5);:::;Nsy=x(i)hg)− y=x(i)hg)mvfli5);:::;N+y − y=x(i)hg) :VgvinA thz mzvsurz of zrror is rzlvtivz to thz pzrformvnxz of thz trivivl przyixtorA+yCiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 7Normalized RMSE0.00.20.40.60.81.0Gauss PowerExp Bayes CGPIceVelocity0.00.20.40.60.81.0Gauss PowerExp Bayes CGPRangeOfArea0.00.10.20.30.40.50.6IceMass0.00.10.20.30.40.50.6IceArea6ig 5B Arwtiw syu iwy woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd(Wonst,Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr two mythoxs ulso huvywonstunt rygryssion moxylsB hhy E5K runs uvuiluvly ury runxomlfl split F5 timys into EGD runszor tting unx FK holxAout runs to givy thy F5 vuluys oz Kamse;ho in u xot plotBNormalized RMSE0.00.51.01.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.5ConstantIceVelocity0.00.51.01.5Select linearIceVelocity0.00.51.01.5Full linearIceVelocity0.00.20.40.60.81.01.2ConstantIceArea0.00.20.40.60.81.01.2Select linearIceArea0.00.20.40.60.81.01.2Full linearIceArea0.00.20.40.60.81.01.2ConstantIceMass0.00.20.40.60.81.01.2Select linearIceMass0.00.20.40.60.81.01.2Full linearIceMass6ig 6B Arwtiw syu iwy woxyN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zor Gugd with ullwomvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrfl moxyl hus u ttyxnuggyt tyrmB hhy E5K runs uvuiluvly ury runxomlfl split F5 timys into EGD runs zor tting unxFK holxAout runs to givy thy F5 vuluys oz Kamse;ho in u xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 20158 HB WHYb Yh ULB25 30 35 40 450.00.10.20.30.40.5nNormalized RMSEReflect x1, n = 24 to 4850 100 150 2000.00.10.20.30.40.5nNormalized RMSEReflect x1, n = 50 to 20025 30 35 40 450.00.10.20.30.40.5nNormalized RMSEReflect x2, n = 24 to 4850 100 150 2000.00.10.20.30.40.5nNormalized RMSEReflect x2, n = 50 to 20025 30 35 40 450.00.10.20.30.40.5nNormalized RMSEReflect x1 and x2, n = 24 to 4850 100 150 2000.00.10.20.30.40.5nNormalized RMSEReflect x1 and x2, n = 50 to 2006ig 7B FAx zunwtionN bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, vyrsus T zor WGd (◦),Gugd(Wonst, dowyrEfip) (△) unx Gugd(Wonst, dowyrEfip) with nuggyt (+)B For yuwh vuluy ozT thy vusy mLHD hus x1 ryywtyx urounx thy wyntyr oz its rungy (top row), x2 ryywtyx (mixxlyrow), or voth x1 unx x2 ryywtyx (vottom row)Biasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 9Normalized Max Error0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Constant40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Select linear40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Full linear40−run mLHD0.00.20.40.60.8Constant27−run mLHD0.00.20.40.60.8Select linear27−run mLHD0.00.20.40.60.8Full linear27−run mLHD0.00.20.40.60.81.0Constant27−run OA0.00.20.40.60.81.0Select linear27−run OA0.00.20.40.60.81.0Full linear27−run OA6ig 8B Voryholy zunwtionN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd with ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhyryury thryy vusy xysignsN u FKArun cA (top row)O u FKArun mLHD (mixxly row)O unx u HDArunmLHD (vottom row)B For yuwh vusy xysign, F5 runxom pyrmututions vytwyyn wolumns givy thyF5 vuluys oz Kmax;ho in u xot plotB2,1 Bmpcfmlc dslariml[igurz M givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorrzBlvtion funxtionsC [igurz N xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstA eofizrBEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzsults vrzvnvlogous to thz mvin pvpzr's [igurzs G vny M for erese;hgP thz svmz pvttzrnszmzrgzC[igurz FE givzs eeax;hg rzsults for moyzls fiith zstimvtzy nuggzt tzrmsC Compvrzyfiith [igurz MA inxlusion of v nuggzt mvkzs littlz yizrznxz zflxzpt for v smvll inBxrzvsz in frzquznxy of poor rzsults fiith szlzxtBlinzvr vny fullBlinzvr rzgrzssionmoyzlsC2,2 E-npmrcil[igurz FF givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorrzBlvtion funxtionsC [igurz FG xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstA eofizrBEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzsults vrzvnvlogous to thz mvin pvpzr's [igurzs I vny N for erese;hgC ihz svmz xonxlusionszmzrgzO thzrz is littlz prvxtixvl yizrznxz wztfizzn thz vvrious moyzling strvtzgizsC2,1 NRU amdc[igurz FH givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz FI xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBsults vrz vnvlogous to [igurzs G vny H in hzxtion HCG of this supplzmznt rzlvting toiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201510 HB WHYb Yh ULBNormalized Max Error0.00.10.20.30.40.5Gauss PowerExp Bayes CGP27−run OA0.00.10.20.30.40.5Gauss PowerExp Bayes CGP27−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Bayes CGP40−run mLHD6ig 9B Voryholy zunwtionN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsBNormalized Max Error0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Constant40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Select linear40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Matern Matern2Full linear40−run mLHD0.00.20.40.60.8Constant27−run mLHD0.00.20.40.60.8Select linear27−run mLHD0.00.20.40.60.8Full linear27−run mLHD0.00.20.40.60.81.0Constant27−run OA0.00.20.40.60.81.0Select linear27−run OA0.00.20.40.60.81.0Full linear27−run OA6ig 10B Voryholy zunwtionN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd with ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrflmoxyl hus u ttyx nuggyt tyrmB hhyry ury thryy vusy xysignsN u FKArun cA (top row)O u FKArun mLHD (mixxly row)O unx u HDArun mLHD (vottom row)B For yuwh vusy xysign, F5 runxompyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kmax;ho in u xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 11Normalized Max Error0.00.10.20.30.4Gauss PowerExp Matern Matern2Constant80−run mLHD0.00.10.20.30.4Gauss PowerExp Matern Matern2Select linear80−run mLHD0.00.10.20.30.4Gauss PowerExp Matern Matern2Full linear80−run mLHD0.00.20.40.60.8Constant40−run mLHD0.00.20.40.60.8Select linear40−run mLHD0.00.20.40.60.8Full linear40−run mLHD6ig 11B GAprotyin woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhyry ury two vusyxysignsN u HDArun mLHD (top row)O unx un LDArun mLHD (vottom row)B For yuwh vusy xysign,ull FH pyrmututions vytwyyn wolumns givy thy FH vuluys oz Kmax;ho in yuwh xot plotBNormalized Max Error0.00.20.40.60.8Gauss PowerExp Bayes CGP40−run mLHD0.000.050.100.150.200.250.30Gauss PowerExp Bayes CGP80−run mLHD6ig 12B GAprotyin woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsBNormalized Max error0.30.40.50.60.7Gauss PowerExp Matern Matern−2ConstantGauss PowerExp Matern Matern−2Select linearGauss PowerExp Matern Matern−2Full linear6ig 13B dhk woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zor Gugdwith ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy vusy xysignis u EEDArun mLHDO F5 runxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kmax;ho inu xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201512 HB WHYb Yh ULBNormalized Max Error0.00.20.40.60.8Gauss PowerExp Bayes CGP6ig 14B dhk woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsB hhy vusy xysign is u EEDArun mLHDO F5 runAxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz Kmax;ho in u xot plotBKmax;hofygryssion Worrylu“ion funw“ionmoxyl [uuss dowyrYfip au“yrnAF au“yrnWons“un“ DBGI DBFK DBFL DBFLgylyw“ linyur DBGI DBFJ DBFK DBFHFull linyur DBGD DBFK DBFL DBFLeuur“iw DBFM DBFL DBFM DBGETable 5bilsonAKuusk moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorzour rygryssion moxyls unx zour worrylution zunwtionsB hhy yfipyrimyntul xutu ury zrom uEDDArun LHD, unx thy holxAout syt is zrom u E5DArun LHDBerese;hgC ihz eeax;hg xritzrion lzvys to similvr xonxlusionsO thzrz vrz no prvxtixvlyizrznxzs in pzrformvnxzC2,2 Lilsml-Isssi Kmdclihz eeax;hg rzsults in ivwlzs 5 vny 6 vrz for FEE trvining runs vny F5E trviningrunsA rzspzxtivzlyC ihzy vrz vnvlogous to ivwlz F of thz mvin pvpzr vny supplzBmzntvry ivwlz IA fihixh rzlvtz to erese;hgC VgvinA Gvuss is infzriorC ihz yizrznxzsin pzrformvnxz wztfizzn thz othzr xorrzlvtion funxtions vrz smvllA wut eofizrEflpis thz wzst pzrformzr L out of M timzsC[igurz F5 givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz F6 xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBKmax;hofygryssion Worrylu“ion funw“ionmoxyl [uuss dowyrYfip au“yrnAF au“yrnWons“un“ DBFE DBEJ DBEL DBEKgylyw“ linyur DBFI DBEJ DBEM DBELFull linyur DBFG DBEJ DBFE DBEKeuur“iw DBFG DBEJ DBFE DBEKTable 6bilsonAKuusk moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorzour rygryssion moxyls unx zour worrylution zunwtionsB hhy yfipyrimyntul xutu ury zrom uE5DArun LHD, unx thy holxAout syt is zrom u EDDArun LHDBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 13Normalized Max Error0.100.150.200.250.30Gauss PowerExp Matern Matern−2ConstantGauss PowerExp Matern Matern−2Select LinearGauss PowerExp Matern Matern−2Full LinearGauss PowerExp Matern Matern−2Quartic6ig 15B bilsonAKuusk woxyN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor zour rygryssion moxyls unx zour worrylution zunwtionsB hwyntflAvy xysigns ury wryutyx zromu E5DArun LHD vusy plus 5D runxom points zrom u EDDArun LHDB hhy rymuining 5D points inthy EDDArun LHD zorm thy holxout syt zor yuwh rypyutBNormalized Max Error0.00.10.20.30.40.5Gauss PowerExp Bayes CGP6ig 16B bilsonAKuusk moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho,zor Gugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201514 HB WHYb Yh ULBNormalized Max Error0.00.20.40.60.8Gauss PowerExp Matern Matern−2Constantlog(y+1)0.00.20.40.60.8Gauss PowerExp Matern Matern−2Select linearlog(y+1)0.00.20.40.60.8Gauss PowerExp Matern Matern−2Full linearlog(y+1)0.00.20.40.60.8Gauss PowerExp Matern Matern−2Quadraticlog(y+1)0.00.10.20.30.4Constanty0.00.10.20.30.4Select lineary0.00.10.20.30.4Full lineary0.00.10.20.30.4Quadraticy6ig 17B Volwuno moxylN bormulizyx holxout mufiimum uvsoluty yrror, Kmax;ho, zor thryy rygrysAsion moxyls unx two worrylution zunwtionsBsults vrz vnvlogous to thz mvin pvpzr's [igurzs 5 vny FEA fihixh rzlvtz to erese;hgC[igurzs F5 vny F6 lzvy to thz svmz xonxlusionsO RReofizrEflp pzrforms wzstARRGvuss is infzriorA thzrz is no vyvvntvgz from vny of thz nonBxonstvnt rzgrzsBsion funxtionsA vny nzithzr WvyzsBGEbBhV nor CGe pzrform fizll hzrzC2,3 Tmlaalm amdc[igurz FL givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz FM xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBsults vrz vnvlogous to thz mvin pvpzr's [igurzs L vny FFA fihixh rzlvtz to erese;hgC[igurz FL shofis thvt thz rzgrzssion moyzl mvkzs littlz yizrznxz if xomwinzy fiitheofizrEflpA fihixh vlfivys pzrforms fizll hzrzC [or thz√y rzsponszA thz quvyrvtixmoyzl mvkzs thz pzrformvnxzs of thz othzr xorrzlvtion funxtions similvr to thvtNormalized Max Error0.00.10.20.30.4Gauss PowerExp Bayes CGPy0.00.20.40.60.8Gauss PowerExp Bayes CGPlog(y+1)6ig 18B Volwuno moxylN bormulizyx holxout mufiimum uvsoluty yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 15Normalized Max Error0.00.51.01.52.02.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.52.02.5ConstantIceVelocity0.00.51.01.52.02.5Select linearIceVelocity0.00.51.01.52.02.5Full linearIceVelocity0.00.51.01.52.0ConstantIceArea0.00.51.01.52.0Select linearIceArea0.00.51.01.52.0Full linearIceArea0.00.51.01.52.0ConstantIceMass0.00.51.01.52.0Select linearIceMass0.00.51.01.52.0Full linearIceMass6ig 19B Arwtiw syu iwy woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zor Gugdwith ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB hhy E5K runsuvuiluvly ury runxomlfl split F5 timys into EGD runs zor tting unx FK holxAout runs to givy thyF5 vuluys oz Kmax;ho in u xot plotBof eofizrEflpA wut thzrz is no vyvvntvgz ovzr Gvhe=ConstA eofizrEflp)C [igurz FMvgvin shofis for this vpplixvtion thvt RRGvuss vny WvyzsBGEbBhV vrz infzriorAfihilz eofizrEflp vny CGe pzrform vwout thz svmzC2,4 ?paria sca iac[igurz FN givzs rzsults for eeax;hg for thrzz rzgrzssion moyzls vny four xorBrzlvtion funxtionsA vny [igurz GE xompvrzs Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGeC ihzsz tfio gurzs rzporting eeax;hg rzBsults vrz vnvlogous to supplzmzntvry [igurzs I vny 5 in hzxtion HCIA fihixh rzlvtzto erese;hgC [igurz FN shofis thzrz is littlz prvxtixvl vyvvntvgz to wz gvinzy fromv nonBxonstvnt rzgrzssion moyzlC ihz smvll improvzmznt szzn for thz rzsponszIxzkzloxity is in thz xontzflt of eeax;hg oftzn grzvtzr thvn F for vll moyzlsA iCzCAno przyixtivz vwilityC [igurz GE shofis thvt Gvhe=ConstA Gvuss)A Gvhe=ConstAeofizrEflp)A WvyzsBGEbBhVA vny CGe givz similvr rzsultsC[igurz GF givzs eeax;hg rzsults for moyzls fiith zstimvtzy nuggzt tzrmsC Compvrisonfiith thz rzsults for thz svmz moyzls fiithout nuggzt tzrms in [igurz FN shofis noprvxtixvl improvzmzntC2,5 Dpicdkal dslariml[igurz GG xompvrzs przyixtion vxxurvxy of moyzls fiith vny fiithout v nuggztviv thz eeax;hg xritzrionC Vs in [igurz FI in thz mvin vrtixlzA no vyvvntvgz fromtting v nuggzt is vppvrzntCiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201516 HB WHYb Yh ULBNormalized Max Error0.00.51.01.52.02.5Gauss PowerExp Bayes CGPIceVelocity0.00.51.01.52.02.5Gauss PowerExp Bayes CGPRangeOfArea0.00.20.40.60.81.0IceMass0.00.20.40.60.81.0IceArea6ig 20B Arwtiw syu iwy woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss), Gugd(Wonst, dowyrEfip), VuflysAGEaAgA, unx WGdB hhy luttyr twomythoxs ulso huvy wonstunt rygryssion moxylsB hhy E5K runs uvuiluvly ury runxomlfl split F5timys into EGD runs zor tting unx FK holxAout runs to givy thy F5 vuluys oz Kmax;ho in u xotplotBNormalized Max Error0.00.51.01.52.02.5Gauss PowerExp Matern Matern2ConstantRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Select linearRangeOfArea0.00.51.01.52.02.5Gauss PowerExp Matern Matern2Full linearRangeOfArea0.00.51.01.52.02.5ConstantIceVelocity0.00.51.01.52.02.5Select linearIceVelocity0.00.51.01.52.02.5Full linearIceVelocity0.00.51.01.52.0ConstantIceArea0.00.51.01.52.0Select linearIceArea0.00.51.01.52.0Full linearIceArea0.00.51.01.52.0ConstantIceMass0.00.51.01.52.0Select linearIceMass0.00.51.01.52.0Full linearIceMass6ig 21B Arwtiw syu iwy woxyN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zor Gugdwith ull womvinutions oz thryy rygryssion moxyls unx zour worrylution zunwtionsB Evyrfl moxyl husu ttyx nuggyt tyrmB hhy E5K runs uvuiluvly ury runxomlfl split F5 timys into EGD runs zor ttingunx FK holxAout runs to givy thy F5 vuluys oz Kmax;ho in u xot plotBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 17Normalized Max Error0.000.050.100.150.20No nugget NuggetGaussn=500.000.050.100.150.20No nugget NuggetPowerExpn=500.00.20.40.60.81.0Gaussn=250.00.20.40.60.81.0PowerExpn=256ig 22B Friyxmun zunwtionN bormulizyx holxout mufiimum yrror oz pryxiwtion, Kmax;ho, zorGugd(Wonst, Guuss) unx Gugd(Wonst, dowyrEfip) moxyls with no nuggyt tyrm vyrsus thy sumymoxyls with u nuggytB hhyry ury two vusy xysignsN u F5Arun mLHD (top row)O unx u 5DArunmLHD (vottom row)B For yuwh vusy xysign, F5 runxom pyrmututions vytwyyn wolumns givy thyF5 vuluys oz Kmax;ho in u xot plotB3, SLACPR?GLRW MD NPCBGARGMLihz rzsults in [igurz GH xompvrz thz yistriwution of VCe vvluzs fiith vny fiithBout v nuggzt tzrm for thz [rizymvn funxtion in zquvtion =5CG) of thz mvin vrtixlzClith n R G5A suwstvntivl unyzrBxovzrvgz oxxurs frzquzntlyP tting v nuggzt tzrmmvkzs littlz yizrznxzC [or n R 5E thzrz is v moyzst improvzmznt in thz yistriBwution tofivrys thz nominvl N5: xovzrvgz fihzn v nuggzt is ttzyC huwstvntivlunyzrBxovzrvgz is still frzquzntA hofizvzrC4, PCEPCSSGML RCPKS]ofi thz inxlusion of rzgrzssion tzrms pzrforms in zfltrzmz szttings xvn wz szznin thz follofiing simulvtionsC [unxtions fizrz gznzrvtzy fiith 5Byimznsionvl inputx from v Gvhe fiith rzgrzssion xomponznt R FE=x) + x2 + x3 + x4 + x5)Avvrivnxz FA vny onz of four Ges fiith Gvussivn xorrzlvtion funxtions yizringin thzir truz pvrvmztzr vvluzsC ihz vzxtor of xorrzlvtion pvrvmztzr vvluzs fivszithzrO =F) A R =6:LGHG; G:INNG; E:6L5G; E:ENNG; E:EEHG)A fihixh sum to FEP =G)B R =H:H6F6; F:GIN6; E:HHL6; E:EIN6; E:EEF6)A hvlf thz vvluzs of AP =H) hvsxonstvnt zlzmznts fiith vvluz GP vny =I) hvs xonstvnt zlzmznts fiith vvluz FClith svmplz sizz n R 5E from v ma]D vny 5EE rvnyom tzst pointsA G5 svmplzfunxtions from thz I Ges fizrz tvkzn vny normvlizzy gbhEs xvlxulvtzyC ihzvvzrvgz erese;hg for thz four simulvtion sxznvrios for Const vzrsus [a ts vrz inivwlz LClhvt xvn wz yrvfin from thzsz numwzrs is thvt thz strong trzny xomposzyfiith lzss xorrzlvtzy =roughzr) funxtions givzs tting thz linzvr rzgrzssion littlzprvxtixvl vyvvntvgzC ihzrz xvn wz v jedalane vyvvntvgz for [a fihzn erese;hg isvzry smvll vnyfivy for woth mzthoysCihz trzny hzrz is vzry zfltrzmzO v totvl trzny of mvgnituyz 5E vxross thz inputiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 201518 HB WHYb Yh ULBActual Coverage Probability0.00.20.40.60.81.0No nugget NuggetGaussn=500.00.20.40.60.81.0No nugget NuggetPowerExpn=500.00.20.40.60.81.0Gaussn=250.00.20.40.60.81.0PowerExpn=256ig 23B Friyxmun zunwtionN AWd oz nominul M59 wonxynwy or wryxivilitfl intyrvuls zorGugd(Wonst, Guuss) unx Gugd(Wonst, dowyrEfip) moxyls, yuwh without or with u nuggyt tyrmBhhyry ury two vusy xysignsN u F5Arun mLHD (top row)O unx u 5DArun mLHD (vottom row)B Foryuwh vusy xysign, F5 runxom pyrmututions vytwyyn wolumns givy thy F5 vuluys oz AWd in u xotplotBWorrylu“ion Uvyrugy Kamse;ho whyn ““ingpurumy“yrs [ugd(Wons“, dowyrYfip= [ugd(FL, dowyrYfip=A DBDH DBDFB DBDF DBDE Q 2 DBDL DBDJ Q 1 DBDH DBDFTable 7bormulizyx holxout fagE oz pryxiwtion, Kamse;ho, zrom tting wonstunt vyrsus zullAlinyurrygryssion moxylsB hhy xutu ury gynyrutyx using ony oz zour vywtors oz truy vuluys oz thyworrylution purumytyrsBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015UbULmg]g aYhHcXg Fcf WcadihYf YldYf]aYbhgN giddLYaYbh 19spvxz of vll 5 vvrivwlzsA rzlvtivz to v Ge fiith vvrivnxz FC dthzr simulvtions fiithlzss zfltrzmz trzny shofi zvzn lzss yizrznxz wztfizzn thz rzsults for thz Constvny [a moyzlsCPCDCPCLACSFugu“y, aB, killiums, VB, Higxon, XB, Hunson, KB aB, [u““ikyr, JB, Whyn, gBAfB, unx inul, WB(FDDI=, pHiyrurwhiwul Vuflysiun unulflsis unx “hy drys“onAhonksAkulluwy moxyl,6 hywhB fypBLUAifADIAGMGI, Los Ulumos bu“ionul Luvoru“orfl, Los Ulumos, baBdrys“on, XB LB, honks, XB LB, unx kulluwy, XB WB (FDDG=, paoxyl of dlus“iw Xyformu“ion forYfi“rymy Louxing Wonxi“ions,6 Journul oz Appliyx dhflsiws, MG, FEE{FFDBiasaft-sts vef. 2012/04/10 file. TSWLateliabTeadS000001.tel date. DeWeabef 4, 2015 Arctic sea ice - 13 input variables ---------------------------------------------------------------------------- Run SnowAlbedo Shortwave MinLead OceanicHeat Snowfall Longwave AtmosDrag ---------------------------------------------------------------------------- 1 0.5375000 1.4125 2.9250e-02 -2.687500 3.000000 0.27000 1.2538e-03 2 0.9187501 3.7750 1.9500e-02 -4.375000 2.312500 0.32500 7.2250e-04 3 0.8812500 2.5750 2.8500e-02 -4.250000 0.375000 0.48000 1.3813e-03 4 0.5687500 3.8500 1.2750e-02 -1.625000 3.500000 0.13000 1.0200e-03 5 0.8250001 1.4875 4.6500e-02 -3.812500 4.750000 0.25000 4.8880e-04 6 0.6125000 1.7500 1.6500e-02 -2.187500 2.437500 0.43000 1.4238e-03 7 0.9750000 1.3375 5.6250e-02 -1.937500 5.000000 0.50000 1.4025e-03 8 0.9562500 2.6500 7.5000e-04 -1.875000 2.125000 0.41500 1.7000e-04 9 0.9687500 3.6250 2.2500e-02 -2.750000 4.000000 0.39500 3.8250e-04 10 0.8437500 1.6375 1.2000e-02 -3.875000 1.000000 0.28500 2.9750e-04 11 0.9437500 2.0125 3.9000e-02 -0.562500 3.562500 0.11500 1.6150e-03 12 0.6875000 2.6125 2.4000e-02 -4.875000 1.625000 0.18000 5.9500e-04 13 0.5500000 2.3875 1.7250e-02 -4.500000 1.812500 0.25500 7.6500e-04 14 0.8125000 1.0375 1.4250e-02 -0.250000 0.062500 0.30500 1.5513e-03 15 0.9062500 1.9375 8.2500e-03 -4.562500 2.812500 0.46500 9.3500e-04 16 0.7375000 1.2250 6.0000e-03 -0.750000 4.562500 0.38000 1.9130e-04 17 0.6812500 3.5500 2.2500e-03 -1.187500 3.250000 0.24000 6.3800e-05 18 0.7875000 1.0750 2.4750e-02 -3.687500 2.062500 0.18500 1.5938e-03 19 0.8625000 3.1000 2.1750e-02 -3.125000 3.750000 0.35000 1.0413e-03 20 0.9375000 1.5250 2.0250e-02 0.000000 4.375000 0.40500 8.9250e-04 21 0.6625000 2.2375 5.2500e-03 -3.750000 2.937500 0.32000 2.3380e-04 22 0.5937500 1.3750 3.0750e-02 -2.125000 4.312500 0.34000 9.9880e-04 23 0.7625001 2.8375 3.3750e-02 -1.125000 2.000000 0.13500 1.1688e-03 24 0.6000000 2.4250 1.8000e-02 -2.562500 4.812500 0.45000 1.0838e-03 25 0.6750000 1.5625 2.7750e-02 -1.562500 2.687500 0.10500 1.6363e-03 26 0.5250000 2.0500 9.7500e-03 -0.312500 0.687500 0.42500 4.0380e-04 27 0.6937500 3.4750 2.3250e-02 -3.062500 0.250000 0.22000 5.7380e-04 28 0.5125000 3.8125 3.7500e-02 -0.625000 3.187500 0.30000 1.4880e-04 29 0.7937501 2.9500 4.3500e-02 -2.437500 3.312500 0.12500 1.1475e-03 30 0.7187500 2.1625 4.9500e-02 -2.625000 2.250000 0.44000 9.1380e-04 31 0.5562500 1.0000 4.1250e-02 -3.937500 2.375000 0.49000 5.5250e-04 32 0.5000000 3.9250 3.4500e-02 -0.187500 0.625000 0.48500 1.5088e-03 33 0.6187500 2.2750 5.9250e-02 -1.375000 2.750000 0.16000 9.5630e-04 34 0.8187500 1.8625 7.5000e-03 -2.312500 3.687500 0.47000 1.1900e-03 35 0.9812501 1.7125 4.8000e-02 -4.312500 1.687500 0.14000 1.3175e-03 36 0.5187500 1.8250 4.0500e-02 -0.437500 3.125000 0.34500 2.5500e-04 37 0.6500000 3.9625 0.0000e+00 -0.125000 3.625000 0.31000 8.0750e-04 38 0.9250000 1.2625 4.5000e-03 -2.062500 4.875000 0.19500 5.1000e-04 39 0.8500000 3.2125 1.5000e-02 -4.062500 2.562500 0.26000 1.7000e-03 40 0.7500000 2.8750 4.2000e-02 -3.000000 4.687500 0.31500 4.6750e-04 41 0.8937500 3.4375 3.0000e-03 -3.375000 1.437500 0.45500 6.1630e-04 42 0.7250000 2.8000 4.7250e-02 -1.437500 3.875000 0.21000 1.4450e-03 43 0.5812500 1.1500 2.6250e-02 -4.625000 4.250000 0.21500 7.4380e-04 44 0.9500001 2.7625 6.7500e-03 -4.437500 3.375000 0.33000 6.8000e-04 45 0.9625000 3.2500 1.8750e-02 -0.687500 0.312500 0.36000 2.1300e-05 46 0.8000000 1.3000 2.1000e-02 -0.062500 0.187500 0.20500 1.2325e-03 47 0.5437500 2.2000 5.7000e-02 -4.750000 0.000000 0.27500 2.7630e-04 48 0.7312500 3.3250 5.8500e-02 -3.187500 1.250000 0.29500 1.0625e-03 49 0.5875000 4.0000 6.0000e-02 -4.812500 4.937500 0.38500 3.1880e-04 50 0.9312500 3.4000 1.5000e-03 -0.375000 1.062500 0.33500 1.4875e-03 51 0.9125000 3.5125 2.5500e-02 -3.625000 0.562500 0.41000 1.1050e-03 52 0.7000000 2.7250 3.1500e-02 -0.812500 1.187500 0.17500 8.5000e-05 53 0.7125000 1.9750 5.7750e-02 -1.062500 2.625000 0.37500 1.5725e-03 54 0.6375000 1.1875 3.0000e-02 -0.500000 2.187500 0.28000 1.2750e-03 55 0.7687500 3.2875 4.5750e-02 -2.812500 1.937500 0.36500 4.2500e-05 56 0.5625000 2.3125 5.1000e-02 -3.437500 1.750000 0.15000 1.3600e-03 57 0.8875001 1.9000 1.3500e-02 -5.000000 1.562500 0.26500 3.6130e-04 58 0.7562500 3.6625 5.3250e-02 -1.750000 1.125000 0.42000 8.2880e-04 59 0.9875000 3.0625 1.0500e-02 -2.375000 0.750000 0.11000 4.2500e-04 60 0.6062500 1.1125 5.0250e-02 -1.250000 0.125000 0.23000 1.2113e-03 61 0.7062500 2.3500 3.3000e-02 -4.125000 3.812500 0.15500 9.7750e-04 62 0.8375000 2.6875 3.2250e-02 -3.312500 2.875000 0.16500 1.1263e-03 63 0.7812500 3.1375 1.1250e-02 -2.250000 3.937500 0.47500 3.4000e-04 64 0.6562500 1.7875 3.6750e-02 -3.562500 3.437500 0.43500 1.0630e-04 65 1.0000000 3.7375 5.1750e-02 -4.937500 1.312500 0.17000 1.6788e-03 66 0.8562501 2.5375 2.7000e-02 -0.937500 1.875000 0.23500 1.2750e-04 67 0.8062500 2.0875 4.8750e-02 -4.687500 0.875000 0.22500 8.7130e-04 68 0.5062500 2.5000 5.5500e-02 -1.812500 0.937500 0.19000 6.5880e-04 69 0.6437500 3.5875 5.4750e-02 -4.187500 4.125000 0.39000 1.2963e-03 70 0.6250000 2.4625 5.4000e-02 -1.500000 1.500000 0.35500 1.4663e-03 71 0.7437500 3.8875 3.7500e-03 -3.250000 1.375000 0.40000 4.4630e-04 72 0.5312500 1.4500 3.5250e-02 -3.500000 4.187500 0.49500 7.8620e-04 73 0.5750000 1.6000 3.8250e-02 -4.000000 4.437500 0.20000 7.0130e-04 74 0.8312500 3.3625 1.5750e-02 -2.875000 4.500000 0.24500 1.5300e-03 75 0.9000000 2.1250 3.9750e-02 -0.875000 0.812500 0.37000 0.0000e+00 76 0.8750000 3.1750 3.6000e-02 -2.000000 0.500000 0.14500 8.5000e-04 77 0.6312500 1.6750 4.4250e-02 -2.500000 4.062500 0.12000 1.3388e-03 78 0.7750000 3.7000 9.0000e-03 -1.687500 4.625000 0.46000 2.1250e-04 79 0.8687500 3.0250 4.5000e-02 -1.312500 2.500000 0.10000 5.3130e-04 80 0.6687500 2.9125 5.2500e-02 -2.937500 3.062500 0.29000 6.3750e-04 81 0.9937500 2.9875 4.2750e-02 -1.000000 0.437500 0.44500 1.6575e-03 1 0.9991900 3.9870 5.5124e-02 -0.003655 4.733500 0.11675 4.4798e-05 2 0.9137500 1.4141 2.3801e-02 -1.134500 0.000165 0.42088 1.6997e-03 3 0.6640300 1.6224 5.9901e-02 -2.695600 0.446290 0.21206 1.9932e-07 4 0.5002000 3.5541 6.0000e-02 -0.041606 2.177700 0.48623 1.6968e-03 5 0.5820400 2.4135 7.6685e-03 -0.186300 0.373040 0.10426 1.3004e-03 6 0.9437900 3.7724 4.3544e-03 -1.366200 0.001933 0.48634 1.1332e-03 7 0.9964700 2.4422 1.6938e-03 -4.996300 4.109500 0.49977 8.7192e-04 8 0.5048400 2.8929 4.5013e-02 -4.292900 4.872700 0.35383 3.8969e-04 9 0.8992900 1.0019 4.9885e-02 -2.084700 4.814600 0.10000 1.7000e-03 10 0.9263000 3.8506 1.3151e-05 -4.171600 2.728000 0.48467 8.1255e-05 11 0.5459600 3.4211 2.7146e-02 -1.485900 1.225100 0.13342 1.5818e-03 12 0.8024200 3.9657 1.9113e-02 -3.767900 0.260440 0.49277 1.2801e-03 13 0.9857600 2.7828 3.6267e-02 -0.756700 0.314730 0.44982 9.6849e-04 14 0.8233100 3.8589 3.3620e-02 -4.323900 1.087400 0.10136 2.3049e-04 15 0.9310200 2.9240 2.5238e-02 -3.318000 0.570270 0.10193 1.3439e-05 16 0.9986800 1.0089 3.9249e-02 -0.379740 0.567030 0.28480 1.0033e-03 17 0.7779300 1.4269 4.2311e-02 -0.324530 2.789300 0.39036 8.0921e-04 18 0.8660800 3.8006 5.2622e-02 -4.692500 4.037000 0.36843 1.5447e-03 19 0.5102700 3.2743 9.9253e-03 -4.680000 0.034325 0.40482 1.5879e-03 20 0.6070300 3.5920 3.3641e-02 -4.944800 0.730160 0.47566 1.0331e-03 21 0.7090600 1.1936 3.7699e-02 -3.029900 1.591300 0.49650 1.2205e-03 22 0.7863700 2.1362 2.1789e-03 -4.574600 4.981400 0.34442 1.1849e-03 23 0.8848300 1.5329 1.7370e-02 -0.189730 2.248300 0.18032 4.4391e-04 24 0.5007000 1.8788 5.6843e-02 -2.502400 0.351810 0.32744 5.3733e-04 25 0.8413200 1.1843 4.6963e-02 -0.051837 3.320600 0.19726 9.7577e-04 26 0.7376800 1.7172 1.8621e-02 -3.569800 1.550100 0.10151 1.6551e-03 27 0.7616500 1.3332 6.3316e-03 -3.489300 4.885700 0.40881 9.8459e-06 28 0.7960400 3.0398 5.8490e-02 -1.532200 0.567920 0.41917 6.4355e-04 29 0.6966200 1.1003 1.1793e-02 -4.963700 4.532900 0.32134 3.8363e-06 30 0.9893800 1.3506 1.8324e-02 -0.584130 4.615500 0.15448 1.4794e-03 31 0.6335100 1.7956 7.7952e-05 -2.075000 4.495400 0.22400 5.1666e-04 32 0.9691300 2.5669 5.9539e-02 -4.997200 2.992600 0.46340 1.6846e-04 33 0.5962800 3.3083 1.3846e-03 -0.988880 1.151200 0.18487 1.6965e-03 34 0.5201800 3.7137 5.9985e-02 -3.953900 0.867360 0.12697 1.2211e-03 35 0.9595000 3.4952 2.5877e-02 -0.249320 3.137400 0.20007 1.6621e-03 36 0.9085400 2.9695 4.4311e-03 -0.387590 2.069300 0.19999 1.3846e-03 37 0.9325500 1.3707 3.8164e-02 -3.655100 4.075200 0.14425 6.3643e-04 38 0.5787800 3.7284 5.1964e-02 -0.608960 2.428000 0.16864 2.8120e-04 39 0.6507100 1.6706 3.4095e-02 -4.560500 4.439400 0.29230 1.3790e-03 40 0.8890400 2.4600 2.9733e-02 -4.442600 0.121770 0.16841 1.3667e-04 41 0.7230000 3.9887 1.1652e-02 -4.345600 0.521430 0.25889 1.5484e-03 42 0.7504000 1.5944 2.8494e-02 -0.008915 3.453200 0.10464 3.6870e-04 43 0.5451300 4.0000 5.2843e-03 -3.386300 2.470500 0.21652 8.4651e-04 44 0.9739700 3.9036 3.9006e-02 -4.653100 2.148200 0.24524 1.1371e-03 45 0.9497300 2.3202 2.9230e-02 -2.452600 2.855900 0.34722 1.6616e-03 46 0.8633500 1.4594 5.9998e-02 -3.536700 1.345900 0.49557 1.4861e-03 47 0.5782000 1.9411 2.1752e-02 -3.614600 2.148300 0.33351 2.2064e-07 48 0.5543000 2.7197 4.3162e-02 -2.704700 0.050771 0.46717 1.4540e-03 49 0.6866300 3.2062 2.0000e-02 -1.637800 2.361300 0.31280 6.9147e-04 50 0.9440800 3.0978 1.5482e-02 -1.140000 4.999900 0.20973 1.5361e-04 1 0.520833 1.425 0.0145 -3.458333 3.375000 0.176667 9.775000e-04 2 0.612500 3.975 0.0475 -4.041667 1.958333 0.403333 6.375000e-04 3 0.970833 3.225 0.0495 -0.791667 3.458333 0.350000 5.525000e-04 4 0.745833 1.275 0.0155 -2.708333 0.208333 0.336667 7.791667e-04 5 0.562500 1.825 0.0055 -0.125000 0.875000 0.416667 1.600833e-03 6 0.954167 1.375 0.0405 -4.291667 1.541667 0.423333 1.374167e-03 7 0.654167 3.775 0.0185 -2.208333 0.708333 0.390000 8.641667e-04 8 0.770833 1.975 0.0325 -2.541667 4.458333 0.183333 1.487500e-03 9 0.829167 2.175 0.0265 -1.291667 4.375000 0.296667 8.075000e-04 10 0.637500 1.125 0.0025 -2.291667 4.791667 0.143333 1.119167e-03 11 0.729167 3.925 0.0565 -2.125000 3.125000 0.156667 7.225000e-04 12 0.712500 3.825 0.0065 -0.458333 1.791667 0.190000 1.175833e-03 13 0.870833 2.425 0.0555 -4.958333 2.041667 0.290000 9.491667e-04 14 0.537500 2.875 0.0195 -4.625000 3.875000 0.116667 1.841667e-04 15 0.679167 2.725 0.0095 -3.708333 2.208333 0.370000 1.685833e-03 16 0.587500 3.575 0.0075 -2.958333 1.375000 0.323333 8.358333e-04 17 0.604167 1.475 0.0235 -4.125000 0.458333 0.356667 2.691667e-04 18 0.737500 1.075 0.0225 -1.375000 0.958333 0.470000 1.402500e-03 19 0.670833 3.425 0.0315 -0.541667 0.791667 0.270000 5.241667e-04 20 0.879167 3.725 0.0295 -3.291667 3.541667 0.410000 1.459167e-03 21 0.720833 3.025 0.0275 -1.791667 0.041667 0.283333 1.657500e-03 22 0.837500 3.675 0.0465 -1.125000 3.041667 0.110000 1.005833e-03 23 0.979167 2.025 0.0175 -3.375000 4.208333 0.310000 1.232500e-03 24 0.920833 1.175 0.0585 -4.375000 3.791667 0.363333 4.108333e-04 25 0.695833 2.075 0.0515 -0.041667 1.041667 0.210000 1.034167e-03 26 0.595833 2.125 0.0425 -1.458333 1.875000 0.276667 3.825000e-04 27 0.895833 2.225 0.0105 -2.791667 3.291667 0.150000 1.430833e-03 28 0.754167 3.475 0.0365 -4.541667 2.291667 0.383333 4.250000e-05 29 0.820833 1.325 0.0355 -2.458333 0.625000 0.216667 1.629167e-03 30 0.579167 2.375 0.0395 -1.708333 2.958333 0.223333 1.572500e-03 31 0.554167 3.125 0.0305 -1.875000 2.708333 0.396667 2.408333e-04 32 0.987500 1.925 0.0345 -3.208333 4.625000 0.170000 6.091667e-04 33 0.662500 2.775 0.0385 -0.958333 0.125000 0.476667 4.958333e-04 34 0.629167 2.975 0.0415 -0.208333 1.708333 0.463333 2.975000e-04 35 0.570833 3.525 0.0445 -0.875000 4.958333 0.256667 9.208333e-04 36 0.845833 2.825 0.0285 -4.208333 4.541667 0.430000 1.062500e-03 37 0.529167 3.275 0.0245 -1.208333 2.875000 0.450000 4.391667e-04 38 0.504167 1.675 0.0205 -3.041667 0.541667 0.243333 2.125000e-04 39 0.962500 2.275 0.0135 -3.958333 1.208333 0.376667 1.090833e-03 40 0.620833 3.175 0.0085 -4.875000 1.125000 0.130000 7.508333e-04 41 0.645833 3.875 0.0455 -3.125000 2.375000 0.456667 1.544167e-03 42 0.995833 2.625 0.0525 -3.625000 3.708333 0.483333 1.515833e-03 43 0.704167 2.525 0.0005 -4.458333 0.375000 0.436667 1.345833e-03 44 0.887500 2.325 0.0485 -0.375000 3.958333 0.496667 6.941667e-04 45 0.762500 3.075 0.0595 -3.791667 3.625000 0.236667 1.317500e-03 46 0.862500 1.875 0.0505 -0.708333 1.291667 0.250000 1.260833e-03 47 0.929167 2.675 0.0045 -1.541667 4.708333 0.263333 1.204167e-03 48 0.912500 1.525 0.0435 -4.791667 3.208333 0.490000 6.658333e-04 49 0.512500 1.625 0.0255 -1.958333 2.458333 0.316667 1.147500e-03 50 0.787500 2.475 0.0035 -3.875000 4.041667 0.303333 1.416667e-05 51 0.937500 3.325 0.0535 -2.625000 0.291667 0.163333 4.675000e-04 52 0.779167 1.575 0.0575 -2.875000 1.625000 0.123333 9.916667e-05 53 0.795833 2.925 0.0165 -0.625000 2.625000 0.136667 8.925000e-04 54 0.854167 1.025 0.0015 -3.541667 2.791667 0.203333 3.258333e-04 55 0.804167 1.775 0.0545 -1.041667 4.291667 0.330000 1.558333e-04 56 0.545833 1.225 0.0115 -0.291667 4.125000 0.443333 1.275000e-04 57 0.687500 3.375 0.0375 -4.708333 4.875000 0.103333 1.289167e-03 58 0.904167 1.725 0.0335 -1.625000 2.541667 0.343333 3.541667e-04 59 0.812500 2.575 0.0215 -2.041667 1.458333 0.196667 7.083333e-05 60 0.945833 3.625 0.0125 -2.375000 2.125000 0.230000 5.808333e-04 ------------------------------------------------------------------- Run SensHeat LatentHeat LogIceStr OceanicDrag OpenAlbedo IceAlbedo ------------------------------------------------------------------- 1 1.3500000 0.2650000 3.864333 6.250000 8.7500e-02 0.3250000 2 3.3000002 0.8287500 4.064333 0.500000 4.5625e-01 0.6000000 3 0.3000000 0.4025000 5.189333 1.125000 3.1875e-01 0.4562500 4 2.7000001 0.6500000 4.564333 4.875000 1.5625e-01 0.5500000 5 1.0500001 1.2000000 5.289333 5.250000 5.6250e-02 0.3562500 6 0.4500000 0.4575000 4.339333 5.500000 4.4375e-01 0.4625000 7 3.5250001 1.1725000 5.439333 6.875000 4.8750e-01 0.3437500 8 5.4750004 0.6362500 3.814333 3.250000 2.6250e-01 0.3000000 9 1.7250000 0.2375000 3.839333 7.875000 3.0625e-01 0.7250000 10 1.2000001 0.5812500 4.989333 2.250000 2.3750e-01 0.8000000 11 2.8500001 0.1000000 3.939333 3.750000 2.5000e-02 0.3687500 12 5.2500000 0.2925000 3.689333 9.000000 7.5000e-02 0.7437500 13 0.0000000 0.2512500 4.264333 1.500000 1.2500e-01 0.4312500 14 1.1250000 0.9112500 3.464333 6.000000 4.0625e-01 0.4250000 15 1.4250001 0.2237500 3.714333 4.750000 1.3750e-01 0.4812500 16 0.7500000 0.9937500 4.364333 7.375000 1.4375e-01 0.3375000 17 3.9750001 0.5537500 4.864333 8.000000 3.8750e-01 0.7812500 18 4.3500004 1.1312500 5.014333 5.750000 5.0000e-02 0.7000001 19 5.3250003 0.8562501 5.414333 8.125000 4.3750e-02 0.7625001 20 4.0500002 0.5950000 4.414333 1.875000 2.4375e-01 0.3625000 21 2.1000001 0.7600000 4.289333 7.250000 1.8125e-01 0.5937500 22 4.7250004 0.7737500 4.714333 9.375000 3.1250e-01 0.5062500 23 4.2750001 0.9525001 4.689333 1.625000 1.2500e-02 0.5687500 24 4.4250002 0.3887500 4.439333 4.500000 3.4375e-01 0.6250000 25 2.2500000 0.5125000 5.064333 1.000000 2.0000e-01 0.3062500 26 3.1500001 0.6637500 3.789333 9.750000 1.1250e-01 0.5812500 27 4.5750003 0.7050000 3.614333 3.000000 4.6875e-01 0.7687500 28 1.9500001 0.3612500 4.789333 7.125000 2.2500e-01 0.6062501 29 3.8250001 0.9387500 4.664333 1.250000 1.6875e-01 0.6875000 30 2.1750002 0.1825000 5.314333 7.750000 2.9375e-01 0.4937500 31 5.8500004 0.4987500 4.764333 10.000000 3.6875e-01 0.6125000 32 2.3250001 0.2787500 3.989333 2.125000 3.7500e-01 0.4062500 33 3.9000001 0.3200000 3.539333 7.625000 1.5000e-01 0.6625000 34 5.0250001 0.7325000 4.464333 8.500000 3.2500e-01 0.7187500 35 3.0000000 0.1550000 5.139333 4.000000 3.6250e-01 0.4375000 36 1.5000000 0.2100000 4.239333 9.625000 0.0000e+00 0.6937500 37 5.7750001 0.8837500 4.089333 6.375000 1.7500e-01 0.6187500 38 4.6500001 0.8150000 3.514333 0.625000 1.3125e-01 0.3812500 39 1.5750001 0.6225000 3.739333 8.750000 1.1875e-01 0.7125000 40 2.9250002 0.4712500 3.964333 9.500000 3.5625e-01 0.3500000 41 3.7500002 0.4437500 3.764333 1.750000 2.7500e-01 0.6812500 42 6.0000000 0.9800000 4.114333 6.125000 4.3750e-01 0.4750000 43 0.6750000 1.1450000 3.639333 6.625000 3.1250e-02 0.5312500 44 1.2750001 1.0625000 4.889333 8.250000 2.6875e-01 0.7062500 45 2.4000001 1.0074999 3.664333 0.000000 1.9375e-01 0.4875000 46 4.8000002 0.5262500 3.914333 0.375000 1.0625e-01 0.7375000 47 3.2250001 0.3337500 5.239333 5.000000 4.5000e-01 0.5187500 48 4.5000000 1.0212500 4.164333 6.500000 2.5625e-01 0.6750000 49 4.9500003 0.4850000 4.639333 2.375000 4.9375e-01 0.3312500 50 0.9750000 1.1587501 4.814333 4.250000 3.3125e-01 0.4687500 51 0.8250001 0.3475000 4.914333 0.250000 5.0000e-01 0.3187500 52 2.0250001 1.1037500 5.039333 3.125000 9.3750e-02 0.6375001 53 2.4750001 0.9662500 3.439333 4.375000 4.6250e-01 0.7875000 54 5.9250002 0.3750000 3.589333 3.375000 2.8125e-01 0.4500000 55 4.1250000 0.6775000 3.564333 2.500000 3.8125e-01 0.5375000 56 4.2000003 0.1275000 4.739333 3.875000 2.0625e-01 0.3750000 57 1.8750001 0.4162500 4.014333 8.375000 2.1875e-01 0.3937500 58 1.8000001 0.7875000 4.189333 9.125000 6.8750e-02 0.5250000 59 3.6000001 0.4300000 4.964333 5.125000 2.8750e-01 0.5562500 60 1.6500001 0.3062500 4.939333 9.875000 3.0000e-01 0.4187500 61 0.6000000 1.0762500 4.214333 2.750000 4.1250e-01 0.5125001 62 2.6250000 1.0900000 5.214333 5.625000 4.0000e-01 0.5437501 63 5.1750002 0.1962500 5.264333 0.750000 2.1250e-01 0.6500000 64 5.1000004 1.0487500 5.114333 3.625000 4.7500e-01 0.5750001 65 3.3750002 0.9250000 5.389333 1.375000 8.1250e-02 0.5000000 66 2.7750001 0.8425000 4.314333 4.125000 6.2500e-03 0.3125000 67 0.1500000 0.7187500 4.389333 9.250000 3.7500e-02 0.6312500 68 5.6250000 0.8975000 4.139333 7.500000 3.5000e-01 0.7312501 69 5.5500002 0.5675000 4.514333 6.750000 4.8125e-01 0.4125000 70 2.5500002 0.1412500 5.164333 0.875000 4.2500e-01 0.5875000 71 0.2250000 0.8700000 3.489333 2.875000 4.1875e-01 0.4000000 72 0.5250000 0.5400000 5.339333 8.625000 6.2500e-02 0.6687501 73 4.8750000 1.1862500 5.089333 5.875000 2.3125e-01 0.4437500 74 3.0750001 0.1687500 4.839333 7.000000 2.5000e-01 0.5625000 75 0.9000000 1.1175001 4.539333 0.125000 4.3125e-01 0.6562500 76 0.0750000 0.1137500 5.364333 2.000000 1.8750e-01 0.7562500 77 5.4000001 0.8012500 4.489333 4.625000 1.6250e-01 0.7750000 78 3.6750002 0.7462500 3.889333 5.375000 3.3750e-01 0.3875000 79 3.4500001 0.6087501 4.039333 8.875000 1.8750e-02 0.6437500 80 0.3750000 1.0350000 4.614333 3.500000 1.0000e-01 0.7937501 81 5.7000003 0.6912500 4.589333 2.625000 3.9375e-01 0.7500000 1 5.9997000 1.1589000 3.728167 1.120600 3.1427e-01 0.7998700 2 2.2264000 1.0322000 4.010978 0.410670 4.9997e-01 0.5514900 3 5.7302000 0.4378000 3.460086 1.269900 6.7792e-02 0.3066200 4 0.4633100 0.1433800 5.437766 1.544100 2.6867e-02 0.3950900 5 3.1825000 0.8446400 5.011613 0.035989 7.7866e-02 0.6735800 6 0.0073140 0.8061900 5.121429 2.285800 5.1269e-02 0.7260500 7 4.5960000 0.3383300 4.627888 0.351580 4.0538e-01 0.3105800 8 0.0095330 0.1834300 3.451648 0.348540 4.7205e-01 0.6761800 9 4.9261000 0.5693300 4.822502 2.712100 3.9996e-03 0.3006600 10 3.2789000 0.7215500 4.694280 0.906750 4.9718e-01 0.4507400 11 4.8171000 1.0666000 5.397262 0.219020 2.1799e-01 0.3492300 12 2.6602000 0.2106300 3.505272 8.842800 4.3526e-01 0.5195100 13 3.3406000 0.9153600 3.970049 9.912100 4.8917e-02 0.7986400 14 3.9517000 0.1026200 3.619709 5.770200 2.9396e-01 0.3253200 15 1.3097000 0.3653500 5.431509 5.365000 1.6848e-01 0.7375500 16 3.8246000 0.2584900 5.279804 6.906800 4.9844e-01 0.3850200 17 0.6145200 0.3929000 4.929945 1.147500 4.5226e-01 0.7633100 18 5.9957000 0.2867400 3.948447 2.996400 3.3700e-01 0.6794200 19 0.8468600 0.9316100 5.343546 3.517400 1.6954e-01 0.3145500 20 2.8816000 1.1733000 4.070333 3.374400 1.4065e-01 0.7035300 21 0.1761700 0.7625300 4.854409 4.690900 1.2050e-01 0.6594000 22 5.9876000 0.1235600 5.077077 7.326300 1.0754e-01 0.3577100 23 1.5836000 0.7963400 3.906195 4.072000 4.9639e-01 0.6547400 24 2.2911000 0.5222600 3.656443 0.919570 4.0402e-01 0.4368500 25 4.4842000 0.5002400 4.269186 0.074368 1.9276e-01 0.3758600 26 5.7496000 1.1923000 4.574610 9.496700 2.6735e-01 0.7625000 27 5.3283000 0.2642800 4.100715 9.935100 2.7406e-02 0.4520400 28 0.7680000 1.1925000 4.266373 2.410800 3.6799e-01 0.4041200 29 5.0351000 1.1999000 3.521766 1.757200 1.2282e-01 0.4990300 30 5.4754000 0.5220900 3.445121 2.322700 4.5116e-01 0.6973200 31 2.0068000 0.1011400 5.247728 6.105200 3.8664e-01 0.7966800 32 5.9057000 0.9582200 5.359019 0.667470 5.1881e-02 0.6393800 33 0.0089600 0.7051800 5.251395 7.741700 4.2043e-01 0.5757500 34 0.6751200 0.1034700 4.074304 6.929800 1.9812e-01 0.3000800 35 3.8655000 0.8704800 5.164977 9.988700 1.3020e-01 0.5172900 36 2.6345000 1.0018000 4.432119 7.989100 2.1750e-01 0.4211900 37 1.1061000 0.4695100 5.206394 8.693400 3.4598e-03 0.6242100 38 2.3969000 0.9173100 4.514893 1.493000 2.5622e-01 0.4858800 39 2.3406000 0.3240300 3.831735 0.075667 4.9214e-02 0.7449100 40 3.0248000 0.5017600 4.380880 5.084100 4.7197e-01 0.7994100 41 5.5046000 0.3263400 4.447065 9.524200 1.0187e-02 0.5387300 42 4.2783000 0.7330300 5.172048 7.202100 4.3740e-01 0.7526600 43 3.9904000 0.1757900 3.822600 3.534700 3.4630e-01 0.5538100 44 4.6088000 0.6066400 4.948472 9.560600 1.8588e-01 0.4695100 45 3.4437000 1.0054000 3.478624 1.810500 3.7831e-01 0.7425200 46 1.7404000 0.5664500 3.641107 0.000474 1.0077e-01 0.4926200 47 0.0014500 0.6688500 4.206097 7.435900 4.9989e-01 0.7972200 48 0.3564800 0.5456400 4.666640 4.195000 2.4020e-01 0.6710900 49 3.4885000 1.1329000 5.296950 9.254200 8.1136e-05 0.6731200 50 2.5518000 0.7634200 3.556556 9.646800 1.6867e-01 0.6811200 1 5.95 0.567500 4.289333 0.250000 0.137500000 0.729167 2 3.55 0.769167 3.956000 0.750000 0.462500000 0.487500 3 0.25 0.420833 4.456000 5.750000 0.495833300 0.520833 4 3.65 1.080833 4.489333 2.083333 0.212500000 0.587500 5 3.45 0.347500 4.822666 1.583333 0.412500000 0.762500 6 2.25 0.714167 3.456000 5.083333 0.345833300 0.645833 7 1.45 0.952500 5.056000 8.250000 0.362500000 0.795833 8 3.85 0.457500 4.556000 3.416667 0.179166700 0.404167 9 4.25 0.255833 3.989333 4.750000 0.487500000 0.712500 10 3.35 0.842500 4.189333 7.750000 0.445833300 0.695833 11 2.95 0.677500 3.856000 4.416667 0.045833330 0.412500 12 1.05 0.530833 3.756000 6.416667 0.479166700 0.637500 13 0.85 0.732500 5.089333 9.750000 0.279166700 0.779167 14 2.15 0.402500 3.589333 4.083333 0.070833330 0.704167 15 5.65 0.109167 4.389333 2.250000 0.312500000 0.687500 16 5.85 0.659167 5.122666 5.250000 0.004166667 0.470833 17 3.05 0.970833 4.156000 3.583333 0.395833300 0.662500 18 1.95 0.640833 5.189333 1.250000 0.170833300 0.362500 19 4.35 0.182500 5.422666 2.583333 0.154166700 0.420833 20 4.95 0.219167 4.922666 7.583333 0.054166670 0.629167 21 0.55 0.145833 3.522666 8.416667 0.387500000 0.620833 22 5.15 0.494167 4.422666 3.083333 0.270833300 0.437500 23 2.05 1.025833 3.689333 2.416667 0.429166700 0.312500 24 2.85 0.934167 5.222666 9.083333 0.229166700 0.770833 25 0.75 0.329167 3.889333 8.750000 0.220833300 0.454167 26 5.05 0.879167 3.656000 5.583333 0.187500000 0.787500 27 2.75 0.897500 5.022666 9.250000 0.329166700 0.445833 28 2.65 0.310833 4.722666 5.916667 0.254166700 0.720833 29 1.85 0.604167 4.622666 7.416667 0.454166700 0.670833 30 3.15 0.824167 4.089333 1.916667 0.287500000 0.345833 31 0.95 0.439167 5.156000 4.583333 0.029166670 0.570833 32 5.55 0.805833 5.322666 7.083333 0.237500000 0.462500 33 4.05 0.787500 3.722666 6.750000 0.337500000 0.754167 34 4.65 1.172500 3.822666 7.250000 0.062500000 0.537500 35 1.25 1.007500 3.489333 0.916667 0.037500000 0.612500 36 0.15 1.117500 4.789333 0.583333 0.162500000 0.379167 37 5.45 1.190833 4.756000 8.583333 0.437500000 0.429167 38 2.55 1.044167 4.022666 0.416667 0.320833300 0.529167 39 5.35 0.200833 3.622666 6.250000 0.012500000 0.354167 40 2.35 0.860833 4.889333 0.083333 0.470833300 0.370833 41 0.65 1.099167 5.256000 3.916667 0.112500000 0.737500 42 1.55 0.549167 3.556000 4.916667 0.195833300 0.320833 43 1.65 0.475833 4.989333 6.916667 0.020833330 0.679167 44 2.45 0.585833 4.689333 5.416667 0.095833330 0.554167 45 0.05 0.365833 3.789333 2.916667 0.245833300 0.495833 46 3.75 1.154167 4.656000 7.916667 0.370833300 0.745833 47 4.55 0.237500 4.522666 3.750000 0.379166700 0.512500 48 4.45 0.164167 4.222666 6.083333 0.304166700 0.579167 49 5.25 0.989167 4.589333 4.250000 0.295833300 0.387500 50 3.95 1.062500 5.389333 9.416667 0.079166670 0.479167 51 4.85 1.135833 4.056000 8.083333 0.120833300 0.604167 52 1.35 0.622500 4.356000 9.916667 0.420833300 0.504167 53 5.75 0.750833 5.289333 3.250000 0.404166700 0.329167 54 0.35 0.292500 5.356000 1.750000 0.354166700 0.595833 55 3.25 0.274167 4.956000 2.750000 0.129166700 0.654167 56 4.75 0.127500 4.122666 8.916667 0.262500000 0.395833 57 1.75 0.512500 4.322666 1.416667 0.204166700 0.545833 58 1.15 0.695833 4.856000 6.583333 0.087500000 0.304167 59 0.45 0.384167 3.922666 9.583333 0.145833300 0.562500 60 4.15 0.915833 4.256000 1.083333 0.104166700 0.337500 Arctic sea ice - 4 output variables -------------------------------------------------------------------- Run IceMass IceArea IceVelocity RangeOfArea -------------------------------------------------------------------- 1 28.0925 858.4200 64.7989 17.7539 2 NA NA NA NA 3 NA NA NA NA 4 23.6671 1013.2809 59.0887 10.1966 5 33.6611 908.6530 23.2440 5.6086 6 12.8168 686.4774 64.4795 15.5639 7 80.2354 1041.0073 28.7366 15.9091 8 17.5758 1029.6089 40.4271 8.6029 9 21.1764 918.3986 33.3483 8.4740 10 18.7918 986.6699 36.6348 8.6815 11 66.8593 1012.3691 74.2131 20.6161 12 47.1297 1055.3019 27.5257 10.4743 13 11.4082 446.5576 102.5274 17.9979 14 48.0503 940.2105 77.9180 27.7288 15 17.0462 767.6149 62.1914 16.3610 16 17.2996 948.5461 23.3061 6.1597 17 19.3960 1056.6251 11.6700 5.6502 18 51.4083 1089.4678 35.6725 11.1223 19 46.9488 1066.3645 24.1405 9.1610 20 50.9734 1042.6746 69.1123 13.9697 21 12.3473 963.1450 30.1542 7.1923 22 55.8963 1052.2637 22.5584 11.1095 23 76.2172 1063.9843 69.2516 19.3792 24 25.9524 980.5435 56.8555 13.6181 25 NA NA NA NA 26 17.9903 939.3955 33.3035 15.4317 27 47.4019 1035.6718 51.4850 16.2930 28 26.5546 927.5731 17.4474 12.9699 29 NA NA NA NA 30 26.8484 895.8054 31.8161 15.4403 31 66.0207 1018.4913 18.1147 18.3207 32 NA NA NA NA 33 91.6324 1041.0626 32.5382 17.7724 34 20.9279 1022.0900 52.1404 10.8432 35 56.5643 1048.1985 40.0925 11.0354 36 27.9922 922.1439 19.0244 10.6832 37 17.3214 1007.4663 56.4963 10.5191 38 NA NA NA NA 39 25.6900 897.0154 60.8135 17.9371 40 47.0434 1019.7569 22.0604 9.4213 41 23.7964 903.4302 96.4945 19.6305 42 108.4831 1077.2700 37.0781 22.1120 43 22.4252 823.2452 50.1323 14.5770 44 9.8054 888.9754 41.2581 7.0697 45 NA NA NA NA 46 NA NA NA NA 47 50.1663 976.1697 21.0574 20.3704 48 88.2821 1026.0039 32.5483 19.8021 49 76.9606 990.7841 33.2208 19.6909 50 20.8695 954.8767 87.2722 15.4864 51 NA NA NA NA 52 42.8508 1048.4762 18.3161 9.5625 53 73.1104 952.7972 66.0781 25.8746 54 77.3832 1031.7157 64.7460 23.3265 55 59.6047 1012.3020 18.4821 15.5599 56 67.2248 1008.2618 44.6623 17.3288 57 18.2524 989.7067 33.0174 8.5823 58 28.5928 903.9642 34.2709 18.1898 59 30.7918 1103.1759 23.4295 6.6926 60 35.0504 960.0704 31.0606 16.3087 61 27.8810 807.6848 89.4528 20.7844 62 44.0634 1067.6035 31.2157 9.3378 63 25.2519 994.1107 75.4404 11.9656 64 61.1253 1033.9933 18.6697 14.3272 65 NA NA NA NA 66 44.4077 1069.6334 17.9298 7.9077 67 17.6524 863.5967 44.5026 12.0665 68 113.7279 1070.2357 22.0104 22.8503 69 82.9204 1007.6683 31.6752 20.1662 70 17.8865 777.0690 65.1718 16.6942 71 10.8653 793.6827 39.7069 9.4200 72 NA NA NA NA 73 73.0543 1074.7257 25.3399 13.8878 74 26.2713 1006.6944 44.7263 8.1954 75 36.9651 1044.9807 3.3607 7.1410 76 18.4841 763.1233 53.6555 8.3777 77 97.2709 1097.7416 38.7489 18.6711 78 18.1863 989.8572 32.2083 8.2891 79 72.5136 1102.3480 20.3697 11.9701 80 23.6247 899.7114 50.0579 10.6544 81 NA NA NA NA 1 143.9022 1147.6920 21.7474 30.4955 2 NA NA NA NA 3 102.9413 1038.8593 6.4257 23.3857 4 NA NA NA NA 5 NA NA NA NA 6 19.6972 715.7026 101.9613 24.8733 7 NA NA NA NA 8 NA NA NA NA 9 NA NA NA NA 10 8.4896 926.2330 58.1046 8.4365 11 NA NA NA NA 12 10.2201 773.7947 50.5651 18.5389 13 46.9198 1010.7829 32.2039 17.0954 14 52.2801 1062.8518 21.2752 9.0795 15 29.7807 1079.8865 7.6766 5.4620 16 67.7779 1071.7053 25.8372 12.2647 17 29.9050 835.2781 96.7181 15.5217 18 107.3926 1016.2476 60.4717 24.5364 19 10.6286 768.1317 95.4541 19.7086 20 26.1472 927.3298 71.3081 22.4411 21 14.7056 798.5698 73.6780 15.3487 22 11.1872 990.2441 57.1501 8.5511 23 34.2750 1070.9807 44.3425 8.4722 24 49.9117 885.1348 94.3641 29.8801 25 NA NA NA NA 26 50.2149 1101.1412 33.9193 10.6541 27 18.5824 1045.7645 8.6557 5.7558 28 39.4377 941.5604 58.6172 17.1935 29 26.4037 1069.5350 11.3957 6.8345 30 NA NA NA NA 31 15.5202 1008.2104 32.7845 6.4083 32 115.9583 1043.8834 51.3323 23.8756 33 13.4618 838.5945 60.9955 12.2748 34 11.0023 671.0287 54.1561 15.2067 35 54.1061 1101.2600 25.4460 9.1428 36 25.2090 1054.0166 60.8005 11.5128 37 28.9961 1025.4668 19.0800 6.1891 38 55.7760 1006.2620 42.3576 13.4121 39 NA NA NA NA 40 44.2918 1082.3187 16.2205 9.4605 41 19.0483 974.6289 54.0456 16.7798 42 67.0760 1137.5922 17.9079 10.8867 43 20.5040 917.3510 77.0908 18.6593 44 65.6763 1065.2570 22.4357 12.1915 45 NA NA NA NA 46 NA NA NA NA 47 9.7387 759.5698 9.6878 8.9307 48 9.7122 637.6529 77.5844 16.2787 49 32.6055 1028.4966 21.4946 10.9849 50 31.9525 1097.3027 18.4270 6.3661 1 NA NA NA NA 2 51.6048 920.9564 106.8727 28.9043 3 17.5319 762.7305 42.0884 8.4899 4 28.2850 1009.8871 75.8806 19.4991 5 NA NA NA NA 6 49.4078 934.8795 71.1114 22.2424 7 15.7346 923.5403 43.2867 14.0801 8 58.9760 1046.8557 58.1720 13.4092 9 52.4084 1050.0486 42.2349 10.2752 10 25.1785 1054.5599 58.3404 11.0758 11 69.2846 1019.4514 38.4099 13.9703 12 28.3805 887.0622 62.8959 17.3376 13 27.1953 945.1967 23.3352 8.6403 14 23.1052 981.9948 32.8832 7.8247 15 NA NA NA NA 16 13.5912 963.3745 60.6661 16.1290 17 26.7678 990.9900 35.2481 15.0889 18 NA NA NA NA 19 40.3421 982.1132 40.8914 18.5419 20 49.5279 1024.9757 28.3452 10.8127 21 20.2745 694.2266 59.4815 20.9970 22 97.9378 1085.1761 43.2615 18.7986 23 53.3219 919.8050 104.2474 25.8918 24 70.2716 1053.6654 16.0383 12.7164 25 32.3497 873.2568 47.2809 14.8747 26 82.9790 1061.5891 24.4324 16.3139 27 25.2319 1083.5582 33.3770 6.8663 28 29.0519 956.7084 12.1898 11.6862 29 40.0426 1022.5867 43.7399 11.2396 30 NA NA NA NA 31 9.3607 768.9521 38.1868 12.5533 32 79.8747 1120.8846 21.0701 14.2230 33 49.8560 1000.4399 29.1079 20.4543 34 58.6257 1004.1964 21.8429 21.0912 35 NA NA NA NA 36 NA NA NA NA 37 35.9416 998.4996 23.8258 18.8829 38 31.9802 984.4305 94.7669 18.3228 39 37.0714 1008.7313 59.7962 15.7220 40 NA NA NA NA 41 NA NA NA NA 42 42.8313 837.1102 78.3339 24.1837 43 7.1160 773.2409 57.2028 13.3353 44 48.5398 985.2294 27.5504 10.8576 45 26.6478 491.0860 86.5007 26.5413 46 85.1714 1070.0127 26.8903 16.2902 47 26.1311 1029.1494 85.2551 13.2828 48 58.2723 1009.4652 28.0063 12.3142 49 45.8463 1028.2770 47.1790 17.3831 50 13.4271 1036.4664 9.0003 5.8029 51 100.6901 1078.5682 19.1679 19.9764 52 47.7863 1057.7366 10.5171 7.7570 53 50.5019 1100.6416 40.7831 9.4077 54 13.5089 902.5354 57.5244 7.2695 55 74.0051 1048.3409 21.9675 13.4440 56 23.8279 996.2996 17.8649 10.1647 57 NA NA NA NA 58 30.2251 1009.0953 19.0226 7.1195 59 20.8639 969.4079 13.9900 6.2769 60 40.8394 1046.1118 85.4104 14.5497
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Faculty Research and Publications /
- Analysis Methods for Computer Experiments : How to...
Open Collections
UBC Faculty Research and Publications
Analysis Methods for Computer Experiments : How to Assess and What Counts? Chen, Hao; Loeppky, Jason L.; Sacks, Jerome, 1931-; Welch, William J. 2016
Page Metadata
Item Metadata
Title | Analysis Methods for Computer Experiments : How to Assess and What Counts? |
Creator |
Chen, Hao Loeppky, Jason L. Sacks, Jerome, 1931- Welch, William J. |
Date Issued | 2016 |
Description | Statistical methods based on a regression model plus a zero-mean Gaussian process (GP) have been widely used for predicting the output of a deterministic computer code. There are many suggestions in the literature for how to choose the regression component and how to model the correlation structure of the GP. This article argues that comprehensive, evidence-based assessment strategies are needed when comparing such modeling options. Otherwise, one is easily misled. Applying the strategies to several computer codes shows that a regression model more complex than a constant mean either has little impact on prediction accuracy or is an impediment. The choice of correlation function has modest effect, but there is little to separate two common choices, the power exponential and the Matérn, if the latter is optimized with respect to its smoothness. The applications presented here also provide no evidence that a composite of GPs provides practical improvement in prediction accuracy. A limited comparison of Bayesian and empirical Bayes methods is similarly inconclusive. In contrast, we find that the effect of experimental design is surprisingly large, even for designs of the same type with the same theoretical properties. |
Subject |
Correlation function Gaussian process kriging prediction accuracy regression |
Genre |
Postprint Article |
Type |
Text Dataset |
Language | eng |
Date Available | 2016-05-19 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0302078 |
URI | http://hdl.handle.net/2429/58142 |
Affiliation |
Science, Faculty of Irving K. Barber School of Arts and Sciences (Okanagan) Other UBC Non UBC Statistics, Department of Computer Science, Mathematics, Physics and Statistics, Department of (Okanagan) |
Citation | Statistical Science 2016, Vol. 31, No. 1, 40–60 |
Publisher DOI | 10.1214/15-STS531 |
Peer Review Status | Reviewed |
Scholarly Level | Faculty |
Copyright Holder | Institute of Mathematical Statistics |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 52383-Chen_H_et_al_Analysis_Methods.pdf [ 785.64kB ]
- 52383-Chen_H_et_al_Analysis_Methods_supplement.pdf [ 849.64kB ]
- 52383-Chen_H_et_al_Analysis_Methods_ice-x.txt [ 27.33kB ]
- 52383-Chen_H_et_al_Analysis_Methods_ice-y.txt [ 13.3kB ]
- Metadata
- JSON: 52383-1.0302078.json
- JSON-LD: 52383-1.0302078-ld.json
- RDF/XML (Pretty): 52383-1.0302078-rdf.xml
- RDF/JSON: 52383-1.0302078-rdf.json
- Turtle: 52383-1.0302078-turtle.txt
- N-Triples: 52383-1.0302078-rdf-ntriples.txt
- Original Record: 52383-1.0302078-source.json
- Full Text
- 52383-1.0302078-fulltext.txt
- Citation
- 52383-1.0302078.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.52383.1-0302078/manifest