Interactive Bayesian Optimization Learning User Preferences for Graphics and Animation by Eric Brochu BA, University of Regina, 1997 BSc, University of Regina, 1998 MSc, The University of British Columbia, 2004 a thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the faculty of graduate studies (Computer Science) The University of British Columbia (Vancouver) December 2010 c© Eric Brochu, 2010 Abstract Bayesian optimization with Gaussian processes has become increasingly pop- ular in the machine learning community. It is efficient and can be used when very little is known about the objective function, making it useful for optimizing expensive black box functions. We examine the case of using Bayesian optimization when the objective function requires feedback from a human. We call this class of problems interactive Bayesian optimization. Here, we assume a parameterized model, and a user whose task is to find an acceptable set of parameters according to some perceptual value func- tion that cannot easily be articulated. This requires special attention to the qualities that make this a unique problem, and so, we introduce three novel extensions: the application of Bayesian optimization to “preference galleries”, where human feedback is in the form of preferences over a set of instances; a particle-filter method for learning the distribution of model hyperparameters over heterogeneous users and tasks; and a bandit-based method of using a portfolio of utility functions to select sample points. Us- ing a variety of test functions, we validate our extensions empirically on both low- and high-dimensional objective functions. We also present graph- ics and animation applications that use interactive Bayesian optimization techniques to help artists find parameters on difficult problems. We show that even with minimal domain knowledge, an interface using interactive Bayesian optimization is much more efficient and effective than traditional “parameter twiddling” techniques on the same problem. ii Preface All the work presented here has been performed under the supervision of Nando de Freitas. In addition, I have been fortunate enough to work with outstanding co-authors on several publications which have become parts of this thesis. • The initial work in using Bayesian optimization for preference gal- leries (Chapter 3) was presented at NIPS and SIGGRAPH [Brochu et al., 2007a; 2007b]. On these publications, we worked with Abhijeet Ghosh, who provided the parameterized Bidirectional Reflectance Dis- tribution Function modelling and rendering code we used as the test problem (§6.3). Abhijeet also helped tailor the SIGGRAPH poster presentation to a graphics audience, which helped it to win First Prize at the SIGGRAPH Student Research Competition. • Chapter 4 describes a novel method of using particle filters to track hy- perparameter evolution over multiple users. This work was presented at the SIGGRAPH/Eurographics Symposium on Computer Anima- tion [Brochu et al., 2010a]. Here we worked with co-author Tyson Brochu, who created the animation application we used for testing. The animation method presented in §6.4.1 is a result of his research, and that section was largely written by Tyson. The implementation of the parameterized interface (but not the machine learning aspects) was designed in consultation with Tyson. • The idea of using hedging methods to manage a portfolio of acquisi- tion functions (Chapter 5) was greatly aided by discussions with Matt iii Hoffman. Matt was instrumental in helping to test the idea and pro- vided valuable input in the writing of Chapter 5. He also wrote the description of the control test problem in §5.3. The control experiment framework we used was coded by Matt with Hendrik Kück for their earlier publication [Hoffman et al., 2009]. A version of the work has been made available as a technical report on the arXiv e-print archive [Brochu et al., 2010b]. An expanded version is under preparation for future submission. • Finally, much of Chapters 1, 2, 3 and 7 will appear as a chapter [Brochu et al., 2011] in an upcoming book, and some parts of Chapter 2 were previously presented in an article for Autonomous Robots [Martinez– Cantin et al., 2009]. The work of my co-authors in those publications is not part of this thesis. The human subject experiments were approved by the University of British Columbia Behavioural Research Ethics Board under certificate H10- 01682. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . xv 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 An introduction to Bayesian optimization . . . . . . . . . . . 4 1.3 Contributions and overview . . . . . . . . . . . . . . . . . . . 6 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 The Bayesian optimization approach . . . . . . . . . . . . . . 11 2.1.1 Priors over functions . . . . . . . . . . . . . . . . . . . 14 2.1.2 Choice of covariance functions . . . . . . . . . . . . . 17 2.2 Acquisition functions for Bayesian optimization . . . . . . . . 20 2.2.1 Improvement-based acquisition functions . . . . . . . . 20 v 2.2.2 Confidence bound criteria . . . . . . . . . . . . . . . . 25 2.2.3 Maximizing the acquisition function . . . . . . . . . . 29 2.3 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 A brief history of Bayesian optimization . . . . . . . . . . . . 31 2.4.1 Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.2 Experimental design . . . . . . . . . . . . . . . . . . . 33 2.4.3 Active learning . . . . . . . . . . . . . . . . . . . . . . 35 2.4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.1 Test functions . . . . . . . . . . . . . . . . . . . . . . . 39 3 Preference galleries . . . . . . . . . . . . . . . . . . . . . . . 41 3.1 Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Probit model for binary observations . . . . . . . . . . . . . . 43 3.3 Experiments: gallery labelling methods . . . . . . . . . . . . . 47 4 Learning hyperparameters across user sessions . . . . . . . 54 4.1 Mean and covariance hyperparameters . . . . . . . . . . . . . 56 4.1.1 Kernel hyperparameters . . . . . . . . . . . . . . . . . 60 4.2 Learning from the user base . . . . . . . . . . . . . . . . . . . 61 4.3 Experiments: hyperparameter learning . . . . . . . . . . . . . 63 4.4 Experiments: learning the mean function from an RBF network 66 5 Portfolio strategies for acquisition selection . . . . . . . . . 67 5.1 Hedge algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Experiments: acquisition functions and acquisition strategies 71 5.3 Experiment: control of a particle simulation . . . . . . . . . . 78 5.4 Conclusions and future work . . . . . . . . . . . . . . . . . . 80 6 Applications: active preference learning with Bayesian op- timization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 Active preferences for 2D smoke simulation . . . . . . . . . . 82 6.3 Interactive Bayesian optimization for material design . . . . . 84 vi 6.3.1 User study . . . . . . . . . . . . . . . . . . . . . . . . 87 6.4 Procedural fluid animation . . . . . . . . . . . . . . . . . . . 88 6.4.1 Procedural animation . . . . . . . . . . . . . . . . . . 90 6.4.2 Gallery . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.4.3 Experiment: hyperparameter learning . . . . . . . . . 92 6.4.4 Gallery interface performance . . . . . . . . . . . . . . 93 6.4.5 Full application interface compared to parameter twid- dling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4.6 Discovery . . . . . . . . . . . . . . . . . . . . . . . . . 101 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.1 Discussion and advice to practitioners . . . . . . . . . . . . . 104 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A Test functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 vii List of Tables 2.1 Standard test functions used in this thesis, and their dimen- sionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.1 Results of the user study on the BRDF gallery. . . . . . . . . 88 6.2 Results of experiments in which users were shown target an- imations and asked to find them using only specific methods. 97 6.3 Comparison of users using the system with a zero function as the mean, and with a mean trained on previous user data. . . 101 viii List of Figures 1.1 Example of using Bayesian optimization on a toy 1D design problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Simple 1D Gaussian process with three observations. . . . . . 15 2.2 The effect of changing the kernel hyperparameters. . . . . . . 18 2.3 Gaussian process from Figure 2.1, additionally showing the region of probable improvement. . . . . . . . . . . . . . . . . 22 2.4 Examples of acquisition functions and their settings. . . . . . 26 2.5 Examples of acquisition functions and their settings in 2 di- mensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6 Comparison of acquisition functions on a toy 1D problem. . . 28 2.7 Some randomly-generated test functions. . . . . . . . . . . . . 40 3.1 Example of a set of preference relations used to infer a GP on a toy problem. . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Testing methods of selecting sample points for preferences on a set of test functions on a simulated 2-gallery. . . . . . . . . 49 3.3 Preference methods on the synthetic functions on a simulated 2-gallery, using the same methods used to generate Figure 3.2. 50 3.4 Testing methods of selecting sample points for preferences on a set of test functions on a simulated 4-gallery. . . . . . . . . 53 4.1 Effect of different mean functions with and without evidence. 56 4.2 Examples of the impact of kernel width (θ) and noise (σ2noise) hyperparameter settings on a toy 1D problem. . . . . . . . . 60 ix 4.3 Learning ARD kernel width hyperparameters using a particle filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 Effect of adding more data to compute the RBF prior mean. 66 5.1 Examples of typical evolution of GP-Hedge’s portfolio with K = 9 for each objective function. . . . . . . . . . . . . . . . 71 5.2 Comparison of different acquisition approaches on four commonly- used test functions. . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Comparison of different hedging strategies on three commonly- used test functions. . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4 Comparison of different hedging strategies on four commonly used literature functions. . . . . . . . . . . . . . . . . . . . . . 75 5.5 Comparison of performance of the acquisition approaches on synthetic functions sampled from a GP prior with randomly initialized hyperparameters. . . . . . . . . . . . . . . . . . . . 76 5.6 Results of experiments on the repeller control problem. . . . . 79 6.1 An example of the interactive Bayesian optimization smoke simulation comparison tool. . . . . . . . . . . . . . . . . . . . 83 6.2 A shorter-than-average but otherwise typical run of the BRDF preference gallery tool. . . . . . . . . . . . . . . . . . . . . . . 85 6.3 The animation gallery in action. . . . . . . . . . . . . . . . . 89 6.4 Impact of learning hyperparameters from user data. . . . . . 94 6.5 Gallery interfaces used for user studies. . . . . . . . . . . . . 95 6.6 Results of the experiments of §6.4.4, shown as boxplots of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 A.1 The Branin function. . . . . . . . . . . . . . . . . . . . . . . . 117 x List of Symbols The symbols used in this thesis follow a few conventions for clarity. • Lower-case, unbolded Latin and Greek letters (x, α, etc.) represent scalar quantities and functions. • Bold, lower-case Latin and Greek letters (x,α, etc.) represent vectors. • Bold, upper-case Latin letters (X, etc.) represent matrices. • Calligraphic upper-case Latin letters (X , etc.) represent sets and dis- tributions. • The subscript notation Za:b indicates the sequence Za, Za+1, . . . , Zb. We additionally use a number of symbols not listed here, where the symbols are needed for discussion, but not relevant to the rest of the thesis. symbol section definition f(·) 1 objective function A 1 compact set being optimized d 1 dimensionality of the compact set being opti- mized x 1 an instance in A xn 1.2 the nth sequential sample drawn during an op- timization D 1.2 a sample and observation pair, e.g. {x, y} x? 2.1 the global argmax f(x?) 2.1 the global maximum u(·) 2.1 an acquisition function xi symbol section definition y, y(·) 2.1 a (possibly noisy) sample from the objective function yn, y(xn) 2.1 a (possibly noisy) sample from the objective function at the instance xn εn 2.1 additive Gaussian noise at the instance xn D1:t 2.1 a set of t observations, {x1:t, y1:t} m(·) 2.1 the mean function of a Gaussian process x+ 2.1 incumbent of a set of points: argmaxxi f(xi) x(?) 2.1 the argmax of a local maximum σ2noise 2.1 variance of Gaussian noise GP(m(·), k(·, ·)) 2.1.1 Gaussian Process, defined by mean function m and covariance function k k(·, ·) 2.1.1 the kernel (covariance) function of a Gaussian process ft 2.1.1 noise-free observation of xt f1:t 2.1.1 noise-free observations of x1:t K 2.1.1 kernel matrix k 2.1.1 vector of kernel function values for xt+1 µ(·) 2.1.1 posterior mean function σ2(·) 2.1.1 posterior variance function θ 2.1.2 length-scale hyperparameter θ 2.1.2 vector of length-scale hyperparameters PI(·) 2.2.1 the “probability of improvement” acquisition function I(·) 2.2.1 the improvement function EI(·) 2.2.1 the “expected improvement” acquisition func- tion Φ(·) 2.2.1 cumulative distribution function of the standard Normal distribution φ(·) 2.2.1 probability density function of the standard Normal distribution ξ 2.2.1 exploration/exploitation trade-off hyperparam- eter for improvement-based acquisition func- tions LCB(·) 2.2.2 lower confidence bound acquisition function UCB(·) 2.2.2 upper confidence bound acquisition function GP-UCB(·) 2.2.2 Srinivas et al. variant of the confidence bound acquisition function r(·) 2.2.2 instantaneous regret function xii symbol section definition ν 2.2.2 exploration/exploitation trade-off hyperparam- eter for GP-UCB acquisition function µ+ 2.3 the previously-observed sample with the highest expected value (the incumbent where noise is present) X 2.4.2 design matrix, where rows are inputs x1:t xfirst 2.5 the value of the first function sample, or the argmax of the set of initial samples G 2.5 the “gap” measure of optimization performance ri, ci 3.2 ith preference pair in a probit model. ri and ci are samples from A, where ri is preferred to ci. v(·) 3.2 valuation function fMAP 3.2 maximum a posteriori estimates of f derived from a set of preference relationships γ 4.1 location parameter of mean hyperprior Γ 4.1 matrix of all location parameters γ β 4.1 weight parameter of mean hyperprior β 4.1 vector of weight parameters of mean hyperprior α 4.1 covariance hyperprior ρ 4.2 generic hyperparameter ρ 4.2 vector of generic hyperparameters τ 4.2 counter of runs of the optimization system on different objectives xiii List of Acronyms Acronyms used in this thesis and the sections in which introduced. symbol section definition GP 2.1.1 Gaussian process PI 2.2.1 probability of improvement EI 2.2.1 expected improvement GP-UCB 2.2.2 Gaussian process upper confidence bound RBF 4.1 radial basis function BRDF 6.3 bidirectional reflectance distribution functions xiv Acknowledgements A martial arts student went to his teacher and said earnestly, “I am devoted to studying with you. How long will it take to master your art?” The teacher’s reply was casual: “Ten years.” Impatiently, the student answered, “But I want to master it faster than that! I will work very hard. I will practice every day, ten or more hours a day if I have to. How long will it take then?” The teacher thought for a moment. “Twenty years.” — traditional Zen story Since I started grad school all those years ago, Nando de Freitas has been not only a supervisor, but a friend, a mentor and a source of inspiration. He’s had the patience to let me follow my own (sometimes misguided) research muse and the strength to see this thesis through to the end. It is with pride that I will always remain one of Nando’s students. This thesis would not have been possible without him. Thanks go to my fellow students, who have made these years so enjoy- able and so rewarding, whether we were blasting techno through computer speakers in the wee hours before a paper deadline, discussing graphics and machine learning over sushi, or filling the sofas of “die Kommune” on West 23rd Avenue for marathon Buffy the Vampire Slayer viewings. Your friend- ship has kept me sane and your criticism has kept me honest. I won’t try xv to name you all for fear of leaving someone out, but you know who you are. I’ve been lucky enough to be funded throughout most of my PhD: my Koerner and UBC Graduate fellowships and NSERC doctoral scholarship have made this research possible, and I am truly grateful. I’ve also been very fortunate to work with the brilliant and creative folks at Worio over the course of my degree, which has influenced my research in ways both large and small. Last, but very far from least, I thank Janelle. Even when you were on the other side of the planet, you were standing beside me. xvi Chapter 1 Introduction An enormous body of scientific literature has been devoted to the problem of optimizing a nonlinear function f(x) over a compact set A. In the realm of optimization, this problem is formulated concisely as follows: max x∈A⊂Rd f(x) One typically assumes that the objective function f(x) has a known math- ematical representation, is convex, or is at least cheap to evaluate. Despite the influence of classical optimization on machine learning, many learning problems do not conform to these strong assumptions. Often, evaluating the objective function is expensive or even impossible, and the derivatives and convexity properties are unknown. In many realistic sequential decision making problems, for example, one can only hope to obtain an estimate of the objective function by simulating future scenarios. Whether one adopts simple Monte Carlo simulation or adaptive schemes, as proposed in the fields of planning and reinforcement learning, the process of simulation is invariably expensive. Moreover, in some applications, drawing samples f(x) from the function corresponds to expensive processes: drug trials, destructive tests or financial investments. In this thesis, our motivating examples are based on active user modelling for learning parameters for procedural graphics and animation. In active 1 user modelling, x represents attributes of a user query—for example, a set of parameters for a procedural animation—and f(x) requires a response from the human in the form of a rating or ranking. Computers must ask the right questions and the number of questions must be kept to a minimum so as to avoid annoying the user. 1.1 Motivation A computer graphics artist sits down to use a simple renderer to find appropriate surfaces for a typical reflectance model. It has a series of parameters that must be set to control the simulation: “specularity”, “Fresnel reflectance coefficient”, and other, less- comprehensible ones. The parameters interact in ways difficult to discern. The artist knows in his mind’s eye what he wants, but he’s not a mathematician or a physicist—no course he took during his MFA covered Fresnel reflectance models. Even if it had, would it help? He moves the specularity slider and waits for the image to be generated. The surface is too shiny. He moves the slider back a bit and runs the simulation again. Better. The surface is now appropriately dull, but too dark. He moves a slider down. Now it’s the right colour, but the specularity doesn’t look quite right any more. He repeatedly bumps the specularity back up, rerunning the renderer at each attempt until it looks right. Good. Now, how to make it look metallic...? Problems in simulation, animation, rendering and other areas often take such a form, where the desired end result is identifiable by the user, but parameters must be tuned in a tedious trial-and-error process. This is par- ticularly apparent in psychoperceptual models, where continual tuning is required to make something “look right”. Unfortunately, it is not at all easy to find a mapping from parameterized animation to psychoperceptual plau- sibility. The perceptual objective function is simply unknown. However, it is fairly easy for humans to judge the quality of an instance in many graphics and animation problems—in fact, it is trivial and almost instantaneous. The 2 application of this principle to animation and other psychoperceptual tools is motivated by the observation that humans often seem to be forming a mental model of the objective function. This model enables them to exploit feasible regions of the parameter space where the value is predicted to be high and to explore regions of high uncertainty. It is our thesis that the process of tweaking parameters to find a result that looks “right” is akin to sampling a perceptual objective function, and that twiddling the parameters to find the best result is, in essence, optimization. Our objective function is the psychoperceptual process underlying judgement—how well a realization fits what the user has in mind. Conventional parameter-tweaking techniques are often inefficient. For example, Hutter et al. [2007] were able to significantly speed up algorithms by automatically tuning the parameters—even improving on the best per- formance achieved by the algorithm designers themselves. They note that “[t]he fact that these empirical results contradicted the algorithm designer’s intuition illustrates clearly the limitations of even an expert’s ability to com- prehend the complex interplay between the many parameters of a sophis- ticated heuristic algorithm.” By using automatic techniques to help select sample points, we can allow the artist to focus on the task of identification, and possibly fine-tuning: the methods presented in this thesis are intended to work alongside familiar tools, not to replace them. In all this, the goal is to separate high-level cognitive tasks performed by human users from te- dious low-level tasks that can be left to machines. At the same time, we need to help the user perform tasks efficiently—cognitive effort (attention) is the limiting resource in effective decision-making [Simon, 1978], and re- ducing the cognitive effort required for a problem often leads to improved accuracy [Russo and Leclerc, 1991]. This motivates a key insight: it is not necessary to accurately model the entire objective function. The problem is actually one of optimization, not regression. We can’t directly maximize the user’s psychoperceptual model, so we use an acquisition function (§2.2) as a principled way of trading off exploration (showing the user examples unlike any they have seen) and ex- ploitation (trying to show the user improvements on examples they have 3 indicated preference for). Of course, regression-based learning can produce an accurate model of the user’s valuation over the entire space of parame- ters, which would also allow us to find the best value. However, this comes at the cost of asking the user to evaluate many, many examples that have no practical relation what she is looking for. Our method tries instead to make the most efficient possible use of the user’s time and cognitive effort: a problem well-suited to Bayesian optimization. Furthermore, the value function can be any psychoperceptual process that lends itself to sliders and preferences: the model can support an ani- mator looking for a particular “cartoon physics” effect, an artist trying to capture a particular mood in the lighting of a scene, or an electronic musi- cian looking for a specific sound or rhythm. Though we use animation and rendering as motivating domains, our work has a broad scope of application in music and other arts, as well as psychology, marketing and econometrics, and human-computer interfaces. To avoid confusion with other utility models in this thesis, we borrow from the econometrics literature and refer to this objective as the value model (or simply the value function). In assuming that the user’s interests can be modelled this way, we are making an assumption that the value function is smooth, or can be reasonably modelled with a smooth function. In doing so, we follow the extensive work in the psychology and economics of decision-making (e.g., [Kahneman and Tversky, 1979; Payne et al., 1993; McFadden, 2001; Train, 2003]), and direct the interested reader to these resources. 1.2 An introduction to Bayesian optimization Bayesian optimization is a powerful strategy for finding the extrema of ob- jective functions that are expensive to evaluate. It is applicable in situations where one does not have a closed-form expression for the objective function, but where one can obtain observations (possibly noisy) of this function at sampled values. It is particularly useful when these evaluations are costly, when one does not have access to derivatives, or when the problem at hand 4 is non-convex. We will provide an informal description here, to give the reader an intuition, and provide a more detailed discussion starting in §2.1. Bayesian optimization techniques are some of the most efficient ap- proaches in terms of the number of function evaluations required (see, e.g., [Močkus, 1994; Jones et al., 1998; Streltsov and Vakili, 1999; Jones, 2001; Sasena, 2002]). Much of the efficiency stems from the ability of Bayesian optimization to incorporate prior belief about the problem to help direct the sampling, and to trade off exploration and exploitation of the search space. It is called Bayesian because it uses the famous “Bayes’ theorem”, which states (simplifying somewhat) that the posterior probability of a model (or theory, or hypothesis) M given evidence (or data, or observations) E is pro- portional to the likelihood of E given M multiplied by the prior probability of M : P (M |E) ∝ P (E|M)P (M). Inside this simple equation is the key to optimizing the objective func- tion. In Bayesian optimization, the prior represents our belief about the space of possible objective functions. Although the cost function is un- known, it is reasonable to assume that there exists prior knowledge about some of its properties, such as smoothness, and this makes some possible objective functions more plausible than others. Let’s define xi as the ith sample, and f(xi) as the observation of the ob- jective function at xi. As we accumulate observations1 D1:t = {x1:t, f(x1:t)}, the prior distribution is combined with the likelihood function P (D1:t|f). Essentially, given what we think we know about the prior, how likely is the data we have seen? If our prior belief is that the objective function is very smooth and noise-free, data with high variance or oscillations should be considered less likely than data that barely deviate from the mean (see, for example, Figure 4.2). Now, we can combine these to obtain our posterior distribution: P (f |D1:t) ∝ P (D1:t|f)P (f). The posterior captures our updated beliefs about the unknown objective 1Here we use subscripts to denote sequences of data, i.e., y1:t = {y1, . . . , yt}. 5 function. One may also interpret this step of Bayesian optimization as estimating the objective function with a surrogate function (also called a response surface), described formally in §2.1.1 with the mean function of a Gaussian process. To sample efficiently, Bayesian optimization uses an acquisition function to determine the next location xt+1 ∈ A to sample. The decision represents an automatic trade-off between exploration (where the objective function is very uncertain) and exploitation (trying values of x where the objective function is expected to be high). This optimization technique has the nice property that it aims to minimize the number of objective function evalua- tions. Moreover, it is likely to do well even in settings where the objective function has multiple local maxima. Figure 1.1 shows a typical run of Bayesian optimization on a 1D problem. The optimization starts with two points. At each iteration, the acquisition function is maximized to determine where next to sample from the objective function—the acquisition function takes into account the mean and variance of the predictions over the space to model the utility of sampling. The objective is then sampled at the argmax of the acquisition function, the Gaussian process is updated and the process is repeated. 1.3 Contributions and overview In this thesis, we present a novel, optimization-based approach to setting pa- rameters interactively. In our approach, we implicitly model the user’s value function from feedback on instances of images or animations generated from a space of valid parameter settings for procedural graphics and animation tasks. In effect, we are automating the “slider twiddling” process typically used by artists to find desired settings. The method is based on Bayesian optimization, which allows prior knowledge about the problem—whether from experts or simply regular users of the system—to be incorporated. We refer to this as interactive Bayesian optimization. Our goal is to offload the cognitive burden of estimating and exploring different sets of parame- ters from the user to the computer, though we can incorporate conventional 6 acquisition max acquisition function (u( ·)) observation (x) objective fn (f( ·)) t = 2 new observation (xt) t = 3 posterior mean (µ( ·)) posterior uncertainty (µ( ·)±σ( ·)) t = 4 Figure 1.1: An example of using Bayesian optimization on a toy 1D design problem. The figures show a Gaussian process (GP) approximation of the objective function over four iterations of sampled values of the objective function. The figure also shows the acquisition function in the lower shaded plots. The acquisition is high where the GP predicts a high objective (exploitation) and where the prediction uncertainty is high (exploration)—areas with both attributes are sampled first. Note that the area on the far left remains unsampled, as while it has high uncertainty, it is (correctly) predicted to offer little improvement over the highest observation. slider twiddling into the framework easily. In Chapter 2, we provide the necessary background for the problem. We formally present Bayesian optimization with Gaussian processes (§2.1) and describe various acquisition functions (§2.2) and the role of Gaussian 7 noise (§2.3). In §2.4, we cover the history of Bayesian optimization, and the related fields of kriging, GP experimental design and GP active learning. In §2.5, we present background information on the experimental methods we use in Chapters 3–5. To deal with the unique problems involved in having humans in the loop, we introduce several novel extensions to Bayesian optimization, which form our main contributions to the field: • In Chapter 3, we show how our optimization methods can be applied to “preference galleries”. In a preference gallery interface, users need only indicate which of a set of generated instances look the most like the target. This is a much easier task for humans than reliably providing absolute magnitudes of single instances. • Chapter 4 introduces a novel way of adaptively learning the model hyperparameters using particle filters. Because of the small number of data we collect during each optimization, we cannot learn hyperpa- rameters through conventional means. However, we know that there is regularity among users and sessions, even when users have different goals. Our insight is that every time the application is used, a model is trained. While different runs might involve users with different simula- tion goals, we can track the distribution over hyperparameter settings across user sessions. This allows the system to exploit regularity be- tween users and sessions and learn from all the users of the system. • Effective Bayesian optimization requires the optimization of an acqui- sition function to select sample points. However, there are several different acquisition functions in the literature, none of which is guar- anteed to outperform the others on an arbitrary unknown function. In Chapter 5, we present a method of adaptively updating a portfolio of acquisition functions. At each time step, one is selected, and the en- tire portfolio can be updated based on changes in the posterior. This method is shown to perform better than the best individual acquisition function in the portfolio. 8 In Chapter 6, we present practical applications of our model in design galleries that allow artists to find parameters efficiently for graphics and an- imation tasks. Our most-sophisticated interface also allows expert users to continue to use familiar parameter-finding techniques, and to restrict areas of the optimization. User expertise is a valuable resource, and it is impor- tant that our system augments, rather than replaces, existing techniques. We show that users are able to find instances efficiently using the tool, and adaptive prior updates make the system significantly and progressively bet- ter at modelling the user. Finally, Chapter 7 is a brief discussion of interactive Bayesian optimiza- tion, future work, and advice to future practitioners. 9 Chapter 2 Background Numerous researchers before us have faced the task of optimizing expensive, black-box functions. In this chapter1, we will look at the existing work in this area. In §2.1–§2.3, we will detail the Gaussian process-based Bayesian optimization framework we will use as the basis of the work in the remainder of this thesis. In §2.4, we will look at the history and recent successes of Bayesian optimization and some of the related fields. We will also describe the experimental setting we use to evaluate our techniques in §2.5. The basic concepts of optimization and Gaussian processes are well- covered in the open literature, so we do not spend a great deal of time re-presenting this material. However, as the devil is often in the details, the reader unfamiliar with the computational details of Gaussian process inference and optimization may wish to have a more thorough text at hand. In working with Gaussian processes, we extensively consulted Rasmussen and William’s book on the topic [2006]. There is not yet an equivalent canonical text on Bayesian optimization, but we found the PhD theses of Schonlau [1997] and Lizotte [2008], and the book The Design and Analysis of Computer Experiments [Santner et al., 2003] to be excellent resources while working on this thesis. In addition, the PhD thesis of Osborne [2010] does 1An earlier version of this chapter was the basis of an online tutorial [Brochu et al., 2009], and a modified version will appear as part of an upcoming collection [Brochu et al., 2011]. Small parts of this chapter have also appeared in our previously-published work in the field [Brochu et al., 2007a; 2007b; 2010a; 2010b; Martinez–Cantin et al., 2009]. 10 an admirable job of covering Gaussian processes for prediction, optimization and quadrature, though as it was published after we had already written most of this chapter, it did not have a large impact on our work. 2.1 The Bayesian optimization approach Optimization is a broad and fundamental field of mathematics. In order to harness it to our ends, we need to narrow it down by defining the conditions we are concerned with. Our first restriction is to simply specify that the form of the problem we are concerned with is maximization, rather than the more common form of minimization. We adopt this form for convenience, because the main problems we are concerned with are optimizing the value and acquisition functions. The maximization of a real-valued function x? = argmaxx f(x) can be regarded as the minimization of the transformed function g(·) = −f(·). We also assume that the objective is Lipschitz-continuous. That is, there exists some constant C, such that for all x1,x2 ∈ A: ‖f(x1)− f(x2)‖ ≤ C‖x1 − x2‖, though C may be (and typically is) unknown. We can narrow the problem down further by defining it as one of global, rather than local optimization. In local maximization problems, we need only find a point x(?) such that f(x(?)) ≥ f(x),∀x s.t. ‖x(?) − x‖ < . If −f(·) is convex, then any local maximum is also a global maximum. However, in our optimization problems, we cannot assume that the negative objective function is convex. It might be the case, but we have no way of knowing before we begin optimizing. 11 It is common in global optimization, and true for our problem, that the objective is a black box function: we do not have an expression of the ob- jective function that we can analyze, and we do not know its derivatives. Evaluating the function is restricted to querying at a point x and getting a (possibly noisy) response of f(x) or (anti-)derivatives of f evaluated at x. Black box optimization also typically requires that all dimensions have bounds on the search space. In our case, we can safely make the simplify- ing assumption these bounds are all axis-aligned, so the search space is a hyperrectangle of dimension d. A number of approaches exist for this kind of global optimization and have been well-studied in the literature (e.g., [Törn and Žilinskas, 1989; Mongeau et al., 1998; Liberti and Maculan, 2006; Zhigljavsky and Žilinskas, 2008]). Deterministic approaches include interval optimization and branch and bound methods. Stochastic approximation is a popular idea for opti- mizing unknown objective functions in machine learning contexts [Kushner and Yin, 1997]. It is the core idea in most reinforcement learning algo- rithms [Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998], learning methods for Boltzmann machines and deep belief networks [Younes, 1989; Hinton and Salakhutdinov, 2006] and parameter estimation for nonlinear state space models [Poyiadjis et al., 2005; Martinez–Cantin et al., 2006]. However, these are generally unsuitable for our domain because they still require many samples, and in the active user-modelling domain drawing samples is expensive. Even in a noise-free domain, evaluating an objective function with known Lipschitz continuity L on a d-dimensional unit hypercube, guaranteeing the best observation f(x+) ≥ f(x?)− requires a minimax approach, requiring (L/2)d samples [Betrò, 1991]. This can be an incredibly expensive premium to pay for insurance against unlikely scenarios. As a result, the idea natu- rally arises to relax the guarantees against pathological worst-case scenarios. The goal, instead, is to use evidence and prior knowledge to maximize the posterior at each step, so that each new evaluation decreases the distance between the true global maximum and the expected maximum given the model. This is sometimes called “one-step” [Močkus, 1994] “average-case” 12 [Streltsov and Vakili, 1999] or “practical” [Lizotte, 2008] optimization. This average-case approach has weaker demands on computation than the worst- case approach. As a result, it may provide faster solutions in many practical domains where one does not believe the worst-case scenario is plausible. Bayesian optimization uses the prior and evidence to define a posterior distribution over the space of functions. The Bayesian model allows for an elegant means by which informative priors can describe attributes of the objective function, such as smoothness or the most likely locations of the maximum, even when the function itself is not known. Optimizing follows the principle of maximum expected utility, or, equivalently, minimum expected risk. The process of deciding where to sample next requires the choice of a utility function and a way of optimizing the expectation of this utility with respect to the posterior distribution of the objective function. This secondary optimization problem is usually easier because the utility is typically chosen so that it is easy to evaluate, though still nonconvex. To make clear which function we are discussing, we will refer to this utility as the acquisition function (also sometimes called the infill function). In §2.2, we will discuss some common acquisition functions. In practice, there is also the possibility of measurement noise, which we will assume is Gaussian. We define xi as the ith sample and yi = f(xi) + εi, with εi iid∼ N (0, σ2noise), as the noisy observation of the objective function at xi. We will discuss the noise model in §2.3. We focus here on real, Gaussian observations for ease of presentation, though this will be extended to discrete observations in Chapter 3. The Bayesian optimization procedure is shown in Algorithm 1. As men- tioned earlier, it has two components: the posterior distribution over the objective and the acquisition function. Let us focus on the posterior dis- tribution first and come back to the acquisition function in §2.2. As we accumulate observations D1:t = {x1:t, y1:t}, a prior distribution P (f) is com- bined with the likelihood function P (D1:t|f) to produce the posterior dis- tribution: P (f |D1:t) ∝ P (D1:t|f)P (f). The posterior captures the updated beliefs about the unknown objective function. 13 Algorithm 1 Bayesian Optimization 1: for t = 1, 2, . . . do 2: Find xt by optimizing the acquisition function over the GP: xt = argmaxx u(x|D1:t−1). 3: Sample the objective function: yt = f(xt) + εt. 4: Augment the data D1:t = {D1:t−1, (xt, yt)} and update the GP. 5: end for 2.1.1 Priors over functions Any Bayesian method depends on a prior distribution, by definition. A Bayesian optimization method will converge to the optimum if (i) the acqui- sition function is continuous and approximately minimizes the risk (defined as the expected deviation from the global minimum at a fixed point x); and (ii) conditional variance converges to zero (or appropriate positive minimum value in the presence of noise) if and only if the distance to the nearest ob- servation is zero [Močkus, 1982; 1994]. Many models could be used for this prior—early work mostly used the Wiener process (§2.4). However, Gaus- sian process (GP) priors for Bayesian optimization date back at least to the late 1970s [O’Hagan, 1978; Žilinskas, 1980]. Močkus [1994] explicitly set the framework for the Gaussian process prior by specifying the additional “simple and natural” conditions that (iii) the objective is continuous; (iv) the prior is homogeneous; (v) the optimization is independent of the mth differences. This includes a very large family of common optimization tasks, and Močkus showed that the GP prior is well-suited to the task. A GP is a stochastic process for which any finite combination of samples will be normally distributed. However, for our purposes, it is often useful to intuitively think of a GP as analogous to a function, but instead of returning a scalar f(x) for an arbitrary x, it returns the mean and variance of a normal distribution (Figure 2.1) over the possible values of f at x. Stochastic processes are sometimes called “random functions”, by analogy to random variables. Just as a Gaussian distribution is a distribution over a random variable, completely specified by its mean and covariance, a GP is a distribution over functions, completely specified by its mean function, m and covariance 14 Figure 2.1: Simple 1D Gaussian process with three observations. The solid black line is the GP surrogate mean prediction of the objective function given the data, and the shaded area shows the mean plus and minus the variance. The superimposed Gaussians correspond to the GP mean and standard deviation (µ(·) and σ(·)) of prediction at the points, x1:3. function, k: f(·) ∼ GP(m(·), k(·, ·)). For convenience, we assume here that the prior mean is the zero function and the covariance is appropriately scaled; in §4.1 we present alternative priors for the mean and covariance. This leaves us the more interesting question of defining the covariance function k. A very popular choice is the 15 squared exponential function: k(xi,xj) = exp ( −1 2 ‖xi − xj‖2 ) . (2.1) Note that this function approaches 1 as values get close together and 0 as they get further apart. Two points that are close together can be expected to have a very large influence on each other, whereas distant points have almost none. This is a necessary condition for convergence under the assumptions of [Močkus, 1994]. We will discuss more sophisticated kernels in §2.1.2. If we were to sample from the prior, we would choose {x1:t} and sample the values of the function at these indices to produce the pairs {x1:t, f1:t}, where f1:t = f(x1:t). The function values are drawn according to a multi- variate normal distribution N (0,K), where the kernel matrix is given by: K = k(x1,x1) . . . k(x1,xt) ... . . . ... k(xt,x1) . . . k(xt,xt) . Of course, the diagonal values of this matrix are 1, which is only possible in a noise-free environment. We will discuss noise in §2.3. Also, recall that we have for simplicity chosen the zero mean function. In our optimization tasks, however, we will use data from an external model to fit the GP and get the posterior. Assume that we already have the observations {x1:t, f1:t}, say from previous iterations, and that we want to use Bayesian optimization to decide what point xt+1 should be considered next. Let us denote the value of the function at this arbitrary point as ft+1 = f(xt+1). Then, by the properties of Gaussian processes, f1:t and ft+1 are jointly Gaussian:[ f1:t ft+1 ] ∼ N ( 0, [ K k kT k(xt+1,xt+1) ]) , where k = [ k(xt+1,x1) k(xt+1,x2) · · · k(xt+1,xt) ] 16 Using the Sherman-Morrison-Woodbury formula (see, e.g., [Rasmussen and Williams, 2006; Press et al., 2007]), one can easily arrive at an expression for the predictive distribution: P (ft+1|D1:t,xt+1) = N ( µt(xt+1), σ2t (xt+1) ) where µt(xt+1) = kTK−1f1:t σ2t (xt+1) = k(xt+1,xt+1)− kTK−1k. That is, µt(xt+1) and σ2t (xt+1) model the predictive posterior distribution P (ft+1|D1:t,xt+1). For legibility, we will omit the subscripts on µ and σ except where it might be unclear. In our sequential decision making setting, the number of query points is relatively small and, consequently, the GP predictions are easy to compute. In other problems, the number of data may be quite large, in which case some form of adaptive ‘dropping’ of data is required (see e.g. [Osborne et al., 2010]). Note that the idea of using a GP approximation to compute attributes of an expensive function is not unique to Bayesian optimization. For example, in the Hybrid Monte Carlo literature, the GPHMC algorithm [Rasmussen, 2003] uses a GP for most of the HMC computations, sampling the actual posterior only to ensure the approximation is reasonable. 2.1.2 Choice of covariance functions The choice of covariance function for the Gaussian process is crucial, as it determines the smoothness properties of samples drawn from it. The squared exponential kernel in Eqn. (2.1) is actually a little naive, in that divergences of all features of x affect the covariance equally. Typically, it is necessary to generalize by adding hyperparameters. In an isotropic model, this can be done with a single hyperparameter θ, which 17 1.5 1.0 0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 θ=0.1 0 1 2 3 4 5 3 2 1 0 1 2 3 1.5 1.0 0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 θ=0.2 0 1 2 3 4 5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 θ=0.5 0 1 2 3 4 5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Figure 2.2: The effect of changing the kernel hyperparameters. Shown are squared exponential kernels with θ = 0.1, 0.2, 0.5. On the left is the function k(0,x). On the right are some one-dimensional functions sampled from a GP with the hyperparameter value. controls the width of the kernel: k(xi,xj) = exp ( − 1 2θ2 ‖xi − xj‖2 ) . For anisotropic models, a very popular choice is the squared exponential kernel with a vector of automatic relevance determination (ARD) hyperpa- rameters θ Neal:1996,Williams:1996: k(xi,xj) = exp (− 12(xi − xj)T diag(θ)−2(xi − xj)), where diag(θ) is a diagonal matrix with d entries θ along the diagonal. Intuitively, if a particular θ` has a small value, the kernel becomes indepen- dent of the `th input, effectively removing irrelevant dimensions. Figure 2.2 18 shows examples of different hyperparameter values on the squared exponen- tial function and what functions sampled from those values look like. In §4.1.1, we will discuss how the hyperparameters can be learned. Another important kernel for Bayesian optimization is the Matérn kernel [Matérn, 1960; Stein, 1999], which incorporates a smoothness parameter ς to permit greater flexibility in modelling functions: k(xi,xj) = 1 2ς−1Γ(ς) (2 √ ς ‖xi − xj‖)ς Hς (2 √ ς ‖xi − xj‖) , where Γ(·) and Hς(·) are the Gamma function and the Bessel function of order ς. Note that as ς →∞, the Matérn kernel reduces to the squared ex- ponential kernel, and when ς = 0.5, it reduces to the unsquared exponential kernel. As with the squared exponential, length-scale hyperparameter are often incorporated. While the squared exponential and Matérn are the most common ker- nels for GPs, numerous others have been examined in the machine learn- ing literature (see, e.g., [Genton, 2001] or [Rasmussen and Williams, 2006, Chapter 4] for an overview). Appropriate covariance functions can also be used to extend the model in other interesting ways. For example, the recent sequential sensor work of Osborne, Garnett and colleagues uses GP models with extensions to the covariance function to model the characteristics of changepoints [Osborne et al., 2010] and the locations of sensors in a network [Garnett et al., 2010a]. A common additional hyperparameter is simply a scalar applied to k to control the magnitude of the variance. Determining which of a set of possible kernel functions to use for a prob- lem typically requires a combination of engineering and automatic model selection, either hierarchical Bayesian model selection [Mackay, 1992] or cross-validation. However, these methods require fitting a model given a representative sample of data. In Chapter 4, we discuss how model selec- tion can be performed using models believed to be similar. The techniques introduced in Chapter 5 could also be applied to model selection, though that is outside the scope of this thesis. 19 2.2 Acquisition functions for Bayesian optimization Now that we have discussed placing priors over smooth functions and how to update these priors in light of new observations, we will focus our at- tention on the acquisition component of Bayesian optimization. The role of the acquisition function is to guide the search for the optimum. Typically, acquisition functions are defined such that high acquisition corresponds to potentially high values of the objective function, whether because the pre- diction is high, the uncertainty is great, or both. Maximizing the acquisition function is used to select the next point at which to evaluate the function. That is, we wish to sample f at argmaxx u(x|D), where u(·) is the generic symbol for an acquisition function. The choice of acquisition function is nontrivial. In this chapter, we will discuss several different methods which been proposed in the literature, and show how they give rise to distinct sampling behaviour. Each works well for certain classes of functions, and it is often difficult or impossible to know which will perform best on an unknown function. In Chapter 5 we propose a solution to this problem, based on a hierarchical hedging approach for managing an adaptive portfolio of acquisition functions. 2.2.1 Improvement-based acquisition functions The early work of Kushner [1964] suggested maximizing the probability of improvement over the incumbent f(x+), where x+ = argmaxxi∈x1:t f(xi), so that PI(x) = P (f(x) ≥ f(x+)) = Φ ( µ(x)− f(x+) σ(x) ) , where Φ(·) is the normal cumulative distribution function. This function is also sometimes called MPI (for “maximum probability of improvement”) or “the P -algorithm” (since the utility is the probability of improvement). The drawback, intuitively, is that this formulation is pure exploitation. 20 Points that have a high probability of being infinitesimally greater than f(x+) will be drawn over points that offer potentially larger gains but less certainty. As a result, a modification is to add a trade-off parameter ξ ≥ 0: PI(x) = P (f(x) ≥ f(x+) + ξ) = Φ ( µ(x)− f(x+)− ξ σ(x) ) , (2.2) The exact choice of ξ is left to the user, though Kushner recommended a schedule for ξ, so that it started fairly high early in the optimization, to drive exploration, and decreased toward zero as the algorithm continued. Several researchers have studied the empirical impact of different values of ξ in different domains [Törn and Žilinskas, 1989; Jones, 2001; Lizotte, 2008]. An appealing characteristic of this formulation for perceptual and pref- erence models is that while maximizing PI(·) is still greedy, it selects the point most likely to offer an improvement of at least ξ. This can be useful in psychoperceptual tasks, where there is a threshold of perceptual difference. Jones [2001] notes that the performance of PI(·) “is truly impressive. It would be quite natural if the reader, like so many others, became enthusiastic about this approach. But if there is a single lesson to be taken away from this paper, it is that nothing in this response-surface area is so simple. There always seems to be a counterexample. In this case, the difficulty is that [the PI(·) method] is extremely sensitive to the choice of the target. If the desired improvement is too small, the search will be highly local and will only move on to search globally after searching nearly exhaustively around the current best point. On the other hand, if [ξ] is set too high, the search will be excessively global, and the algorithm will be slow to fine-tune any promising solutions.” A somewhat more satisfying alternative acquisition function would be one that takes into account not only the probability of improvement, but the magnitude of the improvement a point can potentially yield. In particular, 21 Figure 2.3: Gaussian process from Figure 2.1, additionally showing the re- gion of probable improvement. The maximum observation is at x+. The darkly-shaded area in the superimposed Gaussian above the dashed line can be used as a measure of improvement, I(x). The model pre- dicts almost no possibility of improvement by observing at x1 or x2, while sampling at x3 is more likely to improve on f(x+). we want to minimize the expected deviation from the true maximum f(x?), when choosing a new trial point: xt+1 = argmin x E(‖ft+1(x)− f(x?)‖ |D1:t) = argmin x ∫ ‖ft+1(x)− f(x?)‖P (ft+1|D1:t)dft+1, Note that this decision process is myopic in that it only considers one-step- ahead choices. However, if we want to plan two steps ahead, we can easily 22 apply recursion: xt+1 = argmin x E ( min x′ E(‖ft+2(x′)− f(x?)‖ |Dt+1) |D1:t ) One could continue applying this procedure of dynamic programming for as many steps ahead as desired. However, because of its expense, Močkus et al. [1978] proposed the alternative of maximizing the expected improve- ment with respect to f(x+). Specifically, Močkus defined the improvement function as: I(x) = max{0, ft+1(x)− f(x+)}. That is, I(x) is positive when the prediction is higher than the best value known thus far. Otherwise, I(x) is set to zero. The new query point is found by maximizing the expected improvement: x = argmax x E(max{0, ft+1(x)− f(x+)} |Dt) The likelihood of improvement I on a normal posterior distribution charac- terized by µ(x), σ2(x) can be computed from the normal density function, 1√ 2piσ(x) exp ( −(µ(x)− f(x +)− I)2 2σ2(x) ) . The expected improvement is the integral over this function: E(I) = ∫ I=∞ I=0 I 1√ 2piσ(x) exp ( −(µ(x)− f(x +)− I)2 2σ2(x) ) d I = σ(x) [ µ(x)− f(x+) σ(x) Φ ( µ(x)− f(x+) σ(x) ) + φ ( µ(x)− f(x+) σ(x) )] The expected improvement can be evaluated analytically [Močkus et al., 23 1978; Jones et al., 1998], yielding: EI(x) = { (µ(x)− f(x+))Φ(Z) + σ(x)φ(Z) if σ(x) > 0 0 if σ(x) = 0 (2.3) Z = µ(x)− f(x+) σ(x) where φ(·) and Φ(·) denote the PDF and CDF of the standard normal dis- tribution respectively. Figure 2.3 illustrates a typical expected improvement scenario. Exploration-exploitation trade-off The expectation of the improvement function with respect to the predictive distribution of the Gaussian process enables us to balance the trade-off of exploiting and exploring. When exploring, we should choose points where the surrogate variance is large. When exploiting, we should choose points where the surrogate mean is high. It is highly desirable for our purposes to express EI(·) in a general- ized form which controls the trade-off between global search and local op- timization (exploration/exploitation). Schonlau [1997] suggests adding a non-negative integer parameter γ, such that I(xt+1) = max{0, (µ(xt+1)− f(x+))γ} which results in an expected improvement of EIγ(x) = σγ(x) γ∑ j=0 (−1)j ( γ j ) Zg−jTj , 24 where T0 = Φ(Z) T1 = −φ(Z) Tj>1 = −Zj−1φ(Z) + (j − 1)Tj−2 When γ = 1, (which degenerates to Eqn. (2.3)), emphasis is placed on trying to improve near x+, unless the observations strongly suggest im- provement in areas of high variance. As γ is increased, areas of larger, but less-likely, increase will be favoured. While there is no obvious way to se- lect γ for an arbitrary function, Sasena [2002] proposes an annealing-type schedule to allow global search to smoothly collapse to local improvement. Lizotte [2008] suggests an alternative ξ ≥ 0 parameter such that: EI(x) = { (µ(x)− f(x+)− ξ)Φ(Z) + σ(x)φ(Z) if σ(x) > 0 0 if σ(x) = 0 ,(2.4) where Z = { µ(x)−f(x+)−ξ σ(x) if σ(x) > 0 0 if σ(x) = 0 . This ξ is very similar in flavour to the ξ used in Eqn. (2.2). Lizotte’s ex- periments suggest that setting ξ = 0.01 (scaled by the signal variance if necessary) works well in almost all cases, and interestingly, setting a cooling schedule for ξ to encourage exploration early and exploitation later does not work well empirically, contrary to intuition (though Lizotte did find that a cooling schedule for ξ might slightly improve performance on short runs (t < 30) of PI optimization). 2.2.2 Confidence bound criteria Cox and John [1992; 1997] introduce an algorithm they call “Sequential Design for Optimization”, or SDO. Given a random function model, SDO 25 Figure 2.4: Examples of acquisition functions and their settings. The GP posterior is shown at top. The other images show the acquisition functions for that GP. From the top: probability of improvement (Eqn. (2.2)), expected improvement (Eqn. (2.4)) and upper confidence bound (Eqn. (2.5)). The maximum of each function is shown with a triangle marker. selects points for evaluation based on the lower confidence bound of the prediction site: LCB(x) = µ(x)− κσ(x), where κ ≥ 0. While they are concerned with minimization, we can maximize by instead defining the upper confidence bound: UCB(x) = µ(x) + κσ(x). 26 Figure 2.5: Examples of acquisition functions and their settings in 2 dimen- sions. The top row shows the objective function (which is the scaled Branin function here), and the posterior mean and variance estimates µ(·) and σ2(·). The samples used to train the GP are shows with white dots. The second row shows the acquisition functions for the GP. From left to right: probability of improvement (Eqn. (2.2)), expected improvement (Eqn. (2.4)) and upper confidence bound (Eqn. (2.5)). The maximum of each function is shown with a triangle marker. Like other parameterized acquisition models we have seen, the param- eter κ is left to the user. However, an alternative acquisition function has been proposed by Srinivas et al. [2010]. Casting the Bayesian optimization problem as a multi-armed bandit, the acquisition is the instantaneous regret function r(x) = f(x?)− f(x). 27 0 .0 0 .5 1 .0 101 t = 1 0 .0 0 .5 1 .0 PI t = 2 t = 3 t = 4 t = 5 t = 6 0 .0 0 .5 1 .0 101 t = 1 0 .0 0 .5 1 .0 EI t = 2 t = 3 t = 4 t = 5 t = 6 0 .0 0 .5 1 .0 101 t = 1 0 .0 0 .5 1 .0 GP-UCB t = 2 t = 3 t = 4 t = 5 t = 6 F ig u re 2. 6: C om pa ri so n of pr ob ab ili ty of im pr ov em en t (t op ), ex pe ct ed im pr ov em en t (m id dl e) an d up pe r co nfi de nc e bo un d (b ot to m ) ac qu is it io n fu nc ti on s on a to y 1D pr ob le m . In th e up pe r ro w s, th e ob je ct iv e fu nc ti on is sh ow n w it h th e do tt ed re d lin e, th e so lid bl ue lin e is th e G P po st er io r m ea n. In th e lo w er ro w s, th e re sp ec ti ve ac qu is it io n fu nc ti on s ar e sh ow n, w it h a st ar de no ti ng th e m ax im um . T he op ti m iz at io ns ar e in it ia liz ed w it h th e sa m e tw o po in ts , bu t qu ic kl y fo llo w di ff er en t sa m pl in g tr aj ec to ri es . In pa rt ic ul ar , no te th at th e co ns er va ti ve P I al go ri th m ig no re s th e re gi on ou ts id e th e lo ca l m ax im um on ce it is de te rm in ed th er e is m in im al ch an ce of im pr ov em en t, w hi le E I an d G P -U C B co nt in ue to ex pl or e. T hi s is a kn ow n pr ob le m w it h th e P I al go ri th m . 28 The goal of optimizing in the framework is to find: min T∑ t r(xt) = max T∑ t f(xt), where T is the number of iterations the optimization is to be run for. Using the upper confidence bound selection criterion with κt = √ ντt and the hyperparameter ν > 0 Srinivas et al. define GP-UCB(x) = µ(x) + √ ντtσ(x). (2.5) It can be shown that this method has a cumulative regret which is bounded by O( √ TβTγT ) with high probability. Here γT and βT are problem-specific, depending upon the particular form of kernel-function used, but for many kernels their product can be shown to be sublinear in T . We refer the interested reader to the original paper [Srinivas et al., 2010] for exact bounds. The sublinear bound on cumulative regret implies that the method is no-regret, i.e., that limT→∞RT /T = 0. This in turn provides a bound on the convergence rate for the optimization process, since the regret at the maximum f(x∗)−maxt f(xt) is upper bounded by the average regret RT T = f(x∗)− 1 T T∑ t=1 f(xt). Figures 2.4 and 2.5 show how with the same GP posterior, different acquisition functions with different maxima are defined. Figure 2.6 shows how PI, EI and GP-UCB give rise to distinct sampling behaviour over time. 2.2.3 Maximizing the acquisition function To find the point at which to sample, we still need to maximize the con- strained objective u(x). Unlike the original unknown objective function, u(·) can be cheaply sampled. We optimize the acquisition function using DIRECT [Jones et al., 1993], a deterministic, derivative-free optimizer. It uses the ex- isting samples of the objective function to decide how to proceed to DIvide 29 the feasible space into finer RECTangles. A particular advantage in active learning applications is that DIRECT can be implemented as an “any-time” algorithm, so that as long as the user is doing something else, it continues to optimize, and when interrupted, the program can use the best results found to that point in time. Methods such as Monte Carlo and multistart have also been used, and seem to perform reasonably well [Močkus, 1994; Lizotte, 2008]. 2.3 Noise The model we’ve used so far assumes that we have perfectly noise-free ob- servations. In real life, this is rarely possible, and instead of observing f(x), we can often only observe a noisy transformation of f(x). The simplest transformation arises when f(x) is corrupted with Gaussian noise ∼ N (0, σ2noise) [Rasmussen and Williams, 2006]. If the noise is addi- tive, we can easily add the noise distribution to the Gaussian distribution N (0,K) and define yi = f(xi) + i. Since the mean is zero, this type of noise simply requires that we replace K with K + σ2noiseI. This yields the predictive distribution: P (yt+1|D1:t,xt+1) = N (µt(xt+1), σ2t (xt+1) + σ2noise), and the sufficient statistics µt(xt+1) = kT [K + σ2noiseI] −1y1:t σ2t (xt+1) = k(xt+1,xt+1)− kT [K + σ2noiseI]−1k. In a noisy environment, we also change the definition of the incumbent in the PI and EI acquisition functions. Instead of using the best observation, we use the distribution at the sample points, and define as the incumbent, 30 the point with the highest expected value, µ+ = argmax xi∈x1:t µ(xi). This avoids the problem of attempting to maximize probability or expected improvement over an unreliable sample. It is also possible to resample po- tential incumbents to get more reliable estimates of the values in a noisy environment [Bartz-Beielstein et al., 2005; Huang et al., 2006; Hutter et al., 2009], a process sometimes called intensification. Nonstationary noise mod- els are also possible, such as autoregressive moving-average noise [Murray- Smith and Girard, 2001] and heteroskedastic Gaussian noise [Goldberg et al., 1998]. 2.4 A brief history of Bayesian optimization The earliest work we are aware of resembling the modern Bayesian opti- mization approach is the early work of Kushner [1964], who used Wiener processes for unconstrained one-dimensional problems. Kushner’s decision model was based on maximizing the probability of improvement (§2.2.1). He also included a parameter that controlled the trade-off between ‘more global’ and ‘more local’ optimization, in the same spirit as the exploration- exploitation trade-off. A key difference is that in a (one-dimensional) Wiener process, the intervals between samples are independent, and Kushner was concerned with the problem of selecting from a finite set of intervals. Later work extended Kushner’s technique to multidimensional optimization, us- ing, for example, interpolation in a Delauney triangulation of the space [El- der, 1992] or projecting Wiener processes between sample points [Stuckman, 1988]. Meanwhile, in the former Soviet Union, Močkus and colleagues developed a multidimensional Bayesian optimization method using linear combinations of Wiener fields. This was first published in English as [Močkus et al., 1978]. This paper also, significantly, describes an acquisition function that is based on myopic expected improvement of the posterior, which has been widely 31 adopted in Bayesian optimization as the expected improvement function (§2.2.1). A more recent review of Močkus’ approach is [Močkus, 1994]. At the same time, a large, related body of work emerged under the name kriging (§2.4.1), in honour of the South African student who developed this technique at the University of the Witwatersrand [Krige, 1951], though largely popularized by Matheron and colleagues (e.g. [Matheron, 1971]). In kriging, the goal is interpolation of a random field via a linear predictor. The errors on this model are typically assumed to not be independent, and are modelled with a Gaussian process. More recently, Bayesian optimization using Gaussian processes has been successfully applied to derivative-free optimization and experimental design, where it is called Efficient Global Optimization, or EGO (§2.4.2). There exist several consistency proofs for this algorithm in the one- dimensional setting [Locatelli, 1997] and one for a simplification of the al- gorithm using simplicial partitioning in higher dimensions [Žilinskas and Žilinskas, 2002]. The convergence of the algorithm using multivariate Gaus- sian processes has been recently established in [Vasquez and Bect, 2008]. 2.4.1 Kriging Kriging has been used in geostatistics and environmental science since the 1950s and remains important today. We will briefly summarize the con- nection to Bayesian optimization here. More detailed examinations can be found in, for example, [Stein, 1999; Sasena, 2002; Diggle and Ribeiro, 2007]. This section is primarily drawn from these sources. In many modelling techniques in statistics and machine learning, it is assumed that samples drawn from a process with independent, identically distributed residuals, typically, ε ∼ N (0, σ2noise): y(x) = f(x) + ε In kriging, however, the usual assumption is that errors are not inde- pendent, and are, in fact, spatially correlated: where errors are high, it is expected that nearby errors will also be high. Kriging is a combination of a 32 linear regression model and a stochastic model fitted to the residual errors of the linear model. The residual is modelled with a zero-mean Gaussian process, so ε is actually parameterized by x: ε(x) ∼ N (0, σ2(x)). The actual regression model depends on the type of kriging. In simple kriging, f is modelled with the zero function, making it a zero-mean GP model. In ordinary kriging, f is modelled with a constant but unknown function. Universal kriging models f with a polynomial of degree k with bases m and coefficients β, so that y(x) = k∑ j=1 βjmj(x) + ε(x). Other, more exotic types of kriging are also used. Clearly, kriging and Bayesian optimization are very closely related. There are some key differences in practice, though. In Bayesian optimization, mod- els are usually fit through maximum likelihood. In kriging, models are usu- ally fit using a variogram, a measure of the average dissimilarity between samples versus their separation distance. Fitting is done using least squares or similar numerical methods, or interactively, by an expert visually inspect- ing the variogram plot with specially-designed software. Kriging also often restricts the prediction model to use only a small number of neighbours, making it fit locally while ignoring global information. Bayesian optimiza- tion normally uses all the data in order to learn a global model. 2.4.2 Experimental design Kriging has been applied to experimental design under the name DACE, after “Design and Analysis of Computer Experiments”, the title of a paper by Sacks et al. [1989] (and more recently a book by Santner et al. [2003]). In DACE, the regression model is a best linear unbiased predictor (BLUP), and the residual model is a noise-free Gaussian process. The goal is to find a design point or points that optimizes some criterion. The “efficient global optimization”, or EGO, algorithm is the combina- tion of DACE model with the sequential expected improvement (§2.2.1) ac- 33 quisition criterion. It was published in a paper by Jones et al. [1998] as a re- finement of the SPACE algorithm (Stochastic Process Analysis of Computer Experiments) [Schonlau, 1997]. Since EGO’s publication, there has evolved a body of work devoted to extending the algorithm, particularly in adding constraints to the optimization problem [Audet et al., 2000; Sasena, 2002; Boyle, 2007], and in modelling noisy functions [Bartz-Beielstein et al., 2005; Huang et al., 2006; Hutter et al., 2009; Hutter, 2009]. In so-called “classical” experimental design, the problem to be addressed is often to learn the parameters ζ of a function gζ such that yi = gζ(xi) + εi,∀i ∈ 1, . . . , t with noise εi (usually Gaussian) for scalar output yi. xi is the ith set of experimental conditions. Usually, the assumption is that gζ is linear, so that yi = ζTxi + εi. An experiment is represented by a design matrix X, whose rows are the independent inputs x1:t. If we let ε ∼ N (0, σ), then for the linear model, the variance of the parameter estimate ζ̂ is V ar(ζ̂) = σ2(XTX)−1, and for an input xi, the prediction is V ar(ŷt) = σ2xTi (X TX)−1xi. An optimal design is a design matrix that minimizes some character- istic of the inverse moment matrix (XTX)−1. Common criteria include A-optimality, which minimizes the trace; D-optimality, which minimizes the determinant; and E-optimality, which minimizes the maximum eigenvalue. Experimental design is usually non-adaptive: the entire experiment is designed before data is collected. However, sequential design is an important and active subfield (e.g. [Williams et al., 2000; Busby, 2009]). 34 2.4.3 Active learning Active learning is another area related to Bayesian optimization, and of particular relevance to our task. Active learning is closely related to exper- imental design and, indeed, the decision to describe a particular problem as active learning or experimental design is often arbitrary. However, there are a few distinguishing characteristics that most (but by no means all) active learning approaches use. • Active learning is most often adaptive: a model exists that incorpo- rates all the data seen, and this is used to sequentially select candidates for labelling. Once labelled, the data are incorporated into the model and new candidates selected. This is, of course, the same strategy used by Bayesian optimization. • Active learning employs an oracle for data labelling, and this oracle is very often a human being, as is the case with interactive Bayesian optimization. • Active learning is usually concerned with selecting candidates from a finite set of available data (pool-based sampling). Experimental design and optimization are usually concerned with continuous domains. • Finally, active learning is usually used to learn a model for classifi- cation, or, less commonly, regression. Usually the candidate selection criterion is the maximization of some informativeness measure, or the minimization of uncertainty. These criteria are often closely related to the alphabetic criteria of experimental design. An excellent recent overview of active learning is [Settles, 2010]. We are particularly interested in active learning because it often uses a human oracle for label acquisition. An example using GPs is the object categorization of Kapoor et al. [2007]. In this work, candidates are selected from a pool of unlabelled images so as to maximize the margin of the GP classifier and minimize the uncertainty. Osborne et al. [2010] also use active learning in a Gaussian process problem, deciding when to sample variables of interest 35 in a sequential sampling problem with faults and changepoints. Chu and Ghahramani [2005a] briefly discuss how active learning could be used for ranking GPs, by selecting sample points that maximize entropy gained with a new preference relation. Interesting recent work with GPs that straddles the boundary between active learning and experimental design is the sensor placement problem of Krause et al. [2008]. They examine several criteria, including maximum entropy, and argue for using mutual information. Ideally, they would like to simultaneously select a set of points to place the entire set of sensors in a way that maximizes the mutual information. This is essentially a clas- sical experimental design problem with maximum mutual information as the design criterion. However, maximizing mutual information over a set of samples is NP-complete, so they use an active learning approach. By exploiting the submodularity of mutual information, they are able to show that sequentially selecting sensor locations that greedily maximize mutual information, they can bound the divergence of the active learning approach from the experimental design approach. This work influenced the GP-UCB acquisition function (§2.2.2). Finally, as an aside, in active learning, a common acquisition strategy is selecting the point of maximum uncertainty. This is called uncertainty sampling [Lewis and Gale, 1994]. GPs have the useful property that the posterior variance (interpreted as uncertainty) is independent of the actual observations! As a result, if this is the criterion, the entire active learn- ing scheme can be designed before a single observation are made, making adaptive sampling unnecessary. 2.4.4 Applications Bayesian optimization has recently begun to appear in the machine learning literature as a means of optimizing difficult black box optimizations. A few recent examples include: • Lizotte et al. [2007; 2008] used Bayesian optimization to learn a set of robot gait parameters that maximize velocity of a Sony AIBO ERS-7 36 robot. As an acquisition function, the authors used maximum prob- ability of improvement (§2.2.1). They show that the Bayesian op- timization approach not only outperformed previous techniques, but used drastically fewer evaluations. • Frean and Boyle [2008] use Bayesian optimization to learn the weights of a neural network controller to balance two vertical poles with dif- ferent weights and lengths on a moving cart. • Cora’s MSc thesis [2008] uses Bayesian optimization to learn a hierar- chical policy for a simulated driving task. At the lowest level, using the vehicle controls as inputs and fitness to a course as the response, a pol- icy is learned by which a simulated vehicle performs various activities in an environment. • Martinez–Cantin et al. [2009] also applied Bayesian optimization to policy search. In this problem, the goal was to find a policy for robot path planning that would minimize uncertainty about its location and heading, as well as minimizing the uncertainty about the environmen- tal navigation landmarks. • Hutter’s PhD thesis [2009] studies methods of automatically tuning al- gorithm parameters, and presents several sequential approaches using Bayesian optimization. • The work of Osborne, Garnett et al. [Osborne, 2010; Osborne et al., 2010; Garnett et al., 2010b] uses Bayesian optimization to select the locations of a set of (possibly heterogenous) sensors in a dynamic sys- tem. In this case, the samples are a function of the locations of the entire set of sensors in the network, and the objective is the root mean squared error of the predictions made by the sensor network. 2.5 Experiments We have conducted a series of experiments to demonstrate the effectiveness of our methods. In the experiments of Chapters 3–5, we study the behaviour 37 function d Branin 2 Hartman 3 3 Shekel 10 4 Hartman 6 6 Table 2.1: Standard test functions used in this thesis, and their dimension- ality. The function expressions are described in Appendix A. of our methods applied to known mathematical functions. By not using hu- man judgement and adopting functions whose optima are known, we are able to measure performance precisely. As part of the discussion of our ap- plications in Chapter 6, we will also bring users into the loop to validate our approach. While we know our methods work well in a simulated environ- ment, we want to ensure that some overlooked aspect of human psychology does not cause our assumptions to be violated for the application. Most previous studies of Bayesian optimization (e.g., [Schonlau, 1997; Streltsov and Vakili, 1999; Sasena, 2002; Lizotte, 2008]) have been concerned with the number of evaluations required to reach the global optimum to within some error, or the error after some pre-determined number of samples. In interactive Bayesian optimization, however, we consider it more desirable to know how well the optimization performs doing at each iteration, which tells us whether a user could be seen to be making progress—that is, it is important to track the evolution of the optimization, not just the final result. To measure optimization efficiency at each time step, Huang et al. [2006] suggest using the “gap” metric: G = f(xfirst)− f(x+) f(xfirst)− f(x?) , (2.6) where f(xfirst) is the value of the first function sample (or in the case where multiple samples were used on the first iteration, the max of that set), and f(x+) is the best value seen so far. G will therefore be a number between 0, indicating no improvement over the initial sample (f(x+) = f(xfirst)), and 1, meaning that the incumbent is the maximum (f(x+) = f(x?)). 38 2.5.1 Test functions The set of test functions we used is based on previous work with Bayesian optimization: the full set of functions used in the PhD theses of Schonlau [1997], Sasena [2002] and Lizotte [2008], excepting the ‘mystery’ function unique to Sasena, and two of the two-dimensional functions. In our exper- iments we found that the additional 2D functions (the Goldstein-Price and 6-Hump Camelback) performed similarly to the Branin, and in any case we wish to focus our attention on higher-dimensional functions instead. In- stead, we use the Hartman 3 function, so we can see the impact of increasing dimensionality on performance. Details of the test functions can be found in Appendix A. The standard optimization test functions we use are listed in Table 2.1. We are aware of no widely-used test functions for Bayesian optimization with dimensionality higher than the 6-dimensional version of the Hartman function, which Schonlau, Sasena, Lizotte, Streltsov & Vakili, Finkel [2003] and Osborne et al. [2009] all use. Boyle [2007] uses synthetic functions to test the impact of dimensionality on Bayesian optimization, but also stops at 6 dimensions. (Streltsov and Vakili test a 10-dimensional version of the Rastrigin function, however this function is pathologically multimodal and not very relevant to our work.) It’s not at all clear, though, that even if there were standard test func- tions for higher dimensions that we would want to use them. As discussed in Chapter 1, it is trivial, especially in high dimensions, to create a syn- thetic objective function that requires an unfeasible number of evaluations to optimize. We are not interested here in solving this problem. In an in- teractive optimization problem, we are interested in problems we already know can be optimized by a human being, because in the past a human be- ing has optimized them, even if by finding a set of good parameters with a parameter-twiddling interface. Instead, the problem is optimizing efficiently. As there is no generally-agreed-upon test functions for Bayesian opti- mization in higher dimensions, we seek to sample synthetic functions from a known GP prior. Lizotte follows a similar strategy. A GP prior is infinite- 39 Figure 2.7: Some randomly-generated test functions, sampled from a Gaus- sian ARD kernel with random hyperparameters. dimensional, so on a practical level for performing experiments we simulate this by sampling points from the prior. For each trial, we use an ARD kernel with θ drawn randomly from [0, 2]d to get a mix of informative and uninformative dimensions. We then iteratively sample a fixed number of d- dimensional points using a Latin hypercube design, compute the covariance matrix K, and sample f ∼ GP(0,K). Using the response surface as the ob- jective function, we can then approximate the maximum using conventional global optimization techniques to get f(x?), which permits us to use the gap metric for performance. Note that these sample points are only used to construct the objective, and are not known to the optimization methods. Some examples of randomly-generated 2D synthetic functions are shown in Figure 2.7. We will use these test functions in the machine experiments of Chap- ters 3–5 to test and verify our theoretical contributions. In Chapter 6, we will test our extensions with a real applications and human users. 40 Chapter 3 Preference galleries The Bayesian optimization model described in Chapter 2 requires that each function evaluation have a continuous response. However, this is not al- ways the case. In applications requiring human perception, for instance, preferences are often more accurate than ratings. Prospect theory, for ex- ample, employs utility models based on relation to a reference point, based on evidence that the human perceptual apparatus is attuned to evaluate differences rather than absolute magnitudes [Kahneman and Tversky, 1979; Tversky and Kahneman, 1992]. We present here a Bayesian optimization application based on discrete choice for a “preference gallery” application1. In the case of a person rating the suitability of a procedurally-generated animation or image, each sample of value involves creating an instance with the given parameters and asking a human to provide feedback, which is interpreted as the function response. This is a very expensive class of func- tions to evaluate! Furthermore, it is in general impossible to even sample the function directly and get a consistent response from users. Asking hu- mans to rate an animation on a numerical scale has built-in problems—not only will scales vary from user to user, but human evaluation is subject to phenomena such as drift, where the scale varies over time, anchoring, 1This chapter is based on published work presented at SIGGRAPH (poster session) [Brochu et al., 2007a] and Advances in Neural Information Processing Systems [Brochu et al., 2007b]. Some experiments have been added to use the same experimental framework as the rest of this thesis. 41 in which early experiences dominate the scale [Siegel and Castellan, 1988; Payne et al., 1993]. However, human beings do excel at comparing options and expressing a preference for one over others [Kingsley, 2006]. This insight allows us to approach the optimization function in another way. By present- ing two or more realizations to a user and requiring only that they indicate preference, we can get more robust results with much less cognitive burden on the user [Kendall, 1975]. While this means we can’t get responses for a value function directly, we model the value as a latent function, inferred from the preferences, which permits a Bayesian optimization approach. 3.1 Preferences Learning preferences has been an active field of machine learning for some time. When we have a vocabulary of labels that could be applied and want to find an ordering, it is called label preference. However, what we are concerned with is the problem of instance preference, where we have data and we want to find a preference ordering over it. Much of the work in this field has involved casting instance preference as a binary classification problem [Wong et al., 1988; Herbrich et al., 1998], but we are instead going to cast the problem as one of probabilistic discrete choice over a latent function. Probability models for learning from discrete choices have a long history in psychology and econometrics [Thurstone, 1927; Stern, 1990; McFadden, 2001]. They have been studied extensively for use in rating chess players, and the Elo Rating System [Élő, 1978] was adopted by the FIDE, the World Chess Federation, to model the probability of one player beating another. It has since been adopted to many other two-player games such as Go and Scrabble, and, more recently, online computer gaming [Herbrich and Grae- pel, 2006]. These methods differ from our work in that they are intended to predict the probability of a preference outcome over a finite set of possible pairs, whereas we work with infinite sets and are only incidentally interested in modelling outcomes. Glickman and Jensen [2005] use Bayesian optimal design for adaptively finding pairs for tournaments. Like our method, this 42 work uses the results of previous comparisons to select the next pair to be evaluated, and can be considered a form of active learning. It differs from our method in that it adopts different acquisition models and, more importantly, in the fact that the number of items being compared is finite. Parts of our method are based on [Chu and Ghahramani, 2005b], which presents a preference learning method using probit models and Gaussian processes. They use a Thurstone-Mosteller model, but with an innovative nonparametric model of the utility. Note that the idea of using a latent Gaussian process prior and observations generated according to a generalized linear model has a much older history in spatial statistics (see, e.g. [Diggle et al., 1998; Eidsvik et al., 2009]). 3.2 Probit model for binary observations The probit model allows us to deal with binary observations of f(·) in general. That is, every time we try a value of x, we get back a binary variable, say either zero or one. From the binary observations, we have to infer the latent function f(·). In order to marry the presentation in this section to the user modelling applications discussed later, we will introduce probit models in the particular case of preference learning. Assume we have shown the user M pairs of items from a set of N in- stances. For each pair, the user has identified a preference. The data set therefore consists of the ranked pairs: D = {ri ci; i = 1, . . . ,M}, where the symbol indicates that the user prefers r to c. We use x1:t to denote the t distinct elements in the training data. That is, ri and ci correspond to two elements of x1:t. ri ci can be interpreted as a binary variable that takes value 1 when ri is preferred to ci and is 0 otherwise. In the probit approach, we model the utility v(·) for items r and c as 43 follows: v(ri) = f(ri) + ε v(ci) = f(ci) + ε, (3.1) where the noise terms are Gaussian: ε ∼ N (0, σ2noise). Following [Chu and Ghahramani, 2005b], we assign a nonparametric Gaussian process prior to the unknown mean value: f(x′) ∼ GP(0, k(x′,x)). That is, at the t training points: P (f) = |2piK|− 12 exp ( −1 2 fTK−1f ) , where f = {f(x1), f(x2), . . . , f(xt)}. Random utility models such as (3.1) have a long and influential history in psychology and the study of individual choice behaviour in economic markets. Daniel McFadden’s Nobel Prize speech [McFadden, 2001] provides a glimpse of this history. More comprehensive treatments appear in classical economics books on discrete choice theory. Under our Gaussian utility models, the probability that item r is pre- ferred to item c is given by: P (ri ci|f(ri), f(ci)) = P (v(ri) > v(ci)|f(ri), f(ci)) = P (ε− ε < f(ri)− f(ci)) = Φ(Zi), where Zi = f(ri)− f(ci)√ 2σnoise and Φ(·) is the CDF of the standard normal distribution. By including the noise term σnoise, we allow for noisy inputs, including violations of the tri- angle inequality. This model, relating binary observations to a continuous latent function, is known as the Thurstone-Mosteller law of comparative judgement in psychology [Thurstone, 1927; Mosteller, 1951] or the intraper- sonal random utility model in econometrics [McFadden, 1980]. In statistics 44 it goes by the name of binomial-probit regression. If the user had more than two choices one could adopt a polychotomous regression model [Holmes and Held, 2006]. This multi-category extension would, for example, enable the user to state no preference, or a degree of preference for any of the two items being presented. One could also easily adopt a logistic (sigmoidal) link function ϕ (Zi) = (1 + exp (−Zi))−1. In fact, such choice is known as the Bradley-Terry model [Stern, 1990]. Our goal is to estimate the posterior distribution of the latent utility function given the discrete data. That is, we want to maximize P (f |D) ≈ P (f) M∏ i=1 P (ri ci|f(ri), f(ci)). Although there exist sophisticated variational and Monte Carlo meth- ods for approximating this distribution, we favour a common simple strat- egy: Laplace approximation [Diggle et al., 1998; Chu and Ghahramani, 2005b]. Our motivation for doing this is the simplicity and computational efficiency of this technique. Moreover, given the amount of uncertainty in user valuations, we believe the choice of approximating technique plays a small role and hence we expect the simple Laplace approximation to per- form reasonably in comparison to other techniques. For recent looks at more advanced techniques, we refer the reader to e.g. [Rue et al., 2009; Eidsvik et al., 2009]. The Laplace approximation follows from Taylor-expanding the log pos- terior about a set point f̂ : logP (f |D) = logP (f̂ |D) + gT (f − f̂)− 1 2 (f − f̂)TH(f − f̂), where g = ∇f logP (f |D) and H = −∇f∇f logP (f |D). At the mode of the posterior (f̂ = fMAP), the gradient g vanishes, and we obtain: P (f |D) ≈ P (f̂ |D) exp [ −1 2 (f − f̂)H(f − f̂) ] In order to obtain this approximation, we need to compute the maximum a 45 posteriori (MAP) estimate fMAP, the gradient g and the information matrix H. The gradient is given by: g = ∇f logP (f |D) = ∇f [ const− 1 2 fTK−1f + M∑ i=1 log Φ(Zi) ] = −K−1f +∇f [ M∑ i=1 log Φ(Zi) ] = −K−1f + b, where the j-th entry of the N -dimensional vector b is given by: bj = 1√ 2σnoise M∑ i=1 φ(Zi) Φ(Zi) [ ∂ ∂f(xj) (f(ri)− f(ci)) ] , where φ(·) denotes the PDF of the standard normal distribution. Clearly, the derivative hi(xj) = ∂∂f(xj)(f(ri) − f(ci)) is 1 when xj = ri, -1 when xj = ci and 0 otherwise. Proceeding to compute the second derivative, one obtains the Hessian: H = K−1 + C, where the matrix C has entries Cm,n = − ∂ 2 ∂f(xm)∂f(xn) M∑ i=1 log Φ(Zi) = 1 2σ2 M∑ i=1 hi(xm)hi(xn) [ φ(Zi) Φ2(Zi) + φ2(Zi) Φ(Zi) Zi ] The Hessian is a positive semi-definite matrix. Hence, one can find the MAP estimate with a simple Newton–Raphson recursion: fnew = fold −H−1g |f=fold . At f = fMAP, we have P (f |D) ≈ N (Kb, (K−1 + C)−1) . 46 preferences 0.2 0.1 0.35 0.5 0.2 0.35 0.2 0.6 0.8 0.7 0 0.2 0.4 0.6 0.8 1 Figure 3.1: Example of a set of preference relations (left table) used to infer a GP (right plot) on a toy problem. The preferences are indicated as a set of preferences between two points in the space, which serves as in- put to a function that finds a Gaussian process that takes into account all the available preference information, as well as prior information on the smoothness and noise. with b = K−1fMAP. The goal of our derivation, namely the predictive distri- bution P (ft+1|D), follows by straightforward convolution of two Gaussians: P (ft+1|D) = ∫ P (ft+1|fMAP)P (fMAP|D)dfMAP ∝ N (kTK−1fMAP, k(xt+1,xt+1)− kT (K + C−1)−1k). An example of the procedure on a toy 1D problem is shown in Figure 3.1. We can see that the inferred model explains the observed preferences, while also adhering to prior information about smoothness and noise. 3.3 Experiments: gallery labelling methods To evaluate the preference gallery approach, we conducted a number of simulated gallery experiments using different techniques for modelling the GP. To do this, we use the known test functions described in Appendix A. In all cases, the experiments are run 25 times with random initialization. We use the gap metric (§2.5) to measure the evolution of the optimization. 47 We tested both simulated preference pairs (2-gallery) and a simulated en- vironment in which pairs were defined over four instances at each iteration (4-gallery). 2-gallery In the 2-gallery experiments, we iteratively selected pairs of data according to different methods. We then define preferences, which are added to the GP as training data (§3.2): xi xj ⇐⇒ f(xi) > f(xj) xj xi ⇐⇒ f(xj) > f(xi). Where f is the test function. In the very rare case where f(xi) = f(xj), we randomly assign preference. We tested three methods of sample selection2: • argmax EI, argmax EI+1. This method takes advantage of the fact that in a GP model, the covariance K is independent of the actual ob- servations, so if we know xt+1, we can find σt+1(·) without knowing yt+1. The first sample is obtained by maximizing EI (Eqn. (2.4)). We then update K for the sample and optimize EI again, using µt(·) and σt+1(·). This method was also used by Schonlau [1997] to select sequential samples without observations. • x+, argmax EI. In this method, the two samples for comparison are the incumbent and the argmax of EI. This is intended to directly seek improvement over the incumbent, though at the cost of displaying fewer unique samples to the user. • random. The two points are drawn at random from the valid param- eters, as a baseline comparison. 2Recent work by Azimi et al. [2011] has shown a novel and interesting method of batch Bayesian optimization by starting with a large set of candidates and iteratively pruning the least-promising until the desired batch size it achieved. While we do not test it here, it seems a promising direction. 48 Figure 3.2: Testing methods of selecting sample points for preferences on a set of test functions on a simulated 2-gallery. All methods draw two points at a time. The method labelled argmax EI, argmax EI+1 selects the point of maximum EI, updates the covariance matrix and then samples the point of maximum EI on the updated model. The method labelled x+, argmax EI used the incumbent and the point of maximum EI. The method labelled random selects two random points. 49 0 10 20 30 40 50 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 G Synthetic 2 argmaxEI,argmaxEI+1 x+ , argmaxEI random 0 10 20 30 40 50 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Synthetic 4 0 10 20 30 40 50 iterations 0.0 0.1 0.2 0.3 0.4 0.5 0.6 G Synthetic 8 0 10 20 30 40 50 iterations 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Synthetic 12 Figure 3.3: Preference methods on the synthetic functions on a simulated 2-gallery, using the same methods used to generate Figure 3.2. 50 We show the results of our experiments on functions from the literature (Appendix A) in Figure 3.2, and from synthetic functions (§2.5.1) in Fig- ure 3.3. While there is no clear winner among the non-random methods, on the literature functions, using the x+, argmax EI values is generally the best-performing overall, even though it involves only half as many distinct samples as the alternatives. Interestingly, on the Branin function, random selection performs very well, though all methods have high variance. Look- ing at the samples selected, we hypothesize that this is something of an artifact of the way the Branin function is constructed, where it is fairly easy for random samples to land close to the maximum, while more methodical techniques will tend to either start with a good sample and slowly improve, or first discount regions of high variance near the edges of the space first. On the synthetic functions, argmax EI, argmax EI+1 tends to be the clear winner, particularly for higher-dimensional test functions. As dimensional- ity increases, and samples become further apart, exploring more of the space by drawing two previously unseen samples at each iteration rather than one sample and the incumbent becomes a more effective strategy. 4-gallery In the 4-gallery experiment, we iteratively select a “gallery” of instances, xt+1, . . . ,xt+4 using one of the techniques described below. Preferences are then found for all pairs xi,xj of instances in the gallery, where xi xj ⇐⇒ f(xi) > f(xj) + xj xi ⇐⇒ f(xj) > f(xi) + . ( ≥ 0 is a parameter to simulate two values so similar the user chooses to not distinguish them.) The GP is then updated with the preferences. The methods are extensions of the ones we use for the 2-gallery experiment. • argmax EI1:4. This method iteratively selects points, updates the covariance matrix, and optimizes EI to collect four samples for the gallery. 51 • x+, argmax EI1:3. Compares the incumbent to three iterative samples taken by updating the covariance matrix and optimizing EI. • random. The four points are drawn at random. The results over the literature test functions are shown in Figure 3.4. In this case, the method of using the incumbent and three iterations of EI is clearly dominant. Studying the behaviour, we observe that using the incumbent allows for “anchoring” behaviour, in which highly-confident com- parisons are made between local maxima until the best is found. 52 0 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 G Branin argmaxEI1 :4 x+ , argmaxEI1 :3 random 0 10 20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Shekel 10 0 10 20 30 40 50 iterations 0.0 0.2 0.4 0.6 0.8 1.0 G Hartman 3 0 10 20 30 40 50 iterations 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Hartman 6 Figure 3.4: Testing methods of selecting sample points for preferences on a set of test functions on a simulated 4-gallery. All methods draw four points at a time and then find preferences between every pair of points. The method labelled argmax EI1:4 draws the four points by iteratively selecting the point of maximum EI, updating the covariance matrix and then sampling the point of maximum EI on the updated model, until 4 samples are selected. The method labelled x+, argmax EI1:3 uses the same technique to find three samples, and uses the incumbent as the fourth. The method labelled random selects four random points. 53 Chapter 4 Learning hyperparameters across user sessions Our early work with Bayesian optimization proposed a technique to assist artists with parameter tuning for bidirectional reflectance distribution func- tions (BRDF s, see §6.3). Later, during the development of a procedural smoke animation system, we found ourselves with a parameterized system with 12 continuous parameters1. Setting these was a challenge for the de- velopers, let alone other users, so we looked to adapt the methods we used previously [Brochu et al., 2007a; 2007b]. In the process, though, we found that the settings of the kernel hyperparameters, which control the smooth- ness of the objective function on each of the parameters, has a very large impact on the quality of the optimization. We originally sought to have an expert set these, but our best attempts to do so were simply not good enough. Even worse, approaches in the literature for automatically set- ting these values require far more data than we have available [Jones, 2001; Santner et al., 2003]. Furthermore, many regions of the parameter space (combinations of pa- rameter ranges for the animation) never produce good animation instances. 1The animation problem is described in §6.4. That section and this chapter are based on published work presented at the ACM SIGGRAPH/Eurographics Symposium on Com- puter Animation [Brochu et al., 2010a]. 54 However, under the zero-mean Bayesian optimization model, until the user supplies evidence, parameter settings in these regions are no less likely to be selected than ones that are frequently chosen by other users. Particularly with the relatively large number of parameters we use, this requires users to repeatedly view and rate uninteresting settings until evidence is accrued, instead of first focussing on the most promising areas. In short, the earlier model does not offer an elegant way of incorporating user expertise into the prior. After even a small amount of time spent using the system, users had developed a good enough semantic understanding of certain parameters that they wanted to either set parameters to a specific value, or restrict the optimization to a user-defined region. With these shortcomings in mind, we extended the interactive Bayesian optimization approach with several new features, which we incorporate into a novel application for assisting animators (§6.4). Our primary improve- ment, however, is a new hierarchical Bayesian method for incorporating data from previous users of the system to adaptively tune model hyperparameters. These hyperparameters control aspects such as the relative importance of changing settings of the different parameters, and the belief, before evidence is added, that an area of the space is unlikely to generate good animations. The problem of learning hyperparameters is usually difficult or impossible for small data sets, such as the ones generated in a single user session. We get around this by treating the problem as one of tracking the distributions of the hyperparameters across user sessions. As different users employ the system to find different animations, the system automatically improves and adapts. We define a session as a single “use” of the system, starting from a model with no user data, and ending when the user has found a suitable instance or given up searching for it. While data such as the hyperparameter values might be tracked for some problems, at heart a session is set of samples and observations D1:T . Sessions are indexed by τ , so D1:3,τ refers to the first three instances of x, y for the τ th session. 55 p ri o r mean function 1 p o st e ri o r mean function 2 mean function 3 Figure 4.1: Effect of different mean functions with and without evidence. The upper row shows the model with no observations. The lower row shows the model with four observations. Note that the models largely agree near the observations, but as the observations become more dis- tant, the mean function becomes more strongly informative. 4.1 Mean and covariance hyperparameters In the previous sections, we assume very simple priors on the GP mean and covariance. In practice, however, we might want to use other priors, such as radial basis function networks. In this model, the mean m(·) is represented with radial basis functions (RBFs) with C centres Γ = γ1:C and coefficients β = β1:C : m(x) = C∑ j=1 q(‖x− γj‖)βj . Where q is a squared exponential RBF of the form q(z) = exp(−z2). 56 (If more flexibility were desired, this could introduce additional hyperpa- rameters ω1:C , so that qj(z) = e−ωjz 2 .) Note that a GP with a squared ex- ponential covariance function is equivalent to this RBF model with C =∞, and an alternate mean function would be to fit a GP to all prior user data and use the mean of that GP as the prior mean of the new GP. We refer the reader to, for example, [Bishop, 2006] for an introduction to RBF function approximation. Adopting an informative mean function allows us to learn a general model of which areas of the space tend to generate high-value regions. The GP formulation allows an elegant way of trading off this prior knowledge with the observations. Near observation points, data dominates the predic- tion. Further from the observation points, the mean function dominates. Figure 4.1 shows some examples on a 1D model with some simple mean functions. Subsequently, we describe the procedures for estimating the hyperparam- eters. To begin, we estimate of the locations of the basis centres Γ by, for example, C-means clustering the parameters of previous simulations. We can cluster because sample density near previous successful optimizations should be higher [Močkus, 1994], and we would like these cluster centroids to be the peaks of our mean function. Intuitively, the resulting clusters capture the different regions of the space that users are typically interested in. By conditioning on Γ, the coefficients β can be obtained analytically us- ing standard Bayesian conjugate analysis. Specifically, we place a Gaussian hyperprior on them β ∼ N (0, αδ2) with regularizer δ2. We place an inverse-gamma hyperprior on α with shape parameter a > 0 and scale parameter b > 0: α ∼ IG ( a 2 , b 2 ) ∝ ( 1 α )a 2 +1 e− b 2α We assume for now that the RBF centre estimates Γ are given, and turn 57 our attention to approximating the posteriors of β and α. We use as pseudo- measurement the function estimated by the Laplace method in §3.2. This two-stage approximation approach is motivated by computational simplicity. We model the pseudo-measurements with the following model: P (β, α|X, fMAP) ∝ P (fMAP|X,β)P (β)P (α). The posterior P (fMAP|X,β) follows from Bayes’ rule: P (β, α|X, fMAP) ∝ exp [ −1 2 (fMAP −Qβ)T (αK)−1(fMAP −Qβ)− 12αδ2β Tβ ] P (α) = exp [ − 1 2α Z ] P (α) where K is the kernel matrix; fMAP is the maximum a posteriori estimate of {f(x1), f(x2), . . . , f(xt)}, which will be µ(xi) in the case of direct noisy observations or inferred from preferences in the binomial case (§3.2); Q is defined as Q = q(x1,γ1) . . . q(x1,γC) ... . . . ... q(xt,γ1) . . . q(xt,γC) ; and Z = (fMAP −Qβ)TK−1(fMAP −Qβ) + βT δ−2β = fTMAPK −1fMAP + βTQTK−1Qβ − 2fTMAPK−1Qβ + δ−2βTβ = βT (QTK−1Q + Iδ−2)β − 2fTMAPK−1Qβ + fTMAPK−1fMAP = βTM−1β − 2fTMAPK−1Qβ + fTMAPK−1fMAP. For notational convenience, let the matrix M−1 = QTK−1Q + Iδ−2 and 58 the vector h = MQTK−1fMAP. Then Z = βTM−1β − 2fTMAPK−1Qβ + fTMAPK−1fMAP = βTM−1β − 2hTM−1β + hTM−1h− hTM−1h + fTMAPK−1fMAP = (β − h)TM−1(β − h) + fTMAPK−1fMAP − hTM−1h We can use this to get an expression for the posterior: P (β, α|X, fMAP) ∝ exp [− 12α fTMAPK−1fMAP − hTM−1h] |2piαK|− 12 × exp [− 12α(β − h)TM−1(β − h)]P (α) Integrating out β and recalling that N is the number of data (and that K is therefore an N -by-N matrix), we get: P (α|X, fMAP) ∝ P (α)|2piαK|− 12 exp [ − 1 2α (fTMAPK −1fMAP − hTM−1h) ] ∝ α a2+1+ d2 exp [ − 1 2α b+ fTMAPK −1fMAP − hTM−1h ] This allows us to obtain analytical expressions for the posterior modes of these quantities: β̂ = h = QTK−1fMAP QTK−1Q + δ−2 α̂ = b′ a′ + 1 = b+ fTMAPK −1fMAP − hTM−1h a+N + 2 By approximating the posteriors of α and β with a Dirac-delta masses at α̂ and β̂, the mean and variance of the predictive distribution become: µ(x) = C∑ i=1 q(x,γi)β̂i + k TK−1 ( fMAP −QT β̂ ) σ2(x) = ( k(x,x)− kTK−1k) α̂ Note that this is somewhat similar to the process of universal kriging, 59 θ=1.0, σ 2noise =2 θ=0.01, σ 2 noise =0.001 θ=0.2, σ 2 noise =0.1 Figure 4.2: Examples of the impact of kernel width (θ) and noise (σ2noise) hyperparameter settings on a toy 1D problem. All three models have the same data, but different hyperparameters. The leftmost model at- tributes most of the data to draws from a smooth but very noisy func- tion. The centre model assumes the mean function holds everywhere except the close vicinity of the observations. The model on the right is more conservative, assuming a combination of smooth function and noise. in which a regression model is fit to the data with a GP used to model the residuals (§2.4.1). Here, however, we incorporate the regression model into the mean function and relax the universal kriging requirement that the regression model be unbiased. 4.1.1 Kernel hyperparameters We now turn our attention to the process of inferring the kernel parameters θ and σ2noise. In our user application, these have intuitive interpretations with regard to the impact on the user—θ is the relationship between distance in the parameter space and the value function (§2.1.2), and σ2noise is the noise or uncertainty associated with the user’s ratings (§2.3). In many applications, the kernel hyperparameters can be learned by maximum a posteriori inference. This method works well for many applica- tion of GPs, but unfortunately for our application, it is known to work poorly when the number of training data is small [Hastie et al., 2009]—exactly the situation we are in. The sparsity of data in the space can lead to some dimen- sions becoming very flat, or even monotonically increasing to infinity. This 60 can lead to low-quality models at best, and ill-conditioned covariance ma- trices at worst (Figure 4.2). In the experimental design literature, Bayesian optimization is often initialized using Latin hypercubes, which bypasses the need to learn parameters until enough data has been selected to make MAP feasible. However, the number of data required is significant, typically a minimum 10d [Jones, 2001; Santner et al., 2003], though this can be aided with an informative hyperprior on the hyperparameters, often a log normal prior [Lizotte, 2008; Frean and Boyle, 2008]. 4.2 Learning from the user base Our central hypothesis is that the interests of users are regular. That is, there are some regions of the parameter space that are interesting to some (or all) users, and some that are never, or very rarely, of interest. Using the zero-function as the prior mean fails to capture this until data is added. We would like to use the results of previous trials (optimizations) to learn the hyperparameters of a semi-parametric prior for the current session. This is crucial to both reduce the number of user interactions and to scale the method to higher dimensions. Our insight is that every time the application is used, a model is trained. While different runs might involve users with different simulation goals, if we record the final user data D of each run, this can be used to generate a distribution over the hyperparameters, which identifies some hyperparameter settings as more likely than others, even before a new user starts using the system. It is important to emphasize that this does not directly affect the actual parameters of the animation—that is determined by the user’s feedback. This is confined to the distributions of the hyperparameters. As more evidence is added to the model via user feedback, the hyperparameters can be updated to reflect that. The goal is not to restrict the hyperparameters, but to bring in knowledge gained from the results of previous users of the system. We propose to learn the RBF hyperparameters (a, b) as well as the kernel hyperparameters (θ, σ2noise) using data gathered from previous optimizations. We will refer to these with the generic symbol ρ. 61 We can model the sequence of optimizations conducted by the same user or different users with a dynamical state space model [Doucet et al., 2001, Chapter 1]. In this dynamic model, the hyperparameters ρ correspond to the unknown states. The state space model consists of an initial belief P (ρ0), a stochastic evolution process P (ρτ |ρτ−1) and an observation com- ponent P (fτ |ρτ ). To be precise, the observation fτ is in fact the maximum a posteriori estimate fMAP derived from the user feedback of run τ as out- lined in the previous section, but we drop the MAP superscript to simplify the notation. Our goal is then to compute the optimal filtering distribu- tion P (ρτ |f1:τ ) given that τ runs of the application have taken place. This distribution is intractable, but may be approximated with a Monte Carlo histogram estimator with samples obtained using a particle filter [Doucet et al., 2001], as illustrated in Algorithm 2. We treat the hyperparameters as independent, and use a separate dy- namic Gaussian diffusion model for each hyperparameter2. The dynamic diffusion model allows the hyperparameters to converge to values that ex- hibit some random drift over time. We don’t eliminate this drift as we believe it is useful to have a nonstationary model component in our applica- tion domain to take into account changes in the user, problem, application interface, etc. At the beginning of each user session, we set the hyperparameters to the particle filter means estimated from previous trials. After each user session, we use the inferred function f to compute the fitness of each particle according to the following non-linear Gaussian observation likelihood model: P (fτ |ρτ ) = |2piKρτ |−1/2 exp [ −1 2 fTK−1ρτ f ] . In this expression, we have emphasized the dependency of K on the kernel hyperparameters ρτ for clarity. Algorithm 2 shows the particle filter method of updating for hyperpa- rameters ρ. This is a simple particle filter, where the importance sampling 2This is done for efficiency and interpretability—we could theoretically model the hy- perparameters using a single particle filter if desired. 62 Algorithm 2 Particle Filter for Hyperparameter Learning 1: For i = 1, . . . , N sample ρ (i) 0 ∼ P (ρ0). 2: τ = 1 3: while True do 4: For i = 1, . . . , N sample eρ(i)τ ∼ P (ρτ |ρ(i)τ−1) 5: For i = 1, . . . , N evaluate importance weights ew(i)τ = P (fτ |eρ(i)τ ) and normalize them. 6: Resample with replacement N particles “ ρ (i) 0:τ , i = 1, . . . , N ” from the set“eρ(i)0:τ , i = 1, . . . , N” according to the importance weights ew(i)τ . 7: τ = τ + 1 8: end while proposal is the transition prior P (ρτ |ρτ−1). If one has prior knowledge about the hyperparameters, it is possible to incorporate this into the design of more sophisticated proposal distributions [Doucet et al., 2001]. A promising, and related, recent approach has been to marginalize over the hyperparameters using Bayesian Monte Carlo [Osborne et al., 2008], which has been shown to outperform MAP methods [Garnett et al., 2010a]. While combining this approach with ours is beyond the scope of this thesis, we feel this is a natural fit, and could be used to model much more sophisticated hyperparameter distributions effectively, though at some computational cost. In essence, the model is continually learning from a variety of users doing a variety of tasks with it. The more it is used, the more representative and useful the priors and hyperpriors become. 4.3 Experiments: hyperparameter learning In this section, we evaluate the performance of the particle filter (Algorithm 2) when optimizing a test function with known maxima: the Shekel 10 function (Appendix A). The function has 4 dimensions, 10 local maxima and 1 global maximum. Because of its dimensionality and fairly steep modes, it is difficult to optimize with naive techniques, but we can expect a well- designed general global optimizer to offer measurable improvement, even if it doesn’t find the global maximum. This makes it ideal for study with the gap metric. 63 session Figure 4.3: Learning ARD kernel width hyperparameters on the Shekel 10 test function using a particle filter. The upper subfigure shows the evolution of θ1:4 over 25 sessions. The lower subfigure shows the per- formance measure G corresponding to these hyperparameters. It also shows the same measure using hyperparameters learned by maximum likelihood, using all the accumulated data. We can see that ML is fun- damentally unable to improve above a low baseline due its inability to recover from poor initial estimates of the hyperparameters. We fix the GP mean to zero and focus on studying the effect of learning only the ARD kernel width hyperparameters, θ1:4,τ . In this optimization setting, the observations are noise-free (we will examine the effect of learning the mean function in §4.4, and the interaction between noise and kernel width hyperparameters has been well-studied elsewhere [Santner et al., 2003; Lizotte, 2008]). We train the particle filters for sessions τ as shown in Algorithm 2. At each τ , we gather an observation vector fτ by running 20 iterations of Bayesian optimization, using the mean of the predictive distribution p(θτ |f1:τ−1). The simulation of fτ is not deterministic because 64 of random initialization. The procedure continues as detailed in Algorithm 2. We store the 4 mean trajectories for 25 runs, θ̂1:4,1:25, computed with the particle filter. For each of these trajectory values, we run 20 iterations of Bayesian optimization and record the mean and variance of G (Eqn. (2.6)), called hereG20. We adopted 20 iterations because this is roughly the point at which users of the animation system start to quit if they do not see significant improvement. We repeated the experiment 10 times to obtain confidence estimates. The evolution of G20 is shown in the lower plot of Figure 4.3. For comparison, we also show G20 for hyperparameters learned by maximizing the likelihood directly: for each session τ , data from all previous sessions D1:20,1:τ−1 are used to find the MAP estimate of the hyperparameters, and these hyperparameter values are used for the τ th session. The results show that the particle filter not only converges at a reason- ably fast rate, but also leads to significant improvements in optimization performance. After about 15 sessions, there is little further variation in the particle filter means, and optimization performance reaches a plateau of performance. The maximum likelihood method, meanwhile, forms a kind of performance baseline, and consistently performs very poorly on the Shekel 10 function. This was somewhat unanticipated, as ML might be expected to get better hyperparameter estimates with more data. What appears to be happening is that the estimates from the first session are so bad (all θ values are either close to 0 or extremely high) that the optimization on the next session draws unrepresentative samples (either repeatedly sampling the same tiny region, or sampling the corners of the space). With this data, it is nearly impossible to learn better θ values. This doesn’t happen with the particle filter model, because there is a limit to the range of the hy- perparameter values at each session, so a single disastrous estimate cannot propagate forward. We conjecture that using the ML method with a well- designed hyperprior would improve performance, though we did not study this scenario. 65 Figure 4.4: Effect of adding more data to compute the RBF prior mean. We run Bayesian optimization to find the global maximum of the Shekel 10 function, with G a measure of the optimization performance. Solid lines are the mean performance at each iteration and error bars show the variance. We can see even 50 prior data offer a dramatic im- provement in the speed and quality of the optimization, and 500 offers yet more improvement. 4.4 Experiments: learning the mean function from an RBF network To test the impact of the mean function on the optimization function, again use the Shekel 10 test function. We sample randomly and with noise, and use that data to learn the β and Γ values of the RBF network that we use as a prior. We then optimize the function iteratively by selecting the point of maximum EI and sampling from f and updating the GP. We ran tests using 50 and 500 data to train the RBF network, using Latin hypercube sampling and adding noise ε to the samples, where ε ∼ N (0, 0.5). We set the size of the network (arbitrarily) at p = 25 bases. We run the experiments for 50 iterations, which we feel is the most most users would find acceptable. Figure 4.4 shows the effect of the prior on optimizing the network. We can see that an RBF network trained on even 50 noisy samples offers striking improvement on the rate of optimization, and 500 samples even more. 66 Chapter 5 Portfolio strategies for acquisition selection Bayesian optimization is efficient and can be used when very little is known about the objective function, making it popular for expensive black-box op- timization scenarios. It is able to do this by sampling the objective using an acquisition function which incorporates the model’s estimate of the objec- tive and the uncertainty at any given point. However, we saw in §2.2 that there are several different parameterized acquisition functions in the liter- ature, and it is often unclear which one to use. In this chapter, instead of using a single acquisition function, we adopt a portfolio of acquisition func- tions governed by an online multi-armed bandit strategy1. We describe the method, which we call GP-Hedge, which builds on recent developments in the field of online learning and multi-armed bandits [Cesa-Bianchi and Lu- gosi, 2006] and we will show that in our experiments GP-Hedge consistently outperforms the best individual acquisition function. There is no choice of acquisition function that can guaranteed to per- form best on an arbitrary, unknown objective2. In fact, it may be the case that no single acquisition function will perform the best over an entire 1This chapter is based on work published as an arXiv e-print [Brochu et al., 2010b]. 2While studying the objective might allow an expert to make an educated guess, even this can be difficult as Bayesian optimization is normally used specifically when sampling the objective is expensive. 67 Algorithm 3 Hedge [Auer et al., 1998] 1: Select parameter η ∈ R+. 2: Set gi0 = 0 for i = 1, . . . ,K. 3: for t = 1, 2, . . . do 4: Choose action it with probability pt(i) = exp(ηg i t−1)/ PK `=1 exp(ηg ` t−1). 5: Receive rewards rit. 6: Update gains git = g i t−1 + r i t. 7: end for Algorithm 4 GP-Hedge 1: Select parameter η ∈ R+. 2: Set gi0 = 0 for i = 1, . . . ,K. 3: for t = 1, 2, . . . do 4: Nominate points from each acquisition function: xit = argmaxx ui(x|D1:t−1). 5: Select nominee xt = x j t with probability pt(j) = exp(ηg i t−1)/ PK `=1 exp(ηg ` t−1). 6: Sample the objective function yt = f(xt) + t. 7: Augment the data D1:t = {D1:t−1, (xt, yt)} and update the GP. 8: Receive rewards rit = µt(x i t) from the updated GP. 9: Update gains git = g i t−1 + r i t. 10: end for optimization—a mixed strategy in which the acquisition function samples from a pool at each iteration might work better than any single acquisition. Previous work in this problem involved a strategy that alternated explo- ration and exploitation phases, such as the P ?-algorithm for one-dimensional optimization [Törn and Žilinskas, 1989]. However, we wish to automatically balance a portfolio of algorithms. This approach can be treated as a hi- erarchical multi-armed bandit problem, in which each of the K arms is an acquisition function, each of which is itself an infinite-armed bandit problem. 5.1 Hedge algorithms Hedge (Algorithm 3) is an algorithm which at each time step t selects an ac- tion i with probability pt(i) based on the cumulative rewards (gain) for that action [Auer et al., 1998]. After selecting an action the algorithm receives reward rit for each action and updates the gain vector. In the Bayesian opti- mization setting, we can define K bandits each corresponding to a single ac- quisition function. Choosing action i corresponds to sampling from the point 68 Algorithm 5 Exp3 [Auer et al., 1998] 1: Select parameters η ∈ R+, γ ∈ (0, 1]. 2: for t = 1, 2, . . . do 3: Get distribution pt from Hedge (Algorithm 3). 4: Choose action it to be j with probability p̂(j) = (1− γ)pt(j) + γ/K. 5: Receive reward ritt ∈ [0, 1]. 6: Return simulated reward vector r̂t to Hedge, where r̂ i t = r i t/p̂t(i) if j = it, or r̂ i t = 0 otherwise. 7: end for nominated by function ui, i.e., xit = argmaxx ui(x|D1:t−1) for i = 1, . . . ,K. While in the conventional Bayesian optimization setting the objective func- tion is sampled only once per iteration, Hedge is a “full information game”, meaning it requires a reward rit for every action at every time step. We can achieve this by defining the reward at xit as the expected value of the GP model at xit. That is, r i t = µt(x i t). We refer to this method as GP-Hedge. Provided that the objective function is smooth, this reward definition is reasonable. The GP-Hedge procedure is shown in Algorithm 4. Note that it is nec- essary to optimize K acquisition functions at each time step rather than 1. While this might seem expensive, this is unlikely to be a major problem in practice for small K, as (i) Bayesian optimization is typically employed when sampling the objective is so expensive as to dominate other costs, including acquisition function optimization; (ii) approximate optimization of u is usu- ally sufficient [Močkus, 1994]; and (iii) it is straightforward to optimize the acquisitions functions in parallel on a modern multicore processor. Auer et al. also propose Exp3 (Algorithm 5), a variant of Hedge that applies to the partial information game. In this setting it is no longer as- sumed that rewards are observed for all actions. Instead at each iteration a reward is only associated with the particular action selected. The algorithm uses Hedge as a subroutine where rewards observed by Hedge at each itera- tion are rit/p̂t(i) for the action selected and zero for all other actions. Here p̂t(i) is the probability that Hedge would have selected action i. The Exp3 algorithm, meanwhile, selects actions from a distribution that is a mixture between p̂t(i) and the uniform distribution. Intuitively this ensures that the 69 Algorithm 6 NormalHedge [Chaudhuri et al., 2009] 1: Initialize Ri0 = 0, p i 0 = 1/K. 2: for t = 1, 2, . . . do 3: Each action i incurs loss `it. 4: Learner incurs loss `At = PK i=1 p i t` i t. 5: Update cumulative regrets Rit = R i t−1 + (` A t − `it). 6: Use line search to find ct > 0 satisfying 1/K PK i=1 exp(([R i t]+) 2/(2ct)) = e, where [Rit]+ = max{0, Rit}. 7: pit+1 ∝ [Rit]+/ct exp(([Rit]+)2/(2ct)). 8: end for algorithm does not miss good actions because the initial rewards were low (i.e., it continues exploring all strategies, albeit less frequently). In comparing these two hedging strategies we will first note that GP- Hedge is somewhere “in between” a full and a partial information game. For example, the method collapses to a partial information game if all sample points are too distant, such that the kernels of these points exert negligible influence on each other. However, this behaviour is not observed in practical situations because of smoothness and our particular selection of acquisition functions. Another possible strategy is the NormalHedge algorithm [Chaudhuri et al., 2009], shown in Algorithm 6. We adapt NormalHedge by at each step sampling a single action it to be j with (normalized) probability p j t and defining the loss as `it = 1 1 + exp ( µ(xit) ) to meet the form required by NormalHedge. The algorithm has the appeal- ing quality that non-zero weights are only applied to actions that perform better than the mean performance, so unproductive acquisition functions are quickly discarded. However, NormalHedge is also designed to take advan- tage of situations where the number of bandit arms (acquisition functions) is large, and may not be a good match to problems where K is relatively small. 70 0 20 40 60 80 100 ξ=0.01 ξ=0.10 EI, ξ=1.00 ξ=0.01 ξ=0.10 PI, ξ=1.00 ν=0.1 ν=0.2 GP-UCB, ν=1.0 Branin (2D) 0 20 40 60 80 100 Shekel 10 (4D) 0 20 40 60 80 100 iterations (t) ξ=0.01 ξ=0.10 EI, ξ=1.00 ξ=0.01 ξ=0.10 PI, ξ=1.00 ν=0.1 ν=0.2 GP-UCB, ν=1.0 Hartman 3 (3D) 0 20 40 60 80 100 iterations (t) Hartman 6 (6D) Figure 5.1: Examples of typical evolution of GP-Hedge’s portfolio with K = 9 for each objective function. The height of each band corresponds to the probability pt(i) at each iteration. 5.2 Experiments: acquisition functions and acquisition strategies We first tested performance using our standard suite of test functions: the Branin, Shekel 10, Hartman 3 and Hartman 6 functions (Appendix A). For each experiment, we optimized 25 times with random initialization and computed the mean and variance of the gap metric (§2.5) over time. In these experiments we first learned the hyperparameters θ by sampling 10d points with the Latin hypercube method and maximizing the log marginal likelihood of the data given θ. These samples were then discarded. When also tested on synthetic functions (§2.5). For the synthetic functions, for 71 each trial, we sample a different function and optimize it using the different techniques, so each plot shows the mean result over 25 different functions, with one trial each. We compared the standard acquisition functions using parameters sug- gested by previous authors, i.e., ξ = 0.01 for EI and PI, δ = 0.1 and ν = 0.2 for GP-UCB [Lizotte, 2008; Srinivas et al., 2010]. The plots for EI, PI and GP-UCB shown all use these parameters. For the GP-Hedge trials, we tested performance under using both 3 acquisition functions and 9 acquisi- tion functions. For the 3-function variant we use the standard acquisition functions with default hyperparameters. The 9-function variant uses these same three as well as 6 additional acquisition functions consisting of: both PI and EI with ξ = 0.1 and ξ = 1.0, GP-UCB with ν = 0.1 and ν = 1.0. Used alone, these values are not expected to perform as well as the defaults and our experiments confirmed this hypothesis. However, we are curious to see if adding known suboptimal acquisition functions will help or hinder GP-Hedge in practice. Results for the gap measure Gt are shown in Figure 5.2. While the im- provement GP-Hedge offers over the best single acquisition function varies, there is almost no combination of function and time step in which the 9- function GP-Hedge variant is not the best-performing method. The results suggest that the extra acquisition functions assist GP-Hedge in exploring the space in the early stages of the optimization process. Figure 5.2 also displays, for a single example run, how the the arm probabilities pt(i) used by GP-Hedge evolve over time. We have observed that the distribution becomes more stable when the acquisition functions come to a general con- sensus about the best region to sample. As the optimization progresses, exploitation becomes more rewarding than exploration, resulting in more probability being assigned to methods that tend to exploit. However, note that if the initial portfolio had consisted only of these more exploitative ac- quisition functions, the likelihood of entrapment at suboptimal points would have been higher. In Figures 5.3 and 5.4 we compare against the other Hedging strategies introduced in §5.1 under both the gap measure and mean average regret. We 72 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 G t Branin (2D) GP-Hedge, K=9 GP-Hedge, K=3 GP-UCB PI EI 0 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Shekel 10 (4D) 0 20 40 60 80 100 iterations (t) 0.0 0.2 0.4 0.6 0.8 1.0 G t Hartman 3 (3D) 0 20 40 60 80 100 iterations (t) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Hartman 6 (6D) Figure 5.2: Comparison of different acquisition approaches on four commonly-used test functions. The lines and error bars in the subplots show the mean and variance of the gap metric averaged over 25 trials. With K = 3 acquisition functions, GP-Hedge beats the best-performing acquisition function in almost all cases. With K = 9, we add addi- tional instances of the three acquisition functions, but with different parameters. Despite the fact that these additional functions individu- ally perform worse than the ones with default parameters, adding them to GP-Hedge improves performance in the long run. 73 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 G t Branin (2D) GP-Hedge, K=9 GP-UCB Exp3, K=9 uniform, K=9 Normal-Hedge, K=9 0 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Shekel 10 (4D) 0 20 40 60 80 100 iterations (t) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 G t Hartman 3 (3D) 0 50 100 150 200 250 300 350 400 iterations (t) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Hartman 6 (6D) Figure 5.3: Comparison of different hedging strategies on three commonly- used test functions. The subplots show the mean and variance of the gap metric averaged over 25 trials. 74 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 1.2 m e a n a v e ra g e r e g re t Branin (2D) Exp3, K=9 GP-Hedge, K=9 uniform, K=9 GP-UCB Normal-Hedge, K=9 0 20 40 60 80 100 9.70 9.75 9.80 9.85 9.90 9.95 10.00 10.05 10.10 Shekel 10 (4D) 0 20 40 60 80 100 iterations (t) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 m e a n a v e ra g e r e g re t Hartman 3 (3D) 0 50 100 150 200 250 300 350 400 iterations (t) 1.0 1.5 2.0 2.5 3.0 3.5 Hartman 6 (6D) Figure 5.4: Comparison of different hedging strategies on four commonly used literature functions. The subplots plots show the mean average regret for each method (lower is better). We see that mixed strate- gies (i.e., GP-Hedge) perform comparably to GP-UCB under the re- gret measure and outperform this individual strategy under the gap measure (Figure 5.3). As the problems get harder, and with higher dimensionality, GP-Hedge outperforms other mixed strategies. 75 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 G t Synthetic (5D) GP-Hedge, K=9 GP-Hedge, K=3 GP-UCB PI EI 0 50 100 150 200 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Synthetic (10D) 0 50 100 150 200 iterations (t) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 G t Synthetic (20D) 0 50 100 150 200 iterations (t) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Synthetic (40D) Figure 5.5: Comparison of performance of the acquisition approaches on synthetic functions sampled from a GP prior with randomly initialized hyperparameters. Shown are the mean and variance of the gap metric over 25 sampled functions. Note, the variance in this plot is very different from the variance shown in previous plots. Here, the variance is a relative measure of how well the various algorithms perform while the functions themselves are varied. While the variance is high (which is to be expected over diverse functions), we can see that Hedge is at least comparable to the best acquisition functions and usually better. This confirms our thesis that, in the absence of strong knowledge about the best acquisition to use, it is sensible to use Hedge. 76 also introduce a baseline strategy which utilizes a portfolio uniformly dis- tributed over the same acquisition functions. The results show that mixing across multiple acquisition functions provides significant performance ben- efits under the gap measure, and as the problems’ difficulty/dimensionality increases we see that GP-Hedge outperforms other mixed strategies. The uniform strategy performs well on the easier test functions, as the individ- ual acquisition functions are reasonable. However, for the hardest problem (Hartman 6) we see that the performance of the naive uniform strategy de- grades. NormalHedge performs particularly poorly on this problem. We observed that this algorithm very quickly collapses to an exclusively ex- ploitative portfolio which becomes very conservative in its departures from the incumbent. We again note that this strategy is intended for large values of K, which may explain this behaviour. In the case of the regret measure we see that the hedging strategies perform comparable to GP-UCB, a method designed to optimize this mea- sure. We further note that although the average regret can be seen as a lower-bound on the convergence of Bayesian optimization methods, this bound can be loose in practice. Further, in the setting of Bayesian opti- mization we are typically concerned not with the cumulative regret during optimization, but instead with the regret incurred by the incumbent af- ter optimization is complete—that is, the instantaneous regret of the best observation, independent of the regrets of the other observations. Simi- lar notions of “simple regret” have been studied in [Audibert et al., 2010; Bubeck et al., 2009]. Based on the performance in these experiments, we use Hedge as the underlying algorithm for GP-Hedge in the remainder of the experiments. As can be seen in Figure 5.5, GP-Hedge with K = 9 is again the best- performing method, which becomes even more clear as the dimensionality increases. Interestingly, the worst-performing function changes as dimen- sionality increases. In the 40-D experiments, GP-UCB, which generally performed well in other experiments, does quite poorly. Examining the be- haviour, it appears that by trying to minimize regret instead of maximizing improvement, GP-UCB favours regions of high variance. However, since 77 a 40-D space remains extremely sparsely populated even with hundreds of samples, the vast majority of the space still has high variance, and thus high acquisition value. 5.3 Experiment: control of a particle simulation We also applied these methods to optimize the behaviour of a simulated physical system in which the trajectories of falling particles are controlled via a set of repelling forces. This is a difficult, nonlinear control task whose resulting objective function exhibits fairly isolated regions of high value sur- rounded by severe plateaus. Briefly, the four-dimensional state-space in this problem consists of a particle’s 2D position and velocity (p, ṗ) with two- dimensional actions consisting of forces which act on the particle. Particles are also affected by gravity and a frictional force resisting movement. The goal is to direct the path of the particle through regions of the state space with high reward r(p) so as to maximize the total reward accumulated over many time-steps. In our experiments we use a finite, but large, time-horizon H. In order to control this system we employ a set of “repellers” each of which is located at some position Ci = (Ai, Bi) and has strength Wi (see the upper plot of Figure 5.6). The force on a particle at position p is a weighted sum of the individual forces from all repellers, each of which is inversely pro- portional to the distance p− Ci. For further details on this model we refer the reader to [Hoffman et al., 2009]. The use of Bayesian optimization for control problems has previously been proposed by Frean and Boyle [2008], who used it for a control problem of balancing two poles on a cart (§2.4.4). This problem can be formulated in the setting of Bayesian optimization by defining the vector of repeller parameters x = (W1, A1, B1, . . . ). In the experiments shown in Figure 5.6 we will utilize three repellers, resulting in a 9D optimization task. We can then define our objective as the total H-step expected reward f(x) = E [∑H t=0 r(pt)|x ] . Finally, since the model defines a probability distribution Px(p0:H) over particle trajectories we can obtain a noisy estimate of this objective function 78 Figure 5.6: Results of experiments on the repeller control problem. The upper plot displays 10 sample trajectories over 100 time-steps for a particular repeller configuration (not necessarily optimal). The lower plot shows the progress of each of the described Bayesian optimization methods on a similar model, averaged over 25 runs. by sampling a single trajectory and evaluating the sum over its immediate rewards. Results for this optimization task are shown in Figure 5.6. As with the previous synthetic examples GP-Hedge outperforms each of its constituent methods on average. We further note the particularly poor performance of the PI acquisition function on this example, which in part we hypothesize is a result of plateaus in the resulting objective function. In particular PI has trouble exploring after it has “locked on” to a particular mode, a fact 79 that seems exacerbated when there are large regions with very little change in objective. 5.4 Conclusions and future work Hedging strategies are a powerful tool in the design of acquisition functions for Bayesian optimization. In this paper we have shown that strategies that adaptively modify a portfolio of acquisition functions often perform substan- tially better—and almost never worse—than the best-performing individ- ual acquisition function. Our experiments have shown that full-information strategies are able to outperform partial-information strategies in many situ- ations. However, partial-information strategies can be beneficial in instances of high K or in situations where the acquisition functions provide very con- flicting advice. Evaluating these tradeoffs is an interesting area of future research. As noted earlier (and in §2.2.2) the cumulative regret provides a method for lower-bounding the convergence properties of these algorithms. Bound- ing GP-Hedge is left for future work. Bounds for the underlying GP-UCB algorithm have been shown [Srinivas et al., 2010], as well as bounds for re- lated but distinct bandit problems [Grunewalder et al., 2010]. Ultimately we would also like to show similar bounds for the gap measure on GP-Hedge, but this result seems much more difficult. Finally, while outside the scope of this work, we feel this portfolio-based approach could be used for model selection (§2.1.2), as well, possibly in conjunction with the cross-session model introduced in Chapter 4, and/or Bayesian Monte Carlo model integration [Osborne et al., 2008]. 80 Chapter 6 Applications: active preference learning with Bayesian optimization In this chapter, we will present several applications that use interactive Bayesian optimization to find parameters for graphics and animation prob- lems1. We also present results of experiments done with users, which demon- strate the effectiveness of interactive Bayesian optimization. 6.1 Related work The work of Talton et al. [2009] is a notable recent example of a design gallery approach in the computer graphics literature. They introduce a collaborative system which uses data from a body of users to learn spatially- varying parameters, though their work is still quite distinct from ours. Their model is based on density estimation in high-dimensional spaces, whereas we are interested in optimizing individual user value model. Their approach is also intended as a novel interface to aid users who are unfamiliar with the system, while our approach is intended to work in conjunction with more 1This chapter is based on previously published work [Brochu et al., 2007a; 2007b; 2010a]. See the Co-Authorship Statement for details on co-author contributions. 81 traditional slider manipulation approaches to finding parameters. The gallery interface of [Marks et al., 1997] is perhaps the best known assistance tool for animators. It is a browsing tool where a set of animations is displayed on a 2D layout using multi-dimensional scaling. It uses heuristics to find the set of input parameters to be used in the generation of the display. We depart from this heuristic treatment and instead present a principled probabilistic decision making approach to model the design process. Psychoperceptual preference elicitation has been previously done in graph- ics in the context of image-based rendering [Kuang et al., 2004], as well as evaluation of tone-mapping operators for high dynamic range (HDR) im- ages [Ledda et al., 2005] and displays [Akyüz et al., 2007]. Ledda et al. used preference to conduct psychoperceptual experiments to evaluate tone map- ping operators. Participants were presented with pairs of images and asked to indicate which they thought most closely resembled a reference scene. This approach is very similar in spirit to our system, though it uses a finite set of data and does not interactively select pairs. In this chapter, we detail the applications we have developed to date using interactive Bayesian optimization: an interactive smoke simulation (§6.2), a Bidirectional Radial Basis Function gallery tool (§6.3) and a pro- cedural fluid animation generator (§6.4). On the latter two, most complex applications, we detail the results of user studies we have conducted to eval- uate its use as a tool. 6.2 Active preferences for 2D smoke simulation The initial motivation for this work was an animation tool based on Jos Stam’s smoke simulation [Stam, 2003]. We present it here as a motivating example only—we did not use study it experimentally. This is a 2D simula- tion, which is not necessarily intended to be physically accurate, but rather to generate smoke effects that look “realistic”. It uses the familiar Navier- Stokes equations2 and a grid of samples on a density field. At each time 2We use here the “standard” symbols for Navier-Stokes, not to be confused with sym- bols used elsewhere in this thesis. 82 Figure 6.1: An example of the interactive Bayesian optimization smoke simulation comparison tool (§6.2). The simulation engine is inter- active and real-time, but controlling the simulation environment re- quires the setting of five interacting parameters. Our system allows a user with no prior experience to find the desired simulation parame- ters by generating simulations for the user to compare. At any point, the user can also ask the system to predict the current estimated best simulation parameters based on his or her feedback. step, the density field is updated by applying a velocity field (representing air flow or other interference, also sampled on a grid), uniform diffusion, and additional sources of density (that is, smoke added to the system from a source point). ∂u ∂t = −(u · ∇)u + ν∇2u + f ∂% ∂t = −(u · ∇) + κ∇2%+ S The simulation has a vector x of five parameters, all of which take posi- tive real values: viscosity, diffusion, time step, source (the amount of smoke generated in the simulation when the user clicks the simulation window) and force (the amount of disruption the user’s mouse click creates in the 83 simulation). Clearly, some of these are dependent (force and time step, for example). The system is extremely robust, and will not blow up even when given bad values, but large parts of the parameter space generate simulations that look disappointing, at best. Furthermore, this real-time simulation can generate a variety of legitimate smoke, from wispy curlicues like cigarette smoke to thick, viscous smoke resembling factory exhaust. Our simulation-rating environment is a GUI application shown in Fig- ure 6.1. The user-feedback portion of our application allows the user to interact with two simulation environments and indicate preference using whatever desired value model they have in mind, typically a specific smoke effect. By clicking mouse buttons, the user can add and move smoke within either simulation window. A window button allows the user to indicate preference, at which point our algorithm is run again and selects two more simulations to show the user. At any time, the user can click a button in the GUI to get the “best” simulation, which is computed by running DIRECT to find an x (set of parameters) that maximizes the value rather than EI(·). If this is not the final result the user sought, it can be rated as usual, and another iteration of our algorithm is performed. 6.3 Interactive Bayesian optimization for material design Properly modelling the appearance of a material is a necessary component of realistic image synthesis. The appearance of a material is formalized by the notion of the Bidirectional Reflectance Distribution Function (BRDF). In computer graphics, BRDFs are most often specified using various an- alytical models. Analytical models that are of interest to realistic image synthesis are the ones that observe the physical laws of reciprocity and en- ergy conservation while typically also exhibiting shadowing, masking and Fresnel reflectance phenomenon. Realistic models are therefore fairly com- plex with many parameters that need to be adjusted by the designer for the proper material appearance. Unfortunately these parameters can interact in non-intuitive ways, and small adjustments to certain settings may result in 84 Target 1. 2. 3. 4. Figure 6.2: A shorter-than-average but otherwise typical run of the BRDF preference gallery tool (§6.3). At each (numbered) iteration, the user is provided with two images generated with parameter instances and indicates the one they think most resembles the target image (top-left) they are looking for. The red boxes indicate the user’s selections at each iteration. 85 non-uniform changes in the appearance. This can make the material design process quite difficult for the artist end user, who is not expected to be an expert in the field. To alleviate this problem, Ngan et al. [2006] presented an interface for navigation in a perceptually uniform BRDF space based on a metric derived from user studies. However, this is still somewhat con- straining as the user has to develop an understanding of the various aspects of material appearance such as varying degrees of diffuseness, glossiness, specularity, Fresnel effects and/or anisotropy in order to navigate such an interface. An artist often knows the look that she desires for a particular application without necessarily being interested in understanding the vari- ous subtleties of reflection. We attempt to deal with this using a preference gallery approach, in which users are simply required to view two or more images rendered with different material properties and indicate which they prefer, in an iterative process. We use the interactive Bayesian optimization preference model described in §3.2 on an example gallery application for helping users find a BRDF. For the purposes of this example, we limit ourselves to isotropic materials and ignore wavelength dependent effects in reflection. The gallery uses the Ashikhmin-Shirley Phong model [Ashikhmin et al., 2000] for the BRDFs, which has been shown to be well-suited for representing real materials [Ngan et al., 2005]. The BRDFs are rendered on a sphere under high-frequency natural illumination as this has been shown to be the desired setting for human perception [Fleming et al., 2001]. Our gallery demonstration presents the user with two BRDF images at a time. The GP model is updated after each preference is indicated. We use parameters of real measured materials from the MERL database [Ngan et al., 2005] for seeding the parameter space, but can draw arbitrary parameters after that. By querying the user with a paired comparison, one can estimate statis- tics of the value function at the query point, but only at considerable ex- pense. Thus, we wish to make sure that the samples we do draw will generate the maximum possible improvement. Our method for achieving this goal iterates between the following steps: 1. Present the user with a new set of instances and record preferences 86 from the user: Augment the training set of paired choices with the new user data. 2. Infer the value function: Here we use a Thurstone-Mosteller model with Gaussian processes [Chu and Ghahramani, 2005b]. See §3.2 for details. Note that in this application, the value function is the objective of Bayesian opti- mization. We will use the terms interchangeably. 3. Optimize the acquisition function of the value to obtain the query points for the next gallery: Methods for selecting a set of instances are described in §3.3. 6.3.1 User study To evaluate the performance of our application, we have run a simple user study in which the generated images are restricted to a subset of 38 mate- rials from the MERL database that we deemed to be representative of the appearance space of the measured materials. The user is given the task of finding a single randomly-selected image from that set by indicating pref- erences. Figure 6.2 shows a typical user run, where we ask the user to use the preference gallery to find a provided target image. At each step, the user need only indicate the image they think looks most like the target. This would, of course, be an unrealistic scenario if we were to be evaluating the application from an HCI stance, but here we limit our attention to the model, where we are interested here in demonstrating that with human users maximizing value is preferable to learning the entire latent function. Using five subjects, we compared 50 trials using the expected improve- ment function (§2.2.1) to select the images for the gallery (maxEI), 50 trials using maximum variance (maxσ), and 50 trials using samples selected using a randomized Latin hypercube algorithm. In each case, one of the gallery images was the incumbent and the other was selected by the algorithm. The algorithm type for each trial was randomly selected by the computer and neither the experimenter nor the subjects knew which of the three al- gorithms was selecting the images. The results are shown in Table 6.1. N is the number clicks required of the user to find the target image. Clearly maxEI dominates, with a mean N less than half that of the competing al- 87 algorithm trials N (mean ± std) random 50 18.40 ± 7.87 maxσ 50 17.87 ± 8.60 maxEI 50 8.56 ± 5.23 Table 6.1: Results of the user study on the BRDF gallery (§6.3.1). gorithms. Interestingly, selecting images using maximum variance does not perform much better than random. We suspect that this is because maxσ has a tendency to select images from the corners of the parameter space, which adds limited information to the other images, whereas the random (Latin hypercube) algorithm at least guarantees that the selected images fill the space. 6.4 Procedural fluid animation Procedural methods for generating animation have long been used in visual effects and games studios due to their efficiency and artist controllability. However, this control comes with a cost: a set of often unintuitive parameters confronts the user of a procedural animation system. The desired end result is often identifiable by the user, but these parameters must be tuned in a tedious trial-and-error process. For example, realistic animation of smoke can be achieved by driving a particle system through a simple combination of vortex rings and curl noise [Bridson et al., 2007]. However even these two relatively simple pro- cedural methods are influenced by several parameters: the velocity, radius and magnitude of the vortex rings, and the length scale and magnitude of the curl noise. Adding more procedural “flow primitives”, such as uniform and vortical flows, sources and sinks [Sims, 1990; Wejchert and Haumann, 1991], turbulent wind [Stam and Fiume, 1993], vortex particles [Selle et al., 2005], and vortex filaments [Angelidis and Neyret, 2005] can produce a wider variety of animations, but each of these primitives carries its own set of associated parameters. These parameters can interact in subtle and non-intuitive ways, and small adjustments to certain settings may result in 88 Figure 6.3: The animation gallery in action. The upper and middle images show intermediate stages of the animation gallery. The inset image in the lower right is a frame from the final animation, generated after the artists has found the desired parameter settings. 89 non-uniform changes in the appearance. We apply the Bayesian optimization model as the learning engine for our procedural animation design tool. Our contribution to the design problem is a “gallery” approach in which users can view several animations generated from multiple parameters, and provide feedback in the form of real-valued ratings indicating how close an animation is to what they are looking for. In practice, the first few examples presented to the user will be points of high uncertainty, since little of the space is explored (that is, the model is very uncertain about the user’s value criteria). Later galleries include more examples of high predicted value, as a model of the user’s interest is learned. 6.4.1 Procedural animation We produce smoke animation by driving a set of passive marker particles through a procedurally-generated velocity field. This velocity field is gen- erated by taking the curl of a (vector-valued) potential function, which au- tomatically ensures that the resulting velocity field is divergence-free, an important characteristic of fluid motion. There are two main components to this potential field, which we linearly combine: the contribution due to a set of vortex rings, and a spatially varying noise function. The potential function of a vortex ring perpendicular to the y axis with centre u, radius R, at a point in space z is given by ψv(z) = 1 (R−D)2 + 2RD < z3, 0,−z1 >, where D = ‖z − u‖. The potential function associated with curl noise [Bridson et al., 2007] is a spatially and temporally continuous noise function ψn(z) = g(z, L), where L is the length scale of noise. ψg(z) = g(z, L). Our examples use Flow Noise [Perlin and Neyret, 2001] for this function. The velocity field is then the curl of the linear combination of these two 90 potential fields: v(z) = ∇× (Aψv(z) +Bψn(z)). This simple model results in at least four parameters which must be tuned: the radius of a vortex ring, R, the length-scale of the noise func- tion, L, and the relative strengths of each potential function, A,B ∈ R+. Additionally, our examples use vortex rings which move upward with some velocity, and are generated at the origin with some frequency, resulting in two additional parameters per ring. We model a total of four distinct curl noise layers, for an additional 8 parameters, though the use of the curl noise layers is not required for all animations, and they can be disabled by setting both parameters to 0. Since this method is procedural, and not a simulation, the variety of animations capable of production is fundamen- tally limited, though still quite large. Further parameters which could be added to the system include the spawn rate of marker particles, orientation of vortex rings, and the time derivative of any of the parameters mentioned above. 6.4.2 Gallery The gallery interface (Figure 6.3) is our user-facing parameter-optimization tool. The user has the option of setting any or all of the parameters man- ually, either to a fixed value, or to a range of values. Otherwise, they may leave them at the default range of acceptable parameters. Fixing or setting the ranges of parameters directly sets the bounds of the optimization of the EI function. Four animations are shown at a time. Based on the results of our gallery- selection experiments (§3.3), one of these is the previously-seen animation with the highest predicted value, and one is the point of maximum ex- pected improvement. Remaining windows in the gallery are selected based on Schonlau’s method [Schonlau, 1997] of simulating updates to the GP by iteratively maximizing the EI and updating the covariance matrix of the max EI function. Note that at any stage, the user can set the parameters to a fixed value 91 or change the range. This permits the user to set up useful work-flows. For example, users can start with several free parameters and view examples until they find one similar to their target and fix most of the parameters, using the model to help set one or two “tricky” remaining values. This is a frequently-observed process in human decision making called “filtration” or “anchoring and adjustment”. Alternatively, the user can adjust parameters until they reach a point where they are frustrated with one or more and then use the system to help find it. This is important, as Payne et al. [1993] demonstrate that users will change decision-making strategies depending on previous experience and information extracted and the task at hand. Forcing users into a particular mode can result in poor performance. In any case, the goal is not to remove or restrict the process of manually setting parameters, but to augment it. Animations are generated using the procedural system described above, during a non-interactive “animation” phase. At each frame of the animation, the flow primitives are updated, and new ones are spawned if necessary. New particles are spawned at a source, advected according to the set of flow primitives, and all particle positions for a given frame are written to disk. The animation can then be previewed in an OpenGL window by streaming the particle data from disk. After each run of the application, the final data vector D1:t for the run is logged and the RBF parameters and distributions of the hyperparameters (a, b,θ, σ2noise) are updated (§4.2). The user has the option to skip this step if they do not wish to have the results logged. The application uses the hyperparameters and RBF network mean func- tion described above, trained on all available data from previous users. The hyperparameter updates are computed in the background or at an appro- priate time when the user’s computer has free cycles. 6.4.3 Experiment: hyperparameter learning One of the main innovations of our work is the particle-filter updates of the hyperparameters. Particle filters are known to be an effective way of 92 modelling a distribution, but to our knowledge the concept of using them as a consistent model of hyperparameters across GPs trained with different data is novel. We need to confirm that the method will work well on our problem. To track the fit of the particle filter to the model, we used the gallery to test one parameter at a time, with several users. Each user was given the task of using the gallery to find a target animation, which they were shown. The gallery was used with all the dimensions except one set to the target animation parameters. Users had to find the remaining parameter setting by rating generated animations using the gallery tool. After each run, the particle filters were updated for the noise hyperparameter and the kernel width hyperparameter of the free parameter. Figure 6.4 shows the results of the running the experiment on the σ2noise and θ hyperparameters over 20 runs with 4 users. It is interesting to see that it takes only a few runs for the estimate to move from the initial value to a fairly consistent “region” of values. While the hyperparameters fluctuate a little with different users and use cases, the change is slow after the first few trials. This suggests that in the hyperparameter-learning scenario, particle filters are both quick to learn the hyperparameters and robust to noise. 6.4.4 Gallery interface performance The experiments of Chapter 4 show clearly that particle filters and RBF models for the GP mean result in significant improvements in Bayesian optimization of known mathematical functions. To test the performance of components of the system with human beings in the loop, we would like to simulate the task of an animator looking for a specific animation. The difficulty with measuring performance in these animation tasks is that we don’t know the animator’s precise intentions. That is, we don’t know the objective function’s global maximum. To overcome this, an expert generated a set of five distinct animations, for which the target parameters were known. Users were shown one of the target animations, picked at random, and asked to find the corresponding target parameters using different variations of the 93 Figure 6.4: Impact of learning hyperparameters from user data. Users were asked to perform tasks on the application and the hyperparameters were updated after each run using our particle filter (§4.2). The re- sults show the evolution of 5 hyperparameter settings over 20 runs, involving 4 users. In all cases, the initial values, which are set to the common default of 1, were very quickly corrected and became in- creasingly stable thereafter, despite being used by different users for different tasks. This indicates that after an initial “burn-in” period where the particles are fairly scattered, they tend to congregate around the mean and are resistant to rapid change caused by simple variance from run to run. This suggests our learning model is both quick to learn and robust. 94 rating (1-gallery) pairwise preference (2-gallery) 4-gallery Figure 6.5: Gallery interfaces used for the user studies described in §6.4.4, slightly edited to remove extraneous interface details. The 1-gallery functions by having the user directly rate animations with numerical values. The 2-gallery uses pairwise preferences: the user indicates which of the pair is preferred. The 4-gallery uses degrees of preference to allow richer input. All instances with the highest preference are pre- ferred to all other instances, all instances with the middle preference are preferred to all instances with the lowest preference. 95 interface (Figure 6.5). When users found an animation they felt was “close enough”, they could select it. Subsequently, the application terminated and logged the number of iterations, distinct animations viewed, and the error. A GP zero mean function was used so as to avoid giving unfair advantage to the target animations. We tested the following scenarios: • expert-set hyperparameters (expertHP) We tested interactive Bayesian optimization on our BRDF interface task in §6.3.1, and so we wish to compare our work to that as directly as possible, even though the applications are different. To do this, we (a) set the kernel hyperparameters θ and σ2noise to expert-selected values; (b) restricted the gallery to 2 windows, one generated from argmaxx EI(x) and one from argmaxx µ(x). To set θ we had an expert set an initial θ E. We then ran 5 sessions with those hyperparameters, and maximized the likelihood to learn θLL for each of those sessions. The final value for each hyperparameter θ` was either θE` or the median of θ LL ` , whichever seemed more reasonable to our expert. • particle filter hyperparameters (PFHP) We compared the expert- set hyperparameter model to a pairwise model which is identical except that it uses hyperparameters learned with a particle filter trained on user sessions, as described in §4.1. • ratings We repeated the rating experiment of §6.3.1 on our application to see if there was any undiscovered aspect of our problem that might make it easier to find the target by rating instances numerically. In this case, the user was shown a single animation at a time, corresponding to argmaxx EI(x) and asked to rate it with a “score” between 0 and 1. • 4-gallery Using the same hyperparameters as PFHP, for comparison purposes, we generated a gallery of 4 instances over which the user could enter preferences. Once all preferences were entered, they were added to the model, and a new set of gallery parameters selected. As shown in Table 6.2 and Figure 6.6, the hyperparameters learned by PFHP result in a statistically significant (p < 0.05) improvement in both the 96 sc en ar io ga ll er y si ze p ar am . se ss io n s it er at io n s an im at io n s er ro r e x p e rt H P 2 4 20 12 .4 3 ± 4. 45 22 .6 6 ± 7. 35 0. 44 ± 0. 30 P F H P 2 4 20 8. 45 ± 2. 81 14 .4 4 ± 5. 03 0. 36 ± 0. 34 ra ti n g s 1 4 20 28 .3 5 ± 5. 13 28 .3 5 ± 5. 13 0. 31 ± 0. 26 4 -g a ll e ry 4 4 30 7. 57 ± 4. 67 24 .8 5 ± 15 .4 8 0. 22 ± 0. 17 ? m a n u a l (n ov ic e) 1 12 3 35 .3 3 ± 7. 13 35 .3 3 ± 7. 13 2. 02 ± 0. 35 m a n u a l (e xp er t) 1 12 5 28 .4 0 ± 5. 95 28 .4 0 ± 5. 95 0. 91 ± 0. 30 ? 4 -g a ll e ry + m a n u a l 4 12 20 5. 38 ± 2. 63 19 .1 4 ± 7. 02 1. 23 ± 0. 74 ? T ab le 6. 2: R es ul ts of ex pe ri m en ts of §6 .4 .4 an d 6. 4. 5 in w hi ch us er s w er e sh ow n ta rg et an im at io ns an d as ke d to fin d th em us in g on ly sp ec ifi c m et ho ds . ga ll er y si ze is th e nu m be r of si m ul ta ne ou s an im at io ns in st an ce s in th e in te rf ac e us ed . p ar am . is th e nu m be r of fr ee pa ra m et er s th e us er w as tr yi ng to fin d. se ss io n s is th e to ta l nu m be r of se ss io ns th e sc en ar io w as ru n to co lle ct da ta . it er at io n s, an im at io n s an d er ro r ar e th e m ea n an d st an da rd de vi at io n of th e nu m be r of in te rf ac e it er at io ns , th e to ta ln um be r of di ff er en t an im at io ns vi ew ed , an d th e di ve rg en ce of th e us er ’s se le ct ed pa ra m et er va lu es fr om th e ta rg et an im at io n, re sp ec ti ve ly . B o ld re su lt s in di ca te th at fo r th e in di ca te d at tr ib ut e, th e sc en ar io is th e st at is ti ca lly -s ig ni fic an t be st -p er fo rm in g (p < 0. 05 ). A st er is ks in di ca te si gn ifi ca nc e 0. 05 < p < 0. 2. T he up pe r pa rt of th e ta bl e il lu st ra te s so m e of th e tr ad e- off s in vo lv ed (§6 .4 .4 ). T he 4- ga lle ry ap pr oa ch re qu ir es vi ew in g m or e an im at io ns th an pa ir w is e, bu t is th e m os t ac cu ra te , w hi le di re ct ra ti ng pe rf or m s po or ly , as pr ed ic te d. T he pa ir w is e pr ef er en ce s pr od uc ed hi gh er er ro r, bu t re qu ir ed fe w er to ta l an im at io n vi ew s. T he lo w er pa rt sh ow s ho w a ta sk (§6 .4 .5 ) th at is ne ar ly im po ss ib le fo r no vi ce us er s w it h th e m an ua li nt er fa ce be co m es re la ti ve ly ea sy w he n th e m an ua li nt er fa ce is co m bi ne d w it h th e 4- ga lle ry ap pr oa ch . 97 iterations animations error Figure 6.6: Results of the experiments of §6.4.4. This is the same data as the upper part of Table 6.2, shown as boxplots to visualize the magni- tudes of the mean performance and range of results. number of iterations and animations viewed over the expert hyperparameters (expertHP), and result in a slightly lower error. This represents a substantial savings in human effort, and eliminates the risk involved in having a human expert attempt the difficult but crucial task of setting the hyperparameters. The 4-gallery approach requires viewing more animations than pairwise, but is the most accurate (with significance p < 0.2), while direct rating performs poorly, as predicted. (Of course, it should be no surprise that the 4-gallery involved more instances, as at each iteration, 4 animations are gen- erated, rather than 2.) The pairwise preferences produced higher error, but required fewer total animation views. Anecdotally, users reported difficulty in using the pairwise preferences due to the limited feedback available. We suspect this lead to the very large standard deviations on the error. Based on these results, the case for using a gallery of several instances is signifi- cant, but not as clear-cut as the case for learning the hyperparameters and mean function. If animating is cheap and the cognitive effort of rating is low (as in our application), a large gallery offers lower error. If the goal is to minimize the total number of animations generated and rated, using pairs is preferable. Numerically rating individual animations, however, suffers in both accuracy and the number of instances. Clearly, if we want to make the best use of user time, preferences are the most suitable choice. 98 6.4.5 Full application interface compared to parameter twiddling Next, we test whether our “real” application—in which users can restrict ranges and fix parameters in addition to using a gallery of machine-selected animations—is more effective than more traditional “parameter twiddling” applications: • manual We implemented a single-window GUI with no machine learn- ing: the user sets all parameters manually. We distinguished between novices, who have no real understanding of what the parameters mean, and an expert, who is very familiar with the procedural animation sys- tem, and who also has a commanding knowledge of fluid animation in general. • 4-gallery + manual This is the full interface, in which all parameters start off unrestricted, and the user can indicate preference, and can also restrict the range of any parameter, including setting it to a single value. This directly controls the bounds of the EI maximization, so the next set of parameters will be in the indicated ranges. The hyperparameters are again learned with a particle filter, and the prior mean function is set to zero. We found that we had to discontinue the initial manual-tweaking experiment. Our first subjects became so frustrated with trying to set the parameters that they frequently expressed a desire to give up. The numbers show the results when the experiment was terminated. We include the final results by means of comparison. We also included a small number of runs for an expert user who was familiar with the manual interface. These are shown as manual (expert) in Table 6.2. We consider the numbers for the manual tool very unreliable—the significant result is that for non-experts, using the gallery made completing the task feasible at all! With that caveat, though, we find it interesting that the expert trials with the manual interface were only slightly better than the non-expert users using the gallery-plus-manual interface. 99 We used interactive Bayesian optimization on the problem described in §6.3.1 so we wish to compare the procedural fluid gallery to that as directly as possible, even though the applications (BRDF parameters versus procedural fluid animation parameters) are different. To do this, we set up what we will refer to as the expertHP model, which uses the method described there. This consists of (a) setting the prior mean to be the zero function (i.e., β = 0); (b) setting the kernel hyperparameters θ to expert-selected “reasonable” values and keeping them there; and (c) restricting the gallery to 2 windows, one generated from argmaxx EI(x) and one from argmaxx µ(x). To set θ we had an expert set an initial θE. We then ran 5 sessions with those hyperparameters, and used the log likelihood method to learn θLL for each of those sessions. The final value for each hyperparameter θ` was either θE` or the median of θLL` , whichever seemed more reasonable to our expert. We collected data using 4 users for a total of 20 sessions. We then compared those results to 20 sessions in which the hyperpa- rameters were trained using the particle filter method described in §4.1. We initially trained the hyperparameters using the results of the expertHP ses- sions. Note that in both scenarios, we use the zero function for the mean, so as not to make animations more likely simply because they happened to be in the pool of available targets. The results of our observations in §4.3 indicate this would be a sufficient number of sessions to learn a fairly stable model. The trained particle filters were then used for a second set of sessions, PFHP. The gallery here was still pairwise, and the user’s task the same, but we used hyperparameters that had been automatically tuned on 30 sessions of the system. The results are shown in Table 6.2 and Figure 6.6. Our particle filter method reduced the mean number of animations viewed by almost half, with a comparable error. The mean number of iterations (pairs viewed, in this case) was also substantially reduced, though not by quite as much, as our method is more likely to show the user the “best” animations as a component of different pairs. Nevertheless, this represents a substantial savings in human effort. 100 mean f’n iterations Q1 Q2 zero 11.25 ± 3.60 6.64 ± 1.76 7.00 ± 1.41 trained 6.50 ± 2.15 7.42 ± 1.10 7.36 ± 0.81 Table 6.3: Comparison of users using the system with a zero function as the mean, and with a mean trained on previous user data. Users were asked to think of an animation and use the system to try to find it. We measured the number of iterations before the user terminated and collected responses to a two-question survey. Q1 was “how close is your selected animation to what you had in mind”. Q2 was “independent of how close it was to your target, how much did you like the selected ani- mation”. The trained mean function offers improvement in all metrics, but especially in reducing the number of iterations of the system. 6.4.6 Discovery To test the effect of learning the mean function (§4.1), we designed a “dis- covery” experiment, where a user has an idea of what they want, but no ground truth. This also allows us to assess the impact of learning the mean function in a realistic setting. In this experiment, users (familiar with the system, but non-experts) are simply asked to have a rough idea in mind and use the system to find an animation they are satisfied with. In the first 15 trials, we used the zero-mean function. In the second 15, we learned the mean function using data from the first trials. Users were not told which mean function they were using. At the end of each session, users were asked to answer, using a subjective scale of 1–10, the following questions: (Q1) “how close is your selected animation to what you had in mind?”; and (Q2) “independent of how close it was to your target, how much did you like the selected animation?”. The goal of these two questions is to measure the effect the mean function had on helping the user find an instance, and also to determine whether the trained mean might cause the user to favour instances that are appealing but not what they were looking for. This could happen, for instance, if the mean function had too much influence, biasing the search toward animations it was trained over exploration. The results are shown in Table 6.3. The most dramatic difference is in the number of iterations required for users to find a target—from an average 101 of 11.25 to just 6.5. Moreover, the higher responses to Q1 indicate that users were, indeed, finding what they were looking for, and were not just settling for instances suggested due to the mean function. 102 Chapter 7 Conclusion This thesis has sought to show how Bayesian optimization with Gaussian processes can be applied to solve a real problem: helping animators and artists find parameters for procedural graphics and animation problems. To this end, we have identified several unique aspects of the problem and suggested solutions. The first issue is that human beings are poor at rating psychopercep- tual experiences with absolute magnitudes. This makes optimizing a scalar function directly a difficult proposition. However, humans excel at indicat- ing preferences. In Chapter 3, we showed how a probit model can be used in a gallery application to elicit user feedback. This allows users to respond in a more natural way, using preferences, from which we infer a GP model. Another problem with interactive applications is that typically very little data is collected in a user session, which makes setting hyperparameters and the mean function challenging. Learning hyperparameters in a single user session is impractical, but the nature of the task itself offers a solution: while each individual session with the application might involve different users looking for different animations, if we treat the hyperparameter-learning problem as one of state estimation in a hierarchical Bayesian model, we can use particle filters to make the process automatic. In Chapter 4, we show how this can be done. By only updating the distribution, the model is guided toward good settings without being overly restricted. The continuous 103 learning and updating ensures that the more the gallery application is used, the better it will become at meeting the needs of its users, even if those needs change over time. Finally, determining the acquisition function for an unknown objective is a difficult task. In Chapter 5, we show that strategies that adaptively modify a portfolio of acquisition functions often perform substantially better—and almost never worse—than the best-performing individual acquisition func- tion. Our experiments have shown that full-information strategies are able to outperform partial-information strategies in many situations. In Chapter 6, we put our theoretical work into practice with a series of applications. We demonstrate the performance of applications in a set of user studies in a controlled environment. However, user expertise is a valu- able resource, and if a user knows the ranges or exact value of a parameter, or if they wish to use conventional slider twiddling as part of their workflow, it is important that our tool not interfere. We present a framework in which our techniques can be used in conjunction with existing methods. 7.1 Discussion and advice to practitioners Interactive Bayesian optimization is a powerful tool for machine learning, where the problem is often not acquiring data, but acquiring labels. In many ways, it is like conventional active learning, but instead of acquiring training data for classification or regression, it allows us to develop frameworks to efficiently solve novel kinds of learning problems such as those discussed in Chapters 3–6. It provides us with an efficient way to learn the solutions to problems, and to collect data, all within a Bayesian framework. However, Bayesian optimization is also a fairly recent addition to the machine learning community, and not yet extensively studied on user ap- plications. Here, we wish to describe some of the shortcomings we have experienced in our work with Bayesian optimization, both as caveats and as opportunities for other researchers. A particular issue is that the design of the prior is absolutely critical to efficient Bayesian optimization. Gaussian processes are not always the 104 best or easiest solution, but even when they are, great care must be taken in the design of the kernel. In many cases, though, little is known about the objective function, and, of course, it is expensive to sample from (or we wouldn’t need to use Bayesian optimization in the first place). The practical result is that in the absence of (expensive) data, either strong assumptions are made without certainty that they hold, or a weak prior must be used. It is also often unclear how to handle the trade-off between exploration and exploitation in the acquisition function. Too much exploration, and many iterations can go by without improvement. Too much exploitation leads to local maximization. These problems are exacerbated as dimensionality is increased—more dimensions means more samples are required to cover the space, and more parameters and hyperparameters may need to be tuned, as well. In order to deal with this problem effectively, it may be necessary to do automatic feature selection, or assume independence and optimize each dimension in- dividually. Another limitation of Bayesian optimization is that the acquisition is currently both myopic and permits only a single sample per iteration. Look- ing forward to some horizon would be extremely valuable for reinforcement learning problems, as well as in trying to optimize within a known budget of future observations. Recent work [Garnett et al., 2010b; Azimi et al., 2011] has indicated very promising directions for this work to follow. Being able to efficiently select entire sets of samples to be labelled at each iteration would be a boon to design galleries and other batch-incremental problems. Finally, there are many extensions that will need to be made to Bayesian optimization for particular applications—feature selection, time-varying mod- els, censored data, heteroskedasticity, nonstationarity, non-Gaussian noise, etc. In many cases, these can be dealt with as extensions to the prior—in the case of Gaussian processes, for example, a rich body of literature exists in which such extensions have been proposed. However, these extensions need to take into account the adaptive and iterative nature of the optimization problem, which can vary from trivial to impossible. Clearly, there is a lot of work to be done in Bayesian optimization, but 105 we feel that the doors it opens make it worthwhile. It is our hope that as Bayesian optimization proves itself useful in the machine learning domain, the community will embrace the fascinating new problems and applications it opens up. 106 Bibliography [Akyüz et al., 2007] A. O. Akyüz, R. Fleming, B. E. Riecke, E. Reinhard, and H. H. Bülthoff. Do HDR displays support LDR content?: a psychophysical evaluation. ACM Transactions on Graphics, 26(3):38, 2007. [Angelidis and Neyret, 2005] A. Angelidis and F. Neyret. Simulation of smoke based on vortex filament primitives. In ACM-SIGGRAPH/Eurographics Sympo- sium on Computer Animation (SCA), 2005. [Ashikhmin et al., 2000] M. Ashikhmin, S. Premośe, and P. Shirley. A microfacet- based BRDF generator. ACM Transactions on Graphics, pages 65–74, 2000. [Audet et al., 2000] C. Audet, J. Jr, Dennis, D. W. Moore, A. Booker, and P. D. Frank. Surrogate-model-based method for constrained optimization. In AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Op- timization, 2000. [Audibert et al., 2010] J. Audibert, S. Bubeck, and R. Munos. Best arm identifi- cation in multi-armed bandits. In Proc. of the Conference on Learning Theory (COLT), 2010. [Auer et al., 1998] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gam- bling in a rigged casino: the adversarial multi-armed bandit problem. Technical Report NC2-TR-1998-025, NeuroCOLT2 Technical Report Series, 1998. [Azimi et al., 2011] J. Azimi, A. Fern, and X. Z. Fern. Batch Bayesian optimization via simulation matching. In Advances in Neural Information Processing Systems 24, 2011. [Bartz-Beielstein et al., 2005] T. Bartz-Beielstein, C. Lasarczyk, and M. Preuss. Sequential parameter optimization. In Proc. CEC-05, 2005. [Bertsekas and Tsitsiklis, 1996] D. P. Bertsekas and J. N. Tsitsiklis. Neuro- Dynamic Programming. Athena Scientific, 1996. [Betrò, 1991] B. Betrò. Bayesian methods in global optimization. J. Global Opti- mization, 1:1–14, 1991. [Bishop, 2006] C. M. Bishop. Pattern Recognition and Machine Learning. Springer- Verlag, 2006. [Boyle, 2007] P. Boyle. Gaussian Processes for Regression and Optimisation. PhD thesis, Victoria University of Wellington, Wellington, New Zealand, 2007. 107 [Bridson et al., 2007] R. Bridson, J. Houriham, and M. Nordenstam. Curl-noise for procedural fluid flow. ACM Transactions on Graphics, 26(3):46, 2007. [Brochu et al., 2007a] E. Brochu, N. de Freitas, and A. Ghosh. Active preference learning with discrete choice data. In Advances in Neural Information Processing Systems 21, 2007. [Brochu et al., 2007b] E. Brochu, A. Ghosh, and N. de Freitas. Preference galleries for material design. In ACM SIGGRAPH 2007 Posters, page 105, 2007. [Brochu et al., 2009] E. Brochu, V. M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions. Technical Report TR-2009- 023, University of British Columbia, Department of Computer Science, 2009. [Brochu et al., 2010a] E. Brochu, T. Brochu, and N. de Freitas. A Bayesian inter- active optimization approach to procedural animation design. In Eurographics/ ACM SIGGRAPH Symposium on Computer Animation, 2010. [Brochu et al., 2010b] E. Brochu, M. Hoffman, and N. de Freitas. Hedging strate- gies for Bayesian optimization. eprint arXiv:1009.5419, arXiv.org, September 2010. [Brochu et al., 2011] E. Brochu, V. M. Cora, and N. de Freitas. 2008 Machine Learning Summer School, volume 6368 of LNCS/LNAI Tutorial Series, chapter A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Applica- tion to Active User Modeling and Hierarchical Reinforcement Learning. Springer, 2011. To appear. [Bubeck et al., 2009] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory, pages 23–37. Springer, 2009. [Busby, 2009] D. Busby. Hierarchical adaptive experimental design for Gaussian process emulators. Reliability Engineering and System Safety, 94(7):1183–1193, July 2009. [Cesa-Bianchi and Lugosi, 2006] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, 2006. [Chaudhuri et al., 2009] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In Advances in Neural Information Processing Systems 23, 2009. [Chu and Ghahramani, 2005a] W. Chu and Z. Ghahramani. Extensions of Gaus- sian processes for ranking: semi-supervised and active learning. In Learning to Rank workshop at NIPS-18, 2005. [Chu and Ghahramani, 2005b] W. Chu and Z. Ghahramani. Preference learning with Gaussian processes. In Proc. 22nd International Conf. on Machine Learn- ing, 2005. [Cora, 2008] V. M. Cora. Model-based active learning in hierarchical policies. Mas- ter’s thesis, University of British Columbia, Vancouver, Canada, April 2008. [Cox and John, 1992] D. D. Cox and S. John. A statistical method for global optimization. In Proc. IEEE Conference on Systems, Man and Cybernetics, volume 2, pages 1241–1246, 1992. 108 [Cox and John, 1997] D. D. Cox and S. John. SDO: A statistical method for global optimization. In M. N. Alexandrov and M. Y. Hussaini, editors, Multidisciplinary Design Optimization: State of the Art, pages 315–329. SIAM, 1997. [Diggle and Ribeiro, 2007] P. J. Diggle and P. J. Ribeiro. Model-Based Geostatis- tics. Springer Series in Statistics. Springer, 2007. [Diggle et al., 1998] P. J. Diggle, J. A. Tawn, and R. A. Moyeed. Model-based geo- statistics. Journal of the Royal Statistical Society: Series C (Applied Statistics), 47(3):299–350, 1998. [Doucet et al., 2001] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science. Springer, 2001. [Eidsvik et al., 2009] J. Eidsvik, S. Martinao, and H. Rue. Approximate Bayesian inference in spatial generalized linear mixed models. Scandinavian Journal of Statistics, 36:1–22, 2009. [Elder, 1992] J. F. Elder, IV. Global Rd optimization when probes are expensive: The GROPE algorithm. In Proc. IEEE International Conference on Systems, Man and Cybernetics, 1992. [Élő, 1978] Á. Élő. The Rating of Chess Players: Past and Present. Arco Publish- ing, New York, 1978. [Finkel, 2003] D. E. Finkel. DIRECT Optimization Algorithm User Guide. Center for Research in Scientific Computation, North Carolina State University, 2003. [Fleming et al., 2001] R. Fleming, R. Dror, and E. Adelson. How do humans deter- mine reflectance properties under unknown illumination? In CVPR Workshop on Identifying Objects Across Variations in Lighting, 2001. [Frean and Boyle, 2008] M. Frean and P. Boyle. Using Gaussian processes to op- timize expensive functions. In W. Wobcke and M. Zhang, editors, AI 2008: Advances in Artificial Intelligence, volume 5360 of Lecture Notes in Computer Science, pages 258–267. Springer Berlin / Heidelberg, 2008. [Garnett et al., 2010a] R. Garnett, M. Osborne, S. Reece, A. Rogers, and S. Roberts. Sequential Bayesian prediction in the presence of changepoints and faults. The Computer Journal, 2010. [Garnett et al., 2010b] R. Garnett, M. Osborne, and S. Roberts. Bayesian opti- mization for sensor set selection. In Proceedings of the 9th ACM/IEEE Interna- tional Conference on Information Processing in Sensor Networks, pages 209–219. ACM, 2010. [Genton, 2001] M. G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal of Machine Learning Research, 2:299–312, 2001. [Glickman and Jensen, 2005] M. E. Glickman and S. T. Jensen. Adaptive paired comparison design. J. Statistical Planning and Inference, 127:279–293, 2005. [Goldberg et al., 1998] P. W. Goldberg, C. K. I. Williamsn, and C. M. Bishop. Re- gression with input-dependent noise: A Gaussian process treatment. In Advances in Neural Information Processing Systems 10, 1998. 109 [Grunewalder et al., 2010] S. Grunewalder, J. Audibert, M. Opper, and J. Shawe- Taylor. Regret bounds for Gaussian process bandit problems. In Proceedings of the Conference on Artificial Intelligence and Statistics, 2010. [Hastie et al., 2009] T. Hastie, R. Tibrishani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer, second edition, 2009. [Herbrich and Graepel, 2006] R. Herbrich and T. Graepel. Trueskill: A Bayesian skill rating system. Technical Report MSR-TR-2006-80, Microsoft Research, June 2006. [Herbrich et al., 1998] R. Herbrich, T. Graepel, P. Bollman-Sdorra, and K. Ober- mayer. Learning preference relations for information retreival. Technical report, AAAI Technical Report, 1998. [Hinton and Salakhutdinov, 2006] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 – 507, 2006. [Hoffman et al., 2009] M. Hoffman, H. Kück, N. de Freitas, and A. Doucet. New inference strategies for solving Markov decision processes using reversible jump MCMC. In Uncertainty in Artificial Intelligence, 2009. [Holmes and Held, 2006] C. Holmes and L. Held. Bayesian auxiliary variable mod- els for binary and multinomial regression. Bayesian Analysis, 1(1):145–168, 2006. [Huang et al., 2006] D. Huang, T. T. Allen, W. I. Notz, and N. Zheng. Global optimization of stochastic black-box systems via sequential Kriging meta-models. J. Global Optimization, 34(3):441–466, March 2006. [Hutter et al., 2007] F. Hutter, D. Babić, H. H. Hoos, and A. Hu. Boosting verifi- cation by automatic tuning of decision procedures. In Proc. of Formal Methods in Computer Aided Design (FMCAD’07), 2007. [Hutter et al., 2009] F. Hutter, H. H. Hoos, K. Leyton-Brown, and K. P. Murphy. An experimental investigation of model-based parameter optimisation: SPO and beyond. In Proc. GECCO’09, 2009. [Hutter, 2009] F. Hutter. Automating the Configuration of Algorithms for Solv- ing Hard Computational Problems. PhD thesis, University of British Columbia, Vancouver, Canada, August 2009. [Jones et al., 1993] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. J. Optimization Theory and Apps, 79(1):157–181, 1993. [Jones et al., 1998] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. J. Global Optimization, 13(4):455– 492, 1998. [Jones, 2001] D. R. Jones. A taxonomy of global optimization methods based on response surfaces. J. Global Optimization, 21:345–383, 2001. [Kahneman and Tversky, 1979] D. Kahneman and A. Tversky. Prospect theory: an analysis of decision making under risk. Econometrica, 47:263–291, 1979. 110 [Kapoor et al., 2007] A. Kapoor, K. Grauman, R. Urtasun, and T. Sarrell. Active learning with Gaussian processes for object categorization. In Proc. International Conference on Computer Vision (ICCV), 2007. [Kendall, 1975] M. Kendall. Rank Correlation Methods. Griffin Ltd, 1975. [Kingsley, 2006] D. C. Kingsley. Preference uncertainty, preference refinement and paired comparison choice experiments. Working Paper 06-06, Center for Eco- nomic Analysis, University of Colorado at Boulder, 2006. Dept. of Economics, University of Colorado. [Krause et al., 2008] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. J. Machine Learning Research, 9:235–284, 2008. [Krige, 1951] D. G. Krige. A statistical approach to some basic mine valuation problems on the Witwatersrand. J. the Chemical, Metallurgical and Mining Soc. of South Africa, 52(6), 1951. [Kuang et al., 2004] J. Kuang, H. Yamaguchi, G. Johnson, and M. Fairchild. Test- ing HDR image rendering algorithms. In Proc. IS and T/SID 12th Color Imaging Conference, 2004. [Kushner and Yin, 1997] H. J. Kushner and G. G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, 1997. [Kushner, 1964] H. J. Kushner. A new method of locating the maximum of an arbitrary multipeak curve in the presence of noise. J. Basic Engineering, 86:97– 106, 1964. [Ledda et al., 2005] P. Ledda, A. Chalmers, T. Troscianko, and H. Seetzen. Eval- uation of tone mapping operators using a high dynamic range display. ACM Transactions on Graphics, 24(3):640–648, August 2005. [Lewis and Gale, 1994] D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In Proc. ACM SIGIR Conference on Research and Development in Information Retreival, 1994. [Liberti and Maculan, 2006] L. Liberti and N. Maculan, editors. Global Optimiza- tion: From Theory to Implementation. Springer Nonconvex Optimization and Its Applications. Springer, 2006. [Lizotte et al., 2007] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Au- tomatic gait optimization with Gaussian process regression. In IJCAI, 2007. [Lizotte, 2008] D. Lizotte. Practical Bayesian Optimization. PhD thesis, University of Alberta, Edmonton, Alberta, Canada, 2008. [Locatelli, 1997] M. Locatelli. Bayesian algorithms for one-dimensional global op- timization. J. Global Optimization, 1997. [Mackay, 1992] D. J. C. Mackay. A practical bayesian framework for backpropaga- tion networks. Neural Computation, 4(3):448–472, 1992. [Marks et al., 1997] J. Marks, B. Andalman, P. A. Beardsley, W. Freeman, S. Gib- son, J. Hodgins, T. Kang, B. Mirtich, H. Pfister, W. Ruml, K. Ryall, J. Seims, and S. Shieber. Design galleries: A general approach to setting parameters for computer graphics and animation. Computer Graphics, 31, 1997. 111 [Martinez–Cantin et al., 2006] R. Martinez–Cantin, N. de Freitas, and J. Castel- lanos. Analysis of particle methods for simultaneous robot localization and map- ping and a new algorithm: Marginal-SLAM. In Proc. IEEE International Con- ference on Robots and Automation, 2006. [Martinez–Cantin et al., 2009] R. Martinez–Cantin, N. de Freitas, E. Brochu, J. Castellanos, and A. Doucet. A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots, 27(2):93–103, 2009. [Matérn, 1960] B. Matérn. Spatial Variation. Springer-Verlag, 2nd (1986) edition, 1960. [Matheron, 1971] G. Matheron. The theory of regionalized variables and its ap- plications. Cahier du Centre de Morphologie Mathematique, Ecoles des Mines, 1971. [McFadden, 1980] D. McFadden. Econometric models for probabilistic choice among products. Journal of Business, 53(3):13–29, 1980. [McFadden, 2001] D. McFadden. Economic choices. The American Economic Re- view, 91:351–378, 2001. [Močkus et al., 1978] J. Močkus, V. Tiesis, and A. Žilinskas. Toward Global Opti- mization, volume 2, chapter The Application of Bayesian Methods for Seeking the Extremum, pages 117–128. Elsevier, 1978. [Močkus, 1982] J. Močkus. The Bayesian approach to global optimization. In R. Drenick and F. Kozin, editors, System Modeling and Optimization, volume 38, pages 473–481. Springer Berlin / Heidelberg, 1982. [Močkus, 1994] J. Močkus. Application of Bayesian approach to numerical methods of global and stochastic optimization. J. Global Optimization, 4(4):347 – 365, 1994. [Mongeau et al., 1998] M. Mongeau, H. Karsenty, V. Rouzé, and J.-B. Hiriart- Urruty. Comparison of public-domain software for black-box global optimization. Technical Report LAO 98-01, Universite Paul Sabatier, Toulouse, France, 1998. [Mosteller, 1951] F. Mosteller. Remarks on the method of paired comparisons: I. the least squares solution assuming equal standard deviations and equal correla- tions. Psychometrika, 16:3–9, 1951. [Murray-Smith and Girard, 2001] R. Murray-Smith and A. Girard. Gaussian pro- cess priors with ARMA noise models. In Irish Signals and Systems Conference, 2001. [Ngan et al., 2005] A. Ngan, F. Durand, and W. Matusik. Experimental analysis of BRDF models. In Proc. Eurographics Symposium on Rendering, pages 117–226, 2005. [Ngan et al., 2006] A. Ngan, F. Durand, and W. Matusik. Image-driven navigation of analytical BRDF models. In T. Akenine-Möller and W. Heidrich, editors, Eurographics Symposium on Rendering, 2006. 112 [O’Hagan, 1978] A. O’Hagan. On curve fitting and optimal design for regression. Journal of the Royal Statistical Society B, 40:1–42, 1978. [Osborne et al., 2008] M. Osborne, S. Roberts, A. Rogers, S. Ramchurn, and N. Jennings. Towards real-time information processing of sensor network data using computationally efficient multi-output Gaussian processes. In Proceedings of the 7th international conference on Information processing in sensor networks, pages 109–120. IEEE Computer Society, 2008. [Osborne et al., 2009] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization. In 3rd International Conference on Learning and Intelligent Optimization (LION3), 2009. [Osborne et al., 2010] M. Osborne, R. Garnett, and S. Roberts. Active data selec- tion for sensor networks with faults and changepoints. In IEEE International Conference on Advanced Information Networking and Applications, 2010. [Osborne, 2010] M. Osborne. Bayesian Gaussian Processes for Sequential Predic- tion, Optimization and Quadrature. PhD thesis, University of Oxford, 2010. [Payne et al., 1993] J. W. Payne, J. R. Bettman, and E. J. Johnson. The Adaptive Decision Maker. Cambridge University Press, 1993. [Perlin and Neyret, 2001] K. Perlin and F. Neyret. Flow noise. In ACM SIG- GRAPH 2001 Technical Sketches and Applications, page 187, Aug 2001. [Poyiadjis et al., 2005] G. Poyiadjis, A. Doucet, and S. S. Singh. Particle methods for optimal filter derivative: Application to parameter estimation. In IEEE ICASSP, pages 925–928, 2005. [Press et al., 2007] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flan- nery. Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, 3rd edition, 2007. [Rasmussen and Williams, 2006] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, Massachusetts, 2006. [Rasmussen, 2003] C. E. Rasmussen. Gaussian processes to speed up hybrid Monte Carlo for expensive Bayesian integrals. In Bayesian Statistics 7, pages 651–659. Oxford University Press, 2003. [Rue et al., 2009] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian in- ference for latent Gaussian process models using integrated Laplace approxima- tions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71:319–392, 2009. [Russo and Leclerc, 1991] J. E. Russo and F. Leclerc. Characteristics of successful information programs. J. Social Issues, 17:759–769, 1991. [Sacks et al., 1989] J. Sacks, W. J. Welch, T. J. Welch, and H. P. Wynn. Design and analysis of computer experiments. Statistical Science, 4(4):409–423, 1989. [Santner et al., 2003] T. J. Santner, B. Williams, and W. Notz. The Design and Analysis of Computer Experiments. Springer, 2003. 113 [Sasena, 2002] M. J. Sasena. Flexibility and Efficiency Enhancement for Con- strained Global Design Optimization with Kriging Approximations. PhD thesis, University of Michigan, 2002. [Schonlau, 1997] M. Schonlau. Computer Experiments and Global Optimization. PhD thesis, University of Waterloo, Waterloo, Ontario, Canada, 1997. [Selle et al., 2005] A. Selle, N. Rasmussen, and R. Fedkiw. A vortex particle method for smoke, water and explosions. ACM Transactions on Graphics, 24(3):910–914, 2005. [Settles, 2010] B. Settles. Active learning literature survey. Computer Science Technical Report 1648, University of Wisconsin-Madison, January 2010. [Siegel and Castellan, 1988] S. Siegel and N. J. Castellan. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988. [Simon, 1978] H. A. Simon. Rationality as process and as product of thought. American Economic Review, 68:1–16, 1978. [Sims, 1990] K. Sims. Particle animation and rendering using data parallel com- putation. In SIGGRAPH ’90: Proceedings of the 17th annual conference on Computer graphics and interactive techniques, pages 405–413, New York, NY, USA, 1990. ACM. [Srinivas et al., 2010] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaus- sian process optimization in the bandit setting: No regret and experimental design. In Proc. International Conference on Machine Learning (ICML), 2010. [Stam and Fiume, 1993] J. Stam and E. Fiume. Turbulent wind fields for gaseous phenomena. In SIGGRAPH ’93, pages 369–376, 1993. [Stam, 2003] J. Stam. Real-time fluid dynamics for games. In Proceedings of the Game Developer Conference, 2003. [Stein, 1999] M. L. Stein. Interpolation of Spatial Data: Some Theory for Kriging. Springer Series in Statistics. Springer, 1999. [Stern, 1990] H. Stern. A continuum of paired comparison models. Biometrika, 77:265–273, 1990. [Streltsov and Vakili, 1999] S. Streltsov and P. Vakili. A non-myopic utility func- tion for statistical global optimization algorithms. J. Global Optimization, 14:283–298, 1999. [Stuckman, 1988] B. Stuckman. A global search method for optimizing nonlinear systems. IEEE Transactions on Systems, Man and Cybernetics, 18(6):965–977, 1988. [Sutton and Barto, 1998] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. [Talton et al., 2009] J. O. Talton, D. Gibson, L. Yang, P. Hanrahan, and V. Koltun. Exploratory modeling with collaborative design spaces. In Proc. 2nd Annual ACM SIGGRAPH Conf. and Exhibition in Asia, 2009. 114 [Thurstone, 1927] L. Thurstone. A law of comparative judgement. Psychological Review, 34:273–286, 1927. [Törn and Žilinskas, 1989] A. Törn and A. Žilinskas. Global Optimization. Springer-Verlag, 1989. [Train, 2003] K. Train. Discrete Choice Methods with Simulation. Cambridge Uni- versity Press, 2003. [Tversky and Kahneman, 1992] A. Tversky and D. Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. J. Risk and Un- certainty, 5:297–323, 1992. [Vasquez and Bect, 2008] E. Vasquez and J. Bect. On the convergence of the ex- pected improvement algorithm. Technical Report arXiv:0712.3744v2, arXiv.org, Feb 2008. [Wejchert and Haumann, 1991] J. Wejchert and D. Haumann. Animation aerody- namics. In SIGGRAPH ’91, pages 19–22, New York, NY, USA, 1991. ACM. [Williams et al., 2000] B. J. Williams, T. J. Santner, and W. I. Notz. Sequen- tial design of computer experiments to minimize integrated response functions. Statistica Sinica, 10:1133–1152, 2000. [Wong et al., 1988] S. K. M. Wong, Y. Y. Yao, and P. Bollmann. Linear structure in information retrieval. In ACM SIGIR Conference on Research and Development in Information Retreival, 1988. [Younes, 1989] L. Younes. Parameter estimation for imperfectly observed Gibbsian fields. Prob. Theory and Rel. fields, 82:625–645, 1989. [Zhigljavsky and Žilinskas, 2008] A. Zhigljavsky and A. Žilinskas. Stochastic Global Optimization. Springer Optimization and Its Applications. Springer, 2008. [Žilinskas and Žilinskas, 2002] A. Žilinskas and J. Žilinskas. Global optimization based on a statistical model and simplical partitioning. Computers and Mathe- matics with Applications, 44:957–967, 2002. [Žilinskas, 1980] A. Žilinskas. Lecture Notes in Control and Information Sciences, chapter On the Use of Statistical Models of Multimodal Functions for the Con- struction of Optimization Algorithms. Number 23. Springer-Verlag, 1980. 115 Appendix A Test functions In this appendix, we describe the forms of the literature test functions used for experiments in Chapters 3, 4 and 5. Note that while we give the conven- tional minimization forms here, when running the experiments we maximize −f(x). Branin Two-dimensional Branin function (Figure A.1). Has 3 global minima f(x) = 0.397887, at x = [−pi, 12.275], [pi, 2.275], [9.42478, 2.475]. f(x) = ( x2 − 1.275 (x1 pi )2 + 5 pi x1 − 6 )2 + 10 ( 1− 1 8pi ) cos(x1) + 10 −5 ≤ x1 ≤ 10 0 ≤ x2 ≤ 15 116 Figure A.1: The Branin function. Hartman 3 Three-dimensional Hartman function. Minimum: f([0.1, 0.5559, 0.8522]T ) = −3.8628. f(x) = − 4∑ i=1 Ci exp − 3∑ j=1 Aij(xj −Bij)2 A = 3 10 30 0.1 10 35 3 10 30 0.1 10 35 B = 0.3689 0.117 0.26730.4699 0.4387 0.7470 0.1091 0.8732 0.5547 C = [1 1.2 1 3] 0 ≤ xi ≤ 1, i = 1, . . . , 3 117 Hartman 6 Six-dimensional Hartman function. Minimum f(x) = −3.3224, when x = [0.2017, 0.15, 0.4769, 0.2753, 0.3117, 0.6573]T . f(x) = − 4∑ i=1 Ci exp − 6∑ j=1 Aij(xj −Bij)2 A = 10 3 17 3.5 1.7 8 0.05 10 17 0.1 8 14 3 3.5 1.7 10 17 8 17 8 0.05 10 0.1 14 B = 0.1313 0.1696 0.5569 0.0124 0.8283 0.5886 0.2329 0.4135 0.8307 0.3736 0.1004 0.9991 0.2348 0.1451 0.3522 0.2883 0.3047 0.6650 0.4047 0.8828 0.8732 0.5743 0.1091 0.0381 C = [1 1.2 3 3.2] 0 ≤ xi ≤ 1, i = 1, . . . , 6 Shekel 10 Four-dimensional function with 10 minima, and one global minimum f(x) = −10.1532, when x = [4, 4, 4, 4]. 118 f(x) = 10∑ i=1 1 (x−Ai)(x−Ai)T + Ci A = 4 4 4 4 1 1 1 1 8 8 8 8 6 6 6 6 3 7 3 7 2 9 2 9 5 5 3 3 8 1 8 1 6 2 6 2 7 3.6 7 3.6 C = [0.1 0.2 0.2 0.4 0.4 0.6 0.3 0.7 0.5 0.5] 0 ≤ xi ≤ 10, i = 1, . . . , 4 119