Practical Bayesian optimization with applicationto tuning machine learning algorithmsbyBobak ShahriariB. Sc., McGill University, 2008M. Sc., Simon Fraser University, 2011a thesis submitted in partial fulfillmentof the requirements for the degree ofDoctor of Philosophyinthe faculty of graduate and postdoctoralstudies(Computer Science)The University of British Columbia(Vancouver)August 2016c© Bobak Shahriari, 2016AbstractBayesian optimization has recently emerged in the machine learning com-munity as a very effective automatic alternative to the tedious task of hand-tuning algorithm hyperparameters. Although it is a relatively new aspectof machine learning, it has known roots in the Bayesian experimental de-sign (Lindley, 1956; Chaloner and Verdinelli, 1995), the design and analysisof computer experiments (DACE; Sacks et al., 1989), Kriging (Krige, 1951),and multi-armed bandits (Gittins, 1979). In this thesis, we motivate andintroduce the model-based optimization framework and provide some histor-ical context to the technique that dates back as far as 1933 with applicationto clinical drug trials (Thompson, 1933).Contributions of this work include a Bayesian gap-based exploration pol-icy, inspired by Gabillon et al. (2012); a principled information-theoreticportfolio strategy, out-performing the portfolio of Hoffman et al. (2011); anda general practical technique circumventing the need for an initial boundingbox. These various works each address existing practical challenges in theway of more widespread adoption of probabilistic model-based optimizationtechniques.Finally, we conclude this thesis with important directions for future re-search, emphasizing scalability and computational feasibility of the approachas a general purpose optimizer.iiPrefaceThis thesis is an edited volume of three separate research projects led or co-led by the author of the thesis. Two of these projects culminated in publica-tions at the International Conference on Artificial Intelligence and Statistics(AiStats) in 2014 and 2016—namely (Hoffman et al., 2014; Shahriari et al.,2016a)—and are detailed in Chapters 6 and 4. All experiments and figureswere respectively run and produced by the author, and are included in thepresent document with the approval of all co-authors.Chapters 2 and 3 were inspired by a third publication—namely (Shahri-ari et al., 2016b)—and may exhibit some minor overlap. Similarly, Chapter 5and Appendix D were also originally written by the author for publicationin (Shahriari et al., 2016b) and therefore features significant overlap. Onceagain, barring some minor editing, these chapters were written, and all fig-ures were produced, by the author.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Model-based optimization and the Bayesian approach . . . . 31.2 Historical review . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . 71.3.1 Pure exploration in multi-armed bandits . . . . . . . . 81.3.2 Portfolio methods . . . . . . . . . . . . . . . . . . . . 91.3.3 Unconstrained Bayesian optimization . . . . . . . . . 101.3.4 Open-source software . . . . . . . . . . . . . . . . . . 101.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1 Observation models . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Bayesian inversion with parametric models . . . . . . . . . . 182.3 Gaussian process priors . . . . . . . . . . . . . . . . . . . . . 23iv2.3.1 Covariance kernels . . . . . . . . . . . . . . . . . . . . 262.3.2 Prior mean functions . . . . . . . . . . . . . . . . . . . 272.3.3 Marginal likelihood . . . . . . . . . . . . . . . . . . . . 282.3.4 Computational cost . . . . . . . . . . . . . . . . . . . 283 Decision making . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1 Definitions and notation . . . . . . . . . . . . . . . . . . . . . 313.2 Improvement-based policies . . . . . . . . . . . . . . . . . . . 383.3 Optimistic policies . . . . . . . . . . . . . . . . . . . . . . . . 393.4 Information-based policies . . . . . . . . . . . . . . . . . . . . 403.5 Recommendation strategies . . . . . . . . . . . . . . . . . . . 434 Bayesian gap-based exploration . . . . . . . . . . . . . . . . 454.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3.1 Application to a traffic sensor network . . . . . . . . . 524.3.2 Automatic machine learning . . . . . . . . . . . . . . . 554.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Entropy search portfolio . . . . . . . . . . . . . . . . . . . . . 595.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.1.1 Estimating the expected entropy . . . . . . . . . . . . 615.1.2 Discretizing X : sampling posterior global minimizers . 655.1.3 Algorithm summary . . . . . . . . . . . . . . . . . . . 675.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2.1 Global optimization test functions . . . . . . . . . . . 705.2.2 Geostatistics datasets . . . . . . . . . . . . . . . . . . 715.2.3 Control tasks . . . . . . . . . . . . . . . . . . . . . . . 725.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736 Unconstrained Bayesian optimization via regularization . . 756.1 Regularizing improvement policies . . . . . . . . . . . . . . . 766.2 Extension to general policies . . . . . . . . . . . . . . . . . . . 78v6.3 Choice of regularization . . . . . . . . . . . . . . . . . . . . . 796.4 Volume doubling . . . . . . . . . . . . . . . . . . . . . . . . . 806.5 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.6.1 Synthetic benchmarks: Hartmann . . . . . . . . . . . 836.6.2 MLP and CNN on MNIST . . . . . . . . . . . . . . . 836.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 856.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 877.1 Decision-making: unifying BayesGap and ESP . . . . . . . . 877.2 Contributions to the practice . . . . . . . . . . . . . . . . . . 897.3 Future work: first order Bayesian optimization . . . . . . . . 907.4 Final word: from academia to societal impact . . . . . . . . . 91Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93A BayesGap theory . . . . . . . . . . . . . . . . . . . . . . . . . 105A.1 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . 105A.2 Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107B Approximate sampling from posterior global minimizers . 112C Gradients of acquisition functions . . . . . . . . . . . . . . . 114C.1 Probability of improvement . . . . . . . . . . . . . . . . . . . 115C.2 Expected improvement . . . . . . . . . . . . . . . . . . . . . . 116C.3 Gaussian process upper confidence bound . . . . . . . . . . . 116D Approximating Gaussian process inference . . . . . . . . . 118D.1 Sparse pseudo-inputs . . . . . . . . . . . . . . . . . . . . . . . 118D.2 Sparse spectrum . . . . . . . . . . . . . . . . . . . . . . . . . 119D.3 Random forest surrogates . . . . . . . . . . . . . . . . . . . . 121E Hyperparameter learning . . . . . . . . . . . . . . . . . . . . 123E.1 Point estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 124viE.2 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . 124E.3 Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . 127viiList of TablesTable 4.1 Models and discretized hyperparameter settings for auto-matic machine learning . . . . . . . . . . . . . . . . . . . . 56viiiList of FiguresFigure 1.1 Visualization of Bayesian optimization . . . . . . . . . . . 5Figure 2.1 Probabilistic graphical model for the stochastic bandit . . 16Figure 2.2 Probabilistic graphical model of Bayesian inversion . . . . 19Figure 2.3 Gaussian process visualization . . . . . . . . . . . . . . . 27Figure 3.1 Probabilistic graphical model of the Bayesian optimiza-tion decision problem . . . . . . . . . . . . . . . . . . . . 31Figure 3.2 Visualizing the surrogate regression model and variousinduced acquisition functions . . . . . . . . . . . . . . . . 41Figure 4.1 Visualization of the BayesGap algorithm . . . . . . . . . . 48Figure 4.2 Probability of error on a traffic sensor network . . . . . . 53Figure 4.3 Boxplot of RMSE on wine regression task . . . . . . . . . 57Figure 4.4 Visualization of selections and recommendations of BayesGapand EI on automatic machine learning task . . . . . . . . 58Figure 5.1 Visualization of Bayesian optimization with portfolios . . 62Figure 5.2 Visualization of the Entropy Search Portfolio . . . . . . . 63Figure 5.3 Best observed evaluation on synthetic functions . . . . . . 70Figure 5.4 Best observed evaluation on mining datasets . . . . . . . 72Figure 5.5 Best observed evaluation on control tasks . . . . . . . . . 74Figure 6.1 Visualization of the alternate views of regularization . . . 76ixFigure 6.2 Visualization of selections of the regularized policy onone- and two-dimensional toy examples . . . . . . . . . . 79Figure 6.3 Comparison of vanilla, regularized, and volume doubling EI 81Figure 6.4 Scatter plots of selections on a real hyperparameter tuningtask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Figure D.1 Comparison of surrogate regression models . . . . . . . . 120xAcknowledgementsThank you to my examiners Dale Schuurmans, Jane Z Wang, and Will JWelch for their feedback on the thesis as well as a very interesting discussionat my defence. Thank you to Mark Schmidt and Nick Harvey for servingon my supervisory committee. None of the work in this thesis would havebeen possible without the intellectual, emotional, and financial support ofboth my supervisors Alex and Nando, even though distance and time zonesoften worked against us. From PEP8 to GPs, so much of what I knowhas, one way or another, gone through Matt W Hoffman, that he holds anhonorary supervisor status in my mind. Thank you to Kath and Joyce fortheir patience, and to Joanna McGrenere for a very supportive meeting ata particularly sensitive time.Since creative writing was never a skill of mine, I will simply default to asomewhat chronological listing of important people who, through sustainedor momentary support, have helped me complete the complicated experiencethat is graduate school.It all begins with my mom and dad for making sure their son came beforetheir irreconcilable differences. My older brother Behrak for always beingavailable for a chat and a Corona; my younger brother Bijan who is smarter,fitter, and better looking than his brothers; my brother Behrad of whom Istill have vivid fond memories. My aunt and second mother, Ame´ Mavach,who always demonstrated a saintly generosity and dedication toward herloved ones. My uncle Ge´rard and cousin Jean-Philippe for following suitand being two of the sweetest human beings I have ever known. My auntSima and uncle Alain for their monthly phone calls full of support, wisdom,xiand humour. My oldest friend Ariane with whom I still laugh as much andas loudly as we did 26 years ago, and her warm and fuzzy family Dina,Mitra, and Hushang. My good friends Marc, Walid, and Kayvan, who arealways a game of phone tag away from a deep conversation; Priscilla andCatherine, for their much needed fashion advice; and Denis, my favouritepilot.At McGill I had the good fortune of crossing paths with two exceptionalacademic brothers: Charles, whose skills in the art of chit-chat are unpar-alleled, and Ashton, my life and chess coach. Moving to Vancouver wasmade easy thanks to Kevin, Monica, Franck, Ashley, Danny, Frenchie andKate. Leaving Vancouver is extremely difficult because of Fitzy, Grahamy,Ty, Alexis, Jonesy and Angela, Jesse and SJ, Julesy and Addis, Noah andKelly. Thank you to my labmates Matt, Misha, Ziyu, Emty, Mareija, Dave,and Sancho; and my remote fellow Nandians: Iannis, Brendan, Marcin, andJakob. Thanks as well to a more recent friend: Mike Gelbart.During a short stint in San Francisco, I had the absolute pleasure ofliving with my good friend Ashton and the one and only Bob West, whoselove for a good pun may even rival my own; through them I met a fistfulof tremendous human beings: Jake the wise, Hristo the Wolf, Julian thebodybuilder, Matteu the limer, Ryan the corbeau, and Fabian the mailman.While in the bay area, I worked with some very talented folks: Nathan,Dennis, Jamal, and Aaron; and I would particularly like to acknowledgeMartin, Mike, and Ofer for their great mentorship and timely advice. Mostof my weekends were spent enjoying the dolce vita with Jason, Colleen,Callie, Tarun, Joseph, Lindsay and Becca, Rajeev and Jillian, and Audrey.When I spent a brief time in London, several people were responsible forkeeping me sane, in particular I would like to thank Ogla for her incrediblepatience; Dave for many a necessary distraction; Ryan and John for a timelyheart-to-heart at the Euston Tap that I will likely never forget; Nando,Anj, and Sienna for showing me exceptional hospitality and their poochPicasso for countless calming walks. On the topic of hospitality, IannisAssael certainly takes the cake by sharing his 250 sqft studio apartmentfor three weeks, not once making me feel unwelcome. Gabe, with a similarxiiheroic gesture, let me crash on an inflatable mattress in his room for a wholeweek.Coming back to British Columbia, was a difficult adjustment, one thatNilima, Paul, Vijay, Mira, and Amanda have made much easier. Whilefinishing my thesis on the hightest peak of Mount Burnaby, Nilima and herstudents Sebastia´n and Bamdad filled my final student days with stimulatingdiscussions and barrels of laughter; in fact all three deserve acknowledgementfor relentless and uncensored feedback on the thesis.There are a few people deserving acknowledgement that do not quitefit in the chronological flow: Fitzy for being a true ally, a model for self-improvement, and always sharing a teen-aged Scotch; “Swiss” Marc Ryserfor being a kindred—more hipster—spirit for as long as I’ve known him;Gunnar for planning a once-in-a-lifetime Icelandic adventure; Joel Sidorukand Jean-Philippe Ramaut for being models of bravery and successful risk-taking; Femi for being a loyal friend with an infectious laugh; Jimi for be-ing a fun and light-hearted friend; Moustapha and Roberto for being greatfriends in the machine learning community; and Dave Bryant for being agreat companion one particularly stressful evening. Of course, no acknowl-edgement would be complete without a strong French contingent: Franck,E´mine, Caroline, Aure´lie, Benjamin.Last but certainly not least, I would like to thank Nilima for being amentor in mathematics, academics, integrity, generosity and charity, par-enthood and friendship. Nilima, Paul, and their wonderful children Vijayand Mira have truly been—and continue to be—a surrogate family.xiiiChapter 1IntroductionModel-based optimization has become a popular and successful approachto global optimization of black-box functions. In particular, the frameworkspecializes in finding the optimizer of an unknown objective function f ina data efficient way, using a relatively low number of function evaluations.The quantification “relatively low” is deliberately vague because it will oftendepend on the dimensionality of the input space, the variability in the objec-tive function, and the level of noise in the output. In general, the objectivefunctions of interest are non-convex and multi-modal, they do not readilyproduce derivative information, and can be expensive to evaluate, hencethe goal of data efficiency. The Bayesian approach is particularly appealingin this setting because it facilitates combining small amounts of data withprior knowledge or expertise to make predictions with uncertainties thatare consistent with the prior. In fact, a positive side-effect of a Bayesianmethodology is that it requires the practitioner to precisely quantify herprior knowledge1 and encode her inductive bias into a prior probability dis-tribution.This approach is widely applicable to any problem where one can evalu-ate the objective function at arbitrary points, even when these evaluationsare corrupted by measurement noise. Major successful applications of these1 Actually, even when there is no prior knowledge, this can be encoded using what iscolloquially known as a non-informative prior, e.g., the Jeffreys prior.1techniques include:. design and analysis of computer experiments (Sacks et al., 1989; Joneset al., 1998; Azimi et al., 2012),. reinforcement learning, robotics, and control (Lizotte et al., 2007;Martinez-Cantin et al., 2007; Brochu et al., 2009; Calandra et al.,2014),. environmental monitoring and sensor networks (Garnett et al., 2010;Srinivas et al., 2010; Marchant and Ramos, 2012),. advertising and recommender systems (Scott, 2010; Chapelle and Li,2011),. interactive user interfaces (Brochu et al., 2007, 2010),. combinatorial optimization (Hutter et al., 2011; Wang et al., 2013b),. adaptive Monte Carlo samplers (Mahendran et al., 2012; Wang et al.,2013a).The last bullet point is a specific instance of one general area of particu-lar interest to the machine learning community: the automatic tuning ofalgorithms (Bergstra et al., 2011; Snoek et al., 2012; Swersky et al., 2013;Thornton et al., 2013; Hoffman et al., 2014). In this setting, the objectivefunction is the generalization accuracy (or error) of a learning algorithm,and the inputs are the algorithmic or hyperparameter choices. Whereasthese parameters have traditionally been manually tuned and selected bydomain experts, this tedious approach is not scalable and is indeed quicklybecoming unwieldy, if not impossible, with the increase in complexity ofmodern learning algorithms. For instance, neural network architectures canbe parameterized by the number of layers, the number of units per layer,the type of non-linearity at each layer, etc; while, their learning algorithmsalso rely on tunable parameters such as learning rates, momentum, andregularization parameters.The Bayesian approach to optimizing an unknown objective function isto model the uncertainty over the function as randomness. With a prob-abilistic model for the objective, we have the powerful tools of probabilitytheory at our disposal, as we will see in the next chapter. An agent can2then use the model as a surrogate for the objective function and carefullyselect where the next data point should be acquired through what is calleda selection policy. These policies leverage both the model prediction and itspredictive uncertainty to negotiate local and global search. This constantstruggle is commonly referred to as the exploration-exploitation trade-offand is a recurring theme in optimization, machine learning, and artificial in-telligence. The need for uncertainty quantification for such effective policiescannot be over-stressed; without uncertainty estimates—either Bayesian orfrequentist—only very rudimentary exploration is possible, e.g., -greedy al-gorithms. Indeed, the performance of the selection policy, i.e. how quicklyit is able to approach the global maximizer in practice, hinges on preciselyhow the surrogate model prediction and uncertainty are utilized. In Chap-ter 3, we formally introduce selection policies and review several popularalternatives that are available in the literature.1.1 Model-based optimization and the BayesianapproachConsider the problem of finding a global maximizer x? ∈ X of a given ob-jective function f : X 7→ Y. For the sake of generality, let us be deliberatelyvague about the domain, range, and the function space H to which f be-longs, and simply express the problem as follows for now:find x? ∈ arg maxx∈Xf(x). (1.1)Algorithm 1 is a very general sequential search framework, which includesmany popular algorithms such as gradient ascent and Newton’s method. Inthe following examples, we emphasize how a common optimization algorithmfits into the generic sequential scheme by using a local model of the functionalong with a selection criterion, yielding the next iterate.Example 1 (Newton’s method)Assuming a deterministic function evaluations which output both the gra-3Algorithm 1 Generic sequential model-based optimization schemeRequire: –f – unknown noisy objective functionX – input domainαt – sequence of search criteriaT – fixed query budget1: for iteration t up to T do2: xt+1 ← arg maxX αt / selection3: yt+1 ← f(xt+1) / observation4: update model given (xt+1, yt+1) / modelling5: end for6: return xˆT / recommendationdient gt and Hessian Ht evaluated at xt, the next point xt+1 is selected bymaximizing the criterionαNMt (x) := f(xt) + gt(x− xt) +12(x− xt)>Ht(x− xt), (1.2)which is simply the quadratic local approximation given by Taylor’s theorem,assuming appropriate continuity and differentiability.Of course, in this case, the criterion αNM can be optimized analyticallyat the cost of inverting the Hessian matrix, yielding the famous Newtoniteration scheme:xt+1 = xt −H−1t gt (1.3)The last iterate is returned as the ultimate recommendation xˆT = xT .In this simple case, the search criterion and the model are one and thesame. The quadratic model does not fit historical data {xs, ys,gs,Hs}s<t,nor does it feature uncertainty quantification, which is why this algorithmmakes greedy local moves. While it very quickly converges to a local opti-mizer close to the initial x0, it is well known that such a method must berandomly restarted for good performance on multimodal objective functions.In contrast, Bayesian optimization is a probabilistic approach to thismodel-based optimization scheme. Figure 1.1 provides a visualization of4Figure 1.1: Visualization of three consecutive iterations of Bayesian opti-mization on a noiseless objective function (black dashed curve). As thisfigure is meant to provide a visual intuition, the notions of probabilisticmodel (blue curve and shaded area) and acquisition function (purple curve)remain purposely nebulous for now. While observed data is shown as orangedots, the purple circles indicate the maximum of the acquisition function,and the orange circles indicate the next observation.5three consecutive iterations in the sequential algorithm. In particular, weprescribe a prior belief over the possible objective functions f and thensequentially refine this model, as data are observed, via Bayesian poste-rior updating. The Bayesian posterior represents our updated beliefs, givenall our observations, on the likely objective function we are optimizing.Equipped with this probabilistic model, we can sequentially induce criteriaαt : X 7→ R, similar to those in Example 1 yet more sophisticated in thatthey leverage the uncertainty in the posterior to guide a global exploration.In the Bayesian optimization literature, these criteria are known as acquisi-tion functions. As we will clarify in Chapter 3, selection policies need nottake the form of an acquisition function to be maximized. Though they areperhaps the most ubiquitous form of selection policy, in principle any pro-cedure that produces an x ∈ X is a selection policy, the simplest examplebeing uniformly random search.1.2 Historical reviewArguably the earliest work related to Bayesian optimization was that ofWilliam Thompson in 1933 where he considered the likelihood that oneunknown Bernoulli probability is greater than another given observationaldata (Thompson, 1933). In his article, Thompson argues that when con-sidering, for example, two alternative medical treatments, one should noteliminate the worst one based on a single clinical trial. Instead, he proposes,one should estimate the probability that one treatment is better than theother and weigh future trials in favour of the seemingly better treatment,while still, occasionally, trying the seemingly suboptimal one. Thompsonrightly argues that by adopting a single treatment following a clinical trial,there is a fixed chance that all subsequent patients will be given suboptimaltreatment. In contrast, by dynamically selecting a fraction of patients foreach treatment, this sacrifice becomes vanishingly small.In modern terminology, Thompson was directly addressing the exploration–exploitation trade-off, referring to the tension between selecting the bestknown treatment for every future patient (the greedy strategy) and contin-6uing the clinical trial to more confidently assess the quality of both treat-ments. This is a recurring theme not only in the Bayesian optimizationliterature, but also in the related fields of sequential experimental design,multi-armed bandits (whose arms correspond to differing available valuesof x), and operations research.Although modern experimental design had been developed a decade ear-lier by Ronald Fisher’s work on agricultural crops, Thompson introducedthe idea of making design choices dynamically as new evidence becomesavailable. This is a general strategy known as sequential experimental de-sign or, in the multi-armed bandit literature, adaptive or dynamic allocationrules (Lai and Robbins, 1985; Gittins, 1979).Meanwhile, a Master’s student from the University of the Witwatersrandintroduced a method now known in the resource exploration community asKriging, inheriting its name from its original author: Krige (1951). Notonly was the Kriging technique hugely successful in the mining and naturalresource exploration community, but seemingly independently, a similar ap-proach originated from a global optimization perspective through the earlywork of Kushner (1964); Mocˇkus et al. (1978) which later influenced thefield of design and analysis of computer experiments (DACE; Sacks et al.,1989; Jones et al., 1998; Jones, 2001).1.3 Thesis contributionsThis document is the compilation of multiple works with the common themeof addressing open practical challenges in Bayesian optimization, particu-larly as they relate to tuning hyperparameters of machine learning algo-rithms. In this section, we discuss a few motivating issues with Bayesianoptimization and the contributions made towards addressing these openquestions.71.3.1 Pure exploration in multi-armed banditsMany acquisition functions that are commonly used in Bayesian optimiza-tion have either no guarantee or guarantees with respect to a performancemetric known as cumulative regret (e.g. Srinivas et al., 2010; Russo andVan Roy, 2014a). This metric is not suitable for the task of optimizationbecause it rewards and penalizes intermediate selections. Since the task ofBayesian optimization is to find a global optimizer after a fixed budget offunction evaluations2, the simple regret is more appropriate because it onlyassesses the quality of the final recommendation; the selection strategy (theacquisition function) is therefore free to explore within the allocated budget.Drawing from the multi-armed bandit literature (see Cesa-Bianchi andLugosi, 2006, for a good introduction), particularly the pure explorationsetting, we developed a Bayesian extension to a new bandit algorithm byGabillon et al. (2012).This was joint work with Matthew W. Hoffman (Hoffman et al., 2014).In this work, we discretized two Bayesian optimization tasks reformulat-ing them as multi-armed bandit problems, and demonstrated the superiorperformance of our method, BayesGap, over existing acquisition functionson two real world problems: one application in traffic control and one inautomatic model selection and parameter tuning.Finally, working with noisy bandit feedback for the task of automaticmodel tuning and selection requires defining an objective, or reward, func-tion. Strictly speaking, this objective function must return independentand identically distributed evaluations (these are considered random drawsbecause they can be noise-corrupted). Therefore in running experimentsfor this work, we developed an experimental protocol based on a methodcalled repeated learning-testing (see Burman, 1989, for a study of similarmethods).2 Actually, one could also consider the same optimization problem with a fixed confi-dence requirement, i.e. the stopping criterion is defined in terms of our confidence in ourrecommendation. This was recently considered by Soare et al. (2014)81.3.2 Portfolio methodsWhen considering Bayesian optimization as a solution, no single acquisitionstrategy3 has been found to perform best on all problem instances. Hoffmanet al. (2011) address this issue by proposing the use of a portfolio containingmultiple acquisition strategies. The important new element of this approachis the meta-criterion by which a portfolio method selects among differentstrategies. This function is analogous to an acquisition function, but at ahigher level. Whereas acquisition functions assign utility to points in theinput space, a meta-criterion assigns utility to candidates suggested by itsmember acquisition functions. In the concluding chapter, we also propose aunifying view of these portfolio methods and the aforementioned BayesGapapproach.The meta-criteria considered by Hoffman et al. (2011) are based on ad-versarial bandit algorithms like Hedge and Exp3. These meta-criteria relyon the average performance of the member acquisition functions’ past se-lections to decide which to select. Instead, my work on the entropy searchportfolio (ESP) considers each candidate purely on how much information itis likely to provide towards locating the optimum. This work leveraged newideas from information guided search (Villemonteix et al., 2009; Hennig andSchuler, 2012). This was joint work with Ziyu Wang and Matthew W. Hoff-man (Shahriari et al., 2014) and we demonstrated the good performance ofESP across various problem domains. Interestingly, given the quality of theacquisition functions included in our portfolios we also observed surprisinglygood performance with a random portfolio, which supported our claim thatmixing strategies in portfolios is beneficial.Finally, another major contribution of this work is the evaluation ofan approximation to Thompson sampling for the continuous domain. Thisacquisition function, also known as randomized probability matching or pos-3 In a slight abuse of terminology, we will often use the terms acquisition function,strategy, or policy interchangeably. Similarly, selection strategy or policy also refer to thesame concept. However, these should not to be confused with the recommendation strategyor policy which outputs the optimization algorithm’s final estimate of the optimum. Theseconcepts will be defined more precisely in Chapter 3.9terior sampling, had thus far been constrained to the discrete multi-armedbandit setting, but recent work by my co-author extended this acquisitionfunction to the continuous domain. Our work in (Shahriari et al., 2014) wasthe first empirical evaluation of the method, which performed very well.1.3.3 Unconstrained Bayesian optimizationSetting search domain bounds is often described as one of the main difficul-ties that hinders broader use of Bayesian optimization as a standard frame-work for hyperparameter tuning. For example, this obstacle was raised morethan once at the NIPS 2014 Workshop on Bayesian optimization as one ofthe remaining practical hurdles in the way of more widespread adoption.While it is often possible to prescribe a reasonable range for each variablethat is being optimized, treating those ranges as hard constraints can haveadverse effects. Obviously, when the true global optimum of the objectivefunction is outside of those ranges, there is no way to recover from a, possiblyill-informed, initial range. Even when the optimum lies within the bound-ing box (within all specified ranges) many current acquisition functions willoften focus on the corners and walls of the bounding box. Corners and wallscorrespond to at least one and potentially many variables being in one ofthe two most extreme possible values. This undesired artifact becomes moreand more pronounced in higher dimensions.Our most recent work, detailed in Chapter 6, proposes an approachthat accommodates unconstrained search (Shahriari et al., 2016a). By pre-scribing a regularizing prior mean, the search is weakly focused around theregularizer centre but is otherwise unconstrained. Our proposed method ispractical and easy to add to any existing Bayesian optimization toolbox thatis based on Gaussian process priors over objective functions.1.3.4 Open-source softwareFinally, many of the acquisition functions and techniques used in this thesisare included in a python package that we have released on github, and stillactively develop. These repositories (reggie for Gaussian process models10and pybo for Bayesian optimization policies) include demos to facilitatebroad adoption among the community. The python package benchfunkincludes many synthetic benchmarking functions as well as python wrappersto facilitate running pybo on black-boxes that are not necessarily written inpython. In fact, our Interactive wrapper allows the black-box evaluationto be entered manually by, say, a lab technician, in chemical or biologicalexperiments for instance.1.4 Thesis outlineSince model-based optimization solutions are composed of two main mod-ules, namely modelling and decision-making, we will structure our review ofbackground material into two corresponding chapters. In particular, Chap-ter 2 provides necessary background on probabilistic models, both the ob-servation model, also known as the measurement or noise model, and thesurrogate model, which encapsulates our belief about the unknown objectivefunction f . On the other hand, Chapter 3 formally introduces the notion ofa selection policy and acquisition functions before reviewing some commonalgorithms used in practice.Chapter 4 presents our work on BayesGap, a Bayesian bandit algorithmfor pure exploration. The theorem guaranteeing exponentially vanishingprobability of error is stated in this chapter and the proof is relegated toAppendix A. Chapter 5 introduces the details of the entropy search portfolio(ESP) and the experiments we ran to evaluate its performance. Chapter 6describes our latest work on unconstrained Bayesian optimization, providingintuition and empirical validation. Finally, Chapter 7 features a discussion ofour contributions in the scope of the current state of Bayesian optimizationas well as possible research directions for the future.11Chapter 2ModellingThis chapter concerns fitting probabilistic models to observed data. Weseparate our models into two subcomponents: the observation model andthe surrogate model. The former is responsible for simulating the datageneration process, while the latter maintains our knowledge of the unknownunderlying objective function.Before we rigorously define any probabilistic model, we need a proba-bility space (Ω,F ,P), equipped with a set of outcomes Ω, a σ-algebra ofmeasurable events F , and a probability measure P : F 7→ [0, 1] associating aprobability to each event in F . See (Rosenthal, 2006) for a thorough reviewof basic probability theory. There exist two kinds of probability models,namely parametric and nonparametric distributions. The defining featureof a parametric distribution is that it is characterized by a fixed number ofparameters, as opposed to nonparametric models, which require storage thatincreases with the amount of data they are fit to. While we only considerparametric observation models in this work, we consider both parametricand nonparametric surrogate models in Sections 2.2 and 2.3, respectively.We begin this chapter with a section introducing a unifying probabilisticgenerative observation model for the generalized stochastic bandit, high-lighting some useful examples. Next, in Section 2.2 we detail a few ways toinvert these forward models to obtain a surrogate for the unknown objec-tive function f . Finally, in Section 2.3, we introduce Gaussian process (GP)12regression: the main surrogate model of interest in this thesis.2.1 Observation modelsSince their introduction nearly a century ago (Thompson, 1933), multi-armed bandits have been extended to continuous spaces, bridging the gapbetween the bandit literature and that of efficient global optimization ofstochastic black-box functions. We note the particularly relevant work ofSrinivas et al. (2010) on GP bandits and that of Bubeck et al. (2011), whocoined the term X -armed bandits.Due to this unifying view, we use a general notion of a stochastic bandit,which includes stochastic black-boxes with continuous input sets, such as GPbandits, and stochastic multi-armed bandits with discrete and finite inputsets. Our interest in stochastic bandits in this opening section is purelyas forward generative models responsible for producing the observed data.Therefore, we assume a fixed and knowable objective function f .Definition 1 (Stochastic bandits)Let X and Z denote input and latent sets, respectively. Let (Y,FY ) be ameasurable set of outputs, and let PY denote the space of valid probabilitymeasures on the σ-algebra FY . Finally, let θy ∈ Θy be a fixed parameter.Then the stochastic bandit (f, νθy) is characterized by. a fixed underlying objective function f : X 7→ Z and. a parametric observation model νθy : Z 7→ PY .Stochastic bandit feedback: When an agent selects an input x ∈ X ,the bandit performs the following:zx = f(x), (2.1)Yx ∼ νθy(zx), (2.2)where a realization of the random variable Yx : Ω 7→ Y is returned but zx iskept hidden from the agent, hence the “latent” qualifier.13When the input space X is discrete and finite, the bandit is commonlyknown as a multi-armed bandit, and the input space is often denoted A,referring to the available arms, instead of X . In what follows, a ∈ A refersspecifically to an arm of a multi-armed bandit, whereas x ∈ X refers togeneral inputs. In the multi-armed bandit setting, the idiom pulling an armtranslates to selecting an input.The term objective for f is deliberately general, encompassing both riskand reward functions. In the hyperparameter tuning setting for instance,we seek to minimize generalization error through noisy evaluations Y ofempirical test error on minibatches of available data (e.g., see Section 4.3.2).On the other hand, in reinforcement learning, an agent seeks to maximizeher observed rewards. One obvious and important difference between thebandit setting and reinforcement learning is that in Definition 1 selectionsmade by the agent do not change the input space or the reward function,whereas in reinforcement learning, picking an action can take the agentto a different state where the available actions and the underlying rewardfunction may differ from those of the previous time step.Note that selecting an input x in Definition 1, yields a realization ofthe single random variable Yx ∈ Y, as opposed to an observation of thefull process {Yx}x∈X . This type of interaction, with partial information,is called bandit feedback. The stochastic qualifier distinguishes this type ofbandit from adversarial bandits, where the rewards do not come from adistribution (known or unknown). This thesis will only concern stochasticbandit feedback, but the reader is referred to the very good compilation ofboth partial and full information results in the adversarial, also known asnon-stochastic, setting by Auer et al. (1995).Though Definition 1 allows for a large variety of observation models, inthis thesis we will only consider two types, binomial and Gaussian, whichwe detail next.Example 2 (Binomial bandits)In the binomial bandit, the possible outputs are Y = {0, . . . , n}, the latentset is Z = [0, 1], and n ∈ Θy = N is a fixed, known parameter. Selecting an14input x ∈ X results in an observationYx ∼ νn(zx) = Binomial(n, zx), (2.3)where zx = f(x).Note that, while both sets are embedded in the real line, strictly speak-ing, Y 6= Z, which motivates the need for distinct output and latent sets.The special case of a binomial bandit with n = 1 is called the Bernoullibandit.As we will see in Example 4, in the multi-armed binomial bandit, if weassume each za is sampled from an independent beta distribution for alla ∈ A, the conjugacy of the beta prior and binomial observations make thisforward model particularly amenable to analytic Bayesian inversion.In the binomial bandit, all Ya are finite integer-valued random variables,bounded above by the fixed number of trials n in the forward model νn.These are useful black-boxes for modelling such observations as drug successrates in clinical drug trials, click-through rates in split-testing experiments,correct classifications in hyperparameter tuning experiments, etc.The stochastic bandit we consider next is a forward model that producereal-valued observations, which are better suited for modelling such obser-vations as empirical test error for tuning regression algorithms, temperaturemeasurements, and many more. Example 3 details how Gaussian banditsfit into the stochastic bandit formalism.Example 3 (Gaussian bandits)In the Gaussian bandit, the output and latent space are both the same com-pact subset Y = Z ⊆ R, the observation model is additionally parameterizedby the signal noise variance η2 ∈ Θy = (0,∞). Selecting an input x ∈ Xresults in an observationYx ∼ νη2(zx) = N (zx, η2), (2.4)where zx = f(x).15f xtyt θyt ∈ Tθff xtyt θyt ∈ TFigure 2.1: Probabilistic graphical models of bandit feedback in the stochas-tic (left) and the Bayesian (right) settings. Diamond shaped nodes indicateagent choices, while square and round nodes represent deterministic andrandom variables, respectively.In most cases of interest, the objective function is unknown and onenatural way to model our uncertainty about the function f is by treating it asrandom. This type of randomness is sometimes called epistemic, contrastingit from the uncertainty in the observations Y , for example, which is aleatoric.Roughly speaking, the former refers to uncertainty in a quantity that isknowable, whereas the latter describes uncertainty that is inherent. Forinstance, we will often assume that there exists a true objective function fwhich an oracle could reveal to us. On the other hand, the observations Yare inherently random due to emission or measurement noise. Of course,the line becomes blurred when we ask whether the source of noise is itselfknowable, which is why we will not venture any deeper into the philosophicaldifferences between the two types of uncertainty. Indeed, from a probability-theoretic perspective, the two types of randomness are equivalent.Therefore in true Bayesian fashion, for the rest of the thesis, we assumethe objective function is a realization of a random function F : Ω 7→ H,where H ⊂ {u : X 7→ Z} is some function space. We then prescribe anabstract prior distribution Π over F . It is abstract because, as we will see inthe next section, it is only defined via the random process {Zx : Ω 7→ Z}x∈Xat arbitrary finite collections of points such that Zx ≡ F (·)(x), for all x ∈ X .16Note that since we require pointwise evalutation of realization of F , thenthe function space must at least satisfy H ⊆ L∞(X ). Moreover, in order todefine a Z-valued random process, we need the measurable set (Z,FZ).Definition 2 (Bayesian bandits)Let X , Z, Θy, and (Y,FY ) be as in Definition 1. Let (H,FH) be a measur-able space of functions and let θf ∈ Θf be an additional fixed parameter.Then the Bayesian bandit (Πθf , νθy) is characterized by. a distribution over objective functions Πθf : FH 7→ [0, 1] and. a parametric observation model νθy , as in Definition 1.Bayesian bandit feedback: When the Bayesian bandit is initialized, anobjective function is sampledf ∼ Πθf , (2.5)and then, whenever an agent selects an input x ∈ X , the bandit proceeds asin Definition 1:zx = f(x), (2.6)Yx ∼ νθy(zx), (2.7)where once again Yx is observed and zx is kept hidden.In principle, we could—and eventually will—similarly treat the parame-ters θ := (θf , θy) as unknown and model them as epistemic random variables.However, for the discussion in this and the next chapter, θ is assumed to befixed and known for a more simple exposition. Nevertheless, Appendix Eoutlines several ways to estimate or learn this additional parameter fromobservations via maximum likelihood, empirical Bayes, or Monte Carlo ap-proaches.172.2 Bayesian inversion with parametric modelsThe next two sections describe how to draw useful conclusions about theobjective function f , given a dataset of observations D := {(xs, ys)}s≤t,with (xs, ys) ∈ X × Y. Depending on the observation model, there are var-ious Markov-like inequalities in the frequentist arsenal that can be used tobound the objective function f(x). While these inequalities often requireno additional assumptions other than iid outcomes, extracting meaningfulbounds requires many data points. When the number of evaluations is nota limiting factor, techniques such as Hoeffding (Maron and Moore, 1993) orempirical Bernstein (Mnih et al., 2008) races, or the more recent optimisticoptimization techniques (Auer, 2003) are all applicable. Structural assump-tions can help more efficiently propagate information from the data; see forinstance the works of Dani et al. (2008); Abbasi-Yadkori et al. (2011) forfrequentist approaches in the linear bandit setting, or the works of Munos(2011); Valko et al. (2013); Bubeck et al. (2011), which make various weaksmoothness assumptions.Alternatively, when data is difficult or expensive to acquire and onewishes to avoid unnecessarily repeating queries, the Bayesian frameworkcan be a very attractive alternative. At the cost of making mild a prioriassumptions on f , stronger conclusions can be drawn from relatively lit-tle data, including uncertainty quantification that is consistent with boththe observed data and the prior assumptions. In this section we examinesome parametric models and defer a discussion of Gaussian processes to thefollowing section.Let us first consider the discrete K-armed Bayesian bandit (Π0, ν). Inthese cases, the random function F and the process Z := {Za}a∈A coin-cide and are simply multivariate random variables. Our inductive biases areprescribed in the form of a prior distribution Π0, the density of which wedenote Π0(z) in a slight abuse of notation. Then, for fixed D = {(as, ys)}s≤t,Bayesian inversion is a direct application of Bayes’ rule, noting the indepen-18θff xtyt θyt ∈ TFigure 2.2: Probabilistic graphical model of the Bayesian approach to datafitting. Shaded round nodes are observed random variables and shadeddiamond nodes are the corresponding choices x.dence of all each Ya given a and za:Πt(z) ∝ Π0(z)t∏s=1ν(ys; zas), (2.8)where we omit the parameters θ for simplicity. The updated belief about Zgiven data D, on the left hand side of Equation (2.8), is called the posteriordistribution. The two factors on the right hand side of Equation (2.8) denotethe prior density and the likelihood, respectively. When the correspondingtwo distributions form a conjugate pair, the posterior is in the same familyas the prior distribution, and is readily derived analytically, see Example 4.Example 4 (Beta-binomial multi-armed bandit inversion)Consider the K-armed binomial bandit with n trials defined in Example 2,replacing X with a finite and discrete set of K arms A. The objectivefunction is multivariate a realization z ∈ [0, 1]K from some prior belief dis-tribution.Prescribe K independent beta prior distributions Za ∼ Π1,10 = Beta(1, 1)for all a ∈ A, i.e. θf = (1, 1). As promised in Example 2, the conjugacyof the beta and binomial distributions yields an analytic expression withefficient computation. In particular, after observing D = {(as, ys)}ts=1 the19product of the densities yields:Π1,1t (z) ∝K∏a=1Beta(za; 1, 1)t∏s=1Binomial(ys;n, zas) (2.9)∝K∏a=11t∏s=1zysas(1− zas)n−ys (2.10)=K∏a=1znaa (1− za)nt−na , (2.11)which is clearly another beta distribution and can readily be normalized toobtain the posterior distribution for each a ∈ A:Za | D ∼ Beta(nt+ na, 1 + nt− na), (2.12)where na :=∑s≤tas=ays is the total number of successes observed for arm a.Similarly, the Gaussian bandit can be inverted analytically, by prescrib-ing a K-variate normal prior Π0(z) = N (0, I). In these two simple cases, theindex set A is a subset of N, merely numbering the arms from 1 to K = |A|.However, the formulation in Definition 1 readily accommodates more com-plex index sets of the form A = {xa ∈ Rd}Ka=1. Indeed, when structuredfeature vectors xa are available for each arm, they allow us to define morecomplex forward models. Perhaps the simplest and most common way toexploit available feature vectors is through a linear model. These modelsassume the following structure:f(xa) = wTxa, ∀xa ∈ A. (2.13)Such a structure identifies each objective function with a correspondingweight vector w ∈ Rd. Once again, prescribing a prior distribution forf is as simple as fixing a multi-variate distribution on the measurable set(Rd,B(Rd)) to which w belongs. In this case, when it clear from the context,we denote the prior over functions with Π0(w).20Notice that the vector w may couple the reward distribution of all arms.Indeed, if we place a prior on the parameter w, observing a realization ofYa changes our posterior on w, which can in turn change the distributionof Ya′ | Ya as long as xTaxa′ 6= 0. However, if d = K and the featurevectors are simply the orthogonal axis-aligned unit vectors A = {eˆa}Ka=1,then w ≡ z and we recover the independent K-armed bandit, as in theprevious examples in this chapter. Meanwhile, in the following examples wehighlight two commonly studied bandits where such additional structure inthe index set can be exploited.Example 5 (linear-Gaussian multi-armed bandit inversion)Given the K-armed Gaussian bandit in Example 3 with d-dimensional fea-ture vectors xa, ∀a ∈ A, and assume the linear structure in Equation (2.13).Prescribe the d-variate normal prior distribution Πw0,Σ00 = N (w0,Σ0).Once again, from Bayes’ rule it follows that, given observations {(xas , ys)}ts=1,Πw0,Σ00 (w)t∏s=1νη2(ys; wTxas) = N (w; w0,Σ0)t∏s=1N (ys; wTxas , η2)(2.14)which is proportional to another multi-variate normal density with meanand covariance matrix:wt = Σt(Σ−10 w0 + η−2XTy) (2.15)Σt =(Σ0 + η−2XTX)−1, (2.16)where the matrix X and vector y are the concatenated inputs and outputs,respectively, i.e. the sth row of X is xTas and [y]s = ys.Recall that one can learn the signal noise parameter θy = η2 via aBayesian treatment. In particular, in this case prescribing an inverse-gammaprior distribution on η2 yields an analytic posterior distribution for the jointparameters (w, η2). Alternatively, the unknown noise variance η2 can bemarginalized to obtain a posterior distribution for w alone which accountsfor the uncertainty in θy. Note, however, that this distribution will no longer21be normal.While the linear-Gaussian bandit is useful in settings where observa-tions are real valued, we need a slightly different model to deal with countobservations such as performance of classifiers, e.g., number of correct clas-sifications. The generalized linear model (GLM) extends linear models tosuch a setting and corresponds to the following structural assumption:f(xa) = g−1(wTxa), (2.17)where g : [0, 1] 7→ R is called the link function, and is usually chosen fordesired analytic or computational convenience and held fixed. Then for fixedlink function g, we once again have a correspondence between w and f , sothat in the following example, we may prescribe a prior Π0 on w insteadof f .The use of a feature vector xa—if it is available—allows us to exploitpossible structure, which cannot be done, for example, in the beta-binomialmodel because it assumes all arms are independent.Example 6 (GLM-binomial bandits)Consider the multi-armed binomial bandit of Example 2 with d-dimensionalfeatures for arms A = {xa ∈ Rd}Ka=1, and assume a GLM structure for f .Prescribe the d-variate normal prior distribution Πw0,Σ00 = N (w0,Σ0),then from Bayes’ rule it follows that, given observations {(xas , ys)}ts=1,Πw0,Σ0t (w) ∝ Πw0,Σ00 (w)t∏s=1νn(ys; g−1(wTxas))(2.18)= N (w; w0,Σ0)t∏s=1Binomial(ys;n, zas) (2.19)∝ exp(−12(w −w0)TΣ0(w −w0))× (2.20)K∏a=1g−1(wTxa)na(1− g−1(wTxa))nt−na, (2.21)22where na =∑s≤tas=ays is the total number of successes observed for arm a.Unlike in the beta-binomial case, this expression is not a known dis-tribution and cannot be tractably normalized. Therefore we must resortto approximate inference methods. Markov chain Monte Carlo (MCMC)methods (Andrieu et al., 2003) approximate the posterior with a sequenceof samples that, in the limit of large samples, converge to samples from theexact posterior; this is the approach taken in (Scott, 2010) with a probit linkfunction. In contrast, the Laplace approximation fits a Gaussian distribu-tion to the posterior by matching the curvature of the posterior distributionat the mode. For example in (Chapelle and Li, 2011), Bayesian logistic re-gression with a Laplace approximation was used to model click-throughs forthe recommendation of news articles in a live experiment. Moment match-ing, expectation propagation (EP), or other variational approaches can alsobe used here.2.3 Gaussian process priorsRather than inducing a prior on f by assuming some parametric form fwand prescribing a prior on the parameters w as in Examples 6 and 5, we canprescribe a prior on f directly. One simple way to define a distribution overobjective functions is through a stochastic process {Zx}x∈X . In particular,in this thesis we will use the Gaussian process (GP) exclusively. Though theexact GP is a nonparametric model, we discuss several scalable parametricapproximations in Appendix E.After introducing the Gaussian process, we review the GP-bandit andoutline the derivation of exact inference under a Gaussian observation modelν. We then review the role of the GP covariance and mean functions, aswell as that of the marginal likelihood function used in calibration. Finally,we discuss computational costs.The Gaussian process GP(µ0, k0) is fully characterized by its mean func-tion µ0 : X 7→ R and its symmetric positive kernel, or covariance function,k0 : X × X 7→ R+. See (Rasmussen and Williams, 2006) for an extensive23treatment. Consider an arbitrary finite collection of t input points {xs ∈X}s≤t. In Gaussian process regression, the random vector Z := {Zxs}s≤t isnormally distributed as followsZ ∼ N (m,K) (2.22)where the elements of the mean vector and covariance matrix are defined asmi := µ0(xi) and Ki,j := k0(xi,xj), respectively. Let us ground the rest ofour discussion in the following concrete example.Example 7 (Gaussian process bandits with Gaussian observation models)The Gaussian process bandit with Gaussian observation model is the Bayesianbandit (Π0, ν), whereΠ0 = GP(µ0, k0), (2.23)ν = N (· , η2). (2.24)i.e. the function f is assumed to be a sample from GP(µ0, k0).Although we do not explicitly indicate it, the functions µ0 and k0 areusually parameterized by a θf ∈ Θf . Similarly, we omit the dependence ofν on η2 ∈ Θy. See Appendix E for various ways estimate or learn the jointθ = (θf , θy).When an agent selects a set of inputs {xs}s≤t, the following forwardmodel generates the dataZ ∼ N (m,K), (2.25)Y ∼ N (Z, η2I). (2.26)Now consider the realizations y = {ys}s≤t, which combine with the cor-responding inputs to form the set of t observations D = {(xs, ys)}ts=1 and anarbitrary test point x with objective value z = f(x). According to the Gaus-sian process assumption, the latent variables z and z are jointly Gaussian;by the convolution of two Gaussian, the assumption of Gaussian observa-tions leads to y and z being jointly Gaussian as well; and finally, it follows24from basic Gaussian conditioning that z | y is also normally distributed,with mean and variance depending on the test input x in the following wayµt(x) = µ0(x) + k(x)T(K + η2I)−1(y −m) , and (2.27)σ2t (x) = kt(x,x) , where (2.28)kt(x,x′) = k0(x,x′)− k(x)T(K + η2I)−1k(x′) , (2.29)where [k(x)]i := k0(xi,x). Repeating this exercise with any finite set oftest points in X leads to a jointly Gaussian distribution with mean vectorspecified by the function µt and covariance matrix computed from the kernelkt. Since this satisfies the definition of the Gaussian process, we can assertthat the posterior distribution over the objective function {Zx}x∈X | D isalso a GP{Zx}x∈X | D ∼ GP(µt, kt). (2.30)Formally we can say that Gaussian processes are closed under conditioningon Gaussian observations. This is an extremely useful feature of the GP,especially in Bayesian optimization, because it allows us to sequentially con-dition on more and more data while being guaranteed a Gaussian processposterior surrogate model of the objective function.The posterior mean and variance evaluated at any point x represent themodel’s prediction and uncertainty, respectively, in the objective function atthe point x. These posterior functions are key ingredients in selecting thenext query point xt+1 as detailed in Chapter 3.Remark 1 (On differentiability)Both of the posterior functions can be evaluated at arbitrary test pointsx; moreover, as long as the prior mean and kernel are differentiable, thefollowing gradients can be computed:∇xµt(x) = ∇xµ0(x) +∇xk(x)T(K + η2I)−1(y −m), and (2.31)∇xσ2t (x) = 2∇xk0(x,x)− 2∇xk(x)T(K + η2I)−1k(x). (2.32)These gradients are useful when maximizing certain acquisition functions.25See Chapter 3 for a review of acquisition functions.2.3.1 Covariance kernelsIn Gaussian process regression, the covariance function k0 dictates the struc-ture of the objective functions we can fit. For instance, if we expect our ob-jective function to have some periodic structure, we can prescribe a periodickernel; or if we expect our objective function to be smooth with long rangeinteractions, we can prescribe a heavy tailed highly differentiable radial ba-sis function. While kernels need not be stationary, in this thesis we focus onstationary kernels, which only depend on the difference of their arguments.Mate´rn kernels are a very flexible family of stationary kernels, parame-terized by a smoothness parameter κ > 0, so called because samples froma GP with such a kernel are differentiable bκ − 1c times (Rasmussen andWilliams, 2006). The exponential kernel is a special case of the Mate´rn ker-nel with κ = 12 ; and the squared exponential kernel is the limiting kernelwhen κ → ∞. The following are the most commonly used Mate´rn kernels,labelled by their smoothness parameter.k12 (x,x′) = σ2 exp(−r) (2.33)k32 (x,x′) = σ2 exp(−√3r)(1 +√3r) (2.34)k52 (x,x′) = σ2 exp(−√5r)(1 +√5r + 53r2) (2.35)k∞(x,x′) = σ2 exp(−12r2), (2.36)where r2 = (x − x′)TΛ−1(x − x′) and Λ is a diagonal matrix of d squaredlength scale `2i . This family of covariance functions are therefore parame-terized by an amplitude σ2 and d length scale hyperparameters, which willcontribute to the set of parameters θf . Covariance functions with learnablelength scale parameters are also known as automatic relevance determina-tion (ARD) kernels. Figure 2.3 provides a visualization of the kernel profiles,samples from the corresponding priors and posteriors.26Kernel profile Samples from prior Samples from posteriorMATE´RN1MATE´RN3MATE´RN5SQ-EXPFigure 2.3: Gaussian process visualization. Left: Visualization of variouskernel profiles. The horizontal axis represents the distance r > 0. Middle:Samples from GP priors with the corresponding kernels. Right: Samplesfrom GP posteriors given two data points (black dots). Note the sharperdrop in the Mate´rn1 kernel leads to rough features in the associated samples,while samples from a GP with the Mate´rn3 and 5 kernels are increasinglysmooth.2.3.2 Prior mean functionsWhile the prior kernel function controls the smoothness and amplitude ofallowed reward functions, the prior mean provides a way to encode possibleknown, deterministic behaviour. In practice, this function is set to a con-stant µ0(x) ≡ µ0 and inferred from data using learning techniques coveredin Appendix E. Unless otherwise specified, in what follows we assume a con-stant prior mean function for convenience.1 However, the prior mean func-tion is a principled way to incorporating expert knowledge of likely objectivefunctions or known deterministic behaviour, if they are available. Much ofthe following analysis can be readily extended to non-constant functions µ0.In recent work, the prior mean has been used to steer the search clearfrom the boundary. In this case, the prior mean acts as a penalty termwhich favours points within the boundary as opposed to the edges. Thisproblem becomes more pronounced in high dimensions where most of thevolume of the search space is in the volume along its boundary. Meanwhile,in our own work, described in detail in Chapter 6, we used the prior meanto remove the dependence on a boundary altogether.1There may also be more compelling arguments for a simple constant mean (see, e.g.,Chen et al., 2016).272.3.3 Marginal likelihoodAnother attractive property of the Gaussian process model is that it pro-vides an analytical expression for the marginal likelihood of the data, wheremarginal refers to the fact that the unknown latent function f is marginal-ized out. The expression for the log marginal likelihood is simply givenby:log p(y | {xs}s≤t, θ) =− 12(y − µ0)T(Kσ2,` + η2I)−1(y − µ0)− 12log |Kσ2,` + η2I| − t2log(2pi), (2.37)where θ := (σ, `, µ0, η2); and the dependence on θ is finally made explicitby, for example, adding a superscript to the covariance matrix Kσ2,`t . Themarginal likelihood is very useful in learning the hyperparameters, as wewill see in Appendix E. The right hand side of Equation 2.37 can be brokeninto three terms: the first term quantifies how well the model fits the data,which is simply a Mahalanobis distance between the model predictions andthe data; the second term quantifies the model complexity, smoother covari-ance matrices will have smaller determinants and therefore lower complexitypenalties; finally, the last term is simply a linear function of the number ofdata points t, indicating that the likelihood of data tends to decrease withlarger datasets unless the first term compensates accordingly.Conveniently, as long as the kernel is differentiable with respect to its hy-perparameters (not shown in the equation), the marginal likelihood can bedifferentiated and can therefore be optimized via gradient ascent to obtainan estimate, the so-called type II maximum likelihood (MLII) or empiricalBayes estimate. When data is scarce this can overfit the available data. InAppendix E we will review various practical strategies for learning hyper-parameters which all use the marginal likelihood.2.3.4 Computational costAlthough we have analytic expressions, exact inference in Gaussian processregression is O(t3) where t is the number of observations. This cost is due28to the inversion of the covariance matrix. In principle, the Cholesky de-composition can be computed once, saved and reused, and cheaply updatedvia rank-one updating techniques. In practice however the Cholesky de-composition must be recomputed every time the kernel hyperparameters θare updated, which, depending on the learning approach and/or the learn-ing schedule, happens often enough that the cubic cost is the bottleneck ofthe Bayesian optimization approach—excluding the cost of evaluating theblack-box. Appendix D in the appendix reviews several ideas which proposeto reduce this computational burden.29Chapter 3Decision makingIn the preceding chapter, we described some probabilistic models that can beused to represent our belief about the unknown objective function f givena set of t observations Dt. In this chapter, we focus on the mechanismsfor selecting the sequence of query points {xs}ts=1 and the final recommen-dation xˆt. Naturally, selecting these points arbitrarily could be wasteful,considering the sophisticated probabilistic models developed in Chapter 2.These surrogate models can be cheaply evaluated in lieu of the objectivefunction f , and their uncertainty quantification can be effectively leveragedto reason about where to query the function next in some optimal sense.We refer to such mechanisms as policies or, equivalently, strategies, whichwe use interchangeably. We distinguish two types of policies: the selectionpolicy, used to sample the query point xt at iteration t in the optimizationprocedure; and the recommendation strategy used to return a point xˆt es-timating the location of the optimizer if the procedure were stopped afteriteration t.We begin this chapter by formally defining policies and regret in thefollowing section; then minimizing the latter yields the Bayes-optimal policydiscussed in the subsequent section. The computationally intractable natureof the Bayes-optimal strategy will lead the discussion towards tractable,myopic, sub-optimal alternatives that are commonly used in model-basedoptimization.30x1 . . . xn−1 xn xˆnD1 . . . Dn−1 Dny1 . . . yn−1 yn SnfFigure 3.1: Probabilistic graphical model describing the decision problem.All nodes represent random quantities: shaded nodes are observed variables,diamond shaped choice nodes represent (possibly random) decision variables.Hyperparameters θ are omitted for reduced clutter but they are still present,as in Figure 2.1.3.1 Definitions and notationUntil the current chapter, the inputs {xs} were considered to be fixed, how-ever in this chapter we will allow them to be random variables by virtue ofpossible dependence on previous observations but also from a possible ex-ternal source of randomness—the simplest example being an -greedy typealgorithm. Therefore, for the purpose of this chapter, we will capitalize theselection variables Xs : Ω 7→ X and denote realizations xs as before; andsimilarly for the random recommendations Xˆs : Ω 7→ X and their realiza-tions xˆs. Finally, we will denote pis and ρs the distribution of Xs and Xˆs,respectively, and refer to them as the selection and recommendation poli-cies. In a slight abuse of notation, as it is usually clear from the context, wewill use pis to refer to both the distribution and its density, if it exists, withrespect to some base measure.At each iteration t, the purpose of a policy pit is to somehow producea query point xt ∈ X , only using the information Dt−1 that is available.Therefore, we define a policy as a sequence of distributions over the measur-31able input space (X ,FX), which can be sampled to obtain an input x. Notethat the following formulation can also accommodate deterministic selectionstrategies via the Dirac δ measure. Note however, that even when a policyis a deterministic map from previous observations to a point in X , it inheritsthe randomness in the observations {Yxs}. Whenever it is obvious from thecontext, we will denote Ys = Yxs for all s. Figure 3.1 depicts the prob-abilistic graphical model that describes the decision problem in Bayesianoptimization.Definition 3 (Valid policies)Given a sequence of growing datasets {Ds}s≥1, such that Ds−1 ⊂ Ds, onecan generate the following filtration {Fs := σ(Ds)}s≥1. Then. a valid selection policy pi := {pit} is a sequence of probability measuressuch that if Xt ∼ pit, Xt ⊥ f | Ft−1; and. a valid recommendation policy ρ := {ρt} is a sequence of probabilitymeasures such that if Xˆt ∼ ρt, Xˆt ⊥ f | Ft.Intuitively, the independence constraint excludes all policies whose choicesdepend on the objective function other than through available data. No-tice the recommendation strategy has access to one additional observation.Strictly speaking, a policy includes both a selection policy pi and a recom-mendation strategy ρ, but we will often drop the qualifiers when it is clearwhich type of policy we are referring to—either from the context or notation.Before reasoning about the optimality, or even the performance, of policies,let us define the following useful operator.Definition 4 (Gap operator)Let u ∈ H be an objective function in a function space H, e.g. a reproducingkernel Hilbert space. The gap operator ∆ : H 7→ H is a non-linear operatorsuch that the gap quantity associated with u is defined pointwise as:∆u(x) := maxXu− u(x). (3.1)32If f is the true objective function, the function ∆f measures the sub-optimality of a point x. In particular, ∆f(x?) = 0 and ∆f(x) > 0,∀x 6∈arg maxX f . In a slight abuse of terminology, when it is clear from thecontext, the function ∆f is simply called the gap.This operator allows us to express the following two notions of regret.Whereas the first evaluates the quality of each selection online, the secondonly evaluates selections insofar as they help locate a good final recommen-dation; therefore more experimental selections are not necessarily penalized,in particular, not if they provide valuable information. We will distinguishthe tasks of minimizing the former and the latter as online and offline op-timization, respectively.Definition 5 (Regret)Given an arbitrary sequence of selections {xs}ts=1, the cumulative regretRt is the accumulation of the sub-optimality of all selections:Rt :=t∑s=1γs−1∆f(xs), γ ∈ (0, 1]. (3.2)When the discount factor γ < 1, Rt is said to be discounted. Note that∆f (xs) is sometimes called the instantaneous regret at iteration s, de-noted rs.Finally, given an arbitrary recommendation xˆt, the simple regret Stquantifies the sub-optimality of a recommendation xˆt:St := ∆f(xˆt). (3.3)Intuitively, when the cumulative regret is asymptotically constant, wehave reached the global optimum, and similarly, when simple regret asymp-totically vanishes. Lai and Robbins (1985) proved that no policy can achievebetter than logarithmic cumulative regret—logarithmic in the total selectionbudget T . On the other hand, Bubeck et al. (2009) prove that achievingthe optimal cumulative regret rate prevents a policy from achieving expo-nentially vanishing simple regret.33Let us focus on the task of offline optimization—sometimes called thepure exploration setting—and briefly discuss Bayes-optimal recommendationand selection strategies.Recommendation Recall that we wish to maximize the objective func-tion f , which is unknown and modelled as a Gaussian process. Once the finalfunction evaluation budget T is reached, the only quantity we can minimizeis the final expected simple regret yielding the following problem:minimize E[∆f(xˆ) | FT ] (3.4)s.t. xˆ ∈ XˆT ,where we allow the set of allowable recommendations XˆT ⊆ X to be a datadependent subset of the entire input set X . While it is perfectly reasonableto set Xˆt = X for all t ∈ N, it is very common to restrict the set of allowablerecommendations to the observed queries Xˆt = {xs}ts=1 for all t ∈ N.We can obtain a straightforward optimal recommendation strategy byexpanding Equation (3.5)E[∆f(xˆ) | FT ] = E[maxX f − f(xˆ) | FT ] (3.5)= E[maxXf | FT ]− E[f(xˆ) | FT ] (3.6)and noticing that the first term does not depend on the recommendation xˆ.Hence, the optimal recommendation strategy isxˆT ∈ arg maxxˆ∈XˆTE[f(xˆ) | FT ] = arg maxxˆ∈XˆTµT (xˆ), (3.7)with ties broken arbitrarily. Note once again that xT is a random variablerealization by virtue of its dependence on FT . This recommendation yieldsthe following expression for the simple regret:ST = E[maxX f | FT ]−maxXˆTµT . (3.8)34Selection Given a finite budget of T function evaluations, naturally theultimate simple regret ST is the loss that should guide the exploration.In this Bayesian setting, at any iteration t, we can express our currentexpectation of the ultimate loss of a given selection policy pi via the followingnested conditional expectations using the tower propertyE [ST | Ft−1] = E [E [ST | Ft] | Ft−1] . (3.9)This recursive formulation of the expected ultimate reward allows us to ex-press the deterministic Bayes-optimal selection policy pi? := {pi?t := δxt}Tt=1recursively asxt+1 ∈ arg minx∈XE [Epi? [ST | Dt ∪ (x, Yx)] | Ft] , (3.10)where ties are broken arbitrarily. The subscript pi? appended to the innerexpectation emphasizes that Xs ∼ pi?s for all s > t.Problem (3.10) is severely intractable, both analytically and computa-tionally, in general. In fact, there is currently only one setting with a knowncomputationally tractable optimal sequence due to Gittins (1979):1 the in-dependent Bernoulli bandit with finite query budget (or infinite budget withdiscounted rewards). This approach relies on casting the problem as Markovdecision process and solving it using dynamic programming. If M denotesthe number of possible rewards (e.g., M = 2 for Bernoulli arms), the com-putational complexity of this method is O(TM2). Therefore, it is difficultto generalize this approach to more complicated, for example continuous,reward distributions.Consider instead, a strategy that is optimal for the single step problem,i.e. the optimal strategy for T = t + 1, then once again using the towerpropertyxt+1 ∈ arg minx∈XE[E[St+1 | Dt ∪(x, Yx)] | Ft] (3.11)1See also (Gittins et al., 2011).35= arg maxx∈XE[maxXˆt∪{x}µt+1 − E[maxXf | Dt ∪(x, Yx)] | Ft] (3.12)where the outer expectation marginalizes the observation Yx, then= arg maxx∈XE[maxXˆt∪{x}µt+1 | Ft]− E[maxXf | Ft](3.13)= arg maxx∈XE[maxXˆt∪{x}µt+1 | Ft](3.14)where in the final step, the second term drops out because it does not de-pend on the selection x and we recover a selection policy known as theKnowledge Gradient (KG; Frazier et al., 2009), often mistaken for expectedimprovement—a subtly different policy we discuss in the next section—dueto their striking similarity. In order to understand how KG obtained itsname, we may add the constant St−1 from Equation (3.13), note that thisdoes not change the selection, as St is independent of xt+1 since it dependson xˆt, which must be Ft-measurable. Then, several algebraic manipulationslead toxt+1 ∈ arg maxx∈XE[maxXˆt∪{x}µt+1 | Ft]− E[maxXf | Ft]+ E[maxXf | Ft]−maxXˆtµt (3.15)= arg maxx∈XE[maxXˆt∪{x}µt+1 | Ft]−maxXˆtµt. (3.16)The first term in (3.16) is simply the expected value of the function at therecommended point after observing the function at the point x, to which wesubtract the very same estimate before observation (x, Yx). Intuitively then,KG selects the point xt that is expected to most increases the maximum ofour prediction of the objective function f after the observation (xt, Yxt).Finally, notice that the knowledge gradient policy is a one-step look-ahead Bayesian strategy as it reasons about the expected effect of the next36observation (x, Yx) before selecting a point. Similarly, many selection strate-gies can be formulated in a utilitarian framework in which the next pointxt+1 maximizes some expected utility, where the expectation is taken withrespect to the posterior surrogate model f | Ft, and possiply future obser-vations Yt+1, . . . , Yt+q in the case of a q-step look-ahead strategy.2Consider once again a single-step look-ahead policy such as KG, anddefine a utility function U : X ×H× Y 7→ R which maps an arbitrary querypoint x and its corresponding outcome, modelled by Yx, to a measure ofquality of the experiment for a given function f ∈ H, e.g. , how much thequery would improve our posterior estimate of the optimum in KG, or howmuch information this query will provide towards some goal as in (Lindley,1956). Note that U is a random variable due to the unknown function fand unobserved outcome Yx. Given some data accumulated thus far, we canmarginalize both f and Yx to obtain the expected utility of a query point x:αt+1(x) = E [U (x, f, Yx) | Ft] , (3.17)and the selection policy is simply pit = δxt , where xt+1 ∈ arg maxX αt+1 andties are broken arbitrarily. With the exception of predictive entropy searchin Section 3.4, we do not consider any look-ahead expected utilities and sothe expectations will only be taken over the posterior f | Ft only. Therefore,the utility U will often only take arguments in X ×HWhereas in experimental design and decision theory, the function α iscalled the expected utility or criterion, in Bayesian optimization it is of-ten called the acquisition or infill function. These acquisition functions arecarefully designed to trade off exploration of the search space and exploita-tion of current promising areas. In the next sections, we present traditionalimprovement-based and optimistic acquisition functions, followed by morerecent information-based approaches. We conclude the chapter with a dis-cussion of three recommendation strategies.2 For simplicity, in this section we ignore the hyperparameter dependence and we defera discussion of marginalizing hyperparameters to Section E.373.2 Improvement-based policiesImprovement-based acquisition functions favour points that are likely to im-prove upon a target τ , the policies’ only hyperparameter. An early strategyin the literature, due to (Kushner, 1964), is commonly known as (maximum)probability of improvement (PI or MPI). As its name suggests, this acquisi-tion function measures the probability, according to the model, that a pointx leads to an improvement upon τ . Since the posterior distribution of f(x)is Gaussian, we can analytically compute this probability as follows:αPIt+1(x) := P [f(x) > τ | Ft] (3.18)Recall that αPIt+1 is then maximized to select the next query point. Forthis criterion, the utility function is simply an indicator of improvementUPI(x, f) := I[f(x) > τ ]. From this insight it is clear that all improvementsare treated equal and PI simply accumulates the posterior probability massabove (or below, when minimizing) τ at x.Although probability of improvement can perform very well when thetarget is known, in general the heuristic used for an unknown target τ causesPI to exploit quite aggressively (Jones, 2001).One could instead measure the expected improvement (EI) which incor-porates the amount of improvement. This new criterion, due to Mocˇkus etal. (Mocˇkus et al., 1978), corresponds to a utility called the improvementfunction, usually denoted I, and is defined as followsUEI(x, f) = I(x, f(x)) := (f(x)− τ)I[f(x) > τ ]. (3.19)Note that I > 0 only if there is an improvement. These improvement strate-gies have been empirically studied in the literature (Jones et al., 1998; Jones,2001) and recently convergence rates have been proven for EI (Bull, 2011).Remark 2 (Analytic expressions with GP surrogates)When using a Gaussian process surrogate model, both EI and PI yield the38following analytic expressionsαPIt+1(x) = Φ(µt(x)− τσt(x))and (3.20)αEIt+1(x) = (µt(x)− τ)Φ(µt(x)− τσt(x))+ σt(x)φ(µt(x)− τσt(x)), (3.21)when σt > 0 and vanish otherwise. Here, Φ and φ denote the standardnormal cumulative distribution function and probability density function,respectively. Analytic expressions for the derivatives of these acquisitionfunctions are also readily derived in the appendix (Appendix C), providedthe covariance kernel is differentiable. The analytic expression above allowsvery efficient evaluation of the acquisition function and the analytic gradi-ents facilitate using multi-restarted quasi-Newton methods for more efficientmaximization.Remark 3 (On the choice of target τ)Finally, although the target objective value (i.e., the best achievable objec-tive value) is often unknown, in practice τ is adaptively set to either thebest observed value y+ := maxi=1:t yi or alternatively the current incum-bent µ+t := maxi=1:t µt(xi). Whereas for PI this heuristic can lead to anoverly greedy optimization (Jones, 2001), it works reasonably with EI inpractice (Snoek et al., 2012). For additional exploration, an additive (orrelative) minimum improvement parameter can be added (or multiplied) toy+ to avoid excessively local selections.3.3 Optimistic policiesDating back to the seminal work of Lai and Robbins (1985) on the multi-armed bandit problem, the upper confidence bound criterion has been apopular way of negotiating exploration and exploitation, often with prov-able cumulative regret bounds. The guiding principle behind this class ofstrategies is to be optimistic in the face of uncertainty. Indeed, using the up-per confidence for every query point x corresponds to effectively using a fixedprobability best case scenario according to the model. Originally, the up-39per confidence was given by frequentist Chernoff–Hoeffding bounds (Auer,2003), and more sophisticated bounds in the linear bandit setting (Daniet al., 2008; Abbasi-Yadkori et al., 2011).More recently, the Gaussian process upper confidence bound (GP-UCB;Srinivas et al., 2010) algorithm was proposed as a Bayesian optimistic al-gorithm with provable cumulative regret bounds. In the deterministic case,a branch-and-bound extension to GP-UCB was proven to have exponen-tially vanishing regret (de Freitas et al., 2012). The GP-UCB algorithmhas since been generalized to other Bayesian models by considering upperquantiles (Kaufmann et al., 2012a) instead of Equation (3.22) defined below,which is more reminiscent of frequentist concentration bounds. In the GPcase, since the posterior at any arbitrary point x is a Gaussian, any quantileof the distribution of f(x) is computed with its corresponding value of βt asfollows:αUCBt+1 (x) := µt(x) + βtσt(x). (3.22)There are theoretically motivated guidelines for setting and scheduling thehyperparameter βt to achieve optimal cumulative regret (Srinivas et al.,2010) and, as with τ in the improvement policies, tuning this parameterwithin these guidelines can offer a performance boost.Finally, there also exist variants of these algorithms for the contextualbandits (Vanchinathan et al., 2014) and parallel querying (Desautels et al.,2014).3.4 Information-based policiesIn contrast to the acquisition functions introduced so far, information-basedpolicies consider the posterior distribution over the unknown minimizer x?,denoted P(X? | Dt), where X? is the random variable which models ourknowledge of x?. This distribution is implicitly induced by the posteriorover objective functions f and is sometimes called Pt?. There are two policiesin this class, namely Thompson sampling and entropy search.Though it was introduced in 1933 (Thompson, 1933), Thompson sam-40PIEIUCBTSFigure 3.2: Visualizing the surrogate regression model and various inducedacquisition functions. (Top) The true objective function is shown as adashed line and the probabilistic regression model is shown as a blue linewith a shaded region delimiting the 2σt credible intervals. Finally, the obser-vations are shown as red crosses. (Bottom) Four acquisition functions areshown. In the case of PI, the optimal mode is much closer to the best obser-vation as in the alternative methods, which explains its greedy behaviour.In contrast, the randomization in TS allows it to explore more aggressively.pling has attracted renewed interest in the multi-armed bandit community,producing empirical evaluations (Scott, 2010; Chapelle and Li, 2011) as wellas theoretical results (Kaufmann et al., 2012b; Agrawal and Goyal, 2013;Russo and Van Roy, 2014a). Thompson sampling (TS) is a randomizedstrategy which samples a reward function from the posterior and selects thearm with the highest simulated reward. Therefore the selection made by TScan be expressed as the randomized acquisition function xt+1 ∼ Pt?.However, in continuous search spaces, the analog of Thompson samplingis to draw a continuous function f (t) from the posterior GP and optimize itto obtain xt+1. In order to be optimized, the sample f(t) needs to be fixedso it can be queried at arbitrary points; unfortunately, it is not clear how tofix an exact sample from the GP. However, using recent spectral sampling41techniques (Bochner, 1959; Rahimi and Recht, 2007; La´zaro-Gredilla et al.,2010), we can draw an approximate sample from the posterior that can beevaluated at any arbitrary point x (Herna´ndez-Lobato et al., 2014), whichextends TS to continuous search spaces. As an acquisition function, TS canbe formulated asαTSt+1(x) := f(t)(x), where f (t)s.s.∼ GP(µ0, k0 | Ft) (3.23)ands.s.∼ indicates approximate simulation via spectral sampling. Empiricalevaluations show good performance which, however, seems to deteriorate inhigh dimensional problems, likely due to aggressive exploration (Shahriariet al., 2014).Instead of sampling the distribution Pt?, entropy search (ES) techniquesaim to reduce the uncertainty in the location X? by selecting the point thatis expected to cause the largest reduction in entropy of X? (Villemonteixet al., 2009; Hennig and Schuler, 2012; Herna´ndez-Lobato et al., 2014). Interms of utility, entropy search methods are one-step look-ahead policiesthat use the information gain defined as followsUES(x, Yx) = H(X? | Ft)−H(X? | Dt ∪ {(x, Yx)}), (3.24)where the unknown function f is already marginalized.In other words, ES measures the expected information gain from query-ing an arbitrary point x and selects the point that offers the most infor-mation about the unknown x?. The acquisition function for ES can beexpressed formally asαESt+1(x) := H(X? | Ft)− E [H(X? | Dt ∪ {(x, Yx)}) | Ft]where H(X? | Ft) denotes the differential entropy of the posterior distri-bution Pt?, and the expectation is over the distribution of the random vari-able Yx ∼ N (µt(x), σ2t (x) + η2).Once again, this function is not tractable for continuous search spaces Xso approximations must be made. Early work discretized the space X and42computed the conditional entropy via Monte Carlo sampling (Villemonteixet al., 2009). More recent work uses a discretization of the X to obtaina smooth approximation to Pt? and its expected information gain (Hennigand Schuler, 2012). This method is unfortunately O(M4) where M is thenumber of discrete so-called representer points.Finally, predictive entropy search (PES) removes the need for a dis-cretization and approximates the acquisition function in O((t + d)3) time,which, for t < n is of the same order as EI (Herna´ndez-Lobato et al., 2014).This is achieved by using the symmetric property of mutual information torewrite αES(x) asαPESt+1 (x) := H(Yx | Ft)− E [H(Yx | Ft, X?) | Ft] , (3.25)where the expectation is now with respect to the random variable X?. Theexpectation can be approximated via Monte Carlo with Thompson samples;and three simplifying assumptions are made to compute H(Yx | Ft, X?),which we do not discuss here but which can be found in (Herna´ndez-Lobatoet al., 2014). Empirically, this algorithm has been shown to perform as wellor better than the discretized version without the unappealing quartic term,making it arguably the state of the art in entropy search approximation.3.5 Recommendation strategiesFinally, let us discuss recommendation strategies ρt. As its name suggests,at iteration t, ρt returns the algorithm’s recommendation on where the op-timum x? might be, in light of observations Dt. Although recommendationstrategies can in principle be stochastic, they are usually deterministic, i.e.ρt := δxˆt .If the objective function f is truly a random draw from our prior prob-abilistic surrogate model, and our observations yt are in fact derived fromf(xt) via our likelihood model, then by Bayes’ rule, the posterior distri-bution f | Ft is the correct belief to hold on f . Therefore, the correctrecommendation strategy ρlatt is to select the input xˆ with the highest latent43surrogate mean µt(x) as followsxˆlatt := arg maxxˆ∈Xµt(xˆ). (3.26)However, in practice we rarely know the exact form of the prior and like-lihood, and even if we do, exact inference may not be tractable such thatµt(x) may be an approximation. In such cases of model misspecification orapproximate inference, we may prefer to rely on observations rather thanthe surrogate model, opting for the best observation strategy ρobst :xˆobst := arg max{xs}s≤ty(xs). (3.27)Naturally, if the observations yt are very noisy, it is not hard to see thatρobst can be mislead. Instead, we can restrict recommendations to inputsfor which we have observations and choose between them using the smooth-ing posterior mean µt, rather than the noisy observations yt, leading to astrategy we refer to as selecting the incumbent :xˆinct := arg max{xs}s≤tµt(xs). (3.28)Note that in the noiseless case, ρobst and ρinct will return the same recommen-dations. Finally, note that practitioners may understandably be reluctantto trust a recommendation xˆ for which the algorithm has no correspondingobservation yˆ and therefore ρinct is often a good compromise between trustingthe model and trusting the data.In the multi-armed bandit setting, Bubeck et al. (2009) explored vari-ous recommendation strategies. While the empirical best arm strategy isanalogous to ρinct , the other two strategies only make sense with discreteinput spaces:3 namely, the most played arm and the empirical distributionof plays, which is a randomized recommendation strategy that samples uni-formly from the observed inputs {xs}s≤t with multiplicity.3 Strictly speaking, with continuous input spaces, atomic selection policies could allowsuch recommendation strategies as well.44Chapter 4Bayesian gap-basedexplorationIn this chapter, I discuss a Bayesian policy that we developed for the problemof pure exploration in multi-armed bandits. Though it could be awkwardlyexpressed as an acquisition function, the policy presented in this chapter isnot explicitly an acquisition function that is maximized. This work includesa probably approximately correct (PAC) proof and an extensive comparisonand empirical study of multiple acquisition functions for pure exploration.Finally, we apply this optimization strategy to the problem of automaticmodel selection and hyperparameter tuning with good performance. Thiswas joint work with Matthew W. Hoffman.4.1 AlgorithmIn this section we will introduce a gap-based solution to the Bayesian op-timization problem, which we call BayesGap. This approach builds on thework of Gabillon et al. (2011, 2012), which we will refer to as UGap,1 andoffers a principled way to incorporate correlation between different arms,whereas UGap assumes all arm reward distributions are independent.1Technically, UGapEb, denoting bounded horizon.45At the beginning of round t we will assume that the decision maker isequipped with high-probability upper and lower bounds Uk(t) and Lk(t) onthe unknown mean zk for each arm k ∈ A = {1, . . . ,K}. While this ap-proach can encompass more general bounds, for the linear-Gaussian banditwe consider in this work we can define these quantities in terms of the pos-terior mean and standard deviation µkt ± βσkt. These bounds also give riseto a confidence diameter sk(t) = Uk(t) − Lk(t) = 2βσkt. We will refer toβ as the exploration constant, and note that in this chapter we will use aslightly different notation µkt := µt(k) and σkt := σt(k).Given bounds on the mean reward for each arm, we can then introducethe per-arm worst-case gap quantityBk(t) = maxi 6=kUi(t)− Lk(t), (4.1)which involves a comparison between the lower bound of arm k and thehighest upper bound among all alternative arms. Ultimately this quantityprovides an upper bound on the simple regret (see Lemma B1 in the ap-pendix) and will be used explicitly both in the selection policy and its PACproof. However, rather than directly selecting the arm that minimizes thisgap, we will consider the two candidate armsJ(t) = arg mink∈ABk(t) and (4.2)j(t) = arg maxk 6=J(t)Uk(t). (4.3)We will then define the deterministic selection policy asat = arg maxk∈{j(t),J(t)}sk(t). (4.4)Intuitively, this strategy selects either J(t), the arm that minimizes ourbound on the simple regret, or j(t), the best optimistic arm. We say opti-mistic because j(t) is the index that maximizes Uk(t), not µkt, and Uk(t) isan optimistic upper bound on the expected reward of arm k. Between the46Algorithm 2 BayesGap1: for t = 1, . . . , T do2: J(t)← arg mink∈ABk(t)3: j(t)← arg maxk 6=J(t) Uk(t)4: at ← arg maxk∈{j(t),J(t)} sk(t)5: yt ∼ νat6: update posterior µkt and σkt given (at, yt)7: update bound on H and re-compute β8: update posterior bounds Uk(t) and Lk(t)9: end for10: tˆ← arg mint≤T BJ(t)(t)11: return aˆT ← J(tˆ)two candidates J(t) and j(t), the arm with the highest uncertainty will beselected: the one expected to provide the most information. Finally, we willdefine the deterministic recommendation strategy asaˆT = J(arg mint≤TBJ(t)(t)). (4.5)In other words, the recommended arm aˆT ∈ {J(t)}t<T minimizes the regretbound, over all times t ≤ T . The reason behind this particular choice issubtle, but is necessary for the proof of the method’s simple regret bound.2Algorithm 2 outlines the pseudo-code for BayesGap.4.2 TheoryWe now present our theoretical result concerning the case of linear–Gaussianbandits, where we ask which value of β should be used. This will be aparticular case of the more general result in Theorem 2 in Appendix A.Recall the linear–Gaussian bandit introduced in Chapter 2. Selecting anarm k ∈ A from the K-armed Gaussian bandit results in an observing arealization of N (zk, η2), where zk = f(xk) and xk ∈ Rd, ∀k ∈ A are feature2See inequality (b) in Appendix A.47y(a) Updated posteriorymax U−k(b) Locate max upper boundsymax U−kBk(c) Compute gap for each armymax U−kBk J j(d) Establish candidates (Steps 2–3)ymax U−kBk J(e) Select most uncertaincandidate (Step 4)Figure 4.1: Visualization of one iteration of the BayesGap algorithm ona 5-armed Bernoulli bandit, with references to Algorithm 2 where appro-priate. Each panel includes 5 columns corresponding to the 5 arms. Theunknown true values of zk are represented by black dots, while the posteriordistribution of each zk is represented by magenta curves. The magenta barsare delimited by the bounds Lk and Uk, while the length of the blue barsquantify the gap Bk. The candidate arms J and j, and the selection arehighlighted in orange and green, respectively.48vectors. Furthermore, we assume the linear structure of Equation (2.13):f(xk) = wTxk, ∀xk ∈ A. (4.6)where we prescribe an isotropic normal prior distribution Πw0,σ0 = N (w0, σI).Consider the true gap quantities ∆k = |maxi 6=k zi − zk|, ∀k ∈ A. Thisquantity is different from the gap quantity defined in Chapter 2 only forthe best arm k?, where rather than vanishing, ∆k? measure the differencebetween the optimal arm and the runner up. Otherwise, for all other armsit is again a measure of sub-optimality. Given this quantity let Hk =max(12(∆k + ), ) be an arm-dependent hardness quantity; essentially ourgoal is to reduce the uncertainty in each arm to below this level, at whichpoint with high probability we will identify the best arm. Now, given H =∑kH−2k we define our exploration constant asβ2 =T−Kη2+ κσ24H, (4.7)where κ =∑k ‖xk‖−2. We have chosen β such that with high probabilitywe recover an -best arm, as detailed in the following theorem. This theoremrelies on bounding the uncertainty for each arm by a function of the numberof times that arm is pulled. Roughly speaking, if this bounding functionis monotonically decreasing and if the bounds Uk and Lk hold with highprobability we can then apply Theorem 23 to bound the simple regret ST ofBayesGap.4Theorem 1Consider a K-armed Gaussian bandit problem, horizon T , and upper andlower bounds defined as above. For > 0 and β defined as in Equation (4.7),the algorithm attains simple regret satisfying P(ST ≤ ) ≥ 1−KTe−β2/2.Proof. Let Nk(t) be the number of times arm k ∈ A has been pulled by3The additional Theorem is in Appendix A and is a modification of that in (Gabillonet al., 2012).4Recall the simple regret ST := maxk zk−zkˆT , where kˆT is the recommended arm afterT rounds.49round t. Using the definition of the posterior variance for arm k, we canwrite the confidence diameter assk(t) = 2β√xTkΣtxk (4.8)= 2β√η2xTk(∑iNi(t− 1) xixTi + η2σ2I)−1xk (4.9)≤ 2β√η2xTk(Nk(t− 1) xkxTk + η2σ2I)−1xk. (4.10)In the second equality we decomposed the Gram matrix XTX in terms ofa sum of outer products over the fixed vectors xi—letting X be the matrixwhose rows are the vectors xi. In the final inequality we noted that byremoving samples we can only increase the variance term, and so here weset Ni(t−1) = 0 for i 6= k. We will let the result of this final inequality definean arm-dependent bounding function gk : N 7→ (0,∞). Letting A = 1N η2σ2we can simplify this quantity using the Sherman–Morrison formula asgk(N) = 2β√(η2/N)xTk(xkxTk +AI)−1xk (4.11)= 2β√η2N‖xk‖2A(1− ‖xk‖2/A1 + ‖xk‖2/A)(4.12)= 2β√η2‖xk‖2η2σ2+N‖xk‖2, (4.13)which is monotonically decreasing in N . The inverse of this function can besolved for asg−1k (s) =4β2η2s2− η2σ21‖xk‖2 . (4.14)By setting∑k g−1k (Hk) = T − K and solving for β we then obtain thedefinition of this term given in the statement of the proposition. Finally, byreference to Lemma B4 in the appendix, we can see that for each k and t,the upper and lower bounds must hold with probability 1 − e−β/2. Theselast two statements satisfy the assumptions of Theorem 2 in the appendix,thus concluding the proof.Here we should note that while we are using Bayesian methodology to50drive the exploration of the bandit, we are analyzing this using frequentistregret bounds. This is a common practice when analyzing the regret ofBayesian bandit methods (Srinivas et al., 2010; Kaufmann et al., 2012a).We should also point out that implicitly Theorem 2 assumes that each armis pulled at least once regardless of its bound. However, in our setting wecan avoid this in practice due to the correlation between arms.On the exploration parameter β. The proof and derivation of β givenabove require knowledge of the hardness quantity H, which is unknownin most practical applications. Instead of requiring H from the user, ourapproach is to adaptively bound it. Intuitively, the quantity β controlshow aggressively BayesGap explores uncertain arms; indeed, β is directlyproportional to the width of the uncertainty sk(t) and inversely proportionalto H. As a result, in order to initially encourage more exploration we willlower bound the hardness quantity. In particular, we can do this by upperbounding each ∆k, using conservative posterior-dependent upper and lowerbounds on zk. In this work we use three posterior standard deviations awayfrom the posterior mean: µkt ± 3σkt—we emphasize that these bounds arenot the same as Lk(t) and Uk(t). Then a high probability upper bound on∆k is simply given by∆ˆkt = maxj 6=k(µjt + 3σjt)− (µkt − 3σkt). (4.15)From this point we can recompute H and in turn recompute β (Step 7 inAlgorithm 2). For all experiments we will use this adaptive method.Comparison with UGap. The method in this section provides a Bayesianversion of the UGap algorithm which modifies the bounds used in this ear-lier algorithm’s arm selection step. By modifying step 6 of the BayesGappseudo-code to use either Hoeffding or Bernstein bounds we recover theUGap algorithm. Note, however, that in doing so UGap assumes indepen-dent arms with bounded rewards.We can now compare UGap’s probability of error O(KT exp(−T−KH ))51with that of BayesGap O(KT exp(−T−K+κη2/σ2Hη2)). With minor differences,these bounds are of the same order. First, the additional η2 term is primarilydue to the distinction between bounded and Gaussian-distributed rewards.Similarly, the σ2 term corresponds to the concentration of the prior, and wecan see that the more concentrated the prior is (smaller σ) the faster thisrate is. Note, however, that the proof of BayesGap’s simple regret relies onthe true rewards for each arm being within the support of the prior, so onecannot increase the algorithm’s performance by arbitrarily concentratingthe prior. Finally, the κ term is related to the linear relationship betweendifferent arms. Additional theoretical results on improving these boundsremains for future work.4.3 ExperimentsIn the following subsections, we benchmark our proposed algorithm againsta wide variety of policies on two real-data applications. In Section 4.3.1,we revisit the traffic sensor network problem of Srinivas et al. (2010). InSection 4.3.2, we consider the problem of automatic model selection andalgorithm configuration.4.3.1 Application to a traffic sensor networkIn this experiment, we are given data taken from traffic speed sensors de-ployed along highway I-880 South in California. Traffic speeds were collectedat K = 357 sensor locations for all working days between 6AM and 11AMfor an entire month. Our task is to identify the single location of least con-gestion, i.e. with the highest expected speed. This data was also used in thework of Srinivas et al. (2010) on GP-UCB.Naturally, the readings from different sensors are correlated, however,this correlation is not necessarily only due to geographical location. There-fore specifying a similarity kernel over the space of traffic sensor locationsalone would be overly restrictive. Following the approach of Srinivas et al.(2010), we construct the design matrix treating two-thirds of the available52UCBEUGapBayesUCBBayesGap EI PIGPUCBThompson0.060.080.100.120.140.16Probability of errorFigure 4.2: Probability of error on the optimization domain of traffic speedsensors (averaged over 840 runs). For this real data set, BayesGap providesconsiderable improvements over the Bayesian cumulative regret alternativesand the frequentist simple regret counterparts.data as historical and use the remaining third to evaluate the policies. Inmore detail, the GP kernel matrix G ∈ RK×K is set to be empirical co-variance matrix of measurements for each of the K sensor locations, thecorresponding design matrix is X = V D12 , where G = V DV T .Following (Srinivas et al., 2010), we estimate the noise level η of the ob-servation model using this data. We consider the average empirical varianceof each individual sensor (i.e. the signal variance corresponding to the diag-onal of G) and set the noise variance η2 to 5% of this value; this correspondsto η2 = 4.78. We choose a broad prior with regularization coefficient σ = 20.In order to evaluate different bandit and Bayesian optimization algo-rithms, we use each of the remaining 840 sensor signals (the aforementionedthird of the data) as the true mean vector z for independent runs of theexperiment. Using historical data in this way enables us to evaluate theground truth for each run and estimate the probability that the policiesreturn the best arm.53Other policies in this empirical comparison:(1) UCBE (Audibert et al., 2010): variant of classical UCB of Auer (2003),substituting the log(t) exploration term in UCB with a constant of orderlog(T ) for known horizon T ;(2) UGap (Gabillon et al., 2012): frequentist algorithm that inspired ourBayesian variant;(3) BayesUCB and GP-UCB: Bayesian extensions of UCB introduced byKaufmann et al. (2012a) and Srinivas et al. (2010), respectively;(4) PI, EI, and Thompson sampling (see Chapter 3).Note that techniques (1) and (2) attack the problem of best arm identi-fication and use bounds which encourage more aggressive exploration. How-ever, they do not take correlation into account. On the other hand, tech-niques such as (3) are designed for cumulative regret, but model the corre-lation among the arms. It might seem at first that we are comparing applesand oranges. However, the purpose of comparing these methods, even iftheir objectives are different, is to understand empirically what aspects ofthese algorithms matter the most in practical applications.The results, shown in Figure 4.2, are the probabilities of error for eachstrategy, using a time horizon of T = 400. (Here we used = 0, but varyingthis quantity had little effect on the performance of each algorithm.) Bylooking at the results, we quickly learn that techniques that model correla-tion perform better than the techniques designed for best arm identification,even when they are being evaluated in a best arm identification task. Theimportant conclusion is that one must always invest effort in modelling thecorrelation among the arms.The results also show that BayesGap does better than alternatives in thisdomain. This is not surprising because BayesGap is the only competitor thataddresses budgets, best arm identification and correlation simultaneously.544.3.2 Automatic machine learningThere exist many machine learning toolboxes, such as Weka and scikit-learn.However, for a great many data practitioners interested in finding the besttechnique for a predictive task, it is often hard to understand what eachtechnique in the toolbox does. Moreover, each technique can have manyfree hyper-parameters that are not intuitive to most users.Bayesian optimization techniques have already been proposed to auto-mate machine learning approaches such as MCMC inference (Mahendranet al., 2012; Hamze et al., 2013; Wang et al., 2013a), deep learning (Bergstraet al., 2011), preference learning (Brochu et al., 2007, 2010), reinforcementlearning and control (Martinez-Cantin et al., 2007; Lizotte, 2008), and more(Snoek et al., 2012). In fact, methods to automate entire toolboxes (Weka)have appeared very recently (Hutter et al., 2012), and go back to old racingstrategies for model selection (Maron and Moore, 1993; Shen et al., 2011).In this experiment, we will demonstrate BayesGap by automating re-gression with scikit-learn. Our focus will be on minimizing the cost ofcross-validation in the domain of big data. In this setting, training and test-ing each model can take a prohibitive long time. If we are working undera finite budget, say if we only have three days before a conference dead-line or the deployment of a product, we cannot afford to try all models inall cross-validation tests. However, it is possible to use techniques such asBayesGap and Thompson sampling to find the best model with high prob-ability. In our setting, the action of “pulling an arm” will involve selectinga model, splitting the dataset randomly into training and test sets, trainingthe model, and recording the test-set performance.In this bandit domain, our arms will consist of five scikit-learn tech-niques and associated parameters selected on a discrete grid. We considerthe following models and parameter settings for regression:Within a class of regressors, we model correlation using a squared ex-ponential kernel with unit length scale, i.e., k(x, x′) = e−(x−x′)2 . Using thiskernel, we compute a kernel matrix G and construct the design matrix asbefore.55Model & hypers (160 total) discretized settingsLasso (8)alpha {1, 5} · 10{−4,−3,−2,−1}Random forests (64)n estimators 10{0,1,2,3}min samples split 1, 3, 5, 7min samples leaf 2, 6, 10, 14linSVM (16) & rbfSVM (64)C 10{−3,−2,−1,0}epsilon 10{−4,−3,−2,−1}(rbf only) gamma 0.025, 0.05, 0.1, 0.2K-NN (8)n neighbors 1, 3, 5, 7, 9, 11, 13, 15Table 4.1: Models and discretized hyperparameter settings used in the au-tomatic machine learning model selection experiment.When an arm is pulled we select training and test sets uniformly atrandom. Each set includes 10% of the data in the original dataset; theremaining 80% is ignored for this particular arm pull. We then train theselected model on the training set, and test on the test set. This specificform of cross-validation is similar to that of repeated learning-testing (Arlotet al., 2010; Burman, 1989).We use the wine dataset from the UCI Machine Learning Repository,where the task is to predict the quality score (between 0 and 10) of a winegiven 11 attributes of its chemistry. We repeat the experiment 100 times.We report, for each method, an estimate of the RMSE for the recommendedmodels on each run. Unlike in the previous section, we do not have theground truth generalization error, and in this scenario it is difficult to es-timate the probability of error. Instead we report the RMSE, but remarkthat this is only a proxy for the error rate that we are interested in.The performance of the final recommendations for each strategy and afixed budget of T = 10 tests is shown in Figure 4.3. The results for other56RMSEFigure 4.3: Boxplot of RMSE averaged over 100 runs with a fixed budget ofT = 10. EI, PI, and GP-UCB get stuck in local minima.budgets are almost identical. It must be emphasized that the number ofallowed function evaluations (10 tests) is much smaller than the number ofarms (160 models). Hence, frequentist approaches that require pulling allarms, e.g. UGap, are inapplicable in this domain.The results indicate that Thompson and BayesGap are the best choicesfor this domain. Figure 4.4 shows the individual arms pulled and recom-mended by BayesGap (above) and EI (bottom), over each of the 100 runs, aswell as an estimate of the ground truth RMSE for each individual model. EIand PI often get trapped in local minima. Due to the randomization inher-ent to Thompson sampling, it can explores more depending on the specifiedprior.4.4 ConclusionThis work demonstrates the benefit of modelling correlations between armsin multi-armed bandits. Not only does modelling correlations help convergemore quickly to better optima—as in the traffic sensor network results—57Figure 4.4: Allocations and recommendations of BayesGap (top) and EI(bottom) over 100 runs at a budget of T = 40 training and validation tests,and for 160 models (i.e., more arms than possible observations). Histogramsalong the floor of the plot show the arms pulled at each round while thehistogram on the far wall shows the final arm recommendation over 100different runs. The solid black line on the far wall shows the estimated“ground truth” RMSE for each model. Note that EI quite often gets stuckin a locally optimal rbfSVM.but such modelling efforts accommodate budgets that are smaller than thenumber of available arms by generalizing observations to unobserved arms—as in the automatic model selection results.There is room to improve the theoretical results proven in Appendix A.In particular, we emphasize two loose bounds: the union bound that con-tributes the factor of KT ignores both dependencies across arms and acrosstime (or rounds); meanwhile, the inequality in Equation 4.10 is the resultof a very conservative, worst case bound that effectively ignores interactionsbetween arms.While generalizations to continuous search spaces are not studied, it is aninteresting direction for future work. Indeed, the appealing feature of thesegap-based algorithms is that they always consider two different candidatesand select the most uncertain one. Such an approach could allow Bayesianoptimization algorithms to avoid getting stuck in local optima.58Chapter 5Entropy search portfolioCommitting to a policy a priori is a challenging task, which can lead to dra-matically different performance. Consider, for example, the experiments inChapter 4 where Thompson sampling ranked very low in the traffic sensornetwork results, but very high in the automatic machine learning results.In this chapter, we address this practical challenge. Indeed, we investigatea hierarchical approach to decision making, whereby at every iteration t,we appeal to several policies to provide candidate query points and select asingle xt among them. Though this type of approach is referred to as a port-folio method (Hoffman et al., 2011), it can be interpreted as turning an opti-mization problem over the input space X into a multi-armed bandit problemover the set of arms (acquisition functions) At := {xkt := arg maxX αkt }Kk=1.1In order to pick among the arms At, we propose another utilitarian ap-proach, which we distinguish from the utilitarian member policies by usingthe terms meta-policy or meta-criterion, denoted ut.2 Algorithm 3 outlinesthe slightly modified framework for Bayesian optimization with portfolios,and Figure 5.1 provides a visualization to help follow the discussion.In Section 5.1.1, we outline our proposed meta-criterion, and the variousapproximations and practical techniques used to estimate it. An important1 See (Neufeld et al., 2014) for a similar bandit perspective on portfolios applied toadaptive Monte Carlo.2 Most of the time, since the time index is unambiguous, we will omit the subscript tto simply the notation.59Algorithm 3 Bayesian optimization with portfoliosRequire: a stochastic black-box (f, ν)initial data D0 = ∅a portfolio A = {αk}Kk=11: for t = 1, . . . , T do2: At ← {arg maxX αkt }Kk=1 / collect candidates3: xt ← arg maxAt u( · ;Dt−1)4: yt ∼ ν(f(xt))5: Dt ← Dt−1 ∪ {(xt, yt)}6: end for7: return xˆT ← arg max{xs}s≤TµTstep in the approximation relies on sampling from Equation (5.1), and inSection 5.1.2, we show how to efficiently approximately sample from thisdistribution, an approach that also acts as an extension of Thompson sam-pling to continuous domains with GP priors. Though this particular sectionwas concurrent work by Herna´ndez-Lobato et al. (2014), ours was the firstempirical evaluation of this technique as an acquisition function. The entirealgorithm, including computational complexity analysis, is summarized inSection 5.1.3 and the pseudocode is provided in Algorithm 4. Section 5.2details several experiments and results in the domains of geostatistics andcontrol.5.1 AlgorithmInspired by the works of Villemonteix et al. (2009) and Hennig and Schuler(2012), the approach we propose in this chapter is to use the reduction inentropy as a meta-criterion to select an xt among the points in At.Recall that our prescribed surrogate probabilistic model induces a dis-tribution over the location of the unknown global optimizer x?. Note thatthis optimizer is a random variable, due to the epistemic randomness in f .Given data D, let us define the following distribution over the measurable60input space (X ,FX)P?(E | D) := P(arg minx∈Xf(x) ∈ E∣∣∣ D) , ∀E ∈ FX , (5.1)denoting the posterior over minimizer locations, with density p?( · | D).Our proposed algorithm we call entropy search portfolio (ESP) then selects,among the candidate points A = {xk := arg maxx∈X αk(x;D)}Kk=1, the onewhich maximizes the reduction in uncertainty. As a measure of uncertaintyin the posterior distribution P?, we will use the differential entropy H[p?],and therefore propose the negative expected entropy as a meta-criterion:xt := arg maxxk∈Au(xk;D), where (5.2)u(xk;D) := −EH[p?(· ∣∣ D ∪ {(xk, Yk)})] , (5.3)where the expectation is taken over the, as yet, unobserved random variableYk ∼ νxk .5.1.1 Estimating the expected entropyUnfortunately, the meta-criterion expressed in Equation (5.3) is analyticallyintractable and requires some form of approximation in order to make thealgorithm feasible.3 We begin by estimating the expectation via MonteCarlo integration, taking N samples y(n)k ∼ p(yk | xk,D) and approximatingthe meta-criterion asu(xk;D) ≈ − 1NN∑n=1H[p?( · | D˜(n)k )], (5.4)where D˜(n)k := D ∪ {(xk, y(n)k )}. In practice one can use stratified or quasi-Monte Carlo to reduce the variance of this estimate.We turn our attention to the entropy estimation. For continuous densi-3 Since this work, Herna´ndez-Lobato et al. (2014) proposed a new and better approxi-mation that can readily be incorporated in this work.61x1 x2 x3z(1) ... z(i) z(G)Figure 5.1: Visualization of Bayesian optimization with acquisition functionportfolios. Top: visualization of our probabilistic model at a particulartime. Middle: three acquisition functions, members of a portfolio, in thiscase EI (blue), PI (cyan), and Thompson (red). The purpose of Algorithm 4is to select between the candidates {xk}3k=1. Bottom: histogram of 1000approximate samples from P?. Here we also show G draws from P?( · | D)that we denote z(i).62xky(1)ky(n)ky(N)k[f(s)k,n]iz(1) ... z(i) z(G)Figure 5.2: Visualization of the Entropy Search Portfolio. Top: for eachcandidate xk (blue dashed line) we draw N hallucinations (blue diamonds)from the conditional. In practice this is done with quasi-Monte-Carlo toreduce variance. Middle: for each hallucination y(n)k we augment the GPmodel (green line and shaded area) and sample it S times at the discretepoints z(i) to obtain the f(s)kn (blue triangles) for s = 1, . . . , S. We find theminimizer of each vector f(s)kn (magenta triangle). Bottom: finally, we binthe S minimizers into a discrete empirical distribution pˆ depicted here as amagenta histogram.63ties p the differential entropy can be written asH[p(x)] := −∫Xp(x) log p(x) dx. (5.5)Monte Carlo integration cannot estimate this integral because the integrandlog p?( · | D) cannot be evaluated pointwise. Instead, we approximate thisdensity with a discrete probability mass function pˆ? and use its discreteentropy as a proxy for the differential entropy of interest in (5.5).Let us therefore restrict our attention to a discretization of the domain{z(i)}Gi=1 ⊂ X , and in an effort to maintain the flow of this section, we deferthe technical details of how the space is discretized to the next section.Given such a discretization, we can define the following probability massfunction:pˆ?(z) := P(Yz ≤ Yz(i) , ∀z(i)), ∀z ∈ {z(i)}, (5.6)and similarly for the posterior distribution pˆ?(z | D).Combining this discrete proxy for the entropy with the Monte Carlointegration above yields the following approximation to our meta-criterionu(xk;D) ≈ 1NN∑n=1G∑i=1pˆ?(z(i) | D˜(n)k ) log pˆ?(z(i) | D˜(n)k ). (5.7)Finally, we must estimate pˆ?(z(i) | D˜(n)k ). Let the multivariate randomvariable f := (f(z(1)), . . . , f(z(G))) be the vector of latent objective functionvalues evaluated at the representer points. Under the assumption that fis sampled from a GP, the posterior distribution of f | D˜(n)k is normal; asa result, we can produce S exact samples f(s)kn from the posterior. Theprobabilities necessary to compute the entropy can then be approximatedby the relative countspˆkni =1SS∑s=1I[i = arg minj=1,...,G[f(s)kn ]j]. (5.8)In other words, the number pˆkni is the proportion of times the representer64Algorithm 4 Entropy search portfolioRequire: candidates {xk}Kk=1, observations D1: z(i) ∼ P?( · | D), i = 1, . . . , G2: for k = 1, . . . ,K do3: for n = 1, . . . , N do4: y(n)k ∼ ν(f(xk) | D) / hallucinate next observation5: D˜(n)k ← D ∪ {(xk, y(n)k )}6: f(s)kn ∼ p(f | D˜(n)k ) for s = 1, . . . , S7: pˆkni ← 1S∑s I[i = arg minj [f(s)kn ]j]8: end for9: uk ← 1N∑Nn=1∑Gi=1 pˆkni log pˆkni10: end for11: kˆ = arg maxk=1,...,Kuk12: return xkˆpoint z(i) was the minimizer among the S sampled functions with halluci-nated data point (xk, y(n)k ).Finally, combining the estimates above yields our approximated meta-criterionu(xk;D) ≈ 1NN∑n=1G∑i=1pˆkni log pˆkni. (5.9)We provide the pseudocode for this estimate of the meta-criterion in Algo-rithm 4 along with a corresponding visualization in Figure 5.2.5.1.2 Discretizing X : sampling posterior global minimizersDiscretizing the search space X for such entropy estimates is an establishedapproximation technique (Villemonteix et al., 2009; Hennig and Schuler,2012). Naturally, obtaining a good estimate depends on the particular dis-cretization. In the work of Hennig and Schuler (2012), the discrete pointsz(i) are called representer points and are sampled from either PI or EI, de-pending on the implementation; earlier, in the work of Villemonteix et al.(2009), the discretization was sampled from a latin hypercube (LHC) spacefilling design. Both of these works have argued the importance of both65randomization and sampling points {z(i)} close to the target distributionP?( · | D˜k). While the LHC approach provides a very simple discretization,it is independent of the current posterior, which can lead to very small ef-fective discretization, i.e. eventually most representer points will have noeffect on the estimate. On the other hand, running a slice-sampler on PI orEI is a practically challenging heuristic because these acquisition functionsquickly become highly multi-modal with shrinking support.In this work, we propose using approximate samples from the posteriordistribution P?( · | D), which we expect to be closer to the target distribu-tion than EI or PI, by virtue of the fact that the posterior does not varytoo dramatically from one observation to the next. This step correspondsto Line 1 of Algorithm 4; after this step the {z(i)} are kept fixed for theremainder of the ESP subroutine, and only resampled at the next iterationof the optimization outer-loop: Algorithm 3.We use the following two step process to sample from P?( · | D)f | D ∼ GP, (5.10)z(i) = arg minXf, ∀i = 1, . . . , G (5.11)where ties are broken arbitrarily. At first glance this would require con-structing an infinite-dimensional object representing the function f . Thiscould be done by sequentially constructing f while it is being optimized.However, evaluating such an f would ultimately have cost O(m3) where mis the number of function evaluations necessary to find each optimum z(i).Instead, following the work of Herna´ndez-Lobato et al. (2014), we will sam-ple an analytic approximation to f , which provides a closed form expressionthat can be differentiated and optimized efficiently. We defer the technicaldetails of this technique to Appendix B.Extension to Thompson sampling. Finally, this approach can also beused directly as an acquisition strategy; consider the following randomizedstrategyxt ∼ P?( · | D).66For finite X this is simply the well-established Thompson sampling algo-rithm. However, this derivation extends Thompson sampling to continuoussearch spaces. This work is the first to evaluate this acquisition functionempirically.5.1.3 Algorithm summaryIn Algorithm 3, we describe general pseudocode for performing Bayesianoptimization using a portfolio of acquisition functions. Here line 3 denotes ageneral selection criterion such as the GP-Hedge approach of (Hoffman et al.,2011) or our own ESP, summarized in Algorithm 4. The computational costof this approach will be dominated by the cubic cost of solving the linearsystem on line 6 of Algorithm 4. As a result the complexity will be on theorder O(KNSt3), however since K, N , and S are fixed a priori, this is only aconstant-factor slower than standard Bayesian optimization algorithms andcan easily be parallelized.One consideration we have not mentioned is the selection of hyperpa-rameters, which can have a considerable effect on the performance of anyBayesian optimization algorithm. For example, if we fix the length scaleparameters Λ to be too small or large, then observing a point (x, y) willaffect the posterior in a neighbourhood that is, in turn, too small or largearound x, respectively. This will adversely affect the optimization process,causing any Bayesian optimization algorithm to search too globally or toolocally—in a sense, to explore or exploit too much.Typical approaches to GP regression will often optimize the marginallikelihood or evidence as a means of setting these parameters. However, inthe Bayesian optimization setting this approach is particularly ineffectivedue to the initial paucity of data. Even worse, such optimization can alsolead to quite severe local maxima around the initial data points. In thiswork we instead give a fully Bayesian treatment of the hyperparameters.Let ψ denote a vector of hyperparameters which includes any kernel andlikelihood parameters. Let p(ψ | D) ∝ p(ψ) p(D|ψ) denote the posteriordistribution over these parameters where p(ψ) is a hyperprior and p(D | ψ)67is the GP marginal likelihood. For a fully Bayesian treatment of ψ we mustmarginalize our acquisition strategy with respect to this posterior. Thecorresponding integral has no analytic expression and must be approximatedusing Monte Carlo.Consider now drawing M samples {ψ(i)} from the posterior p(ψ|Dt).Often an acquisition strategy is specified with respect to some internal ac-quisition function which is optimized at every iteration in order to select xt.These functions can then be approximately marginalized with respect to thehyperparameter samples in order to instead optimize an integrated acquisi-tion function. This approach was taken in (Snoek et al., 2012) for the EIand PI acquisition functions. Note that the randomized Thompson samplingstrategy, introduced in Section 5.1.2, only requires a single hyperparametersample as we can see this as a joint sample from the hyperparameters andthe function minimizer.Our approach has additional complexity in that our candidate selectioncriterion depends on the hyperparameter samples as well. This occurs explic-itly in lines (4–6) of Algorithm 4 which sample from the GP posterior. Thiscan be solved simply by adding an additional loop over the hyperparametersamples. Our particular use of representer points in line 1 also depends onthese hyperparameters. We can solve this problem by equally allocating ourrepresenters between the M different hyperparameter samples.5.2 ExperimentsIn this section we evaluate the proposed method on several problems: twosynthetic test functions commonly used for benchmarking in global opti-mization, two real-world datasets from the geostatistics community, andtwo relatively high dimensional simulated control problems. We compareour method against two categories of baselines: first, Bayesian optimizationbased on single acquisition functions, and second, other portfolio methodsbased on the same ensemble of acquisition functions as ESP (but combiningthem in different ways).In the first category, we included three well-known Bayesian optimization68acquisition functions, namely the EI, PI, and Thompson methods describedearlier. For EI we used the implementation available in the spearmintpackage,4 and the latter two were implemented in the spearmint framework.All three methods were included in the portfolios. Note that we do notcompare against GPUCB (Srinivas et al., 2010) as the bounds do not applydirectly when integrating over GP hyperparameters. This would require anadditional derivation which we do not consider here.In the second category, we compare ESP against the GPHedge portfoliomethod of Hoffman et al. (2011) and an approach which selects between dif-ferent base strategies uniformly at random labelled RP (Random Portfolio).In order to marginalize over the GP hyperparameters (i.e. kernel length-scale and amplitude, and prior constant mean) we used slice sampling asimplemented in the spearmint package. By default, the spearmint softwareoutputs 10 samples at each iteration. Both improvement-based methods, EIand PI, can simply be evaluated with each sample and averaged, while inkeeping with the Thompson strategy, our Thompson implementation onlyuses a single sample, namely the last one. As reported above, our proposedESP method splits its 500 representer points equally among the hyperpa-rameter samples. In other words, for each of the 10 hyperparameters wedraw 50 points. In addition, in order to estimate u(xk;D), for each of 5simulated y(n)k and each GP hyperparameter, ESP draws 1000 samples fromP?( · | D˜(n)k ), and then computes and averages the entropy estimates.In the following experiments we evaluate the performance of each algo-rithm as a function of the number of function evaluations. When the trueoptimum is known, as in the case of the Branin and Hartmann 3 experiments,we measure performance by the absolute error. For all other minimizationexperiments, performance is given by the minimum function value obtainedup to that iteration. Each experiment was repeated 20 times with differentrandom seeds. For each method, we plot the median performance metricover these repetitions as well as shading the band between the lower andupper quartiles.4https://github.com/JasperSnoek/spearmint690 20 40 60 80 100Function evaluations10-910-810-710-610-510-410-310-210-1100Absolute errorBraninEIPIThompsonRPRP9HedgeHedge9ESPESP90 20 40 60 80 100Function evaluations10-910-810-710-610-510-410-310-210-1100Absolute errorHartmann 3EIPIThompsonRPRP9HedgeHedge9ESPESP9Figure 5.3: Absolute error of the best observation for the Branin and Hart-mann 3 synthetic functions. The 9 additional random experts in RP9,GPHedge9, and ESP9 affect the RP and GPHedge methods much moredramatically than ESP.5.2.1 Global optimization test functionsWe begin with two synthetic functions commonly used for benchmarkingglobal optimization methods: Branin and Hartmann 3 (Lizotte, 2008). Theyare two- and three-dimensional, respectively, and are both continuous, bounded,and multimodal. The experiments were run up to a final horizon of T = 100.Figure 5.3 reports the observed performance measured in absolute error ona logarithmic scale.It is interesting to note that despite the poor performance of PI in thistask, all portfolios still outperform their base strategies, with the exceptionof Thompson on Branin. As expected, ESP makes the best use of its baseexperts and outperforms all other methods in both examples.5It is interesting to note that each of the two domains have their ownchampion, Thompson being a clear winner on Branin, and EI being themore attractive option on Hartmann 3. This observation motivates the useof portfolios of acquisition functions.5Note that PI performs poorly in this multimodal example because of PI’s greedi-ness. This issue could in principle be addressed using a minimum improvement parameter(Jones, 2001; Brochu et al., 2009). However we set this parameter to its default value of 0in our experiments since our focus in this work is in selecting between acquisition functionsrather than tuning them. However, one could imagine in future work using parameterizedfamilies of strategies as a way of enlarging portfolios.70In these synthetic experiments we also demonstrate the resilience of ESPwith respect to to the inclusion of poor base strategies. We do so by adding 9random experts to each portfolio (we denote these ESP9, RP9, etc.). Theseso-called random experts select a point uniformly at random in the boundingbox X . We expect this sort of random search to be comparable to the otherbase methods in the initial stage of the optimization and eventually providetoo much exploration and not enough exploitation. Note however that afew random experts in a portfolio could actually be beneficial in providing aconstant source of purely exploratory candidates; precisely how many is aninteresting question we do not discuss in the present work. Nevertheless, forthe dimensionality and difficulty of these examples, we propose 9 additionalrandom experts as being too many and indeed we observe empirically thatthey substantially deteriorate performance for all portfolios.In particular, we see that, especially on Hartmann 3, ESP is virtually un-affected until it reaches 5 digits of accuracy. Meanwhile, the progress madeby RP is hindered by the random experts which it selects as frequently asthe legitimate acquisition functions. Significantly worse is the performanceof GPHedge which, due to the initial success of the random experts, favoursthese until the horizon is reached. Note in contrast that ESP does not rely onany expert’s past performance, which makes it robust to lucky guesses andtime-varying expert performances. GPHedge could possibly be improvedby tuning the reward function it uses internally but this is a difficult andproblem dependent task.5.2.2 Geostatistics datasetsThe next set of experiments were carried out on two datasets from thegeostatistics community, referred to here as Brenda and Agromet (Clarkand Harper, 2008).6 Since these datasets consist in finite sets of points,we transformed each of them into a function that we can query at arbitrarypoints via nearest neighbour interpolation. This produces a jagged piecewiseconstant function, which is outside the smoothness class of our surrogate6Both datasets can be found at kriging.com/datasets.710 20 40 60 80 100Function evaluations−0.25−0.20−0.15−0.10−0.05Function valueBrendaEIPIThompsonRPHedgeESP0 20 40 60 80 100Function evaluations−1100−1050−1000−950−900Function valueAgrometEIPIThompsonRPHedgeESPFigure 5.4: Best observed evaluation on mining datasets Brenda andAgromet. ESP outperforms the other portfolio methods while RP is domi-nated by all others.models and hence a relatively difficult problem.Brenda is a dataset of 1,856 observations in the depths of a copper minein British Columbia, Canada, that was closed in 1990. The search domain istherefore three-dimensional and the objective function represents the quan-tity of underground copper. Agromet is a dataset of 18,188 observationstaken in Natal Highlands, South Africa, where gold grade of the soil is theobjective. The search domain here is two-dimensional.We note that PI, which has so far been an under-achiever, is among thebetter strategies on Agromet, while EI is the last. This is further moti-vation for the use of portfolios. On both of these examples, RP performspoorly whereas GPHedge fares somewhat better. We can see that ESP per-forms particularly well on Brenda. On Agromet, ESP outperforms the otherportfolio methods and is competitive with the best acquisition function—Thompson in the initial exploration phase, followed by PI after around 60evaluations.5.2.3 Control tasksOur final experiments compare the portfolio methods on two control tasks.Walker is an eight-dimensional control problem where the inputs are fed intoa simulator which returns the walking speed of a bipedal robot (Westervelt72and Grizzle, 2007). The code for the simulator is available online. Thisexample was also used in a Bayesian optimization context by Herna´ndez-Lobato et al. (2014). Repeller is a nine-dimensional control problem consid-ered in previous work on Bayesian optimization portfolios (Hoffman et al.,2011, 2009). This problem simulates a particle in a two-dimensional worldbeing dropped from a fixed starting region while accelerated downward bygravity. The task is to direct the falling particle through circular regions oflow loss by placing 3 repellers with 3 degrees of freedom each, namely theirposition in the plane and repulsion strength. Figure 5.5 reports our resultson both control tasks.On these tasks, GPHedge performed poorly. On the other hand, RP per-forms well initially on both problems but its rate of improvement then slowsdown, particularly in Walker where it falls about 10 evaluations behind.Meanwhile, ESP is the clear top performer among the portfolio methods onWalker and catches up to RP on Repeller. We have also added the RP9,GPHedge9, and ESP9 methods in this experiment to study their sensitivityto random guesses. Indeed, both RP9 and GPHedge9 exhibit noticeablypoorer performance than RP and GPHedge, while in contrast ESP9 is notsignificantly affected.The good initial performance of the random portfolios and GPHedge9could possibly be due to the high dimensionality of the search space wheremodel-based approaches can suffer from cold start. In fact, a widely usedtechnique suggested by Jones (2001) involves using a certain number ofinitial samples selected from a Latin hypercube as an initialization in orderto alleviate this problem.5.3 ConclusionIn this work, we revisited the use of portfolios for Bayesian optimization. Weintroduced a novel information-theoretic meta-criterion we call ESP, whichcan achieve performance matching or exceeding that of its best componentexpert. This is particularly important since we show in our experiments thatthe post-hoc best acquisition function varies between problem instances. We730 5 10 15 20 25 30Function evaluations−1.30−1.25−1.20−1.15−1.10Function valueWalkerEIPIThompsonRPHedgeESP0 20 40 60 80 100Function evaluations−14−12−10−8−6−4−2Function valueRepellerEIPIThompsonRPRP9HedgeHedge9ESPESP9Figure 5.5: Best observed evaluation on the Walker and Repeller controltasks. Here we see ESP outperforming the other methods on Walker. OnRepeller, ESP is once again competitive with the best methods and exhibitsmuch more robustness to poor experts as evidenced by ESP9 following theperformance of ESP so closely.have also found that ESP ranks consistently high across functions of differentdimensionality even though the members of its portfolio do not exhibit thisbehaviour. Further experiments also demonstrated that ESP is less sensitiveto poorly performing experts than the other portfolio mechanisms.We also showed that, surprisingly, a random portfolio can exhibit rea-sonably good performance, albeit not as good as ESP. Such a strategy isalso more susceptible to the inclusion of poorly performing experts thanESP, however not as susceptible as GPHedge due to the latter algorithm’sreliance on past performance as a predictor of future performance. On theother hand, ESP simply approaches vanilla entropy search as random ex-perts are added to the portfolio.Finally, we propose a mechanism for sampling representer points whichis more principled and reliable than previous approaches, which relied onslice sampling from surrogate measures. We additionally showed that thissampling mechanism can itself be used as an effective acquisition strategy,thereby extending the popular Thompson sampling approach to continuousspaces with kernel methods.74Chapter 6Unconstrained Bayesianoptimization viaregularizationThe established Bayesian optimization practice requires a user-defined bound-ing box which is assumed to contain the optimizer. However, when little isknown about the probed objective function, it can be difficult to prescribesuch bounds. In this work we modify the standard Bayesian optimizationframework in a principled way to allow automatic, data-driven resizing ofthe search space.This work is born out of the observation that the only component of stan-dard Bayesian optimization that requires a bounding box on the search spaceis the maximization of the acquisition function. Therefore we propose a wayto regularize the acquisition function to be a priori small in regions far to theobserved data. In a first step, we achieve this by prescribing a non-constantminimum improvement function in expected improvement (EI; Jones et al.,1998) or probability of improvement (PI; Kushner, 1964). In a second step,we provide an equivalent view that involves a structured prior mean on theGaussian Process. With this view, we are able to experiment with meth-ods that are not improvement based, such as entropy search (Villemonteixet al., 2009; Hennig and Schuler, 2012; Herna´ndez-Lobato et al., 2014) and75Hinge quadraticf(x)µn (x)ξ(x)Quadraticx¯−R x¯+REIEI-Hx¯−12w x¯+12wEI-Q(a) Minimum improvement viewHinge quadratic Quadraticx¯−R x¯+REI-Hx¯−12w x¯+12wEI-Q(b) Prior mean viewFigure 6.1: Visualization of the two alternate views of regularization inBayesian optimization. The objective function and posterior surrogate meanare represented as black dashed and solid lines, respectively, with greyshaded areas indicating ±2σn. Integrating the surrogate model above thetarget (orange shaded area) results in the regularized EI acquisition function(magenta and cyan). Using a non-stationary target with a constant priormean (left) or a fixed target with a non-stationary prior mean (right) leadto indistinguishable acquisition functions, which decay at infinity.Thompson sampling (Herna´ndez-Lobato et al., 2014).In this section, we introduce two methods that will result in robustnessto the choice of initial bounding box. The first is our proposed approach,which can be interpreted as a regularization of current methods, while thesecond is a simple heuristic we call volume doubling.6.1 Regularizing improvement policiesWe motivate the regularized approach by considering improvement policies,in particular, EI. However, in the next section we show that this proposedapproach can be extended more generally to all GP-derived acquisition func-tions, and in fact it is not difficult to apply the idea to other surrogatemodels, which we leave for future work.Improvement policies are a popular class of acquisition functions that76rely on the improvement functionI(x) = (f(x)− τ)I[f(x) ≥ τ ] (6.1)where τ is some target value to improve upon. Expected improvement thencompute E[I(x)] under the posterior GP.When the optimal objective value f? is known and we set τ = f?, thesealgorithms are referred to as goal seeking (Jones, 2001). When the optimumis not known, it is common to use a proxy for f?, such as the value of thebest observation so far, y+; or in the noisy setting, one can use either themaximum of the posterior mean or the value of the mean prediction µn atthe incumbent x+ ∈ arg maxx∈x1:n µn(x).In some cases, the above choice of target τ = y+ can lead to a lackof exploration, therefore it is common to choose a minimum improvementparameter ξ > 0 such that τ = (1 + ξ)y+ (for convenience here we assumey+ is positive and in practice one can subtract the overall mean to make itso). Intuitively, the parameter ξ allows us to require a minimum fractionalimprovement over the current best observation y+. Previously, the param-eter ξ had always been chosen to be constant,1 if not zero. In this workwe propose to use a function ξ : Rd 7→ R+ which maps points in the spaceto a value of fractional minimum improvement. Following the same intu-ition, the function ξ lets us require larger improvements from points that arefarther and hence acts as a regularizer that penalizes distant points. Theimprovement function hence becomes:I(x) = (f(x)− τ(x))I[f(x) ≥ τ(x)], (6.2)where the target is now a function of x:τ(x) = (1 + ξ(x))y+; (6.3)the choice of ξ(x) is discussed in the Section 6.3.1 Here we mean constant with respect to x, there has been previous work on adaptivelyscheduling this parameters.776.2 Extension to general policiesIn the formulation of the previous section, our method seems restricted toimprovement policies. However, many recent acquisition functions of in-terest are not improvement-based, such as GP-UCB, entropy search, andThompson sampling. In this section, we describe a closely related formula-tion that generalizes to all acquisition functions that are derived from a GPsurrogate model.Consider expanding our choice of non-stationary target τ in Equation (6.3)I(x) = (f(x)− y+(1 + ξ(x)))I[f(x) ≥ y+(1 + ξ(x))]= (f(x)− y+ξ(x)− y+)I[f(x)− y+ξ(x) ≥ y+]= (f˜(x)− y+)I[f˜(x) ≥ y+] (6.4)where f˜ is the posterior mean of a GP from (2.27) with prior mean µ˜0(x) =µ0(x) − y+ξ(x). Notice the similarity between (6.1) and (6.4). Indeed, inits current form we see that the regularization can be achieved simply byusing a different prior mean µ˜0 and a constant target y+. This duality canbe visualized when comparing the left and right panel of Figure 6.1.Strictly speaking, Equations (6.1) and (6.4) are not exactly equivalent.Indeed, using a surrogate GP with prior mean µ˜0, the posterior mean yieldsan additional term− k(x)T(K + σ2I)−1ξ(X), (6.5)where [ξ(X)]i = ξ(xi). This negative term will only accentuate the vanishingof expected improvement for test points x that are far from the regularizercentre. Indeed, in Figure 6.1 the two views produce indistinguishable regu-larized acquisition functions (in this case EI). However, we favour this newformulation because we can apply the same regularization to any policywhich uses a Gaussian process, namely Thompson sampling and entropysearch.78f(x)n 3d3d<n 15d15d<n 30d+2+3+1Figure 6.2: Visualization of selections xn made by the EI-H algorithm ontwo toy problems: three Gaussian modes in one (left) and two (right) di-mensions. Grey lines delimit the initial bounding box; grey square markersindicate the initial Latin hypercube points, while the orange and blue pointsdistinguish between the first and second half of the evaluation budget of 30d,respectively. In the two-dimensional example, the height of the Gaussiansare indicated by +1, +2, and +3.6.3 Choice of regularizationBy inspecting Equation (3.21), we see that any coercive2 prior mean func-tion would lead to an asymptotically vanishing EI acquisition function as||x|| → ∞. More precisely, this is due to both Φ and φ vanishing as theirarguments approach −∞. In this work, we consider two coercive regular-izing prior mean functions, namely a quadratic (Q) and an isotropic hinge-quadratic (H), defined as follows (excluding the constant bias b)ξQ(x) = (x− x¯)Tdiag(w2)−1(x− x¯), (6.6)ξH(x) = I[‖x− x¯‖2 > R](‖x− x¯‖2 −RβR)2. (6.7)Both of these regularizers are parameterized by d location parameters x¯, andwhile ξQ has an additional d width parameters w, the isotropic regularizer ξHhas a single radius R and a single β parameter, which controls the curvature2 By a coercive regularizer ξ we mean one that satisfies: lim||x||→∞ ξ(x)→ −∞.79of ξH outside the ball of radius R; in what follows we fix β = 1.Remark 4 (On fixing the prior mean hyperparameters)We are left with the choice of centre x¯ and radius parameters R (or widthsw). Unlike the bias term b, these parameters of the regularizer are notintended to allow the surrogate to better fit the observations Dn. In fact,using the marginal likelihood to estimate or marginalize ψ = {x¯, R,w},could lead to fitting the regularizer to a local mode which could trap thealgorithm in a suboptimal well. For this reason, we use an initial, temporaryuser-defined bounding box to set ψ at the beginning of the run; the value ofψ remains fixed in all subsequent iterations.Note that, while the bounding box in current Bayesian optimizationpractice is a hard constraint on the domain, our approach uses it simply togenerate a regularizer which will focus the sequential search without con-straining it. One clear advantage of our algorithm is that it adheres to thecurrent user interface, allowing practitioners to continue simply specifying areasonable range for each search dimension. This is arguably a much morenatural requirement than a multidimensional centre and width parameters.Finally, note that when users specify an arbitrary bounding box, theyare in effect fixing 2d parameters. In that respect, our algorithm requiresno more parameters than current practice, yet allows the search to progressoutside of this arbitrary box, given a large enough budget.6.4 Volume doublingFinally, let us define the volume doubling heuristic. It consists of expandingthe search space regularly as the optimization progresses, starting with aninitial user-defined bounding box. This method otherwise follows the stan-dard Bayesian optimization procedure and optimizes within the boundingbox that is available at the given time step n. This approach requires twoparameters: the number of iterations between expansions and the growthfactor γ. Naturally, to avoid growing the feasible space X by a factor thatis exponential in d, the growth factor applies to the volume of X . Finally,800.0 0.2 0.4 0.6 0.80.00.20.40.60.81.012 120507090MLP MNIST0.0 0.2 0.4 0.6 0.80.00.20.40.60.81.012 120507090CNN MNIST0.0 0.2 0.4 0.6 0.80.00.20.40.60.81.06 602.51.50.5Branin0.0 0.2 0.4 0.6 0.80.00.20.40.60.81.09 90303540MLP CIFAREI-VEI-QEI-HEIRSFigure 6.3: Best observations as optimization progresses. In the Hartmannexperiments, the known optimum is represented as a horizontal dotted line.Plotting mean and standard error over 40 (Hartmann), 25 (MLP), and 20(CNN) repetitions.the expansion is isotropic about the centre of the domain. In this work, wedouble (γ = 2) the volume every 3d evaluations (only after an initial Latinhypercube sampling of 3d points).6.5 VisualizationBefore describing our experimental results, Figure 6.2 provides a visualiza-tion of EI with the hinge-quadratic prior mean, optimizing two toy problemsin one and two dimensions. The objective functions simply consist of threedistant Gaussian modes of varying heights and the initial bounding box isset such that it does not include the optimum. We draw attention to theway the space is gradually explored outward from the initial bounding box.816.6 ExperimentsIn this section, we evaluate our proposed methods and show that theyachieve the desirable behaviour on two synthetic benchmarking functions,and a simple task of tuning the stochastic gradient descent and regulariza-tion parameters used in training a multi-layered perceptron (MLP) and aconvolutional neural network (CNN) on the MNIST dataset.Experimental protocol. For every test problem of dimension d and ev-ery algorithm, the optimization was run with an overall evaluation budgetof 30d including an initial 3d points sampled according to a Latin hypercubesampling scheme (as suggested in (Jones, 2001)). Throughout each particu-lar run, at every iteration n we record the value of the best observation upto n and report these in Figure 6.3. Experiments were repeated to reportand compare the mean and standard error of the algorithms: the syntheticexperiments were repeated 40 times, while the MNIST experiments wererepeated 25 and 20 times for the MLP and the CNN, respectively.Algorithms. We compared the two different methods above—volume dou-bling and regularization—to the standard EI with a fixed bounding box.Common random seeds were used for all methods in order to reduce con-founding noise. All algorithms were implemented in the pybo frameworkavailable on github,3 and are labelled in the following figures as follows:EI: Vanilla expected improvement with hyperparameter marginalization viaMCMC.EI-V: Expected improvement with the search volume doubled every 3diterations.EI-H: Regularized EI with a hinge-quadratic prior mean with β = 1 andR fixed by the circumradius of the initial bounding box.3 https://github.com/mwhoffman/pybo82EI-Q: Regularized EI with a quadratic prior mean where the widths w arefixed to those of the initial bounding box.RS: As an additional benchmark, on the neural network tuning tasks,we considered a random selection strategy, which uniformly sampledwithin the user-defined bounding box.Note that for the regularized methods EI-H/Q, the initial bounding boxis only used to fix the location and scale of the regularizers, and to sampleinitial query points. In particular, both regularizers are centred aroundthe box centre. For the quadratic regularizer the width of the box in eachdirection is used to fix w, whereas for the hinge-quadratic R is set to thebox circumradius. Once these parameters are fixed, the bounding box is nolonger relevant, and more importantly, the algorithm is free to select pointsoutside of it, see Figure 6.2 and 6.4 for example.6.6.1 Synthetic benchmarks: HartmannThe Hartmann 3 and 6 functions (numbers refer to their dimensionality) arestandard, synthetic global optimization benchmarking test functions withknown global optima. These are typically optimized in the unit hypercube[0, 1]d, as we do in our Hartmann3 and 6 experiments.In a separate experiment, indicated by an appended asterisk (e.g. Hartmann3∗),we consider an initial bounding box of side length 0.2 centred uniformly atrandom within the unit hypercube. Each of the 40 repetitions of this exper-iment fixed a different such domain for all algorithms. The smaller domainhas a 0.2d probability of including the global optimum, especially unlikelyin the six-dimensional problem. This experiment is meant to test whetherour proposed methods are capable of useful exploration outside the initialbounding box and further compare them in such a situation.6.6.2 MLP and CNN on MNISTThe MNIST hand-written digit recognition dataset is a very common taskfor testing neural network methods and architectures. Neural networks are83LearningrateMomentumL1 reg.Figure 6.4: Pairwise scatterplot of selections (upper triangle) and recom-mendations (lower triangle) for the MLP-MNIST experiment. For example,the second plot of the first row corresponds to a scatter plot of the selectedlearning rates vs. momentum parameters for a single seed. In contrast, thefirst plot of the second row corresponds to a scatter plot of the recommendedlearning rates and momentum parameters over all runs. The initial bound-ing box and sample points (for this particular run) are shown as a blackrectangle and black square dots, respectively. All other points respect thecolour scheme of Figure 6.3. (the `2 regularization parameters were croppedout for space considerations.)usually trained using some variant of stochastic gradient descent (SGD).The hyperparameters can impact both the speed of convergence and thequality of the trained network. We consider an MLP with 2048 hidden unitswith tanh non-linearities, and a CNN with two convolutional layers. Theseexamples were taken from the official GitHub repository of torch demos.4The code written for this work can be readily extended to any other demoin the repository or in fact any script that can be run from the shell.In this experiment, we optimize four parameters of the SGD optimizer,4https://github.com/torch/demos84namely the learning rate and momentum, and the `1 and `2 regularizationcoefficients. The parameters were optimized in log space (base e) with aninitial bounding box of [−3,−1] × [−3,−1] × [−3, 1] × [−3, 1], respectively.For each parameter setting, a black-box function evaluation corresponds toeither training the MLP for 5 epochs or the CNN for 3, and returning thetest set accuracy. To be clear, the goal of this experiment is not to achievestate-of-the-art for this classification task but instead to demonstrate thatour proposed algorithms can find optima well outside their initial boundingboxes.6.6.3 ResultsFigure 6.3 shows that for the Hartmann tests, the proposed Bayesian opti-mization approaches work well in practice. The results confirm our hypoth-esis that the proposed methods are capable of useful exploration outside theinitial bounding box. We note that when using the entire unit hypercube asthe initial box, all the Bayesian optimization techniques exhibit similar per-formance as in this case the optimum is within the box. The Hartmann testsalso show that the volume doubling heuristic is a good baseline method; andthe plateaus suggest that this method warrants further study in, perhapsadaptive, scheduling strategies. Although it is less effective than EI-H as thedimensionality increases, it is nonetheless an improvement over standard EIin all cases.The MNIST experiment shows good performance from all three methodsEI-{V,H,Q}, particularly from the hinge-quadratic regularized algorithm.Indeed, when compared to the standard EI, EI-H boasts over 20% improve-ment in accuracy on the MLP and almost 10% on the CNN.We believe that EI-H performs particularly well in settings where a smallinitial bounding box is prescribed because the hinge-quadratic regularizerallows the algorithm to explore outward more quickly. In contrast, EI-Qperforms better when the optimum is included in the initial box; we suspectthat this is due to the fact that the regularizer avoids selecting boundary andcorner points, which EI and EI-V tend to do, as can be seen in Figure 6.4.85Figure 6.4 demonstrates that the algorithm in fact explores points out-side the initially suggested bounded domain (drawn as a black rectangle).Indeed, while the green dots (EI-V) follow the corners of the growing bound-ing box, the magenta and cyan dots of EI-H/Q, respectively, do not exhibitthis artifact.6.7 ConclusionIn this work, we propose a new versatile approach to Bayesian optimizationwhich is less sensitive to the initial choice of bounds because the algorithm iscapable of growing its search space adaptively to an arbitrary size. Indeed,given a small initial bounding box that does not include the optimum, wehave demonstrated that our approach can expand its region of interest andreach the optimum. This method has a proximal flavour; to select the nextpoint, we seek a maximizer of the acquisition function that is not too farfrom the current neighbourhood.Our method fits seamlessly within the current Bayesian optimizationframework, and can be readily used with any acquisition function which isinduced by a GP. Furthermore, all the extra hyperparameters but one (β)can readily be learned using existing hyperparameter sampling mechanisms.86Chapter 7DiscussionThis thesis contributed to both major components of the Bayesian optimiza-tion framework: modelling and decision-making. We validated our methodsby extensive experimenting on ubiquitous synthetic test functions, real-worldbenchmarking tasks, as well as some new tasks we introduced. Moreover,throughout the thesis work, increasing effort was put on making our ex-periments reproducible and our software available online as free and opensource.The exposition of the background material, most of which can be foundin our recent review paper (Shahriari et al., 2016b), often focused on hy-perparameter tuning, which is of particular interest to the machine learningcommunity. Nevertheless, the applicability of the method is much broaderand we anticipate the model-based sequential optimization techniques re-viewed in this thesis to find uses in many scientific and engineering fields.7.1 Decision-making: unifying BayesGap and ESPSince our work on ESP, the approach of Herna´ndez-Lobato et al. (2014)known as PES has proven to be an efficient alternative to the Monte Carloapproximations carried out in Chapter 5. Our results would likely exhibitmuch less variance and be much more efficient using PES. One may askwhether it is worth maintaining a portfolio of candidates {xk} when it is87possible to efficiently optimize the entropy directly over the entire inputspace X . However, as we have seen in Chapter 4 as well as in the recent workof Russo and Van Roy (2014b) on information-directed sampling (IDS), com-bining regret-minimizing policies with more exploratory information-greedyapproaches can be very effective. Indeed, similar to BayesGap, ESP sequen-tially reduces the entire search space X to a set of candidates, reminiscent ofJ and j in the BayesGap algorithm. The next query point is then selectedamong them using a measure of uncertainty, entropy in the case of ESP andposterior variance in the case of BayesGap.Applied to parallel queries. The problem of optimization given a par-allel query budget is often a much more realistic setting than the strictlysequential one. In many cases, the objective function can be evaluatedmultiple times in parallel, for instance, one often has access to a clusterof available compute nodes in order to parallelize hyperparameter tuning.There has been much recent work in batch Bayesian optimization where ateach iteration the agent chooses a set of inputs to be queried next. Becausethis auxiliary task is a combinatorial optimization problem, most presentsolutions simulate the sequential process either by hallucinating pendingobservations (Azimi et al., 2010; Snoek et al., 2012), approximating the in-tractable maximum of a sequence of acquisition functions (Janusevskis et al.,2012), or conditioning on the predictive mean at pending query points (De-sautels et al., 2014). This practice artificially reduces the variance at pendinglocations, which in turn causes these algorithms to output a diverse set ofpoints. Moreover, for randomized acquisition functions such as Thompsonsampling and its approximate extension to continuous spaces discussed inAppendix B, the algorithm can simply be run multiple times with differentpseudo-random generator seeds to simply produce a batch Thompson sam-pling strategy. This last approach is an interesting one which, to the bestof our knowledge, has yet to be studied in the literature.For the approaches that simulate the sequential search, the acquisitionfunctions that are used are designed to select the next query point so as totrade off exploration and exploitation for the next observation. Whereas this88makes sense when the optimizer can only run a single experiment, there maybe a more optimal search strategy when multiple experiments can be run inparallel. In fact the burden of trading off exploration and exploitation is nolonger on each individual experiment but on the collective, which allows theindividual experiments to be more aggressively exploratory or exploitative.Our work on BayesGap and portfolio methods, provide a good way to obtaina diverse set or points, due to the different behaviours of different acquisitionfunctions.In practice, the recommendation strategy is to select the incumbent,which is the best observed query point (after smoothing). Such a recom-mendation strategy would benefit from dedicating a portion of the querybatch to pure exploitation, which will result in refining the surrogate modelaround the current incumbent.7.2 Contributions to the practiceUntil very recently, the established Bayesian optimization practice requireda user-defined bounded domain, which is assumed to contain the global op-timizer. However, raised as an important remaining challenge at the 2014NIPS workshop on Bayesian optimization, practitioners are often hesitantto prescribe a bounding box for the optimization domain because they areunwilling to completely discard large areas of the search domain. This is anunnecessary barrier to adoption that our work on unconstrained Bayesianoptimization (Shahriari et al., 2016a) addressed via an intuitive regulariza-tion technique. While this shifts the difficulty of selecting a bounding boxto that of choosing a regularizer, we proposed a simple and effective way toderive a regularizer from user-defined search ranges, which are often moreintuitively set. Our experiments show that this approach is much more for-giving to a poor choices of search ranges as it can exceed them if necessary.Meanwhile setting hard constraints is still just as easy. Furthermore, thoughwe experimented with expected improvement and Gaussian process models,the simplicity of the approach means that it can be extended to many moremodels and acquisition functions.89Indeed, that work featured an important experimental component, andthe regularized methods proved effective on the task of tuning the stochas-tic gradient descent optimizer of two neural network architectures on twofamous object recognition datasets: MNIST and CIFAR.7.3 Future work: first order Bayesian optimiza-tionRecently, we have also been exploring incorporating first order information,such as gradients and directional derivatives in the Bayesian optimizationsurrogate model. Although the use of Gaussian processes with derivativeinformation is not new (see for example Rasmussen and Williams, 2006;Lizotte, 2008; Osborne, 2010) recent work by Maclaurin et al. (2015) hasprovided renewed motivation for this line of research. Indeed, in the contextof tuning the hyperparameters of a neural network, they are capable of back-propagating through gradient descent iterations in order to obtain gradientsof the empirical loss with respect to said hyperparameters (Maclaurin et al.,2015).Whereas they use this so-called “hypergradient” to tune the hyperpa-rameters via gradient descent, we are exploring how much the gradientscan accelerate the convergence of Bayesian optimization. In addition, sinceincluding gradient information increases the computational complexity ofexact GP inference by a factor of d3 where d is the number of hyperparam-eters, we are also exploring including directional derivatives, which wouldincrease complexity by a fixed factor independent of d. Eventually, we wishto prove convergence rates for Bayesian optimization that are better thanthe existing results (Srinivas et al., 2010; Bull, 2011) thanks to derivativeinformation, which will hopefully resemble those of gradient descent.As a practical application, consider training a recurrent neural networks(RNN). Indeed, training techniques such as backpropagation through time(BPTT) are known to produce noisy gradients and lead to error magnifica-tion due to the recurrent nature of the model. As mentioned above, Bayesian90optimization can elegantly incorporate such noisy gradients and derivatives,as well as all the observations of the empirical loss, hopefully leading tofaster convergence to better (global) optima. However, the challenge forBayesian optimization is to scale to the high-dimensional search space andthe many observations such optimization will certainly require.Fortunately, there has been much interest in compressing neural net-works using Fourier or cosine transforms, e.g., work by Koutnik et al. (2010)and more recently the ACDC work of Moczulski et al. (2015). Such tech-niques could potentially reduce the dimensionality of a neural network of asignificant and useful size down to a manageable number in the hundredsand hopefully dozens. Meanwhile, in the Bayesian optimization community,Kandasamy et al. (2015) have used additive kernels successfully to optimizefunctions of dozens of variables; and Kandasamy’s latest unpublished model,SALSA (Shrunk Additive Least Squares Approximation), seems to improveon this even further. These two approaches could be combined to train re-current neural networks using Bayesian optimization, in other words, learna good set of weights for the neural network. Note that this is in contrast toexisting Bayesian optimization work that aims to tune the hyperparametersrather than the weights.7.4 Final word: from academia to societal impactMany real world implementations could—and, in our opinion, should—incorporate the Bayesian optimization framework of (i) distilling observeddata into interpretable probabilistic surrogate models to facilitate (ii) prin-cipled, (near-)optimal, justifiable, and (semi-)automatic decision-making.This is often a challenging automation task because it involves interven-tion, and people are understandably reluctant to surrender such agency toan automatic mechanism. Hence our emphasis on interpretable models andjustifiable decision-making, that will likely, at least initially, be involve someform of human oversight: driver-less cars are a timely example. Large scaleprojects such as kernel learning (Wilson, 2014) and the automatic statisti-cian Duvenaud et al. (2013) are important steps towards statistical inter-91pretability. Combined with Bayesian optimization, these solutions have thepotential to have a tremendous positive impact on accurate, efficient, anddata-driven medical diagnostic tools to aid medical staff, and similarly foreducation and environmental management initiatives.92BibliographyY. Abbasi-Yadkori, D. Pa´l, and C. Szepesva´ri. Improved algorithms forlinear stochastic bandits. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett,F. Pereira, and K. Q. Weinberger, editors, Advances in Neural InformationProcessing Systems 24, pages 2312–2320. Curran Associates, Inc., 2011.18, 40S. Agrawal and N. Goyal. Thompson sampling for contextual bandits withlinear payoffs. In International Conference on Machine Learning, 2013.41C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introductionto MCMC for machine learning. Machine Learning, 50(1-2):5–43, 2003.ISSN 0885-6125. 23S. Arlot, A. Celisse, et al. A survey of cross-validation procedures for modelselection. Statistics surveys, 4:40–79, 2010. 56J. Audibert, S. Bubeck, and R. Munos. Best arm identification in multi-armed bandits. In Conference on Learning Theory, 2010. 54P. Auer. Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3:397–422, 2003. 18, 40, 54P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in arigged casino: The adversarial multi-armed bandit problem. In Sympo-sium on Foundations of Computer Science, pages 322–331, 1995. 1493J. Azimi, A. Fern, and X. Fern. Batch bayesian optimization via simulationmatching. NIPS, 2010. 88J. Azimi, A. Jalali, and X. Fern. Hybrid batch bayesian optimization. InICML, 2012. 2J. Bergstra, R. Bardenet, Y. Bengio, and B. Ke´gl. Algorithms for hyper-parameter optimization. In Advances in Neural Information ProcessingSystems, pages 2546–2554, 2011. 2, 55S. Bochner. Lectures on Fourier integrals. Princeton University Press, 1959.42, 112L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001. 121E. Brochu, N. de Freitas, and A. Ghosh. Active preference learning with dis-crete choice data. In Advances in Neural Information Processing Systems,pages 409–416, 2007. 2, 55E. Brochu, V. M. Cora, and N. de Freitas. A tutorial on Bayesian op-timization of expensive cost functions, with application to active usermodeling and hierarchical reinforcement learning. Technical Report UBCTR-2009-23 and arXiv:1012.2599v1, Dept. of Computer Science, Univer-sity of British Columbia, 2009. 2, 70E. Brochu, T. Brochu, and N. de Freitas. A Bayesian interactive opti-mization approach to procedural animation design. In Proceedings of the2010 ACM SIGGRAPH/Eurographics Symposium on Computer Anima-tion, pages 103–112, 2010. 2, 55S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed ban-dits problems. In Algorithmic Learning Theory, pages 23–37. Springer,2009. 33, 44S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvari. X-armed bandits. Jour-nal of Machine Learning Research, 12:1655–1695, 2011. 13, 1894A. D. Bull. Convergence rates of efficient global optimization algorithms.Journal of Machine Learning Research, 12:2879–2904, 2011. 38, 90P. Burman. A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika, 76(3):503–514, 1989. 8, 56R. Calandra, N. Gopalan, A. Seyfarth, J. Peters, and M. Deisenroth.Bayesian gait optimization for bipedal locomotion. In Learning and In-telligent OptimizatioN (LION8), pages 274–290, 2014. 2N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cam-bridge University Press, New York, 2006. 8K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. J.of Stat. Science, 10:273–304, 1995. iiO. Chapelle and L. Li. An empirical evaluation of Thompson sampling.In Advances in Neural Information Processing Systems, pages 2249–2257,2011. 2, 23, 41H. Chen, J. L. Loeppky, J. Sacks, W. J. Welch, et al. Analysis methodsfor computer experiments: How to assess and what counts? StatisticalScience, 31(1):40–60, 2016. 27I. Clark and W. V. Harper. Practical Geostatistics 2000: Case Studies.Ecosse North America, 2008. 71A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A uni-fied framework for classification, regression, density estimation, manifoldlearning and semi-supervised learning. Foundations and Trends in Com-puter Graphics and Vision, 7(2–3):81–227, 2011. 121V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimizationunder bandit feedback. In Conference on Learning Theory, pages 355–366, 2008. 18, 4095N. de Freitas, A. Smola, and M. Zoghi. Exponential regret bounds forGaussian process bandits with deterministic observations. In InternationalConference on Machine Learning, 2012. 40P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers.Journal of the Royal Statistical Society: Series B (Statistical Methodol-ogy), 68(3):411–436, 2006. 128T. Desautels, A. Krause, and J. Burdick. Parallelizing exploration-exploitation tradeoffs with Gaussian process bandit optimization. Journalof Machine Learning Research, 2014. 40, 88S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid MonteCarlo. Physics letters B, 195(2):216–222, 1987. 125D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani.Structure discovery in nonparametric regression through compositionalkernel search. In Proceedings of the 30th International Conference onMachine Learning, June 2013. 91P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradient policy forcorrelated normal beliefs. INFORMS journal on Computing, 21(4):599–613, 2009. 36V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck. Multi-banditbest arm identification. In Advances in Neural Information ProcessingSystems 24: 25th Annual Conference on Neural Information Process-ing Systems 2011. Proceedings of a meeting held 12-14 December 2011,Granada, Spain., pages 2222–2230, 2011. 45V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best arm identification: Aunified approach to fixed budget and fixed confidence. In Advances inNeural Information Processing Systems, pages 3212–3220, 2012. ii, 8, 45,49, 54, 105R. Garnett, M. A. Osborne, and S. J. Roberts. Bayesian optimization for96sensor set selection. In ACM/IEEE International Conference on Infor-mation Processing in Sensor Networks, pages 209–219. ACM, 2010. 2J. C. Gittins. Bandit processes and dynamic allocation indices. Journal ofthe Royal Statistical Society. Series B (Methodological), pages 148–177,1979. ii, 7, 35J. C. Gittins, K. Glazebrook, and R. Weber. Multi-armed bandit allocationindices. Wiley, 2011. 35R. B. Gramacy and N. G. Polson. Particle learning of Gaussian processmodels for sequential design and optimization. Journal of Computationaland Graphical Statistics, 20(1):102–118, 2011. 127, 129F. Hamze, Z. Wang, and N. de Freitas. Self-avoiding random dynamics oninteger complex systems. ACM Transactions on Modeling and ComputerSimulation (TOMACS), 23(1):9, 2013. 55P. Hennig and C. J. Schuler. Entropy search for information-efficient globaloptimization. The Journal of Machine Learning Research, pages 1809–1837, 2012. 9, 42, 43, 60, 65, 75J. M. Herna´ndez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictiveentropy search for efficient global optimization of black-box functions. InAdvances in Neural Information Processing Systems. 2014. 42, 43, 60, 61,66, 73, 75, 76, 87M. W. Hoffman, H. Kueck, N. de Freitas, and A. Doucet. New inferencestrategies for solving Markov decision processes using reversible jumpMCMC. In Uncertainty in Artificial Intelligence, pages 223–231, 2009.73M. W. Hoffman, E. Brochu, and N. de Freitas. Portfolio allocation forBayesian optimization. In Uncertainty in Artificial Intelligence, pages327–336, 2011. ii, 9, 59, 67, 69, 7397M. W. Hoffman, B. Shahriari, and N. de Freitas. On correlation and budgetconstraints in model-based bandit optimization with application to auto-matic machine learning. In AI and Statistics, pages 365–374, 2014. iii, 2,8F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based op-timization for general algorithm configuration. In LION, pages 507–523,2011. 2, 121F. Hutter, H. H. Hoos, and K. Leyton-Brown. Parallel algorithm configura-tion. In LION, pages 55–70, 2012. 55J. Janusevskis, R. Le Riche, D. Ginsbourger, and R. Girdziusas. Learn-ing and Intelligent Optimization: 6th International Conference, LION6, Paris, France, January 16-20, 2012, Revised Selected Papers, chapterExpected Improvements for the Asynchronous Parallel Global Optimiza-tion of Expensive Functions: Potentials and Challenges, pages 413–418.Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-34413-8. doi: 10.1007/978-3-642-34413-8 37. URL http://dx.doi.org/10.1007/978-3-642-34413-8_37. 88D. R. Jones. A taxonomy of global optimization methods based on responsesurfaces. J. of Global Optimization, 21(4):345–383, 2001. 7, 38, 39, 70,73, 77, 82D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization ofexpensive black-box functions. J. of Global optimization, 13(4):455–492,1998. 2, 7, 38, 75K. Kandasamy, J. Schneider, and B. Poczos. High dimensional bayesianoptimisation and bandits via additive models. In International Conferenceon Machine Learning, 2015. 91E. Kaufmann, O. Cappe´, and A. Garivier. On bayesian upper confidencebounds for bandit problems. In International Conference on ArtificialIntelligence and Statistics, pages 592–600, 2012a. 40, 51, 5498E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptot-ically optimal finite-time analysis. In Algorithmic Learning Theory, vol-ume 7568 of Lecture Notes in Computer Science, pages 199–213. SpringerBerlin Heidelberg, 2012b. 41J. Koutnik, F. Gomez, and J. Schmidhuber. Evolving neural networks incompressed weight space. In Proceedings of the 12th Annual Conferenceon Genetic and Evolutionary Computation, GECCO ’10, pages 619–626,2010. 91D. G. Krige. A statistical approach to some basic mine valuation problemson the witwatersrand. Journal of Chemical, Metallurgical, and MiningSociety of South Africa, 52(6), 1951. ii, 7H. J. Kushner. A new method of locating the maximum point of an arbitrarymultipeak curve in the presence of noise. Journal of Fluids Engineering,86(1):97–106, 1964. 7, 38, 75T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules.Advances in applied mathematics, 6(1):4–22, 1985. 7, 33, 39M. La´zaro-Gredilla, J. Quin˜onero-Candela, C. E. Rasmussen, and A. R.Figueiras-Vidal. Sparse spectrum Gaussian process regression. The Jour-nal of Machine Learning Research, 11:1865–1881, 2010. 42, 119, 120D. V. Lindley. On a measure of the information provided by an experiment.The Annals of Mathematical Statistics, pages 986–1005, 1956. ii, 37J. S. Liu and R. Chen. Sequential monte carlo methods for dynamic systems.Journal of the American statistical association, 93(443):1032–1044, 1998.127D. Lizotte. Practical Bayesian Optimization. PhD thesis, University ofAlberta, Canada, 2008. 55, 70, 90D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Automatic gaitoptimization with Gaussian process regression. In Proc. of IJCAI, pages944–949, 2007. 299D. Maclaurin, D. Duvenaud, and R. P. Adams. Gradient-based hyperpa-rameter optimization through reversible learning. In Proceedings of the32nd International Conference on Machine Learning, July 2015. 90N. Mahendran, Z. Wang, F. Hamze, and N. de Freitas. Adaptive MCMCwith Bayesian optimization. Journal of Machine Learning Research -Proceedings Track, 22:751–760, 2012. 2, 55R. Marchant and F. Ramos. Bayesian optimisation for intelligent environ-mental monitoring. In NIPS workshop on Bayesian Optimization andDecision Making, 2012. 2O. Maron and A. W. Moore. Hoeffding races: Accelerating model selectionsearch for classification and function approximation. Robotics Institute,page 263, 1993. 18, 55R. Martinez-Cantin, N. de Freitas, A. Doucet, and J. A. Castellanos. Ac-tive policy learning for robot planning and exploration under uncertainty.Robotics Science and Systems, 2007. 2, 55V. Mnih, C. Szepesva´ri, and J.-Y. Audibert. Empirical Bernstein stopping.In Proceedings of the 25th International Conference on Machine Learning,ICML ’08, pages 672–679. ACM, 2008. ISBN 978-1-60558-205-4. doi:10.1145/1390156.1390241. 18J. Mocˇkus, V. Tiesis, and A. Zˇilinskas. The application of bayesian methodsfor seeking the extremum. In L. Dixon and G. Szego, editors, TowardGlobal Optimization, volume 2. Elsevier, 1978. 7, 38M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas. Acdc: A structuredefficient linear layer. Technical report, University of Oxford, 2015. 91R. Munos. Optimistic optimization of a deterministic function without theknowledge of its smoothness. In Advances in Neural Information Process-ing Systems, pages 783–791, 2011. 18100I. Murray and R. P. Adams. Slice sampling covariance hyperparameters oflatent Gaussian models. In Advances in Neural Information ProcessingSystems, pages 1732–1740, 2010. 125R. M. Neal. Slice sampling. Annals of statistics, pages 705–741, 2003. 126J. Neufeld, A. Gyorgy, D. Schuurmans, and C. Szepesvari. Adaptive Monte-Carlo via bandit allocation. In Proceedings of the International Conferenceon Machine Learning, ICML, 2014. 59M. Osborne. Bayesian Gaussian Processes for Sequential Prediction, Opti-misation and Quadrature. PhD thesis, University of Oxford, Oxford, UK,2010. 90J. Quin˜onero-Candela and C. E. Rasmussen. A unifying view of sparse ap-proximate Gaussian process regression. The Journal of Machine LearningResearch, 6:1939–1959, 2005. 119A. Rahimi and B. Recht. Random features for large-scale kernel machines.In Advances in Neural Information Processing Systems, pages 1177–1184,2007. 42, 112C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for MachineLearning. The MIT Press, 2006. 23, 26, 90J. Rosenthal. A First Look at Rigorous Probability Theory. World Scientific,2006. ISBN 9789812703705. 12D. Russo and B. Van Roy. Learning to optimize via posterior sampling.Mathematics of Operations Research, 39(4):1221–1243, 2014a. 8, 41D. Russo and B. Van Roy. Learning to optimize via information-directedsampling. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, andK. Q. Weinberger, editors, Advances in Neural Information ProcessingSystems 27, pages 1583–1591. 2014b. 88J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn. Design and analysisof computer experiments. Statistical science, pages 409–423, 1989. ii, 2, 7101S. L. Scott. A modern Bayesian look at the multi-armed bandit. AppliedStochastic Models in Business and Industry, 26(6):639–658, 2010. 2, 23,41M. Seeger, C. Williams, and N. Lawrence. Fast forward selection to speed upsparse Gaussian process regression. In Artificial Intelligence and Statistics,volume 9, 2003. 118, 119B. Shahriari, Z. Wang, M. W. Hoffman, A. Bouchard-Coˆte´, and N. de Fre-itas. An entropy search portfolio. In NIPS workshop on Bayesian Opti-mization, 2014. 9, 10, 42B. Shahriari, A. Bouchard-Coˆte´, and N. de Freitas. Unbounded Bayesianoptimization via regularization. In AISTATS, 2016a. iii, 10, 89B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Tak-ing the human out of the loop: A review of bayesian optimization. InProceedings of the IEEE, volume 104, pages 1–28, 2016b. iii, 87H. Shen, W. J. Welch, and J. M. Hughes-Oliver. Efficient, adaptive cross-validation for tuning and comparing models, with application to drugdiscovery. The Annals of Applied Statistics, pages 2668–2687, 2011. 55E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pages1257–1264, 2005. 118, 119J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimiza-tion of machine learning algorithms. In Advances in Neural InformationProcessing Systems, pages 2951–2959, 2012. 2, 39, 55, 68, 88, 125M. Soare, A. Lazaric, and R. Munos. Best-arm identification in linear ban-dits. In Advances in Neural Information Processing Systems, pages 828–836, 2014. 8N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian processoptimization in the bandit setting: No regret and experimental design. In102International Conference on Machine Learning, pages 1015–1022, 2010.2, 8, 13, 40, 51, 52, 53, 54, 69, 90K. Swersky, J. Snoek, and R. P. Adams. Multi-task Bayesian optimization.In Advances in Neural Information Processing Systems, pages 2004–2012,2013. 2W. R. Thompson. On the likelihood that one unknown probability exceedsanother in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933. ii, 6, 13, 40C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA:Combined selection and hyperparameter optimization of classification al-gorithms. In Knowledge Discovery and Data Mining, pages 847–855, 2013.2M. K. Titsias. Variational learning of inducing variables in sparse Gaus-sian processes. In International Conference on Artificial Intelligence andStatistics, pages 567–574, 2009. 119M. Valko, A. Carpentier, and R. Munos. Stochastic simultaneous optimisticoptimization. In International Conference on Machine Learning, 2013. 18H. P. Vanchinathan, I. Nikolic, F. De Bona, and A. Krause. Explore-exploitin top-N recommender systems via Gaussian processes. In Proceedings ofthe 8th ACM Conference on Recommender systems, pages 225–232. ACM,2014. 40J. Villemonteix, E. Vazquez, and E. Walter. An informational approach tothe global optimization of expensive-to-evaluate functions. J. of GlobalOptimization, 44(4):509–534, 2009. 9, 42, 43, 60, 65, 75Z. Wang, S. Mohamed, and N. de Freitas. Adaptive Hamiltonian and Rie-mann manifold Monte Carlo samplers. In International Conference onMachine Learning, pages 1462–1470, 2013a. 2, 55103Z. Wang, M. Zoghi, D. Matheson, F. Hutter, and N. de Freitas. Bayesianoptimization in high dimensions via random embeddings. In InternationalJoint Conference on Artificial Intelligence, pages 1778–1784, 2013b. 2E. Westervelt and J. Grizzle. Feedback Control of Dynamic Bipedal RobotLocomotion. Control and Automation Series. CRC PressINC, 2007. ISBN9781420053722. 72A. G. Wilson. Covariance kernels for fast automatic pattern discovery andextrapolation with Gaussian processes. PhD thesis, University of Cam-bridge, 2014. 91104Appendix ABayesGap theoryA.1 Proof of Theorem 2The proof of this section and the lemmas of the next section follow from theproofs of Gabillon et al. (2012). The modifications we have made to thisproof correspond to the introduction of the function gk which bounds theuncertainty sk in order to make it simpler to introduce other models. Wealso introduce a sufficient condition on this bound, i.e. that it is monotoni-cally decreasing in N in order to bound the arm pulls with respect to g−1k .Ultimately, this form of the theorem reduces the problem of of proving aregret bound to that of checking a few properties of the uncertainty model.Theorem 2Consider a bandit problem with horizon T and K arms. Let Uk(t) and Lk(t)be upper and lower bounds that hold for all times t ≤ T and all arms k ≤ Kwith probability 1− δ. Finally, let gk be a monotonically decreasing functionsuch that sk(t) ≤ gk(Nk(t − 1)) and∑k g−1k (Hk) ≤ T − K. We can thenbound the simple regret asP(ST ≤ ) ≥ 1−KTδ. (A.1)Proof. We will first define the event E such that on this event every meanis bounded by its associated bounds for all times t. More precisely we can105write this asE = {∀k ≤ K,∀t ≤ T, Lk(t) ≤ zk ≤ Uk(t)}. (A.2)By definition, these bounds are given such that the probability of deviatingfrom a single bound is δ. Using a union bound we can then bound theprobability of remaining within all bounds as P(E) ≥ 1−KTδ.We will next condition on the event E and assume regret of the formST > in order to reach a contradiction. Upon reaching said contradictionwe can then see that the simple regret must be bounded by with probabilitygiven by the probability of event E , as stated above. As a result we needonly show that a contradiction occurs.We will now define tˆ = arg mint≤T BJ(t)(t) as the time at which therecommended arm attains the minimum bound, i.e. aˆT = J(tˆ) as defined in(4.5). Let tk ≤ T be the last time at which arm k is pulled. Note that eacharm must be pulled at least once due to the initialization phase. We canthen show the following sequence of inequalities:min(0, sk(tk)−∆k) + sk(tk) ≥ BJ(tk)(tk) (a)≥ BΩT (tˆ) (b)≥ RΩT (c)> . (d)Of these inequalities, (a) holds by Lemma B3, (c) holds by Lemma B1,and (d) holds by our assumption on the simple regret. The inequality (b)holds due to the definition aˆT and time tˆ. Note, that we can also write thepreceding inequality as two casessk(tk) > 2sk(tk)−∆k > , if ∆k > sk(tk); (A.3)2sk(tk)−∆k ≥ sk(tk) > , if ∆k ≤ sk(tk) (A.4)106This leads to the following bound on the confidence diameter,sk(tk) > max(12(∆k + ), ) = Hk (A.5)which can be obtained by a simple manipulation of the above equations.More precisely we can notice that in each case, sk(tk) upper bounds both and 12(∆k + ), and thus it obviously bounds their maximum.Now, for any arm k we can consider the final number of arm pulls, whichwe can write asNk(T ) = Nk(tk − 1) + 1 ≤ g−1(sk(tk)) + 1 (A.6)< g−1(Hk) + 1. (A.7)This holds due to the definition of g as a monotonic decreasing function,and the fact that we pull each arm at least once during the initializationstage. Finally, by summing both sides with respect to k we can see that∑k g−1(Hk)+K > T , which contradicts our definition of g in the Theoremstatement.A.2 LemmasIn order to simplify notation in this section, we will first introduce B(t) =mink Bk(t) as the minimizer over all gap indices for any time t. We will alsonote that this term can be rewritten asB(t) = BJ(t)(t) = Uj(t)(t)− LJ(t)(t), (A.8)which holds due to the definitions of j(t) and J(t).Lemma B1For any sub-optimal arm k 6= a?, any time t ∈ {1, . . . , T}, and on eventE, the immediate regret of pulling that arm is upper bounded by the indexquantity, i.e. Bk(t) ≥ Rk.Proof. We can start from the definition of the bound and expand this term107asBk(t) = maxi 6=kUi(t)− Lk(t) (A.9)≥ maxi 6=kzi − zk = z? − zk = Rk. (A.10)The first inequality holds due to the assumption of event E , whereas thefollowing equality holds since we are only considering sub-optimal arms, forwhich the best alternative arm is obviously the optimal arm.Lemma B2For any time t let k = at be the arm pulled, for which the following state-ments hold:if k = j(t), then Lj(t)(t) ≤ LJ(t)(t), (A.11)if k = J(t), then Uj(t)(t) ≤ UJ(t)(t). (A.12)Proof. We can divide this proof into two cases based on which of the twoarms is selected.Case 1: let k = j(t) be the arm selected. We will then assume thatLj(t)(t) > LJ(t)(t) and show that this is a contradiction. By definition of thearm selection rule we know that sj(t)(t) ≥ sJ(t)(t), from which we can easilydeduce that Uj(t)(t) > UJ(t)(t) by way of our first assumption. As a resultwe can see thatBj(t)(t) = maxj 6=j(t)Uj(t)− Lj(t)(t) (A.13)< maxj 6=J(t)Uj(t)− LJ(t)(t) = BJ(t)(t). (A.14)This inequality holds due to the fact that arm j(t) must necessarily have thehighest upper bound over all arms. However, this contradicts the definitionof J(t) and as a result it must hold that Lj(t)(t) ≤ LJ(t)(t).Case 2: let k = J(t) be the arm selected. The proof follows the sameformat as that used for k = j(t).108Corollary B2If arm k = at is pulled at time t, then the minimum index is bounded aboveby the uncertainty of arm k, or more preciselyB(t) ≤ sk(t). (A.15)Proof. We know that k must be restricted to the set {j(t), J(t)} by defini-tion. We can then consider the case that k = j(t), and by Lemma B2 weknow that this imposes an order on the lower bounds of each possible arm,allowing us to writeB(t) ≤ Uj(t)(t)− Lj(t)(t) = sj(t)(t) (A.16)from which our corollary holds. We can then easily see that a similarargument holds for k = J(t) by ordering the upper bounds, again viaLemma B2.Lemma B3On event E, for any time t ∈ {1, . . . , T}, and for arm k = at the followingbound holds on the minimal gap,B(t) ≤ min(0, sk(t)−∆k) + sk(t). (A.17)Proof. In order to prove this lemma we will consider a number of cases basedon which of k ∈ {j(t), J(t)} is selected and whether or not one or neitherof these arms corresponds to the optimal arm a?. Ultimately, this results insix cases, the first three of which we will present are based on selecting armk = j(t).Case 1: consider a? = k = j(t). We can then see that the followingsequence of inequalities holds,z(2)(a)≥ zJ(t)(t)(b)≥ LJ(t)(t)(c)≥ Lj(t)(t)(d)≥ zk − sk(t). (A.18)Here (b) and (d) follow directly from event E and (c) follows from Lemma B2.109Inequality (a) follows trivially from our assumption that k = a?, as a resultJ(t) can only be as good as the 2nd-best arm. Using the definition of ∆kand the fact that k = a?, the above inequality yieldssk(t)− (zk − z(2)) = sk(t)−∆k ≥ 0 (A.19)Therefore the min in the result of Lemma B3 vanishes and the result followsfrom Corollary B2.Case 2: consider k = j(t) and a? = J(t). We can then writeB(t) = Uj(t)(t)− LJ(t)(t) (A.20)≤ zj(t)(t) + sj(t)(t)− zJ(t)(t) + sJ(t)(t) (A.21)≤ zk − z? + 2sk(t) (A.22)where the first inequality holds from event E , and the second holds becauseby definition the selected arm must have higher uncertainty. We can thensimplify this as= 2sk(t)−∆k (A.23)≤ min(0, sk(t)−∆k) + sk(t), (A.24)where the last step evokes Corollary B2.Case 3: consider k = j(t) 6= a? and J(t) 6= a?. We can then write thefollowing sequence of inequalities,zj(t)(t) + sj(t)(t)(a)≥ Uj(t)(t)(b)≥ Ua?(t)(c)≥ z?. (A.25)Here (a) and (c) hold due to event E and (b) holds since by definition j(t) hasthe highest upper bound other than J(t), which in turn is not the optimalarm by assumption in this case. By simplifying this expression we obtainsk(t)−∆k ≥ 0, and hence the result follows from Corollary B2 as in Case 1.Cases 4–6: consider k = J(t). The proofs for these three cases followthe same general form as the above cases and is omitted. Cases 1 through1106 cover all possible scenarios and prove Lemma B3.Lemma B4Consider a normally distributed random variable X ∼ N (µ, σ2) and β ≥ 0.The probability that X is within a radius of βσ from its mean can then bewritten asP(|X − µ| ≤ βσ) ≥ 1− e−β2/2. (A.26)Proof. Consider U ∼ N (0, 1). The probability that U exceeds some positivebound c > 0 can be writtenP(U > c) =e−c2/2√2pi∫ ∞ce(c2−u2)/2 du=e−c2/2√2pi∫ ∞ce−(u−c)2/2−c(u−c) du≤ e−c2/2√2pi∫ ∞ce−(u−c)2/2 du≤ 12e−c2/2.The first inequality holds due to the fact that e−c(u−c) ≤ 1 for u ≥ c, andthe second holds since c > 0. Using a union bound we can then bound bothsides as P(|U | > c) ≤ e−c2/2. Finally, by setting U = (X − µ)/σ and c = βwe obtain the bound stated above.111Appendix BApproximate sampling fromposterior global minimizersGiven a continuous positive stationary kernel k0, Bochner (1959) assertsthe existence of its Fourier dual s(w), the spectral density of k0. We cannormalize it, letting p(w) = s(w)/α be the associated normalized density,we can write the kernel as the expectationk0(x,x′) = αEp(w)[e−iwT(x−x′)] (B.1)= 2αEp(w,b)[cos(wTx + b) cos(wTx′ + b)] , (B.2)where b ∼ U [0, 2pi]. Let φ(x) = √2α/m cos(Wx + b) denote an m-dimensional feature mapping and where W and b consist of m stackedsamples from p(w, b). The kernel k can then be approximated by the innerproduct of these features, k0(x,x′) ≈ φ(x)Tφ(x′). This approach was usedby (Rahimi and Recht, 2007) as an approximation method in the contextof kernel methods. The feature mapping φ(x) allows us to approximatethe Gaussian process prior for f with a linear model f(x) = φ(x)Tθ whereθ ∼ N (0, I) is a standard multivariate normal. By conditioning on D, theposterior for θ is also multivariate normal θ | D ∼ N (A−1ΦTy, η2A−1)where A = ΦTΦ+η2I, y is the vector of the output data, and Φ is a matrixof features evaluated on the input data.112Let φ(i) and θ(i) be a random set of features and the corresponding poste-rior weights sampled both according to the generative process given above.They can then be used to construct the function f (i)(x) = φ(i)(x)Tθ(i),which is an approximate posterior sample of f—albeit one with a finiteparameterization. We can then maximize this function to obtain z(i) =arg minx∈X f (i)(x), which is approximately distributed according to P?( · |D). To produce samples from this process it is also necessary to derivethe spectral density for any kernel of interest. This quantity is given bythe kernel’s Fourier transform s(w) = 1(2pi)d∫eiwTτk(τ ,0) dτ . For exam-ple, for the squared-exponential and Mate´rn kernels the normalized spectraldensities take the formwse ∼ N (0,Λ) (B.3)wmate´rn ∼ T (0,Λ, 52), (B.4)i.e. these are distributed as a normal and Student’s t respectively, which areeasy to sample from. Recall that Λ is the diagonal matrix of length scaleparameters for ARD kernels.113Appendix CGradients of acquisitionfunctionsIn this chapter, we provide some gradients of popular acquisition func-tions that came in handy when implementing the Bayesian optimizationapproaches.Consider the following GP posterior mean and varianceµ(x) = kT (x)K−1y (C.1)σ2(x) = k(x,x)− k(x)K−1k(x), (C.2)which we readily differentiate to obtain∇µ(x) = ∇kT (x)K−1y (C.3)∇σ2(x) = −2∇kT (x)K−1k(x). (C.4)Note that in the last step we assume a stationary kernel k which means thatk(x,x) is some constant.114C.1 Probability of improvementThe fist acquisition function of interest is probability of improvement (PI),which, for a GP model, can be analytically computed yieldingαPI(x) := Φ(z(x)) (C.5)wherez(x) :=µ(x)− y+σ(x), (C.6)where Φ is the standard normal CDF and y+ = mini yi. This expression canbe differentiated as follows∇αPI(x) = φ(z(x))∇z(x) (C.7)where phi is the standard normal pdf and∇z(x) = ∇[µ(x)− y+σ(x)](C.8)=∇µ(x)σ(x)− (µ(x)− y+)∇σ(x)σ2(x)(C.9)=∇µ(x)σ(x)− 12z(x)∇σ2(x)σ2(x)(C.10)where in the last step we used the fact that ∇σ(x) = 12 ∇σ2(x)σ(x) in order toexpress the gradient in terms of the gradient of the variance, which is easierto compute. Substituting (C.10) into (C.7) we get:∇αPI(x) = φ(z(x))[∇µ(x)σ(x)− 12z(x)∇σ2(x)σ2(x)](C.11)115C.2 Expected improvementA similar acquisition function is expected improvement (EI); as with PI, fora GP model, EI can be analytically computed and reduces to the followingsimple expression:αEI(x) := σ(x) [z(x)Φ(z(x)) + φ(z(x))] , (C.12)with z(x) defined as in (C.6). In the following, it will help to denote A(z) =zΦ(z) + φ(z) and its first derivative A′(z) = Φ(z) + zφ(z) + φ′(z) = Φ(z),where in the last step we use the property of the standard normal pdf φ′(z) =zφ(z). Note that with this new quantity A(z), the expected improvementcan be written αEI(x) = σ(x)A(z(x)). Let us now compute the gradient ofthis expression.∇αEI(x) = A(z(x))∇σ(x) + σ(x)A′(z(x))∇z(x) (C.13)=12A(z(x))σ(x)∇σ2(x) + σ(x)Φ(z(x))[∇µ(x)σ(x)− 12z(x)∇σ2(x)σ2(x)](C.14)=12αEI(x)∇σ2(x)σ2(x)+ Φ(z(x))∇µ(x)− 12z(x)Φ(z(x))∇σ2(x)σ(x)(C.15)∇αEI(x) = 12∇σ2(x)σ2(x)[αEI(x)− 12z(x)Φ(z(x))σ(x)]+ Φ(z(x))∇µ(x)(C.16)where we have written the gradient of EI entirely in terms of the posteriorquantities, µ and σ2, and their derivatives.C.3 Gaussian process upper confidence boundFinally, recall the definition of GP-UCB:αUCB(x) := µ(x) + βσ2(x), (C.17)116which can be differentiated∇αUCB(x) = ∇µ(x) + β∇σ2(x). (C.18)117Appendix DApproximating Gaussianprocess inferenceFor large datasets, or large function evaluation budgets in the Bayesian op-timization setting, the cubic cost of exact inference is prohibitive and therehave been many attempts at reducing this computational burden via approx-imation techniques. In this chapter, we review two sparsification techniquesfor Gaussian processes and the alternative random forest regression.D.1 Sparse pseudo-inputsOne early approach considered usingm < t inducing pseudo-inputs to reducethe rank of the covariance matrix to m, resulting in a significant reductionin computational cost (Seeger et al., 2003; Snelson and Ghahramani, 2005).By forcing the interaction between the t data points x1:t and any test pointx to go through the set of m inducing pseudo-inputs, these methods cancompute an approximate posterior in O(tm2 + m3) time. Pseudo-inputmethods have since been unified in a single theory based on the followingoverarching approximation:p(f(x), f) =∫p(f(x), f ,u)du118≈∫q(f(x) | u)q(f | u)p(u) du = q(f , f∗) (D.1)where u is the vector of function values at the pseudo-inputs. All the sparseGP approximations based on pseudo-inputs can be formulated in terms ofthe form used for the training and test conditionals, q(f(x)|u) and q(f |u),respectively (Quin˜onero-Candela and Rasmussen, 2005).In the seminal works on pseudo-input methods, the locations of thepseudo-inputs were selected to optimize the marginal likelihood of the SPGP(Seeger et al., 2003; Snelson and Ghahramani, 2005). In contrast, a varia-tional approach has since been proposed to marginalize the pseudo-inputsto maximize fidelity to the original exact GP (Titsias, 2009) rather than thelikelihood of the approximate GP.The computational savings in the pseudo-input approach to approximat-ing the GP comes at the cost of poor variance estimates. As can be observedin Figure D.1, the uncertainty (blue shaded area) exhibits unwanted pinch-ing at pseudo-inputs, while it is overly conservative in between and awayfrom pseudo-inputs. In this instance, the 10 inducing points, indicated withblack crosses, were not optimized to emphasize the potential pathologies ofthe method. Since in Bayesian optimization we use the credible intervals toguide exploration, these artefacts can mislead our search.D.2 Sparse spectrumWhile inducing pseudo-inputs reduce computational complexity by using afixed number of points in the search space, sparse spectrum gaussian pro-cesses (SSGP) take a similar approach to the kernel’s spectral space (La´zaro-Gredilla et al., 2010). Bochner’s theorem states that any stationary kernelk(x,x′) = k(x− x′) has a positive and finite Fourier spectrum s(w), i.e.k(x) =1(2pi)d∫e−iw·xs(w) dw. (D.2)Since the spectrum is positive and bounded, it can be normalized such thatp(w) := s(w)/ν is a valid probability density function. In this formulation,1190.0 0.5 1.021012Exact GP0.0 0.5 1.0SPGP0.0 0.5 1.0SSGP0.0 0.5 1.0Random ForestFigure D.1: Comparison of surrogate regression modelsevaluating the stationary kernel is equivalent to computing the expectationof the Fourier basis with respect to its specific spectral density p(w) as inthe followingk(x,x′) = ν Ep(w)[e−iw·(x−x′)]. (D.3)As the name suggests, SSGP approximates this expectation via Monte Carloestimation using m samples drawn from the spectral density so thatk(x,x′) ≈ νmm∑i=1e−iw(i)·xeiw(i)·x′ (D.4)where w(i) ∼ s(w)/ν. The resulting finite dimensional problem is equivalentto Bayesian linear regression with m basis functions and the computationalcost is once again reduced to O(tm2 +m3).As with the pseudo-inputs, the spectral points can also be tuned viamarginal likelihood optimization. Although this introduces a risk of overfit-ting, it allows for a smaller number of basis functions with good predictivepower (La´zaro-Gredilla et al., 2010). Once again, in Figure D.1 we havenot tuned the 80 spectral points in this way. Whereas around observeddata (red crosses) the uncertainty estimates are smoother than the pseudo-inputs method, away from observations both the prediction and uncertaintyregions exhibit spurious oscillations. This is highly undesirable for Bayesianoptimization where we expect our surrogate model to fall back on the prioraway from observed data.120D.3 Random forest surrogatesFinally, as an alternative to Gaussian processes, random forest regressionhas been proposed in the context of sequential model-based algorithm con-figuration (SMAC) (Hutter et al., 2011). Introduced in 2001 (Breiman,2001), random forests are a class of scalable and highly parallelizable re-gression models that have been very successful in practice (Criminisi et al.,2011). More precisely, the random forest is an ensemble method wherethe weak learners are decision trees trained on random subsamples of thedata (Breiman, 2001). Averaging the predictions of the individual treesproduces an expressive response surfaces.The random forest regression model gives SMAC the flexibility to handlecategorical input features and to scale to large observational datasets. Onthe other hand, the exploration strategy in SMAC still requires an uncer-tainty estimate for predictions at test points. Since the random forest doesnot readily provide an estimate of the variance of its predictions, Hutteret al. proposed using the empirical variance in the predictions across treesin the ensemble (Hutter et al., 2011).Although random forests are good interpolators in the sense that theyoutput good predictions in the neighbourhood of training data, they arevery poor extrapolators. Indeed, far from the data, the predictions of alltrees could be identical, resulting in a poor prediction; more importantly,using the variance estimate of SMAC results in extremely confident inter-vals. In Figure D.1 for example, away from data the shaded area is verynarrow around a very poor constant prediction. Even more troubling is thefact that in areas of missing data multiple conflicting predictions can causethe empirical variance to blow up sharply, as can be seen in Figure D.1.While Gaussian processes are also poor extrapolators (when used with lo-cal kernels), they at least produce highly uncertain predictions away fromthe data: a much more desirable behavior when trading off exploration andexploitation.Finally, another drawback of random forests for Bayesian optimization isthat the response surface is discontinuous and non-differentiable so gradient121based optimization methods are not applicable and SMAC must rely on acombination of local and random search when maximizing the acquisitionfunction.122Appendix EHyperparameter learningIn Chapter 2, we mostly ignored the GP mean and kernel hyperparame-ters and assumed they were known or given. Consider the generic functiong : X ×Θ 7→ R, where θ ∈ Θ represents the hyperparameters of our GP. Inthe context of Bayesian optimization, this function could be our objectivefunction or any function derived from the Gaussian process, but for con-creteness, it may help to think of it specifically as the acquisition functionα. We wish to marginalize out our uncertainty about θ with the followingexpressiongt(x) := Eθ [g(x; θ)|Dt] =∫g(x; θ)p(dθ|Dt) (E.1)This integral is over our posterior belief over θ given observations Dt, whichcan be decomposed via Bayes rule asp(θ|Dt) = p(y|x1:t, θ)p(θ)p(Dt) =:γt(θ)p(Dt) (E.2)where we have defined γt as the unnormalized posterior. In this section, wereview several methods used in practice to approximate the integral (E.1).We first consider simple point estimates, followed by the Markov chainMonte Carlo and sequential Monte Carlo methods, and finally, we presentvariational methods.123E.1 Point estimatesThe simplest approach to tackling (E.1) is to fit the hyperparameter toobserved data using a point estimate θˆMLt or θˆMAPt , corresponding to type IImaximum likelihood or maximum a posteriori estimates, respectively. Theposterior is then replaced by a delta measure at the corresponding θˆt whichyieldsgˆt(x) = g(x; θˆt). (E.3)The estimators θˆMLt and θˆMAPt can be obtained by optimizing the marginallikelihood or the unnormalized posterior, respectively. For certain priorsand likelihoods, these quantities as well as their gradients can be computedanalytically. For example, the GP regression model yields the followingmarginal likelihood:Lt(θ) := log p(y|x1:t, θ) (E.4)= −12(y − µ0)T(Kθt + σ2I)−1(y − µ0)−12log |Kθt + σ2I| −t2log(2pi).(E.5)Therefore it is common to use multi-started local optimizers such as thelimiter-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method onobjectives Lt or γt.In Bayesian optimization, our uncertainty about the response surfaceplays a key role in guiding exploration and therefore it is important to in-corporate our uncertainty about θ in the regression model. Naturally, thesepoint estimates cannot capture this uncertainty and the following sectionspresent common techniques for marginalizing out the hyperparameters inmore principled ways.E.2 Markov chain Monte CarloThe common component in Monte Carlo (MC) methods is that they ap-proximate the integral in (E.1) using N samples{θ(i)t}Ni=1from the posterior124distribution p(dθ|Dt):E [g(x; θ)|Dt] ≈ 1NN∑i=1g(x; θ(i)t ). (E.6)However, in practice it is impossible to sample directly from the poste-rior so we have to use sampling techniques to obtain samples that aremarginally distributed according to p(dθ|Dt). In the Markov chain MonteCarlo (MCMC) approach, a Markov chain of samples is generated where thesamples are marginally distributed according to the target posterior in thelimit of infinitely long chains.In this section we introduce the slice sampler which is a particularMCMC algorithm used in practice for hyperparameter marginalization inBayesian optimization (Murray and Adams, 2010; Snoek et al., 2012). Notethat other MCMC methods could also be used; in particular, since theanalytic gradient of the marginal likelihood can be computed, Hamilto-nian Monte Carlo (HMC (Duane et al., 1987)) could be employed, but thismethod can be hard to tune in practice.Consider an auxiliary random variable u ∈ and the following joint prob-ability distribution with density pi(θ, u) = I[0 ≤ u ≤ pi(θ)] such that∫pi(θ, u)du =∫ pi(θ)0du = pi(θ). (E.7)Therefore we can sample from pi(dθ) by sampling from this joint distribu-tion and ignoring the auxiliary variable u. One way to create a Markovchain over this augmented space is by defining the following transition ker-nel κ(θ, u, dθ′, du′) = p(dθ′|u′)p(du′|θ), where the conditionals are definedasp(du′|θ) = U[0,pi(θ)](du′) (E.8)p(dθ′|u′) = U{θ′:pi(θ′)≥u′}(dθ′). (E.9)Because this algorithm only consists of comparisons of probability densities,125we can replace the above pi(θ) with the unnormalized γ(θ). Therefore, wefirst compute γ(θ) and draw a uniform random variable u′ ∈ [0, γ(θ)]; wethen sample uniformly within the slice {θ′ : γ(θ′) ≥ u′}. Even in onedimension it is difficult to find this set so in practice a width parameter wis chosen and a candidate window of width w randomly centred around thecurrent sample θ is initialized. The endpoints θL and θU of the slice aretested to see whether u′ ≥ γ(θL/U ) holds, i.e. whether either endpoint isoutside of the slice. The bounds are iteratively grown by taking steps ofwidth w until the test holds.Once the bounds θL/U include the slice1, a proposal θ′ ∈ [θL, θU ] isuniformly sampled. If γ(θ′) > u the proposal is accepted (and the Markovchain carries on from here), if the proposal is rejected, the bounds θL/U areupdated according to whether θ′ < θ or θ′ > θ, respectively. These steps aredescribed more precisely in Algorithm 5, where we consider the slice samplerwithout the stepping out subroutine.In multiple dimensions one can either sequentially sample each dimen-sion independently, or draw a random direction and slice sample along it.Another common strategy is to partition the dimensions into blocks andsequentially slice sample along a random direction in the subspaces corre-sponding to each of the blocks. The latter is especially recommended whencertain dimensions, i.e. hyperparameters are highly correlated.1Actually, since the slice is not necessarily a simply connected set, the bounds are onlyguaranteed to include part of the slice. See (Neal, 2003) for a more rigorous proof ofcorrectness of the slice sampling algorithm.126Algorithm 5 Slice samplingRequire: γ, the unnormalized posterior;θ, the current sample of the Markov chain;w, a width parameter.1: Sample u ∼ U [0, γ(θ)]2: Sample v ∼ U [0, w]3: Set θL = θ − v and θU = θ + w4: while θL < θU do5: Sample θ′ ∼ U [θL, θU ]6: if γ(θ′) ≥ u then7: return θ′8: else9: if θ′ < θ then10: Set θL = θ′11: else12: Set θU = θ′13: end if14: end if15: end while16: return θ′Naturally, in practice this method is used far from the asymptotic limitand therefore the samples can be highly correlated. Furthermore, multi-modal posteriors over θ can be hard to approximate due to poor mixingor short chains. The following section reviews a Monte Carlo method thataddresses the mixing issue when the target distribution is in fact a slowlyvarying distribution, such as our posterior over θ.E.3 Sequential Monte CarloIn this section, the approach we take is a generalization of particle filters dueto (Liu and Chen, 1998) called sequential Monte Carlo (SMC) and has beenimplemented for hyperparameter marginalization by (Gramacy and Polson,1272011). SMC is especially useful when one wants to compute the expectationof a sequence of functions {gt} with respect to a slowly varying sequence ofdistributions pit. In our case, this sequence of distributions consists of theposterior distribution of θ after every observation.The idea of SMC is to maintain a set of N particles and correspondingweights{(θ(i)t , w(i)t )}with target distribution p(dθ|Dt). At time t + 1, weevolve the particles following some transition kernel Kt(θ,dθ) and updatethe weights accordingly. Then the SMC approximation to (E.1) differs from(E.6) slightly and can be written as follows:E [g(x; θ)|Dt] ≈N∑i=1g(x; θ(i)t )wˆ(i)t , (E.10)where wˆ(i)t are the normalized weightswˆ(i)t :=w(i)t∑Nj=1w(j)t. (E.11)Consider the sequence of target distributions pit(dθ) := p(dθ|Dt) where,with a slight abuse of notation, wherever we use θ as an argument to pit weare referring to its associated density with respect to some base measure—we use this convention throughout the paper. While we cannot compute thisquantity pointwise, we can compute the unnormalized measure by applyingBayes rulepit(θ) = p(θ|Dt) = p(Dt|θ)p(θ)p(Dt) =γt(θ)p(Dt) . (E.12)Note that the unnormalized posterior γt(θ) is the product of the marginallikelihood and the prior which can be computed pointwise.The weights are sequentially updated according tow(i)t = w(i)t−1w˜t(θ(i)t−1, θ(i)t ). (E.13)The incremental weight w˜t(·, ·) depends on the transition kernel Kt thatis used. Equation 31 from (Del Moral et al., 2006) states that if Kt is128an MCMC kernel with stationary distribution pit(θ) then a good – thoughsuboptimal – incremental weight to usew˜t(θ, θ′) =γt(θ)γt−1(θ)(E.14)which yieldsw˜t(θ, θ′) =p(Dt|θ)p(θ)p(Dt−1|θ)p(θ) =p(Dt|θ)p(Dt−1|θ) =p(yt|xt,Dt−1, θ)p(Dt−1|θ)p(Dt−1|θ)(E.15)= p(yt|xt,Dt−1, θ). (E.16)Since the weight of the particle θ′ = θ(i)t only depends on its parent θ(i)t−1, itis possible to weigh and resample particles before propagating them; thusrecover the particle learning approach of (Gramacy and Polson, 2011). Thisis sometimes referred to as the look-ahead trick because it allows us toobserve the next data point before resampling the current set of particles.Note that in the previous section we detailed a MCMC kernel with sta-tionary distribution pit(θ) and we can use the same slice sampler to proposenew particles. This way, the SMC approach can be interpreted as evolvingmultiple MCMC chains in parallel with interactions at resampling steps.129
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Practical Bayesian optimization with application to...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Practical Bayesian optimization with application to tuning machine learning algorithms Shahriari, Bobak 2016
pdf
Page Metadata
Item Metadata
Title | Practical Bayesian optimization with application to tuning machine learning algorithms |
Creator |
Shahriari, Bobak |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | Bayesian optimization has recently emerged in the machine learning community as a very effective automatic alternative to the tedious task of hand-tuning algorithm hyperparameters. Although it is a relatively new aspect of machine learning, it has known roots in the Bayesian experimental design (Lindley, 1956; Chaloner and Verdinelli, 1995), the design and analysis of computer experiments (DACE; Sacks et al., 1989), Kriging (Krige, 1951), and multi-armed bandits (Gittins, 1979). In this thesis, we motivate and introduce the model-based optimization framework and provide some historical context to the technique that dates back as far as 1933 with application to clinical drug trials (Thompson, 1933). Contributions of this work include a Bayesian gap-based exploration policy, inspired by Gabillon et al. (2012); a principled information-theoretic portfolio strategy, out-performing the portfolio of Hoffman et al. (2011); and a general practical technique circumventing the need for an initial bounding box. These various works each address existing practical challenges in the way of more widespread adoption of probabilistic model-based optimization techniques. Finally, we conclude this thesis with important directions for future research, emphasizing scalability and computational feasibility of the approach as a general purpose optimizer. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-09-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution 4.0 International |
DOI | 10.14288/1.0314167 |
URI | http://hdl.handle.net/2429/59104 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2016-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2016_november_shahriari_bobak.pdf [ 3.42MB ]
- Metadata
- JSON: 24-1.0314167.json
- JSON-LD: 24-1.0314167-ld.json
- RDF/XML (Pretty): 24-1.0314167-rdf.xml
- RDF/JSON: 24-1.0314167-rdf.json
- Turtle: 24-1.0314167-turtle.txt
- N-Triples: 24-1.0314167-rdf-ntriples.txt
- Original Record: 24-1.0314167-source.json
- Full Text
- 24-1.0314167-fulltext.txt
- Citation
- 24-1.0314167.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0314167/manifest