Regret Bounds for Gaussian Process Bandits Without Observation Noise by Masrour Zoghi B.Sc., The University of British Columbia, 2003 M.Sc., University of Toronto, 2004 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2012 c Masrour Zoghi 2012Abstract This thesis presents some statistical re nements of the bandits approach presented in [11] in the situation where there is no observation noise. We give an improved bound on the cumulative regret of the samples chosen by an algorithm that is related (though not identical) to the UCB algorithm of [11] in a complementary setting. Given a function f on a domain D Rd, sampled from a Gaussian process with an anisotropic kernel that is four times di erentiable at 0, and a lattice L D, we show that if the points in L are chosen for sampling using our branch-and-bound algorithm, the regret asymptotically decreases according to O e t (ln t)d=4 with high probability, where t is the number of observations carried out so far and is a constant that depends on the objective function. iiPreface This thesis grew out of a collaboration with Alex Smola and my supervisor, Nando de Freitas. An abridged version of the results presented here were presented in the Workshop on Bayesian Optimization at NIPS 2011. The results included in Section 3.1 and the Appendix were obtained by the other authors and the proofs are only included for the sake of completeness of the exposition. Everything else was proven and written by the author of this thesis. iiiTable of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Global optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Costly evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Gaussian process bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Surrogates for optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Our algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Approximation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Finiteness of regret (statement) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Remarks on the main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.1 On the statement of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.2 Outline of the proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . 17 ivTable of Contents 3.3.3 Further remarks on the GP prior . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Finiteness of regret (proof) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Appendices Apendix: Auxilliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 vList of Figures 1.1 Lipschitz hypothesis being used to discard pieces of the search space . . . . . . . . . 3 1.2 A sample application of the branch and bound algorithm . . . . . . . . . . . . . . . 5 2.1 Branch and Bound algorithm for a 2D function . . . . . . . . . . . . . . . . . . . . . 10 3.1 The elimination of other smaller peaks . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 The shrinking of the relevant set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 viAcknowledgements First and foremost, I would like to thank my supervisor for all the direction he provided throughout this thesis and for his patience when the proofs were not transpiring as e ortlessly. I am also very grateful to Alex Smola for his collaboration and for providing me with an elegant proof of a key lemma when I was su ering through a much messier argument. This work bene ted greatly from many lengthy discussions with John Chia, who was working on the more experimental aspects of this project. I learned a lot from talking to both Matt Ho man and Ziyu Wang, who were working on related topics, and who happened to sit at the two desks next to mine. I would also like to thank our wonderful Group Assistant, Kath Imhiran, for her incessant pleasantness, which is truly a scarce commodity. Outside of the university, I would like to thank my many \sisters and brothers in arms" from the Mining Justice Alliance, who added some meaning to my life by allowing me to assist them in their e orts to shed some light on injustices emanating from Vancouver. viiDedication I would like to dedicate this thesis to my parents, without whose sacri ces over the last 33 years none of this would be possible, and to M.S. who contributed greatly to my intellectual cultivation during my short stay in Vancouver. viiiChapter 1 Introduction Let f : D ! R be a function on a compact subset D Rd. We would like to address the global optimization problem xM = argmax x2D f(x): Let us assume for the sake of simplicity that the objective function f has a unique global maximum (although it can have many local maxima) The space D might be the set of possible parameters that one could feed into a time-consuming algorithm or the locations where a sensor could be deployed, and the function f might be a measure of the performance of the algorithm (e.g. how long it takes to run) or whatever quantity it is that our given sensor measures (e.g. temperature). In this paper, our assumption is that once the function has been probed at point x 2 D, then the value f(x) can be observed with very high precision. This is the case when the deployed sensors are very accurate or if the algorithm whose parameters are being con gured is deterministic. The challenge in these applications is two-fold; the optimization problem is of a global nature without any convexity assumption on f and it is costly to evaluate f at any given point. We will discuss these two stumbling blocks below. 11.1. Global optimization 1.1 Global optimization Global optimization is a di cult problem without any assumptions on the objective function f . The main complicating factor is the uncertainty over the extent of the variations of f , e.g. one could consider the characteristic function fxMg, which is equal to 1 at xM and 0 elsewhere, and none of the methods we mention here can optimize this function without exhaustively searching through every point in D. The way a large number of global optimization methods address this problem is by imposing some prior assumption on how fast the objective function f can vary. The most blatant manifes- tation of this remedy is the imposition of a Lipschitz assumption on the function f , which requires the change in the value of f(x), as the point x moves around, to be smaller than a constant multiple of the distance traveled by x (cf. [6]). In fact, as pointed out in [3] (cf. Figure 3 therein), it is only important to have this kind of tight control over the function near its optimum: elsewhere in the space, we can have what they have dubbed a \weak Lipschitz" condition. One way to relax these hard Lipschitz constraints is by putting a Gaussian Process (GP) prior on the function. Instead of restricting the function from oscillating too fast, a GP prior requires those fast oscillations to have low probability (cf. [5], Theorem 5). This is the setting of Bayesian optimization (aka GP bandits), with which we concern ourselves: a review of GPs and Bayesian optimization is provided in Section 2. The main point of these derivative bounds (be it hard or soft, as outlined above) is to assist with the exploration-exploitation trade-o that global optimization algorithms have to grapple with. In the absence of any assumptions of convexity on the objective function, a global optimization algorithm is forced to explore enough until it reaches a point in the process when with some degree of certainty it can localize its search space and perform local optimization (aka exploitation). Derivative bounds such as the ones discussed here together with the boundedness of the search space, guaranteed by the compactness assumption on D, provide us with such certainty by producing a useful upper bound that allows us to shrink the search space. This is illustrated in Figure 1.1: suppose that we know that our function is Lipschitz with Lipschitz constant L, then given sample points as shown in the gure (the red dots), we can use the Lipschitz property to discard pieces of the search space (the shaded regions). This is done 21.1. Global optimization Figure 1.1: An example of the Lipschitz hypothesis being used to discard pieces of the search space. by nding points in the search space where the function could not possibly be higher than the maximum value we’ve already observed without violating our Lipschitz hypothesis. Such points are found by placing cones at the sampled points with slope equal to L and seeing where those cones lie below the maximum observed value (the red dashed line in this case): since the function is guaranteed to be bounded from above by the cones, we know that there is nothing interesting to nd in the shaded regions. Note that this crude approach is often very wasteful because very often the slope of the function is much smaller than L, and as we will see below (cf. Figure 1.2), GPs do a better job of providing an upper bound that could be used to limit the search space, by essentially choosing Lipschitz constants that vary over space and time. 31.2. Costly evaluation 1.2 Costly evaluation Now, let us also assume that the objective function f is costly to evaluate (e.g. time-wise or nancially), in which case we would like to avoid probing f as much as possible and to get close to the optimum as quickly as possible. The solution to this problem is to approximate f with a surrogate function that provides a good upper bound for f and which is easier to calculate and optimize. In the Lipschitz-based algorithms, this upper bound is provided by the Lipschitz constraint, whereas in the case of X-armed bandits, this bound is provided by the functions Bh;i(n) de ned on page 1663 of [3]. Again, GPs provide a soft version of this by providing us with high probability bounds that could be modi ed depending on how certain we would like to be of our upper bounds. Moreover, the upper bounds provided by GPs have analytic expressions, which make evaluation and optimization relatively easy. We refer the reader to [2] for a general review of the literature on the various surrogate functions utilized in Bayesian optimization. Let us also point out that in addition to being easier to evaluate, surrogate functions can also aid with global optimization by restricting the domain of interest. More precisely, if we know that f g for some known function g and we have probed f at x0 to get the value y0 = f(x0), then we can shrink our search space to the relevant set R = fx 2 Djg(x) > y0g; which will be substantially smaller than D, supposing that g is a relatively tight upper bound of f near the optimum: it can be atrociously inaccurate away from the optimum as long as it stays below max f . The surrogate function that we will make extensive use of here is called the Upper Con dence Bound (UCB), which is de ned to be + B , where and are the posterior predictive mean and standard deviation of the GP and B is a constant to be chosen by the algorithm. The UCB surrogate function has the desirable property that it provides the tightest bound on f where the function has been probed the most: this is illustrated in Figure 1.2. 41.2. Costly evaluation Figure 1.2: An example of our branch and bound maximization algorithm with UCB surrogate + B , where and are the mean and standard deviation of the GP respectively. The region consisting of the points x for which the upper con dence bound (x) + B (x) is lower that the maximum value of the lower con dence bound (x) B (x) does not need to be sampled anymore. Note that the UCB surrogate function bounds f from above. This surrogate function has been studied extensively in the literature and this paper relies heavily on the ideas put forth in the paper by Srinivas et al [11], in which the algorithm consists of repeated optimization of the UCB surrogate function after each sample. One key di erence between our setting and that of [11] is that, whereas we assume that the value of the function can be observed exactly, in [11] it is necessary for the noise to be non-trivial (and Gaussian) because the main quantity that is used in their estimates, namely information gain (cf. Equation (3) in [11]), becomes unde ned when the variance of the observation noise ( 2 in their notation) is set to 0: cf. the expression for I(yA; fA) that was given in the paragraph following Equation (3). Moreover, our algorithm can in some ways be thought of as a limiting case of the UCB algorithm in [11] when the parameter B goes to in nity because our main goal is to shrink as much as possible, even though we do use + B to limit our search space. In subsequent work, we hope to prove that the algorithm presented here is inferior to the UCB algorithm, but in the mean time we thought it valuable to present this manifestly wasteful algorithm and show that the cumulative regret RT one obtains from applying it belongs to O(1), given a xed function, in contrast to the previous bounds that grow as O( p T ), where T is the number of samples collected. In fact, we show that the regret rT decreases according to O e t (ln t)d=4 . This asymptotic result is valuable to applications that are meant to run inde nitely because it unburdens the user from having to decide when to stop the bandit algorithm and just exploit the 51.2. Costly evaluation optimum point found so far because the overall regret the algorithm can accrue is bounded from above. The paper whose results are most similar to ours is [9], but there are some key di erences in the methodology, analysis and obtained rates. For instance, we are interested in cumulative regret, whereas the results of [9] are proven for nite stop-time regret. In our case, the ideal application is the optimization of a function that is C2-smooth and has an unknown non-singular Hessian at the maximum. We obtain a regret rate O e t (ln t)d=4 , whereas the DOO algorithm in [9] has regret rate O(e t) if the Hessian is known and the SOO algorithm has regret rate O(e p t) if the Hessian is unknown. In addition, the algorithms in [9] can handle functions that behave like ckx xMk near the maximum (cf. Example 2 therein). This problem was also studied by [14] and [4], but using the Expected Improvement surrogate instead of UCB. Our methodology and results are di erent, but complementary to theirs. 6Chapter 2 Gaussian process bandits 2.1 Gaussian processes As in [11], the objective function is distributed according to a Gaussian process prior: f(x) GP(m( ); ( ; )): (2.1) For convenience, and without loss of generality, we assume that the prior mean vanishes, i.e., m( ) = 0. There are many possible choices for the covariance kernel. One obvious choice is the anisotropic kernel with a vector of known hyperparameters [10]: (xi; xj) = e (xi xj) >D(xi xj) ; (2.2) where e is an isotropic kernel and D is a diagonal matrix with positive hyperparameters along the diagonal and zeros elsewhere. Our results apply to squared exponential kernels and Mat ern kernels with parameter 2. In this paper, we assume that the hyperparameters are xed and known in advance. We can sample the GP at t points by choosing points x1:t := fx1; : : : ; xtg and sampling the values of the function at these points to produce the vector f1:t = [f(x1) f(xt)]>. The function values are distributed according to a multivariate Gaussian distribution N (0;K), with covariance entries (xi; xj). Assume that we already have several observations from previous steps, and that we want to decide what action xt+1 should be considered next. Let us denote the value of the function at this arbitrary new point as ft+1. Then, by the properties of GPs, f1:t and ft+1 are 72.2. Surrogates for optimization jointly Gaussian: 2 6 4 f1:t ft+1 3 7 5 N 0 B @0; 2 6 4 K k> k (xt+1; xt+1) 3 7 5 1 C A ; where k = [ (xt+1; x1) (xt+1; xt)]>. Using the Schur complement, one arrives at an expression for the posterior predictive distribution: P (ft+1jx1:t+1; f1:t) = N ( t(xt+1); 2t (xt+1)); where t(xt+1) = k>K 1f1:t; 2t (xt+1) = (xt+1; xt+1) k >K 1k (2.3) and f1:t = [f(x1) f(xt)]>. 2.2 Surrogates for optimization When it is assumed that the objective function f is sampled from a GP, one can use a combination of the posterior predictive mean and variance given by Equations (2.3) to construct surrogate functions, which tell us where to sample next. Here we use the UCB combination, which is given by t(x) +Bt t(x); where fBtg1t=1 is a sequence of numbers speci ed by the algorithm. This surrogate trades-o explo- ration and exploitation since it is optimized by choosing points where the mean is high (exploitation) and where the variance is large (exploration). Since the surrogate has an analytical expression that is easy to evaluate, it is much easier to optimize than the original objective function. Other popular surrogate functions constructed using the su cient statistics of the GP include the Probability of Improvement, Expected Improvement and Thompson sampling. We refer the reader to [2, 7, 8] for details on these. 82.3. Our algorithm 2.3 Our algorithm The main idea of our algorithm (Algorithm 1) is to tighten the bound on f given by the UCB surrogate function by sampling the search space more and more densely and shrinking this space as more and more of the UCB surrogate function is \submerged" under the maximum of the Lower Con dence Bound (LCB). Figure 1.2 illustrates this intuition. Algorithm 1 Branch and Bound Input: A compact subset D Rd, a discrete lattice L D and a function f : D ! R. R D 1 repeat Sample Twice as Densely: 2 Sample f at enough points in L so that every point in R is contained in a simplex of size . Shrink the Relevant Region: Set eR := x 2 R T (x) + p T T (x) > sup R T (x) p T T (x) : T is the number points sampled so far and T = 2 ln jLjT 2 = 4 lnT + 2 ln jLj with 2 (0; 1). Solve the following constrained optimization problem: (x 1; x 2) = argsup (x1;x2)2 eR eR kx1 x2k R B x 1 + x 2 2 ; kx 1 x 2k , where B(p; r) is the ball of radius r centred around p. until R\ L = ? More speci cally, the algorithm consists of two iterative stages. During the rst stage, the function is sampled along a lattice of points (the red crosses in Figure 2.1). In the second stage, the search space is shrunk to discard regions where the maximum is very unlikely to reside. Such regions are obtained by nding points where the UCB is lower than the LCB (the complement of the colored region in the same panel as before). The remaining set of relevant points is denoted by eR. In order to simplify the task of shrinking the search space, we simply nd an enclosing ball, which is denoted by R in Algorithm 1. Back to the rst stage, we consider a lattice that is twice as dense as in the rst stage of the previous iteration, but we only sample at points that lie within 92.3. Our algorithm Figure 2.1: Branch and Bound algorithm for a 2D function. The colored region is the search space and the color-map, with red high and blue low, illustrates the value of the UCB. Four steps of the algorithm are shown; progressing from left to right and top to bottom. The green dots designate the points where the function was sampled in the previous steps, while the red crosses denote the freshly sampled points. our new smaller search space. In the second stage, the auxiliary step of approximating the relevant set eR with the ball R introduces ine ciencies in the algorithm, since we only need to sample inside eR. This can be easily remedied in practice to obtain an e cient algorithm, for instance by using an ellipsoid instead of a ball. Our analysis will show that even without these improvements it is already possible to obtain very strong exponential convergence rates. Of course, practical improvements will result in better constants and ought to be considered seriously. 10Chapter 3 Analysis 3.1 Approximation results Lemma 1 (Variance Bound) Let : Rd Rd ! R be a kernel that is four times di erentiable along the diagonal f(x; x) jx 2 Rdg, with Q de ned as in Lemma 7.2, and f GP (0; ( ; )) a sample from the corresponding Gaussian Process. If f is sampled at points x1:T = fx1; : : : ; xT g that form a -cover of a subset D Rd, then the resulting posterior predictive standard deviation T satis es sup D T Q 2 4 Proof Let H be the RKHS corresponding to and h 2 H an arbitrary element, with g := (1 P1:T )h the residual de ned in Lemma 7.5. By Lemma 7.3, we know that k1 P1:T k 1 and so we have kgkH k1 P1:T k khkH khkH (3.1) Moreover, by Lemma 7.2, we know that the second derivative of g is bounded by kgkHQ, and since by Lemma 7.5 we know that g vanishes at each xi, we can use Lemma 9.2 and the inequality given by inequality (3.1) to conclude that jh(x) P1:Th(x)j := jg(x)j kgkHQ 2 4 by Lemma 9.2 khkHQ 2 4 by inequality (3.1) and so for all x 2 D we have jh(x) P1:Th(x)j Q 2 4 khkH (3.2) 113.1. Approximation results On the other hand, by Lemma 8, we know that for all x 2 D we have the following tight bound: jh(x) P1:Th(x)j T (x) khkH : (3.3) Now, given the fact that both inequalities (3.2) and (3.3) are bounding the same quantity and that the latter is a tight estimate, we necessarily have that T (x) khkH Q 2 4 khkH : Canceling khkH gives the desired result. 123.2. Finiteness of regret (statement) 3.2 Finiteness of regret (statement) Having shown that the variance vanishes according to the square of the resolution of the lattice of sampled points, we now move on to show that this estimate implies an exponential asymptotic vanishing of the regret encountered by our Branch and Bound algorithm. This is laid out in our main theorem stated below and proven in the supplementary material. The theorem considers a function f , which is a sample from a GP with a kernel that is four times di erentiable along its diagonal. The global maximum of f can appear in the interior of the search space, with the function being twice di erentiable at the maximum and with non-vanishing curvature. Alternatively, the maximum can appear on the boundary with the function having non- vanishing gradient at the maximum. Given a lattice that is ne enough, the theorem asserts that the regret asymptotically decreases in exponential fashion. The main idea of the proof of this theorem is to use the bound on given by Proposition 1 to reduce the size of the search space. The key assumption about the function that the proof utilizes is the quadratic upper bound on the objective function f near its global maximum, which together with Proposition 1 allows us to shrink the relevant region R in Algorithm 1 rapidly. The gures in the proof give a picture of this idea. The only complicating factor is the factor p t in the expression for the UCB that needs to be estimated. This is dealt with by modeling the growth in the number of points sampled in each iteration with a di erence equation and nding an approximate solution of that equation. Recall that D Rd is assumed to be a non-empty compact subset and f a sample from the Gaussian Process GP (0; ( ; )) on D. Moreover, in what follows we will use the notation xM := argmax x2D f(x). Also, by convention, for any set S, we will denote its interior by S , its boundary by @S and if S is a subset of Rd, then conv(S) will denote its convex hull. The following holds true: Theorem 2 Suppose we are given: 1. > 0, a compact subset D Rd, and a stationary kernel on Rd that is four times di er- entiable; 2. f GP(0; ) a continuous sample on D that has a unique global maximum xM , which satis es 133.2. Finiteness of regret (statement) one of the following two conditions: (y) xM 2 D and f(xM ) c1kx xMk2 < f(x) f(xM ) c2kx xMk2 for all x satisfying x 2 B(xM ; 0) for some 0 > 0; (z) xM 2 @D and both f and @D are smooth at xM , with rf(xM ) 6= 0; 3. any lattice L D satisfying the following two conditions 2L \ conv(L) L (3.4) 2 l log2 0 diam(D) m +1 L \ L 6= ? (3.5) if f satis es (y) Then, there exist positive numbers A and and an integer T such that the points speci ed by the Branch and Bound algorithm, fxtg, will satisfy the following asymptotic bound: For all t > T , with probability 1 we have r(xt) < Ae t (ln t)d=4 : We would like to make a few clarifying remarks about the theorem. First, note that for a random sample f GP(0; ) one of conditions (y) and (z) will be satis ed almost surely if is a Mat ern kernel with > 2 and the squared exponential kernel because the sample f is twice di erentiable almost surely by [1, Theorem 1.4.2] and [12, x2.6]) and the vanishing of at least one of the eigenvalues of the Hessian is a co-dimension 1 condition in the space of all functions that are smooth at a given point, so it has zero chance of happening at the global maximum. Second, the two conditions (3.4) and (3.5) simply require that the lattice be \divisible by 2" and that it be ne enough so that the algorithm can sample inside the ball B(xM ; 0) when the maximum of the function is located in the interior of the search space D. Finally, it is important to point out that the rate decay does not depend on the choice of the lattice L, even though as stated, the statement of the theorem chooses only after L is speci ed. The theorem was written this way simply for the sake of readability. Given the exponential rate of convergence we obtain in Theorem 2, we have the following niteness conclusion for the cumulative regret accrued by our Branch and Bound algorithm: 143.2. Finiteness of regret (statement) Corollary 3 Given , f GP(0; ) and L D as in Theorem 2, the cumulative regret is bounded from above. Remark 4 It is worth pointing out the trivial observation that using a simple UCB algorithm with monotonically increasing and unbounded factor p t, without any shrinking of the search space as we do here, necessarily leads to unbounded cumulative regret since eventually p t becomes large enough so that at points x0 far away from the maximum, p t t(x0) becomes larger than f(xM ) f(x). In fact, eventually the UCB algorithm will sample every point in the lattice L. 153.3. Remarks on the main theorem 3.3 Remarks on the main theorem This section includes a discussion of the assumptions placed on the objective function in Theorem 2 as well as an outline of the proof, the full details of which are included in the appendix. 3.3.1 On the statement of Theorem 2 A few remarks on the assumptions and the conclusion of the main theorem are in order: A. Relationship between the local and global assumptions on f : The theorem has two seemingly unrelated restrictions on the function f : the global GP prior and the local behaviour near the global maximum. However, in many circumstances of interest, the local condition follows almost surely from the global condition. Two such circumstances are if is a Mat ern kernel with > 2 (including the squared exponential kernel) or if is six times di erentiable. In either case, the sample f is twice di erentiable almost surely, in the former case by [1, Theorem 1.4.2] and [12, x2.6]) and in the latter situation by [5, Theorem 5]. If the global maximum xM lies in the interior of D, the Hessian of f at xM will almost surely be non-singular since the vanishing of at least one of the eigenvalues of the Hessian is a co-dimension 1 condition in the space of all functions that are smooth at a given point, hence justifying condition (y). On the other hand, if xM lies on the boundary of D, then condition (z) will be satis ed almost surely, since the additional event of the vanishing of rf(xM ) is a codimension d phenomenon in the space of functions with global maximum at xM . B. Uniqueness of the global maximum: A randomly drawn continuous sample from a GP on a compact domain will almost surely have a unique global maximum: this is because the space of continuous functions on a compact domain that attain their global maximum at more than one point have codimension one in the space of all continuous functions on that domain. C. Assumptions on L: The two conditions (3.4) and (3.5) simply require that the lattice be \divisible by 2" and that it be ne enough so that the algorithm can sample inside the ball B(xM ; 0) when the maximum of the function is located in the interior of the search space D. One can simply choose L to be the set of points in D that have oating point coordinates: it’s just the points at which the algorithm is allowed to sample the function. D. On ’s dependence: Finally, it is important to point out that the decay rate does not 163.3. Remarks on the main theorem depend on the choice of the lattice L, even though as stated, the statement of the theorem chooses only after L is speci ed. The theorem was written this way simply for the sake of readability. 3.3.2 Outline of the proof of Theorem 2 The starting point for the proof is the observation that one can use the posterior predictive mean and standard deviation of the GP to obtain a high probability envelope around the objective function (cf. Lemma 10 in the appendix). Given the fact that the thickness of this envelope is determined by the height of the posterior predictive standard deviation, , we can use the bound given by Proposition 1 to show that asymptotically one can rapidly dispense with large portions of the search space, as illustrated in Figure 1.2. One disconcerting component of Algorithm 1 is the step that requires sampling twice as densely in each iteration, since the number of samples can start to grow exponentially, hence killing any hope of obtaining exponentially decreasing regret. However, this is where the assumption on the local behaviour near the global maximum becomes relevant. Since Proposition 1 tells us that every time the function is sampled twice as densely, decreases by a factor of 4, and given our assumption that the function has quadratic behaviour near the global maximum, we can conclude that the radius of the search space is halved after each iteration and so the number of sampled points added in each iteration roughly remains constant. Of course, this assumes that the multiplicative factor p t remains constant in this process. However, the algorithm requires p t to grow logarithmically, and so to ll this gap we need to bound the growth of p t, which is tied to the number of samples needed in each iteration of the algorithm, which in turn is linked to the resolution of the lattice of sampled points and the size of the relevant set R, which in turn depends on the size of p t t. This circular dependence gives rise to a di erence equation, whose solutions we bound by solving the corresponding di erential equation. 3.3.3 Further remarks on the GP prior Let us step back for a moment and pose the question of whether it would be possible to carry out a similar line of reasoning under other circumstances. To answer this, one needs to identify the key ingredients of the proof, which are the following: 173.3. Remarks on the main theorem A. A mechanism for calculating a high probability envelope around the objective function (cf. Lemma 10); B. An estimate showing that the thickness of the envelope diminishes rapidly as the function is sampled more and more densely (cf. Proposition 1), so that the search space can be shrunk under reasonable assumptions on the behaviour of the function near the peak. The reason for our imposing a GP prior on f is that it gives us property A, while our smoothness assumption on the kernel guarantees property B. However, GPs are but one way one could obtain these properties and they do this essentially by coming up with local estimates of the Lipschitz constant based on the observed values of the objective function nearby. Perhaps one could explicitly incorporate similar local estimates on the Lipschitz constant into tree based approaches like HOO and SOO, cf. [3] and [9], in which case one would be able to dispense with the GP assumption and get similar performance. But, that is beyond the scope of this paper and will be left for future work. 183.4. Finiteness of regret (proof) 3.4 Finiteness of regret (proof) We will need the following two de nitions in the proof of the theorem: De nition 5 Given the above setup, the regret function is de ned to be r(x) = max D f f(x): De nition 6 (Covering Number) Denote by B a Banach space with norm k k. Furthermore denote by B B a set in this space. Then the covering number n (B;B) is de ned as the minimum number of balls with respect to the Banach space norm that are required to cover B entirely. Proof [Theorem 2] The proof consists of the following steps: Global: We rst show that after a nite number of steps the algorithm zooms in on the neighbourhood B(xM ; 0). This is done by rst showing that can be chosen small enough to squeeze the set f 1((fM ; fM ]) into any arbitrarily small neighbourhood of xM and that as the function is sampled more and more densely, the UBC-LCB envelope around f becomes arbitrarily tight, hence eventually tting the relevant set inside a small neighbourhood of xM . Please refer to Figure 3.1 for a graphical depiction of this process. GI : Since D is compact and f is continuous and has a unique maximum, for every > 0, we can nd an = ( ) > 0 such that f 1 ((fM ; fM ]) B(xM ; ); where fM = max f . To see this, suppose on the contrary that there exists a radius > 0 such that for all > 0 we have f 1 ((fM ; fM ]) * B(xM ; ) which means that there exists a point x 2 D such that f(xM ) f(x) < but kx xMk > . Now, for each i 2 N, pick a point xi 2 f 1 (fM 1i ; fM ] n B(xM ; ): this gives us a sequence of points fxig in D, which by the compactness of D has a convergent 193.4. Finiteness of regret (proof) Figure 3.1: The elimination of other smaller peaks. subsequence fxikg, whose limit we will denote by x . From the continuity of f and the fact that f(xM ) f(xi) < 1i , we can conclude that f(xM ) f(x ) = 0, which contradicts our assumption that f has a unique global maximum since we necessarily have x =2 B(xM ; ). GII : De ne := ( 0) 4 , with 0 as in Condition (y) of the statement of Theorem 2. GIII : For each T , de ne the \relevant set" RT D as follows: RT = x 2 D T (x) + p T T (x) > sup R T (x) p T T (x) : GIV : Choose T = b ln(T ), with b chosen large enough to satisfy the conditions of Lemma 10. Then, it is possible to sample f densely enough so that p T max x2D T (x) < ; (3.6) so thatRT B(xM ; 0). This is because as D is sampled more and more densely we have = O( 2), where is the distance between the points of the grid, and = O ln 1 d = O ( ln ) and so p ! 0 as ! 0, and so there exists a 0 small enough so that 203.4. Finiteness of regret (proof) a lattice of resolution 0 would give us the bound given in inequality (3:6). The end point of this process is depicted in Figure 3.1, where the relevant set RT lies inside the non-shaded region: the reason for this inclusion and \thickness" 4 is described below, in Step L1 of the proof: cf. Equation (3.7). Local: Once the algorithm has localized attention to a neighbourhood of xM , then we can show that the regret decreases exponentially; to do so, we will proceed by sampling the relevant set twice as densely and shrinking the relevant set, and repeating these two steps. The claim is that in each iteration, the maximum regret goes down exponentially and the number of the new points that are sampled in each re ning iteration is asymptotically constant. To prove this, we will write down the equations governing the behaviour of the number of sampled points and . We will adopt the following notation to carry out this task: { ‘ - the resolution of the lattice of sampled points at the end of the (‘ + 1)th re ning iteration inside R‘+1 (de ned below). { ‘ = sup x2R‘ N‘(x) at the end of the ‘ th iteration. Note that ‘ / 2‘ . Also, note that 0 by the choice of 0. { N‘ - number of points that have been sampled by the end of the ‘th iteration. { N‘ = N‘+1 N‘. { R‘ - the relevant set at the beginning of the ‘th iteration. Note that R1 B(xM ; 0). { ‘ = diam(R‘) 2 . Note that 1 < 0. 213.4. Finiteness of regret (proof) L1: We have the following chain of inequalities: N1 N0 +Wn 0 R0; (Rd; k k2) where n 0 R0; (Rd; k k2) is the 0-covering number as de ned in De nition 6 N0 +WN ( 0; 0) where N ( ; ) := n B(0; ); (Rd; k k2) N0 +WN 0 @ s 4 0 p N0 c2 ; 0 1 A N0 +WN 0 @ s 4 0 p b lnN0 c2 ; 0 1 A = N0 +WN c p 0 4 p lnN0; 0 where c := s 4 p b c2 In the rst line of the above chain of inequalities, the factor W multiplying the last term is any number greater than one that gives an upper bound on the algorithm’s wastefulness when it comes to producing a -covering of any set S Rd as compared to the minimum number of points necessary, i.e. n S; (Rd; k k2) . The expression r 4 0 p N0 c2 comes about as follows: using the notations B = p N0 and = N0 we know by Lemma 10 that f and are intertwined with each other in the sense that both of the following chains of inequality hold: B f +B f B f+B ; which, combined together, give us the following chain of inequalities f 2B B f +B f+2B : (3.7) Since, we also know that (x) 0 for all x 2 R0, we can conclude that f 2B 0 B +B f+2B 0: 223.4. Finiteness of regret (proof) Moreover, if condition (y) holds, we know that in R0, the function f satis es c1r2 < f(x) f(xM ) < c2r2, where r = r(x) := kx xMk, so we get that f(xM) c1r 2 2B 0 B +B f(xM) c2r 2+2B 0: Now, recall that R0 is de ned to consist of points x where (x)+B (x) sup D (x) B (x), but given the fact that we have the above outer envelope for B , we can conclude that R0 n x f(xM) c2r(x) 2+2B 0 max f(xM) c1r(x) 2 2B 0 o = n x f(xM) c2r(x) 2+2B 0 f(xM) 2B 0 o = n x c2r(x)2+2B 0 2B 0 o = n x c2r(x)2 4B 0 o = ( x r(x) r 4B 0 c2 ) Now, if, on the other hand, f satis es condition (z), then by the smoothness assumptions in (z), we know that rf(xM ) is perpendicular to @D at xM and so there exist positive numbers c1 and c2 such that in a neighbourhood of xM we have c1r f f(xM ) c2r 2: Note that in the argument above in the case of (y), the precise form of the lower bound on f was irrelevant, since all we are interested in is its maximum. So, the same argument goes through again. This is depicted in Figure 3.2, where B := p N0 = p b lnN0. 233.4. Finiteness of regret (proof) Figure 3.2: The shrinking of the relevant set R‘. Here, B = p N0 L‘+1: Now, let us suppose that we are the end of the ‘th iteration. We have N‘+1 N‘ +WN ( ‘; ‘) = N‘ +WN c p ‘ 4 p lnN‘; ‘ N‘ +WN c r 0 4‘ 4 p lnN‘; 0 2‘ by Proposition 1 = N‘ +WN c p 0 4 p lnN‘; 0 since N (2 ; 2 ) = N ( ; ) for any and := N‘ +Wn 0 B(0; c p 0 4 p lnN‘); (Rd; k k2) by the de nition of N N‘ +Wn 0 h c p 0 4 p lnN‘; c p 0 4 p lnN‘ id ; (Rd; k k2) N‘ +W 2c p 0 4 p lnN‘ 0 d since a regular lattice of resolution 0 gives a 0-covering N‘ + C(lnN‘) d 4 where C = W 2c p 0 0 d So, the number of samples needed by the branch and bound algorithm is governed by the di erence inequation N‘ C(lnN‘) d 4 : (3.8) 243.4. Finiteness of regret (proof) To study the solutions of this di erence equation, we consider the corresponding di er- ential equation: dN d‘ = C(lnN) d 4 : (3.9) Since this equation is separable, we can write dN (lnN) d 4 = Cd‘: Now, letting ‘ = L be a given number of iterations in the algorithm and N(L) the corre- sponding number of sampled points, we can integrate both sides of the above equation to get Z N(L) N(0) dN (lnN) d 4 = Z L 0 Cd‘ = CL: Given the fact that the integral on the left can’t be solved analytically, we will use the lower bound N(L) N(0) (lnN(L)) d 4 Z N(L) N(0) dN (lnN) d 4 to get N(L) N(0) C(lnN(L)) d 4 L (3.10) Given a time t, we will denote by ‘t the largest non-negative integer such that N‘t < t or 0 if no such number exists. We illustrate this somewhat obtuse de nition with the following example: 1 OO 2 OO N0 OO N1 OO N‘t OO t OO N‘t+1 OO 253.4. Finiteness of regret (proof) Now, by Lemma 11, for all t >> N0 we have rt 2 p t max R‘t t 2 p b ln t ‘t 2 0 p b ln t 4‘t 8 0 p b ln t 4‘t+1 8 0 p b ln t 1 4 N‘t+1 N0 C(lnN‘t+1) d=4 by Equation 3.10 8 0 p b ln t 1 4 DN‘t+1 (lnN‘t+1) d=4 for some D > 0 since N‘t+1 > N0 8 0 p b ln t 1 4 Dt (ln t)d=4 for t satisfying ln t > d 4 (see ? below) since t N‘t+1 8 0 p be Et (ln t)d=4 + ln ln t2 8 0 p be Et (ln t)d=4 + Et 2(ln t)d=4 for large enough t = Ae t (ln t)d=4 for A = 8 0 p b and = E=2. ? The reason for the speci c criterion ln t > d4 is that the function x (lnx)d=4 is increasing when this condition is satis ed, and so decreasing x from N‘t + 1 to t decreases its value, increasing the overall expression 1 4 x (ln x)d=4 . To see that x (lnx)d=4 becomes increasing when lnx > d4 , we simply need to calculate its derivative: d dx x (lnx)d=4 = 1 (lnx)d=4 d 4 x x(lnx)d=4+1 = lnx d4 (lnx)d=4 : Moreover, since N‘t+1 t, if the derivative of x (lnx)d=4 is positive at t, it is also positive between t and N‘t+1 and so the function is indeed increasing in that interval. 26Chapter 4 Discussion In this thesis, we have put forth a modi cation of the UCB algorithm presented in [11], which has the additional step of discarding pieces of the search space where the global maximum is very unlikely to be. We show that this additional step leads to a considerable improvement of the regret accrued by the algorithm: in particular, the cumulative regret obtained by our Branch and Bound (BB) algorithm is bounded from above, whereas the cumulative regret bounds obtained in [11] are unbounded. This is because UCB’s regret is O 1p t up to logarithmic factors, whereas for BB, the regret is O e t (ln t)d=4 . The main realization that we had in the process of writing this paper was that what a ects the rate of convergence the most is the possibility of dispensing with chunks of the search space. This observation can also be seen in the works involving hierarchical partitioning, e.g. [9], where regions of the space are deemed as less worthy of probing as time goes on. The most obvious next step is to carry out similar analysis for the case when there is non-trivial observation noise. Another natural question is whether or not, in each iteration of the Branch and Bound algorithm, one could dispense with points that lie outside the current (or perhaps the previous) relevant set when calculating the covariance matrix K in the posterior equations (2.3), and how much of an e ect such a computational cost-cutting measure would have on the regret encountered by the algorithm. Another obvious question is how does the performance of the Branch and Bound algorithm pre- sented here compares with a modi ed version of UCB algorithm in which in addition to optimizing the UCB surrogate function, one introduces an occasional \cleansing" step of getting rid of regions in the search space where the maximum is very unlikely to be. Finally, we believe that similar regret bounds should hold for contextual bandit problems, using similar methods, although the unpredictability of the context introduces new di culties. 27Bibliography [1] Robert J. Adler and Jonathan E. Taylor. Random Fields and Geometry. Springer, 2007. [2] Eric Brochu, Vlad M Cora, and Nando de Freitas. A tutorial on Bayesian optimization of ex- pensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical Report TR-2009-023, arXiv:1012.2599v1, University of British Columbia, Department of Computer Science, 2009. [3] S ebastien Bubeck, R emi Munos, Gilles Stoltz, and Csaba Szepesvari. X-armed bandits. Journal of Machine Learning Research, 12:1655{1695, 2011. [4] Adam D. Bull. Convergence rates of e cient global optimization algorithms. Journal of Machine Learning Research, 12:2879{2904, 2011. [5] Subhashis Ghosal and Anindya Roy. Posterior consistency of Gaussian process prior for non- parametric binary regression. Ann. Stat., 34:2413{2429, 2006. [6] P. Hansen, B. Jaumard, and S. Lu. Global optimization of univariate Lipschitz functions: I. survey and properties. Mathematical Programming, 55:251{272, 1992. [7] Matthew Ho man, Eric Brochu, and Nando de Freitas. Portfolio allocation for Bayesian optimization. In Uncertainty in Arti cial Intelligence, pages 327{336, 2011. [8] Benedict May, Nathan Korda, Anthony Lee, and David Leslie. Optimistic Bayesian sampling in contextual-bandit problems. 2010. [9] R emi Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In Advances in Neural Information Processing Systems, 2011. [10] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006. 28[11] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian pro- cess optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, 2010. [12] Michael L. Stein. Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999. [13] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008. [14] Emmanuel Vazquez and Julien Bect. Convergence properties of the expected improvement algorithm with xed mean and covariance functions. Journal of Statistical Planning and In- ference, 140:3088{3095, 2010. 29Apendix: Auxilliary Lemmas We begin our analysis by showing that, given su cient explored locations, the residual variance is small. More speci cally, for any point x contained in the convex hull of a set of d points that are no further than apart from x, we show that the residual is bounded by O(khkH 2), where khkH is the Hilbert Space norm of the associated function and that furthermore the residual variance is bounded by O( 2). This quantity is subsequently related to ‘2 and ‘1 covering numbers of the associated domain. We begin by relating residual variance, projection operators, and interpolation in Hilbert Spaces. The results are standard and we only present them here for the purpose of being self-contained. 30Apendix: Auxilliary Lemmas Lemma 7 (Hilbert Space Properties) Given a set of points x1:T := fx1; : : : ; xT g 2 D and a Reproducing Kernel Hilbert Space (RKHS) H with kernel the following bounds hold: 1. Any h 2 H is Lipschitz continuous with constant khkH L, where k kH is the Hilbert space norm and L satis es the following: L2 sup x2D @x@x0 (x; x 0)jx=x0 and for (x; x 0) = e (x x0) we have L2 @2xe (x)jx=0: (1) 2. Any h 2 H has its second derivative bounded by khkHQ where Q2 sup x2D @2x@ 2 x0 (x; x 0)jx=x0 and for (x; x 0) = e (x x0) we have Q2 @4xe (x)jx=0: (2) 3. The projection operator P1:T on the subspace span t=1:T f (xt; )g H is given by P1:Th := k >( )K 1 hk( ); hi (3) where k( ) = k1:T ( ) := [ (x1; ) (xT ; )] > and K := [ (xi; xj)]i;j=1:T ; moreover, we have hk( ); hi := 2 6 6 6 6 4 h (x1; ); hi ... h (xT ; ); hi 3 7 7 7 7 5 = 2 6 6 6 6 4 h(x1) ... h(xT ) 3 7 7 7 7 5 : Here P1:TP1:T = P1:T and kP1:T k 1 and k1 P1:T k 1. 4. Given sets x1:T x1:T 0 it follows that kP1:ThkH kP1:T 0hkH khkH. 5. Given tuples (xi; hi) with hi = h(xi), the minimum norm interpolation h with h(xi) = h(xi) is given by h = P1:Th. Consequently its residual g := (1 P1:T )h satis es g(xi) = 0 for all xi 2 x1:T . Proof We prove the claims in sequence. 1. This follows from Corollary 4.36 in [13], with j j = 1. 2. Same as above, just with j j = 2. 31Apendix: Auxilliary Lemmas 3. For any operator V with full column rank the projection on the image of V is given by V (V >V ) 1V >. In our case, V is de ned as V : RT // H w := 2 6 6 6 6 4 w1 ... wT 3 7 7 7 7 5 // k>( )w : To calculate the adjoint V >, let h 2 H and w 2 RT be arbitary elements and consider the following chain of equalities: D V >h;w E RT = hh; VwiH (cf. [13] Equation A.19) = D h;k>( )w E H = * h; k(x1; ) k(xT ; ) 2 6 6 6 6 4 w1 ... wT 3 7 7 7 7 5 + H = hh; k(x1; )w1 + + k(xT ; )wT iH = hh; k(x1; )iw1 + + hh; k(xT ; )iwT (by linearity of h ; iH) = hh; k(x1; )iH hh; k(xT ; )iH 2 6 6 6 6 4 w1 ... wT 3 7 7 7 7 5 = hk( ); hi>Hw (by the symmetry of h ; iH) = hhk( ); hiH ;wiRT (by the de ntion of h ; iRT ) Since, this equality holds for all w, we can conclude that for all h 2 H V >h = hk( ); hiH : Now, all we need to do is to calculate the expression V >V to complete the derivation of the 32Apendix: Auxilliary Lemmas expression for P1:T ; to this end let w 2 RT be arbitrary: V >Vw = D k( );k>( )w E H = D k( );k>( ) E H w = * 2 6 6 6 6 4 (x1; ) ... (xT ; ) 3 7 7 7 7 5 ; (x1; ) (xT ; ) + H w = 2 6 6 6 6 4 h (x1; ); (x1; )iH h (x1; ); (xT ; )iH ... . . . ... h (xT ; ); (x1; )iH h (xT ; ); (xT ; )iH 3 7 7 7 7 5 w = 2 6 6 6 6 4 (x1; x1) (x1; xT ) ... . . . ... (xT ; x1) (xT ; xT ) 3 7 7 7 7 5 w (cf. [13] De nition 4.18 and Equation 4.14) = Kw and so V >V = K. This provides us with P1:T . The remaining claims follow from standard properties of projec- tion operators. 4. Projection operators satisfy kP1:T k 1. This proves the second claim. The rst claim can be seen from the fact that projecting on a subspace can only have a smaller norm than the superspace projection. 5. We rst show that the projection is an interpolation. This follows from h(xi) = P1:Th(xi) = hP1:Th; (xi; )i = hh; P1:T (xi; )i = hh; (xi; )i = h(xi): Correspondingly g(xi) = h(xi) h(xi) = 0 for all xi 2 x1:T . By construction P1:Th uses h only in evaluations h(xi), hence for any two functions h; h0 with h(xi) = h0(xi) we have P1:Th = P1:Th0. Since kP1:T k 1 it follows that kP1:Thk khkH. Hence there is no interpolation with norm smaller than kP1:Thk. 33Apendix: Auxilliary Lemmas Lemma 8 (Gaussian Process Variance) Under the assumptions of Lemma 7 it follows that jh(x) P1:Th(x)j khkH T (x) where 2 T (x) = (x; x) k > 1:T (x)K 1k1:T (x) (4) and this bound is tight. Moreover, 2T (x) is the residual variance of a Gaussian process with the same kernel. Proof To see the bound we again use the Cauchy-Schwartz inequality jh(x) P1:Th(x)j = j(1 P1:T )h(x)j = jh(1 P1:T )h; (x; )iHj (by the de ning property of h ; iH, cf. [13] Def. 4.18) = jhh; (1 P1:T ) (x; )iHj (since 1 P1:T is an orthogonal projection and so self-adjoint) khkH k(1 P1:T ) (x; )k (by Cauchy-Schwarz) This inequality is clearly tight for h = (1 P1:T ) (x; ) by the nature of dual norms. Next note that k(1 P1:T ) (x; )k 2 = h(1 P1:T ) (x; ); (1 P1:T ) (x; )i = h (x; ); (1 P1:T ) (x; )i = (x; x) h (x; ); P1:T (x; )i = 2 T (x): The second equality follows from the fact that 1 P1:T is idempotent. The last equality follows from the de nition of P1:T . The fact that 2T (x) is the residual variance of a Gaussian Process re- gression estimate is well known in the literature and follows, e.g. from the matrix inversion lemma. 34Apendix: Auxilliary Lemmas Lemma 9 (Approximation Guarantees) In the following we denote by x1:T D a set of loca- tions and assume that g(xi) = 0 for all xi 2 x1:T . 1. Assume that g is Lipschitz continuous with bound L. Then g(x) Ld(x; x1:T ), where d(x; x1:T ) is the minimum distance kx xik between x and any xi 2 x1:T . 2. Assume that g has its second derivative bounded by Q0. Moreover, assume that x is contained inside the convex hull of x1:T such that the smallest such convex hull has a maximum pairwise distance between vertices of d. Then we have g(x) 14Q 0d2. Proof The rst claim is an immediate consequence of the Lipschitz property of g. To see the second claim we need to establish a number of issues: without loss of generality assume that the maximum within the convex hull containing x is attained at x (and that the maximum rather than the minimum denotes the maximum deviation from 0). The maximum distance of x to one of its vertices is bounded by = p 2. This is established by considering the minimum enclosing ball and realizing that the maximum distance is achieved for the regular polyhedron. To see the maximum deviation from 0 we exploit the fact that @xg(x) = 0 by the assumption of x being the maximum (we need not consider cases where x is on a facet of the polyhedral set since in this case we could easily reduce the dimensionality). In this case the largest deviation between g(x) and g(xi) is obtained by making g a quadratic function g(x0) = Q0 2 kx 0 xk2. At distance p 2 the function value is bounded by 2Q0 4 . Since the latter bounds the maximum deviation it does bound it for g in particular. This proves the claim. 35Apendix: Auxilliary Lemmas For the sake of completeness of the exposition, we will also include the proofs of the following two lemmas, which rst appeared in [11]: Lemma 10 (Lemma 5.1 of [11]) Given any nite set L, any sequnce of points fx1; x2; : : :g L and f : L ! R a sample from GP(0; ( ; )), for all 2 (0; 1), we have P n 8x 2 L; t 1 : jf(x) t 1(x)j p t t 1(x) o 1 ; where t = 2 ln jLj t and f tg is any positive sequence satisfying P 1 t = 1. Here jLj denotes the number of elements in L. Proof Looking at the complement of the set whose probability we’re interested in, we have P n 8x 2 L; t 1 : jf(x) t 1(x)j p t t 1(x) o = 1 P n 9x 2 L; t 1 s:t: jf(x) t 1(x)j > p t t 1(x) o = 1 P [t 1 [x2L jf(x) t 1(x)j > p t t 1(x) 1 X t 1 X x2L P jf(x) t 1(x)j > p t t 1(x) = 1 X t 1 X x2L 2P f(x) > t 1(x) + p t t 1(x) = 1 X t 1 X x2L 2P n r > p t r N (0; 1) o = 1 X t 1 X x2L 2 p 2 Z 1 p t e r2 2 dr = 1 X t 1 X x2L 2 p 2 Z 1 0 e (u+ p t) 2 2 du (u = r p t; du = dr; r = u+ p t) = 1 X t 1 X x2L 2 p 2 Z 1 0 e u2+2u p t+ t 2 du = 1 X t 1 X x2L 2e t 2 p 2 Z 1 0 e u p te u2 2 du 1 X t 1 X x2L 2e t 2 p 2 Z 1 0 e u2 2 du (* e u p t 1 8u 0) = 1 X t 1 X x2L e t 2 2 4 2 R1 0 e u 2 2 du p 2 3 5 = 1 X t 1 X x2L e t 2 * 12 4 R1 1 e u 2 2 du p 2 3 5 = 1 X t 1 X x2L e t 2 = 1 X t 1 X x2L tjLj = 1 X t 1 1 t X x2L 1 jLj = 1 : 36Apendix: Auxilliary Lemmas Lemma 11 (Lemma 5.2 in [11]) Let L a non-empty nite set and f : L ! R an arbitrary function. Also assume that there exist functions ; : L ! R and a constant p , such that jf(x) (x)j p 8x 2 L: (y) Then, we have r(x) 2 p (x) 2 p max L : Proof We know from the de nition of sup and the upper and the lower bounds given by (y) that (x) p (x) f(x) sup L f sup L ( + p ) = (x) + p (x): So, f(x) and supL f lie in the interval [ (x) p (x); (x) + p (x)], and so their distance from each other can be at most 2 p (x), and so r(x) := sup L f f(x) 2 p (x): 37
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Regret bounds for Gaussian process bandits without...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Regret bounds for Gaussian process bandits without observation noise Zoghi, Masrour 2012-12-31
pdf
Page Metadata
Item Metadata
Title | Regret bounds for Gaussian process bandits without observation noise |
Creator |
Zoghi, Masrour |
Publisher | University of British Columbia |
Date | 2012 |
Date Issued | 2012-08-02 |
Description | This thesis presents some statistical refinements of the bandits approach presented in [11] in the situation where there is no observation noise. We give an improved bound on the cumulative regret of the samples chosen by an algorithm that is related (though not identical) to the UCB algorithm of [11] in a complementary setting. Given a function f on a domain D ⊆ R^d , sampled from a Gaussian process with an anisotropic kernel that is four times differentiable at 0, and a lattice L ⊆ D, we show that if the points in L are chosen for sampling using our branch-and-bound algorithm, the regret asymptotically decreases according to O(e^{τt/(ln t)^{d/4}}) with high probability, where t is the number of observations carried out so far and τ is a constant that depends on the objective function. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Collection |
Electronic Theses and Dissertations (ETDs) 2008+ |
Date Available | 2012-08-02 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-ShareAlike 3.0 Unported |
DOI | 10.14288/1.0052133 |
URI | http://hdl.handle.net/2429/42865 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2012-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-sa/3.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- ubc_2012_fall_zoghi_masrour.pdf [ 589.92kB ]
- Metadata
- JSON: 1.0052133.json
- JSON-LD: 1.0052133+ld.json
- RDF/XML (Pretty): 1.0052133.xml
- RDF/JSON: 1.0052133+rdf.json
- Turtle: 1.0052133+rdf-turtle.txt
- N-Triples: 1.0052133+rdf-ntriples.txt
- Original Record: 1.0052133 +original-record.json
- Full Text
- 1.0052133.txt
- Citation
- 1.0052133.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
China | 32 | 16 |
Spain | 18 | 0 |
United States | 15 | 2 |
Japan | 6 | 0 |
France | 5 | 0 |
Germany | 4 | 24 |
Netherlands | 3 | 0 |
Canada | 3 | 0 |
India | 2 | 0 |
Russia | 2 | 1 |
Chile | 1 | 0 |
Republic of Korea | 1 | 0 |
City | Views | Downloads |
---|---|---|
Unknown | 26 | 30 |
Shenzhen | 14 | 11 |
Ashburn | 11 | 0 |
Tianjin | 10 | 5 |
Tokyo | 6 | 0 |
Beijing | 4 | 0 |
Nanchang | 3 | 0 |
Edmonton | 3 | 0 |
Mannheim | 3 | 0 |
Mountain View | 3 | 1 |
Roubaix | 2 | 0 |
Saint Petersburg | 2 | 0 |
Chengdu | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052133/manifest