Regret Bounds for Gaussian Process Bandits Without Observation Noise by Masrour Zoghi B.Sc., The University of British Columbia, 2003 M.Sc., University of Toronto, 2004 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2012 c© Masrour Zoghi 2012 Abstract This thesis presents some statistical refinements of the bandits approach presented in [11] in the situation where there is no observation noise. We give an improved bound on the cumulative regret of the samples chosen by an algorithm that is related (though not identical) to the UCB algorithm of [11] in a complementary setting. Given a function f on a domain D ⊆ Rd, sampled from a Gaussian process with an anisotropic kernel that is four times differentiable at 0, and a lattice L ⊆ D, we show that if the points in L are chosen for sampling using our branch-and-bound algorithm, the regret asymptotically decreases according to O ( e − τt (ln t)d/4 ) with high probability, where t is the number of observations carried out so far and τ is a constant that depends on the objective function. ii Preface This thesis grew out of a collaboration with Alex Smola and my supervisor, Nando de Freitas. An abridged version of the results presented here were presented in the Workshop on Bayesian Optimization at NIPS 2011. The results included in Section 3.1 and the Appendix were obtained by the other authors and the proofs are only included for the sake of completeness of the exposition. Everything else was proven and written by the author of this thesis. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Global optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Costly evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Gaussian process bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Surrogates for optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Our algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Approximation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Finiteness of regret (statement) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Remarks on the main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.1 On the statement of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.2 Outline of the proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . 17 iv Table of Contents 3.3.3 Further remarks on the GP prior . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Finiteness of regret (proof) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Appendices Apendix: Auxilliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 v List of Figures 1.1 Lipschitz hypothesis being used to discard pieces of the search space . . . . . . . . . 3 1.2 A sample application of the branch and bound algorithm . . . . . . . . . . . . . . . 5 2.1 Branch and Bound algorithm for a 2D function . . . . . . . . . . . . . . . . . . . . . 10 3.1 The elimination of other smaller peaks . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 The shrinking of the relevant set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 vi Acknowledgements First and foremost, I would like to thank my supervisor for all the direction he provided throughout this thesis and for his patience when the proofs were not transpiring as effortlessly. I am also very grateful to Alex Smola for his collaboration and for providing me with an elegant proof of a key lemma when I was suffering through a much messier argument. This work benefited greatly from many lengthy discussions with John Chia, who was working on the more experimental aspects of this project. I learned a lot from talking to both Matt Hoffman and Ziyu Wang, who were working on related topics, and who happened to sit at the two desks next to mine. I would also like to thank our wonderful Group Assistant, Kath Imhiran, for her incessant pleasantness, which is truly a scarce commodity. Outside of the university, I would like to thank my many “sisters and brothers in arms” from the Mining Justice Alliance, who added some meaning to my life by allowing me to assist them in their efforts to shed some light on injustices emanating from Vancouver. vii Dedication I would like to dedicate this thesis to my parents, without whose sacrifices over the last 33 years none of this would be possible, and to M.S. who contributed greatly to my intellectual cultivation during my short stay in Vancouver. viii Chapter 1 Introduction Let f : D → R be a function on a compact subset D ⊆ Rd. We would like to address the global optimization problem xM = argmax x∈D f(x). Let us assume for the sake of simplicity that the objective function f has a unique global maximum (although it can have many local maxima) The space D might be the set of possible parameters that one could feed into a time-consuming algorithm or the locations where a sensor could be deployed, and the function f might be a measure of the performance of the algorithm (e.g. how long it takes to run) or whatever quantity it is that our given sensor measures (e.g. temperature). In this paper, our assumption is that once the function has been probed at point x ∈ D, then the value f(x) can be observed with very high precision. This is the case when the deployed sensors are very accurate or if the algorithm whose parameters are being configured is deterministic. The challenge in these applications is two-fold; the optimization problem is of a global nature without any convexity assumption on f and it is costly to evaluate f at any given point. We will discuss these two stumbling blocks below. 1 1.1. Global optimization 1.1 Global optimization Global optimization is a difficult problem without any assumptions on the objective function f . The main complicating factor is the uncertainty over the extent of the variations of f , e.g. one could consider the characteristic function χ{xM}, which is equal to 1 at xM and 0 elsewhere, and none of the methods we mention here can optimize this function without exhaustively searching through every point in D. The way a large number of global optimization methods address this problem is by imposing some prior assumption on how fast the objective function f can vary. The most blatant manifes- tation of this remedy is the imposition of a Lipschitz assumption on the function f , which requires the change in the value of f(x), as the point x moves around, to be smaller than a constant multiple of the distance traveled by x (cf. [6]). In fact, as pointed out in [3] (cf. Figure 3 therein), it is only important to have this kind of tight control over the function near its optimum: elsewhere in the space, we can have what they have dubbed a “weak Lipschitz” condition. One way to relax these hard Lipschitz constraints is by putting a Gaussian Process (GP) prior on the function. Instead of restricting the function from oscillating too fast, a GP prior requires those fast oscillations to have low probability (cf. [5], Theorem 5). This is the setting of Bayesian optimization (aka GP bandits), with which we concern ourselves: a review of GPs and Bayesian optimization is provided in Section 2. The main point of these derivative bounds (be it hard or soft, as outlined above) is to assist with the exploration-exploitation trade-off that global optimization algorithms have to grapple with. In the absence of any assumptions of convexity on the objective function, a global optimization algorithm is forced to explore enough until it reaches a point in the process when with some degree of certainty it can localize its search space and perform local optimization (aka exploitation). Derivative bounds such as the ones discussed here together with the boundedness of the search space, guaranteed by the compactness assumption on D, provide us with such certainty by producing a useful upper bound that allows us to shrink the search space. This is illustrated in Figure 1.1: suppose that we know that our function is Lipschitz with Lipschitz constant L, then given sample points as shown in the figure (the red dots), we can use the Lipschitz property to discard pieces of the search space (the shaded regions). This is done 2 1.1. Global optimization Figure 1.1: An example of the Lipschitz hypothesis being used to discard pieces of the search space. by finding points in the search space where the function could not possibly be higher than the maximum value we’ve already observed without violating our Lipschitz hypothesis. Such points are found by placing cones at the sampled points with slope equal to L and seeing where those cones lie below the maximum observed value (the red dashed line in this case): since the function is guaranteed to be bounded from above by the cones, we know that there is nothing interesting to find in the shaded regions. Note that this crude approach is often very wasteful because very often the slope of the function is much smaller than L, and as we will see below (cf. Figure 1.2), GPs do a better job of providing an upper bound that could be used to limit the search space, by essentially choosing Lipschitz constants that vary over space and time. 3 1.2. Costly evaluation 1.2 Costly evaluation Now, let us also assume that the objective function f is costly to evaluate (e.g. time-wise or financially), in which case we would like to avoid probing f as much as possible and to get close to the optimum as quickly as possible. The solution to this problem is to approximate f with a surrogate function that provides a good upper bound for f and which is easier to calculate and optimize. In the Lipschitz-based algorithms, this upper bound is provided by the Lipschitz constraint, whereas in the case of X-armed bandits, this bound is provided by the functions Bh,i(n) defined on page 1663 of [3]. Again, GPs provide a soft version of this by providing us with high probability bounds that could be modified depending on how certain we would like to be of our upper bounds. Moreover, the upper bounds provided by GPs have analytic expressions, which make evaluation and optimization relatively easy. We refer the reader to [2] for a general review of the literature on the various surrogate functions utilized in Bayesian optimization. Let us also point out that in addition to being easier to evaluate, surrogate functions can also aid with global optimization by restricting the domain of interest. More precisely, if we know that f ≤ g for some known function g and we have probed f at x0 to get the value y0 = f(x0), then we can shrink our search space to the relevant set R = {x ∈ D|g(x) > y0}, which will be substantially smaller than D, supposing that g is a relatively tight upper bound of f near the optimum: it can be atrociously inaccurate away from the optimum as long as it stays below max f . The surrogate function that we will make extensive use of here is called the Upper Confidence Bound (UCB), which is defined to be µ + Bσ, where µ and σ are the posterior predictive mean and standard deviation of the GP and B is a constant to be chosen by the algorithm. The UCB surrogate function has the desirable property that it provides the tightest bound on f where the function has been probed the most: this is illustrated in Figure 1.2. 4 1.2. Costly evaluation Figure 1.2: An example of our branch and bound maximization algorithm with UCB surrogate µ + Bσ, where µ and σ are the mean and standard deviation of the GP respectively. The region consisting of the points x for which the upper confidence bound µ(x) + Bσ(x) is lower that the maximum value of the lower confidence bound µ(x)−Bσ(x) does not need to be sampled anymore. Note that the UCB surrogate function bounds f from above. This surrogate function has been studied extensively in the literature and this paper relies heavily on the ideas put forth in the paper by Srinivas et al [11], in which the algorithm consists of repeated optimization of the UCB surrogate function after each sample. One key difference between our setting and that of [11] is that, whereas we assume that the value of the function can be observed exactly, in [11] it is necessary for the noise to be non-trivial (and Gaussian) because the main quantity that is used in their estimates, namely information gain (cf. Equation (3) in [11]), becomes undefined when the variance of the observation noise (σ2 in their notation) is set to 0: cf. the expression for I(yA; fA) that was given in the paragraph following Equation (3). Moreover, our algorithm can in some ways be thought of as a limiting case of the UCB algorithm in [11] when the parameter B goes to infinity because our main goal is to shrink σ as much as possible, even though we do use µ + Bσ to limit our search space. In subsequent work, we hope to prove that the algorithm presented here is inferior to the UCB algorithm, but in the mean time we thought it valuable to present this manifestly wasteful algorithm and show that the cumulative regret RT one obtains from applying it belongs to O(1), given a fixed function, in contrast to the previous bounds that grow as O( √ T ), where T is the number of samples collected. In fact, we show that the regret rT decreases according to O ( e − τt (ln t)d/4 ) . This asymptotic result is valuable to applications that are meant to run indefinitely because it unburdens the user from having to decide when to stop the bandit algorithm and just exploit the 5 1.2. Costly evaluation optimum point found so far because the overall regret the algorithm can accrue is bounded from above. The paper whose results are most similar to ours is [9], but there are some key differences in the methodology, analysis and obtained rates. For instance, we are interested in cumulative regret, whereas the results of [9] are proven for finite stop-time regret. In our case, the ideal application is the optimization of a function that is C2-smooth and has an unknown non-singular Hessian at the maximum. We obtain a regret rate O ( e − τt (ln t)d/4 ) , whereas the DOO algorithm in [9] has regret rate O(e−t) if the Hessian is known and the SOO algorithm has regret rate O(e− √ t) if the Hessian is unknown. In addition, the algorithms in [9] can handle functions that behave like −c‖x− xM‖α near the maximum (cf. Example 2 therein). This problem was also studied by [14] and [4], but using the Expected Improvement surrogate instead of UCB. Our methodology and results are different, but complementary to theirs. 6 Chapter 2 Gaussian process bandits 2.1 Gaussian processes As in [11], the objective function is distributed according to a Gaussian process prior: f(x) ∼ GP(m(·), κ(·, ·)). (2.1) For convenience, and without loss of generality, we assume that the prior mean vanishes, i.e., m(·) = 0. There are many possible choices for the covariance kernel. One obvious choice is the anisotropic kernel κ with a vector of known hyperparameters [10]: κ(xi, xj) = κ̃ ( −(xi − xj)>D(xi − xj) ) , (2.2) where κ̃ is an isotropic kernel and D is a diagonal matrix with positive hyperparameters along the diagonal and zeros elsewhere. Our results apply to squared exponential kernels and Matérn kernels with parameter ν ≥ 2. In this paper, we assume that the hyperparameters are fixed and known in advance. We can sample the GP at t points by choosing points x1:t := {x1, . . . , xt} and sampling the values of the function at these points to produce the vector f1:t = [f(x1) · · · f(xt)]>. The function values are distributed according to a multivariate Gaussian distribution N (0,K), with covariance entries κ(xi, xj). Assume that we already have several observations from previous steps, and that we want to decide what action xt+1 should be considered next. Let us denote the value of the function at this arbitrary new point as ft+1. Then, by the properties of GPs, f1:t and ft+1 are 7 2.2. Surrogates for optimization jointly Gaussian: f1:t ft+1 ∼ N 0, K k> k κ(xt+1, xt+1) , where k = [κ(xt+1, x1) · · ·κ(xt+1, xt)]>. Using the Schur complement, one arrives at an expression for the posterior predictive distribution: P (ft+1|x1:t+1, f1:t) = N (µt(xt+1), σ2t (xt+1)), where µt(xt+1) = k >K−1f1:t, σ2t (xt+1) = κ(xt+1, xt+1)− k>K−1k (2.3) and f1:t = [f(x1) · · · f(xt)]>. 2.2 Surrogates for optimization When it is assumed that the objective function f is sampled from a GP, one can use a combination of the posterior predictive mean and variance given by Equations (2.3) to construct surrogate functions, which tell us where to sample next. Here we use the UCB combination, which is given by µt(x) +Btσt(x), where {Bt}∞t=1 is a sequence of numbers specified by the algorithm. This surrogate trades-off explo- ration and exploitation since it is optimized by choosing points where the mean is high (exploitation) and where the variance is large (exploration). Since the surrogate has an analytical expression that is easy to evaluate, it is much easier to optimize than the original objective function. Other popular surrogate functions constructed using the sufficient statistics of the GP include the Probability of Improvement, Expected Improvement and Thompson sampling. We refer the reader to [2, 7, 8] for details on these. 8 2.3. Our algorithm 2.3 Our algorithm The main idea of our algorithm (Algorithm 1) is to tighten the bound on f given by the UCB surrogate function by sampling the search space more and more densely and shrinking this space as more and more of the UCB surrogate function is “submerged” under the maximum of the Lower Confidence Bound (LCB). Figure 1.2 illustrates this intuition. Algorithm 1 Branch and Bound Input: A compact subset D ⊆ Rd, a discrete lattice L ⊆ D and a function f : D → R. R ← D δ ← 1 repeat Sample Twice as Densely: • δ ← δ 2• Sample f at enough points in L so that every point in R is contained in a simplex of size δ. Shrink the Relevant Region: • Set R̃ := { x ∈ R ∣∣∣∣µT (x) +√βTσT (x) > supR µT (x)−√βTσT (x) } . T is the number points sampled so far and βT = 2 ln ( |L|T 2 α ) = 4 lnT + 2 ln |L|α with α ∈ (0, 1). • Solve the following constrained optimization problem: (x∗1, x ∗ 2) = argsup (x1,x2)∈R̃×R̃ ‖x1 − x2‖ • R ← B ( x∗1 + x∗2 2 , ‖x∗1 − x∗2‖ ) , where B(p, r) is the ball of radius r centred around p. until R∩ L = ∅ More specifically, the algorithm consists of two iterative stages. During the first stage, the function is sampled along a lattice of points (the red crosses in Figure 2.1). In the second stage, the search space is shrunk to discard regions where the maximum is very unlikely to reside. Such regions are obtained by finding points where the UCB is lower than the LCB (the complement of the colored region in the same panel as before). The remaining set of relevant points is denoted by R̃. In order to simplify the task of shrinking the search space, we simply find an enclosing ball, which is denoted by R in Algorithm 1. Back to the first stage, we consider a lattice that is twice as dense as in the first stage of the previous iteration, but we only sample at points that lie within 9 2.3. Our algorithm Figure 2.1: Branch and Bound algorithm for a 2D function. The colored region is the search space and the color-map, with red high and blue low, illustrates the value of the UCB. Four steps of the algorithm are shown; progressing from left to right and top to bottom. The green dots designate the points where the function was sampled in the previous steps, while the red crosses denote the freshly sampled points. our new smaller search space. In the second stage, the auxiliary step of approximating the relevant set R̃ with the ball R introduces inefficiencies in the algorithm, since we only need to sample inside R̃. This can be easily remedied in practice to obtain an efficient algorithm, for instance by using an ellipsoid instead of a ball. Our analysis will show that even without these improvements it is already possible to obtain very strong exponential convergence rates. Of course, practical improvements will result in better constants and ought to be considered seriously. 10 Chapter 3 Analysis 3.1 Approximation results Lemma 1 (Variance Bound) Let κ : Rd × Rd → R be a kernel that is four times differentiable along the diagonal {(x, x) |x ∈ Rd}, with Q defined as in Lemma 7.2, and f ∼ GP (0, κ(·, ·)) a sample from the corresponding Gaussian Process. If f is sampled at points x1:T = {x1, . . . , xT } that form a δ-cover of a subset D ⊆ Rd, then the resulting posterior predictive standard deviation σT satisfies sup D σT ≤ Qδ 2 4 Proof Let H be the RKHS corresponding to κ and h ∈ H an arbitrary element, with g := (1− P1:T )h the residual defined in Lemma 7.5. By Lemma 7.3, we know that ‖1− P1:T ‖ ≤ 1 and so we have ‖g‖H ≤ ‖1− P1:T ‖ ‖h‖H ≤ ‖h‖H (3.1) Moreover, by Lemma 7.2, we know that the second derivative of g is bounded by ‖g‖HQ, and since by Lemma 7.5 we know that g vanishes at each xi, we can use Lemma 9.2 and the inequality given by inequality (3.1) to conclude that |h(x)− P1:Th(x)| := |g(x)| ≤ ‖g‖HQδ 2 4 by Lemma 9.2 ≤ ‖h‖HQδ 2 4 by inequality (3.1) and so for all x ∈ D we have |h(x)− P1:Th(x)| ≤ Qδ 2 4 ‖h‖H (3.2) 11 3.1. Approximation results On the other hand, by Lemma 8, we know that for all x ∈ D we have the following tight bound: |h(x)− P1:Th(x)| ≤ σT (x) ‖h‖H . (3.3) Now, given the fact that both inequalities (3.2) and (3.3) are bounding the same quantity and that the latter is a tight estimate, we necessarily have that σT (x) ‖h‖H ≤ Qδ2 4 ‖h‖H . Canceling ‖h‖H gives the desired result. 12 3.2. Finiteness of regret (statement) 3.2 Finiteness of regret (statement) Having shown that the variance vanishes according to the square of the resolution of the lattice of sampled points, we now move on to show that this estimate implies an exponential asymptotic vanishing of the regret encountered by our Branch and Bound algorithm. This is laid out in our main theorem stated below and proven in the supplementary material. The theorem considers a function f , which is a sample from a GP with a kernel that is four times differentiable along its diagonal. The global maximum of f can appear in the interior of the search space, with the function being twice differentiable at the maximum and with non-vanishing curvature. Alternatively, the maximum can appear on the boundary with the function having non- vanishing gradient at the maximum. Given a lattice that is fine enough, the theorem asserts that the regret asymptotically decreases in exponential fashion. The main idea of the proof of this theorem is to use the bound on σ given by Proposition 1 to reduce the size of the search space. The key assumption about the function that the proof utilizes is the quadratic upper bound on the objective function f near its global maximum, which together with Proposition 1 allows us to shrink the relevant region R in Algorithm 1 rapidly. The figures in the proof give a picture of this idea. The only complicating factor is the factor √ βt in the expression for the UCB that needs to be estimated. This is dealt with by modeling the growth in the number of points sampled in each iteration with a difference equation and finding an approximate solution of that equation. Recall that D ⊆ Rd is assumed to be a non-empty compact subset and f a sample from the Gaussian Process GP (0, κ(·, ·)) on D. Moreover, in what follows we will use the notation xM := argmax x∈D f(x). Also, by convention, for any set S, we will denote its interior by S◦, its boundary by ∂S and if S is a subset of Rd, then conv(S) will denote its convex hull. The following holds true: Theorem 2 Suppose we are given: 1. α > 0, a compact subset D ⊆ Rd, and κ a stationary kernel on Rd that is four times differ- entiable; 2. f ∼ GP(0, κ) a continuous sample on D that has a unique global maximum xM , which satisfies 13 3.2. Finiteness of regret (statement) one of the following two conditions: (†) xM ∈ D◦ and f(xM )− c1‖x− xM‖2 < f(x) ≤ f(xM )− c2‖x− xM‖2 for all x satisfying x ∈ B(xM , ρ0) for some ρ0 > 0; (‡) xM ∈ ∂D and both f and ∂D are smooth at xM , with ∇f(xM ) 6= 0; 3. any lattice L ⊆ D satisfying the following two conditions • 2L ∩ conv(L) ⊆ L (3.4) • 2 ⌈ − log2 ρ0diam(D) ⌉ +1L ∩ L 6= ∅ (3.5) if f satisfies (†) Then, there exist positive numbers A and τ and an integer T such that the points specified by the Branch and Bound algorithm, {xt}, will satisfy the following asymptotic bound: For all t > T , with probability 1− α we have r(xt) < Ae − τt (ln t)d/4 . We would like to make a few clarifying remarks about the theorem. First, note that for a random sample f ∼ GP(0, κ) one of conditions (†) and (‡) will be satisfied almost surely if κ is a Matérn kernel with ν > 2 and the squared exponential kernel because the sample f is twice differentiable almost surely by [1, Theorem 1.4.2] and [12, §2.6]) and the vanishing of at least one of the eigenvalues of the Hessian is a co-dimension 1 condition in the space of all functions that are smooth at a given point, so it has zero chance of happening at the global maximum. Second, the two conditions (3.4) and (3.5) simply require that the lattice be “divisible by 2” and that it be fine enough so that the algorithm can sample inside the ball B(xM , ρ0) when the maximum of the function is located in the interior of the search space D. Finally, it is important to point out that the rate decay τ does not depend on the choice of the lattice L, even though as stated, the statement of the theorem chooses τ only after L is specified. The theorem was written this way simply for the sake of readability. Given the exponential rate of convergence we obtain in Theorem 2, we have the following finiteness conclusion for the cumulative regret accrued by our Branch and Bound algorithm: 14 3.2. Finiteness of regret (statement) Corollary 3 Given κ, f ∼ GP(0, κ) and L ⊆ D as in Theorem 2, the cumulative regret is bounded from above. Remark 4 It is worth pointing out the trivial observation that using a simple UCB algorithm with monotonically increasing and unbounded factor √ βt, without any shrinking of the search space as we do here, necessarily leads to unbounded cumulative regret since eventually √ βt becomes large enough so that at points x′ far away from the maximum, √ βtσt(x ′) becomes larger than f(xM )− f(x). In fact, eventually the UCB algorithm will sample every point in the lattice L. 15 3.3. Remarks on the main theorem 3.3 Remarks on the main theorem This section includes a discussion of the assumptions placed on the objective function in Theorem 2 as well as an outline of the proof, the full details of which are included in the appendix. 3.3.1 On the statement of Theorem 2 A few remarks on the assumptions and the conclusion of the main theorem are in order: A. Relationship between the local and global assumptions on f : The theorem has two seemingly unrelated restrictions on the function f : the global GP prior and the local behaviour near the global maximum. However, in many circumstances of interest, the local condition follows almost surely from the global condition. Two such circumstances are if κ is a Matérn kernel with ν > 2 (including the squared exponential kernel) or if κ is six times differentiable. In either case, the sample f is twice differentiable almost surely, in the former case by [1, Theorem 1.4.2] and [12, §2.6]) and in the latter situation by [5, Theorem 5]. If the global maximum xM lies in the interior of D, the Hessian of f at xM will almost surely be non-singular since the vanishing of at least one of the eigenvalues of the Hessian is a co-dimension 1 condition in the space of all functions that are smooth at a given point, hence justifying condition (†). On the other hand, if xM lies on the boundary of D, then condition (‡) will be satisfied almost surely, since the additional event of the vanishing of ∇f(xM ) is a codimension d phenomenon in the space of functions with global maximum at xM . B. Uniqueness of the global maximum: A randomly drawn continuous sample from a GP on a compact domain will almost surely have a unique global maximum: this is because the space of continuous functions on a compact domain that attain their global maximum at more than one point have codimension one in the space of all continuous functions on that domain. C. Assumptions on L: The two conditions (3.4) and (3.5) simply require that the lattice be “divisible by 2” and that it be fine enough so that the algorithm can sample inside the ball B(xM , ρ0) when the maximum of the function is located in the interior of the search space D. One can simply choose L to be the set of points in D that have floating point coordinates: it’s just the points at which the algorithm is allowed to sample the function. D. On τ ’s dependence: Finally, it is important to point out that the decay rate τ does not 16 3.3. Remarks on the main theorem depend on the choice of the lattice L, even though as stated, the statement of the theorem chooses τ only after L is specified. The theorem was written this way simply for the sake of readability. 3.3.2 Outline of the proof of Theorem 2 The starting point for the proof is the observation that one can use the posterior predictive mean and standard deviation of the GP to obtain a high probability envelope around the objective function (cf. Lemma 10 in the appendix). Given the fact that the thickness of this envelope is determined by the height of the posterior predictive standard deviation, σ, we can use the bound given by Proposition 1 to show that asymptotically one can rapidly dispense with large portions of the search space, as illustrated in Figure 1.2. One disconcerting component of Algorithm 1 is the step that requires sampling twice as densely in each iteration, since the number of samples can start to grow exponentially, hence killing any hope of obtaining exponentially decreasing regret. However, this is where the assumption on the local behaviour near the global maximum becomes relevant. Since Proposition 1 tells us that every time the function is sampled twice as densely, σ decreases by a factor of 4, and given our assumption that the function has quadratic behaviour near the global maximum, we can conclude that the radius of the search space is halved after each iteration and so the number of sampled points added in each iteration roughly remains constant. Of course, this assumes that the multiplicative factor √ βt remains constant in this process. However, the algorithm requires √ βt to grow logarithmically, and so to fill this gap we need to bound the growth of √ βt, which is tied to the number of samples needed in each iteration of the algorithm, which in turn is linked to the resolution of the lattice of sampled points δ and the size of the relevant set R, which in turn depends on the size of √βtσt. This circular dependence gives rise to a difference equation, whose solutions we bound by solving the corresponding differential equation. 3.3.3 Further remarks on the GP prior Let us step back for a moment and pose the question of whether it would be possible to carry out a similar line of reasoning under other circumstances. To answer this, one needs to identify the key ingredients of the proof, which are the following: 17 3.3. Remarks on the main theorem A. A mechanism for calculating a high probability envelope around the objective function (cf. Lemma 10); B. An estimate showing that the thickness of the envelope diminishes rapidly as the function is sampled more and more densely (cf. Proposition 1), so that the search space can be shrunk under reasonable assumptions on the behaviour of the function near the peak. The reason for our imposing a GP prior on f is that it gives us property A, while our smoothness assumption on the kernel guarantees property B. However, GPs are but one way one could obtain these properties and they do this essentially by coming up with local estimates of the Lipschitz constant based on the observed values of the objective function nearby. Perhaps one could explicitly incorporate similar local estimates on the Lipschitz constant into tree based approaches like HOO and SOO, cf. [3] and [9], in which case one would be able to dispense with the GP assumption and get similar performance. But, that is beyond the scope of this paper and will be left for future work. 18 3.4. Finiteness of regret (proof) 3.4 Finiteness of regret (proof) We will need the following two definitions in the proof of the theorem: Definition 5 Given the above setup, the regret function is defined to be r(x) = max D f − f(x). Definition 6 (Covering Number) Denote by B a Banach space with norm ‖·‖. Furthermore denote by B ⊆ B a set in this space. Then the covering number n(B,B) is defined as the minimum number of balls with respect to the Banach space norm that are required to cover B entirely. Proof [Theorem 2] The proof consists of the following steps: • Global: We first show that after a finite number of steps the algorithm zooms in on the neighbourhood B(xM , ρ0). This is done by first showing that can be chosen small enough to squeeze the set f−1((fM − , fM ]) into any arbitrarily small neighbourhood of xM and that as the function is sampled more and more densely, the UBC-LCB envelope around f becomes arbitrarily tight, hence eventually fitting the relevant set inside a small neighbourhood of xM . Please refer to Figure 3.1 for a graphical depiction of this process. GI : Since D is compact and f is continuous and has a unique maximum, for every ρ > 0, we can find an = (ρ) > 0 such that f−1 ((fM − , fM ]) ⊆ B(xM , ρ), where fM = max f . To see this, suppose on the contrary that there exists a radius ρ > 0 such that for all > 0 we have f−1 ((fM − , fM ]) * B(xM , ρ) which means that there exists a point x ∈ D such that f(xM )−f(x) < but ‖x−xM‖ > ρ. Now, for each i ∈ N, pick a point xi ∈ f−1 ((fM − 1i , fM ]) \ B(xM , ρ): this gives us a sequence of points {xi} in D, which by the compactness of D has a convergent 19 3.4. Finiteness of regret (proof) Figure 3.1: The elimination of other smaller peaks. subsequence {xik}, whose limit we will denote by x∗. From the continuity of f and the fact that f(xM ) − f(xi) < 1i , we can conclude that f(xM ) − f(x∗) = 0, which contradicts our assumption that f has a unique global maximum since we necessarily have x∗ /∈ B(xM , ρ). GII : Define ∗ := (ρ0) 4 , with ρ0 as in Condition (†) of the statement of Theorem 2. GIII : For each T , define the “relevant set” RT ⊆ D as follows: RT = { x ∈ D ∣∣∣∣µT (x) +√βTσT (x) > supR µT (x)−√βTσT (x) } . GIV : Choose βT = b ln(T ), with b chosen large enough to satisfy the conditions of Lemma 10. Then, it is possible to sample f densely enough so that √ βT max x∈D σT (x) < ∗, (3.6) so thatRT ⊆ B(xM , ρ0). This is because as D is sampled more and more densely we have σ = O(δ2), where δ is the distance between the points of the grid, and β = O ( ln 1 δd ) = O (− ln δ) and so √βσ → 0 as δ → 0, and so there exists a δ0 small enough so that 20 3.4. Finiteness of regret (proof) a lattice of resolution δ0 would give us the bound given in inequality (3.6). The end point of this process is depicted in Figure 3.1, where the relevant set RT lies inside the non-shaded region: the reason for this inclusion and “thickness” 4∗ is described below, in Step L1 of the proof: cf. Equation (3.7). • Local: Once the algorithm has localized attention to a neighbourhood of xM , then we can show that the regret decreases exponentially; to do so, we will proceed by sampling the relevant set twice as densely and shrinking the relevant set, and repeating these two steps. The claim is that in each iteration, the maximum regret goes down exponentially and the number of the new points that are sampled in each refining iteration is asymptotically constant. To prove this, we will write down the equations governing the behaviour of the number of sampled points and σ. We will adopt the following notation to carry out this task: – δ` - the resolution of the lattice of sampled points at the end of the (` + 1) th refining iteration inside R`+1 (defined below). – ` = sup x∈R` σN`(x) at the end of the ` th iteration. Note that ` ∝ δ2` . Also, note that 0 ≤ ∗ by the choice of δ0. – N` - number of points that have been sampled by the end of the ` th iteration. – ∆N` = N`+1 −N`. – R` - the relevant set at the beginning of the `th iteration. Note that R1 ⊆ B(xM , ρ0). – ρ` = diam(R`) 2 . Note that ρ1 < ρ0. 21 3.4. Finiteness of regret (proof) L1: We have the following chain of inequalities: N1 ≤ N0 +Wnδ0 ( R0, (Rd, ‖ · ‖2) ) where nδ0 ( R0, (Rd, ‖ · ‖2) ) is the δ0-covering number as defined in Definition 6 ≤ N0 +WN (ρ0, δ0) where N (ρ, δ) := nδ ( B(0, ρ), (Rd, ‖ · ‖2) ) ≤ N0 +WN √40√βN0 c2 , δ0 ≤ N0 +WN √40√b lnN0 c2 , δ0 = N0 +WN ( c √ 0 4 √ lnN0, δ0 ) where c := √ 4 √ b c2 In the first line of the above chain of inequalities, the factor W multiplying the last term is any number greater than one that gives an upper bound on the algorithm’s wastefulness when it comes to producing a δ-covering of any set S ⊆ Rd as compared to the minimum number of points necessary, i.e. nδ (S, (Rd, ‖ · ‖2)). The expression √ 40 √ βN0 c2 comes about as follows: using the notations B = √ βN0 and σ = σN0 we know by Lemma 10 that f and µ are intertwined with each other in the sense that both of the following chains of inequality hold: µ−Bσ ≤ f ≤ µ+Bσ f−Bσ ≤ µ ≤ f+Bσ, which, combined together, give us the following chain of inequalities f−2Bσ ≤ µ−Bσ ≤ f ≤ µ+Bσ ≤ f+2Bσ. (3.7) Since, we also know that σ(x) ≤ 0 for all x ∈ R0, we can conclude that f−2B0 ≤ µ−Bσ ≤ µ+Bσ ≤ f+2B0. 22 3.4. Finiteness of regret (proof) Moreover, if condition (†) holds, we know that in R0, the function f satisfies −c1r2 < f(x)− f(xM ) < −c2r2, where r = r(x) := ‖x− xM‖, so we get that f(xM)−c1r2−2B0 ≤ µ−Bσ ≤ µ+Bσ ≤ f(xM)−c2r2+2B0. Now, recall that R0 is defined to consist of points x where µ(x)+Bσ(x) ≥ sup D µ(x)−Bσ(x), but given the fact that we have the above outer envelope for µ ± Bσ, we can conclude that R0 ⊆ { x ∣∣∣ f(xM)−c2r(x)2+2B0 ≥ max f(xM)−c1r(x)2−2B0} = { x ∣∣∣ f(xM)−c2r(x)2+2B0 ≥ f(xM)−2B0} = { x ∣∣∣ −c2r(x)2+2B0 ≥ −2B0} = { x ∣∣∣ c2r(x)2 ≤ 4B0} = { x ∣∣∣ r(x) ≤ √4B0 c2 } Now, if, on the other hand, f satisfies condition (‡), then by the smoothness assumptions in (‡), we know that ∇f(xM ) is perpendicular to ∂D at xM and so there exist positive numbers c1 and c2 such that in a neighbourhood of xM we have −c1r ≤ f − f(xM ) ≤ −c2r2. Note that in the argument above in the case of (†), the precise form of the lower bound on f was irrelevant, since all we are interested in is its maximum. So, the same argument goes through again. This is depicted in Figure 3.2, where B := √ βN0 = √ b lnN0. 23 3.4. Finiteness of regret (proof) Figure 3.2: The shrinking of the relevant set R`. Here, B = √ βN0 L`+1: Now, let us suppose that we are the end of the ` th iteration. We have N`+1 ≤ N` +WN (ρ`, δ`) = N` +WN ( c √ ` 4 √ lnN`, δ` ) ≤ N` +WN ( c √ 0 4` 4 √ lnN`, δ0 2` ) by Proposition 1 = N` +WN ( c √ 0 4 √ lnN`, δ0 ) since N (2ρ, 2δ) = N (ρ, δ) for any ρ and δ := N` +Wnδ0 ( B(0, c √ 0 4 √ lnN`), (Rd, ‖ · ‖2) ) by the definition of N ≤ N` +Wnδ0 ([ −c√0 4 √ lnN`, c √ 0 4 √ lnN` ]d , (Rd, ‖ · ‖2) ) ≤ N` +W ( 2c √ 0 4 √ lnN` δ0 )d since a regular lattice of resolution δ0 gives a δ0-covering ≤ N` + C(lnN`) d 4 where C = W ( 2c √ 0 δ0 )d So, the number of samples needed by the branch and bound algorithm is governed by the difference inequation ∆N` ≤ C(lnN`) d 4 . (3.8) 24 3.4. Finiteness of regret (proof) To study the solutions of this difference equation, we consider the corresponding differ- ential equation: dN d` = C(lnN) d 4 . (3.9) Since this equation is separable, we can write dN (lnN) d 4 = Cd`. Now, letting ` = L be a given number of iterations in the algorithm and N(L) the corre- sponding number of sampled points, we can integrate both sides of the above equation to get ∫ N(L) N(0) dN (lnN) d 4 = ∫ L 0 Cd` = CL. Given the fact that the integral on the left can’t be solved analytically, we will use the lower bound N(L)−N(0) (lnN(L)) d 4 ≤ ∫ N(L) N(0) dN (lnN) d 4 to get N(L)−N(0) C(lnN(L)) d 4 ≤ L (3.10) Given a time t, we will denote by `t the largest non-negative integer such that N`t < t or 0 if no such number exists. We illustrate this somewhat obtuse definition with the following example: • • · · · • · · · • · · · • · · · • · · · • · · · 1 OO 2 OO N0 OO N1 OO N`t OO t OO N`t+1 OO 25 3.4. Finiteness of regret (proof) Now, by Lemma 11, for all t >> N0 we have rt ≤ 2 √ βt maxR`t σt ≤ 2 √ b ln t`t ≤ 20 √ b ln t 4`t ≤ 80 √ b ln t 4`t+1 ≤ 80 √ b ln t ( 1 4 ) N`t+1−N0 C(lnN`t+1) d/4 by Equation 3.10 ≤ 80 √ b ln t ( 1 4 ) DN`t+1 (lnN`t+1) d/4 for some D > 0 since N`t+1 > N0 ≤ 80 √ b ln t ( 1 4 ) Dt (ln t)d/4 for t satisfying ln t > d 4 (see ? below) since t ≤ N`t+1 ≤ 80 √ be − Et (ln t)d/4 + ln ln t 2 ≤ 80 √ be − Et (ln t)d/4 + Et 2(ln t)d/4 for large enough t = Ae − τt (ln t)d/4 for A = 80 √ b and τ = E/2. ? The reason for the specific criterion ln t > d4 is that the function x (lnx)d/4 is increasing when this condition is satisfied, and so decreasing x from N`t + 1 to t decreases its value, increasing the overall expression ( 1 4 ) x (ln x)d/4 . To see that x (lnx)d/4 becomes increasing when lnx > d4 , we simply need to calculate its derivative: d dx x (lnx)d/4 = 1 (lnx)d/4 − d 4 x x(lnx)d/4+1 = lnx− d4 (lnx)d/4 . Moreover, since N`t+1 ≥ t, if the derivative of x(lnx)d/4 is positive at t, it is also positive between t and N`t+1 and so the function is indeed increasing in that interval. 26 Chapter 4 Discussion In this thesis, we have put forth a modification of the UCB algorithm presented in [11], which has the additional step of discarding pieces of the search space where the global maximum is very unlikely to be. We show that this additional step leads to a considerable improvement of the regret accrued by the algorithm: in particular, the cumulative regret obtained by our Branch and Bound (BB) algorithm is bounded from above, whereas the cumulative regret bounds obtained in [11] are unbounded. This is because UCB’s regret is O ( 1√ t ) up to logarithmic factors, whereas for BB, the regret is O ( e − τt (ln t)d/4 ) . The main realization that we had in the process of writing this paper was that what affects the rate of convergence the most is the possibility of dispensing with chunks of the search space. This observation can also be seen in the works involving hierarchical partitioning, e.g. [9], where regions of the space are deemed as less worthy of probing as time goes on. The most obvious next step is to carry out similar analysis for the case when there is non-trivial observation noise. Another natural question is whether or not, in each iteration of the Branch and Bound algorithm, one could dispense with points that lie outside the current (or perhaps the previous) relevant set when calculating the covariance matrix K in the posterior equations (2.3), and how much of an effect such a computational cost-cutting measure would have on the regret encountered by the algorithm. Another obvious question is how does the performance of the Branch and Bound algorithm pre- sented here compares with a modified version of UCB algorithm in which in addition to optimizing the UCB surrogate function, one introduces an occasional “cleansing” step of getting rid of regions in the search space where the maximum is very unlikely to be. Finally, we believe that similar regret bounds should hold for contextual bandit problems, using similar methods, although the unpredictability of the context introduces new difficulties. 27 Bibliography [1] Robert J. Adler and Jonathan E. Taylor. Random Fields and Geometry. Springer, 2007. [2] Eric Brochu, Vlad M Cora, and Nando de Freitas. A tutorial on Bayesian optimization of ex- pensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical Report TR-2009-023, arXiv:1012.2599v1, University of British Columbia, Department of Computer Science, 2009. [3] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvari. X-armed bandits. Journal of Machine Learning Research, 12:1655–1695, 2011. [4] Adam D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12:2879–2904, 2011. [5] Subhashis Ghosal and Anindya Roy. Posterior consistency of Gaussian process prior for non- parametric binary regression. Ann. Stat., 34:2413–2429, 2006. [6] P. Hansen, B. Jaumard, and S. Lu. Global optimization of univariate Lipschitz functions: I. survey and properties. Mathematical Programming, 55:251–272, 1992. [7] Matthew Hoffman, Eric Brochu, and Nando de Freitas. Portfolio allocation for Bayesian optimization. In Uncertainty in Artificial Intelligence, pages 327–336, 2011. [8] Benedict May, Nathan Korda, Anthony Lee, and David Leslie. Optimistic Bayesian sampling in contextual-bandit problems. 2010. [9] Rémi Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In Advances in Neural Information Processing Systems, 2011. [10] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006. 28 [11] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian pro- cess optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, 2010. [12] Michael L. Stein. Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999. [13] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008. [14] Emmanuel Vazquez and Julien Bect. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and In- ference, 140:3088–3095, 2010. 29 Apendix: Auxilliary Lemmas We begin our analysis by showing that, given sufficient explored locations, the residual variance is small. More specifically, for any point x contained in the convex hull of a set of d points that are no further than δ apart from x, we show that the residual is bounded by O(‖h‖H δ2), where ‖h‖H is the Hilbert Space norm of the associated function and that furthermore the residual variance is bounded by O(δ2). This quantity is subsequently related to `2 and `1 covering numbers of the associated domain. We begin by relating residual variance, projection operators, and interpolation in Hilbert Spaces. The results are standard and we only present them here for the purpose of being self-contained. 30 Apendix: Auxilliary Lemmas Lemma 7 (Hilbert Space Properties) Given a set of points x1:T := {x1, . . . , xT } ∈ D and a Reproducing Kernel Hilbert Space (RKHS) H with kernel κ the following bounds hold: 1. Any h ∈ H is Lipschitz continuous with constant ‖h‖H L, where ‖·‖H is the Hilbert space norm and L satisfies the following: L2 ≤ sup x∈D ∂x∂x′κ(x, x ′)|x=x′ and for κ(x, x′) = κ̃(x− x′) we have L2 ≤ ∂2xκ̃(x)|x=0. (1) 2. Any h ∈ H has its second derivative bounded by ‖h‖HQ where Q2 ≤ sup x∈D ∂2x∂ 2 x′κ(x, x ′)|x=x′ and for κ(x, x′) = κ̃(x− x′) we have Q2 ≤ ∂4xκ̃(x)|x=0. (2) 3. The projection operator P1:T on the subspace span t=1:T {κ(xt, ·)} ⊆ H is given by P1:Th := k >(·)K−1 〈k(·), h〉 (3) where k(·) = k1:T (·) := [κ(x1, ·) · · ·κ(xT , ·)]> and K := [κ(xi, xj)]i,j=1:T ; moreover, we have 〈k(·), h〉 := 〈κ(x1, ·), h〉 ... 〈κ(xT , ·), h〉 = h(x1) ... h(xT ) . Here P1:TP1:T = P1:T and ‖P1:T ‖ ≤ 1 and ‖1− P1:T ‖ ≤ 1. 4. Given sets x1:T ⊆ x1:T ′ it follows that ‖P1:Th‖H ≤ ‖P1:T ′h‖H ≤ ‖h‖H. 5. Given tuples (xi, hi) with hi = h(xi), the minimum norm interpolation h̄ with h̄(xi) = h(xi) is given by h̄ = P1:Th. Consequently its residual g := (1 − P1:T )h satisfies g(xi) = 0 for all xi ∈ x1:T . Proof We prove the claims in sequence. 1. This follows from Corollary 4.36 in [13], with |α| = 1. 2. Same as above, just with |α| = 2. 31 Apendix: Auxilliary Lemmas 3. For any operator V with full column rank the projection on the image of V is given by V (V >V )−1V >. In our case, V is defined as V : RT // H w := w1 ... wT // k>(·)w . To calculate the adjoint V >, let h ∈ H and w ∈ RT be arbitary elements and consider the following chain of equalities: 〈 V >h,w 〉 RT = 〈h, Vw〉H (cf. [13] Equation A.19) = 〈 h,k>(·)w 〉 H = 〈 h, [ k(x1, ·) · · · k(xT , ·) ] w1 ... wT 〉 H = 〈h, k(x1, ·)w1 + · · ·+ k(xT , ·)wT 〉H = 〈h, k(x1, ·)〉w1 + · · ·+ 〈h, k(xT , ·)〉wT (by linearity of 〈 , 〉H) = [ 〈h, k(x1, ·)〉H · · · 〈h, k(xT , ·)〉H ] w1 ... wT = 〈k(·), h〉>Hw (by the symmetry of 〈 , 〉H) = 〈〈k(·), h〉H ,w〉RT (by the defintion of 〈 , 〉RT ) Since, this equality holds for all w, we can conclude that for all h ∈ H V >h = 〈k(·), h〉H . Now, all we need to do is to calculate the expression V >V to complete the derivation of the 32 Apendix: Auxilliary Lemmas expression for P1:T ; to this end let w ∈ RT be arbitrary: V >Vw = 〈 k(·),k>(·)w 〉 H = 〈 k(·),k>(·) 〉 H w = 〈 κ(x1, ·) ... κ(xT , ·) , [ κ(x1, ·) · · · κ(xT , ·) ]〉 H w = 〈κ(x1, ·), κ(x1, ·)〉H · · · 〈κ(x1, ·), κ(xT , ·)〉H ... . . . ... 〈κ(xT , ·), κ(x1, ·)〉H · · · 〈κ(xT , ·), κ(xT , ·)〉H w = κ(x1, x1) · · · κ(x1, xT ) ... . . . ... κ(xT , x1) · · · κ(xT , xT ) w (cf. [13] Definition 4.18 and Equation 4.14) = Kw and so V >V = K. This provides us with P1:T . The remaining claims follow from standard properties of projec- tion operators. 4. Projection operators satisfy ‖P1:T ‖ ≤ 1. This proves the second claim. The first claim can be seen from the fact that projecting on a subspace can only have a smaller norm than the superspace projection. 5. We first show that the projection is an interpolation. This follows from h̄(xi) = P1:Th(xi) = 〈P1:Th, κ(xi, ·)〉 = 〈h, P1:Tκ(xi, ·)〉 = 〈h, κ(xi, ·)〉 = h(xi). Correspondingly g(xi) = h(xi) − h̄(xi) = 0 for all xi ∈ x1:T . By construction P1:Th uses h only in evaluations h(xi), hence for any two functions h, h ′ with h(xi) = h′(xi) we have P1:Th = P1:Th ′. Since ‖P1:T ‖ ≤ 1 it follows that ‖P1:Th‖ ≤ ‖h‖H. Hence there is no interpolation with norm smaller than ‖P1:Th‖. 33 Apendix: Auxilliary Lemmas Lemma 8 (Gaussian Process Variance) Under the assumptions of Lemma 7 it follows that |h(x)− P1:Th(x)| ≤ ‖h‖H σT (x) where σ2T (x) = κ(x, x)− k>1:T (x)K−1k1:T (x) (4) and this bound is tight. Moreover, σ2T (x) is the residual variance of a Gaussian process with the same kernel. Proof To see the bound we again use the Cauchy-Schwartz inequality |h(x)− P1:Th(x)| = |(1− P1:T )h(x)| = |〈(1− P1:T )h, κ(x, ·)〉H| (by the defining property of 〈 , 〉H, cf. [13] Def. 4.18) = |〈h, (1− P1:T )κ(x, ·)〉H| (since 1− P1:T is an orthogonal projection and so self-adjoint) ≤ ‖h‖H ‖(1− P1:T )κ(x, ·)‖ (by Cauchy-Schwarz) This inequality is clearly tight for h = (1−P1:T )κ(x, ·) by the nature of dual norms. Next note that ‖(1− P1:T )κ(x, ·)‖2 = 〈(1− P1:T )κ(x, ·), (1− P1:T )κ(x, ·)〉 = 〈κ(x, ·), (1− P1:T )κ(x, ·)〉 = κ(x, x)− 〈κ(x, ·), P1:Tκ(x, ·)〉 = σ2T (x). The second equality follows from the fact that 1 − P1:T is idempotent. The last equality follows from the definition of P1:T . The fact that σ 2 T (x) is the residual variance of a Gaussian Process re- gression estimate is well known in the literature and follows, e.g. from the matrix inversion lemma. 34 Apendix: Auxilliary Lemmas Lemma 9 (Approximation Guarantees) In the following we denote by x1:T ⊆ D a set of loca- tions and assume that g(xi) = 0 for all xi ∈ x1:T . 1. Assume that g is Lipschitz continuous with bound L. Then g(x) ≤ Ld(x, x1:T ), where d(x, x1:T ) is the minimum distance ‖x− xi‖ between x and any xi ∈ x1:T . 2. Assume that g has its second derivative bounded by Q′. Moreover, assume that x is contained inside the convex hull of x1:T such that the smallest such convex hull has a maximum pairwise distance between vertices of d. Then we have g(x) ≤ 14Q′d2. Proof The first claim is an immediate consequence of the Lipschitz property of g. To see the second claim we need to establish a number of issues: without loss of generality assume that the maximum within the convex hull containing x is attained at x (and that the maximum rather than the minimum denotes the maximum deviation from 0). The maximum distance of x to one of its vertices is bounded by δ/ √ 2. This is established by considering the minimum enclosing ball and realizing that the maximum distance is achieved for the regular polyhedron. To see the maximum deviation from 0 we exploit the fact that ∂xg(x) = 0 by the assumption of x being the maximum (we need not consider cases where x is on a facet of the polyhedral set since in this case we could easily reduce the dimensionality). In this case the largest deviation between g(x) and g(xi) is obtained by making g a quadratic function g(x ′) = Q ′ 2 ‖x′ − x‖2. At distance δ√2 the function value is bounded by δ 2Q′ 4 . Since the latter bounds the maximum deviation it does bound it for g in particular. This proves the claim. 35 Apendix: Auxilliary Lemmas For the sake of completeness of the exposition, we will also include the proofs of the following two lemmas, which first appeared in [11]: Lemma 10 (Lemma 5.1 of [11]) Given any finite set L, any sequnce of points {x1, x2, . . .} ⊆ L and f : L → R a sample from GP(0, κ(·, ·)), for all α ∈ (0, 1), we have P { ∀x ∈ L, t ≥ 1 : |f(x)− µt−1(x)| ≤ √ βtσt−1(x) } ≥ 1− α, where βt = 2 ln ( |L|pit α ) and {pit} is any positive sequence satisfying ∑ 1 pit = 1. Here |L| denotes the number of elements in L. Proof Looking at the complement of the set whose probability we’re interested in, we have P { ∀x ∈ L, t ≥ 1 : |f(x)− µt−1(x)| ≤ √ βtσt−1(x) } = 1− P { ∃x ∈ L, t ≥ 1 s.t. |f(x)− µt−1(x)| > √ βtσt−1(x) } = 1− P ( ∪t≥1 ∪x∈L |f(x)− µt−1(x)| > √ βtσt−1(x) ) ≥ 1− ∑ t≥1 ∑ x∈L P ( |f(x)− µt−1(x)| > √ βtσt−1(x) ) = 1− ∑ t≥1 ∑ x∈L 2P ( f(x) > µt−1(x) + √ βtσt−1(x) ) = 1− ∑ t≥1 ∑ x∈L 2P { r > √ βt ∣∣∣r ∼ N (0, 1)} = 1−∑ t≥1 ∑ x∈L 2√ 2pi ∫ ∞ √ βt e− r2 2 dr = 1− ∑ t≥1 ∑ x∈L 2√ 2pi ∫ ∞ 0 e− (u+ √ βt) 2 2 du (u = r − √ βt, du = dr, r = u+ √ βt) = 1− ∑ t≥1 ∑ x∈L 2√ 2pi ∫ ∞ 0 e− u2+2u √ βt+βt 2 du = 1− ∑ t≥1 ∑ x∈L 2e −βt 2√ 2pi ∫ ∞ 0 e−u √ βte− u2 2 du ≥ 1− ∑ t≥1 ∑ x∈L 2e −βt 2√ 2pi ∫ ∞ 0 e− u2 2 du (∵ e−u √ βt ≤ 1 ∀u ≥ 0) = 1− ∑ t≥1 ∑ x∈L e −βt 2 2 ∫∞0 e−u22 du√ 2pi = 1−∑ t≥1 ∑ x∈L e −βt 2 * 1∫∞−∞ e−u22 du√ 2pi = 1− ∑ t≥1 ∑ x∈L e −βt 2 = 1− ∑ t≥1 ∑ x∈L α pit|L| = 1− α ∑ t≥1 1 pit ∑ x∈L 1 |L| = 1− α. 36 Apendix: Auxilliary Lemmas Lemma 11 (Lemma 5.2 in [11]) Let L a non-empty finite set and f : L → R an arbitrary function. Also assume that there exist functions µ, σ : L → R and a constant √β, such that |f(x)− µ(x)| ≤ √ βσ ∀x ∈ L. (†) Then, we have r(x) ≤ 2 √ βσ(x) ≤ 2 √ βmax L σ. Proof We know from the definition of sup and the upper and the lower bounds given by (†) that µ(x)− √ βσ(x) ≤ f(x) ≤ sup L f ≤ sup L (µ+ √ βσ) = µ(x) + √ βσ(x). So, f(x) and supL f lie in the interval [µ(x)− √ βσ(x), µ(x) + √ βσ(x)], and so their distance from each other can be at most 2 √ βσ(x), and so r(x) := sup L f − f(x) ≤ 2 √ βσ(x). 37
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Regret bounds for Gaussian process bandits without...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Regret bounds for Gaussian process bandits without observation noise Zoghi, Masrour 2012
pdf
Page Metadata
Item Metadata
Title | Regret bounds for Gaussian process bandits without observation noise |
Creator |
Zoghi, Masrour |
Publisher | University of British Columbia |
Date Issued | 2012 |
Description | This thesis presents some statistical refinements of the bandits approach presented in [11] in the situation where there is no observation noise. We give an improved bound on the cumulative regret of the samples chosen by an algorithm that is related (though not identical) to the UCB algorithm of [11] in a complementary setting. Given a function f on a domain D ⊆ R^d , sampled from a Gaussian process with an anisotropic kernel that is four times differentiable at 0, and a lattice L ⊆ D, we show that if the points in L are chosen for sampling using our branch-and-bound algorithm, the regret asymptotically decreases according to O(e^{τt/(ln t)^{d/4}}) with high probability, where t is the number of observations carried out so far and τ is a constant that depends on the objective function. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2012-08-02 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-ShareAlike 3.0 Unported |
DOI | 10.14288/1.0052133 |
URI | http://hdl.handle.net/2429/42865 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2012-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-sa/3.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2012_fall_zoghi_masrour.pdf [ 589.92kB ]
- Metadata
- JSON: 24-1.0052133.json
- JSON-LD: 24-1.0052133-ld.json
- RDF/XML (Pretty): 24-1.0052133-rdf.xml
- RDF/JSON: 24-1.0052133-rdf.json
- Turtle: 24-1.0052133-turtle.txt
- N-Triples: 24-1.0052133-rdf-ntriples.txt
- Original Record: 24-1.0052133-source.json
- Full Text
- 24-1.0052133-fulltext.txt
- Citation
- 24-1.0052133.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0052133/manifest