Near-Optimal HerdingbySamira SamadiB.Sc., Sharif University of Technology, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2014c© Samira Samadi 2014AbstractHerding is an algorithm of recent interest in the machine learning community, mo-tivated by inference in Markov random fields. It solves the following SamplingProblem: given a set X ⊂ Rd with mean µ, construct an infinite sequence ofpoints from X such that, for every t ≥ 1, the mean of the first t points in thatsequence lies within Euclidean distance O(1/t) of µ. The error of a solution toSampling Problem is defined to be the distance between the empirical mean of thefirst t samples and the original mean µ.The O(1/t) error bound suppresses the dependence on d and X . In this thesis,we study the best dependence on d and |X | that can be achieved for the error inSampling Problem. Known analysis of the Herding algorithm give an error boundthat depends on geometric properties of X but, even under favorable conditions,this bound depends linearly on d.We first show that any algorithm for the Sampling Problem must have errorΩ(√d/t). Afterward, we present a new polynomial-time algorithm that solvesthe Sampling Problem with error O(√d log2.5|X |/t)assuming that X is finite.This implies that our algorithm is optimal to within logarithmic factors. Finally,we prove that the actual error of the Herding algorithm is strictly worse than theerror of our algorithm if we measure the error in the `∞-norm. Our algorithmis randomized and based on recent algorithmic results in discrepancy theory. Weimplement our algorithm and other potential solutions for the Sampling Problemand evaluate them on various inputs.iiPrefaceThis work is the product of great deal of active discussions between me and mysupervisor Prof. Nick Harvey and numerous refinements. The theoretical con-tributions is largely based on the work that has been published at Conference onLearning Thoery 2014 [21]. All the algorithms are implemented in Matlab and theplots are added for the experimental evaluations.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 42 Formal Definitions and Main Results . . . . . . . . . . . . . . . . . 52.1 Sampling Problem . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Herding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 82.4.1 The choice of w0 is important . . . . . . . . . . . . . . . 102.5 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Towards Optimal Bound for Sampling Problem . . . . . . . . . . 122.6.1 John’s position . . . . . . . . . . . . . . . . . . . . . . . 12ivTable of Contents2.7 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Analysis of the Existing Results . . . . . . . . . . . . . . . . . . . . 153.1 Random Sampling Error Analysis . . . . . . . . . . . . . . . . . 153.2 Herding Error Analysis . . . . . . . . . . . . . . . . . . . . . . . 174 Discrepancy Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1 Permutation Problem . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Steinitz Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Our Approach for Solving the Permutation Problem . . . . . . . 205 Our Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.1 Bounding ss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Relating ss and st . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Proofs of Main Results . . . . . . . . . . . . . . . . . . . . . . . 275.4 Performance of Our Algorithm . . . . . . . . . . . . . . . . . . . 295.5 Lower Bound for the Sampling Problem . . . . . . . . . . . . . . 295.6 ‖·‖∞ Lower Bound for the Herding Algorithm . . . . . . . . . . 315.7 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.8 Partial Coloring Lemma Hypothesis . . . . . . . . . . . . . . . . 366 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.1 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 Open Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39vList of Figures2.1 The figure shows: Set of points X and its convex hull K (black).The smallest circle centred at the origin that contains K and itsradius δ (blue). The biggest circle centred at µ that is contained inK and its radius r (red). . . . . . . . . . . . . . . . . . . . . . . 62.2 Random sampling performance on different inputs. . . . . . . . . 82.3 Herding algorithm performance on different inputs. . . . . . . . . 92.4 Herding algorithm error depends on w0. . . . . . . . . . . . . . . 102.5 The body in the left is not in John’s position but the body in theright is in John’s position. . . . . . . . . . . . . . . . . . . . . . . 134.1 Error of the Steinitz algorithm for matrix F in ‖·‖2. . . . . . . . . 215.1 Error of our algorithm on matrix F measured in ‖·‖2 and ‖·‖∞. . 305.2 Herding algorithm error in ‖·‖∞. . . . . . . . . . . . . . . . . . . 345.3 Comparison of the error for different sampling algorithms. . . . . 35viAcknowledgementsFirst and foremost, I would like to express my utmost gratitude to my supervisor,Prof. Nick Harvey. He has devoted countless hours to myself and my work andprovided me insightful guidance and endless support. I immensely enjoyed andlearned from every meeting I had with him and I am deeply indebted to him forbeing a great mentor. I would like to extend my sincerest thanks to Prof. Alexan-dre Bouchard-Coˆte´ for being my second reader and for his helpful comments. Iwould like to thank Prof. David Kirkpatrick for his continues support and valuablefeedback.I also owe my special thanks to my friends and colleagues in UBC, too nu-merous to name here, for their help and for making my life much more enjoyable.I would very much like to thank Maryam Siahbani and Shabnam Mirshokraie fortheir true kindness and support during my stay in Vancouver.Last but not least, I would like to thank my family for their tireless support, loveand trust. None of this would have been possible without their love and patience.viiDedicationTo my brother, Mehdi, for always being there for me.viiiChapter 1IntroductionHerding is a topic that has attracted much interest in the machine learning com-munity over the past few years. It was originally proposed [33] as a method for“sampling” from a Markov random field1 that agrees with a given data set X . Thetraditional approach for this scenario is to use maximum likelihood estimation toinfer parameters of the model, and once the model is obtained, one can draw sam-ples from it.A drawback of the traditional approach is that the maximum likelihood estima-tion step can be computationally intractable. Another difficulty is that the estimatedmodel can have different local modes and it may cause the algorithm to get stuck inone of the local modes. The Herding approach sidesteps these intractability prob-lems; it does not infer the model parameters, but instead deterministically producesa sequence of “samples” whose moments rapidly converge to the moments of thegiven data set X .A concise statement of the Herding algorithm is as followsxt+1 ∈ arg maxx∈X〈 wt, x 〉wt+1 = wt + µ− xt+1where xt is the sample that the algorithm produces in step t and wt is a weightvector that guides the choice of the next sample. The initial weight vector w0 isinitialized arbitrary. The error of a sampling algorithm after t steps, is defined asthe Euclidean distance between the empirical mean of the first t samples and theoriginal mean µ. The convergence rate of a sampling algorithm is the error at eachstep as a function of t.The Herding algorithm has several attractive properties. Unlike previous sam-pling techniques that used random number generation, Herding is fully determinis-tic. Computationally, it is very simple; it is a greedy algorithm whose pseudocodeis only two lines. Yet despite its simplicity, it has a dignified lineage. Herding isrelated to to conditional gradient methods [4] and to quadrature methods [4, 23].1Markov random field is a synonym for undirected graphical model.11.1. Our ContributionThere is also a deep connection between the Herding algorithm and the Perceptronmethod [19]. It was observed that supervised Perceptron algorithm and the Herd-ing algorithm can both be derived from the classic “Perceptron Cycling Theorem”[26]. We will discuss this connection in more details in Section 2.5.Herding was originally proposed for discrete space processes but it can nat-urally be extended to continuous spaces using the kernel trick [17]. The Kerneltrick is a method for implicitly mapping the observations coming from a generalset X into a high-dimensional inner product space V such that the observations getwell-behaved linear structure in V . The resulting kernel Herding algorithm can beapplied to accommodate a large numbers of features.Finally, an intriguing property is that the samples produced by Herding con-verge faster than independent random samples, in the sense that the error after tHerding samples is inversely proportional to t, whereas the error of t random sam-ples is inversely proportional to√t [17] 2. The Herding algorithm keeps track of allthe samples generated so far and avoids sampling from the areas that have alreadybeen over sampled. This helps the algorithm to have fast convergence rate.The purpose of this thesis is to rigorously analyze the theoretical underpinningsof the Herding algorithm. In particular, since Herding’s convergence propertieshave received much attention, we wish to determine the best convergence achiev-able by any algorithm. For this purpose, we first need to define a mathematicalsetting for our target class of sampling algorithms.1.1 Our ContributionWe define an exact mathematical problem that asks “How can we perform an ef-ficient sampling?” We call this problem the Sampling Problem. Our goal is toanalyze any possible algorithm that gives a solution to this problem.One of our main observations is that there is a connection between the Sam-pling Problem and discrepancy theory, an important topic in combinatorics andgeometry. We exploit this connection by restating the Sampling Problem in the ge-ometrical setting established by Banaszczyk [8]. The new statement of the problemallows us to find a close relation between the Sampling Problem and a well-studiedproblem in discrepancy theory which we call the Permutation Problem. We showthat there is an algorithmic reduction from the Permutation Problem to the Sam-pling Problem and focus on solving the Permutation Problem.2The proof given in [17] for the error of Herding algorithm seems not to be precise, we will givea correct version of it in Chapter 3.21.2. Related WorkTo begin, we present a fairly simple lower bound on the convergence rate ofany algorithm for the Sampling Problem (see Section 5.5). We then observe thatthis lower bound is actually optimal if a classic conjecture in discrepancy theory istrue. Next, we present a partial result towards this conjecture which is algorithmicand uses recent breakthroughs in discrepancy theory [9, 24]. This yields a new,polynomial-time algorithm to produce a sequence of samples whose convergencerate is within a logarithmic factor of optimal. In contrast, the best known upperbound on the convergence of the Herding algorithm is significantly worse, evenunder generous assumptions. We conjecture that the actual convergence of theHerding algorithm is strictly worse than the convergence of our new algorithm,and we prove that this is true if convergence is measured in the `∞-norm.1.2 Related WorkThe invention of the Herding algorithm goes back to 2009 when Welling [33] firstproposed it as a new method for directly generating pseudosamples from the ob-served moments of a model. One can formulate this problem as the general maxi-mum entropy problem [32] as follows. We are given the pairwise marginal distri-butions Pij and we would like to estimate the joint distribution P (x) that has themaximum entropy. Once we have estimated the full joint distribution, we can drawsamples from it. The learning phase in the latter formulation is computationallyexpensive and this is what makes the Herding algorithm an attractive approach;Herding avoids this difficulty by directly generating samples.Although Welling [33] did not focus on the theoretical foundation of the Herd-ing algorithm, later on the researchers tied this algorithm to the well-studied areasin learning theory and continues optimization. Chen et al. [17] consider Herding asan infinite memory process and explain it as a mapping xt+1 = G(x1, . . . , xt, w0).In this scenario, the Herding algorithm can be interpreted as taking gradient stepsof size 1 on function G. Another observation is that it is easy to apply the kerneltrick to the Herding algorithm. This is due to the fact that in the Herding algorithm,the formula for obtaining the next sample xt+1 only deals with inner product. Thisresults in a kernel Herding algorithm and its error function that decreases as a func-tion of O(1/t).Another line of work [4] showed that the Herding algorithm is related to theconditional gradient algorithm also called as the Frank-Wolfe algorithm. Frank-Wolfe algorithms are a class of iterative optimization algorithms that has applica-tions in many areas including machine learning, signal processing and statistics.These algorithms can be applied to any smooth convex function with a compact31.3. Thesis Organizationconvex domain. An attractive property of Frank-Wolfe algorithms is that they donot perform expensive operations. At each iteration, they only need to computethe gradient of the objective function and optimize linear functions on its domain.This is unlike other optimization methods such as “gradient descent” algorithmthat requires a projection step at each iteration. The connection between Herdingand Frank-Wolfe allows the authors to suggest better approaches for the task ofapproximation integrals in a reproducing kernel Hilbert space.From another perspective, Herding is related to deterministic algorithms forsampling from distributions [16, 17, 22, 23]. For many researchers, it is very in-teresting to pursue Herding from this line of work, as it provides a well definesmathematical setting for it. In particular, since Herding convergence rate has at-tracted much attention, recent work has studied using Herding ideas to improveconvergence of the Gibbs sampler [14], this relates to interesting directions in com-binatorics [22] and Markov chain Monte Carlo methods [16].1.3 Thesis OrganizationThe rest of this thesis is organized as follows. In Chapter 2, we present formaldefinitions and briefly explain our main results. In Chapter 3, we theoreticallyprove some of the important existing results. In Chapter 4, we present a briefoverview of discrepancy theory which is necessary to explain our results in detail.In Chapter 5, we present our main algorithm and analyze its convergence rate. InChapter 6, we conclude our work and discuss future directions.4Chapter 2Formal Definitions and MainResultsIn this chapter we will define the Sampling Problem which is the main problem ofstudy in this thesis. We continue by discussing various solutions to the SamplingProblem including Random sampling and the Herding algorithm. At each section,we provide theoretical results on the upper bound analysis for the error of thesealgorithms as well as experimental evaluations of their performance on differentdata sets3. For now, we do not go into the proofs of the theoretical results but wewill explain all the details in Chapter 3. The goal of this chapter is to give thereader a good understanding of the Sampling Problem and properties of its currentsolutions.The following notation will be used throughout the thesis. We write [n] for theset {1, 2, . . . , n}. For any matrixM we letMi,∗ refer to the ith row ofM , andM∗,jrefer to the jth column of M . All logarithms are in base 2. The `p-norm is denoted‖·‖p. The ith standard basis vector is denoted ei.2.1 Sampling ProblemLet X ⊂ Rd be a finite set with |X | = n. Let µ =∑x∈X x/n and δ =maxx∈X ‖x‖2. Let r be the maximum value such that the `2-ball centered at µof radius r is contained in the convex hull of X (See figure 2.1).More generally, one could assume thatX is infinite and that it lies in an infinite-dimensional Hilbert space. For now we will restrict to the case of finiteX and finitedimensions but in Section 6.2 we will briefly talk about the infinite case too.3All the functions are implemented in Matlab.52.2. Random SamplingKXµrδFigure 2.1: The figure shows: Set of points X and its convex hull K (black). Thesmallest circle centred at the origin that contains K and its radius δ (blue). Thebiggest circle centred at µ that is contained in K and its radius r (red).Definition 2.1.1 (Sampling Problem). We wish to find an infinite sequence ofpoints x1, x2, . . . from the set X . Each xi is called a pseudosample. The error ofthe first t pseudosamples is ‖∑i≤t xi/t − µ‖2. The goal is to ensure that, for allt ≥ 1, the error of the first t pseudosamples is at most α/t, where α is a quantitycan depend arbitrarily on d, n and X but it should not depend on t.Our goal is to rigorously analyze the Sampling Problem and to find an algo-rithm that solves this problem with the best possible error. We will start by con-sidering Random sampling and see if this algorithm gives a solution to SamplingProblem.2.2 Random SamplingThe most natural algorithm for solving Sampling Problem is Random sampling.The formal definition of this algorithm can be states as following:Definition 2.2.1 (Random Sampling). Get i.i.d. uniform samples from the set X .We evaluate the error function for Random sampling both theoretically and ex-perimentally. Standard analysis for Random sampling shows that the error for this62.3. Data Setsalgorithm decreases as a function ofO(1/√t)4. This proves that Random samplingdoes not solve the Sampling Problem. In order to understand the performance ofthis algorithm in the experiments, we first need to introduce the data sets that weare going to use throughout this chapter.2.3 Data SetsWe show any finite data set X ∈ Rd (with |X | = n) as columns of a d× n matrix.Here we introduce four data sets that we will use in our experiments. For all thematrixes, we add a normalization step to make µ = 0 and δ = 1.(1) Consider the random matrix Fd×n such that each entry of F is drawn inde-pendently from the standard normal distribution.(2) Let Gd×n be the identity matrix.(3) Let matrixHd×n be defined explicitly as follows:Hd×n =A︷ ︸︸ ︷1/√dB︷ ︸︸ ︷− · · · −C︷ ︸︸ ︷0 · · · 0....... . ........ . ....1/√d − · · · − 0 · · · 01 0 · · · 0 − · · · −.where |A| = 1, |B| =⌈√d⌉, |C| = d and = 1/d. This gives us n =d+⌈√d⌉+ 1.Figure 2.2 shows the performance of Random sampling when applied to ma-trices F ,G andH with d = 20, n = 25 to generate 50 pseudosamples.4We prove this result in Section 3.1.72.4. Herding AlgorithmFigure 2.2: Random sampling performance on different inputs.So far we have presented Random sampling and seen that this algorithm doesnot give a solution to Sampling Problem. The next natural idea for solving Sam-pling Problem is a greedy algorithm which we will explain in the next section.2.4 Herding AlgorithmHerding algorithm was first introduced by Welling [33] as method for learningin intractable Markov random fields models. It is a deterministic algorithm thatdirectly generates samples from the model of study. A concise statement of theHerding algorithm is as follows 5xt+1 ∈ arg maxx∈X〈 wt, x 〉 (2.4.1a)wt+1 = wt + µ− xt+1 (2.4.1b)The initial weight vector w0 is initialized somewhat arbitrarily. The literature5In the definition of Herding algorithm, we will also assume that the algorithm choses the smallestindex in the set {argmaxx∈X 〈wt, x〉}. This assumption is used in the proof of Theorem 2.7.3.82.4. Herding Algorithmhas proposed setting w0 = 0 or w0 = µ [4, 17]. In Section 3.2 we show that theerror function of Herding algorithm is O((‖w0‖2 + δ2/r)/t).Figure 2.3 shows the performance of Herding algorithm when applied to ma-trices F ,G and H with d = 20 and n = d +√d + 1 = 25. The algorithm startswith w0 = 0 and generates 50 pseudosamples. The top plot shows the error in ‖·‖2and the bottom plot in ‖·‖∞. Comparing this figure with the Figure 2.2, it is easyto see that the 2-norm error of Herding is much better than Random sampling.In Section 5.6 we will show that the error of Herding algorithm is strictly worsethan the error of our algorithm if we compute the error in the `∞-norm. This iswhy we include the `∞-norm evaluation of the error for Herding algorithm in theFigure 2.3.Figure 2.3: Herding algorithm performance on different inputs.92.4. Herding Algorithm2.4.1 The choice of w0 is importantIn the previous section, we analyzed the performance of the Herding algorithmwhen it starts with w0 = µ = 0. It is interesting to see that choice of the initialweight vector w0 plays an important role in the analysis of the error for Herdingalgorithm.We run the Herding algorithm on matrix H to generate 150 pseudosamples.Figure 2.4 shows the ‖·‖2 error of this algorithm if we start with different weightvectors w0.• Dashed line shows error of Herding algorithm for w0 = (0, · · · , 0)T.• Solid line shows error of Herding algorithm for w0 = (1, · · · , 1)T.• Dotted line shows error of Herding algorithm for w0 = (d, · · · , d)T.Figure 2.4: Herding algorithm error depends on w0.Our observation is that there is a direct connection between the error for theHerding algorithm and the ‖w0‖2.102.5. Perceptron2.5 PerceptronSo far we have talked about Random sampling and the Herding algorithm as po-tential solutions for the Sampling Problem. We observed that the Herding al-gorithm solves the Sampling Problem with error that decreases as a function ofO((‖w0‖2 + δ2/r)/t). Initially it may seem intriguing that the error is propor-tional to 1/t, but this is an instance of a much more general phenomenon which wewill talk about in this section.The Perceptron algorithm was first introduced by Rosenblatt as a method forlearning relations in brain data [18]. Later on, it was implemented as “Mark 1Perceptron” hardware and was used as an AI machine for image recognition. Atthe beginning of its invention, the Perception seemed to be having a big influencein the AI community. However, after a few years it was proved that it can not beapplied to a big class of data sets. In this section, we introduce the Perceptron asan online algorithm for learning a linear threshold function.We present the Perceptron algorithm in the online learning setting, which isdefined as follows [5]. The algorithm receives an unlabelled example and predictsa classification of this example using its current hypothesis. In the next step, thecorrect answer is presented to the algorithm and the process repeats.Assume that we have a set of examples {x1, . . . , xn}with the labels {y1, . . . , yn} ∈{±1}n. The goal of the Perceptron algorithm is to learn a linear classifierw1x1 + w2x2 + . . .+ wnxn > 0with the following procedure [5]:Algorithm 1 PerceptronStart: Set t = 1 and w1 = 0Iterative Step:· Given example xt, predict positive iff wt · xt > 0· For wrong prediction, update wt+1 as follows:• Mistake on positive: wt+1 = wt + x• Mistake on negative: wt+1 = wt − x· t = t+ 1There is a connection between the classic Perceptron algorithm and the Herdingalgorithm; the general “Perceptron Cycling Theorem” applies to both of them. It112.6. Towards Optimal Bound for Sampling Problemwas observed [19] that if Equation (2.4.1a)xt+1 ∈ arg maxx∈X〈 wt, x 〉is replaced with any rule that ensures 〈wt, xt+1 〉 ≥ 0, then the Perceptron CyclingTheorem [13] implies that the error is at most α/t for some quantity α. However,the proof of this theorem uses compactness and does not give any quantitativebound on α. Giving an effective bound on α was an open question for decades,until [3] recently proved an explicit bound that is exponential in d.2.6 Towards Optimal Bound for Sampling ProblemAs we have seen, the Perceptron Cycling Theorem implies that the Herding algo-rithm solves the Sampling Problem with error function that is proved to be boundedby α/t for some quantity α. This shows that the interesting aspect of the SamplingProblem is not that the error is proportional to 1/t, but rather determining a precisevalue of α such that the error is at most α/t. The purpose of this thesis is to addressthe following question, which seems fundamental.Question 2.6.1. What is the best value of α (as a function of d, n or X ) that canbe obtained in the Sampling Problem?The analysis of the Herding algorithm gives a bound on α that depends onr, and hence on the geometry of X . An undesirable aspect of this bound is thatit does not reflect the intrinsic geometry: it is highly sensitive to rescalings ofX . Imagine multiplying the first coordinate of all points in X by a quantity andkeeping other coordinates unchanged — if the coordinates represent features of thedata, then this amounts to simply changing the units of the first feature. One cansee that, as → 0, r decreases by a factor of roughly , whereas δ remains mostlyunchanged. Thus, such scalings can arbitrarily increase the ratio δ/r, and hencethe error bound of the Herding algorithm. This motivates the need for a bound onα that is not disrupted by unimportant rescalings. In the next section, we introducea geometrical setting that fulfills these conditions.2.6.1 John’s positionFor any convex body K, there is a canonical transformation that negates the effectsof such unimportant rescalings. We say that K is in John’s position if the largestellipsoid contained in K is a scalar multiple of the unit Euclidean ball (see [6, 30]).So consider K to be the convex hull of X , and assume that the center of this ball122.7. Main Resultsis µ. Then the quantity r defined above is the radius of this ball. John’s theoremasserts that K is contained in the ball centered at µ of radius rd. Assuming thatµ = 0, it follows that the quantity δ defined above is at most rd, and hence6δ/r ≤ d. Thus, under these assumptions, the error of the Herding algorithm isO((‖w0‖2 + δd)/t).Figure 2.5: The body in the left is not in John’s position but the body in the right isin John’s position.An attractive aspect of this bound is that it essentially depends only on d, whichis a fundamental parameter of the Sampling Problem. This suggests two naturalquestions: First, does a bound depending only on d hold without making the assump-tions of this section? Second, is this linear dependence on d optimal?We investigate the answers to this questions in the next section.2.7 Main ResultsIn this section, we present the main contribution of this thesis toward answeringthe above questions. Our first, and rather simple, observation is:Theorem 2.7.1. For each d, there exists a set X ⊂ Rd with maxx∈X ‖x‖2 = 1such that every solution to the Sampling Problem on X has error Ω(√d/t) fort = Θ(d2).6 This bound is tight: a simplex centered at the origin has δ/r exactly equal to d.132.7. Main ResultsIf we allow error bounds that depend on n, then we can show a nearly-matchingupper bound.Theorem 2.7.2. There is an algorithm with running time poly(d, n) that solvesthe Sampling Problem with error O(√dδ log2.5(n)/t).Theorem 2.7.1 is proven in Section 5.5 and Theorem 2.7.2 is proven in Sec-tion 5.3. We do not know whether the algorithm of Theorem 2.7.2 has strictlybetter error than the Herding algorithm. However, if we measure error in the `∞-norm then our algorithm is seen to be strictly better, assuming that n = 2o(d1/5).Theorem 2.7.3. For each d, we can construct a set X ∈ Rd such that the Herdingalgorithm has maxt‖t · (∑ti=1 xi/t − µ)‖∞ = Ω(√d). On the other hand, thealgorithm of Theorem 2.7.2 has maxt‖t · (∑ti=1 xi/t− µ)‖∞ = O(log2.5(n)).14Chapter 3Analysis of the Existing ResultsIn this chapter we formally prove the upper bound results that we earlier mentionedfor Random sampling and the Herding algorithm. In Section 3.1, we prove that theerror of Random sampling after generating t samples is a function of O(1/√t).Furthermore, it can be seen that this bound is tight which proves that Randomsampling does not give a solution to the Sampling Problem. In Section 3.2, weprove that the error of the Herding algorithm after generating t samples decreasesas a function of O((‖w0‖2 + δ2/r)/t).3.1 Random Sampling Error AnalysisWe prove that the error of Random sampling on the set X is O(1/√t). For thesake of this thesis, we are not concerned about other factors suppressed by theO notation in O(1/√t) as the 1/√t dependence is enough to show that Randomsampling does not solve Sampling Problem.We are given the set X with assumptions of Section 2.1. For sake of simplicity,we also assume that 0 ≤ x ≤ 1 for all x ∈ X 7. Let X1, X2, . . . , Xt, . . . be therandom variables that represent pseudosamples generated by Random sampling.For every j ∈ [t], we can write Xj = (X1j , . . . , Xdj)T. Our approach is to applythe Chernoff bound to bound tail of the distribution for random variable∑tj=1Xijfor all i ∈ [d]. Afterward, we can use union bound and norm inequalities to give abound on∥∥∥∑tj=1Xj/t− µ∥∥∥2.The Chernoff bound is one of the most powerful tools in probability theory. Itwas first introduced by Herman Rubin [31] to show that sum of independent ran-dom variables has a distribution for which the tail is bounded by an exponentiallydecreasing function. One statement of Chernoff bound can be stated as follows[20, 27]:Theorem 3.1.1 (Chernoff Upper Tail). Let Z1, . . . , Zm be independent random7The proof can be generalized for any bounded interval.153.1. Random Sampling Error Analysisvariables such that Zi always lies in the interval [0, 1]. Let Z = Z1 + . . .+Zm andµ = E[Z]. Let pi = E[Zi] and note that µ =∑i pi. Then for any δ > 0Pr[Z ≥ (1 + δ)µ] ≤ exp(− µ((1 + δ)ln(1 + δ)− δ))For any δ ∈ [0, 1]Pr[Z ≥ (1 + δ)µ] ≤ exp(−µδ2/3) (3.1.1)For any δ ≥ 1Pr[Z ≥ (1 + δ)µ] ≤ exp(−µδ/3) (3.1.2)Consider the sequence of independent random variablesXi1, Xi2, . . . , Xit suchthat i ∈ [d]. We define a new random variable∑tj=1Xij . Note thatE[∑tj=1Xij ] =tE[Xi1] = tµi. Now we can apply Chernoff bound to random variable∑tj=1Xijwith δ = 30×ln d√tµi. From Equation 3.1.1 we know that for δ ∈ [0, 1]Pr[t∑j=1Xij ≥ (1 + δ)tµi] ≤ exp(−300 ln2 dµi) ≤ exp(−300 ln2 d)This is due to the fact that µi ≤ 1. From Equation 3.1.2, for δ ≥ 1Pr[t∑j=1Xij ≥ (1 + δ)tµi] ≤ exp(−10 ln d√t) ≤ exp(−10 ln d)Thus, for any δ ≥ 0Pr[t∑j=1Xij/t− µi ≥ δµi] = Pr[t∑j=1Xij ≤ (1 + δ)tµi] ≤ exp(−10 ln d)Since δµi = 30 ln d√tPr[t∑j=1Xij/t− µi ≥30 ln d√t] ≤ exp(−10 ln d) = d−10 (3.1.3)Let’s define event Ai to be the event of∑tj=1Xij/t − µi ≤30 ln d√t. We want tocompute Pr[⋂di=1Ai]. Using union bound and Equation 3.1.3Pr[d⋂i=1Ai] = 1− Pr[d⋃i=1Ai{] ≥ 1− (d∑i=1Pr[Ai{]) ≥ 1− d−9163.2. Herding Error AnalysisThus, with constant probability 1− d−9 for all i ∈ [d]t∑j=1Xij/t− µi ≤30 ln d√tSo with constant probability 1− d−9 we have that∥∥∥∥∥∥t∑j=1Xj/t− µ∥∥∥∥∥∥2=√√√√d∑i=1(t∑j=1Xij/t− µi)2 ≤30√d ln d√tThis shows that the error of t pseudosamples generated by Random sampling isO(√d ln d/√t). This bound can be improved as a function of d if we make betterchoices of δ. One can complete the proof by showing that this bound is tight.3.2 Herding Error AnalysisIn Section 2.1, we introduced the Herding algorithm and showed its performanceon different data sets. Remember that this algorithm is defined asxt+1 ∈ arg maxx∈X〈 wt, x 〉wt+1 = wt + µ− xt+1In this section we theoretically analyze the error for Herding algorithm andprove the upper bound O((‖w0‖2 + δ2/r)/t). The argument that we give here is amodification of the proof in [17].Assume that the Herding algorithm generates pseudosamples x1, x2, . . . , xt, . . .and weights w0, w1, w2, . . . , wt, . . .. We have that wi = wi−1 + µ − xi for all1 ≤ i ≤ t. Summing these above equations we get that wt = w0 + tµ−∑ti=1 xi.The error of the Herding algorithm for the first t samples is∥∥∥∥∥1tt∑i=1xi − µ∥∥∥∥∥2=1t∥∥∥∥∥t∑i=1xi − tµ∥∥∥∥∥2=1t‖w0 − wt‖2 ≤1t(‖w0‖2 + ‖wt‖2)So in order to bound the error, we only need to bound ‖wt‖2. We claim that‖wt‖2 ≤ O(δ2/r). Let’s define the convex body C = K − µ. Since ‖x‖2 ≤ δfor all x ∈ K, ‖c‖2 ≤ 2δ for all c ∈ C. Let ct = arg maxc∈C〈wt, c〉 for all t. TheHerding algorithm second equation becomes wt+1 = wt − ct. Since there is a ball173.2. Herding Error Analysisof radius r with centre µ inside K and C is a shifting of K, there is a ball of radiusr with centre 0 inside C. Thus r · wt‖wt‖2∈ C.Consider the sequencew0, w1, . . . and let k be the first index for which ‖wk‖2 >2δ2/r. Then‖wk+1‖22−‖wk‖22 = ‖ck‖22−2〈wk, ck〉 ≤ 4δ2−2〈wk, r·wk‖wk‖2〉 = 4δ2−2r ‖wk‖2 < 0.Therefore ‖wk‖2 > ‖wk+1‖2. This means that as soon as the norm of a weightvector gets bigger than 2δ2/r, the norm starts decreasing until it gets smaller than2δ2r again. Since wk = wk−1 − ck−1, we have‖wk‖2 ≤ ‖wk−1‖2 + ‖ck−1‖2 ≤2δ2r+ 2δ = O(δ2r)Note that this is the maximum value that the norm of any vector wt can get. There-fore the error of Herding algorithm is bounded by∥∥∥∥∥1tt∑i=1xi − µ∥∥∥∥∥2≤1t(‖w0‖2 + ‖wt‖2) ≤ O(‖w0‖2 +δ2r)/t.18Chapter 4Discrepancy TheoryDiscrepancy theory is a major area of study in both combinatorics and geometry[15, 25]. The powerful results in this area have applications in many theoreticalareas of computer science. We begin this section by defining a stringent problemthat is very similar to the Sampling Problem.4.1 Permutation ProblemDefinition 4.1.1 (Permutation Problem). Let X , µ, d and n be as before. Wewish to find a bijection pi : {1, . . . , n} → X . The error of the first t points is‖∑i≤t pi(i)/t − µ‖2. The goal is to ensure that, for all t ≥ 1, the error of the firstt points is at most α/t.It is easy to see that any solution to the Permutation Problem yields a solutionto the Sampling Problem with the same parameter α. To see this, consider theinfinite sequence obtained by concatenating copies of X , each ordered by pi. Everycontiguous subsequence of length n sums to nµ. So, for every length-t prefix ofthis infinite sequence, only the last t mod n vectors contribute to the error. Theerror of those last vectors is exactly ‖∑tmodni=1 (pi(i)−µ)/t‖2, which is assumed tobe at most α/t.Thus, to give an upper bound on the error in the Sampling Problem, it suffices togive an upper bound on the error in the Permutation Problem. This is the approachused to prove Theorem 2.7.2. To explain this approach in more detail, we mustintroduce tools from discrepancy theory.Fix a convex body B ⊂ Rd that is 0-symmetric (i.e., B = −B). For any finiteset V ⊂ B with∑v∈V v = 0, let βV be the smallest value for which there is anordering v1, v2, . . . of V with∑i≤t vi ∈ βV · B for all t. The Steinitz constant ofB, denoted S(B), is the supremum of βV over all finite V ⊂ B. Note that S(B)depends only on d and B, and is not a function of |V |.The connection between the Permutation Problem and discrepancy theory isnow apparent. Let Bd2 be the Euclidean unit ball in Rd. Let X ′ be X translated so194.2. Steinitz Algorithmthat its centroid µ is the origin, and scaled so that X ⊂ Bd2 . There is an orderingx1, x2, . . . of X ′ such that∑i≤t xi ∈ S(Bd2) ·Bd2 for all t. This gives a solution tothe Permutation Problem on X with α = S(Bd2) ·maxx∈X ‖x‖2.It is conjectured that S(Bd2) = O(√d) (see [11, 12]). If true, that wouldimply a solution to Permutation Problem, and hence to the Sampling Problem, withα = O(√d) ·maxx∈X ‖x‖2. This would match our lower bound in Section 5.5.4.2 Steinitz AlgorithmThe best known bound on S(Bd2) is the Steinitz Lemma: S(Bd2) ≤ d ([11]). Fur-thermore, the proof is algorithmic, so this gives an efficient solution to the Sam-pling Problem with α = d ·maxx∈X ‖x‖2. We call this solution Steinitz algorithm.This result is of mild interest, in that it answers the question raised in the previoussection on whether there is a solution with error linear in d without the variousassumptions on K.We run the Steinitz algorithm on the matrix F (defined in Section 2.3) to gen-erate 50 pseudosamples. Figure 4.1 shows the ‖·‖2 of the empirical error and thetheoretical error (dδ/t) for this algorithm. Note that the empirical error of theSteinitz algorithm periodically gets zero. The error starts at zero and after each |X |iterations, the algorithm adds a complete ordering of X that contributes zero to theerror.4.3 Our Approach for Solving the Permutation ProblemDue to the connection from the Sampling Problem to the Permutation Problemdescribed in Section 4.1, we will focus on solving Permutation Problem. The mainresult of this section is Theorem 4.3.1 which solves Permutation Problem with errorthat is bounded by O(√d log2.5 n).Theorem 4.3.1. Let V ⊂ Rd satisfy maxv∈V ‖v‖2 ≤ 1 and∑v∈V v = 0. Letn = |V |. Then there is an ordering v1, v2, . . . of V such that ‖∑i≤t vi‖2 =O(√d log2.5 n) for all t ≥ 1. Furthermore, there is an algorithm with runningtime poly(d, n) to find such an ordering.Although Theorem 4.3.1 implies a solution to Theorem 2.7.2, it does not implyanything about the Steinitz constant S(Bd2) because that quantity must not dependon n. To explain the proof of Theorem 4.3.1, we must introduce some furtherdefinitions and notation [8].204.3. Our Approach for Solving the Permutation ProblemFigure 4.1: Error of the Steinitz algorithm for matrix F in ‖·‖2.Let B ⊂ Rd be a full-dimensional convex body that is 0-symmetric. Let B′ ⊂Rd be bounded. For n ∈ N, we define sv(B,B′;n), ss(B,B′;n) and st(B,B′;n)as follows:• sv(B,B′;n) is the smallest λ > 0 such that for all x1, . . . , xn ∈ B′ thereexist 1, . . . , n ∈ {−1, 1} such that∑ni=1 ixi ∈ λB. The symbol sv isabbreviation of “signed vectors”.• ss(B,B′;n) is the smallest λ > 0 such that for all x1, . . . , xn ∈ B′ thereexist 1, . . . , n ∈ {−1, 1} such that∑ki=1 ixi ∈ λB for all k ∈ [n]. Thesymbol ss is abbreviation of “signed series”.• st(B,B′;n) is the smallest λ > 0 such that for all x1, . . . , xn ∈ B′ with∑ni=1 xi = 0 there exists permutation pi of [n] such that∑ki=1 xpi(i) ∈ λBfor all k ∈ [n]. The symbol st is abbreviation of “Steinitz constant”.214.3. Our Approach for Solving the Permutation ProblemLet us now state some results using this notation. Let Bp be the unit ball ofthe `p-norm in Rd. Theorem 4.3.1 states that st(B2, B2;n) ≤ O(√d log2.5 n). Itis known that sv(B∞, B∞;n) ≤ O(√d log (2n/d)) and that sv(B∞, B1;n) ≤2. These results can be found in standard references [1]. It is also known thatsv(B∞, B2;n) ≤ O(√log d) [7]. The central open question in this area is Komlo´s’conjecture that sv(B∞, B2;n) ≤ O(1) [28].Algorithmic proofs are known for some of the results mentioned above. Re-cent algorithmic breakthroughs proved that sv(B∞, B∞;n) ≤ O(√d log(2n/d))[9, 24]. These techniques also algorithmically prove sv(B∞, B2;n) ≤ O(log d),which is slightly worse than Banaszczyk’s result. The core of these algorithmic re-sults is the following lemma ([24]). We will also use this lemma as a key ingredientin our proof in Section 5.1.Lemma 4.3.2 (Partial Coloring Lemma). Let v1, . . . , vd ∈ Rn be vectors and letx0 ∈ [−1, 1]n. Let c1, . . . , cd ≥ 0 be scalars such that∑di=1 exp(−ci2/16) ≤n/16. Let > 0. Then there exists a randomized algorithm which with probabilityat least 0.1 finds a point x ∈ [−1, 1]n such that(i) |〈x− x0, vi〉| ≤ ci‖vi‖2 for all i ∈ [d](ii) |xj | ≥ 1− for at least n/2 indices j ∈ [n].The running time of the algorithm is poly(n, d, 1/).Now that we have introduced the essential tools from the discrepancy the-ory, we are ready to explain our main result in this new framework. In the nextchapter we prove Theorem 2.7.2; this is equivalent to proving st(B2, B2;n) ≤O(√dδ log2.5 n) constructively.22Chapter 5Our AlgorithmIn this chapter we present the main result of this thesis which is an efficient random-ized algorithm for solving Sampling Problem. Our goal is to give an algorithmicproof for st(B2, B2;n) ≤ O(√dδ log2.5 n). We can briefly explain our approachthrough the following steps:1 We use an iterative application of the partial coloring lemma (Lemma 4.3.2)to show that ss(B∞, B2;n) ≤ O(log2.5 n). The procedure is explained asthe proof of Theorem 5.1.1.2 We show that there is an algorithmic reduction from st to ss. Formally,we give a constructive proof for st(B∞, B2;n) ≤ ss(B∞, B2;n) in The-orem 5.2.1. Our proof is an algorithmic version of the proof of a classictheorem in discrepancy theory.3 Applying some standard norm inequalities we conclude that st(B2, B2;n) ≤O(√d log2.5 n). This completes the proof of Theorem 4.3.1.Once we have algorithmic proof for st(B2, B2;n) ≤ O(√d log2.5 n) it is thenstraightforward to prove Theorem 2.7.2; use this algorithm to get an ordering ofelements of X and repeat this ordering for infinitely many times.The rest of this chapter is organized as follows. In Section 5.5 and Section 5.6we prove Theorem 2.7.1 and Theorem 2.7.3 respectively. We finish this part bycomparing empirical results of our algorithm, Herding algorithm and Random sam-pling.5.1 Bounding ssIn this section, we will construct the first step of our algorithm.Theorem 5.1.1. Let V be a matrix of size d × n such that maxj ‖V∗,j‖2 ≤ δ.Then there exists χ ∈ {−1, 1}n such that ‖∑kj=1 V∗,jχj‖∞ ≤ O(δ log2.5 n) for all235.1. Bounding ssk ∈ [n]. Furthermore, there exists a randomized algorithm to compute the coloringχ in time poly(n, d).Proof sketch. The partial coloring lemma can be used iteratively to control thediscrepancy of the rows of matrix V ; this is the standard technique of Spencer usedto bound sv(B∞, B2;n). However, it does not suffice to bound ss(B∞, B2;n).We must additionally control the discrepancy of all prefixes of rows of V . The ideais to construct from V a new matrix W by adding rows that indirectly control theprefixes of rows of V . Some care is required to ensure that δ does not increasedramatically. Finally, we iteratively apply the partial coloring lemma to W toobtain the desired coloring.Proof. The proof has several steps:Construction of W . We construct a new matrix W from V by adding new rowswhich are “subrows” of the rows of V . Roughly speaking, for every row of V , wemake many new copies of that row. Each copy has most of its entries zeroed out,except those in a contiguous region of length 2k that ends at a multiple of 2k.More formally, define I :={[j · 2k + 1, (j + 1) · 2k] ∩ [n] : j ≥ 0, k ≥ 0},and note that |I| ≤ 2n − 1. We may enumerate the entries of I as I1, I2, . . .. Foreach i ∈ [d], let W i be a matrix of size |I| × n, constructed from the row Vi,∗ asfollows.∀a ∈ {1, . . . , |I|} W ia,l ={Vi,l if l ∈ Ia0 otherwiseNote that Vi,∗ itself is one of the rows of W i. Finally, the matrix W is con-structed by “stacking” the matrices W i on top of each other. That is, the rows ofW are precisely the set of rows that appear in any W i. So the matrix W has sized′ × n, where d′ = d · |I| = O(nd).Let δ′ be maxj ‖W∗,j‖2, the largest 2-norm of any column of W . We claimthat δ′ ≤ δ ·√dlog ne. To see this, note that for every l ∈ [n], we have|{ I ∈ I : l ∈ I }| ≤ dlog ne. Thus, each entry Vi,j appears in at most dlog ne ofthe copies of row Vi,∗.Applying the partial coloring lemma. The next step of the proof is to apply thepartial coloring lemma to the matrix W . We set the parameter ci = δ′/ ‖Wi,∗‖2,where C = 32 and δ′ is the maximum 2-norm of any column of W .We now check the hypotheses of the partial coloring lemma. The argument issimilar to existing arguments [9, 24]. We give a complete proof of this in.245.1. Bounding ssWe haved′∑i=1‖Wi,∗‖22 =n∑j=1‖W∗,j‖22 ≤ nδ′2.In particular, there are at most n/2r rowsWi,∗ with ‖Wi,∗‖22 ∈ [2rδ′2, 2r+1δ′2].Sod′∑i=1exp(−ci216)<∑r∈Zn2rexp(−C22r+5). (5.1.1)We prove in Section 5.8 that the right-hand side 5.1.1 is at most n/16 if C =32. This establishes the hypothesis of partial coloring lemma.Iteratively applying the partial coloring lemma. This step is identical to pre-vious result [9, 24]. The idea is that a partial coloring ensures that at least half ofthe coordinates are very close to either 1 or −1. If we recurse on the remaining co-ordinates we can obtain a full coloring. The depth of the recursion is only dlog nesince the number of remaining coordinates halves in each step.Each application of the partial coloring lemma increases the discrepancy of rowWi,∗ by at most ci ‖Wi,∗‖2 = Cδ′. The final vector resulting from this recursion isa vector χ ∈ [−1, 1]n satisfying |χj | ≥ 1− ε for all j ∈ [n], and:|〈Wi,∗, χ 〉| ≤ Cδ′ · dlog ne = O(δ′ log n) ∀i ∈ [d′]. (5.1.2)For further explanation, see [24].Rounding. The next step is to produce an integral coloring. As in prior work, wesimply round each coordinate of χ to either +1 or −1. The additional discrepancyintroduced by this rounding can be easily bounded. Every entry Vi,j satisfies Vi,j ≤‖V∗,j‖2 ≤ δ. Increasing |xi| from 1 − ε to 1 changes the discrepancy by at mostεδn. Choosing ε ≤ 1/nδ, the additional discrepancy produced by this rounding isat most 1.Discrepancy of prefixes. Finally, we must show that every prefix of every row ofV has low discrepancy under the coloring χ. Observe that any set P = {1, . . . , k},k ≤ n, can be written as the disjoint union P = ∪a∈S Ia for some S ⊆ [ |I| ] with|S| ≤ O(log n). Thus, for any i ∈ [d],255.2. Relating ss and st|∑kj=1Vi,jχj | ≤∑a∈S |∑j∈IaVi,jχj |=∑a∈S |〈Wia,∗, χ 〉|≤ |S| ·O(δ′ log n)= O(δ′ log2 n) = O(δ log2.5 n).This completes the proof.5.2 Relating ss and stIn the previous section we gave an algorithmic proof of ss(B∞, B2;n) ≤ O(log2.5 n).The next step in proving Theorem 4.3.1 is to relate ss to st; furthermore, the proofshould be constructive. The Chobanyan theorem is a classic theorem in discrep-ancy that relates ss and st. Unfortunately, the standard proof of this theorem (see[11]) is not constructive. We derive an algorithmic version of the Chobanyan theo-rem, leading to the following result.Theorem 5.2.1. Let V be a real matrix of size d × n with∑j V∗,j = 0. Let δ =maxj‖V∗,j‖2. Assume that for every permutation pi of [n] there exist 1, . . . , n ∈{−1, 1} (depending on pi) such that max1≤k≤n‖∑kj=1jV∗,pi(j)‖∞ ≤ A. Further-more, assume that there is an algorithm with running time poly(n, d) to computethe signs 1, . . . , n. Then there is an algorithm to construct a permutation pi∗ of [n]such that max1≤k≤n‖∑kj=1V∗,pi∗(j)‖∞ ≤ A + δ. This algorithm also has runningtime poly(n, d).Proof. We describe an algorithm that outputs the desired permutation pi∗. For eachpermutation pi of [n], define disc(pi) = max1≤k≤n‖∑kj=1V∗,pi(j)‖∞.Consider an arbitrary permutation pi1 of [n], and let B1 = disc(pi1). If B1 ≤ Athen simply output pi1. Otherwise, by the hypothesis of the theorem, there exist11, . . . , 1n ∈ {−1, 1} such that max1≤k≤n‖∑kj=11jV∗,pi1(j)‖∞ ≤ A. We constructpermutation pi2 from pi1 as follows. LetM+ ={j ∈ [n] : 1j = +1}and M− ={j ∈ [n] : 1j = −1}.We have∑kj=1V∗,pi1(j) +∑kj=11jV∗,pi1(j) = 2 ·∑j∈M+∩[k] V∗,pi1(j)∑kj=1V∗,pi1(j) −∑kj=11jV∗,pi1(j) = 2 ·∑j∈M−∩[k] V∗,pi1(j).265.3. Proofs of Main ResultsHence‖∑j∈M+∩[k]V∗,pi1(j)‖∞ ≤A+B12‖∑j∈M−∩[k]V∗,pi1(j)‖∞ ≤A+B12.The new permutation pi2 is formed by concatenating the elements of M+, in theorder given by pi1, followed by the elements of M−, in the reverse of the ordergiven by pi1. It follows from the assumption∑j V∗,j = 0 thatmax1≤k≤n‖∑kj=1V∗,pi2(j)‖∞ ≤A+B12. (5.2.1)Let B2 := disc(pi2). If B2 ≤ A, we output pi2. Otherwise, (5.2.1) givesB2 ≤A+B12 , and so A < B2 <A+B12 < B1. We then repeat the procedureexplained above to get another ordering pi3 from pi2.Let ` = log((B1−A)/δ)and suppose we repeat this process ` times. Possiblysome ordering pii has disc(pii) ≤ A, in which case the algorithm sets pi∗ = pii andterminates. If not, then since Bi+1 −A < (Bi −A)/2, we have disc(pi`) = B` ≤A+ δ. So the algorithm sets pi∗ = pi` and terminates.As a final remark, we should specify how to choose the initial ordering pi1. Agood choice is to use the ordering produced by the algorithmic proof of the Steinitzlemma [11]. This choice ensures that B1 = dδ, and so the number of iterations `is at most log(B1/δ) ≤ log d.5.3 Proofs of Main ResultsIn this section, we will explain the proofs of Theorem 4.3.1 and Theorem 2.7.2.The previous two sections have presented algorithmic proofs that ss(B∞, B2;n) ≤O(log2.5(n)) and that st(B∞, B2;n) ≤ ss(B∞, B2;n). The proof of Theorem 4.3.1now follows easily.Proof. (of Theorem 4.3.1) We may think of the set V as a d×nmatrix by imposingan arbitrary order on the vectors. Applying Theorem 5.1.1 with δ = 1 shows thatthere exists an efficient randomized algorithm that, for any ordering on the columnsof V , computes a coloring χ ∈ {−1, 1}n with max1≤k≤n‖∑kj=1 V∗,jχj‖∞ ≤O(log2.5 n).275.3. Proofs of Main ResultsThe existence of this algorithm satisfies the key hypothesis of Theorem 5.2.1.Consequently Theorem 5.2.1 asserts that there exists an algorithm with runningtime poly(n, d) that constructs a permutation pi of these vectors withmax1≤t≤n‖∑tj=1V∗,pi(j)‖∞ ≤ O(log2.5 n). (5.3.1)Using standard norm inequalities, we have thatmax1≤t≤n‖∑tj=1V∗,pi(j)‖2 ≤ O(√d log2.5 n),as required.As observed earlier, Theorem 4.3.1 easily implies Theorem 2.7.2.Proof. (of Theorem 2.7.2)Let X ⊂ Rd and let us denote its members as x1, . . . , xn. Let y1, y2, . . . be theinfinite sequence computed by Algorithm 2. We must show that this sequence haserror O(√d log2.5(n)/t).Consider any t ≥ 0, and let p and 0 ≤ r ≤ n − 1 be integers such thatt = pn+ r. Note that∑nj=1xj − nµ = 0. Consequently, the error of y0, . . . , yt is‖∑tj=1yj/t− µ‖2 = ‖∑tj=1yj − tµ‖2/t= ‖(∑pnj=1yj − pnµ) + (∑tj=pn+1yj − rµ)‖2/t= ‖p(∑nj=1xj − nµ︸ ︷︷ ︸=0) + (∑rj=1xpi(j) − rµ)‖2/t= ‖∑rj=1xpi(j) − rµ‖2/tBy Theorem 4.3.1, the ordering pi satisfiesmax1≤r≤n‖∑rj=1vpi(j)‖2 ≤ O(√d log2.5 n)=⇒ max1≤r≤n‖∑rj=1xpi(j) − tµ‖2 ≤ O(√dδ log2.5 n).It follows that error is‖∑tj=1yj/t− µ‖2 ≤ O(√dδ log2.5(n)/t),as required.285.4. Performance of Our AlgorithmAlgorithm 2 Our new algorithm for the Sampling ProblemInput: a set X = {x1, . . . , xn} ⊆ Rd.Let δ = maxi∈[n] ‖xi‖2 and µ =∑i∈[n] xi/n.Let V = {v1, . . . , vn} where vi = (xi − µ)/2δ.Since∑i∈[n] vi = 0 and maxi∈[n] ‖vi‖2 ≤ 1, we may apply the algorithmof Theorem 4.3.1 to get an ordering pi : [n]→ [n] of the elements of V .Output: The infinite sequence y1, y2, . . . obtained by ordering the vectors in Xaccording to the ordering pi, and then repeating this ordering infinitely manytimes. Formally, for all t ≥ 1, the tth output point is yt = xpi(t mod n), if weassume that the range of mod n is {1, . . . , n}.5.4 Performance of Our AlgorithmWe run our algorithm on matrix F with d = 5 and n = 10 to generate 50 pseu-dosamples. We will measure the error in the `2-norm and the `∞-norm. The cyclicbehaviour in both graphs shows the fact that the error of algorithm periodically hitszero every |X | steps. The zero error appears each time that a permutation of Xis added to the sequence of samples generated so far. Let’s Compare this figurewith Figure 2.2, Figure 2.3 and Figure 4.1 for the `2-norm error on matrix F , itcan be seen that our algorithm has worse performance than Random sampling, theHerding algorithm and the Steinitz algorithm. In Section 5.6 we will show that ouralgorithm beats the Herding algorithm, if we measure the error in the `∞-norm.5.5 Lower Bound for the Sampling ProblemIn this section we prove that every algorithm for the Sampling Problem has er-ror Ω(√dδ/t), proving Theorem 2.7.1. This shows that the algorithm of Theo-rem 2.7.2 is optimal, up to logarithmic factors.Theorem 5.5.1. For any d, let n = d2 and define a matrix V of size d × (n + d)byVd×(n+d) =−1/n · · · −1/n 1 · · · 0−1/n · · · −1/n.... . .......... 1 0−1/n · · · −1/n 0 · · · 0 1.Here, the first n columns are −1/n ·1 and the last d columns are e1, . . . , ed. Then,295.5. Lower Bound for the Sampling ProblemFigure 5.1: Error of our algorithm on matrix F measured in ‖·‖2 and ‖·‖∞.for any sequence x1, x2, . . . generated from columns of this matrix, there existst ∈ N such that ‖∑ti=1xt‖2 ≥ Ω(√d).Proof. Consider any sequence x1, x2, . . .. Fix t = n/2 and consider the first telements x1, . . . , xt in the sequence. Let ai be the number of occurrences of ei inthese t elements, and let a =∑di=1ai. Define z :=∑ti=1xi. Thenz =a1 − (t− a)/na2 − (t− a)/n...ad − (t− a)/nThe ith coordinate of z is equal to ai − 1/2 + a/n. We consider two cases:Case 1: a ≥ n/3. Then, there exists an index l for which al ≥ n/3d. Sozl ≥ n/3d− 1/2 + 1/3 = n/3d− 1/6 ≥ d/3− 1/6.Thus ‖z‖2 ≥ Ω(d) = Ω(√d).305.6. ‖·‖∞ Lower Bound for the Herding AlgorithmCase 2: a ≤ n/3. Then, for all i ∈ [d]ai − 1/2 ≤ zi ≤ ai − 1/2 + 1/3 = ai − 1/6.Note that ai ∈ N so |zi| ≥ 1/6. So every coordinate of z has absolute value greaterthan 1/6. Thus‖∑ti=1xt‖2 = ‖z‖2 = Ω(√d),as required.Proof. (of Theorem 2.7.1) Let X consist of the columns of V . Note that µ =∑x∈X x = 0, and that δ = maxx∈X ‖x‖2 = 1. Assume that we have an algorithmthat solves the Sampling Problem on X and generates pseudosamples x1, x2, . . ..By Theorem 5.5.1 there exists an index t ∈ N such that ‖∑ti=1xt‖2 ≥ Ω(√d).Therefore‖∑ti=1xi/t− µ‖2 ≥ Ω(√dδ/t),as required.5.6 ‖·‖∞ Lower Bound for the Herding AlgorithmIn this section, we define matrix J and prove that the error of Herding algorithmon this data set gets bigger than Ω(√d), at some point.Theorem 5.6.1. Let matrix Jd×n be defined asA︷ ︸︸ ︷1/√dB︷ ︸︸ ︷− · · · −C︷ ︸︸ ︷0 · · · 0D︷ ︸︸ ︷− 1 0 · · · 0E︷ ︸︸ ︷ · · · 0. . ............. 0...1/√d − · · · − 0 · · · 0 0 · · · 0 −1 · · · 1 0 · · · 0 − · · · − 0 · · · 0 0 0 · · · 0in which |A| = 1, |B| =⌈1/(√d)⌉, |C| = d1/e, |D| = d− 1, and |E| = d1/e.So n = d+⌈1/√d⌉+d1/e+d1/e. This definition ensures that µ, the average ofthe columns, is 0. We define = 1/d√d. Consider matrix J and let x1, x2, . . . bethe pseudosamples generated by the Herding algorithm when we apply it to matrixV , with w0 = 0. Then there exists T ∈ N such that ‖∑Ti=1xi‖∞ ≥ Ω(√d).315.6. ‖·‖∞ Lower Bound for the Herding AlgorithmProof. Set T = (2√d + 1)(√d/2 − 1). Assume that the Herding algorithm haschosen pseudosamples x1, x2, . . . , xT in the first T iterations. From the definitionof the Herding algorithm we know thatwi = wi−1−xi for all 1 ≤ i ≤ T . Summingthese T equalities wT = w0 − (x1 + x2 + . . .+ xT ). Substituting w0 = 0, we getthat wT = −(x1 + x2 + . . . + xT ). So in order to compute the value of∑Ti=1xiwe only need to compute the vector wT .We claim that for all 1 ≤ t ≤ T , wt is as follows. Choose 0 ≤ k ≤√d/2− 1such that (2√d+ 1)(k − 1) + 1 < t ≤ (2√d+ 1)k + 1 thenwt =[t−1︷ ︸︸ ︷1− k/√d, · · · , 1− k/√d ,d−t︷ ︸︸ ︷−k/√d, · · · ,−k/√d ,1︷︸︸︷−k]Tin which there are t−1 coordinates of value 1−k/√d on the top, d− t coordinatesof value −k/√d in the middle and one coordinate of value −k at the bottom. Weprove our claim by induction:Initial case of induction; t = 1. From the definition of the Herding algorithm,x1 is defined to be arg maxj∈[n]〈w0, V∗,j〉. Since w0 = 0, x1 = V1. So w1 =w0 − x1 = −V1. This is the same as w1 in the above formula for k = 1.Inductive step. Assuming that the hypothesis is true for some t = 1, 2, . . . , l,we prove that it is also true for t = l + 1. We consider two cases:• t = (2√d + 1)k + 1 for some 0 ≤ k ≤√d/2 − 1. In order to get xt+1 weneed to compute 〈wt, Vj〉 for all j ∈ [n] and get the column that maximizesthis value. We consider five cases:Case 1: Vj ∈ A. In this case 〈wt, Vj〉 = k/√d+ k/d.Case 2: Vj ∈ B. In this case 〈wt, Vj〉 = −k(√d+ 1 + 1/√d).Case 3: Vj ∈ C. In this case 〈wt, Vj〉 = k.Case 4: Vj ∈ D. In this case 〈wt, Vj〉 = −1 + k/√d or k/√d.Case 5: Vj ∈ E. In this case 〈wt, Vj〉 = k(√d+ 1 + 1/√d).Substituting the values of , and , it is easy to check that the maximumvalue is achieved by choosing xt+1 = V1. So wt+1 = wt − xt+1 =[t−1︷ ︸︸ ︷1− (k + 1)/√d, · · · , 1− (k + 1)/√d ,d−t︷ ︸︸ ︷−(k + 1)/√d, · · · ,−(k + 1)/√d ,1︷ ︸︸ ︷−(k + 1)]TNote that t = (2√d+ 1)k + 1 which gives t+ 1 = (2√d+ 1)k + 2. Thus(2√d+ 1)((k + 1)− 1) + 1 < t < (2√d+ 1)(k + 1) + 1 which gives thesame value for wt+1 in the hypothesis.325.6. ‖·‖∞ Lower Bound for the Herding Algorithm• (2√d + 1)(k − 1) + 2 ≤ t ≤ (2√d + 1)k for some 1 ≤ k ≤√d/2 − 1.Similar to the previous case, we need to compute 〈wt, Vj〉 for all j ∈ [n] andget the column that maximizes this value. We consider five cases:Case 1: Vj ∈ A. In this case 〈wt, Vj〉 ≤ (k − 1)/√d+ k/d.Case 2: Vj ∈ B. In this case 〈wt, Vj〉 ≤√d(2− k)− k − k/√d.Case 3: Vj ∈ C. In this case 〈wt, Vj〉 = k.Case 4: Vj ∈ D. In this case 〈wt, Vj〉 = −1 + k/√d or k/√d.Case 5: Vj ∈ E. In this case 〈wt, Vj〉 ≤ (√dk + k + k/√d− 1).Similar to the previous case, it is easy to check that the maximum value isachieved in the third case by choosing the xt+1 = −et from block D. (Recallthat et denotes the tth standard basis vector). So wt+1 = wt − xt+1 =[t︷ ︸︸ ︷1− k/√d, · · · , 1− k/√d ,d−1−t︷ ︸︸ ︷1− k/√d, · · · ,−(k + 1)/√d ,1︷ ︸︸ ︷−(k + 1)]Tin which the top t coordinate are 1 − k/√d and the middle d − 1 − t coor-dinates are −k/d and the last coordinate is −k. Note that t+ 1 is still in thesame interval as before and therefore the above vector gives the same valuefor wt+1 as in the hypothesis. This completes the proof of our claim.The proof completes by setting T = (2√d+ 1)(√d/2− 1) and k =√d/2− 1. Inthis case ‖wT ‖∞ ≥√d/2− 1 = Ω(√d).We finish this section by showing that Theorem 5.6.1 holds in application.Figure 5.2 shows the performance of Herding algorithm on matrix J for d =19 × 19 to generate 600 pseudosamples. The Y -axis shows ‖t · error‖∞ of thealgorithm and the X−axis shows the number of samples. The figure clearly showsthat the graph eventually hits the line√d/2. The reason that we plot ‖t · error‖∞instead of ‖error‖∞ is that we want to see the growth of the error function better.335.7. ComparisonFigure 5.2: Herding algorithm error in ‖·‖∞.5.7 ComparisonWe run our algorithm, Herding algorithm and random sampling on matrices F ,Gwith d = 6 and n = 10 and matrix J with d = 6 to generate 30 pseudosamples.Figure 5.3 shows error of these algorithms measured in ‖·‖2 and ‖·‖∞. It can beobserved that, our algorithm is doing better than other algorithms if we measure theerror in the ‖·‖∞. For the case of ‖·‖2, our algorithm is strictly better than randomsampling but approximately the same as the Herding algorithm.345.7. ComparisonFigure 5.3: Comparison of the error for different sampling algorithms.355.8. Partial Coloring Lemma Hypothesis5.8 Partial Coloring Lemma HypothesisWe prove that the right-hand side (5.1.1) is at most n/16 if C = 32. Dividing byn, it suffices to show that∑r∈Z12rexp(−C22r+5) ≤ 1/16. (5.8.1)First, assuming only that C ≥ 0, we have∑r≥612rexp(−C22r+5)≤∑r≥612r= 1/32. (5.8.2)Next assume that C = 32. Then∑r≤012rexp(−C22r+5)=∑i≥02i · exp(−C2 · 2i32)≤∑i≥0exp(i− 32 · 2i)<∑i≥1exp(−31i)=e−311− e−31< 1/200. (5.8.3)Finally,5∑r=112rexp(−C22r+5)=e−162+e−84+e−48+e−216+e−132< 1/44. (5.8.4)Summing (5.8.2), (5.8.3) and (5.8.4) shows (5.8.1).This establishes the hypotheses of the partial coloring lemma.36Chapter 6ConclusionsIn this thesis we studied the theoretical foundation of Herding algorithm and moregenerally the Sampling Problem. We analyzed lower bound and upper bound anal-ysis for the error of Sampling Problem and designed an efficient algorithm thatsolves this problem with an error that is near the optimal. To drive our results weused randomness and made several assumptions on the input data. In this chapter,we discuss how these assumptions can be generalized to conduct further researchon this problem.6.1 RemarksOur algorithm in Chapter 5 is described as a randomized algorithm. It can bederandomized by replacing our use of the the Lovett-Meka algorithm with a de-randomized form of Bansal’s algorithm [10]. It is also plausible to improve therunning time of our algorithm by using the recent results that should be faster thanthe Lovett-Meka algorithm [29].Our bound ss(B∞, B2;n) = O(log2.5 n) can be improved if we do not requirean efficient algorithm. It is also known [7] that sv(B∞, B2;n) = O(√log n). Ourresults in Chapter 5 show thatss(B∞, B2;n) = O(log1.5 n) · sv(B∞, B2;n). (6.1.1)Combining our result with Banaszczyk’s yields a solution to the Sampling Prob-lem with error O(δ√d log2(n)/t), but no known efficient algorithm achieves thatbound.Our lower bound in Theorem 2.7.1 is for the particular case that t = d2/2. Itwould be interesting to get a similar bound for the cases t d2 and t d2.376.2. Open Questions6.2 Open QuestionsOur work leaves several questions unanswered. We do not prove that our algo-rithm of Chapter 5 has better error than the Herding algorithm. In particular, it isconceivable that the Herding algorithm has optimal error O(δ√d/t), matching ourlower bound. Alternatively, it is also conceivable that the O(δd/t) bound on theHerding algorithm’s error is actually tight.We do not know whether our bound in (6.1.1) is tight. It is conceivable thatss(B∞, B2;n) = O(1) · sv(B∞, B2;n). (6.2.1)It seems that techniques substantially different than ours would be required‘ toprove this.A major open question in discrepancy theory [11, 12] is whetherss(B2, B2;n) ≤ O(√d). (6.2.2)It would be very interesting if (6.2.1) is true, as then then the Komlo´s conjecturewould imply (6.2.2).Our algorithm in Chapter 5 assumes that the input set X is finite. It is inter-esting to study if we can generalize our algorithm for case of infinite X . It wouldalso be interesting to study if our algorithm gives good results for the intendedapplications of Herding in Markov random fields.Our definition of the Sampling Problem states that µ is the mean of X , whereasa more general formulation could incorporate a distribution p onX . It is interestingto see if one can modify our algorithm for the more general case. Here we brieflysuggest one possible approach. It seems that the more general case should bereducible to the uniform distribution case. Assume that p assigns probability p(xi)to each point xi ∈ X . If each p(xi) =aibifor natural numbers ai and bi, then onecan set c = LCM(b1, . . . , bn) and generate a new multi-set Y with c ·aibicopies ofxi. The uniform distribution on Y is the same as the distribution p on the originalset X . Now we only need to feed set Y to our algorithm. There are two caveatswith this approach. First, we have ignored irrational numbers and second, we mayhave |Y| |X |. It seems possible to mitigate the runtime increase and to dealwith irrational numbers through the use of standard sampling lemmas [2], whichcan approximate p by a distribution for which c is small.38Bibliography[1] N. Alon and J. Spencer. The Probabilistic Method. John Wiley,, 2000.[2] I. Altho¨fer. On sparse approximations to randomized strategies and convexcombinations. Linear Algebra and its Applications, 199:339–355, 1994.[3] E. Amaldi and R. Hauser. Boundedness theorems for the relaxation method.Mathematics of Operations Research, 30:939–955, 2005.[4] F. Bach, S. Lacoste-Julien, and G. Obozinsk. On the equivalence betweenHerding and conditional gradient algorithms. In Proc. ICML, 2012.[5] M. F. Balcan. Lecture notes of 8803 machine learning theory, 2011.manuscript.[6] K. Ball. An elementary introduction to modern convex geometry. RandomStructures and Algorithms, 31, 1997.[7] W. Banaszczyk. Balancing vectors and Gaussian measures of n-dimensionalconvex bodies. Random Structures and Algorithms, 12(4):351–360, 1998.[8] W. Banaszczyk. On series of signed vectors and their rearrangements. Ran-dom Structures and Algorithms, 40:301–316, 2012.[9] N. Bansal. Constructive algorithms for discrepancy minimization. In FOCS,2010.[10] N. Bansal and J. Spencer. Deterministic discrepancy minimization. Algorith-mica, 67(4):451–471, 2013.[11] I. Bar´an´y. On the power of linear dependencies. Building Bridges, BolyaiSociety Mathematical Studies, 19:31–45, 2008.[12] V. Bergstro¨m. Zwei sa¨tze u¨ber ebene vektorpolygone. Abh. Math. Sem. Univ.Hamburg, 8:148–152, 1931.39Bibliography[13] H. D. Block and S. A. Levin. On the boundedness of an iterative proce-dure for solving a system of linear inequalities. Proceedings of the AmericanMathematical Society, 26:229–235, 1970.[14] L. Bornn, Y. Chen, N. de Freitas, M. Eskelin, J. Fang, and M. Welling. HerdedGibbs sampling. In International Conference on Learning Representations,2013.[15] B. Chazelle. The Discrepancy Method: Randomness and Complexity. Cam-bridge University Press, 2000.[16] S. Chen, J. Dick, and A. B. Owen. Consistency of Markov chain quasi-MonteCarlo on continuous state spaces. Annals of Statistics, 39:673–701, 2011.[17] Y. Chen, M. Welling, and A. Smola. Super-samples from kernel Herding. InProc. UAI, 2010.[18] F. Rosenblatt. The Perceptron: A probabilistic model for information storageand organization in the brain. Psychological review, 65(6), 1958.[19] A. Gelfand, Y. Chen, L. van der Maaten, and M. Welling. On Herding andthe Perceptron Cycling theorem. In Proc. NIPS, 2010.[20] N. Harvey. Lecture notes of CPSC 536N: Randomized algorithms, 2011.manuscript.[21] N. Harvey and S. Samadi. Near-optimal Herding. In COLT, pages 1165–1182, 2014.[22] A. E. Holroyd and J. Propp. Rotor walks and Markov chains. AlgorithmicProbability and Combinatorics, 520:105–126, 2010.[23] F. Husza´r and D. Duvenaud. Optimally-weighted Herding is Bayesianquadrature. In In Proc. UAI, 2012.[24] S. Lovett and R. Meka. Constructive discrepancy minimization by walkingon the edges. In FOCS, 2012.[25] J. Matousek. Geometric Discrepancy: An Illustrated Guide. Springer, 1999.[26] M.L. Minsky and S. Paper. Perceptrons; An introduction to computationalgeometry. MIT Press, 1969.[27] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge Univer-sity Press, New York, NY, USA, 1995.40Bibliography[28] J. Spencer. Six standard deviations suffice. Trans. Amer. Math. Soc., 289:679–706, 1985.[29] T. Rothvoss. Constructive discrepancy minimization for convex sets. CoRR,abs/1404.0339, 2014.[30] R. Vershynin. Lecture notes in geometric functional analysis, 2009.manuscript.[31] A. Vilenkin, E. P. S. Shellard, X. Lin, C. Genest, D. L. Banks, G. Molen-berghs, D. W. Scott, and J. L. Wang. Past, Present, and Future of StatisticalScience. CRC Press, 2014.[32] A. Vilenkin, E. P. S. Shellard, Xihong Lin, C. Genest, D. L. Banks, G. Molen-berghs, D. W. Scott, and J. L. Wang. Information theory and statistical me-chanics. Physical Review, 106:620–630, 1957.[33] M. Welling. Herding dynamical weights to learn. In Proceedings of the 26thInternational Conference on Machine Learning (ICML), 2009.41
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Near-optimal Herding
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Near-optimal Herding Samadi, Samira 2014
pdf
Page Metadata
Item Metadata
Title | Near-optimal Herding |
Creator |
Samadi, Samira |
Publisher | University of British Columbia |
Date Issued | 2014 |
Description | Herding is an algorithm of recent interest in the machine learning community, motivated by inference in Markov random fields. It solves the following Sampling Problem: given a set Χ \subset R^d with mean μ, construct an infinite sequence of points from Χ such that, for every t ≥ 1, the mean of the first t points in that sequence lies within Euclidean distance O(1/t) of μ. The error of a solution to Sampling Problem is defined to be the distance between the empirical mean of the first t samples and the original mean μ. The O(1/t) error bound suppresses the dependence on d and Χ. In this thesis, we study the best dependence on d and |Χ| that can be achieved for the error in Sampling Problem. Known analysis of the Herding algorithm give an error bound that depends on geometric properties of Χ but, even under favorable conditions, this bound depends linearly on d. We first show that any algorithm for the Sampling Problem must have error Ω(√d/t). Afterward, we present a new polynomial-time algorithm that solves the Sampling Problem with error O(√d log^2.5|Χ|/t) assuming that Χ is finite. This implies that our algorithm is optimal to within logarithmic factors. Finally, we prove that the actual error of the Herding Algorithm is strictly worse than the error of our algorithm if we measure the error in the infinity-norm. Our algorithm is randomized and based on recent algorithmic results in discrepancy theory. We implement our algorithm and other potential solutions for the Sampling Problem and evaluate them on various inputs. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2014-08-26 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0166033 |
URI | http://hdl.handle.net/2429/50167 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2014-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2014_september_samadi_samira.pdf [ 404.48kB ]
- Metadata
- JSON: 24-1.0166033.json
- JSON-LD: 24-1.0166033-ld.json
- RDF/XML (Pretty): 24-1.0166033-rdf.xml
- RDF/JSON: 24-1.0166033-rdf.json
- Turtle: 24-1.0166033-turtle.txt
- N-Triples: 24-1.0166033-rdf-ntriples.txt
- Original Record: 24-1.0166033-source.json
- Full Text
- 24-1.0166033-fulltext.txt
- Citation
- 24-1.0166033.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0166033/manifest