Herded Gibbs and Discretized HerdedGibbs SamplingbyMareija EskelinenB.Sc., Simon Fraser University, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2013c? Mareija Eskelinen 2013AbstractStatistical inference is at the heart of the probabilistic programming ap-proach to artificial intelligence; as far as statistical inference is concerned,the Gibbs sampler is one of the most widely applicable and popular ran-dom sampling techniques. On the flip side, deterministic sampling is stillnascent and has yet to be widely accepted into the field of Bayesian infer-ence. In this thesis, we advance the domain of deterministic sampling bydiscussing and evaluating two deterministic sampling techniques inspired bythe Gibbs sampler. The first technique is the recently-introduced herdedGibbs sampler and the second is the novel discretized herded Gibbs sam-pler. Herded Gibbs sampling is an entirely deterministic algorithm withan impressive O(1/T ) convergence rate for both models with independentvariables and fully connected probabilistic graphical models. However, theconvergence for herded Gibbs on sparsely connected probabilistic graphicalmodels is still an open problem. Additionally, the computational complexityof herded Gibbs increases exponentially with the connectivity of the prob-abilistic graphical model, making it better suited for partially connectedgraphical models. We empirically demonstrate that herded Gibbs outper-forms Gibbs on partially connected probabilistic graphical models for thetasks of image denoising with Markov Random Fields and named entityrecognition with Conditional Random Fields. Furthermore, we develop dis-cretized herded Gibbs as a variant of herded Gibbs that attempts to reducethe spatial complexity by approximating herded Gibbs. Though discretizedherded Gibbs achieves linear storage requirements and is capable of fastinitial convergence rates, it can only approximate the target up to a finiteaccuracy. This thesis highlights the power of one recent deterministic sam-pling algorithm and shows that deterministic sampling with good space andtime complexity remains an elusive goal.iiPrefaceThis thesis is a byproduct of collaboration between Nando de Freitas, JingFang, Max Welling, Yutian Chen, Luke Bornn and myself. As a result ofour collaboration various other publications arose including [8, 10, 18].This work most strongly overlaps with the recent publications of herdedGibbs that include experimental results, namely [10, 18], as we provide theempirical evaluations of herded Gibbs as part of this thesis. These empiri-cal evaluations of herded Gibbs were the direct result of collaboration withNando de Freitas and Jing Fang with whom we contributed to the develop-ment of the code base and execution of these experiments. For a completeview of herded Gibbs we include a mention of the theoretical convergenceguarantees developed by Yutian Chen [8] but make no claim to have con-tributed to that research.Any research in this thesis touching upon the novel discretized herdedGibbs is the result of collaboration with Nando de Freitas and has not beenpreviously published.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . 42 Herding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Herded Gibbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.1 Computational Complexity . . . . . . . . . . . . . . . 83.1.2 Spatial Complexity . . . . . . . . . . . . . . . . . . . 104 Discretized Herded Gibbs . . . . . . . . . . . . . . . . . . . . 115 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1 Markov Random Field . . . . . . . . . . . . . . . . . . . . . . 145.2 Herded Gibbs . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2.1 Simple Complete Graph . . . . . . . . . . . . . . . . 165.2.2 Scalability on Fully Connected Markov Random Field(MRF)s . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2.3 MRF for Image Denoising . . . . . . . . . . . . . . . 18ivTable of Contents5.2.4 Conditional Random Field (CRF) for Named EntityRecognition . . . . . . . . . . . . . . . . . . . . . . . 215.3 Discretized Herded Gibbs (DHG) Convergence . . . . . . . . 236 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . 29Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31vList of Tables5.1 Joint distribution of the two-variable model. . . . . . . . . . . 165.2 Errors of image denoising example after 30 iterations (all mea-surements have been scaled by ?10?3). We use an Ising priorwith Jij = 1 and four Gaussian noise models with different??s. For each ?, we generated 10 corrupted images by addingGaussian noise. The final results shown here are averages andstandard deviations (in parentheses) across the 10 corruptedimages. D denotes the damping factor in mean field. . . . . . 215.3 Gibbs, herded Gibbs and Viterbi for the Named Entity Recog-nition (NER) task. The average computational time each ap-proach took to do inference for the entire test set is listed(in square brackets). After only 400 iterations (90.48 sec-onds), herded Gibbs already achieves an F1 score of 84.75,while Gibbs, even after 800 iterations (115.92 seconds) onlyachieves an F1 score of 84.61. For the same computation,herded Gibbs is more accurate than Gibbs. . . . . . . . . . . 235.4 Weight sharing for DHG employing 5 bins. For a binary graphof three variables, each node has 22 = 4 conditioning states.This table depicts how those states are allocated to herd-ing parameters for a randomly generated three-variable MRF.Note, this is the same graph analyzed by Figure 5.10 and Ta-ble 5.5. The red entry of the table is that which is alteredin the 15-bin version. Weights with no conditioning statesassigned to them are never used. . . . . . . . . . . . . . . . . 265.5 Weights sharing for DHG employing 15 bins on an MRF ofthree variables. Note, this example employs the same graphfor which performance is depicted in Figure 5.10 and for whichweight sharing (with 5 weights) is depicted in Table 5.4. Thered entries of the table are those that differ from the 5-bincase. Weights with no conditioning states assigned to themare never used. . . . . . . . . . . . . . . . . . . . . . . . . . . 26viList of Figures4.1 An illustration of shared herding weights for variable Xi whenDHG employs four herding parameters and the probabilitydistribution pi = P (Xi|xNi) is defined by a sigmoid. (Note,pi can be defined by any function and, for that reason, isabstracted in Algorithm 4.1.) Since Neighbouring State 1(xN 1i)and Neighbouring State 2(xN 2i)create similar con-ditional probabilities for Xi = 1, they share the herding pa-rameter w4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.1Two-variable model. . . . . . . . . . . . . . . . . . . . . . . . 165.2 (a) Approximating a marginal distribution with Gibbs (blue)and herded Gibbs (red) for an MRF of two variables, con-structed so as to make the move from state (0, 0) to (1, 1)progressively more difficult as decreases. The four columns,from left to right, are for = 0.1, = 0.01, = 0.001 and = 0.0001. Table 5.1 provides the joint distribution for thesevariables. The error bars for Gibbs correspond to one stan-dard deviation. Row (b) illustrates that the empirical conver-gence rate of herded Gibbs matches the expected theoreticalrate. In the plot of row (b), the upper-bound in the errorof herded Gibbs was used to remove the oscillations so as toillustrate the behaviour of the algorithm more clearly. . . . . 175.3 Approximation errors for one-variable marginals on a fully-connected graph of ten variables. The graph was randomlygenerated with coupling parameters ? N (0, 1). Results areaveraged across 10 sampling trials with random starts. Theerror bars correspond to one standard deviation. (Note thatthese results are typical for MRFs for which convergence wasobserved.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18viiList of Figures5.4 Rectangular 2D lattice MRF model. Variables yi and yj de-note pixels of the given noisy image and variables xi and xjdenote pixels of the clean, uncorrupted image. . . . . . . . . 195.5 Original image (left) and its corrupted version (right), withnoise parameter ? = 4. . . . . . . . . . . . . . . . . . . . . . . 195.6 Reconstruction errors for the image denoising task. The re-sults are averaged across 10 corrupted images with GaussiannoiseN (0, 16). The error bars correspond to one standard de-viation. Mean field requires the specification of the dampingfactor D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.7 Typical skip-chain CRF model for named entity recognition. . 215.8 Results for the application of the NER CRF to a randomWikipedia sample [1]. Entities are automatically classified asPerson, Location and Organization. . . . . . . . . . . . . . 225.9 Approximation errors for one-variable marginals on a fully-connected graph of ten variables. The graph was randomlygenerated with coupling parameters ? N (0, 1). Results areaveraged across 10 sampling trials with random starts. Theerror bars correspond to one standard deviation. . . . . . . . 245.10 Approximation errors for two-variable marginals on a fully-connected graph of three variables. The graph was randomlygenerated with coupling parameters ? N (0, 1). Results areaveraged across 10 sampling trials with random starts. Theerror bars correspond to one standard deviation. . . . . . . . 25viiiGlossaryCUD Completely Uniformly DistributedCRF Conditional Random FieldDHG Discretized Herded GibbsMCMC Markov chain Monte CarloMRF Markov Random FieldNER Named Entity RecognitionqMC quasi-Monte CarloixAcknowledgmentsI would like to express my sincere gratitude to my supervisor, Nando deFreitas, for his relentless guidance, support and motivation throughout thisprocess. I would also like to extend warm thanks to my collaborators, JingFang, Yutian Chen, Max Welling, and Luke Bornn, who were amazing towork with and with whom much of the content of this thesis was developed.Furthermore, thanks to my lab-mates who created a truly excellent workenvironment and who provided help in times of difficulty.Finally, to my friends, family, and amazing partner, thank you for beingmy lifeline and the preservers of my sanity, I would not have been able todo any of this without you!xChapter 1IntroductionIn an ideal world, analytically finding the perfect solution to any computa-tional problem would be straightforward. However, difficulties often arisethat make the analytical computation of a solution intractable. Certain real-world constraints such as the sheer volume of the data set or the algorithm?scomputational complexity push a problem into the realm of intractability.For problems where it is intractable to compute the solution analytically,some approximation scheme must be applied. In these cases, sampling playsa crucial role as one of the most common and widely applicable approxima-tion techniques. In essence, the ability to perform statistical inference, bymeans of sampling, allows one to solve incredibly complex problems rangingfrom Natural Language Processing to Computer Vision and beyond.The most common class of sampling algorithms are the Markov chainMonte Carlo (MCMC) algorithms which estimate integrals by sampling fromthe state space. (The state space is the multi-dimensional space that de-scribes the domain of a problem or its solution space.) With MCMC algo-rithms, samples are generated by performing a walk (often random) throughthe state space, known as a Markov chain. Each step of the Markov chaincontributes one sample to the estimate, i.e. the integrand value at that sam-ple location is counted in an estimate of the target integral. As more samplesare accumulated the estimate of the target integral will improve. The rateat which the empirical estimate converges to the target, as a function ofthe number of samples, is known as the sampling algorithm?s convergencerate. The convergence rate of the Gibbs sampler, a Monte Carlo algorithm,is O(1/?T ), where T denotes the number of samples.Over the last 60 years, we have witnessed great progress in the designof randomized sampling algorithms like the MCMC samplers; e.g. [3, 9,17, 24] and the references therein. In contrast, the design of deterministicalgorithms for ?sampling? from distributions is still in its inception [6, 7, 14,22]. There are, however, many important reasons for pursuing this line ofattack on the problem. From a theoretical perspective, this is a well definedmathematical challenge whose solution might have important consequences.It also brings us closer to reconciling the fact that we typically use pseudo-1Chapter 1. Introductionrandom number generators to run Monte Carlo algorithms on classical, VonNeumann architecture-based computers. Moreover, the theory for some ofthe recently proposed deterministic sampling algorithms has taught us thatthey can achieve O(1/T ) convergence rates [7, 14], which are much fasterthan the Gibbs rate of O(1/?T ) for computing ergodic averages (whereT is the number of samples). From a practical perspective, the design ofdeterministic sampling algorithms creates an opportunity for researchers toapply a great body of knowledge on optimization to the problem of sampling;see [4] for an early example of this.Some of the first deterministic samplers attempt to address concernsregarding the use of pseudo-random number generators on classical, VonNeumann architecture computers by taking advantage of low-discrepancysequences and employing these sequences in place of pseudo-random num-ber generators. In doing so, they achieve improved convergence rates. Thesequasi-Monte Carlo (qMC) sampling procedures generate samples by relyingon a uniformly distributed sequence of numbers that produce ?nice? sam-ples whose ergodic average converges to the target distribution at a rate nearO(1/T ). For any subinterval spanned by a uniformly distributed sequence,the sequence must cover that subinterval with samples proportional to thelength of the subinterval. For example, the sequence ?0, pi, 2pi, 3pi, ...? is uni-formly distributed modulo 1. The concept of uniform sequences can be mademore definitive by employing Completely Uniformly Distributed (CUD) se-quences [6]; a CUD sequence obeys more strict uniformity constraints butcan be more easily interpreted as one or more complete periods of a pseudo-random number generator. However, it is unclear if CUD sequences can beconstructed greedily.The convergence rate of O(1/T ) is attainable with qMC methods thatproduce pseudo samples by transforming sampling into optimization. Herd-ing, introduced in [26], is one such technique that employs weights to tracksample occurrences. These weights are employed to steer sample generationby directly minimizing some discrepancy between the true distribution andan ergodic average of the pseudo samples. While it guarantees a convergencerate of O(1/T ) under certain conditions [7, 11], herding is only applicableto normalized distributions. Another qMC algorithm that transforms sam-pling into optimization is the Rotor-Router algorithm [14]. This algorithmdeterministically steers sample generation in the joint space by employingthe transition probabilities to construct a sequence of ?successors? to bevisited from each joint state. This deterministic analogue of a random walkhas provably superior concentration rates for quantities such as normalizedhitting frequencies, hitting times and occupation frequencies. Again, it is2Chapter 1. Introductionshown that discrepancies in these quantities decrease at O(1/T ) instead ofthe usual O(1/?T ).The problem of sampling from unnormalized distributions is at the heartof the field of Bayesian inference and the probabilistic programming ap-proach to artificial intelligence [5, 12, 19, 20]. However, until very recentlythere did not exist deterministic tools for sampling from unnormalized mul-tivariate probability distributions; such a constraint drastically limited theapplicability of deterministic sampling procedures. At the same time, de-spite great progress in Monte Carlo simulation, the celebrated Gibbs samplercontinues to be one of the most widely-used algorithms. Some examples ofits usage are as the inference engine behind popular statistics packages [19],in several tools for text analysis [23], and Boltzmann machines [2, 13]. Thepopularity of Gibbs stems from its simplicity of implementation and thefact that it is a very generic algorithm. Without any doubt, it would be re-markable if we could design generic deterministic Gibbs samplers with fast(theoretical and empirical) rates of convergence.In this thesis, we advance the domain of deterministic sampling by dis-cussing the recently introduced herded Gibbs sampler [10, 18] as well asproposing a novel variant of this technique, DHG. Herded Gibbs deran-domizes the Gibbs sampler by employing herding?s deterministic samplingprocedure. This has the effect of generalizing herding to unnormalized distri-butions. Herded Gibbs takes advantage of the deterministic updates utilizedby herding to match moments of the target distribution and to retain theconvergence rate of O(1/T ) for distributions of independent variables andfully-connected probabilistic graphical models. Furthermore, the good con-vergence performance of herded Gibbs has been demonstrated over multi-ple applications not covered by its theoretical guarantees, including NamedEntity Recognition and image de-noising. However, a proof establishingsuitable conditions that ensure convergence of herded Gibbs sampling forsparsely connected probabilistic graphical models is still unavailable. In-terestingly, if for a fully connected graphical model, we build a new graphwhere every state is a node and directed connections exist between nodesthat can be reached with a single herded Gibbs update, then herded Gibbsbecomes equivalent to the Rotor-Router model of Alex Holroyd and JimPropp1 [14]. We expect that many of the results from this literature applyto herded Gibbs as well.Though herded Gibbs demonstrates exceptional performance for smallfully-connected graphical models and large partially-connected graphical1We thank Art Owen for pointing out this connection.31.1. Thesis Contributionsmodels, it does not scale well to large highly-connected graphical models.In these instances, the spatial complexity of herded Gibbs makes samplingintractable. The novel sampling procedure, DHG is an attempt to reducethe spatial complexity of herded Gibbs; it achieves linear storage require-ments in the maximal degree. We demonstrate that though DHG is capableof fast initial convergence rates, it can only approximate the target up to afinite accuracy.1.1 Thesis ContributionsIn this thesis, we provide an empirical evaluation of the herded Gibbs sam-pling technique, but our greatest contribution is the investigation of a novelvariant of herded Gibbs. DHG is an attempt at achieving the best of bothworlds: good computational and spatial complexity. Though the algorithmdoes not prove successful, it provides important insight into this realm ofsampling.4Chapter 2HerdingHerding [11, 25, 26] is a qMC method that produces pseudo samples bydirectly minimizing the discrepancy between some measure of the targetdistribution and an ergodic average of the pseudo samples. The algorithm,in essence, transforms the task of sampling to a task of global optimization.In order to do so, it employs weights to track sample occurrences and de-terministically updates those weights upon the generation of each sample.While it guarantees an impressive convergence rate of O(1/T ), herding isonly applicable to normalized distributions. The magic of herding lies inhow the weights are used to steer samples and improve upon the (typical)O(1/?T ) convergence rate of random samplers.Herding is a completely deterministic process that generates pseudo sam-ples, x ? X ? Rn, in a greedy fashion with the objective of matchingmoments ? of the target distribution. Herding achieves this objective bygenerating a weakly chaotic sequence of pseudo samples according to thefollowing recursion.x(t) = arg maxx?X?w(t?1),?(x)?w(t) = w(t?1) + ?? ?(x(t)), (2.1)where x(t) is the sample produced at time step t, w(t) is the weight vector ofherding parameters at time step t, ? : X ? H is a feature vector (statistic)from X to a Hilbert space H with inner product ??, ??, and ? ? H is themean moment of the target distribution (expected value of ? over the data).Importantly, this herding procedure will only be consistent if the moments? characterize the target distribution. If we choose normalized features bymaking ??(x)? constant for all x, then the update to generate samples x(t)for t = 1, 2, . . . , T in Equation 2.1 is equivalent to minimizing the objectiveJ(x1, . . . ,xT ) =???????1TT?t=1?(x(t))?????2, (2.2)where T may have no prior known value and ? ? ? =???, ?? is the naturallydefined norm based upon the inner product of the space H [4, 7].5Chapter 2. HerdingHerding can be used to produce samples from normalized probabilitydistributions. For example, given a binary probabilistic model, we mightdenote ? to be the joint distribution over that model. That is, ?i ? [0, 1]and?ni=1 ?i = 1. In this case, our feature vector ?(x) is a one-to-onemapping from x ? X = {0, 1}n to a vector of length 2n of zeros where oneentry is set to 1. If n = 2, we might map x = (0, 0)T to ?(x) = (1, 0, 0, 0)T .Therefore, an average of pseudo samples will be an empirical estimate of ?:?? = T?1T?t=1?(x(t)) (2.3)In this case, one step of the herding algorithm involves finding the largestcomponent of the weight vector (i? = arg maxi?{1,2,...,n}w(t?1)i ), settingx(t) = i?, fixing the i?-entry of ?(x(t)) to one and all other entries to zero,and updating the weight vector: w(t) = w(t?1) + (? ? ?(x(t))). The out-put is a set of samples {x(1), . . . , x(T )} for which the empirical estimate ??converges on the target distribution ? as O(1/T ).The herding method, as described thus far, only applies to normalizeddistributions or to problems where the objective is not to guarantee that thesamples come from the right target, but to ensure that some moments arematched. An interpretation of herding in terms of Bayesian quadrature hasbeen put forward recently by [15].6Chapter 3Herded GibbsHerded Gibbs sampling is at the forefront of deterministic sampling by mak-ing it more generally applicable. In particular, herded Gibbs has made itpossible to use deterministic tools for sampling from unnormalized multi-variate probability distributions. Moreover, it takes steps towards providingfast (theoretical and empirical) rates of convergence in this domain.Herded Gibbs achieves this deterministic simulation with fast rates ofconvergence by capitalizing on a recent idea for deterministic simulationknown as herding, and adapting herding to conditional sampling. Further-more, it is the Gibbs sampler that provides the structure by which to modifyherding to perform conditional sampling. Herded Gibbs sampling transformsthe deterministic global optimization of herding to a task of iterative localoptimization and in doing so retains good convergence rates while general-izing the herding technique.Definition 1. For a graph of discrete nodes G = (V,E), where the set ofnodes are the random variables V = {Xi}Ni=1, Xi ? X , let pi denote thetarget distribution defined on G.Gibbs sampling is one of the most popular methods to draw samplesfrom pi. Gibbs alternates (either systematically or randomly) the samplingof each variable Xi given XN (i) = xN (i), where i is the index of the node,and N (i) denotes the neighbors of node i. That is, Gibbs generates eachsample from its full-conditional distribution p(Xi|xN (i)).Herded Gibbs replaces the sampling from full-conditionals with herdingat the level of the full-conditionals. That is, it alternates a process of match-ing the full-conditional distributions p(Xi = xi|XN (i)). To do this, herdedGibbs defines a set of auxiliary weights {wi,xN (i)} for any value of Xi = xiand XN (i) = xN (i). For ease of presentation, we assume the domain of Xiis binary, X = {0, 1}, and we use one weight for every i and assignmentto the neighbors xN (i). Herded Gibbs can be trivially generalized to themultivariate setting by employing weight vectors in R|X | instead of scalars.If the binary variable Xi has four binary neighbors XN (i), we must main-tain 24 = 16 weight vectors. Only the weight vector corresponding to the73.1. ComplexityAlgorithm 3.1 Herded Gibbs Sampling.Input: T .Step 1: Set t = 0. Initialize X(0) in the support of pi and w(0)i,xN (i) in(pi(Xi = 1|xN (i))? 1, pi(Xi = 1|xN (i))).for t = 1? T doStep 2: Pick a node i according to some policy. Denote w = w(t?1)i,x(t?1)N (i).Step 3: If w > 0, set x(t)i = 1, otherwise set x(t)i = 0.Step 4: Update weight w(t)i,x(t)N (i)= w(t?1)i,x(t?1)N (i)+ pi(Xi = 1|x(t?1)N (i) )? x(t)iStep 5: Keep the values of all the other nodes x(t)j = x(t?1)j ,?j 6= i andall the other weights w(t)j,xN (j) = w(t?1)j,xN (j), ?j 6= i or xN (j) 6= x(t?1)N (i) .end forOutput: x(1), . . . ,x(T )current instantiation of the neighbors is updated, as illustrated in Algo-rithm 3.1. The memory complexity of herded Gibbs is exponential in themaximum node degree. Note the algorithm is a deterministic Markov pro-cess with state (X,W).The initialization in step 1 guarantees that X(t) always remains in thesupport of pi and helps guarantee the convergence of herded Gibbs. For adeterministic scan policy in step 2, we take the value of variables x(tN), t ? Nas a sample sequence. Throughout the paper all experiments employ a fixedvariable traversal for sample generation. We call one such traversal of thevariables a sweep.3.1 Complexity3.1.1 Computational ComplexityAs herded Gibbs sampling is a deterministic algorithm, there is no station-ary probability distribution of states. Instead, the convergence of herdedGibbs is evaluated by examining the average of the sample states over timeand hypothesizing that it converges to the joint distribution, i.e. our tar-get distribution pi. To make the treatment precise, we need the followingdefinition.Definition 2. For a graph of discrete nodes G = (V,E), where the set ofnodes V = {Xi}Ni=1, Xi ? X , P(?)T is the empirical estimate of the joint83.1. Complexitydistribution obtained by averaging over T samples acquired from G. P (?)T isderived from T samples, collected at the end of every sweep over N variables,starting from the ? th sweep:P (?)T (X = x) =1T?+T?1?k=?I(X(kN) = x) (3.1)The convergence rate of herded Gibbs is described by the behaviour ofthe limiting average sample distribution over time. Specifically, the conver-gence of herded Gibbs is shown by the following:limT??P (?)T (x) = pi(x),?? ? 0 (3.2)The rate of convergence of herded Gibbs is governed by the rate of con-vergence of Equation 3.2. Yutian Chen et al. [8, 18] proved convergence ofthe limiting average sample distribution to the target distribution pi. Forcompleteness we include these results in the following theorems.In an empty graph, all the variables are independent of each other andherded Gibbs reduces to running N one-variable chains in parallel. Exam-ples of failing convergence in the presence of rational ratios between the piiswere observed in [4]. The paper also mentioned the need for further theo-retical research on this matter. Theorem 1 provides formal conditions forconvergence in the restricted domain of empty graphs.Theorem 1. For an empty graph, when herded Gibbs has a fixed scanningorder, and {1, pi1, . . . , piN} are rationally independent,2 the empirical distri-bution P (?)T converges to the target distribution pi as T ?? for any ? ? 0.For fully-connected (complete) graphs, convergence is guaranteed evenwith rational ratios. In fact, herded Gibbs converges to the target jointdistribution at a rate of O(1/T ) with a O(log(T )) burn-in period. Thisstatement is formalized in Theorem 2.Theorem 2. For a fully-connected graph, when herded Gibbs has a fixedscanning order and a Dobrushin coefficient 3 of the corresponding Gibbs2A set of n real numbers, x1, x2, . . . , xn, is said to be rationally independent if for anyset of rational numbers, a1, a2, . . . , an, we have?ni=1 aixi = 0? ai = 0, ?1 ? i ? n.3Dobrushin coefficients are employed to characterize dynamical systems such as theGibbs sampler. The Dobrushin coefficient is defined as ?(x) = supx?X ,x 6=0||Tx||||x|| , where Tis the transition matrix that characterizes the sampler and X is the domain.93.1. Complexitysampler ? < 1, there exist constants l > 0, and B > 0 such thatdv(P(?)T ? pi) ??T, ?T ? T ?, ? > ??(T ) (3.3)where ? = 2N(1+?)l(1??) , T? = 2Bl , ??(T ) = log 21+?((1??)lT4N), and dv(?pi) :=12 ||?pi||1.However, for a generic graph there are no mathematical guarantees onthe convergence rate of herded Gibbs. In fact, one can easily constructsynthetic examples for which herded Gibbs does not seem to converge tothe true marginals and joint distribution. For the examples covered bythese theorems and for examples with real data, herded Gibbs demonstratesgood behaviour. The exact conditions under which herded Gibbs convergesfor sparsely connected graphs are still unknown.3.1.2 Spatial ComplexityHerded Gibbs achieves the improved convergence rate of O(1/T ) by tradingcomputational complexity for spatial complexity. The typical MCMC con-vergence rate of O(1/?T ) is replaced by an exponential storage cost withherded Gibbs. In a sense, herded Gibbs discretizes over the domain of nodestates.Definition 3. For a binary probabilistic graphical model of N nodes and anO(d) maximum degree of connectivity, herded Gibbs has a spatial complexityofO(2dN)(3.4)Due to the herded Gibbs? exponential spatial complexity (in the con-nectivity of the graphical model), this technique is only practical for prob-lem instances employing small or sparsely-connected probabilistic graphicalmodels.10Chapter 4Discretized Herded GibbsDHG is a novel sampling procedure that was inspired by herded Gibbs andwas designed to reduce herded Gibbs? spatial complexity while attemptingto preserve its convergence rate. DHG achieves a reduction in spatial com-plexity by approximating the herding procedure of herded Gibbs (Steps 3-4of Algorithm 3.1). This reduced spatial complexity is a result of parametersharing. Though DHG is scalable to large highly-connected probabilisticgraphical models, the novel technique only demonstrates fast convergenceinitially and will typically reach a limiting accuracy.DHG requires linear storage in the maximum degree of a graphical model,as compared to the exponential storage of herded Gibbs (as stated in Defi-nition 3). Such a reduction in the storage cost is obtained by grouping andsharing herding parameters belonging to each node. Herded Gibbs has expo-nential spatial complexity because a node retains O(1) parameters for eachof the possible configurations of its neighbours. In contrast, DHG retains afixed number of parameters for each node.DHG achieves linear storage requirements by defining the set of auxil-iary weights {wi,bin} for any value of Xi = xi and for each bin = {1...B}(instead of for each configuration of XNi , as is done with herded Gibbs).Each configuration of a node?s neighbours xNi is then assiciated with oneparameter according to the conditional probability of X given xNi . Thus,configurations with similar conditional probabilities share a herding weight.Figure 4.1 illustrates an instance of DHG where two conditional states ofa node, xN 1i and xN 2i , share the same herding parameter. The parameterallocation of DHG is analogous to discretization along the probability axis(range) instead of over the node states (domain).As with herded Gibbs, only the weight vector corresponding to the cur-rent instantiation of the neighbours is updated, Algorithm 4.1 illustratesthe sampling procedure of DHG. Note that the algorithm is a deterministicMarkov process with state (X,W).Definition 4. For a binary probabilistic graphical model of discrete nodesG = (V,E), where the set of nodes V = {Xi}Ni=1, Xi ? X , and a discretiza-tion of the probability axis into B intervals, DHG has a spatial complexity11Chapter 4. Discretized Herded GibbsofO (BN) (4.1)In this way, DHG was designed to be better suited for graphs of arbitrar-ily large degree. However, the addition of a meta parameter, B, controllingthe spatial complexity and expressiveness of DHG convolutes deploymentof the algorithm. Furthermore, the convergence behaviour of this noveltechnique in empirical trials suggests a terminal accuracy proportionallygoverned by the number of bins employed B.As with herded Gibbs, DHG performs a systematic scan of the variablesand samples each in turn; Algorithm 4.1 describes sample generation as afixed traversal of the variables. For DHG the conditional sampling of onevariable, given the state of its neighbours, differs from herded Gibbs by theweights employed in herding.Algorithm 4.1 describes the conditional sampling process for DHG. Anew sample for the current node, Xi, is drawn given some configurationof the node?s neighbours, xN(i), and the current herding weights for thatnode, w(t?1)i . In particular, it is the fixed configuration of xN(i) thatdetermines which of the node?s herding parameters will be used to gen-erate its next value. In herded Gibbs, the boolean vector described byw1w3w2w4Conditional (Sigmoid) Activation FunctionP (Xi|xN 2i )P (Xi|xN 1i )Figure 4.1: An illustration of shared herding weights for variable Xi whenDHG employs four herding parameters and the probability distribution pi =P (Xi|xNi) is defined by a sigmoid. (Note, pi can be defined by any functionand, for that reason, is abstracted in Algorithm 4.1.) Since NeighbouringState 1(xN 1i)and Neighbouring State 2(xN 2i)create similar conditionalprobabilities for Xi = 1, they share the herding parameter w4.12Chapter 4. Discretized Herded GibbsAlgorithm 4.1 Discretized Herded Gibbs Sampling.Input: T and B (the number of herding parameters per node).Step 1: Set t = 0. Initialize X(0) and wi,bin,?i ? {1...N} and ?bin ? {1...B}.for t = 1? T doStep 2: Pick a node i according to some policy.Step 3: Compute pxi = pi(Xi = 1|x(t?1)N (i) ).Step 4: Ascertain the appropriate herding weight by computing:binpxi = min {bpxi ?Bc, B ? 1}. Denote w = wbinpxi .Step 5: If w > 0, set x(t)i = 1, otherwise set x(t)i = 0.Step 6: Update weight w(t)binpxi= w(t?1)binpxi+ pi(Xi = 1|x(t?1)N (i) )? x(t)iStep 7: Keep the values of all the other nodes x(t)j = x(t?1)j ,?j 6= i and all theother weights w(t)bin = w(t?1)bin ,?bin 6= binpxi .end forOutput: x(1), . . . ,x(T )xN(i) uniquely determines which of the 2|Ni| herding parameters to em-ploy. Conversely, in DHG, the conditional probability distribution for Xi,? = P (Xi = 1|xN(i), ?), identifies a point in the interval [0, 1], and one ofthe B intervals in which that point lies. For the herding parameter singledout, w, the DHG algorithm greedily chooses a new value for Xi so as tominimize the discrepancy between sample averages attributed to w, and theconditional distribution, ?.The approximation scheme of DHG achieves a reduction in spatial com-plexity. In particular, the cost of employing DHG is linear in the connectiv-ity of the probabilistic graphical model and controlled by a meta-parameter.The computational effects of this algorithm?s approximation scheme are eval-uated empirically in Section 5.3. The experimental results suggest that DHGis only capable of fast convergence initially and will eventually plateau inperformance, providing no further improvements to an estimate of the tar-get.13Chapter 5ExperimentsIn this chapter, we provide an empirical evaluation of the conditional-herdingdeterministic sampling algorithms. We choose to perform our evaluationsover two varieties of probabilistic graphical models: MRF and CRF. Thoughthere exist partial theoretical convergence guarantees for herded Gibbs, weimprove upon the understanding of this algorithm by demonstrating its ef-fectiveness empirically. We begin with two experiments that are consistentwith the theoretical results of herded Gibbs. These experiments employfully-connected probabilistic graphical models. We then extend our experi-mental evaluation beyond the theory and explore the performance of herdedGibbs on partially-connected probabilistic graphical models. In contrast,our experimental evaluation of DHG is designed to flesh out an understand-ing of this novel technique. Since DHG is an approximation scheme withno theoretical guarantees behind it, empirical evaluation is all the more im-portant. Using multiple randomly-generated probabilistic graphical models,we explore the convergence of DHG. One of DHG?s shortcomings is thatafter an extended period of time, it no longer generates samples of valueand no further improvements are made to the estimate of the target. Thisis analogous to a plateau in the convergence rate of DHG after sufficientlymany iterations.5.1 Markov Random FieldThroughout Chapter 5 herded Gibbs and DHG are evaluated over two prob-abilistic graphical models: CRFs and MRFs. We employ CRFs and MRFsin slightly different capacities. The CRF is employed to highlight one real-world application of herded Gibbs on the Named Entity Recognition task.The MRF provides a generic framework that is easily manipulated to stressthe performance of both sampling techniques, and therefore it is a mainstayof our experiments. In this section, we introduce the MRF model in de-tail and describe its characteristics. In particular, we limit ourselves to thebinary MRFs employed in our experiments.A binary MRF is an undirected probabilistic graphical model defined145.1. Markov Random Fieldover variables {Xi}i?{1...N}, where Xi ? {?1,+1}, that have the MarkovProperty4. Dependencies and interactions between these variables are de-scribed by parameters ?i,j that establish the coupling strength between nodesi and j. ?i,j > 0 describes neighbouring nodes, Xi and Xj , that are likely tobe in the same state, ?i,j < 0 describes neighbouring nodes that are likely tobe in different states and ?i,j = 0 if nodes Xi and Xj are not neighbours inthe MRF. (Note that the matrix ? is symmetric.) The parameters ?i,j allowfor an analytical description of the energy of the MRF; one such expressionof the energy of an MRF can be summarized by Equation 5.1.E(x, ?) = ??i,j?i,jxi,xj (5.1)Analogously, an MRF can be summarized by defining pair-wise cliquepotentials as follows:?ij(xi, xj) =(e?ij e??ije??ij e?ij)(5.2)and the joint distribution over {Xi}i?{1...N} can be expressed as as either oftwo possible representations:p(x|?) =1Z(?)?i?j?ij(xi, xj) =1Z(?)exp??12?i?j?ijxixj?? , (5.3)Note that the probability of a joint state as described by Equation 5.3 isinversely proportional to the energy of that same state, described by Equa-tion 5.1. These representations of the joint distribution allow us to easilyderive expressions for the conditionals that can be computed analytically.An analytical expression for the conditional is essential for sampling withherded Gibbs and DHG because both of these techniques rely on matchingthe conditionals (either exactly or approximately).4The Markov Property enforces memoryless behaviour in a process.155.2. Herded Gibbs5.2 Herded GibbsIn this section, we evaluate the herded Gibbs sampling algorithm over avariety of illustrative sample problems ranging from the contrived to thereal-world. We begin our experimental analysis with an empirical evalua-tion of herded Gibbs on a small problem, contrived to stress the performanceof the technique, and continue along this grain with large randomly gener-ated problems. Though herded Gibbs demonstrates excellent performancein these artificial scenarios, we are truly interested in the performance ofthe technique for real-world tasks. As such, herded Gibbs is empiricallyevaluated for two real-world problems employing sparsely connected proba-bilistic graphical models: Named Entity Recognition and Image Denoising.Significantly, these problems are not covered by the theoretical convergenceguarantees of herded Gibbs, yet the technique demonstrates good perfor-mance. Overall, herded Gibbs performs well on fully-connected graphicalmodels and partially-connected real-world models.5.2.1 Simple Complete GraphWe begin with an illustration of herded Gibbs on a simple complete graphand demonstrate this technique substantially outperforming the Gibbs sam-pler. In particular, we consider a fully-connected MRF model of two vari-ables, X1 and X2, as shown in Figure 5.1 with a joint distribution overthese variables as shown in Table 5.1. The joint distribution of this modelis designed so as to make sampling from the model very difficult. As seenin Table 5.1, the joint distribution is parametrized by , and as decreases,transitioning between states of high probability becomes progressively moredifficult.X1 X2Figure 5.1:Two-variable model.X1 = 0 X1 = 1 P(X2)X2 = 0 1/4? 1/4X2 = 1 3/4? 3/4P(X1) 1/4 3/4 1Table 5.1: Joint distribution of the two-variable model.Figure 5.2 shows the marginal distribution P (X1 = 1) approximated byboth Gibbs and herded Gibbs for different . As decreases, both approachesrequire more iterations to converge, but herded Gibbs clearly outperformsGibbs. Furthermore, Figure 5.2b also highlights that Herding does indeedexhibit a linear convergence rate consistent with the theoretical guaranteesof Section 3.1.1.165.2. Herded Gibbs(a) Approximate marginals obtained via Gibbs (blue) and herded Gibbs (red).(b) Log-log plot of marginal approximation errors obtained via Gibbs (blue) andherded Gibbs (red).Figure 5.2: (a) Approximating a marginal distribution with Gibbs (blue)and herded Gibbs (red) for an MRF of two variables, constructed so as tomake the move from state (0, 0) to (1, 1) progressively more difficult as decreases. The four columns, from left to right, are for = 0.1, = 0.01, = 0.001 and = 0.0001. Table 5.1 provides the joint distribution for thesevariables. The error bars for Gibbs correspond to one standard deviation.Row (b) illustrates that the empirical convergence rate of herded Gibbsmatches the expected theoretical rate. In the plot of row (b), the upper-bound in the error of herded Gibbs was used to remove the oscillations soas to illustrate the behaviour of the algorithm more clearly.5.2.2 Scalability on Fully Connected MRFsThough it is not computationally practical to scale herded Gibbs to largefully connected probabilistic graphical models, we demonstrate that herdedGibbs retains good convergence for these models. We include one suchrepresentative example demonstrating the continued (excellent) convergenceof herded Gibbs sampling on an MRF of 10 variables. Figure 5.3 contraststhe convergence of herded Gibbs (having a convergence rate of O(1/T )) tothe Gibbs sampler on a large, 10-node fully-connected MRF. Herded Gibbsretains excellent convergence behaviour for this problem and it is importantto note that the convergence behaviour of herded Gibbs shown here is typical175.2. Herded Gibbs0 5000 10000 15000 20000Iterations0.000.010.020.030.040.05ErrorGibbsHerded GibbsFigure 5.3: Approximation errors for one-variable marginals on a fully-connected graph of ten variables. The graph was randomly generated withcoupling parameters ? N (0, 1). Results are averaged across 10 samplingtrials with random starts. The error bars correspond to one standard devi-ation. (Note that these results are typical for MRFs for which convergencewas observed.)of the algorithm on large models for which convergence is attainable.In poorly conditioned problems, where convergence is difficult to achievewith any MCMC algorithm, herded Gibbs also struggles to converge. Insuch situations, Gibbs sampling demonstrated similar convergence issues.5.2.3 MRF for Image DenoisingNext, we consider the standard setting of a grid-lattice MRF for imagedenoising. Let us assume that we have a binary image corrupted by noise,and that we want to infer the original clean image. LetXi ? {?1,+1} denotethe unknown true value of pixel i, and yi the observed, noise-corrupted valueof this pixel. We take advantage of the fact that neighboring pixels are likelyto have the same label by defining an MRF with an Ising prior. Thus, wespecify a rectangular 2D lattice MRF, depicted in Figure 5.4.The known parameters ?ij establish the coupling strength between nodesi and j and for the image denoising task. ?ij = 1 if node i is connected tonode j and ?ij = 0 otherwise. Note that the matrix ? is symmetric. Whenthe ?ij > 0, then the neighboring pixels i and j are likely to be in the same185.2. Herded Gibbsxiyiyjxj?xi,yiFigure 5.4: Rectangular 2D lattice MRF model. Variables yi and yj denotepixels of the given noisy image and variables xi and xj denote pixels of theclean, uncorrupted image.Figure 5.5: Original image (left) and its corrupted version (right), with noiseparameter ? = 4.state. Initializing the potentials ?ij as described encourages label smoothnesstypically consistent with natural images.For this image denoising example, the MRF model combines the Isingprior with a likelihood model for a joint representation as follows:p(x,y) = p(x)p(y|x) =??1Z?i?j?ij(xi, xj)?? .[?ip(yi|xi)](5.4)Recall that the potentials ?ij are an alternative representation of the cou-pling strength between nodes i and j, as defined in Equation 5.2, and en-courage label smoothness. The likelihood terms p(yi|xi) are conditionallyindependent (e.g. Gaussians with known variance ?2 and mean ? centered195.2. Herded Gibbsat each value of xi, denoted ?xi). In more precise terms,p(x,y|?,?, ?) =1Z(?,?, ?)exp??12?i?j?ijxixj ?12?2?i(yi ? ?xi)2?? .(5.5)In this exemplar image denoising experiment, noisy versions of the bi-nary image, seen in Figure 5.5 (left), were created through the addition ofGaussian noise, with varying ?. Figure 5.5 (right) shows a corrupted imagewith ? = 4. The L2 reconstruction errors as a function of the number ofiterations, for this example, are shown in Figure 5.6. The plot compares theherded Gibbs method against Gibbs and two versions of mean field with dif-ferent damping factors [21]. The results demonstrate that the herded Gibbstechnique is among the best methods for solving this task.A comparison for different values of ? is presented in Table 5.2. Asexpected, mean field does well in the low-noise scenario. However, meanfield requires the tuning of the damping factor and the wrong dampingfactor may result in very bad performance. Herded Gibbs demonstratesimpressive performance improvements over comparable techniques in thehigh-noise scenarios and consistently outperforms the Gibbs sampler in allscenarios.Figure 5.6: Reconstruction errors for the image denoising task. The resultsare averaged across 10 corrupted images with Gaussian noise N (0, 16). Theerror bars correspond to one standard deviation. Mean field requires thespecification of the damping factor D.205.2. Herded GibbsTable 5.2: Errors of image denoising example after 30 iterations (all mea-surements have been scaled by ?10?3). We use an Ising prior with Jij = 1and four Gaussian noise models with different ??s. For each ?, we gener-ated 10 corrupted images by adding Gaussian noise. The final results shownhere are averages and standard deviations (in parentheses) across the 10corrupted images. D denotes the damping factor in mean field.PPPPPPPMethod?2 4 6 8Herded Gibbs 21.58(0.26) 32.07(0.98) 47.52(1.64) 67.93(2.78)Gibbs 21.63(0.28) 37.20(1.23) 63.78(2.41) 90.27(3.48)Mean field (D=0.5) 15.52(0.30) 41.76(0.71) 76.24(1.65) 104.08(1.93)Mean field (D=1) 17.67(0.40) 32.04(0.76) 51.19(1.44) 74.74(2.21)5.2.4 CRF for Named Entity RecognitionOther?PersonZedOthersOther.Other?PersonZedOthersOtherdeadOther.OtherdeadOtherbabyFigure 5.7: Typical skip-chain CRF model for named entity recognition.NER involves the identification of entities, such as people and locations,within a text sample. A conditional random field (CRF) for NER models therelationship between entity labels and sentences with a conditional probabil-ity distribution: P (Y |X, ?), where X is a sentence, Y is a labeling, and ? isa vector of coupling parameters. The parameters, ?, are feature weights andmodel relationships between variables Yi and Xj or Yi and Yj . A chain CRFonly employs relationships between adjacent variables, whereas a skip-chainCRF can employ relationships between variables where subscripts i and jdiffer dramatically. Skip-chain CRFs are important in language tasks, suchas NER and semantic role labeling, because they allow us to model longdependencies in a stream of words, see Figure 5.7.Once the parameters have been learned, the CRF can be used for infer-ence; a labeling for some sentence X is found by maximizing the above prob-215.2. Herded Gibbsability. Inference for CRF models in the NER domain is typically carriedout with the Viterbi algorithm. However, if we want to accommodate long-term dependencies, thus resulting in the so called skip-chain CRFs, Viterbibecomes prohibitively expensive. To surmount this problem, the Stanfordnamed entity recognizer [16] makes use of annealed Gibbs sampling 5.To demonstrate herded Gibbs on a practical application of great interestin text mining, we modify the standard inference procedure of the Stanfordnamed entity recognizer by replacing the annealed Gibbs sampler with theherded Gibbs sampler. The herded Gibbs sampler is not annealed. To findthe maximum a posteriori sequence Y , we simply choose the sample withhighest joint discrete probability. In order to be able to compare againstViterbi, we have purposely chosen to use single-chain CRFs. We remind thereader, however, that the herded Gibbs algorithm could be used in caseswhere Viterbi inference is not possible.?Pumpkin? (Tim Roth) and ?Honey Bunn? (Amanda Plummer) are havingbreakfast in a diner. They decide to rob it after realizing they could makemoney off the customers as well as the business, as they did during theirprevious heist. Moments after they initiate the hold-up, the scene breaksoff and the title credits roll. As Jules Winnfield (Samuel L. Jackson) drives,Vincent Vega (John Travolta) talks about his experiences in Europe, fromwhere he has just returned: the hash bars in Amsterdam, the FrenchMcDonald?s and its ?Royale with Cheese?.Figure 5.8: Results for the application of the NER CRF to a randomWikipedia sample [1]. Entities are automatically classified as Person, Lo-cation and Organization.We used the pre-trained 3-class CRF model in the Stanford NER pack-age [16]. This model is a linear chain CRF with pre-defined features andpre-trained feature weights, ?. For the test set, we used the corpus for theNIST 1999 IE-ER Evaluation. Performance is measured in per-entity F1(F1 = 2 ?precision?recallprecision+recall). For all the methods, except Viterbi, we show F1scores after 100, 400 and 800 iterations in Table 5.3. For Gibbs, the re-sults shown are the averages and standard deviations over 5 random runs.We used a linear annealing schedule for Gibbs. As the results illustrate,herded Gibbs attains the same accuracy as Viterbi and it is faster than5Annealing improves the slow random walk that can initially impede the progress ofthe Gibbs sampler.225.3. DHG Convergenceannealed Gibbs. Unlike Viterbi, herded Gibbs can be easily applied to skip-chain CRFs. After only 400 iterations (90.5 seconds), herded Gibbs alreadyachieves an F1 score of 84.75, while Gibbs, even after 800 iterations (115.9seconds) only achieves an F1 score of 84.61. The experiment thus clearlydemonstrates that(i) herded Gibbs does no worse than the optimal solution, Viterbi, and(ii) herded Gibbs yields more accurate results for the same amount ofcomputation.Figure 5.8 provides a representative NER example of the performance ofGibbs, herded Gibbs and Viterbi (all methods produced the same annotationfor this short example).Table 5.3: Gibbs, herded Gibbs and Viterbi for the NER task. The averagecomputational time each approach took to do inference for the entire testset is listed (in square brackets). After only 400 iterations (90.48 seconds),herded Gibbs already achieves an F1 score of 84.75, while Gibbs, even after800 iterations (115.92 seconds) only achieves an F1 score of 84.61. For thesame computation, herded Gibbs is more accurate than Gibbs.Iterations100 400 800MethodsAnnealed Gibbs 84.36(0.16) [55.73s] 84.51(0.10) [83.49s] 84.61(0.05) [115.92s]Herded Gibbs 84.70 [59.08s] 84.75 [90.48s] 84.81 [132.00s]Viterbi 84.81[46.74s]5.3 DHG ConvergenceIn this section, we will empirically evaluate the performance of the DHGsampler. We compare the convergence behaviour of DHG to that of theGibbs sampler on multiple randomly generated probabilistic graphical mod-els.DHG was designed for large scale, highly-connected models. As such,some of the first experimental results we obtained evaluated the techniquein these areas. Figure 5.9 depicts the performance of DHG on a randomly235.3. DHG Convergencegenerated fully-connected graphical model with parameters drawn from theStandard Normal. DHG is capable of providing deceptively good perfor-mance initially, but in the long run can only approximate the target up toa finite accuracy and the Gibbs sampler will inevitably surpass the perfor-mance of DHG as seen in Figure 5.9. On a difficult problem, where Gibbsconverges slowly to the target, DHG will provide significantly superior ini-tial convergence, while on an easier problem, where Gibbs converges muchfaster to the target, DHG will (sooner) provide notably poor performancein comparison to Gibbs. In both instances, DHG converges to the target ina similar way and it is the random sampler whose performance drasticallydiffers.0 50000 100000 150000 200000Iterations0.0000.0010.0020.0030.0040.0050.0060.0070.008ErrorGibbsDiscretized Herded Gibbs 5 binsDiscretized Herded Gibbs 15 binsDiscretized Herded Gibbs 20 binsFigure 5.9: Approximation errors for one-variable marginals on a fully-connected graph of ten variables. The graph was randomly generated withcoupling parameters ? N (0, 1). Results are averaged across 10 samplingtrials with random starts. The error bars correspond to one standard devi-ation.In order to better understand the inner workings of DHG we exploredsmaller MRFs and multiple variations of DHG. In particular, Figure 5.10demonstrates the performance of two variants of DHG on a fully-connectedMRF of three variables. For any graph of three variables, herded Gibbs245.3. DHG Convergence0 50000 100000 150000 200000Iterations0.00000.00010.00020.00030.00040.0005ErrorGibbsDiscretized Herded Gibbs 5 binsDiscretized Herded Gibbs 15 binsFigure 5.10: Approximation errors for two-variable marginals on a fully-connected graph of three variables. The graph was randomly generatedwith coupling parameters ? N (0, 1). Results are averaged across 10 sam-pling trials with random starts. The error bars correspond to one standarddeviation.would require the storage of four herding weights per variable (for a totalof 12 herding parameters). In contrast, DHG allows the designer to choosehow many herding weights should be maintained per variable. We refer tothe number of weights per variable as the number of bins employed by DHG.We evaluate DHG for two choices of discretization: the first employs 5 pa-rameters per variable (for a total of 15 herding parameters), and the secondemploys 15 parameters per variable (for a total of 45 herding parameters).Though in both instances of DHG there is a surplus of parameters, mostparameters are never employed in the sampling procedure. In particular, Ta-ble 5.5 is a detailed depiction of which of the 15 parameters are employed byDHG (with 5 bins) and how those parameters are shared between differentconditioning states.255.3. DHG ConvergenceTable 5.4: Weight sharing for DHG employing 5 bins. For a binary graphof three variables, each node has 22 = 4 conditioning states. This tabledepicts how those states are allocated to herding parameters for a randomlygenerated three-variable MRF. Note, this is the same graph analyzed byFigure 5.10 and Table 5.5. The red entry of the table is that which isaltered in the 15-bin version. Weights with no conditioning states assignedto them are never used.XXXXXXXXXXVariableWeightwi,1 wi,2 wi,3 wi,4 wi,5X1 3 0 0 0 1X2 3 0 0 0 1X3 1 1 0 0 2We obtain further insight into the effects of parameter sharing when wecompare the parameter sharing of the 5 bin version (as shown in Table 5.4)to the 15 bin version of DHG (as shown in Table 5.5). The two instancesemploy nearly identical distributions of the herding parameters, differing inonly one shared weight assignment. Herding parameter w3,5 represents twoconditional states of X3 in the 5 bin case, whereas in the 15 bin case thoseconditional states of X3 are represented by herding parameters w3,14 andw3,15 (the relevant entries highlighted in red).Table 5.5: Weights sharing for DHG employing 15 bins on an MRF of threevariables. Note, this example employs the same graph for which performanceis depicted in Figure 5.10 and for which weight sharing (with 5 weights) isdepicted in Table 5.4. The red entries of the table are those that differ fromthe 5-bin case. Weights with no conditioning states assigned to them arenever used.XXXXXXXXXXVariableWeightwi,1 wi,2 ... wi,5 wi,6 wi,7 ... wi,13 wi,14 wi,15X1 3 0 ... 0 0 0 ... 0 0 1X2 3 0 ... 0 0 0 ... 0 0 1X3 1 0 ... 0 1 0 ... 0 1 1265.3. DHG ConvergenceThis small change in parameter sharing has a drastic effect on the per-formance of DHG. Figure 5.10 depicts an early plateau in the performanceof DHG with 5 bins, whereas DHG with 15 bins appears to achieve goodconvergence behaviour. Importantly, the Gibbs sampler is able to surpassthe instance of DHG employing 5 bins mid-way through the simulation andapproaches the performance of DHG employing 15 bins. This example il-lustrates the dramatic effects that parameter sharing can have on perfor-mance because of the minor change in the sharing of herding weights. Theconvergence issues of DHG can be better exemplified on large probabilis-tic graphical models. Returning to Figure 5.9 the lack of convergence ofDHG is all the more evident, even for versions employing a large numberof parameters. We hypothesize that Gibbs will eventually surpass any in-stantiation of DHG that employs a shared herding parameter to representmultiple different conditional probabilities.When a weight w is shared between two conditional states of a nodeXi, say xN 1i and xN 2i , where without loss of generality P (Xi = 1|xN 1i ) >P (Xi = 1|xN 2i ) those two states must compete for control of the sharedweight. Because the two conditioning states share a weight, their condi-tional probabilities must be ?close? to each other and the distance betweenthe conditionals has an upper bound of the bin size (as illustrated by Fig-ure 4.1). In other words, P (Xi = 1|xN 1i )? P (Xi = 1|xN 2i ) < 1/B where Bis the number of weights per variable. The DHG samples produced at firstare ?good? samples in that they chip away at the error by placing samplesat areas of high probability (that work well for both conditionals). There-fore, initially the shared weight can steer samples for both conditionals atthe same time reasonably well because the conditionals agree. Samples gen-erated later do not continue to improve the estimate of the target becausethe weight w is being torn between two competing objectives (attemptingto produce samples for the two conditionals).By an extension of this reasoning, the minimum or terminal error achievedby DHG is governed by the cumulative discrepancies between target con-ditionals that share weights. Furthermore, you will notice that the actualerror of the DHG estimate in Figure 5.9 and Figure 5.10 is quite small andless than 1/B which agrees with this theory.Though DHG eventually stops converging to the target, it is capable ofvery fast initial convergence rates. To this end we considered the possibilityof employing DHG as a mechanism to initialize the sampling process of theGibbs sampler. Our results indicated that DHG did not provide significantimprovements over the Gibbs sampler alone; in the long-run both the hybrid275.3. DHG Convergencealgorithm and the Gibbs sampler provided comparable performance.The confusion caused by DHG?s sharing of weights has been shown tohave a direct effect on the performance of this sampling technique. ThoughDHG is scalable to large highly-connected probabilistic graphical models,the novel technique only demonstrates fast convergence initially and willtypically reach a limiting accuracy. The limiting accuracy achieved by DHGis a direct result of parameter sharing. In effect, if DHG is run long enoughGibbs will surpass any version of DHG that employs shared weights.28Chapter 6Conclusion and DiscussionIn this thesis, we advance the domain of deterministic sampling throughthe evaluation of two deterministic sampling techniques: the recently in-troduced herded Gibbs sampler and the novel discretized herded Gibbssampler. Both of these techniques are completely deterministic variants ofthe infamous Gibbs sampler that trade randomness for spatial complexity.Herded Gibbs has strong theoretical convergence guarantees and demon-strates empirical performance that is consistent with those theoretical guar-antees. Furthermore, we empirically show that herded Gibbs generalizesto situations/scenarios not covered by its theoretical guarantees. On theother hand, DHG has better spatial complexity than herded Gibbs and iscapable of good convergence behaviour. However, DHG does not provideconsistently good convergence behaviour.In this thesis, we discussed recent theoretical developments surroundingherded Gibbs sampling and empirically demonstrated the good convergencebehaviour of this algorithm. As mentioned, the Herded Gibbs sampling al-gorithm is a deterministic variant of the popular Gibbs sampling algorithm.While Gibbs relies on drawing samples from the full-conditionals at ran-dom, herded Gibbs generates the samples by matching the full-conditionals.Importantly, the herded Gibbs algorithm remains very close to the Gibbsalgorithm and hence retains its simplicity of implementation.The synthetic, denoising and named entity recognition experiments pro-vided evidence that herded Gibbs outperforms Gibbs sampling. However, asdiscussed, herded Gibbs requires storage of the conditional distributions forall instantiations of the neighbors in the worst case. This storage require-ment indicates that it is more suitable for sparse probabilistic graphicalmodels, such as the CRFs used in information extraction. At the otherextreme, we highlight recent theoretical developments of deterministic sam-pling that show herded Gibbs converges with a rate of O(1/T ) for modelswith independent variables and fully-connected models. Thus, there is a gapbetween theory and practice that needs to be narrowed. We do not antic-ipate that this will be an easy task, but it is certainly a key direction forfuture work. Furthermore, the lack of convergence guarantees on sparsely29Chapter 6. Conclusion and Discussionconnected models raises a question of cause: can convergence be guaranteedon sparsely connected models that employ moment matching or are theseconvergence issues unique to herded Gibbs?DHG was developed to address the spatial constraints of herded Gibbs;it was an attempt to design an efficient herding algorithm for densely con-nected probabilistic graphical models. This novel algorithm reduced thespatial complexity of herded Gibbs by approximating its matching scheme(of the full-conditionals). DHG matched groupings over the full-conditionalsthat respect conditional probability values. In this thesis, we evaluated theconvergence of DHG on randomly generated fully-connected probabilisticgraphical models. We demonstrated with multiple examples the commonpitfalls of this novel technique as a complete sampling solution. ThoughDHG is capable of fast initial convergence rates, it can only approximatethe target up to a finite accuracy.Continuing along the path of DHG, the design of spatially and com-putationally efficient herding algorithms for densely connected probabilisticgraphical models remains an important area for future research. Such algo-rithms, in conjunction with Rao Blackwellization, will enable us to attackmany statistical inference tasks, including Bayesian variable selection andDirichlet processes.30Bibliography[1] Pulp fiction - wikipedia, the free encyclopedia @ONLINE, June 2012.[2] D. H. Ackley, G. Hinton, and T.. Sejnowski. A learning algorithm forBoltzmann machines. Cognitive Science, 9:147?169, 1985.[3] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An Introductionto MCMC for Machine Learning. Machine Learning, 50(1):5?43, 2003.[4] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence be-tween herding and conditional gradient algorithms. In InternationalConference on Machine Learning, 2012.[5] P. Carbonetto, J. Kisynski, N. de Freitas, and D. Poole. NonparametricBayesian logic. In Uncertainty in Artificial Intelligence, pages 85?93,2005.[6] S. Chen, J. Dick, and A. B. Owen. Consistency of Markov chain quasi-Monte Carlo on continuous state spaces. Annals of Statistics, 39(2):673?701, 2011.[7] Y. Chen, M. Welling, and A.J. Smola. Supersamples from kernel-herding. In Uncertainty in Artificial Intelligence, pages 109?116, 2010.[8] Yutian Chen. Herding: Driving deterministic dynamics to learn andsample probabilistic models. Ph.D. Thesis. University of California,Irvine, 2013.[9] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte CarloMethods in Practice. Statistics for Engineering and Information Science.Springer, 2001.[10] Jing Fang. Herded gibbs. M.Sc. Thesis. University of British Columbia,2012.[11] A. Gelfand, Y. Chen, L. van der Maaten, and M. Welling. On herdingand the perceptron cycling theorem. In Advances in Neural InformationProcessing Systems, pages 694?702, 2010.31Bibliography[12] N. D. Goodman, V. K. Mansinghka, D. M. Roy, K. Bonawitz, and J. B.Tenenbaum. Church: a language for generative models. Uncertainty inArtificial Intelligence, 2008.[13] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality ofdata with neural networks. Science, 313(5786):504?507, 2006.[14] Alexander E Holroyd and James Propp. Rotor walks and Markovchains. Algorithmic Probability and Combinatorics, 520:105?126, 2010.[15] F. Husza?r and D. Duvenaud. Optimally-weighted herding is Bayesianquadrature. Arxiv preprint arXiv:1204.1664, 2012.[16] T. Grenager J. R. Finkel and C. Manning. Incorporating non-local in-formation into information extraction systems by Gibbs sampling. InProceedings of the 43rd Annual Meeting on Association for Computa-tional Linguistics, ACL ?05, pages 363?370, Stroudsburg, PA, USA,2005. Association for Computational Linguistics.[17] J. S. Liu. Monte Carlo strategies in scientific computing. Springer,2001.[18] Nando de Freitas Mareija Eskelin Jing Fang Luke Bornn, Yutian Chenand Max Welling. Herded gibbs sampling. International Conference onLearning Representations (ICLR), To appear, 2013.[19] D. J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter. WinBUGS aBayesian modelling framework: Concepts, structure, and extensibility.Statistics and Computing, 10(4):325?337, 2000.[20] B. Milch and S. Russell. General-purpose MCMC inference over rela-tional structures. In Uncertainty in Artificial Intelligence, pages 349?358, 2006.[21] K. P. Murphy. Machine Learning: a Probabilistic Perspective. MITPress, 2012.[22] I. Murray and L. T. Elliott. Driving Markov chain Monte Carlo with adependent random stream. Technical Report arXiv:1204.3187, 2012.[23] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, andM. Welling. Fast collapsed Gibbs sampling for latent Dirichlet allo-cation. In ACM SIGKDD international conference on Knowledge dis-covery and data mining, pages 569?577, 2008.32Bibliography[24] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer,2nd edition, 2004.[25] M. Welling. Herding dynamic weights for partially observed randomfield models. In Uncertainty in Artificial Intelligence, pages 599?606,2009.[26] M. Welling. Herding dynamical weights to learn. In International Con-ference on Machine Learning, pages 1121?1128, 2009.33
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Herded Gibbs and discretized herded Gibbs sampling
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Herded Gibbs and discretized herded Gibbs sampling Eskelinen, Mareija 2013
pdf
Page Metadata
Item Metadata
Title | Herded Gibbs and discretized herded Gibbs sampling |
Creator |
Eskelinen, Mareija |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | Statistical inference is at the heart of the probabilistic programming approach to artificial intelligence; as far as statistical inference is concerned, the Gibbs sampler is one of the most widely applicable and popular random sampling techniques. On the flip side, deterministic sampling is still nascent and has yet to be widely accepted into the field of Bayesian inference. In this thesis, we advance the domain of deterministic sampling by discussing and evaluating two deterministic sampling techniques inspired by the Gibbs sampler. The first technique is the recently-introduced herded Gibbs sampler and the second is the novel discretized herded Gibbs sampler. Herded Gibbs sampling is an entirely deterministic algorithm with an impressive O(1/T) convergence rate for both models with independent variables and fully connected probabilistic graphical models. However, the convergence for herded Gibbs on sparsely connected probabilistic graphical models is still an open problem. Additionally, the computational complexity of herded Gibbs increases exponentially with the connectivity of the probabilistic graphical model, making it better suited for partially connected graphical models. We empirically demonstrate that herded Gibbs outperforms Gibbs on partially connected probabilistic graphical models for the tasks of image denoising with Markov Random Fields and named entity recognition with Conditional Random Fields. Furthermore, we develop discretized herded Gibbs as a variant of herded Gibbs that attempts to reduce the spatial complexity by approximating herded Gibbs. Though discretized herded Gibbs achieves linear storage requirements and is capable of fast initial convergence rates, it can only approximate the target up to a finite accuracy. This thesis highlights the power of one recent deterministic sampling algorithm and shows that deterministic sampling with good space and time complexity remains an elusive goal. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-08-27 |
Provider | Vancouver : University of British Columbia Library |
Rights | CC0 1.0 Universal |
DOI | 10.14288/1.0052182 |
URI | http://hdl.handle.net/2429/44896 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2013-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/publicdomain/zero/1.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2013_fall_eskelinen_mareija.pdf [ 657.43kB ]
- Metadata
- JSON: 24-1.0052182.json
- JSON-LD: 24-1.0052182-ld.json
- RDF/XML (Pretty): 24-1.0052182-rdf.xml
- RDF/JSON: 24-1.0052182-rdf.json
- Turtle: 24-1.0052182-turtle.txt
- N-Triples: 24-1.0052182-rdf-ntriples.txt
- Original Record: 24-1.0052182-source.json
- Full Text
- 24-1.0052182-fulltext.txt
- Citation
- 24-1.0052182.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052182/manifest