"Science, Faculty of"@en . "Computer Science, Department of"@en . "DSpace"@en . "UBCV"@en . "Fang, Jing"@en . "2012-09-06T17:29:50Z"@en . "2012"@en . "Master of Science - MSc"@en . "University of British Columbia"@en . "The Gibbs sampler is one of the most popular algorithms for inference in statistical models. In this thesis, we introduce a herding variant of this algorithm that is entirely deterministic. We demonstrate, with simple examples, that herded Gibbs exhibits better convergence behavior for approximating the marginal distributions than Gibbs sampling. In particular, image denoising exemplifies the effectiveness of herded Gibbs as an inference technique for Markov Random Fields (MRFs). Also, we adopt herded Gibbs as the inference engine for Conditional Random Fields (CRFs) in Named Entity Recognition (NER) and show that it is competitive with the state of the art. The conclusion is that herded Gibbs, for graphical models with nodes of low degree, is very close to Gibbs sampling in terms of the complexity of the code and computation, but that it converges much faster."@en . "https://circle.library.ubc.ca/rest/handle/2429/43188?expand=metadata"@en . "Herded Gibbs Sampling by Jing Fang B. Eng., Zhejiang University, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) September 2012 c\u00C2\u00A9 Jing Fang, 2012 Abstract The Gibbs sampler is one of the most popular algorithms for inference in statisti- cal models. In this thesis, we introduce a herding variant of this algorithm that is entirely deterministic. We demonstrate, with simple examples, that herded Gibbs exhibits better convergence behavior for approximating the marginal distributions than Gibbs sampling. In particular, image denoising exemplifies the effectiveness of herded Gibbs as an inference technique for Markov Random Fields (MRFs). Also, we adopt herded Gibbs as the inference engine for Conditional Random Fields (CRFs) in Named Entity Recognition (NER) and show that it is competi- tive with the state of the art. The conclusion is that herded Gibbs, for graphical models with nodes of low degree, is very close to Gibbs sampling in terms of the complexity of the code and computation, but that it converges much faster. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Herding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Herding and Frank-Wolfe Algorithms . . . . . . . . . . . . . . . 6 2.1.1 Frank-Wolfe Algorithms . . . . . . . . . . . . . . . . . . 6 2.1.2 Herding as a Frank-Wolfe Algorithm . . . . . . . . . . . 7 2.2 Herding and Maximum/Minimum Entropy . . . . . . . . . . . . . 7 3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Herded Gibbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 iii 4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.1 Simple Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.1.1 Two-Node Example . . . . . . . . . . . . . . . . . . . . 15 5.1.2 Four-Node Example . . . . . . . . . . . . . . . . . . . . 17 5.1.3 Three-Node Example . . . . . . . . . . . . . . . . . . . . 19 5.2 Markov Random Field (MRF) for Image Denoising . . . . . . . . 20 5.3 Conditional Random Field (CRF) for Named Entity Recognition . 26 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 iv List of Tables Table 5.1 Joint distribution of the two-variable model. . . . . . . . . . . 16 Table 5.2 Joint distribution of the three-variable model. The model is parametrized such that x1 \u00E2\u008A\u00A5 x2|x3 and p(x1|x3) = p(x2|x3). . . 19 Table 5.3 Errors of image denoising example after 30 iterations (all mea- surements have been scaled by \u00C3\u009710\u00E2\u0088\u00923). We use an Ising prior with Wi j = 1 and Gaussian noise models with different \u00CF\u0083 \u00E2\u0080\u0099s. For each \u00CF\u0083 , we generated 10 corrupted images by adding Gaussian noise. The final results shown here are averages and standard deviations (in the parentheses) across the 10 corrupted images. I: in-place updates; P: parallel updates; D: damping factor; R: random step size. . . . . . . . . . . . . . . . . . . . . . . . . 25 v Table 5.4 Comparisons of Gibbs, herded Gibbs, herded Gibbs with a ran- domized step size, and Viterbi for the NER task. We used the pre-trained 3-class CRF model in the Stanford NER package [7]. For the test set, we used the corpus for the NIST 1999 IE-ER Evaluation. Performance is measured in per-entity F1( F1 = 2 \u00C2\u00B7 precision\u00C2\u00B7recallprecision+recall ) . For all the methods except Viterbi, we show F1 scores after 100, 400 and 800 iterations. For Gibbs and herded Gibbs with a random step size, the results shown are the averages and standard deviations (in the parentheses) over 5 random runs. We used a linear annealing schedule for Gibbs and in-place updates for both versions of herded Gibbs. The average computational time each approach took to do inference for the entire test set is also listed (in the brackets). R: random step size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 vi List of Figures Figure 2.1 Two iterations corresponding to two versions of the conditional gradient algorithm. Left: \u00CF\u0081(t) = 1/(t+1); right: line search. . 6 Figure 5.1 Two-variable model. . . . . . . . . . . . . . . . . . . . . . . 16 Figure 5.2 Approximate marginals obtained via Gibbs and herded Gibbs for an MRF of two variables, constructed so as to make the move from state (0,0) to (1,1) progressively more difficult as \u00CE\u00B5 decreases. Table 5.1 provides the joint distribution for these variables. The error bars correspond to one standard deviation. The plots are best viewed in color. . . . . . . . . . . . . . . . 16 Figure 5.3 Four-variable model. . . . . . . . . . . . . . . . . . . . . . . 17 Figure 5.4 The error in approximating the marginals (top) and joint (bot- tom) for an MRF of four variables. When the ratios of condi- tional distributions are rational, the deterministic herded Gibbs fails to converge to the joint. This is however not a problem when the ratios are irrational. In both cases, herded Gibbs with a randomized step size converges well, while providing a very significant improvement over standard Gibbs sampling. . . . . 18 Figure 5.5 Three-variable model. . . . . . . . . . . . . . . . . . . . . . 19 Figure 5.6 The marginal of the first node of a three-node MRF and its esti- mates. Herded Gibbs with a random step size converges to the correct value at an impressive rate. The left plot shows 5,000 iterations while the right plots shows 50,000 iterations. . . . . 20 vii Figure 5.7 The MRF employed for the task of image denoising. The vari- ables xi,x j are hidden variables representing the true pixel val- ues, and the variables yi,y j are observed variables representing the noisy pixel values. . . . . . . . . . . . . . . . . . . . . . 21 Figure 5.8 Image denoising results using different methods to perform ap- proximate inference. We use an Ising prior with Wi j = 1 and a Gaussian noise model with \u00CF\u0083 = 4. We show the results after 1, 10 and 30 iterations across the image and also the posterior means computed by averaging over the last 10 iterations. We use in-place updates for each method, \u00CE\u00B5 \u00E2\u0088\u00BC 2 U[0,1/2] for herded Gibbs with a randomized step size, and a damping factor of 1 for mean field inference. . . . . . . . . . . . . . . . . . . . . 22 Figure 5.9 Reconstruction error for the image denoising task. The results are averaged across 10 corrupted images with Gaussian noise N (0,16). The error bars correspond to one standard devi- ation. Here we introduce a shared version of herded Gibbs, which stores conditionals that have the same potential values together. As shown, the shared version of herded Gibbs outper- forms all the other approaches. In general, the in-place update versions of each approach perform better than the correspond- ing parallel update versions. Herded Gibbs with a random- ized step size has almost the same performance with standard Herded Gibbs under the same setting. For better display, we do not show them in the plots. . . . . . . . . . . . . . . . . . 24 Figure 5.10 A representative application of NER to a random Wikipedia sample [1]. Entities are identified as follows: Person, Loca- tion, Organization. . . . . . . . . . . . . . . . . . . . . . . . 26 viii Glossary MCMC Markov chain Monte Carlo MRF Markov Random Field CRF Conditional Random Field NER Named Entity Recognition ix Acknowledgments I would like to express my utmost gratitude to many people, without whom this thesis would not have been possible. First and foremost, my supervisor Nando de Freitas for his invaluable guidance and support throughout my graduate study. He devoted numerous hours to my work and gave me the freedom to explore my own ideas. Dr. Luke Bornn and Dr. David Poole for their collaboration and valuable feedback on this work. Mareija Eskelinen for being a wonderful collaborator and for the time and efforts she devoted to proofread my thesis. Misha Denil, Masrour Zoghi, Ziyu Wang, Matt Hoffman, David Buchman, Shakir Mohamed, Xing Chen, Shafiq Joty, Kevin Swersky and other fellow UBC CS students for discussing ideas, answering questions, and sharing code with me. I highly value their openness and willingness to help. Finally, I would like to thank my parents for their tireless love and support. It means very much to me that they are proud to have me as their daughter. x Chapter 1 Introduction Computing the posterior distribution is the core component of Bayesian analysis. This computation involves the integration of high-dimensional functions, which can be computationally very difficult. Several approaches that avoid direct integra- tion have been proposed over the decades [6, 14]. Markov chain Monte Carlo (MCMC) methods are one of the most popular classes of such algorithms, which attempt to simulate direct draws from some com- plex distribution of interest [2, 15]. They generate samples based on constructing a Markov chain whose equilibrium distribution is exactly the target distribution. As the transition probabilities between sample values are only a function of the most recent sample values, it only needs the previous sample values in order to generate the next. The states of the chain after a large number of steps (the so-called burn-in period) can then be used as a sample of the target distribution. The quality of the sample improves as a function of the number of steps. One particular MCMC method, the Gibbs sampler [11], was realized to be widely applicable to a broad class of Bayesian problems in the early 1990s [10]. It was first introduced in the context of image processing as a special case of the Metropolis-Hastings algorithm, and was then extended to be a general framework for sampling from a large set of variables. Gibbs sampling is commonly used for statistical inference. It is applicable when the joint distribution is not known ex- plicitly or is difficult to sample from directly, but the conditional distribution of each variable is known and is easy to sample from. 1 Over the decades, numerous variations and extensions of the basic Gibbs sam- pler have arisen. In this thesis, we introduce a novel variant of the Gibbs sampler that is entirely deterministic - herded Gibbs. It builds upon a recently published deterministic herding technique and is shown empirically to be an effective and valuable algorithm for statistical inference. 1.1 Thesis Contributions In this thesis, we present herded Gibbs, a deterministic variant of the well-known and widely-used Gibbs sampler. We provide an empirical study of the proposed algorithm and show that it is a valuable algorithm that performs better than the standard Gibbs sampler in several different tasks of great interest, including image denoising and Named Entity Recognition. Our algorithm is a novel and key step in the design of herding methods for sampling from large discrete distributions. By an image denoising example, we show that a shared version of herded Gibbs requires less storage while performing even better than comparable inference tech- niques. This to some extent resolves the storage issue of herded Gibbs and makes it a potentially more useful algorithm that is capable of being applied to more general models. We also show by experiments that, for herded Gibbs to converge to both the marginals and joint, we require the ratios of the conditional distributions to be ir- rational. If this is satisfied, herded Gibbs should converge to the correct marginals and joint. Otherwise, it may fail to converge. Furthermore, by simply randomizing the step size of herded Gibbs, we will be able to deliver the correct marginal and joint distributions, while retaining improved rates of convergence over the tradi- tional Gibbs sampling. This thesis presents a very effective, useful and straightforward algorithm as well as important methodological and empirical contributions to research on herd- ing. We believe this contribution will be of a wide interest to researchers in the machine learning community. 2 1.2 Thesis Organization This thesis is organized as follows: in Chapter 2 and 3, we give an overview of the herding algorithm and the Gibbs algorithm respectively, which are the bases of our algorithm. Then we introduce in Chapter 4 the proposed algorithm, herded Gibbs, which is a variant of the Gibbs sampler and is entirely deterministic. In Chapter 5, we use several experiments to show the effectiveness of herded Gibbs as an inference technique and compare it with several existing popular algorithms for inference. Finally we discuss the limitations and extensions of herded Gibbs and conclude in Chapter 6. 3 Chapter 2 Herding Herding [9, 16, 17] is a deterministic procedure for generating pseudo-samples x \u00E2\u0088\u0088 X \u00E2\u008A\u0086 Rn, such that the empirical moments \u00C2\u00B5 of the data are matched. A moment \u00C2\u00B5i is the expectation of some feature function \u00CF\u0086i over data and can be used as a quantitative measure of the data set. Some examples include the mean and the variance. The herding procedure, at iteration t, is as follows: x(t) \u00E2\u0088\u0088 argmax x\u00E2\u0088\u0088X \u00E3\u0080\u0088w(t\u00E2\u0088\u00921),\u00CF\u0086(x(t\u00E2\u0088\u00921))\u00E3\u0080\u0089 w(t) = w(t\u00E2\u0088\u00921)+\u00C2\u00B5\u00E2\u0088\u0092\u00CF\u0086(x(t)), (2.1) where \u00CF\u0086 :X \u00E2\u0086\u0092H is a feature map (statistics) from X to a Hilbert Space H with inner product \u00E3\u0080\u0088\u00C2\u00B7, \u00C2\u00B7\u00E3\u0080\u0089, w \u00E2\u0088\u0088H is the vector of parameters, and \u00C2\u00B5 \u00E2\u0088\u0088H is the moment vector that we want to match (E[\u00CF\u0086(x)], the expected value of \u00CF\u0086 over data). The pseudo-samples, x(t) for t = 1,2, . . . ,T , are generated so as to minimize the objective function ET = \u00E2\u0088\u00A5\u00E2\u0088\u00A5\u00E2\u0088\u00A5\u00E2\u0088\u00A5\u00E2\u0088\u00A5\u00C2\u00B5\u00E2\u0088\u0092 1T T\u00E2\u0088\u0091t=1\u00CF\u0086(x(t)) \u00E2\u0088\u00A5\u00E2\u0088\u00A5\u00E2\u0088\u00A5\u00E2\u0088\u00A5\u00E2\u0088\u00A5 2 , (2.2) where \u00E2\u0080\u0096 \u00C2\u00B7 \u00E2\u0080\u0096 =\u00E2\u0088\u009A\u00E3\u0080\u0088\u00C2\u00B7, \u00C2\u00B7\u00E3\u0080\u0089 is the naturally defined norm based upon the inner product of the space H [3, 5]. As presented, herding is a moment matching algorithm. However, it can also be used to produce samples of normalized probability distri- butions. For example, let \u00C2\u00B5 denote a discrete, normalized probability distribution 4 with \u00C2\u00B5i = p(x = i), i = 1, . . . ,n. That is, \u00C2\u00B5i is the probability of the i-th bin. A natural feature in this case is the vector \u00CF\u0086(x) that has all entries equal to zero, ex- cept for the entry at the position indicated by x. For instance, if x = 2, we have \u00CF\u0086(x) = (0,1,0,0, . . . ,0)T . Hence, \u00C2\u00B5\u00CC\u0082 = 1/T \u00E2\u0088\u0091Tt=1 \u00CF\u0086(x(t)) is an empirical estimate of the distribution. The corresponding pseudocode is shown in Algorithm 2.1. Note that this is a simple instance of the algorithm of equation (2.1). Algorithm 2.1 Herding to sample from a discrete distribution Input: \u00C2\u00B5,T Initialize w deterministically, e.g. a vector of ones of size n. for t = 1,2, . . . ,T do Find the largest component of w: i? = argmaxi\u00E2\u0088\u0088{1,2,...,n}wi Set x(t) = i? Set \u00CF\u0086 to be a vector of zeros of size n Set the i?-entry of \u00CF\u0086 to one Update: w(t+1) = w(t)+(\u00C2\u00B5\u00E2\u0088\u0092\u00CF\u0086) end for Output: x(1), . . . ,x(T ) The benefit of using herding here is that its rate of convergence is O(1/T ), which is much faster than the standard Monte Carlo rate of O(1/ \u00E2\u0088\u009A T ). Unfortu- nately, this only applies to normalized distributions or to problems where the ob- jective is not to guarantee that the samples come from the right target, but simply to ensure that some moments are matched. An interpretation of herding in terms of quadrature has been put forward recently by Husza\u00CC\u0081r and Duvenaud [12]. In this thesis, we will show that it is possible to use herding to generate samples from more complex unnormalized probability distributions. 5 2.1 Herding and Frank-Wolfe Algorithms 2.1.1 Frank-Wolfe Algorithms The Frank-Wolfe algorithm, also known as the reduced gradient method and the convex combination algorithm, is a class of iterative optimization algorithms. It was first proposed by Marguerite Frank and Philip Wolfe in 1956 [8] as an algo- rithm for quadratic programming and was later proven to be an iterative method for nonlinear programming that need not be restricted to quadratic programming. Given a smooth (twice continuously differentiable) convex function J on a com- pact convex setM with gradient J\u00E2\u0080\u00B2, to solve the optimization problem ming\u00E2\u0088\u0088M J(g), Frank-Wolfe algorithms only require (in addition to the computation of the gradi- ent J\u00E2\u0080\u00B2) to be able to optimize linear functions on M . The first class of such al- gorithms is often referred to as conditional gradient algorithms: given an initial guess g(1) \u00E2\u0088\u0088M , at iteration t, we first find an optimum for minh\u00E2\u0088\u0088M \u00E3\u0080\u0088J\u00E2\u0080\u00B2(g(t)),h\u00E3\u0080\u0089, and then take the next iterate on the segment between g(t) and h, i.e., g(t+1) = (1\u00E2\u0088\u0092\u00CF\u0081(t))g(t)+\u00CF\u0081(t)h, where \u00CF\u0081(t) \u00E2\u0088\u0088 [0,1]. For the choice of \u00CF\u0081(t), we can (a) simply take \u00CF\u0081(t)= 1/(t+1) or (b) find the point in the segment with optimal value through line search. The iterations corresponding to these two choices are illustrated in Fig- ure 2.1. As t increases, g(t) approaches the optimal solution \u00C2\u00B5 . \u00C2\u00B5 g(1) g(2) g(3) \u00C2\u00B5 g(1) g(2) g(3) Figure 2.1: Two iterations corresponding to two versions of the conditional gradient algorithm. Left: \u00CF\u0081(t) = 1/(t+1); right: line search. 6 2.1.2 Herding as a Frank-Wolfe Algorithm As described by Bach et al. [3], herding can be reformulated as a conditional gra- dient algorithm. Considering the optimization problem: min g\u00E2\u0088\u0088M J(g) = 1 2 ||g\u00E2\u0088\u0092\u00C2\u00B5||2, (2.3) the procedures for a conditional gradient algorithm to solve this problem is as fol- lows: g\u00CC\u0084(t+1) \u00E2\u0088\u0088 argmin g\u00E2\u0088\u0088M \u00E3\u0080\u0088g(t)\u00E2\u0088\u0092\u00C2\u00B5,g\u00E3\u0080\u0089 g(t+1) = (1\u00E2\u0088\u0092\u00CF\u0081(t))g(t)+\u00CF\u0081(t)g\u00CC\u0084(t+1), (2.4) With a step size of \u00CF\u0081(t) = 1/(t +1) and a change of variable g(t) = \u00C2\u00B5\u00E2\u0088\u0092w(t)/t, these updates are equivalent to the herding steps as shown in Equation 2.1. With this setting, we have g\u00CC\u0084(t) = \u00CF\u0086(x(t)) and g(t) = 1t \u00E2\u0088\u0091 t u=1 \u00CF\u0086(x(u)), which implies that each sample is uniformly weighted and the set of samples approximate \u00C2\u00B5 . An alternative choice of step size is to use line search to find the point in the segment with optimal value. This leads to a setting of \u00CF\u0081(t) = \u00E3\u0080\u0088g (t)\u00E2\u0088\u0092\u00C2\u00B5,g(t)\u00E2\u0088\u0092g\u00CC\u0084(t+1)\u00E3\u0080\u0089 ||g(t)\u00E2\u0088\u0092g\u00CC\u0084(t+1)||2 \u00E2\u0088\u0088 [0,1]. With this setting, we have g(t) = \u00E2\u0088\u0091tu=1 ( \u00E2\u0088\u008Ft\u00E2\u0088\u00921v=u(1\u00E2\u0088\u0092\u00CF\u0081(v\u00E2\u0088\u00921))\u00CF\u0081(u\u00E2\u0088\u00921) ) \u00CF\u0086(x(u)), which means the weight for each sample is non-uniform and the weighted set of samples approximate \u00C2\u00B5 . 2.2 Herding and Maximum/Minimum Entropy In the maximum entropy estimation setting, given a moment vector \u00C2\u00B5 , the goal of herding is to produce a pseudo-sample drawn from a distribution whose moments match \u00C2\u00B5 . For an introduction to maximum entropy estimation, see the book by Bernardo and Smith [4, Section 4.5.4]. Bach et al. [3] apply herding to estimate the maximum entropy distribution in cases where it can be easily computed, namely for X = {\u00E2\u0088\u00921,1}d (with d \u00E2\u0089\u00A4 10) and \u00CF\u0086(x) = x \u00E2\u0088\u0088 [\u00E2\u0088\u00921,1]d . They conjecture that for almost surely all random vec- tors \u00C2\u00B5 \u00E2\u0088\u0088 [\u00E2\u0088\u00921,1]d , regular herding converges to the maximum entropy distribution. Moreover, they conclude that: 7 \u00E2\u0080\u00A2 For a random mean vector \u00C2\u00B5 (which would avoid rational ratios between the means), regular herding (with no line search) empirically converges to the maximum entropy distribution. \u00E2\u0080\u00A2 For irrational means with rational ratios between the means, herding does not converge to the maximum entropy distribution. \u00E2\u0080\u00A2 For rational means, herding does not converge either, but the behavior is more erratic. \u00E2\u0080\u00A2 Herding with line search never converges to the maximum entropy distribu- tion. On the contrary, it happens to be close to a minimum entropy solution, where many states have probability zero. In addition, the work also mentions that the conclusions can not be further generalized to other cases. In general, the herding procedure, while leading to a consistent estimation of the mean vector \u00C2\u00B5 , does not converge to the maximum entropy distribution. 8 Chapter 3 Gibbs Sampling Gibbs sampling (popularized in the context of image processing by Geman and Geman 1984 [11]) is a simple MCMC method used to perform approximate infer- ence in factored probabilistic models. In many cases it is not possible to do direct simulation from the posterior distributions. In these cases, Gibbs sampling may prove useful. The idea behind Gibbs sampling is that it alternates (either systematically or randomly) the sampling of each variable while conditioning on the values of all the other variables in the distribution. That is, given a joint sample x(t\u00E2\u0088\u00921) of all the variables, we generate a new sample x(t) by sampling each component in turn, based on the most recent values of the other variables. Typically, we do not sample those visible variables whose values are already known. Suppose we have an n-dimensional vector x and the expressions for the full conditionals p(x j|x1, ...,x j\u00E2\u0088\u00921,x j+1, ...,xn). The pseudocode of Gibbs sampling is shown in Algorithm 3.1. In general, x j may only depend on some of the other variables. If we represent p(x) as a graphical model, we can infer the dependencies by looking at x j\u00E2\u0080\u0099s Markov blanket, which are its neighbors in the graph. Thus to sample x j, we only need to know the values of x j\u00E2\u0080\u0099s neighbors. In this sense, Gibbs sampling is a distributed algorithm. However, it is not a parallel algorithm, since the samples must be generated sequentially. It is common to discard some number of samples at the beginning until the Markov chain has burned in or reached its stationary distribution, which by con- 9 Algorithm 3.1 Systematic-scan Gibbs sampler 1: Input: p(x j|x\u00E2\u0088\u0092 j) for j = 1,2, ...,n and T . 2: Initialize x(0). 3: for t = 1,2, . . . ,T do 4: Sample x(t)1 \u00E2\u0088\u00BC p(x1|x(t\u00E2\u0088\u00921)2 ,x(t\u00E2\u0088\u00921)3 , ...,x(t\u00E2\u0088\u00921)n ) 5: Sample x(t)2 \u00E2\u0088\u00BC p(x2|x(t)1 ,x(t\u00E2\u0088\u00921)3 , ...,x(t\u00E2\u0088\u00921)n ) 6: ... 7: Sample x(t)j \u00E2\u0088\u00BC p(x j|x(t)1 , ...,x(t)j\u00E2\u0088\u00921,x(t\u00E2\u0088\u00921)j+1 ...,x(t\u00E2\u0088\u00921)n ) 8: ... 9: Sample x(t)n \u00E2\u0088\u00BC p(xn|x(t)1 ,x(t)2 , ...,x(t)n\u00E2\u0088\u00921) 10: end for 11: Output: x(1), . . . ,x(T ) struction is the target joint distribution over the variables. We then consider only every mth sample when averaging values to compute an expectation, because suc- cessive samples are not independent of each other but from a Markov chain with some amount of correlation. In the early phase of the sampling process, the algorithm tends to move slowly around the sample space rather than moving around quickly as is desired. As a result, the samples generated in this stage tend to have a high amount of autocor- relation. Simulated annealing is often used to reduce the random walk behavior and to reduce autocorrelations between samples. Other techniques that may reduce autocorrelation include collapsed Gibbs sampling, blocked Gibbs sampling, and ordered overrelaxation. The sequence obtained by Gibbs sampling can be used to approximate the joint distribution, to approximate the marginal distribution of one of the variables or some subset of the variables, or to compute an integral. 10 Chapter 4 Herded Gibbs In this chapter, we will introduce the herded Gibbs sampler, a variant of the popular Gibbs sampling algorithm that is completely deterministic. While Gibbs relies on drawing samples from the full-conditionals in a random way, herded Gibbs generates samples by matching the full-conditionals using the herding procedure. Given that Gibbs sampling is one of the work-horses of statistical inference, it is remarkable that it is possible to construct an entirely deterministic version of this algorithm. 4.1 Example For ease of presentation, we introduce the herded Gibbs sampler with a simple, but very popular example: Image denoising with a Markov Random Field (MRF). Let us assume that we have a binary image corrupted by noise, and that we want to infer the original clean image. Let xi \u00E2\u0088\u0088 {\u00E2\u0088\u00921,+1} denote the unknown true value of pixel i, and yi the observed, noise-corrupted value of this pixel. We take advantage of the fact that neighboring pixels are likely to have the same label by defining an MRF with an Ising prior. That is, we specify a rectangular 2D lattice with the following pair-wise clique potentials: \u00CF\u0088i j(xi,x j) = ( eJi j e\u00E2\u0088\u0092Ji j e\u00E2\u0088\u0092Ji j eJi j ) (4.1) 11 and joint distribution: p(x|J) = 1 Z(J)\u00E2\u0088\u008Fi\u00E2\u0088\u00BC j \u00CF\u0088i j(xi,x j) = 1 Z(J) exp ( 1 2\u00E2\u0088\u0091i\u00E2\u0088\u00BC j Ji jxix j ) , (4.2) where i \u00E2\u0088\u00BC j is used to indicate that nodes i and j are connected. The known pa- rameters Ji j establish the coupling strength between nodes i and j. Note that the matrix J is symmetric. If all the Ji j > 0, then neighboring pixels are likely to be in the same state. The MRF model combines the Ising prior with a likelihood model as follows: p(x,y) = p(x)p(y|x) = [ 1 Z\u00E2\u0088\u008Fi\u00E2\u0088\u00BC j \u00CF\u0088i j(xi,x j) ] . [ \u00E2\u0088\u008F i p(yi|xi) ] (4.3) The potentials \u00CF\u0088i j encourage label smoothness. The likelihood terms p(yi|xi) are conditionally independent (e.g. Gaussians with known variance \u00CF\u00832 and mean \u00C2\u00B5 centered at each value of xi, denoted \u00C2\u00B5xi). In more precise terms, p(x,y|J,\u00C2\u00B5,\u00CF\u0083) = 1 Z(J,\u00C2\u00B5,\u00CF\u0083) exp ( 1 2\u00E2\u0088\u0091i\u00E2\u0088\u00BC j Ji jxix j\u00E2\u0088\u0092 12\u00CF\u00832\u00E2\u0088\u0091i (yi\u00E2\u0088\u0092\u00C2\u00B5xi)2 ) . (4.4) 4.2 Algorithm As discussed in Chapter 3, Gibbs sampling is a popular method for recovering the marginals p(xi|y). It alternates (either systematically or randomly) the sampling of each pixel xi while conditioning on the observations y and the neighbors of xi, de- noted xN (i). That is, it generates each sample from its full-conditional distribution. The pseudocode of Gibbs sampling for this example is shown in Algorithm 4.1. We use the notation x(t\u00E2\u0088\u00921)N (2)\u00E2\u0088\u00921 to indicate the neighbors of variable x2 at the previous it- eration excluding the variable x1. This is because we already have an updated value of x1 at time t, and x1 has been assumed to be a neighbor of x2. The herded Gibbs sampler replaces the sampling from full-conditionals with herding at the level of the full-conditionals, as shown in Algorithm 4.2. That is, one alternates a process of matching the full-conditional distributions, p(xi|yi,xN (i)). 12 Algorithm 4.1 Systematic-scan Gibbs sampler 1: Input: An image y with M pixels, and T . 2: for t = 1,2, . . . ,T do 3: Sample x(t)1 \u00E2\u0088\u00BC p(x1|y1,x(t\u00E2\u0088\u00921)N (1)) 4: Sample x(t)2 \u00E2\u0088\u00BC p(x2|y2,x(t)1 ,x(t\u00E2\u0088\u00921)N (2)\u00E2\u0088\u00921) 5: ... 6: Sample x(t)M \u00E2\u0088\u00BC p(xM|yM,x(t)N (M)) 7: end for 8: Output: x(1), . . . ,x(T ) Algorithm 4.2 Systematic-scan herded Gibbs sampler 1: Input: An image y with M pixels, and T . 2: for t = 1,2, . . . ,T do 3: Apply herding to generate x(t)1 given y1 and x (t\u00E2\u0088\u00921) N (1) (see Algorithm 4.3) 4: Apply herding to generate x(t)2 given y2, x (t) 1 and x (t\u00E2\u0088\u00921) N (2)\u00E2\u0088\u00921 5: ... 6: Apply herding to generate x(t)M given yM and x (t) N (M) 7: end for 8: Output: x(1), . . . ,x(T ) Note that we apply the herding algorithm to each instantiation of the full- conditional distribution for xi. For example, if we assume that the variable x1 has four binary neighbors xN (1), we will maintain 24 = 16 weight vectors. Only the weight vector corresponding to the current instantiation of the neighbors is up- dated, as illustrated in Algorithm 4.3. Hence, herded Gibbs effectively trades off memory for increased statistical ef- ficiency. The memory complexity of herded Gibbs is exponential in the maximum node degree. However, the cost for one iteration is not significantly more expensive than that of Gibbs and the convergence rate of herded Gibbs is faster than that of Gibbs, as shown by means of experiments in the following section. Moreover, by comparing various models for fitting the convergence curves, we conjecture that the rate of convergence of herded Gibbs is O(1/T ). Proving this fact mathemati- cally is an open problem. 13 Algorithm 4.3 Step 3 of Algorithm 4.2, with \u00C2\u00B5 = p(x1|y1,x(t\u00E2\u0088\u00921)N (1)) assuming x1 has 4 binary neighbors 1: Input: \u00C2\u00B5 and a set of stored weight vectors {w(t\u00E2\u0088\u00921)1 ,w(t\u00E2\u0088\u00921)2 , . . . ,w(t\u00E2\u0088\u00921)16 } for x1. 2: if x(t\u00E2\u0088\u00921)N (1) = (0,0,0,0) then 3: Find the largest component of w(t\u00E2\u0088\u00921)1 : i ? = argmaxi\u00E2\u0088\u0088{1,2,...,n}w (t\u00E2\u0088\u00921) 1,i 4: chosen = 1 5: else if x(t\u00E2\u0088\u00921)N (1) = (0,0,0,1) then 6: Find the largest component of w(t\u00E2\u0088\u00921)2 : i ? = argmaxi\u00E2\u0088\u0088{1,2,...,n}w (t\u00E2\u0088\u00921) 2,i 7: chosen = 2 8: ... 9: else if x(t\u00E2\u0088\u00921)N (1) = (1,1,1,0) then 10: Find the largest component of w(t\u00E2\u0088\u00921)15 : i ? = argmaxi\u00E2\u0088\u0088{1,2,...,n}w (t\u00E2\u0088\u00921) 15,i 11: chosen = 15 12: else 13: Find the largest component of w(t\u00E2\u0088\u00921)16 : i ? = argmaxi\u00E2\u0088\u0088{1,2,...,n}w (t\u00E2\u0088\u00921) 16,i 14: chosen = 16 15: end if 16: Set x(t)1 = i ? 17: Set \u00CF\u0086 to be a vector of zeros of size n = 2 (if x1 is binary) 18: Set the i?-entry of \u00CF\u0086 to one 19: Set w(t)j = w (t\u00E2\u0088\u00921) j for all j 6= chosen 20: Update: w(t)chosen = w (t\u00E2\u0088\u00921) chosen+(\u00C2\u00B5\u00E2\u0088\u0092\u00CF\u0086) 21: Output: {w(t)1 ,w(t)2 , . . . ,w(t)16} and x(t)1 14 Chapter 5 Experiments 5.1 Simple Examples As shown by Ip and Wang [13], a given set of full-conditional distributions spec- ifies a unique joint distribution if and only if the set is complete and compatible. Furthermore, if the conditions are met, then we can recover the joint distribution from the conditional distributions which allows us to recover the marginal dis- tributions (though this last step is computationally expensive). In the following examples, we assess whether herded Gibbs is capable of estimating the correct conditionals and marginals. 5.1.1 Two-Node Example First, we consider a model of two variables, x1 and x2, as shown in Figure 5.1; the joint distribution of these variables is shown in Table 5.1. Figure 5.2 shows the marginal distribution P(X1 = 1) approximated by both Gibbs and herded Gibbs for different \u00CE\u00B5 . As \u00CE\u00B5 decreases, both approaches require more iterations to converge. Alternately, as \u00CE\u00B5 approaches zero, Gibbs struggles to converge, whereas, herded Gibbs continues to provide good deterministic convergence behavior. In this and subsequent experiments, we set the initial parameters w to a constant vector. We observed that the performance of herded Gibbs does not change when we initialize w at random. 15 x1 x2 Figure 5.1: Two-variable model. X1 = 0 X1 = 1 P(X2) X2 = 0 1/4\u00E2\u0088\u0092 \u00CE\u00B5 \u00CE\u00B5 1/4 X2 = 1 \u00CE\u00B5 3/4\u00E2\u0088\u0092 \u00CE\u00B5 3/4 P(X1) 1/4 3/4 1 Table 5.1: Joint distribution of the two-variable model. 0 2 4 6 8 10 x 104 0.744 0.746 0.748 0.75 0.752 0.754 0.756 \u00CE\u00B5 = 1e\u00E2\u0088\u009201 Iterations p(x1=1) Gibbs Herded Gibbs 0 2 4 6 8 10 x 104 0.73 0.74 0.75 0.76 0.77 \u00CE\u00B5 = 1e\u00E2\u0088\u009202 Iterations p(x1=1) Gibbs Herded Gibbs 0 2 4 6 8 10 x 104 0.7 0.72 0.74 0.76 0.78 0.8 \u00CE\u00B5 = 1e\u00E2\u0088\u009203 Iterations p(x1=1) Gibbs Herded Gibbs 0 2 4 6 8 10 x 104 0.6 0.7 0.8 0.9 1 \u00CE\u00B5 = 1e\u00E2\u0088\u009204 Iterations p(x1=1) Gibbs Herded Gibbs Figure 5.2: Approximate marginals obtained via Gibbs and herded Gibbs for an MRF of two variables, constructed so as to make the move from state (0,0) to (1,1) progressively more difficult as \u00CE\u00B5 decreases. Table 5.1 pro- vides the joint distribution for these variables. The error bars correspond to one standard deviation. The plots are best viewed in color. 16 5.1.2 Four-Node Example Next, we study a model of four variables with a joint distribution as shown in Equation 4.2 and illustrated in Figure 5.3. The left column of Figure 5.4 shows the 1-norm error of approximating the marginals and the joint distribution when the coupling parameters Ji j are set to 1. (For this choice of coupling parameters, the ratios of the conditional distributions are rational numbers.) While herded Gibbs approximates the marginal distributions well, it fails to approximate the joint dis- tribution. x1 x2 x3 x4 Figure 5.3: Four-variable model. This finding is not new and has in fact been reported in a simple maximum- entropy setting [3]. There, the authors conjecture that almost sure convergence of the herding algorithm to the maximum entropy distribution can be established when \u00C2\u00B5 is a random vector. In our setting, by drawing the full conditionals uni- formly at random in the unit interval, we are creating irrational ratios among these full conditionals. This follows from the fact that the rationals have measure zero in the unit interval. When the ratios of full conditionals are irrational, determin- istic cycles in the state (sample) sequence are broken (or be made very large in digital computers) and, hence, herded Gibbs converges to the right marginals and joint distribution. The right hand side of Figure 5.4 illustrates this. Herded Gibbs converges to the correct joint distribution when the coupling parameters are set at random: Ji j = 1+N (0,1/4). These results add further evidence to the conjecture made by Bach et al. [3]. The choice of irrational \u00C2\u00B5\u00E2\u0080\u0099s has also appeared in the work by Welling and Chen [18]. Clearly, however, we should avoid irrational \u00C2\u00B5\u00E2\u0080\u0099s whose ratios are rational. This discussion provides us with the necessary intuition for the design of an 17 0 0.5 1 1.5 2 x 104 0 0.1 0.2 0.3 0.4 Iterations L 1 e rr o r fo r a ll m ar gi na ls Rational ratios Gibbs Herded Gibbs Herded Gibbs \u00E2\u0088\u0092 random step size 0 0.5 1 1.5 2 x 104 0 0.1 0.2 0.3 0.4 Iterations L 1 e rr o r fo r t he jo int Gibbs Herded Gibbs Herded Gibbs \u00E2\u0088\u0092 random step size 0 0.5 1 1.5 2 x 104 0 0.1 0.2 0.3 0.4 Iterations Irrational ratios Gibbs Herded Gibbs Herded Gibbs \u00E2\u0088\u0092 random step size 0 0.5 1 1.5 2 x 104 0 0.1 0.2 0.3 0.4 Iterations Gibbs Herded Gibbs Herded Gibbs \u00E2\u0088\u0092 random step size Figure 5.4: The error in approximating the marginals (top) and joint (bottom) for an MRF of four variables. When the ratios of conditional distribu- tions are rational, the deterministic herded Gibbs fails to converge to the joint. This is however not a problem when the ratios are irrational. In both cases, herded Gibbs with a randomized step size converges well, while providing a very significant improvement over standard Gibbs sampling. alternative random step size herded Gibbs algorithm: x(t) \u00E2\u0088\u0088 argmax x\u00E2\u0088\u0088X \u00E3\u0080\u0088w(t\u00E2\u0088\u00921),\u00CF\u0086(x(t\u00E2\u0088\u00921))\u00E3\u0080\u0089 w(t) = w(t\u00E2\u0088\u00921)+(1+ \u00CE\u00B5) [ \u00C2\u00B5\u00E2\u0088\u0092\u00CF\u0086(x(t)) ] , (5.1) where \u00CE\u00B5 \u00E2\u0088\u00BC 2 U[0,1/2] is a small uniform random number (the method is fairly robust with respect to this choice). The randomization of the step size effectively breaks cycles and results in ergodicity of the marginals and joint as illustrated in 18 Figure 5.4. This four-node experiment has shown that herded Gibbs can yield fast and ac- curate estimates of the marginals and joint distribution when its step size is ran- domized. The experiment also seems to indicate that the deterministic herded Gibbs can result in good estimates of the marginal distributions. The following three-node experiment shows that this is not always the case. 5.1.3 Three-Node Example Consider a three-node model as illustrated in Figure 5.5, whose joint distribution is shown in Table 5.2. If the model is parametrized such that p(x1|x3) = p(x2|x3) and we apply herded Gibbs, with identical initial parameters w, to generate samples in the order of x1, x2, x3, then x1 and x2 will always have the same value. In other words, we will never reach the states (x1 = 0,x2 = 1) or (x1 = 1,x2 = 0). In this case, the deterministic herded Gibbs fails to converge to the correct marginals. This is depicted in Figure 5.6. x1 x3 x2 Figure 5.5: Three-variable model. Table 5.2: Joint distribution of the three-variable model. The model is parametrized such that x1 \u00E2\u008A\u00A5 x2|x3 and p(x1|x3) = p(x2|x3). x1 0 0 0 0 1 1 1 1 x2 0 0 1 1 0 0 1 1 x3 0 1 0 1 0 1 0 1 P(x1,x2,x3) 3 91 36 91 6 91 12 91 6 91 12 91 12 91 4 91 19 0 1 2 3 4 5 x 104 0.61 0.615 0.62 0.625 0.63 0.635 0.64 0.645 0.65 Iterations p(x1=1) Gibbs Herded Gibbs Herded Gibbs \u00E2\u0088\u0092 random step size 0 1000 2000 3000 4000 5000 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 Iterations p(x1=1) Gibbs Herded Gibbs Herded Gibbs \u00E2\u0088\u0092 random step size Figure 5.6: The marginal of the first node of a three-node MRF and its esti- mates. Herded Gibbs with a random step size converges to the correct value at an impressive rate. The left plot shows 5,000 iterations while the right plots shows 50,000 iterations. Even if all states can be reached, the deterministic algorithm can still fail. In our three-node example, all states can be reached if the herding parameters for x1 and x2 are initialized to different values. However, this does not break state cycles and the algorithm still fails to converge to the right marginals. Once again, to obtain fast convergence, we simply have to randomize the step size. 5.2 Markov Random Field (MRF) for Image Denoising As discussed in Chapter 4, herded Gibbs performs deterministic sampling which in itself can provide computational benefits over comparable stochastic techniques. Furthermore, this derandomized sampling algorithm is further shown to provide performance gains beyond that of similar widely-adapted stochastic techniques. Here, herded Gibbs is compared to both Gibbs sampling and mean field inference. In particular, the herded Gibbs algorithm\u00E2\u0080\u0099s performance is exemplified for image denoising of the \u00E2\u0080\u009CNIPS\u00E2\u0080\u009D image (as seen in Figure 5.8). Various noisy \u00E2\u0080\u009CNIPS\u00E2\u0080\u009D images were created through the addition of Gaussian noise, with varying \u00CF\u0083 , to the original \u00E2\u0080\u009CNIPS\u00E2\u0080\u009D image. For each noisy image a set of observed variables is 20 xi yi y j x j Figure 5.7: The MRF employed for the task of image denoising. The vari- ables xi,x j are hidden variables representing the true pixel values, and the variables yi,y j are observed variables representing the noisy pixel values. implicitly defined, one for each pixel; in Figure 5.7 the observed variables are depicted as yi and y j. Image denoising requires the inference of a set of hidden variables, xi and x j as seen in Figure 5.7, that describe the true binary values of each pixel. Accurately reproducing the hidden variables will result in the construction of an image much like the original \u00E2\u0080\u009CNIPS\u00E2\u0080\u009D image of Figure 5.8a. Gibbs sampling proceeds by sequentially resampling each hidden variable given the values of its neighbors (in the MRF) at the previous iteration. Though herded Gibbs performs inference in much the same way, the samples drawn via herded Gibbs are computed deterministically. Such a similarity establishes Gibbs sam- pling as a natural choice for evaluating the performance of herded Gibbs. Analo- gously, mean field inference is well-suited for evaluating the performance of herded Gibbs in that it performs inference by employing the mean values of neighboring nodes in the MRF. Furthermore, both mean field inference and Gibbs sampling are widely used and natural choices for the task of image denoising with MRFs. Recall that herded Gibbs employs coupling parameters in the joint distribu- tion. If the coupling parameters Ji j are all the same, say Ji j = J, we will have \u00E2\u0088\u0091i j Ji j f (xi,x j) = J\u00E2\u0088\u0091i j f (xi,x j). For the Ising model, \u00E2\u0088\u0091i j f (xi,x j) just equals the sum of pixel values of a node\u00E2\u0080\u0099s neighbors. Thus two different configurations of 21 (a) Original image (b) Input/Noisy image Iteration 30Iteration 10 Herded Gibbs Iteration 1 Herded Gibbs \u00E2\u0088\u0092 shared Herded Gibbs \u00E2\u0088\u0092 rand step size Herded Gibbs \u00E2\u0088\u0092 shared & rand step size Gibbs Mean field Mean of last 10 iterations (c) Denoising results Figure 5.8: Image denoising results using different methods to perform ap- proximate inference. We use an Ising prior with Wi j = 1 and a Gaussian noise model with \u00CF\u0083 = 4. We show the results after 1, 10 and 30 it- erations across the image and also the posterior means computed by averaging over the last 10 iterations. We use in-place updates for each method, \u00CE\u00B5 \u00E2\u0088\u00BC 2 U[0,1/2] for herded Gibbs with a randomized step size, and a damping factor of 1 for mean field inference. 22 a node\u00E2\u0080\u0099s neighbors can give the same value of J\u00E2\u0088\u0091i j f (xi,x j) because of the sym- metrical characteristic of neighbors. If we store the conditionals for configurations with the same sum together, we only need to store as many conditionals as differ- ent possible values that the sum on the right hand side could take. This gives us inspiration to develop a shared version of herded Gibbs. The shared version of herded Gibbs presents additional opportunities for op- timization. In particular, the difference in performance between in-place updates and parallel updates was evaluated. For in-place updates (Gauss-Seidel updates), we compute the value of each node on the go. For parallel updates (Jacobi updates), we compute all the nodes synchronously at the end of each iteration. In general, the in-place update versions of each approach perform better than the correspond- ing parallel update versions, but this conclusion might change if one has access to a large cluster or processors. Figure 5.8c provides one particular example of image denoising where Gaus- sian noise was constructed with \u00CF\u0083 = 4. The performance of inference techniques in this case is indicative of their average performance and demonstrates the impres- sive convergence of herded Gibbs. After ten iterations of the algorithms, the herded Gibbs techniques already demonstrate improvements over comparable techniques. After thirty iterations, the herded Gibbs techniques have effectively converged and are significantly beyond the comparable techniques. A long-term evaluation of the inference techniques for image denoising of \u00E2\u0080\u009CNIPS\u00E2\u0080\u009D is shown in Figure 5.9. The shared version of herded Gibbs converges faster than all other approaches with in-place herded Gibbs following as a close second. Herded Gibbs\u00E2\u0080\u0099 superior perfor- mance to comparable stochastic techniques makes the algorithm a promising new inference technique. 23 0 5 10 15 20 25 30 35 40 45 500 1 2 3 4 5 6 x 10 4 Iterations R ec on st ru ct io n er ro r Gibbs Mean field (in\u00E2\u0088\u0092place updates, damping factor=0.5) Mean field (in\u00E2\u0088\u0092place updates, damping factor=1) Mean field (parallel updates, damping factor=0.5) Mean field (parallel updates, damping factor=1) Herded Gibbs (in\u00E2\u0088\u0092place updates) Herded Gibbs (parallel updates) Herded Gibbs \u00E2\u0088\u0092 shared (in\u00E2\u0088\u0092place updates) Herded Gibbs \u00E2\u0088\u0092 shared (parallel updates) Figure 5.9: Reconstruction error for the image denoising task. The results are averaged across 10 corrupted images with Gaussian noise N (0,16). The error bars correspond to one standard deviation. Here we intro- duce a shared version of herded Gibbs, which stores conditionals that have the same potential values together. As shown, the shared version of herded Gibbs outperforms all the other approaches. In general, the in-place update versions of each approach perform better than the cor- responding parallel update versions. Herded Gibbs with a randomized step size has almost the same performance with standard Herded Gibbs under the same setting. For better display, we do not show them in the plots. 24 A comparison of the image denoising results over different \u00CF\u0083 \u00E2\u0080\u0099s is shown in Table 5.3. Overall, herded Gibbs is demonstrated to provide consistently superior performance in that it converges much faster than comparable techniques and does so consistently over a range of \u00CF\u0083 \u00E2\u0080\u0099s. As such, herded Gibbs provides a potential platform for improving upon applications employing Gibbs sampling or other sim- ilar sampling techniques. Table 5.3: Errors of image denoising example after 30 iterations (all measure- ments have been scaled by \u00C3\u009710\u00E2\u0088\u00923). We use an Ising prior with Wi j = 1 and Gaussian noise models with different \u00CF\u0083 \u00E2\u0080\u0099s. For each \u00CF\u0083 , we generated 10 corrupted images by adding Gaussian noise. The final results shown here are averages and standard deviations (in the parentheses) across the 10 corrupted images. I: in-place updates; P: parallel updates; D: damping factor; R: random step size. PPPPPPPPPMethod \u00CF\u0083 2 4 6 8 Gibbs 2.23(0.19) 8.35(0.58) 18.4(1.76) 27.9(2.14) Mean field (I, D=0.5) 2.04(0.12) 10.4(0.83) 21.6(1.20) 30.8(1.18) Mean field (I, D=1) 1.95(0.15) 7.09(0.47) 14.6(1.35) 22.3(1.86) Mean field (P, D=0.5) 2.15(0.13) 12.3(0.83) 24.4(1.02) 33.5(1.00) Mean field (P, D=1) 2.30(0.14) 13.6(0.64) 26.9(0.80) 36.2(0.77) Herded Gibbs (I) 1.98(0.15) 6.51(0.57) 13.8(1.48) 21.1(2.32) Herded Gibbs (I, R) 2.01(0.12) 6.57(0.60) 13.8(1.49) 20.9(2.23) Herded Gibbs (P) 2.18(0.13) 10.6(0.66) 22.4(1.21) 31.2(1.35) Herded Gibbs (P, R) 2.23(0.10) 10.6(0.63) 22.4(1.26) 31.1(1.37) Herded Gibbs - shared (I) 1.94(0.08) 6.41(0.66) 12.9(1.30) 19.0(1.40) Herded Gibbs - shared (I, R) 2.10(0.12) 6.41(0.60) 12.8(1.37) 19.2(1.53) Herded Gibbs - shared (P) 2.18(0.10) 10.8(0.62) 23.0(1.10) 31.9(1.05) Herded Gibbs - shared (P, R) 2.29(0.11) 10.9(0.72) 23.0(1.15) 32.1(1.24) 25 5.3 Conditional Random Field (CRF) for Named Entity Recognition Named Entity Recognition (NER) involves the identification of entities, such as people and locations, within a text sample. Information extraction, such as auto- matically identifying and extracting calendar dates or phone numbers from email messages, employs NER as a vital component of its computation. Historically, NER was performed by techniques like Viterbi that, though accurate and efficient under the Markov assumption, become intractable if the model attempts to account for long-term dependencies such as non-local information. As shown by Finkel et al. [7], Gibbs sampling and simulated annealing can incorporate long distance struc- ture while preserving tractable inference in factored probabilistic models. Like Gibbs, herded Gibbs is capable of accommodating long term dependencies and the performance improvements it can obtain over Gibbs (as demonstrated in Sec- tion 5.2) make it an interesting candidate for NER. We show here that herded Gibbs can be easily adapted to NER and requires fewer iterations to find the maximum. Figure 5.10 provides a representative NER example of the performance of Gibbs, herded Gibbs, herded Gibbs with a randomized step size, and Viterbi (all methods produced the same annotation for this short example). Though, the accuracy of these algorithms is fairly similar, herded Gibbs (with or without a randomized step size) again converges faster than its annealed Gibbs counterpart and can account \u00E2\u0080\u009DPumpkin\u00E2\u0080\u009D (Tim Roth) and \u00E2\u0080\u009DHoney Bunny\u00E2\u0080\u009D (Amanda Plummer) are having breakfast in a diner. They decide to rob it after realizing they could make money off the customers as well as the business, as they did during their previous heist. Moments after they initiate the hold-up, the scene breaks off and the title credits roll. As Jules Winnfield (Samuel L. Jackson) drives, Vincent Vega (John Travolta) talks about his experiences in Europe, from where he has just returned: the hash bars in Amsterdam, the French McDonald\u00E2\u0080\u0099s and its \u00E2\u0080\u009DRoyale with Cheese\u00E2\u0080\u009D. Figure 5.10: A representative application of NER to a random Wikipedia sample [1]. Entities are identified as follows: Person, Location, Organization. 26 for long-term dependencies in the text. The results of Table 5.4 indicate the perfor- mance gain associated with herded Gibbs; herded Gibbs achieves accuracy equal to that of Viterbi and a faster convergence rate than annealed Gibbs. Table 5.4: Comparisons of Gibbs, herded Gibbs, herded Gibbs with a ran- domized step size, and Viterbi for the NER task. We used the pre-trained 3-class CRF model in the Stanford NER package [7]. For the test set, we used the corpus for the NIST 1999 IE-ER Evaluation. Performance is measured in per-entity F1 ( F1 = 2 \u00C2\u00B7 precision\u00C2\u00B7recallprecision+recall ) . For all the methods except Viterbi, we show F1 scores after 100, 400 and 800 iterations. For Gibbs and herded Gibbs with a random step size, the results shown are the averages and standard deviations (in the parentheses) over 5 random runs. We used a linear annealing schedule for Gibbs and in-place updates for both versions of herded Gibbs. The average computational time each approach took to do inference for the entire test set is also listed (in the brackets). R: random step size. XXXXXXXXXXXXMethod Iterations 100 400 800 Annealed Gibbs 84.36(0.16) [55.73s] 84.51(0.10) [83.49s] 84.61(0.05) [115.92s] Herded Gibbs 84.70 [59.08s] 84.75 [90.48s] 84.81 [132.00s] Herded Gibbs (R) 84.73(0.03) [61.13s] 84.75(0.00) [93.21s] 84.81(0.00) [137.39s] Viterbi 84.81[46.74s] 27 Chapter 6 Conclusion In this thesis, we introduce herded Gibbs, a deterministic variant of the popular Gibbs sampling algorithm. While Gibbs relies on drawing samples from the full- conditionals in a random way, herded Gibbs generates the samples by matching the full-conditionals. Several experiments are employed to indicate the effectiveness of herded Gibbs over multiple domains and in tricky situations that can inhibit the performance of Gibbs sampling. The experiments also showed that the determin- istic algorithm can fail to converge to the joint distribution when the ratios of the conditionals are rational, but that it converges when the same ratios are irrational. This motivated the introduction of a herded Gibbs algorithm with random step size, which retained improved rates of convergence over Gibbs, while delivering the cor- rect marginal and joint distributions. Proving consistency and rates of convergence of the deterministic herded Gibbs method for irrational ratios and the random step size herded Gibbs is an open problem. As demonstrated Chapter 4, herded Gibbs trades storage cost for faster conver- gence by storing a parameter for each conditional. Without resolving the storage issue, it will be hard for herded Gibbs to be applicable to more general models, such as densely connected models. However, by the image denoising experiment we show that a shared version of herded Gibbs requires less storage while perform- ing even better than comparable inference techniques. Further exploration of the shared version of herded Gibbs and its applications to other models is a promising avenue for further research. 28 Bibliography [1] Pulp fiction - wikipedia, the free encyclopedia @ONLINE, June 2012. URL http://en.wikipedia.org/wiki/Pulp Fiction. \u00E2\u0086\u0092 pages viii, 26 [2] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine Learning, 50:5\u00E2\u0080\u009343, 2003. ISSN 0885-6125. URL http://dx.doi.org/10.1023/A:1020281327116. 10.1023/A:1020281327116. \u00E2\u0086\u0092 pages 1 [3] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and conditional gradient algorithms. In International Conference on Machine Learning, 2012. \u00E2\u0086\u0092 pages 4, 7, 17 [4] J. M. Bernardo and A. Smith. Bayesian Theory, pages 207\u00E2\u0080\u0093209. Wiley Series in Probability and Statistics. John Wiley & Sons Ltd., Chichester, 1994. \u00E2\u0086\u0092 pages 7 [5] Y. Chen, M. Welling, and A. Smola. Supersamples from kernel-herding. In Uncertainty in Artificial Intelligence, pages 109\u00E2\u0080\u0093116, 2010. \u00E2\u0086\u0092 pages 4 [6] M. Evans and T. Swartz. Methods for approximating integrals in statistics with special emphasis on bayesian integration problems. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.5152. \u00E2\u0086\u0092 pages 1 [7] J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL \u00E2\u0080\u009905, pages 363\u00E2\u0080\u0093370, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. doi:10.3115/1219840.1219885. URL http://dx.doi.org/10.3115/1219840.1219885. \u00E2\u0086\u0092 pages vi, 26, 27 [8] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95\u00E2\u0080\u0093110, 1956. ISSN 1931-9193. 29 doi:10.1002/nav.3800030109. URL http://dx.doi.org/10.1002/nav.3800030109. \u00E2\u0086\u0092 pages 6 [9] A. Gelfand, Y. Chen, L. van der Maaten, and M. Welling. On herding and the perceptron cycling theorem. In Advances in Neural Information Processing Systems, pages 694\u00E2\u0080\u0093702, 2010. \u00E2\u0086\u0092 pages 4 [10] A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410): 398\u00E2\u0080\u0093409, 1990. doi:10.1080/01621459.1990.10476213. URL http://www.tandfonline.com/doi/abs/10.1080/01621459.1990.10476213. \u00E2\u0086\u0092 pages 1 [11] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-6(6):721 \u00E2\u0080\u0093741, nov. 1984. ISSN 0162-8828. doi:10.1109/TPAMI.1984.4767596. \u00E2\u0086\u0092 pages 1, 9 [12] F. Husza\u00CC\u0081r and D. Duvenaud. Optimally-weighted herding is Bayesian quadrature. Arxiv preprint arXiv:1204.1664, 2012. \u00E2\u0086\u0092 pages 5 [13] E. H. Ip and Y. J. Wang. Canonical representation of conditionally specified multivariate discrete distributions. Journal of Multivariate Analysis, 100(6): 1282\u00E2\u0080\u00931290, 2009. \u00E2\u0086\u0092 pages 15 [14] A. F. M. Smith. Bayesian computational methods. Philosophical Transactions of the Royal Society of London. Series A: Physical and Engineering Sciences, 337(1647):369\u00E2\u0080\u0093386, 1991. doi:10.1098/rsta.1991.0130. URL http://rsta.royalsocietypublishing.org/content/337/1647/369.abstract. \u00E2\u0086\u0092 pages 1 [15] B. Walsh. Markov chain monte carlo and gibbs sampling. 2004. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.131.4064. \u00E2\u0086\u0092 pages 1 [16] M. Welling. Herding dynamical weights to learn. In International Conference on Machine Learning, pages 1121\u00E2\u0080\u00931128, 2009. \u00E2\u0086\u0092 pages 4 [17] M. Welling. Herding dynamic weights for partially observed random field models. In Uncertainty in Artificial Intelligence, pages 599\u00E2\u0080\u0093606, 2009. \u00E2\u0086\u0092 pages 4 30 [18] M. Welling and Y. Chen. Statistical inference using weak chaos and infinite memory. Journal of Physics: Conference Series, 233(1), 2010. \u00E2\u0086\u0092 pages 17 31"@en . "Thesis/Dissertation"@en . "2012-11"@en . "10.14288/1.0052124"@en . "eng"@en . "Computer Science"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "Attribution-NonCommercial-NoDerivs 3.0 Unported"@en . "http://creativecommons.org/licenses/by-nc-nd/3.0/"@en . "Graduate"@en . "Herded Gibbs sampling"@en . "Text"@en . "http://hdl.handle.net/2429/43188"@en .