Herded Gibbs Sampling by Jing Fang B. Eng., Zhejiang University, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) September 2012 c Jing Fang, 2012 Abstract The Gibbs sampler is one of the most popular algorithms for inference in statistical models. In this thesis, we introduce a herding variant of this algorithm that is entirely deterministic. We demonstrate, with simple examples, that herded Gibbs exhibits better convergence behavior for approximating the marginal distributions than Gibbs sampling. In particular, image denoising exemplifies the effectiveness of herded Gibbs as an inference technique for Markov Random Fields (MRFs). Also, we adopt herded Gibbs as the inference engine for Conditional Random Fields (CRFs) in Named Entity Recognition (NER) and show that it is competitive with the state of the art. The conclusion is that herded Gibbs, for graphical models with nodes of low degree, is very close to Gibbs sampling in terms of the complexity of the code and computation, but that it converges much faster. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 3 Herding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Herding and Frank-Wolfe Algorithms . . . . . . . . . . . . . . . 6 2.1.1 Frank-Wolfe Algorithms . . . . . . . . . . . . . . . . . . 6 2.1.2 Herding as a Frank-Wolfe Algorithm . . . . . . . . . . . 7 Herding and Maximum/Minimum Entropy . . . . . . . . . . . . . 7 3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Herded Gibbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.1 11 2 2.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.1 Simple Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.1.1 Two-Node Example . . . . . . . . . . . . . . . . . . . . 15 5.1.2 Four-Node Example . . . . . . . . . . . . . . . . . . . . 17 5.1.3 Three-Node Example . . . . . . . . . . . . . . . . . . . . 19 5.2 Markov Random Field (MRF) for Image Denoising . . . . . . . . 20 5.3 Conditional Random Field (CRF) for Named Entity Recognition . 26 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5 6 iv List of Tables Table 5.1 Joint distribution of the two-variable model. . . . . . . . . . . Table 5.2 Joint distribution of the three-variable model. The model is parametrized such that x1 ⊥ x2 |x3 and p(x1 |x3 ) = p(x2 |x3 ). . . Table 5.3 16 19 Errors of image denoising example after 30 iterations (all measurements have been scaled by ×10−3 ). We use an Ising prior with Wi j = 1 and Gaussian noise models with different σ ’s. For each σ , we generated 10 corrupted images by adding Gaussian noise. The final results shown here are averages and standard deviations (in the parentheses) across the 10 corrupted images. I: in-place updates; P: parallel updates; D: damping factor; R: random step size. . . . . . . . . . . . . . . . . . . . . . . . . v 25 Table 5.4 Comparisons of Gibbs, herded Gibbs, herded Gibbs with a randomized step size, and Viterbi for the NER task. We used the pre-trained 3-class CRF model in the Stanford NER package [7]. For the test set, we used the corpus for the NIST 1999 IE-ER Evaluation. Performance is measured in per-entity F1 precision·recall F1 = 2 · precision+recall . For all the methods except Viterbi, we show F1 scores after 100, 400 and 800 iterations. For Gibbs and herded Gibbs with a random step size, the results shown are the averages and standard deviations (in the parentheses) over 5 random runs. We used a linear annealing schedule for Gibbs and in-place updates for both versions of herded Gibbs. The average computational time each approach took to do inference for the entire test set is also listed (in the brackets). R: random step size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 27 List of Figures Figure 2.1 Two iterations corresponding to two versions of the conditional gradient algorithm. Left: ρ (t) = 1/(t + 1); right: line search. . 6 Figure 5.1 Two-variable model. . . . . . . . . . . . . . . . . . . . . . . 16 Figure 5.2 Approximate marginals obtained via Gibbs and herded Gibbs for an MRF of two variables, constructed so as to make the move from state (0, 0) to (1, 1) progressively more difficult as ε decreases. Table 5.1 provides the joint distribution for these variables. The error bars correspond to one standard deviation. The plots are best viewed in color. . . . . . . . . . . . . . . . 16 Figure 5.3 Four-variable model. . . . . . . . . . . . . . . . . . . . . . . 17 Figure 5.4 The error in approximating the marginals (top) and joint (bottom) for an MRF of four variables. When the ratios of conditional distributions are rational, the deterministic herded Gibbs fails to converge to the joint. This is however not a problem when the ratios are irrational. In both cases, herded Gibbs with a randomized step size converges well, while providing a very significant improvement over standard Gibbs sampling. . . . . 18 Figure 5.5 Three-variable model. . . . . . . . . . . . . . . . . . . . . . 19 Figure 5.6 The marginal of the first node of a three-node MRF and its estimates. Herded Gibbs with a random step size converges to the correct value at an impressive rate. The left plot shows 5,000 iterations while the right plots shows 50,000 iterations. . . . . vii 20 Figure 5.7 The MRF employed for the task of image denoising. The variables xi , x j are hidden variables representing the true pixel values, and the variables yi , y j are observed variables representing the noisy pixel values. . . . . . . . . . . . . . . . . . . . . . Figure 5.8 21 Image denoising results using different methods to perform approximate inference. We use an Ising prior with Wi j = 1 and a Gaussian noise model with σ = 4. We show the results after 1, 10 and 30 iterations across the image and also the posterior means computed by averaging over the last 10 iterations. We use in-place updates for each method, ε ∼ 2 U[0,1/2] for herded Gibbs with a randomized step size, and a damping factor of 1 for mean field inference. . . . . . . . . . . . . . . . . . . . . Figure 5.9 22 Reconstruction error for the image denoising task. The results are averaged across 10 corrupted images with Gaussian noise N (0, 16). The error bars correspond to one standard deviation. Here we introduce a shared version of herded Gibbs, which stores conditionals that have the same potential values together. As shown, the shared version of herded Gibbs outperforms all the other approaches. In general, the in-place update versions of each approach perform better than the corresponding parallel update versions. Herded Gibbs with a randomized step size has almost the same performance with standard Herded Gibbs under the same setting. For better display, we do not show them in the plots. . . . . . . . . . . . . . . . . . 24 Figure 5.10 A representative application of NER to a random Wikipedia sample [1]. Entities are identified as follows: Person, Location, Organization. . . . . . . . . . . . . . . . . . . . . . . . viii 26 Glossary MCMC Markov chain Monte Carlo MRF Markov Random Field CRF Conditional Random Field NER Named Entity Recognition ix Acknowledgments I would like to express my utmost gratitude to many people, without whom this thesis would not have been possible. First and foremost, my supervisor Nando de Freitas for his invaluable guidance and support throughout my graduate study. He devoted numerous hours to my work and gave me the freedom to explore my own ideas. Dr. Luke Bornn and Dr. David Poole for their collaboration and valuable feedback on this work. Mareija Eskelinen for being a wonderful collaborator and for the time and efforts she devoted to proofread my thesis. Misha Denil, Masrour Zoghi, Ziyu Wang, Matt Hoffman, David Buchman, Shakir Mohamed, Xing Chen, Shafiq Joty, Kevin Swersky and other fellow UBC CS students for discussing ideas, answering questions, and sharing code with me. I highly value their openness and willingness to help. Finally, I would like to thank my parents for their tireless love and support. It means very much to me that they are proud to have me as their daughter. x Chapter 1 Introduction Computing the posterior distribution is the core component of Bayesian analysis. This computation involves the integration of high-dimensional functions, which can be computationally very difficult. Several approaches that avoid direct integration have been proposed over the decades [6, 14]. Markov chain Monte Carlo (MCMC) methods are one of the most popular classes of such algorithms, which attempt to simulate direct draws from some complex distribution of interest [2, 15]. They generate samples based on constructing a Markov chain whose equilibrium distribution is exactly the target distribution. As the transition probabilities between sample values are only a function of the most recent sample values, it only needs the previous sample values in order to generate the next. The states of the chain after a large number of steps (the so-called burn-in period) can then be used as a sample of the target distribution. The quality of the sample improves as a function of the number of steps. One particular MCMC method, the Gibbs sampler [11], was realized to be widely applicable to a broad class of Bayesian problems in the early 1990s [10]. It was first introduced in the context of image processing as a special case of the Metropolis-Hastings algorithm, and was then extended to be a general framework for sampling from a large set of variables. Gibbs sampling is commonly used for statistical inference. It is applicable when the joint distribution is not known explicitly or is difficult to sample from directly, but the conditional distribution of each variable is known and is easy to sample from. 1 Over the decades, numerous variations and extensions of the basic Gibbs sampler have arisen. In this thesis, we introduce a novel variant of the Gibbs sampler that is entirely deterministic - herded Gibbs. It builds upon a recently published deterministic herding technique and is shown empirically to be an effective and valuable algorithm for statistical inference. 1.1 Thesis Contributions In this thesis, we present herded Gibbs, a deterministic variant of the well-known and widely-used Gibbs sampler. We provide an empirical study of the proposed algorithm and show that it is a valuable algorithm that performs better than the standard Gibbs sampler in several different tasks of great interest, including image denoising and Named Entity Recognition. Our algorithm is a novel and key step in the design of herding methods for sampling from large discrete distributions. By an image denoising example, we show that a shared version of herded Gibbs requires less storage while performing even better than comparable inference techniques. This to some extent resolves the storage issue of herded Gibbs and makes it a potentially more useful algorithm that is capable of being applied to more general models. We also show by experiments that, for herded Gibbs to converge to both the marginals and joint, we require the ratios of the conditional distributions to be irrational. If this is satisfied, herded Gibbs should converge to the correct marginals and joint. Otherwise, it may fail to converge. Furthermore, by simply randomizing the step size of herded Gibbs, we will be able to deliver the correct marginal and joint distributions, while retaining improved rates of convergence over the traditional Gibbs sampling. This thesis presents a very effective, useful and straightforward algorithm as well as important methodological and empirical contributions to research on herding. We believe this contribution will be of a wide interest to researchers in the machine learning community. 2 1.2 Thesis Organization This thesis is organized as follows: in Chapter 2 and 3, we give an overview of the herding algorithm and the Gibbs algorithm respectively, which are the bases of our algorithm. Then we introduce in Chapter 4 the proposed algorithm, herded Gibbs, which is a variant of the Gibbs sampler and is entirely deterministic. In Chapter 5, we use several experiments to show the effectiveness of herded Gibbs as an inference technique and compare it with several existing popular algorithms for inference. Finally we discuss the limitations and extensions of herded Gibbs and conclude in Chapter 6. 3 Chapter 2 Herding Herding [9, 16, 17] is a deterministic procedure for generating pseudo-samples x ∈ X ⊆ Rn , such that the empirical moments µ of the data are matched. A moment µi is the expectation of some feature function φi over data and can be used as a quantitative measure of the data set. Some examples include the mean and the variance. The herding procedure, at iteration t, is as follows: x(t) w (t) ∈ arg max w(t−1) , φ (x(t−1) ) x∈X (t−1) = w + µ − φ (x(t) ), (2.1) where φ : X → H is a feature map (statistics) from X to a Hilbert Space H with inner product ·, · , w ∈ H is the vector of parameters, and µ ∈ H is the moment vector that we want to match (E[φ (x)], the expected value of φ over data). The pseudo-samples, x(t) for t = 1, 2, . . . , T , are generated so as to minimize the objective function 1 T ET = µ − ∑ φ (x(t) ) T t=1 where · = 2 , (2.2) ·, · is the naturally defined norm based upon the inner product of the space H [3, 5]. As presented, herding is a moment matching algorithm. However, it can also be used to produce samples of normalized probability distributions. For example, let µ denote a discrete, normalized probability distribution 4 with µi = p(x = i), i = 1, . . . , n. That is, µi is the probability of the i-th bin. A natural feature in this case is the vector φ (x) that has all entries equal to zero, except for the entry at the position indicated by x. For instance, if x = 2, we have T φ (x) = (0, 1, 0, 0, . . . , 0)T . Hence, µ = 1/T ∑t=1 φ (x(t) ) is an empirical estimate of the distribution. The corresponding pseudocode is shown in Algorithm 2.1. Note that this is a simple instance of the algorithm of equation (2.1). Algorithm 2.1 Herding to sample from a discrete distribution Input: µ, T Initialize w deterministically, e.g. a vector of ones of size n. for t = 1, 2, . . . , T do Find the largest component of w: i = arg maxi∈{1,2,...,n} wi Set x(t) = i Set φ to be a vector of zeros of size n Set the i -entry of φ to one Update: w(t+1) = w(t) + (µ − φ ) end for Output: x(1) , . . . , x(T ) The benefit of using herding here is that its rate of convergence is O(1/T ), √ which is much faster than the standard Monte Carlo rate of O(1/ T ). Unfortunately, this only applies to normalized distributions or to problems where the objective is not to guarantee that the samples come from the right target, but simply to ensure that some moments are matched. An interpretation of herding in terms of quadrature has been put forward recently by Husz´ar and Duvenaud [12]. In this thesis, we will show that it is possible to use herding to generate samples from more complex unnormalized probability distributions. 5 2.1 2.1.1 Herding and Frank-Wolfe Algorithms Frank-Wolfe Algorithms The Frank-Wolfe algorithm, also known as the reduced gradient method and the convex combination algorithm, is a class of iterative optimization algorithms. It was first proposed by Marguerite Frank and Philip Wolfe in 1956 [8] as an algorithm for quadratic programming and was later proven to be an iterative method for nonlinear programming that need not be restricted to quadratic programming. Given a smooth (twice continuously differentiable) convex function J on a compact convex set M with gradient J , to solve the optimization problem ming∈M J(g), Frank-Wolfe algorithms only require (in addition to the computation of the gradient J ) to be able to optimize linear functions on M . The first class of such algorithms is often referred to as conditional gradient algorithms: given an initial guess g(1) ∈ M , at iteration t, we first find an optimum for minh∈M J (g(t) ), h , and then take the next iterate on the segment between g(t) and h, i.e., g(t+1) = (1 − ρ (t) )g(t) + ρ (t) h, where ρ (t) ∈ [0, 1]. For the choice of ρ (t) , we can (a) simply take ρ (t) = 1/(t + 1) or (b) find the point in the segment with optimal value through line search. The iterations corresponding to these two choices are illustrated in Figure 2.1. As t increases, g(t) approaches the optimal solution µ. g(2) g(2) g(3) g(3) µ µ g(1) g(1) Figure 2.1: Two iterations corresponding to two versions of the conditional gradient algorithm. Left: ρ (t) = 1/(t + 1); right: line search. 6 2.1.2 Herding as a Frank-Wolfe Algorithm As described by Bach et al. [3], herding can be reformulated as a conditional gradient algorithm. Considering the optimization problem: 1 min J(g) = ||g − µ||2 , 2 g∈M (2.3) the procedures for a conditional gradient algorithm to solve this problem is as follows: g¯ (t+1) ∈ arg min g(t) − µ, g g∈M g (t+1) = (1 − ρ (t) )g(t) + ρ (t) g¯ (t+1) , (2.4) With a step size of ρ (t) = 1/(t + 1) and a change of variable g(t) = µ − w(t) /t, these updates are equivalent to the herding steps as shown in Equation 2.1. With this setting, we have g¯ (t) = φ (x(t) ) and g(t) = 1t ∑tu=1 φ (x(u) ), which implies that each sample is uniformly weighted and the set of samples approximate µ. An alternative choice of step size is to use line search to find the point in the g(t) −µ,g(t) −¯g(t+1) ∈ ||g(t) −¯g(t+1) ||2 t t−1 (v−1) (u−1) (u) )ρ φ (x ), ∑u=1 ∏v=u (1 − ρ segment with optimal value. This leads to a setting of ρ (t) = [0, 1]. With this setting, we have g(t) = which means the weight for each sample is non-uniform and the weighted set of samples approximate µ. 2.2 Herding and Maximum/Minimum Entropy In the maximum entropy estimation setting, given a moment vector µ, the goal of herding is to produce a pseudo-sample drawn from a distribution whose moments match µ. For an introduction to maximum entropy estimation, see the book by Bernardo and Smith [4, Section 4.5.4]. Bach et al. [3] apply herding to estimate the maximum entropy distribution in cases where it can be easily computed, namely for X = {−1, 1}d (with d ≤ 10) and φ (x) = x ∈ [−1, 1]d . They conjecture that for almost surely all random vectors µ ∈ [−1, 1]d , regular herding converges to the maximum entropy distribution. Moreover, they conclude that: 7 • For a random mean vector µ (which would avoid rational ratios between the means), regular herding (with no line search) empirically converges to the maximum entropy distribution. • For irrational means with rational ratios between the means, herding does not converge to the maximum entropy distribution. • For rational means, herding does not converge either, but the behavior is more erratic. • Herding with line search never converges to the maximum entropy distribution. On the contrary, it happens to be close to a minimum entropy solution, where many states have probability zero. In addition, the work also mentions that the conclusions can not be further generalized to other cases. In general, the herding procedure, while leading to a consistent estimation of the mean vector µ, does not converge to the maximum entropy distribution. 8 Chapter 3 Gibbs Sampling Gibbs sampling (popularized in the context of image processing by Geman and Geman 1984 [11]) is a simple MCMC method used to perform approximate inference in factored probabilistic models. In many cases it is not possible to do direct simulation from the posterior distributions. In these cases, Gibbs sampling may prove useful. The idea behind Gibbs sampling is that it alternates (either systematically or randomly) the sampling of each variable while conditioning on the values of all the other variables in the distribution. That is, given a joint sample x(t−1) of all the variables, we generate a new sample x(t) by sampling each component in turn, based on the most recent values of the other variables. Typically, we do not sample those visible variables whose values are already known. Suppose we have an n-dimensional vector x and the expressions for the full conditionals p(x j |x1 , ..., x j−1 , x j+1 , ..., xn ). The pseudocode of Gibbs sampling is shown in Algorithm 3.1. In general, x j may only depend on some of the other variables. If we represent p(x) as a graphical model, we can infer the dependencies by looking at x j ’s Markov blanket, which are its neighbors in the graph. Thus to sample x j , we only need to know the values of x j ’s neighbors. In this sense, Gibbs sampling is a distributed algorithm. However, it is not a parallel algorithm, since the samples must be generated sequentially. It is common to discard some number of samples at the beginning until the Markov chain has burned in or reached its stationary distribution, which by con9 Algorithm 3.1 Systematic-scan Gibbs sampler 1: Input: p(x j |x− j ) for j = 1, 2, ..., n and T . 2: Initialize x(0) . 3: for t = 1, 2, . . . , T do 4: 5: 6: 7: 8: 9: 10: 11: (t) (t−1) Sample x1 ∼ p(x1 |x2 (t) x2 (t−1) , x3 (t−1) , ..., xn ) (t) (t−1) (t−1) p(x2 |x1 , x3 , ..., xn ) Sample ∼ .. . (t) (t) (t) (t−1) (t−1) Sample x j ∼ p(x j |x1 , ..., x j−1 , x j+1 ..., xn ) .. . (t) (t) (t) (t) Sample xn ∼ p(xn |x1 , x2 , ..., xn−1 ) end for Output: x(1) , . . . , x(T ) struction is the target joint distribution over the variables. We then consider only every mth sample when averaging values to compute an expectation, because successive samples are not independent of each other but from a Markov chain with some amount of correlation. In the early phase of the sampling process, the algorithm tends to move slowly around the sample space rather than moving around quickly as is desired. As a result, the samples generated in this stage tend to have a high amount of autocorrelation. Simulated annealing is often used to reduce the random walk behavior and to reduce autocorrelations between samples. Other techniques that may reduce autocorrelation include collapsed Gibbs sampling, blocked Gibbs sampling, and ordered overrelaxation. The sequence obtained by Gibbs sampling can be used to approximate the joint distribution, to approximate the marginal distribution of one of the variables or some subset of the variables, or to compute an integral. 10 Chapter 4 Herded Gibbs In this chapter, we will introduce the herded Gibbs sampler, a variant of the popular Gibbs sampling algorithm that is completely deterministic. While Gibbs relies on drawing samples from the full-conditionals in a random way, herded Gibbs generates samples by matching the full-conditionals using the herding procedure. Given that Gibbs sampling is one of the work-horses of statistical inference, it is remarkable that it is possible to construct an entirely deterministic version of this algorithm. 4.1 Example For ease of presentation, we introduce the herded Gibbs sampler with a simple, but very popular example: Image denoising with a Markov Random Field (MRF). Let us assume that we have a binary image corrupted by noise, and that we want to infer the original clean image. Let xi ∈ {−1, +1} denote the unknown true value of pixel i, and yi the observed, noise-corrupted value of this pixel. We take advantage of the fact that neighboring pixels are likely to have the same label by defining an MRF with an Ising prior. That is, we specify a rectangular 2D lattice with the following pair-wise clique potentials: ψi j (xi , x j ) = eJi j e−Ji j e−Ji j eJi j 11 (4.1) and joint distribution: 1 1 ψi j (xi , x j ) = exp ∏ Z(J) i∼ j Z(J) p(x|J) = 1 ∑ Ji j xi x j , 2 i∼ j (4.2) where i ∼ j is used to indicate that nodes i and j are connected. The known parameters Ji j establish the coupling strength between nodes i and j. Note that the matrix J is symmetric. If all the Ji j > 0, then neighboring pixels are likely to be in the same state. The MRF model combines the Ising prior with a likelihood model as follows: p(x, y) = p(x)p(y|x) = 1 ψi j (xi , x j ) . Z∏ i∼ j ∏ p(yi |xi ) (4.3) i The potentials ψi j encourage label smoothness. The likelihood terms p(yi |xi ) are conditionally independent (e.g. Gaussians with known variance σ 2 and mean µ centered at each value of xi , denoted µxi ). In more precise terms, p(x, y|J, µ, σ ) = 4.2 1 exp Z(J, µ, σ ) 1 1 Ji j xi x j − 2 ∑(yi − µxi )2 . ∑ 2 i∼ j 2σ i (4.4) Algorithm As discussed in Chapter 3, Gibbs sampling is a popular method for recovering the marginals p(xi |y). It alternates (either systematically or randomly) the sampling of each pixel xi while conditioning on the observations y and the neighbors of xi , denoted xN (i) . That is, it generates each sample from its full-conditional distribution. The pseudocode of Gibbs sampling for this example is shown in Algorithm 4.1. We (t−1) use the notation xN (2)−1 to indicate the neighbors of variable x2 at the previous iteration excluding the variable x1 . This is because we already have an updated value of x1 at time t, and x1 has been assumed to be a neighbor of x2 . The herded Gibbs sampler replaces the sampling from full-conditionals with herding at the level of the full-conditionals, as shown in Algorithm 4.2. That is, one alternates a process of matching the full-conditional distributions, p(xi |yi , xN (i) ). 12 Algorithm 4.1 Systematic-scan Gibbs sampler 1: Input: An image y with M pixels, and T . 2: for t = 1, 2, . . . , T do 3: 4: 5: 6: 7: 8: (t) (t−1) (t) (t) Sample x1 ∼ p(x1 |y1 , xN (1) ) (t−1) Sample x2 ∼ p(x2 |y2 , x1 , xN (2)−1 ) .. . (t) (t) Sample xM ∼ p(xM |yM , xN (M) ) end for Output: x(1) , . . . , x(T ) Algorithm 4.2 Systematic-scan herded Gibbs sampler 1: Input: An image y with M pixels, and T . 2: for t = 1, 2, . . . , T do 3: 4: 5: 6: 7: 8: (t) (t−1) Apply herding to generate x1 given y1 and xN (1) (see Algorithm 4.3) (t) (t) (t−1) Apply herding to generate x2 given y2 , x1 and xN (2)−1 .. . (t) (t) Apply herding to generate xM given yM and xN (M) end for Output: x(1) , . . . , x(T ) Note that we apply the herding algorithm to each instantiation of the fullconditional distribution for xi . For example, if we assume that the variable x1 has four binary neighbors xN (1) , we will maintain 24 = 16 weight vectors. Only the weight vector corresponding to the current instantiation of the neighbors is updated, as illustrated in Algorithm 4.3. Hence, herded Gibbs effectively trades off memory for increased statistical efficiency. The memory complexity of herded Gibbs is exponential in the maximum node degree. However, the cost for one iteration is not significantly more expensive than that of Gibbs and the convergence rate of herded Gibbs is faster than that of Gibbs, as shown by means of experiments in the following section. Moreover, by comparing various models for fitting the convergence curves, we conjecture that the rate of convergence of herded Gibbs is O(1/T ). Proving this fact mathematically is an open problem. 13 (t−1) Algorithm 4.3 Step 3 of Algorithm 4.2, with µ = p(x1 |y1 , xN (1) ) assuming x1 has 4 binary neighbors (t−1) 1: Input: µ and a set of stored weight vectors {w1 2: if (t−1) xN (1) (t−1) , . . . , w16 = (0, 0, 0, 0) then (t−1) : i = arg maxi∈{1,2,...,n} w1,i (t−1) : i = arg maxi∈{1,2,...,n} w2,i 3: Find the largest component of w1 4: chosen = 1 (t−1) 5: else if xN (1) = (0, 0, 0, 1) then 6: 7: (t−1) , w2 Find the largest component of w2 chosen = 2 .. . (t−1) (t−1) 8: (t−1) 9: else if xN (1) = (1, 1, 1, 0) then (t−1) (t−1) Find the largest component of w15 : i = arg maxi∈{1,2,...,n} w15,i chosen = 15 else (t−1) (t−1) Find the largest component of w16 : i = arg maxi∈{1,2,...,n} w16,i chosen = 16 end if (t) Set x1 = i Set φ to be a vector of zeros of size n = 2 (if x1 is binary) Set the i -entry of φ to one (t−1) (t) for all j = chosen 19: Set w j = w j 10: 11: 12: 13: 14: 15: 16: 17: 18: (t) (t−1) 20: Update: wchosen = wchosen + (µ − φ ) (t) (t) (t) (t) 21: Output: {w1 , w2 , . . . , w16 } and x1 14 } for x1 . Chapter 5 Experiments 5.1 Simple Examples As shown by Ip and Wang [13], a given set of full-conditional distributions specifies a unique joint distribution if and only if the set is complete and compatible. Furthermore, if the conditions are met, then we can recover the joint distribution from the conditional distributions which allows us to recover the marginal distributions (though this last step is computationally expensive). In the following examples, we assess whether herded Gibbs is capable of estimating the correct conditionals and marginals. 5.1.1 Two-Node Example First, we consider a model of two variables, x1 and x2 , as shown in Figure 5.1; the joint distribution of these variables is shown in Table 5.1. Figure 5.2 shows the marginal distribution P(X1 = 1) approximated by both Gibbs and herded Gibbs for different ε. As ε decreases, both approaches require more iterations to converge. Alternately, as ε approaches zero, Gibbs struggles to converge, whereas, herded Gibbs continues to provide good deterministic convergence behavior. In this and subsequent experiments, we set the initial parameters w to a constant vector. We observed that the performance of herded Gibbs does not change when we initialize w at random. 15 x1 x2 Figure 5.1: Two-variable model. X2 = 0 X2 = 1 P(X1 ) X1 = 0 1/4 − ε ε 1/4 X1 = 1 ε 3/4 − ε 3/4 P(X2 ) 1/4 3/4 1 Table 5.1: Joint distribution of the two-variable model. ε = 1e−01 0.756 p(x1=1) ε = 1e−02 Gibbs Herded Gibbs 0.754 0.76 0.75 0.75 0.748 0.74 0.746 0.744 0 Gibbs Herded Gibbs 0.77 0.752 p(x1=1) 0.73 2 4 6 8 Iterations ε = 1e−03 10 4 x 10 0 4 6 8 Iterations p(x1=1) ε = 1e−04 1 Gibbs Herded Gibbs 0.8 2 10 4 x 10 p(x1=1) Gibbs Herded Gibbs 0.9 0.78 0.76 0.8 0.74 0.7 0.72 0.7 0 2 4 6 Iterations 8 0.6 0 10 4 x 10 2 4 6 Iterations 8 10 4 x 10 Figure 5.2: Approximate marginals obtained via Gibbs and herded Gibbs for an MRF of two variables, constructed so as to make the move from state (0, 0) to (1, 1) progressively more difficult as ε decreases. Table 5.1 provides the joint distribution for these variables. The error bars correspond to one standard deviation. The plots are best viewed in color. 16 5.1.2 Four-Node Example Next, we study a model of four variables with a joint distribution as shown in Equation 4.2 and illustrated in Figure 5.3. The left column of Figure 5.4 shows the 1-norm error of approximating the marginals and the joint distribution when the coupling parameters Ji j are set to 1. (For this choice of coupling parameters, the ratios of the conditional distributions are rational numbers.) While herded Gibbs approximates the marginal distributions well, it fails to approximate the joint distribution. x1 x2 x3 x4 Figure 5.3: Four-variable model. This finding is not new and has in fact been reported in a simple maximumentropy setting [3]. There, the authors conjecture that almost sure convergence of the herding algorithm to the maximum entropy distribution can be established when µ is a random vector. In our setting, by drawing the full conditionals uniformly at random in the unit interval, we are creating irrational ratios among these full conditionals. This follows from the fact that the rationals have measure zero in the unit interval. When the ratios of full conditionals are irrational, deterministic cycles in the state (sample) sequence are broken (or be made very large in digital computers) and, hence, herded Gibbs converges to the right marginals and joint distribution. The right hand side of Figure 5.4 illustrates this. Herded Gibbs converges to the correct joint distribution when the coupling parameters are set at random: Ji j = 1 + N (0, 1/4). These results add further evidence to the conjecture made by Bach et al. [3]. The choice of irrational µ’s has also appeared in the work by Welling and Chen [18]. Clearly, however, we should avoid irrational µ’s whose ratios are rational. This discussion provides us with the necessary intuition for the design of an 17 Irrational ratios Rational ratios L1 error for all marginals 0.4 0.3 0.4 Gibbs Herded Gibbs Herded Gibbs − random step size 0.2 0.1 0.1 0.5 1 1.5 2 Iterations 0.3 0 0 1 1.5 2 Iterations x 10 4 x 10 0.4 Gibbs Herded Gibbs Herded Gibbs − random step size 0.2 0.1 0.1 0.5 1 1.5 2 Iterations Gibbs Herded Gibbs Herded Gibbs − random step size 0.3 0.2 0 0 0.5 4 0.4 L1 error for the joint 0.3 0.2 0 0 Gibbs Herded Gibbs Herded Gibbs − random step size 0 0 0.5 4 x 10 1 1.5 2 Iterations 4 x 10 Figure 5.4: The error in approximating the marginals (top) and joint (bottom) for an MRF of four variables. When the ratios of conditional distributions are rational, the deterministic herded Gibbs fails to converge to the joint. This is however not a problem when the ratios are irrational. In both cases, herded Gibbs with a randomized step size converges well, while providing a very significant improvement over standard Gibbs sampling. alternative random step size herded Gibbs algorithm: x(t) ∈ arg max w(t−1) , φ (x(t−1) ) x∈X (t) w = w(t−1) + (1 + ε) µ − φ (x(t) ) , (5.1) where ε ∼ 2 U[0,1/2] is a small uniform random number (the method is fairly robust with respect to this choice). The randomization of the step size effectively breaks cycles and results in ergodicity of the marginals and joint as illustrated in 18 Figure 5.4. This four-node experiment has shown that herded Gibbs can yield fast and accurate estimates of the marginals and joint distribution when its step size is randomized. The experiment also seems to indicate that the deterministic herded Gibbs can result in good estimates of the marginal distributions. The following three-node experiment shows that this is not always the case. 5.1.3 Three-Node Example Consider a three-node model as illustrated in Figure 5.5, whose joint distribution is shown in Table 5.2. If the model is parametrized such that p(x1 |x3 ) = p(x2 |x3 ) and we apply herded Gibbs, with identical initial parameters w, to generate samples in the order of x1 , x2 , x3 , then x1 and x2 will always have the same value. In other words, we will never reach the states (x1 = 0, x2 = 1) or (x1 = 1, x2 = 0). In this case, the deterministic herded Gibbs fails to converge to the correct marginals. This is depicted in Figure 5.6. x1 x3 x2 Figure 5.5: Three-variable model. Table 5.2: Joint distribution of the three-variable model. The model is parametrized such that x1 ⊥ x2 |x3 and p(x1 |x3 ) = p(x2 |x3 ). x1 x2 x3 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 P(x1 , x2 , x3 ) 3 91 36 91 6 91 12 91 6 91 12 91 12 91 4 91 19 0.67 0.65 p(x1=1) 0.66 Gibbs Herded Gibbs Herded Gibbs − random step size 0.65 p(x1=1) 0.645 Gibbs Herded Gibbs Herded Gibbs − random step size 0.64 0.64 0.635 0.63 0.63 0.625 0.62 0.62 0.61 0.615 0.6 0 1000 2000 3000 4000 5000 0.61 0 1 Iterations 2 3 4 Iterations 5 Figure 5.6: The marginal of the first node of a three-node MRF and its estimates. Herded Gibbs with a random step size converges to the correct value at an impressive rate. The left plot shows 5,000 iterations while the right plots shows 50,000 iterations. Even if all states can be reached, the deterministic algorithm can still fail. In our three-node example, all states can be reached if the herding parameters for x1 and x2 are initialized to different values. However, this does not break state cycles and the algorithm still fails to converge to the right marginals. Once again, to obtain fast convergence, we simply have to randomize the step size. 5.2 Markov Random Field (MRF) for Image Denoising As discussed in Chapter 4, herded Gibbs performs deterministic sampling which in itself can provide computational benefits over comparable stochastic techniques. Furthermore, this derandomized sampling algorithm is further shown to provide performance gains beyond that of similar widely-adapted stochastic techniques. Here, herded Gibbs is compared to both Gibbs sampling and mean field inference. In particular, the herded Gibbs algorithm’s performance is exemplified for image denoising of the “NIPS” image (as seen in Figure 5.8). Various noisy “NIPS” images were created through the addition of Gaussian noise, with varying σ , to the original “NIPS” image. For each noisy image a set of observed variables is 20 4 x 10 xj xi yj yi Figure 5.7: The MRF employed for the task of image denoising. The variables xi , x j are hidden variables representing the true pixel values, and the variables yi , y j are observed variables representing the noisy pixel values. implicitly defined, one for each pixel; in Figure 5.7 the observed variables are depicted as yi and y j . Image denoising requires the inference of a set of hidden variables, xi and x j as seen in Figure 5.7, that describe the true binary values of each pixel. Accurately reproducing the hidden variables will result in the construction of an image much like the original “NIPS” image of Figure 5.8a. Gibbs sampling proceeds by sequentially resampling each hidden variable given the values of its neighbors (in the MRF) at the previous iteration. Though herded Gibbs performs inference in much the same way, the samples drawn via herded Gibbs are computed deterministically. Such a similarity establishes Gibbs sampling as a natural choice for evaluating the performance of herded Gibbs. Analogously, mean field inference is well-suited for evaluating the performance of herded Gibbs in that it performs inference by employing the mean values of neighboring nodes in the MRF. Furthermore, both mean field inference and Gibbs sampling are widely used and natural choices for the task of image denoising with MRFs. Recall that herded Gibbs employs coupling parameters in the joint distribution. If the coupling parameters Ji j are all the same, say Ji j = J, we will have ∑i j Ji j f (xi , x j ) = J ∑i j f (xi , x j ). For the Ising model, ∑i j f (xi , x j ) just equals the sum of pixel values of a node’s neighbors. Thus two different configurations of 21 (a) Original image Iteration 1 (b) Input/Noisy image Iteration 10 Iteration 30 Mean of last 10 iterations Herded Gibbs Herded Gibbs − shared Herded Gibbs − rand step size Herded Gibbs − shared & rand step size Gibbs Mean field (c) Denoising results Figure 5.8: Image denoising results using different methods to perform approximate inference. We use an Ising prior with Wi j = 1 and a Gaussian noise model with σ = 4. We show the results after 1, 10 and 30 iterations across the image and also the posterior means computed by averaging over the last 10 iterations. We use in-place updates for each method, ε ∼ 2 U[0,1/2] for herded Gibbs with a randomized step size, and a damping factor of 1 for mean field inference. 22 a node’s neighbors can give the same value of J ∑i j f (xi , x j ) because of the symmetrical characteristic of neighbors. If we store the conditionals for configurations with the same sum together, we only need to store as many conditionals as different possible values that the sum on the right hand side could take. This gives us inspiration to develop a shared version of herded Gibbs. The shared version of herded Gibbs presents additional opportunities for optimization. In particular, the difference in performance between in-place updates and parallel updates was evaluated. For in-place updates (Gauss-Seidel updates), we compute the value of each node on the go. For parallel updates (Jacobi updates), we compute all the nodes synchronously at the end of each iteration. In general, the in-place update versions of each approach perform better than the corresponding parallel update versions, but this conclusion might change if one has access to a large cluster or processors. Figure 5.8c provides one particular example of image denoising where Gaussian noise was constructed with σ = 4. The performance of inference techniques in this case is indicative of their average performance and demonstrates the impressive convergence of herded Gibbs. After ten iterations of the algorithms, the herded Gibbs techniques already demonstrate improvements over comparable techniques. After thirty iterations, the herded Gibbs techniques have effectively converged and are significantly beyond the comparable techniques. A long-term evaluation of the inference techniques for image denoising of “NIPS” is shown in Figure 5.9. The shared version of herded Gibbs converges faster than all other approaches with in-place herded Gibbs following as a close second. Herded Gibbs’ superior performance to comparable stochastic techniques makes the algorithm a promising new inference technique. 23 4 6 x 10 Gibbs Mean field (in−place updates, damping factor=0.5) Mean field (in−place updates, damping factor=1) Mean field (parallel updates, damping factor=0.5) Mean field (parallel updates, damping factor=1) Herded Gibbs (in−place updates) Herded Gibbs (parallel updates) Herded Gibbs − shared (in−place updates) Herded Gibbs − shared (parallel updates) Reconstruction error 5 4 3 2 1 0 0 5 10 15 20 25 30 35 40 45 Iterations Figure 5.9: Reconstruction error for the image denoising task. The results are averaged across 10 corrupted images with Gaussian noise N (0, 16). The error bars correspond to one standard deviation. Here we introduce a shared version of herded Gibbs, which stores conditionals that have the same potential values together. As shown, the shared version of herded Gibbs outperforms all the other approaches. In general, the in-place update versions of each approach perform better than the corresponding parallel update versions. Herded Gibbs with a randomized step size has almost the same performance with standard Herded Gibbs under the same setting. For better display, we do not show them in the plots. 24 50 A comparison of the image denoising results over different σ ’s is shown in Table 5.3. Overall, herded Gibbs is demonstrated to provide consistently superior performance in that it converges much faster than comparable techniques and does so consistently over a range of σ ’s. As such, herded Gibbs provides a potential platform for improving upon applications employing Gibbs sampling or other similar sampling techniques. Table 5.3: Errors of image denoising example after 30 iterations (all measurements have been scaled by ×10−3 ). We use an Ising prior with Wi j = 1 and Gaussian noise models with different σ ’s. For each σ , we generated 10 corrupted images by adding Gaussian noise. The final results shown here are averages and standard deviations (in the parentheses) across the 10 corrupted images. I: in-place updates; P: parallel updates; D: damping factor; R: random step size. PP PP σ PP PP Method P Gibbs Mean field (I, D=0.5) Mean field (I, D=1) Mean field (P, D=0.5) Mean field (P, D=1) Herded Gibbs (I) Herded Gibbs (I, R) Herded Gibbs (P) Herded Gibbs (P, R) Herded Gibbs - shared (I) Herded Gibbs - shared (I, R) Herded Gibbs - shared (P) Herded Gibbs - shared (P, R) 2 4 6 8 2.23(0.19) 2.04(0.12) 1.95(0.15) 2.15(0.13) 2.30(0.14) 1.98(0.15) 2.01(0.12) 2.18(0.13) 2.23(0.10) 1.94(0.08) 2.10(0.12) 2.18(0.10) 2.29(0.11) 8.35(0.58) 10.4(0.83) 7.09(0.47) 12.3(0.83) 13.6(0.64) 6.51(0.57) 6.57(0.60) 10.6(0.66) 10.6(0.63) 6.41(0.66) 6.41(0.60) 10.8(0.62) 10.9(0.72) 18.4(1.76) 21.6(1.20) 14.6(1.35) 24.4(1.02) 26.9(0.80) 13.8(1.48) 13.8(1.49) 22.4(1.21) 22.4(1.26) 12.9(1.30) 12.8(1.37) 23.0(1.10) 23.0(1.15) 27.9(2.14) 30.8(1.18) 22.3(1.86) 33.5(1.00) 36.2(0.77) 21.1(2.32) 20.9(2.23) 31.2(1.35) 31.1(1.37) 19.0(1.40) 19.2(1.53) 31.9(1.05) 32.1(1.24) 25 5.3 Conditional Random Field (CRF) for Named Entity Recognition Named Entity Recognition (NER) involves the identification of entities, such as people and locations, within a text sample. Information extraction, such as automatically identifying and extracting calendar dates or phone numbers from email messages, employs NER as a vital component of its computation. Historically, NER was performed by techniques like Viterbi that, though accurate and efficient under the Markov assumption, become intractable if the model attempts to account for long-term dependencies such as non-local information. As shown by Finkel et al. [7], Gibbs sampling and simulated annealing can incorporate long distance structure while preserving tractable inference in factored probabilistic models. Like Gibbs, herded Gibbs is capable of accommodating long term dependencies and the performance improvements it can obtain over Gibbs (as demonstrated in Section 5.2) make it an interesting candidate for NER. We show here that herded Gibbs can be easily adapted to NER and requires fewer iterations to find the maximum. Figure 5.10 provides a representative NER example of the performance of Gibbs, herded Gibbs, herded Gibbs with a randomized step size, and Viterbi (all methods produced the same annotation for this short example). Though, the accuracy of these algorithms is fairly similar, herded Gibbs (with or without a randomized step size) again converges faster than its annealed Gibbs counterpart and can account ”Pumpkin” (Tim Roth) and ”Honey Bunny” (Amanda Plummer) are having breakfast in a diner. They decide to rob it after realizing they could make money off the customers as well as the business, as they did during their previous heist. Moments after they initiate the hold-up, the scene breaks off and the title credits roll. As Jules Winnfield (Samuel L. Jackson) drives, Vincent Vega (John Travolta) talks about his experiences in Europe, from where he has just returned: the hash bars in Amsterdam, the French McDonald’s and its ”Royale with Cheese”. Figure 5.10: A representative application of NER to a random Wikipedia sample [1]. Entities are identified as follows: Person, Location, Organization. 26 for long-term dependencies in the text. The results of Table 5.4 indicate the performance gain associated with herded Gibbs; herded Gibbs achieves accuracy equal to that of Viterbi and a faster convergence rate than annealed Gibbs. Table 5.4: Comparisons of Gibbs, herded Gibbs, herded Gibbs with a randomized step size, and Viterbi for the NER task. We used the pre-trained 3-class CRF model in the Stanford NER package [7]. For the test set, we used the corpus for the NIST 1999 IE-ER Evaluation. Performance is precision·recall measured in per-entity F1 F1 = 2 · precision+recall . For all the methods except Viterbi, we show F1 scores after 100, 400 and 800 iterations. For Gibbs and herded Gibbs with a random step size, the results shown are the averages and standard deviations (in the parentheses) over 5 random runs. We used a linear annealing schedule for Gibbs and in-place updates for both versions of herded Gibbs. The average computational time each approach took to do inference for the entire test set is also listed (in the brackets). R: random step size. ❳❳ ❳❳❳ Iterations ❳❳❳ 100 ❳❳ Method ❳❳ Annealed Gibbs Herded Gibbs Herded Gibbs (R) Viterbi 84.36(0.16) [55.73s] 84.70 [59.08s] 84.73(0.03) [61.13s] 27 400 800 84.51(0.10) [83.49s] 84.75 [90.48s] 84.75(0.00) [93.21s] 84.61(0.05) [115.92s] 84.81 [132.00s] 84.81(0.00) [137.39s] 84.81[46.74s] Chapter 6 Conclusion In this thesis, we introduce herded Gibbs, a deterministic variant of the popular Gibbs sampling algorithm. While Gibbs relies on drawing samples from the fullconditionals in a random way, herded Gibbs generates the samples by matching the full-conditionals. Several experiments are employed to indicate the effectiveness of herded Gibbs over multiple domains and in tricky situations that can inhibit the performance of Gibbs sampling. The experiments also showed that the deterministic algorithm can fail to converge to the joint distribution when the ratios of the conditionals are rational, but that it converges when the same ratios are irrational. This motivated the introduction of a herded Gibbs algorithm with random step size, which retained improved rates of convergence over Gibbs, while delivering the correct marginal and joint distributions. Proving consistency and rates of convergence of the deterministic herded Gibbs method for irrational ratios and the random step size herded Gibbs is an open problem. As demonstrated Chapter 4, herded Gibbs trades storage cost for faster convergence by storing a parameter for each conditional. Without resolving the storage issue, it will be hard for herded Gibbs to be applicable to more general models, such as densely connected models. However, by the image denoising experiment we show that a shared version of herded Gibbs requires less storage while performing even better than comparable inference techniques. Further exploration of the shared version of herded Gibbs and its applications to other models is a promising avenue for further research. 28 Bibliography [1] Pulp fiction - wikipedia, the free encyclopedia @ONLINE, June 2012. URL http://en.wikipedia.org/wiki/Pulp Fiction. → pages viii, 26 [2] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine Learning, 50:5–43, 2003. ISSN 0885-6125. URL http://dx.doi.org/10.1023/A:1020281327116. 10.1023/A:1020281327116. → pages 1 [3] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and conditional gradient algorithms. In International Conference on Machine Learning, 2012. → pages 4, 7, 17 [4] J. M. Bernardo and A. Smith. Bayesian Theory, pages 207–209. Wiley Series in Probability and Statistics. John Wiley & Sons Ltd., Chichester, 1994. → pages 7 [5] Y. Chen, M. Welling, and A. Smola. Supersamples from kernel-herding. In Uncertainty in Artificial Intelligence, pages 109–116, 2010. → pages 4 [6] M. Evans and T. Swartz. Methods for approximating integrals in statistics with special emphasis on bayesian integration problems. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.5152. → pages 1 [7] J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 363–370, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. doi:10.3115/1219840.1219885. URL http://dx.doi.org/10.3115/1219840.1219885. → pages vi, 26, 27 [8] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95–110, 1956. ISSN 1931-9193. 29 doi:10.1002/nav.3800030109. URL http://dx.doi.org/10.1002/nav.3800030109. → pages 6 [9] A. Gelfand, Y. Chen, L. van der Maaten, and M. Welling. On herding and the perceptron cycling theorem. In Advances in Neural Information Processing Systems, pages 694–702, 2010. → pages 4 [10] A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410): 398–409, 1990. doi:10.1080/01621459.1990.10476213. URL http://www.tandfonline.com/doi/abs/10.1080/01621459.1990.10476213. → pages 1 [11] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-6(6):721 –741, nov. 1984. ISSN 0162-8828. doi:10.1109/TPAMI.1984.4767596. → pages 1, 9 [12] F. Husz´ar and D. Duvenaud. Optimally-weighted herding is Bayesian quadrature. Arxiv preprint arXiv:1204.1664, 2012. → pages 5 [13] E. H. Ip and Y. J. Wang. Canonical representation of conditionally specified multivariate discrete distributions. Journal of Multivariate Analysis, 100(6): 1282–1290, 2009. → pages 15 [14] A. F. M. Smith. Bayesian computational methods. Philosophical Transactions of the Royal Society of London. Series A: Physical and Engineering Sciences, 337(1647):369–386, 1991. doi:10.1098/rsta.1991.0130. URL http://rsta.royalsocietypublishing.org/content/337/1647/369.abstract. → pages 1 [15] B. Walsh. Markov chain monte carlo and gibbs sampling. 2004. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.131.4064. → pages 1 [16] M. Welling. Herding dynamical weights to learn. In International Conference on Machine Learning, pages 1121–1128, 2009. → pages 4 [17] M. Welling. Herding dynamic weights for partially observed random field models. In Uncertainty in Artificial Intelligence, pages 599–606, 2009. → pages 4 30 [18] M. Welling and Y. Chen. Statistical inference using weak chaos and infinite memory. Journal of Physics: Conference Series, 233(1), 2010. → pages 17 31
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Herded Gibbs sampling
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Herded Gibbs sampling Fang, Jing 2012
pdf
Page Metadata
Item Metadata
Title | Herded Gibbs sampling |
Creator |
Fang, Jing |
Publisher | University of British Columbia |
Date Issued | 2012 |
Description | The Gibbs sampler is one of the most popular algorithms for inference in statistical models. In this thesis, we introduce a herding variant of this algorithm that is entirely deterministic. We demonstrate, with simple examples, that herded Gibbs exhibits better convergence behavior for approximating the marginal distributions than Gibbs sampling. In particular, image denoising exemplifies the effectiveness of herded Gibbs as an inference technique for Markov Random Fields (MRFs). Also, we adopt herded Gibbs as the inference engine for Conditional Random Fields (CRFs) in Named Entity Recognition (NER) and show that it is competitive with the state of the art. The conclusion is that herded Gibbs, for graphical models with nodes of low degree, is very close to Gibbs sampling in terms of the complexity of the code and computation, but that it converges much faster. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2012-09-06 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 3.0 Unported |
IsShownAt | 10.14288/1.0052124 |
URI | http://hdl.handle.net/2429/43188 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2012-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/3.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2012_fall_fang_jing.pdf [ 1.21MB ]
- Metadata
- JSON: 24-1.0052124.json
- JSON-LD: 24-1.0052124-ld.json
- RDF/XML (Pretty): 24-1.0052124-rdf.xml
- RDF/JSON: 24-1.0052124-rdf.json
- Turtle: 24-1.0052124-turtle.txt
- N-Triples: 24-1.0052124-rdf-ntriples.txt
- Original Record: 24-1.0052124-source.json
- Full Text
- 24-1.0052124-fulltext.txt
- Citation
- 24-1.0052124.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052124/manifest