Monte Carlo Integration in Discrete Undirected Probabilistic Models by Firas Hamze M.Sc., Stanford University, 2001 B.Eng.(Hon.), McGill University, 1999 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia (Vancouver) April 18, 2008 c Firas Hamze 2008 ii Abstract This thesis contains the author’s work in and contributions to the field of Monte Carlo sampling for undirected graphical models, a class of statistical model commonly used in machine learning, computer vision, and spatial statistics; the aim is to be able to use the methodology and resultant samples to estimate integrals of functions of the variables in the model. Over the course of the study, three different but related methods were proposed and have appeared as research papers. The thesis consists of an introductory chapter discussing the models considered, the problems involved, and a general outline of Monte Carlo methods. The three subsequent chapters contain versions of the published work. The second chapter, which has appeared in (Hamze and de Freitas 2004), is a presentation of new MCMC algorithms for computing the posterior distributions and expectations of the unknown variables in undirected graphical models with regular structure. For demonstration purposes, we focus on Markov Random Fields (MRFs). By partitioning the MRFs into non-overlapping trees, it is possible to compute the posterior distribution of a particular tree exactly by conditioning on the remaining tree. These exact solutions allow us to construct efficient blocked and Rao-Blackwellised MCMC algorithms. We show empirically that tree sampling is considerably more efficient than other partitioned sampling schemes and the naive Gibbs sampler, even in cases where loopy belief propagation fails to converge. We prove that tree sampling exhibits lower variance than the naive Gibbs sampler and other naive partitioning schemes using the theoretical measure of maximal correlation. We also construct new information theory tools for comparing different MCMC schemes and show that, under these, tree sampling is more efficient. Although the work discussed in Chapter 2 exhibited promise on the class of graphs to which it was suited, there are many cases where limiting the topology is quite a handicap. The work in Chapter 3 was an exploration in an alternative methodology for approximating functions of variables representable as undirected graphical models of arbitrary connectivity with pairwise potentials, as well as for estimating the notoriously difficult partition function of the graph. The algorithm, published in (Hamze and de Freitas 2005), fits into the framework of sequential Monte Carlo methods rather than the more widely used MCMC, and relies on constructing a sequence of intermediate distributions which get closer to the desired one. While the idea of using “tempered” proposals is known, we construct a novel sequence of target distributions where, rather than dropping a global temperature parameter, we sequentially couple individual pairs of variables that are, initially, sampled exactly from a spanning tree Abstract iii of the variables. We present experimental results on inference and estimation of the partition function for sparse and densely-connected graphs. The final contribution of this thesis, presented in Chapter 4 and also in (Hamze and de Freitas 2007), emerged from some empirical observations that were made while trying to optimize the sequence of edges to add to a graph so as to guide the population of samples to the high-probability regions of the model. Most important among these observations was that while several heuristic approaches, discussed in Chapter 1, certainly yielded improvements over edge sequences consisting of random choices, strategies based on forcing the particles to take large, biased random walks in the state-space resulted in a more efficient exploration, particularly at low temperatures. This motivated a new Monte Carlo approach to treating complex discrete distributions. The algorithm is motivated by the N-Fold Way, which is an ingenious event-driven MCMC sampler that avoids rejection moves at any specific state. The N-Fold Way can however get “trapped” in cycles. We surmount this problem by modifying the sampling process to result in biased state-space paths of randomly chosen length. This alteration does introduce bias, but the bias is subsequently corrected with a carefully engineered importance sampler. iv Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Undirected Graphical Models . . . . . . . . . . . . . . . . 1.2 Inference and Normalization . . . . . . . . . . . . . . . . . 1.2.1 Brief Summary of Exact Inference . . . . . . . . . 1.2.2 Complexity Properties of Some Potts/Ising models 1.2.3 Variational Methods . . . . . . . . . . . . . . . . . 1.2.4 Monte Carlo Methods . . . . . . . . . . . . . . . . 1.3 Contributions of This Thesis . . . . . . . . . . . . . . . . 1.3.1 Tree Sampling . . . . . . . . . . . . . . . . . . . . 1.3.2 Hot Coupling . . . . . . . . . . . . . . . . . . . . . 1.3.3 Large-Flip Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 5 5 6 7 7 13 14 15 17 2 Tree Sampling . . . . . . . . . . . . 2.1 Chapter Overview . . . . . . . . 2.2 Tree Sampling for MRFS . . . . 2.2.1 Model Specification . . . 2.2.2 Tree Sampling Algorithm 2.3 Theoretical Justification . . . . . 2.4 Numerical Results . . . . . . . . 2.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 18 18 19 20 27 29 3 Hot Coupling . . . . . . . . . . . . . . . 3.1 Chapter Overview . . . . . . . . . . 3.2 Sequential Monte Carlo . . . . . . . 3.2.1 The general SMC framework 3.3 The Algorithm . . . . . . . . . . . . 3.4 Experimental Results . . . . . . . . . 3.5 Chapter Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 34 36 39 40 42 46 v Contents 4 Large-Flip Importance Sampling . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Preliminaries and Notation . . . . . . . . . . . . . . . . . 4.3 Event-Driven Monte Carlo . . . . . . . . . . . . . . . . . . 4.4 The Large-Flip Quasi-Gibbs Sampler . . . . . . . . . . . . 4.5 Sample Selection and Correcting the Bias with Importance pling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sam. . . . . . . . 47 47 49 49 54 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5 Conclusion 57 62 vi Acknowledgements I would like to thank my supervisor Nando de Freitas for all his help, time, and advice over the course of this degree. Not enough can be said of the patience and understanding of my fiancee (wife at the time of these revisions) Joanna Wood; a strong case can be made that tolerating a graduate student is at least as hard as being one! I am also grateful to my thesis committee, Professors Kevin Murphy, Paul Gustafson, Joel Friedman, and Rabab Ward, and to my external examiner Professor Dale Schuurmans of the University of Alberta, for offering constructive comments and making the defense ritual a memorable and (dare I say) enjoyable experience. My fellow students, especially my “supervisor siblings,” at the UBC Laboratory for Computational Intelligence shared many of the vicissitudes of grad student life: failed and successful experiments, sprinting for the last bus at 2:30am, certain electronic “vices,” faulty plumbing in Edinburgh art galleries, etc. Your companionship is certainly part of this story; I will mention no names purely out of fear of leaving anyone out. Finally, I would like to thank my mother, father, and sister for their support over the years. 1 Chapter 1 Introduction 1.1 Undirected Graphical Models An undirected graphical model (Lauritzen 1996) is a multivariate statistical model whose joint distribution is specifiable as a product of functions called potentials defined on subsets of the variables. A special case of the above definition occurs when the subsets are of size at most two; the resulting function defines a pairwise model. The definition of the functions suggests a natural graphical interpretation of the model; the variables can be visualized as forming the vertices of a graph, with two vertices being linked by an edge if and only if they appear together in at least one of the potential functions. A set of variables whose graphical representation is such that an edge appears between every member of the set are said to form a clique in the representation. Mathematically, the distribution of an undirected graphical model is given by: 1 ΨC (xC ) (1.1) π(x) = Z C∈C . The set C is a set of cliques of the graph, and the functions ΨC (xC ) are the potentials defined on the cliques C. The potentials are arbitrary as long as they are positive. It is possible that the potential on a given clique is merely the product of potentials on a set of smaller cliques. One example is the pairwise case mentioned above; if all pairs of variables appear in a potential function, then the model is said to be fully-connected. In such a case, the variables all form one clique; nonetheless the “fundamental” clique size is two. This is detailed much more rigorously in (Lauritzen 1996). The quantity Z in (1.1) is a normalization factor ensuring that π(x) is a correct probability; it is given by: ΨC (x′C ) Z= x′ (1.2) C∈C Known also as the partition function, computing it is a daunting problem since the summation runs over all possible states the model can be in; unfortunately, it invariably appears in most parameter estimation and learning equations in undirected models, considerably complicating those tasks. In contrast, learning in directed graphical models, whose joint density is a product of conditional probabilities representable as directed acyclic graphs, does not encounter this Chapter 1. Introduction 2 obstacle. Undirected models, however, are more naturally suited to many problems where the dependence among the variables is not immediately specifiable in terms of conditional probabilities. In this thesis, we primarily consider methods applicable to pairwise models, except in the final chapter where even this requirement is relaxed. We also restricted our attention to models with discrete-valued domains; continuousvalued models require the consideration of an additional set of issues. An alternate way of specifying the general distribution (1.1) of an undirected model is the fundamental Gibbs-Boltzmann distribution of statistical physics: π(x) = 1 −βE(x) e Z(β) (1.3) The function E(x) returns an energy for each configuration of the system; β is known as an inverse temperature parameter and specifies the “peakedness” of the joint distribution. High values of β, alternatively referred to as “low temperatures,” imply that the system will be found in one of its lowest-energy states; lower values mean that it behaves more like an uncorrelated, random system. Note that, if say, the energy term consists of a sum of terms defined on pairs of variables, then the positive-valued potentials in (1.1) correspond to the product of the exponentials of these energy terms. Undirected graphical models are best-suited to applications where instead of modeling the dependence among variables via conditional distributions (“causal” relations) it is more natural to do so with these potential functions which reflect the variates’ “compatibilities.” Examples include image processing (Geman and Geman 1984, Besag 1986), spatial statistics (Besag 1974, Rue and Held 2006), and neural computation (Amit 1989, Ackley, Hinton and Sejnowski 1985). Undirected models are also known as Markov Random Fields(MRFs); we will thus refer to them interchangeably over the course of this work. There is a considerable array of literature discussing the fundamental properties of these systems, for example (Kindermann and Snell 1980), and we shall certainly not reproduce all the insights therein here. We will, however, briefly discuss the notions of conditional independence and a fundamental theorem identifying any MRF with a system possessing a Gibbs-Boltzmann distribution of the form (1.3). Formally, suppose we had a set M {1, . . . , M } and a set of M random variables (X1 , . . . XM ) each assuming one of q values from a discrete set D {D1 , . . . , Dq }. Now consider the set M to index vertices in a graph, and define N N1 , . . . , NM to be a neighborhood system, that is for each i ∈ M, Ni is a subset of M such that an edge connects i to each j ∈ Ni in the graph. Then the joint distribution π(X1 = x1 , . . . XM = xM ) is said to be a Markov Random Field if the following property holds: π(Xi = xi |Xj , j ∈ M\i) = π(Xi = xi |Xj , j ∈ Ni ) (1.4) The condition says that the conditional distribution of each variable is not altered by the instantiation of any other variables in the system once all the Chapter 1. Introduction 3 “neighboring” variables are known. The neighborhood Ni is known as the Markov Blanket of each node. Variable i is also said to be conditionally independent all other variables given its neighbors. The fundamental theorem of Hammersley and Clifford (Hammersley and Clifford 1971) asserts that the system of variables (X1 , . . . , XM ) is a Markov Random Field if and only if it possesses a Gibbs-Boltzmann distribution of the form in Equation (1.3). Details can be found in, say, (Bremaud 1999). We mentioned above that calculating (or at least estimating) the partition function is often of interest in statistical learning. In estimation applications, we are primarily concerned with making inferences about a set of unobserved (or “latent”) variables given a set of observations. Encoding these observations into a general MRF model is straightforward: the relevant variables are simply assigned their observed values; of course to ensure the correctness of this conditional distribution, the normalization constant is a function of this evidence. We will refer to the vector of unobserved and observed variables as X and Y respectively. The general task of inference consists in computation of expectations of functions of the variables of the form: Eπ [h(X)|Y ] (1.5) A special case of (1.5) consists of the computation of the marginal probabilities of the unobserved variables. In principle, for discrete-valued models, one can compute these expectations by summing out all other variables, but unless the model possesses special structure, this is not practically possible for all but the smallest graphs. The prototypical MRF is a mainstay model of statistical mechanics. Known as the Ising model after its inventor, Ersnt Ising (Ising 1925), it has elucidated the nature of phase transitions in ferromagnets and disordered systems and provided a useful model for interacting particle systems. In its general form, the energy function (known also as a Hamiltonian) appearing in Equation (1.3) is given by: hi xi (1.6) Jij xi xj − E(x) − i,j i where xi ∈ {−1, 1}. The parameters Jij are real numbers whose sign determines the mutual “preference” of corresponding variables Xi , Xj ; positive and negative values of Jij imply that the pair tends to have the same value (both positive or negative) and different values respectively. In statistical physics, positive and negative values of Jij are known as ferromagnetic and antiferromagnetic cases respectively. The linear term in the energy models an external influence (“field”) on each variable; the sign specifies the tendency this external influence imposes on variable i. For both the Jij and the hi parameters, the magnitude reflects the strength (weight) of that particular interaction. There has been a vast amount of research in the statistical physics community into the properties of this model; in the planar ferromagnetic case without an external field, its asymptotic (large system size) statistical properties have been solved exactly by Onsager (Onsager 1944). Unfortunately, an analogous Chapter 1. Introduction 4 success treating three (or higher) dimensional arbitrary models proved elusive for decades, and a recent NP-Completeness result has proved the task of computing exactly the partition function of such models to be almost certainly futile (Istrail 2000). For the case where the Jij are allowed to be negative and cycles exist such that the product of the sign of the interactions along the cycles is negative, the situation is also grim; even finding the minimum of the energy is NPHard except in limited special cases. In contrast, the global minimum of the ferromagnetic Ising model without an external field is trivial in any connectivityall the variables align themselves with each other. The complexity of the energy landscape arises from the fact that interactions “compete;” the negativity of the interaction sign products along cycles implies that it is impossible to “satisfy” all the edges along that cycle. This feature is known as “frustration” in the physics literature, and the models are known as Spin Glasses. Much study on these systems has revealed a fascinating array of behavior; two classic surveys detailing the developments (Fischer and Hertz 1991) and (Mezard, Parisi and Virasoro 1987). A generalization of the Ising model to variables that assume more than binary values is known as the Potts model. Analogously, an interaction term contributes a Jij to the overall energy (which can be positive or negative) if the variables on edge (i, j) are the same. Ising and Potts models have often made their way into applications in machine learning (Zhu and Ghahramani 2002, Getz, Shental and Domany 2005, Hinton 1999), and computer vision (Geman and Geman 1984, Li 2001). A very simple application of Ising-type models to computer vision is as follows: consider the intensities of the M pixels in an image to be the random variables Xi ; they can take one of q colours. What we are given, however, is a set of possibly corrupted observations Y of the intensities. If we have a model of the corruption process, then we can use an Ising-type prior to encode our belief in the smoothness of the “true” image. There is often considerably more information to encode than smoothness, like, for example, texture, which would require a richer model. The current state-of-the-art research in MRFs for low-level computer vision (Roth and Black 2005) is exploring the usage of higher-order neighborhoods to overcome the limitations of models using pairwise, nearest neighbor interactions. In statistical classification, this set of models has been applied to the task of semi-supervised learning (Zhu and Ghahramani 2002). The problem can be posed as follows: given a set of data, few of which actually have known labels, can we, in conjunction with a hypothesized model, utilize the labeled data to provide information about the unlabeled data’s classes? In (Zhu and Ghahramani 2002), an Ising-type model is proposed which specifies the Jij to be positive values that decrease exponentially in the Euclidean distance between data points i and j. The Boltzmann machine (Ackley et al. 1985) is a binary-valued, Ising-type MRF where the Jij are indeed allowed to be negative; the competing interactions have been interpreted as representing the processes of neural inhibition and Chapter 1. Introduction 5 excitation in biological neural systems (Amit 1989). The contributions of this thesis are three methods for approximate solution of the inference and/or normalization problem. The first two methods make specific assumptions about the form of the distribution, whether on the topology of the graph or on the functional form of the potentials. The last method makes no restriction, but attention is restricted to discrete-valued models. In the next section, we will survey some of the chief methods used to solve the problems at hand: exact inference, Variational Methods, and Monte Carlo Methods. 1.2 1.2.1 Inference and Normalization Brief Summary of Exact Inference In this section, we briefly outline the methods used to tackle the problems. We shall go over three principal classes of methods, concentrating on the last since that is the primary focus of this thesis. The immediate aim of inference is the computation of values of the form of Equation (1.5). One special case is evaluating the marginal probabilities of some variables given the evidence: P r(Xi = xi |Y ) (1.7) In principle, one can simply sum out all the remaining variables for each value that Xi can take. This is, of course, impossible in practice; if each variable assumes one of q states then this requires O(q M ) computations. There exist, however, some clever algorithms that exploit the conditional independence structure of the graph to drastically reduce the cost of the marginalization by rewriting the overall summation as a procedure that generates a set of intermediate functions, each of which is used in an “outer” layer. A prototypical example is the case where the variables have a “chain”-like independence structure; in such a case, even though the overall system still has q M states, the factorization of the potentials ensures that summing out variables can be done in O(M q 2 ). The well-used Hidden Markov Models belong to this category. A more general formulation of this principle defines the now-classic algorithm known as belief propagation (also called message passing) (Pearl 1987) that can efficiently solve the inference problem on graphical models of singly-connected topology, i.e. chains or trees. For more general graphs, an algorithm known as variable elimination (VE) (Zhang and Poole 1994) is a generalization of this principle. VE relies on the fact that marginalization is a set of nested summations over all the variables, but depending on the arrangement of the summations, it can happen that only a few terms are involved in a given summation, and so the summation can be thought of as generating an intermediate function over the remaining variables. The ordering of the summations defines an elimination ordering. These functions in turn become terms in external summations. Thus, rather than compute the Chapter 1. Introduction 6 marginals by sheer brute force, given an elimination ordering, we can recursively compute all the intermediate functions; the complexity is then reduced to the size of the state space of the largest such intermediate function. Often, however, one is interested in computing the marginals of several variables in the model. In the plain VE method, one is forced to redo the operation for each such query; a general data-structure known as the Junction tree (Cowell, Dawid, Lauritzen and Spiegelhalter 1999) avoids this by “compiling” the joint distribution; see the reference for details. Both of the previous two algorithms require the specification of an elimination ordering; while any ordering is correct in principle, the most favourable one among all choices generates the intermediate function with the smallest domain size because this dictates the algorithms’ bottleneck. Unfortunately, finding an optimal such ordering is NP-Hard (Arnborg, Corneil and Proskurowski 1987). Additionally, using VE and JT on densely-connected graphs does not give any gain. Consider the case of a complete (fully-connected) pairwise graph; summing out any variable results in a generally non-factorizable function that jointly depends on all the variables. It turns out that exact inference is a very difficult computational problem, and one is invariably forced to use approximations. The two main classes of approximative methods used in the machine learning community are variational and Monte Carlo algorithms. 1.2.2 Complexity Properties of Some Potts/Ising models Before proceeding to approximate methods of inference, however, some properties of the complexity of minimizing the energy of Ising/Potts-type models in their various cases will be mentioned. Energy minimization can be seen as a special case of inference where the parameter β in the form of the distribution (1.3) is very large. It turns out that the problem of minimizing the energy of an Ising model is equivalent to a combinatorial optimization problem known as the Max Cut problem (Picard and Ratliff 1973). The parameters Jij of the graph can be seen as weights defined on edges of an undirected graph; a partitioning of all the vertices is a “labeling” of each vertex to have “colour” +1 or −1. A cut is subset of all the edges such that for each such edge, one vertex lies in each partition. The object is to maximize the sum of the weights across this cut. In general, the Max Cut problem is NP-Hard (Garey and Johnson 1979), but is polynomially solvable in a few special cases. Some non-trivial examples include the case of planar graphs with no external field (Hadlock 1975), where the problem can be cast as a maximum matching problem, and the purely ferromagnetic case (obviously, including an arbitrary external field; the zero-field case is trivial.) (Picard and Ratliff 1975) If the graph has negative interactions and has a field, then even if it is planar, the problem is NP-Hard (Barahona 1982). For the case q > 2, the Potts model, minimizing the energy in the presence of external influence is NP-Hard even in the positively interacting planar case(Veksler 1999); the problem becomes a generalization of the Max Cut problem known as the Max k-Cut problem. Chapter 1. Introduction 7 We next turn to some methods used to perform approximate inference. 1.2.3 Variational Methods Variational methods (Jordan, Ghahramani, Jaakkola and Saul 1999) essentially characterize the joint distribution as the solution to an optimization problem; typically, the optimization is performed over a set of parameters indexing the members of an approximating (and tractable) class of distributions. The simplest such method is a well-known idea from statistical physics known as the mean field approximation (see the previous reference, or (Fischer and Hertz 1991) for a physics treatment.) The approximation involves assuming that each variable in the system is essentially independent of all the others, but has a local “field” specifying its marginal distribution; the field parameters are those that minimize the variational distance between all distributions in this family and the target distribution. The approximation is exact for the fully-connected ferromagnetic Ising model with no external field in the infinite system-size limit, sometimes known as Curie-Weiss model. For the so-called Ising Spin Glass model, where the interactions are sampled from a Gaussian distribution, there is a mean-field theory for such infinitely-large, fully-connected systems averaged over all values of the parameters. See (Fischer and Hertz 1991, Mezard et al. 1987) for an in-depth overview. Physicists, however, recognize the “unphysicality” of such mean-field models due to the fully-connected topology, and so far studies of the models with more realistic nearest-neighbor interactions in the two or three dimensional lattice have been forced to rely on Monte Carlo simulations to examine their properties. In the machine-learning community, variational methods are often used because they can be faster than Monte Carlo simulations. This speed comes at the price of reduced accuracy; typically, variational methods offer bounds on the log of the partition function; the mean field approximation, for example, is a lower bound on this quantity. More sophisticated methods have emerged in recent years (Wainwright, Jaakkola and Willsky 2002) that offer upper bounds as well. The bounds are impressive on ferromagnetic systems, though for frustrated models they are quite loose. As we have seen when we examined the complexity of even finding the global minima of the Potts/Ising models in different regimes, there is something fundamentally “irreducible” about systems with competing interactions. 1.2.4 Monte Carlo Methods Monte Carlo methods have chiefly fallen into one of two classes, though recently there has emerged a new framework which can be seen as a union of the two “traditional” methodologies. The main idea behind the methods is that if we can draw samples from the distributions with respect to which we are trying to integrate, then we can approximate the integrals by averages of the functions of the samples. Unfortunately, it is seldom possible to draw “exact” independent samples from our Chapter 1. Introduction 8 joint target distribution except in the simplest of cases. Importance Sampling One possible Monte Carlo estimator of the integral then uses an alternative distribution known as a proposal from which we can draw exact samples, and which we can evaluate up to a normalization constant. We can then approximate the integral of interest by weighting this set of samples. In the event that we know the normalization constants of both distributions, the resulting estimator is an unbiased; if not we can normalize the estimator by the sum of the weights to yield an asymptotically unbiased estimator. The framework is known as importance sampling. Specifically, suppose we wanted to estimate: I h(x)π(x)dx (1.8) If we have an auxiliary density q(x) which is strictly positive whenever π(x) is, then clearly we can write the integral as: I= h(x) π(x) q(x)dx q(x) The quantity π(x) q(x) is known as the importance weight and is abbreviated by w(x); we can now estimate I by drawing N samples from q and using the following estimator: N 1 ˆ h(X (i) )w(X (i) ) (1.9) I= N i=1 Often, however, the distributions π and q can only be evaluated up to a constant; let π ˜ and q˜ refer to the unnormalized forms of the distributions. Then the following ratio of two estimators is an asymptotically unbiased estimator of I: N h(X (i) )w(X ˜ (i) ) Iˆ = i=1 N (1.10) ˜ (i) ) i=1 w(X Note that the denominator term in Equation (1.10) is an unbiased estimator of (N times) the ratio of the partition functions of π and f . The main challenge of importance sampling is finding a good proposal distribution; while the estimator Iˆ may be unbiased, a poor choice of proposal will result in a large variance; practically, this means that a very large number of samples must be drawn from the proposal before the weighted set is “representative” of the target distribution. This problem becomes more pronounced the larger the dimension of the state space. For directed graphical models, an example of importance sampling is a method known as likelihood weighting (Fung and Chang 1989). In that context, a joint distribution is specifiable in terms of the product of conditionals of the latent variables (the prior ) and the product of likelihood terms, i.e. conditional probabilities of the evidence given the values of the latent variables. In Chapter 1. Introduction 9 a directed model, sampling from the prior is straightforward due to the conditional construction of the distribution. The prior is used as a proposal to the target (full posterior,) and the importance weights reduce to simply being the remaining likelihood terms. The more “deterministic” the likelihood terms, the worse the assumption that the prior approximates the posterior, and consequently, the higher the variance of the estimator. A more serious problem in our context is that it is not trivial in general to draw samples from the prior of an undirected model because the prior is itself in the form of a product of general potentials, not conditionals as it is for directed models. Finding a good importance sampling proposal for a general undirected model is very difficult. Markov Chain Monte Carlo This section will provide a cursory overview of Markov Chain Monte Carlo (MCMC) methods; for an in-depth exposition, please consult, say (Robert and Casella 1999), (Liu 2001), or (Newman and Barkema 1999). MCMC methods, called dynamical Monte Carlo by physicists, have their roots in the seminal paper of Metropolis et al (Metropolis, Rosenbluth, Rosenbluth, Teller and Teller 1953), and are philosophically different from trying to weight independent samples from a known proposal distribution. Instead, a correlated, Markov sequence of random variables is generated that has the target distribution π as its invariant or stationary distribution. This sequence is obtained by proposing a new point according to a Markovian proposal process and accepting the “move” in such as way as to enforce a probabilistic convergence to the target. The general family of methods is known as Metropolis-Hastings (MH) algorithms. Let the proposal process be defined as: q(y|Xt = x) Then the MH algorithm is defined as follows: Initialize: X0 = x0 For t = 1 to t = T : 1. Sample X ∗ ∼ q(x|Xt−1 = xt−1 ) ∗ ∗ π(x )q(xt−1 |Xt−1 =x ) 2. Calculate α = min{1, π(x } ∗ t−1 )q(x |Xt−1 =xt−1 ) 3. Simulate R ∼ U[0, 1] 4. If R ≤ α, Xt = x∗ Else, Xt = xt−1 Figure 1.1: Abstract pseudocode for the Metropolis-Hastings method. See the text for a discussion. Chapter 1. Introduction 10 The acceptance step behaves as a probability “flow-biasing” mechanism on the Markov chain. It enforces a condition known as detailed balance, which states that under the stationary distribution, the probability of being in any state x and subsequently in any state y is the same as the probability of being in x after being in state y. This condition is a sufficient way of ensuring that asymptotically, the frequency with which the process visits states equals their probability of occurrence under the target distribution. The special case of the proposal q being symmetric, i.e. q(x|y) = q(y|x) results in a cancellation of the proposal terms in the acceptance condition in (1.1). This case is generally known as the Metropolis algorithm. We should emphasize that this rejection step serves a different purpose than an apparently analogous operation that takes place in the method known as rejection sampling. In the latter, samples that are rejected from a proposal are discarded; the rejection step serves to ensure that the resultant set of accepted samples is distributed according to the target distribution. In the MH algorithm, rejection means that the chain must remain at its current point in the state space; thus it is in fact essential to not discard the sampled points to ensure that asymptotically, the states are visited with the correct frequencies. A special case, defined for distributions with at least 2 variables, is known as the Gibbs sampler (Geman and Geman 1984); it uses as proposals the conditional distributions of the variables. Typically, for graphical models, the single site conditionals are quite easy to draw samples from, but sometimes, as shown in Chapter 2 with the method known as Tree Sampling, it is possible to efficiently obtain samples from the conditionals of sets of variables. Once again, we refer the reader to (Robert and Casella 1999) for details on the many variations of this algorithm. In MCMC, due to the fact that proposed points depend on their immediate predecessors, the resultant set of samples is correlated. Nonetheless, under conditions met by the MH algorithm and the Gibbs sampler, it is asymptotically correct to estimate integrals of the form (1.5) using the samples as if they were exact samples from π. Of course, the more correlated the samples, the larger the sample size required to have an estimator with the same expected level of accuracy as one using a given number of independent samples from π would. This can easily be shown using the properties of expectations and variance. An interesting aspect of the Gibbs sampler is that, assuming that the relevant conditionals can be sampled from exactly, there is no rejection as there is with the MH method with an arbitrary proposal. While a high rejection in a MH chain means that the samples are highly correlated, the fact that a Gibbs sampler does not reject samples does not mean that it is less so; indeed, the analog of high rejection in MH is the phenomenon of the Gibbs sampler simply proposing the previous state in the chain with very high probability. Typically, MCMC methods are initialized with arbitrary initial conditions. To ensure the asymptotic correctness of the samples, the underlying Markov chain must satisfy the condition of ergodicity; informally, this means that the chain will settle into the stationary distribution irrespective of the distribution of the initial state. The mixing time or correlation time is a way of quantifying how Chapter 1. Introduction 11 quickly this occurs. Unfortunately, for high dimensional problems, application of a naive form of the MH algorithm often results in a very long mixing time. Despite the fact that MCMC methods are asymptotically correct even in their simple forms, it often takes considerable insight into the specific problem under consideration to make them work efficiently. It often appears in practice that when simulating a very “peaked” distribution (i.e. at low temperature,) naive MCMC methods can take an exceedingly long time to propose moves away from the current state. A change in perspective of the steps of the algorithm can suggest a way of running the simulation that can dramatically speed up the efficiency and yet yield a statistically equivalent process as the usual simulation. The N-Fold Way (Bortz, Kalos and Lebowitz 1975) is the prototypical example of an Event-Driven Monte Carlo method. It requires an ability to enumerate the probability of transitioning to each state that neighbors (in the sense defined by the proposal) the current one. The probability that a change happens is the sum of all such change of state probabilities, and a waiting time based on this probability is sampled. Finally, the neighboring state is sampled from the normalized set of transition probabilities. The method is described in detail in Chapter 4, as are the limitations which we try to surmount. A statistical technique that can be used to improve the accuracy of estimators using MCMC is known as Rao-Blackwellisation (Gelfand and Smith 1990). This technique, discussed in more detail in Chapter 2, applies when we can analytically compute expectations of the functions under consideration conditional on the previous sampled values. Proving that the method reduces the variance of estimators using independent samples is straightforward, though doing so for correlated samples is not. Liu et al (Liu, Wong and Kong 1994) have proved that the variance is indeed reduced for Gibbs samplers that sample from the conditionals of two disjoint sets of the variables. We have employed this statistical result and methods of efficiently sampling from regions of the graph to propose the Tree Sampling algorithm of Chapter 2. A more systematic limitation of MCMC is that it does not offer a way to estimate the partition function, even if the samples come from a very efficient process. The next section details a methodology that overcomes this. Sequential Monte Carlo The framework known as Sequential Monte Carlo (SMC) (del Moral, Doucet and Jasra 2006)can be seen as something of a combination of importance sampling with dynamical Monte Carlo. The family of methods appears to have been first suggested by (Jarzynski 1997). Fundamentally, the ideas are those of importance sampling; a population of N “particles” is sampled from a known initial distribution π0 and somehow “moved” via some proposal process to the target distribution π via a sequence of intermediate distributions. The proposal process at each stage is arbitrary as long as it satisfies the support condition of importance sampling, i.e. it must cover the target at that step. One such reasonable process is the MCMC kernel whose invariant distribution is the intermediate Chapter 1. Introduction 12 target at that step. In contrast to MCMC, we do not need to assume that the samples drawn from SMC come from the stationary distribution since the bias for incomplete convergence is corrected for by weighting. In fact this would be straightforward if we could somehow compute the exact temporal marginal of the incompletely converged Markov chains. Unfortunately, this is impossible as it requires an sequence of exponentially complex integrations. The solution adopted by SMC is to perform the importance sampling on a joint state space; the target distribution is augmented so that the marginal of the joint target’s final variable is the target distribution at that step. The cost for this reduction in complexity is an increase in the variance of the importance weights. To be more specific, we will didactically follow a simple case. Suppose we had an initial proposal distribution µ0 from which we could draw samples, and a target distribution π1 . Additionally, we also have a proposal process K1 (x1 |x0 ); we draw N samples from µ0 and subsequently apply the proposal K to each sample. Let the marginal distribution of the samples after the application of K1 be: µ1 (x1 ) = x′0 K1 (x1 |x′0 )µ0 (x′0 )dx′0 If we could calculate this marginal, then we could simply estimate our target integral using the ideas described in the importance sampling section, i.e. by simply weighting the samples from µ1 with their importance weights against π1 . Unfortunately, we cannot in general evaluate this integral. Suppose, however, that we augment the target distribution to: π1 (x0 , x1 ) = L0 (x0 |x1 )π1 (x1 ) Clearly, for whatever choice of conditional distribution L0 , π1 (x0 , x1 ) marginalizes to the target distribution π1 (x1 ). The importance weights are then defined on the joint state space of X0 , X1 as: W1 (x0 , x1 ) = L0 (x0 |x1 )π1 (x1 ) K1 (x1 |x0 )µ0 (x0 ) (1.11) W1 (x0 , x1 ) = L0 (x0 |x1 )π1 (x1 ) H0 (x0 |x1 )µ1 (x1 ) (1.12) H0 (x0 |x1 ) = K1 (x1 |x0 )µ0 (x0 ) µ1 (x1 ) or alternatively as: where: The form of the importance weights in Equation (1.12) shows that if we could somehow calculate H0 (x0 |x1 ) and set L0 (x0 |x1 ) = H0 (x0 |x1 ) then the resulting cancellation in effects means that we are importance sampling only with respect to the last variable. As we said previously, however, calculating H0 is not feasible as it requires knowledge of µ1 . In (del Moral et al. 2006), Chapter 1. Introduction 13 some suggestions (discussed in Chapter 3) are made for possible choices of L0 ; ideally, it should be a good approximation to H0 . Additionally, if the samples are resampled according to the weights in Equation (1.11), then the resulting population is (asymptotically) distributed according to L0 (x0 |x1 )π1 (x1 ). No miracles occur, however; if the joint proposal distribution K1 (x1 |x0 )µ0 (x0 ) is poorly suited to the augmented target L0 (x0 |x1 )π1 (x1 ), then a huge number of samples is required before resampling generates a representative population from the target. This fact includes the case that we can calculate the “optimal” L0 ; in such a case it would mean that the induced marginal µ1 is itself a poor proposal to π1 . The SMC framework proposes to use instead of a single target π1 , a sequence π0 , . . . , πT , where π0 is known and easy to sample from, πT is the target, and the intermediate distributions form a set whose members are typically chosen so as to get progressively “closer” to the target. One example is the set of annealed distributions π γn , where {γn }Tn=1 is a sequence of T numbers from zero to unity. If, given a family of proposal processes, the distributions are close enough together that the processes can reasonably move a set of samples from one distribution to another, then resampling the population is desirable as it ensures that the population at each step is a reasonably good sample from the particular intermediate distribution. After resampling, the weights are reset to unity (as they were in the first step due to the fact that we were using exact samples from π0 ) to reflect this fact, and the algorithm continues with the next proposal process towards the subsequent target distribution. An example of a proposal that in principle satisfies the requirement that samples are moved from target πn to target πn+1 is an MCMC kernel of invariant distribution πn+1 . We remind the reader, though, of the point made previously about SMC methods is general: if the proposal process is too “slow” with respect to the distance between intermediate target distributions, then resampling cannot be expected to yield a set of samples from the correct parts of the state space of πn+1 unless a very large population is drawn. The reason is that the implicit proposal marginal µn+1 does not match πn+1 . This typically occurs when simulating systems at very low temperature, because the movement of the particles due to the proposal process becomes more and more difficult. A significant advantage of this class of methods is that it allows estimation of the partition function. The interested reader is referred to (del Moral et al. 2006) for details and asymptotic properties of the SMC class of algorithms. 1.3 Contributions of This Thesis In this section, the three main Monte Carlo methods which appear in the full subsequent research papers are discussed in an informal and introductory way, in particular how they relate to one another and how they can be improved. Chapter 1. Introduction 1.3.1 14 Tree Sampling The first method we have proposed, discussed in detail in Chapter 2, is specifically tailored to planar undirected graphical models. The method relies on the fact that if we instantiate certain variables in the model, then the remaining variables are such that exact inference is possible on the remaining variables. Indeed, not only can exact inference be performed, but exact samples can be drawn from these variables. The set of conditioning variables is chosen such that the remaining variables conditionally form tree-structured distributions. The algorithm was thus a Gibbs sampler, but one whose conditionals were the tree distributions rather than the single-site updates that are typically done. An analysis of (Liu et al. 1994) has shown that such a “blocking” of variables in a Gibbs sampler combined with Rao-Blackwellisation is guaranteed to reduce the variance of the corresponding MCMC estimator as long as the variables are partitioned into two disjoint sets. The tree sampling method leveraged these results to propose a block MCMC scheme where Rao-Blackwellisation is assured to perform at least as well as the corresponding non Rao-Blackwellised method. Some theoretical results regarding the sampler’s behavior at stationarity were derived, in particular that the maximal correlation between samples is lower than for a block-update scheme which utilizes a the tree conditionals than one which uses a more disconnected set. Additionally, the mutual information between the samples was shown to be lower. Although the method empirically performed considerably better than singlesite Gibbs sampling, it was still not immune to getting trapped in minima once it found them. The reason for this is that despite the process’ sampling from trees instead of single-site conditionals, the fact that some neighboring variables are conditioning each site in the tree means that their influence may nonetheless be strong enough to “pin” variables in the conditionals from changing state so that different states (with, say, equal or lower energy) may be explored. There are several possible ways around this issue. The tree sampling methodology has been applied to inference in Conditional Random Fields for Multiagent Reinforcement Learning (Zhang, Aberdeen and Vishwanathan 2007); in that work, rather than initialize the starting distribution arbitrarily, they sampled a random spanning tree and initialized the sampler with an exact sample from this tree. Intuitively, this can be understood as preliminarily satisfying a much larger set of constraints than a random assignment would. Another possible strategy would be to incorporate the tree sampler into a regime that “heats” and “cools” the system; two such well-known “ensemble” Monte Carlo methods are simulated tempering (Marinari and Parisi 1992) and parallel tempering (Hukushima and Nemoto 1996). Both these methods extend the target distribution to be a product of a set of distributions, one of which is the distribution of interest. The ensemble typically consists of versions of the target at increasing temperatures. MCMC on the high temperature distributions is likely to make comparatively large changes of state, and a further acceptance step is added on swap moves between chains at different temperatures to enforce the correctness of the algorithm. However since the methodology still Chapter 1. Introduction 15 requires MCMC moves to be made within chains each temperature, and this is typically done with single-site moves, the tree sampler can straightforwardly be integrated there. Unfortunately, a more serious limitation of the method is that it does not readily apply to graphs whose topology is not planar; coming up with a good decomposition into tractable conditionals is not trivial. Good progress in this field has been made by (Rivasseau 2005), who has proposed method of partitioning arbitrary pairwise and factor graphs based on a heuristic that seeks to minimize the number of trees within a given partition. The results in that work were quite promising, and a unified treatment was presented in (Hamze, Rivasseau and de Freitas 2006). However the difficulty of applying Tree Sampling to arbitrary graphs inspired the method of the next section, which relies on a sequential construction of the target distributions in a way derived from the graph topology. 1.3.2 Hot Coupling The method described in Chapter 3 belongs to the framework of Sequential Monte Carlo (SMC), where an artificial dynamics is imposed on a joint state space and a sequence of distributions “moving” from some simple, initial distribution, to the final target is used. MCMC steps towards each intermediate distribution are used to perform the moves. The initial distribution was a spanning tree of the model; a population of samples was drawn from this tree. The sequence of distributions consists of graphs with edges slowly being added to the predecessors. This is distinct from an annealed sequence of distributions, where a temperature parameter is decreased from member to succeeding member of the sequence. The experiments demonstrated good performance relative to annealing on small Potts models at moderate-low temperature even when a random ordering of edges was used to grow the graph. The natural direction to take the research, however, was to try to determine a good choice of sequence rather than a random one. We can visualize this as the sequence that would “guide” the set of particles towards the important regions of the target. It is of course, not necessary that a single edge is added at a time. The Greedy Constructive Heuristic Though finding such an “optimal” sequence appears to be extremely difficult, some heuristics appear to be reasonable. One natural such method is a (possibly randomized) greedy heuristic. Suppose our set of target distributions is such that after the full addition of a set of edges, all nodes whose edges have not yet been added “compete” to contribute the next such set. The performance measure is the energy of the current state on the new system once the edges are added. A deterministic choice would be to select the set of edges that yield the largest such gain; one that introduces randomness and variety would be to select the set in a random way, that, say biases the choices to result in lower-energy states Chapter 1. Introduction 16 on the new graph. Once the choice has been made, a Gibbs sampler is run to equilibrate the old state on the new distribution, and the process is repeated until we reach the final graph. A better strategy would be to perform a lookahead simulation, that is, at each decision point, instead of merely making a choice based on the energy changes at that step, we run a “greedy” simulation to the end of the system for each such choice and use the final values thus obtained to make the decision. If, however, each particle in the population chooses its own set of edges to add, then the target distribution in the sequence is not merely the previous graph with a set of edges added; in effect, each sample is choosing its own target. This introduces complexity into the simulation. Our proposal is to use a uniformly weighted convex combination of the graphs at each step to be the joint target at that step. This increases the complexity because now, when we perform single-site Gibbs updates to each of the N samples, the single-site conditionals consist of a sum of up to N distinct terms each. Despite the fact that each particle proposes its own contribution to the joint target, they continue to interact through the fact that the full joint target is a mixture of the results of the individual decisions. Note that since we only have the distributions of the partial graphs in unnormalized form, and that the partition functions of these graphs are not known, a uniform-weight mixture of the graphs does not mean that the probability of selecting a mixture component is uniform; in fact this probability is not known. The goal of this new method was to attempt to perform accurate simulations of large systems at low temperatures. In this regime, a good indicator of performance is how well the sampler finds low energy states. The preliminary experiments with the method suggested that as expected, adding edges by choosing according to the greedy constructive heuristic above performed much better than Gibbs sampling on a random sequence of edges. Better still was the scheme that employed the one-step lookahead in choosing the next set of edges to add. This looked very promising, until an embarrassing fact emerged: the method was soundly beaten by a simple scheme that iteratively ran a plain Gibbs sampler on the full graph until the energy appeared to have converged, and then perturbed some values of the resultant stable state with random noise. The astonishing observation was the speed and consistency with which the minima were found on our test graphs relative to the lookahead scheme. Additionally, combining the two methods yielded little or no gain, in particular when the plain perturbative method was allowed to run for enough iterations to ensure a fair comparison. On complete graphs of size 500, the simpler method yielded states whose energies were around 15% lower, a huge discrepancy for a low-temperature simulation. The author decided that the final direction of the research be to try to develop a methodology that could be used to not only found the global minima of the models, but could also be used for calculations on models at nonzero temperature. The methodology is outlined in the next section. Chapter 1. Introduction 1.3.3 17 Large-Flip Importance Sampling We mentioned in the last section that for low-temperature models, a simple perturbative Gibbs sampler succeeded in finding low-lying states much more quickly and consistently than a greedy edge constructive method using a Gibbs sampler. The reason for this is that the perturbation steps allow some variables that were previously prevented from changing state by the remaining variables of the system to do so. We found, however, that even better results are obtained if we use special Gibbs-like process, that prohibits certain moves from being remade. The motivation for this is as follows: the regular Gibbs sampler, when “stuck” in a local minimum, must wait for a very long time in a given state before one of the neighbors is chosen. This neighbor will tend to be the one such that the least energy is gained. In fact, a quantitative statement of this fact (see Chapter 4) and the fact that the time spent in a given state is on average inverse to the probability that any state transition occurs form the basis of the Monte Carlo method mentioned previously known as the N-Fold Way (NFW.) The problem with the NFW (and consequently, with the plain Gibbs sampler or Metropolis algorithm, even if enough time is devoted to making state transitions) is that the neighbor chosen may itself have no neighbors of descending energy aside from the preceding state. The result is that the state will simply revert to the initial one, and this cycling behaviour will continue for a potentially long time. This is because the Gibbs sampler does not have a memory of the states it has visited. One way around this is to actually store explicit states that have been visited by the Gibbs sampler. Tempting as this may be, it is inefficient because in general, the state space of a large model may have such a huge number of states with very close energies that trying to store them all can overwhelm any reasonable computer memory. Our approach was instead to forbid specific moves from being made; the resulting algorithm at zero temperature was found to have a resemblance to a combinatorial optimization heuristic known as Tabu Search (Glover 1989). Unfortunately, the resulting process is no longer a Markov Chain and does not possess a stationary distribution. To make the algorithm theoretically correct, we apply a selection step followed by an importance sampling step to compensate for the bias. We demonstrated the algorithm on several realisticallysized models. Full details of the method can be found in Chapter 4 of this thesis. Aside from its improved performance, the main advantage of this method is that it is much more generally applicable than the previous two discussed in this thesis. In particular, it applies to directed models (Bayesian Networks,) or models without pairwise potentials. In fact it is technically suitable to any distribution whose local conditionals we can evaluate and sample from, which typically means discrete-valued models. 18 Chapter 2 Tree Sampling1 2.1 Chapter Overview Rao-Blackwellised sampling is a powerful inference tool for probabilistic graphical models (Doucet, de Freitas, Murphy and Russell 2000, Lipson, Kueck, Brochu and de Freitas 2003, Besag 1974). In this chapter, we propose a new RaoBlackwellised MCMC algorithm for MRFs, which is easily expandable to other models, such as conditional random fields (Green and Richardson 2002, Moller, Pettitt, Berthelsen and Reeves 2004). As mentioned in Chapter 1, MRFs play an important role in spatial statistics and computer vision (Li 2001, Besag 1986, Hamze and de Freitas 2004). The algorithm proposed in this chapter exploits the property that MRFs can be split into two disjoint trees. By carrying out exact inference on each tree, it is possible to sample half of the MRF nodes in a single MCMC step. Our theorems will show that this tree sampling method outperforms simpler MCMC schemes. In particular, it exhibits lower correlation between samples and a faster geometric convergence rate. These theoretical results will be mirrored by our numerical examples. 2.2 2.2.1 Tree Sampling for MRFS Model Specification We specify the MRF distribution on a graph G(V, E), with edges E and N nodes V as shown in Figure 2.1 left. The clear nodes correspond to the unknown discrete states x ∈ {1, . . . , nx }, while the attached black nodes represent discrete observations y ∈ {1, . . . , nx } (they could also be Gaussian). According to this graph, the MRF distribution factorizes into a product of local Markov positive potentials: 1 ψ(xi , xj ) φ(xi , yi ) P (x1:n , y1:n ) = Z i∈V (i,j)∈E where Z is the partition function, φ(·) denotes the observation potentials and ψ(·) denotes the pair-wise interaction potentials. Our goal is to estimate the marginal posterior distributions (beliefs) p(xi |y1:N ) and expectations of functions over these distributions. 1 This section has appeared in the Proceedings of the 2004 Conference on Uncertainty in Artificial Intelligence (UAI) Chapter 2. Tree Sampling 19 Figure 2.1: At left, illustration of a partitioned MRF; nodes in the shaded and white regions are ∆1 , ∆2 respectively, with the small black circles representing observations. At right, depiction of the two-stage sampler; sampled values are large black circles. Conditioned on ∆1 , the variables in ∆2 form a tree. Using this two-stage scheme, Rao-Blackwellised estimators are guaranteed to outperform naive ones. As shown in Figure 2.1, an MRF can be partitioned into two disjoint trees. The loops in the MRF model cause it to be analytically intractable. However, belief propagation on each of the two trees is a tractable problem. This idea leads naturally to an algorithm that combines analytical and sampling steps. In particular if we have a sample of one of the trees, we can use belief propagation to compute the exact distribution of the other tree by conditioning on this sample. The algorithm therefore alternates between computing the trees and sampling them, as shown in Figure 2.1. Drawing samples in blocks (trees in this case) is well known to have benefits over algorithms that sample one node at-a-time. In Section 2.3 we will prove the domination of estimators using this tree sampling algorithm in comparison to other sampling schemes. Before this, we present the algorithm in more detail. 2.2.2 Tree Sampling Algorithm Consider the 5x5 MRF graph shown in Figure 2.1 at left. We have partitioned the nodes into two disjoint sets. Denote the set of indices of the shaded nodes as ∆1 and those of the unshaded nodes as ∆2 , where of course, ∆1 ∪ ∆2 = I, △ the set of all node indices. Let the variables indexed by these nodes be X∆1 = △ {Xj |j ∈ ∆1 } and X∆2 = {Xj |j ∈ ∆2 }. If we can sample from the conditional distributions: p(x∆1 |x∆2 , y) (2.1) p(x∆2 |x∆1 , y) then we have a two-stage Gibbs sampler called data augmentation, which has powerful structural properties that the general Gibbs sampler lacks (Liu 2001, Robert and Casella 1999). In particular, the two Markov chains in data augmen(t) (t−1) (t) tation exhibit the interleaving property: x∆2 is independent of x∆2 given x∆1 ; Chapter 2. Tree Sampling (t) (t−1) (t) 20 (t) and (x∆2 , x∆2 ) and (x∆2 , x∆2 ) are identically distributed under stationarity. Conditioned on set X∆2 , the variables X∆1 form an acyclic graph whose marginals are readily computed using belief propagation (Pearl 1987). This enables us to sample efficiently from the joint conditionals in (2.1) using the Forward Filtering/Backward Sampling algorithm (FF/BS) (Carter and Kohn 1994, Wilkinson and Yeung 2002). The sampling cycle is graphically shown in Figure 2.1, which makes it explicit that sampled values act as additional “evidence” to the complementary graph. The algorithm is shown in pseudocode in Figure 2.2. In the pseudocode, we are adopting Rao-Blackwellised estimators (Casella and Robert 1996, Gelfand and Smith 1990): δrb (h(Xi )) = 1 T T (t) E[h(Xi )|x∆1 , y] t=1 where T denotes the number of samples and , in this case, i ∈ ∆2 . The alternative Monte Carlo histogram estimator is: 1 δmc (h(Xi )) = T T h(Xi ) t=1 Both estimators converge to E(h(Xi ), (Liu et al. 1994) have proved that sampling from data augmentation is a sufficient condition for the Rao-Blackwellised estimator to dominate(have lower variance.) To obtain estimates of the node beliefs, we can simply choose h to be the set indicator function, yielding: p(xi |y) = 1 T T (t) t=1 p(xi |x∆k , y) where k = 1 if i ∈ ∆2 or 2 if i ∈ ∆1 . Rao-Blackwellised estimators are more expensive per sample as they require exact computation of the expectations. In practice, the large gain in variance reduction typically offsets this extra cost. Our experiments will show this in our setting. As shown in Figure 2.3, we can partition MRFs into trees in several different ways. (We address the theoretical advantages of these and other partitioning schemes in Section 2.3.) Thus it is possible at each iteration of the MCMC sampler to draw a uniform random variable and choose a particular tree partition. In particular, if we consider two different tree sampling algorithms with Markov transition kernels K1 and K2 and stationary distribution p(x|y), then the mixture kernel Kmix = αK1 + (1 − α)K2 , with 0 ≤ α ≤ 1, also converges to p(x|y) (Tierney 1994). 2.3 Theoretical Justification We now turn to a theoretical justification of our algorithm. It is clear that RaoBlackwellisation reduces the variance of the estimates. However, the question 21 Chapter 2. Tree Sampling Tree sampling Initialize • Set all sums Si = 0 • Partition the set of all nodes I into disjoint, tree-connected sets {∆1 , ∆2 } (0) • for i ∈ I, randomly initialize Xi for t = 1 . . . T • for i ∈ ∆1 – Apply belief propagation to compute the smoothing (t−1) (t−1) p(xi |x∆2 , y), treating the samples ∆2 as evidence. densities – Compute the expectations for the Rao-Blackwellised P (t−1) (t−1) E[h(Xi )|x∆2 , y] = Xi h(Xi )p(xi |x∆2 , y) estimator (t−1) – Set Si ← Si + E[h(Xi )|x∆2 (t) (t−1) – Sample X∆1 ∼ p(x∆1 |x∆2 pling. , y] , y) using Forward filtering / Backward sam- • for i ∈ ∆2 (t) – Apply belief propagation to compute the smoothing densities p(xi |x∆1 , y), (t) treating the samples ∆1 as evidence. – Compute the expectations for the Rao-Blackwellised P (t) (t) E[h(Xi )|x∆ , y] = Xi h(Xi )p(xi |x∆ , y) 1 estimator 1 (t) – Set Si ← Si + E[h(Xi )|x∆1 , y] (t) (t) – Sample X∆2 ∼ p(x∆2 |x∆1 , y) using Forward filtering / Backward sampling. Output Rao-Blackwellised estimates • δrb (h(Xi )) = 1 T Si Figure 2.2: Tree sampling. of what partitions should be adopted must be addressed. As shown in Figure 2.3 one could partition the graph using trees or, as proposed in (Jepson, Fleet and El-Maraghi 2003), using a checker board. In this section, we will show that our tree partitioning scheme results in lower dependency between Markov chain samples and faster geometric convergence. To do this, we need to introduce 22 Chapter 2. Tree Sampling Figure 2.3: Alternative partitions of an MRF corresponding to data augmentation methods. Again, the shaded nodes are called X∆1 . For the leftmost scheme, the elements of ∆1 are separated by ∆2 , and so the conditional p(x∆1 |x∆2 , y) is a product of the conditionals of each node. In the rightmost partition there are no unconnected subregions of either partition. The middle case is intermediate. some ideas from functional analysis. Let x1 , x2 , . . . be consecutive samples of the stationary Markov chain with equilibrium distribution π(x) (for notational simplicity we suppress conditioning on the data) and transition kernel K(xk |xk−1 ). We define the forward F and backward B operators on this chain as follows: F h(x1 ) h(y)K(y|x1 )dy = E(h(x2 )|x1 ) Bh(x2 ) 2 |y)π(y) dy h(y) K(xπ(x 2) = E(h(x1 )|x2 ) When the Markov chain is defined on a finite (discrete) space, F is just a matrix. For more generality, we need to work on the Hilbert space of zero mean and finite variance functions with respect to π, denoted as: L20 (x) = h(x) : h2 (x)π(x)dx< ∞, h(x)π(x)dx = 0 This is a space equipped with the inner product: h(x), g(x) = Eπ (h(x)g(x)) and consequently, the variance is varπ (h(x)) = h(x), h(x) = h(x) 2 and the correlation coefficient is corr(h(x1 ), g(x2 )) = g(x2 ) h(x1 ) , h(x1 ) g(x2 ) The induced operator norm is defined as F = sup h: h =1 Fh 23 Chapter 2. Tree Sampling The forward and backward operators are self adjoint F h(x), g(x) = h(x), Bg(x) . That is, the chain is reversible and satisfies detailed balance (Liu 2001). Iterating the Markov chain, we have F n h(x0 ) = n B h(xn ) = E(h(xn )|x0 ) (2.2) E(h(x0 )|xn ) (2.3) Since var(E(h(x1 )|x0 )) ≤ var(h(x1 )), F is a contraction operator F h ≤ h (its norm is bounded above by 1). Finally, we define the maximal correlation between two random variables as follows: γ(x2 , x1 ) sup corr (h(x2 ), g(x1 )) g: g <∞,h: h <∞ = 1 sup (var [Eπ (h(x2 )|x1 )]]) 2 h: h =1 = sup Eπ (h(x2 )|x1 ) (2.4) h: h =1 Substituting (2.2) and (2.3) into (2.4) results in the following important result (Liu 2001): Lemma 1 γ(x0 , xn ) = F n = B n . Note that the largest eigenvalue of F , when restricted to L20 , is the second largest eigenvalue of the standard Markov chain operator when restricted to L2 . Consequently, the reversibility of the Markov chain implies that F is self-adjoint and hence its norm is equal to the second largest eigenvalue of the standard Markov chain. Since F is also nonnegative, it has a decomposition F = A2 where A is a contracting, self-adjoint operator. We can use this to assert that the covariance between samples of the Markov chain decreases monotonically. That is, cov(h(xn ), h(x0 )) = F n h, h = An h, An h = An h 2 and by contraction, An h decreases monotonically. This implies that algorithms with a lower operator norm will result in less correlated samples. Moreover, they will converge at a faster geometric rate (Liu et al. 1994, Lemma 12.6.3): h(y)K n (y|x)P (x0 )dydx0 − Eπ (h(x)) ≤ c0 F n h Under the choice h(x) = IA (x) − π(A), where IA (x) is the indicator function of the set A, we have: |Pn (A) − π(A)|≤ c F n (2.5) That is, the Markov chain beliefs converge to the true beliefs with rate determined by F . Armed with these functional analysis tools, we attack the partitioning question. In (Liu et al. 1994) this question is posed as: given random variables U, V, W, which of the following sampling schemes yields a better estimate? Chapter 2. Tree Sampling 24 1. (U |V ), (V |U ) 2. (U |V, W ), (V, W |U ) 3. (U |V, W ), (V |U, W ), (W |U, V ) The first two schemes are both data augmentations, but variable W is “integrated out” in scheme 1. The norms of the forward operators of schemes 1 and 3 are smallest and largest respectively, so that the first scheme is favourable if the computational expense is justified. The last scheme is a general Gibbs sampler. We also aim to choose between sampling schemes to judge which MRF partition is better, but with some important differences. In our case, it is not merely a matter of more variables being conditioned on in one case than another. The problem is that the samples are multivariate and the variables are shuffled to different phases of the sampling from one scheme to the next. An example will soon clarify clarify this. Our comparison will focus the checker-board (CB) and two-tree sampling (TS) schemes shown in Figure 2.3. These two cases are extreme situations. Specifically, in the CB case, the node’s neighbors render the rest of the graph conditionally independent. In Figure 2.3, conditioning on the “gray” set means that at a given sampling instant, each “white” node only recieves input from its “gray” neighbors and vice versa for the gray nodes. Of course as sampling proceeds, information is passed globally accross the graph but we show that this performs quite poorly. In the two-tree case, if we condition on the “gray” set, then every node in the “white” set is influenced by the full gray set. To be precise and general, our notation must be expanded. We define the index sets of the fully-connected and checker-board schemes to be (W1 , W2 ), (V1 , V2 ) respectively (again, W1 ∪ W2 = V1 ∪ V2 = I) Our first result shows that the variance of the conditional expectation terms in the Rao-Blackwellised estimator is larger for the checker-board sampler than it is for tree sampling. Specifically, Proposition 1 Given the TS and CB sampling schemes, assume without loss of generality that node i belongs to set W1 in TS and set V1 in CB. Then var(E[h(Xi )|XW2 , y]) ≤ var(E[h(Xi )|XV2 , y]) Proof: See appendix. Proposition 1 is only applicable when a single component of a data augmentation block has information about the rest of the graph conditioned out. If we were sampling independently from the corresponding distributions, this would be a final answer to the variance reduction question, as shown in (Liu et al. 1994). However, in our case, we must consider functions of the full multivariate blocks of nodes not just of the single components Xi . As a simple illustration of this additional difficulty, consider Figure 2.4 showing a very simple 2x2 MRF sampled using the TS and CB schemes. Adjacent to each is a corresponding “unrolled” sampling sequence showing the spatiotemporal relationships of the sampled variables in the usual manner of graphical models. In the ovals are the variables corresponding to the sampling blocks; the 25 Chapter 2. Tree Sampling 1 2 1 2 1 2 3 4 4 3 4 3 V(0) 1 V(0) 2 V(1) 1 V(1) 2 1 2 1 3 1 3 3 4 2 4 2 4 (0) W1 (0) W2 (1) W2 (1) W1 Figure 2.4: Illustration of chessboard (top) and fully-connected (bottom) partitioning of variables for a 2x2 MRF. In the top and bottom schemes, the shaded nodes delineate variable sets V2 and W2 respectively. The chains beside each show the spatiotemporal dependencies of variables in the resulting Markov chains for each scheme. Theory and experiment reveal that the fully-connected scheme yields a superior RB estimator. Chapter 2. Tree Sampling 26 superscripts denote the time indices of the samples. Arrows indicate the sampling direction and reveal the “parent/child” temporal dependencies between variables. The undirected links in the ovals of the TS case reflect the spatial dependence of the variables in the blocks. In this example, our samples are multivariate. Hence, we need to compare functions of all the variables in a block, say h(x1 , x4 ) in TS against h(x1 , x2 ) in CB sampling. Here, there is the additional difficulty that the variables are shuffled to different times in the sampling schemes (e.g. x4 in TS does not match x2 in CB directly). Despite these additional difficulties, we show in the next theorem that the maximal correlation between functions of multivariate samples in CB is greater than the one between functions in TS. Theorem 1 Let the maximal correlations between samples drawn from TS and CB be γ(XW1 , XW2 ), γ(XV1 , XV2 ) respectively. Then under the stationary distribution γ(XW1 , XW2 ) ≤ γ(XV1 , XV2 ) Proof: See appendix. In (Liu et al. 1994) the case is made that conditioning on “more” variables (scheme 2 at the beginning of the section) can be a worse estimator, but the preceeding shows how a sampler can be better than another even though the same number of variables is sampled in both cases. Another interesting observation is that while the covariances of the TS sampler “instantly” involve the entire block of variables in the graph, those from CB have a sort of “propagation delay.” It is easier to appreciate this by picturing a larger graph, say 10x10 or bigger. In CB, the initial conditioning set for a particular node is its 4 neighbors, and the conditioining set for those 4 neighbors will be the original one and the immediately surrounding region, and so on. We observe a wave-like spreading of interaction with a delay of the order of half the size of the graph before the whole block of variables is involved, and many “reflected” covariance terms generated at each iteration. This is an additional cause of higher variance with regions being conditioned out. Finally, we state a result that may be more fundamental than that of Theorem 1. Instead of using correlations as a measure of depedency between samples (the standard approach in statistics), we propose the use of information theory measures to assess this dependency. The following theorem demonstrates how we can use information theory measures of dependency to compare MCMC sampling schemes. Theorem 2 Under the stationary distribution, the mutual information between samples generated from CB is larger than that between samples from TS: (0) (0) (0) (0) I(XV1 ; XV2 ) ≥ I(XW1 ; XW2 ) Proof: See appendix. Chapter 2. Tree Sampling 2.4 27 Numerical Results Our experiments were aimed at demonstrating the estimator dominance result of our tree sampler (TS). We compared TS against a plain Gibbs (PG) sampler and the checker-board (CB) data augmentation mentioned in a previous section. As we predict in the theoretical discussion of Section 2.3, we see that for small graphs, while TS does indeed outperform CB, the gain seems somewhat incremental. For larger graphs the true merit of tree sampling becomes obvious. We ran two sets of experiments where we applied the 3 samplers to a problem and assessed the variance or error. In the first experiment, for a small (10x10) MRF with isotropic potentials and each node having between 10 and 15 possible states, we approximated the expected value of each node using PG, CB and TS with the goal of assessing variance of the estimators. Each sampler was given 500 trials of 1200 iterations. The empirical variance of the mean estimator of each node was calculated over the trials, and records of the estimate against computation were obtained. Figure 2.5 shows a node-by-node comparison of the variances of the mean estimators of the samplers adjusted for computational time. To obtain this measure for “value for money” we penalized slow estimators by multiplying the raw estimator variance by the time factor by which they were greater than the fastest scheme. Thus, the PG was not penalized (fastest), and TS was penalized more than CB. Nonetheless our method is an improvement over the CB despite the penalty factor. Note how the variance of ours seems to “envelop” the chessboard sampler, precisely as predicted by the bounds. Per unit computation time, CB achieved a variance reduction factor of 10.41 over PG, but our sampler reduced it by 17.18. In Figure 2.6, we plot the estimated mean against computation time for a particular node in the MRF; the median estimate over the trials is plotted. A vital aspect to note are the standard deviation error bars accross trials. It is clear that the PG and, to a certain extent CB, are subject to wild swings in estimation. The thick error bars are from TS, showing how much more stable an estimator TS produces, yielding estimates that are more trustworthy as evinced by the narrow range across trials. We also notice in passing that we found that loopy belief propagation failed to converge for several random instance of these small MRFS. Our next experiment set was the classic “reconstruction” of states from noisy observations. We used a 50 × 50 pixel “patch” image (consisting of shaded, rectangular regions) with an isotropic 11-state prior model. Noise was added by randomly flipping states. Each sampler was run for 1000 iterations on each of 50 separate trials. An important aspect to assess is the sensitivity of the estimator, that is, is our good estimate a matter of luck or is it robust over trials? The plot in Figure 2.7 shows the median reconstruction error as a function of computation time showing that the gain is considerable. In fact in this case, the CB sampler is hardly distinguishable from PG, again a predictable consequence of the theoretical considerations. For larger graphs, far from expecting any kind of breakdown of these results, we predict that the difference will become even sharper. The error bars show that the low reconstruction error of our sampler is highly robust across trials compared to that of PG and CB. We also ran loopy 28 Chapter 2. Tree Sampling 0.12 Tree Sampler Plain Gibbs Chessboard Sampler Empirical variance 0.1 0.08 0.06 0.04 0.02 0 0 10 20 30 40 50 60 70 80 90 100 Node number Figure 2.5: Computation time-adjusted variance estimates of a 10x10 MRF as a function of the node for the 3 samplers discussed in the text. Note the upper-bounding effect; the “shapes” of the graphs are similar but the magnitudes are scaled, which is an expected consequence of Rao-Blackwellisation. 29 Chapter 2. Tree Sampling on these images, which took about the same number of iterations (around 30 passes through the MRF) to achieve the same error reduction as our method, suggesting that our method might be computationally comparable to loopy but guaranteed to converge. It is very important to realize that the gain is not merely due to “blocking”; the CB sampler is also a 2-block Rao-Blackwellised scheme, but does not take advantage of RB as well. 2.56 Plain Gibbs Checkerboard Gibbs Our sampler 2.54 2.52 Estimated Mean 2.5 2.48 2.46 2.44 2.42 2.4 100 200 300 400 500 600 Computation Time 700 800 900 1000 Figure 2.6: Convergence to the estimated mean of PG, CB, and our sampler for a specific node of a 10 × 10 MRF, for which LBP failed. The error bars centered on each line represent the standard deviation of the estimates across trials. PG has the widest bars and TS, shown with thick lines, has the narrowest. The plots are the medians of the trials. Note the considerable increase in stability and variance reduction in the case of our sampler, which outperforms the improvements afforded by CB. The error ranges for our sampler and CB always overlap, while PG sometimes fluctuates out of range. 2.5 Chapter Summary We proposed a new tree sampling algorithm for MRFs. We addressed the issue of comparing different partitioning schemes, ranging from checker-board (CB) sampling to two-tree sampling (TS). These two algorithms are extremes in a scale where TS exploits larger analytical coverage of the graph. We proved two fundamental theorems. Theorem 1 showed that the maximal correlation between samples drawn from the CB sampler is larger than the one for TS. This 30 Chapter 2. Tree Sampling Median reconstruction error vs. computations Our sampler Checkerboard Gibbs Plain Gibbs 0.8 0.7 Reconstruction Error 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 Computation Time 250 300 350 Figure 2.7: Reconstruction error against computation time for a 50×50 pixel 11-state image corrupted with noise. The 3 plots show the median reconstruction error history of PG, CB and our sampler over 50 runs. The bars represent the standard deviation of the error across these runs. Clearly, aside from achieving a lower error, our sampler is more robust, that is, consistently achieves these results. The gain of CB over PG in this case is negligible, again predictable from the theory of Rao-Blackwellisation 31 Chapter 2. Tree Sampling implies that TS exhibits a faster geometric convergence rate and less correlation between samples. Theorem 2 showed that information theory measures can be applied to the analysis of MCMC algorithms. In particular, it showed that he mutual information between samples from CB is higher than the one for TS. Our experimental results confirmed these assertions. As discussed in Chapter 1, the method has been extended to factor graphs by (Rivasseau 2005), who also proposed a useful heuristic for partitioning more general general graphs. The unified treatment has been presented in (Hamze et al. 2006). Additionally, (Zhang et al. 2007) have recently applied the method to inference in CRFs for multiagent reinforcement learning; the heuristic they used a random spanning tree to initialize the sampler rather than start from a random starting configuration. An alternative to avoiding local minima, which TS on its own is still prone to get trapped in, mentioned in Chapter 1, is to use the tree sampler as a component within a statistically-correct ensemble MCMC method such as Parallel Tempering (Hukushima and Nemoto 1996). Appendix: Proofs Proof of Proposition 1: Recall that for any random variables A, B, C, E[A|B] = E[E[A|B, C]|B] so that var(E[A|B]) ≤ var(E[A|B, C]). Our proof follows from this and the spatial Markov properties of the graph. We note that for the TS case, the expectation can be written as E[h(Xi )|xW2 , y] = E[E[h(Xi )|XV2 , XW2 , y]|XW2 , y] But by conditional independence, E[h(Xi )|xV2 , XW2 , y] = E[h(Xi )|XV2 , y] which is precisely the form of the expectation terms of the CB sampler. Thus the TS expectations correspond to an additional conditioning step over the CB expectations, and by the above relation for conditioning, the variance result follows. Proof of Theorem 1: We show this for the small 2x2 MRF shown in Figure 2.4; the extension to larger MRFs is inductive. The proof depends on the decomposition of the transition kernel in both cases; it will turn out that we can do a term-by-tem analysis of the resulting decompositions and show that some of the TS terms are CB terms with additional integration. Let the respective joint/conditional probabilities under CB and TS be pCB and pT S . For the CB sampler, the one-step transition kernel of the joint chain is: (1) (0) (0) (1) (1) (1) KCB (x(0) , x(1) ) = pCB (x1 |x2 , x3 )pCB (x2 |x1 , x4 ) (1) (1) (1) (1) (0) (0) × pCB (x3 |x1 , x4 )pCB (x4 |x2 , x3 ) 32 Chapter 2. Tree Sampling while for the TS sampler, the kernel is: (1) (0) (0) (1) (1) (0) (1) (1) (1) (1) (1) (1) KT S (x(0) , x(1) ) = pT S (x1 |x3 , x4 )pT S (x2 |x1 , x4 ) × pT S (x3 |x1 , x2 )pT S (x4 |x2 , x3 ) Using the reversibility of the augmentation chains, the assumption of stationarity, and the spatial properties of the graph, we see that all of the above conditional distributions are conditionals from π, the stationary distribution. For ex(1) (1) (1) (1) (1) (1) ample pCB (x2 |x1 , x4 ) = π(x2 |x1 , x4 ). Also note that pCB (x2 |x1 , x4 ) = (1) (1) (0) (1) (0) (0) (1) (1) (1) pT S (x2 |x1 , x4 ) = π(x2 |x1 , x4 ) and pCB (x4 |x2 , x3 ) = pT S (x4 |x2 , x3 ) = π(x4 |x2 , x3 ). By applying the spatial Markov property in “reverse,” (1) (0) (0) (1) (0) (0) (0) (1) (1) (1) (1) (1) (1) (1) π(x1 |x2 , x3 ) = π(x1 |x2 , x3 , x4 ) and π(x3 |x1 , x4 ) = π(x3 |x1 , x2 , x4 ) If we define the following functions (for conciseness domain variables are omitted.) j1 = j2 = j3 = j4 = (1) (0) (0) (1) (1) (1) (1) (1) (1) (1) (0) (0) (0) π(x1 |x2 , x3 , x4 ) π(x2 |x1 , x4 ) (1) π(x3 |x1 , x2 , x4 ) π(x4 |x2 , x3 ) and: k1 = k2 = k3 = k4 = (0) (0) (1) (0) (1) (1) (1) (1) (0) (0) (1) (0) j1 π(x2 |x3 , x4 )dx2 (1) π(x2 |x1 , x4 ) j3 π(x4 |x1 , x2 )dx4 (1) π(x4 |x2 , x3 ) then we can write the transition kernel of CB as: KCB (x(0) , x(1) ) = j1 j2 j3 j4 (2.6) KT S (x(0) , x(1) ) = k1 k2 k3 k4 (2.7) and that of TS as: If we compare the conditional expectation operators acting on a given function g(x) (= g(x1 , x2 , x3 , x4 )) under the two schemes, we have: ECB [g(X (1) )|x(0) ] = j1 j2 j3 (0) j4 g(x(1) )dx1:4 33 Chapter 2. Tree Sampling and: ET S [g(X (1) )|x(0) ] = k1 k2 (0) k4 g(x(1) )dx1:4 k3 Note that the expression for the TS is similar to CB’s but the conditional distributions of X1 and X3 (k1 and k3 ) are the integrated expressions given in (2.6). Clearly, as integration proceeds “outward” from X4 to X1 in both cases, the variance of the argument function is reduced by the same amount for the X2 and X4 integrations (it is the same operator in both cases due to the distributions being the same) but is reduced more in TS for the X1 and X3 operators due to the extra integration. It can be verified that this effect becomes progressively more pronounced as the graph gets larger, with a greater proportion of the terms having extra integrations, consistent with our experimental results. Thus the norm of the CB expectation operator is larger than TS’s, and by Theorem 3.2 of (Liu et al. 1994), the result on maximal correlation follows. Proof of Theorem 2: Let the conditional entropies be △ HT S = HCB = (1) (1) (0) (0) (1) (1) (0) (0) H(XW1 , XW2 |XW1 , XW2 ) △ H(XV1 , XV2 |XV1 , XV2 ) Using the same decomposition of the probabilities and stationarity/reversibility arguments, the spatial Markov properties used to prove Theorem 1, and the fact that conditioning reduces entropy, it can be shown that HT S ≥ HCB Using the temporal Markov properties, (1) (0) (0) (0) (0) (0) HCB = H(X1 |X2 , X3 ) + H(X2 |X1 , X4 ) + . . . (0) (0) (0) (1) (0) (0) H(X3 |X1 , X4 ) + H(X4 |X2 , X3 ) with an analogous expansion for HT S . Now since the joint entropy of both schemes is the same, and (0) (0) (0) (0) (0) (1) (0) I(XW1 ; XW2 ) (0) (1) (0) I(XV1 ; XV2 ) I(XW1 ; (XW1 , XW2 )) = I(XV1 ; (XV1 , XV2 )) = by the definition of joint entropy we obtain (0) (0) (0) (0) HT S + I(XW1 ; XW2 ) = HCB + I(XV1 ; XV2 ) By the stationarity of the chain, the mutual information result holds for any two successive samples drawn from TS and CB. 34 Chapter 3 Hot Coupling1 3.1 Chapter Overview Despite the intuitive appeal of the algorithm presented in the last chapter, there are several diverse applications of undirected models such as image analysis (Li 2001, Carbonetto and de Freitas 2003), conditional random fields (Lafferty, McCallum and Pereira 2001), neural models (Rumelhart, Hinton and Williams 1986) and epidemiology (Green and Richardson 2002) where the models are not obviously amenable to a decomposition strategy of the type discussed earlier. Additionally, the state spaces may be so complex that even the tree sampler can get trapped in “local” modes and slow down considerably in its exploration of the state space. While a straightforward conceivable remedy to this latter problem is to employ the tree sampler within an ensemble MCMC algorithm such as parallel tempering (Hukushima and Nemoto 1996), it does not address the issue of non-regular graph topologies. Finally, the tree sampler on its own is unable to yield estimates of the partition function; this is a shortcoming of MCMC methods in general(Green and Richardson 2002, Moller et al. 2004). This chapter introduces a method for approximating expectations of a pairwise graph’s variables (of which beliefs are a special case) and of reasonably estimating the partition function. Formally, given hidden variables x and observations y, the model is specified on a graph G(V, E), with edges E and M nodes V by: 1 ψ(xi , xj ) φ(xi , yi ) π(x, y) = Z i∈V (i,j)∈E where x = {x1 , . . . , xM }, Z is the partition function, φ(·) denotes the observation potentials and ψ(·) denotes the pair-wise interaction potentials, which are strictly positive but otherwise arbitrary. The partition function is: Z= ψ(xi , xj ) φ(xi , yi ) x i∈V (i,j)∈E where the sum is over all possible system states. We make no assumption about the graph’s topology or sparseness, an example is in Figure 3.1. We present experimental results on both fully-connected graphs (cases where each node neighbors every other node) and sparse graphs. 1 This section has appeared in the Proceedings of the 2005 Conference on Neural Information Processing Systems (NIPS) 35 Chapter 3. Hot Coupling xi ψ(xi ,x j ) xj y φ (xj ,y) Figure 3.1: A small example of the type of graphical model treated in this paper. Our approach belongs to the framework of Sequential Monte Carlo (SMC), which has its roots in the seminal paper of (Metropolis and Ulam 1949). Particle filters are a well-known instance of SMC methods (Doucet, de Freitas and Gordon 2001). They apply naturally to dynamic systems like tracking. Our situation is different. We introduce artificial dynamics simply as a constructive strategy for obtaining samples of a sequence of distributions converging to the distribution of interest. That is, initially we sample from and easy-to-sample distribution. This distribution is then used as a proposal mechanism to obtain samples from a slightly more complex distribution that is closer to the target distribution. The process is repeated until the sequence of distributions of increasing complexity reaches the target distribution. Our algorithm has connections to a general annealing strategy proposed in the physics (Jarzynski 1997) and statistics (Neal 1998) literature, known as Annealed Importance Sampling (AIS). AIS is a special case of the general SMC framework (Del Moral, Doucet and Peters 2004). The term annealing refers to the lowering of a “temperature parameter,” the process of which makes the joint distribution more concentrated on its modes, whose number can be massive for difficult problems. The celebrated simulated annealing (SA) (Kirkpatrick, Gelatt and Vecchi 1983) algorithm is an optimization method relying on this phenomenon; presently, however we are interested in integration and so SA does not apply here. Chapter 3. Hot Coupling 36 Our approach does not use a global temperature, but sequentially introduces dependencies among the variables; graphically, this can be understood as “adding edges” to the graph. In this chapter, we restrict ourselves to discrete state-spaces although a Gaussian version is straightforward and one involving arbitrary continuous distributions is conceivable but slightly more challenging. For our initial distribution we choose a spanning tree of the variables, on which analytic marginalization, exact sampling, and computation of the partition function are easily done. After drawing a population of samples (particles) from this distribution, the sequential phase begins: an edge of the desired graph is chosen and gradually added to the current one as shown in Figure 3.2. The particles then follow a trajectory according to some proposal mechanism. The “fitness” of the particles is measured via their importance weights. When the set of samples has become skewed, that is with some containing high weights and many containing low ones, the particles are resampled according to their weights. The sequential structure is thus imposed by the propose-and-resample mechanism rather than by any property of the original system. The algorithm is formally described after an overview of SMC and recent work presenting a unifying framework of the SMC methodology outside the context of Bayesian dynamic filtering(Del Moral et al. 2004). Figure 3.2: A graphical illustration of our algorithm. First we construct a weighted spanning tree, of which a population of iid samples can be easily drawn using the forward filtering/backward sampling algorithm for trees. The tree then becomes the proposal mechanism for generating samples for a graph with an extra potential. The process is repeated until we obtain samples from the target distribution (defined on a fully connected graph in this case). Edges can be added “slowly” using annealing. 3.2 Sequential Monte Carlo As shown in Figure 2, we consider a sequence of auxiliary distributions π1 (x1 ), π2 (x1:2 ), . . . , πn (x1:n ), where π1 (x1 ) is the distribution on the weighted spanning tree. As shown later in Section 2.1, the sequence of distributions can be constructed so that it satisfies πn (x1:n ) = πn (xn )πn (x1:n−1 |x1:n ) Marginalizing over x1:n−1 gives us the target distribution of interest πn (xn ) (the distribution of the graphical model that we want to sample from as illustrated in Figure 2 for n = 4). So we first focus on sampling from the sequence of auxiliary distributions. The joint distribution is only known up to a normalization constant: πn (x1:n ) = Zn−1 fn (x1:n ) 37 Chapter 3. Hot Coupling where Zn fn (x1:n )dx1:n is the partition function. We are often interested in computing this partition function and other expectations, such as I(g(x1:n )) = g(x1:n )πn (x1:n )dx1:n where g is a function of interest (e.g. g(x) = x if we are interested in computing the mean of x). N (i) If we had a set of samples x1:n from π, we could approximate this i=1 integral with the following Monte Carlo estimator π n (dx1:n ) = 1 N N δx(i) (dx1:n ) 1:n i=1 where δx(i) (dx1:n ) denotes the delta Dirac function. and consequently approxi1:n mate the expectations of interest with I(g(x1:n )) = 1 N N (i) g(x1:n ) i=1 This estimate converges almost surely to the true expectation as N goes to infinity. It is typically hard to sample from π directly. Instead, in a generalization of the simple illustrative case discussed in Chapter 1, we sample from a proposal distribution q and weight the samples according to the following importance ratio fn (x1:n ) qn−1 (x1:n−1 ) fn (x1:n ) = wn−1 wn = qn (x1:n ) qn (x1:n ) fn−1 (x1:n−1 ) The proposal distribution is constructed sequentially q(x1:n ) = qn−1 (x1:n−1 )qn (xn |x1:n−1 ) and, hence, the importance weights can be updated recursively wn = fn (x1:n ) wn−1 qn (xn |x1:n−1 )fn−1 (x1:n−1 ) (i) (3.1) (i) Given a set of N particles x1:n−1 , we obtain a set of particles xn by sampling (i) qn (xn |x1:n−1 ) from and applying the weights of equation (3.1). To overcome slow drift in the particle population, a resampling (selection) step chooses the fittest particles (see the introductory chapter in (Doucet et al. 2001) for a more detailed explanation). We use a state-of-the-art minimum variance resampling algorithm (Kitagawa 1996). The generic SMC algorithm is shown in Figure 3.3. Example 1 To make the presentation more accessible, this example reviews the standard use of SMC for optimal filtering. Consider the problem of Bayesian 38 Chapter 3. Hot Coupling Sequential importance sampling step • For i = 1, ..., N , sample from the proposal (i) (i) e1:n and set x en x ” (i) ∼ qn (xn |x1:n−1 ) “ (i) (i) en , x1:n−1 . x • For i = 1, ..., N , evaluate the importance weights (i) fn (e x1:n ) (i) wn = (i) (i) (i) (i) qn (e xn |e x1:n−1 )fn−1 (e x1:n−1 ) (i) • Normalize the importance weights w en = wn−1 (i) wn P (j) j wn Selection step • If the effective sample size has decreased below and acceptable threshold, resample o n (i) (i) N e0:n , w the discrete weighted measure x en to obtain an unweighted measure i=1 n o (i) 1 N x0:n , N of N new particles. i=1 Figure 3.3: Sequential Monte Carlo at iteration n. estimation of a hidden Markov process with states x and observations y. The target distribution is the posterior distribution πn (x1:n ) = p(x1:n |y1:n ) = Zn−1 fn (x1:n ) where Zn = p(y1:n ) and n fn (x1:n ) = p(x1:n , y1:n ) = k=1 p(yn |xn )p(xn |xn−1 ) Substituting these expressions into equation (3.1), we obtain the familiar recursive updates used in particle filtering: wn = p(yn |xn )p(xn |xn−1 ) wn−1 q(xn |x1:n−1 , y1:n ) In particular, if we choose the proposal to be the transition prior p(xn |xn−1 ), then the importance weights are simply the likelihood functions p(yn |xn ). We can therefore think of particle filtering as a process of sampling from a graph that is growing with time. This intuition is key to the method presented here. The ratio of successive partition functions can be easily estimated using this Chapter 3. Hot Coupling 39 algorithm as follows: Zn Zn−1 fn (x1:n )dx1:n Zn−1 f (x1:n ) πn−1 (x1:n−1 )dx1:n fn−1 (x1:n−1 ) = = wn πn−1 (x1:n−1 )qn (xn |x1:n−1 )dx1:n = N ≈ (i) wn(i) wn−1 (3.2) i=1 where wn = fn (x1:n ) qn (xn |x1:n−1 )fn−1 (x1:n−1 ) and Z1 can be easily computed as it is the partition function for a tree. 3.2.1 The general SMC framework We can choose a (non-homogeneous) Markov chain with transition kernel Kn (xn−1 , xn ) as the proposal distribution qn (xn |x1:n−1 ). Hence, given an initial proposal distribution q1 (·), we have joint proposal distribution at step n: n qn (x1:n ) = q1 (x1 ) Kk (xk−1 , xk ) (3.3) k=2 It is convenient to assume that the artificial distribution πn (x1:n−1 |xn ) is also n−1 the product of (backward) Markov kernels: πn (x1:n−1 |xn ) = k=1 Lk (xk+1 , xk ) (Del Moral et al. 2004). Under these choices, the (unnormalized) incremental importance weight becomes: wn ∝ fn (xn )Ln−1 (xn , xn−1 ) fn−1 (xn−1 )Kn (xn−1 , xn ) (3.4) Different choices of the backward Kernel L result in different algorithms (Del Moral et al. 2004). For example, the choice: Ln−1 (xn , xn−1 ) = fn (xn−1 )Kn (xn−1 , xn ) fn (xn ) results in the AIS algorithm, with weights wn ∝ fn (xn−1 ) fn−1 (xn−1 ) However, we should point out that this method is slightly more general as one can carry out resampling. Note that in this case, the importance weights do not Chapter 3. Hot Coupling 40 depend on xn and, hence, it is possible to do resampling before the importance sampling step. This often leads to huge reduction in estimation error (de Freitas, Dearden, Hutter, Morales-Menendez, Mutch and Poole 2004). Using asymptotic variance results, (Del Moral et al. 2004) propose a different choice of backward kernel, which results in the following incremental importance weights: fn (xn ) wn ∝ (3.5) fn−1 (xn )Kn (xn−1 , xn )dxn−1 The integral in the denominator can be evaluated when dealing with Gaussian or reasonable discrete networks, as we will see in the following section. 3.3 The Algorithm We could try to perform traditional importance sampling by seeking some proposal distribution for the entire graph. This is very difficult and performance degrades exponentially in dimension if the proposal is mismatched (Bucklew 1986). One may imagine that in some cases a spanning tree of the variables may constitute a proposal. In fact, in a pairwise graph the importance weight is straightforwardly seen to be the product of potentials of the edges in the graph but not a tree. Clearly then, a spanning tree is a reasonable proposal only to graphs that are not too far from being a spanning tree, which is a very restrictive assumption on the desired graph distribution. We propose, however, to use the samples from the tree distribution as candidates to an intermediate target distribution, consisting of the tree along with a “weak” version of a potential corresponding to some edge of the original graph. Formally, if given an initial set of edges G0 which form a spanning tree, the initial proposal is π0 (x0 ) = 1 Z0 ψ(xi,0 , xj,0 ) φ(xi,0 , yi,0 ) i∈V (3.6) (i,j)∈G0 Using the belief propagation equations (Pearl 1988) and bottom-up propagation, top-down sampling (Wilkinson and Yeung 2002, Carter and Kohn 1994), we draw a set of N independent samples from the tree. Computation of the normalization constant Z0 is also straightforward and efficient in the case of trees using a sum-product recursion. From then on, however, the normalization constants of subsequent target distributions cannot be analytically computed. We then choose a new edge e1 from the set of “unused” edges E − G0 and add it to G0 to form the new edge set G1 = e1 ∪ G0 . Let the vertices of e1 be u1 and v1 . Then, the intermediate target distribution π1 is proportional to π0 (x1 )ψe1 (xu1 , xv1 ). In doing straightforward importance sampling, using π0 as a proposal for π1 , the importance weight is proportional to ψe1 (xu1 , xv1 ). We adopt a slow proposal process to move the population of particles towards π1 (and subsequently to other π). We gradually introduce the potential between Xu1 and Xv1 via a coupling parameter α which increases from 0 to 1 in order 41 Chapter 3. Hot Coupling to “softly” bring the edge’s potential in and allow the particles to adjust to the new environment. Formally, when adding edge e1 to the graph, we introduce a number of coupling steps so that we have the intermediate target distribution: αn π0 (x0 ) [ψe1 (xu1 , xv1 )] where αn is defined to be 0 when a new edge enters the sequence, increases to 1 as the edge is brought in, and drops back to zero when another edge is added at the following edge iteration. (This definition saves us a more cumbersome notation involving an intermediate time index.) At each time step, we want a proposal mechanism that is close to the target distribution. Proposals based on simple perturbations, such as random walks, are easy to implement, but can be inefficient. Metropolis-Hastings proposals are not possible because of the integral in the rejection term. We can, however, employ a single-site Gibbs sampler with random scan whose invariant distribution at each step is the the next target density in the sequence; this kernel is applied to each particle. When an edge has been fully added a new one is chosen and the process is repeated until the final target density is the full graph. (In the experiments presented here, we do not address the issue of edge ordering, but it is an interesting problem which seems to be amenable in many cases to sequential optimization techniques.) An analytic expression for the incremental weights corresponding to Equation (3.5) is given by: (i) wn(i) = (i) ψen (xun , xvn ) M−2 M (i) (i) αn ψen (xun , xvn ) + 1 M αn+1 (i) (i) ψen (xun , xvn ) (3.7) αn+1 Rz where M is the total number of variables in the graph, △ Rz = Zun|N (un ) (αn ) Zvn|N (vn ) (αn ) + Zun|N (un ) (αn+1 ) Zvn|N (vn ) (αn+1 ) and the local normalization constants of the new-edge nodes are simply φv (xv ) Zv|N (v) (α) = xv (i) (i) α ψv,k (xv , xk ) ψv,j (xv , xj ) k∈Nv \j where if v = un then j = vn and vice versa. Equation (3.7) is straightforward to compute from the definition of the local random-scan Gibbs kernel and substitution into the importance weight equation. The derivation is omitted for brevity. The conditional distributions are easily computed for Gibbs sampling and belief propagation. It is theoretically possible to calculate the marginal for several steps ahead, but generally this becomes infeasible past the 1-step case presented here. To alleviate potential confusion with MCMC, while any one particle obviously forms a correlated path, we are using a population and are making no assumption or requirement that the chains have converged as is done in MCMC Chapter 3. Hot Coupling 42 as we are correcting for incomplete convergence with the weights. Application of the Gibbs step results in the population taking a step “toward” the new π. The importance weights quantify how well the population is doing. Unfortunately, we are dealing with a growing state space so the variance is likely to increase. By monitoring the effective sample size (ESS) (Liu 2001), which is inversely proportional to the variance of the estimators, we choose to resample when the particles have become degenerate, say when the ESS has dropped to N/2 (The ESS is at most N and at least 1.) Resampling too frequently is a bad idea and can lead to impoverishment of the particle population diversity. 3.4 Experimental Results Four approximate inference methods were compared: our SMC method with sequential edge addition (Hot Coupling (HC)), a more typical annealing strategy with a global temperature parameter(SMCG), single-site Gibbs sampling with random scan and loopy belief propagation. SMCG can be thought of as related to HC but where all the edges and local evidence are annealed at the same time. The majority of our experiments were performed on graphs that were small enough for exact marginals and partition functions to be exhaustively calculated. However, even in toy cases MCMC and loopy can give unsatisfactory and sometimes disastrous results. We also ran a set of experiments on a relatively large MRF. For the small examples we examined both fully-connected (FC) and square grid (MRF) networks, with 18 and 16 nodes respectively. Each variable could assume one of 3 states. Our pairwise potentials corresponded to the well-known Potts model : 1 1 ψi,j (xi , xj ) = e T Jij δxi ,xj , φi (xi ) = e T Jδxi (yi ) We set T = 0.5 (a low temperature) and tested models with uniform and positive Jij , widely used in image analysis, and models with Jij drawn from a standard Gaussian; the latter is an instance of the much-studied spin-glass models of statistical physics which are known to be notoriously difficult to simulate at low temperatures (Newman and Barkema 1999). Of course fully-connected models are known as Boltzmann machines (Rumelhart et al. 1986) to the neural computation community. The output potentials were randomly selected in both the uniform and random interaction cases. The HC method used a linear coupling schedule for each edge, increasing from α = 0 to α = 1 over 100 iterations; our SMCG implementation used a linear global cooling schedule, whose number of steps depended on the graph in order to match those taken by SMCG. All Monte Carlo algorithms were independently run 50 times each to approximate the variance of the estimates. Our SMC simulations used 1000 particles for each run, while each Gibbs run performed 20000 single-site updates. For these models, this was more than enough steps to settle into local minima; runs of up to 1 million iterations did not yield a difference, which is characteristic of the exponential mixing time of the sampler on these graphs. For our HC method, spanning trees and edges in the sequential construction were randomly 43 Chapter 3. Hot Coupling Method HC SMCG Gibbs MRF Random Ψ Error Var 0.0022 0.012 0.0001 0.03 0.0003 0.014 MRF Homo. Ψ Error Var 0.0251 0.17 0.2789 10.09 0.4928 200.95 FC Random Ψ Error Var 0.0016 0.0522 0.127 0.570 0.02 0.32 FC Homo. Ψ Error Var 0.0036 0.038 0.331 165.61 0.3152 201.08 Figure 3.4: Approximate magnetization for the nodes of the graphs, as defined in the text, calculated using HC, SMCG, and Gibbs sampling and compared to the true value obtained by brute force. Observe the massive variance of Gibbs sampling in some cases. Method HC SMCG loopy MRF Random Ψ Error Var 0.0105 0.002 0.004 0.005 0.005 - MRF Homo. Ψ Error Var 0.0227 0.001 6.47 7.646 0.155 - FC Random Ψ Error Var 0.0043 0.0537 1800 1.24 1 - FC Homo. Ψ Err Var 0.0394 0.001 1 29.99 0.075 - Figure 3.5: Approximate partition function of the graphs discussed in the text calculated using HC, SMCG, and Loopy Belief Propagation (loopy.) For HC and SMCG are shown the error of the sample average of results over 50 independent runs and the variance across those runs. loopy is of course a deterministic algorithm and has no variance. HC maintains a low error and variance in all cases. chosen from the full graph; the rationale for doing so is to allay any criticism that “tweaking” the ordering may have had a crucial effect on the algorithm. The order clearly would matter to some extent, but this will be examined in later work. Also in the tables by “error” we mean the quantity |ˆa−a| where a ˆ is a an estimate of some quantity a obtained exactly (say Z). First, we used HC, SMCG and Gibbs to approximate the expected sum of our graphs’ variables, the so-called magnetization: M xi ] m = E[ i=1 We then approximated the partition functions of the graphs using HC, SMCG, and loopy.2 We note again that there is no obvious way of estimating Z using Gibbs. Finally, we approximated the marginal probabilities using the four approximate methods. For loopy, we only kept the runs where it converged. Figure 3.4 shows the results of the magnetization experiments. On the MRF with random interactions, all three methods gave very accurate answers with small variance, but for the other graphs, the accuracies and variances began to diverge. On both positive-potential graphs, Gibbs sampling gives high error and huge variance; SMCG gives lower variance but is still quite skewed. On the fully-connected random-potential graph the 3 methods give good results but HC has the lowest variance. Our method experiences its worst performance on the homogeneous MRF but it is only 2.5% error! Figure 3.5 tabulates the approximate partition function calculations. Again, 2 Code for Bethe Z approximation kindly provided by Kevin Murphy. 44 Chapter 3. Hot Coupling for the MRF with random interactions, the 3 methods give estimates of Z of comparable quality. This example appeared to work for loopy, Gibbs, and SMCG. For the homogeneous MRF, SMCG degrades rapidly; loopy is still satisfactory at 15% error, but HC is at 2.7% with very low variance. In the fullyconnected case with random potentials, HC’s error is 0.43% while loopy’s error is very high, having underestimated Z by a factor of 105 . SMCG fails completely here as well. On the uniform fully-connected graph, loopy actually gives a reasonable estimate of Z at 7.5%, but is still beaten by HC. Fully−Connected Random Fully−Connected Homogeneous 1.5 Loopy 1 SMCG 0.5 Gibbs Variational distance Variational distance 1.5 SMCG Gibbs 1 0.5 HC 0 0 Grid Model Random Grid Model Homogeneous 1.5 Variational distance Variational distance 1.5 1 0.5 0 Loopy HC HC SMCG Gibbs Loopy Gibbs 1 SMCG 0.5 Loopy HC 0 Figure 3.6: Variational(L1 ) distance between estimated and true marginals for a randomly chosen node in each of the 4 graphs using the four approximate methods (smaller values mean less error.) The MRF-random example was again “easy” for all the methods, but the rest raise problems for all but HC. Figure 3.6 shows the variational (L1 ) distance between the exact marginal for a randomly chosen node in each graph and the approximate marginals of the 4 algorithms, a common measure of the “distance” between 2 distributions. For the Monte Carlo methods (HC, SMCG and Gibbs) the average over 50 independent runs was used to approximate the expected L1 error of the estimate. All 4 methods perform well on the random Ψ MRF. On the MRF with homogeneous Ψ, both loopy and SMCG degrade, but HC maintains a low error. Among the FC graphs, HC performs extremely well on the homogeneous Ψ and surprisingly loopy does well too. In the random Ψ case, loopy’s error increases dramatically. 45 60 60 50 50 40 40 Sample Average Sample Average Chapter 3. Hot Coupling 30 30 20 20 10 10 0 0 100 200 300 400 500 600 0 0 2 4 6 8 10 Figure 3.7: An example of how MCMC can get “stuck:” 3 different runs of a Gibbs sampler estimating the magnetization of FC-Homogeneous graph. At left are shown the first 600 iterations of the runs; after a brief transient behaviour the samplers settled into different minima which persisted for the entire duration (20000 steps) of the runs. Indeed for 1 million steps the local minima persist, as shown at right. Our final set of simulations was the classic Mean Squared reconstruction of a noisy image problem; we used a 100x100 MRF with a noisy “patch” image (consisting of shaded, rectangular regions) with an isotropic 5-state prior model. The object was to calculate the pixels’ posterior marginal expectations. We chose this problem because it is a large model on which loopy is known to do well on, and can hence provide us with a measure of quality of the HC and SMCG results as larger numbers of edges are involved. From the toy examples we infer that the mechanism of HC is quite different from that of loopy as we have seen that it can work when loopy does not. Hence good performance on this problem would suggest that HC would scale well, which is a crucial question as in the large graph the final distribution has many more edges than the initial spanning tree. The results were promising: the mean-squared reconstruction error using loopy and using HC were virtually identical at 9.067 × 10−5 and 9.036 × 10−5 respectively, showing that HC seemed to be robust to the addition of around 9000 edges and many resampling stages. SMCG on the large MRF did not fare as well. It is crucial to realize that MCMC is completely unsuited to some problems; see for example the “convergence” plots of the estimated magnetization of 3 independent Gibbs sampler runs on one of our “toy” graphs shown in Figure 3.7. Such behavior has been studied by Gore and Jerrum (Gore and Jerrum 1996) and others, who discuss pessimistic theoretical results on the mixing properties of both Gibbs sampling and the celebrated Swendsen-Wang algorithm in several cases. To obtain a good estimate, MCMC requires that the process “visit” each of the target distribution’s basins of energy with a frequency representative of Chapter 3. Hot Coupling 46 their probability. Unfortunately, some basins take an exponential amount of time to exit, and so different finite runs of MCMC will give quite different answers, leading to tremendous variance. The methodology presented here is an attempt to sidestep the whole issue of mixing by permitting the independent particles to be stuck in modes, but then considering them jointly when estimating. In other words, instead of using a time average, we estimate using a weighted ensemble average. The object of the sequential phase is to address the difficult problem of constructing a suitable proposal for high-dimensional problems; to this the resampling-based methodology of particle filters was thought to be particularly suited. For the graphs we have considered, the single-edge algorithm we propose seems to be preferable to global annealing. 3.5 Chapter Summary and Outlook The experiments in this chapter revealed that, for the models we considered, an sequence of distributions of growing edges can perform quite well, and it was natural to try to somehow “optimize” the addition of edges in such a way as to ensure a favourable guiding of the population towards the high probability regions of the target. As discussed in Section 1.3.2, a reasonable way of intelligently adding edges was to consider a (possibly randomized) strategy where the edges are chosen based on the resulting energy of the state on the new model. Such a strategy, which resembled so-called constructive search heuristics, did indeed result in considerable gain when considering models at low temperatures. However we were dissuaded from pursuing that approach when we surprisingly discovered that merely perturbing a locally optimal state with random noise was sufficient to enable a Gibbs sampler to find states of significantly lower energy (around 15% on our 500-node, complete test graphs) than those that were obtained with the constructive search heuristic. We decided that it would be more fruitful if we devoted some effort instead to trying to make a sampler that enabled large changes of state that also yielded correct statistical estimates at nonzero temperatures, or in other words, was not merely an optimization algorithm. That is the subject of the next chapter. 47 Chapter 4 Large-Flip Importance Sampling1 4.1 Introduction The shortcomings of the preceding chapters can be generally grouped into either making strong assumptions on the form of the model, for example the pairwise form of the potentials or the regularity of the graph structure, or still being subject to being overwhelmed by very complex state-spaces. This chapter details work that tries to circumvent both of these problems (within reason- many of the tasks turn out to be NP Hard.) The only assumptions we make here is that the model possesses a discrete state-space, and that we can calculate the conditional distribution of each variable with the remaining variables instantiated. Obviously, the pairwise-potential applications mentioned earlier, such as Boltzmann machines (Ackley et al. 1985), densely connected conditional random fields (Lafferty et al. 2001) and densely connected graphs arising in semi-supervised learning (Getz et al. 2005), belong to the class of problems we target. More general discrete-valued models, however, such as Bayesian networks, can now be treated. Another less obvious application domain arises when dealing with models with discrete and continuous variables, where the continuous variables can be integrated out analytically. This domain is very broad; we give a few examples of members of this class subsequently. The first example is mixture models. For example, in a mixture of Gaussians, one can integrate out the mixture proportions, means and variances and end up with an un-normalized “densely connected” distribution, which depends only on the discrete mixture component indicator variables (Liu, Zhang, Palumbo and Lawrence 2002). After sampling these indicators, one can compute the other variables analytically. This is often referred to as Rao-Blackwellization or Collapsed Gibbs sampling. Likewise, in latent Dirichlet allocation (Blei, Ng and Jordan 2003) one also ends up with a distribution in terms of indicators (involving ratios of Gamma functions) after carrying out analytical integration. A final example to illustrate this point is variable selection with kernel regression models (Tham, Doucet and Ramamohanarao 2002). Here, after integrating out the kernel coefficients and noise covariance, one is left with a densely connected distribution over the binary variables indicating the presence of features. 1 This section has appeared in shorter form in the Proceedings of the 2007 Conference on Uncertainty in Artificial Intelligence (UAI) Chapter 4. Large-Flip Importance Sampling 48 In all these “densely connected” models, sampling the discrete variables is well known to be a very demanding task. Variational methods also tend to exhibit estimation problems in these domains (Carbonetto and de Freitas 2006, Hamze and de Freitas 2005). The target distributions in these models tend to have many modes separated by low probability barriers that cannot be crossed by standard moves. MCMC methods often get “stuck” in modes of the target distribution and fail to accurately sample from all relevant regions in the time allotted to the experiment. For example, as reported in (Munoz, Novotny and Mitchell 2003), it could take 1010 minutes for a dynamic Metropolis algorithm to leave a metastable state in an Ising model. To surmount this problem, approaches based on ingenious domain specific proposal distributions have been proposed. We are however interested in developing general methods. With this goal in mind, the most successful methods seem to be cluster techniques, such as the Swendsen Wang algorithm (Swendsen and Wang 1987), population-based methods and multicanonical Monte Carlo (Iba 2001, Newman and Barkema 1999). Theoretical studies have shown that the so-called mixing time diverges exponentially fast in the Swendsen-Wang algorithm (Gore and Jerrum 1996). Naive parallel chain methods are often wasteful and computationally expensive (Iba 2001). Multicanonical methods require the estimation of the density of states and this is no easy task; see (Iba 2001) for an excellent review. Distinct from plain parallel MCMC are the Sequential Monte Carlo, methods (Jarzynski 1997, del Moral et al. 2006) that use importance sampling in a joint, artificially-imposed temporal state-space to compensate for the effect of incomplete convergence of the proposal processes. Unfortunately, such methods are also limited by the proposal processes used, and can only do as well as importance sampling with the (temporal) marginal distribution of the set at the last step of the sequence. In general this marginal is not known, so weighting is done in the joint temporal space, incurring additional variance. While for systems with a relatively simple state-space, say with relatively few but widelyseparated modes, these methods can yield great improvements, when the target space becomes very rough, we expect the situation to degenerate. In the former case, there are few “obstructions” in the way of the particles’ movement towards the high density regions, though once they reach them leaving may take a long time. In contrast in the latter case, which occurs in Bayesian modeling, statistical physics and biopolymer simulation, the number of particles is tiny compared to the number of modes that can trap them, and so it again takes an exponentially long time to reach the “important” regions. In this chapter, we proposed a novel approach to this problem. It is motivated by the N-Fold Way sampler proposed in (Bortz et al. 1975). This sampler has an ingenious way of avoiding the situation of being trapped in any specific state. It can however get easily trapped in cycles. In our approach we surmount this problem partially by forbidding the chain from jumping to recently visited states. This process is no longer Markov and introduces bias. We correct for the bias by adopting a mixture of Gibbs proposal distributions, initialized at our biased process, within an importance sampler with the right target distribution. Chapter 4. Large-Flip Importance Sampling 4.2 49 Preliminaries and Notation We are interested in sampling from the following target distribution: π(x) = e−βE(x) . Z(β) (4.1) In general we can only evaluate π up to a constant and refer to this unnormalized function as π ˜ . Here, E(·) denotes the energy, β the inverse temperature and Z the partition function. We assume the following in the problems we consider. First, that x is a M dimensional vector, each component of which can assume q values in the set D {D1 . . . Dq }, so that the state space Ω of π is of size q M . In the sequel, we often refer to different values of x in a sequence, in which case xt,i means the ith component of xt . Second, we can readily calculate and sample from the conditional distributions of π. If the set of all variables is M {1, 2, . . . , M }, then the conditionals πi (xi |x{M\i} ) are available in that sense. Here, x{M\i} (x1 , . . . , xi−1 , xi+1 , . . . , xM ). In this chapter we assume that readers are familiar with constructing and sampling from the local conditionals of models with distributions of the form 4.1 such as Markov Random Fields (MRFs.) Our tasks are inference, the computation of quantities expressible as Eπ [h(X)] (4.2) and normalization, the estimation of the unknown constant e−βE(x) Z(β) (4.3) x Marginalization is a special case of equation 4.2. Needless to say, obtaining exact solutions to these quantities is out of the question in general for realisticallysized models. 4.3 Event-Driven Monte Carlo We motivate our algorithm by first considering a beautiful restructuring of traditional single-variable MCMC algorithms that can result in dramatic simulation efficiency gains. Algorithms using this methodology comprise a family known as Event-Driven Monte Carlo, of which the first instance, due to (Bortz et al. 1975), is known as the N-Fold Way (NFW). The family is sometimes called Rejectionless Monte Carlo, though we prefer to use the former since, as is clear in this paper, the methodology can apply to MCMC algorithms such as the Gibbs sampler that do not technically have a “rejection.” Although our approach diverges from the spirit of these methods, particularly by forsaking any interest in having a proper MCMC process, it is worth looking at them in some Chapter 4. Large-Flip Importance Sampling 50 detail since the exhibition gives some insight into the problems that arise while running MCMC and how our approach attempts to circumvent them. Traditional single-site MCMC, say using a Gibbs or Metropolis process, proceeds by choosing a site at random and sampling a new value according to a local transition probability. In this work we focus on using the Gibbs sampler without loss of generality; adaptation to the (plain) Metropolis method is straightforward. The states which are the same as a given state x but for one component are called the single-site neighbours of x. Formally, the random-site Gibbs sampler transition kernel KG (Xt = xt |Xt−1 = xt−1 ) is defined as: 1 M M i=1 πi (xt,i |Xt−1,{M\i} = xt−1,{M\i} )δ(xt−1,{M\i} ) (Xt,{M\i} ). (4.4) The kernel is an equal-weight mixture of update densities. The delta functions in each mixture component are to specify that Xt,{M\i} = Xt−1,{M\i} ; Xt,i is sampled from the local conditional. If the current state is such that all potential neighbors are energy-gaining moves, then at low temperatures, Xt will very likely be xt−1 ; the system is forced to “wait” in the same state until the low-probability event that one of the uphill neighbors is transitioned to. This perspective on the single-site MCMC motivates the NFW. If one can calculate the probability that a variable will change, then instead of repeatedly taking MCMC steps, we can sample a waiting time τ from this probability, and then sample from the conditional distribution of the neighboring states given that a change occurred. This gives a statistically equivalent process to the conventional MCMC sampler; the state xt−1 can be “replicated” for τ steps and a new state xt+τ is then sampled from this conditional. We now present the NFW more specifically. First, for notational compactness, define: ζi (x′ ; xt−1 ) 1 πi (x′ |Xt−1,{M\i} = xt−1,{M\i} ). M (4.5) ζi (x′ ; xt−1 ) is simply the probability that variable i is selected uniformly at random and receives sampled value x′ (which could, of course be the same as xt−1,i ); all other variables stay as they were at t − 1. The problem of excessive waiting time referred to previously occurs when for each i ∈ M, ζi (x′ ; xt−1 ) is overwhelmingly peaked on xt−1,i , while the remaining q − 1 values that Xt,i can assume are all unlikely. Let the set of values that variable i can assume if it changes value from xt−1,i be Fi (xt−1 ) {D\xt−1,i }. We will sometimes refer to the operation of changing a variable’s value as flipping it, though the term is best-suited to the binary (q = 2) case. Now let ζi (x′t,i ; xt−1 ). (4.6) αi (xt−1 ) x′t,i =xt−1,i That is, αi (xt−1 ) is the probability that variable i is chosen by the Gibbs sampler and is changed from its value of xt−1,i ; thus for each i ∈ M there are q − 1 Chapter 4. Large-Flip Importance Sampling 51 terms in the summation (4.6). Once again, in a local minimum and at high β, these probabilities are low for all i. Now define the marginal probability that a change occurs during a Gibbs step to be: pf lip (xt−1 ) P r(Xt = xt−1 ) = xt =xt−1 KG (xt |Xt−1 = xt−1 ). (4.7) M It is clear that pf lip (xt−1 ) = i=1 αi (xt−1 ). The event that a change in state occurs can then be seen as the first occurrence of a “success” in a series of Bernoulli trials with success probability pf lip (xt−1 ); the time until such an event, is of course, distributed according to a geometric distribution, from which samples can be drawn very efficiently (Devroye 1986). Finally, let us create a discrete distribution out of the set of (q − 1)M values that the ζi (x′ ; xt−1 ) can assume for i ∈ M, x′ = xt−1,i . Let νˆ(i, x; xt−1 ) and ν(i, x; xt−1 ) ζi (x; xt−1 ) if x ∈ Fi (xt−1 ) 0 otherwise νˆ(i, x; xt−1 ) νˆ(j, x′ ; xt−1 ) j (xt−1 ) (4.8) (4.9) j∈M,x′ ∈F The reader should convince herself that ν(i, x; xt−1 ) is the posterior probability of the joint event “variable i assumes value x” given that a change in state occurred, i.e. when xt = xt−1 . The posterior will favor sampling the “best” of the neighbors of state xt−1 , i.e. those resulting in the smallest energy gain. At high beta, these will be chosen the vast majority of the time. Sampling from this distribution is O(M (q − 1)) in the worst case, though many optimizations are possible in certain cases, such as graphical models of sparse connectivity or when the set of possible moves can be placed into one of a small set of classes, as was done in the original paper of (Bortz et al. 1975). In this work we describe experiments on systems where such optimizations are not possible as far as we could tell, namely densely-connected graphical models with continuous-valued parameters, so the optimizations will not be discussed here. Thus, one possible statement of the NFW is as follows. First, choose a number T − 1 of flip moves to make, so that the process including the initial ˆ0, . . . , X ˆ T , each having the same state has T steps. Let us define a process X domain as the original problem. The indices here correspond to flipping moves and not the “real” time of the simulation. ˆ 0 is the initial (randomly chosen) state, and X ˆ n is sampled from ν(i, x; xn−1 ). X Furthermore, define a sequence of time variables Θ0 , . . . , ΘT . The algorithm is constructed as shown in Figure 4.1. The states in Monte Carlo time can then ˆ i . The effective length of the simbe read off from this data: XΘi ...Θi+1 −1 = X ulation, i.e. that of an equivalent direct Gibbs sampler, is ΘT . The overall complexity of the method without any optimizations is O((q − 1)M ) per step, and is thus O((q − 1)M T ) for all T steps. Chapter 4. Large-Flip Importance Sampling 52 ˆ 0 and Θ0 = 0. 1. Set X0 = X 2. For i = 1 . . . T (a) Sample τ ∼ Geometric(pf lip (xi−1 )) ˆ i ∼ ν(i, x; xi−1 ) (b) Sample X (c) Set Θi = Θi−1 + τ Figure 4.1: N-Fold Way sampler. It should also be mentioned that one can apply this strategy to the simulated annealing (Kirkpatrick et al. 1983) optimization heuristic by simply replacing the local conditionals πi (x|Xn−1,{M\i} = xn−1,{M\i} ) in the preceding calculations with their annealed versions πiγn (x|Xn−1,{M\i} = xn−1,{M\i} ) where γn is the temperature parameter chosen to “cool” according to some schedule. Note that in this event-driven formulation, any annealing schedule is effectively such that in a conventional implementation, the temperature is held constant at γn between the times of the (n − 1)th and nth flips. Also in most optimization contexts, one is not interested in the times Θi of the flips; the method reduces to sampling from annealed versions of (4.9), which effectively performs a stochastic best-improving search that becomes less “soft” as the annealing continues. The need for computing the flip (or equivalently rejection) probabilities in the NFW creates difficulties when considering continuous variables. Often these are not integrable. However for simple continuous distributions, some progress has been made (Munoz et al. 2003). While the NFW is an example of how a clever perspective can yield great practical improvements, if it solved the problem of simulating from complex distributions, then this paper need not have been written. A severe shortcoming might have suggested itself to the reader in the description of sampling from the conditional (4.9). If this conditional favors choosing the best of the (bad) neighbors in a local minimum, then unless that move results in favourable moves other than the original local minimum, the next sample from (4.9) will almost certainly re-select the original local minimum since it is now a favourable point from the new one. Breaking out of this cycle can take an extremely long time. An example of this behavior is shown in Figure 4.2; the simulation shows evolution of the energy of a 25 node fully-connected binary spin glass with Gaussian-generated parameters at the low temperature β = 5. (this model will be detailed in a coming section.) Note how two different runs of the algorithm get stuck in minima of different values, a very undesirable property of a Monte Carlo method. A conceivable remedy to this problem is to include larger than single-flip neighborhoods into the algorithm. Indeed it may be possible that flipping a 53 Chapter 4. Large-Flip Importance Sampling 5 0 Trial 1 Trial 2 Energy Global Minimum −5 −10 −15 −20 0 200 400 600 Iteration 800 1000 Figure 4.2: A binary-valued system evolving under two different runs of the N-Fold Way at β = 5. Once the processes find a local basin, they can take excessively long to escape. Chapter 4. Large-Flip Importance Sampling 54 pair or a triplet can escape the minima. Unfortunately, while this is certainly conceptually correct, the computational costs make it infeasible. Even in the q = 2 case, there are M k sets of variables of size k to flip, and these must all be included in the conditional (4.9) For small k and large M the strategy may not be worth the effort as quite large flips of state are often required; for large k calculating and sampling from (4.9) is intractable. In the next section, we finally come to our contribution to this problem in an attempt to overcome shortcomings of the NFW. 4.4 The Large-Flip Quasi-Gibbs Sampler The low-temperature cycling behavior is a severe drawback of the NFW. As we mentioned, it is possible to approach low-energy states slowly via annealing into the desired distribution, i.e. using a sequence of distributions of the form π γn rather than having the sampler target the desired distribution π from the beginning. Our strategy is philosophically different and involves introducing a “memory” into the process. The resulting sequences of random variables no longer form Markov Chains due to the dependence on the history, and no longer (asymptotically) draw samples from π in the same sense that MCMC methods do. However they do aggressively visit the important regions of π, even at very low temperatures, but not with the correct frequencies. This deficiency is subsequently corrected. First, we drop interest in the “real” time of the process, i.e. the sequence Θi , since this is only useful for inducing a correct MCMC process, which we will not have anyway. Our sequence of samples Xn corresponds to the nth flips from the initial state X0 . However rather than choose flips according to the conditional distribution (4.9), the new mechanism is such that once a variable has been chosen for change, it is prevented from re-assuming its previous value for a certain amount of simulation time. (From now on unless otherwise clarified, “time” refers to the sequence of flips rather than MCMC time.) The reader may sense a connection between this idea and a much-used meta-heuristic for difficult combinatorial problems known as Tabu Search (Glover 1989). In fact we were initially motivated to solve the problems faced by the NFW, but this connection became made clear to us. Our subsequent survey of the tabu search literature has motivated one aspect of the algorithm, namely the random choice of set sizes of variables to update. However strictly speaking, our method has no “tenure” in the same sense, nor any “aspiration” criteria or any of the other features of the heuristic. Define the k th partial Large-Flip (LF) move to be the (ordered) sequence χkm ((i, x)0 , . . . , (i, x)m−1 ) (4.10) The k th LF move is complete when the the sequence χkm is of length Γk , the size of the k th LF move. The elements (i, x)j of the sequence χkΓk specify that variable i has changed value to state x at the j th step of the move, and thus Chapter 4. Large-Flip Importance Sampling 55 implicitly determine the global state trajectory. For precision, we also define the (unordered) set of moves in the k th partial LF move to be: k Cm {(i, x)0 , . . . , (i, x)m−1 } (4.11) If we define Γ−1 = 0, the cumulative number of flips during the mth step of k−1 the k th LF move is n(m, k) = m + i=0 Γk ; whenever we mention the global flip index n this dependence on m, k is to be remembered. The onset time of k−1 the k th LF move is TkC = i=0 Γi with T0C = 0. Our method of escaping from the cycling traps corresponds to forcing the probability of any duplicate elements (i, x)j within each complete LF move to be zero. This requires a modification of equation (4.8); the goal is achieved with νˆ(i, x; xn−1 ) ζi (x; xn−1 ) 0 k if (i, x) ∈ / Cm otherwise (4.12) When normalized as in equation (4.9), this will still result in a biased selection towards the “good” neighbors of state Xn−1 , but such moves are suppressed if they have been visited during the current LF move. The sizes of the LF moves Γk are randomly chosen integers from an interval ˆ min , Γ ˆ max ] prior to the onset of the k th move. These limits are tunable pa[Γ rameters of the system and are chosen a priori; they need not remain constant over the course of a simulation. Our experiments with large models have used ˆ max = ⌊M/6⌋ and Γ ˆ min = ⌊M/8⌋. Alternatively, the ranges can be decreased, Γ by say halving them every so many iterations for a given computational cost, or alternately increased and decreased to force exploration and increased scrutiny of a particular region respectively. The NFW flip selection equation (4.9) is the special case of Γmin = Γmax = 1 for the entire duration of the simulation. It is this random LF move size that was motivated by robust tabu search, a heuristic that has shown promise on the quadratic assignment problem (Taillard 1991) and MaxSAT (Smyth, Hoos and St¨ utzle 2003). While we do not care about the waiting time between samples, we will need to store the entire set of states visited by the process 2 , which can be implicitly done in a memory-efficient way by remembering only the initial state x0 and the entire set of LF moves, which only consist of (variable, value) pairs. The pseudocode to this process, which we call the Large-Flip Quasi-Gibbs Sampler (LFQGS) is shown in Figure 4.3. The pseudocode contains many steps that are purely illustrative and would never be implemented that way; implementation decisions must always be made specifically to the problem at hand. A moment’s reflection will reveal that this algorithm does not produce correct samples from π. The process is not Markov (due to the dependence via the LF memory) nor is it stationary. Indeed, we observe that forcing variables to not take certain values implicitly introduces a change to the model being 2 at least the distinct states, as will be seen shortly Chapter 4. Large-Flip Importance Sampling 56 ˆ min , Γ ˆ max ], χ0−1 = ∅, Initialize: X0 = x0 , n = 0, m = 0, Γ0 ∼ U[Γ 0 C−1 = ∅, choose a number T − 1 of flip moves to make. For n = 0 to T − 1: 1. If (m = Γk ) k (a) m = 0, k = k + 1, χk−1 = ∅, C−1 =∅ ˆ min , Γ ˆ max ] (b) Γk ∼ U[Γ (c) Update νˆ(i, x; xn−1 ) with Equation (4.12) 2. Sample (i, x) ∼ ν(i, x; xn−1 ) 3. Xn,i = x, Xn,{M\i} = xn−1,{M\i} k k = Cm−1 ∪ (i, x) 4. χkm = (χkm−1 , (i, x)), Cm 5. Update νˆ(i, x; xn ) according to Equation (4.12) 6. m = m + 1 and n = n + 1 Figure 4.3: Large-Flip Quasi-Gibbs sampler. simulated; for example by introducing “infinitely strong” (deterministic) local evidence in a graphical model about the values a set of variables can take. Thus at any step, the flip selection corresponds to a legitimate MCMC algorithm for that implicitly defined model but at the next step, the flipped variable’s values will be constrained by virtue of the previous value having been placed in the k set Cm and so the model has changed yet again! On the whole, such a strategy does not seem overly sensible; however the purpose behind the LFQGS is to collect a set of samples that approximate the trajectory of distinct states that true MCMC would visit over the course of the run for use in the next part of the method. The LFQGS states are an approximation because we do not explicitly store moves to individual states, but implicitly to collections of states k corresponding to the moves in Cm . However we can give a brief preview of the remainder of the method that makes it an asymptotically correct algorithm for inference and normalization. The first step is to select a single sample from each sequence; this is done by selecting the sample according to its relative probability to the other samples in the sequence. Unfortunately, as will be explained in the next section, this will not in general result in a sample from π. Several iterations of a Gibbs sampler via the (plain) NFW are applied to the selected samples resulting in a marginal distribution unavailable in closed form; this marginal is approximated using a mixture of Gibbs kernels “centered” at the previous points. From these evaluated marginals, importance weights are used to estimate the desired quantities 57 Chapter 4. Large-Flip Importance Sampling (4.2) and (4.3). Our entire procedure can thus be seen as elaborate “preprocessing” to obtain a good set of samples with which to perform importance weighting. 4.5 Sample Selection and Correcting the Bias with Importance Sampling In this section we show how we can indeed make the samples drawn from LFQGS a component of a statistically correct methodology for inference and normalization, that is approximating integrals under the stationary distribution, but we cannot have the system exhibit “real” dynamics as MCMC does. We call this methodology Large-Flip Importance Sampling(LFIS.) In a single LFQGS simulation of T samples, T − 1 flips, there will be a set S of distinct states visited, with |S| ≤ T . Many of these states have very low probability under π, but are necessarily traversed using the LF moves in order to escape the local optima. The approach we take is to run a set of N such LF processes of T − 1 flips each resulting in random variables {Xnl } for n ∈ {0, . . . , T − 1}, l ∈ {0, . . . , N − 1}. One possibility is to assign each state xi in the set of distinct visited states S l for l ∈ {0, . . . (N − 1)} a weight: qi = e−βE(xi ) −βE(x′ ) x′ ∈S l e (4.13) Selecting a state according to (4.13) corresponds to drawing an exact sample from the conditional distribution π(x|x ∈ S l ), not from the full target π(x). It bears a resemblance to the resampling step performed in Sequential Monte Carlo, though it is not the same as we are not selecting samples from the ratio of π with a known proposal distribution. It should be emphasized right now that this feature plainly distinguishes the method from ones based on annealing; while in annealing the idea is to try to (slowly) move to a state of high probability, so that the final value is the one that is of primary interest, in LFQGS the selection method can conceivably choose a sample visited at any point in the sequence (X0l , . . . , XTl −1 ), and it is indeed often the case that early samples are selected. The selection process is only asymptotically correct in that as the number of samples T → ∞, due to the randomness in the LF move sizes and the noise in the system, an ever-increasing portion of the overall state space is visited, so π(x|x ∈ S l ) essentially becomes π(x). This fact is of little practical consequence; what is desired is to collect a representative set of important states as early as possible. For small systems (and also for large systems at low temperature,) selecting according to (4.13) seems to accomplish this quite well. We refer the reader to Figure 4.4 for a demonstration. The graph shows the exact probability distribution of the energies of a complete 25-node graphical model with binary values and many competing interactions at a temperature of β = 5 (see Section 4.6 for the model description) and the approximation obtained using the LFQGS Chapter 4. Large-Flip Importance Sampling 58 followed by selection using N = 1000, T = 1000. The approximation was calculated by simply histogramming the energies of the N selected states and took about 10 seconds of computer time on a 3GHz Xeon CPU. In contrast the brute-force computation of the exact distribution by summing over 2M states took about 8 minutes on the same machine. The two curves are indeed very difficult to tell apart in Figure 4.4. We remind the reader again that due to the non-stationarity of the LFQGS process, it is not correct to conclude anything −1 about π by histogramming the samples {Xnl }Tn=0 in sequence l as is done in MCMC. Note that approximating the distribution of energies at β = 5 is demonstrating the method in a difficult case: the task is to generate samples with the correct nontrivial frequencies, not merely to find the global optimum, which is the case at very high β, or to in effect sample states uniformly (low β). It would be favourable indeed if we could maintain this fidelity for larger systems. Unfortunately, the number of samples needed to approach π by selecting according to (4.13) can be tremendous, and the bias incurred in sampling from the conditional (4.13) can be very high, oversampling the low-energy states in S l . In fact, even if the set S l was a perfect sample from π, at “moderate” temperatures (typically used in vision restoration problems, for example) there can be a substantial range in energy of the samples; at that temperature, the probability of the lowest energy state in the set can be very large relative to say, those of average value. This may appear paradoxical, but in fact it is a perfectly reasonable consequence of the enormity of the state spaces involved. To illustrate this, suppose we had a very simple system with only two energies, E1 , E2 : E1 < E2 , and while only one state x1 had energy E1 , the remaining states, all assuming energy E2 , comprised a large set R. The ratio of probabilities between x1 and xi ∈ R is: e−β(E1 −E2 ) This ratio may be very large depending on the particular values of E1 , E2 , and β; however if the system is such that e−β(E1 ) ≈ |R|e−β(E2 ) then the probability that x1 occurs is about the same as the probability that some xi inside the set R occurs. For most actual physical models, this phenomenon is ubiquitous. The state-space is such that the vast majority of states are of “moderate” energy. At very low temperatures, the probability will nonetheless concentrate on the very low energy states, but at moderate temperatures, the sheer combinatoric “mass” imposed by the states of moderate energy can overwhelm the low-energy states. Indeed, one of the simplest possible discrete processes, a sequence of independent tossing of a biased coin, exhibits this. Suppose we had an “unfair” coin that landed with heads up with probability p > 0.5. If it was tossed N = 10000 times independently, then despite the fact that the most likely sequence is certainly a run of N heads in a row, this will almost never actually be seen unless p is very close to 1. What “typically” occurs is an outcome from the aptly-named typical set : a sequence with pN heads. Depending on the value of p, there can be a gargantuan number of particular sequences with this property, all of which are equiprobable. The probability of N heads occurring can, however, be Chapter 4. Large-Flip Importance Sampling 59 made arbitrarily likely by making p arbitrarily close to 1. This is the analog of decreasing the temperature. We are thus faced with trying to find a solution to this difficult issue. There are two possible routes we could take. The first one, which is quite challenging, is to find a more sophisticated weighting and selection scheme than that represented in Equation (4.13). The other is to simply select according to (4.13 ) and to subject the selected sample to several iterations of the (plain) NFW, i.e. a proper MCMC in the direction of π. Note that selecting according to (4.13) from the set of states sampled by the LFQGS process at moderate β does not tend to sample the system’s global minimum (as it would for high β) but merely the minimum of the range of energies encountered by the process at that temperature. We have made a useful empirical observation that can justify selecting states from a given process according to the simple scheme above as long as it is followed by a number of NFW iterations. To do this, we ran the following two sets of experiments. First, on a large, fully-connected system at moderate temperature, we ran N = 500 independent LFQGS processes and selected a state from each using (4.13 ). Next, on the same system and β, we ran N independent instances of an annealed NFW process, where the system β was dropped linearly from a low β to the final system β. It was found empirically that the average energy of the selected states (over the independent runs) from the LFQGS was very close to the minimum energy that resulted from the annealed process at that given β . The application of “supplementary” NFW steps is an effective way to enable the samples explore parts of the state space that are in regions such are likely at that temperature. At moderate temperatures, the NFW phase will have no problems fluctuating and exploring many energy levels despite the fact that the selected sample used to initialize it is biased towards the lower energies it may encounter at that temperature. At low temperatures, this strategy will most likely result in the cycling behaviour described in the previous section since the selected sample is most likely a single-flip local minimum. This is also “correct” provided that the selected sample is “close” to the global minimum, which luckily, the LFQGS can reasonably ensure. We now detail the final steps of the algorithm that allow us to estimate the partition function Z(β), and we demonstrate its asymptotic correctness by appealing to standard importance sampling arguments. After selecting a sample from each set S l , we apply say 10M iterations of the NFW to each sample and take the last one; the resulting samples comprise the new population and whose (“time”) marginal distribution we refer to as fT (x). This latter distribution is unknown and most likely not available in closed form; the hope however, is that it is a reasonable approximation to π. However inference and normalization using fT (x) is not asymptotically correct in the sense that as we draw N → ∞ samples Y (i) from it for a fixed T , N estimators of the form N1 i=1 h(Y (i) ) do not converge to Eπ [h(X)] because fT (x) = π(x). Our remedy to this as follows. Suppose we drew N samples from fT (x) using the method. Define the set Chapter 4. Large-Flip Importance Sampling 60 0.5 Approximate 0.4 Probability Exact 0.3 0.2 0.1 0 −16.5 −16 −15.5 Energy −15 −14.5 Figure 4.4: The exact and approximate probability distribution of the energy on a 25 node complete graphical model at β = 5. The approximation was calculated using the LFQGS followed by a selection step, as described in the text. The two plots are very close to each other. Chapter 4. Large-Flip Importance Sampling 61 of N selected variables under the rule (4.13) to be Y (i) . For each Y (i) , perform one sweep of conventional Gibbs Sampling through all the variables in a predetermined, fixed order. This order can be a randomly generated permutation of 1 . . . M and can be different for each i. The Gibbs Sampler will technically “move” the set of samples towards the stationary distribution π, whose unnormalized expression, we remind the reader, is π ˜ . This moving effect is slight and not the object of the exercise; using a Gibbs kernel just ensures that it will not make the population of Y (i) any worse. The application of the Gibbs kernel to the population Y (i) sampled from fT (.) results in samples Y˜ (i) from a distribution we call µ(.). Since we can analytically compute KG (y|Y (i) ) due to the predetermined variable sweep (it is simply the product of the local conditional probabilities of the variables in the sweep order), it is straightforward to approximate µ(Y˜ (i) ) with µ ˆ(Y˜ (i) ) 1 N N −1 j=0 KG (Y˜ (i) |Y (j) ) (4.14) If we desire to approximate expectations under π(x) of the form (4.2), we can simply use the importance sampling estimator (i) N −1 ˜ (i) ) π˜ (Y˜ ) i=0 h(Y µ( ˆ Y˜ (i) ) Iˆ W where N −1 W i=0 π ˜ (Y˜ (i) ) µ ˆ(Y˜ (i) ) (4.15) (4.16) The advantage is that we also have an estimator for the partition function 1 ˆ Z(β): Z(β) N W and not merely the ratio of two constants as often arises. The reason is that by construction, the normalization constant of µ is unity. The estimator for µ is O(N ) for each of the N points; fortunately this overall O(N 2 ) cost is quite reasonable since it is only done once, in the final stage of the algorithm. Note that trying to estimate the “temporal” marginal that is more than one step ahead is practically impossible as an intermediate integration over the entire state space is required. Although the approximation (4.14) to the marginal µ(.) is certainly asymptotically correct, it may fail to give a practically good estimate to the true µ for reasonable sample sizes N . The reason for this is that the “width” of the Gibbs kernel, i.e. the number of states within its domain with non-negligible probability, may be quite narrow relative to the size of the high-probability regions of µ. In such a case, the variance of the estimator (4.14) may be high, which means that for a given point Y˜ = y˜ from µ, a large number of samples {Y (j) }N j=1 is required to ensure that µ ˆ(˜ y ) is a reliable estimate of µ(˜ y ). We must emphasize that this present issue is determined by the variance that the particular kernel induces in the estimator (4.14) and is independent of the fidelity with respect to π of the samples Y (i) . Indeed, it can arise even if the Y (i) are perfect samples from π if the kernels are very narrow. Chapter 4. Large-Flip Importance Sampling 62 The obvious way around this problem is simply to increase the number of samples N . An alternative is to keep the same set of samples from the selectionNFW step, but rather than simply sampling the set of N variables {Y˜ (j) }N j=1 from KG (y|Y (i) ), we introduce for each Y (i) a set of κ “smoothed” versions of the Gibbs kernels centered at each point. An example is a sequence of hightemperature kernels: γ KGj (y|Y (i) ) for j = 1 . . . κ and γj ≥ γj+1 with γj = 1. For each Y (i) , a set of κ points {Y˜ (i,j) }κj=1 are sampled (one from each kernel centered on Y (i) ) and a mixture estimator analogous to the one in (4.14) is used to estimate a marginal. This increases the computational cost: the final step will then be O((κN )2 ) instead of O(N 2 ), but the advantage is that the set of kernels centered at each point now have increased overlap with the target, hopefully resulting in a decrease in variance. This possibility was not explored in implementation, however. Before proceeding to the experiments, a word should be said about the method for very large systems with a rough landscape at high β. In such cases, the correct behaviour of the algorithm would be for almost every sample to belong to the very lowest energy levels. In many cases, such as for most arbitrary graphical models (directed or undirected,) determining the minima is NP-Hard, and while the algorithm seems to perform quite well for the models we tested, it is not without limitations. Indeed, it is well-known (Dagum and Luby 1993) that even approximate inference in Bayesian networks is NP-Hard, and so is the Maximum Cut problem (Papadimitriou and Yannakakis 1991), whose solution coincides with sampling at very low temperatures from the undirected model to be presented in the next section. Thus, one should not expect miracles. 4.6 Experiments We present two sets of numerical results. The first set was primarily to establish that the method seems to work correctly. A 25 node fully-connected, binary valued, undirected graphical model under 6 values of the inverse temperature β was used. The task was to estimate the log(Z(β)) for each of these β. The system is an instance of an Ising Spin Glass, a model that is known as the Boltzmann machine (Ackley et al. 1985) in the machine learning community. Each component of the state xi ∈ {−1, 1}. The energy function is given by: 1 E(x) = − √ M Jij xi xj (4.17) i<j where the Jij are distributed according to the zero-mean, unit variance Gaussian. The fact that the interactions Jij are allowed to be of mixed sign makes this a very difficult and complex model. In fact, merely finding the global energy optimum in (4.17) is an instance of the Unconstrained Binary Quadratic Programming problem, well-known to be NP-Hard (Garey and Johnson 1979) 63 Chapter 4. Large-Flip Importance Sampling β .5 1 2 5 10 20 Exact log Z(β) 18.96 23.69 38.55 90.13 178.34 355.55 LFIS Error 1.3 × 10−3 1.6 × 10−3 2 × 10−3 6 × 10−4 5 × 10−4 5 × 10−4 Variance 1.0 × 10−4 7.9 × 10−4 3.7 × 10−4 1.3 × 10−5 9.0 × 10−6 3.2 × 10−7 SMC Error 8 × 10−4 1.2 × 10−3 1 × 10−3 6.7 × 10−3 1.1 × 10−2 7.6 × 10−2 Variance 2.2 × 10−5 1.6 × 10−4 6.7 × 10−4 3.2 × 10−3 3.5 × 10−2 0.2295 LBP Error 0.005 0.052 4.6 22.75 0.26 12.81 Figure 4.5: The exact values log(Z(β)) and of average error over 50 runs of the ˆ estimates log(Z(β)) for a 25 node complete graphical model using the 3 methods ˆ described in the text. Variances of log(Z(β)) are also included for LFIS and SMC; loopy BP does not have a variance as it is a deterministic method. in all but a small set of situations. The minimization task can also be seen (Hammer 1965) as trying to “colour” each vertex with one of two labels such that the weight of the edges Jij with vertices labeled with distinct colours is minimized; this formulation is known as the Max-Cut problem. If the edge weights were all positive, this would be trivial; it is the mixed sign that changes this fact. For the small model here, log(Z(β)) was calculated by brute force for each β. ˆ The estimates of log(Z(β))from the LFIS methodology were compared against Sequential Monte Carlo (del Moral et al. 2006) (SMC) with annealed distributions and Loopy Belief Propagation, and the results are presented in Figure 4.5. 50 runs of LFIS and SMC were performed at each β; the “error” in Figure 4.5 ˆ is defined as | log(Z(β)) − log(Z(β))|. LFIS drew 1000 samples from sequences of 1000 flips each. For SMC 1000 samples were also used, and a linear cooling schedule was employed such that the number of computational steps matched those taken by the LFIS method (about 5000 annealing steps in our implementation.) SMC gives very good estimates of log(Z(β)) of this system for low β (high temperature,) clearly being the winning method for β = 0.5. For β = 1, 2, LFIS and SMC perform comparably well. As β rises further, LFIS begins to dominate. An interesting trend with SMC is the almost monotonically increasing error and variance with increasing β; in contrast, LFIS exhibits a small rise in error from β = 0.5 to β = 2, but it then declines sharply. Loopy BP gave respectable results for β = 0.5 and β = 1 but very large errors for the remaining β. The previous example is too small to draw any concrete conclusions from, serving mostly as a demonstration of correctness of the LFIS against exact results, but the diverging accuracies of SMC and LFIS methods for decreasing temperatures certainly inspired us to investigate this effect further for large models. Unfortunately it clearly becomes impossible to compare against the exact answers. To address this, we run two sets of experiments using each method: one at a high β and one at “moderate” β. Low β problems do not exhibit prob- Chapter 4. Large-Flip Importance Sampling 64 lems for sampling methods such as MCMC, but they are not particularly useful for practical applications. The results of the two sets are considered differently in each case. If we force β to be high, we know that under the equilibrium distribution, the system will primarily be in its low-energy states. Thus, two reasonable checks for the efficiency of an algorithm at high β are the average energy of the states returned by each run and their variance. It seems reasonable to suspect that in this β regime, a higher average energy suggests a worse algorithm, and if one algorithm generates samples whose energies have a larger variance over the runs than those of the other, it is almost certainly less reliable. At moderate β, the comparison is less straightforward as we expect a much wider range in sampled energies than those of the high β case. There is no reason to expect that the system will be restricted to its low energy states, so merely applying the diagnostic criteria of the previous case is incorrect. However we shall see that the statistics of the states sampled by both annealing and the LFIS method are very similar; if this does not necessarily imply that they are both working correctly then at least it demonstrates their consistency. For this set of experiments we considered it very important to be thoroughly fair in terms of computational parity; we thus implemented an event-driven annealing (EDA) process of the type described previously. The process was such that at each step, one variable was flipped. The final state after the T EDA iterations was considered to be the sample generated by that run. The computational cost per iteration is almost identical to that of the LFQGS; that latter needs to do slightly more bookkeeping in setting and unsetting the allowable moves. In our implementation, this translated to the LFQGS of T steps running about 3% more slowly than the EDA process of the same length, although this can certainly be diminished in a less prototypical implementation. Two large models were considered: the first, called 1000FC, was a 1000 variable fully-connected system whose energy had the form of Equation (4.17). Again, Jij were drawn from the standard Gaussian. The second, called CUBE, was a binary-valued graphical model of three-dimensional lattice topology with dimension 4 × 4 × 16. The interactions can be seen to be a special case of (4.17), where all Jij = 0 except those connecting neighbors on the lattice. Instead of being Gaussian, however, the values of the remaining Jij were 1 or −1 with equal probability. Normalization and optimization on this topology is also N P Hard (Barahona 1982). The low and moderate temperature cases corresponded to β = 20 and 2 respectively. LFQGS and EDA were both run 500 times on each problem at each β. 50000 and 100000 steps were taken by both on CUBE and 1000FC respectively. EDA employed a β schedule which ascended linearly from β = 0.001 to its final value in the allotted number of time steps. For LFQGS, the LF move size range was set to Γmin = ⌊M/8⌋ and Γmax = ⌊M/6⌋. We consider first the case of β = 20. Figure 4.8 displays numerically the results of these simulations; the average energy of the sampled states over the runs and their variance is shown for the two methods on both models. It is clear that on these large models, for the given computational resources, the samples Chapter 4. Large-Flip Importance Sampling 65 generated by our methodology are of far higher quality than those obtained even by using the sophisticated implementation of annealing. We mentioned in the previous section that the probabilities of individual samples in a “correct” population can vary considerably due to the size of the state spaces, but at such a low temperature, it is reasonable to compare the samples based on the relative probabilities. The average sample generated by our method from FC1000 was 2.65 × 10121 times more likely at β than one generated by EDA; for the CUBE model, that ratio is 2.06 × 1069 . Note that the samples generated with EDA also show a very large spread of energy. The SMC procedure that uses these samples is doomed to failure; the reweighting step in it cannot save a particular population that has not visited the right regions, it can only correct one that has visited the right ones with the wrong frequencies. As we mentioned in the introduction, the primary source of variance in Sequential Monte Carlo methods such as the one used here is a failure of the proposal process to effectively sample the target distribution; the other source, due to importance sampling in a growing state space, can be effectively managed using strategies such as resampling (del Moral et al. 2006). To visualize the results of the two methods, in Figure 4.6, we have sorted and plotted the energies of the 500 sampled states of both methods on FC1000. 4.6(b) zooms in on the 330 lowest-lying states generated by LFIS, showing that many states of very low energy that are found using LFIS are not approached by EDA. Out of the 500 runs, the lowest energy state was hit 5 times, suggesting that it is in fact the global optimum; for this size of graph, it is very difficult to ensure that the true minimum is reached. The two curves illustrate graphically via their shapes the discrepancy in the average energy and variance between EDA and LFIS. Needless to say, we are not implying that the samples generated by LFIS at this temperature after the number of flips allotted in this set of experiments is an unbiased population from π. The fact that the very lowest energies are not sampled often is more likely an artifact of the algorithm than a property of the system. However at such a temperature, the selected states are almost always single-flip local minima and thus not subject to any further minimization by the post-selection NFW stage. The calculation of the approximate marginals of the sampled points Y˜ (i) will simply return probabilities very close to 1 for remaining at the final point. The importance weights will thus be mostly allocated to the samples of lowest energy. Of course, we must remember that annealing is an asymptotically-correct method as well; no doubt slowing the cooling rate can improve matters in the large models we have experimented with. Our algorithm has the advantage of on average doing better in a shorter allotted resource time, and that is of great practical importance. Moving now to the case of β = 2, we display the sorted energy profiles of the two methods on FC1000 in Figure 4.7. As we said, we have no way of obtaining exact answers for, say, the marginal probability of some nodes, or of the exact partition function, but since this is a higher temperature, we expect EDA to fare better than it did for the high β case. In fact, the two plots are very Chapter 4. Large-Flip Importance Sampling 66 similar, which provides a sort of cross-validation of our method at moderate temperatures. The average energies of the EDA and LFIS samples were -695.36 and -698.82 respectively; the variance of the energies were 83.53 and 78.3239 respectively. Again, as in the small examples, these experiments suggest that LFIS performs well over a wider range of temperatures, and consequently, “ruggedness” of state-space landscape, than methods based on annealing. Our final experiment was to illustrate the application of the method to the semi-supervised learning problem of machine learning; its purpose was not to propose any new ideas in the area. Many classification tasks can be cast as graph-based “energy minimization” problems (Blum and Chawla 2001, Veksler 1999). The model in such approaches, whose parameters are either learned or set heuristically, reflects the relations between the set of data points, which typically consist largely of points whose labels are unknown and a handful of points with given labels. The task is to classify the unlabeled points with respect to this model and labeled data by minimizing the cost function defined over the model. However an argument has been made (Getz et al. 2005) that a more robust approach is not to consider the minima of such cost functions, but rather to estimate the marginal probabilities of the unlabeled data at nonzero β and to label such points using thresholding criteria on the marginals. To illustrate the applicability of LFIS to such a problem, we construct a simple set of 300 points arranged as shown in Figure 4.9. We will attempt to classify the points into one of two classes based on this data and two differently labeled points, one inside the central cluster, and one in the outlying ring. Intuitively, all points in the outer ring and some of those in the connecting “bridge” should belong to one class, and the remaining points to the other. We make this more precise by defining the model and similarity measure used. Since the LFIS methodology does not merely apply to low-temperature models (whose solution effectively becomes a minimization problem) and since we have already demonstrated the efficacy of the minimization “mode” of the method in the previous set of experiments, here we restrict attention to the nonzero temperature regime suggested in (Getz et al. 2005) and estimate the marginals of the unlabeled points using the importance sampling estimator. Let M be the total number of points, labeled or otherwise, and let L and U be the sets of labeled and unlabeled points respectively. For semi-supervised classification, the number of labeled points L |L| is small compared to U |U|; in our test model, for example, M = 300 while L = 2. We assume the points lie in a Euclidean space; our example’s points lie in the 2D plane. The models used in (Zhu and Ghahramani 2002) and (Getz et al. 2005) are specified by the following Potts-like “conditional” energy function: E(XU |XL ) i<j:i,j∈U Jij [1 − δxi (xj )] + i∈U ,j∈L Jij [1 − δxi (xj )] (4.18) The two summations in the energy defined in (4.18) are the contributions due to interactions among the unlabeled points in that configuration and to 67 Chapter 4. Large-Flip Importance Sampling −725 −730 Energy −735 −740 −745 −750 −755 −760 0 100 200 300 400 500 Run (a) Sorted energies for 500 runs of LFIS (solid) and EDA (dashed) −748 −749 Energy −750 −751 −752 −753 −754 −755 −756 0 50 100 150 200 250 300 Run (b) Same as (a), but zooming in on the best 330 runs of LFIS. Figure 4.6: Sorted energies of 500 runs of LFIS and EDA on a difficult 1000-node fully-connected binary model with Gaussian-valued interaction parameters at the low temperature of β = 20; each run was 1000000 steps. LFIS sampled 31 states with an energy below -754 and 15 with an energy below -755. The efficient annealing method, on the other hand, only sampled one state with an energy below -754 and none below -755. The lowest-lying state sampled by LFIS, with an energy of -755.13, was hit in 5 out of the 500 runs, leading us to believe that it is the actual minimum. The general shapes of the two plots clearly imply that LFIS is superior to even the rejectionless annealing method. 68 Chapter 4. Large-Flip Importance Sampling −660 −670 Energy −680 −690 −700 −710 −720 −730 0 100 200 300 400 500 Run Figure 4.7: Sorted energy profile of 500 runs of LFIS (solid) and EDA (dashed) on the same instance of the model described in the text at β = 2; this is the same model instance whose results are shown in Figure 4.6 at β = 20. The very similar profile exhibited by the two curves suggests that they are both performing correctly at this temperature. The energy statistics of the two methods over the runs are very close: the LFIS samples gave empirical energy mean and variance of -698.82 and 78.3239 respectively; the annealed runs gave values of -695.36 and 83.53 respectively. Method LFQGS EDA CUBE Mean energy Energy Var -401.52 0.899 -393.54 25.281 FC1000 Mean energy Energy var -751.04 9.00 -737.07 46.19 Figure 4.8: Results of the LFIS methodology against an EDA. The mean energies and variances are those obtained from 100 runs of each process. Note that at the system β = 20, the average state generated by LFIS is 2.65 × 10121 more likely than one generated by EDA on FC1000 and 2.06 × 1069 more likely on CUBE. 69 Chapter 4. Large-Flip Importance Sampling 3 2 1 0 −1 −2 −3 −3 −2 −1 0 1 2 3 1 2 3 (a) Before labeling 3 2 1 0 −1 −2 −3 −3 −2 −1 0 (b) Labeled points Figure 4.9: The set of M = 300 points used in the semi-supervised learning experiments discussed in the text. Figure 4.9(a) shows the set of points prior to labeling; all points are unlabeled except for the two bold ones. Figure 4.9(b) shows the result of the classification of the points into two classes. Even the connecting “bridge” of points is properly divided. See the text for details. Chapter 4. Large-Flip Importance Sampling 70 those between the unlabeled points and those clamped at their values in XL ; this formulation is equivalent to that in (Getz et al. 2005) with an “infinite local field” at the observed points. The delta functions in the energy do not depend on how the state variables are represented; the “multiplicative” form of the Ising energy in Equation (4.17), on the other hand, is defined for the variables being 1 or −1. A little bit of algebra will reveal that the “delta” form of the Ising model can be transformed to the multiplicative form but with all the interactions Jij halved. This is of no consequence if we seek the global minimum, but for nonzero temperatures it should be kept in mind as the simulation of the delta form will effectively be taking place at twice the temperature (half β ) if we do not transform the Jij . The Jij are chosen to reflect the similarity between the points; in (Zhu and Ghahramani 2002) the following definition is used: Jij e −dij 2σ2 (4.19) where dij is the Euclidean distance between data points i and j, and σ is a parameter which can be thought to control how close two points must be to be considered “similar.” For the experiments, we used a set of M = 300 points; the set prior to the labeling of the unknown is shown in Figure 4.9(a). The two points in bold in the central and peripheral “clusters” are the labeled points; the remaining points are unlabeled. Given the model and the two labels, the task is to add labels to the remaining points. The parameters σ and β were set to 0.5 and 1.2 respectively. The model was assumed to be in the delta form of the Ising model. For the LFIS step, a total of N = 500 particles each running for a total of T = 8000 steps were used. The results are shown in Figure 4.9(b); the colours depict the labels that were obtained by thresholding the estimated marginal probability of each unlabeled point towards its most likely value. As we can see, the combination of the model and the method succeed in this instance in dividing the set of points into the classes we would expect. In particular, the connecting “bridge” of points connecting the central cluster to the outer ring is suitably divided. The set of simulations took about 5 minutes on our hardware; in contrast, Loopy Belief Propagation failed to converge to any solution on this problem in over 12 hours of run time. 71 Chapter 5 Conclusion This thesis has detailed some ideas in Monte Carlo sampling for graphical models, mostly of the undirected kind. The first method, Tree Sampling, was motivated by some statistical optimality results and common intuition that sampling from connected regions of a regular-structured MRF should do at least as well as the traditional single-site algorithms. The experiments bore these predictions out. However the method was limited by the assumed topology of the graphical model, and by the fact that local minima could hinder the exploration of other parts of the state space once the algorithm appeared to have found a mode. The method known as Hot Coupling was a Sequential Monte Carlo algorithm that utilized partially-constrained versions of the full graphical model to try to guide a population of particles towards the “important” regions of the state space. On small test graphs, the method performed well even though a random sequence of edge additions was used. For large graphs a low temperatures, performance was considerably improved by using a greedy-constructive heuristic for deciding the order of edges to add to the sequence, but this was outperformed by simple perturbative methods. This motivated the final algorithm, which attempted to leverage this observation into a statistically-correct Monte Carlo method. The final method, known as Large-Flip Importance Sampling, was superficially related to a restructuring of single-site MCMC known as the N-Fold Way. The similarity ended quickly, though, because the sampling process we proposed sacrificed its being a valid MCMC method in order to make aggressive jumps in the state-space, which we have called Large-Flip moves. This deficiency was compensated for via importance sampling. The chief advantage of this method is that it applies to any discrete-valued model whose single-variable conditionals are accessible. Current work focuses on more sophisticated methods for calculating the importance weights of the sampler. 72 Bibliography Ackley, D. H., Hinton, G. E. and Sejnowski, T. J. (1985). A learning algorithm for boltzmann machines, Cognitive Science 9: 147–185. Amit, D. A. (1989). Modeling Brain Function, Cambridge University Press. Arnborg, S., Corneil, D. G. and Proskurowski, A. (1987). Complexity of finding embedding in a k-tree, SIAM J. Alg. Disc. Meth. 8(2): 277–284. Barahona, F. (1982). On the computational complexity of ising spin glass models, Journal of Physics A 15(10): 3241–3253. Besag, J. E. (1974). Spatial interaction and the statistical analysis of lattice systems, Journal of the Royal Statistical Society B 36: 192–236. Besag, J. E. (1986). On the statistical analysis of dirty pictures, Journal of the Royal Statistical Society B 48(3): 259–302. Blei, D., Ng, A. Y. and Jordan, M. I. (2003). Latent dirichlet allocation, J. Mach. Learn. Res. 3: 993–1022. Blum, A. and Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts, Proc. 18th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA, pp. 19–26. Bortz, A. B., Kalos, M. H. and Lebowitz, J. L. (1975). A new algorithm for monte carlo simulation of ising spin systems, Jour. Computat. Phys. 17: 10– 18. Bremaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues, Springer. Bucklew, J. A. (1986). Large Deviation Techniques in Decision, Simulation, and Estimation, John Wiley & Sons. Carbonetto, P. and de Freitas, N. (2003). Why can’t Jos´e read? the problem of learning semantic associations in a robot environment, Human Language Technology Conference Workshop on Learning Word Meaning from NonLinguistic Data. Carbonetto, P. and de Freitas, N. (2006). Conditional mean field, Advances in Neural Information Processing Systems. Bibliography 73 Carter, C. K. and Kohn, R. (1994). On Gibbs sampling for state space models, Biometrika 81(3): 541–553. Casella, G. and Robert, C. P. (1996). Rao-Blackwellisation of sampling schemes, Biometrika 83(1): 81–94. Cowell, R. G., Dawid, A. P., Lauritzen, S. L. and Spiegelhalter, D. J. (1999). Probabilistic Networks and Expert Systems, Springer-Verlag, New York. Dagum, P. and Luby, M. (1993). Approximating probabilistic inference in bayesian belief networks is np-hard, Artificial intelligence 60(11): 141–153. de Freitas, N., Dearden, R., Hutter, F., Morales-Menendez, R., Mutch, J. and Poole, D. (2004). Diagnosis by a waiter and a mars explorer, IEEE Proceedings. del Moral, P., Doucet, A. and Jasra, A. (2006). Sequential monte carlo samplers, J. Roy. Statist. Soc., Ser. B 68: 411–436. Del Moral, P., Doucet, A. and Peters, G. W. (2004). Sequential Monte Carlo samplers, Technical Report CUED/F-INFENG/2004, Cambridge University Engineering Department. Devroye, L. (1986). Non-Uniform Random Variate Generation, Springer-Verlag, New York. Doucet, A., de Freitas, N. and Gordon, N. J. (eds) (2001). Sequential Monte Carlo Methods in Practice, Springer-Verlag. Doucet, A., de Freitas, N., Murphy, K. and Russell, S. (2000). Rao-Blackwellised particle filtering for dynamic Bayesian networks, in C. Boutilier and M. Godszmidt (eds), Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers, pp. 176–183. Fischer, K. H. and Hertz, J. A. (1991). Spin Glasses, Cambridge University Press. Fung, R. and Chang, K.-C. (1989). Weighing and integrating evidence for stochastic simulation in bayesian networks, Uncertainty in AI, pp. 209– 219. Garey, M. and Johnson, D. (1979). Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman & Co. Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities, Journal of the American Statistical Association 85(410): 398–409. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6): 721–741. Bibliography 74 Getz, G., Shental, N. and Domany, E. (2005). Semi-supervised learning – a statistical physics approach, ICML Workshop on Learning with Partially Classified Training Data. Glover, F. (1989). Tabu search - part i., ORSA Journal on Computing 1(3): 190– 206. Gore, V. and Jerrum, M. (1996). The swendsen-wang process does not always mix rapidly, 29th Annual ACM Symposium on Theory of Computing. Green, P. J. and Richardson, S. (2002). Hidden Markov models and disease mapping, Journal of the American Statistical Association 97(460): 1055– 1070. Hadlock, F. O. (1975). Finding a maximum cut of a planar graph in polynomial time, SIAM Journal on Computing 4: 221–225. Hammer, P. L. (1965). Some network flow problems solved with pseudo-boolean programming, Operations Research 13: 388–399. Hammersley, J. M. and Clifford, M. S. (1971). Markov fields on finite graphs and lattices. Hamze, F. and de Freitas, N. (2004). From fields to trees, Uncertainty in Artificial Intelligence. Hamze, F. and de Freitas, N. (2005). Hot coupling: a particle approach to inference and normalization on pairwise undirected graphs, Advances in Neural Information Processing Systems, Vol. 18, pp. 491–498. Hamze, F. and de Freitas, N. (2007). Large-flip importance sampling, Proc. Conf. Uncertainty in Artificial Intelligence. Hamze, F., Rivasseau, J. N. and de Freitas, N. (2006). Information theory tools to rank MCMC algorithms on probabilistic graphical models, Proceedings of the UCSD Workshop on Information Theory and its Applications Invited Paper, San Diego, CA. Hinton, G. (1999). Products of experts, Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN99), pp. 1–6. Hukushima, K. and Nemoto, K. (1996). Exchange monte carlo method and application to spin glass simulations, Journal of the Physical Society of Japan 65(6): 1604–1608. Iba, Y. (2001). Extended ensemble monte carlo, International Journal of Modern Physics 12: 623–656. Ising, E. (1925). Beitrag zur theorie des ferromagnetismus, Zeitschrift f¨ ur Physik 31: 253–258. Bibliography 75 Istrail, S. (2000). Statistical Mechanics, Three-Dimensionality and NPcompleteness - I. Universality of intractability for the partition function of the Ising model across non-planar lattices (extended abstract), STOC 2000. Jarzynski, C. (1997). Nonequilibrium equality for free energy differences, Phys. Rev. Lett. Jepson, A. D., Fleet, D. J. and El-Maraghi, T. (2003). Robust online appearance models for visual tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 25: 1296–1311. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (1999). An introduction to variational methods for graphical models, Machine Learning 37: 183–233. Kindermann, R. and Snell, J. (1980). Markov random fields and their applications, American Mathematical Society. Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P. (1983). Optimization by simulated annealing, Science 220: 671–680. Kitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state space models, Journal of Computational and Graphical Statistics 5: 1–25. Lafferty, J. D., McCallum, A. and Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data, International Conference on Machine Learning. Lauritzen, S. (1996). Graphical models, Oxford University Press. Li, S. Z. (2001). Markov random field modeling in image analysis, SpringerVerlag. Lipson, A., Kueck, H., Brochu, E. and de Freitas, N. (2003). Machine learning for computer games, First International Digital Games Research Conference. Liu, J. S. (ed.) (2001). Monte Carlo Strategies in Scientific Computing, SpringerVerlag. Liu, J. S., Zhang, J. L., Palumbo, M. L. and Lawrence, C. E. (2002). Bayesian clustering with variable and transformation selections, in J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. Smith (eds), Bayesian Statistics, Vol. 7, Oxford University Press. Liu, J., Wong, W. H. and Kong, A. (1994). Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes, Biometrika 81(1): 27–40. Bibliography 76 Marinari, E. and Parisi, G. (1992). Simulated tempering: a new Monte Carlo scheme, Europhysics Letters 19(6): 451–455. Metropolis, N. and Ulam, S. (1949). The Monte Carlo method, Journal of the American Statistical Association 44(247): 335–341. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equations of state calculations by fast computing machines, Journal of Chemical Physics 21: 1087–1091. Mezard, M., Parisi, G. and Virasoro, M. (1987). Spin Glass Theory and Beyond, Vol. 9 of World Scientific Lecture Notes in Physics, World Scientific. Moller, J., Pettitt, A. N., Berthelsen, K. K. and Reeves, R. W. (2004). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants, Technical report, The Danish National Research Foundation: Network in Mathematical Physics and Stochastics. Munoz, J. D., Novotny, M. A. and Mitchell, S. J. (2003). Rejection-free monte carlo algorithms for models with continuous degrees of freedom, Physical Review Letters. Neal, R. M. (1998). Annealed importance sampling, Technical Report No 9805, University of Toronto. Newman, M. E. J. and Barkema, G. T. (1999). Monte Carlo Methods in Statistical Physics, Oxford University Press. Onsager, L. (1944). Crystal statistics. I. A two-dimensional model with a orderdisorder transition, Phys. Rev. 65: 117–149. Papadimitriou, C. H. and Yannakakis, M. (1991). Optimization, approximation, and complexity classes, J. Comput. System Sci. 43: 425–440. Pearl, J. (1987). Evidential reasoning using stochastic simulation, Artificial Intelligence 32: 245–257. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan-Kaufmann. Picard, J.-C. and Ratliff, H. D. (1973). A graph-theoretic equivalence for integer programs, Operations Research 21(1): 261–269. Picard, J.-C. and Ratliff, H. D. (1975). Minimum cuts and related problems, Networks 5: 357–370. Rivasseau, J. N. (2005). From the jungle to the garden: Growing trees for Markov Chain Monte Carlo inference in undirected graphical models, Master’s thesis, University of British Columbia. Bibliography 77 Robert, C. P. and Casella, G. (1999). Monte Carlo Statistical Methods, SpringerVerlag, New York. Roth, S. and Black, M. (2005). Fields of experts: A framework for learning image priors, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 860–867. Rue, H. and Held, L. (2006). Gaussian Markov Random Fields: Theory and Applications, Chapman & Hall/CRC. Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986). Learning internal representations by error propagation, in D. E. Rumelhart and J. L. McClelland (eds), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Cambridge, MA, pp. 318–362. Smyth, K., Hoos, H. H. and St¨ utzle, T. (2003). Iterated robust tabu search for max-sat, Canadian Conference on AI, pp. 129–144. Swendsen, R. H. and Wang, J. S. (1987). Nonuniversal critical dynamics in Monte Carlo simulations, Physical Review Letters 58(2): 86–88. Taillard, E. D. (1991). Robust taboo search for the quadratic assignment problem, Parallel Computing 17: 443–455. Tham, S. S., Doucet, A. and Ramamohanarao, K. (2002). Sparse bayesian learning for regression and classification using Markov Chain Monte Carlo, International Conference on Machine Learning, pp. 634–641. Tierney, L. (1994). Markov chains for exploring posterior distributions, The Annals of Statistics 22(4): 1701–1762. Veksler, O. (1999). Efficient graph-based energy minimization methods in computer vision, PhD thesis, Department of Computer Science, Cornell University. Wainwright, M. J., Jaakkola, T. S. and Willsky, A. S. (2002). A new class of upper bounds on the log partition function, Proc. Uncertainty in Articial Intelligence, Vol. 18, pp. 536–543. Wilkinson, D. J. and Yeung, S. K. H. (2002). Conditional simulation from highly structured gaussian systems, with application to blocking-MCMC for the Bayesian analysis of very large linear models, Statistics and Computing 12: 287–300. Zhang, N. L. and Poole, D. (1994). A simple approach to Bayesian network computations, Proc. of the Tenth Canadian Conference on Artificial Intelligence, pp. 171–178. Zhang, X., Aberdeen, D. and Vishwanathan, S. V. N. (2007). Conditional random fields for multi-agent reinforcement learning, International Conference on Machine Learning. Bibliography 78 Zhu, X. and Ghahramani, Z. (2002). Towards semi-supervised classification with Markov Random Fields, Technical Report CMU-CALD-02-106, CMU.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Monte Carlo integration in discrete undirected probabilistic...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Monte Carlo integration in discrete undirected probabilistic models Hamze, Firas 2008
pdf
Page Metadata
Item Metadata
Title | Monte Carlo integration in discrete undirected probabilistic models |
Creator |
Hamze, Firas |
Publisher | University of British Columbia |
Date Issued | 2008 |
Description | This thesis contains the author’s work in and contributions to the field of Monte Carlo sampling for undirected graphical models, a class of statistical model commonly used in machine learning, computer vision, and spatial statistics; the aim is to be able to use the methodology and resultant samples to estimate integrals of functions of the variables in the model. Over the course of the study, three different but related methods were proposed and have appeared as research papers. The thesis consists of an introductory chapter discussing the models considered, the problems involved, and a general outline of Monte Carlo methods. The three subsequent chapters contain versions of the published work. The second chapter, which has appeared in (Hamze and de Freitas 2004), is a presentation of new MCMC algorithms for computing the posterior distributions and expectations of the unknown variables in undirected graphical models with regular structure. For demonstration purposes, we focus on Markov Random Fields (MRFs). By partitioning the MRFs into non-overlapping trees, it is possible to compute the posterior distribution of a particular tree exactly by conditioning on the remaining tree. These exact solutions allow us to construct efficient blocked and Rao-Blackwellised MCMC algorithms. We show empirically that tree sampling is considerably more efficient than other partitioned sampling schemes and the naive Gibbs sampler, even in cases where loopy belief propagation fails to converge. We prove that tree sampling exhibits lower variance than the naive Gibbs sampler and other naive partitioning schemes using the theoretical measure of maximal correlation. We also construct new information theory tools for comparing different MCMC schemes and show that, under these, tree sampling is more efficient. Although the work discussed in Chapter 2 exhibited promise on the class of graphs to which it was suited, there are many cases where limiting the topology is quite a handicap. The work in Chapter 3 was an exploration in an alternative methodology for approximating functions of variables representable as undirected graphical models of arbitrary connectivity with pairwise potentials, as well as for estimating the notoriously difficult partition function of the graph. The algorithm, published in (Hamze and de Freitas 2005), fits into the framework of sequential Monte Carlo methods rather than the more widely used MCMC, and relies on constructing a sequence of intermediate distributions which get closer to the desired one. While the idea of using “tempered” proposals is known, we construct a novel sequence of target distributions where, rather than dropping a global temperature parameter, we sequentially couple individual pairs of variables that are, initially, sampled exactly from a spanning treeof the variables. We present experimental results on inference and estimation of the partition function for sparse and densely-connected graphs. The final contribution of this thesis, presented in Chapter 4 and also in (Hamze and de Freitas 2007), emerged from some empirical observations that were made while trying to optimize the sequence of edges to add to a graph so as to guide the population of samples to the high-probability regions of the model. Most important among these observations was that while several heuristic approaches, discussed in Chapter 1, certainly yielded improvements over edge sequences consisting of random choices, strategies based on forcing the particles to take large, biased random walks in the state-space resulted in a more efficient exploration, particularly at low temperatures. This motivated a new Monte Carlo approach to treating complex discrete distributions. The algorithm is motivated by the N-Fold Way, which is an ingenious event-driven MCMC sampler that avoids rejection moves at any specific state. The N-Fold Way can however get “trapped” in cycles. We surmount this problem by modifying the sampling process to result in biased state-space paths of randomly chosen length. This alteration does introduce bias, but the bias is subsequently corrected with a carefully engineered importance sampler. |
Extent | 750719 bytes |
Subject |
Graphical models Monte Carlo inference Bayesian inference |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2008-04-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0051447 |
URI | http://hdl.handle.net/2429/744 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2008-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2008_spring_hamze_firas.pdf [ 733.12kB ]
- Metadata
- JSON: 24-1.0051447.json
- JSON-LD: 24-1.0051447-ld.json
- RDF/XML (Pretty): 24-1.0051447-rdf.xml
- RDF/JSON: 24-1.0051447-rdf.json
- Turtle: 24-1.0051447-turtle.txt
- N-Triples: 24-1.0051447-rdf-ntriples.txt
- Original Record: 24-1.0051447-source.json
- Full Text
- 24-1.0051447-fulltext.txt
- Citation
- 24-1.0051447.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051447/manifest