From the Jungle to the Garden: Growing Trees for Markov Chain Monte Carlo Inference in Undirected Graphical Models • by Jean-Noel Rivasseau, M.Sc, Ecole Polytechnique, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia October 2005 © Jean-Noel Rivasseau, 2005 11 Abstract In machine-learning, Markov Chain Monte Carlo (MCMC) strategies such as Gibbs sampling are important approximate inference techniques. They use a Markov Chain mechanism to explore and sample the state space of a target distribution. The gener-ated samples are then used to approximate the target distribution. MCMC is mathematically guaranteed to converge with enough samples. Yet some complex graphical models can cause it to converge very slowly to the true distribution of interest. Improving the quality and efficiency of MCMC methods is an active topic of research in the probabilistic graphical models field. One possible method is to "block" some parts of the graph together, sampling groups of variables instead of single variables. In this thesis, we concentrate on a particular blocking scheme known as tree sam-pling. Tree sampling operates on groups of trees, and as such requires that the graph be partitioned in a special way prior to inference. We present new algorithms to find tree partitions on arbitrary graphs. This allows tree sampling to be used on any undirected probabilistic graphical model. iii Contents Abstract ii Contents iii List of Tables . _ vi List of Figures vii Acknowledgements ix 1 Introduction 1 2 Probability Propagation: Exact Inference 3 2.1 Probabilistic Models 3 2.1.1 Directed Acyclic Graphs 4 2.1.2 Pairwise Markov Random Fields 5 2.1.3 Factor Graphs 6 2.1.4 Inference on graphical models 7 2.2 The Belief Propagation Algorithm 8 2.2.1 Belief Propagation for Factor Graphs 9 2.2.1.1 Message propagation expressions 10 2.2.1.2 Incorporating evidence 10 2.2.1.3 Obtaining the marginals of every random variable . . . 10 2.2.1.4 Factor Graph BP pseudo-code 11 2.2.2 Belief Propagation for Pairwise Markov Random Fields . . . . . 11 2.3 Loopy Belief Propagation (LBP) 13 Contents iv 3 Monte Carlo Methods 15 3.1 Introducing Monte Carlo simulation 15 3.2 Markov Chain Monte Carlo methods 16 3.2.1 The Gibbs sampler 16 4 Combining Monte Carlo and Belief Propagation: M C M C Tree Sam-pling 19 4.1 Improving MCMC algorithms 19 4.1.1 Prom single sampling to group sampling 19 4.1.2 A special case of group sampling: tree sampling 21 4.2 Sampling from Trees: Forward F'iltering Backward Sampling 21 4.2.1 Pairwise MRF Trees 21 4.2.2 Factor Trees 23 4.3 MCMC Tree Sampling 26 4.3.1 Tree partitions for Pairwise Markov Random Fields 28 4.3.2 Choosing a tree partition for Factor Graphs 31 4.4 MCMC Tree Sampling Implementation 32 4.4.1 Rao-Blackwellization 33 4.4.2 Pseudo-code and remarks 34 5 Finding M C M C Tree Partitions 36 5.1 Introductory remarks 36 5.1.1 Definition of the problem 36 5.1.2 Search heuristics to minimize the number of trees in an MCMC tree partition 36 5.2 The pairwise case 37 5.2.1 Essential choices 37 5.2.2 General overview 38 5.2.3 Simplifying the initial graph 39 5.2.4 Exploring and coloring the graph 40 Contents v 5.2.5 Choosing an ordering on the priority queue 42 5.2.6 A useful improvement 45 5.2.7 Full Pseudo Code for our partitioning implementation 48 5.3 The general case 48 5.3.1 Changes in the factor graph case 50 5.3.2 A Factor Graph partitioning example 51 6 Experimental Results 54 6.1 Experimental Setup 54 6.2 Inference on Pairwise Graphs , 55 6.2.1 Fully Connected Graph 55 6.2.2 Square Lattice MRF 57 6.2.3 Random Graph . 57 6.2.4 Remarks '. 57 6.3 Inference on Factor Graphs 60 6.3.1 First QMR graph: low leak 61 6.3.2 Second QMR graph: medium leak 61 6.3.3 Third QMR graph: high leak 61 6.3.4 Remarks 61 6.4 Tree Partitioning Algorithm Results 65 6.4.1 Square Lattices MRF 65 6.4.2 Pairwise Random Graphs 66 6.4.3 Random Factor Graphs 67 7 Conclusion 68 Bibliography 69 List o f Tables 6.1 Partitioning Algorithm applied to pairwise square-lattices MRF 6.2 Partitioning Algorithm applied to pairwise random graphs . . . 6.3 Partitioning Algorithm applied to random factor graphs . . . . vii List of Figures 1.1 Various tree partitions 2 2.1 A Directed Acyclic Graph 4 2.2 A Pairwise Markov Random Field 5 2.3 A square lattice Markov Random Field 6 2.4 A Factor Graph 7 2.5 The flow of belief messages on an example tree 9 2.6 Pseudo-code for Belief Propagation 12 2.7 Pseudo-code for Loopy Belief Propagation 13 3.1 Pseudo-code for the Gibbs sampler 18 4.1 A problematic distribution for MCMC approximation 20 4.2 Pseudo-code for Forward-Filtering Backward-Sampling, pairwise case . . 23 4.3 A Factor Tree 24 4.4 Sampling from a clique in a Factor Tree 25 4.5 Pseudo-code for Forward-Filtering Backward-Sampling, factor tree case 27 4.6 A MCMC Tree Partition 29 4.7 Two different MCMC tree partitions of a square lattice MRF 30 4.8 Sampling from one of the trees of a MCMC tree partition 30 4.9 An illegitimate MCMC tree partition for a Factor Graph 32 4.10 A corrected Factor Graph MCMC tree partition 33 4.11 Pseudo-code for the MCMC Tree Sampler algorithm 35 5.1 Pseudo-code outlining the structure of our partitioning algorithm . . . . 38 List of Figures viii 5.2 Simplification of a graph prior to partitioning 39 5.3 Coloring a graph for partitioning, part I 41 5.4 Coloring a graph for partitioning, part II 42 5.5 A comparison function defining an ordering on the priority queue . . . . 43 5.6 Partition of a graph, using an ordering on the priority queue 44 5.7 Suboptimal partitioning of a graph due to a trapped vertex 46 5.8 Another example of trapped vertices 47 5.9 Pseudo-code of the partitioning algorithm 49 5.10 Pseudo-code for the auxiliary functions of the partitioning algorithm . . 50 5.11 Partitioning a Factor Graph, part I 52 5.12 Partitioning a Factor Graph, part II 52 5.13 The final partition of a Factor Graph 53 6.1 Inference algorithms comparison: Fully connected pairwise graph . . . . 56 6.2 Inference algorithms comparison: Square-lattice MRF graph 58 6.3 Inference algorithms comparison: Random pairwise graph 59 6.4 Inference algorithms comparison: QMR graph with low leak 62 6.5 Inference algorithms comparison: QMR graph with medium leak . . . . 63 6.6 Inference algorithms comparison: QMR graph with high leak 64 ix Acknowledgements I would like to thank all the people who helped me, in different ways, throughout my degree. Academically, I am first and foremost grateful to my supervisor, Nando de Freitas. It is thanks to his teaching and support that I learnt probabilistic machine-learning. And it is thanks to its suggestions and advices that this thesis - hopefully - contains interesting results! Secondly, I would like to thank the whole Computer Science De-partment at UBC, and especially Firas Hamze for interesting discussions and for the initial ideas that form the core of this thesis. I am also equally grateful to my whole family, especially my wife Marina, my brother Christian, my sister Marie, my parents Marie-France and Vincent, and my parents in-law Vera and Sergei. Their love and affection was absolutely essential for this thesis to see the light of the day, wether they were nearby or far away... Finally, I would like to thank all my friends back in France. While I was away for two years, completing my Msc. degree in Vancouver, I have not forgotten them nor the times we had together. I sure hope to see you all again! 1 Chapter 1 Introduction This thesis is about approximate inference in general discrete probabilistic models. The methods presented here do however extend to Gaussian models. They are appli-cable to general probabilistic graphical models such as directed acyclic graphs (DAGs, also known as Bayesian networks), conditional random fields, Markov random fields (MRFs) and factor graphs. Performing inference on a large network (more than a thousand nodes) with a rea-sonable variable size (more than a hundred values) remains a huge computational task. Since the exact solution is impossible to compute, people generally turn to approx-imations. The Gibbs sampler is one of the most widely used approximate inference methods in this domain. However, correlation between variables in the probabilistic model can lead to a poor convergence rate for a single-site MCMC sampling technique. Various attempts at improving the robustness and efficiency of MCMC methods have thus been developed. This thesis extends the tree sampling method, proposed in [6] for square-lattices Markov Random Fields, to arbitrary pairwise graphs and factor graphs. We also introduce new automatic partition schemes, obtaining results similar to those presented in Figure 1.1. The tree sampler combines elements of Monte Carlo simulation as well as belief propagation. It requires that the graph be partitioned in trees first. The tree par-titions of Figure 1.1 allow us to perform exact inference on each tree. These exact computations form the basis of a powerful blocked Gibbs sampler. Chapter 1. Preamble 2 Figure 1.1: Several tree partitions for different types of graphs. On the left, a square-lattice MRF is partitioned into two trees, as in [6]. The middle figure is an example of a tree partition on an arbitrary pairwise graph, while the right figure corresponds to a factor graph. This thesis is organized in the following manner. Chapters 2 and 3 are back-ground chapters about probabilistic graphical models; in particular belief propagation (Chapter 2) and Monte Carlo methods (Chapter 3). They introduce all the necessary background material necessary for understanding the MCMC Tree Sampling algorithm and framework in Chapter 4. From there we can describe in detail our partitioning algorithm, to which we devote entirely Chapter 5. Chapter 6 presents our experimental results. All mathematical notation in this thesis is standard. We will designate a random variable with a bold letter x^ , while the use of Xi will indicate one of its realizations. If a variable is conditioned on (observed), we will refer to its value using xl. We also adopt the notation Xi-.j — { X J , X J + I , ... ,Xj_i,Xj}. The notation xi:j\k excludes the variable x^ from the set. Finally, in a graph, we denote by N(i) the neighbors of node i, and by V(i) its parents. 3 Chapter 2 Probability Propagation: Exact Inference Probabilistic inference is at the heart of many scientific problems, including medi-cal diagnosis, computer vision, image reconstruction and so on. In this chapter, we introduce probabilistic graphical models, which are a general framework for solving in-ference problems. We present the belief propagation algorithm, an instance of dynamic programming providing a solution to the exact inference problem. 2.1 P robab i l i s t i c M o d e l s This section is a short introduction to probabilistic models. For an exhaustive back-ground on probabilistic graphical models, see chapter 2 of Jordan's book [12], or Yedidia et al. [22]. A probabilistic model contains a set X of random variables x\:n. The set of possible values, for each variable, can be finite or not. Usually, some of these random variables will be observed, which means that their value is fixed and known, while others remain hidden (unknown). We describe entirely a probabilistic model by specifying its joint probability distri-bution, which is a probability mass function p(xi : n) defined on the RVs. Using this distribution, one can answer any question about the model. We can, for instance, determine if a variable is independent of a subset of RVs, or compute conditional probabilities. Chapter 2. Probability Propagation: Exact Inference 4 However, manipulating the joint distribution in an unfactorized representation is very costly. The graphical model framework proves to be an invaluable tool for solv-ing multivariate statistical problems. Probabilistic graphical models are the result of a marriage between probabilistic modeling and graph theory. They allow us to express joint probability distributions in a factorized way, exploiting directly relationships be-tween RVs. On the following paragraphs, we will describe three graphical model representa-tions: directed acyclic graphs [9], pairwise Markov random fields [14], and factor graphs [13]. Each of these models factorizes the joint distribution in a different way. In fact, we can convert any of these three representations into another one. Details on the conversion process can be found in [22]. Since it requires the introduction of additional "artificial" random variables, it is better to represent a probabilistic model in its simplest form. 2.1.1 Directed Acyclic Graphs Figure 2.1: A directed acyclic graph with 5 random variables. X5 has 3 parents: •p(x5) = {x2,X3,X4}. x i is the parent of X3 and X4. DAGs are probably the most popular type of graphical model. Each node in a DAG is associated with a single random variable. Each edge in the graph is directed, X 2 X 5 X 4 Chapter 2. Probability Propagation: Exact Inference 5 defining a parent node (by convention, the source of the edge) and a child node (the target of the edge). The joint distribution of a DAG G(V,£) factorizes as follows: p(X) = l[p(xi\V(xi)) (2.1) where the p(xi|7-'(xi)) represent directly conditional probabilities. They sum to one with respect to x .^ 2.1.2 Pairwise Markov Random Fields Pairwise Markov Random Fields belong to the group of undirected graphical models. Each node is again associated with a RV. We embed potentials in each variable node (noted as ) and in the edges of the graph (noted as ip). These potentials are arbitrary positive functions of one () or two (ip) RVs. Figure 2.2: An example Pairwise Markov Random Field, with 6 random variables. The joint probability equation of a pairwise MRF is written in the following form for a graph Q(V,£) with nodes V and edges S: pm=^n(xi) JJ fj,k-*i(xi) (2-5) From Potential to Variable: Hi^j(xj) — •i/,(x7v'(i)) JT^ Vk-+i(xk) (2-6) In terms of complexity, the Hi^j messages take most of the computational time. Assuming all variables can take the same number of values m, we carry out m^WI sums for potential i. In each of these sums, we have to compute a product of \j\f(i)\ — 1 terms. By comparison, for the messages, we compute m times a product of |A/"(i)| — 1 terms. 2.2.1.2 Incorporating evidence Like in Chapter 4 of [12], we will use "evidence potentials" on the RVs to express conditioning (or observed variables). If RV x^ is conditioned and equal to xl, the evidence potential will be equal to 5(xi,x~i). The final potential E on RV x, is the product of the evidence potential and the original potential E(xi) = (j>(xi)5(xi,x~i) (2.7) If Xi is not conditioned: {xi) This convenient transformation will be useful later in Chapter 4 for tree sampling. However, for simplicity, we will not denote explicitly the internal potentials as 4>E. We will keep writing them as cf>, assuming the previous conversion has already taken place. 2.2.1.3 Obtaining the marginals of every random variable Once messages have flowed twice (along each direction of the edges of the graph), we are ready to compute our marginals. We just take the products of all the messages Chapter 2. Probability Propagation: Exact Inference 11 received at every variable node to obtain the marginals. Thus they are given (not normalized) by the following equation: 2.2.1.4 Factor Graph BP pseudo-code We present the complete pseudo-code of the sum-product algorithm in Figure 2.6. The implementation shown here uses two depth-first traversal of the tree. In the initialization step, we need to record the parent of every node (the root will be its own parent). We usually do that with another depth-first traversal of the tree, which we combine with the depth-first traversal of phase 1. In the first phase, we are sending from u to v, while we are sending from v to u in the second phase. This is of course to reverse the direction of the messages, as shown in Figure 2.5. Note also that in phase 1, we send a message when a vertex is finished. In phase 2, we send a message when a vertex is discovered. 2.2.2 Belief Propagation for Pairwise Markov Random Fields The BP algorithm for Pairwise MRFs is a specialization of the one we just presented for factor graphs. The only changes appear in the expressions of the messages. Since every potential is linked to only one or two variables, we can elegantly combine Equa-tions (2.5) and (2.6) into a single equation. The message mj_>j from a variable Xj to another one Xj is given by: The message mj_ j is of the same size as Xj , and not Xj (eg, it has the same number of values than Xj can take). However, the sum and product of messages occur over Xj, the sending variable. The unnormalized marginals are obtained in the same way as for factor graphs. p{Xi) oc JJ (2.8) (2-9) Chapter 2. Probability Propagation: Exact Inference 12 Initialization • Choose an arbitrary root node r for the tree T. For every node in T, record its parent. Messages propagation step • Do a depth-first traversal of the tree T, starting from the root node r. During the course of this traversal, whenever a node u is finished (e.g., when the exploration algorithm has returned from all its children): — Obtain u 's parent, v. If u = v (also meaning u = r: we are at the root), do nothing and go on to the next node. Else: - If u e V (u is a variable node): Send the normalized message i / u _ v from u to v, according to equation (2.5) — If u € V (u is a potential node): Send the normalized message /x„_„ from u to v, according to equation (2.6) • Do a second depth-first traversal of the tree T, starting from the root node r. During the course of this traversal, whenever a node u is discovered (e.g., when this node is first encountered by the exploration algorithm): — Obtain u 's parent, v. If u — v (also meaning u = r: we are at the root), do nothing and go on to the next node. Else: — If v G V (v is a variable node): Send the normalized message v v _ u from v to u, according to equation (2.5) — If v G V (v is a potential node): Send the normalized message fiv^u from v to u, according to equation (2.6) Messages equations l / i _ j (S i ) = (Xi) YI V-k-,i{Xi) (2.5) / ^ • ( x j ) = 2\2 (^X>VW) II "k-.i(xk) (2.6) Marginals • For every variable Xi € V, p(xi) = n w^ 3 ^) • Normalize the marginals: P(Xi) = v Figure 2.6: Belief Propagation algorithm for a Factor Graph T(V,V). V represents the set of random variables, while V is the set of potentials attached to the graph. Chapter 2. Probability Propagation: Exact Inference 13 We can write, similarly to Equation (2.8): p(xi) oc mj_+i(xi) (2-10) 2.3 L o o p y Be l i e f P ropaga t ion ( L B P ) The sum-product algorithm can only be carried on a tree. On a graph with loops, it is impossible to respect the message passing protocol described in section 2.2 while ensuring that each node sends a message to each of its neighbors. However, Equa-tions (2.5), (2.6) and (2.9) do not make any reference to a particular graph structure. By ignoring the message passing protocol, we can propagate the messages defined by the expressions of Sections 2.2.1.1 and 2.2.2 on arbitrary graphs2. Initialization • For every (u,v) 6 £, set —.v{xv) — 1, for all possible values xv of x„ . Messages propagation For t = 1,.. •,T • For every u £ Q: — For every v S J\f(u): t -*v{xv) = >*v) n Marginals • For every variable x u 6 V, P{XU) = n Figure 2.7: Loopy Belief Propagation algorithm for a pairwise graph Q(V, £) and a predetermined number of steps T. This process is named loopy belief propagation, and is probably the most popular 2The graphical model must still be discrete or Gaussian, which is the second limitation of Exact Belief Propagation. Chapter 2. Probability Propagation: Exact Inference 14 algorithm based on belief propagation. Each of the nodes in the graph will send, in turn, a message to each of its neighbors. This is repeated for a certain number of steps. At step t, a node sends a new message m' to its neighbors based on the messages received at step t—1. The pseudo-code is presented in Figure 2.7, with the message equations corresponding to a pairwise MRF. LBP is an approximate inference algorithm, and there are no theoretical guaran-tees that it will converge to the true marginals. It can either fail to converge at all (cycling through different beliefs) or predict beliefs that are inaccurate. In Chapter 6, we will use LBP as a benchmark to compare the performance of our own inference algorithm, the tree sampler. LBP effectively fails to converge in some of our experi-ments. However, the study of the convergence properties of LBP could fill a chapter on its own. The reader can refer to [17] and [22] for a more in-depth presentation. Chapter 3 15 Monte Carlo Methods In the last chapter, we introduced probabilistic models and simple inference algorithms. In the current chapter, we will discuss a whole new class of algorithms: Monte Carlo methods. The first published paper on Monte Carlo, written by Metropolis and Ulam [15], dates back to 1949. The advent of massive cheap computational power has contributed to their "rediscovery" in the 1990s. In essence, Monte Carlo methods work by generating many samples, and then ap-proximating integrals and large sums by sample sums. The approximation gets better as more samples are obtained. If we were to generate an infinite number of samples, in most cases the law of large numbers would guarantee the correct exact results. The approximation obtained after a reasonable finite amount of time is excellent in many cases. For many applications, Monte Carlo methods are indeed the only viable algorithms. button p(X) defined on a space X. These N samples are used to approximate integrals of functions over the target distribution (a task usually intractable) by finite sums over the samples. The following equation illustrates this idea: In particular, we can obtain an approximation of the marginals by counting how many times every RV has been in a given state. 3.1 Introducing Monte Carlo simulation In Monte Carlo, one draws a large set of N samples {^^}i=i...Ar from a target distri-(3.1) Chapter 3. Monte Carlo Methods 16 Many different strategies have been developed to explore the state space and gen-erate the samples. Sampling from a high-dimensional space X is not an easy task. For example, sampling from Equation (2.3) is impossible for a factor graph of a reasonable size. We will review MCMC methods in the next section. This class of Monte Carlo strategies is of special importance to us, since our own inference algorithm, introduced in Chapter 4, is based on MCMC. 3.2 Markov Chain Monte Carlo methods A introduction to MCMC algorithms can be found in Andrieu et al. [1], and a more extensive treatment in Gilks et. al [7], Robert and Casella [5]. MCMC strategies use a Markov chain mechanism to explore the state space. This means that a given sample depends only on the previous sample X^l_1\ This can be useful if we are trying to approximate a distribution where we can easily have an expression for the probability of a new sample given the last one. We will use this transition distribution to "jump" from one state (sample) to another, starting from an arbitrary initial state. If this transition distribution obeys some properties1 (which is usually the case in practice), we can approximate the original distribution by MCMC sampling. The most popular MCMC algorithm is the Metropolis-Hastings (MH) algorithm. In fact, most other MCMC algorithms are derived from this one, and can be interpreted as special cases of this basic method. We will not cover the MH algorithm but instead focus on the Gibbs sampler, an algorithm particularly suited to approximate inference on probabilistic graphical models. 3.2.1 The Gibbs sampler If we have a n-dimensional vector X — {xi, x 2 , . . . , x„} and the expressions for all the full conditionals {p(xj|xi,... , X J _ I , X J + I , . . . , xn)}i=i...n, we can sample in turn from 1The transition distribution should obey the Irreducibility and Aperiodicity properties. See [1]. Chapter 3. Monte Carlo Methods 17 each of these conditionals to generate a valid sample for Monte Carlo approximation. We present a unified framework for running a Gibbs sampler on MRFs and factor graphs in Figure 3.1. The only differences appear in the equations giving the full conditional distributions. On a MRF, we obtain them with the following derivations: / I x F»(xi : n ) jUjo(x:i)U[j.k)ci:^(x^Xfc) ( ; J ^ E X i 7 Ilj (xi)U{j,k)eS V»(XJ,xfc) On Equation (3.2), the denominator is a sum, constant with respect to X*. The product on the nominator contains terms depending on Xj, and terms independent of Xj. Up to a proportionality constant, we can remove the independent terms, and write: MRF Full Conditionals: p(x* | x 1 : n\;) oc 0(XJ) JJ V( x i i x j) (3-3) For a factor graph, the derivations are similar and we obtain the following expres-sion. Factor Graph Full Conditionals: p(xj | x1 :„\j) oc (xi) JJ ^ ( X C J ) (3.4) Chapter 3. Monte Carlo Methods 18 Initialization • Initialize X<°> = {xj0', x<0),... ,x^0)} at random. Gibbs sampling steps • Fort = 1 , . . . , T — Sample xj'' ~ p ( x i | x 2 t _ 1 ' , ... , xi ' - 1 ' ) according to Equation ( 3 . 5 ) or ( 3 . 6 ) - Sample xj" ~ p(x,|xj",... . x j ^x j - 1 ' , . . . ,x<'-1') - Sample x!'' ~ p(xn|xj'',... j X ^ ' l j Expression for the full conditionals • For the Pairwise MRF case (Equation ( 3 . 3 ) ) : p(xj'' = x i|xj'',...,xjl'1,xj+"11',...,xi'-1') tx4>(xi) TJ ll>{Xi,xj) • For the Factor Graph case (Equation ( 3 . 4 ) ) : ( 3 . 5 ) p(x<" = .. ... ,xi ' - 1 ) ) oc 0(xO TJ T M * i , * c , \ x 4 ) ( 3 . 6 ) In these equations, xj = xj.'' if j < i, or XJ = x j ' - 1 ' if j < i. Marginals • For every variable X * : 1 T p (Xi = X i ) = ]T <5(xj ' = Xi) t=0 Figure 3.1: The Gibbs sampler algorithm. And among these I hold trees C h a p t e r 4 d e a r The Silmarillion J.R.R. TOLKIEN C o m b i n i n g M o n t e C a r l o a n d B e l i e f P r o p a g a t i o n : M C M C T r e e S a m p l i n g The two previous chapters presented different methods for performing inference on probabilistic graphical models. Belief propagation allows for exact inference, but its range of application is limited. On the other hand, Monte Carlo simulation is more practical, but remains an approximation. It fails to exploit fully the structural prop-erties of the underlying graphical model. This chapter will build on Chapters 2 and 3 in order to present a strategy for combining Monte Carlo simulation and exact belief propagation. 4.1 Improving M C M C algorithms Basic MCMC algorithms such as the naive Gibbs sampler tend to be slow. They need lots of samples to converge. Over recent years, several different schemes have been used to try to improve the convergence of MCMC algorithms. 4.1.1 From single sampling to group sampling An MCMC sample consists of a full instantiation of the random variables in the model. For interesting problems, it is impossible to sample directly from the full joint proba-bility. In the Gibbs sampler, one random variable is sampled at a time. However, this is problematic when it is difficult to move through the target distribution in "one di-19 Chapter 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 20 mension"1. If we start in a low-probability region of the state space, it will take many samples to get to the interesting region. Figure 4.1 illustrates this idea graphically. Figure 4.1: A example distribution in two dimensions (X = (xi ,X2)), which is hard to approximate by MCMC methods. If we use Gibbs sampling, we sample xi conditioned on X2, and vice versa. The individual sampling steps are drawn by a red line in one dimension. It is hard to get to the region where most of the distribution weight (density) is, since we start in a remote region. Note that if we could sample from the full joint distribution from the start, con-vergence would be much faster as we would instantly move to the regions of high-probability. Said otherwise, "moving in two dimensions" is dramatically better in Figure 4.1. This example demonstrates a weakness of the naive Gibbs sampler, and motivates the search for better solutions. 1 By "moving in one dimension", we mean that we will sample a variable conditioned on all the others (Gibbs). Our sampling distribution is thus effectively in one dimension. Chapter 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 21 Generally, we would like to exploit the structural properties of the graphical model. RVs should thus be sampled in blocks2, or groups, to avoid the situation of Figure 4.1. However, the trade-off is that sampling in groups is harder. The key here is to create specially crafted groups, where sampling can be done efficiently. 4.1.2 A special case of group sampling: tree sampling There is an efficient way for sampling on trees called forward-filtering backward-sampling (FFBS), described by Carter and Kohn in [4] and Wilkinson and Yeung in [21]. This algorithm allows us to draw independent samples of the whole tree. We describe it in the next section for MRFs and factor graphs. Sampling on a factor tree is an extension of the original algorithm. It is one of the main contribution of this thesis. 4.2 Sampling from Trees: Forward Fil tering Backward Sampling 4.2.1 Pairwise M R F Trees For pairwise graphs, the key to tree sampling is that we can decompose the full prob-ability p(X) as follows: P(x) =p(x i ,x 2 , . . . , x„ ) = p(x n |x i , . . . ,x n_i)p(xi,. . . , X n _ l ) Here x„ is a leaf in the tree, which means that given its parent V(pcn), x„ does not depend on any other variables of the graph. This is clear by looking at Equation (2.2). Thus we write: p(X) = p(x n |P(x„)) p(xi , . . . ,x n _i) 2Blocking in MCMC or in other sampling techniques is a well-discussed topic in the machine learning literature. Various blocking schemes have been developed: see for instance [11] and [21] for Gibbs blocking, and [3] for cutsets in Bayesian networks. Chapter 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 22 If there are other leaves in Q, we can repeat this procedure for p(xi,... ,x r a_i), yielding: p(X) = p(x1)TTp(xi|P(xi)) (4.1) i>l This is in fact the same expression as the factorization for the joint probability distribution of a DAG in Equation (2.1). On a DAG, by choosing a suitable sampling ordering, we can directly generate a sample of the whole tree. On an undirected model, we don't have expressions for p(xi) and the conditional probabilities p(xi\V(xi)). We first obtain p(xi) by carrying out the first phase of Belief-Propagation, the one where the messages flow from the leaves to the root x i . To compute the conditional probabilities, we write: p(xi\V(xi)) (xj)S(xj,Xj)ip(xi,Xj) YI mk-+j(xj) = (j)(xj)tp(Xi,X~) Y[ mk->j{Xj) k€JV(j)\i CCtp(Xi,Xj) This finally gives us the following important equation for the conditional probabil-ities: p(xi = Xi\V{xi) = V{xi)) ip(xi,V{xi)) Yi rnj^i(xi) (4.2) j€Af(i)\P(i) Chapter 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 23 With these expressions, we can obtain a sample of the whole tree. We start by sampling the root xi. We then continue to sample down the tree, from the root to the leaves, using Equation (4.2). Figure 4.2 describes the complete pseudo-code for FFBS on a pairwise undirected graph. Forward-Filtering • Run the first pass of Belief Propagation for a pairwise graph, sending messages from the leaves to the root. Backward Sampling • Sample from the root node x i , and record the sample as x\. We can obtain its marginals by computing: p{xi)i(xj) xA/"(i)\m l~\:m—l /=m+l:/c = YI ^i(xi:fc) JJ U(XJ,X7)<^(XJ) JJ Hj^l(Xl) J PJ ^ i ( x z ) x Af( i ) \m Z=l:m-1 \ jeA/"(/)V / i=m+l:fc Since RVs x i : m _ i have already been sampled, the messages vi^it ii, with x i the parent of Ci. * Compute message ^ i _ m : in^m{xm) oc ^ i ( x i : m _ i , i m , x m + i ; i , ) F | i^ i (x i ) (4.6) x m + l:fc l=tn+l:k * Sample x m from p(xm |xi : , p(xm = Xm\xi:m-l) OC / i i ^ m ( x m ) T[ {Xm) (4.5) * Record the sample as xm. Figure 4.5: Forward-Filtering Backward-Sampling for a discrete factor tree T. the previous section. Of course, this is true for every tree T in the partition T. Given all other trees3, we can thus obtain a full joint sample for any single tree T. Using this idea, we construct a blocked Gibbs sampler. We don't sample individual variables one at a time, but entire trees. In fact, the "simple" Gibbs sampler can be seen as a special case of our more general tree sampler, where all trees are just in fact single variables. Given all other trees actually means "given all the variables in the other trees". Chapter 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 28 Before describing the MCMC tree sampler implementation, we enumerate the dif-ferent rules needed to generate correct tree partitions for MCMC sampling. 4.3.1 Tree partitions for Pairwise Markov Random Fields For a pairwise MRF Q(V,£), the problem of choosing a tree partition can be reduced to the mathematical problem of finding a set of / disjoint trees Ti(Vi, £\),..., Ti(Vi,£i) covering Q. The partition must comply with the following rules: Rule 1. Every vertex and edge in a tree T £ T must also be present in the original graph Q. Rule 2. Each vertex must be present in one of the trees of T, and only in one tree. That is, the set of trees is disjoint. Rule 3. For each tree T, and each pair of vertices (i,j) in T, if there exists an edge between i and j in the original graph, then the same edge must be present in T. We will delete some edges so that there are no edges between different trees. A deleted edge will correspond to a link between a variable belonging to the tree that we are currently sampling, and another variable in a different tree. This other RV will be considered observed while we sample the current tree. The last rule can thus be written as: Rule 4. For each edge (i,j) in the original graph Q, if vertices i and j belong to different trees in the tree partition, then (i,j) £ £k, for k £ {1,..., I}. These four rules can be combined into the following formal definition. Definition 4.3.1. Given a graph Q(V,£), an MCMC tree partition is a set of I trees {7i(Vi,£i),... ,Ti(Vi,£i)} satisfying the following conditions: Vke{l,...,l},VkcV,£kc£ (4.7) U * = v (4.8) Chapter 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 29 V ( U ) € { l , . . , I } 2 , V i n V J = f ) (4.9) Vfc€{i, . . . ) i},v(t ) j ) e V2k,(i,j) e£ =>(i,j) e£k (4.10) Rule 1 gives (4.7); Rule 2 gives both (4.8) and (4.9). Rule 3 gives (4.10), while Rule 4 does not need to be formulated explicitly, as tree edges can occur only between vertices actually in the tree. An MCMC tree partition is just a disjoint covering of the graph's vertices by several trees. We are allowed to discard edges between two distinct trees. Figure 4.6 gives an example of a valid MCMC tree partition on a small graph. Figure 4.6: The original pairwise 14 nodes graph with several loops (left) has been partitioned into 3 trees (nodes in white, grey, and black) on the right. Edges that link vertices belonging to the same tree appear solid, while edges that are "left-out" appear dotted. Note that it is impossible, for this graph, to obtain an MCMC tree partition consisting of only 2 trees. It is also interesting to note that a square lattice MRF of an arbitrary size can always be partitioned in two trees using a "comb" partition. In fact, it can even be partitioned in two chains using a spiraling pattern. Figure 4.7 shows these partitions on a 8 by 8 square lattice MRF. These tree partitions allow us to sample variables in groups. Given all other trees, we can sample an individual tree in one pass according to Section 4.2.1. Potentials cor-responding to edges between different trees depend originally on two RVs. While one Chapter 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 30 Figure 4.7: Two different MCMC tree partitions of a square lattice MRF, using a comb pattern (left) and a spiraling pattern (right). tree is conditioned (observed), they become potentials of a single variable. Figure 4.8 illustrates this. Potentials linking the tree currently being sampled and another tree Figure 4.8: The process of sampling from one tree in an MCMC tree partition. The variables in white are currently sampled, while shaded nodes correspond to variables conditioned. The red edges (dotted) correspond to potentials that are linked both to a white and shaded variable. Conditioned on the shaded variable, they depend on a single variable. For any RV Xj in the tree T that we are currently sampling, we look at its edges. If an edge goes to a RV Xj in a different tree, we incorporate the potential ip associated with this edge with the local potential cj) of X j . We do so by taking the product of ip, Chapter^ 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 31 conditioned on X j , and new(Xi) = 4>(Xi) Y[ 1p{Xi,Xj) (4.11) Once we have incorporated all these potentials onto cpnew, w e a r e then ready to use the tree sampling method of section 4.2.1. 4.3.2 Choosing a tree partition for Factor Graphs Unfortunately, the rules for obtaining an MCMC tree partition for a Factor Graph are different than those for a pairwise graph. One could naively think of a factor graph as a pairwise graph for the purposes of generating an MCMC tree partition4 and apply the rules of section 4.3.1. This would not work, because the property that the potentials should depend on a single variable (when conditioned on all trees except the one we are currently sampling) no longer holds. Figure 4.9 will help the reader understand the issue at hand. In order to avoid this case, we need to add another rule to our existing four rules: Rule 5. For each tree T in the MCMC tree partition T, and each potential ip in T, if C = {xi , . . . , Xfe} is the set of RVs ip is linked to, at most one variable x; € C may be in each single other tree in T. This ensures that a potential ip belonging to a conditioned tree will not pose problems when sampling the current tree. Again, let us write a formal mathematical definition for a MCMC tree partition, in the factor graph case. Definition 4.3.2. Given a factor graph G(V(X, P),£),& MCMC tree partition is a set of I trees {7I(Vi(Xi, Pi), £i) , . . . ,Ti(Vi(Xi,Pi),£i)} satisfying the following conditions: Vke{l,...,l},VkCV,£kc£ (4.7) U Vi = V (4.8) t=i 4That is, consider all vertices equal, whether they correspond to RVs or potentials. Chapter 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 32 Problematic potential (|) Figure 4.9: A wrong MCMC tree partition for a Factor Graph. While the given partition in two trees (white and shaded) would have been legitimate for a pairwise graph, it is not valid for a factor graph. The problem arises when sampling the white tree. The two red-dashed edges show that while conditioning out the shaded RVs, one of the potentials tp belonging to the shaded tree depends on two variables x i and X2 in the white tree. Thus, ip can not be considered as a potential depending on a single variable while we sample the white tree. We have a loop, which makes the use of algorithm 4.5 impossible. V ( z , j ) € { l , . . . , / } 2 , V i n V j = 0 Vfc G {1,... ,*}, V(i, j) G Vl G £ =» G £ k Vk G {!,..., r},vi G P f e,Vm G {1,... ,1} \ k, (i,j) G £ | j G V„ < 1 (4.9) (4.10) (4.12) Figure 4.10 gives a corrected tree partition for the factor graph of figure 4.9. In practice, the property expressed by Equation (4.12) makes it much harder to find factor graph MCMC tree partitions. 4.4 M C M C Tree Sampl ing Implementa t ion In this last section, we provide the full pseudo-code for the MCMC tree sampler. The overall structure is the one of a Gibbs Sampler. Variables are replaced with trees. To Chapter 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 33 Figure 4.10: A legitimate MCMC tree partition, with the addition of a third tree (in black). Now the potential linked to four variables has 2 variables in its own tree, and only one variable in each of the other trees. Property (4.12) is respected. sample from these trees, we will use the methods described in Figures 4.2 and 4.5. We have not discussed so far the output of the results. If we interpreted MCMC tree sampling merely as an improvement over Gibbs sampling, we could just count the samples obtained for each variable. We would use the Monte Carlo histogram as the approximation to the marginals. Per sample, MCMC tree sampling would be more efficient that Gibbs sampling, since we sample from a much larger state space with tree sampling. It turns out that with MCMC tree sampling, we can adopt Rao-Blackwellized estimates instead of the simpler Monte Carlo histogram that corresponds to counting [6]. 4.4.1 Rao-Blackwellization For Rao-Blackwellization, we add directly the conditional probabilities we computed for variable Xj at each step t. These probabilities are conditioned on the RVs not on the tree containing X j , denoted as T(i). We denote this set of RVs X A - These values can be computed by a simple application of Belief Propagation (see Section 2.2) to the tree T(i), conditionned on all the other trees A. For T total samples, the expression of both estimates is given by Equations (4.13) Chapter 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 34 and (4.14). 1 T Monte Carlo estimates: p (Xi = Xi) = T + l = X i ) (4.13) t=o 1 T Rao-Blackwellized estimates: p(Xj = Xi) = T + l x, = Xi | x£>) (4.14) t=o It can be proven [6] that Rao-Blackwellized estimates have lower variance than Monte Carlo estimates. The computation of RB estimates is more expensive, as it requires us to compute the exact conditional marginals via BP at each step. However, since we already have to do half the work of Belief Propagation to sample from a tree, Rao-Blackwellized estimates only cost the other half of BP's computational work. In practice, the reduction in variance is worth this extra cost. 4.4.2 Pseudo-code and remarks Figure 4.11 contains the pseudo-code for the MCMC tree sampler. Note the similarity of this code with the Gibbs Sampler of Figure 3.1. The MCMC tree sampler is a general Monte Carlo algorithm that builds on the Gibbs sampler and improves it through intelligent blocking. We presented this algo-rithm in its "one-partition" version, but we can also use MCMC tree sampling with several tree partitions. At each sampling step, we choose a different tree partition rather than a static one. That is, we use a mixture of MCMC kernels [20]. This variation on the basic MCMC tree sampler once more emphasizes the need for generating many MCMC tree partitions. Obtaining a correct MCMC tree partition is a hard problem. We attack this task in the next chapter. Chapter 4. Combining Monte Carlo and Belief Propagation: MCMC Tree Sampling 35 Initialization • Obtain a tree partition T = { T i ( V i , £ i ) , . . . ,7I(Vi,£()} f r o m t n e original graph G(V,£). See Chapter 5 for the partition finding algorithm. • Initialize X ( 0 ) = {x< 0 ),x 2 0 ),...,xl 0 )} at random. MCMC Tree Sampler steps • Fort = — For i = 1, . . . , I * x, so they did not matter. Between RVs Xj and Xj we used the following interaction potential ip(xi,Xj): i;(Xi = i,Xj=j)=erM^ (6.2) where Mij is a three-by-three diagonal matrix. The parameters M^i were drawn from a standard Gaussian distribution1 with mean 0 and variance 1. We used the same matrix M for each interaction potential. Each RV in the graph had a probability P0bs — 0.2 to be observed. Between an observed RV x{ and a non observed Xj, there was a different interaction potential V'obs ( xt i x j ) • ^(x7 = i , X j = j ) = e T ^ (6.3) Ni:j is different from Mij, but is generated in the same way. T represents the temperature and was set to 0.5 for all our experiments. 10 runs were made for each experiment. We computed the reference marginals with one long run of the Gibbs sampler, and one long run of the tree sampler. If both runs converged to the same marginals (e.g., the error E between the marginals was very small), then these marginals were used as "true marginals". Else, this particular run was discarded along with the generated parameters. We used three different colors2 and outlier symbols in the plots: tree sampling appears in black with '+' outliers, Gibbs sampling in red ('*') and LBP in green ('x'). 6.2.1 Fully Connected Graph Figure 6.1 presents the comparison results on fully connected graphs with 20 RVs. LBP often fails to converge. Tree sampling, despite the fact that a fully connected 1Mi]i were allowed to be negative. 2If this thesis is printed in grayscale, tree sampling is black, Gibbs sampling is gray, and LBP is very light gray. This thesis should be printed in color, as it is much easier to distinguish the plots. Chapter 6. Experimental Results 56 L B P l l l l i ! Tree Sampling & Gibbs Sampling JIJMUH11 nnnnnnnnnnnnn 5 10 15 2 0 2 5 3 0 3 5 4 0 4 5 5 0 5 5 6 0 6 5 70 75 Adjusted Computational Time Tree Sampling Gibbs Sampling Adjusted Computational Time Figure 6.1: Performance comparison on a fully connected 20 nodes graph. The top graph shows that LBP's performance is very poor. These green lines of dots correspond to the errors obtained by various LBP runs. These errors are constant, since LBP converge almost instantly, and high. The bottom plot demonstrates the superiority of tree sampling over Gibbs sampling. Chapter 6. Experimental Results 57 graph forces it to create trees with only two nodes, is the best inference algorithm here. 6.2.2 Square Lattice M R F On a 25 by 25 MRF, tree sampling is again the best method. LBP fails to converge on one of the 10 runs. The bottom plots of Figure 6.2 show that the tree sampler has a slight performance advantage over Gibbs, but has an outlier run. 6.2.3 Random Graph We generated a 1000 nodes random graph with a density of 0.01. The density corre-sponds to the probability that any two given RVs are linked together. Gibbs sampling completely diverged in one of the ten runs (not shown on Fig-ure 6.3), and otherwise obtains correct results, but takes longer than tree sampling. On random graphs, LBP tends to perform well. On our experiments, we found this to be true when LBP converges. But some random graphs still cause LBP to misconverge. On the graph of Figure 6.3, LBP slightly diverged on two runs. While tree sampling also diverged by a moderate amount on one of the runs (and slightly on another one), it is generally more reliable than LBP on random graphs3. 6.2.4 Remarks On pairwise graphs, tree sampling performs remarkably well. We found Gibbs sampling to be an efficient algorithm too. The true benefits of tree sampling, as compared to simple Gibbs sampling, are probably more visible with much larger graphs. However, conducting experiments with large graphs is challenging, as finding ground truth is impossible. On a 1200 nodes random graph, the long Gibbs runs used for computing the true marginals diverged by a significant amount on some cases. Gibbs sampling then gives different results than tree sampling (and LBP does not converge at all). 3Several other random graphs experiments, not shown, support this conclusion. Chapter 6. Experimental Results 58 LBP Ii | | Gibbs sampling -IIIIlllli Tree sampling Adjusted Computational Time Figure 6.2: Performance comparison on a 25 by 25 square-lattice MRF. LBP obtains better results but the top graph still shows the presence of a very bad LBP run (high error). On the bottom graph, it can be seen that tree sampling has a slight advantage over Gibbs sampling. However, one of the tree sampling runs is an outlier. Chapter 6. Experimental Results 59 1 1 j 1 1 1 1 1 1 \ 1 1 1 1 1 r ++ LBP & Tree Sampling 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Adjusted Computational Time S LU Tree Sampling BBQBBBBBBBIBH LBP Adjusted Computational Time Figure 6.3: Performance comparison on a 1000 nodes random graph. Gibbs sampling has a poor performance here (it even totally diverged on one run, not shown in the plots). LBP has two outlier.runs. Tree sampling is efficient, except on one run when it takes longer to converge. ' Chapter 6. Experimental Results 60 6.3 Inference on Factor Graphs We ran experiments on QMR (Quick Medical Reference) factor graphs. Jaakkola and Jordan [10] describe QMRs in more detail, along with variational inference methods for them. The joint probability of a QMR graph Q(V, V) can be written in Equation (6.4): where represent the priors over the binary random variables (corresponding to dis-eases), and tp represent the potentials associated with positive findings4. An expression for tp is: qio and the qij are the parameters of the model, and need to be chosen between 0 and 1. qio represents the leak probability for the finding i. This is the probability that the finding is caused by other diseases than the ones included in the model, q^ is the probability than the disease j, if present, will cause a positive finding i. For each experiment, the leak probability qio was fixed at a different value. This parameter drastically influences the performance of our inference algorithms: a low leak makes computing approximate inference harder. We set the number of RVs (dis-eases) to 40 and the number of potentials (findings) to 14. A Bernouilli prior was chosen for every disease, with a parameter of 0.01 for the presence of the disease. We used 15 runs in each experiment. For each run, we chose the uniformly at random between 0 and 1, and generated the QMR graph randomly. The density d, representing the probability that a given RV will be linked to a given potential, was set to 0.15. This is quite low, but necessary for us to be interesting. If d is high, it is just impossible to create a worthy tree partition, and tree sampling becomes Gibbs sampling (see Section 6.4.3). 4We only included positive findings in our experiments. Unobserved findings marginalize out and have no effects on the posteriors of the RVs, while negative findings can be factorized into the RV priors. Only positive findings make inference hard (see [10]). P(*) = ^ n^ te<)II#x;) (6.4) iev jev (6.5) Chapter 6. Experimental Results 61 We obtained the true marginals (posteriors) of the variables via the junction tree algorithm in Kevin Murphy's BNT Toolbox [16]. The errors were computed according to Equation (6.1). 6.3.1 First QMR graph: low leak A low leak corresponds to the hardest of our QMR experiments. Loopy belief prop-agation fails to converge, and cycles through false beliefs. As shown on Figure 6.4, both Gibbs sampling and tree sampling converge. Gibbs has the best performance, as it produces much more samples than tree sampling in the same amount of time. 6.3.2 Second QMR graph: medium leak With a medium leak, all three inference algorithms converge. Figure 6.5 shows that tree sampling achieves the best performance, followed by LBP and Gibbs sampling. 6.3.3 Third QMR graph: high leak A high leak corresponds to an easy situation. All algorithms converge (Figure 6.6). Tree sampling (thanks to Rao-Blackwellization) and LBP both outperform simple Gibbs sampling. LBP is the best algorithm here. 6.3.4 Remarks A low leak corresponds to the most interesting case, as the graph is more constrained. This causes LBP to fail on the first experiment. Tree sampling and Gibbs sampling both converge every time. Gibbs sampling produces much more samples per second than tree sampling, and outperforms our inference algorithm on these small test graphs. As the leak increases, LBP and tree sampling start to perform better. On the high leak case, LBP is extremely efficient. Tree sampling is the best on the medium leak case. The QMR experiments demonstrate that tree sampling is a compromise between LBP and Gibbs sampling. Contrary to LBP, it never failed to converge. It is able Chapter 6. Experimental Results 62 Gibbs sampler a i a i 3 20 E5 30 36 40 4 S S 0 5 a e o a s Adjusted Computational Time Figure 6.4: Performance comparison on a low-leak QMR network. The top plot shows LBP cycling through different false beliefs. The bottom plot shows the two other algorithms converging: Gibbs sampler converge faster. Chapter 6. Experimental Results 63 |Treei samplingi l I S 10 1 5 S 0 2 S 30 36 40 4 S 5 0 S S S O S S 70 7 S B O B G B O B S Adjusted Computational Time Tree sampling J I 1 I I I I I 1 ! 1 1 1 1 L-S 1 0 I S 2 O 23 3 O 3 5 « > 4 S S O S 5 8 O e 5 7 O 7 S Adjusted Computational Time Figure 6.5: Performance comparison on a medium-leak QMR network. All algorithms obtain the correct marginals (even if LBP and Gibbs have much higher errors than tree sampling). Tree sampler is clearly the fastest and most robust method here. Chapter 6. Experimental Results 64 Gibbs sampl ing Tree sampl ing ft LBF^ , , , Adjusted Computat ional T ime T i i i I i I I I I r L B P J I 1 I I I I I I I I L 5 10 15 20 25 30 35 40 45 50 55 S0 Adjusted Computat ional T ime Figure 6.6: Performance comparison on a high-leak QMR graph. On this easy case, LBP outperforms the MCMC algorithms. The top plot shows that Gibbs performance is poor compared to the others. The bottom plot demonstrates the superiority of LBP over tree sampling. Chapter 6. Experimental Results 65 to take advantage of belief propagation when it helps. On these cases, it outperforms Gibbs sampling. 6.4 Tree Partit ioning Algor i thm Results A l l reported results count the number of trees in the resulting partition. On each experiment, we ran the partitioning algorithm 20 times. We provide the mean result (rounded), along with the best and worst run results. 6.4.1 Square Lattices M R F Square-lattices M R F s are hard graphs to partition automatically, since they contain so many loops. We present the partitioning results for M R F s in Table 6.1. M R F size Queue Ordering Mean Best Worst 5*5 normal 2 2 2 5*5 random 3 2 5 10*10 normal 5 3 7 10*10 random 8 5 10 20*20 normal 26 17 36 20*20 random 25 17 31 50*50 normal 148 105 241 50*50 random 152 127 166 100*100 normal 365 273 639 100*100 random 584 563 624 Table 6.1: Partitioning Algorithm applied to pairwise square-lattices M R F . Roughly, there are twenty times less trees than the number of vertices 5. We can see that the queue ordering does matter. Choosing the "normal one" (defined in 5This does not mean, however, that the average length of a tree is 20 vertices. On the MRF case, generally a single tree accounts for more than half of the total vertices in the graph. Chapter 6. Experimental Results 66 Figure 5.5) has clear benefits compared to a totally random one (except, surprisingly, in the 20*20 case, but even there their performance is almost equal). 6.4.2 Pairwise Random Graphs Random Graph size Density Backtrack Mean Best Worst 100 0.1 yes 5 5 6 100 0.1 no 6 5 7 100 0.5 yes 14 14 15 100 0.5 no 14 14 15 1000 0.01 yes 7 6 9 1000 0.01 no 17 10 22 1000 0.25 yes 41 40 42 1000 0.25 no 41 40 42 10000 0.01 yes 22 21 24 10000 0.01 no 31 25 37 Table 6.2: Partitioning Algorithm applied to pairwise random graphs. Our algorithm provides good results for random graphs. Even with a high density (0.25), we find a correct partition in 41 trees for a 1000-nodes graph, where each vertex was in average linked with 250 others. It is interesting to note that for random graphs, the variance across different runs is much lower than for square-lattices. While for MRFs there was a difference of 366 trees between the best and' worst run for 10000 vertices, here this difference is only 3. Table 6.2 makes a comparison between the backtrack-enabled algorithm and the simpler one. It is clear that backtracking is very useful for random graphs with low density. If the graph has a high density, it is however unable to help. These results confirm the claims of Section 5.2.6. Backtracking can only help, and never worsens the results. Chapter 6. Experimental Results 67 6.4.3 Random Factor Graphs We created random factor graphs by specifying the number of RVs and the number of potentials. The density represents the maximal number of RV a potential can be linked to. We chose the edges randomly. Number of Variables Number of Potentials Density Mean Best Worst 50 30 3 6 6 6 50 30 5 14 13 14 250 100 4 22 19 27 250 100 8 42 41 43 1000 700 4 163 150 179 1000 1500 4 139 125 150 4000 1000 5 261 261 262 4000 1000 10 1073 1060 1075 100 75 100 100 100 100 100 75 50 96 95 96 1000 100 100 865 853 874 Table 6.3: Partitioning Algorithm applied to random factor graphs. As expected from the example of Figure 5.13, the factor graph case is harder to partition. For a given number of random variables, the pairwise case produces generally less trees. For factor graph with low densities, the algorithm produces partitions with rela-tively few trees. This is encouraging. Increasing the number of links does not always increase the number of trees, since it also provides more "escape routes" for trapped groups of vertices. By raising the number of potentials from 700 to 1500 with 1000 RVs, Table 6.3 shows a decrease in the number of trees found. In cases of high density, it is impossible to obtain good partitions. The partitions for the last three lines of Table 6.3 contain trees of only one RV. Tree sampling is useless in such situations. Chapter 7 From all its branches there spilled a golden dew upon the barren earth. The Silmarillion J.R.R. Conclusion TOLKIEN This thesis extended the tree sampling framework [6] to factor graphs (Chapter 4) and, with automatic partitioning algorithms (Chapter 5), to arbitrary pairwise graphs. Tree sampling is a hybrid algorithm, using elements of Monte Carlo simulation and Belief Propagation. The experimental results of Chapter 6 showed this heritage. Tree sampling is able to benefit from BP to outperform the simple Gibbs sampler in many test cases, albeit not all. But, contrary to LBP, tree sampling never failed to converge in our experiments. Many aspects of tree sampling remain to be investigated. First, using multiple tree partitions (a mixture of MCMC kernels) could dramatically improve the performance of tree sampling. Adapting tree sampling to the continuous case is a hard challenge. We tried a variant of tree sampling, called Sequential Monte Carlo (SMC) tree sampling, on a MRF with a non-parametric model. Instead of using BP to sample from a tree, we used a SMC technique [8]. The mixing of MCMC and SMC did not work well, and the results were disappointing. However, tree sampling is probably adaptable to Gaussian models. Finally, inference algorithms such as tree sampling are not yet fully understood from a theoretical point of view. On very large graphs1, Gibbs sampling usually has a poor performance. Our hope is that tree sampling would scale much better. 1It is hard to conduct experiments on such graphs, as it is impossible to get ground truth. 68 69 Bibliography [1] C. Andrieu, N. de Freitas, A. Doucet, and M. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5-43, 2003. [2] Jozsef Balogh, Martin Kochol, Andras Pluhar, and Xingxing Yu. Covering planar graphs with forests. Journal of Combinatorial Theory, Series B, 94(1):147-158, 2005. [3] B. Bidyuk and R. Dechter. Cycle-cutset sampling for bayesian networks, 2003. Sixteenth Canadian Conf. on AI. [4] C.K. Carter and R. Kohn. On Gibbs sampling for state space models. Biometrika, 81(3):541-553, 1994. [5] G. Casella and C. P. Robert. Monte Carlo Statistical Methods. Springer, 1999. [6] Nando de Freitas and Firas Hamze. From Fields to Trees. In Proceedings of the 20th conference on Uncertainty in Artificial Intelligence, pages 243-250. ACM International Conference Proceeding Series, Vol. 70, 2004. [7] W. Gilks, S. Richardson, and D. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall, 1996. [8] S. J. Godsill, A. Doucet, and M. West. Monte Carlo Smoothing for Nonlinear Time Series. Journal of the American Statistical Association, 99(465) :156-168, March 2004. [9] D. Heckerman. A tutorial on learning with Bayesian networks. Technical report, Microsoft Research, Redmond, Washington, 1995. Revised June 96. Bibliography 70 [10] Tommi Jaakkola and Michael I. Jordan. Variational probabilistic inference and the QMR-DT network. Journal of Artificial Intelligence Research, 10:291-322, 1999. [11] C. Jensen, A. Kong, and U. Kjaerulff. Blocking Gibbs sampling in very large probabilistic expert systems. International Journal of Human-Computer Studies, 42:647-666, 1995. [12] M. Jordan. Probabilistic Graphical Models. To be published. [13] Kschischang, Prey, and Loeliger. Factor graphs and the sum-product algorithm. IEEE TIT: IEEE Transactions on Information Theory, 47, 2001. [14] S. Z. Li. Markov Random Field modeling in image analysis. Springer-Verlag, 2001. [15] N. Metropolis and S. Ulam. The Monte Carlo method. JASA, 44(247):335-341, 1949. [16] Kevin P. Murphy. The Bayes Net Toolbox for MATLAB. Computing Science and Statistics, 33, 2001. [17] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of Uncertainty in AI, pages 467-475, 1999. [18] C.St.J.A. Nash-Williams. Edge-disjoint spanning-trees of finite graph. J. London Math. Soc, 36:445-450, 1961. [19] J. Pearl. Evidential reasoning using stochastic simulation. Artificial Intelligence, 32:245-257, 1987. [20] L. Tierney. Markov chains for exploring posterior distributions. Annals of Statis-tics, 4(22) :1701-1762, 1994. Bibliography 71 [21] D. J. Wilkinson and S. K. H. Yeung. Conditional simulation from highly struc-tured gaussian systems, with application to blocking-MCMC for the Bayesian analysis of very large linear models. Statistics and Computing, 12(3):287-300, 2002. [22] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. Exploring Artificial Intelligence in the New Millennium, pages 239-269, 2003.