Linear and Parallel Learning of Markov Random FieldsbyYariv Dror MizrahiA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Mathematics)The University of British Columbia(Vancouver)November 2014c© Yariv Dror Mizrahi, 2014AbstractIn this thesis, we introduce a new class of embarrassingly parallel parameter learning algorithms for Markovrandom fields (MRFs) with untied parameters, which are efficient for a large class of practical models.The algorithms parallelize naturally over cliques and, for graphs of bounded degree, have complexity thatis linear in the number of cliques. We refer to these algorithms with the acronym LAP, which stands forLinear And Parallel. Unlike their competitors, the marginal versions of the proposed algorithms are fullyparallel and for log-linear models they are also data efficient, requiring only the local sufficient statistics ofthe data to estimate parameters. LAP algorithms are ideal for parameter learning in big graphs and big dataapplications.The correctness of the newly proposed algorithms relies heavily on the existence and uniqueness of thenormalized potential representation of an MRF. We capitalize on this theoretical result to develop a newtheory of correctness and consistency of LAP estimators corresponding to different local graph neighbour-hoods. This theory also establishes a general condition on composite likelihood decompositions of MRFsthat guarantees the global consistency of distributed estimators, provided the local estimators are consistent.We introduce a conditional variant of LAP that enables us to attack parameter estimation of fully-observed models of arbitrary connectivity, including fully connected Boltzmann distributions. Once again,we show consistency for this distributed estimator, and relate it to distributed pseudo-likelihood estimators.Finally, for linear and non-linear inverse problems with a sparse forward operator, we present a newalgorithm, named iLAP, which decomposes the inverse problem into a set of smaller dimensional inverseproblems that can be solved independently. This parallel estimation strategy is also memory efficient.iiPrefaceThis dissertation is original, independent work by the author. A version of Chapter 3 was accepted tothe International Conference on Machine Learning, 2014 (with Misha Denil and Nando de Freitas as coauthors). A version of Chapter 4 was submitted to the Advances in Neural Information Processing Systemsconference, 2014 (with Misha Denil and Nando de Freitas as co authors). Chapter 6 describes joint workwith Fred Roosta and Nando de Freitas.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contribution list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Definition and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1 Random fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Graphs, neighbourhoods and cliques . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Markov random fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.4 MRF and Gibbs distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Model specification and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 Maximum Likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Maximum Pseudo-Likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 93 The Marginal LAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 Model and data efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Algorithm description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.1 Construction of the Auxiliary MRF . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16iv3.4.1 The LAP argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4.2 Consistency of LAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4.3 Relationship to ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Strong LAP theorem, conditional LAP and distributed parameter estimation . . . . . . . . 224.1 Centralised estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1.1 Consensus estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1.2 Distributed estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Strong LAP argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.1 Efficiency and the choice of decomposition . . . . . . . . . . . . . . . . . . . . . . 294.3 Conditional LAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3.1 Connection to distributed Pseudo-Likelihood and composite likelihood . . . . . . . 315 Applying LAP to Gaussian graphical models and discrete tables . . . . . . . . . . . . . . . . 325.1 Simple example: the Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.1.1 Gaussian example: using the first neighbourhood . . . . . . . . . . . . . . . . . . . 335.1.2 Gaussian example: using sub neighbourhood . . . . . . . . . . . . . . . . . . . . . 355.2 LAP for tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3 Improving the LAP accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.4 Memory allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 iLAP : Applying LAP to inverse problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.1 Inverse problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.2 Localizing inverse problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2.1 Localizing the conditional distribution . . . . . . . . . . . . . . . . . . . . . . . . 446.2.2 Localizing the prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.3 The iLAP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.4 Image deblurring example using the DCT and wavelet transforms . . . . . . . . . . . . . . 477 Concluding remarks and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.1 Corrupted data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.2 Structure learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.3 Tied parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56A.1 Equivalent definitions of the 1-neighbourhood . . . . . . . . . . . . . . . . . . . . . . . . . 56A.2 Strong LAP condition is sufficient but not necessitate . . . . . . . . . . . . . . . . . . . . . 56A.3 Conditional estimator can not be better than marginal estimator . . . . . . . . . . . . . . . . 57vList of FiguresFigure 3.1 The left column shows several popular MRFs: (a) a restricted Boltzmann machine(RBM), (b) a 2-D Ising model, and (c) a 3-D Ising model. The right hand side showsthe corresponding 1-neighbourhoods. Models (b) and (c) have small 1-neighbourhoodscompared to the full graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Figure 3.2 Left: Relative error of parameter estimates compared to maximum likelihood for LAPand pseudo-likelihood on a 4×4 Ising grid. Error bars show the standard deviation overseveral runs. Right: Variance of the parameter estimates for each algorithm. . . . . . . . 15Figure 3.3 Left: Relative error of parameter estimates compared to maximum likelihood for LAPand pseudo-likelihood on a 4×4×4 Ising lattice. Error bars show the standard deviationover several runs. Right: Variance of the parameter estimates for each algorithm. . . . . 16Figure 3.4 Left: Relative error of parameter estimates compared to ML for LAP and pseudo-likelihood on a Chimera 3× 3× 3 model. Error bars show the standard deviation overseveral runs. Right: Variance of the parameter estimates for each algorithm. . . . . . . . 17Figure 4.1 Left: A simple 2d-lattice MRF to illustrate our notation. For node j = 7 we haveN (x j) = {x4,x8}. Centre left: The 1-neighbourhood of the clique q = {x7,x8} includ-ing additional edges (dashed lines) present in the marginal over the 1-neighbourhood.Factors of this form are used by the LAP algorithm of Chapter 3 Centre right: TheMRF used by our conditional estimator of Section 4.3 when using the same domain asChapter 3 Right: A smaller neighbourhood which we show is also sufficient to estimatethe clique parameter of q. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Figure 4.2 Illustrating the concept of relative path connectivity. Here, A = {i, j,k}. While (k, j) arepath connected via {3,4} and (k, i) are path connected via {2,1,5}, the pair (i, j) arepath disconnected with respect to S\A. . . . . . . . . . . . . . . . . . . . . . . . . . . 26Figure 4.3 Figures (a)-(c) Illustrating the difference between LAP and Strong LAP. (a) Shows a stargraph with q highlighted in red. (b) Shows Aq required by LAP. (c) Shows an alternativeneighbourhood allowed by Strong LAP. Thus, if the root node is a response variable andthe leafs are covariates, Strong LAP states we can estimate each parameter separatelyand consistently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27viFigure 5.1 A simple sparsity pattern for a Gaussian graphical model. The neighbourhood systemdescribed in the graph is compatible with the precision matrix for the multivariate Gaus-sian precision matrix in Figure 5.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Figure 5.2 The precision matrix associated with the graph of Figure 5.1. The symbol × stands fornon-zero entries in the precision matrix, and Σ−1(i, j) 6= 0 ⇐⇒ (i, j) ∈ E. . . . . . . . . 34Figure 5.3 The first neighbourhood of the clique {9,10} and the auxiliary graph. . . . . . . . . . . 34Figure 5.4 Two different alternative sub neighbourhoods for the clique {9,10}. On the left, we usethe union of the neighbours of node 10 and on the right the neighbours of node 9. . . . 36Figure 5.5 A non trivial Gaussian graphical model (left) and the relative estimation error for LAPand MLE as a function of the number of data (right). . . . . . . . . . . . . . . . . . . . 37Figure 5.6 A discrete probability distribution in table form and in full exponential form for x∈{0,1}3. 38Figure 5.7 On the left, a 5× 5 lattice MRF with vertical and horizontal neighbours. The middlegraph is the first neighbourhood for the unary clique {13}. The figure on the rightshows the second neighbourhood for the unary clique {13}. New edges introduced bymarginalization are depicted with dashed green lines. . . . . . . . . . . . . . . . . . . . 40Figure 5.8 Relative error of parameter estimates compared to maximum likelihood for 1-neighbourhoodLAP and 2-neighbourhood LAP. The full graph contained 300 nodes and the PDF is amultivariable Gaussian. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Figure 6.1 The graphical representation of the MRF in which both m and b are variables . . . . . . 43Figure 6.2 Construction of the local models. Suppose we are interested in estimating m6. Then,d˜ = {d5,d6,d7} are the 1-hop data of m6. m˜ = {m4,m5,m6,m7,m8} are the componentsof the model that affect d˜ directly. The 1-blanket consisting of d˜ and m˜ constitutes the1-neighbourhood of the parameters θ65, θ66 and θ67. . . . . . . . . . . . . . . . . . . . 45Figure 6.3 The 1-blanket (left) and 2-blanket (right) of m10. . . . . . . . . . . . . . . . . . . . . . 46Figure 6.4 MAP estimation using iLAP (1-blanket) and 4× 4 blocks. Top left: true model, topright: data, middle left: full inverse reconstruction with DCT transform, middle right:iLAP with DCT transform, bottom left: full inverse reconstruction with wavelet trans-form, bottom right: iLAP with wavelet transform. . . . . . . . . . . . . . . . . . . . . 47Figure 6.5 Relative error of global MAP estimate (left) and iLAP distributed estimates (right) usingthe wavelet transform. While the values of the regularization coefficient are different,the relative errors at the optimum values are very close. . . . . . . . . . . . . . . . . . . 49Figure 6.6 Recovery using iLAP with the first-blanket (left) and the second-blanket (right) withblocks of size 4×4 and the DCT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Figure A.1 A simple graph. Our interest is in the clique {1,2,3} . . . . . . . . . . . . . . . . . . . 57viiAcknowledgmentsI wish to thank Prof. Nando de Freitas, for his supervision, patience and support. I also thank Prof. EldadHaber for his early stage supervision, and Prof. Joel Friedman and Prof. Luis Tenorio for commenting onan early version of this work. Special thanks to Fred Roosta for all his help and many useful discussions.I thank my family for their love and support.viiiChapter 1Introduction1.1 MotivationMarkov random fields (MRFs), also known as undirected probabilistic graphical models, are ubiquitousstructured statistical models that have impacted a significantly large number of fields, including computervision [Li, 2001, Szeliski et al., 2008], computational photography and graphics [Agarwala et al., 2004,Boykov and Veksler, 2006, Chen et al., 2008], computational neuroscience [Ackley et al., 1985, Hopfield,1984], bio-informatics [Yanover et al., 2007], natural language processing [Lafferty et al., 2001, Galley,2006, Sutton and McCallum, 2012] and statistical physics [Marinari et al., 1997, Braunstein et al., 2005].As pointed out in MacKay [2003] and Wainwright and Jordan [2008] there are also many applications inclassical statistics, constraint satisfaction and combinatorial optimization, error-correcting codes and epi-demiology. Not surprisingly, many comprehensive treatments of this important topic have appeared in thelast four decades; see for example [Kindermann and Snell, 1980, Lauritzen, 1996, Li, 2001, Bremaud, 2001,Koller and Friedman, 2009, Murphy, 2012].Despite the huge success and impact of these models, fitting them to data via maximum likelihood(ML) is prohibitively expensive in most practical situations. Although the likelihood is typically convex inthe parameters, each optimization step requires solving inference problems that in the worst case are ]P-hard [Murphy, 2012]. As stated, in bold, in the authoritative book of Koller and Friedman [2009]: “a fullinference step is required at every iteration of the gradient ascent procedure. Because inference is almostalways costly in time and space, the computational cost of parameter estimation in Markov networks isusually high, sometimes prohibitively so.”Ideally, we would like to be able to compute the maximum likelihood estimates as these are consistentand maximally asymptotic efficient [Fisher, 1922]. We remind the reader that an estimator is asymptoticallyconsistent if it converges to the true parameters as the sample size goes to infinity. An asymptoticallyconsistent estimator is maximally efficient if the variance in the estimated parameters attains the minimumpossible value among all consistent estimators as the sample size goes to infinity. Of course, in manyapplications, we are interested in penalized maximum likelihood estimates. That is, the goal is to find the1maximum a posteriori (MAP) estimates after the addition of smoothness or sparsity priors. For presentationsimplicity, we will focus the discussion ML estimates, as our results will follow straightforwardly for typicalMAP estimates.In many cases, maximum likelihood in these models is data efficient in the sense that the data term in thegradient can be easily precomputed, making its evaluation trivial during optimization. The main difficultywith maximum likelihood is that it is not model efficient since evaluating the gradient involves computingexpectations over the model distribution. This requires evaluating a sum with exponentially many terms,which is intractable for even moderately sized models, as pointed out above.If the MRF under study has low tree-width, then the junction-tree algorithm can be adopted as theinference engine [Lauritzen, 1996, Murphy, 2012]. However, for many MRFs of interest, such as squarelattices, the complexity of inference with the junction-tree algorithm grows exponentially with the size ofthe grid. This is also a severe problem when deploying skip-chain conditional random fields (CRFs) innatural language processing applications, such as named entity recognition and co-reference resolution, andcomputer vision tasks, such as dense stereo reconstruction [Galley, 2006, Sutton and McCallum, 2012,Murphy, 2012, Bradley, 2013].The computational difficulties associated with exact inference have motivated researchers to adopt al-ternative estimators even if these are not as statistically efficient as maximum likelihood. Examples of theseestimators include ratio matching, score matching, stochastic maximum likelihood, contrastive divergence,composite likelihood and pseudolikelihood [Besag, 1975, Younes, 1989, Hinton, 2000, Hyvärinen, 2005,2007, Marlin et al., 2010, Varin et al., 2011, Marlin and de Freitas, 2011, Swersky et al., 2011].An important class of approximate methods for this problem are stochastic approximation techniques,which approximate the model term by drawing samples from the model distribution, typically via Markovchain Monte Carlo (MCMC). This simulation is costly and often many samples are required for accurateestimation. Moreover, in settings where the parameters or data must be distributed across many machinessuch simulation poses additional difficulties.Another approach is to approximate the maximum likelihood objective with a factored alternative. Theleading method in this area is pseudo-likelihood. In this approach the joint distribution over all variablesin the MRF is replaced by a product of conditional distributions for each variable. Applying pseudo likeli-hood in a distributed setting is may difficult, because the conditional distributions share parameters. Severalresearchers have addressed this issue by proposing to approximate pseudo-likelihood by disjointly opti-mizing each conditional and combining the parameters using some form of averaging [Ravikumar et al.,2010, Wiesel and Hero III, 2012, Liu and Ihler, 2012]. Yet, as pointed out by its creator, Julian Besag,“My own view is that the technique is really a creature of the 1970s and 1980s and I am surprised to see itrecommended in the computer age” [Besag, 2001].In this thesis, we introduce a new approach to parameter estimation in MRFs with untied parameters,which avoids the model inefficiency of maximum likelihood for an important class of models while preserv-ing its data efficiency. Moreover, the proposed algorithms are naturally parallel and can be implemented in adistributed setting without modification. The algorithms replace the joint maximum likelihood problem with2a collection of much smaller auxiliary maximum likelihood problems which can be solved independently.We prove that if the auxiliary problems satisfy certain conditions, the relevant parameters in the auxiliaryproblems converge to the values of the true parameters in the joint model. The experiments show that goodperformance is achieved in this case and that good performance is still achieved when these conditions arenot satisfied. Violating the conditions for convergence sacrifices theoretical guarantees in exchange for evenfurther computational savings while maintaining good empirical performance.Under a strong assumption, we prove that a proposed Linear And Parallel (LAP) algorithm is exactlyequal to maximum likelihood on the full joint distribution. While not directly applicable, this result providesadditional insight into why our approach is effective.A method similar to the naive LAP algorithm was recently, and independently, introduced in the contextof Gaussian graphical models by Meng et al. [2013]. In that paper, the authors consider local neighbour-hoods of nodes, whereas we consider neighbourhoods of cliques, and they rely on a convex relaxation viathe Schur complement to derive their algorithm for inverse covariance estimation. At the time of writing thisthesis, the same authors have shown that the convergence rate to the true parameters with their method iscomparable to centralized maximum likelihood estimation [Meng et al., 2014].Although our work and that of Meng et al. arrive at distributed learning via different paths, and whiletheirs is restricted to (pair-wise) Gaussian graphical models, both works show that it is possible to capitalizeon graph structures beyond low tree-width to design algorithms that are both data and model efficient andexhibit good empirical performance.In this thesis, we also introduce the Strong LAP Condition, which characterises a large class of compositelikelihood factorisations for which it is possible to obtain global consistency, provided the local estimatorsare consistent. This much stronger sufficiency condition enables us to construct linear and globally consis-tent distributed estimators for a much wider class of models than Mizrahi et al., including fully-connectedBoltzmann machines.Using this framework, we also show how the asymptotic theory of Liu and Ihler [2012] applies moregenerally to distributed composite likelihood estimators. In particular, the Strong LAP Condition providesa sufficient condition to guarantee the validity of a core assumption made in the theory of Liu and Ihler,namely that each local estimate for the parameter of a clique is a consistent estimator of the correspondingclique parameter in the joint distribution. By applying the Strong LAP Condition to verify the assumption ofLiu and Ihler, we are able to import their M-estimation results into the LAP framework directly, bridging thegap between LAP and consensus estimators. In particular we illustrate how LAP can be applied to sparseGaussian graphical models and discrete tables.Finally, this thesis also offer an alternative approach for solving inverse problems by introducing anefficient parallel algorithm, named iLAP, which appropriately divides the large problem into smaller sub-problems of much lower dimension. This process of localization offers substantial advantages in terms ofcomputational efficiency and memory allocation.31.2 Contribution listThe main novel result of this thesis is a mathematical observation that leads to a parameter estimationtechnique, LAP, which can be used, with linear complexity, for many important graphs and parametricfamilies. The algorithm uses a set of local, independent, and low-dimensional estimators.LAP is a naturally parallel algorithm, with significantly reduced complexity, that uses data statistics forefficient online and distributed memory allocation. We prove the consistency of LAP and prove that (undersome conditions) LAP is identical to ML.We further develop the algorithm by proving a strong LAP theorem that significantly reduces its com-plexity. The strong LAP theorem establishes the relationship between the graph topology and the consis-tency of local estimators.We introduce the conditional LAP (CLAP), which is applicable to a larger class of parametric distri-bution families and prove the consistency of CLAP. We link the LAP and CLAP results to the design ofalgorithms for distributed parameter estimation in MRFs by showing how the work of Liu and Ihler [2012]and LAP can both be seen as special cases of distributed composite likelihood. Casting these two works ina common framework allows us to transfer results between them, strengthening the results of both works.We study sparse Gaussian graphical models and discrete tables, and demonstrate the proper usage ofLAP for these models. We investigate the complexity versus accuracy trade-off for these models.We present an efficient algorithm, iLAP, for solving large scale inverse problems with sparse forwardoperators. Using iLAP we reduce the large inverse problem into a set of local inverse problems of smallerdimension. This approach is naturally parallel and significantly reduces the memory requirements of eachsolver. We experiment with iLAP by applying it to an image de-blurring problem and compare the resultsto baseline estimators.1.3 Thesis organizationChapters 1 and 2 provide motivation and background material on the well established theory of MRFs.Chapter 3 introduces the LAP methodology. We define the first neighbourhood of a clique, introduce thefirst LAP algorithms and prove their consistency and relation to ML estimation.Chapter 4 refines the result of Chapter 3. In particular it includes the strong LAP theorem, whichrelates the graph topology to our ability to obtain consistent global estimators from local estimators. In thesame chapter, we introduce a conditional version of LAP (named CLAP) and relate the LAP algorithm toother estimators. In particular, we develop distributed composite likelihood estimators with emphasis onobtaining sufficient conditions for their consistency.In Chapter 5 we explain in detail how to apply LAP to sparse Gaussian graphical models and discretetables. We define higher order neighbourhoods as domains and compare the resulting trade-off of complexityversus accuracy.In Chapter 6 we introduce iLAP, which is a LAP based method for solving inverse problems. Chapter 7suggests directions for future research.4Chapter 2Background2.1 Definitions and notation2.1.1 Random fieldsA real-valued random field defined on a set S is a collection of random variables indexed by the elements inS : {xt}t∈S. Since we will only consider finite sets S, a random field is simply a k×1 random vector x, wherek is the number of elements in S. The correlation structure of x will be defined by the relationships amongthe elements in S. Thus, from now on we will assume that the random field is defined on S = {1, ...,K}.If A = {i1, ..., in} ⊂ S, we define xA = (xi1 , ..,xin), but for simplicity we shall write x instead of xS.The random field is completely defined by the joint distribution function of x. We will assume that thisdistribution has the probability density function (PDF) p. The marginal PDF corresponding to xA can beobtained from p by integration:p(xA) =∫p(x)dxc, c = S\A. (2.1)2.1.2 Graphs, neighbourhoods and cliquesLet S index a discrete set of K sites, where a site may represent a point location in a Euclidian space, a pixellocation in an image, etc. Assume that for each site index i ∈ S, there exists a corresponding subset of S,denoted as Ni. We denote the collection of the corresponding subsets {Ni} as N .Definition 1. N = {Ni} is called a neighbourhood system if and only if1. i ∈N j ⇐⇒ j ∈Ni ∀i, j ∈ S (mutual relationship)2. A site is not a neighbour of itself: i /∈Ni5The finite set S and the neighbourhood system N can be described as an undirected graph, with thesites as the nodes and the neighbourhood relationships as the edges, G = {S,E}, where S is the discrete setof K nodes and E is a list of edges satisfying {i, j} ∈ E ⇐⇒ i ∈N j.Definition 2. A subset c ⊆ S is a clique for neighbourhood system N , if any two different i, j in c areneighbours. That is:∀i, j ∈ c, i 6= j, i ∈N j.Clearly any subset of a clique, cs ⊆ c, is itself a clique, and any singleton {i} is a clique. In the graphicalmodel representation, any fully connected subset is a clique.Definition 3. A clique c is called maximal if c∪{i} is not a clique for any i /∈ c.2.1.3 Markov random fieldsThe random vector x is said to have the Markov property with respect to a neighbourhood system N if itsmarginal and conditional distributions are such that:p(xi|xS\i) = p(xi|xNi).Definition 4. A family of random variables x1, ...,xK is called a Markov Random Field on S with respect toa neighbourhood system N if it satisfies the Markov property.2.1.4 MRFs and Gibbs distributionsA Gibbs random field (GRF) on S with respect to neighbourhood system N , is a set of random variableswith the joint density function of the formp(x) = 1Zexp(−U(x)). (2.2)U is called the energy f unction and Z is a constant called the partition function which plays the role of thenormalization factor. The constant Z can be calculated by integration over the high dimensional space:Z =∫...∫exp(−U(x))dx. (2.3)The Hammersley-Clifford theorem [Hammersley and Clifford, 1971] establishes the equivalence be-tween MRFs and GRFs. The theorem states that a probability distribution that has a positive mass or densitysatisfies the Markov property under neighbourhood system N if and only if it is a Gibbs random field,where the energy function U is a summation of functions called potentials over the set of cliques:U(x) = ∑c∈CEc(xc ). (2.4)6Ec(xc ) is the energy or clique potential associated with the variables in clique c, and it depends only on thevalues at the site inside the clique c. C is the collection of all cliques.Combining Equations (2.2) and (2.4) yieldsp(x) = 1Zexp ∑c∈CEc(xc ). (2.5)From the above expression it is clear that p is determined by local characteristics. The representation of pusing the potentials in Equation (2.5) is not unique, but there is a way to impose such uniqueness using theconcept of normalized potentials.Definition 5. An energy function, Ec(xc ) (or potential), is said to be normalized with respect to the zerovector if Ec(xc ) = 0 whenever there exists t ∈ c such that xt = 0.Theorem 1. (Uniqueness of normalized potential) There exists one and only one normalized potential rep-resentation with respect to zero corresponding to a Gibbs distribution.Proof. See Bremaud [2001], pages 262-265.The normalized potential is also known as the canonical potential [Griffeath, 1976, Kindermann andSnell, 1980]. One can choose normalized potentials with respect to reference values other than the zerovector. However, in order to ease the notation we consider only the normalization with respect to the zerovector.The uniqueness of the normalized potential representation enables us to talk about the uniqueness ofcliques. In particular, each clique has one and only one normalized potential. Henceforth, this thesis willfocus on clique potentials in normalized potential representation and, specifically, normalized with respectto the zero vector.2.2 Model specification and objectivesWe are interested in estimating the parameter vector θ of the positive distribution p(x |θ)> 0 that satisfiesthe Markov properties of an undirected graph G. That is, a distribution that can be represented as a Gibbsdistribution:p(x |θ) = 1Z(θ) exp(−∑cE(xc |θ c)), (2.6)where Z(θ) is the partition function:Z(θ) =∫exp(−∑cE(xc |θ c))dx. (2.7)7We will assume that the energy terms E(xc |θ c) ∈ R are chosen so that the parameters are identifiable. Thatis,θ1 6= θ2 ⇐⇒ E(xc |θ 1) 6= E(xc |θ 2). (2.8)When the energy is a linear function of the parameters,E(xc |θ c) =−θTc φ c(xc),where φ c(xc) is a feature vector derived from the values of the variables xc, we will refer to the model as amaximum entropy representation or log-linear model (Wasserman [2004], Buchman et al. [2012], Murphy[2012]). The features in these models are also referred to as local sufficient statistics.At this stage, we will make an additional remark regarding notation. We will use x to refer to the vectorof all variables (nodes). When needed, we increase the precision in our notation by using S to denote theset of all variables and use xS for the vector of all variables in the MRF. We restrict the symbols n and c sothat xn refers to the n-th observation of all the variables in the MRF, and xc refers to the subset of variablesassociated with clique c. Finally xmn refers to the n-th observation of node m.2.2.1 Maximum Likelihood estimationMaximum Likelihood (ML) is maybe the most natural principle for estimating θ . Denoted by θ̂ml , themaximum likelihood estimator is defined as the θ that maximizes the likelihood function L(·). Specifically,let x1, ..,xN be independent realizations of p(x|θ), then the likelihood function L is given byL (θ ;x1, ..,xN) =N∏n=1p(xn|θ). (2.9)There is (in general) no closed form solution for the maximum likelihood estimate of the parameters of anMRF, so gradient-based optimizers are needed.Consider the fully-observed maximum entropy modelp(x |θ) = 1Z(θ) exp(∑cθTc φ c(x)). (2.10)The scaled log-likelihood is given by`(θ) = 1NN∑n=1log p(xn |θ). (2.11)By substituing Equation 2.10 we get`(θ) = 1NN∑n=1[∑cθTc φ c(xn)− logZ(θ)], (2.12)8which is a convex function of θ . The derivative with respect to the parameters of a particular clique, q, isgiven by∂`∂θ q= 1NN∑n=1[φ q(xn)−∂ logZ(θ)∂θ q], (2.13)where∂ logZ(θ)∂θ q= E[φ q(x) |θ]=∫φ q(x)p(x |θ)dx. (2.14)Equation (2.14) is the expectation of the feature φ q(x) over the model distribution.The full derivative of the log-likelihood contrasts the model expectation against the expected value ofthe feature over the data,∂`∂θ q= 1NN∑n=1φ q(xn)−E[φ q(x) |θ]. (2.15)At the optimum these two terms will be equal and the empirical distribution of the features will match themodel predictions. Asymptotically, ML is the optimal estimator since it reaches the Cramer-Rao lowerbound. Specifically, the central limit theorem tell us that the ML estimate θˆML converges to the true param-eter vector θ true at the following rate (which is the fastest possible asymptotic rate):limN→∞(θ true− θˆML) =N (0,I−1), (2.16)where N is the number of samples and I is the Fisher Information matrix defined byIql =−E[(∂2 log p(x|θ)∂θq∂θl)]. (2.17)This statistical optimality is not the only important aspect of learning. For many models of interest θˆMLis intractable. We must therefore also consider the computational costs associated with learning. In thisthesis, we will seek to design classes of algorithms that are both computationally and statistically efficient.This will not always be possible and in some cases we will have to present trade-offs between efficientcomputation and asymptotic statistical efficiency.2.2.2 Maximum Pseudo-Likelihood estimationTo surmount the intractable problem of computing expectations over the model distribution, the popularpseudo-likelihood estimator considers a simpler factorized objective function,`PL(θ) = 1NN∑n=1M∑m=1log p(xmn |x−mn,θ) (2.18)9where x−mn denotes all the components of the n-th data vector, except for component m. (For models withsparse connectivity, we only need to condition on the neighbours of node m.) In the binary, log-linear case,the gradient of this objective can be expressed in contrastive form,∂`PL∂θ q= 1N ∑n,mp(x¯mmn |x−mn,θ)[φ q(xn)−φ q(x¯mn )],where x¯mn is the data vector x¯n with the m-th bit flipped. That is, x¯imn = 1− xmn if i = m and xmn otherwise(Marlin et al. [2010]). Pseudo-likelihood will appear in most chapters of this thesis, first as a baseline andlater as an application of LAP when considering the distributed pseudo-likelihood setting.10Chapter 3The Marginal LAPIn this chapter, we describe a parameter estimation algorithm named LAP. We prove that LAP is both modelefficient (i.e., estimating the parameters requires low computational complexity) and data efficient (i.e., it issufficient to have data statistics distributed in a network). In other words, LAP avoids the model inefficiencyof maximum likelihood for an important class of models while preserving its data efficiency.LAP is naturally parallel and can be implemented in a distributed setting without modification. LAP re-places the joint maximum likelihood problem with a (linear) collection of much smaller auxiliary maximumlikelihood problems that can be solved independently.We prove that if the auxiliary problems satisfy certain conditions, the relevant parameters in the auxiliaryproblems converge to the values of the true parameters in the joint model. Under a strong assumption, wealso prove that LAP is exactly equal to maximum likelihood on the full joint distribution. While hard toverify in practice, this result provides additional insight into why the LAP approach is effective.3.1 Model and data efficiencyThere are two terms in the gradient of Equation 2.15. The first term is an empirical expectation,1NN∑n=1φ q(xn)and depends only on the data. The value of this term for each clique can be pre-computed before parameteroptimization begins, making this term of the gradient extremely cheap to evaluate during optimization.The data term in the ML gradient is contrasted with an expectation over the model distribution,E[φ q(x) |θ],which is a sum over exponentially many configurations. For large models this term is intractable.We describe this situation by saying that ML estimation is data efficient, since the terms involving onlythe data can be computed efficiently. However, ML is not model efficient, since the model term in the11Algorithm 1 LAPInput: MRF with maximal cliques Cfor q ∈ C doConstruct auxiliary MRF over the variables in Aq.Estimate parameters αˆML of auxiliary MRF.Set θˆ q← αˆMLq .end forgradient is intractable, and the difficulty in evaluating it is the primary motivation for the development ofalternative objectives like pseudo-likelihood.Pseudo-likelihood addresses the model inefficiency of ML by eliminating the model term from the gra-dient, which makes pseudo-likelihood model efficient. However, pseudo-likelihood is not data efficient,since computing the gradient requires access to the full conditional distributions p(x¯mmn |x−mn,θ). Becauseof this the outer sum over data examples must be computed for each gradient evaluation. (Note that forbinary models the full conditionals correspond to logistic regressions, so any advances in scaling logisticregression to massive models and datasets would be of use here.)In the following section we introduce a Linear And Parallel (LAP) algorithm, which uses a particulardecomposition of the graph to avoid the exponential cost in ML, but unlike pseudo-likelihood LAP is fullyparallel and maintains the data efficiency of ML estimation. LAP is therefore both model and data efficient.3.2 Algorithm descriptionThe LAP algorithm operates by splitting the joint parameter estimation problem into several independentsub-problems which can be solved in parallel. Once the sub-problems have been solved, it combines thesolutions to each sub-problem together into a solution to the full problem.Definition 6. For a fixed clique q we define its 1-neighbourhood Aq as the union of all cliques with nonempty intersection with q:Aq =⋃c∩q6= /0c. (3.1)Alternatively, we say that Aq contains all of the variables of q itself as well as the variables with at leastone neighbour in q (See proof in Appendix A.1). LAP creates one sub-problem for each maximal cliquein the original problem by defining an auxiliary MRF over the variables in Aq. Details on how to constructthe auxiliary MRF will be discussed later, for now we assume we have an auxiliary MRF on Aq and that itcontains a clique over the variables in q that is parametrized the same way as q in the original problem.LAP derives the parameter vector θ q for the full problem by estimating parameters in the auxiliary MRFon Aq using maximum likelihood and reading off the parameters for the clique q directly. The steps of thealgorithm are summarized in Algorithm 1.In a log-linear model, when estimating the vector of parameters α of the auxiliary MRF by maximum12(a)(b)(c)Figure 3.1: The left column shows several popular MRFs: (a) a restricted Boltzmann machine (RBM), (b) a2-D Ising model, and (c) a 3-D Ising model. The right hand side shows the corresponding 1-neighbourhoods.Models (b) and (c) have small 1-neighbourhoods compared to the full graph.likelihood, the relevant derivative is∂`Mq∂αq= 1NN∑n=1φ q(xAqn)−E[φ q(xAq)|α]. (3.2)13This approach is data efficient, since the sufficient statistics 1N ∑Nn=1 φ q(xAqn) can be easily pre-computed.Moreover, the data vector xn can be stored in a distributed fashion, with the node estimating the auxiliaryMRF only needing access to the sub-vector xAqn. In addition, LAP is model efficient since the expectationE[φ q(xAq)|α]can be easily computed when the number of variables in Aq is small. To illustrate this point,consider the models shown in Figure 3.1. For dense graphs, such as the restricted Boltzmann machine,the exponential cost of enumerating over all the variables in Aq is prohibitive. However, for other practicalMRFs of interest, including lattices and Chimeras [Denil and de Freitas, 2011], this cost is acceptable.3.2.1 Construction of the auxiliary MRFThe effectiveness of LAP comes from proper construction of the auxiliary MRF. As already mentioned, theauxiliary MRF must contain the clique q, which must be parametrized in the same way as in the joint model.This requirement is clear from the previous section, otherwise the final step in Algorithm 1 would be invalid.We will see in the analysis section that it is desirable for the auxiliary MRF to be as close to the marginaldistribution on xAq as possible. This means we must include all cliques from the original MRF which aresubsets of Aq. Additionally, marginalization may introduce additional cliques not present in the originaljoint distribution. It is clear that these cliques can only involve variables in Aq \ q, but determining theirexact structure in general can be difficult.We consider three strategies for constructing auxiliary MRFs, which are distinguished by how theyinduce clique structures on Aq \q. The three strategies are as follows.Exact: Here we compute the exact structure of the marginal distribution over Aq from the original problem.We have chosen our test models to be ones where the marginal structure is readily computed.Dense: For many classes of model the marginal over Aq involves a fully parametrized clique over Aq \q fornearly every choice of q (for example, this is the case in lattice models). The dense variant assumesthat the marginal always has this structure. Making this choice will sometimes over-parametrize themarginal, but avoids the requirement of explicitly computing its structure.Pairwise: Both the exact and dense strategies create high order terms in the auxiliary MRF. While highorder terms do exist in the marginals of discrete MRFs, it is computationally inconvenient to includethem, since the add many parameters to each sub-problem. In the pairwise variant we use the samegraph structure as in dense, but here we introduce only unary and binary potentials over Aq \q. Thisresults in a significant computational savings for each sub-problem in LAP, but fails to capture thetrue marginal distribution in many cases (including all of the example problems we consider).3.3 ExperimentsIn this section we describe some experiments designed to show that the LAP estimator has good empiricalperformance. We focus on small models where exact maximum likelihood is tractable in order to allow14102 103 104 105 106Number of samples0.00.10.2Relative errorLAP_DLAP_ELAP_PPL102 103 104 105 106Number of samples0.00.71.42.2Estimator variance LAP_DLAP_ELAP_PPLMLFigure 3.2: Left: Relative error of parameter estimates compared to maximum likelihood for LAP andpseudo-likelihood on a 4× 4 Ising grid. Error bars show the standard deviation over several runs. Right:Variance of the parameter estimates for each algorithm.performance to be measured. We chose to focus our experiments on demonstrating accuracy rather thanscalability since the scaling and data efficiency properties of LAP are obvious.The purpose of the experiments in this section is to show two things:1. The accuracy of LAP estimates is not worse than its main competitor, pseudo-likelihood; and2. LAP achieves good performance even when the exact marginal structure is not used.In all of our experiments we compare pseudo-likelihood estimation against LAP using the three differentstrategies for constructing the auxiliary MRF discussed in the previous section. In each plot, lines labeled PLcorrespond to pseudo-likelihood and ML corresponds to maximum likelihood. LAP_E, LAP_D and LAP_Prefer respectively to LAP with the exact, dense and pairwise strategies for constructing the auxiliary MRF.We compare LAP and pseudo-likelihood to maximum likelihood estimation on three different modelclasses. The first is a 4×4 Ising grids with 4-neighbourhoods, and the results are shown in Figure 3.2. Thesecond is a 4× 4× 4 Ising lattice with 6-neighbourhoods, which is shown in Figure 3.3. Finally, we alsoconsider a Chimera 3×3×3 model, with results shown in Figure 3.4.The procedure for all models is the same: we choose the generating parameters uniformly at randomfrom the interval [−1,1] and draw samples approximately from the model. We then fit exact maximumlikelihood parameters based on these samples, and compare the parameters obtained by pseudo-likelihoodand LAP to the maximum likelihood estimates. The left plot in each figure shows the mean relative error ofthe parameter estimates using the maximum likelihood estimates as ground truth. Specifically, we measureerr(θ) = ‖θML‖−1 · ‖θ −θML‖15102 103 104 105 106Number of samples0.00.20.50.7Relative errorLAP_DLAP_ELAP_PPL102 103 104 105 106Number of samples0.06.613.219.7Estimator variance LAP_DLAP_ELAP_PPLMLFigure 3.3: Left: Relative error of parameter estimates compared to maximum likelihood for LAP andpseudo-likelihood on a 4× 4× 4 Ising lattice. Error bars show the standard deviation over several runs.Right: Variance of the parameter estimates for each algorithm.for each estimate on each set of samples and average over several runs. We also measure the variance ofthe estimates produced by each algorithm over several runs. In this case we measure the variance of theestimates of each parameter separately and average these variances over all parameters in the model. Thesemeasurements are shown in the right plot in each figure. For reference we also show the variance of themaximum likelihood estimates in these plots.In all of the experiments we see that the performance of all of the LAP variants is basically indistinguish-able from pseudo-likelihood, except for small numbers of samples. Interestingly, LAP_P does not performnoticeably worse than the other LAP variants on any of the problems we considered here. This is interestingbecause LAP_P approximates the marginal with a pairwise MRF, which is not sufficient to capture the truemarginal structure in any of our examples. LAP_P is also the most efficient LAP variant we tested, sincethe auxiliary MRFs it uses have the fewest number of parameters.3.4 TheoryIn this section show that matching parameters in the joint and the marginal distributions is valid, providedthe parametrizations are chosen correctly. We then prove consistency of the LAP algorithm and illustrate itsconnection to ML.Undirected probabilistic graphical models can be specified, locally, in terms of Markov properties andconditional independence and, globally, in terms of an energy function ∑c E(xc|θ c). As shown in Equations(2.2), (2.3), and (2.4). This is the direct outcome of the Hammersley-Clifford theorem [Hammersley andClifford, 1971] which establishes the equivalence of these two representations.One important fact that is often omitted is that the energy function and the partition function are not16102 103 104 105 106Number of samples0.00.20.30.5Relative errorLAP_DLAP_ELAP_PPL102 103 104 105 106Number of samples0.02.44.97.3Estimator variance LAP_DLAP_ELAP_PPLMLFigure 3.4: Left: Relative error of parameter estimates compared to ML for LAP and pseudo-likelihood ona Chimera 3× 3× 3 model. Error bars show the standard deviation over several runs. Right: Variance ofthe parameter estimates for each algorithm.unique. It is however possible to obtain uniqueness, for both of these functions, by imposing normalizationwith respect to a setting of the random variables of the potential. This gives rise to the concept of normalizedpotential [Bremaud, 2001]:Definition 7. A Gibbs potential {E(xc|θ c)}c∈C is said to be normalized with respect to zero if E(xc|θ c) = 0whenever there exists t ∈ c such that xt = 0.(In this section, we use the term Gibbs potential, or simply potential, to refer to the energy so as tomatch the nomenclature of [Bremaud, 2001].) The following theorem plays a central role in understandingthe LAP algorithm. The proof can be found in [Griffeath, 1976, Bremaud, 2001]:Theorem 8. [Existence and Uniqueness of the normalized potential] There exists one and only one(Gibbs) potential normalized with respect to zero corresponding to a Gibbs distribution.3.4.1 The LAP argumentSuppose we have a Gibbs distribution p(xS |θ) that factors according to the clique system C , and let q ∈ Cbe a clique of interest. Let the auxiliary MRFp(xAq |α) =1Z(α) exp(− ∑c∈CqE(xc |αc)) (3.3)have the same form as the marginal distribution on Aq (with clique system Cq) parametrized so that thepotentials are normalized with respect to zero. We can obtain the marginal from the joint in the following17way:p(xAq |θ) =∫p(xS |θ)dxS\Aq . (3.4)Substituting Equation 3.3 into Equation 3.4, yieldsp(xAq |θ) =1Z(θ)∫exp(−∑c∈CE(xc |θ c))dxS\Aq (3.5)Pulling out from the integral all the clique potentials with empty intersection with Aq yields:p(xAq |θ) =1Z(θ) exp(− ∑c⊆AqE(xc |θ c))∫exp(− ∑c(AqE(xc |θ c))dxS\Aq (3.6)The domain of the function ∫exp(− ∑c(AqE(xc |θ c))dxS\Aqhas empty intersection with the clique of interest q (since Aq is defined as the union of all the cliques withnon empty intersection).Let us defineg(xAq\q) = log(∫exp(− ∑c(AqE(xc |θ c))dxS\Aq).With this definition, the marginal distribution over Aq isp(xAq |θ) =1Z(θ) exp(− ∑c⊆AqE(xc |θ c)+g(xAq\q)). (3.7)The energy function in the Gibbs distribution of Equation 3.7 satsifies the sufficency condition in theHammersley-Clifford theorem, that is p(xAq |θ) is the PDF of a MRF on its own.Proposition 9. If the parametrisations of p(xS |θ) and p(xAq |α) are chosen to be normalized with respectto zero, and if the parameters are identifiable with respect to the potentials, then θ q = αq.Proof. The terms E(xq |θ q) and E(xq |αq) appear as separate factors in p(xAq |θ) in Equation 3.7 and inp(xAq |α) in Equation 3.3 respectively. By existence and uniqueness of the normalized potentials (Theo-rem 8), we haveE(xq |αq) = E(xq |θ q) (3.8)which implies that θ q = αq if the parameters are identifiable.3.4.2 Consistency of LAPLet θ ? be the true vector of parameters taken from the unknown generating distribution p(xS |θ ?) parametrizedsuch that the potentials are normalized with respect to zero. Suppose we have N samples drawn iid from this18distribution. Let θˆML be the ML estimate of θ given the data and let αˆML the corresponding ML estimatefor the auxiliary MRF with true parameters α?.Proposition 10. If the true marginal distributions are contained in the class of auxiliary MRFs, we haveαˆML→ θ ? as N→ ∞.Proof. Let q ∈ C be an arbitrary clique of interest. It is sufficient to show that αˆMLq → θ ?q. By marginaliza-tion, we havep(xAq |θ ?) = ∑S\Aqp(xS |θ ?).By the lap argument (Proposition 3), we know that α?q = θ ?q. Since ML in consistent under smoothness andidentifiability assumptions (for example, see [Fienberg and Rinaldo, 2012]), we also haveαˆML→ α?,so,αˆMLq → θ ?q.Note that in the above proposition, the class of auxiliary MRFs can be more general than the class ofmarginal MRFs, but must contain the latter. Asymptotically, superfluous terms in the auxiliary MRF vanishto zero.3.4.3 Relationship to MLHere we prove that, under certain (strong) assumptions in the discrete case, LAP is exactly equal to ML.The main result here will be that under the required assumptions, estimation by ML and marginalizationcommute. Suppose we have a discrete MRF on xS which factorizes according to the cliques C , and letq ∈ C be a particular clique of interest.We will make use of the following characterization of ML estimates, which is proved in [Jordan, 2002].Lemma 11. If a distribution pˆ(xS) satisfies that for each c ∈ Cpˆ(xc) = p˜(xc)then pˆ(xS) is an ML estimate for the empirical distribution p˜(xS).This characterization allows us to derive an explicit expression for an ML estimate of pˆ(xS).Proposition 12. The distributionpˆ(xS) =p˜(xAq)p˜(xS\q)p˜(xAq\q)19is an ML estimate for p˜(xS).Proof. To see this we compute∑qpˆ(xS) =∑qp˜(xAq)p˜(xS\q)p˜(xAq\q)= p˜(xS\q)and∑S\Aqpˆ(xS) = ∑S\Aqp˜(xAq)p˜(xS\q)p˜(xAq\q)= p˜(xAq)For an arbitrary clique c∈C , either c⊂ S\q or c⊂ Aq, and we see that fˆ (xc) = f˜ (xc) by further marginaliz-ing one of the above expressions. This shows that our expression for pˆ(xS) satisfies the criteria of Lemma 11,and is therefore an ML estimate for p˜(xS).Suppose we have a family of distributions F on xS which satisfy the Markov properties of the MRF,and suppose that pˆ(xS) ∈F where pˆ(xS) is defined as in Proposition 12. Define the auxiliary family Fqassociated with the clique q as follows.Fq = {∑S\Aqp(xS) | p(xS) ∈F}That is, Fq is the family of distributions obtained by marginalizing the family F over S\Aq.Proposition 13. The auxiliary family Fq contains the marginal empirical distribution p˜(xAq). Moreoverpˆ(xAq) = p˜(xAq) is an ML estimate for p˜(xAq) in Fq.Proof. Recall that pˆ(xS) from Proposition 12 is in F by assumption. Thus,∑S\Aqpˆ(xS) = p˜(xAq)is in Fq by definition. That pˆ(xAq) ∈ Fq is an ML estimate follows since the log likelihood gradient inEquation 2.15 is zero when the model and empirical distributions are equal.Suppose we can represent the family F as a Gibbs family, i.e.F =F (Θ) = {p(xS |θ) |θ ∈Θ}for some domain of parameters Θ, wherep(xS|θ) =1Z(θ) exp(−∑c∈CE(xc |θ c)) .20Moreover, suppose we have chosen this parametrisation so that the potential functions are normalized withrespect to zero.Since F is representable as a Gibbs family then the auxiliary family Fq is also representable as a Gibbsfamily withFq =Fq(Ψ) = {p(xAq |α) |α ∈Ψ}for some domain of parameters Ψ. We will again suppose that this parametrization is chosen so that thepotential functions are normalized with respect to zero.We have already shown that ML estimates for p˜(xS) and p˜(xAq) exist in the families F and Fq, re-spectively. Since we have chosen the parametrizations of these families to be normalized we also haveunique ML parameters θˆ ∈ Θ and αˆ ∈ Ψ such that p(xS | θˆ) ∈ F (Θ) is an ML estimate for p˜(xS) andp(xAq | αˆ) ∈F (Ψ) is an ML estimate for p˜(xAq).We can now prove the main result of this chapter.Theorem 14. Under the assumptions used in this section, estimating the joint parameters by ML and in-tegrating the resulting ML distribution gives the same result as integrating the joint family of distributionsand performing ML estimation in the marginal family. Concisely,∑S\Aqp(xS | θˆ) = p(xAq | αˆ)Proof. We have the following sequence of equalities:p(xS | θˆ)(1)= pˆ(xS)(2)=p˜(xAq)p˜(xS\q)p˜(xAq\q)(3)=pˆ(xAq)p˜(xS\q)p˜(xAq\q)(4)=p(xAq | αˆ)p˜(xS\q)p˜(xAq\q)The first equality follows from the parametrisation of F , the second follows from Proposition 12, the thirdfrom Proposition 13 and the fourth follows from the parametrisation of Fq. The theorem is proved bysumming both sides of the equality over S\Aq.Applying the LAP argument to Theorem 14 we see that θˆ q = αˆq.Remark 15. The assumption that pˆ(xS) ∈ F amounts to assuming that the empirical distribution of thedata factors according to the MRF. This is very unlikely to hold in practice for finite data. However, if thetrue model structure is known then this property does hold in the limit of infinite data.21Chapter 4Strong LAP theorem, conditional LAP anddistributed parameter estimationIn this chapter, we advanced the theoretical understanding of LAP in two directions and great practicalconsequence. First, we proved that it is possible to reduce the size of the clique 1-neighbourhoods used inthe construction of auxiliary marginal MRFs in Chapter 3. Second, we extended the LAP argument to theconditional case, thereby enabling the methodology to become applicable to densely connected graphs.Finally, we link it to the design of algorithms for distributed parameter estimation in MRFs by showinghow the work of Liu and Ihler [2012] and Chapter 3 can both be seen as special cases of distributedcomposite likelihood. Casting these two works in a common framework allows us transfer results betweenthem, strengthening the results of both works. It also appears that this thesis is the first work to addressdistributed composite likelihood estimators.In Chapter 3 we introduced a theoretical result to show that it is possible to learn MRFs with untied pa-rameters in a fully-parallel but globally consistent manner. That result lead to the construction of a globallyconsistent estimator, whose cost is linear in the number of cliques as opposed to exponential as in central-ized maximum likelihood estimators. That result applies only to a specific factorization, with the cost oflearning being exponential in the size of the factors. While these factors are small for lattice-MRFs and othermodels of low degree, they can be as large as the original graph for other models, such as fully-observedBoltzmann machines [Ackley et al., 1985]. In this chapter, we introduce the Strong LAP Condition, whichcharacterizes a large class of composite likelihood factorizations for which it is possible to obtain global con-sistency, provided the local estimators are consistent. This much stronger sufficiency condition enables usto construct linear and globally consistent distributed estimators for a much wider class of models, includingfully-connected Boltzmann machines.Using this framework we also show how the asymptotic theory of Liu and Ihler applies more generallyto distributed composite likelihood estimators. In particular, the Strong LAP Condition provides a sufficientcondition to guarantee the validity of a core assumption made in the theory of Liu and Ihler, namely that eachlocal estimate for the parameter of a clique is a consistent estimator of the corresponding clique parameter221 2 34 5 67 8 91 2 34 5 67 8 91 2 34 5 67 8 91 2 34 5 67 8 9Figure 4.1: Left: A simple 2d-lattice MRF to illustrate our notation. For node j = 7 we have N (x j) ={x4,x8}. Centre left: The 1-neighbourhood of the clique q = {x7,x8} including additional edges (dashedlines) present in the marginal over the 1-neighbourhood. Factors of this form are used by the LAP algorithmof Chapter 3 Centre right: The MRF used by our conditional estimator of Section 4.3 when using thesame domain as Chapter 3 Right: A smaller neighbourhood which we show is also sufficient to estimatethe clique parameter of q.in the joint distribution. By applying the Strong LAP Condition to verify the assumption of Liu and Ihler,we are able to import their M-estimation results into the LAP framework directly, bridging the gap betweenLAP and consensus estimators.4.1 Centralised estimationRecall that our goal is to estimate the D-dimensional parameter vector θ of an MRF with the followingGibbs density or mass function:p(x |θ) = 1Z(θ) exp(−∑cE(xc |θ c)). (4.1)As in previous chapters, c ∈ C is an index over the cliques of an undirected graph G = (S,E), E(xc |θ c) isthe energy or Gibbs potential, and Z(θ) is a normalizing term known as the partition function.The standard approach to parameter estimation in statistics is through maximum likelihood, whichchooses parameters θ by maximizingL ML(θ) =N∏n=1p(xn |θ). (4.2)This estimator has played a central role in statistics as it is consistent, asymptotically normal, and efficient,among other desirable properties. However, applying maximum likelihood estimation to an MRF is gen-erally intractable since computing the value of logL ML and its derivative require evaluating the partitionfunction, or an expectation over the model respectively. Both of these values involve a sum over exponen-tially many terms.To surmount this difficulty it is common to approximate p(x |θ) as a product over more tractable terms.23This approach is known as composite likelihood and leads to an objective of the formL CL(θ) =N∏n=1I∏i=1f i(xn,θ i) (4.3)where θ i denote the (possibly shared) parameters of each composite likelihood factor f i. Composite like-lihood estimators are and both well studied and widely applied [Cox, 1988, Mardia et al., 2009, Liang andJordan, 2008, Dillon and Lebanon, 2010, Marlin et al., 2010, Asuncion et al., 2010, Okabayashi et al., 2011,Bradley and Guestrin, 2012, Nowozin, 2013]. In practice the f i terms are chosen to be easy to compute, andare typically local functions, depending only on some local region of the underlying graph G .An early and influential variant of composite likelihood is pseudo-likelihood (PL) [Besag, 1974], wheref i(x,θ i) is chosen to be the conditional distribution of xi given its neighbours,L PL(θ) =N∏n=1M∏m=1p(xmn |xN (xm)n,θm) (4.4)Since the joint distribution has a Markov structure with respect to the graph G , the conditional distributionfor xm depends only on its neighbours, namely xN (xm). In general more efficient composite likelihoodestimators can be obtained by blocking, i.e. choosing the f i(x,θ i) to be conditional or marginal likelihoodsover blocks of variables, which may be allowed to overlap.Composite likelihood estimators are often divided into conditional and marginal variants, dependingon whether the f i(x,θ i) are formed from conditional or marginal likelihoods. In machine learning theconditional variant is quite popular [Liang and Jordan, 2008, Dillon and Lebanon, 2010, Marlin et al.,2010, Marlin and de Freitas, 2011, Bradley and Guestrin, 2012] while the marginal variant has received lessattention. In statistics, both the marginal and conditional variants of composite likelihood are well studied(see the comprehensive review of Varin et al. [2011]).An unfortunate difficulty with composite likelihood is that the estimators cannot be computed in parallel,since elements of θ are often shared between the different factors. For a fixed value of θ the terms of logL CLdecouple over data and over blocks of the decomposition; however, if θ is not fixed then the terms remaincoupled.4.1.1 Consensus estimationSeeking greater parallelism, researchers have investigated methods for decoupling the sub-problems in com-posite likelihood. This leads to the class of consensus estimators, which perform parameter estimation inde-pendently in each composite likelihood factor. This approach results in parameters that are shared betweenfactors being estimated multiple times, and a final consensus step is required to force agreement betweenthe solutions from separate sub-problems [Wiesel and Hero III, 2012, Liu and Ihler, 2012].Centralized estimators enforce sub-problem agreement throughout the estimation process, requiringmany rounds of communication in a distributed setting. Consensus estimators allow sub-problems to dis-24agree during optimization, enforcing agreement as a post-processing step which requires only a single roundof communication.Liu and Ihler [2012] approach distributed composite likelihood by optimizing each term separatelyθˆ iβi = argmaxθβi(N∏n=1f i(xA i,n,θβi)), (4.5)whereA i denotes the group of variables associated with block i, and θβi is the corresponding set of parame-ters. In this setting the sets βi⊆V are allowed to overlap, but the optimisations are carried out independently,so multiple estimates for overlapping parameters are obtained. Following Liu and Ihler we have used thenotation θ i = θβi to make this interdependence between factors explicit.The analysis of this setting proceeds by embedding each local estimator θˆ iβi into a degenerate estimatorθˆ i for the global parameter vector θ by setting θˆ ic = 0 for c /∈ βi. The degenerate estimators are combinedinto a single non-degenerate global estimate using different consensus operators, e.g. weighted averages ofthe θˆ i.The analysis of Liu and Ihler assumes that for each sub-problem i and for each c ∈ βi(θˆ iβi)cp→ θ c (4.6)i.e., that each local estimate for the parameter of clique c is a consistent estimator of the corresponding cliqueparameter in the joint distribution. This assumption does not hold in general, and one of the contributionsof this chapter is to give a general condition under which this assumption holds.The analysis of Liu and Ihler [2012] considers the case where the local estimators in Equation 4.5 arearbitrary M-estimators [van der Vaart, 1998], however their experiments address only the case of pseudo-likelihood. In Section 4.3 we prove that the factorization used by pseudo-likelihood satisfies Equation 4.6,explaining the good results in their experiments.4.1.2 Distributed estimationConsensus estimation dramatically increases the parallelism of composite likelihood estimates by relaxingthe requirements on enforcing agreement between coupled sub-problems. In Chapter 3 we have shown thatif the composite likelihood factorization is constructed correctly then consistent parameter estimates can beobtained without requiring a consensus step.In the LAP algorithm described in Chapter 3, the domain of each composite likelihood factor (whichis called the auxiliary MRF) is constructed by surrounding each maximal clique q with the variables in its1-neighbourhoodAq =⋃c∩q6= /0c25which contains all of the variables of q itself as well as the variables with at least one neighbour in q; seeFigure 4.1 for an example. For MRFs of low degree the sets Aq are small, and consequently maximumlikelihood estimates for parameters of MRFs over these sets can be obtained efficiently. The parametricform of each factor in LAP is chosen to coincide with the marginal distribution over Aq. The factorisationof Chapter 3 is essentially the same as in Equation 4.5, but the domain of each term is carefully selected,and the LAP theorems are proved only for the case wheref i(xAq ,θβq) = p(xAq ,θβq).As in consensus estimation, parameter estimation in LAP is performed separately and in parallel for eachterm; however, agreement between sub-problems is handled differently. Instead of combining parameterestimates from different sub-problems, LAP designates a specific sub-problem as authoritative for eachparameter (in particular the sub-problem with domain Aq is authoritative for the parameter θ q). The globalsolution is constructed by collecting parameters from each sub-problem for which it is authoritative anddiscarding the rest.4.2 Strong LAP argumentIn this section we present the Strong LAP Condition, which provides a general condition under whichthe convergence of Equation 4.6 holds. This turns out to be intimately connected to the structure of theunderlying graph.Definition 16 (Relative Path Connectivity). Let G = (S,E) be an undirected graph, and let A be a givensubset of S. We say that two nodes i, j ∈ A are path connected with respect to S \A if there exists a pathP = {i,s1,s2, . . . ,sn, j} 6= {i, j} with none of the sk ∈ A. Otherwise, we say that i, j are path disconnectedwith respect to S\A.ijk123456Figure 4.2: Illustrating the concept of relative path connectivity. Here, A = {i, j,k}. While (k, j) are pathconnected via {3,4} and (k, i) are path connected via {2,1,5}, the pair (i, j) are path disconnected withrespect to S\A.260 1234 5(b)0 1234 5(a)0 1234 5(c)Figure 4.3: Figures (a)-(c) Illustrating the difference between LAP and Strong LAP. (a) Shows a star graphwith q highlighted in red. (b) Shows Aq required by LAP. (c) Shows an alternative neighbourhood allowedby Strong LAP. Thus, if the root node is a response variable and the leafs are covariates, Strong LAP stateswe can estimate each parameter separately and consistently.For a given A⊆V we partition the clique system of G into two parts, C inA that contains all of the cliquesthat are a subset of A, and C outA = C \C inA that contains the remaining cliques of G. Using this notation wecan write the marginal distribution over xA asp(xA |θ) =1Z(θ) exp(− ∑c∈C inAE(xc |θ c))∫exp(− ∑c∈C outAE(xc |θ c))dxS\A. (4.7)Up to a normalization constant,∫exp(−∑c∈C outA E(xc |θ c))dxS\A induces a Gibbs density (and therefore anMRF) on A, which we refer to as the induced MRF.For example, as illustrated in Figure 4.1 centre-left, the induced MRF involves all the cliques over thenodes 4, 5 and 9. By the Hammersley-Clifford theorem this MRF has a corresponding graph which we referto as the induced graph and denote GA. Note that the induced graph does not have the same structure as themarginal, it contains only edges which are created by summing over xS\A.Remark 17. To work in the general case, we assume throughout that that if an MRF contains the path{i, j,k} then integration over j creates the edge (i,k) in the marginal.In other words, ifg(xi,xk) =∫exp(E1(xi,x j))exp(E1(x j,xk))dx jit can not be factorized into functions of xi and xkg(xi,xk) 6= f1(xi) f2(xk).Proposition 18. Let A be a subset of S, and let i, j ∈ A. The edge (i, j) exists in the induced graph GA if andonly if i and j are path connected with respect to S\A.Proof. If i and j are path connected then there is a path P = {i,s1,s2, . . . ,sn, j} 6= {i, j} with none of thesk ∈ A. Integrating over sk forms an edge (sk−1,sk+1). By induction, integrating over s1, . . . ,sn forms theedge (i, j).27If i and j are path disconnected with respect to S \A then integrating over any s ∈ S \A cannot formthe edge (i, j) or i and j would be path connected through the path {i,s, j}. By induction, if the edge(i, j) is formed by integrating over s1, . . . ,sn this implies that i and j are path connected via {i,s1, . . . ,sn, j},contradicting the assumption.Corollary 19. B ⊆ A is a clique in the induced graph GA if and only if all pairs of nodes in B are pathconnected with respect to S\A.Definition 20 (Strong LAP condition). Let G = (S,E) be an undirected graph and let q ∈ C be a clique ofinterest. We say that a set A such that q⊆ A⊆ S satisfies the strong LAP condition for q if there exist i, j ∈ qsuch that i and j are path-disconnected with respect to S\A.Proposition 21. Let G = (S,E) be an undirected graph and let q ∈ C be a clique of interest. If Aq satisfiesthe strong LAP condition for q then the joint distribution p(xS |θ) and the marginal p(xAq |θ) share thesame normalized potential for q.Proof. If Aq satisfies the Strong LAP Condition for q then by Corollary 19 the induced MRF containsno potential for q. Inspection of Equation 4.7 reveals that the same E(xq |θ q) appears as a potential inboth the marginal and the joint distributions. The result follows by uniqueness of the normalized potentialrepresentation.We now restrict our attention to a set Aq which satisfies the Strong LAP Condition for a clique of interestq. The marginal over p(xAq |θ) can be written as in Equation 4.7 in terms of θ , or in terms of auxiliaryparameters αp(xAq |α) =1Z(α) exp(− ∑c∈CqE(xc |αc)) (4.8)Where Cq is the clique system over the marginal. We will assume both parametrisations are normalised withrespect to zero.Theorem 22 (Strong LAP Argument). Let q be a clique in G and let q ⊆ Aq ⊆ S. Suppose p(xS|θ) andp(xAq |θ) are parametrised so that their potentials are normalised with respect to zero and the parametersare identifiable with respect to the potentials. If Aq satisfies the Strong LAP Condition for q then θ q = αq.Proof. From Proposition 21 we know that p(xS|θ) and p(xAq |θ) share the same clique potential for q.Alternatively we can write the marginal distribution as in Equation 4.8 in terms of auxiliary variables α . Byuniqueness, both parametrizations must have the same normalized potentials. Since the potentials are equal,we can match terms between the two parametrizations. In particular since E(xq |θ q) = E(xq |αq) we seethat θ q = αq by identifiability.Figure 4.3 shows the significant complexity reduction achived by the Strong LAP theorm. However, wenote that the Strong LAP Condition is sufficient but not necessary. For example see Appendix A.2.284.2.1 Efficiency and the choice of decompositionTheorem 22 implies that distributed composite likelihood is consistent for a wide class of decompositionsof the joint distribution; however it does not address the issue of statistical efficiency.This question has been studied empirically in the work of Meng et. al. (Meng et al. [2013, 2014]),who introduce a distributed algorithm for Gaussian random fields and consider neighbourhoods of differentsizes. Meng et. al. find the larger neighbourhoods produce better empirical results and the following theoremconfirms this observation.Theorem 23. Let A be set of nodes which satisfies the Strong LAP Condition for q. Let θˆA be the MLparameter estimate of the marginal over A. If B is a superset of A, and θˆB is the ML parameter estimate ofthe marginal over B. Then (asymptotically):|θq− (θˆB)q| ≤ |θq− (θˆA)q|.Proof. Suppose that |θq− (θˆB)q|> |θq− (θˆA)q|. Then the estimates θˆA over the various subsets A of B im-prove upon the ML estimates of the marginal on B. This contradicts the Cramer-Rao lower bound achievedthe by the ML estimate of the marginal on B.In general the choice of decomposition implies a trade-off in computational and statistical efficiency.Larger factors are preferable from a statistical efficiency standpoint, but increase computation and decreasethe degree of parallelism.4.3 Conditional LAPThe Strong LAP Argument tells us that if we construct composite likelihood factors using marginal distri-butions over domains that satisfy the Strong LAP Condition then the LAP algorithm of Chapter 3 remainsconsistent. In this section we show that more can be achieved.Once we have satisfied the Strong LAP Condition we know it is acceptable to match parameters be-tween the joint distribution p(xS, |θ) and the auxiliary distribution p(xAq , |α). To obtain a consistent LAPalgorithm from this correspondence all that is required is to have a consistent estimate of αq. In Chapter 3w.b.ved this by applying maximum likelihood estimation to p(xAq , |α), but any consistent estimator is valid.We exploit this fact to show how the Strong LAP Argument can be applied to create a consistent con-ditional LAP algorithm, where conditional estimation is performed in each auxiliary MRF. This allows usto apply the LAP methodology to a broader class of models. For some models, such as large densely con-nected graphs, we cannot rely on the marginal LAP algorithm of Chapter 3. For example, for a restrictedBoltzmann machine (RBM) (Smolensky [1986]), the 1-neighbourhood of any pairwise clique includes theentire graph. Hence, the complexity of LAP is exponential in the number of cliques. However, it is linearfor conditional LAP, without sacrificing consistency.29Theorem 24. Let q be a clique in G and let x j ∈ q⊆ Aq ⊆ S. If Aq satisfies the Strong LAP Condition for qthen p(xS|θ) and p(x j |xAq\{x j},α) share the same normalised potential for q.Proof. We can write the conditional distribution of x j given Aq \{x j} asp(x j |xAq\{x j},θ) =p(xAq |θ)∫p(xAq , |θ)dx j(4.9)Both the numerator and the denominator of Equation 4.9 are Gibbs distributions, and can therefore beexpressed in terms of potentials over clique systems.Since Aq satisfies the Strong LAP Condition for q we know that p(xAq |θ) and p(xS|θ) have the samepotential for q. Moreover, the domain of∫p(xAq |θ)dx j does not include q, so it cannot contain a potentialfor q. We conclude that the potential for q in p(x j|xAq\{x j},θ) must be shared with p(xS|θ).Remark 25. There exists a Gibbs representation normalized with respect to zero for p(x j |xAq\{x j},θ).Moreover, the clique potential for q is unique in that representation.The existence in the above remark is an immediate result of the the existence of normalized representa-tion both for the numerator and denominator of Equation 4.9, and the fact that difference of normalized po-tentials is a normalized potential. For uniqueness, first note that p(xAq |θ) = p(x j |xAq\{x j},θ)p(xAq\{x j},θ)The variable x j is not part of p(xAq\{x j},θ) and hence this distribution does not contain the clique q. Supposethere were two different normalized representations with respect to zero for the conditional p(x j |xAq\{x j},θ).This would then imply two normalised representations with respect to zero for the joint, which contradictsthe fact that the joint has a unique normalized representation.We can now proceed as in the original LAP construction from Chapter 3. For a clique of interest qwe find a set Aq which satisfies the Strong LAP Condition for q. However, instead of creating an auxiliaryparametrization of the marginal we create an auxiliary parametrization of the conditional in Equation 4.9.p(x j |xAq\{x j},α) =1Z j(α)exp(− ∑c∈CAqE(xc |αc)) (4.10)From Theorem 24 we know that E(xq |αq) = E(xq |θ q). Equality of the parameters is also obtained, pro-vided they are identifiable.Corollary 26. If Aq satisfies the Strong LAP Condition for q then any consistent estimator of αq in p(x j |xAq\{x j},α)is also a consistent estimator of θ q in p(xS |θ).The Conditional LAP enables the user to estimate the parameters when the parametric family of themarginal is unknown. However, if the parametric family is known, or even if an extended parametric familyis known, the asymptotical estimation of CLAP will not be better then the asymptotical estimation of themarginal (for proof see Appendix A.3).304.3.1 Connection to distributed Pseudo-Likelihood and composite likelihoodTheorem 24 tells us that if Aq satisfies the Strong LAP Condition for q then to estimate θ q in p(xS |θ) it issufficient to have an estimate of αq in p(x j |xAq\{x j},α) for any x j ∈ q. This tells us that it is sufficient to usepseudo-likelihood-like conditional factors, provided that their domains satisfy the Strong LAP Condition.The following remark completes the connection by telling us that the Strong LAP Condition is satisfied bythe specific domains used in the pseudo-likelihood factorisation.Remark 27. Let q = {x1,x2, ..,xm} be a clique of interest, with 1-neighbourhood Fq = q∪{N (xi)}xi∈q.Then for any x j ∈ q, the set q∪N (x j) satisfies the Strong LAP Condition for q. Moreover, q∪N (x j)satisfies the Strong LAP Condition for all cliques in the graph that contain x j.Importantly, to estimate every unary clique potential, we need to visit each node in the graph. However,to estimate pairwise clique potentials, visiting all nodes is redundant because the parameters of each pair-wise clique are estimated twice. This observation is important because it takes us back to the work of Liuand Ihler (Liu and Ihler [2012]). If a parameter is estimated more than once, it is reasonable from a statisticalstandpoint to apply the consensus operators of Liu and Ihler to obtain a single consensus estimate. The the-ory of Liu and Ihler tells us that the consensus estimates are consistent and asymptotically normal, providedEquation 4.6 is satisfied. In turn, the Strong LAP Condition guarantees the convergence of Equation 4.6.The above remark tell us that the convergence in Equation 4.6 is satisfied for the distributed pseudo-likelihood setting of Liu and Ihler. We can go beyond this and consider either marginal or conditionalfactorisations over larger groups of variables. Since the asymptotic results of Liu and Ihler (Liu and Ih-ler [2012]) apply to any distributed composite likelihood estimator where the convergence of Equation 4.6holds, it follows that any distributed composite likelihood estimator where each factor satisfies the StrongLAP Condition (including LAP and the conditional composite likelihood estimator from Section 4.3) im-mediately gains asymptotic normality and variance guarantees as a result of their work and ours.31Chapter 5Applying LAP to Gaussian graphicalmodels and discrete tablesIn this chapter, we describe the application of LAP to sparse Gaussian graphical models. In particular, weapply the LAP algorithm to the problem of estimating the inverse covariance of a Gaussian distribution sub-ject to conditional independence constraints, encoded as zeros in the inverse covariance. This is a problemthat has attracted a great deal of attention in optimization, machine learning and statistics [Dempster, 1972,Songsiri et al., 2009, Ravikumar et al., 2010].Subsequently, we discuss how to parametrize discrete probability functions which are given in tableform, in order to apply LAP to these models.While addressing these two popular models, we investigate the issue of trading-off complexity in favourof accuracy, by generalizing the first neighbourhood concept into higher orders. In addition, we discuss thememory allocation attributes of LAP.5.1 Simple example: The Gaussian distributionLet p(x) be the density of a zero mean Gaussian distribution. In the Gaussian distribution, the non-zeroclique functions are pairwise and unary, and Equations (2.2) and (2.4) take the formp(x;θ) = 1Z ∑i∑jθi jxix j, (5.1)where θi j 6= 0 if (i, j) is an edge. It is common to represent the PDF of Equation (5.1) using the matrix formp(x) = det(Σ)− 12(2pi) d2exp(−12xTΣ−1x). (5.2)Where det(Σ)− 12(2pi)d2is the analytically known partition function (which may not be easy to calculate in practice).32The graph pattern is equivalent to the sparsity pattern of the SPD matrix Σ−1. Σ−1 is known as theprecision matrix or inverse covariance since it can be shown that the precision matrix is the inverse ofthe covariance matrix Σ. It is important to mention that every marginal PDF of a multivariate Gaussian ismultivariate Gaussian on its own, with a smaller dimension.Given a set of realizations x1,x2, ...,xN we denote by D the sample covariance matrix,D = 1NN∑i=1xixiT .Our goal is to estimate the Σ−1 using D.To this end, we first show that p(x) is natively specified in the normalized potentials form. In particular,Equation (5.2) can be written asp(x) = 1Zexp(∑i, j−Σ−1i j xix j2). (5.3)The non-zero clique potentials have either one or two variables. That is, the neighbourhood structure satisfiesNi = { j : Σ−1i j 6= 0}. (5.4)The energy function in Equation (5.3) can be re-written asE(xi,i |θ i,i) = Σ−1iix2i2andE(xi, j |θ i, j) = Σ−1i j xix j.Both satisfy the normalized potential definition since{xi = 0 or x j = 0}=⇒ Σ−1i jxix j2= 0.5.1.1 Gaussian example: Using the first neighbourhoodWe demonstrate the computation of the parameters of a specific clique for the zero-mean Gaussian graphicalmodel depicted in Figure 5.1. In this detailed example, we follow the estimation of Σ−19,10, which is theparameter associated with the clique potential Σ−19,10x9x10. The first neighbourhood of the clique {9,10} is{7,8,9,10}. {Aq \q} is {7,8} and a new edge connecting {8,7} was added to the auxiliary graph, as shownin Figure 5.4. Following the second step of the LAP algorithm we have to estimate the marginal PDF which3312345 678910Figure 5.1: A simple sparsity pattern for a Gaussian graphical model. The neighbourhood system describedin the graph is compatible with the precision matrix for the multivariate Gaussian precision matrix in Figure5.2.xxxxxxxxxxxxxxxxxx xxx xxxxxxxxxxxxxxxxxΣ−1 =Figure 5.2: The precision matrix associated with the graph of Figure 5.1. The symbol× stands for non-zeroentries in the precision matrix, and Σ−1(i, j) 6= 0 ⇐⇒ (i, j) ∈ E.78910Figure 5.3: The first neighbourhood of the clique {9,10} and the auxiliary graph.34is a Gaussian restricted to x7,x8,x9,x10 and given by:fAcq (x7,x8,x9,x10) =|Σmarginal− 12 |(2pi) 42exp(−12xcqTΣ−1marginalxcq). (5.5)That is, we have to estimate the much smaller Σ−1marginal with respect to graph in Figure 5.4 and the samplecovariance matrixD7,7 D7,8 D7,9 D7,10D8,7 D8,8 D8,9 D8,10D9,7 D9,8 D9,9 D9,10D10,7 D10,8 D10,9 D10,10, (5.6)which is a 4×4 projection matrix from the full covariance matrix. (Here, is use the term projection to referto the process of extracting the relevant columns and rows from the original matrix.) We accomplish theprocedure by substituting the value calculated in Σ−1marginal(3,4) into Σ−1(9,10).5.1.2 Gaussian example: Using sub-neighbourhoodIn this section, we follow the estimation of the same parameter Σ−19,10, associated with the energy functionΣ−19,10x9x10. However this time we use a domain smaller than the first neighbourhood.There two valid alternatives for the choice of sub-neighbourhood for the clique {9,10}: either {8,9,10}or {7,9,10}. Both are shown in Figure 5.1.2, with dashed lines representing the new edges added to theauxiliary graphs because of marginalization.Here, {Aq \q} corresponds to {7,8} and a new edge connecting {8,7} was added to the auxiliary graph,as shown in Figure 5.4. The second step of the LAP algorithm is similar to the one presented in Sec-tion 5.1.1.Let us, first, restrict our attention to the domain of x8,x9,x10, with marginal:fAcq (x8,x9,x10) =|Σmarginal− 12 |(2pi) 42exp(−12xcqTΣ−1marginal10xcq). (5.7)The notation marginal10 stands for the sub neighbourhood which includes{9,10}∪N10.Note the domain is formed simply by the non-zero terms in the 10th row or column. Our goal is to estimate3589107 910Figure 5.4: Two different alternative sub neighbourhoods for the clique {9,10}. On the left, we use theunion of the neighbours of node 10 and on the right the neighbours of node 9.Σ−1marginal10 with respect to the graph in Figure 5.4 and the sample covariance matrixD8,8 D8,9 D8,10D9,8 D9,9 D9,10D10,8 D10,9 D10,10 , (5.8)which is a 3×3 projection matrix from the full covariance matrix. We accomplish the procedure by substi-tuting the value calculated in Σ−1marginal(2,3) into Σ−1(9,10).Alternatively, we can chose the domain of x7,x9,x10, with marginal:fAcq (x7,x9,x10) =|Σmarginal− 12 |(2pi) 42exp(−12xcqTΣ−1marginal9xcq). (5.9)That is, our goal is now to estimate the much smaller Σ−1marginal with respect to graph in Figure 5.4 and thesample covariance matrixD7,7 D7,9 D7,10D9,7 D9,9 D9,10D10,7 D10,9 D10,10 , (5.10)which is a different 3× 3 projection matrix from the full covariance matrix. This time we accomplish ourgoal by substituting the value calculated in Σ−1marginal(2,3) into Σ−1(9,10).The two alternative estimators explain the need for averaging in distributed estimators. For big dataapplications, the sub neighbourhoods are preferable. If one is looking to invert an SPD matrix, where thedesired inverse matrix has a known sparse pattern, it is sufficient to use LAP with sub neighbourhoods. Thebenefit not only stems from inverting a 3×3 matrix instead of a 4×4 matrix, but also from the fact that thesame sub neighbourhoods hold for several different parameters. Hence, averaging the estimates of the sameparameter for different sub neighbourhoods improves statistical efficiency.36Figure 5.5: A non trivial Gaussian graphical model (left) and the relative estimation error for LAP and MLEas a function of the number of data (right).5.2 LAP for tablesSection 5.1 may be thought of as “ideal” from some perspectives. The PDF is taken from a well knownparametric family, where every marginal distribution is a multivariate Gaussian. In this section, we assumethe PDF is a discrete distribution with no known parametric family. The graph structure G(S,E) and dataare the only sources of information. However, we also know that the PDF is described as a probability table.Any general discrete distribution may be expressed table form. We will show (by construction) thatif the table contains only positive probabilities (∀x p(x) > 0) then it can be expressed as an exponentialdistribution, in which the energy function consists of polynomials. Since xi ∈ {0,1} then for any k, xki = xi.Hence, the energy function consists multiplicative terms of xi of degree 1, such as αx1x4x5. The terms inthe energy function are linear combinations of the subsets of {x1, ...,xk}. This is reasonable since for everyk, the number of vectors in the discrete distributionf (x); x ∈ {0,1}kis 2k and the number of subsets of {x1, ...,xk} is also 2k. Hence, f (x) is spanned by a Gibbs distribution witha polynomial energy function, made of the subsets of {x1, ...,xk}, with density:f (x) = 1Zexp( ∑s⊂{1..k}θsxi1 ..xis). (5.11)For a given positive table p(x), it is clear that the partition function Z is equal to p(0) (see Figure 5.2),Z = 1p(0) .37x1 x2 x3 p(x)0 0 0 1Z0 0 1 1Z exp(c)0 1 0 1Z exp(b)0 1 1 1Z exp(c+b+ e) ⇐⇒ f (x) = 1Z exp(ax1 +bx2 + cx3 +dx1x2 + ex2x3 + f x1x3 +gx1x2x3)1 0 0 1Z exp(a)1 0 1 1Z exp(a+ c+ f )1 1 0 1Z exp(a+b+d)1 1 1 1Z exp(a+b+ c+d + e+ f +g)Figure 5.6: A discrete probability distribution in table form and in full exponential form for x ∈ {0,1}3.Next we search for vectors x with sum of entries equal to 1, ∑(xi) = 1:θ(1,0,0,..0) = log(p(1,0,0, ...0)− log(Z))θ(0,1,0,..0) = log(p(0,1,0, ...0)− log(Z))..θ(0,0,..0,1) = log(p(0,0, ...0,1)− log(Z)).Then we search for vectors with sum of entries equal to 2, ∑(xi) = 2,θ(1,1,0,..0) = log(p(1,1,0, ...0)− log(p(1,0, ..0))− log(p(0,1,0..0))− log(Z))and so on.Next, we ask "how can we learn the table of probabilities?". The trivial way is to do it by counting overthe realizations:p(xk = b) =1NN∑n=1δ (xn,k = b). (5.12)However, this may yield zero probabilities, which cannot be represented by exponential distributions. Hence,we form the table in a slightly different way.Choose ε > 0 andp(xk = b) =1N + ε (ε+N∑n=1δ (xn,k = b)). (5.13)Note that the estimator is consistent, since the full PDF p(x) is non zero for any x and the weight of any εgoes to zero as N→ ∞.38In summary, for each clique potential, the algorithm for tables proceeds as follows:• Find the first neighbourhood (or any domain that satisfies the Strong LAP condition).• Use the realizations to form the table of probabilities for the marginal neighbourhood.• convert the table into full exponential form, and pull out the terms related to the clique of interest.5.3 Improving the accuracy of LAPLAP is a cost effective alternative to ML estimation, but not identical to ML for finite sample sizes. However,one can increase the computational cost of LAP in order to improve its estimation accuracy.Recall that theorem 22 holds not only for domains that satisfy the Strong LAP condition. Clearly, if Aqsatisfies the condition, then for any larger subset B that contains Aq the condition holds. We saw the thatfirst neighbourhood is not the minimal sub-graph for which the marginal PDF shares the same normalizedpotential with the full PDF, and there is computational benefit for choosing smaller domains.On the other hand, one may take bigger domains than the first neighbourhood. This lead us to the ideaof the second neighbourhood, and, recursively, to higher order neighbourhoods.Let us first generalize the definition of the first neighbourhood, as given in (3.1),Definition 28. The k-neighbourhood of a subset q, denoted Akq, is the first neighbourhood of the subset Ak−1q .Alternatively,Definition 29. The k-neighbourhood of a subset q, denoted Akq, is the collection of nodes with distance of kor less edges from q.As proven in Theorem 23 Estimating the marginal over higher neighbourhoods will yield results closerto ML. Enlarging the marginal domain to the maximal set (i.e., to the entire graph) will simply coincidewith the ML itself. In other words, LAP can be understood as a series of nested estimators of increasingstatistical efficiency and decreasing computational efficiency.To illustrate the effect of neighbourhood size, we consider the Gaussian graphical model of Figure 5.7and two clique neighbourhoods as illustrated in the same figure. Figure 5.8 shows the relative errors obtainedusing these neighbourhoods. Clearly higher order neighbourhoods perform better, but it is not clear howimportant they are as both LAP estimators are very close to ML.5.4 Memory allocation for LAPLAP has low memory requirements. If each node has a bounded number of p neighbours, then the firstneighbourhood involves no more than p2 nodes and if working with the minimal sub neighbourhoods, onlyp nodes are needed in each local estimator.391 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 2021 22 23 24 25812 13 141837 8 911 12 13 14 1517 18 1923Figure 5.7: On the left, a 5×5 lattice MRF with vertical and horizontal neighbours. The middle graph is thefirst neighbourhood for the unary clique {13}. The figure on the right shows the second neighbourhood forthe unary clique {13}. New edges introduced by marginalization are depicted with dashed green lines.Figure 5.8: Relative error of parameter estimates compared to maximum likelihood for 1-neighbourhoodLAP and 2-neighbourhood LAP. The full graph contained 300 nodes and the PDF is a multivariable Gaus-sian.40Consider the sparse inverse covariance estimation problem. If there are n nodes, working with the fullPDF requires allocating memory for n2 cells. The complexity of inverting the covariance matrix is n3. Onecan apply LAP to each row, with each local estimator using the nodes with the non-zero values in the row.Thus the maximal memory allocation needed is of the order p2. This may play a significant role in largesystems where n >> p.41Chapter 6i LAP: applying LAP to inverse problemsLarge-scale inverse problems arise in a multitude of applications, including weather forecasting, medicalimaging, and oil exploration. In these big data domains, the high-dimensionality of the models that must berecovered from data creates substantial computational challenges, see for example [Roosta-Khorasani et al.,2014, Herrmann et al., 2012, Ascher and Haber, 2004, Haber and Ascher, 2001, Scheichl et al., 2013] andthe references therein.The unknown model often represents some physical property of the system under investigation. Therelationship between the model and the data is often assumed known and encoded in the form of a forwardoperator, which predicts data for a given model.The model is typically obtained by computing the Maximum a Posteriori (MAP) estimate. If the modelis high-dimensional, computing this estimate is computationally expensive and demands large amounts ofmemory.In the present chapter, drawing upon the strong LAP argument, we offer an alternative approach forsolving inverse problems. In particular, we introduce an efficient parallel algorithm, named iLAP, whichappropriately divides the large problem into smaller sub-problems of much lower dimension. This processof localization offers substantial advantages in terms of computational efficiency and memory allocation.6.1 Inverse problemsIn inverse problems, the physical system under investigation is modelled as follows:A(m)+ ε = d. (6.1)Here, m ∈ Rn is the unknown model, A : Rn→ Rk is the known (linear or non-linear) forward operator thatis assumed to be sparse in our setting, d ∈ Rk is the data and ε ∈ Rk is the noise, which is assumed to bezero mean Gaussian with known diagonal covariance matrix Σd ,ε ∼ N(0,Σd). (6.2)421234567891012345678910mdFigure 6.1: The graphical representation of the MRF in which both m and b are variablesThe inverse problem of recovering m given the data d and the known parameters θ = (A,Σ−1d ) is ill-posed in most applications of interest [Vogel, 2002, Kaipio and Somersalo, 2005]. This forces us to introducea regularizer or prior pi(m), which by Bayes rule yields the following posterior distribution:p(m|d) ∝ p(d|m)pi(m). (6.3)We refer to p(d|m) as the conditional distribution. Its form follows from the model specification:p(d|m;θ) = 1Zexp(−12(Am−d)>Σ−1d (Am−d))(6.4)Our goal is to compute the MAP estimate of the model:mˆMAP = argmaxm(exp(−12(Am−d)>Σ−1d (Am−d))pi(m)). (6.5)6.2 Localizing inverse problemsSuppose the forward operator is sparse. For example, consider the model illustrated in Figure 6.1. For eachcomponent of the data di, let {m}i denote the group of elements of the model that influence di directly. Inour example, to generate d8, we only need to know the components {m}8 = {m7,m8,m9} of the model.Since we are assuming that A is sparse, we have that |{m}i| << |m|. Moreover since the observationnoise is such that the entries of d are conditionally independent, we can express the joint model as an MRF:p(m,d) = pi(m)p(d1|{m}1)p(d2|{m}2)...p(dk|{m}k). (6.6)Our goal is to replace the global objectivep(m,d;θ) = pi(m)p(d|m;θ)43with a set of local low-dimensional objectives,p(m˜, d˜; θ˜) = pi(m˜)p(d˜|m˜; θ˜)that can be solved easily and independently.The problem with this strategy is that we do not know the values of the local parameters θ˜ or theexpression for the local prior pi(m˜) in general.However, using the Strong LAP results (Theorems 22 and 24), we will be able to construct local ob-jectives for which θ˜i = θi for i associated with mi. Once the local parameters are known, this divide-and-conquer strategy will enable us to compute each of the model components in a fully parallel manner. Usingthis distributed estimation approach, we will be able to recover the global MAP estimates provided the prioralso decomposes.6.2.1 Localizing the conditional distributionLet mi ∈ m be the entry of interest. To appeal to the Strong LAP Condition, we need to construct anappropriate neighbourhood for the local model involving mi. This construction is illustrated in Figure 6.2.Definition 30. The 1-hop data of mi is the set of all entries in d at distance 1 from mi in the graphical modelrepresentation.Definition 31. The 1-blanket of mi is the set of all nodes in the graphical model at distance 1 or less fromthe 1-hop data of mi.Proposition 32. Let {d˜,m˜} be the 1-blanket of mi. Then, p(d˜|m˜; θ˜) inherits θ˜ from θ .Proof. By construction, d˜ is the 1-hop data of mi. By conditional independence, there are no edgesin the joint graphical representation connecting two data nodes. By definition, the 1-blanket is the 1-neighbourhood of the parameters associated with the edges connecting mi and d˜. Hence, the Strong LAPCondition is satisfied for the parameters connecting mi and the 1-hop data. The result follows by the StrongLAP Theorem (Theorem 22).6.2.2 Localizing the priorThe previous subsection provided us with conditions under which the parameters of the local conditionalsare the same as the parameters of the global conditional distribution. In this subsection, we focus on theprior distribution. Let pi(m˜) be the marginal of the global prior pi(m).In some tractable cases, it is feasible to obtain an analytical expression for the local prior pi(m˜) bymarginalization. For example, this is true if the prior is Gaussian.In some applications, the prior is not expressed in terms of a function, but rather in terms of a set ofN samples {m}Ni=1. Here the local prior is straightforwardly obtained by discarding the samples of modelcomponents not associated with the marginal.44m12345678910d123456789106567m6 d˜ 45678567d˜m˜A :Figure 6.2: Construction of the local models. Suppose we are interested in estimating m6. Then, d˜ ={d5,d6,d7} are the 1-hop data of m6. m˜ = {m4,m5,m6,m7,m8} are the components of the model that affectd˜ directly. The 1-blanket consisting of d˜ and m˜ constitutes the 1-neighbourhood of the parameters θ65, θ66and θ67.In the above two cases, if the conditional distributions factorises, the posterior also factorises and we areable to recover the global MAP estimates by computing local MAP estimates.In general, the prior simply encodes sparseness and smoothness assumptions. For example, one mayuse the `1 norm to construct the prior when the model is assumed to be sparse. In this situation, we canno longer localize the prior and hence we cannot guarantee that the local estimates coincide with the globalMAP estimates. However, denoising and deblurring experiments, using the `1 norm on the local model, willshow that the iLAP approach cab be effective even in these situations.458910910m˜d˜67891078910m˜d˜Figure 6.3: The 1-blanket (left) and 2-blanket (right) of m10.6.3 The iLAP algorithmFollowing the construction of the 1-neighbourhood of the parameters of the conditional distribution, theiLAP algorithm is as follows:Algorithm 2 iLAP (1-blanket)Input: Forward operator A and observation dfor mi ∈m doFind d˜⊆ d; the 1-hop data of mi.Find {m˜, d˜}; the 1-blanket of mi.Find the local prior as described in Section 6.2.2.Construct the local objective function: pi(m˜)p(d˜|m˜).Compute the local MAP estimate of mi by maximizing the local objective.end forRemark 33. For any m˘, s.t.m˜⊂ m˘one can define the augmented local inverse problem by pi(m˘)p(d˜|m˜) becausep(d˜|m˜) = p(d˜|m˘).The choice of a correct local objective is not unique and one can consider larger neighbourhoods. Inparticular, we can define a 2-blanket as follows.Definition 34. The 2-hop data of mi is the set of all entries in d with distance 1 from the 1-blanket mi in thegraphical model.Definition 35. The 2-blanket of mi is the set of all nodes in the graphical model at distance 1 or less fromthe 2-hop data of mi.Proceeding in a similar fashion, we can generalize to the k-blanket of mi. Figure 6.3 depicts the 1-blanketand the 2-blanket of m10 for the model shown in Figure 6.1.46As a result of the previous observation, we can select larger subsets of m, as opposed to a single entry,and perform local MAP estimation in blocks. We follow this strategy in the experiments presented in thefollowing section.6.4 Image deblurring example using the DCT and wavelet transformsFigure 6.4: MAP estimation using iLAP (1-blanket) and 4× 4 blocks. Top left: true model, top right:data, middle left: full inverse reconstruction with DCT transform, middle right: iLAP with DCT transform,bottom left: full inverse reconstruction with wavelet transform, bottom right: iLAP with wavelet transform.We consider the problem of recovering a 128×128 image m that has been corrupted by a global blurring47operator and noise.Specifically, we generate the corrupted image by convolving it with a Gaussian blurring kernel withstandard deviation 1.2 in the spatial domain. Subsequently, 2% white Gaussian noise is added to the blurredimage.In order to recover the image, the dense blurring operator is made sparse by by setting to zero all entriesof A that fall below 10−4. The recovery is done by basis pursuit denoising with an `1 regularizer:min‖f‖1 subject to ‖AH−1f−d‖22 ≤ α, (6.7)where f = Hm is the transformed model and H is the transformation (DCT or wavelet bases). Both A andH are of size 16,384×16,384. We refer the reader to Mallat [2009] or Chan and Shen [2005] for a detaileddiscussion regarding DCT and wavelet transforms.The above denoising objective for a single image states that the model is sparse in the transformeddomain (frequency or wavelet domain). This is a reasonable assumption in many applications. We solve theoptimization problem using the spgl1 package [van den Berg and Friedlander, 2009, 2011].For the DCT, we considered 4× 4 blocks of pixels. For this size of local estimates and the truncatedGaussian kernel of radius 5, the 1-blankets and 2-blankets were augment to square patches of size 24×24and 44×44 respectively. They could be made smaller since the kernel is circular, but by making the squarethey become considerably easier to code. For the wavelet transform, we had to consider blocks of size 8×8with 1-blankets of radius 32.The recovery results are shown in Figure 6.4. In this particular example, the local approach appears todo better than the global approach. This however may not necessarily generalize to other images.In Figure 6.5 we show the relative error for the global and iLAP algorithms using the Wavelet transformas a function of the regularization coefficient α . The relative error is calculated as follows:Error = ||m∗−m||||m∗||,where m∗ is the true image and m is the deblurred image. While the optimal range of the coefficient varies,the relative error of iLAP is similar to the one of the global MAP estimator.Finally, Figure 6.5 compares the results for the 1-blanket and 2-blanket estimators using the DCT. Thereappears to be no gain for using larger neighbourhoods in this case. This may be attributed to the sparsityprior.It is both interesting and reassuring that similar ideas, but solely in the context of image deblurring,have been studied since the seminal work in Trussell and Hunt [1978]. Our iLAP approach provides a moregeneral framework for understanding these approximations and for developing more powerful algorithms.If, for example, m contain 106 entries (in the case of 1000×1000 pixels), solving each entry at a timerequires 106 local solvers, while solving in 10×10 blocks requires only 104 solvers.In Figure 6.6 we present the comparison of a 4×4 blocks 1-blanket reconstruction and a 4×4 blocks48Figure 6.5: Relative error of global MAP estimate (left) and iLAP distributed estimates (right) using thewavelet transform. While the values of the regularization coefficient are different, the relative errors at theoptimum values are very close.Figure 6.6: Recovery using iLAP with the first-blanket (left) and the second-blanket (right) with blocks ofsize 4×4 and the DCT.2-blanket reconstruction, for the same problem presented in Section 6.4.49Chapter 7Concluding remarks and future workThe key to this work was the somehow under-appreciated theorem on the existence and uniqueness of thenormalized Gibbs representation of an MRF.We attacked an important question: Under what conditions do marginal MRFs have the same potentialsas the full MRF? The answer to this question is the Strong LAP Theorem.The LAP algorithm for distributed learning in MRFs, both in its marginal and conditional forms, is justa way to exploit this theoretical result. iLAP for inverse problems is another example of the utility of thetheory advanced in this thesis. We believe many more applications abound.However, there remain several open questions, mainly regarding iLAP. For example, should a biggerblanket improve the estimation accuracy? How can we generalize iLAP to the non-independent noise ordense operator scenarios?The following subsections discuss a few additional areas that merit further research.7.1 Corrupted dataWe discussed parameter estimation using data that was sampled from the true PDF. An exciting generaliza-tion for LAP will be to consider cases in which the data is corrupted. Naturally we would like to considerfirst the case where the data is locally corrupted. That is, in each sample noise is added to a small numberof unknown entries. Filtering the noise (that is, for each sample learning for which entries the noise wasadded, and ignoring these) may be a complicated task when handling the entire sample, and one can benefitfrom reducing the dimension.Let x1..xq be q locally corrupted samples taken from the PDF p(x|θ). Using LAP, for each parameterwe find a local domain, project the samples x1..xq on that domain, then estimate the marginal PDF. Filteringthe sample projections may be a more reasonable task, since we can consider each projection as "corrupted"or "not corrupted".507.2 Structure learningAssume the set of samples is taken from the true PDF, but that the graph structure is not known. We wishto estimate the graph structure using the sparsity assumption. In other words, each node has only fewneighbours but these neighbours are not known.The main difficulty lays in the fact that even if the number of direct neighbours for each node is limited,the total number of combinations is intractable. Using LAP, we would like to explore new local approaches,which may simplify the problem. The Basis for such approach have to be a local structure learning. Thatis, for each node we estimate the potentials locally, rating different local structures independently, and thencombine these local structures into a global one.7.3 Tied parametersWe ignored the possibility that the entries of the parameters vector θ may dependent on each other. If, forexample, the same parameter appears in several different clique potentials. In such case, one may naivelytake the following two steps:• Ignore the relation between the entries, by assuming that the PDF is given byf (x;θ) = 1Z(θ)exp(∑c∈CVC(xc;θc)). (7.1)where each entry is independent from the other. Use LAP to find the estimate θˆ .• Estimate each entry θi by averaging the relevant estimates in θˆ .Clearly, this leads to a consistent estimator since it is known thatθˆ n→ ∞−−−→ θtrue (7.2)for each of the local estimators.However, equally simple averaging is not an optimal method. An open question is: What is the bestway to choose averaging weights? Regardless, it is clear that LAP should be applied to domains with tiedparameters, including conditional random fields, to assess its merits. The one thing that LAP has going foritself is that it is data efficient, model efficient, and consistent. Therefore, in the age of big data, it shouldbecome an important player in the field of parameter learning.51BibliographyD. H. Ackley, G. Hinton, and T. Sejnowski. A learning algorithm for Boltzmann machines. CognitiveScience, 9:147–169, 1985.A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin, and M. Cohen.Interactive digital photomontage. In ACM SIGGRAPH, pages 294–302, 2004.U. Ascher and E. Haber. Computational methods for large distributed parameter estimation problems withpossible discontinuities. In Proceedings of the Symposium on Inverse Problems, Design andOptimization, pages 201–208, 2004.A. Asuncion, Q. Liu, A. Ihler, and P. Smyth. Learning with blocks: Composite likelihood and contrastivedivergence. In Artificial Intelligence and Statistics, pages 33–40, 2010.J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal StatisticalSociety, Series B, 36:192–236, 1974.J. Besag. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society. Series D, 24(3):179–195, 1975.J. Besag. Comments on Conditionally Specified Distributions: An Introduction by B. Arnold, E. Castilloand J. Sarabia. Statistical Science, 16(3):249–265, 2001.Y. Boykov and O. Veksler. Graph cuts in vision and graphics: Theories and applications. In N. Paragios,Y. Chen, and O. Faugeras, editors, Handbook of Mathematical Models in Computer Vision, chapter 5,pages 79–96. Springer, 2006.J. Bradley. Learning Large-Scale Conditional Random Fields. PhD thesis, Machine Learning Department,Carnegie-Mellon University, 2013.J. K. Bradley and C. Guestrin. Sample complexity of composite likelihood. In Artificial Intelligence andStatistics, pages 136–160, 2012.A. Braunstein, M. Mezard, and R. Zecchina. Survey propagation: An algorithm for satisfiability. RandomStructures and Algorithms, (2):201–226, 2005.P. Bremaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer-Verlag, 2001.D. Buchman, M. W. Schmidt, S. Mohamed, D. Poole, and N. de Freitas. On sparse, spectral and otherparameterizations of binary probabilistic models. Journal of Machine Learning Research - ProceedingsTrack, 22:173–181, 2012.52T. Chan and J. Shen. Image Processing and Analysis: Variational, PDE, Wavelet and Stochastic Methods.SIAM, 2005.X. Chen, B. Neubert, Y. Xu, O. Deussen, and S. B. Kang. Sketch-based tree modeling using Markovrandom fields. In ACM SIGGRAPH Asia, pages 1–9, 2008.B. Cox. Composite likelihood methods. Contemporary Mathematics, 80:221–239, 1988.A. P. Dempster. Covariance selection. Biometrics, 28(1):157–175, 1972.M. Denil and N. de Freitas. Toward the implementation of a quantum RBM. In NIPS Deep Learning andUnsupervised Feature Learning Workshop, 2011.J. V. Dillon and G. Lebanon. Stochastic composite likelihood. Journal of Machine Learning Research, 11:2597–2633, 2010.S. E. Fienberg and A. Rinaldo. Maximum likelihood estimation in log-linear models. The Annals ofStatistics, 40(2):996–1023, 2012.R. A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of theRoyal Society of London Series A, 222:309–368, 1922.M. Galley. A skip-chain conditional random field for ranking meeting utterances by importance. InEmpirical Methods in Natural Language Processing, pages 364–372, 2006.D. Griffeath. Introduction to random fields. In Denumerable Markov Chains, volume 40 of Graduate Textsin Mathematics, pages 425–458. Springer, 1976.E. Haber and U. Ascher. Preconditioned all-at-one methods for large, sparse parameter estimationproblems. Inverse Problems, 17:1847–1864, 2001.J. M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices. 1971.F. J. Herrmann, M. P. Friedlander, and O. Yilmaz. Fighting the curse of dimensionality: compressivesensing in exploration seismology. IEEE Signal Processing Magazine, 29:88–100, 2012.G. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2000.J. J. Hopfield. Neurons with graded response have collective computational properties like those oftwo-state neurons. Proceedings of the National Academy of Sciences, 81(10):3088–3092, 1984.A. Hyvärinen. Estimation of non-normalized statistical models using score matching. Journal of MachineLearning Research, 6:695–709, 2005.A. Hyvärinen. Connections between score matching, contrastive divergence, and pseudolikelihood forcontinuous-valued variables. IEEE Transactions on Neural Networks, 18(5):1529–1531, 2007.M. I. Jordan. An introduction to probabilistic graphical models. 2002.J. Kaipio and E. Somersalo. Statistical and computational inverse problems. Springer, 2005.53R. Kindermann and J. L. Snell. Markov Random Fields and their Applications. American MathematicalSociety, 1980.D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models forsegmenting and labeling sequence data. In International Conference on Machine Learning, pages282–289, 2001.S. Lauritzen. Graphical models. Oxford University Press, 1996.S. Z. Li. Markov random field modeling in image analysis. Springer-Verlag, 2001.P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihoodestimators. In International Conference on Machine Learning, pages 584–591, 2008.Q. Liu and A. Ihler. Distributed parameter estimation via pseudo-likelihood. In International Conferenceon Machine Learning, 2012.D. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.S. Mallat. A Wavelet Tour of Signal Processing: the Sparse Way. Academic Press, 2009.K. V. Mardia, J. T. Kent, G. Hughes, and C. C. Taylor. Maximum likelihood estimation using compositelikelihoods for closed exponential families. Biometrika, 96(4):975–982, 2009.E. Marinari, G. Parisi, and J. Ruiz-Lorenzo. Numerical simulations of spin glass systems. Spin Glassesand Random Fields, pages 59–98, 1997.B. Marlin and N. de Freitas. Asymptotic efficiency of deterministic estimators for discrete energy-basedmodels: Ratio matching and pseudolikelihood. In Uncertainty in Artificial Intelligence, pages 497–505,2011.B. Marlin, K. Swersky, B. Chen, and N. de Freitas. Inductive principles for restricted Boltzmann machinelearning. In Artificial Intelligence and Statistics, pages 509–516, 2010.Z. Meng, D. Wei, A. Wiesel, and A. O. Hero III. Distributed learning of Gaussian graphical models viamarginal likelihoods. In Artificial Intelligence and Statistics, pages 39–47, 2013.Z. Meng, D. Wei, A. Wiesel, and A. O. Hero III. Marginal likelihoods for distributed parameter estimationof Gaussian graphical models. Technical report, arXiv:1303.4756, 2014.K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.S. Nowozin. Constructing composite likelihoods in general random fields. In ICML Workshop onInferning: Interactions between Inference and Learning, 2013.S. Okabayashi, L. Johnson, and C. Geyer. Extending pseudo-likelihood for Potts models. Statistica Sinica,21(1):331–347, 2011.P. Ravikumar, M. J. Wainwright, and J. D. Lafferty. High-dimensional Ising model selection using`1-regularized logistic regression. Annals of Statistics, 38(3):1287–1319, 2010.54F. Roosta-Khorasani, K. van den Doel, and U. Ascher. Stochastic algorithms for inverse problemsinvolving PDEs and many measurements. SIAM Journal of Scientific Computing, 2014.R. Scheichl, S. Kindermann, M. A. Freitag, and M. Cullen. Large Scale Inverse Problems: ComputationalMethods and Applications in the Earth Sciences. De Gruyter, 2013.P. Smolensky. Information processing in dynamical systems: foundations of harmony theory. Paralleldistributed processing: explorations in the microstructure of cognition, 1:194–281, 1986.J. Songsiri, J. Dahl, and L. Vandenberghe. Maximum-likelihood estimation of autoregressive models withconditional independence constraints. In IEEE International Conference on Acoustics, Speech andSignal Processing, pages 1701–1704, 2009.C. Sutton and A. McCallum. An introduction to conditional random fields. Foundations and Trends inMachine Learning, 4(4):267–373, 2012.K. Swersky, M. Ranzato, D. Buchman, B. Marlin, and N. Freitas. On autoencoders and score matching forenergy based models. In International Conference on Machine Learning, pages 1201–1208, 2011.R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother.A comparative study of energy minimization methods for Markov random fields with smoothness-basedpriors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(6):1068–1080, 2008.H. Trussell and B. Hunt. Sectioned methods for image restoration. IEEE Transactions on Acoustics,Speech and Signal Processing, 26(2):157–164, 1978.E. van den Berg and M. Friedlander. Probing the pareto frontier for basis pursuit solutions. SIAM Journalon Scientific Computing, 31(2):890–912, 2009.E. van den Berg and M. Friedlander. Sparse optimization with least-squares constraints. SIAM Journal onOptimization, 21(4):1201–1229, 2011.A. W. van der Vaart. Asymptotic statistics. Cambridge University Press, 1998.C. Varin, N. Reid, and D. Firth. An overview of composite likelihood methods. Statistica Sinica, 21:5–42,2011.C. Vogel. Computational methods for inverse problems. SIAM, 2002.M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.L. Wasserman. All of Statistics. Springer, 2004.A. Wiesel and A. Hero III. Distributed covariance estimation in Gaussian graphical models. IEEETransactions on Signal Processing, 60(1):211–220, 2012.C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and learning energy functions for side-chainprediction. In T. Speed and H. Huang, editors, Research in Computational Molecular Biology, volume4453 of Lecture Notes in Computer Science, pages 381–395. Springer, 2007.L. Younes. Parametric inference for imperfectly observed Gibbsian fields. Probability Theory and RelatedFields, 82(4):625–645, 1989.55Appendix AA.1 Equivalent definitions of the 1-neighbourhoodFor a given clique, q, define its 1- neighborhood Aq by :Âq = ∪c, ∀c ∈ C s.t. c∩q 6= /0 (A.1)as the union of all cliques with non empty intersection. Or, alternatively, as the union of all nodes withdistance one or less from qA˜q = q∪Ni, ∀i ∈ q. (A.2)We show here the Equivalence of these two definitions.Proof. Clearly q⊆ Âq and q⊆ A˜q.⇒ Let i ∈ A˜q, and i /∈ q. Then, i ∈N j for j ∈ q and the subset {i, j} is a clique with non empty intersectionwith q. Therefore i ∈ Âq.⇐ Let i ∈ Âq. Then i ∈ ci, where ci is a clique with non empty intersection with q,∃ j ∈ q∩ ci,ci is a clique, then i ∈N j and i ∈ A˜q.A.2 Strong LAP condition is sufficient but not necessaryIn this subsection we prove by example that the strong LAP condition is not a necessary condition. Let thegraph be as in Figure A.1. Our interest lays in the parametric estimation for the clique {1,2,3}.The marginal PDF over the domain {1,2,3} is achieved by integrating over the nodes i, j and k andwill not satisfies the Strong lap condition with respect to the clique {1,2,3}, as all the possible edges in theclique are formed in the induced MRF. The edge {1,2} by integration over i, the edge {2,3} by integrationover j and the edge {1,3} by integration over k. However, the induced MRF consist only pairwise cliques,56123i jkFigure A.1: A simple graph. Our interest is in the clique {1,2,3}and tough the clique {1,2,3} is formed in the graph, the clique potential will not be formed. Hence, themarginal over {1,2,3} and the full PDF shares the same potential over the clique of interest.A.3 Conditional estimator can not be better than marginal estimatorLet p(x;θ) be the full MRF, let q be the clique of interest and Aq be the marginal domain. We assume onecan not find the exact marginal parametric family, i.e., the exact solution ofpAq(xAq) =∫p(x);θ)dxS\Aqis not known. In such cases we say that the parametric family is not integrable. In particular we note that themarginal is combined from Energy potentials which appeared originally in the full PDF and are all known,and the induced Energy potentials which are not known.On the other hand, the conditional p(xq|xAq\q) is known and can be directly derived from the full PDF.Moreover, we have shown that the full PDF and the conditional shares the same Energy potential over theclique of interest, by that observation we derived the CLAP algorithm.The marginal PDF can be written aspAq(xAq) = p(xAq\q;α)p(xq|xAq\q).Again, we assume the exact parametric form of p(xAq\q;α) cannot be derived directly from p(x;θ).However, by expanding the parametric family and introducing more parameters one may find largerfamily, p(xAq\q;β ) that will contain p(xAq\q;α).In such a case, the marginal model is taken from the larger family:pAq(xAq) = p(xAq\q;β )p(xq|xAq\q).57Now, there may be two different approaches. One may either use CLAP (which is not data efficient) orestimate the marginal (for which the model is not optimal).Proposition 36. Asymptotically, the estimation of Eq(xq;θq) derived from the conditional estimation cannot be better then the the estimation of Eq(xq;θq) derived from the marginal estimation.Proof. If the estimation of θq will be better the estimation of the marginal, one would be able to improvethe ML estimator for pAq(xAq) = p(xAq\q;β )p(xq|xAq\q). This is in contradiction to the Cramer-rao boundachieved by the MLE.58
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Linear and parallel learning of Markov random fields
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Linear and parallel learning of Markov random fields Dror Mizrahi, Yariv 2014
pdf
Page Metadata
Item Metadata
Title | Linear and parallel learning of Markov random fields |
Creator |
Dror Mizrahi, Yariv |
Publisher | University of British Columbia |
Date Issued | 2014 |
Description | In this thesis, we introduce a new class of embarrassingly parallel parameter learning algorithms for Markov random fields (MRFs) with untied parameters, which are efficient for a large class of practical models. The algorithms parallelize naturally over cliques and, for graphs of bounded degree, have complexity that is linear in the number of cliques. We refer to these algorithms with the acronym LAP, which stands for Linear And Parallel. Unlike their competitors, the marginal versions of the proposed algorithms are fully parallel and for log-linear models they are also data efficient, requiring only the local sufficient statistics of the data to estimate parameters. LAP algorithms are ideal for parameter learning in big graphs and big data applications. The correctness of the newly proposed algorithms relies heavily on the existence and uniqueness of the normalized potential representation of an MRF. We capitalize on this theoretical result to develop a new theory of correctness and consistency of LAP estimators corresponding to different local graph neighbourhoods. This theory also establishes a general condition on composite likelihood decompositions of MRFs that guarantees the global consistency of distributed estimators, provided the local estimators are consistent. We introduce a conditional variant of LAP that enables us to attack parameter estimation of fully-observed models of arbitrary connectivity, including fully connected Boltzmann distributions. Once again, we show consistency for this distributed estimator, and relate it to distributed pseudo-likelihood estimators. Finally, for linear and non-linear inverse problems with a sparse forward operator, we present a new algorithm, named iLAP, which decomposes the inverse problem into a set of smaller dimensional inverse problems that can be solved independently. This parallel estimation strategy is also memory efficient. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2014-12-09 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0167069 |
URI | http://hdl.handle.net/2429/51458 |
Degree |
Doctor of Philosophy - PhD |
Program |
Mathematics |
Affiliation |
Science, Faculty of Mathematics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2015-02 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2015_february_drormizrahi_yariv.pdf [ 962.68kB ]
- Metadata
- JSON: 24-1.0167069.json
- JSON-LD: 24-1.0167069-ld.json
- RDF/XML (Pretty): 24-1.0167069-rdf.xml
- RDF/JSON: 24-1.0167069-rdf.json
- Turtle: 24-1.0167069-turtle.txt
- N-Triples: 24-1.0167069-rdf-ntriples.txt
- Original Record: 24-1.0167069-source.json
- Full Text
- 24-1.0167069-fulltext.txt
- Citation
- 24-1.0167069.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0167069/manifest